WCET analysis and optimization for multi-core real-time systems by Kelter, Timon
WCET Analysis and Optimization for Multi-Core
Real-Time Systems
Dissertation
zur Erlangung des Grades eines
Doktors der Ingenieurwissenschaften
der Technischen Universität Dortmund
an der Fakultät für Informatik
von
Timon Kelter
Dortmund
2015
Tag der mündlichen Prüfung: 12. März 2015
Dekan /Dekanin: Prof. Dr. Gernot Fink
Gutachter /Gutachterinnen: Prof. Dr. Peter Marwedel
Prof. Dr. Isabelle Puaut


Acknowledgments
First and foremost I want to thank my advisor Prof. Dr. Peter Marwedel for pointing
my research into a rewarding direction from the start on and for providing me
with the opportunity to work on this fascinating ﬁeld of research in an inspiring,
international team. Without his continued support this thesis would not exist. I
would also like to thank Prof. Dr. Isabelle Puaut for her time and commitment to
review this thesis.
The implementation of the WCET analyzer and the WCET optimizations which
are presented in this thesis would not have been possible without the previous work
of numerous colleagues at our chair over the course of more than one decade. I am
especially grateful for the provision of the WCC framework, on which the majority of
my practical work was built. In this context, I owe special thanks to Prof. Dr. Heiko
Falk for being one of the most thorough reviewers and advisors I have ever met, to
Dr. Paul Lokuciejewski for introducing me to this interesting ﬁeld of research, to Dr.
Sascha Plazar for being a fantastic oﬃce neighbor, to Jan Christopher Kleinsorge for
our mutual motivation to ﬁnish the PhD project and to all of them for the enjoyable
time in countless on- and oﬀ-topic disucssions.
For proof-reading my drafts and papers, for helpful discussions and for being re-
ally good colleagues I would also like to thank Björn Bönninghoﬀ, Olaf Neugebauer,
Pascal Libuschewski, Chen-Wei Huang, Dr. Michael Engel, Helena Kotthaus, An-
dreas Heinig, Florian Schmoll and Dr. Daniel Cordes. On the implementation side,
my work was supported by Jan Körtner, Hendrik Borghorst, Tim Harde, Chris-
tian Günter and Henning Garus. Without their help the whole WCET analyzer
implementation would be in a diﬀerent shape now.
Furthermore, I am deeply thankful towards the former WCET analysis team
at the National University of Singapore, most of all to Dr. Sudipta Chattopadhyay
and Prof. Dr. Abhik Roychoudhury, for giving me the opportunity to work on the
Chronos WCET analyzer. Without these ﬁrst steps I possibly would have not
found my way into the topic of WCET analysis.
What has kept me going in the last years was of course not only scientiﬁc progress
but also the support that I received from friends and family, who kept me grounded
when my thoughts were spinning around work issues. Many thanks to all of you –
you know who you are. In particular I owe my father a big debt of gratitude for
motivating me to pick up computer science as a profession and to ﬁnally strive for
the PhD.
iii

Abstract
During the design of safety-critical real-time systems, developers must be able to
verify that a system shows a timely reaction to external events. To achieve this,
the Worst-Case Execution Time (WCET) of each task in such a system must be
determined. The WCET is used in the schedulability analysis in order to verify that
all tasks will meet their deadlines and to verify the overall timing of the system.
Unfortunately, the execution time of a task depends on the task’s input values,
the initial system state, the preemptions due to tasks executing on the same core
and on the interference due to tasks executing in parallel on other cores. These
dependencies render it close to impossible to cover every feasible timing behavior
in measurements. It is preferable to create a static analysis which determines the
WCET based on a safe mathematical model.
The static WCET analysis tools which are currently available are restricted to a
single task running uninterruptedly on a single-core system. There are also exten-
sions of these tools which can capture the eﬀects of multi-tasking, i.e., preemptions
by higher-priority tasks, on the WCET for certain well-deﬁned scenarios. These
tools are nowadays already used to verify industrial real-time software, e.g., in the
automotive and avionics domain. Up to now, there are no mature tools which can
handle the case of parallel tasks on a multi-core platform, where the tasks potentially
interfere with each other.
This dissertation presents multiple approaches towards a WCET analysis for
diﬀerent types of multi-core systems. They are based upon previous work on the
modeling of hardware and program behavior but extend it to the treatment of shared
resources like shared caches and shared buses. We present multiple methods of inte-
grating shared bus analysis into the classical WCET analysis framework and show
that time-triggered bus arbitration policies can be eﬃciently analyzed with high pre-
cision. In order to get precise WCET estimations for the case of shared caches, we
present an eﬃcient analysis of interactions in parallel systems which utilizes timing
information to cut down the search space. All of the analyses were implemented in
a research C compiler. Extensive evaluations on real-time benchmarks show that
they are up to 11.96 times more precise than previous approaches.
Finally, we present two compiler optimizations which are tailored towards the
optimization of the WCET of tasks in multi-core systems, namely an evolutionary
optimization of shared resource schedules and an instruction scheduling which uses
WCET analysis results to optimally place shared resource requests of individual
tasks. Experiments show that the two combined optimizations are able to achieve
an average WCET reduction of 33%.
During the course of this thesis, a complete WCET analysis framework was
developed which can be used for further work like the integration of multi-task and
multi-core-aware techniques into a single analyzer.
v

Publications
Parts of this thesis have been published in journals and the proceedings of the
following conferences and workshops (in chronological order):
Timon Kelter, Heiko Falk, Peter Marwedel, Sudipta Chattopadhyay, and Abhik Roychoud-
hury. “Bus-Aware Multicore WCET Analysis through TDMA Oﬀset Bounds”. In: Pro-
ceedings of the 23rd Euromicro Conference on Real-Time Systems (ECRTS). Porto,
Portugal, 07/2011, pp. 3–12.
Sudipta Chattopadhyay, Chong Lee Kee, Abhik Roychoudhury, Timon Kelter, Heiko Falk,
and Peter Marwedel. “A Uniﬁed WCET Analysis Framework for Multi-Core Platforms”.
In: IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).
Beijing, China, 04/2012, pp. 99–108.
Timon Kelter, Tim Harde, Peter Marwedel, and Heiko Falk. “Evaluation of Resource Ar-
bitration Methods for Multi-Core Real-Time Systems”. In: Proceedings of the 13th In-
ternational Workshop on Worst-Case Execution Time Analysis (WCET). Ed. by Claire
Maiza. Paris, France, 07/2013.
Timon Kelter, Heiko Falk, Peter Marwedel, Sudipta Chattopadhyay, and Abhik Roychoud-
hury. “Static Analysis of Multi-Core TDMA Resource Arbitration Delays”. English.
In: Real-Time Systems 50.2 (03/2014), pp. 185–229. issn: 0922-6443. doi: 10.1007/
s11241-013-9189-x. url: http://dx.doi.org/10.1007/s11241-013-9189-x.
Sudipta Chattopadhyay, Lee Kee Chong, Abhik Roychoudhury, Timon Kelter, Peter Mar-
wedel, and Heiko Falk. “A Uniﬁed WCET Analysis Framework for Multicore Platforms”.
In: ACM Transactions on Embedded Computing Systems 13.4s (04/2014), 124:1–124:29.
issn: 1539-9087. doi: 10.1145/2584654. url: http://doi.acm.org/10.1145/
2584654.
Timon Kelter, Peter Marwedel, and Hendrik Borghorst. “WCET-aware Scheduling Opti-
mizations for Multi-Core Real-Time Systems”. In: International Conference on Em-
bedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS). Samos,
Greece, 07/2014.
Chen-Wei Huang, Timon Kelter, Bjoern Boenninghoﬀ, Jan Kleinsorge, Michael Engel, Peter
Marwedel, and Shiao-Li Tsao. “Static WCET Analysis of the H.264/AVC Decoder Ex-
ploiting Coding Information”. In: International Conference on Embedded and Real-Time
Computing Systems and Applications (RTCSA). IEEE. Chongqing, China, 08/2014.
Timon Kelter and Peter Marwedel. “Parallelism Analysis: Precise WCET Values for Com-
plex Multi-Core Systems”. In: Third International Workshop on Formal Techniques for
Safety-Critical Systems (FTSCS). Ed. by Cyrille Artho and Peter Ölveczky. Luxem-
bourg: Springer, 11/2014.
vii

Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Contributions of this Work . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Author’s Contribution to this Dissertation . . . . . . . . . . . . . . 9
2 Timing Analysis Concepts 11
2.1 Abstract Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 WCET Analysis for Uninterrupted Single Tasks . . . . . . . . . . 18
2.2.1 Static WCET Analysis . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Parametric WCET analysis . . . . . . . . . . . . . . . . . . 21
2.2.3 Hybrid WCET analysis . . . . . . . . . . . . . . . . . . . . . 21
2.2.4 Early-Stage WCET analysis . . . . . . . . . . . . . . . . . . 22
2.2.5 Statistical WCET analysis . . . . . . . . . . . . . . . . . . . 22
2.2.6 WCET-friendly Hardware Design . . . . . . . . . . . . . . . 23
2.2.7 Experiences with Practical Application of WCET Analysis 23
2.2.8 Timing Anomalies . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.9 Compositionality in WCET Analysis . . . . . . . . . . . . . 27
2.3 Timing Analysis of Sequential Multi-Task Systems . . . . . . . . . 29
2.3.1 Accounting for the Timing Behavior of System Calls . . . 29
2.3.2 Accounting for Task Interaction Impacts on the WCET . 30
2.3.3 Schedulability of Multi-Task Systems with Given WCETs 32
2.4 Timing Analysis of Parallel Multi-Task Systems . . . . . . . . . . 32
2.4.1 Multi-Core Systems . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.2 Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . 34
3 WCC Framework 37
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Compiler Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Flow Fact Management . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Extensions for Binary Input Files . . . . . . . . . . . . . . . . . . . 47
4 Single-Core WCET-Analysis 51
4.1 IPCFG Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1.1 Analysis Graph . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.2 Context Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Value Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.1 Abstract Value Domain . . . . . . . . . . . . . . . . . . . . . 60
ix
x Contents
4.2.2 Challenges of Predicated Execution . . . . . . . . . . . . . 62
4.3 Microarchitectural Analysis . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 ARM7TDMI Pipeline Model . . . . . . . . . . . . . . . . . 67
4.3.2 Cache Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Path Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 Multi-Core WCET Analysis 81
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Multi-Core Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.1 Shared Caches . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.2 Shared Interconnection Structures . . . . . . . . . . . . . . 83
5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.1 WCET Analysis Approaches for Multi-Cores . . . . . . . . 86
5.3.2 WCET-friendly Multi-Core Architecture Design . . . . . . 89
5.4 Partitioned Multi-Core WCET Analysis . . . . . . . . . . . . . . . 90
5.4.1 Shared Cache Handling . . . . . . . . . . . . . . . . . . . . . 91
5.4.2 Shared Bus Analysis Preliminaries . . . . . . . . . . . . . . 94
5.4.3 Basic Bus Domains . . . . . . . . . . . . . . . . . . . . . . . 96
5.4.4 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.4.5 Oﬀset Contexts . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.6 Oﬀset Relocation . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.7 Timing-Anomaly-Free Analysis . . . . . . . . . . . . . . . . 110
5.4.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.5 Uniﬁed WCET Analysis for Complex Multi-Cores . . . . . . . . . 120
5.5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.5.2 Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.5.3 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . 123
5.5.4 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.5.5 Parallel Execution Graph Construction . . . . . . . . . . . 127
5.5.6 Parallel System States . . . . . . . . . . . . . . . . . . . . . 132
5.5.7 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.5.8 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.5.9 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6 Multi-Core WCET Optimization 147
6.1 Multi-Objective Evolutionary Schedule Optimization . . . . . . . 147
6.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.1.2 Evolutionary Algorithm . . . . . . . . . . . . . . . . . . . . 148
6.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.2 WCET-driven Multi-Core Instruction Scheduling . . . . . . . . . . 154
6.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Contents xi
6.2.2 Scheduling Heuristics . . . . . . . . . . . . . . . . . . . . . . 156
6.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7 Conclusion and Future Work 163
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
List of Figures 169
List of Tables 169
List of Algorithms 171
Glossary 173
Bibliography 175
A Employed Benchmarks 209

Chapter 1
Introduction
This dissertation deals with an aspect of computer science that has been a second-
class citizen since the emergence of the discipline: With time. Or, to put it more
precisely, with safe bounds on the timing behavior of programs running on a com-
puter system.
Since the days of mainframes, attempts have been made to increase the pro-
ductivity of a programmer by supplying him with ever more powerful program-
ming languages and compilers for the latter [Myc07], by teaching useful idioms and
patterns [GHJ+95] and by making the single computers faster to allow the soft-
ware complexity to increase steadily [Sch97]. The runtime behavior of algorithms
has traditionally been classiﬁed asymptotically in big O notation which eases or
even enables the reasoning about runtime behavior for complex algorithms [Weg03].
For many computer applications the asymptotical classiﬁcation is suﬃcient, even
though its limitations are already stressed by examples like the Simplex algorithm,
which in spite of having exponential asymptotical runtime performs better than
its polynomial-time counterparts on most real-world examples [Cor10, Choosing an
optimizer for your LP problem].
The major shortcoming of this type of runtime classiﬁcation is that it is not us-
able in the important area of real-time systems, i.e., computer systems in which the
executed tasks must always fulﬁll their work in a bounded time interval or before a
given deadline. An asymptotic modeling which ignores constant factors in the run-
time formula is not applicable here, since taking twice the allowed time or only once
is an important diﬀerence, possibly rendering the system dysfunctional in the former
case. Most real-time systems are also embedded systems, i.e., “information processing
systems embedded into enclosing products” [Mar11], which are integrated into many
devices of daily life. Application domains for embedded systems are numerous and
cover areas like automotive electronics, avionics, railways, telecommunication, the
health sector, security, consumer-electronics, fabrication equipment, smart build-
ings, logistics, robotics, military applications, and many more [Mar11].
Real-time systems are generally classiﬁed as soft or hard real-time systems, where
“soft” means that deadlines may be violated for a few executions of a task, but not
regularly and “hard” means that not a single deadline must be violated. Multimedia
applications like audio and video decoders are prime examples for soft real-time sys-
tems, whereas industrial Electronic Control Units (ECU) in robotics, power stations,
cars and planes are typical hard-real time systems. Modern cars for example have
more than 70 ECUs [ES08] for engine control, safety features like anti-lock braking
1
2 Chapter 1. Introduction
Model-driven Design
(ASCET, SCADE, ...)
Coding Conventions
(MISRA-C, ...)
Standardization and Re-use
(OSEK, AUTOSAR, ...)
Real-Time
System Design
Formal Veriﬁcation
(Astrée, Model
checking, abstract
interpretation, ...)
Testing
(Unit tests, In-
tegration Tests,
Coverage Criteria, ...)
WCET Analysis
(aiT, Static, Dy-
namic, Hybrid, ...)
Schedulability Analysis
(Real-Time Calculus,
Static schedulabil-
ity theorems, ...)
Figure 1.1: Real-time system veriﬁcation tools.
and electronic stability program and multi-media functions. At least for the ﬁrst
two categories, hard-real time implementations clearly must be provided to ensure
a timely reaction of the system. According to a recent market study, about 60%
of all embedded development projects [BW13] require real-time capabilities, though
in some of these cases soft-real time may be suﬃcient. Still, the rising number
of complications with real-world safety-critical embedded systems of everyday life,
mostly cars1, demonstrates that (semi-)automated safety certiﬁcation of embedded
systems is highly desirable. Here, WCET analysis comes into play, as a means to
semi-automatically verify the timing behavior of the task set under analysis. Of
course, this has to be complemented by other analyses as sketched in Figure 1.1.
The upper half of the ﬁgure shows methodologies, coding conventions and stan-
dard components which are used to avoid programming errors and to increase the
productivity. They have a direct inﬂuence on the shape of the ﬁnal real-time system
code. As an example, code generation from models and conventions like MISRA-
C [MIS13] can ease the WCET analysis by limiting the variability of the code and
prohibiting hard-to-analyze software constructions.
The lower half of Figure 1.1 shows tools which are used to verify a system that
has already been partly or fully designed. Formal veriﬁcation is needed to detect
run-time errors like, e.g., null-pointer, overﬂow, out-of-bounds bugs. One of the best-
known tools in this area is Astrée2 [Abs14b] which similar to WCET analysis relies
on abstract interpretation to derive static information about the program. Model-
checking has also proven to be useful especially when a system implementation is
already generated from a high-level model. Model checkers focus on proving the
1As an example, there were 24 retractions of vehicle classes in the year 2011 [Ele12].
2Here and throughout the rest of this thesis, example tools are set in small caps.
1.1. Motivation 3
absence of deadlocks, reachability conditions and program termination, rather than
on run-time errors in real code. Testing is mandatory for any type of development,
with a wealth of testing frameworks and methodologies being available and widely
used.
WCET and schedulability analysis are then providing what formal error-checking
and testing alone cannot oﬀer, namely safe and precise bounds on the runtime
of individual tasks (WCET analysis) and statements on whether the given task
set will always meet its deadline on the given platform (Schedulability analysis).
aiT [Abs14a] is the de-facto standard for industrially used WCET analysis. In ad-
dition to delivering highly precise WCET values for single tasks, it can also compute
the worst-case memory consumption of tasks.
Therefore, WCET analysis is one of the key elements in timing veriﬁcation.
With imprecise, or worse with unsafe WCET values all statements derived in the
schedulability analysis are overly pessimistic or even void.
1.1 Motivation
Unfortunately, a precise WCET analysis is undecidable in general. This is easy to
see, since the WCET analysis problem is just an extension of the halting problem,
where we not only ask whether a program will terminate, but also when it will
terminate. Since the halting problem is undecidable [Weg99], so is the WCET
analysis problem3. In addition, the timing behavior of modern hardware shows
enough variance to render a simple enumeration of all possible execution paths
infeasible.
Therefore, any practically feasible WCET analysis has to use approximation as
a means to make the problem decidable at all. We will see in Chapter 2 that even
with algorithmic approximation, WCET analysis potentially still requires some user
interaction for complex tasks.
Since WCET analysis in general requires approximation, we distinguish between
the WCETreal, which is the true worst-case runtime of the task under analysis,
and estimations WCETest ≥ WCETreal. In Figure 1.2 an example distribution of
runtimes is given, where the WCETreal is marked along with with one possible
example of a WCETest for this task. It is worth noticing that for any real task a
runtime distribution as shown in Figure 1.2 can hardly be constructed, since it needs
to cover all input combinations and all possible initial system states. Similar to the
WCET we can also deﬁne the Best-Case Execution Time (BCET) of a task which
is also indicated in Figure 1.2, where we again distinguish between the BCETreal
and the BCETest ≤ BCETreal. The BCET is also important in the timing analysis,
to, e.g., derive minimum task inter-arrival times in the schedulability analysis, but
3Even though Kirner, Zimmermann and Richter argue that the halting problem is in fact not
undecidable for existing bounded-memory platforms [KZR09], it still is orders of magnitude too
complex to decide in reasonable time.
4 Chapter 1. Introduction
BCETreal
BCETest
WCETreal
WCETest
Program runtime
Fr
eq
ue
nc
y
Figure 1.2: A sample distribution of runtimes of a program, along with with sam-
ple BCET and WCET estimates.
since most WCET analysis concepts are also directly applicable to BCET analysis,
we focus on the WCET side in the following. Nevertheless, most analyses from
Chapter 4 and Chapter 5 yield both BCET and WCET values. Since the WCETreal
is unknown in general, we use the term “WCET” as a synonym for “WCETest”
throughout this thesis (same for BCET).
Any WCET analysis is required to be safe, i.e., the relation WCETest ≥WCETreal
must always hold. In addition, it is desirable for the analysis to be precise, which
means that the diﬀerence WCETest −WCETreal should be minimized. To the best
of the author’s knowledge, theoretical bounds on the precision of WCET analyses
are not available, so the precision is usually determined empirically by comparing
the WCETest with measured execution times.
For a task which runs uninterruptedly on a single core, eﬀective WCET analysis
methodologies were developed and compared throughout the last two decades, which
deliver WCET estimates which are 3% to 25% above the WCETreal [Abs14a]. One
key problem for these analyses is that preemptions by other tasks are usually not
accounted for in the WCET analysis itself, but only during schedulability analysis
(cf. Figure 1.1). This decomposition of WCET and schedulability eases the anal-
ysis of both, but also promotes overestimation, since the communication between
WCET and schedulability analysis is unidirectional and on a very coarse level. Most
importantly, the schedulability analysis intentionally has no access to the detailed
worst-case hardware states that are generated by the WCET analysis. Instead,
the schedulability tests are solely based on numeric WCET values and platform
assumptions to lower their algorithmic complexity.
WCET analysis faces an equally important problem due to the latest hardware
development trends, namely the shift towards multi-core architectures. While until
2005 the performance improvement of new chips mainly was generated by frequency
increases, we are now faced with stagnating frequencies and rising numbers of cores
per chip [Sut12]. This can be seen as a general trend towards parallel computing
which does not stop at homogeneous multi-cores, but continues with heterogeneous
1.1. Motivation 5
1.5GHz 1.1V 1.0GHz 0.95V
10
20
30
P
ow
er
(W
)
Dual Core
Single Core
Figure 1.3: Multi-core implementation power consumption for the FreeScale
MPC8641 [Fre09, Section 1.1], depending on the core frequency and
voltage.
multi-cores (since 2009) and “elastic cloud compute cores” (since 2010), e.g., the
outsourcing of computations to commercial or private clouds. While the latter is
not expected to have a big impact on hard real-time systems, multi-cores have
already arrived in the embedded market.
As Equation 1.1 shows, the power consumption of a core is directly proportional
to the frequency, but according to [Fre09] increased voltages are needed to reach
higher frequencies due to electrotechnical reasons. Therefore, by rule of thumb, a
doubling of frequencies leads to a fourfold increase in power consumption [Fre09].
Power∝ Capacitance ×Voltage2 × Frequency (1.1)
This power consumption increase in turn generates more heat which leads to de-
creased component lifetime and higher cooling needs. For battery-driven devices
like many embedded systems all of these points are highly important and motivate
the now widespread use of multi-cores also in the embedded domain. As an exam-
ple, Figure 1.3 shows the power consumption of the FreeScale MPC8641 in single-
and dual-core conﬁgurations as found in [Fre09]. It is visible that the dual-core
conﬁguration only needs about 30% more power than the single-core one, whereas a
single-core implementation with doubled frequency would have caused an four-fold
increase of the power demand as mentioned above. A more extreme example is the
IBM SyNAPSE chip, according to IBM research the biggest chip that was ever built
by IBM, which integrates 4096 cores on a single chip [Joh14]. Each core is running
at 1kHz only, leading to a marginal power consumption of 70mW. Though this chip
is not yet in volume production, it shows the capabilities of multi-cores. Finally,
Figure 1.4 presents recent ARM architectures which can be found in abundance
in modern smartphones and embedded systems. It is visible that high-performance
(with the exception of the Cortex-A8) is only achieved with multi-core designs, espe-
cially since these chips are often used in passively cooled systems where the heating
problem mentioned above is a severe issue. For the ARM designs and similar chips
which are intended to oﬀer more than some kilohertz of per-core frequency it is
6 Chapter 1. Introduction
Eﬃciency
Fe
at
ur
es
/P
er
fo
rm
an
ce
ARM7
ARM9
ARM11
Cortex-M0
Cortex-M3
Cortex-R4
Cortex-A5
Cortex-A8
Cortex-A9
Cortex-A15
Figure 1.4: ARM Processor Families [ARM14b]. Multi-core architectures are
marked in gray.
already predicted that the multi-core scaling will ﬁnally end due to thermal issues
which arise when all cores on the same chip are powered on [EBS+11]. This eﬀect
is also called “dark silicon”, since it implies that not all cores can be powered on
at the same time without causing the system to overheat. Therefore, even though
the integration density may still increase the number of concurrently usable cores
does not, because some of them must be powered oﬀ, i.e., they are “dark silicon”.
Nevertheless, multi-cores will continue to scale for some time and will remain the
predominant type of computing system for the foreseeable future.
From the WCET perspective, the problem with this inevitable trend is that the
cores in a multi-core system usually share some hardware components for eﬃciency
reasons. Typical examples of this are shared I/O-devices, shared main memory and
shared cache levels. Since these resources can only be accessed by one task at a
time, concurrent requests to the resource need to be sequentialized by some kind
of arbiter. This arbitration delay now has to be bounded as precisely as possible
by the static WCET analysis, which may be hard to achieve, depending on the
employed arbitration strategy. In addition, some shared components have a shared
state which determines the timing behavior of the component. A prime example for
this is a shared cache where it makes a big diﬀerence in terms of timing whether
the requested block was found in the cache or not. Therefore WCET analysis now
faces the problem of ﬁnding static estimates of
1. the arbitration delay and
2. the shared state
1.2. Contributions of this Work 7
of any shared resource, or to put it in other words, the interference on this resource.
Both may depend on the order of concurrent accesses which are issued to the re-
source. As an example, a shared cache will have a diﬀerent state if two conﬂicting
cache blocks A and B are requested in the order A,B,A,B or B,B,A,A. Mature
WCET analysis tools for multi-cores are not yet available, therefore the alarming
reaction of industrial real-time system designers is to switch all but one single core
oﬀ [WHK+13]. This removes all interference-related problems, but of course also ef-
fectively degrades the multi-core system to a single-core one. Since the most recent
hardware generation is often only available as multi-core chips, this is sometimes
still the only viable option.
1.2 Contributions of this Work
This thesis presents multiple approaches towards a precise WCET analysis for multi-
cores, which may help to alleviate the aforementioned problems with the adoption of
multi-core hardware in real-time system design. Also, the concrete implementation
of these approaches inside the WCET-aware C Compiler (WCC) is demonstrated
and used to evaluate the presented techniques. The WCC also provides unique
opportunities to couple the analysis of a task’s WCET with the optimization of
the latter. Therefore, this thesis also presents two optimizations which utilize this
unique capability to demonstrate that an optimization of task WCETs is feasible
and useful in practice.
We build our approach on the branch of WCET analysis methods which has
proven to be most eﬃcient in the past, namely on a decomposition of the WCET
analysis into microarchitectural analysis and path analysis. This approach is also
employed in the de-facto standard WCET analyzer aiT [Abs14a]. It consists of an
abstract interpretation-based microarchitectural analysis phase, during which the
best- and worst-case runtime of each basic block in the task’s control-ﬂow graph is
determined. With these values a successive path analysis can compute the shortest
and longest path through the control-ﬂow graph, whose length corresponds to the
BCET and WCET, respectively.
For the single-core case, the WCET analysis methodology is reviewed. Focus is
laid on how to achieve a value analysis with suﬃcient precision and on the modeling
of microarchitectural features, since these form the basis for a precise WCET analysis
of both single- and multi-core systems.
The extension of the microarchitectural modeling to include shared buses and
shared caches is discussed, and the possibility to model the behavior of time-
triggered arbiters by means of TDMA oﬀsets is presented. These can be used to
statically capture the duration of shared resource access requests. Their true poten-
tial only unfolds when they are not applied naively, but combined with intelligent
loop unrolling techniques. Since the loop unrolling incurs a signiﬁcant analysis time
8 Chapter 1. Introduction
overhead, the concepts of oﬀset relocation and oﬀset contexts are established to allow
for a fast but also precise WCET analysis.
The timing behavior of stateful resources and non-time-triggered arbiters relies
on the ordering of accesses, therefore any analysis which tries to bound this timing
must safely consider all possible interleavings of potentially parallel actions. This
thesis for the ﬁrst time presents a structured way to explore such interleavings on the
single-machine-cycle level. To avoid some part of the combinatorial explosion that
is inevitable in such analyses, a timing-based pruning criterion is developed which
uses the generated timing information to rule out invalid parallel system states.
The presented analysis methods are compared with respect to the achievable
precision and to the overhead which is incurred by platform conﬁgurations which
are routinely advertised for being more predictable and thus easier to analyze.
Finally, two compiler optimizations are given, which can be used to decrease
the WCET of tasks in a multi-core system. The ﬁrst is a WCET-aware multi-
criteria evolutionary optimization of the schedule of shared resources. By eﬃciently
exploring diﬀerent system conﬁgurations a suitable schedule can be found for a given
task set. It is also shown that this schedule is highly dependent on the given tasks,
i.e., an optimal default schedule cannot be speciﬁed in general, necessitating an
optimization like the one presented here.
The second optimization is a WCET-driven instruction scheduling which utilizes
the close coupling of analyses and optimizations inside the WCC by using microar-
chitectural analysis results to optimally place single instructions inside the tasks’
source code. The placement is done in such a way that the single shared resource
accesses incur minimum access overhead, which is only possible with the help of
detailed microarchitectural information.
1.3 Organization of the Thesis
The structure of the remaining parts of this thesis is as follows:
• Chapter 2 presents the existing concepts used in WCET analysis. It starts
with a general review of abstract interpretation, the basic method onto which
the presented WCET analysis is built. Subsequently, diﬀerent approaches to
WCET analysis are shown, and necessary deﬁnitions of WCET-related phe-
nomena like timing compositionality and timing anomalies are given. The
chapter closes with a presentation of preexisting approaches to WCET analy-
sis in multi-task and multi-core systems.
• The WCC framework is introduced in Chapter 3. Starting with an overview
of the compiler infrastructure and related work on similar projects, it covers
the assumed system model and the compiler phases. Finally, extensions of
the WCC framework to incorporate binary code into WCET analysis and
optimization are presented.
1.4. Author’s Contribution to this Dissertation 9
• Chapter 4 picks up the WCET analysis concepts presented in Chapter 2 and
demonstrates how these are used in the context of the WCC framework to
create a WCET (and BCET) analyzer for a single-core ARM platform.
• The main contributions of this thesis are found in Chapter 5, where the
methods for WCET analysis in multi-core systems are discussed. It covers
TDMA oﬀset analysis for time-triggered arbiters as well as precise analysis for
less predictable architectures and an extensive evaluation.
• Chapter 6 contains the presentation of the aforementioned two compiler op-
timizations tailored towards WCET optimization. It starts with the evolu-
tionary shared resource schedule optimizations and then proceeds with the
WCET-aware multi-core instruction scheduler. For both optimizations, eval-
uation results on a large number of real-time benchmarks are given.
• Finally, Chapter 7 closes the thesis with a summary and an outlook on future
work.
1.4 Author’s Contribution to this Dissertation
According to §10(2) of the “Promotionsordung der Fakultät für Informatik der Tech-
nischen Universität Dortmund vom 29. August 2011”, a dissertation within the
context of doctoral studies has to contain a separate list that highlights the au-
thor’s contributions to research and results obtained in cooperation with other re-
searchers. Therefore, the following overview lists the contribution of the author on
the presented results for each chapter:
• Chapter 2: This chapter summarizes related work only, therefore there is no
contribution to account for.
• Chapter 3: The WCC framework [FL10] was created by a multitude of
people, among others Heiko Falk, Paul Lokuciejewski, Sascha Plazar and Jan
Kleinsorge. The author has also worked on this framework to some extent
by using it as a basis for the multi-core WCET analysis and optimizations.
The extension of the WCC framework to multi-core architectures was done
by the author only. An initial version of the virtual platform that was used
to evaluate the proposed architecture was implemented by Tim Harde [Har13]
and later largely re-structured by the author. The employed cycle-true virtual
platform simulator CoMET was donated by Synopsys Inc. [Syn14].
The concepts for the extension of the WCC towards the handling of binary
input ﬁles were developed by the author and implemented by Christian Gün-
ter [Gün13].
• Chapter 4: The WCC value analysis was developed by the author in collab-
oration with Jan Körtner, who carried out the majority of the implementation
work. The analysis and context graph handling as well as the microarchitec-
tural modeling as presented in this chapter are the work of the author.
10 Chapter 1. Introduction
• Chapter 5: The WCET analysis approaches for shared buses were entirely
designed and developed by the author. They are based on previous work from
Chattopadhyay et al. [CRM10], which was later extended in cooperation with
the same authors in the publications [KFM+11; KFM+14]. The co-authors of
these publications assisted the author in technical discussions, proof-reading
and structuring of the publications. Furthermore, the presentation of the anal-
yses in this thesis contains a state which is far more advanced than the one
in [KFM+11; KFM+14] and integrates better with the classical microarchi-
tectural analysis. These advances also are original work of the author.
The comparison of arbitration strategies as already published in [KHM+13]
was developed by the author of this thesis, based on the platform implemen-
tation by Tim Harde.
The WCET analysis for stateful resources considering all possible interleavings
of task executions was designed and implemented exclusively by the author
and published in [KM14].
• Chapter 6: The optimization opportunities and concepts were developed and
formalized by the author. A ﬁrst version of the implementation was done by
Hendrik Borghorst [Bor13] which was later reworked by the author, leading
to the publication [KMB14].
Chapter 2
Timing Analysis Concepts
Contents
2.1 Abstract Interpretation . . . . . . . . . . . . . . . . . . . . . 12
2.2 WCET Analysis for Uninterrupted Single Tasks . . . . . 18
2.2.1 Static WCET Analysis . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Parametric WCET analysis . . . . . . . . . . . . . . . . . . . 21
2.2.3 Hybrid WCET analysis . . . . . . . . . . . . . . . . . . . . . . 21
2.2.4 Early-Stage WCET analysis . . . . . . . . . . . . . . . . . . . 22
2.2.5 Statistical WCET analysis . . . . . . . . . . . . . . . . . . . . 22
2.2.6 WCET-friendly Hardware Design . . . . . . . . . . . . . . . . 23
2.2.7 Experiences with Practical Application of WCET Analysis 23
2.2.8 Timing Anomalies . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.9 Compositionality in WCET Analysis . . . . . . . . . . . . . 27
2.3 Timing Analysis of Sequential Multi-Task Systems . . . 29
2.3.1 Accounting for the Timing Behavior of System Calls . . . . 29
2.3.2 Accounting for Task Interaction Impacts on the WCET . . 30
2.3.3 Schedulability of Multi-Task Systems with Given WCETs . 32
2.4 Timing Analysis of Parallel Multi-Task Systems . . . . . 32
2.4.1 Multi-Core Systems . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.2 Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . . 34
In this chapter we will review the existing literature on WCET analysis and
investigate which concepts have already been established in this domain. We start
in Section 2.1 with a thorough treatment of abstract interpretation, since this is one
very fundamental technique that forms the basis for WCET analysis. In Section 2.2
we examine how this technique and others can be applied in the context of classical
WCET analysis. Section 2.3 proceeds with concepts for the timing analysis of
multiple tasks on a single processor. Finally, we present existing approaches for the
analysis of the WCET of tasks in parallel systems in Section 2.4.
11
12 Chapter 2. Timing Analysis Concepts
2.1 Abstract Interpretation
Abstract Interpretation (AI) is one of the most well-developed theories for the ap-
proximation of states in discrete transition systems. The basic ideas date back to
a 1973 publication from Kildall [Kil73] which were later formalized and general-
ized by Cousot and Cousot [CC77]. The presentation in this thesis is based on
the more modern introduction in [ALS+07]. Though AI is applicable to various
discrete transitions systems like, e.g., source-code programs, petri nets and Kahn
process networks, we base our examples on the approximation of low-level computer
system states since these will be the target of WCET analysis.
In general, we can associate a concrete semantics with any computer system.
This semantics reﬂects the transformation of all memory cells in the system by
operations carried out by the computational circuits. Since all currently relevant
computer systems are working in clocked operation, we can deﬁne these transfor-
mations on discrete time steps. Therefore, if L˜ is the set of all possible memory
cell assignments including all registers, the program counter, current instruction
and otherwise stored values, a single cycle of a computer system’s operation can be
deterministically described by a concrete semantics function ⟦⟧conc ∶ L˜ ↦ L˜. Since
all relevant computer systems are also working on a well-deﬁned instruction set,
concrete states are always generated by programs.
Deﬁnition 1. A program X = (i0, i1, . . . , in) is a sequence of instructions i ∈ I from
a global instruction set I with start instruction i0 and a set of terminal instructions
It. An execution of X is a trace of concrete system states (l˜0, l˜1, . . . , l˜m) such that
• l˜0 encodes X in its memory content and executes i0, and
• ∀i ∈ {1,m} ∶ l˜i = ⟦l˜i−1⟧conc, and
• pc(l˜m) ∈ It.
where pc ∶ L˜→ I extracts the value of the program counter from a concrete state.
An ideal analysis would determine the execution trace for every possible initial
system state l˜0. The length of the longest of these traces would be the WCET of
the program under analysis. Obviously, a naive attempt to collect all of these traces
and return them as the analysis result will fail, since
a) there may be traces of inﬁnite length corresponding to non-terminating pro-
grams and
b) modeling the whole state of the system under analysis and maintaining huge
numbers of these states during the analysis is practically infeasible.
To be able to reason about program executions in a structured way, the notion of a
control-ﬂow graph is used.
Deﬁnition 2. A basic block v = (iv0, . . . , ivk) of a program X is a maximal sub-
sequence of X, such that for all j ∈ {1, . . . , k} and l˜, l˜′ ∈ L˜
(pc(l˜) = ivj ∧ ⟦l˜′⟧conc = l˜) ⇒ (pc(l˜′) = ivj ∨ pc(l˜′) = ivj−1) . (2.1)
2.1. Abstract Interpretation 13
A Control Flow Graph (CFG) (V,E, v0) of a program X, with V being the set
of basic blocks of X and E ⊂ V × V being a set of directed edges which model every
possible transfer of control within the program. The entry point of the program is
given by node v0.
A path P in a CFG from vs ∈ V to ve ∈ V is a non-empty sequence of nodes
P = (vs, . . . , ve) such that for any two adjacent vi and vi+1 in P , (vi, vi+1) ∈ E holds.
If a path from vs ∈ V to ve ∈ V exists, we call ve reachable from vs which is expressed
as vs ↝ ve, else ve is unreachable from vs written as vs ↝̸ ve. The set of all paths
from vs to ve is called P[vs,ve]. For any node v ∈ V , δ+(v) = {w ∣ (v,w) ∈ E} and
δ−(v) = {w ∣ (w, v) ∈ E} are the successors and predecessors of v.
The connection between traces of concrete states and the CFG is easily made,
since each l˜ ∈ L˜ holds a concrete value of the Program Counter (PC) register of the
underlying architecture, which points to an instruction il˜ ∈ I. Thus, if each l˜j of a
trace is mapped to the node v ∈ V with il˜j ∈ v the trace is transformed to a CFG
path.
A path is called feasible, if there is a concrete trace which is mapped to this
path. Infeasible paths may exist in a CFG due to dependencies between control
ﬂow branches, which are not reﬂected in the graph structure. As an example, if two
branches b1 and b2 have the same branch condition and the value of the condition
remains unmodiﬁed between b1 and b2, then either both branches are taken or none.
The CFG, in contrast, may also contain an infeasible path in which b1 is taken
whereas b2 is not. A cycle at v ∈ V is a path (v, . . . , v). A path which contains
no cycle is called acyclic. The set of all paths P[vs,ve] can be restricted to feasible
(P feasible[vs,ve] ) or acyclic paths (P
acyclic
[vs,ve]
).
To deal with issue a) from above, the collecting semantics ⟦⟧coll ∶ 2L˜ × V ↦ 2L˜ is
deﬁned as taking a set of concrete states, executing the instructions from the given
CFG node in those states for all possible inputs and returning the resulting end
states. We can trivially extend the collecting semantics to paths p = (v1, v2, . . . vn)
by setting ⟦s⟧ (p) = ⟦. . . ⟦⟦s⟧ (v1)⟧ (v2) . . .⟧ (vn). We are then searching for the state
set scollv = ⋃p∈P[v0,v],l˜0∈L˜⟦l˜0⟧ (p) for every node v ∈ V . If L˜ has inﬁnite cardinality, this
result may still be inﬁnitely big. To overcome this and the issue b) mentioned above,
AI further changes from the collecting semantics ⟦⟧coll to an abstract semantics
⟦⟧abs ∶ L × V ↦ L which is deﬁned on abstract system states L. To be useful, the
abstract semantics must overapproximate the collecting one, which is expressed by
the notion of a Galois connection.
Deﬁnition 3. A Galois connection (α, γ) is a pair of functions α ∶ P ↦ Q and
γ ∶ Q↦ P for two sets with partial orders (P,≤) and (Q,⊑) such that
∀x ∈ P, y ∈ Q ∶ α (x) ⊑ y⇔ x ≤ γ (y) (2.2)
For our concrete example, this means (P,≤) = (2L˜,⊆) and (Q,⊑) = (L,⊑), where
x ⊑ y if and only if γ(x) ⊆ γ(y). I.e., an abstract state x is “bigger” than y if and
14 Chapter 2. Timing Analysis Concepts
2L˜ 2L˜
L L
α
⟦⟧coll
⟦⟧abs
γ
Figure 2.1: The Galois connection between concrete and abstract semantics.
only if it “contains” a superset of the concrete states which are contained in y. This
is a general necessity in a Galois connection, where we also have the property, that
x ≤ γ ○ α (x), e.g., the re-concretization of an abstraction of x always contains x.
The general idea is visualized in Figure 2.1. With the abstraction function α, we
can map the initial system states into the abstract domain L, where we conduct our
analyses to compute abstract results loutv . After the analysis is ﬁnished the Galois
condition ensures that γ (loutv ) ⊇ scollv , i.e., every reachable concrete state is covered
in the result. Since we are computing an overapproximation, potentially also other
states which are not reachable in any concrete execution are covered.
For the analysis to be eﬃciently possible, the abstract domain L must be a
semi-lattice.
Deﬁnition 4. A semi-lattice (L,⊔) is a set L with a meet operator ⊔ ∶ L ↦ L,
which is required to be
• idempotent: ∀l ∈ L ∶ l⊔ l = l,
• commutative: ∀l,m ∈ L ∶ l⊔m =m⊔ l and
• associative: ∀l,m,n ∈ L ∶ l⊔(m⊔n) = (l⊔m)⊔n.
The meet operator induces a partial order ⊑ on L, which is deﬁned by
∀l,m ∈ L ∶ l ⊑m⇔ l⊔m =m (2.3)
In a semi-lattice (L,⊔), also a biggest element ⊺ = ⊔L exists, with ∀l ∈ L ∶ l⊔⊺ = ⊺
or equivalently ∀l ∈ L ∶ l ⊑ ⊺. This element is usually named “top”. Additionally, a
smallest element  ∈ L (“bottom”) may exist, with ∀m ∈ L ∶ ⊔ l = l.
The height of a lattice is one less than the length of the longest sequence (l1, l2, . . . , ln)
of li ∈ L such that ∀i ∶ li ⊑ li+1.
Deﬁnition 5. A monotonic Data-Flow Analysis (DFA) framework ((L,⊔), F ) con-
sists of a
• semi-lattice (L,⊔) and
• a set F containing functions f ∶ L→ L, where
1. every f is monotonic in ⊑, i.e., ∀l,m ∈ L ∶ l ⊑m ⇒ f(l) ⊑ f(m),
2. the identity function id with ∀l ∈ L ∶ id(l) = l is contained in F and
2.1. Abstract Interpretation 15
3. F is closed under function composition, i.e., ∀f, g ∈ F ∶ f ○ g ∈ F .
A DFA framework is called distributive iﬀ
∀f ∈ F, l,m ∈ L ∶ f (l⊔m) = f (l)⊔f (m) (2.4)
The weaker form of this condition f (l⊔m) ⊑ f (l)⊔f (m) is already true also for
non-distributive DFA frameworks.
Deﬁnition 6. An instance of a DFA framework ((L,⊔), F ) is a tuple ((V,E), v0, l0),
where (V,E) is a control-ﬂow graph of the program to analyze, v0 is the node where
the control ﬂow enters the program and l0 ∈ L is the initial data-ﬂow information
for the start node and every node v ∈ V has an associated transfer function fv ∈ F .
A solution for an instance is a set of data-ﬂow items loutv for all v ∈ V such that the
reachable concrete states scollv are covered, i.e., γ (loutv ) ⊇ scollv .
The transfer functions are just invocations of the abstract semantics, i.e. fv =
⟦⟧abs (v) and similar to it, we can extend the transfer function to CFG paths. By
following all feasible paths through the program’s control ﬂow graph given by a DFA
framework instance, we can deﬁne an ideal solution
lout,IDEALv = ⊔
p∈P feasible
[v0,v]
fp (l0) (2.5)
Since the identiﬁcation of feasible paths in impossible in general, we can relax
this to the Meet Over All Paths (MOP) solution, which considers all paths in the
CFG
lout,MOPv = ⊔
p∈P[v0,v]
fp (l0) (2.6)
The MOP solution is still not computable, since there may be inﬁnitely many
paths due to the existence of loops in the CFG as already discussed. Therefore as
the last coarsening step, we revert to the Minimum Fixed Point (MFP) solution
which is the ﬁxed point of Equation 2.7.
lout,MFPv =
⎧⎪⎪⎨⎪⎪⎩
fv (l0) if v = v0
fv (⊔(u,v)∈E lout,MFPu ) else
(2.7)
This solution avoids the computation of all paths by only propagating the states via
the edges of the CFG until a ﬁxed point is reached. During this procedure all ﬁnite
paths are visited implicitly. From the ﬁxed point deﬁnition we also know, that for
all inﬁnite paths to v the generated data-ﬂow information is covered by lout,MFPv .
Further details on why the precision relation lout,MFPv ⊒ lout,MOPv ⊒ lout,IDEALv is true
for all v ∈ V can be found in [ALS+07, Chapter 9.3].
For monotonic frameworks, an MFP solution always exists according to the
Knaster-Tarski-Fixed-Point Theorem [Tar55]. In general, the MOP solution is more
precise than the MFP one, since in the MFP case we apply the meet operator
16 Chapter 2. Timing Analysis Concepts
Algorithm 1 The generic data-ﬂow analysis work-list algorithm.
1: function WorkListAlgorithm(((L,⊑), F ), ((V,E), v0, l0))
2: worklist← v0
3: for v ∈ V do ▷ Initialization of the linv /loutv ...
4: if v = v0 then
5: linv ← l0, loutv ←  ▷ ... for the start node ...
6: else
7: linv ← , loutv ←  ▷ ... and for all other nodes.
8: while worklist ≠ ∅ do ▷ Loop until a ﬁxed point was found
9: v ← pop(worklist)
10: linv ← ⊔(u,v)∈E loutu
11: ltmpv ← fv(linv ) ▷ Apply transfer function of node v
12: if loutv ≠ ltmpv then ▷ If there were changes at node v ...
13: loutv ← ltmpv
14: for (v,w) ∈ E do
15: push(worklist,w) ▷ ... propagate them to all successors
16: return {v → linv ∣ v ∈ V }
earlier than in the MOP case, namely after each node instead of only after each
path. For distributive frameworks this does not make a diﬀerence (see Equation 2.4),
therefore the MFP solution is equal to the MOP solution for distributive frameworks
(“Interprocedural Coincidence Theorem” [KS92]). For computing the MFP solution,
the work-list algorithm is a standard approach as shown in Algorithm 1.
The data-ﬂow information is initialized in lines 3-7 and then propagated through
the graph in lines 8-15. The nodes at which the data-ﬂow information has not yet
converged are kept in a work-list which is processed until all state information has
converged. As mentioned, this convergence relies on the monotonicity of the height
of the underlying lattice. For ﬁnite-height lattices, the convergence is guaranteed,
since any data-ﬂow item either reaches a ﬁxed point other than ⊺, or reaches ⊺ in a
ﬁnite number of steps which is a forced ﬁxed point due to the monotonicity of the
transfer functions. To speed up the convergence of the data-ﬂow items we can sort
the work list in topological order, ignoring back-edges in the CFG.
As an example for a lattice, consider the problem of determining whether a single
register has a constant value at the individual CFG nodes. The Hasse diagram of the
lattice for this problem is shown in Figure 2.2. It has a ﬁnite height of 2, though it
has an inﬁnite number of elements. The bottom element  denotes the state “register
was not initialized”, the top state ⊺ means “register does not hold a constant value”
and the middle row of elements represents actual constants. Whenever a block v
loads a constant value into the register, loutv is set to this value. If multiple constants
are merged in the meet operation, linv is set to ⊺ in Algorithm 1. Further examples
and an overview of possible lattices for program analysis can be found in [Cou01].
2.1. Abstract Interpretation 17
⊺
−∞ . . . −2 −1 0 1 2 . . . ∞

Figure 2.2: The lattice of integer constants.
Since the convergence of Algorithm 1 is not guaranteed for lattices with inﬁ-
nite height, such as intervals on integer numbers, widening is used to enforce the
termination of the abstract interpretation in such lattices.
Deﬁnition 7. A widening operator Δ is a unary function on L, such that ∀l ∈ L ∶
l ⊑ Δ (l) and for any sequence l1, l2, . . . with ∀i > 1 ∶ li ⊐ Δ (li−1) the top element is
reached in a ﬁnite number of steps, i.e., ln = ⊺ for some n ∈ N.
With a widening installed, line 11 in Algorithm 1 becomes
ltmpv ←Δ (fv (linv ))
After the ﬁxed point was found, it is possible to reﬁne the results again by letting
the main loop of Algorithm 1 iterate again for a user-deﬁned number of times with
the original deﬁnition of ltmpv , i.e., without the widening. This process may remove
some of the overestimation induced by the widening and is called narrowing. In
Figure 2.3 an example is given, how widening and narrowing aﬀects the generated
results. The ﬁgures show the Hasse diagram of the lattice and the evolution of a
single data-ﬂow value (one loutv for a ﬁxed node v) during the runtime of the ﬁxed
point determination. Every arc corresponds to one iteration of the main loop from
Algorithm 1 for v. The widening skips over many intermediate values by artiﬁcially
coarsening the results of the individual steps. This speeds up the convergence (re-
duced number of arcs) but also leads a more imprecise result (higher in the Hasse
diagram). The narrowing is able to re-gain some precision by applying the unmod-
iﬁed transfer functions to the ﬁxed point values. It can lead to results which are at
best as good as in the case without widening, but usually they are worse.
To further reﬁne the results of the data-ﬂow analysis, path-awareness can be
introduced as shown in [HT98], which inﬂates the analysis lattice by multiplying
it with a lattice that represents some of the edges that were visited during the
generation of a data-ﬂow information item. If that is done, data-ﬂow items with
diﬀerent path expressions are not merged, which potentially increases the precision
if the original lattice was not distributive. In addition, path-awareness can be used
to rule out infeasible paths [NKJ10] if the edges in the path have contradictory
guard conditions, thus moving the ﬁxed-point result a step closer to the IDEAL
solution sketched above. All of this comes at the price of increased analysis duration,
since the size of the underlying lattice is multiplied by the number of possible edge
18 Chapter 2. Timing Analysis Concepts
⊺

(a) Without widening
⊺

(b) With widening
⊺

(c) With widening and nar-
rowing
Figure 2.3: Convergence behavior in a lattice. Solid and dotted arcs represent one
or indeﬁnitely many applications of the transfer function, respectively.
strings. Therefore, [NKJ10] try to limit the edge strings to relevant edges only,
heuristically. Another approach to control the complexity of the path-awareness is
to gradually increase the allowed length of the path expressions during the analysis
as shown in [DDY06], where the authors use this method to analyze large real-world
benchmarks. As a last option, [BSI+08] separate the detection of infeasible paths
from the data-ﬂow analysis itself and re-structure the program such that infeasible
paths are avoided also in a non-path-aware analysis.
With the systematical description of a DFA framework coined above, it should
not be surprising that DFA implementations for concrete problems can easily be
generated from compact lattice and transfer function descriptions [SSB09], even for
path-aware DFAs [HMM12].
2.2 WCET Analysis for Uninterrupted Single Tasks
In this section we will review the existing approaches to the determination of WCET
estimates for a single program running without interruption on a single-core com-
puter with no other active hardware components which could interfere with the
program under analysis. This is the most basic of all WCET analysis scenarios, but
it already oﬀers a plethora of pitfalls and problems to solve [WEE+08]. In general,
the challenges that WCET analysis is facing in this scenario fall into one of the
following categories:
• Unpredictable hardware components: The WCET analysis has to follow
all execution paths of the hardware to ﬁnd the path with maximum execution
time. Abstract interpretation can relieve this problem to some limited ex-
tent, but strong abstractions on the hardware state also lose much precision.
Therefore, hardware features which increase the state space of the pipeline
or memory hierarchy such as caches, superscalar and out-of-order execution,
2.2. WCET Analysis for Uninterrupted Single Tasks 19
Program
Code
User
Annotations
CFG
Reconstruction
Value
Analysis
Microarchitectural
Analysis
Path
Analysis
WCET
Figure 2.4: Structure of most static WCET analyzers.
fetch and commit queues and speculative execution complicate the WCET
analysis [WGR+09].
• Unpredictable software structure: Usage of runtime-dynamic software
constructs like function pointers, virtual inheritance and even dynamic mem-
ory allocation is hard to analyze statically. These can be circumvented to
some extent by coding conventions such as MISRA-C [MIS13] which prohibit
the usage of these features. Nevertheless even predictably-designed software
exhibits an iteration structure of loops and recursions which may be not ana-
lyzable statically. Here, the user has to intervene in the worst case and needs
to give hints to the WCET analyzer [KKP+11].
The following subsections will introduce the important branches of single-task
WCET analysis, review real-life examples and case studies of their application and
ﬁnally ﬁnish with an introduction to timing anomalies and compositionality, two
concepts which are central to WCET analysis and frequently used in the following
sections.
2.2.1 Static WCET Analysis
Static WCET analysis analyzes a program without executing any part of it, solely
based on abstract models of the hardware and program semantics. Therefore, it
provides stronger safety and coverage guarantees than measurement-based methods.
The usual approach is to separate the analysis into multiple steps as shown in
Figure 2.4.
A program, which, depending on the analyzer, may be given in binary or source-
code form, is ﬁrst fed into the control-ﬂow graph reconstruction which extracts an
interprocedural control-ﬂow graph that reﬂects all possible transitions of control ﬂow
among instructions from the given program. In case of non-analyzable control-ﬂow
transitions, such as function pointers, the user might be required to specify their
targets manually. After this step, the value analysis computes safe approximations
of the register or variable values by means of AI on the generated CFG. The user can
manually reﬁne these results if he is not satisﬁed with their precision. The microar-
chitectural analysis uses the CFG and the value results to provide overestimates of
20 Chapter 2. Timing Analysis Concepts
the possible hardware states at each CFG node, again usually by means of AI. The
user might have to specify machine details like the used clock frequency and the
memory hierarchy speciﬁcation for this step. Finally, the path analysis computes
the WCET result. To achieve this, it must know upper bounds on the execution
frequency of all cyclic paths, i.e., loops and recursions. For simple cases, the value
analysis will deliver these bounds, whereas for complex loops the user has to provide
a bound.
The prime example of an implementation of this scheme is the aiT WCET an-
alyzer from AbsInt GmbH [FH04], which is the most well-known and industrially-
used WCET analyzer available. The analysis that was implemented for this thesis
inside the WCC [FL10] also follows this line of research, as well as other industrial
and academic WCET projects like Bound-T [Tid04], the Open Timing Analysis
Platform (OTAP) [HPP11], the OTAWA framework [Tea14], the CalcWCET167
tool [Kir12], SWEET [MRT14], Chronos [LLM+07] and TuBound [PSK08].
Apart from aiT, all of these tools were written mainly for research purposes and
have experimental or simpliﬁed versions of individual analysis stages. Most impor-
tantly, only aiT includes an abstract processor model also for non-trivial target
processors in the microarchitectural analysis. In Chapter 4 we will look at the in-
dividual analysis stages in more detail, where we also show their realization inside
the WCC.
The analysis structure sketched in Figure 2.4 can be regarded as the most promis-
ing approach to WCET analysis. Nevertheless, it is not the only one. Model-
checking has proven to be eﬀective in the analysis of distributed real-time sys-
tems [AD90; BY04; FKP+07]. There also exist proposals to determine the WCET
using model-checkers, which requires the whole architecture and the program under
analysis to be modeled as a transition system amenable to classical model-checking
theory. The main drawback of early approaches was the fact, that a dedicated model
has to be created which reﬂects the timing of the program under analysis [CCM97].
Any error done in the manual modeling invalidates the results. More recent ap-
proaches incorporate program and hardware state into the model, e.g., using a
generic model-checker like UPPAAL to solve the WCET problem [BC11]. The us-
age of a generic model-checker opens up the possibility to model arbitrarily complex
systems. Unfortunately, due to the same reason, these approaches also lack the ca-
pabilities to form controlled abstractions of the resulting state space which leads to
poor analysis times when compared to abstract interpretation-based designs [Wil04].
In experimental results from Gustavsson [GEL+10] runtimes of more than 30 hours
are reported even for drastically simpliﬁed architectures. Similarly, the path analy-
sis speed is far inferior with model checking than with classical methods based on
Integer Linear Programs (ILPs) [HS09].
A third, but now outdated approach is the integration of microarchitectural
analysis and path analysis into one single ILP. This was only attempted for simpliﬁed
architectures, where only a cache was modeled, and already there major problems
with the scalability of this method were found [LMW95; LMW96].
2.2. WCET Analysis for Uninterrupted Single Tasks 21
2.2.2 Parametric WCET analysis
One problem of the “classical” approach to WCET analysis as shown in the last
subsection is, that the result is a single numeric value. For software which shows
signiﬁcant variation in its runtime, it would be better if the WCET analysis yielded
a formula which describes the WCET depending on the program’s input parame-
ter values. This concept is called parametric WCET analysis. If the variables in
the WCET formula are input values, we speciﬁcally call it input-parametric. Ac-
cording to an industrial case study from Gustafsson and Ermedahl [GE07] the lack
of parametric WCET analysis tools is one major shortcoming of present WCET
analyzers.
The main problem for input-parametric WCET analysis is the path analy-
sis stage, which now must produce a WCET formula describing the longest path
through the CFG depending on input values which may trigger diﬀerent behaviors of
loop and condition structures in the program under analysis. Methods for construct-
ing such formulas are stepwise CFG reductions [AAN11], algebraic simpliﬁcation
of weighted path expressions [HPP12], solving symbolic ILP problems [AHL+08;
BL08], custom graph-ﬂow algorithms [BEL11] and symbolic evaluation [Bli02].
A less complex but also more restricted approach is the identiﬁcation of exe-
cution scenarios prior to WCET analysis. These are often present in the software
structure if the code is able to run in diﬀerent input-deﬁned modes. For each iden-
tiﬁed scenario a separate WCET analysis can be conducted with user annotations
that force the analyzer to limit the examined program paths to those of the partic-
ular scenario [LPW09; MA11; HKB+14]. The generated WCET formula thus only
assigns a single numeric WCET value to each scenario, which is identiﬁed by the
inputs that trigger this scenario.
Apart from input-parametric WCET analysis, there have recently also been ef-
forts to establish architecture-parametric WCET analysis [RD14]. In the latter case,
the WCET formula is a function of architectural properties of the system under
analysis. This mainly aﬀects the microarchitectural analysis, which must now gen-
erate results for all possible target hardware conﬁgurations. Although architecture-
parametric WCETs would be useful, input-parametric WCETs are a more urgent
problem in practice [GE07].
2.2.3 Hybrid WCET analysis
As a counterpart to static WCET analysis, dynamic WCET analysis refers to end-
to-end measurements of the program runtime on the actual hardware. These will
not be safe unless all input parameter and initial system conﬁguration combinations
were exercised, which is infeasible for any real-world system.
Thus, to re-gain some conﬁdence in the results supplied by dynamic WCET
analysis, hybrid WCET analysis was proposed. It replaces the value and microar-
chitectural analysis from Figure 2.4 with a measurement of the runtime of each CFG
node and then continues with the known path analysis methods from static WCET
22 Chapter 2. Timing Analysis Concepts
analysis to derive the WCET estimate. Since the measured CFG node runtimes
are not guaranteed to be safe, this method is only suited for soft real-time tasks,
where precision of the estimate is the main concern. An industrial analyzer which
uses this technique is the RapiTime tool from Rapita Ltd. [Ltd14]. To increase the
conﬁdence in this method, test input generation heuristics like the Balanced Path
Generation [BZT+11] have been proposed.
2.2.4 Early-Stage WCET analysis
Typically, WCET analysis can only be applied after the complete binary code of the
program to be analyzed is available and the hardware platform on which it should
run is known. This is needed for the WCET analysis to be safe, since all program
execution paths on the platform must be considered hence both the program and
the platform must be known. Early-stage WCET analysis abandons safeness in
favor of getting early results, even for models of the systems which are not yet
available in binary form. For these models, a safe estimation is hardly possible,
instead an eﬃcient, early feedback on the achievable WCET is requested. For this
purpose, mostly regression with linear models is used [GAE+09], which are trained
with WCET values of high-level programs [AEL+11]. The results are not safe, nor
is there any known bound on the maximum deviation from the true WCET, but
empirical results indicate that early-stage WCETs have a deviation of less than 20%
compared to the ﬁnal, safe WCET values.
2.2.5 Statistical WCET analysis
Since any computing hardware has some deﬁned mean time before failure, statistical
WCET analysis is an attempt to compute a probability distribution of execution
times of a task on a given platform. If the probability for a violation of the task’s
deadline is at least as small as the probability of a hardware failure, the remaining
failure probability can be acceptable [Höf12; KVA+13]. Still, many assumptions
have to be veriﬁed before applying this concept, like the statistical independence of
the probability distributions of software and hardware events [SLL+11]. In practice,
proving this independence will be challenging and possibly lead to unsafe results if
invalid assumptions are made.
A relatively new approach to this problem is Probabilistic Timing Analysis
(PTA), which tries to justify these independence claims by requiring a hardware
where everything is randomized as far as possible, from the memory layout over mi-
croarchitectural aspects to the cache placement and replacement policies [CQV+13].
However, a recent publication [Rei14] shows that the handling of randomized caches
in current PTA approaches is unsafe, and that deterministic replacement policies
like LRU perform consistently better, even in the context of PTA. It is also shown
that statistical independence of microarchitectural event distributions, which is the
basis of PTA, was often wrongly assumed [Rei14]. Therefore, this analysis branch
can only be considered highly experimental in its current state.
2.2. WCET Analysis for Uninterrupted Single Tasks 23
2.2.6 WCET-friendly Hardware Design
Even mainstream processor development has acknowledged the need for more pre-
dictable hardware, though only to a limited extent. As an example, the ARM
Cortex-R series of processors [ARM14a] features local scratchpads for each core
and a time-predictable interrupt handling. Nevertheless, it still contains a deep su-
perscalar pipeline with speculative execution and instruction pre-fetching. Though
these are not necessarily features which hamper the “observed predictability”, i.e.,
the timing variation between multiple independent measurement runs, they still
complicate a static WCET analysis by increasing the hardware state space consid-
erably. The Inﬁneon AURIX series [Inf14] shows similar properties from the analysis
point of view, though with a shallower pipeline and a higher focus on safety and
security. Lastly, the popular Kalray MPPA 256 manycore [KAL14] is an example
for how power-awareness can also increase the predictability. To achieve power-
eﬃciency while maintaining high performance it implements a VLIW architecture
which is not only more power-eﬃcient than dynamic out-of-order processors but also
fundamentally easier to analyze.
In contrast to that, there exist multiple academic proposals for more predictable
hardware. The most extreme of these are proposals to make the hardware capable
of running in a worst-case mode [ORM+09] or to design hardware with constant-
execution-time instructions such as the Java-Optimized Processor [SPP+10] or the
PRET machine [LRB+12].
Other ideas include the instantiation of a dedicated DRAM-refresh software task
to replace the unpredictable hardware DRAM refresh [BM11] and the adaption of
the cache hierarchy [HPS12] or the load-store-buﬀer [MR12] to the needs of WCET
analysis.
2.2.7 Experiences with Practical Application of WCET Analysis
Multiple case studies have been conducted on the usability of static WCET analysis
in practice. These studies indicate that the tools are usable by non-experts [TSH+03]
and that a maximum overestimation between 4% and 33% is achievable [GE07;
CSB+10]. Compared to “traditional” methods which measure the runtime and add a
safety margin, improvements of approximately 10% were observed [TSH+03]. Since
static analyses must be able to safely bound all program loops, the user may have
to provide such bounds if they cannot be found automatically. This is an issue
especially for loops in operating system kernels, which are often data-dependent.
The manual bounding of such loops is time-consuming and often requires assistance
from the original developers of the respective binary code [GE07].
For tools which operate on the source-level, like SWEET, also the develop-
ment toolchain poses problems since they need to integrate with it to ﬁnd all
source ﬁles which might be written in diﬀering, non-compatible source-language di-
alects [LES+13]. Even worse, the source code might be missing altogether because a
24 Chapter 2. Timing Analysis Concepts
l˜0 l˜1 l˜2 l˜i
l˜i+1
l˜lwc
l˜n
l0
li
llwc
ln
Execution event
Time
Figure 2.5: The execution paths of a program on a timing-anomalous system de-
picted by transitions between hardware states.
particular component is supplied by a subcontractor [LES+13]. This provides strong
arguments in favor of binary-level analysis, where these problems do not exist.
Finally, the analyzed software often operates in modes, therefore a WCET result
which is parametric on the mode and input would be highly desirable [GE07]. This
is supported by the ﬁnding, that the separation of multiple scenarios which use a
common code base can reduce the resulting WCETs by up to 70% [HKB+14].
The WCET tool challenge was established to compare the results of diﬀerent
analysis tools [HGB+08]. Since the market is still very fragmented and it is already
hard to ﬁnd multiple tools which target the same hardware conﬁguration, there are
still no quantitative results comparing diﬀerent analyzers.
The author is not aware of industrial case studies on hybrid, parametric, early-
stage or statistical WCET analysis. The lack of such studies may also reﬂect the
fact that static WCET analysis is the oldest variant and the tools produced in this
area are more mature and ready for industrial use.
2.2.8 Timing Anomalies
Timing anomalies are a phenomenon which is observed on some architectures and
which complicates static WCET analysis. During WCET analysis, we want to de-
termine all possible concrete hardware states (CHS) for every node in the CFG
to ﬁnd the worst-case path through the program. An example for a worst-case
execution sequence is given in Figure 2.5, where each dot is a CHS. The ﬁgure
shows the worst-case execution of the program under analysis in the CHS sequence
(l˜0, l˜1, l˜2, . . . , l˜i, l˜i+1, . . . , l˜n). Of course, the machine works deterministically, there-
fore each CHS has only a single successor which can be determined using the ⟦⟧conc
cycle-step semantics from Section 2.1. As we have already seen there, we need to in-
troduce abstract semantics ⟦⟧abs which work on the abstract hardware states (AHS)
to make the analysis decidable. Each AHS contains one or more CHS, or to be more
2.2. WCET Analysis for Uninterrupted Single Tasks 25
Resource 1 A D E
Resource 2 C B
Resource 1 A D E
Resource 2 B C
Time
(a) Scheduling anomaly
Cache Hit A B C
Cache Miss A C
Time
(b) Speculation anomaly
Figure 2.6: Examples of timing anomalies [RWT+06].
precise, the AHS and the CHS form a Galois connection as shown in Figure 2.1.
In Figure 2.5, the ellipses represent AHS, thus the initial abstract state l0 contains
at least the concrete states l˜0, l˜1 and l˜2. Since every AHS contains multiple CHS,
the cycle step for an AHS is no longer deterministic. Depending on the contained
concrete states we might end in one of multiple successor AHS as visible at state
li. Concrete examples for sources of this kind of non-determinism are unknown ad-
dresses of memory accesses (all possible memory access targets must be considered
in separate successor AHS) or unknown cache access behavior (cache hit and cache
miss must be tracked independently).
At this point, it would make WCET analysis easier if is was guaranteed that
only the local worst-case (the cache miss) may lead to the WCET of the program.
If this were true, we could restrict our search to the local worst-case successor AHS
and ignore all other possible successor states (e.g. the case of a cache hit). A Timing
Anomaly (TA) occurs when this assumption is violated, e.g. when a successor state
that does not represent a local worst-case behavior nevertheless exclusively leads
to the global worst-case behavior. To deﬁne the notion “local worst-case”, we need
to deﬁne a local scope, i.e., the processing of some microarchitectural event such
as a single cache miss or maybe a single instruction. In Figure 2.5, the rectangle
shows such an event. For simplicity assume that a cache access is modeled. This
scope contains all CHS which contain the processing of this cache access. The local
worst-case is given by the state l˜lwc (AHS llwc), but it does not lead to the global
WCET. Therefore, in such a case, the WCET analysis cannot assume that the only
relevant successor of li is llwc, but instead it had to search every possible successor.
This deﬁnition was ﬁrst given by Reineke et al. [RWT+06]. The authors also give
concrete examples of timing anomalies, the two most well-known ones are depicted
in Figure 2.6. In Figure 2.6a two possible execution paths are shown assuming an
out-of-order processor with two execution units (resources), which are utilized by
instructions A, B, C, D, and E. The arcs visualize the data or control-ﬂow depen-
dencies between the instructions. The execution time of instruction A is assumed
to be unknown and variable, all other execution times are ﬁxed. Contrary to naive
assumption, a short runtime of A leads to the WCET for this sample program (as
26 Chapter 2. Timing Analysis Concepts
shown in the lower half of Figure 2.6a), not a longer one (as shown in the upper
half). This is caused by the dynamic scheduling of the resources in the out-of-order
core, i.e., since B becomes ready for execution earlier than C in the lower sce-
nario, the execution order of B and C is reversed. Other known sources of timing
anomalies are found in caches when they are combined with speculative execution
and prefetching, a combination which is also often present in modern processors.
Figure 2.6b shows an example for such a timing anomaly, where node A performs
a cache access which might be a hit or a miss. In the case of a hit, speculative
prefetching for block B begins and pollutes the cache until the hardware recognizes
that the speculation was wrong and execution of B is aborted. Due to the cache
pollution, the runtime of C is increased, such that this case is the global worst-case
even though the cache hit was the local best-case. In the event of a cache-miss, the
miss takes so long that the branch condition is fully evaluated once the miss was
processed, therefore no speculation and cache pollution takes place here.
Timing anomalies can be further subdivided into constant-bounded anomalies
and domino eﬀects [RS09]. A system suﬀers constant-bounded anomalies if for any
two concrete hardware states li, lj ∈ L
tterm (li) − tterm (lj) ≤Δ (li, lj) (2.8)
holds, where Δ ∶ L×L→ N and tterm ∶ L↦ N is a function which returns the number
of cycles until the executed program is completed, counted from the given input
state on. A system suﬀers from domino eﬀects if such a constant bound cannot
be found, i.e., if there are instruction sequences for which the runtime starting in
state li is a multiple of the runtime starting in state lj . In theory, both concepts
can be used to prune parts of the analysis search space even in systems with timing
anomalies [RS09], but this is only possible when microarchitectural analysis and
path analysis are combined which is intentionally not the case in current static
analyzers to limit the analysis complexity (compare Figure 2.4).
Up to now, there has been only one publication on an approach to prove the
absence of TAs in a given architecture speciﬁcation [EPB+06]. Unfortunately, this
approach has never been tested for real architectures. In contrast, there exist a
number of existence proofs by example. The following combinations of hardware
features are known to exhibit timing anomalies:
• out-of-order, superscalar processors with execution units with non-overlapping
functionality [RWT+06],
• systems with caches, prefetching and speculative execution [RWT+06],
• pseudo-LRU caches [RWT+06],
• multi-cores with a round-robin arbitrated bus and private L1 caches [SHK14],
• MRU caches [Geb10] and
• partial ﬁlling of cache lines (cache streaming) [Geb10].
Related to timing anomalies are parallel timing anomalies which are deﬁned in
[KKP09] as analysis errors that are introduced by a decomposition of the microar-
2.2. WCET Analysis for Uninterrupted Single Tasks 27
chitectural analysis into multiple phases which analyze subsets of the hardware in
separation. In practice, this is less relevant since the state of hardware components
is highly interdependent. A separation of cache, bus and pipeline analysis is usually
only possible at the expense of a decreased analysis precision.
2.2.9 Compositionality in WCET Analysis
To account for “extra delays” due to interference by other tasks, analysis approaches
often compute a single-task WCET and then add the delay that occurs due to phe-
nomena which were not considered in the single-task WCET computation. Many
analyses in Section 2.3 and Section 2.4 will follow this scheme. All of these ap-
proaches require timing compositionality, i.e., the notion that the WCET of a task
running on a system can be safely derived from separate contributions of individual
hardware components to the WCET.
In systems with timing anomalies, we have already seen that the occurrence
of a timing anomaly depends on an interaction between multiple hardware compo-
nents. If each component is analyzed in separation to achieve a timing-compositional
WCET, the occurrence of TAs can never be excluded and thus must be assumed
wherever possible. This will make the timing-compositional WCET for TA-prone
systems highly pessimistic. Due to their more limited impact on the WCET, the
case of constant-bounded TAs according to Equation 2.8 introduces less pessimism
than the case of domino eﬀects. On the other hand, WCETs for timing-anomaly-free
systems are always timing-compositional. Here we know that when we externally
increase the duration of a hardware event e by a duration d, e either already was
the local worst-case and thus the WCET increases by d, or e has not been the local
worst-case and the WCET increases by at most d, depending on whether e now
becomes local worst-case or not. This classiﬁcation has ﬁrst been introduced by
Wilhelm et al. in [WGR+09] and has led to the categories
• fully timing-compositional (no timing anomalies),
• constant-bounded timing-compositional (only constant-bounded anomalies) and
• not timing-compositional (domino eﬀects).
Unfortunately, this well-established nomenclature is a bit misleading, since also
for domino-eﬀect-systems, timing-compositionality is achievable at the expense of
increased overestimation as mentioned above. Therefore, calling these systems “not
timing-compositional” is not fully appropriate. Recently, Hahn, Reineke and Wil-
helm further elaborated the deﬁnition to the following form [HRW13]:
Deﬁnition 8. Let C be a system and (Ci)i∈{1,...,n} its components with associated
state spaces S and (Si)i∈{1,...,n}. Furthermore, let the timing contributions (tci ∶
Si ↦ N0)i∈{1,...,n} together with state abstraction functions (ai ∶ S ↦ Si)i∈{1,...,n}
and combination operator ⊕ ∶ N0n ↦ N be a decomposition of the system’s timing
28 Chapter 2. Timing Analysis Concepts
tterm ∶ S ↦ N0. We call the decomposition (μ,α)-timing compositional where μ ∈ R≥1,
α ∈ R+0 if and only if
∀s ∈ S ∶ tterm(s) ≤
n
⊕
i=0
tci (ai (s)) ≤ μ ⋅ tterm (s) + α (2.9)
This deﬁnition is more akin to the original, intuitive understanding of timing
compositionality than the deﬁnitions from [WGR+09]. The leftmost inequality in
Equation 2.9 states that the timing decomposition must be safe, i.e., greater than
the most precise timing value tterm(s) whereas the rightmost inequality imposes a
bound on the precision loss due to the decomposition. The combination function ⊕
will be the addition operator in almost all cases.
From the point of view of an analysis which requires timing-compositional WCET
results, “constant-bounded timing-compositional” now corresponds to a (1, α)-de-
composition and “fully timing-compositional” denotes a (1,0)-decomposition. Still,
an implicit connection to the deﬁnitions from [WGR+09] exists, since a (1,0)-de-
composition will be easy to ﬁnd for a timing-anomaly-free architecture, whereas for
architectures with TAs, a (1,0)-decomposition for more than one component (n > 1)
will only be achievable at the cost of degraded analysis precision. To the best of
the author’s knowledge, no publications exist which actually quantify the achievable
precision of timing-compositional WCETs for systems with and without TAs.
In the following, we will not use the nomenclature from [WGR+09], but ex-
plicitly state to which TAs a system is susceptible. When analyses require timing
compositionality in the (intuitive) sense of [HRW13], we will use (μ,α)-timing com-
positionality to denote this.
A diﬀerent problem which is also termed “timing-compositionality” in some pub-
lications is the construction of software from multiple, independent and possibly
precompiled software components. For WCET analysis this is a problem, since the
intention of component-based design is, that the developer can use the components
as black-boxes fulﬁlling a particular task. However, during WCET analysis the
microarchitectural behavior and the worst-case iteration count of loops in the com-
ponent must be known. This leads to the problem that the person conducting the
WCET analysis has to manually bound the loops in the code of the component as
already mentioned in Section 2.2.7, which is often highly time-consuming. To resolve
this, a standardized summary format for binaries was proposed which sums up the
eﬀects that the binary code has for given data-ﬂow analyses [BCM09]. This would
enable the inclusion of the semantics of black-box components into a bigger data-ﬂow
analysis, e.g. the microarchitectural analysis from Figure 2.4. Still, the approach
requires a major standardization eﬀort, since every data-ﬂow analysis uses diﬀerent
domains and formats. For the path analysis (cf. Figure 2.4), a constraint-logic-
programming-based approach was presented in [Mar10] which requires parametric
WCETs for the components.
2.3. Timing Analysis of Sequential Multi-Task Systems 29
2.3 Timing Analysis of Sequential Multi-Task Systems
When multiple real-time tasks are executed on one core, usually a Real-Time Oper-
ating System (RTOS) is employed to schedule and decouple the tasks. For WCET
analysis this poses the problems that
• single-task execution may be preempted by interrupts and successively running
higher-priority tasks and
• tasks will be using system calls to communicate with the RTOS, thus the
timing of these system calls must be analyzable.
Traditional system calls in addition are a separate kind of transfer of control that
is beyond the capabilities of conventional WCET analyzers as introduced in Sec-
tion 2.2. This problem is alleviated by the fact that many RTOSes give up the
traditional strong boundary between the OS and user code for performance reasons.
In these cases, the RTOS and the user code are compiled into a single binary and
system calls collapse to normal function calls which are again amenable to standard
WCET analysis. Nevertheless, the two main issues as listed above remain. In the
following subsections we will review the state of research concerning these points.
2.3.1 Accounting for the Timing Behavior of System Calls
One of the most important standards for RTOS design in the automotive domain
is AUTOSAR [AUT09]. Even though this standard is already tailored towards
real-time automotive systems, it is still not trivial to achieve a time-predictable
and eﬃcient implementation of inter-task communication system calls as shown
in [FDG+09]. The authors of this work do not consider static bounds on the duration
of communication calls, but instead they examine the time predictability of the calls
from a measurement point of view. In fact, a time-predictable implementation of
system calls not only helps the WCET analysis, but also in additional measurements
of the timing on an example system.
For a real-time operating system which is tailored towards WCET analysis, it is
often possible to determine static bounds on the runtime of the individual system
calls as shown in the MERASA project [WGK+10]. Nevertheless, for existing OSes,
which were not designed with analyzability in mind, this is a considerable eﬀort
whose success has proven to be limited [LGZ+09]. The main problems which are
discussed by Lv et al. are lack of
• parametric WCET tools since system calls show highly variable behavior,
• integration between the WCET analysis of the RTOS and the analysis of the
user tasks to achieve better precision,
• integration of schedulability analysis and WCET analysis to reach more precise
results and
• automation of the WCET analysis, since many user interactions are needed
for complex code.
30 Chapter 2. Timing Analysis Concepts
In contrast to timing compositionality which denotes that a WCET value can
be broken down into multiple contributions from diﬀerent hardware components (cf.
Section 2.2.9), timing composability describes that the timing behavior of a piece of
software is independent of other software running in parallel or in sequential alterna-
tion on the same core. Timing composability is a highly desirable feature for RTOS
implementations, since only then diﬀerent tasks can be analyzed in separation from
each other before they are integrated into the system. A possible implementation
of a timing-composable operating system based on the ARINC standard is given
by Baldovin, Mezetti and Vardagena in [BMV12], where also neccessary criteria to
achieve timing composability are deﬁned.
Nevertheless, in summary there are few approaches which construct dedicated,
WCET-analyzable operating systems but for standard operating systems, even for
standard RTOSes, WCET analysis is highly complicated due to the points identiﬁed
by Lv et al. [LGZ+09] as mentioned above.
2.3.2 Accounting for Task Interaction Impacts on the WCET
If the scheduler can decide to preempt a real-time task, its WCET alone is no
longer suﬃcient when reasoning about whether it will meet its deadline or not.
In this case, the Worst-Case Response Time (WCRT) must be computed, which
is the WCET plus the delay introduced by preemptions of all other tasks. For
the case of ﬁxed-priority scheduling it has been shown that WCRT values can be
computed by solving a set of recurrence equations [JP86]. The original formulation
assumed static WCET values, which is not true for modern hardware, due to the
fact that pipelines and caches use part of the execution history to speed up future
calculations. To be precise, a WCET analysis has to consider these history eﬀects in
the microarchitectural analysis. Each preemption invalidates the results of a single-
task WCET analysis, since when the preempted task (the preemptee) is resumed
• the pipeline must be reloaded with instructions from this task, and
• the caches and all cache-like structures as, e.g., branch prediction buﬀers have
been potentially modiﬁed by the preempter.
The former point is a small, local eﬀect which can be bounded by a constant number
of cycles, whereas the latter is a major issue since a modiﬁed cache state can lead
to additional delays throughout all following program parts. This non-local eﬀect is
called the Cache-Related Preemption Delay (CRPD). Since the CRPD is dependent
on the number of preemptions, which in turn is dependent on the WCRT, there is
a cyclic dependency between CRPD and WCRT determination.
Therefore, Lee et al. combined the WCRT computation from [JP86] with an
integrated determination of the CRPD [LHS+98]. They introduced the notion of
Useful Cache Blocks (UCB) which are those memory blocks of the preemptee that
may be re-used. Only if one of these blocks is evicted from the cache by a preempter,
the timing of the preemptee will be inﬂuenced. By providing an estimate of the
2.3. Timing Analysis of Sequential Multi-Task Systems 31
number of UCBs, they are able to bound the CRPD and therefore the WCRT of the
tasks. To make this estimate more precise, the cache behavior of the preempter can
also be taken into account in the form of an estimation on its number of Evicting
Cache Blocks (ECB) [TD00], i.e. those blocks that are touched by the preempter.
The intersection of UCBs and ECBs provides a more precise bound on the number
of cache misses due to a preemption than obtainable with the UCBs alone.
Whereas Lee et al. [LHS+98] use the UCBs as a single numeric value specifying
the number of such blocks, Negi, Mitra and Roychoudhury propose to work on
the level of abstract cache states of preempter and preemptee directly, instead of
abstracting both to a simpliﬁed numeric value [NMR03]. This allows for a higher
precision but also incurs an exponentially increased analysis eﬀort.
Altmeyer and Burguière [AB09] develop the UCB concept further to also account
for Deﬁnitely Cached Useful Cache Blocks (DC-UCB). The DC-UCBs are a subset
of the UCB which contains all blocks that must be cached. With this additional
concept they can avoid some overestimation in the CRPD analysis by identifying
where DC-UCBs can be used to eliminate the double accounting of a single block
eviction.
Altmeyer, Maiza and Reineke further reﬁned the CRPD for set-associative caches
by noting that even if some UCBs and ECBs collide, the UCBs are only evicted if
the cache set in which they are placed cannot hold all ECBs. Based on the abstract
cache block age information, they determine a resilience value [AMR10] per cache
block at each program point which determines how many ECBs this block can
take without being evicted. Finally, the same authors show that FIFO and Pseudo-
LRU (PLRU) replacement strategies cannot be analyzed with the known UCB/ECB
procedure, but they establish the concept of relative competitiveness [BRA09] which
denotes that a CRPD and WCET estimate for FIFO and PLRU can be obtained
by analyzing an appropriately shrinked LRU cache. It turns out that the shrinking
needs to be more aggressive for FIFO to make this work, than for PLRU.
Kleinsorge, Falk and Marwedel [KFM11] proposed a method to compute UCB,
ECB and resilience information in a single analysis pass with a full cache state repre-
sentation, similar to [NMR03] which further increases the precision while removing
the analysis overhead for resilience computation.
Apart from implicit inﬂuence via the CRPD, tasks also have explicit inﬂuence
on each others’ timing via the invocation of potentially blocking communication
system calls. The bounding of end-to-end message delivery delays has not been
solved in general, but for the special case of tasks, which only send messages when
they terminate, solutions exist [TBW95].
Finally, interactions on global variables may also have timing eﬀects. Research on
this topic is still very limited, but data-ﬂow analysis on multi-task systems could po-
tentially be a solution. A recent publication from Mittermayr and Blieberger [MB11]
demonstrates how the potential interleavings of the threads’ instructions can be ef-
ﬁciently explored when exploiting the synchronization structure of programs. As a
32 Chapter 2. Timing Analysis Concepts
side eﬀect, their framework can also be used to bound the synchronization delay in
a multi-task system with lock-protected shared resources.
In summary, the determination of implicit interactions like the CRPD is well-
understood for at least the case of ﬁxed-priority scheduling and LRU caches. The
overestimation for these eﬀects has been considerably lowered by the reﬁnements
mentioned above. Still, it is worthwhile to mention that all of the above approaches
require (μ,α)-timing-compositional architectures to work, since an extra delay is
added to the single-task WCET. This means that potentially they are only applicable
to timing-anomaly-free architectures without loss of precision. Bounds on the timing
behavior of explicit interactions between the tasks are still an active ﬁeld of research,
where no standard solution has been established.
2.3.3 Schedulability of Multi-Task Systems with Given WCETs
Schedulability is a synonym for the existence of a schedule under which each task
from a given task set meets its deadline. If the WCRTs of the tasks have been
computed for some type of schedule, testing for schedulability is trivial, since only
the WCRTs and the deadlines need to be compared.
If this is not possible, e.g., because no WCRT algorithm is known for the consid-
ered schedule, the Real-Time Calculus (RTC) oﬀers possibilities to model a system
as a network of communicating components whose interaction is described by event
arrival curves and service curves. Finally, these can be used to analyze the schedu-
lability of the described task set [TCN00; PC07]. This line of research has led to
a number of software tools which support the analysis of a real-time system with
the help of the RTC, as, e.g., MAST [HGG+01] and SymTA/S [HHJ+05]. Similar
to the WCRT/CRPD analysis mentioned above, the RTC-based modeling methods
require (μ,α)-timing-compositionality with the mentioned drawbacks.
As an alternative to RTC-based methods, real-time schedulability theory can
be used which provides necessary and suﬃcient conditions which can show that a
schedule must exist or may not exist [Mar11, Chapter 6.2]. Unfortunately, these
static conditions almost always assume a ﬁxed WCET per task, which makes it
impossible to precisely account for the CRPD.
2.4 Timing Analysis of Parallel Multi-Task Systems
Since approximately 2005 the hardware market has inevitably shifted towards multi-
core systems [Sut12], also in the embedded domain [Fre09]. Therefore, interactions
of tasks which execute in sequential alternation, possibly preempting each other and
possibly in interaction with an RTOS are no longer the only complication for WCET
analysis. On modern hardware, it also has to account for possible timing eﬀects of
tasks that execute concurrently.
Since this is the main topic of this thesis, we will only provide some general
ideas on how parallel timing interaction can be handled. In Chapter 5 we will then
2.4. Timing Analysis of Parallel Multi-Task Systems 33
examine the speciﬁc problems of WCET analysis for parallel tasks in more detail.
We will only analyze the behavior of a given parallel system – the creation of such
a system, whether by explicitly parallel programming, by generic parallelization
techniques [Mid12] or by multi-criteria-aware embedded parallelization [CEN+13]
is a topic of its own.
In the following discussion, we distinguish between closely-coupled and loosely-
coupled parallel systems, e.g., multi-cores and distributed systems, because the gran-
ularity of the timing eﬀects which are to be analyzed diﬀers in these two cases.
2.4.1 Multi-Core Systems
An example for the acuteness of the shift to multi-cores in industrial practice is a
growing number of publications on how to transfer the existing AUTOSAR standard
for automotive software to multi-cores [AUT09]. New problems that arise in this
ﬁeld are
• how to map tasks to cores while controlling the communication overhead and
fulﬁlling scheduling constraints [GHK+11],
• how to extend the priority ceiling protocol [Mar11], which guarantees the ab-
sence of priority inversion, to multi-cores [KYM+09], building on previous
work on multi-processor priority ceiling [RSL88; CT94],
• how to support migration of legacy code to multi-cores [JMR10].
Schneider, Bohn and Rößger [JMR10] classify dependencies between tasks into
the categories “none”, “mutual exclusion”, “precedence” and “temporal distance”,
where the latter denotes a precedence constraint with the additional requirement
that a minimum amount of time passes between the termination of the ﬁrst task
and the starting of the second. They emphasize the importance of tools to visualize
execution and scheduling scenarios by which the developer can verify that these
dependencies are met. WCET analysis can be one of those tools, since it also de-
livers a value analysis which [JMR10] names as an important factor to determine
(possibly implicit) dependencies between legacy AUTOSAR tasks. It is also shown
that real-time software realizes the named dependencies not only via classical syn-
chronization management but also via interrupt enabling/disabling and task priority
levels. From the perspective of WCET analysis this is favorable, since the timing of
explicit synchronization primitives is often hard if not impossible to determine (cf.
Section 2.3.2).
Since the timing of the components in a multi-core system is tightly coupled,
these systems are more amenable to an integrated analysis, which tries to ana-
lyze the state of all hardware components in an integrated way. Such approaches
have been made, based on the classical static WCET analysis framework from Fig-
ure 2.4 [CKR+12; KFM+14], based on model-checking [GEL+10] and on a com-
bination of both [LYG+10]. We will discuss these approaches in more detail in
Chapter 5.
34 Chapter 2. Timing Analysis Concepts
Although real-time calculus-based methods have also been proposed for the anal-
ysis of multi-core architectural eﬀects [SE09], these methods are often too coarse-
grained for such an application, since they require (μ,α)-timing-compositional tim-
ing analyses for the pipeline and the caches. In a recent publication, Shah, Huang
and Knoll show that such analyses will be highly pessimistic for multi-cores, since
timing anomalies already occur in multi-cores with L1 caches and a round-robin
bus [SHK14].
Diﬃculties exist not only for WCET analysis but also for schedulability tests.
Extensions of classical schedulability theory towards multi-cores was attempted by
Li et al. who developed a schedulability test for global earliest deadline ﬁrst (GEDF)
scheduling on periodic tasks [LAL+13]. Nemati and Nolte derived a schedulability
test for clustered ﬁxed-priority-scheduled parallel task sets [NN13] with sporadic
tasks. A constructive schedulability proof for periodic tasks was given by Moir and
Ramamurthy [MR99], who show that a valid schedule can be found in polynomial
time iﬀ such a schedule exists.
Some scheduling approaches also try to explicitly take hardware capabilities into
account as, e.g., the conﬂict behavior of the tasks in the shared L2-cache [ACD06] or
the distribution of memory accesses among local and remote memories [CCK+13].
These approaches are limited by the fact that they consider complex task activation
schemes where a precise tracking of hardware states as done in a WCET analysis
is infeasible due to the number of possible execution paths and task preemptions.
Therefore, they require a rather simple parametric WCET model, in which the
inﬂuence of the scheduling decisions on the WCET can be quickly determined.
2.4.2 Distributed Systems
In contrast to multi-cores, a network of processing nodes, also called a distributed
system, is much more loosely coupled. Events at single nodes may trigger messages
to other nodes which may start some kind of processing chain. Timing analysis of
distributed systems is mostly concerned with bounding the frequency and duration
of the messages and events in the system. Since the frequencies are generally much
lower than, e.g., the frequency of shared memory accesses in a multi-core, a higher
overestimation is acceptable. In this context, the Real-Time Calculus [Thi05] is
usually used to analyze the system behavior. An advantage of the high level of
abstraction that it provides is, that complex task activation schemes can also be
modeled easily [HT07]. In addition, there exist attempts to couple the RTC with
timed automata [LPT09] to reduce the overestimation in the analysis.
A seminal example of how to make the behavior of individual network compo-
nents predictable enough such that methods like the RTC can reasonably well ap-
proximate them is the time-triggered architecture developed over multiple decades
by Kopetz et al. [KB03]. It is based on the idea that time-triggered operation of
systems provides the most predictable timing behavior and shows constructively
2.4. Timing Analysis of Parallel Multi-Task Systems 35
that such an architecture is feasible in practice. We will build upon this work in
Section 5.4 when building a WCET analysis for time-triggered multi-core resources.
Apart from the RTC-based methods, there are also analyses which integrate
the message runtime determination with classical WCRT computation for the case
of CAN [DBB+07] and FlexRay buses [PPE+08]. These analyses were tested in
practice to determine, e.g., the end-to-end delay of messages in an AUTOSAR en-
vironment [LBR10].

Chapter 3
WCC Framework
Contents
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Compiler Phases . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Flow Fact Management . . . . . . . . . . . . . . . . . . . . . 41
3.4 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Extensions for Binary Input Files . . . . . . . . . . . . . . . 47
The WCET-aware C Compiler (WCC) is the basis for the implementations of the
WCET analyses developed in this thesis. The rationale behind its design [FLT06;
FL10] is the integration of a WCET analyzer with a C compiler to give source and
machine code optimizations the opportunity to evaluate the eﬀect of their transfor-
mations on the WCET. Also, the WCET is broken down into the contributions of
the single basic blocks of the program to make it possible to focus speciﬁcally on
the optimization of the Worst-Case Execution Path (WCEP).
It has been shown that applying an optimization aggressively wherever pos-
sible may as well be adverse to the WCET of the program [ZCS03]. Similarly,
WCET, Average-Case Execution Time (ACET) and code size can be conﬂicting
goals [LPF+10; PKF+11]. Therefore, a prediction of the optimization eﬀect in
terms of the considered target function has to be achieved to guide the compiler
optimization.
Since the WCC was originally intended as a drop-in replacement for a classical
compiler, there was little to no multi-task and multi-core support. This support is
needed at least in the WCET analysis to be able to account for interference by other
tasks and cores. Originally, the WCC exclusively used the analyzer aiT [Abs14a],
since this is a stable and industrially-proven WCET analysis solution. Unfortu-
nately, it is only applicable to single-task WCET analysis, and since it is a com-
mercial product, modiﬁcations are also not possible. Therefore, to have a ﬂexible
experimentation platform and since many components required for a WCET analysis
were already present, a WCC-internal WCET analysis was devised in the course of
this thesis. Chapters 4 and 5 will introduce this analyzer and its multi-core-speciﬁc
extensions. Finally, the WCC was extended to also handle binary input ﬁles, which
allows for a limited degree of modularity in the WCET analysis.
37
38 Chapter 3. WCC Framework
3.1 Related Work
Apart from WCC there are several other projects which have made attempts to
integrate a WCET analyzer with a compiler. The ﬁrst of these approaches dates
back to 1996 and consisted of a combination of the commercial IAR compiler with
multiple, now outdated WCET computation methods [Bor96]. The emphasis was
put on the representation and management of ﬂow facts and the project was not
pursued long enough to achieve a working implementation.
Kirner and Puschner have ﬁrst established the idea that ﬂow facts which are
needed for WCET analysis should automatically be updated when an optimiz-
ing compiler transforms the control-ﬂow of the program [KP01]. They built a
prototype based on the well-known gcc, which was later incorporated into the
TuBound [PSK08] WCET analysis framework. For the actual WCET analysis,
TuBound uses a previously developed WCET analyzer for the Inﬁneon C167 pro-
cessor (CalcWCET167 [Kir12]). Still, the generated WCET information is not
made available to optimizations here, the purpose of the modiﬁed compiler is to
transform the user-speciﬁed high-level ﬂow facts to valid machine-level ﬂow facts.
The VISTA framework [ZKW+04] was designed for the optimization of WCETs.
Unlike WCC, however, it is targeted at interactive optimization guided by the user
and its timing model is restricted to (μ,α)-timing-compositional architectures.
A recent project that bears many similarities with WCC is the Open Timing
Analysis Platform (OTAP) [HPP11], which also tries to integrate aiT with a com-
piler, in this case the LLVM compiler framework, for the purpose of WCET-guided
optimization and general experiments with WCET analysis. The project also uses
the SWEET [MRT14] tool to obtain highly precise ﬂow facts. Lately, the OTAP
compiler has seemingly merged into the T-CREST project infrastructure [PKH+12;
PHP14] which studies predictable multi-core hardware. T-CREST in addition pur-
sues the idea of single-path code transformations which generate a branch-free (not
loop-free) sequence of instructions from arbitrary code with the help of predicated
execution.
The fact that all of the mentioned projects including the WCC are focusing on C
as the prime source language, can be explained with C’s popularity among real-time
system developers. C++ in contrast holds more sources of implicit unpredictability
like virtual methods, virtual inheritance and dynamic casts. Though there exist
methods like, e.g., a real-time capable dynamic cast implementation [DMS08], these
are rather limited and make the code generation and WCET analysis even more
complex. Therefore, the restriction to C encourages the development of predictable
code to some extent.
Finally, another big concern in the development of safety-critical real-time sys-
tems is that tools which are used for the development must usually undergo some
certiﬁcation process or must even better be formally veriﬁed to work correctly. How-
ever, formal veriﬁcation of a compiler even for subsets of the C language consumes
3.2. Compiler Phases 39
C-Sources
with Flow Facts
ICD-C
Parser
High-Level
IR (ICD-C)
Code
Selector
Low-Level
IR (ICD-LLIR)
aiT WCET
Analysis
Back-
Annotation
WCET-aware
Optimizations &
Flow Fact
Management
Memory Layout &
Hardware SpeciﬁcationCode
Generator
WCET-optimized
Assembly
Linker
Script
Linker
Single-Task
Binary
Figure 3.1: Previous structure of the WCC compiler [FL10].
multiple man-years of work as shown by the work of Leroy [Ler09]. Therefore, this
certiﬁcation eﬀort is simply infeasible for WCC at its current stage of development.
3.2 Compiler Phases
The original structure of the WCC is shown in Figure 3.1. It follows the phase
model of a general-purpose compiler [ALS+07], which are shown by the solid edges
in Figure 3.1. The dashed edges connect components which are only needed in
the context of the analysis and optimization of the WCET. First, the C source
ﬁles which are annotated with ﬂow fact pragmas are read and transformed to a
High-Level Intermediate Representation (HLIR) called ICD-C. A code selector then
lowers the HLIR to the Low-Level Intermediate Representation (LLIR). The LLIR is
a direct representation of assembly code or relocatable object code. Together with
the memory layout and hardware speciﬁcations it can be used by aiT to determine
the WCET of the single task, and the WCETs and Worst-Case Execution Counts
(WCECs) of the individual basic blocks and of functions inside the tasks. Through
the usage of a back-annotation [FL10] which transforms the low-level basic block
WCET contributions back to the high-level basic blocks, WCET-aware optimiza-
tions are enabled on both the HLIR and the LLIR. An overview of the developed
WCET-aware single-task optimizations can be found in [LM10]. Unfortunately, the
40 Chapter 3. WCC Framework
mapping of low-level basic blocks to high-level ones is an n ∶m-mapping in general.
The back-annotation is still possible, because each component which modiﬁes the
basic block structure of either representation as, e.g., the code selector and many
optimizations announce their changes to the back-annotation which then updates its
internal data structures accordingly. A similar problem is the required updating of
ﬂow facts after optimizations [KPP10], which is also handled in WCC through the
communication of all relevant changes to the ﬂow fact management module, which
then updates the aﬀected ﬂow facts.
The original compiler was designed to translate the source ﬁles of a single task
into a separate binary which can be loaded by a surrounding operating system.
However, in the case of real-time operating systems, the OS binary is often linked
together with the tasks into a single binary program, or even compiled together
with the task code and based on parameters depending on the tasks. Most OSEK-
compliant RTOSes like, e.g., ERIKA Enterprise [Evi14] and FreeOSEK [Ope14]
follow this approach. Therefore, the WCC was extended to support the analysis of
multiple tasks inside the same binary (so-called “entrypoints”, see Section 3.3) and
to the compilation of multiple binaries in parallel, where each binary is loaded onto
a particular core of a multi-core system.
The resulting structure is shown in Figure 3.2. The vertical ﬂow corresponds to
the original single-task phases, where all core-binaries are ﬁnally packed together
with a boot loader into a multi-core system ROM. Also, since parts of the task
code may only be given in assembly or precompiled form, parsers for these two
types of input ﬁles were developed. The result is fed directly into the LLIR of the
task. Most importantly, a new WCET analysis was developed with can exploit
the knowledge about all tasks and cores in the system to achieve precise multi-
core WCET estimations. The elements in Figure 3.2 which were developed in the
course of this thesis are drawn with black font, whereas those which preexisted
are drawn in gray. Since a system description now comprises multiple source ﬁles
per core and possibly the speciﬁcation of the task entrypoints in these sources, a
task system description ﬁle can be used as an input for the compiler. This task
system description ﬁle simply contains references to those input ﬁles drawn with
dotted border in Figure 3.2 and in addition it may specify new entry points and/or
properties of entry points.
WCC is mainly written in C++ and comprises approximately 340,000 Lines
Of Code (LOC), counted without comments and blanks plus some fraction of as-
sembly startup code ﬁles and bash scripts. The analyses presented in Chapter 4
and Chapter 5 account for 32,000 LOC, whereas the optimizations from Chapter 6
are formulated in 10,000 LOC. Finally, 10,000 LOC were needed to implement the
parsing of binary inputs (Section 3.5) and 39,000 LOC to implement the conﬁg-
urable simulator platform (Section 3.4). In total, the extensions described in this
thesis contribute approximately 27% of WCC’s current code size.
3.3. Flow Fact Management 41
Core 1
C-Sources
with Flow Facts
Parser
ICD-C
Code
Selector
ICD-LLIR
ASM Parser
Core 1
Assembly
Files
Obj Parser
Core 1
Object
Files Code
Generator
Linker
Script
WCET-optimized
Assembly
Linker
Core 1
Binary
.........
Core nc
C-Sources
with Flow Facts
Parser
ICD-C
Code
Selector
ICD-LLIR
ASM Parser
Core nc
Assembly
Files
Obj Parser
Core nc
Object
FilesCode
Generator
WCET-optimized
Assembly
Linker
Script
Linker
Core n
Binary
WCET-aware
Optimizations &
Flow Fact
Management &
Back-Annotation
Memory Layout &
Hardware Speciﬁcation
Internal WCET
Analysis
Core Startup
Code
BootROM Builder
Boot loader
System ROM
Figure 3.2: Structure of the WCC for multi-core compilation and analysis.
The next three sections will shortly introduce the handling of ﬂow facts inside
the WCC, the multi-core target platform and the extension to assembly and binary
input ﬁles as shown in Figure 3.2.
3.3 Flow Fact Management
As already mentioned in Section 2.2.1, static WCET analysis as done inside of
the WCC requires user inputs for loops which cannot be bounded by the internal
analyzers. Both the WCC [LCF+09] and aiT [Abs14a] feature a loop analysis which
detects constant loop bounds and bounds with simple, mostly aﬃne behavior. For
more complex loop structures and recursions the user has to provide ﬂow facts to
42 Chapter 3. WCC Framework
make a WCET analysis possible [KKP+11]. This can be done at the source code
level in the WCC, as indicated in the following code fragments. These ﬂow facts
are gathered in the HLIR and LLIR (see Figure 3.2) and are automatically kept
consistent by the ﬂow fact management. The simplest and most widely used type of
annotation is a loop bound as shown in Code Example 1, which speciﬁes how often
the loop body may execute once the loop head is entered from the outside of the
loop. For a loop l, this is always speciﬁed through a minium and maximum iteration
count Blmin and B
l
max, i.e., in Code Example 1 we have Blmin = 8 and Blmax = 12.
1 _Pragma( "loopbound min 8 max 12" );
2 while ( input > 0 ) {
3 input = read( p, q, r );
4 }
Code Example 1: A simple loop bound for an input-dependent loop.
For non-reducible loops (compare Section 4.1.2) loop bounds are not applicable,
and for loops with varying iteration count like shown in Code Example 2, they
may be imprecise. Therefore is is also possible to specify the loop behavior with a
ﬂow restriction which is a linear relation between execution frequencies of program
points. As an example, the ﬂow restriction from Code Example 2 states, that for
every visit of the “outer” point, the “inner” one may be executed 55 times. In the
same manner, we can also bound recursions from their initial call site as shown in
Code Example 3.
1 _Pragma( "marker outer" );
2 for ( int i = 0; i < 10; ++i ) {
3 for ( int j = i; j < 10; ++j ) {
4 _Pragma( "marker inner" );
5 act ();
6 }
7 }
8 _Pragma( "flowrestriction 1*inner <= 55* outer" );
Code Example 2: A precise bound for a triangular loop.
3.4. System Model 43
1 _Pragma( "marker call" );
2 recurse( 10 );
3 _Pragma( "flowrestriction 1* recurse <= 10* call" );
4 ...
5 int recurse( int p ) {
6 if ( p > 0 )
7 return recurse( p - 1 );
8 else
9 return 0;
10 }
Code Example 3: A recursion bound.
Finally, also the entrypoints at which a single task may start can be speciﬁed
in the same way as presented in Code Example 4. A number of properties like the
execution period or task priority can be speciﬁed for each entry point. In addition,
the entry points can also be set in the task system description ﬁle, in the case that
the same code shall be used with diﬀerent entry point conﬁgurations.
1 _Pragma( "entrypoint period =20ms" );
2 void task1() {
3 ...
4 }
Code Example 4: An entry point speciﬁcation.
3.4 System Model
Usually, a compiler or static analyzer only needs to know about the external view
of the target architecture and the speciﬁcation of the semantics of the high-level
language. The internal structure and behavior of the architecture are not relevant
as long as they correctly implement the speciﬁed Instruction Set Architecture (ISA).
However, ISAs usually do not contain any information about the time which will
be needed to execute a given instruction. Therefore, a static WCET analyzer also
needs precise information about the behavior and structure of the hardware imple-
mentation as least as far as it aﬀects the timing of an instruction execution. Con-
sidering the classical Gajski-Kuhn-Y-Chart for levels of hardware design as shown
in Figure 3.3, WCET analysis will require knowledge of at least some aspects of
the Register-Transfer Level to be able to reason about what may happen in a single
hardware cycle. Since cycles are the lowest granularity of time that is considered in
WCET analysis, the Logic and Circuit levels do not need to be known.
The initial design of the WCC included back-ends for the ISAs Inﬁneon TriCore
V1.3 and V1.3.1 [Inf08] and for the ARM v4T [ARM05]. The target platforms were
44 Chapter 3. WCC Framework
Behavioural Domain Structural Domain
Physical Domain
Systems
Algorithms
Register transfers
Logic
Circuits
Processors
ALUs, RAM, etc.
Gates, ﬂip-ﬂops, etc.
Transistors
Physical partitions
Floorplans
Module layout
Cell layout
Transistor layout
Figure 3.3: Levels of hardware design in a Gajski-Kuhn Y-chart [Gro08].
the TriCore TC1796/TC1797 [Inf09] and a generic ARM7TDMI platform [ARM04].
For all of these platforms also the virtual prototyping IDE CoMET [Syn14], which
was donated to the chair by Synopsys, contains cycle-true simulation models. At
the time of the ﬁrst studies for this thesis, multiple multi-core architectures were
considered. Desired properties included
• a high degree of conﬁgurability to be able to examine multiple hardware op-
tions,
• as close to a real-world architecture as possible to achieve realistic results,
• cores which operate as time-predictable as possible to ease the pipeline anal-
ysis,
• the architecture should be included in CoMET or at least be easily modeled
with it, to ease ACET measurements.
Finally, the decision was made in favor of an ARM7TDMI-based multi-core, since
this provided the best compromise among the mentioned points. The ARM7TDMI
is a widely-used real-world core with highly predictable execution behavior. It is
currently being replaced [Ele12] by its successor, the ARM Cortex M0, but both
are almost identical when it comes to their functionality. They share a 3-stage in-
order pipeline, the absence of on-chip caches and a low-power, highly predictable
operation. Therefore, results gathered for the ARM7TDMI will also be valid for the
Cortex M0. CoMET already includes a model of the ARM7TDMI and conﬁgurable
platforms can easily be built. In addition, the WCC already contains a suitable
back-end, which eases the implementation work.
The resulting architecture is shown in Figure 3.4. Each ARM7TDMI core is
connected to a local core bus to which instruction and data scratchpads and caches
are connected. The separation of data and instruction caches is a general recom-
3.4. System Model 45
Core 1 Core nc
Shared Memory
ARM7TDMI
Core
Instruction
Scratchpad
Data
Scratchpad
ARM PL190
Interrupt
Controller
Timer
L1 Instruction
Cache
L1 Data
Cache
Bus Bridge
ARM7TDMI
Core
Instruction
Scratchpad
Data
Scratchpad
ARM PL190
Interrupt
Controller
Timer
L1 Instruction
Cache
L1 Data
Cache
Bus Bridge
......
Shared Bus
Shared
Uniﬁed
Flash
BootROM
Shared L2
Instruction
Cache
Shared
Instruction
RAM
Shared L2
Data
Cache
Shared
Data
RAM
Non-cached
Shared
Uniﬁed
RAM
Figure 3.4: The multi-core system model.
mendation to increase the predictability [WGR+09]. Each core also has a timer and
an ARM PL190 interrupt controller to be able to start time-triggered tasks, and a
bus bridge to access the shared memory. The bridge is connected to the shared bus
whose arbiter can be conﬁgured freely among multiple variants mentioned below.
Shared, cached and uncached RAM memory is attached to the shared bus, again
split between instruction and data in the cached case. In addition, a slower shared
ﬂash memory is assumed and a BootROM, which contains the system ROM gener-
ated by WCC (cf. Figure 3.2). All RAM memory is modeled as SRAM, since the
analysis of standard DRAM refresh cycles makes a precise WCET analysis impossi-
ble [BM11]. This organization of memory is modeled in accordance with real-world
memory hierarchies such as the Inﬁneon TriCore TC1797 [Inf09]. Especially for
the multi-core case such a distributed memory, partitioned into core-local, private
modules, and shared modules is realistic [Fre09]. Also, with the choice of rather
simple cores and a multi-level memory hierarchy, this design follows the PROMPT
principles [CFG+10] for predictable multi-core design.
The cache and shared bus simulation modules were developed in the course of
this thesis, since the CoMET-supplied cache is restricted to random replacement
which is inherently not analyzable, even not with probabilistic techniques [Rei14].
46 Chapter 3. WCC Framework
Component Property Default Value
Cores
Scheduler Time-Triggered
Non-Preemptive Dispatcher
Clock rate 200MHz
Caches
L1 L2
Hit Delay 1 cycle 1 cycle
Miss Delay1 1 cycle 2 cycles
Size 8kB 32kB
Associativity 2 4
Line size 32B 64B
Replacement strategy LRU LRU
Write through true true
Write allocate true true
Buses
Core Bus Shared Bus
Width 4B 4B
Clock rate 200MHz 200MHz
Arbitration strategy FAIR (1 master) FAIR (n masters)
Arbiter delay 1 cycle 1 cycle
Memories
Size Access Delay
Boot ROM 4MB 3 cycles
Instruction Scratchpad 32kB 1 cycle
(per core)
Data Scratchpad 32kB 1 cycle
(per core)
Shared I-RAM 512kB 3 cycles
Shared D-RAM 512kB 3 cycles
Uncached Shared RAM 1MB 3 cycles
Shared Flash 8MB 5 cycles
Memory-mapped 1 cycle
device registers
Table 3.1: Default system parameters.
Similarly, the CoMET-supplied bus does not oﬀer time-triggered arbitration strate-
gies, whose timing predictability was shown to be superior [PS10].
Since the experiments are done on a simulator, system parameters can be var-
ied, depending on which scenarios should be modeled. The parameters and their
default values are given in Table 3.1. The memory parameters are modeled in accor-
dance with the real-world parameters of the Inﬁneon TriCore TC1797 [Inf09]. The
scheduler for the cores is generated by WCC on demand and simply invokes the
1The miss also triggers the reloading of the cache line from the next higher hierarchy level which
dominates the total miss penalty and leads to a total miss delay far higher than 1 or 2.
3.5. Extensions for Binary Input Files 47
annotated periodic tasks (cf. Code Example 4) according to their period. However,
this is completely optional and can be exchanged for a user-deﬁned RTOS. Finally,
the default memory layout is such that instructions (.text section) and the stack
are placed in the scratchpads, whereas global data (.data and .bss sections) is
placed in the non-cached shared RAM.
Before the simulation starts, the hardware parameters are set and the boot ROM
memory is ﬁlled with the contents of a WCC-compiled system ROM. The boot loader
(cf. Figure 3.2) which runs on core 1 then unpacks this system ROM, which contains
an ELF binary for each core. After the unpacking, each core decodes its ELF ﬁle and
places the content according to the memory layout speciﬁcation described above.
Then the execution begins and either runs to completion (in case of non-periodic
core tasks) or runs for a speciﬁed number of hyperperiods (in case of periodic core
tasks) where the hyperperiod is the smallest common multiple of all tasks’ periods.
Probes in the CoMET simulation record the runtime of each task non-intrusively
as well as numerous other microarchitectural events, such as cache hits and misses
and bus arbitration events.
3.5 Extensions for Binary Input Files
An inherent problem with the workﬂow of the WCC has been the handling of binary
input ﬁles. The original structure of the WCC as shown in Figure 3.1 only had access
to the compiled translation units at the LLIR level. Unfortunately, often external
libraries are linked into the program, which also contribute to the WCET, since
they are called from within the compiled code. Even worse, programs may also
use compiler-internal libraries implicitly since ﬂoating point arithmetic and integer
division is not implemented in hardware on platforms like the ARMv4T. Therefore,
the compiler will insert calls to precompiled software routines where these operations
are used in the source code.
To be able to provide the internal WCET analyzer but also the optimizations
with a complete view of the analyzed binary, WCC modules were developed to
• read in library ﬁles, e.g. relocatable object ﬁles,
• reconstruct the control ﬂow graphs for the functions in these ﬁles
• provide a possibility to read and store ﬂow facts inside the binary ﬁles.
This additional module also provides the WCC with the capability to compile each
C ﬁle in separation, producing separate object ﬁles which contain the compiled code
as well as the ﬂow facts needed to analyze this code as shown in Figure 3.5a. Another
WCC run can then read in all of these object ﬁles, perform the WCET analysis and
possibly the low-level optimization as shown in Figure 3.5c. Of course, high-level
WCET optimizations are impossible in this case. Flow facts can also be added to
external libraries which were not compiled with WCC with the help of a command-
line ﬂow fact editor (cf. Figure 3.5b). The combination of C input ﬁles and binary
input ﬁles is also feasible like shown in the multi-core compilation structure from
48 Chapter 3. WCC Framework
C File
with Flow Facts
ICD-C
Parser
High-Level
IR (ICD-C)
Code
Selector
Low-Level
IR (ICD-LLIR)
Assembler
Flow Fact
Writer
Object File
with
Flow Facts
(a) Compilation of a sin-
gle translation unit
Library
Object File
Object File
Parser
Flow Fact
Editor
Flow Fact
Writer
Object File
with
Flow Facts
(b) Annota-
tion of a
library
ﬁle
Object Files with
Flow Facts
Object File Parser
Low-Level
IR (ICD-LLIR)
WCET-aware
Optimizations &
Flow Fact
Management
WCET analysis
Assembler
WCET-optimized
Object Files
Linker
Script
Linker
Single-Task
Binary
(c) Compilation with binary input ﬁles
Figure 3.5: Interaction of binary input ﬁle modules with the rest of WCC.
Figure 3.2. In the following, we will shortly summarize how diﬀerent challenges in
the design of these modules were solved. A more detailed description can be found
in [Gün13].
Container and instruction parser The ﬁrst step during the read process is
the reading of the object ﬁle format. Since multiple container formats like ELF
and COFF have been established, we use the existing GNU libbfd to parse the
container format.
The object ﬁle is partitioned into sections in which code and data are stored.
Initialized data can be taken over into the LLIR in the form of a byte sequence,
and instructions are parsed by using a linear-sweep over the machine instructions.
Each instruction is read according to the ARM binary format deﬁnitions [ARM05].
During this process, the symbol table, relocation table and branch instructions of the
program are used to add labels to the code and data. This is needed, since the LLIR
is an assembler-level representation which requires labels for all data accesses and
branch instructions. Another complication arises from the fact, that in the ARM
3.5. Extensions for Binary Input Files 49
case, data objects are sometimes embedded into the code. These are identiﬁed via
debug information or via the load instructions that try to access the data.
CFG reconstruction The inserted labels partition the instruction sequence into
disjoint basic blocks. To re-build the CFG the edges of the graph now need to
be added. For branch instructions with a ﬁxed target, usually a program-counter-
relative byte oﬀset, the control ﬂow is easy to reconstruct. In these cases a CFG edge
towards the target is inserted. In the case that the branch is executed conditionally,
a second edge is inserted, which points towards the fall-through successor, i.e. the
basic block following the current one in the instruction sequence.
For dynamic branch instructions, like a branch to a target given by the contents
of a register, the situation is more complex. In these cases, one can either rely on
pattern matching [Tid10] to detect patterns that are used to return from a function
or to jump into a switch statement, or data-ﬂow analysis has to be carried out con-
currently to the CFG reconstruction to bound the possible branch targets [BHV11;
KV08; KZV09; The00]. Both methods may fail and require the user to explicitly
specify the targets of dynamic branches. To limit the implementation eﬀort, the
implemented modules use the pattern-matching approach, which works well for all
compiler-generated code but is more likely to fail for hand-written assembly code.
Flow fact speciﬁcation Flow facts as presented in Section 3.3 are either loop
bounds or ﬂow restrictions, which are needed to compute the WCET. In C ﬁles they
are represented as source code pragmas which is obviously not possible for binary
input ﬁles. Instead, the capability of the object ﬁle format to host an undeﬁned
number of sections is exploited. A new section is inserted which stores
• Loop bounds,
• Flow restrictions and
• Branch targets
in ASCII text form. Since the annotations are stored as text, the basic blocks are
referenced using unique names that are generated by the ﬂow fact reader. The
implementation ensures that the generated names are the same for every invocation
of the ﬂow fact reader, since otherwise the reference points for the ﬂow facts would
be lost.
To fully exploit the possibilities of modular development and WCET analysis,
parametric ﬂow facts would be needed for an increased precision. This in turn
would require a parametric WCET analysis as discussed in Section 2.2.2, which
is not available in aiT and WCC. Therefore, parametric ﬂow facts are not yet
supported.

Chapter 4
Single-Core WCET-Analysis
Contents
4.1 IPCFG Construction . . . . . . . . . . . . . . . . . . . . . . . 52
4.1.1 Analysis Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.2 Context Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Value Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.1 Abstract Value Domain . . . . . . . . . . . . . . . . . . . . . 60
4.2.2 Challenges of Predicated Execution . . . . . . . . . . . . . . 62
4.3 Microarchitectural Analysis . . . . . . . . . . . . . . . . . . 63
4.3.1 ARM7TDMI Pipeline Model . . . . . . . . . . . . . . . . . . 67
4.3.2 Cache Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Path Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
The single-core analysis implementation is the basis for the extension to the
multi-core case. We will therefore review its structure in this chapter and point out
diﬀerent challenges that arise during the analysis of predicated instructions in the
ARM instruction set [ARM05]. Since we assume a non-preemptive scheduler, this
analysis is both single-core and single-task, i.e., the code for each core and each
task within this code (identiﬁed by the entry points, cf. Section 3.3) is analyzed in
separation.
LLIR of
Current Core
IPCFG
Construction
Value
Analysis
Microarchitectural
Analysis
Machine
Parameters
Path
Analysis
WCETBCET
Task-Speciﬁc
Analyses
Figure 4.1: Structure of the WCC-internal, single-core, single-task WCET analy-
sis.
51
52 Chapter 4. Single-Core WCET-Analysis
The diﬀerent phases of the WCC-internal analyzer are shown in Figure 4.1.
Compared to the generic static WCET analysis structure from Figure 2.4, we have
the noticeable diﬀerence that the input is an annotated LLIR code stemming either
from the code selector (cf. Section 3.2) or the binary parser (cf. Section 3.5). The
LLIR code contains both the ﬂow facts and the intra-function CFGs which removes
the necessity of a full CFG reconstruction, but still we will create an Interprocedural
Control Flow Graph (IPCFG) for each task in the system. The IPCFG contains
all ﬂow fact information as later needed by the path analysis. Therefore, there is
no further ﬂow of user annotations into the path analysis. Based on the IPCFG,
the values of the CPU registers are statically approximated for each program point
by means of abstract interpretation. These values are then used in the microar-
chitectural analysis, again an abstract interpretation on a specialized domain, to
determine memory access targets and machine instruction operand values. Since
our architecture is conﬁgurable (cf. Section 3.4), the machine parameters must be
speciﬁed at this point. Finally, a path analysis determines the WCET. The BCET
is also determined if
• we are not exploiting the fact that the architecture is timing-anomaly-free (cf.
Section 2.2.8), i.e., we are not tracking the local worst-case state only and
• the speciﬁed ﬂow fact set includes minimum iteration bounds for all loops,
too, as opposed to the maximum bounds required for WCET analysis.
During all of the following analyses we have the problem, that we will be doing
computations on a machine with limited-precision integral types, i.e., overﬂows and
underﬂows may occur. We handle this by performing all computations on the ab-
stract wrapper type SafeInt1, which makes integer overﬂows in the analysis explicit
by throwing C++ exceptions. If a register content in the value analysis is subject
to an overﬂow, these exceptions are handled by setting the register content to the
maximum possible interval (the “top”-value of the lattice). If other ﬁelds like the
execution time of a basic block overﬂow, the analysis is aborted. In our experiments
the latter case never occurred, since the respective variables are appropriately sized
and their variance is more limited.
4.1 IPCFG Construction
All of the following analysis stages require an IPCFG to work on, we therefore shortly
introduce this graph and how it is constructed in the following. The IPCFG con-
struction from Figure 4.1 can be partitioned into multiple sub-phases, which ﬁnally
produce a context graph on which all of the following stages will work. Figure 4.2
shows the preceding construction steps which we will explain in the following.
For every core c and any task τ in the set Tc of tasks mapped to core c, the LLIR
already contains the CFGs Gfc = (V fc ,Efc ) of every function f in the LLIR of c. In
1https://safeint.codeplex.com
4.1. IPCFG Construction 53
LLIR of
Core c
IPCFG
Construction
Value
Analysis
Loop Nesting
Analysis
Analysis Graph
GAc Construction
Call
Resolver
Inter-procedural
DFS
Context Graph
GCτ Construction
Task-Speciﬁc AnalysesGeneric Analyses
Figure 4.2: Stages of the IPCFG construction.
addition, the entry functions f⊥τ for each task τ which runs on c are given. A node
v ∈ V fc is a basic block, i.e., a sequence of instructions (i1, . . . , in) which can only
be entered at i1 and only be exited at in in non-preempted execution. The edges
e ∈ Efc represent transfers of control, but function calls are not resolved. When a
function f2 is called by an instruction i ∈ v ∈ V f1c , the block v ends but it only has
one outgoing edge which points to its successor block in f1. The main objective of
the IPCFG is to overcome this limitation and to connect the Gfc to each other via
call and return edges. In addition, the following is desired:
• Implicit control-ﬂow through the use of predicated execution as it is present
on many architectures including the ARM architecture [ARM05] should be
explicitly visible in the IPCFG. The analysis graph GAc from Figure 4.2 incor-
porates this, as well as the call/return edges between functions.
• The loop structure of the program should be easily visible from the IPCFG.
This is optional, but a proper loop detection will raise the precision of the
following analyses. This is addressed by the Loop Nesting Analysis shown in
Figure 4.2.
• Invocations of the same function or loop body from diﬀerent call sites or in
diﬀerent iterations will possibly show diﬀerent runtime behavior. To be able
to distinguish these analysis contexts, a context graph GCτ is created based
on GAc which virtually inlines functions and unrolls loops according to user
speciﬁcations.
We did not use standard data-ﬂow analysis generators like [SSB09; HMM12] to
construct the graphs and the analyses, since we will need to introduce a new type
of context in the multi-core WCET analysis which requires full control over the
underlying graph and the DFA algorithm.
54 Chapter 4. Single-Core WCET-Analysis
cmp r3, #8
ldreq r2, [fp, #-16]
addne r3, r3, r2
movcs r3, #0
b1
(a) A basic block with
predicated instructions.
cmp r3, #8
AL | true
a1
ldreq r2, [fp, #-16]
EQ | true
a2
ldreq r2, [fp, #-16]
EQ | false
a3
addne r3, r3, r2
NE | true
a4
addne r3, r3, r2
NE | false
a5
movcs r3, #0
CS | true
a6
movcs r3, #0
CS | false
a7
(b) The resulting analysis blocks. All edges are predication
edges.
Figure 4.3: Analysis block creation example.
4.1.1 Analysis Graph
Deﬁnition 9. The analysis graph GAc = (V Ac ,EAc ) for the LLIR code of core c is
a directed graph where each analysis block v = ((i1, . . . , in), p, q) ∈ V Ac is a tuple
containing an instruction sequence (i1, . . . , in), a predicate p from a set of machine-
deﬁned predicates P and a truth assignment q ∈ B that indicates whether p is true
during the execution of v. Each edge e ∈ EAc has a type (e) which is either BRANCH,
CALL, RETURN or PREDICATION.
The ﬁrst step in the construction of the analysis graph is to split up each basic
block into one or more analysis blocks. The compiler or assembly programmer may
decide to predicate some instructions. In the case of the ARM architecture, this is
possible for almost every instruction. Predication means that the instruction is only
executed if certain CPU ﬂag registers are in a state that is described by the predicate.
As an example, the cmp instruction in Figure 4.3a sets the ﬂag values according to
the comparison of r3 and the constant 8. It is executed unconditionally which is
expressed by the implicit predicate al (always), i.e., cmp is the same as cmpal. The
ldr with predicate eq is only executed when the “equal”-ﬂag is set, similarly ne
(cs) is only executed when the “not equal”-ﬂag (“carry”-ﬂag) is set. In Figure 4.3b
the blocks with the “true” value represent the case that the predicate is true, i.e.,
the instruction is executed, whereas the “false” blocks represent the skipping of the
instruction by the CPU. Since some predicates are contradictory to each other, as
e.g., eq and ne, edges between the contradicting analysis blocks can be removed. Of
course, as soon as an instruction is executed which may alter the CPU ﬂag registers
the contradiction is potentially invalidated. We can therefore easily deﬁne a split
function splitA ∶ V fc → 2V Ac × 2V Ac ×V Ac , which maps each basic block to a subgraph
of analysis blocks and edges which model its predication-aware execution paths in
4.1. IPCFG Construction 55
analogy to Figure 4.3. This function can be implemented by a single scan over the
basic block, which splits it into contiguous chunks with the same predicate which are
then connected by predication edges according to their mutual exclusion behavior.
δA⊥ (v) ⊆ V Ac denotes the analysis blocks that model the entry into basic block v,
whereas δA⊤ (v) ⊆ V Ac is the set of analysis blocks which are an exit of v. The set
δA⊥ (v) has two elements if the ﬁrst instruction of the block has a predicate other than
al, else it has only one element. The same applies to δA⊤ (v). The reverse mapping of
an analysis block vA to its basic block v is given by μA(vA). For any type of graph,
the successors and predecessors of node v are given by δ+(v) and δ−(v), respectively.
For a function f , ν⊥(f) ⊆ V fc gives the entry block and ν⊤(f) ⊆ V fc returns the set
of return blocks for the function.
Algorithm 2 Analysis Graph Construction.
1: GAc = ⋃f∈c,v∈V fc (split
A (v)) ▷ Initialize as union of basic block subgraphs
2: for vA ∈ V Ac do
3: if call(vA) then ▷ Add call and return edges for calls
4: for f ∈ targets(vA) do
5: Ecall = {vA} × δA⊥ (ν⊥(f))
6: ∀e ∈ Ecall ∶ type (e) ← CALL
7: Ereturn = δA⊤ (ν⊤(f)) × δA⊥ (δ+(μA(vA)))
8: ∀e ∈ Ereturn ∶ type (e) ← RETURN
9: EAc ← EAc ∪Ecall ∪Ereturn
10: else ▷ Add local CFG edges for all other blocks
11: Ebranch = {vA} × δA⊥ (δ+(μA(vA)))
12: ∀e ∈ Ereturn ∶ type (e) ← BRANCH
13: EAc ← EAc ∪Ebranch
Algorithm 2 summarizes the analysis graph construction. The functions call ∶
V Ac → B and targets ∶ V Ac → 2V
A
c compute whether the given block ends with a call
and which functions this call may be targeted at. In the WCC, these functions are
implemented by a call resolver as sketched in Figure 4.2, which tries to look up the
function name if the call is carried out via a symbol name. In case of indirect calls
via function pointers, the user must be queried for the call target.
4.1.2 Context Graph
The analysis graph already provides the possibility to perform an interprocedural
data-ﬂow analysis. Its main drawback is that it does not distinguish between diﬀer-
ent contexts of the code. A function can show diﬀerent behavior when called from
two sites with varying parameter sets, and the same is true for the iterations of code
in a loop. As an example consider the analysis graph shown in Figure 4.4. When
analyzing the recursive function bar, the information stemming from foo1 and foo2
will be merged. To overcome this, virtual inlining of functions and virtual unrolling
56 Chapter 4. Single-Core WCET-Analysis
foo1:
...
bl bar
AL | true
cmp r0, #1
...
AL | true
foo2:
...
bl bar
AL | true
cmp r0, #1
...
AL | true
bar:
str lr, [sp, #-4]!
ldr r3, X
cmp r3, #0
AL | true
beq .L2
EQ | false
beq .L2
EQ | true
... bl bar
AL | true
ldr pc, [sp], #4
AL | true
.L2:
...
ldr pc, [sp], #4
AL | true
Figure 4.4: Example for an analysis graph with a directly recursive function. Solid
edges are branches, black (gray) dashed edges are calls (returns) and
dotted edges are predication edges.
of loops (VIVU) [LM97] is used to create call contexts and iteration contexts. Note
that none of these graph transformations actually modiﬁes any source or machine
code. We are only concerned with the disambiguation of diﬀerent execution contexts
here, not with code transformations.
Deﬁnition 10. For a task τ ∈ Tc with entry function f⊥τ , a call string is a sequence
S = ((v0, f0), (v1, f1), (v2, f2), . . . , (vn, fn)) such that f0 = f⊥τ and for all i, vi ∈
⋃
v∈V
fi
c
δA⊤ (v) has an outgoing call edge to function fi+1. A call context for (vi, fi) ∈ S
is a graph G which models the execution of fi when called via the preﬁx of S with
length i.
The creation of call contexts is shown in Algorithm 3. As long as the call string
S has length ∣S∣ > 1, we inline it by cloning each function fi with i > 1 in S and by
redirecting the call and return edges from fi−1 to the new copy of fi. In Algorithm 3
this is formulated through a recursion. Every (Vdup,Edup), constructed during the
virtual inlining, is a call context of the respective function fn.
As an example, assume that in Figure 4.4 foo1 and foo2 are called from the
main function of the task. The resulting context graph for the virtual inlining of
1. ( main, foo1, bar, bar ) and
2. ( main, foo2, bar, bar )
(call nodes are omitted here for brevity) is shown in Figure 4.5a. To limit the graph
size, all blocks of each function were collapsed to a single node in Figure 4.5a. Both
of the above call strings have length 3 and it should be clear that the result graph
4.1. IPCFG Construction 57
Algorithm 3 The virtual inlining algorithm.
1: function VirtualInline(S = ((v0, f0), . . . , (vn−1, fn−1), (vn, fn)), GAin)
2: if ∣S∣ = 1 then
3: return GAin ▷ S does not contain further calls.
4: else
5: GAn−1 ← (V An−1,EAn−1) = VirtualInline(((v0, f0), . . . , (vn−1, fn−1)),GAin)
▷ Inline all but the last call
6: Vdup = {δA(v) ∣ v ∈ V fnc } ▷ Clone function fn from GAn−1
7: Edup = EAn−1 ∩ (Vdup × Vdup) ▷ And fn’s edges
8: return GAn−1 ∪ copy(Vdup,Edup) with all call edges from vn−1 to fn
and all return edges from fn to successors of vn−1 redirected to
copy(Vdup,Edup).
main
foo1 foo2
bar bar
bar
(a) Maximum call string length 3.
main
foo1 foo2
bar
(b) Maximum call string length 1.
Figure 4.5: Possible virtual inlining results for Figure 4.4. Call contexts are shown
with gray background, default contexts in white.
grows linearly with the length of the call string. Also note, that by inlining the call
strings of length 3 implicitly also all call strings of length less than 3 were inlined,
due to the recursive nature of the deﬁnition.
If we do not have recursive functions in the task, the number of distinct call
strings is ﬁnite. However, if we have recursive functions like bar, an unlimited
number of call strings results. Therefore, the user has to specify limits for the inlining
process. To this end, we use the limited-length call-string approach from [SP78].
The inlining proceeds is done for all call strings up to a length speciﬁed as the
maximum call string length. If this length is reached, all further calls are directed
to a default context of the callee instead of to a new call context. Figure 4.5a is an
example for this procedure with a maximum call string length of 3. After the ﬁrst
visit to bar, all further calls to bar are modeled by its default context shown in
white. If the maximum call string length is set to 1 instead, all calls are directed to
default contexts as shown in Figure 4.5b.
Since the detection of call instructions at the machine code level can be non-
trivial on some architectures, another option for context construction is to identify
58 Chapter 4. Single-Core WCET-Analysis
every movement of the stack pointer [LBS+10] and to build “call” contexts from
these stack modiﬁcation points. This method can also deal with obfuscated code
which does not adhere to any calling convention, but is more costly in terms of
analysis time. If the call string is not made explicit in the graph but carried along
with the data items as an extension of the analysis lattice, it can be stored as a
single numeric value in the best case [SZW+10]. This allows each data-ﬂow analysis
on the graph to use diﬀerent context but also complicates and delays the individual
analyses.
Since we need to set a bound on the call string length in case of recursive pro-
grams, we ﬁrst need to ﬁnd out whether recursion exists in the given task.
Deﬁnition 11. A spanning tree Gspan = (V,Espan, v0) for a CFG G = (V,E, v0)
is a subgraph of G with directed edges Espan ⊆ E such that ∀w ∈ V ∶ v0 ↝ v ⇒
∣P[v0,w]∣ = 1, i.e. a single unique path exists to each reachable node. The length of
this path is given by dfsdepth (v).
A Depth-First Search (DFS) on a CFG G = (V,E, v0) returns a classiﬁcation
of edges dfs type (e) ∈ {TREE,BACK,CROSS,FORWARD} for each edge e ∈ E.
The tree edges form a spanning tree Gspan of G in which back edges (v,w) have
dfsdepth (v) > dfsdepth (w), forward edges have dfsdepth (v) < dfsdepth (w), and for
cross edges v ↝̸ w in Gspan.
To detect recursions, we extend the analysis graph GAc by a virtual entry node v0
and edges {v0}×⋃c∈{1,...,n}⋃τ∈Tc δA⊥ (ν⊥(f⊥τ )) where the tasks on core c are again given
by Tc and for any τ ∈ Tc the entry point function is f⊥τ . Recursive calls are contained
if and only if an edge e = (v,w) with type (e) = CALL and dfstype (e) = BACK
is found in a DFS on this extended graph. If the user did not specify an explicit
maximum call string length, we only create a single default context for each recursive
function.
The last aspect from Figure 4.2 that is still missing now, is the handling of loops.
To handle them, we ﬁrst need a proper deﬁnition of a loop.
Deﬁnition 12. A node v ∈ V dominates a node w ∈ V in a CFG G = (V,E, v0) iﬀ
v = w or ∀p ∈ P acyclic
[v0,w]
∶ p = (v0, . . . , v, . . . ,w). We denote this as v domw.
If for a node vhead ∈ V , back-edges e = (w, vhead) ∈ E with vhead domw exist, a
natural loop l with head vlhead is induced by these edges. The set of all natural loops
in task τ is denoted by ↺τ . For natural loops, a unique nesting relation <↺ exists.
Iﬀ all back-edges of a graph G end in natural loops’ heads, G is called reducible and
has only a single DFS spanning tree Gspan, else G is irreducible.
Both the loop detection and the inter-procedural DFS can be performed in time
O(∣V Ac ∣) [ALS+07]. All natural loops have a single entry node but may have multiple
back and exit edges as shown in Figure 4.6a.
Deﬁnition 13. A loop context for loop l ∈↺τ , iteration i is a graph Gil which
models the i-th iteration (and only the i-th iteration) of l counted from 1 on.
4.1. IPCFG Construction 59
Algorithm 4 The virtual unrolling algorithm.
1: function VirtualUnroll(GAin, l ∈↺τ , n ∈ N0)
2: Vl = {v ∈ V Ain ) ∣ (vlhead dom v) ∧ (v ↝ vlhead)} ▷ Get loop body
3: El = {(v,w) ∈ EAin ∣ v ∈ Vl ∨w ∈ Vl) ▷ And all loop edges
4: for i ∈ {1, . . . , n} do ▷ Create loop context i
5: Gil ← copy(Vl,El) ▷ Copy the loop
6: if i > 1 then ▷ If this is not the ﬁrst iteration
7: Eil ← Eil ∖ {(v,w) ∣ v ∉ V il ∧w ∈ V il }
▷ Remove the external entry edges
8: Eil ← Eil ∪ {(v, vihead) ∣ (v,w) ∈ Ei−1l ∧ dfstype ((v,w)) = BACK}
▷ And replace them by back edges ...
9: Ei−1l ← Ei−1l ∖ {e ∣ e ∈ Ei−1l ∧ dfstype (e) = BACK}
▷ ... from the previous iteration
10: if i ≤ Blmin then ▷ If a loop exit is impossible in iteration i
11: Eil ← Eil ∖ {(v,w) ∣ v ∈ V il ∧w ∉ V il } ▷ Remove the exit edges
12: return (GAin ∖ (Vl,El)) ∪ ⋃i∈{1,...,n}Gli
Algorithm 4 presents how loop contexts for the ﬁrst n−1 iterations of loop l are
constructed. In each of the algorithm’s iterations i, we unroll the i-th iteration of l
by duplicating l’s nodes and edges. From the outside, l can only be entered in the
ﬁrst iteration which is why we remove the entry edges from all but the ﬁrst iteration
in line 7. The iterations i > 1 are only entered via the back-edges of preceding
iterations (line 8 and 9).
The result of unrolling the ﬁrst iteration of the loop from Figure 4.6a is shown
in Figure 4.6b. If the minimum loop iteration count Blmin is bigger or equal to the
unrolled iteration’s index i, we can remove the exit edges from the unrolled iteration
(line 11 in Algorithm 4). For the example of Figure 4.6b, this implies that the edges
(B2a,B6) and (B5a,B6) can be removed if the Blmin ≥ 2.
Deﬁnition 14. A context graph GCτ = (V Cτ ,ECτ , v0,τ) for a task τ ∈ Tc is a CFG
which is a copy of an analysis graph GAc that was subject to virtual inlining and un-
rolling, where loops in GAc were classiﬁed starting from root node v0,τ = δA⊥ (ν⊥(f⊥τ )).
A context graph is based on a set of contexts C. Each context ci has an associated
subgraph Gciτ = (V ciτ ,Eciτ , vci0 ) and there exists a set of transition edges Etrans between
the contexts such that V Cτ = ⋃ci∈C V ciτ and ECτ = ⋃ci∈C Eciτ ∪ Etrans. Each ci is
classiﬁed as either a call, iteration or default context as introduced above.
The context graphs of the tasks will be the basis for all of the following analyses.
For bigger maximum allowed call string lengths and maximum virtual unrolling fac-
tors, the size of C increases and so does the analysis precision, but also its duration.
60 Chapter 4. Single-Core WCET-Analysis
main
B1
B2
B3 B4
B5
B6
(a) Original function
with natural loop.
main
B1
B2a
B3a B4a
B5a
B2b
B3b B4b
B5b
B6
(b) Result of virtually unrolling the ﬁrst it-
eration.
Figure 4.6: Example for virtual unrolling. Call contexts are shown with gray back-
ground, iteration contexts are dotted.
4.2 Value Analysis
Later analysis stages will require knowledge about the addresses which may be af-
fected by a memory access or the value of a register which determines the runtime
of an arithmetic operation, often seen in integer multiplication and division instruc-
tions. Therefore, the purpose of the value analysis is to determine safe approxima-
tions of the possible contents of memory cells in the system. To this end, we use
abstract interpretation as introduced in Section 2.1. But even with an abstracted
value semantics it is computationally infeasible to approximate the whole memory
content of the system. Therefore, value analyses are usually restricted to the CPU
registers, since these are most important for the timing behavior. To some extent,
also the stack contents can be modeled, but we will restrict the analysis to the most
relevant case of CPU registers here.
Thus, the value analysis must determine an approximation val inv ∈ V on an
abstract value domain V for every node v in a context graph GCτ such that val
in
v (r)
safely approximates the content of any register r from a set of CPU registers R =
{r1, . . . , rn}. We achieve this by means of a work-list DFA as shown in Algorithm 1
whose domain and transfer functions will be detailed in the following.
4.2.1 Abstract Value Domain
Some CPU operations treat register contents as bit strings like, e.g., shifting and
logical operations, and some perform signed integer, unsigned integer and ﬂoating
4.2. Value Analysis 61
⊺ = {0,1}
{0} {1}
 = ∅
(a) Single bit value lattice B1.
⊺ = {l, u}
{l, u − 1} {l + 1, u}
{l} {l + 1} . . . {u − 1} {u}
 = ∅
(b) Interval lattive I[l,u]
Figure 4.7: Hasse diagrams of value analysis domain components.
point arithmetic on them. Only the latter case can be excluded for the ARM7, since
it has no ﬂoating point registers. Instead, it has 16 32-bit general purpose registers,
which are used for bit operations, signed two’s complement arithmetic and unsigned
arithmetic. Therefore, we deﬁne the value domain V as
V = R → (B321 × Is × Iu) (4.1)
Is = I[−231,231−1] (4.2)
Iu = I[0,232−1] (4.3)
which models the interpretation of the registers’ contents in these three domains.
B1 is the set-based lattice of bit contents as shown in Figure 4.7a whereas I[l,u] is
the lattice of sub-intervals of [l, u] ordered by inclusion as shown in Figure 4.7b.
The meet operator ⊔ ∶ V ×V → V and the partial order ⊑∶ V ×V → B are formed by
piecewise application of the meet and partial order operators of the sub-lattices.
The transfer function fv ∶ V → V for a context block v ∈ V Cτ iterates over the
instructions in the block and applies their eﬀect to the value state. Each instruction’s
eﬀect on a register is deﬁned to work on a single sub-lattice, e.g., a shift operation
will work on the bit-vector sub-lattice and a signed addition will work on the signed
interval sub-lattice. Once the respective sub-lattice value was updated, the results
are transferred to the other domains by using conversion functions αTS ∶ S → T where
S,T ∈ {B321 , Is, Iu} are the source and target sub-lattice. For all S and T , αTS and
αST form a Galois connection (see Deﬁnition 3), i.e., ∀s ∈ S ∶ αST (αTS (s)) ⊒ s. We do
not detail the individual transfer functions here, but the ARMv4 architecture has
91 distinct instructions with up to 11 addressing modes that must be handled.
To speed up the convergence of the signed integer intervals we employ a widening
with a the set I of all signed integer constants found in the program code and the
maximum and minimum signed integer values −231 and 231 − 1. The widening
operator Δ ∶ Is → Is is then deﬁned as
Δ([a, b]) = [max({i ∣ i ∈ I, i ≤ a}),min({i ∣ i ∈ I, i ≥ b})] (4.4)
An equivalent widening is deﬁned for the unsigned case. For the bit vectors, widening
is not needed, since the height of B1 is two, such that the height of B321 is 64, therefore
62 Chapter 4. Single-Core WCET-Analysis
any element from this lattice can only undergo 63 changes until it reaches the ﬁxed
point ⊺. In contrast, the interval lattices have height 232, which is still ﬁnite, but
makes the convergence really slow.2 In addition, we perform narrowing until each
block was visited twice to re-gain precision that was lost due to the widening.
4.2.2 Challenges of Predicated Execution
The predicated execution as already mentioned in the analysis graph deﬁnition at
Section 4.1.1 works by ﬁrst setting a ﬂag and then performing actions based on
it. As an example, in block a1 from Figure 4.3, the instruction cmp r3, #8 sets
the EQ-ﬂag to true iﬀ r3 is equal to 8. The value analysis might be unable to
infer a concrete value of r3 at this point, i.e., val ina1(r3) might be ⊺ and a naive
implementation will therefore not be able to infer anything about the value of r3 at
a2 either. Obviously, we know at a2 that val ina2(r3) = [8,8] due to the implication
set up with the preceding comparison. We can supply the value analysis with the
possibility to infer this by adding ﬂag implications to the value domain, i.e.,
V = (R → (B321 × Is × Iu)) × (F × {0,1} ×R × Is) (4.5)
where F is the set of ﬂag bits of the machine. Each implication entry (f, b, r,w) at
a node v denotes that
f = b ⇒ val inv (r) ⊑ w (4.6)
This means that in each transfer function invocation, we may restrict the incoming
value set val inv (r) to the lattice inﬁmum val inv (r)⊓w iﬀ there is a ﬂag implication for r.
The transfer functions are also responsible for registering new ﬂag implications that
are set up by instructions like cmp and for removing the implication when the base
register r is overwritten with a new value. The basic idea of exploiting conditions
to reﬁne the variable values has already been used before, but here we adapted it
to conditions which are only given as predication entries.
The meet function is applied to each f ∈ F and b ∈ {0,1} in separation as
⊔((f, b, r1,w1), (f, b, r2,w2)) =
⎧⎪⎪⎨⎪⎪⎩
(f, b, r1,w1 ⊔w2) if r1 = r2
(f, b, r1,⊺) else
(4.7)
With this extension, the implications in valouta1 will contain the entry (EQ,1,r3, [8,8])
such that at the beginning of fa2 we can set
valouta2 (r3) = val ina2(r3) ⊓ [8,8]
= ⊺ ⊓ [8,8]
= [8,8]
2In Section 2.1 we have stated that the integer value lattice has inﬁnite height. This is only
true for high-level languages where integers of arbitrary length are supported. On the machine
code level, every analysis domain is naturally limited by the register or memory size speciﬁcation.
4.3. Microarchitectural Analysis 63
In general, the value analysis with ﬂag implications is able to recognize bounds on
the direct and derived induction variables in counting loops which are very frequent
in embedded system code. A yet higher precision would be achieved by relational
congruences [Cou01] but usually, convex abstractions like intervals are suﬃcient for
the microarchitectural analysis.
4.3 Microarchitectural Analysis
The purpose of the microarchitectural analysis is to determine the possible execution
duration for every block in a context graph. To be able to determine such dura-
tions, ﬁrst the operation of the processor must be correctly formalized. We base
our following introduction to the formal pipeline description on [Wil12], whereas a
thorough treatment of concrete pipeline implementation techniques can be found
in [GLM11].
Deﬁnition 15. A processor pipeline is a ﬁnite-state machine P˜ = (Q˜P , I˜, O˜, δ˜P , λ˜P )
with state set Q˜P , input set I˜, output set O˜, state transfer function δ˜P ∶ Q˜P × I˜ → Q˜P
and output function λ˜P ∶ Q˜P × I˜ → O˜. Each state transition of P models a single
clock cycle of the pipeline. P interacts with an environment E which can itself be
represented as a state machine E˜ = (Q˜E , O˜, I˜, δ˜E , λ˜E).
A concrete microarchitectural state is an element of Q˜M = Q˜P ×Q˜E, which forms
the concrete microarchitectural lattice (M˜ = 2Q˜M ,∪).
The environment models the memory hierarchy components like caches and
memories which contain the program P amongst others. We assume a deterministic
execution without preemptions, therefore E is a deterministic FSM, too.
Deﬁnition 16. The concrete execution of a program L = (i0, . . . , in) with start
instruction i0 and a set of terminal instructions It starting at an initial state q˜
p
0 ∈ Q˜P ,
q˜e0 ∈ Q˜E is modeled by exec(L, q˜p0 , q˜e0) = ((q˜p1 , q˜e1), . . . , (q˜pk, q˜ek)) with
∀i > 0 ∶ (q˜pi , q˜ei ) = (δ˜P (λ˜E(q˜ei−1)), δ˜E(λ˜P (q˜pi−1))) (4.8)
where state q˜p0 is required to fetch i0 from the environment state q˜
e
0 and the trace
((q˜p1 , q˜e1), . . . , (q˜pm, q˜em)) models the execution of instructions until a terminal instruc-
tion it ∈ It has been retired in (q˜pk, q˜ek). Here, retirement means that the instruction
has ﬁnally left the pipeline and its eﬀect was made permanent.
When we are not interested in distinguishing pipeline and environment state we
abbreviate the microarchitectural state to q˜mi = (q˜pi , q˜ei ) for all i. The exec function
then becomes
exec(L, q˜m0 ) = (q˜m1 , . . . , q˜mk ) (4.9)
Since any basic block b = (i0, . . . , in) is also a valid program with a single terminal
instruction in, we can determine its concrete execution duration from a given initial
state (q˜p0 , q˜e0) as ∣exec(b, q˜p0 , q˜e0)∣ ∈ N0. The duration may actually be zero, since in
64 Chapter 4. Single-Core WCET-Analysis
a superscalar processor state, all instructions of the block may have already been
retired in parallel to the execution of the previous block. Obviously, we need to
know about the possible valid initial states q˜p0 and q˜
e
0 at this point to derive the
minimum and maximum execution durations.
These states can in principle be computed using abstract interpretation on
(M˜,∪). To achieve this we need to set the transfer function fMv for a context
graph node v = ((i0, . . . , in), ⋅, ⋅) to
∀m ∈ M˜ ∶ f M˜v (m) = ⋃
(q˜m)∈m
{q˜mk ∣ exec((i0, . . . , in), q˜m) = (q˜m0 , . . . , q˜mk )} (4.10)
where we use the notation from Equation 4.9 and q˜mk is the state in which the last in-
struction of block v was retired. With these transfer functions, ((M˜,∪),⋃v∈V Cτ {f M˜v })
is a monotone DFA framework (cf. Deﬁnition 5) which can be solved by ﬁxed-point
iteration.
Still, the concrete FSMs P˜ and E˜ have far to many states. To make the DFA
solution computable we abstract P˜ and E˜ to abstract state machines.
Deﬁnition 17. An abstract pipeline model P = (QP , I,O, δP , λP ) and abstract
environment model E = (QE ,O, I, δE , λE) for a pipeline P˜ and environment E˜ are
non-deterministic FSMs, i.e., δP ∶ QP ×I → 2QP , λP ∶ QP ×I → 2O and δE ∶ QE×O →
2QP , λP ∶ QP ×O → 2I . In analogy to Deﬁnition 15, each state transition in P and
E corresponds to one cycle step of the modeled pipeline.
An abstract microarchitectural state is an element of QM = QP × QE, which
forms the abstract microarchitectural lattice (M = 2QM ,∪). Since each abstract
state qm ∈ M represents a number of concrete states q˜m ∈ M˜, (M,∪) and (M˜,∪)
must be connected through a Galois connection αM ∶ M˜ →M and γM ∶M → M˜. The
input (output) sets I˜ and I (O˜ and O) must be covered by a Galois connection, in
the same way.
Finally, P and E must be valid with respect to the Galois connections, i.e.,
∀q˜p ∈ Q˜P , i˜ ∈ I˜ ∶ δ˜P (q˜p, i˜) ∈ γM(δP (αM(q˜p), αM(˜i))) (4.11)
∀q˜p ∈ Q˜P , i˜ ∈ I˜ ∶ λ˜P (q˜p, i˜) ∈ γM(λP (αM(q˜p), αM(˜i))) (4.12)
∀q˜e ∈ Q˜E , o˜ ∈ O˜ ∶ δ˜E(q˜e, o˜) ∈ γM(δE(αM(q˜e), αM(o˜))) (4.13)
∀q˜e ∈ Q˜E , o˜ ∈ O˜ ∶ λ˜E(q˜e, o˜) ∈ γM(λE(αM(q˜e), αM(o˜))) (4.14)
In the transition from P˜ to P , we remove all modeling of value changes of
registers and main memory cells and all component-internal state like arithmetic
unit states. The value changes are partly covered by the value analysis and the rest
is abandoned to limit the complexity. In the environment, we remove the modeling
of memory cell values and memory hierarchy component state which does not aﬀect
the timing. As a consequence of this reduction, the state transitions in P and E are
non-deterministic. This non-determinism is needed to resolve situations in which
the successor or output in the concrete FSM depends on a state component which
4.3. Microarchitectural Analysis 65
is not modeled in the abstract FSM like, e.g., the program input values. Therefore,
the abstract execution must gather all reachable abstract states.
Deﬁnition 18. The abstract execution of a program L = (i0, . . . , in) with start
instruction i0 and a set of terminal instructions It starting at an abstract state
qm0 ∈ QM is modeled by exec(L, qm0 ) = (m1, . . . ,mk) with mi ⊆ QM and
∀i > 0 ∶mi = ⋃
(qpi−1,q
e
i−1)∈mi−1 with
¬retired((qpi−1,q
e
i−1))
{(qpi , qei ) = (δP (λE(qei−1)), δE(λP (qpi−1)))} (4.15)
where retired ∶ QM → {true, false} determines whether an instruction from It was
retired in a given abstract state. Therefore, each set mi holds those states which
are reachable in the i-th cycle of an execution starting at any concrete initial state
q˜m0 ∈ γM(qm0 ). The states in which the program may have been completed in the
abstract execution are given by
compstates(L, qm0 ) = {qmc ∣ ∃mi ∈ exec(L, qm0 ) ∶ qmc ∈mi ∧ retired(qmc )} (4.16)
The transfer functions on M are then deﬁned as
∀m ∈M ∶ fMv (m) = ⋃
qm∈m
compstates(L, qm) (4.17)
In this way, we can compute the possible initial abstract microarchitectural states
for each block in a context graph by data-ﬂow analysis on the DFA framework
((M,∪),⋃v∈V Cτ {fMv }).
Deﬁnition 19. Given a context block v = ((i1, . . . , in), ⋅, ⋅) and a set of possible
initial hardware states qinv ∈M, the block duration ω(v) ∈ I[0,∞] is
ω(v) = [min(T ),max(T )] (4.18)
T = {i ∣ ∃mi ∈ exec(L, qm0 ) ∶ qmc ∈mi ∧ retired(qmc )} (4.19)
The lower (upper) bound of ω(v) is denoted as ωmin(v) (ωmax(v)).
An example for the analysis of a block which yields a value of ω(v) = [2,5] is given
in Figure 4.8. The ﬁgure illustrates the computation of qoutv = fMv (qinv ), where each
edge represents one cycle step of the abstract microarchitectural states. To compute
qoutv , the cycle step must be invoked for all initial states and their successors until all
instructions from v are retired at every sink of the resulting transition graph. The
set of sinks is then called qoutv . As a side product from this state computation, we
can derive the execution time bound ω(v) which can be used in the path analysis to
compute the longest (shortest) path through the program and by that the WCET
(BCET).
In a timing-anomaly-free architecture as deﬁned in Section 2.2.8, we may restrict
the output of the transfer function to those states which correspond to the local
66 Chapter 4. Single-Core WCET-Analysis
0 1 2 3 4 5
Cycles
∈ qinv ∈M ∈ qoutv ∈M
Figure 4.8: Example abstract microarchitectural states qm ∈ QM during the anal-
ysis of a context block v ∈ V Cτ with ω(v) = [2,5]. Each state transition
corresponds to one cycle step of the modeled pipeline.
worst-case successors. In that case, we are no longer building an approximation of
every possible microarchitectural state with which a block may be entered but of
those states with which it may be entered in an execution that exclusively consists
of local worst-case transitions. In a timing-anomaly-free system this is suﬃcient
for ﬁnding the WCET, since then we know that the global worst-case execution
does only consist of local worst-case transitions. Formally, the transfer function for
timing-anomaly-free systems becomes
∀m ∈M ∶ fMv (m) = ⋃
qm∈m
{qmc ∣mk ∈ exec(L, qm) ∶ qmc ∈mk ∧ retired(qmc )} (4.20)
i.e., we only draw the completion states from the longest traces, whose length is
given by k as in Deﬁnition 18. For the example in Figure 4.8, this implies that only
the gray states at time 4 and 5 would be part of qoutv , since for the topmost initial
state the longest execution path ends at time 4 and for the other two initial states the
longest execution path ends at time 5. Therefore, we could derive a tighter execution
time window of ω(v) = [4,5] in a timing-anomaly-free system. Obviously, once we
start to exploit the timing-anomaly-freedom to cut down the microarchitectural
search space in this way, the resulting ω values are only guaranteed to be valid for
the worst-case path. Therefore, we can no longer use them to determine a BCET
of the task under analysis.
This FSM-based modeling approach has been applied successfully in many static
WCET analyzers including the commercial product aiT. The microarchitectural
state can then be represented as an element of M as sketched above [SF99; LTH02;
The04]. For complex microarchitectures, the state space that must be explored for
a single block (as shown in Figure 4.8) can become quite big which may lead to
higher analysis times. To avoid this, symbolic representations of the state transfer
function [Wil12] can be used to avoid the state enumeration from Figure 4.8. A
minor other direction of microarchitectural modeling is the representation of the
pipeline structure by a directed graph [LRM06], but this approach integrates far
less easily with an analysis of the environment (e.g., memory hierarchy elements)
and with out-of-order processors where basic block executions may overlap.
4.3. Microarchitectural Analysis 67
4.3.1 ARM7TDMI Pipeline Model
As stated, the construction of the abstract pipeline model P = (QP , I,O, δP , λP )
consists of a slicing of the concrete state space into timing-relevant and non-relevant
components. There are approaches which create P semi-automatically from formal
hardware descriptions [SP10], but most pipeline models are generated manually by
careful inspection of formal hardware descriptions, data sheets or by conducting
extensive measurements on real hardware.
In the case of the ARM7TDMI, the timing is precisely speciﬁed in its technical
manual [ARM04], which drastically eases the derivation of P . Figure 4.9 shows
the basic structure of the timing model of the ARM7TDMI. We use an extended
FSM model here, where the FSM state holds a ﬁxed amount of ﬁnite-sized variables.
Transitions have guards (marked in square brackets) and actions on the FSM vari-
ables to make the model more compact. Nevertheless, we can expand this modeling
to a classical FSM due to the ﬁxed size of the FSM variables.
The core has an in-order pipeline which is reﬁlled each time a branch is taken.
The initial ﬁlling of the pipeline as well as branch-induced reﬁlling is modeled by
the chain of “Fetch + X” states which increment the address from which the next
instruction is fetched by the fetch width b. Since the ARM7TDMI supports both
32-bit ARM as well as 16-bit THUMB instructions, b is equal to either 2 or 4 bytes.
Once the pipeline is ﬁlled, most instructions are executed in one cycle and the
pipeline keeps on fetching new instructions and executing one-cycle instructions in
parallel which is modeled by the self-loop at “Arithmetic / Fetch + 2b”. However, the
duration of multiplications may vary by up to 3 cycles as shown by the multiplication
states on the right hand side. The duration depends on the value of the second
operand of the multiplication. If the value analysis can precisely determine the
operand, then we are able to provide a precise multt(c) interval. For the example
of a MLAL (multiply-accumulate on long word) instruction, this means that in the
best case we get an estimation of multt(c) = [3,3], whereas in the worst case we fall
back to multt(c) = [3,6]. Similarly, the accessed address of a load, store or swap
instruction must be determined in the helper function acc(c) by reading the value
analysis results. Some of the memory-accessing instructions store or load multiple
registers at once for an eﬃcient stack handling. The number of aﬀected registers is
encoded in the parameter r = regs(c) in P .
Finally, the non-determinism in the operation of P can be observed at states
which have transitions of which multiple may be ready at the same point in time.
As an example, when the duration d = multt(c) is equal to [0,1] at the “Multiply”
state due to insuﬃcient value analysis precision, the guard conditions [0 ∈ d] and
[d∖{0} ≠ ∅] are both true at the same time. In this case, the analysis must explore
all possible result states as shown in Figure 4.8.
The mem(c) function can only determine the duration of a memory access if
it interacts with the abstract environment E = (QE ,O, I, δE , λE) as deﬁned in the
beginning of Section 4.3. It therefore has to issue an abstract memory request o ∈ O
68 Chapter 4. Single-Core WCET-Analysis
Fetch + 0
Fetch + b
Arithmetic /
Fetch + 2b
(updates b and c)
Fetch + 3b
Write
MAC & MultLong
MAC / MultLong
Multiply
Swap Read
Swap Write
Swap Delay
Read
WriteToReg
RegisterShift
cmpl(a)/
a =mem(f + b)
cmpl(a)/
a =mem(f + 2b)
cmpl(a)
[c = BL_16]/
a =mem(f + 3b)
¬c
m
pl
(a
)/
−
¬c
m
pl
(a
)/
−
¬cmpl(a)/−
¬c
m
pl
(a
)/
−
cm
pl
(a
)[
c
is
ST
OR
E]
/
r
=
re
gs
(c
),
a
=
m
em
(a
cc
(c
))
cm
pl
(a
)[
0
∈ r
]/
a
=
m
em
(f
+
2b
)
¬cmpl(a)/−
cmpl(a)[r ∖ {0} ≠ ∅]/
r + +, a =mem(acc(c))
cmp
l(a)
[c =
MLAL
]/−
cmpl(a)
[c = MLA ∨ c
= MULL]/−
cmpl(a)[c = MUL]/
d =multt(c)
−/−
−/d =multt(c)
[d ∖ {0} ≠ ∅]
d − −
[0 ∈ d]/a =mem(f + 2b)
cmpl(a)[c is SWAP]
a =mem(acc(c))
¬cmpl(a)/−
cmpl(a)/
a =mem(acc(c)
¬cmpl(a)/−
−/−
−
/a
=
m
e
m
(
f
+
2
b
)
cmpl(a)[c is LOAD]
r = regs(c), a =mem(acc(c)
¬cmpl(a)/−
cmpl(a)[r ∖ {0} ≠ ∅]
r − −, a =mem(acc(c)
cmpl(a)[0 ∈ r]/−
[¬
w
ri
te
P
C
(c
)]
/
a
=
m
em
(f
+
2b
)
[regshift(c)]/−
[¬w
riteP
C
(c)]/
a
=
m
em
(f
+
2b)
cmpl(a)/f = succr(c)[writePC(c)]/f = succr(c)
cmpl(a)[c is BRANCH]/
f = succr(c), b = width(c)
Field Type Description
a O (Output) Current access
c I (Input) Current instruction
cmpl I (Input) Access completions
f Iu Addresses to fetch
from
b Iu Possible fetch width
d Iu Remaining cycles in
state
r Iu Number of registers
to be processed
Function Description
mem(c) Issue memory access (Output)
acc(c) Possible access range of c
width(c) Possible fetch widths at c
multt(c) Multiplication duration at c
regs(c) Number of registers processed
by c
succr(c) Address range of successors
of the current basic block
writePC (c) Whether c writes to the PC
Figure 4.9: The abstract pipeline model for the ARM7TDMI.
4.3. Microarchitectural Analysis 69
to the environment which the environment must respond to by signaling cmpl(a) in
its future output and by altering its state qe, e.g., if the access touches a cache. As
mentioned, the domain of the microarchitectural analysis is
M = 2QM = 2QP×QE (4.21)
i.e., to determine the runtime of a memory access, each pipeline state qp ∈ QP
interacts with its assigned environment state qe ∈ QE .
To summarize the pipeline aspects, if Q#P is the set of FSM states from Figure 4.9,
we can now formally deﬁne QP for the ARM7TDMI pipeline as
QP = Q#P × (Iu)4 (4.22)
where the Iu-components represent the FSM variables f , b, w and r from Figure 4.9.
For this extended FSM model, we also must adapt the meet function of the
microarchitectural lattice. Previously, the meet function was the set union of the
contained states, but here we may have elements with identical FSM state but diﬀer-
ent FSM variable values. Therefore, the meet operator on two elements qM1 , q
M
2 ∈M
is deﬁned as
qM1 ⊔ qM2 = ⋃
q#∈Q#P
⊔({qm ∣ qm = ((q#, . . . ), qe) ∈ (qM1 ∪ qM2 )}) (4.23)
where (qp1 , qe1)⊔(qp1 , qe1) = (qp1 ⊔ qp2 , qe1 ⊔ qe2) and for all qp1 = (q#, f1, b1,w1, r1) ∈ QP
and qp2 = (q#, f2, b2,w2, r2) ∈ QP
qp1 ⊔ qp2 = {(q#, f1 ⊔f2, b1 ⊔ b2,w1 ⊔w2, r1 ⊔ r2)} (4.24)
Thus, we exploit the fact that the FSM variables are all intervals which can be easily
merged to reduce the state space size. Note that we could even go one step further
into this direction by setting QP = 2Q
#
P × (Iu)4. In this case we would lose every
connection between FSM states q# ∈ Q#P and the FSM variables which allows us to
represent even more concrete states with a single abstract one at the cost of reduced
analysis precision. To keep up a high precision, we did not pursue this approach.
The environment model must include all microarchitectural components like
buses, caches, memories and peripherals. In practice, usually neither SRAM memory
modules nor peripherals are modeled in detail. For SRAM, a ﬁxed access time (or
a very small time window) is given which makes a more precise analysis of their
behavior not necessary, since we are only concerned with time, here. Accesses to
peripherals are rare enough that we can assume the full time window (best-case to
worst-case) on each access.
For buses, at which the current core deﬁnitely is the sole master and thus will be
granted the bus immediately, also few modeling is needed. The only thing that will
regularly happen here, is that the target address of the access cannot be determined
precisely, e.g. due to coarse results in the value analysis. In this situation, the
70 Chapter 4. Single-Core WCET-Analysis
transition in the bus module is non-deterministic, and all possible access targets
have to explored separately as shown in Figure 4.8.
One important factor that is modeled in the environment state, however, are
caches since they show both high execution time variation and access frequency.
This makes a precise analysis of the cache behavior inevitable if the generated ω(v)-
values shall not be overly conservative.
4.3.2 Cache Analysis
The ﬁrst cache models incorporated the analysis of direct-mapped caches into the
path analysis [LBJ+95; LMW96]. Since these approaches were neither scalable nor
could they be integrated with the rest of the microarchitectural modeling as sketched
above, an eﬃcient abstract model for set-associative caches with Least Recently Used
(LRU) replacement was ﬁrst devised in [FW99]. This approach can classify a cache
access in a context block as “always hit (AH)”, “always miss (AM)” or “unknown (U)”.
It was later reﬁned to include the classiﬁcation “ﬁrst-access-miss-all-others-hit”, also
known as “persistent” [BC08].
The First-In First-Out (FIFO) replacement policy has proven to be much harder
to analyze [GR10] and to consistently perform worse than LRU in terms of the
necessary number of replacements [CN98]. FIFO and Pseudo-LRU (PLRU) caches
can be analyzed using the concept of relative competitiveness [BRA09], i.e., their
analysis can be reduced to the analysis of an appropriately shrunk LRU cache. This
emphasizes the usefulness of LRU analysis.
A basic abstract model of the cache operation can be seen in Figure 4.10. Ac-
cesses to the cache modeled by the input values r and n are classiﬁed with respect to
the abstract cache state qC ∈ C via a classiﬁcation function cls(qC , r) ⊆ {HIT,MISS}
that determines whether the access must be a hit or a miss or may be both. Upon
a miss, the content is fetched by issuing memory accesses to the component to fetch
from in fetch(r). Finally, the completion of the original request r is signaled to the
source (i.e., the pipeline) by cmpl(r). Note, that the cache content is not part of
our abstract cache model, we therefore only communicate the completion without
knowing what data is actually transferred by the cache.
The cache analysis domain which is part of the environment state QE is therefore
deﬁned as
C = 2QC ×C (4.25)
where QC is the set of states in the abstract FSM from Figure 4.10 and C is the
domain of abstract cache states as detailed in the following.
For a set-associative cache, the state of the cache qC ∈ C is usually maintained
for each cache set s ∈ S independently [FW99] as a set state qCs ∈ qC . Alternatively,
the cache state can be represented as a tuple of explicit set states [CR09]. This
results in higher precision but also far higher analysis duration, which is why we use
the former approach from [FW99]. An abstract memory request r is mapped to the
set of cache sets which it may aﬀect by an abstract assignment function sets(r) ⊆ S.
4.3. Microarchitectural Analysis 71
Idle
Cache Hit
Cache MissWrite BackFetch Line
Write Through
n/−
r[HIT ∈ cls(qC , r)]/
qC = touch(qC , r)
d =Dhit
r[MISS ∈ cls(qC , r)]/
d =Dmiss
[wb ∧ true ∈ dirty(qC , r)]/
w = write(r)
[¬wb ∨ false ∈ dirty(qC , r)]/−
cmpl(w)/
f = fetch(r)
cmpl(w)/
qC
= replace(qC
, r), cmpl(r)
[isWrite(r) ∧ ¬wb]/
w = write(r)
¬cmpl(w)/− [d ∖ {0} ≠ ∅]/d − −
[0 ∈ d]/
cmpl(r)
[d ∖ {0} ≠ ∅]/d − −
¬cmpl(w)/−
¬cmpl(f)/−
cmpl(f)[isRead(r) ∨ (isWrite(r) ∧wb)]/
qC = replace(qC , r), cmpl(r)
cmpl(f)[isWrite(r)∧
¬wb]/w = write(r)
Field Type Description
n Input No access
r Input Memory access r
cmpl Input Request completions
d N0 Processing delay
qC C Abstract cache state
wb {true, false} Write policy
Function Description
cls Classify access
touch Update cache state on hit
replace Update cache state on eviction
dirty Determine if line may be dirty
write Issue write request (Output)
fetch Issue fetch request (Output)
cmpl Signal access completion (Output)
Figure 4.10: A simpliﬁed abstract cache model. One idle cycle is enforced between
successive accesses to reduce the graph size.
Since the address of an access as determined by the value analysis may be unknown,
sets(r) = S is possible3, but for a ﬁxed address, we have ∣sets(r)∣ = 1. Analogously,
the possible tags of the memory block that is targeted by access r are given by
tags(r) ⊆ T where T is the set of all tags. Therefore, we can deﬁne the cache state
functions by delegating the work to the set states as
cls(qC , r) = ⋃
s∈sets(r),t∈tags(r)
cls(qCs , t) (4.26)
touch(qC , r) = ⊔
s∈sets(r),t∈tags(r)
touch(qC , s, t) (4.27)
replace(qC , r) = ⊔
s∈sets(r),t∈tags(r)
replace(qC , s, t) (4.28)
Every qCs must respect the replacement policy of the cache which we assume to
be LRU as mentioned above. A concrete LRU set state q˜Cs for an associativity aˆ is
3In general, this is one crucial dependency of the cache analysis due to which also dynamic mem-
ory allocation is usually prohibited since the resulting addresses are unpredictable. Approaches to
circumvent this are cache-aware allocators [HBH+11] and relational cache analysis [HG12; Weg12]
which works on an access position relation instead of on absolute address values.
72 Chapter 4. Single-Core WCET-Analysis
an age function Ts → A that maps a subset Tc ⊆ T to an age set A = {1, . . . , aˆ,∞}
with the concrete semantics
cls(q˜Cs , t) =
⎧⎪⎪⎨⎪⎪⎩
HIT if q˜Cs (t) ≠ ∞
MISS else
(4.29)
touch(q˜C , s, t) = ⋃
s′∈S
⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎩
q˜C′s (t′) =
⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
1 if t′ = t
q˜Cs (t′) + 1 q˜Cs (t′) < q˜Cs (t)
q˜Cs (t′) else
if s = s′
q˜Cs else
(4.30)
replace(q˜C , s, t) = ⋃
s′∈S
⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
q˜C′s (t′) =
⎧⎪⎪⎨⎪⎪⎩
1 if t′ = t
q˜Cs (t′) + 1 else
if s = s′
q˜Cs else
(4.31)
This concrete semantics is abstracted to an age range for each tag, i.e., ∀t ∈ T ∶
qCs (t) = [atmin, atmax] such that atmin, atmax ∈ {1, . . . , a,∞} and atmin (atmax) is the
minimum (maximum) age of tag t at the current block. atmin (a
t
max) is also called
may-information (must-information) since it can be used to determine what may
(must) be in the cache as shown in the following deﬁnition of cls(qC , t)
cls(qCs , t) =
⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
{HIT} if atmax ≠ ∞
{MISS} if atmin = ∞
{HIT,MISS} else
(4.32)
The touch and replace functions coincide in this domain (replace = touch) with
touch(qC , s, t) = ⋃
s′∈S
⎧⎪⎪⎨⎪⎪⎩
qC′s (t′) = [at
′→t
min, a
t′→t
max] if s = s′
qCs else
(4.33)
at
′→t
min =
⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
1 if t = t′
at
′
min + 1 if at
′
min ≤ atmin
at
′
min else
(4.34)
at
′→t
max =
⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
1 if t = t′
at
′
max + 1 if at
′
max < atmax
at
′
max else
(4.35)
where the addition on ages is again only deﬁned on A, i.e., aˆ + 1 = ∞. The meet
function on C is then deﬁned as the interval union, i.e.,
qC1 ⊔ qC2 (s)(t) = [min(qC1 (s)(t), qC2 (s)(t)),max(qC1 (s)(t), qC2 (s)(t))] (4.36)
The initial DFA information for the start node is the top element ⊺C of C,
i.e., (QC , qCinit) with ∀s ∈ S, t ∈ T ∶ qCinit(s)(t) = [0,∞]. In the implementation, we
4.4. Path Analysis 73
obviously do not want to represent all age intervals explicitly, since many entries
will be identical. As an example, all entries in qCinit are identical. We therefore use
interval maps to compress ranges of sets and tags with identical abstract information
to a single value.
Multiple cache levels can be handled by directing the output of the abstract L1
cache to the abstract L2 cache which may then in turn perform non-deterministic
state transitions based on the forwarded request. If the cache hierarchy is analyzed
in separation from the rest of the system (i.e., pipeline, busses), multiple cache levels
can more eﬃciently be analyzed by computing cache access classiﬁcations (CAC)
per hierarchy level [HP08; LHP09], but this requires a compositional timing model
as discussed in Section 2.2.9.
4.4 Path Analysis
For a program L and a set of initial system states Q˜0 ⊆ Q˜M , we have
WCETreal = max
qm0 ∈Q˜0
{∣exec(L, qm0 )∣} (4.37)
Usually, Q˜0 contains all possible program inputs and one or more initial hardware
states. The goal of the path analysis is to derive a WCETest ≥ WCETreal with
the help of the block durations ω(v) ∈ I[0,∞] generated by the microarchitectural
analysis. The ﬁrst step towards this is to deﬁne an alternative notation for a program
execution as a sequence of context blocks and concrete block durations, i.e.,
bbexec(L, qm0 ) = ((vC1 ,w1), . . . , (vCk ,wk)) (4.38)
as opposed to exec(L, qm) from Deﬁnition 16 which deﬁnes an execution as a se-
quence of system states qmi . This can easily be achieved by mapping each contiguous
sequence of system states qmi with length w which “belong” to the same block v
C
along the lines of Deﬁnition 19 to an element (vC ,w).
From the correctness of the abstract microarchitectural model from Deﬁnition 17
and the prerequisite that the initial DFA information m0 ∈ M covers Q˜0, i.e.,
γM(m0) ⊇ Q˜0, we know that
∀(vCi ,wi) ∈ bbexec(L, qˆm) ∶ wi ∈ ω(vCi ) (4.39)
where qˆm is the initial worst-case state that maximizes Equation 4.37. Thus, it
follows that
WCETreal = ∑
vCi ∈bbexec(L,qˆ
m)
wi ≤ ∑
vCi ∈bbexec(L,qˆ
m)
ωmax(vCi ) =WCEToptest (4.40)
Of course, since bbexec returns a sequence, those vCi which occur multiple times
in the sequence must also be counted multiple times in the sums above. The path
analysis is now responsible for ﬁnding a WCETest ≥WCEToptest .
74 Chapter 4. Single-Core WCET-Analysis
Since the node sequence returned by bbexec is a path, we call a sum of weight
values for each node a path weight. In Equation 4.40, the weight of the worst-case
path described by bbexec is computed with respect to wi and ωmax. Unfortunately,
this worst-case path is unknown in general. Therefore WCEToptest can again only be
approximated. We deﬁne WCETfeasibleest to be the weight of the feasible path from
the context graph source to its sink which has maximum weight with respect to
ωmax. Since the worst-case block sequence must be a feasible path as deﬁned in
Section 2.1 we know that WCETfeasibleest ≥WCEToptest .
Unfortunately, it is undecidable whether a path is feasible in general. Therefore
user annotations in the form of ﬂow facts as introduced in Section 3.3 are needed
to specify which paths are feasible. Since the user may choose not to exclude every
infeasible path but only a subset P ex which must be excluded to achieve a reasonably
tight WCET, this procedure yields a WCET annest ≥ WCET feasibleest . Those paths
which are always in P ex are inﬁnite paths through loops which would otherwise
lead to an inﬁnite WCET annest For simple loops, these bounds can also be detected
automatically to remove some annotation eﬀort of the user [CM07]. Also, mutual
exclusion of special types of context blocks can be automatically assessed in some
cases [CBR13], but not in general.
The ﬁrst approach to path analysis was tree-based analysis, which operated on
a high-level syntax tree of the program [PK89; PS91; CP01; AGP03]. It allows for
an eﬃcient evaluation but is limited since for a precise microarchitectural analysis,
the mapping of high-level statements and low-level instructions needs to be known,
which is highly non-trivial. A high-level CFG and its corresponding low-level CFGs
may be substantially diﬀerent from each other. For hybrid WCET analyses, the tree-
based method can be attractive since no CFG reconstruction is needed then [BB06].
Other approaches include the integration of microarchitectural analysis with the
path analysis [EGL11] and explicit path enumeration [CMR+05] which both have
proven to not scale well. The latter can be combined with abstract interpretation
on a path domain to answer more complex path problems [KFM13].
However, for the classical path problem, the Implicit Path Enumeration Tech-
nique (IPET) [LM97] is still the de-facto standard. It creates an Integer Linear
Program (ILP) that models the feasible paths and ﬁnds one with maximal weight.
Though ILP solving is NP-complete in general, the IPET ILP for reducible pro-
grams is totally unimodular [LM97], and totally unimodular ILPs can be solved in
polynomial time [KV12].
Since Equation 4.40 already indicates that the WCETannest can be formed as a
summation over block frequencies times the block weight, the IPET ILP for a context
graph GCτ = (V Cτ ,ECτ , v0,τ) uses the objective function
max ∑
v∈V Cτ
ωmax(v) ⋅ xv (4.41)
where xv is the WCEC of the block v, i.e., the number of times v is visited on
the worst-case execution path. The maximization is subject to the following ﬂow
4.4. Path Analysis 75
conservation constraints which force the generated solution to be a path according
to the deﬁnition in Section 2.1
∀v ∈ V Cτ ∶ ∑
vpred∈δ−(v)
x(vpred,v) = xv = ∑
vsucc∈δ+(v)
x(v,vsucc) (4.42)
where x(v,w) for an edge (v,w) ∈ ECτ is the WCEC of edge (v,w).
For the IPET, the context graph is extended by a virtual source v+ with edges
(v+, v) for all v ∈ δA⊥ (ν⊥(f⊥τ)) and a virtual sink v− with edges (v, v−) for all v ∈
δA⊤ (ν⊤(f⊥τ)). These edges are also counted in Equation 4.42, but v+, v− ∉ V Cτ .
Instead, we have the following initializing and terminating conditions for v+ and v−
1 = ∑
vsucc∈δ+(v+)
x(v+,vsucc) (4.43)
∑
vpred∈δ−(v−)
x(vpred,v−) = 1 (4.44)
Equation 4.43 enforces that one of the edges from v+ towards an entry node of τ
must be executed once, which models the program start. Due to Equation 4.42 the
execution frequencies xv and xe form a ﬂow through the graph which according to
Equation 4.44 must end by supplying one ﬂow unit, i.e., one execution, to any of the
edges towards the supersink v−. Still, we will add more constraints in the following,
to exclude some infeasible paths from this ﬂow and to bound the amount of ﬂow
per node, i.e., the length of the modeled path.
For every call context c with subgraph Gcτ = (V cτ ,Ecτ), we deﬁne the set of
incoming call edges as Eccallers = {(v,w) ∈ ECτ ∣ v ∉ Ecτ ∧ w ∈ Ecτ}. For each (v,w) ∈
Eccallers we deﬁne the associated return edge set E
c
(v,w)−returns = {(x, y) ∈ ECτ ∣ x ∈
Ecτ ∧context(v) = context(y)}, i.e., the return edges which lead back into the context
of v. These are used to enforce that a path which enters a call context must also
use a return edge towards the caller as speciﬁed in the constraint
∀call contexts c ∶ ∀ecall ∈ Eccallers ∶ xecall = ∑
eret∈Ece−returns
xeret (4.45)
Each loop bound of the form min Blmin max Blmax as introduced in Section 3.3 is
attached to a natural loop l ∈↺τ according to Deﬁnition 12. The loop by deﬁnition
has a unique entry/head node vlhead and a set of back-edges E
l
back, such that we can
bound the loop iterations in the IPET by limiting the back-edge usage based on the
frequency of the entry-edge usages as given in the following constraints
∑
eback∈E
l
back
xeback ≤ Blmax ⋅
⎛
⎜
⎝
∑
eentry∈{(v,vlhead)∈E
C
τ }∖E
l
back
xeentry
⎞
⎟
⎠
(4.46)
∑
eback∈E
l
back
xeback ≥ Blmin ⋅
⎛
⎜
⎝
∑
eentry∈{(v,vlhead)∈E
C
τ }∖E
l
back
xeentry
⎞
⎟
⎠
(4.47)
76 Chapter 4. Single-Core WCET-Analysis
Equation 4.46 is needed to make the WCET ﬁnite, whereas Equation 4.47 is optional
and will improve the precision of the BCET of the application.
Each ﬂow restriction cl1b
l
1 + . . . + clnbln ≤ cr1br1 + . . . + crmbrm is itself already an
inequality which relates block execution frequencies bi to each other, and it therefore
can be directly integrated into the ILP by replacing each bi with its respective block
variable as
cl1xbl1
+ . . . + clnxbln ≤ c
r
1xbr1 + . . . + c
r
mxbrm (4.48)
The ﬂow restrictions can be used ﬂexibly to model path infeasibility, to express
dependencies between non-adjacent blocks and to precisely bound complex loop
behavior [KKP+11].
The value of the objective function from Equation 4.41 under the constraints
from Equation 4.42 to Equation 4.48 is a valid WCETannest ≥ WCETreal. When
the objective in Equation 4.41 is minimized and ωmin is used instead of ωmax, a
BCET annest ≤ BCET real is produced, analogously.
Whenever we refer to a WCET (BCET) of a task in the following, this always
denotes the WCETannest (BCET
ann
est ) unless stated otherwise.
4.5 Evaluation
We evaluated our analysis framework as described in the last sections for a subset
of the benchmark suites
• the MRTC benchmarks [Mäl05] (various embedded benchmarks, collected
speciﬁcally for WCET analysis)
• the UTDSP benchmark suite [LCS92] (DSP kernels and applications)
• the DSPStone Benchmarks [ZVS+94] (DSP-oriented benchmarks)
• the MediaBench collection [LPM97] (multimedia and communications)
• the MiBench suite [GRE+01] (commercially representative embedded bench-
marks)
• the NetBench suite [MMH01] (network processor benchmarks)
• the PolyBench/C suite [Pou12] (static control calculations)
• the StreamIt benchmarks [Str14] (streaming applications)
Not all of the benchmarks from the individual suites could be integrated, since
they must be prepared for WCET analysis through determination and annotation of
ﬂow facts, replacement of dynamic memory allocation (if any) with a static allocation
and possibly by ﬁxing compiler incompatibilities, which can be quite time-consuming
as already mentioned in the case studies from Section 2.2.7. A detailed list of
all benchmarks and their respective properties can be found in Appendix A. All
benchmarks were compiled with optimization level O0, which triggers a compilation
without further machine code optimization.
4.5. Evaluation 77
ad
pc
m
-g
72
1-
ve
rif
y
ch
ol
es
ky
co
m
pr
es
sd
at
a
co
ve
r
fd
ct ﬁr
g7
21
.m
ar
cu
sle
e-
de
co
de
r
g7
23
-e
nc
od
e
h2
64
de
c-
ld
ec
od
e-
bl
oc
k
ha
m
m
in
g-
w
in
do
w
in
se
rt
so
rt
jfd
ct
in
t
lc
dn
um
lm
s-
ﬁx
ed lu
lu
dc
m
p
m
at
rix
2-
ﬁx
ed
nd
es
pe
tr
in
et
qm
f-t
ra
ns
m
it
rij
nd
ae
l-e
nc
od
er
se
le
ct
st
ar
tu
p-
ﬁx
ed
st
at
em
at
e
v3
2.
m
od
em
-b
en
co
de
0%
100%
200%
Benchmark
W
C
E
T
U
C
/A
C
E
T
U
C
WCC aiT
Figure 4.11: WCET performance of the WCC-internal single-core analysis frame-
work for a system without cache usage (superscript UC).
WCETUCWCC
WCETUCaiT
AnalysisDurationUCWCC
AnalysisDurationUCaiT
AnalysisDurationUCWCC
WCETUCWCC
ACETUC
WCETUCaiT
ACETUC
104.46% 37.22% 1.59s 163.13% 156.17%
Table 4.1: Average results for the analysis of uncached execution of a single task.
We cannot determine the WCETreal to compare our WCET results against it.
Instead, we measure the ACET and with ACET ≤WCETreal the true overestimation
Oreal of our analysis can be bounded as follows:
Oreal =
WCETannest
WCETreal
≤ WCET
ann
est
ACET
= Omax (4.49)
Therefore, we always show Omax, i.e., the maximum overestimation when discussing
WCET results. This is a dimensionless number by deﬁnition, which we may present
as a percentage, but also as an absolute value where 1 represents 100%, obviously.
The maximum overestimation of our WCET analysis framework for the case
of a system with deactivated caches is shown in Figure 4.11 for a representative
subset of benchmarks. To disambiguate the results from the WCET and ACET for
a system with activated caches we use the superscript “UC”. It is visible that most of
the time, the WCC-internal analyzer is almost as good as the commercial analyzer
78 Chapter 4. Single-Core WCET-Analysis
aiT, on average4 it is only 4.46% worse than aiT as shown in the ﬁrst column
of Table 4.1. Both analyzers exploit the fact that the single-core ARM7TDMI
platform is free of timing anomalies (cf. Section 2.2.8) to speed up the analysis
by following the local worst-case only during the microarchitectural analysis. The
compilation and analysis with the WCC-internal analyzer takes 1.59s on average
which is only 37.22% of the time that is needed for an analysis with aiT. Since the
communication of the WCC and aiT is done via ﬁles (cf. Figure 3.1) whereas the
WCC-internal analyzer uses only in-memory structures, this comparison is biased
against aiT. Still, it shows that the analysis speed is competitive without actually
claiming superiority to aiT here. The overestimation compared to the measured
execution time is 63.13% for the WCC and 56.17% for aiT as shown in the last two
columns of Table 4.1. This overestimation mainly stems from
• Imprecise results of the value analysis. Especially when the target of an indi-
rect memory access cannot be determined precisely, the analysis must assume
an access to the memory module with the highest access duration (Flash with
5 cycles compared to 1 cycle for the scratchpads). If this happens in a loop,
it can drastically impair the analysis precision.
• Imprecise ﬂow facts for loops with variable iteration counts. For those loops
we only speciﬁed minimum and maximum bounds. More precise results could
potentially be obtained with the usage of custom ﬂow restrictions.
The slight advantage of aiT can be explained with the ﬁrst point since aiT’s value
analyses are more precise when pointer computation or array accesses are involved.
As a second scenario, we also analyzed the benchmarks for a system with acti-
vated instruction cache, i.e., the benchmark code is cached but not the data objects.
Since the benchmarks vary in size, the instruction cache size was adjusted to match
50% of the benchmark’s code size rounded to the closest power of two. In this sce-
nario, we compare the analysis precision of the WCC with activated cache analysis
(WCETCWCC) with the with the results of an analysis that assumes all cache accesses
to be hits (WCETC,AHWCC) or misses (WCET
C,AM
WCC). In Figure 4.12 we can see that
the cache analysis outperforms the pessimistic analysis WCETC,AMWCC by far in most
cases. The optimistic analysis WCETC,AHWCC does not provide safe WCET estimates
as expected.
The overestimation however is still considerably higher than in the uncached
case. Since we are analyzing an instruction cache here and the ARM7TDMI has a
very predictable fetch behavior, the analysis of fetch operations is not the problem.
What is a problem for the cache analysis is the fact that
• the platform is a Von-Neumann architecture, i.e., data and instruction accesses
may target the same memory modules,
4Here and in the following, we exclusively use the geometrical mean for relative numbers (like
relative WCETs), since arithmetical means of relative numbers can lead to invalid and even con-
tradictory conclusions depending on the selected comparison base [FW86].
4.5. Evaluation 79
ad
pc
m
-g
72
1-
ve
rif
y
ch
ol
es
ky
co
m
pr
es
sd
at
a
co
ve
r
fd
ct ﬁr
g7
21
.m
ar
cu
sle
e-
de
co
de
r
g7
23
-e
nc
od
e
h2
64
de
c-
ld
ec
od
e-
bl
oc
k
ha
m
m
in
g-
w
in
do
w
in
se
rt
so
rt
jfd
ct
in
t
lc
dn
um
lm
s-
ﬁx
ed lu
lu
dc
m
p
m
at
rix
2-
ﬁx
ed
nd
es
pe
tr
in
et
qm
f-t
ra
ns
m
it
rij
nd
ae
l-e
nc
od
er
se
le
ct
st
ar
tu
p-
ﬁx
ed
st
at
em
at
e
v3
2.
m
od
em
-b
en
co
de
100
101
Benchmark
W
C
E
T
C
/A
C
E
T
C
WCETC,AHWCC (Always-Hit) WCET
C
WCC WCET
C,AM
WCC (Always-Miss)
Figure 4.12: Single-core WCET results for a system with activated cache (super-
script C).
• the ARM compiler actually uses this and embeds data items into the instruc-
tion stream for a better performance,
• the value analysis in its current form often cannot determine the targets of
indirect loads and stores especially for pointer arithmetic and array accesses.
Therefore, the cache analysis has to update the instruction cache state also in the
case of loads or stores to unknown (data) targets. Depending on the number of
these types of accesses, this degenerates the cache state precision. Since for every
miss, a cache line reﬁll is triggered (cf. Figure 4.10), the eﬀective penalty for a
miss is 1 cycle+ (32B/4B) ⋅ 3 cycles = 25 cycles (see the parameter list in Table 3.1),
thus a decrease in cache state precision directly translates to a decrease in WCET
precision.
The average results for the cached case are shown in Table 4.2. aiT for the
ARM7TDMI also features a cache analysis, which can be conﬁgured to some extent.
However, it does not model the precisely same cache settings as our analyzer. Only
the line size, cache size and associativity are adjustable in aiT, but none of the
other parameters from Table 3.1. Still, the WCET results are roughly compara-
ble, therefore we also included the average result for aiT in Table 4.2. Since aiT
employs a superior value analysis it is clearly better here, leading to a maximum
overestimation which is 2.26 times as good as the one of the WCC and 8.16 times
better than the pessimistic analysis, respectively. However, Table 4.2 also indicates
80 Chapter 4. Single-Core WCET-Analysis
WCETC,AHWCC
ACETC
WCETCWCC
ACETC
WCETC,AMWCC
ACETC
WCETCaiT
ACETC
AnalysisDurationCWCC
AnalysisDurationUCWCC
88.02% 645.17% 2,318.00% 284.58% 117.52%
Table 4.2: Average results for the analysis of cached execution of a single task.
that WCETCWCC is 3.56 times better than WCET
C,AM
WCC and that the analysis takes
an acceptable 17.52% more time than for the uncached architecture.
By manual annotation of register value ranges and ﬂow restrictions the precision
of the WCETs can be drastically improved in both scenarios. Since this is a quite
time-consuming job, which requires a detailed review of each benchmark, this was
not pursued here. The precision of the single-core analysis is suﬃcient to demon-
strate the multi-core analysis techniques that will be discussed in the next section,
even without such further annotations.
Chapter 5
Multi-Core WCET Analysis
Contents
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Multi-Core Challenges . . . . . . . . . . . . . . . . . . . . . . 82
5.2.1 Shared Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.2 Shared Interconnection Structures . . . . . . . . . . . . . . . 83
5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.1 WCET Analysis Approaches for Multi-Cores . . . . . . . . . 86
5.3.2 WCET-friendly Multi-Core Architecture Design . . . . . . . 89
5.4 Partitioned Multi-Core WCET Analysis . . . . . . . . . . 90
5.4.1 Shared Cache Handling . . . . . . . . . . . . . . . . . . . . . . 91
5.4.2 Shared Bus Analysis Preliminaries . . . . . . . . . . . . . . . 94
5.4.3 Basic Bus Domains . . . . . . . . . . . . . . . . . . . . . . . . 96
5.4.4 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.4.5 Oﬀset Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.6 Oﬀset Relocation . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.7 Timing-Anomaly-Free Analysis . . . . . . . . . . . . . . . . . 110
5.4.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.5 Uniﬁed WCET Analysis for Complex Multi-Cores . . . . 120
5.5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.5.2 Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.5.3 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . 123
5.5.4 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.5.5 Parallel Execution Graph Construction . . . . . . . . . . . . 127
5.5.6 Parallel System States . . . . . . . . . . . . . . . . . . . . . . 132
5.5.7 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.5.8 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.5.9 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
81
82 Chapter 5. Multi-Core WCET Analysis
5.1 Introduction
In this chapter, we will ﬁrst investigate which challenges are posed to WCET anal-
ysis due to the advent of multi-core architectures. We will then review existing
approaches to these challenges and ﬁnally present two diﬀerent multi-core WCET
analysis algorithms which build upon the single-core analysis as presented in the
previous chapter.
5.2 Multi-Core Challenges
The memory hierarchy has a signiﬁcant impact on the average-case performance of a
system. Measurements show that modern CPUs spend 90% of their time waiting for
memory in real-life benchmarks [Jac09, Chapter 2]. Since also the speed diﬀerence
between the memory levels can amount to orders of magnitude [PH11, Chapter 2.1],
each access may have a considerable impact on the resulting ACET. As already seen
in the evaluation in Section 4.5, the impact on the WCET is even higher since it
needs to account for every reachable hardware state. If infeasible accesses cannot
be excluded, they will thus degrade the WCET precision.
In a multi-core system, each core’s pipeline can be analyzed in isolation but
the memory hierarchy state can not since typically the cores share some part of
the memory hierarchy, e.g., in the form of shared caches, shared buses and shared
memory. Thus, a task τ1 running on core ci may be able to impair the timing of
task τ2 running on core cj ≠ ci. This complicates the WCET analysis, since it has
to account for all possible interference behaviors and still has to be as precise as
possible. A good overview on how the predictability of a system suﬀers from shared
resources is given in [ABD+13]. The authors distinguish between
• bandwidth resources, mainly shared buses and
• storage resources, mainly shared caches.
We will review the problems that arise due to these two resource types in the fol-
lowing. Shared memory in the form of S-RAM is usually not a problem, since, it
has an approximately constant access time. For D-RAM memory the challenges are
the same as in the single-core case, namely that the periodic refresh can hardly be
modeled in the system state [BM11]. Therefore we limit ourselves to S-RAM in this
work.
5.2.1 Shared Caches
The “memory wall” problem [Mar11, Chapter 3.4] that the speed of the CPU is
increasing faster than the speed of RAM is even more pressing in the multi-core
case than in the single-core one, since the requests from more than one CPU must
be served by the slow memory. Cache hierarchies therefore are a critical factor for
average-case multi-core performance [BJM11].
5.2. Multi-Core Challenges 83
As we have seen in Section 4.3.2 the static analysis of cache behavior must ap-
proximate all possible access sequences with which the current context block position
may be reached. In the case of a shared cache these sequences are no longer limited
to accesses from the current task but they can also contain accesses from other tasks.
The classical single-CFG-based data-ﬂow analysis cannot infer anything about their
location in the sequence, thus we only have the option to either loose precision and
provide rough over-approximations, or to develop new analysis approaches. We will
discuss both of these approaches in the following.
As an additional complication, shared caches are sometimes also kept coher-
ent [SHW11] by a hardware protocol. In this case, changes to one private cache
will propagate to other private caches. Eﬀectively, this turns every coherent cache
into a shared cache, which can be handled by similar methods as shared caches.
Still, the mechanism used to implement the coherency is more complex in both its
microarchitectural implementation and its timing behavior. Since the analysis of
shared caches alone is diﬃcult already, this additional complication makes it very
close to impossible to provide precise WCET estimates for cache-coherent systems.
We will therefore assume absence of cache coherency mechanisms in the following.
5.2.2 Shared Interconnection Structures
Before any shared resource, like a shared cache, can be accessed by multiple cores,
there must be a shared interconnection structure which allows this type of access. An
example for this is the shared bus in Figure 3.4. The target module, be it a shared
cache or a shared memory, can only be accessed by one core at a time, therefore the
interconnection structure must provide an arbitration to resolve potential conﬂicts.
Compared to the shared cache analysis, this is a more elementary problem, since
the timing analysis of the employed arbitration method aﬀects every access, also
for systems without shared caches. We will therefore focus on the analysis of the
arbitration timing behavior in the following discussions.
Interconnection structures are generally subdivided into interconnection net-
works and buses. In accordance with established terminology, devices which can
initiate communication requests are called masters (e.g., core pipelines, caches)
whereas those who can not are called slaves (e.g., memory).
Interconnection Networks
Interconnection Networks (ICNs), also called switched-media on-chip networks [PH11,
Appendix F], are composed of a number of point-to-point links and switches which
route packets through the network. The most prominent and widely-used example
for such an on-chip network is PCI-Express, which is installed in virtually every
modern PC.
In general, the timing behavior of a transmission through an ICN depends on
the collisions that may occur in the network due to other transmissions which want
to access the same or a diﬀerent slave node. For cases where n masters each issue a
84 Chapter 5. Multi-Core WCET Analysis
communication request to one of n diﬀerent slaves and no slave is the target of more
than one request, there exist ICN topologies which are strictly collision-free [SJ96],
i.e., all of these requests can be satisﬁed in constant time. Unfortunately, in WCET
analysis we often cannot exclude the case that multiple masters (cores) may access
the same slave (memory module). In such a case, only one access at a time can be
granted which leads to an inevitable amount of blocking.
These conﬂicts can be elided for read accesses which want to read the same
memory location as for example in the logarithmic Hypercore network [BHQ+07;
Bay08; Plu10], but again we may be unable to prove that deﬁnitely the same address
is accessed.
From a timing perspective, ICNs potentially lead to higher average-case perfor-
mance, but their complex state space complicates WCET analysis. The methods
that we are going to use for the analysis of buses, i.e., abstract state machines and
abstract arbitration functions, are also applicable to ICNs but the modeling eﬀort
increases whereas the WCET precision will drop. In the important case of time-
triggered ICNs which oﬀer deterministic transmission times for real-time systems,
the methods used for buses are even immediately applicable without changes or
precision loss. Therefore, we focus on shared buses in the following.
Buses
As shown in Figure 3.4, a bus is a shared medium which connects multiple masters
and slaves, but only a single physical signal can be transmitted over the bus at each
point in time. Therefore, if multiple masters want to use the same bus, they must
share it by either
• Frequency-Division Multiplexing (FDM)
• Code-Division Multiplexing (CDM)
• Time-Division Multiplexing (TDM)
Frequency-Division Multiplexing (FDM) lets multiple masters transmit their data
in parallel, but physically modulated to diﬀerent frequency bands. This limits the
bandwidth and thus the transmission speed that is available to each master. The
author is not aware of any implementation of a bus using FDM. This technique is
successfully used to multiplex signals over telephone lines carrying analog telephony,
ISDN and DSL signals in diﬀerent frequency bands, but apparently the miniatur-
ization to chip-level buses poses electro-technical problems.
Code-Division Multiplexing (CDM) is similar to FDM in that multiple masters
transmit their signal at the same time. The diﬀerence is, that the signals are not
modulated purely in the frequency domain. They undergo a convolution with a
predeﬁned code signal. These code signals must be deﬁned in such a way that the
original signals can be extracted even after the signals from multiple masters have
been mixed with each other. Though some approaches to CDM on the chip-level
exist [BKJ+01], they have the same conceptual disadvantage as FDM, namely that
5.2. Multi-Core Challenges 85
each CDM channel has a limited bandwidth and transmission speed. Overall, this
is still a niche topic, which does not seem relevant at the moment.
In Time-Division Multiplexing (TDM) only a single core may use the bus at any
time, i.e., if multiple masters want to use the bus the order in which their accesses
are granted must be determined by an arbiter. This method is widely employed in
digital communication technology as, e.g., in GSM and ISDN and in most computer
buses like USB, PCI and ISA.
TDM is usually implemented with non-preemptable accesses [SJ96, Section 3.1.3]
since preemptions would require buﬀering or abortion of accesses which compli-
cates the hardware or decreases the performance. Therefore, we also assume non-
preemptable accesses in the following. We also assume a maximum access time TBmax
that speciﬁes the maximum duration for accesses to a device behind the shared bus
B. In practice, this can be computed from the device conﬁguration registers in the
example of PCI [Tec14].
For the arbiter, three major categories of arbiters are employed:
Fair arbitration Also called Round-Robin, this arbiter rotates the bus access
among all masters. It maintains an active master ma ∈ {0, . . . , nc − 1}. When an
access ﬁnishes, ma is cyclically advanced to the next master which requests the bus.
Thus, each master can acquire the bus after at most nc − 1 others have performed
their accesses.
In the case of our reference architecture from Figure 3.4, each bus connection
may be a master, i.e., from the point of view of the shared bus, a core and its L1
caches are the same master, though they may issue requests independently. Access
collisions on the local core bus cannot occur though, since the ARM7 blocks upon
an cache miss.
Practical applications of this scheme include ARM AMBA, the PCI [Alt00]
bus and PROFIBUS [WB05, Sec. 4.4.2]. Under AMBA and PCI, the arbitration
protocol is conﬁgurable but includes fair arbitration. In the case of PCI, a timeout
is imposed to enforce a predeﬁned maximum access time [Buc00, Section 4.3].
(Static) Priority-based arbitration Here, a unique priority pi ∈ {1, . . . , nc} is
assigned to each master mi. If there are multiple requests only the request from the
master with the highest priority is granted. Nevertheless, since accesses are non-
interruptible, even the highest-priority master may have to wait until an ongoing
transaction is completed.
Again, AMBA and PCI [Tec14] can be conﬁgured to use priority-driven arbi-
tration. Another well-known application is the CAN bus [WB05, Sec. 4.4.3] though
this is typically not used as an on-chip bus.
Time-triggered arbitration The basic notion that time-triggered operation con-
tributes to timing-predictability is evident through the large body of work in-
vested into the Time-Triggered Architecture [KB03] and into time-triggered schedul-
86 Chapter 5. Multi-Core WCET Analysis
ing [Liu00, Chapter 6]. In a time-triggered arbitration scheme, a centralized arbiter
must assign the bus based on the current time, i.e., the number of passed bus cy-
cles. Alternatively, a distributed implementation is possible where synchronized,
distributed clocks [Lam78] are used to control local bus guards.
Time-Division Multiple Access (TDMA) creates a schedule consisting of nl slots
of size ls and assigns an owner master oi ∈ {1, . . . , nc} to each slot. The current
position in the schedule is determined by taking the current clock tick modulo nlls.
In each slot i, only the owner is granted access to the bus and only in the interval
[ils, . . . , (i + 1)ls − TBmax]. The subtraction of TBmax is necessary to make sure that
accesses complete before the next slot begins.
Priority Division (PD) [SRK11] is a generalization of TDMA. Instead of assign-
ing an owner oi it assigns unique priorities pij ∈ {0,1, . . . , nc} for each slot i and
each master j. For each slot i, the bus is granted to the requesting master with the
highest positive priority, but again only in the time frame [ils, . . . , (i + 1)ls − TBmax].
Only those masters with priority 0 are excluded from arbitration, which can be used
to emulate TDMA behavior.
Real-life implementations of TDMA include the Time-Triggered Proto-
col [PK08], Aethereal [GH10; HG11], FlexRay [Mar11, Section 3.5.4] and
PROFInet/IRT [WB05, Section 4.4.11]. Technically, Aethereal is not a bus
but an ICN which is based on a design-time allocation of time-triggered schedules
and communication paths for each master. Therefore, from the analysis perspective
it can be seen as a generalization of a TDMA-scheduled bus, with the diﬀerence
that each core is presented with an individual access schedule.
5.3 Related Work
In the following we review existing approaches to the analysis of shared resources in
multi-cores. The related work can be partitioned into works which try to analyze ex-
isting hardware and those which propose new hardware structures to accommodate
the needs of WCET analysis.
5.3.1 WCET Analysis Approaches for Multi-Cores
As mentioned in Section 2.2.9, the behavior of each component can be analyzed
separately in an (γ,α)-compositional timing analysis. Unfortunately, as we have
seen in Section 4.3, the microarchitectural analysis in general needs to maintain
a product state of all hardware component states to achieve reasonable precision.
Therefore, a separate analysis of individual hardware components, i.e., a (1,0)-
compositional analysis, incurs a loss of precision in general.
Approaches which can exclusively analyze shared caches or buses and which do
not integrate into a general WCET analysis framework like shown in Chapter 4
are therefore of limited use. They often assume that some constant penalty can be
5.3. Related Work 87
added or subtracted for each additional miss or hit or bus event, which is only true
for hard-to-obtain (1,0)-compositional WCET analysis values.
Separate Analysis of Shared Caches
The ﬁrst attempts that focused on exclusively analyzing the shared cache behavior
are [YZ08; ZY09] where an “ad-hoc” method for analyzing the possible interference
is given. The method can be used to bound the cache interference but it does not
integrate well with any of the known WCET analysis concepts.
An integration of sharing eﬀects into the single-task cache analysis as described in
Section 4.3.2 was given by Li et al. [LSL+09]. There, a summary of the possible cache
eﬀects that all other concurrent tasks in the system may have is computed. Based
on this summary, the block age map is updated and the classiﬁcation of accesses is
altered to “unknown” if the respective access may suﬀer from interference with other
cores. Eﬀectively, this is still the only option that was found up to now, namely
to precompute a worst-case summary of all possibly concurrently ongoing cache
actions and apply this worst-case summary to the local cache results. This concept
was also taken over into the integrated multi-core analysis frameworks in [KFM+14;
CCR+14] and is a baseline for this thesis.
To account for systems where tasks may migrate between cores, Hardy [HP09]
generalized the CRPD concept to a cache-related migration delay which can also
be statically computed. In addition, Lesage, Hardy and Puaut propose a cache-
bypassing heuristic in [LHP10] to reduce the amount of shared cache interference.
Separate Analysis of Shared Buses
For shared buses, some approaches compute the maximum number and duration of
bus accesses independently of the task WCET analysis and add the extra delay to
the task WCETs later [AEL10; DAN+11].
A very similar family of approaches is the usage of the Real-Time Calculus (RTC)
for deriving the worst-case duration of the shared memory accesses which is later
added to the total task WCET or WCRT. A prominent example of this approach
is given by Schliecker et al. for plain shared memory [SNN+08] and for resources
locked under the global priority ceiling protocol [NSE09]. Pellizzoni et al. detail
how to apply the RTC-based approach to COTS hardware [PSC+10] and how to
isolate aperiodic events by usage of a dedicated hardware server device [PC10].
Schranzhofer, Chen and Thiele [SCT09] employ the RTC-based approaches to
asses various access models which restrict when inside the task shared resource
access can be made. They ﬁnd that tasks which ﬁt into a read-execute-write phase
model produce especially low WCETs. This approach was later reﬁned towards
more precise WCRT results for TDMA arbitration [SCT10] and towards integration
of schedulability results [SPC+10].
All of these approaches require (1,0)-compositional WCET analyses to keep up
the validity of the generated WCET as mentioned above. Therefore, they are more
88 Chapter 5. Multi-Core WCET Analysis
suitable for soft real-time applications on Components Oﬀ The Shelf (COTS), where
precision and analysis speed are more important than WCET safeness.
Integrated Analysis of Shared Resources
The ﬁrst attempt to combine the cache analysis of [LSL+09] with a handling of the
bus accesses was made in [CRM10]. A loop alignment is used to analyze accesses to
the TDMA bus from inside of loops in the tasks. This concept was carried further
in [KFM+11] where a static analysis of TDMA bus hardware states was proposed.
Finally, [CKR+12] extended this approach by a graph-based pipeline handling taken
from [LRM06] and the correctness of the loop alignment was proven in [KFM+14].
These publications are the basis for the content of Section 5.4.
A diﬀerent approach, which combines an abstract interpretation-based cache
analysis with model checking-based bus analysis was presented by Lv [LYG+10].
The reported analysis times are better than for a purely model-checking-based ap-
proach [GEL+10], but for more complex pipelines the same problems will be per-
ceived as in the purely model-checking case [Wil04].
A topic that is not taken into account by any of the above approaches is the
synchronization structure of the program given by “lock”, “release” and “barrier”
operations. Gustavsson employed the model-checker UPPAAL to generate synchro-
nization-aware multi-core WCET values [GEL+10], but the analysis takes multiple
hours for a system with a simple pipeline, no bus modeling and tasks which corre-
spond to 10 lines of C code. Therefore, this approach was continued on a high-level
code representation in [GGL12]. The latter approach is a purely theoretical one
without any evaluation or implementation.
Ozatkas, Rochange and Sainrat [ORS14] approach the analysis of synchroniza-
tion behavior from a diﬀerent perspective. They devised a method of estimating
the worst-case stall time due to synchronization. Potop and Puaut proposed an
integrated source-level framework for incorporating synchronization inside of tasks
into the WCRT computation [PP13] which has the potential to scale far better
than [GEL+10].
Practical Experiences
Nowotsch and Paulitsch [NP13] demonstrate that predictability is also desirable
from the industry point of view, not only to ease WCET analysis but also to in-
crease the determinism in a multi-core system and with it the repeatability of exper-
iments on those systems. They also propose a method for subdividing the WCET
into a core-local part and an additive component caused by interference of other
cores [NPH+14]. As discussed above, this requires (1,0)-compositional WCET val-
ues.
Experiences on applying a WCET analysis on a benchmark executed on a pre-
dictable MERASA multicore are given by Rochange et al. [RBS+10]. The results
are formed manually by addition of single-core WCETs for the individual cores.
5.3. Related Work 89
5.3.2 WCET-friendly Multi-Core Architecture Design
Many publications try to design and promote hardware which is more predictable
and more WCET-friendly than existing multi-cores. All of these approaches of
course require the respective custom hardware to be available and most of them
have only been implemented on an FPGA up to now.
Mische et al. [MUK+08; MGU+10] propose the CarCore architecture which
is built upon the Infineon TriCore. It contains exactly one hard real-time task
which is treated as if it was the only task in the system, i.e., all hardware components
must immediately service its requests and abort those of lower-priority tasks. In
this way, the WCET analysis from the single-core case can be re-used and multiple
non-real-time tasks can coexist with the hard real-time task without the ability to
interfere with its timing. Multiple hard real-time tasks can be aggregated into one
meta task which can itself host multiple sub-tasks by time-sharing [MUK+08].
In a follow-up work, Mische et al. [MMU11] discuss a many-core architecture
with predictable timing, which is achieved by highly predictable in-order cores and
a TDMA-arbitrated mesh network, such that network communication runtime and
task WCET can be analyzed independently and added up afterwards.
Paolieri et al. [PQC+09b] propose the integration of a worst-case computation
mode into the bus arbiter and later also into the memory controller [PQC+09a],
such that the WCET of individual basic blocks can be measured. Though the idea
seems appealing, it also comes at the price of a drastical WCET overestimation. A
later publication of the same author tries to leverage this problem by not enforcing
the worst-case in hardware but by adding a watchdog module that can assert that
blocks of instructions are ﬁnished within a user-deﬁnable time frame [PM11]. This
does not yield safe WCET estimates but eases the veriﬁcation of the timing at least
for the test benchmarks.
Wilhelm at el. have given advices on the design of predictable multi-core sys-
tems in terms of both hardware and software [WFC+09; CFG+10] which have
become known as the PROMPT principles (PRedictability Of MultiProcessor Tim-
ing). They also give advice on how to conﬁgure a commodity system as predictable
as possible [KSP+12].
The PRET architecture [LRL10; LRB+12] oﬀers ﬁxed durations of every single
instruction and is thus highly predictable. Still, the attempt is made to also de-
liver an acceptable average-case performance by implementing a thread-interleaved
pipeline which executes multiple threads in ﬁne-grained alternation.
Pitter and Schoeberl [PS10] introduce the Java-Optimized Processor (JOP) and
analyze the WCET of programs running on it for diﬀerent bus arbitration strategies.
They also examine the performance loss that is incurred by tailoring the architecture
towards predictability instead of towards average-case performance.
Bui et al. present extensions of an instruction set that might be useful for WCET
analysis, including an instruction to set a deadline for certain code blocks [BLL+11].
90 Chapter 5. Multi-Core WCET Analysis
XMOS is the ﬁrst industrial chip producer that has produced a timing-predictable
multi-core – the xCORE – whose timing behavior can easily be analyzed, and which
therefore is even delivered with a custom-built WCET analyzer tool [XMO13].
A new, more predictable cache coherence mechanism, the On-Demand Coherent
Cache is proposed by Pyka, Rohde and Uhrig [PRU13]. The basic idea is that
cache coherence is deactivated most of the time and only switched on selectively
for very small program regions. In this case, only these regions are aﬀected by the
overestimation related to cache coherence timing estimation.
To make multicores more predictable for a single high-priority thread, Gudide-
vuni and Zhang [GZ10] propose a priority-driven cache replacement policy, but this
is more measurement-oriented and limited to a single high-priority thread. From
the WCET analysis perspective, it does not yield much beneﬁt.
Finally, Münch, Paulitsch and Herkersdorf show that temporal separation, which
is advocated for in many of the WCET papers referenced above, can be achieved
also on commodity interconnects like PCI-Express [MPH14].
5.4 Partitioned Multi-Core WCET Analysis
In this section, we investigate the achievable precision for a multi-core WCET anal-
ysis which analyzes each core and each task in isolation. For each task, the analysis
structure shown in Figure 4.1 stays valid but the individual stages are modiﬁed such
that interferences by tasks on other cores are taken into account. This procedure is
motivated by the fact that
• the complexity of the resulting analysis is lower than in the case of an explo-
ration of the whole system’s state (state space cross product) and
• it allows a certiﬁcation of a single task without knowledge of the complete
system. This is often a necessity in modern development teams where diﬀerent
suppliers may independently produce software components.
The resulting structure is shown in Figure 5.1, where each τi,j denotes task number
j allocated to core i. Ti is again the set of tasks allocated to core i. For the analysis
of shared caches we will require some (static) information about the other tasks in
the system. Therefore, this introduces a (loose) coupling between the individual
core analyses. Since we only require the range of memory addresses that tasks on
other cores may access, this can also be resolved by a “contract” between the tasks
concerning the memory space usage.
We do not analyze the content of memory modules during the value analysis.
Therefore sharing is not a problem here. But even if we did, setting the value of
shared variables to ⊺ suﬃces to account for parallel modiﬁcations.
For the path analysis, explicit synchronization is a new control-ﬂow element.
One possibility to account for it are manually determined bounds on the iteration
count of spin-locks, i.e., a fall-back to loop bounds. A diﬀerent option is to incorpo-
rate synchronization or communication edges into a parallel context graph, as done
5.4. Partitioned Multi-Core WCET Analysis 91
LLIR of Core 1
GCτ1,i Construction
Value Analysis
GCτ1,1 ... G
C
τ1,∣T1 ∣
...
LLIR of Core nc
GCτnc,j Construction
Value Analysis
GCτnc,1
... GCτnc,∣Tn ∣...
...
Shared Cache Interference Analysis
Microarchitectural
Analysis
Path Analysis
WCET
of τ1,i
BCET
of τ1,i
Microarchitectural
Analysis
Path Analysis
WCET
of τnc,j
BCET
of τnc,j
Figure 5.1: Structure of the core- and task-partitioned WCET analysis with op-
tional cache interference analysis.
in [PP13]. Unfortunately, this technique is limited to synchronization statements
outside of loops up to now since synchronization in a loop breaks the network-ﬂow-
based semantics of the IPET (cf. Section 4.4). We therefore do not handle explicit
synchronization but require the tasks to either
• bound it through traditional loop bounds or
• use implicit synchronization. In a system where tasks are time-triggered and
WCETs are known, mutual exclusion can often be veriﬁed through the timing
itself, thereby removing the need for runtime-intensive explicit synchroniza-
tion.
In the next subsections, we therefore focus on the handling of microarchitectural
interference in the shared cache and bus.
5.4.1 Shared Cache Handling
We adopt the technique from [LSL+09] by ﬁrst deriving the cache sets sets(τ) ⊆ S
that may be accessed by a task τ . For each set s ∈ sets(τ), the tags that may
be accessed by τ in this set are given by tags(τ, s). Once the value analysis has
been completed, these functions can easily be derived by inspecting the task code,
gathering all addresses that may be accessed by the instructions of the task and
computing their sets and tags.
In addition, we need some notion of which tasks may run in parallel to a task
τ from core c, called par(τ, c). Conservatively, we ﬁrst assume par(τ, c) = ⋃x≠c Tx,
i.e., τ may run in parallel to all tasks from other cores. The computation of sets(τ),
92 Chapter 5. Multi-Core WCET Analysis
tags(τ) and par(τ, c) constitutes the “Shared Cache Interference Analysis” stage
from Figure 5.1.
The “Microarchitectural Analysis” stage then uses the interference data to update
the single-core cache results. It must alter Equation 4.32 for classifying an access
to set s and tag t from the current task τ and core c to
cls(qCs , t) =
⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
{HIT} if atmax < aˆ − ∣⋃τp∈par(τ,c) tags(τp, s)∣
{MISS} if atmin = ∞∧ ∣⋃τp∈par(τ,c) tags(τp, s)∣ = 0
{HIT,MISS} else
(5.1)
where aˆ is again the associativity of the cache and qCs (t) = [atmin, atmax]. The ratio-
nale behind this is, that cache hits can still be guaranteed if there are not enough
potential accesses by other cores (⋃τp∈par(τ,c) tags(τp, s)) to evict the element from
the cache. Misses are only kept guaranteed if there is no other core which might
have loaded the element in parallel. Otherwise, the classiﬁcation is degraded to
“unknown” ({HIT,MISS}) and the microarchitectural analysis has to explore both
possibilities.
Obviously, this is a rather coarse-grained estimation which will classify all shared
cache accesses as “unknown” once a certain number of parallel tasks is reached. In
Section 5.5, we will present a computationally intensive technique to generate more
precise results.
Algorithm 5 Reﬁnement of the shared cache results for timing-anomaly-free archi-
tectures.
1: Set ∀τ, c ∶ par0(τ, c) ← ⋃x≠c Tx and i← 0
2: Compute hyperperiod pH ← lcm(p1, p2, . . . , p∣ ⋃c∈{1,...,nc} Tc∣)
3: repeat
4: i← i + 1
5: for τ ∈ ⋃c Tc do
6: Compute WCETi(τ) as shown in Figure 5.1
7: for c ∈ {1, . . . , nc} do
8: Job set Jc ← {(ti, τi) ∣ τi ∈ Tc ∧ ∃k ∈ N ∶ ti = kpj ∧ ti < pH}
9: Job sequence Sc ← sort(Jc) in ascending order of “spawn times” ti
10: for τ ∈ Tc do
11: δi(τ) = ∅
12: for (tj , τj) ∈ Sc do
13: δi(τj) ← δi(τj) ∪ [ti,max(tj−1 +WCETi(τj−1), tj) +WCETi(τj)]
14: Update ∀τ, c ∶ par i(τ, c) ← {τo∣τo ∉ Tc ∧ δi(τo) ∩ δi(τ) ≠ ∅}
15: until ∀τ, c ∶ par i(τ, c) = par i−1(τ, c)
16: return WCETi
5.4. Partitioned Multi-Core WCET Analysis 93
However, if the system under analysis is free of timing-anomalies and we know
the period pi for each task τj ∈ ⋃c Tc, then we can reﬁne the results through the
steps shown in Algorithm 5. In line 1 we initialize par with our conservative as-
sumption mentioned above. Line 2 computes the hyperperiod of the task set as the
least common multiple of the periods. The main loop then recomputes the task
WCETs (line 6), determines from the periods and the WCETs the lifetime windows
of the tasks (line 13). Since a task τj may be blocked by a preceding task executing
on the same core, the job sequence Sc is computed in line 9 for each core c, ﬁrst.
The maximization operator in line 13 accounts for possible blocking. Those tasks
whose lifetime windows do not overlap cannot be executed in parallel and are thus
excluded from par i in line 14. These steps are then repeated until par – and thus
also the WCET – reaches a ﬁxed point. The resulting WCETs are then returned in
line 15.
Theorem 1. For timing-anomaly-free systems, Algorithm 5 terminates and provides
a valid overapproximation of par(τ, c) for all tasks τ and cores c.
Proof. To show the termination, it is suﬃcient to prove that par i is monotonically
decreasing in i, i.e.,
∀i ∶ ∀τ, c ∶ par i(τ, c) ⊆ par i−1(τ, c) (5.2)
We prove this by induction over i. In the ﬁrst iteration with i = 1 we have par1(τ, c) ⊆
par0(τ, c) = ⋃x≠c Tx which is trivially true according to the deﬁnition of par1(τ, c)
in line 14 of Algorithm 5.
For the induction step, we assume the existence of a task τe on a core c with
par i+1(τe, c) ⊃ par i(τe, c) and show that this leads to a contradiction. If we as-
sume that it existed, there must be at least one task τ oe ∈ par i+1(τe, c) ∖ par i(τe, c).
According to the deﬁnition of par i+1(τe, c) this means that δi+1(τe) ∩ δi+1(τ oe ) ≠ ∅
whereas δi(τe) ∩ δi(τ oe ) = ∅. The lower bounds of the windows that constitute
any δ(τ) are the “spawn times” ti which never change (cf. line 13). Therefore,
δi+1(τe) ∩ δi+1(τ oe ) ≠ ∅ is only possible if the upper bound of any such window has
grown, i.e., if WCETi+1(τ) >WCETi(τ) for τ = τe or τ = τ oe or τ being a predecessor
of τe or τ oe in the respective job sequence Sc. To complete the contradiction we show
that such a WCET growth is not possible for any task τ , i.e.,
par i(τ, c) ⊆ par i−1(τ, c) ⇒ WCETi+1(τ) ≤WCETi(τ) (5.3)
Any WCETi+1 is computed based upon par i in Algorithm 5. A smaller par induces
less or equal “unknown” classiﬁcations in Equation 5.1. This means that for any
shared cache access r with cls(qC , r) = {HIT,MISS}, either the classiﬁcation stays
unchanged or
• r is now classiﬁed as {HIT}. In a timing-anomaly-free system, the local worst-
case is always also the global worst-case (cf. Figure 2.5), i.e., for the context
block vr that issued r we have ωi+1max(vr) < ωimax(vr). Thus, WCETi+1(τ) ≤
WCETi(τ) depending on whether vr is part of the WCEP.
94 Chapter 5. Multi-Core WCET Analysis
Shared component Bounded Access Delay State-Permeable
Shared cache Yes Yes
Shared bus (PRIO) No Yes
Shared bus (FAIR) Yes Yes
Shared bus (TDMA) Yes No
Shared bus (PD) Conﬁguration-dependent
Table 5.1: Properties of shared resources in multi-cores.
• r is now classiﬁed as {MISS}, i.e., the local worst-case has stayed the same
which leads to ωi+1max(vr) = ωimax(vr) and WCETi+1(τ) = WCETi(τ), respec-
tively.
With the monotony of par i the correctness can easily be shown by induction over
i and the prerequisite that the WCET computation for a given par i−1 is correct.
Algorithm 5 and Theorem 1 were ﬁrst given in [LSL+09]. In this thesis, they
are generalized to periodic tasks.
5.4.2 Shared Bus Analysis Preliminaries
As pointed out in Section 5.2.2, we focus on the analysis of shared buses, since
shared interconnection networks can be reduced to this case for the important class
of time-triggered arbitration. This analysis stage is inevitable in multi-cores if any
shared memory is going to be used, whereas shared caches are not required in
all cases. In fact, real-world architectures actually avoid even non-shared caches in
areas where time-predictability and power-eﬃciency matters more than simplicity of
programming such as smartphone MPSoCs and gaming consoles [WEE+08, Section
11.2]. To cite from the latter article:
“The complexity of analysis moves from the behavior of the individual
cores to the interplay between them as they access memory.”
[WEE+08]
In Section 5.2.2, we have already presented the four main types of arbitration
that we will consider. They are listed in Table 5.1 together with a classiﬁcation of
analysis properties.
Deﬁnition 20. A shared resource R has bounded access delay iﬀ for every access
request r at time t there exists a number of time steps Dmax ∈ N0 such that r is
guaranteed to have been granted access at time t+Dmax. R is called state-permeable
iﬀ an access by a core ci can lead to a change of the timing-behavior of the resource
as observed by a core cj ≠ ci.
5.4. Partitioned Multi-Core WCET Analysis 95
Arbitrate
qB = update(qB)
Blocked
qB = update(qB)
Forward to
Slave 1
qB = update(qB)
Forward to
Slave n
qB = update(qB)
......
−/−
−, r/− −, r/−
r[delay(qB , r) ∖ {0} ≠ ∅]/
d = delay(qB , r), rc = r
r[
0
∈ d
el
ay
(q
B ,
r)
∧
ac
c(
r)
∩
A1
≠
∅]
/
rc
=
r,
f
=
fw
d(
r)
r[0
∈ delay(q B
, r)∧
acc(r) ∩
A
n
≠
∅]/
r
c =
r, f
=
fwd(r)
−, r[d ∖ {0} ≠ ∅]/
d − −
−, r
[0 ∈
d∧
acc
(rc
) ∩
A1
≠ ∅
]/
f =
fwd
(rc
)
cm
pl
(f
)/
cm
pl
(r
c
)
−, r[0 ∈ d∧
acc(rc) ∩A
n ≠ ∅]/
f = fwd(rc)
cm
pl(f)/cm
pl(r
c )
Field Type Description
r Input An incoming request
cmpl Input Request completions
rc R The current request
qB B Abstract arbiter state
d Iu Blocking duration
Ai Iu Address range of slave i
Function Description
cmpl Signal access completion (Output)
fwd Forward a request (Output)
delay Possible access delay computation
update Update bus state
acc Determine request address range
Figure 5.2: The abstract bus timing model.
Resources without bounded access delay cannot be analyzed by a task- or core-
partitioned WCET analysis, since the access opponents must be known to make the
possible delay ﬁnite. Therefore, the PRIO arbitration cannot be analyzed with the
techniques presented in this section.
The task- or core-partitioned analysis of a resource which is state-permeable
but has bounded access delay is possible, but it will incur inevitable overestimation
since the other cores’ timing-relevant modiﬁcations of the resource state cannot be
captured precisely. Therefore, shared caches, FAIR and PD arbitration can only be
handled by worst-case assumptions. Section 5.5 presents a computationally more
expensive methodology for precise analyses of resources with unbounded delay or
state-permeability.
The ideal case, from the point of view of the analysis, is a bounded-delay non-
state-permeable resource. The only known example for this are TDMA-arbitrated
interconnects and buses. For such a resource, a precise analysis is possible, even
without knowing the concurrently executing tasks. Therefore, the following sections
will have an emphasis on the precise analysis of TDMA.
In Figure 5.2, the abstract state machine for a shared bus is shown in analogy
to the abstract pipeline and cache FSMs from Section 4.3. Since we are conduct-
ing a task-partitioned analysis here, the input is a bus access request r from the
current task or no access denoted as “−”. No information is available about ac-
96 Chapter 5. Multi-Core WCET Analysis
cesses which are possibly issued in parallel by other cores. If the bus is free (state
“Arbitrate”) the function delay(qB, r) is used to determine the maximum blocking
duration caused by other tasks or the arbitration scheme. Depending on whether
this duration may be zero, the transitions to “Blocked” and to slaves which are reg-
istered for addresses from the possibly accessed address range acc(r) are enabled.
As mentioned in Section 4.3, the microarchitectural analysis must explore all non-
deterministic transitions if multiple ones are enabled. Once a request is completed
at the slave side, which may be a shared cache or a shared memory in our architec-
ture from Figure 3.4, the completion is signaled through cmpl(f) and is forwarded
to the master as cmpl(rc). Depending on the arbitration policy, the bus arbiter also
keeps an internal state qB which may be used in the arbitration decision and may
be updated in each cycle through update(qB).
Similar to the cache analysis, with QB being the set of FSM states from Fig-
ure 5.2 and B being the set of abstract arbiter states we can deﬁne the abstract bus
domain B which is part of the microarchitectural environment domain QE as
B = 2QB ×B (5.4)
In the following we will explore diﬀerent possibilities of implementing the do-
main B and its delay and update operations for the arbitration types from Table 5.1.
These results were developed exclusively by the author of this thesis and a prelimi-
nary version of them was published in [KFM+11] and [KFM+14]. The generaliza-
tion to the Priority Division policy was published by the author in [KHM+13].
5.4.3 Basic Bus Domains
As discussed in Section 5.2.2, for each shared bus a maximum slave access duration
TBmax is known, i.e., any transaction that was granted the bus must be completed
within TBmax cycles.
In the discussion of Deﬁnition 20 and Table 5.1, we have already seen that the
PRIO arbitration cannot be analyzed by a partitioned analysis, since in general
we cannot prove that access r, issued by the current task τ , does not suﬀer from
starvation on the bus. Thus, for τ running on core c we set
BPRIO = {−} (5.5)
∀r ∈ R ∶ delayPRIO(−, r) =
⎧⎪⎪⎨⎪⎪⎩
[0, TBmax − 1] if ∀i ∈ {0, . . . , nc} ∶ pc ≥ pi
∞ else
(5.6)
Even for the maximum priority core, a non-zero delay is possible since running
accesses are not preemptable. Still, this is only of very limited use. We will therefore
ignore PRIO for the rest of this section. A more precise analysis is presented in
Section 5.5.
5.4. Partitioned Multi-Core WCET Analysis 97
Core 1 Core 2 Core 1 Core 2... ...
0 1ls 2ls 3ls
ls − TBmax + 1 Time
r}
TBmax
r}
TBmax
Figure 5.3: An example for a TDMA bus access which is maximally delayed.
Similarly to PRIO, the FAIR arbitration can also not be analyzed precisely here.
But other than PRIO, it has a bounded access delay which prevents starvation of
any access. We can therefore analyze it with
BFAIR = {−} (5.7)
∀r ∈ R ∶ delayFAIR(−, r) = [0, (nc − 1)TBmax] (5.8)
where nc is the number of cores. For an access from τ ∈ Tc the bus is free in the
best-case (delay 0) whereas in the worst-case the active master ma is equal to (c+1)
mod nc and all cores co ≠ c have issued requests at the current cycle, too (delay
(nc − 1)TBmax). Since we have no information about other concurrent accesses, we
must account for best- and worst-case and all in-between cases as shown above.
For TDMA, the worst-case happens when the access hits the bus at the ﬁrst
cycle where it cannot be granted any more, which is shown in Figure 5.3 for n = 2.
The generalized worst-case delay amounts to (nl − 1)ls +(TBmax − 1) cycles. Since we
require the slots to be big enough to accommodate any access to a target behind
the bus, i.e., ls ≥ TBmax, this worst-case is even worse than for FAIR. Fortunately
though, if we assume that the access in Figure 5.3 is issued at time 0 instead of
at time ls − TBmax + 1, then we can infer that the delay must be equal to zero. So,
we can determine the arbitration delay without knowing anything about possible
concurrently occurring accesses under TDMA.
Deﬁnition 21. The absolute time in the analyzed system is measured in CPU clock
cycles1. An absolute point in time in an execution is given as t ∈ N0 which means
the t-th clock cycle after the start of the system. The oﬀset o of a point in time t is
computed as o = (tmodnlls). An oﬀset set O ⊆ N0/(nlls) is a set of oﬀsets, i.e., a
subset of the modulo ring N0/(nlls).
We will represent the position in the cyclic TDMA schedule through oﬀset sets
in the following. Therefore, we set
BTDMA = 2N0/(nlls) (5.9)
1We assume a constant clock frequency.
98 Chapter 5. Multi-Core WCET Analysis
Core 1 Core 2
0 1 2 3 4 5 6 7 8 9 10
O =ˆ
Figure 5.4: The TDMA oﬀset set O = {1,2,7} in a 2-core schedule with ls = 5.
Eﬀectively, this introduces the notion of “points in time” into the microarchitectural
analysis itself, though only in the compressed form of oﬀsets. In the classical single-
core analysis frameworks, the step towards points in time as opposed to points in
a state space is done only after the microarchitectural analysis has terminated as
sketched in Section 4.3 and Section 4.4.
The TDMA delay function is dependent on the core c on which the currently
analyzed task is executed. For all qB ∈ BTDMA and r ∈ R it is deﬁned as
delayTDMA(qB, r) = ⋃
o∈(qB+Ta)
⎧⎪⎪⎨⎪⎪⎩
0 if o ∈ γ(c)
minog∈γ(c){og − o mod nlls} else
(5.10)
with a grant window of oﬀsets γ(c) for core c deﬁned as
γ(c) = ⋃
i with oi=c
[ils, (i + 1)ls − TBmax] (5.11)
The ﬁrst case corresponds to an access inside of one of c’s slots summarized through
γ(c), whereas the second case is an access outside of γ(c). The equation uses (qB +
Ta) as the eﬀective oﬀset set where Ta is the time that is needed for the arbitration
itself. We generally assume Ta = 1, which was already reﬂected in Table 3.1 and by
the fact that there is only one “Arbitrate” state in Figure 5.2 which has no self-loop.
The shift by Ta is needed since o is the oﬀset at the time of the arbitration whereas
o + Ta is the oﬀset at the time of the bus access.
As an example for a delay computation consider the oﬀset set given in Figure 5.4.
Assuming TBmax = 2 and an access from core 1, i.e., c = 0, oﬀsets 1 and 2 fall into
case one and contribute delay 0, whereas oﬀset 7 lies in case two and produces a
delay of 3. Therefore, in this case, delayTDMA({1,2,7}, r) = {0,3}.
In the “Arbitrate” and “Blocked” states from Figure 5.2 we can only register the
passing of a CPU cycle, therefore the update for these states is
∀qB ∈ B ∶ updateDefaultTDMA (qB) = {o + 1modnlls∣o ∈ qB} (5.12)
In the transition to the “Forward to Slave X” states we however know that the
bus has been granted at the current cycle, i.e., we must be at an oﬀset within the
current core’s slot. For the “Forward” states we deﬁne update on all qB ∈ B as
updateForwardTDMA (qB) = {o + 1modnlls∣o ∈ qB} ∩ γ(c) (5.13)
To make B usable as the bus domain in the microarchitectural analysis, we ﬁnally
need to provide a meet operator for B which is simply given by the set union:
∀qB1 , qB2 ∈ B ∶ qB1 ⊔ qB2 = qB1 ∪ qB2 (5.14)
5.4. Partitioned Multi-Core WCET Analysis 99
In the implementation, we represent the oﬀset sets eﬃciently as sets of oﬀset
intervals, which speeds up most operations on them since instead of examining
every single oﬀset only each interval’s lower and upper bound must be inspected or
updated.
For the time-triggered priority division (PD) arbiter, we distinguish between
an oﬀset window γMUST(c) in which accesses from c must be granted and a win-
dow γMAY(c) in which accesses may be granted. The possible delay due to a non-
preemptable lower-priority access must be accounted for in the determination of
the grant window γMUST(c), which is reﬂected by the subtraction of (TBmax − 1) in
Equation 5.15.
γMUST(c) = ⋃
i with ∀j∶pic≥pij
[ils, (i + 1)ls − TBmax − (TBmax − 1)] (5.15)
γMAY(c) = ⋃
i with pic>0
[ils, (i + 1)ls − TBmax] (5.16)
For the delay function, the core which has maximum priority in a slot is subject
to the same delay as in the PRIO case (see Equation 5.6). If a core has nonzero
priority, it may be able to access, but it not guaranteed to do so, whereas if it has
zero priority, it remains blocked. This leads to the delay function
delayPD(qB, r) = ⋃
o∈(qB+Ta)
⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
[0, TBmax − 1] if o ∈ γMUST(c)
[0,minog∈γMUST(c){og − o mod nlls}] if o ∈ γMAY(c)
minog∈γMUST(c){og − o mod nlls} else
(5.17)
with γ(c) = γMUST(c) ∪ γMAY(c), the meet and update functions are the same as
for TDMA.
For both the TDMA and PD analysis, we also still need the initial data-ﬂow
information qB0 (labeled l0 in Deﬁnition 6). We could assume q
B
0 = {0, . . . , nlls} but
for the purpose of higher result precision, we assume that the tasks start synchro-
nized with the TDMA schedule at oﬀset 0, i.e., qB0 = {0}. In our platform, this is
achieved by a memory-mapped delay register of the bus arbiter which delays the ac-
cess until the oﬀset nlls−1 is reached, such that the ﬁrst instruction of a task starts
at oﬀset 0. However as noted above, this is completely optional and the impact of
qB0 = {0, . . . , nlls} on the precision is limited as discussed in the following.
With this framework, we can analyze time-triggered arbitration to some extent.
One problematic aspect is, that once we have lost precision due to meet operations as
speciﬁed in Equation 5.14, we can hardly regain it. The only point where we actually
gain precision is in Equation 5.13 by intersecting with γ(c). Thus, once we have
reached a bus state qB ⊇ γ(c) we can no longer reach any qBnext with ∣qBnext∣ < γ(c).
This is undesirable, since any increase of the cardinality of qB may introduce a new
value into the union in Equation 5.10. To be able to reach smaller oﬀset sets in the
analysis again, we therefore have to extend it as detailed in the following sections.
100 Chapter 5. Multi-Core WCET Analysis
5.4.4 Loop Unrolling
The main problem for most abstract interpretation-based analyses are loops, because
the data-ﬂow information is repeatedly joined together with the meet operator at
the loop head until it stabilizes as sketched in Algorithm 1. As an example, consider
the code fragment shown in Figure 5.5a. For simplicity of presentation, we assume
a constant block execution time ω for each block and no oﬀset reﬁnement after
Equation 5.13. With the given parameters nl = 2 and ls = 5, we have a schedule
length of 10 cycles. The path A → B → D has a total length of 20, i.e., any oﬀset
that enters at A is mirrored back to A from D. This is diﬀerent for path A→ C →D
with length 21. When entering with oﬀset 0 at A, we reach C with oﬀset 5 and D
and A are then reached with oﬀset 1. The meet operator at A yields {0}⊔{1} = [0,1]
which is again propagated through the loop. Therefore, in the next analysis iteration
all oﬀsets in [0,9] are added to qBout of D in ascending order. Since [0,9] is the “top”
element ⊺B of the oﬀset lattice, convergence is reached then.
A
ω = [5,5]
B
ω = [5,5]
C
ω = [6,6]
D
ω = [10,10]
qBout = {0}
qBout = [0,9]
qBout = [0,9] q
B
out = [0,9]
qBout = [0,9] q
B
out = [0,9]
qBout = [0,9]
(a) The original loop with con-
verged oﬀsets.
A
ω = [5,5]
B
ω = [5,5]
C
ω = [6,6]
D
ω = [10,10]
A
ω = [5,5]
B
ω = [5,5]
C
ω = [6,6]
D
ω = [10,10]
qBout = {0}
qBout = {5}
qBout = {5} q
B
out = {5}
qBout = {0} q
B
out = {1}
qBout = [0,1]
qBout = [0,9]
qBout = [0,9] q
B
out = [0,9]
qBout = [0,9] q
B
out = [0,9]
qBout = [0,9]
(b) The unrolled loop with con-
verged oﬀsets.
Figure 5.5: An example for the divergence of TDMA oﬀsets due to loops in the
program with parameters nl = 2 and ls = 5.
The easiest way to avoid this behavior is to unroll the loop. Figure 5.5b shows an
unrolling of the ﬁrst iteration and the respective oﬀset results as discussed above. As
visible, only the unrolled iterations proﬁt from this transformation, i.e., to obtain
5.4. Partitioned Multi-Core WCET Analysis 101
really precise results, we need to fully unroll the loop which was ﬁrst proposed
in [AEP+08] and extended in [KFM+11, “Global Convergence Analysis”]. However,
this approach has a number of signiﬁcant drawbacks:
• A full unrolling is possible only for those loops for which we have an explicit
maximum loop iteration count (compare Section 3.3). For loops which are
implicitly bounded by ﬂow restrictions, this technique is not applicable.
• The unrolling may severely impair the analysis duration. Since the TDMA
analysis only works as a part of the integrated microarchitectural analysis,
the duration is scaled linearly with the loop bound for all modules including
pipeline and cache analysis.
5.4.5 Oﬀset Contexts
As already visible from Figure 5.5, the TDMA behavior is cyclic. Even though the
loop from Figure 5.5a may have thousands of iterations, every iteration i will start
with oﬀset i mod nlls. Thus, to be precise with respect to the TDMA oﬀsets, we
only need to distinguish nlls diﬀerent execution scenarios, i.e., one scenario for each
oﬀset with which the loop may start. Classical DFA unrolling contexts as presented
in Section 4.1.2 form sequential chains of contexts. In the special case of TDMA
analysis, we however need cyclic contexts whose interdependencies are statically
unknown and only become clear during the analysis itself. We therefore introduce
a new type of contexts, named oﬀset contexts, into the analysis:
Deﬁnition 22. An oﬀset context clo for an oﬀset o ∈ BTDMA and a natural loop
l ∈↺τ is a context Gc
l
o
τ = (V c
l
o
τ ,E
clo
τ , v
clo
0 ) within the context graph GCτ as deﬁned in
Deﬁnition 14 with vc
l
o
0 being the head of loop l. The sibling set C
l = {cl0, . . . , clnlls−1}
contains the oﬀset contexts of a given loop for all possible oﬀsets. Each sibling set
is accompanied by an unrolled iteration context cl⊥ for the same loop which models
the ﬁrst loop iteration. The set of nodes v ∈ V cloτ which are sources of back-edges is
called V c
l
back ⊆ V
clo
τ .
This basic idea was ﬁrst published by the author in [KFM+11], but based on
a syntax-directed WCET analysis which worked by solving a separated graph-ﬂow
problem. We build upon this idea here, but we seamlessly integrate the oﬀset
analysis into the context graph itself.
An illustration of how oﬀsets contexts for a loop are built, or unfolded, is shown
in Figure 5.6. It shows the oﬀset contexts that are unfolded when processing the
loop from Figure 5.5a. To make the graph reasonably small, each node in Figure 5.6
represents one copy of the loop body, i.e., the nodes A, B, C, andD from Figure 5.5a.
Every edge from one copy of the body to another one is a copy of the back edge
(A,D). The ﬁrst iteration of the loop is unrolled as presented previously. This is
done to capture cache eﬀects, since the ﬁrst iteration of a loop will very often “warm
up” the cache and we want to separate this behavior from the one of the successive
102 Chapter 5. Multi-Core WCET Analysis
cl0 c
l
1 c
l
2 c
l
3 c
l
4 c
l
5 c
l
6 c
l
7 c
l
8 c
l
9
cl⊥
Iterations
2 to ∞
Iteration 1
... before
loop ...
... after
loop ...
Figure 5.6: The unfolded oﬀset contexts for the example shown in Figure 5.5a.
The loop is assumed to iterate at least twice and each loop body has
ω(l) = [20,21].
iterations, modeled by the sibling set of oﬀset contexts cl0 to c
l
9. If the loop may
exit after the ﬁrst iteration (lower loop bound of 1), we also add a dashed exit edge
from cl⊥. If it must exit after or before the ﬁrst iteration (upper loop bound of 1 or
0), the unfolding does not alter the loop at all.
Unrolling and unfolding operations always work on the whole loop body. If a loop
l contains a nested loop l′, i.e., l >↺ l′, and l′ has been either unrolled or unfolded
previously, then all contexts of l′ will be duplicated in the unrolling or unfolding of
l. This leads to a considerable increase in the context graph size, but in the case
of unfolding of oﬀset contexts, this increase is bounded by the analysis parameters.
In the case of unrolling, the increase is only limited by the loop bound values,
which may be arbitrarily huge. Both techniques also aﬀect the path analysis, since
they transform the representation of the loop within the context graph. Therefore,
Equation 4.46 and Equation 4.47 must summarize over all duplicates of the back-
edges. In the case of oﬀset contexts, only the edges that lead into cl⊥ are counted as
entry-edges whereas all edges between cl⊥ and the clo are back-edges.
Obviously, the graph structure as shown in Figure 5.6 is already part of the result
of the oﬀset analysis, because we initially have no idea which transitions between cl⊥
and the clo are possible. Therefore, the analysis will start with no edge between cl⊥
and any clo, i.e., all solid edges in Figure 5.6 are not added when the oﬀset contexts
are unfolded. They are added during the microarchitectural analysis, which requires
some changes to the generic DFA algorithm. The resulting analysis algorithm is
shown in Algorithm 6.
The main changes in comparison to the generic DFA work list algorithm depicted
in Algorithm 1 are the lines 3–4, 11–12 and 16–17. In lines 3–4, the oﬀset contexts
are expanded for each natural loop as shown in Figure 5.6. No oﬀset contexts are
generated for non-reducible loops. As in the generic algorithm, the start node’s
incoming microarchitectural state is set to qm0 ∈ M whereas all other nodes are
initialized with  ∈ M (lines 5–7). The main analysis loop in line 8 iterates until
all data-ﬂow values – microarchitectural states in this case – have converged. To
achieve this, the meet and transfer operators of the microarchitectural domain as
5.4. Partitioned Multi-Core WCET Analysis 103
Algorithm 6 The oﬀset context speciﬁc microarchitectural analysis.
1: function OffsetAwareDFA(((M,⊑), F ), (GCτ , qm0 ))
2: worklist← vC0,τ
3: for l ∈↺τ in ascending order of <↺ do ▷ (Inner loops ﬁrst)
4: unfoldOﬀsetContexts(l)
5: for v ∈ V Cτ do ▷ Initialization of the microarch. state ...
6: if v = vC0,τ then qinv ← qm0 , qoutv ←  ▷ ... for the start node ...
7: else qinv ← , qoutv ←  ▷ ... and for all other nodes.
8: while worklist ≠ ∅ do ▷ Loop until a ﬁxed point was found
9: v ← pop(worklist)
10: qinv ← ⊔(u,v)∈ECτ qoutu
11: if v = vclo0 then ▷ When oﬀset context clo is entered ...
12: qB,inv = {o} ▷ ... set the incoming oﬀset to o.
13: qtmpv ← fMv (qinv ) ▷ Apply microarch. transfer function of node v
14: if qoutv ≠ qtmpv then
15: qoutv ← qtmpv ▷ Take over new outgoing value
16: if ∃cl ∶ v ∈ V clback then ▷ For oﬀset context back-edge sources ...
17: ECτ ← ECτ ∪ {(v, vc
l
o
0 )∣o ∈ qB,outv } ▷ add edges to target siblings.
18: for (v,w) ∈ ECτ do
19: push(worklist,w) ▷ Propagate changes to all successors
20: return (GCτ ,{v → qinv ∣ v ∈ V Cτ })
deﬁned in Equation 4.23 and Equation 4.17 are invoked in lines 10 and 13 and new
results are propagated through the graph in lines 18–19.
In comparison to the analysis from Section 5.4.3, the additional precision is
gained by setting the incoming oﬀsets to o if we reach the head node of an oﬀset
context clo in line 12. This is valid, since context clo exclusively models the situation
that loop l is entered with oﬀset o.
Of course, if we use oﬀset contexts in this way, we must ensure that each node
v which has an outgoing edge to the head node of l has edges to each oﬀset context
clo for all o ∈ qB,outv . Since oﬀset contexts are only formed for reducible loops, and
the ﬁrst loop iteration is unrolled2, the transition into and between oﬀset contexts
is only possible via l’s back-edges. We add copies of the back-edges leading to the
respective oﬀset contexts in lines 16–17. Edges into the ﬁrst iteration, modeled by
cl⊥, are already created during the unfolding. Therefore, the only edges that we add
in lines 16–17 are edges between cl⊥ and the clo and between diﬀerent oﬀset contexts
clo1 and c
l
o2 , i.e., the solid edges from Figure 5.6.
2The unrolling is done to distinguish the cache behavior of the ﬁrst iteration from that of the
successive ones. Nevertheless the oﬀset contexts in principle also work without the unrolling or
with a diﬀerent unrolling.
104 Chapter 5. Multi-Core WCET Analysis
In this way, each oﬀset context clo which becomes reachable through the edge
additions is analyzed using the existing microarchitectural analysis leading to ω(v)
for each v ∈ V cloτ . The determination of the longest path through the context sibling
set is done by the path analysis which uses the generated ω(v) values and the
resulting context graph. The original formulation in [KFM+11; KFM+14] solved a
dedicated dynamic ﬂow problem for this purpose, but this is actually not needed as
shown here.
Theorem 2. The DFA with oﬀset contexts as shown in Algorithm 6 yields correct
microarchitectural states qinv for each v ∈ V Cτ and terminates in ﬁnite time.
Proof. If task τ contains no loops, Algorithm 6 coincides with Algorithm 1. In
this case, the correctness follows from the Galois connection between M and the
concrete microarchitectural states M (see Deﬁnition 17) and from the correctness
of the transfer functions fMv (see Equation 4.17). The termination follows from the
fact that the fMv are monotonic (cf. Section 2.1).
If τ contains loops, any data-ﬂow item qinv that is generated by Algorithm 1 for
a node v ∈ V c
l
[0,∞]
τ in the default loop iteration context cl[0,∞] is “distributed” among
cl⊥ and the contexts cl0 to c
l
nlls−1
. Therefore,
qˆl = ⊔
c∈{cl⊥,c
l
0,...,c
l
nlls−1
}
qinvc ∧ qˆl ⊒ qinv (5.18)
where vc is the copy of v in context c. Since all oﬀset contexts have exit edges
towards the loop successors, any loop successor obtains the value qˆl ⊒ qinv from the
loop in its meet operation, i.e., the safety of the approximation is maintained for
this node and all of its successors.
Finally, the termination of the analysis follows from the fact that we only add
edges to the context graph in Algorithm 6 but we never remove them again. Thus, in
the worst case each sibling set becomes a complete graph in the course of the analysis.
No more graph changes are possible after this point. After the graph structure has
converged, the termination follows from the same monotonicity argument as in the
sequential or non-unfolded case.
Note that the analysis using oﬀset contexts is also feasible for tasks with irre-
ducible CFGs, where some loops are no natural loops according to Deﬁnition 12.
In this case, oﬀset contexts are only constructed for the natural loops on the CFG,
whereas non-natural loops are analyzed with the basic method from Section 5.4.3
without any unrolling or unfolding.
5.4.6 Oﬀset Relocation
The previous approaches focused on tracking the oﬀset development in more detail
through the use of classical iteration contexts in the case of loop unrolling or cyclic
5.4. Partitioned Multi-Core WCET Analysis 105
oﬀset contexts in the case of loop unfolding. The orthogonal approach of relocating
oﬀsets was ﬁrst mentioned in [CRM10].
Deﬁnition 23. An oﬀset relocation to oﬀset o ∈ B at a block v ∈ V Cτ means
1. setting qB,inv = {o} and
2. artiﬁcially increasing ωmax(v) by nlls cycles.
Thus, the relocation allows to recover from imprecise oﬀset information by set-
ting the incoming oﬀsets of a block to a ﬁxed value. This comes at the expense
of analysis precision, since we have to add nlls to the block duration. It can only
be beneﬁcial if shared bus accesses follow in v or its successor blocks, which may
be classiﬁed more precisely with the narrow oﬀset value. If we have an imprecise
qB,inv = {0, . . . , nlls − 1} and multiple accesses to the bus, each of them may incur a
delay of up to (nl − 1)ls + TBmax − 1. If the delay for each of these can be reduced,
this can easily compensate the additional delay of nlls cycles for the oﬀset reloca-
tion. A formal proof of correctness of this procedure and a discussion of it was ﬁrst
published by the author in [KFM+14].
To be able to reason about the validity of the relocation, we need to make some
assumptions about the analyzed machine:
Deﬁnition 24. Given a multi-core hardware platform with a TDMA-arbitrated
shared resource R, a program L, an initial state q˜mo ∈ Q˜M , an execution exec(L, q˜m0 ) =
(q˜m1 , . . . , q˜mk ) (see Deﬁnition 16) and a state q˜mi ∈ exec(L, q˜m0 ) that issues an access
request to a shared resource R, the platform is TDMA-compositional if and only if
• an alternative execution exec′(L, q˜m0 ) = (q˜m1 , . . . , q˜mi , . . . , q˜mj ) can be constructed
such that the number of cycles for which R continuously stays in state “Blocked”
(cf. Figure 5.2) after q˜mi is longer in exec
′(L, q˜m0 ) than in exec(L, q˜m0 ) and
• the change of length of the suﬃx (q˜mi , . . . , q˜mj ) ⊆ exec′(L, q˜m0 ) compared to the
suﬃx (q˜mi , . . . , q˜mk ) ⊆ exec(L, q˜m0 ) is exclusively caused by R, i.e., the number
of cycles that are spent in the “Blocked” state. In particular, all states from
(q˜mi , . . . , q˜mk ) in which no resource access is performed must be found in the
suﬃx (q˜mi , . . . , q˜mj ), too.
More intuitively, TDMA-compositionality means that we can artiﬁcially delay
the granting of an access and the only changes to the system timing that follow from
this extra delay are changes of the arbitration delays of following accesses. Cycles
which were devoted to computation or other pipeline actions are not aﬀected. Other
than the concept of timing compositionality mentioned in Section 2.2.9, TDMA-
compositionality does not imply that the bus is analyzed in isolation from the other
parts of the system. TDMA-compositionality merely eases the bus analysis as we
will see in the following, but is otherwise unrelated to timing compositionality.
A typical example of a TDMA-compositional platform are platforms with strict
in-order pipelines where the pipeline is stalled until an instruction is completed. In
106 Chapter 5. Multi-Core WCET Analysis
this case, the following actions of the pipeline are not aﬀected by the extra delay
introduced into the arbitration, except for those which try to access the resource
again. Our ARM7TDMI-based system from Section 3.4 is an example of such a
platform.
A counterexample are superscalar out-of-order processors in which resource ac-
cesses may overlap in time with other computations. In such a case, the following
instructions may be executed in parallel to the delayed resource access, therefore
contradicting with the last sentence of Deﬁnition 24.
The above example for TDMA-compositionality was also a timing-anomaly-free
system. An example of a system which is TDMA-compositional but not free from
timing anomalies is a platform with in-order pipelines, caches and branch predic-
tion. Reineke et al. [RWT+06] have shown this combination to suﬀer from timing
anomalies. Still, the system is TDMA-compositional as long as strict in-order cores
are used.
In the following, we will show how TDMA-compositionality can be used to en-
hance the precision of the oﬀset analyses for TDMA-compositional systems.
Lemma 1. (Oﬀset Relocation Lemma) Given a context graph GCτ = (V Cτ ,ECτ )
and a path P through GCτ , two executions of the path P , one starting at TDMA
oﬀset o1 and the other starting at TDMA oﬀset o2 with the initial state of all other
hardware components being identical between the two executions, will lead to a dif-
ference in execution time of at most nlls cycles between the two execution scenarios
if a TDMA-compositional platform is used.
Proof. The execution of the path P is deﬁned as a sequence of states (q˜m0 , . . . , q˜mk ).
We transform this into a sequence of events S = (s0, . . . , sk) such that all contiguous
groups of states which model the processing of an access to the shared resource are
represented by a single si and all contiguous states which do not model a shared
resource access are likewise represented by a sj . Thus, each s ∈ S is either an access
to the shared resource or a block of local computations. We therefore deﬁne a set
of accesses A and a set of “processing” blocks B such that ∀s ∈ S ∶ s ∈ A ⊕ s ∈ B,
where ⊕ is the logical exclusive-or. For each access si ∈ A, we deﬁne the time from
the access request to the access grant as δi and the time that it takes to perform
the access as γi. Similarly, we deﬁne the runtime of each si ∈ B as αi. Due to
the TDMA-compositionality we know that αi and γi are constant among the two
execution scenarios. Thus, what changes between the scenarios are the δi values but
not the execution times of local computations or the resource accesses themselves.
To simplify our computations below, we allow si ∈ B to have length 0, that is
αi = 0. In this way, we can ensure that each pair of accesses is separated by a block
of computation, though this block may have length 0 (∀j ∈ 0, . . . , i − 1 ∶ sj ∈ A⇔
sj+1 ∈ B). Without loss of generality, we assume s0 ∈ A.
The arbitration delay δj that an access sj ∈ A incurs, will change among the
two execution scenarios. The arbitration delay that is incurred by access sj in the
execution scenario starting at oﬀset o1 (o2) will be denoted by δ1j (δ
2
j ), respectively.
5.4. Partitioned Multi-Core WCET Analysis 107
In the same way, we will refer to the absolute point in time at which the access
request is issued in the two scenarios as β1j and β
2
j . What we would like to prove
now, is:
∀j ∈ {0,2,4, . . .} ∶ ∣β1j − β2j ∣ ≤ nlls (5.19)
Note that the Lemma is equivalent to ∣β1k−β2k ∣ ≤ nlls where k is the last execution
event. For sk ∈ A, Equation 5.19 with j = k is equivalent to the Lemma, and for
sk ∈ B, the Lemma follows from Equation 5.19 with j = k − 1 and the constant
runtime of sk. Therefore, it is suﬃcient to prove that Equation 5.19 holds.
To do this, we ﬁrst deﬁne the dependencies among the β and δ values, which are
as follows:
∀j ∈ {0,2,4, . . .} ∶ βj = βj−2 + δj−2 + γj−2 + αj−1 (5.20)
∀j ∈ {0,2,4, . . .} ∶ δj = delayTDMA (βj , sj) + γj mod nlls (5.21)
In Equation 5.20, the access request time for sj is computed as the sum of
• βj−2 – the request time of the last access (in sj−2)
• δj−2 – the arbitration delay that this request incurs
• γj−2 – the access duration itself
• and αj−1, the duration of the consecutive block of computation sj−1
The value for δj is then easily derived in Equation 5.21 from βj with the usual TDMA
arbitration computation via the known delayTDMA function from Equation 5.10.
These dependencies are valid for both execution scenarios since they model the
execution of the path P . The only variability comes from the accesses to the shared
resource, for which the waiting time δj is computed using the known delayTDMA
function. To initialize the computation, we must fulﬁll the conditions
β10 = o1 mod nlls (5.22)
β20 = o2 mod nlls (5.23)
Since we are ﬁnally only interested in the diﬀerence Δj = ∣β1j − β2j ∣, we simply set
β10 ∶= o1 and β20 ∶= o2. We prove Equation 5.19 by induction over the sequence S.
Base case (j = 0): In this case, we have Δ0 = ∣β10 − β20 ∣ = ∣o1 − o2∣ which is less than
nlls by deﬁnition of the TDMA oﬀsets.
Induction step (j − 2→ j): By inserting the deﬁnitions of βj into Δj we obtain:
Δj = ∣ (β1j−2 + δ1j−2 + γj−2 + αj−1) − (β2j−2 + δ2j−2 + γj−2 + αj−1) ∣ (5.24)
= ∣ (β1j−2 + δ1j−2) − (β2j−2 + δ2j−2) ∣ (5.25)
= ∣ (ξ1j−2 − ξ2j−2) ∣ (5.26)
108 Chapter 5. Multi-Core WCET Analysis
... ...
0ls 1ls 2ls 3ls 4ls 5ls 6ls 7ls 8ls 9ls
1) 2) 3)
Figure 5.7: Illustration of proof scenario for the Relocation Lemma.
where ξj is a shorthand for βj + δj , i.e., the time when the access is granted. Obvi-
ously, Δj only depends on the access request times and waiting times of the preceding
resource access. From the induction hypothesis we know that Δj−2 ≤ nlls, that is,
the preceding access is requested in a window of size nlls cycles in both execution
scenarios. What we then have to show is, that both accesses are granted in a new
window of size nlls cycles.
Figure 5.7 illustrates a scenario from the perspective of the second core in a
conﬁguration with three cores in total. Three TDMA periods are shown in the
Figure. The gray areas are those where an access request from core two will have
to wait, and the white ones are the areas where an access request from core two
will be granted immediately. The hatched area represents the cycles from TBmax − 1
cycles before the slot end until the slot end in which accesses are not granted as
explained in Figure 5.3. Due to the cyclicity of the TDMA schedule, it is suﬃcient
to assume that β1j−2 ∈ [3ls,6ls) and to place β2j−2 at positions which are at most nlls
cycles away from β1j−2. We then have to prove that ∣ (ξ1j−2 − ξ2j−2) ∣ ≤ nlls holds. We
will examine the individual cases with the help of the example, since this is more
intuitive. The notation could nevertheless be generalized to describe the cases in a
fully abstract way – in this case, only the position and size of the white box will
change in all TDMA periods. All three following cases are also marked with their
case number in Figure 5.7:
1) β1j−2 ∈ [3ls,4ls − 1]: Here, we have to wait for the core’s slot, thus ξ1j−2 = 4ls.
If β2j−2 is in another gray area, then ξ
2
j−2 will be either 1ls, 4ls or 7ls. If
β2j−2 is in a white area, then it follows that ξ
2
j−2 ∈ [1ls,2ls − (TBmax − 1)] ∪
[4ls,5ls − (TBmax − 1)]. In all cases, ∣ξ1j−2 − ξ2j−2∣ ≤ nlls holds.
2) β1j−2 ∈ [4ls,5ls − (TBmax − 1)]: Since the access can be granted immediately
here, we have ξ1j−2 ∈ [4ls,5ls − (TBmax − 1)]. If β2j−2 is in a gray area, then ξ2j−2
will be either 4ls or 7ls. In this case, ∣ξ1j−2 − ξ2j−2∣ ≤ nlls directly holds. For
the case that β2j−2 is in a white area, the Lemma follows from the induction
hypothesis that ∣β1j−2 − β2j−2∣ since in this case ξj−2 = βj−2.
3) β1j−2 ∈ [5ls + (TBmax − 1),6ls − 1]: This case is symmetrical to the ﬁrst one.
Since in all these cases ∣ξ1j−2−ξ2j−2∣ ≤ nlls holds and we known from Equation 5.26 that
Δj = ∣ (ξ1j−2 − ξ2j−2) ∣, the induction step is complete and the lemma is proven.
Lemma 1 provides the background of why we may safely use the oﬀset relocation
from Deﬁnition 23 in TDMA-compositional systems. During the analysis, we may
5.4. Partitioned Multi-Core WCET Analysis 109
v3
v1 v2
v4 v5
(a) Relocation applied on
WCEP
v3
v1 v2
v4 v5
(b) Relocation applied out-
side of WCEP
v3
v1 v2
v4 v5
(c) Relocation applied out-
side of and on WCEP
Figure 5.8: Diﬀerent scenarios for applying the oﬀset relocation heuristic.
judiciously apply the relocation before line 13 in Algorithm 6. We do not determine
in advance what happens “inside” a context block, i.e., whether contains shared bus
accesses ﬁrst or computation cycles ﬁrst and therefore, we always relocate to oﬀset
0. Also, we must apply the relocation sparsely since every application may incur a
loss of precision. Therefore, we heuristically only apply the relocation to 0 at loop
heads of loops which actually contain shared bus accesses. Whether a loop may
potentially access the bus or not can be veriﬁed by inspecting the possible memory
access targets determined by the value analysis. Note that this is only a heuristic,
we could also apply the Lemma at arbitrary other program points.
With the oﬀset relocation, we are no longer over-approximating the set of possi-
ble oﬀsets. Nevertheless, we are still generating safe WCET overestimations and the
analysis is still guaranteed to terminate, which we will both show in the following.
Theorem 3. For a WCET est which is determined by the path analysis using the
ω(v)-values determined by the microarchitectural analysis with oﬀset relocations,
WCET est ≥WCET real holds.
Proof. For cases where no oﬀset relocations were applied, WCET est ≥ WCET real
follows from the correctness arguments presented in Section 4.4.
For cases where relocations were done, Figure 5.8 shows the possible relocation
scenarios. In each sub-ﬁgure, the oﬀset relocation was applied at the gray blocks, all
white blocks have their oﬀsets computed without oﬀset relocation. The solid arrows
mark the WCEP, whereas the dashed arrows represent control-ﬂow edges which are
not part of the WCEP P = (b1, b2, . . . , bn). Note that the WCEP is unveiled by
the path analysis, which runs after the microarchitectural analysis, including the
bus analysis. We consider a single application of the heuristic at a node vA and
distinguish three cases:
• The WCEP contains vA: This case is illustrated in Figure 5.8a with vA = v1.
Since we applied the heuristic, we added a penalty of nlls cycles to vA’s WCET.
Thus, the correctness of the computed WCET follows from Lemma 1.
• The WCEP does not contain vA: This case is shown in Figure 5.8b and Fig-
ure 5.8c with vA = v2. Then, the application of the heuristic may only aﬀect
110 Chapter 5. Multi-Core WCET Analysis
the WCET results due to merged-in oﬀset information, when there is a path
in the CFG from vA to any node bw on the WCEP P = (vP1 , vP2 , . . . , vPn ), e.g.,
path (v2, v3) in Figure 5.8b and Figure 5.8c. This can only lead to additional
oﬀsets in the oﬀset set of bw. Again we have two cases:
– If there is no further node vy ∈ Ptail with Ptail = {vPi ∣w ≤ i ≤ n} ⊆ P
where the heuristic was applied too (e.g. as in Figure 5.8b), then these
additional oﬀsets will only make the WCET results for the blocks in Ptail
less precise, but they do not endanger their safeness. This is guaranteed
since the microarchitectural transfer function, which is then applied for
all blocks in Ptail, is monotonic and thus all original oﬀsets, coming from
blocks on the WCEP, are conserved.
– If such a node vy does exist (e.g. v5 in Figure 5.8c), then the application
of the heuristic renders the oﬀset set with which vy is reached irrelevant,
since the oﬀset result for vy will be relocated to 0 anyways. Therefore,
the correctness of the WCET also follows in this case.
Thus, the safeness of the WCET is retained in all cases.
An example for how the relocation can be used to generate more precise WCET
values can be found in Figure 5.9. The code shown between the dotted horizontal
lines is a loop body that was unrolled once. In this case, the upper half of the ﬁgure
shows the analysis of the ﬁrst iteration and the lower half of the ﬁgure shows the
analysis of the following iterations of the loop. The ﬁgure shows the outgoing oﬀsets
qB,outv for each block v on the left and right hand side of each arc. In addition, using
the notation of Lemma 1, it shows the arbitration delay δi for each shared resource
access si and the runtime αi for each block of computation si. The analysis run A,
shown in black on the left side, starts with the information that the top block may be
entered with oﬀsets {0, . . . ,9} and obtains a total arbitration delay of 14+10(X −1)
if the relocation heuristic is applied at node s1 and X is the loop bound of the
analyzed loop. In contrast, the analysis run B, shown in gray on the right side,
starts with the more precise oﬀset information {2} and obtains a WCET of 12X
if the relocation heuristic is not applied. Obviously analysis run A will produce a
more precise WCET result than analysis run B for all X ≥ 3, even though it started
with more imprecise oﬀset information.
The existing termination guarantee of the analysis was given through the mono-
tonicity of the microarchitectural transfer functions. Setting the outgoing oﬀset set
to qB,outv = 0 + ω(v) mod nlls as done by the relocation is obviously monotonic in
qB,inv since it does not even depend on q
B,in
v . Therefore, the termination is also
guaranteed with the oﬀset relocation.
5.4.7 Timing-Anomaly-Free Analysis
Similar to the single-core case, we can use the fact that a platform is free of timing-
anomalies to reduce the outgoing microarchitectural states of a context block to
5.4. Partitioned Multi-Core WCET Analysis 111
s1 ∈ A
s2 ∈ A
s3 ∈ B
α3 = 2
s4 ∈ A
s5 ∈ A
s6 ∈ A
s7 ∈ B
α7 = 2
s8 ∈ A
{0, . . . ,9}{2}
{2}{4}
{2}{4}
{6}{4}
{4,2}{2}
{2}{4}
{2}{4}
{6}{4}
{4,2}{2}
{4,2}{2}
{4} {2}
{4} {2}
Scenario A
(with relocation
to 0 at s1)
Scenario B
(without
relocation)
δ1 = 10 δ1 = 0
δ2 = 0 δ2 = 6
δ4 = 4 δ4 = 6
δ5 = 6 δ5 = 0
δ6 = 0 δ6 = 6
δ8 = 4 δ8 = 6
Figure 5.9: An example of the application of the oﬀset relocation in a 2-core system
with sloth length ls = 5 and access duration γi = 2.
those which represent local worst-case behavior. Following the notation of Sec-
tion 4.3 and the example from Figure 4.8, the analysis that uses the timing-anomaly-
freedom (called TA-free in the following) can safely discard all outgoing states
qM ∈ qoutv of a context node v ∈ V Cτ which are not maximizing ω(v) as deﬁned
in Deﬁnition 19. In Figure 4.8, this implies that only the lower-right, gray out-
state needs to be retained in qoutv , whereas all others can be dropped. Since every
qM ∈ qoutv may contribute diﬀerent result oﬀsets, this obviously increases the analysis
precision.
However, this breaks one important prerequisite that was required for any DFA
framework in Deﬁnition 5, namely that the transfer functions of the framework must
be monotonic. If we are consequently over-approximating the oﬀset sets as done in
Section 5.4.3, the qB,inv and q
B,out
v for all blocks v will monotonically grow during
the analysis. If we prune away the local non-worst-case successor state, we no longer
have this guarantee and, therefore, the data-ﬂow results are no longer guaranteed
to converge.
An example for such a behavior is given in Figure 5.10a. We assume that the
block v is reached with a single microarchitectural state qM,in ∈ qM,inv . As shown
in Section 4.3, the microarchitectural analysis, which includes the TDMA oﬀset
112 Chapter 5. Multi-Core WCET Analysis
v
{0}
qB,outv
qB,outv
(a) An example task
executing on core 1.
Core 1 Core 2
0 3 4 7
(b) The TDMA schedule for
the example.
Figure 5.10: An example of non-converging TDMA oﬀset results when naively
exploiting the absence of timing-anomalies.
analysis, then proceeds by applying the transfer function from Equation 4.17 to
qM,in to generate the outgoing data-ﬂow value qM,outv . According to Deﬁnition 17
and Deﬁnition 16 this transfer is done by performing cycle steps on the abstract state
and gathering all reachable result states, as sketched in Figure 4.8. For the example,
further assume that the ﬁrst cycle step in qM,in issues a shared bus request, and that
once the request is granted, it takes 3 cycles to complete. Then, additional 5 cycles
are spent on pipeline computations in v. We further assume that the abstract bus
state, i.e., the TDMA oﬀset set, qB,inv ∈ qM,in is equal to {0} and that the system
has two cores and slot length 4 as shown in Figure 5.10b. The microarchitectural
analysis will repeatedly visit the node v, merge the incoming oﬀset information at
the head of v using Equation 4.23 and recompute the outgoing oﬀsets using the
microarchitectural transfer function as mentioned above. For simplicity, we assume
that during the whole process the pipeline does not show non-deterministic behavior,
i.e., that qM,in = {qM,in} and qM,out = {qM,out} stay valid and only the value of qM,in
and qM,out changes. With qB,inv and q
B,out
v we denote the TDMA bus states which
are part of qM,in and qM,out.
1. In the ﬁrst iteration, qB,inv = {0} ∪∅ since initially, qB,outv = ∅. The access in v
is then issued at oﬀset 0, granted at oﬀset 1 and completed at oﬀset 4. This
leads to a runtime of 9 cycles for v, and thus to a qB,outv = {0+9 mod 8} = {1}.
2. Now we have qB,inv = {0} ∪ {1} = {0,1}. The execution scenario for oﬀset 0
stays the same, but the execution with start oﬀset 1 incurs a 7-cycle delay
before the access can be granted, leading to a runtime of 15 cycles. Thus,
the scenario with oﬀset 1 is the local worst-case and therefore qB,outv = {1+ 15
mod 8} = {0}.
3. In this analysis iteration, we again have qB,inv = {0} as in step 1. Therefore,
steps 1 and 2 will repeat inﬁnitely often from here on.
Since the data-ﬂow information qB,outv never stabilizes in this example, an anal-
ysis of TDMA oﬀsets which naively exploits timing-anomaly-freedom to cut down
its search space will not terminate in general.
Obviously, these convergence problems only occur due to progressions of oﬀset
information at loop heads as we have seen in the example. All of the approaches
5.4. Partitioned Multi-Core WCET Analysis 113
presented in Section 5.4.4, Section 5.4.5 and Section 5.4.6 eliminate this problem.
The unrolling removes all back-edges and the oﬀset contexts and oﬀset relocation
lead to a constant incoming oﬀset at the loop head. Therefore, a partitioned WCET
analysis using TDMA oﬀset analysis and the pruning of local non-worst-case states
is only feasible if either the full unrolling, oﬀset context or oﬀset relocation approach
is used.
5.4.8 Evaluation
In the following, we present an evaluation of the performance of the presented analy-
sis techniques and of the diﬀerent arbitration strategies in comparison to each other.
Parts of this evaluation were published in [KHM+13] and [KFM+11; KFM+14]. In
the two latter publications, the experiments were done for a platform based on
SimpleScalar cores instead of ARM ones.
Due to the lack of standard multicore real-time benchmarks, we chose to exe-
cute independent tasks from the benchmark suites already mentioned in Section 4.5
on the single cores, amounting to 154 ﬂow-fact-annotated, independent benchmark
tasks in total. In the experiments, we grouped together benchmarks with similar
runtime and executed packages with one benchmark per core. The packages were
formed by sorting the benchmarks in the order of their single-core ACET and then
having a window of size 2/4/8 slide over this list, collecting all 153/151/147 possi-
ble combinations. All cores start their assigned task synchronously and execution
ﬁnishes when all tasks have been completed. Thus, since the benchmarks have dif-
ferent runtimes, there will be some amount of inevitable completion time jitter. All
the benchmarks read their inputs and store their outputs in the shared non-cached
RAM, whereas all program code as well as the stack was allocated in the scratchpad
memory of the individual cores. This emulates the (reasonable) scenario that I/O
is done via a shared device, whereas code and local data are kept in local memories
for performance reasons. As in Section 4.5, all benchmarks were compiled without
further compiler optimization (optimization level O0). This is frequently done for
safety-critical code, due to the fear of errors introduced by potentially incorrect com-
piler optimizations. All other parameters were set to their default values according
to Table 3.1.
Concerning the parametrization of the arbitration methods we have selected
simple heuristics to demonstrate some key impacts. For PRIO, the priorities were
assigned such that ti > tj ⇔ pi > pj where ti is the single-core runtime of the task
mapped to core i. We use this strategy, also known as largest job ﬁrst, here to speed
up long-running tasks and thus to decrease the completion time jitter. For TDMA
(and also for PD) we set the slot size ls = TBmax to keep delay times as small as
possible. Our experiments have shown that higher slot lengths most often impose
both higher WCET and ACET values. Also, for TDMA we set oi = i such that
each core owns a single slot. For PD, each slot i is “owned” by core i by setting
pii = n. Priorities for all other cores are distributed in the same way as for PRIO,
114 Chapter 5. Multi-Core WCET Analysis
(1/3) (1/6) (2/3) (2/6) (4/3) (4/6) (8/3) (8/6)
0%
20%
40%
60%
Number of cores nc / Slot length ls
A
ve
ra
ge
T
ot
al
U
ti
liz
at
io
n
PRIO FAIR
PD TDMA
Figure 5.11: Average total bus utilization.
i.e., in the order of single-core task runtime. In the experiments, these values proved
to be good default values. The impact of diﬀerent conﬁguration options and their
optimization will be further explored in Section 6.1.
Runtime properties
Before delving into the WCET results, we will examine some runtime aspects of the
benchmarks which are needed to judge the WCET results.
Figure 5.11 shows the geometrical mean utilization resulting for diﬀerent values
of nc and ls. Here, and in the following, the results for FAIR and PRIO are only given
for ls = 3 since they do not use the concept of a slot length at all. As expected, FAIR
and PRIO show superior utilization, since these are work-conserving arbitration
methods, i.e., as long as there are active requests, they do not insert wait cycles.
TDMA shows some increase in utilization with rising nc, but it is stagnating at
around 20% due to slots which remain unused by their owners. For nc = 8 the
utilization is actually decreasing again below 20%. PD is also not work-conserving,
since it must delay requests when they cannot be served in the current slot, which
may happen frequently for our setting of ls = TBmax. Still, PD shows a linear increase
in utilization, which is twice as high as for TDMA, which is also reﬂected in the
average ACETs of the benchmarks as shown in Figure 5.12.
In general, the ACET per task is inversely proportional to the achieved utiliza-
tion values. The dotted areas in Figure 5.12 show the portion of the ACET which
is used for computation and local memory accesses (stack and program code, see
Section 3.4), the crosshatched areas show the portion in which the task is using the
shared bus, and the areas with vertical bars show the percentage of the ACET in
which the task is waiting for the shared bus. For TDMA it becomes visible that,
e.g., for the conﬁguration with 8 cores, the tasks are on average using more cycles
for waiting than for performing computation and actual memory accesses.
5.4. Partitioned Multi-Core WCET Analysis 115
(1/3) (1/6) (2/3) (2/6) (4/3) (4/6) (8/3) (8/6)0%
100%
200%
Number of cores nc / Slot length ls
A
ve
ra
ge
R
el
at
iv
e
A
C
E
T PRIO FAIR
PD TDMA
Figure 5.12: Average relative measured execution time (ACET) for diﬀerent plat-
forms (Baseline = execution time on single-core platform).
(2/3) (2/6) (4/3) (4/6) (8/3) (8/6)
0%
20%
40%
60%
80%
Number of cores nc / Slot length ls
A
ve
ra
ge
T
ot
al
Ji
tt
er PRIO FAIR
PD TDMA
Figure 5.13: Average benchmark execution time jitter.
The ACET and utilization values are inﬂuenced by the completion time jitter
of the benchmark packages, that is the length of the time interval between the ﬁrst
termination of a task on any core and the termination of the last task. Especially
for TDMA, the jitter is problematic since it leaves slots of already terminated tasks
unused. Figure 5.13 shows the jitter as a percentage of the total runtime of the
benchmark package (i.e., the runtime of the longest task). It is visible that the low
utilization values for TDMA are to some extent related to the rising jitter, but since
this increase is itself triggered by the usage of TDMA, this is an inherent drawback
of the policy. As an example, in the case of 8 cores this means, that after 21% of
the total benchmark runtime the ﬁrst task terminates, which leaves the remaining
79% of the slots of this task unused.
116 Chapter 5. Multi-Core WCET Analysis
(2/3) (2/6) (4/3) (4/6) (8/3) (8/6)
0%
100%
200%
300%
400%
Number of cores nc / Slot length ls
A
vg
.
R
el
at
iv
e
W
C
E
T
FAIR PD
TDMA
Figure 5.14: Average relative WCET when all bus accesses show the worst-case
bus behavior.
WCET results
In all of the following experiments, we had to restrict the maximum analysis duration
to two hours for a single benchmark, and the maximum memory consumption was
implicitly limited by the fact that the analyzer was compiled in 32-bit mode, i.e.,
only 4GB of main memory were available. The full loop unrolling approach from
Section 5.4.4 was only able to analyze 30.9% of the benchmarks. In all other cases,
it ran out of either time or memory. Therefore, we restricted our experiments to
these 30.9% of the benchmarks which worked also using full unrolling. Due to this,
the average WCETs obtained in this chapter are not directly comparable to those
from Section 4.5, since we use a diﬀerent benchmark base.
We will ﬁrst examine the worst-case behavior of the arbitration policies, since
this will give us one possible baseline to compare against. As in Section 4.5, we
present only relative WCET values, i.e., the WCET divided by the ACET of the
benchmark, since these relative WCETs are a bound on the overestimation of the
analysis. The results for an analysis in which the worst-case bus behavior is assumed
for every single access are shown in Figure 5.14. As expected, the WCETs increase
linearly with the length of the schedule in the TDMA case, and linearly with the
number of cores in the system in the FAIR case. WCET results for PRIO were
not generated, since the task-partitioned analysis can only provide a WCET for the
tasks running on the highest-priority core as detailed in Section 5.4.2. For all other
tasks, the WCET using PRIO is inﬁnite. Therefore, we excluded PRIO from the
WCET comparison.
The results for a WCET analysis using the basic TDMA analysis as shown
in Section 5.4.3 are presented in Figure 5.15. Here, we have not used the fact
that our platform is free of timing anomalies, i.e., this is the TA-prone scenario as
sketched in Section 5.4.7. The results show, that the FAIR results are almost equal
to the worst-case results from Figure 5.14. We could not expect better results here,
since the analysis itself is forced to make worst-case assumptions (cf. Equation 5.7).
5.4. Partitioned Multi-Core WCET Analysis 117
(2/3) (2/6) (4/3) (4/6) (8/3) (8/6)
0%
100%
200%
300%
400%
Number of cores nc / Slot length ls
A
vg
.
R
el
at
iv
e
W
C
E
T
FAIR PD
TDMA
Figure 5.15: Average relative WCET for diﬀerent arbitration types with the basic,
TA-prone TDMA analysis.
(2/3) (2/6) (4/3) (4/6) (8/3) (8/6)
0%
100%
200%
300%
Number of cores nc / Slot length ls
A
vg
.
R
el
at
iv
e
W
C
E
T
PD / Full Unrolling TDMA / Full Unrolling
PD / Oﬀset Contexts TDMA / Oﬀset Contexts
PD / Oﬀset Relocation TDMA / Oﬀset Relocation
Figure 5.16: Average relative WCET for advanced TDMA analysis techniques.
For TDMA, between 10 and 20% of the total overestimation could be avoided in
comparison to the worst-case behavior, but the WCET reduction is not very strong
up to here. This provided the initial motivation to devise the improved TDMA
analysis techniques presented in Section 5.4.4, Section 5.4.5 and Section 5.4.6.
To see, how much we can improve on this baseline, we have repeated the PD and
TDMA experiments with the full loop unrolling approach (Section 5.4.4), the oﬀset
context approach (Section 5.4.5) and the oﬀset relocation approach (Section 5.4.6).
In addition, the new experiments also exploit the fact, that our architecture is TA-
free. The results are presented in Figure 5.16.
As visible, all three approaches are better than the basic TDMA oﬀset analysis
shown in Figure 5.15, except for the oﬀset relocation which is worse than the basic
analysis for nc = 8, ls = 6. The best approach in terms of WCET precision is the
full unrolling context approach. However, other than suggested by the average
118 Chapter 5. Multi-Core WCET Analysis
Table 5.2: Example for a case where oﬀset contexts yield more precise results than
full unrolling.
Arbiter TA-free Relocation Oﬀset Contexts Unrolling WCET Duration
TDMA no no no 0 19,099 2s
TDMA no no no ∞ 19,099 10s
TDMA yes no no ∞ 18,995 13s
TDMA yes no yes 0 16,226 7s
TDMA yes yes no 0 19,483 2s
numbers in Figure 5.16, the oﬀset context approach is also better than the unrolling
on some instances. If a loop only contains straight-line code, the full unrolling is
the most precise method to analyze it, since it will exactly capture the behavior of
each iteration. If, on the other hand, a loop contains many if-branches which all
contribute diﬀerent result oﬀsets, then the oﬀset context approach may be superior.
The reason is, that each oﬀset context provides a reset point for the oﬀset information
(cf. lines 11 and 12 in Algorithm 6), whereas the unrolling cannot provide such reset
eﬀects. Table 5.2 shows the results for a 4-core benchmark with the tasks select,
mult-4-4, lms-float and fir for which the oﬀset contexts performs best. The full
unrolling and the transition towards the TA-free analysis only yields little WCET
reduction. The relocation in this case leads to a WCET which is even higher than
the result with the basic analysis (ﬁrst line of Table 5.2), which shows that the
penalty-based relocation must be applied very cautiously. In contrast, the oﬀset
context approach reduces the WCET signiﬁcantly.
Finally, as also visible from Figure 5.16, the oﬀset relocation approach performs
still better than the basic one in all cases except (8/6), but compared to the other
two approaches it shows inferior precision. One reason for this behavior is, that
we perform the oﬀset relocation only for those loops which may access the bus (cf.
Section 5.4.6). Unfortunately, we frequently encounter situations, in which the value
analysis cannot determine the target of a memory access, which is then classiﬁed
as potentially accessing any memory address, including those covered by the shared
bus. Therefore, these imprecise value results introduce false positives into our may-
access-the-bus classiﬁcation which leads to costly applications of the oﬀset relocation
in loops where we have no beneﬁt from the relocation at all. Unfortunately, we have
seen in Section 5.4.7 that the oﬀsets at a loop head of any loop which does perform
bus accesses must be relocated to ensure the termination of the analysis. Therefore,
the only way to increase the precision of the oﬀset relocation-based analysis is to
further increase the precision of the value analysis or to ﬁnd some other, better
approximation of the may-access-the-bus classiﬁcation.
In all these experiments, the PD results were virtually equal to those for TDMA
which may be due to our choice of the PD conﬁguration and the benchmarks used.
5.4. Partitioned Multi-Core WCET Analysis 119
Table 5.3: Average analysis time per benchmark for a timing-anomaly-free analysis
run.
Platform Arbiter Relocation Oﬀset Contexts Unrolling ∅ Duration
(2/3) FAIR no no 0 2s
(2/3) TDMA no no ∞ 514s
(2/3) TDMA no no 0 3s
(2/3) TDMA no yes 0 30s
(2/3) TDMA yes no 0 7s
(4/3) FAIR no no 0 13s
(4/3) TDMA no no ∞ 569s
(4/3) TDMA no no 0 20s
(4/3) TDMA no yes 0 238s
(4/3) TDMA yes no 0 12s
(8/3) FAIR no no 0 21s
(8/3) TDMA no no ∞ 695s
(8/3) TDMA no no 0 37s
(8/3) TDMA no yes 0 398s
(8/3) TDMA yes no 0 27s
In [KHM+13], more varied results for PD were found, and also one additional PD
conﬁguration was evaluated.
From the average analysis durations, shown in Table 5.3, we can see that the
relocation approach is not dominated by the others since it has an analysis runtime
which is comparable to the basic analysis’ runtime. Time results for PD are not
shown, since they are virtually identical to those of TDMA. The second largest
runtime is observed with the oﬀset context approach. The number of oﬀset contexts
of course directly inﬂuences the analysis duration. For each two-fold increase in
schedule length, this number grows by a factor of 2D where D is the deepest loop
nesting depth in the program. Therefore, the super-linear analysis time growth
observed in Table 5.3 should not be surprising.
As mentioned, the full unrolling approach ran out of time or memory in 69.1%
of the experiments, whereas the oﬀset context approach exceeded the maximum
memory capacity in only 38.9% of the cases. The naive approach and the relocation
had this problem in only 3.6% of the experiments. These cases where the time
bound was reached are included in Table 5.3. Therefore, the runtime values for
both the unrolling and the oﬀset contexts can only be seen as a lower bound on the
real runtime. Especially in the case of the unrolling, it can be expected that the
real runtime if the process is given inﬁnite amounts of time, is far higher.
The results shown in Table 5.3 were collected for a slot size of 3 cycles. Surpris-
ingly, the runtime of the oﬀset relocation analysis was only 2.7% higher for a slot
120 Chapter 5. Multi-Core WCET Analysis
size of 6, whereas the runtime for the naive analysis and the oﬀset context analysis
increased by 17% and 169% on average for slot size 6. In all cases, the fully un-
rolling approach was still the slowest one. The analysis time approximately doubles
on average in the TA-prone case.
5.5 Uniﬁed WCET Analysis for Complex Multi-Cores
Section 5.4 introduced the concept of analyzing each task in isolation for the purpose
of higher analysis speed and independent timing certiﬁcation of tasks. Though these
are strong arguments, we have also seen that an isolated per-task analysis can only
be precise in the case of time-triggered arbitration. In particular, the important class
of state-permeable arbitration policies like FAIR and PRIO (cf. Deﬁnition 20 and
Table 5.1) cannot be analyzed by a partitioned analysis as presented in Section 5.4.
Almost the same applies to shared caches, where only coarse-grained approximations
are possible (cf. Section 5.4.1).
Therefore, this section will present a diﬀerent approach to the microarchitectural
analysis of multi-core systems which explicitly uses and maintains information about
concurrently ongoing events. This will allow us to achieve more precise WCET es-
timations for FAIR arbitration and to analyze PRIO arbitration for the ﬁrst time.
Obviously, it also implies that the analysis eﬀort will rise, because we need to track
every relevant parallel execution scenario. The resulting analysis framework is de-
picted in Figure 5.17. In contrast to the partitioned framework from Figure 5.1,
the microarchitectural analysis explicitly knows all tasks in the system and analyzes
them together as opposed to each task in isolation. In the current state, this com-
bined analysis only extends to the microarchitectural stage, after which the context
block durations ω(v) are extracted from the combined analysis. The path analysis
then works on each task CFG in isolation, as shown in Section 4.4. However, if the
tasks have synchronization statements, the timing of these can only be captured by
a combined path analysis. Since we focus on the microarchitectural analysis here,
we ignore the synchronization aspect here, but it would be a good starting point for
future work to also integrate a combined synchronization-aware path analysis.
The content of this section was previously published in [KM14].
5.5.1 Related Work
The author is not aware of any previous work that deals with a non-partitioned
multi-core WCET analysis. Instead, previous work on multi-core WCET analysis,
as summarized in Section 5.3.1, exclusively uses the per-task analysis approach with
the mentioned drawbacks.
However, the static analysis of parallel software has a long tradition in the area
of formal veriﬁcation. In the following we present a short extract from the existing
body of literature on this topic.
5.5. Uniﬁed WCET Analysis for Complex Multi-Cores 121
LLIR of Core 1
GCτ1,i Construction
Value Analysis
GCτ1,1 ... G
C
τ1,∣T1 ∣
...
LLIR of Core nc
GCτnc,j Construction
Value Analysis
GCτnc,1
... GCτnc,∣Tn ∣...
...
Microarchitectural Analysis
Path Analysis
WCET
of τ1,i
BCET
of τ1,i
Path Analysis
WCET
of τnc,j
BCET
of τnc,j
Figure 5.17: Structure of the uniﬁed multi-core WCET analysis.
Analysis of synchronization properties
A ﬁxed-point analysis of Communicating Sequential Processes (CSP) was ﬁrst es-
tablished by Cousot and Cousot in [CC80]. It can prove properties like the absence
of deadlocks or program termination for a subset of CSP programs. These concepts
were later generalized to discrete state transition systems in [CC84] by the same
authors.
Static analysis of the synchronization structure of concurrent programs was ﬁrst
considered by Taylor [Tay83a]. He presents an algorithm which can approximate
which parts of a synchronized program may run in parallel to each other, simi-
lar to what we have done in Algorithm 5. The underlying problem is called the
May-Happen-In-Parallel (MHP) problem. The diﬀerence is, that he is using syn-
chronization to bound the possible parallel execution scenarios, whereas we will use
timing information to do the same. Taylor also introduced the notion of a Par-
allel Execution Graph (PEG), which is also used in the analysis presented in this
section, though at a far more ﬁne-grained level. The MHP problem has been the
topic of many works, including an approximation of the MHP relation for Java
programs [NAC99] which runs in cubic time, and MHP computations using Java
programs with barriers [KY06].
The exact solution to the MHP problem has been shown to be NP-hard [Tay83b]
even when timing is not taken into account. Even worse, Ramalingam has shown
in [Ram00] that any static analysis which is exact with respect to the call behavior
and synchronization behavior of the tasks is undecidable.
In the seminal work of Valmari [Val89], an eﬃcient algorithm is developed to
explore all relevant interleavings of transitions in a generalized petri net. It is shown
that many interleavings can be discarded in the analysis, since they lead to the same
terminal states. Therefore, it is suﬃcient to consider a more restricted stubborn set
of transitions. Unfortunately, this method is by construction only applicable in
122 Chapter 5. Multi-Core WCET Analysis
analyses which examine properties based on the reachability of termination states
like, e.g., detection of deadlocks.
Chow et al. [CI92] combine stubborn sets, virtual coarsening (i.e., program slic-
ing) and abstract interpretation to form a framework for the analysis of C-like shared
memory parallel programs. Unfortunately, only concepts are given, without imple-
mentation or feasibility studies.
Analysis of parallel program semantics
Whereas the previous analyses approximate synchronization-related questions like
MHP or deadlock-freedom, there are also approaches which try to create classical
data-ﬂow analyses for parallel systems, e.g., parallel liveness analysis or parallel
value analysis. Again, we can only present fragments of this ﬁeld of research due to
the amount of literature published on it. To precisely capture the eﬀects of parallel
programs, like in the work of Taylor [Tay83a], a Parallel Execution Graph (PEG)
is needed. As an example, this is demonstrated for the case of Message Passing
Interface (MPI)-Analysis in [GKS+11].
Since the construction of a PEG which captures all relevant concurrent interleav-
ings is computationally expensive, there are also attempts to use a summary-based
technique similar to what we have seen in the handling of shared caches in Sec-
tion 5.4.1. The basic idea is always to ﬁrst compute a single-core result and then
“patch” the results to account for possible parallel modiﬁcations.
For the reaching deﬁnitions bit-vector problem on parallel programs, this idea
was ﬁrst applied by Grunwald and Srinivasan [GS93]. Bit-vector problems are a
subclass of distributive data-ﬂow frameworks (cf. Deﬁnition 5), which are particu-
larly easy to solve. The extension to arbitrary bit-vector problems was completed
by Knoop, Steﬀen and Vollmer [Vol95; KSV96]. They show, that the MOP and
MFP solutions (cf. Section 2.1) are identical for these types of problems when using
the summary-based approach, i.e., that this approach produces precise solutions for
bit-vector data-ﬂow problems on parallel programs. Unfortunately, the microarchi-
tectural analysis is not a bit-vector problem.
A similar approach is the use of a traditional sequential analysis whose results are
ﬁltered or widened by a race detection engine as presented by Chugh et al. [CVJ+08].
Other than in [KSV96], no guarantees on the quality of the results are given due to
the usage of non-bit-vector domains.
The task scheduler was neglected in all of the works presented up to here. To
overcome this, Mine [Min12] provides a scheduler-aware, path-aware abstract inter-
pretation semantics of real-time C (without recursion and dynamic memory alloca-
tion). The analysis can detect arithmetic overﬂows, null pointer errors and similar
run-time errors and is tested on a huge avionics benchmark. Similar to him, we
also take the scheduler into account and have an explicitly parallel semantics. In
addition, we use the generated WCET values to further prune the search space.
5.5. Uniﬁed WCET Analysis for Complex Multi-Cores 123
τ1
τ2
τ3
τ4
0 tT Time
rτ2
rτ4
Figure 5.18: The basic task scheduling model for the uniﬁed multi-core analysis.
Finally, a recent publication from Mittermayr and Blieberger [MB12] examines
the computation of feasible synchronization-aware parallel interleavings. Their ap-
proach focuses on path analysis and is thus complementary to the one presented in
this section.
5.5.2 Task Model
To be able to bound the number of possible parallel execution scenarios, each task τ ∈
Tc (i.e., mapped to core c) must be strictly periodic with a period tτ as exempliﬁed
in Figure 5.18. Periods are given as a time length according to Deﬁnition 21. This
is not a major restriction, since tasks in hard real-time systems are often strictly
periodic as, e.g., in the OSEK Time-Triggered Operating System [OSE01].
In the following sections, we will need a common reference point in time for
all running tasks. Therefore, we ﬁrst require that all tasks τ have the same period
tτ = tT and that each task is executed non-preemptively on a separate core. However,
each task τ may have a diﬀerent release time rτ within the common period. We will
discuss how to lift the restriction to a common period in Section 5.5.8.
5.5.3 Motivating Example
Before starting with the formal speciﬁcations, we brieﬂy sketch the intuition behind
the analysis procedure. Our goal will be to eﬃciently explore all feasible interleavings
of multiple tasks running in parallel. As an example, consider the execution of the
tasks τ1 and τ2 as given by the context graphs in Figure 5.19a and Figure 5.19b under
the assumption that both tasks start concurrently at time 0. For this assumption,
we can ﬁnd all valid parallel execution scenarios from the Parallel Execution Graph
(PEG) shown in Figure 5.19c.
The construction of this graph starts with node ⊥ which represents the state of
the system before the execution begins. Then, we add edges from ⊥ to the nodes
corresponding to the possible starting points for a parallel execution, in this case
only the node AE (the δ-values will be explained below). From these start nodes,
we iteratively simulate cycle steps of the system. To keep our example PEG from
Figure 5.19c suﬃciently small, we assume that every context block will take one cycle
124 Chapter 5. Multi-Core WCET Analysis
A
B
C
D
LB [2,3]
LB [10,10]
(a) Task τ1.
E
F
G
LB [2,2]
LB [3,3]
(b) Task τ2.
AE
δ(1) ∶ [0,2]
δ(2) ∶ [0,1]⊥
δ(1) ∶ [0,0]
δ(2) ∶ [0,0]
BE
δ(1) ∶ [2,3]
δ(2) ∶ [0,1] AF
δ(1) ∶ [0,2]
δ(2) ∶ [2,4]
BF
δ(1) ∶ [2,3]
δ(2) ∶ [2,4] AG
δ(1) ∶ [0,2]
δ(2) ∶ [5,5]
BG
δ(1) ∶ [2,3]
δ(2) ∶ [5,5]CF
δ(1) ∶ [3,13]
δ(2) ∶ [2,4]
CG
δ(1) ∶ [3,13]
δ(2) ∶ [5,5]DF
δ(1) ∶ [13,14]
δ(2) ∶ [2,4]
DG
δ(1) ∶ [13,14]
δ(2) ∶ [5,5] C⊤
δ(1) ∶ [3,13]
δ(2) ∶ [0,∞]
D⊤ δ
(1) ∶ [13,14]
δ(2) ∶ [0,∞]
⊤⊤ δ
(1) ∶ [0,∞]
δ(2) ∶ [0,∞]
Sc
he
du
lin
g
ed
ge
(c) The ﬁnal parallel execution graph.
Figure 5.19: Parallel execution graph creation example for two tasks τ1 and τ2,
starting synchronously at time 0.
to complete. We also assume that blocks B and G hold a shared resource access, i.e.,
if both blocks try to execute concurrently, one of them has to wait for one cycle until
the shared resource is free again. The aim of our analysis is to determine the block
durations ω(v) for each block (cf. Deﬁnition 19). In our simpliﬁed example, we
already know that ω(v) = 1 for v ∈ {A,C,D,E,F}, whereas ω(v) ∈ {[1,1], [1,2], [2,2]}
for v ∈ {B,G}.
Therefore, our initial block AE is terminated after the ﬁrst cycle and the execution
must continue in one of the nodes AE, BE, BF and AF. To generate these successors,
5.5. Uniﬁed WCET Analysis for Complex Multi-Cores 125
we simply follow all combinations of successor blocks in the tasks’ context graphs.
The loop bounds are not used here. If we continued the graph construction in this
manner, we would end up with a full product graph of the context graphs. However,
we will see in the following that a full product graph is not always needed. When
every core has reached the end of its task, indicated by the “⊤” sign in Figure 5.19c,
we add a back-edge from ⊤⊤ to AE to account for the repeated execution of the tasks
in the cyclic schedule. The ﬁnal PEG contains every possible parallel execution
scenario for each context block in each task. Thus, we can derive ω(v) of a context
block v from the PEG by measuring the length of all traces in the PEG which model
the execution of v. The length of each PEG block is given by the number of cycle
steps that were performed to analyze this block.
As visible, the PEG in Figure 5.19c is not a full product graph of the graphs from
Figure 5.19a and Figure 5.19b. The construction of the graph has been stopped at
nodes BE, AG, BG, DF and DG. To explain why this was done, and why it is correct,
we need the δ-values and the loop bounds. We deﬁne δ(i) as an interval containing
all points in time, measured from the beginning of the common period pT , at which
a node may be entered on core i. Initially, we set δ(1) = δ(2) = [0,0] for node AE,
since core 1 (2) enters node A (E) at time 0. From here on, every time we visit a
PEG node v in the analysis, we recompute its δ intervals with the help of the path
analysis which computes the length of the shortest and longest paths to the context
blocks in v.
As an example, when we visit node AE the second time, we have already seen that
both blocks A and E complete within one cycle. Therefore, since A can be executed
at most three times and E at most two times (see Figure 5.19a and Figure 5.19b), the
path analysis can infer that any execution of block A must begin in the time frame
δ(1) = [0,2] and similarly any execution of block E must begin within δ(2) = [0,1].
Thus, the path analysis always operates only on the context graphs of the tasks, not
on the PEG. The PEG is only used to compute the ω(v) values that are used by
the path analysis.
The path analysis for node BE yields δ(1) = [2,3] (due to the loop at A which
must complete before B) and δ(2) = [0,1]. Here, we can see the application of
the computed δ-values: We can exclude this node from the PEG and thus from
the analysis. Using the δ-values we know that blocks B and E cannot be executed
concurrently because their execution time windows do not overlap. This condition
is called the Block Exclusion Criterion (BEC) and all blocks for which it holds can
be removed from the PEG. In Figure 5.19c, these blocks are marked by a dotted
border. Therefore, we can infer from the ﬁnal PEG that BG is an invalid node, i.e.,
that ω(B) = ω(G) = 1.
Following this basic idea, the next sections will show how to generate a PEG and
how to use timing information to prune parts of it for increased WCET precision
and analysis speed. When lifting the simpliﬁcations made in this example, there
are many points that we will have to clarify, some of the most important being
how to deal with the case when blocks do not terminate concurrently and how to
126 Chapter 5. Multi-Core WCET Analysis
incorporate system state information into this analysis scheme. We will cover these
aspects in the next sections.
5.5.4 Prerequisites
To precisely deﬁne our analysis procedure, we will need some terminology which is
introduced in the following.
Deﬁnition 25. Given a task τ ∈ Tc, a Task Execution Position (TEP) ψτ ∈ Ψτ is a
tuple (v, i, c, d), where v ∈ V Cτ is a context block, i ∈ v is an instruction within that
basic block and c is the number of cycles that were already spent on the processing
of this instruction. Finally, d is the number of cycles that the task must wait until
its execution will begin. The set of all TEPs of task τ is called Ψτ .
Deﬁnition 26. A System Execution Position (SEP) ψ on n cores is an n-tuple
with ψ ∈ Ψ = ⨉ni=1Ψτi ∪ {⊤}, τi being a task mapped to core i. The special token ⊤
indicates that the respective core is currently running idle. The set of all SEPs is
called Ψ.
The motivation for this deﬁnition is, that other than in our introductory example
from Figure 5.19, real basic blocks will contain more than one instruction3 each of
which may take multiple cycles to complete. Still, we need to be able to split the
execution of each basic block into chunks which may be as small as a single CPU
cycle, as we will see in the following. We will use SEPs to specify the point at which
the execution is resumed in a PEG block. Therefore SEPs correspond to the block
labels from Figure 5.19c (e.g., AE, BE, AF, etc).
Deﬁnition 27. Abstract Parallel System States (APSSs) Σ ∈ X = 2(QP )nc×QE are
a generalization of abstract microarchitectural states m ∈ M = 2QP×QE (cf. Deﬁni-
tion 17) which may contain more than one pipeline state.
The environment state QE contains both the state of the memory hierarchy
elements that are private to the cores as well as the state of shared memory hierarchy
elements. We give more detail on how to form proper APSSs in Section 5.5.6.
Similarly to Deﬁnition 16, we can deﬁne the parallel execution of a set of pro-
grams based on concrete parallel system states and generalize this execution to
abstract states along the lines of Equation 4.17. We require the existence of a
monotonic cycle step function ξX ∶ X×Ψ×2{1,...,nc} → ({0,1}nc ×X). The invocation
of ξX(Σ, ψ,α) must simulate all possible state transfers that may happen when a
single clock cycle is executed at position ψ in system state Σ. However, only the
cores in the set α ⊆ {1, . . . , ∣T ∣} may perform a cycle step, to be able to account for
diﬀerent release times. For any instruction completion vector c ∈ {0,1}n which may
occur in this cycle, it must specify the result state, where c deﬁnes for each core,
3In the example, we have not even diﬀerentiated between basic blocks and instructions.
5.5. Uniﬁed WCET Analysis for Complex Multi-Cores 127
whether it has retired its current instruction (1) or not (0) in analogy to Deﬁni-
tion 16. The “current instruction” is always given by the “program counter” register
value.
Deﬁnition 28. A Parallel Execution Graph GP = (VP ,EP ) is a directed graph with
node set VP ⊆ Ψ∪{⊥} and edge set EP ⊆ VP ×VP . ⊥ is a special PEG node which is
exclusively used to model the situation that the execution of the parallel system has
not yet started. For any PEG, we deﬁne
• a block time window mapping δ ∶ VP → (Itime)nc ,
• an edge state mapping λ ∶ EP → X, and
• a block length mapping ωP ∶ VP → N.
Itime = {[x, y] ⊂ 2N0 ∣x ≤ y} is the set of all discrete execution time intervals, measured
in cycles from the last point where all cores were synchronized.
The edge set EP = EcfP ∪EschedP is partitioned into a set of control-ﬂow edges EcfP
and a set of scheduling edges EschedP .
The time window function will be used to rule out infeasible SEPs as indicated
in Figure 5.19c. The edge state function is employed for the propagation of the
possible hardware states from one PEG node to the other and the block length
function speciﬁes how many cycles were spent on the execution of a PEG node. The
three mappings are not deﬁned a priori. They will be computed by the algorithms
presented in the following. The partitioning of the edges is needed, since also in
our example from Figure 5.19c we have two types of edges. The edge (⊤⊤,AE) is a
scheduling edge which means that an unspeciﬁed amount of time may pass until the
transition from ⊤⊤ to AE is completed. All other edges are control-ﬂow edges which
have the semantics of an immediate transition, i.e., no time passes when taking a
control-ﬂow edge.
Note, that the PEG block runtime ωP must not be confused with the context
block runtime ω as deﬁned in Deﬁnition 19. To disambiguate the two symbols, we
refer to the latter as ωC in the following.
In Section 5.5.5, we will ﬁrst examine how to construct a PEG for a given task
set and with a given abstract cycle step function ξX. We also show how to extract
the ωC-values for each context block from the ﬁnal PEG (cf. Deﬁnition 19). In
Section 5.5.6, we will then take a brief look at how to properly model the abstract
parallel system states X and their cycle step ξX.
5.5.5 Parallel Execution Graph Construction
The outline of the main analysis is shown in Algorithm 7. Here and in the following,
we use ()(i) to access the i-th element of a tuple.
It starts with an initialization of the initial context block runtimes ωC in line 2
and of the work-list Q in line 3. According to the system schedule, the initial SEP
consists of the begin of the start block of each task (vstartτi ) with a delay of ri cycles.
128 Chapter 5. Multi-Core WCET Analysis
Algorithm 7 PEG-driven parallelism analysis.
1: function ParallelismAnalysis(Σstart, GCτ1 , ..., G
C
τn)
2: ∀τ ∶ ∀v ∈ V Cτ ∶ ωC(v) = ∅ ▷ Initialize all context block runtimes to ∅
3: Q← (vstartτ1 ,0,0, r1) × ⋅ ⋅ ⋅ × (vstartτn ,0,0, rn) ▷ Initialize work queue
4: GP ← (Q ∪ {⊥},∅)
5: δ(⊥) ← [0,0]nc , ωP (⊥) ← ∞
6: ∀ψ ∈ Q ∶ δ(ψ) ← {∅}nc , λ((⊥, ψ)) ← Σstart, ωP (ψ) ← ∞
7: while Q ≠ ∅ do ▷ Main loop
8: ψ = PopFront(Q) ▷ Analyze next block
9: ωC ←GatherNewBBTraces(ψ,GP , ωP , ωC) ▷ Update ωC .
10: for i ∈ {1, . . . , nc} do ▷ Update δ-window for all cores
11: δ(ψ)(i) ← ⋃(ψ′,ψ)∈EcfP δ(ψ
′)(i) + ωP (ψ′)
12: if IsLoopHeadOrExit(ψ(i)) then
13: δ(ψ)(i) = ri +PathAnalysis(ψ(i),GCτi , ωC)
14: if ∀i∈{1,...,nc}δ(ψ)(i) ≠ ∅ ∧⋂nci=1 δ(ψ)(i) = ∅ then ▷ If BEC holds ...
15: continue ▷ ... skip the current block ψ ...
16: else ▷ ... else analyze it.
17: λprev ← λ, GP,prev ← GP
18: (GP , λ, ωP ) ← AnalyzeBlock(ψ,GP , λ, ωP )
19: if λprev ≠ λ ∨GP,prev ≠ GP then ▷ If graph or states were altered ...
20: ∀(ψ,ψ′) ∈ EP ∶ PushBack(Q,ψ′) ▷ ... propagate the changes.
21: if EP,prev ≠ EP then ▷ If edges were added ...
22: ∀ψ ↝ ψ′ ∶ PushBack(Q,ψ′) ▷ ... propagate δ-changes
23: return ωC
As in the following, every new SEP get an initially empty execution time window
δ and an inﬁnite runtime ωP . We also create a virtual edge (⊥, ψ) pointing to the
initial SEP, which is assigned the initial APSS Σstart ∈ X. The start block ⊥ itself
has a runtime of zero cycles and executes in the start window [0,0] to mark that
the schedule starts here. Then, we process items from the queue Q until it gets
empty (line 7). In the main loop, we extract the ﬁrst block ψ from the queue. After
the new SEP ψ is extracted from Q, we ﬁrst check whether ψ models the end of a
context block v on any core in the call to GatherNewBBTraces in line 9. For
any such context block v, its runtime ωC(v) is updated in GatherNewBBTraces
as shown in Algorithm 8.
In line 11 of Algorithm 7, we infer the block time window for all task positions
ψ(i) ∈ ψ from the windows and runtimes of its control-ﬂow predecessors. If ψ is part
of a sequential block chain, the δ-update in line 11 is suﬃcient. On the other hand, if
ψ is a loop head (like A in Figure 5.19a) or a loop exit4 (like B in Figure 5.19b), then
we have to take the loop bounds into account to determine the block time window,
4A successor of a loop head which is not part of the loop.
5.5. Uniﬁed WCET Analysis for Complex Multi-Cores 129
Algorithm 8 Update of basic block runtimes.
1: function GatherNewBBTraces(ψ,GP , ωP , ωC)
2: for i ∈ {1, . . . , nc}, (ψ′, ψ) ∈ EP do
3: if ψ(i)(1) ≠ ψ′(i)(1) then ▷ If ψ is context block start on core k, ...
4: vpred = ψ′(i)(1) ▷ ... collect the length of all paths to starts of vpred.
5: ωC(vpred) ← ωC(vpred) ∪TraceToStarts(vpred, ψ′, i,GP , ωP )
6: return ωC
7: function TraceToStarts(vτ , ψ, i,GP , ωP )
8: if ψ(i) = (vτ , i0,0,0) then ▷ If ψ(i) is a begin of vτ , ...
9: return ωP (ψ) ▷ ... ﬁnish this trace.
10: else ▷ Else continue with the recursion.
11: return ⋃(ψ′,ψ)∈EP {ωP (ψ) +TraceToStarts(vτ , ψ′, i,GP , ωP )}
like we have done in the computation of δ(1) in, e.g., AE and BE in Figure 5.19c.
This is done in line 13, where the existing path analysis of our framework is used
to compute the shortest and the longest path from vstartτi to ψ
(i). The path analysis
may fail, because not all loop blocks were yet analyzed. In this case, an empty set
is returned, which also sets δ(ψ)(i) = ∅.
The δ values are used in line 14, where we try to apply the block exclusion
criterion by intersecting all block time windows. However, this test can only be
applied if the time windows for each task could already be determined, i.e., if they
are not empty. If the intersection is empty, ψ cannot be reached from its current
predecessors and we skip its analysis in line 15. This is exactly what we have done
with BE in Figure 5.19c. Still, we may need to analyze ψ in the future when it
becomes accessible via new edges. Then, we will re-check whether our exclusion
criterion still holds. Thus, this skipping is eﬀectively either postponing or avoiding
the graph growth at ψ.
If the exclusion criterion does not hold (line 16), we analyze the parallel execution
block (PEB) beginning at node ψ (line 18). This analysis will determine a block
runtime ωP (ψ), an output APPS for all out-edges of ψ and will possibly alter GP .
If the output states or the graph are changed, we push the successors of ψ into the
work-list at line 20. By doing this, all changes to the block time windows δ, edge
states λ and block runtimes ωP will be propagated through the graph. Finally, if we
have added edges to the PEG, we also push all blocks ψ′ which are reachable from
ψ into Q (line 22), to ensure that a new attempt to compute δ(ψ′) is started, if
ψ′ is a loop head or exit. The algorithm terminates when no more edges are added
and all edge states have converged.
All in all, Algorithm 7 is a standard data-ﬂow analysis work-list algorithm, with
the diﬀerence that we are dynamically expanding (line 18) the underlying graph
GP . However, we will show in the proof of correctness, that the convergence of the
PEG is still guaranteed. When ParallelismAnalysis has ﬁnished, all reachable
130 Chapter 5. Multi-Core WCET Analysis
blocks of all tasks will have been visited in one or more parallel execution blocks,
i.e., ωC will contain the possible execution time for each reachable context block of
each task.
For the path analysis PathAnalysis(ψ(i),GCτi ,GP , ωC) we are using an adapted
IPET, as presented in Section 4.4, with some modiﬁcations to account for the facts
that
1. we do not yet know the ﬁnal context block durations ωC(v), and
2. we are not searching for the longest (shortest) path through the whole program
but only for the longest (shortest) path to the current context block ψ(i).
To deal with problem 1, PathAnalysis simply uses the preliminary values of
ωC as determined by Algorithm 8. If any block v ∈ GCτi with v ↝ ψ(i) is not yet
covered in GP , then the initial value ωC(u) = ∅ is still present. The path analysis
will then also return ∅ for the path length to ψ(i).
Problem 2 requires some minor changes to the IPET, too. In the original IPET,
we added edges from all terminal blocks v ∈ δA⊤ (ν⊤(f⊥τ )) to the virtual sink v−. Since
we want to determine the longest (shortest) path to ψ(i) now, we no longer add
these edges. Instead we add edges to v− from all v ∈ δ−(ψ(i)), i.e., all predecessors
of the current task execution position. We call all of these predecessors explicit
sinks of the resulting IPET. Each of the explicit sinks can be used to terminate the
ﬂow through the IPET by directing its ﬂow towards v−. Thus, each of the explicit
sinks can also be used to exit from a function call sequence or one or multiple
loops in which ψ(i) is nested. Therefore, Equation 4.45 must be adapted to account
for the possibility that aﬀected functions are left via the explicit sink edge and
not via the return edge. Similarly, the lower loop bound in Equation 4.47 must
be dropped for all loops which are not post-dominated by all explicit sinks, since
the minimum iteration count only applies if the loop is completely executed before
ψ(i) is reached. The same problems arises for ﬂow restrictions, but unfortunately
we cannot determine whether a given ﬂow restriction stays valid in the longest-
path-to-ψ(i) problem. Therefore, our modiﬁed path analysis rejects tasks with ﬂow
restrictions. These are not needed in most real-time benchmarks, and it is only a
restriction of our path analysis, not of the parallelism analysis itself.
The repeated building and solving of the modiﬁed IPETs is not an ideal solution,
since basically we are just re-computing shortest and longest paths to all nodes of
the context graph under changing node weights. Advanced single-source all-sinks
analyses [KFM13] would be much better suited to solve this problem more eﬃciently.
Since the IPET was already available, we nevertheless used it as a preliminary
solution to avoid the implementation of a new type of path analysis.
Single-Block Analysis
To complete the view on the analysis, Algorithm 9 shows the function Analyze-
Block which is invoked in Algorithm 7. First, the incoming APSSs are joined in
5.5. Uniﬁed WCET Analysis for Complex Multi-Cores 131
line 2. The current system execution position ψrun is initialized to ψin (remember
that VP ⊆ Ψ) and the block duration ωP (ψin) is set to zero. Then, we simulate the
eﬀect of successive system cycle steps on ψrun and Σrun, until on any core, either
a) the end of a basic block is reached or
b) the successor SEP is ambiguous.
The latter happens, when it is uncertain in APSS Σrun whether the current instruc-
tion of at least one core will complete or not. In this case, we track all completion
combinations in separate successor blocks.
The ﬁrst step in each cycle is to invoke the APSS cycle step function ξX, which is
done in line 5, but only for those cores with zero delay cycles (set α). The APSS cycle
step function ξX returns a mapping κ ⊆ {0,1}nc × X, i.e., it associates instruction
completion vectors to successor APSSs. Line 6 checks the two block termination
conditions a) and b) mentioned above. The helper function φαc ∶ Ψ → Ψ generates
the successor SEP for a given SEP ψ, instruction completion vector c and active
core set α. We deﬁne it as
φαc (ψ) = ⨉
k∈{1,...,nc}∶
ψ(k)=(v,i,c,d)
⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
(v, i, c, d − 1) if k ∉ α
(v, i, c + 1,0) if k ∈ α, c(k) = 0
φ′(v, i) if k ∈ α, c(k) = 1
(5.27)
where φ′(v, i) is the set of all task execution positions “after” instruction i, i.e.,
either
• the next instruction in v, if there is any, or
• the ﬁrst instruction of all blocks w with (v,w) ∈ ECτk , if such edges do exist, or
• the terminal symbol ⊤, if ψ(k) ≠⊤, or
• all initial PEG nodes if ψ(k) =⊤.
If neither a basic block end is reached, nor the successor SEP is ambiguous, we
take over the results of the cycle step as our new working SEP ψrun and APSS Σrun
in line 7 and increment the cycle counter for this block in line 8. Here, ψ(i)(1)run is the
basic block executed by core i, κ(1)(1) is the ﬁrst instruction completion vector and
κ(1)(2) is its associated successor APSS.
If the block end is detected, we terminate the current block as shown from line 9
on. It will be one invariant of our analysis that the length of a block can only stay
the same or be reduced in successive analyses of the same block. Therefore, we only
check in line 10, whether the block has been shortened. This may happen due to
a newly joined-in APSS, that triggers an earlier ambiguous successor SEP. In this
case, we remove all previous out-edges of the current block v (line 11). In any case,
we add for each instruction completion vector c an out-edge to φαc (Σrun) which gets
annotated with the respective out-state Σc (lines 13–16). In the end, the modiﬁed
graph, edge states and block lengths are returned in line 18.
132 Chapter 5. Multi-Core WCET Analysis
Algorithm 9 PEG block analysis.
1: function AnalyzeBlock(ψin,GP , λ, ωP )
2: Σrun ← ⊔∀e=(ψ′,ψin)∈EP λ(e) ▷ Join incoming states
3: ψrun ← ψin, ωP,prev ← ωP , ωP (ψin) ← 0
4: while true do
5: κ← ξX(Σrun, ψrun, α = {i∣ψ(i)run = (⋅, ⋅, ⋅,0)}) ▷ Simulate next cycle
6: if ∣κ∣ = 1 ∧ ∄i ∶ (φα
κ(1)(1)
(ψrun))(i)(1) ≠ ψ(i)(1)run then ▷ Split/Block end?
7: Σrun ← κ(1)(2), ψrun ← φακ(1)(1)(ψrun) ▷ If not, prepare next cycle
8: ωP (ψin) ← ωP (ψin) + 1
9: else ▷ Else terminate the current block
10: if ωP (ψin) < ωP,prev(ψin) then ▷ If the block shrank, ...
11: EP ← EP ∖ {(ψin, ψ′) ∈ EP } ▷ ... remove old edges
12: for (c→ Σc) ∈ κ do ▷ Add new successors and out-states
13: VP ← VP ∪ {ψnew = φαc (Σrun)}
14: δ(ψnew) ← {∅}nc , ωP (ψnew) ← ∞
15: EP ← EP ∪ {enew = (ψin, ψnew)}
16: λ(enew) ← Σc
17: break
18: return (GP , λ, ωP ) ▷ Return all modiﬁcations
With Algorithm 9 we completed the macroscopic side of the analysis. In the
next section, we will examine the microscopic perspective, namely how to eﬃciently
represent abstract parallel system states.
5.5.6 Parallel System States
As stated in Deﬁnition 27, an APSS is an abstraction of the microarchitectural
state of a parallel machine. Therefore, we can use all of the models that were in-
troduced for modeling pipelines (Section 4.3.1), caches (Section 4.3.2) and buses
(Section 5.4.3). However, other than in the task-partitioned case (Section 5.4), we
do not need to account for possible parallel interference by either using cache inter-
ference summaries or time-triggered bus schedules. Instead, we can now explicitly
analyze concurrently ongoing actions, since the PEG encodes which program parts
are currently executed in parallel. This allows us to also analyze the PRIO arbi-
tration, which was not possible in the task-partitioned case, and to derive tighter
WCET estimates for FAIR arbitration and shared caches.
The main diﬀerence compared to the single-core case is, that an APSS Σ contains
an abstract state for nc cores instead of for only one core. Therefore, each element
σ ∈ Σ is a tuple σ ∈ (QP )nc ×QE . The rationale behind Σ being a set of tuples is – as
in the single-core case – that we need this mechanism to track diﬀerent, alternative
microarchitectural behaviors, e.g., in the case of an access to an unknown memory
address.
5.5. Uniﬁed WCET Analysis for Complex Multi-Cores 133
In every cycle step, i.e., every invocation of ξX, we perform the cycle step on
each contained sub-state tuple σ ∈ Σ with ξσ ∶ σ × ψ × 2{1,...,n} → ({0,1}n × X). ξσ
then simply applies a cycle step to each contained pipeline and memory hierarchy
element state. As we have already seen in the single-core case, the abstract ﬁnite
state machines behind these states are non-deterministic, therefore multiple succes-
sors may be generated. This is reﬂected in the choice of X as the target domain
of ξσ. Finally, each successor state σ′ is labeled with the instruction completion
information that was emitted by the pipeline models during the transition to σ′.
These sub-results are collected for all σ ∈ Σ, and those successor states with the
same completion vector are grouped together. This merged result is then returned
by ξX.
Up to here, we have not yet exploited the explicit encoding of parallelism in
the PEG. This is only done if one or more cores access a shared resource. In this
case, the requests arrive at the shared bus which is responsible for arbitrating them.
The parallelism-aware abstract bus model is almost the same as the one shown in
Figure 5.2, with the important diﬀerence that
• the input is not a single access r but a set of accesses R, and
• the transitions from “Arbitrate” to the “Blocked” and “Forward” states are only
enabled for every r ∈ arb(qB,R).
The function arb(qB,R) must decide which of the access requests in R may be
granted, based on the current bus state qB. If the information in qB is imprecise,
arb(qB,R) = R is possible, and in the case of TDMA also arb(qB,R) = ∅ is possible
if all requests will deﬁnitely not be granted in the current cycle. If ∣arb(qB,R)∣ > 1,
then every grant possibility must be explored in a separate successor state.
Arbitration functions
For a TDMA bus, the state information qB is a TDMA oﬀset set as presented in
Section 5.4.3. For TDMA, the arb-function is formed based on the delay as
arbTDMA(qB,R) = {r ∈ R ∣ 0 ∈ delayTDMA(qB, r)} (5.28)
A real advantage can be gained only for FAIR and PRIO arbitration. In contrast
to TDMA, both are work-conserving arbitration methods, i.e., they do not generate
idle cycles as long as there are pending requests. Therefore, the delayPRIO and
delayFAIR are always equal to zero in the parallelism-aware analysis. The delay for
those accesses which are blocked by another one is implicitly generated, since
• there will only be a single winner in an arbitration cycle, and
• contending requests which were not granted the bus must be re-issued until
they are ﬁnally granted.
For FAIR, qB over-approximates the cores which last accessed the bus, i.e.,
134 Chapter 5. Multi-Core WCET Analysis
σ1 σ2
Σrun
ψ
Shared Bus : “Arbitrate”
rc : none
qBFAIR : {2}
Core 1 : “Read”
Core 2 : “Write”
e1 e2
e3
Figure 5.20: An example PEG block ψ with attached APSS Σ and details on the
contained state tuple σ2.
BFAIR = 2{1,...,nc} (5.29)
arbFAIR(qB,R) = {r ∈ R ∣ ∃cp ∈ qB ∶ ∀r2 ∈ R ∶ c(r) − cp ≤modnc c(r2) − cp} (5.30)
In Equation 5.30, c(r) yields the number of the core which issued request r.
Thus, arbFAIR(qB,R) returns all those requests which belong to a core that may be
the next one in the round-robin schedule. The “last-access” information in qB is
only updated in the “Forward” states from Figure 5.2. For these states, we deﬁne
updateFAIR(qB) = {c(rc)} (5.31)
where rc is the currently granted access as deﬁned in Figure 5.2.
The arbitration function for PRIO is even simpler, since we do not need to
maintain any state information here. Instead, we can always perform the arbitration
based on the priorities p(r) of the requests r.
BPRIO = {−} (5.32)
arbPRIO(−,R) = {r ∈ R ∣ ∀r2 ∈ R ∶ p(r) ≥ p(r2)} (5.33)
An example for these states is illustrated in Figure 5.20, where a PEG block
ψ is shown with incoming and outgoing edges e1, e2 and e3. The state Σrun for
this block (see Algorithm 9) holds two sub-states, of which σ2 is presented in more
detail. In this sub-state, the two cores in this example are currently performing a
memory read and a memory write operation. The bus state qBFAIR shows that the last
access has deﬁnitely been carried out by core 2. Assuming that both accesses hit
the shared bus in this cycle, we know based on qB that arbFAIR({2},{r1, r2}) = {r1}
with c(r1) = 1 and c(r2) = 2.
Since the PEG already carries the burden of constructing all possible interleaving
scenarios, we can formulate the arbitration analysis in a rather simple manner here.
5.5. Uniﬁed WCET Analysis for Complex Multi-Cores 135
By construction, this has not been possible for the standard per-core WCET analysis
approach.
Shared caches are immediately analyzable with this framework and the origi-
nal cache domain from Section 4.3.2. Every possible order of accesses is implic-
itly explored by the PEG construction and the arbitration analysis. Therefore, no
summary-based approach as in Section 5.4.1 is needed.
5.5.7 Correctness
In the following, we use GiP , λ
i, ωiP and δ
i to denote the PEG and the values of
the three functions after i-th iteration of the main loop of Algorithm 7. Also, we
denote the PEG node ψ that is analyzed in iteration i as ψi. The special iteration
number 0 is used to denote the state before the ﬁrst iteration of the main loop. We
ﬁrst show, that with rising analysis iteration count, for each PEG node the block
runtime will only shrink, the incoming APSS will only get more imprecise and the
execution time intervals for each task execution position will only become wider.
Lemma 2. For any iteration j of the main loop of the parallelism analysis (line 7
in Algorithm 7), any iteration i ≤ j and any SEP ψ ∈ GiP , the following invariants
hold:
1. ∀ψ′ ∈ V iP ∶ ψ′ ↝GiP ψ ⇒ ψ
′ ↝
GjP
ψ,
2. ∀ψ′ ∈ V iP ∶ λi((ψ′, ψ)) ⊑ λj((ψ′, ψ)),
3. ωiP (ψ) ≥ ωjP (ψ), and
4. ∀k ∈ {1, . . . , n} ∶ δi(ψ)(k) ⊆ δj(ψ)(k).
Condition 1 codiﬁes that any existing path in the PEG of iteration i is retained in
the future versions of the PEG. Therefore, we call this condition the structural con-
dition in the following. The conditions 2, 3, and 4 claim that the development of the
three adjunct properties is monotonic and are therefore called property conditions.
Proof. We show the lemma by induction over j. For brevity of notation we denote
the incoming APSS at block ψ as λin(ψ), i.e.,
λjin(ψ) = ⊔
(ψ′,ψ)∈Ej−1P
λj−1((ψ′, ψ)) (5.34)
Induction Base (j = 1) : In this case, we know the initial values of ω0P (ψ1) = ∞,
λ0in(ψ1) = Σstart and δ0(ψ1) = {∅}nc from line 5 of Algorithm 7. Thus, condition 3
trivially holds, and λ0(ψ1) will always contain the value Σstart since it is contributed
via the static edge (⊥, ψ1), which satisﬁes condition 2. Since EP is initially empty,
it can only grow in the ﬁrst iteration, which implies condition 1. The δ1(ψ1)-value
is either ∅ or [0,0] depending on whether ψ1 is a loop head or not. Therefore, also
condition 4 holds.
136 Chapter 5. Multi-Core WCET Analysis
ψ1 ∶ Σrun = {σ1}
ψ2 ∶ Σrun = {σ2}
ψ3 ∶ Σrun = {σ3}
ψ1 ∶ Σrun = {σ1, σ4}
ψ2 ∶ Σrun = {σ2}
ψ3 ∶ Σrun = {σ3}
ψ4 ∶ Σrun = {σ5}
⇒
Figure 5.21: Example for block shortening during the PEG construction.
Induction Step (j − 1 → j) : The structural condition and the property condi-
tions depend on each other. We will need the induction hypothesis of the property
conditions to prove the induction step of the structural condition and vice versa.
Concerning condition 1, we know from the induction hypothesis that ψ′ ↝
Gj−1P
ψ
holds. We must then show, that it also holds in GjP . Iteration j can only alter GP in
a limited way, namely by removing or adding edges in lines 11 and 13 of Algorithm 9.
The incoming APSS λjin(ψ) are given through Σrun in line 2 of Algorithm 9. With
the induction hypothesis of condition 2, it follows that
λjin(ψ) = ⊔
(ψ′,ψ)∈Ej−1P
λj−1((ψ′, ψ)) ⊒ ⊔
(ψ′,ψ)∈Ej−1P
λi((ψ′, ψ)) ⊒ λiin(ψ) (5.35)
Due to the monotonicity of the cycle step function ξX, λ
j
in(ψ) ⊒ λiin(ψ) implies that
ξX(λjin(ψ)) ⊒ ξX(λiin(ψ)) (5.36)
ξX(ξX(λjin(ψ))) ⊒ ξX(ξX(λiin(ψ))) (5.37)
. . . (5.38)
Therefore, every successor of ψ which was reachable in iteration i must also be
reachable in iteration j. An example is given in Figure 5.21, where the result of
iteration j − 1 is shown on the left side and the result of iteration j is shown in
the right side, with λj−1in (ψ) = {σ1} and λjin(ψ) = {σ1, σ4}, respectively. Thus, even
if the block length shrinks, as shown in the example for ψ1, the application of the
transfer function ξX will still generate at least one transition to the original successor
SEPs (ψ2 in the example). Therefore, the edge removal in line 11 of Algorithm 9
cannot destroy any previously existing path, which completes the induction step for
condition 1.
For the induction step of condition 2 we must show that
∀e ∈ EiP ∶ λi(e) ⊑ λj(e) (5.39)
under the hypothesis that
∀e ∈ EiP ∶ λi(e) ⊑ λj−1(e) (5.40)
5.5. Uniﬁed WCET Analysis for Complex Multi-Cores 137
For an edge e = (ψ2, ψ3), λj(e) is computed in Algorithm 9 through one or more
applications of ξX on the incoming microarchitectural state λ
j
in(ψ2) as shown in
Equation 5.41. We use ξ+
X
to denote one or more applications of ξX. Since ξX is
monotonic, Equation 5.42 follows with the help of Equation 5.40. Equation 5.43
then follows from the induction hypothesis of condition 1, which completes the
induction step for condition 2.
λj((ψ2, ψ3)) = ξ+X
⎛
⎜
⎝
⊔
(ψ1,ψ2)∈E
j−1
P
λj−1 ((ψ1, ψ2))
⎞
⎟
⎠
(5.41)
⊒ ξ+X
⎛
⎜
⎝
⊔
(ψ1,ψ2)∈E
j−1
P
λi−1 ((ψ1, ψ2))
⎞
⎟
⎠
(5.42)
⊒ ξ+X
⎛
⎜
⎝
⊔
(ψ1,ψ2)∈EiP
λi−1 ((ψ1, ψ2))
⎞
⎟
⎠
(5.43)
= λi((ψ2, ψ3)) (5.44)
The monotonic growth of λ(e) and thus also of λin(ψ) directly causes the mono-
tonic decrease of ωP (ψ). To see why this is the case, it is important to bear in
mind, that λin(ψ) encodes the possible initial hardware states when executing SEP
ψ. If iteration i led to a runtime of ωiP (ψ) cycles, then there must be a σi ∈ λiin(ψ)
which caused this runtime through a basic block end or ambiguous instruction com-
pletion. With the induction hypothesis of condition 2, we know that there must be
a σj ∈ λjin(ψ) with σi ⊑ σj . Therefore, σj will also cover the execution paths derived
from σi, which implies that the basic block end or ambiguous instruction completion
which caused the runtime ωiP (ψ) is still reachable in σj . Due to σi ⊑ σj , there may
also be new execution paths for σj which were not possible with σi. These may
trigger an even earlier occurrence of a basic block end or ambiguous instruction
completion, therefore ωjP (ψ) ≤ ωiP (ψ).
Finally, we have to prove the induction step of condition 4. The δ-values are
computed by the path analysis based on the context block runtimes as determined
by Algorithm 8. By construction, ωiC(vτ) ⊆ ωjC(vτ), since we never remove elements
from ωC at all. Additional traces for vτ may exist inG
j
P which leads to the possibility
of ωiC(vτ) being not equal to ωjC(vτ). With the monotonic growth of the context
block durations ωC(vτ), it is obvious that also the path lengths in δ can only grow,
which proves the last condition.
With the help of Lemma 2 we can then show that Algorithm 7 terminates. The
main loop only continues as long as either the PEG or the data-ﬂow information in
λ changes. The addition of new blocks and edges is limited, since there are only
ﬁnitely many diﬀerent SEPs. According to Lemma 2, edges are also never really
removed but only replaced. Therefore, in the worst-case, the PEG becomes a full
cycle-level product graph of the diﬀerent context graphs. For the data-ﬂow values
138 Chapter 5. Multi-Core WCET Analysis
λ(ψ) ∈ X we can then apply the usual argument that the lattice X has ﬁnite-height,
thus, in the worst-case, the λ-values converge in a ﬁnite number of steps towards
⊺ ∈ X.
Theorem 4. The context block runtimes ωC as returned by Algorithm 7 are safe
over-approximations of the concrete block runtimes in any possible parallel execution
scenario.
Proof. Any possible concrete, periodic task set execution, can be modeled as an
inﬁnite sequence S = (ψ0, ψ1, ψ2, . . . ) of SEPs where each ψi models that eﬀect after
a new cycle step. For each ψi ∈ S there is a corresponding concrete hardware state Σˆi.
Due to the construction of the PEG, ψ0 must be one of the initial PEG blocks and by
prerequisite Σˆ0 must be contained in γ(Σstart), i.e., the concretization of the abstract
start state. Since the PEG is constructed according to the SEP transitions dictated
by the cycle step function ξX, and this cycle step function is a safe abstraction of
the concrete cycle step, the sequence S must be an inﬁnite path through the PEG
if and only if the block exclusion criterion did not apply for all analysis iterations
of any block ψi ∈ S.
If we assume that ψi ∈ S is the ﬁrst SEP in S where the BEC was applied in every
analysis iteration of ψi, then we can easily show that this leads to a contradiction.
Since the BEC was not applied in all analysis iterations for each of the blocks in
Si−1 = (ψ0, ψ1, . . . , ψi−1), these SEPs must be contained in the PEG. Through the
correctness of ξX and Σˆ0 ∈ γ(Σstart) we also know, that Σˆj ∈ γ(λin(ψj)) for all
j ∈ {0, . . . , i − 1}, i.e., there must be an edge from ψi−1 to ψi in the PEG. Likewise,
after Si−1 is covered in the PEG, the block runtimes ωP will cover the concrete
runtimes of this path and therefore also ωC will cover the runtimes of the context
graph nodes in this path. Therefore, since ψi is actually reachable in the concrete
execution S, the δ-intersection then cannot be empty at ψi, which implies that the
BEC cannot be applied at ψi.
Therefore, any concrete parallel execution, and thus any concrete context block
runtime, is covered in the PEG.
5.5.8 Extensions
Topologically sorted work queue. To increase the eﬀectiveness of the BEC we
have to make sure that each PEG block fulﬁlls the condition ∀k∈{1,...,nc}δi(ψ)(k) ≠ ∅
as early as possible. Therefore, we perform a topological sort of the tasks’ context
graphs which induces an order ≤topo∶ Ψτ ×Ψτ → {true,false} on the task execution
positions ψτ ∈ Ψτ of the respective core. From this, we can derive a topological order
of system execution positions as
ψ1 ≤topo ψ2⇔∀i∈{1,...,nc}ψ(i)1 ≤topo ψ
(i)
2 (5.45)
This order is used to sort the work-list before extracting a new SEP in line 8 of
Algorithm 7.
5.5. Uniﬁed WCET Analysis for Complex Multi-Cores 139
Exploiting timing-anomaly-free architectures. If the underlying architecture
is guaranteed to be free of timing anomalies (cf. Section 2.2.8), then in each block
analysis (Algorithm 9, line 5) we can skip all instruction completion vectors c1 ∈ κ
which are dominated by another vector c2 ∈ κ, i.e., c1 ≺c c2⇔∀i ∈ {1, . . . , n} ∶ c(i)2 ⇒
c
(i)
1 . The dominated vectors correspond to an earlier termination of an instruction.
Since every local worst-case action is always also the global worst-case action in a
timing-anomaly-free architecture, we can assume that they are never part of the
worst-case path. This can drastically reduce the state space and the PEG size.
Unfortunately, it also renders invalid Lemma 2, since some paths in the PEG will
actually be removed now. This may lead to non-termination of the analysis, similar
to what we have shown in Section 5.4.7.
Explicit synchronization. In task sets with explicit synchronization points, we
have to consider these points in the path analysis as shown in [PP13]. In addition,
we can also use them to prune the PEG as we have done in Section 5.5.5, since a task
which is waiting for synchronization cannot progress until a partner has arrived to
complete the rendez-vous. This idea has already been used in [Tay83a] and similar
to there, it can be used on top of the timing information to further prune the PEG.
Non-uniform periods The extension of our framework to task sets with non-
uniform periods is also possible. With non-uniform task periods, we can still com-
pute the global hyperperiod, i.e., the smallest common multiple of all task periods
and build a PEG for this hyperperiod. In this case, the task execution position must
also contain the current position in the core’s cyclic schedule. The biggest problem
is then, to perform the cycle step for a SEP ψ with ∃i ∶ ψ(i) =⊤ but ψ ≠⊤nc . For any
such core i, we need to determine whether the next task execution position for i may
be still ⊤ or the begin of the next task instance Jnext on that core. To determine
this, we need information about the current position in the system schedule.
For this purpose we can employ the δ-values. However, since we compute these
values based on the local CFG structure we will have δ(ψ)(i) = [0,∞] if ψ(i) = ⊺.
Due to ψ ≠⊤nc , we have at least one core j on which a task is still running. Therefore,
we can determine the current time window as δnow = ⋂j∈{1,...,nc} δ(ψ)(j) just as in
the block exclusion criterion. If the next task instance starts at time tnext, we know
that ⊤ is a possible successor if δnow ∖ {t ∣ t > tnext} ≠ ∅ and Jnext is a possible
successor if tnext ∈ δnow.
Still, this yields a lot of possible spawn points for the next task instance. The
problem can be further curtailed if synchronization structures are taken into account
or better approximations for δ(ψ)(i) with ψ(i) =⊤ can be found.
In contrast, SEPs with ψ =⊤nc are not problematic. All successors of these nodes
are reached via scheduling edges, not normal control-ﬂow edges. Therefore, we can
simply create successors for these SEPs corresponding to the task instances which
spawn next according to the global schedule.
140 Chapter 5. Multi-Core WCET Analysis
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0%
50%
100%
PEG Block Count Reduction
A
na
ly
si
s
T
im
e
R
ed
uc
ti
on
2|FAIR 2|PRIO 4|FAIR 4|PRIO
Figure 5.22: Eﬃciency of the block exclusion criterion on example benchmarks for
varying number of cores and arbitration policies. The solid line is a
linear regression of the data points.
Multiple-issue processors. We can easily account for multiple-issue cores which
can complete multiple instructions per cycle. In this case, the instruction com-
pletion vectors c ∈ {0,1}n are replaced by instruction completion count vectors
c ∈ {0, . . . , cmax}n, where cmax is the maximum number of instructions that can be
completed in one cycle. All algorithms must be adapted to this change, which results
in a possibly enlarged PEG.
5.5.9 Evaluation
We ran our evaluations on single-core tasks from the MRTC and DSPStone real-
time benchmark suites. Due to the restrictions of our IPET-based path analysis (cf.
page 130) we excluded benchmarks with an irreducible loop structure which require
ﬂow restrictions to bound the WCET. Out of these single-core tasks, we formed 35
task sets of size 2 and 6 task sets of size 4. All tasks were assigned a release time
of 0. As we will see in the following, the parallelism analysis proved to be very
time-consuming, therefore bigger task sets could not be analyzed in the given time
limit of two hours per analysis run. Like before, we compiled the benchmarks with
optimization level O0, i.e., without optimizations. We analyzed the system topology
from Section 3.4 with 2 or 4 cores, depending on the task set. In the evaluation, we
ﬁrst focus on analyzing state-permeable bus arbitration methods (PRIO and FAIR)
which were not analyzable (PRIO) or not precisely analyzable (FAIR) without the
presented parallelism analysis. The bus which is arbitrated by these methods is the
shared memory bus introduced in Section 3.4.
In Figure 5.22, the results of our block exclusion criterion (BEC) from Algo-
rithm 7, line 14 are shown. Each mark represents one analysis run on one task set.
5.5. Uniﬁed WCET Analysis for Complex Multi-Cores 141
Table 5.4: Average analysis time and PEG sizes.
Schedule Analysis Type ∅ Duration ∅ PEBs
FAIR C|N 1s 0
FAIR P|O|N 389s 963
FAIR P|B|N 161s 728
FAIR C|T 10s 0
FAIR P|O|T 693s 5,455
FAIR P|B|T 509s 4,046
PRIO P|O|N 503s 983
PRIO P|B|N 71s 728
PRIO P|O|T 776s 5,298
PRIO P|B|T 322s 3,961
The circle marks indicate runs where the shared bus was conﬁgured for FAIR arbi-
tration, the triangles correspond to ﬁxed priority-based arbitration and the squares
correspond to TDMA. Non-ﬁlled (ﬁlled) marks are analysis runs with the 2-core
(4-core) system. The x-axis value is the number of PEG blocks that are generated
during the analysis, when the BEC is used compared to the case when it is not
used (100%). On the y-axis, the required analysis time is shown, also compared to
the case that the BEC was not used (100%). From the data points and the solid
regression curve, it is visible that the analysis time scales roughly linearly with the
number of PEG blocks, which was expected, since the runtime of the main loop in
Algorithm 7 depends on the total number of blocks. The variations stem from the
convergence behavior of the individual benchmarks, i.e., how often loops have to be
visited until the attached APSSs converge. More importantly, we can see from Fig-
ure 5.22 that the BEC is eﬀective, since the average of all data points in Figure 5.22
corresponds to a reduction of the PEG block count and the analysis time by 35.6%
and 49.7%, respectively.
To limit the evaluation runtime we set up a deadline of two hours for each
individual analysis run. Our current implementation completed the analysis of the
35 smallest dual-core and the 6 smallest four-core benchmarks in that time frame.
It could not complete bigger task sets (8 cores) or sets with bigger task CFGs.
Therefore, we limited our evaluation to the aforementioned benchmarks only.
The average resulting analysis time for the benchmarks is presented in Table 5.4.
The column “Analysis Type” shows which type of WCET analysis was tested. We
compare the classical multi-core WCET analysis as presented in Section 5.4 (abbr.
“C”) to our new parallelism analysis with (abbr. “P|B”) and without (abbr. “P|O”)
usage of the block exclusion criterion. The last element of the “Analysis Type” col-
umn shows whether the absence of timing-anomalies on our platform was exploited
by the analysis (abbr. “N”) or not (abbr. “T”). As already seen in Figure 5.22, “P|B”
is always superior to “P|O” but both are slower than the classical approach “C” by a
142 Chapter 5. Multi-Core WCET Analysis
P|O|T P|B|T P|O|N P|B|N C|T C|N
100%
110%
120%
130%
140%
Analysis Type
A
vg
.
R
el
at
iv
e
W
C
E
T PRIO FAIR
Figure 5.23: Relative WCET results.
factor of 106 or 229, respectively. This is a result of the more complex system state
and of the thousands of parallel interleavings that have to be explored, whereas
the classical analysis only operates on the CFG of a single task and the state of
a single core. For a WCET analysis these runtimes are still acceptable, though.
Even mature tools like aiT need 12 hours per task in a 256-task system [SPH+07].
As presented in Section 5.5.8, the exploitation of timing-anomaly-freedom can be
used to drastically reduce the PEG size, which is visible in Table 5.4 in column “∅
PEBs”, which holds the average number of PEG blocks for this analysis scenario.
The conﬁgurations where absence of timing anomalies was assumed (“N”) produce
far lower PEG sizes and analysis times than their counterparts (“T”). As already
pointed out in Section 5.5.8, the “N”-type analysis is not guaranteed to terminate in
all cases. However, in the analyzed set of benchmarks we did not observe a case of
non-termination.
The beneﬁts we get from the parallelism analysis (“P”-conﬁgurations) at the price
of increased analysis times are that we can analyze the PRIO arbitration for the ﬁrst
time and that we can reduce the arbitration delay estimations for FAIR arbitration.
Details on both aspects are presented in Figure 5.23. In analogy to Section 5.4.8,
it shows the average relative WCET, i.e., the average quotient of WCET and ACET,
for diﬀerent analysis conﬁgurations from Table 5.4. The “C”-conﬁgurations show
the results for the classical WCET analysis framework, which can only assume the
maximum possible delay for every access in state-permeable arbitration policies.
The plain parallelism-based analysis (“P|O”) is able to outperform this approach
by 2.3% in the TA-prone analysis (“T”). Only if the timing-based block exclusion
criterion is used (“B”), we observe average reductions of up to 10.0% in the case
of the TA-free analysis (“P|B|N” compared to “C|N”). At this point it is important
to note that the majority of the experiments was done for dual-core benchmarks
and the worst-case delay for FAIR arbitration grows with the number of cores (cf.
Equation 5.7). Therefore, the parallelism analysis can be expected to yield higher
WCET decreases on four-core and eight-core systems. Finally, Figure 5.23 shows
5.5. Uniﬁed WCET Analysis for Complex Multi-Cores 143
C-AH P|B|N P|O|N C|N C-AM
10−1
100
101
102
30%
120% 121%
1,435%
7,224%
Analysis Type
A
vg
.
R
el
at
iv
e
W
C
E
T
(a) WCET results.
Analysis Type ∅ Duration
C-AH 0.8s
P|B|N 1,489.9s
P|O|N 2,090.2s
C|N 2.0s
C-AM 1.9s
(b) Analysis duration results.
Figure 5.24: Results for diﬀerent analysis methods of shared caches averaged over
36 dual-core task sets.
that the PEG-based WCET analysis for a system with PRIO arbitration yields
results that are comparable to those for FAIR arbitration.
We have also evaluated the performance of the parallelism analysis with respect
to shared caches. Obviously, shared caches can only be analyzed together with a
shared bus which connects them to the cores. Since we want to exclude the eﬀect
of shared bus analysis in the following experiments, we have set it into an “always
assume best-case” mode. In this mode, the analysis assumes that the bus is imme-
diately granted on each request of the analyzed task. This allows us to deﬁnitely
attribute the changes in WCET results to the shared cache analysis. To make
the cache analysis reasonably precise, we only activated the L1 and L2 instruction
caches, moved the task code into the cached RAM (cf. Figure 3.4), and unrolled
the ﬁrst iteration of each loop. Following the methodology from Section 4.5, the L1
cache size was set to 50% of the core’s tasks code size and the L2 cache size was set
to 50% of the overall code size.
The average results on 36 dual-core task sets are shown in Figure 5.24. Similar
to Section 4.5, we compare the results against the “always-hit” (C-AH) and “always-
miss” (C-AM) scenario, which provide lower and upper bounds on the possible
results of any shared cache analysis. First of all, we see from Figure 5.24a that the
always-miss conﬁguration produces a far higher pessimism than in the single-core
case, where it led to an average relative WCET of 2318%. This is plausible, because
in Section 4.5 we examined a L1 cache only. Of course, every cache read miss
includes a cache line reﬁll. Since the bus that connects the L1 cache to the higher
memory hierarchy levels has limited width, we need multiple bus requests to ﬁll the
cache line. In our system conﬁguration from Table 3.1 we need 32B/4B = 8 shared
bus and shared cache accesses to ﬁll the L1 cache line. The C-AM conﬁguration
must now assume that every L1 access is a miss and that all of the 8 line reﬁll
144 Chapter 5. Multi-Core WCET Analysis
accesses are again L2 misses which trigger L2 cache line reﬁlls. Since each L2 cache
line reﬁll takes 2 + 16 ∗ 3 = 50 cycles, the total duration of a memory access rises
from 1 cycle (L1 hit) to 1 + 8 ∗ 50 = 401 cycles. This shows the large potential for
pessimism in any shared cache analysis.
The summary-based shared cache from Section 5.4.1 which is shown in Fig-
ure 5.24 as “C|N” yields a high average relative WCET of 1435%. This is signiﬁ-
cantly worse than the single-core results, which showed an average relative WCET
of 645% in Section 4.5. The main drawback for the summary-based approach is that
there are always memory accesses for which the value analysis fails to determine an
access range bound, i.e., these accesses may go anywhere including the task’s code
section. A single access of this type in a task τ is suﬃcient to raise the cardinality
of the interference map tags(τ, s) for every cache set s far above the associativity aˆ
since τ may then access all possible tags. Therefore, a single access of this type in
task τ also degrades the classiﬁcation of every shared cache access in τ ′ ∈ T ∖ {τ}
to {HIT,MISS} according to Equation 5.1. Obviously, every additionally introduced
consideration of a MISS case impairs the WCET as mentioned in the discussion of
the C-AM conﬁguration.
The parallelism-aware analyses (“P|B|N” and “P|O|N” in Figure 5.24) are not
making any such summary-based approximations, but simply use the original ab-
stract LRU cache states as described in Section 4.3.2. The parallelism analysis
framework already takes care of exploring every possible access order. In addition,
the shared bus modeling resolves truly concurrent accesses by either determining
a guaranteed winner or by splitting the microarchitectural state. This increased
eﬀort translates into an average relative WCET of only 120% if the block exclusion
criterion is used (conﬁguration “P|B|N”). The WCET diﬀerence to the case in which
the BEC is not used (“P|O|N”) is marginal here, but as visible from Figure 5.24b the
analysis duration is reduced by 28.7% when the BEC is used. However, similar to
our previous experiments, there is a vast increase in analysis duration compared to
the task-partitioned approach. On the other hand, in this scenario the parallelism-
aware analysis can really make use of its superior knowledge about concurrent events
and outperform the summary-based approach by a factor of 11.96 on average.
5.6 Summary
This chapter has introduced two distinct approaches towards the modeling of shared
resources in the WCET analysis of multi-core systems. Both are based on non-
deterministic ﬁnite-state-machines which are also used in the single-core analysis to
model abstract hardware states (cf. Section 4.3).
For a task-partitioned WCET analysis which analyzes the tasks in isolation, only
time-triggered arbitration schemes could be analyzed with good precision as shown
in Section 5.4. For these schemes, we have presented cyclic data-ﬂow contexts,
called oﬀset contexts, and the oﬀset relocation technique to speed up the analysis
5.6. Summary 145
and still gain close-to-optimal results. We have also seen that shared caches and
state-permeable arbitration policies can only be analyzed with the help of coarse-
grained worst-case assumptions in a task-partitioned analysis.
To overcome this problem, we also investigated a uniﬁed multi-core WCET anal-
ysis in Section 5.5, which explores the possible interleavings of a parallel, periodic
task set. It was shown that this approach can handle state-permeable arbitration
policies with very good precision, being up to 10% more precise on average than a
task-partitioned analysis of fair round-robin arbitration and 11.96 times more pre-
cise than the best preexisting shared cache analysis. In addition, this is the ﬁrst
approach which enables the analysis of ﬁxed-priority arbitration. The combinatorial
explosion which is caused by the enormous search space of parallel conﬁgurations
could be limited to some extent by applying a new, timing-based block-exclusion
criterion to exclude infeasible interleavings. Experiments have shown that the BEC
can halve the analysis duration and decrease the resulting WCET by up to 10%.
We also pointed out, that synchronization statements in the program can be used
in addition to further reduce the search space of parallel conﬁgurations.

Chapter 6
Multi-Core WCET Optimization
Contents
6.1 Multi-Objective Evolutionary Schedule Optimization . . 147
6.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.1.2 Evolutionary Algorithm . . . . . . . . . . . . . . . . . . . . . 148
6.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.2 WCET-driven Multi-Core Instruction Scheduling . . . . 154
6.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.2.2 Scheduling Heuristics . . . . . . . . . . . . . . . . . . . . . . . 156
6.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
In the previous chapter, we have discussed the implications of diﬀerent types of
arbitration methods on the achievable analysis precision. In particular, the analysis
of time-triggered arbitration methods was found to be comparatively fast and precise
in Section 5.4. However, the WCET and ACET performance of systems running
time-triggered arbitration methods is highly dependent on
• the parameterization of the schedules and
• the structure of the examined tasks.
The analyses work best for tasks where the distribution of accesses in the time
domain matches the time-triggered schedule of the resource. Therefore, two novel
optimizations were developed that address the two points mentioned above. These
were ﬁrst published by the author of this thesis in [KMB14]. The ﬁrst is an evolu-
tionary optimization of the shared bus schedule parameters, whereas the second is
a multi-core WCET-aware instruction scheduling which re-structures the tasks to
increase their performance on a given time-predictable multi-core platform. Both
optimizations aim for a decrease of the WCET and ACET of the given tasks, which
in turn leads to improved schedulability and increased resource utilization.
6.1 Multi-Objective Evolutionary Schedule Optimization
In this section, we present a multi-objective evolutionary search algorithm which
automatically determines a range of well-suited schedules for a task set. It enables
147
148 Chapter 6. Multi-Core WCET Optimization
Initialize
Evaluate
Recombine
Mutate
Select
Termination
Criterion
Figure 6.1: The structure of the evolutionary bus schedule optimization.
users to choose a solution which balances WCET, ACET and utilization according
to their needs, and thus avoids a manual search for an optimal schedule, since this
is a hard and error-prone task.
6.1.1 Related Work
The optimization of bus schedules has been the topic of a range of previous publica-
tions, but the vast majority either is restricted to TDMA schedules or uses ad-hoc
WCET computations instead of an analyzer following the design principles presented
in Section 2.2.
The optimization in [RNE+11] and [AEP+08] by the same authors is based on
the evolutionary optimization technique simulated annealing. It integrates system-
wide task scheduling with optimization, but on the other hand, it is restricted to
TDMA schedules, whereas we also consider more ﬂexible schedule variants. TDMA
slot length allocation is also done in [WT06], but the employed WCET analy-
sis framework is less precise and it is again restricted to TDMA. Concerning the
employed evolutionary variation operators, we use an approach similar to [HE05],
but [HE05] is restricted to TDMA and considers the optimization at a far more
coarse-grained level, i.e., the scheduling of tasks as a whole. Finally, [YKS11]
also examines bus schedule optimization, but only for the special case of Harmonic
Round-Robin schedules and for additive WCET models.
To the best of our knowledge, previous work has not addressed the optimization
of real-time bus schedules including TDMA and more ﬂexible methods. We will
see in the following that especially the consideration of schedule types other than
TDMA is important to achieve the highest WCET and ACET gains.
6.1.2 Evolutionary Algorithm
The structure of the optimization is depicted in Figure 6.1. It starts with a set
of initial schedules, which are called individuals in the context of evolutionary op-
timization. For all individuals, the WCET, ACET and bus utilization values are
determined. Then, promising individuals are recombined and mutated with a cer-
tain probability. After these steps, the optimizer selects those individuals that form
6.1. Multi-Objective Evolutionary Schedule Optimization 149
t nl p⃗ = (p0, . . . , p63) l⃗ = (l0, . . . , l63) o⃗ = (o0, . . . , o63)
Figure 6.2: The evolutionary algorithm’s genome.
the next generation and the optimization continues with them. The steps are re-
peated until a user-deﬁnable termination criterion is met, e.g., until a predeﬁned
result quality is achieved or until no further increase was observed over a predeﬁned
time frame.
In our schedule optimization, individuals are represented by the genome shown
in Figure 6.2. It contains the scheduling policy t (one of FAIR, PRIO, TDMA or
PD), the number of slots nl and the vectors of priorities, slot lengths and slot owners
(p⃗, l⃗ and o⃗). For an eﬃcient recombination and mutation, the genome needs a ﬁxed
length. Therefore, each vector is limited to 64 entries, which limits the solution space
to 64 slots. We will examine systems with up to 8 cores, and our initial experiments
have shown that good solutions almost exclusively use a minimum number of slots.
This is also supported by the fact that the last TBmax − 1 cycles of each slot are
“wasted” since no access can be scheduled here (cf. Figure 5.3). Therefore, it can be
expected that the limitation to 64 slots does not degrade the solution quality.
Initialization
The optimization process starts with a set of schedule candidates also called the
population. It contains a FAIR schedule, a uniform PRIO schedule, a uniform TDMA
schedule and a uniform PD schedule. Here, “uniform” means that all cores get one
slot (nl = nc) and all slot lengths are set to the minimum allowed size (∀i ∶ li = TBmax),
since from our experience this reduces the bus arbitration delay on average. Slot
priorities are distributed such that the cores get a priority equal to their core ID
(∀i ∶ pi = i).
The rest of the population is ﬁlled up with candidates for which the parameters
are randomly chosen according to a uniform distribution. The randomness is needed
to appropriately cover the search space and is a standard approach in evolutionary
optimization [Wei07].
Recombination and Mutation
Similar to [HE05], we use arithmetic operators which do not treat the genome as a
bit string and ﬂip individual bits, but which perform arithmetic operations on the
contained parameter values. This is done to limit the degree of randomness in the
optimization, since otherwise ﬂipping a high-order bit of a parameter might cause
the optimization to unguidedly “jump around” in the parameter space.
The recombination works piecewise on two genomes, with a multi-point crossover.
That is, during the recombination of A and B, for each segment σ ∈ {t, nl, p⃗, l⃗, o⃗}
from Figure 6.2 a recombination point r ∈ {0, . . . , lσ} is determined randomly with
150 Chapter 6. Multi-Core WCET Optimization
uniform distribution, where lσ is the length of σ. The new segment σC for the re-
sulting individual C is then given as the concatenation of the substrings σA[0∶r) and
σB[r∶lσ). lσ always denotes the eﬀective length of the segment, e.g., we may have up
to 64 slots, but if A and B only use 7 slots at maximum, then lo⃗ = 7.
The mutation is also only applied for parameters within the eﬀective lengths and
mutates each segment’s values with probability 0.3. To restrict the step size, we use
δ-mutation, where a new value vnew is randomly chosen from [vold − δ, vold + δ]. For
the number of slots s the value of δ is 5, for the slot lengths l ∈ l⃗ we chose δ = 30.
Finally, we perform a randomized genome repair step, which mutates the indi-
vidual until each core has a unique priority and each core is the owner of at least
one slot. The ﬁrst is a requirement of our platform, whereas the latter is needed to
avoid core starvation and thus inﬁnite WCET values. The intention behind all of
these design decisions is to increase the chances that we ﬁnd good solutions early,
since the objective evaluation and thus each new generation is costly in our scenario
as we have seen in Section 5.4.8.
6.1.3 Evaluation
We implemented the evolutionary optimization with the PISA framework [BLT+03],
using the SPEA2 selector [ZLT02], which is used to determine individuals for muta-
tion, recombination and generation survival. SPEA2 tries to keep individuals in the
population that are not pareto-dominated by others, i.e., for which no other individ-
ual exists which is better in all objective values. In addition, SPEA2 increases the
diversity of the generated solutions by maintaining a solution density value for each
individual which indicates how diﬀerent it is with respect to all other individuals.
For the WCET analysis, we used the task-partitioned approach with the oﬀset
relocation technique as presented in Section 5.4.6 due to its superior analysis du-
ration. The ACET and utilization values were again determined with the CoMET
simulator [Syn14].
In total, we used 110 tasks from the UTDSP, MRTC, MiBench and MediaBench
suites [LCS92; Mäl05; GRE+01; LPM97]. The tasks were compiled with opti-
mization level O1 which includes only the most basic compiler optimizations. As
described in Section 5.4.8 and Section 5.5.9, the tasks were grouped by their ACET
and only their input and output is read from and written to the shared memory.
Thus, only the I/O operations issued by the tasks are subject to bus arbitration,
the tasks’ code and local data are stored in the scratchpads of the cores.
We used a generation size of 20 individuals and a minimum number of 20 genera-
tions. After the 20th generation, optimization is continued if the current generation
is at least 0.05% better in any objective than the previous one. This was added
to provide conﬁdence that we do not abort the optimization prematurely, but we
encountered no cases where the 21st generation was actually reached.
We present relative results in the following, where the ﬁrst meaningful baseline
is the FAIR individual as this represents the current practice in many real-world
6.1. Multi-Objective Evolutionary Schedule Optimization 151
0% 20% 40% 60% 80% 100% 120% 140%
2
4
8
N
um
be
r
of
co
re
s ACET
WCET
(a) Baseline (100%): FAIR individual
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110%
2
4
8
N
um
be
r
of
co
re
s ACET
WCET
(b) Baseline (100%): Uniform TDMA individual
Figure 6.3: Average results for the best-WCET individuals.
systems. Figure 6.3a shows the geometrical mean of the relative WCET (ACET) of
the ﬁnal-generation individuals with the best WCET, relative to the WCET (ACET)
value of the FAIR individual from the ﬁrst generation. Since the FAIR individual
has no parameters and thus never evolves, it does not matter from which generation
it is taken. It can be seen that the reduction in WCET of up to 39% in the case with
4 cores is accompanied by an increase in ACET. This is plausible, because most of
the best-WCET individuals are using TDMA or PD (see Figure 6.5), which have
better WCET, but worse ACET performance (cf. Section 5.4.8).
As the second baseline, we chose the uniform TDMA schedule with minimum slot
length, which usually produces good WCET values. Here, the question is whether
the optimization can still improve upon this baseline. As can be seen in Figure 6.3b,
we can still observe WCET improvements of 31% (2 cores) to 25% (8 cores) without
signiﬁcant loss of ACET performance. Also note that Figure 6.3 contains the results
for the individual with the best WCET. Thus, if we want to balance ACET and
WCET, the evolutionary approach also delivers matching solutions, some of which
are presented in the following.
To indicate the distribution of the results among the benchmarks, Figure 6.4
shows the detailed WCET results for all benchmarks in the 2-core conﬁguration.
Each segment in the ﬁgure represents the best-WCET individual for one benchmark,
which is identiﬁed by its benchmark ID, shown on the x axis. All WCETs are
relative to the WCET of the uniform-TDMA individual, which was also taken as
152 Chapter 6. Multi-Core WCET Optimization
0 10 20 30 40 50 60 70 80 90
0%
50%
100%
Benchmark ID
R
el
at
iv
e
W
C
E
T
Figure 6.4: Detailed results for the best-WCET individuals on the 2-core platform
(Baseline: Uniform TDMA individual).
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
2
4
8
N
um
be
r
of
co
re
s
FAIR
TDMA
PD
Figure 6.5: Distribution of diﬀerent schedule types among best-WCET individu-
als.
the comparison base in Figure 6.3b. It is visible that the average WCET reduction is
achieved by a uniform distribution of WCET reductions. The few benchmarks which
experience WCET reductions larger than 50% are unbalanced examples, where one
task with the largest WCET needs much more bus bandwidth than the others, and
thus its runtime can be drastically decreased by assigning more or longer slots to it.
The best-WCET individuals constitute of TDMA schedules with adapted slot
lengths, of even more customized PD schedules and of some FAIR schedules. The
distribution of the schedule types is depicted in Figure 6.5. Note that in the worst
case, a FAIR access may have to wait for at most one access from all other cores. A
TDMA access that is issued too late in the issuer’s slot to ﬁnish inside that slot may
have to wait for the rest of the slot plus the slots of all other cores. Due to this, for
tasks on which the TDMA WCET analysis fails to produce precise results, FAIR
can be better than TDMA. Apparently, this mostly happens for the platform with
8 cores. This platform requires a longer TDMA schedule to still provide at least one
slot per core and it seems that the WCET analysis gets more imprecise with growing
schedule length. Nevertheless, it works well for most examples, making TDMA the
predominant schedule type among the best-WCET individuals.
The fact that the optimization performs better for systems with fewer cores
can also be explained when examining the baseline utilization of the shared bus as
depicted in Figure 5.11. For 2 cores, the current benchmark set has an average bus
6.1. Multi-Objective Evolutionary Schedule Optimization 153
Mode s WCET ACET Utilization l⃗ o⃗ p⃗
FAIR - 95076900 5752270 0.6596 - - -
PD 8 85379400 7085960 0.5496 (3,3,3,3, (2,1,7,5, (7,3,1,4,
3,3,3,3) 4,0,6,3) 5,6,8,2)
TDMA 8 61227900 10725600 0.3870 (10,3,3,3, (2,3,1,0, -
3,3,3,3) 4,5,6,7)
TDMA 8 85379400 9383380 0.4421 (3,3,3,3, (0,1,2,3, -
3,3,3,3) 4,5,6,7)
Table 6.1: Details on the pareto-optimal individuals from Figure 6.6.
load of 21%, which rises to 41% for 4 cores and 64% for 8 cores, measured under FAIR
scheduling. Thus, all attempts to increase the utilization are ultimately limited by
the amount of unused bus time, which is decreasing as the number of cores increases.
The development of the individuals during a single optimization run is illustrated
in Figure 6.6 which shows the WCET, ACET and utilization for all individuals that
were evaluated in the course of the optimization of an 8-core benchmark containing
mixed multimedia and control tasks (codecs-dcodhuff, lmsfir-32-64, fft-256,
selection-sort, edge-detect, latnrm-32-64, adpcm-decoder and adpcm-encoder
from Appendix A). WCET and ACET are shown on the x and y axes, whereas the
color of the marks indicates their utilization, as shown in the color bar under the
Figure. The axes are scaled logarithmically to accommodate the spread of the re-
sults.
The PD individuals all show a good ACET performance, but vary in their WCET
by more than one order of magnitude depending on the conﬁguration. In contrast,
the TDMA individuals stringently have a worse ACET performance which is com-
pensated by a bigger span of WCET values – they provide both the best and the
worst WCET values. The utilization is directly proportional to the ACET, which
conﬁrms our expectation that higher utilization implies lower average bus access
delays.
The pareto-optimal points are represented by the blank symbols on the left side
of the ﬁgure. They are also listed in detail in Table 6.1 together with their slot
length, owner and priority vectors l⃗, o⃗ and p⃗. As can be seen, FAIR produces the
best utilization and ACET values (the triangle in Figure 6.6) and TDMA has the
best WCET value (the blank squares in Figure 6.6). In between we ﬁnd TDMA
and PD conﬁgurations, where, in this case, PD provides a signiﬁcantly enhanced
ACET without loss of precision at the WCET side (blank circle in Figure 6.6). This
distribution of results is typical and could be observed for most benchmarks.
The evolutionary optimization of a single task set takes 3 to 4 hours on average,
depending on the number of analyzed cores. Taking into account that even the
evaluation of a single conﬁguration takes minutes on average, and that almost all of
the time is spent on the WCET analysis (59%) and the CoMET simulation (36%),
this is still reasonable for, e.g., nightly builds of a software. Also, the optimization
154 Chapter 6. Multi-Core WCET Optimization
108 109
107
108
WCET
A
C
E
T
TDMA (Pareto-Optimal) PD (Pareto-Optimal) FAIR (Pareto-Optimal)
TDMA (Dominated) PD (Dominated)
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65
Figure 6.6: Exemplary population with marked Pareto-Front for a benchmark
with 8 cores.
itself is trivially parallelizable, which we have not done here. The WCET analysis
and simulation runtime will scale linearly with the number of cores, but the total
runtime of the optimization until “good” solutions are found might grow faster than
linear since more parameters will have to be explored.
All in all, we have seen that FAIR arbitration is strong on producing good
ACET values, and that it can even outperform more predictable arbitration schemes
especially when the minimum schedule length increases as, e.g., for systems with
rising core numbers. TDMA has proven to be the best choice for WCET. Still,
for TDMA as well as for PD, an optimization of the schedule parameters is highly
desirable and may lead to WCET improvements of more than 30%. PD can be used
to balance ACET and WCET which again is easier to do in an automated way.
6.2 WCET-driven Multi-Core Instruction Scheduling
In Section 6.1, we have examined the possibilities of adjusting the bus schedule
parameters to the given task set. But there is another degree of freedom, namely
to reorder the instructions inside the tasks to match the bus schedule. Since both
optimizations are interdependent, we perform the instruction reordering for every
6.2. WCET-driven Multi-Core Instruction Scheduling 155
single individual that is generated by the algorithm from Section 6.1. Thus, we
may also ﬁnd solutions in which the bus schedule only excels when combined with
a custom instruction schedule. Alternatively, the instruction reordering can also be
invoked separately for a given, user-deﬁned schedule.
We build upon a classical list scheduler [ALS+07] here, which partitions the
task into non-overlapping, sequential regions and schedules each of those separately.
A region (v1, . . . , vn) is therefore deﬁned as a path through the CFG of the task
(cf. Deﬁnition 2), which is a sequence of adjacent basic blocks vi. The instructions
of the regions are re-ordered by our scheduler. The scheduler maintains a list of
dependencies of the instructions on the path, and a set of instructions which are
ready for execution, i.e., whose dependencies have been fulﬁlled. Our task is to
assign a priority to them, and the scheduler will then select one of the instructions
with highest priority and append it to the result order. This process continues in
the same manner until all instructions of the region have been scheduled. If a region
contained more than one basic block, instructions may have moved into a diﬀerent
block after the optimization. To restore the original semantics of the task, it is
necessary to add compensation code in this case [Fis81] after the scheduling of the
region. Since this only occurs if basic block boundaries are crossed, compensation
code is never needed if the regions only contain a single basic block. The whole
procedure is a compile-time optimization, there is no runtime scheduling involved.
In contrast to the evolutionary optimization of the schedule, the instruction
scheduling is not a multi-objective optimization. It exclusively focuses on the opti-
mization of the WCET. To achieve this, it will use the results of the WCET analysis
to be able to assess the eﬀect of code changes on the WCET. This is a crucial step,
since unconditionally applying an optimization without such an assessment may
even lead to degraded results [ZCS03].
6.2.1 Related Work
The majority of previous publications on WCET-aware instruction scheduling is
focused on optimizing the WCET of a single-core system [ZKW+05; HZX12]. As an
exception, [SPC+10] discusses several access models for time-predictable multi-cores
on an abstract level, but requires manual restructuring of the tasks. In contrast, the
instruction scheduler presented in this thesis can be used to automatically implement
these models on a microarchitectural scale.
A higher-level approach to WCET-aware scheduling is taken in [DZ10], where
a cache-aware task scheduling algorithm is presented together with a greedy min-
imization of the shared cache interference. This work is orthogonal to ours, since
we are concerned with low-level instruction scheduling, whereas [DZ10] focuses on
operating-system-level task scheduling.
To the best of the author’s knowledge, previous work has not addressed the
scheduling of instructions according to the requirements of a time-predictable multi-
core platform.
156 Chapter 6. Multi-Core WCET Optimization
6.2.2 Scheduling Heuristics
In the following, we present two novel priority assignment heuristics that are tai-
lored towards the optimization of the WCET of tasks running on time-predictable
multi-cores. Since we want to exploit the information generated during the WCET
analysis, we ﬁrst need to establish a connection between the region to schedule and
the WCET analysis results.
For any context block vC ∈ V Cτ , we have gathered the set of initial microar-
chitectural states qM ∈ qin
vC
through the data-ﬂow analysis framework presented in
Section 4.3. Each of these microarchitectural states is a tuple which, among oth-
ers, contains a bus state qB. According to Deﬁnition 21, qB is an oﬀset set for
time-triggered schedules. Additionally, each qM can be used to determine a runtime
interval for the execution of vC with initial state qM using the abstract execution
function given in Equation 4.17. The union of the runtime intervals for all qM is
called ω(vC) according to Deﬁnition 19.
To make these results usable for the optimization, we ﬁrst need to map the
WCET results, which were obtained based on the context graph, back to the original
CFG of the task. Context graph nodes were generated from the CFG with the help
of three distinct construction steps:
• Context copies. According to Deﬁnition 14, a context graph GCτ holds one
or more copies of every node vA ∈ GAc , i.e., to represent the execution of vA in
diﬀerent contexts. We require a function ctxts ∶ V Ac → 2V
C
τ which maps each
analysis graph node to its context copies.
• Sequential splits. These are produced by the analysis graph construction in
Algorithm 2. As visible in Figure 4.3, every basic block v is partitioned into
a sequence of analysis blocks. The analysis blocks which model the entry into
v are given by δA⊥ (v).
• Alternative copies. Like the sequential splits, these are generated by Al-
gorithm 2, since every block with a predicate other than “AL”/”always” is
represented by two analysis blocks (cf. Figure 4.3b). The potential alternative
copies of the start block are contained in δA⊥ (v). Actually, they are the reason
why ∣δA⊥ (v)∣ = 2 is possible.
Therefore, we can map the results of any data-ﬂow analysis, performed on the
context graph, back to the CFG nodes v ∈ V fc as follows:
qinv = ⊔
qA∈δA⊥ (v)
⊔
vC∈ctxts(vA)
qinvC (6.1)
Then, qinv represents the data-ﬂow information at the start of CFG node v. In the
following, we will use the mapped-back results of the value analysis (qV,inv ) and the
microarchitectural analysis (qM,inv ).
For the optimization of task τ running on core c, the slot length heuristic (SL)
ﬁrst determines the length lmaxc of the longest slot which is assigned to c. During
6.2. WCET-driven Multi-Core Instruction Scheduling 157
the scheduling of a region, it maintains a counter lcurc which is set to 0 at the start of
the region. With the help of the maximum bus access duration TmaxB , the priority
1
of instruction i ∈ I is given by the priority function pSL as
pSL(i) =
⎧⎪⎪⎨⎪⎪⎩
1 if bac(i) ∧ (lcurc + TmaxB ) ≤ lmaxc
0 else
(6.2)
where bac ∶ I → {true, false} determines whether an instruction will possibly access
the shared bus. After an instruction i with bac(i) = true was scheduled, lcurc is
incremented by tmax, otherwise lcurc = 0. The intention is to bundle bus-accesses to
packages which ﬁt into the slots of the core. The bac function can be implemented
by inspecting the mapped-back value analysis results for the current region. If the
address range of any memory-accessing instruction i covers the addresses that are
served by the shared bus, then bac(i) = true, else bac(i) = false.
The second heuristic, called oﬀset heuristic (OF), uses the results of the WCET
analysis more directly. For each region (v1, . . . , vk), it ﬁrst determines the incoming
microarchitectural state qM,inv1 as presented in Equation 6.1. After a new instruction
i was scheduled, the transfer function of i is invoked, i.e., cycle steps of the abstract
states are performed until i is “committed”. This results in a new incoming microar-
chitectural state for the next instruction to schedule. Therefore, we can maintain a
qMcur, which is the current microarchitectural state before the execution of the next
instruction. From qMcur we can derive the union of the possible current TDMA states
qBcur = ⋃qBi ∈qMi ,qMi ∈qMcur q
B
i . With this information, we can determine whether i is
guaranteed to be granted the bus or not, and deﬁne the priority function pOF as
pOF(i) =
⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
2 if bac(i) ∧ qBcur ⊆ γ(c)
1 if ¬bac(i)
0 else
(6.3)
where γ(c) is the grant window of core c as deﬁned in Equation 5.11. The idea is
to force the immediate scheduling of a bus-accessing instruction when we know for
sure that the access will be granted (case 1), and to delay it if possible when the
access will not be granted (case 3). All instructions which do not require the bus
use a default priority (case 2).
As an example, consider Figure 6.7 which shows a task’s control-ﬂow graph in
the upper half. For this example, we assume a system with 2 cores, where the
presented task is executed on core 0. The TDMA bus schedule consists of 4 slots,
whose length and owner cores are depicted in the bottom of the ﬁgure. Below the
graph, the set qBcur for the ﬁrst instruction of block L4 is shown, which is a subset of
the full TDMA oﬀset span, marked in gray. Thus, the analysis has determined for
this example that the load instruction ldr at the head of block L4, which accesses
the bus, will always start its execution from one of the oﬀsets contained in the
1Higher values indicate higher priority.
158 Chapter 6. Multi-Core WCET Optimization
main:
mov ip, #0
mov r2, ip
cmp r0, #0
ble .L2
.L4:
ldr r3, [r2, r1]
add r3, r3, #23
str r3, [r2, r1]
add ip, ip, #1
cmp ip, r0
bne .L4
.L2:
mov r0, #0
bx lr
Block
oﬀsets
TDMA
schedule
o0 = 0 o1 = 1 o2 = 0 o3 = 1
0 (∑nli=0 li) − 1
qBcur
transfer
Figure 6.7: An example for oﬀset results as computed during the multi-core
WCET analysis.
white rectangle marked in the schedule. Considering the schedule, we know that
this area is contained in slot 2 which is owned by core 0, and thus the access will
be granted immediately. Therefore, pOF(ldr) = 2. After the ldr instruction was
scheduled, the analysis will compute new oﬀsets which reﬂect the positions in the
schedule at which the execution of the next instruction will start. The striped areas
in Figure 6.7 represent qBcur for the second and third instruction of L4. The dotted
arrows indicate which oﬀset information belongs to which instruction. In this case,
the optimization can decide that after the execution of the ﬁrst add, the following
str has to wait for the bus in slot 3 and thus can prefer to schedule the second add
and cmp ﬁrst.
6.2.3 Evaluation
To evaluate the eﬀectiveness of the heuristics, we tested the scheduler on the same
benchmarks that were used in Section 6.1.3. We ﬁrst evaluated scheduling at the
basic block level, thus “regions” are “basic blocks” in the following. The basic blocks
consisted of 1 to 661 instructions (average: 5.25), and 11.42% of those were accessing
the bus on average. In total, the benchmarks contained 82,133 instructions.
In Figure 6.8a, the average results for the scheduling are shown for a uniform
TDMA schedule, where each core has one slot of length 3 cycles. Both heuristics
perform equally well in this setting. This may be due to the fact that in this setting,
only one bus access ﬁts into each TDMA slot since the maximum bus access duration
is also 3 cycles. Therefore, the schedule is short and it is suﬃcient to keep the bus
accesses isolated, which both heuristics are capable of.
6.2. WCET-driven Multi-Core Instruction Scheduling 159
0% 0.5% 1% 1.5%
2
4
8
N
um
be
r
of
co
re
s pOF
pSL
(a) Average WCET reductions for uniform
TDMA with slot length of 3 cycles.
0% 0.5% 1% 1.5% 2% 2.5%
2
4
8
N
um
be
r
of
co
re
s
pOF
pSL
(b) Average WCET reductions for uniform
TDMA with slot length of 12 cycles.
Figure 6.8: Average results per platform for scheduling with the slot length heuris-
tic (pSL) and oﬀset heuristic (pOF).
0 2 4 6 8 10 12 14 16 18 20
5%
10%
15%
W
C
E
T
re
du
ct
io
n pSL
pOF
Figure 6.9: Relative WCET results for the best 20 benchmarks from Figure 6.8a
per scheduling method.
To test the sensitivity of the optimization w.r.t. to diﬀerent schedule conﬁgura-
tions, we also tested it on a uniform TDMA schedule with a slot length of 12 cycles
for which the results are shown in Figure 6.8b. Here, the WCET reduction achieved
by pOF is up to 4 times higher than the reduction for pSL. The drawback is, that the
absolute WCET values for the 12-cycle conﬁguration are 25% (2 cores) to 158% (8
cores) worse than those for the 3-cycle one, as already pointed out in Section 5.4.8.
It is a general observation in our experiments, that bigger TDMA slots lead to worse
WCET and utilization values, which defeats the value of the increased optimization
potential in these conﬁgurations.
Since the scheduler works on the microarchitectural level, it cannot be expected
to have as much impact as the macroscopic schedule parameter optimization pre-
sented in Section 6.1. To illustrate the results for the individual benchmarks, Fig-
ure 6.9 lists the 20 highest WCET reductions from the results shown in Figure 6.8a.
In this range, we observe an increased average bus utilization of 14% and 4.4 instruc-
tions per region. The WCET reductions for the individual benchmarks range from
13.2% to 2.5% (pOF ) or 15.8% to 7.2% (pSL), respectively. Therefore, though the
results are lower on average, we still have many benchmarks for which the scheduler
achieves signiﬁcant gains with both heuristics.
160 Chapter 6. Multi-Core WCET Optimization
The compilation times have tripled compared to the compilation without the
WCET-aware scheduling, but again, this is mostly due to the runtime of the WCET
analyses. Also, to be precise, we would need to recompute the WCET and thus the
microarchitectural states after each scheduled region. However, this would lead to
a vast increase of the optimization runtime. Therefore, we performed the whole
scheduling with the microarchitectural results for the original task. This leads to
some degree of imprecision, but otherwise the compilation time can be expected to
scale linearly with the number of regions (basic blocks, in this case) which is not
acceptable.
We also extended the optimization to work on trace regions [Fis81] and su-
perblock regions [HMC+93]. Both are well-known methods to increase the schedul-
ing ﬂexibility, but they come at the cost of inserting compensation code which
can adversely aﬀect the WCET. In our experiments, the average WCET obtained
with both trace and superblock scheduling varies between 99.7% and 100.3% of
the WCET that was achieved using pure basic block scheduling. In addition, the
minimum achieved WCET was consistently larger than the basic-block scheduling
WCET by up to 15%. Therefore, the additional overhead of creating traces and su-
perblocks does not really pay oﬀ in this scheduling scenario. This also follows from
the observation that the I/O operations which access the shared bus often have data
dependencies to their neighbors and thus can rarely make use of the increased trace
and superblock region size.
Due to the relatively small impact of the instruction scheduling compared to
the evolutionary schedule optimization (cf. Section 6.1) and the high computational
demand of the latter, we tested the combination of both on selected benchmarks
only. The instruction scheduler was invoked for every generated individual to also
ﬁnd solutions which are only accessible through a combination of bus and instruction
scheduling. Unfortunately, this optimization combination did not yield any further
WCET decrease beyond the decrease that is caused by the evolutionary schedule
optimization alone. Therefore, the averages given in Figure 6.8a are most likely
also an upper bound on the additional WCET decrease that is achievable after the
schedule itself was optimized. This suggests that optimizing the bus schedule ﬁrst
and performing the instruction scheduling afterwards is suﬃcient in practice, leading
to a combined average WCET reduction of more than 30%.
6.3 Summary
This chapter has presented the ﬁrst WCET-aware multi-core bus schedule opti-
mization which takes into account fair, TDMA and priority-division schedules. The
results show that it can reduce the WCET of real-world benchmarks by more than
30% on average, and how ACET, WCET and bus utilization evolve under diﬀer-
ent parameterizations of the three schedule types. In addition, we have seen that
TDMA is not always the best choice for minimizing the WCET, which was a basic
6.3. Summary 161
assumption in previous work. This macroscopic approach was complemented with
a new type of instruction scheduling heuristic tailored towards multi-core WCET
reduction, which can further reduce the WCET by up to 15.8%. In summary, both
optimizations signiﬁcantly increase the precision of the estimated WCETs and thus
the usability of multi-core WCET analysis.

Chapter 7
Conclusion and Future Work
7.1 Summary
This thesis has presented two self-contained, holistic approaches towards the WCET
analysis of tasks running on embedded multi-core systems which strongly improve
upon previously published work. For both state-permeable and non-state-permeable
resources, we have developed new approaches which deliver precise WCET results
and an analysis duration which is lower than the one achieved by previously known
approaches. Last but not least, we have presented two optimizations which are
able to decrease the WCET of tasks running on embedded multi-core systems with
shared resources.
We have started with a comparison of diﬀerent approaches to WCET analysis in
Chapter 2. There, we have shown that only the mathematically sound static WCET
analysis, based on abstract interpretation, has the potential to safely capture every
possible execution behavior of a program. Therefore, the rest of the thesis builds
upon this branch of WCET analysis. The need for an analysis which covers all
interacting components of a hardware system was motivated with the deﬁnition of
compositionality, which is hardly ever present in today’s computing systems. We
also introduced the concept of timing anomalies and how it can be used to ease the
static WCET analysis. In Chapter 3, we have brieﬂy sketched the infrastructure
which was used to implement the analyses in the research compiler WCC.
The most well-known form of a static, single-core WCET analysis pipeline was
presented in Chapter 4. We have discussed how call and iteration contexts are
disambiguated in WCET analysis, how a value analysis that supports predicated
execution can be performed, how to properly model hardware states in WCET
analysis using non-deterministic ﬁnite-state-machines, and we ﬁnally have seen how
to perform a path analysis that determines the WCET from runtimes of individual
context blocks. The results indicate that the implementation is as precise as the
commercial analyzer aiT for non-cached code. For cached code, the WCC imple-
mentation is 2.26 times worse than aiT but still 3.56 times better than pessimistic
worst-case assumptions, which suﬃces for our following multi-core comparisons.
In Chapter 5, we have initially identiﬁed shared caches and shared buses as the
main challenge in the static timing analysis of multi-core systems. Depending on
their parameterization we have classiﬁed these according to their state-permeability
and bounded access delay. This classiﬁcation determines whether the analysis of
such systems is possible through a task-partitioned WCET analysis, or whether it
163
164 Chapter 7. Conclusion and Future Work
requires a more time-consuming uniﬁed WCET analysis which considers all parallel
tasks together.
For the case of task-partitioned analyses, we have introduced a new abstract
microarchitectural domain, called TDMA oﬀset sets, for shared, time-triggered buses
in Section 5.4, which allows us to employ the single-core analysis framework from
Chapter 4 for the WCET analysis of multi-cores. We have shown that the achievable
precision for TDMA oﬀset sets strongly depends on the context management. The
trivial approach of fully unrolling all loops was identiﬁed as being infeasible due to
the resulting analysis duration. To overcome this, cyclic data-ﬂow contexts, called
oﬀset contexts, were introduced into the microarchitectural analysis. These are able
to outperform the fully unrolling approach in some cases while on average, they are
only 18.4% worse and require less than 35.0% of the analysis runtime. Finally, oﬀset
relocation was proposed as the fastest option, which only needs less than 2.4% of
the full unrolling’s analysis time but also generates WCET values which are only
3.3% better than a “naive” oﬀset analysis. The full unrolling, being the most precise
approach, outperforms the pessimistic worst-case bound by up to 56% on average.
State-permeable and unbounded-delay resources like shared caches and priority-
driven arbitration could not be analyzed with suﬃcient precision by the task-
partitioned framework. Currently, there is little reason to believe that the precision
of this feature combination can be improved at all, since any task-partitioned anal-
ysis is forced to use summary-based or worst-case-assumption-based techniques to
analyze state-permeable resources. Therefore, the task-partitioned analysis, as pre-
sented in Section 5.4, is most suitable for systems with time-triggered bus arbitration
and partitioned or locked caches or scratchpads.
Precise WCET results for shared, state-permeable resources can be determined
with the help of a uniﬁed, parallelism-aware WCET analysis as presented in Sec-
tion 5.5. It explores the possible parallel interleavings of a concurrent, strictly
periodic task set at the cycle level. To limit the combinatorial explosion which is
incurred by any such exploration, we have introduced a new, timing-based block ex-
clusion criterion which can be used to identify invalid execution scenarios and thus to
restrict the search space. Our experiments have shown that the parallelism-aware
analysis reduces the WCET overestimation by a factor of 11.96 when analyzing
shared caches compared to the results from Section 5.4. Likewise, the overestima-
tion in the case of fair round-robin arbitration could be decreased by up to 10% and
a microarchitectural analysis of ﬁxed-priority-driven arbitration could be performed
for the ﬁrst time. The increase in precision and scope comes at the price of an
analysis time which is increased by a factor of 229 compared to the task-partitioned
analysis. The block exclusion criterion can be used to limit this increase, since it
eliminates 35.6% of the analyzed parallel interleavings on average which translates
to a reduction in analysis time by 49.7% and a decrease of the WCET overestimation
by up to 10%.
Finally, we have examined in how far the determined multi-core WCETs can
be optimized in Chapter 6. To this end, an evolutionary optimization of both the
7.2. Future Work 165
schedule type and parameters was devised which searches for conﬁgurations which
are optimal with respect to ACET, WCET and utilization. Since the schedule must
be adapted to the shared resource usage of the benchmark, there can be no generally
ideal schedule. Compared with the default conﬁgurations, the optimization could
achieve an average WCET decrease of more than 30%. Moreover, we devised an
instruction scheduling which exploits the ﬁne-grained microarchitectural results from
the WCET analysis. In the experiments, this scheduling could additionally decrease
the WCET by up to 15.8%.
7.2 Future Work
Task-partitioned WCET Analysis Our treatment of oﬀset contexts in Sec-
tion 5.4.5 is potentially not optimal if the iterations of a loop can be partitioned
into sequential groups with similar loop body behavior inside the group and dif-
fering behavior among the groups. In this case, it may be worthwhile to consider
using one layer of oﬀset contexts for each such group, i.e., one horizontal layer is
introduced for each sequential group in Figure 5.6. However, the number of oﬀset
contexts will then quickly reach the number of contexts in the full unrolling case.
Therefore, sophisticated algorithms would be needed to ensure that this layering is
only used when proﬁt can actually be gained. Making the microarchitectural anal-
ysis path-aware by adding path history information to its domain (cf. Section 2.1)
is also one interesting possibility, which could not be explored in this thesis.
Uniﬁed WCET Analysis There are multiple opportunities to further improve
the scalability and precision of the uniﬁed WCET analysis as presented in Sec-
tion 5.5. The most promising idea is to provide synchronization data to the uniﬁed
WCET analysis to be able to further prune the search space by exploiting rendez-
vous synchronization behavior. With this extension, the analysis is expected to scale
up as long as the tasks to analyze are tightly synchronized. Another important ex-
tension would be to deal with non-uniform periods in a more advanced way than
proposed in Section 5.5.8.
The uniﬁed WCET analysis is also one of the ﬁrst approaches towards the in-
tegration of schedulability and WCET analysis, since it implicitly adheres to the
periodic, non-preemptive schedule. Therefore, it can be used to determine whether
a task set is schedulable. The synchronization-aware version will in addition allow
to determine bounds on the time that a task may wait for a given lock.
Synchronization-aware Path Analysis This work has focused on the diﬃcul-
ties that arise during the microarchitectural analysis of multi-core systems. As
already pointed out in the beginning of Section 5.4, another major challenge is the
integration of synchronization semantics into the path analysis. For synchronization
statements inside of loops, the IPET approach can no longer be used. Therefore,
166 Chapter 7. Conclusion and Future Work
new approaches are needed to deal with this important case. Data-ﬂow-based path
analyses [EGL11; KFM13] are one very promising candidate here.
Multi-core Value Analysis The value analysis is also aﬀected by multi-core in-
terference, if communication between concurrently executing tasks is allowed. Since
the value analysis is also a data-ﬂow analysis, the approaches that were developed
in this thesis are applicable. It is possible to use a summary-based technique similar
to the one presented in Section 5.4.1, or the value analysis can be integrated with
the microarchitectural analysis. In the latter case, it could also proﬁt from the new,
uniﬁed WCET analysis and its associated timing-based block exclusion criterion
as presented in Section 5.5. A last option is a dedicated parallelism-aware value
analysis that works without timing information and possibly on sliced task CFGs.
Extended Benchmark and Hardware Feature Coverage In general, it would
be interesting to apply the analyses presented in this thesis to a broader set of bench-
marks, possibly including synchronization among the tasks. In particular, some of
the average-case drawbacks that were attributed to TDMA in Section 5.4.8 will van-
ish on benchmarks, where the system is fully utilized all the time. Unfortunately,
there are still no standard real-time multi-task or multi-core benchmarks available,
and most general purpose benchmark suites use features like dynamic memory al-
location, deep software libraries and operating system support which drastically
complicate their analysis even if carried out with commercial WCET tools.
To limit the implementation eﬀort, we have not even halfway explored all hard-
ware features that can be modeled using the presented framework. Out-of-order
processors, speculative execution, fetch and store buﬀers and split transactions can
all be modeled in both the task-partitioned and the uniﬁed WCET analysis. Espe-
cially split transactions are interesting, since they would allow for smaller slot sizes
which increases the TDMA performance. In addition, they would render the last
TBmax−1 cycles of each slot usable which was not the case in our analysis setting used
in Chapter 5. The task-partitioned modeling of TDMA oﬀsets could also be applied
to statically-allocated time-triggered interconnection networks like Aethereal and
TTP and to the FlexRay bus.
Compiler Optimization Opportunities The developed multi-core WCET anal-
yses also open up new possibilities in the domain of compiler optimizations. As
already demonstrated for the example of instruction scheduling in Section 6.2, we
are now able to assess the eﬀect of code modiﬁcations in very high detail. The
microarchitectural states determined in the WCET analysis can also be used to
guide compiler optimizations like scratchpad and register allocation. The detailed
analysis of shared caches now allows a comparison with preexisting cache locking
and partitioning methods. Finally, the optimization of the task-to-core mapping
becomes feasible, since we can detailedly quantify each mapping’s impact on the
tasks’ WCETs.
List of Figures
1.1 Real-time system veriﬁcation tools. . . . . . . . . . . . . . . . . . . 2
1.2 A sample distribution of runtimes of a program, along with with
sample BCET and WCET estimates. . . . . . . . . . . . . . . . . . 4
1.3 Multi-core implementation power consumption. . . . . . . . . . . . 5
1.4 ARM Processor Families [ARM14b]. . . . . . . . . . . . . . . . . . 6
2.1 The Galois connection between concrete and abstract semantics. 14
2.2 The lattice of integer constants. . . . . . . . . . . . . . . . . . . . . 17
2.3 Convergence behavior in a lattice. . . . . . . . . . . . . . . . . . . . 18
2.4 Structure of most static WCET analyzers. . . . . . . . . . . . . . . 19
2.5 The execution paths of a program on a timing-anomalous system
depicted by transitions between hardware states. . . . . . . . . . . 24
2.6 Examples of timing anomalies [RWT+06]. . . . . . . . . . . . . . . 25
3.1 Previous structure of the WCC compiler [FL10]. . . . . . . . . . . 39
3.2 Structure of the WCC for multi-core compilation and analysis. . 41
3.3 Levels of hardware design in a Gajski-Kuhn Y-chart [Gro08]. . . 44
3.4 The multi-core system model. . . . . . . . . . . . . . . . . . . . . . 45
3.5 Interaction of binary input ﬁle modules with the rest of WCC. . 48
4.1 Structure of the WCC-internal, single-core, single-task WCET
analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Stages of the IPCFG construction. . . . . . . . . . . . . . . . . . . 53
4.3 Analysis block creation example. . . . . . . . . . . . . . . . . . . . 54
4.4 Example for an analysis graph with a directly recursive function. 56
4.5 Possible virtual inlining results for Figure 4.4. . . . . . . . . . . . . 57
4.6 Example for virtual unrolling. . . . . . . . . . . . . . . . . . . . . . 60
4.7 Hasse diagrams of value analysis domain components. . . . . . . . 61
4.8 Example abstract microarchitectural states. . . . . . . . . . . . . . 66
4.9 The abstract pipeline model for the ARM7TDMI. . . . . . . . . . 68
4.10 A simpliﬁed abstract cache model. . . . . . . . . . . . . . . . . . . 71
4.11 WCET performance of the WCC-internal single-core analysis frame-
work for a system without cache usage (superscript UC). . . . . . 77
4.12 Single-core WCET results for a system with activated cache (su-
perscript C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1 Structure of the core- and task-partitioned WCET analysis with
optional cache interference analysis. . . . . . . . . . . . . . . . . . . 91
5.2 The abstract bus timing model. . . . . . . . . . . . . . . . . . . . . 95
5.3 An example for a TDMA bus access which is maximally delayed. 97
167
168 List of Figures
5.4 The TDMA oﬀset set O = {1,2,7} in a 2-core schedule with ls = 5. 98
5.5 An example for the divergence of TDMA oﬀsets. . . . . . . . . . . 100
5.6 The unfolded oﬀset contexts for the example shown in Figure 5.5a. 102
5.7 Illustration of proof scenario for the Relocation Lemma. . . . . . 108
5.8 Diﬀerent scenarios for applying the oﬀset relocation heuristic. . . 109
5.9 An example of the application of the oﬀset relocation. . . . . . . . 111
5.10 An example of non-converging TDMA oﬀset results when naively
exploiting the absence of timing-anomalies. . . . . . . . . . . . . . 112
5.11 Average total bus utilization. . . . . . . . . . . . . . . . . . . . . . . 114
5.12 Average relative measured execution time (ACET) for diﬀerent
platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.13 Average benchmark execution time jitter. . . . . . . . . . . . . . . 115
5.14 Average relative WCET when all bus accesses show the worst-case
bus behavior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.15 Average relative WCET for diﬀerent arbitration types with the
basic, TA-prone TDMA analysis. . . . . . . . . . . . . . . . . . . . 117
5.16 Average relative WCET for advanced TDMA analysis techniques. 117
5.17 Structure of the uniﬁed multi-core WCET analysis. . . . . . . . . 121
5.18 The basic task scheduling model for the uniﬁed multi-core analysis. 123
5.19 Parallel execution graph creation example. . . . . . . . . . . . . . . 124
5.20 An example for parallelism-aware bus states. . . . . . . . . . . . . 134
5.21 Example for block shortening during the PEG construction. . . . 136
5.22 Eﬃciency of the block exclusion criterion for varying number of
cores and arbitration policies. . . . . . . . . . . . . . . . . . . . . . 140
5.23 Relative WCET results. . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.24 Results for diﬀerent analysis methods of shared caches averaged
over 36 dual-core task sets. . . . . . . . . . . . . . . . . . . . . . . . 143
6.1 The structure of the evolutionary bus schedule optimization. . . . 148
6.2 The evolutionary algorithm’s genome. . . . . . . . . . . . . . . . . 149
6.3 Average results for the best-WCET individuals. . . . . . . . . . . . 151
6.4 Detailed results for the best-WCET individuals on the 2-core plat-
form. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.5 Distribution of diﬀerent schedule types among best-WCET indi-
viduals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.6 Exemplary population with marked Pareto-Front for a benchmark
with 8 cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.7 An example for oﬀset results as computed during the multi-core
WCET analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.8 Average results per platform for scheduling with the slot length
heuristic (pSL) and oﬀset heuristic (pOF). . . . . . . . . . . . . . . 159
6.9 Relative WCET results for the best 20 benchmarks from Fig-
ure 6.8a per scheduling method. . . . . . . . . . . . . . . . . . . . . 159
List of Tables
3.1 Default system parameters. . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 Average results for the analysis of uncached execution of a single
task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Average results for the analysis of cached execution of a single task. 80
5.1 Properties of shared resources in multi-cores. . . . . . . . . . . . . 94
5.2 Example for a case where oﬀset contexts yield more precise results
than full unrolling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.3 Average analysis time per benchmark for a timing-anomaly-free
analysis run. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.4 Average analysis time and PEG sizes. . . . . . . . . . . . . . . . . . 141
6.1 Details on the pareto-optimal individuals from Figure 6.6. . . . . 153
169

List of Algorithms
1 The generic data-ﬂow analysis work-list algorithm. . . . . . . . . . 16
2 Analysis Graph Construction. . . . . . . . . . . . . . . . . . . . . . 55
3 The virtual inlining algorithm. . . . . . . . . . . . . . . . . . . . . . 57
4 The virtual unrolling algorithm. . . . . . . . . . . . . . . . . . . . . 59
5 Reﬁnement of the shared cache results for timing-anomaly-free
architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6 The oﬀset context speciﬁc microarchitectural analysis. . . . . . . 103
7 PEG-driven parallelism analysis. . . . . . . . . . . . . . . . . . . . . 128
8 Update of basic block runtimes. . . . . . . . . . . . . . . . . . . . . 129
9 PEG block analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
171

Glossary
ACET Average-Case Execution Time. 37
AI Abstract Interpretation. 12
APSS Abstract Parallel System State. 126
BCET Best-Case Execution Time. 3
BEC Block Exclusion Criterion. 125
CDM Code-Division Multiplexing. 84
CFG Control Flow Graph. 13, 49
CRPD Cache-Related Preemption Delay. 30, 87
CSP Communicating Sequential Processes. 121
DC-UCB Deﬁnitely Cached Useful Cache Blocks. 31
DFA Data-Flow Analysis. 14
DFS Depth-First Search. 58
ECB Evicting Cache Blocks. 31
ECU Electronic Control Units. 1
FDM Frequency-Division Multiplexing. 84
FIFO First-In First-Out. 70
HLIR High-Level Intermediate Representation. 39
ICN Interconnection Network. 83
ILP Integer Linear Program. 20, 74
IPCFG Interprocedural Control Flow Graph. 52
IPET Implicit Path Enumeration Technique. 74
ISA Instruction Set Architecture. 43
LLIR Low-Level Intermediate Representation. 39
LOC Lines Of Code. 40, 209
LRU Least Recently Used. 70, 71
MFP Minimum Fixed Point. 15
MHP May-Happen-In-Parallel. 121
MOP Meet Over All Paths. 15
MPI Message Passing Interface. 122
PC Program Counter. 13
PD Priority Division. 86
PEG Parallel Execution Graph. 121–123
173
174 Glossary
PLRU Pseudo-LRU. 70
RTC Real-Time Calculus. 32, 87
RTOS Real-Time Operating System. 29
SEP System Execution Position. 126
TA Timing Anomaly. 25
TDM Time-Division Multiplexing. 84, 85
TDMA Time-Division Multiple Access. 86
TEP Task Execution Position. 126
UCB Useful Cache Blocks. 30
WCC WCET-aware C Compiler. 7, 20, 37
WCEC Worst-Case Execution Count. 39, 74
WCEP Worst-Case Execution Path. 37
WCET Worst-Case Execution Time. v
WCRT Worst-Case Response Time. 30
Bibliography
[AAN11] Ernst Althaus, Sebastian Altmeyer, and Rouven Naujoks. “Precise
and Eﬃcient Parametric Path Analysis”. In: Proceedings of the 2011
SIGPLAN/SIGBED Conference on Languages, Compilers and Tools
for Embedded Systems. LCTES ’11. Chicago, IL, USA: ACM, 2011,
pp. 141–150. isbn: 978-1-4503-0555-6. doi: 10 . 1145 / 1967677 . 1967697.
url: http://doi.acm.org/10.1145/1967677.1967697 (Cited on page 21).
[AB09] Sebastian Altmeyer and Claire Burguiere. “A New Notion of Useful Cache
Block to Improve the Bounds of Cache-Related Preemption Delay”. In:
Proceedings of the 2009 21st Euromicro Conference on Real-Time Systems.
ECRTS ’09. Washington, DC, USA: IEEE Computer Society, 2009,
pp. 109–118. isbn: 978-0-7695-3724-5. doi: 10.1109/ECRTS.2009.21. url:
http://dx.doi.org/10.1109/ECRTS.2009.21 (Cited on page 31).
[ABD+13] Andreas Abel, Florian Benz, Johannes Doerfert, Barbara Dörr, Sebastian
Hahn, Florian Haupenthal, Michael Jacobs, Amir H. Moin, Jan Reineke,
Bernhard Schommer, and Reinhard Wilhelm. “Impact of Resource Sharing on
Performance and Performance Prediciton: A Survey”. In: CONCUR. 08/2013
(Cited on page 82).
[Abs14a] AbsInt GmbH. aiT Worst-Case Execution Time Analyzers. http://www.
absint.com/ait. 2014 (Cited on pages 3 sq., 7, 37, 41).
[Abs14b] AbsInt GmbH. Astrée Run Time Error Analyzer. http://www.absint.com/
astree/index_de.htm. 2014 (Cited on page 2).
[ACD06] James H. Anderson, John M. Cal, and Umamaheswari C. Devi. “Real-time
scheduling on multicore platforms”. In: Proc. of the 12th IEEE Real-Time and
Embedded Technology and Applications Symp. Chapman Hall/CRC, Boca,
2006, pp. 179–190 (Cited on page 34).
[AD90] Rajeev Alur and D. L. Dill. “Automata for Modeling Real-time Systems”.
In: Proceedings of the Seventeenth International Colloquium on Automata,
Languages and Programming. Warwick University, England: Springer-Verlag
New York, Inc., 1990, pp. 322–335. isbn: 0-387-52826-1. url: http://dl.
acm.org/citation.cfm?id=90397.90438 (Cited on page 20).
[AEL+11] Peter Altenbernd, Andreas Ermedahl, Björn Lisper, and Jan Gustafsson.
“Automatic Generation of Timing Models for Timing Analysis of High-Level
Code”. In: Proc. 19th International Conference on Real-Time and Network
Systems (RTNS2011). Ed. by Sébastien Faucou. The IRCCyN lab., 09/2011.
url: http://www.es.mdh.se/publications/2134- (Cited on page 22).
[AEL10] Björn Andersson, Arvind Easwaran, and Jinkyu Lee. “Finding an Upper
Bound on the Increase in Execution Time Due to Contention on the Memory
Bus in COTS-based Multicore Systems”. In: SIGBED Rev. 7.1 (01/2010),
4:1–4:4. issn: 1551-3688. doi: 10.1145/1851166.1851172. url: http://
doi.acm.org/10.1145/1851166.1851172 (Cited on page 87).
175
176 Bibliography
[AEP+08] Alexandru Andrei, Petru Eles, Zebo Peng, and Jakob Rosen. “Predictable
Implementation of Real-Time Applications on Multiprocessor Systems-on-
Chip”. In: Proceedings of the 21st International Conference on VLSI Design.
VLSID ’08. Washington, DC, USA: IEEE Computer Society, 2008, pp. 103–
110. isbn: 0-7695-3083-4. doi: 10.1109/VLSI.2008.33. url: http://dx.
doi.org/10.1109/VLSI.2008.33 (Cited on pages 101, 148).
[AGP03] Mathieu Avila, Maxime Glaizot, and Isabelle Puaut. “Impact of Automatic
Gain Time Identiﬁcation on Tree-Based Static WCET Analysis”. In: WCET.
2003, pp. 71–74 (Cited on page 74).
[AHL+08] Sebastian Altmeyer, Christian Hümbert, Björn Lisper, and Reinhard Wil-
helm. “Parametric Timing Analysis for Complex Architectures”. In: Proceed-
ings of the 2008 14th IEEE International Conference on Embedded and Real-
Time Computing Systems and Applications. RTCSA ’08. Washington, DC,
USA: IEEE Computer Society, 2008, pp. 367–376. isbn: 978-0-7695-3349-0.
doi: 10.1109/RTCSA.2008.7. url: http://dx.doi.org/10.1109/RTCSA.
2008.7 (Cited on page 21).
[ALS+07] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeﬀrey D. Ullman. Compil-
ers: Principles, Techniques, and Tools. 2nd edition. Pearson Education, 2007.
isbn: 0321486811 (Cited on pages 12, 15, 39, 58, 155).
[Alt00] Altera / Eureka Technology, Inc. PCI Bus Arbiter. http://www.altera.
com/products/ip/iup/pci/m-eur-pci-bus-arb.html. 2000 (Cited on
page 85).
[AMR10] Sebastian Altmeyer, Claire Maiza, and Jan Reineke. “Resilience Analysis:
Tightening the CRPD Bound for Set-associative Caches”. In: Proceedings of
the ACM SIGPLAN/SIGBED 2010 Conference on Languages, Compilers,
and Tools for Embedded Systems. LCTES ’10. Stockholm, Sweden: ACM,
2010, pp. 153–162. isbn: 978-1-60558-953-4. doi: 10.1145/1755888.1755911.
url: http://doi.acm.org/10.1145/1755888.1755911 (Cited on page 31).
[ARM04] ARM Ltd. ARM7TDMI Technical Reference Manual. Revision: r4p1. ARM
DDI 0210C. 11/2004 (Cited on pages 44, 67).
[ARM05] ARM Ltd. ARM Architecture Reference Manual. I. ARM DDI 0100I. 110
Fulbourn Road Cambridge, England CB1 9NJ, 07/2005 (Cited on pages 43,
48, 51, 53).
[ARM14a] ARM Ltd. ARM Cortex-R Series. http : / / www . arm . com / products /
processors/cortex-r/index.php. 2014 (Cited on page 23).
[ARM14b] ARM Ltd. ARM Processor Families. http://www.arm.com/products/
processors/classic/arm7/index.php. 2014 (Cited on page 6).
[AUT09] AUTOSAR Administration. Speciﬁcation of Multi-Core OS Architecture
V1.0.0. R4.0 Rev 1. 2009 (Cited on pages 29, 33).
[Bay08] Nimrod Bayer. US Patent 60986659: Shared Mmemory System for a Tightly-
Coupled Multiprocessor. 11/2008 (Cited on page 84).
Bibliography 177
[BB06] Adam Betts and Guillem Bernat. “Tree-Based WCET Analysis on Instru-
mentation Point Graphs”. In: Proceedings of the Ninth IEEE International
Symposium on Object and Component-Oriented Real-Time Distributed Com-
puting. ISORC ’06. Washington, DC, USA: IEEE Computer Society, 2006,
pp. 558–565. isbn: 0-7695-2561-X. doi: 10.1109/ISORC.2006.75. url: http:
//dx.doi.org/10.1109/ISORC.2006.75 (Cited on page 74).
[BC08] Clément Ballabriga and Hugues Casse. “Improving the First-Miss Compu-
tation in Set-Associative Instruction Caches”. In: Proceedings of the 2008
Euromicro Conference on Real-Time Systems. ECRTS ’08. Washington, DC,
USA: IEEE Computer Society, 2008, pp. 341–350. isbn: 978-0-7695-3298-1.
doi: 10.1109/ECRTS.2008.34. url: http://dx.doi.org/10.1109/ECRTS.
2008.34 (Cited on page 70).
[BC11] Jean-Luc Béchennec and Franck Cassez. Computation of WCET using Pro-
gram Slicing and Real-Time Model-Checking. Research Report. IRCCyN/C-
NRS, 05/2011 (Cited on page 20).
[BCM09] Clément Ballabriga, Hugues Cassé, and Marianne De Michiel. “A Generic
Framework for Blackbox Components in WCET Computation”. In: 9th In-
ternational Workshop on Worst-Case Execution Time Analysis (WCET’09).
Ed. by Niklas Holsti. Vol. 10. OpenAccess Series in Informatics (OASIcs). also
published in print by Austrian Computer Society (OCG) with ISBN 978-3-
85403-252-6. Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum fuer
Informatik, 2009, pp. 1–12. isbn: 978-3-939897-14-9. doi: http://dx.doi.
org/10.4230/OASIcs.WCET.2009.2290. url: http://drops.dagstuhl.de/
opus/volltexte/2009/2290 (Cited on page 28).
[BEL11] Stefan Bygde, Andreas Ermedahl, and Björn Lisper. “An Eﬃcient Algorithm
for Parametric WCET Calculation”. In: Journal of Systems Architecture 57.6
(06/2011), pp. 614–624. issn: 1383-7621. doi: 10.1016/j.sysarc.2010.06.
009. url: http://dx.doi.org/10.1016/j.sysarc.2010.06.009 (Cited on
page 21).
[BHQ+07] A.O. Balkan, M.N. Horak, Gang Qu, and Uzi Vishkin. “Layout-Accurate
Design and Implementation of a High-Throughput Interconnection Network
for Single-Chip Parallel Processing”. In: 15th Annual IEEE Symposium on
High-Performance Interconnects, 2007. HOTI 2007. 2007/08///2007, pp. 21–
28. doi: 10.1109/HOTI.2007.11 (Cited on page 84).
[BHV11] Sébastien Bardin, Philippe Herrmann, and Franck Védrine. “Reﬁnement-
based CFG Reconstruction from Unstructured Programs”. In: Proceedings of
the 12th International Conference on Veriﬁcation, Model Checking, and Ab-
stract Interpretation. VMCAI’11. Austin, TX, USA: Springer-Verlag, 2011,
pp. 54–69. isbn: 978-3-642-18274-7. url: http://dl.acm.org/citation.
cfm?id=1946284.1946290 (Cited on page 49).
[BJM11] Rajeev Balasubramonian, Norman P. Jouppi, and Naveen Muralimanohar.
“Multi-Core Cache Hierarchies”. In: Synthesis Lectures on Computer
Architecture. Ed. by University of Wisconsin Mark D. Hill.
Vol. Lecture #17. Morgan & Claypool, 2011. isbn: 9781598297546. doi:
10.2200/S00365ED1V01Y201105CAC017 (Cited on page 82).
178 Bibliography
[BKJ+01] Jr. Bell R.H., Chang Yong Kang, L. John, and E.E. Swartzlander. “CDMA as
a multiprocessor interconnect strategy”. In: Conference Record of the Thirty-
Fifth Asilomar Conference on Signals, Systems and Computers, 2001. Vol. 2.
11/2001, 1246–1250 vol.2. doi: 10.1109/ACSSC.2001.987690 (Cited on
page 84).
[BL08] Stefan Bygde and Björn Lisper. “Towards an Automatic Parametric WCET
Analysis”. In: 8th International Workshop on Worst-Case Execution Time
Analysis (WCET’08). Ed. by Raimund Kirner. Vol. 8. OpenAccess Series in
Informatics (OASIcs). also published in print by Austrian Computer Society
(OCG) with ISBN 978-3-85403-237-3. Dagstuhl, Germany: Schloss Dagstuhl
– Leibniz-Zentrum fuer Informatik, 2008. isbn: 978-3-939897-10-1. doi: http:
//dx.doi.org/10.4230/OASIcs.WCET.2008.1659. url: http://drops.
dagstuhl.de/opus/volltexte/2008/1659 (Cited on page 21).
[Bli02] Johann Blieberger. “Data-Flow Frameworks for Worst-Case Execution Time
Analysis”. In: Real-Time Systems 22.3 (05/2002), pp. 183–227. issn: 0922-
6443. doi: 10.1023/A:1014535317056. url: http://dx.doi.org/10.1023/
A:1014535317056 (Cited on page 21).
[BLL+11] Dai Bui, Edward Lee, Isaac Liu, Hiren Patel, and Jan Reineke. “Temporal
Isolation on Multiprocessing Architectures”. In: Proceedings of the 48th De-
sign Automation Conference. DAC ’11. San Diego, California: ACM, 2011,
pp. 274–279. isbn: 978-1-4503-0636-2. doi: 10.1145/2024724.2024787. url:
http://doi.acm.org/10.1145/2024724.2024787 (Cited on page 89).
[BLT+03] Stefan Bleuler, Marco Laumanns, Lothar Thiele, and Eckart Zitzler. “PISA: A
Platform and Programming Language Independent Interface for Search Algo-
rithms”. In: Proceedings of the 2nd International Conference on Evolutionary
Multi-criterion Optimization. EMO’03. Faro, Portugal: Springer-Verlag, 2003,
pp. 494–508. isbn: 3-540-01869-7. url: http://dl.acm.org/citation.cfm?
id=1760102.1760144 (Cited on page 150).
[BM11] Balasubramanya Bhat and Frank Mueller. “Making DRAM Refresh Pre-
dictable”. In: Real-Time Systems 47.5 (09/2011), pp. 430–453. issn: 0922-
6443. doi: 10.1007/s11241-011-9129-6. url: http://dx.doi.org/10.
1007/s11241-011-9129-6 (Cited on pages 23, 45, 82).
[BMV12] Andrea Baldovin, Enrico Mezzetti, and Tullio Vardanega. “A
Time-composable Operating System”. In: WCET. 2012, pp. 69–80 (Cited on
page 30).
[Bor13] Hendrik Borghorst. “Schedulingverfahren zur WCET-Reduktion in eingebet-
teten Multicore-Systemen”. Master’s Thesis. TU Dortmund, 2013 (Cited on
page 10).
[Bor96] Hans Borjesson. “Incorporating Worst Case Execution Time in a Commer-
cial C-compiler”. Undergraduate Thesis. Department of Computer Systems,
Uppsala University, 1996 (Cited on page 38).
[BRA09] Claire Burguière, Jan Reineke, and Sebastian Altmeyer. “Cache-Related Pre-
emption Delay Computation for Set-Associative Caches - Pitfalls and Solu-
tions.” In: WCET. 2009 (Cited on pages 31, 70).
Bibliography 179
[BSI+08] Gogul Balakrishnan, Sriram Sankaranarayanan, Franjo Ivančić, Ou Wei, and
Aarti Gupta. “SLR: Path-Sensitive Analysis through Infeasible-Path Detec-
tion and Syntactic Language Reﬁnement”. English. In: Static Analysis. Ed. by
María Alpuente and Germán Vidal. Vol. 5079. Lecture Notes in Computer Sci-
ence. Springer Berlin Heidelberg, 2008, pp. 238–254. isbn: 978-3-540-69163-1.
url: http://dx.doi.org/10.1007/978-3-540-69166-2%5C_16 (Cited on
page 18).
[Buc00] William Buchanan. Computer Busses. Newton, MA, USA:
Butterworth-Heinemann, 2000. isbn: 0340740760 (Cited on page 85).
[BW13] David Blaza and Alex Wolfe. UBM Tech 2013 Embedded Market Study. De-
sign West. San Jose, CA, 2013 (Cited on page 2).
[BY04] Johan Bengtsson and Wang Yi. “Timed Automata: Semantics, Algorithms
and Tools”. English. In: Lectures on Concurrency and Petri Nets. Ed. by
Jörg Desel, Wolfgang Reisig, and Grzegorz Rozenberg. Vol. 3098. Lecture
Notes in Computer Science. Springer Berlin Heidelberg, 2004, pp. 87–124.
isbn: 978-3-540-22261-3. doi: 10.1007/978- 3- 540- 27755- 2\_3. url:
http://dx.doi.org/10.1007/978- 3- 540- 27755- 2%5C_3 (Cited on
page 20).
[BZT+11] Sven Bünte, Michael Zolda, Michael Tautschnig, and Raimund Kirner. “Im-
proving the Conﬁdence in Measurement-Based Timing Analysis”. In: ISORC.
2011, pp. 144–151 (Cited on page 22).
[CBR13] Sudipta Chattopadhyay, Abhijeet Banerjee, and Abhik Roychoudhury. “Pre-
cise Micro-architectural Modeling for WCET Analysis via AI+SAT”. In: Pro-
ceedings of the 2013 IEEE 19th Real-Time and Embedded Technology and Ap-
plications Symposium (RTAS). RTAS ’13. Washington, DC, USA: IEEE Com-
puter Society, 2013, pp. 87–96. isbn: 978-1-4799-0186-9. doi: 10.1109/RTAS.
2013.6531082. url: http://dx.doi.org/10.1109/RTAS.2013.6531082
(Cited on page 74).
[CC77] Patrick Cousot and Radhia Cousot. “Abstract Interpretation: A Uniﬁed Lat-
tice Model for Static Analysis of Programs by Construction or Approxima-
tion of Fixpoints”. In: Proceedings of the 4th ACM SIGACT-SIGPLAN Sym-
posium on Principles of Programming Languages. POPL ’77. Los Angeles,
California: ACM, 1977, pp. 238–252. doi: 10.1145/512950.512973. url:
http://doi.acm.org/10.1145/512950.512973 (Cited on page 12).
[CC80] Patrick Cousot and Radhia Cousot. “Semantic analysis of communicating
sequential processes”. English. In: Automata, Languages and Programming.
Ed. by Jaco de Bakker and Jan van Leeuwen. Vol. 85. Lecture Notes in
Computer Science. Springer Berlin Heidelberg, 1980, pp. 119–133. isbn: 978-
3-540-10003-4. url: http://dx.doi.org/10.1007/3-540-10003-2%5C_65
(Cited on page 121).
[CC84] Patrick Cousot and Radhia Cousot. “Invariance Proof Methods and Analy-
sis Techniques For Parallel Programs”. In: Automatic Program Construction
Techniques. Ed. by A.W. Biermann, G. Guiho, and Y. Kodratoﬀ. Macmillan,
New York, United States, 1984. Chap. 12, pp. 243–271 (Cited on page 121).
180 Bibliography
[CCK+13] Che-Wei Chang, Jian-Jia Chen, Tei-Wei Kuo, and H. Falk. “Real-time par-
titioned scheduling on multi-core systems with local and global memories”.
In: Design Automation Conference (ASP-DAC), 2013 18th Asia and South
Paciﬁc. 01/2013, pp. 467–472. doi: 10.1109/ASPDAC.2013.6509640 (Cited
on page 34).
[CCM97] Sérgio Campos, Edmund Clarke, and Marius Minea. “The verus tool: A quan-
titative approach to the formal veriﬁcation of real-time systems”. English. In:
Computer Aided Veriﬁcation. Ed. by Orna Grumberg. Vol. 1254. Lecture
Notes in Computer Science. Springer Berlin Heidelberg, 1997, pp. 452–455.
isbn: 978-3-540-63166-8. doi: 10.1007/3-540-63166-6\_46. url: http:
//dx.doi.org/10.1007/3-540-63166-6%5C_46 (Cited on page 20).
[CCR+14] Sudipta Chattopadhyay, Lee Kee Chong, Abhik Roychoudhury, Timon Kel-
ter, Peter Marwedel, and Heiko Falk. “A Uniﬁed WCET Analysis Frame-
work for Multicore Platforms”. In: ACM Transactions on Embedded Comput-
ing Systems 13.4s (04/2014), 124:1–124:29. issn: 1539-9087. doi: 10.1145/
2584654. url: http://doi.acm.org/10.1145/2584654 (Cited on page 87).
[CEN+13] Daniel Cordes, Michael Engel, Olaf Neugebauer, and Peter Marwedel. “Au-
tomatic Extraction of Multi-Objective Aware Parallelism for Heterogeneous
MPSoCs”. In: Proceedings of the Sixth International Workshop on Multi-
/Many-core Computing Systems (MuCoCoS 2013). MuCoCoS 2013. Edin-
burgh, Scotland, UK, 09/2013 (Cited on page 33).
[CFG+10] Christoph Cullmann, Christian Ferdinand, Gernot Gebhard, Daniel Grund,
Claire Maiza, Jan Reineke, Benoît Triquet, Simon Wegener, and Reinhard
Wilhelm. “Predictability Considerations in the Design of Multi-Core Embed-
ded Systems”. In: Ingénieurs de l’Automobile 807 (09/2010), pp. 36–42. issn:
0020-1200 (Cited on pages 45, 89).
[CI92] Jyh-Herng Chow and Williams Ludwell Harrison III. “A General Frame-
work for Analyzing Shared-Memory Parallel Programs.” In: ICPP (2). 1992,
pp. 192–199 (Cited on page 122).
[CKR+12] Sudipta Chattopadhyay, Chong Lee Kee, Abhik Roychoudhury, Timon Kel-
ter, Heiko Falk, and Peter Marwedel. “A Uniﬁed WCET Analysis Framework
for Multi-Core Platforms”. In: IEEE Real-Time and Embedded Technology
and Applications Symposium (RTAS). Beijing, China, 04/2012, pp. 99–108
(Cited on pages 33, 88).
[CM07] Christoph Cullmann and Florian Martin. “Data-Flow Based Detection of
Loop Bounds”. In: WCET. 2007 (Cited on page 74).
[CMR+05] Ting Chen, Tulika Mitra, Abhik Roychoudhury, and Vivy Suhendra. “Ex-
ploiting branch constraints without exhaustive path enumeration”. In: In 5th
International Workshop on Worst-Case Execution Time Analysis (WCET.
2005 (Cited on page 74).
[CN98] Marek Chrobak and John Noga. “LRU is Better Than FIFO”. In: Proceedings
of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA
’98. San Francisco, California, USA: Society for Industrial and Applied Math-
ematics, 1998, pp. 78–81. isbn: 0-89871-410-9. url: http://dl.acm.org/
citation.cfm?id=314613.314655 (Cited on page 70).
Bibliography 181
[Cor10] IBM Corporation. User’s Manual for CPLEX 12.2. 2010 (Cited on page 1).
[Cou01] Patrick Cousot. “Abstract Interpretation Based Formal Methods and Future
Challenges”. In: Informatics - 10 Years Back. 10 Years Ahead. London, UK,
UK: Springer-Verlag, 2001, pp. 138–156. isbn: 3-540-41635-8. url: http:
//dl.acm.org/citation.cfm?id=647348.724445 (Cited on pages 16, 63).
[CP01] Antoine Colin and Isabelle Puaut. “A Modular & Retargetable Framework
for Tree-Based WCET Analysis”. In: Proceedings of the 13th Euromicro Con-
ference on Real-Time Systems. ECRTS ’01. Washington, DC, USA: IEEE
Computer Society, 2001, pp. 37–. url: http://dl.acm.org/citation.cfm?
id=871910.871918 (Cited on page 74).
[CQV+13] Francisco J. Cazorla, Eduardo Quiñones, Tullio Vardanega, Liliana Cucu,
Benoit Triquet, Guillem Bernat, Emery Berger, Jaume Abella, Franck Wartel,
Michael Houston, Luca Santinelli, Leonidas Kosmidis, Code Lo, and Dorin
Maxim. “PROARTIS: Probabilistically Analyzable Real-Time Systems”. In:
ACM Trans. Embed. Comput. Syst. 12.2s (05/2013), 94:1–94:26. issn: 1539-
9087. doi: 10.1145/2465787.2465796. url: http://doi.acm.org/10.
1145/2465787.2465796 (Cited on page 22).
[CR09] Sudipta Chattopadhyay and Abhik Roychoudhury. “Uniﬁed Cache Modeling
for WCET Analysis and Layout Optimizations”. In: Proceedings of the 2009
30th IEEE Real-Time Systems Symposium. RTSS ’09. Washington, DC, USA:
IEEE Computer Society, 2009, pp. 47–56. isbn: 978-0-7695-3875-4. doi: 10.
1109/RTSS.2009.20. url: http://dx.doi.org/10.1109/RTSS.2009.20
(Cited on page 70).
[CRM10] Sudipta Chattopadhyay, Abhik Roychoudhury, and Tulika Mitra. “Modeling
Shared Cache and Bus in Multi-cores for Timing Analysis”. In: Proceedings of
the 13th International Workshop on Software &#38; Compilers for Embedded
Systems. SCOPES ’10. St. Goar, Germany: ACM, 2010, 6:1–6:10. isbn: 978-
1-4503-0084-1. doi: 10.1145/1811212.1811220. url: http://doi.acm.
org/10.1145/1811212.1811220 (Cited on pages 10, 88, 105).
[CSB+10] H. Cassé, P. Sainrat, C. Ballabriga, and M. de Michiel. “Experimentation
of WCET Computation on Both Ends of Automotive Processor Range”. In:
Proceedings of the 1st Workshop on Critical Automotive Applications: Ro-
bustness &#38; Safety. CARS ’10. Valencia, Spain: ACM, 2010, pp. 67–
70. isbn: 978-1-60558-915-2. doi: 10.1145/1772643.1772663. url: http:
//doi.acm.org/10.1145/1772643.1772663 (Cited on page 23).
[CT94] Chia-Mei Chen and Satish K. Tripathi. Multiprocessor Priority Ceiling Based
Protocols. Tech. rep. College Park, MD, USA, 1994 (Cited on page 33).
[CVJ+08] Ravi Chugh, Jan W. Voung, Ranjit Jhala, and Sorin Lerner. “Dataﬂow Anal-
ysis for Concurrent Programs Using Datarace Detection”. In: Proceedings of
the 2008 ACM SIGPLAN Conference on Programming Language Design and
Implementation. PLDI ’08. Tucson, AZ, USA: ACM, 2008, pp. 316–326. isbn:
978-1-59593-860-2. doi: 10.1145/1375581.1375620. url: http://doi.acm.
org/10.1145/1375581.1375620 (Cited on page 122).
182 Bibliography
[DAN+11] Dakshina Dasari, Bjorn Andersson, Vincent Nelis, Stefan M. Petters, Arvind
Easwaran, and Jinkyu Lee. “Response Time Analysis of COTS-Based Multi-
cores Considering the Contention on the Shared Memory Bus”. In: Proceed-
ings of the 2011IEEE 10th International Conference on Trust, Security and
Privacy in Computing and Communications. TRUSTCOM ’11. Washington,
DC, USA: IEEE Computer Society, 2011, pp. 1068–1075. isbn: 978-0-7695-
4600-1. doi: 10.1109/TrustCom.2011.146. url: http://dx.doi.org/10.
1109/TrustCom.2011.146 (Cited on page 87).
[DBB+07] Robert I. Davis, Alan Burns, Reinder J. Bril, and Johan J. Lukkien. “Con-
troller Area Network (CAN) Schedulability Analysis: Refuted, Revisited and
Revised”. In: Real-Time Systems 35.3 (04/2007), pp. 239–272. issn: 0922-
6443. doi: 10.1007/s11241-007-9012-7. url: http://dx.doi.org/10.
1007/s11241-007-9012-7 (Cited on page 35).
[DDY06] Dinakar Dhurjati, Manuvir Das, and Yue Yang. “Path-Sensitive Dataﬂow
Analysis with Iterative Reﬁnement”. In: Proceedings of the 13th Interna-
tional Conference on Static Analysis. SAS’06. Seoul, Korea: Springer-Verlag,
2006, pp. 425–442. isbn: 3-540-37756-5, 978-3-540-37756-6. doi: 10.1007/
11823230\_27. url: http://dx.doi.org/10.1007/11823230%5C_27 (Cited
on page 18).
[DMS08] Damian Dechev, Rabi N. Mahapatra, and Bjarne Stroustrup. “Practical and
Veriﬁable C++ Dynamic Cast for Hard Real-Time Systems.” In: 2008,
pp. 375–393 (Cited on page 38).
[DZ10] Yiqiang Ding and Wei Zhang. “Improving the Static Real-time Scheduling
on Multicore Processors by Reducing Worst-case Inter-thread Cache Inter-
ferences”. In: Proceedings of the 48th Annual Southeast Regional Conference.
ACM SE ’10. Oxford, Mississippi: ACM, 2010, 108:1–108:4. isbn: 978-1-4503-
0064-3. doi: 10.1145/1900008.1900148. url: http://doi.acm.org/10.
1145/1900008.1900148 (Cited on page 155).
[EBS+11] Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankar-
alingam, and Doug Burger. “Dark Silicon and the End of Multicore Scaling”.
In: Proceedings of the 38th Annual International Symposium on Computer
Architecture. ISCA ’11. San Jose, California, USA: ACM, 2011, pp. 365–
376. isbn: 978-1-4503-0472-6. doi: 10.1145/2000064.2000108. url: http:
//doi.acm.org/10.1145/2000064.2000108 (Cited on page 6).
[EGL11] Andreas Ermedahl, Jan Gustafsson, and Björn Lisper. “Deriving WCET
bounds by abstract execution”. In: Proc. 11th International Workshop on
Worst-Case Execution Time (WCET) Analysis (WCET 2011:) 2011 (Cited
on pages 74, 166).
[Ele12] ElektronikPraxis. “ElektronikPraxis”. In: (02/2012), p. 15 (Cited on pages 2,
44).
[EPB+06] J. Eisinger, I Polian, B. Becker, and A Metzner. “Automatic Identiﬁcation of
Timing Anomalies for Cycle-Accurate Worst-Case Execution Time Analysis”.
In: Design and Diagnostics of Electronic Circuits and systems, 2006 IEEE.
04/2006, pp. 13–18. doi: 10.1109/DDECS.2006.1649563 (Cited on page 26).
Bibliography 183
[ES08] Michael Engel and Olaf Spinczyk. “System-on-chip Integration of Embedded
Automotive Controllers”. In: Proceedings of the 1st Workshop on Isolation
and Integration in Embedded Systems. IIES ’08. Glasgow, Scotland: ACM,
2008, pp. 29–34. isbn: 978-1-60558-126-2. doi: 10.1145/1435458.1435464.
url: http://doi.acm.org/10.1145/1435458.1435464 (Cited on page 1).
[Evi14] Evidence. ERIKA Enterprise - Open Source RTOS OSEK/VDX Kernel.
http://erika.tuxfamily.org/drupal. 2014 (Cited on page 40).
[FDG+09] Alberto Ferrari, Marco Di Natale, Giacomo Gentile, Giovanni Reggiani, and
Paolo Gai. “Time and Memory Tradeoﬀs in the Implementation of AUTOSAR
Components”. In: Proceedings of the Conference on Design, Automation and
Test in Europe. DATE ’09. Nice, France: European Design and Automation
Association, 2009, pp. 864–869. isbn: 978-3-9810801-5-5. url: http://dl.
acm.org/citation.cfm?id=1874620.1874830 (Cited on page 29).
[FH04] Christian Ferdinand and Reinhold Heckmann. “aiT: Worst-Case Execution
Time Prediction by Static Program Analysis”. English. In: Building the In-
formation Society. Ed. by Renè Jacquart. Vol. 156. IFIP International Fed-
eration for Information Processing. Springer US, 2004, pp. 377–383. isbn:
978-1-4020-8156-9. doi: 10.1007/978-1-4020-8157-6\_29. url: http:
//dx.doi.org/10.1007/978-1-4020-8157-6%5C_29 (Cited on page 20).
[Fis81] J. A. Fisher. “Trace Scheduling: A Technique for Global Microcode Com-
paction”. In: IEEE Trans. Comput. 30.7 (07/1981), pp. 478–490. issn: 0018-
9340. doi: 10.1109/TC.1981.1675827. url: http://dx.doi.org/10.1109/
TC.1981.1675827 (Cited on pages 155, 160).
[FKP+07] Elena Fersman, Pavel Krcal, Paul Pettersson, and Wang Yi. “Task automata:
Schedulability, decidability and undecidability”. In: Information and Compu-
tation 205.8 (2007), pp. 1149–1172 (Cited on page 20).
[FL10] Heiko Falk and Paul Lokuciejewski. “A compiler framework for the reduc-
tion of worst-case execution times”. In: Journal on Real-Time Systems 46.2
(10/2010). DOI 10.1007/s11241-010-9101-x, pp. 251–300 (Cited on pages 9,
20, 37, 39).
[FLT06] Heiko Falk, Paul Lokuciejewski, and Henrik Theiling. “Design of a WCET-
Aware C Compiler”. In: 6th International Workshop on Worst-Case Execution
Time Analysis (WCET). Dresden/Germany, 07/2006 (Cited on page 37).
[Fre09] Freescale Semiconductor, Inc. Embedded Multicore: An Introduction, Rev.
0. www.freescale.com. Document Number: EMBMCRM. 2009 (Cited on
pages 5, 32, 45).
[FW86] Philip J. Fleming and John J. Wallace. “How Not to Lie with Statistics: The
Correct Way to Summarize Benchmark Results”. In: Commun. ACM 29.3
(03/1986), pp. 218–221. issn: 0001-0782. doi: 10.1145/5666.5673. url:
http://doi.acm.org/10.1145/5666.5673 (Cited on page 78).
[FW99] Christian Ferdinand and Reinhard Wilhelm. “Eﬃcient and Precise Cache
Behavior Prediction for Real-TimeSystems”. In: Real-Time Systems 17.2-3
(12/1999), pp. 131–181. issn: 0922-6443. doi: 10.1023/A:1008186323068.
url: http://dx.doi.org/10.1023/A:1008186323068 (Cited on page 70).
184 Bibliography
[GAE+09] Jan Gustafsson, Peter Altenbernd, Andreas Ermedahl, and Björn Lisper.
“Approximate Worst-Case Execution Time Analysis for Early Stage Embed-
ded Systems Development”. In: Proceedings of the 7th IFIP WG 10.2 Inter-
national Workshop on Software Technologies for Embedded and Ubiquitous
Systems. SEUS ’09. Newport Beach, CA: Springer-Verlag, 2009, pp. 308–319.
isbn: 978-3-642-10264-6. doi: 10.1007/978- 3- 642- 10265- 3\_28. url:
http://dx.doi.org/10.1007/978-3-642-10265-3%5C_28 (Cited on
page 22).
[GE07] Jan Gustafsson and Andreas Ermedahl. “Experiences from Applying WCET
Analysis in Industrial Settings”. In: Proceedings of the 10th IEEE Interna-
tional Symposium on Object and Component-Oriented Real-Time Distributed
Computing. ISORC ’07. Washington, DC, USA: IEEE Computer Society,
2007, pp. 382–392. isbn: 0-7695-2765-5. doi: 10.1109/ISORC.2007.36. url:
http://dx.doi.org/10.1109/ISORC.2007.36 (Cited on pages 21, 23 sq.).
[Geb10] Gernot Gebhard. “Timing Anomalies Reloaded”. In: Proceedings of 10th In-
ternational Workshop on Worst-Case Execution Time (WCET) Analysis. Ed.
by Björn Lisper. Austrian Computer Society, 07/2010, pp. 5–15 (Cited on
page 26).
[GEL+10] Andreas Gustavsson, Andreas Ermedahl, Björn Lisper, and Paul Petters-
son. “Towards WCET Analysis of Multicore Architectures Using UPPAAL”.
In: 10th International Workshop on Worst-Case Execution Time Analysis
(WCET 2010). Ed. by Björn Lisper. Vol. 15. OpenAccess Series in Informatics
(OASIcs). The printed version of the WCET’10 proceedings are published by
OCG (www.ocg.at) - ISBN 978-3-85403-268-7. Dagstuhl, Germany: Schloss
Dagstuhl – Leibniz-Zentrum fuer Informatik, 2010, pp. 101–112. isbn: 978-
3-939897-21-7. doi: http://dx.doi.org/10.4230/OASIcs.WCET.2010.101.
url: http://drops.dagstuhl.de/opus/volltexte/2010/2830 (Cited on
pages 20, 33, 88).
[GGL12] Andreas Gustavsson, Jan Gustafsson, and Björn Lisper. “Toward Static Tim-
ing Analysis of Parallel Software”. In: 12th International Workshop on Worst-
Case Execution Time Analysis. Ed. by Tullio Vardanega. Vol. 23. OpenAc-
cess Series in Informatics (OASIcs). Dagstuhl, Germany: Schloss Dagstuhl –
Leibniz-Zentrum fuer Informatik, 2012, pp. 38–47. isbn: 978-3-939897-41-5.
doi: http://dx.doi.org/10.4230/OASIcs.WCET.2012.38. url: http:
//drops.dagstuhl.de/opus/volltexte/2012/3555 (Cited on page 88).
[GH10] Kees Goossens and Andreas Hansson. “The Aethereal Network on Chip After
Ten Years: Goals, Evolution, Lessons, and Future”. In: Proceedings of the 47th
Design Automation Conference. DAC ’10. Anaheim, California: ACM, 2010,
pp. 306–311. isbn: 978-1-4503-0002-5. doi: 10.1145/1837274.1837353. url:
http://doi.acm.org/10.1145/1837274.1837353 (Cited on page 86).
[GHJ+95] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design
Patterns: Elements of Reusable Object-oriented Software. Boston, MA, USA:
Addison-Wesley Longman Publishing Co., Inc., 1995. isbn: 0-201-63361-2 (Ci-
ted on page 1).
Bibliography 185
[GHK+11] Peter Gliwa, Jens Harnisch, Ursula Kelling, and Christoph Ficek. “From
Single-Core to Multi-Core Platforms - Systematic Migration of Hard Real-
Time Software in AUTOSAR”. In: Embedded World 28. 2011, pp. 979–992
(Cited on page 33).
[GKS+11] Ganesh Gopalakrishnan, Robert M. Kirby, Stephen Siegel, Rajeev Thakur,
William Gropp, Ewing Lusk, Bronis R. De Supinski, Martin Schulz, and
Greg Bronevetsky. “Formal Analysis of MPI-based Parallel Programs”. In:
Commun. ACM 54.12 (12/2011), pp. 82–91. issn: 0001-0782. doi: 10.1145/
2043174.2043194. url: http://doi.acm.org/10.1145/2043174.2043194
(Cited on page 122).
[GLM11] Antonio González, Fernando Latorre, and Grigorios Magklis. “Processor
Microarchitecture: An Implementation Perspective”. In: Synthesis Lectures
on Computer Architecture. Ed. by University of Wisconsin Mark D. Hill.
Vol. Lecture #12. Morgan & Claypool, 2011. isbn: 9781608454532. doi:
10.2200/S00309ED1V01Y201011CAC012 (Cited on page 63).
[GR10] Daniel Grund and Jan Reineke. “Precise and Eﬃcient FIFO-Replacement
Analysis Based on Static Phase Detection”. In: Proceedings of the 2010 22Nd
Euromicro Conference on Real-Time Systems. ECRTS ’10. Washington, DC,
USA: IEEE Computer Society, 2010, pp. 155–164. isbn: 978-0-7695-4111-2.
doi: 10.1109/ECRTS.2010.8. url: http://dx.doi.org/10.1109/ECRTS.
2010.8 (Cited on page 70).
[GRE+01] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and
R. B. Brown. “MiBench: A Free, Commercially Representative Embedded
Benchmark Suite”. In: Proceedings of the Workload Characterization, 2001.
WWC-4. 2001 IEEE International Workshop. WWC ’01. Washington, DC,
USA: IEEE Computer Society, 2001, pp. 3–14. isbn: 0-7803-7315-4. doi: 10.
1109/WWC.2001.15. url: http://dx.doi.org/10.1109/WWC.2001.15
(Cited on pages 76, 150).
[Gro08] Ian Grout. Digital Systems Design with FPGAs and CPLDs. Newton, MA,
USA: Newnes, 2008. isbn: 075068397X, 9780750683975 (Cited on page 44).
[GS93] Dirk Grunwald and Harini Srinivasan. “Data Flow Equations for Explicitly
Parallel Programs”. In: Proceedings of the Fourth ACM SIGPLAN Symposium
on Principles and Practice of Parallel Programming. PPOPP ’93. San Diego,
California, USA: ACM, 1993, pp. 159–168. isbn: 0-89791-589-5. doi: 10.
1145/155332.155349. url: http://doi.acm.org/10.1145/155332.155349
(Cited on page 122).
[Gün13] Christian Günter. “Unterstützung modularer WCET-Analyse durch
annotierte Binärobjekte”. Bachelor’s Thesis. TU Dortmund, 2013 (Cited on
pages 9, 48).
[GZ10] Satya Mohan Raju Gudidevuni and Wei Zhang. “A Time-predictable Dual-
core Prototype on FPGA”. In: Proceedings of the 48th Annual Southeast
Regional Conference. ACM SE ’10. Oxford, Mississippi: ACM, 2010, 7:1–
7:4. isbn: 978-1-4503-0064-3. doi: 10.1145/1900008.1900020. url: http:
//doi.acm.org/10.1145/1900008.1900020 (Cited on page 90).
186 Bibliography
[Har13] Tim Harde. “Vergleichende Studie von Arbitrierungsverfahren für Kommu-
nikationsstrukturen in eingebetteten Multicoresystemen”. Bachelor’s Thesis.
TU Dortmund, 2013 (Cited on page 9).
[HBH+11] J. Herter, P. Backes, F. Haupenthal, and J. Reineke. “CAMA: A Predictable
Cache-Aware Memory Allocator”. In: Real-Time Systems (ECRTS), 2011
23rd Euromicro Conference on. 07/2011, pp. 23–32. doi: 10.1109/ECRTS.
2011.11 (Cited on page 71).
[HE05] Arne Hamann and Rolf Ernst. “TDMA Time Slot and Turn Optimization
with Evolutionary Search Techniques”. In: Proceedings of the Conference on
Design, Automation and Test in Europe - Volume 1. DATE ’05. Washington,
DC, USA: IEEE Computer Society, 2005, pp. 312–317. isbn: 0-7695-2288-2.
doi: 10.1109/DATE.2005.299. url: http://dx.doi.org/10.1109/DATE.
2005.299 (Cited on pages 148 sq.).
[HG11] Andreas Hansson and Kees Goossens. On-Chip Interconnect with aelite: Com-
posable and Predictable Systems. Embedded Systems. Springer New York,
2011. isbn: 978-1-4419-6496-0 (Cited on page 86).
[HG12] Sebastian Hahn and Daniel Grund. “Relational Cache Analysis for Static
Timing Analysis”. In: Proceedings of the 24th Euromicro Conference on Real-
Time Systems (ECRTS ’12). Pisa, Italy, 07/2012, pp. 102–111. isbn: 978-1-
4673-2032-0. doi: 10.1109/ECRTS.2012.14 (Cited on page 71).
[HGB+08] Niklas Holsti, Jan Gustafsson, Guillem Bernat, Clément Ballabriga, Armelle
Bonenfant, Roman Bourgade, Hugues Cassé, Daniel Cordes, Albrecht Kadlec,
Raimund Kirner, Jens Knoop, Paul Lokuciejewski, Nicholas Merriam, Mari-
anne de Michiel, Adrian Prantl, Bernhard Rieder, Christine Rochange, Pascal
Sainrat, and Markus Schordan. “ WCET 2008 – Report from the Tool Chal-
lenge 2008 – 8th Intl. Workshop on Worst-Case Execution Time (WCET)
Analysis”. In: 8th International Workshop on Worst-Case Execution Time
Analysis (WCET’08). Ed. by Raimund Kirner. Vol. 8. OpenAccess Series in
Informatics (OASIcs). Prague / Czech Republic: Schloss Dagstuhl – Leib-
niz-Zentrum fuer Informatik, 09/2008. isbn: 978-3-939897-10-1 (Cited on
page 24).
[HGG+01] González Harbour, Gutiérrez García, Palencia Gutiérrez, and Drake Moyano.
“Mast: Modeling and analysis suite for real time applications”. In: 13th Eu-
romicro Conference on Real-Time Systems. IEEE. 2001, pp. 125–134 (Cited
on page 32).
[HHJ+05] R. Henia, A Hamann, M. Jersak, R. Racu, K. Richter, and R. Ernst. “System
level performance analysis - the SymTA/S approach”. In: IEEE Proceedings
on Computers and Digital Techniques 152.2 (03/2005), pp. 148–166. issn:
1350-2387. doi: 10.1049/ip-cdt:20045088 (Cited on page 32).
[HKB+14] Chen-Wei Huang, Timon Kelter, Bjoern Boenninghoﬀ, Jan Kleinsorge,
Michael Engel, Peter Marwedel, and Shiao-Li Tsao. “Static WCET
Analysis of the H.264/AVC Decoder Exploiting Coding Information”. In:
International Conference on Embedded and Real-Time Computing Systems
and Applications (RTCSA). IEEE. Chongqing, China, 08/2014 (Cited on
pages 21, 24).
Bibliography 187
[HMC+93] Wen-Mei W. Hwu, Scott A. Mahlke, William Y. Chen, Pohua P. Chang,
Nancy J. Warter, Roger A. Bringmann, Roland G. Ouellette, Richard E.
Hank, Tokuzo Kiyohara, Grant E. Haab, John G. Holm, and Daniel M. Lav-
ery. “The Superblock: An Eﬀective Technique for VLIW and Superscalar
Compilation”. In: J. Supercomput. 7.1-2 (05/1993), pp. 229–248. issn: 0920-
8542. doi: 10.1007/BF01205185. url: http://dx.doi.org/10.1007/
BF01205185 (Cited on page 160).
[HMM12] Julien Henry, David Monniaux, and Matthieu Moy. “PAGAI: A Path Sensi-
tive Static Analyser”. In: Electron. Notes Theor. Comput. Sci. 289 (12/2012),
pp. 15–25. issn: 1571-0661. doi: 10.1016/j.entcs.2012.11.003. url:
http://dx.doi.org/10.1016/j.entcs.2012.11.003 (Cited on pages 18,
53).
[Höf12] Kai Höﬁg. “Failure-Dependent Timing Analysis - A New Methodology for
Probabilistic Worst-Case Execution Time Analysis”. English. In: Measure-
ment, Modelling, and Evaluation of Computing Systems and Dependability
and Fault Tolerance. Ed. by JensB. Schmitt. Vol. 7201. Lecture Notes in
Computer Science. Springer Berlin Heidelberg, 2012, pp. 61–75. isbn: 978-3-
642-28539-4. doi: 10.1007/978-3-642-28540-0\_5. url: http://dx.doi.
org/10.1007/978-3-642-28540-0%5C_5 (Cited on page 22).
[HP08] Damien Hardy and Isabelle Puaut. “WCET Analysis of Multi-level Non-
inclusive Set-Associative Instruction Caches”. In: Proceedings of the 2008
Real-Time Systems Symposium. RTSS ’08. Washington, DC, USA: IEEE
Computer Society, 2008, pp. 456–466. isbn: 978-0-7695-3477-0. doi: 10.1109/
RTSS.2008.10. url: http://dx.doi.org/10.1109/RTSS.2008.10 (Cited
on page 73).
[HP09] Damien Hardy and Isabelle Puaut. “Estimation of Cache Related Migration
Delays for Multi-Core Processors with Shared Instruction Caches”. In: 17th
International Conference on Real-Time and Network Systems. Ed. by Laurent
George and Maryline Chetto andMikael Sjodin. Paris, France, 2009, pp. 45–
54. url: http://hal.inria.fr/inria-00441959 (Cited on page 87).
[HPP11] Benedikt Huber, Wolfgang Puﬃtsch, and Peter Puschner. “Towards an open
timing analysis platform”. In: 11th International Workshop on Worst-Case
Execution Time Analysis. 07/2011 (Cited on pages 20, 38).
[HPP12] Benedikt Huber, Daniel Prokesch, and Peter Puschner. “A Formal Framework
for Precise Parametric WCET Formulas”. In: 12th International Workshop on
Worst-Case Execution Time Analysis. Ed. by Tullio Vardanega. Vol. 23. Ope-
nAccess Series in Informatics (OASIcs). Dagstuhl, Germany: Schloss Dagstuhl
– Leibniz-Zentrum fuer Informatik, 2012, pp. 91–102. isbn: 978-3-939897-41-
5. doi: http://dx.doi.org/10.4230/OASIcs.WCET.2012.91. url: http:
//drops.dagstuhl.de/opus/volltexte/2012/3560 (Cited on page 21).
[HPS12] Benedikt Huber, Wolfgang Puﬃtsch, and Martin Schoeberl. “Worst-case Ex-
ecution Time Analysis-driven Object Cache Design”. In: Concurrency and
Computation: Practice & Experience 24.8 (06/2012), pp. 753–771. issn: 1532-
0626. doi: 10.1002/cpe.1763. url: http://dx.doi.org/10.1002/cpe.
1763 (Cited on page 23).
188 Bibliography
[HRW13] Sebastian Hahn, Jan Reineke, and Reinhard Wilhelm. “Towards Composi-
tionality in Execution Time Analysis – Deﬁnition and Challenges”. In: CRTS.
12/2013 (Cited on pages 27 sq.).
[HS09] Benedikt Huber and Martin Schoeberl. “Comparison of Implicit Path Enu-
meration and Model Checking Based WCET Analysis”. In: WCET. 2009 (Ci-
ted on page 20).
[HT07] Wolfgang Haid and Lothar Thiele. “Complex Task Activation Schemes in
System Level Performance Analysis”. In: Proceedings of the 5th IEEE/ACM
International Conference on Hardware/Software Codesign and System Syn-
thesis. CODES+ISSS ’07. Salzburg, Austria: ACM, 2007, pp. 173–178. isbn:
978-1-59593-824-4. doi: 10.1145/1289816.1289860. url: http://doi.acm.
org/10.1145/1289816.1289860 (Cited on page 34).
[HT98] Maria Handjieva and Stanislav Tzolovski. “Reﬁning Static Analyses by Trace-
Based Partitioning Using Control Flow”. English. In: Static Analysis. Ed.
by Giorgio Levi. Vol. 1503. Lecture Notes in Computer Science. Springer
Berlin Heidelberg, 1998, pp. 200–214. isbn: 978-3-540-65014-0. url: http:
//dx.doi.org/10.1007/3-540-49727-7%5C_12 (Cited on page 17).
[HZX12] Yazhi Huang, Mengying Zhao, and Chun Jason Xue. “WCET-aware Re-
Scheduling Register Allocation for Real-Time Embedded Systems with Clus-
tered VLIW Architecture”. In: International Conference on Languages, Com-
pilers, Tools and Theory for Embedded Systems. 2012 (Cited on page 155).
[Inf08] Inﬁneon Technologies AG. TriCore 1, Instruction Set V1.3 & V1.3.1 Archi-
tecture. User’s Manual V1.3.8. 01/2008 (Cited on page 43).
[Inf09] Inﬁneon Technologies AG. TC1767 32-Bit Single-Chip Microcontroller. User’s
Manual V1.1. 05/2009 (Cited on pages 44 sqq.).
[Inf14] Inﬁneon Technologies AG. AURIX microcontroller. 2014 (Cited on page 23).
[Jac09] Bruce Jacob. “The Memory System: You Can’t Avoid It, You Can’t Ignore It,
You Can’t Fake It”. In: Synthesis Lectures on Computer Architecture. Ed. by
University of Wisconsin Mark D. Hill. Vol. Lecture #7. Morgan & Claypool,
2009. isbn: 9781598295887. doi: 10.2200/S00201ED1V01Y200907CAC007 (Ci-
ted on page 82).
[JMR10] J. Schneider, M. Bohn, and R. Rößger. “Migration of Automotive Real-Time
Software to Multicore Systems: First Steps towards an Automated Solution”.
In: Proceedings Work-In-Progress Session of the 22th Euromicro Conference
on Real-Time Systems. ECRTS’10. Brussels, Belgium, 07/2010, pp. 37–40
(Cited on page 33).
[Joh14] R. Colin Johnson. “IBM Puts Brain On-a-Chip”. In: EE Times (08/2014)
(Cited on page 5).
[JP86] M. Joseph and P. Pandya. “Finding Response Times in a Real-Time System”.
In: The Computer Journal 29.5 (1986), pp. 390–395. doi: 10.1093/comjnl/
29.5.390. eprint: http://comjnl.oxfordjournals.org/content/29/5/
390.full.pdf+html. url: http://comjnl.oxfordjournals.org/content/
29/5/390.abstract (Cited on page 30).
[KAL14] KALRAY Corporation. MPPA 256 - Many-core Processors. http://www.
kalray.eu. 2014 (Cited on page 23).
Bibliography 189
[KB03] Hermann Kopetz and G. Bauer. “The time-triggered architecture”. In: Pro-
ceedings of the IEEE 91.1 (01/2003), pp. 112–126. issn: 0018-9219. doi:
10.1109/JPROC.2002.805821 (Cited on pages 34, 85).
[KFM+11] Timon Kelter, Heiko Falk, Peter Marwedel, Sudipta Chattopadhyay, and
Abhik Roychoudhury. “Bus-Aware Multicore WCET Analysis through
TDMA Oﬀset Bounds”. In: Proceedings of the 23rd Euromicro Conference
on Real-Time Systems (ECRTS). Porto, Portugal, 07/2011, pp. 3–12 (Cited
on pages 10, 88, 96, 101, 104, 113).
[KFM+14] Timon Kelter, Heiko Falk, Peter Marwedel, Sudipta Chattopadhyay, and Ab-
hik Roychoudhury. “Static Analysis of Multi-Core TDMA Resource Arbitra-
tion Delays”. English. In: Real-Time Systems 50.2 (03/2014), pp. 185–229.
issn: 0922-6443. doi: 10.1007/s11241- 013- 9189- x. url: http://dx.
doi.org/10.1007/s11241-013-9189-x (Cited on pages 10, 33, 87 sq., 96,
104 sq., 113).
[KFM11] Jan C. Kleinsorge, Heiko Falk, and Peter Marwedel. “A Synergetic Approach
To Accurate Analysis Of Cache-Related Preemption Delay”. In: Proceedings
of the International Conference on Embedded Software (EMSOFT). Taipei,
Taiwan, 10/2011, pp. 329–338 (Cited on page 31).
[KFM13] Jan Kleinsorge, Heiko Falk, and Peter Marwedel. “Simple Analysis of Partial
Worst-case Execution Paths on General Control Flow Graphs”. In: Proceed-
ings of the International Conference on Embedded Software (EMSOFT 2013).
EMSOFT 2013. Montreal, Canada, 10/2013 (Cited on pages 74, 130, 166).
[KHM+13] Timon Kelter, Tim Harde, Peter Marwedel, and Heiko Falk. “Evaluation of
Resource Arbitration Methods for Multi-Core Real-Time Systems”. In: Pro-
ceedings of the 13th International Workshop on Worst-Case Execution Time
Analysis (WCET). Ed. by Claire Maiza. Paris, France, 07/2013 (Cited on
pages 10, 96, 113, 119).
[Kil73] Gary A. Kildall. “A Uniﬁed Approach to Global Program Optimization”.
In: Proceedings of the 1st Annual ACM SIGACT-SIGPLAN Symposium on
Principles of Programming Languages. POPL ’73. Boston, Massachusetts:
ACM, 1973, pp. 194–206. doi: 10.1145/512927.512945. url: http://doi.
acm.org/10.1145/512927.512945 (Cited on page 12).
[Kir12] Raimund Kirner. “The WCET Analysis Tool Calcwcet167”. In: Proceedings of
the 5th International Conference on Leveraging Applications of Formal Meth-
ods, Veriﬁcation and Validation: Applications and Case Studies - Volume
Part II. ISoLA’12. Heraklion, Crete, Greece: Springer-Verlag, 2012, pp. 158–
172. isbn: 978-3-642-34031-4. doi: 10.1007/978-3-642-34032-1\_17. url:
http://dx.doi.org/10.1007/978-3-642-34032-1%5C_17 (Cited on
pages 20, 38).
[KKP+11] Raimund Kirner, Jens Knoop, Adrian Prantl, Markus Schordan, and Al-
brecht Kadlec. “Beyond Loop Bounds: Comparing Annotation Languages for
Worst-case Execution Time Analysis”. In: Software and Systems Modeling
10.3 (07/2011), pp. 411–437. issn: 1619-1366. doi: 10.1007/s10270-010-
0161-0. url: http://dx.doi.org/10.1007/s10270-010-0161-0 (Cited on
pages 19, 42, 76).
190 Bibliography
[KKP09] Raimund Kirner, Albrecht Kadlec, and Peter Puschner. “Precise Worst-Case
Execution Time Analysis for Processors with Timing Anomalies”. In: Proceed-
ings of the 2009 21st Euromicro Conference on Real-Time Systems. ECRTS
’09. Washington, DC, USA: IEEE Computer Society, 2009, pp. 119–128. isbn:
978-0-7695-3724-5. doi: 10.1109/ECRTS.2009.8. url: http://dx.doi.org/
10.1109/ECRTS.2009.8 (Cited on page 26).
[KM14] Timon Kelter and Peter Marwedel. “Parallelism Analysis: Precise WCET
Values for Complex Multi-Core Systems”. In: Third International Workshop
on Formal Techniques for Safety-Critical Systems (FTSCS). Ed. by
Cyrille Artho and Peter Ölveczky. Luxembourg: Springer, 11/2014 (Cited
on pages 10, 120).
[KMB14] Timon Kelter, Peter Marwedel, and Hendrik Borghorst. “WCET-aware
Scheduling Optimizations for Multi-Core Real-Time Systems”. In:
International Conference on Embedded Computer Systems: Architectures,
Modeling, and Simulation (SAMOS). Samos, Greece, 07/2014 (Cited on
pages 10, 147).
[KP01] R. Kirner and P. Puschner. “Transformation of path information for WCET
analysis during compilation”. In: Real-Time Systems, 13th Euromicro Con-
ference on, 2001. 2001, pp. 29–36. doi: 10.1109/EMRTS.2001.933993 (Cited
on page 38).
[KPP10] Raimund Kirner, Peter Puschner, and Adrian Prantl. “Trans-
forming Flow Information During Code Optimization for Timing
Analysis”. In: Real-Time Systems 45.1-2 (06/2010), pp. 72–105.
issn: 0922-6443. doi: 10 . 1007 / s11241 - 010 - 9091 - 8. url:
http://dx.doi.org/10.1007/s11241-010-9091-8 (Cited on page 40).
[KS92] Jens Knoop and Bernhard Steﬀen. “The interprocedural coincidence theo-
rem”. English. In: Compiler Construction. Ed. by Uwe Kastens and Peter
Pfahler. Vol. 641. Lecture Notes in Computer Science. Springer Berlin Hei-
delberg, 1992, pp. 125–140. isbn: 978-3-540-55984-9. doi: 10.1007/3-540-
55984-1\_13. url: http://dx.doi.org/10.1007/3-540-55984-1%5C_13
(Cited on page 16).
[KSP+12] Daniel Kästner, Marc Schlickling, Markus Pister, Christoph Cullmann,
Gernot Gebhard, Reinhold Heckmann, and Christian Ferdinand. “Meeting
Real-time Requirements with Multi-core Processors”. In: Proceedings of the
2012 International Conference on Computer Safety, Reliability, and Security.
SAFECOMP’12. Magdeburg, Germany: Springer-Verlag, 2012, pp. 117–131.
isbn: 978-3-642-33674-4. doi: 10.1007/978- 3- 642- 33675- 1\_10. url:
http://dx.doi.org/10.1007/978-3-642-33675-1%5C_10 (Cited on
page 89).
[KSV96] Jens Knoop, Bernhard Steﬀen, and Jürgen Vollmer. “Parallelism for Free:
Eﬃcient and Optimal Bitvector Analyses for Parallel Programs”. In: ACM
Trans. Program. Lang. Syst. 18.3 (05/1996), pp. 268–299. issn: 0164-0925.
doi: 10.1145/229542.229545. url: http://doi.acm.org/10.1145/
229542.229545 (Cited on page 122).
Bibliography 191
[KV08] Johannes Kinder and Helmut Veith. “Jakstab: A Static Analysis Platform for
Binaries”. In: Proceedings of the 20th International Conference on Computer
Aided Veriﬁcation. CAV ’08. Princeton, NJ, USA: Springer-Verlag, 2008,
pp. 423–427. isbn: 978-3-540-70543-7. doi: 10.1007/978-3-540-70545-
1\_40. url: http://dx.doi.org/10.1007/978-3-540-70545-1%5C_40
(Cited on page 49).
[KV12] Bernhard Korte and Jens Vygen. Combinatorial Optimization: Theory and
Algorithms. 5th edition. Springer, 2012 (Cited on page 74).
[KVA+13] Leonidas Kosmidis, Tullio Vardanega, Jaume Abella, Eduardo Quiñones, and
Francisco J. Cazorla. “Applying Measurement-Based Probabilistic Timing
Analysis to Buﬀer Resources”. In: 13th International Workshop on Worst-
Case Execution Time Analysis. Ed. by Claire Maiza. Vol. 30. OpenAccess
Series in Informatics (OASIcs). Dagstuhl, Germany: Schloss Dagstuhl – Leib-
niz-Zentrum fuer Informatik, 2013, pp. 97–108. isbn: 978-3-939897-54-5. doi:
http://dx.doi.org/10.4230/OASIcs.WCET.2013.97. url: http://drops.
dagstuhl.de/opus/volltexte/2013/4126 (Cited on page 22).
[KY06] Amir Kamil and Katherine Yelick. “Concurrency Analysis for Parallel Pro-
grams with Textually Aligned Barriers”. In: Proceedings of the 18th Inter-
national Conference on Languages and Compilers for Parallel Computing.
LCPC’05. Hawthorne, NY: Springer-Verlag, 2006, pp. 185–199. isbn: 3-540-
69329-7, 978-3-540-69329-1. doi: 10.1007/978-3-540-69330-7\_13. url:
http://dx.doi.org/10.1007/978-3-540-69330-7%5C_13 (Cited on
page 121).
[KYM+09] Florian Kluge, Chenglong Yu, Jörg Mische, Sascha Uhrig, and Theo Ungerer.
“Implementing AUTOSAR Scheduling and Resource Management on an Em-
bedded SMT Processor”. In: Proceedings of th 12th International Workshop on
Software and Compilers for Embedded Systems. SCOPES ’09. Nice, France:
ACM, 2009, pp. 33–42. isbn: 978-1-60558-696-0. url: http://dl.acm.org/
citation.cfm?id=1543820.1543828 (Cited on page 33).
[KZR09] Raimund Kirner, W. Zimmermann, and D. Richter. “On undecidability re-
sults of real programming languages”. In: Proc. of Kolloquium Programmier-
sprachen und Grundlagen der Programmierung. 2009 (Cited on page 3).
[KZV09] Johannes Kinder, Florian Zuleger, and Helmut Veith. “An Abstract
Interpretation-Based Framework for Control Flow Reconstruction
from Binaries”. In: Proceedings of the 10th International Conference
on Veriﬁcation, Model Checking, and Abstract Interpretation. VMCAI
’09. Savannah, GA: Springer-Verlag, 2009, pp. 214–228. isbn:
978-3-540-93899-6. doi: 10 . 1007 / 978 - 3 - 540 - 93900 - 9 \ _19. url:
http://dx.doi.org/10.1007/978-3-540-93900-9%5C_19 (Cited on
page 49).
[LAL+13] Jing Li, Kunal Agrawal, Chenyang Lu, and Christopher Gill. Analysis of
Global EDF for Parallel Tasks. Tech. rep. Campus Box 1045 - St. Louis,
MO - 63130: Department of Computer Science & Engineering - Washington
University in St. Louis, 2013 (Cited on page 34).
192 Bibliography
[Lam78] Leslie Lamport. “Time, Clocks, and the Ordering of Events in a Distributed
System”. In: Commun. ACM 21.7 (07/1978), pp. 558–565. issn: 0001-0782.
doi: 10.1145/359545.359563. url: http://doi.acm.org/10.1145/
359545.359563 (Cited on page 86).
[LBJ+95] Sung-Soo Lim, Young Hyun Bae, Gyu Tae Jang, Byung-Do Rhee, Sang Lyul
Min, Chang Yun Park, Heonshik Shin, Kunsoo Park, Soo-Mook Moon, and
Chong Sang Kim. “An Accurate Worst Case Timing Analysis for RISC Pro-
cessors”. In: IEEE Transactions on Software Engineering 21.7 (07/1995),
pp. 593–604. issn: 0098-5589. doi: 10.1109/32.392980. url: http://
dx.doi.org/10.1109/32.392980 (Cited on page 70).
[LBR10] Karthik Lakshmanan, Gaurav Bhatia, and Raj Rajkumar. “Integrated End-
to-end Timing Analysis of Networked AUTOSAR-compliant Systems”. In:
Proceedings of the Conference on Design, Automation and Test in Europe.
DATE ’10. Dresden, Germany: European Design and Automation Associa-
tion, 2010, pp. 331–334. isbn: 978-3-9810801-6-2. url: http://dl.acm.org/
citation.cfm?id=1870926.1871007 (Cited on page 35).
[LBS+10] Arun Lakhotia, Davidson R. Boccardo, Anshuman Singh, and Aleardo Man-
acero Jr. “Context-sensitive Analysis Without Calling-context”. In: Higher
Order Symbol. Comput. 23.3 (09/2010), pp. 275–313. issn: 1388-3690. doi:
10.1007/s10990- 011- 9080- 1. url: http://dx.doi.org/10.1007/
s10990-011-9080-1 (Cited on page 58).
[LCF+09] Paul Lokuciejewski, Daniel Cordes, Heiko Falk, and Peter Marwedel. “A Fast
and Precise Static Loop Analysis based on Abstract Interpretation, Program
Slicing and Polytope Models”. In: International Symposium on Code Gener-
ation and Optimization (CGO). Seattle / USA, 03/2009, pp. 136–146 (Cited
on page 41).
[LCS92] Corinna G. Lee, Paul Chow, and Mark G. Stoodley. UTDSP
Benchmark Suite. University of Toronto, 10 King’s Col-
lege Road, Toronto, Canada M5S 3G4, 1992. url: http :
//www.eecg.toronto.edu/~corinna/DSP/infrastructure/UTDSP.html
(Cited on pages 76, 150).
[Ler09] Xavier Leroy. “Formal Veriﬁcation of a Realistic Compiler”. In: Commun.
ACM 52.7 (07/2009), pp. 107–115. issn: 0001-0782. doi: 10.1145/1538788.
1538814. url: http://doi.acm.org/10.1145/1538788.1538814 (Cited on
page 39).
[LES+13] Björn Lisper, Andreas Ermedahl, Dietmar Schreiner, Jens Knoop, and Peter
Gliwa. “Practical experiences of applying source-level WCET ﬂow analysis
to industrial code”. English. In: International Journal on Software Tools for
Technology Transfer 15.1 (2013), pp. 53–63. issn: 1433-2779. doi: 10.1007/
s10009-012-0255-9. url: http://dx.doi.org/10.1007/s10009-012-
0255-9 (Cited on pages 23 sq.).
[LGZ+09] Mingsong Lv, Nan Guan, Yi Zhang, Qingxu Deng, Ge Yu, and Jianming
Zhang. “A Survey of WCET Analysis of Real-Time Operating Systems”. In:
International Conference on Embedded Software and Systems, 2009. 05/2009,
pp. 65–72. doi: 10.1109/ICESS.2009.24 (Cited on pages 29 sq.).
Bibliography 193
[LHP09] Benjamin Lesage, Damien Hardy, and Isabelle Puaut. “WCET Analysis
of Multi-Level Set-Associative Data Caches”. In: WCET. 2009 (Cited on
page 73).
[LHP10] Benjamin Lesage, Damien Hardy, and Isabelle Puaut. “Shared Data Caches
Conﬂicts Reduction for WCET Computation in Multi-Core Architectures.”
English. In: 18th International Conference on Real-Time and Network Sys-
tems. Toulouse, France, 11/2010, p. 2283. url: http://hal.inria.fr/
inria-00531214 (Cited on page 87).
[LHS+98] Chang-Gun Lee, Joosun Hahn, Yang-Min Seo, Sang Lyul Min, Rhan Ha,
Seongsoo Hong, Chang Yun Park, Minsuk Lee, and Chong Sang Kim. “Anal-
ysis of cache-related preemption delay in ﬁxed-priority preemptive schedul-
ing”. In: Computers, IEEE Transactions on 47.6 (1998), pp. 700–713 (Cited
on pages 30 sq.).
[Liu00] Jane W. S. W. Liu. Real-Time Systems. 1st. Upper Saddle River, NJ, USA:
Prentice Hall PTR, 2000. isbn: 0130996513 (Cited on page 86).
[LLM+07] Xianfeng Li, Yun Liang, Tulika Mitra, and Abhik Roychoudhury. “Chro-
nos: A Timing Analyzer for Embedded Software”. In: Science of Computer
Programming 69.1-3 (12/2007), pp. 56–67. issn: 0167-6423. doi: 10.1016/
j.scico.2007.01.014. url: http://dx.doi.org/10.1016/j.scico.2007.
01.014 (Cited on page 20).
[LM10] Paul Lokuciejewski and Peter Marwedel. Worst-Case Execution Time Aware
Compilation Techniques for Real-Time Systems. Springer, 11/2010 (Cited on
page 39).
[LM97] Yau-Tsun Steven Li and Sharad Malik. “Performance analysis of embed-
ded software using implicit path enumeration”. In: IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems 16.12 (12/1997),
pp. 1477–1487. issn: 0278-0070. doi: 10.1109/43.664229 (Cited on pages 56,
74).
[LMW95] Y.-T. S. Li, S. Malik, and A. Wolfe. “Eﬃcient Microarchitecture Model-
ing and Path Analysis for Real-time Software”. In: Proceedings of the 16th
IEEE Real-Time Systems Symposium. RTSS ’95. Washington, DC, USA:
IEEE Computer Society, 1995, pp. 298–. isbn: 0-8186-7337-0. url: http:
//dl.acm.org/citation.cfm?id=827267.828940 (Cited on page 20).
[LMW96] Y.-T. S. Li, S. Malik, and A. Wolfe. “Cache Modeling for Real-time Soft-
ware: Beyond Direct Mapped Instruction Caches”. In: Proceedings of the 17th
IEEE Real-Time Systems Symposium. RTSS ’96. Washington, DC, USA:
IEEE Computer Society, 1996, pp. 254–. isbn: 0-8186-7689-2. url: http:
//dl.acm.org/citation.cfm?id=827268.828947 (Cited on pages 20, 70).
[LPF+10] Paul Lokuciejewski, Sascha Plazar, Heiko Falk, Peter Marwedel, and
Lothar Thiele. “Multi-Objective Exploration of Compiler Optimizations for
Real-Time Systems”. In: Proceedings of the 13th International Symposium
on Object/Component/Service-oriented Real-time Distributed Computing
(ISORC). Carmona / Spain, 05/2010, pp. 115–122 (Cited on page 37).
194 Bibliography
[LPM97] Chunho Lee, M. Potkonjak, and W.H. Mangione-Smith. “MediaBench: A Tool
for Evaluating and Synthesizing Multimedia and Communications Systems”.
In: Proceedings of the Thirtieth IEEE/ACM International Symposium on Mi-
croarchitecture. 12/1997, pp. 330–335. doi: 10.1109/MICRO.1997.645830
(Cited on pages 76, 150).
[LPT09] Kai Lampka, Simon Perathoner, and Lothar Thiele. “Analytic Real-time
Analysis and Timed Automata: A Hybrid Method for Analyzing Embedded
Real-time Systems”. In: Proceedings of the Seventh ACM International Con-
ference on Embedded Software. EMSOFT ’09. Grenoble, France: ACM, 2009,
pp. 107–116. isbn: 978-1-60558-627-4. doi: 10.1145/1629335.1629351. url:
http://doi.acm.org/10.1145/1629335.1629351 (Cited on page 34).
[LPW09] Philipp Lucas, Oleg Parshin, and Reinhard Wilhelm. “Operating Mode Spe-
ciﬁc WCET Analysis”. In: Proceedings of the 3rd Junior Researcher Workshop
on Real-Time Computing (JRWRTC). Ed. by Charlotte Seidner. 10/2009,
pp. 15–18 (Cited on page 21).
[LRB+12] I Liu, J. Reineke, D. Broman, M. Zimmer, and E.A Lee. “A PRET microar-
chitecture implementation with repeatable timing and competitive perfor-
mance”. In: Computer Design (ICCD), 2012 IEEE 30th International Con-
ference on. 09/2012, pp. 87–93. doi: 10.1109/ICCD.2012.6378622 (Cited
on pages 23, 89).
[LRL10] Isaac Liu, Jan Reineke, and Edward A. Lee. “A PRET Architecture
Supporting Concurrent Programs with Composable Timing Properties”. In:
44th Asilomar Conference on Signals, Systems, and Computers. 11/7/2010,
pp. 2111–2115. url: http://chess.eecs.berkeley.edu/pubs/803.html
(Cited on page 89).
[LRM06] Xianfeng Li, Abhik Roychoudhury, and Tulika Mitra. “Modeling Out-of-order
Processors for WCET Analysis”. In: Real-Time Systems 34.3 (11/2006),
pp. 195–227. issn: 0922-6443. doi: 10.1007/s11241- 006- 9205- 5. url:
http://dx.doi.org/10.1007/s11241-006-9205-5 (Cited on pages 66,
88).
[LSL+09] Yan Li, Vivy Suhendra, Yun Liang, Tulika Mitra, and Abhik Roychoudhury.
“Timing Analysis of Concurrent Programs Running on Shared Cache Multi-
Cores”. In: Proceedings of the 2009 30th IEEE Real-Time Systems Symposium.
RTSS ’09. Washington, DC, USA: IEEE Computer Society, 2009, pp. 57–67.
isbn: 978-0-7695-3875-4. doi: 10.1109/RTSS.2009.32. url: http://dx.
doi.org/10.1109/RTSS.2009.32 (Cited on pages 87 sq., 91, 94).
[Ltd14] Rapita Systems Ltd. RapiTime Explained. http://www.rapitasystems.
com/products/rapitime. 2014 (Cited on page 22).
[LTH02] Marc Langenbach, Stephan Thesing, and Reinhold Heckmann.
“Pipeline Modeling for Timing Analysis”. In: Proceedings of the 9th
International Symposium on Static Analysis. SAS ’02. London, UK,
UK: Springer-Verlag, 2002, pp. 294–309. isbn: 3-540-44235-9. url:
http://dl.acm.org/citation.cfm?id=647171.716098 (Cited on page 66).
Bibliography 195
[LYG+10] Mingsong Lv, Wang Yi, Nan Guan, and Ge Yu. “Combining Abstract In-
terpretation with Model Checking for Timing Analysis of Multicore Soft-
ware”. In: Proceedings of the 2010 31st IEEE Real-Time Systems Symposium.
RTSS ’10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 339–
349. isbn: 978-0-7695-4298-0. doi: 10.1109/RTSS.2010.30. url: http:
//dx.doi.org/10.1109/RTSS.2010.30 (Cited on pages 33, 88).
[MA11] P. Montag and S. Altmeyer. “Precise WCET calculation in highly variant real-
time systems”. In: Design, Automation Test in Europe Conference Exhibition
(DATE), 2011. 03/2011, pp. 1–6. doi: 10.1109/DATE.2011.5763149 (Cited
on page 21).
[Mäl05] Mälardalen WCET research group. MRTC Benchmarks. 2005. url: http:
//www.mrtc.mdh.se/projects/wcet/benchmarks.html (Cited on pages 76,
150).
[Mar10] Amine Marref. “Compositional Timing Analysis”. In: SAMOS X - Inter-
national Conference on Embedded Computer Systems: Architectures, Mod-
eling, and Simulation. IEEE, 07/2010. url: http : / / www . es . mdh . se /
publications/1819- (Cited on page 28).
[Mar11] Peter Marwedel. Embedded System Design. 2nd. Secaucus, NJ, USA:
Springer-Verlag New York, Inc., 2011. isbn: 978-94-007-0257-8 (Cited on
pages 1, 32 sq., 82, 86).
[MB11] Robert Mittermayr and Johann Blieberger. Shared Memory Concurrent Sys-
tem Veriﬁcation using Kronecker Algebra. Tech. rep. 183/1-155. Cornell Uni-
versity Library, 2011 (Cited on page 31).
[MB12] Robert Mittermayr and Johann Blieberger. “Timing Analysis of Concurrent
Programs”. In: 12th International Workshop on Worst-Case Execution Time
Analysis (WCET 2012). 2012, pp. 59–68 (Cited on page 123).
[MGU+10] Jörg Mische, Irakli Guliashvili, Sascha Uhrig, and Theo Ungerer. “How to
Enhance a Superscalar Processor to Provide Hard Real-time Capable In-order
SMT”. In: Proceedings of the 23rd International Conference on Architecture of
Computing Systems. ARCS’10. Hannover, Germany: Springer-Verlag, 2010,
pp. 2–14. isbn: 3-642-11949-2, 978-3-642-11949-1. doi: 10.1007/978- 3-
642-11950-7\_2. url: http://dx.doi.org/10.1007/978-3-642-11950-
7%5C_2 (Cited on page 89).
[Mid12] Samuel P. Midkiﬀ. “Automatic Parallelization: An Overview of
Fundamental Compiler Techniques”. In: Synthesis Lectures on
Computer Architecture. Ed. by University of Wisconsin Mark D. Hill.
Vol. Lecture #19. Morgan & Claypool, 2012. isbn: 9781608458424. doi:
10.2200/S003401D1V01Y201201CAC019 (Cited on page 33).
[Min12] Antoine Miné. “Static Analysis of Run-Time Errors In Embedded
Real-Time Parallel C Programs”. In: Logical Methods in Computer
Science 8.1:26 (03/2012), p. 63. doi: 10 . 2168 / LMCS - 8. url:
http://hal.archives-ouvertes.fr/hal-00748098 (Cited on page 122).
[MIS13] MISRA Ltd. Guidelines for the Use of the C Language in Critical Systems.
03/2013 (Cited on pages 2, 19).
196 Bibliography
[MMH01] G. Memik, W.H. Mangione-Smith, and W. Hu. “NetBench: A Benchmarking
Suite for Network Processors”. In: IEEE/ACM International Conference on
Computer Aided Design (ICCAD). 11/2001, pp. 39–42. doi: 10.1109/ICCAD.
2001.968595 (Cited on page 76).
[MMU11] Stefan Metzlaﬀ, Jörg Mische, and Theo Ungerer. “A Real-Time Capable
Many-Core Model”. In: Proceedings of 32nd IEEE Real-Time Systems Sym-
posium: Work-in-Progress Session. Vienna, Austria, 2011 (Cited on page 89).
[MPH14] Daniel Muench, Michael Paulitsch, and Andreas Herkersdorf. “Temporal Sep-
aration for Hardware-Based I/O Virtualization for Mixed-Criticality Embed-
ded Real-Time Systems Using PCIe SR-IOV”. In: 27th International Confer-
ence on Architecture of Computing Systems (ARCS). 02/2014, pp. 1–7 (Cited
on page 90).
[MR12] Mohamed Abdel Maksoud and Jan Reineke. “An Empirical Evaluation of
the Inﬂuence of the Load-Store Unit on WCET Analysis”. In: 12th Interna-
tional Workshop on Worst-Case Execution Time Analysis. Ed. by Tullio Var-
danega. Vol. 23. OpenAccess Series in Informatics (OASIcs). Dagstuhl, Ger-
many: Schloss Dagstuhl – Leibniz-Zentrum fuer Informatik, 2012, pp. 13–24.
isbn: 978-3-939897-41-5. doi: http://dx.doi.org/10.4230/OASIcs.WCET.
2012.13. url: http://drops.dagstuhl.de/opus/volltexte/2012/3553
(Cited on page 23).
[MR99] Mark Moir and Srikanth Ramamurthy. “Pfair Scheduling of Fixed and Mi-
grating Periodic Tasks on Multiple Resources”. In: Proceedings of the 20th
IEEE Real-Time Systems Symposium. RTSS ’99. Washington, DC, USA:
IEEE Computer Society, 1999, pp. 294–. isbn: 0-7695-0475-2. url: http:
//dl.acm.org/citation.cfm?id=827271.829072 (Cited on page 34).
[MRT14] MRTC Research Group. SWEET (SWEdish Execution Time tool). http:
//www.mrtc.mdh.se/projects/wcet/sweet/index.html. 2014 (Cited on
pages 20, 38).
[MUK+08] J. Mische, S. Uhrig, F. Kluge, and T. Ungerer. “Exploiting spare resources of
in-order SMT processors executing hard real-time threads”. In: Computer De-
sign, 2008. ICCD 2008. IEEE International Conference on. 10/2008, pp. 371–
376. doi: 10.1109/ICCD.2008.4751887 (Cited on page 89).
[Myc07] Alan Mycroft. “Programming Language Design and Analysis Motivated by
Hardware Evolution”. In: Proceedings of the 14th International Conference on
Static Analysis. SAS’07. Kongens Lyngby, Denmark: Springer-Verlag, 2007,
pp. 18–33. isbn: 3-540-74060-0, 978-3-540-74060-5. url: http://dl.acm.
org/citation.cfm?id=2391451.2391454 (Cited on page 1).
[NAC99] Gleb Naumovich, GeorgeS. Avrunin, and LoriA. Clarke. “An Eﬃcient Al-
gorithm for Computing MHP Information for Concurrent Java Programs”.
English. In: Software Engineering — ESEC/FSE ’99. Ed. by Oscar Nier-
strasz and Michel Lemoine. Vol. 1687. Lecture Notes in Computer Science.
Springer Berlin Heidelberg, 1999, pp. 338–354. isbn: 978-3-540-66538-0. url:
http://dx.doi.org/10.1007/3-540-48166-4%5C_21 (Cited on page 121).
[NKJ10] Armand Navabi, Nicholas Kidd, and Suresh Jagannathan. Path-Sensitive
Analysis Using Edge Strings. Tech. rep. 10-006. Purdue University, Depart-
ment of Computer Science, 2010 (Cited on pages 17 sq.).
Bibliography 197
[NMR03] Hemendra Singh Negi, Tulika Mitra, and Abhik Roychoudhury. “Accurate
Estimation of Cache-related Preemption Delay”. In: Proceedings of the 1st
IEEE/ACM/IFIP International Conference on Hardware/Software Codesign
and System Synthesis. CODES+ISSS ’03. Newport Beach, CA, USA: ACM,
2003, pp. 201–206. isbn: 1-58113-742-7. doi: 10.1145/944645.944698. url:
http://doi.acm.org/10.1145/944645.944698 (Cited on page 31).
[NN13] Farhang Nemati and Thomas Nolte. “Resource sharing among real-time com-
ponents under multiprocessor clustered scheduling”. English. In: Real-Time
Systems 49.5 (2013), pp. 580–613. issn: 0922-6443. doi: 10.1007/s11241-
013-9180-6. url: http://dx.doi.org/10.1007/s11241-013-9180-6
(Cited on page 34).
[NP13] Jan Nowotsch and Michael Paulitsch. “Quality of service capabilities for hard
real-time applications on multi-core processors”. In: 21st International Con-
ference on Real-Time Networks and Systems (RTNS 2013). 2013, pp. 151–
160 (Cited on page 88).
[NPH+14] Jan Nowotsch, Michael Paulitsch, Arne Henrichsen, Werner Pongratz, and
Andreas Schacht. “Monitoring and WCET Analysis in COTS multi-core-SoC-
based Mixed-criticality Systems”. In: Proceedings of the Conference on De-
sign, Automation & Test in Europe. DATE ’14. Dresden, Germany: European
Design and Automation Association, 2014, 67:1–67:5. isbn: 978-3-9815370-2-
4. url: http://dl.acm.org/citation.cfm?id=2616606.2616689 (Cited
on page 88).
[NSE09] M. Negrean, S. Schliecker, and R. Ernst. “Response-time analysis of arbi-
trarily activated tasks in multiprocessor systems with shared resources”. In:
Design, Automation Test in Europe Conference Exhibition, 2009. DATE ’09.
04/2009, pp. 524–529. doi: 10.1109/DATE.2009.5090720 (Cited on page 87).
[Ope14] Open-Source Community. Free OSEK RTOS. http://sourceforge.net/
projects/opensek. 2014 (Cited on page 40).
[ORM+09] Jin Ouyang, Raghuveer Raghavendra, Sibin Mohan, Tao Zhang, Yuan Xie,
and Frank Mueller. “CheckerCore: Enhancing an FPGA Soft Core to Cap-
ture Worst-case Execution Times”. In: Proceedings of the 2009 International
Conference on Compilers, Architecture, and Synthesis for Embedded Systems.
CASES ’09. Grenoble, France: ACM, 2009, pp. 175–184. isbn: 978-1-60558-
626-7. doi: 10.1145/1629395.1629421. url: http://doi.acm.org/10.
1145/1629395.1629421 (Cited on page 23).
[ORS14] Haluk Ozaktas, Christine Rochange, and Pascal Sainrat. “Minimizing the
Cost of Synchronisations in the WCET of Real-time Parallel Programs”. In:
Proceedings of the 17th International Workshop on Software and Compilers
for Embedded Systems. SCOPES ’14. Sankt Goar, Germany: ACM, 2014,
pp. 98–107. isbn: 978-1-4503-2941-5. doi: 10.1145/2609248.2609261. url:
http://doi.acm.org/10.1145/2609248.2609261 (Cited on page 88).
[OSE01] OSEK/VDX. OSEK/VDX Time-Triggered Operating System. Version 1.0.
2001 (Cited on page 123).
198 Bibliography
[PC07] Rodolfo Pellizzoni and Marco Caccamo. “Toward the Predictable Integra-
tion of Real-Time COTS Based Systems”. In: Proceedings of the 28th IEEE
International Real-Time Systems Symposium. RTSS ’07. Washington, DC,
USA: IEEE Computer Society, 2007, pp. 73–82. isbn: 0-7695-3062-1. doi:
10.1109/RTSS.2007.51. url: http://dx.doi.org/10.1109/RTSS.2007.51
(Cited on page 32).
[PC10] R. Pellizzoni and M. Caccamo. “Impact of Peripheral-Processor Interference
on WCET Analysis of Real-Time Embedded Systems”. In: Computers, IEEE
Transactions on 59.3 (03/2010), pp. 400–415. issn: 0018-9340. doi: 10.1109/
TC.2009.156 (Cited on page 87).
[PH11] David A. Patterson and John L. Hennessy. Computer Architecture: A Quan-
titative Approach. 5th edition. The Morgan Kaufmann Series in Computer
Architecture and Design. Morgan Kaufmann, 2011. isbn: 012383872X (Cited
on pages 82 sq.).
[PHP14] Daniel Prokesch, Benedikt Huber, and Peter Puschner. “Towards Automated
Generation of Time-Predictable Code”. In: 14th International Workshop on
Worst-Case Execution Time Analysis. Ed. by Heiko Falk. Vol. 39. OpenAc-
cess Series in Informatics (OASIcs). Dagstuhl, Germany: Schloss Dagstuhl –
Leibniz-Zentrum fuer Informatik, 2014, pp. 103–112. isbn: 978-3-939897-69-
9. doi: http://dx.doi.org/10.4230/OASIcs.WCET.2014.103. url: http:
//drops.dagstuhl.de/opus/volltexte/2014/4609 (Cited on page 38).
[PK08] C. Paukovits and H. Kopetz. “Concepts of Switching in the Time-Triggered
Network-on-Chip”. In: 14th IEEE International Conference on Embedded and
Real-Time Computing Systems and Applications, 2008. RTCSA ’08. 08/2008,
pp. 120–129. doi: 10.1109/RTCSA.2008.18 (Cited on page 86).
[PK89] P. Puschner and Ch. Koza. “Calculating the Maximum Execution Time of
Real-time Programs”. In: Real-Time Systems 1.2 (09/1989), pp. 159–176.
issn: 0922-6443. doi: 10.1007/BF00571421. url: http://dx.doi.org/
10.1007/BF00571421 (Cited on page 74).
[PKF+11] Sascha Plazar, Jan C. Kleinsorge, Heiko Falk, and Peter Marwedel. “WCET-
driven Branch Prediction aware Code Positioning”. In: Proceedings of the
International Conference on Compilers, Architectures and Synthesis for Em-
bedded Systems (CASES). Taipei, Taiwan, 10/2011, pp. 165–174 (Cited on
page 37).
[PKH+12] Peter Puschner, Raimund Kirner, Benedikt Huber, and Daniel Prokesch.
“Compiling for Time Predictability”. English. In: Computer Safety, Reliabil-
ity, and Security. Ed. by Frank Ortmeier and Peter Daniel. Vol. 7613. Lecture
Notes in Computer Science. Springer Berlin Heidelberg, 2012, pp. 382–391.
isbn: 978-3-642-33674-4. doi: 10.1007/978- 3- 642- 33675- 1\_35. url:
http://dx.doi.org/10.1007/978-3-642-33675-1%5C_35 (Cited on
page 38).
[Plu10] Plurality Ltd. The HyperCore Architecture Version 1.0. 3 Hanotea St., Ne-
tanya 42300, Israel, 01/2010. url: www.plurality.com (Cited on page 84).
Bibliography 199
[PM11] M. Paolieri and R. Mariani. “Towards Functional-safe Timing-dependable
Real-time Architectures”. In: Proceedings of the 2011 IEEE 17th Interna-
tional On-Line Testing Symposium. IOLTS ’11. Washington, DC, USA: IEEE
Computer Society, 2011, pp. 31–36. isbn: 978-1-4577-1053-7. doi: 10.1109/
IOLTS.2011.5993807. url: http://dx.doi.org/10.1109/IOLTS.2011.
5993807 (Cited on page 89).
[Pou12] Louis-Noel Pouchet. PolyBench/C - The Polyhedral Benchmark Suite. http:
//www.cs.ucla.edu/~pouchet/software/polybench/. 2012 (Cited on
page 76).
[PP13] Dumitru Potop-Butucaru and Isabelle Puaut. Integrated Worst-Case
Response Time Evaluation of Multicore Non-Preemptive Appli-
cations. Rapport de recherche RR-8234. INRIA, 02/2013. url:
http://hal.inria.fr/hal-00787931 (Cited on pages 88, 91, 139).
[PPE+08] Traian Pop, Paul Pop, Petru Eles, Zebo Peng, and Alexandru Andrei. “Timing
Analysis of the FlexRay Communication Protocol”. In: Real-Time Systems
39.1-3 (08/2008), pp. 205–235. issn: 0922-6443. doi: 10.1007/s11241-007-
9040-3. url: http://dx.doi.org/10.1007/s11241-007-9040-3 (Cited on
page 35).
[PQC+09a] M. Paolieri, E. Quinones, F. J. Cazorla, and M. Valero. “An Analyzable Mem-
ory Controller for Hard Real-Time CMPs”. In: IEEE Embedded Systems Let-
ters 1.4 (12/2009), pp. 86–90. issn: 1943-0663. doi: 10.1109/LES.2010.
2041634. url: http://dx.doi.org/10.1109/LES.2010.2041634 (Cited on
page 89).
[PQC+09b] Marco Paolieri, Eduardo Quiñones, Francisco J. Cazorla, Guillem Bernat,
and Mateo Valero. “Hardware Support for WCET Analysis of Hard Real-
time Multicore Systems”. In: Proceedings of the 36th Annual International
Symposium on Computer Architecture. ISCA ’09. Austin, TX, USA: ACM,
2009, pp. 57–68. isbn: 978-1-60558-526-0. doi: 10.1145/1555754.1555764.
url: http://doi.acm.org/10.1145/1555754.1555764 (Cited on page 89).
[PRU13] Arthur Pyka, Mathias Rohde, and Sascha Uhrig. “A Real-Time Capable First-
Level Cache for Multi-Cores”. In: Workshop on High-Performance and Real-
Time Embedded Systems. HiRES 2013. Berlin, Gemany, 01/2013 (Cited on
page 90).
[PS10] Christof Pitter and Martin Schoeberl. “A Real-Time Java Chip-Mulitiproces-
sor”. In: ACM Transactions on Embedded Computing Systems (TECS) 10.1
(08/2010), 9:1–9:34. issn: 1539-9087. doi: 10.1145/1814539.1814548. url:
http://doi.acm.org/10.1145/1814539.1814548 (Cited on pages 46, 89).
[PS91] Chang Yun Park and Alan C. Shaw. “Experiments with a Program Timing
Tool Based on Source-Level Timing Schema”. In: Computer - Special issue on
real-time systems 24.5 (05/1991), pp. 48–57. issn: 0018-9162. doi: 10.1109/
2.76286. url: http://dx.doi.org/10.1109/2.76286 (Cited on page 74).
[PSC+10] Rodolfo Pellizzoni, Andreas Schranzhofer, Jian-Jia Chen, Marco Caccamo,
and Lothar Thiele. “Worst Case Delay Analysis for Memory Interference in
Multicore Systems”. In: Proceedings of the Conference on Design, Automa-
tion and Test in Europe. DATE ’10. Dresden, Germany: European Design
200 Bibliography
and Automation Association, 2010, pp. 741–746. isbn: 978-3-9810801-6-2.
url: http://dl.acm.org/citation.cfm?id=1870926.1871105 (Cited on
page 87).
[PSK08] Adrian Prantl, Markus Schordan, and Jens Knoop. “TuBound – A Concep-
tually New Tool for Worst-Case Execution Time Analysis”. In: 8th Inter-
national Workshop on Worst-Case Execution Time Analysis (WCET 2008).
ISBN: 978-3-85403-237-3. Prague, Czech Republic: Österreichische Computer
Gesellschaft, 2008, pp. 141–148. url: http://costa.tuwien.ac.at/papers/
wcet08-tubound.pdf (Cited on pages 20, 38).
[Ram00] G. Ramalingam. “Context-sensitive Synchronization-sensitive Analysis is Un-
decidable”. In: ACM Trans. Program. Lang. Syst. 22.2 (03/2000), pp. 416–
430. issn: 0164-0925. doi: 10.1145/349214.349241. url: http://doi.acm.
org/10.1145/349214.349241 (Cited on page 121).
[RBS+10] Christine Rochange, Armelle Bonenfant, Pascal Sainrat, Mike Gerdes, Julian
Wolf, Theo Ungerer, Zlatko Petrov, and Frantisek Mikulu. “WCET Analysis
of a Parallel 3D Multigrid Solver Executed on the MERASA Multi-Core”.
In: 10th International Workshop on Worst-Case Execution Time Analysis
(WCET 2010). Ed. by Björn Lisper. Vol. 15. OpenAccess Series in Informatics
(OASIcs). The printed version of the WCET’10 proceedings are published by
OCG (www.ocg.at) - ISBN 978-3-85403-268-7. Dagstuhl, Germany: Schloss
Dagstuhl – Leibniz-Zentrum fuer Informatik, 2010, pp. 90–100. isbn: 978-3-
939897-21-7. doi: http://dx.doi.org/10.4230/OASIcs.WCET.2010.90.
url: http://drops.dagstuhl.de/opus/volltexte/2010/2829 (Cited on
page 88).
[RD14] Jan Reineke and Johannes Doerfert. “Architecture-Parametric Timing Anal-
ysis”. In: Proceedings of the 20th IEEE Real-Time and Embedded Technology
and Application Symposium (RTAS). Ed. by Richard West. IEEE. 04/2014,
pp. 189–200. url: http://chess.eecs.berkeley.edu/pubs/1070.html
(Cited on page 21).
[Rei14] Jan Reineke. “Randomized Caches Considered Harmful in Hard Real-Time
Systems”. In: LITES 1.1 (2014), 03:1–03:13 (Cited on pages 22, 45).
[RNE+11] J. Rosén, C. Neikter, P. Eles, Zebo Peng, P. Burgio, and L. Benini. “Bus
Access Design for Combined Worst and Average Case Execution Time Opti-
mization of Predictable Real-Time Applications on Multiprocessor Systems-
on-Chip”. In: Real-Time and Embedded Technology and Applications Sympo-
sium (RTAS), 2011 17th IEEE. 04/2011, pp. 291–301. doi: 10.1109/RTAS.
2011.35 (Cited on page 148).
[RS09] Jan Reineke and Rathijit Sen. “Sound and Eﬃcient WCET Analysis in the
Presence of Timing Anomalies”. In: 9th International Workshop on Worst-
Case Execution Time Analysis (WCET’09). Ed. by Niklas Holsti. Vol. 10.
OpenAccess Series in Informatics (OASIcs). also published in print by Aus-
trian Computer Society (OCG) with ISBN 978-3-85403-252-6. Dagstuhl, Ger-
many: Schloss Dagstuhl – Leibniz-Zentrum fuer Informatik, 2009, pp. 1–11.
isbn: 978-3-939897-14-9. doi: http://dx.doi.org/10.4230/OASIcs.WCET.
2009.2289. url: http://drops.dagstuhl.de/opus/volltexte/2009/2289
(Cited on page 26).
Bibliography 201
[RSL88] R. Rajkumar, Lui Sha, and J.P. Lehoczky. “Real-time synchronization pro-
tocols for multiprocessors”. In: Real-Time Systems Symposium, 1988., Pro-
ceedings. 12/1988, pp. 259–269. doi: 10.1109/REAL.1988.51121 (Cited on
page 33).
[RWT+06] Jan Reineke, Björn Wachter, Stephan Thesing, Reinhard Wilhelm, Ilia Polian,
Jochen Eisinger, and Bernd Becker. “A Deﬁnition and Classiﬁcation of Timing
Anomalies”. In: Proceedings of 6th International Workshop on Worst-Case
Execution Time (WCET) Analysis. 07/2006 (Cited on pages 25 sq., 106).
[Sch97] R.R. Schaller. “Moore’s law: past, present and future”. In: Spectrum, IEEE
34.6 (06/1997), pp. 52–59. issn: 0018-9235. doi: 10.1109/6.591665 (Cited
on page 1).
[SCT09] Andreas Schranzhofer, Jian-Jia Chen, and Lothar Thiele. “Timing
predictability on multi-processor systems with shared resources”. In:
Embedded Systems Week - Workshop on Reconciling Performance with
Predictability. 2009 (Cited on page 87).
[SCT10] Andreas Schranzhofer, Jian-Jia Chen, and Lothar Thiele. “Timing Analysis
for TDMA Arbitration in Resource Sharing Systems”. In: Proceedings of the
2010 16th IEEE Real-Time and Embedded Technology and Applications Sym-
posium. RTAS ’10. Washington, DC, USA: IEEE Computer Society, 2010,
pp. 215–224. isbn: 978-0-7695-4001-6. doi: 10.1109/RTAS.2010.24. url:
http://dx.doi.org/10.1109/RTAS.2010.24 (Cited on page 87).
[SE09] Simon Schliecker and Rolf Ernst. “A Recursive Approach to End-to-end Path
Latency Computation in Heterogeneous Multiprocessor Systems”. In: Pro-
ceedings of the 7th IEEE/ACM International Conference on Hardware/Soft-
ware Codesign and System Synthesis. CODES+ISSS ’09. Grenoble, France:
ACM, 2009, pp. 433–442. isbn: 978-1-60558-628-1. doi: 10.1145/1629435.
1629494. url: http://doi.acm.org/10.1145/1629435.1629494 (Cited on
page 34).
[SF99] Jörn Schneider and Christian Ferdinand. “Pipeline Behavior Prediction for
Superscalar Processors by Abstract Interpretation”. In: Proceedings of the
ACM SIGPLAN 1999 Workshop on Languages, Compilers, and Tools for
Embedded Systems. LCTES ’99. Atlanta, Georgia, USA: ACM, 1999, pp. 35–
44. isbn: 1-58113-136-4. doi: 10.1145/314403.314432. url: http://doi.
acm.org/10.1145/314403.314432 (Cited on page 66).
[SHK14] H. Shah, Kai Huang, and A Knoll. “Timing anomalies in multi-core architec-
tures due to the interference on the shared resources”. In: Design Automation
Conference (ASP-DAC), 2014 19th Asia and South Paciﬁc. 01/2014, pp. 708–
713. doi: 10.1109/ASPDAC.2014.6742973 (Cited on pages 26, 34).
[SHW11] Daniel J. Sorin, Mark D. Hill, and David A. Wood. “A Primer on
Memory Consistency and Cache Coherence”. In: Synthesis Lectures on
Computer Architecture. Ed. by University of Wisconsin Mark D. Hill.
Vol. Lecture #16. Morgan & Claypool, 2011. isbn: 9781608455652. doi:
10.2200/S00346ED1V01Y201104CAC016 (Cited on page 83).
[SJ96] Thomas Schwederski and Michael Jurczyk. Verbindungsnetze - Strukturen
und Eigenschaften. Leitfäden der Informatik. Teubner, 1996, pp. I–XVI, 1–
420. isbn: 978-3-519-02134-6 (Cited on pages 84 sq.).
202 Bibliography
[SLL+11] Marcelo Santos, Björn Lisper, George Lima, and Veronica Lima. “Sequential
Composition of Execution Time Distributions by Convolution”. In: Proc. 4th
Workshop on Compositional Theory and Technology for Real&#8208;Time
Embedded Systems (CRTS 2011). Ed. by Robert Davis and Linh Thi Xuan
Phan. Best paper award. 11/2011, pp. 30–37. url: http://www.es.mdh.se/
publications/2215- (Cited on page 22).
[SNN+08] Simon Schliecker, Mircea Negrean, Gabriela Nicolescu, Pierre Paulin, and
Rolf Ernst. “Reliable Performance Analysis of a Multicore Multithreaded
System-on-chip”. In: Proceedings of the 6th IEEE/ACM/IFIP International
Conference on Hardware/Software Codesign and System Synthesis.
CODES+ISSS ’08. Atlanta, GA, USA: ACM, 2008, pp. 161–166.
isbn: 978-1-60558-470-6. doi: 10 . 1145 / 1450135 . 1450172. url:
http://doi.acm.org/10.1145/1450135.1450172 (Cited on page 87).
[SP10] Marc Schlickling and Markus Pister. “Semi-automatic Derivation of
Timing Models for WCET Analysis”. In: Proceedings of the ACM
SIGPLAN/SIGBED 2010 Conference on Languages, Compilers, and Tools
for Embedded Systems. LCTES ’10. Stockholm, Sweden: ACM, 2010,
pp. 67–76. isbn: 978-1-60558-953-4. doi: 10.1145/1755888.1755899. url:
http://doi.acm.org/10.1145/1755888.1755899 (Cited on page 67).
[SP78] Micha Sharir and Amir Pnueli. Two approaches to interprocedural dataﬂow
analysis. Tech. rep. 2. 251 Mercer Street, New York, N.Y.: New York Univer-
sity, Department of Computer Science, 09/1978 (Cited on page 57).
[SPC+10] Andreas Schranzhofer, Rodolfo Pellizzoni, Jian-Jia Chen, Lothar Thiele, and
Marco Caccamo. “Worst-case Response Time Analysis of Resource Access
Models in Multi-core Systems”. In: Proceedings of the 47th Design Automation
Conference. DAC ’10. Anaheim, California: ACM, 2010, pp. 332–337. isbn:
978-1-4503-0002-5. doi: 10.1145/1837274.1837359. url: http://doi.acm.
org/10.1145/1837274.1837359 (Cited on pages 87, 155).
[SPH+07] Jean Souyris, Erwan Le Pavec, Guillaume Himbert, Guillaume Borios, Victor
Jégu, and Reinhold Heckmann. “Computing the Worst Case Execution Time
of an Avionics Program by Abstract Interpretation”. In: 5th International
Workshop on Worst-Case Execution Time Analysis (WCET’05). Ed. by Rein-
hard Wilhelm. Vol. 1. OpenAccess Series in Informatics (OASIcs). Dagstuhl,
Germany: Schloss Dagstuhl – Leibniz-Zentrum fuer Informatik, 2007. isbn:
978-3-939897-24-8. doi: http://dx.doi.org/10.4230/OASIcs.WCET.2005.
810. url: http://drops.dagstuhl.de/opus/volltexte/2007/810 (Cited
on page 142).
[SPP+10] Martin Schoeberl, Wolfgang Puﬃtsch, Rasmus Ulslev Pedersen, and Benedikt
Huber. “Worst-case Execution Time Analysis for a Java Processor”. In: Soft-
ware Practice and Experience 40.6 (05/2010), pp. 507–542. issn: 0038-0644.
doi: 10.1002/spe.v40:6. url: http://dx.doi.org/10.1002/spe.v40:6
(Cited on page 23).
[SRK11] H. Shah, A Raabe, and A Knoll. “Priority division: A high-speed shared-
memory bus arbitration with bounded latency”. In: Design, Automation Test
in Europe Conference Exhibition (DATE), 2011. 03/2011, pp. 1–4. doi: 10.
1109/DATE.2011.5763319 (Cited on page 86).
Bibliography 203
[SSB09] Andrew Stone, Michelle Strout, and Shweta Behere. “May/must analysis and
the {DFAGen} data-ﬂow analysis generator”. In: Information and Software
Technology 51.10 (2009). Source Code Analysis and Manipulation, {SCAM}
2008, pp. 1440–1453. issn: 0950-5849. doi: http://dx.doi.org/10.1016/
j.infsof.2009.04.014. url: http://www.sciencedirect.com/science/
article/pii/S0950584909000482 (Cited on pages 18, 53).
[Str14] StreamIt Community. The StreamIt Benchmark Suite. http : / / groups .
csail.mit.edu/cag/streamit/shtml/benchmarks.shtml. 2014 (Cited
on page 76).
[Sut12] Herb Sutter. Welcome to the Parallel Jungle! http :
/ / drdobbs . com / parallel / 232400273. 01/2012 (Cited on
pages 4, 32).
[Syn14] Synopsys Inc. CoMET System Engineering IDE. http://www.synopsys.com.
2014 (Cited on pages 9, 44, 150).
[SZW+10] William N. Sumner, Yunhui Zheng, Dasarath Weeratunge, and
Xiangyu Zhang. “Precise Calling Context Encoding”. In: Proceedings of
the 32Nd ACM/IEEE International Conference on Software Engineering -
Volume 1. ICSE ’10. Cape Town, South Africa: ACM, 2010, pp. 525–534.
isbn: 978-1-60558-719-6. doi: 10 . 1145 / 1806799 . 1806875. url:
http://doi.acm.org/10.1145/1806799.1806875 (Cited on page 58).
[Tar55] Alfred Tarski. “A lattice-theoretical ﬁxpoint theorem and its applications.” In:
Paciﬁc Journal of Mathematics 5.2 (1955), pp. 285–309 (Cited on page 15).
[Tay83a] Richard N. Taylor. “A General-purpose Algorithm for Analyzing Concurrent
Programs”. In: Commun. ACM 26.5 (05/1983), pp. 361–376. issn: 0001-0782.
doi: 10.1145/69586.69587. url: http://doi.acm.org/10.1145/69586.
69587 (Cited on pages 121 sq., 139).
[Tay83b] RichardN. Taylor. “Complexity of analyzing the synchronization structure of
concurrent programs”. English. In: Acta Informatica 19.1 (1983), pp. 57–84.
issn: 0001-5903. doi: 10.1007/BF00263928. url: http://dx.doi.org/10.
1007/BF00263928 (Cited on page 121).
[TBW95] K. Tindell, A. Burns, and A.J. Wellings. “Analysis of hard real-time com-
munications”. English. In: Real-Time Systems 9.2 (1995), pp. 147–171. issn:
0922-6443. doi: 10.1007/BF01088855. url: http://dx.doi.org/10.1007/
BF01088855 (Cited on page 31).
[TCN00] Lothar Thiele, Samarjit Chakraborty, and Martin Naedele. “Real-time Cal-
culus for Scheduling Hard Real-Time Systems”. In: International Symposium
on Circuits and Systems ISCAS 2000. Vol. 4. Geneva, Switzerland, 03/2000,
pp. 101–104 (Cited on page 32).
[TD00] Hiroyuki Tomiyama and Nikil D. Dutt. “Program Path Analysis to Bound
Cache-related Preemption Delay in Preemptive Real-time Systems”. In: Pro-
ceedings of the Eighth International Workshop on Hardware/Software Code-
sign. CODES ’00. San Diego, California, USA: ACM, 2000, pp. 67–71. isbn:
1-58113-268-9. doi: 10.1145/334012.334025. url: http://doi.acm.org/
10.1145/334012.334025 (Cited on page 31).
204 Bibliography
[Tea14] TRACES Team. OTAWA WCET Analysis Framework.
http://www.otawa.fr/. IRIT, 118 Route de Narbonne, F-31062 Toulouse,
2014 (Cited on page 20).
[Tec14] Tech-Pro.net. How the PCI Bus Works. http://www.tech-pro.net/intro_
pci.html. 2014 (Cited on page 85).
[The00] H. Theiling. “Extracting Safe and Precise Control Flow from Binaries”. In:
Proceedings of the Seventh International Conference on Real-Time Systems
and Applications. RTCSA ’00. Washington, DC, USA: IEEE Computer Soci-
ety, 2000, pp. 23–. isbn: 0-7695-0930-4. url: http://dl.acm.org/citation.
cfm?id=580571.828823 (Cited on page 49).
[The04] Stephan Thesing. Safe and Precise WCET Determination by Abstract Inter-
pretation of Pipeline Models. Pirrot, 2004. isbn: 9783937436005. url: http:
//books.google.de/books?id=Sbq0SgAACAAJ (Cited on page 66).
[Thi05] Lothar Thiele. “Modular Performance Analysis of Distributed Embedded Sys-
tems”. In: Lecture Notes in Computer Science. FORMATS 2005. Vol. 3829.
Springer Verlag, 2005, pp. 1–2 (Cited on page 34).
[Tid04] Tidorum Ltd. Bound-T Execution Time Analyzer. http://www.bound-t.
com/. 2004 (Cited on page 20).
[Tid10] Tidorum Ltd. Bound-T Time and Stack Analyser, Application Note
ARM7TDMI. http://www.bound-t.com/. 2010 (Cited on page 49).
[TSH+03] Stephan Thesing, Jean Souyris, Reinhold Heckmann, Famantanantsoa
Randimbivololona, Marc Langenbach, Reinhard Wilhelm, and Christian
Ferdinand. “An Abstract Interpretation-Based Timing Validation of Hard
Real-Time Avionics Software.” In: DSN. 2003, pp. 625–632 (Cited on
page 23).
[Val89] Antti Valmari. “Eliminating Redundant Interleavings During Concurrent
Program Veriﬁcation”. In: Proceedings of the Parallel Architectures and
Languages Europe, Volume II: Parallel Languages. PARLE ’89. London,
UK, UK: Springer-Verlag, 1989, pp. 89–103. isbn: 3-540-51285-3. url:
http : / / dl . acm . org / citation . cfm ? id = 646427 . 692577 (Cited on
page 121).
[Vol95] Jürgen Vollmer. “Data Flow Analysis of Parallel Programs”. In: Proceedings
of the IFIP WG10.3 Working Conference on Parallel Architectures and Com-
pilation Techniques. PACT ’95. Limassol, Cyprus: IFIP Working Group on
Algol, 1995, pp. 168–177. isbn: 0-89791-745-6. url: http://dl.acm.org/
citation.cfm?id=224659.224717 (Cited on page 122).
[WB05] Heinz Wörn and Uwe Brinkschulte. Echtzeitsysteme: Grundlagen,
Funktionsweisen, Anwendungen. eXamen. press Series. Springer, 2005. isbn:
9783540205883 (Cited on pages 85 sq.).
[WEE+08] Reinhard Wilhelm, Jakob Engblom, Andreas Ermedahl, Niklas Holsti,
Stephan Thesing, David Whalley, Guillem Bernat, Christian Ferdinand,
Reinhold Heckmann, Tulika Mitra, Frank Mueller, Isabelle Puaut,
Peter Puschner, Jan Staschulat, and Per Stenström. “The Worst-case
Execution-time Problem – Overview of Methods and Survey of Tools”.
In: ACM Transactions on Embedded Computing Systems 7.3 (05/2008),
Bibliography 205
36:1–36:53. issn: 1539-9087. doi: 10 . 1145 / 1347375 . 1347389. url:
http://doi.acm.org/10.1145/1347375.1347389 (Cited on pages 18, 94).
[Weg03] Ingo Wegener. Komplexitätstheorie: Grenzen der Eﬃzienz von Algorithmen.
Springer, 2003. isbn: 3-540-00161-1 (Cited on page 1).
[Weg12] Simon Wegener. “Computing Same Block Relations for Relational Cache
Analysis”. In: 12th International Workshop on Worst-Case Execution Time
Analysis. Ed. by Tullio Vardanega. Vol. 23. OpenAccess Series in Informat-
ics (OASIcs). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum fuer
Informatik, 2012, pp. 25–37. isbn: 978-3-939897-41-5. doi: http://dx.doi.
org/10.4230/OASIcs.WCET.2012.25. url: http://drops.dagstuhl.de/
opus/volltexte/2012/3554 (Cited on page 71).
[Weg99] Ingo Wegener. Theoretische Informatik - eine algorithmenorientierte Ein-
führung. 2. Auﬂage. Teubner, 1999. isbn: 3-519-12123-9 (Cited on page 3).
[Wei07] Karsten Weicker. Evolutionäre Algorithmen (Leitfäden der Informatik).
Vieweg+Teubner Verlag, 2007. isbn: 3835102192 (Cited on page 149).
[WFC+09] Reinhard Wilhelm, Christian Ferdinand, Christoph Cullmann, Daniel Grund,
Jan Reineke, and Benoît Triquet. “Designing Predictable Multicore Architec-
tures for Avionics and Automotive Systems”. In: Workshop on Reconciling
Performance with Predictability (RePP). 10/2009. url: http://www.tik.
ee.ethz.ch/~jchen/RePP/papers/2-3.pdf (Cited on page 89).
[WGK+10] J. Wolf, M. Gerdes, F. Kluge, S. Uhrig, J. Mische, S. Metzlaﬀ, C. Rochange,
H. Cassé, P. Sainrat, and T. Ungerer. “RTOS Support for Parallel Execution
of Hard Real-Time Applications on the MERASA Multi-core Processor”. In:
13th IEEE International Symposium on Object/Component/Service-Oriented
Real-Time Distributed Computing (ISORC), 2010. 05/2010, pp. 193–201. doi:
10.1109/ISORC.2010.31 (Cited on page 29).
[WGR+09] Reinhard Wilhelm, Daniel Grund, Jan Reineke, Marc Schlickling, Markus Pis-
ter, and Christian Ferdinand. “Memory Hierarchies, Pipelines, and Buses for
Future Architectures in Time-critical Embedded Systems”. In: IEEE Trans-
actions on Computer-Aided Design of Integrated Circuits and Systems 28.7
(07/2009), pp. 966–978. issn: 0278-0070. doi: 10.1109/TCAD.2009.2013287.
url: http://dx.doi.org/10.1109/TCAD.2009.2013287 (Cited on pages 19,
27 sq., 45).
[WHK+13] B.C. Ward, J.L. Herman, C.J. Kenna, and J.H. Anderson. “Making Shared
Caches More Predictable on Multicore Platforms”. In: Real-Time Systems
(ECRTS), 2013 25th Euromicro Conference on. 07/2013, pp. 157–167. doi:
10.1109/ECRTS.2013.26 (Cited on page 7).
[Wil04] Reinhard Wilhelm. “Why AI + ILP Is Good for WCET, but MC Is Not,
Nor ILP Alone”. English. In: Veriﬁcation, Model Checking, and Abstract In-
terpretation. Ed. by Bernhard Steﬀen and Giorgio Levi. Vol. 2937. Lecture
Notes in Computer Science. Springer Berlin Heidelberg, 2004, pp. 309–322.
isbn: 978-3-540-20803-7. doi: 10.1007/978- 3- 540- 24622- 0\_25. url:
http://dx.doi.org/10.1007/978-3-540-24622-0%5C_25 (Cited on
pages 20, 88).
206 Bibliography
[Wil12] Stephan Wilhelm. “Symbolic Representations in WCET Analysis”. PhD the-
sis. Saarland University, 2012. isbn: 978-3-8442-2463-4 (Cited on pages 63,
66).
[WT06] Ernesto Wandeler and Lothar Thiele. “Optimal TDMA Time Slot and Cy-
cle Length Allocation for Hard Real-time Systems”. In: Proceedings of the
2006 Asia and South Paciﬁc Design Automation Conference. ASP-DAC ’06.
Yokohama, Japan: IEEE Press, 2006, pp. 479–484. isbn: 0-7803-9451-8. doi:
10.1145/1118299.1118417. url: http://dx.doi.org/10.1145/1118299.
1118417 (Cited on page 148).
[XMO13] XMOS. XMOS Timing Analyzer Whitepaper, Rev. 1.1. http://www.xmos.
com/products/tools/xta. 2013 (Cited on page 90).
[YKS11] Man-Ki Yoon, Jung-Eun Kim, and Lui Sha. “WCET-Aware optimization of
shared cache partition and bus arbitration for hard real-time multicore sys-
tems”. In: (2011) (Cited on page 148).
[YZ08] Jun Yan and Wei Zhang. “WCET Analysis for Multi-Core Processors with
Shared L2 Instruction Caches”. In: Proceedings of the 2008 IEEE Real-Time
and Embedded Technology and Applications Symposium. RTAS ’08. Washing-
ton, DC, USA: IEEE Computer Society, 2008, pp. 80–89. isbn: 978-0-7695-
3146-5. doi: 10.1109/RTAS.2008.6. url: http://dx.doi.org/10.1109/
RTAS.2008.6 (Cited on page 87).
[ZCS03] Min Zhao, Bruce Childers, and Mary Lou Soﬀa. “Predicting the Impact of
Optimizations for Embedded Systems”. In: Proceedings of the 2003 ACM
SIGPLAN Conference on Language, Compiler, and Tool for Embedded Sys-
tems. LCTES ’03. San Diego, California, USA: ACM, 2003, pp. 1–11. isbn:
1-58113-647-1. doi: 10.1145/780732.780734. url: http://doi.acm.org/
10.1145/780732.780734 (Cited on pages 37, 155).
[ZKW+04] Wankang Zhao, Prasad A. Kulkarni, David B. Whalley, Christopher A. Healy,
Frank Mueller, and Gang-Ryung Uh. “Tuning the WCET of Embedded Ap-
plications”. In: IEEE Real-Time and Embedded Technology and Applications
Symposium. 2004, pp. 472–481 (Cited on page 38).
[ZKW+05] Wankang Zhao, William C. Kreahling, David B. Whalley, Christopher A.
Healy, and Frank Mueller. “Improving WCET by Optimizing Worst-Case
Paths”. In: IEEE Real-Time and Embedded Technology and Applications Sym-
posium. 2005, pp. 138–147 (Cited on page 155).
[ZLT02] Eckart Zitzler, Marco Laumanns, and Lothar Thiele. “SPEA2: Improving the
Strength Pareto Evolutionary Algorithm for Multiobjective Optimization”.
In: EUROGEN2001 Conference. 2002 (Cited on page 150).
[ZVS+94] Vojin Zivojnović, Juan M. Velarde, Christian Schläger, and Heinrich Meyr.
“DSPSTONE: A DSP-oriented Benchmarking Methodology”. In: Proceedings
of the International Conference on Signal Processing and Technology (IC-
SPAT’94). 1994 (Cited on page 76).
Bibliography 207
[ZY09] Wei Zhang and Jun Yan. “Accurately Estimating Worst-Case Execution Time
for Multi-core Processors with Shared Direct-Mapped Instruction Caches”. In:
Proceedings of the 2009 15th IEEE International Conference on Embedded
and Real-Time Computing Systems and Applications. RTCSA ’09. Washing-
ton, DC, USA: IEEE Computer Society, 2009, pp. 455–463. isbn: 978-0-7695-
3787-0. doi: 10.1109/RTCSA.2009.55. url: http://dx.doi.org/10.1109/
RTCSA.2009.55 (Cited on page 87).

Appendix A
Employed Benchmarks
The following table contains a list of the benchmarks used in the evaluations. For
each benchmark the name, the Lines Of Code (LOC) per benchmark, the binary
code size S in bytes, the number of loops L, the maximum loop nesting depth D,
the loop bound range B, the average lower (Bavgmin) and upper (B
avg
max) loop bound
and the number of context blocks ∣V Cτ ∣ for a context graph with unlimited call string
length but without virtual unrolling.
Name LOC S L(D) B Bavgmin/B
avg
max ∣V Cτ ∣
DSPstone-ﬁxed-point
complex-multiply-ﬁxed 22 220 0(0) − 0.0/0.0 6
complex-update-ﬁxed 26 172 0(0) − 0.0/0.0 6
convolution-ﬁxed 27 108 2(1) 16 16.0/16.0 14
dot-product-ﬁxed 22 124 1(1) 2 2.0/2.0 11
ﬀt-1024-13 311 932 9(3) [0 − 2048] 571.7/686.3 95
ﬀt-1024-7 282 888 8(3) [0 − 1024] 386.6/515.6 90
ﬀt-16-13 156 928 9(3) [0 − 32] 11.0/13.0 95
ﬀt-16-7 149 884 8(3) [0 − 16] 7.9/10.1 90
ﬁr2dim-ﬁxed 81 536 13(3) [3 − 16] 5.4/5.4 111
ﬁr-ﬁxed 40 188 2(1) [15 − 16] 15.5/15.5 21
iir-biquad-N-sections-ﬁxed 42 284 3(1) [4 − 20] 10.7/10.7 31
iir-biquad-one-section-ﬁxed 25 204 0(0) − 0.0/0.0 8
lms-ﬁxed 44 316 3(1) [15 − 16] 15.7/15.7 26
matrix1-ﬁxed 48 288 6(3) [10 − 100] 55.0/55.0 51
matrix1x3-ﬁxed 23 116 2(2) 3 3.0/3.0 12
matrix2-ﬁxed 48 316 6(3) [8 − 100] 54.7/54.7 51
n-complex-updates-ﬁxed 38 280 2(1) 16 16.0/16.0 21
n-real-updates-ﬁxed 29 212 2(1) 16 16.0/16.0 21
real-update-ﬁxed 21 112 0(0) − 0.0/0.0 6
startup-ﬁxed 99 340 5(1) [1 − 64] 27.4/33.2 58
DSPstone-ﬂoating-point
complex-multiply-ﬂoat 22 288 0(0) − 0.0/0.0 12
complex-update-ﬂoat 28 316 0(0) − 0.0/0.0 14
convolution-ﬂoat 27 184 2(1) 16 16.0/16.0 19
dot-product-ﬂoat 24 184 1(1) 2 2.0/2.0 13
ﬁr2dim-ﬂoat 81 852 13(3) [3 − 16] 5.4/5.4 149
ﬁr-ﬂoat 40 264 2(1) [15 − 16] 15.5/15.5 29
iir-biquad-N-sections-ﬂoat 43 344 3(1) [4 − 20] 10.7/10.7 40
iir-biquad-one-section-ﬂoat 25 296 0(0) − 0.0/0.0 17
lms-ﬂoat 44 352 3(1) [15 − 16] 15.7/15.7 34
209
210 Appendix A. Employed Benchmarks
Name LOC S L(D) B Bavgmin/B
avg
max ∣V Cτ ∣
matrix1-ﬂoat 49 300 6(3) [10 − 100] 55.0/55.0 53
matrix1x3-ﬂoat 41 256 4(2) [3 − 9] 4.5/4.5 28
matrix2-ﬂoat 49 340 6(3) [8 − 100] 54.7/54.7 56
n-complex-updates-ﬂoat 38 352 2(1) 16 16.0/16.0 29
n-real-updates-ﬂoat 29 228 2(1) 16 16.0/16.0 23
real-update-ﬂoat 24 176 0(0) − 0.0/0.0 8
MRTC
adpcm-decoder 406 2812 14(1) [0 − 2424] 66.9/322.2 216
adpcm-encoder 434 2944 15(1) [0 − 2424] 63.3/303.5 249
binarysearch 35 156 1(1) 4 4.0/4.0 18
bsort100 54 224 3(2) [2 − 100] 67.0/99.3 31
compressdata 203 1152 4(1) [0 − 50] 14.0/14.0 32
countnegative 73 492 4(2) 20 20.0/20.0 40
cover 231 3192 3(1) [10 − 120] 60.0/60.0 789
duﬀ 44 296 1(1) 100 100.0/100.0 50
edn 204 2004 12(3) [2 − 150] 53.8/53.8 80
expint 62 344 3(2) [49 − 100] 83.0/83.0 40
fac 21 116 1(1) 6 6.0/6.0 13
fdct 143 1136 2(1) 8 8.0/8.0 14
ﬁbcall 22 96 1(1) 29 29.0/29.0 10
ﬁr 225 268 2(2) [10 − 26] 13.5/18.0 22
insertsort 61 196 2(2) [1 − 9] 5.0/9.0 12
janne-complex 61 108 2(2) [0 − 9] 4.5/9.0 21
jfdctint 218 1460 3(1) [8 − 64] 26.7/26.7 20
lcdnum 62 348 1(1) 10 10.0/10.0 74
lms 160 1876 10(1) [0 − 201] 48.2/50.5 250
ludcmp 86 1092 11(3) [1 − 6] 3.0/5.1 90
matmult 57 496 5(3) 20 20.0/20.0 54
minver 162 1648 17(3) [1 − 3] 2.8/2.9 162
ndes 407 1800 12(1) [2 − 32] 19.4/19.4 244
petrinet 500 3920 1(1) 2 2.0/2.0 404
prime 34 320 1(1) [73 − 357] 73.0/357.0 89
qurt 88 820 1(1) 19 19.0/19.0 215
recursion 16 104 0(0) − 0.0/0.0 13
select 62 772 4(3) [0 − 16] 8.2/8.5 65
sqrt 45 320 2(1) [6 − 19] 12.5/12.5 41
MediaBench
cjpeg-jpeg6b-transupp 1599 2448 47(7) [1 − 29] 5.1/7.5 310
cjpeg-jpeg6b-wrbmp 1295 580 5(1) [30 − 512] 262.0/262.0 128
epic 994 4520 41(4) [0 − 9801] 132.0/316.6 332
gsm-decode 1380 7480 19(2) [0 − 648] 58.5/64.4 857
h264dec-ldecode-block 1571 4556 27(2) [0 − 16] 7.3/7.9 400
MiBench
bitcount 202 1088 4(2) [3 − 31] 13.0/14.2 97
dijkstra 227 908 5(2) [0 − 928] 167.6/292.2 86
PolyBench
211
Name LOC S L(D) B Bavgmin/B
avg
max ∣V Cτ ∣
2mm 133 1244 14(2) 32 32.0/32.0 103
3mm 158 1396 19(3) 32 32.0/32.0 130
atax 106 664 8(2) 32 32.0/32.0 60
cholesky 136 952 9(3) [0 − 32] 23.4/30.3 95
correlation 175 1500 14(3) [0 − 32] 28.7/30.9 166
covariance 116 792 11(3) 32 32.0/32.0 73
doitgen 119 972 13(4) 10 10.0/10.0 88
durbin 110 1176 7(2) [0 − 32] 22.7/31.6 69
dynprog 112 652 8(4) [0 − 10] 4.2/5.5 50
fdtd-2d 133 844 13(3) [2 − 32] 27.1/27.1 90
fdtd-apml 194 4376 13(3) [8 − 9] 8.7/8.7 206
ﬂoyd-warshall 87 548 7(3) 32 32.0/32.0 56
gemm 124 936 11(3) 32 32.0/32.0 83
gemver 121 1220 10(2) 32 32.0/32.0 91
gesummv 107 884 5(2) 32 32.0/32.0 54
gramschmidt 165 1196 17(3) [0 − 32] 29.4/31.2 145
jacobi-1d-imper 97 548 5(2) [2 − 500] 399.6/399.6 45
jacobi-2d-imper 104 784 9(3) [2 − 32] 27.8/27.8 71
lu 89 516 8(3) [0 − 32] 20.0/32.0 57
ludcmp-Poly 131 1028 12(3) [0 − 33] 16.2/32.2 96
mvt 97 876 7(2) 32 32.0/32.0 67
reg-detect 133 688 14(4) [0 − 32] 1.9/6.9 81
seidel-2d 95 632 7(3) [2 − 32] 27.1/27.1 59
StreamIt
audiobeam 6764 6372 22(2) [0 − 371] 26.2/26.2 1172
bitonic 56 352 3(1) [0 − 32] 21.3/26.7 42
UTDSP
adpcm 800 7644 8(2) [0 − 256] 34.5/35.4 1340
compress 603 4284 12(4) [8 − 16] 9.3/9.3 446
edge-detect 1134 716 10(4) [3 − 128] 20.0/77.6 153
ﬀt-1024 186 576 4(3) [1 − 1024] 259.0/514.5 34
ﬀt-256 90 580 4(3) [1 − 256] 66.5/130.0 34
ﬁr-256-64 546 200 2(2) [1 − 256] 128.5/128.5 23
ﬁr-32-1 33 136 1(1) 32 32.0/32.0 11
g721.marcuslee-decoder 2556 180 1(1) 2407 2407.0/2407.0 24
g721.marcuslee-encoder 2574 208 1(1) 2407 2407.0/2407.0 33
histogram 4132 300 6(2) [64 − 256] 128.0/128.0 37
iir-1-1 29 216 0(0) − 0.0/0.0 12
iir-4-64 73 504 3(2) [4 − 64] 44.0/44.0 27
latnrm-32-64 73 348 3(2) [31 − 64] 42.3/42.3 27
latnrm-8-1 47 356 2(1) [7 − 8] 7.5/7.5 23
lmsﬁr-32-64 88 472 3(2) [31 − 64] 53.0/53.0 48
lmsﬁr-8-1 40 284 2(1) [7 − 8] 7.5/7.5 22
lpc 328 4092 23(2) [0 − 320] 74.5/80.1 400
mult-10-10 232 220 3(3) 10 10.0/10.0 21
mult-4-4 64 196 3(3) 4 4.0/4.0 21
qmf-receive 8079 456 2(1) [11 − 4000] 2005.5/2005.5 41
212 Appendix A. Employed Benchmarks
Name LOC S L(D) B Bavgmin/B
avg
max ∣V Cτ ∣
qmf-transmit 8078 456 2(1) [11 − 4000] 2005.5/2005.5 41
v32.modem-eglue 507 144 1(1) [239 − 240] 239.0/240.0 9
misc
ammunition 2508 21704 77(2) [0 − 4008] 581.6/639.0 1938
anagram 2722 2980 23(3) [0 − 2279] 289.9/341.9 306
codecs-codrle1 111 772 4(2) [1 − 128] 10.5/69.5 201
codecs-dcodhuﬀ 640 1456 11(2) [0 − 10280] 125.8/1061.3 358
g721-encode 899 6272 9(1) [0 − 256] 32.3/33.7 578
g723-encode 897 6272 9(1) [0 − 256] 32.3/33.7 578
h263 926 1672 7(2) [40 − 1024] 205.3/205.3 208
hamming-window 62 188 3(2) [19 − 401] 153.3/153.3 18
selection-sort 67 144 2(2) [1 − 299] 150.0/299.0 20
213
