Safe and Precise WCET Determination by Abstract Interpretation of Pipeline Models by Thesing, Stephan
Safe and Precise WCET
Determination by Abstract
Interpretation of Pipeline Models
Dissertation
zur Erlangung des Grades eines Doktors der
Ingenieurwissenschaften (Dr.-Ing.) der
Naturwissenschaftlich-Technischen Fakult¤aten der
Universit¤at des Saarlandes
von
Diplom-Informatiker
Stephan Thesing
Saarbru¨cken
Juli 2004
Tag des Kolloquiums: 18.11.2004
Dekan: Prof. Dr. Jo¨rg Eschmeier, Universita¨t des Saarlandes
Pru¨fungsausschuss:
Vorsitzender Prof. Dr. Holger Hermanns, Universita¨t des Saarlandes
Gutachter: Prof. Dr. Reinhard Wilhelm, Universita¨t des Saarlandes
Prof. Dr. Werner Damm, Universita¨t Oldenburg
Akad. Mitarbeiter: Dr. Christian Lindig, Universita¨t des Saarlandes
i
Erkla¨rung
Hiermit versichere ich an Eides statt, dass ich die vorliegende Arbeit selbststa¨ndig
und ohne Benutzung anderer als der angegebenen Hilfmittel angefertigt habe. Die
aus anderen Quellen oder indirekt u¨bernommenen Daten und Konzepte sind unter
Angabe der Quelle gekennzeichnet.
Die Arbeit wurde bisher wieder im In- noch im Ausland in gleicher oder a¨hnlicher
Form in einem Verfahren zur Erlangung eines akademischen Grades vorgelegt.
Saarbru¨cken, den
ii
Short Abstract
Failure of computer software in a hard real-time system leads to severe conse-
quences and must be avoided by proving the correctness of the system’s software.
A prerequisite for this is the determination of an upper bound for the worst-case
execution times (WCET) of the tasks in the system. We show that for modern
CPUs, WCETs can be obtained by static program analysis methods even for CPUs
with execution history sensitives components like caches and pipelines. This is
the first time that complex CPU features (out-of-order execution, speculation, etc)
have been included in a comprehensive and safe analysis.
The approach presented in this thesis is able to handle the analysis of very
complex architectures (PowerPC 755) by first modeling the CPU and peripherals
of the system and then using abstractions on some components of the system
to obtain an analysis. The analysis computes WCET for the basic blocks of the
program by simulating the abstract system model. The correctness of the approach
is shown.
A tool has been built based on this approach, which was evaluated under real-
life industry conditions by Airbus France in the course of the DAEDALUS project,
showing the practical applicability of the methodology.
iii
Kurze Zusammenfassung
Fehlverhalten der Computersoftware eines harten Echtzeitsystems kann katas-
trophale Folgen haben. Um ein solches Verhalten zu verhindern, muss die Korrek-
theit der Programme des Systems vorher nachgewiesen werden. Eine Vorausset-
zung hierfu¨r ist die Kenntniss von oberen Schranken fu¨r die Ausfu¨hrungszeit der
Programme (WCET). Fu¨r moderne CPUs ko¨nnen solche Schranken effektiv nur
durch statische Analysemethoden verla¨sslich gewonnen werden, da die Laufzeiten
stark von kontextsensitiven Komponenten (Caches, Pipelines) abha¨ngen. Bisher
galten komplexe Merkmale moderner CPUs (out-of-order Ausfu¨hrung, Spekula-
tion) als nicht effizient statisch analysierbar.
Die vorliegende Arbeit pra¨sentiert einen Ansatz, der in der Lage ist, sehr kom-
plexe Architekturen (etwa den PowerPC 755) zu behandeln. Hierbei wird zuerst
ein Modell des Prozessors und der Peripherie des Systems erstellt, dessen Kom-
ponenten dann geeignet abstrahiert werden ko¨nnen, um eine Analyse zu erhalten.
Die Analyse berechnet WCET fu¨r die Basisblo¨cke eines Programmes durch Sim-
ulation des abstrahierten Prozessormodells. Die Korrektheit der Analyse wird
durch die Verwendung der Theorie der abstrakten Interpretation garantiert.
Mit diesem Ansatz wurde ein Werkzeug entwickelt, welches unter Indus-
triebedingungen von Airbus France im Verlauf des DAEDALUS Projektes eva-
luiert wurde. Dabei konnte die praktische Anwendbarkeit des vorgestellten An-
satzes klar demonstriert werden.
iv
Abstract
Hard real-time systems are computer systems that control critical physical plants
(avionics, automotive, nuclear power plant control, weapon guidance, etc). If
such a computer system fails, the consequences can be severe: damaged prop-
erty or even loss of lifes. Therefore, hard real-time systems must be checked for
correctness before being deployed. One aspect of correctness is the timely re-
sponse of the system, often expressed by temporal deadlines that must be met by
the software tasks in the system. An essential component for proving that every
task meets its deadline is the knowledge of an upper bound for the execution time
of each task, then worst-case execution time (WCET). It has been shown to be
usually practically impossible to obtain the WCET by measuring real execution
times of tasks, due to the complex dependencies between the execution time and
the input data or starting conditions of the system. Thus, safe upper bounds for
the WCET can only be obtained by statically analyzing a task’s program code.
Due to the restricted forms of programming used in hard real-time systems, it is in
principle possible to compute a WCET from the program text alone (loop iteration
and recursion bounds are assumed to be known).
Modern CPUs use features like caches and pipelines to improve performance.
These features can lead to a huge variation in execution time as they are history
sensitive: a memory access, e. g., may take just one cycle for an access that hits
in the cache, while it may take more than 50 cycles for a cache miss and subse-
quent access to main memory. Whether an access hits in the cache depends on the
contents of the cache and thus on the accesses performed before that access. A
similar observation holds for effects in the processor’s pipeline. The static analy-
sis of such features in todays CPUs has proven difficult. Until recently, the static
analysis of CPUs featuring branch prediction, out-of-order execution or specula-
tion has been viewed as too complex to be used in practice [Eng02].
We present a novel approach that is able to analyze architectures with the be-
fore mentioned features for real-life sized example programs. Our approach is
based on a model of the CPU and the peripherals (memory, system controller,. . . ).
We use a cycle-precise model with communicating units which have inner state
and update rules. This resembles approaches taken by hardware description lan-
guages like VHDL or Verilog. The framework of abstract interpretation is used
to define (safe) abstractions for some of the components of the model (e. g. the
caches), reducing the model to details relevant to timing and to a size that can
be practically handled. If the abstractions performed satisfy certain conditions,
following from the theory of abstract interpretation, they are guaranteed to be
safe in the sense that every concrete model state is subsumed by an abstract one.
The abstract model obtained by this process can still be simulated cycle-wise and
its simulation gives safe WCETs of the basic blocks of a program, while taking
v
into account pipeline effects even across basic block boundaries. The WCETs for
the basic blocks together with the control-flow graph of the program can then be
transformed into an integer linear program with the global execution time as the
objective function to be maximized. The solution of this ILP is then a safe WCET
of the program. This work presents two models in detail, namely for the Motorola
ColdFire 5307 and the PowerPC 755, together with the abstractions used to ob-
tain the abstract model for both processors. The abstract model simulation is then
used as the transfer function of a data-flow analysis (DFA) over the control-flow
graph of the binary program to be analyzed. This DFA is then the core of the
implementation of the WCET analysis for the program.
This approach has led to the development of a commercial tool, aiT, which
is based on provably correct methods and able to handle complex architectures
for the computation of WCETs for real-life sized programs. Its prototypes have
been evaluated during the DAEDALUS project by Airbus France with realistic
benchmarks for avionics software under industry conditions. It has been awarded
an European IST award 2004. This demonstrates the practical relevance and ap-
plicability of the approach presented in this thesis.
Other aspects that are of relevance for the real-life utilization of WCET tools
are discussed in this thesis: validation of WCET tools and predictable hardware.
Validating the results of a WCET tool is critical if it is to be deployed in a critical
area like avionics. As our tool is based on provably correct abstractions, vali-
dation serves to detect implementation errors or errors made in the model itself.
Predictable hardware means hardware whose worst-case behavior does not differ
too much from its average-case behavior, making WCET prediction easier and
more precise. The identification of problematic features in processors that have
a bad worst-case behavior is thus an important step for design decisions in future
real-time systems.
The approach can be extended in several ways, by using different model-
ing techniques, e. g. a model obtained from the authoritative VHDL code of a
processor by abstraction steps. This approach is being studied in the AVACS
Transregional Collaborative Research Center 14 sponsored by the German Re-
search Foundation DFG. More processors (Motorola PPC 5xx, Infineon Tricore,
Ti TMS320C33, Motorola STAR12) have been or are being modeled in this frame-
work at the moment by the company AbsInt.
vi
Zusammenfassung
Als harte Echtzeitsyteme bezeichnet man allgemein Computersysteme, welche
kritische physikalische Umgebungen kontrollieren (etwa Flugzeugsteuerungen,
Atomkraftwerkssteuerungen, Waffenlenksysteme, etc). Falls ein solches Comput-
ersystem Fehler aufweist, ko¨nnen die Konsequenzen drastisch sein: Hohe Sach-
scha¨den oder sogar Verlust von Menschenleben. Daher mu¨ssen harte Echtzeit-
systeme vor dem Einsatz auf ihre Korrektheit u¨berpru¨ft werden. Ein Aspekt der
Korrektheit ist die rechtzeitige Antwort des Systems, die oft durch Zeitschranken
angegeben wird, die von den Tasks im Computersystem eingehalten werden mu¨s-
sen. Ein wesentlicher Bestandteil des Nachweises, dass jede Task ihre Zeitschran-
ke einha¨lt ist die Kenntniss der Ausfu¨hrungszeit der Task im schlimmsten Fall,
ihrer worst-case execution time (WCET). Es hat sich meist als nicht praktikabel
heraus gestellt, die WCET durch Messen realer Ausfu¨hrungszeiten zu bestimmen,
da komplexe Abha¨ngigkeiten zwischen der Ausfu¨hrungszeit und den Eingabe-
daten oder Startzusta¨nden des Systems bestehen. Daher ko¨nnen sichere obere
Schranken fu¨r die WCET nur durch Benutzung statischer Analysemethoden auf
dem Programmcode erhalten werden. Da die Formen der in harten Echtzeit-
systemen benutzten Programmierstile eingeschra¨nkt sind (keine Benutzung von
Zeigern, kein Heap, beschra¨nkte Rekursion, etc), ist es prinzipiell mo¨glich, eine
WCET allein vom Programmcode her zu berechnen (bekannte Schleifendurch-
lauf- und Rekursionsschranken).
Moderne CPUs benutzen Verfahren, etwa Caches und Pipelines, zur Erho¨hung
ihrer Performanz. Diese Verfahren ko¨nnen zu einer hohen Varianz der Ausfu¨hr-
ungszeit fu¨hren, da sie von der Ausfu¨hrungsgeschichte abha¨ngen: Ein Speicherzu-
griff z.B. kann lediglich einen Takt in Anspruch nehmen, wenn er ein Cachetre-
ffer ist, oder er kann u¨ber 50 Takte dauern, wenn das gewu¨nschte Datum nicht
im Cache liegt und aus dem Hauptspeicher geholt werden muss. Ob ein Zugriff
ein Cachetreffer ist ha¨ngt von dem Cacheinhalt und daher von den vor diesem
Zugriff ausgefu¨hrten Zugriffen ab. A¨hnliches gilt fu¨r die Effekte innerhalb der
Prozessorpipeline. Die statische Analyse solcher Eigenschaften heutiger CPUs
hat sich als schwierig erwiesen. Bis vor kurzem wurde die Analyse von CPUs mit
Sprungvorhersage, out-of-order Ausfu¨hrung oder Spekulation als zu komplex fu¨r
den praktischen Einsatz angesehen, vgl. [Eng02].
Wir stellen einen neuen Ansatz vor, der in der Lage ist, Architekturen mit
den oben genannten Eigenschaften fu¨r realistische Programme zu analysieren.
Unser Ansatz basiert auf einem Modell der CPU und der Peripherie (Speicher,
System Controller). Wir benutzen ein zyklengenaues Modell mit untereinander
kommuniziernden Einheiten, welche einen inneren Zustand aufweisen und mit
Zustandsu¨bergangsregeln ausgestattet sind. Dies lehnt sich an Ansa¨tze aus dem
Gebiet der Hardwarebeschreibungssprachen an, etwa VHDL oder Verilog. Im An-
vii
schluss an die Modellierung werden mit Hilfe der Theorie der abstrakten Interpre-
tation (sichere) Abstraktionen fu¨r einige der Komponenten des Modells definiert,
die das Modell auf die fu¨r das Zeitverhalten wichtige Bestandteile reduzieren und
die Gro¨sse des Modells auf ein handhabbares Ma”s bringen. Wenn die durch-
gefu¨hrten Abstraktionen gewisse, aus der Theorie der abstrakten Interpretation
stammende, Bedingungen erfu¨llen, ist garantiert, dass jeder konkrete Modellzu-
stand von einem abstrakten Zustand repra¨sentiert wird, d.h. die Abstraktionen
sind sicher. Das durch diesen Prozess erhaltene abstrakte Modell kann immer
noch zyklenweise simuliert werden, wodurch man sichere obere Schranken fu¨r die
Ausfu¨hrungszeit der Basisblo¨cke des Programms bekommt. Pipelineeffekte wer-
den zusa¨tzlich u¨ber Basisblockgrenzen hinweg propagiert. Die WCETs fu¨r die
Basisblo¨cke, zusammen mit dem Kontrollflussgraphen des Programmes ko¨nnen
dann in ein ganzzahlig-lineares Programm u¨berfu¨hrt werden, wobei die Gesam-
tausfu¨hrungzeit des Programmes als zu maximierende Zielfunktion formuliert
wird. Eine Lo¨sung dieses Programmes ist dann eine sichere obere Schranke fu¨r
die Ausfu¨hrungszeit des Programmes. Diese Arbeit pra¨sentiert zwei detaillierte
Modelle fu¨r den Motorola ColdFire 5307 und den Motorola PowerPC 755 zusam-
men mit den zur Erlangung des abstrakten Modells durchgefu¨hrten Abstraktionen.
Die abstrakte Simulation wird dann als Transferfunktion einer Datenflussanalyse
(DFA) u¨ber dem Kontrollflussgraphen des Programmes benutzt. Diese DFA ist
dann der Kern der Implementierung der WCET Analyse des Programmes.
Dieser Ansatz hat zur Entwicklung eines kommerziellen Werkzeuges, aiT,
gefu¨hrt, welches auf beweisbar korrekten Grundlagen basiert und in der Lage ist,
komplexe Architekturen zur Berechnung von WCETs realistischer Programme zu
behandeln. Seine Prototypen wurden wa¨hrend des DAEDALUS Projektes durch
Airbus France mit realistischen Benchmarks fu¨r Avionics Software unter indus-
triellen Bedingungen evaluiert. Das Werkzeug erhielt einen IST Preis der Eu-
ropa¨ischen Union. Dies zeigt die praktische Relevanz und Anwendbarkeit des
vorgestellten Ansatzes.
Andere, fu¨r die reale Anwendbarkeit von WCET Werkzeugen wichtige As-
pekte werden ebenfalls diskutiert: Validierung von WCET Werkzeugen und Vorher-
sagbarkeit von Hardware. Die Validierung der von WCET Werkzeugen geliefer-
ten Ergebnisse ist wichtig, wenn sie in einem Bereich wie der Avionic einge-
setzt werden sollen. Da unser Werkzeug auf beweisbar korrekten Abstraktionen
beruht, dient die Validierung hier nur der Absicherung gegen Implementierungs-
fehler und Fehlern im Modell. Vorhersagbare Hardware bezeichnet Hardware,
deren Verhalten im schlimmsten Fall nicht weit entfernt ist vom Verhalten im
durchschnittlichen Fall. Dies macht die Vorhersage von WCETs einfacher und
pra¨ziser, da weniger “Unfa¨lle” im Verhalten der Hardware modelliert oder kon-
servativ abgescha¨tzt werden mu¨ssen. Die Identifikation problematischer Eigen-
schaften in Prozessoren, welche ein ungu¨nstiges Verhalten im schlimmsten Fall
viii
ausweisen, ist daher ein wichtiger Schritt im Entwurf zuku¨nftiger Echtzeitsys-
teme.
Der vorgestellte Ansatz kann in verschiedene Richtungen durch die Benutzung
andere Modellierungstechniken erweitert werden. Z.B. ko¨nnte ein VHDL Modell
eines Prozessors durch Abstraktionsschritte auf ein abstraktes Modell reduziert
werden. Diese Erweiterung wird zur Zeit im transregionalen Sonderforschungs-
bereich AVACS der DFG untersucht. Derzeit werden zusa¨tzliche Prozessoren
(Motorola PPC 5xx, Infineon Tricore, Ti TMS320C33, Motorola STAR12) im
Rahmen dieses Ansatzes von der Firma AbsInt modelliert.
ix
Acknowledgements
Many people have contributed in different ways to this work.
First, I thank Prof. Wilhelm for letting me work on this interesting subject; he
gave the direction for my work and contributed his vast knowledge and advice in
many discussions that shaped the style and contents of this thesis. He created a
unique working atmosphere in his group, making the cooperative work with all
the members of the group very delightful.
Then, my colleagues deserve a big thanks for their discussions and suggestions
regarding this thesis. Especially Marc Langenbach, Sebastian Winkel, Christian
Probst, Jo¨rg Bauer, Daniel Ka¨stner and Stephan Diehl did a great job at comment-
ing on drafts of this thesis.
The people at AbsInt Angewandte Informatik have contributed in many dif-
ferent ways to this work. Reinhold Heckmann was an invaluable help in looking
at things the right way, asking the right questions and commenting on drafts of
my work. Christian Ferdinand and Florian Martin provided great insights and
implementation contributions. Most critical parts of the WCET framework were
implemented by Florian Martin, Michael Schmidt and Henrik Theiling. Without
this framework and the work done at AbsInt the results of this thesis would not
have been achieved. Also, Reinhold Heckmann and Christian Ferdinand helped
me very much by reviewing drafts of my thesis, pointing out the errors and miss-
ing ends.
The work described in this thesis has been done in the DAEDALUS project
supported by the European FP5 program under RTD project IST-1999-20527.
In the DAEDALUS project, the people from Airbus France, Jean Souyris and
Famantanantsoa Randimbivololona provided a remarkable amount of practical ex-
periences and interest in the results of our WCET work. Without their help, this
work would never have reached its present level of practical usefulness.
My family has always been a big support for me. I am grateful to my parents,
Helga and Josef, for helping me and waiting patiently for their son to finally finish
his PhD work. I hope it was worthwhile to wait for this thesis:-)
 
	

, 
ﬀﬂﬁﬃ! #"%$'&( )* +-,.
/10
2'3ﬂ45
6!789
.
Finally, I have to thank my son Raphael :; for being the child he is. My
wife Yoomi deserves my highest gratitude for helping me and taking care of our
family while I was busy filling the pages of this thesis. Her asian patience was
really put to the test sometimes.
For my wife Yoomi.
x
Contents
1 Introduction 1
1.1 Real-Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 WCET Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Dynamic Methods . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Static Methods . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Modern Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.1 WCET determination . . . . . . . . . . . . . . . . . . . . 14
1.4.2 Pipeline Modeling . . . . . . . . . . . . . . . . . . . . . 18
2 Modern Hardware 21
2.1 Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.1 LRU Replacement . . . . . . . . . . . . . . . . . . . . . 25
2.2 Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.1 Pipeline Hazards . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2 Performance Improving Features . . . . . . . . . . . . . . 30
2.2.3 Other Features . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 System Components . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.1 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.2 Peripherals . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.3 DMA, Multiprocessors . . . . . . . . . . . . . . . . . . . 39
2.3.4 Busses . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4 Timing Anomalies . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5 The Motorola ColdFire 5307 . . . . . . . . . . . . . . . . . . . . 43
2.5.1 The Pipeline of the ColdFire 5307 . . . . . . . . . . . . . 44
2.5.2 The Cache of the ColdFire 5307 . . . . . . . . . . . . . . 47
2.5.3 System Configuration . . . . . . . . . . . . . . . . . . . 50
2.5.4 Assumptions Made . . . . . . . . . . . . . . . . . . . . . 51
2.5.5 Timing Anomalies with the MCF 5307 . . . . . . . . . . 52
2.6 The Motorola PowerPC 755 . . . . . . . . . . . . . . . . . . . . 57
2.6.1 The PowerPC Architecture . . . . . . . . . . . . . . . . . 57
xi
2.6.2 The PowerPC 755 . . . . . . . . . . . . . . . . . . . . . 57
2.6.3 PPC 755 Pipeline . . . . . . . . . . . . . . . . . . . . . . 57
2.6.4 PPC 755 Caches . . . . . . . . . . . . . . . . . . . . . . 63
2.6.5 Assumptions Made . . . . . . . . . . . . . . . . . . . . . 68
2.6.6 Timing Anomalies with the PPC 755 . . . . . . . . . . . 70
3 Semantics and Analyses 73
3.1 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.1.1 Program Representation . . . . . . . . . . . . . . . . . . 76
3.1.2 Concrete Semantics . . . . . . . . . . . . . . . . . . . . . 77
3.2 Program Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.2.1 Abstract Interpretation . . . . . . . . . . . . . . . . . . . 81
3.3 Data-Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.3.1 Interprocedural Analyses . . . . . . . . . . . . . . . . . . 97
4 Pipeline Modeling 103
4.1 Finite State Automata . . . . . . . . . . . . . . . . . . . . . . . . 106
4.1.1 The Meaning of State Predicates . . . . . . . . . . . . . . 108
4.1.2 Instruction Execution Semantics . . . . . . . . . . . . . . 111
4.1.3 Inputs to Finite Automata . . . . . . . . . . . . . . . . . 114
4.2 A Sequence of Abstractions . . . . . . . . . . . . . . . . . . . . . 114
4.2.1 Introducing Components and Units . . . . . . . . . . . . 115
4.2.2 Abstract Components . . . . . . . . . . . . . . . . . . . . 120
4.2.3 A Semantics for Unit Transitions . . . . . . . . . . . . . . 121
4.3 Example 1: The MCF 5307 . . . . . . . . . . . . . . . . . . . . . 130
4.3.1 Instruction Address Generation (IAG) . . . . . . . . . . . 131
4.3.2 Instruction Fetch Cycle 1 (IC1) . . . . . . . . . . . . . . 133
4.3.3 Bus Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.3.4 Instruction Fetch Cycle 2 (IC2) . . . . . . . . . . . . . . 134
4.3.5 Instruction Early Decode (IED) . . . . . . . . . . . . . . 135
4.3.6 Instruction Buffer (IB) . . . . . . . . . . . . . . . . . . . 137
4.3.7 Store Stall Timer (SST) . . . . . . . . . . . . . . . . . . . 138
4.3.8 Execution Unit (EX) . . . . . . . . . . . . . . . . . . . . 139
4.3.9 State Predicates . . . . . . . . . . . . . . . . . . . . . . . 142
4.4 Example 2: The PPC 755 . . . . . . . . . . . . . . . . . . . . . . 143
4.4.1 FBPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.4.2 CQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.4.3 DU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.4.4 IU1, IU2, SRU . . . . . . . . . . . . . . . . . . . . . . . 157
4.4.5 LSU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
4.4.6 FPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
xii
4.4.7 CU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.4.8 BU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.4.9 CSU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
4.4.10 State Predicates . . . . . . . . . . . . . . . . . . . . . . . 176
5 Pipeline Analysis 179
5.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.2 Global Correctness . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.3 The Abstraction Using Unit Updates . . . . . . . . . . . . . . . . 185
5.3.1 Analysis for the MCF 5307 . . . . . . . . . . . . . . . . 191
5.3.2 Analysis for the PPC 755 . . . . . . . . . . . . . . . . . . 192
5.4 Nondeterminism . . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.5 Parallelism in a Sequential Data-Flow Analysis . . . . . . . . . . 195
5.6 Other Caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6 A WCET Toolframe 197
6.1 Analyzing Binaries . . . . . . . . . . . . . . . . . . . . . . . . . 203
6.2 Value Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.3 Separating Analyses . . . . . . . . . . . . . . . . . . . . . . . . . 207
6.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.5 Practical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
7 Verification of the Analysis 217
7.1 Modeling Sources . . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.1.1 Processor Handbooks . . . . . . . . . . . . . . . . . . . . 218
7.1.2 Experiments on the Hardware . . . . . . . . . . . . . . . 219
7.1.3 Other Sources . . . . . . . . . . . . . . . . . . . . . . . . 221
7.2 Verifying the Modeling . . . . . . . . . . . . . . . . . . . . . . . 221
7.3 Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
8 Predictability of Modern Hardware 225
8.1 Sources of Complexity and Imprecision . . . . . . . . . . . . . . 225
8.2 Advice for Predictable Systems . . . . . . . . . . . . . . . . . . . 231
9 Conclusions 235
9.1 Predicting Modern Hardware . . . . . . . . . . . . . . . . . . . . 236
9.2 Practical Usability . . . . . . . . . . . . . . . . . . . . . . . . . . 237
9.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
xiii
xiv
Chapter 1
Introduction
This thesis is concerned with finding tight and safe predictions for the maximal
execution time of programs (also called worst-case execution time or WCET for
short). Finding upper bounds is important to guarantee that the response of a
computer system is always computed in time and that the system can thus react
to external events in a guaranteed time frame (a system that has such temporal
correctness requirements is called a real-time system). Finding a tight upper bound
is important in order to have a well scaled system, not reserving system power for
computing needs that will never arise in practice.
Although this problem is undecidable in general, it is decidable for a restricted
set of programs that satisfy the following constraints: static bounds to recursion
and to loop iterations must be known. For this restricted type of programs, the
problem seems to be easily solved, by applying, e. g., timing schemes to the pro-
gram structure ([Sha89]). However, mostly these methods are only defined for
high-level programming languages or make certain assumptions about the local-
ity of timing behavior.
Another aspect of the work presented in this thesis was practical applicability.
While some methods have been presented to compute WCET (bounds), they ig-
nored the fact that certain conditions are imposed on the usage of a tool derived
from these methods. In the course of the DAEDALUS1 project, a tool based on the
principles presented in this thesis has been implemented and evaluated by indus-
trial end users. The feedback gained through this process has lead to an efficient
and realistic tool, called aiT2, which is sold by the company AbsInt3. This tool
has been awarded an European IST Prize for 2004.
In the remainder of this chapter, the topics of real-time systems and WCET
1See http://www.di.ens.fr/cousot/projects/DAEDALUS. The project has
been supported by the EU under IST-1999-20527.
2http://www.AbsInt.com/ait
3http://www.AbsInt.com
1
determination are introduced and our approach to the latter is sketched. The prob-
lems for WCET determination introduced by modern hardware are sketched in
Section 1.3. Related work is discussed in the last section of this chapter.
This thesis tries to collect all the material necessary for a full understanding of
the issues involved in the analysis of modern hardware for WCET prediction. We
deem this necessary since the application of an industry strength WCET prediction
tool requires a lot of work in the framework surrounding the tool. On the other
hand, the problems that occur for a safe design and implementation of such a tool
necessitates a deeper reflection about the hardware and software components that
make such a task complex and demanding.
As a consequence, many concepts are treated with considerable verbosity and
a lot of material from other sources is included. Therefore, we summarize the new
contributions of this thesis for clarity here. A technique to describe the hardware
that is to be analyzed has been developed: pipeline models. Clear semantics and
examples of applications have been given: Motorola MCF 5307 and Motorola
PowerPC 755. An analysis and its correctness are presented: pipeline analysis.
A discussion of hardware features that make precise predictions difficult is pre-
sented: Chapter 8. The problem of the verification of an implementation of an
analysis is discussed and solutions are suggested: Chapter 7. The real-life readi-
ness of the resulting analysis is shown and problems in the implementation of the
analysis are discussed.
The other chapters of this thesis are organized as follows: in Chapter 2, hard-
ware features of modern processors which make computing the WCET difficult
are introduced. Two processors are presented in detail, the Motorola ColdFire
5307 and the Motorola PowerPC 755. For both processors, a WCET tool has been
implemented. In Chapter 3, the theoretical framework for the analyses behind our
WCET tool is presented, i. e. abstract interpretation. Chapter 4 introduces the
method used to model the timing behavior of processors, and presents the models
for the MCF 5307 and the PPC 755. The analyses based on these models are pre-
sented in Chapter 5 together with the necessary correctness proofs. The tool using
these analyses and the remaining framework needed for performing WCET analy-
ses on machine programs are presented in Chapter 6, while the important question
on how to verify the correctness of the implementation of the tool is discussed in
Chapter 7. General comments and suggestions for the design of predictable (w.r.t.
WCET determination) systems are collected in Chapter 8. Chapter 9 summarizes
the contributions of this thesis and gives directions for future work.
2
1.1 Real-Time Systems
Embedded systems are computer systems interacting with a physical environment,
e. g. flight-control computers in planes, DVD players, nuclear power plant control,
etc. Using computer systems to control or interact with a physical environment
adds another level of complexity to the system. Not only must the system be
functionally correct, i. e. compute the correct result from its inputs, but also the
time it takes until the result is available is important. This is because the results are
usually used to control actuators that influence the physical environment, leading
to new system parameters, which are again inputs for the computer system. If
the results are available too late, the computed action may be inappropriate for
the correct functioning of the whole system. As an example a new angle for the
rudder of a plane that is applied too late may cause a resonance and oscillation of
the plane, leading to a crash. Or, less dramatically, an audio frame delivered too
late while playing a movie causes audio disturbance and loss of synchronization
between picture and sound.
Systems for which timely functioning is important are called real-time sys-
tems. Timing constraints are often formulated as deadlines, i. e. points in time
until which all results must be available. If missing timing constraints only oc-
casionally is acceptable and does not cause the whole system to malfunction, the
system is a soft real-time system. Systems that will not work properly if timeli-
ness is not fulfilled are called hard real-time systems. In a hard real-time system
missing a deadline can lead to a loss of life and/or property in the physical environ-
ment controlled by the system. Examples of hard real-time systems are avionics,
especially flight-control systems, missile guidance, nuclear power plant control or
steering controls in cars. This thesis is primarily concerned with analyzing hard
real-time systems, with the focus on avionics. However, the results can be fully
applied to any hard real-time system and at least partially to soft real-time systems
as well.
For our setting, we look at real-time systems that are made up of a set of tasks.
Tasks are executed periodically and implement a certain sub-function of the sys-
tem. There may be dependencies between tasks Ti and Tj such that Ti must finish
its execution before Tj starts executing. We assume that these dependencies are
given by a partial order < on the set = Ti > 1 < i < n ? of tasks. Also, there are
parameters associated with each task Ti: a task may only be started after a certain
point of time, that is there is a minimal release time, ri for it. The deadline of
the task is denoted by di and the task’s ending time by ei, the starting time by si.
We only consider uniprocessor systems, so all the tasks of the system must be
executed on one CPU. As there are normally several tasks without dependencies
ready to be executed at the same time, we introduce a priority among tasks, a total
order @ on the set of tasks. The task with the highest priority is the smallest one
3
in this ordering. From the tasks that are ready (i. e. have no dependencies) the
scheduler selects one to execute on the CPU. This scheduling can be either static
or dynamic, depending on when the decisions for task selection are made: offline
before the system starts to execute or online by scheduling code. Furthermore,
another characteristics of a real-time system is, whether the scheduling is preemp-
tive or not. With preemptive scheduling, the currently executing task is preempted
from execution and the next highest priority task is executed as soon as it becomes
ready (if, e. g., its release time has passed). With nonpreemptive scheduling, the
currently executing task is always executed until it finishes. Real-time systems
are designed by fixing certain periods of execution of the subtasks in the systems
(coupled to sampling rates of sensors, etc). Thus, also the tasks are assigned an
individual period pi. Two invocations of a task are separated by this period.
There are several strategies how the priority ordering @ on tasks is computed.
The ordering can be static, giving fixed priorities to tasks, or dynamic, where
priorities are determined at run-time based on certain conditions. A static one is,
e. g., rate monotonic, [LL73], where the priorities are assigned by inverse length
of the period, i. e. the task with the smallest period is assigned the highest priority.
A dynamic strategy is earliest deadline rst (EDF), where the highest priority is
assigned to the task whose deadline is nearest. EDF with preemptive scheduling is
known to be optimal on a uniprocessor system, cf. [Liu00]. A schedule obtained
by applying a given scheduling strategy is feasible, if every task meets its deadline,
i. e. A i : si B ei < di and the dependencies among the tasks and earliest release times
are met.
For all strategies there exist sufficient criteria that guarantee that a feasible
schedule can be generated if they are fulfilled, cf. [Liu00] or [BW90]. Such a
criterion is also called schedulability test. E.g., in the response time approach,
[JP86, TBW94], the maximal time, Ri, from the release of a task until it finishes
its execution is described by a recursive formula which takes into account the
preemptions of the task by other tasks with higher priorities and because of de-
pendencies. In the simplest form the equation is (tasks T j with lower indices j
have higher priority):
R C n D 1 Ei F ei B
i G 1
∑
k H 1
I
R C n Ei
pk J
ek
The start value can be chosen to be R C 0 Ei F ei. The recursive computation of the
R C j D 1 Ei can be stopped if R C
j D 1 E
i F R C
j E
i KLF : Ri M for some j or if R C j Ei N di in which
case no feasible schedule exists. The necessary schedulability condition is then
A i : Ri < di.
All these criteria need the WCETs of the tasks as inputs. More precisely, they
need upper bounds for all the WCETs of the tasks. In the literature, the notions
4
“WCET” and “upper bound for the WCET” are usually not distinguished. So,
in the following, when we talk about WCET we actually mean an upper bound
for the WCET. Thus, knowledge about the WCET ei of each task Ti is crucial for
proving that a feasible schedule for the whole real-time system exists so that it can
function correctly.
The remainder of this thesis concerns itself with obtaining a tight upper bound
for the WCET of a task. The best case execution time (BCET) of a task is some-
times also of importance to get a feeling for the utilization of the processor or for
multiprocessor scheduling. Mostly, the execution time of a task will be less than
its WCET. In the interval from the end of the task until its WCET has passed, ad-
ditional soft real-time tasks may be executed without the possibility of violating
the feasibility of scheduling (slack time). The maximal available slack time can
be bounded by the BCET of the tasks. The method presented later to compute
(upper bounds to) WCET can equally well be utilized to compute (lower bounds
to) BCET of tasks. In fact, all that is needed is redefining the ordering OQP of the
abstract domain in Section 5.1 to use R to sort the second component of each pair
instead of < , adjusting the least upper bound and greatest lower bound operators
as well as redefining γ as
γ
K
s˘ S mˆ
M
FUT
=
K
s S m
MV>
s W Γ
K
sˆ
MYX
mˆ < m < Tmax X sˆ W s˘ ?
1.2 WCET Analysis
In general it is not possible to obtain running times for programs due to the halting
problem. But for real-time systems it is possible to only use a restricted form of
programming which guarantees that programs always terminate. That is, recur-
sion is not allowed (or restricted) and the iteration counts of loops are known in
advance, or at least lower and (finite) upper bounds for them are known.
Two classes of methods to obtain WCETs can be distinguished:
Z Dynamic methods utilize real program executions to obtain WCETs
Z Static methods only need the program itself, maybe extended with some
additional information (like loop bounds)
1.2.1 Dynamic Methods
The easiest dynamic method is measuring the execution times of a number of real
program executions, either by augmenting the program with additional code to
manipulate timer hardware or by using a logic analyzer to observe the relevant
5
signals directly at hardware level (e. g. a write by the program to a special mem-
ory location indicating start and another one indicating end of execution). This
method has two drawbacks: it is not safe and in practice it involves a lot of ef-
fort in terms of measuring equipment and input simulation. It is not safe, because
in virtually all cases the worst case input to a program4 is not known and thus
the WCET cannot directly be measured. Thus, it can only be guaranteed that the
WCET has been obtained by measuring for all inputs, which is practically infea-
sible. The method is very expensive in terms of resources needed (both hardware
and personnel) as the computer system controlling a complete real-time system
is connected by sensors and actuators to its environment. For the measurements,
sensor data has to be provided by other external simulation hardware, since these
measurements cannot be done in the real system for safety reasons. Also, addi-
tional hardware has to compute the effects of the actuator outputs and feed them
back to the sensors. This environment is needed to make sure that the program run
behaves as in the real system. If the sensor inputs in the measurement run differ
from those in the real system, the program may also behave differently, e. g. other
program paths are taken based on the input range of a sensor.
Additional drawbacks of this approach are that some variants need to use mod-
ified code (augmented with instructions to steer timers) or utilize a two-loop ap-
proach: a loop is put around the piece of code under examination. The execution
time of several loop iterations is measured to increase reliability and precision of
the results. This is compared against a measurement of the loop execution with an
empty body to separate the execution time of the loop instructions from the execu-
tion time of the investigated code snippet’s. However, a certification requirement
makes this method difficult: validation of real-time systems, especially in avion-
ics, mandates that validation is done on exactly the same code that will fly in the
airplane later5. Using instrumented code is not possible under these restrictions.
Another method combines measuring and static methods. Here, small snippets
of code are measured for their execution time, then a safety margin is added and
the results for code pieces are combined according to the structure of the whole
task. E.g. if a task first executes a snippet A and then a snippet B, the resulting
time is the one measured for A, tA, added to the time tB measured for B: t
F
tA B tB.
This reduces the amount of measurements that have to be made, as code snippets
tend to be reused a lot in control software and only the different snippets need to
be measured. However, it causes the need to argument about the correctness of
the composition step of the measured snippet times. Correctness depends, e. g. on
certain properties of the measurement environment of the snippets. One example
would be that the snippets are measured with an empty cache at the beginning
4I.e. the input leading to the WCET.
5Test what you fly and fly what you test.
6
under the assumption that this will lead to a larger execution time than with a
partially filled cache. In Section 2.5.5 we will show that especially this assumption
can be wrong. The problem of unknown worst-case input exists for this method
as well, while it is still infeasible to measure for all input values.
A practical disadvantage of these dynamic methods is that the hardware (and
the surrounding simulation equipment) must be available. In early stages of de-
velopment this is not the case. Design decisions about the layout of the system
must be made before results about worst-case performance are available. Since it
is extremely costly to change designs in later stages of the development process,
this puts a high responsibility and risk on the engineers building the system.
Program Timing
If (a<b) Then 1
c:=a+b; 2
d:=1; 1
Else 1
d:=0; 1
Fi 4
Figure 1.1: Simple program with execution times
1.2.2 Static Methods
The second class of methods does not rely on executing code on real hardware but
rather takes the program code itself, combines it with some (theoretical) model
of the system and obtains WCETs from this combination. A simple variant, ap-
plicable to a wide range of programming languages and systems, is the use of
timing schemes, [Sha89]. Here, execution times are assumed to be known for the
atomic instructions of the program and the execution time for the whole program
is obtained by combing results for parts of the program. E.g., for the program
in Figure 1.1, the execution time of the whole program is made up of the time
to evaluate the condition of the If statement plus the maximum of the execution
times of the two branches of the If statement, after the Then and the Else key-
word. In the Then branch, the execution time of the two consecutive instructions
is the sum of both instructions, in this case 3. For loops, iteration bounds are used.
7
For WCET computation, the maximal iteration count is multiplied with the time
for the body plus the time for the evaluation of the loop’s condition.
This method gives a safe approximation to the WCET, if all the atomic exe-
cution times are WCETs and the upper loop bounds are correct (i. e. every loop
iteration count during program execution is smaller than the given bound). Preci-
sion can be very bad, if e. g. a loop iterates 100 times, but the WCET of the body,
ebody only really occurs during one of these iterations and the others are consider-
ably faster (say twice as fast). The overapproximation is 99 [ 0 \ 5 [ ebody. Another
source of imprecision are the loop bounds: if they are overly conservative, the
result is still safe but far away from the real WCET.
A different source of problems arises from some implicit assumptions under-
lying this method. Timing schemes assume that local worst-case execution times
are invariant. This means that the WCET obtained for a part of the program, say
one instruction, does not change if the same part later occurs again in the program.
E.g., in Figure 1.1, if the d:=0; instruction would occur again in another part of
the program, it still would take 1 time unit to execute, not two or three. For loops,
one assumption is the monotonicity of the loop body. This means that if one exe-
cutes a loop locally more often, this will globally also lead to a higher execution
time, so we can use the upper bound to the loop iterations as a safe number of
iterations to compute WCET. Both assumptions are not valid for systems where
execution times are dependent on the execution history. Unfortunately, modern
processors with caches and pipelines are sensitive to the execution history in their
execution time.
Related to timing schemes is the approach of modeling the flow of a program
as an integer linear program (ILP) with local worst case execution times at atomic
instructions. The WCET is computed by solving the ILP to maximize a formula
defining the execution time, cf. [LMA97].
An example C program and the ILP variables associated with it are depicted
in Figure 1.2. Here, the variables a S b S]\]\]\ on the right hand side are the execution
counts for the corresponding instructions. The tm are the execution times for the
instructions and the xi are the traversal counts of edges.
How these variables are connected is shown in Table 1.1. The inequality 0 <
x4 < 10 comes from the loop iteration bound. With given xi, the execution time
of the whole program is t
F
∑
n ^`_ a ab acbdbdb e
n f tn. To determine an upper bound to the
WCET of the program, we solve the ILP by maximizing t, obtaining values for
the xi and a Sg\]\]\ . Additional constraints can be added to the ILP to increase the
precision of the solution. In the example, the constraint 0 < x5 < 6 can be added
because instruction e is executed at most six times.
Another static method is to simulate the program in a model of the system,
namely the processor. Such simulators for processors are made available by the
8
a: j:=0;
b: i:=0;
c: while (i @ 10) =
d: if (i @ 6)
e: j+=i+1;
else
f: j++;
g: i++;
?
h: j++;
a
b
c
at
tb
tc
d td
fe
g
te tf
gt
h th
x0
x1
x2
3x
x4
x5 x6
x7 x8
x9
Figure 1.2: Example program with ILP variables
manufacturers of the processors to ease code development for embedded plat-
forms, where the usual debug techniques cannot be applied. Unfortunately, sim-
ulation will be much slower than running in the real system. The problems of
measuring, i. e. unknown worst-case input and the number of possible inputs ap-
ply to this method as well. They can partly be circumvented by using some form
a
F
x0 b
F
x1 c
F
x2 B x9 d
F
x4
e
F
x5 f
F
x6 g
F
x7 B x8 h
F
x3
0 < x4 < 10 x5 B x6
F
x4 x9
F
x7 B x8 x7
F
x5
x8
F
x6 x0
F
1 x1
F
a
Table 1.1: Relations of example ILP variables
9
of abstraction for the simulation, cf. the discussion of Lundqvist’s and Stenstro¨m’s
work [LS98, LS99] in Section 1.4.
It is also possible to use simulation only for small parts of the program, like
basic blocks, and to compose the results to form the WCET of the whole program.
If the small parts are considered in isolation, sufficient starting (i. e. worst-case)
scenarios for the simulation must be used to achieve a sound analysis. This sim-
ulation is very sensitive to so called timing anomalies (see below), where local
worst-cases may not lead to global worst-case scenarios but rather a locally better
scenario results in the global worst-case. These effects must be considered in the
worst-case scenarios for the simulation of the program parts by adding a safety
margin. This forced localization of the analysis reduces computation time but
can drastically decrease precision, if effects in the pipeline can reach over several
basic blocks.
Model checking, [CGP99], is a very popular approach in the area of verifica-
tion of embedded and real-time systems. In principle it can also be used to obtain
WCETs for programs. For this, the system (processor and periphery) and the pro-
gram itself must be coded as an automaton and an appropriate formula describing
the execution of the program within a time t1 must be constructed. Model check-
ing can then determine if this formula is violated. If not, the WCET of the program
is not larger than t1. Otherwise, the procedure is repeated for the execution time
t2
F
2 [ t1, etc. In the same manner, a tighter WCET bound can be obtained after a
successful check for t1 by using a formula for t2
F
t1 h 2, and rechecking, increas-
ing the time again if the check fails, in analogy to binary search in sorted tables.
It has already been argued in [Wil04] that this approach (and further variants, de-
scribed in the paper) is inefficient for realistic systems if the program consists of
low-level machine code. Obtaining the formula to check is also not trivial.
Abstract Interpretation of Pipeline Models
Finally, one can apply a method called data ow analysis (DFA) to the problem
of determining WCETs. This method is well established in the area of compiler
construction and used to obtain information about a program that holds for all
executions of the program. It is, e. g., used to determine the values of variables
as a prerequisite to constant folding and expression evaluation optimization. The
charm of DFA is that it computes safety properties for each instruction of the
program that are valid for all executions of the program. Since DFA allows to
define abstractions of program values, like e. g. inputs, results are computed once
over these abstractions. A well-founded theory, the framework of abstract inter-
pretation, underlies this approach, making correctness proofs for the results ob-
tained easier. This method also does not suffer from the drawbacks of measuring
and simulation, namely unknown worst-case inputs and prohibitively large input
10
value sets. However, it can be costly for large programs. Our approach presented
in this thesis uses DFA to obtain WCETs for local parts of the program, namely
basic blocks. The processor’s execution is abstractly simulated, and ILP methods
are utilized to combine these local worst cases into a global WCET. Unlike other
approaches, our approach considers the global flow of the program also in the
first DFA phase by propagating simulation results globally between basic blocks.
Thus, worst-case assumptions need not be made by our analysis on basic block
level. The DFA uses a set of abstract pipeline states as the domain for the values
propagated through the CFG.
Basic Block
Basic Block
represents
1
1
1 1
1 1 1
Simulate cycle by cycle
Data−Flow Value:
set of abstract pipeline states
Abstract pipeline state
Set of concrete real pipeline states
Basic Block
Figure 1.3: Overview of pipeline analysis by abstract simulation
As sketched in Figure 1.3, each abstract pipeline state describes a set of con-
crete pipeline states. A concrete pipeline state represents the complete system
state relevant for timing, like cache contents, etc. A model defines the execution
of the whole system by evolving concrete pipeline states clock cycle by clock
cycle.
In the analysis, each abstract pipeline state is abstractly simulated cycle-wise.
It is possible that more than one result state is generated because of information
loss due to the abstraction that has been made. The maximal number of simulation
cycles for the instructions in the basic block starting with all abstract states is an
11
upper bound for the WCET of the basic block. Certain connections between the
abstract and concrete states guarantee correctness of the abstract simulation on
the abstract pipeline states. This approach allows to precisely model the system
and still obtain efficient analyzers. In addition, the system hardware need not
be available for the analyzer to deliver results. Furthermore, parameters of the
model, e. g. the size of the cache or memory layout can be changed before each
analysis to gain hints on system performance and to support design decisions.
As the next section shows, modern hardware poses a lot of problems for WCET
analysis by exhibiting non-local effects, because behavior is history dependent.
Our approach allows to capture these effects in the model so that the analysis
will correctly predict and incorporate them. The central part of the analysis, the
system model and the abstraction over it, are described in Chapters 4 and 5, while
the other components of the industrial tool built from this approach are described
in Chapter 6.
1.3 Modern Hardware
As stated in the previous section, most WCET methods are not applicable if the
instruction times of atomic components are not fixed. Modern processors have
caches and pipelines to bridge the gap between slow memory access times and
fast processor cores and to accelerate program execution by overlapping instruc-
tion execution. Caches and pipelines increase the performance of program execu-
tion, at least in the average case. However, they introduce a dependency on the
execution history for an instruction’s execution time. E.g., if a memory read ac-
cesses data already in the cache, the instruction will execute fast. If the instruction
has to access main memory to fetch the data, the execution will take much longer.
Clearly, whether the data is in the cache or not, depends on the accesses happening
before this access. In order for this instruction to be a cache hit, there must have
been an access putting the data into the cache. Furthermore, there must not have
been too many other accesses to the cache because otherwise the loaded data may
have been replaced by a different data block. Due to the huge number of cache
configurations and of paths that lead to a single instruction, it is normally not pos-
sible to precisely predict the cache contents that occur when execution reaches
that instruction.
The same holds, although less dramatically, for the behavior of a processor’s
pipeline. The execution time of an instruction depends on the instructions before
it in the pipeline (and partly also on the instructions after it, due to out-of-order
and speculative execution). The content of the pipeline at the execution start of an
instruction clearly depends on how this instruction was reached during program
execution, i. e. on the execution history.
12
Thus, one can no longer just look up the execution time of a single machine
instruction in the processor handbook. The same effect makes measuring execu-
tion times difficult, as the dependencies on the execution history are not linear but
can show chaotic behavior. That is, a small change in the value of an input pa-
rameter can lead to a huge change in the execution time of the whole program, as
the cache and pipeline contents are changed and may suddenly cause a completely
different behavior of the rest of the program. One example: the small change in
an input value causes one special program path to be traversed. Fetching this pro-
gram fragment causes other memory blocks in the cache to age so that they are in
the sequel thrown out of the cache, while they stayed in the cache when another
path was taken. Later, these blocks cause additional cache misses. This might
increase the execution time by a factor of 30.
Clearly, this behavior makes is impossible to practically apply a technique like
timing schemes from the previous section in order to obtain tight WCET bounds.
One might be able to find a penalty for all the effects that may cause a certain in-
struction to prolong its execution, thus removing the history dependency of the in-
struction (by adding the penalty to the WCET of the instruction). But this penalty
will clearly only occur for a few real executions. By using it for all instruction
occurrences, the WCET obtained is too imprecise, rendering it useless. On the
practical side, it has proved difficult to actually nd a bounding penalty for effects
on an instruction due to the complicated interactions between different processor
features that may have effects on instruction execution. These interactions may
even lead to timing anomalies, cf. Section 2.4, where locally faster executions
(e. g. a cache hit) can lead to globally slower program execution.
The history dependencies also have other effects besides complicating WCET
computation. As said before, the WCETs are needed for scheduling analysis to
make sure that the tasks in the system can finish their execution before their dead-
lines. In systems that use preemptive scheduling, a task’s execution may be inter-
rupted by the scheduler and a higher priority task may be executed instead. The
execution of the first task is resumed when the second task has finished its execu-
tion. At the point of the resumption of the task’s execution, the history of the pro-
cessor has changed compared to an undisturbed execution of the task: the cache
contents may be different, the pipeline contents may be different, cf. [BN94]. This
may cause an additional amount of execution time for the task to be required: e. g.
because useful data has been replaced from the cache by the interrupting task and
has to be reloaded from slow main memory.
The changed pipeline contents may cause strange alignment effects for pro-
gram execution, leading to longer execution times (one example of such an effect
is presented in Section 2.6.6, where the effect is only bounded by the program
length, not by a processor dependent constant). Thus, the WCET of a task is no
longer constant but becomes history dependent: it depends on which task has in-
13
terrupted this task when and where. Most scheduling tests expect the WCET to
be constant. Some extensions of schedulability tests have been presented, e. g.
[Sch02] or [LHYM D 96]. However, these approaches are either too computation-
ally complex or loose too much precision due to overly large overapproxima-
tions to possible damages. For systems with preemption, no practically feasible
scheduling test exists today that can take history dependent effects of processors
into account. For static scheduling, where a sequence of tasks (segments) is com-
puted offline, a solution has been presented in [KT99] which takes the effects of
caches and pipelines into account across switches to different segments of other
tasks. A segment here is a linear piece of a task. The approach presented in the
paper also allows to use precedences among and earliest release times for tasks.
Our approach only determines a WCET for the uninterrupted execution of a
task. Effects caused by preemption have to be considered separately. Because we
use an appropriately abstracted model of the processor and its periphery, we can
capture all history dependent effects safely. As the results in Section 6.5 show, the
results are also tight. In the next section we go on by discussing work related to
our approach.
1.4 Related Work
Related work in two areas is discussed: determining WCET and modeling of
processors or systems.
1.4.1 WCET determination
Based on Shaw’s timing scheme [Sha89] which focuses on high-level language
constructs and does not consider cache or pipeline effects, Lim et al. [LBJ D 95,
LRM D 94] and Hur et al. [HBL D 95] present an analysis technique taking mod-
ern processor features into account. Cache effects are modeled via bookkeep-
ing of first and last references to blocks, and reservation tables are used to han-
dle pipeline effects. As the target machine –a MIPS 3000– implements only
a simple pipeline, reservation tables whose resources are registers and pipeline
stages are sufficient. For loops in the CFG, an approximation operator, related to
a least upper bound operator in DFA, is used to determine cache contents and
execution costs. Results were only reported for toy sample programs. More
sophisticated processors featuring out-of-order execution, superscalarity, or set-
associative caches are not considered. Correctness of the approach is not proven,
nor is it clear how this could be done.
The approach was extended in [LHYM D 96] to incorporate the effects of pre-
emption to the WCET computation of tasks. The approach uses the response time
14
approach for fixed priority scheduling (rate monotonic) and integrates the pre-
emption effects into the recursion equations for the response times in the form
ri
F
ei B ∑
Tj ^ hp
C
i Eji
ri
p j k
e j B PCi
K
ri M . Here, hp
K
i
M
is the set of tasks with higher
priorities than Ti and PCi
K
ri M is the execution time penalty incurred on Ti. It is
computed by solving an ILP that describes the maximal damage in terms of re-
placed cache contents of the tasks. The constraints of this ILP are computed from
the useful cache blocks of a task. A useful cache block is a cache block that may
be referenced again after it has been loaded. Solving this ILP for each iteration of
the computation of ri seems to be very complex and only very small examples are
presented. Pipelines are not considered in the paper and it is not clear how timing
anomalies should be integrated into the approach efficiently.
Healy et al. [HAM D 99, HWH95, MWH94] present another approach for pre-
dicting WCETs in the presence of caches and simple pipelines. In a first step of
the analysis, a static cache simulator classifies instructions as cache hits or misses.
This information is used by a pipeline path analysis that computes the execution
time for a sequence of instructions. Loops are handled in a bottom-up manner.
Only the simple pipeline of a MicroSPARC is considered and in [HAM D 99] only
direct-mapped caches are taken into account. The method can only be applied to
pipelines which can be described by resource usage patterns of instructions. For
their experimental results, the authors only consider a small direct-mapped cache
with small test programs. Complex architectures with timing anomalies are not
considered and can in fact not be handled by simple resource usage patterns.
Li et al. suggested another solution using the technique of integer linear pro-
gramming [LMA97]. Both cache and pipeline behavior prediction are formu-
lated as one linear program. The Intel i960KB [Int91] is investigated which has
a fairly simple pipeline. So only structural hazards need to be modeled keeping
the complexity of the integer linear program moderate. Branch prediction and/or
instruction prefetching are not considered at all. Obtaining the ILP modeling for
a more complex processor will be difficult. Using their approach for super-scalar
pipelines does not seem very promising considering the analysis times reported in
the article. Nonetheless, the description of the worst-case path through the pro-
gram via ILP is an elegant method and can be efficient if the size of the ILP is
kept small as is the case in our tool.
Lundqvist and Stenstro¨m presented an integrated approach to obtain WCET
bounds through the simulation of the pipeline in [LS98, LS99, Lun02]. They ex-
tend a pipeline simulator to handle unknown values in inputs. With this approach
we share conceptual similarities in that we perform a cycle-wise evolution of a
pipeline (model). In contrast to our approach, their method is an integrated one,
where value analysis for register/memory contents and execution time computa-
tion are parts of the same simulation. If the simulation cannot determine a branch
15
condition exactly due to dependencies on unknown (input) values, they have to
simulate both branch outcomes. Their method does not guarantee termination of
the analysis, but has the advantage of determining loop bounds and/or recursion
bounds for free6. However, we feel that their analysis is very costly due to the
huge amount of data that has to be kept for each branch they follow. In contrast,
our method does not keep information like register or memory contents in the
pipeline analysis phase. A value analysis can be executed before the cache and
pipeline analysis instead. In [LS99] experiments with a PowerPC like architec-
ture are made for small example programs. They use an extended PSIM simulator
with simple reservation tables for instructions. In all, it is not clear how well their
method scales up to programs of realistic sizes.
Narasimhan and Nilson present a retargetable execution time analyzer for
RISC processors in [NN94]. The target architecture is modeled using an extended
MARIL language (see [BHE91]). The generated analyzer takes an assembly pro-
gram and a path represented by a sequence of labels and computes the time needed
to execute that path. Due to the use of MARIL the range of targetable processors
is significantly limited. Analyzing assembly programs complicates the integration
of instruction and data cache analysis. This leads to a large gap between predicted
and measured execution time, especially for larger inputs [NN94].
In contrast to Lundqvist and Stenstro¨m’s integrated approach, Engblom pre-
sents a WCET tool with a clear separation of all the analysis modules in [Eng02].
The modules communicate using interface data structures. One main component
is a simulator that estimates the execution time for a given sequence of instruc-
tions. These timing estimates are composed to form the execution time of the
entire program. The quality of the obtained WCET is greatly influenced by the
quality of the simulator used. Cache behavior prediction is not incorporated in the
tool as the addressed targets do not have caches. This eliminates the problem of
cache and pipeline interaction, which becomes more difficult with more complex
pipelines, prefetching, and branch prediction. The author comes to the conclusion
that “. . . out-of-order processors are denitely too complex to model with current
techniques.”
Colin and Puaut describe a framework for tree-based WCET analysis in their
paper [CP01]. Instruction cache and pipeline behavior as well as branch prediction
are taken into account and are analyzed independently of one another reducing the
precision of the obtained WCET estimate.
The analyses are based on two intermediate representations: the syntax tree
and the control flow graph built from the assembly output of the compiler. As
the program is not yet translated to object code, it is not clear which machine
instruction an assembly instruction is mapped to, and as the program is not linked,
6If these do not depend in a non-trivial way on unknown input values.
16
information on instruction addresses is not available. The syntax tree is used to
compose the WCET from smaller parts. This is not appropriate as it disregards the
execution context leading to imprecise results. In [CP00], they present an analysis
to predict the branch processing prediction behavior of a processor with dynamic
prediction in the same framework. Modern features like out-of-order execution
are not considered. For such architectures, the presented analysis is either not
sound or too imprecise (after safety margins have been added to the results).
In [BCP02] and [BBB03] Bernat et al. present a method to obtain probabilistic
guarantees for the schedulability and timeliness of a hard real-time system. They
argue that WCETs are rarely known and sometimes probabilistic guarantees suf-
fice. In the context our work was done in, i. e. avionics, we have to give hard
guarantees for the timeliness of the tasks in the system. So their approach cannot
be used in our application area. However, note that a relaxed, much faster version
of our analysis that only uses local worst-cases in its computations instead of fol-
lowing all possible scenarios, gives results which are true “most of the time”, cf.
Chapter 6. I.e. the timing anomalies will rarely occur in practice and local worst-
cases will with some probability also be global worst-cases. Thus, our results can
be used for the probabilistic method.
Model checking has been widely applied in the area of verification of real-time
systems and hardware, [McM93, BCM D 92, BBCZ98, McM98, CGP99]. How-
ever, its usefulness for determining WCETs seems limited. Although the use of
timed automata in principle makes it possible to check a formula which describes
the execution of the program under the condition that it finishes before its dead-
line, this does not classify as a scheduling test under a preemptive scheduling
regime (the preemptions by other tasks are not considered). Even then, one must
use abstractions to be able to consider all possible states in the timed automata
(which depend on input data to the task). To obtain tight WCET bounds, model
checking must be repeatedly applied with differing time bounds to check for. How
to obtain the formula to check (and prove its correctness) and an adequate timed
automaton is another issue in this approach. See [Wil04] for a more in-depth dis-
cussion of these questions. In [Met04], Metzner has presented an application of
model checking for WCET analysis by analyzing an instruction cache in combina-
tion with a simple processor execution model. He argues that model checking can
be applied to and indeed improve the precision of WCET prediction by avoiding
the abstractions and approximations made by data-flow analyses based on abstract
interpretation. It will be interesting to see if this approach scales up to the analysis
of data caches and processors with branch prediction, prefetching and out-of-order
execution, all features that introduce nondeterminism in the analysis.
In [Fer97] Ferdinand presents a method for static cache and pipeline analysis
based on abstract interpretation. Using this methodology, Schneider et al. devel-
oped a pipeline behavior prediction [SF99] for the SuperSPARC processor. Both
17
analyses are components of a worst-case execution time prediction. However,
their approach uses a simple reservation table scheme in the pipeline analysis
which cannot be extended to more complex architectures. Also, they do not con-
sider prefetching and/or branch prediction and assume the sequence of instruction
accesses to be known statically. Ferdinand’s cache analysis is integrated into our
pipeline analysis, adapted to the update semantics of ColdFire’s and PPC 750/755
cache .
The problem of scheduling effects on the WCETs of tasks has been studied
by Schneider in [Sch02]. There, maximal damage due to task preemptions is
computed by using response time analysis with additional damage terms. Our
pipeline analysis has been utilized in this work to obtain WCETs for undisturbed
executions and then damage costs are added to the results. Also, the effects of the
real-time operating system are considered in this work (interrupts, system calls,
etc).
Except for the works based on our pipeline analysis, none of the work men-
tioned above takes into account the combination of branch prediction, instruction
prefetching, speculative execution, or effects caused by, e. g., data accesses collid-
ing with wrap-around cache line fills on the external processor bus. Also, modern
computers feature separated core and bus clocks, causing new wait effects for
external bus accesses, which have not been investigated so far. Most models sim-
plify the memory architecture of the systems they consider; in a real system, the
memory space is made up of quite a few regions with different characteristics
concerning access timing, access mode7, and so on. Our tool makes it easy to
specify these parameters and the pipeline analysis takes all this into account when
examining memory accesses.
1.4.2 Pipeline Modeling
A lot of work exists about hardware modeling, especially of processors and pe-
ripherals. Specialized languages (Hardware Description languages, HDL) for
specifying hardware have been developed to help in EDA8. The most important
ones are Verilog [TM91] and VHDL [VHD00]. Both languages allow to automat-
ically synthesize electronic circuits from (restricted) descriptions. The languages
are based on the notion of modules, each running a number of processes, commu-
nicating via signals. In the case of VHDL, signals can even carry more complex
data, while Verilog basically considers signals as physical wires with a limited set
of states (asserted, floating, high impedance, etc). Designs can be tested by sim-
7E. g., speculative accesses with the PowerPC can be allowed on some regions but disallowed
on others.
8Electronic Design Automation
18
ulation of models and models can be formulated at different levels of abstraction
(RTL, gate level, etc).
A lot of work has been done on the verification of the synthesis step from
models into circuits, and also on obtaining higher level descriptions from lower
specifications [AK95], [McM98], [HQR98] or [JPM02]. Little work has been
done on the equivalence checking of abstract models with concrete ones, as would
be required for checking that the abstractions of our pipeline models are correct.
Another area where model checking can be used is to prove that certain timing
anomalies are not present in a processor or do at least not occur for a given pro-
gram. See Section 9.3 for a discussion.
First steps in the area of analyzing HDLs are the papers by Hymans [Hym02,
Hym03], which present an abstract interpretation of a VHDL kernel. By adding
a monitor and a test bench to an existing model, safety properties can be checked
by observing the analysis results for assertions in the monitor, which describe the
desired properties (e. g. that two signals are never asserted at once). The analysis
is complex on low level VHDL models, but the same kind of analysis should be
possible on our pipeline models (which are quite similar to VHDL), and should
be much faster due to the higher abstraction level of our pipeline models.
Recently, SystemC [Ope02] has been promoted as a modeling framework for
embedded systems ranging from processors to complete systems. SystemC is
basically a collection of C++ classes, with a predefined simulation engine. As
SystemC descriptions are in essence C++ programs, analyzing them is very dif-
ficult, as is the analysis of C++ itself. The implicit semantics of the predefined
SystemC components (the simulator, etc) has to be abstracted in order to obtain
at least a small amount of precision. The correctness of any analysis of a model
is difficult to establish due to the complex semantics of C++. Equivalence proofs
between models at different levels of abstraction are tedious for the same reason.
Apart from retargetable timing analysis, pipeline descriptions have been used
in code generation and simulation. Early work [BHE91] considered only reg-
isters and pipeline stages by the use of reservation tables. Improvements have
led to mixed-level languages such as Expression [HGG D 99]. From a structural
specification of hardware resources, pipeline mechanism, and data transfer paths,
reservation tables are generated automatically [GHDN99], but this approach is not
able to model dynamic behavior like out-of-order execution. Validation of models
done in Expression is described in [MTH D 02]. Correctness has to be expressed
as properties and it is checked if the flow in the model violates this property or
not. Due to the form of the description (LISP like), this mechanism is not suited
to model complex processors in a maintainable fashion.
Dynamic hardware features are handled in hardware description languages
that are used for retargetable simulators. The language LISA [ZPS D 96, PZHM98]
uses a refined form of reservation tables called L-charts that can model dynamic
19
scheduling to some extent. The language RADL [Sis98] offers a flexible signal
mechanism that seems more promising in the context of modern hardware fea-
tures. Signals are emitted if a boolean expression on machine state components
holds; they are used to influence scheduling behavior. In contrast to our work
which also deals with signals, RADL describes a model very close to the silicon
level, where, e. g., latches between pipeline stages are defined implicitly and may
be accessed bitwise.
20
Chapter 2
Modern Hardware
The need for computing power is ever increasing, resulting in an increasing de-
mand for faster computer systems. The number of transistors that can be put
on a chip is on average doubling every 18 months, according to Moore’s Law,
[Moo65]1.
The growing availability of gates per chip has led to the integration of more
complex functions into CPUs and higher capacity of main memory. Furthermore,
a tighter packing of gates allows to increase the clock speed of the chips, because
signal delay times are shortened. The increasing performance of CPUs and main
memory is depicted in Figure 2.1: following [HP96], an increase of 7% for mem-
ories and of 35% (until 1987) and 55% (from 1987 on) for CPUs is assumed.
The enhanced performance capabilities of CPUs have led to the development
of sophisticated features like pipelining or branch prediction, in order to be able
to actually utilize them for machine code programs. The increasing gap between
the performance of CPUs and that of main memories is the main bottleneck for
system performance and will continue to be so. CPUs spend most of their time
waiting for slow main memory. The cause of the problem is that one can either
produce cheap but slow memory (DRAMs) of high capacity or an expensive but
fast one of lower capacity (SRAM, register cells). The solution applied today is
the introduction of a memory hierarchy. Between CPU (core) and main memory
a small but fast cache memory is placed. Since programs normally exhibit a high
degree of temporal and spatial locality, caching accesses for reuse in the fast mem-
ory improves the overall performance considerably. Trends nowadays go to two
or even three levels of cache memories between the CPU core and main memory.
Where caches are used to reduce access latencies, pipelining is used to make
use of the massive parallelism enabled by overlapping the execution of program
1The original paper by Moore estimated a doubling of transistors per chip every 12 months.
The doubling period for transistors per square centimeter has flattened to 18 months in the 1970s,
cf. [Tuo02, Sto03].
21
 0.1
 1
 10
 100
 1000
 10000
 100000
 1980  1985  1990  1995  2000  2005
Pe
rfo
rm
an
ce
Year
CPU Memory
Figure 2.1: Performance gap between CPUs and main memory
instructions. The execution of an instruction can be split into a number of phases
like fetching, decoding, executing and writing back, each consuming one clock
cycle. The different stages of consecutive instructions can be executed in parallel,
bringing the maximal performance to one instruction per processor cycle.
Both caches and pipelines are good in increasing the average performance
of the system. However, there may be pathological cases of programs that ex-
hibit a very bad worst-case performance. Moreover pipelines and caches are very
history-sensitive: their behavior depends on the instructions executed before. This
makes it difficult to predict the behavior of the actual instruction. E.g. whether an
instruction cache access is a hit or a miss depends on the instructions fetched be-
fore the current one. Whether a pipeline has to stall due to hazards depends on the
predecessor instructions still in the pipeline, etc. Another issue is that pipelines
and caches may interfere in non-obvious ways such that an event locally classi-
fied as a best-case scenario, e. g. a cache hit, may globally lead to the worst-case,
i. e. a longer execution time of the whole program. Such contrived effects make
it very difficult to argue about the worst-case execution time of programs running
on systems with caches and/or pipelines.
In the next section, the architecture and behavior of caches are summarized,
followed by an overview of processor pipeline architectures in Section 2.2 and the
concepts of other system components in Section 2.3. Afterwards we present the
22
important notion of timing anomalies which are the main cause of the complexity
of WCET tools for modern processors. Two processors for which we have in-
stantiated the analysis described in this thesis are presented in more detail to give
an impression of the sort of hardware features that have been modeled with our
approach.
2.1 Caches
A cache is a small fast memory that is between the main processor core and slower
memory. There can be several levels of caches nested, the one closest to the
processor is the L1 cache, the next one the L2 cache and so on. Data that is not
found in a cache at level n is searched for in level n
B
1 or finally the main memory.
The architecture of a cache is depicted in Figure 2.2.
line A−1...
10 A−1
Line
line 0 line 1 line A−1...
0
...
...
.
S−1
D V ...
Se
t
...
...
.
State
Tag
line 0
Data
line 1
Figure 2.2: Cache architecture
The cache consists of S sets with A lines each. A is called the associativity of
the cache. If A
F
1 then the cache is called direct mapped; if S
F
1, then it is called
fully associative, otherwise it is called A-way set associative. Each set si thus is a
fully associative sub-cache in itself. Each line l j contains a memory block of data.
The size of this block is called the line size, L. Apart from the memory block,
a tag is stored in the line that gives the high-order memory address bits for the
memory block stored here. Finally, there are status bits in the line that indicate,
whether the entry contains valid data (V
F
1), and for data caches, whether the
data in the block has been changed in the cache but not written to main memory,
i. e. whether the line is dirty (D
F
1). There can be additional status bits for
coherency state or process tags. The parameters characterizing a cache are its
23
capacity C, its line size L, the number of sets S and the associativity A, C
F
S f A f L.
The line size and number of sets are always powers of 2, L
F
2l , and S
F
2s.
Normally, the associativity and capacity are also powers of two (one exception is
the instruction cache of the SuperSPARC, which has A
F
5 and C
F
20480, cf.
[Mic92]). Table 2.1 shows instruction cache configurations for some processors.
Another parameter is the replacement policy of the cache, which determines
the lines that should be replaced from the cache when it is full and further data is
accessed.
Processor C A L Replacement
SuperSPARC II 20kB 5 64 LRU
UltraSPARC III 32kB 4 32 LRU(?)
PPC 755 32kB 8 32 Pseudo LRU
G5 (PPC 970) 64kB 1 32 -
MCF 5307 8kB 4 16 Pseudo Round Robin
Pentium III 16kB 4 32 LRU
Intel IXC1000 (XScale Core) 32kB 32 32 Round Robin
SH5-101 32kB 4 32 LRU
Alpha 21264 64kB 2 64 Branch Predict
MIPS 24k 64kB 4 32 LRU
Table 2.1: Some instruction cache configurations
The places where data blocks are stored in the cache are determined by the
addresses of the memory blocks. When accessing data at address a, the lower l
bits are used to select the data in the 2l bytes of the line data. The next lower s
bits of the address form the index i of the set where the block may reside, thus
i
FlK
a
h
L
M
&
K
S m 1
M
.
For data read accesses or instruction fetches, the A elements in the selected
set are searched in parallel by comparing the top w m
K
l
B
s
M
bits of the address,
where w is the total number of address bits, including the tags of the lines in the
set (ignoring lines, whose V bit is not set). If a match occurs, then we have a hit
in the cache and the data in the line (indexed by the lower l bits of the address) is
returned to the processor. If the required memory block is not in the cache, then
it must be loaded from main memory. The replacement strategy decides, where
the newly loaded memory block should be placed in this set. Normally, invalid
lines (V
F
0) are filled with new data first. If there are no invalid lines, a line is
24
selected to be replaced from the cache and the new data takes its place. Which line
is replaced is determined by the replacement strategy. When new data is placed
into the cache, the tag of the line is updated with the tag bits of the access address,
the V bit is set and D is cleared (for data caches). Then the processor starts to
fetch the bytes of the data for this line from the memory hierarchy (next cache
level or main memory). As the line size is normally bigger than one data word, it
takes several external accesses to fill the complete line. If the data word that was
referenced in the line is fetched first, the fetch is called critical word rst (and
wraps around at the end of the line). Otherwise the first word of the line (offset =
0) is fetched first. If the cache can serve (hit) accesses while the remainder of a
line is being loaded it is called a hit under miss cache.
For data writes, there are several possibilities for the behavior of the cache.
If the write is always issued to main memory (changing the corresponding data
block in the cache, if it happens to be there), the cache is called write-through. If
the data is only changed in the cache, it is a write-back cache. In the latter case,
the data must eventually be flushed to memory to obtain a consistent state. This is
done either by explicitly flushing the cache through machine instructions or when
the modified line is replaced from the cache due to another cache miss. To record
if a line has been written but not flushed to memory, the cache line has the dirty bit
D set to 1. If a written-to block is not in the cache, the block may either be loaded
from memory first, placed in the cache and then modified (write allocate) or the
data may just be written to main memory, without loading the line. Normally,
write-back and write-allocate are used together.
Some architectures allow to lock several ways of the cache. Then, no replace-
ment occurs in these portions, but data may be loaded in the first access to empty
(invalid) lines of the way. This feature is often utilized to preallocate critical code
and data for fast and deterministic access.
All architectures allow to designate only portions of their address space as
cacheable, so that only accesses to these areas go through the cache. The other
memory areas bypass the cache completely and always result in external bus trans-
actions.
A cache that holds both data and instructions is called a unied cache. Here,
data reads and writes compete with instruction fetches for entries in the cache.
Most systems nowadays internally have separate paths for data and instructions
(Harvard architecture) and the paths are connected to separate data and instruction
caches.
2.1.1 LRU Replacement
One replacement strategy that is often implemented is the least recently used strat-
egy. The line in a set that has been unreferenced for the longest time is the one
25
being replaced first. Here, the sets of the cache are independent, i. e. the refer-
ences of lines are counted separately for each set. An age of the A lines in a set is
introduced such that the line referenced last is the youngest line and the one un-
referenced for the longest time is the oldest. Each access to the set, cache hit and
miss, updates the ages of all lines. If the access is a cache hit, the line accessed
becomes the youngest line (age 0), and all lines that were younger than this line
before this access age by 1. For a cache miss, the oldest line is replaced from the
cache, all other lines age by 1 and the newly loaded line becomes the youngest
line. Figure 2.3 shows a set with 4 ways and the aging of its lines for a cache hit
and a cache miss.
e
a
b
c
d
a
b
c
d
0
1
2
3
0
1
2
3
a
b
c
age
age
d
c
b
a
e
c
Figure 2.3: Aging of cache lines under accesses for LRU
To describe this strategy precisely, we can define an update function for a
cache. The set of memory block addresses M is the set of integers obtained from
addresses by leaving out the lower l bits, which only select data in a line. We
describe the cache contents by giving for each memory block address its age in
the set it is mapped to. For blocks that are not in the cache, we introduce the “age”
n
. So the set of ages is G
F
= 0 S]\g\]\gS A m 1 S n ? . We introduce an ordering < on
them by 0 < 1 <Uf]fgfo< A m 1 < n . An addition p on G is defined by
a p b
Frq
n
, if a
F
ns
a
F
ns
a
B
b R A
a
B
b , otherwise
The set a memory block is placed in is given by the function set : M t
= 0 S]\]\]\uS S m 1 ? , defined by set
K
m
M
F
m mod S. We write seti
K
M
M
for the set of
memory blocks mapped to set i.
26
A cache is now a mapping C : M t G which maps memory block addresses
to their age. Since sets in the cache are independent, we introduce a mapping of
sets to ages, i.e. a set mapping Ti is a function Ti : seti
K
M
M
t G. Then a cache is
defined in terms of S set mappings: C
K
m
M
F
Tset
C
m E
K
m
M
.
A set mapping Ti must satisfy the constraint Ti
K
m
Mwv
F
nyx z
m { : m {
v
F
m
X
Ti
K
m
M
F
Ti
K
m {
M
, i. e. there cannot be different memory blocks with the same age in
the set.
The update of a cache when accessing a memory block m is then described by
a function Uc :
K
M t G
M}|
M t
K
M t G
M
with
Uc
K
C S m
M
K
m {
M
F q
C
K
m {
M
, if set
K
m {
M~v
F
set
K
m
M
Us
K
Tset
C
m E S m M
K
m {
M
, otherwise
The update of a set mapping, Us
K
Ti S m M , is given by
Us
K
Ti S m M
K
m {
M
F 
d
0 , if m
F
m {
Ti
K
m {
M
, if m
v
F
m {
X
Ti
K
m
Mv
F
n
X
Ti
K
m
M
< Ti
K
m {
M
Ti
K
m {
M
p 1 , otherwise
2.2 Pipelines
The process of executing one instruction can be divided into a number of sequen-
tial sub-operations. A simple example is the pipeline of the DLX machine from
[HP96]:
Z IF: Instruction fetching from memory
Z ID: Instruction decoding and fetching of register operands
Z EX: Execution of the operation denoted by the instruction
Z MEM: Memory access (read or write)
Z WB: Write-back of the results into registers
Each of these sub-operation is called a pipeline stage. Instead of waiting un-
til an instruction has left the last stage of the pipeline (we say it has retired in
this case) before letting the next instruction entering the first stage, as depicted
in Figure 2.4, one can overlap the execution of different stages of subsequent in-
structions. If there are no dependencies between stages, a perfect pipelining as in
Figure 2.5 is achieved. Here, the execution of an instruction takes one cycle, as
soon as the pipeline has been lled completely with instructions. We say that the
CPI (clock cycles per instruction) is equal to 1, in this case.
27











Clock Cycle
Stage
IF ID EX MEM WB Finished
1 I1
2 I1
3 I1
4 I1
5 I1
6 I1
7 I2
Figure 2.4: Sequential execution
2.2.1 Pipeline Hazards
Pipelining exploits the parallelism that is inherent in a program. In practice a CPI
of 1 is never achievable as there are dependencies between instructions and other
hazards. When a hazard occurs, parts of the pipeline have to be stalled, i. e. some
stages do not perform any work but wait until the stall condition has vanished.
Pipelines differ in which stages are stalled by a hazard: while some stall all stages
before the one the hazards occurs in, others only stall the stages that want to












Clock Cycle
Stage
IF ID EX MEM WB Finished
1 I1
2 I2 I1
3 I3 I2 I1
4 I4 I3 I2 I1
5 I5 I4 I3 I2 I1
6 I6 I5 I4 I3 I2 I1
7 I7 I6 I5 I4 I3 I2
Figure 2.5: Fully pipelined execution
28
deliver results to stages stalled by the hazards. This way, e. g., instruction fetching
to a prefetch queue could continue if a data hazard stalls instruction dispatch due
to data dependencies. There are three categories of hazards:
Structural Hazards
These hazards result from a shortage of functional units in the processor. They
occur when two instructions in the pipeline try to use the same unit in the chip.
Often the memory bus is the functional unit that causes these hazards: the IF and
MEM stages both want to transfer data over the bus. In this case, IF must be stalled
until MEM has finished the transfer. This inserts a bubble after the IF stage, i. e.
the ID stage is empty in the next cycle and performs no work. Another example
are the integer units of the PowerPC 755: if more than 4 integer instructions have
been issued, the issue of the next instruction must be stalled since there is no free
integer unit (or reservation station) available for it.
Data Hazards
These hazards are the most common ones and originate from data dependencies
between the operands of subsequent instructions. E.g. in Figure 2.6 the second
instruction needs the result of the first one as an input operand in register r8. The
second instruction cannot begin its execution until the first one has written back
the result into the register file.
0x10000: add r8,r9,r9
0x10004: sub r10, r8, r11
0x10008: add r11,r7,r7
0x1000C: add r6,r11,r11
0x10010: add r11,r8,r8
Figure 2.6: Data dependencies between instructions
There are three different kinds of data hazards between instructions:
Z Read after write (RAW) hazards occur if a subsequent instruction reads a
register that is written by a previous instruction, as for the first two instruc-
tions in Figure 2.6.
29
Z Write after read (WAR) hazards occur if an instruction writes a register that
is read by a previous instruction. Here it must be guaranteed that the write
occurs only after the first instruction has read the register. This is guaran-
teed for the simple pipeline of the DLX, due to its in-order structure. For
out-of-order execution, like the PowerPC 755, special logic must be imple-
mented to resolve this hazard. In Figure 2.6 the instructions at 0x10004 and
0x10008 cause a WAR hazard.
Z Write after write (WAW) hazards occur if a subsequent instruction writes
the same register as a previous one. It must be guaranteed that only the last
write is performed to the register file. Instructions 0x10008 and 0x10010
cause a WAW hazard in the example. Again, for in-order pipelines with
only a single write-back stage, this hazard cannot occur. On the PowerPC
755, the reorder buffer and its in-order retirement regime guarantee that the
ordering of writes to the register file is maintained.
Control Hazards
The pipeline only delivers good performance, if it is filled with instructions. If a
dynamic branch instruction occurs, either as a conditional branch or an indirect
jump, the fetch stage cannot continue to fetch the next instruction after the branch
until the branch target is known, i. e. until the instruction that computes the argu-
ment to the branch has written back its result. This means that the pipeline will
run empty and has to be refilled with the instructions at the target address when
the depending instruction has been completed.
2.2.2 Performance Improving Features
Several features have been implemented in pipelines to avoid hazards or limit their
impact on (average-case) performance.
Prefetching
A pipeline can be designed to continue instruction fetching during a stall in later
stages of the pipeline caused,e. g., by a memory access into a prefetch queue. In-
structions can then be decoded and issued from this queue, eliminating delays due
to instruction fetching. The prefetch queue can either hold decoded instructions
(as for the MCF 5307, with 8 entries in the prefetch queue) or raw instruction
words (as for the PowerPC 755 with 6 entries in the instruction queue). Care must
be taken to flush the contents of prefetch queues when data in instruction memory
has changed. Prefetching introduces another complication for WCET analysis,
30
as the amount of prefetching depends on the overall state of the pipeline: if the
pipeline stalls due to data dependencies, the prefetch queue fills up with instruc-
tions. Since instructions are normally fetched via the instruction cache, this (in
combination with branch prediction) alters contents and ages of lines in the cache.
Which, in turn, influences the replacement behavior of the cache and thus the
global timing.
Branch Prediction
To reduce control hazards the pipeline should be filled even after a branch instruc-
tion has been fetched. This can be done by combining prefetching with decoding
of branches early in the pipeline and redirecting the fetching of instructions to
the known target of a static (non-dynamic) branch (branch folding). If the branch
depends on other results, branch folding can also be applied when the result is
already known during fetching/decoding of the branch instruction. If the result is
not yet known, one can predict the target of the branch in the case of conditional
branches, which only have two possible target addresses (the instruction follow-
ing the branch and the instruction at the fixed taken-target address of the branch).
Fetching of instructions is redirected to the predicted target address of the branch,
based on the assumption that after the result of the branch condition is known, it
will resolve the prediction as correct and the instructions are already available. If
the prediction went wrong (misprediction), the instructions fetched are discarded
and the fetching is redirected to the correct target address of the branch (and the
pipeline stalls until the instructions are available).
The prediction can be done in a variety of ways:
Z Static branch prediction encodes the most probable target of the branch in
the branch instruction. E.g. for the MCF 5307, backward branches are pre-
dicted as taken, as they correspond to the loop back edges and are usually
taken N out of the N
B
1 times the condition of a loop with N iterations
is evaluated. For other architectures, a special bit in the branch instruction
can indicate, whether the branch should be predicted taken or not to allow
for compiler optimizations of branch instructions (one example is the Pow-
erPC architecture). Static branch prediction has the advantage that it is easy
to implement in hardware and also rather easy to account for in a WCET
analysis.
Z Dynamic branch prediction uses a cache to record the last outcomes of
branches and based on this history it predicts the next branch direction.
Most architectures with dynamic branch prediction feature a branch his-
tory table (BHT) that is organized as a little cache where each entry records
the address of a branch instruction and a branch counter of k bits. If the
31
value of the counter is R 2k G 1 the branch is predicted as taken, otherwise it
is predicted as not-taken. In the case of a misprediction, the counter corre-
sponding to the BHT entry of the branch is decremented by one, saturated at
0, otherwise it is incremented by one, saturated at 2k m 1. If k
N
1, then this
prediction scheme reacts less sensitive to one misprediction in a series of
correct predictions. Usual values for k range from 0 to 4. The PowerPC 755
has a 512 entry BHT with k
F
2 bits. Another form used, often in combina-
tion with the BHT, is the branch target buffer (BTB). This is also a cache of
recent branches but instead of just containing a prediction for the outcome
of the branch, it stores some instructions at the predicted target address of
the branch. If a branch hits in the BTB, the instructions at the predicted
target address are immediately available to be inserted into the instruction
fetch stream. After a branch is first encountered, the instructions fetched
after the prediction are put into the BTB. The PowerPC 755 has a 64 entry
BTB (called BTIC here). The advantage of the BTB is that the instructions
are available earlier than they can arrive from the instruction cache (usually
one clock cycle).
Static branch prediction proves to be very efficient in practice with less than
30% mispredictions according to [HP96]. While dynamic branch prediction is
more accurate on average, the fetch behavior of programs depends on the execu-
tion history (the states of the BTB and BHT) and is thus much harder to analyze
than with static branch prediction. To obtain a precise WCET, the branch predic-
tion must be modeled precisely, which makes the resulting analysis more complex.
Delay Slots
A different method to avoid pipeline stalls due to dynamic branches is to use one
or more delay slots. Delay slots are the instructions directly following a branch.
These instructions are always executed, even if the branch is taken and the pro-
gram execution continues at a different place in the program after the delays slot
instructions have been executed. This way, the pipeline need not be stalled if the
condition of the branch is not yet known. Ideally, the condition of the branch is
known while the delay slot instructions have been executed and no further delay
occurs. The SPARC architecture, e. g., has one delay slot. Delay slots make it a
little bit more difficult to reconstruct control-flow from a binary executable.
Forwarding/Shortcuts
RAW hazards can be partially eliminated in a pipeline by forwarding the result of
an operation to any other stage that holds an instruction depending on this result.
32
This way, the depending pipeline stage need not be stalled until the previous in-
struction reaches the write-back stage. Shortcuts are special-case evaluations for
instruction execution, e. g. a multiplication can be shortened if one argument is 0
(or has the upper 8 [ n bits set to zero), which means that execution times for in-
structions depend on operand values. Forwarding can be viewed as an instruction
independent shortcut mechanism. Forwarding and shortcuts together with timing
anomalies (see Section 2.4), can make a WCET computation expensive, as every
possible execution duration has to be considered.
Superscalarity
The simple DLX pipeline described above can only decode and start to execute
one instruction per clock cycle, thus the optimal CPI is equal to 1. Superscalar
(or multiple-issue) machines can start the execution of more than one instruction
per clock cycle. If a processor can issue two instructions at once, the ideal CPI is
0 \ 5 (the PPC 970 can issue up to 4 instructions plus one branch per cycle, giving
a ideal CPI of 0.2). How many instructions are issued is decided dynamically,
according to dependency rules among the instructions in the prefetch queue. Nat-
urally, the dynamic nature of multiple-issue makes is harder to precisely analyse
it for WCET, as the modeling complexity increases.
Out-of-order Execution
In order to overcome stalls due to structural and data hazards, a processor may
execute instructions out-of-order. That is, the execution of subsequent instructions
may be begun and finished before a previous instruction has finished its execution.
This greatly increases the utilization of the functional units in the processor, as
later instructions with no data dependencies can already be issued and dispatched
to the functional units while their predecessors are still waiting for input data or
continuing their execution. E. g. an integer division usually takes more than one
clock cycle and blocks subsequent instructions from executing in a strict in-order
pipeline.
In a processor with multiple-issue, the distribution of instructions to functional
units can be out-of-order too: which instruction gets to a functional unit and which
one has to wait in a reservation station can be decided based on availability of
operands. On the other hand, like in the PowerPC 755, issuing can be in-order but
only the execution out-of-order.
Out-of-order execution makes the design of pipelines much more difficult, be-
cause correct reading and writing of results in program order must be maintained.
Also, the problem of precise exceptions arises. If one instruction causes an excep-
tion, e. g. a page fault for a memory access, then the operating system must know
33
exactly where to restart program execution after the page fault has been handled.
In an out-of-order pipeline it is not immediately clear how to restart program exe-
cution if an instruction causes an exception while predecessor instructions are still
unfinished.
Two approaches of maintaining a correct machine state for out-of-order pipe-
lines are the scoreboard, [Tho64], and the Tomasulo Algorithm, [Tom67]. The
latter one uses reservation stations at functional units that take instructions waiting
for their input dependencies to resolve and a reorder buffer that keeps track of the
sequential dependencies of register updates.
For WCET analysis, out-of-order execution poses big challenges because of
its highly dynamic nature with great influence on caches and memory access be-
havior.
Superscalarity and out-of-order execution also make it harder to exactly define
the notion of “execution of an instruction”, as a dynamically varying number of
instructions begin and end execution per clock cycle.
Speculative Execution
The performance gained by out-of-order execution is reduced if a branch is en-
countered, whose condition is not yet known, as the instructions at the (predicted)
target address of the branch cannot be executed (issued). We call instructions at
the predicted target address of a branch speculative instructions until the branch
outcome is resolved.
The Tomasulo approach can be extended to allow speculative instructions to
begin their execution, i. e. being issued to functional units. If a prediction turns
out to be wrong, the speculative instructions are simply flushed from the units
and their entries in the reorder buffer are freed. Speculative instructions are not
allowed to write-back their results into the register file until the predicted branch
that caused them to be speculative has been resolved and the prediction affirmed.
Naturally, speculative execution makes the analysis of the pipeline behavior
even more complicated and necessitates a more complex modeling of the structure
of the pipeline in the analysis.
2.2.3 Other Features
Besides the features implemented in the CPU to enhance performance, there are
some more features implemented to support the operating system or to guarantee a
deterministic system state from the programmer’s point of view. In the following,
we present some of these features that may have an impact on WCET prediction.
34
Memory Protection
In order to support an OS that supplies multi-process, multi-user capabilities, pro-
tected memory areas must be provided by the CPU. Here, every process in the OS
has the impression that it runs alone on the CPU in its own address space. This
is implemented by using logical addresses in the processes, which are translated
to physical addresses by the virtual memory unit of the CPU. When switching
processes, the OS also switches this mapping, which is sometimes called VM
mapping.
The VM mapping is stored as an array of descriptors (PTE, Page Table En-
tries) in memory, each descriptor mapping a page of fixed size (mostly 4kB) from
logical to physical address. If an access to a page occurs, the CPU has to access
the corresponding PTE in memory to obtain the physical address for it, before it
can perform the access (possibly through the cache). To increase performance,
the CPU keeps a small cache of recently used mappings, so that further accesses
to a page whose mapping is in the cache can be performed without loading its as-
sociated PTE from memory. The capacity and associativity of this cache, which is
called Translation Lookaside Buffer (TLB), varies widely. Usually, LRU is used
as the replacement strategy for entries in the TLB. Apart from the mapping of
logical to physical address of a page, the PTEs also store access attributes for
the page, i. e. if the page can contain instructions, can be written to, is cacheable,
etc. Accesses with attributes not matching the page attributes lead to program
exceptions, as would accesses for which no mapping has been established. For
pages containing data, the PTEs also record if this page has been written to by
the processor, so that the OS can write back dirty buffer cache pages to files. If
the processor writes to a page, the associated PTE is fetched (if not present in
the TLB) from memory and a modified copy, which has the dirty bit set, is then
written back to memory before the write access is performed. Thus, a write ac-
cess can lead to two or more write accesses to memory if virtual memory is used.
The write-back of the PTE need only be performed for the first write access to the
page.
For systems supporting I/O operations via dedicated pins (like the Intel archi-
tecture), a similar feature exists for controlling access to the I/O address space.
Some systems, like the MCF 5307, do not provide full virtual memory support
but can also associate certain areas of addresses with certain attributes. These
attributes then determine if a memory area is cacheable, writeable, executable,
etc. This feature is also present as an additional VM mapping mechanism in
systems supporting PTEs and TLBs to map large areas whose attributes never
change (like OS kernel address space), e. g. the IBAT and DBAT mapping registers
of the PowerPC family, cf. Section 2.6.
Since the memory access timing with enabled virtual memory depends on the
35
contents of the TLB and the status of the page (first write to a page), its behavior
is important for WCET determination. As addresses for data accesses are some-
times not precisely known in a WCET analysis, the behavior of VM is hard to
predict for them. As VM is rarely used in real-time systems, most WCET pre-
diction techniques assume that it is turned off, leading to more predictable access
behavior.
System Configuration
On systems embedding peripherals in the CPU, special control registers are usu-
ally available, which determine at which addresses the peripherals can be ac-
cessed. E.g. the MCF5307 has an embedded SRAM memory area of 4kB which
can be accessed via the fast processor core bus. The memory address of this area
must be programmed into a system register. Other registers, possibly memory
mapped, control the peripherals (UARTs, DMA engines, etc).
The WCET tool for the MCF5307 must know about the address of the SRAM
area, as accesses to it are faster than accesses to external memory. Also, accesses
to peripheral registers are faster than external memory accesses, so the register
areas of the peripherals must be known.
Although the mappings can be changed dynamically by writing new values to
the associated system registers, this is rarely necessary in real-time systems which
have a fixed configuration. Thus, such dynamic changes need not be modeled by
WCET tools.
Synchronization
In a highly parallel pipeline, changes to system registers or VM mappings may
not be able to affect instructions already (partially) being executed. To ensure
that after a given point of program execution, all subsequent instructions can rely
on the new system state imposed by such changes, synchronization points are
required. Some architectures, like the PowerPC, define synchronization properties
of certain instructions. Upon fetching such an instruction, the CPU automatically
prevents the execution of subsequent instructions or even refetches an instruction
stream to ensure that all effects by the instruction to the system state are committed
or that no two instructions alter the system state out of their program order.
Naturally, such behavior alters the timing of program execution quite signifi-
cantly compared to a fully pipelined program run. In combination with instruction
prefetching and speculation this can also have profound impacts on the ages or
contents of cache blocks. A WCET tool must therefore take the synchronization
behavior of instructions into account.
36
2.3 System Components
Not only the caches and pipeline of the processor have an influence on the timing
behavior. Also the other (computer) components of the real-time system must
be considered. This section will present some components and their influence
on system performance and thus WCET analysis. In our WCET tool, we have
modeled and integrated several of these components to obtain a correct and precise
analysis of the whole system.
2.3.1 Memory
The component with the greatest impact on execution time is the main memory.
Access to main memory is performed via a memory controller that translates bus
transactions of the processor to the access protocol of memory chips. In some
cases, this controller is embedded directly into the CPU (MCF 5307 for example)
or has to be provided by external circuitry. The RAM controller is commonly
integrated into the main system controller (called northbridge in the PC world)
that also contains the interfaces to the peripheral busses.
SRAM
Static RAM (SRAM) implements the single RAM bit cells by transistors and thus
is quite fast. Little extra logic is needed to interface SRAM to the common address
and data buses of CPUs. On the other hand, SRAM is quite expensive and high
capacity chips are not easily available. The access times to SRAM are fixed, which
provides for good predictability and eases WCET prediction. SRAM systems are
only used for small memories and/or if the price of the system is irrelevant.
DRAM, SDRAM
Dynamic RAM (DRAM) has only one capacitor for each RAM bit cell, augmented
with a row of transistors, the sense amps, that amplify and drive the signals of a
row of RAM cells on the data bus. This design allows for extremely tight packing
of RAM cells and thus cheap and huge memories. The drawback is the slow speed
of the DRAM technology (around 10-16 times slower than SRAM). Internally, the
memory chips are organized as arrays of bit cells, with 4096 or 8192 bits in each
row and a varying number of columns (up to 4096). Each of these arrays con-
tributes 1 bit to the output data path of the chip, which has between 4 and 64 data
bits. So, for a 16 data bit chip, 16 RAM cell arrays are accessed in parallel in the
chip. In addition, each chip has a number of parallel banks, which duplicate the
array structure. Each data array has a row of sense amps, 4096 bits wide, which
37
takes the contents of one row of the array upon reading/writing. The access proto-
col to these chips differs slightly for some implementations, e. g. EDO (extended
data out), FPM (fast page mode) or SDRAM (synchronous DRAM).
Access to an SDRAM chip is synchronous to the rising edge of the memory
bus clock and can be divided into 3 phases, cf. [Mic03, Int99]:
1. One row of an array in the chip must be copied into the sense amps. This is
done by selecting the row with 12 address inputs and 2 bank select inputs
and asserting the RAS (row address strobe) signal. After a certain amount
of time (tRCD), the sense amps are filled with the contents of the row from
the memory array (which loses its contents). No access must be performed
during this waiting phase.
2. After tRCD has passed, a column in the array is selected, which means that
the corresponding bit in the row of sense amps is selected. This is done by
asserting the CAS signal. After a certain amount of time (tCL), the data has
been transferred from the sense amps into the output buffers and is available
at the data outputs. From now on, new data is available each clock cycle,
i. e. the SDRAMs work in burst mode. For write accesses, the data can be
transferred each cycle over the bus after CAS has been asserted.
3. After the burst is finished and before the next access to a different row can
begin, the contents of the sense amps must be copied back into the capac-
itors of the memory array (precharging). This procedure takes a certain
amount of time (tRP).
An SDRAM controller can keep rows (which are also called pages in the cor-
responding literature) in an SDRAM chip open, i. e. skip the precharge step until
it opens a new row. An access to an open page can skip the first step of the access
procedure above and thus is faster (by tRCD). Precharging times can be hidden if
the access after the current one (which necessitates the precharge) goes to a differ-
ent memory chip. SDRAM controllers implement different strategies for keeping
SDRAM pages open and mapping memory access addresses to memory banks
containing SDRAM and mapping the addresses to the column and row indices of
the memory arrays.
In addition, the contents of the DRAM chips have to be refreshed periodically,
because the capacitors in the arrays lose their current and thus their bit content.
Refreshing is done by copying the rows into the sense amps and writing them
back. For this, the SDRAM controller has to issue precharge commandos to the
chips periodically (and close any open pages before). During refresh periods,
the chips cannot be accessed. Different chips allow different refresh regimes:
refreshes distributed evenly for all rows along the refresh interval, or a burst of
4096 refreshes in sequence.
38
All this makes the access times to SDRAM quite unpredictable. Even without
considering refreshes, the open pages in the SDRAMs must be modeled precisely
in order to predict the timing behavior of the memory. Although SDRAM is less
predictable than SRAM, it is used more often in current real-time system designs
due to its price and capacity advantages. For custom-built SDRAM controllers,
the behavior of the controller w.r.t. open page and interleaving policies and refresh
strategies is mostly known. If a widely available standard controller is used, this
is most often not the case, posing additional problems for system predictability
and WCET analysis.
2.3.2 Peripherals
A real-time system has a number of sensors to receive inputs and actuators in
order to pass the computed actions to the environment. These are realized as some
peripheral chips which are accessed by the CPU either via special I/O instructions
or normal load/store instructions for memory-mapped peripherals.
A system controller manages the translation from the CPU memory bus to the
address mechanisms used by the peripherals. Access times (as viewed from the
CPU) for peripherals are mostly constant and thus nicely predictable for WCET
analysis. One problem for WCET analysis is that sometimes the analysis cannot
exactly determine the address of a data access (due to the static nature of the
analysis). In this case, the analysis must also consider that the access will go to a
memory-mapped peripheral, which results in increased complexity of the analysis
and probably worsens analysis results, as peripheral access times are large (even
compared to the slow main memory). A WCET analyzer must provide a means to
specify which memory areas are mapped to peripheral access registers.
2.3.3 DMA, Multiprocessors
Direct memory access (DMA) by other peripherals sharing the same main mem-
ory as the CPU (or by other CPUs in a multiprocessor system) causes conflicts for
concurrent memory accesses of the CPU. As only one device can access the mem-
ory, the other(s) have to wait. As DMA activity runs in parallel with the program
execution and is not synchronized with it, this means that memory access time can
vary according to access blockage. If the main memory is built from SDRAM, the
accesses of the peripherals also cause changes in what pages the SDRAM con-
troller has opened, introducing further unpredictability for CPU memory access
times.
Cache coherency is another problem: some processors snoop the memory bus
for accesses of other components to memory and implement a coherency protocol
that enforces that two CPUs with caches have a coherent picture of the state of
39
their caches w.r.t. main memory. E.g. if a dirty line exists in one cache and a sec-
ond CPU writes data to memory at that address, than this line must not be written
back by the first CPU to memory, it must be invalidated from the cache. Other-
wise, incorrect data would end up in memory. Clearly, snooping influences the
cache behavior of processors and reduces the predictability of memory accesses,
making a correct and precise WCET analysis difficult.
2.3.4 Busses
Computer system components are connected by a collection of busses. Besides the
memory bus of the processor, there is the SDRAM bus and busses to peripherals or
FLASH memory. Some of the peripheral buses have quite complex architectures.
One example is the wide-spread PCI (Peripheral Component Interconnect) bus,
[SA95]. Accesses on the PCI bus result in one of three cycles:
1. Cong space cycles access the configuration registers and information that
allow to auto-configure PCI devices.
2. I/O space cycles access data in the I/O address space of the peripheral.
3. Memory space cycles access data in the memory address space of the pe-
ripheral.
On CPUs that don’t have I/O space instructions themselves (like the MCF 5307),
the second type of cycles is performed by memory access operations which are
translated by the system controller to PCI I/O operations. The same is true for
the config space accesses. Furthermore, the addresses of the PCI memory and the
addresses of the CPU need not be in a 1:1 relation, but are rather mapped by the
system controller.
Since the timing of the different accesses for the PCI bus is important for
WCET determination, the WCET tool must know about the memory mapping
for PCI accesses. Things get even more complicated, if multiple PCI busses are
nested, as then transactions are forwarded with yet another access protocol from
one bus to the other.
One effect that occurs if multiple busses are present is that in most cases they
do not use the same clock speed to time accesses. Nowadays, the CPUs internal
clock (core clock) speed is nearly always a multiple of the clock speed of its
memory bus. And peripheral busses are also clocked at lower speeds than this
memory bus. This means that accesses crossing busses incur jitter, i. e. the access
has to wait for the next rising clock edge of the target bus before continuing,
which results in up to n m 1 cycles of the faster clock (if the ratio between the
clock speeds is n). Sometimes busses are clocked at speeds that are not integer
40
multiples of each other but half-cycle multiples, e. g. the core clock running at
4 \ 5
|
the speed of the memory bus clock. This gives even more jitter possibilities.
2.4 Timing Anomalies
The interaction of several features in a pipeline can be very hard to predict. While
all these features try to increase (average) performance, there are cases where they
influence each other in such a way that a locally faster execution of an instruction
can lead to a globally longer execution time of the whole program. More gener-
ally, if a change ∆T1 of the execution time of one instruction leads to a change ∆T
of the global execution time we say that a timing anomaly (cf. [Lun02]) occurs if
either:
Z ∆T1 @ 0, i. e. the instruction executes faster, and ∆T @ ∆T1
s
∆T
N
0, i. e.
the overall execution is accelerated by more than the acceleration of the
instruction or it runs longer than before
Z ∆T1
N
0, i. e. the instruction takes longer to execute, and ∆T
N
∆T1
s
∆T @ 0,
i. e. the overall execution is extended by more than the delay of the instruc-
tion or the program takes less time to execute than before
The case ∆T1 @ 0 X ∆T
N
0 is a critical case for WCET analysis while ∆T1
N
0
X
∆T @ 0 is critical for BCET analysis. These cases make it impossible to use
local worst-case scenarios for WCET or BCET computation. This necessitates
a conservative approximation to the possible damages of all cases or forces the
analysis to follow all possible scenarios.
A similar effect is well-known in scheduling theory. E.g., [Gra69] shows that
tasks scheduled on multiprocessors can meet their deadlines, if the real execu-
tion times are assumed to be the WCETs, but there may be tasks missing their
deadlines if they execute for less than the WCET.
Unfortunately, as [LS98, Lun02] have observed, the worst-case penalties im-
posed by a timing anomaly may not be bounded by a globally fixed constant, but
may depend on the program size. In Figure 2.7 an example from [Lun02] is shown
that demonstrates that a cache miss may not be the global worst case.
Here, the target architecture is a simplified PowerPC with two integer units.
The first integer unit (IU) can execute additions and subtractions in one cycle,
while only the second one (MIU) can execute multiplication and division but
needs 4 cycles for such an instruction. The load store unit (LSU) executes the
load and store instructions. The architecture is supposed to have out-of-order is-
sue and execution. In the example program, instructions 2 and 3 are issued to
IU, while 4 and 5 are issued to MIU and instruction 1 to LSU. There exist RAW
41
1 LD r4,0(r3)
2 ADD r5,r4,r4
3 ADD r11,r10,r10
4 MUL r12,r11,r11
5 MUL r13,r12,r12
Example program
Cycle LSU IU/Res MIU/Res LSU IU/Res MIU/Res
1 1 1
2 1 /2 1 /2
3 2/3 1 3/2
4 3/ /4 1 /2 4/
5 4/5 1 /2 4/5
6 4/5 1 /2 4/5
7 4/5 1 /2 4/5
8 4/5 1 /2 5/
9 5/ 1 /2 5/
10 5/ 1 /2 5/
11 5/ 2/ 5/
12 5/
Cache hit Cache miss
Figure 2.7: Timing anomaly example
dependencies between instructions 1,2, instructions 3,4 and 4,5. If the load in in-
struction 1 hits in the cache, it takes two cycles to complete, otherwise it takes 10
cycles. In the case of a cache hit, instruction 2 is scheduled to the unit in cycle 3,
while instruction 3 goes to the reservation station. Instruction 4 has to wait until
instruction 3 finishes, before it can begin execution in cycle 5 due to the RAW
dependency. Thus the program finishes after cycle 12 in this case. If the load is
a cache miss, then instruction 2 is not scheduled to IU, but instruction 3 is sched-
uled first in cycle 3, due to the RAW dependency between instructions 1 and 2.
In this case, instruction 4 begins execution one cycle earlier in cycle 4, because
instruction 3 is finished after cycle 3. Here, instruction 2 overlaps execution with
instruction 5 in the last cycle (11) and the whole program finishes one cycle earlier
42
than in the case of the cache hit. This example also shows that discovering timing
anomalies can be quite complex.
Some sufficient conditions for the absence of anomalies have been presented in
[Eng02] for simple processor designs. But, most modern processors show timing
anomalies, even relatively simple ones (like the Motorola ColdFire 5307, which
we will present in Section 2.5). Even more disturbing is that it is very difficult to
prove the absence of timing anomalies. Interactions that cause these anomalies
can be quite complicated. In order to show that no anomalies can occur, a pre-
cise model of the processor must be made and analyzed. It is a topic of ongoing
research to utilize abstract interpretation and model checking techniques for this.
Even knowing one timing anomaly scenario for a processor doesn’t make it easier
to find all anomalies for the processor, as they may be caused by different interac-
tions. Furthermore, the interactions may also reach over the processor itself and
include the attached peripherals and memories.
We will present examples of timing anomalies for two processors in the two
sections describing the MCF 5307 and the PowerPC 755.
2.5 The Motorola ColdFire 5307
The Motorola ColdFire 5307 (or MCF 5307 for short) is a member of the version
3 ColdFire family from Motorola. The ColdFire family of processors is a suc-
cessor to the well known 68k processor series from Motorola. The MCF 5307
understands most of the instructions of the 68030 processor but restricts the in-
struction length to be either two, four, or six bytes (one to three words), compared
to the up to 24 bytes of an 68030 instruction. In addition, a few instructions are
not supported. But in general, most of the existing code for M68k designs can be
reused.
As a member of the V3 ColdFire family, the MCF 5307 shares the general
structure of that family, which is shown in Figure 2.8.
Internally the MCF 5307 has a hierarchical design of busses to access data.
The processor core, which features the pipeline and a MAC unit, is connected via
a pipelined bus, the so called K-Bus, to the cache and an on-chip SRAM module
of 4kB. Accesses to the K-Bus are completed in two cycles. The first cycle puts
the address and attributes (read/write, length) on the bus and in the second cycle,
the memory returns the data. Thus, there can be two accesses being performed in
parallel on the K-Bus, one in the address phase and one in the data phase.
The MCF 5307 has a bunch of peripherals integrated, e. g. serial ports, DMA
machinery, etc. Peripherals that can perform DMA are connected to the Master
Bus (or M-Bus for short). The M-Bus is connected to the K-Bus via a K2M
Controller. The other (slave) peripherals are connected to yet another bus, the
43
Memory
Cache
Data
Cache
Tags
System
Bus
Controller
Slave
Module
Slave
Module
Master
ModuleK2M
KROM
CTRLCTRL
KRAM Cache
CTRL
KRAM
Memory
KROM
Slave Bus
Master Bus
CFV3Core
CFV3CoreMem
K−Bus
External Bus
CPU
CFV3
Figure 2.8: The inner architecture of the MCF 5307
slave bus which, in turn, is connected to the M-Bus. Also, the external bus of the
chip is connected to the M-Bus via a System Bus Controller.
The processor core and the K-Bus run at a different clock speed than the M-
Bus and the external bus. This introduces alignment delays for external accesses,
since the processor core (or rather the K2M controller) has to wait until the rising
edge of the external clock before starting an access.
2.5.1 The Pipeline of the ColdFire 5307
The pipeline of the MCF 5307 is depicted in Figure 2.9. It consists of a fetch
pipeline where instructions are fetched from memory (or the cache), and an exe-
cution pipeline where instructions are executed. Fetch and execution pipeline are
separated by a FIFO instruction buffer (IB) that can hold at most 8 instructions.
The IB holds complete instructions, which can be from two to six bytes long.
The figure resembles the pipelined nature of the K-Bus in that instruction
fetches are distributed over two stages (IC1 and IC2); data reads or writes are
also performed in the AGEX and DSOC stages of the execution pipeline.
44
Data [31:0]Execution
Pipeline
(OEP)
Instruction
Fetch
Pipeline
(IFP)
Instruction
Early Decode
Instruction
Fetch Cycle 2
Intruction
Fetch Cycle 1
Instruction
Address
Generation
FIFO
Instruction Buffer
Decode & Select,
Operand Fetch
Address
Generation,
Execute
DSOC
AGEX
IAG
IC1
IC2
IED
IB
Address [31:0]
Operand
Figure 2.9: The pipeline of the MCF5307
The Fetch Pipeline
The instruction fetch pipeline of the MCF 5307 is used to generate the next fetch
address, to issue the fetch via the K-Bus (and possibly via the K2M controller and
further via the external bus) and to receive the fetched double words. Furthermore,
it is responsible for the reassembly of instructions from the double words (four
bytes) fetched.
It has 4 stages: IAG (Instruction Address Generation), IC1 (Instruction Fetch
Cycle 1), IC2 (Instruction Fetch Cycle 2), and IED (Instruction Early Decode).
The IED stage is responsible for performing branch prediction of conditional and
unconditional branches whose target addresses are known statically, i. e., from the
instruction alone.
The IAG stage generates the next fetch address. Since fetching is done in por-
tions of one double word (four bytes), normally the next generated address is a
B
4
if a is the address currently generated by IAG. In the event that fetching has to be
redirected, which can be done either by the IED stage due to branch prediction or
by the AGEX stage because it discovered a misprediction or a computed call, the
currently generated address is changed to the one given by either IED or AGEX.
The IC1 stage issues a fetch request via the K-Bus by placing the address
45
generated by IAG in the previous cycle on the bus together with the appropriate
attributes (instruction fetch, length). If the IC1 stage cannot issue the fetch be-
cause the bus is busy, it has to stall. In this case, it signals a stall to the IAG stage
so that that stage does not generate a new address but rather keeps the current one.
The IC2 stage receives the fetched double word from the K-Bus. If it has to
wait because the K-Bus cannot serve the data (external access and/or cache miss),
it signals a stall to the IC1 stage (which in turn signals a stall to IAG). If the IC2
stage received the data, it forwards it to the IED stage.
The IED stage is responsible for the decoding and reassembly of instructions
from the fetched double words. For this, it contains a small reassembly buffer of
eight bytes. When an instruction has been reassembled, it is checked if that in-
struction is a branch or call that is not computed (i. e. the target address is known
statically). If this is the case, the target of the instruction is predicted. For condi-
tional branches, the target is predicted to be the target address of the taken branch,
if the branch goes backward, i. e. the branch target address is smaller than the ad-
dress of the branch itself. Then, the IED signals the IAG that it should generate
the target address of the branch next. By this, fetching is redirected to the target
of the branch. Otherwise, the branch is predicted to be not taken and no change
of the fetch address is generated. If a change of the fetch address is signaled, also
a signal is sent to the IC1 and IC2 stages to inform them to discard their contents.
The remaining decode buffer in the IAG is cleared.
If the instruction being reassembled is a computed branch or a return instruc-
tion (RTS), then fetching is halted. The IAG, IC1 and IC2 stages are signaled to
clear their contents.
The reassembled instruction is put into the IB and at the same time forwarded
to the execution pipeline if the IB is empty.
In the case that either the local reassembly buffer cannot accept the next double
word form IC2 or the IB is full, the IED stalls and emits a stall signal to the IC2
stage (which in turn signals a stall to IC1, etc).
Because the IB is rather large (eight complete instructions) and the IED per-
forms static branch prediction, there may be a lot of redirections in the fetch flow,
which may be canceled by the AGEX stage later (mispredictions). It is thus rather
difficult to precisely analyse which memory blocks are accessed by the fetch en-
gine.
The Execution Pipeline
The execution pipeline consists of only two stages: Decode Select and Operand
fetCh (DSOC) and Address Generation and EXecution (AGEX). The DSOC fin-
ishes the decoding of the instruction and fetches the register operands. The AGEX
stage executes the instruction. For certain instructions, two iterations of the in-
46
struction are required through the two stages. For instructions that perform a
memory read, the AGEX stage on the first iteration generates the read address and
sends it via the K-Bus. On the second iteration, the DSOC stage reads the result
and the AGEX stage performs the operation (and issues the write address for a
read-write kind of instruction).
By this, instructions can overlap by at most one cycle in the execution pipeline.
A special case is the MOVEM instruction, which performs up to 15 read or write
accesses to/from the register file. It iterates in the execution pipeline as long as
there is a remaining register to be transferred.
As a side condition, a write access must be separated by at least three processor
cycles from the previous write access. This does not apply to MOVEM transfer
instructions.
The time it takes the instructions to execute in AGEX varies from instruction
to instruction. So the AGEX stage can stall the DSOC stage.
The DSOC stage takes the first instruction from IB. If DSOC is stalled, it can
happen that the IB fills completely which prefetched instructions. In this case, the
IB signals this fact to the IED stage. Also, multiple iterations of one instruction
through the execution pipeline do not allow a subsequent instruction to be inserted
between iterations of this instruction. The next instruction can only enter DSOC
if the instruction has entered AGEX in its final iteration.
Data accesses by AGEX and DSOC must use the same K-Bus as the instruc-
tion fetches from IC1 and IC2. Data accesses take precedence in the case of
simultaneous accesses, i. e. the instruction fetch cannot be issued by IC1.
2.5.2 The Cache of the ColdFire 5307
The MCF 5307 features a unified instruction and data cache. The IC1 stage per-
forms instruction accesses, the AGEX stage initiates data reads and writes. Due
to the fact that instructions and data have to share the same cache, instruction
prefetches may throw out data entries in the cache, leading to increased data ac-
cess misses. Because the prefetching includes branch prediction along conditional
branches there may be a lot of accesses to the cache, whose addresses are not easy
to bound statically.
The cache of the MCF 5307 is 8kB in size, having 128 sets with four lines
of 16 bytes in each set. Thus, upon an access the lower 4 bits are the index into
the line of the entity being accessed (byte, word, double word), the next 7 bits are
the index into the cache giving the set number and the remaining 21 bits are used
as tags to search for line hits in the selected set. In addition there is one global
counter of 2 bits that gives the index of the line in a set that will be replaced on
the next cache miss.
47
If the accessed data is found in the cache (cache hit) then this counter is not
updated and only the data in the line is returned via the K-Bus. If the data is not
found in the cache, then it has to be loaded into the cache. Thus, an external access
is started via the M-Bus and the external bus. Then the addressed set is searched
for an invalid line (one containing no data). If one is found, the data read is placed
into that line and also returned over the K-Bus. If all lines contain valid data, the
line that is indexed by the global counter is replaced from the cache and the new
data is put into that line. Then the global counter is incremented by one (wrapping
around to 0 if it indexed line 3 before). This replacement strategy is called Pseudo
Round-Robin.
Thus, accesses to different sets are not independent of one another as would
be the case for other replacement strategies. This leads to strange cases, where
data can stay in the cache very long.
Way
Set 0 1 2 3
0 0x0000 0x0800 0x1000 0x1800
1 0x0010 0x0810 0x1010 0x1810
fgf]f f]f]f
126 0x07E0 0x0FE0 0x17E0 0x1FE0
127 0x07F0 0x0FF0 0x17F0 0x1FF0
Table 2.2: Cache contents after 512 accesses
Consider for example that the cache is empty at a given point and the replace-
ment counter is 0 (which is the case after a complete cache invalidation). Then we
access data linearly from address 0 on, where 0 is mapped to set 0. The memory
block at 0 is brought into set 0, way 0. The next block at address 16 is brought
into set 1, way 0, etc. After 128 accesses the first way of all sets is filled. Access
number 129 at address 0
B
128 [ 16 is mapped to set 0. Because there is an invalid
line in this set, the block is put there, i. e. into set 0, way 1. This continues until
we have done 512 accesses, giving the cache contents as in Table 2.2, where the
addresses are given in hexadecimal notation.
After the 512th access, the replacement counter is still 0, having not been
updated during all those accesses. Now access 513 is mapped again to set 0. But
now we do not have an invalid line in set 0, so we must replace the line indexed
by the counter, i. e. line 0. Thus, the block at address 0x2000 is put into set 0, way
0, replacing the block at address 0x0000. The counter is incremented by one, now
indexing way 1. Access number 514 goes to set 1 and also does not find an invalid
48
line to be placed into. Thus, the line indexed by the counter is replaced, which
is line 1. This continue for the next 126 accesses, giving the cache contents of
Table 2.3, where the blocks having replaced other blocks are marked in boldface.
Way
Set 0 1 2 3
0 0x2000 0x0800 0x1000 0x1800
1 0x0010 0x2010 0x1010 0x1810
fgf]f f]f]f
126 0x07E0 0x0FE0 0x3FE0 0x1FE0
127 0x07F0 0x0FF0 0x17F0 0x3FF0
Table 2.3: Cache contents after 640 accesses
Since the replacement counter was incremented 128 times and was zero (on
access number 513), it is again zero after access number 640. So access number
641 to address 0x4000 again replaces the line in way 0 of set 0, as did access num-
ber 513. From here on, this scheme continues and only the ways with elements
in boldface in Table 2.3 are ever replaced. The data in the other ways stays in
the cache forever. This shows, that under certain access patterns only one fourth
of the cache is ever being used for caching recent data. It has been observed in
[HT01, HLTW03, FHL D 01] that indeed the analysis that can be performed mod-
els only one fourth of the cache contents, i. e. effectively a direct mapped cache of
size 2kB.
The MCF 5307 has also the special feature that the cache is a wrap-around
line fill, i. e. first the double word accessed in the line with a cache-miss is read
over the external bus, followed by the remaining three words (wrapping around to
the first word of the line after reading the last word of the line). As soon as the
first word is read, it is delivered back to IC2/DSOC and the line fill continues in
the background. During such a background line fill completion, the cache is able
to serve accesses that do not reference the same line (hit under miss).
Things are further complicated because instruction fetches can cross cache
line boundaries, necessitating two cache line accesses. In fact, the access logic
splits all unaligned accesses into up to three aligned accesses on the K-Bus.
The cache semantics (cf. Section 2.1) for the ColdFire cache is complicated
by the fact that sets in the cache are not independent. Updating one set changes
the ages in the other sets, as there is only one global replacement counter. The
age is in this case defined as follows: the line that is pointed to by the replacement
counter has age 3. The next line has age 2 and so on, wrapping around after line 3
49
to line 0.
Cache hits do not change the replacement counter, as do cache misses if there
is an invalid line in the set referenced by the access. All other misses replace
the line pointed to by the replacement counter and increase the counter by one
(modulo 4).
The set of values of the replacement counter is R
F
= 0 S 1 S 2 S 3 ? . We model the
cache c as a pair of replacement counter and a set mapping. A set mapping s maps
a set index to a line mapping, which maps the line index to the memory block
contained in that line or
n
if the line is invalid:
c :  c S c 
F
R
|
K
= 0 S]\]\g\S 127 ?t
K
= 0 Sg\]\]\uS 3 ?t M Ł
M]M
s :  s S s 
F
= 0 S]\]\g\S 3 ?t M Ł
The function loc :  c 
|
M t= 0 S 1 S 2 S 3 S
n
? maps a memory block to the line it
is contained in in the cache, or
n
if it is not in the cache:
loc
KgK
rS d
M
S m
M
F
q
l , if  l : m
F
d
K
set
K
m
MgM
K
l
M
n
, otherwise
The update of the cache by an access to memory block m is described by the
functions Uc :  c 
|
M t  c  :
Uc
KgK
rS d
M
S m
M
F
q
K
rS d
M
, if loc
K]K
rS d
M
S m
Mv
F
n
K
r {S d  set
K
m
M
t s {
M
, otherwise
where
K
r {S s {
M
F
Us
K
d
K
set
K
m
M]M
S rS m
M
.
The function that updates one set and the replacement counter is U s :  s 
|
R
|
M t R
|
 s  :
Us
K
s S rS m
M
Fq
K
rS s  l

t m 
M
, if  l { : s
K
l {
M
F
n
S l smallest such l {
K]K
r
B
1
M
mod4 S s  r

t m 
M
, otherwise
To classify a memory access as a cache hit H or cache miss M, the function
classify :  c 
|
M t= H S M ? is defined:
classify
K
c S m
M
F
q
H , if loc
K
c S m
Mv
F
n
M , otherwise
2.5.3 System Configuration
The MCF 5307 allows to map the internal SRAM area at different addresses,
controlled by a system register. Also, the memory area that the registers of the
integrated peripherals are mapped to can be chosen by setting an internal register
with the movec instruction.
50
Furthermore, two access registers and a cache control register are used to con-
figure access properties for memory areas (i. e. up to three memory areas with
different properties can be configured). Configurable properties include caching
mode (uncached, cached write-through, cached write-back) and the use of an in-
ternal burst buffer to accelerate memory reaccesses.
Furthermore, memory controllers are built into the MCF 5307 for DRAM or
SDRAM and asynchronous access using up to 8 chip-select signals. The memory
areas corresponding to these external accesses are configurable by registers in the
peripheral register area.
2.5.4 Assumptions Made
Naturally, it is not possible or necessary to model all features of a processor for an
analysis. By placing certain restrictions on the code to be analyzed, the analyzer
can be simplified and made more efficient.
Mostly, the features that can cause problems are not used in the area of real-
time programming anyhow, while others are easy to avoid. For our analyzer,
we made a number of restrictions on the underlying hardware of the system, the
programming of system control registers and the behavior of programs.
Underlying Hardware
We assume that the system only uses SRAM or/and EEPROM for main memory
and that all access times over the external bus are fixed, either to SRAM or to
peripherals or to EEPROM (Flash). No other bus masters must use the external
bus and the internal DMA engines of the MCF 5307 must not be used. Thus,
the timing parameters for memory accesses are determined by the address of the
access alone. Thus, bus contention or variable access times need not be modeled.
System Configuration
We assume that the program (or rather, a part of the program whose timing is of in-
terest) does not change any settings in system registers. That is, the memory area
configuration is once initialized at system startup (whose timing is not analyzed)
and then never changed. We assume that all cacheable memory areas are config-
ured for write-through mode and that the burst buffer is not used on uncacheable
memory areas. This allows us to ignore the side effects of setting system regis-
ters in the model. Write-back mode is impossible to analyze precisely due to the
restricted form of cache analysis possible for the MCF 5307.
The cache is only modified by memory accesses made by the program (code
or data) and not manipulated by the special cpushl instruction. Cache locking
51
(an option offered by the MCF 5307) is not used. The effects of the cpushl
instruction are difficult to model precisely because the cache analysis cannot find
guaranteed cache misses. The possible memory accesses by this instruction would
make the results very imprecise. Cache locking would lead to imprecision for the
same reasons.
Programming
We assume that the program runs without causing exceptions and with interrupts
turned off. Instructions that manipulate CPU internal system registers (movec)
must not be used. Furthermore, instructions that cause special behavior, e. g. the
stop instruction must not be used (this instruction halts program execution until
an interrupt occurs). Other special instructions are “move to SR”, pulse, trap,
wddebug and halt. Disallowing these instructions enables us to simplify the
model considerably, since we do not have to model their complex behavior.
Furthermore, no self-modifying code is allowed and no dynamic allocation (at
the level of the OS). The code in the system must conform to the ABI2 set forth
by Motorola for the ColdFire family (e. g. stack pointers must resume the same
value after a call to a function that they had before the call). This allows to ignore
data accesses w.r.t. the form of the control-flow graph of the program.
For memory areas mapped to read-only devices, no write accesses may occur.
If a memory has alignment restraints for accesses, the program must not access
it misaligned (e. g. some control register must not be accessed at odd memory
addresses or can only be read by a word-sized transfer).
For data accesses, data must only be accessed naturally aligned in main mem-
ory. E.g. an access of a 16bit word must only occur at an even address.
Thus, we can ignore the special behavior caused by misaligned accesses in the
modeling process.
2.5.5 Timing Anomalies with the MCF 5307
Even though the MCF 5307 pipeline is a rather simple in-order architecture, it
shows timing anomalies. One anomaly steams from the replacement strategy of
the cache, which falsifies one commonly made local worst-case assumption: on
the MCF 5307 an empty cache is not the worst-case for program execution.
Another anomaly originates in the fact that fetch and execution pipeline are
independent (only coupled by the IB and prediction resolution) but use the same
unified cache.
2Application Binary Interface: defines e. g. the layout of stack frames and calling conventions
of procedures.
52
Empty Cache not Worst-Case
The assumption that an empty cache constitutes the worst possible cache w.r.t. ex-
ecution time behavior (as there cannot be any cache hits from this cache contents),
is not true for the MCF 5307.
Due to the dependence of the cache sets of the MCF 5307 on the 2-bit replace-
ment counter, some cache configurations and access patterns can lead to the effect
that newly loaded blocks throw out blocks loaded into a set on the last access to
that set, cf. Table 2.3. If the cache is empty, the first four accesses to a set will
place new data into the set, and replacement can only happen on the fifth access
to that set.
Table 2.4 shows an access sequence of 8 accesses to sets 0 \g\]\ 3 of the cache.
The accesses go to addresses 0x0, 0x10, 0x20, 0x30, 0x800, 0x810, 0x820, 0x830,
so sets 0 \]\]\ 3 each are accessed twice by this sequence.
Empty Cache Filled Cache
Set Line Line
0 1 2 3 0 1 2 3
After 4 accesses
0 0 I I I 0 1800 2000 2800
1 10 I I I 1010 10 2010 2810
2 20 I I I 1020 1820 20 2820
3 30 I I I 1030 1830 2030 30
After 8 accesses
0 0 800 I I 800 1800 2000 2800
1 10 810 I I 1010 810 2010 2810
2 20 820 I I 1020 1820 820 2820
3 30 830 I I 1030 1830 2030 830
Table 2.4: Timing anomaly MCF 5307: Empty cache not worst-case (first 8 ac-
cesses)
The table shows cache misses as boldface and cache hits in a set in italics.
Assume that the non-empty cache is filled with memory blocks 0x1000, 0x1800,
0x2000, 0x2800 for set 0 and blocks 0x1010, 0x1810, 0x2010, 0x2810 for set 1,
etc, and the replacement counter is 0 before the first access.
For the empty start cache, after 8 accesses the eight memory blocks are in
53
ways 0 and 1 of sets 0 \g\]\ 3, causing 8 cache misses. For the other cache, after four
accesses, the first four memory blocks are in the sets, and the replacement counter
is again 0. So the next four accesses to blocks 0x800,. . . replace the blocks just
loaded by the previous accesses. After eight accesses, only the last four blocks of
the sequence are in the cache and there have been eight cache misses.
Empty Cache Filled Cache
Set Line Line
0 1 2 3 0 1 2 3
After 12 accesses
0 0 800 I I 0 1800 2000 2800
1 10 810 I I 1010 10 2010 2810
2 20 820 I I 1020 1820 20 2820
3 30 830 I I 1030 1830 2030 30
After 16 accesses
0 0 800 I I 800 1800 2000 2800
1 10 810 I I 1010 810 2010 2810
2 20 820 I I 1020 1820 820 2820
3 30 830 I I 1030 1830 2030 830
Table 2.5: Timing anomaly MCF 5307: Empty cache not worst-case (16 accesses)
Now assume that this access sequence is repeated, which gives the results
shown in Table 2.5. For the empty start cache, all accesses are cache hits, since
the blocks have been previously loaded into the cache. For the filled cache, every
access is a cache miss, because the subsequences of four accesses throw out the
blocks needed by the next sequence of 4 accesses. After 16 accesses, the empty
cache has lead to 8 cache misses to load the 8 blocks and 8 cache hits due to
block reuse. The filled cache has lead to 16 cache misses. In general, for 8n
accesses with this sequence, the empty cache leads to 8 cache misses and 8
K
n m 1
M
cache hits, while the other cache results in 8n cache misses. As the instructions
performing the accesses and the loop counter increment and test also take time to
execute, n iterations of the loop take 8M
B
8
K
n m 1
M
H
B
nO cycles (empty case),
and 8nM
B
nO cycles (filled) cache, where M is the miss penalty, H the access
time for a cache hit and O the remaining overhead of the instructions in the loop.
In Figure 2.10 an example program for this anomaly is depicted, the measure-
ments of the execution times are depicted in Figure 2.11 for the empty and filled
54
lea.l 0x30000,a2
move.l #N,D3
Loop:
move.l 0x00(A2),D1
move.l 0x10(A2),D1
move.l 0x20(A2),D1
move.l 0x30(A2),D1
move.l 0x800(A2),D1
move.l 0x810(A2),D1
move.l 0x820(A2),D1
move.l 0x830(A2),D1
subq #1,D3
bne Loop
Figure 2.10: Timing anomaly MCF 5307: Program
cache scenarios. The loop in the program is executed N times. The number of cy-
cles is measured for the code sequence starting from the label Loop until after the
final branch of the last iteration3. The behavior of a filled cache is equivalent to
no caching at all. Turning off the cache and running the example program indeed
gives the same timing results as for the filled cache.
As an empty cache is often assumed to be the worst-case scenario for program
execution, this behavior is of substantial impact for WCET computation. Unfortu-
nately, it is not obvious which concrete cache will be the worst-case for a program
execution, as this depends on the memory access sequence of the program. Thus,
more abstract notions of “undetermined cache” must be used as descriptions of
cache contents at program start. The theory of abstract interpretation underlying
our WCET tool readily provides a solution for this problem.
Unified Cache and Prefetching
The combination of prefetching and a unified cache leads to a scenario, where
a data access of one instruction that is a cache miss leads to a shorter global
3The measurements were made on an MCF 5307 Evaluation board running at 45/90MHz
(bus/core) using the built-in timers of the MCF5307. The program code was in the internal SRAM
so it does not influence the cache.
55
 0
 10000
 20000
 30000
 40000
 50000
 60000
 70000
 80000
 90000
 0  50  100  150  200  250  300  350  400  450  500
Cy
cle
s
Iterations
Empty Cache Filled Cache
Figure 2.11: Timing anomaly MCF 5307: Measurements
execution time of a complete program sequence compared to a cache hit by that
instruction’s data access.
This happens because in the case of the data access being a cache hit, the
prefetching of the IFP can continue, resulting in a cache miss that replaces two
lines, because of an instruction crossing a cache line boundary. Assuming that
those two lines contain useful data, successive data accesses to those lines result
in two additional cache misses. If the branch was mispredicted, those two lines
are fetched without being used in program execution.
If the data access is a cache miss, then prefetching cannot continue, as the
memory bus is in use by the data access. After the data access finishes, the ex-
ecution pipeline can resolve the branch condition before the misprediction of the
branch has led to a fetch of the two cache lines. In this case, only one cache miss
occurs, compared to two misses in the first scenario.
56
2.6 The Motorola PowerPC 755
2.6.1 The PowerPC Architecture
The architecture of the PowerPC family of microprocessors covers 32-bit and 64-
bit variants. The 32-bit architecture is described in [PPC97b], which fixes the set
of required and optional features of an implementation. The PPC 755 is a 32-bit
PowerPC.
For our purposes the following elements of the architecture are important:
Z A set of 32 32-bit General Purpose Registers (GPR)
Z A set of 32 64-bit Floating Point Registers (FPR)
Z A set of Special Purpose Registers (SPR) of (at most) 32-bit each
Z Some special registers: the link register (LR), the count register (CTR), the
condition code register (CCR)
Z Some instructions provide instruction synchronization and/or context syn-
chronization
Z Memory access translation: with IBAT/DBAT registers.
2.6.2 The PowerPC 755
The PPC 755 is an improved variant of the PPC 750, which differs from the lat-
ter CPU only by an enhanced L2-Cache interface, some additional SPRs, an ad-
vanced method to lock the instruction and/or data cache, support for software TLB
searches and a selectable reduced bus interface. Since we are assuming that most
of these features will not be used in the software considered, we can take the PPC
750 manual, [PPC97a] as the basis of our description and model4 5.
2.6.3 PPC 755 Pipeline
The pipeline of the PowerPC 755 is depicted in Figure 2.12 and contains the fol-
lowing units:
4We will sometimes directly reference to chapters in both the architecture and PPC 750 manual
in the form [PPC97a, Chapter] etc.
5The only PPC 755 features that are considered in the model are certain cache locking config-
urations and instructions, like dcbf.
57
System Register
FU
Fetch Unit
CTR
LR
Branch Processing Unit
BPU
Instruction Queue
IQ
Dispatch Unit
DU
Reservation Station Reservation Station Reservation Station
Integer Unit 
+ * :
IU1
Integer Unit 
IU2
+
CR
Reservation Station
Reservation Station
Reservation Station
EA calculation
Access
Load/Store Unit
LSU
Round
Add
Multiply
FPU
Unit
Floating−Point
FPSCR
FPR File
Rename Buffers
(6)
Rename Buffers
(6)
D−Cache
I−Cache
Bus Unit
BU
Reorder Buffer
CU
Completion Unit
GPR File
Store Queue
Memory
SRU
Unit
Figure 2.12: The pipeline of the PowerPC 755
Z The Fetch Unit (FU) fetches one to four instructions from memory (possibly
via the instruction cache) and places them into the Instruction Queue (IQ).
How many instructions are requested depends on the number of free slots in
IQ: the requested number is the minimum of 4 and the number of free slots
in the IQ.
Z The Instruction Queue (IQ) holds instructions to be dispatched. It is orga-
nized as a FIFO with six entries.
Z The Branch Processing Unit (BPU) takes care of predicting and performing
branches. It reads the instructions as soon as they are in the IQ, removes
taken (or predicted taken) branches, and redirects the instruction fetch ac-
58
cordingly. The BPU contains a branch target instruction cache and a branch
history table to dynamically predict branches. Since we will not use this
feature, these components are not shown in the figure.
Z The Dispatch Unit (DU) dispatches instructions from IQ0 and IQ1 to the
five functional units. Instructions are dispatched in-order, i.e. an instruction
can only be dispatched from IQ1, if the instruction in IQ0 is also dispatched.
Z The Completion Unit (CU) contains six entries that are assigned to dis-
patched instructions (except for some branch instructions handled directly
by the BPU). The retirement of the instructions happens from the CU in or-
der, i.e. an instruction can only retire if its predecessors are already retired.
The PPC 755 can retire at most two instructions from the CU per cycle.
Z The Integer Units (IU1/IU2) perform integer operations. IU1 can perform
all operations, while IU2 cannot perform multiplications and divisions.
Z The System Register Unit (SRU) handles operations on the SPRs, like e. g.
mfspr.
Z The Load/Store Unit (LSU) performs loads to GPRs/FPRs from memory
and write operations from those registers to memory. It is pipelined with
two stages: the first stage performs the calculation of the effective address
of the operation and the second stage performs the memory access. Stores
from the LSU are first gathered in a Store Buffer of three entries. This is not
true for multiple store instructions like stmw which perform stores directly.
Z The Floating-Point Unit (FPU) handles all floating-point operations and
consists of a pipeline with three stages: the first stage performs multipli-
cations and divisions. The second stage performs additions and the last
stage rounds or converts the result.
Z The GPR file and FPR file contain 32 registers each and in addition six
integer rename registers and six floating-point rename registers.
Z The Bus Unit (BU) is responsible for all accesses via the external bus. Data
requested from the FU or the LSU (or written from the Store Queue) may
also go through the instruction or data cache.
Instruction execution is handled in the following manner: first, instructions
are fetched by the FU, possibly via the I-Cache, from memory. The number of
instructions fetched depends on the number of free slots in the IQ. If in cycle n
there are m free slots in the IQ, then min
K
4 S m
M
instructions are requested by the
fetcher. Even if in cycle n one to three slots may be freed by instruction dispatch
59
and branch processing, these slots are not counted in the number of free slots in
that cycle ([PPC97a, 6.3.1]). The instructions are put into the IQ in program order,
i.e. in address order.
Instructions in the lowest two entries in IQ (IQ0 and IQ1) can be dispatched
for execution. Dispatch happens in-order, so the instruction in IQ1 can only be
dispatched, if the instruction in IQ06 can be dispatched. Dispatching from IQ0 is
restricted by several rules ([PPC97a, 6.6.1.2]):
Z The execution unit needed by [IQ0] is available
Z Needed GPR and/or FPR rename buffers are available
Z An entry in the CU is available
Z No completion serialization instruction is being executed (see page 63)
Dispatching from IQ1 can only happen, if
Z [IQ0] can be dispatched
Z [IQ0] is not a completion or refetch serialization instruction (see page 63)
Z Execution unit needed by [IQ1] is available (with [IQ0] already dispatched)
Z GPR/FPR rename buffer available (after [IQ0] dispatched)
Z CU not full ([IQ0] already dispatched)
For each instruction an entry in the CU is allocated in dispatch (i. e. program)
order. Certain instructions don’t take an entry in the CU and are removed from
the instruction flow by the BPU; these instructions are branch instructions that do
not update the LR or CTR.
Also, some branch instructions have additional requirements (cf. [PPC97a,
6.6.1.1]):
Z The bclr requires that the LR is available, i. e. there is no outstanding
computation writing the LR.
Z The bcctr requires that CTR is available.
Z Branch and link instructions require that a rename LR register is available.
Z The ‘branch conditional on ctr decrement’ instruction requires CTR avail-
ability (if the condition is not false).
6We write [IQn] to denote the instruction in IQn.
60
Z A branch conditional instruction cannot be executed following an unre-
solved branch. I.e. there can be at most one level of speculative execution.
Branches are also treated specially, when they enter the IQ:
Z As soon as a branch is fetched, it is examined to decide whether it can be
folded or fallen-through (i. e. removed from the instruction stream).
Z Instructions are treated speculatively if the condition on which they depend
is not known. This is called an unresolved branch.
If a branch instruction that is unconditional and whose operands are available
enters the IQ, it is folded away if it is taken. The branch and all later instructions
in IQ are spilled and instruction fetch continues at the target of the branch. If the
branch is not taken, it remains in IQ until it reaches IQ0 or IQ1 and is then simply
discarded ([PPC97a, 6.4.1.1]).
Conditional branch instructions are treated like unconditional branches if the
condition is already known (i.e. the CR will not be modified by any instruction in
the CU or IQ). If this is not the case they are predicted according to their instruc-
tion encoding (cf. [PPC97b, 4.2.4.2]). If they are predicted as not taken, they will
fall-through. Otherwise, they will be folded. The processor will mark all subse-
quent instructions as speculative until the condition is resolved. If the condition
turns out to be different than predicted, all instructions after the branch are flushed
from CU and the functional units, and instruction fetch continues along the other
branch path.
When an instruction can be dispatched, it allocates the required rename reg-
isters7. After that, the instruction is assigned an entry in CU and it is dispatched
to the reservation stage of the functional unit required by the instruction8. If the
operands required by that instruction are not available, it will stay in the reser-
vation stage until the instruction(s) that compute those operands are finished and
broadcast their results.
The LSU and FPU are pipelined, so two instructions can be in the LSU at the
same time or three instructions in the FPU. Some instructions cannot be pipelined
(e.g. certain FPU instructions like fdiv, cf. [PPC97a, 6.4.3]) and block the whole
unit until they are finished. Instructions already in the later stages of the FPU
continue their execution concurrently with a blocking instruction.
The LSU unit has two stages: EA calculation and access. If an instruction
in the LSU is speculative (after an unresolved branch), then it must not access
7For instructions that update CR, LR and CTR there is one rename register for each of those
registers.
8For integer instructions, one of IU1 and IU2 can be chosen, if both are free and the instruction
doesn’t perform multiplication or division. We assume that IU2 is chosen to keep IU1 free for
multiplications in this case.
61
memory marked as ‘guarded’ in the DBAT register for that access. This means,
such an access will stall in the access stage until the branch is resolved (until the
speculative bit in the CU entry for that instruction is cleared). Stores are only
performed, when the instruction is retired from the CU, except for multiple store
instructions, like stmw, which are completion serialization instructions and thus
can never be executed speculatively. These instructions perform stores directly.
All other stores are kept in a three-entry store buffer at the LSU and are performed
out of that buffer when the instruction is retired from the CU. If it must be flushed
due to a misprediction, the store is discarded from the store buffer.
We assume that a load reads data from the store buffer if it accesses the same
data as a previous store which is still not committed.
Completion of instructions can only happen from CU0 and CU1, restricted by
([PPC97a, 6.6.1.3]):
Z [CU0] must be finished
Z [CU0] must not follow an unresolved branch
Z [CU0] must not cause an exception
For [CU1], similar rules apply:
Z [CU0] must complete in the same cycle
Z [CU1] must be finished and must not follow an unresolved branch
Z [CU1] must not cause an exception
Z [CU1] must be an integer or load instruction.
Z there must not be more than two CR updates from both [CU0] and [CU1]
Z there must not be more than two GPR updates9 for [CU0] and [CU1].
Z no more than two FPR updates may occur
Certain instructions require synchronization. Synchronization or serialization
can happen in three ways ([PPC97a, 6.3.3.2]):
Z Execution Serialization: These instructions are dispatched to a unit but do
not begin execution until all previously issued instructions have completed.
Example: mtspr.
9So we cannot complete a load with update instruction and an addition in the same cycle.
62
Z Completion Serialization: An instruction does not start execution until all
previously issued instructions have completed. In contrast to the execution
serialization, no instructions after this instruction may be issued to func-
tional units until it has completed execution. Example: lswi.
Z Refetch Serialization: Behaves like completion serialization but in addition,
all instructions in the IQ after this one are flushed and have to be refetched
when the instruction completes. Example: isync.
2.6.4 PPC 755 Caches
The PPC 755 has two separate caches for instructions and data. Internally, the data
paths to these caches are independent (Harvard architecture). Each Cache has a
capacity of 32kB and is eight-way set associative with a line size of 32 bytes (4
double words of 64 bits), thus each cache has 128 ways. The replacement strategy
of the caches is called pseudo least recently used (PLRU) by Motorola. Again, as
in the case of the pseudo round robin scheme of the Coldfire, the “pseudo” leads
to a loss in predictability. When accessing an address, instruction or data, in an
area that is marked as cacheable by the corresponding BAT registers, bits 27 m 31
select the byte in the line10. Bits 20 m 26 select the set in the cache, bits 0 m 25 are
compared against the tags of the (up to) eight lines in that way. Each line in the
set has a valid bit that is set if the line contains a data block. For the data cache,
another bit, the dirty bit, signals if the line has been written to and must be copied
to main memory on cache flush or replacement.
If a cache hit occurs, the data is delivered in the following cycle. For a cache
miss, new data is loaded into the set. If there is an invalid line in the set, the newly
read line is placed in that line. If there are multiple invalid lines in the set, the one
with the lowest index (0 \]\]\ 7) is chosen. If there is no invalid line, i. e. the set is
completely full, one line is chosen to be replaced. A replacement in the data cache
may invoke a write-back of the line replaced, if it is dirty. Which line to choose
for replacement is governed by a logic that uses 7 state bits. Figure 2.13 depicts
the behavior of this selection logic.
In this figure, the 8 lines in a set are at the bottom and the seven state bits are
represented by the circles. Starting with bit 0 at the top, a path to a line is chosen
according to the state, 0 or 1, of each bit. E.g. if the 7 seven bits are (with bit 0
at the left) 0110011 then line 2 is chosen. For each set, there is a separate set of
these seven bits, so unlike the ColdFire cache, set accesses are independent of one
another.
10Motorola counts the bits in a 32 bit word for the PowerPC in such a way that bit 0 is the most
significant bit with value 231 and bit 31 the least significant with value 20. Likewise for 64 bit
double words.
63
3 4 5 6
2
0 1
0 10 1
0
0
1
1
0
2
1
3
0
4
1
5
0
6
1
7
0
1
Figure 2.13: PowerPC 755 PLRU replacement selection logic
After a hit or a miss in a cache set, the state bits for that set are updated. This
update depends on the line in the set that has been accessed. Intuitively, the bits in
all three layers are changed so that the next line selected by the new setting points
away as far as possible from the old line. This means that every bit on the path
from bit 0 to the accessed line is inverted. Table 2.6 gives the update function on
the state bits. It denotes for the accessed line the new settings of the state bits,
where “U” means that the corresponding bit is unchanged.
New State Bits
Line 0 1 2 3 4 5 6
0 1 1 U 1 U U U
1 1 1 U 0 U U U
2 1 0 U U 1 U U
3 1 0 U U 0 U U
4 0 U 1 U U 1 U
5 0 U 1 U U 0 U
6 0 U 0 U U U 1
7 0 U 0 U U U 0
Table 2.6: PPC 755 state bits update
64
This replacement scheme also has a pathologic access sequence that lets some
part of the cache contents survive although it is not reused. One sequence that
shows this behavior is an alternation of accesses that hit in line 0 of the set and
then miss and replace in the line pointed to by the state bits. Assuming that all state
bits are zero initially and that valid data is in all lines, Table 2.7 gives an access
sequence. Here, the first column gives the running number of the sequence, the
second shows the line that is accessed, which is either line 0 for a cache hit (odd
numbered accesses) or the line pointed to by the state bits for a cache miss and
replacement. The third column presents the state bits after the access, the last
column gives the line pointed to by the state bits from column three. The even
numbered accesses (in bold) are misses that replace data in the lines pointed to by
the state bits from the preceding row in the table.
# Accessed State bits Pointer
1 0 1101000 4
2 4 0111010 2
3 0 1111010 6
4 6 0101011 2
5 0 1101011 5
6 5 0111001 2
7 0 1111001 7
8 7 0101000 2
9 0 1101000 4
Table 2.7: PPC 755 bad replacement sequence
This sequence ends with the same state bits as after the first access, so this
scheme of lines replaced will repeat itself. Table 2.8 shows the cache contents
of the set after each of the accesses. The access sequence starts at address 0 and
shows which memory blocks (in hexadecimal notation) are in the set after the
access. The lines affected by the access are in bold. Lines affected by previous
accesses are in italics. The access sequence is 0, 0x8000, 0, 0x9000, 0, 0xa000,. . .
Only lines 4, 6, 5 and 7 are ever replaced, while lines 1, 2 and 3 stay in the
cache (although untouched). This means that those three lines not in italics or
bold in the last row of Table 2.8 do not take part in the replacement scheme for
this access pattern.
65
Even with a sequence of accesses where each access address is distinct from
the others it may take up to 16 misses and a total of 21 accesses until all lines in
a cache set have been replaced, as exemplified in Table 2.9. There, the accesses
are in bold and new lines are denoted in italics. In this example, line 7 survives
until access 21. With a real LRU strategy a line would survive at most 15 unique
accesses.
However, if one looks at a sequence of accesses where each access results
in a cache miss, then for every setting of the state bits at the beginning of the
sequence, every line survives at most 7 accesses. That is, all lines are replaced
after 8 accesses.
The precise cache semantics is defined with a cache c W c  as a (total) mapping
from set indices to a pair of line mapping and replacement bits. The line mapping
s W s  maps line indices to the memory block contained in that line or
n
if the
line is invalid:
 c 
F
= 0 Sg\]\]\uS 127 ?t
K
= 0 S]\]\]\uS 7 ?t M Ł
M}|
R
 s 
FlK
= 0 S]\]\]\uS 7 ?t M Ł
M}|
R
R
F
= 0 S]\g\]\S 127 ?
The update of a cache c after a memory access to block m uses the location of
a memory block in a set (its line), which is given by two functions, loc :  c 
|
M t
Line
# 0 1 2 3 4 5 6 7
1 0000 4000 2000 6000 1000 5000 3000 7000
2 0000 4000 2000 6000 8000 5000 3000 7000
3 0000 4000 2000 6000 8000 5000 3000 7000
4 0000 4000 2000 6000 8000 5000 9000 7000
5 0000 4000 2000 6000 8000 5000 9000 7000
6 0000 4000 2000 6000 8000 a000 9000 7000
7 0000 4000 2000 6000 8000 a000 9000 7000
8 0000 4000 2000 6000 8000 a000 9000 b000
9 0000 4000 2000 6000 8000 a000 9000 b000
10 0000 4000 2000 6000 c000 a000 9000 b000
Table 2.8: PPC 755 cache set contents
66
= 0 S]\]\]\uS 7 S
n
? , and locs :  s 
|
M tﬂ= 0 S]\]\]\uS 7 S
n
? :
loc
K
c S m
M
F
locs
K
c
K
set
K
m
MgM
S m
M
locs
K]K
s S r
M
S m
M
F q
l , if  l : m
F
s
K
l
M
n
, otherwise
The result
n
means that the memory block is not in the cache.
As the sets of the cache are independent the cache update is simply the appli-
cation of the set update function to the correct set:
U
K
c S m
M
F
c  set
K
m
M}
t Us
K
c
K
set
K
m
M]M
S m
M

Line
# 0 1 2 3 4 5 6 7
1 09000 02000 03000 04000 05000 06000 07000 08000
2 09000 02000 03000 04000 05000 06000 07000 08000
3 09000 02000 03000 04000 05000 06000 07000 08000
4 09000 02000 03000 04000 05000 06000 07000 08000
5 09000 02000 03000 04000 0a000 06000 07000 08000
6 0b000 02000 03000 04000 0a000 06000 07000 08000
7 0b000 02000 03000 04000 0a000 06000 07000 08000
8 0b000 02000 0c000 04000 0a000 06000 07000 08000
9 0b000 02000 0c000 04000 0a000 06000 0d000 08000
10 0b000 0e000 0c000 04000 0a000 06000 0d000 08000
11 0b000 0e000 0c000 04000 0f000 06000 0d000 08000
12 0b000 0e000 0c000 10000 0f000 06000 0d000 08000
13 0b000 0e000 0c000 10000 0f000 06000 0d000 08000
14 11000 0e000 0c000 10000 0f000 06000 0d000 08000
15 11000 0e000 0c000 10000 0f000 12000 0d000 08000
16 11000 0e000 13000 10000 0f000 12000 0d000 08000
17 11000 0e000 13000 10000 0f000 12000 14000 08000
18 11000 15000 13000 10000 0f000 12000 14000 08000
19 11000 15000 13000 10000 16000 12000 14000 08000
20 11000 15000 13000 17000 16000 12000 14000 08000
21 11000 15000 13000 17000 16000 12000 14000 18000
Table 2.9: PPC 755 unique access sequence
67
If a set contains an invalid line and a cache miss occurs, that line is filled
with the new memory block (lowest line first). Otherwise, the line selected by the
replacement bits is replaced. After every access, the replacement bits are updated
according to Table 2.6. Let up : R
|
= 0 S]\]\]\S 7 ?t R be this update function and let
line : R t= 0 S]\g\]\S 7 ? be the function mapping replacement bits to the line selected
by them, according to Figure 2.13.
Then the set update Us :  s 
|
M t  s  is defined as
Us
K]K
s S r
M
S m
M
F






K
s S up
K
rS locs
K
s S m
MgM]M
, if locs
K
s S m
Mv
F
n
K
s  l

t m LS up
K
rS l
M]M
, if locs
K
s S m
M
F
n
X
 l : s
K
l
M
F
n
X
l min.
K
s  line
K
r
M
t m S up
K
rS line
K
r
MgM]M
, otherwise
To determine if a cache access is a cache hit H or cache miss M, the function
classify :  c 
|
M t= H S M ? is used:
classify
K
c S m
M
F
q
H , if loc
K
c S m
Mv
F
n
M , otherwise
2.6.5 Assumptions Made
The PPC 755 has some features with great impact on the desired pipeline model.
Some of these features can be controlled by configuration registers, others can
be avoided by following a certain style of programming. Again, it is desirable
to exclude certain features or behaviors in order to reduce the complexity of the
model.
We made the following assumptions in the design of our pipeline analyzer:
Z The PPC 755 can be used in real address mode, i.e. without address trans-
lation for data and/or instruction accesses. Since this mode poses some
strange conditions with regard to the ‘guarded’ attribute of memory (cf.
[PPC97b, 5.1.2.5.3]) and speculative execution, we will assume that always
only the DBAT/IBAT registers are used to perform block address transla-
tion. This assumption allows us to drop the modeling of virtual memory,
which is complicated and not possible precisely. Also, we can reduce the
complexity of the memory access model in the LSU.
Z The PPC 755 can perform Store Gathering ([PPC97a, 2.3.4.3.5]). We do
not model this feature, which can be turned off through the HID0 SPR.
Allowing store gathering increases complexity and reduced precision of the
analysis as we will not be able to precisely determine that stores can be
gathered for imprecise data accesses.
68
Z All synchronous exceptions on the 755 are precise. However, we will not
model any exceptions, including floating-point exceptions or illegal instruc-
tions. Exceptions are mostly a sign of an error that occured during program
execution. If an error occurs, the system will go to recovery mode or shut
down. Either way, no WCET is needed for this case and we can ignore it in
our timing model.
Z The 755 can use an external L2 cache. We do not model that cache and
assume that it is turned off. Also, we will not model the case that this
L2 cache can be configured as a private SRAM area. Access to the L2
is complicated and increases the complexity of the interface to the caches.
The L2 hardware was not present in the system the analysis was originaly
designed for anyhow.
Z We do model only a subset of the data cache invalidate or flush instructions,
the others are assumed not to be present in the program. This is because
the semantics of these instructions is difficult and their behavior cannot be
analyzed precisely later.
Z Dynamic branch prediction with a branch target instruction cache (BTIC)
and a branch history table (BHT) is built into the 755. We do not model
these features, and assume that they are turned off in HID0. Dynamic
branch prediction can in principle be modeled with increased complexity.
However, it does not give a great performance gain for typical real-time
applications and thus we decided to exclude it.
Z The instruction and/or data cache can be (partially) locked against replace-
ment. We assume this feature is only used in certain configurations and is
fixed during runtime. This adds only a slight amount of complexity and is a
useful feature, so it is included in the model for the pipeline.
Z The data cache must be configured for write-through mode. Write-back
cannot be analyzed precisely due to the bad worst-case behavior of the PPC
755 cache.
Z Effects on memory protection registers (TLB, etc) are not modeled. Virtual
memory was too difficult to model for this version of the analyzer.
Z Atomic update instructions for multiprocessor synchronization, lswarx and
stwcx are not modeled. They can be modeled in principle but are rather
useless in a real-time system with only one bus master.
69
Z Multiple load/stores with dynamic load/store counts are not modeled, i.e.
the instructions lswx and stswx must not occur in any program. These in-
structions would lead to a huge loss in precision of the analysis because the
number of memory accesses is not known and must be approximated.
Z Memory access times are assumed to be constant for each memory area.
This simplifies the modeling of the chip set unit. Another version includes
a more elaborated chip set unit with a controller that supports SDRAM and
PCI.
2.6.6 Timing Anomalies with the PPC 755
The following two examples are taken from [Sch02]. The first example shows that
an empty pipeline at the beginning of the execution of an instruction sequence may
lead to a longer execution time of the whole sequence, than a partially filled one.
The second example shows that effects of timing anomalies are unbounded by
fixed hardware constants, but can only be bounded by program length and iteration
counts of loops, which makes their approximation non-local.
Empty Pipeline may be worst-case scenario
One assumption about pipelines is that a program starting with no other instruc-
tions occupying pipeline stages before it will run faster or at least as fast as were
the case if the pipeline was not empty. This assumption is based on the argu-
mentation that occupied stages ahead in the pipeline may lead to hazards for the
program and thus slow down the program execution. As the MPC 755 sched-
ules instructions dynamically, this assumption is no longer true. In fact, occupied
stages ahead in the pipeline can prevent scheduling decisions that lead to hazards
in the program itself.
The example program in Figure 2.14 exhibits this behavior. The program
whose timing is delayed starts at index 1, the unit an instruction may be scheduled
to is given in the third column. All multiplication instructions must be scheduled
to IU1, the other integer instructions may be scheduled to either IU1 or IU2, what-
ever unit is free (IU2 takes precedence if both units are free). The operands of the
multiplication instructions must be such that the multiply takes 5 cycles (upper
8 bits not zero). The RAW dependency between instructions i
G 2 and i G 1 delays
execution of the latter until the former has finished. The RAW dependency be-
tween instructions i
G 1 and i1 also forces the latter to wait until instruction i G 1 has
finished execution.
If the instruction sequence is executed from index i
G 3 on, then instruction i G 1
is scheduled to the IU2 unit, while i1 is scheduled to IU1 (IU2 is not free because
70
Index Instruction Unit
i
G 3 addi r15,r15,4 IU1/IU2
i
G 2 lwz r20,0(r31) LSU
i
G 1 addi r16,r20,4 IU1/IU2
i0 fadd f31,f31,f30 FPU
i1 addi r17,r16,4 IU1/IU2
i2 lwz r18,0(r17) LSU
i3 addi r22,r18,4 IU1/IU2
i4 addi r21,r19,4 IU1/IU2
i5 mullw r25,r23,r24 IU1
i6 lwz r26,0(r22) LSU
i7 lwz r27,0(r25) LSU
i8 addi r8,r26,4 IU1/IU2
i9 addi r28,r25,4 IU1/IU2
i10 addi r5,r27,1 IU1/IU2
i11 mullw r4,r14,r30 IU1
i12 mullw r3,r6,r30 IU1
Figure 2.14: Timing anomaly MPC 755 I: Empty pipeline
of the RAW dependency). Following the other dependencies, instruction i10 is
scheduled to IU2, while instruction i11 is scheduled to IU1 and begins execution
even before instruction i10. The whole sequence finishes after 19 cycles.
If the sequence starts with instruction i1 and an empty pipeline, then i1 will be
scheduled to IU2 (which is not occupied). Following the dependencies, instruction
i10 must be scheduled to IU1, as IU2 is occupied by instruction i9. Thus, instruc-
tion i11 must be scheduled to the reservation station of IU1, as it must execute on
IU1 and the unit itself is occupied by instruction i10. In this scenario, execution of
instruction i11 begins after instruction i10 has finished and the (shorter) sequence
takes 22 cycles.
The details of the instruction scheduling for these sequences can be found in
[Sch02, A.1].
71
Unbounded Timing Anomaly
The timing behavior of the MPC 755 is sensible to small perturbations. Given the
program in Figure 2.15, when the pipeline is empty at the beginning, instruction
i2 is scheduled to IU2 and i3 to IU1. i2 cannot begin execution until i1 has finished
due to a RAW dependency.
Index Instruction Unit
i1 lwz r20,0(r2) LSU
i2 addi r21,r20,4 IU1/2
i3 mullw r19,r14,r29 IU1
i4 lwz r23,0(r20) LSU
i5 addi r24,r23,4 IU1/2
i6 addi r25,r14,4 IU1/2
i7 lwz r26,0(r19) LSU
i8 mullw r27,r14,r29 IU1
i9 lwz r28,0(r26) LSU
i10 addi r22,r28,0 IU1/2
Figure 2.15: Timing anomaly MPC 755 II: Unbounded effect
The whole sequence takes 10 cycles to execute once, 19 to execute twice and
10+9
K
n m 1
M
to execute n times.
If the pipeline is not empty, but one instruction occupies IU2 for 1 cycle ini-
tially, then instruction i2 is scheduled to IU1 and instruction i3, which must exe-
cute in IU1, must wait for the end of instruction i2. Looking at several executions
of this scenario, it takes 12n cycles to execute this sequence n times. If this se-
quence forms the body of a loop and is executed 100 times, then the difference in
execution times is 299 cycles, although the initial delay was just one cycle due to
the added instruction.
Both scenarios presented here have been verified by measurements on a Sand-
Point 3 evaluation system from Motorola.
72
Chapter 3
Semantics and Analyses
Correctness of a hard real-time system is a crucial property and must be guaran-
teed. This implies that also the methods used to verify this correctness must meet
stringent correctness and reliability goals. Therefore, every methodology used
in the validation of the requirements of a real-time system must have a sound and
clear basis. The way chosen here to achieve this is to define a clear (mathematical)
semantics for the system to be analyzed and to prove that the analyses deployed to
obtain information about the system are sound w.r.t. this semantics: if the analysis
determines that a given property is true for the system then it must be true for
all concrete executions of the system. In this chapter, we will present the general
framework for defining semantics for a real-time system consisting of a proces-
sor and associated peripherals, together with a general framework that guarantees
the correctness of analyses that fulfill certain conditions w.r.t. the semantics. The
presentation is an adapted recapitulation of the foundations of program semantics
and abstract interpretation.
The instantiated semantics definition for the pipeline analysis is presented in
Chapter 4, the analysis based on it is defined and proven correct in Chapter 5. An-
other analysis that is used in the framework of our WCET tool, the value analysis,
is described in Section 6.2.
It should be noted that the definition of the semantics itself must be looked at
under correctness considerations: in the context in which the analyses presented
in this work have been developed, authoritative information about the underlying
hardware was not available. The semantics itself had to be extracted from public
documents about the processors and the peripherals, combined with experimen-
tal verification and questioning of the design teams of the hardware. Thus, the
analyses are only correct if this modeling of the semantics is correct w.r.t. the real
hardware. This problem of validating the information the semantics is based on as
well as that of validating the implementation of the analyses is discussed in more
detail in Chapter 7. Possible ways to obtain a semantics (semi-)automatically from
73
authoritative descriptions (e. g. VHDL models) are the topic of ongoing research
and are discussed in Section 9.3.
3.1 Semantics
When defining the semantics of the program execution in a real-time system, we
may take one of two views:
Z With a system-centric semantics we describe how the system evolves as a
whole. In our case, this evolution would be in steps of one processor cycle.
Such a semantics would define a start state and an end state and then only
describe the evolution of a state, beginning with the start state until the end
state is reached.
Z With a instruction-centric semantics we describe what the effects of execut-
ing this very instruction are on the state of the system.
The second view is equivalent to the first one: by composing the effects of the
instructions along an execution path of the program, we obtain the whole evolution
of the system. The first view can only be used to obtain the second one if we can
define a predicate on the system state that can consistently determine when an
instruction has finished its execution (e. g. when it has left the processor pipeline).
In the presence of modern processor features like branch folding and speculative
execution this is a non-trivial issue.
The instruction-centric semantics is easily utilized to denote the semantics of
a machine program w.r.t. the visible effects on the machine programming model.
Unfortunately, the machine programming model is only concerned with such
things as contents of registers and memory. Timing is not part of this model.
E.g., the PowerPC user model does not say anything about additional memory
reads being performed by a program, as far as these reads are from normal RAM
(not memory mapped control registers). A program may, and actually does on
the PPC 755, perform speculative data reads via instructions that will be canceled
(after a branch misprediction) later. Clearly, this influences the timing behavior
significantly.
A system-centric semantics on the other hand can capture such effects better in
that it can represent the evolution of the system with sufficient timing resolution.
This style of semantics has the drawback that it is more difficult to make state-
ments about parts of the program, as the state trace initially only describes the
execution from program start to program end. Using a system-centric semantics,
i. e. a pure state trace semantics, leads to an analysis that is quite similar to the
74
one used by Lundqvist and Stentro¨m in [LS98] which uses approximations to ex-
ecution traces to obtain upper bounds to WCETs. The drawbacks of this approach
have already been discussed in Section 1.4.1.
Being able to modularize the analysis of a program by obtaining significant
information for parts of it (say, basic blocks in a machine program) and then using
other techniques to combine the results (as ILP for the computation of the worst-
case path through the program) is a key point for obtaining efficient analyses. And
to obtain a modular analysis one needs to define the semantics in a modular way.
Luckily, the system-centric semantics can be utilized for the definition of an
instruction-centric semantics associated with the instructions of the program. This
is done by carefully defining when the effects of this instruction are no longer vis-
ible in the state and then assigning sub-sequences of the state trace to each instruc-
tion. This means that the effect of one instruction is represented by a sequence of
states (during which this instruction is being executed). Clearly, in the presence
of the highly parallel execution in today’s processor pipelines finding a sufficient
condition for the end of an instruction’s execution can be challenging.
Since an instruction-centric semantics is better suited for defining the asso-
ciated analyses, and because numerous techniques to do so have already been
developed in the context of compiler construction theory, we will adopt this style
in our semantics definitions. We will make use of different aspects of the machine
semantics in two places in our WCET tool chain:
Z The so-called value analysis [Sic97] tries to determine the addresses of data
accesses that may occur during program execution. Since the addresses
of data accesses can depend on register contents, we perform an interval
analysis on the machine registers. Interval analysis is a special case of the
analysis presented in [CH78] that determines linear inequalities between
program variables. Here, the machine registers are the variables and the
inequalities have the simple form a < r < b with r being the contents of a
machine register and a and b constants. This way, the interval analysis is a
generalization of constant propagation [CCKT86] which is a standard anal-
ysis in compilers. For this analysis, we only consider the user visible effects
of the semantics, which makes it much easier to define the semantics and
the analysis itself. In Section 6.2 we will recapitulate how this is achieved
and how the results of this analysis are used in the pipeline analysis.
Z The pipeline analysis itself determines upper bounds to execution times of
basic blocks in the program. To define the analysis we first define a state
trace semantics. Then we address the question how meaning can be as-
signed to the single instructions in the program, i. e. how an instruction-
centric semantics can be derived from the trace semantics. This approach
relies on the assumption that we are able to define when an instruction has
75
finished its execution, i. e. when it has left the pipeline. Thus, we are only
able to obtain pipeline analyses for those classes of processors for which
this is possible. As we will show, one is able to do so for processors that
perform branch folding, speculative execution and out-of-order execution,
so this assumption should not be too restrictive in practice. Chapter 4 and 5
elaborate on this in detail.
3.1.1 Program Representation
Both forms of semantics annotation require knowledge about the instruction that
will be executed after the current instruction. The value semantics, e. g., will
have different results for branch instructions, if the branch is being taken or not.
Consequently, the results for the instruction immediately following the branch in
the program and the one at the target address of the branch are different. There-
fore, we should assign semantics not to instructions but to pairs of instructions,
where the second instruction represents that instruction which will be executed
next. This view is the same as taken in [Cou81], where the next instruction to
execute is encoded in the semantic value (state) given to the semantics functions.
We choose to make explicit this dependency by defining semantics and analyses
on the control-ow graph (CFG) of a program. This graph describes (a superset
of) all possible execution paths of a program. It is made up of nodes which are
labeled with program constructs1. Directed, labeled, edges connect nodes, if the
execution can pass from the source node to the target node without passing other
nodes in between. With these edges, we will associate semantics.
How this CFG can be reconstructed from the executable code of a program is
described in Section 6.1. Figure 3.1 shows a fragment of a program in PowerPC
machine code and the corresponding CFG.
Definition 3.1.1 (CFG): A control-flow graph is a graph G
FyK
V S E
M
, with a set
V of vertices (or nodes) and a set E  V
|
V of edges. The nodes are labeled
with program constructs; we write prog
K
v
M
for the label of a node v W V . The
edges, which are written as e
F
v1 t v2 (v1 S v2 W V ) are labeled with edge
labels, denoted by lab
K
e
M
for an edge e W E.
We assume that G has a unique start node and a unique end node, which
are denoted by sG and eG, resp. In addition, there must exist a path (see
below) through G from sG to every node v. Likewise, a path must exist
from every node v to eG.
Note that a CFG may also contain nodes that do not correspond directly to
a program construct. E.g. to ensure that unique start and end nodes do exist,
1In our case, the nodes are labeled with machine instructions.
76
0x10200a0: cmpi cr0, 0x0, r3, 10
0x10200a4: addi r0, zero, 1
0x10200a8: bc 0xc, cr0.gt, 0x10200b0.f
0x10200ac: addi r0, zero, 0
0x10200b0: or r3, r0, r0
0x10200b4: bclr 0x14
Figure 3.1: Program fragment and its CFG
we can add special nodes. These are labeled with special symbols to distinguish
them from program nodes (in Figure 3.1 the node labeled x is such a node.). The
RETURN and CALL nodes in Section 3.3.1 are other examples of special nodes
that do not correspond to program constructs.
3.1.2 Concrete Semantics
Before we can explore the static analysis of a program and its correctness, we
have to define the concrete semantics of a program. In the semantics presented
here, the system is described by a state, i. e. the contents of registers, memory,
etc. Execution of a program is then described by the changes to an initial state by
the instructions of the program executed along a path through the CFG:
77
Definition 3.1.2 (State): A program execution is defined by the transformation
of a state. We denote the set of all states by Σ, a single state by σ.
The effect of an instruction depends not only on the instruction itself but also
on the next instruction to be executed. That is, rather than describing an effect of
one instruction execution on a state, we attribute the effect to the edge in the CFG,
connecting an instruction and (one of) its successors.
Sometimes, it is impossible to traverse a certain edge given a state. E.g. if
a node in the CFG has multiple outgoing edges, corresponding to a conditional
branch instruction, then a state determines that only one edge may be traversed.
In such a case, the transfer functions associated to the other edges should produce
an “impossible” state when applied to the input state. For this, we could introduce
a special element W Σ that denotes infeasible execution. But it will later be
more convenient to capture this situation in a different manner: instead of defining
transfer functions as functions from E t Σ t Σ, we take them from the domain
E t P
K
Σ
M
t P
K
Σ
M
. The second argument to a transfer function will, however,
either be the empty set, /0, denoting infeasible execution, or a singleton set = σ ?
containing the execution state.
Definition 3.1.3 (Transfer function): The map T
b
: E t P
K
Σ
M
t P
K
Σ
M
assigns
a transfer function Te : P
K
Σ
M
t P
K
Σ
M
to every edge e W E in the CFG G.
We require that Te
K
/0
M
F
/0 for every e. If there are several outgoing edges
e1 S]\]\]\S en for a node n, then there is at most one edge ei with Tei
K
x
M~v
F
/0.
If Te
K
= σ ?
M
F
/0 then this means that σ cannot occur as an input state on a real
execution along edge e. This is just one way to capture the fact that transfer func-
tions are partial functions. The condition that there can be at most one outgoing
edge along which the execution is feasible guarantees a deterministic program
execution.
With this, we can define the semantics of a program starting with a state σ as
the (possibly infinite) trace of states occurring during execution:
Definition 3.1.4 (Semantics): Given a CFG G, the semantics of G is defined by
a function S : Σ t Σ  , defined by
S
K
σ
M
F
S {
K
= σ ?S sG M
where (with T
F
= σ ? )
S {
K
T S v
M
F
σ \

d
ε , if v
F
eG
S {
K
Te
K
T
M
S v {
M
, if v
v
F
eG X  e
F
v t v { : Te
K
T
M~v
F
/0
S {
K
T S v
M
, otherwise
78
S maps a start state σ to a (possibly infinite) state trace σ1 \ σ2 \g\]\]\ . Execution is
along the edges of the CFG, starting with sG. At every node, we have either
reached the end node eG of the CFG and execution stops, or there is at most one
outgoing edge with a transfer function that gives a non-empty result for the current
state. If there is no such outgoing edge, the resulting state trace is infinite and will
“loop” at the current node. If there is one valid outgoing edge, then the state trace
is the current state, prolonged with the state trace from the target node with the
state transformed by the transfer function.
Since the program execution is defined to be deterministic, the semantics is
well defined, i. e. there is at most one edge leaving a node n, which is feasible
for the continuation of the program execution. The trace is finite, if the program
terminates and infinite is the program either does not terminate or gets stuck in an
infeasible state (run time error).
For the argumentation of the correctness of an analysis we will make use of
the path through the CFG that is taken during execution of the program, which is
equivalent to the program trace.
Definition 3.1.5 (Path): Let G
FK
V S E
M
be a CFG. A (infinite) path pi in G is
a sequence of edges, written e1 \ e2 \]f]f]f , where for two consecutive edges
ei S ei D 1 it holds that  v j S vk S vl W V : ei
F
v j t vk S ei D 1
F
vk t vl . The set of
all paths (in G) is denoted by Π∞; the empty path by ε.
The set of all nite paths in G is denoted by Π, Π  Π∞.
The set of all paths in G, which start at node s and end in node v is written
P
K
s S v
M
 Π.
For every path pi W Π∞, the starting node of pi is written as start
K
pi
M
;
start
K
pi
M
F
v1 
x
pi
F
v1 t v2 \ e2 \]f]fgf .
Likewise, we write end
K
pi
M
for the ending node of a finite path pi W Π,
end
K
pi
M
F
vn 
x
pi
F
e1 \]f]f]fu\ vn G 1 t vn, n
N
0.
The path taken during the execution of the program is already available in the
definition of the semantics: we only have to collect the edges e of the transfer
functions giving the successor state to one state in the trace. The function pG :
Σ t Π∞ gives this path:
pG
K
σ
M
F
p {G
K
= σ ?S sG M
where
p {G
K
S S v
M
F

d
ε , if v
F
eG
e \ p {G K Te K S M S v { M , if v vF eG X  e F v t v { : Te K S MvF /0
v t v \ p {G K S S v M , otherwise
79
From a given path pi, we can deduce the sequence of states that occur during
a walk along this path. For every finite path, the path semantics determines the
effect on a start state of executing the program along this path:
Definition 3.1.6 (Path semantics): Let pi W Π be a finite path. Define the path
semantics f : Π t P
K
Σ
M
t P
K
Σ
M
as
 e1 \]f]fgfg\ en  :
F
Ten  f]f]f  Te1
Naturally, we cannot give a semantics to an infinite path. We can, however,
give a semantics to every finite prefix of such a path, which is sufficient for our
needs. Thus, if the program G starting from σ terminates and the last state in S
K
σ
M
is σ { , then we also have σ {
F
 pG
K
σ
M

K
= σ ?
M
.
As we want to obtain information for parts of the program later, we have to
look not only at the state after execution of the program but also at the states dur-
ing execution. Therefore, we associate with every program point, i. e. instruction,
the set of states that occur during execution at that point. Since program points
correspond to the nodes in our CFG, we define the set of possible states that may
occur at each node of the CFG as a map from CFG nodes to the powerset of
states. This can easily be done by gathering the path semantics of every prefix of
the execution path that ends in this node:
Definition 3.1.7 (State map): Let σ be an initial start state for program execu-
tion and pi
F
pG
K
σ
M
, then define the map statesσ : V t P
K
Σ
M
as
statesσ
K
v
M
:
F(T
=  pi { o= σ ?
>
pi { W prefixes
K
pi
MY¡
P
K
sG S v M ?
where prefixes
K
pi
M
is the set of all finite prefixes of the path pi.
statesσ gives for every start state σ the set of all states that occur at a node
during execution.
If we consider only one execution of the program, starting from a given state,
then any analysis based on this can only give information about that particular
execution. Our analyses will give results that are valid for all executions of the
program. Therefore we first define the collection of all program executions, the
collecting semantics:
Definition 3.1.8 (Collecting semantics): We define the collecting semantics
AcoG : P K Σ M t K V t P K Σ M]M (where V are the nodes from the CFG G) as
AcoG
K
S
M
F
λv \
T
σ ^ S
statesσ
K
v
M
(3.1.9)
80
If the set S contains all possible start states, then AcoG K S M gives all possible
states at each program point that can occur during execution.
Everything that can be said about the program G is represented by this col-
lecting semantics. Naturally, it is not computable in general; even if we restrict
ourselves to terminating programs and finitely many possible start states, we can-
not compute AcoG efficiently. Therefore, we have to search for approximate de-
scriptions of the information represented by AcoG ; such an approximation will
be called analysis of G. This information will necessarily be incomplete but is
required to be safe: if an approximation (analysis) says that a given property
at a program point v holds, then this property must be satisfied by every state
σ W AcoG K S M K v M . The next section presents the framework of abstract interpreta-
tion which defines some general relations between AcoG and any approximation.
Section 3.3 then presents an implementation method for analyses, the data-ow
analysis, and shows that it satisfies the necessary correctness constraints.
3.2 Program Analysis
This section presents the theory of static program analysis in the framework of ab-
stract interpretation, [CC77, CC79, Cou81, CC91, CC92a, CC92b, NNH99]. Due
to the high importance of the correctness of analysis results in real-time systems,
the presentation is quite detailed.
3.2.1 Abstract Interpretation
The framework of abstract interpretation defines relations between a concrete se-
mantics (here the collecting semantics) and an abstract semantics. While the col-
lecting semantics computes a set of states for each program point, thus works on
the domain P
K
Σ
M
, an abstract semantics computes values on an abstract domain
Dˆ. Abstract values dˆ from this domain can be seen as approximations of the set
of states computed in the collecting semantics.
We assume that there is an ordering on abstract values, that captures the preci-
sion of the approximation represented by the abstract values. Therefore, we first
define some concepts about orderings and ordered sets that will be utilized in the
sequel.
Definition 3.2.1 (Partially Ordered Set): Given a set A, a relation O A  A | A
is called partial ordering iff it is reflexive, transitive and anti-symmetric.
We call
K
A S]O A M a partially ordered set.
Definition 3.2.2 (Bounds): For a partially ordered set
K
A S]O A M and a set Y  A
we call an element a W A lower bound of Y , iff A y W Y : a O A y. Likewise
81
we call b W A upper bound of Y iff A y W Y : y O A b. We call an element
l W A greatest lower bound (or glb for short) of Y if it is a lower bound for
Y and for all other lower bounds a of Y we have a O A l. We call an element
u W A least upper bound (or lub for short) of Y if it is an upper bound for
Y and for all other upper bounds b of Y we have u O A b. If greatest lower
bounds or least upper bounds exist, they are unique and we write ¢ Y for
the glb of Y and £ Y for the lub of Y .
Definition 3.2.3 (Complete Lattice): A partially ordered set
K
L S]O L M is called
a complete lattice if £ Y and
¢
Y exist for every Y  L. A complete
lattice has a least element ¤W L and a greatest element
n
W L. For
these we have 
F
£ /0 and
n
F
£ L. For convenience we write a ¥ b for
£ = a S b ? and a ¦ b for ¢ = a S b ? . ¦ is called meet operator and ¥ is called
join operator. To define a complete lattice we sometimes write it as a tupel
K
L S]O L S§S
n
S
£
S¢
M
.
We will use a partially ordered set
K
A S]O A M to capture the precision of the
elements from A. If we have a S b W A with a O A b then we say that a is more
precise than b; or that b is more abstract than a.
If
K
A S]O
M
is even a complete lattice, then we can find for any set Y  A of
approximations the most precise one, a, which is a
F
¢ Y .
Example 3.2.4 (Collecting Semantics): The domain used in the collecting se-
mantics, P
K
Σ
M
together with the subset ordering  is a complete lattice
K
P
K
Σ
M
S]¨S
/0 S Σ S©ªS«
M
.
Here, if S  S { for subsets S S S { of Σ, then S { is more abstract because it
contains more states than S and we usually cannot make more precise state-
ments that hold for all states in S { than we can for the states in S. ¬
Example 3.2.5 (Intervals): The set ­Ł
F
­®¯=
n
? together with the ordering
O
Ł
defined by
a O
Ł
n
a O
Ł
b , if a
v
F
n
X
b
v
F
n
X
a < b
is a partially ordered set.
The set I
F
­
Ł
|
­
Ł of intervals together with the ordering O I defined by
K
a S b
M
O I
K
c S d
M
iff c O
Ł
a
X
b O
Ł
d is a complete lattice with
 I
FlK
n
S 0
M
n
I
FlK
0 S
n
M
£ Y
F°K
min = a
>
K
a S b
M
W Y ?S max = b
>
K
a S b
M
W Y ?
M
¢ Y
FlK
max = a
>
K
a S b
M
W Y ?S min = b
>
K
a S b
M
W Y ?
M
82
This complete lattice can be used to approximate a set of natural numbers.
The set of numbers represented exactly by an interval
K
a S b
M
is = n
>
a < n <
b ? . Likewise, for any set of numbers N, we can give intervals that represent
at least these numbers. Since I is a complete lattice we can even give a most
precise (i. e. smallest w.r.t. O I) interval that contains all numbers from N:
K
minN S maxN
M
.
¬
Analysis will later give approximations, i. e. values from abstract domains in-
stead of a set of states as does the collecting semantics. These values must be
related to the results of the collecting semantics. To formalize this relation, we
can use a concretization function γ : Dˆ t P
K
Σ
M
that maps abstract values to the
set of states they approximate (or represent). We fix some notations for functions
working on partially ordered sets:
Definition 3.2.6 (Functions): Let
K
A S]O A M and
K
B S]O B M be partially ordered sets.
A function f : A t B is called monotone iff a O A a { x f
K
a
M
O B f
K
a {
M
. It
is called surjective, if A b W B :  a W A : b
F
f
K
a
M
, i. e. every element in B
can be reached from A via f . If
K
A S]O A M and
K
B S]O B M are complete lattices,
we call f distributive (or completely additive) iff for all subsets Y  A:
f
K
£ AY M
F
£ B = f
K
a
M>
a W Y ? . Likewise we call f completely multiplica-
tive iff f
K
¢ AY M F ¢ B = f K a MV> a W Y ? .
Clearly, γ must be monotone: the concretization of more abstract values must
also be more abstract in the collecting semantics. And for every concrete value
there must be an abstract one whose concretization approximates the concrete
value: A S W P
K
Σ
M
:  dˆ W Dˆ : S  γ
K
dˆ
M
An analysis is now just an abstract semantics for a program G, working on a
domain Dˆ of abstract values. The analysis gives for each program point an abstract
value:
Definition 3.2.7 (Analysis): Given a partially ordered set
K
Dˆ S]O Dˆ M we call any
monotone map AG : Dˆ t
K
V t Dˆ
M
an analysis on the program G
F±K
V S E
M
.
Thus, AcoG can be viewed as an analysis (the most precise one), too. The
notion of precision via the partially orderings is also visible in the definition of
the collecting semantics: it is monotone in its first argument on the domain P
K
Σ
M
with the ordering  , if we define the complete lattice
K
V t A S]O³²
M
by
Definition 3.2.8 (Function Lattice): Let V be a set,
K
A S]O A M a complete lattice,
then the functional lattice
K
V t A SgO´²
M
is a complete lattice with f Oµ²
83
g

x
A v W V : f
K
v
M
O A g
K
v
M
.
n
²
F
λv \
n
A and w²
F
λv \ A. The join and
meet operators are defined by £
²
F
F
λv \ £ A = f
K
v
M¶>
f W F ? and ¢
²
F
F
λv \·¢ A = f K v M¸> f W F ? , where F  V t A. If K A S]O A M is not a complete lattice
but a partially ordered set, then so is
K
V t A S]O´²
M
.
The monotonicity of AcoG means that we obtain less precise concrete results if
we have less precise inputs: we cannot gain more precise information that is true
for all states at a program point from a set S { of inputs, than we can gain for a set
S  S { of inputs. Thus, AcoG is in itself an analysis with Dˆ F P K Σ M .
With the monotone concretization function γ we can relate the results of an
analysis with those of the collecting semantics. A correct analysis will deliver
approximations that represent at least all concrete states that can occur during
execution, if the input approximation represents all possible inputs to the concrete
program execution:
Definition 3.2.9 (Correct Analysis): Let AG : Dˆ t
K
V t Dˆ
M
be an analysis of
G on the partially ordered set Dˆ and γ : Dˆ t P
K
Σ
M
be a concretization. We
say that AG is a correct analysis (w.r.t. AcoG and γ) of G iff
A S  Σ S dˆ W Dˆ : S  γ
K
dˆ
M
x AcoG
K
S
M
Oµ² γ

AG
K
dˆ
M
(3.2.10)
In fact, we can generalize this correctness approach by defining correctness
not w.r.t. AcoG but a different analysis BG : Eˆ t K V t Eˆ M working on a domain Eˆ:
Definition 3.2.11 (General Correct Analysis): Let BG : Eˆ t
K
V t Eˆ
M
and
AG : Dˆ t
K
V t Dˆ
M
be analyses of G on the partially ordered sets Eˆ and
Dˆ respectively. And let γ : Eˆ t Dˆ be a concretization function. We call BG
a correct analysis of G w.r.t. AG iff
A dˆ W Dˆ S eˆ W Eˆ : dˆ O Dˆ γ K eˆ M
x AG
K
dˆ
M
Oµ² γ

BG
K
eˆ
M
(3.2.12)
From now on we will omit the program G from the denotation of analyses, so
we write A instead of AG.
One advantage of abstract interpretation is now that we can combine analyses
and still have a correct analysis. Suppose we have three domains Dˆ, Eˆ and Fˆ and
analyses A , B and C , working on Dˆ, Eˆ and Fˆ resp. Additionally, B is correct w.r.t.
A and C is correct w.r.t. B . Let the concretizations be γ1 : Eˆ t Dˆ and γ2 : Fˆ t Eˆ.
Then C is also correct w.r.t. A with the concretization γ
F
γ1

γ2:
A dˆ W Dˆ S eˆ W Eˆ : dˆ O Dˆ γ1 K eˆ M
x A
K
dˆ
M
Oµ² γ1

B
K
eˆ
M]X
A eˆ W Eˆ S fˆ W Fˆ : eˆ O Eˆ γ2 K fˆ M
x B
K
eˆ
M
Oµ² γ2

C
K
fˆ
M
84
Because γ1 is monotone the conjunction of the two assumptions implies A dˆ S fˆ :
dˆ O Dˆ γ1 K γ2 K fˆ M]M and the conclusion yields A K dˆ M Oµ² γ1  γ2  K C K fˆ M]M , i. e. C is
correct w.r.t. A with γ
F
γ1

γ2, as the combination of two monotone functions is
again monotone.
The ability to obtain a correct analysis from the combination of two correct
analyses can be very convenient for the design of the domains, analyses and con-
cretizations. It may be easier to define a domain and analysis “in between” the
collecting semantics and the final domain/analysis and to prove this analysis cor-
rect than to prove the correctness of the final analysis w.r.t. Aco. We will now
introduce such an intermediate analysis that makes it easier to prove the correct-
ness of our data-flow analyses later. We will call this analysis the Coarse Analysis.
It uses the same domain as the collecting semantics but does not compute the set
of states at a node by considering all feasible paths but it gathers the results along
all paths from the start node to a node:
Definition 3.2.13 (Coarse Analysis): For every transfer function Te we define
the transfer function T {e : P
K
Σ
M
t P
K
Σ
M
by T {e
K
S
M
F
©
= Te
K
= σ ?
MV>
σ W S ? .
We define the extended path semantics ¹fu { : Π t P
K
Σ
M
t P
K
Σ
M
by
 e1 \gf]f]fg\ en  {
F
T {en  f]f]f  T {e1
The coarse analysis Ac : P
K
Σ
M
t
K
V t P
K
Σ
M]M
is given by
Ac
K
S
M
F
λv \
T
pi ^ P
C
sG a v E
 pi  {
K
S
M
(3.2.14)
Note that AcG is monotone in its first argument because the extended path se-
mantics is monotone as the composition of the monotone functions T {e .
This analysis is less precise than the collecting semantics, but still correct, we
just choose γ : P
K
Σ
M
t P
K
Σ
M
as γ
F
λS \ S, the identity function:
Lemma 3.2.15 (Correctness of Ac): Ac is correct w.r.t. Aco with γ
F
λS \ S.
Proof:
We just have to show that
Aco
K
S
M
K
v
M
 Ac
K
S
M
K
v
M
holds for an arbitrary S W P
K
Σ
M
S v W V .
85
This can be written as
T
σ º S
pi » pG ¼ σ ½
=  pi { o= σ ?
>
pi { W prefixes
K
pi
MY¡
P
K
sG S v M ?¾
T
pi ¿^ P
C
sG a v E
 pi {  {
K
S
M
Obviously, it is always correct for a path pi {W prefixes
K
pi
MÀ¡
P
K
sG S v M that
pi {ÁW P
K
sG S v M holds.
Thus, it is sufficient to show  pi {Âo= σ ?´l pi {ÂÃ{
K
S
M
for σ W S S pi
F
pG
K
σ
M
and
pi {ÁW P
K
sG S v M .
But this is the case, since a short look at the definitions shows that  pi {Âo= σ ?
F
 pi {Ä{
K
= σ ?
M
. And for  pi {ÅÄ{ it is true that S  S { x  pi {ÂÄ{
K
S
M
! pi {Ä{
K
S {
M
.
¬
The coarse analysis is useful, because it is in general not possible to effectively
compute the set of feasible paths to a program point. The coarse analysis simply
ignores the feasibility constraint and determines the state map using all paths to
the program point. All feasible executions are included by this approach, but erro-
neous states are added at each program point, thus this analysis is less precise than
the collecting semantics. Since analyses can normally only compute information
about all paths, not just the feasible ones, the coarse analysis helps to prove these
analyses correct.
Galois Connections
In Definition 3.2.11, the correctness of an analysis BG : Eˆ t
K
V t Eˆ
M
is defined
w.r.t. an analysis AG : Dˆ t
K
V t Dˆ
M
. In essence BG is correct, if its results starting
from an eˆ W Eˆ are correct approximations to the results of AG starting from a
dˆ W Dˆ, and eˆ is an approximation of dˆ. The question remains, that given a dˆ for AG,
which eˆ to choose as the start of BG. Naturally, it should be as precise as possible
to give the most precise analysis results for BG. Abstract interpretation gives us a
means by which we can determine the most precise starting value for an analysis
by defining a counter part to γ, namely α : Dˆ t Eˆ, the abstraction function. α and
γ convert between the domains of the two analyses under consideration and must
form a Galois Connection to ensure correctness:
Definition 3.2.16 (Galois connection): Let Dˆ and Eˆ be complete lattices. Let
α : Dˆ t Eˆ and γ : Eˆ t Dˆ be monotone maps.
K
Dˆ S α S γ S Eˆ
M
is called Galois
connection iff
A dˆ W Dˆ : dˆ O Dˆ γ K α K dˆ M]M (3.2.17)
86
and
A eˆ W Eˆ : α
K
γ
K
eˆ
M]M
O Eˆ eˆ (3.2.18)
We call α abstraction and γ concretization.
A Galois connection guarantees that by abstracting and going back by con-
cretization results in an equally or less precise element, cf. Figure 3.2, where the
ordering is represented by the vertical position of the elements.
γ^ E^
e^
d^
α
α
γ
D
Figure 3.2: Galois connection between two domains
A trivial example is the Galois Connection connecting the domains of the col-
lecting semantics and the coarse analysis:
Example 3.2.19 (GC Coarse Analysis): The tupel
K
Dˆ S λdˆ \ dˆ S λdˆ \ dˆ
M
is a Galois
Connection. Thus, the most precise input value for the coarse analysis is
α
K
S
M
F
S, if S is the set of all inputs. ¬
A less trivial example is the Galois Connection between P
K
­Ł
M
and the inter-
val domain ­ Ł
|
­
Ł :
Example 3.2.20 (Interval Galois Connection): The two complete lattices de-
fined by
K
P
K
­ÆŁ
M
S]
M
and
K
­ÆŁ
|
­ŁµS]O I M together with the functions
γI
K
a S b
M
F
= n
>
a O
Ł
n O
Ł
b ?
αI
K
N
M
F K
minN S maxN
M
87
form a Galois Connection.
¬
Apart from Definition (3.2.16) there is an equivalent way of determining if a
pair of functions α S γ form a Galois Connection:
Lemma 3.2.21 (Adjunction): Let α : Dˆ t Eˆ and γ : Eˆ t Dˆ be maps on the
complete lattices Dˆ and Eˆ . Then the following statements are equivalent:
1. A dˆ W Dˆ S eˆ W Eˆ : α
K
dˆ
M
O Eˆ eˆ 
x dˆ O Dˆ γ K eˆ M
2.
K
Dˆ S α S γ S Eˆ
M
is a Galois connection.
Proof: See [NNH99].
¬
This equivalence between α S γ forming an adjunction and them being a Galois
Connection will be utilized in a proof later. Also, we will need the fact that α is
distributive in the correctness proof for the data-flow analysis:
Lemma 3.2.22 (Distributivity): Let
K
Dˆ S α S γ S Eˆ
M
be a Galois connection. Then α
is distributive, i. e.
A Y  Dˆ : α
KLÇ
Y
M
F(Ç
= α
K
dˆ
M>
dˆ W Y ?
Proof:
From Lemma 3.2.21 we have that A dˆ S eˆ : α
K
dˆ
M
O Eˆ eˆ 
x dˆ O Dˆ γ K eˆ M .
α
K
£ Y
M
O Eˆ eˆ

x
£ Y O Dˆ γ K eˆ M

x
A dˆ W Y : dˆ O Dˆ γ K eˆ M

x
A dˆ W Y : α
K
dˆ
M
O Eˆ eˆ

x
£
= α
K
dˆ
M>
dˆ W Y ?µO Eˆ eˆ
Since this equivalence holds for all eˆ it is especially true for eˆ
F
α
K
£ Y
M
and
eˆ
F
£
= α
K
dˆ
M¾>
dˆ W Y ? , which, together with the asymmetry of O Eˆ , proves
the claim.
¬
Interestingly, γ already uniquely determines the α of the Galois Connection if
γ is a complete meet morphism, i. e. is completely multiplicative:
Lemma 3.2.23 (Galois Connection Facts): If
K
Dˆ S α S γ S Eˆ
M
is a Galois connec-
tion then
88
Z γ is completely multiplicative, i. e.
È
Dˆ
= γ
K
eˆ
M>
eˆ W Y ?
F
γ
K
È
Eˆ
Y
M
If γ is completely multiplicative, then
Z γ uniquely defines α by
α
K
dˆ
M
F
È
Eˆ
= eˆ
>
dˆ O Dˆ γ K eˆ M ?
Proof:
See [NNH99] lemma 4.22.
¬
Also, one can show that a completely additive function α uniquely determines a γ
(and thus a Galois connection) as the least upper bound of all values from Dˆ whose
abstraction is more precise than the argument to γ. An easy way to determine
if there exists a Galois connection for a given γ is to check if γ is completely
multiplicative.
Having a Galois Connection for two or more domains, one can construct Ga-
lois Connections for various combinations of the domains, e. g. cross product,
function spaces, etc. We will later need the following Galois connection for the
state maps based on a Galois connection for the domains of two analyses:
Lemma 3.2.24 (Node map): Let
K
Dˆ S α S γ S Eˆ
M
be a Galois connection. Then we
have that
K
V t Dˆ S α ²ÉS γ ²QS V t Eˆ
M
is also a Galois connection, with
α ²
K
f
M
:
F
α

f
γ ²
K
g
M
:
F
γ

g
Proof:
Obviously, α ² and γ ² are monotone, since α and γ are monotone. We have
for all v W V S f W V t Dˆ:
f
K
v
M
O Dˆ γ K α K f K v MgM]M F γ K α ² K f M K v M]M F γ ² K α ² K f M]M K v M
Which implies f Oµ² γ ²
K
α ²
K
f
M]M
.
For all v W V S g W V t Eˆ we have:
α
K
γ
K
g
K
v
M]MgM
F
α
K
γ ²
K
g
M
K
v
MgM
F
α ²
K
γ ²
K
g
M]M
K
v
M
O Eˆ g K v M
From this we obtain α ²
K
γ ²
K
g
M]M
O¾² g, as desired.
¬
89
Having a Galois connection for two domains not only allows to determine the
best input value for an analysis on the target domain, but even allows to dene
the best analysis possible with these domains and concretization/abstraction. This
result is obtained by giving a different correctness condition for an analysis with
the following theorem:
Theorem 3.2.25 (Correctness): If A : Dˆ t
K
V t Dˆ
M
is an analysis and further-
more
K
Dˆ S α S γ S Eˆ
M
is a Galois connection, then every analysis B : Eˆ t
K
V t
Eˆ
M
is correct w.r.t. A if it fulfills
α ²

A

γ O¾² B (3.2.26)
where
K
V t Dˆ S α ²ÊS γ ²QS V t Eˆ
M
is as in Lemma 3.2.24.
Proof:
See [NNH99].
¬
Correctness by this theorem implies correctness as in Definition (3.2.11):
α ²

A

γ Oµ² B
x
A eˆ W Eˆ : α ²
K
A
K
γ
K
eˆ
M]MgM
Oµ² B
K
eˆ
M
x
A v W V : α
K
A
K
γ
K
eˆ
M]M
K
v
M]M
O Eˆ B K eˆ M K v M
x A
K
γ
K
eˆ
M]M
K
v
M
O Dˆ γ K B K eˆ M K v M]M
x
A dˆ O γ
K
eˆ
M
: A
K
dˆ
M
K
v
M
O Dˆ γ K B K eˆ M K v M]M
x A
K
dˆ
M
O Dˆ γ  B K eˆ M
by using the monotonicity of γ in step three.
Rather than verifying that a given analysis B satisfies condition (3.2.26), we
can dene our analysis B as B
F
α ²

A

γ. This way we not only obtain a
correct analysis, but also the most precise one possible under the given Galois
Connection. Herein lies the great advantage of using Galois connections for the
definition of analyses.
Unfortunately, it is not always possible to define a Galois connection for a
given γ, e. g. by using Lemma (3.2.23). That means, γ may not always be com-
pletely multiplicative. One example, which is derived from a similar argument in
[CC92a] is the approximation of points in the plane by a rotated square:
Example 3.2.27 (Rotated Square): Let the points of the plane Ë
|
Ë be ap-
proximated by a square of side length a whose center is at a point x S y and
which is rotated around this point by an angle 0 < φ @ pi
h
2 (We have to
add special symbols  and
n
to Ë to make them complete lattices, simi-
lar to ­Ł , but this is not important for this example). The concretization
90
γ : Ë
|
Ë
|
Ë
|
ËÌt P
K
Ë
|
Ë
M
describes the set of points enclosed by the
rotated square
K
a S x S y S φ
M
. If there was a Galois connection then there had to
exist an α that is defined by (cf. Lemma (3.2.23))
α
K
P
M
F
È
=
K
a S x S y S φ
M>
P  γ
K
a S x S y S φ
M
?
That is, there must be a best approximation to a set P of points. Looking at
P
F
=
K
c S d
MÍ>Á>
c2
B
d2
>
< 1 ? , i. e. a circle of radius 1 around the origin, we
can see that every square centered at the origin with a
F
2 and an arbitrary
rotation angle φ encloses this circle. However, different values of φ lead
to squares enclosing different sets of points. Thus, we cannot define one
square to be the best square approximating the circle.
¬
Another example, where it is not possible to define a best abstraction is the
cache analysis for the PowerPC 755 cache. This means that we will only be able to
define a Galois connection for the value analysis used in our WCET toolchain but
not for the PowerPC pipeline analysis, as it contains the PowerPC cache analysis.
The next section presents the type of analyses implemented in our toolchain,
the data flow analyses.
3.3 Data-Flow Analysis
Data-flow analysis (DFA) is a special case of a program analysis where abstract
values are propagated along the edges of the CFG and are transformed by abstract
versions of the transfer functions until a fixed point of the node map is reached.
If the abstract transfer functions fulfill certain conditions w.r.t. the coarse se-
mantics (or another analysis against which correctness should be shown) then the
computed fixed point is a correct analysis.
In the following we will start by presenting the Meet Over all Paths (MOP)
formulation of DFA, as an analysis. After we have shown the correctness of this
analysis, we will repair the principal drawback of the MOP analyses: their po-
tential uncomputability. We do this by providing a safe approximation to the
analyses, the so called Minimum Fixed Point (MFP) analyses. These analyses are
computable, if the underlying lattice fulfills the finite ascending chains condition
and if the abstract transfer functions are monotone.
This section finishes with some explanations on increasing the precision of the
analysis results by doing inter-procedural DFA. We will utilize this to increase
the precision of critical loops in the program, thus giving tighter bounds for the
WCET.
91
Definition 3.3.1 (MOP analysis): Let Dˆ be the complete lattice of the DFA. We
call monotone functions Tˆ
b
: E t Dˆ t Dˆ abstract transfer functions.
Let Î¹fu : Π t Dˆ t Dˆ be defined by
Ï
 e1 \]f]fgfg\ en  :
F
Tˆen  f]f]f  Tˆe1
The MOP-analysis MOP : Dˆ t
K
V t Dˆ
M
is defined as
MOP
F
λdˆ \ λv \
Ç
= Î pi  dˆ
>
pi W P
K
sG S v M ? (3.3.2)
When showing that the MOP analysis is correct w.r.t. an analysis B working
on Eˆ, we can give an easy sufficient property that guarantees this if the domains
of the two analyses are connected by a Galois Connection:
Theorem 3.3.3 (Correctness MOP I): Let the assumptions as stated in Defi-
nition 3.3.1 be valid and let furthermore Tˆ {
Ð
: E t Eˆ t Eˆ be the transfer
functions from Definition 3.2.13, fÑ { the path semantics from the same
definition and
K
P
K
Σ
M
S α S γ S Dˆ
M
be a Galois connection. If the transfer func-
tions Tˆ
b
satisfy
α

T {e

γ Oµ² Tˆe (3.3.4)
Then the following property holds: α ²

Ac

γ Oµ² MOP, i. e. MOP is
correct w.r.t. Ac.
Proof: We first show for all v W V S dˆ W Dˆ S pi W P
K
sG S v M
α
K
 pi  { γ
K
dˆ
M]M
O Dˆ Î pi  dˆ (†)
by induction on pi.
pi
F
ε: Because of the Galois connection we have
α
K
γ
K
dˆ
M]M
O Dˆ dˆ
With  ε  {
F
λS \ S and Î ε 
F
λdˆ \ dˆ the claim follows.
pi
F
pi {\ e: By the induction hypothesis we have
α
K
 pi {  { γ
K
dˆ
MgM
O Dˆ Ò pi {  dˆ
92
From this follows
x Tˆe
K
α
K
 pi {  { γ
K
dˆ
M]M]M
O Dˆ Tˆe K Ò pi {  dˆ M F Î pi  dˆ
x
α
K
T {e
K
γ
K
α
K
 pi {Ã{ γ
K
dˆ
M]M]M]MgM
O Dˆ Î pi  dˆ
x
α
K
T {e
K
 pi {ÂÄ{ γ
K
dˆ
M]M]M
F
α
K
 pi Ä{ γ
K
dˆ
MgM
O Dˆ Î pi  dˆ
Here the first step is true because of the monotonicity of the Tˆ
b
. The
second step is valid because of the assumed connection between T {
b
and Tˆ
b
. The third step is a consequence of the Galois connection.
We now have that, with v W V S dˆ W Dˆ S pi W P
K
sG S v M
KF
: PP
M
:
α
K
 pi Ä{ γ
K
dˆ
MgM
O Dˆ Î pi  dˆ
x
£
= α
K
 pi Ä{ γ
K
dˆ
M]MV>
pi W PP ?µO Dˆ £ = Î pi  dˆ > pi W PP ?
x
α
K
©
pi ^ PP
 pi Ã{ γ
K
dˆ
M]M
O Dˆ £ = Î pi  dˆ > pi W PP ? F MOP K dˆ M K v M
x
α
K
Ac
K
γ
K
dˆ
M]M
K
v
M]M
O Dˆ MOP K dˆ M K v M
x
α ²

Ac

γ Oµ² MOP
where the second step can be concluded as a consequence of the distributiv-
ity of α (Lemma 3.2.22).
¬
As before, we can choose the abstract transfer functions as Tˆe
F
α

T {e

γ and
obtain the best MOP analysis possible under this Galois Connection.
As we have seen earlier, it is not always possible to define a Galois Connection
for a given set Dˆ but it is possible to give a monotone concretization γ : Dˆ t P
K
Σ
M
.
This suffices to still assure the correctness of the analysis.
This approach has the disadvantage that the conditions on the transfer func-
tions are a little bit more complicated and that we are not guided to the best pos-
sible version of the analysis.
Theorem 3.3.5 (Correctness MOP II): Let again the assumptions from Defini-
tion 3.3.1 be valid and let furthermore T {
Ð
: E t P
K
Σ
M
t P
K
Σ
M
be the trans-
fer functions from Definition 3.2.13, ¹f·Ó{ the path semantics from the same
definition and Dˆ a complete lattice with a monotone function γ : Dˆ t P
K
Σ
M
.
If the transfer functions Tˆ
b
satisfy
S  γ
K
dˆ
M
x T {e
K
S
M
 γ
K
Tˆe
K
dˆ
M]M
(3.3.6)
then we have
A S S dˆ : S  γ
K
dˆ
M
x
A v : Ac
K
S
M
K
v
M
 γ
K
MOP
K
dˆ
M
K
v
M]M
93
i. e. MOP is a correct analysis on Dˆ w.r.t. Ac.
Proof:
First, we observe that the following property holds:
T
= γ
K
dˆ
M>
dˆ W Y ?µ γ
K Ç
Y
M
(3.3.7)
This is because γ is monotone and
A dˆ W Y : dˆ O Dˆ £ Y
x
A dˆ W Y : γ
K
dˆ
M
 γ
K
£ Y
M
x
©Ô= γ
K
dˆ
MV>
dˆ W Y ?¾ γ
K
£ Y
M
We now show that
A pi W P
K
sG S v M S S S dˆ : S  γ
K
dˆ
M
x
 pi  { S  γ
K
Î
 pi  dˆ
M
by induction on pi under the assumption S  γ
K
dˆ
M
.
If pi
F
ε, the claim is trivially true. For pi
F
pi { \ e, we assume that our claim
is valid for pi { :
S  γ
K
m
M
x
 pi {ÂÄ{ S  γ
K
Ò
 pi
{
 y
M
x T {e
K
 pi {  { S
M
 γ
K
Tˆe
KÕÒ
 pi
{
 dˆ
MgM
x
 pi Ä{ S  γ
K
Î
 pi  dˆ
M
by using the assumptions on the transfer functions.
From this statement on the semantics of the paths to a node v we have
 pi Ä{ S  γ
K
Î
 pi  dˆ
M
x
©Ö=  pi Ä{ S
>
pi W P
K
sG S v M ?Í©ª= γ
K
Î
 pi  dˆ
M>
pi W P
K
sG S v M ?
x Ac
K
S
M
K
v
M
 γ
K
£
=ÆÎ pi  dˆ
>
pi W P
K
sG S v M ? M
x Ac
K
S
M
K
v
M
 γ
K
MOP
K
dˆ
M
K
v
M]M
which proofs our claim. Here, the second step utilizes (3.3.7).
¬
We can define the correctness of MOP on Dˆ not only w.r.t. the coarse analysis
but also relative to another DFA, MOP { on a different domain Eˆ and with dif-
ferent abstract transfer functions Tˆ {e . Then the correctness condition for a Galois
connection between Eˆ and Dˆ is
α

Tˆ {e

γ O Dˆ Tˆe
94
And in the general case the abstract transfer functions must satisfy
eˆ O Eˆ γ K dˆ M
x Tˆ {e
K
eˆ
M
O Eˆ γ K Tˆe K dˆ MgM
As said earlier, the MOP analyses are in general uncomputable as the least
upper bound operator on the iteration results of infinitely many paths is in general
uncomputable. Fortunately, one can formulate an approximation to the MOP anal-
yses, which is computable, if the complete lattice Dˆ fulfills the ascending chain
condition, i. e. every chain dˆ1 × dˆ2 × f]f]f is of finite length.
This second analysis, the Minimal Fixed Point analysis, works by comput-
ing a fixed point by iterating propagations of elements from Dˆ through the CFG,
transformed by the abstract transfer functions:
Definition 3.3.8 (MFP-analysis): Let Tˆ Ð : E t Dˆ t Dˆ be the monotone abstract
transfer functions and Dˆ the lattice from Definition 3.3.1.
The MFP-analysis is defined as the smallest fixed point MFP : Dˆ t V t Dˆ
of the equation system
MFP
K
dˆ
M
K
v
M
F
q
dˆ , v
F
sG
£
= Tˆv ¿ ² v
K
MFP
K
dˆ
M
K
v {
M]MV>
v {ot v W E ? , otherwise
The following theorem is crucial:
Theorem 3.3.9 (Correctness MFP): There always exists a computable small-
est fixed point of the MFP analyses and for all dˆ W Dˆ S v W V it holds that:
MOP O¾² MFP
The MFP and MOP analyses are identical if the abstract transfer functions
Tˆ Ð are distributive.
Proof:
See [NNH99].
¬
Although this theorem makes only an existential claim about the least fixed
point of MFP, an algorithm to actually compute it can be derived from the defi-
nition of the MFP-analyses. Figure 3.3 shows the work-set algorithm. The algo-
rithm begins with the starting value dˆ at the start node and propagates the trans-
formed values along the edges to other nodes until a stable assignment (M ) of
nodes to values is reached. A central component of the algorithm is the work-set
95
W which contains all nodes which still have to be processed. In the beginning
every node is assigned the value  , only the start node is assigned the starting
value. The map E assigns to every edge the value of its source node, transformed
by the transfer function of that edge. Initially, every edge is assigned the value  ,
except that the edges going out from sG are assigned the value Tˆe
K

M
.
Input: CFG G
F°K
V S E
M
, starting value dˆ W Dˆ, transfer functions Tˆ Ð : E t Dˆ t Dˆ.
Output: a map M : V t dˆ of nodes to values.
(* Init *)
W :=V ; (* Working set *)
A v W V : M
K
v
M
:=  ;
M
K
sG M :=dˆ;
A e W E : E
K
e
M
:=  ;
A e
F
sG t v W E : E
K
e
M
:=Tˆe
K

M
(* Iteration *)
while W
v
F
/0 do
K
v S W
M
:
F
extract(W ); (* extract a node *)
new:= £ = E
K
v {`t v
M>
v {ot v W E ? ;
if (M
K
v
Mv
F
new) then (* changed *)
A v t v {d{ÁW E :
E
K
v t v {d{
M
:=Tˆv ² v ¿ ¿
K
new
M
;
W :=W ®É= v {d{? ;
M
K
v
M
:=new;
fi
od
Figure 3.3: Work-set algorithm
During the iteration a node from the work set W is selected; the value at this
node is computed as the least upper bound of all the values at the incoming edges
for this node. If the value has changed since the last iteration then the value at
this node is set to the new value new and every outgoing edge v t v {d{ is assigned
the value Tˆv ² v ¿ ¿
K
new
M
. The target nodes of the outgoing edges of this node are
inserted into the work set.
The DFA we have presented here is usually referred to as forward analysis in
the literature. Since we will only implement analyses that derive an approximation
96
to a node map from an approximation to the starting values of the program, this
suffices. One can also look at analyses that derive an approximation to the node
map from an approximation to all end values of the program, i. e. the values at eG.
This can be done along the same ways as presented in the last two sections and
leads to so-called backward analyses.
The work set algorithm shown above can be varied in many ways by using
special heuristics to determine the node selected in every iteration, cf. [NNH99,
Mar98, TMLA97].
3.3.1 Interprocedural Analyses
So far we have only looked at CFGs consisting of a single procedure (or func-
tion, subroutine). In reality every program contains several subroutines which are
called from different contexts and with different arguments.
main()
{
int j;
...
p(j);
...
q(1);
}
p(int k)
{
int i;
...
q(i+k);
}
q(int k)
{
...
}
Figure 3.4: Procedure calls
By context we mean the calling history of a procedure call. In Figure 3.4
procedure q can be called from main (with parameter 1), or it can be called
from p (which in turn was called from main), with parameter i+k. If we do not
consider contexts in the DFA, then every call of a subroutine is handled like the
combination of the different calls of this routine in the program, i. e. yields the
least upper bound of the analysis results for each call. Especially when analyzing
machine code, this can lead to a big loss in precision. E.g. a value analysis may
be able to obtain precise contents of the stack pointer for each call of a subroutine,
but the combination of these calls will give only a (large) interval of possible stack
97
pointer values. Since local parameters are usually accessed via the stack pointer,
this will also lead to a high loss for most data accesses in the subroutine. This
imprecision will spill over to the pipeline analysis, since the data cache contents
will not be known precisely enough. Thus, we need different analysis results for
different contexts.
These different contexts can be taken into account in the analysis by several
means, e. g.
Z By inlining the bodies of the procedures. This only works for non-recursive
procedures and furthermore leads to an exponential growth of the CFG with
increasing call nesting depth.
Z The data-flow value can be set to
n
at every procedure call. Although this
is a safe approximation, one looses information and thus precision this way.
The GNU C-Compiler gcc uses this technique in its analyses.
Z One can compute effects, that is a mapping of incoming data-flow values to
outgoing ones for every procedure.
Z By using the call string approach, where the calling history is coded into
the lattice of the data-flow values as a so-called “call string”, cf. [SP81].
Z The static call graph method examines the static calling sequences of the
program. For every such sequence (up to a certain length) the analysis is
performed separately. Sequences that are longer than a fixed threshold are
joined together and thus are less precise. This technique is implemented in
the Program Analyzer Generator PAG, [Mar99b, TMLA97].
In the following we will only look at the last of the above techniques. We
assume that procedure calls occur in the CFG as in Figure 3.5: every call consists
of a CALL node that has an edge to the unique entry node of the procedure, its
ENTRY node. A second node, the RETURN node at the calling site, also has
an edge from the CALL node. This edge, which we call the local edge, serves
the purpose of modeling the effects on local elements (variables, etc) with the
associated transfer function. The unique end node of the procedure, its EXIT
node also emits an edge to the RETURN node.
From a semantics point of view one must ask the question what relation the
local edge has to the programs semantics: a control flow from the call of a pro-
cedure directly to the RETURN node is obviously impossible. One has to define
the meaning of a path that contains such an edge, i. e. what the meaning of Îf
should be in this case.
A detailed theoretical foundation for this approach can be found in [Mar99b].
We will only present an intuitional argumentation: as soon as there are procedures
98
ENTRY "q"
EXIT
ENTRY "main"
CALL  p(j)
RETURN
CALL q(1)
RETURN
EXIT
local
ENTRY "p"
CALL q(i+k)
RETURN
EXIT
local
Figure 3.5: Interprocedural CFG
the problem of incarnations of procedure arises. That is, there can be several
incarnations of a procedure. Each of these incarnations can have local variables
which are not visible in other procedures. In a real implementation these incar-
nations are realized as stack frames (and a corresponding semantics can be given
to them). The number of simultaneously active incarnations of a procedure is in
general not bounded, so there can be no finite abstraction for this nesting, if one
does not ignore local variables of procedures. So one has either the choice of
leaving the effects of a procedure on the local variables of the calling procedure
unmodeled2. In this case one has to set all information about local variables in
the calling procedure at a call site to
n
to ensure correctness. Or one can keep the
values of the local variables that are guaranteed to not be affected by the proce-
dure call. The local edge serves exactly this purpose. It models the known effects
on the local variables. By taking the join through the £ operator at the RETURN
node this information is then correctly mixed with the information from the EXIT
node of the called procedure.
2A procedure can also affect the variables not visible in the procedure by using pointers.
99
The technique of static call graphs can be implemented by using mappings:
at every node in the CFG there is not one but several data-flow values, where each
value corresponds to a different context at the same node. Now the edges con-
nect the data-flow values at the nodes that have been connected by edges before.
Through the type of mapping of the elements the static call graph technique can
be realized. For the details cf. [TMLA97, Mar99a, Mar99b].
By this technique, the same work set algorithms can be used for interprocedu-
ral data-flow analyses as for intraprocedural analyses.
An even bigger problem for the precision of analysis information as computed
by the value and pipeline analyses is the behavior of a program in its loops. Pro-
grams spend most of their execution time in some of the loops present in them.
Thus, obtaining precise analysis results for loops will increase the precision of the
analysis for the whole program. Loops are naturally a place where reuse of cache
contents is most likely. Especially instruction cache but also data cache locality is
very likely to occur in loops. Here, the first iteration of a loop usually loads the
caches with the interesting memory blocks, e. g. the code of the loop itself. Further
iterations will access this information in the cache and thus execute considerably
faster. To capture this effect in an analysis of the execution, one must be able to
distinguish loop iterations in the analysis. There are also cases, where one wants
to distinguish not only the first loop iteration from the remaining ones but also the
first N iterations from the remaining iterations. The VIVU approach implemented
in PAG allows to do this. In this approach, a loop is virtually inlined and then
virtually unrolled by factoring out loops as (virtual) procedures and reducing the
problem to a mapping for procedure calls.
In Figure 3.6 an example loop together with the virtual transformation of its
CFG is shown. Naturally, this transformation is only a conceptual one, the code of
the program being analyzed is not altered. The transformation allows to gain infor-
mation for separate iterations of the loop, thus increasing the precision of the DFA.
On the left side of the figure is the original loop, on the right side the transformed
CFG. A virtual procedure loop 0000 with the code of the loop has been added.
A call to this procedure is represented by the box labeled loop 0000 first in
the original function test routine. The iteration in the loop is modeled by
a call to the loop, represented by the box labeled loop 0000 rec in the loop
procedure.
Differentiating contexts in the DFA means that instead of one data flow el-
ement from Dˆ at every node of the CFG we have an array of elements, one for
each context the node appears in. These elements are then connected following
the CFG, but at CALL nodes the connection reflects the context change caused by
the call according to the chosen mapping. This defines a supergraph upon which
the propagation of data flow values is made, cf. [Mar99b, The02] for the details.
In Figure 3.7 the loop from Figure 3.6 is shown with a mapping that differentiates
100
ØFigure 3.6: Program with loop and transformation
the first two loop iterations from the remaining ones. Thus, in the loop there are
three data flow elements at every node. The elements with index 0 are those for
the analysis of the first iteration of the loop, index 1 is for the second iteration and
index 2 for all remaining iterations. The dotted lines in Figure 3.7 are the local
edges between the CALL and RETURN nodes. The transfer functions at those
edges and at the edges connecting the entry node elements with the first instruc-
tion node are the identity function λdˆ \ dˆ. The same is true for the edges from the
RETURN to and from the EXIT nodes and the edges from the CALL nodes to
the ENTRY nodes.
In the case that this loop can be reached from different places in the call graph,
i. e. the subroutine containing this loop is called from more than one place, then
additional contexts are added distinguishing between the various calls. As loops
are treated in the same way as procedures, additional contexts also result from
nested loops.
The precision that can be gained in the DFA by using separate contexts for
separate loop iterations can be tuned against the time required for the analysis. In
the extreme case, loops are unrolled by this technique up to their maximal iteration
101
20 1
Call
Return
Exit
Entry test_routine
cmpi cr0,0,r8,+10
bc 4, cr0.lt, 0x10200a8.f
add r9,r9,r8
addi r8,r8,+1
b 0x1020094
Figure 3.7: Mapping for two differentiated loop iterations
count, which must be known for WCET analysis anyhow. This gives the best
precision but also the longest analysis time in general3. The number of contexts
(i. e. separate data flow elements) increases rapidly for this complete unrolling of
loops, especially in the presence of nested loops. How many loop iterations are
unrolled can be tuned at run-time in the analysis. Some practical results about
the precision gained and additional analysis time needed for different amounts of
distinguished loop iterations are given in Section 6.5 for WCET analysis and in
Section 8.1 for the effects on predictability of modern processors.
3Although there are cases, where the more precise analysis is faster, because it has to reiterate
the same cycle in the CFG less often as the fixed point is reached earlier.
102
Chapter 4
Pipeline Modeling
Make for thyself a definition or descrip-
tion of the thing which is presented to
thee, so as to see distinctly what kind of
a thing it is in its substance, in its nu-
dity, in its complete entirety, and tell thy-
self its proper name, and the names of the
things of which it has been compounded,
and into which it will be resolved.
Marcus Aurelius, Meditations, III/11
In order to judge the correctness of an analysis of the timing behavior we need a
definition of the concrete behavior of the processor. In Chapter 5 we will develop
correct abstractions and analyses for the concrete semantics using the techniques
presented in Chapter 3.
We are interested in the timing behavior of a program being executed by the
processor, i. e. we have to describe not only the results of the program execution
(as far as they are relevant), but mainly how long the execution takes. There are
several possibilities to choose from for the concrete semantics of the program
execution:
Z Simulation of low-level hardware models. Most chips nowadays are syn-
thesized by translating low-level hardware models into gate layouts. The
low-level models, written in a restricted set of VHDL ([VHD00, Ash02]) or
Verilog ([TM91]), can be simulated too, and can thereby capture the timed
execution of the program by simulating the processor.
Z RTL level hardware models are VHDL or Verilog models at the register
transfer level, which do not contain information necessary for synthesis.
103
Anyhow, they are still regarded as authoritative for the behavior of the pro-
cessor on a cycle-exact level. These models are much faster to simulate and
less complex than the low-level models used for synthesis.
Z An effects semantics, defined in the classical way for instructions. This
semantics defines the effects of program execution on a state by giving se-
mantics to single instructions and defining how the semantics have to be
combined to give the semantics for the whole program, e. g. by rules for
sequencing instructions.
Z Designing a higher level model that represents the effects of program ex-
ecution on the state by hand, i. e. not based on an authoritative hardware
model. Here “higher” means that the model is not constructed from the low
level view of pins and signals in the processor but rather by identifying and
modeling logical components and their interaction.
Each of these alternatives has its advantages and drawbacks:
Z Low-level hardware models are too cluttered with unnecessary information,
e. g. signal values that are not important for the timing of the program ex-
ecution: from the 360 pins of the PPC 755, only 44 may be of interest for
the timing of accesses to external memory. The situation would probably
be much worse for internal signals. Also, it is difficult to obtain efficient
abstractions from low-level models, because much of the structure has van-
ished at the level necessary for synthesis. Using simulation of these models
is impractical due to the extremely slow simulation speed of full low-level
models.
Z RTL level models are at a higher level and thus are more efficient to sim-
ulate. However, they still contain much information that is never needed
for timed program execution. From a RTL model one could slice out only
those portions that influence program execution. This model would be much
smaller and better to abstract. However, RTL models are rarely available for
the designers of a timing analyzer as they are considered confidential infor-
mation by the processor makers. In our work, no RTL models have been
available and thus we could not follow this approach.
Z The classical way to define semantics by combining smaller instruction se-
mantics is difficult for modern processors, since here the instructions are ex-
ecuted in parallel, thus composition constructs and semantics for concurrent
or parallel languages must be used. In addition, this parallelism is dynamic
and the decision what instructions to execute in parallel is performed by the
dispatching unit at run-time based on dynamic resource occupations. Thus,
104
the combination of the instruction semantics is not trivial at all. And it is not
clear, how to easily obtain such a semantics and how to verify it. Further-
more, analysis of concurrent/parallel programs is still not very elaborated
and easy to use and powerful methods are not available1.
We choose to define a semantics along the lines of an RTL style hardware
model, but with higher abstraction. This model has to be constructed from the
available documentation about processor and system, augmented by experiments
run on a real system. Its design can follow the overall design of the processor
making it easier to obtain the model. Nonetheless, the validation of such a model
is a non-trivial task and can be time consuming.
Since we are only interested in analyzing current processor designs, we restrict
ourselves to synchronous designs, i. e. processors whose internal signals are syn-
chronized against the rising and/or falling edge of a system clock. Although there
are nowadays ideas to increase the speed of processors by using asynchronous
logic, it has not been shown how the increased design complexity of this approach
should be handled and how the processors could be verified.
Thus, we can handle time as discrete. The minimal time unit we want to use
for measuring program execution is a full processor clock cycle, since instruction
execution is normally synchronized against the rising edge of the processor clock.
The minimum time interval that can have an observable effect is one half of a
processor cycle2.
Since the internal state of the processor and the contents of the main memory
together are finite in size, it is natural to use the cycle-wise evolution of a nite
state automaton as the concrete semantics of program execution. In this setting,
program execution begins with a state that is set up such that the processor will
start to execute the first instruction of the program (the program counter is set to
the corresponding address). Then, the automaton will perform transitions, every
transition taking one processor cycle, until the last instruction of the program
has been completed. During each transition, the transition relation transfers the
current state into a new state according to the (partial) execution of the current
instruction. The total number of transitions made by the automaton gives the
execution time of the program in processor cycles.
In the following section we will define the semantics of a program execution
using the notations from Section 3.2 on the machine program, represented as a
CFG G. We do this, because the analysis we are performing is a data-flow anal-
1Static analysis of concurrent programs in practice seems to be limited to either bit-vector
problems, which are not useful in the context of WCET, or to the analysis of an interleaving-
semantics, which is not practical for the size of our programs.
2It is one half of a cycle because internally events can be synchronized against the rising and
the falling edge of the clock.
105
ysis on the program (or, more precisely, its CFG). Section 4.2 describes how a
processor model can be obtained in a way that reduces design complexity by a
series of abstractions.
4.1 Finite State Automata
For a given processor its execution can be described by giving a set S of (machine)
states that the processor may be in. Such a state not only represents the inner
components of the processor but also all peripheral hardware states of the system,
e. g. the contents of main memory or the inner state of the main chipset connecting
the processor to the rest of the system. Apart from the set S there is a transition
function Ù between states. To denote the fact that the program execution has
halted, we introduce a special state  , so that Ù : S t S ®Ê=ÚÔ? . A transition from
a state s W S to a state Ù
K
s
M
takes exactly one processor cycle. To summarize, we
define
Definition 4.1.1 (Finite Automaton): A finite automaton A is a pair
K
S SLÙ
M
where S is a finite set of states and Ù : S t S ®Ê=ÚÖ? is a transition function
between states. When the automaton is in state  , we say that it has halted.
Naturally, by inspecting a state s W S , we must be able to say which of the
possibly many instructions being executed in parallel is the oldest one (regarding
program order) and if an instruction has been retired. How this is done depends
on the processor and the model, but the information is present in some way in
the components (see below) of a state. We will introduce generic state predicates,
which must be defined for each automaton, that given a state return the necessary
information.
In order to argue about properties of (short sequences of) instructions we will
describe a (machine) program G by its CFG, as in Section 3.2. Each node of the
CFG represents one machine instruction and is labeled with the address of that
instruction, denoted by addr
K
n
M
and the instruction itself, denoted by prog
K
n
M
.
In the case of a conditional branch instruction, the edges going out of that node
are labeled with T and F for the edge corresponding to the taken branch and the
fall-through edge, resp. Given a program G, the execution of the finite automaton
starts with a state ssG
K
d
M
, which is set up such that the processor will start to fetch
the instruction at addr
K
sG M . Here d denotes the input data for this particular run,
which come from a set D.
It should be noted that the program we are going to execute is present in the
state, as it is stored in memory and the memory contents are part of the state.
Thus, by defining S to be the set of all possible states, we have defined the finite
106
automaton for all possible programs, so that the automaton is only dependent on
the processor and the model itself that gives the structure of the states and the
transition function.
The execution of the program by the processor can be viewed in two ways:
Z By giving the trace of the states that occur during execution of the finite
automaton A.
Z By giving the path pi through the programs CFG (as in Definition 3.1.4)
together with transfer functions for edges in the CFG (cf. Definition 3.1.3).
The first view is more natural in that it represents the real execution of a processor
more closely, while the other one makes is easier to argue about properties during
execution of a given instruction and the correctness of an analysis. Anyhow, one
view is sufficient to define the other. We will start by defining the state trace
first, then the execution-path semantics next, which will be used later to prove the
correctness of the resulting analyses.
As noted in Section 3.1 we can define a system-centric semantics by just look-
ing at the evolution of a system as a whole. For an automaton we can define the
cycle-wise evolution by a sequence of states called the state trace of processor
execution starting from a state s:
Definition 4.1.2 (State Trace): Given a finite automaton A
FK
S SLÙ
M
, the state
trace starting from state s W S is defined by
TA
K
s
M
Frq
ε , if Ù
K
s
M
F

s \ TA
K
Ù
K
s
M]M
, otherwise
Since we are requiring that every program execution terminates, the transition
function Ù must not produce infinite chains of states. I.e. every trace has a finite
length. The length of a trace is the number of cycles the execution took, thus
giving the execution time. The execution of the whole program G with input d is
represented by TA
K
ssG
K
d
M]M
.
In Section 3.1 we already argued that for a modular analysis one needs to
go away from the system-centric view to the instruction-centric view that allows
to associate partial state traces to instructions, thus defining the execution of an
instruction. As noted earlier, it is processor dependent when an instruction has
finished execution. Some parts of the processor states will reflect this information,
either a specific part alone or several parts in conjunction. To generalize these
specific dependencies, we introduce predicates that describe the information of a
state w.r.t. the execution of one instruction. These predicates have to be defined for
107
every processor and its accompanying automaton. In Sections 4.3 and 4.4 we will
give examples of the definition of these predicates using the processor modeling
introduced later.
So, to argue about the execution of single instructions (associated with nodes
n in the CFG), we introduce the following predicates and functions:
Definition 4.1.3 (State Predicates): The predicate F
K
n S s
M
is true, if the in-
struction at node n is finished in state s. The predicate N
K
n S s
M
is true, if
the instruction at node n is the next instruction to be finished in state s after
the current one. The predicate H
K
s
M
is true if program execution has halted
in state s. The function R
K
n S s
M
returns for a state s, in which the instruction
at n is finished, a state s { where the acknowledge of n being finished has
been removed. Thus, the second oldest instruction being executed in s is the
oldest being executed in s { .
4.1.1 The Meaning of State Predicates
These state predicates allow us to conceptually reduce the overlapped, parallel
execution of multiple execution in the pipeline to a sequential execution. This is
similar to approaches taken in the area of the verification of processor designs,
e. g. [BD94, HYHD95, JSD98, SJD98, DP97, Bur96]. There, the problem is to
verify that an implementation of a processor design is correct w.r.t. a sequential
specification (ISA3). The ISA defines the (observable) effects of instructions on
the programmer visible system state (registers and memory). In it, instruction
execution is conceptually sequential, as in our instruction-centric view. The ap-
proaches taken in verifying an actual implementation also require to establish a
connection between the processor states in the actual execution and the effects of
instructions on the state as defined in the ISA. This is either done by artificially
flushing an implementation pipeline state and the comparing the ISA-observable
parts of it against the result of an instructions execution in the ISA or by defining
a refinement relation between pipeline states of different implementations (where
one implementation is the sequential ISA) with the help of an abstraction and
observable components of pipeline states.
Our state predicates differ from these approaches in that they don’t talk about
instruction effects on some programmer visible part of the system but rather de-
termine when an instruction can have no further influence because it has left the
pipeline and which instruction should be considered the next one to execute in a
sequential view. E.g. in processor verification an instruction i may still be in the
pipeline in some retirement buffer in a state s, but all of its effects have been com-
mitted. For verification, this state is not distinguishable (and need not be) from a
3Instruction Set Architecture
108
state s { where the instruction has left the pipeline completely, as they are equal af-
ter projection to the ISA-observable parts. For our state predicates, F
K
i S s
M
would
be false, while F
K
i S s {
M
would be true. Thus, the states are considered not equal
under timing-aspects of the execution.
Two Examples for State Predicates
To illustrate this concept further, we give two examples how state predicates can
be defined with the help of special components in the state. The first example
is a simple DLX like pipeline without super-scalarity of out-of-order execution4.
The second one features a branch prediction that folds away branches in the fetch
stage.
IF ID EX WB
Figure 4.1: Simple Pipeline
Consider the simple pipeline in Fig. 4.1, which consists of an instruction fetch
(IF), a decode (ID), an execution (EX) and write-back (WB) stage.
As a first example consider that all instructions flow sequentially through the
pipeline, branches are computed in the EX stage, redirecting the fetching. Among
other components, a pipeline state s records for each state, which instruction i is in
the stage at the moment. E.g., we write s
K
WB
M
for the instruction in the WB stage
in state s (which can be ε if the stage is empty). To record which instruction has
left the WB stage and thus the pipeline, we have another component RT (retired)
in the state, holding the last retired instruction. Naturally, the components are
updated together with other components (not shown in the example) during one
application of Ù
K
s
M
to model the execution flow.
With this we can define the state predicates in the following way:
F 1
K
i S s
M
F q
true , if s
K
RT
M
F
i
false , otherwise
N 1
K
i S s
M
F







s
K
WB
M
, if s
K
WB
M~v
F
ε
s
K
E
M
, if s
K
WB
M
F
ε
X
s
K
E
Mv
F
ε
s
K
D
M
, if s
K
WB
M
F
ε
X
s
K
E
M
F
ε
X
s
K
D
Mv
F
ε
s
K
IF
M
, otherwise
H 1
K
s
M
F
q
true , if s
K
IF
M
F
s
K
D
M
F
s
K
EX
M
F
s
K
WB
M
F
ε
false , otherwise
R 1
K
i S s
M
F
s RT

t ε 
4For simplicity, we leave out the MEM stage which performs memory accesses.
109
Thus, an instruction has left the pipeline if it occurs in RT. The next instruction
is the one in the last non-empty stage. Execution has halted if there is no instruc-
tion being executed or fetched. The R predicate simply clears the RT component
of the last finished instruction.
For the second example, we take the same pipeline layout but now uncondi-
tional branches are folded in the ID stage, i. e. they are removed from the instruc-
tion stream and never enter the EX or WB stages. When waiting for such a branch
to finish execution (i. e. removal from the pipeline), there can be several scenarios:
Z The branch has not yet been fetched, thus it has not reached the ID stage.
Then we must perform more cycle transitions until it is discarded in the ID
stage after fetching completes.
Z It has already been discarded while we waited for the end of a predecessor
instruction. Then its execution time is zero.
Z It is in the IF or ID stage but has not yet been discarded.
To keep track of this, we add another component to the state, NFB (number of
folded branches). NFB is a counter that records the number of branches that have
been folded out in the ID stage. Is is updated during Ù
K
s
M
if a branch is folded out
in the ID stage.
With the help of three functions br, cb, and targ, where br
K
i
M
is true if instruc-
tion i is an unconditional branch; targ
K
i
M
is the target instruction of an uncondi-
tional branch i, and cb
K
i
M
is true if the instruction i is a conditional or computed
branch, we can define the state predicates by
F 2
K
i S s
M
F 


true , if s
K
RT
M
F
i
XÖÛ
br
K
i
M
true , if br
K
i
M X
s
K
NFB
M
N
0
false , otherwise
N 2
K
i S s
M
F 
d
targ
K
i
M
, if br
K
i
M
N 1
K
i S s
M
, if
Û
br
K
i
MYX
cb
K
i
M
i
B
4 , otherwise
H 2
K
s
M
F
H 1
K
s
M X
s
K
NFB
M
F
0
R 2
K
i S s
M
F
q
s RT

t ε  , if
Û
br
K
i
M
s NFB

t s
K
NFB
M
m 1  , otherwise
Here, an unconditional branch is finished if it has been folded out. If the
branch has not yet reached the ID stage, then s
K
NFB
M
is zero and more appli-
cations of Ù (cycles) must pass before it reached the ID stage and NFB is incre-
mented. The R predicate now has to differentiate between unconditional branches
and other instructions (only the latter end up in RT). The next instruction also
110
becomes more difficult to determine, as one has to conceptually reinsert the dis-
carded branches. Thus, the N predicate differentiates between instructions whose
successor is already in the pipeline or statically known (non-branches and com-
puted branches, resp.) and folded branches as the result. The folded branches
appear either as i
B
4 (i. e. the next instruction after i) or as the statically known
targets of an unconditional branch to another such branch.
4.1.2 Instruction Execution Semantics
The instruction-centric semantics in Definition 3.1.4 is defined using the transfer
functions for each edge in the program. Our set Σ is defined by Σ
F
S
|
Z as
the set of pairs containing a machine state and the number of cycles the execution
took so far5.
We then define the transfer functions as follows:
Definition 4.1.4 (State Transfer Functions): Let A
FÜK
S SLÙ
M
be a finite au-
tomaton and let G be a program. We define the state transfer functions
Tn ² n ¿ : P
K
Σ
M
t P
K
Σ
M
as
Tn ² n ¿
K
t
M
F















/0 , if t
F
=
K
s S m
M
?
X
F
K
n S s
M]X
Û
N
K
n {S s
M
=
K
R
K
n S s
M
S m
M
? , if t
F
=
K
s S m
M
?
X
F
K
n S s
M]X
N
K
n {S s
M
Tn ² n ¿
K
=
K
Ù
K
s
M
S m
B
1
M
?
M
, if t
F
=
K
s S m
M
?
XÖÛ
F
K
n S s
M
/0 , otherwise
This captures the fact that we define the execution of an instruction to have fin-
ished if it leaves the pipeline. The number of state transitions with states that still
execute this instruction is the number of cycles it takes to finish. Note that nor-
mally the execution of this instruction has already been started earlier. Thus, it can
happen that it takes zero cycles to complete, if it has already been retired (which
can happen in the case of multiple retirements, as in the Motorola PPC 755). For
an architecture with out-of-order retirement, when an instruction i retires before
its predecessor (in fetch or program order) i { , then also i will have an execution
time of 0, as it has already left the pipeline. The execution of i has occurred in
parallel with that of i { and as i { takes longer and we have waited for a number of
cycles for it to leave the pipeline, we need not assign any real execution time to i
later (in fetch order).
5Z ÝßÞ 0 à 1 àâáâáãáâà Tmax ä is a subset of å , containing all possible execution times, cf. Chapter 5.
111
Note that if an instruction determines that it will not execute along the edge
n t n { (but rather along an edge n t n {d{ ), then the result of Tn ² n ¿
K
t
M
F
/0 meaning
infeasible execution along this edge.
By the definition of the transfer function, the number m of execution cycles
so far does not influence the new state. Thus, if Te
K
=
K
s S m
M
?
M
F
=
K
s {S m
B
k
M
? then
we have =
K
s {S k
M
?
F
Te
K
=
K
s S 0
M
?
M
. This allows us later to have an abstraction for
the execution of one instruction starting at time 0 that describes every execution
of the instruction (starting from the same state s) at an arbitrary time m.
Two Examples for Instruction Execution
Again, we give two examples for the execution of an instruction i, with the same
two simple processors as above.
i1
i: ba l
i2
...
i3 l: j1
i4 j2
... j3
...
Figure 4.2: An example instruction sequence
Consider that we are executing the simple program in Fig. 4.2 and are looking
at Ti1 ² i
K
=
K
s S 0
M
?
M
for a state s, i. e. the execution of instruction i1.
For the first simple processor architecture, the instruction i1 is finished when
it ends up in the RT component. The same holds for the unconditional branch i.
Assuming that i is in the IF stage in state s and i1 is in the ID stage and no
other instruction is being executed, the execution proceeds in the following way:
Cycle IF ID EX WB RT F 1 N 1
0(s) i i1 ε ε ε
1 i2 i i1 ε ε
2 i3 i2 i i1 ε
3(s1) j1 ε ε i i1 i1 i
3(s2
F
R
K
i1 S s1 M ) j1 ε ε i ε
4(s3) j2 j1 ε ε i i j1
4(s4
F
R
K
i S s3
M
) j2 j1 ε ε ε
112
In this table, the last two columns give the instructions j for which F 1
K
j S s
M
holds and which is the next instruction for that instruction (N 1).
After the 3rd application of Ù to the start state s, instruction i1 has left the
pipeline. Thus, for instruction i1 we have Ti1 ² i
K
=
K
s S 0
M
?
M
F
=
K
s2 S 3
M
? . Here, s2 is
the result of using R on the state represented by the previous line in the above
table (s1).
Going on from here, we have that Ti ² j1
K
=
K
s2 S 3
M
?
M
F
=
K
s3 S 4
M
? , as the branch i
will retire in 1 more cycle.
For the processor variant that folds out branches in the ID stage, the execution
from an initial start state proceeds in the following way:
Cycle IF ID EX WB NFB RT F 2 N 2
0(s) i i1 ε ε 0 ε
1 i2 i i1 ε 0 ε
2 j1 ε ε i1 1 ε
3(s1) j2 j1 ε ε 1 i1 i1 i
3(s2
F
R
K
i1 S s1 M ) j2 j1 ε ε 1 ε i j1
3(s3
F
R
K
i S s2
M
) j2 j1 ε ε 0 ε
Thus, i1 again finishes in 3 cycles but in this sequence, i has already been
folded out (cycle 2). After Ti1 ² i
K
=
K
s S 0
M
?
M
F
=
K
s2 S 3
M
? we have Ti ² j1
K
=
K
s2 S 3
M
?
M
F
=
K
s3 S 3
M
? as i has already left the pipeline as indicated by F 1
K
i S s2
M
being true. One
can say that i takes zero cycles to execute (from start state s2 on).
With this architecture, the sequence i1 S i will finish after 3 cycles, one cycle
earlier compared to the first processor without branch folding.
Collecting Semantics
Our primary point of interest is not the whole set of states that can occur at the
end of a program as result of its execution but only the part that describes the time
it took to execute the program. And as we are interested in an upper bound for
any execution of a program with arbitrary input, we will look at the collecting
semantics (cf. Definition 3.1.8)
AcoG
K
=
K
ssG
K
d
M
S 0
M>
d W D ?
M
(4.1.5)
Here, D contains all possible input data. By looking at the final instruction
of the program, we can then define the WCET of the program as the maximum
number of cycles of every execution:
WCET
K
G
M
F
max = n
>
K
s S n
M
W AcoG
K
=
K
ssG
K
d
M
S 0
MV>
d W D ?
M
K
eG M ? (4.1.6)
113
As explained in Section 3.2.1, we can utilize the less precise Coarse analysis
AcG as a basis for our analyses in later chapters instead of A
co. The WCET de-
fined via the Coarse Analysis is max = n
>
K
s S n
M
W AcG K = K ssG K d M S 0 M¸> d W D ? M K eG M ? ,
which is greater or equal to WCET
K
G
M
.
4.1.3 Inputs to Finite Automata
The finite automaton of Definition 4.1.1 is intended to model the execution of a
real computer system, more specifically, a hard-real time system. One may won-
der why there is no notion of the inputs to the system (e. g. sensor readings, etc) in
the definition of the automaton execution ((4.1.2) and via the transfer functions in
(4.1.4)). As stated briefly at the introduction of the start state on page 106 and in
equations (4.1.5, 4.1.6), we encode all input data to the program in the start state
ssG
K
d
M
. This encoding can be represented as the finite vector of sensor readings
that occur during execution. Naturally, this vector is not part of the physical com-
puter system. By putting the input vector into the state, we avoid having to deal
with an extra input vector in the definition of the automaton execution, without
losing any possible finite execution.
Since we require that termination of any program under consideration is guar-
anteed, there can be only finitely many readings of input data. Thus, the state
space with the embedded input data vector is still finite. This would not be the
case, if we allow arbitrary (i. e. including non-terminating) programs. In addition,
since we assume that the program is already somehow represented in the state
(i. e. in computer memory), there can be only finitely many programs. Thus, also
the set of possible input data D can be assumed to be finite.
4.2 A Sequence of Abstractions
While the last section was concerned with defining the semantics of a given fi-
nite automaton, i. e. with given set S and transition function Ù , this section will
describe how to define the structure of the elements of S (the states) and how to
give the transition function Ù . Since the inner workings of a processor can be
quite complex, especially if the processor has features like speculative execution
or out-of-order execution, one must find a way to describe the state and state tran-
sitions in a modular way. Otherwise, the complexity of the dependencies in the
state transitions would be impossible to handle. As stated before, we will adopt
a high-level hardware modeling concept to ease the definition of the state and
transition function.
114
4.2.1 Introducing Components and Units
First, a state has an inner structure; we can view a state as a tuple of components.
A component might represent the contents of a register or a prefetch queue, etc.
Naturally, the exact number and types of components depends on the processor to
be described.
Another feature that is very common in the description of hardware is that
there is delayed information flow, i. e. new values of components take effect only
in the next cycle of the execution, see Figure 4.3 on page 115 for an example
involving a “clear” signal and a “result” output signal of a digital circuit.
Assume we have a trigger signal clear, a result signal result, and an integral
state variable arg.
The clear signal should take effect only in the next execution of the code, i. e.
result should only be set in the next cycle.
if (arg>10)
clear:=1
if (clear==1)
result:=0
does not work since result is set immediately. Yet, we can rewrite this example
to use a latch clear delayed that is copied at the end of the cycle.
if (arg>10)
clear delayed:=1
if (clear==1)
result:=0
...
...
clear:=clear delayed
Figure 4.3: Delayed signal example
We call a component of a state that can be viewed as such delayed data a
delayed signal, borrowing from the semantics of signal assignments with delta
delays in VHDL, which exhibit the same behavior as delayed signals.
So far, a state is:
Definition 4.2.1 (State): A state µ is a tuple of components and delayed sig-
115
nals. The delayed signals are represented by a mapping
= n

t dn > n W Labδ S dn W Dn ?
the components by a mapping
= m

t cm > m W Labc S cm W Cm ?
The set Labδ is a set of names for the delayed signals; likewise Labc is a set
of names for the components. Dn and Cn are the domains of the delayed sig-
nals and components. So a state µ can be written as the union of a mapping
for the delayed signals and the components:
µ
F
= n1 t d1 Sg\]\]\gS nk t dk S m1 t c1 S]\g\]\gS ml t cl ?
with ni W Labδ S di W Dni S mi W Labc S ci W Cmi .
Each delayed signal has a default value, represented by a function δ æ :
Labδ t ©
n ^ Labδ
Dn which gives the value of a delayed signal, if it is not
explicitly asserted. We write  µ  for the set containing all possible states and
 δ  for all possible mappings for delayed signals.
A state represented by a mapping µ can equivalently be represented by a tuple
from Dm1 | f]fgf | Dml | Cn1 | f]f]f | Cnk . We can switch between both represen-
tations via a bijection ψ : = 1 S]\]\g\S k
B
l ?¨t = ni > 1 < i < k ?Æ®ç= mi > 1 < i < l ? .
However, we will introduce names for objects in our model description language
later anyhow. To ease the definition of the semantics of that language, we will use
names already here.
The values of the delayed signals and the components in a state s are changed
by the transition function Ù based on the actual values:
µ
next
F
Ù
K
µ
actual
M
Clearly, not all µ
K
n
M
actual are needed for the computation of a µ
K
m
M
next. From
this, we can deduce that it may be a good idea to group components that represent
information belonging together with little influence on other components. E.g. all
components that directly represent the state of branch prediction of the processor
may be grouped together. We call a set of grouped components a unit. Now we
make explicit the dependencies between old and new values of components in
different units by defining instantaneous signals. Such signals, which can carry
parameters, like in VHDL, are sent from one unit to another unit. Issuing a signal
carries the information about the inner state of a unit to other units. With this,
116
the update of the inner state of a unit only depends on the delayed signals, its
inner state and the instantaneous signals it received. The instantaneous signals are
also used to structure the state evolution: they denote more abstract events that
happen. As an extension, there can be units that represent global register les
that are visible to all units (although this can be modeled by using instantaneous
signals broadcasting the unit state to all other units).
Definition 4.2.2 (Unit): A unit u is a collection of components, represented as a
mapping
µu
F
= n

t cn > n W Labcu S cn W Cn ?
where Labcu  Lab
c and we decompose the components of a state, µ, into
a tuple
K
µ1 S]\]\]\S µN M of units, such that Labcu ¡ Lab
c
u ¿ F /0 
x
u
v
F
u { and
© =
Labcu > 1 < u < N ?
F
Labc. Thus we have N units µ1 S]\]\g\gS µN whose union
contains all the components from µ and we can write µ
F
©
1 è i è N
µi ® δ, where
δ are the delayed signals of the state. We write  µi  for the set containing all
possible contents of unit µi.
A named instantaneous signal ι is sent from one unit to one or more units
carrying data from a domain Iι. Emitting and receiving an instantaneous signal
will simply be done by adding an instance of the signal to the set of asserted
signals, as we will later see.
Definition 4.2.3 (Instantaneous Signal): An instantaneous signal ι is a tu-
ple
K
u S= u1 Sg\]\]\ un ?S l S Iι M , where u and u1 S]\]\]\uS un are units, l W Labι is a label
and Iι is the domain of the signal. An instance of a signal is a pair
K
l S i
M
,
where i W Iι is the actual data carried by the signal. u is the unit emitting
the signal, u1 S]\g\]\gS un the ones receiving it, A i : u v
F
ui. An instantaneous
signal with label l can be sent from only one unit but be received by multi-
ple units. Thus, for all signals
K
u S U S l S Iι M there must not be a different one
K
u {S U {S l {S Iι ¿ M with l
F
l { . We write é ι é for the set of all instantaneous signals
ι. We write  ι  for the set containing all possible instances of instantaneous
signals, i. e. for the set =
K
l S i
MV>
K
u S U S l S I
M
W#é ι é
X
i W I ? . A set I of instanta-
neous signals, I ! ι  with
K
l S v
M
W I
x
v

K
l S v {
M
W I : v
v
F
v { will be written as
a partial mapping
I
F
= n1 t i1 S]\]\g\gS nl t il ?
where n j W Labι and i j W I j. Note that = n1 S]\]\]\S nl ? need not be equal to
Labι, i. e. the mapping is partial. For an l W Labι we write targ
K
l
M
for the set
U of a signal
K
u S U S l S I
M
Wêé ι é .
117
Unit components and delayed signals together with asserted instantaneous sig-
nals can be written as a mapping ν
F
µ ® I. We write  ν  for the set of all combined
mappings. Note that  µ À! ν  .
For a mapping f from  ν  and a set N we write f ë N for the mapping obtained
from f by containing only labels from N and f ì N for the mapping equal to f
but not containing any labels from N. To denote the update of a mapping f for
label l with new value v, we write f  l

t v  . For two mappings f and g
F
= l1 t
v1 S]\]\]\S ln t vn ? we write f p g for the mapping
f  l1 t v1 Äf]f]fÑ ln t vn 
.
Our notation for updating units will be sequential, i. e. one unit will be updated
after the other. Since the instantaneous signals give the dependencies of a unit’s
new state w.r.t. other units, the updates must be done such that all units that emit
a signal must be updated before a unit that receives that signal. This is not nec-
essary for delayed signals. Also, we assume that signals emitted from different
units but received by the same unit are distinct. This can easily be made explicit
by using different labels (possibly including the sending units name) for the sig-
nals. This is also true for delayed signals: only one unit may change the value
of a delayed signal. And again, this can be circumvented by using more delayed
signals, parameterized by which unit changed the signal. Note that this implies
that during the state update of a unit, the update must decide which signal from
which unit takes precedence. We could circumvent this by introducing a form of
resolved signals as is done in VHDL to decide a signals value if it is driven by
multiple sources, cf. [VHD00, clause 2.4]. We choose not to, since the priority
issues showed themselves as being easy to resolve.
Given the set é ι é we can define a dependency graph of the units connected by
the instantaneous signals:
Definition 4.2.4 (Unit Dependency): Given the set of instantaneous signals,
é ι é , we define the dependency graph í
FlK
VîïS E î
M
by
Vî
F
= 1 Sg\]\]\gS N ?
E î
F
=
K
u S u {
M>
K
u S U S l S I
M
Wð ι ð
X
u {W U ?
We require that í is acyclic, i. e. there is no cyclic dependency by instanta-
neous signals among units.
Given í , let ñ î Ì= 1 S]\g\]\gS N ?
|
= 1 S]\]\]\S N ? be an ordering on units that is
asymmetric and irreflexive such that

K
u S u {
M
W E î
x
u ñ¶î u {
118
We write ñ~îòë U { for the restriction of ñ¶î to U {
|
U { . By min ó ô
K
U {
M
we
denote an u W U { that is minimal, i. e.
v
 u { W U { : u { ñ î min óÁô
K
U {
M
. If such
a u is not unique, we choose the smallest u according to the ordering on
natural numbers. Note that ñ î
F
ñ î ë U .
Thus, ñ î is an ordering, under which the dependent units (those that are target
of an instantaneous signal) are greater than the units that are the sources of the
dependencies. The ordering is not total by this definition: one has always some
freedom to arrange non-dependent units. ñ¾î will be used to define the update
order in the unit update cycles.
In our setting, we will do a cycle update in two steps: the first step performs all
updates for the first half cycle, the second those for the second half cycle. These
two steps are denoted by functions Ù S1 and Ù
S
2 , working on units and delayed and
instantaneous signals with
Ù
S
1 a 2 : P
K
= 1 Sg\]\]\uS N ?
M}|
 ν 
|
 δ Át  ν 
|
 δ 
The first argument reflects the set of units that still need updating, the second
is the current unit and instantaneous signal contents, the third reflects the values
of the delayed signals for the next cycle. The result is a pair of new unit/asserted
instantaneous signals and new delayed signals.
Thus, the function Ù on states can be defined as
Ù
K
µ
M
F
q
 , if H
K
µ
M
M
K
Ù
S
2 K = 1 S]\]\]\uS N ?S ν
1
S δ1
M]M
, otherwise
where the function M maps the half-cycle update results back to states:
M
K
ν S δ
M
F°K
ν p δ
M
ë
K
Labc ® Labδ
M
and ν1, δ1 are the results from the first half-cycle update:
K
ν1 S δ1
M
F
Ù
S
1
K
= 1 S]\]\g\S N ?S µ S δ æ
M
The functions Ù S1 a 2 use only updates of units. The update of unit u is given by
functions
Ù
u
1 a 2 :  ν Át  ν  |  δ 
mapping the contents of the unit/delayed signals/instantaneous signals to a
new unit and asserted signals content and newly asserted delayed signals.
Ù
u
1 a 2 are the fundamental semantics one has to give in order to describe a pro-
cessor system. With them we can write Ù S1 a 2 as
Ù
S
i
K
U S ν S δ {
M
F
q
K
ν S δ {
M
, if U
F
/0
Ù
S
i K U õ = u ?S ν {öS δ {p δ {d{ M , otherwise
119
where u
F
min óÁô
K
U
M
K
ν {d{ S δ {d{
M
F
Ù
u
i K ν M
ν {
F
ν p ν {c{
Ù
S
i is well defined, as U is a finite set.
4.2.2 Abstract Components
We are primarily interested in an analysis of the finite automaton describing our
system. In the DFA which will be the framework we use, we will not be able to
precisely represent all components of a concrete state, for reasons of computabil-
ity, efficiency, etc. E.g. the input data vector in the state will not be represented in
the analysis, as we would have to represent all possible input data for all possible
executions. The contents of main memory will not be represented in the analysis
as well, nor the cache contents. Nonetheless, we can always find correct approxi-
mations for these components in the analysis. A component named ni, for which
we don’t want to (or can’t) have any knowledge in the analysis can always be rep-
resented by an abstract domain Dˆ
F
=Ú§S
n
? . Here, the element
n
represents the
complete set Dni , thus we have a concretization (cf. Section 3.2.1) γ : Dˆ t P
K
Dni M
with γ
K
n
M
F
Dni . On the other hand we may have an approximation for the com-
ponent’s values that loses some knowledge but can represent some subsets of the
component’s domain. The approximation for the contents of caches is such an
example. Then, we have a more complex abstract domain Eˆ and a concretization
γEˆ .
Now, modeling these components precisely would be wasted effort, as we will
have either no or only limited knowledge about them in the analysis later. To
make this fact explicit in the model, we introduce so called abstract components.
These are components for whose values we will have non-exact approximations
in the analysis. In contrast to abstract components, the remaining values of the
components will be represented exactly in the analysis later.
E.g. the abstract caches are only able to represent what is guaranteed to be in
the cache at a given point during execution but not everything that is in the cache.
On the other hand, the contents of the prefetch queue (instruction addresses) will
be represented exactly.
We introduce abstract components here already for the concrete model and
semantics because this makes it easier to define the abstraction of the complete
state and semantics for the analysis later. In the following discussion, abstract
components will occur as abstract types (i. e. predefined meanings for Dni , not
defined from atomic elements, as for the other components) and predefined func-
tions, working on abstract types. The analysis later then only has to give abstract
domains for the abstract types and abstract versions of the predefined functions.
120
In the remainder of this section, we will show how the unit updates Ù u1 a 2 can
be defined by giving a language and semantics for an imperative unit update. In
the next two sections we will apply the techniques developed here by giving two
examples for processor models.
4.2.3 A Semantics for Unit Transitions
This section will introduce an imperative language which is used to define Ù ui .
The language does three things:
Z It provides a type system that is used to define the domains of the compo-
nents, delayed and instantaneous signals, i. e. Ci, Di and Iι.
Z It provides the means to define the unit update, including local variables.
Z It makes explicit the dependencies on abstract components both in the type
system and the program.
We will use local variables in our unit-update programs. These variables will
be referenced by identifiers from the set Labl .
To keep things simple, we assume that the sets Labcu, Lab
δ, Labι and Labl are
pairwise disjoint. The set Lab is the union of all these identifier sets. Additionally,
we assume that the identifiers used in structure or enum declarations are disjoint
from other identifiers, i. e. in declarations or signal or component names.
We will now go on by defining the semantic domains of the language. Then
we will introduce the types of data values. With this, we show how the Ci, Di
and Iι can be defined as the meaning of types in our language. Then the abstract
syntax of the language is introduced, together with the static semantics, i. e. the
side conditions on well formed programs. After that, we will give the dynamic
semantics of a program and show how the Ù ui can be defined in its terms.
Semantic Domains
The following semantic domains are defined:
Z Val the set of values
Z Types the set of types
Z Env
F
 Labl t Val ÷®¯ ν  the environments, containing mappings of local
variables, state variables, delayed signal names and instantaneous signal
names to the corresponding values. Note that  ν À Env.
Z Stmt the set of statements
121
Z Expr the set of expressions
Expressions as well as signals in the language have a type. We can construct
unit members and signal arguments/values from the following types. Here, the
brackets u denotes the meaning of the type. The concrete annotation of the type
in the program is given first, then the abstract syntax, then the meaning.
Z Int,
Int,
 Int 
F
=m 263 Sg\]\]\uS 0 S]\]\g\S 263 m 1 ? the set of 64 bit integers.
Z Bool,
Bool,
 Bool 
F
= true S false ? . The set of boolean values.
Z struct t1 id1; \]\g\ ; tn idn; end,
S
K
id1 : t1 Sg\]\]\gS idn : tn M ,
 S
K
id1 : t1 S]\]\]\uS idn : tn M 
F
= f
>
f : = id1 S]\]\g\gS idn ?ïtø© 1 è i è n  ti oS f
K
idi M Wù ti o? .
A record of types.
Z enum id1 
K
t1 M÷> ε  > \]\]\ > idn 
K
tn M÷> ε  end,
E
K
id1 : t1 Sg\]\]\gS idn : tn M ,
 E
K
id1 : t1 S]\]\]\uS idn : tn M 
F
=
K
idi S vi M~> 1 < i < n S vi W' ti o? . An enumeration
with constructors. If no type is given in the syntax (case ε above), then the
type ti in the abstract syntax is the unit type
K
M
.
Z
K
t1 S]\]\]\S tn M ,
K
t1 S]\]\]\S tn M ,

K
t1 S]\]\g\gS tn M 
F
 t1  | f]f]f |  tn  . A tuple of types.
Z t id  n  ,
 t  n,
Ä t  n 
F
ú= 1 S]\]\]\S n ?t  t Ä . An array of values of type t with n elements.
Z ,
K
M
,

K
M

F
=
K
M
? . The unit type needed for signals without arguments.
Z ,
sig
K
t
M
,
 sig
K
t
M

F
 t  A delayed or instantaneous signal.
Z abstr
K
id
M
,
abstr
K
id
M
,
122
 abstr
K
id
M

F
P
K
id
M
An abstract type has a meaning that is given by some
function P. The identifier is used to differentiate between several abstract
types.
Z ,
t1 | fgf]f | tn t t,
 t1 | f]f]f | tn t t 
F
é t1  | fgf]f |  tn ¾t  t Ã . A predefined function with
n arguments of types t1 Sg\]\]\gS tn and result type t. ti and t must not be signal
types.
In addition, the meaning of all types must be finite sets. There is no concrete
syntax to denote the unit type
K
M
or a signal type sig
K
t
M
. These types cannot be
manipulated by programs and are only used in the semantics definitions. Also,
predefined functions cannot be manipulated in the program, apart from being in-
voked. Furthermore, predefined functions are the only means to introduce values
of abstract types in the program. This trick makes it easy to provide abstractions
later by simply demanding that (correct) abstract versions of the predefined func-
tions exist.
The set Types contains the syntax of all possible types. To denote all mean-
ings we can represent with our type system we write  Types  . Thus,  Types 
F
©
t ^ Types
 t  . This way we can define the semantic domain Val
F
 Types  .
For every domain Ci of a component in unit µu we give one type CTi W Types
and define Ci
F
 CTi  . Likewise for Di and Iι with types DTi and ITι .
Statements
A unit update program will simply be a statement from the domain Stmt. The
statements that define Stmt are
Z Skip: This statement does nothing
Z l û e, where e W Expr is an expression and l W lval is assigned to
Z If e Then s1 Else s2 Fi: a conditional with condition e W Expr and two
statements s1 S s2 W Stmt
Z Goto Label, where Label is a program label
Z Emit id
K
e
M
, where e W Expr is an expression and id W Labδ ® Labι is a signal
name
Z s Where id1 : t1
F
e1; \]\]\ idn : tn
F
en; introduces local variables idi in the
statement s W Stmt. The initial value of idi is given by the expression ei W
Expr. The type of variable idi is ti W Types
123
Z For id
F
e1
S To e2 : s introduces a loop, where the local integer variable ‘id’ ranges from
the value of e1 W Expr to e2 W Expr and the statement s W Stmt is executed
for all the values assigned to id
Z Break: This statement terminates the innermost For loop that it occurs in.
Z s1 s2 a sequence of statements for statements s1 S s2 W Stmt
In a program, certain parts may be marked by a label. The places, where a
label l may appear must not be nested in a For or an If-Then-Else or a Where
construct. For a program p W Stmt, we write p@l for the program from the label
on to the end. A label must occur after any Goto statement that references it.
Also, a Break instruction must not occur outside a For loop construct.
The set lval is defined by l W lval

x l W name s l
FyK
id1 S]\]\]\S idn M . I.e. an
lval is either a name (defined below) or a tuple of identifiers.
Expressions
The set of expressions present in the language, Expr is given by
Z p
K
e1 S]\]\]\S en M where p is a primitive operation or predened function and
ei W Expr
Z
K
e1 S]\]\g\gS en M a tuple of expressions ei W Expr
Z Received id
K
id1 S]\]\]\S idn M a test on received signal id W Labδ ® Labι and
A 1 < i < n : idn W Labl .
Z A name n W name
Z A literal, i. e. either an integer n, or the values true or false or an enumera-
tion member id
K
e1 S]\g\]\S en M , ei W Expr
A name serves either as the left hand side of an assignment or can be derefer-
enced to obtain a value. The set name contains
Z id an identifier, representing a signal name, a state variable or a local vari-
able.
Z n \ id the selection of a structure member, n W name
Z id  e  an array reference, e W Expr
Z n \ id  e  an array reference of a struct reference, e W Expr, n W name
124
Note that these four rules define the same set as the rules
Z id
Z id  e  , e W Expr
Z id \ n, n W name
We will use this to perform induction on names from the front or the tail of a
name.
Although there are no provisions to declare the types of the unit components
and/or signals in the syntax of a program, we assume that there is a list of type
declarations given in addition to the program. A type declaration has the form
t id, where t W Types. Also, we will later add the convenience to add typedefs to
the concrete syntax. A typedef introduces an abbreviation for a type, saving some
space in the type declarations. In our setting, a typedef is simply expanded to its
definition to give the concrete type in a declaration.
This concludes the syntax of our language. Yet, there are additional conditions
on the way expressions and statements may be used. E.g. the typing of expressions
must be valid, etc.
Thus, we now give the rules to determine the type of an expression. The
typing of an expression e W Expr is given by the following inference rules. Here
e ü ξ t means that the expression e has type t under the type environment ξ : Lab t
Types. The initial typing environment for a program p will contain entries for the
unit components, and the delayed and instantaneous signals. The function enum
gives for an enum identifier id the enum type E
K
id1 : t1 S]\]\g\gS idn : tn M it belongs to,
i. e. id
F
idi for some 1 < i < n. The function etype gives for an enum identifier
id its type in the enum, ti.
(1)
n ü ξ Int
, if n is an integer (2)
true ü ξ Bool
(3)
false ü ξ Bool
(4)
K
M
ü
ξ
K
M
(5)
id ü ξ ξ
K
id
M
, if id W dom
K
ξ
M
(6)
id ü ξ enum
K
id
M
, if id is an enum identifier
(7)
e ü ξ etype
K
id
M
id e ü ξ enum
K
id
M
, if id is an enum identifier
(8)
e1 ü
ξ
 t  n S e2 ü ξ Int
e1  e2 ýü ξ t
(9)
e ü ξ S
K
id1 : t1 S]\g\]\S idn : tn M
e \ idi ü ξ ti
125
(10)
e1 ü
ξ t1 S]\]\g\gS en ü ξ tn
K
e1 Sg\]\]\gS en M ü ξ
K
t1 S]\]\]\S tn M
(11)
id1 ü ξ t1 S]\]\g\gS idn ü ξ tn S id ü ξ sig
K
t1 | f]fgf | tn M
Received id
K
id1 S]\g\]\gS idn M ü ξ Bool
(12)
e1 ü
ξ t1 S]\]\]\S en ü ξ tn S p ü ξ t1 | fgf]f | tn t t
p
K
e1 S]\]\]\uS en M ü ξ t
, if t S ti contain no signal types
These rules present no surprises, except for rule (12), whose condition guaran-
tees that signal types cannot be derived by other means than coming from the typ-
ing environment. The static correctness of a program p is defined by the following
inference rules. Here, s ë ξ p where s is a statement and ξ a typing environment, as
above, denotes that s is well formed in the program p under ξ.
(13)
Skip ë ξ p
(14)
Break ë ξ p
(15)
s1 ë
ξ p S s2 ë ξ p
s1 s2 ë ξ p
(16)
Goto Label ë ξ p
, if Label occurs in p
(17)
e ü ξ t S id ü ξ sig
K
t
M
Emit ide ë ξ p
(18)
e ü ξ Bool S s1 ë ξ p S s2 ë ξ p
If e Then s1 Else s2 Fi ë ξ p
(19)
l ü ξ t S e ü ξ t
l û e ë ξ p
, where t contains no signal types
(20)
e1 ü
ξ0 t1 S]\]\]\uS en ü ξn þ 1 tn s ë ξn p
s Where id1 : t1
F
e1; \]\]\ idn : tn
F
en; ë ξ p
,
ξ0
F
ξ
ξi D 1
F
ξi p= idi G 1 t ti G 1 ?
(21)
e1 ü
ξ Int S e2 ü ξ Int S s ë ξ ¿ p
For id
F
e1To e2 ë ξ p
, ξ {
F
ξ p= id

t Int ?
Rule (16) does not make any assumptions on the correctness of the program
fragment at p@Label, since that will be checked by the sequencing rule (15).
It demands, however, that the label actually occurs in the program. The more
interesting rule is rule (19): the condition makes sure that signal values can only
be changed by the Emit construct. Also, signals in themselves cannot be used in
expressions by rule (12) for expressions and the fact that they cannot be defined
by a type declaration for unit components.
126
A program p for a unit u is statically correct iff p ë ξ p, where
ξ
F
= l

t CTl > l W Lab
c
u ?Ú®
= l

t DTl > l W Lab
δ
?Ú®
= l

t ITl > l W Lab
ι
?Ú®
= P

t tP > P is predefined function with type tP ?
Dynamic Semantics
The dynamic semantics for a program p is described by the transformation of an
environment into a modified environment together with the newly emitted delayed
signals (instantaneous signals are recorded in the environment). We will give this
semantics as p t C ρ a δ ß Ep
K
ρ {öS ρδ
M
, meaning that the environment ρ is transformed
under program p into the environment ρ { and results in delayed signals ρδ being
asserted. δ
æ
is just the “inactive” environment for delayed signals. Besides the
mappings for the unit components and currently active signals, the environment ρ
also contains mappings for the local variables in each unit. Thus, we can write it
as a mapping
ρ
F
= l

t v
>
l W Labcu ?Ú®
= l

t v
>
l W Labδ ?Ú®
= l

t v
>
l W Labl ?Ú®
= l

t v
>
l W L  Labι ?
Or, equivalently, represent it as a unit component/signal mapping extended with
mappings for the local variables:
ρ
F
ν ®É= l

t v
>
l W Labl ?
With this, we can give Ù ui in terms of the relation t p:
Ù
u
i
K
ν
M
F°K
ρ ë
K
Labc ® Labδ ® Labι
M
S ρδ
M
iff p t C ν a δ ß Ep
K
ρ S ρδ
M
(4.2.5)
where pi is the program for half-cycle update i.
We start by giving the rules for evaluating an expression under an environment
ρ. Since the Received construct introduces side effects in expression evalua-
tion, the result of an expression is not only a value but also a new environment.
We write e t ρE K v S ρ { M if the expression e evaluates to the value v and a new envi-
ronment ρ { .
(22)
n t
ρ
E K n S ρ M
, if n is an integer
127
(23)
b t ρE K b S ρ M
S b W = true S false ?
(24)
e1 t
ρ
E K v1 S ρ1 M S]\]\]\S en t
ρn þ 1
E K vn S ρn M
K
e1 S]\]\]\S en M t
ρ
E K]K v1 S]\]\]\S vn M S ρn M
(25)
id t ρE K ρ K id M S ρ M
, if id W dom
K
ρ
M
(26)
id t ρE K]K id S K MgM S ρ M
, if id is an enum identifier
(27)
e t
ρ
E K v S ρ { M
id e t ρE K]K id S v M S ρ { M
, if id is an enum identifier
(28)
e t
ρ
E K v S ρ { M
id  e t ρE K ρ { K id M K v M S ρ { M
(29)
n t
ρ  ρ
C
id E
E K v S ρ { M
id \ n t ρE K v S ρ { ì dom K ρ K id M]M p K ρ ë dom K ρ K id M]M]MgM
(30)
Received id
K
id1 S]\]\]\S idn M t
ρ
E K true S ρ p= idi t vi > 1 < i < n ? M
, if
id

t
K
v1 S]\]\g\gS vn M W ρ X id W Labι or
id W Labδ
X
ρ
K
id
M¶v
F
δ
æ
K
id
M
(31)
Received id
K
id1 S]\]\]\S idn M t
ρ
E K false S ρ M
, if
id W Labι
X
v
 id

t v W ρ or
id W Labδ
X
ρ
K
id
M
F
δ æ
K
id
M
(32)
e1 t
ρ
E K v1 S ρ1 M S]\]\g\gS en t
ρn þ 1
E K vn S ρn M
p
K
e1 S]\]\]\uS en M t
ρ
E K  p  K ρn S v1 S]\]\]\uS vn M]M
These rules fix the evaluation order of expressions to be from left to right.
Rules (30) and (31) are interesting. They define the value of the Received con-
struct to be true if the signal with the given identifier is either an instantaneous
signal and is present in the environment or is a delayed signal and is set to some-
thing different from the default value for that signal. Rule (32) uses the meaning
 p  of the predefined function p, applied to the meanings of the arguments of p
to define the semantics. This meaning includes the side effects that the evalua-
tion of p has on the environment. Thus,  p 
K
ρ S v
M
F°K
v { S ρ {
M
, where v { is the value
computed by p and ρ { is the new environment.
With these evaluation rules, we can then define the semantics of a program
p. We write p t τp τ { if the program p transforms the environment ρ into an envi-
128
ronment ρ { and newly asserted delayed signals ρδ with τ {
F°K
ρ {S ρδ
M
. As a special
case, we write p t τpBreak
K
τ {
M
if program fragment p should be ignored after
executing a Break instruction. In all rules, we abbreviate the pair
K
ρ S ρδ
M
by τ.
(33)
p
{
t
Break
C
τ E
p Break
K
τ
M
(34)
Break t τpBreak
K
τ
M
, if τ
v
F
Break
K
τ {
M
(35)
Skip t τp τ
(36)
e t
ρ
E K]K v1 S]\]\g\S vn M S ρ { M
K
id1 Sg\]\]\gS idn M û e t τp
K
ρ
{
p= idi t vi > 1 < i < n ?S ρδ M
(37)
e t
ρ
E K v S ρ { M
id û e t τp
K
ρ
{
 id

t v LS ρδ
M
(38)
e {gt
ρ
E K v1 S ρ1 M S e t
ρ1
E K v2 S ρ2 M
id  e
{
 û e t τp
K
ρ2  id t
K
ρ2
K
id
M
 v1 t v2  M LS ρδ M
, if 1 < v1 < n
(39)
n û e t C
ρ  ρ
C
id E a ρδ E
p
K
ρ {S ρ { δ
M
id \ n û e t τp
K
ρ
{
ì dom
K
ρ
K
id
M]M
p
K
ρ ë dom
K
ρ
K
id
M]M]M
S ρ
{
δ
M
(40)
e t
ρ
E K true S ρ { M S s1 t
ρ ¿
p ρ2
If e Then s1 Else s2 Fi t τp
K
ρ2 S ρδ M
(41)
e t
ρ
E K false S ρ { M S s2 t
ρ ¿
p ρ2
If e Then s1 Else s2 Fi t τp
K
ρ2 S ρδ M
(42)
p@Label t τp τ {
Goto Label t τp τ {
(43)
p@Label t τpBreak
K
τ {
M
Goto Label t τpBreak
K
τ
{
M
(44)
e t
ρ
E K v S ρ { M
Emit id
K
e
M
t
τ
p
K
ρ
{
p ρ
{d{
S ρδ p ρ
{
δ
M
,
ρ {c{
F
q
= id

t v ? , if id W Labι
/0 , otherwise
ρ { δ
F
q
= id

t v ? , if id W Labδ
/0 , otherwise
(45)
e1 t
ρ
E K v1 S ρ1 M
\]\]\
en t
ρn þ 1
 idn þ 1 ² vn þ 1 
E K vn S ρn M S s t±C
ρn
 idn

² vn

a ρδ E
p τ {
s Where
id1
F
e1;
\]\]\
idn
F
en;
t
τ
p
K
ρ
{
ì M p
K
ρ ë M
M
S ρ
{
δ
M
, and M
F
= id1 S]\]\g\S idn ?
129
(46)
s1 t
τ
p τ { S s2 t τ
¿
p τ2
s1 s2 t τp τ2
(47)
e1 t
ρ
E K v1 S ρ1 M S e2 t
ρ1
E K v2 S ρ2 M
For id
F
e1To e2 s t τp
K
ρ2 S ρδ M
, if v1
N
v2
(48)
e1 t
ρ
E K v1 S ρ1 M
e2 t
ρ1
E K v2 S ρ2 M
s t C
ρ2
 id

² v1  a ρδ E
p τ3
For id
F
v1 B 1To v2 t
τ3
p τ4
For id
F
e1To e2 s t τp
K
ρ4 ì¨= id ?}p
K
ρ2 ë¨= id ? M S ρδ4 M
(49)
e1 t
ρ
E K v1 S ρ1 M
e2 t
ρ1
E K v2 S ρ2 M s t C
ρ2
 id

² v1  a ρδ E
p Break
K
τ {
M
For id
F
e1To e2 s t τp
K
ρ
{
ì = id ?}p
K
ρ2 ë¨= id ? M S ρ { δ M
These rules use induction on the program length only, except rule (48) for the
For construct. But here, the expressions e1 and e2 are evaluated only once, and
because the set Int is finite, there can be only finitely many applications of those
rules. Note that the constraints on the label of a Goto statement ensure that rule
(42) performs induction on a smaller part of the program p.
Therefore, there is at most one
K
ρ {S ρδ
M
for a τ with p t τp
K
ρ {S ρδ
M
, guaranteeing
the well defined-ness of Ù ui .
This concludes the concrete semantics for our state update.
For the examples in the next two sections, we assume that a set of prede-
fined functions is available, namely addition, subtraction, multiplication, division,
remainder and negation on integers. Additionally, logical negation, logical and,
logical or, equality and inequality are assumed to be predefined. We will use the
more usual infix or prefix notation for these operations with the corresponding
symbols.
4.3 Example 1: The MCF 5307
The model for the MCF 5307 follows closely the description of its pipeline in
Section 2.5 and is depicted in Figure 4.4.
The following units exist:
IAG instruction address generation
IC1 instruction fetch cycle 1
IC2 instruction fetch cycle 2
IED instruction early decode
130
IB instruction buffer, the first four units correspond directly to the stages of the
IFP of the MCF 5307
EX execution unit. This unit summarizes the two stages of the OEP
SST store stall timer. This models a side condition on stores described in the
MCF 5307 manual
BU bus unit. This models the accesses to external memory and the internal
SRAM.
Figure 4.4 shows the signals between these units as arrows. In the following
sections, the various units are presented in detail together with the signals they
may receive and send out.
Here and in the next example we use a relaxed syntax compared to the formal
syntax definition of the previous sections. Also, not every functions or expression
is formulated strictly in the given syntax in order not to clutter the presentation
too much.
4.3.1 Instruction Address Generation (IAG)
State and Signals
IAG carries as inner state IAGstate an address a, or none if the fetch pipeline has
been stopped. It may receive the following signals:
1. setEX
K
a {
M
from the execution unit EX, where a { is the new address to be
fetched;
2. stopEX from the execution unit EX if the processor is halted;
3. setIED
K
a {
M
from instruction early decode IED, where a { is the new address
to be fetched;
4. stopIED from instruction early decode IED if a return instruction or com-
puted branch is found;
5. wait from its successor stage IC1 if the fetch pipeline is stalled.
If more than one signal is received in the same cycle, only one of them takes effect.
The signals from EX have highest priority and wait has lowest.
IAG may send a signal addr
K
a
M
to IC1, where a is an address. Unlike the
signals received by IAG, this signal is delayed, i. e.received by IC1 in the next
cycle. It models the standard way of advancing the pipeline by one step.
131
cancel
IAG
IC1
IC2
IED
IB
EX
SST
BU
addr(a)
await(a)
put(a)
instr
start
store
wait
wait
wait
cancel
cancel
cancel
next
stop
set(a)
read(a) write(a)
wait
data hold
wait
code(a)
wait
hold
fetch(a)
Figure 4.4: Map of formal pipeline model
State Evolution
The inner state of IAG evolves according to the following program.
If stop received from EX Then
IAGstate û none
Else If set
K
a
M
received from EX Then
IAGstate û a
Emit addr
K
a
M
signal
Else If stop received from IED Then
IAGstate û none
Else If set
K
a
M
received from IED Then
IAGstate û a
132
Emit addr
K
a
M
signal
Else If wait received Then
do nothing
Else
If IAGstate
v
F
none Then
IAGstate û IAGstate
B
4
Emit addr
K
IAGstate
M
signal
Fi
Fi
4.3.2 Instruction Fetch Cycle 1 (IC1)
State and Signals
Like IAG, IC1 carries as inner state ICIstate an address a, or none if it is tem-
porarily empty. It may receive the following signals:
1. addr
K
a
M
from IAG communicating the next address to be fetched;
2. wait from its successor stage IC2 if the fetch pipeline is stalled;
3. wait from the execution unit EX if EX needs the bus for reading or writing;
4. cancel from EX or IED if prefetching is redirected or stopped.
IC1 may send a wait signal to its predecessor IAG, a fetch
K
a
M
signal to the
bus, where a is an address, and an await
K
a
M
signal to its successor IC2, preparing
it to receive the code at address a some time later. The await signal is delayed.
State Evolution
The inner state of IC1 evolves in two steps:
If addr
K
a
M
received Then
ICIstate û a
Fi
If cancel received Then
ICIstate û none
Else If wait received Then
If ICIstate
v
F
none Then
Emit wait signal
Fi
Else
If ICIstate
v
F
none Then
133
Emit fetch
K
ICIstate
M
signal
Emit await
K
ICIstate
M
signal
ICIstate û none
Fi
Fi
If IC1 need not wait and contains an address a, it puts a request for the code
at address a to the bus and changes its internal state to none which is usually
replaced by the next address in the first step of the next cycle. If not, the setting to
none prevents the same address from being fetched twice.
If IC1 is told to wait and contains none, it does not send a wait signal to IAG
because it is ready to receive a new address anyway.
If IC1 must wait and already contains an address a, it sends a wait signal to
IAG to prevent IAG from sending a new address before a is fetched.
4.3.3 Bus Unit
The cache proper can be modeled as yet another unit whose inner state is a con-
crete cache state.
Cache Semantics
This inner state is updated when a fetch
K
a
M
signal from IC1 is handled. The
hardware cache will send the code at address a to IC2 some time (at least 1 cycle)
after the fetch request. The exact time depends on the concrete cache state (hit or
miss), but also on the status of the bus between CPU and cache, of the bus between
cache and memory, of the write buffer between cache and memory, and of the
memory itself. All these components must also be modeled by units with inner
state, communicating by signals. The unit is left out for space and complexity
reasons. All fetch signals from IC1 are answered by code signals to IC2 some
time (at least 1 cycle) later.
The code itself is irrelevant in the model considered here, only the time of the
answer matters. For the purpose of coordinating request and answer, we include
the requested address in the answer, which is thus a signal code
K
a
M
to IC2.
4.3.4 Instruction Fetch Cycle 2 (IC2)
State and Signals
The inner state ICIIstate of IC2 is none if it is temporarily empty, w
K
a
M
if IC2 is
waiting for the code at address a, and r
K
a
M
if it has received the code.
IC2 may receive the following signals:
134
1. await
K
a
M
from IC1 communicating the address that is currently fetched;
2. code
K
a
M
from the bus controller if the code is sent now;
3. wait from its successor unit Instruction Early Decode (IED) if the byte
buffer in IED is full;
4. cancel from EX or IED if prefetching is redirected or stopped.
IC2 may send a wait signal to its predecessor IC1, or a delayed put
K
a
M
signal
to its successor IED, where a is an address.
State Evolution
The inner state of IC2 evolves in three steps:
If await
K
a
M
received Then
ICIIstate û w
K
a
M
Fi
If w
K
a {
M
F¨F
ICIIstate
X
code
K
a {
M
received Then
ICIIstate û r
K
a {
M
Fi
If cancel received Then
ICIIstate û none
Else If wait received Then
Emit wait signal
Else
If ICIIstate
F
r
K
a {d{
M
Then
ICIIstate û none
Emit put
K
a {d{
M
signal
Else If ICIIstate
F
w
K
a {d{
M
Then
Emit wait signal
Fi
Fi
4.3.5 Instruction Early Decode (IED)
State and Signals
The inner state IEDstate of IED is a pair
K
b S q
M
, where b is the number of bytes
in the byte buffer of the IED, and q is a queue of instructions. The number b is
even and satisfies 0 < b < 8, under the assumption that 8 is the maximal capacity
of the byte buffer. The queue q
F
 in Sg\]\]\gS i1  contains all instructions which start
135
in the byte buffer; the last one (in) may be incomplete. The maximal length of q
is 8
h
2
F
4.
IED may receive the following signals:
1. put
K
a
M
from IC2 if new code is arriving;
2. wait from its successor unit, the Instruction Buffer (IB), if the buffer is full;
3. cancel from the execution unit EX if prefetching is redirected.
IED may send the following signals:
1. wait to its predecessor IC2 if the byte buffer is full;
2. setIED
K
a
M
to the Instruction Address Generation IAG if the fetch pipeline is
redirected to address a;
3. stopIED to the Instruction Address Generation IAG if the fetch pipeline is
stopped;
4. cancel to IC1 and IC2 if the fetch pipeline is redirected or stopped;
5. instr, a delayed signal sent to IB when an instruction is forwarded to IB (the
instruction itself is irrelevant for the purpose of this model).
State Evolution
Step 1 reacts to put
K
a
M
signals
If put
K
a
M
received Then
IEDstate û
K
b
B
4 S Ia D 2 \ Ia \ q M
where Ia is the list of instructions starting at address a, which is either
a singleton or empty.
Ia D 2 is defined analogously, and ‘ \ ’ is list concatenation
and
K
b S q
M
F
IEDstate
Fi
Step 2:
If cancel received from EX Then
mvIEDstate û
K
0 SÑâ
M
Fi
Step 3
If wait not received Then
If IEDstate contains complete instruction i
i. e.q
F
q { \ i  and b R w
F
width
K
i
M
where
IEDstate
FlK
b S q
M
Then
136
Step 3a: jump, call decode
If i is an unconditional jump or call with target a
or a conditional jump with target a which is predicted as taken
Then
Emit setIED
K
a
M
signal
Emit cancel signal
IEDstate û
K
w S i 
M
Fi
Step 3b:
If i is a return instruction or indirect jump Then
Emit stopIED signal
Emit cancel signal
IEDstate û
K
w S i 
M
Fi
Step 4
Emit instr signal
IEDstate û
K
b m w S q {
M
where IEDstate
FlK
b S q {\ i 
M
and
w
F
width
K
i
M
Fi
Step 5
If IEDstate
FlK
b S q
M
with b
B
4
N
8 Then
Emit wait signal
Fi
4.3.6 Instruction Buffer (IB)
State and Signals
The instruction buffer can hold up to 8 instructions. Its purpose is the separation
of the fetch pipeline (IAG till IED) from the execution pipeline (EX, see below).
The actual instructions in the buffer are irrelevant for the purpose of this
model; only the number of instructions matters. Thus, the inner state IBstate
of IB is an integer i with 0 < i < 8.
IB may receive an instr signal from IED when a new instruction is arriving, a
cancel signal from the execution unit EX when fetching is stopped or redirected,
and a next signal from EX indicating that EX is ready to start the execution of the
next instruction. The next signal is repeated if IB is not able to forward the next
instruction.
IB may send a wait signal to IED when the instruction buffer is full, and a
start signal to EX when a new instruction is ready to start. The start signal is only
sent in response to a next signal. It is delayed.
137
State Evolution
The inner state of IB evolves in four steps.
Step 1:
If cancel received Then
IBstate û 0
Else
Step 2:
If instr received Then
mvIBstate û IBstate
B
1
Fi
Step 3:
If next received and IBstate
N
0 Then
IBstate û IBstate m 1
Emit start signal
Fi
Step 4:
If IBstate
F
8 Then
Emit wait
Fi
Fi
With this modeling, an instruction can enter and leave an empty instruction
buffer in the same cycle (in the ColdFire manual [Mot00], it is indicated that an
empty instruction buffer can be bypassed). Yet a full instruction buffer refuses to
accept another instruction in the next cycle even if an instruction can be forwarded
to the execution unit in that cycle because this is not known in advance.
4.3.7 Store Stall Timer (SST)
Before we come to the description of the execution unit EX, we describe SST,
an auxiliary unit that implements a sequence-related pipeline stall described as
follows in the ColdFire manual [Mot00]:
This type of stall involves consecutive store operations, excluding the
MOVEM instruction. For all store operations (except MOVEM), cer-
tain hardware resources within the processor are marked as “busy” for
two clock cycles after the final DSOC cycle of the store instruction.
If a subsequent store instruction is encountered within this two-cycle
window, it is stalled until the resource again becomes available.
138
State and Signals
Hence, the inner state SSTstate of SST is an integer t with 0 < t < 2. SST is
activated by a delayed store signal from EX and may send wait signals to EX,
which are ignored unless another store operation is performed.
State Evolution
SST evolves in two steps:
Step 1:
If received store Then
SSTstate û 2
Fi
Step 2:
If SSTstate
N
0 Then
SSTstate û SSTstate m 1
Emit wait signal
Fi
4.3.8 Execution Unit (EX)
State and Signals
EX is the most complex unit of this pipeline model. It may receive the following
signals:
1. start (delayed) from IB when a new instruction is ready to be executed.
2. wait from SST indicating that a store operation has been performed recently.
3. data from the bus controller indicating that data is being sent to EX. The
actual data values are irrelevant for the purpose of this model.
EX may send the following signals:
1. next to IB when EX is ready to execute the next instruction.
2. store (delayed) to SST to prevent store operations in the next two cycles.
3. read
K
a
M
to the bus controller when EX wishes to read the memory at address
a. (For the moment, we assume that this address is statically known.) This
request will be answered by a data signal in the next cycle or some time
later. EX is stalled until the answer arrives.
139
4. write
K
a
M
to the bus controller when EX wishes to write to the memory at ad-
dress a. (For the moment, we assume that this address is statically known.)
The actual data value that is written is irrelevant in this model. There is no
answer, so EX need not wait (except for SST induced stalls).
5. wait to IC1 when EX needs the bus connection for data accesses. This
signal prevents IC1 from using the bus for fetching code.
6. stopEX to IAG when the processor is halted.
7. setEX
K
a
M
to IAG when prefetching should continue at a.
8. cancel to IC1, IC2, IED, and IB when the processor is halted or prefetching
is redirected.
The behavior of EX is derived from schedules telling what each instruction
does in each cycle. The schedule for an instruction heavily depends on the struc-
ture of the operands of the instruction; memory operands need to access memory
while register operands need not. To be more precise, schedules are associated
with the edges of the control flow graph. For instance, the two edges starting from
a conditional branch instruction (one to the textual successor, one to the branch
target) carry two different schedules since only one of them includes a redirection
of the fetch pipeline.
The inner state of EX is a schedule (a suffix of a schedule associated with
an edge). A schedule is a list of items. An item is either a set of events to be
performed within one cycle, or a control item that conceptually needs no time.
The following events exist:
Z read
K
a
M
telling that the instruction reads from memory address a in this
cycle (more exactly, it puts the request for reading a to the bus).
Z write
K
a
M
telling that the instruction writes to memory address a in this cycle.
Z stop when the processor is halted.
Z fetch
K
a
M
when prefetching is redirected to address a.
The following control items exist:
Z await indicating that EX is awaiting a data signal.
Z stall indicating that the next event set contains a store operation that may be
stalled by SST.
Z store indicating that the previous event set contained a store operation that
induces SST stalling.
140
await items are inserted dynamically into schedules when running them, but stall
and store control items must already be present in the schedules associated with
control flow edges. The reason is that not all store operations are followed by
store items (the ones in MOVEM not).
A set of events may be empty. We assume that the schedule for every edge
starts with an empty event set, i. e. instructions do not perform anything relevant
in their first cycle. This is important since the first cycle of an instruction may run
in parallel with the last cycle of the previous one. This first cycle is in fact not
modeled explicitly, but implicitly by the fact that the start signal from IB to EX is
delayed.
State Evolution
The evolution of EX is modeled by the following six steps.
Step 1:
If start received Then
EXstate û sched
K
acti
M
where sched gives the schedule for an instruction
and acti is the instruction being analyzed currently
Fi
Step 2:
If EXstate
F
await \ s Then
If data received Then
EXstate û s
Fi
Fi
Step 3:
If EXstate
F
stall \ s Then
If wait not received Then
EXstate û s
Fi
Fi
Step 4:
If EXstate
F
S \ s (S an event set) Then
If write
K
a
M
W S Then
Emit write
K
a
M
signal
Emit wait signal to IC1
Fi
If read
K
a
M
W S Then
Emit read
K
a
M
signal
Emit wait signal to IC1
141
Fi
If stop W S
Emit stopEX signal to IAG
Emit cancel signals to IC1, IC2, IED, and IB
Fi
If fetch
K
a
M
W S Then
Emit setEX
K
a
M
signal to IAG
Emit cancel signals to IC1, IC2, IED, and IB
Fi
If read W S Then
EXstate û await \ s
Else
EXstate û s
Fi
Fi
Step 5:
If EXstate
F
store \ s Then
Emit store signal
EXstate û s
Fi
Step 4:
If EXstate
F
  Then
Emit next signal
Fi
4.3.9 State Predicates
The four state predicates F , H , N and R from Definition 4.1.3 remain to be
defined for this model.
For this we introduce three new state components mcfRetired, mcfNext and
mcfHalted:
Z mcfRetired is a Boolean that is set in Step 4 of the EX update. So we add
the assignment mcfRetired û true there.
Z mcfNext is the address of the next instruction. When we finish the sched-
ule for an instruction in step 4 of the EX update, we add the assignment
mcfNext û target
K
acti
M
there. Here, target is the target of the instruction,
which is the next address after the instruction for non branching instructions
and the target of the branches otherwise.
142
Z mcfHalted signals that the last instruction of the program has been retired.
This is set to true in step 4 of the EX update also in that case.
Then the predicates can be defined by
Z F
K
n S s
M
:
F
s
K
mcfRetired
M
Z N
K
n S s
M
:
F
n
F
s
K
mcfNext
M
Z H
K
s
M
:
F
s
K
mcfHalted
M
Z R
K
n S s
M
:
F
s mcfRetired û false 
4.4 Example 2: The PPC 755
A model for the PPC750 can be given by modeling the functional units and in-
troducing signals exchanged between them. The evolution of the pipeline is then
modeled cycle-wise by giving an update order of the functional units and de-
scribing how their inner state evolves depending on the state and signals received.
Signals come in two flavors: instantaneous and delayed. Instantaneous signals
are received in the same cycle as they are generated, while delayed signals are
generated in one cycle and received in the next cycle.
We suggest the following functional units in the pipeline model, cf. Figure 4.5:
Z CSU: the chip set unit. It receives signals from the BU requesting transfers.
It delivers the data to the BU via signals. This unit models the memory and
other devices in the system, e.g. SDRAM and the PCI bus.
Z BU: the bus unit. It receives signals requesting instruction fetch and/or
data reads/writes. It emits answering signals to these requests. The in-
struction/data caches are modeled in this unit.
Z FBPU: the fetch and branch prediction unit. This unit is the most com-
plex unit in the model, since it has to reflect the complicated dependencies
between instruction fetching, dispatching, branch folding and fall-through,
coupled with speculative execution. It requests instructions from the BU
and emits signals to reflect possible dispatch of instructions from the IQ as
well as mispredictions of speculative branches and correct predictions. It
receives instructions from the BU and dispatch notifications from the DU.
Z DU: the dispatch unit models the dispatch rules of the instructions in IQ0/1,
which it receives from the FBPU. It dispatches instructions according to
their resource needs and the status of the functional units, which it monitors
143
ts
(A
dd
r, 
ty
pe
, l
en
)
re
ad
(A
dd
r_
L
en
)
da
ta
w
or
ki
ng
fetch(Addr_Len)
store(AddrL)
fi
ni
sh
ed
re
se
rv
e(
C
Q
In
de
x[
2]
,B
oo
l[
2]
,U
ni
t[
2]
)
fe
tc
he
d(
L
en
)
aa
ck
ta
B
U
C
SU
IQ
(A
dd
r[
2]
, B
oo
l[
2]
)
fl
us
h/
re
so
lv
e
fr
ee
(i
nt
:2
)
re
tir
ed
(R
R
us
ag
e[
2]
)
st
al
l
bu
sy
do
ne
D
U
FB
PU
IU
1
IU
2
C
Q
SR
U
FP
U
C
U
L
SU
Figure 4.5: Model of the pipeline
from signals sent to it. Here also the execution, completion and refetch se-
rialization instruction semantics is handled by delaying dispatch until other
instructions have completed.
Z IU1 and IU2: the integer units directly model the concrete units. They
144
receive instructions from the DU and emit signals if they are busy or have
completed an instruction.
Z SRU: the system register unit is modeled in a similar way to the IU1/2 units.
Z LSU: the load store unit. This unit is a little bit more complex, since it is
pipelined and has to reflect this inner pipeline. In addition, some instruc-
tions (e.g. cache manipulation instructions) must be handled in a special
way. It emits read requests to the BU, and store requests for multiple store
instructions. Single stores are handled by the CU. It receives read acknowl-
edgements from the BU and keeps track of stores and a busy BU to make
sure that the internal data bus is single threaded.
Z FPU: the floating point unit is modeled to reflect its three stages and the fact
that special instructions occupy the whole pipeline.
Z CU: the completion unit sends out retirement signals and holds to prevent
the DU from issuing instructions. It receives instructions from the DU and
the completion acknowledgments from the functional units. It signals the
LSU when a store instruction retires.
Z CQ: the completion queue is a common data pool, containing the dispatched
instructions and their status.
The update order of the units is: FBPU, DU, CU, LSU, FPU, IU1, IU2, SRU,
BU and CSU.
The following signals are sent and received by/from these units, as indicated
in Figure 4.5:
Z Instantaneous signals:
– IQ(Addr[2],Bool[2]): from FBPU to DU. This signal contains the ad-
dresses of the two instructions in IQ0/1 (or none if a slot is empty),
together with flags indicating if these instructions are executed specu-
latively.
– flush: from FBPU to CU and the functional units (FUs). This indi-
cates that a speculative branch was mispredicted and all speculative
instructions should be discarded.
– resolve: from FBPU to CU and the FUs. This signal indicates that
a speculation has been resolved correctly. All speculative instructions
now become non-speculative.
145
– reserve(CQIndex[2],Bool[2],Unit[2]): from DU to CU and the FUs.
This signal denotes the dispatch of the instructions with the corre-
sponding indices in CQ (or none if a slot is empty) to the functional
unit indicated by the third argument. The second argument is again a
flag denoting speculative execution of this instruction.
– fetch(AddrLen): from FBPU to BU. This signal requests instructions
to be fetched by the bus unit. The address and size of the instruction
block to fetch is given by the argument.
– read(AddrLen): from LSU to BU. This signal requests a data read of
the given address and length to be performed.
– store(AddrLen): from LSU to BU. This signal requests a data store
of the given address and length to be performed.
– finished: from CU to LSU. This signal flags the retirement of a single-
store operation.
Z The delayed signals:
– fetched(int:3): from BU to FBPU. This signals the completion of an
instruction fetch. The length of the fetched instruction block in words
is given as argument to account for cases where less instructions than
requested could be fetched6.
– free(int:2): from DU to FBPU. This gives the number of instructions
to remove from the beginning of the IQ in the next cycle due to dis-
patch in this cycle.
– stall: from CU to DU. This signal inhibits dispatch of new instructions
to implement serialization semantics.
– busy: from each FU to DU. A functional unit sending this signal can-
not accept another instruction being dispatched to it.
– retired(RRusage[2]): from CU to DU. This signals the retirement of
up to two instructions. It gives the amount of rename registers freed
by the retirement of the instructions (none, if a slot is unused).
– done: from each FU to CU. This indicates that the first instruction in
the functional unit has completed execution.
– data: from BU to LSU. Indicates that a read request has finished and
the data is available.
– working: from BU to LSU. This signal indicates that the bus unit
cannot accept another read or write request in the next cycle.
6It is unclear if this behavior actually happens.
146
– TS
K
a S t S l
M
: from BU to CSU. This signal indicates the start of an ac-
cess at address a of type t W= D S S S I ? for l double words. The type
corresponds to data reads, data stores and instruction fetches. If l
F
0
then this indicates a sub double word access.
– AACK: from CSU to BU. This signals the acknowledge of the TS
signal by the CSU.
– TA: from CSU to BU. This signal is the indication that the next data
beat of a transfer has finished.
In the following sections the units will be discussed in more detail, including
their inner state and the cycle evolution.
When giving the inner state we will use a C-like declaration style, using arrays
with square brackets denoting element access and structures with dot (.) denoting
component selection.
An important concept is that of the index. An index denotes an element in
the IQ or the CQ. An index can be of type IQIndex if it points into the IQ, of
type CQIndex if it points into the CQ or of type Index if it can point into either
one. Since the IQ and CQ are modified very often, all indices have to be adjusted if
elements are removed from the queues. In the following when we say that “indices
are updated” we mean that all indices in all states are automatically adjusted.
4.4.1 FBPU
Inner State
The following types are defined:
Name Component Description
IQ Addr address Address of instruction
int predLevel Prediction level of instruction
State WAIT(Addr, len)
>
waiting for instructions to arrive from bus
STOP(Index, Addr)
>
stopped
IGNR(Addr, l)
>
will ignore next instructions arriving from bus
RUN(Addr)
>
Unit is running normally
HOLD(Addr) Unit has encountered too many predictions
This means that IQ is a record (struct) with two components, namely address
of type Addr and predLevel of type int. In contrast, State is a kind of union of
5 alternatives, where each alternative is marked by a tag (e.g. WAIT) and carries
some data (e.g. something of type Addr). The address in the state is in most cases
the address of the next instruction to be fetched. In the WAIT case, it is the address
of the first instruction to arrive from the bus.
147
The inner state of FBPU is the following record (struct):
Type Name Description
int:3 predLevel Global State of execution prediction
IQ iq[6] IQ entries
State state Global unit state
IQIndex insIndex Index into iq of instruction currently considered
Index CRDep[2] CR dependencies for first two predictions
Addr altAddr[2] Alternative address for prediction
Here, iq contains the instructions in the IQ. An entry contains the address of
the instruction and its prediction level. This level is 0, if the instruction is known
to be executed, 1 if it follows a speculative branch, 2 if it follows two speculative
branches. If the address of an entry in iq is none, then this entry is empty. Since
the IQ is a FIFO, if an entry iq  n  is empty, so are all entries iq  n
B
1 LSg\]\]\uS iq  5  .
We use last
K
iq
M
to give the index of the last (i.e. with the highest index) entry with
an address not equal to none. If the IQ is completely empty, then last
K
iq
M
F
m 1.
By free
K
iq
M
we denote the number of entries that are empty.
lastCR
K
i
M
is the index of the last instruction that writes into CR and comes
before the instruction with index i; lastCTR
K
i
M
is defined likewise.
State Evolution
The inner state of the FBPU evolves in one cycle by executing the following micro
steps in the given order:
Step 1: This step throws out entries in IQ which have been dispatched in the
cycle before
If a free(n) signal is received, then the entries iq  0 LSg\]\]\S iq  n m 1  are dis-
carded i.e. iq  0 Àû iq  n S]\]\]\ etc. The last n entries in iq that are not empty
are set to be empty.
insIndex is adjusted.
Step 2: This step starts the fetching after a STOP condition has resolved
If state is STOP(idx,addr) and instruction denoted by idx has completed in
CQ, then state û RUN(addr)
Step 3: This step inserts fetched instructions into the IQ
If state is WAIT(addr, l) and fetched(len) has been received, then
If len
F
l then
state û RUN(addr
B
4 [ len)
148
Else
state û WAIT(addr
B
4 [ len S l m len)
Fi
Append len instructions at the end of iq,
i.e. i
F
last
K
iq
M
;
iq  i
B
1  û
K
addr S predLevel
M
;
\]\]\
iq  i
B
len û
K
addr
B
4 [
K
len m 1
M
S predLevel
M
Here predLevel comes from the state. insIndex is set to the index in iq of the
first new instruction, i.e. insIndex û i
B
1.
Step 4: This step skips over ignored instruction fetches
If state is IGNR(addr S l) and fetched
K
x
M
is received then
If x
F
l then
state û RUN(addr)
Else
state û IGNR(addr S l m x)
Fi
Fi
Step 5: This step checks for resolved predictions
Step 5a: Check for resolving of first level prediction
If d1
F
CRDep  0 
v
F
none and d1 has completed in CQ then
CR dependency resolved
CRDep  0 Áû none
good û ’prediction was correct’
Dependency resolved: decrement prediction level, restart fetching
predLevel û predLevel m 1
If good then
Our prediction was right. Just adjust instructions in IQ
If state=HOLD(a) then
state û RUN(a)
iq  insIndex \ pred û iq  insIndex L\ pred m 1
Fi
A 0 < i < last
K
iq
M
: iq  i L\ pred û max
K
0 S iq  i \ pred m 1
M
emit resolve signal
Else
Prediction was wrong: flush instructions, redirect
A 0 < i < last
K
iq
M
:
If iq  i L\ pred
N
0 then
149
iq  i Áû
K
none S 0
M
Fi
emit flush signal
predLevel û 0
insIndex û none
clear CRDep
state û
q
IGNR(altAddr  0  ) , if state
F
WAIT(x) or IGNR(y)
RUN(altAddr  0  ) , otherwise
Fi
Move second dependency to first (if any), clear second
move CRDep  1  , altAddr  1  into CRDep  0  , altAddr  0 
clear CRDep  1  , altAddr  1 
Look at second dependency, which is the first now
ind û 0
Else
Look at second dependency
ind û 1
Fi
Step 5b: Maybe resolve second level prediction
If d1
F
CRDep  ind 
v
F
none and d1 has completed in CQ then
CR dependency resolved
CRDep  ind  û none
good û ’prediction was correct’
Dependency resolved: decrement prediction level, restart fetching
predLevel û predLevel m 1
Clear CRDep  ind  , altAddr  ind 
If good then
If state=HOLD(a) then
If holded, reset prediction level of holding branch
state û RUN(a)
iq  insIndex \ pred û iq  insIndex \ pred m 1
Fi
Our prediction was right. Just adjust instructions in IQ
A 0 < i < last
K
iq
M
:
If iq  i \ pred
N
ind
M
then
iq  i L\ pred û max
K
0 S iq  i L\ pred m 1
M
Fi
Else
Prediction was wrong: flush instructions, redirect
150
A 0 < i < last
K
iq
M
:
If iq  i L\ pred
N
ind then
iq  i û
K
none S 0
M
Fi
insIndex û none
state û
q
IGNR(altAddr  ind  ) , if state
F
WAIT(x) or IGNR(y)
RUN(altAddr  ind  ) , otherwise
Fi
Fi
Step 6: This step handles decoding of instructions and branch folding, etc
If insIndex
N
last
K
iq
M
or insIndex
F
none then
insIndex û none
Goto Step 7
Fi
Let i be the instruction at address iq  insIndex \ addr
Step 6a: If instruction uses CTR or LR and CTR/LR is already written:
Stop
If uses
K
i
MY¡
= CTR S LR ?
v
F
/0 and
 i { W CQ : i { not completed and
write
K
i {
MY¡
uses
K
i
Mv
F
/0 or
 i { W iq  0 \ð\ insIndex m 1  : write
K
i {
MY¡
uses
K
i
Mv
F
/0 then
state û STOP(li,addr+4)
where li is the highest index of the instruction i { that
fulfills the condition in cq  0 \é\ 5  iq  0 \é\ insIndex m 1  and
K
addr S pl
M
F
last
K
iq
M
.
Goto Step 7
Fi
Step 6b: This step handles refetch serialization
If i is a refetch serialization instruction then
Clear iq  insIndex
B
1 S]\]\]\ iq  last
K
iq
M

state û STOP
K
insIndex S last
K
iq
M
\ addr
B
4
M
Fi
Step 6c: This step skips non branches
If i is not a branch then goto Step 6f
Step 6d: This step handles resolved branches
If the branch condition of i is resolved then
If i is taken then
151
Taken branch
iq  j Áû
K
none S 0
M
\]\]\
iq  last
K
iq
M
Áû
K
none S 0
M
where j
Frq
insIndex
B
1 , if i updates the LR or CTR
insIndex , otherwise
state û RUN(addr), where addr is the target address of the branch
insIndex û none
Goto Step 7
Else
Fall through branch
Goto Step 6f
Fi
Fi
Step 6e: This step handles predicted branches
predLevel û predLevel
B
1
A insIndex < j @ last
K
iq
M
:
iq  j L\ pred û predLevel
If predLevel R 3 then
Hold because of too many predictions
state û HOLD(last
K
iq
M
\ addr
B
4)
Goto Step 7
Fi
If predLevel @ 3 then
Fold or fall-through based on prediction
Let t and a be the predicted target and alternative target of i:
altAddr  predLevel m 1 Áû a
CRDep  predLevel m 1  û
q
none , if i does not depend on CR
lastCR
K
i
M
, otherwise
If i predicted taken then
Fold away following instructions
If i updates LR or CTR then
j û insIndex
B
1
Else
j û insIndex
Fi
iq  j  û
K
none S 0
M
\]\]\
iq  last
K
iq
M
Áû
K
none S 0
M
152
state û
q
IGNR
K
t
M
, if state
F
WAIT
K
x
M
RUN
K
t
M
, otherwise
insIndex û none
Goto Step 7
Fi
Fi
Step 6f: Look at next instruction
insIndex û insIndex
B
1
Goto Step 6
Step 7: This step issues next fetch
If free
K
iq
Mv
F
0 and state is RUN(addr) then
Emit fetch(
K
addr S len
M
) signal
state û WAIT(addr)
where len
F
min
K
free
K
iq
M
S 4
M
.
Step 8: This step signals dispatchable instructions
a  0 Áû a  1 Áû none
p  0 Áû p  1  û false
s û 0
If iq  0 \ addr
v
F
none then
If instruction at iq  0 L\ addr is a fall through instruction then
Remove fall through branch
iq  0 Áû iq  1 
\]\]\
iq  last
K
iq
M
m 1  û iq  last
K
iq
M

Adjust insIndex
Else
s û s
B
1
If iq  0 L\ pred @ 2 then
Not twice predicted instruction
a  0 Áû iq  0 \ addr
p  0  û
K
iq  0 L\ pred
N
0
M
Fi
Fi
Fi
If iq  s L\ addr
v
F
none then
Second entry in IQ
If instruction at iq  s L\ addr is a fall through instruction then
Remove fall through branch
153
iq  s Áû iq  s
B
1 
\]\]\
iq  last
K
iq
M
m 1 Áû iq  last
K
iq
M

Adjust insIndex
Else
If iq  s L\ pred @ 2 then
Not twice predicted instruction
a  s Áû iq  s L\ addr
p  s û
K
iq  s L\ pred
N
0
M
Fi
Fi
Fi
If s
N
0 then
Emit IQ
K
a S p
M
signal
Fi
4.4.2 CQ
The CQ only contains state that is accessed by FBPU.
The following types are defined:
Name Component Description
Unit Branch
>
Pseudo unit for branches
IU1
>
IU2
>
SRU
>
LSU
>
FPU
>
none No unit assigned
CQ Addr addr Address of instruction
Bool spec Is Executed speculatively
Bool complete Is finished
Bool ready Is ready to execute
Unit unit Unit this instruction is executed on
The CQ contains the following state:
Type Name Description
CQ cq[6] Completion queue
continued on next page
154
continued from previous page
Type Name Description
We define last
K
cq
M
and free
K
cq
M
as in the case of the FBPU.
4.4.3 DU
Inner State
The DU has as inner state the number of free rename registers. The following type
is defined:
Name Component Description
RRUsage int:3 GPRs Number of free GPR rename registers
int:3 FPRs Number of free FPR rename registers
int:1 CTRs Number of free CTR rename registers
int:1 CRs Number of free CR rename registers
int:1 LRs Number of free LR rename registers
Here int:6 is the type of numbers n with 0 < n < 6, and likewise for int:1.
The inner state is an RRUsage object telling the number of free rename regis-
ters of the various kinds:
Type Name Description
RRUsage rrfree Number of free rename registers
We define addition and subtraction operations on RRUsage in the canonical
manner, together with comparison.
For an instruction i we write rruse
K
i
M
for the static rename register require-
ments of that instruction; we write unit
K
i
M
for the set of all functional units this
instruction may be dispatched to.
State Evolution
The following microsteps are performed on each cycle update by the DU.
Step 1: This step handles retirement of instructions
If retired(rr) signal received then
rrfree û rrfree
B
rr  0 
B
rr  1 
Fi
Step 2: This step handles a flush after misprediction
155
If flush signal received then
A 0 < j < last
K
cq
M
:
If cq  j L\ spec then
rrfree û rrfree
B
rruse
K
cq  j L\ addr
M
Fi
Fi
Step 3: This step handles stalls
If stall signal received then
Goto Step 5
Fi
Step 4: This step handles dispatch
If IQ
K
a S p
M
signal not received then
Goto Step 5
Fi
Step 4a: Dispatch first instruction
n û 0
If a  0 
v
F
none and free
K
cq
Mv
F
0 and rruse
K
a  0 
M
< rrfree then
Can dispatch first instruction
If a  0  is a branch then
u û branch
Else
Find first u in  IU2 S IU1 S LSU S SRU S FPU  which satisfies:
busy-u signal not received and u W unit
K
a  0 
M
Fi
If u
v
F
none then
Free unit available
duunit  0  û u
i  0 Áû last
K
cq
M B
1
duunit  1  û none
i  1 Áû none
n û n
B
1
rrfree û rrfree m rruse
K
a  0 
M
Fi
Fi
Step 4b: Dispatch second instruction
156
If n
N
0 and a  1 
v
F
none and free
K
cq
M
N
1 and
rruse
K
a  1 
M
< rrfree and a  0  is not completion or
refetch serializing Then
Maybe dispatch second instruction
If a  1  is a branch then
u û branch
Else
Find first u in  IU2 S IU1 S LSU S SRU S FPU õÆ r  0 L\ unit  which satisfies:
busy-u signal not received and u W unit
K
a  1 
M
Note: LSU cannot accept two instructions
although it has two reservation stations.
Fi
If u
v
F
none then
Free unit available
i  1 Áû last
K
cq
M B
2
duunit  1 û u
n û n
B
1
rrfree û rrfree m rruse
K
a  1 
M
Fi
Fi
Step 4c: This step performs the dispatch
If n
v
F
0 then
Emit reserve
K
i S p S duunit
M
signal
Add
K
a  0 LS p  0 S c0 S true S duunit  0  M S \]\]\S
K
a  n m 1 S p  n m 1 S cn G 1 S true S duunit  n m 1  M to CQ
where ci is true, if the instruction at address a  i 
is a branch, false otherwise
Note: ready is initially true, but may be set to false within CU.
Emit free(n) signal
Fi
Step 5: The End
Noop
4.4.4 IU1, IU2, SRU
These three functional units work in the same way and have the same inner state.
Therefore, in this section we write U to denote one of IU1, IU2 or SRU.
157
Inner State
Unit U has the following inner state:
Type Name Description
CQIndex res Instruction in reservation station
CQIndex work Instruction executing in unit
int cycles Cycles remaining for executing instruction
We will write cycles
K
i
M
for an instruction i to denote the number of cycles, this
instruction takes to completely execute.
State Evolution
The following micro steps are executed on each cycle update for U :
Step 1: This step handles a new instruction
If reserve
K
i S p S u
M
signal received then
If i  0 
v
F
none and u  0 
F
U then
Dispatch to this unit
res û i  0 
Else If i  1 
v
F
none and u  1 
F
U then
res û i  1 
Fi
Fi
Step 2: This step handles a flush signal
If flush signal received then
If res
v
F
none and cq  res L\ spec then
Flushed from CQ, flush it from U
res û none
Fi
If work
v
F
none and cq work L\ spec then
work û none
cycles û 0
Fi
Fi
Step 3: This step advances reservation to work
If res
v
F
none and cq  res L\ ready
F
true and
work
F
none then
work û res
158
res û none
cycles û cycles
K
cq work \ addr
M
Fi
Step 4: This step does some work
If work
v
F
none then
cycles û cycles m 1
If cycles
F
0 then
Done
work û none
Emit done-U signal
Fi
Fi
Step 5: This step handles the busy condition
If res
v
F
none then
Emit busy-U signal
Fi
4.4.5 LSU
Inner State
The following type is defined:
Name Component Description
LState Idle
>
Unit is idle
Await
>
Unit waits for datum
Ignore
>
Unit ignores next datum
Hold Unit is stopped
The LSU has the following inner state:
Type Name Description
CQIndex res[2] Two reservation stages
CQIndex ea Instruction in EA stage
CQIndex access Instruction in access stage
Addr store[3] The store buffer (store[0] is the next to execute)
CQIndex sIns[3] Instruction corresponding to store buffer entry
int sPending Number of store and sIns entries
continued on next page
159
continued from previous page
Type Name Description
LState state State of the access machinery
int numAcc Number of accesses instruction will perform
State Evolution
LSU performs the following micro steps in each cycle update:
Step 1: This step handles newly dispatched instructions
If reserve
K
i S p S u
M
signal received and = u  0 S u  1 ã?
¡
= LSU ?
v
F
/0 then
Dispatch for us
If u  0 
F
LSU then
j û i  0 
Else
j û i  1 
Fi
If res  0 
F
none then
res  0 Áû j
Else
res  1 Áû j
Fi
Fi
Step 2: This step handles a flush signal
If flush signal received then
If access
v
F
none and cq  access \ spec then
If state
F
Await then
state û Ignore
Else
state û Idle
Fi
access û none
Fi
If ea
v
F
none and cq  ea \ spec then
ea û none
Fi
If res  1 
v
F
none and cq  res  1 ðL\ spec then
res  1 Áû none
Fi
If res  0 
v
F
none and cq  res  0 ðL\ spec then
160
res  0  û none
Fi
A 3 R j R 0 :
If sIns  j 
v
F
none and cq  sIns  j é\ spec then
sIns  j Áû none
store  j û none
Fi
Fi
Step 3: This step handles the resolve signal
If resolve signal received then
If state
F
Hold then
state û Idle
Fi
Fi
Step 4: This step handles pipeline advance from reservation
If ea
F
none and res  0 
v
F
none and cq  res  0 ðL\ ready
F
true then
ea û res  0 
res  0 Áû res  1 
res  1 Áû none
Fi
Step 5: This step handles delayed stores
If finished signal received then
sPending û sPending
B
1
Fi
If state
F
Idle and working signal not received and
sPending
N
0 then
Emit store(
K
a S l
M
) signal, where a
F
store  0 
sPending û sPending m 1
store  0 û store  1 
store  1 û store  2 
store  2 û none
sIns  0 û sIns  1 
sIns  1 û sIns  2 
sIns  2 û none
Goto Step 7
Fi
Step 6: This step processes accesses
161
If access
F
none then
Goto Step 7
Fi
Step 6a: This step handles a read
If access is not a read access then
Goto Step 6c
Fi
Step 6b:
If state
F
Idle and working signal not received then
Let a be the address of the access
If a is guarded and cq  access L\ spec then
Speculative access to guarded space
state û Hold
Else
state û Await
Emit read
K
a
M
signal
Fi
Else If state
F
Await and data signal received then
numAcc û numAcc m 1
If numAcc
F
0 then
access û none
Emit done-LSU signal
Fi
Else If state
F
Ignore and data signal received then
state û Idle
Fi
Goto Step 7
Step 6c: This step handles a write access
If access is a single store then
If sPending @ 3 then
store  sPending û addr
K
access
M
sIns  sPending û access
sPending û sPending
B
1
Emit done-LSU signal
access û none
Fi
Else
Multiple word/string store
162
If working signal not received then
Emit store
K
a
M
signal, where a
F
addr
K
access
MYB
c
where c models the progress through the word/string
numAcc û numAcc m 1
If numAcc
F
0 then
access û none
Emit done-LSU signal
Fi
Fi
Fi
Step 7: This step handles pipeline advance to access
If access
F
none and ea
v
F
none then
access û ea
ea û none
If access is a multiple load/store then
numAcc û ’# of accesses’
Else
numAcc û 1
Fi
Fi
Step 8: This step handles the busy signal
If res  1 
v
F
none then
Emit busy-LSU signal
Fi
4.4.6 FPU
Inner State
The FPU contains the following inner state:
Type Name Description
CQIndex res Reservation station
CQIndex work[3] Stages
int cycles[3] Cycles remaining in stages
Bool block We are executing a blocking instruction
163
State Evolution
The FPU performs the following micro steps on each cycle update:
Step 1: This step handles new instructions
If reserve(i,p,u) signal received and = u  0 S u  1 ã?
¡
= FPU ?
v
F
/0 then
If u  0 
F
FPU then
res û i  0 
Else
res û i  1 
Fi
Fi
Step 2: This step handles the flush signal
If flush signal received then
If res
v
F
none and cq  res L\ spec then
res û none
Fi
A 0 < j < 2 :
If work  j 
v
F
none and cq work  j ðL\ spec then
work  j Áû none
cycles  j  û 0
If j
F
0 then
block û false
Fi
Fi
Fi
Step 3: This step advances the reservation station
If res
v
F
none and cq  res L\ ready
F
true and
work  0 
F
none then
work  0 Áû res
res û none
If isblocking
K
cq work  0 éL\ addr
M
then
A blocking instruction
block û true
Fi
cycles  0 Áû cycles
K
cq work  0 ðL\ addr
M
Fi
Step 4: This step does some work
164
A 0 < j < 2 :
If work  j 
v
F
none then
cycles  j Áû max
K
0 S cycles  j Úm 1
M
Fi
Step 5: This step advances the pipeline
If work  2 
v
F
none and cycles  2 
F
0 then
Emit done-FPU signal
work  2 Áû none
Fi
If work  1 
v
F
none and cycles  1 
F
0 and work  2 
F
none then
work  2 Áû work  1 
work  1 Áû none
cycles  2 Áû fpucycles
K
3 S cq work  2 ðL\ addr
M
Fi
If work  0 
v
F
none and cycles  0 
F
0 and work  1 
F
none and
block
F
false then
work  1 Áû work  0 
work  0 Áû none
cycles  1 Áû fpucycles
K
2 S cq work  1 ðL\ addr
M
Fi
Step 6: This step handles finish of blocking instruction
If block and cycles  0 
F
0 and work  1 
F
work  2 
F
none then
block û false
work  0 Áû none
Emit done-FPU signal
Fi
Step 7: This step handles the busy signal
If res
v
F
none then
Emit busy-FPU signal
Fi
4.4.7 CU
Inner State
The CU only uses the CQ as state.
165
State Evolution
We write gprups
K
i
M
to denote the number of GPR updates of instruction i; likewise
with fprups
K
i
M
and FPR updates.
These micro-steps are performed on each cycle update by CU:
Step 1: This step handles finished instructions
If done-U signal received then
Let j
F
min
_ 0 adbdbdb a last
C
cq Ee : cq  j L\ complete
F
false and cq  j \ unit
F
U
cq  j L\ complete û true
Fi
Step 2: This step handles a flush signal
If flush signal received then
A 0 < j < last
K
cq
M
:
If cq  j L\ spec
F
true then
cq  j Áû
K
none S true S false S false S none
M
Note: The spec field remains true because it is still accessed.
Fi
Fi
Step 3: This step handles a resolve signal
If resolve signal received then
A 0 < j < last
K
cq
M
:
cq  j L\ spec û false
Fi
Step 4: This step handles serialization
A 0 < j < last
K
cq
M
:
cq  j L\ ready û true
If cq  j  is execution serialization and j
v
F
0 then
Cannot execute if there is a predecessor in CQ
cq  j L\ ready û false
Fi
If cq  j  is completion serialization then
Cannot dispatch another instruction
Emit stall signal
Fi
Step 5: This step handles stalls due to operands
166
A 0 < j < last
K
cq
M
:
If cq  j \ ready
F
true then
A o W operands
K
cq  j 
M
:
k û max
_ 0 adbcbdbda j G 1 e : o W write
K
k
M
If k
v
F
none and cq  k \ complete
F
false then
cq  j L\ ready û false
Fi
Fi
Step 6: This step handles retirement
n û 0
If cq  0 \ addr
v
F
none and cq  0 L\ complete then
Note: cq  0  cannot be speculative; it cannot depend on previous instructions.
n û 1
rr û rruse
K
cq  0 
M
If cq  1 \ addr
v
F
none and cq  1 L\ complete then
Note: Here cq  1  cannot be speculative either.
If cq  1  is an integer or load instruction and
gprups
K
cq  0 
MYB
gprups
K
cq  1 
M
< 2 and
fprups
K
cq  0 
MYB
fprups
K
cq  1 
M
< 2 then
We can retire two instructions
n û 2
rr û rr
B
rruse
K
cq  1 
M
Fi
Fi
Fi
If n
N
0 then
Emit retired
K
rr
M
signal
If cq  0 S]\]\]\uS cq  n m 1  contain a single store instruction then
Emit finished signal
Fi
Remove cq  0 S]\]\]\S cq  n m 1  from CQ
Fi
4.4.8 BU
Inner State
The following types are defined:
167
Name Component Description
IBCState HH
>
Instruction fetch is a hit-hit
HM
>
Fetch is a hit-miss
MH
>
Fetch is a miss-hit
MM Fetch is a miss-miss
ACSource D
>
Access is a data read
S
>
Access is a store or cache line flush
I
>
Access is an instruction fetch
N No access
Schedule ts(Addr a, Bool isburst)
>
Start access by emitting a TS signal
aack
>
Wait for AACK signal
emit(D)
>
Emit a data signal
emit(F)
>
Emit a fetched signal
ta Wait for a TA signal
Access ACSource src The source of the access
Addr CL The cacheline affected (if any)
Schedule schedule[] The schedule for the access
The bus unit contains the following inner state:
Type Name Description
Addr IBAddr Instruction address in instruction buffer or none
int:3 IBLen # of instructions to fetch
IBCState IBC Cache behavior of instruction fetch
Bool IBSched Has fetch been put into acc?
Addr DBAddr Data read address in data buffer or none
int DBLen Length of data access in data buffer
Bool DBSched Has read been put into acc?
Addr SBAddr Data store address in store buffer or none
int SBLen Length of store in store buffer
Bool SBSched Has write been put into acc?
Access acc[4] Two pipelineable accesses and two split accesses
The state of the bus unit consists of four groups:
Z The Instruction Buffer (IB) holds a request for an instruction fetch from
the FBPU. The request remains in the IB until the instruction fetch has
completed and the instructions have been returned to the FBPU. The IB
is filled by a fetch(AddrLen) signal. It has four components: the address
to fetch from, the number of instructions to fetch, the cache behavior and
if it has been put into acc already. The cache behavior determines, if the
168
instruction fetch is a miss or hit for the two lines that may be involved. If
the instruction fetch only spans one cache line (or none, if instructions are
not fetched from cacheable memory), the other components are assumed to
be ‘hits’. I.e. an access covering two lines, where the first line is a cache
miss and the second a cache hit would have a cache behavior classification
of ‘MH’. An access that covers one line and misses in that line would also
be classified as ‘MH’. The case ‘HH’ means that all line accesses hit in the
cache, but that one of the two possible lines momentarily completes loading
over the bus (from a previous access). Such an access has to wait until the
current bus access has completed.
Z The Data Buffer (DB) holds a read request from the LSU. It is filled upon
receipt of a read(AddrLen) signal from the LSU. It contains three com-
ponents: the address and length of the data to be read and if the read has
already been put into acc.
Z The Store Buffer (SB) holds a data write request from the LSU. It is filled
upon receipt of a store(AddrLen) signal from the LSU and has the same
components as the DB.
Z Finally the Access Slots (AC) give the accesses that are scheduled to be
performed over the bus. The PPC755 can pipeline two accesses, i.e. two
accesses may be active at the same time, overlaping address and data phases.
Every access (Access) contains its source (DB, SB, IB or empty, if the slot
is not used). In addition, the cacheline affected by the access, if it is a cache
line fill is recorded. Finally, the schedule for the access is recorded. A
schedule contains a sequence of items:
– ts(a, b) indicates that a TS signal is to be emitted to the CSU. The
parameters for the signal are taken form the src attribute of the access
and the a and b parameters of the item.
– aack the access has to wait for the AACK signal.
– emit(D) a data signal is to be emitted.
– emit(F) a fetched signal is to be emitted. The parameter to the signal
depends on the alignment of the access in the IB.
– ta wait for an TA signal.
Since an instruction fetch in IB may have to be split in up to three non
interruptible accesses (in the case that a fetch of four instructions starts at
an address not dividable by 8), the two “additional” accesses have to be
kept in acc[2,3]. This is just to ensure that the accesses are not interrupted
by other accesses.
169
State Evolution
Some assumptions have been made in the model of the bus unit:
Z The data cache is only used in write-through mode. I.e. stores cannot affect
the cache contents.
Z Data accesses do not cross cache line boundaries. This would violate the
alignment restrictions of the PowerPC anyhow.
Z When multiple requests are queued up at the bus unit, they are processed in
the order: stores, data reads, instruction fetches.
We write cclass
K
a
M
to denote the access classification of an address a (HH,
HM, MH, or MM).
We write insinline
K
a S l
M
for the number of instructions in the cache line which
contains the address a, where l is the total number of instructions.
The function getsched
K
a S l S c
M
returns a vector of Access for the instruction
fetch from address a. The parameter l gives the number of instructions to fetch,
the boolean parameter c is true if the access is cacheable. The function is defined
by:
getsched
K
a S l S false
M
F 



K
I S none S s
K
a S 1
M]M
 , if a S l span 1 double word

K
I S none S s
K
a S 2
M]M
 , if a S l span 2 double words

K
I S none S s
K
a S 3
M]M
 , if a S l span 3 double words
Where s
K
a S n
M
is defined by
s
K
a S n
M
F
q
 ts
K
a S false
M
S aack S ta S emit
K
F
M
 , if s
F
1
 ts
K
a S false
M
S aack S ta S emit
K
F
M
\ s
K
al
K
a
M
S n m 1
M
, if s
N
1
Where al
K
a
M
is the address a aligned to the beginning of the next double word.
getsched
K
a S l S true
M
F








K
I S a S t
K
a S l
M]M
 , if 1 cache line access

K
I S a S t
K
a S insinline
K
a
MgM]M
S
K
I S a { S t
K
a { S insinline
K
a {
M]MgM
 , else; with
a {
F
a
B
4insinline
K
a
M
Where t
K
a S l
M
is defined by
t
K
a S l
M
F



 ts
K
a S t
M
S aack S t { S ta S ta S ta  , if l @ ins
K
a
M
 ts
K
a S t
M
S aack S t {éS t {éS ta S ta  , if l @ ins
K
a
MYB
2
 ts
K
a S t
M
S aack S t {éS t {éS t {S ta  , else
Where t {
F
ta S emit
K
F
M
and ins
K
a
M
is the number of instructions in the double word
containing a.
The BU evolves in 8 steps:
170
Step 1: Advance signaled fetch into IB
If fetch(a S l) signal received then
K
IBAddr S IBLen S IBSched
M
û
K
a S l S false
M
IBC û HH
Determine cache behavior
If memory area at IBAddr is cacheable then
IBC û cclass
K
IBAddr
M
Update the instruction cache
If access is in one cache line only then
If IBC
F
HH then
If IBAddr does not clash with access in acc  0 \ð\ 3  then
Emit fetched
K
IBLen
M
signal, clear IB
Fi
Fi
Else Two line access
n û 0
If IBC W = HH S HM ? then
If first line does not clash with acc  0 \é\ 3  then
n û insinline
K
IBAddr S IBLen
M
If IBC
F
HH and
second line does not clash with acc  0 \é\ 3  then
n û n
B
insinline
K
IBAddr
B
4 [ n S IBLen m n
M
Fi
Fi
Fi
IBAddr û IBAddr
B
4 [ n
IBLen û IBLen m n
If 0
F
IBLen then
Clear IB
Fi
If 0
v
F
n then
Emit fetched
K
n
M
signal
IBC û tail
K
IBC
M
Fi
Fi
Fi
Fi
Step 2: Advance signaled store to SB
If store
K
a S l
M
signal received then
171
KSBAddr S SBLen S SBSched
M
û
K
a S l S false
M
If the access is to write-back area then write-back, no replacement
Clear SB Returns immediately
Fi
Fi
Step 3: Advance signaled read to DB
If read
K
a S l
M
signal received then
K
DBAddr S DBLen S DBSched
M
û
K
a S l S false
M
If access is cacheable then
If l
F
0 then Cache line invalidation by dcbi
Remove cache line for DBAddr from cache
If DBAddr does clash with acc  0 \ð\ 3  then
DBLen û 0
Else
Emit data signal
Clear DB
Fi
Goto step 4
Else if l
F
32 then cache line zero by dcbz
If DBAddr does clash with acc  0 \ð\ 3  then
DBLen û 0
Else
Emit data signal
Clear DB
Fi
Goto step 4
Fi
cata û cclass
K
DBAddr
M
Update the data cache
If access clashes with data cache line fill in acc  0 \ð\ 3  then
DBLen û 0 Can return as soon as fill completes
Else If cata
F
H then
Emit data signal, Clear DB
Fi
Fi
Fi
Step 4: Advance store to access
If SBAddr
v
F
none and free
K
acc
M
and !SBSched then
If 32
F
SBLen then dcbf instruction cache line flush
172
insert
K
S S none S ts
K
SBAddr S true
M
S aack S ta S ta S ta S ta 
M
into acc
Else single store
Insert
K
S S none S ts
K
SBAddr S false
M
S aack S ta 
M
into acc
Fi
SBSched û true
Fi
Step 5: Advance read to access
If DBAddr
v
F
none and free
K
acc
M
and !DBSched then
If DBAddr is cacheable then
If DBAddr does not clash with acc then
Insert
K
D S DBAddr S
 ts
K
DBAddr S true
M
S aack S ta S emit
K
D
M
S ta S ta S ta 
M
into acc
Fi
Else Single access
insert
K
D S none S ts
K
DBAddr S false
M
S aack S ta S emit
K
D
M

M
into acc
Fi
DBSched û true
Fi
Step 6: Advance fetch to access
If IBAddr
v
F
none and free
K
acc
M
and !IBSched then
If IBAddr is cacheable then
If
K
IBAddr S IBLen
M
is in one cache line then
If IBC
F
MH then HH is a noop, HM, MM impossible
scheds  Áû getsched
K
IBAddr S IBLen S true
M
insert scheds into acc
IBSched û true
Fi
Else Two cachelines
scheds û empty
case IBC of
HH: line clash, noop
HM: first line clash
n û insinline
K
IBAddr S IBLen
M
scheds û getsched
K
IBAddr
B
4 [ n S IBLen m n S true
M
MH: first miss, second will be hit
scheds û getsched
K
IBAddr S insinline
K
IBAddr S IBLen
M
S true
M
MM: two misses
scheds û getsched
K
IBAddr S IBLen S true
M
esac
173
If scheds
v
F
empty then
Insert scheds into acc
IBSched û true
Fi
Fi
Else Uncacheable access
scheds û getsched
K
IBAddr S IBLen S false
M
Insert scheds into acc
IBSched û true
Fi
Fi
Step 7: Process accesses
If acc  1 L\ src
v
F
N then A 2nd access there
If hd
K
acc  1 
M
F
ts
K
a S b
M
and hd
K
acc  0 
Mv
Wç= aack S ts
K
a { S b {
M
? and
the bus/core clock are aligned then
emit TS
K
a S acc  1 \ src S l
M
signal, where
l
F 


4 , if b
F
true
0 , if acc  1 L\ src
F
S and SBLen @ 8
1 , else
acc  1 \ schedule û tail
K
acc  1 \ schedule
M
Fi
Fi
If acc  0 L\ src
v
F
N then 1st access not empty
If hd
K
acc  0 L\ schedule
M
F
ts
K
a S b
M
and core/bus clock are aligned then
Start transfer
emit TS
K
a S acc  0 \ src S l
M
signal, where l is defined as above.
acc  0 \ schedule û tail
K
acc  0 \ schedule
M
Fi
If AACK signal received then
If hd
K
acc  0 L\ schedule
M
F
aack then
acc  0 \ schedule û tail
K
acc  0 \ schedule
M
Else
acc  1 \ schedule û tail
K
acc  1 \ schedule
M
Fi
Fi
If TA signal received then 1st access must wait for it
acc  0 \ schedule û tail
K
acc  0 \ schedule
M
Fi
If hd
K
acc  0 L\ schedule
M
F
emit
K
D
M
then
Emit data signal
174
Clear DB
acc  0 L\ schedule û tail
K
acc  0 \ schedule
M
Fi
If hd
K
acc  0 L\ schedule
M
F
emit
K
F
M
then
n û ins
K
IBAddr
M
Emit fetched
K
n
M
signal
IBLen û IBLen m n
IBAddr û IBAddr
B
4n
If IBLen
F
0 then
Clear IB
Fi
acc  0 L\ schedule û tail
K
acc  0 \ schedule
M
Fi
If acc  0 L\ schedule
F
  then access finished
case acc  0 L\ src of
S:
Clear SB
D:
If acc  0 L\ CL
v
F
none and DBAddr clashes with acc  0 \ CL and
!DBSched then
DB waits for this access
Clear DB
Emit data signal
Fi
I:
If acc  0 L\ CL
v
F
none and IBAddr clashes with acc  0 L\ CL and
!IBSched then
n û 0
If IBC
F
HH then
n û IBLen
Else If IBC
F
HM then
n û insinline
K
IBAddr S IBLen
M
Fi
If n
v
F
0 then
Emit fetched
K
n
M
signal
IBLen û IBLen m n
IBAddr û IBAddr
B
4n
IBC û tail
K
IBC
M
If IBLen
F
0 then
Clear IB
Fi
175
Fi
Fi
esac
Move acc  1 \ð\ 3  up one slot
Clear acc  3 
Fi
Fi
Step 8: Emit Working signal
If SBAddr
v
F
none then
Emit working signal
Fi
4.4.9 CSU
This unit has been left out for space considerations.
4.4.10 State Predicates
Finally, we have to define the state predicates from Definition 4.1.3. We thus in-
troduce new component variables, ppcRetired, ppcHalted, ppcNext, ppcBranches
and ppcNops:
Z ppcRetired is a sequence of addresses which denote the instructions already
retired. This accounts for multiple retirement. We update this sequence
whenever an instruction retires from the CU.
Z ppcHalted is a Boolean that records the fact that the last instruction of the
program has been executed
Z ppcNext records the address of the next instruction to be executed
Z ppcBranches is an integer that counts the number of branches dropped or
folded without an entry in the CQ
Z ppcNops is an integer that counts the number of noop instructions dropped
by the DU
The state predicates are more complicated to define for the PPC 755 because
instructions may be dropped without passing through the CQ or CU: branches and
noops. Therefore, we record the number of these instructions in the two variables
ppcNops and ppcBranches. They are increased in the FBPU and DU if such an
instruction is dropped or folded. In the CU, it is checked if the actual instruction
176
is such a noop or branch and retirement is adjusted accordingly and the variables
are decremented. Then we can define
F
K
n S s
M
:
F
hd
K
s
K
ppcRetired
M]M
F
n
and
R
K
n S s
M
:
F
s  ppcRetired û tl
K
s
K
ppcRetired
M]M

ppcNext is set to the address of the next instruction as soon as that instruction
is clear, i. e. after speculative branches have been resolved. As for the ColdFire
version of this predicate, the address is computed by a function target
K
acti
M
for
the actual instruction. The update of this variable is at the begining for non branch
instructions and in the FBPU after the branch resolves. Then we can define
N
K
n S s
M
:
F
n
F
s
K
ppcNext
M
Finally, ppcHalted is set as soon as the last instruction retires in the CU and
we have
H
K
s
M
:
F
s
K
ppcHalted
M
177
178
Chapter 5
Pipeline Analysis
If you try to know everything, you will
learn nothing.
Demokrit
The goal of our pipeline analysis is to find a correct approximation to the WCET
of a program defined by Equation (4.1.6). We do this by finding an abstraction
of the collecting semantics defined in Equation (4.1.5). Since the implementation
of our analysis is a data-flow analysis, finding an abstraction for the collecting
semantics boils down to finding abstract versions Tˆe of the transfer functions T {e
associated with the edges in the CFG that satisfy Equation (3.3.4) or (3.3.6).
First, the abstract domain of the analysis has to be determined and the relation
of abstract values of this domain to sets of concrete states has to be defined. An
abstract value σˆ W Σˆ is a pair
K
s˘ S mˆ
M
, where s˘ is a set of abstract states sˆ W Sˆ , each
of which represents a set of concrete states s. mˆ is an upper bound for the times of
all executions ending in one of the concrete states s. Thus, the abstract domain is
Σˆ
F
P
K
Sˆ
M}|
Z.
Additionaly, we have a concretization function Γ : Sˆ t P
K
S
M
which gives the
set of concrete states represented by an abstract state sˆ. We will later see, how
this Γ is constructed from the definitions of the state components and their re-
spective abstraction. Unfortunately, it has shown to be difficult to define a Galois
Connection between the domains P
K
Σ
M
and Σˆ simply because Γ does not allow to
define a corresponding abstraction function α { : P
K
S
M
t Sˆ . This is because, e. g.
for the PowerPC cache, there does not exist a best abstraction for a given set of
concrete components, cf. Example 3.2.27 in Section 3.2.1. Thus, we cannot use
condition (3.3.4) to prove the soundness of an abstraction. Since the presence of
a concretization function with certain properties is sufficient to show correctness
by using (3.3.6), we will present the proof in this way.
179
In the following we will first determine what abstract entities have to be de-
fined and how they must be related to their concrete counterparts. Using this we
give a global correctness proof that only relies on a property of the state transfer
function Ù and its abstract counterpart ˆÙ .
Then we will give an abstraction for our abstract version ˆÙ of Ù by providing
abstract unit updates and component domains. By proving that our definition of ˆÙ
has the desired properties required by the global correctness proof, the correctness
of our abstraction is shown.
This approach allows to prove other implementations of the pipeline analysis
correct, e. g. the cycle update can be formulated in the synchronous language Es-
terel ([Ber]) by simply showing that the correctness conditions between Ù and the
Esterel cycle update ˆÙ E hold.
In the remainder of this chapter, we will first give some notational definitions
that will be used in the sequel. Then we will give the general correspondence
between the concrete domain and the abstract one, Σˆ, based on a concretization
function Γ : Sˆ t P
K
S
M
. With this correspondence we give minimal relations be-
tween concrete and abstract predicates and functions, including Ù . In Section 5.2
we will show that any analysis satisfying these relations is correct w.r.t. the coarse
semantics. After this, we will give in Section 5.3 abstractions for our definition
of Ù in terms of unit updates. The proof that these abstractions satisfy the gen-
eral abstraction requirements establishes the correctness of our analysis using unit
updates.
5.1 Notation
Given a pair
K
a S b
M
of values, we define the functions
fst
K
a S b
M
F
a
snd
K
a S b
M
F
b
We extend these functions to set of pairs, e. g. fst
K
A
M
F
= fst
K
a S b
Mµ>
K
a S b
M
W A ? .
We abbreviate a tupel
K
a1 S]\]\]\S an M by writing a. We write
K
a S b
M
for the tupel
K
a1 S]\]\]\uS an S b1 S]\]\]\S bm M .
Let Z be a finite interval of ­ containing all possible execution times: Z
F
= 0 S 1 S]\]\g\uS Tmax ? . Z together with the usual ordering < on natural numbers forms
a complete lattice with least element 0, greatest element Tmax and £ Z N
F
maxN,
¢ Z N F minN. We define an addition B on Z as the usual addition on natural
numbers, saturated at the value Tmax, i. e. the sum in Z of two numbers will be
Tmax if the sum of the corresponding natural numbers is larger.
Let the set P
K
S
|
Z
M
be ordered by  . Then P
K
S
|
Z
M
is a complete lattice
with least element /0, greatest element S
|
Z and £
F
©
and
¢
F
«
.
180
For a set Sˆ let P
K
Sˆ
M |
Z be ordered by O§P , defined by
K
Sˆ1 S n1 M OÖP
K
Sˆ2 S n2 M iff Sˆ1  Sˆ2 X n1 < n2
Then it is a complete lattice with least element
K
/0 S 0
M
, greatest element
K
Sˆ S Tmax M
and
£
P
N
F K
© fst
K
N
M
S maxsnd
K
N
M]M
¢
P
N
F K
« fst
K
N
M
S minsnd
K
N
MgM
Elements s˘ S t˘ are always from the set P
K
Sˆ
M
, i. e. sets of abstract pipeline states.
sˆ and tˆ stand for abstract pipeline states from Sˆ . We write mˆ, nˆ and oˆ for elements
from Z, if we want to hint that they come from a pair, e. g.
K
s˘ S mˆ
M
, of the abstract
domain. We write s, t for concrete states and likewise m, o, p for concrete exe-
cution time bounds from Z. Furthermore, we write Sˆ for a set of pairs from the
abstract domain, Sˆ  P
K
Sˆ
M |
Z. A variable b is an element from the set Bool; z
is from the set Int. Abstract values from a domain are written as vˆ or vˆi. Concrete
values are written v or vi.
5.2 Global Correctness
For our pipeline analysis we have to specify abstract counterparts of the concrete
domains and semantic functions. We will represent a set of concrete pipeline
states from S by one abstract pipeline state sˆ from a set Sˆ . A set of such abstract
states, s˘ W P
K
Sˆ
M
, will be used as the counterpart for the set P
K
S
M
, the domain of
the sets of concrete pipeline states.
That is, the concretization function Γ : Sˆ t P
K
S
M
maps one abstract state to
the concrete states that are represented by it. This Γ does not need to fulfill further
constraints, like monotonicity, etc. In fact, we do not even define an ordering on
the set Sˆ . For different analyses, different version of Γ can be used. In Section 5.3,
we will give the Γ that corresponds to the abstractions on unit states defined there.
The domain of the semantics, Σ
F
S
|
Z, consists of pairs of a concrete
pipeline state and the execution time of the program when it reaches this state. To
obtain an analysis that correctly describes all executions of a program, we have
to find an approximation to the collecting semantics, i. e. we have to abstract ele-
ments from P
K
S
|
Z
M
. We do this by choosing the abstract domain Σˆ
F
P
K
Sˆ
M |
Z,
where an element of the abstract domain is a pair
K
s˘ S mˆ
M
of a set of abstract states
and an upper bound to execution times. The meaning of such a pair is the set of
concrete pairs, where each concrete state in the pair is described by an abstract
state and the execution times of all concrete pairs are at most as large as the up-
per bound. More formally, we can give a concretization function γ that gives all
181
concrete pairs described by one element from Σˆ:
γ : P
K
Sˆ
M |
Z t P
K
S
|
Z
M
γ
K
s˘ S mˆ
M
F
©Ô=
K
s S m
MV>
s W Γ
K
sˆ
M X
0 < m < mˆ
X
sˆ W s˘ ?
(5.2.1)
This γ satisfies the correctness prerequisites of Theorem 3.3.5:
Lemma 5.2.2 (Monotonicity of γ): γ is monotone. Proof:
Let
K
s˘1 S mˆ1 M OÖP
K
s˘2 S mˆ2 M , i. e. we have s˘1  s˘2 X mˆ1 < mˆ2. Then
K
s S m
M
W γ
K
s˘1 S mˆ1 M
x
 sˆ W s˘1 : s W Γ
K
sˆ
MYX
m < mˆ1
x
sˆ W s˘2 X m < mˆ1 < mˆ2
x
K
s S m
M
W γ
K
s˘2 S mˆ2 M
x
γ
K
s˘1 S mˆ1 M  γ
K
s˘2 S mˆ2 M
¬
In the following, we will extend Γ to sets of abstract states, i. e.
Γ
K
s˘
M
F
T
= Γ
K
sˆ
M>
sˆ W s˘ ?
To prove the global correctness of the pipeline analysis, we have to specify
the abstract versions of the transfer functions Te, Tˆe. We will define them analo-
gously to the transfer functions in the concrete semantics, which is based on state
predicates F S N S H and a retirement function R together with the state transition
function Ù .
Definition 5.2.3 (Abstract Counterparts): Given the four state predicates F ,
N , H and R according to Definition 4.1.3 and a state transfer function Ù
from an abstract machine according to Definition 4.1.1. Then predicates
Fˆ  V
|
Sˆ , Nˆ  V
|
Sˆ , Hˆ  Sˆ and a function Rˆ : V
|
Sˆ t Sˆ are called
abstract state predicates for the abstraction described by Sˆ and Γ iff the
following conditions hold:
1. Fˆ
K
n S sˆ
M
x
A s W Γ
K
sˆ
M
: F
K
n S s
M
2.
Û
Fˆ
K
n S sˆ
M
x
A s W Γ
K
sˆ
M
:
Û
F
K
n S s
M
3.  s W Γ
K
sˆ
M
: N
K
n S s
M
x Nˆ
K
n S sˆ
M
4. Hˆ
K
sˆ
M
x
A s W Γ
K
sˆ
M
: H
K
s
M
5.
Û
Hˆ
K
sˆ
M
x
A s W Γ
K
sˆ
M
:
Û
H
K
s
M
6. If Rˆ
K
n S sˆ
M
F
tˆ then A s W Γ
K
sˆ
M
: R
K
n S s
M
W Γ
K
n S tˆ
M
and A t W Γ
K
tˆ
M
:  s W
Γ
K
sˆ
M
: t
F
R
K
n S s
M
182
A function ˆÙ : Sˆ t P
K
Sˆ
M
is called abstract transition function iff it satis-
fies the condition
s W Γ
K
sˆ
M
x
Ù
K
s
M
W Γ
K
ˆ
Ù
K
sˆ
MgM
and if for every sequence sˆ1 S]\]\]\ with sˆi D 1 W ˆÙ
K
sˆi M there exists a k Wù­ such
that Fˆ
K
n S sˆk M holds.
This definition forces any abstract representation of pipeline states to be exact
w.r.t. to the retirement and finishing of instructions. Also, it must always be clear,
which instruction is the next one to execute for an abstract state. By using not
predicates but maps from V
|
Sˆ tﬂ= 0 S 1 S n ? one could relax these restrictions by
using three valued logic functions. This would induce further nondeterminism
into the analysis since it would be no longer clear, if and when an instruction has
finished execution and where to continue the analysis. For this reason, we choose
an exact representation of the state predicates.
With these definition we can define the abstract transfer functions for the
global pipeline analysis:
Definition 5.2.4 (Pipeline Transfer Functions): Given an abstract transition
function ˆÙ and abstract state predicates Fˆ S Nˆ S Hˆ S Rˆ for the abstract pipeline
state domain induced by Sˆ and Γ, we define the abstract pipeline transfer
functions Tˆe : P
K
Sˆ
M |
Z t P
K
Sˆ
M}|
Z by
Tˆe
K
s˘ S mˆ
M
F(Ç
P
= Tˆ 2e
K
sˆ S mˆ
M>
sˆ W s˘ ? (5.2.5)
Where the function Tˆ 2e : Sˆ | Z t P
K
Sˆ
|
Z
M
determines for one abstract
state/time pair the set of resulting abstract pairs:
Tˆ 2n ² n ¿
K
sˆ S mˆ
M
F



=
K
Rˆ
K
n S sˆ
M
S mˆ
M
? , if Fˆ
K
n S sˆ
MYX
Nˆ
K
n {S sˆ
M
©
= Tˆ 2n ² n ¿ K tˆ S mˆ B 1 M> tˆ W ˆÙ K sˆ M ? , if Û Fˆ K n S sˆ M
/0 , otherwise
(5.2.6)
The abstract transfer functions defined this way perform for each abstract state
the cycle wise iteration using the abstract transition function ˆÙ and joining to-
gether the resulting sets of abstract states. The upper bound for the execution time
is obtained as the maximum of all upper bounds for each iteration of one abstract
state.
The Tˆe are well defined because of the restriction placed on the abstract tran-
sition function ˆÙ in 5.2.3: after finitely many applications of ˆÙ , Fˆ holds for the
resulting state.
183
Since we have already shown in Lemma 5.2.2 that the concretization function
γ fulfills the necessary premisses of Theorem 3.3.5 it only remains to show that
the Tˆe defined above satisfy condition 3.3.6 to establish that the MOP analysis
using the abstract transfer functions 5.2.6 is correct.
This means, we have to show that given abstract state predicates Fˆ S Nˆ S Hˆ S Rˆ
and an abstract transition function ˆÙ the following equation holds:
A S W P
K
S
|
Z
M
S
K
s˘ S mˆ
M
W P
K
Sˆ
M}|
Z : S  γ
K
s˘ S mˆ
M
x T {e
K
S
M
 γ
K
Tˆe
K
s˘ S mˆ
M]M
(5.2.7)
Since the Tˆe are defined as a least upper bound this proof can be simplified by
utilizing assumption 3.3.7: by showing that
S  γ
K
s˘ S mˆ
M
x T {e
K
S
M

T
= γ
K
t˘ S nˆ
M>
K
t˘ S nˆ
M
W Sˆ ? (5.2.8)
where Sˆ
F
= Tˆ 2e
K
sˆ S mˆ
M~>
sˆ W s˘ ? , we can show that 5.2.7 holds. Since the functions
T {e are defined as a union over the Te in 3.2.13, our proof is reduced to showing
=
K
t S n
M
?
F
Te
K
=
K
s S m
M
?
MYX
K
s S m
M
W γ
K
s˘ S mˆ
M
x
 sˆ W s˘ : 
K
t˘ S nˆ
M
W Tˆ 2e
K
sˆ S mˆ
M
:
K
t S n
M
W γ
K
t˘ S nˆ
M
(5.2.9)
Note, that the trivial cases of S
F
/0 or Te
K
S
M
F
/0 have been omitted from this
condition.
Theorem 5.2.10 (Global Correctness): Given abstract state predicates and an
abstract transition function, condition (5.2.9) holds, i. e. the pipeline analy-
sis defined by the Tˆe of 5.2.4 is correct.
Proof:
By inspecting the definition of the Te in 4.1.4 we can see that =
K
t S n
M
?
F
Tn ² n ¿
K
=
K
s S m
M
?
M
means that t
F
R
K
Ù
k
K
s
M]M
and n
F
m
B
k for some k R 0.
Therefore, we show by induction on k that the conclusion of 5.2.9 holds.
1. k
F
0: t
F
R
K
s
M X
F
K
n S s
M X
N
K
n { S s
M
and since
K
s S m
M
W γ
K
s˘ S mˆ
M
there is
an sˆ W s˘ such that s W Γ
K
sˆ
M
. By the properties of the abstract state predi-
cates we have Fˆ
K
n S sˆ
M
and Nˆ
K
n {öS sˆ
M
. Thus Tˆ 2n ² n ¿ K sˆ S mˆ M F = K Rˆ K n S sˆ M S mˆ M ?
and also t
F
R
K
n S s
M
W Γ
K
Rˆ
K
n S sˆ
M]M
. This means that
K
R
K
n S s
M
S m
M
W
γ
K
= Rˆ
K
n S sˆ
M
?S mˆ
M
holds, proving the claim.
2. k
F
k {
B
1: t
F
R
K
Ù
k ¿
K
Ù
K
s
MgM]MYXÖÛ
F
K
n S s
M
, n
F
m
B
k {
B
1. Again, there
 sˆ W s˘ such that s W Γ
K
sˆ
M
. We also have
Û
Fˆ
K
n S sˆ
M
from the definition
of the abstract state predicates. By the definition of the abstract transi-
tion function ˆÙ we have that
K
Ù
K
s
M
S m
B
1
M
W γ
K
ˆ
Ù
K
sˆ
M
S mˆ
B
1
M
. Applying
the induction hypothesis to =
K
t S n
M
?
F
R
K
Ù
k ¿
K
Ù
K
s
M
S m
B
1
MgM
we have
184
that there exists a uˆ W ˆÙ
K
sˆ
M
: 
K
t˘ S oˆ
M
W Tˆ 2n ² n ¿ K uˆ S mˆ B 1 M : K t S n M W γ K t˘ S oˆ M .
By the definition of Tˆe for the case Û Fˆ
K
n S sˆ
M
we have that
K
t˘ S oˆ
M
W
Tˆn ² n ¿
K
sˆ S mˆ
M
, proving the claim.
Thus, we can conclude that the abstract pipeline transfer functions Tˆe define
a correct MOP analysis, if abstract pipeline predicates Fˆ S Nˆ S Hˆ S Rˆ and an
abstract transition function ˆÙ are given.
¬
Note that there are no conditions on the concretization function Γ, mapping
one abstract state to a set of concrete states. We even do not require that the set Sˆ
is a domain or even has an ordering defined on it. This makes it easier to define
new abstractions with different Sˆ and Γ.
In the following, we will present an abstraction that corresponds to our ap-
proach of formulating Ù with unit updates using delayed and instantaneous sig-
nals.
5.3 The Abstraction Using Unit Updates
As in Chapter 4 we will use the same units communicating via delayed or instan-
taneous signals. Abstract unit updates use abstract domains for the types of the
unit components (but the components stay the same). An abstract state is then
obtained as the union of all the (abstract) units. In the same way we obtained a
concrete state as the union of the (concrete) units in the previous chapter.
The abstract domains are connected to the concrete domains of the unit com-
ponents via the concretization Γ, which will be defined for every component be-
low. The result of applying Γ to an abstract state sˆ is then composed from the
results of applying it to the components.
As discussed earlier in Section 4.2.2, using abstractions for the components
may prohibit the precise represention of the components values. Since the com-
ponents values are used in the evaluation of expressions, information loss in the
components will introduce nondeterminism in the evaluation of expressions in the
unit update programs. Nondeterminism in expressions means that also the execu-
tion of instructions of the unit update program is nondeterministic. Thus, we may
have several successor states for one state after a unit update.
Because only the domains of the components in the units are changed, units
and thus abstract states are still represented as environments. The difference be-
tween (concrete) environments and abstract ones is that abstract environments
map the components names to abstract values from the abstract domains. For
components that do not contain abstract types, the abstract domains are the same
185
as the concrete ones: the information in these components is represented exactly
in the analysis. For abstract types, we use new underlying domains whose ele-
ments are approximations to the elements of the domains of the abstract types in
the concrete semantics. That is, one element from the abstract domain represents
a set of elements in the concrete domain, in the same way as an abstract state
represents a set of concrete states.
We will denote abstract environments by ρˆ, abstract delayed signals by δˆ or ρδ.
In the concrete case, we used a function P to associate a domain of concrete values,
P
K
id
M
to an abstract type abstr
K
id
M
, cf. Section 4.2.3. Now we will use a function
Pˆ to associate new (abstract) value domains Pˆ
K
id
M
to the abstract types. For every
such pair P
K
id
M
and Pˆ
K
id
M
we assume that there is a concretization function that
maps an element from Pˆ
K
id
M
to a set of values in P
K
id
M
that are approximated by
that element. We will call this function Γ, too, like the concretization function for
abstract states. Thus, for every abstract type abstr
K
id
M
, Γ : Pˆ
K
id
M
t P
K
P
K
id
M]M
.
The concretization function Γ, mapping an abstract state to a set of concrete
states represented by it, can then be defined by applying Γ to the values of the
environmental mapping ρˆ, which is the abstract counterpart of ρ, and subsequently
to mappings µˆ and νˆ, the counterparts of µ and ν. We also extend Γ to all domains,
defined by our type system:
Γ
K
ρˆ
M
F
= ρ
>
l

t vˆ W ρˆ
X
l

t v W ρ
X
v W Γ
K
vˆ
M
?
Γ
K]K
vˆ1 Sg\]\]\gS vˆn M]M
F
=
K
v1 S]\]\]\S vn M> vi W Γ
K
vˆi M ?
Γ
K]K
id S vˆ
M]M
F
=
K
id S v
MV>
v W Γ
K
vˆ
M
?
Γ
K
z
M
F
= z ?
Γ
K
b
M
F
= b ?
(5.3.1)
Structures and arrays are already covered by the case for abstract environ-
ments. Now that we have defined the abstract states completely as abstract envi-
ronments, we continue by giving the abstract counterparts of the transition func-
tions Ù , Ù Si and Ù
u
i . The abstract semantics of a unit update, ˆÙ
u
i will be given by
an abstract version  p of the relation t p. This will also be defined by a set of
inference rules.
In the concrete update we had the mapping δ æ , representing the unasserted
state of delayed signals. We will make use of its abstract counterpart, δˆ
æ
. Here,
δˆ
æ
can be chosen from a variety of abstract values. However, the one chosen must
be an approximation to δ æ , i. e. δ æêW Γ
K
δˆ æ
M
must hold. Also, there is an abstract
counterpart for the state  , ˆ with ÌW Γ
K
ˆ

M
.
Definition 5.3.2 (Abstract Transition Function): The function
ˆ
Ù
K
µˆ
M
F





=
ˆ
Ö? , if Hˆ
K
µˆ
M
p
K
©Ö=
ˆ
Ù
S
2 K = 1 S]\]\g\S N ?S νˆ {S δˆ { M>
K
νˆ {S δˆ {
M
W
ˆ
Ù
S
1 K = 1 S]\]\]\uS N ?S µˆ S δˆ æ M ? M , otherwise
186
(where p
K
Sˆ
M
F
=
K
νˆ {Ãp δˆ {
M
ë
K
Labc ® Labδ
M¾>
K
νˆ { S δˆ {
M
W Sˆ ? ) is called the ab-
stract transition function for unit updates.
The functions ˆÙ S1 a 2 traverse all units according to the dependencies between
units for half-cycle 1 and 2:
Definition 5.3.3 (Unit Traversion): The functions ˆÙ S1 a 2 : P K = 1 Sg\]\]\uS N ? M |  νˆ  |
 δˆ t P
K
 νˆ 
|
 δˆ 
M
are defined by
ˆ
Ù
S
i
K
U S νˆ S δˆ
M
F






=
K
νˆ S δˆ
M
? , if U
F
/0
©Ö=
ˆ
Ù
S
i K U õ = u ?S νˆ p νˆ { S δˆ p δˆ { M>
K
νˆ { S δˆ {
M
W
ˆ
Ù
u
i K νˆ M ? , otherwise
where u
F
min ó ô
K
U {
M
The functions ˆÙ ui perform a half-cycle update for unit u:
Definition 5.3.4 (Abstract Unit Update): The functions ˆÙ ui :  νˆ Ót P K  νˆ  |  δˆ  M
are defined by
ˆ
Ù
u
i
K
νˆ
M
F
=
K
ρˆ ë
K
Labc ® Labδ ® Labι
M
S δˆ
M>
p  C νˆ a δˆ ß Ep
K
ρˆ S δˆ
M
?
where p
F
pui is the program for unit u in half-cycle i.
These three functions are defined similarly to their concrete counterparts; the
only significant change is that one abstract state update may produce a set of
possible successor states.
As in the case of the concrete semantics, we use a set of inference rules to
define the relation p  τˆp τˆ { mapping abstract environments to pairs of new abstract
environment and newly emitted abstract delayed signals.
As we can no longer expect that the abstract values can represent informa-
tion exactly, the relation  p is no longer deterministic, i. e. there may be several
K
ρˆ S δˆ
M
for one
K
νˆ S δˆ
æ
M
with p  C νˆ a δˆ ß Ep
K
ρˆ S δˆ
M
. In our framework, this nondeterminism
comes solely from the evaluation of expressions using abstract values instead of
the concrete ones: the inference rules for expression evaluation are nondetermin-
istic. The inference rules for program evaluation are the same as in the concrete
case.
Taking the semantics from Section 4.2.3, we only have to replace all prede-
fined functions by abstract counterparts. Predefined functions are the only source
of nondeterminism in the abstract semantics. We replace the meaning of a pre-
defined function f
F
 p  , f : D1 | f]f]f | Dn t D |  ρ  by an abstract counterpart
187
fˆ
F Ò
 p  , fˆ : Dˆ1 | f]f]f | Dˆn t P
K
Dˆ
|
 ρˆ 
M
. That is, the argument domains of f are
replaced by their abstract counterparts and the result domain is a finite set of ab-
stract values, coming from the abstract counterpart of the concrete result domain.
Here, the result is a pair of value, environment in the concrete case and a set of
pairs of abstract value, abstract environment in the abstract case. By this, we can
model the fact that a certain computation may not have a uniquely determined
result in the analysis and that the result of a function can be undetermined, but be
within a set of values.
Thus, we can define the evaluation of an expression e under an abstract envi-
ronment ρˆ, denoted by e  ρˆEˆ K vˆ S ρˆ { M by nearly the same set of inference rules, as in
the concrete case:
(50)
n 
ρˆ
Eˆ K n S ρˆ M
, if n is an integer
(51)
b  ρˆEˆ K b S ρˆ M
S b Wç= true S false ?
(52)
e1 
ρˆ
Eˆ K vˆ1 S ρˆ1 M S]\g\]\S en 
ρˆn þ 1
Eˆ K vˆn S ρˆn M
K
e1 S]\g\]\S en M 
ρˆ
Eˆ K]K vˆ1 S]\g\]\S vˆn M S ρˆn M
(53)
id  ρˆEˆ K ρˆ K id M S ρˆ M
, if id W dom
K
ρˆ
M
(54)
id  ρˆEˆ K]K id S K M]M S ρˆ M
, if id is an enum identifier
(55)
e 
ρˆ
Eˆ K vˆ S ρˆ { M
id e  ρˆEˆ K]K id S vˆ M S ρˆ { M
, if id is an enum identifier
(56)
e 
ρˆ
Eˆ K vˆ S ρˆ { M
id  e  ρˆEˆ K ρˆ { K id M K vˆ M S ρˆ { M
(57)
n 
ρˆ  ρˆ
C
id E
Eˆ K vˆ S ρˆ { M
id \ n  ρˆEˆ K vˆ S ρˆ { ì dom K ρˆ K id M]M p K ρˆ ë dom K ρˆ K id M]M]M]M
(58)
Received id
K
id1 S]\]\]\S idn M 
ρˆ
Eˆ K true S ρˆ p= idi t vˆi > 1 < i < n ? M
, if
id

t
K
vˆ1 S]\]\]\S vˆn M W ρˆ X id W Labι or
id W Labδ
X
ρˆ
K
id
M~v
F
δˆ
æ
K
id
M
188
(59)
Received id
K
id1 S]\g\]\S idn M 
ρˆ
Eˆ K false S ρˆ M
, if
id W Labι
Xv
 id

t vˆ W ρˆ or
id W Labδ
X
ρˆ
K
id
M
F
δˆ
æ
K
id
M
(60)
e1 
ρˆ
Eˆ K vˆ1 S ρˆ1 M S]\]\g\S en 
ρˆn þ 1
Eˆ K vˆn S ρˆn M
p
K
e1 S]\]\]\S en M 
ρˆ
Eˆ K vˆ S ρˆ { M
, if
K
vˆ S ρˆ {
M
W
Ò
 p 
K
ρˆn S vˆ1 S]\]\]\uS vˆn M
Here, the nondeterminism is introduced by rule (60), where one of the return
value/environment pairs of the abstract function is chosen. Naturally, we have to
impose a condition on the abstract version of a predefined function, relating it to
its concrete counterpart. This condition is, not surprisingly, that every concrete
computation by a predefined function must be in the concretization of the abstract
computation, if the concrete arguments are represented by the abstract ones:
K
ρ S v1 Sg\]\]\uS vn M W Γ
K]K
ρˆ S vˆ1 S]\]\]\S vˆn M]M
x
 p 
K
ρ S v1 Sg\]\]\gS vn M Wù©Ô= Γ
KgK
vˆ S ρˆ {
MgMV>
K
vˆ S ρˆ {
M
W
Ò
 p 
K
ρˆ S vˆ1 S]\]\]\S vˆn M ?
(5.3.5)
The rules for  p are the same as those for t p, just replacing t E by  Eˆ .
This abstract inference system has the property that any concrete computation
has an abstract one starting from correct approximations:
Theorem 5.3.6 (Abstract Computation Correctness): Let a concrete deri-
vation p t C ν a δ ßoEp
K
ρ S ρδ
M
be given and
K
ν S δ æ
M
W Γ
K]K
νˆ S δˆ æ
MgM
, then there exist ρˆ
and ρˆδ with p ±C νˆ a δˆ ß Ep
K
ρˆ S ρˆδ
M
and
K
ρ S ρδ
M
W Γ
K]K
ρˆ S ρˆδ
MgM
.
Proof:
As the only difference between the set of inference rules is the relation used
to denote evaluation of expressions, it is sufficient to prove that the concrete
evaluation of an expression is approximated by the abstract evaluation, i. e.
we have to show
e t
ρ
E K v S ρ { M X ρ W Γ K ρˆ M
x
 vˆ S ρˆ { : e  ρˆEˆ K vˆ S ρˆ { MYX K v S ρ { M W Γ KgK vˆ S ρˆ { MgM (5.3.7)
Here, the only rule different between the concrete and abstract compu-
tations is the evaluation of predefined functions. For them, we have to
show that there exist ρˆ {S vˆ such that p
K
e1 S]\]\]\S en M 
ρˆ
Eˆ K vˆ S ρˆ { M . We can assume
that there exists vˆi and ρˆi, 1 < i < n, such that ei 
ρˆi þ 1
Eˆ K vˆi S ρˆi M (by induc-
tion on the length of the concrete inference tree). For those we also have
K
vi S ρi M W Γ
K]K
vˆi S ρˆi M]M . By the condition on the abstract versions of the prede-
fined function meaning, we have
K
v S ρ {
M
F
 p 
K
ρ S v1 Sg\]\]\gS vn M W © = Γ
K]K
vˆ S ρˆ {
M]Mï>
189
Kvˆ S ρˆ {
M
W
Ò
 p 
K
ρˆ S vˆ1 S]\]\]\S vˆn M ? . Thus, there exist
K
vˆ S ρˆ {
M
W
Ò
 p 
K
ρˆ S vˆ1 S]\]\]\S vˆn M ?
such that
K
v S ρ {
M
W Γ
K]K
vˆ S ρˆ {
M]M
, proving the claim.
¬
This theorem in turn implies that for ν W Γ
K
νˆ
M
we have Ù ui K ν M W Γ K ˆÙ
u
i K νˆ M]M by
the definitions of Ù ui and ˆÙ
u
i . This can then be used to prove the
Theorem 5.3.8 (Abstract Unit Traversion Correctness): If it is true for a
pair
K
ν S δ {
M
that
K
ν S δ {
M
W Γ
K]K
νˆ S δˆ {
M]M
then we also have that
Ù
S
i
K
U S ν S δ {
M
W Γ
K
ˆ
Ù
S
i
K
U S νˆ S δˆ {
M]M
Proof:
We do induction on the number of element in the set U .
Z
>
U
>
F
0: Then Ù Si K U S ν S δ { M FyK ν S δ { M and ˆÙ Si K U S νˆ S δˆ { M F = K νˆ S δˆ { M ? . The
claim trivially holds.
Z Let
>
U
>
F
n
B
1 S n R 0. The assumptions allow us to conclude that
Ù
u
i K ν M F°K ν {d{S δ {d{ M W Γ K ˆÙ
u
i K νˆ MgM . This means there exist K νˆ {d{S δˆ {d{ M W ˆÙ
u
i K νˆ M
with
K
ν {d{S δ {d{
M
W Γ
K]K
νˆ {d{S δˆ {d{
MgM
. This also means that
– ν p ν {c{ÁW Γ
K
νˆ p νˆ {d{
M
– δ {p δ {d{YW Γ
K
δˆ {p δˆ {d{
M
By using the induction hypothesis we have that Ù Si K U õ= u ?S ν p ν {d{ S δ { p
δ {d{
M
W Γ
K
ˆ
Ù
S
i K U õï= u ?S νˆ p νˆ {c{ S δˆ {p δˆ {d{ M holds, which proves our claim by
the definition of Ù Si and ˆÙ
S
i
¬
Using this theorem we can finally prove that our ˆÙ is a correct abstract transi-
tion function.
Theorem 5.3.9 (Abstract Transition Function Correctness): The function
ˆ
Ù defined in (5.3.2) is an abstract transition function as defined in (5.2.3).
Proof:
First, ˆÙ always reaches a state in which Fˆ holds, because Fˆ correctly ap-
proximates F and that one is guaranteed to finish every instruction eventu-
ally. Then, we have to show that ˆÙ is a correct approximation to Ù , i. e.
µ W Γ
K
µˆ
M
x
Ù
K
µ
M
W Γ
K
ˆ
Ù
K
µˆ
MgM
Assume that H
K
µ
M
is true. Then Ù
K
µ
M
F
 and because of the properties
imposed on Hˆ we have that ˆÙ
K
µˆ
M
F
=
ˆ
Ö? , where  W Γ
K
ˆ

M
is true. If
190
ÛH
K
µ
M
then also
Û
Hˆ
K
µ
M
. By the previous theorem we have that
K
ν1 S δ1 M
F
Ù
S
1 K M S µ S δ æ M W Γ K ˆÙ
S
1 K M S µˆ S δˆ æ M]M , where M F = 1 Sg\]\]\uS N ? . This means there
exist
K
νˆ1 S δˆ1 M W ˆÙ S1 K M S µˆ S δˆ æ M with K ν1 S δ1 M W Γ K]K νˆ1 S δˆ1 MgM . Applying that the-
orem once more we obtain
K
ν2 S δ2 M
F
Ù
S
2 K M S ν1 S δ1 M W Γ K ˆÙ
S
2 K M S νˆ1 S δˆ1 M]M . By
the properties of the p operator we then have
M
K
ν2 S δ2 M W Γ
K
p
K T
=
ˆ
Ù
S
2
K
M S νˆ { S δˆ {
M>
K
νˆ { S δˆ {
M
W
ˆ
Ù
S
1
K
M S µˆ S δˆ æ
M
?
M
which proves the claim.
¬
This concludes the correctness proof for our unit update style pipeline analy-
sis. The following sections will give details on the analyses for the two example
models presented in the previous chapter.
5.3.1 Analysis for the MCF 5307
For this analysis two major abstractions were made. First, the cache is abstracted
by a one-way direct mapped abstract cache with 128 ways. All updates on the
cache are replaced by the abstract must cache update function for this cache,
[Fer97]. All functions that classify a cache access are replaced by the classifi-
cation function for this cache, more precisely
cl
K
m S c
M
F
q
= hit ? , if m W c
K
set
K
m
M]M
= hit S miss ? , otherwise
cl
K
m S c
M
returns the possible concrete values for the classification of memory block
m with the abstract cache c.
The other abstraction was more far reaching: all addresses that may not be
known precisely in components or signals were replaced by intervals of addresses.
This effects the addresses of data accesses in the execution pipeline, read
K
a
M
and
write
K
a
M
and the state in the EX and bus unit. The predefined functions are re-
placed by abstract versions that operate on intervals. As an example, the cache
update with such an interval has to touch all cache lines covered by the interval,
while no block can be placed in the cache if the interval is not precise (one ele-
ment). This results in loss of information about the (unified) cache for any cached
imprecise data access. The other important abstraction is the function that returns
the timings for a memory area referenced by an address. This function must now
return a set of possible timings, which are all the timings for the memory area
spanned by the interval. The other necessary abstractions are obvious from the
concrete model.
191
The abstract state predicates are the same as the concrete ones, i. e. we repre-
sent this information exactly.
The other abstract (implicit) components in the concrete model (memory, reg-
isters) are abstracted by the domain with just the element
n
, i. e. no knowledge is
available for these components.
5.3.2 Analysis for the PPC 755
Two major abstractions have been introduced: caches and memory address in-
tervals. The caches are the abstractions of the PowerPC 755 caches, which only
represent must information: some memory blocks can be guaranteed to be in the
cache. Thus, the abstract cache classification function is the same as for the Cold-
Fire. The cache update function is the one modeling the PowerPC 755 abstract
cache, which is able to obtain at most information for half of the blocks in a set.
All data address components and signals are abstracted by address intervals.
This is relevant for the LSU and the store buffer as well as the bus unit and the
chip set unit, CSU. As for the ColdFire, all functions working on such addresses
were changed to abstract versions that return the collection of information valid
for the interval, e. g. access timings.
The function that gives the number of cycles for instructions to execute in the
functional units is abstracted by a version that always returns the set of possi-
ble cycle count values, e. g. = 1 S 2 S 3 ? for the IU if a multiplication is performed.
Functions that determine if a speculative branch has been resolved correctly are
abstracted by a version that uses infeasibility information from the value analysis,
cf. Section 6.2, to sometimes limit the set of possible return values. If this is not
possible, it returns the set = false S true ? .
The abstract state predicates are the same as the concrete ones.
Here too, memory and registers are abstracted by the single element domain
n
as for the ColdFire analysis.
5.4 Nondeterminism
The abstract computation semantics  p of the pipeline models is nondeterminis-
tic, because some decisions are based on data, for which no precise information
(or no information at all) is available due to the abstraction applied on the com-
ponents. The abstract transition function collects all possible results from the
computation semantics. The correctness of the analysis has been shown correct
for this transition function.
Naturally, it is costly to collect all possible results from the nondeterministic
computation semantics. If we are primarily interested in the WCET of a program
192
we only need to consider the result(s) for  p that can lead to the WCET. For
architectures without timing anomalies this would be the local worst-case result,
e. g. the state resulting in a cache miss if a cache access could not be precisely clas-
sified as hit or miss in a cycle update. In this case, we can define a deterministic
abstract computation semantics  p
l which is defined by
p
l

τˆ
p τ { 
x
p  τˆp τ { X τ { represents local worst-case
Even for architectures with timing anomalies, we can make  p “less nonde-
terministic” by only considering result states τ { that may lead to the global worst-
case, leaving out those states that are guaranteed to never participate in a global
worst-case execution time scenario. Unfortunately, it is not easy to decide which
local case cannot result in a global worst-case. For such a proof, it must be shown
that all state traces starting from this state are not longer than all possible traces
starting from other states. As there are subtle interdependencies between the com-
ponents of the pipeline model we have not introduced such a reduction of nonde-
terminism and stayed on the safe side of looking at all possible results of  p in
our analysis.
This complete exploration of the state space seems to be prohibitively expen-
sive in terms of memory usage and computation time. Fortunately, the problem is
reduced by the following observations:
Z The number of possible result states is bounded as there are not many non-
deterministic choices to be made in one cycle. Additionally, if a cycle up-
date contains nondeterministic choices with multiple successor states, the
next one will probably be deterministic for each of them, so the state graph
does not branch at every level (cycle).
Z The analysis itself records the number of executed cycles not with every
state but rather has an upper bound for all cycles asociated with a set of
abstract states. Thus, states differing only in the number of cycles simulated
on them fall together, reducing the number of states in the set. In fact, it can
be seen in the actual outputs of the analysis that a lot of states fall together
this way, reducing the state space further.
Z We can always apply a widening to a set of states. This widening computes
a smaller set of states, which is still a correct approximation for the original
state set. In practice, we use this technique to reduce sets of states that differ
only in their abstract cache contents, by replacing them with a single state,
whose cache content is the least upper bound (in the abstract cache domain)
of all the abstract caches of the original states.
193
In practice there are a number of significant sources of nondeterminism.
Z Cache accesses that cannot be classified precisely,
Z external memory accesses whose address is not known precisely enough,
Z varying execution times of instructions.
The nondeterminism due to caching comes from the fact that the cache anal-
yses are not able to classify every access precisely as cache hit or miss. These
accesses must then be treated as cache miss and as cache hit, resulting in two suc-
cessor states. The problem is even more present for the caches of the ColdFire
5307 and PowerPC 755, because the cache analysis is never able to classify ac-
cesses as guranteed cache misses due to the missing cache may analysis. Also, the
must analysis for the caches can only predict a reduced portion of the cache con-
tents (at most 1/4 for ColdFire and 1/2 for PPC 755). Because instruction cache
accesses can cross cache line boundaries, up to two cache lines may be accessed
for a single fetch. Thus, this can result in up to four successor states if these
accesses cannot be classified exactly by the analysis.
Varying execution times for functional units due to shortcut evaluation in the
units (e. g. multiplication of “short” numbers with leading zero bits) results in as
many successor states for one state as there are possible execution times for the
functional units. It may be possible to reduce this nondeterminism by exporting
more information from the value analysis (cf. Chapter 6) about the values of the
operands. However, this makes the analysis itself more complicated. The effects
of this nondeterminism are limited to a few cycles most of the time because the
resulting states fall together after the instruction has completed and the processor
waits for main memory to conclude an instruction fetch or data access.
The most severe source of nondeterminism arises from memory accesses. The
configuration of a real-time system typically contains a significant number of
memory areas with different access characteristics and access times. If the ad-
dress of a data access is not known precisely, all memory areas that intersect with
the interval used as abstraction of the data access address must be considered as
possible targets for the access. This results in as many successor states as there
are memory areas configured in the system. This is especially problematic for
analyses of functions in the program that contain arguments used as indices into
an array. As the value of these arguments are unknown initially, all array accesses
are assumed to span the whole address range. The result is typically a large num-
ber of states and a loss of precision for the WCET result because slow memory
areas (controller register maps, etc) are probably never really the target of these
accesses during a concrete execution of the function. This can also lead to a chain
of branches in the state graph. Some instructions access memory multiple times
194
to save and restore register sets at function entry or exit. If for every access this
branching in the state graph occurs then a lot of useless states are generated. One
can circumvent this problem by the assumption that those instructions will access
memory only in the same memory area during execution. Then, a state compo-
nent is added that records the memory area that is currently being accessed. This
component is initialized on the first access of the instruction to a different value
in the successor states according to the memory areas possibly targeted by the in-
struction. This way the state graph branches only once for the first memory access
of the instruction.
In practice, many of the nondeterministic choices can be avoided by introduc-
ing user annotations into the analysis. These are assertions done by the user of
the analysis that add knowledge the analysis cannot infer itself. E.g. the range of
a function argument may be restricted to a certain interval due to knowledge on
the size of the arrays accessed via this argument. This will drastically reduce the
branching due to unknown memory accesses and improve the precision of the re-
sults. As mentioned in Chapter 6, there are a lot of annotations that can be added
to the analysis this way to improve the results and decrease the costs of the anal-
ysis itself. There has even been implemented a mode that makes the computation
deterministic by only using the local worst-case in pipeline state computations.
5.5 Parallelism in a Sequential Data-Flow Analysis
The concept of a control-flow graph relies on the sequential execution of the in-
structions in the program. Thus, also data-flow analysis assumes that instructions
are executed sequentially and extensions to parallel programs have not been too
successful. This seems to make the analysis of pipelines by data-flow analysis im-
possible, since pipelines are solely there to parallelize program execution as much
as possible.
Luckily the parallelism in a machine program is of a limited form, namely
designed in such a way as to guarantee the sequentiality of the effects of the pro-
gram on the architectural state. This is mostly due to the fact that certain kinds
of exceptions in the system must be guaranteed to be precise. Consider for ex-
ample an operating system with virtual memory. If a page fault occurs because
the accessed page has not yet been mapped in from backing store an exception is
raised. The handler routine must be able to continue the execution of the program
with the instruction that caused the fault after it has mapped in the page. For this,
the address of that instruction must be known and all effects of instructions before
this instruction must be known to be commited to the architectural state.
This does not keep the architectures from dropping instructions that have no
effect as soon as possible or to perform actions that have no effects on the state
195
(e. g. speculative data accesses). The important consequence of all these consid-
erations is that there exists a unique criterion, when an instruction has left the
pipeline: its effects have been commited. We can use this to enforce a virtual
sequentiality for our analyses as expressed by the notion of state predicates in
Definition 4.1.3.
In reality, the fact that certain instructions may be dropped without executing
in the processor (noops, folded branches) enforces additional components in the
state modeling that record (for the analysis) that these instructions have been en-
countered. By careful bookkeeping of these instructions, we can use the sequential
data-flow analysis to analyze the parallel pipeline of a processor.
One example of such bookkeeping comes from the fact that an instruction
stream is prefetched long before it even starts execution. Intuitively, this means
we have a look-ahead along edges in the CFG for the instructions fetched. We
must record not only the addresses of these instructions but also the contexts un-
der which they occur in the CFG, i. e. a look-ahead in the supergraph formed by
unrolling the CFG with contexts. For this reason, the analyses have state com-
ponents that record the context number for prefetched instructions. These can
be viewed as the information needed to compute the state predicate for the next
instruction.
5.6 Other Caveats
More practical things to consider are mainly related to the efficiency of the analy-
sis. The pipeline updates need a lot of information about the instructions currently
in the pipeline. For example, the registers written to by an instruction (which is
identified by its address in the program code). Recomputing this information in
the update program for every cycle where it is accessed is costly. As this informa-
tion is fixed for the program it can be precomputed. For this, a single pass along
the CFG is done initially and all relevant information is collected in a hash table
indexed by the instruction addresses.
196
Chapter 6
A WCET Toolframe
No practical analysis for out-of-order su-
perscalar processors has been presented.
Due to the problems of very complex be-
havior, timing anomalies, and the prob-
lem of finding all relevant processor fea-
tures, we do not think that a practical,
safe, and tight analysis is feasible.
[Eng02]
Start by doing what is necessary, then
what is possible, and suddenly you are do-
ing the impossible
St. Francis of Assisi
The novel approach presented in this thesis explains how a systems hardware can
be modeled and how a correct (pipeline) analysis of the system’s timing behavior
can be derived from this model. However, a real-life tool has many more com-
ponents aside from the pipeline analysis. The whole framework is the result of
several PhD theses and has been developed by the company AbsInt.
Figure 6.1 shows the toolchain that makes up AbsInt’s WCET analysis tool,
aiT. Eight main phases, realized by separate programs, can be identified:
1. The reconstruction of the control-ow graph takes a binary executable and
extracts the CFG of the part to be analyzed into the intermediate represen-
tation, CRL. This phase is discussed in more detail in Section 6.1.
2. The second phase, loop transformation, searches for loops in the CFG ex-
tracted by the first phase and marks them. As explained in Section 3.3.1,
loops are extracted in the intermediate representation as virtual procedures
197
Path Analysis
Config
Loop Transformation
Loop Analysis
Value Analysis
Pipeline Analysis
CFG Reconstruction
Visualization
ILP Solver
Binary Executable
CRL
BB Timings
ILP
WCET+Path
Value
Information
Files
Figure 6.1: The aiT toolchain
so that the common context separation techniques can be applied for the
precise analysis of loops.
3. The loop analysis phase tries to find the iteration bounds for the loops in
the program statically. This is done by a data-flow analysis that derives
the loop bounds for certain code patterns based on safe information. E.g.
if a variable is incremented by a constant increment upon every iteration
and the loop termination condition is a comparison against a fixed number,
the analysis will validate these conditions and derive the iteration bound
from the initial value of the loop iteration variable, the increment and the
comparison value. Many loop bounds can be automatically detected by this
analysis. Thus, only a small number of loop bounds must be given manually
by the user of the WCET tool.
4. The next phase, value analysis, uses a data-flow analysis to determine in-
tervals for the data memory accesses performed by the program. There
intervals present safe bounds for the addresses of these accesses computed
198
dynamically at run-time. Section 6.2 has more details about this important
phase. The value analysis writes its results into a separate result file, which
is read by the next phase.
5. The pipeline analysis implements the concepts introduced in this thesis and
determines WCET bounds for the basic blocks of the program. These results
are written out into a separate result file, which is read by the path analysis
phase.
6. The path analysis generates an integer linear program (ILP) from the CFG
of the program and the results of the pipeline analysis. The objective func-
tion of the ILP is the execution time of the program, which is then max-
imized. This phase can be influenced by various configuration statements
to exclude special paths or whole code pieces and to add extra constraints
to model special relations between parts of the program. This phase is the
work of Henrik Theiling and is described in detail in his PhD thesis [The02].
7. The seventh phase solves the ILP produced by the path analysis with the
help of an ILP solver. In addition, this phase translates the results of the
ILP solver back to a WCET for the whole program and constructs a worst-
case path through the CFG of the program. This path is one of the (possi-
bly) many paths through the CFG upon which the computed WCET for the
whole program would occur. Also, this part computes WCET contributions
for all the basic blocks and functions in the program1.
8. The last phase computes and displays a graphical and interactively explo-
rable representation of the WCET results. Section 6.4 has more details and
examples about this phase.
All these phases are influenced by global configuration files and individual
command line settings. The global configuration files contain the information that
the user has to give about the code to be analyzed:
Z The cache configurations. On the PowerPC 755, the caches can be partially
locked and the amount of locking can be specified by giving the relevant
cache size and associativity. Also, this feature can be used to explore the
effects that different cache architectures (size, replacement strategy, asso-
ciativity) would have on the WCET; Section 8.1 contains as an example the
comparisons of a real LRU cache against the MPC 755 PLRU cache.
1The WCET contribution for basic blocks are not the same as the results of the pipeline analysis
for these blocks. Blocks may never be traversed in a worst-case path or they may be traversed more
than once.
199
Z The memory configuration. For each memory area in the system, its timings
and attributes must be given. Timings give the access times for a memory
access in bus cycles. Attributes determine the behavior of an access in this
region:
– If an access to this area is cacheable or not. Some memory areas in
a system are not be cached by the processor, e. g. memory mapped
control registers, for others one may choose not to cache them for rea-
sons originating from the software design itself (DMA memory areas,
e. g.).
– If this memory area is guarded, i. e. data prefetching is not allowed for
this area (control registers).
– Timings and attributes can be given separately for code and data ac-
cesses and read or write accesses to make a fine grained attribute con-
trol possible.
Z The precision that the data-flow analysis should have for all the analyses
performed in the toolchain. This is achieved by giving the context mapping
which should be applied to introduce additional contexts into the data-flow
analyses, cf. Section 3.3.1. As the loop iteration bounds are known, the
analyses can be configured in such a way that every loop iteration is ana-
lyzed separately.
Z Constraints for the instructions and control-flow can be annotated in the
configuration files. This allows to exclude certain code segments from the
analysis or to analyze at a finer granularity than a complete function.
Z To make the analysis of isolated functions more precise, intervals for the
values of arguments to the function can be specified. This is very helpful to
avoid the assumption that a memory access indexed by a function param-
eter may span the whole memory in the system (including slow peripheral
device’s control registers). By giving an interval for the possible values of
the parameter, the memory accesses can be restricted to the RAM areas,
increasing the precision considerably.
Z Finally, loop bounds for loops that cannot be detected by the loop analysis
can be annotated. These can not only be given with the address of the loop
but also annotations like “the second loop in the function runs for at least 3
and at most 10 iterations” can be given2.
2The tool can also support the extraction of certain annotations from comments placed into the
source code of the program.
200
The different phases of the toolchain can further be influenced by command
line options that effect only the selected phase.
Z The pipeline analysis can be instructed to consider only local worst-cases
in its abstract model simulation, reducing the nondeterminism of the model
and decreasing the analysis costs. Naturally, the results may not be a WCET
bound due to timing anomalies possible with the processor, cf. Section 8.1.
Z A special cache mode lets the pipeline analysis assume that every access to
a cacheable memory area is a cache hit (for instruction and/or data cache).
This allows to investigate an upper bound for the influence of the cache
misses.
Z In another cache mode, the caches are treated as containing no useful infor-
mation at all. I.e. for every cache access, the analysis can neither exactly
assume a cache miss nor a cache hit and follows both possibilities. This can
be used to obtain some information about the benefit of the caching used by
comparing the results against a “normal” analysis run.
The main data that the phases of the toolchain share is a textual representa-
tion of the CFG of the program to be analyzed, which is generated by the first
phase, control-flow reconstruction. This intermediate representation contains the
instructions of the program together with attributes for each instruction, grouped
into basic blocks. CRL is described in more detail in [Lan98]. Information gained
by many of the analyses is merged into CRL in the form of new attributes or trans-
formations on the CRL representation are done, e. g. for loop transformation. This
gives a clear interface between different phases of aiT and also to additional tools,
e. g. the trace verification of Section 7.3. Furthermore the representation is generic
so that some programs implementing phases in aiT can be shared among different
versions of aiT or the toolchains of other programs without modifications (e. g.
path analysis).
The toolchain of aiT is held together by a central driver program that calls
the different phases and also provides a graphical user interface within which the
configuration can be changed easily. Figure 6.2 shows the main window of aiT.
The executable (e. g. in ELF format) and the location of the configuration files
can be selected in the upper left window together with the start point in the exe-
cutable that is to be analyzed. The window on the lower left shows the disassem-
bly of the selected function, while any message produced during the analysis is
displayed in the window on the right. The tool can save all settings in a project
file that can also be used to perform analyses in batch mode without the graphical
user interface to facilitate analysis automation.
201
Figure 6.2: aiT’s main window
Figure 6.3 shows one dialog in which options influencing the different phases
can be set directly. Here, the initial value of the stack pointer and the cache modes
(cf. above) can be set as well as the ratio of the speeds of the processor core and bus
clocks. The PowerPC EABI3 defines that global data is addressed relative to two
registers, which are loaded with fixed addresses during the program prologue. The
value of these registers can be inferred directly from the executable’s symbol table
or given manually. Another switch in this dialog instructs the pipeline analysis to
only consider local worst-cases in its analysis. With this, e. g., a cache access not
known to be a cache hit is assumed to be a cache miss. The global view would
3The Embedded Application Binary Interface fixes the environment in terms of linkage and
register usage.
202
Figure 6.3: One of aiT’s option windows
follow both alternatives in the analysis.
aiT has three top-level functionalities:
Z The reconstructed CFG can be displayed in an interactively explorable rep-
resentation.
Z The WCET analysis can be performed and the results displayed graphically.
Z The WCET analysis can be performed and the cycle-wise state evolution of
the abstract pipeline model can be displayed in an interactively explorable
view.
Examples of the outputs from these three modes are given in Section 6.4.
6.1 Analyzing Binaries
aiT takes as input a binary executable (in ELF or COFF format). While it would
be much easier to start the analysis on something at a “higher” level, e. g. assem-
bler or even C source code, this is not possible in our setting: it is the common
203
understanding in the avionics verification area that for timing validation an analy-
sis giving answers about the behavior of an executable (notably its running time)
has to be performed on the executable itself. Using source or intermediate code is
not feasible as this code undergoes transformations before it ends up in the exe-
cutable (compiler, code generators,. . . ). Another complication is that it is difficult
to use information from the source code (e. g. loop iteration bound annotation
made as comments in the C source code) in the analysis itself: if the toolchain
that produces code for the avionics system is changed, the whole toolchain must
be recertified with the airworthiness authorities. Therefore, it is practically impos-
sible to introduce changes into this toolchain in order to facilitate validation. As
a consequence, aiT only relies on the binary executable and separate annotations
made by the user of the tool4. Practice has shown that this approach scales up to
real-life analysis tasks.
Working on executables also has the advantage that many important param-
eters are fixed and can be utilized directly in the analyses. E.g. the addresses of
the code are known, which would not be the case for assembler or C code. Thus,
cache analysis is much easier. Otherwise, modular cache analysis would be nec-
essary which reduces the precision and increases the costs of the framework, cf.
[RPTW04].
The details of the reconstruction of control-flow from binaries are described
in detail in [The02]. We only highlight some important issues. The reconstructed
control-flow must be safe, i. e. every execution path possibly taken during a real
execution must be present in the CFG. This becomes difficult for machine in-
structions whose successor(s) are dynamically computed. This feature is often
used to implement function calls through a function pointer or, less obviously, to
implement high level language constructs equivalent to the switch statement in
C. Function tables for switch statements and similar constructs can often be rec-
ognized automatically if the compiler that generated the code is known. This is
because a compiler lays out the necessary function tables and call code in a fixed
way. The compiler can often be deduced from other information in the binary5.
If the target of a computed branch or call cannot be determined statically, the
CFG reconstruction terminates with an error message. In this case, user annotation
in a configuration file is required to specify all possible targets of that instruction.
The other alternative is to assume that all routines in the executable may be called
by this instruction, which would lead to very imprecise results.
The processor architecture itself can make it difficult to recognize function
calls and return instructions. E.g. on the PowerPC architecture, function call and
4Other versions of aiT are able to extract the annotations made in the source code itself.
5The gcc compiler defines a special variable in a data section to indicate that the code was
generated by it, etc.
204
return are performed by two machine instructions, blr and blrl. These instructions
continue execution at the address stored in the link register. The second form in
addition stores the value of the instruction after it in the link register. Here, the
CFG reconstruction must make sure that the link register has been loaded with the
address of a function and identify the function, e. g. by using pattern matching on
the code. To complicate things, the link register may also be used for indirect or
far function calls and to implement switch statements.
The result of the CFG reconstruction is the CFG in CRL representation, which
is written to a file. Attributes for each instruction in the CFG are added by this
phase. These attributes define entities like the registers used by an instruction, its
opcode, class, etc. Later phases make heavy use of the information computed in
this phase. In addition, the call graph is present in the CRL file as special attributes
for basic blocks: which block may call another block.
6.2 Value Analysis
The value analysis phase computes intervals as safe approximation to the ad-
dresses that may be used during data accesses6. In [Sic97] M. Sicks described
a value analysis for SPARC machine code that computes safe approximations for
the contents of machine registers. The principle is to perform the interval analy-
sis from [CH78] on machine programs. The value analysis performs an abstract
interpretation of the machine instruction semantics over the domain of intervals.
The exported result is not an interval for the contents of each machine register
for each program point, but only those that access data memory. Here, the ad-
dress calculation of the instruction together with the internally present interval
information is used to compute a safe interval for the address that is used during
the memory access7. This analysis uses the same context separation technique
as the pipeline analysis, in order to analyze loops or nested call more precisely.
As a consequence, memory access information is produced for each context of an
instruction and written to a result file.
As the value analysis has to compute information about the outcome of in-
structions influencing conditional branches8, it is often possible to determine that
a conditional branch will never branch to one of its two successors (fall-through
successor or taken-branch successor). In this case we know that the code starting
at that untaken successor is infeasible, i. e. will never be executed. This happens
very often if the analysis distinguishes enough context to separately analyze loop
iterations. As this information is also very important for the pipeline analysis, it is
6The instruction accesses are known and fixed by the structure of the CFG.
7Or accesses, as, e. g. the ColdFire can perform a read and write with one instruction.
8With other words, about the status register of the processor.
205
written to the result file in addition to the the intervals for data memory accesses.
In the pipeline analysis this infeasibility information is used in order to decide if
a predicted branch was resolved correctly or that a pipeline state can safely be
deleted during analysis if it represent execution of infeasible code.
In [FHL D 01] results for the precision of the value analysis results are pre-
sented. These results are depicted in Table 6.1.
Task reads [%] writes [%] infeasible
[%]
time
[s]exact good ? exact good ?
1 93.32 3.69 2.99 94.51 4.11 1.38 9.1 61
2 93.42 4.01 2.57 93.39 4.62 1.99 8.5 24
3 92.58 4.20 3.22 93.25 4.59 2.16 9.3 31
4 90.24 5.61 4.15 91.71 6.35 1.94 13.5 21
5 93.79 3.33 2.88 94.61 3.83 1.56 7.0 59
6 93.23 4.21 2.56 93.02 4.82 2.16 10.2 26
7 92.06 4.63 3.31 93.13 5.01 1.86 11.0 35
8 90.87 5.03 4.10 91.97 5.88 2.15 11.4 18
9 93.97 3.22 2.81 94.74 3.72 1.54 6.8 56
10 92.40 4.72 2.88 92.79 5.32 1.89 11.0 26
11 92.16 4.47 3.37 92.98 4.90 2.12 9.4 31
12 90.51 5.28 4.21 91.57 6.16 2.27 11.5 24
Table 6.1: Value analysis results
These results are for the ColdFire 5307 and were obtained on a 1GHz Athlon.
The classification are for all contexts with the following meanings:
Z exact means that the address is exactly known, i. e. an interval with exactly
one member
Z good means that the interval spans at most 16 cachelines or 256 bytes
Z ? means that that the interval is larger than 256 bytes
Z infeasible gives the percentage of instructions who were determined to
never be executed as stated above
Z time gives the time for the analysis
206
The results for the PowerPC are of comparable quality. Naturally, the pre-
cision heavily depends on the code structure of the analyzed executable. The
avionics code is very friendly for the analysis, the results will be less precise for
hand written or highly optimized assembly.
6.3 Separating Analyses
At this point it should be stressed that the different phases of the toolchain im-
plement different kinds of analyses. While these analyses could in principle be
integrated into one big analysis, it is always desirable to separate them for per-
formance and engineering reasons. While one may loose some precision by this
separation, the gain in efficiency of the whole tool and the reduced design com-
plexity is worth the effort. How much precision is lost by the separation depends
on the interactions of the components that are the main focus of the individual
analyses. While nearly no precision is lost by separating value analysis from the
pipeline analysis, this may not be the case for other analyses. In former publi-
cations (e. g. [FKL D 99]) we considered a separate cache and pipeline analysis to
reduce the complexity of the pipeline analysis. However, at least for the processors
we implemented aiT for, the cache is not independent from the pipeline. Both the
ColdFire 5307 and the PowerPC 755 implement branch prediction, which redi-
rects fetching by some heuristics to reduce stall time due to control hazards. For
both processors, the fetch machinery and execution logic are independent. The
latter resolves the branch condition upon which is being speculated while the for-
mer fills the prefetch queue with instructions, altering cache contents and/or ages
of blocks in the cache. The amount of prefetching done depends on the speed of
the execution pipeline, thus its state and so the results of the pipeline analysis. The
pipeline analysis, on the other hand, depends on the results of the cache analysis,
i. e. cache access classifications and times. To break this interdependency, one
has to make conservative approximations in the cache analysis to account for the
prefetch that may (but need not) happen due to branch prediction.
Figure 6.4, taken from [HLTW03], gives some results for the loss of precision
incurred due to a separate cache analysis. Here, the conservative approximations
are depicted as APPROX, while the results taken from the integrated cache and
pipeline analysis are shown as PURE. The measure shown for twelve example
programs is the number of cachelines known to be in the cache divided by the
number of program points and contexts, cf. Section 8.1. The ratio between the
measures lies between 1.3 and 1.484. The benchmarks each had around 40kB
of code. The loss of precision is significant, especially if one considers that the
effects of data accesses on the unified cache have not been accounted for. Thus, we
decided in this case to do a more complex integrated cache and pipeline analysis.
207
050
100
150
200
250
300
350
1 2 3 4 5 6 7 8 9 10 11 12
lin
es
/in
s
task
PURE
APPROX
Figure 6.4: Precision loss due to separate cache analysis
The same arguments and conclusions apply to the cache of the PowerPC and the
respective analyses.
Separating analyses is made easy in the framework developed by AbsInt, as
data can be merged into the intermediate CRL description or written out to sepa-
rate result files and easily be read back into subsequent phases and associated with
instructions and contexts.
6.4 Visualization
aiT includes an interactive visualization of the program and WCET analysis re-
sults that is based on the powerful features of the aisee graph visualization tool
[aiS].
One functionality of aiT is the visualization of the reconstructed CFG, as de-
picted in Figure 6.5. The CFG is first shown as the global call graph, connecting
functions together (lower half). Each function box can be unfolded showing the
basic blocks of the program and the instructions together with the control-flow
edges connecting them. As can be seen, loops are extracted into procedures to
208
Figure 6.5: CFG reconstructed by aiT
facilitate a more detailed analysis.
After computation of the WCET, the results are also shown graphically as in
Figure 6.6.
Apart from the overall WCET in the red box at the top, the CFG is shown again
with the depiction of one worst-case path, i. e. one path along which the WCET
will be reached. This path is shown in red color. Again, the routine boxes can be
unfolded showing the basic blocks and the information obtained for them. At the
edges connecting basic blocks, the contribution of execution along that edge for
the block at which the edge originates and the number of traversals of that edge
along the worst-case path are shown. The contribution of each function to the
WCET can also be displayed (not shown in the figure). This information gives
a detailed view about the WCET and its distribution among the functions and
basic blocks of the program. This information can be very useful if the computed
WCET differs from the expected one or is greater than the deadline of a task.
The third mode of the tool allows to follow the cycle-wise simulation of the ab-
stract pipeline model. It contains all information that is computed by the analysis.
This view is selected for a given basic block and context (e. g. the third iteration
209
Figure 6.6: Result visualization
of a loop, etc). An overview of the information displayed is shown in Figure 6.7.
The whole figure corresponds to one instruction in the selected basic block.
The darker (red) boxes at the top are the starting states for the simulation, the
darker (blue) ones near the bottom are those states in which the instruction has
left the pipeline and thus finished its execution. These states are then the start-
ing states for the next instruction in the basic block (or the first instruction in a
another basic block). Each row in the figure stands for one cycle of simulation.
The nondeterminism of the abstract model is already evident in the figure: in the
second row, the starting states split into up to four successor states. In this case,
this is because it is not known if the instruction fetch of an instruction is a cache
hit or miss. And as the fetch happens to cross a cache line boundary9 there are
four possible successor states: two cache lines involved and for each line cache
hit and cache miss; the analysis follows all possibilities. This state expansion is
reduced by the fact that often pipeline states coming from different columns in
the figure collapse into the same end state because they are identical. This effect
can very often be seen at the end of accesses to external memory as these take so
9The PowerPC 755 fetches up to four instructions in one block.
210
Figure 6.7: State exploration
long that all instructions in the pipeline have finished execution and the pipeline
is empty.
A more detailed view for one of the pipeline states from Figure 6.7 is shown
in Figure 6.8.
The information in the boxes is divided into three parts:
Z the cache analysis information is displayed in the box on the left.
Z The box on the right contains information about the instruction currently be-
ing executed10, e. g. the opcode and some instantaneous signals, if asserted.
Z In the center, the pipeline state is shown, including in this figure the values
of all components of the abstract pipeline model, as defined in Section 4.4.
10I.e. the node in the CFG were we are iterating the analysis simulation.
211
DU_RRFREE: (G: 6, F: 6, CTR: 1, CR: 1, LR: 1)
CQ: [0]: (0xb8, CTX:0, SPEC:0, COMP:0, RDY:1, UNIT:SRU)
    [1]: (0xbc, CTX:0, SPEC:0, COMP:0, RDY:1, UNIT:LSU)
...
SRU: (R: NONE, W: CQ:0, C: 1)
LSU: (R0: NONE, R1: NONE, EA: CQ:1, ACC: NONE)
DS_FREE(2)
must
 5: {{0xa0}{}}
 6: {{0xc0}{}}
D:
<empty>
instruction 0x000000b8
 op_id: 0x7c0002a6
 cycles = 50
flag = in progress
IS_IQ([ (A:0xb8, C:0, S:0), (A:0xbc, C:0, S:0)])
ISRESERVE([ (I:CQ:0, s:0, u:SRU), (I:CQ:1, s:0, u:LSU)])
I:
JIT_LEN: 2.0
IQ: [0]=(0xb8,  CTX: 0, PRED: 0)
    [1]=(0xbc,  CTX: 0, PRED: 0)
    ...
SPEC: (31, 31)
BRANCHES: [ ...]
ACT_CTX: 0
NOP_CNT: 0
NOP_SPEC_CNT: 0
I:
JIT_LEN: 1.0
 5: {{0xa0}{}}
 6: {{0xc0}{}}
D:
<empty>
instruction 0x000000b4
op_id: 0x94000000
cycles = 49
must
SUCC: 0xb8
IQ: ...
FBPU_PREDLEVEL: 0
FBPU_STATE: WAIT(0xb8, 0x3)
FBPU_INSINDEX: NONE
FBPU_CRDEP: (NONE,NONE)
DU_RRFREE: (G: 5, F: 6, CTR: 1, CR: 1, LR: 1)
CQ: ...
IU1: (R: NONE, W: NONE, C: 0)
IU2: (R: NONE, W: NONE, C: 0)
SRU: (R: NONE, W: NONE, C: 0)
LSU: (R0: NONE, R1: NONE, EA: NONE, ACC: NONE)
STORE[0]: (NONE, NONE, L: 1, I: NONE)
STORE[1]: (NONE, NONE, L: 1, I: NONE)
STORE[2]: (NONE, NONE, L: 1, I: NONE)
LSU_SPENDING: 0
LSU_MEMIDX: 0
LSU_NUMACC: 0
LSU_STATE: IDLE
FPU: (R: NONE, W0: NONE, W1: NONE, W2: NONE, C0: 0, C1: 0, C2: 0, B: 0)
BU_IB: 0xc0(1) MM  queued for bus
BU_SB: [0x4ffffe0,0x4ffffe0](4)  queued for bus
BU_ACC: [  ( SRC: IB CL: 0xa0 SCHED: {TA, TA} ),
           ( SRC: IB CL: 0xc0 SCHED: {TA, E(F), TA, TA, TA} ),
           ( SRC: SB CL: NONE SCHED: {TS, AACK, TA} ) ]
DS_WORKING
DS_FETCHED(2)
DS_RETIRED(GPR:1, FPR:0, CTR:0, CR:0, LR:0)
DS_RETIRED(GPR:1, FPR:0, CTR:0, CR:0, LR:0)
DS_RETIRED(GPR:1, FPR:0, CTR:0, CR:0, LR:0)
NEXTBRANCH: 0
BRANCH: FFFFFFFFFFFFFFFF
Figure 6.8: State exploration detail
For subsequent cycles, only the information that changed is displayed in order
not to clutter the view with irrelevant details. Naturally, the whole pipeline state
can be retrieved for these states too by unfolding the box in the middle.
All visualizations can be explored by moving around in the view, zooming and
folding or unfolding boxes. The information can also be exported to other formats
for further processing.
In addition to the visualizations, aiT also produces a textual form of the results
that is easier to integrate into batch jobs for the automated analysis of a sequence
of programs.
212
6.5 Practical Results
The aiT versions for the ColdFire 5307 and the PowerPC 755 have been designed
and refined for the needs of Airbus France. As has been stated earlier, the details
of the hardware that the programs to analyze will run on have a tremendous in-
fluence on the results of the WCET tool. Thus, comparing results against actual
run-time measurements on real hardware is not meaningful if the hardware that
the measurements are performed on differs from the hardware that the analyses
model. The target application for the tool is flight critical avionics software (pri-
mary flight control), thus the hardware that the analyses model is not available
outside of Airbus France. In the course of the DAEDALUS project, a detailed as-
sessment of the WCET results delivered by aiT was performed by Airbus France,
cf. Section 9.2. The assessment was done using large benchmarks typical for the
intended target domain. Due to the sensitive nature of the software under analysis
most of the results are not available for publication. Nonetheless, in [TSH D 03], a
small subset of the evaluation was published. Table 6.2 shows a comparison for
the ColdFire version of the WCET tool.
precision
Task Airbus’ method aiT’s results improvement
1 6.11 ms 5.50 ms 10.0 %
2 6.29 ms 5.53 ms 12.0 %
3 6.07 ms 5.48 ms 9.7 %
4 5.98 ms 5.61 ms 6.2 %
5 6.05 ms 5.54 ms 8.4 %
6 6.29 ms 5.49 ms 12.7 %
7 6.10 ms 5.35 ms 12.3 %
8 5.99 ms 5.49 ms 8.3 %
9 6.09 ms 5.45 ms 10.5 %
10 6.12 ms 5.39 ms 11.9 %
11 6.00 ms 5.19 ms 13.5 %
12 5.97 ms 5.40 ms 9.5 %
Table 6.2: Comparison between Airbus’ legacy method and aiT
Twelve benchmarks have been analyzed by Airbus’ personell and compared
to the results achieved by a method in practical use at Airbus to compute WCETs
213
for a long time.
The results obtained by aiT are more precise than the results of Airbus’ legacy
method by up to 13.5% and never worse. The analyses were performed on a 1GHz
Athlon system with 1GB of main memory. Running times of the analyses were
below 30 minutes. The analysis results were also compared against real execution
times, finding that the predicted WCET is always an upper bound for these times.
We did not perform extensive measurements on “similar” hardware or inte-
grate a modeling of the hardware we have available because of the following rea-
sons:
Z When measuring on hardware that differs from the hardware described by
the model, one has to add a margin to the results that accounts for the effects
of the differing hardware. In a few experiments, we found that this margin is
quite large compared to the WCET results so that an error of the analysis can
barely be found; furthermore, no meaningful statement about the precision
of the analysis can be made because of this margin.
Z Making models for two rather complicated processors and peripherals is a
time consuming task. Implementing the model and tuning and verifying
the results is a costly process. Therefore, we did not want to invest into a
hardware model11 that resembles the hardware we had available in order to
run larger benchmarks. Besides, no large publically available benchmarks
exist that have a similar size as the benchmarks used by Airbus France for
their evaluation of aiT.
It is often asked to compare the results of a WCET analysis against those of
other tools. We think that this is impossible due to the following reasons.
Z There is no such thing as a realistic standard benchmark for our application
area. There are a few benchmarks with collections of small functions, like
matrix multiplication or fast Fourier transformation. Computing times for
these benchmarks is easily possible but has no significance for answering
the question wether the method scales up to real-life programs. Also, from
our experience real-life benchmarks contain a lot more than these simple
functions.
Z The processors used in different approaches are not comparable. E.g. com-
puting WCET for a deterministic DSP without caches and no timing variant
components is easy and can be done quite precise. Computing the same
information for the PowerPC 755 is much more complex and the precision
cannot be expected to be as precise.
11More precisely, a model of the peripherals, in this case.
214
Z Other work uses smaller benchmarks and already encounters complexity
problems. We chose to let the benchmarking of our tool be performed by
a third party with the benchmarks they classify as representative for the
complexity and size of real-life programs.
Z One goal of the ARTIST2 Network of Excellence of the European Union is
the development of a realistic set of benchmarks for real-time applications.
Section 9.2 contains some conclusions about aiT from the reviewers of the
DAEDALUS project who had wider access to the results of the Airbus assess-
ment.
215
216
Chapter 7
Verication of the Analysis
Sed quis custodiet ipsos custodes?
Decimus Junius Juvenalis, Satura VI
The WCET analysis is used to obtain the bounds of each tasks execution time
needed for the schedulability analysis. The scheduling analysis is performed to
validate that the system runs within the timing bounds specified in its design.
As such, even a wrong scheduling analysis and also a wrong WCET analysis do
not introduce errors into the system, they just fail to detect errors already in the
system. This does not mean that WCET analysis is useless, rather it limits the
effort needed to prove the absolute absence of bugs in the analysis in practice.
Other tools, e. g. code generators, have to be validated more thoroughly, because
they can indeed introduce bugs into the system by generating wrong code1.
As we have said earlier, our methods for pipeline analysis (and the other
components of the aiT framework, cf. Chapter 6) are based on provably correct
methodologies. Thus, verifying the correctness of aiT’s results seems to be easy.
In fact, there are a number of sources for possible bugs in the tool:
Z Implementation errors can never be ruled out completely, as the toolchain
components are (partially) written by-hand, especially the implementation
of the abstract pipeline analysis is done in hand-coded C.
Z The pipeline analysis is only correct w.r.t. the concrete model that is the
starting point of the abstractions and analyses. If this model does not de-
scribe reality, then also the analyses will not analyze reality and the results
1DO-178B [DO192], the standard governing the software development process for avionics
contains only short passages about analysis tools, while much more is said about code generating
tools.
217
will probably be invalid. The next section deals with this problem in more
detail.
Z aiT uses an ILP to find the worst-case path and time of the whole program.
ILP solvers are known to sometimes produce incorrect results due to the fact
that the ILP is relaxed to the domain of floating point numbers and a solution
is searched for in this domain. Numerical instabilities in the computations
may lead to a wrong solution being found. In the AVACS project, methods
will be developed to verify the correctness of solutions returned by ILP
solvers. Another idea is to make use of the special form of ILPs generated
for the WCET problem and use purely integral techniques to solve them,
avoiding the numerical instabilities of currently available ILP solvers.
The next section will describe the sources used for the pipeline models in this
thesis and possible problems w.r.t. analysis result validation coming from them.
In Section 7.2 the steps undertaken to verify the correctness of aspects of the
modeling process are presented. This chapter closes with some ideas about the
automatization of the validation process in Section 7.3.
7.1 Modeling Sources
As stated in Chapter 4, we did not have authoritative HDL models available for the
two processors for which aiT had to be instantiated first. Therefore, the models
were made using three sources of information:
Z The processor’s handbooks,
Z experiments performed on the actual hardware,
Z communication with the design teams of the processors and the support
teams of the manufacturer.
7.1.1 Processor Handbooks
The models were designed starting from the descriptions in the processors’ hand-
books [PPC97a] and [Mot00]. Handbooks contain a varying degree of information
about the processor’s internal design. Neither for the ColdFire, nor the PowerPC
the presented information was sufficient to obtain the processor model with a sat-
isfying degree of precision. E.g. the description of the cache replacement and
design for the ColdFire in [Mot00] does not state explicitly if there is one cache
replacement counter for all sets or one counter per cache set. Other information
is stated differently at different places in the handbooks, contradicting each other.
218
However, the basic structure of the processor’s implementation and thus the basic
structure of our models is described in the handbooks.
7.1.2 Experiments on the Hardware
To decide unclear or unexplained situations in the processor’s program execution,
we designed test programs and executed them on the actual hardware. These
experiments were designed such that the program’s result differs for the different
possible scenarios. E.g., to solve the question, if the ColdFire has one replacement
counter per set or not, we designed the following experiment and program:
1. The cache is invalidated, clearing all lines and marking them as invalid.
2. The cache is filled with data by performing 512 accesses to addresses 0x0,
0x10, . . . , 0x1FF0 (in hexadecimal notation), giving the contents of Ta-
ble 7.1.
3. 128 more accesses are performed to the addresses 0x2000, 0x2010, . . . ,
0x27F0. If there is one replacement counter for each set, then the cache
contents is as in Table 7.2 and the counter is 1 for every set. If there is only
one global counter, then the contents is as in Table 7.3. In both tables, the
elements newly placed into the cache are denoted in boldface.
Way
Set 0 1 2 3
0 0x0000 0x0800 0x1000 0x1800
1 0x0010 0x0810 0x1010 0x1810
f]fgf f]f]f
126 0x07E0 0x0FE0 0x17E0 0x1FE0
127 0x07F0 0x0FF0 0x17F0 0x1FF0
Table 7.1: Cache contents after 512 accesses
After this setup phase, two different access patterns for 128 additional accesses
were measured for execution time:
1. Accessing the elements 0x00, 0x10, 0x20, i. e. the elements in way 0 of
Table 7.1.
2. Accessing the elements 0x00, 0x810, 0x1020, 0x1830, 0x40, i. e. the ele-
ments in the diagonals of Table 7.1.
219
Way
Set 0 1 2 3
0 0x2000 0x0800 0x1000 0x1800
1 0x2010 0x0810 0x1010 0x1810
f]f]f f]f]f
126 0x27E0 0x0FE0 0x17E0 0x1FE0
127 0x27F0 0x0FF0 0x17F0 0x1FF0
Table 7.2: Cache contents, 640 accesses, per-set counter
Way
Set 0 1 2 3
0 0x2000 0x0800 0x1000 0x1800
1 0x0010 0x2010 0x1010 0x1810
f]f]f f]f]f
126 0x07E0 0x0FE0 0x27E0 0x1FE0
127 0x07F0 0x0FF0 0x17F0 0x27F0
Table 7.3: Cache contents, 640 accesses, global counter
If there is only one global replacement counter, then the first access pattern
will produce 32 cache misses for the elements 0x00, 0x40,. . . . It produces 96
cache hits for the elements 0x10, 0x20, 0x30,0x50, 0x60,. . . . The second access
pattern will produce 128 cache misses in this case, as all the diagonal elements
have been replaced by the elements 0x2000, 0x2010,. . . . If there is one counter
per set, then the first access pattern will produce 128 cache misses, as all elements
in way 0 had been replaced. For the second pattern, one would observe 32 cache
misses for the accesses to 0x00, 0x40, . . . and 96 cache hits for the accesses to
0x810, 0x1020, . . . . As a cache miss takes around 8 bus cycles on our MCF
5307 board, or 32 internal cycles, while a cache hit should be served in 2 internal
cycles, the time for the complete access loop differs enough to be measurable with
the built-in timers for both scenarios (32 misses + 96 hits vs. 128 misses).
To obtain a time bound for the memory access times, the experiment was
performed with caching disabled, too.
Performing the experiment for both patterns shows times consistent with the
assumption that only one global counter exists.
220
Performing these experiments is difficult, as other influences, e. g. instruction
fetching must be avoided as far as possible. Our solution was to place the code
for the experiment into the internal SRAM of the MCF 5307, which is operated
independently of the external bus machinery, which was exclusively available for
the data accesses.
For the PowerPC 755 some details were validated or investigated by experi-
ments performed along the same lines. E.g. the handbook explicitly selects one
instruction to be used as a “no operation” instruction, namely the ori r0,r0,0
instruction. On the architectural state, this instruction has clearly no effect. For
timing, it is important, if this instruction is dispatched to an integer unit and does
nothing there, or if the dispatcher already throws away the instruction, never dis-
patching it to a functional unit. Also, if the instruction is thrown away, it is impor-
tant to know, if it is thrown away before it reaches the two dispatchable positions 0
and 1 in the instruction queue or it is simply discarded during the normal dispatch
from those entries.
An experiment has been designed that executes a special instruction sequence
that changes the dispatching of integer instruction to the IU1 and IU2 units, caus-
ing some divide instructions to stall if nops are discarded and not to stall, if they
are dispatched to a unit. Execution times (of this repeated sequence) were mea-
sured with the internal performance counters of the MPC 755, counting processor
cycles. The experiments validated the behavior that nops are discarded from the
normal dispatch entries and are simply thrown away.
Designing and performing these experiments increases the costs of modeling
further. Therefore, we decided to only perform experiments if a behavior was in
doubt judging from the documentation or if there had been strong evidence that a
behavior was different from the documented one.
7.1.3 Other Sources
Apart from the information in the handbooks and the experiments performed, we
forwarded detailed questions about the processor’s behavior to the support hotline
of Motorola. These questions were in most cases forwarded to the design team
of the PowerPC 755 and answered by them, guaranteeing a sufficient level of
authoritative information. Naturally, this process was very time consuming and
increased the modeling effort considerably.
7.2 Verifying the Modeling
Not all aspects of the model had been verified by experiments or by information
from the design teams of the processors. Thus, there was still a lot of space for
221
errors in the model. To further validate the model, we used three methods:
Z Feature Verication: to verify that the model reproduces certain features,
e. g. the number of branch mispredictions is correctly covered by the model,
one can design experiments that use the performance counters of the chips
to count the relevant events.
Z Local Verication: small example programs that can be simulated deter-
ministically even in the abstract model can be measured to verify the cycle-
precise equivalence of model and real hardware.
Z Global Verication: longer programs, which can no longer be deterministi-
cally simulated in the abstract model can be analyzed and measured on the
real hardware. The predicted execution time by the abstract model should
be an upper bound to the observed runtimes.
This process is naturally very time consuming as for the first two methods, ex-
periments have to be designed and performed. Besides the design of the programs
to use in the experiments, also undesired influences have to be ruled out. E.g. it
must be made sure that the whole program code is already cached to minimize
the effects of instruction caching on the timing: an external fetch takes so long
that smaller internal effects are hidden by it. Thus, the next section contains some
ideas about a method by which the necessary effort for validation can be reduced.
7.3 Automation
Validation of an analysis against real executions of test programs can only facili-
tate the data that is observable with the real hardware. Internal processor state can
not be observed without intrusion in most systems. Even the use of internal debug
facilities like JTAG may influence the execution of the processor. We will restrict
ourselves to the observation of the events on the external processor bus. Here, a
logic analyzer can be connected that samples the signals on the bus and is able
to collect a trace of events on the bus during the execution of the program. This
trace may contain events like “transfer start”, “address acknowledge”, “data word
transfered”, etc with the corresponding attributes (addresses, direction, . . . ). It can
then be checked if this trace is covered by the predictions made by the pipeline
analysis.
The pipeline analysis is nondeterministic in the sense that if a cycle update
in the abstract model depends on abstract components, the analysis will follow
all possibilities and computes all possible successor states. Thus, all possible
executions of the program are covered by the analysis, not only the worst one.
222
Also, the abstract model and thus the pipeline analysis can be augmented to record
the same events in a state that can be observed by a logic analyzer in a real system.
If the analysis finds out that a transfer is put on the bus, a “transfer start” event
can be recorded with the corresponding pipeline state, giving a set of events for a
set of pipeline states.
Given a trace of real events and a set of events predicted by the pipeline analy-
sis for each instruction of the CFG, the validation effort is now reduced to showing
that there is a path through the CFG, along which we can see all events in the trace
predicted. If a trace is not covered this way, we have found an error. Repeating
this process for a number of traces increases the confidence one can have into the
correctness of the analysis. Practically, it is not possible to collect the whole ex-
ecution of some program with a logic analyzer due to technical reasons. Thus, a
prefix of program execution must be considered.
More formally, we define
Definition 7.3.1 (Events, Trace): We assume that a set Ev of events φ W Ev
is given. The empty event ε
v
W Ev is used when we want to denote that
no event occurs; we define Evε :
F
Ev ®É= ε ? . A trace t W Ev D is non-empty
sequence of events. The pipeline analysis delivers a mapping ξ : E t P
K
Ev
M
of edges in the CFG to set of events happening when execution goes from
the source to the target node of that edge. This mapping is called predicted
events.
Given a trace t
F
φ1 \ φ2 \]\]\]\\ φl of length l, which represents the events observed
during (a prefix of the) execution of program G, we want to find a path through
the CFG, starting at sG up to some node n that has an assignment of all events in
t to the predicted events along the path. For this, we define the notion of a match:
Definition 7.3.2 (Match): Given a trace t and and predicted events ξ, a match
of t with ξ is a sequence m W
K
E
|
Evε M D of edge, (possibly empty) event
pairs. A match m
F-K
e1 S φ1 M \
K
e2 S φ2 M \g\]\]\g\
K
ek S φk M is feasible if it satisfies
R
K
φ1 \]\]\g\\ φk M
F
t and e1 \]\g\]\g\ ek is a path through the CFG of the program G
that starts at sG. Here, R : Ev ε t Ev  is defined by
R
K
φ1 \]\g\]\\ φn M
F

d
ε , if n @ 1
R
K
φ2 \g\]\]\u\ φn M , if n R 1 X ε
F
φ1
φ1 \ R
K
φ2 \]\g\]\\ φn M , otherwise
A naive algorithm to find a match simply starts at sG and tries to match the
first event from the trace t against the set of events in the predicted events for the
outgoing edges of the node. If the event is found, then the trace t is shortened by
223
its first element and matching continues at the node corresponding to the edge e
for which the event was found in ξ
K
e
M
. The edge e and event are recorded. If the
event was not found, an empty event ε is assumed and the search has to continue
in all successor nodes, recording all outgoing edges and ε as possible match. In
case that the match cannot continue, because all events reachable from that node
even over empty events are not compatible with the event at the head of the trace,
backtracking has to be performed.
Currently, a master thesis is under way to implement this automatic validation
of hardware traces, taking the textual output of a logic analyzer, searching for the
desired events in it and building the event trace. Then, this event trace is matched
against the results of the pipeline analysis, which writes out all predicted events
during program analysis.
In extension to the above definitions, also the distance in time of the events
can be considered in trace matching. This may be complicated by the fact that a
logic analyzer is not always capable of recording timing exactly enough.
With all the approaches sketched in this chapter, we hope to improve the con-
fidence in the correctness of the analysis results delivered by aiT. It may be noted
that the question of verifying WCET analysis results against reality under strin-
gent conditions (like DO-178B) has not been considered in previous work to the
best of our knowledge. Nonetheless, it forms an important part in the process
of introducing static analysis methods into the validation procedures for avionics
software.
224
Chapter 8
Predictability of Modern Hardware
Das U¨bel erkennen heisst schon ihm teil-
weise abhelfen
Otto von Bismarck
The pipeline analysis introduced in the previous chapters is based on quite com-
plex hardware models. The complexity is needed to precisely model all the inter-
actions of the different components in the system correctly. This chapter gives an
overview over the sources of the interactions and the consequences they have for
the complexity and precision of the analysis.
Some hardware features lend themselves not easily to static analysis of their
behavior. On the other hand, the analysis itself may loose precision and cause
higher costs due to the abstractions it makes to obtain the abstract pipeline model.
The next section elaborates these issues and in Section 8.2 some thoughts about
the predictable design of real-time systems are summarized.
8.1 Sources of Complexity and Imprecision
There are subtle interdependencies between the complexity of an analysis, its cost
and the precision that can be obtained by it. A less complex analysis that does not
model, e. g., the cache contents precisely must make conservative assumptions in
the case of memory accesses. If local worst-cases can be utilized, many cache
misses will be assumed by the analysis, increasing the predicted WCET and lead-
ing to a loss of precision. Unfortunately, local worst-case decisions can rarely be
made, as demonstrated for the two processors handled in this thesis. Here, all pos-
sibilities must be considered and the less complex analysis has to consider both
cache hit and cache miss scenarios, increasing its costs while probably leading to
225
a less precise bound for the WCET1. Furthermore, a less complex analysis will
probably not be able to ensure that the local worst-case is also the global worst-
case in some scenarios. The aiT tool for the PPC 755 has been augmented with a
run-time option to only consider local worst-cases in its computations. Table 8.1
gives some comparisons for the running time of the WCET analysis with and
without this option and the differences of the predicted WCET bounds. It can be
seen that indeed the local worst-case leads to a lower bound for the WCET than
the analysis that considers all scenarios.
Analysis Time WCET Bound
Program Size Local Global Local Global
edn 2372 9 s 64 s 149,631 152,818
prime 292 1 s 1 s 5,451 5,996
dry 2624 3 s 110 s 20,202,828 22,248,372
Table 8.1: Running times and WCET with and without local worst-case
The three programs in this experiment contain a collection of real-time algo-
rithms (filters, CRC computation) in the edn executable, the computation of the
first 20 prime numbers (prime) and the Dhrystone benchmark (dry).
The high analysis times are due to the fact that no extra information was pro-
vided to the analysis, even loop bound were found by the analysis itself. Also,
no tuning in terms of context definitions were made, cf. Chapter 6. The influence
of these context settings on the analysis times and results with local and global
worst-case analysis are depicted in Figure 8.1 for the Dhrystone benchmark. The
times are given against the number of different context to consider for one pro-
gram point.
The difference of from 22.46% for context bound 1 and 8.03% for the context
bounds greater than 7 for the computed WCET for the local worst-case only anal-
ysis w.r.t. the global worst-case analysis is not solely because of timing anoma-
lies. Rather, there is a large influence of the timing abstraction we use. Namely,
WCETs are computed for each basic block by taking the length of the maximal
execution sequence of the instructions in the block as WCET for that block. Con-
sider that two subsequent blocks are executed and that there are two paths going
through both blocks. On the first path, the first block takes 42 cycles to execute,
the second 24. On the second path, the first block takes 32 cycles and the sec-
ond one 42 cycles. The WCETs for the basic blocks will be 42 cycles for both
blocks, as this is the maximum of the execution times for the blocks. Assume now
1As local worst-cases lead to global worst-cases most of the time but not all the time.
226
 19000000
 20000000
 21000000
 22000000
 23000000
 24000000
 25000000
 26000000
 27000000
 28000000
 29000000
 30000000
 31000000
 32000000
 33000000
 0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20
 1
 2
 4
 8
 16
 32
 64
 128
 256
W
CE
T
An
al
ys
is 
Ti
m
e
Context # Bound
WCET Global WCET Local Time Global Time Local
Figure 8.1: WCET and analysis time for local and global worst-case settings
that the first path originates from a local best-case that is nonetheless considered
by the global analysis. This path is not considered for the local worst-case only
analysis and thus the analysis computes the WCETs 32 and 42 cycles for the first
and second block resp. Thus, the total WCET computed for the sequence of the
two blocks is lower (32+42 cycles) than the one computed by the global analy-
sis (42+42 cycles). For the same reason, the bound decreases if more contexts
are separated by the analysis because then the number of paths through the same
basic block/context pair is reduced.
This experiment also shows that computing global worst-case results is much
more costly than the local worst-case analysis. The ratio of the analysis times
approaches the number 12.24 for the higher number of contexts for this small
example.
Another point that is always present when using abstract interpretation as the
basis for an analysis is the influence of the abstractions and approximations. An
abstracted element is not capable of holding the same amount of information as
the concrete one, e. g. for the abstract cache components we only have informa-
227
tion about a subset of the memory blocks that are guaranteed to be in the cache
when program execution reaches a given point. Also, joins in the CFG lead to
the merging of the information at the incoming edges. For correctness reasons
the merge must be at least as unprecise as the least upper bound defined on the
lattice. Furthermore, the information is lost, on which path the information was
computed, which can lead to quite imprecise abstract information. In Figure 8.2
we have two joins in the CFG, after the if statements. All paths through this
CFG are disjoint in the sense that any execution that goes along the then branch
of the first if goes along the else branch of the second if statement. And all
executions along the else branch of the first if go along the then branch of
the second if.
if !b
elsethen
if b
then else
1
2
Figure 8.2: Two joins with information loss
So the only paths through the CFG are the ones depicted in the figure. A naive
analysis is not able to consider only these two paths through the CFG. Rather, at
point 1 in the figure it merges the information of both branches of the first if and
looses the information along which path of the if this point was reached. Then
228
it cannot be decided that information should only be propagated along the reverse
branch of the second if. For cache analysis, e. g., this means that if the alternating
branches of the two if statements access the same data, these accesses cannot be
classified as at the point 1 it cannot be guaranteed that the data is in the cache for
both branches of the first if.
A similar situation exists for loops. Here, a naive analysis is not able to sepa-
rate the first loop iteration from the second or third, etc. This is especially bad for
cache analysis as the first iteration of a loop normally loads most of the necessary
instructions of the loop into the cache and subsequent iterations find the data in
the cache. The naive analysis would classify the accesses to the first instructions
in a cache line in a loop as cache misses. It has already been mentioned in Sec-
tion 3.3.1 how this can be avoided by using contexts for the data-flow analysis.
The contexts make different paths through the CFG distinguishable by the DFA.
The same principle can, if one desires, be applied to the situation of alternating
if branches mentioned above. Introducing more contexts does naturally increase
at first the costs of the analysis as more data flow elements must be propagated.
However, it may reduce the number of iterations necessary for one node in the
CFG, because the information stabilizes earlier. Finding a good compromise is a
matter of tuning the parameters that govern the generation of different context for
the analysis.
Other aspects related to the analysis technique is the reduced complexity of
the overall analysis if it is split into separate analyses running after each other. It
has already been discussed in Section 6.3 that for the case of separating the cache
from the pipeline analysis the loss in precision can be severe due to necessary
conservative approximations for the interdependent analyses.
Another point is that the hardware itself may have a behavior that makes a
static analysis difficult. This is the case if we cannot gather precise information
about the behavior of a component during execution of a program block if we
don’t have any information about the component at the beginning of the block.
The non-existence of information about a component is common for static analy-
sis if program blocks are analyzed in isolation (modular analysis) as the contexts
by which the block can be reached are unknown and must be approximated. Fur-
thermore, joins in the CFG may lead to loss of information about components,
because the static analysis computes information that is valid for all executions
passing a given program point. Therefore, if control-flow joins only the informa-
tion guaranteed to be correct for all incoming edges can survive in the analysis.
Concrete examples of such hardware are the caches of the Motorola ColdFire
5307 and the PowerPC 755. As mentioned in Section 2.5.2, there is only one
global replacement counter c for the cache. If an analysis does not have infor-
229
mation about the value of this counter2, then it can be shown that no information
about the counter can ever be obtained that is different from 0 < c < 3 by a rst-
order analysis. Such an analysis does not take into account the path by which the
current program point was reached but only the point (instruction) before the cur-
rent one. A higher order analysis would naturally be much more complex. For the
ColdFire cache the absence of knowledge about the replacement counter has the
effect that it can only be guaranteed statically that the last accessed cache block
is in the cache. Furthermore, the absence of blocks cannot be guaranteed (may-
analysis) as blocks can never be shown to be replaced from the cache. In practice,
this virtually reduces the cache size of the MCF 5307 to 1/4 of the real cache for
the analysis.
The pseudo LRU cache of the MPC755 shows a similar effect as the replace-
ment strategy is not monotone but “jumps” due to the decision tree used to deter-
mine the block to be replaced next, cf. Section 2.6.4. In [HLTW03] the loss in
precision of the analysis of the pseudo LRU cache of the MPC755 compared to a
real 8-way associative LRU cache was investigated. Figure 8.3 shows the results
of one experiment. The experiment consisted in analyzing one executable with
different analysis settings for the contexts considered3.
The measure chosen for comparison was the number of lines guaranteed in
the cache (i. e. must analysis was performed) summed over all program points and
contexts and divided by the number of contexts and program points. The example
program was a 84kB PowerPC code that contains code pieces typical for avionics
(filters, CRC computations, etc). The ratio of the LRU and PLRU results is up to
1.609. As the whole program is just 2.5 times larger than the PowerPC cache of
32kB, this result already shows a considerable loss of information due to the bad
analyzability of the PLRU strategy. Also, this shows that a LRU strategy recovers
from unknown information after a sequence of accesses. Here, the cache contents
and ages of the blocks in the cache are known quite precise. This is because the
LRU replacement is more “monotone” in the aging of blocks in the cache.
Another source for imprecision in the analysis related to the hardware may
arise if the the average (or best) case and the worst-case behavior of components
differ significantly. One example are again caches. Another example are “short-
cuts” in the instruction execution. The MPC755 for example can shorten integer
multiplications if the upper 8 [ n, 1 < n < 4, bits of one operand are zero. In-
stead of three cycles, a multiplication may then only take one cycle. While this
increases average-case performance a bit, it makes the analysis more imprecise,
as it is mostly not possible to obtain tight intervals for the operands of multiplica-
2So it must assume it has values 0  c  3.
3“vivu” in the figure refers to the virtual unrolling of loops mentioned above, while “cs0” is
the callstring approach with length 0, cf. [Mar99a, Mar99b].
230
50
100
150
200
250
300
350
400
450
cs0 vivu vivu(2) vivu(4) vivu(5)
lin
es
/in
s
analysis
LRU
PLRU
Figure 8.3: PLRU vs. LRU cache analysis
tions without increasing the analysis costs considerably. Thus, all cases of execu-
tion time for the multiplication must be considered by the analysis (1, 2, 3 cycles),
increasing the costs and decreasing the precision, as few fast multiplications can
be predicted.
Last, the structure of the program code to analyze has a great influence on
the analysis costs and the precision of the results. If the code is generated from
higher-level control laws, it is quite easy to analyze, especially if no optimizing
compilers have been used to compile the generated C code to object code. Hand
written or highly optimized code makes the analysis more costly, as it is less likely
to obtain good information that can limit the number of successor pipeline states
that have to be considered. Also, the quality of the generated bounds may be poor,
as the analysis may not be able to separate enough context in the optimized code,
cf. above.
8.2 Advice for Predictable Systems
In the following we list a number of processor properties whose combination will
allow high precision in statements about the timing behavior and a modular design
231
of the timing analyzer.
Z Separate data and instruction caches: separate caches eliminate the inter-
dependencies of instruction prefetching and data accesses. This way the
precision loss of separate cache and pipeline analyses can be reduced. In an
integrated analysis, worst case assumptions can be made more easily since
these dependencies do not have to be considered.
Z Cache replacement strategies: these should be immune against “chaos”.
This means that when cache contents are not known at one point, subse-
quent accesses can recover knowledge about the new cache contents. The
ColdFire cache with its global replacement counter does not allow to re-
cover knowledge about the counter, if one has no information on its value at
some point. LRU replacement strategies recover from “chaos”: after some
cache updates, the ages of the new elements in the cache are known.
Naturally, the update strategy should be (locally) deterministic, otherwise
little can be statically said about cache contents.
The cache architecture should allow both must and may analyses for the
caches. Neither the ColdFire nor the PowerPC 755 cache make this pos-
sible, although this information about what is guaranteed not to be in the
cache, is valuable in restricting the non-determinism in the pipeline analy-
sis.
Z Branch prediction if any should be static: the modeling of dynamic branch
prediction would lead to an even more complex integrated analysis. A static,
separate and precise analysis of dynamic branch prediction is difficult since
it also depends on the pipeline state.
Z Out-of-order execution should be limited: with out-of-order execution, one
has to consider the effects of all possible interleavings of instructions. Clear-
ly, this is difficult and imprecise to do statically in a separate analysis since
there are many possible interleavings, whereas a worst-case interleaving is
not likely to occur during execution but must be assumed to ensure a correct
result. In an integrated approach all interleavings are considered but most
of them will not be worst cases. The required granularity of the pipeline
model for this both increases design complexity and analysis complexity.
Z Shortcuts should be avoided: in general, shortcuts in the hardware design,
e. g. special cases to accelerate some operation, if certain (dynamic) condi-
tions hold, should not be used. While they definitely improve average per-
formance, they have little gain in running typical real-time tasks; nonethe-
232
less, they must be modeled in quite some detail in the pipeline analysis or
give raise to increased nondeterminism.
233
234
Chapter 9
Conclusions
In summary, the idea is to give all of the
information to help others to judge the
value of your contribution; not just the in-
formation that leads to judgement in one
particular direction or another.
Richard P. Feynman
Hard real-time systems must be verified, requiring, among others, knowledge
about upper bounds for the worst-case execution time (WCET) of the tasks in
the system. Static analysis working on the program code is currently viewed as
the only safe and practical method of obtaining WCET bounds.
Features of modern hardware (cache, pipelines, SDRAM, . . . ) which have
been introduced to improve average-case performance of systems are not easy
to handle within static analysis due to their history sensitive behavior and strong
interactions between different features (e. g. caching and branch prediction).
Precise results can be difficult to obtain, because their exist bad worst-case
scenarios for the performance of certain features which have to be assumed for
a safe analysis result in case the scenario cannot be ruled out by the analysis.
Furthermore, the interaction of features may make it difficult or impossible to
decide locally which of the possible behaviors will lead to the globally longest
execution time (timing anomalies). This necessitates an integrated, global analysis
approach in order to conservatively capture all possible effects on the execution
time.
This work presents a new approach to the problem of obtaining safe bound
for WCETs for complex hardware architectures and realistically sized programs.
Our approach is semantics based, utilizing safe abstractions using the technique of
abstract interpretation. This way, the correctness of the results can be proven, an-
other advantage over currently advocated methods. We use a model of the system,
containing encapsulated units with inner state and update rules, communicating
via typed signals. The model forms a cycle-precise simulation of program execu-
tion in the hard real-time system. Components of this model are then abstracted
using the framework of abstract interpretation, resulting in an abstract model that
still can be simulated in a cycle-precise way.
Finding all possible simulation sequences starting from a set of system start
states in this nondeterministic abstract model allows to obtain a WCET bound
for the basic blocks of the program and successor states to the subsequent basic
blocks of the program. Integration of this technique into a data-flow analysis over
the control-flow graph of the program gives an analysis that delivers WCET for
all the basic blocks of a program.
In this thesis, a concrete semantics for the definition of cycle update programs
in a model has been given together with a semantics describing the cycle-wise
execution. A method to define abstractions for this concrete semantics has been
presented and its correctness has been proven. The correctness proof can be ap-
plied to other forms of semantics for cycle updates as well. The approach has
been exemplified by giving two models for the Motorola ColdFire 5307 and the
PowerPC 755.
A complete WCET tool, aiT, has been implemented with this analysis as the
central component. The other phases of the tool are also described in this thesis.
9.1 Predicting Modern Hardware
Timing anomalies and bad worst-case behavior make the precise, safe and fast
analysis of modern hardware difficult. As widely-used processors are normally
designed towards a good average-case performance, little attention is paid to the
worst-case scenarios from the side of chip manufacturers. For hard real-time sys-
tems, the average-case performance is of less importance than the worst-case per-
formance, since only the latter can be assumed to be available in the scheduling
analysis.
Bad worst-case performance can be defined in terms of the difference between
average and worst-case behavior. The smaller the gap between both cases, the
better the worst-case performance. If one behavior has a fast best or average-
case, e. g. a cache hit and a slow worst-case, e. g. a cache miss, then an analysis
must either be very detailed, e. g. the must and may analyses for caches must be
performed as part of the analysis, or it will be in danger of delivering results that
are too far away from the real worst-case, e. g. by assuming that every memory
access is a cache miss. In the latter case, the analysis is practically useless, in the
former it will be more complex and costly.
236
As WCET analysis has to consider all behaviors of the system if timing anoma-
lies cannot be ruled out1, it is most efficient if there are not many possible behav-
iors for a system component. E.g. static RAM has a fixed access timing, while
accesses to SDRAM take different amounts of time, depending on the pages open
in the memory banks accessed. An analysis of SDRAM has to consider both page
hit and miss behaviors. This, too, makes the analysis more complex and decreases
the speed of the analysis.
Even if one is willing to use a more complex analysis, e. g. to analyze cache
behavior more precisely, it may be difficult to obtain a precise static analysis.
This is because a static analysis has to correctly handle the case that nothing is
known about the component, e. g. the cache contents are unknown at the begin-
ning of the (modular) analysis of a procedure in the program. It depends on the
system component, whether an (first-order) analysis is able to compute precise
information while analyzing the procedure. E.g. for a LRU cache, cache analysis
can compute information for all the lines in the cache, as old content (and thus,
the unknown content) of the cache is guaranteed to be replaced with new content.
The cache of the Motorola ColdFire 5307 with its pseudo round-robin replace-
ment does not allow to compute information for the whole cache, as the cache
replacement counter value remains unknown if it is unknown at the beginning of
the procedure, cf. Section 2.5.2.
9.2 Practical Usability
The methods and observations of this thesis have been tested by implementing a
WCET tool, aiT, for two processors, the Motorola ColdFire 5307 and the Pow-
erPC 755. This work was done during the DAEDALUS project, which aimed at
the validation of critical software by static analysis and abstract testing. Airbus
France was the main industrial partner in the project and the end user for the tools
and new methodologies developed during the project. Consequently, aiT was de-
veloped and tested in the setting of analyzing avionics software.
As the results of the project are considered for application in the verification
toolset for the Airbus project, an extensive evaluation of all tools, including aiT,
has been performed by Airbus France. This evaluation was performed solely by
people from the software validation group of Airbus France, choosing their own
benchmarks. This ensured an unbiased evaluation process as neither the bench-
marks, nor the settings of parameters of the analysis process, etc, could be chosen
by the developers of aiT to direct the results into the “desired” direction.
1The problem of ruling out timing anomalies is difficult because all possible interactions be-
tween components have to be considered.
237
In the final report on the DAEDALUS project, w.r.t. this evaluation process,
the EU review concluded:
The assessment performed by the end user [Airbus] . . . was consid-
ered of exceptional quality.
A small part of the evaluation results have been published in [TSH D 03]. Due
to the sensitive nature of the software involved (flight control), the main parts are
not disclosed to the public. The reviewers of the project, who had been presented
the results during the review, summarize the overall results of aiT’s evaluation as
follows in the final review report:
The results obtained on this topic [WCET analysis], and the improve-
ments of the ABSINT tool is one of the most important results of the
Daedalus project, from both technical and industrial points of view.
The WCET analysis tool for the ColdFire processor has been very
successful. The final evaluation of the tool for PowerPC was expected
to be at hand about a month after the final review. The ABSINT tool is
probably the best of its kind in the world, and it is justified to consider
this result as a breakthrough.
After the DAEDALUS project, the development of aiT for the PowerPC 755
has been continued. In this course, work has been started to model the system
controller of the target hardware processor board. Validation and introduction of
this specialized aiT version is currently under way.
Beside the PowerPC 755 version, aiT is available as a commercial tool mar-
keted by AbsInt GmbH, for a number of other processors, e. g. PowerPC 5xx and
ARM, cf. http://www.absint.com/ait.
In summary, our approach can be applied to real-life problems in the area of
avionics with success. It is the first one to handle a combination of advanced
processor feature like
Z speculative execution,
Z prefetching,
Z branch prediction,
Z out-of-order execution,
Z and caching
together in a provably correct way for a real-life processor (independently) evalu-
ated on real-life benchmarks. This is also the first time that an industrial strength
commercial tool for complex processors has been designed and is being marketed.
238
9.3 Future Work
The application to areas other than avionics in which hard real-time systems
have to be analyzed, e. g. automotive industry or train control systems, is cur-
rently under investigation in other projects, e. g. the AVACS Transregional Col-
laborative Research Center 14 sponsored by the German Science Foundation:
http://www.avacs.org.
The design of processor models based on handbooks, experiments and infor-
mation obtained from the processor designers is a complex and lengthy process.
The process may introduce errors at the level of the model. As the correctness
of the model can only be verified with test runs of the analyzer and comparison
against real execution traces, these errors may be costly to find. As an alternative,
models can be made starting from an authoritative source, namely the Verilog
or VHDL descriptions of the processor (and the other system hardware). The
modularization and signal concept is similar to the one used in this thesis. The
abstraction methodologies can be translated to these languages in a similar way,
at least if the models are clocked, i. e. operations are synchronized to the proces-
sor clock edge. During AVACS, this approach will be studied for the (publically
available) LEON2 SPARC clone.
Another area of active research is to reduce the costs of the WCET analysis.
One aspect is to reduce the size of the state information that has to be kept in
the abstract model. Another aspect is to reduce the degree of nondeterminism
in the analysis, resulting in less successor states for one abstract pipeline state
during the analysis. If a processor does not possess timing anomalies, then only
the local worst-case has to be considered in the case of non-determinism. We are
investigating approaches to verify the absence of timing anomalies by using model
checking on processor models. With this, one can either show the general absence
of anomalies for a processor or the absence of anomalies during the execution of a
fixed program. Model checking is also usable to determine that one component of
a pipeline state subsumes another, i. e. the information in one component uniquely
(w.r.t. the abstractions made) defines the contents of the other. Then, the second
component can be safely removed from the state, reducing the complexity of the
WCET analysis.
239
240
Bibliography
[aiS] http://www.aisee.com. aiSee Home Page.
[AK95] D. P. Appenzeller and A. Kuehlmann. Formal Verification of a Pow-
erPC Microprocessor. In Proceedings of the IEEE International
Conference on Computer Design (ICCD ’95), Austin, Texas, 1995.
[Ash02] P. J. Ashenden. The Designer’s Guide to VHDL. Morgan Kaufmann
Publishers, Academic Press, 2nd edition, 2002.
[BBB03] A. Burns, G. Bernat, and I. Broster. A Probabilistic Framework for
Schedulability Analysis. In R. Alur and I. Lee, editors, Proceedings
of the Third International Embedded Software Conference, EM-
SOFT, volume 2855 of Lecture Notes in Computer Science, pages
1–15, 2003.
[BBCZ98] S. Berezin, A. Biere, E. Clarke, and Y. Zhu. Combining Sym-
bolic Model Checking with Uninterpreted Functions for Out-of-
order Processor Verification. In Proceedings of the Formal Methods
in Computer-Aided Design, Second International Conference, FM-
CAD ’98, volume 1522 of Lecture Notes in Computer Science, Palo
Alto, California, USA, 1998. Springer.
[BCM D 92] J. Burch, E. Clark, K. McMillan, D. Dill, and L. Hwang. Sym-
bolic Model Checking: 1020 States and Beyond. Information and
Computation, 98(2), 1992.
[BCP02] G. Bernat, A. Colin, and S. M. Petters. WCET Analysis of Prob-
abilistic Hard Real-Time Systems. In Proceedings of the 23rd
Real-Time Systems Symposium RTSS 2002, pages 279–288, Austin,
Texas, USA, December 2002.
[BD94] J. R. Burch and D. L. Dill. Automatic verification of pipelined
microprocessor control. In David L. Dill, editor, Conference on
241
Computer-Aided Verication, volume 818 of Lecture Notes in Com-
puter Science, pages 68–80. Springer-Verlag, 1994. Stanford, Cal-
ifornia, June 21–23, 1994.
[Ber] G. Berry. The Esterel V5 Language Primer. Esterel Technologies.
[BHE91] D. G. Bradlee, R. R. Henry, and S. J. Eggers. The Marion Sys-
tem for Retargetable Instruction Scheduling. In Brent Hailpern,
editor, Proceedings of the ACM SIGPLAN ’91 Conference on Pro-
gramming Language Design and Implementation, pages 229–240,
Toronto, ON, Canada, June 1991. ACM Press.
[BN94] S. Basumallick and K. Nilsen. Cache Issues in Real-Time Sys-
tems. In Proceedings of the ACM SIGPLAN Workshop on Lan-
guage, Compiler and Tool Support for Real-Time Systems, June
1994.
[Bur96] J. R. Burch. Techniques for Verifying Superscalar Microprocessors.
In Proceedings of the 33rd Design Automation Conference DAC 96,
pages 552–557, Las Vegas, USA, June 1996. ACM Press.
[BW90] A. Burns and A. Wellings. Real-Time Systems and their Program-
ming Languages. Addison Wesley, 1990.
[CC77] P. Cousot and R. Cousot. A Unified Lattice Model for Static Anal-
ysis of Programs by Construction or Approximation of Fixpoints.
In Conference Record of the 4th ACM Symposium on Principles of
Programming Languages, pages 238–252, Los Angeles, CA, Jan-
uary 1977.
[CC79] P. Cousot and R. Cousot. Systematic Design of Program Analysis
Frameworks. In Proceedings of the 6th ACM Symposium on Prin-
ciples of Programming Languages, 1979.
[CC91] P. Cousot and R. Cousot. Comparison of the Galois connec-
tion and widening/narrowing approaches to abstract interpretation.
JTASPEFL’91, Bordeaux. BIGRE, 74:107–110, October 1991.
[CC92a] P. Cousot and R. Cousot. Abstract Interpretation Frameworks. Jour-
nal of Logic and Computation, 2(4):511–547, 1992.
[CC92b] P. Cousot and R. Cousot. Comparing the Galois Connection and
Widening/Narrowing Approaches to Abstract Interpretation. In
242
M. Bruynooghe and M. Wirsing, editors, Proceedings of the In-
ternational Workshop Programming Language Implementation and
Logic Programming, PLILP’92,, volume 631 of Lecture Notes
in Computer Science, pages 269–295, Leuven, Belgium, August
1992. Springer.
[CCKT86] D. Callahan, K. D. Cooper, K. Kennedy, and L. Torczon. Interpro-
cedural Constant Propagation. ACM SIGPLAN Notices, 21(7):152–
161, June 1986. Proceedings of the ACM SIGPLAN ’86 Sympo-
sium on Compiler Construction, Palo Alto, USA.
[CGP99] E. M. Clarke, O. Grumberg, and D. A. Peled. Model Checking. The
MIT Press, 1999.
[CH78] P. Cousot and N. Halbwachs. Automatic Discovery of Linear Re-
straints Among Variables of a Program. In Proceedings of the
5th ACM SIGACT-SIGPLAN symposium on Principles Of Program-
ming Languages, pages 84–96. ACM Press, 1978.
[Cou81] P. Cousot. Semantic Foundations of Program Analysis. In S.S.
Muchnick and N.D. Jones, editors, Program Flow Analysis: Theory
and Applications, chapter 10, pages 303–342. Prentice-Hall, Inc.,
Englewood Cliffs, New Jersey, 1981.
[CP00] A. Colin and I. Puaut. Worst Case Execution Time Analysis for
a Processor with Branch Prediction. Real-Time Systems, Special
issue on worst-case execution time analysis, 18(2):249–274, April
2000.
[CP01] A. Colin and I. Puaut. A Modular and Retargetable Framework
for Tree-based WCET Analysis. In 13th Euromicro Conference on
Real-Time Systems, pages 37–44, June 2001.
[DO192] European Organisation for Civil Aviation Electronics. DO-178B:
Software Considerations in Airborne Systems and Equipment Cer-
tication, Dec 1992.
[DP97] W. Damm and A. Pnueli. Verifying out-of-order executions. In Pro-
ceedings of the IFIP WG 10.5 International Conference on Correct
Hardware Design and Verication Methods, pages 23–47. Chap-
man & Hall, Ltd., 1997.
243
[Eng02] J. Engblom. Processor Pipelines and Static Worst-Case Execution
Time Analysis. PhD thesis, Faculty of Science and Technology,
Uppsala University, 2002.
[Erm03] A. Ermedahl. A Modular Tool Architecture for Worst-Case Execu-
tion Time Analysis. PhD thesis, Faculty of Science and Technology,
Uppsala University, 2003.
[Fer97] C. Ferdinand. Cache Behavior Prediction for Real-Time Systems.
PhD thesis, Saarland University, 1997.
[FHL D 01] C. Ferdinand, R. Heckmann, M. Langenbach, F. Martin,
M. Schmidt, H. Theiling, S. Thesing, and R. Wilhelm. Reliable
and Precise WCET Determination for a Real-Life Processor. In
Proceedings of the rst International Workshop on Embedded Soft-
ware, EMSOFT 2001, volume 2211, 2001.
[FKL D 99] C. Ferdinand, D. Ka¨stner, M. Langenbach, F. Martin, M. Schmidt,
J. Schneider, H. Theiling, S. Thesing, and R. Wilhelm. Run-
Time Guarantees for Real-Time Systems — The USES Approach.
In Proceedings of Informatik ’99  Arbeitstagung Programmier-
sprachen, Paderborn, 1999.
[GHDN99] P. Grun, A. Halambi, N. Dutt, and A. Nicolau. RTGEN: An Al-
gorithm for Automatic Generation of Reservation Tables from Ar-
chitectural Descriptions. In Proceedings on the 12th International
Symposium on Systems Synthesis, 1999.
[Gra69] R.L. Graham. Bounds on Multiprocessing Timing Anomalies.
SIAM Journal of Applied Mathematics, 17(2):416–429, March
1969.
[HAM D 99] C. A. Healy, R. D. Arnold, F. Mueller, D. B. Whalley, and M. G.
Harmon. Bounding Pipeline and Instruction Cache Performance.
IEEE Transactions on Computers, 48(1):53–70, January 1999.
[HBL D 95] Y. Hur, Y. H. Bae, S. Lim, S. Kim, B. Rhee, S. L. Min, C. Y. Park,
M. Lee, H. Shin, and C. S. Kim. Worst Case Timing Analysis of
RISC Processors: R3000/R3100 Case Study. In Proceedings of the
IEEE Real-Time Systems Symposium, pages 308–319, December
1995.
[HGG D 99] A. Halambi, P. Grun, V. Ganesh, A. Khare, N. Dutt, and A. Nicolau.
EXPRESSION: A Language for Architecture Exploration through
244
Compiler/Simulator Retargetability. In Proceedings of the Design,
Automation and Test in Europe Conference, DATE’99, pages 485–
490, March 1999.
[HLTW03] R. Heckmann, M. Langenbach, S. Thesing, and R. Wilhelm. The
Influence of Processor Architecture on the Design and the Results
of WCET Tools. Proceedings of the IEEE, 91(7), July 2003.
[HP96] J. L. Hennessy and D. A. Patterson. Computer Architecture. A
Quantitative Approach. Morgan Kaufmann Publishers, 2nd edition,
1996.
[HQR98] T. A. Henzinger, S. Qadeer, and S. K. Rajamani. You Assume,
We Guarantee: Methodology and Case Studies. In Proceedings of
the 10th International Conference on Computer-aided Verication
(CAV), volume 1427 of Lecture Notes in Computer Science, pages
440–451. Springer-Verlag, 1998.
[HT01] R. Heckmann and S. Thesing. Cache and Pipeline Analysis for the
Coldfire 5307. Technical report, Saarland University, 2001.
[HWH95] C. A. Healy, D. B. Whalley, and M. G. Harmon. Integrating the
Timing Analysis of Pipelining and Instruction Caching. In Pro-
ceedings of the IEEE Real-Time Systems Symposium, pages 288–
297, December 1995.
[HYHD95] R. C. Ho, C. Han Yang, M. Horowitz, and D. L. Dill. Architec-
ture validation for processors. In Proceedings of the 22nd An-
nual International Symposium on Computer Architecture, pages
404–413. Association for Computing Machinery, June 1995. Santa
Margherita Ligure, Italy, 22-24 June 1995.
[Hym02] C. Hymans. Checking Safety Properties of Behavioral VHDL De-
scriptions by Abstract Interpretation. In 9th International Static
Analysis Symposium (SAS’02), volume 2477 of Lecture Notes in
Computer Science, pages 444–460. Springer, 2002.
[Hym03] C. Hymans. Design and Implementation of an Abstract Inter-
preter for VHDL. In 12th Advanced Research Working Con-
ference on Correct Hardware Design and Verication Methods
(CHARME’03), Lecture Notes in Computer Science. Springer,
2003.
245
[Int91] Intel Corporation. i960 KA/KB Microprocessor Programmer’s Ref-
erence Manual, 1991.
[Int99] Intel Corporation. PC SDRAM Specication, November 1999. Re-
vision 1.7.
[JP86] M. Joseph and P. Pandya. Finding Response Times in a Real-Time
System. The BCS Computer Journal, 29(5):390–395, January 1986.
[JPM02] S. Jolly, A. Parashkevov, and T. McDougall. Automated Equiva-
lence Checking of Switch Level Circuits. In Proceedings of the 39th
Design Automation Conference, DAC 2002, New Orleans, USA,
2002. ACM.
[JSD98] R. B. Jones, J. U. Skakkebaek, and D. L. Dill. Reducing manual ab-
straction in formal verification of out-of-order execution. In Formal
Methods in Computer-Aided Design, pages 2–17, 1998.
[KT99] D. Ka¨stner and S. Thesing. Cache Aware Pre-runtime Scheduling.
Real-Time Systems Journal, 17(2/3):235–250, November 1999.
[Lan98] M. Langenbach. CRL – A Uniform Representation for Control
Flow. Technical report, Saarland University, Faculty for Computer
Science, 1998.
[LBJ D 95] S. Lim, Y. H. Bae, G. T. Jang, B. Rhee, S. L. Min, C. Y. Park,
H. Shin, K. Park, and C. S. Kim. An Accurate Worst Case Timing
Analysis Technique for RISC Processors. IEEE Transactions on
Software Engineering, 21(7), July 1995.
[LHYM D 96] C.-G. Lee, J. Hahn, Seo Y.-M., S. L. Min, R. Ha, S. Hong, C. Y.
Park, M. Lee, and C. S. Kim. Analysis of Cache-related Preemption
Delay in Fixed-priority Preemptive Scheduling. In Proceedings of
the 16th Real-Time Systems Symposium, 1996.
[Liu00] J. W. S. Liu. Real-Time Systems. Prentice-Hall, 2000.
[LL73] C.L. Liu and Layland. Scheduling Algorithms for Multiprogram-
ming in a Hard Real-Time Environment. Journal of the Association
of Computing Machinery, 20:46–61, 1973.
[LMA97] Y.-T. S. Li, S. Malik, and A.Wolfe. Cache Modeling for Real-Time
Software: Beyond Direct Mapped Instruction Caches. IEEE Real-
Time Systems Symposium, January 1997.
246
[LRM D 94] S. Lim, B. Rhee, S. L. Min, C. Y. Park, H. Shin, and C. S. Kim.
Issues of Advanced Architectural Features in the Design of a Tim-
ing Tool. In Proceedings of the 11th IEEE Workshop on Real-time
Operating Systems and Software, 1994.
[LS98] T. Lundqvist and P. Stenstro¨m. Integrating Path and Timing Anal-
ysis Using Instruction-Level Simulation Techniques. In F. Mueller
and A. Bestavros, editors, Proceedings of the ACM SIGPLAN
Workshop Languages, Compilers and Tools for Embedded Sys-
tems (LCTES), volume 1474 of Lecture Notes in Computer Science,
pages 1–15, 1998.
[LS99] T. Lundqvist and P. Stenstro¨m. An Integrated Path and Timing
Analysis Method based on Cycle-Level Symbolic Execution. Real-
Time Systems Journal, 17(2/3):183–207, November 1999.
[Lun02] T. Lundqvist. A WCET Analysis Method for Pipelined Micropro-
cessors with Cache Memories. PhD thesis, Dept. of Computer En-
gineering, Chalmers University of Technology, Sweden, June 2002.
[Mar98] F. Martin. PAG – an Efficient Program Analyzer Generator. Inter-
national Journal on Software Tools for Technology Transfer, 2(1),
1998.
[Mar99a] F. Martin. Experimental Comparison of call string and functional
Approaches to Interprocedural Analysis. In S. Jaehnichen, edi-
tor, Proceedings of the 8th International Conference on Compiler
Construction, volume 1575 of Lecture Notes in Computer Science,
pages 63–75. Springer, 1999.
[Mar99b] F. Martin. Generation of Program Analyzers. PhD thesis, Saarland
University, 1999.
[McM93] K. McMillan. Symbolic Model Checking. Kluwer Academic Pub-
lishers, 1993.
[McM98] K. McMillan. Verification of an Implementation of Tomasulo’s
Algorithm by Compositional Model Checking. Lecture Notes in
Computer Science, 1427, 1998.
[Met04] A. Metzner. Why Model Checking Can Improve WCET Analysis.
In Proceedings of the 16th International Conference on Computer
Aided Verication, CAV’2004. Springer, July 2004.
247
[Mic92] Sun Microsystems. The SuperSPARC Microprocessor. Technical
White Paper, May 1992.
[Mic03] Micron Technology, Inc. 64MB Synchronous DRAM
MT48LC16M4A2 Datasheet, 2003.
[Moo65] G. Moore. Cramming More Components onto Integrated Circuits.
Electronics, 38(8), April 1965.
[Mot00] Motorola. MCF5307 ColdFire Integrated Microprocessor User’s
Manual, August 2000. MCF5307UM/D, Rev. 2.0.
[MTH D 02] P. Mishra, H. Tomiyama, A. Halambi, P. Grun, N. Dutt, and
A. Nicolau. Automatic Modeling and Validation of Pipeline Speci-
fications Driven by an Architecture Description Language. In Pro-
ceedings of ASPDAC-2002/VLSI Design 2002, 2002.
[MWH94] F. Mueller, D.B. Whalley, and M. Harmon. Predicting Instruction
Cache Behavior. In Proceedings of the ACM SIGPLAN Workshop
on Language, Compiler and Tool Support for Real-Time Systems,
1994.
[NN94] K. Narasimhan and K. Nilsen. Portable Execution Time Analysis
for RISC Processors. In ACM PLDI Workshop on Language, Com-
piler, and Tool Support for Real-Time Systems, June 1994.
[NNH99] F. Nielsen, H. Nielsen, and C. Hankin. Principles of Program Anal-
ysis. Addison-Wesley, 1999.
[NR95] K. Nilsen and B. Rygg. Worst-Case Execution Time Analysis on
Modern Processors. In Proceedings of the Second ACM SIGPLAN
Workshop on Languages, Compilers, and Tools for Real-Time Sys-
tems, LCT-RTS, June 1995.
[Ope02] Open SystemC Initiative. SystemC User’s Guide, 2.0 edition, 2002.
[PPC97a] Motorola. MPC750 RISC Processors User’s Manual, 1997.
[PPC97b] Motorola. PowerPC Microprocessor Family: The Programming
Environments for 32-bit Microprocessors, 1997.
[PZHM98] S. Pees, V. Zivojnovici, A. Hoffmann, and H. Meyr. Retargetable
Timed Instruction Set Simulation of Pipelined Processor Architec-
tures. In Proceedings of the International Conference on Signal
Processing Applications and Technology, September 1998.
248
[RPTW04] A. Rakib, O. Parshin, S. Thesing, and R. Wilhelm. Component-
wise Instruction Cache Behavior Prediction. In Proceedings of the
4th Intl. Workshop on Worst-Case Execution Time (WCET) Analy-
sis, Catania, Sicily, Italy, June 2004.
[SA95] T. Shanley and D. Anderson. PCI System Architecture. MindShare,
Inc., third edition, 1995.
[Sch02] J. Schneider. Combined Schedulability and WCET Analysis for
Real-Time Operating Systems. PhD thesis, Saarland University,
2002.
[SF99] J. Schneider and C. Ferdinand. Pipeline Behavior Prediction for
Superscalar Processors by Abstract Interpretation. In Workshop
on Languages, Compilers, and Tools for Embedded Systems, vol-
ume 34 of ACM SIGPLAN Notices, pages 35–44, May 1999.
[Sha89] A. C. Shaw. Reasoning About Time in Higher-Level Language
Software. IEEE Transactions on Software Engineering, 15(7):875–
889, July 1989.
[Sic97] M. Sicks. Adressbestimmung zur Vorhersage des Verhaltens von
Daten-Caches. Master’s thesis, Saarland University, 1997.
[Sis98] C. Siska. A Processor Description Language Supporting Retar-
getable Multi-Pipeline DSP Program Development Tools. In Pro-
ceedings of the 11th International Symposium on System Synthesis,
1998.
[SJD98] Jens U. Skakkebæk, Robert B. Jones, and David L. Dill. Formal
Verification of Out-of-order Execution Using Incremental Flushing.
In Proceedings of the 10th International Conference on Computer
Aided Verication, CAV’98, June 1998.
[SP81] M. Sharir and A. Pnueli. Two Approaches to Interprocedural Data
Flow Analysis. In S. S. Muchnick and N. D. Jones, editors, Pro-
gram Flow Analysis: Theory and Application, chapter 7, pages
189–233. Prentice-Hall, 1981.
[Sto03] J. Stokes. Understanding Moore’s Law. WWW, February 2003.
http://www.arstechnica.com/paedia/m/moore/moore-1.html.
[TBW94] K. Tindell, A. Burns, and A. Wellings. An Extensible Approach for
Analyzing Fixed Priority Hard Real-Time Tasks. The Journal of
Real-Time Systems, 6(2):133–151, March 1994.
249
[The02] H. Theiling. Control Flow Graphs for Real-Time System Analysis
- Reconstruction from Binary Executables and Usage in ILP-based
Path Analysis. PhD thesis, Saarland University, 2002.
[Tho64] J. E. Thornton. Parallel Operation in the Control Data 6600. In
Proc. Fall Joint Computer Conference, pages 33–40, 1964.
[TM91] D.E. Thomas and P.R. Moorby. The Verilog Hardware Description
Language. Kluwer Academic Publishers, Boston, Massechusetts,
1991.
[TMLA97] S. Thesing, F. Martin, O. Lauer, and M. Alt. PAG: User’s Manual.
Saarland University, 1.0 edition, 1997.
[Tom67] R. M. Tomasulo. An Efficient Algorithm for Exploiting Multiple
Arithmetic Units. IBM J. Research and Development, 11(1):25–33,
January 1967.
[TSH D 03] S. Thesing, J. Souyris, R. Heckmann, F. Randimbivololona,
M. Langenbach, R. Wilhelm, and C. Ferdinand. An Abstract
Interpretation-Based Timing Validation of Hard Real-Time Avion-
ics. In Proceedings of the International Performance and Depend-
ability Symposium (IPDS), June 2003.
[Tuo02] I. Tuomi. The Lives and Death of Moore’s Law. First Monday
WWW Journal, 7(11), November 2002. http://firstmonday.org.
[VHD00] Institute of Electrical and Electronic Engineers, New York. Draft
IEEE Standard P1076 2000/D3 VHDL Language Reference Man-
ual, 2000.
[Wil04] R. Wilhelm. Why AI + ILP is Good for WCET, but MC is Not, nor
ILP Alone. In Proceedings of the Fifth International Conference on
Verication, Model Checking and Abstract Interpretation, Venice,
Italy, January 2004.
[WM95] R. Wilhelm and D. Maurer. Compiler Design. Addison-Wesley,
1995.
[ZPS D 96] V. Zivojnovic, S. Pees, C. Schla¨ger, M. Willems, R. Schoenen, and
H. Meyr. LISA – Machine Description Language and Generic Ma-
chine Model. In Proceedings of the International Conference on
Signal Processing Applications and Technology, October 1996.
250
