Formale Verifikation von Mikroprozessoren mit Pipeline by Kröning, Daniel
Formal Verification of
Pipelined Microprocessors
Dissertation
zur Erlangung des Grades Doktor der
Ingenieurswissenschaften (Dr.-Ing.) der
Naturwissenschaftlich-Technischen Fakulta¨t I der
Universita¨t des Saarlandes
Daniel Kro¨ning
Saarbru¨cken, 2001
Abstract
Subject of this thesis is the formal verification of pipelined micropro-
cessors. This includes processors with state of the art schedulers, such as
the Tomasulo scheduler and speculation. In contrast to most of the litera-
ture, we verify synthesizable design at gate level. Furthermore, we prove
both data consistency and liveness. We verify the proofs using the theorem
proving system PVS. We verify both in-order and out-of-order machines.
For verifying in-order machines, we extend the stall engine concept pre-
sented in [MP00]. We describe and implement an algorithm that does the
transformation into a pipelined machine. We describe a generic machine
that supports speculating on arbitraty values. We formally verify proofs
for the Tomasulo scheduling algorithm with reorder buffer.
Kurzzusammenfassung
Gegenstand dieser Dissertation ist die formale Verifikation von Mikro-
prozessoren mit Pipeline. Dies beinhaltet auch Prozessoren mit aktuellen
Scheduling-Verfahren wie den Tomasulo Scheduler und spekulativer Aus-
fu¨hrung. Im Gegensatz zu weiten Teilen der bestehenden Literatur fu¨hren
wir die Verifikation auf Gatter-Ebene durch. Des weitern beweisen wir
sowohl Datenkonsistenz als auch eine obere Schranke fu¨r die Ausfu¨h-
rungszeit. Die Beweise werden mit dem Theorem Beweissystem PVS
verifiziert. Es werden sowohl in-order Maschinen als auch out-of-order
Maschinen verifiziert. Zur Verifikation der in-order Maschinen erweitern
wir die Stall Engine aus [MP00]. Wir beschreiben und Implementieren ein
Verfahren das die Transformation in die “pipelined machine” durchfu¨hrt.
Wir beschreiben eine generische Maschine die Spekulation auf beliebige
Werte erlaubt. Wir verifizieren die Beweise fu¨r den Tomasulo Scheduler
mit Reorder Buffer.
Extended Abstract
Microprocessors are in use in many safety-critical environments, such as
cars or planes. We therefore consider the correctness of such components
as a matter of vital importance. Testing microprocessors is limited by the
huge state space of modern microprocessors. We therefore think formal
verification is the sole way to obtain a guarantee.
This formal verification should be done such that any third party is able
to verify the correctness with low effort, i.e., we aim to provide a proof
of correctness that can be checked mechanically. In particular, we think
that all critical designs should be delivered in form of a four-tuple: 1)
the design itself, 2) a specification, 3) a human-readable proof, and 4) a
machine-verified proof.
In this thesis, we present proofs of correctness for complex micropro-
cessors. Designing microprocessors is considered an error-prone process.
A well known example for this is the Pentium FDIV bug [Coe95, Pra95].
In this thesis, we provide a rigorously formal approach to hardware veri-
fication. The designs presented in this thesis include state of the art sched-
ulers, such as the Tomasulo scheduler [Tom67] and speculation. In con-
trast to most of the literature, the designs we provide are very close to
gate level. In particular, we are synthesizing some of the designs for the
XILINX FPGA series.
These designs are of high complexity, and so are the proofs. In contrast
to [MP95, Lei99, MP00], the proofs are machine verified using the theorem
proving system PVS [CRSS94]. We do not present the original PVS proof
in this thesis but aim to provide comprehensible paper-and-pencil proofs.
In order to verify sequential machines, we extend the data consistency
invariant given in [MP00] by defining a “correct value” of an implementa-
tion register such as IR:2. Given the correctness of functional components
such as the ALU, this allows for an almost fully automated proof of the
data consistency of the prepared sequential machine using PVS. We ar-
gue that the correct functional components provide correct results if given
correct inputs.
We extend the stall engine concept presented in [MP00] by providing
a fully generic stall engine design. In contrast to [MP00], our stall en-
gine design supports an arbitrary number of stages and allows for stalling
(and therefore clocking) all stages independently. Furthermore, it supports
pipeline bubble removal, i.e., the stages are clocked whenether the in-order
property permits this. This includes that bubbles are removed from the
pipeline if necessary. We formally verify data consistency and liveness
properties for this stall engine.
Using this extended stall engine, we improve the process of transforming
the prepared sequential machine into the pipelined machine by providing
a tool that does this transformation automatically. This includes the gener-
ation for forwarding and interlock hardware.
We then prove the data consistency of the pipelined machine. We do
so by showing that the inputs of the pipeline stages are correct. Using this
fact, we argue the correctness of the output values as we do for the prepared
sequential machine, since the functional components of the machines are
identical.
We present a generic approach to speculative execution and propose a
data consistency criterion for such a machine. We then apply this method
in order to implement and prove DLX pipelines with branch prediction
and precise interrupts. It is a well-known fact that both techniques are im-
plemented using speculation [SP88]. However, to the best of our knowl-
edge, implementing both techniques as an instance of a generic speculation
mechanism is done for the first time.
Besides the in-order pipelines, we verify the correctness of the Tomasulo
scheduling algorithm with reorder buffer as described in [KMP99]. The re-
order buffer realizes in-order termination, which allows implementing pre-
cise interrupts. The proof of correctness covers the arguments neccessary
to show the uniqueness of the tags.
Furthermore, we rigorously prove the liveness of all machines we de-
sign, i.e., we prove that any given instruction sequence is executed within
a finite amount of time. Although critical, liveness issues are often not
covered in the open literature.
Zusammenfassung
Mikroprozessoren werden in vielen sicherheitskritischen Bereichen ein-
gesetzt, wie beispielsweise in Automobilen oder Flugzeugen. Wir erachten
daher die Korrektheit solcher Komponenten als lebenswichtig. Der Test
von Prozessoren ist durch den extrem großen Zustandsraum moderner Pro-
zessoren nur eingeschra¨nkt mo¨glich. Wir sind daher der Meinung, daß
formale Verifikation die einzige Mo¨glichkeit darstellt, eine Garantie zu er-
halten.
Diese formale Verifikation sollte so durchgefu¨hrt werden, daß Dritten
die Mo¨glichkeit offen steht, die Korrektheit mit geringen Aufwand nachzu-
vollziehen. Wir wollen daher einen Beweis zur Verfu¨gung stellen, der au-
tomatisiert u¨berpru¨ft werden kann. Insbesondere sollten alle kritischen De-
signs in Form von vier-Tupeln ausgeliefert werden: 1) das Design selbst,
2) eine Spezifikation, 3) ein manuell nachvollziehbarer Beweis, und 4) ein
maschinell verifizierbarer Beweis.
Gegenstand dieser Dissertation sind Korrektheitsbeweise fu¨r komplexe
Mikroprozessoren. Die Erstellung von Mirkoprozessordesigns gilt als feh-
leranfa¨llig. Ein bekanntes Beispiel ist der Pentium FDIV bug [Coe95,
Pra95].
In dieser Dissertation wird das Problem der Korrektheit von Hardware
streng formal behandelt. Die Designs beinhalten Prozessoren mit aktuellen
Scheduling Verfahren, wie beispielsweise dem Tomasulo Scheduler aus
[Tom67] und spekulativer Ausfu¨hrung. Im Gegensatz zu weiten Teilen
der bestehenden Literatur sind die Designs auf Gatter-Ebene spezifiziert.
Insbesondere werden einige der Designs fu¨r die XILINX FPGA Serie syn-
thetisiert.
Die Designs haben hohe Komplexita¨t, was sich auf die Beweise aus-
wirkt. Im Gegensatz zu [MP95, Lei99, MP00] sind die Beweise mit dem
Theorem Beweissystem PVS verifiziert. Wir geben in dieser Dissertation
nicht den originalen PVS Beweis an, sondern versuchen einen nachvol-
lziehbaren Beweis in u¨blicher mathematischer Notation anzugeben.
Um sequentielle Maschinen zu verifizieren, erweitern wir die Datenkon-
sistenz-Invariante aus [MP00] indem wir einen “korrekten Wert” eines Im-
plenentation Registers wie beispielsweise IR:2 definieren. Gegeben die
Korrektheit der funktionalen Komponenten, wie beispielsweise der ALU,
erlaubt uns dies den Beweis der Datenkonsistenz der pra¨pariert sequen-
tiellen Maschine in PVS fast vo¨llig zu automatisieren. Wir argumentieren,
daß die funktionellen Komponenten korrekte Ergebnisse liefern wenn sie
korrekte Eingaben erhalten.
Wir erweitern das Konzept der “stall engine” aus [MP00] indem wir eine
vollsta¨ndig generische stall engine angeben. Im Gegensatz zu der stall en-
gine aus [MP00], erlaubt unsere stall engine eine beliebige Anzahl von
Stufen und ermo¨glicht es, alle Stufen unabha¨ngig voneinander anzuhalten.
Des weiteren unterstu¨tzt unsere stall engine das Entfernen von “pipeline
bubbles”. Das bedeutet, daß die Stufen immer dann in Betrieb sind, wenn
dies die in-order Eigenschaft zula¨ßt. Das beinhaltet, daß “pipeline bub-
bles” wenn notwendig aus der Pipeline entfernt werden. Wir verifizieren
die Datenkonsistenz dieser stall engine und geben Eigenschaften an, die es
erlauben Laufzeitschanken zu beweisen.
Mit dieser erweiterten stall engine verbessern wir die Transformation
der pra¨pariert sequentiellen Maschine in die Maschine mit Pipeline in-
dem wir ein Programm implementieren das diese Transformation automa-
tisiert. Dies beinhaltet die Generierung von Forwarding und Interlock
Schaltkreisen.
Anschließen beweisen wir die Datenkonsistenz der Maschine mit Pipe-
line. Dies wird dadurch erreicht, daß wir beweisen, daß die Eingaben der
Pipeline Stufen korrekt sind. Damit ko¨nnen wir wie bei der pra¨pariert
sequentiellen Maschine argumentieren, daß die Ausgaben korrekt sind, da
die funktionalen Einheiten identisch sind.
Wir geben einen generischen Ansatz zur Realisierung von spekulativer
Ausfu¨hrung an und stellen ein Datenkonsistenzkriterium dafu¨r auf. Wir
wenden diese Methode dann an um DLX Pipelines mit Branch Predic-
tion und pra¨zisen Interrupts zu implementieren und zu verifizieren. Es ist
allgemein bekannt, daß beide Techniken mit spekulativer Ausfu¨hrung zu
implementieren sind [SP88]. Nach unserem Wissen ist dies jedoch das er-
ste Mal, daß beide Techniken als Instanz eines generischen Mechanismus’
fu¨r spekulative Ausfu¨hrung implementiert werden.
Neben den in-order Pipelines verifizieren wir die Korrektheit des Toma-
sulo Scheduling Algorithmus’ mit Reorder Buffer. Der Reorder Buffer
bewirkt in-order Terminierung, was es erlaubt, pra¨zise Interrupts zu im-
plementieren. Der Korrektheitsbeweis beinhaltet die Argumente, die not-
wendig sind, um die Eindeutigkeit der Tags zu beweisen.
Des weiteren beweisen wir eine obere Schranke fu¨r die Ausfu¨hrungs-
zeit von Programmen auf allen Maschinen. Obwohl dies eine kritische
Eigenschaft darstellt, wird dieses Thema in oder offenen Literatur oft u¨ber-
gangen.

Contents
1 Introduction 1
1.1 Formal Verification of Microprocessors . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Basic Concepts 7
2.1 Specifying Machines . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Mathematical Machines . . . . . . . . . . . . . . 7
2.1.2 Notation . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Bits and Bit Vectors . . . . . . . . . . . . . . . . 9
2.1.4 Gates . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 Interpretations of Bit Vectors . . . . . . . . . . . . 13
2.2 Basic Circuits . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Binary Trees . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Zero Tester . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Equality Tester . . . . . . . . . . . . . . . . . . . 16
Table of contents
2.2.4 Parallel Prefix . . . . . . . . . . . . . . . . . . . . 16
2.2.5 Adders . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.6 Verification of a Carry Lookahead Adder . . . . . 21
2.3 Verification of an ALU . . . . . . . . . . . . . . . . . . . 22
2.3.1 Specification . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Implementation . . . . . . . . . . . . . . . . . . . 25
2.4 Specifying the Reference Machine . . . . . . . . . . . . . 27
2.4.1 DLX Architecture . . . . . . . . . . . . . . . . . 27
2.4.2 Configuration of an Integer DLX with Delayed PC 27
2.4.3 Initial Configuration . . . . . . . . . . . . . . . . 28
2.4.4 Transition Function . . . . . . . . . . . . . . . . . 29
2.5 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 A Sequential Implementation Machine 35
3.1 The Prepared Sequential Machine . . . . . . . . . . . . . 35
3.2 How Hardware is Specified . . . . . . . . . . . . . . . . . 36
3.2.1 A Simple Hardware Description Language . . . . 36
3.2.2 The Register Set of the Implementation Machine . 37
3.2.3 Scheduling of the Prepared Sequential Machine . . 38
3.2.4 The Transition Function . . . . . . . . . . . . . . 41
3.2.5 Inputs . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.6 Register Files and Memory . . . . . . . . . . . . . 47
3.2.7 Multiport Read Accesses . . . . . . . . . . . . . . 51
3.2.8 Notation . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Precomputed Control . . . . . . . . . . . . . . . . . . . . 52
3.4 Implementing the Prepared Sequential DLX . . . . . . . . 53
3.4.1 Structure . . . . . . . . . . . . . . . . . . . . . . 53
3.4.2 The Instruction Fetch Stage . . . . . . . . . . . . 55
3.4.3 The Instruction Decode Stage . . . . . . . . . . . 55
3.4.4 The Execute Stage . . . . . . . . . . . . . . . . . 58
3.4.5 The Memory Stage . . . . . . . . . . . . . . . . . 59
x
Table of contents
3.4.6 The Write Back Stage . . . . . . . . . . . . . . . 60
3.5 Data Consistency Proof . . . . . . . . . . . . . . . . . . . 61
3.5.1 Properties of the Full Bits . . . . . . . . . . . . . 61
3.5.2 Scheduling Functions . . . . . . . . . . . . . . . . 63
3.5.3 Properties of the Scheduling Function . . . . . . . 65
3.5.4 Data Consistency Proof Strategy . . . . . . . . . . 71
3.5.5 Correctness of the Transition Functions . . . . . . 78
3.6 Liveness . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . 86
3.6.2 Liveness Criterion . . . . . . . . . . . . . . . . . 87
3.6.3 Liveness Properties of the Scheduling Logic . . . . 87
3.6.4 Liveness Proof for the Sequential DLX . . . . . . 88
3.7 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4 Pipelined Machines 91
4.1 Scheduling the Pipelined Machine . . . . . . . . . . . . . 91
4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . 91
4.1.2 Scheduling Lemmas . . . . . . . . . . . . . . . . 95
4.1.3 The Scheduling Invariants . . . . . . . . . . . . . 98
4.2 Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . 100
4.2.2 Forwarding from the Next Stage . . . . . . . . . . 101
4.2.3 Result Forwarding . . . . . . . . . . . . . . . . . 107
4.3 Stalling . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.4 Implementing the DLXpi . . . . . . . . . . . . . . . . . . 120
4.5 Data Consistency . . . . . . . . . . . . . . . . . . . . . . 121
4.6 Liveness . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . 129
4.6.2 Extended Liveness Calculus . . . . . . . . . . . . 130
4.6.3 Liveness Proof . . . . . . . . . . . . . . . . . . . 138
4.7 Performance . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.8 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . 147
xi
Table of contents
5 Speculative Execution 149
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.2 Stall Engine with Speculation . . . . . . . . . . . . . . . . 151
5.3 Schedule with Speculation . . . . . . . . . . . . . . . . . 153
5.4 Scheduling Invariants . . . . . . . . . . . . . . . . . . . . 157
5.5 Speculative Inputs . . . . . . . . . . . . . . . . . . . . . . 159
5.6 Detecting Misspeculation . . . . . . . . . . . . . . . . . . 159
5.7 Rollback . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.8 Extended Read Access Semantics . . . . . . . . . . . . . 163
5.8.1 Specification Registers . . . . . . . . . . . . . . . 163
5.8.2 External Signals . . . . . . . . . . . . . . . . . . 165
5.9 Branch Prediction . . . . . . . . . . . . . . . . . . . . . . 166
5.9.1 The DLX without Delayed PC . . . . . . . . . . . 166
5.9.2 The Sequential DLX without Delayed PC . . . . . 167
5.9.3 The Pipelined DLX without Delayed PC . . . . . . 168
5.10 Data Consistency . . . . . . . . . . . . . . . . . . . . . . 173
5.10.1 Data Consistency Criterion . . . . . . . . . . . . . 173
5.10.2 Properties of the Pipeline . . . . . . . . . . . . . . 180
5.10.3 Data Consistency Invariants . . . . . . . . . . . . 184
5.11 Liveness . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
5.11.1 Liveness Proof Strategy . . . . . . . . . . . . . . 191
5.11.2 Properties of M(T) . . . . . . . . . . . . . . . . . 192
5.11.3 Rollback Properties . . . . . . . . . . . . . . . . . 195
5.11.4 Liveness Proof . . . . . . . . . . . . . . . . . . . 209
5.12 Precise Interrupts . . . . . . . . . . . . . . . . . . . . . . 211
5.12.1 Definition . . . . . . . . . . . . . . . . . . . . . . 211
5.12.2 The DLX with Interrupts . . . . . . . . . . . . . . 211
5.12.3 Hardware for the DLX with Interrupts . . . . . . . 218
5.12.4 Configuration of the Pipelined DLX with Interrupts 220
5.12.5 Transition Functions of Stage 0 . . . . . . . . . . 220
5.12.6 Transition Functions of Stage 1 . . . . . . . . . . 223
xii
Table of contents
5.12.7 Transition Functions of Stage 2 . . . . . . . . . . 229
5.12.8 Transition Functions of Stage 3 . . . . . . . . . . 229
5.12.9 Transition Functions of Stage 4 . . . . . . . . . . 232
5.12.10 Data Consistency and Liveness . . . . . . . . . . . 239
5.13 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6 Out-of-Order Execution 241
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 241
6.2 The Tomasulo Algorithm with Reorder Buffer . . . . . . . 242
6.3 Tomasulo Data Structures . . . . . . . . . . . . . . . . . . 243
6.3.1 Reorder Buffer . . . . . . . . . . . . . . . . . . . 243
6.3.2 Register File Extentions . . . . . . . . . . . . . . 245
6.3.3 Reservation Stations . . . . . . . . . . . . . . . . 246
6.3.4 Producers . . . . . . . . . . . . . . . . . . . . . . 246
6.3.5 Initial Configuration . . . . . . . . . . . . . . . . 246
6.4 Tomasulo Protocols . . . . . . . . . . . . . . . . . . . . . 247
6.4.1 Formalization . . . . . . . . . . . . . . . . . . . . 247
6.4.2 Issue . . . . . . . . . . . . . . . . . . . . . . . . 249
6.4.3 CDB Snooping . . . . . . . . . . . . . . . . . . . 250
6.4.4 Dispatch . . . . . . . . . . . . . . . . . . . . . . 250
6.4.5 Completion . . . . . . . . . . . . . . . . . . . . . 252
6.4.6 Writeback . . . . . . . . . . . . . . . . . . . . . . 254
6.5 Data Consistency . . . . . . . . . . . . . . . . . . . . . . 254
6.5.1 Scheduling Functions . . . . . . . . . . . . . . . . 254
6.5.2 Function Unit Axioms . . . . . . . . . . . . . . . 259
6.5.3 ROB Flags . . . . . . . . . . . . . . . . . . . . . 261
6.5.4 ROB Properties . . . . . . . . . . . . . . . . . . . 262
6.5.5 Instruction Phases . . . . . . . . . . . . . . . . . 272
6.5.6 Tag Consistency . . . . . . . . . . . . . . . . . . 275
6.5.7 Data Consistency Criterion . . . . . . . . . . . . . 277
6.5.8 Forwarding Tags Consistency . . . . . . . . . . . 280
xiii
Table of contents
6.5.9 Tag Uniqueness . . . . . . . . . . . . . . . . . . . 283
6.5.10 Data Consistency Invariants . . . . . . . . . . . . 287
6.6 Liveness . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
6.7 Verifying the DLX Implementation . . . . . . . . . . . . . 303
6.7.1 Implementation Differences . . . . . . . . . . . . 303
6.7.2 Verifying the Instruction Fetch . . . . . . . . . . . 305
6.7.3 Verifying IEEEf . . . . . . . . . . . . . . . . . . 306
6.7.4 Verifying Interrupts . . . . . . . . . . . . . . . . . 308
6.8 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . 309
7 Perspective 311
7.1 Functional Units . . . . . . . . . . . . . . . . . . . . . . . 311
7.2 In-Order Scheduling and Forwarding . . . . . . . . . . . . 312
7.3 Speculation . . . . . . . . . . . . . . . . . . . . . . . . . 313
7.4 Out-of-Order Execution . . . . . . . . . . . . . . . . . . . 313
7.5 Synthesizing Hardware . . . . . . . . . . . . . . . . . . . 313
A Theorem Index 315
A.1 The PVS Proof Tree . . . . . . . . . . . . . . . . . . . . . 315
A.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . 316
A.3 A Sequential Implementation Machine . . . . . . . . . . . 317
A.4 Pipelined Machines . . . . . . . . . . . . . . . . . . . . . 318
A.5 Speculative Execution . . . . . . . . . . . . . . . . . . . . 319
A.6 Out-of-Order Execution . . . . . . . . . . . . . . . . . . . 321
B DLX Instruction Set 323
C Performance of the Pipelined DLX 331
D Liveness Verification using SMV 335
D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 335
D.2 Using Induction . . . . . . . . . . . . . . . . . . . . . . . 337
Bibliography 341
xiv
Chapter
1
Introduction
1.1 Formal Verification of Microprocessors
N
OWADAYS, microprocessors are in use in many safety-critical envi-
ronments, such as cars or planes. We therefore consider the correct-
ness of such components as a matter of vital importance.
Verifying the correctness of microprocessors used to be done by exten-
sive tests. However, the state space of modern microprocessors is huge and
tests never attain full coverage, especially for 64-bit designs. We therefore
think formal verification is the sole way to obtain a guarantee.
This formal verification should be done such that any third party is able
to verify the correctness with low effort, i.e., we aim to provide a proof of
correctness that can be checked mechanically. In particular, we think that
all critical designs should be delivered in form of a four-tuple: 1) the design
itself, 2) a specification, 3) a human-readable proof, and 4) a machine-
verified proof. Moreover, we think that there will be a considerable market
for such four-tuples.
Let us motivate why we distinguish human-readable proofs and ma-
chine-readable proofs and why we demand for both. This is not a common
demand. In industrial environments, low-effort but automatized verifica-
tion is preferred.
Chapter 1
INTRODUCTION
However, proofs written for theorem proving systems tend to be hard to
read. This becomes worse the higher the grade of automatization of the
theorem proving system is. We think that this leads to two drawbacks:
Without a human-readable proof, one completely depends on the sound-
ness of the theorem proving system. This includes that one depends on the
clarity and accuracy of the specification language of the theorem proving
system.
The second drawback is that automatized design verification is of no
aid in understanding the designs. In contrast to that, we have experienced
that writing proofs, in particular the human-readable proofs, is producing
generic theories and design approaches previously unknown. We therefore
claim that providing human-readable proofs will aid automatizing the ac-
tual design process, since generic theories allow for the development of
non-specialized tools with diversified use.
In this thesis, we present proofs of correctness for complex micropro-
cessors. Designing microprocessors is considered an error-prone process.
Due to the complexity of the designs, errors often remain undiscovered
even in case extensive testing is done. A well known example for this is
the Pentium FDIV bug [Coe95, Pra95].
1.2 Related Work
There are many publications on the formal verification of sequential ma-
chines, e.g., Cohn verified the VIPER processor [Coh87], Joyce verified
the Tamarack [Joy88a, Joy88b], Hunt verified the FM8501 [Hun94], and
Windley verified the AVM-1 [Win95].
In [HP96, PH94], Hennessy and Patterson describe a 32-bit RISC ar-
chitecture, the DLX, which serves as basis for many microprocessor ver-
ification projects. In [MP95], Mueller and Paul describe sequential DLX
designs at gate level, including a machine with precise interrupts.
The formal verification of a pipelined processor is reported in [BS89]:
Bickford and Srivas verify a three stage DLX-like RISC processor. In
[LO96], Levitt and Olukotun verify a five-stage DLX pipeline by trans-
forming it back into a sequential machine by removing stalling and roll-
back logic.
In [Hos00], Hosabettu verifies both in-order and out-of-order DLX im-
2
Section 1.2
RELATED WORK
plementations that are not synthesizeable. The pipelined implementation
has a trivial stalling logic. The verification is done using the completion
function approach and PVS.
Further literature on the verification of pipelined machines is [LO96],
which covers automatic verification of pipelined microprocessors, [BM96]
provides a manual proof of a DLX pipeline, Burch, Dill [BD94] verify a
very simple pipeline. Henzinger et.al. [HQR98] use refinement mappings
in order to model-check a RISC pipeline.
Besides PVS, there are more theorem proving systems that are applied
for hardware verification, such as HOL [CGM86] or ACL2 [KM96]. There
has been much success in verifying complete, complex systems using the-
orem provers [BS89, HGS99, SH99]. However, theorem proving systems
always involve much manual work.
Recent papers show the correctness of complex designs or schedulers
in theorem proving systems such as PVS. Hosabettu et al. [HGS99] prove
both safety and liveness of Tomasulo’s algorithm using PVS. Swada and
Hunt [SH99] provide an ACL2 proof of a complete design implementing
a Tomasulo scheduler with reorder buffer.
Henzinger et al. [HQR98] verify a simple pipelined processor using a
model checker. McMillan [McM98] partly automates the proof by refine-
ment of Tomasulo’s algorithm presented in [DP97] with the help of com-
positional model checking. This technique is improved in [McM99b] by
theorem proving methods to support an arbitrary register size and number
of function units.
In the literature cited above, the complex designs are verified at very
high levels of abstraction. In particular, there is even not much litera-
ture on details of actually implementing complex microprocessors. Gate-
level descriptions of microprocessors usually never go beyond simple ma-
chines, with the exception of [Lei99] and [MP00]: In [Lei99], Holger Leis-
ter presents out-of-order designs and evaluates the architectures regarding
hardware cost and performance. The correctness is argued using paper-
and-pencil proofs but not verified by means of machine.
In [MP00], Silvia M. Mueller and Wolfgang J. Paul present gate-level
designs of pipelined DLX implementations including a machine with full
IEEE floating point arithmetic and interrupts. The correctness of the ma-
chines is argued as follows: The authors build a sequential machine but
with the structure of a pipelined machine. This machine is called prepared
sequential machine. The authors transform this prepared sequential ma-
3
Chapter 1
INTRODUCTION
chine into a pipelined machine by adding interlock and forwarding hard-
ware. This is supported by introducing the concept of a stall engine. The
stall engine encapsulates the logic required for generating clock enable
signals for the individual pipeline stages.
The correctness of the pipelined machine is argued as follows: given
the correctness of the prepared sequential machine, the authors prove the
pipeline to be correct by arguing that it simulates the prepared sequential
machine. This is done using a scheduling function. This function maps
a configuration of the physical machine to a configuration of the abstract
reference machine.
1.3 Contribution
In this thesis, we provide a rigorously formal approach to hardware verifi-
cation. The designs presented in this thesis include state of the art sched-
ulers, such as the Tomasulo scheduler [Tom67] and speculation. In con-
trast to most of the literature, the designs we provide are very close to
gate level. In particular, we are synthesizing some of the designs for the
XILINX FPGA series.
These designs are of high complexity, and so are the proofs. In contrast
to [MP95, Lei99, MP00], the proofs are machine verified using the theorem
proving system PVS [CRSS94]. However, we never present the original
PVS proof in this thesis. We aim to provide proofs that come close to
comprehensible paper-and-pencil proofs in the tradition of [KP95, MP95,
MP00]. We aim to maintain the full formal reasoning of the PVS proofs,
to the extent that the proofs are reviewable on a line-per-line basis. This
resulted in several PVS proofs to be re-written due to better readability of
the paper version of the proof.
In order to verify sequential machines, we extend the data consistency
invariant given in [MP00] by defining a “correct value” of an implementa-
tion register such as IR:2. Given the correctness of functional components
such as the ALU, this allows for an almost fully automated proof of the
data consistency of the prepared sequential machine using PVS. We ar-
gue that the correct functional components provide correct results if given
correct inputs.
We extend the stall engine concept presented in [MP00] by providing
a fully generic stall engine design. In contrast to [MP00], our stall en-
4
Section 1.4
ORGANIZATION
gine design supports an arbitrary number of stages and allows for stalling
(and therefore clocking) all stages independently. We formally verify data
consistency and liveness properties for this stall engine.
Using this extended stall engine, we can significantly improve the pro-
cess of transforming the prepared sequential machine into the pipelined
machine by providing a tool that does this transformation automatically.
This includes the generation for forwarding and interlock hardware. In
particular, the transformation of the PC environment of the DLX with De-
layed PC, i.e., removing the DPC register, turns out to be a special case of
adding forwarding.
We then prove the data consistency of the pipelined machine. We do so
by showing that the inputs of the pipeline stages are correct. Using this
fact, we argue the correctness of the output values as we do for the sequen-
tial prepared machine, since the functional components of the machines
are identical.
We present a generic approach to speculative execution and propose a
data consistency criterion for such a machine. We then apply this method
in order to implement and prove DLX pipelines with branch prediction
and precise interrupts. It is a well-known fact that both techniques are im-
plemented using speculation [SP88]. However, to the best of our knowl-
edge, implementing both techniques as an instance of a generic speculation
mechanism is done for the first time.
Besides the in-order pipelines, we verify the correctness of the Tomasulo
scheduling algorithm with reorder buffer as described in [KMP99]. The re-
order buffer realizes in-order termination which allows implementing pre-
cise interrupts. The proof of correctness covers the arguments neccessary
to show the uniqueness of the tags.
Furthermore, we rigorously prove the liveness of all machines we de-
sign, i.e., we prove that any given instruction sequence is executed within
a finite amount of time. Although critical, liveness issues are often not
covered in the open literature.
1.4 Organization
Chapter 2 describes basic concepts. We introduce the mathematical hard-
ware model, and describe the implementation and verification of basic cir-
5
Chapter 1
INTRODUCTION
cuits, such as adders. We use these basic circuits in order to implement and
verify an ALU. We then provide a formal specification of a DLX RISC mi-
croprocessor without interrupts and floating point instructions.
In chapter 3, we describe how we model the hardware of a micropro-
cessor. We describe the extended stall engine for the prepared sequential
machine. We introduce the functions used in order to model the registers,
the circuits between the registers and the forwarding logic. We use this
formalism in order to implement and verify a prepared sequential DLX.
We also show the liveness of the prepared sequential machine.
In chapter 4, we describe how the stall engine is modified in order to
get a pipelined machine. We describe how to add the forwarding and in-
terlock hardware and prove the correctness of the pipelined machine. This
comprises of both data consistency and liveness.
In chapter 5, we describe a generic approach to speculative execution.
We prove its data consistency and liveness. We implement two machines as
examples: the first machine guesses whether branches are taken or not. The
second machine guesses whether we have an interrupt or not. We prove
that this realizes precise interrupts according to the specification given in
[MP00].
In chapter 6, we describe the results of verifying an out-of-order DLX
with Tomasulo scheduler as presented in [Kro¨99].
6
Chapter
2
Basic Concepts
2.1 Specifying Machines
2.1.1 Mathematical Machines
T
HE SUBJECT of this thesis is to present a provably correct micropro-
cessor. A microprocessor is said to be correct if it interprets a given
instruction set architecture (ISA). The instruction set architecture is usually
given as an informal list of registers and instructions, and a specification
of the impact of these instructions on the values of the registers. The im-
plementation of this ISA, the microprocessor, is a piece of hardware.
In order to make a formal proof of the correctness of such a processor,
it is necessary to formalize the specification, the implementation, and the
correctness criterion.
Mathematical machines are a common method to model the behavior of
arbitrary microprocessor systems. There are different definitions of math-
ematical machines. In this thesis, the mathematical machine is used to
specify both the microprocessor hardware and the instruction set architec-
ture. The correctness criterion and its proof then rely on arguments on
these two mathematical machines.
The model used in this thesis is similar to the synchronous transition
Chapter 2
BASIC CONCEPTS
states (STS) model used in [KP96, DP97]. In contrast to [DP97], the math-
ematical machines here work fully deterministic to allow direct hardware
synthesis from the mathematical machine. A very similar approach is also
used in [Cyr93].
A mathematical machine, as used in this thesis, is a triple M = (C;c0;δ)Definition 2.1
Mathematical Machine
I
that consists of the following components:
 C is the set of all possible configurations of M. An element c of C is
called configuration or state of the machine.
 The initial configuration c0 is a configuration of M.
 The transition function δ : C!C maps one configuration cT to its
successor cT+1.
The sequence c0, c1, . . . of configurations is called computation of M. The
configuration cT is called configuration in cycle T . The configurations of
M in cycles T  1 are defined recursively as follows:
cT = δ(cT 1)
In the literature, the transition function is often called next state function
[Cyr93].
2.1.2 Notation
Registers Both the specification and the implementation of a micropro-
cessor use registers. A register is a place where a value can be stored and
re-read in later cycles. In terms of mathematical machines, a value of a
register is part of the configuration c.
Let R = fR1; : : : ;Rng be a finite set of registers. Each register R can
have a value within a finite domain W (R), i.e., Ri 2 W (Ri).
In order to allow an easy identification of the value of a register in the
configuration of a mathematical machine, all valid configurations in C are
expected to be a tuple of the values of all registers:
C = W (R1)W(R2) : : :W (Rn)
8
Section 2.1
SPECIFYING
MACHINES
The value of a given register Ri can be extracted from a configuration c
with a projection function ϕi. Let c be (a1;a2; : : : ;an).
ϕRi : C! W (Ri); ϕRi(c) = ai
Let c = cT be part of a computation of a mathematical machine. In this
case, let RT be a shorthand for ϕR(cT ).
Let c:R be a shorthand for the value of the projection ϕR applied to c:
c:R := ϕR(c)
In analogy to that, let δ:R be a shorthand for the restriction of a state
transition function to a register value:
δ:R : C! W (R); δ:R = ϕR Æδ
Signals
A signal s is defined as a mapping from the set of configurations into an J Definition 2.2
Signalarbitrary domain W (s):
s : C! W (s)
Signals are therefore a shorthand for a calculation on a given configura-
tion.
2.1.3 Bits and Bit Vectors
In order to model gates and wiring between gates in a formal way, the
theorem proving system PVS [CRSS94] provides a bit vector library. Bits
are defined as a boolean value and bit strings are defined as a vector of
boolean values.
An n-dimensional vector on a domain D is a mapping from fi2 N0 j i < ng J Definition 2.3
Vectorinto D.
9
Chapter 2
BASIC CONCEPTS
Let an denote the component n of the vector a:
an := a(n)
A bit is a value in the domain B = f0;1g. The value 0 is called FALSEDefinition 2.4
Bits and Bit Vectors
I
and the value 1 is called TRUE. An n-bit bit vector is an n-dimensional
vector on B . The number n is called length of the bit vector. If a is an n-bit
bit vector, this is denoted by:
a 2 bvec[n]
There is a projection function to get a subpart of an n-bit bit vector. Let
x < n and y  x. The function a[x : y] takes a bit vector a and returns the
subvector from ax downto ay:
[x : y] : bvec[n]  ! bvec[x  y+1]
a[x : y](i) := a(i+ y) 80 i (x  y)
Dots Notation Let Æ be a binary operator on a set T :
Æ : T T ! T
Let n, a, b be nonnegative integers with b a. Let X be an n-dimensional
vector on T . The following definition is used for the common “dots nota-
tion”:
Xa ÆXa+1 Æ : : :ÆXb := rÆ;a;b(b;X)
The function r
Æ;a;b is defined recursively as follows: Let v[n] denote the
set of n-dimensional vectors on T .
r
Æ;a;b : fa; : : : ;bg v[b a+1]! T
r
Æ;a;b(i;X) :=

Xa : i = a
r
Æ;a;b(i 1;X)ÆXi : otherwise
In case a is omitted, zero is assumed:
10
Section 2.1
SPECIFYING
MACHINES
T = 0 T = 1 T = 2 T = 3 T = 4
AT 0 1 0 1 1
BT 0 0 1 1 1
Table 2.1 The computation of the example machine
r
Æ;b : f0; : : : ;bg v[b+1]! T
r
Æ;b(i;X) :=

X0 : i = 0
r
Æ;b(i 1;X)ÆXi : otherwise
2.1.4 Gates
Using the definition of bits above, the basic gates such as AND and OR are
defined in a obvious way: a gate like AND with two inputs and one output
is a mapping on two bits:
AND : BB  ! B
As an example, consider the following mathematical machine (a two bit
saturating counter): It has two one bit registers R = fA;Bg with W (A) =
W (B) = B . The configuration set C therefore is B2. Let the transition
function δ be defined as follows:
δ:A(c) = c:A_ c:B
δ:B(c) = c:A_ c:B
Let the initial configuration c0 be f0;0g. This mathematical machine
models hardware: in order to illustrate the hardware modeled by mathe-
matical machines, the symbols from figure 2.1 are used.
The transition function δ models two OR-gates and one inverter. The
configuration set models two one-bit registers. In hardware, registers usu-
ally do not have defined initial values. In order to get the initial configu-
ration c0, an external signal reset is assumed. This signal is active during
11
Chapter 2
BASIC CONCEPTS
oe
0 1
Tristate Driver AND OR XOR Multiplexer
ce
in
out
Inverter NAND NOR XNOR Flip-Flop
Figure 2.1 Symbols of the basic gates
0 1 0 1
A1
0
reset
A
A
B
B1
0
reset
B
BA
Figure 2.2 A two bit saturating counter
12
Section 2.1
SPECIFYING
MACHINES
cycle  1. Using multiplexers, this allows calculating the initial configura-
tion.
The hardware modeled by the mathematical machine described above is
illustrated by figure 2.2. Table 2.1 lists the values of the registers A and B
in the configurations c0 to c4.
2.1.5 Interpretations of Bit Vectors
The interpretation of a bit vector a as a binary number is a mapping from
the n-bit bit vectors into f0; : : : :;2n 1g. The mapping is denoted by hain.
If the length of the bit vector argument is obvious in the context, just hai is
used.
hin : bvec[n]  ! f0; : : : ;2n 1g
hain :=
n 1
X
i=0
ai 2i
The PVS bit vector library provides the function bv2nat[n] for this
purpose. The value of this function is defined by a recursive function that
takes an n-bit bit vector and an index i: the function sums up the first i
addends of the sum above:
hi
i
n : f0; : : : ;ngbvec[n]  ! f0; : : : ;2n 1g
haiin =
i 1
X
j=0
aj 2 j
In PVS, this is defined using a recursion:
haiin :=

0 : i = 0
2i 1 ai 1 + haii 1n : otherwise
It is easy to prove that both definitions are equivalent and that hainn = hain
holds.
The interpretation of a bit vector a as a two’s complement number is a
mapping from the n-bit bit vectors into f 2n 1; : : : ;2n 1 1g:
[ ]n : bvec[n]  ! f0; : : : ;2n 1g
13
Chapter 2
BASIC CONCEPTS
[a]n := an 1 2n 1 + ha[n 2 : 0]in 1
The bit an 1 is called sign bit.
This allows defining several operations on bit vectors such as addition
and subtraction:
+;  : bvec[n]bvec[n]  ! bvec[n]
a+b := c such that hcin = hai+ hbi mod 2n
a b := c such that hcin = hai hbi mod 2n
A similar definition is used for operations on a bit vector and an integer:
+;  : bvec[n]Z ! bvec[n]
a+b := c such that hcin = hai+b mod 2n
a b := c such that hcin = hai b mod 2n
An unary minus on bit vectors is defined as follows:
  : bvec[n]  ! bvec[n]
 a := c such that hcin = hai mod 2n
The function zero extendk extends a given n-bit bit vector to k  n bits
by adding zeros:
zero extendk : bvec[n]  ! bvec[k]
zero extendk(a)i =

ai : i < n
0 : otherwise
The function sign extendk extends a given n-bit bit vector to k  n bits
by adding the sign bit:
sign extendk : bvec[n]  ! bvec[k]
sign extendk(a)i =

ai : i < n
an 1 : otherwise
14
Section 2.2
BASIC CIRCUITS
2.2 Basic Circuits
2.2.1 Binary Trees
Let n be a power of two, i.e., n = 2k, k 2 N0 . Let Æ : T T  ! T be a J Definition 2.5
Binary Tree Circuitdyadic function that is associative. Let T denote a set and let v[n] denote
the set of n-dimensional vectors on T .
The binary tree is implemented as follows:
btree
Æ;k : v[2k] ! T
btree
Æ;k(X) =
8
<
:
X0 : k = 0
btree
Æ;k 1(X(0); : : : ;X(2k 1 1))Æ : otherwise
btree
Æ;k 1(X(2k 1); : : : ;X(2k 1))
The binary tree circuit btree
Æ;k : v[n]  ! T calculates the following func- J Lemma 2.1
tion:
btree
Æ;k(X) = X0 ÆX1 Æ : : :ÆXn 1
This is shown by induction on k. For k = 0, the claim is obviously true. PROOF
For k+1, the claim is:
btree
Æ;k+1 = X(0)Æ : : : ÆX(2k+1 1)
By definition of btree, this is equivalent to:
btree
Æ;k(X(0); : : : ;X(2k 1))ÆbtreeÆ;k(X(2k); : : : ;X(2k+1 1)) =
X(0)Æ : : :ÆX(2k+1 1)
By the induction premise for both btree instances, this is equivalent to:
(X(0)Æ : : :ÆX(2k 1))Æ (X(2k)Æ : : :ÆX(2k+1 1))
= X(0)Æ : : :ÆX(2k+1 1)
This is shown by induction using that Æ is associative.
15
Chapter 2
BASIC CONCEPTS
2.2.2 Zero Tester
Let n be a power of two. The zero tester is implemented as follows:
zerotester : bvec[n]  ! B
zerotester(a) = btreeOR(a)
The zero tester calculates the following function:Lemma 2.2 I
zerotester(a) = (8i : ai)
This is shown by induction on n using lemma 2.1.
2.2.3 Equality Tester
Using the zero tester, an equality tester is constructed as follows:
equalitytester : bvec[n]bvec[n]  ! B
equalitytester(a;b) = zerotester(ab)
The equality tester is correct:Lemma 2.3 I
equalitytester(a;b) = (a = b)
The correctness is shown easily with lemma 2.2.
2.2.4 Parallel Prefix
Let T denote a set and let v[n] denote the set of n-dimensional vectorsDefinition 2.6
Parallel Prefix
I
on T . Let Æ : T T  ! T be an associative dyadic function. The n-fold
generic parallel prefix circuit PP
Æ;n : v[n]  ! v[n] calculates the following
function:
PP
Æ;n(X)i = X0 ÆX1 Æ : : : ÆXi i 2 f0; : : : ;n 1g
16
Section 2.2
BASIC CIRCUITS
. . .
. . .
n = 1
Y0 Yn 2
n > 1
Yn 1
X1 X0
Y1 Y0
X2
Y2
X0 Xn 1 Xn 2
pp
Æ;n=2
X 00X 0n=2 1
Y 0
n=2 1 Y
0
0
Figure 2.3 The recursive specification of an n-fold parallel prefix circuit
The parallel prefix circuit is implemented by means of a recursive defi-
nition (figure 2.3). Let n be a power of two, i.e., n = 2K with K 2 N, and
let X 2 v[2K ] be the inputs of the circuit.
The function ppX 0
Æ
calculates the inputs X 00 to X 0n=2 1 for the next recur-
sion step. The recursion depth is given by the first parameter K:
ppX 0
Æ
: N v[2K ] ! v[2K 1]
ppX 0
Æ
(K;X)i := X(2  i)ÆX(2  i+1)
Given those inputs, the function ppY
Æ
calculates the outputs Y0 to Yn 1. As
above, the recursion depth is given by the first parameter K:
ppY
Æ
(K;X)i =
8
>
<
>
:
X0 i = 0
ppY
Æ
(K 1; ppX 0
Æ
(K;X)) i 1
2
odd i
ppY
Æ
(K 1; ppX 0
Æ
(K;X)) i
2 1
ÆXi even i
The outputs of the parallel prefix circuit are the values Y0 to Yn 1:
pp
Æ
(X)i := ppYÆ(K;X)i
The parallel prefix circuit is correct: J Theorem 2.4
17
Chapter 2
BASIC CONCEPTS
pp
Æ
(X)i = X0 ÆX1 Æ : : : ÆXi
In order to prove theorem 2.4, the definition pp1 is used. The first pa-
rameter defines the number of inputs, the second parameter is the index of
the output, the third parameter is the input vector.
pp1 : Nf0; : : : ;2K 1g v[2K ] ! T
pp1(K; i;X) :=

X0 : i = 0
pp1(K; i 1;X)ÆXi : otherwise
This definition is equivalent to PP
Æ;n, which is an easy proof by induction:Lemma 2.5 I
pp1(K; i;X) = PP
Æ;n(X)i
If i is odd, applying pp1 to X 00 to X 0
(i 1)=2 is equivalent to applying pp1 toLemma 2.6 I
X0 to Xi:
pp1(K 1;(i 1)=2; ppX 0(K;X)) = pp1(K; i;X)
If i is even and not zero, appending Xi to the sequence above on the left
hand side produces the desired result:
pp1(K 1; i=2 1; ppX 0(K;X))ÆXi = pp1(K; i;X)
This is shown by induction on i. For i = 0, the claim is obvious. For oddPROOF
i+1, the claim is:
pp1(K 1; i=2; ppX 0(K;X)) = pp1(K; i+1;X)
By definition of pp1, this is equal to:
pp1(K 1; i=2 1; ppX 0(K;X))Æ ppX 0(K;X)(i=2) = pp1(K; i+1;X)
Unfolding ppX 0
Æ
, this results in:
pp1(K 1; i=2 1; ppX 0(K;X))Æ (Xi ÆXi+1) = pp1(K; i+1;X)
Since Æ is associative, this is equal to:
(pp1(K 1; i=2 1; ppX 0(K;X))ÆXi)ÆXi+1 = pp1(K; i+1;X)
18
Section 2.2
BASIC CIRCUITS
This is shown by unfolding the definition of pp1 on the right-hand side
and by the induction hypothesis for even i.
For even i+1, the claim is shown by the definition of pp1 and the induc-
tion premise for odd i.
The parallel prefix circuit computes pp1. J Lemma 2.7
8 0 k  K; X 2 v[2k]; 0 i 2k : ppY (k; i;X) = pp1(k; i;X)
This is shown by induction on k. For k = 0, the claim is obvious. For PROOF
k+1, and after definition unfolding, the claim is:
ppY (k+1; i;X) != pp1(k+1; i;X)
For i = 0, the claim is shown by definition unfolding. If i is odd, the
claim is:
ppY (k;(i 1)=2; ppX 0(k+1;X)) != pp1(k+1; i;X)
This is shown using the induction hypothesis and lemma 2.6.
If i is even, the claim is:
ppY (k; i=2 1; ppX 0(k+1;X))ÆXi
!
= pp1(k+1; i;X)
This is shown using the induction premise and lemma 2.6.
2.2.5 Adders
The definitions used in this section are taken from the PVS bit vector li-
brary. In order to define adders, the two functions cout and sum are used.
Using both functions, one gets a fulladder.
The functions take three input bits a, b, and cin. The function cout
calculates the carry-out bit of the adder, the function sum calculates the
sum bit.
cout;sum : BBB! B
19
Chapter 2
BASIC CONCEPTS
The functions are defined using XOR, AND, and OR gates as follows:
cout(a;b;cin) := (a^b)_ ((ab)^ cin)
sum(a;b;cin) := ab cin
Let x and y denote two n-bit bit vectors and cin a single bit. The carry bitsDefinition 2.7
Carry Bits
I
c(0) to c(n 1) are defined as follows:
c(i) :=

cout(x0;y0;cin) : i = 0
cout(xi;yi;c(i 1)) : otherwise
An n-bit adder implements the following function add on two n-bit bitDefinition 2.8
Adder
I
vectors x, y: The function is defined using the addition on bit vectors as
defined in section 2.1.5.
add : bvec[n]bvec[n]  ! bvec[n]
add(x;y) = x+ y
Let c(i) denote the i-th carry bit as in definition 2.7. An n-bit adder with
carry-in and carry-out implements the following function addc on two n-
bit bit vectors x, y and the carry-in bit cin:
addc : bvec[n]bvec[n]B  ! bvec[n]B
addc(x;y;cin) := (result;cout)
with result := (x+ y+ hcini);
cout := c(n 1)
The carry chain adder is implemented as follows:
cc : bvec[n]bvec[n]B  ! bvec[n]B
cc(x;y;cin) := (result;cout)
i 2 f0; : : : ;n 1g : result(i) :=

sum(x0;y0;cin) : i = 0
sum(xi;yi;c(i 1)) : otherwise
cout := c(n 1)
The carry chain adder is correct according to definition 2.8.Lemma 2.8 I
The proof for this lemma is already in the PVS bit vector library.
20
Section 2.2
BASIC CIRCUITS
2.2.6 Verification of a Carry Lookahead Adder
The carry lookahead adder provides both low hardware cost and low depth
[KP95].
Let c(0) to c(n  1) denote the carry bits as defined in definition 2.7
for the addition of two n-bit bit vectors a and b and the carry-in bit cin.
The idea is to use a parallel prefix calculation (definition 2.6) in order to
calculate the carry bits c(i). Using these bits, the carry lookahead adder is
realized as follows:
cla(a;b;cin) = (result;cout)
with result(i) = a(i)b(i)

cin : i = 0
c(i 1) : otherwise
and cout = c(n 1)
The inputs (gi; pi) and the associative function Æ used for the parallel
prefix circuit are taken from [MP00]:
pi := a(i)b(i)
gi :=

((a(i)b(i))^ cin)_ (a(0)^b(0)) : i = 0
a(i)^b(i) : otherwise
(g1; p1)Æ (g2; p2) := (g2_g1^ p2; p1^ p2)
The proof that Æ is associative is trivial in PVS.
Let G(i) and P(i) denote the outputs of the parallel prefix circuit, i.e.,
according to theorem 2.4 (correctness of the parallel prefix circuit) this is:
G(i) = ((g0; p0)Æ : : :Æ (gi; pi)):g
P(i) = ((g0; p0)Æ : : :Æ (gi; pi)):p
We will now show that we get the carry bits by calculating G(i) as above.
The carry bits c are G. J Lemma 2.9
c = G
21
Chapter 2
BASIC CONCEPTS
The proof for this claim is already given in [MP00]. We verify it using
PVS.
The proof proceeds by induction on i. For i = 0, the claim follows byPROOF
definition unfolding.
For i+1, the claim after applying theorem 2.4 (correctness of the parallel
prefix circuit) is:
c(i+1) = ((g0; p0)Æ : : :Æ (gi+1; pi+1)):g
By definition of Æ, this is equivalent to:
c(i+1) = gi+1_ ((g0; p0)Æ : : : Æ (gi; pi)):g^ pi+1
By the induction hypothesis, this is equivalent to:
c(i+1) = gi+1_ c(i)^ pi+1
By definition of the carry bits, this is equivalent to:
a(i+1)^b(i+1)_ ((a(i+1)b(i+1))^ c(i))
= gi+1_ c(i)^ pi+1
This is shown by definition of gi+1 and pi+1.QED
2.3 Verification of an ALU
2.3.1 Specification
An ALU (arithmetic logical unit) performs operations such as addition,
subtraction, comparisons, and bitwise operations such as AND, OR, and
XOR.
The ALU takes two 32-bit bit vector operands a and b and additional
five bits f . These bits f control the operation performed by the ALU. The
ALU returns the result bit vector and an additional bit ovf that is set iff an
overflow occurred during an addition or subtraction.
22
Section 2.3
VERIFICATION OF
AN ALU
f[4] f[3] f[2] f[1] f[0] Function
0 * * 0 * a b[4 : 0]
0 * * 1 0 a b[4 : 0]
0 * * 1 1 aa b[4 : 0]
1 0 0 0 0 a+b with overflow test
1 0 0 0 1 a+b without overflow test
1 0 0 1 0 a b with overflow test
1 0 0 1 1 a b without overflow test
1 0 1 0 0 a^b
1 0 1 0 1 a_b
1 0 1 1 0 ab
1 0 1 1 1 b[0 : 15]016
1 1 0 0 0 return zero
1 1 0 0 1 [a]> [b] ? 1 : 0
1 1 0 1 0 a = b ? 1 : 0
1 1 0 1 1 [a] [b] ? 1 : 0
1 1 1 0 0 a < b ? 1 : 0
1 1 1 0 1 a 6= b ? 1 : 0
1 1 1 1 0 [a] [b] ? 1 : 0
1 1 1 1 1 return one
Table 2.2 ALU functions
23
Chapter 2
BASIC CONCEPTS
1
32
0 1 0 1
0 1
0 1
0 1
1 0
addsub[32]
ovf neg
sub
[15:0]
0
15 shifter[32]
comp zero
result
res
a
b
031
f [4]
f [2]
f [2 : 0]
f [1 : 0]
f [3]
f [1]
f [0]
sub
Figure 2.4 The ALU implementation
Table 2.2 lists the operations performed by the ALU. It is taken from
[MP95] with small modifications. The notation a b is used to denote
a left shift of a with shift distance b, a b denotes a logic right shift of
a with shift distance b, aa b denotes an arithmetic right shift of a with
shift distance b.
Overflow Let Æ be an addition or subtraction, i.e., Æ 2 f+; g. An over-
flow indicates that the result of [a] Æ [b] is not in the range of the 32-bit
two’s complement numbers. Let a 2 Tn denote that a is in the range of the
n-bit two’s complement numbers.
Table 2.2 does not provide overflow test and comparisons for unsigned
binary numbers in contrast to most microprocessors processors such as the
MIPS CPUs or the Intel Pentiums [KH92, Int95b]. We do so in order to
maintain the instruction set used in [MP00].
24
Section 2.3
VERIFICATION OF
AN ALU
2.3.2 Implementation
Figure 2.4 [MP95] gives an overview of the ALU implementation. De-
pending on the signals f , the result from the appropriate unit is taken.
The addsub unit takes the operands a and b and one extra input bit sub,
which indicates whether to do an addition or a subtraction. If sub is set,
the unit performs a subtraction. The sub bit is calculated as follows:
sub := f4^ f3^ f2^ f1
The unit returns the result bit vector, and the flag bits ovf and neg. The
ovf bit is supposed to indicate the overflow condition described in the sec-
tion above. The neg bit is used for the comparison operations and indicates
that [a]Æ [b] is below zero.
The addsub unit is realized as follows: Let op1 and op2 denote the
operands. The second operand is inverted in case of a subtraction.
op1 := a
op2 := b (sub32)
This is justified by the following lemma:
For all bitvectors a, inverting and incrementing a implements the unary J Lemma 2.10
minus on bitvectors.
(a (1; : : : ;1))+1 = a
This is shown in the PVS bit vector library.
Using the operands and the sub bit the result is calculated by an adder.
In the following, the carry lookahead adder (section 2.2.6) is used. How-
ever, there is also an implementation and proof of a compound adder, as
described in [MP00], in the PVS tree in order to allow cycle time vs. hard-
ware cost tradeoffs. The implementation and the proof are omitted here.
The sub bit is passed as carry-in bit to the adder. This realizes the incre-
mentation in case of a subtraction.
addsub(a;b;sub) := (result;ov f ;neg)
25
Chapter 2
BASIC CONCEPTS
with result = cla(op1;op2;sub):result
The bits ov f and neg are calculated as follows:
neg = cla(op1;op2;sub):cout op1[31]op2[31]
ovf = neg cla(op1;op2;sub):result[31]
The calculation of result in the addsub unit is correct.Lemma 2.11 I
addsub(a;b;sub):result = aÆb
This is shown using lemma 2.10 and 2.9.
The calculation of the ovf signal in the addsub unit is correct.Lemma 2.12 I
addsub(a;b;sub):ovf = ([a]Æ [b]) 62 Tn
The calculation of the neg signal in the addsub unit is correct.Lemma 2.13 I
addsub(a;b;sub):neg = ([a]Æ [b]) < 0
A proof for the lemmas 2.12 and 2.13 can be found in [MP00]. The full
proof is also in the PVS tree.
An equality tester is realized by testing if ab is zero. Using the output
signal eq of the zero tester and the signals ovf and neg from the addsub
unit, the comp unit makes the comparisons as follows:
comp : bvec[5]B B  ! B
comp( f ;neg;eq) = ( f2^neg)_ ( f1^ eq)_ (eq^neg^ f0)
Using the lemmas 2.2, 2.3, 2.12, and 2.13, the correctness of the comp
unit is shown.
The ALU is correct.Lemma 2.14 I
This is shown by a case-split on the operation code f using the lemmas
above. The correctness of the shifter is assumed.
26
Section 2.4
SPECIFYING THE
REFERENCE
MACHINE
2.4 Specifying the Reference Machine
2.4.1 DLX Architecture
The reference machine used for all designs in this thesis is the DLX [HP96,
SK96]. However, the DLX architecture serves as an example only. The
algorithms and proof method presented here does not depend on any prop-
erties of the DLX architecture.
The DLX architecture is a load/store architecture with support for integer
and floating point arithmetic. The DLX instruction set (appendix B) is a
RISC instruction set and is similar to the MIPS instruction set.
The DLX architecture provides three register files:
 The general purpose register file (GPR) consists of 32 integer reg-
isters (R0,...,R31), each of which is 32 bits wide. The register R0 is
defined to be always zero. The general purpose registers are used for
all integer operations and memory addressing purposes.
 The floating point register file (FPR) consists of 32 single precision
floating point registers (FGR0,...,FGR31), each of which is 32 bits
wide. These registers can also be accessed as 16 double precision
floating point registers (FPR0, FPR2,...,FPR30), each of which is 64
bits wide. The register FPR0 is mapped onto the single precision
registers FGR0 and FGR1, and so on:
FPR0(i) =

FGR0(i) : i < 32
FGR1(i 32) : i 32
The floating point registers are used by FPU (floating point unit)
instructions only.
 The special purpose register file (SPR) consists of several registers
needed for special purposes such as flags and masks. An example is
the IEEE floating point flags register.
2.4.2 Configuration of an Integer DLX with Delayed PC
The configuration set of the DLX specification machine consists of the
visible registers (register files RF), the program counter (PC) registers, and
27
Chapter 2
BASIC CONCEPTS
the main memory (MEM) of the machine:
CDLX = W (RF)W (RPC)W (MEM)
The DLX implementation presented in chapter 3 implements integer op-
erations only and no interrupts. The floating point and special purpose
registers are not needed therefore. The machine is called DLXσ.
RF = fGPR[0]; : : : ;GPR[31]g
W (GPR[i]) = B32
In order to implement pipelining at a high performance level without
the need for a branch prediction mechanism, the DLX implemented in this
thesis uses the concept of delayed PCs [MPK00, MP00]: all modifications
to the PC register are delayed by one instruction, not just taken branches.
This is realized by buffering the PC register in a register called DPC (“de-
layed PC”). The Delayed PC technique is provably equivalent to the de-
layed branch semantics. The delayed branch semantics is, for example,
used in the MIPS [KH92], the SPARC [SPA92] and the PA-RISC [Hew94]
instruction set.
In order to implement the Delayed PC technique, two PC registers are
required: DPC, the delayed PC, and PC0:
RPC = fDPC;PC0g
W (DPC) = W (PC0) = B32
The main memory of the DLX specification machine consists of 230
memory cells, each of which is 32 bits wide. That accounts for a total of
four gigabytes RAM:
MEM = fMEM[0]; : : : ;MEM[230 1]g
W (MEM[i]) = B32
2.4.3 Initial Configuration
The GPR registers and the main memory of the DLXσ machine are ini-
tialized with arbitrary but fixed values. The PC registers DPC and PC0 are
28
Section 2.4
SPECIFYING THE
REFERENCE
MACHINE
initialized as follows [MPK00]:
c0:DPC = 0
c0:PC0 = 4
2.4.4 Transition Function
The DLXσ machine provides control instructions (conditional branch and
jump), ALU instructions such as add and compare, and the memory in-
structions load and store. The instruction that is to be executed is encoded
in a 32-bit instruction word. This instruction word is fetched from the in-
struction memory IM, which is assumed to be constant in this thesis. The
instruction memory is not part of the configuration therefore.
Let the signal I denote the instruction word fetched. The address used
to fetch I is taken from the register DPC, as required by the Delayed PC
technique [MPK00]:
I(c) = IM(c:DPC)
I-type
R-type
J-type
26
ImmediateRD
Function
6
SA
55
RDRS2
55
RS1
6
Opcode
6
Opcode PC Offset
Opcode
6
RS1
5 5 16
Figure 2.5 Integer instruction formats of the DLX
The DLX architecture provides three instruction formats for integer in-
structions (figure 2.5): the I-type format provides a 16-bit immediate con-
stant and two register addresses, the R-type format provides three regis-
ter addresses, a 5-bit immediate constant and an additional 6-bit function
code. The J-type format provides a 26-bit immediate constant, which is
used as PC offset.
29
Chapter 2
BASIC CONCEPTS
The coding of the instructions is given in appendix B. In order to decode
the instruction word I, the following functions are used: The functions
I rtype, I jtype, I itype indicate an R-type, J-type, and I-type instruction,
respectively:
I rtype(I) = (=I31^=I30^=I29^=I28^=I27^=I26)_
(=I31^ I30^=I29^=I28^=I27^ I26)
I jtype(I) = (=I31^=I30^=I29^=I28^ I27)_
(I31^ I30^ I29^ I28^ I27)
I itype(I) = I jtype(I)^ I rtype(I)
The function I ID extracts the index of the destination register from the
instruction word:
I RD(I) =
8
<
:
I[20 : 16] : I itype(I)
I[15 : 11] : I rtype(I)
05 : otherwise
The functions I RS1 and I RS2 extract the index of the first and second
operand from the instruction word, respectively:
I RS1(I) = I[25;21]
I RS2(I) = I[20;16]
The function I immediate extracts the immediate constant from the in-
struction word:
I immediate(I) =
8
>
>
<
>
>
:
sign extend32(I[15;0]) : I itype(I)
zero extend32(I[10;6]) : I rtype(I)
sign extend32(I[25;0]) : I jtype(I)
0 : otherwise
This allows defining the values of the source operands: the integer DLX
instructions can have up to two source operands. Let op1 and op2 denote
the values of these operands. If the address of the operand is zero, the value
of the operand is zero by convention:
op1(c) =

0 : I RS1(I) = 0
c:GPR[I RS1(I)] : otherwise (2.1)
op2(c) =

0 : I RS2(I) = 0
c:GPR[I RS2(I)] : otherwise (2.2)
30
Section 2.4
SPECIFYING THE
REFERENCE
MACHINE
Branch Mechanism The DLX architecture provides two instructions to
modify the PC0 register: the branch instructions test a given register for a
condition and add the offset given as immediate constant if the condition
holds; the jump instructions always set the PC0 register to the given value.
In order to determine the instruction coded by an instruction word I, a
boolean function is defined for each instruction. The equations for these
functions are generated from the instruction set in appendix B and are in
the PVS tree. A list of the functions is also in appendix B.
The functions I j(I) and I jr(I) return true iff the instruction is a jump
instruction. In case of I j(I), the immediate constant is used as offset to
the PC, in case of I jr(I) the jump target is the value of the first operand.
The function I branch(I) is used to detect a branch. If the instruction is
a branch, I branch eq(I) indicates that the branch is to be taken if the
operand is zero. If I branch eq(I) does not hold, the branch is to be taken
if the operand is not zero.
Let GPRa be the value of the operand. The function b jtaken(I;GPRa)
is true iff the given instruction I is a taken branch or jump:
b jtaken : bvec[32]bvec[32]  ! B
b jtaken(I;GPRa) = I j(I)_ I jr(I)_ (I branch(I)^
(I branch eq(I) (GPRa = 0)))
The function next pc calculates the new value of PC0 given the instruction
word I, the value of the first operand GPRa and the old value of PC0:
next pc(I;GPRa;PC0) =
8
<
:
GPRa : b jtaken(I;GPRa)^ I jr(I)
PC0+ I immediate(I) : b jtaken(I;GPRa)^ I jr(I)
PC0+4 : otherwise
δ:PC0(c) = next pc(I;op1(c);c:PC0) (2.3)
According to the Delayed PC technique, the new value for DPC is the
old value of PC0:
δ:DPC(c) = c:PC0 (2.4)
In case of a jump and link instruction, which is indicated by I link(I),
the old value of PC0 plus four is stored in the destination register:
δ:GPR[I RD(I)](c) = c:PC0+4
31
Chapter 2
BASIC CONCEPTS
ALU Instructions The function ALUfunction(I) extracts the ALU func-
tion code from the instruction word. The ALU function codes are given in
table 2.2, page 23.
ALUfunction(I) : B32  ! B5
ALUfunction(I) =
8
<
:
1 I30 I[28 : 26] : I itype(I)
I5 I3 (I2^ I5) I[1 : 0] : I rtype(I)
05 : otherwise
The ALU performs the DLX ALU instructions such as addition and
compare operations, which are indicated by I ALU (two register operands)
and I ALUi (one register operand and one immediate constant operand).
Furthermore, the shift operations are performed by the ALU. The shift op-
erations are indicated by I shi f t (two register operands) and I shi f ti (one
register operand and one immediate constant operand).
In case of an ALU or shift operation with two register operands, the
transition function for the destination register is:
δ:GPR[I RD(I)](c) = ALU(op1(c);op2(c);ALUfunction(I))
In case of an ALU or shift operation with one register operand and one
immediate constant operand, the transition function for the destination reg-
ister is:
δ:GPR[I RD(I)](c) =
ALU(op1(c); I immediate(I);ALUfunction(I))
Memory Instructions In order to access off-chip memory, the DLX ar-
chitecture provides load and store instructions. The load instructions copy
a value of a memory cell into a register. The store instructions copy the
value of a register into a memory cell.
As described in section 2.4.2, the DLX memory is organized in 32-bit
words. The address that is to be accessed is computed as follows: the
value of the first operand and the immediate constant provided in the in-
struction word are added. Let EA (effective address) denote this address.
It is defined using the addition on bit vectors as defined in section 2.1.5:
EA := op1+ I immeditate(I)
32
Section 2.5
LITERATURE
EA[1:0] = 00
EA[1:0] = 00 EA[1:0] = 10
EA[1:0] = 00
lw
lb EA[1:0] = 01 EA[1:0] = 10 EA[1:0] = 11
lh
0 1 2 30 8 16 24 32
Figure 2.6 The possible alignments for memory instructions
The DLX architecture supports memory accesses with variable widths:
byte (8 bits), half word (16 bits), and word (32 bits) accesses are allowed.
The bits EA[31 : 2] are used to select the word that is to be accessed.
The DLX architecture does not support non-aligned accesses, i.e., memory
accesses must not cross a memory cell boundary. In case of a word access,
this implies that EA[1 : 0] must be zero. In case of a byte or half word
access, EA[1 : 0] is used to specify the bytes in the memory cell. Figure
2.6 shows the allowed positions of the memory operand within a memory
cell.
In case of a load instruction, the 32 bits of the destination register are
always written. In case of a byte or half word load instruction, the memory
operand is stored in the register beginning with the least significant bits
and either a zero or a sign extension is performed. In case of the lh and
lb instructions, sign extension is performed, in case of the lhu and lbu
instructions, zero extension is performed.
In case of a store instruction, the machines presented in the following
chapters assume full word accesses. This restriction will be removed in
chapter 6.
2.5 Literature
Besides the basic ciruits presented here, there are more advanced circuits,
e.g., adders [LF80, Min95]. There are also HDL generators available for
33
Chapter 2
BASIC CONCEPTS
arithmetic circuits such as adders and multipliers [PA96]. Basic circuits
with proofs in PVS language are covered by [BJK01].
Fully automated verification of combinational circuits such as adders has
been reported using BDDs (binary decision diagrams) [Bry86, FFK88].
The BDDs of some circuits, such as multipliers, grow exponentially in
the number of inputs bits. A lot of literature addresses this issue [Bur91,
JNFSV97].
Barrett et.al. [BDL98] extend an equivalence-checker by decision proce-
dures for bit vector arithmetic and verify components of a microprocessor
such as an instruction fetch unit automatically. The decision procedures
are similar to those used in PVS.
The specification of microprocessors as mathematical machine is a com-
mon technique [Gau95].
34
Chapter
3
A Sequential
Implementation Machine
3.1 The Prepared Sequential Machine
I
N THIS CHAPTER, an implementation machine is built that works as fol-
lows: the calculation of a configuration of the specification machine is
split in n arbitrary phases, called stages. In each phase, the values of a
subset of the registers of the configuration of the specification machine are
calculated. The implementation machine performs the phases round-robin
and needs one transition for each phase, i.e., the implementation machine
needs n times as many transitions to do the same calculation as the speci-
fication machine.
The calculation is still done in a sequential way, at no time two con-
figurations of MS are calculated in parallel. However, the structure of the
machine will match the structure of the pipelined machine described in the
next chapter. The machine is called prepared sequential [MP00] machine
or Mσ therefore.
Let MS and MI be mathematical machines. Let MS = (CS;c0S;δS) be
a specification machine and let MI = (CI ;c0I ;δI) be the implementation
machine. Let RS be the registers of the specification machine and RI be the
registers of the implementation machine. The following sections describe
how to build a prepared sequential machine that provably simulates the
specification machine.
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
3.2 How Hardware is Specified
3.2.1 A Simple Hardware Description Language
The hardware of the implementation machine consists of the registers of
the machine and of the data paths, which calculate the values of the regis-
ters. The registers are modeled by the configuration set of the mathematical
machine, and the data paths are modeled by the transition function δ. The
configuration set and the transition function δ are defined using a simple
hardware description language that is similar to a register transfer language
(RTL).
For example, in a register transfer language the new value for the DPC
register is specified as follows:
DPC := PC0
In this example, the value of PC0 is used in order to specify the new value
of DPC. Formally, suppose the goal is to calculate the value DPC has in
configuration ciS with i > 0. In this case, the value used for PC0 is the value
the register PC0 has in configuration ci 1S . Thus, the following is supposed
to hold for all i > 0:
ciS:DPC = ci 1S :PC
0
A very similar definition is in [KP95].
In the example above, two things happen:
 The old value of PC0 is read.
 The new value of DPC is written.
The hardware description language used in this thesis makes use of the
following language elements:
1. The configuration set is defined using a list of registers and addi-
tional information on the registers such as their domain. This is
described in the next section.
2. The transition function δ, i.e., the function computed by the gates
between the registers, is defined using a set of functions. This is
described in the sections 3.2.4 and 3.2.6.
36
Section 3.2
HOW HARDWARE IS
SPECIFIED
3.2.2 The Register Set of the Implementation Machine
For all registers R of the implementation machine, let R 2 out(k) denote
that the register R is updated by stage k 2 f0; : : : ;n 1g.
The registers of the implementation machine include all registers of the J Definition 3.1
Specification Registerspecification machine. These registers are called specification registers.
The fact that R 2 RI is a specification register is denoted by R 2 spec.
By convention, a specification register R 2RI can be updated by exactly
one stage only. Let the stage k = stage(R) be the stage that updates R. In
this case, the register R is also denoted by R:(k+1). This convention and
the notation is taken from [MP00].
In order to store temporary values used for the calculation, further registers J Definition 3.2
Implementation Registerare added to the machine. These registers are called implementation regis-
ters. The fact that R is an implementation register is denoted by R 2 impl.
For example, if a processor fetches an instruction word from the instruc-
tion memory and stores it in the instruction word register, this instruction
word is an intermediate result of the computation of the next state of the
reference machine.
In contrast to specification registers, instances R:k of implementation
registers R can be present in multiple stages. The function stage(R) is
defined to be the first stage an instance of the implementation register is
present in:
8R 2 impl : stage(R) = minf k j R:(k+1) 2 out(k) g
The property of a register whether it is an implementation or specifica-
tion register is called class of the register.
Thus, the configuration set is defined by listing the names of the reg-
isters, their types (i.e., domain), and their classes. Furthermore, for each
register the stage(s) are given. In case of a specification register, only one
stage is allowed. In case of an implementation register, multiple stages are
allowed.
In addition to that, the command used in order to define a register is also
used in order to specify the value the register has in the initial configura-
tion.
37
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
T = 0 T = 1 T = 2 T = 3 T = 4 T = 5 T = 6
ueT0 1 0 0 0 1 0 0
ueT1 0 1 0 0 0 1 0
ueT2 0 0 1 0 0 0 1
ueT3 0 0 0 1 0 0 0
Table 3.1 The sequential scheduling of a four stage pipeline
3.2.3 Scheduling of the Prepared Sequential Machine
The next step is to define the transition function δ of the machine. The
registers of the prepared sequential machine are updated round-robin. In
each transition, the registers of only one stage are updated. The update
of the registers in out(k) of a stage is controlled by a signal uek (update
enable). Iff uek is one, the registers in out(k) are updated. Table 3.1 gives
an example of the values of uek for a four stage pipeline. The same concept
is used by [MP00].
The stage that is updated before stage k is calculated by the function
prev(k):
prev : f0; : : : ;n 1g! f0; : : : ;n 1g
prev(k) =

k 1 : k 6= 0
n 1 : k = 0
Stage k is said to be updated in cycle T iff ueTprev(k) = 1 holds.
In analogy to the function prev, the function next(k) calculates the stage
that is updated after stage k:
next : f0; : : : ;n 1g ! f0; : : : ;n 1g
next(k) =

k+1 : k 6= n 1
0 : otherwise
In order to allow the machine MI to keep track of the stage that is
currently processed, an 1-bit register f ull: j is added to each stage j 2
f1; : : : ;ng. If f ull: j is set, the calculation of the registers R: j is finished.
38
Section 3.2
HOW HARDWARE IS
SPECIFIED
In the initial configuration, only the full bit of the last stage is set:
c0: f ull: j =

1 : j = n
0 : otherwise
In addition to that, a signal f ullk is defined for each stage k2 f0; : : : ;n 1g
as follows:
f ullk(c) =

c: f ull:n : k = 0
c: f ull:k : otherwise
If f ullk(cTI ) holds, it is said that stage k is full during cycle T .
In general, the registers in out(k) are updated iff f ullk is active. How-
ever, some operations on the registers might take more than one cycle,
like an access to slow off-chip memory. This requires means to stall the
machine. This is realized by a signal stallk for each stage. If active, the
stage is stalled. The signal uek is active iff the stage is full and not stalled,
therefore:
uek = f ullk ^ stallk
By convention, the stall signal of a given stage k must not be active if the J Convention 3.1
stage is not full:
f ullk =) stallk
The transition functions of the full bits are defined as follows: a full bit
is set iff the stage was updated or the stage was full in the previous cycle
and the stage was stalled:
1 k < n : δ(c): f ull:k = uek 1(c)_ (c: f ull:k^ stallk(c))
k = n : δ(c): f ull:n = uen 1(c)_ (c: f ull:n^ stall0(c))
Since stallk(c) implies f ullk(c) (convention 3.1), this definition can be
simplified to:
1 k < n : δ(c): f ull:k = uek 1(c)_ stallk(c)
k = n : δ(c): f ull:n = uen 1(c)_ stall0(c)
A stage is full iff it is updated or stalled in the previous cycle. J Lemma 3.2
f ullT+1k = ueTprev(k) _ stallTk
39
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
R:1full:11
R:21 full:2
ue1
full1
full0
f1
f0ue0
ue1
ue0
R:n
fn 1
full:n1 uen 1
stall0
stall1
stall2
stall0uen 1
Figure 3.1 The prepared sequential machine
40
Section 3.2
HOW HARDWARE IS
SPECIFIED
This is shown by unfolding the definition of the full signal f ullk and of
prev(k).
Figure 3.1 shows the registers of a prepared sequential machine and the
clock enable signals that are used for them. As described in chapter 2, the
circuits used in order to realize the calculation of the new values for the
initial state c0 are omitted.
3.2.4 The Transition Function
As described above, a register value is supposed to be written only if the
update enable signal of its stage is active. The value of the register should
remain unchanged otherwise. The overall transition function δ:R for a reg-
ister R 2 out(k) therefore is generated as follows: if the update enable
signal is not active, the old value is taken. If the update enable signal is
active, the value provided by a function ωkR(c) is taken, which is defined
later.
δ:R(c) =

ωkR(c) : uek(c) = 1
c:R : otherwise
The functions ωkR are mappings from the configuration of the imple-
mentation machine into the domain of the register R. These functions are
generated from the hardware description language using the two simple
elements: write accesses and read accesses. The accesses are kept in a list.
In addition to the data of the read or write access, which is described below,
the list contains a flag for each access that specifies whether the access is a
read or write access.
A write access without write address is a five-tuple (R:(k + 1), fkR,
dep(R;k), fkRwe, dep we(R;k)) (write accesses with write address are
used in order to provide an address for memories or register files and will
be described in the next section).
The first element specifies the instance of the register that is written to.
We require that exactly one write access is given for each instance R:(k+1)
of each register.
The second element, the function fkR, provides the value that is written
into the register R:(k+ 1), i.e., the range of the function is the domain of
the register. The function is called register transition function. The register
41
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
transition functions basically model the combinatorial circuits between the
pipeline stages. As an example, this includes the ALU, FPU and so on.
A register transition function takes the values of several registers as ar-
guments. These registers are listed in dep(R;k). Let a register R0 be in the
list of a register R. In this case, it is said that the calculation of R depends
on R0. Let dep(R;k) denote the list of registers the calculation of R depends
on:
dep(R;k) = (R01; : : : ;R0j) with R0l 2 R
This allows defining the domain and range of the functions fkR:
fkR : W (R01) : : :W (R0i) ! W (R)
Furthermore, a function fkRwe may be provided as element four. The
function fkRwe is called write enable signal and can be used in order to re-
alize updates of the given register instance that are only performed under a
certain condition. As an example, consider that in case of a microprocessor
most registers are only changed by certain instructions. The write enable
function allows modeling this. This function may become non-trivial, for
example if writing the register is to be suppressed in case of an interrupt.
The range of the function fkRwe is B . If it returns one, the write access
is to be done. If the value is zero, the write access is suppressed. The
domain of the function is defined in analogy to the domain of fkR using a
list of input registers named dep we(R;k):
dep we(R;k) = (S01; : : : ;S0o) with S0l 2 R
fkRwe : W (S01) : : :W (S0m) ! B
Let the functions γkR and γkRwe denote the values of the arguments of
the functions fkR and fkRwe. These functions are defined later using the
read accesses.
The effect of fkRwe depends on whether R is an implementation or speci-
fication register. In case of a specification register, the following behaviour
is used: If the function returns fkRwe true, the updating of R:(k + 1) is
performed. If the function returns false, the updating is suppressed and the
value in the register does not change. Thus, if R is a specification register,
ωkR is defined as:
ωkR(c) =
 fkR(γkR(c)) : fkRwe(γkRwe(c))
c:R : otherwise
42
Section 3.2
HOW HARDWARE IS
SPECIFIED
Note that we have actually two signals that are used in order to determine
whether a specification register is to be clocked or not: both the update
enable and the write enable signals must be active, i.e., the clock enable
signal of a specification register R 2 out(k) is:
uek ^ fkRwe(γkRwe(c))
This method is taken from [MP00]. It allows us to specify the stall
engine as a module as done in the previous section.
If the write enable signal is false and R is an implementation register, the
following behavior is used: the value from the register in the previous stage
is written into the register. If there is no instance of the implementation
register in the previous stage, a pre-defined default value, e.g., zero, is
taken. Thus, if R is a specification register, ωkR is:
ωkR(c) =
8
<
:
fkR(γkR(c)) : fkRwe(γkRwe(c))
c:R:k : R 2 out(k 1)
0 : otherwise
This is illustrated in figure 3.2: As an example, consider a processor with
an ALU in stage 2. The results are stored in instances of implementation
registers C. In case of an ALU instruction, f2Cwe holds and we store the
output of the ALU in the register C:3. If not so, the value in C:2 is taken.
The ALU is modeled by the function f2C. The multiplexer selecting the
appropriate value is modeled by the function ω2C.
If no function fkRwe is provided, the constant value true is taken instead,
i.e., the updating is performed unconditionally.
3.2.5 Inputs
The functions fkR and fkRwe above require certain inputs in order to pro-
vide a meaningful value, i.e., it is left to define the functions γkR and γkRwe.
Formalizing the inputs of a register transition function is the most impor-
tant concept of this thesis, since most of our arguments are used in order
to justify how to get those inputs. In particular, we will realize forward-
ing in pipelined machines and speculation by adjusting these functions ac-
cordingly, i.e., the functions model the forwarding logic and speculation
circuits.
43
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
ALU
0 1 f2Cwe
C:3ue2
δ
ω2C
f2C
C:2
Figure 3.2 Example for fkR, fkRwe, and ωkR
44
Section 3.2
HOW HARDWARE IS
SPECIFIED
FO
RW
A
RD
IN
G
Stage 1
R:3
R:4
R:5
A:2
γ1A
γ1B
B:2
Figure 3.3 Example for the input generation functions
This is illustrated in figure 3.3: it depicts a pipeline that reads two values
in stage 2 that require forwarding. The forwarding logic is modeled by the
functions γ1A and γ1B.
The register transition functions depend on a set of input registers. The
functions γkR and γkRwe provide the whole set. Let gkR0 be a function that
extract the value of a single input register R0 from the configuration of the
implementation macine.
gkR0 : C  ! W (R)
We will later on define gkR0. Using gkR0, we define gk, which takes a
configuration and a list of registers. It returns the input values provided by
gkR0:
gk(c;(R01;R
0
2; : : : ;R
0
j)) = (gkR
0
1(c);gkR
0
2(c); : : : ;gkR
0
j(c))
Let fkR be the register transition function and dep(R;k) be the list of
input registers (R01; : : : ;R0j), as above. Using gk, we define γkR:
γkR : C  ! W (R01) : : :W (R0j)
45
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
γkR(c) = gk(c;dep(R;k))
The functions γkRwe and γkRre are defined in analogy to this definition.
It is left to define the functions gkR0, which calculate the actual input
value. As described above, calculating such input values may be complex,
for example in machines with forwarding or speculation. In the sequential
prepared machine, we neither need forwarding nor speculation. Thus, we
define rather simple functions gkR0 for this machine.
In the hardware description language, the definition of gkR is done using
read accesses. A read access without read address is a four-tuple (R0, k,
fkR0re, dep re(R0;k)) (read accesses with read address will be described
in the next section). For each stage and for each register that is input of
the stage, exactly one read access must be defined. The first element is the
register (not an instance thereof), the second element is the stage that de-
pends on the register, the third element is a read enable function in analogy
to the write enable for write accesses. The function fkR0re also depends on
registers:
dep re(R0;k) = (U 01; : : : ;U 0q) with U 0l 2 R
fkR0re : W (U 01) : : :W (U 0q) ! B
As above, γkR0re is used in order to denote the input arguments of fkR0re.
In order to prevent this definition from becoming recursive, it is required
that the read accesses to those registers in dep re(R0;k) or in dep we(R;k)
have no read enable signal.1
The read enable function has the following purpose: If the read enable
signal fkR0re is not active, a default value, e.g., zero, is used as input.
This allows us to state whether we actually need an input or not. In case
of a microprocessor, not all instructions have an equal number of input
registers, some take one GPR operand, some two, and so on. The benefit of
knowing when we do not need an input becomes obvious if one considers a
machine with forwarding: in case forwarding fails because of data hazards,
we do not have to stall if the input is not used anyway.
If no function fkR0re is provided, the constant value true is taken instead,
i.e., the read access is performed unconditionally.
1It is feasible to extend this definition in order to allow a recursion. However, no mi-
croprocessor design implemented for this thesis requires it.
46
Section 3.2
HOW HARDWARE IS
SPECIFIED
If the read enable signal is active, the value provided by gkR0 is the value
of the register. As described above, this simple definition only works in the
prepared sequential machine. We will re-define gkR0 for faster machines
later on.
The formal definition of gkR0 depends on whether R0 is an implementa-
tion or specification register. Let w be stage(R0).
If R0 is an implementation register, an instance of R0 is expected to be
in the previous stage. If there is no instance of R0 2 out(k 1), instances of
the register R0 are added to out(w+1); : : : ;out(k 1) if not already present.
These registers are called “buffer registers”. The transition function for
these additional registers is:
ωpR0(c) = c:R0:p for p 2 fw; : : : ;k 1g
After this is done, an instance of R0 is in out(k  1). The value is read
directly from the register R0:k.
gkR0(c) =

c:R0:k : fkR0re(γkR0re(c))
0 : otherwise
If R0 is a specification register, there is only one instance of R, by def-
inition. In this case, the value in this register is taken. It is required that
w k holds (this limitation is removed in chapter 5).
Thus, gkR0 is defined as follows:
gkR0(c) =

c:R0:(w+1) : fkR0re(γkR0re(c))
0 : otherwise
Figure 3.4 shows an example how the functions fkR and gkR0 are used in
order to model hardware. It shows the hardware for an unconditional write
access to a register R:(k+1) that depends on two implementation registers
R01 and R02. The read accesses to R01 and R02 are both unconditional.
3.2.6 Register Files and Memory
In hardware implementations of microprocessors, on-chip memory is used
to realize register files. In addition to that, microprocessors provide an
47
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
R02:k
uek 1
gkR02
ωk 1R02
R01:k
uek 1
gkR01
ωk 1R01
fkR
ωkR
uek
γkR = gk(cI;dep(R;k))
)
dep(R;k)
R:(k+1)
Figure 3.4 The input and output functions for an unconditional write access to a
register R:(k+ 1) that depends on two implementation registers R01 and R02. The
read accesses to R01 and R02 are both unconditional.
48
Section 3.2
HOW HARDWARE IS
SPECIFIED
interface to off-chip memory in order to store larger amounts of data. The
microprocessor usually reads or writes only a small part of these memories
in each cycle.
In theory, accesses to these memories could be modeled as follows: the
complete contents are read, some parts are modified and re-written. How-
ever, in hardware the access to both memory and register files is limited.
For write accesses to register files or memory, the transition functions are
therefore expected to provide the value that is to be written, the address of
the register or memory cell that is to be modified, and a write enable signal.
For read accesses, the transition functions must provide the address and a
read signal. The functions ωkR defined above model the behavior of the
hardware, but are not suited for synthesizing hardware as soon as memory
or register files are involved.
The definition of the function ωkR is therefore changed if a register file
or memory is accessed. This is done by extending the hardware description
language using read and write accesses with address. It is presumed that
implementation registers are never in a register file or part of a memory.
A write access with write address is defined like a write access without
write address but with additional elements fkRwa and dep wa(R;k), i.e.,
it is a seven-tuple. The write address function fkRwa takes the registers in
the list dep wa(R;k) as arguments, as done with the arguments of fkR. The
function returns the address that is to be used. The range of the function
therefore is the set of possible addresses of the access. Let W a(R) denote
this range.
dep wa(R;k) = (V 01; : : : ;V 0r ) with V 0l 2 R
fkRwa : W (V 01) : : :W (V 0r ) ! W a(R)
The range of the function fkR has to be adjusted accordingly such that
it matches the range of a single memory cell or register of the register file;
e.g., if a 32x32 bit register file named GPR:5 is accessed, f4GPR returns a
32-bit vector. Let W r(R) denote this range. In the example, the function
f4GPRwa returns a five-bit vector.
dep(R;k) = (R01; : : : ;R0j) with R0l 2 R
fkR : W (R01) : : :W (R0i) ! W r(R)
The value provided by the function fkRwe enables (return value one) or
disables the write access (return value zero) to the register or memory cell.
This value is taken as write enable signal.
49
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
The behavior of the memory or register file is modeled by the function
ωkR as follows: let the brackets [ ] denote a projection used in order to
access a single memory cell or register of a register file and let the function
γkRwa denote the function that calculates the arguments of the function
γkR0wa as described in section 3.2.4.
8 x 2 W a(R) :
ωkR(c)[x] =
8
<
:
fkR(γkR(c)) : fkRwe(γkRwe(c))^
: x = fkRwa(γkRwa(c))
c:R[x] : otherwise
Furthermore, a read address can be supplied for each read access to a
specification register that is a register file or memory. Such a read access
is called read access with read address. In analogy to the write address
functions fkRwa, this index is supplied by an additional function fkR0ra,
called read address. The function takes arguments as described above for
fkR0re. The list of registers is denoted by dep ra(R;k). The range of the
function is W a(R), as described above.
Let the function γkR0ra denote the function that calculates the arguments
of fkR0ra as described in section 3.2.4. The function gkR for a conditional
specification register read access with read address is:
gkR0 : C  ! W r(R)
gkR0(c) =

c:R[ fkR0ra(γkR0ra(c))] : fkR0re(γkR0re(c))
0 : otherwise
The generation of the hardware of the implementation machine can now
be done automatically by a program that reads the following:
 The program reads the register list including the domain, type, class,
and initial value of the register.
 The program reads the list of read and write accesses.
In the following section, the hardware description language above will
be used in order to implement a sequential DLX. This is followed by a
proof that this implementation simulates the specification as given in chap-
ter 2.
50
Section 3.2
HOW HARDWARE IS
SPECIFIED
3.2.7 Multiport Read Accesses
In case of a microprocessor, we can have multiple read accesses to the same
register file in the same stage. For example, in a DLX implementation we
have two read accesses to the general purpose register file. We support
separate read enable and read address functions for these read access. Let
R be the register that is read.
By convention, we name these functions as follows: the read enable
function of the first access is named fkRa re, the read enable function of
the second access is named fkRb re. In analogy to that, the read address
function of the first access is named fkRa ra, the read address function of
the second access is named fkRb ra. The list of inputs these functions de-
pend on is denoted by dep re(Ra;k), dep re(Rb;k), and so on. In analogy
to that, the function that provides the inputs to fkRa re is named γkRa re,
and so on.
Since we have separate read enable and read address functions, we also
get different input values. We denote the value generated for the first read
access by g1Ra and the value generated for the second read access by g1Rb.
As an example, consider two read accesses in stage 1 to the GPR register
file. The read enable functions are named f1GPRa re and f1GPRb re. The
input generation functions are named g1GPRa and g1GPRb.
3.2.8 Notation
For sake of simplicity, we introduce the following shorthand for formulas
that will be used very often in the rest of this thesis: Consider two functions
fkQ and γkQ. Let x be a tuple of arguments of γkQ. In this thesis, we will
often need the value fkQ of γkQ of x:
fkQ(γkQ(x))
We will denote this by f γkQ(x):
f γkQ(x) := fkQ(γkQ(x))
For example, the function compositions used in the previous section will
51
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
be shortened as follows:
f γkR(c) := fkR(γkR(c)) (3.1)
f γkRre(c) := fkRre(γkRre(c)) (3.2)
f γkRwe(c) := fkRwe(γkRre(c)) (3.3)
f γkRra(c) := fkRra(γkRra(c)) (3.4)
f γkRwa(c) := fkRwa(γkRwa(c)) (3.5)
3.3 Precomputed Control
In the sections above, the signals fkRwa, fkRwe, fkRra, and fkRre are used
in order to specify which register is read or written. The functions that
calculate these signals can take an arbitrary number of registers as input
just as the functions fkR.
Consider a write enable signal of stage 4 in a five stage pipeline. Let
this write enable signal depend on an instruction word that is calculated by
stage 0. In order to read this instruction word in stage 4, one has to add
buffer registers for the stages 1 to 4. These registers are quite expensive. In
order to save hardware cost, one can calculate the value of the write enable
signal already in stage 0 or 1, thus saving the buffer registers.
In order to get the value of the write enable signal, registers for the write
enable signal are added instead. However, this requires only a one-bit
register for each stage. This is less expensive than the registers for the
full instruction word. This can be also done for other signals such as the
read/write address.
This method is called precomputed control [PH94, MP00]. If the value
of a control signal is calculated as described above, it is said that the signal
is precomputed.
Naming Convention Let s be the name of a precomputed control signal,
e.g., f4GPRwe. The registers added in order to store the value of the signal
will be named s:k with k being the number of the stage the register is an
input of, i.e., s:k2 out(k 1). All registers containing precomputed control
are summarized by the register P:k.
For example, the registers containing the precomputed versions of the
write enable signal f4GPRwe are called f4GPRwe:1, f4GPRwe:2, and so
52
Section 3.4
IMPLEMENTING THE
PREPARED
SEQUENTIAL DLX
on. Note that these registers are treated like implementation registers. In
particular, there are corresponding functions fkR for each precomputed sig-
nal R.
3.4 Implementing the Prepared Sequential DLX
3.4.1 Structure
The prepared sequential machine DLXσ is the first approach to implement
the DLX defined in chapter 2. The execution of an instruction in the DLXσ
is done in five stages. The organization of the stages is similar to the
pipeline of a MIPS R2000/R3000 [KH92] and also used in [HP96, MP00]:
 In stage 0 (IF), the instruction fetch is done.
 In stage 1 (ID), the instruction word is decoded and the operands of
the instruction are fetched.
 In stage 2 (EX), the ALU calculation is done.
 In stage 3 (M), the memory access for load and store instructions is
done.
 In stage 4 (WB), the result of the instruction is written into the reg-
ister file.
Figure 3.5 shows all registers of the machine DLXσ and the stage they
belong to. As described above, the signals used for precomputed control
are summarized as register P. Furthermore, the main components such as
ALU and memory are depicted. Table 3.2 lists the stages and summarizes
the registers that are written in and read in a given stage, respectively.
Initial Configuration and Transition Function In the initial configura-
tion, the values of specification registers of the DLXσ machine are identical
to the values of the corresponding registers of the specification machine.
The implementation registers are initialized with zero.
The transition function of the DLXσ machine is defined using register
transition functions as described in section 3.2.4.
53
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
DMEM
ALU
shift4load
GPR
IM
ID
EX
M
WB
IF
nextPC
Aad;Bad
Adata;Bdata
control
A;B PC0 DPC
MAR:3
MAR:4
P:2C:2
PC0+4
P:3IR:3C:3
P:4IR:4C:4
IR:2
IR:1
MDRw:3
MDRr:4
Figure 3.5 The prepared sequential DLX. The registers P:k summarize the regis-
ters used for precomputed control.
54
Section 3.4
IMPLEMENTING THE
PREPARED
SEQUENTIAL DLX
Stage Reads Writes
0 DPC IR
1 IR, PC0, DPC, PC0, Aad,
GPR[Aad] = GPRa, Bad, A, B, C
GPR[Bad] = GPRb
2 IR, A, B C, MAR, MDRw
3 IR, MAR, MDRw, C, C, MARh, DM[MAR[31 : 2]],
DM[MAR[31 : 2]] MDRr
4 MDRr, MAR, IR GPR
Table 3.2 The registers the stages of the prepared sequential machine read and
write, without precomputed control
3.4.2 The Instruction Fetch Stage
The instruction fetch stage IF reads the delayed PC register DPC uncondi-
tionally and fetches the instruction memory cell that DPC points to. This
value is stored in the only output register of the stage, the IR implemen-
tation register, unconditionally. The register transition function for IR.1
therefore is:
f0IR(DPC) = IM[DPC] (3.6)
3.4.3 The Instruction Decode Stage
The instruction decode stage ID reads the instruction word in the register
IR and decodes it.
The operand registers of the instruction are read and stored in two imple-
mentation registers A and B. This is realized by means of two conditional
read accesses with read address to GPR. The naming conventions for such
multiport accesses is described in section 3.2.7.
The read enable functions for these read accesses depend on the instruc-
tion word and on the source address of the operand: we need to test the
instruction word in order to determine whether the instruction requires the
operand or not. This is determined by testing the instruction word read
from IR using the functions defined in chapter 2.
55
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
In addition to that, we test the address. If the address is zero, the read
access is not necessary, since the GPR register with address zero has the
constant value zero, as required by the DLX specification. If the read en-
able function is not active, the value zero is passed to the register transition
function by convention. This is exactly the value required by the specifi-
cation. Thus, we omit an extra multiplexer in order to get zero in case of a
read access to GPR[0].
The first operand is required by loads, stores, ALU instructions, branch
instructions, and the jump register instructions. Thus, the read access is
performed if the following condition holds:
f1GPRa re(IR) = (I load(IR)_ I store(IR)_
I ALUi(IR)_ I branch(IR)_
I jr(IR)_ I shi f t(IR)_
I ALU(IR)_ I shi f ti(IR))^
(I RS1(IR) 6= 0)
(3.7)
The second operand is required by ALU instructions that do not use the
immediate constant as second argument and by store instructions. In case
of store instructions, the address of the second operand is stored in the RS2
location of the instruction word and not in the RD location (appendix B).
f1GPRb re(IR) = (I shi f t(IR)_ I ALU(IR)_ I store(IR))^
(I store(IR)?I RD(IR) : I RS2(IR)) 6= 0)
(3.8)
The indices of the read accesses are calculated from the instruction word
in IR: In case of the first operand, I RS1 provides the address. In case of
the second operand, it is necessary in order to distinguish store instructions
from ALU instructions:
f1GPRa ra(IR) = I RS1(IR)
f1GPRb ra(IR) =

I RD(IR) : I store(IR)
I RS2(IR) : otherwise
(3.9)
The result of the instruction is buffered in the implementation register C.
In the decode stage, only the result of jump and link instructions is known
already, which is PC0+4. This value is stored in C therefore if the instruc-
tion is a jump and link instruction. This is realized using a conditional
write access to C:2.
f1C(PC0) = PC0+4 (3.10)
f1Cwe(IR) = (I jr(IR)_ I j(IR))^ I link(IR) (3.11)
56
Section 3.4
IMPLEMENTING THE
PREPARED
SEQUENTIAL DLX
4
NextPC
0 1
Add
0 1
GPRa
PC0
I immediate(IR)
b jtaken imp
next pc imp(IR;GPRa;PC0)
I jr(IR)
Figure 3.6 The implementation of next pc
Furthermore, the decode stage calculates the new values for the PC reg-
isters DPC and PC0 according to the Delayed PC technique. Let GPRa
denote the value of the first operand, as calculated by g1GPRa.
f1DPC(PC0) = PC0 (3.12)
f1PC0(IR;GPRa;PC0) = next pc imp(IR;GPRa;PC0) (3.13)
The function next pc imp is implements the next pc calculation as defined
in chapter 2. It is defined in the obvious way using a zero tester in order to
calculate the b jtaken signal, as defined in section 2.4.4:
b jtaken imp(IR;GPRa)
:= I j(IR)_ I jr(IR)_ (I branch(IR)^ (3.14)
(I branch eq(Iw) zerotester imp(GPRa)))
By using the correctness of the zero tester (lemma 2.2), one easily shows
the correctness of the circuit that calculates b jtaken:
The calculation of the b jtaken signal is correct: J Lemma 3.3
b jtaken imp = b jtaken
Using the b jtaken imp signal and a carry lookahead adder as described
in section 2.2.6, we calculate the new PC (figure 3.6). In case of a jump
57
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
register instruction, we take the value of the GPR operand. In case of a
taken branch, we add the old PC and the PC offset from the instruction
word. In any other case, we take the old PC incremented by four.
The calculation of the new PC is correct:Lemma 3.4 I
next pc impl = next pc
This is shown easily using the correctness of the adder circuit.
Precomputed Control In addition to that, the decode stage also does
the precomputation of several control signals. The following signals are
precomputed in the decode stage:
 The write enable signal and write address used in the write back
stage,
 the write enable signals of all C registers.
The formulae for these signals are given in the sections of the stages the
registers belong to for sake of simplicity. Note that the circuits calculating
the signal values actually belong to the decode stage. In the later stages,
the signal is just taken from the register holding the precomputed signal
and no calculation is performed.
3.4.4 The Execute Stage
In the execute stage, the result of all ALU instructions is computed. This
includes the integer instructions such as addition and subtraction, the shift-
ing instructions, and the compare instructions. Furthermore, the address
computation for memory instructions is performed.
The stage reads the values of the operands from implementation regis-
ters A:2 and B:2. However, both the memory instructions and the ALU
instructions with immediate constant (e.g., addi) take the immediate con-
stant from the instruction word as second operand. Let aluop2 denote the
58
Section 3.4
IMPLEMENTING THE
PREPARED
SEQUENTIAL DLX
value of the second operand:
aluop2(IR;B) =
8
<
:
B : I ALU(IR)_
I shi f t(IR)
I immediate(IR) : otherwise
(3.15)
The function aluop2 is used as a shorthand for this text only; in the PVS
tree, the expanded form is always used.
In case of ALU instructions, the operation that is to be performed is
provided by the function ALU f unction. This function is defined in section
2.4.4. In case of memory instructions, as indicated by I load and I store,
an addition is performed in order to compute the effective address.
alu f (IR) =

(1;0;0;0;0) : I store(IR)_ I load(IR)
ALU f unction(IR) : otherwise
The implementation register C:3 holds the result provided by the ALU.
It is only written on ALU instructions.
f2C(IR;A;B) = ALU(A;aluop2(IR;B);alu f (IR)):result
f2Cwe(IR) = I ALU(IR)_ I ALUi(IR)_
I shi f ti(IR)_ I shi f t(IR)
In case of a memory instruction, the result (i.e., the address of the mem-
ory operand) is stored in the register MAR:3.
f2MAR(IR;A;B) = ALU(A;aluop2(IR;B);alu f (IR)):result
In the register MDRw:3, the second operand is stored, which is the value
to be stored in memory in case of a store instruction.
f2MDRw(B) = B
3.4.5 The Memory Stage
In the memory stage, the memory access for load and store instructions is
performed. In order to realize load instructions, a conditional read access
59
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
with read address to DM is performed. The read access is performed iff
the instruction is a load instruction. The read address is the high-order 30
bits of the effective memory address stored in MAR.
f3DMre(IR) = I load(IR)
f3DMra(MAR) = MAR[31 : 2]
This result is stored in the register MDRr. The result of the read access,
as provided by g3DM, is named DMemout.
f3MDRr(DMemout) = DMemout
The register C is passed to the next stage without modification. The
write enable function of the write access to C:4 is constant false therefore
(compare the definition of conditional write accesses on page 43).
f3Cwe(IR) = 0
In order to realize store instructions, a conditional write access to DM
is performed. The value read from the register MDRw is written iff the
instruction is a store instruction. The address of the write access is the
upper 30 bits of the effective memory address, as above.
f3DMwe(IR) = I store(IR)
f3DMwa(MAR) = MAR[31 : 2]
f3DM(MDRw) = MDRw
3.4.6 The Write Back Stage
In the write back stage, the result of the instruction is stored in the register
file. In case of a load instruction, the data word fetched from the data
memory present in MDRr is shifted and masked prior to write back. This
is done using the function shi f t4load, which is defined in section 2.4.4.
In case of any other instruction, the result is read from the implementation
register C.
f4GPR(C; IR;MAR;MDRr) =

shi f t4load(MAR;MDRr; IR) : I load(IR)
C : otherwise
60
Section 3.5
DATA
CONSISTENCY
PROOF
The write access to the register file is conditional; the condition is that
the instruction has a GPR destination operand. This is true for ALU/shift
instructions, loads, and jump and link instructions. Thus, the write enable
signal is:
f4GPRwe(IR) = I ALU(IR)_ I ALUi(IR)_ I load(IR)_
I shi f ti(IR)_ I shi f t(IR)_
((I j(IR)_ I jr(IR))^ I link(IR))
Furthermore, the write access has a write address. As defined in chapter
2, the function I RD(IR) determines the address destination register.
f4GPRwa(IR) = I RD(IR)
Both the write enable and the write address signals are precomputed in
the decode stage as described in section 3.3, i.e., the calculation is done in
the decode stage and the result is buffered using additional registers. In the
write back stage, one just takes the values from the registers.
Note that the cost savings of precomputing these signals are low in the
prepared sequential machine. However, in the pipelined machine presented
in the next chapter, we will need the signals in multiple stages for forward-
ing. In this case, precomputing the signals saves a significant amount of
hardware since the computation has to be done only once. In order to pre-
vent that we need to make changes to the machine due to pipelining, we
already introduce the precomputed control in this chapter.
3.5 Data Consistency Proof
3.5.1 Properties of the Full Bits
Using the equations for the full bits and update enable signals, it is easy to
conclude the following properties:
If a stage is full, either the same or the previous stage was full in the J Lemma 3.5
previous cycle.
f ullT+1k =) f ullTk _ f ullTprev(k)
61
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
PROOF By lemma 3.2 (page 39), one can conclude from f ullT+1k that
ueTprev(k) or stall
T
k holds. If ueTprev(k) holds, f ullTprev(k) holds by definition of
the ue signals. If stallTk holds, f ullTk holds by convention 3.1.
If a stage is full, either the same or the next stage is full in the next cycle.Lemma 3.6 I
f ullTk =) f ullT+1k _ f ullT+1next(k)
Assume neither f ullT+1k nor f ullT+1next(k) holds. Using lemma 3.2 for stagesPROOF
k and next(k), one concludes that neither ueprev(k), nor ueprev(next(k)) , nor
stallTk , nor stallTnext(k) holds.
By definition of ueTk , it is concluded that ueTk holds. An easy proof shows
that prev(next(k)) = k. This allows concluding that f ullT+1
next(k) holds, which
is a contradiction to the assumption.
If a stage k becomes full in cycle T + 1, the previous stage prev(k) wasLemma 3.7 I
full in the previous cycle and the output registers of the stage prev(k) were
updated.
f ullTk ^ f ullT+1k =) f ullTprev(k) ^ueTprev(k)
By lemma 3.2, ueTprev(k) or stall
T
k holds. By convention 3.1, ueTprev(k) isPROOF
concluded. The first claim, f ullTprev(k), is concluded by definition of ue.
In every cycle, exactly one stage is full.Lemma 3.8 I
9!k : f ullTk
The proof proceeds by induction on T . For T = 0, the claim obviouslyPROOF
holds. For T + 1, one has to prove that at least one full bit is set and that
this full bit is unique.
It is easy to show that at least one full but is set by a case split on the
stall signal of the stage with the full bit set. Let stage k be this stage. If the
stage is stalled, in cycle T + 1 the full bit of the same stage is set. If the
stage is not stalled, the full bit of stage next(k) is set in cycle T +1.
This full bit is unique, i.e., f ullT+1x ^ f ullT+1y implies that x is equal y.
Assume that x 6= y holds. According to lemma 3.5, there are four cases:
62
Section 3.5
DATA
CONSISTENCY
PROOF
1. f ullTx ^ f ullTy
2. f ullTprev(x) ^ f ullTy
3. f ullTx ^ f ullTprev(y)
4. f ullTprev(x) ^ f ullTprev(y)
The cases one and four are disproved by the induction premise. Let case
2 hold (otherwise, swap x and y). According to the induction premise,
y = prev(x) must hold. Using lemma 3.7, ueTy is concluded, which is equal
to f ullTy ^ stallTy by definition.
Since prev(y) 6= y, and because of the induction premise, f ullTprev(y)
holds. This allows concluding that ueTprev(y) holds. Since f ullT+1y is ac-
tive, this is a contradiction to stallTy according to lemma 3.2.
3.5.2 Scheduling Functions
Unless stalled, the implementation machine calculates parts of the con-
figurations c0S;c1S; : : : of the specification machine. A scheduling function
[MP00] specifies which configuration is being calculated by the machine
in a given stage and cycle. If stage k is full during cycle T , let
sI(k;T ) = i
denote that the implementation machine is performing a part of the com-
putation of configuration ci+1S in stage k during cycle T .2
In case of a microprocessor, let
I0; I1; I2; : : :
denote an instruction sequence. In this case, the configuration ci+1S of the
specification machine provides the values of the registers after executing
instruction Ii, i.e., instruction Ii transforms configuration ciS into c
i+1
S :
c0S
I0
 ! c1S
I1
 ! c2S : : :c
i
S
Ii
 ! ci+1S : : :
2The function is named sI and not I, as in [MP00], because I is used as the identity in
PVS.
63
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
This is different from the notation used in [MP00]. In [MP00], ci+1
denotes a value before the execution of Ii+1.
If sI(k;T ) = i holds and if stage k is full during cycle T , it is said that
instruction i is in stage k during cycle T [MP00].
For this thesis, the domain of the function above is extended to cycles T
in that the stage k is not full in order to simplify some proofs. If the stage
k was never full before cycle T , sI(k;T ) is supposed to be zero. If the
stage k was full before cycle T , the supposed value of the function sI(k;T )
is defined using the value the function had in the last cycle T 0 < T such
that f ullT 0k holds. In this case, sI(k;T ) is supposed to be sI(k;T 0)+ 1 in
anticipation of the next instruction in the stage. In contrast to the definition
of the scheduling function in [MP00], such a scheduling function sI is total.
A scheduling function of the prepared sequential machine is constructed
as follows: The following properties of the scheduling function should
hold obviously:
1. During cycle 0, all stages are in the initial configuration:
8k : sI(k;0) = 0
2. If the output registers of a stage k are not updated during cycle T  1
(i.e., ueT 1k = 0), the stage was either not full or stalled. The stage
was inactive; the value of the scheduling function should not change
either.
ueT 1k = 0 ) sI(k;T ) = sI(k;T  1)
3. If the output registers of a stage k are updated during cycle T   1
(i.e., ueT 1k = 1), the registers are updated with values of the same
configuration that is in the previous stage, i.e., stage k  1. The
scheduling function must reflect this.
k  1^ueT 1k = 1 ) sI(k;T ) = sI(k 1;T  1)
In case of the first stage (k = 0), the computation of the next config-
uration of the specification machine is started:
ueT 10 = 1 ) sI(0;T ) = sI(0;T  1)+1
64
Section 3.5
DATA
CONSISTENCY
PROOF
T = 0 T = 1 T = 2 T = 3 T = 4 T = 5 T = 6
sI(0;T ) 0 1 1 1 1 2 2
sI(1;T ) 0 0 1 1 1 1 2
sI(2;T ) 0 0 0 1 1 1 1
si(3;T ) 0 0 0 0 1 1 1
Table 3.3 The values of sI in a four stage sequential machine in the absence of
stalls
This allows for a recursive definition of the scheduling function of the
prepared sequential machine:
sI(k;T ) =
8
>
>
<
>
>
>
:
0 : T = 0
sI(k;T  1) : T 6= 0^ueT 1k
sI(0;T  1)+1 : T 6= 0^ueT 1k ^ k = 0
sI(k 1;T  1) : T 6= 0^ueT 1k ^ k 6= 0
Table 3.3 illustrates the values of sI(k;T ) for the first seven cycles as-
suming four stages and that the stall signals are never active.
3.5.3 Properties of the Scheduling Function
If the update enable signal of a stage is active in cycle T   1, the value J Invariant 3.1
of the scheduling function for that stage increases by one. If the update
enable signal of a stage is not active, the value does not change. For T > 0:
sI(k;T ) =

sI(k;T  1) if ueT 1k = 0
sI(k;T  1)+1 if ueT 1k = 1
Given a cycle T , the values of the scheduling functions of two adjacent J Invariant 3.2
stages are either equal or the value of the scheduling function of the earlier
stage is greater by one.
The value of the scheduling function of the earlier stage is greater by one J Invariant 3.3
iff the full bit of the later stage is set. For k > 0:
f ullTk = 1, sI(k 1;T ) = sI(k;T )+1
65
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
Negating both sides of the last equation and applying invariant 3.2 results
in:
f ullTk = 0, sI(k 1;T ) = sI(k;T )
The proof of the invariants proceeds by induction. Let Pi(T ) denote thatPROOF
invariant i holds for cycle T . The claim is concluded as follows:
Invariant 3.1 for cycle T is shown using invariant 3.3 for cycle T   1.
Invariant 3.2 for cycle T is shown using invariant 3.1 in cycle T and invari-
ant 3.3 in cycle T   1. Invariant 3.3 is shown using invariant 3.1 in cycle
T and invariant 3.2 in cycle T  1.
P3(T  1) =) P1(T )
P1(T )^P2(T  1)^P3(T  1) =) P2(T )
P1(T )^P2(T  1)^P3(T  1) =) P3(T )
Proof of Invariant 3.1 The claim for the case ueT 1k = 0 holds by defini-
tion of sI. Let ueT 1k = 1 hold. For the case k = 0, the claim follows from
the definition of sI. For k > 0, the claim is:
sI(k;T ) = sI(k;T  1)+1
According to the definition of sI(k;T ), this is equivalent to:
sI(k 1;T  1) = sI(k;T  1)+1
According to invariant 3.3 for cycle T 1, this is equivalent to f ullT 1k = 1.
This is true because of the definition of ueT 1k .
Proof of Invariant 3.2 For cycle T = 0, the claim holds by definition of
sI(k;0).
For T > 0, let us consider the stages k  1 and k with k > 0. There are
four cases regarding the update enable signals ueT 1k and ue
T 1
k 1 of these
stages:
1. Let both update enable signals be active. According to the definition
of the update enable signals, this is a contradiction to the fact that at
most one full bit is active in a given cycle (lemma 3.8).
66
Section 3.5
DATA
CONSISTENCY
PROOF
2. Let both update enable signals be not active. According to invariant
3.1, the values of the scheduling function do not change and the
claim follows from invariant 3.2 for cycle T  1 therefore.
3. Let the update enable signal of stage k be active and the update en-
able signal of stage k  1 be not active. Let the first case given by
invariant 3.2 for cycle T  1 hold:
sI(k 1;T  1) = sI(k;T  1)
Using lemma 3.1 for stage k on the right-hand side, one concludes:
sI(k 1;T  1) = sI(k;T ) 1
According to the definition of sI(k;T ), this is equal to:
sI(k 1;T  1) = sI(k 1;T  1) 1
This is a contradiction. The case above therefore never happens.
Let the second case given by invariant 3.2 for cycle T  1 hold, i.e.,
sI(k 1;T  1) = sI(k;T  1)+1
holds. Using invariant 3.1 for both stages k and k 1, sI(k 1;T ) =
sI(k;T ) is concluded.
4. Let the update enable signal of stage k be not active and the update
enable signal of stage k  1 be active. Let the first case given by
invariant 3.2 for cycle T   1 hold, i.e., sI(k  1;T   1) is equal to
sI(k;T   1). Using invariant 3.1, sI(k  1;T ) = sI(k;T )+ 1 is con-
cluded.
Let the second case given by invariant 3.2 for cycle T  1 hold, i.e.,
sI(k 1;T  1) = sI(k;T  1)+1 holds. According to invariant 3.3,
f ullT 1k holds. According to the definition of the update enable sig-
nals, f ullT 1k 1 also holds. This is a contradiction to lemma 3.8.
Proof of Invariant 3.3 For T = 0, the claim is shown using the definition
of sI. For T > 0, according to lemma 3.2, the claim is equivalent to:
ueT 1prev(k)_ stall
T 1
k () sI(k 1;T ) = sI(k;T )+1
Since prev(k) = k 1 for all k > 0, this is equivalent to:
ueT 1k 1 _ stall
T 1
k () sI(k 1;T ) = sI(k;T )+1
67
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
The proof proceeds by a full case split on the values of the update enable
bits ueT 1k 1 and ue
T 1
k , as done in the proof of invariant 3.2. There are four
cases:
1. If both update enable signals are on, this is a contradiction to the fact
that at most one full bit is on (lemma 3.8).
2. If ueT 1k 1 is on and ue
T 1
k is off, the left side of the equivalence eval-
uates to true and the claim is equal to:
sI(k 1;T ) = sI(k;T )+1
Invariant 3.1 for cycle T and stages k 1 and k is used to show that
the claim is equal to:
sI(k 1;T  1)+1 = sI(k;T  1)+1
Obviously, this claim is equal to:
sI(k 1;T  1) = sI(k;T  1)
Assume this claim does not hold. In this case, invariant 3.2 states
that
sI(k 1;T  1) = sI(k;T  1)+1
holds. According to invariant 3.3 for cycle T   1, this implies that
f ullT 1k holds. Since f ullT 1k 1 also holds because of the definition of
ueT 1k 1 , this is a contradiction to the fact that at most one full bit is on
(lemma 3.8).
3. If ueT 1k 1 is off and ue
T 1
k is on, it is left to show that
stallT 1k () sI(k 1;T ) = sI(k;T )+1
holds. Using invariant 3.1 for stages k  1 and k and cycle T one
shows that this is equal to:
stallT 1k () sI(k 1;T  1) = sI(k;T  1)+2
According to the definition of the update enable signal ueT 1k , the
stall signal stallT 1k cannot be active. Invariant 3.2 shows that sI(k 
1;T   1) = sI(k;T   1) + 2 never holds. Thus, both sides of the
equivalence are false.
68
Section 3.5
DATA
CONSISTENCY
PROOF
4. If both update enable signals are off, invariant 3.1 shows that the
claim is equal to:
stallT 1k () sI(k 1;T  1) = sI(k;T  1)+1
Using invariant 3.3 for cycle T  1, one shows:
sI(k 1;T  1) = sI(k;T  1)+1 () f ullT 1k
Thus, the claim is equivalent to:
stallT 1k () f ullT 1k
By definition of ueT 1k , f ullT 1k implies stallT 1k if ueT 1k is off. The
opposite direction is given by convention 3.1. QED
Invariant 3.3 can be extended to multiple stages inductively, which re-
sults in the following claim:
Let k and l be stage numbers and l > k. If the full bit of all stages between J Lemma 3.9
k and l (including stage l, not including stage k) is not set, the scheduling
functions for stage k and l are equal:
(8m j m > k^m l : f ullTm) =) sI(k;T ) = sI(l;T )
The claim is shown by induction on l using invariant 3.3. PROOF
Let stage k be full in cycle T . In this case, stages after stage k contain J Lemma 3.10
the values of the same configuration and stages prior to stage k contain the
values of the next configuration.
T > 0^ f ullTk =) sI(l;T ) =

sI(k;T )+1 l < k
sI(k;T ) otherwise
This lemma is the central lemma for showing the correctness of the
operands read. The lemma is almost identical to the dateline lemma pre-
sented in [MP00].
Lemma 3.10 is illustrated by figure 3.7: Let f ullT2 hold and sI(2;T ) be
i. In this case, the output registers of the stages 0 and 1 already contain the
69
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
    
    
    



0
1
0
0
k = 0
k = 1
k = 2
k = 3
k = 4
Full-Bits
0 R:3
R:2
R:1
R:4
R:5
Configuration
ci+1
ci
ci
ci
ci+1
Ii
Figure 3.7 Calculation of the configurations in the sequential prepared machine.
In the current cycle, instruction Ii is in stage 2.
70
Section 3.5
DATA
CONSISTENCY
PROOF
values of configuration ci+1. The stages 2, 3, and so on still contain the
values of configuration ci.
PROOF For l = k, the claim is obvious. For l < k, the claim is
sI(l;T ) = sI(k;T )+1
According to invariant 3.3, sI(k 1;T ) = sI(k;T )+1 holds, which shows
the claim for l = k 1. For l < k 1, lemma 3.8 states that the full bits are
not set. Thus, lemma 3.9 can be used in order to show the claim.
For l > k, lemma 3.8 states that the full bits of these stages are not set
either. Lemma 3.9 shows the claim. QED
Stage k is full at the earliest in cycle k. J Lemma 3.11
f ullTk =) T  k
The proof proceeds by induction over T . For T = 0, the claim is concluded PROOF
by the fact that during cycle 0, only full signal f ull0 is active.
Assuming the lemma for cycle T , the claim for T + 1 is shown as fol-
lows: For k = 0, the claim is obvious. Thus, the claim is shown for k > 0.
If f ullTk or f ullTk 1 holds, one simply uses the induction premise. If
f ullTk and f ullTk 1 do not hold, one shows that f ullT+1k cannot hold using
lemma 3.2:
f ullT+1k = ueTk 1_ stallTk
Applying the definition of ueTk 1, this results in:
f ullT+1k = ( f ullTk 1^ stallTk 1)_ stallTk
= stallTk
According to convention 4.2, stallTk cannot be active. QED
3.5.4 Data Consistency Proof Strategy
The correctness criterion for the machines presented in this thesis is based
on the scheduling function: the values of the specification registers of the
71
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
implementation machine must match the values of the corresponding reg-
isters in the specification machine. Given a stage k and a cycle T , the
scheduling function provides the configuration of the specification ma-
chine to compare with.
Thus, the correctness of the complete machine is asserted in the follow-
ing theorem:
The value of a given specification register R 2 out(k) during cycle T inTheorem 3.12 I
the implementation machine must match the value of the same register in
the specification machine in the configuration ciS with i = sI(k;T ).
RTI = R
i
S
This data consistency criterion is taken literally from [MP00] but with
index shift. This index shift arises from a notational difference: in [MP00],
Ri denotes the value of R after the execution of Ii. In this thesis, RiS denotes
ciS:R, which is the value of R before the execution of Ii. This difference can
be adjusted by taking Ri+1S . Thus, the correctness criterion of [MP00] in
the notation of this thesis is:
RT+1I = R
i+1
S
Furthermore, in [MP00], the criterion is shown for cycles T with ueTk
only. Using invariant 3.1, one can conclude that for this case sI(k;T +1) =
i+1 holds. Inserting this into the equation above results in:
RT+1I = R
sI(k;T+1)
S
This is exactly the correctness criterion as given above despite that the
criterion in [MP00] does not cover the values of the registers during cycle
0 (initial configuration).
The proof of the correctness criterion proceeds by induction on T . For
the PVS tree, an automated tool developped by the autor generates this
proof. In the following, the generic algorithm used in order to generate the
proof is described.
Step 1 For all implementation registers R of the implementation ma-
chine, a function ΩkR(c) is defined. This function maps a configuration
72
Section 3.5
DATA
CONSISTENCY
PROOF
c of the specification machine on the domain W (R) of the register R and
provides the “correct” value of R. It is not necessary to define this function
for specification registers, since the correct value of a specification register
is defined by the specification machine.
For intuition, take the prepared sequential machine and remove all im-
plementation registers. The inputs of the registers are connected to the
outputs (figure 3.8). The remaining specification registers share a common
clock. This machine processes one configuration of the specification ma-
chine with each cycle unless stalled. The configuration set of this machine
exactly matches the configuration set of the specification machine. Let c be
such a configuration. In this machine, one can get the value of Ωk 1R0(c)
right at the point where the register R0:k formerly was.
Formally, the functions ΩkR are defined recursively: in analogy to g
and γ (section 3.2.4), functions G and Γ are defined, which provide the
correct input values for a register transition function f . In analogy to the
function ω, the function Ω is defined. The definition of Ω is identical to
the definition of ω except for that Γ is used instead of γ.
GkR : CS  ! W (R)
Let Gk(c;(R01;R02; : : : ;R0j)) denote a j-tuple of values calculated as follows:
Gk(c;(R01;R02; : : : ;R0i)) = (GkR01(c);GkR02(c); : : : ;GkR0i(c))
Let Γ be a function that maps a configuration of the specification ma-
chine to the correct input values of a register transition function. Let
dep(R;k) be (R01; : : : ;R0i).
Γ : CS  ! W (R01) : : :W (R0i)
ΓkR(c) = Gk(c;dep(R;k))
The functions GkR can now be specified recursively: If R is a specifica-
tion register, GkR is:
GkR(c) = c:R
This allows a straightforward definition of GkR if R is an implementation
register. Since an instance of R must be in the previous stage, the correct
value of R:k is used, i.e., Ωk 1R(c):
GkR(c) = Ωk 1R(c)
73
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
R02:k
uek 1
gkR02
ωk 1R02
R01:k
uek 1
gkR01
ωk 1R01
fkR
ωkR
uek
γkR = gk(cI;dep(R;k))
)
dep(R;k)
R:(k+1)
Sequential prepared machine
)
dep(R;k)R02:k
Ωk 1R02
R01:k
GkR01
Ωk 1R01
fkR
ΓkR = Gk(cS;dep(R;k))
R:(k+1)
ΩkR
GkR02
Machine without implementation registers
Figure 3.8 Relationship between ωkR, gkR0 and ΩkR, GkR0, depicted for two
stages k and k + 1. Let R01 and R02 be implementation registers and let R be an
implementation register that depends on R01 and R02. The read accesses to R01 and
R02 are unconditional.
74
Section 3.5
DATA
CONSISTENCY
PROOF
Remember that f ΓkRre(c) is just a shorthand for
fkRre(ΓkRre(c))
as described in section 3.2.8 (page 51).
For a conditional read access to a specification register, GkR is defined
as follows:
GkR(c) =

c:R : f ΓkRre(c)
0 : otherwise
For a read access with read address to a specification register, GkR is
defined as follows:
GkR(c) =

c:R[ f ΓkRra(c)] : f ΓkRre(c)
0 : otherwise
Step 2 A set of lemmas is claimed and asserted later. For each specifica-
tion register R 2 out(k), one lemma is used:
Using the correct input values, the register transition function
fkR provides the correct output value.
Let the inputs be calculated using values from configuration ciS. In this
case, the output values can be found in configuration ci+1S of the specifica-
tion machine.
These lemmas assert the correctness of the non-scheduled implementa-
tion described above, i.e., the sequential machine that performs the calcu-
lation of a configuration of the specification machine within one transition
without implementation registers. The lemmas are therefore called register
transition function correctness lemmas.
If the write access to the register is neither conditional nor has a write J Lemma 3.13
address, the claim is:
ci+1S :R = f ΓkR(ciS)
If the write access to the register is conditional, the value of the transition
function is used only if the write enable signal is active. If the write enable
75
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
signal is not active, one takes the value from the previous configuration.
The claim therefore is:
ci+1S :R =
 f ΓkR(ciS) : f ΓkRwe(ciS)
ciS:R : otherwise
In case of a write access with write address, the claim is for all addresses
possible write addresses x:
ci+1S :R[x] =
8
<
:
f ΓkR(ciS) : f ΓkRwe(ciS)^
f ΓkRwa(ciS) = x
ciS:R[x] : otherwise
Step 3 Let T be a cycle. A stage correctness predicate Pk(T ) is defined
for each stage. It will be used later on in the proofs of all central claims.
The predicate Pk(T ) holds iff the values of the registers of stage k are
correct cycle T . This comprises both the implementation and the specifi-
cation registers. Let sPk(T ) denote the stage correctness predicate for the
specification registers and let iPk(T ) denote the stage correctness predicate
for the implementation registers:
Pk(T ) () sPk(T )^ iPk(T )
The stage correctness predicate sPk(T ) for the specification registers is
given in analogy to the data consistency criterion in theorem 3.12: the val-
ues of the specification registers must match the values of the correspond-
ing registers in the configuration of the specification machine indicated by
the scheduling function. Thus, for all specification registers R 2 out(k) the
following condition must hold:
RTI = R
sI(k;T )
S
The stage correctness predicate for the implementation registers is given
using the notion of a correct implementation register as defined in step 1.
For all implementation registers R 2 out(k) the following condition must
hold:
RTI :(k+1) =
(
0 : sI(k;T ) = 0
ΩkR(csI(k;T ) 1S ) : otherwise
76
Section 3.5
DATA
CONSISTENCY
PROOF
      
      
      
      
      
      
      
      
      
      
      
      












Q:k
R:(k+1)uek
uek 1
Stage k
= Qi+1
= Ri
Ii
Figure 3.9 Illustration of the values in the registers if instruction Ii is in stage k
(i.e., sI(k;T ) = i). Q is a specification register in out(k  1), R is a specification
register in out(k).
The stage correctness predicate for the implementation registers is mo-
tivated as follows: The stage correctness predicate for the implementation
registers is supposed to provide information about the value of an imple-
mentation register R 2 out(k) during cycle T , i.e., about RTI :(k + 1). In
case of sI(k;T ) = 0, the register has never been written before, i.e., it has
still the initial value, which is zero by definition:
RTI :(k+1) = 0
If i = sI(k;T ) > 0, the last time the register was written was on calcu-
lating a part of configuration ci (figure 3.9). Suppose this was done during
cycle T 0. By definition of the transition function, the following value was
written:
ωkR(cT
0
I )
This value was not changed since cycle T 0, thus, it is still in the register
during cycle T :
RTI :(k+1) = ωkR(cT
0
I )
In case of correct calculations, the inputs used by ωkR for the transition
function fkR during cycle T 0 while calculating configuration ci were taken
77
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
from configuration ci 1. Thus, the right-hand side is:
RTI :(k+1) = ΩkR(ci 1S )
This motivates the stage correctness predicate for implementation regis-
ters.
All stage correctness predicates hold for the initial cycle, i.e., Pk(0) holdsLemma 3.14 I
for all stages k.
In the initial cycle, the value of all stage scheduling functions is zero.PROOF
One therefore has to show that the values of the specifications registers in
the implementation machine during cycle 0 match the values of the corre-
sponding registers in the specification. Since this is exactly the definition
of c0I , the claim follows immediately.QED
3.5.5 Correctness of the Transition Functions
The register transition function lemmas, as defined in step 2, hold.Lemma 3.15 I
The stages IF and EX do not write any specification register, thus, there isPROOF
nothing to show.
Note that the following proofs are given here for illustration only. In
PVS, the proofs are much simpler, since PVS is able to expand the def-
initions of the functions fkR, ΓkR, and GkR automatically. Furthermore,
the lemmas that show the correctness of circuits such as the ALU can be
applied automatically. The proofs rely on definition expansion and trivial
use of lemmas only, thus, the proofs below have just a few lines in PVS
and require almost no manual interaction.
Stage ID Stage ID writes the specification registers DPC and PC0. The
claim of the register transition function lemma for register DPC is:
f1DPC(Γ1DPC(ciS)) != ci+1S :DPC
Expanding the function Γ1DPC on the left hand side, this is equal to:
f1DPC(G1(ciS;PC0)) != ci+1S :DPC
78
Section 3.5
DATA
CONSISTENCY
PROOF
Expanding the function G1 on the left hand side, this is equal to:
f1DPC(G1PC0(ciS)) != ci+1S :DPC
Since PC0 is a specification register, by definition of G1PC0 this is equal
to:
f1DPC(ciS:PC0) != ci+1S :DPC
Since f1DPC is just the identity (equation 3.12 page 57), this claim sim-
plifies to:
ciS:PC0
!
= ci+1S :DPC
This holds because of the definition of ci+1S :DPC (equation 2.3 page 31).
The claim of the register transition function lemma for register PC0 is:
f1PC0(Γ1PC0(ciS)) != ci+1S :PC0
The calculation of PC0 depends on the first GPR operand. The functions
for this operand use GPRa as register and not GPR in order to distinguish
them from the functions of the second GPR operand.
Repeatedly expanding definitions as above, the claim is equal to:
f1PC0(G1IR(ciS);G1GPRa(ciS);G1PC0(ciS)) != ci+1S :PC0
By definition of f1PC0 (equation 3.13 page 57), this is equal to:
next pc imp(G1IR(ciS);G1GPRa(ciS);G1PC0(ciS))
!
= ci+1S :PC
0
By lemma 3.4 (correctness of next pc impl), this is equal to:
next pc(G1IR(ciS);G1GPRa(ciS);G1PC0(ciS))
!
= ci+1S :PC
0
There are three cases:
79
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
 If the instruction is a jump register or branch instruction, the register
transition function f1PC0 reads IR, the first GPR operand, and the
old value of the PC0 register. These values are passed to the function
next pc.
One easily shows the correctness of the IR argument by expanding
definitions, since IR is an implementation register:
G1IR(ciS) = Ω0IR(ciS)
= f0IR(ciS:DPC)
= IM[ciS:DPC]
The correctness of the PC0 argument is shown easily:
G1PC0(ciS) = ciS:PC0
The correctness of the GPR operand is assured as follows: the first
GPR operand is read using a conditional read access with address.
In case the read enable signal holds (equation 3.7), the GPR register
with address I RS1(IR) is read (equation 3.9):
G1GPRa(ciS) = ciS:GPR[I RS1(IM[ciS:DPC])]
This is exactly op1, as required by the specification (equation 2.1
page 30).
If the condition does not hold, zero is returned by the G1 function:
G1GPRa(ciS) = 0
In case of a jump register or branch instruction this happens only
if I RS1(IR) is zero. In this case, register GPR0 is read, which is
always zero when read, as required by equation 2.1.
 If the instruction is neither a jump nor branch instruction, the value
of IR, zero, and the old value of PC0 is passed to next pc. In this
case, next pc ignores the value of the second argument and returns
the correct result therefore.
 If the instruction is a jump instruction, next pc does not use the sec-
ond argument, the GPR operand. The offset to the PC is provided
by the immediate constant.
80
Section 3.5
DATA
CONSISTENCY
PROOF
Stage M The memory stage writes a data word into the main memory
in case of a store instruction. The write access to the data memory is a
conditional write access with write address.
Most transition functions depend on the implementation register IR. Ex-
emplary, it is shown how to assert the correctness of the IR arguments. By
definition of G3IR, one shows that Ω2IR(ciS) is the value read. Repeat-
edly expanding the Ω functions and proceeding as above, one shows the
correctness of the IR arguments:
G3IR(ciS) = Ω2IR(ciS)
= Ω1IR(ciS)
= Ω0IR(ciS)
= f0IR(ciS:DPC)
= IM[ciS:DPC]
As described above, the write access to DM is conditional and has a
write address. In case the write enable signal f3DMwe(G3IR(ciS)) does
not hold, nothing has to be shown. In the case that f3DMwe(G3IR(ciS))
holds, one has to show that the data value written is correct and that the
write address is correct.
The data value written is read from the MDRw register from the previous
stage. By definition, the correct value of the MDRw register is the correct
value of the B register written by the decode stage. The decode stage places
the second operand here. The correctness of this operand is asserted as
described in the section above.
The address used for the write access to DM is taken from the func-
tion f3DMwa. The function reads MAR from the previous stage and strips
the two least significant bits. The correct value of the MAR register is by
definition the output of the ALU. The ALU performs an addition, which
is shown easily using that f3DMwe(G3IR(ciS)) holds and using the cor-
rectness of the ALU (lemma 2.14). Furthermore, it is shown easily that the
second operand of the ALU is the immediate constant. The first operand of
the ALU is generated by the decode stage. The correctness of this operand
is shown as in the proof for the stage ID. Thus, exactly the effective ad-
dress, as required by the specification, is used as write address for the write
access.
Stage WB The write back stage writes the GPR destination operand of
the instruction. The proof is similar to the proof used for the memory stage.
81
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
A conditional write access with address to GPR is performed. In the case
that the write enable signal is active, one has to show that the data value
written is correct and that the write address is correct.
In case of a load instruction, the data value is taken from the shi f t4load
circuit, which takes MAR, MDRr, and IR as inputs. The correctness of the
value in the MAR register is asserted as described above. The correctness
of the value in the MDRr register is asserted as follows: the register is
written by the memory stage using a conditional read access with address
to DM. It is easy to show that the read enable signal of this read access
is active using that the write enable signal holds. The correctness of the
address of the memory access is shown as described above.
If the instruction is not a load instruction, the data value is taken from
the C register. The C register is passed unmodified by the memory stage
from the execute stage. In case of an ALU/shift instruction, the correctness
of this value is asserted as follows: the correctness of the input operands
is asserted as described above; using the correctness of the ALU (lemma
2.14), the correctness of the result is shown.
In case of a jump and link instruction, the execute stage passes the value
of the C register from stage ID. This is the correct value of the PC0 register,
as required by the specification.
The correctness of the index used for the write access to GPR is shown
easily by definition unfolding. The index is written into an implementation
register by stage ID and not changed in any subsequent stage.QED
Correctness of the Functions gkR In the proofs above, the correctness
of the inputs of each stage is assumed. In real hardware, the implementa-
tions of the functions gkR are used in order to generate the input operands.
It is therefore left to show that the values generated by the functions gkR
actually match the correct values.
Let sI(k;T ) = i and f ullTk hold. Assuming that the stage correctness pred-Lemma 3.16 I
icates Pj hold in all cycles up to cycle T , the inputs generated by the func-
tions gkR during cycle T are correct:
gkR(cTI ) = GkR(ciS)
In case of an implementation register, the correct value on the right-handPROOF
82
Section 3.5
DATA
CONSISTENCY
PROOF
side is defined by the function ΩkR:
gkR(cTI )
!
= Ωk 1R(ciS)
In case of a specification register, the correct value is given in the config-
uration of the specification machine. If the read access is neither indexed
nor conditional:
gkR(cTI )
!
= RiS
In case of a conditional read access, the correct value is zero if the read
enable signal is not active. If the signal is active, the correct value is the
same as in the case above.
gkR(cTI )
!
=

RiS : f ΓkRre(ciS)
0 : otherwise
In case of an indexed read access, the correct value is defined using the
correct value of the address. Let x denote this value:
x := f ΓkRra(ciS)
gkR(cTI )
!
=

RiS[x] : f ΓkRre(ciS)
0 : otherwise
The proof depends on the type of the register that is read and in which
stage the register is. The first thing is to show the correctness for the case
that neither a condition nor a read address is used.
1. Let the register that is to be read be an implementation register. By
the definition of gkR, the register from the previous stage is taken:
gkR(cTI ) = c
T
I :R:k (3.16)
An implementation register is never read in stage k = 0, and one
therefore can use lemma 3.11 and the fact that the full signal f ullTk is
active in order to conclude that T  1 holds. For this case, invariant
3.3 (page 65) states:
sI(k 1;T ) = i+1
83
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
The stage correctness predicate for R2 impl, cycle T , and stage k 1
states:
RTI :k =
(
0 : sI(k 1;T ) = 0
Ωk 1R(csI(k 1;T ) 1S ) : otherwise
Since sI(k 1;T ) = i+1 is never zero, this simplifies to:
RTI :k = Ωk 1R(ciS)
Remember that RTI :k just denotes cTI :R:k. Thus, one can insert this
into equation 3.16. This changes equation 3.16 into:
gkR(cTI ) = Ωk 1R(ciS) (3.17)
This is exactly the claim.
2. Let the register that is to be read be a specification register that is in
the same stage in which it is read. By definition of gkR, the value of
R is read:
gkR(cTI ) = c
T
I :R:(k+1) (3.18)
By using the stage correctness predicate for specification register R,
stage k, cycle T , this is transformed into:
cTI :R:(k+1) = R
sI(k;T )
S (3.19)
Since i = sI(k;T ), this is the claim.
3. Let the register that is to be read be a specification register that is
in a later stage than the stage it is read in. Let w be stage(R). By
definition of gkR, the value of R is read:
gkR(cTI ) = c
T
I :R:(w+1) (3.20)
By applying the stage correctness predicate for stage w and cycle T ,
this transforms into:
gkR(cTI ) = R
sI(w;T )
S (3.21)
For cycle T = 0, both sI(w;0) and sI(k;0) are zero by definition of
sI. Thus, the claim holds.
84
Section 3.6
DATA
CONSISTENCY
PROOF
For cycles T > 0, one uses lemma 3.10 for cycle T and stages k and
w. Because of f ullTk and w > k, lemma 3.10 shows that
sI(w;T ) = sI(k;T )
= i
holds. This concludes the claim.
This shows the claim for inputs without index and condition. The claim
for inputs with index or condition is shown as follows: Since the inputs
for the functions fkRre and fkRra never use a condition or an index, the
correctness of the inputs of these functions can be shown as above. If the
condition does not hold, the claim obviously holds. If the condition holds,
the proof proceeds as above. In case of an indexed access, the claim is
shown using the arguments above and that the index is correct. QED
Let T 0 be greater than zero. Assuming all stage correctness predicates for J Lemma 3.17
the cycle T 0 1, the predicate for stage k holds for cycle T 0.
(8l : Pl(T 0 1)) =) Pk(T 0)
Let the update enable signal ueT 1k be active. In this case, one uses in- PROOF
variant 3.1 in order to conclude that sI(k;T  1) = i 1. This allows using
lemma 3.16 for cycle T  1 and configuration i 1. The lemma shows that
the inputs of the stage transition functions are correct. In case of a speci-
fication register, lemma 3.15 is used to show that the output written in the
register is correct. In case of an implementation register, the output value
of the stage matches the correct value by definition of the correct value of
an implementation register.
If the update enable signal ueT 1k is not active, invariant 3.1 is used to
show that the value of the stage scheduling function does not change from
cycle T   1 to cycle T . Since the update enable signal is not active, the
values in the registers do not change from cycle T   1 to cycle T , which
shows the claim. QED
All stage predicates hold for all cycles. J Theorem 3.18
This is shown by induction on T . The case T = 0 is subsumed by lemma PROOF
3.14, the induction step is shown using lemma 3.17.
Theorem 3.18 obviously implies the data consistency criterion as pro-
posed in theorem 3.12.
85
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
3.6 Liveness
3.6.1 Introduction
The liveness criterion used in this thesis is that the implementation machine
actually calculates any desired configuration of the specification machine
within a finite amount of time. In order to prove the liveness criterion for
the prepared sequential machine, a formal notion of “will happen in finite
time” is required.
A time predicate is a mapping from N0 to B .Definition 3.3
Time Predicate
I
The constant time predicates always and never are defined as follows:
always(T ) = true (3.22)
never(T ) = f alse (3.23)
Let pred be a time predicate. The following notation is used:
9pred :() 9T 2 N0 : pred(T ) (3.24)
The operator 9T 0 on a time predicate holds iff the predicate is true for
a time T  T 0.
9
T 0 pred :() 9T 2 N0 : (pred(T )^T  T 0) (3.25)
If there exists a time T  T 0 with pred(T ), also a time T 00 exists that is theLemma 3.19 I
smallest T 00  T 0 satisfying the predicate.
Let S be the set of natural numbers that are greater or equal T 0 and satisfyPROOF
the predicate. The set is non-empty and has a lower bound. The minimum
min(S) exists therefore and is T 00.
Let pred be a time predicate. The predicate is called finite false iff for allDefinition 3.4
Finite False
I
T 9T pred holds. This implies that if pred(T ) does not hold for a given
T , there is a finite T 0  T such that pred(T 0) holds. In analogy to that, a
predicate is called finite true, iff pred is finite false.
86
Section 3.6
LIVENESS
3.6.2 Liveness Criterion
Let ciS be any desired configuration of the specification machine. The im-
plementation machine is said to be alive iff for all stages k there exists a
time T 2 N0 with sI(k;T ) = i:
9T 2 N0 : sI(k;T ) = i
3.6.3 Liveness Properties of the Scheduling Logic
Let uek denote the time predicate of the update enable signal of stage k.
Let stallk denote the time predicate of the stall signal of stage k.
Let f ullTk hold for a stage k and a cycle T . Let the stall signal stallk be J Lemma 3.20
finite true (thus, it becomes false within a finite amount of time). This
implies that uek becomes true within a finite amount of time:
f ullTk =) 9T uek
Since stallk is finite true, there exists a cycle T 0  T such that the stall PROOF
signal is not active. Let T 0 be the smallest value with this property. If
T 0 = T , f ullT 0k holds by premise.
If T 0 > T , assume that f ullT 0k does not hold. According to lemma 3.2,
this implies that stallT 0 1k does not hold. This is a contradiction to the
assumption that T 0 is the smallest value.
Since f ullT 0k = 1 and stallT
0
k = 0, ueT
0
k holds by definition. QED
Assuming that all stall signals are finite true, and that the update enable J Lemma 3.21
signal of stage 0 will be active within finite time after cycle T , the update
enable signal of stage k will be active within finite time after cycle T .
9
T ue0 =) 9
T uek
This is shown by induction on k. For k = 0, the claim is subsumed by the PROOF
premise.
87
Chapter 3
A SEQUENTIAL
IMPLEMENTATION
MACHINE
For k+ 1, the induction premise states that there is a cycle T 0  T with
ueT
0
k . By the transition function of the full bits, f ullT
0
+1
k+1 holds. Lemma
3.20 concludes the claim.
Assuming that all stall signals are finite true, the update enable signals areLemma 3.22 I
finite false.
The claim for ue0 is that for all T there is a T 0  T such that ueT
0
0 holds.PROOF
This is shown by induction on T . For T = 0, one uses lemma 3.20 and the
fact that f ull00 holds by definition.
For T +1, lemma 3.21 is used to argue that there exists a T 0T such that
ueT
0
n 1 holds. According to the transition function of the full bits, f ullT
0
+1
0
holds. Lemma 3.20 is used to show the claim.
The claim for uek with k  1 is shown by induction on k. For k = 0, the
claim is shown already. For k+1, the claim is shown as in lemma 3.21.QED
3.6.4 Liveness Proof for the Sequential DLX
Let the update enable signal uek of a stage k be off during the cycles T 00Lemma 3.23 I
with T 0 > T 00  T . The value of the scheduling function does not change
from cycle T to T 0.
8T 00jT 0 > T 00  T : ueT 00k =) sI(k;T ) = sI(k;T
0
)
The proof proceeds by induction on T 0 and by definition unfolding.PROOF
Assuming that all stall singnals are finite true, the machine is alive.Theorem 3.24 I
This is shown by induction on i. For i = 0, the claim is that there is a TPROOF
such that sI(k;T ) = 0 holds. By definition of sI, T = 0 satisfies this.
For i+ 1, the induction premise states that there is a cycle T such that
sI(k;T ) = i holds. According to lemma 3.22), the update enable signal uek
is finite false,
Thus, there is a cycle T 0  T such that ueT 0k holds. Let T 0 be the smallest
value that satisfies this. If T 0 is equal to T , the claim holds by invariant 3.1.
88
Section 3.7
LITERATURE
If T 0 > T , lemma 3.23 states that sI(k;T ) is equal to sI(k;T 0). Invariant
3.1 shows that sI(k;T 0+1) is i+1.
3.7 Literature
The concept of the prepared sequential machine and the DLX implementa-
tion is taken from [MP00]. There are many publications on the verification
of sequential machines, e.g., Cohn verified the VIPER [Coh87], Joyce ver-
ified the Tamarack [Joy88a, Joy88b], Hunt verified the FM8501 [Hun94],
and Windley verified the AVM-1 [Win95].
There is not much literature on the verification of liveness properties
of microprocessors. However, liveness verification is critical. In [MP96],
deadlocks in the original version of the well known scoreboard scheduler
are described. Furthermore, a corrected version is presented and its live-
ness is proven.
89

Chapter
4
Pipelined Machines
4.1 Scheduling the Pipelined Machine
4.1.1 Introduction
T
HE PREPARED SEQUENTIAL MACHINE, as described in the previous
chapter, calculates a configuration of the specification machine within
n transitions if no external stall condition arises, with n being the number
of stages. In each transition of the prepared sequential machine, only one
stage is in use. The data paths of the remaining stages are left idle.
In this chapter, the prepared sequential machine Mσ is transformed into
a pipelined machine Mpi, which allows running all stages in parallel. This
concept is taken from [MPK00, MP00]. In contrast to the cited literature,
an automated tool is used in order to do the transformation including the
generation of stalling and forwarding logic. Furthermore, the transforma-
tion is not limited to microprocessors. Any prepared sequential machine
as specified in the previous chapter can be transformed into a pipelined
design.
The goal of the transformation is to use the formerly idle data paths
in order to speed up the calculation of the desired configurations of the
specification machine. As before, the registers of all stages are initialized
Chapter 4
PIPELINED
MACHINES
cycle
stage 2
0 1 2 3
stage 0
stage 1
c1S
c1S
c1S
c2S
Figure 4.1 Scheduling of the prepared sequential machine with n = 3 stages in
the absence of external stalls.
with the values of c0S and the first stage starts with the calculation of c1S. In
the next cycle, the second stage starts with the calculation of c1S, as before.
In contrast to the prepared sequential machine, the first stage does not idle
but starts the calculation of c2S.
In particular, the calculation of the configuration c2S starts before the cal-
culation of configuration c1S is finished since stage 0 calculates only some
parts of the configuration. Figure 4.1 shows how the prepared sequen-
tial machine calculates the configurations, and figure 4.2 shows how the
pipelined machine uses the formerly idle stages to speed up this calcula-
tion.
The calculation of configuration c1S is finished in cycle 2, as in the se-
quential machine. In contrast to the sequential machine, the calculation is
of configuration c2S is finished already in cycle 3.
A stage ”runs” if the corresponding update enable signal is active. A
stage is updated if it is full and not stalled. The first step of the transfor-
mation therefore is to modify the machine such that there are as many full
stages as possible.
In the new initial configuration, no full bit is set in contrast to the initial
configuration of the prepared sequential machine. The definition of the
signal f ull0 is changed: the first stage is defined to be always full, since
one can start with the calculation of the next configuration any time the
92
Section 4.1
SCHEDULING THE
PIPELINED
MACHINE
cycle
stage 2
0 1 2 3
stage 0
stage 1
c1S c
2
S c
3
S c
4
S
c1S c
2
S c
3
S
c1S c
2
S
Figure 4.2 Running all stages in parallel.
T = 0 T = 1 T = 2 T = 3 T = 4 T = 5 T = 6
ueT0 1 1 1 1 1 1 1
ueT1 0 1 1 1 1 1 1
ueT2 0 0 1 1 1 1 1
ueT3 0 0 0 1 1 1 1
Table 4.1 The update enable signals of a four stage pipeline in the absence of
stalls
stage would be empty otherwise.
f ull0(c) := 1
This is the only change required in order to get a pipelined schedule.
This full bit is propagated to the next stage in each transition just as in the
machine Mσ. Thus, if there is no stall signal, the full bits are never cleared
after they are set. Table 4.1 illustrates the values of the update enable
signals for a four stage pipeline and after applying this modification and
assuming that no stall signal is active.
After n transitions, all stages work in parallel therefore. Every stage
calculates a part of a different configuration of the reference machine. For
example, let stage 2 calculate parts of ciS. In this case, stage 1 calculates
parts of ci+1S and stage 3 calculates parts of c
i 1
S . This is depicted in figure
4.3.
93
Chapter 4
PIPELINED
MACHINES 1
f ull:1
f ull:2
f ull:n
f ull:3
f ull3
f ull2
f ull1
f ull0
1
1
1
1
R:1
f1
f0
R:2
R:3
f2
fn 1
R:n
ci
ci+1
ci 1
ue0
ue1
ue2
uen 1
Figure 4.3 The structure of the pipelined machine
94
Section 4.1
SCHEDULING THE
PIPELINED
MACHINE
T = 0 T = 1 T = 2 T = 3 T = 4 T = 5 T = 6
sI(0;T ) 0 1 2 3 4 5 6
sI(1;T ) 0 0 1 2 3 4 5
sI(2;T ) 0 0 0 1 2 3 4
sI(3;T ) 0 0 0 0 1 2 3
Table 4.2 The values of sI in a four stage pipelined machine in the absence of
stalls
The new values of the update enable signals affect the values of the
scheduling function also since the scheduling function is defined using the
update enable signals. Table 4.2 illustrates the values of sI(k;T ) for the
first seven cycles.
4.1.2 Scheduling Lemmas
The following simple lemmas are concluded from the new definition of the
full signals:
A stage is full iff it was updated or stalled in the previous cycle: J Lemma 4.1
8k  1 : f ullT+1k = ueTk 1_ stallTk
The signal f ull0 is always active:
f ullT0 = 1
All other signals f ullk are not active during cycle 0:
8k  1 : f ull0k = 0
This lemma is a counterpart of lemma 3.2 of the sequential machine.
This lemma is an implication of the transition function of the full bits and PROOF
of the definition of the full signals.
In analogy to convention 3.1, it is required that if a stage is not full, it must J Convention 4.2
95
Chapter 4
PIPELINED
MACHINES
not be stalled:
f ullTk =) stallTk
In addition to that, it is required that if a stage is stalled and the previous
stage is full, the previous stage must be stalled also:
8k  1 : f ullTk 1^ stallTk =) stallTk 1
Using the lemma and the convention above, it is easy to show that the
pipeline has the same properties like a simple queue: no entry in the queue
is lost and no entry in the queue is duplicated. These properties are sub-
sumed by three trivial lemmas.
The equations for the stall signals of the prepared sequential machine
in chapter 3 also comply with this extended convention, which is shown
easily using lemma 3.8. Thus, all properties of the pipelined machine con-
cluded using this convention also hold in the sequential machine.
If a stage is full and is updated, the next stage is updated, too.Lemma 4.3 I
8k  1 : f ullTk ^ueTk 1 =) ueTk
This ensures that the contents of a stage are never overwritten without
moving into the next stage.
According to the definition of the update enable signals, it is sufficient toPROOF
show that f ullTk and stallTk holds. According to the premise of the lemma,
f ullTk holds. By the definition of the update enable signal ueTk 1, stallTk 1
and f ullTk 1 holds. The claim is concluded by convention 4.2.
If a stage is full and if its output registers are not updated, the full bit isLemma 4.4 I
preserved.
8k  1 : f ullTk ^ueTk =) f ullT+1k
By the definition of the update enable signals, one concludes that stallTkPROOF
holds. The claim is concluded using lemma 4.1.
96
Section 4.1
SCHEDULING THE
PIPELINED
MACHINE
Lemma 4.3 and lemma 4.4 guarantee that no configuration in a given
stage is ever lost.
If a configuration in a stage moves into the next stage (i.e., the output J Lemma 4.5
registers of a stage are updated), and if the next configuration is not clocked
into the stage, the full bit is cleared:
8k  1 : f ullTk ^ueTk ^ueTk 1 =) f ullT+1k
This lemma guarantees that no configuration is duplicated.
By the definition of the update enable signals, one concludes stallTk . The PROOF
claim is concluded by lemma 4.1.
The following lemma is the counterpart of lemma 3.11 in the sequential
machine.
Stage k is full at the earliest in cycle k. J Lemma 4.6
f ullTk =) T  k
The proof proceeds by induction over T . For T = 0, the claim is concluded PROOF
from lemma 4.1.
Assuming the lemma for cycle T , the claim for T + 1 is shown as fol-
lows: For k = 0, the claim is obvious. Thus, the claim is shown for k > 0.
If f ullTk or f ullTk 1 holds, one simply uses the induction premise. If
f ullTk and f ullTk 1 do not hold, one shows that f ullT+1k cannot hold using
lemma 4.1:
f ullT+1k = ueTk 1_ stallTk
The rest of the proof proceeds as the proof of lemma 3.11 (page 71). QED
97
Chapter 4
PIPELINED
MACHINES
4.1.3 The Scheduling Invariants
In order to prove the data consistency of the pipelined machine, the three
scheduling invariants presented for the prepared sequential machine in
chapter 3 (page 65) will be used. We will therefore show that they also
hold for the pipelined machine.
The proof of the invariants proceeds as in chapter 3: Let Pi(T ) denote thatPROOF
invariant i holds for the pipelined machine for the cycle T . The claim is
concluded as in chapter 3:
P3(T  1) =) P1(T )
P1(T )^P2(T  1)^P3(T  1) =) P2(T )
P1(T )^P2(T  1)^P3(T  1) =) P3(T )
The proof of invariant 3.1 is identical to the proof presented in chapter
3. The proof depends on the definition of sI and invariant 3.3 only.
Proof of Invariant 3.2 The proof of invariant 3.2 presented in chapter 3
depends on lemma 3.8 (“exactly one stage full”), which no longer holds in
the pipelined machine.
Let us consider the stages k  1 and k with k > 0. There are four cases
regarding the update enable signals ueT 1k and ue
T 1
k 1 of these stages:
1. Let both update enable signals be active. According to invariant 3.2,
the values of the scheduling function of the stages in cycle T  1 are
either equal or the value of the scheduling function of stage k  1
is greater by one. According to invariant 3.1, the values of both
scheduling functions increase by one within the step to cycle T . The
claim therefore holds by invariant 3.2 for cycle T  1.
2. Let both update enable signals be not active. According to invariant
3.1, the values of the scheduling function do not change and the
claim follows from invariant 3.2 for cycle T  1 therefore.
3. Let the update enable signal of stage k be active and the update en-
able signal of stage k  1 be not active. This case is argued as in
chapter 3.
98
Section 4.1
SCHEDULING THE
PIPELINED
MACHINE
4. Let the update enable signal of stage k be not active and the update
enable signal of stage k  1 be active. Let the first case given by
invariant 3.2 for cycle T   1 hold, i.e., sI(k  1;T   1) is equal to
sI(k;T   1). Using invariant 3.1, sI(k  1;T ) = sI(k;T )+ 1 is con-
cluded.
Let the second case given by invariant 3.2 for cycle T  1 hold, i.e.,
sI(k 1;T  1) = sI(k;T  1)+1 holds. According to invariant 3.3,
f ullT 1k holds. According to lemma 4.3, one can conclude ueT 1k
from f ullT 1k ^ ueT 1k 1 . This is a contradiction since ueT 1k was as-
sumed.
Proof of Invariant 3.3 The proof of invariant 3.3 presented in chapter
3 also depends on lemma 3.8, which no longer holds in the pipelined ma-
chine.
For T = 0, the claim can be shown by definition unfolding and using
lemma 4.6. For T > 0, according to lemma 4.1, the claim is equivalent to:
ueT 1k 1 _ stall
T 1
k () sI(k 1;T ) = sI(k;T )+1
The proof proceeds by a full case split on the values of the update enable
bits ueT 1k 1 and ue
T 1
k , as done in the proof of invariant 3.2. There are four
cases:
1. If both update enable signals are on, it is left to show that
sI(k 1;T ) = sI(k;T )+1
holds. According to invariant 3.1, this is equivalent to:
sI(k 1;T  1)+1 = sI(k;T  1)+2
It is sufficient to show that f ullT 1k holds because of invariant 3.3
for the previous cycle. This is done using the definition of ueT 1k .
2. If ueT 1k 1 is on and ue
T 1
k is off, one assumes the claim does not hold.
Using the same arguments as in the proof of invariant 3.3 in chapter
3, one can conclude that f ullT 1k holds.
Using lemma 4.3, ueT 1k is concluded. This is a contradiction to the
assumption above.
99
Chapter 4
PIPELINED
MACHINES
3. If ueT 1k 1 is off and ue
T 1
k is on, the rest of the argumentation is iden-
tical to the proof in chapter 3.
4. If both update enable signals are off, the arguments in chapter 3 can
be repeated using convention 4.2.
This concludes the claim.QED
4.2 Forwarding
4.2.1 Introduction
The new scheduling has impact on the calculation of the input values of
the stages. Let stage k read an implementation register R. Read access to
implementation registers is not affected by the changes to the scheduler,
since according to the requirements of the hardware description language
(section 3.2.4), an instance of the register R must be in out(k 1). Thus, the
value of the implementation register has been calculated by the previous
stage. A formal proof for that claim uses the same arguments as given for
the sequential machine.
However, the access to specification registers is affected. For the se-
quential machine, lemma 3.16 (page 82) shows that the value the register
has in the previous configuration of the specification machine (as given by
the scheduling function) is passed as input.
The goal is to modify the functions gkR such that the same proposition
can be made for the pipelined machine. This allows concluding that the
values the pipelined machine writes into the registers match those written
by the sequential machine. This proof method is taken from [MPK00,
MP00].
Let stage k read specification register R 2 out(w). There are two cases,
as k > w is not allowed so far (section 3.2.5, page 43):
1. If the read access is done in the stage that writes the register, i.e.,
k = w, it is sure that the register still contains the value from the
previous configuration as required. Nothing has to be changed in
this case. The formal proof for this proposition is identical to the
corresponding case in lemma 3.16.
100
Section 4.2
FORWARDING
An example is the read access to PC0 in the decode stage of the DLX
as implemented in chapter 3.
2. If the read access is done in a stage before the stage that writes the
register, i.e., k < w, the access cannot be done, since the desired
value is not in the register yet.
There are two methods to overcome the limitation in the last case: for-
warding and, if this fails, stalling. These methods will be described in the
next sections.
4.2.2 Forwarding from the Next Stage
Forwarding makes use of the parallelism of the calculations of the configu-
rations. In the literature, forwarding is often also called bypassing [Fly95].
Let R be the specification register to be read. The technique used here is
presented in [KPM00]. Let stage(R) = w = k + 1, i.e., the register R is
an output register of the next stage and do not let the read access have an
address. There are two cases (figure 4.4):
 If f ull:w is set, the stage w contains the configuration that the desired
value is part of. Since the full bit is set, the stage is still busy and the
desired value is not yet stored in the register. However, the register
transition function of R provides the final value. Since R 2 out(w),
ωwR is this value.
 If f ull:w is not set, there is either no previous configuration (the
stage was never used after reset), or the previous configuration is
already in the next stage. In the first case, the stage still contains the
initial values, i.e., the values of c0S. In the second case, the calculation
is already done and the result is stored in the register. In both cases,
R:(w+1) contains the desired value.
Thus, in order to realize forwarding from the next stage, it is sufficient
to select between ωwR and R:(w+1) depending on the signal f ull:w.
This is formalized as follows: The function gkR is used in order to gen-
erate the input values for the register transition functions. The method
described above allows defining gkR for the pipelined machine for the case
101
Chapter 4
PIPELINED
MACHINES
      
      
      
      
      
      
      
      
      
      
      
      
      













      
      
      
      
      
      
      
      
      
      
      
      












P:k
Q:(k+1)uek
uek 1
Stage k
R:(k+2)
Stage k+1
uek+1
ωkQ
ci 1S
ciS
ci+1S
ωk+1R
Ii
Ii 1
      
      
      
      
      
      
      
      
      
      
      
      
      













P:k
Q:(k+1)
R:(k+2) ciS
ciS
ci+1S
ωkQ
ωk+1R
Ii
Stage k+1 is full Stage k+1 is not full
Figure 4.4 How forwarding is done from the next stage: the calculation of Q:(k+
1) depends on R:(k+2). If stage k+1 is full, the output of stage k+1 is taken. If
not, the value from the register is taken.
102
Section 4.2
FORWARDING
w = k+1 and R 2 spec. If R is an implementation register, no forwarding
is necessary and we use gkR from the prepared sequential machine.
If R is a specification register with w = k + 1 and the read access does
not have an address, gkR is:
gkR(c) =
8
<
:
R:(w+1) : c: f ull:w^ f γkRre(c)
ωwR(c) : c: f ull:w^ f γkRre(c)
0 : otherwise
The following lemma asserts that the input generation function gkR de-
fined above provides the correct value. This lemma corresponds to lemma
3.16 (page 82) as used in the sequential machine.
Let sI(k;T ) = i and f ullTk hold. Let R be a specification register and J Lemma 4.7
w = k + 1. Assuming that the stage correctness predicates Pj hold in all
cycles up to cycle T , the inputs generated by the function gkR during cycle
T are correct:
gkR(cTI ) = GkR(ciS)
Since R is a specification register, the correct value GkR on the right-hand PROOF
side of the claim is given in the configuration of the specification machine.
Since the read access does not have an address, this is:
gkR(cTI )
!
=

RiS : f ΓkRre(ciS)
0 : otherwise
The correctness of the read enable signal is asserted as in the proof of
lemma 3.16. After that, the claim is shown for the last stage, which is stage
n 1, then for stage n 2, and so on until the claim is shown for all stages.
In case of the last stage, there is nothing to show since there is no next
stage to forward from. Assuming the claim holds for stage k+1, the claim
is shown for stage k as follows:
There are two cases regarding the full bit f ull:wT :
1. The full bit f ull:wT is set. Thus, invariant 3.3 states:
sI(k+1;T ) = i 1
103
Chapter 4
PIPELINED
MACHINES
      
      
      
      
      
      
      
      
      
      
      
      
      













      
      
      
      
      
      
      
      
      
      
      
      












      
      
      
      
      
      
      
      
      
      
      
      












P:k
Q:(k+1)uek
uek 1
Stage k
R:(k+2)
Stage k+1
uek+1
ci+1S
ciS
ci 1S
uek+2 S:(k+3) ci 2S
Stage k+2
ωk+2S
ωkQ
ωk+1R
Ii
Ii 2
Ii 1
Figure 4.5 Forwarding multiple times: register Q depends on R, which depends
on S.
The next thing to show is that the inputs of fwR are correct:
γwR(cTI )
!
= ΓwR(c
sI(w;T )
S ) (4.1)
This is done as follows: for implementation registers, one applies
lemma 3.16. For specification registers that are in the same stage,
one also uses lemma 3.16. For specification registers that are in
stage w+ 1, forwarding from the next stage is done. For this case,
the correctness is shown using the induction premise.
This situation is depicted in figure 4.5: register Q depends on register
R, which depends on register S. The proof covers this situation by
using induction as described. However, we do not recommend it
since it results in a bad cycle time.
The correctness of the inputs implies that the outputs of the stage
104
Section 4.2
FORWARDING
given by ωwR(cT ) are also correct. Formally, lemma 3.13 is used.
Let the write enable signal fwRwe be active:
gkR(cTI ) = ωwR(cTI ) (by def. of gkR)
= f γwR(cTI ) (by def. of ω)
= f ΓwR(csI(w;T )S ) (eq. 4.1)
= f ΓwR(ci 1S ) (inv. 3.3)
= RiS (lemma 3.13)
If the write enable signal fwRwe is not active, the function ωwR takes
the value from the register, i.e., cTI :R. According to the stage correct-
ness predicate for stage k + 1 and cycle T , this is equal to RsI(w;T )S .
This is equal Ri 1S because of sI(w;T ) = i  1. This is equal to RiS
because the register R is not changed since the write enable signal is
not active. Formally, one uses lemma 3.13.
gkR(cTI ) = ωwR(cTI ) (by def. of gkR)
= cTI :R (by def. of ω)
= RsI(w;T )S (stage correctness)
= Ri 1S (inv. 3.3)
= RiS (lemma 3.13)
This concludes the claim.
2. The full bit f ull:wT is not set. In this case, gkR is by definition:
gkR(cTI ) = R
T
I (4.2)
Using the stage correctness for the value on the right-hand side, this
is transformed into:
gkR(cTI ) = R
sI(w;T )
S (4.3)
Invariant 3.2 shows that sI(w;T ) is either i or i  1. Invariant 3.3
shows that it is not i  1, thus, sI(w;T ) is i. This concludes the
claim.
This concludes the correctness of the forwarding from the next stage. QED
105
Chapter 4
PIPELINED
MACHINES
reset0 1
0
PC’ DPC
IR
EX
ID
IF
IM
reset0 1
0
 
PC’ DPC
IR
1 0
ID
IF
EX
IM
f ull1
Sequential machine Pipelined machine
Figure 4.6 Transformation of the PC environment
Example: Forwarding of DPC in the DLX The DLX implementation
in the previous chapter reads the register DPC in the instruction fetch stage
(section 3.4.2 page 55). The register DPC is an output register of the de-
code stage. Thus, the read access to DPC in stage 0 can be realized using
the function g0DPC as defined above.
According to the definition above, the value read depends on the full bit
of the stage that writes DPC, i.e., it depends on f ull1. If f ull1 is not set, the
value from the register DPC.2 is taken. If f ull1 is set, the value provided
by the transition function of DPC is taken. The transition function of DPC
reads PC’ and outputs this value unmodified. If f ull1 is set, one therefore
takes the value of PC’ as input in stage 0.
g0DPC =

PC0:2 : f ull1 = 1
DPC:2 : f ull1 = 0
The PC environment before and after the transformation is depicted in
figure 4.6.
However, this does not disprove the correctness of the implementation in
[MP00]: in the pipelined implementation in [MP00], the value of DPC is
always taken from the register PC0. This is correct since the stall engine in
[MP00] ensures that the stages 0 and 1 are always clocked simultaneously.
106
Section 4.2
FORWARDING
This implies that the full bit f ull1 is always active iff stage 0 reads DPC
except for the first time after reset. The calculation of the first DPC is
compensated using the signal reset.
4.2.3 Result Forwarding
In the general case, i.e., if w > k+ 1, gkR is still undefined. The method
used for the case w = k + 1 cannot be used with reasonable effort since
this would require combining the transition functions of two or even more
stages. Besides the extra hardware cost, these combined transition func-
tions would be too deep and would lengthen the cycle time.
However, forwarding over multiple stages is reasonable in one special
case: Microprocessor instruction sets usually offer different kinds of in-
structions, such as ALU and memory instructions. The value that is to be
forwarded is the result of these operations. The different instructions are
processed by different stages, e.g., by an execute stage and by a memory
stage as described in the previous chapter. The result is available in an
early stage therefore. The later stages just pass this result unmodified. The
transition functions that are left to be applied are very simple in this case:
they are just the identity.
In a sequential machine, it is possible to write the result in the register
file as soon as it is calculated. However, as shown by the example in figure
4.7, it is not possible to do so in the pipelined machine. Consider two
instructions I1 and I2:
I1 : R1 := DM[0]
I2 : R1 := R2+R3
If each instruction writes its result into the register file as soon as it is
available, the register R1 would contain the result of the memory instruc-
tion I1 instead of the ALU instruction I2. The result is written in the last
stage (WB) therefore and not as soon as available. This is shown in figure
4.8.
In a pipelined implementation, the result therefore must be buffered in
an implementation register. The value of this implementation register is
written into the register file in the last stage.
In order to realize forwarding of results that are available in an early
107
Chapter 4
PIPELINED
MACHINES
IF D Ex
IF D M
t
M
Ex
writing R1
writing R1
I2: R1 := R2+R3
I1: R1 := M[0]
Figure 4.7 Write back as soon as possible.
IF D Ex
IF D
M WB
M WBEx
t
writing R1
writing R1
I1: R1 := M[0]
I2: R1 := R2+R3
Figure 4.8 Write back in program order.
108
Section 4.2
FORWARDING
Stage Alias
2 (EX) C:3 =a GPR
3 (M) C:4 =a GPR
Table 4.3 Write aliases for the DLX
stage, it is necessary to specify which implementation register holds the
intermediate result of a specification register.
Let R denote a specification register and Q with Q 2 out(k) denote an J Definition 4.1
Write Aliasimplementation output register of stage k. In this case, let Q =a R denote
that Q is used in order to buffer the final value of R. The register Q is called
a write alias for R.
The list of write aliases is added to the hardware description file.
Example: Result Forwarding in the DLX Consider the prepared se-
quential DLX as defined in chapter 3. Consider an ALU instruction, e.g.,
addi. The final result written into GPR, i.e., the sum, is known already in
stage 2. The result of the ALU instruction is written into the register C:3
(figure 3.5, page 54). Thus, one can define C:3 =a GPR.
Table 4.3 shows the list of all write aliases defined for the DLX design.
As soon as a value is written into a write alias register Q with Q =a R, it J Definition 4.2
Valid Valuesis assumed that this value matches the final value of R in the configuration
that is being calculated. Such a value is called valid value.
A register Q:(l + 1) is written iff the write enable signal of the register
is active. The write enable signal of Q:(l + 1) is flQwe. Thus, a value in
a register Q:(k+ 1) is valid iff any write enable signal flQwe of the alias
register Q of stage l  k is active.
The hardware transformation program uses the list of write aliases in order J Definition 4.3
Valid Signalsto generate a set of additional signals Qkvalid(cI) for each Q =a R.
In the following, it is assumed that precomputed versions of all write
enable signals flQwe exist (see section 3.3, page 52). These precomputed
109
Chapter 4
PIPELINED
MACHINES
values are stored in implementation registers. Remember that these regis-
ters are named just like the signal. We will use these registers in order to
calculate the valid signals in an obvious way:
Qkvalid : CI  ! B
Qkvalid(cI) :=
k
_
l=stage(Q)
cI : flQwe:k
This definition is slightly different from the definition of the valid signals
in [MP00]. The valid signals from [MP00] are on even if the instruction
does not write GPR, e.g., in case of a store instruction, which writes to the
data memory only. However, this does not disprove the correctness of the
hardware in [MP00], since special care is taken for instructions that do not
write GPR.
However, there is no guarantee that a value written into an alias register
matches the value finally written into the register that is to be forwarded.
Thus, this assumption must be proved for each Q =a R.
This is formalized as follows: Let cS be a configuration of the specifi-
cation machine. In analogy to the valid signals above, we define a correct
valid signal. The predicate QkValid(cS) holds iff a correct write enable
signals flQwe of the alias register Q of stage l  k holds.
As described in section 3.3, the registers holding the precomputed ver-
sions of control signals are treated just like implementation registers. Thus,
the definition of a correct value of a register, as given in section 3.5.4 (page
71), also applies for these registers. Thus, one can use the correct value of
the registers holding precomputed signals in order to define the correct
value of the valid signals.
The register holding the precomputed version of a signal is named just
like the signal, i.e., if the name of the write enable signal is flQwe, so is the
name of the register. The correct value of a register R:k is Ωk 1R. Thus,
the predicate QkValid is defined as follows:
QkValid : CS  ! B
QkValid(cS) :=
k
_
l=stage(Q)
Ωk 1 flQwe(cS)
110
Section 4.2
FORWARDING
Example: Result Forwarding in the DLX Consider the prepared se-
quential DLX as defined in chapter 3. Let
I0; I1; I2; : : :
denote an instruction sequence, as in section 3.5.2 (page 63). As described
above, if Ii is an ALU instruction, the final result written into GPR is known
already in stage 2. The result of the ALU instruction is written into the
register C:3. In this case, C2Valid(ciS) holds because the write enable signal
f2Cwe is active.
Let Ii be a load instruction. In this case, C2Valid(ciS) does not hold
because neither f2Cwe nor f1Cwe are active.
Let w= stage(R) hold. The following statement is shown for each Q=a R: J Lemma 4.8
if the valid predicate of a register Q holds, the correct value written in
this implementation register has to match the final value generated by the
register transition function of R. The correct value written into Q:(k + 1)
is given by ΩkQ(ciS) (section 3.5.4 page 71).
If the write access to R does not have an address:
QkValid(ciS) =) ΩkQ(ciS) = ci+1S :R
If the write access to R has an address, we assume that the control signals
for the write address fwRwa are also precomputed. The correct value of
this address is just the correct value of implementation register holding the
precomputed write address.
QkValid(ciS) =) ΩkQ(ciS) = ci+1S :R[Ωk 1 fwRwa(ciS)]
This is illustrated in figure 4.9 for stage k = 2 (EX). In this stage, the
result, which is to be forwarded, is provided by the ALU. This is Ω2C.
This result is calculated using the values in the registers in out(1) as in-
puts. Thus, the address is also taken from a register in out(1), which is
f4GPRwa:2. The correct value of f4GPRwa:2 is given by Ω1 f4GPRwa.
The same method is used in [MP00] with a different notation: the re-
sult provided by the ALU is denoted by C0:2. The address is taken from
f4GPRwa:2, which is a precomputed version of the write address used for
writing GPR. In correspondence to lemma 4.8, [MP00] provides a lemma
in order to argue that the value of C0:2 matches the final value assuming
the valid signal is active.
111
Chapter 4
PIPELINED
MACHINES
ALUEX
ID Ω1C
Ω2C
Ω1 f4GPRwa
Ω2 f4GPRwa
A;B PC0 DPC
MAR:3
C:2
MDRw:3 C:3 IR:3
IR:2
f4GPRwa:3
f4GPRwa:2
Figure 4.9 ALU results and the register address in the machine without imple-
mentation registers
Lemma 4.8 is shown for the write aliases for the pipelined DLX as definedPROOF
in table 4.3.
For C:3 =a GPR, the claim is:
C2Valid(ciS) =) Ω2C(ciS) = ci+1S :GPR[Ω1 f4GPRwa(ciS)] (4.4)
The first step is to conclude that the write enable signal f4GPRwe is
active using that C2Valid(ciS) holds, i.e., one shows that an instruction in
stage EX that is valid actually writes a GPR. Remember that the write
enable signal of GPR:4 is precomputed in the decode stage (section 3.3,
page 52), i.e., f4GPRwe just takes the value from the register f4GPRwe:4.
The correct value of this register is Ω3 f4GPRwe(ciS):
f4GPRwe(Ω3 f4GPRwe(ciS)) = Ω3 f4GPRwe(ciS)
By repeatedly expanding the definition the function on the right-hand
side, one gets:
f4GPRwe(Ω3 f4GPRwe(ciS)) = Ω3 f4GPRwe(ciS)
= Ω2 f4GPRwe(ciS)
= Ω1 f4GPRwe(ciS)
= f1 f4GPRwe(Ω0IR(ciS))
One proves this by expanding the definition of the correct valid signal:
C2Valid(ciS) = Ω1 f1Cwe(ciS)_Ω1 f2Cwe(ciS)
= f1 f1Cwe(Ω0IR(ciS))_ f1 f2Cwe(Ω0IR(ciS))
112
Section 4.2
FORWARDING
As defined in section 3.4.3 (page 55), f1 f1Cwe(Ω0IR(ciS)) holds if the
instruction Ii is a jump and link instruction. The write enable function
f1 f2Cwe(Ω0IR(ciS)) holds if Ii is an ALU/shift instruction (section 3.4.4).
This allows concluding that f1 f4GPRwe(Ω0IR(ciS)) holds (section 3.4.6,
page 60). Thus, Ω3 f4GPRwe(ciS) holds.
This allows concluding that the write enable signal with correct input
f4GPRwe(Ω3 f4GPRwe(ciS)) is active, which is equivalent to f Γ4GPRwe.
We conclude the claim using lemma 3.15: Since the write enable signal
f Γ4GPRwe(ciS) is active, lemma 3.15 (page 78, the generic claim is in
lemma 3.13 page 75) states for all addresses x:
ci+1S :GPR[x] =
 f Γ4GPR(ciS) : x = f Γ4GPRwa(ciS)
ciS:GPR[x] : otherwise
(4.5)
By expanding definitions, one shows that f Γ4GPRwa(ciS) is equal to
the precomputed version, which is Ω1 f4GPRwa(ciS). Using that equality,
equation 4.5 with x = Ω1 f4GPRwa(ciS) is:
ci+1S :GPR[Ω1 f4GPRwa(ciS)] = f Γ4GPR(ciS) (4.6)
By inserting equation 4.6 into the claim (equation 4.4), the claim is trans-
formed into:
Ω2C(ciS)
!
= f Γ4GPR(ciS) (4.7)
This new claim is shown as follows: the first step is to show that the in-
struction is not a load instruction, as indicated by I load(Ω1IR(ciS)). This
is done using that C2Valid(ciS) holds. One shows that the instruction coded
by Ω1IR(ciS) is either a jump and link or ALU/Shift instruction. Thus, it
cannot be a load instruction. One easily shows that Ω1IR is equal to Ω3IR,
thus, I load(Ω3IR(ciS)) is also not active.
The proof proceeds by expanding the definition of Γ4GPR on the right-
hand side of the claim (equation 4.7):
Ω2C(ciS)
!
= f4GPR(Γ4GPR(ciS))
= f4GPR(Ω3C(ciS);Ω3IR(ciS); (4.8)
Ω3MAR(ciS);Ω3MDRr(ciS))
One then expands the definition of f4GPR (section 3.4.6). Since the
instruction is not a load instruction (I load(Ω3IR(ciS)) does not hold), the
113
Chapter 4
PIPELINED
MACHINES
function f4GPR returns the value of the C register. This transforms the
claim into:
Ω2C(ciS)
!
= Ω3C(ciS) (4.9)
This is shown by expanding the definition of Ω3C(ciS) on the right-hand
side.
Very similar arguments are used in order to show the claim for C:4 =a
GPR.QED
Implementing Result Forwarding Thus, if such a QkValid predicate
holds, it is possible to take the result written into the Q:(k+1) register as
the value for a read access. This is done only if the instruction in a given
stage actually writes the desired register. A signal is defined that is active
if this holds. This signal is called hit signal.
Let stage k depend on a specification register R that is an output register ofDefinition 4.4
Hit Signals
I
stage w with w (k+1). In addition to the valid signals, a set of hit signals
is defined as follows: if the write access to R does not have an address:
8 j 2 fk+1; : : : ;wg :
Rkhit[ j](cI) := f ull j(cI)^ fwRwe: j
We use the precomputed version of the write enable signal of R in order
to determine if the instruction in the given stage writes R. If the write ac-
cess has an address, it is necessary to check the address of the write access
in addition to the conditions above. As above, we use the precomputed
version of the write address.
8 j 2 fk+1; : : : ;w 1g :
Rkhit[ j](cI) := f ull j(cI)^ fwRwe: j^
( f γkRra(cI) = fwRwa: j)
In case of stage w, the address and write enable signals are taken from
the write access directly:
Rkhit[w](cI) := f ullw(cI)^ f γwRwe(cI)^
( f γkRra(cI) = f γwRwa(cI))
114
Section 4.2
FORWARDING
A very similar definition is in [MP00]. If any hit signal of a stage j is
active, let top denote the smallest such j, i.e., the topmost stage with active
hit signal:
top := minf j 2 fk+1; : : : ;wg j Rkhit[ j](cI)g
This is undefined if no signal Rkhit[ j] is active. The signal
Rkhit[top]
is called topmost hit signal.
Using this definition, one can now define the forwarding function gkR.
For sake of simplicity, let us assume that the read enable signal f γkRre(cI)
is active. If not, gkR returns zero and no forwarding is necessary.
If the topmost hit signal is in stage w, i.e., the stage that actually outputs
R, one just takes the value written into R, which is provided by fwR. If the
topmost hit signal is in a stage j < w, one takes the value written into the
alias register, i.e., ω jQ(cI). If no hit signal is active, one takes the value
from R.
If the write access to R does not have an address, gkR is:
gkR(c) =
8
>
<
>
:
f γwR(c) : Rkhit[w](c)^w = top
ω jQ(c) : j 2 fk+1; : : : ;w 1g^
Rkhit[ j](c)^ j = top
R:(w+1) : otherwise
(4.10)
If the read access has an address, gkR is defined using the read address.
Let x := f γkRra(c) be the address.
gkR(c) =
8
>
>
<
>
:
f γwR(c) : Rkhit[w](c)^w = top
ω jQ(c) : j 2 fk+1; : : : ;w 1g^
Rkhit[ j](c)^ j = top
R:(w+1)[x] : otherwise
(4.11)
The same forwarding method is used in [MP00]. As in [MP00], the
comparison j = top is realized using a chain of multiplexers (in PVS, IF
: : : THEN : : : ELSIF : : : ENDIF is used). This is depicted exemplary in
figure 4.10. 1
1In larger pipelines, the delay of this circuit grows linear with the pipeline size. For
large pipelines, a find first one circuit is faster.
115
Chapter 4
PIPELINED
MACHINES
0 1
0 1
0 1
hit[2]
hit[3]
hit[4]
R:5 ω4R
ω3C
ω2C
Figure 4.10 Implementation of 3-stage forwarding
Observe that the forwarding from the next stage, as described in section
4.2.2, is just a special instantiation of the more general forwarding method
described in this section.
In case of a hit in a stage, lemma 4.8 will be used in order to argue about
the value read from the given stage. However, if the hit signal is not active,
one has to argue that one can safely ignore the contents of the stage in
order to do forwarding. This is asserted by the following lemma. In terms
of microprocessors, the lemma asserts that the instruction in a given stage
does not update the register that is to be forwarded if the hit signal is off.
Let Q =a R hold and let Q 2 out( j) and R 2 out(w) hold. Consider theLemma 4.9 I
correct value of the precomputed write enable signal of R.
If the write access to R does not have an address, the register R is not
modified if the write enable signal is not active:
Ω j 1 fwRwe =) ciS:R = ci+1S :R
Observe that Ω j 1 fwRwe is just the correct version of the hit signal (def-
inition 4.4). Thus, it is called correct hit signal (the full signal is ommited
because the configuration cS does not have full bits; it processes one in-
struction in one cycle).
If the write access to R has an address, in analogy to definition 4.4, the
address x is compared with the correct precomputed address, as defined by
Ω j 1 fwRwa(cS).
116
Section 4.2
FORWARDING
Thus, in terms of microprocessors, if the addresses are not equal or if
the write enable function returns false, the instruction in stage j does not
update the desired register.
8x 2 W a(R) : x 6= Ω j 1 fwRwa(ciS)_Ω j 1 fwRwe
=) ciS:R[x] = c
i+1
S :R[x]
The lemmas 4.8 and 4.9 are called write alias correctness lemmas.
Lemma 4.2.3 is shown for the write aliases for the pipelined DLX as de- PROOF
fined in table 4.3.
For C:3 =a GPR, the claim is:
8x 2 W a(GPR) : x 6= Ω1 f4GPRwa(ciS)_Ω1 f4GPRwe(ciS)
!
=) ciS:GPR[x] = c
i+1
S :GPR[x]
There are two cases regarding the value of f Γ4GPRwe(ciS). If it is off,
the claim directly follows from lemma 3.15 (page 78).
Thus, let f Γ4GPRwe(ciS) hold. From this, we easily show by expanding
definitions that Ω1 f4GPRwe(ciS) holds:
f Γ4GPRwe(ciS) = f4GPRwe(Γ4GPRwe(ciS))
= f4GPRwe(Ω3 f4GPRwe(ciS))
= Ω3 f4GPRwe(ciS)
= Ω2 f4GPRwe(ciS)
= Ω1 f4GPRwe(ciS)
This transforms the claim into:
8x 2 W a(GPR) : (x 6= Ω1 f4GPRwa(ciS))
!
=) ciS:GPR[x] = c
i+1
S :GPR[x]
Let ind be a shorthand for f Γ4GPRwa(ciS). Since the write enable signal
is active, lemma 3.15 states for all addresses x:
ci+1S :GPR[x] =
 f4GPR(Γ4GPR(ciS)) : x = ind
ciS:GPR[x] : otherwise
(4.12)
117
Chapter 4
PIPELINED
MACHINES
By inserting this into the claim, the claim is transformed into:
8x 2 W a(GPR) : (x 6= Ω1 f4GPRwa(ciS))
!
=) ciS:GPR[x] =
 f4GPR(Γ4GPR(ciS)) : x = ind
ciS:GPR[x] : otherwise
We will now conclude x 6= ind using x 6= Ω1 f4GPRwa(ciS):
ind = f Γ4GPRwa(ciS)
= f4GPRwa(Γ4GPRwa(ciS))
= f4GPRwa(Ω3 f4GPRwa(ciS))
= Ω3 f4GPRwa(ciS)
= Ω2 f4GPRwa(ciS)
= Ω1 f4GPRwa(ciS)
6= x
This concludes the claim. Very similar arguments are used in order to
show the claim for C:4 =a GPR.QED
Consider a hit in stage top and that the valid signal of that stage, as given
by definition 4.3, does not hold. In this case, one cannot use lemma 4.8 to
argue that the values in the stage are valid. Lemma 4.9 cannot be used,
either, in order to argue that the stage can be ignored, since there is a hit in
the stage. In this case, forwarding, as described above, fails completely.
4.3 Stalling
If forwarding fails, the calculation of the input values of the stage is not
possible. It is necessary to delay the calculation in a stage until the data
is available. Since the result of prior stages has to be stored somewhere,
these stages have to wait also. In contrast to that, later stages must not be
stalled since these stages calculate the desired inputs. The mechanism used
to realize this is called stall engine and is introduced in [MP00]. In con-
trast to the stall engine in [MP00], the stall engine presented here supports
independent stall signals for each stage.
The stall engine is implemented by re-defining the signals stallk. In the
prepared sequential machine the stall signals are only used in order to obey
118
Section 4.3
STALLING
external stall conditions such as caused by slow memory. In the pipelined
machine, internal stall conditions are added.
In order to calculate the signals stallk, a signal is required that indicates
whether a given stage has to wait for an input value. The signal dhazk
(data hazard) is active iff stage k is waiting for an input operand. The stage
k must be stalled if dhazk is active and if the stage is full. This is called a
data hazard stall.
Furthermore, the stage must be stalled if the next stage (stage k+ 1) is
stalled because the necessary data paths and registers are not available in
this case. This case is called a structural hazard stall.
Let extk denote the disjunction of the external stall signals of stage k,
e.g., used for memory. Let intk denote the disjunction of the internal stall
signals. Since the last stage (stage n  1) has no next stage, the definition
of the signal intk depends on the stage number:
k 6= n 1 : intk := dhazk _ stallk+1
intn 1 := dhazn 1 (4.13)
This allows defining the stall signal stallk:
stallk := f ullk ^ (extk _ intk) (4.14)
This definition of the stall signal obviously conforms to the stall signal
convention 4.2 (page 95)2.
As described above, forwarding fails if there is a hit in stage top and the
valid signal of the stage does not hold. For each input that requires result
forwarding, a separate data hazard signal is defined. Let R 2 out(w) be a
specification register that is read by stage k. The data hazard signal for this
input is then called Rkdhaz. The data hazard signal of stage k, which is
dhazk, is the disjunction of these data hazard signals.
In case of the DLX, we have two read accesses to the general purpose
register file. In analogy to the naming convention described in section 3.2.7
(page 51), GPRa and GPRb are used for the GPR operands:
dhaz1 := GPRa1dhaz_GPRb1dhaz
2In the PVS tree, this is shown in form of a TCC (type-correctness condition).
119
Chapter 4
PIPELINED
MACHINES
The data hazard signals Rkdhaz are defined as follows: if there is no hit
signal active, no data hazard is indicated. If there is a hit in any stage, the
stage given by top is examined. As in [MP00], a data hazard is indicated
if the stage is not valid. In case of stage w = stage(R), there is no valid
signal. In stage w, the result is written into the register and the result is
therefore known in stage w at the latest. Thus, there is no need for a valid
signal in stage w.
In addition to that, a data hazard is also indicated if the data hazard signal
of the stage top is active:
Rkdhaz(cI) :=
8
>
<
>
:
dhazw(cI) : Rkhit[w](cI)^w = top
dhaz j(cI)_ : j 2 fk+1; : : : ;w 1g^
Q jvalid(cI) Rkhit[ j](cI)^ j = top
0 : otherwise
This addition to [MP00] is motivated as follows: If the valid signal of
the stage is active, one uses lemma 4.8 in order to show that the output
of the stage matches the value finally written in the register that is read.
However, the output value of the stage is only correct if the inputs of the
stage are correct. Assume the stage uses forwarding in order to get one or
more inputs. These inputs are only correct if the forwarding does not fail.
It will turn out that forwarding fails iff the data hazard signal is active.
Thus, the data hazard signal is checked.
This does not disprove the correctness of the implementation of the for-
warding logic in [MP00]: in [MP00], data is forwarded from stages 2 to
4. For the calculation of GPR results, these stages never use forwarding in
order to get inputs.
This also applies to the pipelined DLX presented in this chapter: stages
2 to 4 never use result forwarding, the data hazard signals of stages 2 to 4
are always false therefore.
In hardware, the comparison of the stage number with top is done with
multiplexers as described in section 4.2.3.
4.4 Implementing the DLXpi
The implementation of the DLXpi is completely identical to the implemen-
tation of the machine DLXσ as described in the previous chapter except for
120
Section 4.5
DATA
CONSISTENCY
the following changes:
1. The definition of f ull0 and the stall signals are changed as described
in sections 4.1.1 and 4.3.
2. Forwarding logic is added for the stages IF and ID as described in
section 4.2.
The complete process of introducing the new stalling and forwarding
logic is completely automated.
4.5 Data Consistency
The following lemmas assert the correctness of the result forwarding mech-
anism as presented in the previous sections. These lemmas correspond to
lemma 3.16 (page 82) of the sequential machine. They assert that the in-
puts generated during cycle T   1 are correct. They will be used in order
to show that the values of the registers during cycle T are correct.
Let sI(k;T ) = i and f ullTk hold. Let R 2 out(w) be a specification register J Lemma 4.10
with w > k and let the stage correctness predicates Pj hold in all cycles up
to cycle T . If there is no hit signal active, register R is not modified from
configuration csI(w;T )S to configuration ciS:
RsI(w;T )S = R
i
S
If the read access has an address, the claim is that the register with the
given address is not modified. Let x denote the address.
x := f ΓkRra(ciS)
RsI(w;T )S [x] = R
i
S[x]
The first step is to assert the correctness of an address value, if present: PROOF
f ΓkRra(ciS) != f γkRra(cTI )
Thus, one has to show:
ΓkRra(ciS)
!
= γkRra(cTI )
121
Chapter 4
PIPELINED
MACHINES
This is done as in the proof of lemma 3.16 (page 82). By definition, the
inputs required in order to calculate the address do not require forwarding.
The claim is then shown easily by a full case split on the full bits of
the stages k + 1 to w. As soon as fixed values for the full bits are given,
the scheduling invariants can be used in order to determine the value of
sI(w;T ) relative to sI(k;T ) (T > 0 is shown easily using lemma 4.6).
For example, if all full bits are off, one easily shows that sI(w;T ) =
sI(k;T ) = i holds. In this case, the claim above obviously holds.
If the full bit of one or more stages is set, let
diff := sI(k;T )  sI(w;T )
denote the difference between the values of the scheduling function. Using
the scheduling invariants 3.2 and 3.3, one easily shows that there are as
many active full bits as given by diff .
For each active full bit f ullTl , one argues that
RsI(l;T )S = R
sI(l;T )+1
S
holds. If the read access has an address, it is argued that
RsI(l;T )S [x] = R
sI(l;T )+1
S [x]
holds. This is done using the fact that the hit signal Rkhit[l] is off and by
lemma 4.9 if l 6= w and by lemma 3.13 if l = w.
This can be done diff -times and concludes the claim.QED
Let sI(k;T ) = i and f ullTk hold. Let R 2 out(w) be a specification registerLemma 4.11 I
with w > k and let the stage correctness predicates Pj hold in all cycles up
to cycle T . If there is an active hit signal, register R is not modified from
configuration csI(top;T )+1S to configuration ciS:
RsI(top;T )+1S = R
i
S
If the read access has an address:
x := f ΓkRra(ciS)
RsI(top;T )+1S [x] = R
i
S[x]
122
Section 4.5
DATA
CONSISTENCY
It is not surprising that one argues about RsI(top;T )+1S . In case of an active
hit signal, the forwarding hardware takes the output of the stage top. If
instruction IsI(top;T ) is in stage top, the outputs of the stage are part of
configuration csI(top;T )+1S . Let j = sI(top;T ) hold:
c
j
S
Ij
 ! c
j+1
S
This is shown with the same method as used in the proof of lemma 4.10. PROOF
For example, if top = k+ 1, i.e., the hit is in the next stage, one easily
shows that sI(top;T )+ 1 = i holds using invariant 3.3 and that f ullTtop is
active.
Let top be k+ 2 and f ullTk+1 not hold. In this case, one uses invariants
3.2 and 3.3 in order to show that sI(top;T )+ 1 = i. If a full bit is active,
one uses lemma 4.9, as above. QED
Let sI(k;T ) = i and f ullTk hold and let the data hazard signal dhazTk be not J Lemma 4.12
active. Let R 2 out(w) be a specification register with w > k and let the
stage correctness predicates Pj hold in all cycles up to cycle T . Let there
be no hit signal active.
The claim is that the inputs generated by the function gkR during cycle
T are correct:
gkR(cTI ) = GkR(ciS)
Since R is a specification register, the correct value on the right-hand side PROOF
of the claim is given in the configuration of the specification machine. If
the read access does not have an address, this transforms the claim into:
gkR(cTI )
!
=

RiS : f ΓkRre(ciS)
0 : otherwise
In case of a read access with address, the correct value is defined using
the correct value of the address, as in lemma 3.16 (page 82):
x := f ΓkRra(ciS)
gkR(cTI )
!
=

RiS[x] : f ΓkRre(ciS)
0 : otherwise
123
Chapter 4
PIPELINED
MACHINES
The first step is to assert the correctness of an address value, if present,
as done in the proof of lemma 4.10:
f ΓkRra(ciS) = f γkRra(cTI )
By the same arguments, one shows the correctness of the inputs of the
read enable function fkRre. Let the read enable signal fkRre be active.
Otherwise, the claim is trivial since zero is returned and no forwarding is
required. This transforms the claim into:
no address: gkR(cTI )
!
= RiS
with address: gkR(cTI )
!
= RiS[x]
(4.15)
Since no hit signal is active, by definition of gkR (equation 4.10), RTI is
read:
gkR(cTI ) = R
T
I
If the read access has an address, the correct address is used (equation
4.11):
gkR(cTI ) = R
T
I [x]
Using the stage correctness predicate for cycle T and stage w, one easily
transforms the right-hand side of both equations into:
no address: gkR(cTI ) = R
sI(w;T )
S
with address: gkR(cTI ) = R
sI(w;T )
S [x]
(4.16)
This allows transforming the claim into:
no address: RsI(w;T )S
!
= RiS
with address: RsI(w;T )S [x]
!
= RiS[x]
(4.17)
This is concluded using lemma 4.10.QED
Let sI(k;T ) = i and f ullTk hold and let the data hazard signal dhazTk be notLemma 4.13 I
active. Let R 2 out(w) be a specification register with w > k and let the
124
Section 4.5
DATA
CONSISTENCY
stage correctness predicates Pj hold in all cycles up to cycle T . Let there
be an active hit signal.
The claim is that the inputs generated by the function gkR during cycle
T are correct:
gkR(cTI ) = GkR(ciS)
Since R is a specification register, the correct value on the right-hand side PROOF
of the claim is given in the configuration of the specification machine. If
the read access does not have an address, this transforms the claim into:
gkR(cTI )
!
=

RiS : f ΓkRre(ciS)
0 : otherwise
In case of a read access with address, the correct value is defined using
the correct value of the address, as in lemma 3.16 (page 82):
x := f ΓkRra(ciS)
gkR(cTI )
!
=

RiS[x] : f ΓkRre(ciS)
0 : otherwise
The claim is shown inductively beginning with the last stage and pro-
ceeding from stage k+1 to stage k. In case of the last stage, which is stage
n  1, there is nothing to show since there is no stage below to forward
from. Assuming the claim holds for stages k0 with k < k0 < n, the claim is
shown for stage k as follows:
As in the proof of lemma 4.12, one asserts the correctness of the address
value and that the read enable signal is active.
As required in the premise, the data hazard signal RkdhazT is not active.
By definition of the data hazard signal, this implies that the valid bit of the
stage top is active and that the data hazard signal of stage top is not active.
As described above, one assumes the correctness of the inputs of the
stages k0 > k in order to show the correctness of the inputs of stage k.
Since top > k, one can apply the induction premise for stage top. This
shows the correctness of the inputs of the stage top:
γtopR(cTI ) = ΓtopR(c
sI(w;T )
S ) (4.18)
125
Chapter 4
PIPELINED
MACHINES
The claim is now shown by a case split on the value of top (in PVS, a
separate lemma is used for the possible values of top).
Let top = w hold, i.e., the hit is in the stage that writes R. Since top=w,
gkR returns the value written into R:(w+ 1). If the write access does not
have an address, this is (equation 4.10):
gkR(cTI ) = f γwR(cTI ) (4.19)
As described above, one uses that the inputs of stage top are correct.
Formally, one uses equation 4.18, which transforms the last equation into:
gkR(cTI ) = f ΓwR(csI(w;T )S ) (4.20)
Using this equation, the claim is transformed into:
no address: f ΓwR(csI(w;T )S ) != RiS
with address: f ΓwR(csI(w;T )S ) != RiS[x]
(4.21)
One easily shows that f ΓwRwe(csI(w;T )S )) holds by using that the hit sig-
nal Rkhit[w] is active (definition 4.4). If the read access does not have an
address, lemma 3.13 states:
RsI(w;T )+1S = f ΓwR(csI(w;T )S ) (4.22)
This allows transforming the claim into:
RsI(w;T )+1S
!
= RiS (4.23)
This is concluded by lemma 4.11.
In case of a read access with address, the last thing is to show that the
address given by x matches the address actually used for the final write
access to R, as given by lemma 3.13:
f ΓwRwa(csI(w;T )S ) != x
= f γkRra(cTI )
The value on the right-hand side is equal to f γwRwa(cTI ) because the
signal Rkhit[w] is active (definition 4.4, page 114). This transforms the
claim into:
f ΓwRwa(csI(w;T )S ) != f γwRwa(cTI )
126
Section 4.5
DATA
CONSISTENCY
Thus, it is sufficient to assert that the inputs of fwRwa are correct:
ΓwRwa(c
sI(w;T )
S )
!
= γwRwa(cTI )
This is done as described above for the inputs of fwR.
Let top 6= w hold, i.e., the hit is not in the stage that writes R. Since
top 6= w, there must be a write alias for R for the stage. Let the register Q
be the alias register (i.e., Q =a R).
In this case, gkR returns the value written into Q:(top+1) (by definition
of gkR, equation 4.11):
gkR(cTI ) = ωtopQ(cTI ) (4.24)
As above, one argues that the inputs of ftopQ are correct. Thus, the
output is correct.
ωtopQ(cTI ) = ΩtopQ(csI(top;T )S ) (4.25)
This allows transforming the claim into:
no address: ΩtopQ(csI(top;T )S )
!
= RiS
with address: ΩtopQ(csI(top;T )S )
!
= RiS[x]
(4.26)
Using lemma 4.11, the claim is transformed into:
no address: ΩtopQ(csI(top;T )S )
!
= RsI(top;T )+1S
with address: ΩtopQ(csI(top;T )S )
!
= RsI(top;T )+1S [x]
(4.27)
Since there is a hit in stage top, one concludes that the valid signal
Qtopvalid(cTI ) is active. Using this, one easily shows that the correct valid
bit QtopValid(csI(top 1;T ) 1S ) holds:
Qtopvalid(cTI )) =
top
_
l=stage(Q)
cTI : flQwe:top
One transforms the right hand sind by applying the stage correctness
predicate for implementation registers, stage top 1 and cycle T :
Qtopvalid(cTI )) =
top
_
l=stage(Q)
Ωtop 1 flQwe(csI(top 1;T ) 1S )
= QtopValid(csI(top 1;T ) 1S )
127
Chapter 4
PIPELINED
MACHINES
Since stage top is full, one can apply scheduling invariant 3.3 in order to
conclude that
sI(top 1;T ) 1 = sI(top;T )
holds. Thus, QtopValid(csI(top;T )S ) holds.
This allows using lemma 4.8 for stage top and configuration sI(top;T ).
If the read access does not have an address, this concludes the claim.
In case of a read access with address, lemma 4.8 states:
ΩtopQ(csI(top;T )S ) = RsI(top;T )+1S [Ωtop 1 fwRwa(csI(top;T )S )] (4.28)
This transforms the claim into:
RsI(top;T )+1S [x]
!
= RsI(top;T )+1S [Ωtop 1 fwRwa(csI(top;T )S )] (4.29)
It is therefore left to show that the addresses match:
Ωtop 1 fwRwa(csI(top;T )S ) != x
= f γkRra(cTI )
The value on the right-hand side is equal to fwRwa:topTI because the
signal RTk hit[top] is active (definition 4.4). This transforms the claim into:
Ωtop 1 fwRwa(csI(top;T )S ) != fwRwa:topT (4.30)
By using the stage correctness predicate for cycle T and stage top  1,
the right-hand side is transformed into:
Ωtop 1 fwRwa(csI(top;T )S ) != Ωtop 1 fwRwa(csI(top 1;T ) 1S )
As above, one can use invariant 3.3 in order to show that sI(top;T ) is
equal to sI(top 1;T ) 1. This concludes the claim.QED
The following lemma corresponds to lemma 3.17 (page 85) in the se-
quential machine:
Let T 0 be greater than zero. Assuming all stage correctness predicates forLemma 4.14 I
the cycle T 0 1, the predicate for stage k holds for cycle T 0.
(8l : Pl(T 0 1)) =) Pk(T 0)
128
Section 4.6
LIVENESS
PROOF The proof proceeds as the proof of lemma 3.17. However, for
the case i > 0 and ueT 1, one uses lemmas 4.12 and 4.13 for operands that
require forwarding. This lemma requires that the data hazard signal is not
active. This is shown easily by definition of the stall and data hazard signal
and using that the update enable signal is active. QED
4.6 Liveness
4.6.1 Introduction
The liveness criterion of the pipelined machine is identical to the liveness
criterion of the prepared sequential machine as presented in chapter 3:
Let ciS be any desired configuration of the specification machine. The
implementation machine is said to be alive iff for all stages k there exists a
time T 2 N0 with sI(k;T ) = i:
9T 2 N0 : sI(k;T ) = i
As in chapter 3, this is shown by arguing that the update enable signal
is alive, as done in lemma 3.22. This lemma has the premise that all stall
signals are finite true. In the prepared sequential machine, only external
stall signals exist and this property was assumed. This is no longer true for
the pipelined machine since internal stall conditions were added (section
4.3).
Thus, a proof that the stall signals are finite true has to be given for the
pipelined machine. According to equation 4.14 (page 119), there are three
possible reasons for an active stall signal, given that the stage is full:
1. one of the external stall signals is active,
2. the data hazard signal is active,
3. the stall signal of the next stage is active.
Consider the following proof strategy: Beginning with the last stage,
which has no next stage, we will argue that the stall signals are finite true.
The external stall signals are still assumed to have this property. Further-
more, one shows that the data hazard signal is finite true, which can be
129
Chapter 4
PIPELINED
MACHINES
Signal A_B
Signal B
Signal A
t
Figure 4.11 Two alternating, finite true signals A and B. The disjunction is not
finite true but constant true.
done easily. It is now tempting to conclude that the disjunction of finite
true signals is also finite true.
However, this is wrong. A finite true signal is guaranteed to eventually
become false. The problem is that there is no guarantee that a signal that
is finite true stays false for more than one cycle once it becomes false. In
particular, one can think of two alternating signals that are both finite true
(figure 4.11). The disjunction never becomes false and therefore cannot be
finite true.
“Finite true” therefore is too weak. For the three signals above, one
needs a stronger property such that one can conclude that the disjunction
is finite true. In case of stall conditions, one needs that the signal actually
stays false once it became false until all conditions are false. As soon as all
conditions are false, the update enable signal becomes active and the stage
therefore proceeds calculating.
4.6.2 Extended Liveness Calculus
The property of a signal that it “stays until” a given event (i.e., signal) is
formalized as follows:
Let pred and predu be time predicates and T be a cycle. The predicateDefinition 4.5
Stays Until
I
pred is said to stay until predu from cycle T , iff the following holds: Given
an arbitrary cycle T 0  T such that predu does not hold for cycles T 00 with
T  T 00 < T 0, the predicate pred holds for all cycles T 00 with T  T 00  T 0.
stays until(pred;T; predu) :()
130
Section 4.6
LIVENESS
   
   


    
    


   
   


    
    


Signal predu
Signal pred
t
T 0T
Figure 4.12 Two signals satisfying stays until(pred;T; predu). A signal shown
as a hatched box means that the value of the signal during this cycle does not
matter.
8T 0 j T 0  T : (8T 00 j T  T 00 < T 0 : predu(T 00))
=) (8T 00 j T  T 00  T 0 : pred(T 00))
This is illustrated in figure 4.12. If a signal is shown as a hatched box
this means that the value of the signal during this cycle does not matter.
Note that it is not required that signal predu ever becomes true after cycle
T . In particular, if predu never becomes true after cycle T , pred is required
to hold for all cycles T 0  T (one easily shows this using induction).
Let pred and predu be time predicates and T be a cycle. Let T 0 be a cycle J Lemma 4.15
with T 0  T . Let pred stay until predu after cycle T . If predu is off during
cycles T 00 with T  T 00 < T 0, pred also stays until predu from cycle T 0:
stays until(pred;T; predu)^
8T 00 j T  T 00 < T 0 : predu(T 00) (4.31)
=) stays until(pred;T 0; predu)
By definition 4.5, stays until(pred;T 0; predu) is equivalent to: PROOF
8t 0 j t 0  T 0 : (8t 00 j T 0  t 00 < t 0 : predu(t 00))
=) (8t 00 j T 0  t 00  t 0 : pred(t 00)) (4.32)
By definition of stays until, stays until(pred;T; predu) is equivalent to:
8t 0 j t 0  T : (8t 00 j T  t 00 < t 0 : predu(t 00))
=) (8t 00 j T  t 00  t 0 : pred(t 00)) (4.33)
131
Chapter 4
PIPELINED
MACHINES
Since T 0T , one can instantiate formula 4.33 with t 0 from formula 4.32.
This results in:
(8t 00 j T  t 00 < t 0 : predu(t 00))
=) (8t 00 j T  t 00  t 0 : pred(t 00)) (4.34)
Obviously, the implication of equation 4.34 will conclude the claim as
given by the implication of equation 4.32. However, it is left to show that
the premise of the implication of equation 4.34 holds:
8t 00 j T  t 00 < t 0 : predu(t 00) (4.35)
This is done as follows: if t 00  T 0, one takes the premise in equation 4.32
in order to show predu(t 00). If t 00 < T 0, predu(t 00) holds according to the
premise in equation 4.31.QED
Let pred1, pred2, and predu be time predicates. Let T be a cycle. IfLemma 4.16 I
pred2 implies pred1 for all cycles T 00 with T 00  T , and pred2 stays until
predu after cycle T , pred1 also stays until predu after cycle T .
(8T 00 j T 00  T : pred2(T 00) =) pred1(T 00))^
stays until(pred2;T; predu)
=) stays until(pred1;T; predu)
This lemma is shown easily by expanding the definition of stays until.PROOF
Let pred1, pred2, and predu be time predicates. Let T be a cycle. If bothLemma 4.17 I
pred1 and pred2 stay until predu after cycle T , the conjunction pred1^
pred2 also stays until predu after cycle T .
stays until(pred1;T; predu)^ stays until(pred2;T; predu))
=) stays until(pred1^ pred2;T; predu)
This lemma is shown easily by expanding the definition of stays until. AnPROOF
example for the lemma is given in figure 4.13.
Let pred and predu be two time predicates. In analogy to the definitionDefinition 4.6
9
T
(pred; predu)
I
of 9T 0 (equation 3.25, page 86), one defines an operator that holds iff
the predicate pred eventually becomes true in a cycle T  T 0 before predu
132
Section 4.6
LIVENESS
t
Signal predu
Signal pred1^ pred2
Signal pred2
Signal pred1
T T 0
Figure 4.13 Three signals pred1, pred2, and predu satisfying the premise of
lemma 4.17: since stays until(pred1;T; predu) and stays until(pred1;T; predu)
hold, also stays until(pred1^ pred2;T; predu) holds.
does. Furthermore, it is required that it stays true until predu becomes true,
as defined in definition 4.5:
9
T
(pred; predu) :() 9T 0jT 0  T : pred(T 0)^
8T 00jT  T 00 < T 0 : predu(T 00)^
stays until(pred;T 0; predu)
This definition is illustrated in figure 4.14.
One easily shows that for any time predicate predu, 9T (pred; predu) J Lemma 4.18
implies that 9T pred holds:
9
T
(pred; predu) =) 9T pred
A time predicate pred is said to be finite false and stays until a given J Definition 4.7
Finite False
and Stays Until
predicate predu, iff 9T (pred; predu) holds for all T . In analogy to that,
pred is said to be finite true and stays until predu iff pred is finite false and
stays until predu.
133
Chapter 4
PIPELINED
MACHINES
   
   


   
   


    
    


    
    

Signal pred
T T 0
Signal predu
t
Figure 4.14 Two signals satisfying 9T (pred; predu)
The following lemmas are shown easily using lemma 4.18:
Let pred and predu be time predicates. If pred is finite false and staysLemma 4.19 I
until predu, it is also finite false as defined in definition 3.4 (page 86).
Let pred and predu be time predicates. If pred is finite true and stays untilLemma 4.20 I
predu, it is also finite true.
Given two time predicates pred1 and pred2 with 9T (pred1; predu) andLemma 4.21 I
9
T
(pred2; predu), the conjunction eventually holds after T and before
predu, and stays until predu.
9
T
(pred1; predu) ^ 9T (pred2; predu)
=) 9
T
(pred1^ pred2;T; predu)
By expanding the definition of 9T (pred1^ pred2;T; predu), one gets:PROOF
9T 0jT 0  T : pred1(T 0)^ pred2(T 0)^
8T 00jT  T 00 < T 0 : predu(T 00)^ (4.36)
stays until(pred1^ pred2;T 0; predu)
Since 9T (pred1; predu) and 9T (pred2; predu) hold, there are cycles
t 01  T and t 02  T such that pred1(t 01) and pred2(t 02) hold. Let t 01  t 02 hold
(otherwise, swap pred1 and pred23). An example for this situation is given
in figure 4.15.
3In PVS, one actually shows the case t 01 < t
0
2 by replaying the proof.
134
Section 4.6
LIVENESS
t
Signal predu
Signal pred1^ pred2
Signal pred2
Signal pred1
T t 01t 02
Figure 4.15 Illustration of the proof of lemma 4.21
We will now show that t 01 satisfies equation 4.36, i.e.:
pred1(t 01)^ pred2(t 01)^
8T 00jT  T 00 < t 01 : predu(T 00)^ (4.37)
stays until(pred1^ pred2; t 01; predu)
This conjunction consists of four parts, which are now shown separately:
1. As described above, pred1(t 01) holds by definition of t 01.
2. The second part, pred2(t 01), is shown using 9T (pred2; predu): As
described above, pred2(t 02) holds with t 02  t 01. Furthermore, pred2
stays active until predu holds, which is after t 01. Thus, pred2(t 01)
holds.
3. One easily shows 8T 00jT  T 00 < t 01 : predu(T 00) by expanding the
definition of 9T (pred1; predu).
4. Using lemma 4.15 with pred2 and predu and cycles T and t 01, one
concludes:
stays until(pred2; t 01 ; predu) (4.38)
This allows using lemma 4.17 for pred1 and pred2 and cycle t 01:
stays until(pred1^ pred2; t 01 ; predu) (4.39)
135
Chapter 4
PIPELINED
MACHINES
This concludes the claim.
Using lemma 4.21, one easily shows:
Given two time predicates pred1 and pred2 that are both finite false andLemma 4.22 I
stay until predu, the conjunction pred1^ pred2 is also finite false and stays
until predu.
Given two time predicates pred1 and pred2 that are both finite true andLemma 4.23 I
stay until predu, the disjunction pred1_ pred2 is also finite true and stays
until predu.
Lemma 4.23 is shown easily using lemma 4.22 and the fact thatPROOF
pred1_ pred2 = pred1^ pred2 (4.40)
holds.QED
The following two lemmas obviously hold (PVS shows them automati-
cally):
The predicate always (equation 3.22 page 86) is finite false and stays untilLemma 4.24 I
any predicate.
The preciate never (equation 3.23 page 86) is finite true and stays untilLemma 4.25 I
any predicate.
The following lemma is shown easily (PVS shows it automatically):
Let pred1 and pred2 be two time predicates. If pred1 holds eventuallyLemma 4.26 I
after cycle T , the disjunction pred1 _ pred2 also holds eventually after
cycle T :
9
T pred1 =) 9T (pred1_ pred2)
136
Section 4.6
LIVENESS
t
Signal predu
Signal pred1_ pred2
Signal pred2
Signal pred1
T T 01 T 02
Figure 4.16 Illustration of lemma 4.29
Using lemma 4.26, one easily concludes:
Let pred1 and pred2 be two time predicates. If pred1 is finite false, the J Lemma 4.27
disjunction pred1_ pred2 is also finite false.
Using lemma 4.27 and the definition of finite true, one easily concludes:
Let pred1 and pred2 be two time predicates. If pred1 is finite true, the J Lemma 4.28
conjunction pred1^ pred2 is also finite true.
Assume one has the disjunction of two signals. One signal is finite true
and stays until predu, the other one is just finite true but implies predu.
In this case one can conclude that the disjunction is finite true. This is
illustrated in figure 4.16.
Let pred1, pred2, and predu be time predicates. If pred1 is finite true J Lemma 4.29
and pred2 is true false and stays until predu, and pred1 implies predu, the
disjunction pred1_ pred2 is finite true.
The claim is equivalent to: PROOF
8T9T pred1(T )_ pred2(T ) (4.41)
137
Chapter 4
PIPELINED
MACHINES
By definition of 9T , this is equivalent to:
8T9T 0  T : pred1(T 0)_ pred2(T 0) (4.42)
Obviously, this is equivalent to:
8T9T 0  T : (pred1(T 0)^ pred2(T 0)) (4.43)
According to the premise of the lemma, there is a cycle T 01  T 0 such
that pred2 holds and stays until predu. Furthermore, there is also a cycle
T 02  T 01 such that pred1 holds. Let T 02 be the smallest such cycle.
We will now show that cycle T 02 satisfies the claim (equation 4.43), i.e.,
it is left to show that pred2(T 02) holds. This holds since pred2 is finite true
and stays until predu. The signal predu cannot have been active yet, since
pred1 is implies predu and T 02 is the smallest cycle after T 01 such that pred1
holds.QED
4.6.3 Liveness Proof
In order to prove the liveness of the machine, we have to show that the stall
signal of stage k is finite true. Assuming stage k is full, the stall signal is
a disjunction of the external stall signals extk and the internal stall signals
intk (equation 4.14). We will need to argue that the internal stall signal intk
is finite true and stays until uek.
This will be done by induction. The following lemma will be used in
order to do the induction step. It states that if one stalls an arbitrary stage
for a time that is long enough, eventually all stages below become empty,
i.e., the pipeline drains. Let the time predicate below emptyk(T ) hold iff
all stages below stage k are empty during cycle T :
below emptyk(T ) := 8 j j k < j < n : f ullTj (4.44)
Let k be a stage number, i.e., k 2 f0; : : : ;n  1g. Let the stall signals ofLemma 4.30 I
all stages below stage k be finite true and let T be a cycle. This implies
that there is a cycle T 0  T such that if the update enable signal is off from
cycle T to T 0  1, the full bits of the stages below stage k are off during
cycle T 0.
9T 0 j T 0  T :
(8T 00 j T  T 00 < T 0 : ueT 00k ) =) below emptyk(T
0
)
138
Section 4.6
LIVENESS
PROOF As before, this is shown by induction on k beginning with n 1
and proceeding from k + 1 to k. For k = n  1, there is nothing to show
since there are no stages below. Concluding from k + 1 to k is done as
follows:
Since the stall signals of stages below stage k are assumed to be finite
true, stall signal stallk+1 is also finite true. Thus, there is a cycle T 01  T
such that the stall signal stallT
0
1
k+1 is not active. Let T 01 be the smallest such
cycle. According to the premise of the lemma, we have ueT
0
1
k . Accoring to
lemma 4.1, this implies
f ullT 0+1k+1 :
Thus, stage k+1 is empty during cycle T 0+1.
We now apply the induction premise in order to show that the stages be-
low stage k+1 eventually also become empty. According to the induction
premise, there is a cycle T 02  T 01 + 1 such that if the update enable signal
uek+1 is off from cycle T 01 + 1 to T 02  1, all full signals below stage k+ 1
are off.
We will now show that during cycle T 02 all stages below stage k are
empty. The first step is to show that the full signal of stage k + 1 actu-
ally stays empty until cycle T 02:
8T 00 j T 01 +1 T  T 02 : f ullT 00k+1 (4.45)
This is done easily by induction on T 00. For T 00 = T 01 + 1, we already
showed the claim above. For cycle T 00+ 1, one uses the fact that the full
signal is not active in cycle T 00. Thus, the stall signal cannot be active. The
update enable signal is not active by the premise of the lemma. Thus, the
claim can be concluded using lemma 4.1.
It is left to show that the the update enable signal of stage k + 1 is not
active from cycle T 01 + 1 to cycle T 02   1. This is easily argued since the
stage is not full. This concludes the claim. QED
Let k be a stage number but not the last stage. If all stages below stage k J Lemma 4.31
are empty during cycle T , this stays so until the output registers of stage k
are updated.
below emptyk(T ) =) stays until(below emptyk;T;uek)
139
Chapter 4
PIPELINED
MACHINES
PROOF By definition of stays until (definition 4.5, page 130), we have
to show:
8T 0 j T 0  T : (8T 00 j T  T 00 < T 0 : uek(T 00)) (4.46)
=) (8T 00 j T  T 00  T 0 : below emptyk(T 00)) (4.47)
This is done by induction on T 0. For T 0 = T , the claim holds according
to the premise of the lemma. The claim for cycle T 0+ 1 is concluded as
follows: The claim is:
8T 00 j T  T 00  (T 0+1) : below emptyk(T 00) (4.48)
For T  T 00  T 0, this holds according to the induction premise. Thus,
it is left to show this for T 00 = T 0+ 1. By definition of below emptyk, the
claim is equal to:
8 j j k < j < n : f ullT 0+1j (4.49)
Case one: If j is equal to k+1, we show f ullT 0+1k+1 as follows: according
to lemma 4.1, a stage becomes full if it was either stalled or if the output
registers of the previous stage were updated. The update enable signal of
the previous stage, which is stage k, is not active according to the premise
of the lemma.
f ullT 0+1k+1 = ueT
0
k _ stallT
0
k+1 by lemma 4.1
= stallT 0k+1 because of ueT
0
k
The stall signal stallT 0k+1 cannot be active since stage k+1 is not full during
cycle T 0 according to the induction premise. Thus, f ullT 0+1k+1 is not active.
Case two: If j is not equal to k+1, we show f ullT 0+1j as follows:
f ullT 0+1j = ueT
0
j 1_ stallT
0
j by lemma 4.1
= ( f ullT 0j 1^ stallT 0j 1)_ stallT
0
j because of def. of ue
The stall signal stallT 0j cannot be active since stage j is not full during
cycle T 0 according to the induction premise. The full signal f ullT 0j 1 is also
not active because of the induction premise. This concludes the claim.QED
Assuming that the external stall signals are finite true and stay until uek,Lemma 4.32 I
the disjunction of the external stall signals extk is finite true.
140
Section 4.6
LIVENESS
PROOF Using lemma 4.23 one concludes that the disjunction is finite
true and stays until uek. Using lemma 4.20, one concludes that extk is
finite true.
Let k be a stage number but not the last stage. If the stages below stage k J Lemma 4.33
are empty, the internal stall signal intk is off.
below empty(k;T ) =) intTk
By definition, intk is: PROOF
intTk = dhazTk _ stallTk+1
If the stages below stage k are empty, dhazTk cannot be active by defini-
tion (empty stages never generate a data hazard). If the stages below stage
k are empty, so is stage k+1. Thus, stallTk+1 cannot be active according to
the stall signal convention (convention 4.2). QED
In the following, we will conclude that stallk is finite true from the same
claim for stalll with l > k. The signal stallk includes the internal signal as
defined in equation 4.13. Thus, one has to show that the internal stall signal
eventually gets deactivated and stays so until the update enable signal is
activated. This is done as follows:
The internal stall signal of stage k is deactivated if the stages
below stage k are empty, at the latest.
The term “at the latest”, as used in the last sentence, will be formalized
by the next lemma. In the last sentences, three time predicates are used:
1. “The internal stall signal : : : is deactivated” will be referred to by
time predicate pred1,
2. “the update enable signal is activated” will be referred to by time
predicate predu,
3. “the stages below stage k are empty” will be referred to by time
predicate pred2.
141
Chapter 4
PIPELINED
MACHINES
   
   


   
   


   
   


    
    


    
    


    
    


Signal pred2
Signal pred1
T
t
Signal predu
T 0
Figure 4.17 Illustration of lemma 4.34: Since pred2 implies pred1 and pred2
becomes active, pred1 also becomes active.
According to lemma 4.33, pred2 obviously implies pred1 (empty stages
never generate a hazard or stall signal). Furthermore, one easily shows that
predu also implies pred1 (the update enable signal is not active as long as
the stage is stalled). The notion “at the latest” will now be formalized
as follows: pred1 holds if pred2 holds “at the latest” means that assuming
pred1 does not hold for a time that is long enough, pred2 holds eventually.
Now there are two cases:
a) The predicate pred2 becomes true. Since pred2 implies pred1, one
can conclude that pred1 will hold eventually. This case is illustrated
in figure 4.17.
b) The predicate pred1 becomes true before pred2. However, this does
not imply that pred2 will hold eventually. This case is illustrated in
figure 4.18.
The following lemma formalizes this claim:
Let T be a cycle, pred1, pred2, and predu time predicates. Furthermore,Lemma 4.34 I
let the following conditions hold:
1. Let both predu and pred2 imply pred1 after cycle T .
8T 00 j T 00  T : predu(T 00) =) pred1(T 00)
8T 00 j T 00  T : pred2(T 00) =) pred1(T 00)
142
Section 4.6
LIVENESS
   
   


   
   


   
   


    
    


    
    


     
     

Signal pred2
Signal pred1
T
t
Signal predu
T 0
Figure 4.18 Illustration of lemma 4.34: pred1 becomes active before pred2.
2. Let there be a cycle T 01  T such that if predu holds for all cycles T 00
with T  T 00 < T 01 then pred2(T 01) holds.
9T 01 j T 01  T : 8T 00 j T  T 00 < T 01 : predu(T 00) =) pred2(T 01)
3. If pred2 holds in any given cycle T 0  T , it is supposed to stay until
predu after T 0.
8T 0 j T 0  T : pred2(T 0) =) stays until(pred2;T 0; predu)
The claim is that this implies 9T (pred1; predu).
By expanding the definition of 9T (pred1; predu), one gets: PROOF
9T 0jT 0  T : pred1(T 0)^
8T 00jT  T 00 < T 0 : predu(T 00)^
stays until(pred1;T 0; predu)
Let 9T predu hold. In this case, there is a cycle T 0  T such that predu
is active. Let this be the smallest cycle with this property, which exists
according to lemma 3.19. We will now show that this cycle satisfies the
claim. According to the first condition above, pred1(T 0) holds. Since T 0
is the smallest cycle such that predu is active,
8T 00jT  T 00 < T 0 : predu(T 00)
obviously holds. One easily shows
stays until(pred1;T 0; predu)
143
Chapter 4
PIPELINED
MACHINES
by using the fact that predu(T 0) holds. If 9T predu holds, this concludes
the claim.
Assume 9T predu does not hold. In this case, predu never holds in any
cycle T 0  T . This allows using the second condition above in order to
conclude that there is a cycle T 01  T such that pred2(T 01) holds. We will
now show that this cycle satisfies the claim.
According to the first condition above, pred1(T 01) holds. As 9T predu
does not hold, one can conclude that
8T 00jT  T 00 < T 01 : predu(T 00)
holds. Using the third condition above, one easily concludes that
stays until(pred2;T 01 ; predu)
holds. Using the first condition and lemma 4.16, one shows that
stays until(pred1;T 01 ; predu)
holds. This concludes the claim.QED
Assuming that the external stall signals are finite true and stay until uek,Lemma 4.35 I
the stall signal is finite true.
The proof proceeds by induction on k. We begin with the last stage. ThePROOF
induction step is done by concluding the claim for stage k from the claim
for stages l > k.
For stage k = n 1 (i.e., for the last stage), the claim is shown as follows:
in case of the last stage, no forwarding is done, i.e., dhazn 1 is always false.
Thus, the stall signal of the last stage is:
stallTn 1 = f ullTn 1^ extTn 1 (4.50)
According to lemma 4.32, this is finite true.
The induction claim for stage k < (n 1) is shown as follows: The stall
signal of stage k is:
stallTk = f ullTk ^ (extTk _ intTk ) (4.51)
Using lemma 4.28, one concludes that it is sufficient to show that
extTk _ intTk (4.52)
144
Section 4.7
PERFORMANCE
is finite true. This is concluded by lemma 4.29 using the predicates extk,
intk, and uek. In order to apply lemma 4.29, one has to show that the
premises of the lemma hold. These premises are:
 The predicate extk must be finite true,
 the predicate intk must be finite true and stay until uek,
 the predicate extk must imply uek.
The first premise holds according to lemma 4.32. The third premise
holds according to the definition of uek and stallk. It is left to show that the
second premise holds, i.e., that intk is finite true and stay until uek. This is
done by using lemma 4.34 as described above.
One now easily concludes the liveness criterion for the pipelined ma-
chine:
Assuming that the external stall signals of stage k are finite true and stay J Theorem 4.36
until uek for all stages k, the pipelined machine is alive.
Using lemma 4.35, one concludes that the stall signals are finite true. As PROOF
in theorem 3.24 (page 88), one concludes that the machine is alive.
4.7 Performance
The machine presented in this chapter almost matches the pipelined DLX
presented in [MP00]. One major difference is the stall engine. The stall
engine in [MP00] uses only two different clock enable signals. The first
clock enable signal controls stages 0 and 1 and the second clock enable
signal controls the rest of the pipeline. Thus, stages 0 and 1 are always
clocked simultaneously. The same holds for stages 2, 3, and 4.
In contrast to that, the stall engine used in this thesis supports stalling
all stages independently. This improves performance. Consider the fol-
lowing example in a five stage integer DLX: The first instruction is a load
instruction (LW). Let the destination register of this instruction be R1. The
second instruction is an ALU instruction that calculates the disjunction of
145
Chapter 4
PIPELINED
MACHINES
Cycle 1 2 3 4 5 : : : 10 11 : : :
IF LW ORI ADD ADD ADD SUB XOR
ID LW ORI ORI ORI ADD SUB
EX LW Bubb. Bubb. : : : ORI ADD
MEM LW LW Bubb. ORI
WB LW Bubb.
Figure 4.19 Scheduling in [MP00]: The cache miss in the MEM stage stalls the
pipeline completely.
Cycle 1 2 3 4 5 : : : 10 11 : : :
IF LW ORI ADD ADD SUB XOR SW
ID LW ORI ORI AND SUB XOR
EX LW Bubb. ORI : : : ADD SUB
MEM LW LW ORI ADD
WB LW ORI
Figure 4.20 Scheduling in this thesis: the bubble introduced because of the data
hazard is removed. The execution differs from [MP00] beginning with cycle 5.
a register value and an immediate constant (ORI). Let register R1 be the
source register. In stage ID, the machine is supposed to read the operand
register. However, this register is not yet available in this stage because the
load has not yet completed. Thus, in both machines a pipeline bubble is
inserted (figure 4.19, cycle 4).
Assume that load instruction causes a data cache miss in stage MEM.
The machine in [MP00] stalls the execution completely. In contrast to that,
the machine presented in this thesis keeps stages 0 to 2 running for one
cycle more by removing the pipeline bubble in stage 2 (figure 4.20, cycle
5). Assume that the data word required for the load instruction is available
by cycle 10 in both machines. In the machine presented in [MP00], the
bubble proceeds until it reaches the end of the pipeline.
In order to quantify the performance impact of the new stall engine, we
performed simulations using the SPEC92 benchmarks as a workload. In
case of integer-only workload, the new stall engine speeds up execution
on the five stage DLX pipeline by approximately 1.1%. The speedup in-
creases the more long latency instructions, in particular floating point in-
structions, are involved. Appendix C gives more details on the simulation
environment and the results.
146
Section 4.8
LITERATURE
4.8 Literature
The concept of the transformation of a prepared sequential machine into a
pipelined machine is taken from [MPK00, MP00]. In addition to that, the
design of the pipelined DLX used as example is taken from [MP00].
Flynn’s classic textbook [Fly95] on pipelined processors states the fol-
lowing on interlock hardware:
“As any pipelined processor designer knows, a great deal of
engineering effort is required to efficiently realize a fully func-
tional set of interlocks.”
However, to best of our knowledge, in most of the literature the details
of implementing forwarding and interlock hardware are skipped over, in-
cluding [Fly95]. An exception is [MP00], which presents the interlock and
forwarding logic at gate level. The stalling mechanisms described in the
literature including those in [MP00] usually assume that a pipeline bubble
floats through the complete pipeline [Fly95, HP96]. In contrast to that, the
stall engine presented in this thesis supports removal of pipeline bubbles,
which speeds up the execution.
In [LO96], Levitt and Olukotun verify a five-stage DLX pipeline by
transforming it back into a sequential machine by removing stalling and
rollback logic. Liveness is not argued.
In [Hos00], Hosabettu verifies a simple five stage DLX that is not syn-
thesizeable. It has a trivial stalling logic. Stalls caused by slow memory
are not covered. The verification is done using the completion function
approach and PVS. Liveness is not argued.
Further literature on the verification of pipelined machines is [BM96],
which provides a manual proof of a DLX pipeline, Burch, Dill [BD94]
verify a very simple pipeline. Henzinger et.al. [HQR98] use refinement
mappings in order to model-check a RISC pipeline. Liveness is not argued.
147

Chapter
5
Speculative Execution
5.1 Introduction
S
PECULATIVE EXECUTION is a technique to avoid stalling the pipeline
because of data dependencies in situations that do not permit forward-
ing. Thus, instead of stalling, the calculation is continued with a value that
is guessed. As soon as the correct value is available, the correct value and
the guessed value are compared. If both are equal, the calculations made
with the guessed value are also correct.
If the guessed value and the correct value are different, all calculations
made with the guessed value are usually false. This is called misspec-
ulation. In this case, the calculation has to be restarted at the stage the
guessing is made. This process is called rollback (in the literature, the
term squashing is often used [LO96]). It includes that all changes made
to the state of the machine based on false data have to be reverted. The
extra cycles required for the rollback and the wrong calculation are called
misspeculation penalty.
In this chapter, we will describe a generic method that allows to specu-
late on arbitrary values. The method includes automatic generation of the
circuits necessary to detect a misspeculation and to do the rollback in case
of a misspeculation. We will then use the method in order to implement
branch prediction and precise interrupts.
Chapter 5
SPECULATIVE
EXECUTION
   
   
   



   
   
   



   
   
   



   
   
   



   
   
   



   
   
   



   
   
   



   
   


   
   
   



   
   


   
   
   



   
   
   



   
   


   
   


R:1
R:2
R:3
R:4
R:5
R:1
R:2
R:3
R:4
R:5
R:1
R:2
R:3
R:4
R:5
R:1
R:2
R:3
R:4
R:5
R:1
R:2
R:3
R:4
R:5
R:1
R:2
R:3
R:4
R:5
T = 0 T = 1 T = 2 T = 3 T = 4 T = 5
k = 0
k = 1
k = 2
k = 3
k = 4
I0
I0
I0
I0
I0
I1
I1
I2
I1
I2
I1 I2
I1
I3
Figure 5.1 Execution of the instructions I0 to I3 in a pipelined machine with
speculation. Let the PC of I1 be misspeculated. This is detected in stage 2 during
cycle T = 3, as illustrated by the flash symbol. Stages that are full are hatched.
Example Consider a pipelined machine with five stages. Let us guess
(i.e., speculate on) the correct value of the memory address used for the
instruction fetch (denoted by PC) in stage 0. Assume that the correct value
is available in stage k = 1 (decode).
Figure 5.1 gives an example what can happen in such a machine: let the
mechanism guess the value of PC of the instruction I0 correctly but not
of instruction I1. The machine runs as usual until cycle T = 2. In cycle
T = 2, instruction I1 is in stage 1 and the misspeculation is detected. Thus,
instruction I1 has to be restarted completely. Assume that this takes one
cycle.
In cycle T = 3, instruction I1 therefore is in stage 0 again. The cal-
culation re-starts using the correct value of PC that is now known. The
instruction I2 is completely evicted from the pipeline. Note that, however,
instruction I0 proceeds (and terminates) as before. This is justified by the
fact that instruction I0 does not depend on any data that was misspeculated.
Table 5.1 shows the schedule of this example.
However, the instruction I1 might have made changes to the registers in
out(0). Instruction I1 usually relies on the original values, i.e., the values
written by I0. Thus, one has to ensure that instruction I1 calculates its
150
Section 5.2
STALL ENGINE
WITH SPECULATION
T = 0 T = 1 T = 2 T = 3 T = 4 T = 5
sI(0;T ) 0 1 2 1 2 3
sI(1;T ) 0 0 1 1 1 2
sI(2;T ) 0 0 0 1 1 1
sI(3;T ) 0 0 0 0 1 1
sI(4;T ) 0 0 0 0 0 1
Table 5.1 The values of sI in a five stage pipelined machine with speculation
inputs using the values written by I0 and not I1. We will now describe how
such a mechanism is implemented.
5.2 Stall Engine with Speculation
In this section, we will describe a simple generic speculation mechanism
that allows speculating on values of arbitrary implementation registers.
The first step is to modify the stall engine such that we can evict instruc-
tions from the pipeline in case of misspeculation. For this purpose, we
introduce signals rollbackk with k 2 f0; : : : ;n  1g. The signal rollbackk
is to be activated if misspeculation is detected in stage k. We will later on
describe how we detect misspeculation.
Using these signals, a set of signals rollback0k is defined. The signal
rollback0k is active if the instruction in stage k has to be squashed because
of misspeculation. Assume a signal rollbackk is active. In this case, one
has to evict all instructions in the stages 0 to k. Thus, rollback0k is active if
a rollback signal of any later stage is active:
rollback0k =
n 1
_
i=k
rollbacki (5.1)
One easily speeds up this computation using the parallel prefix circuit
as described in section 2.2.4. Using the signals rollback0 , we make the
following changes to the stall engine:
 The update enable signal of a stage k is deactivated if the rollback
signal is active. Let ue0k denote the old update enable signal as used
151
Chapter 5
SPECULATIVE
EXECUTION
in the previous chapters. Let uek denote the new update enable sig-
nal. The new update enable signal is:
uek := ue
0
k ^ rollback0k (5.2)
 The transition function for the full bits is changed as follows: Let
δ0: f ull:k denote the old transition function and let δ: f ull:k denote
the new one. The new transition function for k 2 f1; : : : ;n 1g is:
δ: f ull:k := δ0: f ull:k^ rollback0k (5.3)
The following simple lemmas are concluded from the new definition of
the signals and the new transition functions:
A stage is full iff it was updated or stalled in the previous cycle and if thereLemma 5.1 I
was no rollback:
8k 1 : f ullT+1k = (ueTk 1_ stallTk )^ rollback0Tk
The signal f ull0 is always active:
f ullT0 = 1
All other signals f ullk are not active during cycle 0:
8k 1 : f ull0k = 0
This lemma is a counterpart of lemma 4.1 of the pipelined machine with-
out speculation.
If a stage is full and is updated, the next stage is updated, too.Lemma 5.2 I
8k  1 : f ullTk ^ueTk 1 =) ueTk
This lemma is a counterpart of lemma 4.3 of the pipelined machine with-
out speculation.
According to the definition of the update enable signals, we have to showPROOF
1) f ullTk , 2) stallTk , and 3) rollback0Tk .
152
Section 5.3
SCHEDULE WITH
SPECULATION
According to the premise of the lemma, f ullTk holds. We show stallTk as
in the proof of lemma 4.3.
We show rollback0Tk as follows: assume rollback0
T
k holds. In this case,
rollback0Tk 1 also holds. Thus, ueTk 1 cannot be active. This is a contradic-
tion to the premise of the lemma.
If a stage is full and if its output registers are not updated and if no rollback J Lemma 5.3
is made, the full bit is preserved.
8k 1 : f ullTk ^ueTk ^ rollback0Tk =) f ullT+1k
By the definition of the update enable signals, one concludes that stallTk PROOF
holds. The claim is concluded using lemma 5.1.
If a configuration in a stage moves into the next stage (i.e., the output J Lemma 5.4
registers of a stage are updated), and if the next configuration is not clocked
into the stage, the full bit is cleared:
8k  1 : f ullTk ^ueTk ^ueTk 1 =) f ullT+1k
By the definition of the update enable signals, one concludes stallTk and PROOF
rollback0Tk . The claim is concluded by lemma 5.1.
The following lemma is the counterpart of lemma 4.6 in the pipelined
machine without speculation. The proof is proceeds as in chapter 4.
Stage k is full at the earliest in cycle k. J Lemma 5.5
f ullTk =) T  k
5.3 Schedule with Speculation
Using the signals rollback0k , it is possible to give a recursive specification
of a scheduling function sI(k;T ) for the pipelined machine with specula-
tion that reflects the changes caused by a rollback.
153
Chapter 5
SPECULATIVE
EXECUTION
It is constructed as follows: In “normal operation”, i.e., if no speculation
is made, the scheduling function should match the scheduling function of
the pipelined machine without speculation. However, in case of a rollback,
the scheduling function must provide values such that the instructions that
are evicted never entered the pipeline.
This allows for a recursive definition of the scheduling function of the
prepared sequential machine: For sake of simplicity, we split the definition
of the function into three cases: 1) T=0, 2) a rollback is made, and 3) no
rollback is made.
If T = 0 holds, sI(k;T ) is zero, just as before:
sI(k;0) := 0 (5.4)
If T 6= 0 holds and if no rollback is made, i.e., rollback0T 1k does not
hold, we use the definition from chapter 3:
sI(k;T ) :=
8
>
<
>
:
sI(k;T  1) : ueT 1k
sI(0;T  1)+1 : ueT 1k ^ k = 0
sI(k 1;T  1) : ueT 1k ^ k 6= 0
If T 6= 0 holds and a rollback is made, i.e., rollback0T 1k holds, we aim
to provide values as if the instructions that are evicted never were put into
the pipeline.
Assume the following example: Instruction I0 does not use speculation
and proceeds through the pipeline as usual. Instruction I1 uses speculation
and we misspeculate. This is detected in cycle T = 3 and stage k = 2. In
table 5.2, we depict a standard pipelined schedule such that I1 is not put
into the pipeline before cycle T = 4. In table 5.3, we depict a schedule
such that I1 uses speculation instead. Note that the schedules match after
the rollback in cycle T = 4.
In this example, during cycle T = 3, the following signals are active:
because of the misspeculation, rollback32 is active. This implies that the
signals rollback00 to rollback02 are active by definition of these signals.
We construct the scheduling function for this case as follows: for all
stages with rollback, we take the value of the scheduling function from cy-
cle T  1 from the last stage in that we detect a rollback. If rollback0n 1 is
active, this is stage n 1. If not so, this is stage k such that rollback0k holds
154
Section 5.3
SCHEDULE WITH
SPECULATION
T 0 1 2 3 4 5 6 7 8
sI(0;T ) 0 1 1 1 1 2 3 4 5
sI(1;T ) 0 0 1 1 1 1 2 3 4
sI(2;T ) 0 0 0 1 1 1 1 2 3
sI(3;T ) 0 0 0 0 1 1 1 1 2
sI(4;T ) 0 0 0 0 0 1 1 1 1
Table 5.2 The values of sI in a five stage pipelined machine without speculation.
Instruction I1 is delayed until cycle 5.
T 0 1 2 3 4 5 6 7 8
sI(0;T ) 0 1 2 3 1 2 3 4 5
sI(1;T ) 0 0 1 2 1 1 2 3 4
sI(2;T ) 0 0 0 1 1 1 1 2 3
sI(3;T ) 0 0 0 0 1 1 1 1 2
sI(4;T ) 0 0 0 0 0 1 1 1 1
Table 5.3 The values of sI in a five stage pipelined machine with speculation. We
misspeculate on instruction I1 and detect this in cycle 4. In cycle 5, the execution
proceeds as if instruction I1 was delayed until cycle 5.
155
Chapter 5
SPECULATIVE
EXECUTION
T = 0 T = 1 T = 2 T = 3 T = 4 T = 5
sI(0; T ) 0 1 2 3 1 2
sI(1; T ) 0 0 1 2 1 1
sI(2; T ) 0 0 0 1 1 1
sI(3; T ) 0 0 0 0 1 1
sI(4; T ) 0 0 0 0 0 1
Table 5.4 Illustration of the recursion made for sI(k;4) in case of a rollback.
but rollback0k+1 does not. We use the predicate ρ(k;T ) as a shorthand:
ρ(k;T ) :() rollback0Tk ^ (k = n 1_ rollback0Tk+1)
We assert the claim above in the following lemma:
The construction described above provides the last stage with active roll-Lemma 5.6 I
back signal.
ρ(k;T ) =) k = maxf j 2 f0; : : : ;n 1g j rollbackTj g
One easily shows this inductively using the fact that if rollback0k+1 isPROOF
active, this implies that the signal rollback0k is also active.
Thus, if ρ(k;T  1) holds, we take sI(k;T  1) as value for sI(k;T ). If it
does not hold, we use recursion in order to get the desired value: we walk
down the pipeline from k to k+1 until ρ(k;T  1) holds:
sI(k;T ) :=

sI(k;T  1) : ρ(k;T  1)
sI(k+1;T ) : otherwise
Obviously, this simplifies to:
sI(k;T ) :=
(
sI(k;T  1) : k = n 1_ rollback0T 1k+1
sI(k+1;T ) : otherwise
This recursion is illustrated in table 5.4. It is no longer obvious that this
recursion terminates for all values k and T . One argues as follows: the
recursion terminates as soon as T = 0 is reached. In case of no rollback, T
decreases by one. In case of a rollback, either T decreases or k decreases.
However, T decreases if the end of the pipeline is reached at the latest, i.e.,
if k = n 1 holds.
156
Section 5.4
SCHEDULING
INVARIANTS
5.4 Scheduling Invariants
In this section, we will show that the scheduling invariants presented in
chapter 3 still hold for the stall engine of the pipelined machine with in-
terrupts. We have to make a small change to invariant one for the rollback
case. Invariants two and three still hold without any change.
Assume that the rollback signal rollback0T 1k+1 is not active or that k is the J Invariant 5.1
last stage. If the update enable signal of stage k is active in cycle T   1,
the value of the scheduling function for that stage increases by one. If the
update enable signal of the stage is not active, the value does not change.
For T > 0:
sI(k;T ) =

sI(k;T  1) if ueT 1k = 0
sI(k;T  1)+1 if ueT 1k = 1
Given a cycle T , the values of the scheduling functions of two adjacent J Invariant 5.2
stages are either equal or the value of the scheduling function of the earlier
stage is greater by one. This also holds in case of a rollback.
The value of the scheduling function of the earlier stage is greater by one J Invariant 5.3
iff the full bit of the later stage is set. For k > 0:
f ullTk = 1, sI(k 1;T ) = sI(k;T )+1
Negating both sides of the last equation and applying invariant 5.2 results
in:
f ullTk = 0, sI(k 1;T ) = sI(k;T )
This also holds in case of a rollback.
The proof of the invariants proceeds as in chapter 3: Let Pi(T ) denote that PROOF
invariant i holds for the pipelined machine with speculation for the cycle
T . The claim is concluded as in chapter 3:
P3(T  1) =) P1(T )
P1(T )^P2(T  1)^P3(T  1) =) P2(T )
P1(T )^P2(T  1)^P3(T  1) =) P3(T )
157
Chapter 5
SPECULATIVE
EXECUTION
Proof of Invariant 5.1 We make a case split on the value of the rollback
signal rollback0T 1k :
1. Let the rollback signal rollback0T 1k be active. Since rollback0
T 1
k+1 is
not active or k is the last stage, stage k is the last stage with active
rollback signal. The update enable signal ueT 1k is not active in this
case by definition. Thus, the claim is:
sI(k;T ) = sI(k;T  1)
This holds by definition of sI(k;T ).
2. Let the rollback signal rollback0T 1k be not active. As we exclude
the case of a rollback, the proof proceeds as the proof of invariant
3.1 presented in chapter 3.
Proof of Invariant 5.2 Let us consider the stages k 1 and k with k > 0.
Let rollback0T 1k be active. We start with the induction claim:
sI(k 1;T ) = sI(k;T )+1
_ sI(k 1;T ) = sI(k;T ) (5.5)
The second equation holds because of the definition of sI(k  1;T ) and
because the rollback signal is active.
Let rollback0T 1k be not active. In this case, k is either the last stage
or rollback0T 1k+1 is not active. Thus, no rollback is involved and the proof
proceeds as the proof of invariant 3.2 in chapter 4.
Proof of Invariant 5.3 For T = 0, the claim can be shown by definition
unfolding and using lemma 5.5. For T > 0, according to lemma 5.1, the
claim is equivalent to:
(ueT 1k 1 _ stall
T 1
k )^ rollback0
T 1
k () sI(k 1;T ) = sI(k;T )+1
As before, the proof in chapter 4 can be repeated if the rollback signal
rollback0T 1k is not active. Thus, let rollback0
T 1
k hold. This implies that
sI(k 1;T ) is equal to sI(k;T ). Thus, the right hand side of the equivalence
in the claim cannot hold. The left hand side of the equivalence also does
not hold because rollback0T 1k holds. This concludes the claim.QED
158
Section 5.6
SPECULATIVE
INPUTS
5.5 Speculative Inputs
For sake of simplicity, we restrict ourselves to the case that the speculation
is done in the first stage. Let R be a denominator for a value we want to
guess. Let R 2 σ denote this fact. The speculation mechanism is added in
three steps:
1. The first step is to add functions that do the guessing of the value. We
name those functions f0Rs by convention. These functions are called
speculation functions and can take arbitrary specification registers as
arguments as described in section 3.2.4 (page 41). In analogy to the
notation used in the previous chapters, this set of registers is denoted
by dep s(R;0). All other notation used for register transition func-
tions also applies for the speculation functions.
2. We add registers that record whether we still have to speculate or
whether the real value is already known. We denote this register by
cR. The domain of this register is one bit. If it is set, the correct
value of R is known. If not, we have to speculate. We initialize these
registers with zero.
We furthermore add registers that save the real value in case of a
rollback. The registers are named R and have the same domain as
the value we are guessing. These registers are initialized with an
arbitrary value, e.g., zero.
3. We make the guessed value provided by f0Rs available as input for
the register transition functions of stage 0. We do not allow a re-
cursion here, i.e., the input of a speculation function must not be a
speculative value.
The input generation function g0R for such a speculative value is
defined as follows: in case the bit in cR is set, we return the value in
c:R. If not so, we guess the value using f0Rs.
g0R(cTI ) :=

cTI :R : cTI :cR = 1
f γ0Rs(cTI ) : otherwise
(5.6)
5.6 Detecting Misspeculation
The mechanism above allows guessing a register value. The guessed value
can be used as normal input to register transition functions. However, we
159
Chapter 5
SPECULATIVE
EXECUTION
g0R
R:1
R:2
R:3
Figure 5.2 The speculative value R is guessed by stage 0 and then stored in
registers R:1, R:2, and so on.
have to detect and handle the case that the speculation fails. In order to
detect that we misspeculated, it is necessary to store the value guessed to
have it available later on. For R 2 σ, we do so by adding instances of an
implementation register named R, i.e., R:1, R:2, and so on.
If ue0 is active and such a register R:1 with R 2 σ is updated, one simply
writes the value provided by g0R into the register, i.e., the guessed value.
In case of registers R:k with R 2 σ and k > 1, we just take the value from
the previous stage, i.e., from R:(k 1). This is depicted in figure 5.2.
In addition to the check for misspeculation, the value in these registers
can be used in order to read the speculative value in stages other than the
first stage. This is handled just like a normal read access to an implemen-
tation register.
A misspeculation is detected as follows: let R 2 out(k) be an instance of
such a register. If a value is written into the register by write accesses as
used in the previous chapters, this value is compared with the value that is
in the instance of the register in the previous stage. If they do not match, a
misspeculation is detected and a rollback is signaled.
For this purpose, we define a misspeculation signal Rkmisspec for each
such register R:(k+1). It is active if the value provided by the write access
and the value in the register do not match and if the stage is full but not
stalled.
Rkmisspec(cI) := ( f γkR(cI) 6= cI :R:k)^
f ullk ^ stallk (5.7)
This is depicted in figure 5.3. The test for the stall signal is motivated as
160
Section 5.7
ROLLBACK
eq
f γkR
f ullk
stallk
Rkmisspec
R:k
R:(k+1)
Figure 5.3 The speculative value R is compared with the value provided by the
register transition function. If they do not match, a misspeculation is signaled.
follows: the function fkR takes inputs. These inputs might be forwarded.
Thus, they are only guaranteed to be valid if the stall signal is not active.
Furthermore, we require that the functions fkR do not depend on values
that are guessed.
We use these signals in order to calculate the rollback signal of stage k:
It is just the disjunction of the Rkmisspec signals:
rollbackk(cI) :=
_
R:k2σ
Rkmisspec(cI) (5.8)
5.7 Rollback
During rollback, we have to revert changes to the machine made by the
instructions that used misspeculated data. Thus, the state of the machine
has to be changed as if the instructions that used misspeculated data never
entered the machine.
The rollback is realized as follows: the original values of the registers
that are changed during the speculation are saved in temporary registers.
All calculations store their results in the original place as before. If the
161
Chapter 5
SPECULATIVE
EXECUTION
EX
ID
M
WB
PC0 oPC0:2
oPC0:3
oPC0:4
Figure 5.4 Saving the original value of PC0:2 in oPC0 for reading or rollback by
stages 2, 3, and 4.
speculation fails, the original values are restored from the temporary regis-
ters. If the speculation turns out to be correct, the values in the temporary
registers are just ignored. By convention, we name the temporary register
oQ if the name of the original register is Q.
In order to save hardware cost, we restore specification registers only.
In particular, we do not restore the implementation registers in case of a
rollback. The only justification for this is saving the gates and latches
required for the rollback in case of implementation registers. The price
paid for this is extra proof effort, since we have to argue that not restoring
implementation registers does not affect data consistency.
Example Consider a pipelined machine and a specification register PC0
that is written by stage 1 (decode). In the same cycle in that one clocks a
new value into PC0:2, the old value of the register is saved in an implemen-
tation register called oPC0:2 (figure 5.4).
If this value is required in any later stage for rollback or any other pur-
pose, extra instances can be added to the stages in between. This is the
usual method to add instances of implementation registers, as already de-
scribed in chapter 3. Note that duplicating Q into oQ is expensive regarding
hardware cost. We therefore assume that Q is neither a register file nor a
memory.
Remember that ωkQ denotes the value clocked into register Q. In or-
der to realize rollback, we change the function ωkQ as follows: in case no
162
Section 5.8
EXTENDED READ
ACCESS
SEMANTICS
rollback is made, the function returns the same value as before. In case a
rollback is made, we have to select the appropriate instance of the register
oQ. Note that actually more than one rollback signal can be active simul-
taneously. In this case, we have to take the original values from the latest
stage with active rollback signal. Remember that we used ρ( j;T ) in order
to denote that stage j has this property. We now change the new value
clocked into Q:(k+1) as follows:
ωkQ(cTI ) :=
8
<
:
f γkQ(cTI ) : uek = 1
oQT : j : ρ( j;T )
QT 1:(k+1) : otherwise
(5.9)
We implement this using multiplexers (figure 5.5). This implementa-
tion is similar to the circuit in figure 4.10 (page 116) we use in order to
implement the minimum required for forwarding in chapter 4.
The implementation described here takes one cycle in order to detect
misspeculation and handle the rollback. The calculation of the next con-
figuration begins in the next cycle. In some designs, in particular in case
of branch prediction, the calculation of the next configuration begins in the
same cycle the misspeculation is detected. This saves one cycle but may
increase cycle time. Thus, this is a CPI vs. cycle time tradeoff. However,
we do not further evaluate this.
5.8 Extended Read Access Semantics
5.8.1 Specification Registers
In chapter 3, we did not allow read accesses to specification registers R in
stages k > stage(R). We now define semantics for such read accesses. This
is not related to speculation. In fact, one can define the same semantics for
the prepared sequential and pipelined machine without speculation. The
only reason why we did not introduce it in one of the previous chapters is
that we did not need such read accesses.
We aim to define read accesses to specification registers R in stages k >
stage(R) such that the claim of the input correctness lemmas still holds:
gkR(cTI )
!
= RsI(k;T )S
163
Chapter 5
SPECULATIVE
EXECUTION
0 1
0 1
0 1
ID
EX
M
WB
f1Q oQ:2
oQ:4
oQ:3
rollback4
rollback3
rollback2
Q:2 oQ:2
oQ:3
oQ:4
Figure 5.5 Selecting the correct value for restoring Q:2 in case of a rollback.
164
Section 5.8
EXTENDED READ
ACCESS
SEMANTICS
We realize this by reading the implementation register oR:k, as intro-
duced above. However, we do not define such read accesses with address.
We have to prove the claim above: in order to do so, we extend the stage
correctness predicates as introduced in chapter 3. The claim for registers
oR:(k+1) 2 out(k) is:
oRT :(k+1) =
(
RsI(k;T ) 1S : sI(k;T )> 0
0 : otherwise
Assuming this stage correctness predicate, we easily show that inputs cal-
culated according to the rules above are correct:
Let f ullTk hold and let R be a specification register and k > stage(R). J Lemma 5.7
Assuming that the stage correctness predicate Pk 1 holds in cycle T , the
inputs generated by the functions gkR during cycle T are correct:
gkR(cTI )
!
= RsI(k;T )S
By definition of gkR, the value of oR:k is read: PROOF
gkR(cTI ) = c
T
I :oR:k (5.10)
By using the stage correctness predicate for register R, stage k 1, cycle
T , we transform the right hand side:
gkR(cTI ) = R
sI(k 1;T ) 1
S (5.11)
According to invariant 5.3, we have sI(k  1;T ) = sI(k;T ) + 1. Thus,
we get:
gkR(cTI ) = R
sI(k;T )
S (5.12)
This is the claim. QED
5.8.2 External Signals
We further extend the read access semantics by defining external signals.
We allow accessing external signals by adding the name of the signal to
165
Chapter 5
SPECULATIVE
EXECUTION
the list of registers a register transition function depends on. Let R be an
external signal that is read by stage k. The signal has an arbitrary domain
W (R). We assume a mapping from the instruction numbers into W (R)
that defines the value of the signal in the specification machine:
RS : N  ! W (R)
Thus, the correct value of an external input R in stage k is:
GkR(cS) := RS(cS)
We have to assume that we get exactly the correct value if an instruction
in stage k reads R. This is done if stage k is full and not stalled.
f ullTk ^ stallTk =) gkR(cTI ) = GkR(sI(k;T ))
Obviously, this is inconsistent if the same signal is read in multiple
stages. We therefore assume that a signal is read in exactly one stage.
5.9 Branch Prediction
5.9.1 The DLX without Delayed PC
Many microprocessors do not use delayed branch semantics because of
binary compatibility with earlier, sequential versions. One well-known ex-
ample is the Intel x86 family [Yeu84, Int95b]. Removing the delayed PC
from the specification significantly complicates a pipelined implementa-
tion.
In this section, we will give a specification of a DLX without Delayed
PC. We will then use speculation as described above in order to build a
pipelined DLX that provably implements this specification.
The first step is to remove the registers PC0 and DPC from the specifica-
tion. We add a single register PC instead. The other registers (GPR, DM)
remain unchanged. As in chapter 2, let the signal I denote the instruction
word fetched. The address used to fetch I is taken from the register PC and
no longer from DPC:
I(c) = IM(c:PC) (5.13)
166
Section 5.9
BRANCH
PREDICTION
EX
ID
IF
IM
nextpc
Adata
IR:1
PC:2
Figure 5.6 Instruction fetch and next PC calculation in a prepared sequential
DLX without Delayed PC
The transition function for the register PC is the same as for PC0:
δ:PC(c) = next pc(I;op1(c);c:PC) (5.14)
The transition functions of DM and GPR remain unchanged. In case of
a jump and link instruction, we take PC+4 and no longer PC0+4.
5.9.2 The Sequential DLX without Delayed PC
Implementing and verifying a prepared sequential machine without de-
layed PC is trivial. One takes the prepared sequential machine from chapter
3 with minimal modifications. One just renames PC0 into PC and removes
the DPC register. The instruction fetch is made using PC as register (fig-
ure 5.6). No speculation is necessary. The proof of correctness follows the
proof given in chapter 3.
167
Chapter 5
SPECULATIVE
EXECUTION
EX
ID
IF
1
0
IM
next pc
Adata
IR:1
PC:2
Figure 5.7 Instruction fetch and next PC calculation in a pipelined DLX without
Delayed PC and without speculation
5.9.3 The Pipelined DLX without Delayed PC
In chapter 4, we transformed the prepared sequential machine with De-
layed PC into a pipelined machine. This is still feasible for the machine
without Delayed PC. However, we have to forward register PC:2 into the
instruction fetch stage (figure 5.7). According to the forwarding mecha-
nism as given in the previous chapter, we have to select between the value
in the register PC:2 and the value written into PC:2 depending on the value
of the full bit f ull:1.
If the decode stage is full, which is the common case, we have to use the
value provided by the nextpc circuit as address for the instruction fetch.
In particular, the next pc circuit uses the first GPR operand as input. This
operand might be forwarded, too. We therefore get a data path that passes
the ALU and the nextpc circuit and the instruction memory. We consider
such a path to be too long.
A common approach to this problem is using branch prediction. The
problem is the GPR operand. The GPR operand is used in order to decide
168
Section 5.9
BRANCH
PREDICTION
whether the branch is taken or not in case of a branch instruction. In case
of a jump register instruction, the operand value is used as target address.
The idea of branch prediction is to guess whether a branch is taken or
not. There are various methods to realize this. Implementing branch pre-
dictors lies beyond the scope of this thesis. There is a vast amount of
literature on sophisticated branch predictors, e.g. [Smi81, LS84, YP92,
CHYP94, PS94].
However, branch prediction is of no use regarding jump register in-
structions. Since jump register instructions are much less common than
branch instructions, a feasible solution is to stall the execution until the
GPR operand is available. Another solution is to guess the branch target,
too. This is what we implement. As for the branch predictor, we do not
elaborate how to implement the target predictor.
We implement branch prediction as follows: the first step is to move
register PC from stage 1 (decode) to stage 0 (fetch). This allows reading
the PC register in stage 1 without any forwarding. The next PC is now
calculated as follows: if the instruction fetched is neither a branch or jump,
we just take the old value and increment it by four. If it is a branch, we
guess whether it is taken or not. If it is a jump register instruction, we guess
the branch target. We denominate these guessed values by branch taken
and branch target. We pass the address of the instruction to the predictor
(figure 5.8). Figure 5.9 shows how the new PC is calculated using the
guessed values.
We instantiate the rollback mechanism as described in section 5.7 (page
161). The old PC value is stored in a register oPC:1. This allows restoring
the PC in case of a rollback (figure 5.10). Thus, the register PC:1 is clocked
if ue0 is active or if the rollback signal rollback01 is active. The update
enable signal is used in order to select the appropriate source.
The register oPC:1 is also used for reading the PC register in stage 1 (de-
code stage): we read the PC register for jump and link instructions. Since
1 > stage(PC) = 0, we have to use the extended read access semantics as
introduced above.
As described in section 5.6 (page 159), the guessed values are stored
in instances of implementation registers. In case of the DLX with branch
registers these registers are branch taken:1 and branch target:1. In stage 1
(decode), we can calculate the correct values. We add a write access to the
register branch taken:2. The value written is just b jtaken imp as defined
169
Chapter 5
SPECULATIVE
EXECUTION
IF
ID
0 1
nextpc’
PC:1 PC:1
IM spec
IR bt btarget
IR:1 PC:1 bt, btarget
Figure 5.8 Instruction fetch and next PC calculation in a pipelined DLX without
Delayed PC and with speculation. Let bt be a shorthand for branch taken and
btarget be a shorthand for branch target. The prediction unit is denoted by spec.
The circuits for providing the old PC value in case of a rollback are omitted.
4
NextPC’
Add
0 1
1 0
branch target
PC
I jr(IR)
I immediate(IR)
I j(IR)
I branch(IR)
branch taken
Figure 5.9 Calculating the next PC using speculation
170
Section 5.9
BRANCH
PREDICTION1 0
IF
ID
ue1
f0PC
oPC:1PC:1
Figure 5.10 Restoring the PC register in case of a rollback: The register PC is
clocked if ue0 or rollback01 is active or in case of a reset. If ue0 is active, the next
PC is clocked into PC:1 and the old PC is clocked into oPC:1. If ue0 is not active,
the old PC from oPC:1 is clocked into PC:1. The multiplexer used in order to
handle the reset case is omitted.
in the previous chapter:
f1branch taken(IR;GPRa) = b jtaken imp(IR;GPRa)
Given correct inputs, the function above calculates the correct value of J Lemma 5.8
branch taken. We denote this correct value by Ωbranch taken(cS):
Ωbranch taken(cS) := b jtaken(I(cS);op1(cS))
The functions b jtaken, I, and op1 are defined in chapter 2.
The claim is that f1branch taken returns this value given correct inputs:
f Γ1branch taken(ciS) = Ωbranch taken(ciS)
By expanding f Γ1branch taken, we get the following claim: PROOF
b jtaken imp(Ω0IR(ciS);G1GPRa(ciS)) != Ωbranch taken(ciS)
Using lemma 3.3 (page 57), we transform this into:
b jtaken(Ω0IR(ciS);G1GPRa(ciS)) != Ωbranch taken(ciS)
171
Chapter 5
SPECULATIVE
EXECUTION
By definition of Ωbranch taken, the claim is equal to:
b jtaken(Ω0IR(ciS);G1GPRa(ciS)) != b jtaken(I(ciS);op1(ciS))
This is concluded as in lemma 3.15 (correctness of the transition func-
tions of the DLX without branch prediction): The first step is to assert that
Ω0IR(ciS) is equal to I(ciS). In case the instruction coded by I(ciS) is a jump
instruction, the claim immediately follows from the definition of b jtaken.
In case of a branch instruction, one asserts that G1GPRa(ciS) is equal to
op1(ciS).QED
Furthermore, we add a write access to the register branch target. The
value written is GPRa (the first GPR operand) if we have a jump register
instruction and zero otherwise:
f1branch target(IR;GPRa) =

GPRa : I jr(IR)
0 : otherwise
As above, we define a correct value for branch target. This is the GPR
operand in case of a jump register instruction. In case of any other instruc-
tion, we use zero.
Ωbranch target(ciS) =

op1(ciS) : I jr(I(ciS))
0 : otherwise
Given correct inputs, f1branch target calculates this value:Lemma 5.9 I
f Γ1branch target(ciS) = Ωbranch target(ciS)
By expanding f Γ1branch target and swapping left hand side and rightPROOF
hand side for readability, we get the following claim:
Ωbranch target(ciS)
!
=

G1GPRa(ciS) : I jr(Ω0IR(ciS))
0 : otherwise
One easily asserts that Ω0IR(ciS) is equal to I(ciS). This transforms the
claim into:
Ωbranch target(ciS)
!
=

G1GPRa(ciS) : I jr(I(ciS))
0 : otherwise
172
Section 5.10
DATA
CONSISTENCY
In case I jr(I(ciS)) does not hold, one easily concludes the claim by
expanding the definition of Ωbranch target. In case I jr(I(ciS)) holds, one
expands Ωbranch target and gets the following claim:
op1(ciS)
!
= G1GPRa(ciS)
One asserts this easily using that we have a jump register instruction. QED
5.10 Data Consistency
5.10.1 Data Consistency Criterion
The data consistency criterion for both sequential and pipelined machines
is that we match values of the registers in the implementation machine
with values taken from the specification machine. This no longer works
in a machine with speculation. As an example, consider the PC register
in the pipelined machine without Delayed PC. If the speculation fails, we
actually write wrong values into this register. This wrong value might
never occur in the specification machine.
This gets even worse if one considers a machine that detects the mis-
speculation in even later stages, e.g., in stage 3 or 4. In such a machine,
subsequent instructions are fetched using a wrong PC. This might lead to
completely undefined results. This is illustrated in figure 5.11: assume in-
struction I0 does not require speculation and that we misspeculated while
instruction I1 was in stage 0. If I1 is in stage 3, we have the following
situation: in registers R:k with k > 3, there is still correct data. In registers
R:3, we have the misspeculated data. In registers R:k with k < 3, we have
data calculated using misspeculated data.
The last stage that contains misspeculated data is called speculation
stage. In the example above, this is stage 3. If no misspeculation is done,
this is stage 0. In analogy to the scheduling function I(k;T ), let Σ(T ) de-
note the number of this stage during cycle T .
Σ : N0  ! f0; : : : ;n 1g
We now adjust our data consistency criterion as follows: we no longer
claim anything for registers R:k with k < Σ(T ). For registers R:k with
173
Chapter 5
SPECULATIVE
EXECUTION
      
      
      
      
      
      






      
      
      
      
      
      






      
      
      
      
      
      






      
      
      
      
      
      






      
      
      
      
      
      






Stage 1
Stage 2
Stage 3
Stage 4
contains misspeculated
data
contains data calculated
using misspeculated data
contains correct
data
Stage 0
Stage 5
R:2
R:1
R:3
R:4
9
>
>
>
>
=
>
>
>
>
;
R:5
R:6
9
>
>
>
>
>
>
>
>
>
>
=
>
>
>
>
>
>
>
>
>
>
;
I4
I3
I2
I0
I1
Figure 5.11 Illustration of the data consistency criterion for machines with spec-
ulation: let stage 3 be the latest stage with misspeculated data. Stages 4 and 5
contain correct data, stages 1 and 2 contain data that were calculated using mis-
speculated data.
174
Section 5.10
DATA
CONSISTENCY
k > Σ(T ), we use the very same criterion as before. For the registers R:k
with k = Σ(T ), we need to distinguish the registers.
Obviously, there are some registers R:k that contain wrong data, in par-
ticular the implementation registers that hold the misspeculated data, i.e.,
the registers R:k with R 2 σ. But there may be more registers with wrong
data, namely those that have been calculated using the misspeculated data
as input. We denote the set of registers R:(k+ 1) that is calculated using
speculated data by σ(k). Note that if one uses a speculative input R 2 σ
in a stage later or equal than the misspeculation is detected, it is no longer
considered a speculative input. This is motivated as follows: if the input
value is used for subsequent calculations, we know it is correct. Otherwise,
we make a rollback and the value is not used.
For all registers R:k that are not element of σ(k  1), we maintain the
original correctness criterion. Formally, we redefine the stage correctness
predicates introduced in chapter 3 as follows: the stage correctness pred-
icate Pk no longer contains a claim about registers that are involved in
speculation, i.e., those that are element of σ(k).
As before, let sPk(T ) denote the stage correctness predicate for the spec-
ification registers and let iPk(T ) denote the stage correctness predicate for
the implementation registers. The new stage correctness predicate Pk for
the output registers of stage k holds if both sPk and iPk hold, as before:
Pk(T ) () sPk(T )^ iPk(T )
The stage correctness predicate sPk(T ) for the specification registers is
the same as before but without the registers involved in speculation: Thus,
for all specification registers R 2 out(k) and R 62 σ(k) the following condi-
tion must hold:
RTI = R
sI(k;T )
S
Furthermore, we modify the claim for implementation registers. We
have to do so because we do not restore implementation registers in case
of a rollback. As described in section 5.7 (page 161), this is motivated by
saving hardware cost. Let R:(k + 1) be an implementation register. The
claim depends on the full bit f ullTk+1. If it is active, the claim for RTI stays
the same as before. If it is not active, we just do not claim anything for
RTI . Thus, for all implementation registers R 2 out(k) and R 62 σ(k) the
175
Chapter 5
SPECULATIVE
EXECUTION
following condition must hold:
f ullTk+1 =) RTI :(k+1) =
(
0 : sI(k;T ) = 0
ΩkR(c
sI(k;T ) 1
S ) : otherwise
If f ullTk+1, we have sI(k;T ) = sI(k;T ) + 1 according to invariant 5.3.
Thus, sI(k;T ) = 0 cannot happen if f ullTk+1 holds. The condition above
therefore simplifies to:
f ullTk+1 =) RTI :(k+1) = ΩkR(csI(k;T ) 1S )
In case of a rollback, the full bits of the affected stages are cleared. Thus,
we no longer have to show anything for the implementation registers in
those stages until new values are stored there. However, this new induc-
tion premise is weaker than the old one. We have to verify that it is still
sufficient for showing that the inputs of the transition functions are correct.
One easily asserts this. The lemmas used to argue the correctness of
input registers have the premise that the full bit is active (e.g., lemma 3.16,
page 82, lemma 4.7, page 103).
For the registers that are calculated using speculative values, we define
a separate predicate P 0k(T ). The predicate holds iff the values in the reg-
isters that are in σ(k) are correct. In case of specification registers, the
correctness criterion is as before: we use the value provided by the spec-
ification machine. In case of implementation registers, we used to define
the correct value using a definition for correct inputs. However, the imple-
mentation registers R:(k+1) 2 σ(k) depend on speculative values.
We therefore need a notion of a correct speculative value. As described
above,the function ΩR(cS) denotes the correct value of a speculative value
R 2 σ given a configuration of the specification machine. Using that func-
tion, we define the correctness predicate for speculative implementation
registers as before.
This also allows defining a predicate S
 1(T ), which holds iff we specu-
late correctly during cycle T . The predicate is used for the speculation in
done stage 0 only. Formally, this is done using the function ΩR(cS). We
speculate correctly iff the input generation circuit provides this value for
all registers R 2 σ:
S
 1(T ) :() 8R 2 σ : g0R(cTI ) = ΩR(c
sI(0;T )
S )
176
Section 5.10
DATA
CONSISTENCY
As described above, the guessed values are stored in implementation
registers and are propagated by adding instances of these implementation
registers. We therefore define a correctness predicate Sk(T ) for those reg-
isters R with R 2 σ and R 2 out(k). The predicate is defined in analogy to
the predicate iPk for implementation registers. As for the implementation
registers, we only claim anything if the full bit is active:
Sk(T ) :()

f ullTk+1 =) RTI :(k+1) = ΩR(csI(k;T ) 1S )

Examples The use of the stage correctness predicates is illustrated in
figures 5.12, 5.13, and 5.14: In all figures, we summarize four classes of
registers:
 By S:k, we summarize the non-speculative specification registers,
 by S0:k, we summarize the speculative specification registers,
 by I:k, we summarize the implementation registers that are not spec-
ulative, i.e., they are not element of σ(k 1),
 by I0:k, we summarize the implementation registers that are specu-
lative, i.e., they are element of σ(k 1).
If the box of the register is drawn using stronger lines, this denotes that the
correctness of the value in the register is claimed.
In figure 5.12, we show the transition from cycle T to cycle T + 1 if
instruction I0 moves from stage 0 to 1 and does not misspeculate. The
speculation function Σ is zero in both cycles. In cycle T , we claim the cor-
rectness of the specification registers only, i.e., of the registers S:k and S0:k.
Since no full bit is set, we do not claim anything for implementation regis-
ters. As soon as I0 is in stage 1, the full bit f ull1 is set. Thus, we claim the
correctness of the implementation registers I:1 during cycle T + 1. Since
Σ(T + 1) is zero, we did not misspeculate. We therefore also claim the
correctness of the speculative implementation registers I0:1 during cycle
T +1. Since we always claim the correctness of the registers S:k, we omit
those registers in later figures.
In figure 5.13, we show the case that we misspeculate. The speculation
function Σ therefore is 1 in cycle T +1. We therefore no longer claim that
the values in S0:1 or I0:1 are correct. However, we still claim that the values
in I:1 are correct: these values do not depend on the guessed values.
177
Chapter 5
SPECULATIVE
EXECUTION k = 0
k = 1
k = 2
k = 3
k = 4
cycle T +1
I0:4
I0:5
cycle T
S:1 S0:1 I:1 I0:1
I0:2I:2
I0:3
I:5
I:4
I:3
S0:1
S0:1
S0:1
S0:1
S:2
S:3
S:4
S:5
S:1
S:2
S:3
S:4
S:5
I0:1
I0:2
I0:3
I0:4
I0:5
I:1
I:2
I:3
I:4
I:5
S:1
S0:2
S0:3
S0:4
Σ(T ) Σ(T +1)
S0:5
I1
I0
I0
Figure 5.12 Claim of correctness in case we do not misspeculate. Σ(T ) points
to the speculation stage. We claim correctness for registers drawn with stronger
lines.
k = 0
k = 1
k = 2
k = 3
k = 4
S0:1
S0:2
S0:3
S0:4
S0:5
I:1
I:2
I:3
I:4
I:5
I0:2
I0:3
I0:4
I0:1
I0:5
cycle T +1
S0:2
S0:4
S0:5
I:1
I:2
I:3
I:4
I:5
I0:2
I0:3
I0:4
I0:1
I0:5
cycle T
Σ(T )
Σ(T +1)
S0:1
S0:3
I1I0
I0
Figure 5.13 Claim of correctness in case we misspeculate
178
Section 5.10
DATA
CONSISTENCY
k = 0
k = 1
k = 2
k = 3
k = 4
S0:1
S0:2
S0:3
S0:4
S0:5
I:1
I:2
I:3
I:4
I:5
I0:2
I0:3
I0:4
I0:1
I0:5
cycle T +1
S0:1
S0:2
S0:3
S0:4
S0:5
I:1
I:2
I:3
I:4
I:5
I0:2
I0:3
I0:4
I0:1
I0:5
cycle T
Σ(T )
Σ(T +1)
Ii+3
Ii+2
Ii
Ii 1
Ii
Ii 1Ii 2
Figure 5.14 Claim of correctness in case we do a rollback. After a rollback, Σ is
zero.
In figure 5.14, we show the rollback case. The speculation function Σ is
2 in cycle T and we detect the misspeculation in that cycle, as indicated by
the flash symbol. Thus, the speculation function is again 0 in cycle T +1.
As before, for cycle T and stage Σ(T ) = 2, we only claim the correctness
of the values in I:2. For all later stages, we claim the correctness of the
values in S0:k, and I:k/I0:k iff the stage is full. In the example, stages 3 and
4 are full, we therefore claim the correctness of the values in I:3, I0:3, I:4,
and I0:4.
Registers of the DLX without Delayed PC As an example, consider
the pipelined DLX without Delayed PC described above. We have three
specification registers, which are PC, GPR, and DM. Of these registers,
only the register PC depends on speculative inputs since we detect any
misspeculation in stage 1. Thus, we have:
PC 2 σ(0)
Thus, we remove the claim for PC from sP0 and add it to sP 00 instead:
sP 00(T ) :() PC:1T = PC
sI(k;T )
S
179
Chapter 5
SPECULATIVE
EXECUTION
We have speculative values branch taken and branch target. The pred-
icate S
 1(T ) therefore is:
S
 1(T ) :() g0branch taken(cTI ) = Ωbranch taken(c
sI(0;T )
S )^
g0branch target(cTI ) = Ωbranch target(c
sI(0;T )
S )
In out(0), we have instances of the speculative values branch taken and
branch target. The claim for those registers is in the predicate S0(T ):
S0(T ) :()
( f ullTk+1 =) branch takenT :1 = Ωbranch taken(csI(0;T ) 1S )^
branch targetT :1 = Ωbranch target(csI(0;T ) 1S ))
5.10.2 Properties of the Pipeline
In this section, we conclude basic data consistency properties. We start
with a lemma that asserts that the machine is initialized properly:
The predicates Pk(T ), P 0k(T ), and Sk(T ) hold for cycle T = 0 and k  0.Lemma 5.10 I
One easily asserts this lemma using that sI(k;0) = 0 holds.PROOF
For the following data consistency properties, we define a shorthand for
the term “the inputs of stage k are correct”. In analogy to the stage cor-
rectness predicates, we use two predicates: one for inputs not affected by
misspeculation, and one for inputs affected by misspeculation.
Let Ik(T ) denote that the inputs of stage k that are not affected by mis-
speculation are correct during cycle T . One shows this using the input
correctness lemmas. These lemmas in turn depend on certain stage cor-
rectness predicates. In order to argue the correctness of the inputs of stage
k, we need the stage correctness predicates P of the stages k 1 and later
stages. In addition to that, we can use the stage correctness predicates P 0
for stage k and later ones:
Ik(T ) :() (8l : l  k 1 =) Pl(T )) ^
(8l : l  k =) P 0l (T ))
180
Section 5.10
DATA
CONSISTENCY
In analogy to that, let I0k(T ) denote that the inputs of stage k that are
affected by misspeculation are correct during cycle T . The inputs affected
by misspeculation depend on the stage. In case of stage k = 0, we have the
guessed data, i.e., we use the predicate S
 1(T ). In case of stages k > 0, we
use the predicates Sk 1(T ) and P 0k 1(T ):
I0k(T ) :()

Sk 1(T ) : k = 0
Sk 1(T )^P 0k 1(T ) : otherwise
Let the non-speculative inputs of stage k be correct during cycle T and let J Lemma 5.11
the output registers of stage k be not affected by a rollback. In this case,
the stage correctness predicate Pk holds during cycle T +1.
(k = n 1_ rollback0Tk+1)^Ik(T ) =) Pk(T +1)
Let all inputs of stage k be correct during cycle T and let the output regis- J Lemma 5.12
ters of stage k be not affected by a rollback. In this case, the stage correct-
ness predicate P0k holds during cycle T +1.
(k = n 1_ rollback0Tk+1)^Ik(T )^I0k(T ) =) P 0k(T +1)
One easily asserts lemmas 5.11 and 5.12 as done in the pipelined ma-
chine without speculation.
Obviously, if one combines the lemmas 5.11 and 5.12, one gets that
correctness of all inputs implies the correctness of all outputs unless there
is a rollback.
If the update enable signal uek is off and if the output registers of stage k J Lemma 5.13
are not affected by a rollback, all predicates that hold in cycle T also hold
in cycle T +1:
Pk(T ) =) Pk(T +1)
P 0k(T ) =) P
0
k(T +1)
Sk(T ) =) Sk(T +1)
The proof is trivial and uses the fact that neither the values in the registers PROOF
nor the predicates change from cycle T to T +1.
181
Chapter 5
SPECULATIVE
EXECUTION
For all machines with speculation, we assume that there is a stage in that
we detect any misspeculation at the latest. For example, in the DLX with-
out Delayed PC, we detect the branch missprediction in stage 1 (decode)
at the latest. We denote the number of this stage by λ.
If the update enable of stage λ is active, and if the non-speculative inputsLemma 5.14 I
of the stage are correct, we did not misspeculate, i.e., the registers R:λ
holding the propagated speculative values have correct values.
ueTλ ^Iλ(T ) =) Sλ 1(T )
Remember that Sλ 1(T ) is defined as follows:PROOF
f ullTλ =) RTI :λ = ΩR(csI(λ 1;T ) 1S )
Since ueTλ holds, f ullTλ also holds. Thus, we have to show:
RTI :λ
!
= ΩR(csI(λ 1;T ) 1S )
According to invariant 5.3, we have sI(λ;T ) = sI(λ  1;T )  1. Thus,
the claim is transformed into:
RTI :λ
!
= ΩR(csI(λ;T )S )
This lemma is shown easily using the fact that ueTλ implies that the roll-
back signal rollbackTλ cannot be active. Furthermore, stall
T
λ is not active
and f ullTλ is active. Thus, the signals RλmisspecT are also not active for
speculative registers R. Thus, by definition of the misspec signal, we have:
f γλR(cTI ) = RTI :λ
Since the inputs are correct, we have:
f ΓλR(csI(λ;T )S ) = RTI :λ
This allows transforming the claim into:
f ΓλR(csI(λ;T )S ) != ΩR(csI(λ;T )S )
For the pipelined machine with branch prediction, we have two specula-
tive registers R, which are branch taken and branch target.
The claim above for branch taken is concluded by lemma 5.8, and the
claim for the register branch target is concluded by lemma 5.9.
182
Section 5.10
DATA
CONSISTENCY
The following lemma asserts that the guessed data is passed correctly
from one stage to the next if the update enable signal is active. Note that
this includes the case that the guessed data is wrong.
If the update enable signal of stage k is active and if the non-speculative J Lemma 5.15
inputs of stage k are correct, the predicates Sk(T +1) holds iff the predicate
Sk 1(T ) holds.
ueTk ^Ik(T ) =) Sk(T +1) = Sk 1(T )
One easily asserts this lemma by expanding the predicates S.
Let there be a rollback in stage k 1 and cycle T . If the values in the non- J Lemma 5.16
speculative output registers of stage k  1 are correct, we do the rollback
correctly, i.e., the correctness predicates for the output registers of stages
l < k hold during cycle T +1.
ρ(k;T )^Pk 1(T )
=) 80 l < k : Pl(T +1)^P 0l (T +1)^Sl(T +1)
One easily asserts this by expanding the predicates. For implementation PROOF
registers, we do not have to show anything, because the rollback clears the
full bits. For specification registers, we use the predicate Pk 1(T ).
Let stage k be full during cycles T and T +1. If the update enable signal of J Lemma 5.17
the stage is not active and if the output registers of stage k are not affected
by a rollback, the predicate Sk(T ) is equal to Sk(T +1).
ueTk ^ (k = N 1_ rollback0
T
k+1)^ f ullTk+1^ f ullT+1k+1
=) Sk(T ) = Sk(T +1)
One easily shows this by arguing as follows: Since the update enable PROOF
signal is off and there is no rollback, the values of the registers do not
change. The claims of the predicates do not change either, since the full
bit is active in both cycles.
If a rollback0k signal is active, there is a stage j such that this is the last J Lemma 5.18
stage with active rollback0 signal.
rollback0Tk =) 9 j  k : ρ( j;T )
183
Chapter 5
SPECULATIVE
EXECUTION
PROOF One easily shows this lemma using induction. One starts with
stage k and proceeds from k to k+1 until one either reaches the end of the
pipeline or a stage without active rollback0 signal.
5.10.3 Data Consistency Invariants
We introduced the speculation stage function Σ(T ) above without giving
a definition. In analogy to sI(k;T ), we now give a recursive definition
of Σ(T ). The recursive definition is constructed as follows: During cycle
T = 0, we obviously have Σ(T ) = 0, i.e., no instruction with misspeculated
data is in the pipeline.
The definition of Σ(T ) for T > 0 is constructed as follows: we consider
the stage that Σ(T  1) points to. There are three cases:
1. If the update enable signal of the stage is not active and if the stage is
not affected by a rollback, the value of the speculation stage function
must stay the same, i.e., Σ(T ) = Σ(T  1).
2. If the update enable signal of the stage is active, the instruction in
the stage moves into the next stage. In case of stage Σ(T   1) = 0,
we need to distinguish whether we misspeculated or not. We mis-
speculated iff S
 1(T  1) does not hold. If we misspeculated, Σ(T )
must be one. If not so, Σ(T ) remains zero.
In case of stage Σ(T   1) > 0, we already know that we misspecu-
lated, i.e., Σ(T ) = Σ(T  1)+1.
In addition to that, we define an upper bound for Σ(T ) that is λ. In
case Σ(T  1) is greater or equal than λ, we define Σ(T ) to be zero.
3. In case of a rollback, as indicated by rollback0T 1Σ(T 1), the speculation
stage function becomes zero.
Thus, we define Σ(T ) for T > 0 as follows: If ueT 1Σ(T 1) holds, Σ(T ) is:
Σ(T ) :=
8
<
:
1 : Σ(T  1) = 0^S
 1(T  1)
Σ(T  1)+1 : 0 < Σ(T  1)< λ
0 : otherwise
184
Section 5.10
DATA
CONSISTENCY
If ueT 1Σ(T 1) does not hold, Σ(T ) is:
Σ(T ) :=
(
0 : rollback0T 1Σ(T 1)
Σ(T  1) : otherwise
Using this recursive definition for Σ(T ), we conclude several properties.
The latest stage we detect misspeculation in, i.e., stage λ, is an upper J Lemma 5.19
bound for the speculation stage function.
Σ(T )  λ
One easily shows this using the definition of the speculation stage function PROOF
above.
The stage Σ(T ) is full during cycle T , i.e., there is an instruction in the J Lemma 5.20
last stage containing misspeculated data.
f ullTΣ(T )
The claim is shown using induction on T . For T = 0, one easily asserts PROOF
the claim using that Σ(T ) is zero and that f ull0 is always active.
For T +1, we show the claim by a case split on the value of ueTΣ(T ).
 If ueTΣ(T ) holds, we have to show either f ullT+1Σ(T )+1 or f ullT+10 . The
later claim holds because f ull0 is always active. We show f ullT+1Σ(T )+1
by applying lemma 4.1:
f ullT+1Σ(T )+1 = (ueTΣ(T )_ stallTΣ(T )+1)^ rollback0TΣ(T )+1
Since ueTΣ(T ) holds, the rollback
0 signal cannot be active and we get
that f ullT+1Σ(T )+1 holds.
 If ueTΣ(T ) does not hold and we have a rollback, we have to show that
f ullT+10 , which one easily asserts as above.
If ueTΣ(T ) does not hold and we do not have a rollback, we have to
show that f ullT+1Σ(T ) holds. This follows directly form lemma 4.4 for
cycle T and stage Σ(T ). QED
185
Chapter 5
SPECULATIVE
EXECUTION
The following lemma is easily concluded from lemma 5.13 and lemma
5.15:
If we have correct inputs (both speculative and non-speculative) duringLemma 5.21 I
cycle T , and the values in the speculative output registers of stage k are
correct during cycle T , the values are also correct during cycle T +1.
(k = n 1_ rollback0Tk+1)^I0k(T )^Ik(T )^Sk(T ) =) Sk(T +1)
We now claim two speculation invariants. We will later on show these
invariants using induction.
If Σ(T ) is not zero, at least one speculative register of the output registersInvariant 5.4 I
of stage Σ(T ) 1 has wrong values:
Σ(T ) 1 =) STΣ(T ) 1
We will later on use this invariant in order to claim that we actually are
able to detect misspeculation. The following invariant is the data consis-
tency claim as introduced above:
The data consistency predicates of all registers that are outputs of stagesInvariant 5.5 I
k  Σ(T ) hold during cycle T . In addition to that, the predicate for the
non-speculative registers that are output registers of stage Σ(T ) 1 holds.
k  Σ(T ) 1 =) Pk(T )
k  Σ(T ) =) P 0k(T )
k  Σ(T ) =) Sk(T )
One easily asserts the following two claims by expanding the predicates:
Let invariant 5.5 hold for cycle T . For all stages k  Σ(T ), the non-Lemma 5.22 I
speculative inputs of stage k are correct.
k  Σ(T ) =) Ik(T )
Let invariant 5.5 hold for cycle T . For stages k > Σ(T ), the speculativeLemma 5.23 I
inputs of stage k are correct.
k > Σ(T ) =) I0k(T )
186
Section 5.10
DATA
CONSISTENCY
The following lemma will be used as the induction step for showing
invariant 5.4:
Let both speculation invariants hold during cycle T . This implies that J Lemma 5.24
invariant 5.4 holds during cycle T +1.
We do a case split on the values of the update enable signal ueTΣ(T ) and PROOF
rollback0TΣ(T ).
If ueTΣ(T ) holds, there are three cases for Σ(T ):
1. Let Σ(T ) be zero. If S
 1(T ) holds, Σ(T + 1) is also zero and we
have nothing to show.
Thus, let S
 1(T ) not hold. In this case, we have Σ(T + 1) = 1 and
we therefore have to disprove S0(T + 1). This is easily done using
lemma 5.15 and lemma 5.22.
2. Let Σ(T )λ hold. In this case, Σ(T +1) is zero and we have nothing
to show.
3. Let 0 < Σ(T )< λ hold. In this case, we have Σ(T +1) = Σ(T )+1.
We have to disprove SΣ(T )+1(T + 1). As before, this is done using
lemma 5.15 and lemma 5.22.
Let ueTΣ(T ) not hold and let rollback
0
T
Σ(T ) hold. In this case, Σ(T + 1) is
zero and we have nothing to show.
Let both ueTΣ(T ) and rollback
0
T
Σ(T ) not hold. In this case, Σ(T +1) is equal
to Σ(T ). We have to disprove SΣ(T ) 1(T + 1). According to lemma 5.17
for stage Σ(T ) 1, we have:
SΣ(T ) 1(T +1) = SΣ(T ) 1(T )
The right hand side does not hold because of the induction premise.
However, we have to prove the premises of lemma 5.17: We disprove
ueTΣ(T ) 1 as follows: according to lemma 5.20, stage Σ(T ) is full during
cycle T . Since ueTΣ(T ) is not active, we would overwrite the contents of
stage Σ(T ). This is not possible, as asserted by lemma 4.3.
In addition to that, we have to show that both f ullTΣ(T ) and f ullT+1Σ(T ) hold,
which is easily done using lemma 5.20. QED
187
Chapter 5
SPECULATIVE
EXECUTION
The following lemma will be used as induction step for the case that ueTk
does not hold.
Let the speculation invariants hold in cycle T . Let the update enable signalLemma 5.25 I
ueTk be not active. This implies that Sk(T + 1) and P 0k(T + 1) hold if k 
Σ(T ) and that Pk(T +1) holds if k  Σ(T ) 1:
k  Σ(T ) 1 =) Pk(T +1)
k  Σ(T ) =) P 0k(T +1)
k  Σ(T ) =) Sk(T +1))
Note that the claim of this lemma is not identical with speculation invari-
ant 5.5. On the left hand side, we have Σ(T ) and on the right hand side, we
have the predicates for cycle T +1.
If the output registers of stage k are affected by a rollback, we concludePROOF
that there is a last stage l  k+1 with active rollback signal (lemma 5.18).
We then use lemma 5.16 in order to conclude the claim.
If the output registers of stage k are not affected by a rollback, we use
lemma 5.12 in order to conclude the claim.QED
Let the speculation invariants hold in cycle T . This implies that S0k(T +1)Lemma 5.26 I
and P 0k(T +1) hold if k  Σ(T )+1 and that Pk(T +1) if k  Σ(T ):
k  Σ(T ) =) Pk(T +1)
k  Σ(T )+1 =) P 0k(T +1)
k  Σ(T )+1 =) Sk(T +1)
If ueTk does not hold, we use lemma 5.25 in order to conclude the claim.PROOF
Thus, let ueTk hold. This implies that the output registers of stage k are
not affected by a rollback. We conclude Pk(T + 1) using the lemma 5.22
(non-speculative inputs correct) and lemma 5.11. We conclude P 0k(T + 1)
using lemma 5.22 and lemma 5.23 (speculative inputs correct), and lemma
5.12. We conclude Sk(T +1) using lemma 5.15.QED
The following lemmas are the induction step for showing invariant 5.5.
For sake of simplicity, we case-split using the values of the update enable
188
Section 5.10
DATA
CONSISTENCY
and rollback signals. Lemma 5.27 shows the claim if ueTΣ(T ), lemma 5.28
shows the claim if rollback0TΣ(T ) is active, and lemma 5.29 shows the claim
if neither signal is active.
Let both speculation invariants hold during cycle T and let the update J Lemma 5.27
enable signal ueTΣ(T ) be active. This implies that invariant 5.5 holds during
cycle T +1.
We do a case split on Σ(T ) and on Σ(T +1). PROOF
1. If both Σ(T ) and Σ(T + 1) are zero, we conclude the claim as fol-
lows: we conclude Pk(T + 1) for k  0 using lemma 5.22 (inputs
correct) and lemma 5.11.
We conclude P 0k(T + 1) for k > 0 using lemma 5.23, 5.22 (inputs
correct) and lemma 5.12. For k = 0, we conclude S
 1 using the fact
that Σ(T +1) is zero. We can then apply lemma 5.12.
We conclude Sk(T + 1) for k  0 using lemma 5.21. The premises
of this lemma are shown as before.
2. Let Σ(T ) be λ and Σ(T + 1) be zero. In this case, we conclude the
claim using lemma 5.22 (inputs correct) and lemma 5.14.
3. Let Σ(T )> λ hold. This is disproved using lemma 5.19.
4. The case 0 < Σ(T ) < λ and Σ(T + 1) = 0 is a contradiction to the
definition of Σ.
5. Let Σ(T +1) be not zero. In this case, Σ(T +1)=Σ(T )+1 must hold
because of the active update enable signal. Because of Σ(T + 1) =
Σ(T )+1, we can conclude the claim using lemma 5.26.
This concludes the claim. QED
Let both speculation invariants hold during cycle T and let the update J Lemma 5.28
enable signal ueTΣ(T ) be not active and let the rollback signal rollback
0
T
Σ(T )
be active. This implies that invariant 5.5 holds during cycle T +1.
Since rollback0TΣ(T ) holds, we have Σ(T +1) = 0. Thus, we have to show PROOF
all three predicates for all k  0.
189
Chapter 5
SPECULATIVE
EXECUTION
Since we have an active rollback signal, there is a stage j  Σ(T ) that
signaled the rollback, as asserted by lemma 5.18. There are three cases
regarding the value of j:
1. Let k = j hold. In this case, we have the stage in which the rollback
is detected. The output registers of this stage are not updated, i.e.,
ueTk does not hold. We therefore are able to apply lemma 5.13, which
shows the claim.
2. Let k > j hold. In this case, we conclude the claim using lemma
5.26. This is feasible because of j  Σ(T ).
3. Let k < j hold. In this case, the output registers of stage k are af-
fected by the rollback and claim follows from lemma 5.16.QED
Let both speculation invariants hold during cycle T and let the updateLemma 5.29 I
enable signal ueTΣ(T ) and the rollback signal rollback
0
T
Σ(T ) be not active.
This implies that invariant 5.5 holds during cycle T +1.
Since the update enable signal ueTΣ(T ) and the rollback signal rollback
0
T
Σ(T )PROOF
are not active, we have Σ(T +1) = Σ(T ).
If ueTk does not hold, we conclude the claim using lemma 5.25.
If ueTk holds, we obviously have k 6= Σ(T ). For k > Σ(T ), we conclude
the claim using lemma 5.26. For k < Σ(T )  1, there is nothing to show.
For k =Σ(T ) 1, we argue that ueTΣ(T ) 1 cannot hold. According to lemma
5.20, f ullTΣ(T ) holds. This is a contradiction to lemma 5.2.QED
Both speculation invariants hold.Lemma 5.30 I
Note that speculation invariant 5.2 implies the data consistency of the
specification registers.
We show this by induction on T . For cycle T = 0, one easily concludesPROOF
the claim using that Σ(0) = 0 and using lemma 5.10. The claim for T +1
is concluded from the claim for cycle T using the lemmas 5.24, 5.27, 5.28,
and 5.29.
190
Section 5.11
LIVENESS
5.11 Liveness
5.11.1 Liveness Proof Strategy
As for the pipelined machine without speculation, we desire to prove that
the pipelined machine with speculation is alive. We maintain the very same
liveness criterion as we used for the prepared sequential and pipelined ma-
chine without speculation.
Unfortunately, we cannot repeat the liveness arguments of the pipelined
machine without speculation. This arises from the fact that the machine
with speculation restarts instructions in case of misspeculation. We will
therefore have to argue that this does not cause an infinite loop of rollbacks.
Informally, we argue as follows: in case there is no rollback, the exe-
cution proceeds as in the machine without speculation. In case there is
a rollback, we argue that we will not misspeculate on the same instruc-
tion twice. However, this only holds for rollbacks in the speculation stage
(lemma 5.16). For rollbacks in earlier stages, we cannot make any claim.
We therefore only consider the latest stage that is full. In case there is a
rollback in the latest stage that is full, we can claim that this must be the
speculation stage Σ in case of a rollback.
Formally, we define a function M(T) that maps a cycle T to the number
of the latest full stage:
M(T) := maxfk j f ullTk g
In order to show liveness, we have to show that for all instructions i and
stages k there is a cycle T such that sI(k;T ) = i holds. We consider the
instruction in stage M(T ). Let this be instruction i. We will show that
this instruction will eventually arrive in the last stage using the arguments
above. We will then conclude that instruction i must have been in all stages
below at least once, which satisfies our claim.
After that, we argue that the instruction in the last stage will eventually
leave the pipeline. After that, there must be a stage such that instruction
i+1 is the last stage in the pipeline. This is the first stage in the worst case.
We can now repeat the arguments made for instruction i for instruction i+1
and so on.
This proof strategy is illustrated in figure 5.15: in cycle T1, we have
instruction Ii in stage k = 2. This is also the latest full stage, i.e., M(T1) = 2.
191
Chapter 5
SPECULATIVE
EXECUTION
   
   
   



   
   


   
   


   
   
   



   
   


   
   


   
   


   
   


R:1
R:2
R:3
R:4
R:5
R:1
R:2
R:3
R:4
R:5
R:1
R:2
R:3
R:4
R:5
T1 T2 T3
k = 0
k = 1
k = 2
k = 3
k = 4
M(T1)
M(T2)
M(T3)
Ii+2
Ii+1
Ii
Ii+1
Ii+2
Ii+1
Ii+2Ii
Figure 5.15 Illustration of the liveness proof strategy for machines with specula-
tion. M(T ) points to the latest full stage.
This instruction will eventually arrive in the last stage. Let this be true in
cycle T2. The instruction will eventually leave the pipeline. Let this be true
in cycle T3. Then there is a stage such that instruction Ii+1 is in the last full
stage. In the example, this is stage k = 3.
We will now formalize this proof.
5.11.2 Properties of M(T)
We conclude a set of trivial lemmas from the definition of M(T):
This maximum exists for all T .Lemma 5.31 I
One easily concludes this using that f ullT0 is active for all T by definitionPROOF
of the signal.
For T = 0, M(T ) is zero:Lemma 5.32 I
M(0) = 0
192
Section 5.11
LIVENESS
One easily asserts this by the definition of the initial values of the full
bits.
Stage M(T ) is full during cycle T , which one concludes by definition of J Lemma 5.33
max:
f ullTM(T )
All stages below stage M(T ) are not full. J Lemma 5.34
k > M(T ) =) f ullTk
The predicate below emptyk(T ) (page 138) with k = M(T ) holds for all J Lemma 5.35
cycles T .
8T : below emptyM(T )(T )
One easily concludes this by using lemma 5.34 and the definition of
below empty.
A stage is the latest full stage iff the stage is full and all stages below are J Lemma 5.36
empty.
M(T ) = k () f ullTk ^below emptyk(T )
In case rollback0TM(T ) holds, M(T + 1) is zero. If ueM(T ) is active and J Lemma 5.37
M(T ) is the last stage, we do not claim anything. If ueM(T ) is active and
M(T ) is not the last stage, we claim that M(T + 1) is M(T) + 1. In any
other case, we claim that M(T +1) is equal to M(T).
ueTM(T )^M(T) = n 1
=) M(T +1) =
8
>
<
>
:
0 : rollback0TM(T )
M(T)+1 : ueTM(T )
M(T) : otherwise
We do a case split on the values of rollback0TM(T ) and ueTM(T ). PROOF
193
Chapter 5
SPECULATIVE
EXECUTION
1. Let ueTM(T ) hold. Thus, we only have to consider M(T ) 6= n  1. In
this case, we claim that
M(T +1) != M(T )
holds.
We instantiate lemma 5.36 with cycle T + 1 and stage M(T) + 1.
This is:
M(T +1) = M(T )+1
() f ullT+1M(T )+1^below emptyM(T )+1(T +1)
Thus, the claim holds iff the right hand side of the equivalence above
holds. We show f ullT+1M(T )+1 using lemma 4.1 (full bit transition func-
tion):
f ullT+1M(T )+1 = (ueTM(T )_ stallTM(T )+1)^ rollback0TM(T )+1
As ueTM(T ) holds, this simplifies to:
f ullT+1M(T )+1 = rollback0TM(T )+1
We conclude that this rollback signal cannot be active since ueTM(T )
is active.
It is left to show that below emptyM(T )+1(T +1) holds, i.e., that all
stages below stage M(T )+1 are empty during cycle T +1:
8 j j j > M(T)+1 : f ullT+1j
We apply lemma 4.1, which replaces f ullT+1j :
8 j j j > M(T )+1 : ueTj 1_ stallTj )^ rollback0Tj
This simplifies to:
8 j j j > M(T )+1 : (ueTj 1^ stallTj )_ rollback0Tj
We disprove ueTj 1 using that stage j  1 is not full during cycle T .
We disprove stallTj using that stage j is not full during cycle T . This
concludes the claim.
194
Section 5.11
LIVENESS
2. Let rollback0TM(T ) hold. In this case, we claim that
M(T +1) != 0
holds, i.e., we have to show that stage 0 is full and all stages below
are empty. One easily asserts this by using the fact that the rollback
clears all full bits f ullT+1j with 0< jM(T ) and that f ullT+10 holds
by definition.
3. Let both rollback0TM(T ) and ueTM(T ) not hold. In this case, we claim
that
M(T +1) != M(T )
holds, i.e., that the number of the last full stage does not change
from cycle T to cycle T + 1. We use lemma 5.36 with cycle T + 1
and stage M(T ). This is:
M(T +1) = M(T)
() f ullT+1M(T )^below emptyM(T )(T +1)
Thus, the claim holds iff the right hand side of this equivalence
holds. For M(T) = 0, this claim holds by definition of the full signal.
For M(T ) > 0, we show f ullT+1M(T ) using lemma 5.3 (full bits do not
get lost) for stage M(T ) and cycle T :
( f ullTM(T )^ueTM(T )^ rollback0TM(T ) =) f ullT+1M(T )
One easily concludes that f ullTM(T ) holds by lemma 5.33.
It is left to show that below emptyM(T )(T + 1) holds, i.e., that all
stages below stage M(T) are empty during cycle T +1:
8 j j j > M(T ) : f ullT+1j
One asserts this using the transition function as above. QED
5.11.3 Rollback Properties
The last stage with active rollback0 signal is full and not stalled. Remem- J Lemma 5.38
ber that we used ρ(k;T ) as a shorthand for the fact that stage k is the last
stage with active rollback signal.
ρ(k;T ) =) f ullTk ^ stallTk
195
Chapter 5
SPECULATIVE
EXECUTION
PROOF One easily concludes this from the definition of the rollback
signals. The misspec signals are only active if the stage is full and not
stalled. Since we have the last full stage, there cannot be an instruction
below that causes a rollback.
If we do a rollback in stage M(T ) during cycle T , this is the last stage withLemma 5.39 I
active rollback0 signal.
rollback0TM(T ) =) ρ(M(T );T )
One easily concludes this using the fact that all stages below stage M(T)
are empty.
We have to argue that we only do a rollback in case of a misspeculation.
Furthermore, we have to argue that the correct value is saved in case of a
misspeculation. We do this using the following two lemmas:
Let T > 0 be a cycle and let stage k be the last stage with active rollback0Lemma 5.40 I
signal during cycle T   1. This implies that the values of the scheduling
function for stages l k during cycle T match the number of the instruction
in stage k during cycle T  1.
T > 0^ρ(k;T  1)^ l  k =) sI(l;T ) = sI(k;T  1)
We prove this claim using induction on l. The induction starts with l = kPROOF
and proceeds from l to l 1.
For l = k, we conclude the claim by expanding the definition of sI(k;T ).
For l 1, we have the following claim:
sI(l 1;T ) != sI(k;T  1)
According to the induction premise, we have:
sI(l;T ) = sI(k;T  1)
This allows transforming the claim into:
sI(l 1;T ) != sI(l;T )
One easily asserts that rollback0T 1l holds using the definition of the
signal. After that, one expands the definition of sI(l   1;T ) on the left
hand side. This concludes the claim.QED
196
Section 5.11
LIVENESS
The following lemma argues about the correctness of the rollback mech-
anism. Given correct non-speculative inputs, we claim that we only roll-
back in case of misspeculation. Furthermore, we claim that we correctly
restore the registers destroyed by the misspeculation.
If stage k is the last stage with active rollback0 signal during cycle T and if J Lemma 5.41
the non-speculative inputs of this stage are correct, we have two claims: a)
we misspeculated, and b) the correct values are in the speculative registers
in cycle T +1.
ρ(k;T )^Ik(T ) =) Sk 1(T )^S 1(T +1)
We show that we misspeculated as follows: because of rollback in stage k, PROOF
at least one misspeck signal must be active. Let Rkmisspec be that signal.
Thus, we have:
f γkR(cTI ) 6= R:kT
Since fkR does not depend on inputs that are speculative by definition,
we can argue that fkR gets correct inputs. For example, in the DLX with
branch prediction, these functions use IR and GPRa as inputs. These reg-
isters are not calculated using the guessed values. Thus, we have:
f ΓkR(csI(k;T )S ) 6= R:kT
Using the correctness of fkR (lemmas 5.8 and 5.9 for the DLX with
branch prediction), we get that the correct value of R is different from the
value in R:k during cycle T :
ΩR(csI(k;T )S ) 6= R:k
T
Since we have a rollback, we can conclude that stage k is full using
lemma 5.38. This allows applying invariant 3.3, which transforms this
into:
ΩR(csI(k 1;T ) 1S ) 6= R:k
T
Because f ullTk holds, this implies that Sk 1(T ) does not hold (compare
the definition of S as given in section 5.10.1). This concludes the first
claim.
197
Chapter 5
SPECULATIVE
EXECUTION
We show that we store the correct values in the speculative registers as
follows: By definition of S
 1(T +1), we have to show for the speculative
values R:
g0R(cT+1I )
!
= ΩR(csI(0;T+1)S )
One easily shows that in case of a rollback g0R returns the values in R:
cT+1I :R
!
= ΩR(csI(0;T+1)S )
These registers hold the value provided by fkR in case of a rollback by
definition. Thus, the claim is transformed into:
f γkR(cTI ) != ΩR(csI(0;T+1)S )
As above, we argue that the inputs of fkR are correct, which transforms
the claim into:
f ΓkR(csI(k;T )S ) != ΩR(csI(0;T+1)S )
As above, we apply the lemma that shows the correctness of fkR, which
transforms the claim into:
ΩR(csI(k;T )S )
!
= ΩR(csI(0;T+1)S )
It is left to show that sI(k;T ) is equal to sI(0;T +1). This is easily done
using lemma 5.40.QED
Consider the following situation: Let us have a rollback in cycle T . Us-
ing the lemma above, we can conclude that we have the correct data in
the registers cI :R during cycle T + 1. Now let stage 0 be stalled during
cycle T + 1 for any reason. We now have to argue that the correct data is
preserved for subsequent cycles until another rollback happens or the up-
date enable signal gets activated. If not so, we could get an infinite loop of
rollbacks if we “forget” the correct data because of stalls.
Given that both ueT0 and rollback0
T
0 are not active and we have no mis-Lemma 5.42 I
speculation, we also have no misspeculation in cycle T +1.
ueT0 ^ rollback0
T
0 ^ST 1 =) ST+1
 1
198
Section 5.11
LIVENESS
PROOF By definition of S
 1, the claim is:
g0R(cTI ) = ΩR(c
sI(0;T )
S )
!
=) g0R(cT+1I ) = ΩR(c
sI(0;T+1)
S )
One easily concludes this claim as follows: Using invariant 5.1, one
argues that sI(0;T ) = sI(0;T +1) holds. This transforms the claim into:
g0R(cTI )
!
= g0R(cT+1I )
One easily asserts this using the fact that the registers g0R depends on
do not change from cycle T to T +1 because of the disabled rollback and
update enable signals. QED
We now define a predicate Mc(T ) that holds if we have a guarantee that
the instruction in stage M(T) will not rollback. We argue that once an
instruction causes a rollback, we have a guarantee that it will not do so
a second time. Thus, we will later on prove that Mc(T ) implies that the
rollback signal of stage M(T) is not active during cycle T .
We provide a recursive definition for Mc(T ) in analogy to the scheduling
function sI(k;T ). In cycle T = 0, we have no guarantee that instruction I0
will not cause a rollback. Thus, we define Mc(0) to be false.
For T > 0, we define Mc(T ) using the rollback and update enable sig-
nals. If rollback0T 1M(T 1) holds, we have a rollback and we argue that the
instruction in stage M(T  1) during cycle T  1 will not rollback a second
time. Because of the rollback, that instruction is in stage 0 during cycle
T . Because the rollback happened in the latest full stage, all stages later
than stage 0 are empty during cycle T . Thus, the instruction that caused
the rollback is still in the latest full stage. Thus, we define Mc(T ) to be
true for this case.
rollback0T 1M(T 1) =) Mc(T ) = 1
In case the update enable of stage M(T   1) is active, the instruction
proceeds into the next stage. We claim that the guarantee is maintained,
i.e., that Mc(T ) = Mc(T   1) holds. In case there is no next stage, i.e.,
in case of stage M(T   1) = n  1, the instruction we have a guarantee
for leaves the pipeline. In this case, we no longer have a guarantee and
therefore define Mc(T ) to be false.
ueT 1M(T 1)^M(T  1) 6= n 1 =) Mc(T ) = Mc(T  1)
ueT 1M(T 1)^M(T  1) = n 1 =) Mc(T ) = 0
199
Chapter 5
SPECULATIVE
EXECUTION
In case neither rollback nor update enable is active, we claim that Mc(T )
is Mc(T  1). In order to summarize, the complete definition of Mc(T ) is:
Mc(T ) =
8
>
>
<
>
>
:
0 : T = 0
1 : rollback0T 1M(T 1)
0 : ueT 1M(T 1)^M(T  1) = n 1
Mc(T  1) : otherwise
Let T and T 0 be cycles with T 0  T . Given that both the update enable andLemma 5.43 I
rollback0 signals of stage M(T ) are not active during cycles T  T 00 < T 0,
M(T 0) is equal to M(T ) and Mc(T 0) is equal to Mc(T ).
(8T  T 00 < T 0 : ueT 00M(T )^ rollback0
T 00
M(T ))
=) M(T ) = M(T 0)^Mc(T) = Mc(T 0)
We show this claim using induction on T 0. For T 0 = 0, we have T = T 0PROOF
and the claim obviously holds. For T 0+1, we easily conclude the claim as
follows:
1. We conclude M(T 0+1) = M(T 0) by using lemma 5.37 and the fact
that the update enable and rollback0 signals are not active. We then
use the induction premise in oder to conclude M(T 0) = M(T ).
2. We conclude Mc(T 0+ 1) = Mc(T 0) by expanding the definition of
Mc(T 0+1) and the fact that the update enable and rollback0 signals
are not active. We then use the induction premise in oder to conclude
Mc(T 0) = Mc(T ).QED
For all stages k, let the external stall signals of stage k be finite true andLemma 5.44 I
stay until uek. For all cycles T , the stall signal of stage M(T) eventually
gets deactivated after cycle T .
9
T stallM(T )
Remember that the stall signal is calculated using internal and externalPROOF
stall signals:
stallTk = f ullTk ^ (extTk _ intTk )
The internal stall signals handle data hazards and pipeline stalls, the ex-
ternal stall signals are used for caches, for example. According to lemma
200
Section 5.11
LIVENESS
4.32 (page 140), the disjunction of the external stall signals extk is finite
true. Thus, there is a cycle T 0  T such that extT 0M(T ) is not active. Let T
0 be
the earliest such cycle.
Observe that both lemma 4.33 (page 141) and lemma 4.31 (page 139)
hold also in the pipelined machine with speculation (the proof uses the
same arguments).
According to lemma 5.35, we have below empty(M(T );T ). Accord-
ing to lemma 4.31, the stages below stage M(T ) stay empty at least until
ueM(T ) becomes active. One easily concludes that this does not happen
before cycle T 0 because before cycle T 0, the external stall signal is active.
Thus, we have below empty(M(T );T 0).
Lemma 4.33 states that empty stages do not cause internal stall signals.
According to this lemma, the internal stall signals of stage M(T ) cannot be
active during cycle T 0 because the stages below stage M(T ) are empty.
Thus, both extk and intk are not active during cycle T 0. This implies that
the stall signal is not active and the claim holds. QED
Informally, consider an instruction in a stage. Assuming the stall signal
of the stage will eventually be deactivated, the instruction in the stage either
moves into the next stage or gets evicted because of a rollback.
Formally, let stage k be full during cycle T . Let there be a cycle T 0  T J Lemma 5.45
such that the stall signal stallk is not active. This implies that either the
update enable signal of stage M(T ) or the rollback0 signal of stage M(T )
eventually gets activated.
f ullTk ^ 9T stallk =) 9T (uek _ rollback0k)
The proof is done in analogy to the proof of the counterpart lemma of the PROOF
machine without speculation, lemma 3.20 (page 87).
For all cycles T , either the update enable signal of stage M(T) or the J Lemma 5.46
rollback0 signal of stage M(T ) eventually gets activated.
9
T
(ueM(T )_ rollback0M(T ))
201
Chapter 5
SPECULATIVE
EXECUTION
PROOF This claim is shown by instantiating lemma 3.20 with stage
M(T). We show f ullTM(T ) using lemma 5.33. We show that the stall signal
will eventually be deactivated by using lemma 5.44.
The speculation stage Σ(T ) is always above or equal to the last full stage.Lemma 5.47 I
Σ(T )  M(T )
According to lemma 5.20, stage Σ(T ) is full during cycle T . Thus, thisPROOF
cannot be below the last full stage.
The non-speculative inputs of stage M(T ) are always correct.Lemma 5.48 I
IM(T )(T )
By definition of IM(T )(T ), we have to show:PROOF
(8l : l M(T ) 1 =) Pl(T )) ^
(8l : l M(T ) =) P 0l (T ))
Remember that Pl(T ) denoted non-speculative output registers of stage l
and P 0l (T ) denoted the output registers of stage l that depend on speculative
registers. We easily conclude both claims using the data consistency of the
machine (invariant 5.5, page 186) and Σ(T )M(T ) (lemma 5.47).QED
If Mc(T ) holds, the speculation registers that are output of stage M(T) 1Lemma 5.49 I
hold correct values.
Mc(T ) =) SM(T ) 1(T )
We show this claim by induction on T . For T = 0, we have nothing toPROOF
show since Mc(0) does not hold.
For T +1, we show the claim as follows:
In case we have a rollback, i.e., if rollback0TM(T ) is active, we have M(T +
1) = 0 and we have to show S
 1(T +1), which is easily done using lemma
5.41.
202
Section 5.11
LIVENESS
In case we do not have a rollback but an active update enable signal
ueTM(T ), we have Mc(T + 1) = Mc(T ). In case Mc(T + 1) does not hold,
there is nothing to show. Thus, Mc(T ) holds, and we therefore have
SM(T ) 1(T ). Using lemma 5.15, we conclude SM(T )(T +1). According to
lemma 5.37, we have M(T +1) = M(T )+1, and therefore SM(T+1) 1(T +
1), which concludes the claim.
In case both signals are not active, we have to do a case split on the
value of M(T ): In case M(T) is zero, we conclude the claim using lemma
5.42. If not so, we argue that ueTM(T ) 1 cannot be active using lemma 4.3.
According to lemma 5.37, we have M(T + 1) = M(T ), thus, we have to
show SM(T )(T +1), which is easily done using lemma 5.13 for stage M(T ).
QED
The following lemma shows that the “intended meaning” of Mc(T ) is
achieved, i.e., that Mc(T ) implies that we do not have a rollback in stage
M(T ).
If Mc(T ) holds, the rollback0 signal of stage M(T) cannot be active during J Lemma 5.50
cycle T .
Mc(T ) =) rollback0TM(T )
According to lemma 5.49, we have SM(T ) 1(T ). PROOF
Assume rollback0TM(T ) is active. Using lemma 5.39, we can conclude that
ρ(M(T );T ) holds, i.e., stage M(T ) is the last stage with active rollback0
signal. Using lemma 5.48, we conclude IM(T )(T ). This allows applying
lemma 5.41 for stage M(T ).
Lemma 5.41 states that SM(T ) 1(T ) cannot hold, which is a contradic-
tion. QED
We now proceed in the liveness proof as follows: we show that an in-
struction in the last full stage is live, i.e., eventually moves into the next
stage. The first step is to show this assuming we have a guarantee that the
instruction will not cause a rollback. One easily shows this.
203
Chapter 5
SPECULATIVE
EXECUTION
The next step is to conclude that this also happens in case we do not have
that guarantee. We do this by arguing that the instruction will rollback at
most once in the worst case.
Assume we have a guarantee that the instruction in stage M(T ) will notLemma 5.51 I
rollback. In this case, we claim that there is a cycle T 0  T such that the
update enable signal is active and no rollback is signaled during cycles T
to T 0.
Mc(T ) =) 9T 0  T : ueT
0
M(T )^8T  T
00
 T 0 : rollback0T 00M(T )
According to lemma 5.46, we have a cycle T 0  T such that either thePROOF
rollback signal rollback0M(T ) or the update enable signal ueM(T ) is active.
Let T 0 be the smallest such cycle.
We will disprove that rollback0M(T ) can be active. Using lemma 5.43,
we conclude that Mc(T 0) holds. Using lemma 5.50, we conclude that
rollback0M(T 0) cannot be active.
Thus, ueM(T ) must be active during cycle T 0. We will show that cycle T 0
satisfies the claim. It is left to show that rollback0M(T ) is not active from
the cycles T to T 0. For cycle T 0, we conclude this from the definition of
the update enable signal (the update enable signal is not active in case of
a rollback). For cycles T 00 with T  T 00 < T 0, we conclude this from the
fact that T 0 is the smallest cycle such that either rollback0M(T ) or ueM(T ) is
active.QED
The following lemma is the counterpart of lemma 3.23 (page 88) for the
machine without speculation. The proof is done in analogy to the proof of
lemma 3.23.
Let T and T 0  T be cycles. Let the update enable signal uek and theLemma 5.52 I
rollback0k signal of a stage k be off during the cycles T 00 with T 0> T 00  T .
The value of the scheduling function does not change from cycle T to T 0.
8T 00jT 0 > T 00  T : ueT 00k ^ rollback0
T 00
k =) sI(k;T ) = sI(k;T 0)
Given that Mc(T ) holds, there is a cycle T 0  T such that the next instruc-Lemma 5.53 I
tion is in stage M(T ).
Mc(T ) =) 9T 0  T : sI(M(T );T 0) = sI(M(T );T )+1
204
Section 5.11
LIVENESS
PROOF Let T 00 be the earliest cycle with active update enable signal
according to lemma 5.51, i.e., we have ueT 00M(T ). Using lemma 5.52, we
conclude that the value of the scheduling function for stage M(T ) does not
change from cycle T to T 00:
sI(M(T );T ) = sI(M(T );T 00)
We then use invariant 5.1 in order to conclude that the value of the
scheduling function for stage M(T) increases by one from cycle T 00 to
cycle T 00+1:
sI(M(T );T 00+1) = sI(M(T );T 00)+1
Thus, cycle T 00+1 satisfies our claim. QED
Let Mc(T ) hold, i.e., we have a guarantee that the instruction in stage J Lemma 5.54
M(T ) will not rollback. Furthermore, let this stage not be the last stage,
i.e., M(T)< n 1. In this case, there is a cycle T 0 such that the instruction
in stage M(T) during cycle T is now in stage M(T)+1. Furthermore, the
last full stage during cycle T 0 is stage M(T )+1 and the instruction in that
stage is guaranteed not to rollback.
=) 9T 0  T : sI(M(T )+1;T 0) = sI(M(T );T )^
M(T 0) = M(T)+1^
Mc(T 0)
Let T 00 be the earliest cycle with active update enable signal according to PROOF
lemma 5.51, i.e., we have ueT 0M(T ). We will show that T
00
+ 1 satisfies the
claim above. We show the three parts of the claim separately.
1. We show sI(M(T )+1;T 00+1) = sI(M(T );T ) as follows: Using the
same arguments as in the proof of lemma 5.53, we conclude:
sI(M(T );T 00+1) = sI(M(T );T )+1
One easily shows that the full signal f ullT 00+1M(T )+1 is active using that
the update enable signal is active. We then apply invariant 5.3, which
states:
sI(M(T );T 00+1) = sI(M(T )+1;T 00+1)+1
Thus, the first part of the claim is satisfied by T 00+1.
205
Chapter 5
SPECULATIVE
EXECUTION
2. We show M(T 00+1) = M(T)+1 as follows: Using lemma 5.43, we
conclude that M does not change from cycle T to T 00. Using lemma
5.37, we conclude M(T 00+ 1) = M(T 00)+ 1. Thus, T 00+ 1 satisfies
the claim.
3. We show Mc(T 00+ 1) as follows: Using lemma 5.43, we conclude
that M does not change from cycle T to T 00. By definition of Mc(T 00+
1), we conclude Mc(T 00+ 1) = Mc(T 00). Thus, T 00+ 1 satisfies the
claim.QED
We can extend the claim of lemma 5.54 to multiple stages using induc-
tion:
Let Mc(T ) hold, i.e., we have a guarantee that the instruction in stageLemma 5.55 I
M(T) will not rollback. Consider a stage k  M(T ). The claim is that
there is a cycle T 0  T such that the instruction in stage M(T ) during cycle
T is in stage k during cycle T 0. Furthermore, we claim that stage k is the
last full stage during cycle T 0 and that the instruction will not rollback.
k M(T ) =) 9T 0  T : sI(k;T 0) = sI(M(T );T )^
M(T 0) = k^
Mc(T 0)
We show the claim by induction on k. For k = M(T ), the claim obviouslyPROOF
holds. For the step from k to k+1, we apply lemma 5.54.
The following lemma has the very same claim as lemma 5.53. However,
we no longer premise that the instruction in stage M(T ) is guaranteed not
to rollback.
For all cycles T , there is a cycle T 0  T such that the next instructionLemma 5.56 I
moves into stage M(T ).
9T 0  T : sI(M(T );T 0) = sI(M(T );T )+1
We use lemma 5.46 in order to conclude that there is a cycle T 00  T suchPROOF
that either the update enable or rollback0 signal is active. Let T 00 be the
earliest such cycle. In case the update enable signal is active, we conclude
the claim as done in the proof of lemma 5.53.
206
Section 5.11
LIVENESS
In case the rollback signal is active, we conclude the claim as follows:
Using lemma 5.52, we conclude that the value of the scheduling function
for stage M(T ) does not change from cycle T to T 00:
sI(M(T );T ) = sI(M(T );T 00)
We then use lemma 5.43 in order to conclude that M(T 00) = M(T). This
allows applying lemma 5.39. Lemma 5.39 states that stage M(T ) is the
last stage with active rollback signal during cycle T 00. This allows applying
lemma 5.40 with l = 0. Lemma 5.40 states that sI(0;T 00+ 1) is equal to
sI(M(T );T 00).
Because of the rollback, we have M(T 00+1) = 0 by lemma 5.37. Thus,
we have:
sI(M(T );T ) = sI(M(T 00+1);T 00+1)
Since we have a rollback, we now have a guarantee that the instruction in
stage M(T 00+1) during cycle T 00+1 will not rollback. Thus, we can apply
lemma 5.55 in order to conclude that this instruction eventually moves into
stage M(T). Let t be that cycle.
sI(M(T );T ) = sI(M(T ); t)
We will then use lemma 5.53 in order to conclude that the value of the
scheduling function will eventually increase by one. Let t 0 be that cycle.
sI(M(T );T )+1 = sI(M(T ); t 0)
Thus, cycle t 0 satisfies the claim. QED
The following lemma has a similar claim as lemma 5.54. However, we
do not premise that we have a guarantee that the instruction in stage M(T )
will not rollback.
Let M(T) not be the last stage. There is a cycle T 0  T such that the J Lemma 5.57
instruction in stage M(T ) during cycle T is in stage M(T )+1 during cycle
T 0. Furthermore, stage M(T )+1 is the last full stage during cycle T 0.
M(T )< n 1 =) 9T 0  T : sI(M(T )+1;T 0) = sI(M(T );T )^
M(T 0) = M(T)+1
207
Chapter 5
SPECULATIVE
EXECUTION
k f ullk sI(k;T )
0 1 4
1 1 3
2 0 3
3 1 2
4 1 1
5 1 0   M(T)
6 0 0
7 0 0    n 1
Table 5.5 Illustration of lemma 5.59: In a pipeline with n = 8 stages, we have
M(T ) = 5 and therefore sI(5;T ) = sI(7;T ).
The proof follows the same pattern as the proof of the lemma 5.56: in casePROOF
the update enable signal becomes active, we argue as in lemma 5.54. If not
so, we have a rollback and continue as in lemma 5.56.
The following lemma is an inductive extention of lemma 5.57.
Consider an instruction in the last full stage during cycle T . There is aLemma 5.58 I
cycle T 0 such that this instruction is in the last stage and such that the last
stage is the last full stage.
9T 0 : sI(n 1;T 0) = sI(M(T );T )^
M(T 0) = n 1
Let k be the number of the last full stage. One easily concludes this claimPROOF
by induction on k. One starts with the last stage and proceeds inductively
from stage k to stage k 1 until the desired stage is reached. The induction
step is argued using lemma 5.57.QED
The value of the scheduling function sI(M(T );T ) is equal to the value ofLemma 5.59 I
the scheduling function in the last stage sI(n 1;T ).
sI(M(T );T ) = sI(n 1;T )
This lemma is illustrated exemplary in table 5.5.
208
Section 5.11
LIVENESS
PROOF Let k be the number of the last full stage. One easily concludes
this claim by induction on k. One starts with the last stage and proceeds
inductively from stage k to stage k  1 until the desired stage is reached.
The induction step from k to k  1 is argued as follows: since f ullk does
not hold, one can use the scheduling invariants 5.2 and 5.3 in order to argue
that
sI(k;T ) = sI(k 1;T )
holds. QED
For all instructions Ii, there is a cycle T such that the value of the schedul- J Lemma 5.60
ing function for the last stage and cycle T is i.
9T : sI(n 1;T ) = i
One shows this claim using induction on i. For i = 0, T = 0 satisfies the PROOF
claim.
For i + 1, we show the claim as follows: According to the induction
premise, there is a cycle T such that sI(n  1;T ) = i holds. Accord-
ing to lemma 5.59, we have instruction Ii also in the last full stage, i.e.,
sI(M(T );T ) = i.
We use lemma 5.58 in order to argue that instruction i is eventually in the
last stage, i.e., we have a cycle T 0 such that sI(n 1;T 0) = i and M(T 0) =
n  1. We then use lemma 5.56 in order to conclude that there is a cycle
T 00 such that sI(n 1;T 00) = i+1. Thus, cycle T 00 satisfies the claim. QED
5.11.4 Liveness Proof
Using lemma 5.60, we show that for all instructions Ii, there is a cycle T
such that this instruction is in the last stage of the pipeline. However, our
liveness criterion as proposed in chapter 3 is stronger: it requires that we
can provide such a cycle T for each stage and not just for the last stage.
209
Chapter 5
SPECULATIVE
EXECUTION
We will now argue as follows: given that an instruction is in the last
stage, there must be cycles T 0 such that Ii was in all stages k < n 1 before.
For intuition, this means that instructions never skip over a stage.
Let k > 0 be a stage. Let Ii be the instruction given by sI(k;T ). In thisLemma 5.61 I
case, there is a cycle T 0 such that sI(k  1;T 0) is i. For intuition, if you
have an instruction in a stage k > 0, there must be an earlier cycle such
that this instruction is in the previous stage.
k > 0 ^ sI(k;T ) = i =) 9T 0 : sI(k 1;T 0) = i
We show this claim using induction on T . For T = 0, T 0 = 0 satisfies thePROOF
claim since we have sI(k;0) = sI(k 1;0) = 0.
For T +1, we have the following claim:
k > 0 ^ sI(k;T +1) = i !=) 9T 0 : sI(k 1;T 0) = i
We show the claim as follows: Assume we have k < n 1 and an active
rollback0k+1 signal during cycle T . We will show that cycle T +1 satisfies
sI(k 1;T +1) = i. By definition of the rollback0 signals, rollback0k must
be active during cycle T . This implies that sI(k  1;T + 1) is equal to
sI(k;T +1) by definition of sI(k 1;T +1).
Let k = n  1 or rollback0Tk+1 hold. If the update enable signal ueTk is
active, the desired instruction was in stage k 1 during cycle T , which sat-
isfies the claim. If the update enable signal ueTk is not active, the instruction
was in stage k during cycle T . In this case, we apply the induction premise,
which provides a cycle T 0 that satisfies the claim.QED
We extend the argument of the previous lemma inductively for multipleLemma 5.62 I
stages: Let k and l  k be stages and let Ii be the instruction given by
sI(k;T ). In this case, there is a cycle T 0 such that sI(l;T 0) is i.
sI(k;T ) = i ^ l  k =) 9T 0 : sI(l;T 0) = i
210
Section 5.12
PRECISE
INTERRUPTS
PROOF We show the claim using induction on l. We start with l = k and
proceed from l to l 1. For l = k, the claim obviously holds. For the step
from stage l to l 1 we apply lemma 5.61.
For all instructions Ii and stages k, there is a cycle T such that sI(k;T ) is J Theorem 5.63
equal to i. This is the liveness criterion proposed in chapter 3.
9T : sI(k;T ) = i
Using lemma 5.60, we conclude that there is a cycle T 0 such that sI(n  PROOF
1;T 0) = i holds. For k = n 1, this satisfies the claim.
Thus, let k 6= n  1 hold. In this case, we apply lemma 5.62, which
provides us with a cycle T 00 that satisfies the claim. QED
5.12 Precise Interrupts
5.12.1 Definition
Interrupts are events that change the flow of control of a program by means
other than a branch instruction [MP00]. They are used in order to realize
virtual memory, fast I/O, and arithmetic error handling.
In case of an interrupt, the state of the machine is saved and the execution
proceeds with an interrupt service routine (ISR). After the interrupt service
routine is done, the state of the machine is restored and the execution of
the program proceeds.
An interrupt between instruction Ii 1 and Ii is precise if instructions I0
to Ii 1 are completed before starting the ISR and later instructions (Ii; : : :)
did not change the state of the machine [SP88, Mu¨l97].
5.12.2 The DLX with Interrupts
The specification of a DLX with interrupts used in the following section is
taken from [MP00]. Interrupts are events other than branches that modify
211
Chapter 5
SPECULATIVE
EXECUTION
Table 5.6 Special purpose registers used for exception handling
address name meaning
0 SR status register
1 ESR exception status register
2 ECA exception cause register
3 EPC the exception PC
4 EDPC the exception delayed PC
5 EDATA exception data register
the flow of control. Each such event is assigned a number in f0;1; : : :g.
If such an event occurs, the next instruction fetched and executed is taken
from a special interrupt service routine. The address of this interrupt ser-
vice routine is denoted by SISR. After the interrupt service routine is done,
there are three ways to resume the execution:
1. The interrupted instruction is repeated.
2. The execution is continued with the instruction that follows the in-
terrupted instruction.
3. The program execution is aborted.
In order to support interrupt handling, the instruction set architecture of
the machine is extended. A set of registers is added to the configuration
of the machine: the registers are called special purpose registers and are
listed in table 5.6. Each register is 32 bits wide.
In order to access these new registers, two instructions are added: the
instruction movs2i reads a special purpose register and stores the value in
a GPR register. The instruction movi2s reads a GPR register and stores the
value in a given special purpose register. The transition function δ:GPR is
changed accordingly. Given an instruction word I, these instructions are
indicated by I movi2s(I) and I movs2i(I).
The special purpose register SR is used in order to mask interrupts. If
bit j in the register SR is set, the interrupt number j is handled. If bit j is
not set, the interrupt is suppressed. However, not all interrupts can be sup-
pressed using SR. Interrupts that can be suppressed are called maskable.
Table 5.7 lists the interrupts supported by the DLX without floating point
instructions. This list is taken from [MP00]. The reset interrupt occurs
212
Section 5.12
PRECISE
INTERRUPTS
Interrupt Symbol Priority Resume Maskable
reset reset 0
illegal instruction ill 1 abort
misaligned access mal 2
page fault IM ip f 3 repeat no
page fault DM d p f 4
trap trap 5
FXU overflow ov f 6 continue yes
external I/O ex[ j] 7+ j
Table 5.7 The Interrupts and their priority
directly in the initial configuration of the machine. Thus, we start the ex-
ecution after reset at the interrupt service routine and no longer at address
zero.
The illegal instruction interrupt occurs iff the instruction word fetched
does not encode a valid instruction. The misaligned access interrupt occurs
iff the instruction fetch or if the data memory access is not well-aligned.
The page fault IM/DM interrupts occur iff the memory system signals a
page fault during an instruction fetch or data memory access, respectively.
The trap interrupt is caused by a special instruction trap. It can be used for
system calls, for example. The trap instruction allows passing an immedi-
ate constant as parameter.
The FXU overflow interrupt occurs if an unmasked overflow occurs dur-
ing an ALU instruction. The external I/O interrupts occur if an external
signal exS[ j] with j  0 is active. These external interrupts can be used in
order to realize fast I/O such as access to hard disks or networks.
Let CA denote a 32-bit signal that is defined as follows: iff an interrupt
with number j occurs, bit j of this signal is active. Let c be a configuration
of the specification machine. Using CA, the 32-bit signal MCA is defined
as follows:
MCA(c)[ j] :=

CA(c)[ j] : if interrupt j is not maskable
CA(c)[ j]^SR[ j] : if interrupt j is maskable
Thus, an interrupt is handled if there is at least one bit in MCA(c) set.
This is indicated by the one bit signal JISR:
JISR(c) := 9 j 2 f0; : : : ;31g : MCA(c)[ j]
213
Chapter 5
SPECULATIVE
EXECUTION
If multiple interrupts occur, the interrupt with the lowest number is han-
dled with priority. The interrupt that is handled is indicated by a 32-bit
signal il (interrupt level). If no interrupt is to be handled, all bits of il are
zero. If there is an interrupt j to handle, exactly bit j is set.
il(c)[ j] :=
8
<
:
1 : JISR(c)^
j = minfi 2 f0; : : : ;31g jMCA(c)[i]g
0 : otherwise
The same interrupt service routine is used in order to handle all inter-
rupts. Thus, in order to enable this interrupt service routine to distinguish
the events that cause interrupts, a new special purpose register ECA is
added to the configuration set of the machine. In case of an interrupt,
the value of MCA(c) is stored in ECA. The interrupt service routine is ex-
pected to handle the interrupt event with the smallest number j such that
the bit ECA[ j] is set.
Instruction Fetch We support two interrupts that affect the instruction
fetch. We check whether the instruction word address is misaligned. Given
an effective address ea, the function imal(ea) holds if we have a misaligned
instruction word:
imal(ea) := ea[0]_ ea[1] (5.15)
Furthermore, we support page faults for the instruction memory access.
Page faults are indicated by an external signal ip fS(c).
If no page fault happens and if the instruction word is not misaligned,
the instruction word I(c) is defined as in chapter 2. In particular, we are
back to using Delayed PC and no longer use branch prediction. In case of
a misaligned instruction word or a page fault, we use zero as instruction
word. The instruction encoded by zero actually turns out to be a NOP.
I(c) :=

0 : imal(c:DPC)_ ip fS(c)
IM[c:DPC] : otherwise (5.16)
The transition functions of PC0 and DPC are changed in order to real-
ize the jump to the interrupt service routine and the r f e instruction. This
instruction is used in order to return from the interrupt service routine. In
case of an r f e instruction, the registers SR, PC0, and DPC are restored from
the corresponding special purpose registers.
214
Section 5.12
PRECISE
INTERRUPTS
In case of an interrupt, the new value of PC0 is the address of the interrupt
service routine (SISR) plus four, i.e., the second instruction of the interrupt
service routine. In case of an r f e instruction, the value in EPC is taken.
Otherwise, the next PC is calculated as in the machine without interrupts.
We define a function next pc0(I;op1;PC;EPC) as follows: in case of an
r f e instruction, it returns EPC. Otherwise, the value provided by next pc
as defined in chapter 2 is returned:
next pc0(I;op1;PC;EPC) :=

EPC : I r f e(I)
next pc(I;op1;PC) : otherwise
As before, op1 is the first GPR operand. We use this new next pc0 func-
tion in order to define the new transition function for the PC0 register:
δ:PC0(c) :=

SISR+4 : JISR(c)
next pc0(I;op1;c:PC0;c:EPC) : otherwise
The transition function of DPC is no longer the identity. In case of an
interrupt, the new value of DPC is the address of the interrupt service rou-
tine (SISR), i.e., the first instruction of the interrupt service routine. In case
of an r f e instruction, the value in EDPC is restored. Otherwise, the new
value of DPC is calculated as in the machine without interrupts.
δ:DPC(c) :=
8
<
:
SISR : JISR(c)
c:EDPC : JISR(c)^ I r f e(c)
c:PC0 : otherwise
Data Memory Exceptions We have two exceptions that are caused by
data memory accesses: data memory page faults are used in order to im-
plement virtual memory, data memory misalignment interrupts indicate a
misaligned memory access.
Data memory page faults are indicated by an external signal d p f . A
misaligned memory access is detected using the the effective address of
the memory access and the instruction word.
The functions memW and memH hold if the memory operand of the
given instruction is of word or half-word size, respectively. In case of
stores, we only support word size accesses.
memW (I) = (I load(Iw)^ I lw(Iw))_ I store(Iw)
memH(I) = (I load(Iw)^ (I lh(Iw)_ Ilhu(Iw)))
215
Chapter 5
SPECULATIVE
EXECUTION
Given an effective address EA, we have a misaligned address, if we have
a word access with active EA(0) or active EA(1) or if we have a half-word
access with active EA(0). This is indicated by malAc:
malAc(I;EA) = (memW (I)^ (EA(0)_EA(1))_
(memH(I)^ (EA(0))
We have a data memory misalignment exception in case of a load or
store instruction with misaligned address:
dmal(Iw;EA) = (I load(I)_ I store(I))^malAc(I;EA)
Transition Function of the SPRs Let Sj be a special purpose register.
In case there is no interrupt, we define the register transition function δ:Sj
as follows: we take the first GPR operand in case we have a movi2s in-
struction with appropriate address and the old value otherwise:
JISR(c) =)
δ:Sj(c) =
8
<
:
op1(c) : I movi2s(I)^
hI immediate(I)[4 : 0]i= j
c:Sj : otherwise
The transition function in case there is an interrupt depends on the reg-
ister.
As described above, we store the value of MCA in the register ECA in
case of an interrupt:
JISR(c) =) δ:ECA(c) = MCA(c)
The EDATA special purpose register is used in order to store additional
information about the exception. In case of a trap instruction, the immedi-
ate constant provided with the instruction is stored in EDATA. This allows
passing of an argument to the interrupt service routine. In case of a page
fault or misaligned memory access, the memory address accessed is stored
in EDATA. In case of any other interrupt, we store zero in EDATA.
Let mem(c) indicate that we execute a load or store instruction:
mem(c) := I load(I)_ I store(I) (5.17)
216
Section 5.12
PRECISE
INTERRUPTS
Let dmemea(c) denote the effective address of a data memory access. In
case of an interrupt, the transition function for EDATA is:
JISR(c) =)
δ:EDATA(c) =
8
>
<
>
>
:
c:DPC : imal(c:DPC)_ ip fS(c)
dmemea(c) : d p fS(c)^mem(c)
I immediate(c) : I trap(I)
0 : otherwise
Furthermore, the values of the registers DPC and PC0 are saved in special
purpose registers EDPC and EPC in order to support resuming the instruc-
tion after the execution of the interrupt service routine. This depends on
whether the interrupt is of type repeat or continue. This is indicated by a
one bit signal repeat:
repeat(c) := (il(c) = 3)_ (il(c) = 4)
If the interrupt is of type repeat, the values of DPC and PC0 in the current
configuration are taken. If the interrupt is of type continue or abort, the
values are taken that point to the following instruction, as calculated by
next pc0 . In case of an interrupt, the transition functions for EPC and EDPC
are:
JISR(c) =)
δ:EPC(c) =

c:PC0 : repeat(c)
next pc0(I;op1;c:PC0;c:EPC) : otherwise
δ:EDPC(c) =
8
<
:
c:DPC : repeat(c)
c:EDPC : repeat(c)^ I r f e(I)
c:PC0 : otherwise
In case of an interrupt, the register SR is set to zero. This masks all
interrupts, which prevents that the interrupt service routine is interrupted:
JISR(c) =) δ:SR(c) = 0
In order to restore the register SR before resuming the program, the value
of SR is saved in the special purpose register ESR. In case of an interrupt
of type repeat, the value from the current configuration is taken. In case
of an interrupt of type continue or abort, the value calculated for the next
217
Chapter 5
SPECULATIVE
EXECUTION
configuration is taken:
JISR(c) =)
δ:ESR(c) =
8
<
:
op1(c) : repeat(c)^ I movi2s(I)^
hI immediate(I)[4 : 0]i= 0
c:SR : otherwise
Furthermore, in case of an interrupt of type repeat, the write access to
GPR and to the memory has to be suppressed. This is realized by modify-
ing the transition function for GPR accordingly.
A complete description how the interrupt service routine is to be imple-
mented such that it behaves like a procedure is given in [MP00].
5.12.3 Hardware for the DLX with Interrupts
In this section, we describe small circuits that are used for interrupt han-
dling.
MCA The circuit MCA(CA;SR) calculates the masked cause register
given CA and the status register SR:
MCA impl(CA;SR)[i] :=

CA[i]^SR[i] : 6 i < 32
CA[i] : otherwise
The circuit MCA is correct:Lemma 5.64 I
MCA impl(cS) = MCA(CA(cS);cS:SR)
One easily asserts this claim by expanding the definitions of the functionsPROOF
MCA impl and MCA.
JISR We calculate the JISR signal using a zero tester and MCA:
JISR impl(MCA) := zerotester(MCA)
The calculation of JISR is correct:Lemma 5.65 I
JISR impl(MCA(cS)) = JISR(cS)
218
Section 5.12
PRECISE
INTERRUPTS
One easily asserts this claim by expanding the definition of both func-
tions and by applying lemma 2.2 (correctness of the zero tester, page 16).
repeat We calculate the repeat signal as done in [MP00]:
repeat impl(MCA) := (MCA[0]_MCA[1]_MCA[2])
^(MCA[3]_MCA[4])
The circuit repeat impl is correct: J Lemma 5.66
repeat impl(MCA(cS)) = repeat(cS)
Let MCA(cS) be zero. In this case, one easily asserts that the bit pro- PROOF
vided by repeat impl is not active. Furthermore, one asserts that JISR(cS)
does not hold. This implies that il(cS) is also zero by definition. Thus,
repeat(cS) does not hold, which concludes the claim for MCA(cS) = 0.
Let MCA(cS) be not zero. In this case, one easily asserts that JISR(cS)
is active. Furthermore, there is a smallest j such that MCA(cS)[ j] holds.
If this j is smaller than 3 or greater than 4, we do not have an interrupt of
type repeat. One easily asserts that repeat impl(MCA(sS)) does not hold
in this case.
If it is equal to 3 or 4, we have an interrupt of type repeat. One easily
asserts that repeat impl(MCA(sS)) holds in this case. QED
Decoder In order to realize the special purpose register file, we need a
decoder:
Let k be an integer and n be 2k. A decoder is a circuit with inputs a 2 J Definition 5.1
Decoderbvec[k] and outputs b 2 bvec[n] such that for all i
bi = 1 () hai= i:
An implementation can be found in [MP95]. A PVS proof is covered by
[BJK01].
219
Chapter 5
SPECULATIVE
EXECUTION
5.12.4 Configuration of the Pipelined DLX with Interrupts
We implement the pipelined DLX with interrupts using speculation. Us-
ing the generic speculation mechanism from this chapter, and the generic
forwarding mechanism from chapter 4, implementing the pipelined DLX
with interrupts is quite easy. We do this in three steps:
1. We start with the pipelined machine without interrupts as presented
in chapter 4. We add the special purpose registers, as described
above, to the configuration set.
2. We add two speculative values: the first value, JISR, is a one bit
register that indicates an interrupt. The second value, repeat, is a
one bit register that is set iff the interrupt is of type repeat. Given
those two speculative inputs, we can almost copy the specification
above in order to get an implementation.
3. We add write accesses to JISR and repeat in order to detect any
misspeculation.
Figure 5.16 gives an overview of the DLX pipeline with precise inter-
rupts. We now describe the changes to the pipelined machine in detail.
Configuration Set We extend the configuration set of the pipelined ma-
chine without interrupts by the special purpose registers as given by the
specification of the DLX with interrupts. We furthermore add a set of im-
plementation registers that we will describe later on.
Initial Configuration As before, the initial values of GPR and DM are
arbitrary but fixed. The register DPC is initialized with SISR, the register
PC0 with SISR+ 4. This will cause the ISR to be executed. All special
purpose registers except for ECA are initialized with zero. The register
ECA is initialized with one in order to indicate the reset.
5.12.5 Transition Functions of Stage 0
In stage 0, we do the instruction fetch. This is done by a write access to
the IR register. The write access depends on DPC and on ip f . We follow
220
Section 5.12
PRECISE
INTERRUPTS
DMEM
IM
PC environment
ALU
shift4load
GPR SPR
ID
EX
M
WB
IF
A/PC0+4
Adata;Bdata
spec
DPC
MAR:3
MAR:4
I0:2C:2
I0:3C:3
I0:4C:4
IR:1
MDRw:3
MDRr:4
IR: j
P: j
I0:1
Cepc: j
oDPC: j
Ced pc: j
buffers:
CAx: j
PC0A;B
Figure 5.16 Overview of the DLX pipeline with precise interrupts. The registers
I0 are a shorthand for the speculative values JISR and repeat. The spec environ-
ment does the speculation.
221
Chapter 5
SPECULATIVE
EXECUTION
the definition of I(c) as given in the specification:
f0IR(DPC; ip f ) =

0 : imal(DPC)_ ip f
IM[DPC] : otherwise
The calculation of the instruction word is correct:Lemma 5.67 I
Ω0IR(ciS) = I(ciS)
If one expands the left hand side of the claim, one gets:PROOF
f0IR(ciS:DPC; ip f (ciS)) != I(ciS)
One easily asserts this claim by expanding f0IR on the left hand side and
I on the right hand side.QED
Exceptions We collect the interrupt cause bits CA in separate implemen-
tation registers. In the register CAimal, we store whether we have an in-
struction word misalignment. The same applies for CAip f and CAex. We
assume an external signal ex that is a bitvector. The bits of the bitvector
indicate the individual external interrupts. In contrast to [MP00], we detect
the external interrupts in stage 0.
f0CAimal(DPC) = imal(DPC) (5.18)
f0CAip f (ip f ) = ip f (5.19)
f0CAex(ex) = ex (5.20)
In addition to that, we speculate two values JISR and repeat. We spec-
ulate that we have an interrupt if we have an instruction memory page
fault, a trap instruction, a misaligned instruction word, or an external in-
terrupt. We detect external interrupts using a zero tester. Remember that
the function used for speculating R is called f0Rs. Thus, the function for
speculating JISR is:
f0JISRs(ip f ;DPC;ex) = ip f _ I trap( f0IR(DPC; ip f ))_
imal(DPC)_ zerotester(ex)
We speculate that we have an interrupt of type repeat if an instruction
memory page fault is signaled and if there is no instruction word misalign-
ment:
f0repeats(ip f ;DPC) = ip f _ imal(DPC)
222
Section 5.12
PRECISE
INTERRUPTS
This implementation differs from the implementation given in [MP00]:
in [MP00], the execution is started always assuming that no interrupt hap-
pens. This includes the interrupts that can be detected in early stages.
As an example, consider a trap instruction. The machine in [MP00] ex-
ecutes the instructions followed by the trap instruction as if no interrupt
happens. In stage 3, the misspeculation is detected and the instructions fol-
lowing trap are evicted from the pipeline. In contrast to that, the machine
presented here never misspeculates on trap instructions. Thus, no rollback
is necessary. Following the trap instruction, the instructions of the inter-
rupt service routine are executed. We therefore waste no cycles in case of
the interrupts given above.
Obviously, this speeds up execution. The price paid for this is extra
complexity. In particular, we have to forward the effect of interrupts. This
includes that interrupts modify all special purpose registers. In [MP00], the
authors remark that “forwarding the effect of this looks like a nightmare”.
We will later on describe the forwarding hardware we use for this.
5.12.6 Transition Functions of Stage 1
In stage 1, we do the operand fetching, the calculation of the new PC regis-
ters, and the calculation of the precomputed control signals. As in chapter
3, let us define the precomputed control signals in the stages that use them.
PC’ In order to calculate the new value of the PC0 register, we implement
the function next pc0 as given by the specification as follows:
 In case of an r f e instruction, we take the value of the EPC input.
 In case of any other instruction, we use the value provided by the old
next pc impl circuit, as defined in chapter 3.
Thus, next pc0 impl is:
next pc0 impl(IR;GPRa;oldPC;EPC) :=

EPC : I r f e(IR)
next pc impl(IR;GPRa;oldPC;EPC) : otherwise
223
Chapter 5
SPECULATIVE
EXECUTION
The following lemma asserts that the circuit next pc0 impl complies with
the specification next pc0 .
The calculation of the new PC is correct:Lemma 5.68 I
next pc0 impl = next pc0
If I r f e(I) holds, the claim obviously holds. If not so, one asserts thePROOF
claim using lemma 3.4.
In the specification, we pass op1(ciS) as parameter to next pc. In the im-Lemma 5.69 I
plementation, we pass GPRa as input. Given that this input is correct, the
next pc function returns the same value in both cases.
next pc(I(ciS);G1GPRa(ciS);ciS:PC0) = next pc(I(ciS);op1(ciS);ciS:PC0)
One asserts this claim in analogy to the proof of lemma 3.15 (correctnessPROOF
of the transition functions of the sequential DLX).
For next pc0 impl, we need the value of EPC in case of an r f e instruc-
tion. We realize this by a conditional read access to EPC. The read enable
function returns true iff we have an r f e instruction:
f1EPCre(IR) = I r f e(IR) (5.21)
This allows defining the register transition function for PC0 in analogy
to the specification:
f1PC0(IR;JISR;PC0;EPC;GPRa) =

SISR+4 : JISR
next pc0 impl(IR;GPRa;PC0;EPC) : otherwise
Assuming correct inputs, the calculation of the new value of PC0 is correct:Lemma 5.70 I
ci+1S :PC
0
= f Γ1PC0(ciS)
By expanding the definition of ci+1S on the left hand side, we get:PROOF
δ:PC0(ciS)
!
= f Γ1PC0(ciS)
224
Section 5.12
PRECISE
INTERRUPTS
The function f1PC0 uses JISR as input. The correct value of the JISR
input given configuration ciS is:
G1JISR(ciS) = JISR(ciS)
Let JISR(ciS) hold. In this case, both f1PC0 and δ:PC0 return SISR+ 4
and the claim holds.
Let JISR(ciS) not hold. In this case, we assert the correctness of the GPR
operand as in the proof of lemma 3.15, which is the corresponding lemma
for the machine without interrupts. We then apply lemma 5.68, which
concludes the claim. QED
DPC For defining the register transition function for DPC, we need the
register EDPC for r f e instructions. As above, we realize this by a condi-
tional read access to EDPC. The read enable function returns true iff we
have an r f e instruction:
f1EDPCre(IR) = I r f e(IR) (5.22)
This allows defining the register transition function for DPC in analogy
to the specification:
f1DPC(IR;JISR;PC0;EDPC) =
8
<
:
SISR : JISR
EDPC : I r f e(IR)
PC0 : otherwise
The following lemma asserts the correctness of this circuit.
Assuming correct inputs, the calculation of the new value of DPC is cor- J Lemma 5.71
rect:
ci+1S :DPC = f Γ1DPC(ciS)
By expanding the definition of ci+1S :DPC on the left hand side, we get: PROOF
δ:DPC(ciS)
!
= f Γ1DPC(ciS)
By expanding the definition of f Γ1DPC on the right hand side, we get:
δ:DPC(ciS)
!
= f1DPC(Ω0IR(ciS);JISR(ciS);ciS:PC0;ciS:EDPC)
225
Chapter 5
SPECULATIVE
EXECUTION
By applying lemma 5.67, we get:
δ:DPC(ciS)
!
= f1DPC(I(ciS);JISR(ciS);ciS:PC0;ciS:EDPC)
One easily asserts this by expanding the functions δ:DPC and f1DPC.QED
We precompute the values to be written into the special purpose regis-
ters EPC and EDPC. This saves hardware cost, since this computation
depends on many registers. Furthermore, it allows forwarding these regis-
ters. This includes the effect of interrupts. As the register C is responsible
for forwarding GPR registers, the registers Cepc and Ced pc are responsi-
ble for forwarding EPC and EDPC. The new values are already available
in the decode/issue stage. Thus, the write condition is always true.
The new value of EPC is precomputed as follows: In case of an interrupt
of type repeat, we write PC0. In case of any other interrupt, we write the
new value of PC0 without interrupt, which is given by next pc0impl. In case
there is no interrupt, we return GPRa in order to handle movi2s with EPC
as destination.
f1Cepc(IR;JISR;repeat;GPRa;PC0;EPC) =
8
<
:
PC0 : JISR^ repeat
next pc0impl(IR;GPRa;PC0;EPC) : JISR^ repeat
GPRa : otherwise
f1Cepcwe(IR;JISR) = 1
The write enable signal of EPC is precomputed as follows: we write to
EPC in case of a movi2s with appropriate destination and in case of an
interrupt.
f1 f4EPCwe(IR;JISR) = JISR_ (I movi2s(IR)^hI immediate(IR)[4 : 0]i= 3)
226
Section 5.12
PRECISE
INTERRUPTS
The new value of EDPC is precomputed as follows: in case of an inter-
rupt of type repeat, we write DPC. In case of any other interrupt and an r f e
instruction, we write EDPC. In case of any other interrupt and any other
instruction, we write PC0. In case there is no interrupt, we return GPRa in
order to handle movi2s with EDPC as destination.
f1Ced pc(IR;JISR;repeat;GPRa;DPC;EDPC;PC0) =
8
>
>
<
>
>
:
DPC : JISR^ repeat
EDPC : JISR^ repeat^ I r f e(IR)
PC0 : JISR^ repeat^ I r f e(IR)
GPRa : otherwise
The write enable signal of EDPC is precomputed as follows:
f1 f4EDPCwe(IR;JISR) = JISR_ (I movi2s(IR)^hI immediate(IR)[4 : 0]i= 4)
We will show the correctness of these values when we describe the tran-
sition functions of stage 4.
Forwarding Logic for EPC/EDPC Using these precomputed values,
we get the following forwarding hardware for reading EPC and EDPC
in stage k = 1: We show this exemplary for EPC. The circuits for EDPC
are identical. As before, we calculate hit signals Rkhit[ j]. Thus, the signals
are named EPC1hit[ j]. The hit signal is active iff the full bit of stage j and
the precomputed write enable signal of EPC in stage j are active:
EPC1hit[ j](cI) := f ull j(cI)^ f4EPCwe: j
Using the hit signals, we calculate the forwarded value. This is done us-
ing multiplexers, as illustrated in figure 5.17. The proof correctness of this
logic is similar to the proof of correctness for forwarding GPR registers.
However, we do not have to argue about an address.
Note that we need only very little effort in order to realize an instruction
fetch with interrupts. In particular, we only need a few arguments in or-
der to show correctness as we only instantiate the generic forwarding and
speculation mechanisms.
In stage 1, we fetch the operands. This is done exactly as in chapter 3
with the exception that we need the first GPR operand also in case of a
movi2s instruction.
227
Chapter 5
SPECULATIVE
EXECUTION
0 1
0 1
0 1
EPC:5 ω4EPC
ω3Cepc
ω2Cepc
EPC1hit[2]
EPC1hit[3]
EPC1hit[4]
Figure 5.17 Implementation of EPC forwarding
Note that we do not fetch the source operand of movs2i instructions in
stage 1 in contrast to the machine presented in [MP00]. We do so in order
to illustrate read accesses to registers other than in the decode stage. This
has both advantages and disadvantages: obviously, we save the forwarding
logic. The disadvantage is that an instruction that follows the movs2i and
uses the destination of the movs2i as source has to be stalled. However,
we do not see a severe performance impact of doing so. Furthermore,
if one desires forwarding, our generic forwarding approach will generate
appropriate forwarding hardware.
Exceptions In stage 1, we decode the instruction word and signal an il-
legal instruction word exception if necessary. Given an instruction word
IR, the function ill(IR) indicates that it is illegal. Let I be the set of in-
structions. The function ill is defined using the predicates I x as defined in
chapter 2:
ill(IR) :=
_
x2I
I x (5.23)
We store this bit in an implementation register CAill:
f1CAill(IR) = ill(IR) (5.24)
We do not do this in stage 0 because the calculation of ill(IR) might get
slow in case that there are many instructions. Furthermore, we consider
illegal instruction exceptions to be rare. Thus, the price for misspeculation
is not often paid.
228
Section 5.12
PRECISE
INTERRUPTS
5.12.7 Transition Functions of Stage 2
In this stage, we do the ALU calculation. The transition functions from
chapter 3 are taken without modification. We store a bit indicating an ALU
overflow in a register CAov f :
f2CAov f (IR;A;B) = ALU(A;aluop2(IR;B);alu f (IR)):ov f
The functions aluop2 and alu f are taken from chapter 3.
5.12.8 Transition Functions of Stage 3
In this stage, we do the data memory access. Most transition functions
from chapter 3 are taken without modification. We store a bit indicating a
data memory page fault in a register CAd p f :
f3CAd p f (IR;d p f ) = d p f ^ (I load(IR)_ I store(IR))
We store a bit indicating a data memory misalignment exception:
f3CAdmal(IR;MAR) = dmal(IR;MAR)
Furthermore, we do not enable the data memory write enable signal in
case we have one of these exceptions.
Cause Collection In stage 3, all exceptions are now known. This allows
us to calculate the MCA register: we do this by reading all CA registers and
calculating CA. As a shorthand, let CAargs denote the list of arguments
used in order to calculate CA (in the PVS tree, we always use the expanded
form). This is:
CAargs := (IR;CAill;CAimal;MAR;CAip f ;d p f ;CAtrap;CAov f ;CAex)
229
Chapter 5
SPECULATIVE
EXECUTION
The function CA impl takes these inputs and provides CA:
CA impl(CAargs)[i] :=
8
>
>
>
>
>
>
<
>
>
>
>
>
:
0 : i = 0
CAill : i = 1
CAimal_dmal(IR;MAR) : i = 2
CAip f : i = 3
(I load(IR)_ I store(IR))^d p f : i = 4
CAtrap : i = 5
CAov f : i = 6
CAex(i 7) : otherwise
Using the CA bits, we calculate MCA using the MCA impl circuit:
f3MCA(SR;CAargs) = MCA impl(CA impl(CAargs);SR)
Forwarding Logic for SR The register transition function f3MCA de-
pends on SR, i.e., we have a read access to SR:4 in stage 3. This requires
forwarding. The forwarding mechanism described in the previous chapter
(forwarding from the next stage, page 101) generates the following hard-
ware (the definition of the function ω4SR is expanded):
g3(cI) =
 f γ4SR(cI) : f ull4(cI)^ f γ4SRwe(cI)
cI :SR:4 : otherwise
Thus, in case stage 4 is full and the write enable signal of SR:4 is active,
we use the value written into SR:4. This holds in particular if there is an
instruction in stage 4 that causes an interrupt or is a movi2s instruction
writing SR.
In any other case, we use the value in the register SR:4. The proof that
this is the correct input is given in chapter 4 (lemma 4.7).
The following lemma asserts that the implementation register MCA con-
tains the correct value, as defined using the configuration of the specifica-
tion machine.
The calculation of the next value of MCA is correct:Lemma 5.72 I
Ω3MCA(ciS) = MCA(ciS)
230
Section 5.12
PRECISE
INTERRUPTS
PROOF By expanding the function Ω3MCA(ciS) on the left hand side, we
get (we omit the parameter list):
f3MCA(: : :) != MCA(ciS)
By definition of f3MCA, we get:
MCA impl(CA impl(G3(ciS;CAargs));ciS :SR)
!
= MCA(ciS)
By applying lemma 5.64, we get:
MCA impl(CA impl(: : :);ciS:SR)
!
= MCA impl(CA(ciS);ciS:SR)
Thus, the claim is shown if CA impl(: : :) is equal to CA(ciS). We show
this by a case split on the number of the exception, i.e., we show
CA impl(G3(ciS;CAargs))[i]
!
= CA(ciS)[i]
for all i 2 f0; : : : ;31g. We show the claim exemplary for the external inter-
rupts. The proofs for the other exceptions follow the same pattern.
For i 7 (external interrupts), we have the following claim:
Ω2CAex(ciS)[i]
!
= exS(c
i
S)
By expanding the functions on the left hand side, we get:
Ω1CAex(ciS)[i]
!
= exS(I(ciS))
Ω0CAex(ciS)[i]
!
= exS(I(ciS))
f0CAex(exS(ciS))[i] != exS(I(ciS))
This is concluded by expanding f0CAex. QED
Detecting Misspeculation Using MCA, we can also calculate the correct
value of JISR and repeat, thus, we can detect any misspeculation in stage
3. In case of JISR, we use the JISR impl circuit as defined above.
The new value of JISR is correct. J Lemma 5.73
f Γ3JISR(ciS) = JISR(ciS)
231
Chapter 5
SPECULATIVE
EXECUTION
One asserts this lemma using lemma 5.72 (correctness of MCA) and
lemma 5.65.
In case of the repeat register, we calculate the correct value using the
circuit repeat impl.
The new value of repeat is correct.Lemma 5.74 I
f Γ3repeat(ciS) = repeat(ciS)
We assert this lemma using lemma 5.72, which shows the correctness of
MCA, and lemma 5.66.
5.12.9 Transition Functions of Stage 4
In analogy to lemma 5.67, one easily shows that IR read in stage 4 is the
instruction word:
The calculation of the instruction word is correct:Lemma 5.75 I
Ω3IR(ciS) = I(ciS)
Write Access to GPR In this stage, the result of the instructions is writ-
ten into the destination register. In case of ALU instructions or load in-
structions, we do this as in chapter 3. However, we have to change the
transition function of GPR in order to realize movs2i.
As described above, we read the source operand in stage 4. We just pass
the values of the special purpose registers as parameters to the transition
function. After that, we use a decoder (definition 5.1) in order to generate
select signals for multiplexers. Let decoder impl be an implementation of
a decoder according to the definition.
SAdec(IR) = decoder impl(I immediate(IR)[4 : 0])
232
Section 5.12
PRECISE
INTERRUPTS
We define a shorthand SPRsrc, which denotes the value of the SPR
source operand:
SPRsrc(IR;SR; : : : ;EDATA) =
8
>
>
>
>
>
<
>
>
>
>
>
:
SR : SAdec(IR)[0]
ESR : SAdec(IR)[1]
ECA : SAdec(IR)[2]
EPC : SAdec(IR)[3]
EDPC : SAdec(IR)[4]
EDATA : SAdec(IR)[5]
0 : otherwise
This allows defining the register transition function for GPR:
f4GPR(C; IR;MAR;MDRr;SR; : : : ;EDATA) =
8
<
:
shi f t4load(MAR;MDRr; IR) : I load(IR)
SPRsrc(IR;SR; : : : ;EDATA) : I movs2i(IR)
C : otherwise
In addition to that, we modify the precomputed version of the write en-
able signal f4GPRwe such that it is active in case of a movs2i instruction.
Furthermore, we disable it in case of an interrupt of type repeat, as indi-
cated by repeat:
f1 f4GPRwe(IR;repeat) = (I ALU(IR)_ I ALUi(IR)_ I load(IR)_
I shi f ti(IR)_ I shi f t(IR)_ I movs2i(IR)
_((I j(IR)_ I jr(IR))^ I link(IR)))
^repeat
One easily asserts the correctness of the f4GPR function in analogy to
the proof of lemma 3.15.
In addition to the write access to GPR, we also have the write accesses
to the special purpose registers in stage 4.
Write Access to SR We perform a conditional write access to SR: in
case of an interrupt, as indicated by JISR, we write zero. In case of an r f e
233
Chapter 5
SPECULATIVE
EXECUTION
instruction, we write ESR. Otherwise, we have a movi2s instruction and
write the value in the C register, which is the GPR operand:
f4SR(C; IR;JISR;ESR) =
8
<
:
0 : JISR
ESR : I r f e(IR)
C : otherwise
f4SRwe(IR;JISR) = JISR_ I r f e(IR)_
(I movi2s(IR)^SAdec(IR)[0])
The correct value of C matches the GPR operand in case of a movi2sLemma 5.76 I
instruction.
I movi2s(I(ciS)) =) Ω3C(ciS) = op1(ciS)
Because we have a movi2s instruction, we have Ω3C(ciS) = G1GPRa(ciS).PROOF
This transforms the claim into:
I movi2s(I(ciS)) =) G1GPRa(ciS) = op1(ciS)
One concludes this claim by expanding the definition of G1GPRa on the
right hand side.QED
The value written by f4SR is correct.Lemma 5.77 I
ci+1S :SR =
 f Γ4SR(ciS) : f Γ4SRwe(ciS)
ciS:SR : otherwise
Let us expand the definition of the write enable signal f Γ4SRwe:PROOF
f Γ4SRwe(ciS) = ciS:JISR_ I r f e(Ω3IR(ciS))_
(I movi2s(Ω3IR(ciS))^SAdec(Ω3IR(ciS))[0])
By applying lemma 5.75, this is transformed into:
f Γ4SRwe(ciS) = ciS:JISR_ I r f e(I(ciS))_
(I movi2s(I(ciS))^SAdec(I(ciS))[0])
234
Section 5.12
PRECISE
INTERRUPTS
Using the correctness of the decoder circuit, this is transformed into:
f Γ4SRwe(ciS) = ciS:JISR_ I r f e(I(ciS))_
(I movi2s(I(ciS))^hI immediate(I(ciS))[4 : 0]i= 0
By expanding the definition of ci+1S on the left hand side of the claim (as
given in lemma 5.77), we get:
δ:SR(ciS)
!
=
 f Γ4SR(ciS) : f Γ4SRwe(ciS)
ciS:SR : otherwise
Let the write enable signal f Γ4SRwe(ciS) be not active. In this case, one
easily asserts the claim by expanding the definition of δ:SR.
Let the write enable signal f Γ4SRwe(ciS) be active. By expanding the
definition of f Γ4SR, we get:
δ:SR(ciS)
!
= f4SR(Ω3C(ciS);Ω3IR(ciS);JISR(ciS);ciS:ESR)
Using lemma 5.75, we get:
δ:SR(ciS)
!
= f4SR(Ω3C(ciS); I(ciS);JISR(ciS);ciS:ESR)
By expanding the definition of f4SR, we get:
δ:SR(ciS)
!
=
8
<
:
0 : JISR(ciS)
ciS:ESR : I r f e(I(ciS))
Ω3C(ciS) : otherwise
In case of JISR(ciS) or I r f e(I(ciS)) the claim holds by definition of δ:SR.
In any other case, we can conclude that we have a movi2s instruction be-
cause the write enable signal is active. This allows applying lemma 5.76
and we get:
δ:SR(ciS)
!
= op1(ciS)
This is concluded by expanding the definition of δ:SR. QED
235
Chapter 5
SPECULATIVE
EXECUTION
Write Access to ESR We perform a conditional write access to ESR: in
case of an interrupt of type repeat, as indicated by JISR and repeat, we
write SR. In case of any other interrupt, we write C if we have a movi2s
instruction that uses SR as destination, and SR otherwise. In case there is
no interrupt, we return C in order to handle movi2s with ESR as destination.
f4ESR(C; IR;JISR;SR;repeat) =

C : sel
SR : otherwise
with a signal sel in analogy to [MP00]:
sel = JISR_ (repeat ^ I movi2s(IR)^SAdec(IR)[0])
The write enable signal is active in case of an interrupt or a movi2s in-
struction with destination ESR.
f4ESRwe(IR;JISR) = JISR_
(I movi2s(IR)^SAdec(IR)[1])
The value written by f4ESR is correct.Lemma 5.78 I
ci+1S :ESR =
 f Γ4ESR(ciS) : f Γ4ESRwe(ciS)
ciS:ESR : otherwise
The proof proceeds in analogy to the proof of lemma 5.77.PROOF
Write Access to ECA We perform a conditional write access to ECA: in
case of an interrupt, we write MCA. In case there is no interrupt, we return
C in order to handle movi2s with ECA as destination.
f4ESR(C; IR;JISR;MCA) =

MCA : JISR
C : otherwise
f4ECAwe(IR;JISR) = JISR_
(I movi2s(IR)^SAdec(IR)[2])
The value written by f4ECA is correct.Lemma 5.79 I
ci+1S :ECA =
 f Γ4ECA(ciS) : f Γ4ECAwe(ciS)
ciS:ECA : otherwise
The proof proceeds in analogy to the proof of lemma 5.77. However, wePROOF
need the correctness of the MCA input, which we assert using lemma 5.72.
236
Section 5.12
PRECISE
INTERRUPTS
Write Access to EPC We perform a conditional write access to EPC.
We already precomputed the value to be written and the write enable signal
in stage 1.
The value written by f4EPC is correct. J Lemma 5.80
ci+1S :EPC =
 f Γ4EPC(ciS) : f Γ4EPCwe(ciS)
ciS:EPC : otherwise
As in the proof of lemma 5.77, let us expand the definition of the write PROOF
enable signal f Γ4EPCwe (including the functions used to pass the precom-
puted signals):
f Γ4EPCwe(ciS) = ciS:JISR_
(I movi2s(Ω1IR(ciS))^
hI immediate(Ω1IR(ciS))[4 : 0]i= 3)
One easily asserts Ω1IR(ciS) = I(ciS), which transforms the last equation
into:
f Γ4EPCwe(ciS) = ciS:JISR_
(I movi2s(I(ciS))^hI immediate(I(ciS))[4 : 0]i= 3)
Using the correctness of the decoder circuit, this is transformed into:
f Γ4EPCwe(ciS) = ciS:JISR_
(I movi2s(I(ciS))^hI immediate(I(ciS))[4 : 0]i= 3)
By expanding the definition of ci+1S on the left hand side of the claim (as
given in lemma 5.80), we get:
δ:EPC(ciS)
!
=
 f Γ4EPC(ciS) : f Γ4EPCwe(ciS)
ciS:EPC : otherwise
Let the write enable signal f Γ4EPCwe(ciS) be not active. In this case,
one easily asserts the claim by expanding the definition of δ:EPC.
Let the write enable signal f Γ4EPCwe(ciS) be active. By expanding the
definition of f Γ4EPC (including the functions that pass the precomputed
value), we get:
δ:EPC(ciS)
!
= f1Cepc(Ω1IR(ciS);JISR(ciS);repeat(ciS);
G1GPRa(ciS);ciS:PC0;G1EPC(ciS))
237
Chapter 5
SPECULATIVE
EXECUTION
By expanding the definition of f1Cepc, we get:
δ:EPC(ciS)
!
=
8
<
:
ciS:PC0 : JISR(ciS)^ repeat(ciS)
newpc0 impl(: : :) : JISR(ciS)^ repeat(ciS)
G1GPRa(ciS) : otherwise
We handle the three cases above separately:
1. In case of an interrupt of type repeat, one concludes the claim by
expanding the definition of δ:SR.
2. In case of any other interrupt, the claim is transformed into (we omit
the parameter list):
δ:EPC(ciS)
!
= newpc0 impl(: : :)
By expanding the definition of δ:EPC(ciS) on the left hand side and
by applying lemma 5.68, one gets:
next pc0(I(ciS);op1(ciS);ciS:PC0;ciS:EPC)
!
= next pc0(I(ciS);G1GPRa(ciS);ciS:PC0;G1EPC(ciS))
In case we have an r f e instruction, one asserts that G1EPC(ciS) (cor-
rect value if reading EPC) is equal to ciS:EPC because the read en-
able function holds. The claim is then easily concluded by expand-
ing the definition of next pc0 .
In case we do not have an r f e instruction, we conclude the claim by
expanding the definition of next pc0 and by applying lemma 5.69.
3. In case we do not have an interrupt, we can conclude that we have a
movi2s instruction because the write enable signal is active. In this
case, the claim is easily concluded by expanding G1GPRa.QED
Write Access to EDPC We perform a conditional write access to EDPC:
We already precomputed the value to be written and the write enable signal
in stage 1.
The value written by f4EDPC is correct.Lemma 5.81 I
ci+1S :EDPC =
 f Γ4EDPC(ciS) : f Γ4EDPCwe(ciS)
ciS:EDPC : otherwise
The proof proceeds as the proof of lemma 5.80.
238
Section 5.13
PRECISE
INTERRUPTS
Write Access to EDATA In stage 4, we perform a conditional write ac-
cess to EDATA: in case of a data memory page fault interrupt, we write
MAR. In case of a trap instruction, we write the immediate constant. In
case of any other interrupt, we write zero. In case there is no interrupt, we
return C in order to handle movi2s with EDATA as destination.
f4EDATA(C; IR;JISR;CAd p f ;CAtrap;MAR) =
8
>
<
>
:
MAR : JISR^CAd p f
I immediate(IR) : JISR^CAd p f ^CAtrap
0 : JISR^CAd p f ^CAtrap
C : otherwise
f4EDATAwe(IR;JISR) = JISR_ (I movi2s(IR)^SAdec(IR)[5])
The following lemma asserts the correctness of the transition function
for EDATA.
The value written by f4EDATA is correct. J Lemma 5.82
ci+1S :EDATA =
 f Γ4EDPC(ciS) : f Γ4EDATAwe(ciS)
ciS:EDATA : otherwise
The proof proceeds as the proof of lemma 5.77. However, we use MAR
as input in case of a data memory page fault.
5.12.10 Data Consistency and Liveness
One concludes the data consistency and liveness of the pipelined machine
with interrupts just as we concluded the data consistency of the pipelined
machine with branch prediction.
Note that in particular PVS almost fully automates the proofs for the
lemmas given above in order to show the pipelined machine with specula-
tion.
239
Chapter 5
SPECULATIVE
EXECUTION
5.13 Literature
In the open literature, speculation is a common approach for implementing
processors without delay slot: Levitt et.al. use a predict-not-taken scheme
[LO96] in a DLX implementation. Boerger and Mazzanti provide two
DLX implementations [BM96]: the first assumes an empty instruction af-
ter jumps/branches. The second implementation stalls the instruction fetch
for one cycle. Saxe et.al. [SGGH94] also use speculation.
In [VB00], Velev and Bryant extend Burch and Dill’s pipeline flushing
technique in order to automatically verify a dual-instruction issue, in-order
DLX with five stages and branch prediction. Misspedicted branches are
detected late, A generic speculation approach or a stall engine is not used.
240
Chapter
6
Out-of-Order Execution
6.1 Introduction
I
N THE PREVIOUS SECTIONS, we presented various implementations of
pipelined RISC processors. These implementations strictly processed
the instructions in program order. However, the performance of these de-
signs drops as soon as long latency instructions such as memory accesses
are involved. For example, consider a load instruction with cache miss in
the memory stage. Thus, the stall signal of the stage is activated and the
instructions above the memory stage are stalled.
Furthermore, consider an ALU instruction that follows the load in the
execute stage:
EX: R3:=R1+R2
M: R4:=Mem[R5]
If there is no data dependency, the result of the ALU instruction is al-
ready known in the execute stage and could be written into the register file.
However, the in-order execution rule prohibits this and the ALU instruction
has to wait for the load.
Thus, dropping this rule can result in better performance. This technique
is called out-of-order execution. The most popular out-of-order execution
Chapter 6
OUT-OF-ORDER
EXECUTION
CDB
FU 1
FU 2
FU 3
Decode
IssueIF
|{z}| {z } |{z}
ROBProducersReservation Register-
Stations File
| {z }
Figure 6.1 Basic structure of a microprocessor with Tomasulo Scheduler and
reorder buffer
algorithms is the Tomasulo scheduling algorithm [Tom67]. It is one of the
most competitive scheduling algorithms and provides CPI rates down to
1.1 on a single-instruction issue machine [Ger98, Del98, MLD+99]. The
algorithm is widely used, e.g., by IBM PowerPC, Intel Pentium-Pro or
AMD K5 [Mot97, CS95]. The original Tomasulo scheduler uses out-of-
order termination and therefore does not support precise interrupts with-
out extra hardware. We support precise interrupts by adding a reorder
buffer[SP88]. The reorder buffer sorts the instructions in program order
before termination.
In this chapter, we describe the results of implementing and verifying a
DLX with Tomasulo scheduler, precise interrupts and floating point unit
using PVS. The designs, the scheduling protocols, and most proofs are
taken from [KMP99, Kro¨99].
6.2 The Tomasulo Algorithm with Reorder Buffer
Figure 6.1 depicts the basic structure of a microprocessor with Tomasulo
scheduler and reorder buffer. The execution begins with the instruction
fetch, as in the in-order machine. The Tomasulo scheduling algorithm
does not cover this phase; it is assumed that the instruction fetch is done in
program order. We will use the very same instruction fetch mechanism as
in the pipelined in-order machines described in the previous chapters.
242
Section 6.3
TOMASULO DATA
STRUCTURES
In the next stage, the instruction is decoded. This includes fetching the
operands if available. The instruction and the operands are then passed to
a reservation station (RS). This is called issue. The reservation stations
are the central data structure of the Tomasulo scheduling algorithm. The
reservation stations act as queue for the instructions and are between the
decode/issue stage and the functional units. Note that the instruction is
passed to the reservation station even if forwarding fails. This is in contrast
to the in-order machine, which stalls in this case.
As soon as all operands are available, the instruction is passed from the
reservation station to the functional unit. This is called dispatch. This
is done without obeying the program order of the instructions, i.e., the
instructions can overtake each other at this point. After the function unit
has finished the execution, the result of the instruction is passed to a special
register, called producer.
In case the producer holds an instruction, it requests a result bus, called
common data bus (CDB). As soon as the request is acknowledged, the re-
sult is put on this bus. This is called completion. In contrast to commerical
designs such as the IBM’s PowerPC, we support only one CDB. The bus
is used for two purposes: 1) The instruction is passed to the reservation
stations that wait for the result because of a data dependency, and 2) the
result is passed to the reorder buffer.
The reorder buffer re-sorts the instructions back in program order. The
benefit of this is that we can write the results into the register file in pro-
gram order (in-order termination). This allows precise interruptions of the
instruction stream.
In the following sections, we will describe the data structures and proto-
cols used to realize this in detail.
6.3 Tomasulo Data Structures
6.3.1 Reorder Buffer
The reorder buffer [SP88] is a ring-buffer that serves two purposes in a
machine with Tomasulo scheduler. The main purpose is to re-sort the in-
structions such that the instructions terminate in program order. For that
purpose, each reorder buffer entry provides space to store the result of an
243
Chapter 6
OUT-OF-ORDER
EXECUTION
   
   
   
   
   
   
   
   
   









0
1
2
3
4
6
7
5
I2
I3
I4
I5
ROBhead
ROBtail
Figure 6.2 Illustration of the reorder buffer pointers
instruction. We support instructions that write multiple registers. This is
useful for supporting double precision floating point instructions.
Furthermore, each reorder buffer entry has a valid bit. The bit indicates
that the result of the instruction is in the reorder buffer entry. A reorder
buffer entry with active valid bit is called valid reorder buffer entry.
The second purpose of the reorder buffer is to provide means to assign
a tag to each instruction. The tag is assigned during instruction issue and
stays unique until the instruction terminates. The tag is the address of the
reorder buffer entry of the instruction. Let ϑ denote the number of tag (i.e.,
ROB address) bits. Thus, the reorder buffer has
Θ := 2ϑ
entries. We denote the value of the ROB entry with address tag during
cycle T with ROB[tag]T .
The reorder buffer is accessed using to pointers, the head and tail point-
ers. These pointers are stored in ϑ-bit registers. We denote the value of
the head pointer during cycle T by ROBheadT , and the value of the tail
pointer by ROBtailT . Instructions are put in the ROB entry ROBtail points
to, and removed from the entry ROBhead points to. After an instruction is
put in the ROB, the ROBtail pointer is increased. After an instruction is
removed from the ROB, the ROBhead pointer is increased. The pointers
wrap-around if they reach the end of the ROB. This is illustrated in figure
6.2.
Let issue(T ) denote that we issue an instruction during cycle T . This
allows defining the values of ROBtail recursively. We initialize the ROB
244
Section 6.3
TOMASULO DATA
STRUCTURES
pointers with zero. The ROBtail pointer is increased iff we issue an in-
struction.
ROBtailT :=
8
<
:
0 : T = 0
ROBtailT 1 +1 : issue(T  1)
ROBtailT 1 : otherwise
Note that the incrementation for the case issue(T  1) holds is a bitvec-
tor operation as described in chapter 2. Thus, the ROBtail pointer wraps
around.
In analogy to that, let writeback(T ) denote that we terminate an instruc-
tion during cycle T . This allows defining the values of ROBhead recur-
sively.
ROBheadT :=
8
<
:
0 : T = 0
ROBheadT 1 +1 : writeback(T  1)
ROBheadT 1 : otherwise
As above, the incrementation for the case writeback(T   1) holds is a
bitvector operation as described in chapter 2. Thus, the ROBhead pointer
wraps around.
6.3.2 Register File Extentions
As before, the register file holds the values of the specification registers of
the machine. We still denote the set of registers by R (in PVS, we just
number the registers). We denote the value of the register r 2 R during
cycle T by R[r]T :data. We assume that all registers have a common width.
We denote the set of possible values of a register by W (R).
The register file is extended with a producer table. The producer table
records which instruction in the machine writes a given register. For that
purpose, the producer table contains two data items for each register.
The first is a valid bit. We denote the value of the valid bit of register r
during cycle T with R[r]T :valid. If it is set, there is no instruction currently
executing with the register as destination. If it is not set, there is such an
instruction. In this case, the second item, a reorder buffer tag, points to the
last instruction with the register as destination. We denote the value of this
tag by R[r]T :tag.
245
Chapter 6
OUT-OF-ORDER
EXECUTION
6.3.3 Reservation Stations
The reservation stations act as queue for the instructions and their source
operands. We give each reservation station a number. We denote the values
in reservation station number rs during cycle T by RS[rs]T . Each reserva-
tion has a full bit RS[rs]: f ull. It indicates that the reservation station is in
use. In addition to that, we store the tag of the instruction in the reservation
station in RS[rs]:tag.
We support instructions with an arbitrary number of source operands.
Let x denote the number of a source operand. For each source operand,
we store a valid bit RS[rs]:op[x]:valid. If the bit is set, the value of the
operand is stored in RS[rs]:op[x]:data. If it is not set, we store the tag of
the instruction producing the value in RS[rs]:op[x]:tag.
6.3.4 Producers
The producers buffer the results from the function units until the CDB
is available. We have a separate producer for each function unit. Each
producer consists of a full bit, a tag, and the result. We denote these items
of producer f u by P[ f u]: f ull, P[ f u]:tag, and P[ f u]:result.
6.3.5 Initial Configuration
We make the following assumptions about the initial values of those regis-
ters.
 The valid bits of the registers must be set in the initial configuration.
We do not make an assumption on the values of the registers or the
tags.
 The full bits of the reservation stations must not be set. We do not
make any assumptions about the other values in the reservation sta-
tions.
 The full bits of the producers must not be set. We do not make any
assumptions about the other values in the producers.
246
Section 6.4
TOMASULO
PROTOCOLS
It is important that we do not make to many assumptions on initial val-
ues, since realizing fixed initial values in hardware is expensive regarding
hardware cost. In particular, assuming initial values of a register usually
prohibits implementing the register as RAM. In particular, note that we do
not make any assumption about the initial values of the ROB entries.
6.4 Tomasulo Protocols
6.4.1 Formalization
In this section, we describe the protocols of the Tomasulo Scheduling al-
gorithm. These protocols form the transition function of a generic and
abstract microprocessor with Tomasulo scheduler. The configuration set
of this machine comprises of the reservation stations, the reorder buffer
including the pointers, the register files, the producers, and the producer
tables.
We denote the configuration of this machine during cycle T by cTaI (ab-
stract implementation).
The transition function of the machine is denoted by δaI . It maps the
configuration of the machine during cycle T to the next configuration of
the machine during cycle T + 1. We will compose this function using
functional specifications of the Tomasulo protocols, which are issue, CDB
snooping, dispatch, completion, and writeback. We name the functions for
these protocols issue, snoop, dispatch, completion, and writeback. These
functions are called protocol functions.
δaI := issueÆ snoopÆdispatchÆ completion Æwriteback
Thus, the issue protocol has priority over CDB snooping and so on. This
is important if two protocols change the same register value in the same
cycle. The final value in the register is the value provided by the proto-
col with the higher priority. We omit the transition function for the ROB
pointers, since we already specified the values of those pointers above.
Notation We specify the protocols using a notation similar to the nota-
tion used in [KMP99]. The notation is also very similar to the notation
247
Chapter 6
OUT-OF-ORDER
EXECUTION
used in PVS. Consider the following example:
R[4]:data := R[3]:data
This is a shorthand for R[4]T+1:data = R[3]T :data.
As before, we consider a stream of instructions I0, I1, : : :. Each instruc-
tion has source and destination registers. By S(i;x), we denote the number
of register that is the source operand x. By D(i;x), we denote the number
of register that is the destination operand x.
By dest(i;r), we denote the fact that instruction Ii has r as destination
register, i.e., that there is a x with D(i;x) = r.
Embedding Convention In a machine with Tomasulo Scheduler and re-
order buffer, there are different places where results are stored or propa-
gated before writing the results into the register file. These are the pro-
ducers, the CDB, and the ROB. We support multiple destination registers
for a single instruction. By convention, each destination register is on a
well-defined part of the result bus or registers. For example, consider the
DLX with floating point instructions. That machine has a maximum of
three results for each instruction. Thus, the result busses and registers have
space for three 32-bit registers, result[0], result[1], and result[2].
In case of the DLX, we embed the results as follows: By convention, all
floating point registers with odd numbers are on result[1], all other “nor-
mal” registers are on result[0]. In order to handle exceptions, we define a
dummy register CA, which is on result[2]. This allows handling the IEEE
flags register and exceptions.
For example, the result of a double precision floating point instruction
with destination register FPR0 is embedded as follows: The lower part
of the result, i.e., the part that is written into FGR0, is on result[0]. The
higher part, i.e., the part that is written into FGR1, is on result[1]. The
exceptions/IEEE flags are on result[2].
Formally, we define an embedding function. Let d denote the maximum
number of destination operands. The embedding function e maps a register
to a number in f0; : : : ;d 1g. Thus, destination register r is on result[e(r)].
248
Section 6.4
TOMASULO
PROTOCOLS
6.4.2 Issue
Let Ii be the instruction to be issued during cycle T (figure 6.3). The first
step is to invalidate the destination registers of instruction Ii. Thus, we clear
the valid bit of all registers R[r] with dest(i;r) and set the tag of register r
to ROBtailT .
In contrast to the issue protocols given in [MPK00], we cover two differ-
ent ways to issue an instruction: the first way is as described in [MPK00]
and as done by the original Tomasulo scheduling algorithm. During is-
sue, the instruction is stored in a reservation station along with the source
operands that are available.
The second way is to skip the reservation stations and to store the result
of the instruction in the reorder buffer directly. This speeds up the execu-
tion of simple instructions. Examples for this are branches, jumps, and the
trap instruction.
The result of these instructions is already known in the issue stage. We
indicate these instructions by the predicate issue with result(i). In case of
such an instruction, the reservation stations are not modified by the issue
protocol. However, we set the valid bit of the ROB entry ROBtail points
to and store the result in the result data item. We denote this result by
issue result(i). For example, this could be the PC address in case of a
jump-and-link instruction.
Machines that support instructions that are directly issued into the ROB
are usually not covered in the open literature. The Tomasulo implementa-
tion in [Kro¨99] uses this feature. However, the proof does not cover it.
In case issue with result(T ) does not hold, we clear the valid bit of the
ROB entry ROBtailT . Let issue rs(T;rs) hold iff reservation station rs
is used for issue during cycle T . We initialize this reservation station as
follows: we set the full bit of the reservation station and store the ROBtail
pointer in the tag data item. Besides the full bit and tag, the reservation
station holds the source operands.
The Tomasulo scheduling algorithm with reorder buffer supports differ-
ent places to forward the source operands from. For each operand of the
instruction three sources have to be checked:
1. The operand might be in the register file. In this case, the valid bit
of the register is set. If it is not in the register file, the producer table
249
Chapter 6
OUT-OF-ORDER
EXECUTION
provides the tag of the last instruction writing it.
2. The operand might be on the CDB. In order to determine which
instruction is on the CDB, the result on the CDB comes with a valid
bit and a tag. If the valid bit is set, the tag indicates the instruction
on the CDB. Thus, we check the valid bit and compare the tag on the
CDB with the tag from the producer table. If they match, we take
the result on the CDB as source operand according to the embedding
convention.
3. The operand might be in the reorder buffer. This is indicated by the
valid bit of the reorder buffer entry that the tag in the producer table
points to. If the bit is set, we take the result from the ROB according
to the embedding convention.
If none of the three cases above applies, the source register is the desti-
nation of a preceding, incomplete instruction. The tag of this instruction is
in the producer table, and instead of the operand, the tag of this instruction
is stored in the reservation station.
6.4.3 CDB Snooping
During issue, the operands in the reservation station that are not available
are marked as not valid. On completion, the result of an operation is put
on the CDB. Instructions in the reservation stations, which depend on this
result, read the operand data from the CDB (figure 6.4). The reservation
stations identify the results by comparing the tag on the CDB with the tag
in the reservation station.
6.4.4 Dispatch
During instruction dispatch (figure 6.5), an instruction moves from a reser-
vation station entry into the actual function unit. We denote this fact by the
predicate dispatch(T;rs). If the predicate holds, the instruction in reserva-
tion station rs is dispatched during cycle T .
The reservation stations that are dispatched are determined by the hard-
ware using a fair arbiter, which selects only full reservations with valid
250
Section 6.4
TOMASULO
PROTOCOLSif issue(T ) then
f
RS[rs]: f ull := 1;
RS[rs]:tag := ROBtail;
For all source operands x of Ii, let r be S(i;x):
if R[r]:valid then
RS[rs]:op[x] := R[r];
elsif CDB:tag = R[r]:tag^CDB:valid then
RS:op[x]:valid := 1;
RS:op[x]:data :=CDB:result[e(r)];
elsif ROB[R[r]:tag]:valid then
RS:op[x]:valid := 1;
RS:op[x]:data := ROB[R[r]:tag]:result[e(r)];
else
RS:op[x]:valid := 0;
RS:op[x]:tag := R[r]:tag;
endif
For all registers r with dest(i;r):
R[r]:tag := ROBtail;
R[r]:valid := 0;
g
Figure 6.3 Issue protocol for issuing instruction Ii during cycle T .
8 operands x of instruction Ii
if RS[rs]: f ull ^=RS[rs]:op[x]:valid^
(RS[rs]:op[x]:tag =CDB:tag)
f
RS[rs]:op[x]:valid := 1;
RS[rs]:op[x]:data :=CDB:result[e(S(i;x))];
g
Figure 6.4 CDB snooping protocol for instruction Ii in reservation station rs
251
Chapter 6
OUT-OF-ORDER
EXECUTION
if dispatch(T;rs) then
f
Pass instruction, operands,
and tag to FU
RS: f ull := 0;
g
Figure 6.5 Dispatch protocol
operands. Thus, we can assume that the reservation stations rs that are
dispatched are full and have valid operands:
dispatch(T;rs) =) RS[rs]T : f ull ^
8x : RS[rs]T :op[x]:valid
In addition to passing the instruction to the function unit, the reservation
station is freed during dispatch. Note that clearing the full bit may conflict
with setting the full bit as done by the issue protocol. Since the issue
protocol has priority, the full bit is set in this case.
6.4.5 Completion
During completion (figure 6.6), the result and the ROB tag in a producer
P[ f u] are put on the CDB. Let the predicate completion(T ) hold iff the
machine completes an instruction. Let f u = compl p(T ) denote the num-
ber of the producer that holds that instruction. That number is determined
by the hardware among the full producers using a fair arbiter. Thus, we
can assume that the producer is full:
completion(T ) =) P[compl p(T )]T : f ull
During completion, the according reorder buffer entry is filled with the
result and the valid bit is set. Let FU [ f u]T :valid denote that the func-
tion unit provides a result. Let FU [ f u]T :result denote that result. Let
FU [ f u]T :tag denote the tag that accompanies the result.
If the function unit provides a new result, this result is stored in the
producer. If not so, the full bit of the producer is cleared.
252
Section 6.4
TOMASULO
PROTOCOLS
if completion(T) then
f
CDBT :valid = 1;
CDBT :result = P[compl p(T )]:result;
CDBT :tag = P[compl p(T )]:tag;
ROB[CDBT :tag]:valid := 1;
ROB[CDBT :tag]:result :=CDBT :result;
g
8 function units f u:
if FU [ f u]T :valid then
f
P[ f u]: f ull := 1;
P[ f u]:result := FU [ f u]T :result;
P[ f u]:tag := FU [ f u]T :tag;
g
elsif completion(T)^ compl p(T ) = f u then
P[ f u]: f ull := 0;
endif
Figure 6.6 Completion protocol
253
Chapter 6
OUT-OF-ORDER
EXECUTION
if writeback(T )
for all registers r with dest(i;r):
f
R[r]:data := ROB[ROBhead]:result[e(r)];
if ROBhead = R[r]:tag then
R[r]:valid := 1;
g
Figure 6.7 Retirement / writeback protocol for instruction Ii.
6.4.6 Writeback
During writeback (figure 6.7), a result of the instruction in the ROB en-
try that ROBhead points to is written into the register file. As introduced
above, we denote this fact by the predicate writeback(T ). We assume that
writeback is done iff the ROB entry is valid and the ROB is not empty. Let
ROBempty(T ) denote that the ROB is empty during cycle T . We will later
on define it.
writeback(T ) () ROBempty(T )^ROB[ROBhead(T )]T :valid
During writeback, we store the result in the ROB in the registers. Fur-
thermore, we set the valid bit of the register if the tag of the instruction
matches the tag in the producer table.
Note that setting the valid bit may conflict with clearing the valid bit
during issue. As described above, the issue protocol has priority over the
writeback protocol, i.e., the setting of the valid bit is suppressed.
6.5 Data Consistency
6.5.1 Scheduling Functions
We need a formal way to state that “instruction Ii is being issued during
cycle T ” or “instruction Ii is being dispatched during cycle T ”. We do this
in analogy to the previous chapters using a scheduling function. While this
concept was introduced for in-order machines by [MP00], we extend it to
out-of-order machines in the obvious way.
254
Section 6.5
DATA
CONSISTENCY
Issue We recursively define a function sIissue that maps a cycle T to
the number of the instruction that is in the issue stage. Since we issue in
program order, that number increases by one in case that issue(T ) holds
and stays unmodified otherwise. We start with instruction I0.
sIissue(T ) :=
8
<
:
0 : T = 0
sIissue(T  1)+1 : issue(T  1)
sIissue(T  1) : otherwise
Reservation Stations We also desire a way to define the instruction in a
given reservation station rs during a given cycle T . We do this by defining
a schedule function sIRS(rs;T ) for reservation stations. Instructions are
put in a reservation station during issue. In case an instruction is issued
into reservation station rs, we take the value of sIissue(T  1). Otherwise,
the value of sIRS(rs;T ) remains unchanged.
sIRS(rs;T ) :=
8
<
:
0 : T = 0
sIissue(T  1) : issue(T  1)
sIRS(rs;T  1) : otherwise
Note that the only point we put an instruction into a reservation station
is during issue. This is in contrast to the implementation given [Kro¨99],
which moves the instructions from one reservation station into the next.
Reorder Buffer In analogy to the schedule of the reservation stations,
we can provide a schedule for the ROB. The function sIROB(tag;T ) de-
notes the instruction that is in the ROB entry with tag tag during cycle T .
We start with  1, which denotes that no instruction is in the ROB entry.
We need this special value because the ROB entries have no such thing like
a full bit.
sIROB(tag;T ) :=
8
>
<
>
:
 1 : T = 0
sIissue(T  1) : issue(T  1)^
tag = ROBtailT 1
sIROB(tag;T  1) : otherwise
Function Units Let dispatch f u(T; f u) denote the number of the reser-
vation station that is used for dispatching an instruction to function unit
f u during cycle T . In hardware, this number is represented unary using
dispatch(T;rs).
255
Chapter 6
OUT-OF-ORDER
EXECUTION
Let sIdispatch( f u;T ) denote the number of the instruction passed to
function unit f u during cycle T . This is defined using the schedule of the
reservation station.
sIdispatch( f u;T ) := sIRS(dispatch f u(T; f u);T )
We also define schedules for the functional units. Let sI f u( f u;T ) denote
the number of the instruction that leaves function unit f u during cycle
T . The most simple functional unit is a combinatorial functional unit that
calculates its result within the same cycle the arguments are passed. The
32-bit ALU presented in chapter 2 is an example. For such a function unit,
sI f u( f u;T ) just is:
sI f u( f u;T ) := sIdispatch( f u;T )
In case of more complex function units such as floating point dividers,
one has to construct a scheduling function. There are two ways to do so:
1) one constructs the function such that it matches the pipeline structure
of the functional unit, and 2) one defines the schedule using the tags the
function unit provides.
As an example for the first method, consider a function unit with four
stages and a cycle that allows iterating the instruction in stage 2 (figure
6.8). We denote the instruction in stage k of the function unit f u during
cycle T by sI f u(k;T ). The instruction in stage 0 of the function unit is the
instruction that is dispatched:
sI f u(0;T ) := sIdispatch( f u;T )
This instruction proceeds into stage 1 iff the update enable signal ue f u;0
is active. This update enable signal is local to the function unit f u.
sI f u(1;T ) :=
8
<
:
0 : T = 0
sI f u(0;T  1) : ueT 1f u;0 = 1
sI f u(1;T  1) : otherwise
This must be changed for stage 2, the stage with the back-cycle.
sI f u(2;T ) :=
8
>
<
>
:
0 : T = 0
sI f u(1;T  1) : ueT 1f u;1 = 1^ sel
T 1
1 = 0
sI f u(2;T  1) : ueT 1f u;1 = 1^ sel
T 1
1 = 1
sI f u(2;T  1) : otherwise
256
Section 6.5
DATA
CONSISTENCY
  
  


from reservation station
to producer
1 0
f1
sel1
f0
ue f u;0
ue f u;1
f2
f3
ue f u;2
sI f u(0;T)
sI f u(1;T)
sI f u(3;T)
sI f u(2;T)
Figure 6.8 Construction of the scheduling function for a function unit with cycles
257
Chapter 6
OUT-OF-ORDER
EXECUTION
For stage 3 of the function unit, the scheduling function is defined in
analogy to the scheduling function of stage 1:
sI f u(3;T ) :=
8
<
:
0 : T = 0
sI f u(2;T  1) : ueT 1f u;2 = 1
sI f u(3;T  1) : otherwise
Since this is also the last stage of the function unit f u, we have
sI f u( f u;T ) := sI f u(3;T )
Producers In analogy to the scheduling function of the reservation sta-
tions, we define the scheduling function of the producer registers. We
denote the number of the instruction in producer number f u during cycle
T by sIP( f u;T ). In case the function unit provides a result, we take the
value from the schedule of the function unit as defined above. If not so, the
value of sIP( f u;T ) does not change.
sIP( f u;T ) :=
8
<
:
0 : T = 0
sI f u( f u;T  1) : FU [ f u]T 1:valid = 1
sIP( f u;T  1) : otherwise
As described above, the instruction in producer with the number given
by compl p(T ) is put on the CDB during completion. We therefore define
the following shorthand for the instruction on the CDB during cycle T :
sICDB(T ) := sIP(compl p(T );T )
Writeback In analogy to sIissue, we recursively define a scheduling
function sIwriteback that maps a cycle T to the number of the instruc-
tion that is in the writeback stage. Since we writeback in program order,
that number increases by one in case that writeback(T ) holds and stays
unmodified otherwise. We start with instruction I0.
sIwriteback(T ) :=
8
<
:
0 : T = 0
sIwriteback(T  1)+1 : writeback(T  1)
sIwriteback(T  1) : otherwise
258
Section 6.5
DATA
CONSISTENCY
6.5.2 Function Unit Axioms
In this section, we describe the assumptions we make regarding data con-
sistency properties of the functional units. We consider the functional units
as a “black box”. In particular, we do not provide implementations for data
memory or floating point function units. The design and verification of a
data memory function unit including virtual memory is subject of the thesis
of Sven Beyer [Bey01]. The design and verification of an IEEE compliant
floating unit including a divider is subject of the thesis of Christian Jacobi
[Jac01].
Inputs and Outputs As described above, FU [ f u]T :valid indicates that
function unit f u provides a result during cycle T . FU [ f u]T :tag denotes
the tag the function unit provides, and FU [ f u]T :result denotes the result
the function unit provides.
Let f uins( f u;T ) denote the inputs of function unit f u during cycle T .
This is a defined as follows: Let rs be a shorthand for dispatch rs(T; f u).
This is the reservation station that is used for dispatching to function unit
f u.
f uins( f u;T ):valid := dispatch rs(T;rs)
f uins( f u;T ):tag := RS[rs]T :tag
f uins( f u;T ):source[x] := RS[rs]T :op[x]:data
Tag Consistency Given that the function unit gets correct tags as inputs
upto cycle T , we assume that the function unit provides the correct tag of
the instruction as output during cycle T .
We formalize “gets correct tags as inputs upto cycle T ” as follows:
8T 0  T : f uins( f u;T 0):valid
=) f uins( f u;T 0):tag = I tag(sIdispatch( f u;T 0))
We formalize “provides the correct tag of the instruction” as follows:
FU [ f u]T :valid =) FU [ f u]T :tag = I tag(sI f u( f u;T ))
259
Chapter 6
OUT-OF-ORDER
EXECUTION
Operand Consistency Given that the function unit gets correct source
operands as inputs upto cycle T , we assume that the function unit provides
the correct results of the instruction as output during cycle T .
We formalize “gets correct source operand as inputs upto cycle T ” as
follows:
8T 0  T : f uins( f u;T 0):valid
=) f uins( f u;T 0):source = source(sIdispatch( f u;T 0))
We formalize “provides the correct results of the instruction” as follows:
FU [ f u]T :valid =) FU [ f u]T :result = result(sI f u( f u;T ))
Phase Consistency In order to show data consistency, we have to argue
that the function units does not generate “garbage output”. We assume two
things: 1) If an instruction leaves the function unit, it entered it before,
and 2) if instructions upto cycle T enter the function unit at most one, the
instructions leave the function unit at most once.
We formalize this as follows: Let in(i;T; f u) denote that instruction Ii
enters the function unit f u during cycle T .
in(i;T; f u) :() f uins( f u;T ):valid ^ sIdispatch( f u;T ) = i
In analogy to that, let out(i;T; f u) denote that instruction Ii leaves the
function unit f u during cycle T .
out(i;T; f u) :() FU [ f u]T :valid ^ sI f u( f u;T ) = i
If instruction Ii leaves function unit f u during cycle T , there must be a
cycle T 0  T such that it entered the function unit:
out(i;T; f u) =) 9T 0  T : in(i;T 0; f u)
If the cycle T 0  T such that instruction Ii enters the function unit during
cycle T 0 is unique, then the cycle T 00  T such that instruction Ii leaves the
function unit during cycle T 00 is unique.


fT 0  T j in(i;T 0; f u)g = 1 =) fT 00  T j out(i;T 00; f u)g = 1
260
Section 6.5
DATA
CONSISTENCY
We do not make further assumptions regarding data consistency. In par-
ticular, this allows that the latency of the function unit is variable and that
the instructions leave the dispatch order within the function unit.
We make further assumptions on the function units in order to show
liveness. We will later on describe these assumptions.
6.5.3 ROB Flags
We need means to determine wether the reorder buffer is full or not. For
this purpose, we take the circuit from [Lei99]. It uses a ϑ+ 1 bit counter
register. The counter is incremented if we issue and instruction and do not
writeback one simulataneously. This is indicated by ROBinc(T ).
ROBinc(T ) = issue(T )^writeback(T )
In analogy to that, ROBdec(T ) indicates that we decrement the counter.
This is done if we writeback an instruction but do not issue one simultane-
ously.
ROBdec(T ) = issue(T )^writeback(T )
Thus, the value of the counter register during cycle T is defined as fol-
lows:
ROBcount(T ) :=
8
>
<
>
>
:
0ϑ : T = 0
ROBcount(T  1)+1 : ROBinc(T  1)
ROBcount(T  1) 1 : ROBdec(T  1)
ROBcount(T  1) : otherwise
The ROB is empty iff the counter is zero:
ROBempty(T ) = (ROBcount(T ) = 0ϑ+1)
The ROB is full iff the counter is the number of ROB entries Θ. We use
the binary encoding of Θ.
ROBempty(T ) = (ROBcount(T ) = 10ϑ)
We make the following assumptions:
261
Chapter 6
OUT-OF-ORDER
EXECUTION
 If we issue an instruction without simultaneous writeback, the ROB
must not be full.
ROBinc(T ) =) ROB f ull(T )
 If we writeback an instruction, the ROB must not be empty.
writeback(T ) =) ROBempty(T )
6.5.4 ROB Properties
Let tag i be a shorthand for a tag that is incremented i times. Formally,Definition 6.1
tag i
I
this is defined using a recursion and the bit-vector incrementation as de-
fined in chapter 2:
tag i :=

tag : i = 0
(tag (i 1))+1 : otherwise
Note that we increment a bit vector with limited range. Thus, it will
wrap-around. One easily verifies the following properties of the ROB
pointers:
Let i be the number of the instruction in the issue stage. The ROB tailLemma 6.1 I
pointer has been increased i times.
ROBtailT = 0ϑ sIissue(T )
Let i be the number of the instruction in the writeback stage. The ROBLemma 6.2 I
head pointer has been increased i times.
ROBheadT = 0ϑ sIwriteback(T )
The proof for both lemmas is easily done using induction on T .
The value in the ROBcount register is smaller or equal than the number ofLemma 6.3 I
ROB entries.
hROBcount(T )i  Θ
262
Section 6.5
DATA
CONSISTENCY
PROOF One verifies this claim by induction on T . For T = 0, we have
hROBcount(T )i = 0:
For T + 1, we show the claim by a full case split on the values of
ROBinc(T ) and ROBdec(T ).
 If neither ROBinc(T ) or ROBdec(T ) holds, the value of ROBcount
does not change and the claim is concluded using the induction
premise.
 If ROBinc(T ) holds, we assert the claim as follows: in case
hROBcount(T )i < Θ
holds, the claim is easily concluded. Assume
hROBcount(T )i = Θ
holds. In this case, we have a contradiction to the assumption above
since ROBinc(T ) holds and the ROB is full.
 If ROBdec(T ) holds, we assert the claim as follows: in case
hROBcount(T )i 6= 0
holds, the claim is easily concluded. Assume
hROBcount(T )i = 0
holds. In this case, we have a contradiction to the assumption above
since ROBdec(T ) holds and the ROB is empty. QED
Let J Lemma 6.4
instr in rob(T ) = sIissue(T )  sIwriteback(T )
denote the difference between the number of issued and terminated instruc-
tions, i.e., the number of instructions in the reorder buffer. We claim that
this number is equal to the binary number interpretation of the value of
ROBcount(T ):
instr in rob(T ) = hROBcount(T )i
263
Chapter 6
OUT-OF-ORDER
EXECUTION
PROOF This claim is asserted by induction on T . For T = 0 we have
instr in rob(T ) = hROBcount(T )i
sIissue(T )  sIwriteback(T ) = h0ϑ+1i
0 0 = h0ϑ+1i:
For T + 1, we do a full case split on the values of the signals issue(T )
and writeback(T ).
 If neither issue(T ) nor writeback(T ) holds, both the values of the
scheduling functions and the ROB counter do not change from cycle
T to T +1. Thus, the claim is concluded by the induction premise.
 If both issue(T ) and writeback(T ) hold, both scheduling functions
are incremented by one. Thus, the difference stays the same. The
ROB counter does not change from cycle T to T +1. Thus, the claim
is concluded by the induction premise.
 In case issue(T ) holds and writeback(T ) does not hold, the differ-
ence is increased by one. The ROB couter is also increased by one.
One asserts that the ROB counter does not wrap around by lemma
6.3.
 In case issue(T ) doe not hold and writeback(T ) holds, the differ-
ence is decreased by one. The ROB couter is also decreased by one.
One asserts that the ROB counter does not wrap around using the
assumption that we do not writeback in case of an empty ROB. 6.3.QED
The number of instructions in the ROB is greater or equal than zero.Lemma 6.5 I
instr in rob(T ) 0
One easily asserts this using lemma 6.4.
The number of instructions in the ROB is smaller or equal than the numberLemma 6.6 I
of ROB entries.
instr in rob(T )Θ
This is easily shown using lemma 6.4 and lemma 6.3.
264
Section 6.5
DATA
CONSISTENCY
The following lemma is easily concluded using lemma 6.5:
The number of issued instructions is greater or equal than the number of J Lemma 6.7
terminated instructions.
sIissue(T ) sIwriteback(T )
If we terminate an instruction using cycle T , the number of issued instruc- J Lemma 6.8
tions is greater than the number of terminated instructions.
writeback(T ) =) sIissue(T )> sIwriteback(T )
One easily shows this using lemma 6.7, and lemma 6.4, and the fact that
we only writeback if the ROB is not empty.
The number of issued instructions upto cycle T is greater or equal than the J Lemma 6.9
number of terminated instructions upto cycle T +1.
sIissue(T )  sIwriteback(T +1)
One easily verifies this claim using lemma 6.8 for the case writeback(T )
and using lemma 6.7 otherwise.
As described above, we assign a tag to each instruction during issue. This J Definition 6.2
I tag(i)is the value of the ROB tail pointer. This pointer is increased by one each
time we issue an instruction. Thus, we define a function I tag(i), which
denotes the tag of instruction Ii, as follows:
I tag(i) := 0ϑ i
I tag(i) is the value of the ROB tail pointer during issue of instruction Ii. J Lemma 6.10
ROBtailT = I tag(sIissue(T ))
265
Chapter 6
OUT-OF-ORDER
EXECUTION
This claim is easily concluded using lemma 6.1 and the definition of
I tag.
If an instruction is in ROB entry tag, then the tag of that instruction is tag.Lemma 6.11 I
sIROB(tag;T ) = i =) tag = I tag(i)
One shows this claim by induction on T . For T = 0, there is nothing toPROOF
show since there is no instruction in the ROB (formally, sIROB(tag;0) is
 1, and there is no instruction I
 1).
For T +1, the claim is concluded by expanding the definition of sIROB.
If
issue(T )^ tag = ROBtailT
holds, we have sIROB(tag;T + 1) = sIissue(T ). The claim is then con-
cluded using lemma 6.10.
If not so, we have sIROB(tag;T +1) = sIROB(tag;T ). The claim is then
concluded using the induction premise.QED
We will now show that this tag is unique beginning with the cycle the
instruction is issued until the instruction terminates. Formally, this means
that we can assign a single, unique instruction to each such tag.
Let issued(i;T ) hold iff instruction Ii is already issued during cycle T .
We define this predicate using the scheduling function sIissue:
issued(i;T ) :() sIissue(T )> i
However, it is not obvious that instruction Ii was issued before cycle T
if sIissue(T )> i and vice-versa. It is an implication of in-order issue. The
following lemma asserts one direction.
If issued(i;T ) holds, there is a cycle T 0 < T such that Ii is issued duringLemma 6.12 I
cycle T 0.
issued(i;T ) =) 9T 0 < T : sIissue(T 0) = i^ issue(T 0)
266
Section 6.5
DATA
CONSISTENCY
PROOF The claim is shown by induction on T . For T = 0, we have
sIissue(0) = 0. Thus, sIissue(0) > i cannot hold and there is nothing to
show.
For T +1, we show the claim using a case split on issue(T ).
 If issue(T ) holds, we have
sIissue(T +1) = sIissue(T )+1
and therefore sIissue(T ) + 1 > i. Let sIissue(T ) > i hold. In this
case, we can apply the induction premise and the claim holds. Thus,
let sIissue(T ) = i hold. In this case, cycle T satisfies the claim.
 If issue(T ) does not hold, we have sIissue(T +1) = sIissue(T ) and
we can apply the induction premise to show the claim. QED
In analogy to issued(i;T ), we define a predicate terminated(i;T ) that
holds iff instruction Ii already terminated before cycle T .
terminated(i;T ) :() sIwriteback(T )> i
Let the predicate τ(i;T ) be a shorthand for the fact that instruction Ii is
already issued during cycle T but has not yet terminated.
τ(i;T ) :() issued(i;T )^ terminated(i;T )
The following lemma will be used in order to show that issue is done in
program order.
Consider the instruction in the issue stage during cycle T . During cycle J Lemma 6.13
T +1, there is the same or a later instruction in the issue stage.
sIissue(T +1)  sIissue(T )
The proof of lemma 6.13 is easily done by expanding the definition of
the scheduling function sIissue(T +1).
The instructions are issued in order, i.e., during cycle T 0  T there is the J Lemma 6.14
same or an earlier instruction in the issue stage.
8T 0  T : sIissue(T 0) sIissue(T )
267
Chapter 6
OUT-OF-ORDER
EXECUTION
This lemma is easily shown using induction on T and lemma 6.13 as
induction step.
Let i  0 and j  0 hold. If one increments a tag i times and after that jLemma 6.15 I
times, this is equivalent to incrementing the tag i+ j times.
(tag i) j = tag (i+ j)
This is easily shown by induction on j.
Let T and T 0  T be cycles. ROBtailT 0 is equal to ROBtailT incrementedLemma 6.16 I
sIissue(T 0)  sIissue(T ) times.
8T 0  T : ROBtailT 0 = ROBtailT  (sIissue(T 0)  sIissue(T ))
By applying lemma 6.1 twice, the claim is transformed into:PROOF
0ϑ sIissue(T 0) != (0ϑ sIissue(T )) (sIissue(T 0)  sIissue(T ))
One shows sIissue(T 0)  sIissue(T )  0 using lemma 6.14. This allows
concluding the claim using lemma 6.15.QED
One easily verifies the following property of tag arithmetic (i.e., bit-
vector arithmetic). It applies for incrementing tags as done for ROBhead
and ROBtail.
If one increments a tag i times, the value of this tag is the value of the oldLemma 6.17 I
tag plus i modulo Θ (number of ROB entries).
htag ii = htagi+ i mod Θ
The following lemma will be used in order to argue that certain entries
in the ROB are not overwritten.
If one increments a tag at least once and less than Θ times, the incrementedLemma 6.18 I
tag is different from the old tag.
0 < j < Θ =) (tag j) 6= tag
268
Section 6.5
DATA
CONSISTENCY
PROOF According to lemma 6.17, we have
htag ji = htagi+ j mod Θ
Assume (tag j) = tag holds. In this case, the equation above trans-
forms into:
htagi = htagi+ j mod Θ
This only holds if j is a multiple of Θ (this property of mod is shown in
the PVS libraries). This is a contradiction to the premise of the lemma and
we therefore have (tag j) 6= tag. QED
Entries in the ROB are overwritten if the ROB tail pointer wraps around.
This happens each Θ (number of ROB entries) instructions. The following
lemma asserts the fact that instruction Ii in the ROB is overwritten only in
this case.
Let instruction Ii be issued during cycle T 0. Consider cycles T > T 0. As J Lemma 6.19
long as no more than Θ instructions are issued from cycle T 0 to T , the
instruction in the ROB entry during cycle T that ROBtailT 0 points to is
instruction i.
issue(T 0)^ sIissue(T 0) = i^ sIissue(T ) (i+Θ)
=) sIROB(ROBtailT 0 ;T ) = i
The proof proceeds by induction on T . For T = 0, there is nothing to show PROOF
since there is no cycle T 0  0 with T > T 0.
For T +1, let us consider the case T = T 0. In this case, the claim holds
by definition of sIROB.
The claim for the case T > T 0 is (we swap left hand side and right and
side):
i != sIROB(ROBtailT 0 ;T )
!
=
8
<
:
sIissue(T ) : issue(T )^
ROBtailT 0 = ROBtailT
sIROB(ROBtailT 0 ;T ) : otherwise
We argue the two cases above separately. Assume
issue(T ) ^ ROBtailT 0 = ROBtailT
269
Chapter 6
OUT-OF-ORDER
EXECUTION
holds. This implies that sIissue(T + 1) = sIissue(T ) + 1 holds because
issue(T ) holds. This allows concluding that
sIissue(T )+1 i+Θ
holds. This allows applying lemma 6.18 with j = sIissue(T )  i, which
states:
ROBtailT 0 6= ROBtail(T 0) (sIissue(T )  i)
According to lemma 6.16 for cycles T 0 and T , we have
ROBtailT = ROBtailT 0 (sIissue(T )  i):
Thus, this is a contradiction to ROBtailT = ROBtailT 0 . Thus,
issue(T )^ROBtailT 0 = ROBtailT
cannot hold. We therefore only have to show sIROB(ROBtailT 0 ;T ) = i.
This is done using the induction premise.QED
If instruction Ii has been issued but has not not yet terminated, less than ΘLemma 6.20 I
(number of ROB entries) instructions have been issued since Ii was issued.
τ(i;T ) =) sIissue(T ) i+Θ
This claim is easily concluded using lemma 6.6.
The following theorem provides the unique mapping from tags to in-
structions: we just use the ROB schedule. The tag of an instruction is
unique, if the instruction in the ROB.
If instruction Ii has been issued but has not not yet terminated, the instruc-Theorem 6.21 I
tion in ROB entry I tag(i) is instruction i.
τ(i;T ) =) sIROB(I tag(i);T ) = i
According to lemma 6.12, there is a cycle T 0 < T such that instruction IiPROOF
is issued during cycle T 0. According to lemma 6.19 for cycle T 0 and T and
instruction i, we have:
sIissue(T ) i+Θ =) sIROB(ROBtailT 0 ;T ) = i
270
Section 6.5
DATA
CONSISTENCY
We assert the left hand side of the implication using lemma 6.20. Thus,
we have:
sIROB(ROBtailT 0 ;T ) = i
It is therefore left to show that ROBtailT 0 is equal to I tag(i). This is
done using lemma 6.10. QED
From lemma 6.21, one easily concludes the following claim:
Let Ii and I j be instructions. If the tags of the instructions are equal and J Lemma 6.22
both unique, instruction i is instruction j.
I tag(i) = I tag( j)^ τ(i;T )^ τ( j;T) =) i = j
In analogy to lemma 6.10, we show:
The ROBhead pointer during cycle T is the tag of the instruction in write- J Lemma 6.23
back stage.
ROBhead(T ) = I tag(sIwriteback(T ))
One easily concludes this claim using lemma 6.2
If we writeback an instruction during cycle T , that instruction is in the J Lemma 6.24
ROB entry that ROBhead points to.
writeback(T ) =) sIwriteback(T ) = sIROB(ROBhead(T );T )
Using lemma 6.23, we transform the claim into: PROOF
writeback(T ) =) sIwriteback(T ) = sIROB(I tag(sIwriteback(T ));T )
The claim is concluded using lemma 6.21. It is left to show that the
premise of lemma 6.21 holds, i.e., we have to show that
τ(sIwriteback(T );T )
holds. We show that the instruction is already issued using lemma 6.8.
Furthermore, the instruction is obviously not terminated yet. QED
271
Chapter 6
OUT-OF-ORDER
EXECUTION
6.5.5 Instruction Phases
We distinguish the following phases of executing instruction Ii:
 Not issued: Before an instruction is issued, the instruction is in the
”not issued” phase. Formally, this holds if issued(i;T ) holds.
 In RS: During issue, the instruction is stored in a reservation sta-
tion unless issue with result(i) holds. Formally, instruction Ii is in a
reservation station during cycle T iff
9rs : RS[rs]T : f ull ^ sIRS(rs;T ) = i
holds.
 In FU: During dispatch, the instruction is passed from the reserva-
tion station to a function unit. Formally, we say an instruction is
dispatched during cycle T iff there is a cycle T 0 < T and a reserva-
tion station rs such that instruction Ii is in reservation station rs and
the instruction in that reservation station is dispatched.
dispatched(i;T )
:() 9T 0 < T;rs : dispatch rs(T 0;rs)^ sIRS(rs;T 0) = i
The instruction leaves the function unit if it is passed to a producer.
Formally, an instruction is executed iff there is a cycle T 0  T and
a producer f u such that instruction Ii is in the producer f u and that
producer is full.
executed(i;T )
:() 9T 0 < T; f u : FU [ f u]T 0 :valid ^ sI f u( f u;T 0) = i
Formally, instruction Ii is in a function unit during cycle T iff
dispatched(i;T )^ executed(i;T )
holds. Note that there are function units (ALU, for example), that re-
turn the result in the same cycle they get it. In this case, the condition
above never holds, although the function unit is not bypassed.
 In producer: After leaving the function unit, the result of the in-
struction is stored in a producer. Formally, an instruction is in a
producer iff there is a producer f u such that instruction Ii is in the
producer f u and the producer is full.
9 f u : P[ f u]T : f ull ^ sIP( f u;T ) = i
272
Section 6.5
DATA
CONSISTENCYin
RS
not
issued
in
FU
in
P
in
ROB
ter-
minated
issue with result(i)
Figure 6.9 Instruction phase state diagram
 In ROB: As soon as the producer gets the CDB, the result in the
producer is stored in the ROB. Formally, an instruction is in the ROB
during cycle T iff there is a ROB entry tag such that the instruction
in that entry is Ii and the entry is valid and the instruction has not
terminated yet.
9 tag : ROB[tag]T :valid ^ sIROB(tag;T ) = i^ terminated(i;T )
The phases of “normal” instructions, i.e., instructions Ii that are not
issued with result, are processed in the order above. Instructions with
issue with result(i) skip the phases “in RS”, “in FU”, and “in producer”.
This is illustrated in figure 6.9. The figure shows the different phases and
the transitions between the phases. However, one has to assert this property
of the machine. This is done by the following lemmas.
Let p(i;T ) denote that instruction Ii is in phase p during cycle T .
Let pred(p) denote the set of predecessor phases of phase p according
to figure 6.9. For example, the “not issued” phase only has itself as prede-
cessor. The “in ROB” phase has three predecessor phases: “in ROB”, “not
issued”, and “in producer”.
In analogy to pred(p), let succ(p) denote the set of successor phases of
phase p according to figure 6.9. For example, the “not issued” phase has
two successor phases: “in RS” and “in ROB”.
If instruction Ii is in a given phase during cycle T , and not in any other J Lemma 6.25
phase, we show that the instruction is in at most one successor phase during
cycle T +1, i.e., the sucessor phases mutually exclude each other.
For most phases, the claim is trivial, because they only thave themselves PROOF
273
Chapter 6
OUT-OF-ORDER
EXECUTION
and another state as successors. The only exception is the “not issued”
phase, which has three successors. We therefore show the claim exemplary
for the “not issued” phase.
 If issue(T ) and sIissue(T ) = i does not hold, one easily concludes
that instruction Ii stays in “not issued” phase during cycle T + 1.
Thus, we have to show that it is not in a reservation station or in the
ROB. According to the premise of the lemma, the phases of Ii are
unique during cycle T . Thus, Ii is not in the ROB or in a reservation
station during cycle T . Since Ii is also not issued, one easily verifies
that it does not move into the ROB or into a reservation station.
 If issue(T ) and sIissue(T ) = i holds, one easily concludes that in-
struction Ii either enteres the ROB or a reservation station, depending
on issue with result(i). If issue with result(i) holds, one verifies
that the instruction cannot be in a reservation station. If not so, one
verifies that the instruction cannot be in the ROB.QED
If instruction Ii is in a given phase during cycle T + 1, we show that itLemma 6.26 I
must have been in one of the predecessor phases as given in figure 6.9
during cycle T :
p(i;T +1) =)
_
p02pred(p)
p0(i;T )
For example, if instruction Ii is in phase “not issued” during cycle T +1,
this implies that it must be in phase “not issued” during cycle T .
In PVS, we split this claim into 6 lemmas, one for each phase. We showPROOF
the claim for the “not issued” phase and the “in RS” phase here exemplary.
 The claim for the “not issued” phase is easily asserted by expanding
the definition of “not issued” and by applying lemma 6.13.
 The claim for the “in RS” phase is asserted as follows: according to
the premise, there is a reservation station rs such that
RS[rs]T+1: f ull ^ sIRS(rs;T +1) = i
holds. Let issue rs(T;rs) hold. In this case, we have
sIRS(rs;T +1) = sIissue(T )
274
Section 6.5
DATA
CONSISTENCY
Thus, the instruction Ii is in issue stage during cycle T . Thus, it is in
“not issued” phase during cycle T , which concludes the claim.
Let issue rs(T;rs) not hold. In this case, one easily asserts that the
full bit RS[rs]T : f ull is active and sIRS(rs;T ) = i holds. Thus, the
instruction is in “in RS” phase during cycle T , which concludes the
claim. QED
The phase of instruction Ii during cycle T is unique, i.e., the phases above J Lemma 6.27
exclude each other mutually.
One easily shows this claim by induction on T . For T = 0, one asserts that PROOF
all instructions are in “not issued” phase only.
For T + 1, one shows the claim as follows: according to the induction
premise, instruction Ii is in at most one phase during cycle T . One applies
lemma 6.25, which shows that the successor states mutually exclude each
other.
Furthermore, the instruction Ii cannot be in a phase that is not a successor
phase during cycle T +1, which is asserted by lemma 6.26. QED
6.5.6 Tag Consistency
We will now show that the tags transported in the machine are consistent
with the scheduling functions, i.e., we will show that the tag stored together
with instruction Ii is I tag(i).
If a reservation station is full, the tag in that reservation station is the tag J Lemma 6.28
of the instruction in the reservation station.
RS[rs]T : f ull =) RS[rs]T :tag = I tag(sIRS(rs;T ))
The claim is shown using induction on T . For T = 0 there is nothing to PROOF
show because the reservation stations are not full in the initial configura-
tion.
For T + 1, we show the claim as follows: If an instruction Ii is issued
into reservation station rs during cycle T , the value of the tag in reservation
275
Chapter 6
OUT-OF-ORDER
EXECUTION
station is defined by the issue protocol:
RS[rs]T+1:tag = ROBtailT
According to lemma 6.1, this is equivalent to 0ϑ sIissue(T ). This is
the definition of I tag(i).
If no instruction is issued into reservation station rs during cycle T , we
apply the induction premise.QED
If there is an instruction in a producer, the tag in the producer matches theLemma 6.29 I
tag of the instruction.
P[ f u]T : f ull =) P[ f u]T :tag = I tag(sIP( f u;T ))
We show this claim by induction on T . For T = 0, there is nothing to showPROOF
because the producer is not full in the initial configuration.
For T + 1, we show the claim as follows: For the case that the instruc-
tion in the producer did not change from cycle T to T + 1, we apply the
induction premise.
If a new instruction moved into the producer, we conclude the claim by
making the following assumption: if the function unit gets correct tags as
inputs for cycles T 0 with T 0  T , this implies that the function unit passes
the correct tag during cycle T . We will later on describe how to verify that
property of the function units. We show that the function units get correct
tags for T 0 with T 0  T using lemma 6.28.QED
The tag on the CDB matches the tag of the instruction on the CDB.Lemma 6.30 I
CDBT :valid =) CDBT :tag = I tag(sICDB(T ))
We assume that we only complete instructions from producers that arePROOF
full. Thus, we can apply lemma 6.29. The tag on the CDB matches the tag
from the producer. Furthermore, the instruction on the CDB matches the
instruction in the producer, by definition of sICDB.QED
276
Section 6.5
DATA
CONSISTENCY
6.5.7 Data Consistency Criterion
In this section we describe our data consistency criterion for the Toma-
sulo protocols. We define a formal notion for the correct input and output
values of an instruction. We do this by defining an abstract machine that
processes an instruction with each transition. We call this machine abstract
specification machine (aS). The configuration set of this machine consists
of the registers.
Given an instruction (configuration of this machine), we define the cor-
rect value of a source register r to be the value of the register r if r 6= 0 and
to be zero if r = 0:
source(i;r) :=

0 : r = 0
ciaS:R : otherwise
The function source(i) maps an instruction to the values of all source
operands. Remember that S(i;x) denotes the number of the register of
source operand x. Let s denote the number of source registers.
source : N  ! W (R)s
source(i)(x) := source(i;S(i;x))
Let fi be the function that maps the values of the source operands of
instruction Ii to the values of the destination operands unless we have
issue with result(i). Let d denote the number of destination registers.
fi : W (R)s  ! W (R)d
Thus, the result of instruction Ii is:
result(i;r) :=

issue result(i) : issue with result(i)
fi(source(i)) : otherwise
This allows defining the configurations of the abstract specification ma-
chine. We start with an initial configuration c0aS and proceed using f . If
instruction i  1 has register r as destination register, then we take the the
new value of R[r] from the result of Ii 1. If not so, we take the value from
the old configuration.
ciaS:R[r] :=
8
<
:
c0aS:R[r] : i = 0
result(i 1)[e(r)] : i 6= 0^dest(i 1;r)
ci 1aS : otherwise
277
Chapter 6
OUT-OF-ORDER
EXECUTION
Proof Strategy We will show the correctness of a DLX implementation
with Tomasulo scheduler as follows:
 We will show that a machine implementing the Tomasulo protocols
given in the previous sections simulates the abstract machine aS.
This is the hardest part of the proof.
 We will show that the DLX implementation with Tomasulo sched-
uler implements the Tomasulo protocols.
We will now conclude several trivial properties of the abstract specifica-
tion machine aS.
If instruction Ii has no destination register R[r], then R[r] is not changedLemma 6.31 I
by instruction Ii.
dest(i;r) =) R[r]i+1aS = R[r]
i
aS
The proof is done by expanding the definition of R[r]i+1aS .
Let the predicate L(i;r) hold iff there is an instruction j < i such thatDefinition 6.3
L(i;r)
I
instruction I j has destination register r.
L(i;r) :() 9 j < i : dest( j;r)
Let i and j  i be instructions. If L( j;r) holds, so does L(i;r).Lemma 6.32 I
j  i^L( j;r)) =) L(i;r)
This holds by definition of the predicates.
Let L(i;r) hold. Let last(i;r) denote the number of the last instructionDefinition 6.4
last(i;r)
I
with destination register r prior to instruction Ii. Formally, this is the max-
imum of the set of instructions I j with j < i and dest( j;r).
last(i;r) := maxf j j j < i^dest( j;r)g
This set is always non-empty because of L(i;r). Furthermore, the set
is finite and has an upper bound. Thus, the maximum is defined if L(i;r)
holds.
278
Section 6.5
DATA
CONSISTENCY
The following property is easily shown using the definition of last and
the definition of max.
If L(i;r) holds, the instruction Ilast(i;r) has destination register r. J Lemma 6.33
L(i;r) =) dest(last(i;r);r)
Let L(i;r) and i  1 hold. If instruction Ii 1 does not have a destination J Lemma 6.34
register r, L(i 1;r) holds.
i 1^L(i;r)^dest(i 1;r) =) L(i 1;r)
Because L(i;r) holds, there must be an instruction I j with j < i and PROOF
dest( j;r). Since this is not instruction i 1, it must be an instruction with
j < i 1. Thus, L(i 1;r) holds.
Let i 1 and L(i;r) hold. If instruction Ii 1 does not have a destination J Lemma 6.35
register r, then last(i;r) is equal to last(i 1;r).
i 1^L(i;r)^dest(i 1;r) =) last(i;r) = last(i 1;r)
Because of L(i;r), last(i;r) is defined. According to lemma 6.34, L(i  PROOF
1;r) holds. Thus, last(i 1;r) is defined.
Let j be last(i;r). By definition of max, this number is element of
f0; : : : ; i 1g. Because of dest(i 1;r), j cannot be i 1. Thus, j is equal
to last(i 1;r). QED
Let i  1 hold. If instruction Ii 1 has destination register r, last(i;r) is J Lemma 6.36
equal to i 1.
i 1^dest(i 1;r) =) last(i;r) = i 1
This is easily shown by using the definition of max.
Let Ii and I j with j i be instructions. If all instructions I j0 with j j0 < i J Lemma 6.37
do not have a destination register r, the value of R[r] does not change from
configuration ciaS to c
j
aS.
j  i^ (8 j j0 < i : dest( j0;r) =) R[r]iaS = R[r] jaS
279
Chapter 6
OUT-OF-ORDER
EXECUTION
One easily concludes this using induction on i and the transition function
of R[r].
Let R[r] with r 6= 0 be a register and let L(i;r) hold. In this case, theLemma 6.38 I
correct source register of Ii is the result of the last instruction writing R[r].
r 6= 0^L(i;r) =) source(i;r) = result(last(i;r))[e(r)]
By definition of last(i;r), the instructions I j with last(i;r) < j < i do notPROOF
have destination register r. According to lemma 6.37, we have
R[r]iaS = R[r]
last(i;r)+1
aS
The left hand side is source(i;r) by definition, and the right hand side is
result(last(i;r))[e(r)] by definition of R[r]last(i;r)+1aS .QED
Let there not be an instruction that is issued during cycle T with desti-Lemma 6.39 I
nation R[r]. This implies that the value of source register r of instruction
Iissue(T ) matches the value of source register r of instruction Iissue(T+1).
issue(T )^dest(sIissue(T );r)
=) source(sIissue(T );r) = source(sIissue(T +1);r)
If issue(T ) does not hold, we have sIissue(T ) = sIissue(T + 1) and thePROOF
claim obviously holds.
If issue(T ) holds, we apply lemma 6.37 and expand the definition of
source.QED
6.5.8 Forwarding Tags Consistency
The Tomasulo scheduling algorithm does forwarding at two places: 1) dur-
ing issue, we forward from the CDB and from the ROB, 2) while in a
reservation station, we forward from the CDB.
280
Section 6.5
DATA
CONSISTENCY
Both forwarding from the ROB and from the CDB is done using the tag.
We will now show that the tags used for forwarding are correct.
Let Ii be the instruction in issue stage during cycle T . If a register R[r] J Lemma 6.40
is marked as “not valid” during cycle T in the producer table, there is an
instruction prior to instruction Ii that writes R[r] and the tag of the regis-
ter in the producer table is the tag of the last instruction prior instruction
IsIissue(T ) writing R[r].
sIissue(T ) = i^R[r]T :valid
=) L(i;r)^R[r]T :tag = I tag(last(i;r))
We verify that claim by induction on T . For T = 0, there is nothing to PROOF
show because we make the valid bits of the registers active in the initial
configuration.
For T +1, we conclude the claim as follows: In case R[r]T+1:valid holds,
there is nothing to show. Thus, let R[r]T+1:valid not hold. We distinguish
three cases:
 If an instruction with destination register R[r] is issued during cycle
T , we easily assert L(i;r), since instruction sIissue(T ) satisfies the
claim.
We assert R[r]T :tag = I tag(last(i;r)) as follows: we apply lemma
6.36, which states:
last(i;r) = i 1
Thus, we have to show:
R[r]T+1:tag != I tag(i 1)
During issue, the ROB tail pointer is stored in R[r]:tag. Thus, the
claim is equivalent to:
ROBtailT != I tag(i 1)
According to the definition of I tag and lemma 6.1, this is equivalent
to:
0θ sIissue(T ) != 0θ (i 1)
sIissue(T ) != i 1
281
Chapter 6
OUT-OF-ORDER
EXECUTION
This is concluded using the fact that i = issue(T + 1) holds, and by
expanding the definition of issue(T + 1), and the fact that issue(T )
holds.
 If an instruction with no destination register R[r] is issued during
cycle T , consider R[r]T :valid. If R[r]T :valid holds, this implies that
R[r]T+1:valid, which is a contradiction.
Thus, R[r]T :valid does not hold. This allows applying the induction
premise for instruction Ii 1 and we get:
L(i 1;r)^R[r]T :tag = I tag(last(i 1;r))
We conclude L(i;r) from L(i 1;r) using lemma 6.32.
As the instruction that is issued during cycle T does not have a des-
tination register R[r], we have R[r]T+1:tag = R[r]T :tag, which trans-
forms the claim into:
R[r]T :tag != I tag(last(i;r))
Thus, it is left to show that last(i  1;r) = last(i;r) holds. This is
concluded using lemma 6.35.
 If no instruction is issued during cycle T , we assert that R[r]T :valid
does not hold as in the case above. This allows applying the induc-
tion premise, which concludes the claim.QED
The following lemma will be used for the induction step for the proof of
lemma 6.42.
Let reservation station rs be full during cycle T +1 and let the operand xLemma 6.41 I
be not valid. There are two possible reasons for this: 1) this was already
true during cycle T , and 2) an instruction was issued into the reservation
station during cycle T .
RS[rs]T+1: f ull^RS[rs]T+1:op[x]:valid
=) (RS[rs]T : f ull^RS[rs]T :op[x]:valid)_
(issue(T )^ issue rs(T;rs))
One easily asserts this claim by applying the definition of the issue pro-
tocol. Full bits of reservation stations are only set by the issue protocol,
the valid bit of the operand is only cleared by the issue protocol.
282
Section 6.5
DATA
CONSISTENCY
The following lemma will be used to argue the correctness of data that
is forwarded into a reservation station.
Let reservation station rs be full and let instruction Ii be in this reservation J Lemma 6.42
station. Let operand x be not valid, and let r be S(i;x). This implies that
r is not zero, and that there is an instruction prior to instruction Ii with
destination R[r] and the tag of operand x is the tag of the last instruction
prior to Ii with destination R[r].
RS[rs]T : f ull^ sIRS(rs;T ) = i^RS[rs]T :op[x]:valid
=) r 6= 0^L(i;r)^RS[rs]T :op[x]:tag = I tag(last(i;r)))
One asserts this claim by induction on T . For T = 0, there is nothing to PROOF
show since the full bits of the reservation stations are not set in the initial
configuration.
For T +1, we show the claim by applying lemma 6.41. Consider the case
that an instruction is issued into the reservation station during cycle T . In
this case, the claim is easily concluded using lemma 6.40 (correctness of
the tags in the producer tables).
If no instruction is issued into the reservation station during cycle T , the
tag in the reservation station does not change and we have
RS[rs]T : f ull ^RS[rs]T :op[x]:valid
according to lemma 6.41. This allows applying the induction premise,
which concludes the claim. QED
6.5.9 Tag Uniqueness
We will now show the tag uniqueness properties for the different places
tags are used in the Tomasulo machine.
Recall that this property was shown in lemma 6.21. This lemma uses
τ(i;T ) as premise. Thus, we use “tag is unique” and τ(i;T ) synonymously.
Let Ii be the instruction in issue stage and let the valid bit of register R[r] J Lemma 6.43
be not set. This implies that there is an instruction prior to Ii writing R[r]
and the tag of the last such instruction is unique.
sIissue(T ) = i^R[r]T :valid =) L(i;r)^ τ(last(i;r);T )
283
Chapter 6
OUT-OF-ORDER
EXECUTION
PROOF This claim is concluded by induction on T . For T = 0, there
is nothing to show since we make the valid bits of all registers set in the
initial configuration.
For T +1, we apply lemma 6.40, which states that there is an instruction
prior to Ii writing R[r] and that the tag in the producer table is the tag of
instruction j := last(i;r). In order to show the uniqueness of the tag, we
have to assert that instruction I j is already issued but not yet terminated.
One easily asserts that instruction I j is already issued by definition of
last(i;r).
We show that instruction I j is not yet terminated by distinguishing two
cases:
1. If an instruction with destination R[r] is issued during cycle T , we
show that j = i 1 holds using lemma 6.36. This instruction cannot
be terminated in cycle T +1, because this is a contradiction to lemma
6.9.
2. If no instruction with destination R[r] is issued during cycle T , we
assert that the valid bit of register R[r] is not set during cycle T :
R[r]T :valid
This allows applying the induction premise for the instruction issued
during cycle T (instruction sIissue(T )). Thus, we have:
τ(last(sIissue(T );r);T )
If issue(T ) does not hold, we have sIissue(T ) = i and the claim is
concluded. Thus, let issue(T ) hold. We already showed the claim
for the case that instruction sIissue(T ) has destination register R[r].
For the case it does not have such a destination register, we apply
lemma 6.35, which states that
last(i;r) = last(sIissue(T );r)
holds. Thus, we have:
τ( j;T )
We therefore know that instruction I j did not terminate before cycle
T . It is left show show that it does not terminate during cycle T .
Assume it does terminate during cycle T . One easily asserts that the
284
Section 6.5
DATA
CONSISTENCY
tag in the producer table of register R[r] is the tag of instruction I j
since it is unique according to the induction premise.
Thus, according to the writeback protocol, the valid bit of R[r] is set
during cycle T . This is a contradiction to the fact that R[r]T+1:valid
does not hold. QED
Let Ii be in reservation station rs and let that reservation station be full. J Lemma 6.44
This implies that the tag of instruction Ii is unique.
RS[rs]T : f ull =) τ(sIRS(rs;T );T )
One easily concludes that instruction Ii is in phase “in RS”, as formally PROOF
defined above. According to lemma 6.27, the instruction cannot be in two
different phases during cycle T . Thus, it cannot be in “not issued” phase,
which allows concluding that it is already issued.
Furthermore, it cannot be in “terminated” phase. Thus, τ(i;T ) holds. QED
Let Ii be an instruction in a full reservation station. Let x be a source J Lemma 6.45
operand that is not valid, and r := S(i;x) be the source register. There is
an instruction prior to Ii writing R[r]. Let I j be the last instruction prior to
instruction Ii that writes R[r].
We claim that instruction I j is in one of the following phases: 1) it is in
a reservation station, 2) it is in a function unit, or 3) it is in a producer.
This claim is shown by induction on T . For T = 0, there is nothing to PROOF
show since the reservation stations are not full in the initial configuration.
For T +1, we conclude the clain as follows: According to lemma 6.41,
there are two cases: an instruction is issued into reservation station rs dur-
ing cycle T or the instruction already was in the reservation station during
cycle T .
 If an instruction is issued into the reservation station during cycle
T , one easily asserts that the valid bit of the source register cannot
be active (otherwise, the valid bit of the reservation station source
operand is set and we have nothing to show). This allows applying
lemma 6.40, which states that the tag of the last instruction writ-
ing the register is in the producer table. According to lemma 6.43,
285
Chapter 6
OUT-OF-ORDER
EXECUTION
the tag is unique, i.e., instruction I j is already issued and has not
yet terminated. Futhermore, the instruction is not in the “in ROB”
phase during cycle T and not on the CDB (otherwise, the valid bit
of the reservation station source operand is set and we have nothing
to show). Thus, it must be in a reservation station, function unit or
producer during cycle T .
We conclude the claim as follows: if the instruction is in a reser-
vation station, we use lemma 6.59 in order to conclude that it ei-
ther stays in that phase or enters a function unit. This concludes the
claim.
If the instruction is in a function unit, we use lemma 6.59 in order to
conclude that it either stays in that phase or enters a producer. This
concludes the claim.
If the instruction is in a producer, we use 6.59 in order to conclude
that it either stays in that phase or moves into the ROB. The last case
cannot happen, since this is a contradiction to the fact that the valid
bit of the operand is not active. This is easily concluded since the tag
of I j is valid because the instruction is in the “in producer” phase.
 If no instruction is issued into the reservation station during cycle
T , one applies the induction premise. The induction premise states
that instruction I j is in a reservation station, a function unit, or in a
producer. After that, the claim is concluded as in the case above.QED
Let Ii be an instruction in a full reservation station. Let x be a sourceLemma 6.46 I
operand that is not valid, and r := S(i;x) be the source register. There is
an instruction prior to Ii writing R[r]. Let I j be the last instruction prior to
instruction Ii that writes R[r]. The tag of that instruction is unique.
RS[rs]T : f ull^ sIRS(rs;T ) = i^RS[rs]T :op[x]:valid
=) L(i;r)^ τ( j;T )
One easily asserts this lemma by applying lemma 6.45. According toPROOF
lemma 6.27, the phases exclude each other. Thus, I j cannot be in “not
issued” or “terminated” phase, which concludes the claim.QED
Let Ii be in producer f u and let that producer be full. This implies that theLemma 6.47 I
tag of instruction Ii is unique.
P[ f u]T : f ull =) τ(sIP( f u;T );T )
286
Section 6.5
DATA
CONSISTENCY
PROOF The instruction in the producer is in the “in producer” phase. Ac-
cording to lemma 6.27, the phases exclude each other. Thus, the instruc-
tion cannot be in “not issued” or “terminated” phase, which concludes the
claim.
The tag of the instruction on the CDB is unique. J Lemma 6.48
CDBT :valid =) τ(sICDB(T );T )
One easily asserts this lemma by expanding the definition of sICDB(T )
and by applying lemma 6.47.
6.5.10 Data Consistency Invariants
In order to show data consistency, we claim a set of invariants. As done
in the previous chapters, we will show that all these invariants hold by
induction on T . The invariants are taken from [MPK00].
Let instruction Ii be in the issue stage. Let r 6= 0 be a register. Let the valid J Invariant 6.1
bit of register R[r] be set. In this case, the register data is correct.
sIissue(T ) = i^ r 6= 0^R[r]T :valid =) R[r]T :data = source(i;r)
Let reservation station rs be full and let instruction Ii be in reservation J Invariant 6.2
station rs. If an input operand of the reservation station is valid, the value
in the operand registers is the correct source operand of instruction Ii.
sIRS(rs;T ) = i^RS[rs]T : f ull^RS[rs]T :op[x]:valid
=) RS[rs]T :op[x]:data = source(i)(x)
After all operands are valid, the instruction is passed to the function
unit. Once the instruction leaves the function unit, the result is stored in
a producer. The following invariant asserts that the producer holds the
correct result.
Let producer p be full and let instruction Ii be in producer f u. The result J Invariant 6.3
in this producer is the result of instruction Ii.
sIP( f u;T ) = i^P[ f u]T : f ull =) P[ f u]T :result = result(i)
287
Chapter 6
OUT-OF-ORDER
EXECUTION
Once there is an instruction in a producer, the producer requests the
CDB. After the request is acknowledged, the result is put on the CDB.
Let Ii be on the CDB. The result on the CDB is the result of Ii.Invariant 6.4 I
sICDB(T) = i^CDBT :valid =) CDBT :result = result(i)
While on the CDB, the results are written into the ROB. The following
invariant asserts that the results in the ROB are correct.
Let Ii be in ROB entry tag and let that entry be valid. This implies that theInvariant 6.5 I
result in the ROB entry is the result of instruction Ii.
sIROB(tag;T ) = i^ROB[tag]T :valid
=) ROB[tag]T :result = result(i)
We now show lemmas that form the induction step of the invariant proof.
Let invariant 6.3 (producer data consistency) hold during cycle T . ThisLemma 6.49 I
implies that invariant 6.4 (CDB data consistency) holds during cycle T .
By definition, CDBT :valid only holds iff we complete an instruction, i.e.,PROOF
iff completion(T ) holds. The producer the instruction we complete is in, is
denoted by compl p(T ). We assume that we only complete an instruction
in a producer, if that producer is full. Thus,
P[compl p(T )]T : f ull
holds. This allows applying invariant 6.3, which states that the result in the
producer is correct:
P[compl p(T )]T ):result = result(sIP(compl p(T );T ))
The term on the left hand side is the result on the CDB by definition.
CDBT :result = result(sIP(T ))
288
Section 6.5
DATA
CONSISTENCY
By definition of sICDB(T ), we have sICDB(T ) = sIP(compl p(T );T ).
This concludes the claim.
Let invariant 6.5 (ROB data consistency) and invariant 6.4 (CDB data J Lemma 6.50
consistency) hold during cycle T . This implies that invariant 6.5 (ROB
data consistency) holds during cycle T +1.
In order to show the claim, we distinguish three cases: PROOF
1. Consider the case that an instruction is issued into ROB entry tag
during cycle T , i.e., we have:
issue(T )^ROBtailT = tag
In this case, the ROB entry tag is valid iff we have the result of the
instruction available during issue, i.e., if issue with result(T ) holds.
Thus, there is nothing to show unless issue with result(T ) holds.
We easily conclude that sIROB(tag;T + 1) is equal to sIissue(T ).
Thus, the result in the ROB is correct by definition.
2. Consider the case that we do not issue an instruction into ROB entry
tag during cycle T and that we receive a result from the CDB during
cycle T , i.e.:
CDBT :valid ^CDBT :tag = tag
In this case, the result on the CDB is stored in the ROB and we have
to argue its correctness:
result(sIROB(tag;T +1)) != ROB[tag]T+1:result
!
= CDBT :result
According to invariant 6.4 (CDB data consistency), we have:
CDBT :result = result(sICDB(T ))
Thus, the claim holds if we show sIROB(tag;T + 1) = sICDB(T ),
i.e., it is left to show that the tag maps to the correct instruction.
These arguments are weak in [MPK00].
We show this formally using lemma 6.48. Lemma 6.48 states that
τ(sICDB(T );T )
289
Chapter 6
OUT-OF-ORDER
EXECUTION
holds. This allows applying theorem 6.21, which states:
sIROB(I tag(sICDB(T ));T ) = sICDB(T)
Thus, it is left to show:
sIROB(tag;T +1) != sIROB(I tag(sICDB(T ));T )
According to lemma 6.30, we have tag = I tag(sICDB(T )). This
transforms the claim into:
sIROB(tag;T +1) != sIROB(tag;T )
This is concluded by expanding the definition of sIROB(tag;T +1).
3. Consider the case that no instruction is issued in ROB entry tag and
that no result for ROB entry tag is on the CDB. We assert this case
using invariant 6.5 for cycle T .QED
Let invariant 6.5 (ROB data consistency) and invariant 6.1 (register fileLemma 6.51 I
data consistency) hold during cycle T . This implies that invariant 6.1 (reg-
ister file data consistency) holds during cycle T +1.
We distinguish three cases:PROOF
1. Consider the case that we issue an instruction with destination r dur-
ing cycle T . In this case, the valid bit R[r]T+1:valid cannot hold and
there is nothing to show.
2. Consider the case that we writeback an instruction with destination r
during cycle T and let the valid bit of R[r] be not active during cycle
T . We only do this writeback if the ROB entry that the ROB head
pointer points to is valid. According to invariant 6.5, this implies
that the result in the rob entry is the result of the instruction. This
transforms the claim into:
result(sIROB(ROBheadT ;T ))[e(r)] != source(i;r)
The tag of R[r] matches the the ROB head pointer, since otherwise
R[r]T+1:valid cannot hold and there is nothing to show.
290
Section 6.5
DATA
CONSISTENCY
According to lemma 6.40, that tag is equal to the tag of the last
instruction prior to instruction issue(T ) that writes R[r]. This trans-
forms the claim into:
result(sIROB(I tag(last(sIissue(T );r);T ))[e(r)] != source(i;r)
According to lemma 6.43, that tag is unique. This allows applying
lemma 6.21, which transforms the claim into:
result(last(sIissue(T );r))[e(r)] != source(i;r)
According to lemma 6.39, we have:
source(sIissue(T );r) = source(sIissue(T +1);r)
This transforms the claim into:
result(last(sIissue(T );r))[e(r)] != source(sIissue(T );r)
This is concluded using lemma 6.38.
3. If we neither issue an instruction with destination R[r] nor write-
back an instruction with destination R[r] with R[r]T :valid, assume
R[r]T :valid does not hold. In this case, valid bit R[r]T+1:valid can-
not hold and there is nothing to show.
Thus, R[r]T+1:valid holds. The claim is:
R[r]T :data != source(i;r)
After applying the induction premise, this is transformed into:
source(sIissue(T );r) != source(i;r)
We assert this using lemma 6.39. QED
Let invariant 6.3 (producer data consistency) hold during cycle T and J Lemma 6.52
invariant 6.2 (reservation station data consistency) hold during cycles T 0
with T 0  T . This implies that invariant 6.3 (producer data consistency)
holds during cycle T +1.
291
Chapter 6
OUT-OF-ORDER
EXECUTION
PROOF One concludes this claim as follows: if an instruction moves
into the producer during cycle T , we make the assumption that the func-
tion unit delivers a correct result given that it got correct inputs during all
cycles T 0  T . This is easily asserted using invariant 6.2 (reservation sta-
tion data consistency). For this, we have to assume that we only dispatch
instructions with valid operands.
If no instruction moves into the producer during cycle T , we conclude
sIP( f u;T ) = sIP( f u;T +1):
Furthermore, we conclude that P[ f u]T : f ull holds and that the value in
P[ f u]:result does not change from cycle T to cycle T + 1. This allows
concluding the claim from invariant 6.3 (producer data consistency) for
cycle T .QED
If the tag on the CDB matches the tag of an instruction Ii and the tag ofLemma 6.53 I
that instruction is unique, then the instruction on the CDB is instruction Ii.
CDBT :valid ^CDBT :tag = I tag(i)^ τ(i;T ) =) sICDB(T) = i
This is easily shown using lemma 6.30 (uniqueness of CDB tag) and
6.22.
The following two lemmas are used to argue the data consistency of the
reservation stations (invariant 6.2). Since this is where all forwarding is
done, this is the most complicated part of the proof. We therefore split the
proof of invariant 6.2 into two lemmas.
The first lemma shows the claim for the case the operand reading is
done in the issue stage. The second lemma shows the claim for the case
the operand reading is done in the reservation station. The same case split
is also done in [MPK00].
Let invariant 6.2 (reservation station data consistency) and invariant 6.1Lemma 6.54 I
(register file data consistency) and invariant 6.4 (CDB data consistency)
and invariant 6.5 (ROB data consistency) hold during cycle T .
If an instruction is issued into reservation station rs, invariant 6.2 for
reservation station rs holds during cycle T +1.
292
Section 6.5
DATA
CONSISTENCY
PROOF We show this claim by a case split on the location the operand x
is read from. Let Ii be the instruction in the issue stage and let r = S(i;x)
be a shorthand for the number of the register we read.
 If r = 0 holds, we read zero and the claim holds by definition of
source(i;0).
 Reading from the register file: This is done only iff R[r]T :valid
holds. This allows applying invariant 6.1. This concludes the claim.
 Reading from the CDB: This is done only iff R[r]T :valid does not
hold. This allows applying lemma 6.40, which states that the tag
in the producer table is the tag of the last instruction writing R[r].
According to lemma 6.43, that tag is unique. This allows applying
lemma 6.53, which states that the last instruction writing R[r] is on
the CDB. According to lemma 6.4, the result on the ROB is the result
of that instruction.
Thus, it is left to show:
result(last(i;r))[e(r)] != source(i)(x)
We assert this using lemma 6.38.
 Reading from the ROB: We repeat the arguments from the case
above in order to show that the tag in the producer table is the tag
of the last instruction writing R[r]. Let tag denote the tag. This tag
is unique, and we therefore know that the instruction in ROB entry
tag is the last instruction writing R[r] (lemma 6.21). According to
invariant 6.5, the result in the ROB is the result of this instruction.
As before, we conclude the claim using lemma 6.38. QED
Let invariant 6.2 (reservation station data consistency) and invariant 6.4 J Lemma 6.55
(CDB data consistency) hold during cycle T .
If no instruction is issued into reservation station rs, invariant 6.2 for
reservation station rs holds during cycle T +1.
Let x be a source operand number. If the valid bit of operand x holds PROOF
during cycle T , one just applies invariant 6.2 for cycle T .
If not so, we snoop an operand from the CDB or we have nothing to
show. The argue the correctness of CDB snooping as follows: Let i be the
293
Chapter 6
OUT-OF-ORDER
EXECUTION
number of the instruction in reservation station rs during cycle T +1. The
claim of invariant 6.2 is:
RS[rs]T+1:op[x]:data != source(i)(x)
By expanding the definition of RS[rs]T+1:op[x]:data on the left hand
side, this is transformed into:
CDBT :result[e(S(i;x))] != source(i)(x)
Invariant 6.4 states:
CDBT :result = result(sICDB(T ))
Thus, the claim is transformed into:
result(sICDB(T ))[e(S(i;x))] != source(i)(x)
Thus, it is left to show that the result of the instruction on the CDB is the
source operand of the instruction in the reservation station. This is argued
as follows: According to lemma 6.38 with instructions Ii and IsICDB(T ), the
claim above holds if we show the premises of the lemma. These premises
are:
S(i;x) 6= 0^L(i;S(i;x))^ last(i;S(i;x)) = sICDB(T ))
Thus, we have to show that the source register is not register 0 and that
there is an instruction before Ii that writes the register. One easily argues
this using invariant 6.42.
Furthermore, one has to show that the last instruction before Ii writing
the register is the instruction on the CDB. We argue this using the fact
that the tag on the CDB matches the tag stored in the reservation station
for the operand. According to invariant 6.42, that tag is the tag of the last
instruction writing the register.
Lemma 6.44 states that the tags in the reservation stations are unique.
This allows applying lemma 6.53, which concludes the claim.QED
294
Section 6.5
DATA
CONSISTENCY
The following lemma combines the claims of lemma 6.54 and lemma
6.55.
Let invariant 6.2 (reservation station data consistency) and invariant 6.1 J Lemma 6.56
(register file data consistency) and invariant 6.4 (CDB data consistency)
and invariant 6.5 (ROB data consistency) hold during cycle T . This implies
that invariant 6.2 for reservation station rs holds during cycle T +1.
This claim is shown using lemma 6.54 and lemma 6.55.
The invariants 6.1 to 6.5 hold. J Theorem 6.57
We show this claim by induction on T . We omit the simple arguments for PROOF
cycle T = 0.
The claim for T + 1 is shown by applying lemma 6.50, 6.51, 6.52, and
6.56 for cycle T and lemma 6.49 for cycle T +1. QED
A machine implementing the Tomasulo protocols above, satisfies the fol- J Theorem 6.58
lowing data consistency criterion:
R[r]TaI :data = R[r]
sIwriteback(T )
aS
Since all speculation registers are output of the writeback stage, this
criterion exactly matches the data consistency criterion as proposed for
the in-order pipelined machine.
Given the data consistency invariants above, one easily shows this claim PROOF
by induction on T . For T = 0, we have sIwriteback(T ) = 0 and we there-
fore have the claim that the registers are in the initial configuration. We
assume this.
For T +1, we show the claim as follows: In case writeback(T ) does not
hold, one easily asserts that
sIwriteback(T ) = sIwriteback(T +1)
holds and that the registers do not change from cycle T to T +1. Thus, the
claim is concluded using the induction premise. Let i be a shorthand for
sIwriteback(T ).
295
Chapter 6
OUT-OF-ORDER
EXECUTION
In case writeback(T ) holds, we do a case split on dest(i;r). If dest(i;r)
does not hold, we easily assert the claim using the induction premise.
If dest(i;r) and writeback(T ) hold, we have the following claim:
ROBT [ROBhead(T )]:result[e(r)] != R[r]i+1aS
The register on the right hand side expands to the result of instruction Ii:
ROBT [ROBhead(T )]:result[e(r)] != result(i)[e(r)]
We assert this using invariant 6.5 for instruction Ii and tag ROBhead(T ),
which holds according to theorem 6.57.
The claim of invariant 6.5 concludes the claim above. It is left to show
the premises of invariant 6.5, which are:
sIROB(ROBhead(T );T ) = i^ROB[ROBhead(T)]T :valid
We assert the first part of this claim using lemma 6.24. The valid bit of
the ROB entry holds since we assume that we only writeback if the valid
bit holds.QED
6.6 Liveness
We propose the following liveness criterion for the Tomasulo machine with
reorder buffer: we will show that all instructions will eventually be in the
terminated phase.
We use a similar liveness proof strategy as employed in chapter 4. We
show our claim by induction on T . Thus, the induction step is: given
all instructions up to instruction Ii 1 terminated, instruction Ii eventually
terminates.
Informally, we show this as follows: We will show that instruction Ii
must be in a phase. According to lemma 6.27, that phase is unique. We
do a case split on the phase of instruction Ii. If instruction Ii is in “in
ROB” phase, we easily assert that it eventually terminates. If instruction
Ii is in a producer, we assert that it will move into “in ROB” phase. We
296
Section 6.6
LIVENESS
then conclude the claim as before. These arguments are continued until all
phases are covered.
We will now formalize this proof.
If instruction Ii is in phase p during cycle T , this implies that it is in one J Lemma 6.59
of the successor phases of phase p during cycle T +1.
p(i;T ) =)
_
p02succ(p)
p0(i;T +1)
We show this claim exemplary for phase “not issued”. Thus, we have to PROOF
show that instruction Ii is still not issued, in a reservation station, or in the
ROB during cycle T +1.
 If issue(T ) and sIissue(T ) = i does not hold, one easily concludes
that instruction Ii stays in “not issued” phase.
 If issue(T ) and sIissue(T ) = i holds and issue with result(i) holds,
one easily shows that instruction Ii is in the reorder buffer during
cycle T +1.
 Otherwise, we assume that there is a reservation station rs such that
issue rs(T;rs) holds. One easily verifies that instruction Ii is in that
reservation station during cycle T +1. QED
Instruction Ii is in at least one phase during cycle T . J Lemma 6.60
The claim is concluded by induction on T . For cycle T = 0, we conclude PROOF
the claim easily since all instructions are in the “not issued” phase.
For T +1, we conclude as follows: According to the induction premise,
instruction Ii is in at least one phase during cycle T . This allows apply-
ing lemma 6.59, which states that instruction Ii is in one of the successor
phases of that phase. This concludes the claim. QED
The following lemmas form the induction step for the liveness proof.
If there is a cycle such that instruction Ii 1 either not exists or terminated J Lemma 6.61
and instruction Ii is in “in ROB” phase, instruction Ii will eventually termi-
nate.
297
Chapter 6
OUT-OF-ORDER
EXECUTION
PROOF Let T be the cycle given by the premise. According to the
premise, instruction Ii is in “in ROB” phase during cycle T . This implies
that it is not terminated yet. Since we either have i = 0 or the previous
instruction is terminated, we have
i = sIwriteback(T )
We show that instruction Ii terminates during cycle T , i.e., it is left to
show that writeback(T ) holds. As described above, we assume that we
always terminate if the ROB is not empty and the ROB entry that ROBhead
points to is valid. One easily asserts that the ROB is not empty during cycle
T using that instruction Ii is in “in ROB” phase during cycle T .
According to the premise, there is a ROB entry tag that is valid and such
that
sIROB(tag;T ) = i
holds. Using lemma 6.11, we assert that tag is the tag of instruction Ii.
Using lemma 6.23, we assert that entry tag is the entry ROBhead(T ) points
to. Thus, the ROB entry ROBhead(T ) points to is valid and we writeback.
QED
If producer f u is full during cycle T , then there is a cycle T 0  T such thatLemma 6.62 I
the instruction is put on the CDB.
P[ f u]T : f ull =) 9T 0  T : completion(T 0)^
compl p(T 0) = f u^
sIP( f u;T 0) = sIP( f u;T )
In order to show this claim, we make the assumption that the CDB requestsPROOF
are served using a fair arbiter. One has to show that instruction Ii stays
in the producer f u until the request is served using induction. For this
purpose, we have to assume that the function unit does not overwrite an
instruction in its producer. This is illustrated in figure 6.10. Formally, the
function unit f u provides a result during cycle T iff FU [ f u]T :valid holds.
The producer generates a stall signal if it is full and does not get the
CDB. Let f uins( f u;T ):stall denote the value of this signal during cycle
T .
f uins( f u;T ):stall := P[ f u]T : f ull ^
(completion(T )^ compl p(T ) = f u)
298
Section 6.6
LIVENESS
from reservation station
function
unit
result, tag, flags valid
stall
producer
CDB
Figure 6.10 Interface between function unit and producer
We assume that the function unit does not provide a result if it gets a
stall signal.
f uins( f u;T ):stall =) FU [ f u]T :valid
Since the CDB is assigned using a fair arbiter, there is a cycle T 0 such
that the request is acknownledged. Using the assumption on the function
unit above, one easily shows by induction that the instruction stays in the
producer until this happens and is not overwritten. QED
If there is a cycle such that instruction Ii 1 either not exists or terminated J Lemma 6.63
and instruction Ii is in “in producer” phase, instruction Ii will eventually
terminate.
Let T be the cycle from the premise of the lemma. Thus, instruction Ii PROOF
is in a producer during cycle T . Let this be producer f u. We will show
that this instruction eventually moves into the reorder buffer. Although we
assume that all instructions prior to instruction Ii already terminated, this is
not obvious. In particular, there might be instructions later than instruction
Ii that block the CDB.
According to lemma 6.62, there is a cycle T 0  T such that the request
is served and the instruction is still in the producer. Formally, we have:
completion(T 0) ^ compl p(T 0) = f u ^ sIP( f u;T 0) = sIP( f u;T )
One easily concludes that instruction Ii is in ROB entry I tag(i) dur-
ing cycle T + 1. This allows applying lemma 6.61, which shows that the
instruction eventually terminates. QED
299
Chapter 6
OUT-OF-ORDER
EXECUTION
Note that assuming that the CDB is allocated using a fair arbiter is not
necessary for liveness, we do it for sake of simplicity only. If the CDB
is not allocated using a fair arbiter, we can argue as follows: Informally,
assume instruction Ii is blocked in a producer by instructions later than Ii.
Since we terminate in-order, there is an upper bound for the number of
these instructions, which is the number of ROB entries. Thus, instruction
Ii will eventually get the CDB.
If there is a cycle such that instruction Ii 1 either not exists or terminatedLemma 6.64 I
and instruction Ii is in “in FU” phase, instruction Ii will eventually termi-
nate.
Let T be the cycle from the premise of the lemma. Thus, instruction Ii isPROOF
in a function unit during cycle T . Let this be function unit f u. We will
show that this instruction eventually moves into the producer P. Although
we assume that all instructions prior to instruction Ii already terminated,
this is not obvious. In particular, there might be instructions later than
instruction Ii that block the function unit or the producer.
In order to show this claim, we have to make the following assumption
on the functional units: Given that the signal f uins( f u;T ):stall is finite
true and that instruction Ii entered the function, there is a later cycle such
that the instruction leaves the unit.
8T 09T 00  T 0 : f uins( f u;T 00):stall ^ in(i;T; f u)
=) 9T 000  Tout(i;T 000; f u)
One easily asserts that the signal f uins( f u;T ):stall is finite true using
the fact that the CDB is allocated using a fair arbiter. Thus, we have a cycle
T 000 such that the instruction leaves the function unit. One easily asserts that
this instruction moves into the producer during that cycle. We then apply
lemma 6.63 in order to conclude the claim.QED
In analogy to lemma 6.62, we show:
If a reservation station is full during cycle T , there is a cycle T 0  T suchLemma 6.65 I
that this reservation station is dispatched during cycle T 0. Furthermore, the
instruction in the RS during cycle T 0 is the same as during cycle T .
RS[rs]T : f ull =) 9T 0  T : dispatch rs(T 0;rs)^
sIRS(rs;T 0) = sIRS(rs;T )
300
Section 6.6
LIVENESS
PROOF As described above, dispatching is done using a fair arbiter. The
arbiter selects among the reservation stations that are full and valid. The
first thing to assert is that the reservation station is valid. Assume it is
not. In this case, one can apply lemma 6.45, which states that there is
an instruction I j with j = last(i;r) that is in a reservation station, in a
function unit, or in a producer. This is a contradition to the premise that all
instructions I j with j < i are already terminated.
The function unit provides a stall singal. We denote this stall signal by
FU [ f u]T :stall. Dispatching is only done if the function unit is not stalled.
We assert this using the following assumption on function units: If the stall
singal that is input of the function unit is finite true, then the stall signal
that is output of the function unit is finite true.

8T 09T 00  T 0 : f uins( f u;T 00):stall

=)

8T 09T 00  T 0 : f uins( f u;T 00):stall

One shows that the stall singal that is input of the function unit is finite
true using that the CDB is assigned using a fair arbiter, as above. This
concludes the claim. QED
If there is a cycle such that instruction Ii 1 either not exists or terminated J Lemma 6.66
and instruction Ii is in “in RS” phase, instruction Ii will eventually termi-
nate.
Let T be the cycle from the premise of the lemma. We conclude this claim PROOF
easily using lemma 6.65. According to this lemma, there is a cycle T 0  T
such that the instruction is dispatched. There are two cases:
 The funcition unit returns the result of instruction Ii in the same cy-
cle. In this case, one shows that the instruction moves into the “in
producer” phase and uses lemma 6.63 in order to conclude the claim.
 The funcition unit does not return the result of instruction Ii in the
same cycle. In this case, one shows that the instruction is in “in FU”
phase during cycle T +1 and uses lemma 6.64 in order to conclude
the claim. QED
If there is a cycle such that instruction Ii 1 either not exists or terminated J Lemma 6.67
and instruction Ii is in “not issued” phase, instruction Ii will eventually
terminate.
301
Chapter 6
OUT-OF-ORDER
EXECUTION
PROOF We will show that the instruction eventually either moves into
the ROB or into a reservation station, depending on issue with result(i).
This happens if the instruction is issued. We then conclude the claim using
lemma 6.61 or 6.66, respectively.
Thus, it is left to show that the instruction is eventually issued. The
issue stage belongs to the in-order part of the machine. As done in the
previous chapters, one easily concludes that this happens if the stall signal
of the stage is finite true. The issue stage is stalled if one of the following
conditions hold [Kro¨99]:
 The ROB is full. One argues that this cannot be the case since all
instructions I j prior to Ii terminated. Thus, we have
sIissue(T ) = sIwriteback(T );
which implies that the ROB is empty (lemma 6.4).
 There is no reservation station available. One easily concludes that
all reservation stations are empty because all instructions are either
in “not issued” or “terminated” phase during cycle T . Thus, they
cannot be in “in RS” phase according to lemma 6.27.
 In case of the DLX, there are some instructions that require stalling
issue because they depend on registers that the Tomasulo scheduler
cannot forward. In case of a conditional branches or jump register
instruction, one has to wait until the source register is valid. Assume
it is not. In this case, we can apply lemma 6.43, which states that
there is an instruction I j with j = last(i;r) that is already issued but
not yet terminated. This is a contradiction.
 In case the instruction is a movs2i and the source register is IEEE f ,
we have to stall issue until the ROB is empty. This arises from the
fact that the Tomasulo scheduling algorithm is not able to forward
this register. As above, one easily concludes that the ROB is empty.
 The desings we verify are based on the designs presented in [Kro¨99].
The machine stalls issue until the ROB is empty in case the instruc-
tion is an r f e instruction. This arises from the hardware cost con-
straints. We do not have enough read ports for the SPR producer
table to forward ESR, EPC, and EDPC. As above, one easily con-
cludes that the ROB is empty.
Thus, the instruction is issued eventually, which concludes the claim.QED
302
Section 6.7
VERIFYING THE
DLX
IMPLEMENTATION
Note that in contrast to the machine given in [Kro¨99], we do not have to
stall issue because of busy instruction memory. This arises from the fact
that our stall engine allows stalling stages indepandantly.
The following lemma forms both the induction step and induction base
for the main liveness claim.
If there is a cycle such that instruction Ii 1 either not exists or terminated, J Lemma 6.68
instruction Ii will eventually terminate.
Let T be the cycle from the premise. According to lemma 6.60, instruction PROOF
Ii is in a phase. If this is “not issued”, we conclude the claim using lemma
6.67. If it is “in RS”, we conclude the claim using lemma 6.66. If it is “in
FU”, we conclude the claim using lemma 6.64. If it is “in producer”, we
conclude the claim using lemma 6.63. If it is “in ROB”, , we conclude the
claim using lemma 6.61. If it is “terminated”, the claim obviously holds. QED
Instruction Ii eventually terminates. J Lemma 6.69
We show this claim by induction on i. For i = 0, we apply lemma 6.68. PROOF
This is also done for the induction step.
6.7 Verifying the DLX Implementation
In this section, we show that the implementation machine I with configu-
rations c0I ; : : : complies with the specification.
6.7.1 Implementation Differences
We do not describe the implementation of the DLX with Tomasulo sched-
uler and reorder buffer, since this design is already presented in [Kro¨99] in
detail including cost and cycle time analysis.
In this section, we describe the differences between the implementation
given in [Kro¨99] and the implementation used for this thesis. Figure 6.11
shows an overview of the hardware.
303
Chapter 6
OUT-OF-ORDER
EXECUTION
IM
PC environment
Reservation Stations
ALU FPU1 FPU2 MEMFPU3
ROB
Producers
CDB
GPR FPR SPR
ID
EX
IF
C
WB
IR:1
DPCPC0
Figure 6.11 Overview of the Tomasulo Hardware
304
Section 6.7
VERIFYING THE
DLX
IMPLEMENTATION
Instruction Fetch In [Kro¨99], the PC environment from [Lei99] is used.
In order to prevent the destruction of the PC registers, stage 0 and 1 are
always clocked simultaneously. We remove this limitation by using the PC
environment and the stall engine described in chapter 5 (in-order machine
with Delayed PC and speculation) instead.
Issue As described above, we no longer need an issue stall because of
instruction memory stalls. This is a feature of the new stall engine.
Dispatch In contrast to [Kro¨99], the instructions do not move from one
RS into another. This implementation in [Kro¨99] is motivated by the live-
ness proof, which uses the fact that one selects the oldest instruction for
dispatch. We use a fair arbiter instead.
Function Units In contrast to [Kro¨99], we do not implement out-of-
order dispatch for the memory unit. This simplifies implementing paging.
As an example, consider two store instructions. The first one modifies the
page table and the second one modifies a memory cell in a page that is
affected. Passing the instructions in program order to the memory function
unit significantly simplifies the task of building such a functional unit.
CDB In [Kro¨99], we allocated the CDB round-robin. We use a fair ar-
biter instead (this is weaker than round-robin).
6.7.2 Verifying the Instruction Fetch
In the proofs above, we assumed that the instruction fetch is correctly done.
The instruction fetch mechanism in the stages 0 and 1 operates like the in-
order pipelined machine as described in section 5. The verification of the
forwarding of DPC for the instruction fetch uses the very same arguments
as before.
One combines the two machines as follows: we define that we issue an
instruction if the output registers of the decode/issue stages are clocked.
This happens iff ueT1 is active, as described in the previous chapters.
issue(T ) := ueT1
305
Chapter 6
OUT-OF-ORDER
EXECUTION
For the correctness proof, we argue on the schedules of both parts of the
machine. We argue that the schedule of the issue stage of the Tomasulo
part matches the schedule of the issue stage of the in-order pipeline.
issue(T ) != sI(1;T )
We show this claim by inducition on T . For T = 0, we have issue(T ) = 0
and sI(1;T ) = 0.
For T + 1, we show the claim by a case-split on ueT1 . If ueT1 does not
hold, the value of both scheduling functions does not change from cycle
T to T +1 by definition. Thus, the claim is concluded using the induction
premise.
If ueT1 holds, we have
sI(1;T +1) = sI(1;T )+1
according to invariant 5.1.
By definition, issue(T ) holds if ueT1 holds. Thus, we have
issue(T +1) = issue(T )+1
by definition of issue(T + 1). This allows concluding the claim using the
induction premise.
6.7.3 Verifying IEEEf
The IEEE f (IEEE flags) register is a special case for the correctness proof
of the machine, since the IEEE standard [IEE85] requires that the bits in
this register are sticky. Thus, if a floating point instruction generates a
masked IEEE exception, the bit of this exception is set in the IEEE f reg-
ister. The bits that were set previously are maintained. However, in case of
a movi2s instruction with destination IEEE f , all bits are overwritten.
One argues the data consistency of the register by induction. As induc-
tion claim we show the data consistency of the complete machine. For
T = 0, we show the correctness of the initialization. For T + 1, we have
the data consistency upto cycle T as premise. The first thing is to argue the
correctness of the interrupt mask in SRTI . This holds according to the in-
duction premise. Let i be a shorthand for sIwriteback(T ). We distinguish
three cases:
306
Section 6.7
VERIFYING THE
DLX
IMPLEMENTATION
 If we do not writeback an instruction, we have
sIwriteback(T ) = sIwriteback(T +1):
The registers also do not change. Thus, the claim holds.
 If we writeback an instruction that is movi2s with destination register
IEEE f , the correctness is shown as above.
 If we writeback an instruction which sets IEEE flags, we have:
sIwriteback(T +1) = i+1
We assert the correctness of the flags as above using invariant 6.5.
Let ieee f lags(i) denote the IEEE flags generated by instruction Ii:
ROB[ROBhead]T :result[2] = ieee f lags(i)
We assert the correctness of the old value in the IEEE flags register
using the induction premise:
IEEE f TI = IEEE f iS
The new value written into the IEEE flags register is the old value
OR the masked new one.
IEEE f T+1I = IEEE f TI _ (ROB[ROBhead]T :result[2]^SRTI )
The claim is that this the correct value:
IEEE f T+1I != IEEE f i+1I
One expands the transition function of the specification machine on
the right hand side:
IEEE f T+1I != IEEE f iI _ (ieee f lags(i+1)^CAiS)
This is easily concluded using the the equations above.
One cannot forward the IEEE f register using the mechanisms described
above. We therefore stall the issue stage if we read this register until the
ROB is empty. As soon as the ROB is empty, we have
sIissue(T ) = sIwriteback(T ):
In this case, one easily concludes the correctness of the value in the
register using the data consistency criterion above.
307
Chapter 6
OUT-OF-ORDER
EXECUTION
6.7.4 Verifying Interrupts
In this section, we describe how to verify a machine that generates inter-
rupts. The proof method is taken from [MP00]. We show the data consis-
tency by induction on T . For T = 0, we have the correctness of the ini-
tialization of the machine. Note that we do not process an interrupt during
cycle T . We realize the reset interrupt by adjusing the initial configuration
accordingly, as done in chapter 5.
Let lastint(T ) denote the number of the last cycle before cycle T in
which we processed an interrupt plus one (i.e., the maximum value of
lastint(T ) is T ). In case no such cycle exists, we define lastint(T ) to be
zero.
In order to show the claim for T +1, we distinguish two cases:
 If we have an interrupt during cycle T , we argue as follows: accord-
ing to the induction premise, the data consistency for cycle T holds.
The modifications made by an interrupt on the configuration are easy
to verify using this fact.
 If we do not have an interrupt during cycle T , we argue as follows:
We claim that the machine works as the abstract implementation ma-
chine without interrupts above from cycle lastint(T ) to cycle T +1.
We initialize the abstract machine without interrupts using the con-
figuration clastint(T )I :
c0aI := c
lastint(T )
I
We then show that the transitions made by both machines are equal
from cycle lastint(T ) to cycle T + 1 using induction on the cycle
number. For this one uses the fact that there are no interrupts from
cycle lastint(T ) to cycle T +1 by definition of lastint(T ).
Liveness Note that the liveness of the machine with interrupts does not
require extra arguments as required in chapter 5. This arises from the fact
that the instruction that generates the interrupt retires as usual and is not
executed a second time. This is in contrast to the implementation of inter-
rupts given in chapter 5.
308
Section 6.8
LITERATURE
6.8 Literature
In this chapter, we formally verify the Tomasulo scheduling algorithm with
reorder buffer as presented in [MPK00]. In contrast to [MPK00], we verify
the correctness using PVS and argue the uniqueness of the tags.
The parts of the hardware are based on machines described in [Lei99].
The correctness of the designs presented in [Lei99] is not verified by means
of machine.
Hosabettu et.al. verify implementations using a Tomasulo scheduler both
with and without reorder buffer [HGS99, HGS00, Hos00] using the com-
pletion functions approach. The verification is done using PVS at a very
high level of abstraction. Gate-level designs are not verified. The func-
tional units are very simple and do not contain cycles. Despite that, the
size of the PVS proofs in [Hos00] is four times the size of the proofs for
this chapter of this thesis. However, [Hos00] makes extensive use of proof
strategies, which enlarges the PVS proofs significantly.
In [BBCZ98], Clarke et.al. verify out-of-order processors by combin-
ing symbolic model-checking with uninterpreded functions. In [BCRZ99],
Clarke et.al. verify safety properties of a PowerPC, which implements out-
of-order execution and precise interrupts.
Sawada and Hunt [SH99] verify the FM9801, which also features a re-
order buffer, using the theorem proving system ACL2. The number of
lemmas is enormous (nearly 4000).
Henzinger et al. [HQR98] verify a simple out-of-order processor us-
ing a model checker. McMillan [McM98] partly automates the proof by
refinement of Tomasulo’s algorithm presented in [DP97] with the help of
compositional model checking. This technique is improved in [McM99b]
by theorem proving methods to support an arbitrary register size and num-
ber of function units. In [McM99a], McMillan verifies the liveness of a
machine with Tomasulo scheduler using SMV.
Arvind and Shen [AS99] describe how to apply term rewriting systems
in order to model microprocessors. The authors give a simple out-of-order
RISC machine with reorder buffer as an example. The authors suggest the
use of tools such as PVS for verifying large, realistic machines.
309

Chapter
7
Perspective
This thesis covers the verification of in-order and out-of-order micropro-
cessor designs. We develop generic theories for forwarding and specula-
tion and demonstrate how they can be applied to DLX-like RISC proces-
sors. However, several aspects are not covered by this thesis.
7.1 Functional Units
Despite of a simple ALU, the correctness of the functional units is not cov-
ered by this thesis. This ALU needs further enhancements. For example,
the ALU verified in this thesis lacks an integer multiplier. Furthermore, all
ALU instructions assume signed operands. Commercial microprocessors,
such as the MIPS series or the i860 support unsigned operantions, too.
For example, this affects overflow detection. The 29K has three variants
for addition/subtraction operations:
1. Suppress interrupts,
2. signed (interrupt if the result is not in the range of the two’s comple-
ment numbers),
Chapter 7
PERSPECTIVE
3. unsigned (interrupt if the result is not in the range of the binary num-
bers).
This also affects test/set operations. The design presented here offers
both  and  tests, which is superflous since one can get the desired op-
erations by implementing one and swapping operands if necessary. The
test/set instructions implemented in this tesis assume that the operands are
two’s complement numbers. Processors such as the MIPS RISC series
also implement ALU test/set instructions that assume that the operands are
unsigned binaries.
Furthermore, modern microprocessors implement instructions with satu-
ration, i.e., if an overflow occurs, the result is set to the edge of the number
range.
Floating point units are not covered at all by this thesis. The formal veri-
fication of a complete floating unit is subject of the PhD thesis of Christian
Jacobi [Jac01]. The adder is verified by Christoph Berg [Ber01]. The
proofs and designs are taken from [MP00] and verified using the theorem
system PVS. This includes a formalization of the IEEE standard and a
proof that the designs comply with this standard.
The architecture used in this thesis lacks SIMD (single instruction mul-
tiple data) instructions. For example, one can process two single precision
floating point operands within a 64-bit word simultaneously with litte extra
hardware cost.
Furthermore, we do not cover how to build and verify memory inter-
faces. The verification of a memory interface including first level on-chip
cache is subject of the PhD thesis of Sven Beyer [Bey01]. This includes
support for virtual memory, which is implemented using a TLB. The cor-
rectness is verified formally using PVS.
7.2 In-Order Scheduling and Forwarding
Besides the schedulers covered by this thesis, there are more scheduling
methods in use in commercial microprocessors. As for in-order machines,
this includes multiple instruction issue machines, i.e., pipelined in-order
machines with two or more parallel pipelines. These machines are able to
issue multiple instructions within the same cycle. Furthermore, we did not
312
Section 7.5
SPECULATION
verify in-order schedulers for functional units with variable latency, such
as result shift registers, as used in [MP00].
7.3 Speculation
The generic speculation mechanism presented in chapter 5 assumes that
we have a guarantee that an instruction never rollbacks twice. However,
one might want to build machines that require this feature. Note that the
hardware presented in chapter 5 supports it; it is left to show its correctness
for this case.
7.4 Out-of-Order Execution
The out-of-order machine we present uses a reorder buffer and therefore
in-order termination. Machines without reorder buffer and out-of-order
termination are not verified. Furthermore, multiple instruction issue ma-
chines with Tomasulo scheduler are not covered. Furthermore, commercial
designs feature two or more CDBs, which is also not covered. A machine
with Tomasulo scheduler, multiple instruction issue and reorder buffer is
described in [Hil00]. However, the designs are not verified by machine.
7.5 Synthesizing Hardware
Subject of the master’s thesis of Dirk Leinenbach is converting the PVS
hardware specification into synthesizable Verilog HDL. This allows build-
ing hardware implementations of the designs using ASICs or FPGAs. This
allows realistic cost and performance measuring. In particular, it allows
evaluating the real hardware cost in chip area rather than gate count. This
includes that one can take the hardware cost of wiring in account.
The Tomasulo scheduler uses several large bus structures. It is of interest
whether these bus structures have significant impact on the hardware cost
and cycle time of the design. The evaluation in [Kro¨99] does not cover
this, since the hardware model presented in [MP95] is used. This hardware
model does not take wiring in account.
313
Chapter 7
PERSPECTIVE
Another approach of interest is automated conversion from Verilog or
other hardware description langues into PVS for formal verification. This
approach is used by Russinov in order to verify AMD’s floating point units,
for example. He converts an in-house, synthesizable HDL into ACL2 lan-
guage and verifies the correctness using ACL2. The benefit of this ap-
proach is that it permits verifying existing desings in HDL.
314
Appendix
A
Theorem Index
A.1 The PVS Proof Tree
In this chapter, we provide a mapping from the theorems in this thesis to
the theorems in the PVS proof tree. This mapping is limited, however.
For sake of simplicity, we sometimes present multiple lemmas of the PVS
proof tree as single one in this thesis. For example, we have a single lemma
that states that the initialization of the machine is correct. In the PVS proof
tree, we use a separate lemma for each stage.
The following tables provide the number of the lemma or theorem, the
page number, the file name of the file the lemma is to be found in, and the
lemma name.
Appendix A
THEOREM INDEX
A.2 Basic Concepts
Th. Page File Name
2.1 15 btree btree lem
2.2 16 zerotester zerotester correct
2.3 16 tester equality tester correct
2.4 17 pp pp correct2
2.5 18 pp pp spec equiv lem
2.6 18 pp pp Xp lem
2.7 19 pp pp correct1 lem
2.8 20 bvhelp bv adder cin is add
2.9 21 cla cla cout lemma
2.10 25 alu addsub alu bv unary minus
2.11 26 alu addsub alu addsub result correct
2.12 26 alu addsub alu addsub ovf correct
2.13 26 alu addsub alu addsub neg correct
2.14 26 dlxalu imp alu correct
316
Appendix A
A SEQUENTIAL
IMPLEMENTATION
MACHINE
A.3 A Sequential Implementation Machine
Th. Page File Name
Conv. 3.1 39 pipetheory pipe stall correct
3.2 39 pipetheory pipe sequential full
3.3 57 bjtaken impl bjtaken imp correct
3.4 58 nextpc impl nextpc imp correct
3.5 61 pipetheory sched sequential lemma1
3.6 62 pipetheory sched sequential lemma2
3.7 62 pipetheory sched sequential lemma3
3.8 62 pipetheory sched sequential lemma4
Inv. 3.1 65 pipetheory sched lemma1
Inv. 3.2 65 pipetheory sched lemma2
Inv. 3.3 65 pipetheory sched lemma3
3.9 69 pipetheory full bit lemma
3.10 69 pipetheory sched sequential lemma
3.11 71 pipetheory sched pipe start
3.18 85 dlxs lemmas dlxs correct
3.19 86 live calculus weakEafter is strongEafter
3.20 87 pipetheory ue is live IS lem
3.21 87 pipetheory ue is live IS2
3.22 88 pipetheory ue is live seq
3.23 88 pipetheory ue sI lemma
3.24 88 pipetheory Machine is live
317
Appendix A
THEOREM INDEX
A.4 Pipelined Machines
Th. Page File Name
4.1 95 pipetheory pipe full def
4.3 96 pipetheory sched overwrite
4.4 96 pipetheory sched full bits save
4.5 97 pipetheory sched clear full bits
4.6 97 pipetheory sched pipe start
4.15 131 live calculus2 stays until gt
4.16 132 live calculus2 stays until impl lem
4.17 132 live calculus2 AND stays until
4.18 133 live calculus2 weakEafter and stays until
IMPLIES weakEafter
4.19 134 live calculus2 finite false and stays until
IMPLIES finite false
4.20 134 live calculus2 finite true and stays until
IMPLIES finite true
4.21 134 live calculus2 AND weakEafter and stays until
4.22 136 live calculus2 AND finite false and stays until
4.23 136 live calculus2 OR finite true and stays until
4.25 136 live calculus2 never is finite true and stays until
4.24 136 live calculus2 always is finite false and stays until
4.27 137 live calculus OR finite false
4.28 137 live calculus AND finite true
4.29 137 live calculus2 finite true OR finite true
and stays until lem
4.30 138 pipetheory pipe drain lem
4.35 144 pipetheory pipe stall is finite true
318
Appendix A
SPECULATIVE
EXECUTION
A.5 Speculative Execution
Th. Page File Name
5.1 152 pipetheory spec pipe full def
5.2 152 pipetheory spec sched overwrite
5.3 153 pipetheory spec sched full bits save
5.4 153 pipetheory spec sched clear full bits
5.5 153 pipetheory spec sched pipe start
5.11 181 spec theory spec premise1
5.12 181 spec theory spec premise2
5.13 181 spec theory spec premise3
5.14 182 spec theory spec premise4
5.15 183 spec theory spec premise5
5.16 183 spec theory spec premise6
5.17 183 spec theory spec premise7
5.18 183 spec theory rollback stage exists
5.19 185 spec theory spec max lemma
5.20 185 spec theory spec full lemma
5.21 186 spec theory stage spec correct lem
5.22 186 spec theory spec correct inputs lemma
5.23 186 spec theory spec correct spec inputs lemma
5.24 187 spec theory spec misspec step
5.25 188 spec theory spec data consistency lemma2
5.26 188 spec theory spec data consistency lemma1
5.27 189 spec theory spec data consistency1
5.28 189 spec theory spec data consistency2
5.29 190 spec theory spec data consistency3
5.30 190 spec theory spec inv hold
5.31 192 spec theory M max exists
5.32 192 spec theory live M0 lem
5.33 193 spec theory M is full
5.34 193 spec theory g M not full
5.35 193 spec theory M implies below empty
5.36 193 spec theory M lemma0
5.37 193 spec theory M lemma1
5.38 195 spec theory rollback correct
5.39 196 spec theory M rollback
5.40 196 spec theory sched rollback lem4
5.41 197 spec theory spec premise8
5.42 198 spec theory spec premise9
319
Appendix A
THEOREM INDEX
Th. Page File Name
5.43 200 spec theory ue M lemma
5.44 200 spec theory M stall is finite true lem
5.45 201 spec theory spec ue is live IS lem
5.46 201 spec theory stage M is live lem
5.47 202 spec theory spec le M lem
5.48 202 spec theory M correct inputs lemma
5.49 202 spec theory mc premise spec
5.50 203 spec theory mc premise
5.51 204 spec theory live Mc lem1
5.52 204 spec theory spec ue sI lemma
5.53 204 spec theory live Mc lem2
5.54 205 spec theory live Mc lem3
5.55 206 spec theory live Mc lem4
5.56 206 spec theory live no Mc lem2
5.57 207 spec theory live no Mc lem3
5.58 208 spec theory live is M lem2
5.59 208 spec theory live sI M is laststage lem
5.60 209 spec theory live is M lem1
5.61 210 spec theory live sI exists IS
5.62 210 spec theory live sI exists lem
5.64 218 interrupts dlx MCA correct
5.65 218 interrupts dlx JISR correct
5.66 219 interrupts dlx repeat correct
5.67 222 dlxP f correct dlxP c0 IR correct
5.68 224 nextpc impl nextpci imp correct
5.69 224 dlxP f correct dlxP c1 nextpc correct
5.70 224 dlxP f correct dlxP f1 PCp correct
5.71 225 dlxP f correct dlxP f1 DPC correct
5.72 230 dlxP f correct dlxP c3 MCA correct
5.73 231 dlxP f correct dlxP f3 JISR correct
5.74 232 dlxP f correct dlxP f3 repeat correct
5.75 232 dlxP f correct dlxP c3 IR correct
5.76 234 dlxP f correct dlxP c3 C correct
5.77 234 dlxP f correct dlxP f4 SR correct
5.78 236 dlxP f correct dlxP f4 ESR correct
5.79 236 dlxP f correct dlxP f4 ECA correct
5.80 237 dlxP f correct dlxP f4 EPC correct
5.81 238 dlxP f correct dlxP f4 EDPC correct
5.82 239 dlxP f correct dlxP f4 EDATA correct
320
Appendix A
OUT-OF-ORDER
EXECUTION
A.6 Out-of-Order Execution
Th. Page File Name
6.1 262 robtheory ROBtail inv
6.2 262 robtheory ROBhead inv
6.3 262 robtheory ROBcount inv
6.4 263 robtheory instr in rob lemma
6.5 264 robtheory min ROB
6.6 264 robtheory max ROB
6.7 265 robtheory sI issue ge sI writeback
6.8 265 robtheory sI issue gt sI writeback
6.9 265 robtheory sI issue ge sI writeback2
6.10 265 robtheory I tag issue lemma
6.11 266 robtheory ROB lemma
6.12 266 robtheory issued correct
6.13 267 robtheory in order issue aux0
6.14 267 robtheory in order issue
6.15 268 robtheory tag inc sum
6.16 268 robtheory ROBtail diff lemma
6.17 268 robtheory tag inc lemma aux
6.18 268 robtheory tag inc lemma
6.19 269 robtheory ROB invariant
6.20 270 robtheory ROB count lemma
6.21 270 robtheory tag unique lemma
6.22 271 robtheory tag unique lemma2
6.23 271 robtheory I tag writeback lemma
6.24 271 robtheory sI writeback lem
6.27 275 tomistate state unique
6.28 275 tomcorrect tag RS correct
6.29 276 tomcorrect tag P correct
6.30 276 tomcorrect tag CDB correct
6.31 278 tomspec not instr has dest lemma
6.32 278 tomspec ldef before
6.33 279 tomspec last has dest lemma
6.34 279 tomspec last has not dest lemma aux
6.35 279 tomspec last prev has not dest lemma
6.36 279 tomspec last prev has dest lemma
6.37 279 tomspec last lemma aux
6.38 280 tomspec last lemma
321
Appendix A
THEOREM INDEX
Th. Page File Name
6.39 280 tomcorrect issue lemma1
6.40 281 tomcorrect prod tag inv
6.41 282 tomcorrect rs tag inv aux1
6.42 283 tomcorrect rs tag inv
6.43 283 tomcorrect tag unique R
6.44 285 tomcorrect tag unique RS
6.45 285 tomcorrect RS op lemma
6.46 286 tomcorrect tag unique RS op
6.47 286 tomcorrect tag unique P
6.48 287 tomcorrect tag unique CDB
6.49 288 tomcorrect inv P data IMPL inv CDB data
6.50 289 tomcorrect inv CDB data IMPL inv ROB valid
6.51 290 tomcorrect inv ROB valid IMPL inv R valid
6.52 291 tomcorrect inv RS valid IMPL inv P valid
6.53 292 tomcorrect CDB tag lemma
6.54 292 tomcorrect inv RS valid proof read issue
6.55 293 tomcorrect inv RS valid proof read snoop
6.56 295 tomcorrect inv RS valid proof
6.57 295 tomcorrect tom inv
6.58 295 tomcorrect data consistency
6.60 297 tomlive has state
6.61 297 tomlive liveness step in ROB
6.62 298 tomlive stays in P
6.63 299 tomlive liveness step in P
6.64 300 tomlive liveness step in FU
6.65 300 tomlive stays in RS
6.66 301 tomlive liveness step in RS
6.67 301 tomlive liveness step not issued
6.68 303 tomlive liveness step
6.69 303 tomlive liveness
322
Appendix
B
DLX Instruction Set
This instruction set is taken from [MP95, MP00] with minimal modifica-
tions. The architecture was defined in [HP96]. A reference for the instruc-
tion formats and mnemonics is also [SK96].
Appendix B
DLX INSTRUCTION
SET
I-type
R-type
J-type
26
FI-type
FR-type
ImmediateRD
Function
6
SA
55
RDRS2
55
RS1
6
Opcode
6
Opcode
6 5 5 16
63
PC Offset
6 55
Opcode RS1 FD Immediate
Opcode FS1 FS2/RS2 FD
5
00 Fmt Function
Opcode
6
RS1
5 5 16
Figure B.1 Instruction formats of the DLX
324
Appendix B
DLX INSTRUCTION
SET
IR[31 : 26] Mnem. d Effect
Data Transfer, mem = M[RS1 + imm]
100000 0x20 lb 1 RD=Sext(mem)
100001 0x21 lh 2 RD=Sext(mem)
100011 0x23 lw 4 RD=mem
100100 0x24 lbu 1 RD=024mem
100101 0x25 lhu 2 RD=016mem
101000 0x28 sb 1 mem=RD[7 : 0]
101001 0x29 sh 2 mem=RD[15 : 0]
101011 0x2b sw 4 mem=RD
Arithmetic, Logical Operation
001000 0x08 addi RD=RS1 + imm
001001 0x09 addiu RD=RS1 + imm (no overflow)
001010 0x10 subi RD=RS1 - imm
001011 0x11 subiu RD=RS1 - imm (no overflow)
001100 0x12 andi RD=RS1 ^ imm
001101 0x13 ori RD=RS1 _ imm
001110 0x14 xori RD=RS1  imm
001111 0x15 lhgi RD=imm 016
Test Set Operation
011000 0x18 clri RD=(false ? 1 : 0)
011001 0x19 sgri RD=(RS1 > imm ? 1 : 0)
011010 0x1a seqi RD=(RS1 = imm ? 1 : 0)
011011 0x1b sgei RD=(RS1  imm ? 1 : 0)
011100 0x1c slsi RD=(RS1 < imm ? 1 : 0)
011101 0x1d snei RD=(RS1 6= imm ? 1 : 0)
011110 0x1e slei RD=(RS1  imm ? 1 : 0)
011111 0x1f seti RD=( true ? 1 : 0)
Control Operation
000100 0x04 beqz PC=PC+4+(RS1 = 0 ? imm: 0)
000101 0x05 bnez PC=PC+4+(RS1 6= 0 ? imm: 0)
000110 0x16 jr PC=RS1
000111 0x17 jalr R31=PC+4; PC = RS1
Table B.1 I-type instruction layout
325
Appendix B
DLX INSTRUCTION
SET
IR[31 : 26] IR[5 : 0] Mnem. Effect
Shift Operation
000000 0x00 000000 0x00 slli RD=RS1<<SA
000000 0x00 000001 0x01 slai RD=RS1<<SA (arith.)
000000 0x00 000010 0x02 srli RD=RS1>>SA
000000 0x00 000011 0x03 srai RD=RS1>>SA (arith.)
000000 0x00 000100 0x04 sll RD=RS1<<RS2[4:0]
000000 0x00 000101 0x05 sla RD=RS1<<RS2[4:0] (ar.)
000000 0x00 000110 0x06 srl RD=RS1>>RS2[4:0]
000000 0x00 000111 0x07 sra RD=RS1>>RS2[4:0] (ar.)
Data Transfer
000000 0x00 010000 0x10 movs2i RD=SA
000000 0x00 010001 0x11 movi2s SA=RS1
Arithmetic, Logical Operation
000000 0x00 100000 0x20 add RD=RS1+RS2
000000 0x00 100001 0x21 addu RD=RS1+RS2 (no overfl.)
000000 0x00 100010 0x22 sub RD=RS1-RS2
000000 0x00 100011 0x23 subu RD=RS1-RS2 (no overfl.)
000000 0x00 100100 0x24 and RD=RS1 ^ RS2
000000 0x00 100101 0x25 or RD=RS1 _ RS2
000000 0x00 100110 0x26 xor RD=RS1  RS2
000000 0x00 100111 0x27 lhg RD=RS2[15:0] 016
Test Set Operation
000000 0x00 101000 0x28 clr RD=( false ? 1 : 0)
000000 0x00 101001 0x29 sgr RD=(RS1 > RS2 ? 1 : 0)
000000 0x00 101010 0x2a seq RD=(RS1 = RS2 ? 1 : 0)
000000 0x00 101011 0x2b sge RD=(RS1  RS2 ? 1 : 0)
000000 0x00 101100 0x2c sls RD=(RS1 < RS2 ? 1 : 0)
000000 0x00 101101 0x2d sne RD=(RS1 6= RS2 ? 1 : 0)
000000 0x00 101110 0x2e sle RD=(RS1  RS2 ? 1 : 0)
000000 0x00 101111 0x2f set RD=( true ? 1 : 0)
Table B.2 R-type instruction layout
326
Appendix B
DLX INSTRUCTION
SET
IR[31 : 26] Mnem. Effect
Control Operation
000010 0x02 j PC = PC + 4 + imm
000011 0x03 jal R31 = PC + 4; PC = PC + 4 + imm
111110 0x3e trap trap = 1; EDATA = imm;
111111 0x3f rfe SR = ESR; PC’ = EPC;
DPC = EDPC
Table B.3 J-type instruction layout
IR[31 : 26] Mnem. d Effect
Load, Store
110001 0x31 load.s 4 FD[31 : 0] = mem
110101 0x35 load.d 8 FD[63 : 0] = mem
111001 0x39 store.s 4 m = FD[31 : 0]
111101 0x3d store.d 8 m = FD[63 : 0]
Control Operation
000110 0x06 fbeqz PC=PC+4+(FCC = 0 ? imm: 0)
000111 0x07 fbnez PC=PC+4+(FCC 6= 0 ? imm: 0)
Table B.4 FI-type instruction layout
327
Appendix B
DLX INSTRUCTION
SET IR[31 : 26] IR[5 : 0] Fmt Mnem. Effect
Arithmetic and Compare Operations
010001 0x11 000000 0x00 fadd FD = FS1 + FS2
010001 0x11 000001 0x01 fsub FD = FS1 - FS2
010001 0x11 000010 0x02 fmul FD = FS1 * FS2
010001 0x11 000011 0x03 fdiv FD = FS1 / FS2
010001 0x11 000100 0x04 fneg FD = - FS1
010001 0x11 000101 0x05 fabs FD = abs(FS1)
010001 0x11 000110 0x06 fsqt FD = sqrt(FS1)
010001 0x11 000111 0x07 frem FD = rem(FS1, FS2)
010001 0x11 11c3c2c1c0 fc.cond FCC=(FS1 co FS2)
Data Transfer
010001 0x11 001000 0x08 000 fmov.s FD[31:0]=FS1[31:0]
010001 0x11 001000 0x08 001 fmov.d FD[63:0]=FS1[63:0]
010001 0x11 001001 0x09 mf2i RS = FS1[31:0]
010001 0x11 001010 0x0a mi2f FD[31:0] = RS
Conversion
010001 0x11 100000 0x20 001 cvt.s.d FD = cvt(FS1, s, d)
010001 0x11 100000 0x20 100 cvt.s.i FD = cvt(FS1, s, i)
010001 0x11 100001 0x21 000 cvt.d.s FD = cvt(FS1, d, s)
010001 0x11 100001 0x21 100 cvt.d.i FD = cvt(FS1, d, i)
010001 0x11 100100 0x24 000 cvt.i.s FD = cvt(FS1, i, s)
010001 0x11 100100 0x24 001 cvt.i.d FD = cvt(FS1, i, d)
Table B.5 FR-type instruction layout. Fmt=IR[8:6]
RM Symbol Rounding
00 RZ toward zero
01 RNE to next even
10 RPI toward +∞
11 RMI toward  ∞
Bit Symbol Purpose
0 OVF overflow
1 UNF underflow
2 INX inexact result
3 DBZ divide by zero
4 INV invalid operation
Table B.6 Coding of the rounding mode RM and the interrupt flags IEEEf
328
Appendix B
DLX INSTRUCTION
SET
IR[31 : 26] IR[5 : 0] Predicate Effect
100*** 0x04 I load load instructions
***000 0x00 I lb byte signed
***001 0x00 I lh halfword signed
***011 0x00 I lw full word
***100 0x00 I lbu byte unsigned
***101 0x00 I lhu halfword unsigned
1010** 0x0a I store store instructions
0*1*** 0x00 I ALUi i-type ALU instr.
0001** 0x01 I branch conditional branch
****1* 0x00 I branch f cc test FCC instead of RS1
*****0 0x00 I branch eq branch if equal
01011* 0x0b I jr jump register instr.
*****1 0x00 I link j/jr is a link instr.
00001* 0x01 I j jump instructions
111110 0x3e I trap trap instruction
111111 0x3f I r f e return from exception
000000 0x00 0000** 0x00 I shi f ti shift instr. with SA
000000 0x00 0001** 0x01 I shi f t shift instr.
000000 0x00 010000 0x10 I movs2i move sp. reg. to GPR
000000 0x00 010001 0x11 I movi2s move GPR to sp. reg.
000000 0x00 10**** 0x02 I ALU ALU instructions
Table B.7 The monomials of the predicates used to decode the instruction word
329

Appendix
C
Performance of the
Pipelined DLX
Short Pipeline
The first implementation uses a standard five stage pipeline as described
above. All simulations were made using a Pentium-like memory system,
i.e., a 16kb split, two-way level one write-back cache with 32 bytes line
size and 4-1-1-1 bus bursts [Int95a, Int95b]. The cache uses LRU replace-
ment and read/write allocation.
As a workload, we used several benchmarks of the SPEC92 benchmark
suite [SPE91]. Table C.2 shows the benchmarks and performance result.
Instruction Latency Pipelined
addition, subtraction 5 full
conversion 3 full
multiplication 5 full
single precision division 17 five stages
double precision division 21 five stages
Table C.1 Latency of floating point instructions in cycles. Most floating point in-
structions can be executed fully pipelined; divisions and square root iterate except
for five stages.
Appendix C
PERFORMANCE OF
THE PIPELINED
DLX
Dhaz Dhaz CPIBenchmark Type FP In. before after before after Sp.
008 espresso int92 0.0% 2.71% 1.84% 1.3567 1.3516 0.4%
013 spice2g6 fp92 1.6% 2.92% 2.07% 1.8678 1.8496 1.0%
015 doduc int92 8.7% 2.78% 2.11% 2.1027 2.0700 1.6%
022 li int92 0.0% 3.11% 2.64% 1.9056 1.8841 1.1%
023 eqntott int92 0.0% 4.09% 3.49% 1.6930 1.6769 1.0%
026 compress int92 0.0% 2.77% 2.20% 1.6212 1.6052 1.0%
034 mdljdp2 fp92 13.0% 1.83% 1.55% 1.6782 1.6654 0.8%
039 wave5 fp92 18.5% 3.21% 2.93% 1.8741 1.8414 1.8%
047 tomcatv fp92 31.1% 0.19% 0.16% 2.1999 2.1924 0.4%
048 ora fp92 27.6% 4.59% 3.45% 2.8024 2.7836 0.7%
052 alvinn int92 0.7% 3.58% 1.83% 2.3961 2.3520 1.9%
056 ear fp92 17.3% 2.02% 1.61% 2.1931 2.1804 0.6%
072 sc int92 0.0% 2.05% 1.55% 1.4530 1.4437 0.6%
077 mdljsp2 fp92 15.7% 1.26% 1.05% 1.6801 1.6709 0.6%
078 swm256 fp92 44.6% 1.61% 1.63% 2.1808 2.1445 1.7%
085 gcc int92 0.0% 4.74% 3.21% 2.3306 2.3042 1.1%
089 su2cor fp92 18.9% 2.20% 1.80% 2.9223 2.8949 0.9%
090 hydro2d fp92 14.6% 4.18% 4.00% 2.4236 2.3931 1.3%
093 nasa7 fp92 28.5% 0.23% 0.04% 2.0788 2.0786 0.0%
094 fpppp fp92 25.5% 2.58% 2.07% 3.1935 3.1269 2.1%
AVERAGE - - - - 2.0976 2.0755 1.6%
Table C.2 Experimental results gained using simulations on the short pipeline.
In the first column, the benchmark is given, the second column gives the type of
the program, the third column shows the percentage of floating point instructions.
The columns four and five show the percentage of cycles with data hazard stalls
before and after applying bubble removal. The last columns show the CPI values
and the total speedup.
In general, the performance depends on the number of multi-cycle instruc-
tions, in particular floating point instructions. The table therefore gives the
percentage of floating point instructions in the execution stream simulated.
On floating point loads, one can observe a speedup up to two percent. Most
of the speedup results from reduced data hazards. The table therefore gives
the number of cycles spent idle in the decode stage because of data hazards
before and after adding bubble removal.
This performance gain might seem neglectable. However, consider that
the hardware effort for this gain is just a couple of gates.
332
Appendix C
PERFORMANCE OF
THE PIPELINED
DLX
CPI CPIBenchmark before after Speedup
008 espresso 2.0853 2.0499 1.7%
013 spice2g6 2.3885 2.2735 5.1%
015 doduc 2.7613 2.5809 7.0%
022 li 2.6438 2.5121 5.2%
023 eqntott 2.2981 2.2024 4.3%
026 compress 2.0625 2.0061 2.8%
034 mdljdp2 2.0084 1.9108 5.1%
039 wave5 2.1896 1.9956 9.7%
047 tomcatv 2.6826 2.4668 8.7%
048 ora 3.2551 3.0940 5.2%
052 alvinn 2.9042 2.7288 6.4%
056 ear 2.8320 2.7181 4.2%
072 sc 2.2510 2.1875 2.9%
077 mdljsp2 2.0336 1.9429 4.7%
078 swm256 2.5566 2.4074 6.2%
085 gcc 2.8373 2.6987 5.1%
089 su2cor 3.4189 3.1873 7.3%
090 hydro2d 2.5153 2.3582 6.7%
093 nasa7 2.1982 2.1567 1.9%
094 fpppp 3.4060 3.0965 10.0%
AVERAGE 2.0976 2.0755 5.5%
Table C.3 Experimental results gained using simulations on the long pipeline
Long Pipeline
In order to achieve high clock frequencies, modern microprocessors fea-
ture very long pipelines with up to twenty stages. However, longer pipe-
lines are also more sensitive to data hazards because of load instructions.
We therefore simulated a RISC pipeline with nine stages total in order to
evaluate the effect of pipeline bubble removal. We expected the benefit of
pipeline bubble removal increase with pipeline complexity.
All other parameters and the workload remain the same. Table C.3
shows the results. Not surprisingly, the CPI rates raise. As expected, the
percental speedup gained by the stall engine also increases. With individ-
ual benchmarks we see speedups up to ten percent and an average of five
333
Appendix C
PERFORMANCE OF
THE PIPELINED
DLX
percent. However, the programs used for the nine stage pipeline are the
same as for the five stage pipeline, i.e., they are compiled (and therefore
optimized) for the five stage pipeline. Thus, we expect less CPI and less
speedup with code compiled with optimizations for the nine stage pipeline.
334
Appendix
D
Liveness Verification using
SMV
D.1 Introduction
The idea of Model-checking [CE81b, CES86] is to check the complete set
of reachable states of a state transition system for a desired property, e.g.,
an invariant. However, this approach suffers from the state explosion prob-
lem since the state space grows exponentially with the number of variables.
However, one easily encapsulates the stall engine as a module with well-
defined interface. In this module, the full bits are the only registers. All
other registers of the machine are not required for the stall engine. The
number of full bits is exactly the number of stages. Thus, there are five
one-bit registers in the stall engine for the DLX design discussed in the
previous chapters. Thus, there are 25 possible states. It therefore seems
feasible to verify properties of the stall engine for fixed size pipelines.
In the following, we will apply a well-known symbolic model-checking
system called SMV by Kenneth McMillan [McM93]. Symbolic model-
checking systems represent the state space as boolean formula. All opera-
tions (property checking) are done on this formula, which usually is much
faster than just enumerating the reachable states.
An introduction of the hardware specification language used by SMV
is beyond the scope of this thesis. Thus, the specification of the stall en-
Appendix D
LIVENESS
VERIFICATION
USING SMV
gine hardware in SMV language and a small introduction can be found in
appendix D. The specification of the liveness criterion in SMV is done us-
ing temporal operators that are similar to those used in CTL (Computation
Tree Logic) [CE81b]1.
Let p be a time predicate. For the specification of the liveness property,
the following two temporal operators are used:
 The operator F p holds iff there is a cycle T in the future such that
the predicate p holds. The definition of the operator 9T in chapter
3 is identical to this definition.
 The operator Gp holds iff the predicate p holds for all future cycles.
Applying a CTL operator on a time predicate results in a time predicate.
Thus, the operators can be combined: For example, in SMV, GF p denotes
a predicate p that is finite true according to definition 3.4.
Thus, the assumption that the external stall signals are finite true is de-
noted as follows in SMV language:
G F :extk (D.1)
We furthermore assume that if all stages below stage k are empty (i.e.,
not full), the data hazard hazard signal of stage k is off (below f ullk holds
iff at least one stage below stage k is full):
G (:below f ullk =):dhazk) (D.2)
Using these assumptions, SMV verifies the following property of the
stall engine logic for a fixed number of stages:
G F uek (D.3)
However, the verification time grows exponentially in the number of
stages. Table D.1 shows experimental results on an AMD machine with
350 MHZ. The liveness of the stall engine of the DLX pipeline with five
stages is verified within a second; however, the run time becomes critical
with 10 stages and beyond. Commercial designs feature up to 30 stages,
which would result in an estimated run time of about 10,000 years. How-
ever, the author’s machine ran out of memory while model-checking the
stall engine with eleven stages or beyond.
1CTL is a subset of a more general temporal logic described in [CE81a], using the
syntax of [BMP81].
336
Appendix D
USING INDUCTION
Stages BDD nodes Time [s]
1 14 0.04
2 850 0.07
3 4387 0.10
4 10027 0.34
5 25764 1.15
6 101598 6.12
7 265952 14.68
8 643058 46.97
9 1294767 242.43
10 5008999 833.76
Table D.1 Experimental results for verifying the liveness criterion of the stall
engine using SMV on an AMD machine with 350 MHZ
D.2 Using Induction
In this section, we try to speed up the verification manually. The first step
is to split the proof goal into n subgoals, one subgoal for the correctness
criterion for each stage. This makes the verification both faster and reduces
the memory consumption.
SMV supports a simple form of induction. This can be applied in anal-
ogy to the proof presented in the previous section. By assuming the live-
ness property for stage k+1, one simplifies the proof of the liveness prop-
erty for stage k. This makes the verification both faster and reduces the
memory consumption dramatically. The price for this is dropping the full
automatization of the proof.
Table D.2 and figure D.1 show the run time for verifying the liveness
criterion for a stall engine with up to 18 stages using no induction and
using one stage induction.
Model-checking was initiated by Clarke and Emerson in [CE81b] and
[CES86]. Various authors improved the idea in order to handle larger state
spaces. In order to handle the state explosion problem, BDDs (binary de-
cision diagrams) were applied for model-checking [McM93]. A big con-
tribution to model-checking is from Bryant by his research on BDD tech-
niques [Bry86]. Recently, McMillan applied classical theorem proving
techniques for model-checking, e.g., in [McM98, McM99b].
337
Appendix D
LIVENESS
VERIFICATION
USING SMV
no induction 1 stage inductionStages
BDD nodes Time [s] BDD nodes Time [s]
1 14 0.04 14 0.04
2 237 0.07 237 0.08
3 701 0.16 701 0.16
4 1147 0.25 1147 0.24
5 2154 0.65 2154 0.38
6 10035 1.62 3842 0.63
7 14274 4.91 6747 1.32
8 27142 12.98 10046 2.78
9 36096 26.83 12413 4.25
10 51856 46.26 29375 7.36
11 70591 133.67 49935 15.93
12 98800 212.42 98857 22.70
13 139882 581.36 139945 102.35
14 213847 3096.31 213916 184.58
15 308379 10163.90 308454 386.58
16 444037 37322.40 444118 763.36
17 - - 662026 1729.37
18 - - 1044414 5416.22
Table D.2 Experimental results for verifying the liveness criterion of the stall
engine using SMV. The columns two and three contain the BDD node count and
the runtime in seconds for verifying the liveness criterion for all stages separately
using no induction. The columns four and five contain the BDD node count and
time for verifying using induction.
338
Appendix D
USING INDUCTION
one stage induction
no induction
Number of Stages
Ti
m
e
[s]
16151413121110987654321
100000
10000
1000
100
10
1
0:1
0:01
Figure D.1 Visualization of the experimental results for verifying the stall engine
using SMV in table D.2
339
Appendix D
LIVENESS
VERIFICATION
USING SMV
340
Bibliography
[AS99] Arvind and Xiaowei Shen. Using term rewriting systems to design
and verify processors. IEEE Micro Special Issue on Modeling and
Validation of Microprocessors, 19(3):36–46, May/June 1999.
[BBCZ98] S. Berezin, A. Biere, E. Clarke, and Y. Zhu. Combining symbolic
model checking with uninterpreted functions for out-of-order pro-
cessor verification. In G. Gopalakrishnan and P. Windley, editors,
Formal Methods in Computer-Aided Design (FMCAD), volume 1522
of Lecture Notes in Computer Science, pages 369–386. Springer-
Verlag, 1998.
[BCRZ99] A. Biere, E. Clarke, E. Raimi, and Y. Zhu. Verifying safety prop-
erties of a PowerPC microprocessor using symbolic model checking
without BDDs. In Nicolas Halbwachs and Doron Peled, editors, Pro-
ceedings of the 11th International Conference on Computer Aided
Verification (CAV’99), volume 1633 of Lecture Notes in Computer
Science, pages 60–71. Springer-Verlag, 1999.
[BD94] Jerry R. Burch and David L. Dill. Automatic verification of pipelined
microprocessors control. In David L. Dill, editor, Proceedings of
the sixth International Conference on Computer-Aided Verification
(CAV’94), volume 818 of Lecture Notes in Computer Science, pages
68–80. Springer-Verlag, 1994.
[BDL98] Clark W. Barrett, David L. Dill, and Jeremy R. Levitt. A decision
procedure for bit-vector arithmetic. In Proceedings of ACM/IEEE
Design Automation Conference (DAC’98), pages 522–527. ACM
Press, 1998.
Bibliography
[Ber01] Christoph Berg. Verification of an IEEE floating point adder (draft).
Master’s thesis, Universita¨t des Saarlandes, FB. Informatik, 2001.
[Bey01] Sven Beyer. Verification of a microprocessor’s memory interface
(Draft). PhD thesis, University of Saarland, Computer Science De-
partment, 2001.
[BJK01] Christoph Berg, Christian Jacobi, and Daniel Kroening. Formal ver-
ification of a basic circuits library. In Proc. 19th IASTED Inter-
national Conference on Applied Informatics, Innsbruck (AI’2001),
pages 252–255. ACTA Press, 2001.
[BM96] E. Boerger and S. Mazzanti. A practical method for rigorously con-
trollable hardware design. In J.P Bowen, M.B. Hinchey, and D. Till,
editors, ZUM’97: The Z Formal Specification Notation, volume 1212
of Lecture Notes in Computer Science, pages 151–187. Springer-
Verlag, 1996.
[BMP81] M. Ben-Ari, Z. Manna, and A. Pneuli. The temporal logic of branch-
ing time. In Conference Record of the Eighth Annual ACM Sympo-
sium on Principles of Programming Languages (POPL ’81), pages
164–176. ACM Press, Jan 1981.
[Bry86] Randal E. Bryant. Graph-based algorithms for boolean function ma-
nipulation. IEEE Transactions on Computers, C-35(8):677–691, Au-
gust 1986.
[BS89] M. Bickford and M. Srivas. Verification of a pipelined microproces-
sor using CLIO. In Proceedings of Workshop on Hardware Speci-
fication, Verification and Synthesis: Mathematical Aspects, volume
408 of Lecture Notes in Computer Science. Springer-Verlag, 1989.
[Bur91] Jerry R. Burch. Using BDDs to verify multipliers. In Proceedings
of the 28th ACM/IEEE Design Automation Conference (DAC’91),
pages 408–412, New York, 1991. ACM Press.
[CE81a] Edmund Clarke and Allen Emerson. Characterizing properties of
parallel programs as fixpoints. In 7th International Colloquium
on Automata, Languages and Programming, volume 85 of Lecture
Notes in Computer Science. Springer-Verlag, 1981.
[CE81b] Edmund Clarke and Allen Emerson. Synthesis of synchronization
skeletons for branching time temporal logic. In In Logic of Pro-
grams: Workshop, Yorktown Heights, volume 131 of Lecture Notes
in Computer Science. Springer-Verlag, 1981.
[CES86] Edmund Clarke, Allen Emerson, and A. P. Sistla. Automatic verifica-
tion of finite-state concurrent systems using temporal logic specifica-
tions. ACM Transactions on Programming Languages and Systems,
8(2):244–263, 1986.
[CGM86] A. Camilleri, M. Gordon, and T. Melham. Hardware verification
using higher order logic. In From HDL Descriptions to Guaranteed
Correct Circuit Designs, pages 41–66. North-Holland, 1986.
342
Bibliography
[CHYP94] Po-Yung Chang, Eric Hao, Tse-Yu Yeh, and Yale N. Patt. Branch
classification: a new mechanism for improving branch predictor per-
formance”. In Proc. of the 27th Annual International Symposium on
Microarchitecture, pages 22–31, 1994.
[Coe95] Tim Coe. Inside the Pentium FDIV bug. Dr. Dobb’s Journal of
Software Tools, 20(4), Apr 1995.
[Coh87] Avra J. Cohn. A proof of correctness of the VIPER microproces-
sor: The first level. In Graham Birtwistle and P.A. Subrahmanyam,
editors, VLSI Specification, Verification and Synthesis, pages 27–71.
Kluwer Academic Publishers, 1987.
[CRSS94] D. Cyrluk, S. Rajan, N. Shankar, and M. K. Srivas. Effective theorem
proving for hardware verification. In 2nd International Conference
on Theorem Provers in Circuit Design, volume 901 of Lecture Notes
in Computer Science, pages 203–222. Springer-Verlag, 1994.
[CS95] Robert P. Colwell and Randy L. Steck. A 0.6um bicmos proces-
sor employing dynamic execution. International Solid State Circuits
Conference (ISSCC), 1995.
[Cyr93] David Cyrluk. Microprocessor verification in PVS: A methodology
and simple example. Technical Report SRI-CSL-93-12, SRI Com-
puter Science Laboratory, 1993.
[Del98] Peter Dell. Die Auswirkung von Mechanismen zur out-of-order
Ausfu¨hrung auf den Cyclecount von RISC-Architekturen. Master’s
thesis, Universita¨t des Saarlandes, FB. Informatik, 1998.
[DP97] W. Damm and A. Pnueli. Verifying out-of-order executions. In H.F.
Li and D.K. Probst, editors, Advances in Hardware Design and Veri-
fication: IFIP WG 10.5 Internatinal Conference on Correct Hard-
ware Design and Verification Methods (CHARME), pages 23–47.
Chapmann & Hall., 1997.
[FFK88] M. Fujita, H. Fujisawa, and N. Kawato. Evaluation and improve-
ments of boolean comparison method based on binary decision di-
agrams. In International Conference on Computer-Aided Design,
pages 2–5. IEEE Computer Society Press, 1988.
[Fly95] Michael Flynn. Computer Architecture: Pipelined and Parallel Pro-
cessor Design. Jones & Bartlett, 1995.
[Gau95] Thilo Gaul. An abstract state machine specification of the DEC-
Alpha processor family. Technical Report [Verifix/UKA/4], Univer-
sity of Karlsruhe, 1995.
[Ger98] N. Gerteis. The performance impact of precise interrupt handling on
a RISC processor (German). Master’s thesis, University of Saarland,
Computer Science Department, Germany, 1998.
[Hew94] Hewlett Packard. PA-RISC 1.1 Architecture Reference Manual,
1994.
343
Bibliography
[HGS99] Ravi Hosabettu, Ganesh Gopalakrishnan, and Mandayam Srivas. A
proof of correctness of a processor implementing Tomasulo’s al-
gorithm without a reorder buffer. In Laurence Pierre and Thomas
Kropf, editors, Correct Hardware Design and Verification Methods:
IFIP WG 10.5 Advanced Research Working Conference, CHARME
’99, pages 8–316. Springer-Verlag, 1999.
[HGS00] Ravi Hosabettu, Ganesh Gopalakrishnan, and Mandayam Srivas.
Verifying advanced microarchitectures that support speculation and
exceptions. In Allen Emerson and A. P. Sistla, editors, Proceedings
of the 12th International Conference on Computer Aided Verification
(CAV 2000), volume 1855 of Lecture Notes in Computer Science.
Springer-Verlag, 2000.
[Hil00] Mark Hillebrand. Design and evaluation of a superscalar RISC pro-
cessor. Master’s thesis, Universita¨t des Saarlandes, FB. Informatik,
Saarbru¨cken, 2000.
[Hos00] Ravi Hosabettu. Systematic Verification of Pipelined Microproces-
sors. PhD thesis, University of Utah, Department of Computer Sci-
ence, 2000.
[HP96] John L. Hennessy and David A. Patterson. Computer Architecture:
A Quantitative Approach. Morgan Kaufmann Publishers, INC., San
Mateo, CA, 2nd edition, 1996.
[HQR98] Thomas A. Henzinger, Shaz Qadeer, and Sriram K. Rajamani. You
assume, we guarantee: Methodology and case studies. In Proceed-
ings of the 10th International Conference on Computer-aided Veri-
fication (CAV), volume 1427 of Lecture Notes in Computer Science,
pages 440–451. Springer-Verlag, 1998.
[Hun94] Warren A. Hunt. FM8501, a verified microprocessor, volume 795
of Lecture Notes in Artificial Intelligence and Lecture Notes in Com-
puter Science. Springer-Verlag, 1994.
[IEE85] Institute of Electrical and Electronics Engineers. ANSI/IEEE stan-
dard 754–1985, IEEE Standard for Binary Floating-Point Arith-
metic, 1985.
[Int95a] Intel Corporation. 2430FX PCIset Datasheet 82437FX System Con-
troller (TSC) and 82438FX Data Path Unit (TDP), 1995.
[Int95b] Intel Corporation. Pentium Processor Family Developer’s Manual,
Vol. 1-3, 1995.
[Jac01] Christian Jacobi. Formal Verification of a fully IEEE compliant
Floating Point Unit (Draft). PhD thesis, University of Saarland,
Computer Science Department, 2001.
344
Bibliography
[JNFSV97] Jawahar Jain, Amit Narayan, M. Fujita, and A. Sangiovanni-
Vincentelli. A survey of techniques for formal verification of com-
binational circuits. In International Conference on Computer De-
sign: VLSI in Computers and Processors (ICCD ’97), pages 445–
454. IEEE Society Press, 1997.
[Joy88a] Jeffrey J. Joyce. Formal specification and verification of micro-
processor systems. Microprocessing & Microprogramming, 24(1-
5):371–8, 1988.
[Joy88b] Jeffrey J. Joyce. Formal verification and implementation of a micro-
processor. In G. Birtwistle and P.A. Subrahmanyam, editors, VLSI
Specification, Verification and Synthesis, pages 129–158. Kluwer
Academic Publishers, 1988.
[KH92] Gerry Kane and Joe Heinrich. MIPS RISC Architecture. Prentice
Hall, 1992.
[KM96] Matt Kaufmann and J. S. Moore. ACL2: An industrial strength ver-
sion of nqthm. In Proc. of the Eleventh Annual Conference on Com-
puter Assurance, pages 23–34. IEEE Computer Society Press, 1996.
[KMP99] Daniel Kroening, Silvia M. Mueller, and Wolfgang Paul. A rigorous
correctness proof of the Tomasulo scheduling algorithm with precise
interrupts. In Proc. of the SCI’99/ISAS’99 International Conference,
1999.
[KP95] Jo¨rg Keller and Wolfgang J. Paul. Hardware Design — Formaler En-
twurf Digitaler Schaltungen. TEUBNER, Stuttgart, Leipzig, 1995.
[KP96] Y. Kesten and A. Pnueli. An αSTS-based common semantics for
SIGNAL, STATECHART, DC+, and C. Technical report, Dept. of
Computer Science, Weizmann Institute, March 1996.
[KPM00] Daniel Kroening, Wolfgang J. Paul, and Silvia M. Mueller. Prov-
ing the correctness of pipelined micro-architectures. In Klaus Wald-
schmidt and Christoph Grimm, editors, Proc. of the ITG/GI/GMM-
Workshop Methoden und Beschreibungssprachen zur Modellierung
und Verifikation von Schaltungen und Systemen, pages 89–98. VDE
Verlag, 2000.
[Kro¨99] Daniel Kro¨ning. Design and evaluation of a RISC processor with a
Tomasulo scheduler. Master’s thesis, University of Saarland, Com-
puter Science Department, Germany, 1999.
[Lei99] Holger Leister. Quantitative Analysis of Precise Interrupt Mecha-
nism for Processors with Out-Of-Order Execution. PhD thesis, Uni-
versity of Saarland, Computer Science Department, 1999.
[LF80] Richard E. Ladner and Michael J. Fischer. Parallel prefix computa-
tion. Journal of the ACM, 27(4):831–838, 1980.
345
Bibliography
[LO96] Jeremy Levitt and Kunle Olukotun. A scalable formal verification
methodology for pipelined microprocessors. In 33rd Design Au-
tomation Conference (DAC’96), pages 558–563, New York, 1996.
Association for Computing Machinery.
[LS84] Jonny K. F. Lee and Alan J. Smith. Branch prediction strategies and
branch target buffer design. Computer, 17(1):6–22, January 1984.
[McM93] Kenneth L. McMillan. Symbolic Model Checking. Kluwer, 1993.
[McM98] Kenneth L. McMillan. Verification of an implementation of Toma-
sulo’s algorithm by composition model checking. In Proc. 10th In-
ternational Conference on Computer Aided Verification, pages 110–
121, 1998.
[McM99a] Kenneth L. McMillan. Circular compositional reasoning about
liveness. In Laurence Pierre and Thomas Kropf, editors, Correct
Hardware Design and Verification Methods: IFIP WG 10.5 Ad-
vanced Research Working Conference, CHARME ’99, pages 342–
345. Springer-Verlag, 1999.
[McM99b] Kenneth L. McMillan. Verification of infinite state systems by com-
positional model checking. In Correct Hardware Design and Ver-
ification Methods: IFIP WG 10.5 Internatinal Conference on Cor-
rect Hardware Design and Verification Methods (CHARME), vol-
ume 1703 of Lecture Notes in Computer Science, pages 219–233.
Springer-Verlag, 1999.
[Min95] Manfred Minimair. Design, analysis and implementation of an adder
by Ladner and Fisher. Technical Report 95-15, RISC-Linz, Johannes
Kepler University, Linz, Austria, 1995.
[MLD+99] Silvia M. Mueller, Holger Leister, Peter Dell, Nikolaus Gerteis,
and Daniel Kroening. The impact of hardware scheduling mecha-
nisms on the performance and cost of processor designs. In Proc.
of the 15th GI/ITG Conference ’Architektur von Rechensystemen’
ARCS’99, pages 65–73. VDE Verlag, 1999.
[Mot97] PowerPC 750 RISC Microprocessor Technical Summary, 1997.
[MP95] Silvia M. Mu¨ller and Wolfgang J. Paul. The Complexity of Simple
Computer Architectures. Lecture Notes in Computer Science 995.
Springer-Verlag, 1995.
[MP96] S. Mu¨ller and W. Paul. Making the original scoreboard mechanism
deadlock free. In Proc. 4th Israeli Symposium on Theory of Com-
puting and Systems (ISTCS), pages 92–99. IEEE Computer Society
Press, 1996.
[MP00] Silvia M. Mu¨ller and Wolfgang J. Paul. Computer Architecture:
Complexity and Correctness. Springer-Verlag, 2000.
346
Bibliography
[MPK00] Silvia M. Mu¨ller, Wolfgang Paul, and Daniel Kro¨ning. Proving
the correctness of processors with delayed branch using delayed
PC. In I. Althoefer, N. Cai, G. Dueck, L. Khachatrian, M. Pinsker,
A. Sarkozy, I. Wegener, and Zhang Z., editors, Proc. Symposium
on Numbers, Information and Complexity, Bielefeld, pages 579–588.
Kluwer, 2000.
[Mu¨l97] Silvia M. Mu¨ller. Complexity and correctness of computer architec-
tures. In Proc. 4th Workshop on Parallel Systems and Algorithms
(PASA’96), pages 125–146. World Scientific Publishing, 1997.
[PA96] J. Pihl and E. J. Aas. A multiplier and squarer generator for high
performance dsp applications. In Proceedings of the 39th Midwest
Symposium on Circuits and Systems, Iowa, 1996.
[PH94] David A. Patterson and John L. Hennessy. The Hardware/Software
Interface. Morgan Kaufmann Publishers, INC., San Mateo, CA,
1994.
[Pra95] Vaughan R. Pratt. Anatomy of the Pentium bug. In Peter D.
Mosses, Mogens Nielsen, and Michael I. Schwartzbach, editors,
TAPSOFT ’95: Theory and Practice of Software Development, vol-
ume 915 of Lecture Notes in Computer Science, pages 97–107.
Springer-Verlag, 1995.
[PS94] Dionisios N. Pnevmatikatos and Gurinadar S. Sohi. Guarded execu-
tion and branch prediction in dynamic ILP processors. In Proc. of the
21th Annual Symposium on Computer Architecture, pages 120–129,
1994.
[SGGH94] James B. Saxe, Stephen J. Garland, John V. Guttag, and James J.
Horning. Using transformations and verification in circuit design.
Formal Methods in System Design, 4(1):181–210, 1994.
[SH99] Jun Sawada and Warren A. Hunt. Results of the verification of a
complex pipelined machine model. In Laurence Pierre and Thomas
Kropf, editors, Correct Hardware Design and Verification Methods:
IFIP WG 10.5 Advanced Research Working Conference, CHARME
’99, pages 313–316. Springer-Verlag, 1999.
[SK96] Philip M. Sailer and David R. Kaeli. The DLX Instruction Set Archi-
tecture Handbook. Morgan Kaufmann, San Francisco, 1996.
[Smi81] James E. Smith. A study of branch prediction strategies. In Proceed-
ings of the 8th Annual Symposium on Computer Architecture, pages
135–148, 1981.
[SP88] James E. Smith and Andrew R. Pleszkun. Implementing precise in-
terrupts in pipelined processors. IEEE Transactions on Computers,
37(5):562–573, 1988.
[SPA92] SPARC International Inc. The SPARC Architecture Manual. Prentice
Hall, 1992.
347
Bibliography
[SPE91] SPEC Newsletter, Vol. 3, No. 4, 1991.
[Tom67] R.M. Tomasulo. An efficient algorithm for exploiting multiple arith-
metic units. IBM Journal of Research and Development, 11(1):25–
33, 1967.
[VB00] Miroslav N. Velev and Randal E. Bryant. Formal verification of su-
perscalar microprocessors with multicycle functional units, excep-
tions, and branch prediction. In Proceedings of ACM/IEEE De-
sign Automation Conference (DAC’00), pages 112–117. ACM Press,
2000.
[Win95] Phillip J. Windley. Formal modeling and verification of micropro-
cessors. IEEE Transactions on Computers, 44(1):54–72, 1995.
[Yeu84] Bik Chung Yeung. 8086/8088 Assembly Language Programming.
Wiley & Sons, 1984.
[YP92] Tse-Yu Yeh and Yale N. Patt. Alternative implementations of two-
level adaptive branch prediction. In Proc. of 19th Int. Sym. on Com-
puter Architecture, pages 124–134, 1992.
348
