Correctness of multi-core processors with operating system support by Lutsyk, Petro
Correctness of Multi-Core Processors
with Operating System Support
Dissertation
zur Erlangung des Grades
des Doktors der Ingenieurswissenschaften
der Fakulta¨t fu¨r Mathematik und Informatik
an der Universita¨t des Saarlandes
eingereicht von
Petro Lutsyk
Saarbru¨cken, Oktober 2018
Tag des Kolloquiums: 27. September 2018
Dekan: Prof. Dr. Sebastian Hack
Vorsitzender des Pru¨fungsausschusses: Prof. Dr. Verena Wolf
1. Berichterstatter: Prof. Dr. Wolfgang J. Paul
2. Berichterstatter: Dr. Silvia Melitta Mu¨ller
Akademischer Mitarbeiter: Dr. Jonas Oberhauser
Zusammenfassung
Im Zuge der Unterstu¨tzung von Hypervisoren verifizieren wir einen realistischen Pipeline-
Multi-Core-Prozessor mit integriertem Mechanismus fu¨r die zweiphasige (verschachtelte)
Adressu¨bersetzung. Das verschachtelte U¨bersetzungsschema wird beno¨tigt, damit Ga¨ste des
Hypervisors (typischerweise Betriebssysteme) ihre Programme im u¨bersetzten Modus ausfu¨hren
ko¨nnen. Wir betrachten das Setup, in dem die Betriebssysteme als Prozesse (im u¨bersetzten
Modus) des Hypervisors laufen, ’auf der bloßen Hardware’, d.h. ohne Adressu¨bersetzung.
Die geschachtelte U¨bersetzung wird von der geschachtelten Memory Management Unit (MMU)
durchgefu¨hrt, wobei beide U¨bersetzungsphasen in Hardware ausgefu¨hrt werden. Sowohl die
Spezifikation als auch die Implementierung der geschachtelten MMU werden ausfu¨hrlich dar-
gestellt. Es wird bewiesen, dass die geschachtelte MMU eine allgemeinere Hilfsspezifikation,
die die geschachtelte MMU vom Rest der Maschine isoliert, korrekt implementiert. Letzteres
erlaubt es uns, die Argumente fu¨r die Korrektheit der MMU-Implementierung in jeder Maschi-
ne auf eine einfache Simulation zwischen zwei Softwaremodellen zu reduzieren.
Der Hauptbeitrag dieser Arbeit ist der vollsta¨ndige Korrektheitsbeweis auf Papier fu¨r die
Pipeline-Multi-Core-Implementierung des MIPS-86 ISA, der zur Unterstu¨tzung der verschach-
telten U¨bersetzung zusa¨tzlich erweitert wurde. Wie der Name schon vermuten la¨sst, kombiniert
MIPS-86 den Befehlssatz von MIPS mit dem Speichermodell von x86. Zuerst betrachten wir
diese erweiterte MIPS-86-Spezifikation in der sequentiellen Implementierung, die dazu dient,
die Integration der verschachtelten MMUs in den MIPS-Prozessor zu demonstrieren und der
Einfachheit halber auf einen einzelnen Prozessorkern beschra¨nkt ist. Im Beweis unseres Haupt-
ergebnisses — Korrektheit der Pipeline-Implementierung — verweisen wir auf den sequenti-
ellen Fall, um die Korrektheit der MMU-Operation zu zeigen. Dies erlaubt uns, den Fokus auf
die Probleme des Pipelinings der Maschine mit spekulativer Ausfu¨hrung und Unterbrechungen
zu verlagern, die bei vorhandener Adressu¨bersetzung zu beru¨cksichtigen sind.
Abstract
In the course of adding support for hypervisors, we verify a realistic pipelined multi-core pro-
cessor with integrated mechanism for two-phase (nested) address translation. The nested trans-
lation scheme is required to allow guests of the hypervisor (typically operating systems) to ex-
ecute their programs in translated mode. We consider the setup in which the operating systems
are running as processes (in translated mode) of the hypervisor, running ‘on the bare hardware’,
i.e., without address translation.
The nested translation is performed by the nested memory management unit (MMU), with both
phases of translation performed in hardware. Both the specification and the implementation of
the nested MMU are presented in full detail. The nested MMU is proven to correctly implement
an auxiliary, more general specification which isolates the nested MMU from the rest of the
machine. The latter allows us to reduce arguments on correctness of the MMU implementation
in any machine to a simple simulation between a pair of software models.
The main contribution of this thesis is the complete paper and pencil correctness proof for
the pipelined multi-core implementation of the MIPS-86 ISA, additionally extended to support
the nested translation. As the name suggests, MIPS-86 combines the instruction set of MIPS
with the memory model of x86. First, we consider this extended MIPS-86 specification in the
sequential implementation, which serves to demonstrate integration of the nested MMUs into
the MIPS processor and for simplicity is restricted to have a single processor core. In the proof
of our main result — correctness of the pipelined implementation — we refer to the sequential
case to show correctness of the MMU operation. This allows us to shift the focus towards
the problems of pipelining the machine with speculative execution and interrupts, which are
necessary to consider in the presence of address translation.
Acknowledgments
First and foremost, I would like to thank Prof. Dr. Wolfgang J. Paul for an opportunity to study
computer architecture at his chair and write a doctoral thesis under his supervision. Starting
from the very first lecture, about six years ago, it is hard to remember a meeting where he
would not try to share some of his experience, no matter that was in a lecture hall, on a dance
floor, or in his kitchen. I thank Prof. Paul for his catching enthusiasm, endless optimism and
patience. Needless to mention that most of the mathematical culture necessary to write this
dissertation I learned from him, both through lectures and private lessons.
Also, I want to thank my colleagues, ever working at the chair, for keeping a friendly and
creative atmosphere. Especially, I want to thank my friend Dr. Jonas Oberhauser for being
always ready to answer (many of) my questions and help with an advice, being a responsive
room mate and a great friend. Also, I am grateful to Jonas for giving me a hard time in our
scientific discussions, which after all only motivated me to keep learning.
Finally, I want to thank my family and friends, who kept me motivated to complete this disser-
tation. First, I thank my parents, who always believed in me and gave me a chance to travel to
Germany and try myself here. Also, I am thankful to my brother and his family, who took care
of me during my stay in Germany from the first day. Last but not least, I thank my first, and
hopefully last, dancing partner for her support and patience during the whole time together.
Contents
Introduction 1
Part I Single-Core MIPS with Address Translation 7
1 Definitions and Notation 9
1.1 Sets, Sequences, and Records . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Boolean Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Binary and Two’s Complement Numbers . . . . . . . . . . . . . . . . . . . . . 10
1.4 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.1 Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Sequential Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Specification 15
2.1 Basic MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Instruction Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Configuration and Instruction Fields . . . . . . . . . . . . . . . . . . . 17
2.1.3 Instruction Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.4 Processor-Local Operations . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.5 Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.1 MIPS ISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.2 Software Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.3 Accesses of ISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Interrupt Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.1 Special Purpose Registers Revisited . . . . . . . . . . . . . . . . . . . 31
2.3.2 Types of Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.3 MIPS ISA with Interrupts . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.4 Specification of Most Internal Interrupt Event Signals . . . . . . . . . . 36
2.3.5 Accesses of ISA Revisited . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4 Multi-Level Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.1 Page Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.2 Walks and Translation Requests . . . . . . . . . . . . . . . . . . . . . 38
2.4.3 MIPS ISA with Address Translation . . . . . . . . . . . . . . . . . . . 41
2.4.4 TLB Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4.5 Processor Core Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Part II Nested Address Translation (NAT) 47
3 Introduction and Specification 49
3.1 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.1 Translation Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.2 Intercept Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.3 Universal Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
VI
CONTENTS VII
3.2 Introduction to NAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.1 Ideas behind Nested Translation . . . . . . . . . . . . . . . . . . . . . 53
3.2.2 Composition of Walks . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.3 Decomposition of Nested Walks . . . . . . . . . . . . . . . . . . . . . 57
3.2.4 Overloading Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 MIPS ISA with NAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.1 TLB Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.2 Translation Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.3 Faults of NAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.4 Processor Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.5 Execution of Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3.6 Overloading Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3.7 Interrupt Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4 Simplified Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4.1 General Semantics for TLB . . . . . . . . . . . . . . . . . . . . . . . 68
3.4.2 Added, Dropped, and Ragged Walks . . . . . . . . . . . . . . . . . . . 68
3.4.3 TLB Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4 Implementation of Nested MMU 73
4.1 Redesigning TLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1.1 Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.1.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 Redesigning MMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.1 Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3 Liveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3.1 Simple Translations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3.2 Nested Translations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5 Correctness of NAT Implementation 103
5.1 Accessing MMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Correctness Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2.1 Stepping of TLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2.2 Simulation of TLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.2.3 Simulation Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 Developing Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.3.1 Coverage of Hardware Walks . . . . . . . . . . . . . . . . . . . . . . 109
5.3.2 Dropping Translations . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.3 Adding Translations . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.4 Correctness Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.4.1 Void Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4.2 Translation Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.4.3 Invalidation Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Part III Single-Core MIPS with NAT 123
6 Sequential Processor with Nested MMUs 125
6.1 Sequential Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.1.1 Control Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.1.2 Collecting Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.1.3 Implementation Registers . . . . . . . . . . . . . . . . . . . . . . . . 130
6.1.4 Connecting Components . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.1.5 Execution Levels and Intercepts . . . . . . . . . . . . . . . . . . . . . 132
6.2 Cache Memory System in Sequential Processor . . . . . . . . . . . . . . . . . 134
6.2.1 Connections to Caches . . . . . . . . . . . . . . . . . . . . . . . . . . 134
VIII CONTENTS
6.2.2 Stability of Inputs to Caches . . . . . . . . . . . . . . . . . . . . . . . 135
6.2.3 Accesses of Hardware Computation . . . . . . . . . . . . . . . . . . . 136
6.2.4 Relating Endings of Accesses with Hardware Control Signals . . . . . 138
6.3 Liveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.3.1 Liveness of Control States . . . . . . . . . . . . . . . . . . . . . . . . 138
6.3.2 Uniqueness of Finish Cycles . . . . . . . . . . . . . . . . . . . . . . . 141
7 Correctness of Sequential Implementation 145
7.1 Correctness Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.1.1 Stepping of Components . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.1.2 Software Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.1.3 Simulation Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.1.4 Stepping Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2 Developing Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.2.1 Scheduling Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.2.2 Relating Global Steps with Scheduling Function . . . . . . . . . . . . 149
7.3 Correctness Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.3.1 Interrupt Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.3.2 Instruction Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.3.3 Implementation Registers . . . . . . . . . . . . . . . . . . . . . . . . 159
7.3.4 Maintaining Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.4 Correctness for TLBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.4.1 Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.4.2 Invalidation of TLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.5 Verifying Guard Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.5.1 TLB Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.5.2 Processor Core Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Part IV Multi-Core MIPS with NAT 173
8 Pipelined Processor with Nested MMUs 175
8.1 Pipelined Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.1.1 Stall Engine Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.1.2 Instruction Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.1.3 Interrupt Cause Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.1.4 Ghost Pipeline for Translations in Use . . . . . . . . . . . . . . . . . . 182
8.1.5 Connecting Components . . . . . . . . . . . . . . . . . . . . . . . . . 183
8.1.6 Forwarding Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.2 Machinery for Description of Pipelines . . . . . . . . . . . . . . . . . . . . . . 188
8.2.1 Execution Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8.2.2 Scheduling Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.2.3 Lowest Non-Live Stage . . . . . . . . . . . . . . . . . . . . . . . . . . 191
8.3 Cache Memory System in Pipelined Processor . . . . . . . . . . . . . . . . . . 193
8.3.1 Connections to Caches . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.3.2 Stability of Inputs to Caches . . . . . . . . . . . . . . . . . . . . . . . 194
8.3.3 Accesses of Hardware Computation . . . . . . . . . . . . . . . . . . . 195
8.3.4 Relating Endings of Accesses with Hardware Control Signals . . . . . 196
8.3.5 Extracting Sequence of Processor Accesses . . . . . . . . . . . . . . . 198
8.4 Liveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.4.1 Lowest Truly Full Stage . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.4.2 Liveness of Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . 202
8.4.3 Instruction Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
8.4.4 Actual Instruction Index . . . . . . . . . . . . . . . . . . . . . . . . . 214
8.4.5 Uniqueness of Update Cycles . . . . . . . . . . . . . . . . . . . . . . 216
CONTENTS IX
9 Correctness of Pipelined Implementation 221
9.1 Multi-Core Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.1.1 Multi-Core Computation . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.1.2 Processor Local Computations . . . . . . . . . . . . . . . . . . . . . . 223
9.1.3 Accessing Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.2 Multi-Core Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
9.2.1 Stepping of Components . . . . . . . . . . . . . . . . . . . . . . . . . 228
9.2.2 Software Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
9.2.3 Speculation Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
9.2.4 Induction Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
9.2.5 Stepping Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
9.3 Developing Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
9.3.1 Relating Instruction Count with Scheduling Functions . . . . . . . . . 236
9.3.2 Relating Global with Processor Local Steps . . . . . . . . . . . . . . . 238
9.3.3 Properties of Σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
9.4 Verifying Guard Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
9.4.1 TLB Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
9.4.2 Processor Core Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
9.4.3 Processor Control Signals . . . . . . . . . . . . . . . . . . . . . . . . 247
9.4.4 Properties of Σ Revisited . . . . . . . . . . . . . . . . . . . . . . . . . 247
9.5 Correctness for Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
9.5.1 Matching Processor Accesses with Non-Void Accesses . . . . . . . . . 258
9.5.2 Induction Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
9.5.3 Outputs to Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
9.6 Correctness for Pipeline Registers . . . . . . . . . . . . . . . . . . . . . . . . 264
9.6.1 Speculation on SPR Content . . . . . . . . . . . . . . . . . . . . . . . 264
9.6.2 Matching MMU Outputs . . . . . . . . . . . . . . . . . . . . . . . . . 266
9.6.3 Regular Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
9.6.4 Speculative Execution . . . . . . . . . . . . . . . . . . . . . . . . . . 273
9.6.5 Exception Return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
9.6.6 Interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
9.7 Correctness for TLBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
9.7.1 Adding Translations . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
9.7.2 Dropping Translations . . . . . . . . . . . . . . . . . . . . . . . . . . 281
9.8 Maintaining Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
9.8.1 Translations in Use and Invalidation of TLB . . . . . . . . . . . . . . . 282
9.8.2 Matching Translation Requests . . . . . . . . . . . . . . . . . . . . . . 283
Conclusion 285
References 289
Index 293
List of Figures
1 Formal representation of computer systems . . . . . . . . . . . . . . . . . . . 2
2 Two-step simulation of the MMU computation . . . . . . . . . . . . . . . . . 5
3 Little endian embedding of byte-addressable memory into line-addressable
memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Visible data structures of MIPS ISA . . . . . . . . . . . . . . . . . . . . . . . 18
5 Types and fields of MIPS instructions . . . . . . . . . . . . . . . . . . . . . . 19
6 Partitioning of address and page table entries . . . . . . . . . . . . . . . . . . 38
7 Page table entry address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8 Process of address translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
9 Partitioning of universal addresses . . . . . . . . . . . . . . . . . . . . . . . . 51
10 Address space of the universal page address upa . . . . . . . . . . . . . . . . . 52
11 Process of simple translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
12 Process of nested translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
13 Possible ways to compose walk wn . . . . . . . . . . . . . . . . . . . . . . . . 57
14 Interface of TLB for nested translation . . . . . . . . . . . . . . . . . . . . . . 74
15 Basic construction of the hardware TLB . . . . . . . . . . . . . . . . . . . . . 77
16 Circuit “tlbhit” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
17 Circuit “tlbspc” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
18 Interface of MMU for nested translation . . . . . . . . . . . . . . . . . . . . . 82
19 Hardware layout of the MMU for nested translation . . . . . . . . . . . . . . . 84
20 Control automaton of the nested MMU for simple translation . . . . . . . . . . 85
21 Control automaton of the nested MMU for nested translation . . . . . . . . . . 86
22 Circuit for walk creation (initialization and extension, data paths) . . . . . . . . 88
23 Initialization of the guest walk register before a nested call . . . . . . . . . . . 89
24 Control automaton of the simple MMU . . . . . . . . . . . . . . . . . . . . . 94
25 Control automaton of the nested MMU . . . . . . . . . . . . . . . . . . . . . . 98
26 Changes in the set of user walks (walksU ) on adding of walks . . . . . . . . . . 118
27 Changes in the set of user walks (walksU ) on dropping of guest walks . . . . . 121
28 Data paths connecting the MMUs to the processor core and the memory system 126
29 Machine’s control automaton . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
30 Wiring of the special inputs of the SPR . . . . . . . . . . . . . . . . . . . . . . 132
31 TLB step performed by mmuE in cycle t . . . . . . . . . . . . . . . . . . . . . 164
32 Invalidating step (of the processor core) performed in cycle t . . . . . . . . . . 167
33 Stall-rollback engine hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 176
34 Instruction address computation . . . . . . . . . . . . . . . . . . . . . . . . . 180
35 Collecting event signals in the cause pipeline . . . . . . . . . . . . . . . . . . 181
36 Pipelined MIPS processor connected to four caches and two (nested) MMUs . . 183
37 PC environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
38 Timing of the MMU control signals . . . . . . . . . . . . . . . . . . . . . . . 200
X
39 Pipeline stages in cycles t and t ′ > t considered in the proof of lemma 96 . . . . 203
40 Rollback request signals in cycle t˜ considered in the proof of lemma 103 . . . . 210
41 Pipeline stages target for the induction step in the proof of lemma 105 . . . . . 214
42 Pipeline stages in cycles t and t+1 considered in a proof in Sect. 9.6.4 . . . . . 273
List of Tables
1 Logical connectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 R-type instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 J-type instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 I-type instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Specification of ALU operations . . . . . . . . . . . . . . . . . . . . . . . . . 21
6 Specification of shift unit operations . . . . . . . . . . . . . . . . . . . . . . . 22
7 Specification of branch condition evaluation . . . . . . . . . . . . . . . . . . . 24
8 Special purpose registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
9 Interrupts handled by the ISA . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
10 Changes of the mode on interrupts and returns from exceptions . . . . . . . . . 51
11 Meaningful ISA signals for various interrupt levels . . . . . . . . . . . . . . . 64
12 Local invisible registers in various pipeline stages . . . . . . . . . . . . . . . . 233
XI

Introduction
The last decade witnessed a revolution at the market of mobile computing, blurring the bound-
ary between ordinary computers and various specialized electronic devices. Today everything,
from mobile phones to toothbrushes, is capable of network communication. Mobile phones,
for instance, are no longer supplied with simple chips, providing some limited, highly spe-
cialized set of features. Now they are more similar to ordinary computers. In fact, they are
computers, computers with multi-core processors inside [ARM14]. Differences between to-
day’s desktop PC, laptop, tablet, and mobile phone are rather quantitative than qualitative, and
determine only the performance of hardware. Leaving the performance aspects aside, all these
devices, running similar programs within similar operating systems, execute equivalent code.
Besides the obvious benefits for uses and software developers, such unification of hardware
platforms creates advantages for hackers and developers of malicious software [ELMC18].
The tendency to equip daily life devices with microprocessors is expected to grow over time.
This also concerns life-critical systems driving airplanes, trains, cars, etc. Programming such
systems requires a comprehensive understanding of operational behavior of integrated proces-
sors. Operational behavior is normally described in manuals, which for modern processors are
typically several thousand pages long [Int16]. Fortunately, industrial engineers simply do not
trust things they do not understand. In this regard, for many years they avoid using many of
the features provided by modern processors, including parallelism of multi-core CPUs. Conse-
quently, the total number of processors and microcontrollers installed, for instance, in a single
vehicle is reaching one hundred. This is a true problem for the industry with sizeable risks
for the customers: in a highly competitive environment, such as the automotive industry, it
becomes a matter of time when dozens of chips will be replaced by a handful of powerful
multi-core processors. Solving this problem and mitigation of the risks becomes one of the
central tasks in modern computing.
One possible solution is provided by formal verification of computer systems. As depicted
in Fig. 1, computer systems are formally described by models which justify computations on
various system layers: from the physical layer at the bottom — via the (OS) kernel layer — to
the applications at the top. The following principle was formulated almost thirty years ago (see
Related Work) but has not lost its relevance: correctness of the upper system layers hinges on
correctness of the lower layers, and is derived by refining the models at the upper layers by the
models at the lower (adjacent) layers. Clearly, in order to show correctness of the entire stack,
one has to prove that every model is refined by the model below.
According to Fig. 1, as one should not try to prove the laws of nature, one should start at
the layer of physical hardware. Having correct policies to abstract away dynamic RAMs and
buses [KMP14], one can start at the layer of digital hardware. Of course, one can completely
ignore the complexity of industrial processors and avoid reasoning about the real hardware:
some of the verification projects exclude the hardware layers from the model stack, and con-
sider only the software layers [SK17, KAK+]. These approaches rely on certain properties,
such as ‘non-bypassable architectural constraints’ in [KAK+], guaranteed by most of the com-
mercial chip manufacturers. As practice shows, such guarantees can be discovered broken after
they were trusted for many years. For instance, security of modern computer systems heav-
2 Introduction
Operating System Kernel
- Application Programming Interface -
Multi-Core ISA + Interrupts + Devices
- parallel C & inline assembly -
`+1
`
Instruction Set Architecture
- assembly -
Digital Hardware
- logical gates -
Physical Hardware
- physical gates -
Electrodynamics
- electrical circuits -
ab
st
ra
ct
io
n
Fig. 1: Formal representation of computer systems.
The models describing a computer system altogether form a stack, in a sense that the model at layer `
implements the model at layer `+1. That is, to justify semantics of the model implemented at layer `+1,
one has to prove that the model at layer ` refines the model at layer `+1, i.e., that for every (low-level)
computation at layer ` there is an equivalent (high-level) computation at layer `+ 1. For instance, one
cannot justify semantics of the instruction set architecture (ISA) without establishing its refinement by
the digital hardware.
ily relies on isolation of memories (data) of programs running on the same physical machine.
In [LSG+18] the authors demonstrated how to exploit side effects of out-of-order (speculative)
execution to read the memory of the kernel; in [KGG+18] the memory of the victim’s process.
Since speculative execution has become an industrial standard decades ago [Int18], billions
of devices from the main suppliers (including microprocessors from Intel, AMD, and ARM)
were found vulnerable to the attacks reported in [LSG+18,KGG+18]. In order to avod similar
consequences, one should start verification of computer systems (model stacks) at the layer of
digital hardware.
Related Work
First attempts for formal verification of computer systems were made in the late 1980’s, when
an approach of systems verification was demonstrated to verify the model stack from Compu-
tational Logic, Inc. or CLI stack for short [BHMY89, Bev89]. The CLI stack consisted of the
following components:
i) a 32-bit microprocessor (FM8502) designed (ISA specification), implemented at the gate-
level and verified by Warren Hurt;
ii) assembly language (Piton) with the assembler, linker, and loader designed, implemented
and verified by J Moore with help from Matt Kaufmann;
iii) two high-level languages with compilers implemented and verified by Bill Young (micro-
Gypsy) and Art Flatau (subset of Nqthm Pure Lisp); and
iv) the operating system (KIT) designed, implemented and verified by Bill Bevier.
After formalizing the Netlist Description Language (NDL) in 1992, Bishop Brock and War-
ren Hurt designed (ISA) and verified a new microprocessor (FM9001) they described using
NDL. The Piton assembly language then was ported to the ISA of FM9001 by J Moore, who
updated and reverified the code generators. This allowed the rest of the CLI stack to be ported
mechanically [Moo03].
In the 1990’s, with the rapid development of computer systems, the CLI stack quickly fell short
for many reasons. For instance, the FM9001 microprocessor implemented an unrealistically
simple architecture: no pipeline, no speculation, no floating-point unit, no cache. Though the
FM9001 implemented memory mapped I/O, none of the verified high-level languages had fa-
cilities to support it. Overall, both high-level languages were too simple to be of practical use.
In the early 2000’s, the results of the CLI stack project were revisited by one of its principle
Introduction 3
architects, and the ‘grand challenge’ for pervasive verification of computer systems was pro-
posed to the formal methods community. The challenge was to design and mechanically verify
a practical computer system, from gates to software [Moo03].
Within the time frame from 2003 to 2010 a new effort to formally verify model stacks was
undertaken in the Verisoft and Verisoft XT projects [Ver07, Ver10]. Meeting the challenge
of [Moo03], in the course of the Verisoft project the ‘short’ CLI stack was extended by in-
tegrating devices and targeting a more realistic system architecture regarding both hardware
and software [DDB08, APST10]. However, an attempt to transfer the theory developed for se-
quential systems onto multi-core architectures made in the course of the Verisoft XT was less
successful. The main reason: lack of a pervasive theory of multi-core systems [Sch13b].
More recently — about five years ago — the missing guidelines were outlined [CPS13]. Fol-
lowing these guidelines, at the layer of hardware — the scope of this thesis — the gate-level
implementation of a multi-core CPU must be verified to implement its ISA specification. Since
commercial hardware vendors tend not to disclose layouts of their products, that multi-core
CPU first had to be designed and implemented. The resulting design included support for most
of the principal features that the modern industrial processors provide: multiple physical cores,
pipelining, multi-level address translation, internal and inter-processor interrupts, devices, etc.
All these mechanisms were specified in the doctoral thesis of Sabine Schmaltz [Sch13b].
Progress on this work is described in detail below.
In 2014, laying down the groundwork for multi-core hardware verification, the gate-level im-
plementation of a realistic RISC processor was proven correct [KMP14]. This was an imple-
mentation of a subset of the MIPS ISA, in particular containing full gate-level designs for both
a pipelined processor with hardware interlock and a sequentially consistent cache memory sys-
tem implementing the MOESI protocol. Afterwards, in a series of bachelor and master theses,
that implementation was successfully
• synthesized on an FPGA board [Mai14] and
• tested (four-core version) by numerical computation programs [Ols14], and
• extended to support store buffers (single-core version) [Lut14].
The first attempt to extend the design from [KMP14] with mechanisms to support hypervi-
sors (operating systems) was taken in the scope of the university lecture on multi-core system
architecture at Saarland University in late 2015. The multi-level address translation scheme
used in the lecture was previously designed and proven correct for a sequential single-core
machine in [Sch14a]. Integration of inter-processor interrupts into a multi-core machine with
sequential processors was covered in [Sch14b]. The memory management units (MMUs) and
advanced programmable interrupt controllers (APICs), constructed in [Sch14a] and [Sch14b]
resp., were integrated into the stages of the pipelined machine from [KMP14] and presented
to students. The design was preliminarily extended to support internal interrupts. Throughout
2016 the ideas presented in the lecture were summarized in the early version of the lecture
notes [Pau16].
Initially we believed that a full correctness proof for the new design, given the sketch in
[Pau16], could be covered in a series of master theses. Correctness of the internal interrupt
mechanism added in [Pau16] was first proven for a simpler design, namely for a five-stage
pipeline from [KMP14], in [Sch16a]. Unfortunately, the approach used in [Sch16a] turned out
to be not powerful enough to handle the additional pipeline stages added for address trans-
lation. In case of a rollback in the resulting seven-stage pipeline with speculative execution,
one has to stabilize signals in up to three pipeline stages in which the memory system is ac-
cessed, but the approach used in [Sch16a] could only be applied to stabilize signals in one
of the rolled-back stages. An improved mechanism which allows to stabilize signals in ar-
bitrary number of rolled-back stages was developed and proven correct by Jonas Oberhauser
in [LOP]. The obtained seven-stage pipelined machine (single-core version) with the rollback
mechanism of [LOP] was successfully implemented on an FPGA board [Zah16], building on
top of the practical realization of the five-stage pipeline from [Mai14].
Finally, correctness of the simplified inter-processor interrupt (IPI) mechanism was proven
for a multi-core machine with sequential processors in [Sch16b]. There the IPI mechanism
4 Introduction
is restricted to support only the interrupts broadcast by the local APICs. Integration of the
IPI mechanism as specified in [Sch13a] (with an I/O APIC and a device subsystem) into the
machine from [LOP] is still, at the moment of writing, the subject for future work.
A different approach on formal verification of hardware is conducted by researches in the Mas-
sachusetts Institute of Technology. In contrast to the standard approach, to model processors
as synchronous finite state machines (FSM), the authors of [VCAD15] utilize the labeled tran-
sition systems (LTS), a more abstract model of computation compared to FSM. In particular,
in their work the authors show how to use the model to
i) decompose hardware systems s.t. each component is modeled by a separate LTS and com-
position of LTSes of all components models the whole system, and
ii) replace a component with its simper substitute if the LTS associated with that component
implements the LTS associated with its substitute.
These techniques allow the authors to formally verify the LTSes representing fairly complex
designs, including speculative processors and hierarchical cache memory systems. On the other
hand, usage of a more abstract model has certain drawbacks: while the model of FSM permits
a straightforward implementation at the register transfer level (RTL), the same is not obvious
for the model of LTS. To synthesize their design, the authors of [VCAD15] substitute the LTS
describing a processor by a program in a dialect (BlueSpec) of a high-level functional language
(Haskell); the latter representation is proven isomorphic to the one using LTSes. Therefore, the
proposed approach hinges on correctness of the proprietary compiler, which translates Blue-
Spec descriptions to RTL (System Verilog). In the open literature we managed to find the
results on correctness of a compiler for a rather academic, simplified version of BlueSpec (Fe-
Si) [BC13]. Correctness of the BlueSpec compiler still, up to our knowledge, remains an open
question [Vij16, CVS+17]. Of course, that is insufficient to guarantee correctness of the entire
model stack as described in [Moo03].
Finally, we cannot ignore the results reported in [CVS+17]. There the authors present a frame-
work (Kami) for the Coq proof assistant, which allows to i) design and verify “fairly real-
istic processors” in Coq, and then ii) extract designs synthesizable on an FPGA board. The
Kami framework uses the BlueSpec compiler as a subroutine. As a case study, the authors
of [CVS+17] verify a multi-core processor, consisting of pipelined (four stage) cores and a
sequentially consistent cache memory system. In addition, the cores are equipped with branch
predictors, but do not implement any forwarding mechanisms. The model was instantiated
with the RISC-V [RIS18] cores implementing the subset of RV32I, the instruction set con-
sisting of 32-bit integer portion of the open-source RISC-V ISA [WLPA16], without subword
memory accesses. The design was synthesized to have the four-core processor and the mem-
ory system with direct-mapped caches. The processor produced by the Kami framework was
reported to be 4% slower (per core) compared to the sequential implementation of the entire
RV32I produced by the BlueSpec compiler from the design in [WZB+16].
In our opinion, the results above look somewhat strange. In [MP00] performance of the se-
quential implementation of a simple DLX processor was compared against its pipelined ver-
sion, also featuring no forwarding. At the moment of writing, the latter results were published
almost twenty years ago. According to the presented analysis, the speedup gained by pipelining
normally reaches 120%, provided that a low-latency (cache) memory system is used.
Goals and Approach
Hypervisors are kernels that allow to run multiple operating systems as processes on a single
physical machine [ABCC66]. If one wants to run user programs within these operating sys-
tems (which is the standard scenario), an additional phase of address translation is required.
The multi-level address translation scheme utilized in [Pau16] provides only one phase of
translation. While there are software techniques to deploy the second phase of translation
(actually, any number of translation phases), these techniques put an additional load on the
underlying hypervisor [Meg12, Kov13]. In order to avoid the aforementioned overheads, we
replace the translation scheme of [Pau16] with a more powerful one, which provides support
Introduction 5
cn cn+1
c˜t−1 c˜t c˜t+1 c˜t+2
mmut−1 mmut mmut+1 mmut+2
s(n)
s˜(t)
Fig. 2: Two-step simulation of the MMU computation.
At the bottom: hardware computation (mmut); in the middle: general computation (c˜t); at the top: ISA
computation (cn). TLB steps are generated by the hardware computation in cycle t. Oracle input s(n)
for the step performed in the semantics of [Sch13a] is provided by stepping function s, whereas oracle
input s˜(t) for the step performed in the general semantics (Sect. 3.4) is provided by stepping function
s˜. Simulation relation between configurations of hardware mmut+1 and ISA cn+1 is established in two
steps: first by showing simulation (simtlb) between mmut+1 and general configuration c˜t+1 (Chap. 5), and
then by showing simulation (simISAtlb) between (software) configurations c˜
t+1 and cn+1.
for two phases of address translation in hardware. Therefore, the main goal of this thesis is to
strengthen the virtualization capabilities of the machine from [Pau16] by designing the two-
phase scheme for (nested) address translation (NAT). Certainly, we are to prove correctness of
the resulting design in the spirit of [KMP14].
Usually integration of new mechanisms like store buffers or multi-level address translation
into the basic design from [KMP14] did not require significant changes of specifications or
modifications of constructions already proven correct. The necessary proof efforts in most
cases were the subject for bachelor or master theses like [Lut14], [Sch14a], [Sch16b]. With
NAT things turn out to be slightly more involved. Since the scheme of NAT is inherently
quite complex, it requires considerably more efforts to specify, construct, and integrate into the
basic design. Moreover, since NAT is used to translate both the instruction address (program
counters) and the effective addresses (memory accesses), a necessity to extract the correctness
of NAT from the arguments on overall machine correctness arises naturally.
Therefore, we handle NAT as follows. Immediately after presenting its formal specification we
introduce another, more general specification of address translation. The general specification
considers a single configuration component — the translation look-aside buffer (TLB), repre-
senting a set of translations — and defines the semantics by specifying how the translations
can be added to or dropped from this set. We show this specification to be equivalent to the
specification from [Sch13a] in case i) only the TLB component is considered and ii) a number
of conditions are fulfilled. After we design the memory management unit (MMU), we prove
that it correctly implements the general specification. This allows us to separate correctness
of the MMU implementation from the other correctness arguments (see Fig. 2). Moreover,
this proof is completely independent of the machine type in which the MMU components are
utilized.
Thus, in one of the chapters (Chap. 7) we show how to integrate NAT into a simple (single-
core, sequential) machine. There we can reveal important proof goals without blurring the
arguments related to the MMU correctness by the machine-specific details. The reader can treat
this chapter as a repetition before doing the main proof, just like a training session before going
into the wild. Afterwards, we reuse many results obtained in this chapter literally, replacing
only the machine-specific lemmas involved. It becomes obvious that integration of NAT boils
down to justification of conditions formulated together with the general specification, under
which the two specifications of TLB are equivalent (i.e., showing simISAtlb as explained in Fig. 2).
That not only makes our argument more modular, but also streamlines the correctness proof.
Modularity comes at the price that things become more lengthy overall and some mathemati-
cally redundant arguments are introduced. This mostly concerns parts where we interface the
6 Introduction
components and therefore the proofs. We pay this price in favor of clarity and flexibility s.t. the
obtained results could be used in the future.
Outline
The material in this thesis is arranged in four major parts. In the first part we present all basic
definitions and notations used throughout the entire text (Chap. 1). For convenience, in the first
part we also include specifications of most of the components we build on (Chap. 2).
Part two is entirely devoted to nested address translation. Chapter 3 presents the process of
NAT, both informally and formally, by specifying the MIPS ISA with NAT. In this chapter
we also specify the new basic hardware mechanisms which we integrate: the privilege levels
(levels of execution) and the intercepts. In the next chapter (Chap. 4) we implement a processor
component which performs NAT in hardware. This component we call the nested memory
management unit or the nested MMU for short. Correctness and liveness of the nested MMU
are both proven in the end of part two (Chap. 5) of the thesis.
In the third part we consider the nested MMU in the sequential single-core implementation
of MIPS. The sequential case is included in order to present the important proof goals on a
simpler hardware. Thus, in Chap. 6 we interconnect a sequential core, two nested MMUs, and
four caches of the memory system from [KMP14]. Next, in Chap. 7 we show how to integrate
the correctness results from Chap. 5 to establish correctness for the TLB component.
Finally, in part four we consider the nested MMU in the (modified) pipelined multi-core im-
plementation from [KMP14]. First, in Chap. 8 we interconnect a pipelined core, two nested
MMUs, and four caches of the memory system from [KMP14], exactly as we connected in
part three. Then, in Chap. 9 we consider multiple processors from Chap. 8 connected to a
single cache memory system in parallel, where each processor is connected to four (private)
caches. The arguments from Chap. 7 are very helpful in Chap. 9, which covers most of the
new correctness results presented in this thesis. Using these arguments allows us to streamline
the presentation in Chap. 9 and focus on problems related to pipelined implementation.
Part I
Single-Core MIPS with Address Translation

1Definitions and Notation
This is actually Sect. 1.3 of [LOP]; it collects some basic definitions and notation we use
massively throughout the entire text. Sections 1.1–1.3 summarize the material from Chap. 2
of [KMP14]. Section 1.4 describes the memories used in software and hardware models; also
it contains the definition of access based memory semantics from Chap. 8 of [KMP14]. All
these materials are included exclusively for completeness of presentation. The authorship over
these materials belongs to all the authors of [LOP].
1.1 Sets, Sequences, and Records
We denote by
N= {0,1,2, . . .}
the set of natural numbers including zero by
Z= {. . . ,−2,−1,0,1,2, . . .}
the set of integers and by
B= {0,1}
the set of Boolean values. For i < j intervals of integers are defined as
[i : j] = {i, i+1, . . . , j}.
In mathematics, finite or infinite sequences a of elements ai are usually indexed starting from
1, i.e., they are written as
a = (a1, . . . ,an)
resp.
a = (a1,a2, . . .).
For computations or their sequences of inputs and outputs it is more convenient to start indexing
with 0, i.e., to write
(c0,c1, . . .).
In this way c0 is the start configuration and ci is the configuration reached after i steps. Finally
for finite bit strings b it is most convenient to number sequences from right to left starting with
0:
b = (bn−1, . . . ,b0).
For finite subsequences of elements with indices from i to j > i we borrow interval notation
from computer aided design (CAD) system and write
a[i : j] = (ai, . . . ,a j),
but if finite bit strings are involved we write this as
10 1 Definitions and Notation
Table 1: Logical connectives
x∧ y and
x∨ y or
/x , x not
x⊕ y exclusive or, + modulo 2
a[ j : i] = (a j, . . . ,ai).
The set of all sequences of length n with elements from set A is denoted as An. The Hilbert
epsilon operator picks an element εA from a set A. Applied to a singleton set it returns the
unique element of the set:
ε{x}= x.
The cardinality of finite sets A is denoted by #A. For symbols x ∈ B and natural numbers
n ∈ N+, a bit-string obtained by repeating x exactly n times is defined as
xn = x . . .x︸ ︷︷ ︸
n times
.
For bit strings x ∈ Bn we abbreviate the high order and low order (approximate) half of the bits
as
xH = x[n−1 : bn/2c]
xL = x[dn/2e−1 : 0].
1.2 Boolean Operators
In Boolean Algebra we use the connectives from Table 1. For logical connectives ◦ ∈
{∧,∨,⊕}, bit-strings a,b ∈ Bn, and a bit c ∈ B, we borrow from vector calculus to define
the corresponding bitwise operations:
/a(n−1 : 0) = (/an−1, . . . ,/a0)
a[n−1 : 0]◦b[n−1 : 0] = (an−1 ◦bn−1, . . . ,a0 ◦b0)
c◦b[n−1 : 0] = (c◦bn−1, . . . ,c◦b0).
For records x with selectors n1, . . .nt we denote as usual with x.ni the component of the record
selected by ni. For subsequences (ns1 , . . . ,nsr) of the selectors we abbreviate the sequence of
record components selected by this subsequence as
x.(ns1 , . . . ,nsr) = (x.ns1 , . . . ,x.nsr).
In a hardware configuration h with components h.pc,h.d pc, . . . we would for instance abbrevi-
ate
h.(d pc, pc) = (h.d pc,h.pc).
1.3 Binary and Two’s Complement Numbers
For bit-strings a = a[n−1 : 0] ∈ Bn we denote by
〈a〉=
n−1
∑
i=0
ai ·2i
the interpretation of bit-string a as a binary number. String a is called the binary representation
of length n of the natural number 〈a〉. It is often useful to decompose n bit binary representa-
tions a[n− 1 : 0] into an upper part a[n− 1 : m] and a lower part a[m− 1 : 0]. The correction
between these parts is reflected in the following lemma (lemma 2.9 in [KMP14]).
1.3 Binary and Two’s Complement Numbers 11
Lemma 1 (decomposition). Let a ∈ Bn and n≥ m. Then
〈a[n−1 : 0]〉= 〈a[n−1 : m]〉 ·2m+ 〈a[m−1 : 0]〉.
The set of natural numbers representable as binary numbers of length n is denoted by
Bn = {〈a〉 | a ∈ Bn}.
For x ∈ Bn the binary representation of x of length n is denoted by
xn = binn(x) = ε{a ∈ Bn | 〈a〉= x}.
We denote by
[a] =−2n−1an−1+ 〈a[n−2 : 0]〉
the interpretation of bit-string a as a two’s complement number. String a is called the two’s
complement representation of length n of the natural number 〈a〉. The set of natural numbers
representable as two’s complement numbers of length n is denoted by
Tn = {〈a〉 | a ∈ Bn}.
For x ∈ Tn the two’s complement representation of x of length n is denoted by
twocn(x) = ε{a ∈ Bn | 〈a〉= x}.
The binary addition +n and subtraction −n of bit strings a,b ∈ Bn is defined by
a+n b = binn((〈a〉+ 〈b〉) mod 2n)
a−n b = binn((〈a〉−〈b〉) mod 2n).
A very easy computation shows, that for the addition of n bit numbers whose last m bits are all
zero, it suffices to add the leading n−m bits.
Lemma 2. Let a,b ∈ Bn and let a[m−1 : 0] = b[m−1 : 0] = 0m. Then
a+n b = (a[n−1 : m]+n−m b[n−1 : m])◦0m.
As arithmetic units process binary numbers as well as two’s complement numbers, one would
expect in their specification also a counter part of this definition for two’s complement num-
bers. However, in manuals such a specification is usually absent; instead addition and subtrac-
tion of two’s complement numbers are also specified by the binary addition and subtraction
operations +n and −n. In a nutshell this works because according to lemma 2.14 of [KMP14]
for a ∈ Bn we have
〈a〉 ≡ [a] mod 2n
which immediately implies for ◦ ∈ {+,−}:
〈a〉 ◦ 〈b〉 ≡ [a]◦ [b] mod 2n.
This is almost but not quite what one wants to know. Attempting to define two’s complement
operators ◦′n one observes that the exact result
S = [a]◦ [b]
might lie outside of the range Tn , just like 〈a〉◦〈b〉might lie outside of Bn. For such situations
one replaces S by a number S tmod 2n which is congruent to S modulo 2n and which lies in the
representable range.
S tmod 2n = ε{x ∈ Tn | S≡ x mod 2n}
Observe that the set on the right hand side of the equation is a singleton set, because congruence
modulo 2n is an equivalence relation with Tn as a system of representatives. The definition of
two’s complement addition and subtraction operators ◦′n then becomes
a◦′n b = twocn(([a]◦ [b]) tmod 2n).
The desired result (lemma 5.1 in [KMP14]) is then simply as follows.
12 1 Definitions and Notation
Lemma 3.
a◦n b = a◦′n b
Proof oflemma 3. Let
s = a◦n b.
Then
[s] ≡ 〈s〉 mod 2n
≡ 〈a〉 ◦ 〈b〉 mod 2n
≡ [a]◦ [b] mod 2n.
Trivially we have [s] ∈ Tn. Thus
[s] = ([a]◦ [b] tmod 2n). uunionsq
1.4 Memory
The state of a byte addressable memory m with 32 address bits is modeled as a mapping
S : B32→ B8.
Such a memory will be used here in ISA specifications. For x ∈ B32 function value m(x) ∈ B8
models the current content of the memory at address x. The sequence md(x) of d consecutive
entries of memory S starting at address x is defined inductively as follows:
m1(x) = m(x)
md+1(x) = m(x+k dk)◦md−1(x).
For bit strings s ∈ B8k, whose length is a multiple of 8, and i < k we identify byte i of s as
byte(i,s) = s[8(i+1)−1 : 8i].
A trivial induction on d shows:
i < d → byte(i,md(x)) = m(x+32 i32). (1)
The hardware memory systems we are going to construct will be addressable by cache lines
which are 64 bits wide. The state resp. configuration of such a line addressable memory is thus
modelled as a mapping
S : B29→ B64.
The set of all such configurations is denoted by Km.
1.4.1 Embedding
We define a conversion function ` which changes byte addressable memories
m : B32→ B8
to their line addressable version.
`(m) : B29→ B64
For line addresses a ∈ B29 it is defined by
`(m)(a) = m8(a◦000).
This is illustrated in Fig. 3. In processor correctness proofs it will serve as the simulation
relation between the byte addressable ISA memory and the cache line addressable hardware
memory system.
For the line addressable version of m we conclude the following.
1.4 Memory 13
a◦000a◦111
64
. . .
`(m)
m8(a◦000)a
a◦000
a◦111
8
m
Fig. 3: Little endian embedding of byte-addressable memory into line-addressable memory
Lemma 4. Assume i < 8.
byte(i, `(m)(a)) = m(a◦000+32 i32)
Proof of lemma 4.
byte(i, `(m)(a)) = byte(i,m8(a◦000)) (definition)
= m(a◦000+32 i32) (equation 1) uunionsq
For byte addresses x ∈B32, resp. for line addresses x.l = x[31 : 3] and line offsets x.o= x[2 : 0],
we then have
m(x) = byte(〈x.o〉, `(m)(x.l)).
1.4.2 Sequential Semantics
We define memory semantics for line addressable memory with the help of accesses. Such
accesses acc ∈ Kacc have the following components:
• processor address acc.a ∈ B29 (line address),
• processor data acc.data ∈ B64 — the input data in case of a write or a compare-and-swap
(CAS),
• comparison data acc.cdata ∈ B32 — the data for comparison in case of a CAS access,
• the byte write signals acc.bw ∈ B8 for write and CAS accesses,
• write signal acc.w ∈ B,
• read signal acc.r ∈ B, and
• CAS signal acc.cas ∈ B.
At most one of the bits w, r or cas is allowed to be on.
acc.r+acc.w+acc.cas≤ 1
In case none of these bits is on, we call the access void.
void(acc)≡ acc.r+acc.w+acc.cas = 0
For technical reasons, we also require the byte write signals to be off in read accesses and to
mask one of the words in case of CAS accesses:
14 1 Definitions and Notation
acc.r → acc.bw = 08
acc.cas → acc.bw ∈ {0414,1404}.
The set of all such accesses is denoted by Kacc. For CAS accesses and line addressable memory
m ∈ Km, we define the predicate test(acc,m) which compares acc.cdata with the upper or the
lower word of the memory line M(acc.a) addressed by the access, depending on the byte write
signal acc.bw[0].
test(acc,m) ≡ acc.cdata =
{
m(acc.a)L acc.bw[0] = 1
m(acc.a)H acc.bw[0] = 0
For n = 64 and strings x,y ∈ Bn function modify describes the replacement of bytes of x by the
corresponding bytes of y under control of byte write signals bw ∈ B8.
byte(i,modify(x,y,bw)) =
{
byte(i,y) bw[i] = 1
byte(i,x) bw[i] = 0
Semantics of single accesses acc operating on a memory m is specified by a memory update
function
δM : Km×Kacc→ Km
and the answers
dataout(m,acc) ∈ B64
of read and CAS accesses. Let
m′ = δM(m,acc).
Then memory is updated under control of signals bw, w and cas. The line addressed by acc.a
is updated in case of a write or in case of a CAS with positive outcome of the test. Only bytes
byte(i,m(acc.a)) with active byte write signal acc.bwi are updated by the corresponding bytes
of acc.data.
m′(a) =

modify(m(a),acc.data,acc.bw) acc.a = a∧ (acc.w∨
acc.cas∧ test(acc,m))
m(a) otherwise
The answers dataout(m,acc) of read or CAS accesses are defined as follows.
acc.r∨acc.cas → dataout(m,acc) = m(acc.a)
Obviously, void read accesses do not change the state of the memory and their answer is not
specified. Overloading notation we consider linear access sequences
acc′ : N→ Kacc
where access acc′(i) is access number i of the sequence. Execution of the first n accesses of
such a sequence is then inductively defined by
∆ 0M(m,acc
′) = m
∆ i+1M (m,acc
′) = ∆M(∆ iM(m,acc
′),acc′(i)).
Readers familiar with [KMP14] will notice that accesses there had an extra component acc. f ∈
B whose activation signals so called flush accesses. These accesses necessarily have to be
considered in the implementation of shared memory systems, but in the specification of such a
system — and that is all we will rely on in this text — one can do without them.
2Specification
The contents of this chapter are taken almost literally from the various sections of [LOP]. Thus,
Sects. 2.1 and 2.2 — partially Sect. 4.1 of [LOP] — introduce basic MIPS ISA specification
as well as semantics of the basic MIPS machine. Section 2.3 is exactly Sect. 7.1 of [LOP];
it extends the model of and integrates the interrupt mechanism into the basic machine model
from Sect. 2.1. Finally, Sect. 2.4 is assembled from Sects. 9.1 and 9.2 of [LOP]; it presents
the basic variant of virtual address translation. All these materials are included exclusively for
completeness of presentation. The authorship over these materials belongs to all the authors
of [LOP].
2.1 Basic MIPS
This is a fairly detailed summary and extension of results from Chap. 5 of [KMP14]. We go
into considerable level of detail here, because i) the MIPS ISA is part of the formulation of
every processor correctness theorem in this book and ii) all processor designs will be derived
in one way or the other from the basic design. Understanding of future chapters will be greatly
facilitated if readers have these specifications, constructions and arguments really at their fin-
gertips.
The special purpose register file, move instructions and CAS instructions are already included
in the basic design, because constructions and correctness proofs require no additional con-
cepts. Changes in notation are marginal. Besides renaming of pins of units we only write the
simulation relation between byte addressable ISA memory c.m and line addressable hardware
memory h.m as
h.m = `(c.m)
instead of
h.m∼ c.m.
The introduction of instruction access iacc(c) and data access dacc(c) of ISA configurations
have been moved forward from Chap. 9 of [KMP14], and the crucial result relating the update
of byte addressable ISA memory c.m with the update of its line addressable version `(c.m) by
data access dacc(c) (lemma 9.3 of [KMP14]) is now formulated and proven in terms of ISA
configurations c alone.
2.1.1 Instruction Tables
For the purpose of reference we include the full instruction tables from [Sch13b]. In contrast
to [KMP14] we will already support move and CAS instructions in our basic processor de-
signs. The treatment of the following instructions is deferred to later chapters: i) eret and sysc
(interrupts), ii) mfence (store buffers), and iii) TLB instructions (address translation).
16 2 Specification
Table 2: R-type instructions
opcode fun rs Mnemonic Assembler-Syntax Effect
Shift Operations
000 000 000 000 sll sll rd rt sa rd = sll(rt,sa)
000 000 000 010 srl srl rd rt sa rd = srl(rt,sa)
000 000 000 011 sra sra rd rt sa rd = sra(rt,sa)
000 000 000 100 sllv sllv rd rt rs rd = sll(rt,rs)
000 000 000 110 srlv srlv rd rt rs rd = srl(rt,rs)
000 000 000 111 srav srav rd rt rs rd = sra(rt,rs)
Arithmetic, Logical Operations
000 000 100 000 add add rd rs rt rd = rs + rt
000 000 100 001 addu addu rd rs rt rd = rs + rt
000 000 100 010 sub sub rd rs rt rd = rs − rt
000 000 100 011 subu subu rd rs rt rd = rs − rt
000 000 100 100 and and rd rs rt rd = rs ∧ rt
000 000 100 101 or or rd rs rt rd = rs ∨ rt
000 000 100 110 xor xor rd rs rt rd = rs ⊕ rt
000 000 100 111 nor nor rd rs rt rd = rs ∨ rt
Test-and-Set Operations
000 000 101 010 slt slt rd rs rt rd = (rs < rt ? 132 : 032)
000 000 101 011 sltu sltu rd rs rt rd = (rs < rt ? 132 : 032)
Jumps, System Call
000 000 001 000 jr jr rs pc = rs
000 000 001 001 jalr jalr rd rs rd = pc + 432, pc = rs
000 000 001 100 sysc sysc System Call
Synchronizing Memory Operation
000 000 111111 cas cas rd rt rd cdata rd’=m
m’= (rd=cdata? rt: m)
Coprocessor Instructions
opcode fun Mnemonic Assembler-Syntax Effect
010 000 011 000 10000 eret eret Exception Return
010 000 00100 movg2s movg2s rd rt spr[rd] := gpr[rt]
010 000 00000 movs2g movs2g rd rt gpr[rd] := spr[rt]
TLB Instructions
000 000 111 101 flusht flusht flushes TLB translations
000 000 111 100 invlpg invlpg rs rt flushes TLB translations
for addr. rt from ASID rs
Store Buffer Instruction
000 000 111 110 mfence mfence flushes the SB
Table 3: J-type instructions
opcode Mnemonic Assembler-Syntax Effect
Jumps
000 010 j j iindex pc = bin32(pc+432)[31:28]iindex00
000 011 jal jal iindex R31 = pc + 432,
pc = bin32(pc+432)[31:28]iindex00
2.1 Basic MIPS 17
Table 4: I-type instructions. Note, we use the following shorthand (d(c) is the access width).
m = c.md(c)(ea(c))
= c.md(c)(c.gpr(rs(c))+32 sxtimm(c))
opcode rt Mnemonic Assembler-Syntax Effect Access Width
Data Transfer
100 100 lbu lbu rt rs imm rt = 024m 1
100 101 lhu lhu rt rs imm rt = 016m 2
100 000 lb lb rt rs imm rt = sxt(m) 1
100 001 lh lh rt rs imm rt = sxt(m) 2
100 011 lw lw rt rs imm rt = m 4
101 000 sb sb rt rs imm m = rt[7:0] 1
101 001 sh sh rt rs imm m = rt[15:0] 2
101 011 sw sw rt rs imm m = rt 4
Arithmetic, Logical Operation, Test-and-Set
001 000 addi addi rt rs imm rt = rs + sxt(imm)
001 001 addiu addiu rt rs imm rt = rs + sxt(imm)
001 010 slti slti rt rs imm rt = (rs < sxt(imm) ? 132 : 032)
001 011 sltiu sltiu rt rs imm rt = (rs < sxt(imm) ? 132 : 032)
001 100 andi andi rt rs imm rt = rs ∧ zxt(imm)
001 101 ori ori rt rs imm rt = rs ∨ zxt(imm)
001 110 xori xori rt rs imm rt = rs ⊕ zxt(imm)
001 111 lui lui rt imm rt = imm016
Branch
000 001 00000 bltz bltz rs imm pc = pc + (rs < 0 ? imm00 : 432)
000 001 00001 bgez bgez rs imm pc = pc + (rs ≥ 0 ? imm00 : 432)
000 100 beq beq rs rt imm pc = pc + (rs = rt ? imm00 : 432)
000 101 bne bne rs rt imm pc = pc + (rs 6= rt ? imm00 : 432)
000 110 00000 blez blez rs imm pc = pc + (rs ≤ 0 ? imm00 : 432)
000 111 00000 bgtz bgtz rs imm pc = pc + (rs > 0 ? imm00 : 432)
2.1.2 Configuration and Instruction Fields
A basic MIPS configuration c has four user visible data structures (Fig. 4):
• c.pc ∈ B32 — the program counter (PC).
• c.gpr :B5→B32 — the general purpose register (GPR) file consisting of 32 registers, each
32 bits wide. Register number zero is tied to 032. Writing to it will have no effect.
c.gpr(05) = 032
• c.m : B32→ B8 — the processor memory. It is byte addressable; addresses have 32 bits.
• c.spr : B5→ B32 — the special purpose register (SPR) file.
Program counter and general purpose registers belong to the central processing unit (CPU).
Let K be the set of all basic MIPS configurations. A mathematical definition of the ISA will
be given by a function
δisa : K→ K
where
c′ = δisa(c)
is the configuration reached from configuration c, if the next instruction is executed. An ISA
computation is a sequence (ci) of ISA configurations with i ∈ N satisfying
c0.pc = 032
ci+1 = δisa(ci),
18 2 Specification
memoryCPU
sprgpr
32
pc 232m
8
32
Fig. 4: Visible data structures of MIPS ISA
i.e., initially the program counter points to address 032 and in each step one instruction is
executed. In the remainder of this section we specify the ISA simply by specifying function
δisa, i.e., by specifying
c′ = δisa(c)
for all configurations c. Recall, in Sect. 1.3 for numbers y ∈ Bn we abbreviated the binary
representation of y with n bits as
yn = binn(y)
and for memories
m : B32→ B8,
addresses a ∈ B32, and numbers d of bytes, we denote the content of d consecutive memory
bytes starting at address a by md(a).
The current instruction I(c) to be executed in configuration c is defined by the 4 bytes in
memory addressed by the current program counter:
I(c) = c.m4(c.pc).
Because instructions are 4 bytes long, we require for the time being that instructions are aligned
at and fetched from 4 byte boundaries:
c.pc[1 : 0] = 00.
When we treat interrupts, we will raise a misalignment interrupt if this condition is violated.
The six high order bits of the current instruction are called the opcode:
opc(c) = I(c)[31 : 26].
There are three instruction types: R-, J-, and I-type. The current instruction type is determined
by the following predicates:
rtype(c) ≡ opc(c) = 0∗04
jtype(c) ≡ opc(c) = 041∗
itype(c) = /rtype(c)∧/ jtype(c).
Depending on the instruction type, the bits of the current instruction are subdivided as shown
in Fig. 5. Register addresses are specified in the following fields of the current instruction:
rs(c) = I(c)[25 : 21]
rt(c) = I(c)[20 : 16]
rd(c) = I(c)[15 : 11].
For R-type instructions, ALU-functions to be applied to the register operands can be specified
in the function field:
f un(c) = I(c)[5 : 0].
2.1 Basic MIPS 19
immrtrsopc
01626 21
0
opc rtrs
21 16
rd sa
11
26
opc
0
iindex
6
31 25
20
15
15 1025 5
25
fun
2631
I
R
J
20
31
Fig. 5: Types and fields of MIPS instructions
Three kinds of immediate constants are specified: the shift amount sa in R-type instructions,
the immediate constant imm in I-type instructions, and an instruction index iindex in J-type
(like jump) operations:
sa(c) = I(c)[10 : 6]
imm(c) = I(c)[15 : 0]
iindex(c) = I(c)[25 : 0].
Immediate constant imm has 16 bits. In order to apply ALU functions to it, the constant can be
extended with 16 high order bits in two ways: zero extension and sign extension:
zxtimm(c) = 016imm(c)
sxtimm(c) = imm(c)[15]16imm(c)
= I(c)[15]16imm(c).
2.1.3 Instruction Decoding
For every mnemonic mn of a MIPS instruction from the tables above, we define a predicate
mn(c) which is true, if instruction mn is to be executed in configuration c. For instance,
lw(c) ≡ opc(c) = 100011
bltz(c) ≡ opc(c) = 051∧ rt(c) = 05
add(c) ≡ rtype(c)∧ f un(c) = 105.
The remaining predicates directly associated to the mnemonics of the assembly language are
derived in the same way from the tables. We group the basic instruction set into six groups and
define for each group a predicate that holds, if an instruction from that group is to be executed.
• ALU-operations of I-type are recognized by the leading three bits of the opcode, resp. I(c)[31 :
29]; ALU-operations of R-type — by the two leading bits of the function code, resp.
I(c)[5 : 4]:
alur(c) ≡ rtype(c)∧ f un(c)[5 : 4] = 10
alui(c) ≡ itype(c)∧opc(c)[5 : 3] = 001
alu(c) = alur(c)∨alui(c).
• Shift unit operations are of R-type and are recognized by the three leading bits of the
function code. If bit f un(c)[2] of the function code is on, the shift distance is taken from
register specified by rs(c):1
su(c) ≡ rtype(c)∧ f un(c)[5 : 3] = 000
suv(c) ≡ su(c)∧ f un(c)[2].
1 Mnemonics with suffix v as “variable”; one would expect instead for the other shifts a suffix i as
“immediate”.
20 2 Specification
• Loads and stores are of I-type and are recognized by the three leading bits of the opcode.
CAS are of R-type and are recognized by all ones in the opcode.
l(c) ≡ itype(c)∧opc(c)[5 : 3] = 100
s(c) ≡ itype(c)∧opc(c)[5 : 3] = 101
cas(c) ≡ rtype(c)∧opc(c) = 16
Memory operations are loads, stores, or CAS. For convenience we introduce the following
shorthands.
ls(c) = l(c)∨ s(c)
mop(c) = ls(c)∨ cas(c)
• Branches are of I-type and are recognized by the three leading bits of the opcode:
b(c) ≡ itype(c)∧opc(c)[5 : 3] = 000
≡ itype(c)∧ I(c)[31 : 29] = 000.
• Jumps are defined in a brute force way:
jump(c) = jr(c)∨ jalr(c)∨ j(c)∨ jal(c)
jb(c) = jump(c)∨b(c).
• Moves are from GPR to SPR or vice versa.
move(c) = movg2s(c)∨movs2g(c)
2.1.4 Processor-Local Operations
Reading out Data from Register Files
Addressing the general purpose register file c.gpr with rs(c) and rt(c) and the special register
file c.spr with rs(c) we obtain shorthands for register file contents:
A(c) = c.gpr(rs(c))
B(c) = c.gpr(rt(c))
S(c) = c.spr(rs(c)).
Moves
Instruction movg2s writes B(c) into the special purpose register file at address rd(s). Instruc-
tion movs2g writes S(c) into the special purpose register file at address rd(c). The memory and
other register file contents are not changed. The pc is incremented by 4. A summary of move
operations (mov(c)) is then
c′.pc = c.pc+32 432
c′.gpr(x) =
{
S(c) movs2g(c)∧ x = rd(c)∧ x 6= 05
c.gpr(x) otherwise
c′.spr(x) =
{
B(c) movg2s(c)∧ x = rd(c)
c.spr(x) otherwise
c′.m = c.m.
2.1 Basic MIPS 21
Table 5: Specification of ALU operations. Result alu.res ∈ Bn and overflow bit alu.ovf ∈ B are given
for ALU operands alu.a,alu.b ∈ Bn, function bits alu. f ∈ B4, and special bit alu.i ∈ B.
f [3 : 0] i res ovf
0000 ∗ a+n b [a]+ [b] /∈ Tn
0001 ∗ a+n b 0
0010 ∗ a−n b [a]− [b] /∈ Tn
0011 ∗ a−n b 0
0100 ∗ a∧b 0
0101 ∗ a∨b 0
0110 ∗ a⊕b 0
0111 0 a∨b 0
0111 1 b[n/2−1 : 0]◦0n/2 0
1010 ∗ 0n−1 ◦ ([a]< [b] ? 1 : 0) 0
1011 ∗ 0n−1 ◦ (〈a〉< 〈b〉 ? 1 : 0) 0
ALU-operations
ALU operations are defined with the help of Table 5. It defines functions res(a,b, f , i) and
ovf (a,b, f , i). As we do not treat interrupts yet, we use only the first of these functions here.
We observe that in all ALU operation of the ISA the left operand is always
alu.a(c) = A(c).
For R-type instruction the right operand is the register specified by the rt field. For I-type
instructions it is the sign extended immediate operand if
opc(c)[2] = I(c)[28] = 0
or zero extended immediate operand if
opc(c)[2] = 1.
Thus, we define immediate fill bit ifill(c), extended immediate constant xtimm(c), and right
operand alu.b(c) in the following way:
ifill(c) =
{
imm(c)[15] opc(c)[2] = 0
0 opc(c)[2] = 1
xtimm(c) =
{
sxtimm(c) opc(c)[2] = 0
zxtimm(c) opc(c)[2] = 1
= ifill(c)16imm(c)
alu.b(c) =
{
B(c) rtype(c)
xtimm(c) otherwise.
Comparing Table 5 with the tables for I-type and R-type instructions we see that bits af [2 : 0]
of the ALU control can be taken from the low order fields of the opcode for I-type instructions
and from the low order bits of the function field for R-type instructions:
alu.f (c)[2 : 0] =
{
f un(c)[2 : 0] rtype(c)
opc(c)[2 : 0] otherwise.
For bit alu.f [3] things are more complicated. For R-type instructions it can be taken from
the function code. For I-type instructions it must only be forced to 1 for the two test and set
operations, which can be recognized by
22 2 Specification
Table 6: Specification of shift unit operations. Result su.res ∈ Bn of the logical left shift sll, logical right
shift srl, and arithmetic right shift sra, is computed for shift operand su.b ∈ Bn and (binary coded) shift
distance 〈su.dist〉 ∈ [0 : n− 1] under control of the function bits su. f ∈ B2. Results of various shifts of
operand su.b by distance i are given in the rightmost column.
f [1 : 0] res res for 〈dist〉= i
00 sll(b,〈dist〉) b[n− i−1 : 0]◦0i
10 srl(b,〈dist〉) 0i ◦b[n−1 : i]
11 sra(b,〈dist〉) bin−1 ◦b[n−1 : i]
opc(c)[2 : 1] = 01.
alu.f (c)[3] =
{
f un(c)[3] rtype(c)
opc(c)[2]∧opc(c)[1] otherwise
The i-input of the ALU distinguishes for
af [3 : 0] = 0111
between the lui-instruction of I-type for i= 0 and the nor-instruction of R-type for i= 1. Thus,
we set it to itype(c). The result of the ALU and the arithmetic overflow computed with these
inputs are denoted resp. by
alu.res(c) = alu.res(alu.a(c),alu.b(c),alu.f (c), itype(c))
and
alu.ovf (c) = alu.ovf (alu.a(c),alu.b(c),alu.f (c), itype(c)).
Depending on the instruction type, the destination register rdes is specified by the rd field or
the rt field:
rdes(c) =
{
rd(c) rtype(c)
rt(c) otherwise.
A summary of all ALU operations (alu(c)) is then
c′.pc = c.pc+32 432
c′.gpr(x) =
{
alu.res(c) x = rdes(c)∧ x 6= 05
c.gpr(x) otherwise
c′.spr = c.spr
c′.m = c.m.
Shift Unit Operations
Results of shift unit operations is defined with the help of Table 6 as a function res(b,dist, f ).
They come in two flavors. If f un(c)[2] is set, the shift distance su.dist(c) is specified by the
last bits of the register specified by the rs field. Otherwise the shift distance is an immediate
operand specified by the sa field of the instruction.
su.dist(c) =
{
A(c)[4 : 0] f un(c)[2] = 1
sa(c) f un(c)[2] = 0
The left operand that is shifted is always the register specified by the rt field:
su.b(c) = B(c)
and the control bits su.f [1 : 0] are taken from the low order bits of the function field:
2.1 Basic MIPS 23
su.f (c) = f un(c)[1 : 0].
The result of the shift unit computed with these inputs is denoted by
su.res(c) = su.res(su.bc,su.dist(c),su.f (c)).
For shift operations the destination register is always specified by the rd field. Thus, the shift
unit operations (su(c)) can be summarized as
c′.pc = c.pc+32 432
c′.gpr(x) =
{
su.res(c) x = rd(c)∧ x 6= 05
c.gpr(x) otherwise
c′.spr = c.spr
c′.m = c.m.
Branch and Jump
A branch condition evaluation unit was specified in Table 7. It computes a function res(a,b, f ).
We use this function with the following parameters:
bce.a(c) = A(c)
bce.b(c) = B(c)
bce.f (c) = opc(c)[2 : 0]◦ rt(c)[0]
and define the result of a branch condition evaluation as
bce.res(c) = bce.res(bce.a(c),bce.b(c),bce.f (c)).
The next program counter c′.pc is usually computed as c.pc+32 432. This order is only changed
in jump instructions or in branch instructions, where the branch is taken, i.e., the branch con-
dition evaluates to 1. We define
jbtaken(c) = jump(c)∨b(c)∧bce.res(c).
In case of a jump or a branch taken, there are three possible jump targets (see below).
i) Branch instructions involve a relative branch.
b(c)∧bce.res(c)
The pc is incremented by a branch distance:
bdist(c) = imm(c)[15]14 ◦ imm(c)◦00
btarget(c) = c.pc+32 bdist(c).
Note that the branch distance is a kind of a sign extended immediate constant, but due
to the alignment requirement the two low order bits of the jump distance must be zeros.
Thus, one uses the 16 bits of the immediate constant for bits [17 : 2] of the jump distance.
Sign extension is used for the remaining bits. Thus, backward jumps are realized with
negative [imm(c)].
ii) R-type jumps
jr(c)∨ jalr(c).
The branch target is specified by the rs field of the instruction:
btarget(c) = A(c).
24 2 Specification
Table 7: Specification of branch condition evaluation. Result bce.res ∈ B is given for operands
bce.a,bce.b ∈ Bn and function bits bce. f ∈ B4.
f [3 : 0] res
0010 [a]< 0
0011 [a]≥ 0
100* a = b
101* a 6= b
110* [a]≤ 0
111* [a]> 0
iii) J-type jumps
j(c)∨ jal(c).
The branch target is computed in a rather peculiar way: i) the pc is incremented by 4, ii)
then bits [27 : 2] are replaced by the iindex field of the instruction:
btarget(c) = (c.pc+32 432)[31 : 28]◦ iindex(c)◦00.
Now we can define the next pc computation for all instructions as
btarget(c) =

c.pc+32 bdist(c) b(c)∧bce.res(c)
A(c) jr(c)∨ jalr(c)
(c.pc+32 432)[31 : 28]◦ iindex(c)◦00 otherwise
c′.pc = nextpc(c)
=
{
btarget(c) jbtaken(c)
c.pc+32 432 otherwise.
Jump and Link
Jump and link instructions
jal(c)∨ jalr(c)
are used to implement calls of procedures. Besides setting the pc to the branch target, they
prepare the so called link address
linkad(c) = c.pc+32 432
and save it in a register. For the R-type instruction jalr, this register is specified by the rd field.
J-type instruction jal does not have an rs field, and the link address is stored in register number
31 (〈15〉). Branch and jump instructions do not change the memory. Therefore, for the update
of registers in branch and jump instructions
jb(c)
we have:
c′.gpr(x) =
{
linkad(c) jalr(c)∧ x = rd(c)∧ x 6= 05∨ jal(c)∧ x = 15
c.gpr(x) otherwise
c′.spr = c.spr
c′.m = c.m.
2.1 Basic MIPS 25
2.1.5 Memory Operations
Memory operations access a certain number
d(c) ∈ {1,2,4}
of bytes of memory starting at a so called effective address ea(c). Letters b, h, and w in the
mnemonics define the width:
• b stands for d = 1 resp. a byte access;
• h stands for d = 2 resp. a half word access, and
• w stands for d = 4 resp. a word access.
Inspection of the instruction tables gives
d(c) =

1 opc(c)[0] = 0
2 opc(c)[1 : 0] = 01
4 opc(c)[1 : 0] = 11∨ cas(c).
Addressing is always relative to A(c). Except for CAS operations (which have R-type) the
offset is specified by the immediate field:
ea(c) =
{
A(c) cas(c)
A(c)+32 sxtimm(c) otherwise
Note that the immediate constant is sign extended. Thus, negative offsets can be realized in the
same way as negative branch distances. In the absence of misalignment interrupts addresses
are for the time being required to be aligned. If we interpret them as binary numbers they have
to be divisible by the width d(c):
d(c) | 〈ea(c)〉
or equivalently
mop(c)∧d(c) = 4→ ea(c)[1 : 0] = 00
mop(c)∧d(c) = 2→ ea(c)[0] = 0.
Stores
A store instruction takes the low order d(c) bytes of B(c) and stores them as md(c)(ea(c)). The
pc is incremented by 4 (but we have already defined that on page 24). Other memory bytes
and register values are not changed.
c′.gpr = c.gpr
c′.spr = c.spr
c′.m(x) =
{
byte(i,B(c)) x = ea(c)+32 i32∧ i < d(c)
c.m(x) otherwise
Loads
Loads, like stores, access d(c) bytes of memory starting at address ea(c). The result is stored
in the low order d(c) bytes of the destination register, which is specified by the rt field of the
instruction. This leaves
32−8 ·d(c)
bits of the destination register to be filled by some bit fill(c). For unsigned loads (with a suffix
“u” in the mnemonics) the fill bit is zero; otherwise it is sign extended by the leading bit of
26 2 Specification
c.md(c)(ea(c)).
In this way a load result
lres(c) ∈ B32
is computed
u(c) = opc(c)[2]
fill(c) =
{
0 u(c)
c.m(ea(c)+32 (d(c)−1)32)[7] otherwise
lres(c) = fill(c)32−8·d(c)c.md(c)(ea(c))
and the general purpose register specified by the rt field is updated. Other registers and the
memory are left unchanged.
c′.gpr(x) =
{
lres(c) x = rt(c)∧ x 6= 05
c.gpr(x) otherwise
c′.spr = c.spr
c′.m = c.m
Compare and Swap (CAS)
Recall that CAS operations have R-type, thus no immediate constant is available and the effec-
tive address is just
ea(c) = A(c).
A load word operation with destination specified by the rd field is performed. The content of
the ninth register of the special purpose register file
cdata(c) = c.spr(95)
is compared with the memory word at the effective address
castest(c)≡ cdata(c) = c.m4(ea(c)).
If the test is positive, the memory content is replaced by B(c).
c′.gpr(x) =
{
c.m4(ea(c)) x = rd(c)∧ x 6= 05
c.gpr(x) otherwise
c′.spr = c.spr
c′.m(x) =
{
byte(i,B(c)) castest(c)∧ x = ea(c)+32 i32∧ i < 4
c.m(x) otherwise
For convenience, we introduce shorthands to classify the memory operations. Loads and CAS
read from memory. Stores and “positive” CAS write to memory. Thus CAS is both reading
and (potentially) writing.
read(c) = l(c)∨ cas(c)
write(c) = s(c)∨ cas(c)∧ castest(c)
2.2 Summary
For convenience, in this section we gather in one place all key parts defining the specification
machine and its computation (ISA computation):
i) MIPS ISA (in Sect. 2.2.1), presenting the material of Sect. 2.1 basically on one page,
ii) software conditions (in Sect. 2.2.2) under which the ISA above is meaningful, and
iii) accesses of the ISA computation (in Sect. 2.2.3), constructed according to the definitions
from Sect. 1.4.2.
2.2 Summary 27
2.2.1 MIPS ISA
We collect all previous definitions of destination registers for the general purpose register file
and the special purpose register file into
xad(c) =

15 jal(c)
rd(c) rtype(c)
rt(c) otherwise.
Also we collect the data gpr.in to be written into the general purpose register file.
gpr.in(c) =
{
lres(c) read(c)
C(c) otherwise
For technical reasons, we define on the way an intermediate result C that collects the possible
GPR input from arithmetic, shift, and jump instructions:
C(c) =

S(c) movs2g(c)
B(c) movg2s(c)
su.res(c) su(c)
linkad(c) jal(c)∨ jalr(c)
alu.res(c) otherwise.
Finally, we collect in a general purpose register write signal all occasions when some general
purpose register is updated:
gpr.w(c) = alu(c)∨ su(c)∨ read(c)∨ jal(c)∨ jalr(c).
Now we can summarize the MIPS ISA in three rules concerning the updates of pc, general
purpose registers, and memory.
c′.pc =
{
btarget(c) jbtaken(c)
c.pc+32 432 otherwise
c′.gpr(x) =
{
gpr.in(c) gpr.w(c)∧ x = xad(c)∧ x 6= 05
c.gpr(x) otherwise
c′.spr(x) =
{
C(c) movg2s(c)∧ x = xad(c)
c.spr(x) otherwise
c′.m(x) =
{
byte(i,B(c)) write(c)∧ x = ea(c)+32 i32∧ i < d(c)
c.m(x) otherwise
2.2.2 Software Conditions
In the absence of misalignment interrupts we have already required that instruction fetch and
memory operations are aligned.
4 | 〈c.pc〉
mop(c) → d(c) | 〈ea(c)〉
Most of ISA memory will be implemented by some kind of RAM, which happens to have
unknown content after power up. The program counter points initially to address 032 and starts
fetching instructions from some initial program, for instance a boot loader. This program will
reside in ROM occupying the low order 2r+3 byte addresses of the memory system for some r.
Because write operations to ROM have no effect, we simply forbid store or CAS operations to
that region.
28 2 Specification
write(c) → 〈ea(c)〉 ≥ 2r+3
Note, in the presence of guard conditions (see Sect. 2.4.3) the software conditions above
are guaranteed to hold only in case the corresponding computation is guarded, i.e., respects
all related guard conditions. As formally described in [Obe17], execution of instructions is
inherently split into two phases: fetch and execute. Within the fetch phase, all ISA signals
necessary to fetch the current instruction (I(c)) are computed, and the phase ends with a fetch
of the instruction word. The remaining computations performed to complete execution of
the current instruction constitute the executed phase, which of course ends with the current
instruction finishing execution.
All guard conditions imposed on steps of the processor core executing instructions are of course
split into two categories: those that restrict computations performed in the fetch phase, and all
the remaining ones, that restrict computations performed in the execute phase. Therefore, we
require the software conditions which apply in the fetch phase of computational step n to hold if
the guard conditions are obeyed by all steps before n as well as by all computations performed
in the fetch phase of step n. Clearly, all the remaining software conditions which apply to step
n must hold only if the guard conditions are obeyed by all steps up to n (before and within step
n).
The assumptions above turn out to be crucial when one tries to justify correctness of pipelined
machines. Thus, in Chap. 9 we actually rely on the software conditions for the fetch phase in
order to verify the guard conditions for the execute phase (see Sect. 9.4.2).
2.2.3 Accesses of ISA
As a guideline for the construction of environments sh4s and sh4l with their shifters supporting
the memory operations of the ISA, we rephrase these operations in terms of data accesses
dacc(c) to the line addressable version `(c.m) of ISA memory. In a completely straightforward
way one specifies
dacc(c).a = ea(c).l
dacc(c).cdata = cdata(c)
dacc(c).r = l(c)
dacc(c).w = s(c)
dacc(c).cas = cas(c)
For the construction of the data memory input dmin(c) of store or CAS operations one shifts
the register B(c), whose low order d(c)≤ 4 bytes are to be stored, by 〈ea(c)[1 : 0]〉 bytes to the
left and then makes two copies of the result F ; which copy is used will be determined by the
byte write signals.
F(c) = slc(B(c),〈ea(c)[1 : 0]〉)
dmin(c) = F(c)◦F(c)
For the byte write signals bw — and also for byte read signals br to be used later for store
buffer forwarding — one constructs first a 4 bit wide mask mmask for memory operations with
d(c) many ones at the right end. This mask is shifted left by 〈ea(c)〉. Two copies of the result
f (c) are produced. The left copy is used in the byte read and byte write signals if ea(c)[2] is
on; otherwise the right copy is used.
mmask(c) = mop(c)∧ (04−d(c)1d(c))
f (c) = slc(mmask(c),〈ea(c)[1 : 0]〉)
br(c) = l(c)∧ (ea[2]∧ f (c))◦ (ea[2]∧ f (c))
bw(c) = (s(c)∨ cas(c))∧ (ea[2]∧ f (c))◦ (ea[2]∧ f (c))
Note that byte read signals are not generated for CAS instructions because these instructions
will not use store buffer forwarding. Using the definition above we specify:
2.2 Summary 29
dacc(c).data = dmin(c)
dacc(c).bw = bw(c).
As the output of the data access we have
dmout(c) = dataout(`(c.m),dacc(c)).
For i < d(c) we locate the bytes of B to be stored in the data memory input by
byte(i,B(c)) = byte(i+ 〈ea(c)[1 : 0]〉,F(c))
= byte(i+ 〈0◦ ea(c)[1 : 0]〉,dmin(c))
= byte(i+ 〈1◦ ea(c)[1 : 0]〉,dmin(c))
= byte(i+ 〈ea(c)[2 : 0]〉,dmin(c)).
Similarly, for i < d(c) we locate the byte write and byte read signals of the bytes to be stored
in bw[7 : 0] resp. br[7 : 0] by
f (c)[ j] = 1↔ mop(c)∧∃i < d(c) : j = 〈ea(c)[1 : 0]〉+ i
bw(c)[ j] = 1↔ (s(c)∨ cas(c))∧∃i < d(c) : j = 〈ea(c)[2 : 0]〉+ i
br(c)[ j] = 1↔ l(c)∧∃i < d(c) : j = 〈ea(c)[2 : 0]〉+ i .
For later reference we summarize these arguments in the following lemma.
Lemma 5.
byte(i,B(c)) = byte(i+ 〈ea(c)[2 : 0]〉,dmin(c))
bw(c)[ j] = 1↔ (s(c)∨ cas(c))∧∃i < d(c) : j = 〈ea(c)[2 : 0]〉+ i
We prove the crucial result, that the line addressable version `(c′.m) of the next ISA configu-
ration is obtained by applying the data access dacc(c) defined in this way to `(c.m) (which is
what the hardware memory h.m is specified to do).
Lemma 6.
`(c′.m) = δM(`(c.m),dacc(c))
Proof of lemma 6. Abbreviating the right hand side
M′ = δM(`(c.m),dacc(c))
and applying the definition of memory semantics δM we get
M′(a) =

modify(`(c.m)(a),dmin(c),bw(c)) a = dacc(c).a∧ (dacc(c).w∨
dacc(c).cas∧ test(dacc(c), `(c.m))
`(c.m(a)) otherwise.
For CAS accesses (dacc(c).cas = 1) we have:
test(dacc(c), `(c.m)) ≡ dacc(c).cdata =
{
`(c.m)(dacc(c).a)H dacc(c).bw[0]
`(c.m)(dacc(c).a)L otherwise
≡ cdata(c) =
{
`(c.m)(ea(c).l)H ea(c).[2]
`(c.m)(ea(c).l)L otherwise
≡ cdata(c) = c.m4(ea(c))
≡ castest(c).
Thus
30 2 Specification
M′(a) =
{
modify(`(c.m)(a),dmin(c),bw(c)) write(c)∧a = ea(c).l
`(c.m)(a) otherwise.
(2)
With x ∈ B32 such that
x.l = a≥ 2r
〈x.o〉 = j ∈ B3
and the definition of function modify, we rewrite equation 2 using lemma 5 and lemma 6.3
from [KMP14] as follows:
byte(〈x.o〉,M′(x.l)) =
{
byte( j,dmin(c)) write(c)∧a = ea(c).l∧bw(c)[ j]
byte( j, `(c.m)(a)) otherwise
=

byte(i,B(c)) write(c)∧a = ea(c).l∧
j = 〈ea(c).o〉+ i∧ i < d(c)
byte( j, `(c.m)(a)) otherwise
=
{
byte(i,B(c)) write(c)∧ x = ea(c)+32 i32∧ i < d(c)
c.m(x) otherwise
= c′.m(x).
From the result above we conclude
M′ = `(c′.m). uunionsq
Analogous to above we rephrase the fetch of the ISA instruction
I(c) = c.m4(c.pc)
in terms of instruction access iacc(c) to the line addressable version `(c.m) of the ISA memory.
iacc(c).a = c.pc.l
iacc(c).r = 1
As the output to the instruction access we have
imout(c) = dataout(`(c.m), iacc(c)).
By the software condition we know that the program counter is word-aligned.
c.pc[1 : 0] = 00
Similarly to lemma 6, using properties of the memory embedding, for the output to the instruc-
tion access we conclude the following.
Lemma 7.
I(c) =
{
imout(c)L c.pc[2] = 0
imout(c)H c.pc[2] = 1
2.3 Interrupt Mechanism
We basically use the interrupt mechanism from [PBLS16] and [MP00]. In the specification we
adjust things in several ways:
• besides reset, there is only one external interrupt signal e ∈ B. This signal will later be
generated by an advanced programmable interrupt controller (APIC). It gets priority 1 and
is masked by bit 1 of the status register.
2.3 Interrupt Mechanism 31
• there are two kinds of misalignment interrupts:
i) mal f : misaligned instruction address (ia) during fetch, and
ii) malm: misaligned effective address (ea) during memory operation.
• overflow is not maskable. As instructions add, addi, and sub create overflow interrupts
and their ‘unsigned’ counterparts don’t; there is no need to mask twice.
• we distinguish between page faults and general-protection faults. Until we treat address
translation we tie the corresponding interrupt event signals to zero.
Because misaligned memory accesses are now signaled by interrupts, we can drop the corre-
sponding software condition. In correctness proofs, even for the sequential implementation of
Chap. 7, this comes at the price of extra bookkeeping, because with a misaligned instruction
fetch, signal I(c) and the control signals derived from it all have no meaning; hence the proof
has in this case to work without referring to them.
Implementation Details
The pipelining and forwarding of signals (see Sect. 8.1) follows in a fairly straightforward way
the lines of [MP00]. But when it comes to rolling back interrupted instructions things get in-
volved. As in [Kro¨01] we treat interrupts as a special case of misspeculation (we speculate that
no interrupts occur and roll instructions in the pipeline back in case we discover that we were
wrong), and as in [Bey05] and [BJK+03] we stabilize the inputs to the cache memory system
in case an instruction is rolled-back while a memory access is still in progress. But in later
constructions we will have rollback due to interrupts and due to ‘ordinary’ speculative execu-
tion. For this purpose we present a new stall engine (see Sect. 8.1.1) which allows to trigger
rollbacks in any pipeline stage and simultaneously stabilizes accesses to the cache memory
system during such rollbacks.
2.3.1 Special Purpose Registers Revisited
The machines specified and constructed before have already a special purpose register file
c.spr, and data can be transported between this file and the general purpose register by means
of move instructions. So far however, only the cas instruction made use of this file by taking
the ’compare data’ cdata(c) from register c.spr(95). The interrupt mechanism to be introduced
in this chapter makes use of quite a few more special purpose registers. In Table 8 we introduce
shorthands for the first 10 registers of the special purpose register file; except for pto, which is
used for address translation, the interrupt mechanism and the CAS instruction together make
use of all of them. Thus we abbreviate
c.sr = c.spr(05)
...
c.cdata = c.spr(95).
For convenience, we introduce the following helper function to map the synonyms of the spe-
cial purpose registers back to the addresses of the corresponding registers in the SPR file.
spr(a) = Z → spr[Z]= a
Also we introduce the following abbreviations for addresses of the SPR registers.
sr esr eca epc edpc edata pto mode emode cdata
05 15 25 35 45 55 65 75 85 95
The mode register c.mode ∈ B32 distinguishes between system mode, where
c.mode[0] = 0,
32 2 Specification
Table 8: Special purpose registers
〈a〉 synonym for spr(a) name
0 sr status register
1 esr exception status register
2 eca exception cause
3 epc exception pc
4 ed pc exception dpc
5 edata exception data
6 pto page table origin
7 mode mode register
8 emode exception mode register
9 cdata compare data
and user mode, where
c.mode[0] = 1.
We abbreviate
mode(c) = c.mode[0].
Clearing bit sr[1] of the status register will mask external interrupts. After reset the machine
should be in system mode, the external interrupts should be masked, and the lowest bit of the
exception cause register should be set.
mode(c0) = 0
c0.sr[1] = 0
c0.eca[0] = 1
The purpose of the other special purpose registers will be explained shortly.
2.3.2 Types of Interrupts
Interrupts are triggered by interrupt event signals; they change the control flow of programs.
Here we consider event signals ev[0 : 10] from Table 9. We classify interrupts in three ways:
• internal or external. The first two interrupts are generated outside of CPU and the memory
system. Renaming
reset = ev[0]
and
e = ev[1]
we collect them into the vector eev ∈ B2 of external interrupt event signals.
eev[0 : 1] = (reset,e)
The other interrupts are generated within CPU and memory system. Their activation de-
pends only on the current ISA configuration c. We collect them into the vector iev(c) ∈ B9
of internal interrupts.
iev(c)[2 : 10] = ev[2 : 10].
More internal interrupt signals must be introduced for machines with floating point units
(see [MP00]). In multi-core processors one tends to additionally implement inter processor
interrupts.
• maskability. Only e is maskable. When an interrupt is masked, the processor will not react
to it if the corresponding event signal is raised.
2.3 Interrupt Mechanism 33
Table 9: Interrupts handled by the ISA
index j synonym for ev[ j] name maskable resume
0 reset reset no abort
1 e external event signal yes repeat
2 mal f misalignment on fetch no abort
3 p f f page fault on fetch no repeat
4 g f f general-protection fault on fetch no abort
5 ill illegal instruction no abort
6 sysc system call no cont.
7 ovf arithmetic overflow no cont.
8 malm misalignment on memory operation no abort
9 pfm page fault on memory operation no repeat
10 gfm general-protection fault on memory operation no abort
• resume type: whether and where execution of a program should be resumed after handling
of an interrupt. In many cases, when interrupts signal error conditions the program is
simply aborted. If page faults are handled transparently one obviously wants to repeat
the interrupted instruction after the missing page has been swapped into memory. For the
external interrupts other than reset one also repeats the interrupted instruction for slightly
less obvious reasons: interrupts will be handled according to their priority with small
numbers signaling high priority. If an external interrupt occurs simultaneously with an
internal interrupt and we would resume execution behind the interrupted instruction, then
the internal interrupt (with the lower priority) would be lost. In the remaining cases one
wishes to continue execution of the program behind the interrupted instruction. Clearly
this should be the case for system calls, i.e., interrupts generated by the sysc instruction.
For the time being we have neither devices generating external interrupts nor memory manage-
ment units generating page faults and general-protection faults. Therefore we will temporarily
tie the corresponding event signals to zero.
2.3.3 MIPS ISA with Interrupts
Recall that the transition function δisa for MIPS ISA defined so far computes a new MIPS
configuration c′ from an old configuration c and the value of the reset signal.
c′ = δisa(c,reset)
We rename this transition function to δ oldisa and the configuration computed by it to c∗.
c∗ = δ oldisa (c,reset)
Then we define the new transition function
c′ = δM(c,eev)
which takes as inputs the old configuration c and the vector eev of the external event signals.
We proceed to define a predicate jisr(c,eev) which indicates that a jump to the interrupt ser-
vice routine is to be performed, and a predicate eret(c,eev) which indicates a return from the
interrupt service routine. If these signals are inactive, the machine should behave as it did
before with an inactive reset signal.
/ jisr(c,eev)∧/eret(c,eev) → c′ = c∗
For the computation of the jisr predicate we unify notation for external and internal interrupts
and define a vector ca(c,eev)[0 : 10] of cause signals by
34 2 Specification
ca(c,eev)[i] =
{
eev[i] i < 2
iev(c)[i] i≥ 2.
For the only maskable interrupt with index 1 we use bit 1 of the status register sr as a mask for
this interrupt. Thus we define the vector mca(c,eev)[0 : 10] of masked cause signals as
mca(c,eev)[i] =
{
ca(c,eev)[i]∧ c.sr[i] i = 1
ca(c,eev)[i] otherwise.
With e = eev[1] and
mask(c) = c.sr[1]
the masked cause bit for the external interrupt can be redefined as
mca(c,eev)[1] = e∧mask(c).
We jump to the interrupt service routine if any of the masked cause bits is on:
jisr(c,eev) =
∨
i mca(c,eev)[i].
In cases this signal is active we define the interrupt level il(c,eev) as the smallest index of an
active masked cause bit (see p. 64).
il(c,eev) =
{
min{i | mca(c,eev)[i]} mca(c,eev) 6= 011
+∞ otherwise
Execution of the interrupted instruction continues if the interrupt of the minimal level has type
continue.
cont(c,eev)≡ il(c,eev) ∈ {6,7}
We handle interrupts with small indices with higher priority than interrupts with high indices.
Thus the interrupt level gives the index of the interrupt that will receive service. With an active
jisr(c,eev) signal many things happen at the transition to the next state c′:
• we jump to the start addresses sisr and sisr+32 432 of the interrupt service routine. We fix
sisr to 032, i.e., to the first address in the ROM.
c′.d pc = 032
c′.pc = 432
• the maskable interrupt event signal e is masked.
c′.sr = 032
The purpose of this mechanism is to make the calling of the interrupt handler (which will
take some instructions) not interruptible by external interrupts. Of course, the interrupt ser-
vice routine should also be programmed in such a way that its execution does not produce
internal interrupts.
• the old value c.sr of the status register is saved into the exception status register c.esr.
c′.esr = c.sr
Thus it can be restored later. This does not work, if the interrupted instruction writes
the status register (movs2g) and the resume type is continue. But only system calls and
arithmetic instructions have resume type continue, and they don’t write the special purpose
register file.
2.3 Interrupt Mechanism 35
• in the exception registers epc and ed pc we save something similar to the link address of
a function call. It is the pair of addresses where program execution will resume after the
handling of the interrupt, if it is not aborted. In ISA (and the hardware) we only distinguish
if the resume type is continue or not. In all other cases we prepare for repetition of the
interrupted instruction by saving the current pc and d pc. No harm is done by this, if the
handler/operating system decides to abort execution.
c′.(ed pc,epc) =
{
c∗.(d pc, pc) cont(c,eev)
c.(d pc, pc) otherwise
• in the exception data register edata, we save the effective address ea(c). After a page fault
on memory operation this provides the effective address, which generated the interrupt to
the page fault handler.
c′.edata = ea(c)
• in the exception cause register eca, we save the interrupt level in the one-hot encoding. To
produce the value saved we use the first-one circuit from [KMP14].2
c′.eca = 021 ◦ f 1(mca(c,eev))
• we back-up the current machine mode into the exception mode register.
c′.emode = c.mode
• finally, we switch to system mode.
c′.mode = 032
• in case of continue interrupts we need to finish the interrupted instruction. We collect the
special purpose registers that are updated at jisr into set
J = {sr,esr,eca,epc,ed pc,edata,mode,emode}.
Then for the general purpose register file and the memory we define
c′.(gpr,m) =
{
c∗.(gpr,m) cont(c,eev)
c.(gpr,m) otherwise
and for special purpose registers spr(x) /∈ J we specify
c′.spr(x) = c.spr(x).
Note, no harm is done by the latter ‘simple’ specification, because instructions generating
continue interrupts do not update the special purpose register file.
This completes the definition of what happens on activation of the jisr(c,eev) signal.
During a return from exception, i.e., if predicate eret(c,eev) is active, also several things hap-
pen simultaneously:
• pc and d pc are restored resp. from the exception pc and exception d pc.
c′.(d pc, pc) = c.(ed pc,epc)
• the status register is restored from the exception status register.
c′.sr = c.esr
• the mode register is restored from the exception mode register.
c′.mode = c.emode
2 Specification of the first-one circuit is as follows. For x ∈ Bn we define f 1(x) ∈ Bn s.t.
f 1(x)[i] ↔ x 6= 0n∧ i = min{k | x[k]}.
36 2 Specification
2.3.4 Specification of Most Internal Interrupt Event Signals
Except for the fault event signals, which obviously depend on the not yet defined mechanism
of address translation, we already can specify when internal event signals are to be activated.
• illegal instruction. By inspection of the tables in Sect. 2.1.1 we define a predicate
unde f ined(c) which is true if the current instruction I(c) is not defined in the tables.
Moreover we forbid in user mode i) the access of the special purpose registers by move
instructions, except for movg2s instructions with the target register number 9 (c.cdata)3,
as well as ii) the execution of the eret instruction. Moreover we forbid in any mode explicit
moves to the mode register. Thus mode can only be changed by interrupts and eret.
ill(c) ≡ unde f ined(c)∨
mode(c)∧ eret(c)∨
mode(c)∧move(c)∧/(movg2s(c)∧ xad(c) = 85)∨
movg2s(c)∧ xad(c) = 75
• misalignment. Misalignment occurs during fetch, if the low order bits of the pc are not both
zero. It also occurs during a memory operation, if the effective address ea(c) interpreted
as a binary number is not a multiple of the access width d(c).
mal f (c) ≡ c.d pc[1 : 0] 6= 00
malm(c) ≡ mop(c)∧ (d(c) - 〈ea(c)〉)
• system call. This event signal is simply identical with the predicate decoding a system call
instruction.
sysc(c) ≡ opc(c) = 06∧ f un(c) = 001100
• arithmetic overflow. This signal is taken from the overflow output of the ALU specification
if the ALU is used.
ovf (c) ≡ alu(c)∧alu.ovf (c)
• page faults and general-protection faults. For the time being we tie the corresponding event
signals to zero.
Except for the generation of the fault signals this already completes the formal specification
of the interrupt mechanism. Note, computation of most of the interrupt event signals above is
trivial. In order to justify computation of the misalignment on memory operation, we proceed
to show the following technical result.
Lemma 8. Assume d(c)> 0.
d(c) - 〈ea(c)〉 ↔ mmask(c)[2 : 1]∧ ea(c)[1 : 0] 6= 02
Proof of lemma 8. For convenience in the scope of this proof we use the following abbrevia-
tions.
d = d(c) ∈ {1,2,4}
ea = ea(c) ∈ B32
mm = mmask(c) = 04−d ◦1d
First, for k = logd we argue that access width d divides effective address ea if and only if the
first k low order bits of ea are zeros:
d | 〈ea〉 ↔ ∃z ∈ Z : 〈ea〉= z ·d
↔ ∃z ∈ Z : 〈ea〉= z ·2k +0
↔ ∃z ∈ Z : 〈ea[31 : k]〉= z ∧ 〈ea[k−1 : 0]〉= 0 (lemma 1)
↔ ea[k−1 : 0] = 0k
3 This move to the special purpose register is allowed in user mode in order to let unprivileged users set
the compare data for the CAS instruction appropriately.
2.4 Multi-Level Address Translation 37
and therefore
d - 〈ea〉 ↔ 02−k ◦ ea[k−1 : 0] 6= 02.
For memory mask mm we proceed to show the following.
mm[ j]↔ (04−d ◦1d)[ j]
↔ d > j
↔ k > log j
↔ (02−k ◦1k)[log j] (3)
Finally, using the result above we derive
02−k ◦ ea[k−1 : 0]↔ (02−k ∧ ea[1 : k])◦ (1k ∧ ea[k−1 : 0])
↔ (02−k ◦1k)∧ ea[1 : 0]
↔ mm[2 : 1]∧ ea[1 : 0] (equation 3)
and the claim follows. uunionsq
2.3.5 Accesses of ISA Revisited
Due to misalignment on fetch an instruction accesses can now be void, and we redefine its read
component as
iacc(c).r =
{
0 mal f (c)
1 otherwise.
Similarly data accesses can become void due to interrupts, which are not of type continue; note
that this includes misalignment on memory operation. We redefine the read, write, and CAS
components of these accesses as
dacc(c).(r,w,cas) =
{
(0,0,0) jisr(c,eev)∧/cont(c,eev)(c)
(l(c),s(c),cas(c)) otherwise.
Lemma 6 relating the data access of ISA with the memory update stays literally the same. In
the proof interrupts which are not of type continue now have to be treated as an additional case.
Lemma 9.
`(c′.m) = δM(`(c.m),dacc(c))
2.4 Multi-Level Address Translation
This is the last section literally taken from [LOP]. It serves a basis for development of a more
advanced address translation mechanism in Chap. 3.
2.4.1 Page Tables
Pages have 4096 = 4K bytes reps. 1024 = K words. As illustrated in Fig. 6(a), byte addresses
a ∈ B32 are partitioned into page addresses and page offsets:
a.pa = a[31 : 12]
a.po = a[11 : 0].
For i ∈ [2 : 1], page addresses pa ∈ B20 are further partitioned into level i page indices pa.pxi
as illustrated in Fig. 6(b) by
pa.px2 = pa[19 : 10]
pa.px1 = pa[9 : 0].
38 2 Specification
31
a
11
po
0
pa
(a) Partitioning of the byte address
px2 px1
091019
pa
(b) Partitioning of the page address
31 12 11 0
ba p r
810
pte
(c) Partitioning of the page table entry
Fig. 6: Partitioning of address and page table entries
Page table entries pte ∈ B32 are one word long, thus K of them fit on a single page. A page
whose words are used as page table entries is called a page table. Page table entries are par-
titioned into base address pte.ba ∈ B20, present bit pte.p ∈ B, and rights bits pte.r ∈ B3 as
indicated in Fig. 6(c) by
pte.ba = pte[31 : 12]
pte.p = pte[11]
pte.r = pte[10 : 8].
The bits r[2 : 0] of a rights vector for a page will be interpreted in the following way:
• r[0] = r.w: write permission,
• r[1] = r.u: permission to access as a user, and
• r[2] = r.ex: permission to fetch as an instruction and execute.
2.4.2 Walks and Translation Requests
Intuitively, multilevel (here: 2-level) address translation is achieved by walking a graph of page
tables, whose edges are defined by the base address fields of the page table entries. In each
level i of translation, the edge to follow from a page table is determined by the level i page
index a.pxi of the page address a which is translated. The central concept for formalizing this
are walks w, which have the following components:
• w.a ∈ B20: the page address to be translated.
• w.` ∈ {100,010,001}: one-hot encoding of the number of walk extensions still required.
• w.ba ∈ B20: the base address of the page table from which the walking is to be continued.
• w.r ∈ B3: the rights still remaining.
• w. f ∈ B: the fault bit indicating that the page table entry, from which the walk was ex-
tended, was not present.
The set of all walks is denoted by
Kwalk ⊆ B47
and the set of all non-faulting walks is denoted by
K+walk = {w ∈ Kwalk | w. f = 0}.
Walks can be created and extended. For the initiation of a walk one needs a page address
a ∈ B20 to be translated and a page table origin pto ∈ B32, which will come from a special
purpose register file. The initial walk
w = winit(a, pto)
has the following components:
2.4 Multi-Level Address Translation 39
pteK
x
ba
ptea
32
Fig. 7: Page table entry address
• w.a = a. The page address to be translated.
• w.`= 100. All levels of translation still need to be performed.
• w.ba = pto.pa. Translation starts at the page table with base address pto.pa.
• w.r = 111. Rights have not yet been restricted.
• w. f = 0. No missing page table entry has been found, because no page table entry has yet
been accessed.
We assume that the page table origin pto is page aligned, i.e., that
pto[11 : 0] = 012.
For a base address ba ∈ B20 and a page index x ∈ B10 we define the page table entry address
ptea(ba,x), i.e., the address of the page table entry on page ba with index x as
ptea(ba,x) = ba◦012+32 020 ◦ x◦00
= ba◦ x◦00.
This is illustrated in Fig. 7. Overloading notation we define the page table entry address ac-
cessed for extending walk w as
ptea(w) = ptea(w.ba,w.a.pxlevel(w)).
Thus the page table entry address is determined by the base address w.ba of the walk and the
page index w.a.pxi of the address under translation, where i = level(w) is the level of the walk
(which determines the number of walk extensions still required).
level(w) =

2 w.`= 100
1 w.`= 010
0 w.`= 001
Walks with level 0 are called complete and cannot be further extended.
complete(w)≡ level(w) = 0
For walk extension with a memory m one looks up the page table entry pte(w,m) in memory
m at address ptea(w), i.e.,
pte(w,m) = m4(ptea(w)).
The extension
w′ = wext(w,m)
of incomplete walk w with memory m is defined as follows.
40 2 Specification
pto
pte
K pteK
K
data
a.px0
a.px2 a.px1
32 32 32
Fig. 8: Process of address translation
w′.a = w.a
w′. f = /pte(w,m).p
w′.` =
{
w.` w′. f
0◦w.`[2 : 1] /w′. f
w′.ba = pte(w,m).ba
w′.r = pte(w,m).r∧w.r
A few simple properties of walks are listed in the following lemma.
Lemma 10.
• Non faulting walk extension decreases the level.
/w′. f → level(w′) = level(w)−1
• Complete walks are not faulting.
complete(w)→ /w. f
For virtual addresses va ∈ B32 and complete walks w with page address
w.a = va.pa
we define the translated memory address tma(va,w) obtained by translating va with walk w as
complete(w)∧w.a = va.pa → tma(va,w) = w.ba◦ va.po.
The process of address translation by repeated walk extension is illustrated in Fig. 8. Before
we proceed, for bit strings x and y of equal length we define:
x≤ y ≡ ∀i : xi ≤ yi.
A translation request trq has two components:
• trq.a ∈ B20: the page address to be translated, and
• trq.r ∈ B3: the access rights requested.
A walk w matches a translation request trq if its address w.a equals trq.a and if w is faulting or
complete.
match(trq,w)≡ trq.a = w.a∧ (w. f ∨ complete(w))
A matching walk provides a translation for the request if the walk is complete and provides at
least the rights requested.
trans(trq,w)≡ match(trq,w)∧/w. f ∧ (trq.r ≤ w.r)
A matching walk leads to a page fault for the request if it is faulting.
p f ault(trq,w)≡ match(trq,w)∧w. f
It leads to a general-protection fault if it is complete and the rights are insufficient.
g f ault(trq,w)≡ match(trq,w)∧/w. f ∧/(trq.r ≤ w.r))
2.4 Multi-Level Address Translation 41
2.4.3 MIPS ISA with Address Translation
Due to address translation the instruction fetch stage will be preceded in the pipelined imple-
mentation by a stage for the translation of the instruction address. This gives rise to a second
delay slot. Thus we need 3 program counters pc,d pc, and dd pc and their counter parts epc,
ed pc, and edd pc in the special purpose register file.
c′.pc =

832 jisr(c,eev)
c.epc eret(c)∧/ jisr(c,eev)
nextpc(c) otherwise
c′.d pc =

432 jisr(c,eev)
c.ed pc eret(c)∧/ jisr(c,eev)
c.pc otherwise
c′.dd pc =

032 jisr(c,eev)
c.edd pc eret(c)∧/ jisr(c,eev)
c.d pc otherwise
In ISA, the instruction address is
ia(c) = c.dd pc
which will subsequently be translated or not depending on the current mode mode(c).
Configurations c ∈ KM+T are now triples with the components:
• c.core = c.core.(pc,d pc,dd pc,gpr,spr): processor core,
• c.m : B32→ B8: byte addressable memory, and
• c.tlb ∈ Ktlb: a TLB containing the walks currently available for address translation or
extension.
A translation look-aside buffer (TLB) is a cache for translations. Formally we simply define it
as a set of walks. Thus, if Kwalk is the set of configurations of walks, then the set Ktlb of TLB
configurations is
Ktlb = 2Kwalk .
The model allows the presence of different translations for the same page address in the TLB.
ISA for MIPS with TLB is nondeterministic in four respects:
• the interleaving of processor core steps and TLB steps,
• the (speculative) choice of initial walks to be placed in the TLB,
• the choice of a walk in the TLB that is to be extended, and
• the choice of walks wI for the translation of the pc for instruction fetch and wE for the
translation of effective addresses among the possibly multiple matching walks in the TLB.
Nondeterminism is formalized by a transition function whose inputs have real components as
well as oracle components. Formally we define an input alphabet Σ for a transition function
δM+T : KM+T ×Σ → KM+T
mapping configurations c and inputs x into a next configuration
c′ = δM+T (c,x).
There are two major cases: TLB steps and core steps. For core steps, we have to process the
external interrupt vector eev and we have, among other things, to deal with the new instructions
f lusht and invl pg from Table 2. For the time being we do not deal with tagged TLBs, and thus
we ignore ASIDs specified in the rs field of the invl pg instruction.
The input alphabet Σ is the disjoint union of alphabets Σcore and Σtlb for processor core and
TLB steps.
Σ = Σcore ∪˙ Σtlb
42 2 Specification
2.4.4 TLB Steps
Inputs leading to TLB steps are only allowed in translated mode, i.e., when
mode(c) = 1.
In system mode the ISA TLB does not walk. This gets more complicated if one builds hardware
support for hypervisors.
For all page addresses a ∈ B20 we include the walk initialization step into the input alphabet.
(winit,a) ∈ Σtlb
This step has the effect of placing an initial walk for a and page table origin c.pto into the TLB.
The processor core configuration is not changed.
c′.tlb = c.tlb∪{initw(a,c.pto)}
c′.core = c.core
For all walks w ∈ Kwalk we also include the walk extension step into the input alphabet.
(wext,w) ∈ Σtlb
The next state is only defined for incomplete walks in the TLB.
/complete(w)∧w ∈ c.tlb
For such walks w the extension of w with memory c.m is included into the TLB. The processor
core configuration stays unchanged.
c′.tlb = c.tlb∪{wext(w,c.m)}
c′.core = c.core
Note, in case of a faulty walk extension, the resulting faulty walk is stored in the TLB. The
latter walk is used to signal a fault on the forthcoming processor core step.
2.4.5 Processor Core Steps
For components x ∈ {pc,d pc,dd pc,gpr,spr} we abbreviate by overloading notation
c.x = c.core.x.
We denote undefined walks by the symbol ⊥, and for walks
wI ,wE ∈ Kwalk ∪{⊥}
and all external interrupt vectors
eev[1 : 0] = (e,reset) ∈ B2
we include into Σcore the input symbol
x = (wI ,wE ,eev) ∈ Σcore.
Walks which are used are required to be in the TLB. For such inputs the machine performs
a processor core step. In translated mode, walk wI is used for translation of the instruction
address and walk wE is used for translation of the effective address in memory operations.
The formal definition of c′ is necessarily somewhat lengthy, because we have to adapt the
definition of the internal interrupt event signals to the case where the machine might be running
in translated mode. The interesting part of the definitions concerns of course the case when the
machine is running in translated mode, i.e., when
mode(c) = mode(c.core) = 1.
2.4 Multi-Level Address Translation 43
Reset
For inputs
x = (wI ,wE ,e1)
we flush the TLB, initialize the program counters and put the processor into the system mode:
c′.tlb = /0
c′.(dd pc,d pc, pc) = (032,432,832)
c′.mode = 032.
Misalignment on Fetch
This is not affected by address translation.
mal f (c) = mal f (c.core)
Faults on Fetch
The translation request for instruction fetch is
trqI(c) = (ia(c).pa,110).
If the machine is running in translated mode, we require walk wI to be a walk from the TLB
matching translation request trqI(c).
mode(c) → match(trqI(c),wI)∧wI ∈ c.tlb
We get a page fault or general-protection fault on fetch if translation of this request with the
chosen walk wI produces the corresponding fault.
p f f (c,x) ≡ mode(c)∧ p f ault(trqI(c),wI)
g f f (c,x) ≡ mode(c)∧g f ault(trqI(c),wI)
For convenience we abbreviate:
f f (c,x) = p f f (c,x)∨g f f (c,x).
Instruction Fetch
Instruction fetch is only possible in the absence of misalignment and — in translated mode —
the absence of page fault and general-protection fault on fetch. Then we can define the physical
memory address pmaI for instruction memory access. In system mode it is ia(c). In user mode
ia(c) is translated with walk wI .
/mal f (c)∧/ f f (c,x) → pmaI(c,x) =
{
ia(c) mode(c) = 0
tma(ia(c),wI) mode(c) = 1
The instruction to be executed is then fetched from pmaI .
I(c,x) = c.m4(pmaI(c,x))
Now we write all functions f (c) which depend only on the instruction I(c) as
f (c) = f ′(I(c))
and generalize them to
f (c,x) = f ′(I(c,x)).
This covers function fields like rd, predicates like addi or ls, addresses like xad, and also some
interrupt event signals like sysc. For any such function f we write the old definition of f (c)
as a function f ′ of I(c). Substituting in subsequent definitions of functions F(c,eev) these
functions f (c) by functions f (c,x) we obtain properly generalized versions F(c,x) of these
functions as long as memory or new instructions are not involved. For instance:
A(c,x) = c.gpr(rs(c,x))
ea(c,x) = A(c,x)+32 sxtimm(c,x).
44 2 Specification
Illegal Interrupt
In user mode instructions f lush and invl pg are also illegal.
ill(c,x) ≡ unde f ined(c,x)∨
mode(c)∧ eret(c,x)∨
mode(c)∧ ( f lush(c,x)∨ invl pg(c,x))∨
mode(c)∧move(c,x)∧/(movg2s(c,x)∧ xad(c,x) = 85)
Overflow and Misalignment on Memory Operation
The two definitions below are repeated for completeness of presentation.
ovf (c,x) ≡ alu(c,x)∧alu.ovf (c,x)
malm(c,x) ≡ mop(c,x)∧ (d(c,x) - 〈ea(c,x)〉)
Faults on Memory Operation
The translation request for the effective address is
trqE(c,x) = (ea(c,x).pa,01◦ (s(c,x)∨ cas(c,x))).
If the machine is executing a memory operation in translated mode we require walk wE to be a
walk from the TLB matching translation request trqE(c,x).
mode(c)∧mop(c,x) → match(trqE(c,x),wE)∧wE ∈ c.tlb
We get a page fault or general-protection fault on memory operation if translation of this re-
quest with the chosen walk wE produces a fault.
pfm(c,x) ≡ mode(c)∧ p f ault(trqE(c,x),wE)
gfm(c,x) ≡ mode(c)∧g f ault(trqE(c,x),wE)
Analogous to above we abbreviate:
fm(c,x) = pfm(c,x)∨gfm(c,x).
Memory Operation
Execution of a memory operation is only possible in the absence of misalignment and — in
translated mode — any fault on memory operation. Then we can define the physical memory
address pmaE for data memory access. In system mode it is ea(c,x). In user mode ea(c,x) is
translated with walk wE .
/malm(c,x)∧/fm(c,x) → pmaE(c,x) =
{
ea(c,x) mode(c) = 0
tma(ea(c,x),wE) mode(c) = 1
Next we split cases on the memory operation performed. For loads (l(c,x)) we have
f ill(c,x) =
{
0 u(c,x)
c.m(pmaE(c,x)+32 (d(c,x)−1)32)[7] otherwise
lres(c,x) = f ill(c,x)32−8·d(c,x) ◦ c.md(c,x)(pmaE(c,x)).
For stores (s(c,x)) we have
c′.m(y) =
{
byte(i,c.gpr(rt(c,x))) y = pmaE(c,x)+32 i32∧ i < d(c,x)
c.m(y) otherwise.
Finally, in case of CAS (cas(c,x)), we have
lres(c,x) = c.m4(pmaE(c,x))
c′.m(y) =
{
byte(i,c.gpr(rt(c,x))) castest(c,x)∧ y = pmaE(c,x)+32 i32∧ i < 4
c.m(y) otherwise.
2.4 Multi-Level Address Translation 45
Instruction Execution without invl pg, f lush, or reset
Let
δM : KM×B2→ KM
be the ISA transition function without address translation. From this we obtain
δx : KM×B2→ KM
by replacing in the ISA specification all functions f (c) that have been generalized or changed
by f (c,x). Then
(c′.core,c′.m) = δx(c.core,c.m,e◦0)
c′.tlb = c.tlb.
Flush or invl pg
For the time being we use a single address space and no address space identifiers (ASIDs).
Execution of a flush instruction in system mode flushes the TLB.
f lush(c,x) → c′.tlb = /0
For technical reason, the definition of intermediate result C is extended: in case of invl pg
instructions we include B(c).pa in the lower bits of the intermediate result C. This is the page
index whose translation in the TLB will be invalidated by the instruction. If we would use
ASIDs, we could store them in the upper 12 bits of C.
C(c,x) =

S(c,x) movs2g(c,x)
B(c,x) movg2s(c,x)
012 ◦B(c,x).pa invl pg(c,x)
su.res(c,x) su(c,x)
linkad(c,x) jal(c,x)∨ jalr(c,x)
alu.res(c,x) otherwise
Execution of an invl pg instruction in system mode removes from the TLB the walks translating
page address given by the GPR register specified by the rt field from the TLB. It also removes
all incomplete walks.
invl pg(c,x) → c′.tlb = {w ∈ c.tlb | (w.a 6=C(c,x)[19 : 0])∧ complete(w)}
In both cases the register files and the memory are not changed and the program counters are
updated.
Z ∈ {gpr,spr,m}→ c′.Z = c.Z
c′.pc = c.pc+32 432
c′.d pc = c.pc
c′.dd pc = c.d pc

Part II
Nested Address Translation (NAT)

3Introduction and Specification
In this chapter we develop a more general scheme of address translation. This new scheme
is introduced in order to enhance virtualization capabilities of our machine and gain hardware
support for the so called “hypervisors”, special kind of kernels which we present later. In a
nutshell, we add one more layer of the virtual memory. This allows the machine to run in three
different translation modes:
• untranslated — same as system mode before,
• translated — same as user mode before, and
• nested-translated — newly added.
As a result, now user programs can run translated. In the new translation mode, the machine
fetches code and accesses memory after two phases of translation:
i) translation of the instruction or effective address to the level of user addresses, and
ii) ordinary translation (as before) of the user level address to the physical address.
An important detail, which might be hard to notice from the first glance, is that the first phase
of translation itself utilizes several rounds of translation, which are essentially of the same type
as in translation phase two. This overhead is simply due to the fact that one can not access
memory directly in the first phase, since now — with an extra layer of virtual memory — page
tables accessed during walking in the first phase are filled with user level (virtual) addresses.
Thus, in order to get to the physical addresses, one has to perform quite a few intermediate
ordinary translations, which is known as quadratic walking [Meg12]. How many and what
kind of accesses we perform for this so called nested translation we describe in detail later.
However, already at this point one can think about the nested translation as of more general
translation process where the memory accesses in their turn are preceded by the translation
calls. We try to reflect this idea in our hardware descriptions to strengthen the intuition.
Just as for any newly added mechanisms, we start with informal description to create certain
background and formalize new concepts.
3.1 Virtualization
Aiming at virtualization support for hypervisors, from the variety of software and hardware
techniques available for virtualization [Age09], we of course implement the virtual address
translation, as the most crucial one, together with the mechanism of intercepts. Since hypervi-
sors require two phases of address translation, whereas the hardware might provide only one
phase, the second phase of translation can be implemented in software [Kov13], usually with
the help of data structures called the shadow page tables. In contrast to [Kov13], we implement
the second phase of address translation in hardware.
50 3 Introduction and Specification
3.1.1 Translation Modes
The first thing we need in order to increase the number of supported translation modes is an
additional
• 32-bit data field to hold the origin of one extra page table structure, and
• 1-bit control flag to keep track whether the new mode is activated.
Conveniently, we make a room for these data in the special purpose register file, and bind the
registers number 11 and 12 resp. to store these additional page table origin and mode. For
technical reasons (to backup the new mode on interrupts) we also require one more exception
register. As usual, it is taken from the SPR, where it gets number 13. For the newly involved
registers we choose the following synonyms:
c.npto = c.spr(115)
c.nmode = c.spr(125)
c.enmode = c.spr(135).
Now we have to adjust our notation from the previous chapters. There using the last bit of the
mode register
mode(c) = c.mode[0]
we distinguished between the user mode (bit set) and the system mode (not set), and the user
mode was the only translated mode. Clearly, in the prior versions of our specification it was a
completely reasonable and elegant solution to keep track of the translation mode using pred-
icate mode(c). Since the number of modes increases, we can no longer use this predicate.
Instead, we introduce three new predicates — one per mode — and finally fix the names. We
redefine
mode(c) = c.nmode[0]◦ c.mode[0]
and abbreviate
user(c) ≡ mode(c) = 11
guest(c) ≡ mode(c) = 01
host(c) ≡ mode(c) ∈ {10,00}.
As it might be hinted by naming, the user mode now stands for the newly added mode of nested
translation, while the guest mode and the host mode stand for the ordinary translated mode (old
user mode) and the untranslated mode (old system mode) resp.
3.1.2 Intercept Mechanism
Together with adding of translation modes we simultaneously introduce new privilege levels
for program execution. In our case, the resulting number of these levels equals the number of
supported translation modes, and for referring to a certain privilege level we use the name that
the corresponding translation mode has. In order to control an execution at a less privileged
level, at a more privileged level one should be able to specify what is allowed to happen
without pausing/aborting (an intercept) that execution. Following this intuition we describe
things formally.
Let x be the machine’s input for a processor step, like defined in Chap. 2. The new semantics of
jisr and eret are defined as follows. The jisr now can occur at any out of three privilege levels,
and — depending on context — switch the machine to the privilege level of host or guest.
jisr(c,x)∧ (user(c)∧ icpt(c,x)) → c′.nmode[0] = 0
jisr(c,x)∧ (user(c)∨ icpt(c,x)) → c′.mode[0] = 0
Note, we specify the change of c.mode in a slightly redundant way: jisr at the level of host
should not cause any changes, whereas we clear the c.mode[0]. That is not a problem since the
c.mode[0] is zero in this case anyway.
3.1 Virtualization 51
Table 10: Changes of the mode on interrupts (JISR) and returns from exceptions (ERET)
JISR user (11) guest (01) host (10) host (00)
/icpt guest (01) guest (01) host (10) host (00)
icpt host (10) host (00) host (10) host (00)
ERET user (11) guest (01) host (10) host (00)
/expt — guest (01) host (10) host (00)
expt — user (11) user (11) guest (01)
vm pr px2 px1 po
104 12108
upa
uva
paas
vaas
Fig. 9: Partitioning of universal addresses
On eret, which is now allowed at the level of guest as well, the machine normally switches
back to the privileged level it was running at before the interrupt.1
eret(c,x)∧host(c) → c′.nmode = c.enmode
eret(c,x)∧host(c) → c′.mode = c.emode
While we manage to keep most of the old specifications, we obtain a variety of new machine’s
behaviors. In Table 10 we collect all possible cases of the mode change. Thus, on jisr either
we drop from the level of user or guest — depending on the intercept signal — to the level
of guest or host, or the current mode stays unchanged. Note that on jisr at most one of the
mode(c) ∈ B2 bits changes. To give the change of mode on eret, we introduce a predicate
exception:
expt(c) ≡
{
c.emode[0] host(c)
c.enmode[0] guest(c).
The eret instruction — depending on the current mode and the exception — brings us to the
level of guest or user, or the current mode stays unchanged. An interesting detail is that from
the level of user we might drop directly to the level of host on jisr, and importantly, return
directly to the level of user with eret.
For the privilege levels of user and guest we specify the signal icpt(c,x) in Sect. 3.3.4. At the
most privileged level of host any computational behavior is allowed, and therefore nothing is
intercepted.
host(c) → icpt(c,x) = 0
3.1.3 Universal Addressing
In a complex computer system, where multiple users are running under control of their op-
erating system(s), virtual memory is a key technique for separation and isolation of accesses
to the single and conceptually shared computer’s memory. Also the virtual memory — com-
bined with mechanisms of paging and swap memory — allows to access a memory larger
1 On the eret, the machine is not necessarily switches to the previous mode (the mode where an in-
terrupt had occurred). Recall, now the mode is determined by a combination of values coming from
two different registers. On return from exception one of these registers is simply restored from the
corresponding exception register. This, of course, does not mean that value of the latter register was
preserved during the interrupt handing.
52 3 Introduction and Specification
               
               
               



0000
0001
0000 0000
0000 0001
AG(1)
AU(1,1)
upa.vm upa.pr upa.pa
AH0000 0000
Fig. 10: Address space of the universal page address upa
that the physically installed one [Hil05]. Normally, user programs running under the memory
virtualization cannot communicate — unless a special effort is put to implement certain data
exchange mechanisms — and each program sees the memory as huge as it can address using
the provided ISA.
Since the user addresses are not unique by their nature, they are subject for address transla-
tion, the result of which depends on the unique address space ID (ASID) as under which the
program is running. This ID consists of the virtual machine ID as.vm (4 bits) and the process
ID as.pr (8 bits), which are extracted directly from the processor’s core. The virtual byte ad-
dress together with the ASID form a universal virtual address, which we typically denote by
uva ∈ B44. Since the address translation mostly operates with the page addresses, we intro-
duce the universal page addresses as well. Obviously, these addresses are compositions of the
ASIDs and the regular page address, and typically we denote these addresses by upa ∈ B32.
For convenience we partition the universal addresses in many different ways as depicted in
Fig. 9. The corresponding definitions are obvious, and therefore are omitted.
As a convention, we reserve the virtual machine ID (VMID) consisting of all zeros for pro-
grams which run at the privilege level of host. At this level the process ID (PRID) is ignored,
but for simplicity we assume it to be zero as well. Conversely, for programs which run at the
level of guest the VMID is taken from SPR register c.mode, whereas the PRID consisting of
all zeros is reserved. Lastly, for programs at the level of user the VMID and PRID are both
taken from the SPR registers, c.mode and c.nmode resp. Following this convention and using
the shorthands
vmid(c) = c.mode[31 : 28]
prid(c) = c.nmode[31 : 24]
we finally can define the ASID formally:
asid(c) =

vmid(c)◦ prid(c) user(c)
vmid(c)◦08 guest(c)
012 host(c).
Having introduced the concepts above, one can continue developing new notation and quite
naturally arrive at the following partition for the address space of universal page addresses:
all addresses with (vm, pr) fields pair “all zeros” fall into the address space of host, which we
denote by AH . Addresses with non-zero vm field and zero pr field fall into the address space
of guest number 〈vm〉, which we denote by AG(〈vm〉). The remaining addresses fall into the
address space of user number (〈vm〉,〈pr〉), which we denote by AU (〈vm〉,〈pr〉). Formally we
define for i, j > 0:
AH = {upa ∈ B32 | upa.(vm, pr) = 012}
AG(i) = {upa ∈ B32 | upa.(vm, pr) = i4 ◦08}
AU (i, j) = {upa ∈ B32 | upa.(vm, pr) = i4 ◦ j8}.
3.2 Introduction to NAT 53
Recall, addresses with zero vm and non-zero pr field we do not consider (see Fig. 10). For
convenience we introduce abbreviations to refer to any user/guest address space.
AU (i) ≡
⋃
j
AU (i, j)
AU ≡
⋃
i
AU (i)
AG ≡
⋃
i
AG(i)
Also — in order to save brackets — we directly pass the bit-strings which encode the user/guest
numbers (instead of the user/guest numbers) to specify the address space.
3.2 Introduction to NAT
We would like to reuse as much as possible of the previously developed formalism. Fortunately
there are very few things that change. Before specifying the mechanism of nested translation
formally, we first introduce some basic ideas and machinery in this preliminary section. Start-
ing with changes to the old concepts, we then gradually present what is necessary to form a
background for our work.
3.2.1 Ideas behind Nested Translation
Just as the ordinary translation (see Fig. 11) from Sect. 2.4, the nested translation (see Fig. 12)
is defined as a process of consecutive walking over the tree-like structure of page tables. These
page tables are of the same kind and organized the same way as the prior page tables, but the
only and crucial difference is — as we already mentioned — that they are filled with virtual
(not physical) addresses. Therefore, for translation of these virtual addresses one needs to
incorporate another (ordinary) page table structure, meaning that for the nested translation one
needs to use two page table structures simultaneously: a user page table for translation of
user addresses to the level of guest, and a guest page table for (ordinary) translation of guest
addresses to the level of host.2 The origins of these two structures (addresses of their root
tables) we keep in special purpose registers c.npto and c.pto resp. for the user and the guest
page tables, details of which we already elaborated previously (see Sect. 2.4.1). Note, the
address stored in register c.npto is an address from the guest address space, which requires
(ordinary) translation like any other virtual address.
When the machine runs at a privilege level higher than the level of host, the special purpose
registers involved into the process of address translation should have proper values. At the level
of guest this amounts to setting the virtual machine field of register c.mode to the guest’s actual
VMID, and loading the corresponding page table origin into register c.pto. Obviously, at the
level of user one additionally needs to set the process field of register c.nmode to the user’s
actual PRID and load the corresponding page table origin into register c.npto. Under this
setting the translation mechanisms — which we describe next — can properly serve universal
translation requests, i.e., translation requests containing the universal page addresses instead
of the ordinary ones.
We define this new flavor of translation requests formally by substituting the address trq.a ∈
B20 of translation request trq by the universal page address trq.upa∈B32. For the new requests
we introduce the following (obvious) shorthands:
trq.as = trq.upa.as
trq.pa = trq.upa.pa.
For referring to the lower-level components we can use the naming convention of [LOP], e.g.,
2 Naturally, one needs to maintain a page table structure for every guest, and a page table structure for
every user of that guest.
54 3 Introduction and Specification
trq.vm = trq.as.vm.
In a similar fashion we need to update all affected notation. For walks w ∈ Kwalk from
Sect. 2.4.2 we replace the address w.a ∈ B20 component by the universal page address
w.upa ∈ B32. Moreover, since in the process of the nested translation the page faults can occur
at two different levels (of user and guest), we extend the walks to include an extra “fault” bit,
which we attach at the right end. The resulting universal walks w ∈ Kuwalk have the following
components:
• w.upa ∈ B32 — universal page address to be translated,
• w.` ∈ {100,010,001} — number (in one-hot encoding) of walk extensions still required
to translate address w.upa,
• w.ba ∈ B20 — base address of the page table to be walked next — if the walk is not faulty
and not yet complete — or of the data page — if the walk is not faulty and complete,
• w.r ∈ B3 — access rights still remaining, and
• w. f ∈ B2 — fault bits indicating that walk extension is no longer possible.
The fault bits are given intuitive names:
w. fu = w. f [1] = w[1]
w. fg = w. f [0] = w[0].
Universal walks which have at least one of the latter bits set we call faulty.
Definition 1 (Faulty universal walk). For w ∈ Kuwalk
f (w) ≡ w. fu ∨ w. fg
Universal walks which have at most one of the latter bits set are called well-formed.
Definition 2 (Well-formed universal walk). For w ∈ Kuwalk
wfu(w) ≡ (w. fu+w. fg ≤ 1)
For the newly obtained walks we abbreviate:
w.as = w.upa.as
w.pa = w.upa.pa.
Universal walks with zero pr field are called guest walks whereas universal walks with non-
zero pr field are called user walks. Guest walks provide translations from the address space
of guest to the address space of host, and user walks translate from the address space of user
to the address space of guest. In what follows we typically represent the guest walks by wg
resp. the user walks by wu.
Since a number of intermediate results is used in the process of nested translation, we need
some more notation. For a pair of universal walks — wu and wg below — we define a predicate
to express that the walks match, i.e., one of the walks is a (guest) walk of guest j, another one
is a (user) walk of user i of guest j, and the base address of user walk equals the virtual address
of guest walk.
match(wu,wg) ≡ wu.ba = wg.pa∧
∃i > 0 : wg.upa ∈ AG(i)∧
∃ j > 0 : wu.upa ∈ AU (i, j)
Of course we call such walks matching and, for convenience, we use the following notation to
abbreviate the corresponding predicate.
wu $ wg ≡ match(wu,wg)
3.2 Introduction to NAT 55
guest
host
data
a
a.px2
a.px0
wg
Fig. 11: Process of simple translation
user
guest
host
data
a
npto
a.px2 a.px1 a.px0
wu w′u w′′u
w1g w
2
g w
3
g
Fig. 12: Process of nested translation
Level of Matching Walks
Recall that walks defined in Sect. 2.4.2 have dedicated components to track the current level,
faultiness and rights they provide. We extend these properties in a natural way to pairs of
(matching) walks. We start with the level of a pair, which we introduce as a ternary number
(from zero to eight) formed from the individual levels of walks composing the pair. Formally,
for walks wu and wg we define
`(wu,wg) = level(wu)◦ level(wg).
Clearly, the process of nested translation is defined for the pairs of matching walks. Depending
on the level of matching walks, we translate — with the ordinary walk extension — through
one of the walks composing the pair. At levels 6 and 3 we translate through the user walk (wu
/ w′u) using the completed guest walk (w1g and w2g resp.), otherwise we translate through the
guest walk. At level 0 a memory access is performed through the completed user walk (w′′u)
using the completed guest walk (w3g). Schematically the process of nested address translation
is depicted in Fig. 12.
Faultiness of Matching Walks
Though two universal walks are matching, we can use them for the nested translation only if
they are not faulting. Here we define the faultiness for arbitrary pairs of walks, but later we
use it to argue only about matching pairs. Again referring to Fig. 12, we conveniently split
the process of nested translation into two phases: translation phase (at levels from 8 to 3) and
access phase (at level from 2 to 0). In the translation phase the user walk either is pending to
be extended or is being extended (at levels 6 and 3) using the guest walk.3 In either case the
user walk is not yet completed, and we define the pair of walks to be faulting if i) some of the
walks is faulty or ii) the guest walk does not have rights sufficient to read the user’s page table
in host memory or to perform the user access.4 To formalize the second case we introduce a
conditional faultiness which we define as
3 Recall, we extend the user walk only after the current guest walk is completed.
4 The latter situation occurs in practice, for instance, if the user’s page table was placed into the memory
region where the guest does not have read permissions.
56 3 Introduction and Specification
f (wg | wu) ≡
{
wg.r 6≥ 010 complete(wu) [PT access]
wg.r 6≥ wu.r complete(wu) [user access].
In the translation phase — when the user walk is not yet completed — the guest walks are used
to read the user’s page tables, i.e., the read rights suffice. In the access phase either the current
guest walk is being extended (at levels 2 and 1) or the final memory access is performed.
Anyway, what we care about in this phase are the rights for the memory access which is not a
read of the user’s page table anymore, but an actual memory access from the user’s program.
Using the conditional faultiness we formalize the faultiness of a pair (wu,wg) of walks as
f (wu,wg) ≡ f (wu)∨ f (wg)∨ f (wg | wu).
With a simple software condition on the content of guests’ page tables (via granting the full
rights), one could allow guests to manage their address spaces on their own. That would
simplify programming on the level of guests (by dropping the limitations on the memory usage)
and our definitions as well (by setting wg.r above to all ones).
3.2.2 Composition of Walks
Just like in case of ordinary translation, to access the virtual memory location at the level
of user one has to perform the nested translation of the (virtual) user address. This nested
translation results in a pair of matching complete walks — one user walk and one guest walk
— which together provide translation for the given user address. After these walks are placed
into the specification TLB, they can be used by the stepping function, e.g., in the processor core
steps. Note that the specification TLB now stores walks of both types: the user walks together
with the guest walks. Formally we describe this in Sect. 3.3, where we give the complete
specification of our new machine.
Intuitively, aiming at the hardware acceleration of the process sketched above, one would at-
tempt to reduce the number of walks involved, which basically defines the number of TLB
look-ups necessary to obtain the physical address. A possible solution would be to store in the
hardware TLB, apart of the ordinary guest walks, the special walks which would translate the
user addresses directly (!) to the physical addresses. Though it is simple to implement, this
solution requires us to introduce some more notation, so that we can formally argue about this
new type of walks.
We start with the notion of composition of two walks. We denote this operation by overloading
the symbol “◦” and formally define it for walks wu and wg as follows:
wu ◦wg =
{
w wu $ wg
⊥ otherwise
where
w.upa = wu.upa
w.` = wu.`
w.ba = wg.ba
w.r = wu.r
w. fu = f (wu)
w. fg = f (wg)∨ f (wg | wu).
In essence, we transform the user walk s.t. its base address (a virtual address) is remapped to
the base address of the guest walk (a physical address). Note that the right bits w.r and the
faulty bit w. f are chosen according to resp. the accumulated rights and the faultiness, both
defined above in Sect. 2.4.2. Motivated by the latter usage, we call walks obtained via this
operation nested, and typically represent them by wn.
3.2 Introduction to NAT 57
GUEST
wn.ba
wn.pa
wu
HOST
USER
wn
wg
Fig. 13: Possible ways to compose walk wn
In the obvious way we generalize the new operation for setsW1 andW2 of walks:
W1 ◦W2 = {wu ◦wg | (wu,wg) ∈W1×W2∧wu $ wg}.
Directly from the definitions we derive the following trivial lemmas about the composition of
sets of walks.
Lemma 11. The composition of set W1 (of walks) with set W2 (of walks) is exactly the com-
position of user walks from setW1 with guest walks from setW2:
W1 ◦W2 = {w ∈W1 | w.upa ∈ AU}◦{w ∈W2 | w.upa ∈ AG}
Lemma 12. The composition of union of setsW1 andW2 with setW3 is exactly the union of
compositions ofW1 withW3 andW2 withW3:
(W1∪W2)◦W3 = W1 ◦W3 ∪ W2 ◦W3
Below we reserve a concise shorthand referring to the generalization above. In case the sets
being composed are equal — we consider a single setW of walks — we abbreviate:
W◦ ≡ W ◦W.
The following predicates help us to formalize conditions under which walk extension steps are
possible. The first predicate indicates whether walk w is valid for the given set of walks W:
incomplete, not faulty, and belongs toW (presumably TLB).
valid(W,w) ≡ w ∈W∧/complete(w)∧/ f (w)
The second predicate indicates whether pair of walks (wu,wg) is valid for the given set of walks
W: matching, not faulty, and both walks from the pair are valid.
valid(W,wu,wg) ≡ valid(W,wu)∧wu $ wg
valid(W,wg)∧/ f (wu,wg)
3.2.3 Decomposition of Nested Walks
For the forthcoming section on semantics we need to have sort of an inverse operation to the
walk composition: given the nested walk (wn) decompose it into two matching walks, namely
the user walk (wu) and the guest walk (wg). This operation we naturally call walk decom-
position. We define it for the given set W of walks (containing both user and guest walks)
which could be used for the decomposition (note that the resulting guest walk is complete by
definition of the composition operation):
u(wn,W) = ε{w ∈W | ∃wg : w◦wg = wn}
g(wn,W) = ε{w ∈W | ∃wu : wu ◦w = wn}.
58 3 Introduction and Specification
Walk decomposition is inherently not unique, since the same nested walk could be composed
out of many different pairs of user and guest walks, as depicted in Fig. 13. So clearly we face
the problem, namely that walks wu and wg into which walk wn was decomposed are not well
defined. Note, the definition above allows walks which cannot be composed back into wn (e.g.,
not matching walks from different pairs).
If one analyzes the situation above carefully, one would notice that it only occurs in practice
— set W contains multiple pairs providing translation from wn.pa to wn.ba — in case there
are multiple distinct walks which translate from the address space of the same guest to unique
(physical) address wn.ba. Which in turn means that multiple distinct locations in virtual mem-
ory of the same guest are represented by a single location in the physical memory. In particular,
if multiple guest addresses are mapped to the same physical address, semantics of the virtual
memory becomes invalid for the level of guests.5 Nevertheless, programming in such model is
reasonable [Hil05], but has to be carried out with caution.
Of course, the problem above can be avoided with certain (software) restrictions on the content
of the page tables, but we build our hardware to work for the general case. This forces us to
store additional data in the hardware TLB in order to preserve efficiency. These are data about
the origin of the nested walks: page addresses of the guest walks used to compose the nested
walks.
3.2.4 Overloading Notation
We finish this section with a small list of changes to old notation that we need next. Due to a
change of the walk format, clearly we need to update those predicates which operate on walks.
The translation requests we already updated in Sect. 3.2.1. Now, for the universal translation
requests trq and walks w we redefine the match predicate:
match(trq,w) ≡ (trq.upa = w.upa)∧ (complete(w)∨ f (w)).
Other predicates which we use in the next section (p f ault and g f ault) depend on match in-
ternally, so we assume they get updated automatically. In a similar fashion we update the
definition of translated memory address:
tma(a,w) = w.ba◦a.po.
Note, the latter definition is used only for complete non-faulty universal walks (w) which trans-
late the given universal address (a).
(a.upa = w.upa)∧ complete(w)∧/ f (w)
At this point our formalism is developed well enough to specify the new semantics. This se-
mantics — covering execution in the new translation mode — turns out to be suspiciously
similar to the original one [Sch13a]. But that is not a coincidence: in order to achieve the sim-
ilarity of formulations we invested a lot into notation which allows us to hide the unnecessary
complexity inside the formalism. Choosing a simpler but bulkier notation instead would make
the text heavier and blur out the intuition, which we want to preserve most. Apart of that, in
our opinion, the translation scheme we are adding is rather lengthy than complex.
3.3 MIPS ISA with NAT
Technically speaking, the machine with nested address translation is (in general) very similar
to the machine with ordinary address translation developed in Chap. 2. There is no need to
introduce new configuration components. For already existing components it is sufficient to
adjust their input alphabets and semantics of their steps. In order to save space and — what is
more important — to make definitions concise, in this section we discuss only changes to the
original specification from Sect. 2.4.
5 The same problem with the virtual memory semantics occurs for the level of users, though it does not
influence the uniqueness of the decomposition.
3.3 MIPS ISA with NAT 59
3.3.1 TLB Component
In the previous section — in the course of defining the process of nested translation — at some
point we had to switch to the new format for representation of walks (see Sect. 3.2.1). In the
new format we supplied the virtual address with a special field (ASID), which identifies the
address space to which that particular virtual address belongs,6 and an extra fault bit, to distin-
guish between faults occurring at the levels of user and guest. Clearly, the TLB component has
to be changed appropriately to store walks of the universal format.
Recall, all walks of the universal format are collected into set Kuwalk. Configurations of the
universal TLB are obviously then taken from the powerset of Kuwalk.
c.tlb ∈ Kutlb = 2Kuwalk
Thus, the universal TLB component now stores walks of both types: guest walks and user
walks. For convenience we partition our new TLB according to the type of walks stored:
tlbG(c) = {w ∈ c.tlb | w.upa ∈ AG}
tlbU (c) = {w ∈ c.tlb | w.upa ∈ AU}.
As in the system mode before, at the level of host the TLB component does not perform any
steps. At the level of guest the TLB component is allowed to perform essentially the same steps
as in the former user mode: creation and extension of the guest walks. Finally, at the level of
user the TLB component is allowed to perform all kind of steps. As a result, very similar steps
are now allowed at both levels — of guest and user — and to distinguish between those steps
we change the TLB portion Σtlb of the machine’s input alphabet Σ .
For walk creation steps we extend the input with one more ‘bit’ to distinguish initialization of
the guest walks from initialization of the user walks:
{winit}×B20×{guest,user} ⊂ Σtlb.
Intuitively, semantics of these steps stays unchanged: an initialized walk for the given vir-
tual address is included into the TLB. But now we change the initialization part: depending
on the new parameter passed — a guest or a user walk is created — either the guest’s page
table origin is chosen to initialize the walk or the user’s one. Moreover, the virtual address
is prepended with its address space identifier; the walk initialization function has to be over-
loaded appropriately to handle universal page addresses. Below we specify the effect of this
transition formally, assuming that x there denotes the machine’s input.
c′.tlb = c.tlb∪
{
{initw(vmid(c)◦08 ◦a, pto(c))} x = (winit,a,guest)
{initw(vmid(c)◦ prid(c)◦a,npto(c))} x = (winit,a,user)
For walk extension steps we change the input to contain either one or two walks:
{wext}×Kuwalk× ({⊥}∪Kuwalk)⊂ Σtlb.
Assume again that x is the machine’s input.
x = (wext,w1,w2) ∈ Σtlb
The first walk passed (w1) is always the one to be extended. In case it is a guest walk, an
ordinary walk extension step is to be performed. In this case the second walk (w2) necessarily
has to be undefined.
w1.upa ∈ AG → w2 =⊥
If w1 is a user walk, again an ordinary walk extension step is to be performed, but this time
walk w1 is first composed with walk w2. In this case the second walk has to be (at least)
defined.
6 Recall, due to the virtual memory, each user or guest program operates within its own isolated address
space.
60 3 Introduction and Specification
w1.upa ∈ AU → w2 6=⊥
Note that both walks are expected to be in the universal walk format. Therefore the walk
extension function has to be overloaded as well to fit the new arguments’ format. The semantics
of walk extension steps is given below.
c′.tlb = c.tlb∪
{
{wext(w1,c.m)} x = (wext,w1,⊥)
{wext(w1 ◦w2,c.m)} x = (wext,w1,w2)∧w2 6=⊥
The first line of the definition above simply repeats the original behavior, whereas the second
line formalizes the result of the nested walk extension. In the latter case we make use of
our new notation and pass to the overloaded walk extension function the composition of two
walks. Obviously, we require those walks to be matching. For intuition we refer the reader to
the introductory Sect. 3.2.2, where we describe the process of nested translation in more detail.
Below we collect in one place all guard conditions on the TLB steps, including those we
introduced in Sect. 2.4.4 and those we informally imposed right above. To every condition
below we attach a short description.
• Steps of the TLB component are allowed at the levels of guest and user. At the level of
guest only two TLB steps are allowed: initialization of the guest walk and extension of the
guest walk. At the level of user any TLB steps are allowed.
x ∈ Σtlb → guest(c)∨user(c)
x ∈ Σtlb → x ∈
{
{(winit,a,guest),(wext,w1,⊥)} guest(c)
{(winit,a,∗),(wext,w1,w2)} user(c)
• Only incomplete and non-faulty walks from the TLB are allowed to be extended (recall
that we always extend the first walk). That is the first condition below, which was inher-
ited and without changes fits the new case of nested translation. At the level of user we
allow extension of walks of the current user and guest; at the level of guest we allow ex-
tension of walks of the current guest only. That is the second condition, which is caused
by introduction of the ASID fields. In order to formulate it in a more intuitive way, we use
our new notation from Sect. 3.2 and the following shorthands to specify the current ASID:
asid(c) = i4 ◦ j8.
Note, if on the steps of walk initialization we allowed to specify the ASID fields, we would
have to add a similar guard condition restricting those steps.
x = (wext,w1,w2)→ valid(c.tlb,w1)
x = (wext,w1,w2)→ w.upa ∈
{
AG(i) guest(c)
AG(i)∪AU (i, j) user(c)
• On the nested translation steps — extensions of the user walks — the walk to be extended
and the walk used for extension are matching. The latter one is a complete walk from the
TLB which does not lead to a faulting composition (see Sect. 3.2.1).
x = (wext,w1,w2)→ valid(c.tlb,w1,w2)
These guard conditions apply simultaneously, i.e., all of the conditions above must be satisfied
for a TLB step to be possible. Later on, whenever we need to argue about TLB steps within a
lengthy proof, we would like to have these conditions in a more concise form. For that reason
we develop some more notation. Components of the machine’s input receive the following
names:
x =
{
x.(t,a, l) x ∈ Σwinit
x.(t,w1,w2) x ∈ Σwext
where
3.3 MIPS ISA with NAT 61
• .t gives the type of the performed step; Obviously we have
x.t ∈ {winit,wext}.
• .a gives the page table origin used for creation of a new walk; Clearly
x.a ∈ B20.
• .l gives the level for which a new walk is created; We clearly have
x.l ∈ {guest,user}.
• .w1 gives the universal walk extended in the corresponding step
x.w1 ∈ Kuwalk,
• while .w2 gives the universal walk used for extension of walk w1 (if not ⊥):
x.w2 ∈ {⊥}∪Kuwalk.
Using the notation above we can equivalently reformulate the guard conditions for TLB steps
as follows. In order to separate conditions that restrict guest walks from conditions that restrict
user walks, we introduce two dedicated predicates. Conditions restricting guest walks are
covered by
TG(c,x)≡ /host(c)∧
{
x.l = guest x.t = winit
x.w1.vm = vmid(c)∧ valid(c.tlb,x.w1)∧ x.w2 =⊥ x.t = wext.
The remaining conditions, restricting the user walks, are covered by
TU (c,x)≡ user(c)∧
{
x.l = user x.t = winit
x.w1.as = asid(c)∧ valid(c.tlb,x.w1,x.w2) x.t = wext.
As a result, the guard conditions for TLB steps are now satisfied iff they are satisfied either for
guest or for user walks.
T(c,x) = TG(c,x)∨TU (c,x)
Taking these guard conditions into account, walk initialization and walk extension functions
guarantee that for the guest and user walks stored in the TLB we resp. have
w ∈ tlbG(c) → w. fu = 0
w ∈ tlbU (c) → w. fg = 0.
The latter obviously implies that all walks stored in the TLB are well-formed.
Invariant 1.
w ∈ c.tlb → wfu(w)
3.3.2 Translation Accesses
In the process of address translation, the memory-hosted structures called page tables are tra-
versed as described in Sect. 2.4. Thus, on steps of the walk extension, i.e., when the specifica-
tion machine receives input
x = (wext,w1,w2) ∈ Σtlb,
the memory is accessed at addresses
ptea(x) =
{
ptea(w1 ◦w2) w2 6=⊥
ptea(w1) otherwise.
For the page table entries used on steps of the walk extension, according to Sect. 3.3.1, we
abbreviate
pte(c,x) =
{
pte(w1 ◦w2,c.m) w2 6=⊥
pte(w1,c.m) otherwise
and naturally obtain the following.
62 3 Introduction and Specification
Lemma 13.
pte(c,x) = c.m4(ptea(x))
Proof of lemma 13.
pte(c,x) =
{
pte(w1 ◦w2,c.m) w2 6=⊥
pte(w1,c.m) otherwise
=
{
c.m4(ptea(w1 ◦w2)) w2 6=⊥
c.m4(ptea(w1)) otherwise
= c.m4(ptea(x)) uunionsq
Analogous to Sect. 2.2.3, we rephrase the read of page table entry pte(c,x) in terms of transla-
tion access tacc(x) to the line addressable version `(c.m) of the memory.
tacc(x).a = ptea(x).l
tacc(x).r = 1
As the output to the translation access we have
tmout(c,x) = dataout(`(c.m), tacc(x)).
Using properties of the memory embedding, for the output to the translation access we derive
the following.
Lemma 14.
pte(c,x) =
{
tmout(c,x)H ptea(x)[2]
tmout(c,x)L otherwise
Proof of lemma 14.
pte(c,x) = c.m4(ptea(x)) (lemma 13)
= c.m4(ptea(x).l ◦ ptea(x)[2]◦00) (definition)
=
{
c.m4(ptea(x).l ◦100) ptea(x)[2]
c.m4(ptea(x).l ◦000) otherwise
=
{
c.m8(ptea(x).l ◦03)H ptea(x)[2]
c.m8(ptea(x).l ◦03)L otherwise
(definition)
=
{
`(c.m)(ptea(x).l)H ptea(x)[2]
`(c.m)(ptea(x).l)L otherwise
(definition)
=
{
dataout(`(c.m), tacc(x))H ptea(x)[2]
dataout(`(c.m), tacc(x))L otherwise
(definition)
=
{
tmout(c,x)H ptea(x)[2]
tmout(c,x)L otherwise
(definition) uunionsq
3.3.3 Faults of NAT
In the previous section we focused our attention on cases in which steps of the nested transla-
tion are actually performed, i.e., the walk extension function is involved to access the memory
— to read the user’s page table. Unfortunately, that is not the case in general. For instance,
those guest’s pages which contain the user’s page tables can be not-accessible due to insuffi-
cient rights or simply because they were temporarily swapped-out7. Therefore, the situations
7 Clearly, such situations are possible unless the hypervisor allocates for guests (operating systems)
sufficient number of memory-locked pages, or provides an alternative mechanism that allows guests
to specify the memory regions with user page table structures as non-swappable. For details we refer
to [Vir18].
3.3 MIPS ISA with NAT 63
in which a step of the nested translation cannot be performed may occur even though all user’s
pages are present in the memory. Such situations we call faults of the nested translation or
nested faults for short.
Another possible interpretation of the nested faults — especially taking into account their effect
— is to consider them simply as yet another source of ordinary page faults which affects only
the nested translation. A page fault is triggered at the level of user if a faulty walk (wY ) is
passed on the processor core step. In case the latter walk is guest-faulty
wY . fg = 1,
the interrupt is intercepted and the machine switches to the level of host (see Sect. 3.1.2). Thus,
it suffices to have the same number of interrupt causes as before in order to handle the page
faults which occur due to the nested translation. In Sect. 3.3.5 we cover the details of instruction
execution in presence of nested faults. In the remainder of this section we present the semantics
of the processor core steps, starting with the changes to the input alphabet (Sect. 3.3.4).
3.3.4 Processor Core
Just as before, on the processor core step either
i) a current instruction is executed ”uninterrupted”, or
ii) a jump to the interrupt service routine is performed due to an interrupt.
In both cases the occurrence of an interrupt entirely depends on the current machine’s input
and the processor local configuration, which is so far the configuration of the processor core
together with the TLB.8 On the level of semantics this means that either one transition function
is applied or the other. This behavior should not change after extension of the ISA with a new
translation mode, and it does not. Also we want to preserve the old format of the machine’s
input for the processor core steps. So, we leave the input alphabet Σcore unchanged, but as
before we switch to the universal walk format:
Σcore = ({⊥}∪Kuwalk)2×B2.
Obviously this works for the processor core steps made at the level of host or guest. But what
about the level of user, where the machine is supposed to perform the nested translation...?
The trick lies in the following: in the latter case (level of user) we pass the nested walks, which
we formally specified in Sect. 3.2.1. Recall that nested walks are obtained using the operation
of walk composition we introduced above, where the user walks were “prolonged” via the
matching guest walks. Formally, these nested walks are not present in the specification TLB
and, strictly speaking, they are a “virtual” concept for the “real” specification configurations9.
However, using the notation we introduced together with the nested walks, we can form the
exact set of all possible nested walks which could be obtained from the current TLB. In the
new notation this set is c.tlb◦. As simple as that.
Using notation of [LOP] from Sect. 2.4.5, we denote the machine’s input by
x = (wI ,wE ,eev) ∈ Σcore
and expect it to contain guest walks from the TLB on the level of guest
guest(c)∧wY 6=⊥ → wY ∈ tlbG(c)
and nested walks from the composed TLB on the level of user.
user(c)∧wY 6=⊥ → wY ∈ c.tlb◦
8 The configuration of the memory does not influence occurrence of the interrupts (see Sect. 2.3).
9 For the hardware configurations they appear to be the format of data stored in the hardware TLB cache.
We discuss this in more detail in the forthcoming chapter on hardware (Chap. 4), but already now it
might be clear that usage of the same format for ISA and hardware simplifies the future correctness
proof considerably.
64 3 Introduction and Specification
Table 11: Meaningful ISA signals (denoted by ) for various interrupt levels
Input/signal
Interrupt level
reset e mal f p f f g f f ill sys ov f malm pfm gfm ∞
eev
exec f etch
. . .
wI
. . .
pmaI I
wE
. . .
pmaE lres
Note, above and in the sequel, we follow the convention from [LOP] for index Y , which ranges
over {I,E}. In addition, we introduce a similar convention for index X , which we use to iterate
over {G,U}. We specify the guards formally in Sect. 3.3.5. Again, below we list only those
things that change compared to the original specification in Sect. 2.4.5, and again, we assume
that x denotes the machine’s current input.
In the presence of interrupts several extra arguments are necessary to specify the instruction
execution formally. Since according to specification from Sect. 2.3.3 on the non-continue type
interrupts the execution of instruction is immediately terminated, signals that would normally
be utilized by the instruction are not used. For instance, we define the following two auxiliary
signals to be always used.
exec(c,x) ≡ jisr(c,x)→ cont(c,x)
f etch(c,x) ≡ jisr(c,x)→ il(c,x)> 4
In Table 11 we visualize this by listing the latter signals in the topmost row. Note, this row is left
entirely white, meaning that signals assigned to this row are always meaningful, independent of
the interrupt level. In the rows below, which are partially grayed out, we collect the remaining
signals. We distribute them according to the interrupt levels at which they become meaningful.
Since it does not make sense to make formal arguments about things which are not meaningful,
like signals that are not used, in the proofs we simply ignore such unused signals.
By analyzing the specification from Chap. 2, Table 11 can be easily populated and made pre-
cise. Though not all signals are listed, the number of rows in the table suggests that for the
interrupts considered all signals fall into one out of six possible groups. To show the instruc-
tions are executed correctly one has to argue about signals from all groups (rows of the table)
that are meaningful at the corresponding interrupt levels and, moreover, utilized by the exe-
cuted instructions. Indeed, for every ISA signal we specify conditions under which it is used.
For instance, for the walks passed as the machine’s input (wI and wE ) we specify the used
predicates to be as follows.
used(wI ,c,x) ≡ /host(c)∧ il(c,x)> 2
used(wE ,c,x) ≡ /host(c)∧ il(c,x)> 8∧mop(c,x)
Thus, the first walk (wI) is used whenever the machine runs in translated mode and the sampled
interrupts are of interrupt level 3 or higher. Similarly, the second walk (wE ) is used whenever
the machine run in translated mode, the sampled interrupts are of interrupt level 9 or higher.
Note, in absence of interrupts the interrupt level by definition is equal to +∞ (see Sect. 2.3.3).
The used predicates for most of the signals are given in [LOP] in the form of tables.
To finish definition of semantics for the processor core steps we proceed as follows. In the
next section we discuss changes of semantics for the “uninterrupted” execution of instructions.
To ease the presentation, details covering execution of the invl pg and f lusht instructions are
taken out of the scope and considered later, in Sect. 3.3.6. In the latter section we also update
definition of the illegal interrupt. In short, the new definition allows guests to execute privi-
leged instructions under certain restrictions. Details of the interrupt handling are covered in
Sect. 3.3.7.
3.3 MIPS ISA with NAT 65
3.3.5 Execution of Instructions
We start with the translation requests and modify them in an obvious way to work with the
universal page addresses. Formally we redefine them as follows:
trqI(c,x) = (asid(c)◦ ia(c,x).pa,110)
trqE(c,x) = (asid(c)◦ ea(c,x).pa,01s)
where the write-bit s of the requested rights is set only if the store or CAS access takes place.
Recall that asid(c) denotes the ASID of the current ISA configuration.
s ≡ s(c,x)∨ cas(c,x)
Next we cover the generation of interrupt signals, and essentially, there are no changes to the
generation mechanism developed in Sect. 2.3. As before, interrupts are discovered gradually
— under the priorities specified in Table 9 — and once an interrupt is discovered, the execution
of the current instruction might be terminated if the resume type of the causing interrupt is not
continue. Formally, we change the definition of page faults to reflect the obvious changes to
the machine’s mode:
p f f (c,x) ≡ used(wI ,c,x)∧ p f ault(trqI(c,x),wI)
pfm(c,x) ≡ used(wE ,c,x)∧ p f ault(trqE(c,x),wE).
Note, in the above definition the page faults are signaled in case the walks passed are faulty.
Otherwise, in case the rights provided by these walks are insufficient to perform an access, the
general-protection faults are signaled. New faults are defined as follows:
g f f (c,x) ≡ used(wI ,c,x)∧g f ault(trqI(c,x),wI)
gfm(c,x) ≡ used(wE ,c,x)∧g f ault(trqE(c,x),wE).
Since the main idea of NAT is to make the page faults that occur due to translations of in-
termediate guest addresses completely “invisible” at the level of guest, these page faults are
intercepted by the mechanism we developed previously, in Sect. 3.1.2.
icpt(c,x) ≡ user(c)∧∨Y used(wY ,c,x)∧wY . fg
In the absence of interrupts, the semantics of instruction execution at the level of user —
when the machine runs in the new translation mode — essentially stays unchanged. For all
instructions — except of invl pgs and f lushes — the only thing we update is the definition of
physical memory addresses used for the instruction fetch and memory access:
pmaI(c,x) =
{
ia(c,x) host(c)
tma(asid(c)◦ ia(c,x),wI) otherwise
pmaE(c,x) =
{
ea(c,x) host(c)
tma(asid(c)◦ ea(c,x),wE) otherwise.
In the end we collect in one place all of the guard conditions on the inputs of processor core
steps. We conveniently split these guards into two categories: for the instruction fetch and for
the memory access. For every guard condition below we give a short description.
• At the levels of user or guest — with the address translation — in the absence of interrupts
of interrupt level 2 or lower, the first walk passed to the machine (via x ∈ Σcore) is the one
matching the translation request for the current instruction address: at the level of guest it
is a matching walk from the regular TLB, whereas at the level of user — a matching walk
from the composed TLB.
/host(c)∧ il(c,x)> 2 → match(trqI(c,x),wI)∧wI ∈
{
tlbG(c) guest(c)
c.tlb◦ user(c)
66 3 Introduction and Specification
• The guard conditions on the second walk passed are analogous to those above, but apply
only on memory operations (mop(c,x)). In contrast to the conditions above, here we
require the absence of interrupts of interrupt level 8 or lower.
/host(c)∧ il(c,x)> 8 → match(trqE(c,x),wE)∧wE ∈
{
tlbG(c) guest(c)
c.tlb◦ user(c)
Similarly to Sect. 3.3.1, we reformulate the guard conditions for processor core steps in a more
concise way. This time the definitions become even shorter: conditions restricting walk wY
passed as the machine’s input are covered by predicate
ΦY (c,x) ≡ used(wY ,c,x) →
match(trqY (c,x),wY )∧wY ∈
{
c.tlb◦ user(c)
c.tlb guest(c).
Clearly, the guard conditions for the processor core steps are now satisfied iff they are satisfied
for both walks contained in the machine’s input.
Φ(c,x) =ΦI(c,x)∧ΦE(c,x)
3.3.6 Overloading Instructions
In the previous chapters on semantics with interrupts, we had to distinguish only between two
possibilities, namely if the machine runs in translated mode or not. Depending on that, certain
instructions could be executed only at the privileged level (system mode), causing interrupts
(illegal instruction) at the level of user programs (user mode). Now, having added a new level
of privilege (“in the middle”), we need to change the old definitions to suit our new setting.
Note, this won’t be just a syntactical patch caused by the change of naming; we want to allow
at the level of guest usage of the instructions forbidden at the level of user. However, the effect
of those instructions can not be the same as when they are executed at the level of host, which
causes us to overload them for guests.
Clearly, we argue only about instructions for operating system support. In Sect. 2.1.1 such
instructions were formally introduced and collected in Table 2. Here, we are interested in the
following ones:
• f lusht — flushes all translations cached in the TLB, and
• invl pg — flushes the translations for address rt and ASID rs.
In order to define the semantics for the invl pg and f lusht instructions, we introduce the fol-
lowing shorthands for the ASID and the page address to be invalidated respectively.
inva(c,x).as =
{
vmid(c)◦A(c,x)[27 : 20] guest(c)
A(c,x)[31 : 20] otherwise
inva(c,x).pa = B(c,x)[31 : 12]
As it might be clear from the last lines, the invl pg instruction uses its first parameter — GPR
register rs — to restrict the victim TLB translations to those having their fields pair (vm, pr)
equal to the ASID provided by the parameter. Note, at the level of guest the first four bits of
the “invalidation” ASID are substituted by the current VMID. This is in order to not let guests
invalidate TLB translations of one another. As before, the “invalidation” page address is taken
from the GPR register rt.
invl pg(c,x) → c′.tlb = {w ∈ c.tlb | w.as = inva(c,x).as →
w.pa 6= inva(c,x).pa∧ complete(w)}
f lusht(c,x) → c′.tlb =
{
{w ∈ c.tlb | w.vm = vmid(c) → w.pr = 08} guest(c)
/0 otherwise
3.4 Simplified Semantics 67
Note that f lusht instruction executed at the level of guest flushes all TLB translations with the
vm field equal to the current VMID except those with the pr field equal to zero. The latter ones
translate from the level of guest to the level of host, and therefore must not be manipulated by
guests.
3.3.7 Interrupt Mechanism
Overall the interrupts mechanism stays unchanged. We touch only two aspects:
• we plug in the new definition of the illegal interrupt. Thus, at the level of guest, we allow
the move instructions partially (not with all arguments). Similarly, we allow guests to use
the invl pg and f lusht. Note that the latter instructions were overloaded in Sect. 3.3.6.
ill(c,x) ≡ unde f ined(c,x)∨
user(c)∧ eret(c,x)∨
user(c)∧ ( f lusht(c,x)∨ invl pg(c,x))∨
user(c)∧move(c,x)∧/(movg2s(c,x)∧ xad(c,x) ∈ {cdata})∨
guest(c)∧ invl pg(c,x)∧ inva(c,x) ∈ AG∨
guest(c)∧movg2s(c,x)∧ xad(c,x) ∈ {pto,mode,nmode}∨
host(c)∧movg2s(c,x)∧ xad(c,x) ∈ {mode}
• we integrate the new definitions spawned by the intercept mechanism. Recall, in Sect. 3.1.2
we specified the effects of jisr and eret on the machine’s state (mode registers).
In all other aspects our “new” mechanism repeats the one described in Sect. 2.3. Note, in
Sect. 2.3.3 we already changed the semantics for the exception cause register to store the hard-
ware analogue of the interrupt level (il(c,x)).
jisr(c,x) → c′.eca = 021 ◦ f 1(mca(c,x))
This allows us to keep the simulation relation for the processor core in its original and usual
form. For the same purpose we update semantics for the exception data register. Thus, in case
the executed instruction was not fetched, the effective address is clearly not meaningful, and
therefore the register is simply filled with zeros.
jisr(c,x) → c′.edata = ea(c,x)∧ f etch(c,x)
3.4 Simplified Semantics
In fact, we want to reformulate the specification so that we can argue about the TLB transitions
in a more structured manner. There are two kinds of transitions: active and passive. For the
TLB component they are represented resp. by addition and drop of the translations10. Active
transitions of the TLB are triggered explicitly, when the stepping function of the entire machine
takes the corresponding values. In turn, passive transitions of the TLB are triggered within
the processor core transitions, whenever an invalidating instruction is executed. Note that
transitions of both types use certain data from the machine state, i.e., configurations of the
processor core and memory (see Sect. 3.3). Aiming at increased modularity of the hardware
correctness proofs, we introduce an alternative version of the TLB specification which allows
to abstract from the particular machine type. This removes the necessity to repeat the proofs
for the MMU component later in this chapter, where we argue about correctness of different
machines which contain MMUs for the nested address translation.
10 This changes if the access and dirty bits are taken into consideration. In the latter case more active
TLB transitions appear.
68 3 Introduction and Specification
3.4.1 General Semantics for TLB
We refer to the specification below as the general semantics for TLB. In a nutshell it is the same
(equivalent) specification as given in Sect. 3.3. We refer to the latter semantics as the original
semantics for convenience. Configurations of the TLB component (tlb) remain unchanged and
still are taken from the same configuration set.
tlb ∈ Kutlb
The next state configuration is given by the following transition function:
δtlb : Kutlb× (Σadd ∪Σdrop) → Kutlb
where the input alphabet is split into
Σadd = {winit}×B32×B20 ∪
{wext}×Kuwalk×B32
and
Σdrop = {drop}× (B32∪B4∪{all}).
Thus, for addition of new walks by initialization we pass the universal page address (upa∈B32)
and a page table origin (pto ∈ B20). For addition of new walks by walk extension we pass the
extended walk (w ∈ Kuwalk) and a page table entry (pte ∈ B32). In order to drop the stored
walks we either pass a universal page address (upa ∈ B32), a virtual machine ID (vm ∈ B4), or
a special keyword (all) to flush the TLB.
y ∈ Σadd → y ∈ {(winit,upa, pto),(wext,w, pte)}
y ∈ Σdrop → y ∈ {(drop,upa),(drop,vm),(drop,all)}
In this way the next configuration of the TLB component is completely determined by the
current TLB configuration and an external input from the latter two sets. Also, transitions
in the general semantics are both triggered explicitly by the external inputs, i.e., are both of
active type. Recall that active transitions of a component are called steps. For convenience we
introduce the following abbreviations for steps in the general semantics which add (tadd) and
drop (tdrop) translations.
tadd(y) ≡ y ∈ Σadd
tdrop(y) ≡ y ∈ Σdrop
Finally, we give the specification of the general semantics in the following two lines.
tadd(y) → tlb′ = tlb ∪ {w(y)}
tdrop(y) → tlb′ = tlb \ I(y)
Intuitively, sets {w(y)} and I(y) abbreviate the walks which are resp. added to/dropped from
the TLB on the corresponding steps. We defined these sets formally in the next section. Only
one guard condition remains meaningful in the general semantics. It concerns the steps of walk
extension, which are still restricted to all incomplete and non-faulty walks from the TLB.
y = (wext,w, pte) → valid(tlb,w)
3.4.2 Added, Dropped, and Ragged Walks
We start with the walk which is added by the corresponding TLB steps (tadd).
Definition 3 (Added Walk).
w(y) =
{
winit(upa, pto) y = (winit,upa, pto)
wext(w, pte) y = (wext,w, pte)
3.4 Simplified Semantics 69
In the latter definition “winit” and “wext” are those functions which we later turn into the
hardware circuits of the MMU component. Their exact specification is as follows. For a
universal page address upa ∈ B32 and a page table origin pto ∈ B20 we define the output w of
the function winit as
winit(upa, pto) = w.(upa, `,ba,r, fu, fg)
= (upa,100, pto,111,0,0).
Recall, universal walks were formally defined in Sect. 3.2.1. Extension of a universal walk
w ∈ Kuwalk using a page table entry pte ∈ B32 is given by the function wext, which outputs
either a faulty or extended walk, depending on whether the target page is present in the memory
or not.
pte.p → wext(w, pte) =
{
w[ fu := 1] w.upa ∈ AU
w[ fg := 1] w.upa ∈ AG
pte.p → wext(w, pte) = w′.(upa, `,ba,r, fu, fg)
= (w.upa,0◦w.`[2 : 1], pte.ba,w.r∧ pte.r,0,0).
Note, in case the target page is not present (/pte.p), the level of the extended walk becomes
the level of the walk extension. A trivial lemma follows.
Lemma 15.
wext(w,m) = wext(w, pte(w,m))
Next we define the sets of walks which are invalidated on the corresponding TLB steps (tdrop).
Definition 4 (Dropped Walks).
I(y) =

IV(y)∪IC(y) y = (drop,upa)
ID(y) y = (drop,vm)
Kutlb y = (drop,all)
where
IV(y) = {w | y = (drop,upa) ∧ w.upa = upa}
IC(y) = {w | y = (drop,upa) ∧ w.as = upa.as∧w.`[0]}
ID(y) = {w | y = (drop,vm) ∧ w.upa ∈ AU (vm)}
Into set IV we collect all walks which translate the invalidated address (upa), into set IC
— all walks incomplete walks which translate addresses from the invalidated address space
(upa.as). The definitions of the latter two sets are only meaningful in case an individual address
is invalidated. If an entire virtual machine is invalidated, we collect all user walks of the
invalidated machine (vm) into set ID.
For reasons that become clear later we formalize the following set of walks: all user walks
from the TLB which are composable with the invalidated or incomplete walks.
Definition 5 (Ragged Walks).
R(y) = {w | {w}◦ (IV(y) ∪ IC(y)) 6= /0}
We call walks from the latter set ragged, in a sense that these walks do not count in construction
of the composed TLB once the sets of invalidated and incomplete walks are dropped. Obvi-
ously, the interesting case is when a guest address is invalidated, since otherwise the set of
ragged walks is empty by definition.
y = (drop,upa)∧upa ∈ AU → R(y) = /0
We infer the following lemma about the sets of invalidated (IV ), incomplete (IV ), and ragged
(R) walks. The external input y ∈ Σdrop is omitted below.
70 3 Introduction and Specification
Lemma 16.
R ◦ (IV ∪ IC) = /0 (1)
R ◦ (IV ∪ IC) = /0 (2)
Both parts of the latter lemma follow directly from the definition above. We state the lemma in
order to reflect the following intuitive property: the ragged walks are the only user walks that
can be successfully composed with the invalidated or incomplete walks (part 1), and vice versa
(part 2).
Results of this section are used in the correctness proofs of Sect. 5.2. One of our main concerns
there is to make sure that after the invalidated and incomplete walks were dropped, no walks
from set
R ◦ (IV ∪ IC)
are passed as external inputs. These walks (according to the lemma above) will no longer be
composable, and if passed, will violate one of the guard conditions of the processor core steps
(see Sect. 3.3.4).
3.4.3 TLB Equivalence
In this section we introduce two technical lemmas which in the future help us to reduce the
length of the arguments in the correctness proofs significantly. The first lemma asserts that
for any step of the original semantics there is a TLB equivalent step of the general semantics.
Intuitively, two steps are called TLB equivalent if for any given ISA configuration they produce
next-state configurations with identical TLB components.
Lemma 17. Let c be an ISA configuration and let x ∈ Σtlb be an external input for the step
performed in the original semantics. If the corresponding step performed in the general se-
mantics uses input y ∈ Σadd ∪Σdrop that satisfies the conditions below, these two steps are TLB
equivalent.
δISA(c,x).tlb = δtlb(c.tlb,y)
For steps which add walks into the TLB the conditions are as follows.
x = x.(winit,a,guest) → y = (winit,vmid(c)◦08 ◦ x.a, pto(c))
x = x.(winit,a,user) → y = (winit,asid(c)◦ x.a,npto(c))
x = x.(wext,w1,w2) → y = (wext,x.w1, pte(c,x))
For steps which drop walks from the TLB
x = (w1,w2,eev)
the conditions are as follows.
/user(c)∧ invl pg(c,x) → y = (drop, inva(c,x))
guest(c)∧ f lusht(c,x) → y = (drop,vmid(c))
host(c)∧ f lusht(c,x) → y = (drop,all)
Proof of lemma 17. This turns out to be a simple bookkeeping exercise. First we consider
addition of the new walks into the TLB. General semantics defines the next state configuration
of the TLB component as
c′.tlb = c.tlb∪{w(y)}
whereas the original semantics defines the next state configuration as
c′.tlb = c.tlb∪{w(c,x)}.
Thus, to prove the TLB equivalence it suffices to show
w(y) = w(c,x).
Cases to cover:
3.4 Simplified Semantics 71
• initialization of the guest walk.
w(y) = initw(y.upa,y.pto)
= initw(vmid(c)◦08 ◦ x.a, pto(c)) (assumptions)
= w(c,x)
• initialization of the user walk.
w(y) = initw(y.upa,y.pto)
= initw(asid(c)◦ x.a,npto(c)) (assumptions)
= w(c,x)
• extension of the guest walk.
w(y) = wext(y.w,y.pte)
= wext(x.w1, pte(c,x)) (assumptions)
= wext(x.w1, pte(x.w1,c.m))
= wext(x.w1,c.m) (lemma 15)
= w(c,x)
• extension of the user walk.
w(y) = wext(y.w,y.pte)
= wext(x.w1, pte(x,c)) (assumptions)
= wext(x.w1, pte(x.w1 ◦ x.w2,c.m))
= wext(x.w1 ◦ x.w2, pte(x.w1 ◦ x.w2,c.m)) (equation 4)
= wext(x.w1 ◦ x.w2,c.m) (lemma 15)
= w(c,x)
In the latter case of the walk extension we use properties of the composition operation.
Among others we use that the fault-bits of composition w1 ◦w2 are identical to the fault-
bits of walk w1 (see p. 56).
/ f (w1 ◦w2)→ (w1 ◦w2).( fu, fg) = 02
→ w1.( fu, fg) = 02 (4)
Next we consider dropping of walks from the TLB. Original semantics defines the next state
configuration of the TLB depending on type of the invalidating instruction. We split cases
according to this type.
• invalidation of a page address. General semantics defines
c′.tlb = c.tlb\ (IV(y)∪IC(y))
= c.tlb∩IV(y)∩IC(y).
For the original semantics we derive
c′.tlb = c.tlb∩{w | w.as = inva(c,x).as → w.pa 6= inva(c,x).pa∧w.`[0]}
= c.tlb∩{w | w.upa 6= inva(c,x) ∧ (w.as 6= inva(c,x).as∨w.`[0])}.
To prove the TLB equivalence of two steps we show:
c.tlb∩IV(y)∩IC(y)
= c.tlb∩{w | w.upa 6= y.upa}∩{w | w.as 6= y.upa.as∨w.`[0]}
= c.tlb∩{w | w.upa 6= inva(c,x)}∩{w | w.as 6= inva(c,x).as∨w.`[0]} (assumptions)
= c.tlb∩{w | w.upa 6= inva(c,x) ∧ (w.as 6= inva(c,x).as∨w.`[0])}.
72 3 Introduction and Specification
• invalidation of a virtual machine. General semantics defines
c′.tlb = c.tlb\ID(y)
= c.tlb∩ID(y).
For the original semantics we derive
c′.tlb = c.tlb∩{w | w.wm = vmid(c)→ w.pr = 08}
= c.tlb∩{w | w.wm 6= vmid(c) ∨ w.pr = 08}.
To prove the TLB equivalence of two steps we show:
c.tlb∩ID(y) = c.tlb∩{w | w.upa 6∈ AU (y.vm)}
= c.tlb∩{w | w.upa 6∈ AU (vmid(c))} (assumptions)
= c.tlb∩{w | w.wm 6= vmid(c)∨w.pr = 08}.
• invalidation of all translations. The TLB is flushed in both models. uunionsq
The second lemma of this section states that steps of the general semantics performed in the
given ISA configurations preserve simulation of the TLB component for the next-state config-
urations. For two ISA configurations c1 and c2 we abbreviate
simISAtlb(c1,c2) ↔ c1.tlb⊆ c2.tlb.
Lemma 18. Let c1 and c2 be two ISA configurations and let y ∈ Σadd ∪Σdrop be an external
input used to obtain the next-state configurations c′i s.t.
c′i.tlb = δtlb(ci.tlb,y).
The simulation of the TLB component is preserved for configurations c′1 and c
′
2.
simISAtlb(c1,c2) → simISAtlb(c′1,c′2)
Proof of lemma 18. According to the general semantics, there are only two cases:
• addition of walks into the TLB, i.e., y ∈ Σadd .
c′1.tlb = c1.tlb∪{w(y)}
⊆ c2.tlb∪{w(y)} (simISAtlb(c1,c2))
= c′2.tlb
• dropping of walks from the TLB, i.e., y ∈ Σdrop.
c′1.tlb = c1.tlb\I(y)
⊆ c2.tlb\I(y) (simISAtlb(c1,c2))
= c′2.tlb uunionsq
4Implementation of Nested MMU
In this chapter we adjust the hardware constructions from [LOP] s.t. after we finish they meet
the specifications from Chap. 3. Later on we proceed gradually, but at this point it is important
to stress our goal. In order to improve performance, we design the hardware to cache two kinds
of walks: i) hardware guest walks and ii) hardware user walks. In the first case these are the
ordinary guest walks from the ISA, exactly as before. However, in the second case these are
compositions — as defined in Sect. 3.2.2 — of the ordinary user and guest walks from the ISA.
In hardware, at the level of user, the processor core will use the latter compositions directly,
after a single look-up in the hardware TLB.
4.1 Redesigning TLB
We start with the presentation of a problem discovered in the process of defining the semantics
for machine with NAT. So, consider the following scenario... assume that a certain user ad-
dress was successfully translated and the corresponding translation is cached by the hardware.
From specification of NAT (Sect. 3.2.1) we infer that after the translation process finishes, the
hardware TLB contains up to three auxiliary complete guest walks created by the MMU. On
the other hand, the software TLB contains — among other translations used to obtain the re-
sulting user walk — software counterparts of those hardware guest walks. We are particularly
interested in the one (wg) which can be composed with the resulting software user walk (wu)
to form a counterpart of the resulting hardware user walk (wU ), i.e.,
wu ◦wg = wU .
For details we refer to the introductory sections on NAT (Sect. 3.2). Now assume an invl pg
instruction is executed and its target is the universal page address of that guest walk (wg). On
the level of ISA this does not create any difficulty, we can easily specify the effect of such
invalidating transition. But (!) if one drops only the auxiliary guest walks from the TLB
(both hardware and software), still there are user walks left, wU in the hardware and wu in
the software TLB. This leads to very undesired consequences: the hardware user walk (wU ),
which still can be used by the processor core, can no longer be reconstructed from the software
user walk (wu) on the ISA level, since the matching guest walk necessary for composition (wg)
was dropped on the invl pg. As a result the hardware is able to perform steps which the ISA
computation cannot simulate. Of course, such behavior we cannot allow.
So, we change the effect of invalidating transitions in hardware: together with the dropped
guest walk we require to drop all (!) matching user walks, to make sure these walks are not
used by the hardware. This, however, creates another difficulty we have to overcome. Namely,
in hardware we need to invalidate the user walks, which no longer contain the data about the
guest walks they were composed from. For the sake of simplicity we consider the model of the
74 4 Implementation of Nested MMU
tlb
invm
inva
upa
hit wout
inval
store
trans
32
32
3
60
win
60
gin
20
Fig. 14: Interface of TLB for nested translation
hardware TLB where we store all the missing data in dedicated registers1. These registers we
fill with the guest page addresses — page addresses of the guest walk register (wG) — which
we lose after the walk composition.
We spend the following two sections to present both formal specification and hardware im-
plementation of the TLB component. Thus, in Sect. 4.1.1 we describe the changes to the
specification of the hardware TLB. The exact implementation of the hardware TLB is then
presented in Sect. 4.1.2.
4.1.1 Specification
Since the construction from [LOP] does not provide all features necessary to perform NAT, we
propose a slightly more advanced design. Extension of functionality requires changes of the
TLB hardware interface. Now it has the following inputs:
• upa ∈ B32 — universal page address — (virtual) address to look-up in the TLB,
• win ∈ B60 — walk input — hardware walk (translation) to add into the TLB,
• gin ∈ B20 — guest input — page address of the guest walk which was used to obtain win,
• inva ∈ B32 — invalidation address — (virtual) address to drop from the TLB,
• invm ∈ B3 — invalidation mask — control bits to specify the invalidation query,
• trans ∈ B — translation request — control signal to execute the translation query,
• store ∈ B — store request — control signal to store the input walk (translation),
• inval ∈ B — invalidation request — control signal to execute the invalidation query.
Outputs of the hardware TLB remain unchanged:
• wout ∈ B60 — walk output — hardware walk to be returned to the nested MMU,
• hit ∈ B — TLB hit — control signal indicating that the TLB contains a translation for the
requested (virtual) address.
The resulting interface is depicted in Fig. 14. As usual, for the hardware TLB to operate
properly, several operating conditions have to be satisfied. We list these conditions below.
• As before, at most one request is allowed at a time.
trans+ store+ inval ≤ 1
• Under the store request, (virtual) address of the written walk — as we already mentioned
in the previous section — has to be passed as the invalidation address. This requirement is
1 Without these extra registers the data about the “origin” of user walks can be restored from the data
available in hardware only partially. Moreover, certain restrictions on content of the page tables be-
come necessary.
4.1 Redesigning TLB 75
technical and very similar to the one we already had in [LOP]. It simplifies the hardware
mechanism which evicts conflicting2 walks.
store→ inva = win.upa
• Under the invalidation request, invalidation mask is expected to have one of the supported
values, which in turn correspond to TLB f lush, vm f lush and invl pg resp.
inval→ invm ∈ {111,011,000}
• One more completely technical condition which simplifies the implementation: on trans-
lation and store requests, we expect bits of the invalidation mask to be all zeros.
trans∨ store→ invm = 000
In the remainder of this section we formulate all required properties of the hardware TLB
component. For convenience we split these properties w.r.t. the request initiated in the current
hardware configuration h. As usual, we denote by h′ the resulting hardware configuration.
• Formally, the hardware TLB h.tlb consists of three components:
h.tlb.w : [1 : N]→ B60
h.tlb.g : [1 : N]→ B20
h.tlb.v : [1 : N]→ B
where N ∈ N denotes the maximum number of simultaneously stored entries. In [LOP]
the set of (valid) walks stored in a TLB was defined as
tlbset(tlb) = {tlb.w(i) | tlb.v(i)∧ i ∈ [1 : N]}.
We refer to the set of walks stored in the hardware TLB of configuration h using the
following shorthand.
tlbset(h) ≡ tlbset(h.tlb)
In a similar way we abbreviate every definition Z introduced below in this section.
Z(h) ≡ Z(h.tlb)
• Obviously, in the absence of any request, the content of the hardware TLB (the set of stored
walks) does not change.
/(trans(h)∨ store(h)∨ inval(h))→ tlbset(h′) = tlbset(h)
• Under a translation request, the content of the hardware TLB does not change either. More-
over, in case of a TLB hit, the output walk matches the (virtual) address requested and,
clearly, comes from the TLB.
trans(h)→ tlbset(h′) = tlbset(h)
trans(h)→ (hit(h)→ wout(h) ∈ tlbset(h))
trans(h)→ (hit(h)→ wout(h).upa = upa(h))
• After execution of a store request, the resulting hardware TLB contains the (former) walk
written, though it might drop one of the walks stored. Conflicting walk is always evicted,
and in case there is none and the hardware TLB is full, the victim walk is dropped.
store(h)→ tlbset(h′)⊆ tlbset(h)∪{win(h)}
store(h)→ tlbset(h′)∩{w | w.upa = win(h).upa}= {win(h)}
2 We call two walks in the hardware TLB conflicting if they provide translation for the identical (virtual)
addresses. It is crucial to eliminate conflicting walks and keep translations in the hardware TLB
“uniquely” present.
76 4 Implementation of Nested MMU
• Finally, after execution of the invalidation request, the resulting hardware TLB does not
contain the specified set of walks. For simplicity we call the latter set invalidated, and
define it formally as
invset(h) =

tlbset(h) invm(h) = 111
tlbset(h)∩{w | w.upa ∈ AU (inva(h).vm)} invm(h) = 011
tlbset(h)∩{w | w.upa = inva(h)} invm(h) = 000.
In addition, according to the semantics of invl pg from Sect. 3.3.6, the hardware TLB loses
all incomplete walks from the target address space, as well as all user walks matching the
invalidated (guest) walk. We call the set of incomplete walks to be dropped incset and
define it as
incset(h) =
{
tlbset(h)∩{w | (w.as = inva(h).as)∧/w.`[0]} invm(h) = 000
/0 otherwise.
We call the other set of walks to be dropped ragset3. In order to formalize the set of ragged
(hardware) walks, we require the following technical definition.
Definition 6 (guest page address). For walks w ∈ tlbset(h):
gpa(w,h) ≡ ε{h.tlb.g(i) | h.tlb.v(i)∧ (h.tlb.w(i) = w)}
The guest page address of walk w from the hardware TLB h.tlb is stored in the component
h.tlb.g(i) if walk w is stored in the entry i. The definition above is well-defined due to the
invariant 2, which preserves uniqueness of the walks stored in hardware.
Formally, we define the ragset as
ragset(h) =

tlbset(h)∩{w | (w.upa ∈ AU (inva(h).vm))∧
((gpa(w,h) = inva(h).pa)∨w. fg)}
inva(h) ∈ AG∧
invm(h) = 000
/0 otherwise.
Note, since the incomplete guest walks are automatically dropped on invl pg, we also in-
clude into the set of ragged walks all guest faulty (see Sect. 3.2.1) user walks, which we
identify using flag w. fg, from the given address space. Now, the effect of the invalidation
request on the hardware TLB is obvious.
inval(h)→ tlbset(h′) = tlbset(h)\ (invset(h)∪ incset(h)∪ ragset(h))
This finishes the hardware specification of the TLB component. Easy to see that in this speci-
fication the TLB is functional w.r.t. addresses of the stored walks. To reflect the latter property
we introduce an invariant.
Invariant 2. For walks w1,w2 ∈ tlbset(h):
w1.upa = w2.upa→ w1 = w2
In the next section we implement the TLB component according to the specification above.
Note that targeting the latter specification, we present a construction that can serve at most one
request at a time.4
4.1.2 Construction
In a nutshell, construction remains almost the same as in [LOP]: still it is a cache for address
translations (walks). However, it acquires several important properties which are absolutely
3 This name is motivated by the situation which occurs in software: after execution of the invl pg on the
level of ISA, the user walks matching the invalidated guest walk cannot be composed with the walks
remaining (see Sect. 3.4.2).
4 The first operating condition restricts inputs of the hardware TLB appropriately. Of course, the latter
operating condition can be relaxed with a slightly more advanced implementation, but we do not
bother.
4.1 Redesigning TLB 77
tlbhit
tlbhit
tlbhit
w(1).ce
w(2).ce
w(N).ce
v(1).ce
v(2).ce
v(N).ce
tlbspc
tlbspc
tlbspc
g(1)
g(2)
g(N)
g(1).ce
g(2).ce
g(N).ce
N N
mov2 (↓)
N-PP∧
tlbhit[2]
tlbhit[1]
tlbhit[N]
w(2)
w(N)
v(1)
v(2)
v(N)
push∨mov2[N]
push
w(1)
win
60
gin
20
Fig. 15: Basic construction of the hardware TLB
necessary for the nested translation. Thus, our construction allows to perform invalidation of
multiple translations and — despite that — use the hardware cache efficiently. First, we present
a basic construction of the novel hardware TLB. Then, we extend that basic construction in two
more steps:
i) we design a circuitry which performs dropping of all hardware ragged walks (user walks
composed from the guest walk being invalidated), and
ii) we add support for invalidation of (all translations of) a particular VM.
As an intermediate signal in the forthcoming hardware constructions we use
tlba(h) =
{
upa(h) trans(h)
inva(h) otherwise.
The reason being is that we tend to reuse the same circuits when computing internal hardware
hits, which in turn depend on the translation (upa) and invalidation (inva) addresses in exactly
the same way.
Basic Design
The basic construction of the hardware TLB is depicted in Fig. 15. The outputs, which are not
shown in the figure, are computed in the usual way by the OR-trees.
hit(h) =
∨
i tlbhit(h)[i]∧ trans(h)
wout(h) =
∨
i tlbhit(h)[i]∧h.tlb.w(i)
There are circuits for computation of hardware hit and incomplete signals on per entry basis,
i.e., for i ∈ [1 : N]: “tlbhit” and “tlbspc” resp. Construction of these circuits can be easily
deduced from the formal specification of their outputs. Using the intermediate signals
hit(h)[i] ≡ h.tlb.w(i).upa = tlba(h)
inc(h)[i] ≡ (h.tlb.w(i).as = tlba(h).as)∧/h.tlb.w(i).`[0]
78 4 Implementation of Nested MMU
we define:
tlbhit(h)[i] ≡ h.tlb.v(i)∧ ((invm(h) = 111)∨hit(h)[i])
tlbspc(h)[i] ≡ h.tlb.v(i)∧ ((invm(h) = 000)∧ inc(h)[i]).
Note, the definitions above are temporary and serve to present a simplified version of the
design. The final definitions of the latter signals are made later in this section, when we describe
dropping of the virtual machines and ragged hardware walks. We define two more auxiliary
signals in order to simplify the presentation below:
• a push (into entry i = 1) — indicates that data are written into the hardware TLB,
• a move-to (entry i > 1) — indicates that data are “moved” from the entry (i− 1) into the
entry i.
push(h) ≡ store(h)
mov2(h)[i] ≡ ∧ j<i h.tlb.v( j)∧/tlbhit(h)[ j]
Using these auxiliary signals, we easily define the inputs of hardware walk and guest address
registers (on the left, x ∈ {w,g})
i = 1 : h.tlb.x(i).in = xin(h)
i > 1 : h.tlb.x(i).in = h.tlb.x(i−1)
and valid bits (on the right).
i = 1 : h.tlb.v(i).in ≡ push(h)
i > 1 : h.tlb.v(i).in ≡ push(h)∧mov2(h)[i]
Finally, for the clock-enable signals we split cases according to the initiated request. Note,
there is no sense to clock the data register if input of the corresponding valid bit differs from
one.
h.tlb.v(i).ce ≡
{
tlbhit(h)[i]∨h.tlb.v(i).in store(h)
tlbhit(h)[i]∨ tlbspc(h)[i] inval(h)
h.tlb.x(i).ce ≡ h.tlb.v(i).in
Clearly, this basic construction refines the one presented in [LOP]. It provides an equivalent
functionality, however supports extensions unavailable for our prior construction. This prop-
erty comes with the price that whenever the hardware TLB fills densely enough, a significant
portion of the registers is overwritten at once (!) on every store request.
Dropping Virtual Machines
First we extend the basic TLB design to support the overloaded f lusht instruction, which ex-
ecuted at the level of guest flushes all TLB translations of the running VM (see Sect. 3.3.6).
Recall, we select both the walk to be output on translation request and the walks to be invali-
dated on invalidation request using the internal hardware hit signal. Also recall, on translation
requests the invalidation mask is guaranteed to be all zeros (by the operating condition, see
Sect. 4.1.1), whereas on invalidation requests the mask is used to specify the invalidation query
further. So, in order to fulfill the hardware specifications, it suffices simply to adjust the former
computation of the tlbhit(h) signal appropriately.
tlbhit(h)[i]≡ h.tlb.v(i)∧

h.tlb.w(i).upa = tlba(h) invm(h) = 000
h.tlb.w(i).upa ∈ AU (tlba(h).vm) invm(h) = 011
1 invm(h) = 111
There are no changes in handling of the translation requests. The resulting hardware is com-
pletely trivial and depicted in Fig. 16.
4.1 Redesigning TLB 79
20-eq4-eq
tlba.vm
w(i).vm
8-eq
w(i).pr
∧
tlbhit[i]
tlba.pa
w(i).pa
tlba.pr
w(i).pr
8-zero
4 4
invm′
4
invm → invm′
000 0100
011 0011
111 1111
v(i)
Fig. 16: Circuit “tlbhit”
Dropping Ragged Walks
As the second extension we introduce a mechanism for dropping all the ragged walks from
the hardware TLB. In specifications of the hardware TLB (see definition of the ragset(h) in
Sect. 4.1.1) we require to drop all the user walks obtained through any matching guest walk
whose page address is invalidated. We can easily determine the ragged walk in the hardware
TLB by matching its guest page address (gpa) with the page address of the invalidation address
(tlba(h)). In particular, on per entry basis (i.e., for i ∈ [1 : N]) we deliberately compute the
following hardware signal.
rag(h)[i]≡ (tlba(h).pr = 08) ∧ (h.tlb.w(i).upa ∈ AU (tlba(h).vm))∧
( (h.tlb.g(i) = tlba(h).pa)∨h.tlb.w(i). fg )
This signal can already be used to drop the ragged hardware walks on invalidation request
(inval(h)). However, we want to reuse most of the definitions made in the previous section
and change the basic design as little as possible. We notice that the ragged walks are in certain
sense “incomplete” without the invalidated walks... Since on inval(h) all incomplete walks are
necessary dropped, we adjust the former computation of tlbspc(h) signal s.t. it also includes
the ragged walks to be dropped.
tlbspc(h)[i]≡ h.tlb.v(i)∧ (invm(h) = 000)∧ (rag(h)[i]∨ inc(h)[i])
The corresponding circuit is depicted in Fig. 17. In the dotted boxes we highlight the hardware
parts which can be saved, i.e., their outputs can be taken from the hardware for computation of
the hit signal (tlbhit(h)).
Implementation Correctness
In the following trivial lemma we encapsulate the correctness of the TLB implementation.
Lemma 19. Assume that the invalidation request signal (inval(h)) is active.
invset(h) ⊆ {h.tlb.w(i) | tlbhit(h)[i]} (1)
ragset(h)∪ incset(h) ⊆ {h.tlb.w(i) | tlbspc(h)[i]} (2)
Proof of lemma 19.1. By case split on the value of the invalidation mask:
• invm(h) = 111. Dropping all walks.
invset(h) = tlbset(h)
= {h.tlb.w(i) | h.tlb.v(i)}
= {h.tlb.w(i) | tlbhit(h)[i]}
80 4 Implementation of Nested MMU
w(i).vm
8-eq
∧
3
3
4-eq 8-zero
∧
3
3
w(i). fg w(i).`[0]
20-eq
∧
4-eq
w(i).pr
tlba.prtlba.vm
inc[i]
tlba.vm
w(i).vm w(i).pr
tlba.pr = 08
rag[i]
g(i)
tlbspc[i]
invm = 000 v(i)
tlba.pa
Fig. 17: Circuit “tlbspc”
• invm(h) = 011. Dropping virtual machines.
invset(h) = tlbset(h)∩{w | w.upa ∈ AU (inva(h).vm)}
= {h.tlb.w(i) | h.tlb.v(i)∧ (h.tlb.w(i).upa ∈ AU (inva(h).vm))}
= {h.tlb.w(i) | tlbhit(h)[i]}
• invm(h) = 000. Dropping individual walks.
invset(h) = tlbset(h)∩{w | w.upa = inva(h)}
= {h.tlb.w(i) | h.tlb.v(i)∧ (h.tlb.w(i).upa = inva(h))}
= {h.tlb.w(i) | tlbhit(h)[i]} uunionsq
Proof of lemma 19.2. Assume we have
invm(h) = 000
since otherwise there is nothing to show. For the incomplete walks we argue:
incset(h) = tlbset(h)∩{w | (w.as = inva(h).as)∧/w.`[0]}
= {h.tlb.w(i) | h.tlb.v(i)∧ (h.tlb.w(i).as = inva(h).as)∧/h.tlb.w(i).`[0]}
= {h.tlb.w(i) | h.tlb.v(i)∧ inc(h)[i]}.
For the ragged walks we assume also
inva(h).pr = 08
since otherwise there is nothing to show. Moreover, for the guest page address we show:
h.tlb.v(i) → gpa(h.tlb.w(i),h) = h.tlb.g(i).
Clearly we have
h.tlb.v(i) → h.tlb.w(i) ∈ tlbset(h).
Using the definition of the guest page address and invariant 2 we easily obtain
gpa(h.tlb.w(i),h) = ε{h.tlb.g( j) | h.tlb.v( j)∧ (h.tlb.w( j) = h.tlb.w(i))}
=
∨
j ( h.tlb.g( j)∧h.tlb.v( j)∧ (h.tlb.w( j) = h.tlb.w(i)) )
= h.tlb.g(i)∧h.tlb.v(i).
4.2 Redesigning MMU 81
Given the lines above we analogously argue:
ragset(h) = tlbset(h)∩{w | (w.upa ∈ AU (inva(h).vm))∧
((gpa(w,h) = inva(h).pa)∨w. fg)}
= {h.tlb.w(i) | h.tlb.v(i)∧ (h.tlb.w(i).upa ∈ AU (inva(h).vm))∧
((h.tlb.g(i) = inva(h).pa)∨h.tlb.w(i). fg)}
= {h.tlb.w(i) | h.tlb.v(i)∧ rag(h)[i]}.
Collecting the arguments we conclude:
ragset(h)∪ incset(h) = {h.tlb.w(i) | h.tlb.v(i)∧ (rag(h)[i]∨ inc(h)[i])}
= {h.tlb.w(i) | tlbspc(h)[i]}. uunionsq
4.2 Redesigning MMU
Of course, the major changes in hardware concern implementation of the memory management
unit — the component responsible for address translation — for the nested translation. Since
we tend to reuse the constructions developed previously in [LOP], out main challenge here is
to find a representation of the nested translation scheme s.t. the hardware we developed for the
address translation previously can be used as part of that scheme. We proceed as follows:
i) in Sect. 4.2.1 we identify those portions of hardware which can be reused, and explain
how to construct a nested MMU out of those pieces, then
ii) in Sect. 4.2.2 we implement the nested MMU in hardware, and finally
iii) in Sect. 4.3 we prove that our implementation is actually live.
4.2.1 Specification
Since in prospect we plan to replace the original MMU component from [LOP] by the nested
MMU developed here, we clearly want to keep the old interfaces unchanged, unless changes
are unavoidable. Thus, as the first and the most obvious step we add only two more inputs to
the original MMU interface, see Fig. 18:
• ptoU ∈ B20 — one more page table origin, for the nested address translation, and
• vm f lush ∈ B— control signal, to invalidate the entire VM (implements f lusht instruction
executed at the level of guest). The VM to be invalidated is chosen using the top-most four
bits of the inva input.
As another obvious step we change the original MMU component to work with the universal
addressing. This simple change can be described as follows: i) the width of address to be trans-
lated increases by 12 bits, and ii) all internal MMU circuits adjust to fit the new address width
s.t. changes propagate through the entire design. Taking these modifications into account, we
finish specification of the MMU interfaces.
In addition to the user page table origin (ptoU ) we specified above, the translation interface
includes the following inputs and outputs:
• input treq ∈ B — translation request — control signal to start the translation,
• input upa ∈ B32 — universal page address — virtual address to be translated,
• input ptoG ∈ B20 — page table origin — page address of the root page table to be used for
translation,
• output busy ∈ B — MMU busy — control signal indicating that the MMU is currently
translating and its output is not yet meaningful, and
• output wout ∈ B60 — walk output — walk returned for the virtual address (upa).
The invalidation interface comprises the aforementioned special input vm f lush and all signals
from the corresponding interface of the simple MMU:
82 4 Implementation of Nested MMU
mbusy
mout
64
mreq
ma
32
wout
60
busy
treq
invl pg
f lush
vm f lush
32
upa
32
inva
20
ptoG
20
ptoU
mmu
Fig. 18: Interface of MMU for nested translation
• input f lush ∈ B — flush — control signal to drop all stored translations,
• input invl pg ∈ B — invalidate page — control signal to drop a translation, and
• input inva∈B30 — invalidation address — universal address of translations to be dropped.
The memory interface of the nested MMU repeats the corresponding interface of the simple
MMU. For details we refer to [LOP].
Just as for the TLB component above, we collect all properties of the nested MMU component.
Overall, most properties of the original MMU repeat for the nested MMU. For compatibility
with the original MMU we use the same naming as we used for the common interface signals.
• Formally, we define the nested MMU h.mmu as follows5:
h.mmu.tlb ∈ [1 : N]→ Kuwalk×B20×B
h.mmu.wG ∈ Kuwalk
h.mmu.wU ∈ Kuwalk.
• the MMU component is busy in hardware configuration h if and only if i) there is a trans-
lation request in configuration h and ii) there is no hit for the given universal page address
in the hardware TLB of configuration h.
busy(h)↔ treq(h)∧/tlb.hit(h)
Later we show that the MMU component is busy in hardware configuration h if and only
if either i) the internal walk register (wU ) is being initialized, or ii) a memory access takes
place, or iii) a so-called nested translation access takes place, or iv) the hardware TLB is
being written in configuration h.
• On a hit in the hardware TLB in hardware configuration h, the walk output of the MMU
component in configuration h matches the given universal page address. Moreover, the
latter walk is either faulty, or complete, or both. Therefore, it can be used for translated
access.
tlb.hit(h)→ wout(h).upa = upa(h)
tlb.hit(h)→ f (wout(h))∨wout(h).`[0]
• We finish with specification of the invalidation interface of the nested MMU. The following
lines are self-explaining, h and h′ below denote resp. the current and the resulting hardware
configurations.
5 As we already mentioned before (in Sect. 4.1.1), in hardware we additionally store with every user
walk a page address of the guest walk through which the user walk was obtained. Details of this we
cover later, in Sect. 4.2.2.
4.2 Redesigning MMU 83
f lush(h)→ tlbset(h′) = /0
vm f lush(h)→ tlbset(h′) = tlbset(h)\{w | w.upa ∈ AU (inva(h).vm)}
invl pg(h)→ tlbset(h′) = tlbset(h)\{w | (w.upa = inva(h)) ∨
(w.as = inva(h).as)∧w.`[0] ∨
(inva(h) ∈ AG→ rag(w,h))}
In the last line above we used the following abbreviation.
rag(w,h) ≡ (w.upa ∈ AU (inva(h).vm))∧ ((gpa(w,h) = inva(h).pa)∨w. fg)
The resulting construction we call an MMU for the nested translation or nested MMU for
simplicity. In the remainder of this chapter we consider only the nested MMU(s), so often we
drop the word nested and simply use the name “MMU” to refer to this construction.
4.2.2 Construction
To start the section we describe how to construct the nested MMU in a rather sketchy fashion.
In a nutshell, the nested MMU consists of the following sub-components:
• one hardware TLB for universal addressing — mmu.tlb, and
• two universal walk registers — mmu.wG and mmu.wU .
Its overall (very general) construction scheme is depicted in Fig. 19. Conceptually, the nested
MMU component there is split into two parts. The first part (on the left) is responsible for
translation of the user addresses, whereas the second part (on the right) — for translation of the
guest addresses. Each part contains i) a walking unit (see below), which is directly connected
to ii) a universal walk register, and iii) a small circuit (a multiplexer) computing the address
of the page table entry to be accessed within the process walk extension. Except that, the left
part contains another small circuit for computation of the walk composition. Since, the same
sub-components occur at both sides, we naturally introduce the following indexing: names of
all components — and signals they generate — occurring on the left we index using capital
U , whereas names of components on the right are indexed using capital G. Note that the TLB
component is accessed whenever necessary by both sides (in turns), but does not belong to
either of them.
For technical reasons we distinguish between the requests for translation of the user addresses
and the requests for translation of the guest addresses. The former ones we call nested transla-
tion requests, whereas the latter ones — simple translation requests.
nreq(h) ≡ treq(h)∧ (upa(h) ∈ AU )
sreq(h) ≡ treq(h)∧ (upa(h) ∈ AG)
Clearly, the left hand side of the MMU component is designed to handle the nested translation
requests, whereas the right hand side — to handle the simple translation requests. In case of a
nested translation request and in the absence of hits in the hardware TLB — after initialization
of the user walk register (mmu.wU ) — another translation request is initiated: a simple transla-
tion request to the right hand side of the MMU component. At that point the regular translation
scheme is used: in the absence of hits in the hardware TLB, the guest walk register (mmu.wG)
is initialized and the ordinary translation starts. When the latter translation ends, the internal
translation call returns, which in turn completes one step of the nested walk extension.
Next in this section we cover the details of exact implementation of the nested MMU in hard-
ware. To be clear, in our hardware implementation we pursue the following goal: in the absence
of nested translation requests we want the nested MMU to behave exactly (!) like the simple
MMU6. Taking into account that control over the simple MMU was performed via the com-
ponent’s internal state machine, we decide to include the latter machine into the nested MMU,
as a separate sub-component, entirely and without changes. Novel capabilities of the nested
6 This will simplify the future correctness proof considerably.
84 4 Implementation of Nested MMU
60
wU wG
composition
60
wG
60 60
mux
32
pteaU
32
pteaG
ptoG
20
pteG
32
mux
pteU
32
upa
32
mux
ptoU
20 60
wG
60
wout
universal ba
ptea
wU ◦wG wGtlb
ptea
wunit
32
wunit
tlb.wout
60
wU
upauba(wU )
Fig. 19: Hardware layout of the MMU for nested translation. To distinguish between the addresses used
by two components we additionally suffixed the ptea in the construction above. For the simple MMU
we denoted the address by pteaG, since in the first case we extend the guest walk (wG). For the nested
MMU, though we actually extend composition of the user walk (wU ) and the current guest walk (wG),
we denoted the address by pteaU for simplicity.
MMU are required only in case of nested translations. Therefore, in case of the simple transla-
tion requests, the control automaton (CA) of the nested MMU should delegate these requests
to the CA of the simple MMU and remain in its idle state. As simple as that. We structure the
content of this section as follows:
• we start with the presentation of the control automaton (of the nested MMU),
• then, we describe the implementation of the component’s data paths, and
• in the end, we connect interfaces of the involved sub-components and memory.
Before we proceed to definitions, we discuss the notation we use below. First, in the scope
of this section all signals are defined w.r.t. the current hardware configuration. So, in order
to make the forthcoming definitions “lighter”, we drop the h, meaning that all signals below
belong to the same hardware configuration. For instance, the predicate nreq(h), which we
defined above, can be rewritten simply as
nreq ≡ treq∧ (upa ∈ AU ).
Also, all signals which are used in the short form (without mentioning the component they
belong to) are implicitly treated to belong to the nested MMU. In the definition above the upa
is meant to be the input address of the nested MMU, i.e., mmu.upa. Signals belonging, for
instance, to the TLB sub-component we distinguish by using the long form, i.e., simply by
prefixing them with tlb.
For convenience we introduce a shorthand to abbreviate the state of the nested MMU in which
both its control automata reside in their idle states.
idle ≡ idleU ∧ idleG
4.2 Redesigning MMU 85
idleGstart
f etch
pteG
write
tlbG
t1
t2
t3
Fig. 20: Control automaton of the nested MMU for simple translation
Before we proceed with implementation of the nested MMU, we add a mechanism to restore
the latter state after losses of the translation requests (due to interrupts). For that we add one
more control flag (abort), which we implement as an ordinary set-clear flip-flop.
abort.set = abort ∧ idle∧ treq
abort.clr = abort ∧ idle∨ reset
Control Automaton of Simple MMU
The simple control automaton is depicted in Fig. 20. Exactly this construction was used in
[LOP] to control the (simple) MMU. For convenience we keep the names for control stages and
labels for transitions original. Note, using suffix G we distinguish control states of the simple
automaton from control states of the nested automaton, which we denote using suffix U . The
purpose of the simple automaton also remains unchanged: within the nested MMU it controls
the simple translation calls (in state fetch-pte) and stores the acquired simple translations in the
TLB (state write-tlb).
We explain the transitions next. Though we mention certain things to improve the presentation,
here we define only (!) conditions when a particular transition is taken. Everything else we
define later in dedicated sections.
• The nested CA, which now “wraps” the simple CA, initiates a simple translation sub-
request in two cases: i) as the nested translation call in state nested-call, and ii) if the
simple address translation is performed (at the level of guest). In the presence of the abort
signal, the request for the simple translation is raised only if both control automata reside
in their idle states (idle).
treqG ≡ (nested-call∧nreq∨ sreq)∧ (abort→ idle)
Of course, we could simply wait for the absence of the abort, but that would take an extra
hardware cycle. Instead, we raise the simple translation request as soon as the abort clear
signal is active.
In the absence of reset, the automaton leaves state idleG only if a simple translation request
leads to a miss in the hardware TLB.
(t1)≡ idleG∧/reset ∧ treqG∧/tlb.hit
• In state fetch-pteG — as the state’s name suggests — the memory is accessed in order to
read the page tables. In total there are three transitions leaving state fetch-pteG. The first
one (t2) is a self-loop. It implements the process of busy waiting for the memory access
to finish. The second transition (t3) is, obviously, included to make sure that the control
automaton is initialized properly on reset and in order to return control to state idleG after
abortions.
86 4 Implementation of Nested MMU
idleUstart
nested
call
f etch
pteU
write
tlbU
t1
t5
t6
t7 t8
t2t4
t3
Fig. 21: Control automaton of the nested MMU for nested translation
(t2) ≡ fetch-pteG∧/reset ∧mbusy
(t3) ≡ fetch-pteG∧ (reset ∨/mbusy∧/treqG)
Definitions of transitions t2 and t3 are self-explaining. The remaining transition (nameless
one) is taken “otherwise”, i.e., if none of the transitions above is taken. In the latter case,
control is brought to state write-tlbG, where a faulty or complete walk (wG) is written into
the hardware TLB.
• Finally, there is only one transition coming out of state write-tlbG. It returns control to
state idleG while the hardware TLB is being written. Every process of simple translation
which was not aborted terminates via this transition.
Control Automaton of Nested MMU
Since internally the nested MMU is more complex than the simple MMU, it has more com-
plicated control mechanism (as can be spotted in Fig. 21). The automaton there has one state
more compared to the control automaton of the simple MMU. We have to account for one more
state (nested-call in the figure) to implement the nested translation calls. These calls are used
for translation of the (virtual) base addresses of the user walk (wU ), since — as it might be
concluded already — in the walk registers we store the hardware counterparts of the specifica-
tion user (wu) and guest (wg) walks. The remaining states acquire the meanings of those states
(of the simple MMU) whose names they repeat. In short: state fetch-pteU is used for reading
page tables and state write-tlbU is, obviously, used for writing the result of nested translation
into the TLB. Transitions of the nested automaton are described below.
• Clearly, under the nested translation request, one should first check if the hardware TLB
contains a walk which translates the requested address (upa). Same as the treqG, in the
presence of the abort signal, the request for the nested translation is raised only if both
control automata reside in their idle states (idle). To end-up with more uniform notation,
we introduce a corresponding shorthand.
treqU ≡ nreq∧ (abort→ idle)
So, in the absence of reset, the automaton leaves state idleU only if a nested translation
request leads to a miss in the hardware TLB.
(t1)≡ idleU ∧/reset ∧ treqU ∧/tlb.hit
4.2 Redesigning MMU 87
• Four transitions in total leave state fetch-pteU . The first two are analogous to the corre-
sponding transitions in the simple control automaton. Transition number three (t4) brings
control to the new state, nested-call, where translation of the current base address of user
walk (wU ) is requested. Below, the first three transitions are specified formally.
(t2) ≡ fetch-pteU ∧/reset ∧mbusy
(t3) ≡ fetch-pteU ∧ (reset ∨/mbusy∧/treqU )
(t4) ≡ fetch-pteU ∧/reset ∧/mbusy∧ treqU ∧/ f (wextU )
Transition t4 is taken in the absence of reset, only if the result of walk extension (wext)
is not faulty. The remaining transition (nameless) is taken “otherwise”, i.e., if none of the
transitions above is taken. Intuitively, it aborts the process of nested translation on a page
fault. In the latter case, control is brought to state write-tlbU , where a faulty (in this case)
composed walk (wU ◦wG) is written into the hardware TLB.
• In turn, in state nested-call a nested translation request to CA of the simple MMU (simple
CA) is initiated. Again, there are four transitions in total coming out of state nested-call.
And again, the first transition (t5) is a self-loop for busy-waiting, but this time one waits for
the simple CA to finish its operation. Similarly to transition t3 above, the second transition
(t6) is technical. It brings control back to state idleU on reset. The third transition (t7) here
returns control to state fetch-pteU , where not faulty and yet incomplete user walk (wU ) is
extended. Formally, the first three transitions are specified below.
(t5) ≡ nested-call∧/reset ∧busyG
(t6) ≡ nested-call∧ (reset ∨/busyG∧/treqU )
(t7) ≡ nested-call∧/reset ∧/busyG∧ treqU ∧/ f (wU , tlb.wout)∧/wU .`[0]
Recall, in Sect. 3.2.1 we defined a pair of walks to be faulting as follows.
f (wu,wg) ≡ f (wu)∨ f (wg)∨ f (wg | wu)
Again, definitions of t5 and t6 are self-explaining. Transition t7 is taken in the absence
of reset (as usual), only if the result of nested translation (tlb.wout) is not faulting (see
Sect. 3.3.3) and, moreover, the user walk (wU ) is incomplete. “Otherwise” (if none of the
transition above is taken), the remaining transition (nameless one) brings control to state
write-tlbU . This happens in two cases: either i) the user walk (wU ) and the resulting guest
walk (tlb.wout) are faulting, or ii) otherwise, simply when the user walk (wU ) is already
complete. In the first case the nested translation “fails”, and a faulty composed walk (wU ◦
wG) is written into the hardware TLB, whereas in the second case the nested translation
“succeeds”, and a not-faulty composed walk (wU ◦wG) is written into the hardware TLB.
Though it is not strictly necessary, for convenience we give a name to the fourth transition
in this case and specify it formally as follows.
(t8) ≡ nested-call∧/reset ∧/busyG∧ treqU ∧ ( f (wU , tlb.wout)∨wU .`[0])
• Analogous to the simple control automaton, one transition is coming out of state write-tlbU .
It returns control to state idleU while the hardware TLB is being written. Every process of
nested translation which was not aborted terminates via this transition.
Walking Units
Walking unit of the simple MMU is a basis for construction of the nested walking unit, i.e.,
walking unit of the nested MMU. The only difference is, literally, in the modified circuits
for computation of the walk initialization and extension functions. The resulting hardware is
schematically depicted in Fig. 22. Note that widths of the inputs and outputs changed as well
according to the new definitions (see Sect. 3.3.1).
Under control of signal winit we select either the output of the initialization circuit (winit in
the figure) or the output of the walk extension circuit (wext in the figure). In designs that fol-
low this signal is provided by a corresponding control automaton, which actually controls the
88 4 Implementation of Nested MMU
winit01
6060
60
upa
32
pto
20
pte
32 60
win
wextwinit
wout
Fig. 22: Circuit for walk creation (initialization and extension, data paths)
walking unit. For the guest walking unit this signal is provided by the simple control automa-
ton, whereas for the user walking unit — by the nested control automaton. For convenience,
we also introduce signal wext which, as the name suggests, indicates that a walk extension
is performed. The definitions are identical for both units walking: winit is raised whenever
transition (t1) is taken; wext — whenever the memory access finishes while the translation is
not aborted.
winit(wunitX ) ≡ idleX ∧ (t1)X
wext(wunitX ) ≡ fetch-pteX ∧/mbusy∧ treqX
Data Paths of Nested MMU
We manage to complete the construction of nested MMU with very little effort. All that is left
to do are the connections of the two walking units with the corresponding walk registers and
very few extra circuits, which we present next, and connections of the interfaces of components
we use, which we present immediately after.
In case of the nested call, the right side of the nested MMU performs translation of the current
base address of the user walk (wU ), whereas in case of a simple address translation, it translates
the universal page address (upa) provided to the nested MMU as an external input.
wunitG.upa =
{
uba(wU ) nested-call
mmu.upa otherwise
In the lines above we used the universal base address (uba) of a universal user walk.
w.upa ∈ AU → uba(w) = w.vm◦08 ◦w.ba
The latter address is fed into the guest walk register on nested translation calls. For intuition we
refer to Fig. 23. Either way, the page table origin used by the guest walking unit is, obviously,
taken directly from the nested MMU inputs.
wunitG.pto = mmu.ptoG
The input of the guest walk register is connected to the outputs of the TLB component and the
guest walking unit. The output of the TLB component is selected (by the multiplexer) once the
TLB component contains the requested translation.
wG.in =
{
tlb.wout tlb.hit
wunitG.wout otherwise
This is done in order to flatten the differences between the ways in which translation of the
uba(wU ) is obtained, and thus to simplify the correctness proof in Chap. 5. In case the output
4.2 Redesigning MMU 89
uba
vm
4
4 8 20 3 20
08
208
ba
20
100
3
ptoG
vm pr pa ` bawU
wG.in . . .
. . .
Fig. 23: Initialization of the guest walk register before a nested call
translation is obtained in the translation process, the control mechanism of the simple MMU
guarantees that the guest walk register contains this translation. Otherwise, if the translation is
output immediately and no prior translation process is done, the guest walk register contains
some obsolete data (from the previous translations), unless the TLB output is written there.
For technical reasons we distinguish between the special and standard clocking of the guest
walk register7.
wG.ce ≡ wG.cespc∨wG.cestd
As the naming suggests, the special clock enable signal handles the case in which the TLB
output is clocked-in, whereas the standard clock enable signal, which is equivalent to the cor-
responding signal from [LOP], covers the remaining cases.
wG.cespc ≡ nested-call∧ tlb.hit
wG.cestd ≡ winit(wunitG)∨wext(wunitG)
The guest walk register is fed (back) into the walking unit
wunitG.win = wG
and into another small circuit (a multiplexer) which computes the address of the guest page
table entry used to extend the current guest walk.
pteaG = wG.ba◦
{
wG.px2 ◦02 wG.`[2]
wG.px1 ◦02 otherwise
Also, the guest walk register is (indirectly) connected to the walk input of the TLB component
s.t. the guest walk can be stored in the TLB cache. The details about connection to the TLB
interface are considered below in this section.
We repeat the definitions above for connections of the user walking unit (on the LHS of the
MMU component). They turn out to be considerably simpler and nearly self-explaining.
wunitU .upa = mmu.upa
wunitU .pto = mmu.ptoU
wunitU .win = wU
The universal page address and the page table origin are taken directly from the inputs of the
nested MMU. Note that the user walking unit inputs the user page table origin (ptoU ) instead
of the one for guests (ptoG), which in turn is obvious. The walk input of the user walking unit
is taken directly from the user walk register.
7 Stepping of the TLB component will be associated with the clocking of the walk registers, same as
in [LOP]. At the same time, updates of the guest walk register on the special clock enable signals
should not produce any specification steps.
90 4 Implementation of Nested MMU
The user walk (wU ) register receives the data only from the user walking unit. The regis-
ter is clocked on initialization of the user walk and whenever the memory access (user walk
extension phase) finishes in presence of translation request.
wU .in = wunitU .wout
wU .ce ≡ winit(wunitU )∨wext(wunitU )
The user page table entry address computed on the LHS of the MMU component depends also
on the current guest walk (wG). Formally, the circuit inputs the composition of two walks (user
and guest) and outputs the address as specified below.
pteaU = wG.ba◦
{
wU .px2 ◦02 wU .`[2]
wU .px1 ◦02 otherwise
The memory is accessed at this address (pteaU ) and at the guest page table entry address
(pteaG) resp. in the control states fetch-pteU and fetch-pteG. This implements the read of the
user’s and guest’s page tables resp. Note that the accesses are coming from both sides of the
MMU. Since the MMU component is connected only to one port of the memory system, with
a simple separation of accesses we can define a single page table entry address from the other
two as follows.
ptea =
{
pteaU fetch-pteU
pteaG otherwise
Then a single memory data output — page table entry (pte) — is available to both sides of the
nested MMU and fed into the user and guest walking units.
pteU = pteG = pte
The memory interface does not change. We define it later in this section.
Finally, we specify outputs of the nested MMU component. Data output does not change and
remains simply the data output of the hardware TLB.
mmu.wout = tlb.wout
But simplicity in the definition above is misleading. Recall, now the hardware TLB is used
(alternately) by two different state machines, namely by the control automata of the simple and
the nested MMU. In fact, both automata share the same output (tlb.wout) of the hardware TLB
for their translation look-ups. For the MMU busy signal we rely on two busy signals coming
from both sides of the nested MMU. Whenever at least one of these signals is raised the entire
component becomes busy.
mmu.busy ≡ busyU ∨busyG
The busy signal coming from the user/guest side of the nested MMU is produced by the re-
spective state machine as follows.
busyX ≡ /idleX ∨ (t1)X
Note, in case of a simple (not nested) translation, which keeps the nested control automaton
in its idle state (idleU ), the nested MMU remains busy until the simple translation request is
served.
Below we cover connections of interfaces. These connections are straightforward, so we ex-
plain them rather in a formal manner and give only brief explanations of the corresponding
formulas.
4.2 Redesigning MMU 91
Memory Interface
The nested MMU starts the memory access in the following two situations: i) when the current
user walk (wU ) is being extended, and ii) when the current guest walk (wG) is being extended.
mreq ≡ mreqU ∨mreqG
In the first case, the memory is accessed at the address pteaU by the nested control automa-
ton, which resides in state fetch-pteU , as we defined previously in this section. In the second
case, the memory access is performed to the address pteaG and initiated by the simple control
automaton, which resides in state fetch-pteG (same as in [LOP]). In both cases the memory is
accessed at the address ptea, as we defined above in this section.
ma = ptea
Formally, we select the correct memory word using the third bit of the memory address and
notation from Sect. 1.1.
pte =
{
moutH ma[2]
moutL otherwise
TLB Interface
Next we connect the hardware TLB, which we designed in Sect. 4.1 specifically to support the
nested translation.8 The interface of the hardware TLB was depicted in Fig. 14. Conveniently
we split it into three parts, according to the form of performed access: i) translation look-up,
ii) translation registering, and iii) translation invalidation. Connections to each of these parts
are described below. Recall, at most one of these parts can be used at a time, i.e., the hardware
TLB cannot handle simultaneous requests to different parts of its interface.
i) Requests for translation look-ups are coming from both sides of the nested MMU compo-
nent. In both cases the request is initiated only when the corresponding control automaton
resides in its idleU state.
tlb.trans≡ idleU ∧ treqU ∨ idleG∧ treqG
In the first case, under the nested translation request, the hardware TLB is checked to
contain a translation for input of the user walking unit (upaU ). In turn, in the second case,
under the simple translation request, the hardware TLB is checked to contain a translation
for input of the guest walking unit (upaG).
tlb.upa =
{
upaG treqG
upaU otherwise
ii) Writes are simple. The hardware TLB is written either i) in state write-tlbU of the nested
control automaton or ii) in the counterpart control state write-tlbG of the simple control
automaton. Note, writes are performed only in case the translation request was not aborted
(see Sect. 4.3).
tlb.store≡ write-tlbU ∧ treqU ∨write-tlbG∧ treqG
In the first case, composition of the current user (wU ) and guest (wG) walks is placed into
the hardware TLB. In the second case, the hardware TLB is fed with the guest walk (wG)
as before.
8 Due to certain limitations of the construction developed in [LOP] (namely, due to the absence of
ASIDs and mechanisms to invalidate multiple translations at a time), there was a necessity to redesign
the hardware TLB in order to make the nested translation possible.
92 4 Implementation of Nested MMU
tlb.win =
{
wU ◦wG write-tlbU
wG otherwise
Additionally we (always) place into the hardware TLB the page address of the current
guest walk (wG). Details of the latter we covered in Sect. 4.1.1.
tlb.gin = wG.pa
iii) Finally, we describe the invalidation part. Obviously, one only activates the invalidation
signal of the hardware TLB if there is a corresponding processor request to the nested
MMU component. Recall that our construction supports the following three invalidation
mechanisms: i) f lush, when all of the cached translations are dropped, ii) vm f lush, when
only translations for a certain (input) VM are dropped, and iii) invl pg, when only transla-
tions for a certain (input) universal page address are dropped.
tlb.inval ≡ f lush∨ vm f lush∨ invl pg
In hardware we distinguish between the cases above using the invalidation mask.
tlb.invm =

111 f lush
011 vm f lush∧/ f lush
000 otherwise
The address or VM to invalidate is passed through the invalidation address. Unless in-
validation of the hardware TLB is requested, for technical reasons we always pass the
universal page address of the input walk (tlb.win) as a “victim” address to meet the oper-
ating conditions of the hardware TLB (see Sect. 4.1.1).
tlb.inva =
{
mmu.inva tlb.inval
tlb.win.upa otherwise
Note, in case of vm f lush, the ID of “victim” VM is passed through the upper portion of
invalidation address, i.e., tlb.inva.vm.
Implementation Correctness
In the following simple lemma we justify our computation of the page table entry addresses
used internally by i) the simple and ii) the nested MMU.
Lemma 20.
pteaG(h) = ptea(wG) (1)
pteaU (h) = ptea(wU ◦wG) (2)
Proof of lemma 20.1. For the page table entry address of the simple MMU we show:
pteaG(h) = wG.ba◦
{
wG.px2 ◦00 wG.`[2]
wG.px1 ◦00 otherwise
=
{
ptea(wG.ba,wG.px2) wG.`[2]
ptea(wG.ba,wG.px1) otherwise
= ptea(wG.ba,wG.pxlevel(wG))
= ptea(wG). uunionsq
In the proof lines below we use the shorthand
wN ≡ wU ◦wG.
4.3 Liveness 93
Proof of lemma 20.2. For the page table entry address of the nested MMU we argue similarly:
pteaU (h) = wG.ba◦
{
wU .px2 ◦00 wU .`[2]
wU .px1 ◦00 otherwise
= (wU ◦wG).ba◦
{
(wU ◦wG).px2 ◦00 (wU ◦wG).`[2]
(wU ◦wG).px1 ◦00 otherwise
=
{
ptea(wN .ba,wN .px2) wU .`[2]
ptea(wN .ba,wN .px1) otherwise
= ptea(wN .ba,wN .pxlevel(wN))
= ptea(wU ◦wG). uunionsq
4.3 Liveness
In this section we show that the implementation is live, i.e., that translation requests are eventu-
ally served by the nested MMU. In order to develop some intuition, first we consider a typical
use case: translation of a user address. Starting from state idleU , we go through the remaining
states of the nested MMU and informally argue about the control mechanism. Note, we assume
the absence of reset.
i) In case of the nested translation request (treqU ), the control automaton of the nested MMU
leaves its idle state on transition (t1) if the request cannot be served immediately from the
TLB cache (/tlb.hit); control goes to state nested-call, while the user walk register (wU )
is initialized with the requested upa as the page address and the user PTO (ptoU ) as the
base address.
ii) In state nested-call a translation request to the simple CA is raised. The request is for
translation of the base address of the user walk register (ptoU ). Liveness of the simple CA
is assumed at this stage.
iii) When serving of the simple request finishes, control is either returns to state fetch-pteU or
exists into state write-tlbU . Transition (t8) to state write-tlbU is taken in case of a fault
of the nested translation (see Sect. 3.3.3) or simply when translation of the user address
(upa) successfully finishes.
iv) In state fetch-pteU the memory access is performed with the aim of extension of composed
walk wU ◦wG (see Sect. 3.2.2). Liveness of the memory system is assumed at this stage.
v) Once the walk extension in state fetch-pteU is finished, we check the faultiness of the
extended walk (wextU ) and update the user walk register (wU ). In case the extension
was faulty ( f (wextU )), control exists into the write-tlbU state. Otherwise, a new round of
translation starts in state nested-call.
vi) Last, in state write-tlbU the translation process finishes and control returns to state idleU .
In order to show things formally we require somewhat more technical definitions. First of all,
to shorten the forthcoming proofs, we abbreviate functions Z of hardware configuration h in
cycle t as
Z(t) = Z(ht).
Hardware cycles t in which the process of address translation starts resp. ends are identified by
the following predicates.
t-start(t) ≡ idle(t)∧/idle(t+1)
t-end(t) ≡ idle(t)∧/idle(t−1)
Analogously we identify hardware cycles t in which the process of address translations termi-
nates due to the loss of translation request. In the such cases we say that translation process
was aborted.
t-abort(t)≡ treq(t−1)∧busy(t−1)∧/treq(t)
94 4 Implementation of Nested MMU
idleGstart
f etch
pteG
write
tlbG
t1
t2
t3
t4
Fig. 24: Control automaton of the simple MMU
Next we introduce notation for intervals of cycles in which the translation request is continuous.
Also we abbreviate the intervals in which the first and the last cycles are resp. the starting and
the ending cycles of the address translation process.
treq[t1 : t2] ≡ (t1 < t2)∧∀t ∈ [t1 : t2] : treq(t)
treg[t1 : t2] ≡ treq[t1 : t2]∧ t-start(t1)∧ t-end(t2)
The latter intervals of cycles we call regular translation phases. Below we consider liveness
of two control automata (simple and nested), which both operate on a single hardware TLB.
Throughout the proofs one can easily see that the following invariants are maintained by the
hardware.
Invariant 3.
w ∈ tlbset(t) → f (w)∨w.`[0] (1)
f (wX )∨wX .`[0] → /fetch-pteU ∧/fetch-pteX (2)
f (wU ,wG)∨ (wU 6$ wG) → /fetch-pteU (3)
4.3.1 Simple Translations
The simple control automaton is depicted in Fig. 20. In addition to labels introduced in the
figure, using label (t4) we refer to a transition between states fetch-pteG and write-tlbG. For
convenience in Fig. 24 we give a construction of the automaton with this additional label.
Transitions of the simple control automaton are listed below.
(t1) ≡ /reset ∧ treqG∧/tlb.hit
(t2) ≡ /reset ∧mbusy ≡ /(t3)∧/(t4)
(t3) ≡ reset ∨/mbusy∧/treqG
(t4) ≡ /reset ∧/mbusy∧ treqG∧ ( f (wextG)∨wextG.`[0])
Several restrictions on input signals are necessary for the simple MMU to function properly.
We formulate these restrictions in the form of operating conditions; these conditions are col-
lected below.
busyG(t)→ (treqG[t0 : t]→ upaG(t) = upaG(t0)) (1)
treqG(t)→ /tlb.ireq(t) (2)
Properties of Simple CA
In the following two lemmas we show some important properties of the simple control automa-
ton. Essentially, these lemmas are the counterparts of the corresponding lemmas from [LOP]
about properties of the local control automaton.
4.3 Liveness 95
Lemma 21.
fetch-pteG(t) → ∃t ′ > t : /fetch-pteG(t ′) ∨ level(wG(t ′))< level(wG(t))
Proof of lemma 21. There are three transitions possible in state fetch-pteG in cycle t: (t2), (t3),
and (t4). In case the latter two transitions are taken, the control goes either to state idleG or
write-tlbG. In both cases we trivially obtain for t ′ = t+1:
/fetch-pteG(t
′).
If transition (t2) is taken, we deduce, obviously, /reset(t), and either
mbusy(t) i)
or
/mbusy(t)∧ treqG(t)∧/( f (wextG(t))∨wextG.`[0]). ii)
In the first case we use liveness of the memory system to argue that
∃t ′′ > t : /mbusy(t ′′)
which brings us into the second case. There, we argue that
fetch-pteG(t)∧/mbusy(t) → wG(t+1) = wextG(t)
from the definitions of the update enable and input signals of the guest walk register. Using
correctness of the circuit which implements walk extension, we conclude for t ′ = t+1:
level(wG(t ′))< level(wG(t)). uunionsq
Lemma 22. Assume the absence of reset in cycles [t0 : t1]. Let
fetch-pteG(t0) ∧ t1 = min{t ′ > t0 | /fetch-pteG(t ′)} ∧ treqG[t0 : t1].
Then
f (wG(t1)) ∨ wG(t1).`[0] i)
and
wG(t1).upa = upaG(t0). ii)
Proof of lemma 22. From the absence of reset and stable translation request we conclude that
∀t ∈ [t0 : t1] : /(t3)(t)
and therefore
write-tlbG(t1).
Also we argue that transition (t4) was taken in cycle (t1−1). This immediately gives i), since
wG(t1) = wextG(t1−1)
from the definitions of the update enable and input signals of the guest walk register. For ii)
we refer to operating condition 1 and argue analyzing data paths of the MMU that
wG(t)in.upa =
{
upaG(t) winitG(t)
wG(t).upa otherwise
given that circuits for walk initialization and extension are implemented correctly. uunionsq
96 4 Implementation of Nested MMU
Liveness of Simple CA
In the remainder of this section we elaborate on liveness of the simple control automaton and
results of the simple address translation. Thus, in the following lemma we show that every
started simple translation eventually ends.
Lemma 23.
t-startG(t) → ∃t ′ > t : t-endG(t ′)
Proof of lemma 23. We cover all cases that possibly occur in the process of simple transla-
tion. In every subsequent case we consider a complement to everything encountered so far.
Obviously, every translation process starts in the idle state of the control automaton:
idleG(t).
• transition (t1) is taken in cycle t. The control goes to state fetch-pteG.
fetch-pteG(t+1)
Using lemma 21 we conclude for some cycle t1 > t+1
/fetch-pteG(t1) i)
or
fetch-pteG(t1) ∧ level(wG(t1))< level(wG(t+1)). ii)
In the first case the control goes either to state idleG (abort) or to state write-tlbG.
idleG(t1)→ t ′ = t1
write-tlbG(t1)→ t ′ = t1+1
• in the second case the translation continues in state fetch-pteG, but the extended walk de-
creases its level. Moreover, this time the extended walk has level one, therefore following
the same argument as above we conclude for some cycle t2 > t1
/fetch-pteG(t2).
Again, translation is either aborted or not, resp. the control goes either to state idleG or to
state write-tlbG.
idleG(t2)→ t ′ = t2
write-tlbG(t2)→ t ′ = t2+1 uunionsq
In the next lemma we argue that every regular simple translation results in a TLB hit. As we
argue later, the latter hit signals that the hardware TLB contains a translation for the universal
(guest) address which was requested to translate.
Lemma 24.
tregG[t0 : t1] → tlb.hit(t1)
Proof of lemma 24. Given that translation is regular, we immediately obtain
∀t ∈ [t0 : t1] : /t-abortG(t).
Next we argue along the lines of the proof above. We consider only those cases in which
translation is not aborted.
• translation is served directly from the TLB. For t0 = t1 we conclude
tlb.hit(t0)
immediately from the definition of transition (t1) (which is not taken).
4.3 Liveness 97
• transition (t1) is taken in cycle t0. The control goes to state fetch-pteG
fetch-pteG(t0+1)
and stays there (no abort) until some cycle t ′ > t0+1 (lemma 21). For cycles until t ′ there
is nothing to show:
∀t ∈ [t0 : t ′] : /t-endG(t).
Again, since translation is not aborted, the control goes to state write-tlbG.
write-tlbG(t ′)
There the content of the walk register is written into the TLB. From the TLB hardware
specification we get
tlbset(t ′+1)∩{w | w.upa = wG(t ′).upa}= {wG(t ′)}.
Using lemma 22 for cycle t ′ > t0+1 and operating condition 1 we obtain
wG(t ′).upa = upaG(t0+1) = upaG(t1).
When control returns to the idle state in cycle t ′+1 = t1, we conclude
tlb.hit(t ′+1)
since due to operating condition 2 we have
/tlb.ireq(t ′). uunionsq
From the latter lemma we trivially conclude (from the MMU interconnect):
treg[t0 : t] → wout(t) ∈ tlbset(t)
and
wout(t).upa = upaG(t).
And from part (1) of invariant 3 we conclude:
f (wout(t)) ∨ wout(t).`[0].
In the last lemma proven in this section we show that the simple control automaton always
eventually returns to its idle state.
Lemma 25.
/idleG(t) → ∃t ′ > t : idleG(t ′)
Proof of lemma 25. By induction on number t of the hardware cycles. For the base case (t = 0)
there is nothing to show since immediately after reset the control resides in its idle state.
idleG(0)
For the induction step from t to t+1 we argue as follows. Clearly, we cover only
/idleG(t+1)
since otherwise, as usual, there is nothing to show. Next, we split cases on whether the control
resided in the idle state in cycle t or not.
• idleG(t) = 1. Directly from the definitions we have
t-startG(t).
Therefore, applying lemma 23 proves the claim for some cycle t ′ > t.
• idleG(t) = 0. In this case we argue using the induction hypothesis, which for some cycle
t ′ > t gives
idleG(t ′).
Since the control does not reside in its idle state in cycle t +1 (as we assumed), for cycle
t ′ we have
t ′ > t+1
which completes the induction step, and therefore the proof. uunionsq
98 4 Implementation of Nested MMU
idleUstart
nested
call
f etch
pteU
write
tlbU
t1
t5
t6
t7 t8
t2t4
t3
t9
t10
Fig. 25: Control automaton of the nested MMU
4.3.2 Nested Translations
The nested control automaton is depicted in Fig. 21. In addition to labels introduced in the fig-
ure, using label (t9) we refer to a transition between states fetch-pteU and write-tlbU , whereas
using label (t10) — to a transition between states write-tlbU and idleU . For convenience in
Fig. 25 we give a construction of the automaton with additional labels. Transitions of the
nested control automaton are listed below. In addition to transition
(t1) ≡ /reset ∧ treqU ∧/tlb.hit
which takes control out of the idle state, we have four transitions emanating from state
fetch-pteU
(t2) ≡ /reset ∧mbusy
(t3) ≡ reset ∨/mbusy∧/treqU
(t4) ≡ /(t2)∧/(t3)∧/ f (wextU )
(t9) ≡ /(t2)∧/(t3)∧ f (wextU )
as well as another four transitions from state nested-call.
(t5) ≡ /reset ∧busyG
(t6) ≡ reset ∨/busyG∧/treqU
(t7) ≡ /(t5)∧/(t6)∧/( f (wU , tlb.wout)∨wU .`[0])
(t8) ≡ /(t5)∧/(t6)∧ ( f (wU , tlb.wout)∨wU .`[0])
The operating conditions remains the same as for the simple control automaton. For conve-
nience we reformulate these conditions to match the interface of the nested MMU.
busy(t)→ (treq[t0 : t]→ upa(t) = upa(t0)) (1)
treq(t)→ /tlb.ireq(t) (2)
Properties of Nested CA
The first two lemmas directly follow from liveness of the cache memory system (lemma 26)
and simple control automaton (lemma 27).
4.3 Liveness 99
Lemma 26.
fetch-pteU (t) → ∃t ′ > t : /fetch-pteU (t ′)
Lemma 27.
nested-call(t) → ∃t ′ > t : /nested-call(t ′)
In the next lemma we capture the following property: whenever the nested control automaton
leaves states fetch-pteU or nested-call with a persistent request for nested translation, the user
walk register contains a translation for the universal (user) address which was requested to
translate; moreover, the latter translation is incomplete only if composition of the walks in the
walk registers is faulty.
Lemma 28. Assume the absence of reset in cycles [t0 : t1]. Let
(fetch-pteU ∨nested-call)(t0) ∧ t1 = min{t ′ > t0 | write-tlbU (t ′)} ∧ treqU [t0 : t1].
Then
f (wU (t1)◦wG(t1)) ∨ wU (t1).`[0] i)
and
wU (t1).upa = upaU (t0). ii)
Proof of lemma 28. From the absence of reset and stable translation request we conclude that
∀t ∈ [t0 : t1] : /(t3)(t)∧/(t6)(t).
State write-tlbU can only be reached through transitions (t8) and (t9). In the first case we argue
(t8)(t1−1)→ f (wU (t1−1), tlb.wout(t1−1)) ∨ wU (t1−1).`[0]
→ f (wU (t1),wG(t1)) ∨ wU (t1).`[0]
from the definitions of the update enable and input signals of the walk registers. The last line
implies i) by definition of the walk composition. In the second case we deduce
(t9)(t1−1)→ f (wextU (t1−1))
→ f (wU (t1))
which again implies i) by definition of the walk composition. For part ii) we argue in exactly
the same way as in the proof of lemma 22. uunionsq
Liveness of Nested CA
The following three lemmas are counterparts of the corresponding lemmas for the simple con-
trol automaton. The structure of their proofs will therefore repeat. Thus, in the first lemma
below we show that every translation eventually ends.
Lemma 29.
t-start(t) → ∃t ′ > t : t-end(t ′)
Proof of lemma 29. We consider only nested requests
treqU (t)
since otherwise the control stays in the idle state while the simple control automaton is used.
Liveness of the latter automaton was already proven (see lemma 23 above). Every nested
translation starts in the idle state of the nested control automaton:
idleU (t).
100 4 Implementation of Nested MMU
• transition (t1) is taken in cycle t. The control goes to state nested-call.
nested-call(t+1)
From lemma 27 we know there is a cycle t1 > t+1 such that
/nested-call(t1).
In case the control returns to the idle state (via transition (t6)) or goes to state write-tlbU
(via transition (t8)) we conclude trivially:
idleU (t1)→ t ′ = t1
write-tlbU (t1)→ t ′ = t1+1.
• otherwise, the control goes to state fetch-pteU (via transition (t7))
fetch-pteU (t1)
in which it stays until some cycle t2 > t1 (lemma 26). Again we consider two cases: either
i) the control returns to the idle state (via transition (t3)) or goes to state write-tlbU (via
transition (t9)), or ii) otherwise, the control returns to state nested-call (via transition (t4)).
In the first case we obtain trivially:
idleU (t2)→ t ′ = t2
write-tlbU (t2)→ t ′ = t2+1.
• in the second case
nested-call(t2)
we additionally argue that the level of the user walk decreases
level(wU (t2)) < level(wextU (t2−1))
= level(wU (t1))
relying on correctness of circuits for the walk extension and using definitions of the update
enable and input signals of the user walk register. This brings us into a loop (between the
control states nested-call and fetch-pteU ), which cannot last forever. Since the user walk is
initialized to have the second level9
winitU (t) → level(wU (t+1)) = 2
we conclude that at some point transition (t7) is no longer available and the loop is broken.
uunionsq
In the next lemma we argue that every regular translation results in a TLB hit. The latter hit
signal, as we argue afterwards, indicates that a translation for the universal address which was
requested to translate is present in the hardware TLB.
Lemma 30.
treg[t0 : t1] → tlb.hit(t1)
9 We consider the page tables of depth 2 here, but that is completely technical parameter. In general,
for page tables of depth d and nestedness of the privilege levels n the total number of walk extensions
necessary to translate the virtual address of the highest privilege level is
(d+1)n−1−1.
4.3 Liveness 101
Proof of lemma 30. Again we argue only about the nested requests
treqU (t)
since otherwise the control stays in the idle state while the simple control automaton is used.
Given that translation is regular, we immediately obtain
∀t ∈ [t0 : t1] : /t-abortU (t).
Thus, we consider only those cases in which translation is not aborted.
• translation is served directly from the TLB. We conclude for t0 = t1
tlb.hit(t0)
immediately from the definition of transition (t1) (which is not taken).
• transition (t1) is taken in cycle t0. The control goes to state nested-call.
nested-call(t0+1)
Given that translation is regular in cycles [t0 : t1] (no abort), we conclude using liveness of
the nested control automaton (lemma 29) that the translation process ends with transition
(t10) — between states write-tlbU and idleU . Therefore in cycle t ′ = t1−1 the hardware
TLB is written
write-tlbU (t ′)
with the composition of walks in the walk registers such that in cycle t ′+1 = t1 we have
tlbset(t ′+1)∩{w | w.upa = wU (t ′).upa}= {wU (t ′)◦wG(t ′)}.
Using lemma 28 for cycle t ′ > t0+1 and operating condition 1 we obtain
wU (t ′).upa = upaU (t0+1) = upaU (t1).
When control returns to the idle state, we conclude
tlb.hit(t ′+1)
since due to operating condition 2 we have
/tlb.ireq(t ′). uunionsq
From the latter lemma we trivially conclude (from the MMU interconnect):
treg[t0 : t] → wout(t) ∈ tlbset(t)
and
wout(t).upa = upa(t).
And from part (1) of invariant 3 we conclude:
f (wout(t)) ∨ wout(t).`[0].
Finally, we argue that the nested control automaton always eventually returns to its idle state
s.t. both control automata are found in the idle states. As a result, the nested MMU always
eventually lowers the busy signal.
Lemma 31.
/idle(t) → ∃t ′ > t : idle(t ′)
Proof of the latter lemma is analogous to proof of the corresponding lemma for the simple
control automaton above (lemma 25), and therefore is omitted. This liveness result completes
the implementation of the nested MMU. Correctness of the latter implementation is proven in
the next chapter.

5Correctness of NAT Implementation
In the previous chapter we introduced a hardware construction which implements the process
of nested address translation. We continue referring to that construction using the name nested
MMU, though sometimes we call it simply MMU for short. Still, when we refer to the original
MMU design from [LOP], we explicitly use a name original MMU or simple MMU.
Construction of the nested MMU followed certain hardware specifications, which preceded
every of its subcomponents. In this chapter we justify those hardware specifications in the
usual manner — via correctness proofs.
5.1 Accessing MMU
In this chapter we are going to show that our implementation of the nested MMU works cor-
rectly in all possible hardware computations. For that purpose we formalize all accesses to the
nested MMU in the form of queries. Under the queries we understand the data coming to the
component’s interfaces (input) while any of the request signals is high. Analyzing interfaces
of the nested MMU from Sect. 4.2.1 we define query
qr ∈ Kquery
to have the following components:
• qr.upa ∈ B32 — universal page address to provide/invalidate translation(s) for,
• qr.ireq ∈ B3 — invalidation requests; bits [2 : 0] represent resp. requests for invalidation of
all translations, translations of a virtual machine, as well as individual translations, and
• qr.treq ∈ B — translation request; output a translation for the requested address; in case a
translation is not contained in the TLB, create it first.
We call a query well-formed if at most one of its request signals is raised.
wf(qr) ≡ (qr.treq+qr.ireq[0]+qr.ireq[1]+qr.ireq[2]≤ 1)
In case none of the request signals is raised, we say that the query is void.
void(qr) ≡ (qr.treq+qr.ireq[0]+qr.ireq[1]+qr.ireq[2] = 0)
According to the request signal raised we distinguish between translation and invalidation
queries. Moreover, depending on the requested address we distinguish between simple and
nested translation queries. Translation queries utilize some additional components:
• qr.ptoG ∈ B20 — guest page table origin,
• qr.ptoU ∈ B20 — user page table origin,
• qr.data ∈ B64 — data (memory line) to use in case of a walk extension, and
• qr.drdy ∈ B — control bit signaling that data provided are ready to use.
104 5 Correctness of NAT Implementation
Application of queries is done simply by connecting their components to interfaces of the
nested MMU. Namely, in order to apply query qr to nested MMU configuration mmu we
connect the data inputs as well as the control inputs as follows. For the data inputs we have
mmu.upa = qr.upa
mmu.ptoG = qr.ptoG
mmu.ptoU = qr.ptoU
mmu.inva = qr.upa
mmu.mout = qr.data
whereas for the control inputs we have
mmu.treq = qr.treq
mmu.invl pg = qr.ireq[0]
mmu.vm f lush = qr.ireq[1]
mmu. f lush = qr.ireq[2]
mmu.mbusy = qr.drdy.
The resulting configuration mmu′ is given by the hardware transition function the nested MMU.
mmu′ = δmmu(mmu,qr)
Formal definition of the latter function is omitted, however it can be easily extracted from the
formal hardware specification of the nested MMU presented in Sect. 4.2.1.
Naturally, we introduce query sequences
qr : N→ Kquery
to be ordinary (infinite) sequences of queries. Note, we use the same notation to denote queries
and query sequences. Hopefully, the difference will be clear from the context. We call a query
sequence well-formed simply if every query in that sequence is well-formed.
wf(qr) ≡ ∀t : wf(qr[t])
Let mmuø be the nested MMU hardware configuration after reset. Application of first t queries
from query sequence qr naturally brings us to configuration
∆ tmmu(mmuø,qr) =
{
δmmu(∆ t−1mmu(mmuø,qr),qr[t−1]) t > 0
mmuø otherwise.
For convenience we abbreviate configuration above as
mmu tqr ≡ ∆ tmmu(mmuø,qr).
Clearly, we prove correctness of the nested MMU implementation not for all possible hardware
computations, but for all computations in which the operating conditions of the nested MMU
are respected. These conditions were formulated in Sect. 4.3, where we covered circuit cor-
rectness and liveness of the nested MMU. Here we reformulate the latter conditions to restrict
the query sequences that we consider in the proofs below. Thus, we call a query sequence valid
if i) the sequence is well-formed and ii) application of the sequence does not violate operating
conditions of the nested MMU.
valid(qr) ≡ wf(qr)∧ (mmu tqr.busy∧qr[t0 : t].treq→ qr[t].upa = qr[t0].upa)
5.2 Correctness Statement 105
5.2 Correctness Statement
In the end of this section (in Sect. 5.2.3) we state a (simulation) theorem claiming that the con-
structions made in Chap. 4 implement their specifications. In a nutshell this theorem claims that
for every hardware cycle there is a configuration from the general computation (see Sect. 3.4)
s.t. the simulation relation holds between the hardware TLB and the TLB component of that
configuration.
Of course, in oder to formulate the latter theorem, we must provide the missing definitions.
Thus, in Sect. 5.2.1 we specify how to construct the general computation from any valid MMU
query sequence. The aforementioned simulation relation is defined in Sect. 5.2.2.
5.2.1 Stepping of TLB
In the original MMU design, stepping of the TLB component was associated with clocking
of the walk register [LOP]. For the nested MMU component this concept does not change.
We naturally transfer it to a more general MMU construction, which uses two hardware walk
registers (guest and user) instead of one. Clocking these registers will “generate” different
specification steps, though the difference is purely technical. The guest walk register gener-
ates steps of initialization or extension of the guest walks (identical to the original stepping),
whereas the user walk register — steps of initialization or extension of the user walks (new).
We differentiate between the generated steps using the following hardware signals.
taddG(mmu) ≡ wG.cestd
taddU (mmu) ≡ wU .ce
Whenever the guest walk register (wG) is updated on the standard clock enable signal, it gen-
erates a guest TLB step (taddG). Every time the user walk register (wU ) is updated it generates
a user TLB step (taddU ). Then the total number of TLB steps generated in configuration h is
obviously given by
tadd(mmu) = taddG(mmu)+ taddU (mmu).
From simple analysis of the control logic of the nested MMU (Sect. 4.2.2) we argue that at
most one TLB step can be generated in any given MMU configuration. Another trivial lemma
follows.
Lemma 32.
tadd(mmu)≤ 1
Proof of lemma 32. Using properties of the nested control automaton we argue:
wU .ce→ treqU ∧ (idleU ∨ fetch-pteU )
→ /sreq∧/nested-call.
The last line implies the translation request signal of the simple control automaton is low (see
Sect. 4.2.2). From there we immediately conclude:
/treqG → /wG.cestd . uunionsq
For convenience we introduce few additional abbreviations. Similarly to the above, we extract
the following two signals from the control logic of the nested MMU.
winitX (mmu) ≡ winit(mmu.wunitX )
wextX (mmu) ≡ wext(mmu.wunitX )
We distinguish initialization cycles of the guest walk register from the corresponding cycles
for the user walk register. For this purpose we use the predicates above (winitG and winitU
resp. for the guest and the user walk register).
106 5 Correctness of NAT Implementation
winit(mmu) = winitG(mmu)+winitU (mmu)
Since the walk registers are updated in the initialization cycles (Sect. 4.2.2), we immediately
conclude from lemma 32 that at most one of these registers can be initialized in the given MMU
configuration.
Lemma 33.
winit(mmu)≤ 1
In contrast to the overall specification, where invalidation of the TLB content occurs on passive
TLB transitions within the processor core steps, the general specification defines the dedicated
TLB steps to drop the stored translations (see Sect. 3.4.1).
tdrop(mmu)≡ mmu.tlb.inval
Taking the newly introduced steps into account we define the total number of steps generated
by the nested MMU in configuration mmu as
ns(mmu) = tadd(mmu)+ tdrop(mmu).
Given that translation and invalidation queries are never served in parallel, using lemma 32 we
obtain that at most one TLB step (out of three) can be generated in any given MMU configu-
ration.
Lemma 34.
ns(mmu)≤ 1
Finally, using all the abbreviations above, we can specify the input passed to the specification
machine (value of the stepping function).
taddG(mmu) → s(mmu) =
{
(winit,upaG, ptoG) winitG(mmu)
(wext,wG, pteG) otherwise
taddU (mmu) → s(mmu) =
{
(winit,upaU , ptoU ) winitU (mmu)
(wext,wU , pteU ) otherwise
tdrop(mmu) → s(mmu) =

(drop, inva) mmu.invl pg
(drop, inva.vm) mmu.vm f lush
(drop,all) otherwise
Here we denote this particular input (for TLB steps) by s(mmu), however below, in Sect. 5.2.1,
we formalize it for a more general case of the non-sequential machines. There it becomes
a component of a vector with the length ns(h) — the number of all ISA steps performed in
configuration h.
5.2.2 Simulation of TLB
First of all we introduce a simulation relation
simtlb(mmu,c)
which expresses that the TLB of (hardware) MMU configuration mmu is correctly simulated
by the TLB of (software) ISA configuration c. Conveniently we split this simulation relation
into two parts: simT and simW , for simulation of the TLB content and the walk registers resp.
simtlb(mmu,c)≡ simT (mmu,c)∧ simW (mmu,c)
Next we specify each part separately.
5.2 Correctness Statement 107
Simulation of TLB Content
For convenience we split the set of walks stored in the hardware TLB (tlbset) into the sets of
hardware guest and user walks. We denote these sets by tlbG and tlbU resp. and define them
as follows.
tlbG(mmu) = {w ∈ tlbset(mmu.tlb) | w.upa ∈ AG}
tlbU (mmu) = {w ∈ tlbset(mmu.tlb) | w.upa ∈ AU}
Now the simulation of the hardware TLB content can be easily expressed: from the hardware
guest walks we require to be contained in the software TLB, whereas from the hardware user
walks — to be contained in the composed TLB.
simT (mmu,c) ≡ tlbG(mmu) ⊆ c.tlb ∧
tlbU (mmu) ⊆ c.tlb◦
Simulation of Walk Registers
Two hardware walk registers involved in the process of nested address translation — guest
wG and user wU walk registers — were added for purely technical reasons: they are required
to implement the operation of walk extension in hardware. For walk extension one needs to
look-up the page table entries in the memory, and in order to perform an access to the hardware
memory (cache) we need registers to provide the inputs or retrieve the outputs of that access.
The content of these registers is considered meaningful only if the nested MMU component is
currently processing a translation request which involves (!) the corresponding walk register.
wG(mmu) =
{
{mmu.wG} mmu.abort ∧mmu.idle
/0 otherwise
wU (mmu) =
{
{mmu.wU} mmu.abort ∧mmu.idleU
/0 otherwise
Thus, if the nested MMU stays in its idleU control state but remains busy, it processes an
ordinary translation request (of the guest address), which clearly does not involve the user
walk register. In case the nested MMU leaves the idleU state, there is a nested translation
request (of the user address) being processed, which involves both walk registers. Invalidation
requests do not trigger the busy signal. They are handled in one cycle and allowed only in
cycles in which the nested MMU is not busy (see Sect. 4.2.1).
Using the definitions above we easily express the simulation of the walk registers.
simW (mmu,c) ≡ wU (mmu)∪wG(mmu) ⊆ c.tlb ∧
wU (mmu)◦wG(mmu) ⊆ c.tlb◦
Note that the second part (for composition) by definition follows from the first part. It was
included to i) stress that the composition of hardware walks belongs to the composed TLB and
ii) justify the upcoming definitions (for sets of hardware walks). A trivial lemma follows.
Lemma 35.
mmu.idle∨mmu.abort → wG(mmu)∪wU (mmu) = /0
Next we specify the MMU configurations in which the specification TLB is stepped. We do it
in such a way that the resulting ISA computation simulates the hardware computation. In the
next sections we keep using our old abbreviations to save space.
wG ≡ mmu.wG
wU ≡ mmu.wU
108 5 Correctness of NAT Implementation
5.2.3 Simulation Theorem
Finally we have enough machinery to justify the implementations of hardware units from
Chap. 4. Below and elsewhere throughout this section by h (c) and h′ (c′) we denote the
current and the next state configurations of hardware (ISA) resp. Informally
h→ h′
c→ c′
where the next state configuration of hardware and ISA are resp. defined by the hardware MMU
specification from Sect. 4.2.1 and the general TLB specification from Sect. 3.4. Independently
of a particular machine design, the nested MMU changes only in those configurations in which
it is queried, i.e., if there is a translation/invalidation query among the operations performed
by hardware (see Sect. 5.1). Otherwise, the nested MMU stays unchanged. If requested, the
nested MMU generates steps to be performed by the ISA computation, as we described in
Sect. 5.2.1. For convenience, on every reference to the MMU hardware configuration below
we abbreviate as usual:
h.Z ≡ mmu.Z
Z(h) ≡ Z(mmu).
The total number of steps performed within the hardware transition (h→ h′) is given by ns(h).
From lemma 34 we know
ns(h)≤ 1,
therefore it suffices to pass s(h) to the ISA computation as input to perform those steps (one
step at most). This input is extracted from the hardware computation in configuration h, as we
specified above, in Sect. 5.2.1. Having that we can define the next state configuration of ISA:
c′.tlb =
{
δtlb(c.tlb,s(h)) ns(h)> 0
c.tlb otherwise.
We leave the next state configuration of hardware (h′) without a formal definition. One can
compose this definition from the formal hardware specification of the nested MMU and the
implementation details from Chap. 4. The resulting transition function (δmmu) would natu-
rally depend on the values coming at the inputs of the component. In what follows we prove
correctness of the nested MMU hardware for arbitrary valid sequence of input values.
Lemma 36. For every valid query sequence qr:
∃c0 ∀t : simtlb(mmu tqr,ct)
In order to prove the latter lemma we proceed as follows:
i) since in hardware we store walks both in the TLB cache and walk registers, first in
Sect. 5.3 we introduce an auxiliary simulation relation (simA) to smoothen these nuances
out and make the forthcoming proofs simpler, then
ii) we state an invariant about the user walks stored in hardware (inv◦), which turns out to be
crucial in order to show correct simulation of these walks, next
iii) we make the last preparations before we do proofs, namely we show how the walks
dropped/added by the invalidation/translation queries in hardware relate to the correspond-
ing walks in ISA, and finally
iv) we prove in Sect. 5.4 that the entire simulation relation for the TLB (simtlb) is preserved
throughout and after execution of invalidation/translation queries and, obviously, in the
absence of any queries.
5.3 Developing Formalism 109
5.3 Developing Formalism
First, we require a counterpart of the definition from [LOP] where we introduced the set of all
walks stored in hardware (walks(h)). In case of the nested translation we — as usual — dis-
tinguish between the hardware guest (walksG(h)) and user (walksU (h)) walks. The following
definitions are self-explaining.
walksG(h) = tlbG(h) ∪ wG(h)
walksU (h) = tlbU (h) ∪ wU (h)◦wG(h)
Since in the proofs later we argue mostly about these sets of walks, we introduce an auxiliary
simulation relation which operates on these sets.
simA(h,c) ≡ walksG(h) ⊆ c.tlb ∧
walksU (h) ⊆ c.tlb◦
Most of the effort will be spent to show that the simulation above holds in the subsequent
configurations of hardware (h′) and ISA (c′). Once it is done, we extend this simulation to the
“entire” one (simtlb) with a very mild effort. Indeed, a trivial lemma shows that only simulation
of the user walk register is missing in that case.
Lemma 37.
simtlb(h,c) ≡ simA(h,c) ∧ wU (h)⊆ c.tlb
Proof of lemma 37. Gradually unfolding definitions we argue as follows.
simtlb(h,c) ≡ simT (h,c)∧ simW (h,c)
≡ tlbG(h)⊆ c.tlb ∧ wG(h)∪wU (h)⊆ c.tlb ∧
tlbU (h)⊆ c.tlb◦∧ wG(h)◦wU (h)⊆ c.tlb◦
≡ walksG(h)⊆ c.tlb ∧ wU (h)⊆ c.tlb ∧
walksU (h)⊆ c.tlb◦
≡ simA(h,c) ∧ wU (h)⊆ c.tlb uunionsq
5.3.1 Coverage of Hardware Walks
It turns out that the simulation of user walks (walksU ) hinges on a slightly more general prop-
erty. Namely, we claim that for every user walk stored in hardware there is a pair of walks in
the specification TLB s.t. in composition walks from the pair form the (hardware) user walk
and the page address of the guest walk from the pair is the same as the guest page address of
the (hardware) user walk. Formally, we maintain the following invariant.
Invariant 4. For configurations h and c we claim that the following holds:
∀w ∈ walksU (h) ∃wu,wg ∈ c.tlb : (w = wu ◦wg)∧ (gpa(w,h) = wg.pa)
In order to have the guest page address defined for all user walks (walksU ), we extend the
definition s.t. it can be applied to the user walk stored in the user walk register.
Definition 7. For walk w ∈ wU (h)◦wG(h):
gpa(w,h) = wG.pa
Note, the latter is well defined since
wU (h)◦wG(h)⊆ tlbset(h)
occurs only after a write into the TLB. In case the walk was written into the entry i, by con-
struction we have
110 5 Correctness of NAT Implementation
h.tlb.g(i) = wG.pa.
We call the invariant above a coverage of the (hardware) user walks, meaning that the walks in
hardware are covered by the walks in software, and refer to it in proofs simply by inv◦(h,c).
As we show below, this invariant trivially implies the desired simulation of the user walks.
To simplify the forthcoming proofs lines, we introduce somewhat more technical notation. We
use the following to select from set walksU a user walk which is covered by pair (wu,wg) of
walks in the sense of invariant 4.
walksU (h)[wu][wg] = {w ∈ walksU (h) | (w = wu ◦wg)∧ (gpa(w,h) = wg.pa)}
In the obvious way we extend the notation above for passing sets of walks instead of individual
walks. For instance, to select using wg as a guest walk and all user walks possible, we write:
walksU (h)[∗][wg] ≡ ⋃
wu∈Kuwalk
walksU (h)[wu][wg].
In case we need to consider only those user walks which are available in the specification TLB
of configuration c, we similarly write:
walksU (h)[c][wg] ≡ ⋃
wu∈c.tlb
walksU (h)[wu][wg].
Naturally, we apply the notation above to select using wu as a user walk and sets of guest walks,
either all possible or only those available in the specification TLB of configuration c. The new
notation turns to be very intuitive and easy to use. For instance, invariant 4 can be equivalently
rewritten simply as follows.
walksU (h) = walksU (h)[c][c]
Proofs also become shorter, and thus easier to understand. Some of them nearly boil down to
simple bookkeeping. We demonstrate this in the proof below, where we show that the coverage
(inv◦(h,c)) implies the simulation of the user walks (walksU ).
Lemma 38. For configurations h and c such that inv◦(h,c) holds:
walksU (h)⊆ c.tlb◦
Proof of lemma 38.
walksU (h)
= walksU (h)[c][c] (inv◦(h,c))
=
⋃
wu∈c.tlb
⋃
wg∈c.tlb
walksU (h)[wu][wg] (notation)
=
⋃
wu∈c.tlb
⋃
wg∈c.tlb
{w ∈ walksU (h) | (w = wu ◦wg)∧ (gpa(w,h) = wg.pa)} (definition)
⊆ ⋃
wu∈c.tlb
⋃
wg∈c.tlb
{w | w = wu ◦wg}
= c.tlb◦ (definition) uunionsq
The notation above can be easily generalized for arbitrary sets of user walks. That helps us
to reduce the length of the arguments in the proofs below. For instance, first we show the
following.
Lemma 39. For any pair of setsW1 andW2 (of walks) such that
W1 ⊆W2[Wu][Wg]
we claim
W1 =W1[Wu][Wg]
where Wu and Wg are two arbitrary sets of resp. user and guest software walks.
5.3 Developing Formalism 111
Proof of lemma 39. From the assumptions we naturally obtain
W1 ⊆ W2[Wu][Wg]
⊆ W2[∗][∗]
= W2 (5)
which allows us to derive
W1 ⊆ W2[Wu][Wg]∩W1
= (W2∩W1)[Wu][Wg] (equation 5)
= W1[Wu][Wg].
The claim follows, since for any set of walksW by definition we clearly have
W[Wu][Wg]⊆W. uunionsq
Now using the latter lemma we argue that the user walks from the hardware TLB are naturally
covered by the walks in the specification in case invariant 4 is maintained.
Lemma 40. For configurations h and c such that inv◦(h,c) holds:
tlbU (h) = tlbU (h)[c][c]
Proof of lemma 40. Using invariant 4 we derive
tlbU (h) ⊆ walksU (h) (definition)
= walksU (h)[c][c] (inv◦(h,c))
and the claim follows by lemma 39. uunionsq
5.3.2 Dropping Translations
Recall, in Sect. 3.4.2 we introduced the sets of invalidated (IV ) and incomplete (IC) walks,
which comprise translations that are dropped on the invalidating TLB steps. In particular,
set IV was defined s.t. it accumulates either the guest or the user walks, depending resp. on
whether the guest or the user address is invalidated. Counterparts of these sets in hardware
were defined in Sect. 4.1.1, in which we formalized the semantics of the hardware TLB. For
the upcoming argument about correctness of the invalidating TLB steps we need to relate the
sets defined for the specification with their hardware counterparts.
In the following lemma we show that in case a guest address is invalidated, the hardware drops
all guest walks that fall into the sets of invalidated and incomplete walks.
Lemma 41. On invalidation of an individual guest address
invl pg(h)∧ inva(h) ∈ AG
we claim the following holds:
invset(h) = tlbG(h)∩IV(s(h)) (1)
incset(h) = tlbG(h)∩IC(s(h)) (2)
Proof of lemma 41.1.
tlbG(h)∩IV(s(h))
= tlbG(h)∩{w | w.upa = s(h).upa} (definition)
= tlbG(h)∩{w | w.upa = inva(h)} (stepping)
= tlbset(h)∩{w | w.upa = inva(h)} (assumptions)
= invset(h) (definition) uunionsq
112 5 Correctness of NAT Implementation
Proof of lemma 41.2.
tlbG(h)∩IC(s(h))
= tlbG(h)∩{w | w.as = s(h).upa.as∧w.`[0]} (definition)
= tlbG(h)∩{w | w.as = inva(h).as∧w.`[0]} (stepping)
= tlbset(h)∩{w | w.as = inva(h).as∧w.`[0]} (assumptions)
= incset(h) (definition) uunionsq
In addition, the hardware drops all user walks that are covered by the sets of invalidated and
incomplete (guest) walks. This is done in order to preserve the invariant, in accordance with
intuition from Sect. 4.1. Otherwise, in case a user address is invalidated, the hardware drops
all (user) walks that are covered by the sets of invalidated and incomplete (user) walks. We for-
malize the latter two results in the following lemma. In order to save space, for X ∈ {V,C,D}
we sometimes abbreviate:
IX = IX (s(h)).
Lemma 42. On invalidation of an individual address
invl pg(h)
we claim the following holds:
inva(h) ∈ AG → ragset(h) ⊇ tlbU (h)[c][IV ∪IC ] (1)
inva(h) ∈ AU → invset(h)∪ incset(h) ⊇ tlbU (h)[IV ∪IC ][c] (2)
Proof of lemma 42.1. In the proof below we argue using part (1) of invariant 3 as follows.
When ragged walk
w = wu ◦wi
was composed from certain incomplete guest walk in hardware, the latter guest walk (wi) was
contained in the hardware TLB.
w ∈ tlbU (h) → ∃h˜ : wi ∈ tlbG(h˜)
For all incomplete walks contained in the hardware TLB we know from the invariant they are
faulty. Using definition of the walk composition we conclude the following.
wi ∈ IC(s(h)) → f (wi)∧w. fg (6)
Having this we proceed to show the claim.
tlbU (h)[c][IV ∪IC ]
⊆ tlbU (h)[∗][IV ∪IC ]
=
⋃
wi∈IV∪IC
{w ∈ tlbU (h) | ∃wu : (w = wu ◦wi) ∧ (gpa(w,h) = wi.pa)} (notation)
⊆ tlbU (h) ∩ ⋃
wi∈IV
{w | (w.vm = wi.vm)∧ (gpa(w,h) = wi.pa)} ∪
tlbU (h) ∩ ⋃
wi∈IC
{w | (w.vm = wi.vm)∧w. fg} (definition, equation 6)
= tlbU (h)∩{w | (w.vm = s(h).upa.vm)∧ ((gpa(w,h) = s(h).upa.pa)∨w. fg)} (definition)
= tlbU (h)∩{w | (w.vm = inva(h).vm)∧ ((gpa(w,h) = inva(h).pa)∨w. fg)} (stepping)
= tlbset(h)∩{w | (w.upa ∈ AU (inva(h).vm))∧ ((gpa(w,h) = inva(h).pa)∨w. fg)}
= ragset(h) (definition) uunionsq
Proof of lemma 42.2.
5.3 Developing Formalism 113
tlbU (h)[IV ∪IC ][c]
⊆ tlbU (h)[IV ∪IC ][∗]
=
⋃
wi∈IV∪IC
{w ∈ tlbU (h) | ∃wg : (w = wi ◦wg) ∧ (gpa(w,h) = wg.pa)} (notation)
⊆ tlbU (h) ∩ ⋃
wi∈IV
{w | w.upa = wi.upa} ∪
tlbU (h) ∩ ⋃
wi∈IC
{w | (w.as = wi.as)∧ (w.`= wi.`)} (definition)
⊆ tlbU (h)∩{w | (w.upa = s(h).upa)∨ (w.as = s(h).upa.as)∧w.`[0]} (definition)
= tlbU (h)∩{w | (w.upa = inva(h))∨ (w.as = inva(h).as)∧w.`[0]} (stepping)
= tlbset(h)∩{w | (w.upa = inva(h))∨ (w.as = inva(h).as)∧w.`[0]}
= invset(h) ∪ incset(h) (definition) uunionsq
In the following lemma we formalize the last result about translations dropped from the hard-
ware TLB on invalidating steps. So, in case a virtual machine is invalidated, all (user) walks
that are covered by the set of dropped (user) walks are dropped from the hardware TLB.
Lemma 43. On invalidation of a virtual machine
vm f lush(h)
we claim the following holds:
invset(h) ⊇ tlbU (h)[ID][c]
Proof of lemma 43.
tlbU (h)[ID][c]
⊆ tlbU (h)[ID][∗]
=
⋃
wi∈ID
{w ∈ tlbU (h) | ∃wg : (w = wi ◦wg) ∧ (gpa(w,h) = wg.pa)} (notation)
⊆ tlbU (h) ∩ ⋃
wi∈ID
{w | w.upa = wi.upa} (definition of ◦)
⊆ tlbU (h)∩{w | w.upa ∈ AU (s(h).vm)} (definition)
= tlbU (h)∩{w | w.upa ∈ AU (inva(h).vm)} (stepping)
= tlbset(h)∩{w | w.upa ∈ AU (inva(h).vm)}
= invset(h) (definition) uunionsq
Also we argue that after the invalidation queries the content of both walk registers is not mean-
ingful.
Lemma 44.
tdrop(h) → wG(h′)∪wU (h′) = /0
Proof of lemma 44. By case split on whether the MMU is idle in configuration h:
• h.idle = 1. In case the idle MMU is not requested for translation, it remains idle.
h.idle∧/h.treq → h′.idle
• h.idle = 0. If the ongoing translation was not aborted yet, it is aborted.
/h.abort ∧/h.idle∧/h.treq → h′.abort
Otherwise, the abort signal stays high in configuration h′.
h.abort ∧/h.idle → h′.abort
In all considered cases the claim follows by lemma 35. uunionsq
114 5 Correctness of NAT Implementation
5.3.3 Adding Translations
In Sect. 3.4.2 we specified the walks (w(x)) which are added to the specification TLB on
resp. guest (taddG) and user (taddU ) TLB steps. For the upcoming argument for correctness of
the TLB steps we need to show that these walks coincide with the inputs of the corresponding
walk registers in hardware.
Lemma 45. On addition of translations
tadd(h)
we claim the following holds:
taddG(h) → w(s(h)) = wg(h) (1)
taddU (h) → w(s(h)) = wu(h) (2)
Recall, by definition of the stepping function we have:
taddX (h) → s(h) =
{
(winit,upaX (h),h.ptoX ) winitX (h)
(wext,wX , pteX (h)) otherwise.
Proof of lemma 45.1.
w(s(h)) =
{
winit(upa, pto) s(h) = (winit,upa, pto)
wext(w, pte) s(h) = (wext,w, pte)
(definition)
=
{
winit(upaG(h),h.ptoG) winitG(h)
wext(wG, pteG(h)) otherwise
(stepping)
= wg(h) (construction) uunionsq
Proof of lemma 45.2.
w(s(h)) =
{
winit(upa, pto) s(h) = (winit,upa, pto)
wext(w, pte) s(h) = (wext,w, pte)
(definition)
=
{
initw(upaU (h),h.ptoU ) winitU (h)
wext(wU , pteU (h)) otherwise
(stepping)
= wu(h) (construction) uunionsq
5.4 Correctness Proof
Finally, we proceed to the proof of lemma 36. The proof is obviously by induction on the
number of hardware cycles t. For the base case (t = 0) we consider
c0.tlb = /0
and argue as follows. Simulation relation for the TLB content (simT ) holds trivially.
tlbG(h0) = /0 ⊆ c0.tlb
tlbU (h0) = /0 ⊆ c0.tlb◦
The simulation relation for the walk registers (simW ) holds by lemma 35. Invariant 4 as well
holds trivially after the initialization of hardware.
walksU (h0) = /0 = walksU (h0)[c0][c0]
For the induction step (t→ t+1) we split cases on the type of query processed by hardware in
cycle t:
5.4 Correctness Proof 115
• qr[t].treq — translation queries — new translation can be added (taddX (t)),
• qr[t].ireq — invalidation queries — old translation(s) are dropped (tdrop(t)), and
• otherwise — void queries — no steps are performed in cycle t.
We cover each type in a dedicated section below. For convenience we abbreviate configurations
of hardware (h) and ISA (c) in cycles t and t+1. For z ∈ {h,c} we abbreviate as usual:
(z,z′) ≡ (zt ,zt+1).
5.4.1 Void Queries
We consider the most simple case first: we show that in the absence of any requests, the
configuration of the MMU component stays unchanged, which implies that the simulation
of the MMU content as well as the invariant 4 — about coverage of the user walks — are
maintained. Below we make our arguments formal.
Simulation
From the hardware construction — automata of the nested MMU, Sect. 4.2.2 — we first argue
that the content of the walk registers is not meaningful in configuration h′.
wU (h′)∪wG(h′) = /0
In order to show the claim we cover the following cases:
i) loss of the translation request (aborting translations),
ii) clearing the pending abort signal (restoring after aborts), and
iii) waiting for a query/restore.
In all cases above we derive directly from the control mechanisms:
• aborting translations (h.abort.set)
h.abort.set → h′.abort
• restoring after aborts (h.abort.clr)
h.abort.clr → h′.idle
• waiting for a query/restore (/h.abort.set ∧/h.abort.clr)
h.abort⊕h.idle → h′.abort⊕h′.idle
According to the definition it remains to show
simT (h′,c′).
From the absence of queries we know:
i) no translation is performed in configuration h (/tadd(h)∧/h.tlb.store), and
ii) no invalidation is performed in configuration h (/tdrop(h)∧/h.tlb.inval).
Therefore, the hardware TLB does not change
/h.tlb.store∧/h.tlb.inval → h′.tlb = h.tlb
and the specification TLB does not change either.
/tadd(h)∧/tdrop(h) → c′ = c
Thus, simulation of the TLB content (simT ) we conclude from the arguments above.
tlbG(h′) = tlbG(h) ⊆ c.tlb = c′.tlb
tlbU (h′) = tlbU (h) ⊆ c.tlb◦ = c′.tlb◦
116 5 Correctness of NAT Implementation
Invariant
Next we argue about the invariant. As we have already shown above, the content of the walk
registers is not meaningful in configuration h′. Therefore we have
wU (h′)◦wG(h′) = /0
and proceed to show the following.
walksU (h′) = tlbU (h′)∪wU (h′)◦wG(h′) (definition)
= tlbU (h) (construction)
= tlbU (h)[c][c] (lemma 40)
= tlbU (h)[c′][c′] (c′ = c)
The claim follows by lemma 39. Note that in the absence of any requests, as well as in the
cycles in which the nested MMU is busy waiting (for the memory system), function gpa does
not change the values for the walks from set walksU .
5.4.2 Translation Queries
Here we consider the translation queries and show correctness of the TLB steps — of walk
initialization and extension. We need to make sure that in all hardware configurations which
can be encountered during execution of the translation queries the content of the nested MMU
is correctly simulated and all user walks are covered by the walks in the specification TLB.
Guest Simulation
First we establish simulation of the guest walks (walksG) as it was formulated in relation simA
(see Sect. 5.2.3). We split the proof of
walksG(h′)⊆ c′.tlb
into the following four cases:
i) clocking of the walk registers (adding walks),
ii) writing walks from the walk registers into the TLB cache,
iii) writing walks from the TLB back into the walk registers, and
iv) waiting for the memory (cycles in which the previous two cases do not apply).
• adding walks (tadd(h)). We split further into two cases, depending on the type (guest or
user) of the walk added.
– adding a guest walk (taddG(h)).
walksG(h′) = tlbG(h′)∪wG(h′) (definition)
= tlbG(h)∪{wg(h)} (construction)
⊆ walksG(h)∪{wg(h)} (definition)
⊆ c.tlb∪{wg(h)} (simA(h,c))
= c.tlb∪{w(s(h))} (lemma 45.1)
= c′.tlb (specification)
– adding a user walk (taddU (h)).
walksG(h′) = tlbG(h′)∪wG(h′) (definition)
= tlbG(h)∪wG(h) (construction)
= walksG(h) (definition)
⊆ c.tlb (simA(h,c))
⊆ c.tlb∪{w(s(h))}
= c′.tlb (specification)
5.4 Correctness Proof 117
• store a walk into the hardware TLB (/tadd(h)). We again split cases further, but this time
on whether a guest walk or a composition of walks is written.
– writing a guest walk (write-tlbG(h)).
walksG(h′) = tlbG(h′)∪wG(h′) (definition)
⊆ (tlbG(h)∪wG(h))∪wG(h) (construction)
= walksG(h) (definition)
⊆ c.tlb (simA(h,c))
= c′.tlb (c′ = c)
– writing a composition of walks (write-tlbU (h)).
walksG(h′) = tlbG(h′)∪wG(h′) (definition)
⊆ tlbG(h)∪wG(h) (construction)
= walksG(h) (definition)
⊆ c.tlb (simA(h,c))
= c′.tlb (c′ = c)
Note that in order to fit the walk which is written, the hardware evicts some valid entry
(stored walk) in case the TLB cache is full. Also, due to the invariant (about uniqueness of
walks with respect to translated addresses) maintained by our hardware construction (see
Sect. 5.2.3), a valid walk stored in hardware is overwritten each time one attempts to store
another walk for the same universal page address.
• writing the walk register with a walk from the hardware TLB (/tadd(h)). The latter con-
cerns only the guest walk register, which is written in state nested-call with a guest walk
from the TLB each time the requested translation was present in the cache (TLB) before
the simple translation started.
walksG(h′) = tlbG(h′)∪wG(h′) (definition)
= tlbG(h)∪{h.wout} (construction)
⊆ walksG(h) (definition)
⊆ c.tlb (simA(h,c))
= c′.tlb (c′ = c)
• busy waiting (/tadd(h)). Finally, we argue that in the hardware cycles in which the TLB
and the walk registers are not written (in the absence of the TLB steps (tadd(h)) and
outside of the control state write-tlbX (h)), the simulation of the guest walks is preserved.
walksG(h′) = tlbG(h′)∪wG(h′) (definition)
= tlbG(h)∪wG(h) (construction)
= walksG(h) (definition)
⊆ c.tlb (simA(h,c))
= c′.tlb (c′ = c)
Invariant
Next we show that invariant 4 is maintained over the translation queries. In the proof we im-
plicitly use a hardware property that values of function gpa are persistent for the user walks
from the TLB (tlbU ). Both if the hardware cache (h.tlb) is written or not, the values “associ-
ated” with the hardware walks do not change.
We split the proof of
inv◦(h′,c′)
into the same cases as the proof above: i) adding walks, ii)-iii) writing walks, and iv) busy
waiting.
118 5 Correctness of NAT Implementation
wU(h)◦wG(h)
tlbU(h)
wU(h′)◦wG(h′)
︷
︸︸
︷︷
︸︸
︷
walksU(h)
walksU(h′)
Fig. 26: Changes in the set of user walks (walksU ) on adding of walks
• adding walks. For convenience we introduce an abbreviation
{w′} = wU (h′)◦wG(h′)
to refer to the newly added walk. As depicted in Fig. 26, the set of user walks (walksU ) in
configuration h′ consists of i) the user walks in the TLB cache, which remain unchanged,
and ii) the newly added walk (w′):
walksU (h′) = tlbU (h′) ∪ {w′}.
First, we show that the coverage is maintained for the user walks from the TLB.
tlbU (h′) = tlbU (h) (construction)
⊆ walksU (h)∩walksU (h′) (definition)
= walksU (h)[c][c]∩walksU (h′) (inv◦(h,c))
⊆ walksU (h′)[c][c]
⊆ walksU (h′)[c′][c′] (specification)
From the definition of the guest page address for the newly added walk we have:
gpa(w′,h′) = (ε wG(h′)).pa.
Therefore, walk w′ is covered by walks wU (h′) and wG(h′), which are contained in the
specification TLB of configuration c′. We justify the latter below.
– adding a guest walk.
{w′} ⊆ walksU (h′)[wU (h′)][wG(h′)] (definition)
= walksU (h′)[wU (h)][wg(h)] (construction)
⊆ walksU (h′)[c][w(s(h))] (simW (h,c); lemma 45.1)
⊆ walksU (h′)[c′][c′] (specification)
– adding a user walk.
{w′} ⊆ walksU (h′)[wU (h′)][wG(h′)] (definition)
= walksU (h′)[wu(h)][wG(h)] (construction)
⊆ walksU (h′)[w(s(h))][c] (lemma 45.2; simW (h,c))
⊆ walksU (h′)[c′][c′] (specification)
• store a walk into the hardware TLB.
walksU (h′) = tlbU (h′)∪wU (h′)◦wG(h′) (definition)
⊆ (tlbU (h)∪wU (h)◦wG(h))∪wU (h)◦wG(h) (construction)
= walksU (h) (definition)
= walksU (h)[c][c] (inv◦(h,c))
= walksU (h)[c′][c′] (c′ = c)
Note, the set of user walks (walksU ) can only decrease after a write of the TLB.
5.4 Correctness Proof 119
• writing the guest walk register with a walk from the hardware TLB (/tadd(h)). Using
notation from the first case above, we first argue that coverage of the user walks from the
TLB is maintained; the proof stays literally the same. However we need to update the
argument for the newly obtained composition of walks.
{w′} ⊆ walksU (h′)[wU (h′)][wG(h′)] (definition)
= walksU (h′)[wU (h)][h.wout] (construction)
⊆ walksU (h′)[c][c] (simW (h,c))
= walksU (h′)[c′][c′] (c′ = c)
• busy waiting. Nothing changes in hardware, therefore nothing to show as before.
walksU (h′) = walksU (h) (construction)
= walksU (h)[c][c] (inv◦(h,c))
= walksU (h)[c′][c′] (c′ = c)
Note, in all cases the claim follows by lemma 39.
User Simulation
Finally, we show that simulation simtlb is preserved over the translation queries.
simtlb(h′,c′)
Previously we have shown
walksG(h′)⊆ c′.tlb
and
inv◦(h′,c′).
The latter by lemma 38 gives
walksU (h′)⊆ c′.tlb◦.
Therefore, we have simA(h′,c′) and by lemma 37 it remains to show
wU (h′)⊆ c′.tlb.
We consider only those cycles in which the walk registers are clocked (updated). In other
cycles their values do not change, and the proof is trivial. We split cases depending on the
register which is clocked. By lemma 32 this register is unique.
• updating the guest walk register (taddG(h)).
wU (h′) = wU (h) (construction)
⊆ c.tlb (lemma 37)
⊆ c.tlb∪{w(s(h))}
= c′.tlb (specification)
• updating the user walk register (taddU (h)).
wU (h′) = {wu(h)} (construction)
= {w(s(h))} (lemma 45.2)
⊆ c.tlb∪{w(s(h))}
= c′.tlb (specification)
120 5 Correctness of NAT Implementation
5.4.3 Invalidation Queries
Lastly, we consider the invalidation queries and, respectively, show correctness of steps inval-
idating the TLB content. In contrast to the translation queries, which in general take multiple
hardware cycles, our MMU implementation allows to invalidate all specified translations in a
single cycle. This means, we do not have to split cases depending on particular action per-
formed by the nested MMU while it serves an invalidation query. Instead, we split the proofs
on specifics of the query served. Our proof goals remain unchanged: we are to show that
simulation simtlb is preserved and invariant 4 is maintained.
Guest Simulation
As before, we start with the simulation of the guest walks. From lemma 44 we have
walksG(h′) = tlbG(h′)
therefore it suffices to show
tlbG(h′)⊆ c′.tlb.
We split cases as follows:
• dropping individual walks (invl pg(h)). As before, we split cases further depending on the
type (guest or user) of the walks dropped.
– dropping guest walks (inva(h) ∈ AG).
tlbG(h′) = tlbG(h)\ (invset(h)∪ incset(h)) (construction)
= tlbG(h)\ (tlbG(h)∩ (IV ∪IC)) (lemma 41)
= tlbG(h)\ (IV ∪IC)
⊆ c.tlb\ (IV ∪IC) (simA(h,c))
= c′.tlb (specification)
– dropping user walks (inva(h) ∈ AU ).
tlbG(h′) = tlbG(h)\ (invset(h)∪ incset(h)) (construction)
⊆ tlbG(h)\ tlbU (h)[IV ∪IC ][c] (lemma 42.2)
= tlbG(h)
= tlbG(h)\ (IV ∪IC)
⊆ c.tlb\ (IV ∪IC) (simA(h,c))
= c′.tlb (specification)
• dropping virtual machines (vm f lush(h)).
tlbG(h′) = tlbG(h)\ invset(h) (construction)
⊆ tlbG(h)\walksU (h)[ID][c] (lemma 43)
= tlbG(h)
= tlbG(h)\ID
⊆ c.tlb\ID (simA(h,c))
= c′.tlb (specification)
• dropping all walks ( f lush(h)). The simulation of the guest walks holds trivially.
tlbG(h′) = /0 (construction)
= c′.tlb (specification)
5.4 Correctness Proof 121
tlbU(h)[c][IV ∪IC ]
ragset(h)
︷︸︸︷
tlbU(h)[c][IV ∪IC ]
tlbU(h′)
︷︸︸︷
Fig. 27: Changes in the set of user walks (walksU ) on dropping of guest walks
Invariant
Next we show that invariant 4 is maintained over the invalidation queries. Again, since from
lemma 44 we have
walksU (h′) = tlbU (h′)
together with lemma 39 it suffices to show
tlbU (h′) = tlbU (h)[c′][c′].
We split the proof into the same cases as the proof above: i) dropping individual walks, ii)
dropping virtual machines, and iii) dropping all walks.
• dropping individual walks (invl pg(h)). Note, in case one invalidates guest walks (the first
case below), the hardware also drops the set of ragged walks according to the hardware
specification from Sect. 4.2.1.
– dropping guest walks (inva(h) ∈ AG). For more intuition we refer to Fig. 27.
tlbU (h′) = tlbU (h)\ ragset(h) (construction)
⊆ tlbU (h)[c][c]\ tlbU (h)[c][IV ∪IC ] (lemma 40; lemma 42.1)
= (tlbU (h)[c][IV ∪IC ]∪ tlbU (h)[c][IV ∪IC ])\ tlbU (h)[c][IV ∪IC ]
⊆ tlbU (h)[c][IV ∪IC ]
= tlbU (h)[c′][c′] (specification)
– dropping user walks (inva(h) ∈ AU ).
tlbU (h′) = tlbU (h)\ (invset(h)∪ incset(h)) (construction)
⊆ tlbU (h)[c][c]\ tlbU (h)[IV ∪IC ][c] (lemma 40; lemma 42.2)
= (tlbU (h)[IV ∪IC ][c]∪ tlbU (h)[IV ∪IC ][c])\ tlbU (h)[IV ∪IC ][c]
⊆ tlbU (h)[IV ∪IC ][c]
= tlbU (h)[c′][c′] (specification)
• dropping virtual machines (vm f lush(h)).
tlbU (h′) = tlbU (h)\ invset(h) (construction)
⊆ tlbU (h)[c][c]\ tlbU (h)[ID][c] (lemma 40; lemma 43)
= (tlbU (h)[ID][c]∪ tlbU (h)[ID][c])\ tlbU (h)[ID][c]
⊆ tlbU (h)[ID][c]
= tlbU (h)[c′][c′] (specification)
• dropping all walks ( f lush(h)). The invariant for the user walks holds trivially.
tlbU (h′) = /0 (construction)
= tlbU (h)[c′][c′] (specification)
122 5 Correctness of NAT Implementation
User Simulation
We finish the section showing that simulation simtlb is preserved over the invalidation queries.
simtlb(h′,c′)
This is a counterpart of the corresponding result about the translation queries. Hence, the proof
below follows the same pattern as the respective proof above. Previously we have shown
walksG(h′)⊆ c′.tlb
and
inv◦(h′,c′).
The latter by lemma 38 gives
walksU (h′)⊆ c′.tlb◦.
Therefore, we have simA(h′,c′) and by lemma 37 it remains to show
wU (h′)⊆ c′.tlb.
which we obtain trivially by lemma 44.
This finishes the proof of lemma 36 and so the chapter on correctness of the nested MMU
implementation as well. In the forthcoming chapters we demonstrate how to prove correctness
of a simple sequential — yet not completely trivial — machine capable to perform the nested
address translation.
Part III
Single-Core MIPS with NAT

6Sequential Processor with Nested MMUs
The content of this chapter is split into sections as follows. In Sect. 6.1 we specify which com-
ponents are involved and how these component are interconnected. In order to interconnect the
components we introduce several implementation registers together with a small state machine
to generate the control signals. Then, in Sect. 6.2 we interconnect the cache memory system
and identify the accesses performed by the hardware computation. In Sect. 6.3 we finish the
chapter with showing that the presented sequential implementation is live.
6.1 Sequential Processor
To compose the machine we are interested in, obviously, one requires: i) a sequential pro-
cessor core, ii) two hardware units performing the nested translation for both, the instruction
and effective address, and iii) a sequentially consistent shared memory. We resp. take: i) the
sequential processor core from [LOP] (with two delayed program counters1), ii) two nested
MMUs, formally specified and constructed in Chap. 4, and iii) the cache memory system
from [KMP14]. The latter memory system we connect to the processor core and the MMUs at
the four cache interfaces, as depicted in Fig. 28.
Note, we did not formulate any conditions for the translation inputs of the MMUs (dotted paths
in Fig. 28). The reason for that is as follows. If we consider mmuI , the data for translation are
of course coming from the PCs (program counters). For instance, in the pipelined machines
— depending on the fullness of the pipe — mmuI would get the data from one of the three
PC registers. PCs of the non-sequential machines are computed speculatively, which means
they take wrong values in case of mis-speculation (e.g., on eret). Often these (mis-speculated)
values do not even belong to the address range of program that is executed. Which in turn
means that translation steps can be performed for addresses that are never used by a source
program. Getting back to our non-deterministic semantics from Chap. 3, one notices that the
latter behavior is completely legitimate.
On the other hand, the dotted connections from the processor core to the MMUs are necessary
to satisfy a crucial (and very reasonable) guard condition in the specification: the walk passed
to the core has to match the universal address that the core is trying to access (see Sect. 3.3.4).
This condition expresses the essence of the address translation.
After all, the fact that we use these particular data (from the PCs) to start the process of address
translation — speculative in certain machines — is nothing more than one out of many possi-
bilities permitted in the specification (even though it might be an intuitive and elegant design
1 Since in Chap. 8 we consider a seven stage pipelined machine (with two delay slots), here we imme-
diately introduce the corresponding components (delayed PCs) in order to save time later. Thus, for
the simpler sequential machine we can introduce, e.g., special purpose register file with appropriate
number of exception PCs and use this hardware in the pipelined machine without changes. Moreover,
this allows us to reuse certain arguments in the later chapters and focus on more relevant things.
126 6 Sequential Processor with Nested MMUs
mmuE
mmuI
wI
I
wE
core
inva
maET
woutE
moutET
inva
moutI
moutE
pmaI
pmaE
memory interface
woutI
moutIT
maIT
memory interface
instruction fetch
memory operation
core interface
core interface
cms
caE
caET
caI
caIT
pto,npto,asid, ia.pa
pto,npto,asid,ea.pa
Fig. 28: Data paths connecting the MMUs to the processor core and the memory system
solution). Though the machine considered in this chapter is sequential, it uses the shared mem-
ory. This turns to be sufficient for hardware to perform some speculative in ISA translations
(think, e.g., about the reset).
6.1.1 Control Logic
Execution of instructions in the sequential machine is subdivided into several phases. Within
each phase, that can last several cycles, the hardware performs various actions in terms of
execution of the next — with respect to the program order — instruction, which we refer to as
the current instruction in what follows. Below we describe every phase by means of actions
performed in the hardware.
i) instruction address translation. In case the machine runs in translated mode, instruction
MMU (mmuI) provides a translation for the current instruction.
ii) instruction fetch. Instruction cache (caI) provides the current instruction.
iii) effective address translation. In case the machine runs in translated mode and the cur-
rent instruction is a memory operation, data MMU (mmuE ) provides a translation for the
effective address.
iv) instruction execution. In case the current instruction is a memory operation, data cache
(caE ) provides the data required to perform this operation. Data structures of the sequen-
tial machine are updated by execution of the current instruction.
On interrupts, the hardware either jumps to the interrupt service routine or — in case the
resume type is continue — continues its current operation. Control logic that implements this
is not particularly difficult. The simple automaton in Fig. 29 is designed to keep track of
the current execution phase (in one of the control states) and change the machine’s control
signals according to the execution phase, availability of the requested data, and the presence of
unmasked interrupts. For convenience we introduce abbreviations for the following hardware
signals.
exec(h) ≡ jisr(h)→ cont(h)
mexec(h) ≡ exec(h)∧mop(h)
The first signal is used to specify the transitions of in the control automaton. Intuitively, execu-
tion of the current instruction continues while the latter signals is active. On the non-continue
6.1 Sequential Processor 127
idle start
IT
IF
ET
EX
exec
exec∧/mmuI .busy
exec∧/caI .mbusy
exec∧/mmuE .busy
Fig. 29: Machine’s control automaton
type interrupts, execution of the current instruction is interrupted and a jump to the interrupt
service routine is performed. For that purpose we added several transitions returning control
to the initial state (see Fig. 29). States
S = {idle, IT, IF,ET,EX}
of the control automaton we naturally order as follows:
idle < IT < IF < ET < EX.
To keep track of the current control state we introduce a function
cs : KHW →S
which maps hardware configurations h ∈ KHW to control states σ ∈ S such that
σ(h) → cs(h) = σ .
While a cache or an MMU stays busy, we say that the corresponding control state is busy, and
abbreviate
busy(σ ,h) ≡

h.mmuY .busy σ = YT
h.caI .busy σ = IF
h.caE .busy σ = EX
0 σ = idle.
From the construction of the control automaton we clearly have the following.
Lemma 46. Assume cs(h) 6= idle.
busy(cs(h),h) ↔ cs(h′) = cs(h)
Control state σ ∈ S finishes if the control leaves state σ :
fin(σ ,h) ≡ (cs(h) = σ)∧/busy(σ ,h).
Execution of the current instruction ends either on an interrupt, or if state EX finishes:
endex(h) ≡ exec(h)→ fin(EX,h).
A trivial lemma follows directly from the construction of the control automaton.
128 6 Sequential Processor with Nested MMUs
Lemma 47.
endex(h)→ idle(h′) (1)
/endex(h)→ cs(h)≤ cs(h′) (2)
For convenience, we also abbreviate
finex(σ ,h) = fin(σ ,h)∧ exec(h),
and another trivial lemma follows.
Lemma 48. Assume cs(h) 6= EX.
finex(cs(h),h) → cs(h)< cs(h′)
6.1.2 Collecting Interrupts
The event signals are collected in every control state visited throughout execution of the current
instruction. For convenience we introduce the following shorthands to group the event signals
collected in the various states.
ev(idle) = 09 ◦ e◦ reset
ev(IT) = 08 ◦mal f ◦02
ev(IF) = 06 ◦g f f ◦ p f f ◦03
ev(ET) = 02 ◦malm◦ovf ◦ sysc◦ ill ◦05
ev(EX) = gfm◦pfm◦09
Note that we mask the external interrupt signal outside the idle state.
e(h) = idle(h)∧ eev(h)[1]
In case an external interrupt occurs in a cycle when the machine does not reside in the idle
state, the latter interrupt is masked until the hardware returns into the idle state. The interrupt
convention [MP00] makes sure that the external interrupt stays on, and eventually will be re-
ceived by the hardware. Computation of most internal interrupt event signals precisely follows
the specifications from Sect. 2.3, and therefore is omitted. For the page faults on fetch and
memory operation resp. we have the following.
p f f (h) = /host(h)∧ f (h.wI)
pfm(h) = /host(h)∧ f (h.wE)∧mop(h)
Analogously, the general-protection faults on fetch and memory operation resp. are computed
as follows:
g f f (h) = /host(h)∧/ f (h.wI)∧ (11 6≤ h.wI .r[xu])
gfm(h) = /host(h)∧/ f (h.wE)∧ (1s 6≤ h.wE .r[uw])∧mop(h)
where write-bit s is computed according to the specification (see Sect. 3.3.5).
s ≡ s(h)∨ cas(h)
For the masked cause signal we get the following:
mca(h) = imask(h)∧∨σ≤cs(h) ev(σ ,h)
where the interrupt mask above is obtained from the status register.
imask(h) = 19 ◦h.sr[1]◦1.
6.1 Sequential Processor 129
Computation of processor control signals jisr and cont remains straightforward.
jisr(h) ≡ mca(h) 6= 011
cont(h) ≡ f 1(mca(h))[7 : 6] 6= 02
In general, on JISR we have to argue only about those interrupt signals which are used in a
particular control state of the machine. In order to keep track of the used interrupt signals we
throw in another technical definition. When given a control state, auxiliary function
J : S → N∗
returns the set of interrupts (indices) processed in that state. To define J formally we simply
specify all the values.
σ idle IT IF IT EX
J (σ) {1} {2} [3 : 4] [5 : 8] [9 : 10]
A trivial property follows directly from the definition.
∀σ1,σ2 ∈ S : σ1 ≤ σ2↔maxJ (σ1)≤maxJ (σ2)
Using the latter function we give an alternative definition for the jisr signal:
jisr(h) =
∨
k≤maxJ (cs(h))mca(h)[k].
Below we introduce an invariant restricting the bits set in the masked cause signal based on the
current control state.
Invariant 5.
IF ≤ cs(h) → mca(h)[2 : 0] = 03 (1)
mop(h)∧EX ≤ cs(h) → mca(h)[8 : 0] = 09 (2)
Proof of invariant 5.1. By induction on the number of hardware cycles t. For the base case
(t = 0) there is nothing to show.
IF 6≤ cs(0) = idle
For the induction step from t to t + 1 we split cases on the current control state (cs(t) ∈ S).
Moreover, we assume
exec(t)∧/fin(EX, t)
since otherwise we have endex(t) and by lemma 47 there is nothing to show.
IF 6≤ cs(t+1) = idle
• IF ≤ cs(t). From the construction we have
mca(t+1)[1 : 0] = 02.
For the misalignment on fetch interrupts we derive
mca(t+1)[2] = mal f (t+1) (definition)
= mal f (t) (construction)
= mca(t)[2] (definition)
and the claim follows from the induction hypothesis.
130 6 Sequential Processor with Nested MMUs
• IF 6≤ cs(t). In this case we consider only
IT(t)∧ IF(t+1)
since otherwise there is nothing to show. From the assumptions we derive
exec(t)↔ / jist(t)∨ cont(t) (definition)
↔ ∀k ≤maxJ (IT) : /mca(t)[k] (definition)
↔ mca(t)[2 : 0] = 03 (definition)
and the claim follows similarly as in the case above. uunionsq
Proof of the second part of invariant 5 is analogous to the proof of the first part, and therefore
is omitted.
6.1.3 Implementation Registers
According to Fig. 28, implementation of the processor core of the sequential machine utilizes
the following implementation (invisible) registers:
• h.core.wI ,h.core.wE ∈ B60 — two walk registers resp. for fetching instructions and ac-
cessing the memory, and
• h.core.I ∈ B32 — an ordinary instruction register.
Obviously, the implementation registers are connected to the walk outputs of the corresponding
MMUs and updated whenever the corresponding translation phase finishes.
h.wY .in = h.mmuY .wout
h.wY .ce = YT(h)∧/h.mmuY .busy
The instruction register, as usual, is connected to the data output of the instruction cache and
updated whenever the instruction fetch phase finishes. As simple as that.
h.I.in =
{
h.caI .pdoutH h.caI .pa[2]
h.caI .pdoutL otherwise
h.I.ce = IF(h)∧/h.caI .mbusy
Basically, the walk registers are introduced to have realistic cycle times in hardware, whereas
the instruction registers is necessary to support the self-modifying code. Without the instruc-
tion register we would have a problem with instructions writing at its own physical memory
addresses.2 Another trivial lemma follows directly from the construction of the control au-
tomaton and definitions of the update enable signals.
Lemma 49. For states σ 6= idle and registers Z we have:
σ(h)∧σ(h′) → /h.core.Z.ce
2 Without the instruction register, the instruction data would be coming straight from the instruction
cache. Moreover, the instruction data — cache line containing the instruction — has to be present
in the data cache in order to be modified. Using terminology from [KMP14], modification of the
instruction data in the data cache would be a global write access, whereas fetching the instruction
word from the data cache — a local read access. According to [KMP14], global accesses do not
overlap with local reads except for the starting cycles.
In a design without the instruction register this would lead to a busy instruction cache in all cycles
of the global access except for the starting one. The latter includes the cycle in which the data cache
lowers the busy signal — in which the processor core is updated and the machine switches to the idle
state. Thus, the processor request to a busy instruction cache is lowered, which violates the operating
conditions of the cache memory system designed in [KMP14] (see Sect. 6.2.2).
6.1 Sequential Processor 131
6.1.4 Connecting Components
Here we collect in a bookkeeping manner the definitions which connect all components of the
machine together. Some of these definitions, such as the ones that provide the MMUs with
information about location of the page tables (pto, npto), are general in a sense that they apply
to other machine types without changes.
Translation Queries
We raise translation request treqI (for the instruction address) if the machine runs in translated
mode. We raise translation request treqE (for the effective address) in the translation mode as
well but only for the memory operations.
treqI(h) ≡ /host(h)∧ IT(h)∧ exec(h)
treqE(h) ≡ /host(h)∧ET(h)∧mexec(h)
Having that we formally define the translation request input of mmuY as follows.
h.mmuY .treq = treqY (h)
The dedicated special purpose registers provide to mmuY the memory locations of the guest
and user page tables resp. as given below. For convenience we abbreviate
pto(h) = h.pto.pa
npto(h) = h.npto.pa
and define:
h.mmuY .ptoG = pto(h)
h.mmuY .ptoU = npto(h).
These inputs as well as the ASID portions of the translation address inputs (below) are invariant
w.r.t. the MMU implementations. The reason being that any other definition would contradict
the specification in part of the TLB steps (see Sect. 3.3.1).
The remaining part of the translation queries — translation address inputs — are defined for
the two MMUs individually. Into mmuI we plug the page address of the actual instruction
address (ia(h)), obviously, prepended with the current ASID. Respectively, into mmuE — the
page address of the actual effective address (ea(h)), prepended in the same way.
h.mmuI .upa = asid(h)◦ ia(h).pa
h.mmuE .upa = asid(h)◦ ea(h).pa
Invalidation Queries
We formalize the invalidation queries in the same way as the translation queries above: it is the
data coming to an MMU through the invalidation inputs if (!) one of the invalidation request
signals is high. We define the invalidation request inputs of mmuY using the machine control
signals introduced above as follows.
h.mmuY .invl pg = finex(EX,h)∧ invl pg(h)∧user(h)
h.mmuY .vm f lush = finex(EX,h)∧ f lusht(h)∧guest(h)
h.mmuY . f lush = finex(EX,h)∧ f lusht(h)∧host(h)
As one can see, the same invalidation requests are raised for both MMUs. Universal page
addresses which are the subject of these requests are formally defined below.
h.mmuY .inva.as =
{
vmid(h)◦A(h)[27 : 20] guest(h)
A(h)[31 : 20] otherwise
h.mmuY .inva.pa = B(h)[31 : 12]
132 6 Sequential Processor with Nested MMUs
323232 32
32 3232
sr esr eca epc
f 1(mca)
[0] [1] [2] [3]
32 32 32
edata pto
3232
mode emode
3232
ed pc
[4]
32
edd pc
[5] [6] [9][7] [8] [10]
3232
32 32
[13]
32
[12]
cdata nmodenpto
[11]
spr
32
ea∧ (ET ∨EX)sr
32
32
032 esr
1 0
32 3232
nmode∧1310mode∧1310
011 0 jisrjisr jisr
3232 32 32
nmode
32
mode
96 96
1 0 cont
pc,d pc,dd pc
enmode
emode
next pc, pc,d pc
enmode
Fig. 30: Wiring of the special inputs of the SPR
SPR Environment
Wiring of the special purpose registers does not change much. We only need to
i) modify the input signals for registers h.eca and h.edata,
ii) modify the special input and clock enable signal for register h.mode, and
iii) provide the special input and clock enable signal for register h.nmode.
For registers h.pto and h.npto we do not change anything: still write them both manually, still
via the move instructions. Register h.enmode is wired as an ordinary exception register, i.e., to
backup the value of h.nmode on jisr.
Connection of the special input signals is depicted in Fig. 30. For the special purpose registers
we follow the specifications from Sect. 3.3.7, which are pretty straightforward. Thus, the
exception cause and the exception data registers are connected resp. to the (hardware) interrupt
level and the effective address masked outside control states ET and EX.
h.eca.in = 021 ◦ f 1(mca(h))
h.edata.in = ea(h)∧ (ET ∨EX)(h)
For both mode registers
xmode ∈ {mode,nmode}
we connect the inputs as follows.
h.xmode.in =
{
h.xmode∧1310 jisr(h)
h.exmode otherwise
In this way the least significant bit of the mode register is cleared on jisr, whereas all bits are
restored from the corresponding exception register on eret. The special inputs are written into
the mode registers under control of the special clock enable signals raised in accordance with
the specification.
h.mode.ce = jisr(h)∧ (user(h)∨ icpt(h)) ∨ eret(h)∧host(h)
h.nmode.ce = jisr(h)∧ (user(h)∧ icpt(h)) ∨ eret(h)∧host(h)
6.1.5 Execution Levels and Intercepts
Before we proceed, for every level of privilege we first argue about what instructions are pos-
sible/allowed (do not cause the illegal interrupt), and clearly we focus on execution of the
translation and invalidation queries.
6.1 Sequential Processor 133
Level of Host
At the level of host the address translation is completely disabled. There are no translation
queries to the MMUs since no translation requests are raised at this level (see Sect. 6.1.4).
host(h) → /treqI(h)∧/treqE(h) (7)
Thus, regular instructions (not invalidating the TLB cache) are executed in exactly the same
way as in the sequential machine without address translation [PBLS16].
At the same time, instructions which invalidate translations act only on the TLB component
and essentially are noops for the processor core. Both invalidating instructions are allowed at
the level of host.
host(h)∧ (invl pg(h)∨ f lusht(h)) → /ill(h) (8)
Next we consider executions at the levels of guest and user in a similar way.
Level of Guest
At this (intermediate) level the address translation is enabled partially: only queries which
request translations for addresses of the running guest are allowed.
guest(h)∧ treqY (h) → upaY (h) ∈ AG(vmid(h)) (9)
Analyzing the control logic of the nested MMU one easily derives from the latter restriction
that the address translation at the level of guest is performed exclusively by the simple MMU.
Portions of this component which implement the translation scheme — the control automa-
ton and the walking unit — were incorporated from the machine with the ordinary address
translation [LOP].
Both invalidating instructions are allowed at this level, but — in contrast to the level of host —
the invl pg instruction is restricted to invalidate only translations that were accumulated by the
running guest.
guest(h)∧ (invl pg(h)∨ f lusht(h)) → /ill(h)∧ invaY (h) ∈ AG(vmid(h)) (10)
Note, at the level of guest the f lusht instruction is interpreted as the vm f lush (see Sect. 3.3.6).
Level of User
At the level of user the address translation is enabled for both the user and the guest addresses.
At this level the translation queries are restricted — by the specification — to addresses of the
running user and addresses of the guest under which the user is running.
user(h)∧ treqY (h) → upaY (h) ∈ AU (vmid(h), prid(h)) (11)
For the guest addresses everything is simple: translation is performed exactly as in the case
above (for the level of guest), utilizing exactly the same hardware components. The translation
queries for the guest addresses arise in the process of the nested translation — translation of
the user addresses. This process was semi-formally described in Sect. 3.2.1, formally specified
— as an iterative process with restricted steps — in Sect. 3.3.1, and implemented in hardware
in Sect. 4.2.2. The fact that our hardware implementation does not violate the restrictions
imposed by the specification we formally proved in Sect. 5.2.
At the level of user invalidating instructions are forbidden in any form.
user(h)∧ (invl pg(h)∨ f lusht(h)) → ill(h) (12)
134 6 Sequential Processor with Nested MMUs
Triggering Intercepts
Finally, we proceed to specify computation of the intercept signal. According to the specifica-
tion in Sect. 3.3.5, the intercepts are triggered only at the level of user. Recall, these intercepts
were introduced in order to make the page faults that occur in the process of nested translation
(due to translations of intermediate guest addresses) completely invisible at the level of guest.
Thus, given that the machine resides in control stage
σ ∈ {IF,ET,EX}
we activate the intercept signal in case translation h.wI for the instruction address indicates a
guest level fault. Analogously, given that the machine resides in control state
σ ∈ {EX}
and a memory operation is performed, we activate the intercept signal if translation h.wE for
the effective address indicates a guest level fault. Combining the latter two causes, we clearly
obtain the following.
icpt(h) = user(h) ∧ (IF ≤ cs(h)∧h.wI . fg ∨
EX ≤ cs(h)∧h.wE . fg∧mop(h))
This finishes construction of our sequential processor. In the next section we interconnect it
with a cache memory system, which will complete the sequential implementation.
6.2 Cache Memory System in Sequential Processor
In this chapter we consider a single-core sequential processor (as constructed above) connected
to a sequentially consistent cache memory system (as constructed in [KMP14]) with four
caches: two caches for translation of the instruction and effective addresses and two resp. for
the instruction fetch and the data access.
6.2.1 Connections to Caches
There are four caches available to the processor core and two MMUs for accesses.3
h.caY = h.ca(21{Y=E}+1)
h.caYT = h.ca(21{Y=E})
We raise the processor requests to these caches as in [LOP]: the processor request to cache caY
is, obviously, raised by the processor core, while the request to cache caY T is raised by mmuY .
The requests to caches caI and caE are defined in the obvious way.
h.caI .preq = IF(h)∧ exec(h)
h.caE .preq = EX(h)∧mexec(h)
h.caY T .preq = h.mmuY .mreq
Since in the scope of this chapter the address translation remains our main interest, we proceed
to specify the processor addresses for the memory accesses. To define the processor address at
port Y , we introduce the physical memory addresses (pmaY ):
3 Note, the indicator function used above is defined for a ∈ B simply as follows:
1{a} =
{
1 a
0 a.
6.2 Cache Memory System in Sequential Processor 135
pmaI(h) = pmaI(h).pa◦ ia(h).po
pmaE(h) = pmaE(h).pa◦ ea(h).po
where the page address of the physical memory address at port Y is
pmaY (h).pa =
{
h.mmuY .pa host(h)
h.mmuY .wout.ba otherwise.
Thus, the physical memory address for port I is either the output of mmuI followed by the page
offset of ia(h) or simply ia(h), depending on whether the machine runs in the translated mode
or not resp. Transferring this onto port E: pmaE is either the output of mmuE followed by the
page offset of ea(h) or simply ea(h). The processor address provided to cache caY T is coming
from mmuY (see Fig. 28).
h.caY .pa = pmaY (h).l
h.caY T .pa = h.mmuY .ma.l
The remaining parts of the memory accesses are obvious and, moreover, the same as defined
in [LOP]. Thus, for the access types we simply have the following.
h.caI .(pr, pw, pcas) = 100
h.caE .(pr, pw, pcas) = l(h)◦ s(h)◦ cas(h)
h.caYT .(pr, pw, pcas) = 100
Data for the memory accesses at port E are provided in accordance with the ISA.
h.caE .pbw = bw(h)
h.caE .pdin = dmin(h)
h.caE .pcdin = cdata(h)
6.2.2 Stability of Inputs to Caches
In order to satisfy the operating conditions of the cache memory system, we have for all ports
to guarantee that until the processing of an ongoing memory access finishes all inputs of that
access stay constant in the digital sense [LOP]. For the machine from [KMP14] this property
was reflected in lemma 9.11 for the instruction and the data cache. Below we establish a
counterpart of the latter lemma for our design with four caches.
Lemma 50.
• For any cache:
ht .ca(i).mbusy → ht .ca(i).preq (13)
• For the translation caches:
ht .caYT .mbusy → ht+1.mmuY .(ma,mreq) = ht .mmuY .(ma,mreq) (14)
• For the instruction and data caches:
ht .caY .mbusy → ∀Z : /ht .core.Z.ce (15)
Proof of lemma 50. For cache i we argue by induction on the number of hardware cycles t. For
the base case (t = 0) there is nothing to show.
h0.ca(i).mbusy = 0
For the induction step from t to t + 1 we first show equations 14 and 15 using the induction
hypothesis (equation 13).
136 6 Sequential Processor with Nested MMUs
From the construction of the nested MMU (see Sect. 4.2.2), for the instruction translation and
the data translation cache we know that the processor address is taken directly from the internal
walk registers, which are not updated in cycles when the memory is busy. The processor
request signal from MMU is kept stable by the nested control automaton (see Sect. 4.2.2). For
the instruction cache we argue as follows.
ht .caI .mbusy→ ht .caI .preq (equation 13)
→ IF(t)∧ exec(t) (definition) (16)
→ IF(t)∧ IF(t+1) (interconnect) (17)
→ ∀Z : /ht .core.Z.ce (lemma 49)
The corresponding arguments for the data cache are analogous, and therefore are omitted. To
complete the induction step we proceed as follows. We assume
ht+1.ca(i).mbusy = 1,
since otherwise there is nothing to show. Next we split cases of whether cache i is busy in cycle
t.
• In case cache i is busy
ht .ca(i).mbusy = 1,
the claim follows directly from the lines above. For the instruction translation cache we
derive:
ht .caIT .mbusy → ht .caIT .preq (equation 13)
→ ht .mmuI .mreq (interconnect)
→ ht+1.mmuI .mreq (equation 14)
→ ht+1.caIT .preq (interconnect).
In turn, for the instruction cache we simply have:
ht .caI .mbusy → IF(t+1)∧ exec(t) (equations 17, 16)
→ IF(t+1)∧ exec(t+1) (equation 15)
→ ht+1.caI .preq (definition).
The corresponding arguments for the data translation cache and the data cache are analo-
gous, and therefore are omitted.
• Otherwise, in case cache i is not busy
ht .ca(i).mbusy = 0,
from the construction of caches [KMP14] we know that the master automaton of cache i
resides and remains in its idle state.
idle(i)t ∧ idle(i)t+1
Thus, the only way to become busy in cycle t+1 is to receive a processor request
ht+1.ca(i).preq = 1
which gives the claim. uunionsq
6.2.3 Accesses of Hardware Computation
Recall from [KMP14], the cache memory system used is accessed by accesses from the so-
called multi-port access sequence
6.2 Cache Memory System in Sequential Processor 137
acc : [0 : 4p−1]×N→ Kacc
where p denotes the number of processors (in the scope of this chapter p = 1). Also recall that
the latter sequence is defined with the help of auxiliary function e(i,k), which gives the ending
cycle of access (i,k). Thus, for non-flushing access k to the instruction cache we specify:
acc(1,k).a = pmaI(e(1,k)).l
acc(1,k).type = 1000.
For non-flushing access k to the data cache ending in cycle t = e(3,k) we have:
acc(3,k).a = pmaE(t).l
acc(3,k).data = dmin(t)
acc(3,k).cdata = cdata(t)
acc(3,k).bw = bw(t)
acc(3,k).type = l(t)◦ s(t)◦ cas(t)◦0.
Finally, for non-flushing access k to translation cache caYT ending in cycle
t = e(21{Y=E},k)
we specify:
acc(21{Y=E},k).a = mmutY .ma.l
acc(21{Y=E},k).type = 1000.
All accesses ending in cycle t are collected into the set
E(t) = {(i,k) | e(i,k) = t}.
Note, the set above also includes accesses generated by the cache memory system internally
(flushes). For convenience we abbreviate the number of ending accesses by
ne(t) = #E(t).
As before, the total number of accesses ending before cycle t is given by function NE.
NE(0) = 0
NE(t+1) = NE(t)+ne(t)
Since we use a sequentially consistent memory system (from [KMP14]), we know there is a
sequential order
seq : [0 : 4p−1]×N→ N
in which the accesses are performed. Therefore, every access performed in cycle t gets some
sequential number
∀a ∈ E(t) ∃n ∈ [0 : ne(t)−1] : seq(a) = NE(t)+n
in the singe-port access sequence acc′. Recall, by definition of sequence acc′ we have
acc′[seq(a)] = acc(a).
Using sequence acc′ we proceed to formulate an important property of the memory abstraction.
Namely, part 3 of lemma 8.65 in [KMP14], which we require later in the correctness proof.
Lemma 51 (relating hardware with atomic protocol).
m(ht) = ∆NE(t)M (m(h
0),acc′)
In order to argue about the answers to read and CAS accesses, we would also require
lemma 8.67 of [KMP14]. Again for convenience we repeat the statement.
Lemma 52 (sequential consistency). Let e(i,k) = t and acc(i,k).r or acc(i,k).cas. Then
pdout(i)t = ∆ seq(i,k)M (m(h
0),acc′)(acc(i,k).a).
138 6 Sequential Processor with Nested MMUs
6.2.4 Relating Endings of Accesses with Hardware Control Signals
In this section we repeat the arguments made in Sect. [9.3.5] of [KMP14] about accesses to
the cache memory system performed by the processor. As before, now the accesses are also
performed by the MMUs. In the first lemma below we reflect that every walk extension per-
formed by the MMU is always accompanied by a read access ending in the cache utilized by
that MMU.
Lemma 53.
wext(mmutI) → ∃k : e(0,k) = t ∧acc(0,k).r (1)
wext(mmutE) → ∃k : e(2,k) = t ∧acc(2,k).r (2)
In the second lemma we show the opposite: every read access ending in the cache utilized by
the MMU triggers a walk extension step on that MMU.
Lemma 54.
e(0,k) = t ∧acc(0,k).r → ( mmutI .treq→ wext(mmutI) ) (1)
e(2,k) = t ∧acc(2,k).r → ( mmutE .treq→ wext(mmutE) ) (2)
Since the presented arguments remain trivial, we omit the proofs. Lemma 55 below is the
counterpart of lemmas [9.12] and [9.13] resp. from [KMP14] about accesses performed by the
processor core. The latter lemma is formulated to fit the sequential processor utilizing four
caches. Thus, every time state IF finishes, a read access in the instruction cache ends and vice
versa. The same holds for state EX and the data cache on execution of memory operations.
Lemma 55.
finex(IF, t) ↔ ∃k : e(1,k) = t ∧acc(1,k).r (1)
mop(t)∧finex(EX, t) ↔ ∃k : e(3,k) = t ∧acc(3,k). f (2)
Proofs of the last three lemmas are straightforward, and therefore are omitted.
6.3 Liveness
In this section we show that the sequential machine constructed in Sects. 6.1 and 6.2 is live.
In the end of Sect. 6.3.1 we show that every ISA instruction is eventually executed by the
sequential processor (lemma 60). Then, in Sect. 6.3.2, we show the following: if a given control
state is visited on execution of a given instruction, it is visited exactly once (equation 19).
6.3.1 Liveness of Control States
In the first lemma proven in this section we show that the control always eventually leaves the
current control state.
Lemma 56.
busy(cs(t), t) → ∃t ′ > t : /busy(cs(t), t ′)
Proof of lemma 56. By case split on the current control state (cs(t) ∈ S):
• cs(t) = idle. For state idle there is nothing to show, since by definition we have
busy(idle, t) = 0.
6.3 Liveness 139
• cs(t) = IT . In this case we argue using liveness of the nested MMU as follows:
busy(IT, t)↔ mmutI .busy (definition)
→ ∃t ′ > t : /mmut ′I .busy (lemmas 29–31)
↔ /busy(IF, t ′) (definition).
• cs(t) = IF. In state IF we rely on liveness of the instruction cache:
busy(IF, t)↔ ht .caI .mbusy (definition)
→ ht .caI .preq (lemma 50)
→ ∃t ′ > t : /ht .caI .mbusy (lemma 8.68 of [KMP14])
↔ /busy(IF, t ′) (definition).
• cs(t) ∈ {ET,EX}. The arguments for states ET and EX are analogous to those presented
above for states IT and IF resp., and therefore are omitted. uunionsq
For convenience, we rewrite the latter result as follows.
Lemma 57.
/fin(cs(t), t) → ∃t ′ > t : fin(cs(t), t ′)
Proof of lemma 57. Unfolding the definition we obtain:
/fin(cs(t), t)↔ (cs(t) 6= cs(t))∨busy(cs(t), t)
↔ busy(cs(t), t).
In case idle is the current control state, there is nothing to show.
busy(idle, t) = 0
Otherwise, we argue as follows. From lemma 56, for some cycle t ′ > t we have
busy(cs(t), t ′) = 0.
For minimum such cycle t ′ we clearly have
∀t˜ ∈ [t : t ′−1] : busy(cs(t), t˜),
and therefore
∀t˜ ∈ [t : t ′] : cs(˜t) = cs(t). uunionsq
Using the lemma above we proceed to show the following: the control always eventually leaves
every control state except EX, either to state idle (on core steps) or to the next control state.
Note, the control leaves state EX only to state idle (see Fig. 29).
Lemma 58.
cs(t) 6= EX → ∃t ′ ≥ t : cstep(t ′)∨ cs(t ′)> cs(t)
Proof of lemma 58. From lemma 57, for some cycle t ′ ≥ t we have
fin(cs(t), t ′).
We assume exec(t ′), since otherwise the claim immediately follows for cycle t ′.
/exec(t ′) → cstep(t ′)
Therefore, we have
finex(cs(t), t ′)
which by lemma 48 implies
cs(t ′+1)> cs(t ′) = cs(t)
and the claim follows for cycle t ′′ = t ′+1. uunionsq
140 6 Sequential Processor with Nested MMUs
From the latter lemma one easily concludes the following result.
cs(t) 6= EX → ∃t ′ ≥ t : cstep(t ′)∨ (cs(t ′) = EX) (18)
Next we argue that the processor core is stepped infinitely often.
Lemma 59.
/cstep(t) → ∃t ′ > t : cstep(t ′)
Proof of lemma 59. By case split on the current control state (cs(t) ∈ S):
• cs(t) = EX. State EX is busy in cycle t, since otherwise we obtain a contradiction.
fin(EX, t) → cstep(t)
From lemma 57, for some cycle t ′ > t we have
fin(EX, t ′) = 0
and the claim follows.
• cs(t) 6= EX. Using equation 18, for some cycle t ′ > t we conclude
cstep(t ′)∨ (cs(t ′) = EX).
For cstep(t ′) there is nothing to show. Otherwise, for cycle t ′ we conclude
/cstep(t ′)∧ (cs(t ′) = EX)
and the claim follows exactly as in the case above, for some cycle t ′′ > t ′. uunionsq
Finally, we show that the sequential implementation is live, i.e., we show that every ISA in-
struction is eventually executed.
Lemma 60.
∀i ∃t : i = I(t)∧ cstep(t)
Proof of lemma 60. By induction on index i of the current instruction (I(t)). For the induction
base (i = 0) recall that immediately after reset we have by definition
I(0) = 0.
From lemma 59 we know the processor core is stepped infinitely often. Consider the first such
cycle t ′ > 0 in which the processor core performs a step. Clearly, we have
(∀t ∈ [0 : t ′−1] : /cstep(t))∧ cstep(t ′).
For the scheduling function we conclude by definition
I(t ′) = I(0) = 0.
For the induction step from i to i+ 1 we argue as follows. From the induction hypothesis we
know there is a cycle t such that
i = I(t)∧ cstep(t).
By definition we immediately obtain
I(t+1) = i+1.
Applying lemma 59 we know there is a cycle t ′ ≥ t+1:
cstep(t)∧ (∀t˜ ∈ [t+1 : t ′−1] : /cstep(˜t))∧ cstep(t ′).
To complete the induction step we argue using the definition:
I(t ′) = I(t+1) = i+1. uunionsq
6.3 Liveness 141
6.3.2 Uniqueness of Finish Cycles
In Sect. 7.1.4 we must define the oracle inputs for the current instruction in cycles in which the
current instruction progresses through the visited control states. Therefore, we first must show
that the latter cycles are unique. This last section is spent to derive that result. Below we argue
that i) only the current control state can be busy and ii) a busy state guarantees the exec signal
is active.
Lemma 61. For any control state σ ∈ S:
busy(σ , t) → (cs(t) = σ)∧ exec(t)
Proof of lemma 61. By induction on the number of hardware cycles t. For the base case (t = 0)
there is nothing to show.
busy(σ ,0) = 0
For the induction step from t to t + 1 we argue as follows. In case state σ becomes busy in
cycle t+1
busy(σ , t) = 0,
we split cases on the value of σ ∈ S:
• σ = idle. For state idle there is nothing to show, since by definition we have
busy(idle, t+1) = 0.
• σ = YT . From the construction of the control automaton of the nested MMU (Sect. 4.2.2),
we immediately conclude
ht+1.mmuY .treq = 1
and the claim follows from the hardware interconnect.
• σ ∈ {IF,EX}. The claim follows analogous to the case above, from the first part of
lemma 50 (equation 13).
Otherwise, if state σ remains busy in cycle t
busy(σ , t) = 1,
from the induction hypothesis we have
cs(t) = σ 6= idle
and from lemma 46 we conclude
cs(t+1) = cs(t).
Moreover, using lemma 49 we obtain
ht .core.Z.ce = 0,
and therefore
mca(t+1)[10 : 2] = mca(t)[10 : 2].
From the construction we also have
mca(t+1)[1 : 0] = 02 = mca(t)[1 : 0].
Thus, we conclude
exec(t+1) = exec(t)
and the claim follows from the induction hypothesis. uunionsq
Using the result above we proceed to show the following: if a given control state is visited on
execution of a given instruction, it necessarily finishes on execution of that instruction.
142 6 Sequential Processor with Nested MMUs
Lemma 62. For states σ ∈ S the following holds:
∀i : (∀t : i = I(t)→ σ 6= cs(t))∨ (∃t : i = I(t)∧fin(σ , t))
Proof of lemma 62. Equivalently, we proceed to show the following statement.
∃t : I(t) = i∧ cs(t) = σ → ∃t ′ : I(t ′) = i∧fin(σ , t ′)
From lemma 57, for some cycle t ′ ≥ t we have
fin(σ , t ′).
For the minimum such cycle t ′ we clearly have
∀t˜ ∈ [t : t ′−1] : /fin(σ , t˜)
and therefore
∀t˜ ∈ [t : t ′] : cs(˜t) = cs(t) = σ .
Moreover, from the definition of finish cycles we derive
/fin(σ , t˜)↔ (cs(˜t) 6= σ)∨busy(σ , t˜)
↔ busy(σ , t˜)
which by lemma 61 implies
∀t˜ ∈ [t : t ′−1] : exec(˜t)
and therefore
∀t˜ ∈ [t : t ′−1] : /cstep(˜t).
Therefore, for the scheduling function we conclude
∀t˜ ∈ [t : t ′] : I(˜t) = I(t) = i
which completes the proof. uunionsq
Finally, we strengthen the result above as follows: if a given control state is visited on execution
of a given instruction, it finishes exactly once on execution of that instruction.
∀i ∀σ : (∀t : i = I(t)→ σ 6= cs(t))∨ (∃! t : i = I(t)∧fin(σ , t)) (19)
Proof of equation 19. From lemma 62 for states σ ∈ S we know
∃t : I(t) = i∧ cs(t) = σ → ∃t ′ : I(t ′) = i∧fin(σ , t ′).
Assume state σ ∈ S finishes in cycles t and t ′ > t
fin(σ , t)∧fin(σ , t ′)
with instruction
I(t) = I(t ′).
Next we split cases on the current control state (in cycle t):
• σ = EX. In state EX we conclude cstep(t), and derive a contradiction.
I(t ′) ≥ I(t+1) (monotonicity)
= I(t)+1 (definition)
= I(t ′)+1 (assumption)
6.3 Liveness 143
• σ 6= EX. In this case we assume /cstep(t), since otherwise the claim follows as above.
From the construction of the control automaton (from lemma 48), we clearly have
cs(t+1)> cs(t) = cs(t ′)
which implies
∃t˜ ∈ [t+1 : t ′−1] : cstep(˜t)
since otherwise, again by construction of the control automaton (by lemma 47), we obtain
a contradiction.
cs(t+1)≤ cs(t ′)
Still, the contradiction follows exactly as in the case above.
I(t ′) ≥ I(˜t+1) (monotonicity)
= I(˜t)+1 (definition)
≥ I(t)+1 (monotonicity)
= I(t ′)+1 (assumption) uunionsq
The latter uniqueness result completes the implementation of the sequential machine. Correct-
ness of the sequential implementation is proven in the next chapter.

7Correctness of Sequential Implementation
In order to prove that the hardware construction of the sequential machine with NAT from
Chap. 6 is correct, i.e., it implements the ISA from Sect. 3.3, in Sect. 7.1 we state a simple
(simulation) theorem. After elaborating of some more auxiliary machinery in Sect. 7.2, we
proceed to show most of the correctness arguments constituting the induction step (correctness
for registers) in Sect. 7.3. In Sect. 7.4 we show that our machine construction fulfills the
correctness conditions on the MMU interfaces formulated in Sect. 3.4. This allows us to extend
results on MMU hardware correctness established for the general semantics in Chap 5 to a
‘proper’ simulation relation (in the original semantics). Finally, in Sect. 7.5 we complete the
induction step by showing that the guard conditions are obeyed in the computations performed
by the sequential implementation.
7.1 Correctness Statement
Below we structure the arguments as follows. In Sect. 7.1.1 in the obvious way we specify the
hardware configurations in which steps of the ISA computation are generated. This allows us
to formulate the simulation theorem in Sect. 7.1.3. In Sect. 7.1.4 the definition of the stepping
function is completed. Note, the software conditions for the sequential implementation are
discussed in Sect. 7.1.2.
7.1.1 Stepping of Components
In this small section we cover definitions that concern the specification steps produced by the
sequential machine. Basically the stepping stays the same as it was defined in [LOP]. Thus,
the processor core step is tied to the end of the instruction execution.
cstep(h) = endex(h)
Note, a processor core step is performed on two occasions: i) execution of the current in-
struction uninterrupted (exec) and ii) jump to the interrupt service routine due to an unmasked
interrupt of a non-continue type (/exec). In the first case the machine has to reside in control
state EX and the data cache must not be busy.
Recall, the definition of tadd from Sect. 5.2 which specifies the total number of translation
steps made in the given hardware configuration for the given MMU. For mmuY we introduce
tstepY (h) = tadd(h.mmuY )
and instantiate the number of steps function for our sequential machine.
ns(h) = tstepI(h)+ tstepE(h)+ cstep(h)
Analyzing the definition above we argue that the total number of steps performed by the se-
quential machine in the given hardware configuration is at most one.
146 7 Correctness of Sequential Implementation
Lemma 63.
ns(h)≤ 1
In order to define the stepping function — specify the steps performed in every hardware
configuration — we first need to enumerate the steps globally. In turn this global ordering is
defined recursively on the number of hardware cycles t. In all predicates and functions p above
we switch from “current” configurations h to configurations ht (configurations h in hardware
cycle t), which for convenience we abbreviate:
p(t) ≡ p(ht).
Now the “global” number of steps function (NS(t)) is composed as usual: from the numbers of
steps performed in every single hardware cycle (ns(t)).
NS(0) = 0
NS(t+1) = NS(t)+ns(t)
7.1.2 Software Conditions
From the software conditions listed in Sect. 2.2.2 remains only the one forbidding store or CAS
operations on ROM. Below we formulate the latter condition to fit the machine description.
SC(i) ≡ writeiσ → 〈pmaiE σ .l〉 ≥ 2r (20)
The other software condition forbidding the misaligned memory accesses is discarded. Align-
ment of the memory accesses is tested by hardware, and if violated, one of the misalignment
interrupts is raised.
7.1.3 Simulation Theorem
We present the result as usual — in the form of a simulation theorem. This section is used
to state the latter theorem. The statement of the theorem repeats the corresponding statement
for the sequential machine without the address translation. Formally we claim: there exists an
initial ISA configuration c0 s.t. for all hardware cycles t the simulation relation — as defined
later in this section — between configurations ht of hardware and cNS(t) of ISA holds and the
guard conditions — as defined later in this section — are satisfied for all steps before NS(t):
∃c0 ∀t : sim(ht ,cNS(t))∧Γ NS(t)
where
ht+1 = δHW (ht)
cn+1 = δISA(cn,s(n)).
Simulation Relation
Below we formulate the simulation relation for the sequential machine with nested MMUs. It
is quite obvious to define: we say that the simulation relation (sim) for the entire machine holds
iff
• both hardware TLBs, for instruction addresses and for the effective address,
are simulated by the software TLB, and
• the hardware core and the cache memory system are simulated resp. by the specification
core and the specification memory.
7.1 Correctness Statement 147
Formally, for the hardware configuration h and the ISA configuration c we define
sim(h,c) ≡ simp+m(h,c)∧ simtlb×2(h,c)
where the individual simulation relations prescribe the simulation of the processor and memory
(simp+m) and the TLBs (simtlb×2) resp. The former one is taken literally from [LOP]; we
provide the definition for completeness of presentation.
simp+m(h,c) ≡ c.(pc,d pc,dd pc) = h.(pc,d pc,dd pc) ∧
c.(gpr,spr) = h.(gpr,spr) ∧
`(c.m) = m(h)
The latter one obviously comprises two parts: one for simulation of the TLB for instruction
addresses, and one for simulation of the TLB for effective addresses.
simtlb×2(h,c) ≡ simtlb(h.mmuI ,c)∧ simtlb(h.mmuE ,c)
Recall, simulation relation simtlb between the TLB of hardware configuration h.mmu and the
TLB of ISA configuration c was defined in section Sect. 5.2.2. Note that we formulated the
latter relation in Chap. 5 to show correctness of the nested MMU implementation w.r.t. the
general semantics, defined in Sect. 3.4. In this chapter we use this relation unchanged to show
correctness of the nested MMU implementation w.r.t. the ordinary semantics.
Guard Conditions
In the obvious way we formalize the guard conditions for the sequential machine with NAT.
Using the machinery from Sect. 3.3, we collect into a single predicate Γ all guard conditions
for the computational step performed in ISA configuration c under oracle input σ ∈ Σ :
Γ (c,σ)≡
{
Φ(c,σ) σ ∈ Σcore
T(c,σ) σ ∈ Σtlb.
In case the guard conditions hold for all ISA steps before global step n, we define
Γ n ≡ ∀m < n : Γ (cm,s(m)).
7.1.4 Stepping Inputs
The stepping function is defined for every global step number (NS(t)) by a case split — on the
type of step — as follows.
• In case a translation step is performed and the TLB for instruction addresses initiates this
step, we essentially use the definition from [LOP], very slightly adjusted to specifics of the
sequential machine.
tstepI G(t)→ s(NS(t)) =
{
(winit,upaG(mmutI).pa,guest) winitG(mmu
t
I)
(wext,mmutI .wG,⊥) otherwise
tstepI U (t)→ s(NS(t)) =
{
(winit,upaU (mmutI).pa,user) winitU (mmu
t
I)
(wext,mmutI .wU ,mmu
t
I .wG) otherwise
• If a translation step is performed and the TLB for effective addresses initiates this step, we
simply follow the same intuition as in the case above.
tstepE G(t)→ s(NS(t)) =
{
(winit,upaG(mmutE).pa,guest) winitG(mmu
t
E)
(wext,mmutE .wG,⊥) otherwise
tstepE U (t)→ s(NS(t)) =
{
(winit,upaU (mmutE).pa,user) winitU (mmu
t
E)
(wext,mmutE .wU ,mmu
t
E .wG) otherwise
148 7 Correctness of Sequential Implementation
• For processor core steps (cstep(NS(t))) we assemble the machine’s input gradually, as ex-
ecution of the current instruction progresses. Execution of instruction i progresses through
states σ ∈ S of the control automaton from Fig. 29 in cycles
T(σ , i) = {t | I(t) = i∧fin(σ , t)}.
The latter cycles (if they exist) are provably unique (see Sect. 6.3.2, equation 19).
τ(σ , i) =
{
ε T(σ , i) T(σ , i) 6= /0
−1 otherwise
Note that if control state σ is not visited throughout execution of instruction i, cycle τ(σ , i)
is simply chosen to be −1. For convenience we abbreviate
τYT(t) = τ(YT, I(t))
and specify the first two components of the oracle input for step NS(t) as follows.
s(NS(t)).wY =
{
mmuτYT (t)Y .wout τYT(t)≥ 0∧mmuτYT (t)Y .treq
⊥ otherwise
Thus, walk wY is taken from the outputs of mmuY once instruction i leaves the correspond-
ing translation state. Finally, for the external event signals we specify
s(NS(t)).eev = eevt∗
where the interrupts sampled at the end of the instruction execution (eevt∗) are not the same
interrupts that were sampled throughout the cycles of instruction execution (eev(t)):
eevt∗ = eev(t)∧ (idle(t)◦0).
According to Sect. 6.1.2, the last definition (for the processor core step) works for the external
interrupts. For reset it works by assumption that the hardware reset is never active in cycles
t > 0.
7.2 Developing Formalism
This technical section we spend to develop the machinery necessary to prove the simulation
theorem from Sect. 7.1.3. In particular, below we introduce functions I(t) and i(t), which will
be heavily used in the remainder of this chapter. Also, in Sect. 7.2.2 we establish an important
relationship between (scheduling) function I(t) and steps of the ISA computation.
7.2.1 Scheduling Function
In order to prove the correctness theorem for the sequential machine we require the following
auxiliary machinery. We introduce a simple scheduling function for the sequential machine
which keeps track of the number of instructions executed (processor core steps performed) up
to hardware cycle t.
I(0) = 0
I(t+1) =
{
I(t)+1 cstep(t)
I(t) otherwise
We also require function pseq, which as usual maps the local processor instruction indices to
the global step numbers. The definition is straightforward.
7.2 Developing Formalism 149
pseq(0) = min{n ∈ N | s(n) ∈ Σcore}
pseq(m+1) = min{n ∈ N | s(n) ∈ Σcore∧n > pseq(m)}
The index of the configuration in which instruction I(t) is executed we abbreviate by
i(t) = pseq(I(t)).
In the correctness proof below in this chapter we extensively argue about signals in ISA con-
figuration n. For convenience we abbreviate these signals as
Znσ = Z(c
n,s(n))
where s(n) denotes the external input provided by the stepping function for step n. With these
two shorthands we save considerable amount of space (and brackets). For instance, when we
argue about signals in configuration in which instructions are executed, we write
Zi(t)σ = Z(cpseq(I(t)),s(pseq(I(t)))).
7.2.2 Relating Global Steps with Scheduling Function
In this section we establish several important relations between the number of instructions
executed up to hardware cycle t and the number of steps performed up to hardware cycle t. A
technical lemma follows directly from the definitions above.
Lemma 64.
cstep(t) → pseq(I(t)) = NS(t)
Proof of lemma 64. Consider two functions
f1(t) = pseq(I(t))
f2(t) = NS(t)
which are defined on domain
D = {t ∈ N | cstep(t)}.
Since both functions
fi : D→ Ri
are strictly increasing it suffices to show that their ranges are equal. We argue
R1 = {pseq(I(t)) | t ∈ D}
= {pseq(I(t)) | cstep(t)}
= {pseq(i) | i ∈ N}
and
R2 = {NS(t) | t ∈ D}
= {NS(t) | cstep(t)}
= {n ∈ N | s(n) ∈ Σcore}.
Let us denote the n-th minimum element of set S by min
n
S. Using sorted sequences
~Ri[n] = min
n
Ri
one easily obtains by induction on n ∈ N:
~R1[n] = ~R2[n].
From the monotonicity of function pseq we obviously have
150 7 Correctness of Sequential Implementation
~R1[n] = pseq(n).
For the induction base (n = 0) we argue as follows.
~R2[0] = min R2
= min{n ∈ N | s(n) ∈ Σcore}
= pseq(0)
For the induction step from n to n+1 we show the following.
~R2[n+1] = min
n+1
R2
= min (R2 \
⋃
k≤n
~R2[k])
= min{m ∈ N\
⋃
k≤n
~R2[k] | s(m) ∈ Σcore}
= min{m ∈ N | s(m) ∈ Σcore∧m > ~R1[n]}
= pseq(n+1) uunionsq
Using the latter lemma, for the walks passed to the specification machine one can easily derive
the following result.
wi(t)Y σ =
{
mmuτYT (t)Y .wout τYT(t)≥ 0∧mmuτYT (t)Y .treq
⊥ otherwise (21)
Another technical lemma we need in the correctness proof below is the following.
Lemma 65.
pseq(I(t))≥ NS(t)
Proof of lemma 65. Consider a pair of hardware cycles t1, t2 s.t.
cstep(t1)∧ cstep(t2)∧ (t ∈ (t1, t2)→ /cstep(t)).
Directly from the definition of the scheduling function we conclude:
∀t ∈ (t1, t2) : I(t) = I(t+1)
= I(t2).
Using that function NS by definition is monotonically increasing, we argue:
∀t ∈ (t1, t2) : pseq(I(t)) = pseq(I(t2))
= NS(t2) (lemma 64)
≥ NS(t). uunionsq
In the last lemma of this section we argue that starting from (global) step NS(t) up to (global)
step pseq(I(t)), in which the current instruction is executed, no processor core steps are per-
formed.
Lemma 66.
∀t : n ∈ [NS(t) : pseq(I(t))−1] → s(n) /∈ Σcore
Proof of lemma 66. We prove the statement by induction on n, where n denotes the length of
sub-interval
[NS(t) : NS(t)+n−1]⊆ [NS(t) : pseq(I(t))−1].
The base case holds trivially since for n = 0 there is nothing to show. For the induction step
from n−1 to
7.3 Correctness Proof 151
n≤ pseq(I(t))−NS(t)
we argue by contradiction as follows. Assume that NS(t)+n is a step of the processor core:
s(NS(t)+n) ∈ Σcore.
Obviously, there is a hardware cycle t ′ in which this step is performed, i.e.,
∃t ′ : NS(t ′) = NS(t)+n.
From the monotonicity of function NS we immediately conclude
t ′ ≥ t.
And from the monotonicity of function pseq we derive a contradiction:
NS(t)+n = NS(t ′)
= pseq(I(t ′)) (lemma 64)
≥ pseq(I(t))
> pseq(I(t))−1. uunionsq
7.3 Correctness Proof
At this point we finally have the machinery to show the correctness of the sequential machine
with NAT. Utilizing all results obtained in this chapter, we turn the proof into a simple book-
keeping exercise; the proof sketch occupies less than two pages. The proof is obviously by
induction on the number of hardware cycles t. Moreover, since in Chap. 5 we have already
shown (lemma 36) that the translations in mmuY are contained in the TLB component of gen-
eral computation c˜Y , defined earlier in Sect. 3.4.
∃c˜0Y ∀t : simtlb(mmutY , c˜tY ) (22)
Recall, stepping function s˜Y for general computation c˜Y was provided in Chap. 5. For conve-
nience, we rewrite the latter definition such that it better matches the hardware descriptions of
the sequential machine.
taddY X (t) → s˜Y (t) =
{
(winit,upaX (mmutY ),mmu
t
Y .ptoX ) winitX (mmu
t
Y )
(wext,mmutY .wX , pteX (mmu
t
Y )) otherwise
tdropY (t) → s˜Y (t) =

(drop,mmutY .inva) mmu
t
Y .invl pg
(drop,mmutY .inva.vm) mmu
t
Y .vm f lush
(drop,all) otherwise
Therefore, to incorporate the results of Chap. 5, it suffices to prove the following.
∃c0 ∀t : simp+m(ht ,cNS(t))∧ simISAtlb(c˜tI ,cNS(t))∧ simISAtlb(c˜tE ,cNS(t))
Finally, we extend the induction hypothesis with an additional term for simulation of imple-
mentation registers, which we require to hold between configurations ht and ci(t).
siminv(ht ,ci(t))
The latter auxiliary relation is necessary to perform the proof.
Simulation of Implementation Registers
We can formulate the simulation relation for implementation registers as follows.
siminv(h,c)≡
1. IF ≤ cs(h)∧used(wI ,c,x) → h.wI = x.wI
2. ET ≤ cs(h)∧used(I,c,x) → h.I = I(c,x)
3. EX ≤ cs(h)∧used(wE ,c,x) → h.wE = x.wE
152 7 Correctness of Sequential Implementation
Invariants
As additional proof goals we require the following invariants to hold.
Invariant 6.
IF ≤ cs(t)∧used(wI)i(t)σ → ht .wI ∈
{
ci(t).tlb◦ user(t)
ci(t).tlb guest(t)
EX ≤ cs(t)∧used(wE)i(t)σ → ht .wE ∈
{
ci(t).tlb◦ user(t)
ci(t).tlb guest(t)
Invariant 7.
IF ≤ cs(t)∧used(wI)i(t)σ → match(trqi(t)I σ ,ht .wI)
EX ≤ cs(t)∧used(wE)i(t)σ → match(trqi(t)E σ ,ht .wE)
Induction Base
For the induction base (t = 0) we argue as follows.
• simulation simISAtlb(c˜
0
Y ,c
NS(0)) we obtain by constructing an appropriate ISA configuration
c0 from the general TLB configuration c˜0Y :
c0.tlb = /0⊇ /0 = c˜0Y .tlb.
• simulation simp+m(h0,cNS(0)) we obtain as usual by constructing an appropriate ISA con-
figuration c0 from the hardware configuration h0:
c0.(pc,d pc,dd pc) = h0.(pc,d pc,dd pc) = (832,432,032)
c0.(gpr,spr) = h0.(gpr,spr)
`(c0.m) = m(h0).
• simulation siminv(h0,ci(0)) holds trivially since the machine’s control automaton starts in
its idle state:
cs(0) = idle.
Induction Step
For the induction step from t to t+1 we show
simISAtlb(c˜
t+1
Y ,c
NS(t+1)), simp+m(ht+1,cNS(t+1)) and siminv(ht+1,ci(t+1))
by case split i) on the number of steps performed in cycle t (ns(t)) and ii) on the types of these
steps (if there are any). Recall, by lemma 63 we know that
ns(t)≤ 1.
• ns(t) = 0. No steps are performed in cycle t. Simulations
simISAtlb(c˜
t+1
Y ,c
NS(t+1)) and simp+m(ht+1,cNS(t+1))
of the TLBs and processor resp. hold trivially, since none of the visible data structures are
updated in cycle t. Simulation
siminv(ht+1,ci(t+1))
of the implementation registers we cover in Sect. 7.3.3 below.
7.3 Correctness Proof 153
• tstep(t) = 1. One of the MMUs performs a TLB step in cycle t. We show simulation
simISAtlb(c˜
t+1
Y ,c
NS(t+1))
of the TLBs in Sect. 7.4.1 below. Simulations
simp+m(ht+1,cNS(t+1)) and siminv(ht+1,ci(t+1))
of the processor and implementation registers resp. hold trivially. Note, configuration of
the processor changes only at the processor core steps. Also, using lemma 49, we argue
that the implementation registers do not change either.
• cstep(t) = 1. Processor core performs a step in cycle t. For simulation
simISAtlb(c˜
t+1
Y ,c
NS(t+1))
of the TLBs we split cases on whether an invalidating instruction is executed in cycle t
(istep(t)). In Sect. 7.4.2 we present an argument for the case in which such an instruction
is executed. Otherwise, the latter simulation holds trivially. Simulation
simp+m(ht+1,cNS(t+1))
of the processor boils down to a simple bookkeeping exercise; it is covered in Sect. 7.3.2.
Finally, simulation
siminv(ht+1,ci(t+1))
for the implementation registers holds trivially, since according to lemma 47 the control
goes to the idle state from cycle t and there is nothing to show in cycle t+1.
The invariants are shown in Sect. 7.3.4; the guard conditions are verified in Sect. 7.5. Fol-
lowing this sketch we finish proving correctness of the sequential machine in the forthcoming
subsections.
7.3.1 Interrupt Processing
With the next lemmas we claim that all interrupts which were processed by hardware through-
out execution of the current instruction were processed correctly.
Lemma 67.
∀k ≤maxJ (cs(t)) : mca(t)[k] = mcai(t)σ [k]
Proof of lemma 67. For the external interrupt signal (k = 1), we split cases on whether a pro-
cessor core step is performed in cycle t.
• cstep(t) = 1. In this case we trivially obtain:
e(t) = idle(t)∧ eev(t)[1] (definition)
= s(NS(t)).eev[1] (stepping)
= eevi(t)σ [1] (lemma 64)
= ei(t)σ (definition).
Moreover, for the interrupt mask signal we show:
imask(t)[1] = ht .sr[1] (definition)
= cNS(t).sr[1] (IH)
= ci(t).sr[1] (lemma 64)
= imaski(t)σ [1] (definition).
Combining the latter two arguments clearly gives the claim for the first bit.
mca(t)[1] = mcai(t)σ [1]
154 7 Correctness of Sequential Implementation
• cstep(t) = 0. This case turns out to be more complicated. First, by contradiction
mca(t)[1]→ jisr(t)∧/cont(t)
→ /exec(t)
→ cstep(t)
we derive exec(t) and
0 = mca(t)[1] = ht .sr[1]∧ e(t).
Note, provided that the external interrupts are not masked in the SPR (ht .sr[1]), the external
interrupt signal (e(t)) must be low in cycle t if the machine resides in the idle state (since
otherwise a core step is performed). Outside the idle state, the external interrupts are
not visible by the hardware (Sect. 6.1.2), and therefore are not passed to the specification
machine (Sect. 7.1.4). Thus, it suffices to show
ei(t)σ = 0.
For the first cycle t ′ > t when a processor core step is performed
t ′ = min{t˜ > t | cstep(t˜)}
we clearly have
/idle(t ′)∧ cstep(t ′).
Analogous to the first case (cstep(t)), for the interrupt signal in cycle t ′ we derive the
following.
e(t ′) = ei(t
′)
σ (23)
Since processor core steps are not performed in cycles [t : t ′−1], we conclude
ei(t)σ = e
i(t ′)
σ (definition)
= e(t ′) (equation 23)
= 0 (definition)
and the claim follows.
Since all other interrupts are not maskable, we argue in the remainder of this proof only about
the cause signals. Thus, for the misalignment on fetch (k = 2) we obtain:
mal f (t) ≡ ht .dd pc[1 : 0] 6= 02 (definition)
≡ cNS(t).dd pc[1 : 0] 6= 02 (IH)
≡ ci(t).dd pc[1 : 0] 6= 02 (lemma 66)
≡ mal f i(t)σ (definition).
First, using the arguments above for walk wI we derive the following.
IF ≤ cs(t) → ( used(wI)i(t)σ ↔ /host(t) ) (24)
Proof of equation 24.
/host(t)↔ /host(t)∧ (mca(t)[2 : 0] = 03) (invariant 5.1)
↔ /host i(t)σ ∧ (mcai(t)σ [2 : 0] = 03) (IH; lemma 66)
↔ /host i(t)σ ∧ ili(t)σ > 2 (definition)
↔ used(wI)i(t)σ (definition) uunionsq
7.3 Correctness Proof 155
Having the result above, we argue about the page faults on fetch (k = 3)
p f f (t) = /host(t)∧ f (ht .wI) (definition)
= used(wI)
i(t)
σ ∧ f (ht .wI) (equation 24)
= used(wI)
i(t)
σ ∧match(trqi(t)I σ ,ht .wI)∧ f (ht .wI) (invariant 7)
= used(wI)
i(t)
σ ∧ p f ault(trqi(t)I σ ,ht .wI) (definition)
= used(wI)
i(t)
σ ∧ p f ault(trqi(t)I σ ,wi(t)I σ ) (IH)
= p f f i(t)σ (definition)
and the general-protection faults on fetch (k = 4) resp.
g f f (t) = /host(t)∧/ f (ht .wI)∧ (11 6≤ ht .wI .r[xu]) (definition)
= used(wI)
i(t)
σ ∧/ f (ht .wI)∧ (11 6≤ ht .wI .r[xu]) (equation 24)
= used(wI)
i(t)
σ ∧match(trqi(t)I σ ,ht .wI)∧/ f (ht .wI)∧ (11 6≤ ht .wI .r[xu])
= used(wI)
i(t)
σ ∧g f ault(trqi(t)I σ ,ht .wI) (definition)
= used(wI)
i(t)
σ ∧g f ault(trqi(t)I σ ,wi(t)I σ ) (IH)
= g f f i(t)σ (definition)
For the misalignment on memory operation (k = 8) we similarly obtain:
malm(t) ≡ mmask(t)[2 : 1]∧ (ea(t)[1 : 0] 6= 02) (definition)
≡ mmaski(t)σ [2 : 1]∧ (eai(t)σ [1 : 0] 6= 02) (IH)
≡ mopi(t)σ ∧ (di(t)σ - 〈eai(t)σ 〉) (lemma 8)
≡ malmi(t)σ (definition).
Analogously to equation 24, using the arguments above for walk wE we derive the following.
EX ≤ cs(t) → ( used(wE)i(t)σ ↔ /host(t)∧mop(t) ) (25)
Finally, we again similarly argue about the page faults on memory operation (k = 9)
pfm(t) = /host(t)∧mop(t)∧ f (ht .wE) (definition)
= used(wE)
i(t)
σ ∧ f (ht .wE) (equation 25)
= used(wE)
i(t)
σ ∧match(trqi(t)E σ ,ht .wE)∧ f (ht .wE) (invariant 7)
= used(wE)
i(t)
σ ∧ p f ault(trqi(t)E σ ,ht .wE) (definition)
= used(wE)
i(t)
σ ∧ p f ault(trqi(t)E σ ,wi(t)E σ ) (IH)
= pfmi(t)σ (definition)
and the general-protection-faults on memory operation (k = 10) resp.
gfm(t) = /host(t)∧mop(t)∧/ f (ht .wE)∧ (1s 6≤ ht .wE .r[uw]) (definition)
= used(wE)
i(t)
σ ∧/ f (ht .wE)∧ (1s 6≤ ht .wE .r[uw]) (equation 25)
= used(wE)
i(t)
σ ∧match(trqi(t)E σ ,ht .wE)∧/ f (ht .wE)∧ (1s 6≤ ht .wE .r[uw])
= used(wE)
i(t)
σ ∧g f ault(trqi(t)E σ ,ht .wE) (definition)
= used(wE)
i(t)
σ ∧g f ault(trqi(t)E σ ,wi(t)E σ ) (IH)
= gfmi(t)σ (definition)
where for the write-bit checked by the hardware we obviously have
s ≡ s(t)∨ cas(t) = si(t)σ ∨ casi(t)σ . uunionsq
156 7 Correctness of Sequential Implementation
Processing of the remaining interrupt signals does not change compared to [LOP], and there-
fore is omitted. Assuming correctness of the simple circuits implementing the machine control
logic, in particular the signals above, we argue further as follows.
Lemma 68. On steps of the processor core (cstep(t)) the following holds:
jisr(t) = jisri(t)σ
f 1(mca(t)) = f 1(mcai(t)σ )
Proof of lemma 68. First we consider the JISR signal.
jisr(t) =
∨
k≤maxJ (cs(t))mca(t)[k] (definition)
=
∨
k≤maxJ (cs(t))mca
i(t)
σ [k] (lemma 67)
For the direction from left to right we clearly have:
jisr(t)→ mcai(t)σ [10 : 0] 6= 011
→ jisri(t)σ (definition).
For the opposite direction we first recall that
cstep(t)∧/ jisr(t)→ cs(t) = EX.
Using that we trivially conclude (by contradiction):
/ jisr(t)→ mcai(t)σ [10 : 0] = 011
→ / jisri(t)σ (definition).
For the masked cause, from lemma 67 we have
mca(t)[k∗ : 0] = mcai(t)σ [k∗ : 0]
for
k∗ = maxJ (cs(t)).
Next we split cases on the value of the JISR signal.
• jisr(t) = 1. In case of an active JISR we know that
∃k ≤ k∗ : mca(t)[k]∧mcai(t)σ [k].
Using properties of the first one circuit we easily derive
f 1(mca(t)) = 010−k∗ ◦ f 1(mca(t)[k∗ : 0])
= 010−k∗ ◦ f 1(mcai(t)σ [k∗ : 0])
= f 1(mcai(t)σ ).
• jisr(t) = 0. In case of an inactive JISR we simply have
k∗ = 10∧∀k ≤ k∗ : /mca(t)[k]∧/mcai(t)σ [k]
and the statement follows.
f 1(mca(t)) = 032
= f 1(mcai(t)σ ) uunionsq
7.3 Correctness Proof 157
7.3.2 Instruction Execution
In Sect. 7.3.1 we deduced that the interrupts registered by hardware are identical to the in-
terrupts registered in the specification on execution of the current instruction. In this section
we argue that our sequential implementation executes instructions correctly. Instructions are
executed in cycles t in which the processor core steps are performed (cstep(t)). In case no
interrupts occurred in cycle t by lemma 68 we know
jisr(t) = 0 = jisri(t)σ
and the current instruction is executed in a regular fashion. Correctness for this case was es-
tablished for sequential machines a number of times; similar proofs can be found in [PBLS16],
[Sch14a], and [Sch16b].
Interrupted Execution
Next we consider the case in which a non-continue type interrupt was discovered in cycle t,
i.e., in which we have
jisr(t)∧/cont(t).
On non-continue type interrupts execution of the current instruction is terminated and no visi-
ble data structures except for programs counters and special purpose registers are updated. For
the program counters we trivially derive:
ht+1.pc = 832 = cNS(t+1).pc
ht+1.d pc = 432 = cNS(t+1).d pc
ht+1.dd pc = 032 = cNS(t+1).dd pc.
For the special purpose registers the arguments are close to trivial as well. For the status register
(sr) we simply have
ht+1.sr = 032 = cNS(t+1).sr.
For the mode registers we apply lemma 64 and split cases on the value of signal
user(t)∧/icpt(t) ↔ userNS(t)σ ∧/icptNS(t)σ .
In case the latter signal is active, we argue about the nested mode register (nmode)
ht+1.nmode = ht .nmode[31 : 1]◦0 = cNS(t).nmode[31 : 1]◦0 = cNS(t+1).nmode
whereas for the ordinary mode register (mode) the argument it trivial.
ht+1.mode = ht .mode = cNS(t).mode = cNS(t+1).mode
Otherwise, in case an interrupt is triggered at the lower privilege levels (of guest or host) or
in case of an intercept, the argument for the nested mode register repeats the corresponding
argument for the ordinary mode register above, and vice versa.
ht+1.nmode = ht .nmode = cNS(t).nmode = cNS(t+1).nmode
ht+1.mode = ht .mode[31 : 1]◦0 = cNS(t).mode[31 : 1]◦0 = cNS(t+1).mode
For most of the “exception” registers
Z ∈ {sr, pc,d pc,dd pc,mode,nmode}
we simply have
ht+1.eZ = ht .Z = cNS(t).Z = cNS(t+1).eZ.
Update of the exception cause register (eca) is justified by application of lemmas 68 and 64:
158 7 Correctness of Sequential Implementation
ht+1.eca = 021 ◦ f 1(mca(t)) = 021 ◦ f 1(mcai(t)σ ) = cNS(t+1).eca.
Finally, update of the exception data register (edata) depends on the control state (cs(t)) in
which the processor core step is performed. On JISR by applying lemmas 68 and 64 we easily
derive:
IF < cs(t) ↔ mca(t)[4 : 0] = 05 ↔ mcai(t)σ [4 : 0] = 05 ↔ 4 < ili(t)σ ↔ f etchNS(t)σ .
If the latter signal is active, we have
ht+1.edata = ea(t) = eaNS(t)σ = cNS(t+1).edata.
Otherwise, we simply have
ht+1.edata = 032 = cNS(t+1).edata.
The remaining special purpose registers do not change.
Uninterrupted Execution
In case a highest priority interrupt which occurred throughout the cycles of execution of the
current instruction is of type continue, in cycle t we clearly have
jisr(t)∧ cont(t)
and execution of the current instruction is not interrupted (exec(t)). First of all, from the
hardware construction we know that the processor core step is performed in the execution
state.
cs(t) = EX
In contrast to the previous case, with continue type interrupts the results of instruction exe-
cution must be written to their target data structures. Since in our simple design the only two
sources for the continue type interrupts are resp. the system calls and the arithmetics overflows,
we proceed as follows.
In case of a system call (sysc), no visible data structures are updated explicitly (by the system
call instruction).1 In case of the arithmetic overflow (ovf ), which occurs on execution of arith-
metic instructions, the intermediate result C is written into the target general purpose register.
Using lemma 64 we argue for the C-address
xad = xad(t) = xadNS(t)σ
and for the target GPR register.
ht+1.gpr(xad) =C(t) =CNS(t)σ = cNS(t+1).gpr(xad)
The program counters are updated with the initial values exactly like in the previous case (for
the non-continue type interrupts). Most of the special purpose registers are updated exactly
like in the previous case as well. The only difference is the exception program counters. For
the latter registers we argue using the induction hypothesis simply as follows.
ht+1.epc = nextpc(t) = nextpcNS(t)σ = cNS(t+1).epc
ht+1.ed pc = ht .pc = cNS(t).pc = cNS(t+1).ed pc
ht+1.edd pc = ht .d pc = cNS(t).d pc = cNS(t+1).edd pc
Note, in the argument for the exception PC above we additionally require lemma 64 to justify
correctness of the nextpc computation.
1 I.e., the general purpose register file and the memory are not affected; PCs are set to point to three
consecutive words after sisr (start of the interrupt service routine), and the special purpose registers
are updated according to the specification of continue type interrupts.
7.3 Correctness Proof 159
7.3.3 Implementation Registers
In order to show
siminv(ht+1,ci(t+1))
we split cases on the next control state (cs(t+1)). Obviously we assume
cs(t+1) 6= cs(t)
since otherwise nothing changes and the simulation holds trivially.
• cs(t+1) ∈ {idle, IT}. The simulation relation holds trivially for this case.
• cs(t+1) = IF. In a dedicated section below we show
used(wI)
i(t+1)
σ → ht+1.wI = wi(t+1)I σ .
• cs(t+1) = ET . In a dedicated section below we show
used(I)i(t+1)σ → ht+1.I = Ii(t+1)σ .
• cs(t+1) = EX. In a dedicated section below we show
used(wE)
i(t+1)
σ → ht+1.wE = wi(t+1)E σ .
For cases in which
cs(t+1) ∈ {IF,ET,EX}
we argue that
i(t+1) = i(t)
since no processor core steps are performed. Therefore, the portions of the simulation relation
which we did not mention above hold trivially by induction.
Instruction Address Translation
Below we show that register wI if used for translation of the instruction address, contains a
proper translation in cycle t + 1. Recall, we show correctness for register wI only in those
cycles in which
IT(t)∧/idle(t+1).
Therefore, using lemma 47 we conclude there is no processor core step in cycle t, which
implies
i(t+1) = i(t). (26)
From the used predicate we derive:
used(wI)
i(t+1)
σ ↔ used(wI)i(t)σ (equation 26)
→ /host i(t)σ (definition)
↔ /host(cNS(t)) (lemma 66)
↔ /host(t) (IH).
Moreover, we have exec(t), and thus mmutI .treq, which allows us to show the claim.
ht+1.wI = mmutI .wout (interconnect)
= wi(t)I σ (equation 21)
= wi(t+1)I σ (equation 26)
160 7 Correctness of Sequential Implementation
Instruction Fetch
We use this section to show that the instruction fetched in hardware cycle t is the instruction
executed in ISA configuration i(t). Recall, pmaI is the physical memory address of the fetched
instruction. In case the address translation is used we derive:
pmaI(t) = ht .wI .ba◦ ia(t).po (interconnect)
= ht .wI .ba◦ ia(cNS(t)).po (IH)
= ht .wI .ba◦ iai(t)σ .po (lemma 66)
= tma(asidi(t)σ ◦ iai(t)σ ,wi(t)I σ ) (equation 32; invariant 7; IH)
= pmai(t)I σ (definition).
Otherwise, without the address translation, we have:
pmaI(t) = ia(t) (interconnect)
= ia(cNS(t)) (IH)
= iai(t)σ (lemma 66)
= pmai(t)I σ (definition).
Having that the address used by the hardware to fetch the instruction is the address used in the
ISA, for the output of the instruction cache we show the following.
finex(IF, t) → pdout(1)t = imout i(t)σ (27)
Proof of equation 27. According to lemmas from Sect. 6.2.4, we know there is an access to
the instruction cache ending in cycle t (lemma 55)
∃k : e(1,k) = t ∧acc(1,k).r
and it is the only non-flushing access ending in cycle t (lemmas 54 and 55).
∀ j ∈ {0,2,3} ∀k : e( j,k) = t→ acc( j,k). f (28)
For convenience we abbreviate
a = pmaI .l
and derive:
pdout(1)t = ∆ seq(1,k)M (m(h
0),acc′)(a) (lemma 52)
= ∆NE(t)+nIM (m(h
0),acc′)(a) (where nI ≤ ne(t))
= ∆NE(t)M (m(h
0),acc′)(a) (equation 28)
= m(ht)(a) (lemma 51)
= `(cNS(t).m)(a) (IH)
= `(ci(t).m)(a) (lemma 66)
= dataout(`(ci(t).m), iacci(t)σ ) (definition)
= imout i(t)σ (definition). uunionsq
Hardware fetches instructions only in the absence of interrupts which prevent the fetch. Recall,
we argue about the instruction register only in those cycles in which
IF(t)∧/idle(t+1).
Thus, if no interrupts occurred
7.3 Correctness Proof 161
finex(IF, t) → / jisr(t)↔ mca(t)[4 : 0] = 05 (definition)
↔ mcai(t)σ [4 : 0] = 05 (lemma 67)
↔ used(I)i(t)σ (definition)
↔ used(I)i(t+1)σ (lemma 47.1),
for the instruction register we have:
ht+1.I =
{
pdout(1)tH pmaI [2]
pdout(1)tL otherwise
(interconnect)
=
{
imout i(t)σ H pmaI [2]
imout i(t)σ L otherwise
(equation 27)
= Ii(t)σ (lemma 7)
= Ii(t+1)σ (lemma 47.1).
Effective Address Translation
The arguments here are completely analogous to those presented in the counterpart section on
translation of the instruction address, and therefore are omitted.
7.3.4 Maintaining Invariants
In this section we split cases on the next control state (cs(t+1)). Obviously we assume
cs(t+1) 6= cs(t)
since otherwise nothing changes and the invariants hold trivially.
• cs(t+1) ∈ {idle, IT}. The simulation relation holds trivially for this case.
• cs(t+1) = IF. In a dedicated section below we show
match(trqi(t+1)I σ ,h
t+1.wI) and ht+1.wI ∈
{
ci(t+1).tlb guest(t+1)
ci(t+1).tlb◦ user(t+1)
in case translation wI for the instruction address is used in step i(t+1).
• cs(t+1) = ET . The simulation relation holds trivially for this case.
• cs(t+1) = EX. In a dedicated section below we show
match(trqi(t+1)E σ ,h
t+1.wE) and ht+1.wE ∈
{
ci(t+1).tlb guest(t+1)
ci(t+1).tlb◦ user(t+1)
in case translation wE for the effective address is used in step i(t+1).
For cases in which
cs(t+1) ∈ {EX}
we argue that
i(t+1) = i(t)
since no processor core steps are performed. Therefore, the portions of the invariants which
we did not mention above hold trivially by induction.
162 7 Correctness of Sequential Implementation
Translation for Instruction Address
That register wI is updated with a walk that is contained in the specification TLB (when the
processor core makes a step) we argue as follows:
ht+1.wI = mmutI .wout (interconnect)
∈
{
tlbI G(t) mmutI .upa ∈ AG
tlbI U (t) mmutI .upa ∈ AU
(interconnect; spec.)
⊆
{
cNS(t).tlb guest(t)
cNS(t).tlb◦ user(t)
(equation 22; IH)
⊆
{
ci(t).tlb guest(t)
ci(t).tlb◦ user(t)
(lemma 66)
=
{
ci(t+1).tlb guest(t+1)
ci(t+1).tlb user(t+1)
(lemma 47.1).
Note that we show correctness for implementation register h.wI only in those cycles in which
IT(t)∧/idle(t+1).
Therefore, the execution mode does not change in cycle t
mode(t+1) = mode(t)
which justifies the last line in the proof above.
That this walk matches the address of the executed instruction is established below.
ht+1.wI .upa = mmutI .wout.upa (interconnect)
= mmutI .upa (specification)
= asid(t)◦ ia(t).pa (interconnect)
= asid(cNS(t))◦ ia(cNS(t)).pa (IH)
= asidi(t)σ ◦ iai(t)σ .pa (lemma 66)
= asidi(t+1)σ ◦ iai(t+1)σ .pa (lemma 47.1).
That this walk is faulty or complete is one of the properties of the nested MMU.
Translation for Effective Address
The arguments here are completely analogous to those presented above on translation for the
instruction address, and therefore are omitted.
7.4 Correctness for TLBs
In this section we prove several results which establish correctness for the TLBs while the
MMUs are serving translation and invalidation queries. We prove these results under the as-
sumption that the MMU components are properly connected to i) the memory system on one
hand and ii) the processor core on the other hand.
7.4.1 Address Translation
From the interface between the MMU and the memory in the following lemma we show that
— after the memory access was performed — the hardware operates on the same page table
entry as the specification machine does.
7.4 Correctness for TLBs 163
Lemma 69. Assume that mmuY performs a memory access in cycle t (wext(mmutY )).
pte(mmutY ) = pte
NS(t)
σ
Proof of lemma 69. From the construction we know that the hardware performs extension of
the user walk iff the nested control automaton of mmuY resides in state fetch-pteU .
wext(mmutY ) → ( mmutY .fetch-pteU ↔ wextU (mmutY ) ) (29)
First we argue about the page table entry addresses.
ptea(mmutY ) =
{
pteaU (mmutY ) wextU (mmu
t
Y )
pteaG(mmutY ) otherwise
(definition; equation 29)
=
{
ptea(mmutY .wU ◦mmutY .wG) wextU (mmutY )
ptea(mmutY .wG) otherwise
(lemma 20)
= ptea(s(NS(t))) (stepping)
For the output of cache caYT we derive the following.
wext(mmutY ) → pdout(21{Y=E})t = tmoutNS(t)σ (30)
Proof of equation 30. According to lemmas from Sect. 6.2.4 we know there is an access to
cache caYT ending in cycle t (lemma 53)
∃k : e(21{Y=E},k) = t ∧acc(21{Y=E},k).r
and it is the only non-flushing access ending in cycle t (lemmas 54 and 55).
∀ j 6= 21{Y=E} ∀k : e( j,k) = t→ acc( j,k). f (31)
For convenience we abbreviate
a = mmutY .ma.l = ptea(mmu
t
Y ).l = ptea(s(NS(t))).l
and derive:
pdout(21{Y=E})t = ∆
seq(21{Y=E},k)
M (m(h
0),acc′)(a) (lemma 52)
= ∆NE(t)+nYM (m(h
0),acc′)(a) (where nY ≤ ne(t))
= ∆NE(t)M (m(h
0),acc′)(a) (equation 31)
= m(ht)(a) (lemma 51)
= `(cNS(t).m)(a) (IH)
= dataout(`(cNS(t).m), taccNS(t)σ ) (definition)
= tmoutNS(t)σ (definition). uunionsq
Finally, for the page table entries we argue as follows.
pte(mmutY ) =
{
pdout(21{Y=E})tH ptea(mmu
t
Y )[2]
pdout(21{Y=E})tL otherwise
(interconnect)
=
{
tmoutNS(t)σ H ptea(s(NS(t)))[2]
tmoutNS(t)σ L otherwise
(equation 30)
= pteNS(t)σ (lemma 14) uunionsq
In addition, we require that the data used to serve the translation queries are valid.
Lemma 70.
(pto,npto,asid)(t) = (pto,npto,asid)(cNS(t))
The proof of lemma 70 is trivial, and therefore we omit it.
164 7 Correctness of Sequential Implementation
mmut−1I mmutI mmu
t+1
I mmu
t+2
I
c˜t−1I c˜tI c˜
t+1
I c˜
t+2
I
cNS(t)
cNS(t+1)
c˜t−1E c˜
t
E c˜t+1E c˜
t+2
E
mmut−1E mmu
t
E mmut+1E mmu
t+2
E
x
y
(a)
y
(b)
Fig. 31: TLB step performed by mmuE in cycle t
Translation Queries
To show correctness of the translation queries we assume an active TLB transition, i.e., a TLB
step, is performed by mmuY in cycle t (tstepY (t)). In the induction step we have to show both,
simISAtlb(c˜
t+1
I ,c
NS(t+1)) and simISAtlb(c˜
t+1
E ,c
NS(t+1)).
The proof is illustrated in Fig. 31, where the latter two goals are denoted resp. by (a) and (b).
W.l.o.g. we assume that the step is performed by mmuE , and proceed to show (b), since (a)
turns to be completely trivial in this case.
As usual, we denote the input for the step performed in the general semantics by y, whereas
the input of the corresponding step of the original semantics by x. Note, for the definitions of
the stepping inputs below we refer to pages 151 and 147 resp.
y = s˜E(t)
x = s(NS(t))
First we argue that the TLB step performed in the original semantics using input x is TLB
equivalent to a step performed in the general semantics using input y. Aiming at application
of lemma 17, according to Sect. 3.4.3, we proceed to show that the following conditions are
satisfied by input y. We split cases on the step type:
• step of walk initialization.
winitG(mmutE) →
{
y.upa = vmid(cNS(t))◦08 ◦ x.a
y.pto = pto(cNS(t))
winitU (mmutE) →
{
y.upa = asid(cNS(t))◦ x.a
y.pto = npto(cNS(t))
• step of walk extension.
wextX (mmutE) →
{
y.pte = pteNS(t)σ
y.w = x.w1
All these conditions are established simply by unfolding the definitions and using the lemmas
shown previously. For instance, on initialization of a user walk we argue as follows. For the
page table origin used in the general semantics we have
7.4 Correctness for TLBs 165
y.pto = mmutE .ptoU (definition)
= npto(t) (interconnect)
= npto(cNS(t)) (lemma 70)
whereas for the universal page address passed by input y we show
y.upa = upaU (mmutE) (definition)
= mmutE .upa.as◦upaU (mmutE).pa (construction)
= asid(t)◦upaU (mmutE).pa (interconnect)
= asid(cNS(t))◦upaU (mmutE).pa (lemma 70)
= asid(cNS(t))◦ x.a (definition).
On steps of the walk extension we argue similarly. For the page table origin used in the general
semantics we have
y.pte = pte(mmutE) (definition)
= pteNS(t)σ (lemma 69)
whereas for the universal walk passed by input y we show
y.w =
{
mmutE .wG wextG(mmu
t
E)
mmutE .wU otherwise
(definition)
= x.w1 (definition).
Even though the proofs turn to be completely straightforward, one has to repeat them for other
machine types in order to rely on correctness of the MMU construction.
Now, using auxiliary configuration c′ s.t.
c′.Z =
{
δtlb(cNS(t).tlb,y) Z = tlb
cNS(t).Z otherwise,
we finish the proof of (b) as follows.
simISAtlb(c˜
t
E ,c
NS(t)) → simISAtlb(c˜t+1E ,c′) (lemma 18)
→ simISAtlb(c˜t+1E ,cNS(t+1)) (lemma 17)
Finally, it remains to show (a). We obtain it directly from the induction hypothesis
simISAtlb(c˜
t
I ,c
NS(t)) → simISAtlb(c˜t+1I ,cNS(t+1))
since on steps of mmuE we obviously have
c˜t+1I .tlb = c˜
t
I .tlb
cNS(t+1).tlb ⊇ cNS(t).tlb.
7.4.2 Invalidation of TLB
To prove correctness of the invalidation queries we exploit that the following conditions are
satisfied by the interface between the processor core and the MMU.
Lemma 71. Assume that an invalidating instruction is executed in cycle t (istep(t)).
mmutY .invl pg ↔ invl pgNS(t)σ ∧user(cNS(t)) (1)
mmutY .vm f lush ↔ f lushtNS(t)σ ∧guest(cNS(t)) (2)
mmutY . f lush ↔ f lushtNS(t)σ ∧host(cNS(t)) (3)
Moreover, for the invalidation input of the TLB component we have:
mmutY .inva = inva
NS(t)
σ (4)
166 7 Correctness of Sequential Implementation
The proof of lemma 71 is given below. For better presentation we merge parts 1–3.
Proof of lemmas 71.1–3. From simulation of implementation registers we immediately obtain
that the hardware and the ISA execute the same instruction
ht .I = Ii(t)σ
since the invalidating steps of the processor core are possible only in state EX. From inter-
connect we know that the invalidating instruction executed by the processor core is executed
uninterrupted.
istep(t) → cstep(t)∧ exec(t) (definition)
→ cstep(t)∧ (cont(t)∨/ jisr(t)) (definition)
→ execi(t)σ (lemma 68)
Having this, using correctness of the instruction decoder implementation we obtain the desired
result just as we regularly do for other predicates and function fields [PBLS16]. Thus, for the
invalidating instructions we conclude
invl pg(t) ↔ invl pgi(t)σ
f lusht(t) ↔ f lusht i(t)σ
and the claim follows by lemma 64. uunionsq
Proof of lemma 71.4. The invalidation address is taken directly from the GPR. For the ASID
to be invalidated we argue:
mmutY .inva.as =
{
vmid(t)◦A(t)[27 : 20] guest(t)
A(t)[31 : 20] otherwise
(interconnect)
=
{
vmid(cNS(t))◦ANS(t)σ [27 : 20] guest(cNS(t))
ANS(t)σ [31 : 20] otherwise
(IH)
= invaNS(t)σ .as (definition).
Similarly, for the invalidated page address we derive:
mmutY .inva.pa = B(t)[31 : 12] (interconnect)
= BNS(t)σ [31 : 12] (IH)
= invaNS(t)σ .pa (definition). uunionsq
Invalidation Queries
To show correctness of the invalidation queries we assume a passive TLB transition, i.e., a
processor core step executing an invalidating instruction, is performed in cycle t (istep(t)). In
the induction step we have to show
simISAtlb(c˜
t+1
Y ,c
NS(t+1)).
The proof is illustrated in Fig. 32. In contrast to the translation queries here we do not have to
differentiate between the MMUs, and can perform the proof for mmuY .
Just as before, we denote the input for the steps performed in the general semantics by y,
whereas the input of the corresponding step of the original semantics by x. Again, for the
definitions of the stepping inputs below we refer to pages 151 and 147 resp.
y = s˜Y (t)
x = s(NS(t))
7.4 Correctness for TLBs 167
mmut−1I mmutI mmu
t+1
I mmu
t+2
I
c˜t−1I c˜tI c˜
t+1
I c˜
t+2
I
cNS(t)
cNS(t+1)
c˜t−1E c˜
t
E c˜t+1E c˜
t+2
E
mmut−1E mmu
t
E mmut+1E mmu
t+2
E
y
x
y
y
Fig. 32: Invalidating step (of the processor core) performed in cycle t
We argue that in configuration NS(t) the step performed in the original semantics (using input
x) is TLB equivalent to a step performed in the general semantics (using input y) in cycle t.
Again aiming at application of lemma 17, according to Sect. 3.4.3, it suffices to show that the
following conditions are satisfied by input y. As usual we split cases on the step type:
• invalidation of one translation.
/user(t)∧ invl pg(t) → y.inva = invaNS(t)σ
• invalidation of all translations of certain guest.
guest(t)∧ f lusht(t) → y.vmid = vmid(cNS(t))
• invalidation of all translations.
host(t)∧ f lusht(t) → y = (drop,all)
The last case we have directly from the definition, i.e., there is nothing to show. For the
invalidation address (first case) we argue as follows.
y.inva = mmutY .inva (definition)
= invaNS(t)σ (lemma 71.4)
For the invalidated virtual machine ID we argue similarly.
y.vmid = mmutY .inva.vm (interconnect)
= vmid(t) (interconnect)
= vmid(cNS(t)) (lemma 70)
Once again, using auxiliary configuration c′ s.t.
c′.Z =
{
δtlb(cNS(t).tlb,y) Z = tlb
cNS(t).Z otherwise,
we complete the induction step for the TLB component exactly as above.
simISAtlb(c˜
t
Y ,c
NS(t)) → simISAtlb(c˜t+1Y ,c′) (lemma 18)
→ simISAtlb(c˜t+1Y ,cNS(t+1)) (lemma 17)
168 7 Correctness of Sequential Implementation
7.5 Verifying Guard Conditions
In this section we show that the guard conditions are respected in all ISA steps until NS(t+1),
i.e.,
∀n < NS(t+1) : Γ (mcn,s(n)).
Directly from the induction hypothesis we obtain:
∀n < NS(t) : Γ (mcn,s(n)).
In case no steps are performed in cycle t, by definition we have
NS(t+1) = NS(t)
and the claim follows. Otherwise, we split cases on whether step
n = NS(t)
is a TLB step (see Sect. 7.5.1) or a processor core step (see Sect. 7.5.2).
Before we proceed to the induction step, for the implementation registers of the nested MMUs
we show the following: in case a walk extension step is performed by mmuY in cycle t, the
implementation register mmuY .wX contains valid walks on both, guest and user TLB steps (see
Sect. 5.2.1).
Lemma 72.
wextG(mmutY ) → valid(cNS(t).tlb,mmutY .wG) (1)
wextU (mmutY ) → valid(cNS(t).tlb,mmutY .wU ,mmutY .wG) (2)
Proof of lemma 72.1. Directly from the induction hypothesis we derive
simtlb(mmutY ,c
NS(t))→ simW (mmutY ,cNS(t)) (definition)
→ wG(mmutY )⊆ cNS(t).tlb (definition)
→ mmutY .wG ∈ cNS(t).tlb (definition).
From the construction of the nested MMU (see Sect. 4.3) we derive that the walk extended in
the guest walk register of mmuY is neither faulty nor complete.
wextG(mmutY )→ mmutY .fetch-pteG (construction)
→ / f (mmutY .wG)∧/mmutY .wG.`[0] (invariant 3.2) uunionsq
Proof of lemma 72.2. Again, from the induction hypothesis we derive
simtlb(mmutY ,c
NS(t))→ simW (mmutY ,cNS(t)) (definition)
→ ⋃X wX (mmutY )⊆ cNS(t).tlb (definition)
→ ∧X mmutY .wX ∈ cNS(t).tlb (construction).
From the construction of the nested MMU, we derive that the walks in the walk registers of
mmuY are neither faulty nor complete.
wextU (mmutY )→ mmutY .fetch-pteU (construction)
→ ∧X / f (mmutY .wX )∧/mmutY .wX .`[0] (invariant 3.2)
From the arguments above we conclude∧
X valid(c
NS(t).tlb,mmutY .wX )
and it remains to show that i) the walks in the walk registers of mmuY are matching
mmutY .wU $ mmutY .wG
and ii) the composition of these walks is not faulty.
/ f (mmutY .wU ,mmu
t
Y .wG)
Both statements follow from part (3) of invariant 3 analogous to the arguments above. uunionsq
7.5 Verifying Guard Conditions 169
7.5.1 TLB Steps
Assume that step n is generated by mmuY .
s(n) ∈ Σtlb∧ tstepY (t)
First we consider execution at the level of guest, then — at the level of user.
i) From the induction hypothesis we have
guest(cn).
From the interconnect and construction of the nested MMU we derive:
guest(t)→ mmutY .upa 6∈ AU (interconnect)
→ /treqU (mmutY ) (definition)
→ /tstepY U (t) (definition).
Thus, in case mmuY performs initialization of a guest walk (winitG(mmutY )), for the step-
ping inputs by definition we have
s(n).(t, l) = (winit,guest)
which gives
TG(cn,s(n)).
In case mmuY performs extension of a guest walk (wextG(mmutY )), for the stepping inputs
by definition we have
s(n) = (wext,mmutY .wG,⊥)
whereas from the first part of lemma 72 we have
valid(cn.tlb,mmutY .wG).
From the arguments above we again conclude
TG(cn,s(n)).
ii) From the induction hypothesis we have
user(cn).
In case mmuY performs initialization or extension of a guest walk the arguments are iden-
tical to those presented in case i).
In case mmuY performs initialization of a user walk (winitU (mmutY )), for the stepping
inputs by definition we have
s(n).(t, l) = (winit,user)
which gives
TU (cn,s(n)).
In case mmuY performs extension of a user walk (wextU (mmutY )), for the stepping inputs
by definition we have
s(n) = (wext,mmutY .wU ,mmu
t
Y .wG)
whereas from the second part of lemma 72 we have
valid(cn.tlb,mmutY .wU ,mmu
t
Y .wU ).
From the arguments above we again conclude
TU (cn,s(n)).
170 7 Correctness of Sequential Implementation
7.5.2 Processor Core Steps
Assume that step n is performed by the processor core.
s(n) ∈ Σcore.
The stepping used in [LOP] differs from the one given above in Sect. 7.1.4, however can be
easily obtained from the induction hypothesis.
Lemma 73. On steps of the processor core (cstep(t)) the following holds:
used(wY )
i(t)
σ → s(NS(t)).wY = ht .wY
Before we proceed, we establish two more auxiliary results.
used(wI)
i(t)
σ ∧ cstep(t) → IF ≤ cs(t) (32)
used(wE)
i(t)
σ ∧ cstep(t) → EX ≤ cs(t) (33)
Proof of equation 32.
cstep(t)∧ IF 6≤ cs(t)→ cstep(t)∧ jisr(t) (definition)
→ ∨k≤maxJ (cs(t))mca(t)[k] (definition)
→ ∃k ≤maxJ (cs(t)) : mcai(t)σ [k] = 1 (lemma 67)
→ ili(t)σ ≤ 2 (definition)
→ /used(wI)i(t)σ (definition) uunionsq
The proof of equation 33 repeats the proof of equation 32, and therefore is omitted.
Proof of lemma 73. Using equations 32–33, for walk wY we derive:
s(NS(t)).wY = w
i(t)
Y σ (lemma 64)
= ht .wY (IH). uunionsq
That the walks coming from the walk registers satisfy the guard conditions of the specification
was reflected in the induction hypothesis above. In case the first walk passed as the machine’s
input is used (for translation of the instruction address)
used(wI)nσ
it is contained in the corresponding TLB
ht .wI ∈
{
ci(t).tlb◦ user(t)
ci(t).tlb guest(t)
(invariant 6)
∈
{
cn.tlb◦ user(cn)
cn.tlb guest(cn)
(lemma 64; IH)
and matches the corresponding translation request (by invariant 7 and lemma 64).
match(trqi(c)I σ ,h
t .wI) ↔ match(trqnI σ ,ht .wI)
Applying lemma 73 (first part) we conclude
ΦI(cn,s(n)).
Analogous to the above, in case the second walk passed to the specification machine is used
(for translation of the effective address)
7.5 Verifying Guard Conditions 171
used(wE)nσ
it is contained in the corresponding TLB
wE .5q,t ∈
{
ci(t).tlb◦ user(t)
ci(t).tlb guest(t)
(invariant 6)
∈
{
cn.tlb◦ user(cn)
cn.tlb guest(cn)
(lemma 64; IH)
and matches the corresponding translation request (by invariant 7 and lemma 64).
match(trqi(c)E σ ,h
t .wE) ↔ match(trqnE σ ,ht .wE)
Finally, we apply lemma 73 (second part) and conclude
ΦE(cn,s(n)).
This completes the induction step, and therefore the entire correctness proof for our implemen-
tation of the sequential machine with nested address translation. In the next chapter we show
correctness of the NAT implementation in the pipelined multi-core machine.

Part IV
Multi-Core MIPS with NAT

8Pipelined Processor with Nested MMUs
In Chap. 6 we presented the sequential implementation of the ISA specification from Sect. 3.3.
In this chapter we proceed to pipeline the processor core from the latter implementation. Thus,
in Sect. 8.1 we present the hardware mechanisms essential to implement the pipelined proces-
sor. Then, in Sect. 8.2 we collect some auxiliary machinery necessary to formally argue about
the pipelined designs. Section 8.3 is a counterpart of Sect. 6.2, where we connect the pipelined
processor to the cache memory system. Finally, in Sect. 8.4 we derive several crucial results
on liveness of the pipelined implementation required later in Chap. 9.
8.1 Pipelined Processor
The implementation presented in this section assumes the absence of external interrupts,
which according to [Sch13a] are provided by the advanced programmable interrupt controllers
(APICs). Note, in order to integrate the APICs into the pipelined multi-core processor from
Chap. 9, one should connect the external interrupt signals to the corresponding outputs of the
APICs on every processor core.
8.1.1 Stall Engine Summary
In order to control execution of instruction in the pipelined machine we incorporate the stall-
rollback engine developed in [LOP] from the original stall engine used in [KMP14]. Construc-
tion of the new control mechanism between stage k−1 and k of the pipeline is given in Fig. 33.
As depicted in the figure, the original stall engine (on the left) is augmented with an additional
circuitry and register within every stage. The original definition therefore can be rewritten as
follows.
stallk = f ullk−1∧ (hazk ∨ stallk+1)
uek = f ullk−1∧/stallk ∧/(rbpk−1∨ rbrk)
f ullt+1k = stall
t
k+1∧/rollbacktk ∨uetk
According to the definition above, the newly added hardware (on the right) is used i) to prevent
updates and ii) to clear the pipeline stages. Definitions are collected below.
rbrk = misspeck ∨ rbrk+1
rollbackk−1 = (rbpk−1∨ rbrk)∧/rhazk
rbpt+1k−1 = (rbp
t
k−1∨ rbrtk)∧ rhaztk
For all stages, registers rbp are initialized with zeros on reset, just like the full bits. For the
write back stage (k = 7) we override the definitions and use
stall7 = 0
rbr7 = 0.
176 8 Pipelined Processor with Nested MMUs
f ullk−1 rbpk−1
uek
misspeck
f ullk rbpk
stallk
hazk
rhazk
rollbackk−1
rollbackk
rbrk
stallk+1 rbrk+1
Fig. 33: Stall-rollback engine hardware between stages k−1 and k of the pipeline
Thus, whenever hardware detects that a certain pipeline stage does not operate on the relevant
data, i.e., the misspeculation is detected in one of the stages, all stages above in the pipeline
must be cleared. In the process of development of [LOP] it turned out that changes to the
stall engine are unavoidable. Modified construction, called the stall-rollback engine, was pro-
posed by Jonas Oberhauser. As depicted in Fig. 33, the stalling mechanism of [KMP14] has
undergone significant changes.
Signal misspec raised in stage k propagates through the pipeline, triggering activation of the
rollback request signals in all stages starting from k. Once stage k is ready for rollback (for
instance, when the memory busy signal becomes low in stage k)1, the full bit of stage k− 1
is cleared (unless stage k− 1 is updated in the same cycle). The latter is resp. signalled by
the rollback hazard inactive for stage k and achieved by activation of the rollback signal for
stage k−1, as depicted. Otherwise, in case stage k is not yet ready for rollback, the rollback-
pending bit is set for stage k−1 resp. the full bit is not cleared. The latter rollback-pending bit
effectively acts as a rollback request signal for stage k while the rollback hazard for this stage
is high.
Stages stabilized by rollback-pending bits propagate stall signals in the same way as before.
As a necessary operating condition we require that the rollback hazard in stage k is active only
if stage above k is full and there is an active hazard signal in stage k.
rhazk → f ullk−1∧hazk (34)
Construction of the stall-rollback engine in full detail can be found in [LOP]. For convenience
we introduce a very useful shorthand
rfullk = f ullk ∧/rbpk
for the real full bit at stage k, and equivalently rewrite the definitions of the update enable
signals
uek = f ullk−1∧/stallk ∧/rbpk−1∧/rbrk
= rfullk−1∧/stallk ∧/rbrk (35)
and the rollback-pending bits.
1 Whenever the memory busy signal is low, according to the operating conditions for the cache mem-
ory system [KMP14], the memory input signals are not required to be stable. Therefore, the stage
providing the latter input signals can be rolled-back.
8.1 Pipelined Processor 177
rbpt+1k−1 = (rbp
t
k−1∨ rbrtk)∧ rhaztk
= (rbptk−1∨ rbrtk)∧ (rhaztk ∨/(rbptk−1∨ rbrtk))
= (rbptk−1∨ rbrtk)∧/rollbacktk−1 (36)
In the following lemmas we capture some of the important properties of the new control mech-
anism.
Lemma 74.
1. Stall signals prevent stages above from update:
stallk → /uek−1
2. Rollback hazards prevent stages above from update:
rhazk → /uek−1
3. Pending rollbacks are set only at full stages:
rbpk → f ullk
Proof of lemma 74.1. In case the stage above k− 1 has no full bit, the claim follows directly
from the definition.
/ f ullk−2 → /uek−1
Otherwise, using the definition of the stall signal we argue
f ullk−2∧ stallk → stallk−1
and the claim follows again directly from the definition.
stallk−1 → /uek−1 uunionsq
Proof of lemma 74.2. From the operating condition of the stall engine (equation 34) we con-
clude
rhazk → f ullk−1∧hazk.
The latter by definition implies that the stall signal in stage k is active
f ullk−1∧hazk → stallk
and the claim follows by the first part (of lemma 74). uunionsq
Proof of lemma 74.3. By induction on the number of hardware cycles t. For the base case
(t = 0) there is nothing to show, since after the hardware reset we clearly have
rbp0k = 0.
For the induction step from t to t + 1 we argue as follows. The rollback-pending bits are set
only in presence of the rollback hazard signals.
rbpt+1k → rhaztk+1
Repeating the proof lines for the second part (of lemma 74) we derive that stage k+1 is stalled
in cycle t.
rhaztk+1 → stalltk+1
Moreover, stage k is not rolled-back in presence of the rollback hazard in cycle t
rhaztk+1 → /rollbacktk
which by definition completes the induction step.
stalltk+1∧/rollbacktk → f ullt+1k uunionsq
178 8 Pipelined Processor with Nested MMUs
Lemma 75.
1. Updates of stages set the real full bits:
uetk → rfullt+1k = 1.
2. Rollback requests clear the real full bits:
rbrtk+1 → rfullt+1k = 0.
3. Truly full stages are not overwritten:
rfulltk ∧uetk → uetk+1.
4. Stages become truly full only after updates:
/rfulltk ∧ rfullt+1k → uetk.
5. Stages become empty only after updates or rollback requests from stages below:
rfulltk ∧/rfullt+1k → uetk+1∨ rbrtk+1.
6. Unless a stage is updated, update of a stage below clears the real full bit:
/uetk ∧uetk+1 → rfullt+1k = 0.
Proof of lemma 75.1. For stage k updated in cycle t by definition we have
uetk → f ullt+1k .
From the second part of lemma 74 we also have
uetk → /rhaztk+1.
We conclude that stage k has no rollback-pending bit in cycle t+1
/rhaztk+1 → /rbpt+1k
and the claim follows. uunionsq
Proof of lemma 75.2. In case the rollback hazard signal for stage k+1 is active in cycle t, the
claim follows directly from the definition.
rhaztk+1∧ rbrtk+1 → rbpt+1k
Otherwise, using the definition of the rollback signal we argue
/rhaztk+1∧ rbrtk+1 → rollbacktk,
and for the update enable signal of stage k in cycle t we derive
rbrtk+1 → rbrtk → /uetk.
The claim follows by definition from the arguments above.
rollbacktk ∧/uetk → / f ullt+1k uunionsq
Proof of lemma 75.3. By contradiction, assume
uetk+1 = 0.
From the definition of the update enable signal we conclude the following.
8.1 Pipelined Processor 179
/rfulltk ∨ stalltk+1∨ rbrtk+1
From the assumptions we know that stage k has a real full bit in cycle t. From the definition of
the stall signal we derive
stalltk+1∧ rfulltk → stalltk
which contradicts the assumptions, since by definition of the update enable signal we have
stalltk → /uetk.
Finally, in case the rollback request signal for stage k+1 is active in cycle t, we derive
rbrtk+1 → rbrtk → /uetk
which gives a contradiction, and therefore completes the proof. uunionsq
Proof of lemma 75.4. From the definition of the real full bit we have the following.
/ f ulltk ∨ rbptk
In case stage k is not full in cycle t, from the definition of the stall signal we have
/ f ulltk → /stalltk+1.
Using the definition of the full bits we conclude
/stalltk+1∧/uetk → / f ullt+1k
and the claim follows. Otherwise, if the rollback-pending bit for stage k is set in cycle t, we
argue as follows. In case the rollback signal for stage k is active in cycle t, using the definition
of the full bits we derive
rollbacktk ∧/uetk → / f ullt+1k .
Finally, in case the rollback request signal for stage k is inactive in cycle t, using the definition
of the rollback-pending bits we obtain
rbptk ∧/rollbacktk → rbpt+1k . uunionsq
Proof of lemma 75.5. From the definitions of the update enable (equation 35) and rollback
signals we resp. have
rfulltk ∧/rbrtk+1∧/uetk+1 → stalltk+1
and
/rbptk ∧/rbrtk+1 → /rollbacktk.
From the definitions of the full and the rollback-pending bits we resp. have
stalltk+1∧/rollbacktk → f ullt+1k
and
/rbptk ∧/rbrtk+1 → /rbpt+1k . uunionsq
Proof of lemma 75.6. From the definition of the update enable signal we have
uetk+1 → /stalltk+1.
Using the definition of the full bits we derive
/stalltk+1∧/uetk → / f ullt+1k
and the claim follows. uunionsq
180 8 Pipelined Processor with Nested MMUs
32
iapi
rfull1∧ rfull2
32
0 1
32 32
32
rfull1∧ rfull2
pc d pc dd pc
3.ID
1 0
32 32
1.IT
Fig. 34: Instruction address computation
8.1.2 Instruction Address
The instruction address in the seven-stage pipelined machine is taken from the pc, d pc, or
dd pc, depending on the number of real full stages above the ID stage. Thus, in case all there
are no real full stages, the instruction address is taken from the dd pc. In case only one stage
has a real full bit, we use the d pc to fetch an instruction. Finally, if both stages have real full
bits, we use the pc. Formally, we specify the instruction address follows.
iapi =

dd pcpi rfull1∧ rfull2
d pcpi rfull1⊕ rfull2
pcpi rfull1∧ rfull2
The implementation (as shown in Fig. 34) is trivial.
8.1.3 Interrupt Cause Pipeline
The internal events are collected gradually, throughout the instruction execution, within stages
1–5. The external event signals are collected in stage 6. We use the following shorthands to
group the event signals collected in the various stages.
ev(1) = 06 ◦g f f ◦ p f f ◦mal f ◦02
ev(2) = 011
ev(3) = 04 ◦ sysc◦ ill ◦05
ev(4) = 02 ◦malm◦ovf ◦07
ev(5) = gfm◦pfm◦09
ev(6) = 09 ◦ e◦ reset
The first five of the signals above are fed into the registers of the cause pipeline.
ca.1pi .in = ev(1)
k ∈ [2 : 5] → ca.kpi .in = ev(k)∨ ca.(k−1)pi
Schematic construction of the cause pipeline is depicted in Fig. 35. Note, all stages of the
cause pipeline consist of registers of the same size (11-bit wide), though in the “early” stages
very few bits of these registers are actually in use. The latter can be easily fixed, but we do
not bother in favor of a more simple implementation. For convenience, below we introduce
short acronyms to refer to the particular bits of the cause pipeline registers in various stages.
Naturally, the names are chosen to mimic the names of the corresponding event signals.
8.1 Pipelined Processor 181
jisr.5 cont.5 ca.5
11
11
11
11
2
2
2
2
ca.4
ca.3
ca.2
ca.1IT
IF
ID
EX
ET
mal f
sysc◦ ill
malls◦ov f
g f f ◦ p f f
g f ls◦ p f ls
Fig. 35: Collecting event signals in the cause pipeline
index 10 9 8 7 6 5 4 3 2
acronym gm pm mm of sc il gf pf mf
Also for convenience we introduce the following notation. For acronym x we define:
ca.kpi [>x] ↔ 〈ca.kpi [x : 0]〉= 0.
Computation of most internal event signals remains straightforward: it precisely follows the
specifications from Sect. 2.3. In contrast to the sequential implementation, the page faults
and general-protection faults are explicitly lowered in the presence of other interrupts of lower
priority.2 For instance, the page faults on fetch and memory operation resp. are computed as
follows.
p f f = /host ∧ ca.1pi .in[>mf]∧ f (mmuI .wout)
pfm = /host ∧ ca.5pi .in[>mm]∧ f (mmuE .wout)∧mop.4pi
Computation of the masked cause signals is performed in accordance with the ISA:
mca(1) = ev(1)
k ∈ [2 : 5] → mca(k) = ev(k)∨ ca.(k−1)pi
mca(6) = (ev(6)∨ ca.5pi)∧ imask
where the interrupt mask above is obtained from the status register.
imask = 19 ◦ srpi [1]◦1
Easy to see that computation of the page faults signals above can be equivalently rewritten as
follows.
p f f = /host ∧mca(1)[>mf]∧ f (mmuI .wout)
pfm = /host ∧mca(5)[>mm]∧ f (mmuE .wout)∧mop.4pi
As depicted in Fig. 35, two more invisible registers are involved into the cause processing in
stage 5 (apart from ca.5): jisr.5 and cont.5. Inputs of these registers are computed using the
masked cause signal in stage 5.
jisr.5pi .in ≡ mca(5) 6= 011
cont.5pi .in ≡ f 1(mca(5))[7 : 6] 6= 02
Note, the external event signals are ignored in the definitions above. Finally, using the latter
registers, processor control signals jisr and cont are generated in the memory stage. The
auxiliary control signal exec is defined exactly as in Chap. 6.
2 Recall, in the sequential machine, execution of the current instruction is aborted once an unmasked
interrupt of a non-continue type is discovered (see Sect. 6.1.2).
182 8 Pipelined Processor with Nested MMUs
jisr = ue6∧ jisr.5pi
cont = ue6∧ cont.5pi
exec ≡ jisr→ cont
In the invariant below we justify early computation of the processor control signals.
Invariant 8. In case stage 5 has a real full bit (rfull5), the following holds:
jisr.5pi ↔ ca.5pi 6= 011 (1)
cont.5pi ↔ f 1(ca.5pi)[7 : 6] 6= 02 (2)
mca(6) = ca.5pi (3)
Proof of invariant 8.1. By induction on the number of hardware cycles t. For the base case
(t = 0) there is nothing to show, since there are no full pipeline stages after reset (/rfull05 ). For
the induction step from t to t+1 we split cases on whether stage 5 is updated in cycle t.
• uet5 = 1. In case the stage is updated, we easily obtain the desired result.
jisr.5t+1pi ↔ mca(5)t 6= 011 (interconnect)
↔ ev(5)t ∨ ca.4tpi 6= 011 (definition)
↔ ca.5t+1pi 6= 011 (interconnect)
• uet5 = 0. In this case we assume that stage 5 has a real full bit in cycle t (rfull
t
5), since
otherwise there is nothing to show (part (4) of lemma 75).
/rfullt5∧/uet5 → /rfullt+15
To complete the induction step we argue as follows.
jist.5t+1pi = jisr.5
t
pi (specification)
↔ ca.5tpi 6= 011 (induction hypothesis)
↔ ca.5t+1pi 6= 011 (specification) uunionsq
Proof of the second part of invariant 8 is analogous to the proof above, whereas proof of the
third part is trivial, given that the external event signals are off (eev = 02) in the scope of this
chapter.
8.1.4 Ghost Pipeline for Translations in Use
In order to keep track of translations used by the pipeline machine in the course of instruc-
tion execution, we introduce a ghost pipeline of walks, outputs of the MMUs. For mmuI we
introduce ghost registers
wI .k
in stages k ∈ [1 : 5]; for mmuE — ghost register
wE .5.
Obviously, we connect
wI .1.in = mmuI .wout
wE .5.in = mmuE .wout
and for k ∈ [2 : 5]:
wI .k.in = wI .(k−1).
8.1 Pipelined Processor 183
ID
IF
IT
ET
M
WB
EX
CACHE D : dca
CACHE I : ica
CACHE IT : itca
CACHE DT : dtca
d pc.3
next pc.3 pc.3
dd pc.3
d pc.5
next pc.5 pc.5
dd pc.5
pc d pc dd pc linkad A B i2ex con.3
con.4min.4ea.4C.4
C.5 pmaE.5 min.5 con.5
con.6mout.6pmaE.6C.6
sprmmuI mmuE
I
pmaI.1
S
d pc.4
next pc.4 pc.4
dd pc.4
ea.5
gpr
Fig. 36: Pipelined MIPS processor connected to four caches and two (nested) MMUs
8.1.5 Connecting Components
In this chapter we consider a pipelined processor connected to a sequentially consistent cache
memory system (as constructed in [KMP14]) with four caches: two caches for translation
of the instruction and effective addresses and two resp. for the instruction fetch and the data
access. For simplicity we use the following intuitive names to abbreviate the caches in use.
itcapi = hpi .ca(0)
icapi = hpi .ca(1)
dtcapi = hpi .ca(2)
dcapi = hpi .ca(3)
As depicted in Fig. 36, we represent the cache for translation of instruction addresses by itcapi ,
whereas the cache for translation of effective addresses by dtcapi . Also as depicted in Fig. 36,
two nested MMUs are connected in the memory stage. This subsection is spent to interconnect
all these components formally.
Interconnect of MMUs
First we connect two MMUs both to the processor core and to the pair of caches. The following
inputs are identical for both memory management units.
mmuY .ptoG = ptopi
mmuY .ptoU = nptopi
mmuY .invl pg = ue6∧ exec∧ invl pg.5pi ∧user
mmuY .vm f lush = ue6∧ exec∧ f lusht.5pi ∧guest
mmuY . f lush = ue6∧ exec∧ f lusht.5pi ∧host
184 8 Pipelined Processor with Nested MMUs
Translation requests on both MMUs are raised only at the levels of user or guest, in the absence
of jisr, and only for really full input stages. Translation request on mmuE stays low in case no
memory operation is performed.
treqI = host ∧ jisr∧mca(1)[>mf]∧ rfull0
treqE = host ∧ jisr∧mca(5)[>mm]∧ rfull4∧mop.4pi
Actually, in case of a rollback, the translation request will simply be lowered and no pending
bit will be set for the corresponding translation stage (see the definition of rollback hazard
signals on p. 185). The latter is possible because the nested MMUs can handle aborts (see
Sect. 4.2.2). Translation requests are also lowered to perform an invalidation request.
mmuY .treq = treqY ∧/inval(mmuY )
Thus, the invalidation requests have priority over the translation requests.3 The remaining
inputs of mmuI are connected as follows.
mmuI .upa = asidpi ◦ iapi .pa
mmuI .mout = itcapi .pdout
mmuI .mbusy = itcapi .mbusy
For mmuE the respective inputs are connected similarly.
mmuE .upa = asidpi ◦ ea.4pi .pa
mmuE .mout = dtcapi .pdout
mmuE .mbusy = dtcapi .mbusy
Interconnect of Stall Engine
The ordinary hazard signals, haz1 and haz5 below, make sure that in translated mode the trans-
lation stages are not updated while the corresponding MMUs are requested. The remaining
hazard signals do not change compared to [LOP].
haz1 = treqI ∧ (mmuI .busy∨ inval(mmuI))∨drain
haz2 = icapi .mbusy
haz3 = hazA∨hazB
haz5 = treqE ∧ (mmuE .busy∨ inval(mmuE))
haz6 = dcapi .mbusy
As the name suggests, the drain signal above is required to drain the pipeline behind the eret
instructions. Since the eret instructions are not necessarily executed (illegal at the level of
user), for stages k ≥ 2 we introduce the interrupt return signals, which indicate the presence
only of the “legal” eret instructions in the pipeline.
iretk =
{
eret(3)∧mca(3)[>il] k = 2
eret.kpi ∧ ca.kpi [>il] otherwise
Then the drain signal is defined simply as follows.
drain =
∨
k≥2 rfullk ∧ iretk
In the absence of the rollback hazard signals for the translation stages, the rollback hazard
signals for the rollback engine are exactly as introduced in [LOP].
3 Recall, our simple construction of the nested MMU supports at most one request at a time. Later
on for this reason the steps of address translation will never be performed in cycles in which the
corresponding processor core executes an invalidating instruction.
8.1 Pipelined Processor 185
rhazk =
{
icapi .mbusy k = 2
0 otherwise
The following invariant clearly follows from the construction of the stall engine (Sect. 8.1.1).
Invariant 9.
rbpk → k = 1
The program counters are restored (from the corresponding exception registers) on activation
of signal
pcres = rfull2∧ iret2.
No new misspeculation signals are introduced.
misspec2 = pcres
misspec5 = jisr
The lemmas below follow directly from the machine’s control mechanism.
Lemma 76.
rbrtk ∧uetk+1 → ∀ j ∈ [1 : k] : /rfullt+1j
Proof of lemma 76. For stages j < k the claim follows simply by part (2) of lemma 75. For
stage k we argue as follows. From the definitions of the update enable and stall signals we
resp. have
rbrtk → /uetk
and
uetk+1 → /stalltk+1.
The claim follows from the arguments above and definition of the full bits.
/stalltk+1∧/uetk → / f ullt+1k uunionsq
As it might be clear from the definitions above, on pcres the real full bit of the IT stage is
cleared. As the eret instruction progresses down through the pipeline, the drain signal stays
active and effectively prevents the new instructions from entering the pipeline. As a result,
there are no real full stages behind the legal eret instruction.
Lemma 77. For stages k > 2 the following holds:
1. rfulltk ∧ irettk → ∀ j ∈ [1 : k−1] : /rfulltj
2. uetk+1∧ irettk → ∀ j ∈ [1 : k] : /rfullt+1j
Proof of lemma 77.1. By induction on the number of hardware cycles t. For the induction base
(t = 0) there is nothing to show, since after the hardware reset we clearly have
∀k : /rfull0k .
For the induction step from t to t +1 we split cases on whether stage k is updated in cycle t or
not:
• uetk = 1. From the definition of the update enable signal we conclude
rfulltk−1∧ irettk−1.
For stage k = 3 by definition we further conclude
pcres(t)
and therefore rbrt2. The claim follows by lemma 76.
186 8 Pipelined Processor with Nested MMUs
∀ j ∈ [1 : 2] : /rfullt+1j
For stages k > 3 by definition we conclude
drain(t)
and therefore /uet1. From the induction hypothesis we have
∀ j ∈ [1 : k−2] : /rfulltj
and therefore
∀ j ∈ [1 : k−1] : /uetj.
From the arguments above and part (4) of lemma 75 we derive
∀ j ∈ [1 : k−2] : /rfullt+1j .
For stage k−1 we conclude the claim using part (6) of lemma 75.
rfullt+1k−1 = 0
• uetk = 0. From part(4) of lemma 75 we derive
rfulltk ∧ irettk
and therefore, using the induction hypothesis, we conclude
∀ j ∈ [1 : k−1] : /rfulltj.
Analogous to the case above (uetk) we argue
∀ j ∈ [1 : k−1] : /uetj.
Again, from the arguments above and part (4) of lemma 75 we derive
∀ j ∈ [1 : k−1] : /rfullt+1j . uunionsq
Proof of lemma 77.2. From the definition of the update enable signal (equation 35) we have
uetk+1 → rfulltk
and therefore using part (1) of lemma 77 we conclude
∀ j ∈ [1 : k−1] : /rfulltj.
Repeating the arguments from the proof above (part (1) of lemma 77) we argue as follows.
First, we argue that stages above k+1 are not updated in cycle t.
∀ j ∈ [1 : k] : /uetj
From the arguments above and part (4) of lemma 75 we conclude
∀ j ∈ [1 : k−1] : /rfullt+1j .
For stage k using part (6) of lemma 75 we derive
rfullt+1k = 0. uunionsq
8.1 Pipelined Processor 187
pc dd pcd pc
0 1
0 1
nextpc
32 32
pcres
jisr
32 32
32
32
832
0 1
0 1
32
32
32 32
32
jisr
pcres
032
0 1
0 1
edpcF
32
32
32 32
32
432
jisr
pcres
eddpcFepcF
Fig. 37: PC environment
PC Environment
Computation of the next configuration PCs from [KMP14] is generalized in the following as-
pects. Apart of an additional circuitry added to support updates of the dd pc, there are extra
multiplexers to restore all PCs from the corresponding exception registers on pcres. The im-
plementation in Fig. 37 literally follows the specification from Sect. 2.4.3. Formally, for the
pipelined machine we specify:
pcpi .in =

832 jisr
epcF pcres∧/ jisr
nextpc otherwise
d pcpi .in =

432 jisr
edpcF pcres∧/ jisr
pcpi otherwise
dd pcpi .in =

032 jisr
eddpcF pcres∧/ jisr
d pcpi otherwise
where the forwarded exception PCs epcF , edpcF , and eddpcF are forwarded from the memory
stage (see Sect. 8.1.6). Update of the program counters is performed on both jisr and pcres:
xpc.ce = ue3∨ jisr∨ pcres.
Note, there is no need to stabilize the PCs in our seven-stage pipeline. The instruction cache
is accessed via the physical memory address register, which effectively acts as a latch for the
instruction cache address in the five-stage pipeline from [LOP].
8.1.6 Forwarding Mechanism
Due to possible moves to the status registers pending in the local pipeline, we extend the
forwarding mechanism from [KMP14] to deliver the most recent values of the SPR registers
to the ID stage. Thus, we specify the hit signals to identify the data written to registers S and
Z ∈ {epc,ed pc,edd pc} in pipeline stage k ∈ [3 : 5]:
hitS[k] = rfullk ∧ sprwpi .k∧ xadpi .k = rspi
hitZ [k] = rfullk ∧ sprwpi .k∧ xadpi .k = spr[Z]
Naturally, in order to forward the most recent data (in case there are several moves to the same
register) we incorporate the top hit signals.
topZ [k] = hitZ [k]∧∧ j<k hitZ [ j]
188 8 Pipelined Processor with Nested MMUs
Using the notation above, we easily specify the input of the S register
Spi .in =

C.4pi .in topS[3]
C.kpi topS[k]∧ k > 3
sprpi(rspi) otherwise
and the forwarded exception PCs.
ZF =

C.4pi .in topZ [3]
C.kpi topZ [k]∧ k > 3
Zpi otherwise
8.2 Machinery for Description of Pipelines
In this small section we collect the machinery necessary to formally argue about the construc-
tions presented in Sect. 8.1 above. Thus, the definitions of the execution mode and scheduling
functions are given in Sect. 8.2.1 and 8.2.2 resp. In Sect. 8.2.3 we introduce the notion of live
circuit stages, which turns out to be crucial to prove correctness of the pipelined implementa-
tion in Chap. 9.
8.2.1 Execution Modes
Execution mode of the pipeline is determined exactly as for the sequential machine, by values
of the least significant bits of the special purpose registers modepi and nmodepi . We define the
execution mode of the pipeline in cycle t as
mode(t) = nmodetpi [0]◦modetpi [0].
Thus, in our simple construction the execution mode is defined for the entire pipeline, i.e., in
cycle t almost all instructions are executed at the same level of privilege. The only exception
is a cycle immediately after jisr, in which an instruction in the write-back stage can have a
different (higher) level of privilege. In the latter case, the instruction in the write-back stage
finishes execution and leaves the pipeline in the same cycle (i.e., the cycle after jisr).
As before, we recognize only three modes of execution, which we intuitively encode by using
the names: user, guest, and host.
mode(t) ∈ {user,guest,host}
Of course we order these three modes by the level of privilege as follows.
host< guest< user
For convenience we introduce the following shorthands.
user(t) ≡ mode(t) = user
guest(t) ≡ mode(t) = guest
host(t) ≡ mode(t) = host
In the following lemma we formalize a simple property that the pipeline mode changes only
after execution of jisr or eret. Namely, the pipeline mode decreases only after execution of
jisr, and increases only after execution of eret.
Lemma 78. Assume that the pipeline mode changes in cycle t.
mode(t)> mode(t+1) → uet6∧ jisr.5tpi (1)
mode(t)< mode(t+1) → uet6∧ jisr.5tpi ∧ eret.5tpi (2)
8.2 Machinery for Description of Pipelines 189
8.2.2 Scheduling Functions
In order to keep track of instructions executed in the various stages of the pipeline we incor-
porate the scheduling functions. In the pipeline with misspeculation and rollbacks we use the
following definition, first introduced in Chap. 7 of [LOP]:
I(k,0) = 0
I(k, t+1) =

I(k, t)+1 uetk ∧ k = 1
I(k−1, t) uetk ∧ k > 1
I(R(t), t) uetk ∧ rbrtk
I(k, t) otherwise
where R(t) denotes the maximal stage with an active rollback request signal, i.e.,
R(t) =
{
max{k | rbrtk} ∃k : rbrtk
0 otherwise.
In the following lemmas we capture the most important properties of the scheduling functions.
Essentially, these lemmas are the counterparts to the corresponding lemmas from [KMP14].
Lemma 79. Scheduling functions increase at real full stages.
I(k−1, t) = I(k, t)+ rfulltk−1
Proof of lemma 79. By induction on the number of hardware cycles t. For the induction base
(t = 0) there is nothing to show, since after reset all real full bits are cleared
rfull0k = 0
whereas for the scheduling functions by definition we have
I(k,0) = 0.
For the induction step from t to t + 1 we argue as follows. First we proceed to show two
auxiliary results which greatly simplify our proof. For k > R(t) we claim that the following
holds.
I(k, t+1) = I(k, t)+uetk (37)
( f ull∧/rbp)t+1k−1 ↔ ( f ull∧/rbp)tk−1∧ stalltk ∨uetk−1 (38)
Proof of equation 37. For stage k = 1 we obtain directly from the definition:
I(1, t+1) = I(1, t)+uet1.
For stage k > 1 we split cases on whether stage k is updated in cycle t or not.
• uetk = 0. In case stage k is not updated, we have
I(k, t+1) = I(k, t) (definition)
= I(k, t)+uetk (assumption).
• uetk = 1. Otherwise, we argue as follows.
I(k, t+1) = I(k−1, t) (definition)
= I(k, t)+ rfulltk−1 (induction hypothesis)
= I(k, t)+uetk (assumption)
Note, from the construction of the stall engine (equation 35) we know
uetk → rfulltk−1. uunionsq
190 8 Pipelined Processor with Nested MMUs
Proof of equation 38.
(→) In order to show sufficiency, from the construction of the stall engine we derive.
f ullt+1k−1 → stalltk ∧/rollbacktk−1∨uetk−1
/rollbacktk−1∧/rbpt+1k−1 → /rbptk−1
(←) For necessity, again from the construction of the stall engine, we argue as follows.
/rbrtk ∧/rbptk−1 → /rbpt+1k−1∧/rollbacktk−1
/rollbacktk−1∧ stalltk → f ullt+1k−1
Recall, we prove the statement only for stages k > R(t), which implies /rbrtk. In case stage
above k is updated in cycle t (uetk−1), the claim follows directly from lemma 75. uunionsq
Using the statements above we can easily complete the induction step. Below we split cases
on whether stage k is rolled-back in cycle t or not.
• rbrtk = 1. In this case all stages above k are rolled-back as well (rbr
t
k−1). From lemma 75
we have
rfullt+1k−1 = 0.
For the scheduling functions we derive:
I(k−1, t+1) = I(R(t), t) (definition)
= I(k, t+1) (definition).
• rbrtk = 0. Otherwise, all stages below k are not rolled-back as well (/rbr
t
k+1). Using
equation 38 we rewrite the claim
I(k−1, t+1) = I(k, t+1)+/rfulltk−1∧ stalltk ∨uetk−1.
and split cases further on whether stage k is stalled in cycle t or not. In the first case
(stalltk), from the construction of the stall engine we derive
stalltk → /uetk ∧/uetk−1
and argue as follows.
I(k−1, t+1) = I(k−1, t) (equation 37)
= I(k, t)+ rfulltk−1 (induction hypothesis)
= I(k, t+1)+ rfulltk−1 (equation 37)
In the second case (/stalltk), again from the construction of the stall engine (equation 35)
we derive
uetk ↔ rfulltk−1∧/stalltk ∧/rbrtk
which allows us to complete the induction step.
I(k−1, t+1) = I(k−1, t)+uetk−1 (equation 37)
= I(k, t)+ rfulltk−1+ue
t
k−1 (induction hypothesis)
= I(k, t+1)+uetk−1 (equation 37) uunionsq
The following lemma gives the value of the scheduling functions of all stages in the next cycle.
Lemma 80. In stages which are not rolled-back, scheduling functions increase with updates;
in the remaining stages — scheduling functions are reset to I(R(t), t).
I(k, t+1) =
{
I(k, t)+uetk k > R(t)
I(R(t), t) otherwise
8.2 Machinery for Description of Pipelines 191
Proof of lemma 80. For stages k > R(t) we refer the reader to the proof of equation 37. In the
latter proof, all occasions of “induction hypothesis” should be simply replaced by “lemma 79”.
For the remaining stages (k ≤ R(t)) the claim follows by definition. Note, from the definition
of the update enable signal we have
rbrtk → /uetk. uunionsq
To streamline the proofs in the next sections we also formulate the following simple properties
of the scheduling functions which could be easily derived formally.
Lemma 81. For the scheduling functions of stages k and ` > k the following holds.
I(k, t) = I(`, t)+∑`−1j=k rfull
t
j
Lemma 82. Assume that the memory stage is updated in cycle t < t ′ (uet6).
I(6, t)< I(6, t ′)
8.2.3 Lowest Non-Live Stage
As we argue later in Sect. 9.6, to show correctness of the pipelined machine with misspecu-
lation it suffices to consider only those instructions which are executed (in the memory stage)
without rollbacks. We say that instruction in circuit stage k in cycle t is live if the following
holds.
live(k, t) ≡ k = 7 ∨ ∃t ′ ≥ t : ( I(6, t ′) = I(k, t)∧uet ′6 ∧
∀t˜ ∈ [t : t ′] ∀k˜ ∈ [k : 6] : I(k˜, t˜) = I(k, t)→ /rbrt˜k˜ )
Note, since instructions in the write-back stage are never rolled-back, for all cycles t we of
course derive live(7, t). The next lemma follows directly from the definition.
Lemma 83. Assume k′ > k.
live(k, t) → live(k′, t)
Circuit stage µ(t) is the lowest non-live stage in cycle t:
µ(t) = max{k ∈ [1 : 7] | /live(k, t)}.
Note, in case all stages are live in cycle t
∀k ∈ [1 : 7] : live(k, t)
we of course obtain
µ(t) = max /0 = 0.
First, we argue that pipeline stage µ(t)< 7 has a real full bit in cycle t.
Lemma 84. Assume µ(t)< 7.
rfulltµ(t)
Proof of lemma 84. In case all stages are live in cycle t, there is nothing to show: technical
stage zero is always truly full.
rfullt0
Otherwise, from the definition we have
/live(µ(t), t)∧ live(µ(t)+1, t).
Clearly we have
I(µ(t), t) 6= I(µ(t)+1, t)
and the claim follows from lemma 79. uunionsq
192 8 Pipelined Processor with Nested MMUs
By definition the stage below µ(t) is live in cycle t. The simple lemma below follows.
Lemma 85. Assume that the stage below µ(t) is updated in cycle t (uetµ(t)+1).
live(µ(t)+2, t+1)
Proof of lemma 85. In case stage µ(t)+2 is updated in cycle t, by definition we have
I(µ(t)+2, t+1) = I(µ(t)+1, t).
Otherwise, from the construction of the stall engine (part (3) of lemma 75) we know that stage
µ(t)+1 does not have a real full bit in cycle t
uetµ(t)+1∧/uetµ(t)+2 → /rfulltµ(t)+1
and easily derive
I(µ(t)+2, t+1) = I(µ(t)+2, t) (definition)
= I(µ(t)+1, t) (lemma 79).
From the definition of µ we clearly have
live(µ(t)+1, t)
and therefore
live(µ(t)+2, t+1). uunionsq
Lemma 86. Assume µ(t) ∈ [1 : 4] and the stage below µ(t) is updated in cycle t (uetµ(t)+1);
assume the rollback request signal at stage µ(t) is inactive in cycle t (/rbrtµ(t)).
/live(µ(t)+1, t+1)
Proof of lemma 86. By contradiction; assume
live(µ(t)+1, t+1).
For the scheduling function by definition we have
I(µ(t)+1, t+1) = I(µ(t), t)
which together with the above implies
∃t ′ ≥ t+1 : I(6, t ′) = I(µ(t), t)∧uet ′6 ∧
∀t˜ ∈ [t+1 : t ′] ∀k˜ ∈ [µ(t)+1 : 6] : I(k˜, t˜) = I(µ(t), t)→ /rbrt˜k˜. (39)
From the assumption (/rbrtµ(t)) for cycle t we by definition have
∀k˜ ∈ [µ(t) : 6] : /rbrtk˜ (40)
and proceed to show the following.
∀t˜ ∈ [t+1 : t ′] : I(µ(t), t˜) = I(µ(t), t)→ /rbrt˜µ(t) (41)
Proof of equation 41. By contradiction. For cycle t˜ we assume
I(µ(t), t˜) = I(µ(t), t)∧ rbrt˜µ(t)
and split cases on whether stage µ(t) has a real full bit in cycle t˜ or not:
8.3 Cache Memory System in Pipelined Processor 193
• rfull t˜µ(t) = 0. From lemma 79 we have
I(µ(t)+1, t˜) = I(µ(t), t)
which together with equation 39 we gives
rbrt˜µ(t)+1 = 0.
Using the definition of the misspec signals we derive a contradiction:
/rfull t˜µ(t)∧/rbrt˜µ(t)+1 → /rbrt˜µ(t).
• rfull t˜µ(t) = 1. In this case we argue as follows. Instruction
I(µ(t)+1, t+1) = I(µ(t), t)
is live (equation 39), and therefore can be found in stage k ∈ [µ(t)+1 : 6] in cycle t˜.
I(k, t˜) = I(µ(t), t)
The contradiction follows.
I(µ(t), t˜) > I(µ(t)+1, t˜) (lemma 79)
≥ I(k, t) (lemma 81)
= I(µ(t), t˜) (assumption) uunionsq
From the arguments above (equations 39–41) we conclude
live(µ(t), t)
which is a contradiction by definition. uunionsq
8.3 Cache Memory System in Pipelined Processor
This is a counterpart of Sect. 6.2 for the pipelined machine. The arguments presented in
Sects. 6.2.1–6.2.4 are mostly repeated in Sects. 8.3.1–8.3.4 resp. to fit the hardware descrip-
tion of the pipelined machine. In contrast to Chap. 6, in the end of this section we extract the
sequence of processor accesses (A), crucial for the correctness proofs performed in Chap. 9.
8.3.1 Connections to Caches
Connections to the instruction (itcapi ) and the data translation cache (dtcapi ) are simple. Since
in the scope of this thesis the MMUs only read the page tables
itcapi .(pr, pw, pcas) = 100
dtcapi .(pr, pw, pcas) = 100
there is no need to provide inputs other than the processor address (in the data fields):
itcapi .pa = mmuI .ma.l
dtcapi .pa = mmuE .ma.l.
The remaining inputs are the processor requests, which we connect as follows:
itcapi .preq = mmuI .mreq
dtcapi .preq = mmuE .mreq.
194 8 Pipelined Processor with Nested MMUs
Connections to the instruction (icapi ) and the data cache (dcapi ) remain almost unchanged.
The processor addresses are now connected to the physical memory address registers of the
corresponding pipeline stages:
icapi .pa = pmaI.1pi .l
dcapi .pa = pmaE.5pi .l.
Since the instruction fetch and the memory stages shift down in the pipeline, the corresponding
control registers are used to provide the access types
icapi .(pr, pw, pcas) = 100
dcapi .(pr, pw, pcas) = l.5pi ◦ s.5pi ◦ cas.5pi
and activate the processor requests.
icapi .preq = f ull1
dcapi .preq = f ull5∧mop.5pi ∧/ jisr.5pi
Data for the memory access are obviously taken from the memory input stage (stage 5) as well.
The processor compare-data are now connected directly to one of the SPR outputs (in contrast
to [KMP14], where they were coming from the GPR).
dcapi .pbw = bw.5pi
dcapi .pdin = min.5pi
dcapi .pcdin = cdatapi
8.3.2 Stability of Inputs to Caches
In order to show that interconnect of the cache memory system from the section above meets
the operating conditions of the cache memory system (see Sect. 6.2.2), we proceed to prove
the counterpart of lemma 50 for the pipelined processor.
Lemma 87.
• For any cache:
htpi .ca(i).mbusy → htpi .ca(i).preq (42)
• For the instruction translation cache:
itcatpi .mbusy → mmut+1I .(ma,mreq) = mmutI .(ma,mreq) (43)
• For the instruction cache:
icatpi .mbusy → /uet1∧ f ullt+11 (44)
• For the data translation cache:
dtcatpi .mbusy → mmut+1E .(ma,mreq) = mmutE .(ma,mreq)
• For the data cache:
dcatpi .mbusy → /uet5∧ f ullt+15
Proof of lemma 87. For cache i we argue by induction on the number of hardware cycles t. For
the induction base (t = 0) there is nothing to show.
h0pi .ca(i).mbusy = 0
For the induction step from t to t +1 we first show the remaining statements using the induc-
tion hypothesis (equation 42). Note, for the translation caches the arguments do not change
8.3 Cache Memory System in Pipelined Processor 195
compared to those presented in the proof of lemma 50, and therefore are omitted. For the
instruction and data caches the arguments become more interesting.
Due to rollbacks below in the pipeline, we need to stabilize inputs to the instruction cache. For
that purpose a new construction of the stall engine was elaborated (see Sect. 8.1.1). Now, an
active rollback hazard signal for the instruction fetch stage (rhaz2) protects the real full bit of
the stage above, i.e., the request to the instruction cache, from clearing. Formally we derive:
icatpi .mbusy → icatpi .preq∧hazt2∧ rhazt2 (equation 42)
→ f ullt1∧hazt2∧ rhazt2 (interconnect)
→ stallt2∧ stallt1∧/rollbackt1 (definition)
→ /uet1∧ f ullt+11 (definition).
For the data cache there is no need to stabilize the input stage since the memory stage is never
rolled-back (/rbr6). Again, formally we have:
dcatpi .mbusy → dcatpi .preq∧haz6 (equation 42)
→ f ullt5∧hazt6 (interconnect)
→ stallt6∧ (stallt5∨/ f ullt4) (definition)
→ /uet5∧ f ullt+15 (definition).
To complete the induction step we argue as follows. We assume
ht+1pi .ca(i).mbusy = 1,
since otherwise there is nothing to show. Next we split cases on whether cache i is busy in
cycle t.
• In case cache i is busy
htpi .ca(i).mbusy = 1,
the claim follows directly from the lines above. For the instruction translation cache we
derive:
itcatpi .mbusy → itcatpi .preq (equation 42)
→ mmutI .mreq (interconnect)
→ mmut+1I .mreq (equation 43)
→ itcat+1pi .preq (interconnect).
In turn, for the instruction cache we simply have:
icatpi .mbusy → f ullt+11 (equation 44)
→ icat+1pi .preq (interconnect).
The corresponding arguments for the data translation cache and the data cache are analo-
gous, and therefore are omitted.
• Otherwise, in case cache i is not busy
htpi .ca(i).mbusy = 0,
one can literally follow the corresponding lines in the proof of lemma 50, and complete
the induction step. uunionsq
8.3.3 Accesses of Hardware Computation
In this small section we identify the accesses occurring in the hardware computation for later
reference. Similar to Sect. 6.2.3, we go through the interfaces of caches interconnected in
196 8 Pipelined Processor with Nested MMUs
Sect. 8.3.1, and summarize on the connected signals for every port. Given that (i,k) is a non-
flushing access ending in cycle t, according to [KMP14], we have:
acc(i,k).a = htpi .ca(i).pa
acc(i,k).data = htpi .ca(i).pdin
acc(i,k).cdata = htpi .ca(i).pcdin
acc(i,k).bw = htpi .ca(i).pbw
acc(i,k).type = htpi .ca(i).(pr, pw, pcas,0).
Below we instantiate acc(i,k) for all ports of the single-core machine.
Accesses of Processor Core
For the instruction cache (icapi ) we have:
acc(1,k).a = pmaI.1tpi .l
acc(1,k).type = 1000.
For the data cache (dcapi ) we have:
acc(3,k).a = pmaE.5tpi .l
acc(3,k).data = min.5tpi
acc(3,k).cdata = cdatatpi
acc(3,k).bw = bw.5tpi
acc(3,k).type = l.5tpi ◦ s.5tpi ◦ cas.5tpi ◦0.
Accesses of MMUs
For the access address, from the construction of the nested MMU we obtain
mmutY .ma = ptea(mmu
t
Y ).
For the instruction translation (itcapi ) we have:
acc(0,k).a = ptea(mmutI).l
acc(0,k).type = 1000.
For the data translation cache (dtcapi ) we have:
acc(2,k).a = ptea(mmutE).l
acc(2,k).type = 1000.
8.3.4 Relating Endings of Accesses with Hardware Control Signals
Lemmas 88 and 89 below are the counterparts of lemmas [9.12] and [9.13] resp. from [KMP14]
about accesses performed by the processor core. The latter two lemmas are reformulated to fit
the pipelined processor utilizing four caches. Recall, every update of the instruction fetch stage
is accompanied by a read access ending in the instruction cache and vise versa. The analogous
result holds for the data cache and updates of the memory stage executing a memory operation.
Lemma 88.
uet2 → ∃k : e(1,k) = t ∧acc(1,k).r (1)
exec(t)∧mop.5tpi ∧uet6 → ∃k : e(3,k) = t ∧acc(3,k). f (2)
8.3 Cache Memory System in Pipelined Processor 197
Proof of lemma 88.1. First, using the definitions of the update enable and the rollback hazard
signals we derive
uet2 → f ullt1∧/rhazt2
→ /icatpi .mbusy.
From the arguments above ( f ullt1) and interconnect of the instruction cache (icapi ) we have
icatpi .preq = 1.
From the construction of caches [KMP14] we know that some non-flushing access to cache
icapi ends in cycle t
icatpi .preq∧/icatpi .mbusy → someend(1, t)∧/ f lushend(1, t)
which by definition implies the claim, since for some k we conclude
e(1,k) = t ∧/acc(1,k). f . uunionsq
Proof of lemma 88.2. Using the definitions of the update enable and the stall signals we argue
uet6 → f ullt5∧/stallt6
→ /hazt6 (stall7 = 0)
→ /dcatpi .mbusy (definition).
Next we proceed to show / jisr.5tpi . By contradiction we argue as follows:
uet6∧ jisr.5tpi → jisr(t)
mop.5tpi ∧ jisr.5tpi → /cont(t)
which gives a contradiction, since by definition we derive /exec(t). Therefore, from the argu-
ments above ( f ullt5 and / jisr.5
t
pi ) and interconnect of the data cache (dcapi ) we have
dcatpi .preq = 1.
Repeating the arguments presented in the proof of part (1) (of lemma 88) we derive
someend(3, t)∧/ f lushend(3, t)
and the claim follows, since for some k we conclude
e(3,k) = t ∧/acc(3,k). f . uunionsq
Lemma 89.
acc(1,k).r∧ e(1,k) = t → ( /stallt3∧/rollbackt1→ uet2 ) (1)
acc(3,k). f ∧ e(3,k) = t → exec(t)∧mop.5tpi ∧uet6 (2)
Proof of lemma 89.1. From the assumptions and construction of caches [KMP14] we derive
/acc(1,k). f ∧ e(1,k) = t → someend(1, t)∧/ f lushend(1, t)
which by definition gives
icatpi .preq∧/icatpi .mbusy.
From the arguments above and interconnect of the instruction cache (icapi ) we have
f ullt1∧/hazt2∧/rhazt2
and using the assumptions we proceed as follows:
/stallt3∧/hazt2 → /stallt2
/rollbackt1∧/rhazt2 → /rbpt1∧/rbrt2.
The latter completes the proof, since using the definition we conclude
f ullt1∧/stallt2∧/rbpt1∧/rbrt2 → uet2. uunionsq
198 8 Pipelined Processor with Nested MMUs
Proof of lemma 89.2. Analogous to the proof of part (1) (of lemma 89) we argue
/acc(3,k). f ∧ e(3,k) = t → someend(3, t)∧/ f lushend(3, t)
which by definition gives
dcatpi .preq∧/dcatpi .mbusy.
From the arguments above and interconnect of the data cache (dcapi ) we have the following.
f ullt5∧mop.5tpi ∧/ jisr.5tpi ∧/hazt6
Using the definitions we conclude
/stallt6∧/rbrt6.
Moreover, from invariant 9 we have /rbpt5, which by definition implies ue
t
6. Finally, from the
arguments above (/ jisr.5tpi ) using the definition we conclude / jisr(t), and the claim follows.
uunionsq
Note, accesses performed by the nested MMUs are covered by the corresponding lemmas from
Sect. 6.2, which apply to the pipelined machine without changes.
8.3.5 Extracting Sequence of Processor Accesses
Following the approaches introduced in [KMP14] and [LOP] in this section we proceed to
extract the sequence of accesses performed by processors in course of execution of memory
operations. Accesses to the translation and data caches performed by the processors, or pro-
cessor accesses for short, can be easily extracted. Note, accesses to the instruction caches and
flushes are excluded.
A(t) = {(i,k) ∈ E(t) | (i mod 4) 6= 1∧/acc(i,k). f}
For convenience we abbreviate the number of the processor accesses by
na(t) = #A(t).
Of course we define the order of these accesses by introducing a dedicated function. For
y < na(t) we define:
x(0, t) = min {n | NE(t)+n ∈ seq(A(t))}
x(y, t) = min {n | NE(t)+n ∈ seq(A(t))∧n > x(y−1, t)}.
Using function x for indices
y ∈ [0 : na(t)−1]
we can easily extract from sequence acc′ the sequence of processor accesses xacc′t :
xacc′t [y] = acc
′[NE(t)+ x(y, t)].
The sequence of memory configurations obtained by executing only the accesses from se-
quence xacc′t is defined as follows.
M0 = m(htpi)
My+1 = δM(My,xacc′t [y])
Using notation for memory updates with access sequences, for y≤ na(t) we obtain
My = ∆ yM(M0,xacc
′
t [0 : y−1]).
We show a technical result first. Below in this chapter by acc′t we denote the part of sequence
acc′ consisting of accesses ending in cycle t, i.e.,
acc′t [0 : ne(t)−1] = acc′[NE(t) : NE(t+1)−1].
8.4 Liveness 199
Lemma 90. For y ∈ [0 : na(t)−1] we claim
My = ∆
x(y,t)
M (M0,acc
′
t).
Proof of lemma 90. The proof is by an easy induction on y < na(t), and therefore we omit it.
Note, for sequence of processor accesses xacc′t by definition we clearly have
∀y ∈ [0 : na(t)−1] : xacc′t [y] = acc′t [x(y, t)]. uunionsq
Next we argue that in cycle t execution of accesses only from sequence xacc′t gives us correct
memory abstraction in cycle t + 1. For that to show we require the results on the sequential
consistency from [KMP14] (lemma 51).
Lemma 91.
m(ht+1pi ) = ∆
na(t)
M (m(h
t
pi),xacc
′
t)
Proof of lemma 91. We start by showing that
m(ht+1pi ) = ∆
NE(t+1)
M (m(h
0
pi),acc
′) (lemma 51)
= ∆ ne(t)M (∆
NE(t)
M (m(h
0
pi),acc
′[0 : NE(t)−1]),acc′[NE(t) : NE(t+1)−1])
= ∆ ne(t)M (m(h
t
pi),acc
′[NE(t) : NE(t+1)−1]) (lemma 51)
= ∆ ne(t)M (m(h
t
pi),acc
′
t) (definition).
Therefore, using the definition of M0 we are left to prove
∆ ne(t)M (M0,acc
′
t) = ∆
na(t)
M (M0,xacc
′
t)
which can easily be shown using lemma 90 and the fact that accesses from E(t)\A(t) do not
change the memory abstraction. uunionsq
8.4 Liveness
In this section we proceed to show liveness of our pipelined implementation. Note, the latter
result is established in the end of Sect. 8.4.2 (lemma 98). In the remained of this section we
develop, in Sects. 8.4.3 and 8.4.4, some more formalism in order to show, in Sect. 8.4.5, that
cycles in which instructions progress through the pipeline are unique (equation 65).
8.4.1 Lowest Truly Full Stage
Before we begin the proofs, we introduce one more technical definition. We say that pipeline
stage L(t) is the lowest truly full stage4 in the given cycle (t) if
L(t) = max {k ∈ [0 : 5] | rfulltk}.
In order to derive properties of L, we require the following result. Namely, we need to show
that the nested MMUs are live while utilized by the pipelined processor.
Lemma 92.
(L(t) = 0)∧mmutI .treq → ( mmutI .busy→ ∃t ′ > t : /mmut
′
I .busy ) (1)
(L(t) = 4)∧mmutE .treq → (mmutE .busy→ ∃t ′ > t : /mmut
′
E .busy ) (2)
4 Registers of the memory stage were excluded for technical reasons: since instructions are executed in
the memory stage, it suffices to consider only the register stages above.
200 8 Pipelined Processor with Nested MMUs
t t ′ t ′+1 t ′′ t ′′+1
treq
abort
idle
busy
Fig. 38: Timing of the MMU control signals
Proof of lemma 92.1. Timing diagrams from Fig. 38 are meant to illustrate the proof. Directly
from lemma 31 we argue for some cycle t ′ > t:
idle(mmut
′
I ).
Clearly, we can assume that mmuI is still busy in cycle t ′ (assume that in cycle t mmuI was
busy recovering from an abort, e.g., after a jisr), since otherwise the claim follows. From the
latter we know the translation request is high in cycle t ′. We conclude
t-start(t ′)
and using lemma 29 argue that translation ends in some cycle cycle t ′′ > t.
t-end(t ′′)
Moreover, we argue that the translation is regular
treg[t ′ : t ′′]
since the translation request is never taken away in the latter cycles. Note, referring to the
hardware specification of processor from Sect. 8.1.5, for cycles t˜ ∈ [t : t ′′] we have:
k(t˜) = L(t) = 0
and
mmut˜I .treq = /host(t˜) = mmu
t
I .treq.
Since the translation is regular, applying lemma 30 we derive that in cycle t ′′ there is a hit in
the hardware TLB
mmut
′′
I .tlb.hit
which essentially gives the claim. uunionsq
Since the proof of the second part is completely analogous, we omit it to shorten the presenta-
tion. Next we show that rollback request signals are not generated in stages below L(t).
Lemma 93.
R(t)≤ L(t)
Proof of lemma 93. By contradiction. Assume that the rollback request signal in stage L(t)+1
is active in cycle t. We derive
rbrtL(t)+1 →
{
misspect2∨misspect5 L(t)< 2
misspect5 2≤ L(t)< 5
→
{
rfullt2∨ rfullt5 L(t)< 2
rfullt5 L(t)< 5
→
{
L(t)≥ 2 L(t)< 2
L(t) = 5 L(t)< 5
which is a contradiction. uunionsq
8.4 Liveness 201
In the following lemma we argue there is a cycle t ′ ≥ t in which the stage immediately below
L(t) is updated.
Lemma 94.
/uetL(t)+1 → ∃t ′ > t : uet
′
L(t)+1
Proof of lemma 94. Directly from the definitions we derive
uetL(t)+1 = rfull
t
L(t)∧/stalltL(t)+1∧/rbrtL(t)+1
= /stalltL(t)+2∧/haztL(t)+1 (45)
since there are no active rollback request signals in stages below L(t) (lemma 93). Moreover,
for the stall signals in stages below L(t)+1 we conclude
stalltL(t)+2 → f ulltL(t)+1
→ rbptL(t)+1 (46)
since otherwise we obtain a contradiction.
f ulltL(t)+1∧/rbptL(t)+1 → rfulltL(t)+1
Referring to Sect. 8.1.5, where the processor components were interconnected, we argue by
case split on the lowest truly full stage:
• L(t) = 0. First, assume that the stall signal of stage IT is low in cycle t.
stallt2 = 0
For the instruction address translation stage we have:
haz1 = treqI ∧ (mmuI .busy∨ inval(mmuI))∨drain
= mmuI .treq∧mmuI .busy.
Therefore, in case mmuI is not requested or not busy in cycle t, there is nothing to show.
Otherwise, we argue using liveness of the nested MMU (lemma 92) and conclude the claim
for some cycle t ′ > t.
In the presence of the stall signal of stage IT in cycle t we split cases on whether the
rollback hazard of stage IT is active in cycle t.
– rhazt2 = 0. In this case, from the construction of the stall engine we derive
/rhazt2 → /rbpt+11
→ /stallt+12 (equation 46)
and the claim follows exactly as in the case above for some cycle t ′ ≥ t+1.
– rhazt2 = 1. Otherwise, in case the instruction cache is busy in cycle t, from liveness of
the cache memory system [KMP14] for some cycle t ′ > t we conclude
rhazt
′
2 = 0
and the claim follows as above, for some cycle t ′′ ≥ t ′+1.
Using specifics of the design of our pipeline, we derive:
stalltL(t)+2 → rbptL(t)+1 (equation 46)
→ L(t)+1 = 1 (invariant 9)
→ L(t) = 0.
Therefore, according to equation 45, for stages L(t)> 0 we simply have
uetL(t)+1 = /haz
t
L(t)+1.
202 8 Pipelined Processor with Nested MMUs
• L(t)= 1. For the instruction fetch stage we argue solely from liveness of the cache memory
system. In case the instruction cache is not busy, there is nothing to show. Otherwise, we
conclude the claim for some cycle t ′ > t. Note, the operating conditions of the instruction
cache are always respected by the new stall engine (see Sect. 8.1.1).
• L(t) = 2. For the instruction decode stage there is nothing to show, since there cannot be
any forwarding hits in stages below L(t).
hazt3 ↔ haztA∨haztB
→ ∃k ∈ [3 : 5] : hittA[k]∨hittB[k]
→ L(t)≥ 3
• L(t) = 3. For the execution stage there is nothing to show since the stage never produces
hazard signals.
• L(t) = 4. For the effective address translation stage we argue exactly as above, for the
instruction address translation. Namely, we show
haz5 = mmuE .treq∧mmuE .busy
and either there is nothing to show (in case mmuE is not requested or not busy in cycle t),
or we argue using lemma 92 to conclude the claim for some cycle t ′ > t.
• L(t) = 5. For the memory stage we argue as for the instruction fetch above. Using liveness
of the cache memory system we conclude the claim for some cycle t ′ > t in case the data
cache is requested and busy in cycle t. Otherwise, there is nothing to show. uunionsq
8.4.2 Liveness of Pipeline Stages
In the next lemma we show that the lowest truly full stage “creeps down” in cycles t in which
the stage immediately below L(t) is updated.
Lemma 95. Assume L(t)< 5.
L(t+1) = L(t)+uetL(t)+1
Proof of lemma 95. From the definition of the lowest truly full stage for cycle t we know that
i) stages below L(t) are not truly full and ii) stages below L(t)+1 are not updated.
∀k > L(t) : /rfulltk ∧/uetk+1 (47)
We split cases on whether the stage immediately below L(t) is updated in cycle t:
• uetL(t)+1 = 1. From above (equation 47) and construction of the stall engine for stages
below L(t)+1 we clearly have
∀k > L(t)+1 : /rfulltk.
By definition we therefore have
L(t+1)≤ L(t)+1.
Construction of the control mechanisms (lemma 75) ensures that stage L(t)+1 is truly full
in cycle t+1 (rfullt+1L(t)+1). The latter obviously gives the claim, since
L(t+1)≥ L(t)+1.
• uetL(t)+1 = 0. Together with equation 47, from the construction of the stall engine for stages
below L(t) we clearly have
∀k > L(t) : /rfulltk,
which by definition gives
8.4 Liveness 203
     
     
     



      
      
      



MM
i L(t)
i L(t ′)
t ′t
Fig. 39: Empty pipeline stages (shaded) below the lowest truly full stages in cycles t and t ′ > t. Note,
instruction i= I(L(t), t) advances further down through the pipeline and in cycle t ′ is found in stage L(t ′).
L(t+1)≤ L(t).
From lemma 93 we know
rbrtL(t)+1 = 0.
Again from the construction of the stall engine (part (5) of lemma 75): the real full bit is
not cleared in the absence of rollback requests from below.
rfulltL(t)∧/uetL(t)+1∧/rbrtL(t)+1 → rfullt+1L(t)
The claim obviously follows, since
L(t+1)≥ L(t). uunionsq
Next we show that the lowest truly full stage “creeps down” over time unless it is already “at
the bottom”. Figure 39 is included in order to illustrate the proof.
Lemma 96. Assume L(t)< 5.
∃t ′ > t : L(t ′) = L(t)+1
Proof of lemma 96. From lemma 94 we know the stage immediately below L(t) is updated in
some cycle t ′ ≥ t.
uet
′
L(t)+1
Moreover, for the first such cycle we know that stage L(t)+1 is not updated in cycles t˜ ∈ [t :
t ′−1].
/uet˜L(t)+1
Applying lemma 95 we obtain
L(t ′) = L(t)
and
L(t ′+1) = L(t ′)+1 = L(t)+1. uunionsq
From the latter lemma one can easily derive the following property.
k ∈ [L(t)+1 : 5] → ∃t ′ > t : L(t ′) = k (48)
In the following lemma we state that the memory stage is updated infinitely often.
Lemma 97.
/uet6 → ∃t ′ > t : uet
′
6
Proof of lemma 97. By case split on the lowest truly full stage:
• L(t) = 5. Since by the assumption the memory stage is not updated in cycle t, the claim
follows directly from lemma 94. For some cycle t ′ > t we conclude
uet
′
5+1.
204 8 Pipelined Processor with Nested MMUs
• L(t)< 5. Using equation 48 we obtain for some cycle t ′ > t
L(t ′) = 5.
The claim follows as above, from lemma 94. For some cycle t ′′ ≥ t ′ we conclude
uet
′′
5+1. uunionsq
Recall, since the memory stage is never rolled-back (/rbr6) in our implementation, the follow-
ing holds.
I(6, t+1) = I(6, t)+uet6 (49)
Finally, we show that the pipelined implementation is live, i.e., we show that every ISA in-
struction is eventually executed.
Lemma 98.
∀i ∃t : i = I(6, t)∧uet6
Proof of lemma 98. By induction on index i of the instruction in the memory stage. For the
induction base (i = 0) recall that immediately after reset we have by definition
I(6,0) = 0.
From lemma 97 we know the memory stage is updated infinitely often. Consider the first such
cycle t ′ > 0 in which the memory stage is updated. Clearly, we have
(∀t ∈ [0 : t ′−1] : /uet6)∧uet
′
6 .
Using equation 49 for the scheduling function we conclude:
I(6, t ′) = I(6,0) = 0.
For the induction step from i to i+1 we argue as follows. Directly from the induction hypoth-
esis we know there is a cycle t such that
i = I(6, t)∧uet6.
From the definition of the scheduling functions we immediately obtain
I(6, t+1) = i+1.
Applying lemma 97 we know there is a cycle t ′ ≥ t+1:
uet6∧ (∀t˜ ∈ [t+1 : t ′−1] : /uet˜6)∧uet
′
6 .
To complete the induction step we argue as above, using equation 49:
I(6, t ′) = I(6, t+1) = i+1. uunionsq
In Sect. 8.4.5 we are to show uniqueness of cycles in which instructions in circuit stages below
µ progress down the pipeline. The remainder of this section is spent to derive some of the
necessary arguments. Thus, below we show that every pipeline stage is updated infinitely
often.
Lemma 99.
/uetk → ∃t ′ > t : uetk
8.4 Liveness 205
Proof of lemma 99. For the write back stage (k = 7) we assume /uet7, since otherwise there is
nothing to show. From the construction of the stall engine we derive
rfullt6 = 0.
From lemma 97 we know that the memory stage is updated in cycle t ′ > t (uet ′6 ), and therefore
rfullt
′+1
6 = 1
and the claim follows for t ′′ = t ′+1.
For the remaining stages by induction on `≤ 5 we proceed to show
/uet6−` → ∃t ′ > t : uet
′
6−`.
The base case (` = 0) is exactly the statement of lemma 97. For the induction step from ` to
`+1≤ 5 we argue by contradiction as follows. From the induction hypothesis for some cycle
t ′ ≥ t we have
uet
′
6−` = 1.
Further we assume
uet
′
5−` = 0,
since otherwise there is nothing to show, and from the construction of the stall engine (part (6)
of lemma 75) we conclude
rfullt
′+1
5−` = 0. (50)
Again, from the induction hypothesis for some cycle t ′′ > t ′+1 we have
uet
′′
6−` = 1
which by definition of the update enable signal implies
rfullt
′′
5−` = 1. (51)
By contradiction we argue
∃t˜ ∈ [t ′+1 : t ′′−1] : uet˜5−`.
Thus, we proceed to show
∀t˜ ∈ [t ′+1 : t ′′] : /rfull t˜5−`
by induction on θ , where θ denotes the length of sub-interval
[t ′+1 : t ′+θ ]⊆ [t ′+1 : t ′′].
For the base case (θ = 0) there is nothing to show (equation 50). For the induction step from
θ to θ +1≤ t ′′− t ′ we argue as follows. For t¯ = t ′+θ , using part (4) of lemma 75 we derive
/rfull t¯5−`∧/uet¯5−` → /rfull t¯+15−`
which completes the proof; the contradiction immediately follows (equation 51). uunionsq
The following existence result is necessary already in the next section (Sect. 8.4.3).
∀i ∀t : (∀k : I(k+1, t) 6= i)∨ (∃! k : I(k+1, t) = i∧ rfulltk) (52)
Proof of equation 52. Equivalently, we proceed to show by contradiction the following state-
ment.
∃k : I(k+1, t) = i → ∃! k˜ : I(k˜+1, t) = i∧ rfulltk˜
Assume that in cycle t stages k and k˜ > k are truly full
rfulltk ∧ rfulltk˜
with instruction
I(k+1, t) = I(k˜+1, t).
The contradiction follows.
I(k+1, t) = I(k˜+1, t)+∑k˜j=k+1 rfull
t
j (lemma 81)
≥ I(k˜+1, t)+ rfulltk˜
> I(k˜+1, t) uunionsq
206 8 Pipelined Processor with Nested MMUs
8.4.3 Instruction Stage
Truly full pipeline stages containing in cycle t (in the circuit stages below) instruction i are
called instruction stages of instruction i in the given cycle (t):
P(i, t) = {k | I(k+1, t) = i∧ rfulltk}.
The latter stages (if they exist) are provably unique (see Sect. 8.4.2, equation 52).
pi(i, t) =
{
ε P(i, t) P(i, t) 6= /0
+∞ otherwise
Note, if instruction i is not contained anywhere in the pipeline in cycle t, stage pi(i, t) is chosen
to be +∞ merely for convenience. In the following lemma we derive the value of stage pi for a
given instruction in the next cycle.
Lemma 100. Assume pi(i, t)≤ 5 and R(t)≤ pi(i, t).
pi(i, t+1) = pi(i, t)+uetpi(i,t)+1
Proof of lemma 100. From the definition of the instruction stage pi = pi(i, t) we have
I(pi+1, t) = i∧ rfulltpi .
Next we split cases on whether the stage below pi is updated in cycle t or not:
• uetpi+1 = 1. In this case we proceed to show that the instruction stage advances.
I(pi+2, t+1) = i∧ rfullt+1pi+1
For the scheduling function in stage pi+2 we derive
I(pi+2, t+1) =
{
I(pi+1, t) rfulltpi+1
I(pi+2, t) otherwise
(lemma 75.3; definition)
= I(pi+1, t) (lemma 79).
For the real full bit of stage pi+1 the claim follows from part (1) of lemma 75.
• uetpi+1 = 0. In this case we are to show that the instruction stage does not change.
I(pi+1, t+1) = i∧ rfullt+1pi
For the scheduling function in stage pi+1 by definition we have
I(pi+1, t+1) = I(pi+1, t)
and for the real full bit of stage pi we argue using part (5) of lemma 75.
rfulltpi ∧/uetpi+1∧/rbrtpi+1 → rfullt+1pi uunionsq
In the following lemma we show that rollback request signals are not generated in stages below
µ(t).
Lemma 101.
R(t)≤ µ(t)
Proof of lemma 101. Assume
R(t)> 0
since otherwise there is nothing to show. From the definition we know
R(t) = k ∈ {2,5}.
Instruction in circuit stage k is clearly not live in cycle t
rbrtk → /live(k, t)
which by definition implies
µ(t)≥ k. uunionsq
8.4 Liveness 207
In order to show later that µ takes values only in range [0 : 5], we require the following auxiliary
result.
Lemma 102. Assume µ(t) = 5.
uet6 → rbrt5
Proof of lemma 102. By contradiction; we assume
µ(t) = 5∧uet6∧/rbrt5.
From the scheduling functions by definition we have
I(6, t+1) = I(5, t).
The latter by lemmas 98 and 82 for some t ′ ≥ t+1 gives
I(6, t ′) = I(6, t+1)∧uet ′6
which by definition implies
live(6, t+1).
Repeating the arguments presented in the proof of lemma 86 we derive
live(5, t)
which is a contradiction. uunionsq
In case not all stages are live in cycle t, the value of µ in cycle t +1 is given in the following
lemma.
Lemma 103. Assume µ(t)> 0.
uetµ(t)+1 → µ(t+1) = µ(t) (1)
uetµ(t)+1 → µ(t+1) =
{
0 rbrtµ(t)
µ(t)+1 otherwise
(2)
Proof of lemma 103.1. From lemmas 84 and 101 we resp. know
rfulltµ(t) = 1
and
rbrtµ(t)+1 = 0.
Since the stage below µ(t) is neither updated nor rolled-back in cycle t, by definition of the
scheduling functions we have
I(µ(t)+1, t+1) = I(µ(t)+1, t).
From the definition of µ we clearly have
live(µ(t)+1, t)
and therefore
live(µ(t)+1, t+1).
From the arguments above we conclude
µ(t+1)< µ(t)+1.
From the construction of the stall engine (part (3) of lemma 75) we know that stage µ(t) is not
updated in cycle t.
uetµ(t) = 0
Next we split cases on the value of the rollback request signal in stage µ(t) in cycle t:
208 8 Pipelined Processor with Nested MMUs
• rbrtµ(t) = 0. In this case, from the definition of the scheduling functions we have
I(µ(t), t+1) = I(µ(t), t).
By definition stage µ(t) is not live in cycle t
/live(µ(t), t)
which implies
/live(µ(t), t+1).
From the arguments above we conclude
µ(t+1)≥ µ(t).
• rbrtµ(t) = 1. In this case we proceed to show that the rollback request signal of stage µ(t)
remains active in cycle t+1.
rbrtµ(t)∧/uetµ(t)+1 → rbrt+1µ(t) (53)
Proof of equation 53. From the definition of the update enable signal we know that stage
µ(t) is not overwritten in cycle t.
uetµ(t) = 0 (54)
Using lemma 101 we derive
µ(t) = R(t) ∈ {2,5}.
In case µ(t) = 2, we argue using the definition of signal misspec2 as follows.
rbrt2 → rfullt2∧ irett2 (interconnect)
→ rfullt+12 ∧ irett+12 (lemma 75.5; equation 54)
→ rbrt+12 (interconnect)
In the other case, if µ(t) = 5, we derive a contradiction.
rbrt5 → jisr(t) (interconnect)
→ uet6 (definition) uunionsq
Therefore, we conclude
µ(t+1) ≥ R(t+1) (lemma 101)
≥ µ(t) (equation 53)
and the claim follows. uunionsq
In the proof lines below we find the following abbreviations useful.
empty↑(k, t)↔ ∀ j ∈ [1 : k] : /rfulltj
empty↓(k, t)↔ ∀ j ∈ [k : 5] : /rfulltj
Proof of lemma 103.2. First consider the case with an active rollback request signal in stage
µ(t) in cycle t (rbrtµ(t)). Using lemma 101 we derive
µ(t) = R(t) ∈ {2,5}
and split cases on the value of µ in cycle t:
8.4 Liveness 209
• µ(t) = 5. For the instruction in stage 1 in cycle t+1, from lemmas 98 and 82 we obtain
∃t ′ ≥ t+1 : I(6, t ′) = I(1, t+1)∧uet ′6 . (55)
Also for cycles in which instruction i = I(1, t + 1) progresses down through the pipeline
we argue that the following holds.
∀t˜ ∈ [t+1 : t ′] : empty↓(pi(i, t˜)+1, t˜) (56)
Proof of equation 56. By induction on θ , where θ is the “length” of sub-interval
[t+1 : t+1+θ ]⊆ [t+1 : t ′].
For the base case (θ = 0) the claim follows from the arguments above: after the jisr in
cycle t we have
pi(i, t+1) = 0
and by lemma 76 we get
empty↓(1, t+1).
For the induction step from θ to θ +1 < t ′− t we abbreviate
t˜ = t+1+θ
and split cases on whether the stage below pi(i, t˜) is updated in cycle t˜ or not. In case
uet˜pi(i,˜t)+1 = 1
from lemma 100 we have
pi(i, t˜+1) = pi(i, t˜)+1.
From the induction hypothesis and definition of the update enable signal, for
pi ∈ [pi(i, t˜)+2 : 5]
by part (4) of lemma 75 we argue
/rfull t˜pi ∧/uet˜pi → /rfull t˜+1pi
which by definition gives the claim.
empty↓(pi(i, t˜+1)+1, t˜+1)
Otherwise, if the stage below pi(i, t˜) is not updated in cycle t˜
uet˜pi(i,˜t)+1 = 0
we proceed to show the following auxiliary result.
empty↓(pi(i, t˜)+1), t˜) → R(˜t)≤ pi(i, t˜) (57)
(Note, the proof below is illustrated in Fig. 40(a).)
Proof of equation 57. By contradiction. Assume that the rollback request signal is active
in cycle t˜ at stage
r = R(˜t)> pi(i, t˜).
From the hardware interconnect (Sect. 8.1.5, definition of the misspec signals) we know
that stage r ≤ 5 has a real full bit in cycle t˜.
rfull t˜r = 1
The latter contradicts the assumption that there are no truly full stages between stages
pi(i, t˜) and the memory stage in cycle t˜.
∀ j ∈ [pi+1 : 5] : /rfull t˜j uunionsq
210 8 Pipelined Processor with Nested MMUs
      
      
      
      



i pi(i, t˜)
M
r ≤ 5
(a)
      
      
      
      
      





M
i pi(i, t˜)≥ 3r2
r3
r1
(b)
Fig. 40: Possible rollback request signals in cycle t˜
Using the result above and lemma 100 we get
pi(i, t˜+1) = pi(i, t˜).
Repeating the arguments from the case above (uet˜pi(i,˜t)+1), for stages
pi ∈ [pi(i, t˜)+1 : 5]
we analogously derive
rfull t˜+1pi = 0
which completes the induction step, and therefore the proof. uunionsq
From equations 56 and 57 we easily conclude the absence of rollback request signals for
instruction i = I(1, t+1) while it progresses down through the pipeline.
∀t˜ ∈ [t+1 : t ′] : /rbrt˜pi(i,˜t)+1 (58)
From the result above we clearly have
∀t˜ ∈ [t+1 : t ′] : I(k, t˜) = I(1, t+1)→ /rbrt˜k.
Combining the statement above with equation 55 we conclude
live(1, t+1)
and the claim follows by lemma 83.
• µ(t) = 2. Since stage 3 is updated in cycle t, from lemma 85 we obtain
live(4, t+1)
which by definition is
∃t ′ ≥ t+1 : I(6, t ′) = I(4, t+1)∧uet ′6 ∧
∀t˜ ∈ [t+1 : t ′] ∀k˜ ∈ [4 : 6] : I(k˜, t˜) = I(4, t+1)→ /rbrt˜k˜.
Moreover, using the definition of signal misspec2 we argue that in cycle t +1 stage 3 has
a real full bit and contains the data of a legal eret instruction:
rbrt2 → rfullt2∧ irett2 (interconnect)
→ rfullt+13 ∧ irett+13 (lemma 75.1).
For cycles in which instruction i = I(4, t + 1) progresses down through the pipeline we
proceed to show the following.
∀t˜ ∈ [t+1 : t ′+1] : live(pi(i, t˜)+1, t˜)∧pi(i, t˜)≥ 3∧ iret t˜pi(i,˜t) (59)
8.4 Liveness 211
Proof of equation 59. By induction on θ , where θ is the “length” of sub-interval
[t+1 : t+1+θ ]⊆ [t+1 : t ′+1].
For the base case (θ = 0) the claim follows from the arguments above.
live(4, t+1)∧pi(i, t+1) = 3∧ irett+13
For the induction step from θ to θ +1≤ t ′− t we abbreviate
t˜ = t+1+θ
and split cases on whether the stage below pi(i, t˜) is updated in cycle t˜ or not. In case
uet˜pi(i,˜t)+1 = 1
from lemma 100 we have
pi(i, t˜+1) = pi(i, t˜)+1
and using the induction hypothesis we conclude
pi(i, t˜+1)≥ 3∧ iret t˜+1pi(i,˜t+1).
For the scheduling function in the stage below pi = pi(i, t˜+1) we derive
I(pi+1, t˜+1) =
{
I(pi, t˜) rfull t˜pi
I(pi+1, t˜) otherwise
(lemma 75.3; definition)
= I(pi+1, t˜) (lemma 79).
From the induction hypothesis and lemma 83 we conclude
live(pi+1, t˜)
which implies
live(pi+1, t˜+1).
Otherwise, if the stage below pi(i, t˜) is not updated in cycle t˜
uet˜pi(i,˜t)+1 = 0
we proceed to show the following auxiliary result.
live(pi(i, t˜)+1, t˜)∧pi(i, t˜)≥ 3∧ iret t˜pi(i,˜t) → R(˜t) = 0 (60)
(Note, the proof below is illustrated in Fig. 40(b).)
Proof of equation 60. By contradiction. Assume that the rollback request signal is active
in cycle t˜ at stage
r = R(˜t)> 0.
Next we split cases on the value of r ≤ 5:
• r = r1 > pi(i, t˜). From the definition of the rollback request signals we conclude
rbrt˜pi(i,˜t)+1 = 1
which contradicts out first assumption (see Sect. 8.2.3):
rbrt˜pi(i,˜t)+1 → /live(pi(i, t˜)+1, t˜).
212 8 Pipelined Processor with Nested MMUs
• r = r2 = pi(i, t˜). From the definition of the rollback request signals we conclude
misspect˜r = 1.
From the second and the third assumption we have
r ≥ 3∧ rfull t˜r ∧ iret t˜r
which gives a contradiction (see Sect. 8.1.5, definition of the misspec signals).
misspect˜r = 0
• r = r3 < pi(i, t˜). Again, from the definition of misspec signals we know that stage r has
a real full bit in cycle t˜.
rfull t˜r = 1
From the second and the third assumption for k = pi(i, t˜) we have
k ≥ 3∧ rfull t˜k ∧ iret t˜k
and using part (1) of lemma 77 we derive a contradiction:
∀ j ∈ [1 : k−1] : /rfull t˜j. uunionsq
Using the result above and lemma 100 we get
pi(i, t˜+1) = pi(i, t˜)
which by the induction hypothesis implies
pi(i, t˜+1)≥ 3∧ iret t˜+1pi(i,˜t+1).
For the scheduling function in the stage below pi = pi(i, t˜) by definition we have
I(pi+1, t˜+1) = I(pi+1, t˜).
Finally, from the induction hypothesis we have
live(pi+1, t˜)
which implies
live(pi+1, t˜+1). uunionsq
From equations 59 and 60 we easily conclude the absence of any rollback request signals
while instruction i = I(4, t+1) progresses down through the pipeline.
∀t˜ ∈ [t+1 : t ′+1] : /rbrt˜1 (61)
For the instruction in stage 1 in cycle t+1, from lemmas 98 and 82 we obtain
∃t ′′ ≥ t+1 : I(6, t ′′) = I(1, t+1)∧uet ′′6 . (62)
Moreover, in cycles in which instruction I(1, t +1) progresses down through the pipeline,
repeating the arguments analogous to those presented in the case above (equations 56–58)
we conclude
∀t˜ ∈ [t ′+2 : t ′′] : I(k, t˜) = I(1, t+1) → /rbrt˜k
which together with equation 61 gives
∀t˜ ∈ [t+1 : t ′′] : I(k, t˜) = I(1, t+1)→ /rbrt˜k.
Combining the statement above with equation 62 we conclude
live(1, t+1)
and the claim follows by lemma 83.
8.4 Liveness 213
Finally we cover the case in which the rollback request signal in stage µ(t) is inactive in cycle
t (/rbrtµ(t)). Since stage µ(t)+1 is updated in cycle t, from lemma 85 we obtain
live(µ(t)+2, t+1)
which clearly gives
µ(t+1)< µ(t)+2.
Moreover, from lemma 102 we know
µ(t)< 5
which by lemma 86 gives
/live(µ(t)+1, t+1).
From the arguments above we have
µ(t+1)≥ µ(t)+1
and the claim follows. uunionsq
For convenience, we generalize the last result as follows.
Lemma 104.
µ(t+1)≤ µ(t)+uetµ(t)+1
Proof of lemma 104. For µ(t) > 0 the claim follows directly from lemma 103. For µ(t) = 0
we proceed to show
µ(t+1)≤ uet1.
In case stage 1 is updated in cycle t (uet1), from lemma 85 we obtain
live(2, t+1)
which by lemma 83 gives the claim.
µ(t+1)< 2
Otherwise, if stage 1 is not updated in cycle t (/uet1), from lemma 101 we derive
R(t) = 0
and thus stage 1 is also not rolled-back in cycle t (/rbrt1). Therefore, from the definition of the
scheduling functions we have
I(1, t+1) = I(1, t).
From the definition of µ we clearly have
live(1, t)
which implies
live(1, t+1).
The claim follows by lemma 83.
µ(t+1)< 1 uunionsq
214 8 Pipelined Processor with Nested MMUs
     
     
     
     




µ(t) µ(t+1)
(a) µ unchanged
     
     
     
     
     





µ(t)
µ(t+1)
(b) µ increases
µ(t)
µ(t+1)
(c) rollback
Fig. 41: Pipeline stages target for the induction step in each of the three cases considered in parts (1) (not
shaded) and (2) (shaded)
8.4.4 Actual Instruction Index
Due to possible misspeculations, which lead to rollbacks of the affected stages, scheduling
functions can be reset to potentially lower values. In order to show that live instructions leave
all pipeline stages only once, we introduce an alternative way to index the instructions in the
pipeline. The actual instruction index below counts updates only of the live stages.
i(k,0) = 0
i(k, t+1) =
{
i(k, t)+1 uetk ∧ k > µ(t)
i(k, t) otherwise
In the following lemma we relate the scheduling functions with the actual instruction index of
the corresponding pipeline stages.
Lemma 105.
k > µ(t) → i(k, t) = I(k, t) (1)
k ≤ µ(t) → i(k, t) = I(µ(t), t) (2)
(Note, the proof below is illustrated in Fig. 41.)
Proof of lemma 105. By induction on the number of hardware cycles t. For the base case
(t = 0) we argue that
µ(0) = 0
and there is nothing to show for part (2). For part (1), for all stages k we have
i(k,0) = 0 = I(k,0).
For the induction step from t to t+1 we argue as follows. For part (1), according to lemmas 103
and 104 there are three cases to cover:
• µ(t+1) = µ(t). For stages k > µ(t), using lemma 101 we have
k > R(t)
and argue as follows.
i(k, t+1) = i(k, t)+uetk (definition)
= I(k, t)+uetk (induction hypothesis)
= I(k, t+1) (lemma 80)
• µ(t+1) = µ(t)+1. For stages k > µ(t)+1 the argument is exactly as above.
8.4 Liveness 215
• µ(t + 1) = 0 6= µ(t). Again, for stages k > µ(t) the argument is exactly as above. For
stages k ≤ µ(t) we argue as follows. According to lemma 103 we have
rbrtµ(t)
which by lemma 101 implies
µ(t) = R(t).
To complete the induction step for part (1) we derive:
i(k, t+1) = i(k, t) (definition)
= I(µ(t), t) (induction hypothesis)
= I(k, t+1) (lemma 80).
For part (2) we split cases exactly as for part (1).
• µ(t+1) = µ(t). According to lemma 103, the stage below µ(t) is not updated in cycle t.
/uetµ(t)+1
Moreover, since by lemma 84 stage µ always has a real full bit (rfulltµ(t)), from construction
of the stall engine (part (3) of lemma 75) we conclude that stage µ(t) is not overwritten in
cycle t.
/uetµ(t)
Finally, from lemma 101 we have
µ(t)≥ R(t).
Thus, for stages k ≤ µ(t) we derive:
i(k, t+1) = i(k, t) (definition)
= I(µ(t), t) (induction hypothesis)
= I(µ(t), t+1) (lemma 80)
= I(µ(t+1), t+1).
• µ(t+1) = µ(t)+1. According to lemmas 103 and 104, stage µ(t)+1 is updated in cycle
t.
uetµ(t)+1
From lemma 101 we know that the latter stage is below R in cycle t.
µ(t)+1 > R(t).
To complete the induction step, for stage k = µ(t)+1 we derive:
i(k, t+1) = i(k, t)+1 (definition)
= I(k, t)+1 (induction hypothesis)
= I(k, t+1) (lemma 80)
= I(µ(t+1), t+1).
For stages k ≤ µ(t) we argue similarly to the case above:
i(k, t+1) = i(k, t) (definition)
= I(µ(t), t) (induction hypothesis)
= I(µ(t)+1, t+1) (definition)
= I(µ(t+1), t+1).
• µ(t+1) = 0 6= µ(t). For stages k < 0 there is nothing to show. uunionsq
216 8 Pipelined Processor with Nested MMUs
8.4.5 Uniqueness of Update Cycles
In Sect. 9.2.5 we must define the oracle inputs for instructions in circuit stages below µ , in
cycles in which those instructions progress down in the pipeline. Therefore, we first must
show that the latter cycles are unique. This last section is spent to derive that result. Below we
argue that the value of µ resets to zero only after a rollback in stage µ .
Lemma 106. Assume t˜ > t and µ(t)> 0.
µ (˜t) = 0 → ∃t ′ ∈ [t : t˜−1] : rbrt ′µ(t ′)
Proof of lemma 106. Consider the minimum cycle t1 > t such that µ(t1) = 0. For cycle
t0 = t1−1≥ t
we clearly have µ(t0)> 0, and according to lemma 103 we obtain
∀t ′ ∈ [t : t0−1] : /(rbrt ′µ(t ′)∧uet
′
µ(t ′)+1)
since otherwise a contradiction follows. In cycle t0, again according to lemma 103, we have
rbrt0µ(t0)∧ue
t0
µ(t0)+1
= 1
and
µ(t0+1) = µ(t1) = 0.
Clearly
t0 ∈ [t : t1−1]⊆ [t : t˜−1]
and the claim follows. uunionsq
Next we show that the (non-zero) value of µ always eventually resets to zero.
Lemma 107. Assume µ(t)> 0.
/rbrtµ(t) → ∃t ′ > t : rbrt
′
µ(t ′)
Proof of lemma 107. By contradiction; assume that the following holds.
∀t ′ > t : /rbrt ′µ(t ′) (63)
First, we proceed to show the following.
∀t ′ > t : µ(t ′)> 0∧ I(µ(t ′), t ′) = I(µ(t), t) (64)
Proof of equation 64. In case µ(t ′) = 0, from lemma 106 we immediately derive a contradic-
tion.
∃t ′′ ∈ [t : t ′−1] : rbrt ′′µ(t ′′)
Therefore, we have
∀t ′ > t : µ(t ′)> 0
for the scheduling functions, by induction on θ , we proceed to show
∀θ ≥ 0 : I(µ(t+θ), t+θ) = I(µ(t), t).
The base case holds trivially since for θ = 0 there is nothing to show. For the induction step
from θ to θ +1 we argue as follows. For t ′′ = t +θ , in case the stage below µ(t ′′) is updated
in cycle t ′′, we derive
I(µ(t ′′+1), t ′′+1) = I(µ(t ′′)+1, t ′′+1) (lemma 103.2)
= I(µ(t ′′), t ′′) (definition)
= I(µ(t), t) (induction hypothesis).
8.4 Liveness 217
Otherwise, from lemma 84 we know that pipeline stage µ(t ′′) has a real full bit in cycle t ′′
rfullt
′′
µ(t ′′) = 1
which according to the construction of the stall engine (part (3) of lemma 75) implies
uet
′′
µ(t ′′) = 0.
To complete the proof we argue similarly to the case above:
I(µ(t ′′+1), t ′′+1) = I(µ(t ′′), t ′′+1) (lemma 103.1)
= I(µ(t ′′), t ′′) (definition)
= I(µ(t), t) (induction hypothesis). uunionsq
Using the result above, we proceed to derive a contradiction (by definition of µ):
live(µ(t), t).
From lemmas 98 and 82 we already have
∃t ′′ ≥ t : I(6, t ′′) = I(µ(t), t)∧uet ′′6
and (by definition of live) it remains to show
∀t˜ ∈ [t : t ′′] ∀k˜ ∈ [µ(t) : 5] : I(k˜, t˜) = I(µ(t), t)→ /rbrt˜k˜.
• For stage k˜ = µ(t ′) directly from the assumption (equation 63) we derive
∀t ′ > t : I(k˜, t ′) = I(µ(t), t)→ /rbrt ′k˜ .
• For stages k˜ > µ(t ′), according to lemma 101, we have /rbrt ′
k˜
, and therefore
∀t ′ > t ∀k˜ ∈ [µ(t ′)+1 : 5] : I(k˜, t ′) = I(µ(t), t)→ /rbrt ′k˜ .
• Finally, for stages k˜ < µ(t ′), we proceed to show
∀t ′ > t ∀k˜ ∈ [µ(t) : µ(t ′)−1] : I(k˜, t ′) = I(µ(t), t)→ /rbrt ′k˜ .
For maximal stage k′ < µ(t ′) such that
rbrt
′
k′ = 1
from the hardware construction and equation 63 we have
∀` ∈ [1 : k′] : rbrt ′` ∧/rbrt
′
k′+1
and therefore, from the definition of the misspec signals we conclude
rfullt
′
k′ = 1.
Using the arguments above, for the scheduling functions of stages ` ≤ k′ we argue (by
contrapositive):
I(`, t ′) > I(k′+1, t ′) (lemma 81)
≥ I(µ(t ′), t ′) (lemma 81)
= I(µ(t), t) (equation 64). uunionsq
Using the result above we argue that every pipeline stage is updated infinitely often below stage
µ .
218 8 Pipelined Processor with Nested MMUs
Lemma 108.
/(uetk ∧ k > µ(t)) → ∃t ′ > t : uet
′
k ∧ k > µ(t ′)
Proof of lemma 108. By case split on whether stage k is below µ in cycle t:
• k > µ(t). Consider minimum cycle t ′ > t in which stage k is updated (uet ′k ). Note, from
lemma 99 we know that the latter cycle exists. Thus, we have
∀t˜ ∈ [t : t ′−1] : /uet˜k ∧uet
′
k .
We proceed to show
∀t˜ ∈ [t : t ′] : k > µ (˜t).
by induction on θ , where θ denotes the length of sub-interval
[t : t+θ −1]⊆ [t : t ′].
The base case holds trivially since for θ = 0 there is nothing to show. For the induction
step from θ ≤ t ′− t to θ +1 we argue as follows. For t ′′ = t+θ −1, in case
µ(t ′′)< k−1
using lemma 103 we obtain
µ(t ′′+1)≤ k−1 < k.
Otherwise, in case
µ(t ′′) = k−1
using part (1) of lemma 103 we have
µ(t ′′+1) = k−1 < k
which completes the induction step, and thus the argument for this case.
• k ≤ µ(t). From lemma 107 for some cycle t ′ ≥ t we know that stage µ(t ′) is rolled-back
(rbrt
′
µ(t ′)). According to lemma 101, we split cases on the value of
µ(t ′) = R(t ′) ∈ {2,5}.
Thus, in case µ(t ′) = 5, from the hardware interconnect we derive that the memory stage
is updated in t ′ (uet ′6 ), and using part (2) of lemma 103 we conclude
µ(t ′+1) = 0.
Otherwise, in case µ(t ′) = 2, from lemma 99 for the minimum cycle t ′′ ≥ t ′ we have
∀t˜ ∈ [t ′ : t ′′−1] : /uet˜3∧uet
′′
3 .
Therefore, using part (1) of lemma 103 we derive
∀t˜ ∈ [t ′ : t ′′] : µ (˜t) = 2.
Repeating the arguments from the proof of equation 53 we conclude
∀t˜ ∈ [t ′ : t ′′] : rbrt˜2
which by part (2) of lemma 103 implies
µ(t ′′+1) = 0.
Therefore, in both cases, for some cycle t¯ > t we have
k > µ(t¯)
and the claim follows as in the case above for some ¯¯t > t. uunionsq
8.4 Liveness 219
Next we show that the hardware cycles in which instructions leave the pipeline stages below µ
exist (lemma 109) and, moreover, unique (equation 65).
Lemma 109. For stages k ∈ [1 : 5] the following holds:
∀i ∃t : i = I(k, t)∧uetk ∧ k > µ(t)
Proof of lemma 109. By induction on index i of the instruction in stage k. For the induction
base (i = 0) we argue as follows. From lemma 108 we know that stage k below µ is updated
infinitely often. Consider the first such cycle t ′ in which stage k below µ is updated. Clearly,
we have:
(∀t ∈ [0 : t ′−1] : /(uetk ∧ k > µ(t)))∧uet
′
k ∧ k > µ(t ′).
For the scheduling function in stage k we conclude:
I(k, t ′) = i(k, t ′) (lemma 105.1)
= i(k,0) (definition)
= 0 (definition).
For the induction step from i to i+1 we argue as follows. Directly from the induction hypoth-
esis we know there is a cycle t such that
i = I(k, t)∧uetk ∧ k > µ(t).
Applying lemma 108 we also know there is a cycle t ′ ≥ t+1:
uetk ∧ k > µ(t)∧ (∀t˜ ∈ [t+1 : t ′−1] : /(uet˜k ∧ k > µ(t˜)))∧uet
′
k ∧ k > µ(t ′).
To complete the induction step, for the scheduling function in stage k we derive:
I(k, t ′) = i(k, t ′) (lemma 105.1)
= i(k, t)+1 (definition)
= I(k, t)+1 (lemma 105.1)
= i+1. uunionsq
Finally, for stages k ∈ [1 : 5] we argue
∀i ∃! t : i = I(k, t)∧uetk ∧ k > µ(t) (65)
Proof of equation 65. From lemma 109 for stages k ∈ [1 : 5] we know
∀i ∃t : i = I(k, t)∧uetk ∧ k > µ(t).
Assume stage k ∈ [1 : 5] is updated below µ in cycles t and t ′ > t
uetk ∧ k > µ(t)
uet
′
k ∧ k > µ(t ′)
with instruction
I(k, t) = I(k, t ′).
The contradiction immediately follows.
I(k, t ′) = i(k, t ′) (lemma 105.1)
≥ i(k, t+1) (monotonicity)
= i(k, t)+1 (definition)
= I(k, t)+1 (lemma 105.1) uunionsq
The latter uniqueness result completes the implementation of the pipelined machine. Cor-
rectness of the pipelined implementation is proven in the next chapter. (Note, in Chap. 9 we
consider a multi-core processor, every core of which is a pipeline as constructed above in this
chapter.)

9Correctness of Pipelined Implementation
In Chap. 8 we constructed the pipelined processor which, as we prove in this chapter, imple-
ments the ISA specification from Sect. 9.1. In order to formalize our arguments, in Sect. 9.2 we
as usual state the (simulation) theorem. After developing some more machinery in Sect. 9.3,
we perform the induction steps for every part of the induction hypothesis. Most of the argu-
ments in Sects. 9.4–9.8 are analogous to the arguments presented in the corresponding sections
of Chap. 7. However, the interesting parts of course are Sect. 9.4, where we use the software
conditions in order show that the guard conditions are respected by our pipelined implementa-
tion, and Sect. 9.5, where we apply the software conditions to prove correctness for the output
of the instruction cache.
9.1 Multi-Core Specification
In this chapter we consider a simplified version of the multi-core machine described in
[Sch13a]. Here we assume that all processor cores start running simultaneously after receiving
a common reset interrupt, i.e., no init signals are required in the absence of running flags. As
we already mentioned above, the external interrupts (other than reset) are tied to zero for all
cores. Thus, we can keep using the same processor model as defined for the single-core ma-
chine in Sect. 2.1. The latter also allows us to keep the definition of inputs for the processor
core steps unchanged.
9.1.1 Multi-Core Computation
Configuration
Configuration mc∈KMC of the multi-core machine with nested MMUs consists of p processors
connected to one, shared byte-addressable memory:
• mc.p : [0 : p−1]→ KM , and
• mc.m : B32→ B8.
In turn configuration of each processor q includes the following sub-components:
• mc.p(q).(pc,d pc,dd pc) ∈ B32 — three program counters,
• mc.p(q).(gpr,spr) : B5→ B32 — general and special purpose register files, and
• mc.p(q).tlb⊆ Kuwalk — translation look-aside buffer.
Note, the initial configuration mc0 of the multi-core system after reset must satisfy the follow-
ing criteria: i) the program counters are initialized with proper values, ii) the first GPR register
contains value zero, and iii) all maskable interrupts are masked.
mc0.p(q).(pc,d pc,dd pc) = (832,432,032)
mc0.p(q).gpr(05) = 032
mc0.p(q).sr = 032
222 9 Correctness of Pipelined Implementation
Moreover, the SPR registers have values which indicate the occurrence of reset on the previous
step: i) the exception cause register has its first bit set to one and ii) both mode registers have
their first bits set to zero.
mc0.p(q).eca[0] = 1
mc0.p(q).mode[0] = 0
mc0.p(q).nmode[0] = 0
Stepping Function
Every sequential global step number n is performed by the component and using the oracle
input specified by the stepping function
s : N→ Σ
where
Σ = {core}× [0 : p−1]×Σcore ∪
{tlb}× [0 : p−1]×Σtlb.
For convenience we introduce the following shorthands to select the fields of the stepping
function value:
s(n) = s(n).(t,u,o)
where
• .t gives the type of component which is stepped; Obviously, we have
s(n).t ∈ {core, tlb}.
• .u gives the number of unit which is stepped; Clearly
s(n).u ∈ [0 : p−1].
• .o gives the oracle input for the step as defined in the components specification; For inputs
we have
s(n).o ∈ Σcore∪Σtlb.
Moreover, we partition the oracle inputs for the processor core steps further using the short-
hands below. The meaning of particular fields is intuitive and matches the specification from
Sect. 3.3.4.
s(n).t = core → s(n).o = s(n).o.(wI ,wE ,eev) ∈ Σcore
Semantics
Below we formalize the semantics of the multi-core MIPS machine with nested address trans-
lation by defining the transition function δMC, which specifies for any global step number n the
next state configuration mcn+1. Depending on which component makes a step we distinguish
two cases:
• s(n).(t,u) = (core,q) — step n is performed by the core of processor q. For the next step
configuration of processor q and the shared memory component we have:
mcn+1.(p(q),m) = δM(mcn.(p(q),m),s(n).o)
where δM denotes the transition function of the single-core MIPS with the nested address
translation, which we specified previously in Sect. 3.3.4.
9.1 Multi-Core Specification 223
• s(n).(t,u) = (tlb,q) — step n is performed by the TLB of processor q. For the next step
configuration of the TLB component of processor q we have:
mcn+1.tlb = δT (mcn.(p(q),m),s(n).o)
where δT is the transition function of the TLB component, specified in Sect. 3.3.1. Note
that the remaining components of processor q as well as the shared memory component do
not change at this step:
Z 6= tlb → mcn+1.p(q).Z = mcn.p(q).Z
mcn+1.m = mcn.m.
In both cases all other processors do not change at step n:
q′ 6= q → mcn+1.p(q′) = mcn.p(q′).
Guard Conditions
Using notation from Sect. 3.3 we introduce the guard conditions for the multi-core machine
with almost no effort. Thus, for multi-core configuration mc ∈ KMC and oracle input σ ∈ Σ we
define a single predicate Γ which covers all conditions restricting the computational steps:
Γ (mc,σ)≡
{
Φ(mc.p(σ .u),σ .o) σ .t = core
T(mc.p(σ .u),σ .o) σ .t = tlb.
In case the guard conditions hold for all ISA steps before step n, we abbreviate
Γ n ≡ ∀m < n : Γ (mcm,s(m)).
9.1.2 Processor Local Computations
We begin with the very basic definitions. As usual, for processor q we define two auxiliary
functions to formalize the processor core step sequence
pseq(q,0) = min{n | s(n).(t,u) = (core,q)}
pseq(q, i+1) = min{n | s(n).(t,u) = (core,q)∧n > pseq(q, i)}
and the instruction count
ic(q,0) = 0
ic(q,n+1) =
{
ic(q,n)+1 s(n).(t,u) = (core,q)
ic(q,n) otherwise.
Two simple lemmas follow directly from the definitions above.
Lemma 110.
s(n).(t,u) = (core,q) → pseq(q, ic(q,n)) = n
Lemma 111.
pseq(q, ic(q,n))≥ n
Their proofs literally follow the arguments from Sect. 7.2.1, where the counter parts of these
two lemmas were proven for the single core (sequential) machine. For technical reasons we
define
pseq(q,−1) =−1 (66)
and show the following result.
224 9 Correctness of Pipelined Implementation
Lemma 112.
pseq(q, ic(q,n)−1)< n
Proof of lemma 112. By induction on the number of steps n. For the base case (n = 0) we
argue as follows.
pseq(q, ic(q,0)−1) = pseq(q,−1) (definition)
= −1 (equation 66)
< 0.
For the induction step from n to n+1 we split cases on the type of the step performed.
• s(n).(t,u) = (core,q). In case step n is a core step of processor q, we have:
pseq(q, ic(q,n+1)−1) = pseq(q, ic(q,n)) (definition)
= n (lemma 110)
< n+1.
• s(n).(t,u) 6= (core,q). Otherwise, we argue using the induction hypothesis:
pseq(q, ic(q,n+1)−1) = pseq(q, ic(q,n)−1) (definition)
< n (induction hypothesis)
< n+1. uunionsq
Next we establish a counterpart of lemma 66 for the multi-core machine.
Lemma 113.
∀n : m ∈ [n : pseq(q, ic(q,n))−1] → s(m).(t,u) 6= (core,q)
Proof of lemma 113. We prove the statement by induction on `, where ` denotes the length of
sub-interval
[n : n+ `−1]⊆ [n : pseq(q, ic(q,n))−1].
The base case holds trivially since for ` = 0 there is nothing to show. For the induction step
from `−1 to
`≤ pseq(q, ic(q,n))−n
we argue by contradiction as follows. Assume that n+ ` is a core step of processor q:
s(n+ `).(t,u) = (core,q).
Then a contradiction follows simply from the monotonicity of functions ic and pseq:
n+ ` = pseq(q, ic(q,n+ `)) (lemma 110)
≥ pseq(q, ic(q,n))
> pseq(q, ic(q,n))−1. uunionsq
In the following lemma we capture a property of pseq, namely that the processors do not
perform steps in configurations apart from the configurations given by pseq.
Lemma 114.
m ∈ [pseq(q, i)+1 : pseq(q, i+1)−1] → s(m).(t,u) 6= (core,q)
9.1 Multi-Core Specification 225
Proof of lemma 114. By contradiction. Assume there is n such that:
n ∈ [pseq(q, i)+1 : pseq(q, i+1)−1]∧ s(n).(t,u) = (core,q).
Then for the instruction index
in = ic(q,n)
using lemma 110 we conclude:
pseq(q, in) = n.
From the monotonicity of function pseq we derive a contradiction:
pseq(q, i)< pseq(q, in)< pseq(q, i+1)
gives
i < in < i+1. uunionsq
The next lemma is a counterpart of lemmas 9.9 from [KMP14].
Lemma 115.
mcn.p(q).core = mcpseq(q,ic(q,n)).p(q).core
Proof of lemma 115. By induction on n. For the base case we trivially obtain
mcpseq(q,ic(q,0)).p(q).core = mcpseq(q,0).p(q).core (definition)
= mc0.p(q).core (lemma 113).
For the induction step from n to (n+ 1) we split cases on whether step n is performed by
processor q or not.
• s(n).(t,u) = (core,q):
mcpseq(q,ic(q,n+1)).p(q).core = mcpseq(q,ic(q,n)+1).p(q).core (definition)
= mcpseq(q,ic(q,n))+1.p(q).core (lemma 114)
= mcn+1.p(q).core (induction hypothesis).
• s(n).(t,u) 6= (core,q):
mcpseq(q,ic(q,n+1)).p(q).core = mcpseq(q,ic(q,n)).p(q).core (definition)
= mcn.p(q).core (induction hypothesis)
= mcn+1.p(q).core (definition). uunionsq
As always, signals Z on processor q of a local configuration i are abbreviated as:
Zq,iσ = Z(mcpseq(q,i).p(q),s(pseq(q, i)).o).
For instance, we abbreviate the walks passed to the specification machine by
wq,iI σ = s(pseq(q, i)).o.wI
wq,iE σ = s(pseq(q, i)).o.wE
whereas for the external event signals passed we introduce in a natural way:
eevq,iσ = s(pseq(q, i)).o.eev.
Similarly, we abbreviate the used predicates in global steps as
used(Z,n) = used(Z,mcn,s(n))
and in processor local steps as
used(Z)q,iσ = used(Z,mcpseq(q,i).p(q),s(pseq(q, i)).o).
226 9 Correctness of Pipelined Implementation
9.1.3 Accessing Memory
In [KMP14] the type of the access was an abbreviation:
.type = .(r,w,cas, f ).
The type component of an access is always used. Using the notation above we define for
acc ∈ Kacc, any step number n, processor q, and instruction index i:
used(acc.type,n) = 1
used(acc.type)q,iσ = 1.
For those components of accesses which are never used (e.g., .data components of instruction
accesses) the corresponding definitions are omitted below.
Address Translation
For translation of the instruction and effective address we introduce the access
tacc : N→ Kacc
(for translation access) which we define as follows. In case a TLB step is performed
s(n).t = tlb
we specify the translation access using notation from Sect. 3.3.2.
tacc(n) = tacc(s(n).o)
Otherwise, we simply specify a void access.
tacc(n).type = 0000
For convenience we abbreviate the output of the translation access in ISA step n as
tmout(n) = dataout(`(mcn.m), tacc(n)).
For the used predicates we obviously define:
used(tacc.a,n) = /void(tacc(n)).
Instruction Fetch
For fetching instructions we update the definition of the instruction access to match the new
specification. Now instructions are fetched from the pmaI . In contrast to the translation access
above, the instruction access
iacc : KC×Σcore→ Kacc
is defined by the local processor configuration. In the absence of interrupts of level 4 or lower
ilq,iσ > 4
we specify the instruction access to be a read of the memory line containing the instruction
word.
iaccq,iσ .a = pma
q,i
I σ .l
iaccq,iσ .type = 1000
Otherwise (ilq,iσ ≤ 4), a void access is specified.
iaccq,iσ .type = 0000
For convenience we abbreviate as usual,
iacc(q, i) = iaccq,iσ .
For the used predicates we define:
used(iacc.a)q,iσ = /void(iacc(q, i)).
9.2 Multi-Core Processor 227
Data Access
We update the definition of the data access in the same way we updated the definition of the
instruction access above. Formally the data access
dacc : KC×Σcore→ Kacc
is defined by the local processor configuration. In the absence of JISR on a memory operation
mopq,iσ ∧/ jisrq,iσ
we specify a non-void access based on the following signals.
daccq,iσ .a = pma
q,i
E σ .l
daccq,iσ .data = dmin
q,i
σ
daccq,iσ .cdata = cdata
q,i
σ
daccq,iσ .bw = bw
q,i
σ
daccq,iσ .type = (l,s,cas,0)
q,i
σ
Otherwise, we simply specify a void access.
daccq,iσ .type = 0000
As usual we abbreviate
dacc(q, i) = daccq,iσ .
Finally, for the used predicates we define:
used(dacc.a)q,iσ = /void(dacc(q, i))
used(dacc.cdata)q,iσ = dacc(q, i).cas
used(dacc.data)q,iσ ≡ dacc(q, i).(w,cas) 6= 02
used(dacc.bw)q,iσ = used(dacc.data)
q,i
σ .
9.2 Multi-Core Processor
Following the specification from Sect. 9.1, in a completely straightforward way we assemble
a multi-core processor by putting p pipelined cores (as constructed in Chap. 8) in parallel.
Clearly, the total number of caches involved increases to 4p. Everywhere in the remainder of
this chapter we use indices
q ∈ [0 : p−1]
to iterate over processors hpi .p(q) of the multi-core machine hpi . Of course, the four caches
connected to processor number q are chosen simply as follows.
itcaqpi = hpi .ca(4q)
icaqpi = hpi .ca(4q+1)
dtcaqpi = hpi .ca(4q+2)
dcaqpi = hpi .ca(4q+3)
Note, for the sake of simplicity we may omit the processor indices (q) where possible.
228 9 Correctness of Pipelined Implementation
9.2.1 Stepping of Components
In this section we specify i) which components perform steps in a given hardware cycle and ii)
the order in which these steps are performed. Recall, in Chap. 7 we introduced predicates cstep
and tstepY to specify resp. steps of the processor core and TLBs of the sequential single-core
machine. We update these predicates for use in the multi-core machine. Thus, processor q
performs a core step in cycle t in case the memory stage on processor q is updated in cycle t.
cstep(q, t) ≡ ueq,t6
Processor q performs a step of TLB Y ∈ {I,E} in cycle t in case the corresponding MMU on
processor q has generated a step in cycle t.
tstepY (q, t) ≡ tadd(htpi .p(q).mmuY )
The processor cores which are stepped in cycle t are collected into the set
PS(t) = {q | cstep(q, t)}
whereas the TLBs which are stepped in cycle t — into the sets
T SY (t) = {q | tstepY (q, t)}
depending on the MMU (mmuI or mmuE ) which generated the corresponding step. The total
number of steps performed by all components in cycle t can be easily expressed using the
definitions above; we denote it by
ns(t) = #PS(t)+#T SI(t)+#T SE(t).
At this point we can write down the usual definition of the total number of steps performed by
all components before cycle t:
NS(0) = 0
NS(t+1) = NS(t)+ns(t).
For convenience, we also collect the numbers of steps performed within cycle t into the set
CS(t) = [NS(t) : NS(t+1)−1].
Note, ordering of the steps performed within cycle t, and therefore the global ordering of
steps, is in general not arbitrary. Broadly speaking, due to specifics of the control mechanisms
implemented in hardware, a certain ordering of steps can be enforced as the only possible in
particular situations (cycles). In our design we implemented control mechanisms in a way
that allows us to step the components which perform memory accesses prior to the remaining
components stepped. On the other hand, in order to make the proofs independent of particular
implementation of the cache memory system, components accessing memory are stepped in
the sequential order in which the memory system in use orders the latter accesses.
In order to filter out accesses of the MMUs which do not generate steps (for instance, due to
possible abortions), we introduce some more notation. Three sets below accumulate accesses
generated by the MMUs (SI and SE ) as well as accesses generated by processors (SD).
SI(t) = {(i,k) ∈ A(t) | (i mod 4) = 0∧bi/4c ∈ T SI(t)}
SE(t) = {(i,k) ∈ A(t) | (i mod 4) = 2∧bi/4c ∈ T SE(t)}
SD(t) = {(i,k) ∈ A(t) | (i mod 4) = 3∧bi/4c ∈ PS(t)}
Note, only those accesses are included under the condition that the corresponding unit actually
performs a step in cycle t. By definition we obviously have
S(t) = SI(t)∪SE(t)∪SD(t)⊆ A(t).
9.2 Multi-Core Processor 229
As usual, we abbreviate the total number of the stepping accesses by
sa(t) = #S(t).
Analogous to Sect. 8.3.5, for y < sa(t) we define:
z(y,0) = min{n | NE(t)+n ∈ seq(S(t))}
z(y, t) = min{n | NE(t)+n ∈ seq(S(t))∧n > z(y−1, t)}.
Using function z for indices
y ∈ [0 : sa(t)−1]
we can easily extract from sequence acc′ the sequence of stepping accesses zacc′t :
zacc′t [y] = acc
′[NE(t)+ z(y, t)].
Before we define the stepping order, we introduce one more technicality, which helps us to
distinguish between possibly two different TLB steps performed by the same processor in the
same hardware cycle. We extend the stepping function to include another field .g that gives the
component which generates the latter step. Formally, for global step n we specify
s(n).g ∈ {proc,mmuI ,mmuE}.
Now we can specify the order in which components are stepped. Below we split cases on the
value of y < ns(t).
• y < sa(t). Components performing memory accesses are stepped in the order in which the
corresponding memory accesses are performed.
s(NS(t)+ y).(g,u) = (proc,q)↔ ∃k : NE(t)+ z(y, t) = seq(4q+3,k) (67)
s(NS(t)+ y).(g,u) = (mmuY ,q)↔ ∃k : NE(t)+ z(y, t) = seq(4q+21{Y=E},k) (68)
• y≥ sa(t). Components which do not perform memory accesses are stepped in an arbitrary
order. Each such component is stepped exactly once. Formally
s(NS(t)+ [sa(t) : ns(t)−1]).(g,u) = (proc,QD(t))∪⋃Y (mmuY ,QY (t))
where sets QD(t) and QY (t) collect the processor indices resp. of the cores and MMUs
which generate steps in cycle t and perform no memory accesses.
QD(t) = {q | ∀y < sa(t) : s(NS(t)+ y).(g,u) 6= (proc,q)}∩PS(t)
QY (t) = {q | ∀y < sa(t) : s(NS(t)+ y).(g,u) 6= (mmuY ,q)}∩T SY (t)
Finally, we derive the step types based on types of the components which generate the corre-
sponding steps in a natural way.
s(n).t =
{
core s(n).g = proc
tlb otherwise
9.2.2 Software Conditions
Software conditions formulated for the basic MIPS machine from Chap. 2 (Sect. 2.2.2) are
no longer valid for the pipelined multi-core machine and updated as follows. First, these
conditions are relaxed in the part that the addresses of memory accesses, both for instruction
fetch and memory operations, no longer have to be aligned. Now the latter conditions are tested
by hardware, and whenever violated, corresponding misalignment interrupts are raised. This
guarantees that the memory is still accessed only at aligned addresses.
On the other hand, we have to strengthen the software conditions in order to handle the problem
which occurs in pipelined machines due to self-modifying codes. Namely, instruction i fetched
230 9 Correctness of Pipelined Implementation
by the instruction fetch stage at address a is outdated (incorrect) if the memory at address a
is overwritten by any of the lower stages of the (instruction) local pipeline, or simply by any
non-local pipeline of the multi-core processor.1 In [KMP14] this problem was avoided by in-
troducing two disjoint regions for code and data, and restricting the code region to be read only,
which effectively forbids any self-modifying codes. In this thesis the self-modifying codes are
allowed under the following software condition. For instruction i executed on processor q we
require that its address (pmaq,iI σ ) is not written in ISA steps after execution of instruction (i−5)
and before execution of instruction (i+1) on processor q.
SCcode(q, i) ≡ ∀n ∈ [pseq(q, i−5)+1 : pseq(q, i+1)−1] :
s(n).(t,u) = (core,q′)∧write(mcn.p(q′),s(n).o)→
pmaE(mcn.p(q′),s(n).o).l 6= pmaq,iI σ .l (69)
Note, in the software condition above we expose the depth of the pipelines used in the multi-
core machine. For these pipelines we know that local instruction i is not fetched earlier than
local instruction (i− 5) is executed. Therefore, in Sect. 9.5.3 we can show that the address
of instruction i fetched on processor q is not modified while instruction i progresses through
the local pipeline, neither by instructions in the lower stages on processor q nor by the other
processors.
Finally, the software condition forbidding store or CAS operations on ROM remains. Below
we simply reformulate the latter condition to fit description of the multi-core machine.
SCROM(q, i) ≡ writeq,iσ → 〈pmaq,iE σ .l〉 ≥ 2r (70)
Formally for instruction i executed on processor q we define the software conditions to be
follows.
SC(q, i) ≡ SCcode(q, i)∧SCROM(q, i)
Recall from Sect. 2.2.2 that the software conditions are assumed to hold only if the corre-
sponding guard conditions are respected. For the multi-core machine we formalize the latter
assumption as follows: the software conditions for instruction fetch (SCcode) are not violated
by instruction i executed on processor q if the guard conditions are respected by all ISA steps
until pseq(q, i) and by the fetch in step pseq(q, i). Using lemma 110 we derive the following.
s(n).(t,u) = (core,q)∧Γ n∧ΦI(mcn.p(q),s(n).o) → SCcode(q, ic(q,n)) (71)
Clearly, all software conditions are not violated by instruction executed on processor q if the
guard conditions are respected by all ISA steps until pseq(q, i) and by step pseq(q, i). Again,
using lemma 110 we summarize the latter arguments as follows.
s(n).(t,u) = (core,q)∧Γ n+1 → SC(q, ic(q,n)) (72)
9.2.3 Speculation Stage
Instructions in the upper pipeline stages (above the memory stage) not necessarily obey the
software conditions that we formally described in Sect. 9.2.2 above. In cases the latter instruc-
tions disobey the software conditions, the corresponding instruction stages obviously cannot
contain the correct data, and should be excluded from the correctness statement we formu-
late in Sect. 9.2.4. Below we introduce some more machinery which helps us to handle the
aforementioned cases. Note, for convenience processor indices are omitted in the scope of this
section.
All truly full stages above the memory stage containing the data of instructions which disobey
the software conditions for instruction fetch are collected into the set
1 Note, though in single-core pipelined machines some forwarding circuits can be added to keep track
of the addresses written in the lower pipeline stages, in the multi-core machines this would require
adding some extra mechanisms for the inter-processor communication. Up to our knowledge, modern
processors do not implement such mechanisms.
9.2 Multi-Core Processor 231
Sso f t(t) = {k ∈ [1 : 5] | rfulltk ∧/SCcode(I(k, t)−1)}.
This allows us to determine the “lowest” pipeline stage containing the relevant data; it is called
the software speculation stage and formally defined as follows:
Σso f t(t) =
{
max Sso f t(t) Sso f t(t) 6= /0
0 otherwise.
Additionally, in the presence of misspeculation we clearly cannot claim correctness of the
pipeline stages which are the subject for rollbacks. Thus, all truly full stages containing the
data of instructions executed with JISR (above the memory stage) and eret instructions (in
stages 1 and 2) are collected resp. into the following sets.
S jisr(t) = {k ∈ [1 : 5] | rfulltk ∧ jisrI(k,t)−1σ }
Seret(t) = {k ∈ [1 : 2] | rfulltk ∧ eretI(k,t)−1σ }
Note, in the definitions above we do not distinguish between legal and illegal eret instructions.
The reason being is that any illegal instruction anyway triggers a JISR in the ISA computation.
Using the sets above we introduce yet another speculation stage
Σ(t) = max{Σ jisr(t),Σeret(t)}
where
Σ jisr(t) =
{
max S jisr(t) S jisr(t) 6= /0
0 otherwise
and
Σeret(t) =
{
max Seret(t) Seret(t) 6= /0
0 otherwise.
The latter allows us to determine the lowest pipeline stage containing the data of instruction
that is executed, both in hardware and in software, without being rolled-back; it is called the
hardware speculation stage and defined, clearly, as follows:
Σhard(t) = max{Σ(t),µ(t)}.
For convenience, we introduce the following shorthands.
Σˆ(t) = max{Σso f t(t),Σhard(t)}
~Σ(t) = (Σso f t(t),Σhard(t))
9.2.4 Induction Hypothesis
Essentially, the simulation theorem proven in this chapter is a counterpart of the corresponding
theorem proven in the last chapter of [KMP14]. Here the latter result is extended to handle the
(internal) interrupts and two more pipeline stages, added to support virtual address translation.
Since the form of the simulation theorem does not change, to ease the presentation we proceed
directly to the induction hypothesis.
The hypothesis as usual consists of several parts, which we introduce below in separate sec-
tions:
i) fulfillment of guard conditions imposed on the constructed ISA computation,
ii) embedding of the line-addressable abstraction of the hardware memory (cache memory
system) into the byte-addressable memory of the ISA,
iii) presence of walks and walk compositions accumulated by the hardware (MMUs) resp. in
the TLB and constructed TLB of the ISA,
iv) coupling of the pipeline registers; includes couplings of both, visible and invisible pipeline
registers, resp. with the data structures and signals of the ISA,
v) coupling of the ghost walk registers (with the signals of the ISA), and
vi) invariants on contents of the ghost walk registers.
232 9 Correctness of Pipelined Implementation
Guard Conditions
First, we require that steps of the constructed ISA computation respect their corresponding
guard conditions.
Γ NS(t)
Simulation of Memory
For abstraction of the cache memory system we have:
m(htpi) = `(mc
NS(t).m).
Simulation of TLBs
Recall, each processor of the multi-core machine contains two MMUs, one for translation of
the instruction addresses and one for translation of the effective addresses:
hpi .p(q).mmuI and hpi .p(q).mmuE .
We abbreviate the mmuY of processor q as
mmuqY = hpi .p(q).mmuY .
Using the simulation relation for TLBs, introduced in Sect. 5.2.2, for MMUs of processor q we
have:
simtlb(mmu
q,t
Y ,mc
NS(t).p(q)).
Simulation of Pipeline Registers
Let i be the index of instruction executed in stage k of processor q in cycle t
i = I(q,k, t)
and R be a pipeline register (configuration component) in stage k of processor q.
R ∈ reg(k)
As we stated previously in Sect. 8.2.3, we claim correctness only for instructions which are ex-
ecuted without being rolled-back, i.e., instructions in the live stages. Therefore, we additionally
assume liveness of the instruction used for simulation of certain pipeline registers. Thus, for
simulation of the visible registers in stage k we assume liveness of instruction i, i.e., live(k, t)
on processor q, which according to the definitions is equivalent to
k > µ(q, t).
For the invisible registers in stage k we analogously assume liveness of the related instruction,
i.e., liveness of instruction
i−1 = I(q,k, t)−1
= I(q,k+1, t) (lemma 79)
on processor q, or equivalently
k+1 > µ(q, t).
Analogous to the arguments above, for visible registers in stage k we assume that instruction i
is executed no later than the execution of jisr or eret. Formally, in case
Σ(q, t) = σ > 0
by definition we have jisrq, jσ or eret
q, j
σ for
9.2 Multi-Core Processor 233
Table 12: Local invisible registers in various pipeline stages
stage # stage name invisible registers
1 IT wI .1, ca.1pi [gf : 0], pmaI.1pi
2 IF wI .2, ca.2pi [gf : 0]
3 ID wI .3, ca.3pi [gf : 0]
· · ·
j = I(q,σ , t)−1
= I(q,σ +1, t) (lemma 79).
As we assume i≤ j on processor q, using lemma 81 we conclude
k ≥ σ +1
or equivalently
k > Σ(q, t).
For the invisible registers in stage k, repeating the arguments above for the related instruction,
i.e., instruction
i−1 = I(q,k+1, t)
on processor q, we analogously derive
k+1 > Σ(t).
Finally, for the visible registers in stage k we assume that no instructions violating the software
conditions were executed before execution of instruction i. Formally, if
Σso f t(q, t) = σ > 0
by definition we have /SCcode(q, j) for
j = I(q,σ +1, t).
Therefore, assuming i≤ j on processor q, we exactly as above derive
k > Σso f t(q, t).
For the invisible registers in stage k, which contain the data (after execution) of the related
instruction, i.e., data of instruction i−1 on processor q, we must consider whether their content
actually depends on instruction i−1 or not. For that reason we distinguish between the local
invisible (Table 12) and non-local invisible registers. As the name suggests, content of local
invisible registers is determined purely by the processor (local) state, and does not depend on
instruction word, fetched from the memory.
Therefore, for the local invisible registers in stage k we can assume, just as for the visible
registers, that no instructions violating the software conditions were executed before execution
of the related instruction, i.e.,
i−1 = I(q,k+1, t)
≤ I(q,σ +1, t) = j
on processor q. Repeating the arguments presented above, for local invisible registers we derive
k+1 > Σso f t(q, t).
On the other hand, for the non-local invisible registers in stage k we assume that no instructions
violating the software conditions were executed up to the execution of the related instruction,
i.e.,
234 9 Correctness of Pipelined Implementation
i−1 = I(q,k+1, t)
< I(q,σ +1, t) = j
on processor q. Following the same arguments as those presented above, for non-local invisible
registers we obtain
k > Σso f t(q, t).
Summarizing the arguments above, depending on the type of register R, we require:
• for a visible register
Σˆ(q, t)< k → Rq,tpi = Rq,iσ ,
• for a local invisible register
Σˆ(q, t)≤ k∧ rfullq,tk ∧used(R)q,i−1σ → Rq,tpi = Rq,i−1σ ,
• for a non-local invisible register
~Σ(q, t)≤ (k−1,k)∧ rfullq,tk ∧used(R)q,i−1σ → Rq,tpi = Rq,i−1σ .
Simulation of Ghost Registers
Below we abbreviate the ghost walk registers of processor q in cycle t as
wI .kq,t = htpi .p(q).wI .k
wE .5q,t = htpi .p(q).wE .5
and require, as for the ordinary pipeline (invisible) registers, the following to hold.
Σˆ(q, t)≤ k∧ rfullq,tk ∧used(wI)q,i−1σ → wI .kq,t = wq,i−1I σ
~Σ(q, t)≤ (4,5)∧ rfullq,t5 ∧used(wE)q,i−1σ → wE .5q,t = wq,i−1E σ
Invariants
In addition we require for the ghost walk registers two more things. First, we require that the
ghost registers on processor q are contained, depending on the execution mode of the pipeline,
either in the (specification) TLB or in the constructed TLB.
Invariant 10.
Σˆ(q, t)≤ k∧ rfullq,tk ∧used(wI)q,i−1σ → wI .kq,t ∈
{
mcNS(t).p(q).tlb◦ user(q, t)
mcNS(t).p(q).tlb guest(q, t)
~Σ(q, t)≤ (4,5)∧ rfullq,t5 ∧used(wE)q,i−1σ → wE .5q,t ∈
{
mcNS(t).p(q).tlb◦ user(q, t)
mcNS(t).p(q).tlb guest(q, t)
Also these ghost registers have to match the translation requests of the instructions executed
on processor q in the corresponding pipeline stages.
Invariant 11.
Σˆ(q, t)≤ k∧ rfullq,tk ∧used(wI)q,i−1σ → match(trqq,i−1I σ ,wI .kq,t)
~Σ(q, t)≤ (4,5)∧ rfullq,t5 ∧used(wE)q,i−1σ → match(trqq,i−1E σ ,wE .5q,t)
9.2 Multi-Core Processor 235
Induction Base
As usual, after the hardware reset the simulation of the ISA data structures is obtained via
extracting an appropriate ISA configuration mc0 from the hardware configuration h0. In this
regard for simulation of the ISA memory it suffices to specify
`(mc0.m) = m(h0pi).
For the program counters and register files on all processors q we obtain analogously:
mc0.p(q).(pc,d pc,dd pc) = h0pi .p(q).(pc,d pc,dd pc) = (832,432,032)
mc0.p(q).(gpr,spr) = h0pi .p(q).(gpr,spr).
For the invisible pipeline registers there is obviously nothing to show in cycle t = 0, after reset.
Finally, for contents of the (specification) TLB and constructed TLB on all processors q we
obtain trivially:
mc0.p(q).tlb ⊇ tlbG(mmuq,0I )∪ tlbG(mmuq,0E ) = /0
mc0.p(q).tlb◦ ⊇ tlbU (mmuq,0I )∪ tlbU (mmuq,0E ) = /0.
Note, from lemma 35 there is nothing to show for the walk registers of the MMUs, since the
corresponding control automata reside in their idle states in cycle t = 0, after reset.
9.2.5 Stepping Inputs
Assume that processor q makes a step in cycle t and this step gets number n in the global
ordering of ISA steps. Let’s consider the TLB steps first and assume that step n is generated
by mmuY . According to the definition (see Sect. 9.1.1), apart from the scheduling inputs
s(n).g = mmuY
s(n).u = q ∈ T SY (t)
the stepping function also provides the oracle input. We define the oracle input for the latter
TLB step as follows.
tstepY G(q, t) → s(n).o =
{
(winit,upaG(mmu
q,t
Y ).pa,guest) winitG(mmu
q,t
Y )
(wext,mmuq,tY .wG,⊥) otherwise
tstepY U (q, t) → s(n).o =
{
(winit,upaU (mmu
q,t
Y ).pa,user) winitU (mmu
q,t
Y )
(wext,mmuq,tY .wU ,mmu
q,t
Y .wG) otherwise
Now assume that step n is performed by the processor core.
s(n).t = core
s(n).u = q ∈ PS(t)
Unlike the scheduling inputs, we assemble the oracle input for the processor core steps gradu-
ally, as the associated instruction advances through the pipeline to the lower/later stages. At the
memory stage, when the processor core performs a step to execute that instruction, all fields of
the stepping function must be completely specified. Further assume that processor q makes a
step in cycle t to execute instruction
i = I(q,6, t).
Instruction i progresses down through pipeline stages k ∈ {1,5} in cycles tk such that
I(q,k, tk) = i∧ueq,tkk .
236 9 Correctness of Pipelined Implementation
Note, due to possible misspeculations, there might be multiple cycles tk in which instruction i
progress in stage k. The oracle inputs for global step n are specified in those cycles tk in which
we in addition have
k > µ(q, tk).
The latter cycles are provably unique (see Sect. 8.4.5, equation 65). In cycles t1 and t5 we
specify the first two components of the oracle input for step n.
s(n).o.wI =
{
mmuq,t1I .wout mmu
q,t1
I .treq
⊥ otherwise
s(n).o.wE =
{
mmuq,t5E .wout mmu
q,t5
E .treq
⊥ otherwise
Thus, the walk for translation of the instruction address is taken from the outputs of mmuI once
instruction i leaves the first translation stage for the last time. Analogously, if a memory oper-
ation is performed, the walk for translation of the effective address is taken from the outputs of
mmuE once instruction i leaves the second translation stage for the last time. Easy to see that
the statements below follow from the definition above; we use these statements to argue about
correctness for ghost walk registers in Sect. 9.6.3. Recall, the shorthands for the walk inputs
were defined early in Sect. 9.1.2.
wq,I(q,1,t1)I σ =
{
mmuq,t1I .wout mmu
q,t1
I .treq
⊥ otherwise (73)
wq,I(q,5,t5)E σ =
{
mmuq,t5E .wout mmu
q,t5
E .treq
⊥ otherwise (74)
Finally, the vector of external events provided to processor q is sampled right before instruction
i is executed in the memory stage. According to the assumption, the latter happens in cycle t.
s(n).o.eev = eevq,t
From the definition above one can easily derive the following.
eevq,I(q,6,t)σ = eevq,t (75)
Note, since we assumed for the time being the absence of external interrupts, for all processors
we have
eevq,t = 00. (76)
9.3 Developing Formalism
In this technical section we develop the machinery necessary to justify the induction hypoth-
esis from Sect. 9.2.4. Thus, in Sect. 9.3.1 we establish important relationship between the
scheduling functions (I) and instruction count (ic), whereas in Sect. 9.3.2 — between the steps
of the processor (pseq) and steps of the ISA computation. Also, in Sect. 9.3.3 we elaborate on
properties of the software speculation stage (Σso f t ).
9.3.1 Relating Instruction Count with Scheduling Functions
In this section we establish an intuitive relation between the number of instructions executed
on the processor core q before global step NS(t) and the index of instruction in the memory
stage of processor q in cycle t.
Lemma 116.
ic(q,NS(t)) = I(q,6, t)
9.3 Developing Formalism 237
Proof of lemma 116. The proof is by an easy induction on the number of cycles t, and therefore
we omit it. For details we refer to the proof of lemma 9.15 of [KMP14]. uunionsq
Clearly, the instruction count for processor q changes at most once in the interval
[NS(t) : NS(t+1)]
since each processor can be stepped at most once in cycle t. In the following lemma we identify
the steps at which the latter changes occur.
Lemma 117. Assume that processor q performs a core step in cycle t, i.e.,
s(NS(t)+ y).(t,u) = (core,q).
Then the following holds for n≤ ns(t):
ic(q,NS(t)+n) =
{
ic(q,NS(t)) n≤ y
ic(q,NS(t+1)) otherwise.
Proof of lemma 117. First, by induction on n we show the following.
n≤ y → ic(q,NS(t)+n) = ic(q,NS(t)) (77)
The base case (n = 0) holds trivially. For the induction step from n to n+ 1 ≤ y we argue as
follows. According to the definition of the stepping function (equation 67) we know
n < y → s(NS(t)+n).(t,u) 6= (core,q)
and therefore we have
ic(q,NS(t)+n+1) = ic(q,NS(t)+n) (definition)
= ic(q,NS(t)) (induction hypothesis).
Next, directly from the definition, for n = y+1 we obtain
ic(q,NS(t)+n) = ic(q,NS(t)+ y)+1 (definition)
= ic(q,NS(t))+1 (equation 77). (78)
Finally, by induction on n we show the following.
n > y+1 → ic(q,NS(t)+n) = ic(q,NS(t))+1 (79)
Again using equation 67 we derive
n > y → s(NS(t)+n).(t,u) 6= (core,q).
For the base case (n = y+2) we argue as follows.
ic(q,NS(t)+ y+2) = ic(q,NS(t)+ y+1) (definition)
= ic(q,NS(t))+1 (equation 78)
For the induction step from n to n+1≤ ns(t) we argue as follows.
ic(q,NS(t)+n+1) = ic(q,NS(t)+n) (definition)
= ic(q,NS(t))+1 (induction hypothesis)
In order to obtain the claim we simply rewrite Eqs. 78 and 79 using
ic(q,NS(t+1)) = ic(q,NS(t)+ns(t)) (definition)
= ic(q,NS(t))+1 (equation 79). uunionsq
For convenience we introduce one more technical result, which follows directly from lem-
mas 117 and 116.
Lemma 118. Assume that processor q performs a core step in cycle t, i.e.,
s(n).(t,u) = (core,q)∧n ∈ CS(t).
Then
ic(q,n) = I(q,6, t).
238 9 Correctness of Pipelined Implementation
9.3.2 Relating Global with Processor Local Steps
In order to show correctness for the instruction cache output, we require the following technical
result.
Lemma 119. Let i = I(q,2, t). Then the following holds.
[NS(t) : NS(t+1)−1]⊆ [pseq(q, i−5)+1 : pseq(q, i+1)−1]
Proof of lemma 119. First, from monotonicity of function pseq we obtain
NS(t)≤ NS(t+1) ≤ pseq(q, ic(q,NS(t+1))) (lemma 111)
= pseq(q, I(q,6, t+1)) (lemma 116)
≤ pseq(q, I(q,6, t)+1) (lemma 80)
≤ pseq(q, i+1) (lemma 81).
Then, again from monotonicity of pseq we derive
NS(t+1)≥ NS(t) > pseq(q, ic(q,NS(t))−1) (lemma 112)
= pseq(q, I(q,6, t)−1) (lemma 116)
≥ pseq(q, i−5) (lemma 81). uunionsq
9.3.3 Properties of Σ
In the scope of this section we derive properties of the software speculation stage (Σso f t ). The
corresponding properties of other speculation stages will be shown later, in Sect. 9.4.4, after
we weaken the assumptions of the lemma below (equation 88).
Lemma 120. Assume that step n (in the global ordering of ISA steps) is a core step of processor
q performed in cycle t.
s(n).(t,u) = (core,q)∧n ∈ CS(t) i)
Assume that the ISA computation is guarded until step n and after the fetch in step n.
Γ n∧ΦI(mcn.p(q),s(n).o) ii)
Then
Σso f t(q, t)< 5.
Proof of lemma 120. From the assumptions we know that the instruction executed in step n
(by processor q) does not violate the software conditions for instruction fetch. Formally, we
argue
Γ n∧ΦI(mcn.p(q),s(n).o)→ SCcode(q, ic(q,n)) (equation 71)
→ SCcode(q, I(q,6, t)) (lemma 118)
→ SCcode(q, I(q,5, t)−1) (lemma 79)
and the claim follows. uunionsq
In the presence of a misspeculation, the value of the software speculation stage (Σso f t ) in the
next cycle is given in the following lemma. Note, for convenience processor indices are omitted
below in this section.
Lemma 121. Assume Σso f t(t) = σ such that σ > 0.
uetσ+1 → Σso f t(t+1) =
{
0 rbrtσ+1
σ otherwise
(1)
σ < 5∧uetσ+1 → Σso f t(t+1) = σ +1 (2)
9.3 Developing Formalism 239
Proof of lemma 121. Assume
Σso f t(t) = σ0 > 0
Σso f t(t+1) = σ1.
By definition of Σso f t we have the following.
rfulltσ0 ∧ /SCcode(I(σ0, t)−1) (80)
6 ∃k > σ0 : rfulltk ∧ /SCcode(I(k, t)−1) (81)
First we show that the following holds.
6 ∃k > σ0+1 : rfullt+1k ∧/SCcode(I(k, t+1)−1) (82)
Proof of equation 82. By contradiction. Assume there is such stage
k > σ0+1
and it was updated in cycle t. From the definitions of the update enable signal (equation 35)
and the scheduling functions we resp. have
uetk → rfulltk−1
and (since k > 1)
uetk → I(k, t+1) = I(k−1, t).
From the assumption and arguments above we derive
rfulltk−1∧/SCcode(I(k−1, t)−1)
which leads to a contradiction:
Σso f t(t)≥ k−1 > σ0.
Otherwise, if stage k was not updated in cycle t, from the definitions of the full and the rollback-
pending bits we resp. have
f ullt+1k ∧/uetk → stalltk+1∧/rollbacktk
and (using equation 36)
/rbpt+1k ∧/rollbacktk → /rbptk ∧/rbrtk+1.
From the definitions of the stall and update enable signals we resp. have
stalltk+1 → f ulltk
and
stalltk+1 → /uetk+1.
Thus, from the assumption and arguments above we conclude
rfulltk ∧/SCcode(I(k, t+1)−1)
where for the instruction index we derive:
I(k, t+1)−1 = I(k+1, t+1) (lemma 79)
= I(k+1, t) (definition)
= I(k, t)−1 (lemma 79).
The latter gives a contradiction:
Σso f t(t)≥ k > σ0+1. uunionsq
240 9 Correctness of Pipelined Implementation
Next we split cases on the values of the rollback request and update enable signals of stage
σ0+1 in cycle t:
• uetσ0+1∧σ0 < 5. In this case we are to show
σ1 = σ0+1.
From the first part of lemma 75 and definition of the scheduling functions we resp. have
uetσ0+1 → rfullt+1σ0+1
and (since σ0 > 0)
uetσ0+1 → I(σ0+1, t+1) = I(σ0, t).
From equation 80 and the arguments above we conclude
rfullt+1σ0+1∧/SCcode(I(σ0+1, t+1)−1)
which by definition implies
Σso f t(t+1)≥ σ0+1.
• uetσ0+1∧ rbrtσ0+1. In this case we have to show
σ1 = σ0.
First, using part (5) of lemma 75 we derive
rfulltσ0 ∧/uetσ0+1∧/rbrtσ0+1 → rfullt+1σ0 .
Thus, from equation 80 and the arguments above we conclude
rfullt+1σ0 ∧/SCcode(I(σ0, t)−1)
where for the instruction index we derive:
I(σ0, t)−1 = I(σ0+1, t) (lemma 79)
= I(σ0+1, t+1) (definition)
= I(σ0, t+1)−1 (lemma 79).
The latter by definition implies
Σso f t(t+1)≥ σ0.
Next we proceed to show
Σso f t(t+1) 6= σ0+1.
We assume that the stage below σ0 has a real full bit in cycle t + 1, since otherwise the
claim follows. From the definitions of the full and the rollback-pending bits we resp. have
f ullt+1σ0+1∧/uetσ0+1 → stalltσ0+2∧/rollbacktσ0+1
and (using equation 36)
/rbpt+1σ0+1∧/rollbacktσ0+1 → /rbptσ0+1.
Finally, from the definitions of the stall signal and the scheduling functions we resp. have
stalltσ0+2 → f ulltσ0+1
and
/uetσ0+1∧/rbrtσ0+1 → I(σ0+1, t+1) = I(σ0+1, t).
From equation 81 and the arguments above we conclude
SCcode(I(σ0+1, t+1)−1)
which by definition implies
Σso f t(t+1) 6= σ0+1.
9.3 Developing Formalism 241
• uetσ0+1∧ rbrtσ0+1. Here we proceed to show
σ1 = 0.
Using part (2) of lemma 75 we conclude
6 ∃k < σ0+1 : rfullt+1k
which by definition implies
Σso f t(t+1) ∈ {0}∪ [σ0+1 : 5].
Finally, for r = R(t) such that
σ0 < r ≤ 5
from the definition of the misspec signals we conclude
rbrtr → rfulltr.
Therefore, from the definition of the scheduling functions we have
/uetσ0+1∧ rbrtσ0+1 → I(σ0+1, t+1) = I(r, t)
and from equation 81 we conclude
SCcode(I(σ0+1, t+1)−1).
The latter by definition implies
Σso f t(t+1) 6= σ0+1.
Equation 82 gives
Σso f t(t+1)≤ σ0+1
which completes the proof in all cases above. uunionsq
For convenience we generalize the result above in the following lemma.
Lemma 122. Assume Σso f t(t) = σ .
Σso f t(t+1)≤ σ +uetσ+1
Proof of lemma 122. For Σso f t(t)> 0 the claim follows directly from lemma 121. For Σso f t(t)=
0 we proceed to show the following.
Σso f t(t+1)≤ uet1
Repeating the arguments presented in the proof of equation 82 we derive
6 ∃k > 1 : rfullt+1k ∧/SCcode(I(k, t+1)−1)
which by definition implies
Σso f t(t+1)≤ 1.
Therefore, if stage 1 is updated in cycle t (uet1), the claim follows. Otherwise, we argue by
contradiction; assume /uet1 and
Σso f t(t+1) = 1.
From the definition we have
rfullt+11 ∧/SCcode(I(1, t+1)−1).
Using parts (2) and (4) of lemma 75 we resp. conclude
242 9 Correctness of Pipelined Implementation
/rbrt2∧ rfullt1.
From the definition of the misspec signals we further derive
/rbrt1
and therefore for the scheduling function in stage 1 by definition we obtain
I(1, t+1) = I(1, t).
From the arguments above we clearly have
Σso f t(t) = 1
which is a contradiction. uunionsq
9.4 Verifying Guard Conditions
Analogous to Sect. 7.5, we are to show that the guard conditions are respected in all ISA steps
until NS(t+1), i.e.,
∀n < NS(t+1) : Γ (mcn,s(n)).
Directly from the induction hypothesis we obtain:
∀n < NS(t) : Γ (mcn,s(n)).
For the remaining steps we proceed to show
∀n ∈ [NS(t) : NS(t)+ `−1] : Γ (mcn,s(n))
by induction on `, where ` denotes the length of sub-interval
[NS(t) : NS(t)+ `−1]⊆ [NS(t) : NS(t+1)−1].
The base case holds trivially since for ` = 0 there is nothing to show. For the induction step
from `−1 to `≤ ns(t) we split cases on whether step
n = NS(t)+ `
is a TLB step (see Sect. 9.4.1) or a processor core step (see Sect. 9.4.2).
Before we proceed to the induction step, we reformulate the result (from Sect. 7.5) for the
implementation registers of the nested MMUs. Thus, in case a walk extension step is performed
by mmuY of processor q in cycle t, the implementation register mmuY .wX contains valid walks
on both, guest and user TLB steps.
Lemma 123. Assume q ∈ TAY (t).
tstepY G(q, t) → valid(mcNS(t).p(q).tlb,mmuq,tY .wG) (1)
tstepY U (q, t) → valid(mcNS(t).p(q).tlb,mmuq,tY .wU ,mmuq,tY .wG) (2)
Note, we can use the proof of lemma 72 literally, since the arguments involved rely entirely
on the internal construction of the nested MMU, namely on the construction of its control
mechanisms.
9.4 Verifying Guard Conditions 243
9.4.1 TLB Steps
Assume that step n is generated by mmuY on processor q.
s(n).(g,u) = (mmuY ,q)
Below we consider only those cases in which mmuY of processor q is not invalidated in cycle t
(/inval(mmuq,tY )), since otherwise we immediately derive a contradiction.
inval(mmuq,tY )→ /mmuq,tY .treq (definition)
→ /tstepY (q, t) (definition)
In case processor q performs a core step in cycle t and this step gets number m< n in the global
ordering of ISA steps, we apply lemma 120 to argue that
Σso f t(q, t)< 5
which allows us to use the induction hypothesis for non-local invisible registers in stage 5 to
argue that neither invl pg nor f lusht are executed by processor q in step m.
s(m).(t,u) = (core,q)∧m ∈ [NS(t) : n−1] → /(invl pgσ ∨ f lushtσ )q,ic(q,m) (83)
Similarly we argue that execution mode on processor q does not decrease in cycle t
mode(q, t)> mode(q, t+1)→ jisr(q, t) (lemma 78.1)
→ /mmuq,tY .treq (definition)
→ /tstepY (q, t) (definition)
and thus consider the following three cases that remain possible.
i) Remaining at the level of guest.
guest(q, t)∧guest(q, t+1)
As above we argue that jisr or eret are not executed by processor q in step m.
s(m).(t,u) = (core,q)∧m ∈ [NS(t) : n−1] → /( jisrσ ∨ eretσ )q,ic(q,m) (84)
From the induction hypothesis for visible register in the memory stage we derive
guest(mcNS(t).p(q))
which together with the result above (equation 84) and the fact that any “moves” to the
mode registers are illegal at the level of guest (see Sect. 3.3.6) gives
guest(mcn.p(q)).
From interconnect and construction of the nested MMU we derive:
guest(q, t)→ mmuq,tY .upa 6∈ AU (interconnect)
→ /treqU (mmuq,tY ) (definition)
→ /tstepY U (q, t) (definition).
Thus, in case mmuY performs initialization of a guest walk (winitG(mmu
q,t
Y )), for the step-
ping inputs by definition we have
s(n).o.(t, l) = (winit,guest).
From the arguments above we easily conclude
244 9 Correctness of Pipelined Implementation
TG(mcn.p(q),s(n).o).
In case mmuY performs extension of a guest walk (wextG(mmu
q,t
Y )), for the stepping inputs
by definition we have
s(n).o = (wext,mmuq,tY .wG,⊥).
From the first part of lemma 123 we have
valid(mcNS(t).p(q).tlb,mmuq,tY .wG)
which together with the result above (equation 83) gives
valid(mcn.p(q).tlb,mmuq,tY .wG).
From the arguments above we again conclude
TG(mcn.p(q),s(n).o).
ii) Remaining at the level of user.
user(q, t)∧user(q, t+1)
Again, we argue that jisr or eret are not executed by processor q in step m.
s(m).(t,u) = (core,q)∧m ∈ [NS(t) : n−1] → /( jisrσ ∨ eretσ )q,ic(q,m) (85)
From the induction hypothesis for visible register in the memory stage we derive
user(mcNS(t).p(q))
which together with the result above (equation 85) and the fact that any “moves” to the
mode registers are illegal at the level of user (see Sect. 3.3.6) gives
user(mcn.p(q)).
In case mmuY performs initialization of a guest walk (winitG(mmu
q,t
Y )) or extension of
a guest walk (wextG(mmu
q.t
Y )) the arguments are identical to those presented in case i).
Further, in case mmuY performs initialization of a user walk (winitU (mmu
q,t
Y )), for the
stepping inputs by definition we have
s(n).o.(t, l) = (winit,user).
From the arguments above we easily conclude
TU (mcn.p(q),s(n).o).
Finally, in case mmuY performs extension of a user walk (wextU (mmu
q,t
Y )), for the stepping
inputs by definition we have
s(n).o = (wext,mmuq,tY .wU ,mmu
q,t
Y .wG).
From the second part of lemma 123 we have
valid(mcNS(t).p(q).tlb,mmuq,tY .wU ,mmu
q,t
Y .wG)
which together with the result above (equation 83) gives
valid(mcn.p(q).tlb,mmuq,tY .wU ,mmu
q,t
Y .wU ).
From the arguments above we again conclude
TU (mcn.p(q),s(n).o).
9.4 Verifying Guard Conditions 245
iii) Execution of eret at the level of guest.
guest(q, t)∧user(q, t+1)
Finally, analogous to above we argue that processor q executes eret in step m.
s(m).(t,u) = (core,q)∧m ∈ [NS(t) : n−1] → eretq,ic(q,m)σ (86)
From the induction hypothesis for visible register in the memory stage we derive
guest(mcNS(t).p(q))
which together with the result above (equation 86) gives
guest(mcn.p(q))∨user(mcn.p(q))
depending on whether a core step or a TLB step of processor q is performed first in cycle
t. The remaining proof lines literally repeat the arguments presented in case i).
9.4.2 Processor Core Steps
Assume that step n is performed by processor q.
s(n).(t,u) = (core,q)
Using the induction hypothesis, we can rewrite the stepping given above in Sect. 9.2.5 in terms
of the registers in stage 5 of the ghost walk pipeline as follows.
Lemma 124. Assume that processor q performs a core step in cycle t, i.e.,
s(n).(t,u) = (core,q)∧n ∈CS(t).
Then the following holds:
used(wI ,n) → s(n).o.wI = wI .5q,t (1)
used(wE ,n) → s(n).o.wE = wE .5q,t (2)
Proof of lemma 124.1. For the first walk passed as the machine’s input we derive:
s(n).o.wI = w
q,ic(q,n)
I σ (lemma 110)
= wq,I(q,6,t)I σ (lemma 118)
= wq,I(q,5,t)−1I σ (lemma 79)
= wI .5q,t (IH). uunionsq
That the walks coming from the MMUs — from the ghost pipeline — satisfy the guard condi-
tions of the specification was reflected in the induction hypothesis above. In case the first walk
passed to the specification machine is used (for translation of the instruction address)
used(wI ,n)
it is contained in the corresponding TLB (invariant 10)
wI .5q,t ∈
{
mcNS(t).p(q).tlb◦ user(q, t)
mcNS(t).p(q).tlb guest(q, t)
and matches the corresponding translation request (invariant 11).
246 9 Correctness of Pipelined Implementation
match(trqq,I(q,5,t)−1I σ ,wI .5
q,t)↔ match(trqq,I(q,6,t)I σ ,wI .5q,t) (lemma 79)
↔ match(trqq,ic(q,n)I σ ,wI .5q,t) (lemma 118)
Since there are no other core steps (except for step n) of processor q in cycle t
m ∈ [NS(t) : n−1] → s(m).(t,u) 6= (core,q)
the TLB on processor q cannot decrease in size (before step n).
wI .5q,t ∈
{
mcn.p(q).tlb◦ user(mcn.p(q))
mcn.p(q).tlb guest(mcn.p(q))
Applying lemma 124 (first part) we conclude
ΦI(mcn.p(q),s(n).o).
Combining the induction hypotheses with the result above, we derive that the ISA computation
is guarded until step n and after the fetch in step n.
Γ NS(t)∧∀m ∈ [NS(t) : n−1] : Γ (mcm,s(m))∧ΦI(mcn.p(q),s(n).o) (87)
The latter allows us to apply the software condition, and proceed to the proof of the second
part of lemma 124.
Proof of lemma 124.2. For the second walk passed to the specification machine we argue re-
peating the proof of part (1) (of lemma 124) as follows:
s(n).o.wE = w
q,ic(q,n)
E σ (lemma 110)
= wq,I(q,6,t)E σ (lemma 118)
= wq,I(q,5,t)−1E σ (lemma 79)
= wE .5q,t (equation 87; lemma 120; IH). uunionsq
In case the second walk passed to the specification machine is used (for translation of the
effective address)
used(wE ,n)
using lemma 120 we argue that the latter walk is contained in the corresponding TLB (invari-
ant 10)
wE .5q,t ∈
{
mcNS(t).p(q).tlb◦ user(q, t)
mcNS(t).p(q).tlb guest(q, t)
and matches the corresponding translation request (invariant 11).
match(trqq,I(q,5,t)−1E σ ,wE .5
q,t)
Repeating the arguments above, we apply lemma 124 (second part) and conclude
ΦE(mcn.p(q),s(n).o).
Combining equation 87 with the result above, we derive that the ISA computation is guarded
until step n+1
Γ NS(t)∧∀m ∈ [NS(t) : n] : Γ (mcm,s(m))
which completes the induction step.
9.4 Verifying Guard Conditions 247
9.4.3 Processor Control Signals
In Sect. 9.4 we showed that the ISA computation is guarded until step NS(t + 1). The latter
allows us to rewrite lemma 120 simply as follows.
uet6 → Σso f t(t)< 5 (88)
Lemma 125. Assume processor q performs a core step in cycle t (cstep(q, t)).
1. mca(6)q,t = mca(6)q,I(q,6,t)σ
2. jisr(q, t) = jisrq,I(q,6,t)σ
Proof of lemma 125.1. First we show correctness for the event signals collected in the memory
stage.
ev(6)q,t = ev(6)q,I(q,6,t)σ (89)
Proof of equation 89.
ev(6)q,t = 09 ◦ eevq,t (definition)
= 09 ◦ s(n).o.eev (stepping)
= 09 ◦ eevq,ic(q,n)σ (lemma 110)
= 09 ◦ eevq,I(q,6,t)σ (lemma 118)
= ev(6)q,I(q,6,t)σ (definition) uunionsq
For the masked cause signal of the memory stage we argue as follows.
mca(6)t = (ev(6)t ∨ ca.5tpi)∧ imaskt (definition)
= (ev(6)t ∨ ca.5tpi)∧ (19 ◦ srtpi [1]◦1) (definition)
= (ev(6)I(6,t)σ ∨ ca.5I(6,t)σ )∧ (19 ◦ srI(6,t)σ [1]◦1) (equations 88–89; IH)
= (ev(6)I(6,t)σ ∨ ca.5I(6,t)σ )∧ imaskI(6,t)σ (definition)
= mca(6)I(6,t)σ (definition) uunionsq
Proof of lemma 125.2. For the jisr signal we argue using the result above.
jisr(t)↔ uet6∧ jisr.5tpi (definition)
↔ uet6∧ (mca(6)t 6= 011) (invariant 8)
↔ mca(6)I(6,t)σ 6= 011 (lemma 125.1)
↔ jisrI(6,t)σ (definition) uunionsq
Later on, when we consider interrupts, we require the latter results to argue about correctness
for the special purpose registers.
9.4.4 Properties of Σ Revisited
We continue developing the arguments presented in Sect. 9.3.3.
248 9 Correctness of Pipelined Implementation
Properties of Σ
For speculation stage Σ we show the results analogous to the corresponding results about the
software speculation stage (Σso f t ). First we argue that speculation stage Σ does not “overflow”,
i.e., takes values only in range [0 : 5].
Lemma 126. Assume Σ(t) = 5.
uet6 → rbrt5
Proof of lemma 126. By contradiction. From the definition of signal misspec5 we have
jisr(t) = 0.
The contradiction follows from part (2) of lemma 125.
Σ(t)< 5 uunionsq
The value of speculation stage Σ in the next cycle is given in the following lemma.
Lemma 127. Assume Σ(t)> 0 and Σ(t)≥ R(t).
uetΣ(t)+1 → Σ(t+1) = Σ(t) (1)
uetΣ(t)+1 → Σ(t+1) =
{
0 rbrtΣ(t)
Σ(t)+1 otherwise
(2)
Proof of lemma 127.1. By contradiction. Assume
Σ(t) = σ0 > 0
Σ(t+1) = σ1 6= σ0.
By definition of Σ we have the following.
rfulltσ0 ∧ ( jisrσ ∨ eretσ )I(σ0,t)−1 (90)
rfullt+1σ1 ∧ ( jisrσ ∨ eretσ )I(σ1,t+1)−1 (91)
In case
σ0 < σ1
we split cases on uetσ1 . In case stage σ1 is updated in cycle t, from the definitions of the update
enable signal (equation 35) and the scheduling functions we resp. have
uetσ1 → rfulltσ1−1
and (since σ1 > 1)
uetσ1 → I(σ1, t+1) = I(σ1−1, t).
From equation 91 and the arguments above we conclude
rfulltσ1−1∧ ( jisrσ ∨ eretσ )I(σ1−1,t)−1
which by definition implies
Σ(t)≥ σ1−1.
From the assumption (σ0 < σ1) we also derive
Σ(t)≤ σ1−1.
The contradiction follows, since by assumption (uetσ1 ) we obtain
uetΣ(t)+1.
9.4 Verifying Guard Conditions 249
Otherwise, if stage σ1 is not updated in cycle t, from the definitions of the full and the rollback-
pending bits we resp. have
f ullt+1σ1 ∧/uetσ1 → stalltσ1+1∧/rollbacktσ1
and (using equation 36)
/rbpt+1σ1 ∧/rollbacktσ1 → /rbptσ1 .
From the definitions of the stall signal and the scheduling functions we resp. have
stalltσ1+1 → f ulltσ1
and (since σ1 > R(t))
/uetσ1 → I(σ1, t+1) = I(σ1, t).
Thus, from equation 91 and the arguments above we conclude
rfulltσ1 ∧ ( jisrσ ∨ eretσ )I(σ1,t)−1
which by definition implies
Σ(t)≥ σ1.
The latter is a contradiction, since as an assumption (σ0 < σ1) we have
Σ(t)< σ1.
In case
σ1 < σ0
using part (5) of lemma 75 we derive
rfulltσ0 ∧/uetσ0+1∧/rbrtσ0+1 → rfullt+1σ0 .
Since σ0 ≥ R(t), from equation 90 and the arguments above we conclude
rfullt+1σ0 ∧ ( jisrσ ∨ eretσ )I(σ0,t)−1
where for the instruction index we derive:
I(σ0, t)−1 = I(σ0+1, t) (lemma 79)
= I(σ0+1, t+1) (definition)
= I(σ0, t+1)−1 (lemma 79).
The latter implies
Σ(t+1)≥ σ0
which is a contradiction, since as an assumption (σ1 < σ0) we have
Σ(t+1)< σ0. uunionsq
Proof of lemma 127.2. Assume
Σ(t) = σ0 > 0
Σ(t+1) = σ1.
By definition of Σ we have the following.
rfulltσ0 ∧ ( jisrσ ∨ eretσ )I(σ0,t)−1 (92)
First we show that the following holds.
6 ∃k > σ0+1 : rfullt+1k ∧ ( jisrσ ∨ eretσ )I(k,t+1)−1 (93)
250 9 Correctness of Pipelined Implementation
Proof of equation 93. By contradiction. Assume there is such stage
k > σ0+1
and it was updated in cycle t. From the definitions of the update enable signal (equation 35)
and the scheduling functions we resp. have
uetk → rfulltk−1
and (since k > 1)
uetk → I(k, t+1) = I(k−1, t).
From the assumption and arguments above we derive
rfulltk−1∧ ( jisrσ ∨ eretσ )I(k−1,t)−1
which leads to a contradiction:
Σ(t)≥ k−1 > σ0.
Otherwise, if stage k was not updated in cycle t, from the definitions of the full and the rollback-
pending bits we resp. have
f ullt+1k ∧/uetk → stalltk+1∧/rollbacktk
and (using equation 36)
/rbpt+1k ∧/rollbacktk → /rbptk.
From the definitions of the stall signal and the scheduling functions we resp. have
stalltk+1 → f ulltk
and (since k > R(t))
/uetk → I(k, t+1) = I(k, t).
Thus, from the assumption and arguments above we conclude
rfulltk ∧ ( jisrσ ∨ eretσ )I(k,t)−1
which gives a contradiction:
Σ(t)≥ k > σ0+1. uunionsq
Further we split cases on the value of σ0:
• σ0 ∈ {1,3,4}. Since σ0 ≥ R(t), the rollback request signal in stage σ0 is inactive in cycle
t.
rbrtσ0 = 0
Therefore, we are to show
σ1 = σ0+1.
From the first part of lemma 75 and definition of the scheduling functions we resp. have
uetσ0+1 → rfullt+1σ0+1
and (since σ0 > 0)
uetσ0+1 → I(σ0+1, t+1) = I(σ0, t).
Thus, from equation 92 and the arguments above we conclude
rfullt+1σ0+1∧ ( jisrσ ∨ eretσ )I(σ0+1,t+1)−1
which by definition implies
Σ(t+1)≥ σ0+1.
Equation 93 gives
Σ(t+1)≤ σ0+1
and the claim follows.
9.4 Verifying Guard Conditions 251
• σ0 = 2. By definition (of Σ ) we have
rfullt2∧ ( jisrσ ∨ eretσ )I(2,t)−1.
We split cases on the value of the rollback request signal of stage 2 in cycle t. (Note, from
the assumptions we know σ0 ≥ R(t).)
rbrt2 ↔ rfullt2∧ irett2 (interconnect)
↔ eret(3)t ∧mca(3)t [>il] (definition)
↔ eretI(2,t)−1σ ∧mcaI(2,t)−1σ [>il] (IH)
↔ eretI(2,t)−1σ ∧/ jisrI(2,t)−1σ (definition)
Note, in the lines above, for any eret instruction (i) we clearly have
eret iσ → mcaiσ [10 : 6] = 05.
Thus, if the rollback request signal is active
rbrt2 = 1
we proceed to show
σ1 = 0.
From the first part of lemma 75 and definition of the scheduling functions we resp. have
uet3 → rfullt+13
and
uet3 → I(3, t+1) = I(2, t).
From the arguments above we conclude
rfullt+13 ∧ (eretσ ∧/ jisrσ )I(3,t+1)−1
which by definition implies
Σ(t+1) 6= 3.
Using lemma 76 and equation 93 we resp. further conclude
6 ∃k < 3 : rfullt+1k
and
6 ∃k > 3 : rfullt+1k ∧ ( jisrσ ∨ eretσ )I(k,t+1)−1
which gives the claim.
Σ(t+1) = 0
Otherwise, in case the rollback request signal is inactive
rbrt2 = 0
we are to show
σ1 = σ0+1.
Analogous to the previous case we derive
rfullt+13 ∧ jisrI(3,t+1)−1σ
which by definition implies
Σ(t+1)≥ 3.
Equation 93 gives
Σ(t+1)≤ 3
and the claim follows.
252 9 Correctness of Pipelined Implementation
• σ0 = 5. Finally, in this case from lemma 126 we have
rbrt5 = 1
and we proceed to show
σ1 = 0.
Using lemma 76 we conclude
6 ∃k < 6 : rfullt+1k
which gives the claim.
Σ(t+1) = 0 uunionsq
The property of speculation stage Σ below can be easily established following analogous argu-
ments to those in the proof above.
Σ(t)< R(t) → Σ(t+1) = 0 (94)
For convenience, we generalize the lemma above as follows.
Lemma 128.
Σ(t+1)≤ Σ(t)+uetΣ(t)+1
Proof of lemma 128. In case
Σ(t)< R(t)
the claim holds trivially by equation 94. Otherwise, for Σ(t) > 0 the claim follows directly
from lemma 127. For Σ(t) = 0 we proceed to show
Σ(t+1)≤ uet1.
Repeating the arguments presented in the proof of equation 93 we derive
6 ∃k > 1 : rfullt+1k ∧ ( jisrσ ∨ eretσ )I(k,t+1)−1
which by definition implies
Σ(t+1)≤ 1.
The remaining arguments are analogous to those presented in the proof of lemma 122. uunionsq
Properties of Σhard
Below we show that most of the properties of Σ also hold for hardware speculation stage
Σhard(t) = max{Σ(t),µ(t)}.
First, using lemma 101 we argue that the rollback request signals are not generated below
Σhard(t).
Lemma 129.
R(t)≤ Σhard(t)
Moreover, from lemmas 102 and 126 we conclude that speculation stage Σhard stays within
interval [0 : 5].
Lemma 130. Assume Σhard(t) = 5.
uet6 → rbrt5
In the following lemma we derive the value of Σhard in the next cycle, based on its value and
the control signals active in the current cycle.
9.4 Verifying Guard Conditions 253
Lemma 131. Assume Σhard(t) = σ such that σ > 0.
uetσ+1 → Σhard(t+1) = σ (1)
uetσ+1 → Σhard(t+1) =
{
0 rbrtσ
σ +1 otherwise
(2)
Proof of lemma 131.1. Given that the stage below σ is not updated in cycle t (/uetσ+1), we
split cases on whether µ(t) is greater, smaller, or equal to Σ(t).
• Σ(t)< µ(t). In this case by definition we have
Σhard(t) = µ(t)
and therefore
Σ(t) ≤ Σ(t)+1 (lemma 128)
≤ µ(t). (95)
Using the result above we argue as follows.
Σhard(t+1) = max{Σ(t+1),µ(t+1)} (definition)
= max{Σ(t+1),µ(t)} (lemma 103.1)
= µ(t) (equation 95)
• Σ(t)> µ(t). In this case by definition we have
Σhard(t) = Σ(t)
and therefore
µ(t+1) ≤ µ(t)+1 (lemma 104)
≤ Σ(t). (96)
The following arguments are analogous to those presented in the case above.
Σhard(t+1) = max{Σ(t+1),µ(t+1)} (definition)
= max{Σ(t),µ(t+1)} (lemma 127.1)
= Σ(t) (equation 96)
• Σ(t) = µ(t). In this case we have
Σ(t+1) = Σ(t) (lemmas 127.1, 128)
= µ(t)
= µ(t+1) (lemmas 103.1, 104)
which clearly yields
Σhard(t+1) = Σhard(t). uunionsq
Proof of lemma 131.2. Given that the stage below σ is updated in cycle t (uetσ+1), we split
cases on whether µ(t) is greater than Σ(t) or not.
• Σ(t)< µ(t). In this case by definition we have
Σhard(t) = µ(t)
and therefore
Σ(t+1) ≤ Σ(t)+1 (lemma 128)
≤ µ(t). (97)
254 9 Correctness of Pipelined Implementation
In case stage σ is rolled-back in cycle t (rbrtσ ), we clearly have
Σ(t+1) = 0 (equation 94)
= µ(t+1) (lemma 103.2)
which clearly gives
Σhard(t+1) = 0.
Otherwise, from lemma 102 we conclude
µ(t)< 5
and argue as follows.
Σhard(t+1) = max{Σ(t+1),µ(t+1)} (definition)
= max{Σ(t+1),µ(t)+1} (lemma 103.2)
= µ(t)+1 (equation 97)
• Σ(t)≥ µ(t). In this case by definition we have
Σhard(t) = Σ(t)
and therefore
µ(t+1) ≤ µ(t)+1 (lemma 104)
≤ Σ(t)+1. (98)
From lemma 126 we have
Σ(t)< 5
and the claim follows from the arguments below.
Σhard(t+1) = max{Σ(t+1),µ(t+1)} (definition)
= max{Σ(t)+1,µ(t+1)} (lemma 127.2)
= Σ(t)+1 (equation 98) uunionsq
Combining lemmas 104 and 128 we easily conclude the following generalization of the result
above.
Lemma 132. Assume Σhard(t) = σ .
Σhard(t+1)≤ σ +uetσ+1
Properties of Σˆ
In the remainder of this section we show that the corresponding properties (to those of Σ and
Σhard) also hold for
Σˆ(t) = max{Σso f t(t),Σhard(t)}.
Clearly, using lemma 129 we argue that the rollback request signals are not generated in stages
below Σˆ(t).
Lemma 133.
R(t)≤ Σˆ(t)
The value of Σˆ in the next cycle is derived in the following lemma.
Lemma 134. Assume Σˆ(t)> 0.
uetΣˆ(t)+1 → Σˆ(t+1) = Σˆ(t) (1)
uetΣˆ(t)+1 → Σˆ(t+1) =
{
0 rbrtΣˆ(t)∧Σso f t(t)< Σhard(t)
Σˆ(t)+1 otherwise
(2)
9.4 Verifying Guard Conditions 255
Proof of lemma 134.1. Given that the stage below Σˆ(t) is not updated in cycle t (/uetΣˆ(t)+1),
we split cases on whether Σhard(t) is greater, smaller, or equal to Σso f t(t).
• Σso f t(t)< Σhard(t). In this case by definition we have
Σˆ(t) = Σhard(t)
and therefore
Σso f t(t+1) ≤ Σso f t(t)+1 (lemma 122)
≤ Σhard(t). (99)
Using the result above we proceed to derive the claim as follows.
Σˆ(t+1) = max{Σso f t(t+1),Σhard(t+1)} (definition)
= max{Σso f t(t+1),Σhard(t)} (lemma 131.1)
= Σhard(t) (equation 99)
• Σso f t(t)> Σhard(t). In this case by definition we have
Σˆ(t) = Σso f t(t)
and therefore
Σhard(t+1) ≤ Σhard(t)+1 (lemma 132)
≤ Σso f t(t). (100)
The remaining arguments are analogous to those presented in the case above.
Σˆ(t+1) = max{Σso f t(t+1),Σhard(t+1)} (definition)
= max{Σso f t(t),Σhard(t+1)} (lemma 121.1)
= Σso f t(t) (equation 100)
• Σso f t(t) = Σhard(t). In this case we have
Σso f t(t+1) = Σso f t(t) (lemmas 121.1, 122)
= Σhard(t)
= Σhard(t+1) (lemmas 131.1, 132)
which clearly gives
Σˆ(t+1) = Σˆ(t). uunionsq
Proof of lemma 134.2. Given that the stage below Σˆ(t) is updated in cycle t (uetΣˆ(t)+1), we split
cases on whether Σhard(t) is greater than Σso f t(t) or not.
• Σso f t(t)< Σhard(t). In this case by definition we have
Σˆ(t) = Σhard(t)
and therefore
Σso f t(t+1) ≤ Σso f t(t)+1 (lemma 122)
≤ Σhard(t). (101)
In case stage Σˆ(t) is rolled-back in cycle t (rbrtΣˆ(t)), for stage σ = Σso f t(t) from the con-
struction of the stall engine we clearly have
rbrtΣˆ(t) → rbrtσ+1
256 9 Correctness of Pipelined Implementation
and therefore
Σso f t(t+1) = 0 (lemma 121.1)
= Σhard(t+1) (lemma 131.2)
which clearly yields
Σˆ(t+1) = 0.
Otherwise (/rbrtΣˆ(t)), from lemma 130 we have
Σhard(t)< 5
and the claim follows from the arguments below.
Σˆ(t+1) = max{Σso f t(t+1),Σhard(t+1)} (definition)
= max{Σso f t(t+1),Σhard(t)+1} (lemma 131.2)
= Σhard(t)+1 (equation 101)
• Σso f t(t)≥ Σhard(t). In this case by definition we have
Σˆ(t) = Σso f t(t)
and therefore
Σhard(t+1) ≤ Σhard(t)+1 (lemma 132)
≤ Σso f t(t)+1. (102)
From equation 88 we know
Σso f t(t)< 5
and the claim follows from the arguments below.
Σˆ(t+1) = max{Σso f t(t+1),Σhard(t+1)} (definition)
= max{Σso f t(t)+1,Σhard(t+1)} (lemma 121.2)
= Σso f t(t)+1 (equation 102) uunionsq
Finally, combining lemmas 122 and 132 we conclude the following generalization of the result
above.
Lemma 135.
Σˆ(t+1)≤ Σˆ(t)+uetΣˆ(t)+1
9.5 Correctness for Memory
All processors that execute memory operations in cycle t we collect into the set
PA(t) = {q ∈ PS(t) | exec(q, t)∧mop.5q,tpi }.
Accordingly, we collect TLBs performing walk extensions in cycle t into the sets
TAY (t) = {q ∈ T SY (t) | wext(mmuq,tY )}
depending on the MMU which is performing the operation. Using the new notation we char-
acterize the stepping numbers y of components q stepped in cycle t.
Lemma 136. Assume that component q performs a step in cycle t.
s(NS(t)+ y).(g,u) = (proc,q)→ ( y < sa(t)↔ q ∈ PA(t) ) (1)
s(NS(t)+ y).(g,u) = (mmuY ,q)→ ( y < sa(t)↔ q ∈ TAY (t) ) (2)
9.5 Correctness for Memory 257
Proof of lemma 136.1. For processor q accessing memory in cycle t there is an access to the
data cache ending in cycle t (lemma 88) and vice versa (lemma 89).
q ∈ PA(t) ↔ ∃k : (4q+3,k) ∈ SD(t)
Since at most one data access ends in the data cache of processor q in cycle t, the claim follows
from the definition of the stepping function (p. 229, equation 67). uunionsq
Proof of lemma 136.2. For mmuY on processor q accessing memory in cycle t there is an access
to the corresponding translation cache ending in cycle t (lemma 53). The opposite is true only
in case mmuY on processor q generates a step in cycle t (lemma 54). As a result, we conclude
the following.
q ∈ TAY (t) ↔ ∃k : (4q+21{Y=E},k) ∈ SY (t)
Analogous to above, since at most one translation access ends in the corresponding translation
cache of processor q in cycle t, the claim follows from the definition of the stepping function
(p. 229, equation 68). uunionsq
Moreover, we argue that the total number of components which are stepped in cycle t while
accessing memory equals the total number of stepping accesses.
#PA(t)+#TAI(t)+#TAE(t) = sa(t).
In order to map between the step numbers and numbers of the processor accesses, we introduce
the following helper function. For y < sa(t) we define:
γ(0, t) = min{y˜ | NE(t)+ x(y˜, t) ∈ seq(S(t))}
γ(y, t) = min{y˜ | NE(t)+ x(y˜, t) ∈ seq(S(t))∧ y˜ > γ(y−1, t)}.
A simple lemma follows from the monotonicity of function x (see p. 198).
Lemma 137. For y < sa(t) the following holds.
z(y, t) = x(γ(y, t), t)
In the induction step below (Sect. 9.5.2) we require the following technical lemma.
Lemma 138. For k < na(t) the following holds.
∀y < sa(t) : k 6= γ(y, t) → xacc′t [k].(w,cas) = 02
Proof of lemma 138. First we show the following auxiliary result.
n ∈ seq(E(t)\SD(t)) → acc′[n].(w,cas) = 02 (103)
Proof of equation 103. From the assumptions we derive
∃a ∈ E(t)\SD(t)∧ seq(a) = n
which allows us to conclude:
a ∈ E(t)\SD(t)→ acc(a).(w,cas) = 02
→ acc′[seq(a)].(w,cas) = 02
→ acc′[n].(w,cas) = 02. uunionsq
Using the result above and monotonicity of function z, we proceed as follows.
∀y < sa(t) : k 6= γ(y, t)→ ∀y < sa(t) : NE(t)+ x(k, t) 6= NE(t)+ z(y, t) (lemma 137)
→ NE(t)+ x(k, t) ∈ seq(A(t)\S(t)) (definition)
→ acc′[NE(t)+ x(k, t)].(w,cas) = 02 (equation 103)
→ xacc′t [k].(w,cas) = 02 (definition) uunionsq
258 9 Correctness of Pipelined Implementation
9.5.1 Matching Processor Accesses with Non-Void Accesses
In this section we identify the accesses registered at various ports (caches) of the memory
system in cycle t with the accesses performed by the ISA computation within interval CS(t).
Below we begin with data accesses, resp. accesses registered at the data caches.
Data Accesses
First, we argue about the accesses performed by processors on memory operations.
Lemma 139. Assume that processor q performs a memory access in cycle t.
q ∈ PA(t)∧ s(NS(t)+ y).(t,u) = (core,q)
For access components Z ∈ {a,data,cdata,bw, type} the following holds.
used(dacc.Z,NS(t)+ y) → dacc(q, ic(q,NS(t)+ y)).Z = zacc′t [y].Z
Proof of lemma 139. Using definition of the stepping function, for some access number k we
argue:
zacc′t [y].Z = acc
′[NE(t)+ z(y, t)].Z (definition)
= acc′[seq(4q+3,k)].Z (lemma 136.1)
= acc(4q+3,k).Z (definition)
= daccq,iσ .Z (equation 88; IH)
= dacc(q, ic(q,NS(t)+ y)).Z (lemma 118)
where by i we denoted the index of instruction executed on processor q in cycle t.
i = I(q,6, t)
= I(q,5, t)−1 (lemma 79) uunionsq
Moreover, using the software conditions from Sect. 9.2.2 (equation 70) we conclude that the
read-only potion of the memory system (ROM) is never written.
Lemma 140. Assume that processor q performs a memory access in cycle t.
q ∈ PA(t)∧ s(NS(t)+ y).(t,u) = (core,q)
Then the following holds.
zacc′t [y].a < 2
r → zacc′t [y].(w,cas) = 02
Next, we argue that the processors which do not execute memory operations necessarily per-
form void accesses.
Lemma 141. Assume that processor q stepped in cycle t does not access the memory.
q ∈ PS(t)\PA(t)∧ s(NS(t)+ y).(t,u) = (core,q)
Then the corresponding data access is void.
void(dacc(q, ic(q,NS(t)+ y)))
Proof of lemma 141. From the induction hypothesis we derive:
/mop.5q,tpi → /mopq,I(q,5,t)−1σ (equation 88; IH)
→ /mopq,I(q,6,t)σ (lemma 79)
→ void(daccq,I(q,6,t)σ ) (definition)
→ void(dacc(q, ic(q,NS(t)+ y))) (lemma 118). uunionsq
9.5 Correctness for Memory 259
Translation Accesses
First, for the address of the page table entry read by mmuY on processor q we derive the
following result.
Lemma 142. Assume that mmuY on processor q performs a memory access in cycle t.
q ∈ TAY (t)∧ s(NS(t)+ y).(g,u) = (mmuY ,q)
Then for the page table entry address read by mmuY the following holds.
ptea(mmuq,tY ) = ptea(s(NS(t)+ y).o)
Proof of lemma 142. From the construction we know: mmuqY in cycle t performs extension of
the user walk iff the nested control automaton of mmuqY resides in state fetch-pteU in cycle t.
q ∈ TAY (t) → ( mmuq,tY .fetch-pteU ↔ wextU (mmuq,tY ) ) (104)
Given the equivalence above, we argue simply as follows:
ptea(mmuq,tY ) =
{
pteaU (mmu
q,t
Y ) wextU (mmu
q,t
Y )
pteaG(mmu
q,t
Y ) otherwise
(definition; equation 104)
=
{
ptea(mmuq,tY .wU ◦mmuq,tY .wG) wextU (mmuq,tY )
ptea(mmuq,tY .wG) otherwise
(lemma 20)
= ptea(s(NS(t)+ y).o) (stepping, p. 235). uunionsq
Using the latter result, we argue about the accesses performed by MMUs on walk extensions.
Lemma 143. Assume that mmuY on processor q performs a memory access in cycle t.
q ∈ TAY (t)∧ s(NS(t)+ y).(g,u) = (mmuY ,q)
For access components Z ∈ {a,bw, type} the following holds.
tacc(NS(t)+ y).Z = zacc′t [y].Z
Proof of lemma 143. Using definition of the stepping function, for some access number k we
argue:
zacc′t [y].Z = acc
′[NE(t)+ z(y, t)].Z (definition)
= acc′[seq(4q+21{Y=E},k)].Z (lemma 136.2)
= acc(4q+21{Y=E},k).Z (definition)
= tacc(s(NS(t)+ y)).Z (lemma 142)
= tacc(NS(t)+ y).Z (definition, p. 226). uunionsq
Analogous to lemma 141, we argue that the MMUs which do not perform walk extensions
necessarily perform void accesses.
Lemma 144. Assume that mmuY of processor q stepped in cycle t does not access the memory.
q ∈ T SY (t)\TAY (t)∧ s(NS(t)+ y).(g,u) = (mmuY ,q)
Then the corresponding translation access is void.
void(tacc(NS(t)+ y))
Proof of lemma 144. Directly from the assumptions we derive:
/wext(mmuq,tY )→ winit(mmuq,tY ) (assumption)
→ s(NS(t)+ y).o.t = winit (stepping)
→ void(tacc(NS(t)+ y)) (definition). uunionsq
260 9 Correctness of Pipelined Implementation
9.5.2 Induction Step
For the memory component of ISA we proceed to show the following property.
Lemma 145. For y ∈ [0 : ns(t)] we claim
`(mcNS(t)+y.m) =
{
∆ γ(y,t)M (`(mc
NS(t).m),xacc′t) y < sa(t)
∆ na(t)M (`(mc
NS(t).m),xacc′t) otherwise
Proof of lemma 145. By induction on y. For convenience we introduce the following short-
hand.
accσ =
{
dacc(q, ic(q,NS(t)+ y)) s(NS(t)+ y).(g,u) = (proc,q)
tacc(NS(t)+ y) s(NS(t)+ y).(g,u) = (mmuY ,q)
For the base case, if stepping accesses are performed in cycle t (sa(t)> 0), we have
∆ γ(0,t)M (`(mc
NS(t).m),xacc′t) = ∆
γ(0,t)
M (`(mc
NS(t).m),xacc′t [0 : γ(0, t)−1])
= `(mcNS(t)+0.m) (lemma 138).
Otherwise, if no stepping accesses are performed in cycle t (sa(t) = 0), we have
∆ na(t)M (`(mc
NS(t).m),xacc′t) = ∆
na(t)
M (`(mc
NS(t).m),xacc′t [0 : na(t)−1])
= `(mcNS(t)+0.m) (lemma 138).
For the induction step from y to y+1 we split cases on the value of y.
• y < sa(t). From lemmas 139 and 143 we have
accσ = zacc′t [y] (105)
and argue using the induction hypothesis for y < sa(t):
`(mcNS(t)+y+1.m) = ∆M(`(mcNS(t)+y.m),accσ ) (definition)
= ∆M(∆
γ(y,t)
M (`(mc
NS(t).m),xacc′t),zacc
′
t [y]) (ind. hypothesis; equation 105)
= ∆M(∆
γ(y,t)
M (`(mc
NS(t).m),xacc′t [0 : γ(y, t)−1]),xacc′t [γ(y, t)]) (lemma 137)
= ∆ γ(y,t)+1M (`(mc
NS(t).m),xacc′t)
=
{
∆ γ(y+1,t)M (`(mc
NS(t).m),xacc′t) y+1 < sa(t)
∆ na(t)M (`(mc
NS(t).m),xacc′t) otherwise
(lemma 138).
• y≥ sa(t). From lemmas 141 and 144 we have
void(accσ ) (106)
and argue using the induction hypothesis for y≥ sa(t):
`(mcNS(t)+y+1.m) = ∆M(`(mcNS(t)+y.m),accσ ) (definition)
= `(mcNS(t)+y.m) (equation 106)
= ∆ na(t)M (`(mc
NS(t).m),xacc′t) (induction hypothesis). uunionsq
Using the results of this section we complete the induction step for the memory part with
almost no effort. We argue as follows:
m(ht+1pi ) = ∆
na(t)
M (m(h
t
pi),xacc
′
t) (lemma 91)
= ∆ na(t)M (`(mc
NS(t).m),xacc′t) (IH)
= `(mcNS(t)+ns(t).m) (lemma 145)
= `(mcNS(t+1).m) (definition).
9.5 Correctness for Memory 261
9.5.3 Outputs to Accesses
Next we show that outputs of the cache memory system to accesses performed by the processor
cores and MMUs match the outputs observed in the ISA. The processor cores access the mem-
ory system to fetch instructions and to execute load or CAS operations, whereas the MMUs —
to perform walk extensions.
Data Access
First, we argue about the outputs to the accesses performed by processors executing load or
CAS operations.
Lemma 146. Assume that processor q stepped in cycle t executes a load or a CAS operation.
(l.5q,tpi ∨ cas.5q,tpi )∧ s(NS(t)+ y).(t,u) = (core,q)
Then the following holds.
pdout(4q+3)t = dmoutq,I(q,6,t)σ
Proof of lemma 146. For instruction executed on processor q in cycle t by lemma 118 we have
i = I(q,6, t) = ic(q,NS(t)+ y)
while for the line address of the corresponding data access, for some access number k we
derive:
a = dacc(q, i).a = zacc′t [y].a (lemma 139)
= acc′[NE(t)+ z(y, t)].a (definition)
= acc′[seq(4q+3,k)].a (lemma 136.1; equation 67).
Then the output to the data access of the specification would be:
dmoutq,iσ = dataout(`(mcpseq(q,i).m),dacc(q, i)) (definition)
= `(mcpseq(q,i).m)(a) (definition)
= `(mcNS(t)+y.m)(a) (lemma 110)
= ∆ γ(y,t)M (`(mc
NS(t).m),xacc′t)(a) (lemma 145)
= ∆ γ(y,t)M (m(h
t
pi),xacc
′
t)(a) (IH)
= Mγ(y,t)(a) (definition, p. 198).
Using definition of the stepping function, for some access number k we proceed to show for
the output of the data cache:
Mγ(y,t)(a) = ∆
x(γ(y,t),t)
M (m(h
t
pi),acc
′
t)(a) (lemma 90)
= ∆ x(γ(y,t),t)M (∆
NE(t)
M (m(h
0
pi),acc
′),acc′t)(a) (lemma 51)
= ∆NE(t)+z(y,t)M (m(h
0
pi),acc
′)(a) (lemma 137)
= ∆ seq(4q+3,k)M (m(h
0
pi),acc
′)(a) (lemma 136.1; equation 67)
= pdout(4q+3)t (lemma 52). uunionsq
Instruction Fetch
Next, we argue about the outputs to the accesses performed by processors fetching instructions.
Lemma 147. Assume that stage 2 on processor q is updated “below” ~Σ in cycle t.
~Σ(q, t)< (1,2)∧ueq,t2
Then the following holds.
pdout(4q+1)t = imoutq,I(q,2,t)σ
262 9 Correctness of Pipelined Implementation
Proof of lemma 147. The instruction fetched on processor q in cycle t is abbreviated as
i = I(q,2, t)
whereas for the line address of the corresponding instruction fetch access we derive:
a = iacc(q, i).a = pmaq,iI σ .l (definition)
= pmaI.1q,tpi .l (IH).
Note, by definition of ~Σ we know that software conditions for instruction fetch are met for
instruction i on processor q.
SCcode(q, i)
We proceed to show that the memory line at address a is the same in configurations NS(t) and
pseq(q, i), in which instruction i is executed.
`(mcNS(t).m)(a) = `(mcpseq(q,i).m)(a) (107)
Proof of equation 107. From the definition (equation 69) we conclude that instructions in steps
n ∈ [pseq(q, i−5)+1 : pseq(q, i+1)−1]
do not modify the ISA memory at line address a.
`(mcn.m)(a) = `(mcpseq(q,i+1).m)(a)
Therefore, it suffices to show that
NS(t), pseq(q, i) ∈ [pseq(q, i−5)+1 : pseq(q, i+1)]
which follows from lemma 119 and monotonicity of function pseq resp. uunionsq
Then the output to the instruction fetch access of the specification would be:
imoutq,iσ = dataout(`(mcpseq(q,i).m), iacc(q, i)) (definition)
= `(mcpseq(q,i).m)(a) (definition)
= `(mcNS(t).m)(a) (equation 107).
In order to complete the proof we require one more auxiliary result.
n ∈ seq(E(t))∧acc′[n].a = a → acc′[n].(w,cas) = 02 (108)
Proof of equation 108. By contradiction. From the assumptions and equation 103 we derive:
∃d ∈ seq(SD(t))∧ seq(d) = n.
From the definition of S for some processor q′, access k, and step number y we have:
d = (4q′+3,k)∧ seq(d) = NE(t)+ z(y, t).
From the definition of the stepping function we conclude
y < sa(t)∧ s(NS(t)+ y).(t,u) = (core,q′)
which by part (1) of lemma 136 implies
q′ ∈ PA(t).
Using the arguments above we can apply lemma 139 and first obtain
9.5 Correctness for Memory 263
dacc(q′, ic(q′,NS(t)+ y)).(w,cas) = zacc′t [y].(w,cas)
= acc′[NE(t)+ z(y, t)].(w,cas)
= acc′[seq(d)].(w,cas)
= acc′[n].(w,cas) 6= 02
and then, analogously obtain
dacc(q′, ic(q′,NS(t)+ y)).a = acc′[n].a = a.
Therefore, for ISA step
m = NS(t)+ y ∈ [NS(t) : NS(t+1)−1]
we conclude
write(mcm,s(m))∧ pmaE(mcm,s(m)).l = a
which according to lemma 119 violates the software conditions (equation 69). uunionsq
In turn, for the output of the instruction cache and some access number k we show:
pdout(4q+1)t = ∆ seq(4q+1,k)M (m(h
0
pi),acc
′
t)(a) (lemma 52)
= ∆NE(t)+nIM (m(h
0
pi),acc
′)(a) (where nI ≤ ne(t))
= ∆NE(t)M (m(h
0
pi),acc
′)(a) (equation 108)
= m(htpi)(a) (lemma 51).
Thus, the result follows directly from the induction hypothesis (IH). uunionsq
Translation Access
Finally, we argue about the outputs to the accesses performed by MMUs extending walks.
Lemma 148. Assume that mmuY on processor q performs a walk extension in cycle t.
q ∈ TAY (t)∧ s(NS(t)+ y).(g,u) = (mmuY ,q).
Then the following holds.
pdout(4q+21{Y=E})t = tmout(NS(t)+ y)
Proof of lemma 148. Given that for the line address of the translation access, for some access
number k we have
a = tacc(NS(t)+ y).a = zacc′t [y].a (lemma 143)
= acc′[NE(t)+ z(y, t)].a (definition)
= acc′[seq(4k+21{Y=E},k)].a (lemma 136.2; equation 68)
the output to the translation access of the specification would be:
tmout(NS(t)+ y) = dataout(`(mcNS(t)+y.m), tacc(NS(t)+ y)) (definition)
= `(mcNS(t)+y.m)(a) (definition)
= ∆ γ(y,t)M (`(mc
NS(t).m),xacc′t)(a) (lemma 145)
= ∆ γ(y,t)M (m(h
t
pi),xacc
′
t)(a) (IH)
= Mγ(y,t)(a) (definition).
For the output of the corresponding cache and some access number k we show:
Mγ(y,t)(a) = ∆
x(γ(y,t),t)
M (m(h
t
pi),acc
′
t)(a) (lemma 90)
= ∆ x(γ(y,t),t)M (∆
NE(t)
M (m(h
0
pi),acc
′),acc′t)(a) (lemma 51)
= ∆NE(t)+z(y,t)M (m(h
0
pi),acc
′)(a) (lemma 137)
= ∆
seq(4q+21{Y=E},k)
M (m(h
0
pi),acc
′)(a) (lemma 136.2; equation 68)
= pdout(4q+21{Y=E})t (lemma 52). uunionsq
264 9 Correctness of Pipelined Implementation
9.6 Correctness for Pipeline Registers
First in this section we present the arguments necessary to perform the induction step for the
pipeline registers (Sects. 9.6.1–9.6.2). To show that the registers are simulated correctly, we
split cases on the values of speculation stage Σˆ in cycles t and t +1. The latter case split is as
follows:
• Σˆ(t+1) = Σˆ(t) = 0; value of Σˆ remains zero (Sect. 9.6.3),
• Σˆ(t+1) = Σˆ(t)+1; value of Σˆ increases (ueΣˆ(t)+1, Sect. 9.6.4), and
• Σˆ(t+1) = Σˆ(t) 6= 0; (non-zero) value of Σˆ remains unchanged (/ueΣˆ(t)+1),
• Σˆ(t+1) = 0∧ Σˆ(t) = 2; value of Σˆ resets to zero on pcres (ue3, see Sect. 9.6.5),
• Σˆ(t+1) = 0∧ Σˆ(t) = 5; value of Σˆ resets to zero on jisr (ue6, Sect. 9.6.6).
Note, in the last two cases according to lemma 134 we have rbrΣˆ(t) and
Σso f t(t)< Σhard(t).
9.6.1 Speculation on SPR Content
Recall from Chap. 8 that the machine mode is i) (speculatively) used in the upper pipeline
stages (Sect. 8.1.5) and ii) written on interrupts and returns from exceptions in the memory
stage (Sect. 8.2.1). In this section we justify our speculative usage of the machine mode. In the
following lemma we assume the absence of interrupts in the ISA computation as well as the
absence of a legal eret instruction in the pipeline.
Lemma 149. For pipelines such that
Σˆ(t) = 0 and ∀ j ∈ [3 : 5] : /(rfulltj ∧ irettj)
we claim the following holds for stages k ≤ 6:
mode(t) = modeI(k,t)σ (1)
asid(t) = asidI(k,t)σ (2)
Proof of lemma 149.1. From the definition of Σ we know
∀ j ∈ [1 : 5] : /(rfulltj ∧ jisrI( j,t)−1σ ) (109)
and
∀ j ∈ [1 : 2] : /(rfulltj ∧ eretI( j,t)−1σ ).
From the assumptions for stages j ∈ [3 : 5] we derive
/(rfulltj ∧ irettj)→ /(rfulltj ∧ eret. jtpi ∧ ca. jtpi [>il]) (definition)
→ (rfulltj ∧ eretI( j,t)−1σ → jisrI( j,t)−1σ ) (IH)
→ /(rfulltj ∧ eretI( j,t)−1σ ) (equation 109).
Altogether we clearly have
6 ∃ j ∈ [1 : 5] : rfulltj ∧ ( jisrI( j,t)−1σ ∨ eretI( j,t)−1σ )
or using lemma 79:
6 ∃ j ∈ [2 : 6] : rfulltj−1∧ ( jisrI( j,t)σ ∨ eretI( j,t)σ ). (110)
From the induction hypothesis we also know
mode(t) = modeI(6,t)σ
9.6 Correctness for Pipeline Registers 265
therefore it suffices to prove
modeI(6,t)σ = mode
I(6−`,t)
σ
by induction on ` ≤ 5. For the base case (` = 0) there is nothing to show. For the induction
step from ` to `+1≤ 5 we argue as follows.
modeI(6−`−1,t)σ =
{
modeI(6−`,t)+1σ rfullt6−`−1
modeI(6−`,t)σ otherwise
(lemma 79)
=
{
modeI(6−`,t)+1σ / jisr
I(6−`,t)
σ ∧/eretI(6−`,t)σ
modeI(6−`,t)σ otherwise
(equation 110)
= modeI(6−`,t)σ (specification)
= modeI(6,t)σ (induction hypothesis) uunionsq
Proof of lemma 149.2. Recall, in Sect. 3.2 we defined the ASID in configuration c as
asid(c) =

vmid(c)◦ prid(c) user(c)
vmid(c)◦08 guest(c)
012 host(c).
Having lemma 149.1, on the level of host there is clearly nothing to show. For the remaining
levels of execution we need to show the following.
mode(t) 6= host → vmid(t) = vmidI(k,t)σ
mode(t) = user → prid(t) = pridI(k,t)σ
Using the induction hypothesis we rewrite the latter as follows.
modeI(6,t)σ 6= host → vmidI(6,t)σ = vmidI(k,t)σ
modeI(6,t)σ = user → pridI(6,t)σ = pridI(k,t)σ
Next we argue along the proof lines of lemma 149.1 and first obtain that there are no illegal
instructions in truly full stages j ∈ [2 : 6].
6 ∃ j ∈ [2 : 6] : rfulltj−1∧ illI( j,t)σ
Using lemma 149.1 for the execution mode, for the virtual machine ID we derive
modeI(6,t)σ 6= host → 6 ∃ j ∈ [2 : 6] : rfulltj−1∧movg2sI( j,t)σ ∧ (xadI( j,t)σ = mode)
and for the process ID we derive
modeI(6,t)σ = user → 6 ∃ j ∈ [2 : 6] : rfulltj−1∧movg2sI( j,t)σ ∧ (xadI( j,t)σ = nmode).
For both IDs one can easily complete the proof by induction on `≤ 5 by showing
modeI(6,t)σ 6= host → vmidI(6,t)σ = vmidI(6−`,t)σ
modeI(6,t)σ = user → pridI(6,t)σ = pridI(6−`,t)σ . uunionsq
In the next sections we not always can apply the result above, at least not for all stages. For that
reason we proceed to limit the scope of lemma 149, which allows us to weaken its assumptions.
Lemma 150. For stages k ∈ [3 : 6] such that
~Σ(t)< (k−1,k)∧ rfulltk−1
we claim the following holds:
mode(t) = modeI(k,t)σ (1)
asid(t) = asidI(k,t)σ (2)
266 9 Correctness of Pipelined Implementation
Proof of lemma 150.1. From the definition of Σ we know
∀ j ∈ [k : 5] : /(rfulltj ∧ jisrI( j,t)−1σ ) (111)
and from the assumptions for stages j ∈ [k : 5] we derive
rfulltk−1 → /(rfulltj ∧ irettj) (lemma 77.1)
→ /(rfulltj ∧ eret. jtpi ∧ ca. jtpi [>il]) (definition)
→ (rfulltj ∧ eretI( j,t)−1σ → jisrI( j,t)−1σ ) (IH)
→ /(rfulltj ∧ eretI( j,t)−1σ ) (equation 111).
Altogether we clearly have
6 ∃ j ∈ [k : 5] : rfulltj ∧ ( jisrI( j,t)−1σ ∨ eretI( j,t)−1σ )
or using lemma 79:
6 ∃ j ∈ [k+1 : 6] : rfulltj−1∧ ( jisrI( j,t)σ ∨ eretI( j,t)σ ).
From the induction hypothesis we also know
mode(t) = modeI(6,t)σ
therefore it suffices to prove
modeI(6,t)σ = mode
I(6−`,t)
σ
by induction on ` ≤ 6− k. In order to complete the proof one can literally follow the corre-
sponding lines in the proof of lemma 149.1. uunionsq
The proof of lemma 150.2 is similar to the proof of lemma 149.2, and therefore is omitted.
9.6.2 Matching MMU Outputs
Below we argue that in the absence of interrupts both in the hardware and in the ISA computa-
tion, the walks output by the two MMUs provide matching translations for the corresponding
translation requests.
Lemma 151. Assume that the address translation is enabled (/host(t)).
Σˆ(t) = 0∧mca(1)t [>mf]∧uet1 → match(trqI(1,t)I σ ,mmutI .wout) (1)
mop.4tpi ∧~Σ(t)< (4,5)∧mca(5)t [>mm]∧uet5 → match(trqI(5,t)E σ ,mmutE .wout) (2)
From the specification of MMUs we know that the walks provided to the processor, and thus
passed to the specification machine, are either faulty or complete.
f (mmutY .wout)∨mmutY .wout.`[0]
Proof of lemma 151.1. For the walk output of mmuI we clearly have
mmutI .wout.upa = mmu
t
I .upa (specification)
= asid(t)◦ iatpi .pa (interconnect)
= asidI(1,t)σ ◦ iatpi .pa (lemma 149.2)
= asidI(1,t)σ ◦ iaI(1,t)σ .pa (induction hypothesis).
Note, in the proof lines above we can apply lemma 149 since, again, here we consider only
those cycles t in which stage k = 1 is updated. From this we immediately derive the absence
of a legal eret instruction anywhere in the pipeline.
uet1 → /hazt1 → /drain(t) uunionsq
9.6 Correctness for Pipeline Registers 267
Proof of lemma 151.2. For the walk output of mmuE we analogously derive
mmutE .wout.upa = mmu
t
E .upa (specification)
= asid(t)◦ ea.4tpi .pa (interconnect)
= asidI(5,t)σ ◦ ea.4tpi .pa (lemma 150.2)
= asidI(5,t)σ ◦ eaI(5,t)σ .pa (induction hypothesis).
Again, since we consider only those cycle t in which stage k = 5 is updated, we immediately
obtain that stage 4 has a real full bit in cycle t (rfullt4). This justifies application of lemma 150
above. uunionsq
9.6.3 Regular Execution
In this section we present the correctness arguments covering execution of most of the instruc-
tions given that
Σˆ(t+1) = Σˆ(t) = 0.
First we are to show the following:
• for the instruction address
iatpi = ia
I(1,t)
σ
• for the physical memory addresses
pmatI pi = pma
I(1,t)
I σ
pmatE pi = pma
I(5,t)
I σ
• for the instruction fetched
I(htpi) = I
I(2,t)
σ
• for the output of the memory access
mout(htpi) = mout
I(6,t)
σ
Instruction address
We split cases on the number n of real full stages above the ID stage:
n(t) = ∑
1≤k<3
rfulltk.
• n(t) = 0.
iatpi = dd pc
t
pi = dd pc
I(3,t)
σ = dd pc
I(1,t)
σ = ia
I(1,t)
σ
• n(t) = 1∧ f ullt1∧/rbpt1.
iatpi = d pc
t
pi = d pc
I(3,t)
σ = d pc
I(2,t)
σ = dd pc
I(2,t)+1
σ = dd pc
I(1,t)
σ = ia
I(1,t)
σ
• n(t) = 1∧ f ullt2∧/rbpt2.
iatpi = d pc
t
pi = d pc
I(3,t)
σ = dd pc
I(3,t)+1
σ = dd pc
I(2,t)
σ = dd pc
I(1,t)
σ = ia
I(1,t)
σ
• n(t) = 2.
iatpi = pc
t
pi = pc
I(3,t)
σ = d pc
I(3,t)+1
σ = d pc
I(2,t)
σ = dd pc
I(2,t)+1
σ = dd pc
I(1,t)
σ = ia
I(1,t)
σ
268 9 Correctness of Pipelined Implementation
Physical Memory Addresses
In case the address translation is used, we show
pmatI pi = mmu
t
I .wout.ba◦ iatpi .po (interconnect)
= mmutI .wout.ba◦ iaI(1,t)σ .po (induction hypothesis)
= tma(asidI(1,t)σ ◦ iaI(1,t)σ ,mmutI .wout) (lemma 151.1)
= tma(asidI(1,t)σ ◦ iaI(1,t)σ ,wI(1,t)I σ ) (equation 73)
= pmaI(1,t)I σ (definition)
and
pmatE pi = mmu
t
E .wout.ba◦ ea.4tpi .po (interconnect)
= mmutE .wout.ba◦ eaI(1,t)σ .po (induction hypothesis)
= tma(asidI(5,t)σ ◦ eaI(5,t)σ ,mmutE .wout) (lemma 151.2)
= tma(asidI(5,t)σ ◦ eaI(5,t)σ ,wI(5,t)E σ ) (equation 74)
= pmaI(5,t)E σ (definition).
In case the address translation is not used, we have
pmatI pi = ia
t
pi = ia
I(1,t)
σ = pma
I(1,t)
I σ
and
pmatE pi = ea.4
t
pi = ea
I(5,t)
σ = pma
I(5,t)
E σ .
For the destination registers (in case they are updated) we now easily derive:
pmaI.1t+1pi = pma
t
I pi = pma
I(1,t)
I σ = pma
I(1,t+1)−1
I σ
and
pmaE.5t+1pi = pma
t
E pi = pma
I(5,t)
E σ = pma
I(5,t+1)−1
E σ .
Instruction Fetched
For the fetched instruction we show:
I(htpi) =
{
pdout(4q+1)tH pmaI.1
t
pi [2]
pdout(4q+1)tL otherwise
(interconnect)
=
{
imoutq,I(q,2,t)σ H pma
I(q,2,t)
I σ [2]
imoutq,I(q,2,t)σ L otherwise
(lemma 147)
= Iq,I(q,2,t)σ (lemma 7).
And for the instruction register (in case it is updated) we now easily derive:
It+1pi = I(h
t
pi) = I
q,I(q,2,t)
σ = I
q,I(q,2,t+1)−1
σ .
Output of Memory Access
For the output of the memory access we show:
mout(htpi) = pdout(4q+3)
t (interconnect)
= dmoutq,I(q,6,t)σ (lemma 146)
= moutq,I(q,6,t)σ (definition).
And for the pipeline register (in case it is updated) we now easily derive:
mout.6t+1pi = mout(h
t
pi) = mout
q,I(q,6,t)
σ = mout
q,I(q,6,t+1)−1
σ .
9.6 Correctness for Pipeline Registers 269
Forwarding for A, B, and S
Forwarding for registers A and B stays literally the same as in [KMP14], except that now it has
to be performed over more pipeline stages. For register S we argue as follows. For address
rs = rstpi = rs
I(3,t)
σ
and special purpose register
sprxtpi = spr
t
pi(rs) = spr
I(6,t)
σ (rs) = sprx
I(6,t)
σ
we derive:
S(htpi) =

C.4tpi .in topS[3]
t
C.ktpi topS[k]
t ∧ k > 3
sprxtpi otherwise
(interconnect)
=
{
CI(k,t)σ min{ j | movg2sI( j,t)σ ∧ (xadI( j,t)σ = rs)}= k ∈ [4 : 6]
sprxI(6,t)σ otherwise
(IH)
= SI(3,t)σ (specification).
Therefore, for the S register we easily obtain:
St+1pi = S(h
t
pi) = S
I(3,t)
σ = S
I(3,t+1)−1
σ .
Typical Pipeline Registers
Obviously, we consider two cases. If register R in stage k is updated in cycle t (uetk), in the
spirit of [KMP14] we first have to argue that all signals used for computation of the register’s
input are correct; then, correctness of the register update follows, basically, from the register
semantics. For better presentation, let us consider the update of the program counters, the
visible registers in stage 3. For the d pc and dd pc resp. the arguments are trivial:
d pct+1pi = pc
t
pi = pc
I(3,t)
σ = d pc
I(3,t)+1
σ = d pc
I(3,t+1)
σ
and
dd pct+1pi = d pc
t
pi = d pc
I(3,t)
σ = dd pc
I(3,t)+1
σ = dd pc
I(3,t+1)
σ .
In turn, correctness of the pc update hinges on correct computation of the nextpc:
pct+1pi = nextpc
t
pi
!
= nextpcI(3,t)σ = pc
I(3,t)+1
σ = pc
I(3,t+1)
σ .
In order to establish correctness of the nextpc computation
nextpctpi =
{
btargettpi jbtaken
t
pi
pctpi +32 432 otherwise
(interconnect)
!
=
{
btargetI(3,t)σ jbtaken
I(3,t)
σ
pcI(3,t)σ +32 432 otherwise
(IH)
= nextpcI(3,t)σ (definition)
we show correctness for signals jbtaken, which is always used, and btarget, which is used
only if a jump or branch instruction is executed.
used(btarget)I(3,t)σ ↔ jbtakenI(3,t)σ ∧ ilI(3,t)σ > gf
For the first signal, using the induction hypothesis (for the instruction register)
270 9 Correctness of Pipelined Implementation
Itpi = I
I(3,t)
σ (112)
and correctness of the forwarding for A and B (in case they are used)
used(A)I(3,t)σ → A(htpi) = AI(3,t)σ (113)
used(B)I(3,t)σ → B(htpi) = BI(3,t)σ (114)
we argue as follows:
jbtakentpi = jump
t
pi ∨btpi ∧bcetpi .res (construction)
= jumpI(3,t)σ ∨bI(3,t)σ ∧bce.resI(3,t)σ (equations 112–114)
= jbtakenI(3,t)σ (definition).
Moreover, here we rely on correctness of the instruction decoder and correct implementation
of the switching function res of the branch condition evaluation (BCE) (see Sects. 2.1.3 and
2.1.4).
For the second signal (btarget), in case a jump or a branch instruction is executed
jbtakentpi = 1 = jbtaken
I(3,t)
σ ,
repeating the arguments above we derive:
btargettpi =
{
pctpi +32 imm(h
t
pi)[15]
16 ◦ imm(htpi)◦02 btpi ∧bcetpi .res
A(htpi) otherwise
(construction)
=
{
pcI(3,t)σ +32 imm
I(3,t)
σ [15]16 ◦ immI(3,t)σ ◦02 bI(3,t)σ ∧bce.resI(3,t)σ
AI(3,t)σ otherwise
(IH)
= btargetI(3,t)σ (definition).
Let us also consider the update of the invisible registers, for instance register ea.4 (the effective
address) in stage 4. Note, we have to show simulation of the invisible registers only if they are
used. Thus, we argue
used(ea.4)I(4,t+1)−1σ → used(ea.4.in)I(4,t)σ → used(A)I(4,t)σ
and show
ea.4tpi .in =
{
Atpi cas.3
t
pi
Atpi +32 imm.3
t
pi [15]
16 ◦ imm.3tpi otherwise
(construction)
=
{
AI(4,t)σ cas
I(4,t)
σ
AI(4,t)σ +32 imm
I(4,t)
σ [15]16 ◦ immI(4,t)σ otherwise
(lemma 79; IH)
= eaI(4,t)σ (definition).
Therefore, for the ea.4 register we easily obtain:
ea.4t+1pi = ea.4
t
pi .in = ea
I(4,t)
σ = ea
I(4,t+1)−1
σ .
In case register R in stage k is not updated in cycle t (/uetk), we clearly have
used(R.k)I(k,t+1)−1σ → used(R.k)I(k,t)−1σ
and argue as follows:
R.kt+1pi = R.k
t
pi (construction)
=
{
R.kI(k,t)σ vis(R.k)
R.kI(k,t)−1σ otherwise
(IH)
=
{
R.kI(k,t+1)σ vis(R.k)
R.kI(k,t+1)−1σ otherwise
(lemma 80).
9.6 Correctness for Pipeline Registers 271
Internal Event Signals
First, for the external event vectors we show the following.
Lemma 152. Assume Σˆ(t)< k where k ≤ 6.
eevI(k,t)σ = 02
Proof of lemma 152. Directly from the definitions we have
µ(t)≤ Σhard(t)≤ Σˆ(t).
From the above and assumptions, by definition of µ we conclude
live(k, t)
which by definition for some t ′ ≥ t implies
I(k, t) = I(6, t ′)∧uet ′6
and the claim follows.
eevI(k,t)σ = eev
I(6,t ′)
σ
= 02 (equations 75, 76) uunionsq
The latter result allows us to show correctness for the event signals collected in stages k∈ [1 : 5].
uetk → ev(k)t = ev(k)I(k,t)σ (115)
Proof of equation 115. The result can be easily derived from the statements below.
• k = 1. For the misalignment on fetch we show:
mal f t ≡ iatpi [1 : 0] 6= 02 (definition)
≡ iaI(1,t)σ [1 : 0] 6= 02 (IH)
≡ mal f I(1,t)σ (definition).
First, using the arguments above for walk wI we derive the following.
used(wI)
I(1,t)
σ ↔ /host(t)∧mca(1)t [>mf] (116)
Proof of equation 116.
/host(t)∧mca(1)t [>mf]↔ /host(t)∧/mal f (t) (definition)
↔ /hostI(1,t)σ ∧/mal f I(1,t)σ (lemma 149.1)
↔ /hostI(1,t)σ ∧ ilI(1,t)σ > 2 (lemma 152)
↔ used(wI)I(1,t)σ (definition) uunionsq
Having the result above, we argue for the page fault on fetch as follows:
p f f t = /host(t)∧mca(1)t [>mf]∧ f (mmutI .wout) (definition)
= /host(t)∧mca(1)t [>mf]∧match(trqI(1,t)I σ ,mmutI .wout)∧ f (mmutI .wout) (lemma 151.1)
= used(wI)
I(1,t)
σ ∧match(trqI(1,t)I σ ,mmutI .wout)∧ f (mmutI .wout) (equation 116)
= used(wI)
I(1,t)
σ ∧ p f ault(trqI(1,t)I σ ,mmutI .wout) (definition)
= used(wI)
I(1,t)
σ ∧ p f ault(trqI(1,t)I σ ,wI(1,t)I σ ) (equation 73)
= p f f I(1,t)σ (definition).
Note, the corresponding argument for the general-protection fault on fetch (g f f ) is analo-
gous to the one presented above, and therefore is omitted.
272 9 Correctness of Pipelined Implementation
• k = 4. For the misalignment on memory operation we show:
malmt ≡ mmask(4)t [2 : 1]∧ ea(4)t [1 : 0] 6= 02 (definition)
≡ mmask(4)I(4,t)σ [2 : 1]∧ ea(4)I(4,t)σ [1 : 0] 6= 02 (IH)
≡ mop(4)I(4,t)σ ∧ (d(4)I(4,t)σ - 〈ea(4)I(4,t)σ 〉) (lemma 8)
≡ malmI(4,t)σ (definition).
• k = 5. Analogously to equation 116, using the arguments above for walk wE we derive the
following.
used(wE)
I(5,t)
σ ↔ /host(t)∧mca(5)t [>mm]∧mop.4tpi (117)
Having the result above, we argue for the page fault on memory operation as follows:
pfmt = /host(t)∧mca(5)t [>mm]∧mop.4tpi ∧ f (mmutE .wout) (definition)
= /host(t)∧mca(5)t [>mm]∧mop.4tpi ∧match(trqI(5,t)E σ ,mmutE .wout)∧ f (mmutE .wout)
= used(wE)
I(5,t)
σ ∧match(trqI(5,t)E σ ,mmutE .wout)∧ f (mmutE .wout) (equation 117)
= used(wE)
I(5,t)
σ ∧ p f ault(trqI(5,t)E σ ,mmutE .wout) (definition)
= used(wE)
I(5,t)
σ ∧ p f ault(trqI(5,t)E σ ,wI(5,t)E σ ) (equation 74)
= pfmI(5,t)σ (definition).
The corresponding argument for the general-protection fault on memory operation (gfm)
is omitted as well. uunionsq
Cause Pipeline Registers
Using the results above we infer correct update for registers of the cause pipeline. Thus, in
case stage k > 1 is updated in cycle t, the cause register in stage k is written with the correct
value.
ca.kt+1pi = ev(k)
t ∨ ca.(k−1)tpi (interconnect)
= ev(k)I(k,t)σ ∨ ca.(k−1)I(k,t)σ (equation 115; IH)
= ca.kI(k,t)σ (definition)
= ca.kI(k,t+1)−1σ (lemma 80)
For the cause register in stage k = 1 the argument is analogous to the one presented above.
Note, in case stage k is not updated in cycle t, correctness for the corresponding cause register
is shown exactly as for the typical pipeline registers (see above).
Ghost Walk Registers
If pipeline stage k = 1 is updated in cycle t (uet1), for the first ghost walk register of pipeline I
we argue as follows.
wI .1t+1 = mmutI .wout (interconnect)
= wI(1,t)I σ (equation 73)
= wI(1,t+1)−1I σ (definition)
For the first (and the only) ghost walk register of pipeline E we argue analogously.
wE .5t+1 = mmutE .wout (interconnect)
= wI(5,t)E σ (equation 74)
= wI(5,t+1)−1E σ (lemma 80)
9.6 Correctness for Pipeline Registers 273
     
     
     
     




Σˆ(t)
t
...
[all registers]
[local,SC?local]
[all registers]
(a)
     
     
     
     
     





     
     


...
t+1
Σˆ(t+1)[local,SC?local]
(b)
     
     
     
     




     
     


...
t+1
Σˆ(t+1)
[all registers]
[local,SC?local]
(c)
Fig. 42: Pipeline stages (not shaded) considered in the induction step (from t to t+1)
In case pipeline stage k > 1 is updated in cycle t (uetk), the ghost walk register in stage k of
pipeline I is of course written with the value of the corresponding register in the stage above.
wI .kt+1 = wI .(k−1)t (interconnect)
= wI(k−1,t)−1I σ (induction hypothesis)
= wI(k,t+1)−1I σ (definition)
Execution of eret
In case a legal eret instruction reaches the memory stage and the stage is updated, all stages
above the memory stage become empty in cycle t+1 (part (2) of lemma 77).
uet6∧ irett5 → ∀ j ∈ [1 : 5] : /rfullt+1j
The special purpose registers
sprx ∈ {sr,mode,nmode}
are restored from their “exception” counterparts:
sprxt+1pi = esprx
t
pi = esprx
I(6,t)
σ = sprx
I(6,t)+1
σ = sprx
I(6,t+1)
σ .
The remaining SPR registers do not change:
sprt+1pi (x) = spr
t
pi(x) = spr
I(6,t)
σ = spr
I(6,t)+1
σ = spr
I(6,t+1)
σ .
None of the invisible registers in stage 6 are used except for the control bits, which are used
always. For the control bits we derive:
R.6t+1pi = R.5
t
pi = R
I(6,t)
σ = R
I(6,t+1)−1
σ .
9.6.4 Speculative Execution
In this section we consider two cases of speculative execution, i.e., an execution with non-zero
value of Σˆ . First we consider the case in which the value of Σˆ increases, and then — the case in
which the non-zero value of Σˆ stays unchanged. In both cases, using lemma 133 we obviously
conclude the following.
k > Σˆ(t) → /rbrtk (118)
In terms of correctness arguments nothing changes for stages
k > Σˆ(t)+1
in both cases, i.e., the same arguments as above apply for the speculative execution. The
latter stages operate only with the visible registers from stages below Σˆ(t)+ 1, and invisible
274 9 Correctness of Pipelined Implementation
registers from stages below Σˆ(t). As depicted in Fig. 42(a), in cycle t registers in these stages
are simulated correctly. Moreover, since there is nothing to show for stages k < Σˆ(t + 1), in
the induction step below we argue only about stages
k ∈ [Σˆ(t+1) : Σˆ(t)+1].
In order to justify the case split on page 264, in the end of this section we argue that in the
memory stage the value of Σˆ eventually resets to zero (equation 124).
Σˆ Increases
In case the value of Σˆ increases in cycle t
Σˆ(t+1) = Σˆ(t)+1
we have to argue only about the registers in stage
k ∈ [Σˆ(t+1) : Σˆ(t)+1]
∈ {Σˆ(t+1)}.
According to lemma 134, the stage below Σˆ(t) is updated in cycle t.
uetΣˆ(t)+1 (119)
As depicted in Fig. 42(b), in the induction step we must consider i) the local invisible registers
in stage k and ii) the non-local invisible registers in stage k, but only in case
Σso f t(t+1)< Σhard(t+1).
For the first part we require the correct simulation of
• the visible registers in stage k and
• the local invisible registers in stage k−1
in cycle t, which we have according to Fig. 42(a). For the second part in addition to the above
we require the correct simulation of
• the non-local invisible registers in stage k−1
in cycle t. According to Fig. 42(a) the latter registers are simulated correctly in cycle t in case
the following holds.
Σso f t(t)< Σhard(t) (120)
Proof of equation 120. By contradiction; assume
Σso f t(t)≥ Σhard(t)
and
Σso f t(t+1)< Σhard(t+1).
The contradiction easily follows; using equations 119 and 88 we derive:
Σso f t(t+1) = Σso f t(t)+1 (lemma 121.2)
= Σˆ(t+1) (assumption)
≥ Σhard(t+1) (definition). uunionsq
Therefore, we argue for the invisible registers in stage k as for the typical pipeline registers in
Sect. 9.6.3 to show
used(R.k)I(k,t+1)−1σ → R.kt+1 = RI(k,t+1)−1σ .
9.6 Correctness for Pipeline Registers 275
Σˆ Remains Unchanged
In case the value of Σˆ does not change in cycle t
Σˆ(t+1) = Σˆ(t)
we have to argue about the registers in stages
k ∈ [Σˆ(t+1) : Σˆ(t)+1]
∈ [Σˆ(t+1) : Σˆ(t+1)+1].
According to lemma 134, the stage below Σˆ(t) is not updated in cycle t.
/uetΣˆ(t)+1 (121)
Moreover, since by definition of Σˆ we have
rfullt+1Σˆ(t)
by part (3) of lemma 75 we conclude that stage Σˆ(t) is not updated either in cycle t.
/uetΣˆ(t) (122)
As depicted in Fig. 42(c), in the induction step we must consider i) all registers in the stage
below Σˆ(t), ii) the local invisible registers in stage Σˆ(t), and iii) the non-local invisible registers
in stage Σˆ(t), but only if
Σso f t(t+1)< Σhard(t+1).
For the first part, having equation 122, we require the correct simulation of
• all registers in stage Σˆ(t)+1
in cycle t, which we have according to Fig. 42(a). For the second part, having equation 121,
we require the correct simulation of
• the local invisible registers in stage Σˆ(t)
in cycle t, which we have, again according to Fig. 42(a). Finally, for the third part, again using
equation 121, we require the correct simulation of
• the non-local invisible registers in stage Σˆ(t)
in cycle t. Analogous to the arguments above (Fig. 42(a)), the latter registers are simulated
correctly in cycle t if the following holds.
Σso f t(t)< Σhard(t) (123)
Proof of equation 123. By contradiction; assume
Σso f t(t)≥ Σhard(t)
and
Σso f t(t+1)< Σhard(t+1).
The contradiction follows; using equations 118 and 121 we derive:
Σso f t(t+1) = Σso f t(t) (lemma 121.1)
= Σˆ(t+1) (assumption)
≥ Σhard(t+1) (definition). uunionsq
276 9 Correctness of Pipelined Implementation
Therefore, for visible registers in stage Σˆ(t) there is nothing to show. For the invisible registers
in stage Σˆ(t) directly from the induction hypothesis we obtain
used(R.Σˆ(t))I(Σˆ(t),t+1)−1σ → R.Σˆ(t)t+1 = RI(Σˆ(t),t+1)−1σ .
For stage Σˆ(t)+1 we consider the following three cases.
• the register stage is stalled.
stalltΣˆ(t)+2
We assume the stage has a real full bit. Then the real full bit is preserved.
rfullt+1Σˆ(t)+1
Both visible and invisible registers in stage k = Σˆ(t)+1 are not updated; their simulation
holds by induction:
used(R.k)I(k,t+1)−1σ → used(R.k)I(k,t)−1σ
and
R.kt+1 = R.kt (interconnect)
=
{
RI(k,t)σ vis(R)
RI(k,t)−1σ otherwise
(IH)
=
{
RI(k,t+1)σ vis(R)
RI(k,t+1)−1σ otherwise
(lemma 80).
• the register stage is emptied.
/uetΣˆ(t)+1∧uetΣˆ(t)+2
The visible registers are not updated; the simulation holds by induction as above. For the
real full bit of stage Σˆ(t)+1 in cycle t+1 using part (6) of lemma 75 we conclude:
/rfullt+1Σˆ(t)+1.
Thus, there is nothing to show for the invisible registers.
• the register stage remains empty.
/rfulltΣˆ(t)+1∧/rfullt+1Σˆ(t)+1
Again, the visible registers are not updated and for the invisible registers there is nothing
to show.
Repeated Rollback
Below we consider a special case, in which the rollback of stage 2 is performed without the
reset of Σˆ , i.e., without the update of the stage below Σˆ . In this case we assume pcres(t) and
~Σ(t) ∈ {(0,2),(1,2)}.
From the assumption and lemma 133 we have
rbrt2∧/rbrt5 → R(t) = 2.
The scheduling functions for stages 1 and 2 are rolled-back:
I(1, t+1) = I(2, t+1) = I(R(t), t) = I(2, t) = I(3, t)+1.
Note, the value of speculation stage ~Σ becomes
~Σ(t+1) = (0,2)
and the rollback is repeated in the next cycle, until the stage below Σˆ is eventually updated.
9.6 Correctness for Pipeline Registers 277
Σˆ Becomes Zero
Since after reset all pipeline stages are empty, by definition we have
Σˆ(0) = 0.
According to lemma 134, the value of Σˆ increases only in cycles t in which the stage directly
below Σˆ(t) is updated.
uetΣˆ(t)+1
Below we show that the value of Σˆ drops to zero after it reaches 5.
Σˆ(t) = 5∧uet6 → Σˆ(t+1) = 0 (124)
Proof of equation 124. Using equation 88, from the definition of Σˆ we derive
Σˆ(t) = Σhard(t).
From lemma 130 we have that the rollback request signal in stage 5 is active in cycle t
rbrt5 = 1
and the claim follows by lemma 134. uunionsq
9.6.5 Exception Return
In this section we cover the details of restoring the program counters from the corresponding
exception registers in the pipelined processor. In this case, according to lemma 134, we assume
~Σ(t) ∈ {(0,2),(1,2)}
and the active rollback request signal in stage Σˆ(t) in cycle t. Using the definition of the
misspec signals we conclude:
rbrtΣˆ(t) → rbrt2∧/rbrt5 (lemma 133)
→ pcres(t) (interconnect)
→ rfullt2∧ irett2 (definition).
Moreover, we assume that the stage below Σˆ(t) is updated in cycle t (uetΣˆ(t)+1), since otherwise
the pcres signal stays active and the rollback is repeated in cycle t+1 (see Sect. 9.6.4, Repeated
Rollback). Finally, in cycle t+1 we have
Σˆ(t+1) = 0.
For the forwarded exception PCs (see Sect. 8.1.6), we argue about the forwarding circuits in
the usual way as follows:
expcFt =

C.4tpi .in topexpc[3]
t
C.ktpi topexpc[k]
t ∧ k > 3
expctpi otherwise
(construction)
=
{
CI(k,t)σ min{ j | movg2sI( j,t)σ ∧ (xadI( j,t)σ = spr[expc])}= k ∈ [4 : 6]
expcI(6,t)σ otherwise
(IH)
= expcI(3,t)σ (specification).
Using the arguments above, for the visible registers in stage 3 (program counters) we easily
derive:
xpct+1pi = expcF
t = expcI(3,t)σ = xpc
I(3,t)+1
σ = xpc
I(3,t+1)
σ .
278 9 Correctness of Pipelined Implementation
For the remaining pipeline registers the arguments are analogous to those we already presented
above in this section, and therefore are omitted. In particular, for stage 3 we obtain
rfullt+13 ∧ irett+13
and therefore for stages k < 3 there is nothing to show, since by part (1) of lemma 77 we have
∀ j ∈ [1 : 2] : /rfullt+1j .
In cycles t ′ ≥ t pipeline stages j ≤ k behind the legal eret instruction are drained; execution
of the legal eret instruction finishes in the memory stage with update of the special purpose
registers (see Sect. 9.6.3, Execution of eret).
9.6.6 Interrupt
Finally, we consider jumps to the interrupt service routine. In this case from the assumptions
we conclude jisr(t) and
Σˆ(t) = 5.
From the definition of Σˆ using lemma 79 we derive:
rfullt5∧ jisrI(5,t)−1σ → jisrI(6,t)σ .
Since after the jisr in cycle t all real full bits except the last one (k = 7) are cleared, there is
nothing to show for the invisible registers in stages (k < 7) in cycle t + 1. For the program
counters we argue:
(pc,d pc,dd pc)t+1pi = (84,44,04) (interconnect)
= (pc,d pc,dd pc)I(6,t)+1σ (specification)
= (pc,d pc,dd pc)I(5,t)σ (lemma 79)
= (pc,d pc,dd pc)I(3,t+1)σ (definition).
For the exception cause register we have
ecat+1pi = 021 ◦ f 1(mca(6)t) (interconnect)
= 021 ◦ f 1(mca(6)I(6,t)σ ) (lemma 125.1)
= ecaI(6,t)+1σ (specification)
= ecaI(6,t+1)σ (lemma 80),
whereas for the exception data register we have
edatat+1pi = ea.5
t
pi ∧ (mca(6)t [4 : 0] = 05) (interconnect)
= eaI(6,t)σ ∧ (mca(6)I(6,t)σ [4 : 0] = 05) (lemma 125.1)
= eaI(6,t)σ ∧ (ilI(6,t)σ > 4) (definition)
= edataI(6,t)+1σ (specification)
= edataI(6,t+1)σ (lemma 80).
For exception register esprx of SPR register
sprx ∈ {sr,mode,nmode}
we analogously obtain from the induction hypothesis:
esprxt+1pi = sprx
t
pi = sprx
I(6,t)
σ = esprx
I(6,t)+1
σ = esprx
I(6,t+1)
σ .
9.7 Correctness for TLBs 279
For the nested mode register we argue as follows.
nmodet+1pi [0] =
{
0 user(q, t)∨/icpt(q, t)
nmodetpi [0] otherwise
(interconnect)
=
{
0 userI(6,t)σ ∨/icptI(6,t)σ
nmodeI(6,t)σ [0] otherwise
(IH)
= nmodeI(6,t)+1σ [0] (specification)
= nmodeI(6,t+1)σ [0] (lemma 80)
The corresponding argument for the mode register is analogous to the one above, and therefore
is omitted. For the remaining SPR registers we show:
sprt+1pi (x) =
{
032 x = 05
sprtpi(x) otherwise
(interconnect)
=
{
032 x = 05
sprI(6,t)σ (x) otherwise
(IH)
= sprI(6,t)+1σ (x) (specification)
= sprI(6,t+1)σ (x) (lemma 80).
Note, moves to the SPR have no effect on interrupts, since move instructions cannot be inter-
rupted by continue type interrupts. According to Table 9, the only instructions which can be
interrupted by continue type interrupts are the arithmetic operations (addi, add, and sub) and
the system call.
9.7 Correctness for TLBs
Reusing the results of Chap. 5 we argue that the translations in mmuY of processor q are con-
tained in the TLB component of general computation c˜qY .
∃c˜q,0Y ∀t : simtlb(mmuq,tY , c˜q,tY ) (125)
This is lemma 36 from Chap. 5. Recall, the general stepping function (s˜Y ) for c˜Y was defined
in the latter chapter. We adjust the definition of s˜Y in the obvious way to fit the hardware
descriptions of the multi-core machine.
taddq,tY X → s˜qY (t) =
{
(winit,upaX (mmu
q,t
Y ),mmu
q,t
Y .ptoX ) winitX (mmu
q,t
Y )
(wext,mmuq,tY .wX , pteX (mmu
q,t
Y )) otherwise
tdropq,tY → s˜qY (t) =

(drop,mmuq,tY .inva) mmu
q,t
Y .invl pg
(drop,mmuq,tY .inva.vm) mmu
q,t
Y .vm f lush
(drop,all) otherwise
Analogous to Chap. 7, in order to incorporate the results of Chap. 5 (equation 125), it suffices
to show the following.
∀q :∧Y simISAtlb(c˜q,t+1Y ,mcNS(t+1).p(q))
Of course split cases on whether translations are added (see Sect. 9.7.1) or dropped (see
Sect. 9.7.2) by the MMUs performing steps in cycle t (if any). Note, in the absence of queries
to mmuqY in cycle t, the simulation of TLB between the general and multi-core computation is
maintained trivially on processor q in cycle t+1.
280 9 Correctness of Pipelined Implementation
9.7.1 Adding Translations
In case step n is generated by mmuY on processor q in cycle t
s(n).(g,u) = (mmuY ,q)∧n ∈ CS(t)
we show that the simulation of TLB between the general and multi-core computation is main-
tained for processor q in cycle t+1.
simISAtlb(c˜
q,t+1
Y ,mc
NS(t+1).p(q))
An analogous argument was made in Chap. 7. There we proved a corresponding result for the
sequential machine (see Sect. 7.4.1). The latter proof can be used as it is for the multi-core
machine; it suffices to reestablish the following arguments:
• for the page table origins and ASIDs
winitq,tY G → (pto,vmid)(q, t) = (pto,vmid)(mcn.p(q))
winitq,tY U → (pto,npto,asid)(q, t) = (pto,npto,asid)(mcn.p(q))
• for the page table entry address
q ∈ TAY (t) → ptea(mmuq,tY ) = ptea(s(n).o)
• for the page table entry
q ∈ TAY (t) → pte(mmuq,tY ) = pte(mcn.p(q),s(n).o)
Page Table Origins
First, we argue that the special purpose registers which are used for the address translation in
cycle t do not change values in cycle t. These registers can be written only by the movg2s
instructions, which are illegal at the corresponding privilege levels.
guest(q, t) → mcn.p(q).(pto,vmid) = mcNS(t).p(q).(pto,vmid) (126)
user(q, t) → mcn.p(q).(pto,npto,asid) = mcNS(t).p(q).(pto,npto,asid) (127)
Having that we obtain the first part directly from the induction hypothesis:
(pto,vmid)(q, t) = (pto,vmid)q,I(6,q,t)σ (IH)
= (pto,vmid)q,ic(q,NS(t))σ (lemma 116)
= (pto,vmid)(mcNS(t).p(q)) (lemma 115)
= (pto,vmid)(mcn.p(q)) (equations 126, 127).
We skip the second part; the omitted argument is literally the same as above.
Page Table Entry Address
Recall, correctness of the page table entry address read by mmuY in cycle t was established
previously in Sect. 9.5.1 (see lemma 142).
Page Table Entry
For mmuY , using correctness of the translation cache output (established in Sect. 9.5.3), we
obtain:
pte(mmuq,tY ) =
{
pdout(4q+21{Y=E})tH ptea(mmu
q,t
Y )[2]
pdout(4q+21{Y=E})tL otherwise
(interconnect)
=
{
tmout(n)H ptea(s(n).o)[2]
tmout(n)L otherwise
(lemma 148)
= pte(mcn,s(n).o) (lemma 14).
9.8 Maintaining Invariants 281
9.7.2 Dropping Translations
For step n in which processor q executes an invalidating instruction in cycle t
s(n).(t,u) = (core,q)∧ (invl pgσ ∨ f lushtσ )q,ic(q,n)∧n ∈ CS(t)∧ exec(q, t)
we show (as above) that the simulation of TLB between the general and multi-core computation
is maintained for processor q in cycle t+1.∧
Y sim
ISA
tlb(c˜
q,t+1
Y ,mc
NS(t+1).p(q))
Again, a similar result was proven for the sequential machine in Chap. 7 and, again, we can
reuse the proof. For the multi-core machine we need to update the following arguments:
• for the MMU control signals
mmuq,tY .invl pg ↔ invl pgq,iσ ∧userq,iσ
mmuq,tY .vm f lush ↔ f lushtq,iσ ∧guestq,iσ
mmuq,tY . f lush ↔ f lushtq,iσ ∧hostq,iσ
• for the invalidation address
mmuq,tY .inva = inva
q,i
σ
where for convenience we abbreviate the instruction count on processor q in cycle t:
i = ic(q,n) = ic(q,NS(t)) = I(q,6, t).
We argue only about one of the three invalidation signals above, namely for vm f lush.
mmuq,tY .vm f lush = f lusht.5
q,t
pi ∧guest(q, t) (interconnect)
= f lushtq,iσ ∧guestq,iσ (IH)
Arguments for the remaining control signals are identical. Finally, for the invalidation address
we argue in the same spirit and obtain:
mmuq,tY .inva =
{
vmid(q, t)◦C.5q,tpi [27 : 0] guest(q, t)
C.5q,tpi otherwise
(interconnect)
=
{
vmidq,iσ ◦Cq,iσ [27 : 0] guestq,iσ
Cq,iσ otherwise
(IH)
= invaq,iσ (definition).
9.8 Maintaining Invariants
In this section we assume that the pipeline of processor q executes at the privilege level of user
or guest, i.e., the address translation is used. For convenience, we repeat the statements of
invariant 10
Σˆ(q, t+1)≤ k∧ rfullq,t+1k → wI .kq,t+1 ∈
{
mcNS(t+1).p(q).tlb◦ user(q, t+1)
mcNS(t+1).p(q).tlb guest(q, t+1)
~Σ(q, t+1)≤ (4,5)∧ rfullq,t+15 → wE .5q,t+1 ∈
{
mcNS(t+1).p(q).tlb◦ user(q, t+1)
mcNS(t+1).p(q).tlb guest(q, t+1)
and invariant 11, formulated for cycle t+1 in which the address translation is used.
Σˆ(q, t+1)≤ k∧ I(q,k, t+1) = i∧ rfullq,t+1k → match(trqq,i−1I σ ,wI .kq,t+1)
~Σ(q, t+1)≤ (4,5)∧ I(q,5, t+1) = i∧ rfullq,t+15 → match(trqq,i−1E σ ,wE .5q,t+1)
Recall, in Sect. 9.2.4 we showed that both invariants (10 and 11) hold after the reset, in cycle
t = 0.
282 9 Correctness of Pipelined Implementation
9.8.1 Translations in Use and Invalidation of TLB
To show that invariants 10 and 11 are maintained in cycle t+1, we split cases on stage k of the
ghost walk pipeline of processor q and cover the input stages first. Thus, for mmuqI we consider
stage k = 1 of processor q in cycles t in which the stage is updated (ueq,t1 ). We obtain:
wI .1q,t+1 = mmu
q,t
I .wout (interconnect)
∈
{
tlbU (mmu
q,t
I ) user(q, t)
tlbG(mmu
q,t
I ) guest(q, t)
(interconnect; spec.)
⊆
{
mcNS(t).p(q).tlb◦ user(q, t)
mcNS(t).p(q).tlb guest(q, t)
(IH).
For mmuE we consider stage k = 5 with the active update enable signal (uet5). Thus, given that
a memory operation is executed (mop.4q,tpi ), we analogously obtain:
wE .5q,t+1 = mmu
q,t
E .wout (interconnect)
∈
{
tlbU (mmu
q,t
E ) user(q, t)
tlbG(mmu
q,t
E ) guest(q, t)
(interconnect; spec.)
⊆
{
mcNS(t).p(q).tlb◦ user(q, t)
mcNS(t).p(q).tlb guest(q, t)
(IH).
In both cases we argue as follows. Since the corresponding stage k is updated, there is no
invalidating instruction in the memory stage. Using definitions from Sect. 8.1.5 we derive:
uetk→ /haztk → /inval(mmuq,tY ) (interconnect, p. 184)
→ /(rfullq,t5 ∧ (invl pg.5q,t ∨ f lusht.5q,t)) (interconnect, p. 183)
→ /(invl pgσ ∨ f lushtσ )q,I(q,6,t) (IH; lemma 79)
→ /(invl pgσ ∨ f lushtσ )q,ic(q,NS(t)) (lemma 116).
The latter obviously yields:
mcNS(t).p(q).tlb⊆ mcNS(t+1).p(q).tlb.
In cycle t +1 for the ghost walk registers in stage k > 1 of pipeline I on processor q we argue
directly from the induction hypothesis:
wI .kq,t+1 =
{
wI .(k−1)q,t ueq,tk
wI .kq,t otherwise
(interconnect)
⊆
{
mcNS(t).p(q).tlb◦ user(q, t)
mcNS(t).p(q).tlb guest(q, t)
(IH).
The corresponding argument for the ghost walk registers of pipelines E is analogous, and
therefore is omitted.
Finally, in case there is no invalidating instruction in the memory stage in cycle t we argue ex-
actly as above to finish the induction step. Otherwise, we proceed to show that the translations
used in the pipeline are never invalidated. We split cases on the execution mode of the pipeline.
On the level of host the address translation is not used, and therefore there is nothing to show
for translations in the ghost pipeline. On the levels of guest and user the address translation
is used. The invalidation of the MMUs is performed only in cycles in which an invalidating
instruction is executed.2
2 Note, according to Table 9, invalidating instructions cannot be interrupted by continue type interrupts.
9.8 Maintaining Invariants 283
inval(mmuq,tY )→ exec(q, t) (interconnect)
→ / jisr(q, t) (specification)
We use that both, invalidation of any guest addresses on the level of guest, as well as invalida-
tion of any addresses on the level of user are illegal:
/ jisr(q, t)→
{
/inval(mmuq,tY ) user(q, t)
mmuq,tY .inva 6∈ AG guest(q, t)
(construction)
→
{
/(invl pgσ ∨ f lushtσ )q,I(q,6,t) user(q, t)
invaq,I(q,6,t)σ 6∈ AG guest(q, t)
(IH).
At the level of user we immediately conclude as above
mcNS(t).p(q).tlb⊆ mcNS(t+1).p(q).tlb
and we are done. At the level of guest, from the induction hypothesis we know that in cycle t
the ghost walk register in stage k of pipeline Y matches translation request Y of the processor
for execution of instruction at stage k+ 1 in cycle t. Thus, the walks in the ghost pipeline
are the guest walks, and the guest walks are not invalidated at the level of guest as we argued
above.
guest(q, t)∧match(trqq,I(q,k,t)−1Y σ ,wY .kq,t)→ wY .kq,t .upa ∈ AG
→ wY .kq,t ∈ mcNS(t+1).p(q).tlb
Summarizing the arguments above, we conclude (invariant 10) that in cycle t + 1 the ghost
walk registers in stage k of pipeline Y are contained in the specification/constructed TLB of
ISA configuration NS(t+1).
wY .kq,t+1 ∈
{
mcNS(t+1).p(q).tlb◦ user(q, t)
mcNS(t+1).p(q).tlb guest(q, t)
Therefore, it remains to show that the execution mode on processor q does not change in cycle
t.
mode(q, t+1) = mode(q, t)
In case the mode changes in cycle t, by lemma 78 we immediately obtain
ueq,t6 ∧ ( jisr.5pi ∨ eret.5pi)q,t .
In both cases by lemma 76 there is nothing to show for registers in the ghost walk pipeline.
∀k ∈ [1 : 5] : /rfullt+1k
9.8.2 Matching Translation Requests
Finally, we proceed to show (invariant 11) that in cycle t +1 the ghost walk register in stage k
of pipeline Y on processor q matches translation request Y of instruction
i = I(q,k, t+1)
at stage k+1 of processor q in cycle t+1.
match(trqq,i−1Y σ ,wY .k
q,t+1)
As above, we split cases on stage k of the ghost walk pipeline of processor q, and again we
cover the input stages first. For cycles t in which stage k = 1 of processor q is updated (ueq,t1 ),
using part (1) of lemma 151 we obtain:
284 9 Correctness of Pipelined Implementation
match(trqq,I(q,1,t)I σ ,mmu
q,t
I .wout)↔ match(trqq,I(q,1,t)I σ ,wI .1q,t+1) (interconnect)
↔ match(trqq,I(q,1,t+1)−1I σ ,wI .1q,t+1) (definition).
For cycles t in which stage k = 5 of pipeline E on processor q is updated (ueq,t5 ), from part
(2) of lemma 151 — given that a memory operation is executed (mop.4q,tpi ) — we analogously
obtain:
match(trqq,I(q,5,t)E σ ,mmu
q,t
E .wout)↔ match(trqq,I(q,5,t)E σ ,wE .5q,t+1) (interconnect)
↔ match(trqq,I(q,5,t+1)−1E σ ,wE .5q,t+1) (lemma 80).
In case pipeline stage k of processor q is not updated in cycle t (/ueq,tk ), we argue as above,
simply using the induction hypothesis for stage k:
match(trqq,I(q,k,t)−1Y σ ,wY .k
q,t) ↔ match(trqq,I(q,k,t+1)−1Y σ ,wY .kq,t+1).
If the ghost walk register in stage k> 1 of pipeline I on processor q is updated in cycle t (ueq,tk ),
we argue analogously, using the induction hypothesis for stage k−1:
match(trqq,I(q,k−1,t)−1Y σ ,wY .(k−1)q,t) ↔ match(trqq,I(q,k,t+1)−1Y σ ,wY .kq,t+1).
This ends the induction step and therefore the correctness proof for the pipelined multi-core
MIPS machine with nested translation. The latter proof is the main result of this thesis.
Conclusion
Initially, the goals of this thesis were to i) strengthen the virtualization capabilities of the
pipelined multi-core machine from [Pau16] and ii) prove correctness of the resulting design.
For the first goal, aiming at the hardware support for hypervisors, we had to replace the ad-
dress translation scheme used in [Pau16] with a more general one, permitting two phases of
address translation. For the second goal, we had to investigate the correctness proofs presented
in [Pau16] and, if necessary, repair the arguments broken by integration of the two-phase trans-
lation scheme.
Given the latter two goals, we briefly describe the main contributions of this thesis.
• First, we presented the formal specification of the nested address translation. Then, we
proposed a simple, still not completely trivial, implementation of the MMU for the nested
translation (nested MMU). The latter implementation was proven correct independent of
any particular machine semantics. Namely, for any valid sequence of MMU queries we
showed that computation of the nested MMU was simulated by a computation in the gen-
eral semantics, formalized in this thesis. As a result, we separated correctness of imple-
mentation of the nested address translation from the overall machine correctness.
• In order to improve overall performance, we additionally introduced the address space
identifiers (ASIDs) to designate entries in the processor-local translation cache, the trans-
lation look-aside buffer (TLB). The latter enabled the possibility to invalidate TLB entries
belonging to particular processes, while keeping other entries intact. At the same time,
in order to keep things simple, we dropped the access and dirty bits from our specifica-
tion. To make the nested translation invisible for the guests, we also introduced a simple
mechanism of intercepts.
• We demonstrated how to integrate the nested MMU and prove its correctness for various
machines. For that we integrated the nested address translation semantics into the MIPS-
86 ISA specification [Sch13a]. Then we considered both, the sequential and the pipelined
implementations of the latter specification, resp. single-core and multi-core. For both im-
plementations we obtained liveness and correctness in complete paper and pencil proofs.
Within the proof we used certain modularity, which allowed us to prevent the arguments
from repeating.
It is worth mentioning that in this thesis the proof for the pipelined machine used the following
new auxiliary mathematical concepts.
• The first concept was introduced to keep track of instructions violating the software con-
ditions, and formalized in the definition of speculation stage Σso f t . This definition turned
out to be crucial in the pipelined implementation to establish correctness for the output of
the instruction cache, and therefore of the fetched instruction.
• The second concept was introduced to keep track of non-live instructions, i.e., instructions
present in the pipeline but eventually discarded on rollbacks. The second concept was
formalized in the definition of lowest non-live circuit stage µ; it turned out to be crucial to
justify our “early” definition of the oracle inputs. Recall that in Sect. 9.2.5 we defined the
286 Conclusion
oracle inputs for instructions in stages below µ and in cycles in which these instructions
progress down the pipeline.
In the next section we discuss the obtained results in more detail, from the perspective of
problems we encountered in the process of working on this thesis. While discussing the latter
problems, we also inform the reader on how these (or the counterpart) problems were tackled
outside this thesis. In the end, we briefly discuss future work.
Discussion
Overall, the results presented in this thesis hinge heavily on both already published works
[KMP14, PBLS16] and ongoing projects [LOP]. Some of the results presented here required
us to repeat certain arguments from the aforementioned works; certain arguments from [Pau16]
were expected to change simply due to a slightly more advanced machine design. In a machine
with an additional level of execution, among the other things, we had to elaborated the follow-
ing. Recall from Sect. 2.4 that one of the guard conditions restricting computations in [Sch13a]
demands that the translations used for instruction execution are taken from a local cache for
translations, i.e., the translation look-aside buffer (TLB). In [Pau16] translations used by in-
structions in the upper pipeline stages were never invalidated simply due to the following ob-
servations: i) invalidating instructions are illegal at the execution level which uses the address
translation and ii) change of the execution level effectively drains the pipeline. In this thesis,
with three levels of execution, the arguments above did not apply since invalidating instructions
are legal at the intermediate level (of guest), which uses the address translation.
Development and integration of the two-phase translation scheme into the pipeline from
[Pau16] turned out to be rather technical and required a moderate effort. However, in the
process of elaborating the details in the correctness proof for the resulting design, we discov-
ered a flaw in the original proof of [Pau16]. There sampling of the oracle inputs for the ISA
computation was performed as usual in the memory stage, in cycles of execution of the corre-
sponding instructions. At the same time, the main induction hypothesis about the content of the
implementation registers, including those located in the upper pipeline stages, was formulated
using the oracle inputs defined in future cycles (from perspective of the cycle considered in
the induction hypothesis), with no assumptions that the latter cycles actually exist. Existence
of these (future) cycles in which the machine executes instructions clearly follows from the
machine liveness, which is normally shown after the machine correctness. After a few collec-
tive discussions, we decided to prove the machine liveness before correctness. In the presence
of rollbacks, this approach allowed us to show uniqueness of cycles in which instructions that
are not rolled-back progress down the pipeline. As a result, we sampled the oracle inputs for
instructions that are not rolled-back exactly in their progress cycles. The resulting arguments
ended up to be somewhat more lengthy than we expected. In [LOP], which is developed in
parallel to this thesis, the correctness proof is performed only for instructions which are not
rolled-back. Completed with the liveness proof afterwards, the latter approach is expected to
be more concise.
In general, if it concerns major changes to the verified designs, we always proceed one step
at a time. Initially, being unaware of the problems described above, we decided to allow self-
modifying code in [Pau16] as one such step. Surprisingly, in the presence of self-modifying
code, the correctness proof for the pipelined machine with address translation became con-
siderably more difficult. Note that the presence of self-modifying code naturally forces us to
introduce a software condition that
i) in single-core pipelined machines forbids to modify the words of instructions while the
latter are executed in the pipeline, i.e., after fetch and before commit in the memory stage,
and
ii) in multi-core pipelined machines forbids to modify the words of instructions while the
latter are executed in the pipelines of the multi-core machine.
The difficulty arose for the following reason: in order to use that software condition we
must first argue that the guard conditions are respected in the computations produced by
Conclusion 287
our pipelined implementation. In turn, some of these guard conditions can be verified only
provided that the software conditions hold. Luckily, this mutual dependence could be bro-
ken [Obe17] by splitting execution of instructions into phases. Thus, in Sect. 2.2.2 we split
execution of instruction into two phases: the fetch and the execute. Recall, the fetch phase
encompasses computations of all ISA signals necessary to fetch the current instruction, and
ends once the instruction word was fetched. Computations remaining to complete execution
of the current instruction constitute the execute phase. The sets of the software and guard
conditions were split accordingly. This allowed us to strengthen the formulations involved,
and thus resolve the problem of mutual dependence described above. The software conditions
which apply in the fetch phase, like the software condition about self-modifying code, were
expected to hold if the guard conditions were satisfied before the fetch. The latter assumption
was reflected in our induction hypothesis (see Sect. 9.2.4).
In [LOP] the problem described above is outmaneuvered at level of ISA specification. There
the model is extended to include instruction buffers — load buffers for instructions. In the latter
model, instruction buffers are filled with data (instructions) non-deterministically, on dedicated
ISA steps, and emptied on interrupts and returns from exceptions. Instruction execution is
restricted only to the instructions found in the instruction buffer and, of course, matching the
actual instruction address. This makes the new model of [LOP] non-deterministic also in the
choice of instructions for execution. Thus, in the latter model instruction words from the
instructions buffer which are outdated due to self-modifying code are allowed to be executed.
The corresponding correctness proofs in [LOP] are expected to be more concise than their
counterparts from this dissertation. Basically, allowing instruction buffers in [LOP] pushes the
problems introduced by self-modifying code out of the pipeline correctness proof, where it
boils down to (not so amusing) application of the software condition described above.
Another interesting observation was pointed out by Jonas Oberhauser. Essentially, instruction
buffers and their related problems are quite similar to translation look-aside buffers (TLBs) and
problems tackled in this thesis. Similar to instruction buffers, TLBs are filled with data (transla-
tions) non-deterministically, on dedicated ISA steps. In general, non-deterministic mechanisms
leave a lot of space for concrete implementations, and as we discussed above, allow to stream-
line correctness proofs for pipelined designs. Moreover, the provided functionality, such as
TLB invalidating instructions, usually allows to construct certain programming disciplines to
obtain effectively deterministic programming models on the higher layers of the model stack.
Reduction theorems are proven to justify abstraction of the mechanisms as above, provided that
a certain, derived in the proof process software discipline is obeyed. Such disciplines might
require some extra (usually minor) effort for programming in high-level languages, or (ideally)
could be entirely integrated into compilers.
Future Work
First and foremost, we are interested in adding support for the inter-processor interrupts, which
requires adding the advanced programmable interrupt controllers (APICs). Much work on this
topic has already been completed in the process of working on this thesis. Thus, at the moment
of writing we strongly believe that integration of APICs into the current design requires only
minor local changes, and does not affect the structure of the arguments presented here or in
[LOP].
Still, in order to integrate the results of this thesis (or of [LOP]) with the results from [Obe17],
where the problem of store buffer reduction was tackled, we have to adjust the specification
for external interrupts. Namely, we have to change model of [Sch13a] s.t. delivery of external
interrupts can be delayed. The ability to postpone delivery of the inter-processor interrupts for
arbitrary number of cycles is crucial to reach higher layers of the model stack, which provide
semantics without store buffers. Using the new mechanisms we also should be able to postpone
the activation of JISR to the subsequent instructions if an external interrupt occurs during the
memory operation. Such change would allow us keep the usual return type (repeat) for external
interrupts.
288 Conclusion
Finally, one would think of adding devices. Thus, at the moment of writing the hard drive
has already been added as a sample device into the pipelined machine of [LOP]. The latter
result has to be extended to meet the specifications from [Sch13a]. According to the latter
specifications, the devices are interconnected with local APICs and I/O APIC into a single
device sub-system, where all devices are connected and communicate via the interrupt bus.
We believe that a moderate effort is required to finish the implementation of the device sub-
system as specified in [Sch13a], and integrate the corresponding correctness arguments into
the overall correctness proof.
References
ABCC66. R. Adair, R. Bayles, L. Comeau, and R. Creasy. A Virtual Machine System for the 360/40.
Technical report, IBM Corp., Cambridge Scientific Center, May 1966.
Age09. Ole Agesen. Software and Hardware Techniques for x86 Virtualization. Technical pa-
per, VMware, Inc., August 2009. https://www.vmware.com/content/dam/
digitalmarketing/vmware/en/pdf/techpaper/software_hardware_
tech_x86_virt.pdf.
APST10. E. Alkassar, W. Paul, A. Starostin, and A. Tsyban. Pervasive Verification of an OS Micro-
kernel: Inline Assembly, Memory Consumption, Concurrent Devices. In Third International
Conference on Verified Software: Theories, Tools, and Experiments (VSTTE’10), volume 6217
of LNCS, pages 71–85. Springer, 2010.
ARM14. ARM and QUALCOMM. Enabling the Next Mobile Computing Revolution with Highly Inte-
grated ARMv8-A based SoCs. White paper, ARM Limited/QUALCOMM Technologies, Inc.,
July 2014. https://www.arm.com/files/pdf/ARM_Qualcomm_White_paper_
Final.pdf.
BC13. Thomas Braibant and Adam Chlipala. Formal Verification of Hardware Synthesis. In Natasha
Sharygina and Helmut Veith, editors, Computer Aided Verification, pages 213–228, Berlin,
Heidelberg, 2013. Springer Berlin Heidelberg.
Bev89. William R. Bevier. Kit and the Short Stack. Journal of Automated Reasoning, 5(4):519–530,
December 1989.
Bey05. Sven Beyer. Putting It All Together — Formal Verification of the VAMP. PhD thesis, Saarland
University, Saarbru¨cken, 2005.
BHMY89. William R. Bevier, Warren A. Hunt, J. Strother Moore, and William D. Young. An Approach
to Systems Verification. Journal of Automated Reasoning, 5(4):411–428, December 1989.
BJK+03. Sven Beyer, Christian Jacobi, Daniel Kroening, Dirk Leinenbach, and Wolfgang J. Paul. In-
stantiating Uninterpreted Functional Units and Memory System: Functional Verification of the
VAMP. In Daniel Geist and Enrico Tronci, editors, Correct Hardware Design and Verification
Methods, Proc. 12th IFIP WG 10.5 Advanced Research Working Conference (CHARME’03),
L’Aquila, Italy, volume 2860 of LNCS, pages 51–65. Springer, 2003.
CPS13. Ernie Cohen, Wolfgang Paul, and Sabine Schmaltz. Theory of Multi Core Hypervisor Verifi-
cation. In Peter van Emde Boas, Frans C. A. Groen, Giuseppe F. Italiano, Jerzy Nawrocki,
and Harald Sack, editors, Proc. SOFSEM’13: Theory and Practice of Computer Science,
Spindleruv Mlyn, Czech Republic, volume 7741 of LNCS, pages 1–27. Springer, 2013.
CVS+17. Joonwon Choi, Muralidaran Vijayaraghavan, Benjamin Sherman, Adam Chlipala, and Arvind.
Kami: A Platform for High-level Parametric Hardware Specification and Its Modular Verifica-
tion. Proc. ACM Program. Lang., 1(ICFP):24:1–24:30, August 2017.
DDB08. Matthias Daum, Jan Do¨rrenba¨cher, and Sebastian Bogan. Model Stack for the Pervasive Ver-
ification of a Microkernel-based Operating System. In Bernhard Beckert and Gerwin Klein,
290 References
editors, 5th International Verification Workshop (VERIFY’08), volume 372 of CEUR Workshop
Proceedings, pages 56–70. CEUR-WS.org, 2008.
ELMC18. Shayan Eskandari, Andreas Leoutsarakos, Troy Mursch, and Jeremy Clark. A first look at
browser-based Cryptojacking. CoRR, abs/1803.02887, 2018. http://arxiv.org/abs/
1803.02887.
Hil05. Mark Hillebrand. Address Spaces and Virtual Memory: Specification, Implementation, and
Correctness. PhD thesis, Saarland University, Saarbru¨cken, 2005.
Int16. Intel Corporation. Intel(R) 64 and IA-32 Architectures Software Devel-
oper’s Manual. Intel Corp., September 2016. https://www.intel.
com/content/dam/www/public/us/en/documents/manuals/
64-ia-32-architectures-software-developer-manual-325462.pdf.
Int18. Intel Corporation. Intel Analysis of Speculative Execution Side Chan-
nels. White paper, Intel Corp., January 2018. https://www.intel.
com/content/dam/www/public/us/en/documents/white-papers/
analysis-of-speculative-execution-side-channels-white-paper.
pdf.
KAK+. Gerwin Klein, June Andronick, Ihor Kuz, Toby Murray, Gernot Heiser, and Matthew Fernan-
dez. Formal Verification at Scale. Communications of the ACM. To appear.
KGG+18. Paul Kocher, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan
Mangard, Thomas Prescher, Michael Schwarz, and Yuval Yarom. Spectre Attacks: Exploiting
Speculative Execution. ArXiv e-prints, January 2018.
KMP14. M. Kovalev, S.M. Mu¨ller, and W.J. Paul. A Pipelined Multi-core MIPS Machine: Hardware
Implementation and Correctness Proof, volume 9000 of LNCS. Springer, 2014.
Kov13. Mikhail Kovalev. TLB Virtualization in the Context of Hypervisor Verification. PhD thesis,
Saarland University, Saarbru¨cken, 2013.
Kro¨01. Daniel Kro¨ning. Formal Verification of Pipelined Microprocessors. PhD thesis, Saarland Uni-
versity, Saarbru¨cken, 2001.
LOP. P. Lutsyk, J. Oberhauser, and W.J. Paul. A Pipelined Multi-core Machine with Operating Sys-
tems Support: Hardware Implementation and Correctness Proof. LNCS. To appear.
LSG+18. Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Stefan Man-
gard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg. Meltdown. ArXiv e-
prints, January 2018.
Lut14. Petro Lutsyk. Pipelined MIPS Processor with a Store Buffer. Master’s thesis, Saarland Uni-
versity, Saarbru¨cken, 2014.
Mai14. Giorgi Maisuradze. Implementing and Debugging a Pipelined Multi-Core MIPS Machine.
Master’s thesis, Saarland University, Saarbru¨cken, 2014.
Meg12. Natarajan Meghanathan. Virtualization of Virtual Memory Address Space. In Proceedings of
the Second International Conference on Computational Science, Engineering and Information
Technology, CCSEIT ’12, pages 732–737, New York, NY, USA, 2012. ACM.
Moo03. J. Strother Moore. A Grand Challenge Proposal for Formal Methods: A Verified Stack, pages
161–172. Springer Berlin Heidelberg, Berlin, Heidelberg, 2003.
MP00. Silvia M. Mu¨ller and Wolfgang J. Paul. Computer Architecture, Complexity and Correctness.
Springer, 2000.
Obe17. Jonas Oberhauser. Justifying The Strong Memory Semantics of Concurrent High-Level Pro-
gramming Languages for System Programming. PhD thesis, Saarland University, Saarbru¨cken,
2017.
References 291
Ols14. Loris Olsem. Development Toolchain for MIPS. Bachelor’s thesis, Saarland University,
Saarbru¨cken, 2014.
Pau16. W.J. Paul. Multi-Core System Architecture. Lecture notes, Saarland Univer-
sity, Saarbru¨cken, 2016. http://www-wjp.cs.uni-saarland.de/lehre/
vorlesung/rechnerarchitektur/ws15/books/mcsysbook.pdf.
PBLS16. W.J. Paul, C. Baumann, P. Lutsyk, and S. Schmaltz. System Architecture: An Ordinary Engi-
neering Discipline. Springer, 2016.
RIS18. RISC-V Foundation. RISC-V ISA, 2018. https://riscv.org/risc-v-isa/.
Sch13a. Sabine Schmaltz. MIPS-86 — A Multi-Core MIPS ISA Specification. Technical report,
Saarland University, Saarbru¨cken, 2013. http://www-wjp.cs.uni-saarland.de/
publikationen/SchmaltzMIPS.pdf.
Sch13b. Sabine Schmaltz. Towards the Pervasive Formal Verification of Multi-Core Operating Systems
and Hypervisors Implemented in C. PhD thesis, Saarland University, Saarbru¨cken, 2013.
Sch14a. Oliver Schmitt. Design and Verification of Memory Management Units for Single-Core CPUs.
Bachelor’s thesis, Saarland University, Saarbru¨cken, 2014.
Sch14b. Konstantin Schwarz. Integration of Inter-Processor Interrupts into Multi-Core Machines. Bach-
elor’s thesis, Saarland University, Saarbru¨cken, 2014.
Sch16a. Felix Schmidt. Correctness of a Pipelined MIPS Machine with Interrupts and a Cache Memory
System. Master’s thesis, Saarland University, Saarbru¨cken, 2016.
Sch16b. Konstantin Schwarz. Correctness of Multi-Core Machines with Non-Pipelined Processors and
Inter-Processor Interrupts. Master’s thesis, Saarland University, Saarbru¨cken, 2016.
SK17. Hira Syeda and Gerwin Klein. Reasoning about Translation Lookaside Buffers. In Proceedings
of the 21st International Conference on Logic for Programming, Artificial Intelligence and
Reasoning, pages 490–508, Maun, Botswana, May 2017.
VCAD15. Muralidaran Vijayaraghavan, Adam Chlipala, Arvind, and Nirav Dave. Modular Deductive
Verification of Multiprocessor Hardware Designs, pages 109–127. Springer International Pub-
lishing, Cham, 2015.
Ver07. Verisoft Consortium. The Verisoft Project, 2003–2007. http://www.verisoft.de/.
Ver10. Verisoft XT Consortium. The Verisoft XT Project, 2007–2010. http://www.
verisoftxt.de/.
Vij16. Muralidaran Vijayaraghavan. Modular Verification of Hardware Systems. PhD thesis, Mas-
sachusetts Institute of Technology, Cambridge, Massachusetts, 2016.
Vir18. libvirt Virtualization API. Application Development. Description of the XML schemas for
domains. libvirt Project, 2018. https://libvirt.org/formatdomain.html#
elementsMemoryBacking.
WLPA16. Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanovic´. The RISC-V In-
struction Set Manual, Volume I: User-Level ISA, Version 2.1. Technical Report UCB/EECS-
2016-118, EECS Department, University of California, Berkeley, May 2016.
WZB+16. Andy Wright, Sizhuo Zhang, Thomas Bourgeat, Muralidaran Vijayaraghavan, Jamey Hicks,
and Arvind. Riscy Processors: A collection of open-sourced RISC-V processors. In 4th RISC-
V Workshop, July 2016.
Zah16. Shahd Zahran. Implementing and Debugging a Pipelined MIPS Machine with Interrupts and
Multi-Level Address Translation. Master’s thesis, Saarland University, Saarbru¨cken, 2016.

Index
accesses, see memory
actual instruction index, see instruction
added walks, see walk
address
partitioning, 37
space, see universal addressing
translation
multi-level, 37
nested, 53
request, 40
address space ID, see ASID
ASID, see universal addressing
binary numbers, 10
bit-strings
as binary numbers, 10
bit-operations, 10
busy control state, see control state
byte addressable memory, see memory
control
automaton
for nested translation, 86
for simple translation, 85
of sequential machine, 126
state
busy, 127
current, 127
finishes, 127
coverage of hardware walks, see walk
current control state, see control state
current instruction, see instruction
dropped walks, see walk
effective address, 25
exception, see interrupt
registers, see registers
execution level, see mode
faultiness of matching walks, see walk
finishes (control state), see control state
full bit, see stall engine
general purpose register file, see GPR
general-protection fault, 36
on fetch, 43
on memory operation, 44
ghost walk pipeline, 182
GPR, 17
guard conditions, 28
on core steps, 65
on TLB steps, 60
guest
mode, see mode
page address, 76
hardware
walk registers, see registers
walks, 107
hazard signal, see stall engine
host mode, see mode
illegal instruction interrupt, 36
implementation registers, see registers
instruction
address, 41
current, 126
execution, 45, 65
fetch, 43
index (actual), 214
live, 191
stage, 206
tables, 15
intercept
mechanism, 50
signal, 65
interrupt, 32
cause pipeline, 180
JISR, 33
level, 34
masking, 34
return from exception, 35
semantics, 34
jump to interrupt service routine, see interrupt
JISR
line addressable memory, see memory
live instruction, see instruction
local invisible registers, see registers
lowest
294 Index
non-live circuit stage, 191
truly full stage, 199
matching walk, see walk
memory, 12
accesses, 13
accesses of ISA, 28
embedding, 12
management unit, 81
semantics, 13
virtual, 49
MIPS ISA
ALU-operations, 21
branches and jumps, 23
J-type jumps, 24
jump and link, 24
R-type jumps, 23
configuration, 17
computation, 17
next configuration, 17
instructions
current instruction, 18
decoding, 19
I-type, 18
immediate constant, 19
instruction fields, 18
J-type, 18
opcode, 18
R-type, 18
loads and stores, 25
effective address, 25
loads, 25
stores, 25
shift unit operations, 22
summary, 27
MMU, see memory management unit
accesses, 103
implementation, 83
queries, see MMU accesses
specification, 81
mode, 50
register, 31
multi-level translation, see address translation
NAT, see nested translation
nested translation, see address translation, see
translation
overflow
interrupt, 36
page, 37
address, see universal addressing
fault, 36
on fetch, 43
on memory operation, 44
tables, 37
PC, 17
physical memory address
for instruction fetch, 43, 65
for memory operation, 44, 65
PRID, see universal addressing
process ID, see PRID
processor
accesses, 198
core, 41
steps, 42, 63
local computations, 223
step, see processor core
program counter, see PC
ragged walks, see walk
real full bit, 176
registers
exception (esr, eca, epc, edata), 32
hardware walk, 83
implementation (pipelined), 183
implementation (sequential), 130
local invisible, 233
reset
interrupt, 33
semantics, 43
return from exception, see interrupt
rollback, see stall engine
hazard, 176
pending, 175
request, 175
scheduling function(s)
for pipelined machine, 189
for sequential machine, 148
sequential memory semantics, see memory
simple translation, see translation
simulation
for pipelined machine, 231
for sequential machine, 146
of TLB, 72, 106
of TLB content, 107
of walk registers, 107
software conditions, 27
for pipelined machine, 229
for sequential machine, 146
speculation stage, 230
stall engine, 175
stall-rollback engine, see stall engine
status register, see special purpose register
system call
instruction, 16
interrupt, 36
TLB, see translation look-aside buffer
equivalence, 70
implementation, 76
simulation, see simulation
specification (hardware), 74
steps, 42, 59
translated memory address, 40, 58
translation, see universal walk
accesses, 61
look-aside buffer, 41
nested, 83
request, see address translation
Index 295
simple, 83
translation mode, see mode
universal
address, see universal addressing
partitioning, 51
addressing, 51
walk, see walk
update enable signal, see stall engine
user mode, see mode
valid walk, see walk
virtual
address, see universal addressing
machine ID, see VMID
memory, see memory
VMID, see universal addressing
void accesses, see memory accesses
walk, 54
added, 68
complete, 39
composition, 56
coverage, 109
decomposition, 57
dropped, 69
extension, 39
faultiness (of matching walks), 55
faulty, 54
initialization, 38
level, 55
matching, 54
ragged, 69
valid, 57
walking unit, 87
