Formal verification of a processor with memory management units by Dalinger, Iakov
Formal Verification of
a Processor with
Memory Management Units
Dissertation
zur Erlangung des Grades
des Doktors der Ingenieurwissenschaften
der Naturwissenschaftlich-Technischen Fakultäten
der Universität des Saarlandes
Iakov Dalinger
dalinger@wjpserver.cs.uni-sb.de
Saarbrücken, Juni 2006
ii
iii
Tag des Kolloquiums: 9. Juni 2006
Dekan: Prof. Dr.-Ing. Thorsten Herfet
Vorsitzender des Prüfungsausschusses: Prof. Dr.-Ing. Gerhard Weikum
1. Berichterstatter: Prof. Dr. Wolfgang J. Paul
2. Berichterstatter: Prof. Dr. Peter-Michael Seidel
akademischer Mitarbeiter: Dr. Mark Hillebrand
Hiermit erkläre ich, dass ich die vorliegende Arbeit ohne unzulässige Hilfe
Dritter und ohne Benutzung anderer als der angegebenen Hilfsmittel ange-
fertigt habe. Die aus anderen Quellen oder indirekt übernommenen Daten
und Konzepte sind unter Angabe der Quelle gekennzeichnet. Die Arbeit wur-
de bisher weder im In- noch im Ausland in gleicher oder ähnlicher Form in
anderen Prüfungsverfahren vorgelegt.
Saarbrücken, im Juni 2006
iv
vDanksagung
Meiner Frau Yulia und meinen Eltern danke ich für ihre Unterstützung
während der Jahre, in denen diese Arbeit entstanden ist.
Ich möchte mich bei Herrn Prof. Paul für die Betreuung und das inter-
essante Thema dieser Arbeit bedanken.
Ich bedanke mich auch bei den Mitarbeitern des Lehrstuhls von Prof.
Paul, insbesondere bei MARK HILLEBRAND und Sven Beyer, für vielen
anregenden Diskussionen.
vi
vii
Abstract
In this thesis we present formal verification of a memory management unit
which operates under specific conditions. We also present formal verification
of a complex processor VAMP with support of address translation by means
of a memory management unit. The VAMP is an out-of-order 32 bit RISC
CPU with DLX instruction set, fully IEEE-compliant floating point units,
and a memory unit. The VAMP also supports precise internal and external
interrupts. It is modeled on the gate level and verified with respect to its
specification. Subject of this thesis is based on the formal proof of the VAMP
without address translation [Bey05] and on paper and pencil specification,
implementation, and correctness proof of a memory management unit [Hil05].
Kurzzusammenfassung
In dieser Dissertation stellen wir die formale Verifikation einer Memory Ma-
nagement Unit vor, welche nur unter bestimmten Operationsbedingungen
korrekt arbeitet. Wir stellen auch die formale Verifikation des VAMP vor,
eines komplexen Prozessors, der Adressübersetzung unterstützt. Der VAMP
ist eine out-of-order 32-Bit RISC CPU mit DLX Instruktionssatz, vollstän-
dig IEEE-konformen Fließkommaeinheiten und einer Speichereinheit. Der
VAMP unterstützt präzise interne und externe Interrupts. Er ist auf der
Gatterebene modelliert und bezüglich einer formalen Spezifikation verifi-
ziert. Diese Arbeit basiert auf dem formalen Beweis des VAMP ohne Adress-
übersetzung [Bey05] und auf der Papier-und-Bleistift Spezifikation, Imple-
mentierung, und dem Korrektheitsbeweis einer Memory Management Unit
aus [Hil05].
viii
Extended Abstract
In this thesis we report on the formal verification of an out-of-order proces-
sor with support of address translation. For this purpose a simple memory
management unit was implemented and specified. The correctness proof for
a memory management unit alone is simple, but depends on nontrivial oper-
ating conditions. The design of the memory management unit was inspired
by dissertation of Mark Hillebrand [Hil05].
As the next step, we extend the microprocessor VAMP [BJK+03,
BJK+05,Krö01,Bey05] without address translation with the memory man-
agement units, one for the address translation on instruction fetch and one
for the address translation on load/store accesses. In order to have pos-
sibility to realize swap memory we also extended the VAMP with precise
external interrupts. The verification is then split into two sub-steps, namely
the correctness without interrupts and the correctness with interrupts.
Both implementation and verification of the processor and the mem-
ory management unit were carried out in the theorem proving system
PVS [OSR92].
The VAMP is a pipelined out-of-order [Krö01] 32-bit RISC CPU with
DLX instruction set, fully IEEE 754 [IEE85] compliant floating point
units [Ber01,BJ01a, Jac02a, Jac02b] for single- and double precision opera-
tions, a memory unit with a cache memory interface, delayed PC, and precise
interrupts. The design and the formal verification of the Tomasulo [Tom67]
out-of-order algorithm and the VAMP’s floating point units and are based
on the PhD-theses of Daniel Kröning [Krö01] and Christian Jacobi [Jac02a],
respectively. An implementation of the Tomasulo algorithm with the float-
ing point units, instruction fetch and a memory unit, as well as integrating
precise interrupts and caches into the proof of the VAMP was done by Sven
Beyer [Bey05].
The results of this work yields a formally verified gate-level implementa-
tion of the VAMP with support of address translation, with interrupts, and
a cache memory interface with split instruction and data caches. Decompo-
sition of the implementation to the variety of abstract levels allows to reason
more effectively in opposite to arguing about the huge overall implementa-
tion.
We show overall correctness of the VAMP implementation with respect
to its specification. We focus on the memory management unit, the data
memory access, self-modifying code, instruction fetch and precise external
interrupts. The results of this work are published in [DHP05]. Its impor-
tance is clear in the context of the verification of a complete computer system
which consists of processor as hardware and operating system and several
applications as the software counterpart. It is needed since common operat-
ix
ing systems require paging and address translation on the hardware level in
order to give each user program its own virtual memory.
xZusammenfassung
In dieser Arbeit berichten wir über die formale Verifikation eines out-of-
order Prozessors mit Unterstützung für Adressübersetzung. Hierfür wurde
eine einfache Memory Management Unit spezifiziert und implementiert. Der
Korrektheitsbeweis dieser Einheit ist an sich einfach, hängt aber von nicht-
trivialen Operationsbedingungen ab. Die Entwicklung der Memory Manage-
ment Unit ist angelehnt an die Implementierung aus der Arbeit von Mark
Hillebrand [Hil05].
In einem weiteren Schritt erweitern wir den Mikroprozessor VAMP ohne
Adressübersetzung [BJK+03,BJK+05,Krö01,Bey05] mit Memory Manage-
ment Units, eine für die Adressübersetzung bei Instruction Fetch und eine für
die Adressübersetzung bei Lade-/ Speicherinstruktionen. Um die Implemen-
tierung von Auslagerungsspeicher zu ermöglichen, erweitern wir den VAMP
für die Unterstützung von präzisen externen Interrupts.
Sowohl die Implementierung als auch die Verifikation des Prozessors und
der Memory Management Unit erfolgen im Beweissystem PVS [OSR92].
Der VAMP ist eine gepipelinete out-of-order [Krö01] 32-Bit RISC CPU
mit DLX Instruktionssatz, vollständig IEEE-754 [IEE85] konformen Fließ-
kommaeinheiten [Ber01,BJ01a,Jac02a,Jac02b] für Operationen mit einfacher
und doppelter Präzision, einer Speicher-Einheit mit einem Cache Memory
Interface, delayed PC, und präzisen Interrupts. Die Entwicklung und die for-
male Verifikation des Tomasulo out-of-order Algorithmus [Tom67] und der
Fließkommaeinheiten des VAMP basieren auf den Dissertationen von Da-
niel Kröning [Krö01] und Christian Jacobi [Jac02a]. Eine Implementierung
des Tomasulo-Algorithmus mit Fließkommaeinheiten, Instruction Fetch, ei-
ner Speichereinheit mit Caches und internen präzisen Interrupts wurde von
Sven Beyer entwickelt [Bey05].
Das Ergebnis beider Schritte ist eine auf Gatterebene formal verifizier-
te Implementierung des VAMP mit Unterstützung für Adressübersetzung,
Interrupts und einem Cache Memory Interface mit getrennten Instruction
und Data Caches. Die Zerlegung der Implementierung in abstrakte Ebenen
erlaubt es, effizienter über die Korrektheit zu argumentieren als über die
komplette Implementierung auf einmal.
Wir zeigen die vollständige Korrektheit der VAMP Implementierung be-
züglich ihrer Spezifikation. Wir konzentrieren uns auf die Memory Mana-
gement Unit, den Data Memory Access, selbst-modifizierenden Code, In-
struction Fetch und präzise externe Interrupts. Die Ergebnisse dieser Arbeit
sind in [DHP05] veroffentlicht. Die Wichtigkeit dieser Arbeit ergibt sich im
Kontext der Verifikation eines vollständigen Computersystems, welches einen
Prozessor als Hardware und ein Betriebssystem sowie verschiedene Applika-
tionen als Software enthält. Standard-Betriebssysteme benötigen Paging und
Adressübersetzung auf der Hardware-Ebene, um jeder Applikation ihren ei-
genen virtuellen Speicher zur Verfügung stellen zu können.
Contents
1 Introduction 1
1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The PVS System . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Basics Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Proof Decomposition . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.1 The Memory Interface Layer . . . . . . . . . . . . . . 10
1.5.2 The CPU Interface Layer . . . . . . . . . . . . . . . . 12
2 The Memory Management Unit 15
2.1 Specification of the MMU . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Assumption for the MMU . . . . . . . . . . . . . . . . 18
2.1.2 Guarantees of the MMU . . . . . . . . . . . . . . . . . 21
2.2 MMU Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 MMU Correctness . . . . . . . . . . . . . . . . . . . . . . . . 30
3 The VAMP with Virtual Memory Support 39
3.1 Specification of the VAMP . . . . . . . . . . . . . . . . . . . . 39
3.2 Implementation of the VAMP . . . . . . . . . . . . . . . . . . 48
3.2.1 Tomasulo Algorithm . . . . . . . . . . . . . . . . . . . 48
3.2.2 The VAMP Design . . . . . . . . . . . . . . . . . . . . 50
3.2.3 Implementation of the VAMP Memory Unit . . . . . . 53
3.3 Correctness Criteria . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.1 Scheduling Functions . . . . . . . . . . . . . . . . . . . 61
3.3.2 Correctness Invariant . . . . . . . . . . . . . . . . . . . 64
3.3.3 Proof Overview . . . . . . . . . . . . . . . . . . . . . . 65
3.4 Correctness between Interrupts . . . . . . . . . . . . . . . . . 65
3.4.1 Correctness of the Memory Unit on Load / Store . . . 66
3.4.2 Instruction Fetch . . . . . . . . . . . . . . . . . . . . . 74
3.5 Correctness with Interrupts . . . . . . . . . . . . . . . . . . . 86
3.5.1 Overall Correctness . . . . . . . . . . . . . . . . . . . . 88
xi
xii CONTENTS
4 Conclusion 89
4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3.1 Automated Methods . . . . . . . . . . . . . . . . . . . 91
4.3.2 Hardware Optimizations and Extensions . . . . . . . . 93
4.3.3 Formal Software Verification . . . . . . . . . . . . . . . 94
A VAMP Instruction Set 95
B Lemmas in PVS 101
List of Figures
1.1 Organization of Virtual Memory . . . . . . . . . . . . . . . . 7
1.2 Proof Decomposition of the VAMP without Address Translation 9
1.3 Proof Decomposition of the VAMP with Address Translation 9
1.4 Timing of the Memory Interface for DMMU . . . . . . . . . . 11
1.5 Timing of the Interface between CPU and DMMU . . . . . . 14
2.1 Interfaces of the MMU . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Example of the Behavior of the Signal busym . . . . . . . . . 20
2.3 Address Translation for the virtual address . . . . . . . . . . 24
2.4 Page Table Entry . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Data Paths of the MMU . . . . . . . . . . . . . . . . . . . . . 28
2.6 Control Automaton of the MMU . . . . . . . . . . . . . . . . 29
2.7 Possibility of Requests to the Memory . . . . . . . . . . . . . 36
3.1 The VAMP Data Paths . . . . . . . . . . . . . . . . . . . . . 49
3.2 Extension of the VAMP with Two MMUs . . . . . . . . . . . 54
3.3 The VAMP Memory Unit . . . . . . . . . . . . . . . . . . . . 55
3.4 Stabilizing Circuit for the Data MMU . . . . . . . . . . . . . 57
3.5 Stabilizing Circuit for the Instruction MMU (Circuit genPC ) 59
3.6 Fetch Implementation in the VAMP. . . . . . . . . . . . . . . 59
3.7 Memory sequence for the DMMU . . . . . . . . . . . . . . . . 67
3.8 Memory Sequence for the IMMU . . . . . . . . . . . . . . . . 77
A.1 Instruction Formats of the VAMP . . . . . . . . . . . . . . . . 95
xiii
xiv LIST OF FIGURES
List of Tables
1.1 The Memory Interface (between MMUs and Physical Memory) 10
1.2 The CPUI Interface (between CPU and MMUs) . . . . . . . 13
3.1 Special Purpose Registers of the VAMP . . . . . . . . . . . . 44
3.2 Supported Interrupts in the VAMP . . . . . . . . . . . . . . . 46
3.3 Scheduling Functions of the VAMP . . . . . . . . . . . . . . . 61
4.1 Comparison of the Verification Effort in PVS. . . . . . . . . . 90
A.1 I-type Instruction Layout . . . . . . . . . . . . . . . . . . . . 96
A.2 R-type Instruction Layout . . . . . . . . . . . . . . . . . . . . 97
A.3 J-type Instruction Layout . . . . . . . . . . . . . . . . . . . . 97
A.4 FI-type Instruction Layout . . . . . . . . . . . . . . . . . . . . 98
A.5 FR-type Instruction Layout . . . . . . . . . . . . . . . . . . . 98
A.6 Floating-point Relational Operators for the fc Instruction . . 99
B.1 Lemmas in PVS context mmu . . . . . . . . . . . . . . . . . . . 101
B.2 Lemmas in PVS context dlxtom_mmu . . . . . . . . . . . . . . 102
xv
xvi LIST OF TABLES
Chapter 1
Introduction
Nowadays, there are many applications where different processors are used
in life-critical devices, e.g., nuclear power stations, airplanes, trains, cars, or
medical devices. Processor developers use diverse tests in order to find bugs
in processors. Since non-exhaustive tests do not cover all possible cases of
a processor behavior industrial processors may still contain bugs. Formal
verification more and more turns out to be the only alternative since the
actual result of a formal verification is equivalent to a simulation of all pos-
sible cases. In addition, it scales better then simulation when one puts the
correctness of several modules together.
The work presented here can be viewed as a link between two projects.
The first is the Verified Architecture Microprocessor (VAMP) project [JK00,
BBJ+02,BJK+03,BJK+05] where an instruction set architecture was speci-
fied, a complete microprocessor on the gate level was implemented, and for-
mally verified in PVS [OSR92]. The second is the Verisoft project [Ver03],
which is funded by the German federal government. One of the goals of the
Verisoft project is to show the correctness of a system which has the VAMP
as hardware and a compiler, an operating system, and several applications as
software. The main distinction from previous processors is that the processor
which we develop in this thesis has hardware support of address translation,
which is very important in order to implement an multi-tasking operating
systems.
The VAMP is a 32-bit RISC CPU with DLX instruction set [HP96b].
In addition to a memory unit with a cache memory interface [Bey05], the
VAMP features a Tomasulo scheduler [Krö01], a fixed point unit, three float-
ing point units [Ber01,BJ01a,Jac02a,Jac02b], and precise interrupts. For the
formal verification of the VAMP, we focus on the memory unit and on in-
struction fetch. We prove the overall VAMP implementation correctness on
the gate level with respect to a programmer’s model that just executes one
instruction in a step. For the formal verification the theorem-proving system
PVS [OSR92] was used.
1
2 CHAPTER 1. INTRODUCTION
In this thesis we prove only the correctness of the hardware used in the
Academic System of the Verisoft project. Note that software bugs may also
be a reason of wrong system behavior. However, both hardware and software
correctness proofs are necessary and the hardware verification is just a first
step on the way to establish the correctness of the whole computer system.
Outline
In Chapter 1 we review the basic notation, which is used for the whole thesis.
We also introduce the PVS system, virtual memory, and the general proof
decomposition approach of this thesis.
In Chapter 2 we present a non-optimized construction of the memory
management unit (MMU) which is a hardware supporting of virtual mem-
ory. We also prove its correctness under nontrivial operating conditions: the
memory cells that are used for address translation must not change during
a MMU request.
In Chapter 3 we establish the overall correctness of the VAMP with ad-
dress translation. Note that in pipelined processors separateMMUs are used
for instruction fetch and load / store. We present the specification and im-
plementation of the VAMP. We show how the operating conditions for both
MMUs can be guaranteed by a combination of hardware mechanisms and
software conventions in order to exclude RAW hazards for fetch with address
translation. Note that for the pipelined processor with address translation
we have more RAW hazards. As software convention we require the usage
of a sync instruction before a fetch from the modified memory location.
In the last Chapter 4 we summarize the results, discuss advantages and
drawbacks of our system, present related work and possibilities for further
work.
1.1 Notation
In this section we introduce some notation for the whole thesis. Since our
work is based on Beyer’s previous work [Bey05] we will use the same notation
and abbreviations which were used in this work.
We define N as the set of natural numbers including 0 and N+ := N\{0}.
The set of integer numbers is denoted by Z. We start with definitions for
integer intervals.
Definition 1.1.1 Let n, m ∈ Z be integer numbers. We define the following
1.1. NOTATION 3
integer intervals:
[n : m] := n, . . . ,m
]n : m] := n+ 1, . . . ,m
[n : m[ := n, . . . ,m− 1
]n : m[ := n+ 1, . . . ,m− 1
Zn := [0 : n[
Z≤n := [0 : n]
Z≥n := N \ Zn.
Next we give definitions for bitvectors and arrays. We start with a stan-
dard definition of a word. We use the star (“*”) to indicate that definitions
and lemmas were copied from [Bey05].
Definition* 1.1.2 Let Σ 6= ∅ be a set called alphabet. A word of length
n ∈ N over the alphabet Σ is a function a : Zn → Σ. A word a is uniquely
identified by the n-tuple of values (a(n− 1), a(n− 2), . . ., a(0)). As a short-
hand for this tuple, we also use an−1an−2 . . . a0 or just a[n− 1 : 0]. The set
Σn := {a|a : Zn → Σ} is the set of all words of length n over Σ. The set of
all finite words over Σ is given by Σ∗ :=
⋃
n∈NΣ
n. The concatenation of
words is defined by
◦ : Σ∗ × Σ∗ → Σ∗
a[n− 1 : 0] ◦ b[m− 1 : 0] = (a(n− 1), . . . , a(0), b(m− 1), . . . , b(0))
Instead of writing ◦ as infix operator, we also simply drop it, using a[n− 1 :
0]b[m− 1 : 0] for concatenation.
Definition 1.1.3 A domain is an alphabet. We abbreviate B := {0, 1}. A
bitvector of length n is an array of length n over the domain B.
Let b be bitvector of length n. We introduce the notation b[k : l] for
a subbitvector and b[k] for k-th bit of the bitvector b, where k ∈ Zn and
l ∈ Zk.
Note that the length for almost all bitvectors, which are used in this
thesis, is divisible by 8. Note also that sometimes we will want to work only
with one part (always 8 bits) of a bitvector. Therefore we introduce the next
shorthand notation for such bitvectors in the following way:
Definition* 1.1.4 For any w ∈ B8·B and b < B, we use the shorthand
notation |w|b = w[8 · b+ 7 : 8 · b] to select the b-th byte of bitvector w.
4 CHAPTER 1. INTRODUCTION
Note that a bitvector of length 1 is a bit. For the whole thesis we identify
the values 1 and 0 of a bit with the values TRUE and FALSE of a boolean
correspondingly.
Definition* 1.1.5 Let n ∈ N+ and a ∈ Bn. We call
〈a〉 :=
n−1∑
i=0
a[i] · 2i
the binary number represented by a. Note that 〈·〉 : Bn → Z2n is bijective.1
Thus, the function binn := 〈·〉−1, binn : Z2n → Bn that returns the binary
representation of a natural number is well defined.
Proposition* 1.1.6 Let a, b ∈ Bn, m ∈ Zn, and k ∈ Z2m. The following
statements hold:
• 〈a〉 ∈ Z2n
• 〈a〉 = 〈b〉 ⇐⇒ a = b
• 〈a〉 = 2m · 〈a[n− 1 : m]〉+ 〈a[m− 1 : 0]〉
• 〈a〉 mod 2m = k ⇐⇒ 〈a[m− 1 : 0]〉 = k
We will use these properties in order to prove our correctness criteria
but we do not need to prove these properties because all of these proofs are
included in the PVS bitvectors library.
Since our hardware implementation contains RAMs we now introduce a
definition for RAM.
Definition* 1.1.7 For any a ∈ N and d ∈ N, a 2a × d-RAM R is a
function Ba → Bd that maps any input address adr ∈ Ba to its data value in
R, denoted by R[adr].
Note that we can use λ-notation for arrays and RAMs to introduce un-
named functions. We only use λ-notation for functions definitions and we do
not use real λ-calculus. For example the following equation trivially holds:
0 = binn〈λi∈ZnFALSE 〉, where n ∈ N+
Definition 1.1.8 A function inp : N → DI is called an input signal over
the domain DI . We use the shorthand notation inpt for the value of inp in
cycle t, i.e., inpt = inp(t).
Let cinit be some initial configuration over the domain Dc, inp some
input signal over the domain DI , and nextc : Dc × DI → Dc a function
1This property is formally verified in the PVS standard library bitvectors, cf. Sec-
tion 1.2.
1.1. NOTATION 5
that computes the configuration in the next cycle based on the current state
and some input. We then denote the configuration in cycle t given a starting
configuration cinit with c[cinit]t, i.e., we have
c[cinit]0 := cinit
c[cinit]t+1 := nextc(c[cinit]t, inpt)
If we do not explicitly care for the starting configuration, but just assume an
arbitrary, but fixed one, we also simply write ct.
Our proofs will not only depend on the signal behavior in the appointed
cycle but we sometimes will use some additional information about the past
or in the future.
Definition* 1.1.9 Let S be a signal over the domain D, and P a predicate
on D, and t ∈ N a cycle. We introduce the shorthand notation P t for P (St).
We define a predicate indicating that P held in a cycle prior to t, namely
∃lastP (t) := ∃t′ < t : P t
′ and a another predicate indicating that P holds in
the present cycle or in a future cycle, ∃nextP (t) := ∃t′ ≥ t : P t
′ .
In case ∃lastP (t) holds, we define the last cycle where P held as lastP (t) :=
max{t′ < t : P t′}. If ∃nextP (t) holds, we define the next cycle where P holds
as nextP (t) := min{t′ ≥ t : P t′}.
If it is clear from the context which predicate P is considered, we will
abbreviate lastP (t) with last(t) and nextP (t) with next(t).
Based on the last definition we can formulate some properties.
Proposition* 1.1.10 Let S, D, P , and t be as in Definition 1.1.9. Then
the following properties hold:
• ∃lastP (t) =⇒ P lastP (t) ∧ ∀t′ ∈ ]lastP (t) : t[ : ¬P t
′
• ∃nextP (t) =⇒ PnextP (t) ∧ ∀t′ ∈ [t : nextP (t)[ : ¬P t
′
• P 0 ∧ t > 0 =⇒ ∃lastP (t)
• t′ ≥ t ∧ ∃lastP (t) =⇒ ∃lastP (t′)
• t′ ≤ t ∧ ∃nextP (t) =⇒ ∃nextP (t′)
• t′ ≤ t ∧ P t′ =⇒ ∃lastP (t) ∧ lastP (t) ≥ t′
• t′ ≥ t ∧ P t′ =⇒ ∃nextP (t) ∧ nextP (t) ≤ t′
• t′ ≥ t ∧ ∃lastP (t) =⇒ lastP (t′) ≥ lastP (t)
• t′ ≤ t ∧ ∃nextP (t) =⇒ nextP (t′) ≤ nextP (t)
• t′ ∈ ]lastP (t) : t] =⇒ lastP (t) = lastP (t′)
6 CHAPTER 1. INTRODUCTION
1.2 The PVS System
There exist several systems which provide an integrated environment for the
development and analysis of formal specifications. One of these systems is the
Prototype Verification System [OSR92] (PVS). PVS consists of a specification
language, a number of predefined libraries, a theorem prover and examples
that illustrate different methods of using the system in several application
areas. Since our work builds on previous work done in PVS we use the same
verification environment.
The specification language of PVS is based on classical, typed higher-
order logic. The PVS contains a bitvectors library. This is important for
us since in hardware implementation we use signals which are represented
in PVS as bitvectors. The bitvectors library provides a great variety of
lemmas relating natural numbers and bitvectors. The conversion between
natural numbers and bitvectors is possible with the bv2nat and nat2bv func-
tions. In our description we define the set of PVS bitvectors bvec[n] as Bn
and one bit as B. Logical operations in the PVS model basics gates in the
hardware circuits, i.e. AND, OR, NOT, etc.
In this interactive theorem prover all standard techniques which one can
use in paper-and-pencil proof are present, i.e. natural induction, case split-
ting, skolemization, applications of lemmas, etc.
We do not present all of the PVS lemmas which were used in order to
prove the processor correctness. The reason for this is that some lemmas
are intuitively obvious but are necessary for the formal PVS theories. More-
over, the proofs of non-trivial lemmas we present in mathematical language
omitting some formal details and hiding PVS syntax. Still, the structure of
lemmas and proofs in this thesis straightforwardly reflects the structure of
lemmas in PVS.
1.3 Basics Circuits
In this section we describe all of basics circuits which are used as macros
later on in lemmas which we present in this thesis. All these circuits were
proven correct before in [BJK01].
Definition 1.3.1 Let n ∈ N+. We define the specification for some
parametrized circuits:
• An n-bit adder is a circuit Addern computing the function
addern : Bn × Bn × B→ Bn+1,
addern(a, b, c) := binn+1(〈a〉+ 〈b〉+ 〈c〉).
1.4. VIRTUAL MEMORY 7
CPU
RAM
storage
secondary
Virtual memory
Figure 1.1: Organization of Virtual Memory
• An n-bit adder and subtractor is a circuit Add_subn computing
the function
add_subn : Bn × Bn × B→ B× Bn,
add_subn(a, b, sub) = (neg, sum), where
add_subn(a, b, sub).neg := (〈a〉 − 〈b〉 < 0),
add_subn(a, b, sub).sum :=
{
binn+1((〈a〉 − 〈b〉) mod n) if sub
binn+1(〈a〉+ 〈b〉) otherwise.
Note that we do not describe implementation and correctness of these cir-
cuits in the thesis. Implementations and correctness proofs for these circuits
are available in the PVS basic library presented in [BJK01].
1.4 Virtual Memory
Nowadays, software requires more physical memory than the hardware pro-
vides. The most common solution to this problem is the use of virtual
memory as shown in Figure 1.1.
The virtual memory is a memory which is created using the secondary
storage in order to simulate additional random-access memory. The address-
able space of the secondary storage is available to the user of a computer
system in which virtual addresses are mapped into physical addresses.
The virtual address space is divided into pages, usually a few kilobytes
large. Each virtual address is split into a virtual page index and a byte index
which is an offset within the page. For a page size of 2n bytes, the byte index
consist of the n least significant bits of a virtual address.
Virtual memory is usually much larger than physical memory, making it
possible to run programs for which the total code plus data size is greater
than the amount of RAM available. The excess is stored on secondary stor-
age, usually on a hard disk. A page is copied from disk to RAM, “paged
in”, when an attempt to access it is made and it is not in the RAM. This
8 CHAPTER 1. INTRODUCTION
is known as “demand paging”. It is performed by a collaboration between
the CPU, the memory management unit, which we describe later, and the
operating system kernel. The program is unaware of this collaboration, it
just sees a large address space, only a part of which corresponds to physical
memory. In this case programs may, of course, run slower.
A memory management unit is a piece of hardware which performs ad-
dress translation. It is located between the processor and the physical mem-
ory which usually contains a cache and main memory. The MMU maps the
virtual page index to a physical page index and the byte index is left un-
changed. The memory contains a page table which is indexed by the page
index (more details in Chapter 2, Section 2.1.2). Each page table entry
(PTE) contains the physical page index corresponding to the virtual one. A
physical page index is combined with the page offset (byte index) to give the
complete physical address.
For each page the PTE may also contain the following information:
• whether the page has been written to,
• when it was last used,
• whether a process may read and write it,
• whether the page is stored in the physical memory or on the hard disk.
In case no physical memory has been allocated to a given virtual page the
MMU signals a “page fault” to the CPU. The operating system will then try
to find a unused page in RAM and load this page into RAM. If there is no
free RAM it chooses an occupied page, using some replacement algorithm,
and saves it to a disk.
In this thesis we prove the correctness of a system which contains three
components: a processor, a memory management unit, and a physical mem-
ory. Note that our model does not contain any secondary storage, but the
processor has external interrupts which can help to extend the model with
the secondary storage [HIP05].
1.5 Proof Decomposition
We now give an overview on the overall correctness proof of the VAMP
processor with support of the address translation. Since our work is in
extension of the previous work of our chair we base this thesis on the previous
proof decomposition which was described in [Bey05]. It has the modules
which are presented in Figure 1.2.
Note that we want to use as many lemmas as possible from the previous
proof without changes. Therefore we proceed as follows:
1.5. PROOF DECOMPOSITION 9
memory
instruction cache
data cache
processor
bus protocol
mem
icache
control
dcache
interface
memory
CPU
Physical Memory
Figure 1.2: Proof Decomposition of the VAMP without Address Translation
External interrupts
Physical
Memory
DMMU
load / store
IMMU
fetch
CPU
processor
memory interface
Figure 1.3: Proof Decomposition of the VAMP with Address Translation
- First, in order to extend the VAMP with two new modules we need
to change the CPU module because we need some additional registers.
We also need to extend the CPU module with the external interrupts
with the help of which we can work with external devices, e.g. a hard
disk as secondary storage.
- Second, the memory interface which was developed in the previous
work differs in no way from standard double-port RAM interface (one
port for instruction and one for data) and, of course, we want to keep
this interface because in this case we can use the CPU module together
with any RAM which has the same interface. Therefore we try not to
change the memory interface at all. In this case, of course, we can use
all the old proofs for the module Physical Memory.
A model of a processor with address translation is presented in Figure 1.2.
There are two additional modules in this model:
• the Instruction MMU which performs translated and untranslated
memory accesses for instruction fetch.
10 CHAPTER 1. INTRODUCTION
Signal Description
Memory interface input to DMMU
dadr[a− 1 : 0] word address of the data access
din[8 ·B − 1 : 0] data word to be written
mw signals data write access
mr signals data read access
mbw[B − 1 : 0] selects bytes of the data word to be written
Memory interface input to IMMU
iadr[a− 1 : 0] word address of the instruction access
imr signals instruction read access
clear initializes memory interface (the signal comes di-
rectly from CPU)
Memory interface output from IMMU
ibusy signals pending instruction access
inst[8 ·B − 1 : 0] read data on finished instruction access
Memory interface output from DMMU
dbusy signals pending data access
dout[8 ·B − 1 : 0] read data on finished data access
Table 1.1: The Memory Interface (between MMUs and Physical Memory)
• the Data MMU which performs translated and untranslated memory
accesses for load / store instructions.
1.5.1 The Memory Interface Layer
We keep a definition of the memory interface which was described in [Bey05].
We have only one difference: now the initiator of memory request is the
DMMU or the IMMU but not the CPU . We can also divide the memory
interface into two parts. The first part will be an interface with the IMMU
and the second with the DMMU . Note that signal clear still comes directly
from the processor (not from one of the MMUs).
Definition* 1.5.1 A memory system MI for a pipelined microprocessor
is a pice of hardware with inputs and outputs according to Table 1.1, where 2a
is the number of cells stored in the memory for a ∈ N and B is the number of
bytes stored per memory cell. Note that Beyer calls MI as memory interface.
We call the sequence of MMU and CPU outputs to the memory interface
valid if there is only an initial clear, the read and write signals on the data
port are never raised simultaneously, and any data or instruction access is
stalled by an active dbusy or ibusy, respectively. Formally, we have
• ∀t ∈ N : cleart ⇐⇒ t = 0
1.5. PROOF DECOMPOSITION 11
clk
mr
mw
dadr
dout
din
dbusy
LHH LLHH LLHH LLHH LLHH LLHH LLHH LLHH LLH
LLHHHHHHHHHHHHHHHHHHHHHH LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL
LLLLLLLLLLLLLLLLLLLLLLLLLLHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH LLLLLLLL
UUVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVUUUUUUUUadr0 adr1
UUUUUUUUUUUUUUUUUUVVVVVVUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUd0
UUUUUUUUUUUUUUUUUUUUUUUUUUVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVUUUUUUUUd1
LLHHHHHHHHHHHHHH LLLLLLHHHHHHHHHHHHHHHHHHHHHH LLLLLLLLLLLLLLLL
Figure 1.4: Timing of the Memory Interface for DMMU
• ∀t ∈ N+ : ¬mwt ∨ ¬mrt
• ∀t ∈ N+ : (mrt ∨mwt) ∧ dbusyt =⇒ int+1d = intd,
where ind ∈ {dadr, din,mw,mr,mbw}
• ∀t ∈ N+ : imrt ∧ ibusyt =⇒ int+1i = inti, where ini ∈ {iadr, imr}
A partial timing diagram of the memory interface is depicted in Fig-
ure 1.4. The DMMU starts a request in a cycle t (one of two signals mrt for
a read or mwt for a write is active). The address dadrt, data dint, and mbwt
keep their value during the whole duration of a request until cycle t′ ≥ t for
which dbusyt′ is inactive.
The timing diagram between IMMU and the memory is similar to the
previous one. Note that we have only read accesses to the IMMU .
We now define the correctness of the memory interface with valid inputs
from DMMU and IMMU . The memory interface is correct if it is live and
consistent according to the following definition.
Definition* 1.5.2 Let init_mem ∈ (B8·B)2a be the initial memory content
of a memory interface.
We introduce a parameterized predicate on the memory interface I/O by
MI .bw(ad, b) := (ad = dadr) ∧mw ∧mbw[b] ∧ ¬dbusy
in order to capture a write to byte b of address ad and define the memory
12 CHAPTER 1. INTRODUCTION
content MI in cycle t ∈ N+ recursively as follows:
M1I := init_mem∣∣M t+1I [〈ad〉]∣∣b :=
{∣∣dint∣∣
b
if MI .mbw(ad, b)t∣∣M tI [〈ad〉]∣∣b otherwise
We call a memory interface correct iff on valid input from the MMUs
according to Definition 1.5.1, the following conditions hold for all t ∈ N+:
1. mrt ∧ ¬dbusyt =⇒ doutt =MI [〈dadrt〉]t (data cache consistency)
2. imrt ∧ ¬ibusyt =⇒ instt = MI [〈iadrt〉]t (instruction cache consis-
tency)
3. ∃next¬dbusy(t) (data cache liveness)
4. ∃next¬ibusy(t) (instruction cache liveness)
1.5.2 The CPU Interface Layer
We now define the last interface according to Figure 1.3. This interface is
divided into two parts, the first part is the interface between the CPU and
theDMMU and the second is the interface between the CPU and the IMMU .
Definition 1.5.3 A CPU interface CPUI for a pipelined microprocessor
is a circuit with inputs and outputs according to Table 1.2.
We call CPU outputs to the CPU interface valid if the read and write
signals on the data port are never raised simultaneously and any data or in-
struction access is stalled by an active dbusy or ibusy, respectively. Formally,
we have
• ∀t ∈ N+ : ¬mwt ∨ ¬mrt
• ∀t ∈ N+ : (mrt ∨mwt) ∧ dbusyt =⇒ int+1d = intd
where ind ∈ {dadr, dout,mw,mr,mbw, dpto, dptl}
• ∀t ∈ N+ : imrt ∧ ibusyt =⇒ int+1i = inti,
where ini ∈ {iadr, imr, ipto, iptl}
A partial timing diagram of the CPUI interface is depicted in Figure 1.5.
The timing diagram between IMMU and CPU is the similar to the previous
one, in the case of read access.
In the following chapter we present the design and formal verification of
the MMU which is used in Chapter 3 as instruction MMU and data MMU .
In Chapter 3, we specify a processor model, then we build hardware for it
and verify the data consistency of the whole VAMP processor.
1.5. PROOF DECOMPOSITION 13
Signal Description
Interface output from CPU
dadr[a− 1 : 0] word address of the data access
dout[8 ·B − 1 : 0] data word to be written
mw signals data write access
mr signals data read access
dpto[19 : 0] data page table origin
dptl[19 : 0] data page table length
dmode system/user mode for the data access
mbw[B − 1 : 0] selects bytes of the data word to be written
iadr[a− 1 : 0] word address of the instruction access
imr signals instruction read access
ipto[19 : 0] instruction page table origin
iptl[19 : 0] instruction page table length
imode system/user mode for the instruction access
clear initializes memory interface
Interface input to CPU
ibusy signals pending instruction access
inst[8 ·B − 1 : 0] read data on finished instruction access
iexcp signals exception on finished instruction access
dbusy signals pending data access
din[8 ·B − 1 : 0] read data on finished data access
dexcp signals exception on finished data access
Table 1.2: The CPUI Interface (between CPU and MMUs)
14 CHAPTER 1. INTRODUCTION
clk
mr
mw
dadr
din
dout
dpto
dptl
dmode
dbusy
dexcp
                 
HHHHHHHHHH LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLHHHHHHHHHHHHHHHHHH LL
LLLLLLLLLLLLHHHHHHHHHHHHHH LLLLLLHHHHHH LLLLLLLLLLLLLLLLLLLLLL
VVVVVVVVVVVVVVVVVVVVVVVVUUUUUUVVVVVVVVVVVVVVVVVVVVVVVVUUadr0 adr1 adr2 adr3
UUUUUUUUVVUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUVVUUd0 d3
UUUUUUUUUUUUVVVVVVVVVVVVVVUUUUUUVVVVVVUUUUUUUUUUUUUUUUUUUUUUd1 d2
VVVVVVVVVVVVVVVVVVVVVVVVUUUUUUVVVVVVVVVVVVVVVVVVVVVVVVUUpto0 pto1 pto2 pto3
VVVVVVVVVVVVVVVVVVVVVVVVUUUUUUVVVVVVVVVVVVVVVVVVVVVVVVUUptl0 ptl1 ptl2 ptl3
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
HHHHHH LLHHHHHHHHHH LLLLLLLLLLHH LLHHHHHHHHHHHHHH LLLLLL
UUUUUUUU)LL,UUUUUUUUUU)LL,UUUUUUUUUU*HH+UUUUUUUUUUUUUU*HH+UU
Figure 1.5: Timing of the Interface between CPU and DMMU
Chapter 2
The Memory Management
Unit
The work presented in this chapter is based on Hillebrand’s work [Hil05].
In this chapter we introduce the specification and the implementation of a
memory management unit and we formally verify the MMU . As we already
know, the MMU is a hardware supporting the implementation of the virtual
memory and it is located between the processor and the memory.
2.1 Specification of the MMU
Since the MMU communicates with processor and memory we specify both
interfaces. The set of all signals in an interface is called interface observation.
We also specify the memory configuration.
We define the configuration Cspec for the specification of the MMU . This
configuration has the following components:
• Iobsp – the processor interface observations.
• Iobsm – the memory interface observations.
• Memory – the configuration of the physical memory.
An element c ∈ Cspec of the MMU specification configuration is triple:
c = (iobsp ∈ Iobsp, iobsm ∈ iobsm,mem ∈Memory)
The processor interface observation iobsp is a 12-tuple:
iobsp = (t,mw,mr,mbw, addr, dout, din, pto, ptl, excp, busy, reset)
We start to describe the components which are inputs to theMMU as follows:
15
16 CHAPTER 2. THE MEMORY MANAGEMENT UNIT
• t ∈ B – the memory access type, if this flag is set we have a translated
memory access, otherwise an untranslated memory access.
• mw ∈ B – if this flag is set we have a write access.
• mr ∈ B – if this flag is set we have a read access.
• mbw ∈ B8 – the memory byte write signals. As every memory re-
quest accesses a double word, the signal mbw indicates which bytes
are written. We use this component only in the case of a write access.
• addr ∈ B29 – data address of the memory access. Depending on the
flag t, the address is either a virtual or a physical one.
• dout ∈ B64 – the data to be stored in the memory. This data will be
read for write accesses only.1
• pto ∈ B20 – the page table origin is used only for translated memory
operations. The address
〈
pto ◦ 012〉 is the first address of the page
table.
• ptl ∈ B20 – the page table length is used only for translated memory
accesses and shows the size of the page table.
• reset ∈ B – reset flag is active in case of processor reset.
The rest of the components are outputs from the MMU :
• din ∈ B64 – the data read from the memory. The data will be used for
read accesses only.
• excp ∈ B – exception flag, this flag is set in case the memory operation
is terminated abnormally.
• busy ∈ B – the busy flag signals that theMMU is busy to the processor.
The memory interface observation iobsm is a 7-tuple:
iobsm = (mw,mr,mbw, addr, dout, din, busy)
We now describe the components which are inputs to the memory:
• mw ∈ B – if this flag is set we have a write access.
• mr ∈ B – if this flag is set we have a read access.
• mbw ∈ B8 – the memory byte write signals. As every memory request
accesses double word, the signalmbw indicates which bytes are written.
We use this component only in the case of a write access.
1In this chapter, we use the nomenclature in and out from a processor’s point of view,
e.g. dout is output from the processor’s point of view.
2.1. SPECIFICATION OF THE MMU 17
Memory
Processor
mw
addr
dout
mbw
mr
pto
ptl
t
busy
din
excp
mw
addr
dout
mbw
mr
MMU
busy
din
Figure 2.1: Interfaces of the MMU
• addr ∈ B29 – data address in the physical memory.
• dout ∈ B64 – the data to be written into the memory.
The rest of the components are outputs from the memory:
• din ∈ B64 – the data read from the memory.
• busy ∈ B – the busy flag signals that the memory is busy to theMMU .
Both interfaces are shown in Figure 2.1.
Definition 2.1.1 Depending on the read and write signals we define a new
signal for interface iobsp by
iobsp.req = iobsp.mw ∨ iobsp.mr
and similarly for interface iobsm by
iobsm.req = iobsm.mw ∨ iobsm.mr.
Note also that we use short notation iobsp.inputs (iobsm.inputs) for proces-
sor (memory) inputs, i.e. all signals which are inputs for the MMU (physical
memory). We also use similar abbreviations iobsp.outputs (iobsm.outputs)
for processor (memory) outputs, i.e. all signals which are outputs for the
MMU (physical memory).
Note that we use record notation for both interfaces, e.g. component din
of iobsm is denoted by iobsm.din and iobsp.pto denote component pto of
iobsm.
Definition 2.1.2 We call a memory configuration mem a function that
maps 29-bit addresses to 64-bit data, i.e. the memory is organized in 229
double words:
mem ∈Memory = {f : Z229 → B64}
18 CHAPTER 2. THE MEMORY MANAGEMENT UNIT
A function that maps times t ∈ N to the configuration of the MMU
specification f(t) ∈ Cspec is called a trace.
Definition 2.1.3 We define the set of all specification traces as follows:
Trace = {f : N→ Cspec}
2.1.1 Assumption for the MMU
We make assumptions on the signals which are inputs of theMMU . Later, in
Chapter 3 we need to prove these assumptions when we integrate the MMU
into the processor according to Definition 1.5.1 and Definition 1.5.3.
Definition 2.1.4 Let s be any signal from interfaces iobsp or iobsm. We
use the short notation sx for iobsx.s where x denotes one of two interfaces
p for processor and m for memory.
We now can define the following properties for the processor interface.
We formulate this properties as predicates on traces. For the whole section
trc ∈ Trace denotes a trace.
Definition 2.1.5 We call the signals read and write of the processor inter-
face mutually exclusive if the following predicate holds:
p_mr_mw_mutexc(trc) :=
∀t ∈ N : ¬(trc(t).mrp ∧ trc(t).mwp)
Definition 2.1.6 We call the input signals of the processor interface stable
if the following predicate holds:
p_req_is_stable(trc) :=
∀t ∈ N : trc(t).reqp ∧ trc(t).busyp =⇒
trc(t+ 1).inputsp = trc(t).inputsp
Now we define a predicate for the processor request.
Definition 2.1.7 We define the predicate is_req_proc for a processor re-
quest in the following way:
is_req_proc(t, t′, trc) :=
(t > 0 =⇒ ¬trc(t− 1).busyp) ∧
(t ≤ t′) ∧ trc(t).reqp ∧ ¬trc(t′).busyp ∧
∀t′′ ∈ [t : t′[ : trc(t′′).busyp
2.1. SPECIFICATION OF THE MMU 19
The signal busyp is active during the whole processor request but the
last cycle. Since we can only change state of inputsp when busyp is inactive
(Definition 2.1.6) we can prove in the following lemma that during a processor
request all the processor inputs are stable.
Lemma 2.1.8 All of the inputs in the processor are stable processor request:
∀t, t′, t′′ ∈ N : t ≤ t′ ≤ t′′ =⇒
(is_req_proc(t, t′′, trc) ∧ p_req_is_stable(trc)) =⇒
trc(t).inputsp = trc(t′).inputsp
Proof: Let t ≤ t′ ≤ t′′. We show the claim by induction on t′. For the base
case t = t′ there is nothing to show. For the induction step t′ → t′ + 1 we
have that
t′ < t′′ ∧
is_req_proc(t, t′′, trc) ∧
p_req_is_stable(trc) ∧
trc(t).inputsp = trc(t′).inputsp
We have to show that
trc(t).inputsp = trc(t′ + 1).inputsp.
With the help of the induction hypothesis we can conclude that:
trc(t).iobsp.inputs = trc(t′).inputsp
Since we know that (trc(t′).mrp ∨ trc(t′).mwp)∧ trc(t′).busyp holds we con-
clude with the help of p_req_is_stable(trc) at the time t′:
trc(t′).inputsp = trc(t′ + 1).inputsp
This finishes the proof. uunionsq
Finally we define the predicate of the whole processor interface correct-
ness criteria as follows:
Definition 2.1.9 Processor inputs are called correct if read and write signals
are mutually exclusive and they are stable. Formally:
good_p_interface(trc) :=
p_mr_mw_mutexc(trc) ∧
p_req_is_stable(trc)
20 CHAPTER 2. THE MEMORY MANAGEMENT UNIT
req
m
busy
m
Figure 2.2: Example of the Behavior of the Signal busym
For the interface between the MMU and the memory we make some
assumptions which are based on properties of signals of Definition 1.5.1.
We now define a predicate for a memory request similar to the predicate
is_req_proc.
Definition 2.1.10 We define the predicate is_req_mem in the following
way:
is_req_mem(t, t′, trc) :=
(t > 0 =⇒ (¬trc(t− 1).busym ∨ ¬trc(t− 1).reqm)) ∧
t ≤ t′ ∧ trc(t).reqm ∧ ¬trc(t′).busym ∧
∀t′′ ∈ [t : t′[ : trc(t′′).busym
Note that this predicate is defined slightly different then is_req_proc.
In case if t > 0 a memory request can also start in cycle t when busyt−1m ∧
¬reqt−1m . Therefore for start of request in cycle t − 1 we do not depend on
signal busym in the case when we do not have any request to the memory
(see Figure 2.2), i.e. for this definition we can also use an memory interface
in which signal busy is undefined when we do not have any requests.
Definition 2.1.11 We call memory inputs live if the predicatem_liveness(trc)
holds, i.e. formally we have
m_liveness(trc) :=
∀t ∈ N : trc(t).reqm =⇒
∃t′ ∈ Z≥t : ¬trc(t′).busym
We split the consistency of the memory into three assumptions. The
first assumption covers the behavior of the memory except last cycle of a
memory request and the other two cover terminating read and write accesses,
respectively. Note that we assume here that a MMU has exclusive access to
the memory.
Definition 2.1.12 The predicate m_ack_write(trc) holds if the memory
does not change if there is no read or write signal input to the memory or
2.1. SPECIFICATION OF THE MMU 21
the memory is busy:
m_ack_write(trc) :=
∀t ∈ N : (¬trc(t).reqm ∨ trc(t).busym)
=⇒ trc(t).mem = trc(t+ 1).mem
Definition 2.1.13 The predicate m_read_consist(trc) holds if at the end
of any memory read access we have the correct data on the output from the
memory and during the last cycle of access the memory does not change:
m_read_consist(trc) :=
∀t, t′ ∈ N : is_req_mem(t, t′, trc) ∧ trc(t).mrm =⇒
trc(t′).dinm = trc(t′).mem[〈trc(t).addrm〉] ∧
trc(t′).mem = trc(t′ + 1).mem
Definition 2.1.14 The predicate m_write_consist(trc) holds if at the end
of any memory write access the memory is updated in the following way:
m_write_consist(trc) :=
∀t, t′ ∈ N : is_req_mem(t, t′, trc) ∧ trc(t).mwm =⇒
∀a ∈ B29 : (a 6= trc(t).addrm =⇒
trc(t′).mem[〈a〉] = trc(t′ + 1).mem[〈a〉]) ∧
∀b ∈ Z8
∣∣trc(t′ + 1).mem[〈trc(t).addrm〉]∣∣b ={
|trc(t).doutm|b if trc(t).mbwm[b]
|trc(t′).mem[〈trc(t).addrm〉]|b otherwise
Finally, we define the predicate for overall memory interface correctness
in the following way:
Definition 2.1.15 We call a memory inputs and configuration correct if the
memory is consistent and live:
good_m_interface(trc) :=
m_liveness(trc) ∧
m_ack_write(trc) ∧
m_read_consist(trc) ∧
m_write_consist(trc)
2.1.2 Guarantees of the MMU
In this section we will define properties which model correctMMU behavior.
We start with liveness.
22 CHAPTER 2. THE MEMORY MANAGEMENT UNIT
Definition 2.1.16 The predicate p_liveness(trc) holds if all of the proces-
sor requests end in finite time, i.e., after setting read or write signals on the
inputs in the MMU the MMU eventually releases busy:
p_liveness(trc) :=
∀t ∈ N : trc(t).reqp =⇒ ∃t′ ∈ Z≥t : ¬trc(t′).busyp
We also have to guarantee according to the Definition 1.5.1 that read and
write signals are not set simultaneously.
Definition 2.1.17 The predicate m_mr_mw_mutexc(trc) holds when the
signals read and write to the memory be mutually exclusive:
m_mr_mw_mutexc(trc) :=
∀t ∈ N : ¬(trc(t).mrm ∧ trc(t).mwm)
The MMU keeps the inputs for the memory stable during requests.
Definition 2.1.18 The predicate m_req_is_stable(trc) holds if all input
signals of the memory interface are called stable if the memory is busy in the
same cycle:
m_req_is_stable(trc) :=
∀t ∈ N : (trc(t).mrm ∨ trc(t).mwm) ∧ trc(t).busym) =⇒
trc(t).mrm = trc(t+ 1).mrm∧
trc(t).mwm = trc(t+ 1).mwm∧
trc(t).addrm = trc(t+ 1).addrm∧
trc(t).mbwm = trc(t+ 1).mbwm∧
trc(t).dinm = trc(t+ 1).dinm
Based on the definition of a memory request and the previous property
we can prove that inputs remains stable in analogy to Lemma 2.1.8.
Lemma 2.1.19 All of the inputs in the memory are stable as long as the
memory is busy:
∀t, t′, t′′ ∈ N : t ≤ t′ ≤ t′′ =⇒
(is_req_mem(t, t′′, trc) ∧m_req_is_stable(trc)) =⇒
trc(t).inputsm = trc(t′).inputsm
We omit the proof of this lemma because it is similar to the proof of
Lemma 2.1.8.
2.1. SPECIFICATION OF THE MMU 23
Now we specify all the MMU operations. The MMU should realize four
types of memory operations as requested by the processor:
• Untranslated read and write – directly access the memory using 29-bit
physical addresses.
• Translated read and write – access the memory using 29-bit virtual
addresses. However, the translation of a virtual address can fail, in
which case the memory operation cannot be executed. In this case the
MMU should signal an exception to the processor.
We start with the definition for the untranslated read. In this case we do
not translate the address from the processor. The data of the memory from
the processor address is returned. We do not have any exception.
Definition 2.1.20 An untranslated read is specified by:
untr_read(trc) :=
∀t, t′ ∈ N : is_req_proc(t, t′, trc) ∧
trc(t).mrp ∧ ¬trc(t).iobsp.t =⇒
trc(t′).dinp = trc(t′).mem[〈trc(t).addrp〉] ∧
trc(t′).mem = trc(t′ + 1).mem ∧ ¬trc(t′).iobsp.excp
In the case of an untranslated write access we only write into the memory
at the processor address the data from the processor. Depending on the
memory byte write signal it could be one, two, four, or eight bytes. We also
do not have any exception.
Definition 2.1.21 An untranslated write access is specified by:
untr_write(trc) :=
∀t, t′ ∈ N : is_req_proc(t, t′, trc) ∧
trc(t).mwp ∧ ¬trc(t).iobsp.t =⇒
∀a ∈ B29 : (a 6= trc(t).addrp =⇒
trc(t′).mem[〈a〉] = trc(t′ + 1).mem[〈a〉]) ∧
¬trc(t′).iobsp.excp ∧
∀b ∈ Z8
∣∣trc(t′ + 1).mem[〈trc(t).addrp〉]∣∣b ={
|trc(t).doutp|b if trc(t).mbwp[b]
|trc(t′).mem[〈trc(t).addrp〉]|b otherwise
For the translated operations we have to define an additional function
for address translation. This function is called the decodeitr — decode
implementation translation function:
decodeitr : B20 × B20 ×Memory × B× B32 → B× B32
24 CHAPTER 2. THE MEMORY MANAGEMENT UNIT
+
32Page Table 20
ppx
px bx
0
2
pto 012
20
12
pa
v p
Figure 2.3: Address Translation for the virtual address
Figure 2.3 shows the principle of address translation, which is done by
decodeitr function. This function takes the following five inputs
• pto ∈ B20 – the page table origin,
• ptl ∈ B20 – the page table length,
• mem ∈Memory – the memory configuration,
• mw ∈ B – the type of request read or write,
• va ∈ B32 – is the virtual address.
The function decodeitr has two outputs excp ∈ B and pa ∈ B32 where
excp indicates a translation exception and pa is the physical address, i.e.,
translated memory address. For the calculation of excp and pa we define
some intermediate values. The virtual address is decomposed into a page
index px ∈ B20 and a byte index bx ∈ B12:
va = px ◦ bx
The page table consists of page table entries pte. Each pte is 4 bytes
wide. The number of page table entries in the page table is equal to ptl+1.
Based on pto and px we define the page table entry address ptea ∈ B32 as
follows:
ptea = (
〈
pto ◦ 012〉+ 〈px ◦ 02〉) mod 232
2.1. SPECIFICATION OF THE MMU 25
do not usepv
10
ppx
032 91112
Figure 2.4: Page Table Entry
Based on ptea we define the page table entry pte ∈ B32 as follows
pte =
{
memory(bptea/8c)[31 : 0] if ptea mod 8 = 0
memory(bptea/8c)[63 : 32] otherwise (2.1)
where a ∈ R is bac is rounding down to the next integer.
The page table entry pte ∈ B32 (see Figure 2.4) consists of several fields:
• ppx = pte[31 : 12] – the physical page index,
• v = pte[11] – the valid bit which indicates whether the page is in the
physical memory or on the secondary storage,
• p = pte[10] – the write protection bit,
• pte[9 : 0] – we do not use the last ten bits.
We have an exception in the following cases:
• the page index is larger then the page-table length, i.e., the access
would be outside the page table,
• we have a write access and the page is protected,
• the page is not in the physical memory, as indicated by the valid bit.
Therefore excp is given by the following equation:
lexcp = 〈px〉 > 〈ptl〉
pteexcp = (mw ∧ p ∨ ¬v)
excp = lexcp ∨ pteexcp
We calculate the physical address so that in the case of ¬excp, we compute
pa as concatenation of the physical page index and byte index. Otherwise,
pa is set to 032:
pa =
{
032 if excp
ppx ◦ bx otherwise
26 CHAPTER 2. THE MEMORY MANAGEMENT UNIT
We use the previously defined variables to define the result of a transla-
tion r_t of a processor request starting in cycle t as follows:
Let d = decodeitr(trc(t).ptop,
trc(t).ptlp,
trc(t).iobsp.mem,
trc(t).mwp,
trc(t).addrp ◦ 03) in
r_t(t).pa = d.pa,
r_t(t).excp = d.excp,
r_t(t).ptea = ptea,
r_t(t).lexcp = lexcp,
r_t(t).pteexcp = pteexcp,
r_t(t).pte = pte.
Now we can define both translated operations. We start with the trans-
lated read. If there is no exception, the data of the memory from the physical
address is returned. Otherwise, exceptions are signaled to the processor.
Definition 2.1.22 A translated read access is specified by:
tr_read(trc) :=
∀t, t′ ∈ N : is_req_proc(t, t′, trc) ∧
trc(t).mrp ∧ trc(t).iobsp.t =⇒
trc(t′).iobsp.excp = r_t(t).excp ∧
trc(t′).mem = trc(t′ + 1).mem ∧
¬r_t(t).excp =⇒ trc(t′).dinp = trc(t′).mem[〈r_t(t).pa[31 : 3]〉]
The last operation is translated write. In case of an exception the memory
is not modified. If we do not have any exception we only write into the
memory at the physical address the data from the processor.
2.2. MMU DESIGN 27
Definition 2.1.23 A translated write access is specified by:
tr_write(trc) :=
∀t, t′ ∈ N : is_req_proc(t, t′, trc) ∧
trc(t).mwp ∧ trc(t).iobsp.t =⇒
(r_t(t).excp = trc(t′).iobsp.excp ∧
(r_t(t).excp =⇒ trc(t′).mem = trc(t′ + 1).mem) ∧
(¬r_t(t).excp =⇒ trc(t′).addrm = r_t(t).pa[31 : 3] ∧
(∀a ∈ B29 : (a 6= r_t(t).pa[31 : 3] =⇒
trc(t′).mem[〈a〉] = trc(t′ + 1).mem[〈a〉])) ∧
∀b ∈ Z8 :
∣∣trc(t′ + 1).mem[〈r_t(t).pa[31 : 3]〉]∣∣
b
={
|trc(t).doutp|b if trc(t).mbwp[b]
|trc(t′).mem[〈r_t(t).pa[31 : 3]〉]|b otherwise
))
We can now define a predicate for overall MMU correctness.
Definition 2.1.24 We call a MMU specification trace correct iff it fulfills
the following predicate:
mmu_guarantee(trc) :=
p_liveness(trc) ∧
m_mr_mw_mutexc(trc) ∧
m_req_is_stable(trc) ∧
untr_read(trc) ∧
untr_write(trc) ∧
tr_read(trc) ∧
tr_write(trc)
We prove later for a MMU implementation that this predicate holds under
the assumptions given earlier.
2.2 MMU Design
In this section we introduce an implementation of the MMU . We use only
the basic circuits which were specified in Section 1.3.
Figure 2.5 shows the data paths of the MMU . In order to compute the
lexcp we use the output neg from the circuit Add_sub21. We compute a page
table entry address with the help of the Adder32. We save the data from the
memory in the register dr. In the address register ar both a page table entry
address or physical address can be stored. Since our memory supports only
double word accesses and the page table entry is only one word wide we read
28 CHAPTER 2. THE MEMORY MANAGEMENT UNIT
dr[63 : 0]
dinm[63 : 0] ptlp ptop
lexcp
neg sub
1 0
[31 : 12][31 : 0]
1 0
(v, p)
arce
0 1
ar[31 : 0]
[31 : 3]
addrm[31 : 3]
[31 : 0]
dinp[63 : 0]
[11 : 10]
[11 : 0] [31 : 0]
[31 : 12]
addrp ◦ 0
3
add
02 012
tp
Adder32
ar[2]
drce
[63 : 32] [31 : 0] Add sub21
Figure 2.5: Data Paths of the MMU
two page table entries at the same time and by using ar[2] we choose the
right one. The last two multiplexers we use to compute the address which
we store in the address register ar. The corresponding control automaton is
presented in Figure 2.6.
The rest of the signals we define using interface signals and signals from
Figure 2.5 in the following way:
mbwm = mbwp
doutm = doutp
pteexcp = ¬v ∨ p ∧mwp
excpp = lexcp ∨ pteexcp
reqp = mrp ∨mwp
busyp = ¬idle′
where idle′ denotes that the control state in the next cycle will be idle.
Definition 2.2.1 We introduce short notation stx for trc(t).iopsx.s where
t ∈ N denotes a hardware cycle, x denotes one of two interfaces (p and m
for the processor and memory interface correspondingly) and trc ∈ Trace,
e.g. trc(t).iobsp.busy and busytp are the same for us.
The next state function of the MMU implementation takes as inputs the
following components:
2.2. MMU DESIGN 29
busym
lexcp
¬reqp
reqp ∧ ¬tp
¬busym
arce
comppa
mrp
busym
busym
mwm
¬mrp
arce
seta
idle
add
arce, add
read write
¬busym
¬mrp
mrp
readpte
mrm, drce
mrm
¬busym
pteexcp
reqp ∧ tp
Figure 2.6: Control Automaton of the MMU
• inputstp – the inputs from the processor in current cycle t,
• inputstm – the inputs from the memory in current cycle t,
• ctmmu – configuration of the MMU in current cycle t. Configuration
cmmu contains the following components:
– ctmmu.ar – state of the address register ar in the current cycle t,
– ctmmu.dr – state of the data register dr in the current cycle t,
– ctmmu.st – state of the control automaton in the current cycle t.
and based on the data path and control automaton produces outputs as
follows:
• outputstp – outputs to the processor in current cycle t,
• outputstm – outputs to the memory in current cycle t,
• ct+1mmu – configuration of the hardware in next cycle t+ 1
Components inputstp, inputstm, outputstp, and outputstp induce a trace trc ∈
Trace. Note that initial state of the control automaton is idle, i.e. c0mmu.st =
idle.
30 CHAPTER 2. THE MEMORY MANAGEMENT UNIT
2.3 MMU Correctness
In this section we show the correctness of the MMU .
Let trc ∈ Trace as before be a trace induced by the MMU implementa-
tion.
By correctness of the MMU we understand that if all the assumptions
are fulfilled our implementation satisfies the guarantees. Formally, we have
to prove the following lemma:
Lemma 2.3.1 The implementation of the MMU satisfies the specification
of the MMU in case the assumptions for both interfaces are fulfilled:
good_p_interface(trc) ∧ good_m_interface(trc)
=⇒ mmu_guarantee(trc)
We need some intermediate lemmas in order to prove this claim. Now we
start to prove the simple property from Definition 2.1.17 which guarantees
that read and write signals in the memory are mutually exclusive.
Lemma 2.3.2 It never holds that both signals write and read in the memory
are active:
m_mr_mw_mutexc(trc)
Proof: This property directly follows from the control automaton: the
memory read and memory write signals mrm and mwm are active in distinct
states read, readpte for mrm, and write for mwm and thus we finish the
proof. uunionsq
Now we want to prove the next property from Definition 2.1.18. At first
we will prove only one part of this property for two inputs mrm and mwm.
We start with the mrm.
Lemma 2.3.3 It always holds that if in the same cycle both signals mrm
and busym are active in the next cycle mrm is also active:
∀t ∈ N : mrtm ∧ busytm =⇒ mrt+1m
Proof: The proof of this lemma also follows from the inspection of the
MMU control automaton: if mrm is active we are in one of the two states
read or readpte. The state does not change in the next cycle because busym
is active which proves that in the next cycle mrm will also be active. uunionsq
In case ofmwm the lemma and its proof will be similar to the proof which
we had in the previous lemma, we have only one difference: mwm is active
iff we stay in the state write.
2.3. MMU CORRECTNESS 31
Lemma 2.3.4 The memory inputs are stable:
p_req_is_stable(trc) =⇒ m_req_is_stable(trc)
Proof: By inspection of the control signals we can conclude that in case
of a memory request in cycle t the control automaton is in one of the three
states readpte, read, or write.
Since the control automaton is initially in state idle, we find a cycle t′
before t when the processor request starts. The automaton leaves idle and all
the time between time t′ and time t, signal busyp is active. Processor inputs
do not change between times t′ and t+1 because of p_req_is_stable(trc) at
cycle t. Thus, the data input into the memory and the memory byte write
input into the memory (both of which come directly from the processor
interface) do not change:
douttm = dout
t
p = dout
t+1
p = dout
t+1
m
mbwtm = mbw
t
p = mbw
t+1
p = mbw
t+1
m
The memory address is given by address register ar[31 : 3] which is
clocked only in states seta and comppa. Since we are neither in state seta
or comppa in cycle t, we can conclude that:
addrtm = addr
t+1
m
For the last two signals mrm and mwm we have already proved this property
before in Lemma 2.3.3. uunionsq
Lemma 2.3.5 Processor liveness holds:
m_liveness(trc) ∧ p_req_is_stable(trc) =⇒
p_liveness(trc)
Proof: Since the definition of the busy signal for the processor interface is
p.busy = ¬idle′ we can conclude that the processor interface is live if there
is a later cycle t′ ≥ t such that in cycle t′ + 1 the control enters the state
idle again.
All cycles in the automaton either contain the state idle or are self-
loops, i.e., we only have to prove that states in which we can have self-loops
are eventually left. These states are write, readpte, and write. Assume
we stay in one of these states in some cycle t′′. Then we conclude that
mrt
′′
m ∨ mwt′′m holds. From m_liveness(trc) in cycle t′′ we conclude that
∃t′′′ ∈ Z≥t′′ : ¬busyt′′′m , i.e., we leave the state and thus, we reach state idle
again. uunionsq
Now we want to prove the correctness of all memory operations requested
by the processor. We start with the untranslated read request.
32 CHAPTER 2. THE MEMORY MANAGEMENT UNIT
Lemma 2.3.6 Any untranslated read request from the processor to the
MMU is performed correctly:
p_req_is_stable(trc) ∧m_liveness(trc) ∧
p_mr_mw_mutexc(trc) ∧m_read_consist(trc) ∧
=⇒ untr_read(trc)
Proof: Let t, t′ ∈ N satisfy p_is_req(t, t′, trc). Note that in both cycles
t and t′ + 1 the control is in the idle state. In any intermediate cycle, it is
not in the idle state. Note also that during the request all inputs are stable
according to Lemma 2.1.8.
Since we have an untranslated request, i.e. ¬ttp, the control is in state
seta in cycle t+ 1.
Since in cycle t + 1 the control signal arce is active and the mux select
signals add and t are both inactive, the address register ar[31 : 0] is clocked
as follows:
art+2[31 : 0] = addrt+1p ◦ 03 = addrtp ◦ 03
Since we have a read request the control enters state read in cycle t+ 2
and we start the memory request. Because of m_liveness we know ¬busyt′′m
for some t′′ ≥ t + 2 and busyt′′′m for all t′′′ in between. With the help of
Lemma 2.3.4 all memory inputs during a memory request are stable. Hence
we conclude that we are in state read in cycle t′′ and idle in cycle t′′ + 1.
This means that request ends in cycle t′′, i.e., we have that t′′ = t′. Based
on the m_read_consist assumption we conclude that:
dint
′
m = trc(t
′).mem[〈addrt′m〉]
= trc(t′).mem[〈art′ [31 : 3]〉] = trc(t′).mem[〈art+2[31 : 3]〉]
= trc(t′).mem[〈addrtp〉]
With the help of m_read_consist we conclude that that the memory does
not change between cycles t′ and t′ + 1.
Thus, we obtain:
dint
′
p = din
t′
m = trc(t
′).mem[
〈
addrtp
〉
] ∧
trc(t′).mem = trc(t′ + 1).mem ∧ ¬excpt′p
uunionsq
Lemma 2.3.7 Any untranslated write request from the processor to the
MMU is performed correctly:
p_req_is_stable(trc) ∧m_ack_write(trc) ∧
p_mr_mw_mutexc(trc) ∧m_write_consist(trc) ∧
m_liveness(trc) =⇒ untr_write(trc)
2.3. MMU CORRECTNESS 33
Proof: Let t, t′ ∈ N satisfy p_is_req(t, t′, trc). As in the proof of
Lemma 2.3.6 we conclude:
art+2[31 : 0] = addrt+1p ◦ 03 = addrtp ◦ 03
In analogy to Lemma 2.3.6 we have writet′′ where t+2 ≤ t′′ ≤ t′ and ¬busyt′m.
Based on the m_write_consist assumption we conclude that:
∀a ∈ B29 : (a 6= addrt+2m =⇒
trc(t′).mem[〈a〉] = trc(t′ + 1).mem[〈a〉]) ∧
∀b ∈ Z8
∣∣trc(t′ + 1).mem[〈addrt+2m 〉]∣∣b ={∣∣doutt+2m ∣∣b if mbwt+2m [b]∣∣trc(t′).mem[〈addrt+2m 〉]∣∣b otherwise
We also have the following equations from the construction of the MMU :
doutt+2m = dout
t+2
p
mbwt+2m = mbw
t+2
p
addrt+2m = ar
t+2[31 : 3]
We also do not have any cases of the exception by untranslated request.
Thus, we obtain:
∀a ∈ B29 : (a 6= addrtp =⇒
trc(t′).mem[〈a〉] = trc(t′ + 1).mem[〈a〉]) ∧
¬excpt′p ∧
∀b ∈ Z8
∣∣trc(t′ + 1).mem[〈addrtp〉]∣∣b ={∣∣douttp∣∣b if mbwtp[b]∣∣trc(t′).mem[〈addrtp〉]∣∣b otherwise
uunionsq
Lemma 2.3.8 Any translated read request from the processor to the MMU
is consistent:
p_req_is_stable(trc) ∧m_ack_write(trc) ∧
p_mr_mw_mutexc(trc) ∧m_read_consist(trc) ∧
m_liveness(trc) =⇒ tr_read(trc)
Proof: Let t, t′ ∈ N be the start and end time for any translated processor
read request. The control is in state idle at cycles t and t′ + 1. In any cycle
of the intermediate cycles, it is not in state idle.
34 CHAPTER 2. THE MEMORY MANAGEMENT UNIT
Since ttp holds, the control is in state add in cycle t + 1. Since arcet+1
and the mux select signals addt+1 and tt+1p are active, the address register
ar is clocked as follows:
art+2[31 : 0] = adder32(010 ◦ addrt+1p [28 : 9] ◦ 02, ptot+1p ◦ 012, 0)[31 : 0]
By using Definition 1.3.1 of an adder and input stability we can rewrite
this equation as follows:
〈art+2[31 : 0]〉 = 〈addrtp[28 : 9] ◦ 02〉+ 〈ptotp ◦ 012〉 mod 232 = r_t(t).ptea
That proves, that we have saved the page table entry address r_t(t).ptea in
the address register at cycle t+ 2.
We compute lexcpt+1 as add_sub21(0 ◦ ptlt+1p , 0 ◦ addrt+1p [28 : 9], 1).neg
and by Definition 1.3.1 we have
lexcpt+1 = (
〈
ptltp
〉− 〈addrtp[28 : 9]〉 < 0)
i.e. lexcpt+1 = r_t(t).lexcp
We now split cases on lexcpt+1:
1. Let lexcpt+1 hold. Of course, then the control is in state idle at cycle
t+2 and we have t′ = t+1. Because of lexcpt+1, we have r_t(t).excp.
This concludes the case since we do not change the memory.
2. Let ¬lexcpt+1 hold. Then the control is in the state readpte in cycle
t + 2. By using the m_liveness assumption we know that at a later
time t′′ ≥ t + 2, we have ¬busyt′′m holds, and at this time we save the
input data from the memory dint′′m into the data register dr. The input
data from the memory at the time t′′ contains two adjacent page table
entries whose address differs in the last bit. With the help of mux select
signal ar[2] we later choose the page table entry which is needed. Since
the memory request at address r_t(t).ptea in cycle t′′+1 is finished we
also know that
r_t(t).pte =
{
drt
′′+1[31 : 0] if 〈r_t(t).ptea〉 mod 8 = 0
drt
′′+1[63 : 32] otherwise
The control is at cycle t′′ + 1 in state comppa. Since the address
register ar is not changed between cycles t + 2 and t′′ + 1 we have
r_t(t).ptea = art′′+1. Since r_t(t).ptea[1 : 0] = 02 and processor inputs
are stable we have:
〈r_t(t).ptea〉 mod 8 = 0⇐⇒ art′′+1[2]
2.3. MMU CORRECTNESS 35
Since in cycle t′′ + 1 the control signal arcet′′+1 and the mux select
signal tt′′+1p are active and the mux select signal addt
′′+1 is inactive,
the address register art′′+2[31 : 0] is given by:
art
′′+2[31 : 0] = r_t(t).pte[31 : 12] ◦ addrt′′+1p [8 : 0] ◦ 02
= r_t(t).pte[31 : 12] ◦ bx ◦ 02
We compute the pteexcpt′ as ¬v ∨ p ∧ mwtp = ¬r_t(t).pte[11] ∨
r_t(t).pte[10]∧mwtp. Since we do not havemwtp because of read request
we can rewrite the pteexcpt′ just as ¬r_t(t).pte[11].
We will now split cases on pteexcpt′′+1:
(a) Let pteexcpt′′+1 hold. Of course, then the control is in the state
idle at cycle t′′ + 2 and we obtain that t′′ + 1 = t′. Note that we
do not change the data register during cycle t′′+1 and so we have
r_t(t).excp. This concludes the claim since we do not change the
memory during the whole request.
(b) If ¬pteexcpt′′+1 holds then we have ¬r_t(t).excp and r_t(t).pa =
art
′′+2[31 : 3]. Since we have a read request the control is in
the state read at cycle t′′ + 2 we start the second memory read
request at the physical address stored in ar. With the help of the
m_liveness assumption we know that ¬busyt˜m for a later time
t˜ ≥ t′′ + 2 and all the time between t′′ + 2 and t˜ the memory is
busy. This means that t˜ = t′. Now based on them_read_consist
assumption we can conclude that:
dint
′
m = trc(t
′).mem[〈addrt′m〉]
= trc(t′).mem[〈art′ [31 : 3]〉]
= trc(t′).mem[〈art′′+2[31 : 3]〉]
= trc(t′).mem[r_t(t).pa]
This finishes the proof since during the processor request the
memory is not changed. uunionsq
Lemma 2.3.9 Any translated write request from the processor to the MMU
is consistent:
p_req_is_stable(trc) ∧m_read_consist(trc) ∧
p_mr_mw_mutexc(trc) ∧m_write_consist(trc) ∧
m_liveness(trc) ∧m_ack_write(trc) =⇒
tr_write(trc)
Proof: The proof of this lemma is very similar to the previous one. We
only have two differences:
36 CHAPTER 2. THE MEMORY MANAGEMENT UNIT
t2t1
t′ t2
reqp
reqm
Figure 2.7: Possibility of Requests to the Memory
1. The signal pteexcp can now also be active because the protection bit
is set. The correctness of pteexcp is concluded similar to the pteexcp
from Lemma 2.3.8.
2. The second memory access now is a write access instead of a read
access. At the end of the request the control is in the state write
and we store the data into the memory according to the definition of
untranslated write. The correctness of this write is concluded similar
to the untranslated write from Lemma 2.3.7. uunionsq
We can now summarize all of the lemmas from 2.3.2 to 2.3.9 to prove
Lemma 2.3.1.
Proof: [Lemma 2.3.1] We have already proved all parts of the predicate
mmu_guarantee(trc) in the lemmas from 2.3.2 to 2.3.9. Both predicates
good_p_interface(trc) and good_m_interface(trc) also contain all the as-
sumption which we needed for these lemmas. uunionsq
We now have the correctness proof for the MMU but in order to use this
proof for the overall proof for the instructionMMU of the VAMP we need to
have one more lemma. The assumption that a MMU has exclusive access to
the memory does not hold in real processors (cf. Chapter 3). Therefore we
need the following lemma in order to distinguish memory space for different
MMUs.
If we have a memory request which ends before end of the processor
request then we have a translated operation and the address of this request
is b〈r_t(t).ptea〉 /8c. If we also have a request to the memory at the end of
the processor request then the address of this request is equal to r_t(t).pa
(see Figure 2.7).
Lemma 2.3.10 In case of a processor read request we have the same ad-
dresses on the address bus of the memory interface at the end of the memory
2.3. MMU CORRECTNESS 37
request as in specification of the MMU :
∀t1, t2 ∈ N : is_req_proc(t1, t2, trc) ∧
p_mr_mw_mutexc(trc) ∧ p_req_is_stable(trc) =⇒
(mrt1p =⇒ (∀t′ ∈ [t1 : t2[ : (mrt
′
m ∨mwt
′
m) ∧ ¬busyt
′
m =⇒
〈addrt′m〉 = b〈r_t(t).ptea〉 /8c ∧ tt1p ))
∧
((mrt2m ∨mwt2m) ∧ ¬busyt2m) =⇒
∃t′ ∈ [t1 : t2[ : (mrt′m ∨mwt′m) ∧ ¬busyt′m∧
addrt2m = r_t(t).pa[31 : 3] if tt2p
addrt2m = addr
t2
p otherwise
Proof: We prove the two implications of the claim separately.
1. First we show that:
∀t′ ∈ [t1 : t2[ : (mrt′m ∨mwt
′
m) ∧ ¬busyt
′
m =⇒
〈addrt′m〉 = b〈r_t(t1).ptea〉 /8c ∧ tt1p
From mrt′m∨mwt
′
m we conclude that the automaton is at cycle t′ in one
of the tree states: read, write, or readpte. Moreover, if the control is
in states read or write and t′ is a last cycle of memory request then
follows that t′ = t2 which is a contradiction because t′ < t2. Therefore
the control is in state readpte in cycle t′. We can only stay in state
readpte when we have a translated request, i.e. the signal tp is active
during the processor request.
This concludes the claim since
addrt
′
m = ar[31 : 3] = br_t(t1).ptea/8c
we have already shown as part of Lemma 2.3.8.
2. Now we show that:
((mrt2m ∨mwt2m) ∧ ¬busyt2m) =⇒
∃t′ ∈ [t1 : t2[ : (mrt′m ∨mwt
′
m) ∧ ¬busyt
′
m∧
addrt2m = r_t(t).pa[31 : 3] if tt2p
addrt2m = addr
t2
p otherwise
We have already proved this case when tt2p holds as part of Lemma 2.3.8
and other case we have proved as part of Lemma 2.3.6 and thus we
finish the claim. uunionsq
38 CHAPTER 2. THE MEMORY MANAGEMENT UNIT
Chapter 3
The VAMP with Virtual
Memory Support
The work presented in this chapter is based on the works of Sven Beyer
[Bey05], Mark Hillebrand [Hil05], and Daniel Kröning [Krö01].
The processor which we describe in this chapter has two principle dif-
ferences with respect to the processor in [Bey05]. First we will have an
implementation supporting virtual memory which allows a user program to
use more memory than the physical RAM available. Second we add a new
type of interrupts for communication with external environment.
3.1 Specification of the VAMP
Typically, the specification of a microprocessor consists of three components:
a memory, a register file and a program counter. A step of computation is the
execution of one command which is fetched from the memory at the address
given by the program counter. Our specification follows this scheme.
In our case we have three register files:
• GPR : {0, 1}5 → {0, 1}32 – a general purpose register file. It consists
of 32 registers. Every register has a 32-bit width. The GPR[0] always
contains 032.
• FPR : {0, 1}5 → {0, 1}32 – a floating point register file. It consists
of 32 registers. Every register has a 32-bit width. All registers in the
floating point register file also can be used in even-odd pairs, encoding
an IEEE double-precision floating point number.
• SPR : Z17 → {0, 1}32 – a special purpose register file. It consists of
17 registers. Every register has a 32-bit width. We will discuss the
purpose of these registers later on. For now only note that we have
39
40 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
four new registers in the SPR, which are needed for address translation.
They are PTO, PTL, MODE and EMODE.
For the program counter we use a so-called delayed PC construction. In
such a delayed PC architecture, instruction updates to the PC do not effect
the next instruction. We need two program counters - PC ′ ∈ {0, 1}32 and
DPC ∈ {0, 1}32. Full definition of this construction is provided by Müller
and Paul [MP00]. Therefore, the VAMP specification configuration consist
of six components
cS = (cS .GPR, cS .SPR, cS .FPR, cS .M, cS .PC ′, cS .DPC) (3.1)
Note that cS .M ∈Memory. Let us introduce some notations, which will be
used in this chapter. We will consider two different specification configura-
tion. The first is configuration cS which reacts on interrupts and the second
is c˜S which does not react on interrupts. We denote with cnS our specifica-
tion configuration before the execution of instruction n and after executing
instruction n− 1. We also denote with c˜nS our specification configuration in
the case of not having interrupts. We also introduce two next step functions
δ and δu which reacts and does not react on interrupts correspondingly. The
first next step function takes as input only specification configuration. We
now define a computation without interrupts with an initial configuration
cinitS as follows:
c˜0S := c
init
S
c˜n+1S := δu(c˜
n
S).
The second function takes additionally external signal reset and since we
add external interrupts ext[18 : 0] takes this bitvector as input. Therefore a
computation with interrupts with an initial computation cinitS is defined as
follows:
c0S := c
init
S
cn+1S := δ(c
n
S , reset, extS [18 : 0]).
We denote with cnS .F one of the six components in the configuration before
the execution of instruction n.
At first we will develop the next step computation for c˜n+1S = δu(c˜
n
S) (or
c′S) and then we will extend it to the c
n+1
S = δ(c
n
S , reset, ext[18 : 0]).
Our processor supports two modes of operation. The first mode is called
system mode. In this mode we do not have virtual addresses, we have only
physical addresses. The behavior of our processor in this mode is equal to
the behavior of the processor described in [Bey05].
The second mode is called user mode. In this mode we have virtual
addresses which we translate into physical addresses when we want to fetch
3.1. SPECIFICATION OF THE VAMP 41
from, read from, or write to the memory. The current mode is indicated
by the lowest bit in the MODE register. If this bit equals 0 then we are
in system mode, otherwise in user mode. We denote this bit by t, i.e.
t(cS) = cS .SPR[MODE][0]. Note that all names of registers in the special
purpose register file are natural numbers and are equal to numbers from the
first column in Table 3.1.
For address translation we will use the decodeitr function which was
defined in Section 2.1 on the MMU specification. Based on the decodeitr
function for memory access to the instruction memory we define instruction
page fault ipf(cS) and instruction physical address ipa(cS):
ipf(cS) = decodeitr(pto, ptl, cS .M, 0, cS .DPC[31 : 3]).excp, (3.2)
ipa(cS) = decodeitr(pto, ptl, cS .M, 0, cS .DPC[31 : 3]).pa (3.3)
where pto and ptl are defined as
pto = cS .SPR[PTO][19 : 0], (3.4)
ptl = cS .SPR[PTL][19 : 0] (3.5)
Every fetch access to the memory must not have instruction misalignment,
i.e. must be aligned, therefore for the access to the instruction memory we
define a predicate
imal(cS) = cS .DPC mod 4 6= 0 (3.6)
The function IR returns the current instruction, which we want to execute
in configuration cS . We define it as follows:
IR(cS) =

cS .M [cS .DPC + 3 : cS .DPC] if ¬t ∧ ¬imal(cS)
cS .M [ipa(cS) + 3 : ipa(cS)] if t ∧ ¬imal(cS) ∧ ¬ipf(cS)
032 otherwise
(3.7)
An instruction consist of several fields, which encode whole instruction
set of the processor according to the Tables A.1 to A.6.
Based on IR(cS) and Figure A.1 in Appendix A, we define some functions.
For example RS1(cS) contains the index of the first source register in a
register file, RD(cS) contains the index of the destination register in a register
file, SA(cS) contains the index of the register in the SPR register file (source
or destination register depending on the command), etc.
Based on the Tables A.1 to A.6 we also introduce some predicates. For
example movs2i?(IR) holds iff IR[31 : 26] = 06∧IR[5 : 0] = 0104 (Table A.2).
Also we have predicates identifying groups of instructions, e.g., mem?(cS) is
true for all memory instructions.
Based on the decodeitr function in case if predicate mem? holds we de-
fine data page fault dpf(cS) and data physical address dpa(cS). As virtual
address we have effective memory address:
ea(cS) := cS .GPR[RS1] + imm(cS) (3.8)
42 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
According to the Tables A.1 to A.6 we have memory write access only in
the case of:
mw := sb? ∨ sh? ∨ sw? ∨ store.s? ∨ store.d? (3.9)
We define physical address and page fault for memory data access as
follows:
dpf(cS) = decodeitr(pto, ptl, cS .M,mw, ea(cS)).excp, (3.10)
dpa(cS) = decodeitr(pto, ptl, cS .M,mw, ea(cS)).pa (3.11)
The pto and ptl are the same as in the case of instruction memory access.
Note that data access must also be aligned:
dmal(cS) =

ea(cS) mod d 6= 0 if ¬t ∧ mem?
dpa(cS) mod d 6= 0 if t ∧ mem?
0 otherwise
(3.12)
where d is the width of the access measured in bytes as introduced in Tables
A.1 and A.4.
Now we can define the next step computation for the memory:
c′S .M =

cS .M with cS .M [ea(cS) + d− 1 : ea(cS)] := dest[8 · d− 1 : 0]
if ¬t ∧ ¬dmal(cS) ∧mw
cS .M with cS .M [dpa(cS) + d− 1 : dpa(cS)] := dest[8 · d− 1 : 0]
if t ∧ ¬dmal(cS) ∧ ¬dpf(cS) ∧mw
cS .M otherwise
where
dest =

cS .GPR[RD(cS)] if sb? ∨ sh? ∨ sw?
cS .FPR[RD(cS)] if store.s?
cS .FPR[RD(cS)[4 : 1]1] ◦ cS .FPR[RD(cS)[4 : 1]0] otherwise
3.1. SPECIFICATION OF THE VAMP 43
Then the next step computation for the GPR is defined as follows:
c′S .GPR[RD(cS)] =
sext(cS .M [ea(cS) : ea(cS)]) if lb? ∧ ¬dmal(cS) ∧ ¬t
sext(cS .M [dpa(cS) : dpa(cS)]) if lb? ∧ ¬dmal(cS)∧
t ∧ ¬dpf(cS)
sext(cS .M [ea(cS) + 1 : ea(cS)]) if lh? ∧ ¬dmal(cS) ∧ ¬t
sext(cS .M [dpa(cS) + 1 : dpa(cS)]) if lh? ∧ ¬dmal(cS)∧
t ∧ ¬dpf(cS)
cS .M [ea(cS) + 3 : ea(cS)] if lw? ∧ ¬dmal(cS) ∧ ¬t
cS .M [dpa(cS) + 3 : dpa(cS)] if lw? ∧ ¬dmal(cS)∧
t ∧ ¬dpf(cS)
cS .GPR[RS1(cS)] ∧ imm(cS) if andi?
cS .GPR[RS1(cS)]− imm(cS) if subi?
...
...
cS .GPR[RD(cS)] otherwise
Note that with sext we denote the operation which we use in order to produce
the sign extension result. We can describe the behavior of both program
counters in the following way:
c′S .PC
′ =

cS .PC
′ + 4 + imm(cS) if beqz? ∧RS1(cS) = 0∨
bnez? ∧RS1(cS) 6= 0 ∨ j? ∨ jal?
cS .GPR[RS1(cS)] if jr? ∨ jalr?
cS .PC
′ + 4 otherwise
(3.13)
c′S .DPC = cS .PC
′ (3.14)
We will not formally define the next step function for the FPR register
file. All details about FPR can be found in [Jac02a] and in [Bey05]. For the
SPR we first describe the next step computation without interrupts for the
IEEEf register. The IEEEf register saves the exception which we needed
according to IEEE standard for floating-point operations. We compute this
register as follows:
c′S .SPR[IEEEf ] =
cS .GPR[RS1(cS)] if movi2s?∧
SA = IEEEf
cS .SPR[IEEEf ][31 : 5](cS .SPR[IEEEf ][4 : 0]∨ otherwise
CA(cS)[11 : 7])
44 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
Index Name Function
0 SR Status register. Contains interrupts mask bits.
1 ESR Exception status register. Saves SR in case of an in-
terrupt.
2 ECA Exception cause register. Saves exception cause in case
of an interrupt.
3 EPC Exception PC. Saves PC ′ in case of an interrupt.
4 EDPC Exception DPC. Saves DPC in case of an interrupt.
5 EData Exception data. Saves additional exception data in
case of an interrupt.
6 RM Rounding mode. Encodes currently used rounding
mode for all floating point operations.
7 IEEEf IEEE flags register. Required by the IEEE standard
to accumulate floating point interrupts.
8 FCC Floating point condition code. Used to store result of
floating point comparisons.
9 PTO Page table origin.
10 PTL Page table length.
11 EMODE Exception MODE. Saves MODE in case of an inter-
rupt.
12 S12 Not used for special purpose.
13 S13 Not used for special purpose.
14 S14 Not used for special purpose.
15 S15 Not used for special purpose.
16 MODE Used to distinguish system and user mode.
Table 3.1: Special Purpose Registers of the VAMP
Note that definition for CA(cS) we will be described later. More information
about this case you can find in [Bey05].
For all other registers in the SPR file we introduce the next step compu-
tation without interrupts as follows:
c′S .SPR = λi∈Z17∧i6=7
{
cS .GPR[RS1(cS)] if movi2s? ∧ SA(cS) = i
cS .SPR[i] otherwise
(3.15)
Now we want to define specification computations with interrupts. We
now describe the SPR register file. The register file contains 17 registers,
seven registers dealing with interrupts, three for floating point operations
and the new group of registers which we use for address translation. The
purpose of each register is presented in the Table 3.1.
Note that we have four registers in the special purpose register file which
are not used for any special purpose, i.e. they are only read or updated using
3.1. SPECIFICATION OF THE VAMP 45
movs2i?(IR) or movi2s?(IR) instructions. This has implementation-specific
reasons, which are described in Section 3.2.2.
All supported interrupts are given in Table 3.2. When an interrupt oc-
curs we execute so-called jump to the interrupt service routine which starts
at a fixed address SISR. We have two different kinds of interrupts: repeat
and continue. If we have a repeat interrupt then after executing the interrupt
service routine we start again to execute the instruction which caused the
interrupt. Otherwise after executing the interrupt service routine we execute
the next instruction after the instruction which caused the interrupt. We can
also classify interrupts as maskable, i.e., they can be ignored under software
control, and nonmaskable. We save the mask of interrupts in the SR register.
When we have more than one interrupt simultaneously, the software executes
an interrupt with the smallest index. For the specification of interrupts we
define two additional functions, i.e., CA(cS) for cause exception (output of
this function is a bit vector and the value of each bit indicates whether we
have an exception of this index or not) and EData(cS) for exception data
of the specification configuration. Our processor also has external interrupt
reset and 19 external interrupts for the communication with external devices.
We have the illegal exception CA(cS)[ill] in the following cases:
• if we cannot decode the new instruction in IRS as one of the instruc-
tions according to Tables A.1 to A.5. This case also is in the previous
VAMP.
• If user processes were allowed to update certain registers they could
hack the operating system. Therefore we have to forbid any access to
special purpose registers except IEEEf , SR, RM , FCC in user mode.
That is we have an illegal interrupt if we decode the new instruction in
IR(cS) as access to SPR register file (i.e., movi2s or movs2i) but not
to IEEEf , SR, RM , FCC registers and we work in user mode.
• For the same reason we have to forbid any rfe command in user mode
i.e. if we decode the new instruction as rfe and we work in user mode.
For misalignment exception we will have the following:
CA(cS)[mal] = imal(cS) ∨ dmal(cS) (3.16)
For page fault on fetch and page fault on load / store we will have the
following:
CA(cS)[ipf ] = ipf(cS) ∧ ¬imal(cS)
CA(cS)[dpf ] = dpf(cS) ∧ mem? ∧ ¬dmal(cS)
CA(cS)[trap] is true when we execute a trap instruction. CA(cS)[ovf ]
is true when we have overflows by fixed point arithmetic and we execute
46 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
Index Name Type Maskable Interrupt
0 reset repeat no reset
1 ill repeat no illegal instruction
2 mal repeat no misaligned memory access
3 ipf repeat no page fault on fetch
4 dpf repeat no page fault on load/store
5 trap continue no trap instruction
6 ovf continue yes fixed point overflow
7 OVF continue yes floating point overflow
8 UNF continue yes floating point underflow
9 INX continue yes floating point inexact result
10 DIVZ continue yes floating point division by zero
11 INV continue yes floating point invalid operation
12 UNIMP continue no floating point unimplemented
13-31 ext[i] continue yes 19 external interrupts
Table 3.2: Supported Interrupts in the VAMP
an instruction that detects overflow. All other exceptions are used only for
floating point arithmetic and we do not consider these exceptions here.
We compute the exception data in the following way:
EData(cS) :=

DPC(cS) if imal(cS) ∨ ipf(cS)
sext(imm(cS)) if trap?
ea(cS) if mem?
FPUresultS if fpu?
032 otherwise
(3.17)
Note that we need to save not only the exception data in the case of
an interrupt but also the bit t, i.e. the register cS .SPR[MODE]. For this
purpose we extended the SPR register file with an EMODE register.
Now for the type of the interrupt we have:
repeat(cS) :=
i<5∨
i=0
CA(cS)[i]. (3.18)
For detecting an interrupt we define the masked cause function as follows:
MCA(cS) := λi∈Z32
{
CA(cS)[i] ∧ cS .SPR[SR][i] if i ≥ 6 ∧ i 6= 12
CA(cS)[i] otherwise
(3.19)
Note that equation i ≥ 6 ∧ i 6= 12 holds only for the maskable interrupts.
Now we can define the predicate JISR(cS) which indicates if any interrupt
occurs:
JISR(cS) :=MCA(cS) 6= 032 (3.20)
3.1. SPECIFICATION OF THE VAMP 47
Note that when JISR(cS) occurs we jump into the interrupt service rou-
tine at address SISR. Computation steps with interrupts are equal to com-
putation steps without interrupts when we do not have JISR(cS) and our
specifications are always equal if we do not have any interrupts. We define
our next step computation function in the following way.
For both program counters as:
cn+1S .DPC :=
{
SISR if JISR(cnS)
cnS
′.DPC otherwise
(3.21)
cn+1S .PC
′ :=
{
SISR+ 4 if JISR(cnS)
cnS
′.PC ′ otherwise
(3.22)
For the memory we have the following:
cn+1S .M =
{
repeat(cnS)? c
n
S .M :c
n
S
′.M if JISR(cnS)
cnS
′.M otherwise
(3.23)
For the GPR we have the following:
cn+1S .GPR =
{
repeat(cnS)? c
n
S .GPR :c
n
S
′.GPR if JISR(cnS)
cnS
′.GPR otherwise
(3.24)
For the FPR we have the same construction as for GPR register file.
The next step computation with interrupts cn+1S for the SPR file is equal
to the next step computation without interrupts cnS
′ when we do not have
JISR(cnS). In the case of JISR(c
n
S) we compute c
n+1
S for the SPR in the
following way:
cn+1S .SPR =
λi∈Z17

032 i = SR
repeat(cnS)? c
n
S .SPR[SR] :c
n
S
′.SPR[SR] i = ESR
MCA(cnS) i = ECA
repeat(cnS)? c
n
S .PC
′ :cnS
′.PC ′ i = EPC
repeat(cnS)? c
n
S .DPC :c
n
S
′.DPC i = EDPC
EData(cnS) i = EData
repeat(cnS)? c
n
S .SPR[MODE] :c
n
S
′.SPR[MODE] i = EMODE
repeat(cnS)? c
n
S .SPR[i] :c
n
S
′.SPR[i] otherwise
After executing an interrupt service routine we need to restore all neces-
sary registers. Therefore the last command in the interrupt service routine
48 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
should be the return from the exception command rfe. This instruction
changes PC ′, DPC and SPR in the following way:
cn+1S .PC
′ := cnS .SPR[EPC]
cn+1S .DPC := c
n
S .SPR[EDPC]
cn+1S .SPR[SR] := c
n
S .SPR[ESR]
cn+1S .SPR[MODE] := c
n
S .SPR[EMODE]
This completes the formal definition of the specification of the micropro-
cessor with MMU. Later we introduce a implementation and prove that our
implementation realizes this specification.
In addition the following proposition trivially holds.
Proposition* 3.1.1 As long as no interrupt occurs, both specification com-
putations are equal, i.e., for all n ∈ N:
(∀m ∈ Zn : ¬JISR(c˜mS )) =⇒ c˜nS = cnS
Note that we also can exchange JISR(c˜mS ) with JISR(c
m
S ).
3.2 Implementation of the VAMP
3.2.1 Tomasulo Algorithm
There are two principles of instruction execution in the microprocessors. The
first is called in-order. In this case all the instructions are executed in the
pipeline consecutively, i.e. instruction Ii will be completely executed after
instruction Ii−1 and before instruction Ii+1 even if Ii needs no result of Ii−1.
The second approach is called out-of-order (OOO) execution. In this case
an instruction Ii that does not need any input from the instruction Ii−1 can
be executed before or simultaneously with the instruction Ii−1.
The implementation of the VAMP processor is based on the out-of-order
implementation principle. We implement the so-called Tomasulo algorithm
[Tom67]. The Tomasulo algorithm was first implemented in the IBM 360/91
Floating Point Unit and is used in many modern processors, e.g. Pentium 2,
3, 4; AMD K5, K6 and Athlon; Power P6, 603/ 604/ G3/ G4; MIPS R10000,
R12000; Alpha 21264.
In order to realize precise interrupts we implemented this algorithm with
a so-called reorder buffer (ROB) [SP88]. We call an interrupt between
instructions Ii−1 and Ii precise iff instructions I0, . . . , Ii−1 were completed
before starting the interrupt service routine (ISR) and the processor did not
change the state of the machine for instructions Ii, Ii+1, . . . . With the help
of the ROB we execute instructions out-of-order inside the pipeline, while
3.2. IMPLEMENTATION OF THE VAMP 49
PC environment
SPRFPRGPR
192192192
128 128 128 128
32
192 64
MEM XPU FPU1 FPU2 FPU3
Producers
Reservation Stations
IF
ID
EX
C
WB
128
128
ROB
64 6432
Reorder Buffer
Common Data Bus CDB
IR
PC′ DPC RS RS RS RS RS
PPPPP
Figure 3.1: The VAMP Data Paths
the instructions leave the pipeline in-order. We also extend the register file
with a so-called producer table in order to implement OOO execution. The
additional information per register is a valid bit and a tag. If the valid
bit is active then no instruction in the pipeline writes to the corresponding
register, i.e. the register contains the correct data. If the valid bit is inactive
we know that an instruction in the pipeline writes into this register. The
tag is assigned during instruction issue. It stays unique until the instruction
terminates.
Figure 3.1 shows the data paths of the VAMP processor. The execution
begins with the instruction fetch. The Tomasulo scheduling algorithm does
not cover this phase. In this phase the instruction is loaded from the memory.
The next stage is ID. In this stage the instruction is decoded and issued
to a reservation station (RS) with its source operands. Each functional
unit (FU) has its own reservation stations. We also reserve a new tag for
this instruction and the destination registers of this instruction we mark as
invalid. We set the tag of each destination register to the instruction’s tag. If
not all operands are valid in registers or present in ROB we need to save for
each unavailable source operand a tag of the instruction which must return
50 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
this operand as a destination register. In order to find unavailable operands
on the RS we watch on the Common Data Bus (CDB), which we describe
later, for the operands which are missing. This it is called CDB snooping.
When all operands are valid and the FU also will be free we execute this
instruction. After the execution we save execution result in the producer
(P ). After that the producer puts the result on the CDB the result of this
instruction is available for all instructions which wait for it. We also save the
result in the ROB. During the execution of stages EX and WB of pipeline
this instruction is invalid in ROB. As soon as the oldest result in the ROB
is valid we can write the result from ROB into register files. Note that
we have an in-order output from ROB because we also add the instruction
in ROB in-order during the instruction issue. Additional details about the
Tomasulo algorithm with reorder buffer can be found in the thesis of Daniel
Kröning [Krö01].
3.2.2 The VAMP Design
In analogy to the specification we denote with cI the implementation con-
figuration of the VAMP. All components of the cI we define later. The
implementation of the VAMP has five separate execution units. There is
one for the fixed point arithmetic XPU , three for the floating point arith-
metic (FPU1 for addition/subtraction, FPU2 for multiplication/division
and FPU3 for conversion/testing) and a memory unit MU . The design and
correctness proof of three FPUs is omitted in this thesis and can be found
in [Jac02a]. The fixed point unit is implemented as an ALU with a shifter.
The correctness of this unit follows directly from the basic circuit correct-
ness because there are no registers inside this unit. The implementation and
correctness of the memory unit MU we will be described in Section 3.2.3.
Our implementation of the Tomasulo algorithm has eight reservation sta-
tions: four reservation stations for the fixed point unit and one reservation
station for each other functional unit. We also have some instructions which
use no functional unit, e.g. the instruction movs2i that copies data from a
special purpose register to a general purpose register. We write these in-
structions in the reorder buffer directly after they are issued but, of course,
since these instructions cannot snoop on the CDB they have to be stalled in
the stage decode until all their operands are available.
The previous implementation of the VAMP had up to six 32-bit source
operands. Six operands were only used for the floating point operations,
four containing two 64-bit operands for double precision operands, the status
register and the rounding mode register. For any memory access we have
used at most three operands, one for the address and two 32-bit operands
for 64-bit width data.
In our implementation of the VAMP we keep the upper limit of source
operands but we extend the source operands in the case of any memory
3.2. IMPLEMENTATION OF THE VAMP 51
access instruction with three new operands for page table origin, page table
length and for mode bit. Therefore we have six source operands for any
memory instruction. Note that since we keep the old limit of source operand
these extensions do not enlarge the hardware of the VAMP. Alternatively,
we can read PTO, PTL, and MODE registers directly from the SPR file. This
approach, however, would require a more complicated proof.
In the previous VAMP the producer table for the special purpose register
file was implemented as a RAM with 5-bit addresses and we could only
change the valid bit and the tag for one different register at the same time.
Since now the effect of rfe? command is SPR[MODE] := SPR[EMODE] and
SPR[SR] := SPR[ESR], we need to update the producer table for two special-
purpose registers at the same time. We use a similar approach as for the
FPR register file and producer table. We split the producer table for special
purpose register file into two RAMs with 4-bit addresses. In the first RAM
we will have valid bits and tags for registers from 0 to 15 according to the
table 3.1 and in the second RAM we will have only a valid bit and a tag for
one register MODE, so we can change producer table for two SPR registers
simultaneously as along as one of these registers is the MODE register. Note
that we introduced the registers S12-S15 only in order to define all registers
in the first RAM .
We also use the same fetch mechanism in the VAMP implementation as
in the previous one. We have separate stages fetch and decode as shown in
Figure 3.1. Note that we fetch instruction Ii+1 and decode instruction Ii
simultaneously in order to evaluate the branch condition prior to the next
instruction fetch. Therefore we decided to use the delayed branch mechanism
in the VAMP in analogy to the DLX from [MP00]. For this purpose in [MP00]
was changed the semantics of the assembler instruction set such that all
jumps and branches will be executed with a delay of one instruction, i.e.
the next instruction after any jump or branch instruction will always be
executed. More details and the correctness of instruction fetch we describe
later.
After the fetch of an instruction we save the result in the S1 registers,
which are the instruction register cI .S1.IR, the exception flag for misalign-
ment for instruction fetch cI .S1.imal, and the flag for the page fault on
instruction fetch cI .S1.ipf . Later, in our correctness criteria we see that
register cI .S1.IR is mapped to the function IR(cS) which is defined on the
programmer’s model, cI .S1.imal is mapped to imal(cS) and cI .S1.ipf to
CA(cS)[ipf ].
The previous VAMP had a mechanism for the internal interrupts. We
extend this mechanism for external devices.
Following [Bey05] the exception cause CA(cI) and the exception data
EData(cI) are parts of the result in the reorder buffer (only for internal
interrupts) and we can compute MCA(cI), JISR(cI) and repeat(cI) dur-
ing writeback in a way similar to the corresponding functions MCA(cS),
52 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
JISR(cS), and repeat(cS) in the programmer’s model.
As long as we do not have any interrupts the VAMP works as the standard
Tomasulo implementation, otherwise we make a so-called flush, i.e. we clear
the instruction register, all reservation stations, functional units, producers,
and the reorder buffer and mark all registers in the register file as valid.
After that we start to execute the ISR according to Equations 3.19, 3.20,
and 3.25 in the programmer’s model.
An implementation configuration is a 17-tuple cI = (PC ′, DPC, M ,
GPR, FPR, SPR, S1, RS, P , ROB, ROBhead, ROBtail, ROBcount, MU ,
FPU1, FPU2, FPU3). The components which were not yet described are:
• ROBhead – points to the head of the reorder buffer,
• ROBtail – points to the tail of the reorder buffer,
• ROBcount – contains the number of the entries, which are currently
in the ROB. This counter is used in order to test whether the ROB is
full or empty when ROBhead = ROBtail.
As before, we also do not have a component for the XPU because this func-
tional unit contains no registers. Later we will use the same components in
the implementation with an index I similar to the specification configuration.
Following [Bey05], we define the memory content cI .M as follows:
c0I .M := init_mem∣∣ct+1I .M [〈ad〉]∣∣b :=
{∣∣dint∣∣
b
if MI .mbw(ad, b)t∣∣ctI .M [〈ad〉]∣∣b otherwise
Note that we can apply decomposition to cI .M , i.e. ct+t
′
I .M = cI [c
t
I ]
t′ .M .
We also will use ctI for the configuration of hardware cycle t. We can
make a definition for initial implementation configurations as in [Bey05].
Definition* 3.2.1 We call an implementation configuration cI initial, de-
noted init?(cI), iff all reservation stations, execution units, producer regis-
ters, the ROB, and the decode stage are empty and all registers are valid, in
the program counter DPC the address SISR is saved and in PC ′ the address
SISR+ 4 is saved. Formally, we have:
init?(cI) :⇐⇒ ¬cI .S1.full ∧ cI .ROBcount = 0 ∧
empty?(cI .MU) ∧ empty?(cI .FPU1) ∧
empty?(cI .FPU2) ∧ empty?(cI .FPU3) ∧
cI .DPC = SISR ∧ cI .PC ′ = SISR+ 4 ∧
(∀x ∈ Z8 : ¬cI .RS[x].full) ∧ (∀x ∈ Z5 : ¬cI .P [x].full) ∧
(∀x ∈ Z32 : cI .GPR[x].valid ∧ cI .FPR[x].valid) ∧
(∀x ∈ Z17 : cI .SPR[x].valid)
3.2. IMPLEMENTATION OF THE VAMP 53
Note that bit full for cI .RS[x].full is active iff this reservation station is full
and bit cI .P [x].full is active iff this producer is full. Predicate empty? for
any functional unit holds only if the current configuration of this functional
unit is initial, i.e. configuration after power up or an interrupt. Note also that
we do not need empty? for the XPU since the XPU is purely combinational.
We will not give any details on the predicate empty? for FPUs since this
would involve the complete pipeline structure of the three floating point units.
The predicate empty? for MU we define in Section 3.2.3.
We define a function spec_conf that creates a specification configuration
from an initial implementation configuration. This function just takes all
the visible parts from the implementation, i.e., it is defined by the following
equations:
spec_conf(cI).PC ′ = cI .PC ′
spec_conf(cI).DPC = cI .DPC
spec_conf(cI).M = cI .M
spec_conf(cI).GPR = λx∈Z32cI .GPR[x].data
spec_conf(cI).FPR = λx∈Z32cI .FPR[x].data
spec_conf(cI).SPR = λx∈Z17cI .SPR[x].data
3.2.3 Implementation of the VAMP Memory Unit
Figure 3.2 shows the extension of the previous VAMP with two MMUs,
one for the instruction fetch (IMMU) and one for load / store (DMMU).
The interface signals of the MMUs are named according to Definitions 1.5.1
and 1.5.3. For the two interfaces we use the same subscripts p and m as were
introduced in Section 2.1 for CPUI and MI respectively.
The whole memory unit of the VAMP is presented in Figure 3.3 and
consists of three parts:
1. The memory system MI satisfying Definition 1.5.1 with address width
a = 29 and 8 bytes per cell, B = 8.
2. DMMU which is a copy of theMMU from Chapter 2 and some auxiliary
circuits for memory access processing. The auxiliary circuits (part of
CPUI) are:
• circuit shift4store is used to shift the data which have to be
written to the correct byte position in a double word of the CPUI
interface. We use this circuit because the data input and output
of memory is 64-bits wide and we need to access memory with
d ∈ {1, 2, 4, 8} bytes.
• circuit genbw is used to generate the correct byte write signals
mbw for the memory interface for store operations.
54 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
iadr
ibusy
imr
inst
mr
mw
din
mbw
dbusy
dout
dadr
dptl
dmode
mr
mw
dadr
dout
dpto
mbw
din
dexcp
dbusy
inst
iexcp
ibusy
iptl
imode
imr
iadr
ipto
CPUI
mr
addr
dout
mbw
busy
din
mw
mw
addr
dout
mbw
busy
din
mr
mw
addr
dout
mbw
mr
pto
ptl
t
busy
din
excp
mw
addr
dout
mbw
mr
pto
ptl
t
busy
din
excp
DMMU
IMMU
0
0
8
0
64
MI
Figure 3.2: Extension of the VAMP with Two MMUs
• circuit shift4load is used for the similar shifting as shift4store
but only for read accesses. Since both signed and unsigned loads
are supported, the circuit shift4load also performs sign-extension
denoted as sext or zero-extension denoted as zext of the data
which is loaded.
• circuit comp_adr is used to compute the effective byte address
according to Definition 3.8.
• circuits flags, flags2, flags3 are used in order to compute con-
trol signals for load/ store memory access.
Complete definitions of these circuits can be found in [Bey05]. Imple-
mentations for these circuits (other than flags, flags2, flags3) are
given in [MP00, p. 78–88].
3. IMMU which is a copy of theMMU from Chapter 2 and some auxiliary
circuits which are part of CPUI . We compute the signal imr and an
instruction address iadr with the help of the circuit gen_pc. Since in
the memory interface only 64 bits width operations are allowed and
our instruction word is 32 bit wide we use a multiplexor in order to
choose the correct instruction word. The construction of gen_pc we
describe later.
Load / store instructions are processed as follows. On dispatch, the
effective address ea is computed from two fields in the reservation station:
3.2. IMPLEMENTATION OF THE VAMP 55
fe
tc
h
pr
od
uc
er
re
se
rv
at
io
n 
sta
tio
n
IR
ip
f
im
a
l
d
o
u
t
E
D
a
ta
C
A
d
s
el
a
d
r
d
a
ta
ta
g
IR
v
a
li
d
P
T
O
P
T
L
M
O
D
E
M
O
D
E
P
T
L
R
O
B
h
ea
d
s
ta
ll
o
u
ta
d
r
i
s
ta
ll
in
ca
ch
e
r
es
et
v
a
li
d
ta
g
ib
u
s
y
cl
ea
r
P
T
O
M
O
D
E
f
et
ch
P
T
L
a
d
r
i[
2]
P
T
O
a
d
r
i[
0]
a
d
r
i[
1]
m
em
S
ta
ge
co
m
p
a
d
r
g
en
bw
f
la
g
s
s
h
if
t4
s
to
r
e
f
la
g
s
2
m
w
d
a
d
r
d
in
m
bw
d
bu
s
y
d
o
u
t
m
r
ib
u
s
y
ia
d
r
in
s
t
im
r
m
r
m
w
a
d
r
d
in
m
bw
d
bu
s
y
d
o
u
t
M
I
cl
ea
r
m
w
d
a
d
r
d
o
u
t
m
bw
d
bu
s
y
d
in
m
r
p
to
p
tl
m
o
d
e
d
ex
cp
s
h
if
t4
lo
a
d
s
h
if
t4
lo
a
d
ib
u
s
y
ia
d
r
in
s
t
im
r
in
s
t
ie
x
cp
ia
d
r
im
r
ib
u
s
y
p
to
p
tl
m
o
d
e
g
en
p
c
ta
g
a
d
r
m
bw
d
a
ta
ct
r
l
D
M
M
U
f
la
g
s
3
f
la
g
s
3
IM
M
U
Figure 3.3: The VAMP Memory Unit
56 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
the instruction register which contains a copy of the instruction word and the
RS1-operand in the reservation station. The address ea is written into the
memory stage together with the shifted data from the circuit shift4store
for store instructions. The tag from the reservation station, the memory
byte write signals from the circuit genbw and some flags for the encoding of
the type of the memory access are written into the memory stage as well.
In case of misalignment (dmal=1) we must not access the memory at all.
In the other case we can start a request to the DMMU . Note that in the
case of store we also need to wait until the tag of the instruction will be
equal to ROBhead. In this case we know that the store instruction is the
oldest instruction in the pipeline what is used for organization of precise
interrupts. The request to the DMMU takes at least three cycles. During
this time the DMMU is busy and initiates up to two memory accesses but
only the second access can be a write access. The memory stage is stalled
and stall_out shows for the Tomasulo scheduler that the memory unit is
not ready to accept the next instruction by dispatch.
When the DMMU finally signals the end of the request with ¬dbusyp we
have two possibilities as in the previous VAMP. In the first case stall_in
signals that the Tomasulo scheduler cannot currently accept outputs from the
DMMU and we need to wait until the signal stall_in is inactive. Therefore
in this case we store the result of the memory access in the intermediate
registers: we store the exception flag dexcp in the mem.ctrl register and in
the case of read access the data returned from the DMMU is stored into
the register mem.data. As soon as stall_in is inactive the result of the
instruction leaves the memory unit. For load instructions data pass through
the circuit shift4load which shifts the double word data to the right and only
returns the d ∈ {1, 2, 4, 8} rightmost bytes of the data. Note that shift4load
returns either sign or zero extended shifted data.
Instruction fetch works as follows. In the case the input signal fetch is
active and we do not have any misalignment imal we start the IMMU access
with the address of instruction which is received from additional input adr_i
(we define this input later). Page table origin, page table length, and mode
are read directly from the SPR file. The request to the IMMU takes at least
three cycles. During this time the IMMU can also initiate up to two memory
accesses but all of them are load accesses. When the IMMU finally signals
end of the access with ¬ibusyp we use PC[2] in order to select the half of
the double word data which has to be saved in the instruction register.
Note that we need to guarantee valid inputs according to the Defini-
tions 1.5.1 and 1.5.3 for both MMUs. Therefore we add an additional stabi-
lizing circuit in order to stabilize the DMMU inputs in case of an interrupt.
With the help of this stabilizing circuit we can later show that predicate
p_req_is_stable holds form Section 2.1.1. We introduce an additional con-
trol bit rollback to the memory stage which is only active in the case when
an interrupt JISR(cI) has occurred for during a request to the DMMU . The
3.2. IMPLEMENTATION OF THE VAMP 57
mwp
rollback
mw mr stall out mem.fullI s JISR
dbusypmrp
Figure 3.4: Stabilizing Circuit for the Data MMU
control bit rollback stays active until the end of an interrupted request, i.e.
until the DMMU signals ¬dbusyp. Note that in this case we do not use the
result of an interrupted request. The memory unit has to activate the signal
stall_out as long as rollback is active. As long as the bit rollback is active
one of the signals mrp or mwp is also active. With help of the signal I_s we
encode the type of the access: the signal I_s is active for store operations
and we prove later that store accesses cannot be interrupted. The stabilizing
circuit is defined in Figure 3.4.
The following equations describe the behavior of stabilizing circuit ac-
cording to Figure 3.4:
c′I .rollback = (mr ∨mw ∨ rollback) ∧ (JISR(cI) ∨ rollback) ∧ dbusy
stall_out = mem.full ∨ rollback
mrp = mr ∨ rollback ∧ ¬mem.I_s
mwp = mw ∨ rollback ∧mem.I_s
(3.25)
where
• mw is a signal from the circuit flags2 and is active when the memory
stage is full, we do not have misalignment and we decode the instruction
in the memory stage as a store instruction.
• mr is a signal from the circuit flags2 and is active when the memory
stage is full, we do not have misalignment and we decode the instruction
in the memory stage as a load instruction. Note that in case when
mw ∨mw holds we have ¬dmal(cI) and signal full is active.
• full is a signal which indicates that in the memory unit there currently
is an instruction. Note that for all cycles t ∈ N we know that
58 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
1. mrt ∨mwt =⇒ mem.fullt,
2. ¬(mem.fullt ∧ ctI .rollback).
Both properties follows directly from the construction of the MU and
were proved before in [Bey05].
For the stabilizing circuit genPC based on the same principle, we add an
additional control bit irollback in order to stabilize the read signal imrp.
We also have to add an additional register mPC that stores the PC in the
cycle when the signal JISR(cI) is active during an access to the IMMU (see
Figure 3.5). In the case that an interrupt occurs during the request to the
IMMU the control bit irollback is raised and the IMMU starts to read the
address of request from the mPC. The control bit irollback gets inactive in
the next cycle after the end of an access to the IMMU , i.e. after ¬ibusyp. The
following equations summarize the behavior of the signals from Figure 3.5.
irollback′ = (JISR(cI) ∧ (¬imal ∧ fetch ∨ irollback)
∨¬JISR(cI) ∧ irollback) ∧ ibusyp
mPC ′ =
{
PC[31 : 3] if ibusyp ∧ ¬irollback
mPC otherwise
imrp = ¬imal ∧ fetch ∨ irollback
iadrp =
{
mPC if irollback
PC[31 : 3] otherwise
(3.26)
Now we give the construction of the address computation for instruction
fetch which was not changed from [Bey05]. We define the adr_i(cI) which
returns the address used in the instruction fetch implementation according
to Figure 3.6.
adr_i(cI) :=

cI .SPR[EDPC] if cI .S1.full ∧ rfe?(cI .S1.IR)
cI .PC
′ if cI .S1.full ∧ ¬rfe?(cI .S1.IR)
cI .DPC otherwise
(3.27)
We introduce the predicate empty? for the memory unit. This predicate
holds if
• no instruction is executed in the memory unit,
• initial state of control automaton of the DMMU is unique, i.e. the
control cannot be in two different states at the same cycle. For this
purpose we define predicate is_unary? which holds in this case,
3.2. IMPLEMENTATION OF THE VAMP 59
1
irollback
mPC
fetch
iadrp
adr i[31 : 3]
ibusyp
0
JISR(cI)
imrp
¬imal
0
1 0
1
0 1
Figure 3.5: Stabilizing Circuit for the Instruction MMU (Circuit genPC )
S1.full
DPC
1
0
SISR
1 0
0
1
rfe
PC ′
EDPC
ue.1
reset
JISR(cI)
adr i
Figure 3.6: Fetch Implementation in the VAMP.
60 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
• initial state of control automaton of the IMMU is unique, i.e. the con-
trol cannot be in two different states at the same cycle,
• in the memory unit is not executed an interrupted read instruction,
• in the memory unit is not executed an interrupted fetch, and
• write request can not be interrupted.
Formally we have:
empty?(cI .MU) := ¬full
∧is_unary?(stateDMMU )
∧is_unary?(stateIMMU )
∧(¬rollback =⇒ stateDMMU .idle)
∧(¬irollback =⇒ stateIMMU .idle)
∧(rollback =⇒ ¬I_s)
3.3 Correctness Criteria
There are two basic concepts to organize correctness criteria for out-of-order
processors.
1. The first is based on the so-called “flushing” technique. At any time
there is some number of instructions in the processor pipeline. We can
execute these instructions while no new instructions are initiated, i.e.
while we stall the processor fetch. Then after a certain number of cy-
cles we have executed all instructions which were in the pipeline; this is
called a “flushing” operation. The basic idea of this correctness criteria
is as follows. At start time the implementation of the processor has a
configuration cjI . Then we flush the pipeline and after that we can get
for some instruction i the specification configuration ciS after execut-
ing all the instructions by applying spec_conf function. We can do it
because if the pipeline is empty and implementation of the processor
is correct the specification configuration is one part of the implemen-
tation configuration. We can now execute one more instruction i + 1
on the specification configuration and we get the next specification
configuration ci+1S . We also have another possibility to calculate c
i+1
S :
we do not stall the pipeline until the next instruction i + 1 begin to
execute. We now flush the pipeline and after flushing we can calculate
the new specification configuration cˆi+1S . The implementation of our
processor satisfies the specification iff ci+1S = cˆ
i+1
S . This technique was
described in [BD94] and is widely used for automatic hardware verifi-
cation. This correctness criterion has been applied ito many designs,
e.g., by Hosabettu, et al. [HSG99], and Velev and Bryant [VB00].
3.3. CORRECTNESS CRITERIA 61
Name Scheduling function
fetch Scheduling function for the stage fetch.
decode Scheduling function for the stage decode.
issue Scheduling function for the issue.
R Scheduling function for every visible register R except
the memory.
mem Scheduling function for the stage mem which is inside
the memory unit MU .
MI Scheduling function for the visible memory.
wb Scheduling function for the stage writeback.
inst Scheduling function for the visible register files for cor-
rectness with interrupt.
Table 3.3: Scheduling Functions of the VAMP
2. The second concept is based on scheduling functions. Scheduling
functions map a pipeline stage in some cycle to the index of the
instruction according to the specification. They were introduced
in [MP00, BJK+03]. A similar idea was used by Sawada and Hunt
in [SH98,HS99], which is based on Micro-Architectural Execution Trace
Table.
Definition 3.3.1 If k is some stage in the CPU and t is a cycle, a
scheduling function sI(k, t) returns a natural number which we asso-
ciate with the index of the instruction that is in stage k in cycle t.
We decided to use this approach because the previous proof of VAMP
was based on scheduling functions.
3.3.1 Scheduling Functions
Since we do not change the scheduling functions we only give definitions
of scheduling functions. We have only changed signals which we use for the
computation of scheduling functions. More detail about scheduling functions
can be found in [Bey05, p. 126–132].
At first we give the definitions for scheduling functions without inter-
rupts. In Table 3.3 we show some scheduling functions which are used
in this thesis. Scheduling functions are defined inductively over cycle t.
If an instruction at time t in stage k is passed to the stage k′ we have
sI(k′, t + 1) = sI(k, t). We have sI(k, 0) = −1 in order to model an empty
pipeline for the initial cycle.
We introduce the scheduling functions according to Table 3.3. We start
with the scheduling function for the stage decode. Scheduling functions for
62 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
stages fetch, decode and issued are introduced in [Bey05] with the help of
the following signals:
• stall.1 a signal which indicates that in the next cycle we cannot accept
inputs for the stage decode.
• ue.0 and ue.1 update enable signals for instruction fetch and decode
correspondingly. Signal ue.0 indicates that an instruction fetch is com-
pleted and signal ue.1 indicates that the instruction in the instruction
register is being issued in the next cycle into one of the reservation
stations or directly into the reorder buffer.
• signal S1.full is active if the stage decode cannot accept inputs from
the stage fetch. It is defined in the following way:
S1.full0 = 0
S1.fullt+1 = ue.0t ∨ stall.1t
where t ∈ N.
For fetching the next instruction we need to use additionally registers
from the special purpose register file. They are SPR[PTO], SPR[PTL] and
SPR[MODE]. These values are taken directly from the register file. Therefore
we have to know that no instructions are being executed in the pipeline which
change one of these registers. For this purpose we introduce a new signal
fetch. This signal is inactive when
• in the pipeline there is an instruction writing to one of these registers
• the instruction in the stage decode is rfe or movi2s, one of the in-
structions which may change the registers PTO, PTL, or MODE.
Additionally, the sync instruction stalls the fetch until all previous instruc-
tions have left the pipeline. More detail about the sync instruction we give
in Section 3.4.2. Formally we define:
fetch := SPR[PTO].valid ∧ SPR[PTL].valid ∧ SPR[MODE].valid ∧
¬(S1.full ∧ (rfe?(S1.IR) ∨ movi2s?(S1.IR) ∨ sync?(S1.IR))
Using the fetch signal, we change the old definition of the ue.0 signal to
ue.0 = ¬ibusyp ∧ ¬stall.1 ∧ fetch ∧ ¬irollback
Using this signal the definition of the scheduling function for the stage
decode is as follows:
sI(dec, 0) := −1
sI(dec, t+ 1) :=
{
sI(dec, t) + 1 if ue.0t
sI(dec, t) otherwise
(3.28)
3.3. CORRECTNESS CRITERIA 63
For the issue scheduling function:
sI(issue, 0) := 0
sI(issue, t+ 1) :=
{
sI(issue, t) + 1 if ue.1t
sI(issue, t) otherwise
(3.29)
The scheduling function for the stage fetch is defined as sI(fetch, t) =
sI(dec, t)+1. The scheduling function for every visible register R except the
memory is as follows:
sI(R, 0) := 0
sI(R, t+ 1) :=
{
sI(k, t) + 1 if ue.kt
sI(R, t) otherwise
(3.30)
We have two scheduling functions for the memory. The first scheduling
function is sI(mem, t) for the stage mem, which is inside the memory unit
MU . The definition of this function is straightforward: If the MU accepts a
new instruction in cycle t from the reservation station the scheduling func-
tion sI(mem, t + 1) is set to the index of that instruction. Otherwise the
scheduling function remains unchanged. Note that we initialize this schedul-
ing function with −1, i.e. sI(mem, 0) = −1.
The second scheduling function is the scheduling function for the visible
memory which is defined as follows:
sI(MI , 0) := 0
sI(MI , t+ 1) :=
{
sI(mem, t) + 1 if (mwt ∨mrt) ∧ ¬dbusytp
sI(MI , t) otherwise
(3.31)
The scheduling function for the stage writeback is as follows:
sI(wb, 0) := 0
sI(wb, t+ 1) :=
{
sI(wb, t) + 1 if ue.wb(ctI)
sI(wb, t) otherwise
(3.32)
where the ue.wb(ctI) signal is update enable signal for the writeback stage. In
the end we need some scheduling function counting completed instructions
and interrupts in order to claim correctness with interrupts. This scheduling
function is defined as:
sI(inst, 0) = 0
sI(inst, t+ 1) =
{
sI(inst, t) + 1 if ue.wb(ctI) ∨ JISR(ctI)
sI(inst, t) otherwise
(3.33)
Of course, the following proposition trivially holds.
64 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
Proposition* 3.3.2 Until one cycle after the first interrupt, the scheduling
function counting instructions and interrupts and the writeback scheduling
function are equal, i.e., for all t ∈ N:
(∀k ∈ Zt : ¬JISR(ckI )) =⇒ sI(inst, t) = sI(wb, t)
3.3.2 Correctness Invariant
We define correctness criteria for computation without interrupts and for
computation with interrupts. The former will be used for the proof of the
correctness of our VAMP between interrupts. The latter will be used for the
proof of the correctness of our VAMP with interrupts.
Definition 3.3.3 We call a VAMP implementation configuration correct
without interrupts in cycle t, denoted, corr?(t), iff under the assumption
∀t′ ∈ Z≤t : ¬JISR(ct′I ) the following conditions hold
1. ctI .M = c˜
sI(MI ,t)
S .M
2. ctI .PC
′ = c˜sI(issue,t)S .PC
′
3. ctI .DPC = c˜
sI(issue,t)
S .DPC
4. sI(dec, t) ≥ 0 =⇒ ctI .S1.IR = IRS(c˜sI(dec,t)S )∧
ctI .S1.imal = imal(c˜
sI(dec,t)
S )∧
ctI .S1.ipf = ipf(c˜
sI(dec,t)
S )
5. ∀x ∈ Z32 : ctI .GPR[x].data = c˜sI(wb,t)S .GPR[x]
6. ∀x ∈ Z32 : ctI .FPR[x].data = c˜sI(wb,t)S .FPR[x]
7. ∀x ∈ Z17 : ctI .SPR[x].data = c˜sI(wb,t)S .SPR[x]
Correctness criteria for computation with interrupts is following:
Definition 3.3.4 We call a VAMP implementation configuration correct
with interrupts in cycle t, i.e., corr_i?(t), iff the following conditions hold:
1. (t = 0 ∨ JISR(ct−1I )) =⇒ ctI .M = csI(inst,t)S .M
2. (t = 0 ∨ JISR(ct−1I )) =⇒ ctI .PC ′ = csI(inst,t)S .PC ′
3. (t = 0 ∨ JISR(ct−1I )) =⇒ ctI .DPC = csI(inst,t)S .DPC
4. ∀x ∈ Z32 : ctI .GPR[x].data = csI(inst,t)S .GPR[x]
5. ∀x ∈ Z32 : ctI .FPR[x].data = csI(inst,t)S .FPR[x]
3.4. CORRECTNESS BETWEEN INTERRUPTS 65
6. ∀x ∈ Z17 : ctI .SPR[x].data = csI(inst,t)S .SPR[x]
Note that corr_i? and spec_conf , which was introduced in Section 3.2.2,
are related in the following way:
(t = 0 ∨ JISR(ct−1I )) ∧ corr_i?(t) =⇒ spec_conf(ctI) = csI(inst,t)S
3.3.3 Proof Overview
The structure of the proof is the same as it was in the VAMP without address
translation. In this thesis we present the parts of the whole proof which are
changed or added to the previous proof. Following [Bey05] we start with the
correctness proof of the VAMP without interrupts, then we extend it to deal
with interrupts.
The whole proof which we present in this thesis consists of three parts:
1. We prove the correctness of the memory access with address translation
in the case when we have load or store instructions (see Figure 3.2).
2. In the second part we describe the proof for the fetch mechanism which
was extended with the Instruction MMU (IMMU). Note that the data
MMU and the instruction MMU are identical but we use the IMMU
for read accesses only (see Figure 3.2).
3. Later we show the correctness of the mechanism dealing with external
interrupts which allows to connect I/O devices to the VAMP.
Definition 3.3.5 We use short notation dinputsx (iinputsx) for all signals
which are inputs to the DMMU (IMMU) and doutputsx (ioutputsx) for all
signals which are outputs from the DMMU (IMMU) where x denotes the
interface x = p for CPUI and x = m for MI .
3.4 Correctness between Interrupts
In this section we introduce the correctness proof of the VAMP implemen-
tation without interrupts. We will show the induction step for the VAMP
correctness without interrupts only for the parts which were changed, i.e.
for the memory unit and for the instruction fetch which was extended with
two MMUs.
We introduce the basic lemma for the correctness without interrupts. In
the following lemma we use the predicate synced_code? in order to deal with
self-modifying code. It is needed since an instruction which has been now
fetched at a certain time could be overwritten by a store instruction still in
the pipeline. Note that we will give the formal definition of the predicate
synced_code? in Section 3.4.2 where we consider instruction fetch.
66 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
Lemma 3.4.1 The VAMP implementation is correct without interrupts.
Formally, we have for all t ∈ N:
(∀t′ ∈ Z≤t : ¬JISR(ct′I )) ∧ synced_code? =⇒ corr?(t)
In order to prove this lemma we show the validity of all parts of the predicate
corr? which were changed.
3.4.1 Correctness of the Memory Unit on Load / Store
We show the correctness of the memory unit as follows:
• First, the memory unit in cycle t produces the same outputs as the
corresponding part of the specification, i.e. the correct data is read in
the case of a load operation. We focus on the cycle when a request
actually completes. Note that we do not change the other part of the
proof when we have to store the result from the DMMU in the memory
unit and pass it to the producer later on an inactive stall_out.
• Second, the memory in the next cycle t + 1 fulfills the correctness
invariant, i.e. the correct data is stored in the memory according to
the correct address in the case of a store instruction. We also focus
only on the cycle when a request actually completes.
Note that for both cases we will assume that the predicate corr?(t′) holds for
all cycles t′ ≤ t and we do not have interrupts up to time t′. The correctness
criteria of the Tomasulo algorithm guarantee the correctness of the informa-
tion in the memory reservation station. With help of this information we
can easily conclude the correctness of the registers in the memory stage, i.e.
the effective address, the data to be stored, and several flags.
In both cases we focus on aligned accesses since correctness of misaligned
accesses is trivial to show.
In the correctness proof of the memory unit we want to use the correctness
of the local MMU , which was shown in Chapter 2. For this purpose we have
to define five components that constitute a local MMU computation. These
are the memory sequence, inputs and outputs sequences for the processor,
and inputs and outputs sequences for the memory system. As in Chapter 2
these five components induce a trace dtrc for the DMMU .
The idea behind these sequences is as follows. From the VAMP com-
putation we only take the request to the DMMU we are interested in. The
local MMU computation is constituted such that this request is placed at
the beginning of the local MMU computation and no more requests follow
afterwards. Figure 3.7 shows the construction of the memory sequence for
the DMMU from the memory sequence of the VAMP and construction of
the request signal. The start cycle t′ of the processor request to the DMMU
3.4. CORRECTNESS BETWEEN INTERRUPTS 67
Memory sequence
for DMMU
0
· · ·
t − t′
t′t′′t′′′
reqp to MMU
reqp to DMMU
ct
′ I
.M
ct
′
+
1
I
.M
ct
′
+
2
I
.M
ct
−
2
I
.M
ct
−
1
I
.M
ct I
.M
t
ct
+
1
I
.M
ct
+
1
I
.M
· · ·
· · ·
· · ·
Figure 3.7: Memory sequence for the DMMU
represents the local MMU cycle 0. The last cycle t of the processor request
to the DMMU represents the local MMU cycle t− t′.
Formally, we define:
• Memory sequence
dtrc.mem = λt′′∈N.
{
ct
′′+t′
I .M if t
′ + t′′ ≤ t
ct
′′+t′+1
I .M otherwise
Note that for local correctness of the MMU we need one port memory
but in our case we have two ports memory. We can have only read ac-
cesses on instruction port according to the Definition 1.5.1. Therefore
data port interface of two ports memory presents the same interface as
for one port memory.
• Inputs sequence for inputs from the processor
dtrc.inp = λt′′∈N.
{
dinputst
′′+t′
p if t′ + t′′ ≤ t
X otherwise
where X is an arbitrary dinputsp such that mrp and mwp are always
inactive.
• Inputs sequence for inputs from the memory system
dtrc.inm = λt′′∈N.
{
dinputst
′′+t′
m if t′ + t′′ ≤ t
X otherwise
where X is an arbitrary dinputsm such that dbusym is always inactive.
68 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
• Outputs sequence for outputs to the processor
dtrc.outp = λt′′∈N.
{
doutputst
′′+t′
p if t′ + t′′ ≤ t
X otherwise
where X is an arbitrary doutputsp such that dbusyp and dexcpp are
always inactive.
• Outputs sequence for outputs to the memory system
dtrc.outm = λt′′∈N.
{
dinputst
′′+t′
m if t′ + t′′ ≤ t
X otherwise
where X is an arbitrary dinputsm such that mrm and mwm are always
inactive and daddrm = daddrt
′′+1
m .
Note that later, in order to use correctness of the MMU from Chapter 2,
it will be not enough only to prove all the assumptions. We also have to
show that this computation is a computation of the local MMU . It is easy
to show that these definitions constitute a valid local MMU computation.
We prove that local MMU correctness which were introduced in Sec-
tion 2.1 holds for the DMMU .
Lemma 3.4.2 :
∀t, t′ ∈ N : is_req_proc(t, t′, dtrc) ∧ corr?(t′) =⇒
mmu_guarantee(dtrc)
In order to prove this lemma we need one additional lemma.
Lemma 3.4.3 All inputs to the DMMU are stable if the DMMU is busy and
mrtp ∨mwtp is active in the same cycle:
∀t ∈ N : (mrtp ∨mwtp) ∧ dbusytp) =⇒
dinputstp = dinputs
t+1
p
Proof: Inputs dinputsp are taken directly from the stage mem. The proof
follows directly from the construction of the memory unit since information
in the stage mem could be changed only if dbusyp is inactive. uunionsq
Proof: [Lemma 3.4.2] The predicate mmu_guarantee(dtrc) consists of
seven predicates. Based on Lemma 3.4.3 we conclude that the predicate
p_req_is_stable(dtrc) holds until the end of request to the DMMU . After
the request we always have ¬mwp and ¬mrp, thus p_req_is_stable(dtrc)
holds for the whole trace.
The second assumption of the local correctness for the processor interface
is thatmrp andmwp are mutually exclusive, i.e. p_mr_mw_mutexc(dtrc).
3.4. CORRECTNESS BETWEEN INTERRUPTS 69
This property is obvious to prove since mrp is active only if ¬I_s holds and
mwp is active only when I_s holds (Figure 3.4, Equation 3.25). For all cy-
cles after the request to theDMMU the predicate p_mr_mw_mutexc(dtrc)
also holds since mrp and mwp are both inactive. Since we know that pred-
icates p_req_is_stable(dtrc) and p_mr_mw_mutexc(dtrc) hold we con-
clude that the predicate good_p_interface(dtrc) also holds.
From Definition 1.5.2 correctness of the predicates m_ack_write(dtrc),
m_read_consist(dtrc), m_write_consist(dtrc), and m_liveness(dtrc)
during the request to the DMMU directly follows. Since mrp and mwp are
both inactive after the request to the DMMU the control of the DMMU is
in state idle. Then we conclude that mwm and mrm are also inactive. Since
mwm and mrm are inactive all these predicates trivially hold. Therefore we
know that the predicate good_m_interface(dtrc) also holds.
So we have already proven all the assumptions for the DMMU . There-
fore we know the correctness of the DMMU according to Lemma 2.3.1, i.e.
mmu_guarantee(dtrc). uunionsq
We define the predicate which holds only in cycles when a request of the
DMMU actually completes.
Definition 3.4.4 We define the predicate commits for any t, i ∈ N in the
following way:
commits(t, i) :=
(mwtp ∨mrtp) ∧ ¬dbusytp ∧ sI(mem, t) = i
We reconstruct from the last cycle of an DMMU request the whole re-
quest.
Lemma 3.4.5 If an instruction in the memory unit commits then we have
a cycle t′ before the cycle t such that from t′ to t we have a request to the
DMMU . Formally, we have:
∀t, i ∈ N : commits(t, i) =⇒
∃t′ ∈ Z≤t : is_req_proc(t′, t, dtrc)
In order to prove this lemma we have to prove one additional lemma.
Lemma 3.4.6 When at any cycle t the DMMU is busy and this request was
not interrupted (as indicated by the rollback signal) then we have a cycle t′
before cycle t when the DMMU is not busy and from t′ to t the memory unit
is full and the DMMU is busy.
∀t ∈ N : dbusytp ∧ ¬rollbackt =⇒
∃t′ ∈ Z≤t : ¬dbusyt′p ∧
∀t′′ ∈ ]t′ : t] : (dbusyt′′p ∧ fullt
′′
)
70 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
Proof: We show the claim by induction on t′:
Induction base (t = 0): The predicate init?(cI) holds always in cy-
cle 0. Therefore the predicate empty?(cI .MU) also holds. Since we have
¬c0I .rollback the control automaton of the DMMU is in state idle by def-
inition of empty?(cI .MU). We also have ¬mem.full0 by definition of
empty?(cI .MU). Since the control is in state idle and we have ¬(mr0p∨mw0p)
we conclude that ¬dbusy0p and so we finish the base case.
Induction step (t→ t+ 1): We split cases on dbusytp ∧ ¬rollbackt:
1. Let dbusytp ∧¬rollbackt hold. From the induction hypothesis we know
that:
∃t′ ∈ Z≤t : ¬dbusyt′p ∧
∀t′′′ ∈ ]t′ : t] : (dbusyt′′′p ∧mem.fullt
′′′
)
Additionally we can assume that:
dbusyt+1p ∧ ¬ct+1I .rollback
We have to prove that:
∃t′ ∈ Z≤t+1 : ¬dbusyt′p ∧
∀t′′ ∈ ]t′ : t+ 1] : dbusyt′′p ∧mem.fullt
′′
We skolemize the existential quantifier in the induction hypothesis.
After the skolemization we have a cycle t′. After that we instantiate
the last existential quantifier with cycle t′. In order to prove this
case we only have to show that mem.fullt+1 is holds. Since we have
¬dbusyt′ ∧ dbusyt′+1p we know that mrt′+1p ∨mwt′+1p . Since processor
inputs are stable between cycles t′+1 and t+1 (Lemma 2.1.8) we have
mrt+1p ∨mwt+1p . This concludes the claim since we have ¬ct+1I .rollback.
Note that definition of c′I .rollback was introduced in Equation 3.25.
2. Let ¬(dbusytp ∧ ¬rollbackt) hold. We split cases on dbusytp:
(a) If dbusytp holds we have rollbackt ∧ ¬rollbackt+1. This is a con-
tradiction because from the construction of rollback we have
dbusytp ∧ rollbackt =⇒ rollbackt+1. This concludes this claim.
(b) Let ¬dbusytp hold. We only have to prove that fullt+1 holds. Since
we have ¬dbusytp and dbusyt+1p we conclude mrt+1p ∨mwt+1p . This
concludes the proof since we have ¬rollbackt+1. uunionsq
We can now prove Lemma 3.4.5
3.4. CORRECTNESS BETWEEN INTERRUPTS 71
Proof: [Lemma 3.4.5] We show the claim by induction on t:
∀t, i ∈ N : commits(t, i) =⇒
∃t′ ∈ Zt : is_req_proc(t′, t, dtrc)
Induction base (t = 0): Since sI(mem, 0) equal −1 we have i = −1. This
is a contradiction because i ≥ 0 and so we finish the base case.
Induction step (t→ t+ 1): We now split cases on commits(t, i):
1. Let commits(t, i) hold. We know that
commits(t, i) ∧
commits(t+ 1, i) ∧
∃t′ ∈ Z≤t : is_req_proc(t′, t, dtrc)
We have to prove that
∃t′ ∈ Z≤t+1 : is_req_proc(t′, t+ 1, dtrc).
We instantiate the last existential quantifier with t + 1. All parts of
the predicate is_req_proc(t + 1, t + 1, dtrc) may then be concluded
using the assumptions commits(t, i) and commits(t+ 1, i). We finish
this case.
2. Let ¬commits(t, i) hold. In this case we know
¬commits(t, i) ∧ commits(t+ 1, i)
and we have to prove that:
∃t′ ∈ Z≤t+1 : is_req_proc(t′, t+ 1, dtrc)
We now split cases on busytp:
(a) Let ¬busytp hold. After instantiation of the existential quantifier
with cycle t+ 1 we split cases on all parts of predicate
(b) Let busytp hold. In this case we have to prove that
¬commits(t, i) ∧ commits(t+ 1, i) ∧ dbusytp =⇒
∃t′ ∈ Z≤t+1 : is_req_proc(t′, t+ 1, dtrc)
We split cases on rollbackt+1:
i. If rollbackt+1 holds then sI(mem, t + 1) = −1 what is con-
tradiction because we have sI(mem, t+ 1) = i and i ≥ 0. It
finishes this case.
72 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
ii. Let ¬rollbackt+1 hold. Since dbusytp is active we conclude
that rollbackt+1 = rollbackt and so we have ¬rollbackt.
Hence, the assumptions of Lemma 3.4.6 hold. By this lemma
we know that there exists a cycle tt < t such that dbusyttp is
inactive and all cycles between tt and t signals dbusyp and
full are active. Of course, cycle tt+ 1 is the cycle when the
request to the DMMU starts, i.e., we instantiate the existen-
tial quantifier with cycle tt+1 and so we finish the proof. uunionsq
We repeat a helper lemma which was introduced in [Bey05]. It states that
the specification memory that the instruction in the memory stage sees at the
end of an memory access is actually the specification memory of instruction
sI(MI , t).
Lemma* 3.4.7 If an instruction in the memory unit terminates its access
in cycle t, the specification memory for the memory unit and the one of the
instruction identified by the visible memory register scheduling function are
identical. Formally, we have
(mwt ∨mrt) ∧ ¬dbusytp =⇒ c˜sI(MI ,t)S .M = c˜sI(mem,t)S .M
Proof of this lemma can be found in [Bey05, p. 148]. We now show that
the memory unit in cycle t produces the same outputs as the corresponding
part of the specification, i.e. the correct data is read in the case of a load
operation.
Lemma 3.4.8 Let the CPU correctness invariant without interrupts (Def-
inition 3.3.3) hold in cycle t, let a load instruction i = sI(mem, t) in the
memory unit complete its aligned access in cycle t on the address ea(cS) with
access width d ∈ {1, 2, 4, 8}, and let r_sext = lb?(IR(c˜iS))∨lh?(IR(c˜iS)), and
finally assume ¬dexcptp. This load instruction then delivers the correct result,
i.e., we have corr?(t) ∧mrtp ∧ ¬dbusytp ∧ ¬dexcptp =⇒
MU.doutt =

sext(c˜iS .M [ea(cS) + d− 1 : ea(cS)]) if r_sext ∧ ¬t
zext(c˜iS .M [ea(cS) + d− 1 : ea(cS)]) if ¬r_sext ∧ ¬t
sext(c˜iS .M [r_t(t).pa+ d− 1 : r_t(t).pa]) if r_sext ∧ t
zext(c˜iS .M [r_t(t).pa+ d− 1 : r_t(t).pa]) otherwise
Proof: Let an aligned load instruction in the memory unit finish its access
to the DMMU in cycle t without an exception, i.e., mrtp∧¬dbusytp∧¬dexcptp
holds. With the help of Lemma 3.4.7 we know that c˜sI(mem,t)S .M =
c˜
sI(MI ,t)
S .M . Since predicate corr?(t) holds we have c˜
i
S .M = c
t
I .M . More-
over, it is easy to conclude the correctness of the data in the memory stage,
i.e. ea(c˜iS) = ea(c
t
I). With the help of Lemma 3.4.5 we know that there
3.4. CORRECTNESS BETWEEN INTERRUPTS 73
exists a cycle t′ such that t′ ≤ t and the predicate is_reg_proc(t′, t, dtrc)
holds. Since we do not have any exception dexcptp the correctness of DMMU
by Lemma 3.4.2 additionally guarantees the following equation:
dintp ={
dtrct−t′ .mem[〈r_t(t).pa[31 : 3]〉+ 7 : 〈r_t(t).pa[31 : 3]〉] if t
dtrct−t′ .mem[〈ea(cS)(c˜iS)[31 : 3] + 7 : ea(cS)(c˜iS)[31 : 3]〉] otherwise
=
{
ctI .M [〈r_t(t).pa[31 : 3]〉+ 7 : 〈r_t(t).pa[31 : 3]〉] if t
ctI .M [〈ea(cS)(c˜iS)[31 : 3] + 7 : ea(cS)(c˜iS)[31 : 3]〉] otherwise
Since MU.dout is given by the output shift4store the access is sign or zero
extended and 64 bits wide. The correctness of the shift4store can be found
in [Bey05, p.142]. Since all flags in the memory stage are also correct, we
finish the lemma. uunionsq
Lemma 3.4.9 Let the CPU correctness invariant without interrupts (Defi-
nition 3.3.3) hold in cycle t, let an instruction i = sI(mem, t) in the memory
unit complete its aligned access in cycle t. The exception result of this in-
struction is correct, i.e., we have corr?(t) ∧ (mrtp ∨mwtp) ∧ ¬dbusytp =⇒
dexcptp = CA(c
i
S)[dpf ]
Proof: The proof follows directly from the local correctness of the DMMU
(see Lemmas 2.3.1 and 3.4.2). uunionsq
We can now define a lemma which shows that the memory in cycle t is
the same as the memory of the specification, i.e. the correct data is written
in the memory in the case of a store operation.
Lemma 3.4.10 Let the CPU correctness invariant hold in cycle t. The
memory part of the invariant still holds in cycle t+ 1, i.e., for all t ∈ N:
corr?(t) =⇒ ct+1I .M = c˜sI(MI ,t+1)S .M.
Proof: Because of corr?(t), we conclude ctI .M = c˜
sI(MI ,t)
S .M . If no in-
struction finishes its memory access in cycle t the equations sI(MI , t +
1) = sI(MI , t) and ct+1I .M = c
t
I .M both hold and the claim holds. Let
therefore an instruction finish its memory access in cycle t, i.e., we have
(mwtp∨mrtp)∧¬dbusytp. In this case, we have sI(MI , t+1) = sI(mem, t)+1.
By applying Lemma 3.4.7 we get ctI .M = c˜
sI(mem,t)
S .M . We split cases on
mwtp.
1. Let ¬mwtp hold. In this case we have both ctI .M = ct+1I .M and
c
sI(mem,t)
I .M = c˜
sI(mem,t)+1
S .M . Since corr?(t) holds we also finish
this case.
74 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
2. Let mwtp hold. Since mwtp is active we have a store instruction in the
memory unit. We split cases on dexcptp:
(a) Let dexcptp hold. In this case we have the same proof as in case
¬mwtp and we finish this case.
(b) Let ¬dexcptp hold. The memory stage contains correct data for
instruction sI(mem, t), the byte write signals and the data shifted
for store are correct and shift4store is correct. The correctness of
the shift4store circuit can be found in [Bey05, p.142]. Since we
know the correctness of the DMMU by Lemma 2.3.1 we are sure
that we the correct data is written in the memory. We conclude
the proof. uunionsq
Note that in both Lemmas 3.4.8 and 3.4.10, we omitted the case of mis-
aligned accesses. In this case we react by simply not accessing the DMMU
and therefore also not the memory.
3.4.2 Instruction Fetch
In pipelined processors an instruction which has already been fetched could
be overwritten by a store instruction which is executed in the memory unit,
i.e. we have a RAW hazard. Note that in our case we have more possibilities
to have RAW hazards than in [Bey05]. If we work in user mode we have
RAW hazard not only when data on physical address is overwritten but also
when page table entry is overwritten. As a solution to this problem Beyer
decided to use a so-called sync instruction before the fetch from a modified
location. The sync instruction stalls the fetch until all previous instructions
have left the pipeline. The sync instruction has the following properties:
1. In the specification model a sync instruction has to be a nop.
2. In the implementation model a sync instruction prohibits any further
fetch of instructions until all instructions which were before the sync
have left the pipeline.
For our VAMP implementation the instruction movs2i IEEEf,R0 fulfills
these criteria. More detail about this case can be found in [Bey05, p. 139–
140]. We denote this instruction by sync. Since the sync?(cS) predicate is a
part of the fetch signal such that we prevent fetching of the next instruction
after sync the second criteria is also fulfilled.
Formally, let adr be a physical address. If the instruction Ii = IR(c˜iS) is
a fetch from adr and we have the instruction Ij which writes to adr before
the instruction Ii then there must be an instruction Ik between Ii and Ij
such that Ik is a sync instruction.
We formally define the required sync instruction before modified fetches.
3.4. CORRECTNESS BETWEEN INTERRUPTS 75
Definition 3.4.11 We call the computation of an assembly code synced iff
for any address ad there is a sync instruction between the fetch of ad and
the last modification of address ad. We introduce a parameterized predicate
writead on a specification configuration cS that holds iff a write to double
word of address ad occurs in cS, i.e., the effective address in the system
mode or physical address in the user mode matched ad and either a store or
a floating point store operation occur. Let
r_tr = decodeitr(cS .SPR[PTO][19 : 0], cS .SPR[PTL][19 : 0], cS .M, 1, ea(cS)).
Formally, we define writead to double word of address ad as:
writead :⇐⇒ (store?(cS) ∨ fstore?(cS)) ∧
(t ∧ (r_tr.pa[31 : 3] = ad[31 : 3]) ∨ ¬t ∧ (ea(cS)[31 : 3] = ad[31 : 3]))
Note that since writead is a predicate on a configuration, we can use
Definition 1.1.9. Since we distinguish two different computations depending
on whether we react to interrupts, we use the prefix cS or c˜S in order to
distinguish between the two possible instantiations.
A computation without interrupts fulfills the sync condition if under the
assumption that we do not have interrupts the following condition holds:
synced_code? :⇐⇒ ∀n ∈ N :
Let ptea = (
〈
c˜nS .SPR[PTO][19 : 0] ◦ 012
〉
+
〈
cnS .DPC ◦ 02
〉
) mod 232,
r_tr = decodeitr(c˜nS .SPR[PTO][19 : 0],
c˜nS .SPR[PTL][19 : 0], c˜
n
S .M, 1, ea(c
n
S)) in
(c˜S .∃lastwritec˜n
S
.DPC
∧ ¬c˜nS .SPR[MODE][0] =⇒
∃m ∈ ]c˜S .lastwritec˜n
S
.DPC
: n[ : sync?(IRS(c˜mS )))∧
(c˜S .∃lastwriter_tr.pa ∧ c˜nS .SPR[MODE][0] =⇒
∃m ∈ ]c˜S .lastwriter_tr.pa : n[ : sync?(IRS(c˜mS )))∧
(c˜S .∃lastwriteptea ∧ c˜nS .SPR[MODE][0] =⇒
∃m ∈ ]c˜S .lastwriteptea : n[ : sync?(IRS(c˜mS )))
From the view of an assembly programmer, interrupts are in general non-
deterministic, e.g., timer-interrupts. This non-determinism is represented by
the sequence of external interrupts extS, which is a parameter of a specifica-
tion computation. Note that we define extS in Section 3.5.
synced_code_i?(extS) :⇐⇒ ∀n ∈ N : synced_code?[cnS ]
The notation [cnS ] in this case denote starting configuration reached by
the next state function with interrupts. It means that for a machine with
76 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
interrupts we only need synced code without interrupt, but from all starting
configurations reached by the next state function with interrupts.
Note that in our case the predicate synced_code? has the same conditions
as before in [Bey05] in case when we work in the system mode. In user mode
we must prevent write access not only to the physical address from which
we want to fetch but also page table entry address in the page table from
which we get the page table entry in order to compute this physical address.
We now prove that the implementation of the instruction fetch is correct.
With the help of the next lemma we are independent of the delayed PC
architecture.
Lemma* 3.4.12 The address forwarding for instruction fetch is correct,
i.e., we have for all t ∈ N:
corr?(t) ∧ ¬stall.1t =⇒ adr_i(ctI) = c˜sI(fetch,t)S .DPC.
Proof: We omit the proof because this lemma was taken without changes.
Since we now have correctness of the address adr_i used for instruction
fetch the following lemma does not depend on the delayed PC architecture.
Lemma 3.4.13 Let the VAMP fulfill the correctness invariant in some cy-
cle t and let the sync condition on the assembler code hold. The instruction
register part of the correctness invariant then holds in the next cycle. I.e.,
for i := sI(dec, t+ 1) we have for all t ∈ N:
synced_code? ∧ corr?(t) ∧ i ≥ 0 =⇒
(ct+1I .S1.IR = IRS(c˜
i
S) ∧
ct+1I .S1.imal = imal(c˜
i
S) ∧
ct+1I .S1.ipf = CA(c˜
i
S)[ipf ])
In order to prove this lemma we have to define some intermediate predicates
and prove some auxiliary lemmas. If we want to use the correctness results
of the local MMU as for the IMMU we have to give not only sequences of
the hardware configuration, inputs, and outputs but also a correct memory
sequence. The hardware configuration, inputs, and outputs sequences for
the IMMU are similar to the hardware configuration, inputs, and outputs
sequences for the DMMU . Note that the local correctness of the MMU has
the assumption m_ack_write(trc) (during a processor request the memory
must be stable while busy is on). Therefore we cannot define a memory
mapping in the same way as it was introduced for theDMMU because during
an request to the IMMU the DMMU can change the memory.
The following solution will suffice. In order to establish the assumption
m_ack_write(trc) we construct the memory which is not changed during
the whole request to the IMMU and is equal to the memory in the last
3.4. CORRECTNESS BETWEEN INTERRUPTS 77
Memory sequence
for IMMU
0
· · ·
t − t′
t′ tt′′t′′′
reqp to MMU
reqp to IMMU
ct I
.M
ct I
.M
ct I
.M
ct I
.M
ct I
.M
ct I
.M
ct I
.M
ct I
.M
· · ·
· · ·
· · ·
Figure 3.8: Memory Sequence for the IMMU
cycle of the request to the IMMU . It means that during the whole processor
request to the IMMU we will use only memory from the last cycle of processor
request to the IMMU . Figure 3.8 shows memory sequence and construction
of the processor request for the IMMU .
We also have to define five components that constitute a local MMU
computation for the IMMU . As for the DMMU these five components induce
a trace itrc for the IMMU . Formally, we define:
• Memory sequence
itrc.mem = λt′′∈N.ctI .M
• Inputs sequence for inputs from the processor
itrc.inp = λt′′∈N.
{
iinputst
′′+t′
p if t′ + t′′ ≤ t
X otherwise
where X is an arbitrary iinputsp such that imrp is always inactive.
• Inputs sequence for inputs from the memory system
itrc.inm = λt′′∈N.
{
iinputst
′′+t′
m if t′ + t′′ ≤ t
X otherwise
where X is an arbitrary iinputsm such that ibusym is always inactive.
• Outputs sequence for outputs to the processor
itrc.outp = λt′′∈N.
{
ioutputst
′′+t′
p if t′ + t′′ ≤ t
X otherwise
where X is an arbitrary ioutputsp such that ibusyp and iexcpp are
always inactive.
78 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
• Outputs sequence for outputs to the memory system
itrc.outm = λt′′∈N.
{
iinputst
′′+t′
m if t′ + t′′ ≤ t
X otherwise
where X is an arbitrary iinputsm such that imrm is always inactive
and iaddrm = iaddrt
′′+1
m .
Note that we also as for the DMMU have to show that this computation
is a computation of the local MMU . It also is easy to show that these
definitions constitute a valid local MMU computation. In order to show
that the assumptions for the local MMU correctness hold we prove a lemma
which shows that assumption p_req_is_stable(itrc) holds for the IMMU .
Lemma 3.4.14 All inputs to the IMMU are stable when the IMMU is busy
in the same cycle:
p_req_is_stable(itrc)
Proof: Note that we give the proof only for the time when we have the
processor request, i.e. from cycle 0 to cycle t − t′ (see Figure 3.8). After
cycle t − t′ the predicate p_req_is_stable(itrc) trivially holds because of
the trace construction.
Let t′′ denote the cycle which we get by skolemization of the universal
quantifier in the predicate p_req_is_stable(itrc). Since any request to the
IMMU is a read request and imrt′′p = fetcht
′′ ∧ ¬ct′′I .S1.imal ∨ irollbackt
′′
we only have to prove that:
(fetcht
′′ ∧ ¬ct′′I .S1.imal ∨ irollbackt
′′
) ∧ ibusyt′′p =⇒
iinputst
′′
p = iinputs
t′′+1
p
We split the input stability into three groups. The first is iadrp, the second
is imrp and the last is iptop, iptlp, and imodep.
Subproof for iadrp. Note that iadrp = adr_i(cI)[31 : 3] and since we also
need to have stability of the last three bits later for the proof of the imrp
stabilizing we prove that the whole adr_i(cI) is stabilized. We now split
cases on irollbackt′′ :
1. Let irollbackt′′ hold. Since ibusyt′′p holds we have irollbackt
′′+1. Since
irollback holds in both cycles t′′ and t′′ + 1 we have adr_i(ct′′I ) =
mPCt
′′ and adr_i(ct
′′+1
I ) = mPC
t′′+1 according to Figure 3.5. Be-
cause irollbackt′′ is active we do not update the register mPC between
cycles t′′ and t′′ + 1. This finishes this case.
3.4. CORRECTNESS BETWEEN INTERRUPTS 79
2. Let ¬irollbackt′′ hold. In this case we have to prove adr_i(ct′′I ) =
adr_i(ct
′′+1
I ). Since fetch
t′′ implies ¬rfe?(ct′′I .S1.IR) we have
adr_i(ct
′′
I ) =
{
ct
′′
I .PC
′ if ct′′I .S1.full ∧ ¬rfe?(ct
′′
I .S1.IR)
ct
′′
I .DPC otherwise
We compute adr_i(ct
′′+1
I ) as follows
adr_i(ct
′′+1
I ) =
ct
′′+1
I .SPR[EDPC] if c
t′′+1
I .S1.full ∧ rfe?(ct
′′+1
I .S1.IR)
ct
′′+1
I .PC
′ if ct
′′+1
I .S1.full ∧ ¬rfe?(ct
′′+1
I .S1.IR)
ct
′′+1
I .DPC otherwise
Since signal ibusyt′′p is active we have ¬ue.0t′′ . Since cI .S1.IR is only
changed when signal ue.0 is active we have ct
′′+1
I .S1.IR = c
t′′
I .S1.IR.
Note also that since signal ue.0t′′ is inactive we have S1.fullt′′+1 =
stall.1t
′′ and from the construction of the signal stall.1t′′ we conclude
that stall.1t′′ =⇒ S1.fullt′′ trivially. We now split cases on stall.1t′′ .
(a) If stall.1t′′ holds then we have adr_i(ct′′I ) = c
t′′
I .PC
′ and
adr_i(ct
′′+1
I ) = c
t′′+1
I .PC
′. Since PC ′ is changed only when ue.1
is active we have:
adr_i(ct
′′+1
I ) = c
t′′+1
I .PC
′ = ct
′′
I PC
′ = adr_i(ct
′′
I )
and finish this case.
(b) Let ¬stall.1t′′ hold. Note that from the construction ¬stall.1t′′ =⇒
¬rfe trivially holds. We now split cases on S1.fullt′′ .
i. If S1.fullt′′ holds then adr_i(ct′′I ) = c
t′′
I .PC
′ and adr_i(ct
′′+1
I ) =
ct
′′+1
I .DPC. Since we compute c
t′′+1
I .DPC by:
ct
′′+1
I .DPC =
{
ct
′′
I .PC
′ if ue.1
ct
′′
I .DPC otherwise
and ue.1 is active we have:
adr_i(ct
′′+1
I ) = c
t′′+1
I .DPC = c
t′′
I PC
′ = adr_i(ct
′′
I )
and finish this case.
ii. If ¬S1.fullt′′ holds then adr_i(ct′′I ) = ct
′′
I .DPC and
adr_i(ct
′′+1
I ) = c
t′′+1
I .DPC. Since signal ue.1 is inactive we
have
adr_i(ct
′′+1
I ) = c
t′′+1
I .DPC = c
t′′
I DPC = adr_i(c
t′′
I )
and finish the part of the lemma for iadrp.
80 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
Subproof for imrp. We have to prove that imrt
′′
p = imr
t′′+1
p . In case when
irollbackt
′′ is active imrt′′p = imrt
′′+1
p follows directly from the construction
of the stabilize circuit for IMMU (Figure 3.5) since we have ibusyt′′p . Let
¬irollbackt′′ hold. Of course, irollbackt′′+1 also is inactive. Since we have
already proved stability of the address we have ct′′I .imal = c
t′′+1
I .imal and
we only have to prove that 1 = fetcht′′ = fetcht′′+1. We now split cases on
stall.1t
′′ .
1. If stall.1t′′ holds then as we know S1.fullt′′ and S1.fullt′′+1 are
both active. Since we know from the previous part of proof that
ct
′′+1
I .S1.IR = c
t′′
I .S1.IR according to the definition of the signal
fetch we only have to prove that registers PTO, PTL, and MODE
are valid. Since validity of a register is changed from 1 to 0 in case of
S1.full ∧ ¬stall.1 we conclude this case.
2. Let ¬stall.1t′′ hold. We now split cases on S1.fullt′′ .
(a) Let S1.fullt′′ hold. In this case we also have that ct
′′+1
I .S1.IR =
ct
′′
I .S1.IR and we only have to show validity of registers PTO,
PTL, andMODE. Since in cycle t′′ we do not have any instruction
which changes the validity of registers PTO, PTL, and MODE we
finish this case.
(b) Let ¬S1.full hold. Since the validity of an register can changed
from 1 to 0 only when S1.full we finish the claim for imrp.
Subproof for iptop, iptlp, and imodep. Note that iptop, iptlp, and imodep
are read directly from the SPR file. We only show the stability of iptop
because the proofs for the other two registers are similar. We split cases on
irollbackt
′′ .
1. Let irollbackt′′ hold. By induction we can prove that irollbackt′′ =⇒
cI .ROBcount = 0 ∧ ¬cI .S1.full. Thus cI .ROBcount = 0; since
the content of registers is changed only in stage writeback and iff
cI .ROBcount 6= 0 we finish this case.
2. If the content of register ct′′I .SPR[PTO] is changed this regis-
ter is not valid. Thus we have contradiction because we have
cI .SPR[PTO].valid. uunionsq
We now prove that the local MMU correctness holds for the trace itrc.
Lemma 3.4.15 Local MMU correctness for the trace itrc holds:
∀t′, t ∈ N : is_req_proc(t′, t, itrc) ∧ corr?(t) ∧ ¬irollbackt ∧
synced_code? ∧ adr_i(ctI) = csI(fetch,t)S .DPC =⇒
mmu_guarantee(itrc)
3.4. CORRECTNESS BETWEEN INTERRUPTS 81
In order to prove this lemma we need some additional lemmas which are
partially taken from [Bey05].
Lemma* 3.4.16 Each memory instruction, which terminated without ex-
ception is committed, i.e.:
∀t, i ∈ N : mem?(ciS) ∧ ¬imal(ciS) ∧ ¬dmal(ciS) ∧
¬ipf(ciS) ∧ ¬dpf(ciS) ∧ sI(wb, t) > i =⇒
sI(MI , t) > i
Lemma* 3.4.17 If a content of a double word memory cell is not equal in
two different cycles then there exists an instruction in between which changes
this content, i.e.:
∀i ∈ N, i′ ∈ Z≥i, adr ∈ B32 : ciS .M [adr[31 : 3]] 6= ci
′
S .M [adr[31 : 3]] =⇒
∃j ∈ [i : i′[ : writeadr(cjS) ∧ mem?(cjS) ∧ ¬imal(cjS) ∧
¬dmal(cjS) ∧ ¬ipf(cjS) ∧ ¬dpf(cjS)
Lemma 3.4.18 An instruction which changes data in the memory at an
address, which is a page table entry address for another instruction, is ter-
minated, when that other instruction is fetched, i.e.:
∀t ∈ N : let ptea = (
〈
c˜
sI(fetch,t)
S .SPR[PTO][19 : 0] ◦ 012
〉
+〈
c
sI(fetch,t)
S .DPC ◦ 02
〉
) mod 232 in
synced_code? ∧ fetcht =⇒
∀i ∈ ZsI(fetch,t) : writeptea(ciS) =⇒
sI(wb, t) > i
Lemma 3.4.19 An instruction which changes data in the memory at an
address which is a physical address for another instruction, is terminated,
when that other instruction is fetched, i.e.:
∀t ∈ N : let r_tr = decodeitr(csI(fetch,t)S .SPR[PTO][19 : 0],
c
sI(fetch,t)
S .SPR[PTL][19 : 0], c
sI(fetch,t)
S .M,
c
sI(fetch,t)
S .SPR[MODE][0], ea(c
sI(fetch,t)
S )) in
82 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
synced_code? ∧ ¬ibusytp ∧ stall.1t ∧ fetcht ∧ ¬irollbackt =⇒
∀i ∈ ZsI(fetch,t) : (writer_tr.pa(ciS) ∧
c
sI(fetch,t)
S .SPR[MODE][0] ∨
write
c
sI(fetch,t)
S .DPC
(ciS) ∧
¬csI(fetch,t)S .SPR[MODE][0]) =⇒
sI(wb, t) > i
We omit the proofs of Lemmas 3.4.18 and 3.4.19; similar properties (with-
out address translation) have been proven in [Bey05].
Proof: [Lemma 3.4.15] Since on instruction fetch we produce for the
IMMU only read accesses the assumption p_mr_mw_mutexc(itrc) al-
ways holds. Lemma 3.4.14 proves that the predicate p_req_is_stable(itrc)
also holds. Therefore the predicate good_p_interface(itrc) holds. As-
sumption of the memory access for the IMMU : m_ack_write(itrc),
m_write_consist(itrc), and m_liveness(itrc) are fulfilled because of Defi-
nition 1.5.1 up to the end of the processor request, i.e. up to cycle t−t′. Since
imrp is inactive after the request to the IMMU the control of the IMMU is
in state idle. Then we conclude that imrm is also inactive. Since imrm is
inactive all these predicates trivially hold. In order to show correctness of
the IMMU , we only have to prove that the predicate m_read_consist(itrc)
also holds. Let ts and te denote start and end time of an memory request
such that the following inequality holds:
0 ≤ ts ≤ te ≤ t− t′
Since during the whole processor request memory sequence in itrc is
stable according to our construction of itrc trace equation itrc(te).mem =
itrc(te + 1).mem trivially holds. Therefore we only have to prove that:
itrc(te).inm.din = itrc(te).mem[itrc(ts).outm.adr]
With the help of Definitions 1.5.1 and 1.5.2 we rewrite this equation as
follows:
Ct
′+te
I .M [iadr
t′+te
m ] = itrc(te).mem[itrc(ts).outm.adr]
Since from Lemmas 2.1.19 and 2.3.4 we know that inputs to the memory
are stable and we rewrite further:
Ct
′+te
I .M [iadr
t′+te
m ] = C
t
I .M [iadr
t′+te
m ]
Since the predicate corr?(t) holds for all cycles up to t it suffices to show
c
sI(MI ,t
′+te)
S .M [iadr
t′+te
m ] = c
sI(MI ,t)
S .M [iadr
t′+te
m ]
3.4. CORRECTNESS BETWEEN INTERRUPTS 83
In case t′ + te = t the equation trivially holds. In other case in order to
find contradiction we assume that
c
sI(MI ,t
′+te)
S .M [iadr
t′+te
m ] 6= csI(MI ,t)S .M [iadrt
′+te
m ]
From Lemma 3.4.17 follows that there exist an instruction j which has
overwritten data at the address iadrt′+tem , i.e.:
∃j ∈ [t′ + ts : t[ : writeiadrt′+tem (c
j
S) ∧ mem?(cjS) ∧ ¬imal(cjS) ∧
¬dmal(cjS) ∧ ¬ipf(cjS) ∧ ¬dpf(cjS)
In Lemma 2.3.10 we proved that our local MMU can access the memory
during a request only at two addresses in user mode pagetable entry address
and physical address and in system mode only at physical address. Since
accesses at physical addresses in both modes we can have only if t′ + te = t
we work in user mode and iadrt′+tem is equal to the page table entry address.
We also know that
sI(MI , t′ + te) ≤ j ≤ sI(MI , t) ≤ sI(fetch, t) = sI(fetch, t′ + te)
Therefore since synced_code? and fetcht′+te both hold we get from
Lemma 3.4.18 that sI(wb, t′ + te) > j. According to the Lemma 3.4.16
we have that sI(MI , t′ + te) > j. Therefore we have found a contradiction
sI(MI , t′ + te) > j ≥ sI(MI , t′ + te).
Therefore we know that the predicate good_m_interface(dtrc) also
holds. So we have already shown correctness of all assumptions for the
IMMU . Therefore we now know the correctness of the IMMU access. uunionsq
Note also that in case if we have last cycle of the request to the IMMU
then we can easily reconstruct from the last cycle of an IMMU request the
whole request.
Lemma 3.4.20 If a fetch to the IMMU terminates in cycle t then there
exists a cycle t′ ≤ t when this access is started. Formally, we have:
∀t ∈ N : imrtp ∧ ¬ibusytp ∧ ¬irollbackt =⇒
∃t′ ∈ Z≤t : is_req_proc(t′, t, itrc)
The proof is done by induction on t′ and omitted here.
We now can prove correctness on the instruction fetch, i.e. Lemma 3.4.13.
Proof: [Lemma 3.4.13] Let synced_code?, corr?(t), and i ≥ 0 hold. If
no fetch occurs in cycle t, i.e., ¬ue.0t, we set i = sI(dec, t), ct+1I .S1.IR =
ctI .S1.IR, c
t+1
I .S1.imal = c
t
I .S1.imal, c
t+1
I .S1.ipf = c
t
I .S1.ipf , and the proof
is finished because of the instruction register part of corr?(t) holds. Hence,
we assume that a fetch occurs, i.e., ¬ibusytp∧¬stall.1t∧fetcht∧¬irollbackt.
We define i = sI(dec, t)+1 and i = sI(fetch, t). By Lemma 3.4.12 we obtain
84 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
that adr_i(ctI) = c
i
S .DPC holds. If c
i
S .DPC is misaligned, the claim holds
trivially. Hence, we have to show the above claim for the instruction register
itself and page fault on instruction fetch.
Let us have an aligned adr_i(ctI), i.e., c˜
i
S .DPC mod 4 = 0. Note
that the instruction port of the CPUI interface is accessed on address
iadrtp = adr_i(ctI) div 8 and depending on adr_i(cI)
t mod 8, either the up-
per or lower word is selected as input to the instruction register. Because of
¬ibusyt ∧ fetcht ∧ ¬irollbackt and Lemma 3.4.20 we have a cycle t′ when
the request to the IMMU is started, i.e., we have is_req_proc(t′, t, itrc).
For the following equations we define short notation as follows:
pa(cI) = decodeitr(ctI .SPR[PTO][19 : 0], c
t
I .SPR[PTL][19 : 0],
ctI .M, 0, c
t
I .DPC).pa,
pa(cS) = decodeitr(c˜iS .SPR[PTO][19 : 0], c˜
i
S .SPR[PTL][19 : 0],
c˜iS .M, 0, c˜
i
S .DPC).pa.
We now consider two cases:
1. The next implications both hold.
(¬c˜iS .SPR[MODE][0] =⇒
c
sI(MI ,t)
S .M [c˜
i
S .DPC + 3 : c˜
i
S .DPC] = c
i
S .M [c˜
i
S .DPC + 3 : c˜
i
S .DPC]) ∧
(c˜iS .SPR[MODE][0] =⇒
c
sI(MI ,t)
S .M [pa(cS) + 3 : pa(cS)] = c
i
S .M [pa(cS) + 3 : pa(cS)] ∧
c
sI(MI ,t)
S .M [ptea(cS) + 3 : ptea(cS)] = c
i
S .M [ptea(cS) + 3 : va(cS)])
where
ptea(cS) = (
〈
c˜iS .SPR[PTO][19 : 0] ◦ 012
〉
+
〈
ciS .DPC ◦ 02
〉
) mod 232.
Note that we have already proved in Lemma 2.3.10 that MMU access
the memory during any untranslated processor request at most one
time on address c˜iS .DPC. If we have a translated processor request
then we access the MMU at most two times. The first time we access
the page table entry address and the second the physical address. By
applying Lemma 3.4.15 we conclude the correctness of the IMMU ,
i.e. mmu_guarantee(itrc). Therefore by using Definition 1.5.2 of a
correct memory interface and corr?(t) in the case if we do not have
3.4. CORRECTNESS BETWEEN INTERRUPTS 85
any exception (iexcptp) we obtain:
ct+1I .S1.IR =
ctI .M [8 · iadrtp + 7 : 8 · iadrtp + 4] if adr_i(ctI) mod 8 = 4 ∧ ¬imodetp
ctI .M [8 · pa(cI) + 7 : 8 · pa(cI) + 4] if adr_i(ctI) mod 8 = 4 ∧ imodetp
ctI .M [8 · iadrtp + 3 : 8 · iadrtp + 0] if adr_i(ctI) mod 8 6= 4 ∧ ¬imodetp
ctI .M [8 · pa(cI) + 3 : 8 · pa(cI) + 0] if adr_i(ctI) mod 8 6= 4 ∧ imodetp
=
{
ctI .M [pa(cS) + 3 : pa(cS)] if c˜
i
S .SPR[MODE][0]
ctI .M [c˜
i
S .DPC + 3 : c˜
i
S .DPC] otherwise
=
{
c˜
sI(MI ,t)
S .M [pa(cS) + 3 : pa(cS)] if c˜
i
S .SPR[MODE][0]
c˜
sI(MI ,t)
S .M [c˜
i
S .DPC + 3 : c˜
i
S .DPC] otherwise
=
{
c˜iS .M [pa(cS) + 3 : pa(cS)] if c˜
i
S .SPR[MODE][0]
c˜iS .M [c˜
i
S .DPC + 3 : c˜
i
S .DPC] otherwise
The first equation holds because of the memory unit construction. The
second equation holds because the predicate mmu_guarantee(itrc)
holds and adr_i(ctI) = c
i
S .DPC. Since the predicate corr?(t) holds
we get the third equation. The last equation holds, because the above
two implications hold.
For page fault on instruction fetch with the help of Lemma 3.4.15 we
have
ct+1I .CA[ipf ] = ipf(c
t
I) ∧ ctI .SPR[MODE][0]
= ipf(ciS) ∧ c˜iS .SPR[MODE][0]
where ipf(ctI) is page fault on fetch in the implementation in cycle t and
ipf(ciS) is the page fault on fetch in the specification. Note that both
are equal because of IMMU correctness (Lemma 3.4.15) and because
of corr?(t). Therefore
IRS(ciS) =
{
c
sI(MI ,t)
S .M [pa(cS) + 3 : pa(cS)] if c˜
i
S .SPR[MODE][0]
c
sI(MI ,t)
S .M [c˜
i
S .DPC + 3 : c˜
i
S .DPC] otherwise
holds and the claim is finished.
86 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
2. Assume the implications
(¬c˜iS .SPR[MODE][0] =⇒
c
sI(MI ,t)
S .M [c˜
i
S .DPC + 3 : c˜
i
S .DPC] = c
i
S .M [c˜
i
S .DPC + 3 : c˜
i
S .DPC]) ∧
(c˜iS .SPR[MODE][0] =⇒
c
sI(MI ,t)
S .M [pa(cS) + 3 : pa(cS)] = c
i
S .M [pa(cS) + 3 : pa(cS)] ∧
c
sI(MI ,t)
S .M [ptea(cS) + 3 : ptea(cS)] = c
i
S .M [ptea(cS) + 3 : va(cS)])
do not both hold.
Since proofs of both parts of this equation are similar we only show
the proof for the case when c˜iS .SPR[MODE][0] is inactive. Therefore in
order to find contradiction we assume that
c
sI(MI ,t)
S .M [c˜
i
S .DPC + 3 : c˜
i
S .DPC] 6= ciS .M [c˜iS .DPC + 3 : c˜iS .DPC].
(3.34)
Note that since instruction i is fetched into the pipeline in cycle t, we
can prove by induction easily that i ≥ sI(MI , t) holds. According to
Lemma 3.4.17 there exists an instruction j ∈ [sI(MI , t) : i[ such that
the following predicate holds
write
c
sI(fetch,t)
S .DPC
(cjS) ∧ mem?(cjS) ∧ ¬imal(cjS) ∧
¬dmal(cjS) ∧ ¬ipf(cjS) ∧ ¬dpf(cjS)
From Lemma 3.4.19 - which uses synced_code? - and since j < i
we conclude that sI(wb, t) > j. If it is true then according to
Lemma 3.4.16 we obtain that sI(MI , t) > j what contradict to
sI(MI , t) ≤ j. In a similar way we get the same contradiction in
case of user mode for both addresses: page table entry address and
physical address and we finish the proof. uunionsq
3.5 Correctness with Interrupts
In this section we extend correctness of the VAMP with the correctness
proof of interrupts. Since correctness of the internal interrupts was already
established in [Bey05] we will focus on external interrupts. The proofs have
only a slight difference, namely the predicate JISR is now formulated over
internal and external interrupts.
In order to specify external interrupts we introduce a function ext(cS)
which represents external interrupts in the specification. This function maps
a number of instruction i to the bitvector of external interrupts.
extS(i) = λx∈Z19 .∃t ∈ N :
(extt[x] ∧ (writebackt ∨ JISR(ctI)) ∧ sI(inst, t) = i)
3.5. CORRECTNESS WITH INTERRUPTS 87
Note that the VAMP processor samples external interrupt signals only
during the writeback stage, i.e. it does not sample them all. Every instruction
is only for one cycle in the writeback stage.
Since in the VAMP implementation all interrupts are triggered during
writeback it is easy to prove correctness of the implementation with respect
to the specification of external interrupts. All instructions also leave the
pipeline in order because of the reorder buffer, i.e. preciseness of interrupts
with respect to the register files is easy to show. The following lemmas we
give without proofs since all of proofs are similar to the proofs for the VAMP
without address translation and can be found in [Bey05, p. 154–158].
Lemma* 3.5.1 VAMP correctness with interrupts holds up to the cycle of
the first interrupt, i.e., for all t ∈ N:
(synced_code_i?(extS) ∧ ∀t′ ∈ Zt : ¬JISR(ct′I )) =⇒ corr?(t) ∧ corr_i?(t)
Lemma* 3.5.2 In case the first interrupt occurs, the specification memory
of the instruction in the MI stage and the instruction after the one in the
writeback stage are equal, i.e., for all t ∈ N:
(synced_code? ∧ ∀t′ ∈ Zt : ¬JISR(ct′I )) ∧ JISR(ctI) =⇒
c˜
sI(wb,t)+1
S .M = c˜
sI(MI ,t)
S .M
Lemma* 3.5.3 In case of an interrupt, no write to the data memory is
currently in progress, for all t ∈ N:
(∀t′ ∈ Zt : ¬JISR(ct′I )) ∧ JISR(ctI) =⇒ ¬mwtp
Lemma* 3.5.4 Let the first interrupt occur in cycle t. The memory part of
corr_i?(t+ 1) then holds, i.e., we have for all t ∈ N:
(synced_code_i?(extS) ∧ ∀t′ ∈ Zt : ¬JISR(ct′)) ∧ JISR(ctI) =⇒
ct+1I .M = c
sI(inst,t+1)
S .M
Lemma* 3.5.5 VAMP correctness with interrupts also holds in the cycle
immediately after the first interrupt. Formally, we have for all t ∈ N:
(synced_code_i? ∧ ∀t′ ∈ Zt−1 : ¬JISR(ct′I )) =⇒ corr_i?(t)
Proposition* 3.5.6 Initially and in the cycle after an interrupt, the VAMP
is an initial state according to Definition 3.2.1. Formally, we have for all
t ∈ N:
(t = 0 ∨ JISR(ct−1I )) =⇒ init?(ct)
88 CHAPTER 3. THE VAMP WITH VIRTUAL MEMORY SUPPORT
3.5.1 Overall Correctness
We now show overall correctness of the VAMP processor with address trans-
lation. Since we have localized all changes in module proofs we can use the
top level of the previous VAMP proof without logical changes. We have only
to do technical changes since we have introduced external interrupts as a new
parameter. In order to prove the main correctness theorem we first have to
prove an auxiliary lemma which states that correctness in some cycle t′ + 1
can be reduced to the correctness in cycle of the last interrupt before t′.
Lemma* 3.5.7 If we have a cycle t after an interrupt with corr_i?(t) and
an interrupt-free interval until some cycle t′ ≥ t, overall correctness with
interrupts also holds in cycle t′ + 1. Formally, we have for all t ∈ N:
synced_code_i? ∧ corr_i?(t) ∧ (t = 0 ∨ JISR(ct−1I )) =⇒
∀t′ ∈ Z≥t : (∀t′′ ∈ [t : t′[ : ¬JISR(ct′′I )) =⇒ corr_i?(t′ + 1)
The proof can be found in [Bey05, p. 159]. Finally we can introduce the
main theorem of the VAMP correctness with address translation.
Theorem 3.5.8 Let the assembly code fulfill the sync condition. The VAMP
is then correct.
synced_code_i? =⇒ corr_i?(t)
Proof: Let synced_code_i? hold. We show corr_i?(t) by induction on t.
Induction base (t = 0): Since corr_i?(0) holds by definition we finish
this case.
Induction step (t→ t+ 1): We have to show corr_i?(t + 1). We now
split cases on cI .∃lastJISR(t).
1. Let ¬cI .∃lastJISR(t) hold. According to the definition of ∃last we have
∀k ∈ Zt : ¬JISR(ckI ). Since we have already proved that corr_i?(0)
holds in the base case, we can apply Lemma 3.5.7 to cycles 0 and t in
order to conclude corr_i?(t+ 1) and finish this case.
2. Let cI .∃lastJISR(t) hold. We set l := cI .lastJISR(t) and by using Defini-
tion 1.1.9 for last, we also have l+ 1 ≤ t, JISR(clI), and ¬JISR(ckI ) for
any k ∈ [l+1 : t[ . By induction hypothesis, corr_i?(l+1) also holds.
Hence, we conclude corr_i?(t+ 1) by applying Lemma 3.5.7 to cycles
l + 1 and t. This finally finishes the main theorem and proves that
the VAMP with address translation does not contain any functional
error. uunionsq
Chapter 4
Conclusion
4.1 Summary
In this thesis we reported about the formal verification of a complex pro-
cessor supporting address translation and external interrupts. Our work
was based on the following previous work: paper and pencil correctness of
memory management units [Hil05], formal verification of the Tomasulo algo-
rithm [Krö01] and formal verification of the VAMP processor without address
translation [Bey05]. The VAMP processor is an 32-bits RISC processor with
a Tomasulo scheduler, a memory unit with cache memory interface, fixed
point unit, and three floating point units. The result of this thesis are pub-
lished in [DHP05].
In this thesis, we have presented implementation and the formal speci-
fication of a simple memory management unit and formally verified its cor-
rectness. The correctness proof of a MMU alone is simple, but depends
on nontrivial operating conditions. As a next step, we have integrated the
MMU into the memory unit of the VAMP processor [Bey05, Krö01, JK00,
BBJ+02,BJK+03,BJK+05,Jac02a] for the address translation on load/store
and into the stage fetch for address translation on instruction fetch.
In addition to the implementation, we also introduced the programmer’s
model of the VAMP which we use as specification in order to prove cor-
rectness of the VAMP. We showed the overall functional correctness of the
VAMP implementation with respect to the specification. We focused on
the data memory access, self-modifying code, instruction fetch and precise
external interrupts.
In Table 4.1 we compare of the different verification effort of proofs which
were done in PVS at our institute.
89
90 CHAPTER 4. CONCLUSION
Index Name of Theory # Commands # Lemmas
1 Basics Circuits 5505 134
2 ALU 2202 76
3 Pipelined DLX 1850 53
4 Tomasulo alg. 5847 186
5 VAMP without MMUs
(w/o Tomasulo alg.)
24753 495
6 Local MMU 7399 85
7 VAMP with MMUs (w/o
Tomasulo alg.)
36785 587
Table 4.1: Comparison of the Verification Effort in PVS.
4.2 Related Work
Note that hardware verification is probably one of most active and successful
fields in formal verification. Nowadays there exist several works in the field of
formal processor verification with a Tomasulo algorithm as well as without.
The VAMP project has been started in 2000 at Saarland University. Using
the theorem prover PVS, Beyer et al. showed the correctness of a powerful
VAMP microprocessor without virtual memory support [BBJ+02,BJK+03,
BJK+05, Bey05,Krö01, JK00]. Sawada and Hunt [SH98,HS99] showed the
correctness of an out-of-order processor with precise interrupts and a store
buffer for the memory unit. They also decided to use sync instruction to
deal with self-modifying code. McMillan showed the correctness an out-
of-order processor based on Tomasulo algorithm using compositional model
checking techniques [McM98]. For this purpose he used the SMV verifier
and applied symmetry to reduce the number of assertions that have to be
verified. Brock and Hunt report about the formal verification of a very
simple, non pipelined processor FM9001 [BHK94]. Aagard et al. presented
in [ACHK04] a new approach for formal verification of pipelined processor
by using equivalence verification and completion functions. They have used
this approach to verify a 32-bit RISC processor with 47 instructions, four-
stages, and in-order execution.
But all results do not cover virtual memory support. Already Beyer
noted in [Bey05] that the version of the VAMP which was presented
in [Bey05] has no hardware support for address translation. However, com-
mon operating systems require paging and address translation on the hard-
ware in order to give each user program its own virtual memory. All results
above do not support address translation. We are not aware of previous work
on formal verification of processors with MMU as well as memory manage-
ment units itself.
Processor designs with address translation by means of an MMU with
4.3. FUTURE WORK 91
translation look aside buffers is presented, e.g., in [HP96a]. These designs are
not formally specified or proven correct. Jacob and Mundge provide a survey
of address translation mechanism present in modern microprocessors [JM98].
Note that now there is more activity on the field of software verification
than hardware verification.
4.3 Future work
Presently we see several main directions for further work. First, we should try
to reprove VAMP using powerful automated tools. Second, on the hardware
side one wants to verify processors with pipelined MMUs, multi level page
tables and table look aside buffers.
4.3.1 Automated Methods
The main drawback of interactive proof tools is the strong dependency on
the concrete implementation. Even slight modifications in some part of the
implementation may cause significant changes of the proofs.
Suppose at a certain step of the proof one discovers a bug in the imple-
mentation. In most cases the only possibility to fix a bug is to change the
implementation. Therefore, one has to change the part of proof which we
have done as well. Thus, the verification process becomes iterative. Schemat-
ically:
obtain a part of proof =⇒
discover a subtle bug =⇒
change the implementation =⇒
change the proof
This process consumes much time. Note that in our case before the formal
verification in the PVS mathematical correctness was done with paper and
pencil. Thus, all bugs which we found by formal verification in PVS were
bugs which we missed in the paper and pencil proof. Since industrial pro-
cessors usually do not have a mathematical correctness proof, the number of
iterations will increase. Therefore we need to find some automated methods
in order to decrease proof time. Using model checking and a light-weight
completion function Clarke et al. have shown the correctness of the Toma-
sulo’s algorithm in [BCBZ02]. By using efficient reductions of the equality
with uninterpreted functions Velev and Bryant in [VB99] entirely automati-
cally correctness of a set of superscalar DLX implementations [HP96b] with
in-order execution, having 2 pipelines with 5 stages each. The work presented
in this paper has several simplifications and abstractions. Some examples of
such abstractions are: load instructions on fetch was produced by an unin-
terpreted function and next step computation of the memory was also done
92 CHAPTER 4. CONCLUSION
by using an uninterpreted function. Manolios and Srinivasan [MS05] pre-
sented results which can be used to speed up runtime of decision procedures
for automated tools.
In order to decrease proof time the automated proof tool NuSMV is
used in our institute. NuSMV is a model checker with an integrated SAT
solver [Tve05a]. Model checking techniques are widely used to verify correct-
ness of hardware systems [BD94,McM98]. The main problem with model
checking is the state explosion problem, i.e. the state space grows exponen-
tially with the system size. Abstraction is the necessary approach to apply
model checking methods to large systems. The goal of abstraction is to re-
duce the size of the state space while still being able to show the desired
property. Sergey Tverdyshev has already proved with the help of NuSMV
and some additional tools the correctness of all basic circuits and most part
of ALU of the VAMP [Tve05b].
We now discuss which parts of the processor could also be verified with
the automated tools.
• MU will be difficult to verify as one module because of the big state
space. One can proceed by splitting theMU into several parts, DMMU ,
shift4store, compadr, etc. Note that correctness proof of the local
MMU with the automated methods should be possible to do for the
current model of MMU after using data abstractions. Our implemen-
tation of the MMU takes several bitvectors as an input. For example
we describe the data abstraction for the page table length ptl. Since
ptl is a bitvector of length 20 we have 220 possible states only for ptl.
We can reduce the state space to two states for the ptl if we use data
abstraction. It will suffice since we use ptl only once, when we com-
pare it with pto, i.e. two states will be enough in order to distinguish
the result of the compare operation. Of course, such abstraction need
to be done for all inputs where it is possible. We can substitute 20-
bits adders for 32-bits adders which is used in order to compute page
table entry address. It also reduce the state space. After such a se-
quence of reducing steps the model checker should hopefully prove the
correctness of the MMU .
• The same approach can be used to establish the correctness of the
IMMU on the instruction fetch.
• Proofs of FPUs already involve automated methods for control au-
tomaton. They are described in [Jac02a]. They could also be proved
in the fully-automated style. Jacobi et al. have already presented
fully-automated methodology for the verification of fused-multiply-add
FPUs in [JWPB05]. For verification they used the IBM internal veri-
fication tool SixthSense.
4.3. FUTURE WORK 93
• We also can try to apply automated methods for all control automatons
which we have in the VAMP. Since the state space of all automatons in
the VAMP is not large one can try to prove liveness and safety prop-
erties by means of model checking. Note that the liveness of the whole
processor is still unproved. We have separate proofs of the liveness for
all functional units and for the Tomasulo algorithm.
• We also can try to use automated methods for the whole processor in
a similar way as it was already done for the verification of a powerful
Tomasulo scheduler in [McM98] and as it was described in [BCBZ02,
VB99]
• Due to the symmetry we hopefully can do sufficient abstraction for
caches. Therefore we could be able to verify caches with the help of
model checker as well.
So we conclude that if we want to prove the VAMP automatically we
need to invent a good abstract model in order to reduce the state space.
4.3.2 Hardware Optimizations and Extensions
We can do some optimizations and extensions in the hardware in the future.
Since the local MMU is simple every translated request without exceptions
needs to access the memory twice. This is of course slow and does not satisfy
the requirements for industrial processors. Therefore we should optimize the
local MMU .
• All industrial processors with MMUs have an internal so called trans-
lation look aside buffer. Table look aside buffers (TLB) are used in
order to cache page table entries. In this case we do not need to do
two requests to the memory if the page table entry which we need was
stored in the TLB.
• In the implementation we perform two additions in order to calculate
page table entry address. The first one, when we calculate an effective
address dva and the second one, when we add page index, which is the
part of the effective address, with the page table origin. The delay of
such computation is obviously equal to twice the delay of the adder.
Note that delay of the carry save adder (3/2–adder) is constant [MP00].
Hence we can reduce the delay to delay of one adder and constant
delay of 3/2–adder if we compress the sum of these three bitvectors
to two bitvectors and after that we compute with a normal adder the
page table entry address. This can be done already outside the MMU ,
reducing the latency ofMMU requests by one cycle. We have now only
one reservation station for the memory unit. It makes all proofs easier
but implementation of the VAMP is slower. If we increase the number
of reservation station we can pipeline the memory unit.
94 CHAPTER 4. CONCLUSION
• In our implementation in the case of store accesses we stall execution
of this access until the store instruction will be the oldest instruction
in the pipeline. Note that for this case it is enough to stall the write
request to the memory until this instruction will be the oldest instruc-
tion in the pipeline, i.e. the read access to the memory which we do in
order to get the page table entry can be started earlier.
• In order to verify computer system which contain hardware we have
to extend the hardware model with external devices, i.e. for whole
implementation of the virtual memory in hardware we need to have
an external device for storage memory pages which are not stored in
RAM. Hence, we have to prove a model with swap memory. In [HIP05]
the paper and pencil proof of such model is given.
The following extensions in the hardware could be done in the future.
• We also can try to organize multi-level address translation which can
be formally specified in the same way as in [Hil05].
• In order to separate memory pages which we can execute, i.e. pages
which contain program code and pages which contain data, we can
extend the format of the page table entry as follows. We introduce
a new flag x (executable). The fetch can be executed only for pages
whose flag x is active. Hence, we forbid any access on the instruction
fetch to pages which do not contain program code. This extension is
also done in many industrial processors in order to distinguish between
regions of data and regions of instructions in memory.
4.3.3 Formal Software Verification
The last direction for the future work, which we want to discuss here, is
correctness of the software. As a part of Verisoft [Ver03] project we try to
verify the following software components.
• Formally verify an compiler of a high level language. Petrova and
Leinenbach currently work on verification a compiler for a subset of the
C language [Lei05, Pet04]. Apart from some optimization and inline
assembler their work is almost finished.
• Formally verify a simple operating system [GHLP05,Bog05].
• Formally verify several applications, e.g. an email client, support of
TCP/IP-protocol, and some signature software.
Appendix A
VAMP Instruction Set
The VAMP instruction set is taken from [Bey05] with minimal modifications.
I−type
R−type
J−type
26
FI−type
FR−type
RD
FunctionSARDRS2
55
RS1
6
Opcode
6
Opcode
6 5 5 16
63
PC Offset
6 55
Opcode RS1 FD Immediate
Opcode FS1 FD
5
00 Fmt Function
Opcode
6
RS1
5 5 16
Immediate
FS2
5 5 6
Figure A.1: Instruction Formats of the VAMP
95
96 APPENDIX A. VAMP INSTRUCTION SET
IR[31 : 26] Mnem. d Effect
Memory operations, pa is an effective address or a translated effective address
100000 lb 1 RD = sext(M [pa+ d− 1 : pa])
100001 lh 2 RD = sext(M [pa+ d− 1 : pa])
100011 lw 4 RD =M [pa+ d− 1 : pa]
100100 lbu 1 RD = 024M [pa+ d− 1 : pa]
100101 lhu 2 RD = 016M [pa+ d− 1 : pa]
101000 sb 1 M [pa+ d− 1 : pa] = RD[7 : 0]
101001 sh 2 M [pa+ d− 1 : pa] = RD[15 : 0]
101011 sw 4 M [pa+ d− 1 : pa] = RD
Arithmetic, logical operation
001000 addi RD = RS1 + imm
001001 addiu RD = RS1 + imm (no overflow)
001010 subi RD = RS1− imm
001011 subiu RD = RS1− imm (no overflow)
001100 andi RD = RS1 ∧ imm
001101 ori RD = RS1 ∨ imm
001110 xori RD = RS1⊕ imm
001111 lhgi RD = imm016
Test and set operations
011000 clri RD = 032
011001 sgri RD = 031(RS1 > imm)
011010 seqi RD = 031(RS1 = imm)
011011 sgei RD = 031(RS1 ≥ imm)
011100 slsi RD = 031(RS1 < imm)
011101 snei RD = 031(RS1 6= imm)
011110 slei RD = 031(RS1 ≤ imm)
011111 seti RD = 0311
Control operation
000100 beqz PC ′ = PC ′ + 4 + (RS1 = 0? imm :0)
000101 bnez PC ′ = PC ′ + 4 + (RS1 6= 0? imm :0)
000110 jr PC ′ = RS1
000111 jalr R31 = PC ′ + 4;PC ′ = RS1
Table A.1: I-type Instruction Layout
97
IR[5 : 0] Mnem. Effect
Shift operations
000000 slli RD = RS1¿ SA
000001 slai RD = RS1¿ SA (arith.)
000010 srli RD = RS1À SA
000011 srai RD = RS1À SA (arith.)
000100 sll RD = RS1¿ RS2[4 : 0]
000101 sla RD = RS1¿ RS2[4 : 0] (arith.)
000110 srl RD = RS1À RS2[4 : 0]
000111 sra RD = RS1À RS2[4 : 0] (arith.)
Data transfer
010000 movs2i GPR[RD] = SPR[SA]
010001 movi2s SPR[SA] = GPR[RS1]
Arithmetic and logical operations
100000 add RD = RS1 +RS2
100001 addu RD = RS1 +RS2 (no overfl.)
100010 sub RD = RS1−RS2
100011 subu RD = RS1−RS2 (no overfl.)
100100 and RD = RS1 ∧RS2
100101 or RD = RS1 ∨RS2
100110 xor RD = RS1⊕RS2
100111 lhg RD = RS2[15 : 0]016
Test and set operations
101000 clr RD = 032
101001 sgr RD = 031(RS1 > RS2)
101010 seq RD = 031(RS1 = RS2)
101011 sge RD = 031(RS1 ≥ RS2)
101100 sls RD = 031(RS1 < RS2)
101101 sne RD = 031(RS1 6= RS2)
101110 sle RD = 031(RS1 ≤ RS2)
101111 set RD = 0311
Table A.2: R-type Instruction Layout
Note that IR[31 : 26] = 06 holds for all instructions in this table and that we
identify a boolean value of true with 1 and false with 0.
IR[31 : 26] Mnem. Effect
000010 j PC ′ = PC ′ + 4 + imm
000011 jal GPR[31] = PC ′ + 4; PC ′ = PC ′ + 4 + imm
111110 trap trap = 1; EData = imm
111111 rfe SR = ESR; PC ′ = EPC; DPC = EDPC
Table A.3: J-type Instruction Layout
98 APPENDIX A. VAMP INSTRUCTION SET
IR[31 : 26] Mnem. d Effect
Memory operations, pa is an effective address or a translated effective address
110001 load.s 4 FD[31 : 0] =M [pa+ d− 1 : pa]
110101 load.d 8 FD[63 : 0] =M [pa+ d− 1 : pa]
111001 store.s 4 M [pa+ d− 1 : pa] = FD[31 : 0]
111101 store.d 8 M [pa+ d− 1 : pa] = FD[63 : 0]
Control operations
000110 fbeqz PC ′ = PC ′ + 4 + (FCC = 0? imm :0)
000111 fbnez PC ′ = PC ′ + 4 + (FCC 6= 0? imm :0)
Table A.4: FI-type Instruction Layout
IR[5 : 0] IR[8 : 6] Mnem. Effect
Arithmetic and compare operations
000000 fadd FD = FS1 + FS2
000001 fsub FD = FS1− FS2
000010 fmul FD = FS1 ∗ FS2
000011 fdiv FD = FS1÷ FS2
000100 fneg FD = −FS1
000101 fabs FD = abs(FS1)
000110 fsqt FD = sqrt(FS1)
000111 frem FD = rem(FS1, FS2)
11c[3 : 0] fc.cond FCC = (FS1 c FS2)
Data transfer
001000 000 fmov.s FD[31 : 0] = FS1[31 : 0]
001000 001 fmov.d FD[63 : 0] = FS1[63 : 0]
001001 mf2i GPR[FD] = FPR[FS1][31 : 0]
001010 mi2f FPR[FD][31 : 0] = GPR[FS2]
Conversion
100000 001 cvt.s.d FD = cvt(FS1, s, d)
100000 100 cvt.s.i FD = cvt(FS1, s, i)
100001 000 cvt.d.s FD = cvt(FS1, d, s)
100001 100 cvt.d.i FD = cvt(FS1, d, i)
100100 000 cvt.i.s FD = cvt(FS1, i, s)
100100 001 cvt.i.d FD = cvt(FS1, i, d)
Table A.5: FR-type Instruction Layout
Note that IR[31 : 26] = 010001 holds for all instructions in this table.
99
Condition Relations Invalid
Code Mnemonic Greater Less Equal Unordered if
c True False > < = ? unordered
0000 F T 0 0 0 0
0001 UN OR 0 0 0 1
0010 EQ NEQ 0 0 1 0
0011 UEQ OGL 0 0 1 1
0100 OLT UGE 0 1 0 0 No
0101 ULT OGE 0 1 0 1
0110 OLE UGT 0 1 1 0
0111 ULE OGT 0 1 1 1
1000 SF ST 0 0 0 0
1001 NGLE GLE 0 0 0 1
1010 SEQ SNE 0 0 1 0
1011 NGL GL 0 0 1 1
1100 LT NLT 0 1 0 0 Yes
1101 NGE GE 0 1 0 1
1110 LE NLE 0 1 1 0
1111 NGT GT 0 1 1 1
Table A.6: Floating-point Relational Operators for the fc Instruction
100 APPENDIX A. VAMP INSTRUCTION SET
Appendix B
Lemmas in PVS
In this chapter we list for each lemma in this thesis the corresponding name
in PVS. Lemmas in this thesis are identified by their unique number. In
PVS, they are identifies by both a context and a name.
Name in PVS Number Number of commands
processor_inputs_stable_for_request 2.1.8 30
cache_inputs_stable_for_request 2.1.19 30
mmu_correct_hw_dec 2.3.1 40
cache_mr_mw_unary_hw 2.3.2 20
cache_mr_active_until_busy 2.3.3 25
cache_input_stable_hw 2.3.4 575
processor_interface_liveness_hw 2.3.5 615
untranslated_read_hw 2.3.6 610
untranslated_write_hw 2.3.7 815
translated_read_hw 2.3.8 1620
translated_write_hw 2.3.9 1720
addr_is_eq 2.3.10 1005
Table B.1: Lemmas in PVS context mmu
101
102 APPENDIX B. LEMMAS IN PVS
Name in PVS Number Number of commands
processor_request_is_stable 3.4.3 30
commits_is_req_proc_data 3.4.5 215
sI_inst_helper 3.3.2 40
dmmu_proof 3.4.2 780
vamp_correct_with_interrupt_extended 3.4.1 110
mem_conf_const_helper 3.4.7 15
mem_result_correct_data 3.4.8 1210
memory_unit_out_correct_dpf 3.4.9 540
vamp_induction_step_mem_conf 3.4.10 765
vamp_correct_fetch_PC 3.4.12 40
vamp_induction_step_S1 3.4.13 720
processor_request_is_stable_inst 3.4.14 500
immu_proof 3.4.15 780
terminated_memory_instruction_is_commited 3.4.16 100
mem_conf_helper2 3.4.17 60
self_modifying_helper_ptea 3.4.18 110
self_modifying_helper_pha 3.4.19 70
last_tact_of_req_is_req_proc_inst 3.4.20 180
vamp_correct_without_interrupt 3.5.1 55
vamp_correct_with_interrupt_extended 3.5.5 150
mem_commit_equal_mem 3.5.2 40
vamp_correct_JISR_step 3.5.3 100
vamp_correct_JISR_step_mem 3.5.4 230
vamp_after_last_JISR_is_initial 3.5.6 140
sI_inst_helper 3.5.7 220
sI_inst_correct 3.5.8 300
Table B.2: Lemmas in PVS context dlxtom_mmu
Bibliography
[ACHK04] M. Aagaard, V. Ciubotariu, J. Higgins, and F. Khalvati. Com-
bining equivalence verification and completion functions. In FM-
CAD 2004, volume 3312 of LNCS, pages 98–112. Springer, 2004.
[BBJ+02] Christoph Berg, Sven Beyer, Christian Jacobi, Daniel Kröning,
and Dirk Leinenbach. Formal verification of the VAMP micro-
processor (project status). In Symposium on the Effectiveness
of Logic in Computer Science (ELICS02), number MPI-I-2002-
2-007, pages 31–36. Max-Planck-Institut für Informatik, March
2002.
[BCBZ02] S. Berezin, E. Clarke, A. Biere, and Y. Zhu. Verification of out-of-
order processor designs using model checking and a light-weight
completion function. In Formal Methods in System Design, vol-
ume 20, 2002.
[BD94] J. R. Burch and D. L. Dill. Automatic verification of pipelined
microprocessors control. In CAV 94, volume 818, pages 68–80.
Springer-Verlag, 1994.
[Ber01] Christoph Berg. Formal verification of an IEEE floating point
adder. Master’s thesis, Saarland University, Saarbrücken, Ger-
many, May 2001.
[Bey05] Sven Beyer. Putting it all together—Formal Verification of the
VAMP. PhD thesis, Saarland University, Saarbrücken, Germany,
2005.
[BHK94] B. Brock, W. A. Hunt, and M. Kaufmann. The FM9001 micro-
processor proof. Technical Report 86, Computational Logic Inc.,
1994.
[BJ01a] Christoph Berg and Christian Jacobi. Formal verification of the
VAMP floating point unit. In CHARME 2001, volume 2144 of
LNCS, pages 325–339. Springer, 2001.
103
104 BIBLIOGRAPHY
[BJ01b] Christoph Berg and Christian Jacobi. Formal verification of the
VAMP floating point unit. Lecture Notes in Computer Science,
2144:325–328, 2001.
[BJK01] Christoph Berg, Christian Jacobi, and Daniel Kröning. For-
mal verification of a basic circuits library. In Proc. 19th
IASTED International Conference on Applied Informatics, Inns-
bruck (AI’2001), pages 252–255. ACTA Press, 2001.
[BJK+03] Sven Beyer, Christian Jacobi, Daniel Kröning, Dirk Leinenbach,
and Wolfgang J. Paul. Instantiating uninterpreted functional
units and memory system: Functional verification of the VAMP.
In CHARME 2003, volume 2860 of LNCS, pages 51–65. Springer,
2003.
[BJK+05] Sven Beyer, Christian Jacobi, Daniel Kröning, Dirk Leinenbach,
and Wolfgang J. Paul. Putting it all together—formal verification
of the VAMP. International Journal of Software Tools for Tech-
nology Transfer, Special Issue on ‘Recent Advances in Hardware
Verification’(to appear), 2005.
[Bog05] Sebastian Bogan. Formal Verification of a Simple Operating Sys-
tem (Draft). PhD thesis, Saarland University, Saarbrücken, Ger-
many, 2005.
[DHP05] I. Dalinger, M. Hillebrand, and W. Paul. On the verification of
memory management mechanisms. In D. Borrione and W. Paul,
editors, CHARME 2005, LNCS. Springer, 2005.
[GHLP05] M. Gargano, M. Hillebrand, D. Leinenbach, and W. Paul. On the
correctness of operating system kernels. In J. Hurd and T. Mel-
ham, editors, TPHOLs 2005 (to appear), LNCS. Springer, 2005.
[Hil05] Mark Hillebrand. Address Spaces and Virtual Memory: Specifi-
cation, Implementation, and Correctnesss. PhD thesis, Saarland
University, Saarbrücken, Germany, 2005.
[HIP05] Mark Hillebrand, Thomas In der Rieden, and Wolfgang Paul.
Dealing with I/O devices in the context of pervasive system veri-
fication. In ICCD ’05. IEEE Computer Society, 2005. To appear.
[HP96a] J. L. Hennessy and D. A. Patterson. Computer Architecture:
A Quantitative Approach. Morgan Kaufmann, San Mateo, CA,
second edition, 1996.
[HP96b] John L. Hennessy and David A. Patterson. Computer Architec-
ture: A Quantitative Approach. Morgan Kaufmann Publishers,
INC., San Mateo, CA, 2nd edition, 1996.
BIBLIOGRAPHY 105
[HS99] W. A. Hunt and J. Sawada. Verifying the FM9801 microarchi-
tecture. In IEEE Micro, pages 47–55, 1999.
[HSG99] Ravi Hosabettu, Mandayam Srivas, and Ganesh Gopalakrishnan.
Proof of correctness of a processor with reorder buffer using the
completion functions approach. In Computer-Aided Verification,
CAV ’99, volume 1633, pages 47–59. Springer-Verlag, 1999.
[IEE85] Institute of Electrical and Electronics Engineers. ANSI/IEEE
standard 754–1985, IEEE Standard for Binary Floating-Point
Arithmetic, 1985.
[Jac02a] Christian Jacobi. Formal Verification of a fully IEEE-compliant
Floating-Point Unit. PhD thesis, Saarland University, Saar-
brücken, Germany, 2002.
[Jac02b] Christian Jacobi. Formal verification of complex out-of-order
pipelines by combining model-checking and theorem-proving.
In Computer Aided Verification, 14th International Conference,
CAV 2002, volume 2404 of Lecture Notes in Computer Science.
Springer, 2002.
[JK00] Christian Jacobi and Daniel Kröning. Proving the correctness of
a complete microprocessor. In Proc. of the 30. Jahrestagung der
Gesellschaft für Informatik. Springer, 2000.
[JM98] Bruce Jacob and Trevor Mudge. Virtual memory in contempo-
rary microprocessors. IEEE Micro, 18(4):60–75, 1998.
[JWPB05] C. Jacobi, K. Weber, V. Paruthi, and J. Baumgartner. Automatic
Formal Verification of Fused-Multiply-Add FPUs. In DATE,
pages 1298–1303, 2005.
[Krö01] Daniel Kröning. Formal Verification of Pipelined Microproces-
sors. PhD thesis, Saarland University, Saarbrücken, Germany,
2001.
[Lei05] Dirk Leinenbach. Formal Verification of a Functional Compiler
of a C-like Language (Draft). PhD thesis, Saarland University,
Saarbrücken, Germany, 2005.
[McM98] K. McMillan. Verification of an implementation of Tomasulo’s
algorithm by compositional model checking. In CAV 98, volume
1427. Springer, June 1998.
[MP00] Silvia M. Müller and Wolfgang J. Paul. Computer Architecture:
Complexity and Correctness. Springer, 2000.
106 BIBLIOGRAPHY
[MS05] Panagiotis Manolios and Sudarshan K. Srinivasan. A parame-
terized benchmark suite of hard pipelined-machine-verification
problems. In CHARME 2005, volume 3725 of LNCS, pages 363–
366. Springer, 2005.
[OSR92] S. Owre, N. Shankar, and J. M. Rushby. PVS: A prototype
verification system. In CADE 11, volume 607 of LNAI, pages
748–752. Springer, 1992.
[Pet04] Elena Petrova. Formal Verification of Compilers on the Source
Code Level (Draft). PhD thesis, Saarland University, Saar-
brücken, Germany, 2004.
[SH98] J. Sawada and W. A. Hunt. Processor verification with precise
exceptions and speculative execution. In CAV 98, volume 1427
of LNCS. Springer, 1998.
[SP88] James E. Smith and Andrew R. Pleszkun. Implementing pre-
cise interrupts in pipelined processors. In IEEE Transactions on
Computers, volume C-37, pages 562–573. 1988.
[Tom67] R. M. Tomasulo. An efficient algorithm for exploiting multiple
arithmetic units. In IBM Journal of Research & Development,
volume 11 (1), pages 25–33. IBM, 1967.
[Tve05a] Sergey Tverdyshev. Combination of Isabelle/HOL with auto-
matic tools. In 5th International Workshop on Frontiers of Com-
bining Systems (FroCoS)(to appear), Lecture Notes in Artificial
Intelligence. Springer, 2005.
[Tve05b] Sergey Tverdyshev. Formal Verification of the VAMP proces-
sor by using automated methods (Draft). PhD thesis, Saarland
University, Saarbrücken, Germany, 2005.
[VB99] M. Velev and R. Bryant. Superscalar processor verification us-
ing efficient reductions of the logic of equality with uninterpreted
functions to propositional logic. In L. Pierre and T. Kropf, edi-
tors, CHARME 1999, LNCS. Springer, 1999.
[VB00] Miroslav N. Velev and Randal E. Bryant. Formal verification
of superscale microprocessors with multicycle functional units,
exception, and branch prediction. In DAC. ACM, 2000.
[Ver03] The Verisoft Project. http://www.verisoft.de/, 2003.
