We define physical machines as processors with physical memory and swap memory; in user mode physical machines support address translation. We report about the formal verification of a complex processor supporting address translation by means of a memory management unit (MMU). We give a paper and pencil proof that physical machines together with appropriate page fault handlers simulate virtual machines.
Introduction

The challenge of verifying entire systems
In the spirit of the famous CLI stack [Boy89] the research of this paper aims at the formal verification of entire computer systems consisting of hardware, compiler, operating system, communication system, and applications. Working with the Boyer-Moore theorem prover [BM88] the researchers of the CLI stack project succeeded as early as 1989 to prove formally the correctness of a system which contained among others the following components: (i) a non pipelined processor [Hun89] , (ii) an assembler [Moo89] , (iii) a compiler for an imperative language [You89] with the only data type int, assignments, if then else, while loops, and function calls, (iv) a rudimentary operating system kernel [Bev89] written in machine language. This kernel provides scheduling for a fixed number of processes; each process has the right to access a fixed interval of addresses in the processors physical memory. An attempt to access memory outside these bounds leads to an interrupt. Interprocess communication and system calls apparently were not provided.
In the years from 1989 to 2002 to the best of our knowledge no verification project aiming at the formal verification of entire computer systems was started anywhere. In [Moo03] the principal investigator of the CLI stack project J S. Moore declares the design and formal verification of practical embedded systems 'from the transistor level to the software' a grand challenge problem. A central goal of the Verisoft project [Ver03], funded by the federal German Government and started in 2003, is to solve this grand challenge problem. This paper makes two necessary steps towards the verification of entire complex systems. (i) We report about the formal verification of a processor with memory management units (MMUs). MMUs provide hardware support for address translation; address translation is needed for the implementation of address spaces provided by modern operating systems.
(ii) We present a paper and pencil correctness proof for a virtual memory emulation based on a very simple page fault handler. As the formal treatment of I/O-devices is an open problem we have to state the correctness of a driver for the swap memory as an axiom.
In subsequent papers we will address the verification of a compiler for a C-like language with in-line assembler code and of an operating system kernel. For preliminary versions of these results see [?, ?, ?].
Overview of this paper
In Section 2 we briefly review the standard formal definition of the DLX instruction set architecture (ISA) for virtual machines. We put emphasis on the handling of internal and external interrupts. In Section 3 on physical machines we enrich the ISA by the standard mechanisms for operating system support: (i) user mode / system mode and (ii) address translation in user mode. In Section 4 we present a (non-optimized) construction of an MMU and prove its correctness under nontrivial operating conditions. Note that in pipelined processors separate MMUs are used for instruction fetch and load / store. In Section 5 we show how the operating conditions for both MMUs can be guaranteed by a combination of hardware mechanisms and software conventions. Section 6 gives the main new arguments of the processor correctness proof assuming that the software conventions are met. In Section 7 we present a simple page fault handler which obeys the software convention. We show that a physical machine with this page fault handler emulates a virtual machine. In Section 8 we present conclusions and further work.
Related work
The processor verification presented here extends work on the VAMP processor presented in [BJK + 03, Bey04] . The treatment of external interrupts is in the spirit of [SH98, MP00] . Formal proofs are in PVS [OSR92] and-except for some limited use of PVS's model checker-interactive. We stress that some central lemmas in [SH98, BJK
+ 03] (e.g. correctness of Tomasulo schedulers) have similar counterparts, which can be proven using the very rich set of automatic methods for hardware verification (e.g.
[?]). How to profit from these automatic methods in correctness proofs of entire processors continues to be an amazingly difficult topic of research. Some recent progress is reported in [ACHK04] .
As for the new results of this paper: we are not aware of previous work on the verification of MMUs. We are also not aware of previous theoretical work on the correctness of virtual memory emulations.
Virtual machines
Notation
We denote the concatenation of bit strings a ∈ {0, 1} n and b ∈ {0, 1} m by a • b. For bits x ∈ {0, 1} and nonnegative natural numbers n ∈ N + we define inductively x 1 = x and x n = x n−1 • x. Thus, for instance 0 5 = 00000 and 1 2 = 11. Overloading symbols like +, ·, and < we will allow arithmetic on bit strings a ∈ {0, 1} n . In these cases arithmetic is binary modulo 2 n . We will consider n = 32 for addresses / register contents and n = 20 for page indices.
We model memories m as mappings from addresses a to values m(a). For natural numbers d we denote by m d (a) the content of d consecutive memory cells starting at address a: 
Specifying the instruction set architecture
Virtual machines are the hardware model visible for user processes. Important parameters of such a machine are the following:
• The number V of pages of accessible virtual memory. This defines the set of accessible virtual addresses
• The number e ∈ N of external interrupt signals.
• The status register SR ∈ {0, 1} 32 . This is the vector of mask bits for the interrupts. In physical machines it comes from the status register.
• The set V SA ⊆ {0, 1} 5 of addresses of user visible special purpose registers. Table 1 shows the entire set of special purpose registers, that will be visible for a physical machine. For the virtual machine only the registers RM , IEEEf , and F CC will be visible. Hence V SA = {00110, 00111, 01000}.
Formally, the configurations of a virtual machine is a 6-tuple cV = (cV .P C, cV .DP C, cV .GP R, cV .SP R, cV .vm, cV .p) .
The individual components are:
• the program counter cV .P C ∈ {0, 1} 32 and the delayed program counter cV .DP C ∈ {0, 1} 32 . Both are used to implement the delayed branch mechanism (see Chapter 4 in [MP00] for details);
• The general purpose register file cV .GP R : {0, 1} 5 → {0, 1} 32 . Registers are 32 bits long, register 0 always contains 0 32 .
• The special purpose register file cV .SP R : V SA → {0, 1} 32 . We also refer to special purpose register by cV .x = cV .SP R[x] where x is one of the synonyms from Table 1 .
• The byte addressable virtual memory cV .vm : V A → {0, 1}
8 .
• The write protection function cV .p : {va.px | va ∈ V A} → {0, 1}. Virtual addresses on the same page have the same protection bit.
Let CV be the set of virtual machine configurations. An instruction set architecture (ISA) is formally specified as a transition function δV : CV × {0, 1} e → CV mapping configurations cV ∈ CV and a vector of external event signals eev ∈ {0, 1} e to a next configuration c V = δV (cV , eev). For the DLX instruction set we outline the formal definition of this function with an emphasis on the treatment of interrupts.
The instruction to be executed in configuration cV is found in the four bytes in virtual memory starting at the address of the delayed PC
The opcode consists of the leading six bits of the instruction
The type of an instruction determines, how the bits outside the opcode are interpreted. If the opcode consists of all zeros we have for instance an R-type instruction R-type(cV ) = (opc(cV ) = 000000)
Other instruction types are defined in a similar way. Many instructions can be decoded just from the opcode, e.g. a load word instruction is recognized by lw(cV ) = (opc(cV ) = 100011)
For R-type instructions one has to refer to a secondary opcode. Depending on the instruction type the register destination address is found at different positions in the instruction
In a similar way one can define register source addresses RS1(cV ), RS2(cV ), the sign extended immediate constant simm(cV ), etc. The effective address of a load or store instruction is computed as the sum of the general purpose register addressed by RS1(cV ) and the sign extended immediate constant ea(cV ) = cV .GP R(RS1(cV )) + simm(cV )
A load word instruction stores four bytes of virtual memory starting at address ea(cV ) into the general purpose register addressed by RS1(cV ). This can be expressed by equations like
Components of the configuration that are not mentioned on the righthand side of the implication are meant to be unchanged. This definition however ignores both internal and external interrupts; therefore even for virtual machines it is an oversimplification. 
Interrupts
We define below a predicate JISR(cV , eev) (jump to interrupt service routine) depending on both the current configuration cV and the current values eev ∈ {0, 1} e of the external interrupt event signals. Only if this signal stays inactive does the above equation hold
An activation of the JISR signal has for physical machines a well defined effect on the program counters and on the special purpose registers. The effect on virtual machine computations however is that control is handed over to the operating system kernel. This effect can only be defined in a model, which includes the operating system kernel too.
1 For the same reason rfe-instructions (return from interrupt) are illegal for virtual machines.
For the definition of signal JISR(cV , eev) for physical machines, we consider the 32 interrupts from Table 2 with indices j ∈ IP = {0, . . . , 31}. For virtual machines we ignore page fault interrupts, thus we only consider j ∈ IV = IP \ {3, 4}.
The activation of signal JISR(cV , eev) can be caused by the activation of external interrupt lines eev[j] or by internal interrupt event signal iev(cV ) [j] . We define the cause vector by For virtual machines, but not for physical machines, an attempt to read or write special purpose registers other than RM , IEEEf , F CC is illegal. Reading or writing these registers is achieved with commands movi2s or movs2i; special purpose registers are addressed with instruction field SA(cV ) = I(cV )[10 : 6]. Thus a straightforward formal definition of the illegal instruction signal would include a term like
The cause vector ca(cV , eev) is ANDed bit wise with a mask vector
in order to obtain the masked cause vector:
The interrupt occurs if at least one masked cause bit is on
Thus, although a virtual machine cannot read or write the status register SR, this register is nevertheless visible via the masked cause vector and the JISR-signal.
Physical machines
Physical machines are the sequential programming model of the hardware as seen by the programmer of an operating system kernel. Compared with the ISA of the virtual machine, more details are visible in configurations cP ∈ CP of physical machines.
• All special purpose registers from Table 1 are visible. Formally cP .SP R : P SA → {0, 1} 32 with P SA = {bin5(a) : a ≤ 13} where binn(a) ∈ {0, 1} n is the bitvector of length n such that binn(a) = a.
The mode register cP .SP R[10000] = cP .mode distinguishes between system mode (cP .mode[0] = 0) and user mode. In system mode reading and writing any special purpose register is legal.
• Page faults are visible, thus in the definition of the JISR-signal the full set of indices IP is used instead of IV .
• For physical machines the next state δP (cP , eev) is defined in the case of an active signal JISR(cP , eev). See [MP00] for details. Similarly, in system mode physical machines can legally execute an rfe (return from exception) instruction.
• Instead of a uniform virtual memory the user now sees two memories: physical memory cP .pm and swap memory cP .sm. The number of pages of available physical memory is specified by a parameter P . Table Entry • In user mode accesses to physical memory are translated.
In the remainder of this section we first specify a 1-level translation mechanism and the corresponding internal interrupt event signals pf f (cP ) and pf ls(cP ). Then we model I/O operations with the swap memory.
Address translation
In user mode, i.e. if cP .mode = 1, virtual addresses va = va.px • va.bx are transformed into three signals pma(cP , va), pf f (cP ), pf ls(cP ), where pf f and pf ls are interrupt event signals and pma is the physical memory address in case the page fault signals stay inactive. Memory region cP .pm c P .ptl·4+4 (cP .pto · 4K) is interpreted as the current page table. The page table entry address for virtual address va is defined as ptea(cP , va) = cP .pto · 4K + 4 · va.px and the corresponding page table entry is pte(cP , va) = cP .pm4(ptea(cP , va)) . In case the valid bit is on, the physical memory address pma is obtained by concatenation of the physical page index with the byte index va.bx:
A page fault on fetch occurs, if page table length exception or invalid access occurs with virtual address dpc pf f (cP ) = ptlexcp(cP , cP .DP C) ∨ ¬v(cP , cP .DP C)
In the absence of page faults on fetch we have in user mode now
In the absence of page faults on fetch a page fault on load store concerns virtual address ea(cP ). Besides page table length exceptions and invalid access there might also be an attempt to perform a store operation, indicated by predicate s(cP ), to a write protected page
It is not difficult to specify multi-level translation can be formally specified in a similar way, see e.g. [Hil05, Chapter 5].
Modeling an I/O device
In order to handle page faults, one clearly has to be able to exchange pages between the physical memory cP .pm and the swap memory cP .sm. For a (minimal) detailed treatment of this process one has to do four things:
1. Define I/O-ports as a portion of memory shared between the CPU and the I/O device holding the swap memory.
2. Specify the detailed protocol of the I/O-devices.
3. Construct a driver program say with the following three parameters passed on fixed addresses a1, a2, a3 in physical memory: a physical page index parameter ppxp(cP ) = cP .pm4(a1), a swap memory page index parameter spxp(cP ) = cP .pm4(a2), and a physical-toswap flag p2s(cP ) = cP .pm(a3) indicating, whether a page is to be transferred from physical to swap memory (p2s = 1) or vice versa.
Show (among other things):
2 if the driver is started in configuration cP and never interrupted, then it eventually terminates in configuration c P with
Without this detailed treatment of I/O devices we have to assume the existence of a correct driver program as an axiom.
Construction and local correctness of MMUs
Notation
For cycles t and hardware signals or register contents x we denote by x t the value of x during cycle t. We will refer to the hardware configuration by h. The components of this configuration are registers h.R or memories h.mem. We often abbreviate h.x by x.
Memory interface
We construct MMUs for processors with two first level caches, an instruction cache CI for fetches and a data cache CD for load / store instructions. Therefore the CPU communicates with the memory system via two sets of busses: one connecting the CPU with the instruction cache and the other one with the data cache (see Figure 2) . We assume that the same protocol is used on both busses. Examples of the protocol are shown in Figure 3 for an instruction fetch with and without a cache hit. The essential properties of the bus protocol and the memory system are the following:
1. Accesses last from the activation of a read or write request signal (in the example mr) until one cycle after the busy signal is turned off; if the busy signal is not turned on at all, accesses last a single cycle. 3. For the duration of an access, inputs from the CPU to the memory system must be held constant.
4. Liveness: if Conditions 2 and 3 are fulfilled, every access eventually ends.
5. Shared memory semantics: for cycles t and addresses a define last(a, t) as the last cycle t before t, when a write access to address a (necessarily on the data cache via bus CD.din) ended. Now assume a read access to cache X ∈ {CI, CD} with address a ends in cycle t. Then the result on bus X.dout is
The definition permits to define the state of the two port memory system memory system m(h) at time t by
For a formal and complete version of this definition 3 see pages .... of [Bey04] . For a construction of a split cache system and a transcript of a formal proof, that it satisfies this specification, see [Bey04] , pages 1-110. Guaranteeing that the CPU keeps inputs constant (Condition 3) during all accesses requires the construction of so-called stabilizer circuits, both for the instruction port and for the data port of the memory system. These circuits are needed for instance, if a fetch is in progress and simultaneously 
MMU construction and operating conditions
Figures 4 and 5 show datapaths and control automaton of a straightforward non optimized construction of an MMU. Two copies of this MMU are placed between the CPU and the two caches as shown in Figure 6 . Note that the interface to the memory system is eight bytes wide and the address width is only 29 bits. In user mode this MMU will only perform address translation under non trivial operating conditions. Consider an access of the CPU to the MMU lasting from a start cycle ts to an end cycle te > ts. We have to require that no signal or register content x from the four groups listed below changes its value during the access, so for all t ∈ {ts, . . . , te} we have
G1. Inputs from the CPU to the interface busses of the MMU; these are p.dout, p.addr, p.mr, and p.mw.
G2. The CPU registers h.mode, h.pto, and h.ptl relevant for translation.
G3. In case of a translated accessing the page table entry used for translation, the shared memory content m(h)4(ptea) with ptea = h.pto · 4096 + 4 · p.addr ts .px. We define the signal p.busy = ¬idle where idle indicates that the next state of the control automaton is idle.
G4. In case of read access with physical address pa, the shared memory content m(h)8(pa).
Using definitions analogous to Section 3.1 one can define for hardware configurations h and virtual addresses va a page table entry address ptea(h, va) a page table entry pte(h, va) and a physical memory address pma(h, va). Note that under the operating conditions the virtual address va, the translation pma(h, va) and in case of a read the data read from the shared memory stay the same during the whole access.
Assuming these operating conditions, we outline a fairly straightforward correctness proof for the MMU in the next subsection. Guaranteeing the operating conditions will be a considerably tougher issue.
Local MMU correctness
There is an obvious case split on the kind of access (i) read / write (ii) translated / untranslated (iii) with / without exception. We treat here only the case of a translated read access without exception.
First we identify the sequence of states in the control automaton for such a request. For states s we denote by s + the fact that control stays in state s until the busy signal is taken away from the memory interface. Consider the cycle t ≥ t + 2 when control is in state read and the busy signal from the memory interface is zero. Using G4 one argues that the memory output mdout contains in cycle te = t +1 the required result of the translated read access
Guaranteeing the operating conditions
Stable inputs from the CPU to the MMU's (Condition G1) can be guaranteed by using the stabilizer circuits mentioned in Section 4.2 with very modest enhancements. Condition G4 for loads can be guaranteed, if loads and stores are performed in order by the memory unit. Guaranteeing the remaining operating conditions (Conditions G2, G3, and G4 for fetch) requires a software convention and a hardware construction.
Software Synchronization Convention
We consider sequential computations of the physical machine (c 0 P , c 1 P , . . .); formally this means that we have for all steps i:
Recall that for physical machines the address iaddr(cP ) from which an instruction is fetched depends on cP .mode:
cP .DP C if cP .mode = 0 ; pma(cP , cP .DP C) otherwise.
The instruction I(cP ) executed in configuration cP is then
We define an instruction as synchronizing if it is completed and the pipeline of the processor is drained before the (translation of the) fetch of the sequentially next instruction starts. The VAMP processor has already a synchronizing instruction, namely a movs2i instruction with IEEEf as a source register. 4 We now also define the rfe instruction to be synchronizing. With the help of function I(cP ) one defines a predicate syncing(cP ) stating that the instruction executed in configuration cP is synchronizing. The software sync-convention now has two parts:
1. Let t < t . Assume I(c t P ) writes to iaddr(c t P ). Then for some t between t and t instruction I(c t P ) must be synchronizing, i.e. we have syncing(c t P ). The corresponding condition is also needed without address translation in order to prevent the modification of an instruction in pipelined machines after it has been (pre-) fetched [SH98, BJK
+ 03].
2. Let t < t . Assume instruction I(c t P ) is in user mode (c t P .mode = 1) and instruction I(c t P ) writes a page table entry read during the fetch of I(c t P ); the address of this entry is ptea(c t P .DP C) for fetch. Then we also require syncing(c t P ) for an instruction t between t and t . Clearly, the first sync-convention addresses operating conditions G4 in case of a fetch, whereas sync Condition 2 addresses G3. In the hardware one has to address operating condition G2 and one has to implement the flushing of the processor once a synchronizing instruction is decoded.
Hardware mechanisms for synchronization
The VAMP processor has a two stage pipeline for instruction fetch and instruction decode, followed by a Tomasulo scheduler. For details see [BJK + 03, Kro01, Bey04] . Thus, there are many register stages S, e.g. IF for instruction fetch, P Cs with P C and DP C, ID for instruction decode, RS(rs) for reservation station rs, ROB(tag) for the content of the reorder buffer with address tag, RF for the register files, etc. In particular the instruction register I belongs to stage ID.
The clocking and stalling of individual stages is achieved by a so-called stall engine. For an introduction to stall engines see [MP00] ; for better stall engines see [Kro01, Bey04] . We enhance here the stall engine from [Bey04] .
Three crucial data structures resp. signals are associated with each stage S in the stalling engine:
1. The full bit f ullS. It is on, if stage S has meaningful data. Clearing the bit f ullS flushes the stage. Here we will only be concerned with bit f ullID of the instruction decode stage.
2. The local busy signal busyS. If this signal is on in cycle t, then the circuits with inputs from register stage S do not produce meaningful data at the end of cycle t. Here we will only be concerned with the busy signal busyIF of the instruction fetch stage.
3. The update enable signals ueS. It is like a clock enable signal. If ueS is active in cycle t, the stage S has new data in cycle t + 1. We will use these signals in Section 6.1 for the definition of scheduling functions.
Let busy IF be the busy signal of the instruction fetch stage of a machine without memory management units. We define a new busy signal by
where signal f etch(h) is almost the read signal for the instruction MMU of the CPU.
5
Signal f etch is turned on, if (i) no instruction changing registers pto, ptl and mode is in progress and (ii) no synchronizing instruction is in progress. Instructions in progress can be in the instruction decode stage, i.e. in the instruction register IR, or they are issued but not completed, thus they are in the Tomasulo scheduler and its data structures. In a Tomasulo scheduler an instruction in progress which changes a register r from a register file is easily recognized by an inactive valid bit r.v. Thus we define
where function f etch (h) has to take care of instructions in the decode stage. Using predicates like rf e() which are already defined for configurations also for the contents of the instruction register, we set f etch (h) = ¬(h.f ullID ∧ (syncing(IR) ∨ movi2s(IR) ∨ rf e(IR))) .
In the VAMP processor synchronizing instructions stay in the instruction decode stage until the can immediately proceed to the write-back stage.
6 Processor correctness
Correctness criteria
We are using correctness criteria based on scheduling functions from [MP00, Kro01, Bey04, SH98] . Register stages S of the hardware come in three flavours:
• Visible stages (with respect to the physical machine): these stages are (i) P Cs with the program counters h.P C, h.DP C, (ii) RF with the register files and h.GP R, h.SP R, and h.F P R (iii) stage mem with the user visible memory. This latter stage does not exist directly as a hardware component; instead it is simulated by the memory system (main memory and caches) and its state in hardware configuration h is encoded in function m(h) (cf. Section 4.2).
• Invisible stages: the registers of these stages store values of auxiliary functions used in the definition of the sequential semantics of the physical machines. E.g. stage ID with the instruction register stores values I(cP ), stage mem with the input registers to the memory system for load / store operations stores ea(cP ), and so on.
• Stages from the data structures of the Tomasulo scheduler.
We map hardware stages S and hardware cycles t to instruction numbers i by means of scheduling functions: sI(S, t) = i. The intention is to relate configurations h t of the hardware with configurations c i P of the specification machines in the following way:
1. For visible registers R from stages S = mem :
Thus the specified value of visible hardware register R is the same as the value of R in the specification machine before execution of the i'th instruction, where i = sI(S, t) is the instruction scheduled in stage S during cycle t. The three main definitions for scheduling functions which make this work are the following:
1. In order fetch: The fetch scheduling function is incremented if instruction decode stage receives a new instruction. Recall that ueID is the update enable function of the instruction register:
otherwise.
2. The scheduling of a stage S that is not updated does not change:
If data are clocked in cycle t from stage S to S (a formal definition depends on update enable signals, mux select signals, or driver enable signals during cycle t) then
Thus intuitively an instruction number i = sI(S, t) moves together with the data through the datapath; upon reaching a register in a visible stage S however, the register receives the value after the i-th instruction, i.e. before instruction (i + 1)-th instruction.
Correctness proof with external interrupt signals
In general the hardware of a pipelined processor does not complete one instruction per cycle. As there are more cycles t than instructions i there are necessarily more external interrupt events signals eev t h 'seen' by the hardware than event signals eev i seen by the sequential specification machine. As the computation of the sequential machine is defined by
one has to define the interrupt signal eev i seen by the specification machine from the signals eev t h seen by the hardware and the processor. This has already been observed in [SH98, MP00] .
The VAMP processor samples external interrupt signals only during the write-back stage W B (i.e. it does not sample them all). Every instruction i is only for one cycle t in the write-back stage. Call this cycle t = W B(i). The correctness proof then works with
That no harm is done by the external interrupts signals, which are ignored both by the sampling of the hardware and the sequential programming model, is a problem that has to be solved by the protocol between the processor and the I/O devices.
Correctness proof
We give the new part of the processor correctness proof for a translated instruction fetch without exceptions. The other new cases are handled by similar arguments. Thus consider a translated read access on the instruction port of the CPU which lasts from cycle ts to cycle te, let i = sI(f etch, ts) and let t ∈ {ts, . . . , te} be any cycle of the access. From the correctness proof for the processor hardware we conclude, that on the address bus of the instruction MMU P I.addr we have observed during cycle t the correct delayed PC,
Let i (t) = sI(RF, t) < i be the instruction in the register stage during cycle t. Also by processor correctness we have for all registers R of the register files
In particular this holds for registers R ∈ {pto, ptl, mode}. By the construction of the fetch signal all instructions x < i that update special purpose register pto, ptl, or mode have already left the pipe already at cycle ts (and, because we fetch in order no instructions x > i can enter the pipe while instruction i is in the fetch stage). We conclude for registers R ∈ {pto, ptl, mode} and all cycles t ∈ {te, . . . , ts} that
Let i (t) = sI(mem , t). From processor correctness we get
By the second part of the software sync-convention all instructions x < i which write the page table entry address ptea(c By the first part of the software sync-convention all instructions which write the physical memory address pma(c i P , c i P .DP C) have also left the pipe already at cycle ts. As above we conclude
Hence the operating conditions for the MMU are fulfilled, and we get from the local correctness lemma:
The rest is lengthy but trivial. Specializing the equations above for t = ts gives Thus, by selecting the right half in the double word P I.dout(h te ) using bit 2 of the delayed program counter, in cycle te + 1 we clock I(c 
Virtual machine simulation
In this section we outline an informal proof that a physical machine with an appropriate page fault handler can simulate a virtual machine. We will use pseudo code as well as C like data structures in order to describe the handler and we will argue in the style of an efficient algorithms paper.
Making these arguments precise is not trivial; we give some details in the section on further work. For page indices px we define the physical page px, the swap page px, and the virtual page px respectively as ppage(cP , px) = cP .pm4K(px • 0 12 ) , spage(cP , px) = cP .sm4K (px • 0 12 ) , and
We extend the definition of physical page indices ppx(cP , va) and valid bits v(cP , va) from addresses va to page indices px by ppx(cP , px) = ppx(cP , px • 0 12 ) and v(cP , px) = v(cP , px • 0 12 ) .
Memory map of the physical machine
We partition the physical memory cP .pm into user memory and system memory according to Figure 7 . Starting at the physical address with page index abase we allocate a pages of user memory. This defines the set of user page indices
Physical addresses with page indices smaller than abase are used by the page fault handler and the swap memory driver.
We list below the data structures used by the handler and some invariants for them.
• A process control block P CB to save the processor registers of the virtual processor when the processors runs in system mode.
• The page table
where V = cP .ptl + 1 is the number of accessible virtual pages. We require for all virtual page indices 0 ≤ px < V that the corresponding physical page index belongs to a user page if it is valid, v(cP , px) =⇒ ppx(cP , px) ∈ U P .
• the physical page index M RL of the most recently loaded page.
• A swap page index sbase in swap memory. For virtual addresses va we will use the swap memory address sma(cP , va) = sbase · 4K + va
• A user page index b ∈ {0, . . . , a − 1}. We call a user page u ∈ {0, . . . , a − 1} full if it stores a valid virtual page, i.e. if there is a virtual page index 0 ≤ px < V such that v(cP , px) ∧ ppx(cP , px) = abase + u A user page which is not full is called free. We maintain as an invariant that user page u is free iff u > b.
• an array B of size a holding virtual page indices. If u ≤ b, i.e. if user page u is full, then
This data structure is used for victim selection, if a page has to be evicted.
• Parameters ppxp, spxp, and p2s of the driver for the swap memory.
Simulation relation
For virtual machine configurations cV and physical machine configurations cP we define a simulation relation B(cP , cV ) stating that configuration cP is an encoding for configuration cV . We require that the invariants of the previous subsections hold for the physical machine and that the physical machine is in user mode (cV .mode = 1). The simulation relation is composed of two parts:
1. For all virtual addresses va cV .p(va) = p(cP , va) .
i.e. the write protection function is encoded in the protection bits of the page tables. 2. For all virtual addresses va cV .vm(va) = ( cP .pm(pma(cP , va)) if v(cP , va) ; cP .sm(sma(cP , va)) otherwise.
So, the user pages of physical memory act as a (write-back) cache for the swap memory. This condition may be equivalent formulated for all virtual page indices px as vpage(cV , px) = ( ppage(cP , ppx(cP , px)) if v(cP , px) ; spage(cP , sbase + px) otherwise.
Page fault handler and software conditions
Assume the interrupt occurred during physical machine configuration cP encoding virtual machine configuration cV , i.e. we have B(cP , cV ). We describe a very simple handler; the handler itself is never interrupted. Thus it suffices that the handler saves only the general purpose registers of the physical processor into the process control block. Testing the exception cause register ECA it is easy to determine, whether a page fault on fetch or a page fault on load store has occurred. For the former we have ECA[3] = 1, for the latter we have ECA[4] = 1.
It is easy to specify and implement the handler in case of an exception due to page table length exception or rights violations: the exception is recognized and the simulation is stopped. Thus assume a page fault occurs, because the required virtual page is not in physical memory. The virtual address xva causing the exception is defined as
We call the page index of the exception virtual address the exception virtual page xv = xva.px. Because this page is not in memory and B(cP , cV ) holds, we know that the correct virtual page is stored in swap memory spage(cP , sbase + xv) = vpage(cV , xv)
If b = a−1 there are no free user pages and a victim physical page index vp is selected from the user pages. However, the most recently loaded page is never chosen as victim to avoid deadlock, so vp ∈ U P \ {M RL}.
Let vp = abase + u. In the first case we increment b. Running the driver with parameters (ppxp, spxp, p2s) = (e, sbase + xv, 0)
will swap the exception virtual page into physical memory, i.e. we end up in a configuration c
P where ppage(c
P , e) = spage(cP , sbase + xv) = vpage(cV , xv) .
Thus, B(c P . The handlers completes its work by restoring the general purpose registers from the process control block and executing an rfe instruction.
By inspection of the handler we see that the software synchronization conditions are fulfilled and thus we can conclude that with this handler the hardware specified above will work correctly. This means that t steps of the virtual machine are simulated after s(t) steps by the physical machine. This is obviously shown by induction on t. For the initialization we set b = −1, i.e. all physical page are invalid, and the entire virtual memory is stored in swap memory. Concluding from t to t + 1 there are several cases. If there is no exception, then one step of the virtual machine is simulated by one step of the physical machine running in user mode 6 and we get s(t + 1) = s(t) + 1
Simulation theorem
Because in a single instruction we can have up to two page faults, there remain four cases:
For physical machine steps s let τ (s) be the first step s after s such that the machine is in user mode:
In all four cases we have a page fault in step s(t) + 1. In the first two cases the simulation of step t + 1 succeeds without page fault in the first step in user mode after step s(t) + 1 thus s(t + 1) = τ (s(t) + 1) + 1 .
In the other two cases the second page fault occurs in step s = τ (s(t) + 1) + 1 .
Because the victim page of the second page fault is not the page swapped in during the handling of the first page fault the simulation will succeed in step s(t + 1) = τ (s ) + 1 .
Summary and further work
We have presented two main results. (i) We have reported about the formal verification of a processor with (simple) MMUs. The local correctness proof for an MMU alone (Section 4.4) is straight forward, but it hinges on nontrivial operating conditions. Guaranteeing the operating conditions requires a variety of arguments, from very detailed arguments about the hardware (e.g. Section 5.2) to the format of page fault handlers (Section 7.3).
(ii) Arguing about low level system software we have given a paper and pencil proof for a simulation theorem stating that virtual machines are simulated by physical machines with appropriate page fault handlers. Because modern operating systems support multitasking and virtual memory, the results of this paper are crucial steps towards the verification of entire computers systems.
Presently we see three main directions for further work. (i) On the hardware side one wants to verify processors with pipelined MMUs, multi level page tables and table look aside buffers. (ii) On the low level software side one wants to formally prove the simulation theorem of this paper. This is part of an ongoing effort to formally verify an entire operating system kernel as part of the Verisoft project. Paper and pencil proofs can be found in [SysArch04] (iii) one wants to establish correctness theorems for the memory management mechanisms of shared memory multiprocessors. The thesis [Hil05] contains results of this nature.
