This paper presents a method for creating formally correct just-intime (JIT) compilers. The tractability of our approach is demonstrated through, what we believe is the first, verification of a JIT compiler with respect to a realistic semantics of self-modifying x86 machine code. Our semantics includes a model of the instruction cache. Two versions of the verified JIT compiler are presented: one generates all of the machine code at once, the other one is incremental i.e. produces code on-demand. All proofs have been performed inside the HOL4 theorem prover.
Introduction
Just-in-time (JIT) compilation is an effective technique for boosting the speed of program interpreters. The idea of JIT compilation, i.e. to dynamically translate input programs into native machine code, then execute only native code, is an old invention which dates back to 1960 [15] . However, the concept is still relevant today as JIT compilation is considered vital for competitive interpreter-based implementations of modern languages, Java, C# and ML etc.
To date, there seems to be no publications on verification of a full JIT compiler. The reasons for this is likely to lie in the fact that verification of JIT compilers poses a number of challenges that are generally considered hard:
1. Compiler verification. A JIT compiler needs to correctly map its input programs down to concrete machine code. In this case the target must be real machine code (numbers), not assembly code or intermediate code which most other verified compilers seem to output.
4.
Pointer arithmetic and code pointers. Pointer arithmetic and particularly code pointers, both natural parts of JIT compilation and execution of the generated code, have been considered hard to deal with in a formal context. This paper presents a method for creating formally correct JIT compilers. The tractability of our approach is demonstrated through, what we believe is the first, verification of a JIT compiler with respect to a semantics of self-modifying x86 code. The contributions of this paper are: a) a semantics suitable for verification of self-modifying x86 code, i.e. an operational semantics which takes into account the hazards of a possibly out-of-date instruction cache.
b) a Hoare logic that fits on top of (a) and can reason about selfmodifying code and code pointers by treating code as data and the program counter as a normal register, c) a workflow, using (b), with which we verified two JIT compilers: one generates all of the machine code at once, the other one is incremental i.e. produces code only on-demand.
The input language of our verified JIT compiler is a simple stack-based bytecode (Section 3) with support for a few branch instructions and stack operations: pop, push, subtract and swap.
Our definitions and proofs have been developed inside the HOL4 theorem prover [26] , which greatly helped keeping track of all the details involved. Our HOL4 proof scripts are available online [1] together with the verified JIT compilers. A very basic benchmark of how efficient the verified JIT compilers are when executed on real x86 hardware is presented in Section 4.6.
Main ideas
This section presents a high-level summary of the main ideas that make verification of JIT compilers possible. Subsequent sections explain the technical details of our proofs (Sections 3, 4, 5) and verification framework (Sections 6, 7).
Operational semantics
The basis of this work is a detailed and extensively tested semantics of x86 machine code (Section 6). This semantics is carefully designed to be suitable for verification of self-modifying x86 code, i.e. it includes an instruction cache that exposes the intricacies of a possibly out-of-date instruction cache. The memory is modelled as a function m and the instruction cache modelled as a function i:
2. data-reads and -writes access only memory m, and 3. occasionally i is updated with values from m.
Cache i can be out-of-date: reading from i might not return values accurate with respect to what is stored in m.
The model covers a number of 32-bit user-mode x86 instructions together with their precise bit-level fetch and decode.
Machine-code Hoare logic
We next set up a Hoare logic (informally described here, but formally defined in Section 7) to ease reasoning about the detailed x86 semantics. We define a Hoare triple {p} c {q}, in Section 7.4, as an abbreviating statement above the x86 semantics, and then prove each individual Hoare-triple statement as a theorem from the x86 semantics. This approach, from [9] , contrasts with the original Hoare logic [11] where certain Hoare triples were axioms.
The three key features of this machine-code Hoare logic are:
1. the program counter is treated as a normal register, 2. code is simply safe-to-execute (instruction-cache accurate) data 3. updates to the cache 'are made local' using a cache abstraction.
A few examples will explain these features. The treatment of the program counter as a normal register can be observed in the following Hoare triple. This Hoare-triple theorem describes an x86 instruction xchg eax,ebx, encoded as 93, which swaps the value v of register eax with the value w of register ebx, and simultaneously adds 1 (the byte length of the instruction) to the value p of program counter pc. For now, read * informally as 'and'. This separating conjunction * is defined and explained in Section 7.1.
{eax v * ebx w * pc p} p : 93 {eax w * ebx v * pc (p + 1)} Direct assignments to the program counter are performed by jumps to code pointers, e.g. x86 instruction jmp eax, encoded as FFE0, assigns the value of eax to the program counter pc:
This treatment of the program counter significantly eases the effort involved in reasoning about code pointers.
The fact that code is simply safe-to-execute data can be observed from the following four theorems. First, code need not be in the middle of a Hoare triple: {p} c {q} = {p * code c} ∅ {q * code c} Second, code can always be weakened to data in the postcondition:
Third, data in the precondition can be strengthened into code:
Fourth, arbitrary data can be turned into code (i.e. data can be made safe to execute) as a side-effect of certain jump instructions, e.g. jmp eax, which is encoded as FFE0 and we already saw in a theorem above, also fits the theorem:
These Hoare triples {p} c {q} satisfies unconventional properties. Their formal definition is given in Section 7, but for now informally read them as: for any state which satisfies p * code c, execution of the x86 semantics will always reach a state for which q * code c holds, and furthermore, no other part of the state was modified, i.e. our Hoare triple satisfies the frame rule from separation logic [24] , {p} c {q} =⇒ ∀r. {p * r} c {q * r} which is essential for local reasoning.
In order to support the frame rule, we had to introduce a lightweight cache abstraction (Section 7.3) which makes the nonlocal updates to the instruction cache seem local.
Verification of JIT compilers
We have used our Hoare logic for self-modifying x86 to construct two JIT compilers based on the following work flow.
1. We start by defining the syntax and operational semantics of the input bytecode. Let next −→ be a relation which describes one step of the execution and let exec −→ be sequence of next −→ that end with execution of a special stop instruction. This semantics represents states as tuples (xs, l, p, cs) that consist of a stack xs, a natural number l which keeps track of available stack space, a bytecode program cs and a program counter p.
2.
A coupling invariant, jit inv, is then defined which allows us to relate states in the bytecode semantics to states in the x86 code, given a base address a:
Informally, this jit inv assertion requires, for bytecode state (xs, l, p, cs) and base address a, that a stack is located in x86 memory containing xs and free space l, that x86 code equivalent to bytecode program cs is stored in memory from address a onwards and that the x86 program counter contains an x86 address equivalent to bytecode program counter p.
The invariant for the incremental version of our JIT compiler further requires that a code generator is present and that a mapping is maintained which keeps track of what and where bytecode instructions have been translated into x86 code.
3. It then follows that each successful execution exec −→ in the bytecode semantics is mimicked by the x86 implementation. The x86 code for the special stop instruction exits by jumping to an address w given in register edx. Here stack is the stack assertion from inside jit inv.
(xs, l, p, cs) exec −→ (xs2, l2, p2, cs2) =⇒ {jit inv (xs, l, p, cs) a * edx w} ∅ {stack (xs2, l2) * pc w * edx w * T} This theorem guarantees that any x86 state which satisfies jit inv will correctly perform evaluations according to exec −→ . 4. Finally, it remains to produce verified x86 machine code that can establish an appropriate state which satisfies jit inv. For this we needed verified x86 code which implements code generation i.e. the translation from bytecode to x86 code. By exploiting backwards compatibility to previously proved synthesis tools [22] , we were able to easily create this x86 code from verified functional descriptions. The final correctness theorem follows as a result of composing the code for establishing jit inv with the code for executing the bytecode program.
The resulting correctness theorem states that the JIT compiler correctly implements exec −→ : given a stack, a string representing the encoding of a bytecode program, and enough space to fit the gen-erated x86 code, the JIT compiler will always terminate in a state where the stack has been updated according to exec −→ .
Input language
To make this paper readable we chose a simple input language for our JIT compiler.
Syntax. The input language is a stack-based bytecode, supporting the following instructions. Here i is a 7-bit immediate constant.
pop pop top element off the stack sub subtract swap swap top two stack elements push i push value i onto stack jump i jump to instruction i jeq i jump to i, if top two equal jlt i jump to i, if top two less stop halt execution
Bytecode programs are lists of the above instructions.
Concrete encoding. The abstract syntax restricts constants to 7-bit values in order to make the concrete encoding (almost) readable as a string. Constants are encoded as a character "0" + i. Here ord maps characters to natural numbers, chr is the inverse of ord.
This encoding of immediate constants makes small constants 1, 2, 3 readable strings "1", "2", "3", while larger constants are less intuitive, e.g. 41, 42, 43 are encoded as "X", "Z", "[".
The concrete encoding of individual instructions is defined as the function enc. Here ++ concatenates strings.
Bytecode programs are encoded as the concatenation of the individual instruction encodings:
This concrete encoding makes small bytecode programs almost readable, e.g. the string "=6<4-j0sj0." is the encoding of a bytecode program which calculates the greatest common divisor: Given the above definition of an input language, we can construct our verified JIT compilers. As mentioned earlier, two JIT compilers will be presented: one JIT compiler generates all of the target x86 code as an initialisation step, the other one produces code incrementally on-demand. This section describes the first version; the second version is presented in Section 5.
Informal intuition
The most basic JIT compilers perform the following steps, they:
1. generate native machine code from bytecode, and then 2. let the generated code run on bare metal.
The first version our JIT compiler implements this separation between code generation and code execution.
The code generation, we consider, is also very simple: each bytecode instruction is always translated to the same x86 instructions. Before describing the exact mapping from bytecode instructions to x86 instructions, we start with an informal description of how the stack xs in the state of a bytecode program, from the previous section, is represented on x86:
• register eax holds the value of the top of the stack xs,
• register edi points to the rest of the stack; we write [edi] for the location of the first element of the rest of the stack.
• register edx holds the address to which stop is to jump.
The choice of general purpose registers eax, edi, edx is arbitrary, and does not depend on any special x86 features.
The mapping from bytecode instructions is informally the following. Push and pop, translate to some pointer arithmetic and move instruction mov, subtraction translates to subtraction on x86, the swap instruction translates to x86 instruction xchg (exchange), and jump instructions are translated into jump instructions on x86, conditional jumps require a compare instruction before the jump. The tedious part is to get the jump offsets correct as x86 instructions vary in length. Fortunately, our proof methodology helped us find and remove our off-by-one bugs early in the design.
Invariant maintained by the generated x86 code
The previous section described informally a coupling invariant which the generated by x86 code maintains to the bytecode. We start the construction of our JIT compiler by first formalising an invariant jit inv and proving that execution of each bytecode instruction respects this invariant:
Invariant jit inv maintains a stack, makes sure that appropriate x86 code is in memory, and also requires that the value of x86 program counter is set correctly. We will formalise each part of the invariant separately.
The stack consists of an array-like list of 32-bit words xs, that grows from address a upwards. Here M a x asserts that 32-bit word x is located at address a, while emp is just the unit of * .
We also have to be precise about what space is reserved as free stack space. Let space a n state that there is space for n instances of M elements below address a.
The stack maintained by jit inv has the following specification: the stack must not be empty [], the top of the stack is located in eax while the rest is placed in memory from some address a upwards, address a is stored in register edi. We also add an assertion which requires that a is 32-bit word aligned, i.e. a & 3 = 0.
To formalise that appropriate x86 code is in memory, we need to define the exact x86 encoding into which each bytecode should be translated. A 32-bit immediate constant w is represented as follows in an instruction as a list of 4 bytes. Here w2w translates 32-bit words to 8-bit words, and is logical-shift right.
The encoding of each instruction is now defined as xenc where t is a function which given the address of a bytecode instruction returns the 32-bit address for the corresponding x86 instruction; w2n converts an unsigned n-bit word into a natural number, and in this case w2w converts 7-bit words to 8-bit words.
Using this encoding function we define the length of each encoding.
Here n2w converts a natural number into a 32-bit word.
xenc length c = n2w (length (xenc (λx. 0) c))
Now we can define a mapping from address p in the bytecode program cs to addresses in the corresponding x86 code, given that the x86 code starts at address a:
The code assertion, code m , which jit inv will use, asserts, for each a ∈ domain m, that the memory of the x86 state contains code byte m(a) at address a. Let memb a m bs make sure that bytes bs appear in m from address a onwards.
We can now define an assertion which states that m contains code for an entire bytecode program cs, from address a onwards:
code in mem cs a m = ∀p c. fetch p cs = some c =⇒ let branch = (λi. addr cs a i − addr cs a p) in memb (addr cs a p) m (xenc branch c)
The value of the x86 program counter should always point at the addr cs a p, i.e. contain the address where instruction with index p is stored. Using these definitions we define jit inv as maintaining a stack, program counter, and appropriate x86 code. (Here s hides the value of the eflags that are irrelevant between instructions.) jit inv (xs, l, p, cs) a = stack (xs, l) * pc (addr cs a p) * s * ∃m. code m * code in mem cs a m
Proving that each case of next −→ satisfies jit inv is a simple exercise of applying the proof rules from Section 7.5 to Hoare triple theorems for each of the instructions that xenc can generate. For example, proving the case for swap starts from theorem: {eax x * edi w * M w y * pc p} p : 8707 {eax y * edi w * M w x * pc (p+2)} which then implies the following,
which in turn leads the following, etc.
code in mem cs a m =⇒ {stack (x :: y :: xs, l) * pc (addr cs a p) * code m } ∅ {stack (y :: x :: xs, l) * pc (addr cs a (p + 1)) * code m }
Bytecode programs are executed by x86 code
The previous section showed how to prove that jit inv is maintained for each transition of next −→ :
To prove that all successful executions exec −→ are handled correctly, we also proved, in much the same manner, the following result for the stop instruction: fetch p cs = some stop =⇒ {jit inv (xs, l, p, cs) a * edx w} ∅ {stack (xs, l) * pc w * edx w * T} The two theorems above are sufficient for proving that any successful execution of a bytecode program, i.e. exec −→ , will always be computed by the x86 code inside jit inv:
This theorem follows by a simple induction on the relation exec −→ .
Implementing code generation
The remaining part of the construction of the JIT compiler is to produce verified x86 machine code that can establish an appropriate state which satisfies jit inv in the precondition of the theorem above. For this we needed verified x86 code which implements code generation i.e. translation from bytecode to x86 code.
By exploiting backwards compatibility to previously developed synthesis tools [22] , we can create x86 code from functional descriptions in HOL4. For example, x86 code for calculating addr, given above, is constructed by first defining, in HOL4, a function x86 addr which we think might calculate addr.
(ORD #".")) then let r6 = r6 + 1w in let r5 = r5 + 2w in x86_addr (r1,r5,r6,g) else if g r6 = n2w (ORD #"p") then let r6 = r6 + 1w in let r5 = r5 + 5w in x86_addr (r1,r5,r6,g) else if g r6 = n2w (ORD #"c") then let r6 = r6 + 2w in let r5 = r5 + 10w in x86_addr (r1,r5,r6,g) else if g r6 = n2w (ORD #"j") then let r6 = r6 + 2w in let r5 = r5 + 5w in x86_addr (r1,r5,r6,g) else if (g r6 = n2w (ORD #"=")) ∨ (g r6 = n2w (ORD #"<")) then let r6 = r6 + 2w in let r5 = r5 + 8w in x86_addr (r1,r5,r6,g) else (r5,g)
Our synthesis tool can from such functions generate x86 code and prove that the generated x86 code executes the input function. In this case, the tool returns a Hoare-triple theorem which states that x86 addr is executed by the generated code:
By then manually proving that x86 addr calculates addr correctly, string in memory(a, encode cs, g) ∧ i < 256 =⇒ x86 addr pre (n2w i, w, a, g) ∧ x86 addr (n2w i, w, a, g) = (addr cs w i, g)
we can produce a formal guarantee that the x86 code produced by our synthesis tool indeed calculates addr.
We produced the full code generator that establishes the invariant jit inv by synthesising x86 code from functions which were manually proved to correctly write x86 instructions into memory with respect to jit inv.
Final correctness theorem
The resulting overall correctness theorem states that the JIT compiler correctly implements exec −→ : given a stack (xs, l), a string representing the encoding of a bytecode program cs, and enough space to fit the generated x86 code, this JIT compiler will always terminate in a state where the stack has been updated according to
Basic benchmark
Finally, we conducted experiments to discover how fast or slow this JIT compiler is on real x86 hardware (a 2.6 GHz Intel processor). In order to run our verified machine code, we wrote a small C wrapper which essentially just jumps to the verified machine code after making the necessary calls to the operating system for allocating heap space with execute permissions enabled. (The verified machine code is supplied to the GNU C compiler gcc as an assembly file consisting of only .byte directives.)
Our JIT compiler calculates the greatest common divisor (GCD) using the bytecode program "=6<4-j0sj0." (Section 3) in very fast times for small inputs, e.g. 135 and 345: $ time ./jit "=6<4-j0sj0." 135 345 Top of stack on exit: 5 real 0m0.003s user 0m0.001s sys 0m0.002s
Only for large inputs, e.g. the pair 2 and 200,000,000, do we get more reliable measurements:
Top of stack on exit: 2 real 0m0.131s user 0m0.128s sys 0m0.003s
Incidentally, the time, 0.128 seconds, matches the execution time of a C program which calculates GCD using a loop: Our JIT compiler makes no optimisations and thus the comparison can easily be made in favour of gcc by simply providing our JIT compiler with less optimal bytecode, e.g. "=8<4sj0s-sj0.". However, the point which we want to make is that simple JIT compilers produced as described above are not necessarily slow; by providing the JIT compilers with optimised bytecode we can get competitive performance.
Verified JIT compiler -version 2 -incremental
The previous section presented a JIT compiler which generates all of the native code at once and then jumps to the generated code. This section describes the construction of a verified JIT compiler which only generates code on-demand, incrementally.
The construction and proof of this incremental JIT compiler follows very closely the workflow of the simpler non-incremental JIT compiler. Therefore, this section will mainly concentrate on describing the intuition of how this incremental JIT compiler works, and will point out how the invariant used in the proof had to be modified.
Informal intuition
The basic intuition for our incremental JIT compiler is that each non-branching instruction (sub, swap, pop, push,stop) is generated as before, but branch instructions (jump, jlt, jeq) will be generated with calls to the code generator, e.g. bytecode instruction jeq i at location p will always initially result in the following x86 code:
When this code is run one of the call instructions will call the code generator, whose address is in register ebx. The value in register ecx will tell the code generator for which bytecode instruction it should generate x86 code. Suppose for this example that the x86 code calls the code generator using the first call, i.e. with p + 1 in ecx, then the code generator will perform a look-up in a table it maintains of where and what bytecode instructions have been translated to x86 code. If it finds that p + 1 has not yet been generated then it translates the longest sequence of non-jump instructions around p + 1 into native code and replaces the xor and call instructions with a jump directly to desired location in the new code, i.e. the code becomes:
(new code starts here) G: (code for instruction p + 1 in bytecode) (new code ends here) 1 Here xor ecx,i has the effect of ecx := i since, whenever such xor instructions are encountered, ecx contains zero. We chose to use xor instead of mov because xor allows us to use a shorter instruction encoding.
In case the code generator had already generated code for instruction p + 1, then no new code would have been generated, instead only the new jump instruction would have been inserted. The interesting aspect of this JIT compiler is that it can generate slightly different x86 code depending on the input, i.e. depending on which branches are taken in which sequence. Branches that are never taken will never be replaced by jmp instructions.
Invariant
This incremental JIT compiler maintains an invariant which, instead of having static code corresponding to the bytecode program, states that a code generator is present and x86 code for part of the bytecode exists.
The interesting part of the invariant describes the state maintained in between calls to the code generator. Its state consists of a partial mapping from bytecode instruction indexes (natural numbers) to locations (32-bit addresses) where that particular bytecode instruction is represented as x86 code. We represent this partial map as a total function to an option type; the type of this mapping, for which we will use variable j, is:
Defining how code is represented in memory is slightly more complicated now. In particular, each jump can potentially be represented in two ways, either as the combination of xor and call, or as just a direct jmp. The encoding of a jump to bytecode location p, in the generated x86 at address a, is either present as xor ecx, The relation between bytecode instructions and x86 code is defined as xenc inc c a p j bs: a relation which states that byte list bs, at machine address a, is a valid x86 code corresponding to bytecode instruction c, at location p. The first 5 cases are identical to the encoding xenc for the non-incremental version.
xenc inc (pop) a p j bs = (bs = xenc (λx. 0) pop) xenc inc (sub) a p j bs = (bs = xenc (λx. 0) sub) xenc inc (swap) a p j bs = (bs = xenc (λx. 0) swap) xenc inc (stop) a p j bs = (bs = xenc (λx. 0) stop) xenc inc (push i) a p j bs = (bs = xenc (λx. 0) (push i)) xenc inc (jump i) a p j bs = enc jmp a (w2n i) j bs xenc inc (jeq i) a p j bs = ∃bs0 bs1 bs2. (bs = bs0 ++ bs1 ++ bs2) ∧ (bs0 = [3B, 07, 0F, 84, 05, 00, 00, 00]) ∧ enc jmp (a+8) (p+1) j bs1 ∧ enc jmp (a+13) (w2n i) j bs2 xenc inc (jlt i) a p j bs = ∃bs0 bs1 bs2. (bs = bs0 ++ bs1 ++ bs2) ∧ (bs0 = [3B, 07, 0F, 82, 05, 00, 00, 00]) ∧ enc jmp (a+8) (p+1) j bs1 ∧ enc jmp (a+13) (w2n i) j bs2
We can now define what it means for bytecode cs to be represented in memory m according to mapping j: whenever bytecode instruction c is according to j stored at an address w, then there exists some sequence of bytes bs, which represents c, stored in memory m at address w. We use memb defined in Section 4.2.
code in mem cs j m = ∀p c w. (fetch p cs = some c) ∧ (j p = some w) =⇒ ∃bs. memb w m bs ∧ xenc inc c w p j bs
We separately assert a well-formedness requirement on mapping j: each x86 representation of a bytecode instruction which is not a jump must, if and only if present in the x86 code, be immediately followed by the x86 representation of the next bytecode instruction. Let ilength c be a function which return the length of each instruction encoding, e.g. ilength (jump i) = 5.
is jump c = ∃i. c ∈ { jump i, jeq i, jlt i, stop } wellformed cs j = ∀c p. (fetch p cs = some c) ∧ fetch (p+1) cs = none ∧ ¬is jump c =⇒ ((j p = none) ⇐⇒ (j (p+1) = none)) ∧ ∀w. (j p = some w) =⇒ (j (p+1) = some (w + ilength c))
The incremental JIT compiler's main invariant, i.e. its version of jit inv, now states that the stack is maintained as stack (xs, l), defined in Section 4.2; the program counter holds a value eip for which there exists some generated x86 code, i.e. j p = some eip; the code cs is represented in memory m according to a wellformed mapping j; and in order to make code generator work, we also include that the code generator is present in memory, that ebx holds a pointer to the entry point in the code generator and register ecx is zero (which is necessary for the xor trick to work). Some inessential details are elided with '. . . '.
jit inv (xs, l, p, cs) a = ∃m j eip w. stack (xs, l) * pc eip * j p = some eip * s * code m * code in mem cs j m ∧ wellformed cs j * code (codegen w) * ebx w * ecx 0 * . . .
The verification proof can be found in our HOL4 proof scripts [1] .
x86 semantics
We use an operational semantics for (user-mode) x86 machine code which builds on a previously presented semantics of x86 [25] . The following subsections will detail how the sequential instantiation of our previously developed x86 semantics is extended to include an instruction cache and read/write/execute memory permissions.
x86 state and memory model
The operational semantics represents x86 states as tuples which consist of a register file r, program counter/instruction pointer e, status bits/eflags s, memory m and instruction cache i:
The type definition makes use of the following data-types:
The type of each x86 state component is: r : regs → word32 e : word32 s : eflags → bool option m : word32 → (word8 * perm set) option i : word32 → (word8 * perm set) option
The option type is used in places where the actual value may be missing, e.g. the value of an eflag in s may be any of three values:
some T has value true (written T), some F has value false (written F), or none has undefined/unpredictable value.
Memory locations in m can either be absent (none), or present (some). Memory locations contain two components: the 8-bits of data that are stored at the address and a set of read/write/execute permissions describing how this data can be accessed. All memory reads and writes are defined in the operational semantics using read mem, write mem and read instr.
read mem a (r, e, s, m, i) = case m a of none → none | some (w, p) → if {r} ⊆ p then some w else none write mem a v (r, e, s, m, i) = case m a of none → none
The function which fetches 8-bits of an instruction gives priority to the instruction cache, and requires the execute permission x. read instr a (r, e, s, m, i) = case (i a, m a) of (none, none) → none | (none, some (w, p)) → if {r, x} ⊆ p then some w else none | (some (w, p), ) → if {r, x} ⊆ p then some w else none Our user-mode semantics has no instructions for altering the permissions attached to memory locations; instead permissions remain static through out execution. In reality operating systems provide user-mode programs with procedures they can call to alter such read/write/execute permissions on a per-page granularity.
Instruction-cache update
Our semantics executes an instruction-cache update icache −→ before each instruction is performed. Each update deletes some set of old address from the cache and loads, from memory, a new set of addresses into the cache:
(r, e, s, m, i) icache −→ (r2, e2, s2, m2, i2) = ∃new old. r2 = r ∧ e2 = e ∧ s2 = s ∧ m2 = m ∧ i2 = λaddr. if addr ∈ new then m addr else if addr ∈ old then none else i addr This cache update transition over approximates the number of different updates a real instruction cache can perform, e.g. a real cache cannot load the entire memory. Over approximating possible cache updates is sufficient, since we will prove that no cache update can cause unwanted behaviour.
A noteworthy feature of this cache update transition is that it will never introduce new inaccuracies. We say that the instruction cache is accurate (not out-of-date) for address a if the instruction cache either does not contain an entry for address a or memory location a is correctly represented in the cache, i.e. accurate a (r, e, s, m, i) = i a = none ∨ i a = m a Automatic cache updates will never introduce inaccurate entries:
Similarly inaccuracies might be removed:
Address a becomes inaccurate whenever address a is represented in the instruction cache and a store instruction successfully modifies the byte stored at address a using write mem a v, for some 8-bit data v.
Next-state relation
The top level next-state relation x86 −→ is defined as a composition of an instruction cache update, then fetch-and-decode, followed by execution of the fetched instruction. The definition of execute can be found in our proof scripts [1] . The function fetch and decode is explained in Appendix A.
(s icache −→ t) ∧ (fetch and decode t = some (instr, len)) ∧ (execute instr len t = some u)
Both the decoder and the execute function have been tested extensively against real x86 hardware, as part of previous work [25] . This gives us confidence that our semantics is, if not completely correct, at least very nearly completely right for the instructions it covers: 
Clearing the instruction cache
The x86 instruction set has no instruction for the sole purpose of clearing the instruction cache. However, the Intel Manual (March 2009) [12] states that it is safe to execute self-modifying code (in sequential programs) if the following steps are taken:
1. store modified code 2. jump to new code or intermediate code
execute new code
This ambiguous description makes it seem safe to assume that some kind of jump is enough to erase any instruction cache inaccuracies that might make the new code unsafe to execute.
In order to make as few assumptions as possible, we will only assume that one type of jump has the ability to clear the instruction cache. We make clearing the instruction cache a side-effect of execution of jump instructions of the form "jmp r32", i.e. jumps to a code pointer stored in a 32-bit register. Other jump instructions, such as branch-to-offset, procedure calls, and procedure returns, are not given this extra side-effect. The relevant part of the execute definition sets i to the empty cache, i.e. λa.none:
All other cases of execute leave the instruction cache i untouched.
Execution sequences
In the next section we will quantify over all possible x86 execution sequences. For this purpose, we define a valid execution sequence x from initial state s, written x86 seq s x, to be an infinite sequence of x86 states (with type N → x86 state) such that
and for each natural number n:
The above definition of x86 seq makes stuck states repeat forever. Repeating stuck states is reasonable as we will only consider judgements of total-correctness which asserts that each execution sequence x will contain a desirable final state x(n) satisfying some postcondition post:
∀x. x86 seq s x =⇒ ∃n. post (x(n))
Machine-code Hoare logic
This section presents a definition of a Hoare triple and associated Hoare proof rules which allow local reasoning in the presence of an instruction cache. The work presented in this section builds on previous experience [19, 21] in adapting separation logic [24] to machine languages.
Separating conjunction: *
Conventionally the separating conjunction * is defined to split partial functions. However, for this work, and previous work on Hoare logic for machine languages, we have found that such a separating conjunction is ill suited for machine languages as processor states are essentially multiple different mappings (mappings from register names to register values, memory locations to memory values, status-bit names to bit values etc.).
We choose to use a set-based separating conjunction in order to treat all resources uniformly and hence make the frame rule (presented later) apply to all types of resources simultaneously. Our set-based separating conjunction * splits a set (of state elements) into two sets: p * q is true for set s if s can be split into two disjoint sets u and v such that p is true for u and q is true for v.
The separating conjunction * is associative and commutative. Its unit is emp and angled brackets . . . will be used for carrying pure boolean assertions (∀p c s. (p * c ) s = p s ∧ c):
This separating conjunction requires states to be represented as sets of state components. Let the type of an x86 state component be defined by a data-type x86 el with the following constructors. A boolean is attached to each memory component to indicate whether of not that byte is accurately represented in the instruction cache.
Eip
: word32 → x86 el Reg : regs → word32 → x86 el Status : eflags → bool option → x86 el Mem : word32 → (word8 * perm) option → bool → x86 el This data-type allows us to define a translation function x86set which maps states represented as tuples x86 state, as described in the previous section, to states represented as sets of x86 state elements. Here range f = { y | ∃x. f x = y }. Let R r w assert that register r has value w, similarly let S a x assert that eflag a has value x, and let pc assert the value of the instruction pointer.
(R a x) s = (s = {Reg a x}) (S a x) s = (s = {Status a x}) (pc x) s = (s = {Eip x})
The above assertions have their intended meaning when used together with the translation function x86set, e.g.
(R a x * p) (x86set(r, e, s, m, i)) =⇒ (r a = x) (S a x * p) (x86set(r, e, s, m, i)) =⇒ (s a = x) (pc x * p) (x86set(r, e, s, m, i)) =⇒ (e = x) but * also separates between assertions of the same kind:
(R a x * R b y * p) (x86set(r, e, s, m, i) ) =⇒ a = b (S a x * S b y * p) (x86set(r, e, s, m, i) ) =⇒ a = b
(pc x * pc y * p) (x86set(r, e, s, m, i)) =⇒ F
We will often abbreviate R EAX w with just eax w, R EBX w with ebx w etc. Another abbreviating assertion is s which hides the values of the eflags:
Assertion s is used in theorems where we need to state that the eflags were modified but we want hide their actual values, which are frequently not of interest.
Memory assertions: M, code, data
The most basic memory assertion B a x is defined to state that byte x, which can be read and written but not executed, is located in memory at address a, which might or might not be accurately represented in the instruction cache:
(B a x) s = ∃acc. s = {Mem a (some (x, {r, w})) acc} A similar assertion M, which states that 32-bit data w is at address a, can be defined using four B assertions. Here w[j−i] extracts bits i to j (inclusive) from 32-bit word w.
Our machine-code Hoare triple represents code as set of code fragments: each element is a tuple (a, x, p) where a is a 32-bit address, x is 8-bits of an instruction and p is write permission w iff location a has write permissions set. The code assertion makes sure that code set c is stored in memory and accurately represented in the instruction cache, hence T below.
The data assertion data m, which informally states that m(a) is stored in memory at location a if a ∈ domain m, is defined as using auxiliary function aux:
Now let data m and datax m, respectively, assert that m is nonexecutable or executable data, regardless of which addresses are accurately represented acc in the instruction cache:
When turning datax m into a code assertion code m , the notation " m " stands for:
Cache abstraction:
Separation logic, which we use as an inspiration for our machinecode Hoare logic, is based on the notion of local reasoning: each action can be described by a Hoare triple that only describes small local updates, such Hoare triples can separately be brought into a grander context using, what is known as, the frame rule, which will be described later.
Applying ideas from separation logic naively to our x86 model does not work as the updates to the instruction cache are non-local and can occur at random; a basic set up that exposes instructioncache updates would struggle to support the vital frame rule.
In order to regain local-reasoning, the definition of our Hoare triple will assert pre/postconditions p, not just as,
but instead using an instruction cache abstraction p s which allows p to be true for some state t with a less accurate instruction cache but otherwise equivalent to s:
Since all non-local cache updates only introduce new accuracies, a different state t can always be chosen in such a way that old inaccuracies (that p might potentially depend on) can be reintroduced before p is asserted.
Definition of Hoare triple: {p} c {q}
The previous subsections defined the necessary building blocks for our instruction-cache-aware Hoare triple, namely: , code and x86 seq. We first define p q to be a total-correctness Hoare "triple" without any code: if some part of the x86 state satisfies p then every possible execution sequence will reach a state which satisfies q. Here ' * r' ensures that every resource that is modified is mentioned in precondition p.
Our machine-code Hoare triple {p} c {q} is an abbreviation which maintains code c as an invariant separate from pre-and postcondition p and q, respectively:
The relationship between the two judgements can be seen in the following theorems. Here ∅ is the empty set.
{p} ∅ {q} = p q {p} c {q} = {p * code c} ∅ {q * code c} These arise from the fact that: p * code ∅ = p * emp = p.
Proof rules
The machine-code Hoare triple, defined above, supports a few unusual proof rules, i.e. theorems proved from the definition of our machine-code Hoare triple {p} c {q} and x86 semantics. Section 2.2 already presented the following proof rules for transforming data into code and vice versa: {p} c {q} = {p * code c} ∅ {q * code c} {p} c {q * code d } =⇒ {p} c {q * datax d} {p * datax d} c {q} =⇒ {p * code d } c {q} Section 2.2 also mentioned that jumps to code pointers, i.e. instructions of the form jmp r32, can turn data into code, e.g. jmp eax, encoded as FFE0, turns data into code:
Other code related rules include the rule for code extension: {p} c {q} =⇒ ∀e. {p} (c ∪ e) {q} which illustrates well that these Hoare triples state only that code c is sufficient to transform states satisfying p into states satisfying q. Thus any extension e to set c will also be sufficient for transforming states satisfying p into states satisfying q. The last code related rule is one which introduces m : 
Quantitative data
Large parts of the basis for this work, the x86 semantics, machinecode Hoare logic and some proof automation, consist of only minor extensions to proof scripts/automation which were developed in previous work [19, 21, 22, 25] . [22] which maps functions in the logic of HOL4 to ARM, x86 and PowerPC machine code, used in Section 4.4. This compiler is based on a proof-producing decompiler [21] which is also included in the number 4851.
The verified JIT compilers can be run on real x86 hardware. A few tests, in Section 4.6, suggest that the JIT compiler is reasonably efficient: the JIT compiler evaluates the bytecode program for calculating the greatest common divisor (GCD), from Section 3, for the pair 2 and 200,000,000 in 0.13 seconds -a time which matches the execution time of compiled C code for calculating GCD. The C code to which we compare, also listed in Section 4.6, was compiled using the GNU C compiler gcc with optimisation turned on.
Related work
This paper has touched on a number of topics: JIT compilation, self-modifying code, verification of machine-code programs and compiler verification. Related work on each topic is briefly discussed below.
JIT compilation. The history of JIT compilation has been surveyed by Aycock [3] who traced back the origins of JIT compilation to McCarthy's LISP paper in 1960 [15] . But the term just-in-time (JIT) was only introduced in connection with attempts to speed up Java, which was slow before using JIT technology: " Java isn't just slow, it's really slow, surprisingly slow. " -from Tyma [28] and Aycock [3] Discussions such as those of Tyma [28] and Aycock [3] make it clear that JIT compilation is vital for implementation of efficient interpreters for modern languages: Java, C#, ML etc.
Self-modifying code. There is very little published work [8, 6] on verification of self-modifying code. To the best of our knowledge, Gerth [8] was the first to propose a solution. His solution is to not have code but instead only data. He demonstrates that linear temporal logic can be used to prove functional correctness and termination of self-modifying code in a toy assembly language. Gerth presents only a few small examples, questions whether his approach will scale, and does not include an instruction cache in his model of the assembly language. More recent work by Cai et al. [6] is somewhat more closely related to our work: they use a Hoare logic and target larger case studies in more realistic machine languages. Cai et al. developed an extension, called GCAP, to an assembly-level Hoare logic called CAP [29] , and formalised GCAP in the Coq theorem prover. They present a long list of small verification examples for MIPS and 16-bit x86 code, some of which generate code dynamically, but none of which is a full JIT compiler.
One of the main differences to this paper is their approach for mutable code: they introduce the concept of possibly overlapping code blocks that might or might not be present at any given time, but one of which must be present in memory when starting execution of a block at that address. In comparison, our approach seems much simpler since for us code is just safe-to-execute data:
{p} c {q} = {p * code c} ∅ {q * code c} Another contrast to the work by Cai et al. is that our work builds on a semantics of x86 which takes into account hazards of an outof-date instruction cache, while their work ignores the instruction cache. Also our framework provides total-correctness results, while GCAP produces partial-correctness results.
Verified machine code. Construction of the two versions of our JIT compiler required verified (functional correctness) implementations of code generation, i.e. x86 code that reads the input bytecode and produces equivalent x86 code.
The obvious way of creating this verified code would have been to, first, either handcraft or generate the machine code, then apply post hoc verification to the proposed machine code. Such an approach could have used:
• symbolic simulation which the ACL2 community has emphasised and successfully applied [4, 17] , e.g. Boyer and Yu [5] verified of machine code for the Motorola MC68020; • a programming logic directly, e.g. Cai et al. [6] chose this method, which works for small examples, but is very labourintensive to apply to larger examples [16] ; • verification condition generation (VCG) applied to machine code: Matthews et al. [14] presented a light-weight approach to trustworthy VCG for machine code -an approach that Hardin et al. have applied to AAMP7G machine code [10] ; or • decompilation into logic -a new method proposed by Myreen et al. [21] which maps, via proof, machine code into equivalent functions in logic (details and examples in Myreen [19] ).
Instead of using any of the above methods for post hoc verification of handcrafted code, we chose to synthesis appropriate machine code using a proof-producing compiler [22] . Our compiler maps functions from the logic of HOL4 down to functionally equivalent x86 machine code and completely automatically proves a certificate theorem. In order to get the desired verified machine code we wrote functions with the appropriate behaviour, proved them correct, and then related the verification result to the automatically generated machine code using the certificate theorem produced by our proofproducing compiler [22] . Influential work on proof-carrying code (PCC) by Necula [23] , typed-assembly language (TAL) by Morrisett et al. [18] , and foundational PCC (FPCC) by Appel [2, 27] share the goal of creating trustworthy software, but are not directly applicable, since work in this direction has aimed to automatically ensure safety properties. In contrast, this paper targets much stronger properties of full functional correctness and termination.
Compiler verification. This paper touched on the topic of compiler verification: the code generator inside the JIT compilers was proved to correctly translate (without any optimisations) bytecode into x86 machine code. From the point of view of compiler verification, this is a very simple and slightly unusual implementation to prove correct. Most papers on compiler verification, of which Dave has made a survey [7] , concentrate on proving a few optimising transformations correct. An impressive exception to this trend is Leroy's recent proof a full end-to-end implementation of an optimising C compiler [13] .
Another contrast to previous work on compiler verification, is that our simple code generator is implemented in x86 machine code (numbers) and produce concrete x86 code as output (numbers); while majority of other work implement the compiler in small toy languages or functional languages and output either assembly code or code in an intermediate language used only inside compilers.
Future work
The input language of our verified JIT compiler is a simple stackbased bytecode. In the future, we plan to extend this input language to a bytecode language suitable for reimplementation of our verified x86, ARM and PowerPC implementations of a LISP interpreter [20] , i.e. we aim to extend the language to operate over a garbage collected stack of s-expressions. Such an extension ought to be largely orthogonal to the developments here as that extension should boil down to replacing the stack assertion in jit inv with a more complicated assertion sexp stack, while code generation is, in principle, unchanged.
Acknowledgments
I would like to thank John Matthews for suggesting that I try to tackle the problem of verifying a JIT compiler. I am grateful for comments from Mike Gordon. This work was funded by the EPSRC, UK.
A. Instruction fetch and decode
Modelling fetch and decode for x86 is made interesting due to the complex instruction encodings that result in variable length instructions. Modelling x86 instruction fetch and decode was part of previous work [25] but did not get a thorough explanation there, so we present fetch-and-decode here, as the current author was responsible for modelling of fetch-and-decode for [25] .
Fetch. First, how do we cleanly separate fetching from decoding? The problem is that fetching needs to know how many bytes to fetch, but the number of bytes (the length of the x86 instruction) is only known after decoding.
A close inspection of the x86 manual [12] shows that no 32-bit mode x86 instruction is longer than 20 bytes. Our solution is to first fetch 20 bytes using fetch 20 a s, fetch 0 a s = [] fetch (n+1) a s = read instr a s :: fetch n (a+1) s Then evaluate decode (outlined later) on a list in which errors, i.e. none elements, have been replaced by zero: decode is applied to clean (fetch 20 a s) where clean is:
clean [] = [] clean (none :: cs) = 0 :: clean cs clean (some x :: cs) = x :: clean cs A successful application of decode returns some (i, len), where i is the abstract syntax tree representation of the instruction and len is the length (number of bytes) of the encoding. Before returning the result, we check that the first len bytes are not none, using not none len:
not none 0 cs = T not none (n+1) (none :: cs) = F not none (n+1) (some x :: cs) = not none n cs
The definition of fetch and decode is hence: fetch and decode (r, e, s, m, i) = let cs = fetch 20 e (r, e, s, m, i) in case decode (clean cs) of none → none | some (i, len) → if not none len cs then some (i, len) else none
Decode. The Intel Manual [12] defines instruction encodings using lists of the following style: each line has two parts, the first part (left of the bar) provides the encoding formats, the second part (right of the bar) provides the assembly instruction. Most assembly instructions have multiple different encodings. Here are some of the encodings for the move instruction mov:
" 89 /r | MOV r/m32, r32 " " 8B /r | MOV r32, r/m32 " " B8+rd id | MOV r32, imm32 " " C7 /0 id | MOV r/m32, imm32 "
The Intel manual provides a description of what each encoding symbol 89, B8+rd, /r, etc. means.
In order to minimise errors that can occur when trying to write the equivalent definitions in a theorem prover, we decided to copy across these lines from the Intel Manual into HOL4 and then write, in the HOL4 logic, an interpreter for this syntax based on Intel's description of what it means. The interpreter first splits each string into two at the bar character | and then breaks each side up into a list of tokens, e.g. the first line from above becomes the pair:
["89", "/r"] and ["MOV", "r/m32", "r32"]
When the interpreter receives a concrete input to decode, e.g. a list of bytes [0x89, 0x07, 0x86, . . . ], it will execute a match function which attempts to fit the encoding format, in this case ["89", "/r"], onto the concrete byte list. If a match is found then a separate function constructs the data-type used for representing the assembly instruction in our operational semantics. The interpreter also returns the unused tail of the input. The example above produces: some (Xmov (Xrm r (Xm none (some EDI) 0) EAX), [0x86, .
. .])
This output from the interpreter means that decoding found x86 instruction mov [edi], eax and that input [0x86, . . . ] was left unused, and hence the instruction consisted of the first two bytes of the input. The top-level decode function returns the generated data-type together with the number of bytes consumed from the input list.
Decoder speed-up. Running the decoder inside the theorem prover logic, e.g. evaluating decode [0x89, 0x07, 0x86, . . . ], to prove decoding results about specific machine instruction is very slow if done naively. It takes nearly 15 minutes for HOL4 to evaluate the above example without helping lemmas. (Evaluating decode outside the logic is not acceptable as we want a theorem certified by the logical core of HOL4.) However, a significant speed up can easily be achieved by partially evaluating the decode function for the list of x86 instruction encodings, i.e. the list of strings mentioned above. When the theorem produced by partial evaluation is supplied to the standard HOL4's evaluation engine, evaluation is performed inside the logic in less than 2 seconds for most instruction, some take up to a bearable 6 seconds.
