A transient fault is a temporary, one-time event that causes a change in state or erroneous signal transfer in a digital circuit. These faults do not cause permanent damage, but when they strike conventional processors, they may result in incorrect program execution. While detecting and correcting faults in first-order data may be accomplished relatively easily by adding redundancy, protecting against faults during control flow transfers is substantially more difficult. This paper analyzes the problem of maintaining the control-flow integrity of a program in the face of transient faults from a formal theoretical perspective. More specifically, we augment the operational semantics of an idealized assembly language with additional rules that model erroneous control-flow transfers. Next, we explain a strategy for detecting control-flow errors based on previous work by Oh [11] and Reis [16] . In order to reason about the correctness of the strategy relative to our fault model, we develop a new assembly-level type system designed to guarantee that any control flow transfer to an incorrect block will be caught before control leaves that block. The key technical result of the paper is a rigorous proof of this fundamental control-flow property for well-typed programs. We also prove that this new typed assembly language is sufficiently expressive to serve as a target for type-preserving compilation from a simple language of while programs.
Introduction
In recent decades, microprocessor performance has been increasing exponentially, due in large part to smaller and faster transistors. While such transistors yield performance enhancements, their lower threshold voltages and tighter noise margins make them less reliable [3, 19, 10] , rendering processors that use them more susceptible to transient faults. Transient faults or soft errors are often caused by external events, such as an energetic particle striking silicon atoms within a chip. These faults do not cause permanent damage, but may result in incorrect program execution by altering signal transfers or stored values.
While transient faults are currently rare, they have already been noticed in commodity processors and have caused significant failures. In 2000, Sun Microsystems acknowledged that cosmic rays interfered with cache memories and caused crashes in server systems at major customer sites, including America Online, eBay, and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. [4] . More recently at the Los Alamos Neutron Science Center, Hewlett Packard acknowledged their AlphaServer ES45 supercomputer was frequently crashing due to soft errors [7] .
More importantly, as each processor generation increases clock rates, lowers voltages and increases the density of transistors, transient faults are more likely to occur. Figure 1 , which is taken from work by Shenkar Borkar [5] , illustrates the current and projected future trends in transient fault rates relative to chip feature size. With current trends suggesting fault rates are increasing at approximately 8% per chip generation, we may see a 100% rise in transient faults in just seven years.
In order to counter the future threat of transient faults, researchers from industry and academia have been searching for solutions to the reliability problem in both hardware and software. Broadly speaking, with sufficient hardware resources, hardwareonly solutions are more efficient for a single, fixed reliability policy, but software-only solutions are more flexible and less costly. In terms of flexibility, software-only solutions may be deployed immediately on current hardware that already exists in the field, simply by recompiling the application in question. Consequently, if Los Alamos Labs is repeatedly suffering from soft errors because of insufficient protection in their current supercomputing facilities, software solutions offer the hope of an effective way to solve their problem right now. In terms of cost-effectiveness, recent studies have shown that software techniques for fault tolerance often add approximately 35% overhead [16] to the computation with no additional hardware cost, whereas a standard double-or triple-modular redundancy technique will add 100% or 200% to the hardware cost, with some additional performance overhead for communication between replicas. Hence, depending upon where a given application sits in the cost-performance-reliability trade-off space, software, hardware or some mix of the two may be the preferred solution.
Unfortunately, devising software solutions to the problem of transient faults, and making sure they are correct, is an extremely difficult task. Just as the many possible interleavings of threads make it difficult to reason about the properties of concurrent programs, the many possible scenarios in which transient faults can arise make it difficult to reason about the properties of faulty programs. Moreover, just as conventional testing is often an ineffective way to uncover bugs in concurrent programs, testing is likely to be an ineffective way to uncover reliability errors in possibly faulty programs.
Faced with these challenges, we and other researchers at Princeton have recently begun to develop type-theoretic techniques for reasoning about software in the presence of transient faults. In our first effort [21] , we devised a lambda calculus called λzap to serve as a highly idealized model for unreliable computations. The operational semantics of the calculus specify that any value may suddenly be corrupted during execution. However, programs are able to replicate computations and use atomic voting operations to check replicas against one another to detect and recover from transient faults. A type system for λzap guarantees that any well-typed program is fault tolerant. In our second piece of work [13] , we studied fault tolerance in the more realistic setting of assembly language with specialized hardware instructions to aid detection of faults.
Once again we devised a type system (this time called TALFT) and rigorously proved that it guarantees a strong fault tolerance property for all well-typed programs. From a theoretical perspective, these type systems codify formal reasoning techniques that allow programmers to prove strong reliability properties of their programs. Equally importantly, from a practical perspective, these type systems can be implemented and used to check the correctness of compiler outputs. Using a type checker to verify these reliability properties, where possible, is vastly superior to conventional testing as the type checker gives perfect coverage relative to the fault model whereas any test suite will be highly incomplete.
Despite the progress made to date, this prior work skirts the issue of how to reason about code that not only incurs faults to first-order data, but also may go wrong during a control flow transfer. The faulty lambda calculus λzap avoids the issue altogether by assuming the existence of high-level atomic operations to simultaneously check for errors, recover and jump to a new control flow point. The fault-tolerant typed assembly language TALFT admits the possibility of faults to the program counter, but requires a highly specialized instruction set and additional nonstandard hardware state to detect those faults.
Surprisingly, however, researchers [11, 16] have developed techniques for detecting certain classes of control-flow errors entirely in software. As mentioned above, software techniques have an advantage over hardware in that they may be deployed selectively and immediately at low cost to solve problems that arise in the field. Unfortunately, none of these new techniques have been proven sound. Oh et al. [11] evaluate the effectiveness of their techniques empirically, showing the number of harmful faults decreases by an order of magnitude, but give no specification of exactly which faults they attempt or do not attempt to detect. Reis et al. [16] lay out in careful English exactly which faults they believe their system defends against and where they believe the vulnerabilities remain. However, they give no mathematical specification of the semantics of their target machine, their fault model or their desired properties, and they make no attempt to prove that their claims are correct.
The purpose of this paper is to analyze the problem of controlflow integrity mathematically and to develop techniques for proving strong program properties in the presence of transient controlflow faults. As in our previous work, we do so by developing a type system that guarantees fault tolerance, relative to our chosen fault fault model, for all well-typed programs.
In this work, we have chosen to analyze the simplest possible control-flow fault model -one in which faults can cause jump instructions and conditional branches to transfer control to the beginning of any program block. Faults that cause control-flow transfers to places other than the beginning of a program block are not modeled. We also assume the standard Single Event Upset Model [15, 17] is in force and hence only a single fault may happen during execution of a program. We believe that investigating this simple, elegant fault model is an important and foundational first step in the exploration of this uncharted space. And while simple, we conjecture the basic principles we outline in this paper can be reused as building blocks in more complicated settings, just as the principles exposed by our study of the λzap proved useful in our development of the more sophisticated TALFT . Finally, we believe this model, as it stands, has future practical value as it is possible for a hardware manufacturer to develop an auxiliary module that analyzes the instruction commit buffer to ensure every jump target is a valid one. Such a change to the hardware would be far less invasive than completely rearchitecting the instruction set as suggested in our previous work [13] .
To summarize, there are three main contributions of this research.
• We have defined the first procedure (a type system) for verification of assembly code in the presence of transient control-flow faults. From a technical perspective, the type system introduces a novel way of classifying the reliability properties of program values and entire machine states, generalizing the earlier "color systems" used by λzap and TALFT . The type system is also of interest for the way it uses a collection of abstract types to track the state of the fault tolerance protocol.
• We have formulated a powerful fault-tolerance theorem for our type system and have developed the proof techniques that allow us to validate it. The key technical challenge we overcome is the fact that after a control-flow fault has occurred, it is impossible to count on almost any standard program invariant. So, how can one carry out a proof of type preservation under such circumstances?
• We have demonstrated that our type system is suitably expressive by showing how to compile a simple language of while programs into well-typed assembly language. We prove our translation is type-preserving.
The rest of the paper explains the problem of transient control flow faults and our techniques for reasoning about them in more detail. First, Section 2 gives additional intuition about the problem and solution by explaining a simple assembly-language protocol for detecting control-flow errors. This protocol is a simplified version of the protocols used by Oh [11] and Reis [16] . Section 3 begins the more formal work by defining the syntax and operational semantics an idealized assembly language. It also shows how to model erroneous control-flow transfers by adding special rules to the standard operational semantics. Section 4 defines the type system that guarantees that assembly code follows the simple protocol outlined earlier in Section 2. This type system, particularly the special value and machine state typing rules, codify the major invariants needed to prove type safety and the subsequent strong reliability properties. Section 5 sketches the major components of the fault-tolerance proof. Due to space limitations, we are unable to include all the details in this paper. However, an online appendix [14] presents all the key lemmas, theorems and proofs. Section 6 shows that our typed assembly language is sufficiently expressive that it is possible to translate while programs into well-typed, fault-tolerant code. Finally, Section 7 discusses further related work and Section 8 concludes.
Informal Overview
When a transient fault causes the actual sequence of control flow blocks visited by a program to deviate from the expected sequence, we say a control-flow error has occurred. In this paper, control-flow errors arise in three different ways: (1) there may be a fault to the target address of a jump instruction; (2) there may be a fault to the target address of a conditional jump instruction; or (3) there may be a fault to the boolean used to decide whether to jump or fall through a conditional. Such faults may occur immediately prior to attempting the control-flow transfer or at any other time during the computation. However, whenever a control-flow operation is executed, we assume execution is either transferred to the beginning of some valid block, or to some invalid block or illegal instruction. In the latter case, we assume the hardware immediately catches an attempt to execute the illegal instruction. We do not consider the possibility that a fault causes a control flow transfer to a legal instruction in the middle of some valid block.
As is standard, we adhere to the Single Event Upset model [15, 17] , which states that only one fault may occur during an execution. However, even though just one fault occurs, faulty values may be copied, propagated and used in any way an ordinary value may be used. Hence, a single fault can lead to arbitrarily many corrupted values if not caught soon after it occurs.
The goal of this work is to develop and prove correct a software protocol that guarantees such control-flow errors can never go undetected. The central challenge in this endeavor is to overcome the problem that no single value can ever be trusted to be correcta transient fault may strike any value in any register. Consequently, as is usual in fault tolerance, the solution is to avoid relying on any single value by replicating the critical state and checking replicas against one another. In this case, the critical state is the value of the program counter. Checking the correctness of a control-flow transfer involves creating a replica of the intended control-flow destination and then checking the replica against the real program counter to detect any difference.
To be more specific, compiled code creates the replica prior to any control-flow transfer by moving the intended destination into a designated register. We refer to this register as the intentions register ri. This intentions register is part of the global "calling convention" for fault-tolerant control flow transfers. We fix the register so that all jump targets know where to find the intended destination, even when there has been a control-flow fault.
As an example, to jump to address L2, one might use the following code sequence. In this code, we leave ellipsis in between instructions to emphasize our system allows flexible scheduling of instructions -ordinary instructions may be interleaved with the instructions used to guarantee fault tolerance.
Since the intentions register ri plays a special role in the protocol for detecting control-flow errors, we will need to type check the move instruction that loads this register in a special way. To designate the move as special, we henceforth write it intend L2 rather than mov ri, L2 as in the following example code. Once again, since the branch to the recovery code plays a special role in the fault-tolerance protocol, we give it the special syntax recovernz r2. Thus, our detection code will henceforth be written as follows.
L2: movi r2, L2
... sub r2, r2, ri ... recovernz r2 ...
As an example of how a transient fault might be caught using our protocol, suppose register r2 is corrupted just prior to attempting to execute the jump to L2 in block L1. Upon arrival at some erroneous control flow block, say L3, the intended destination L2 remains safely untouched in register ri, though, unnervingly, all other program invariants may be disrupted. The target code compares the contents of ri (i.e., L2) with L3, which it loaded into r2 after arriving at the current block. It detects a difference and jumps to the recovery code.
One must also consider what happens if faults strike at different times or in different places. For instance, the jump target might have been corrupted much earlier than we suggested above, perhaps just after being initially loaded into r2, instead of just prior to the jump. Will that make a difference? In this case, no. Likewise, ri might be corrupted, either before or after jumping. In this case, we reach the correct destination, but it appears as though there was a fault because ri differs from the current block label (assuming the fault occurs prior to the subtraction). Unable to tell the difference between a fault in the intentions register and a fault in the controlflow transfer itself, we jump to recovery code.
2
A number of other scenarios must also be analyzed -in order to have confidence in the solution, one must do so in a principled, disciplined fashion.
It is important to observe that similar, but subtly different code sequences do not adequately protect against faults. In particular, optimizations like copy propagation, common subexpression elimination and some code motion transformations, are no longer semantics-preserving in the context of transient faults. For instance, the following simple change to the way block L1 was written above leads to a vulnerability.
Here, a single transient fault to r2 anywhere between execution of instructions (*) and (**) results in an uncaught control-flow fault as both the jump target and the intentions register will simultaneously be incorrect.
Likewise, the code motion transformation illustrated below shifts the move from a target block into the jumping block and creates a vulnerability.
L1: movi r2, L2
intend L2 movi r3, L2 jmp r2
Lk: sub r3, r3, ri (***) recovernz r3 ...
Above, a fault to r2 causes a control-flow error, but testing r3 against ri at line (***) will not help detect the fault. The conclusions to draw from these examples are that the correctness properties of this code are indeed subtle and that verifying fault tolerance properties after the compiler has completed its suite of performance optimizations may help detect errors in code generation.
Conditional Branches. The protocol for handling conditional branches is slightly more involved than the case for jumps, but follows a similar pattern. We begin by assuming that the condition for the jump is held in registers r4 and r4'. These two registers and must be independent replicas of one another. In other words, in the absence of faults, they should contain the same boolean value, and moreover, a fault to one should have no impact on the value of the other. Given this assumption (which will be verified by our type system), the following code sequence sets up a conditional branch, which may fall through to L2 or may jump to L3. The code uses a conditional branch brz r4, r3, which jumps to r3 if r4 is zero and otherwise falls through to L2. It also uses a conditional move cmovz r4', ri, r3', which moves the contents of r3' into ri if r4' is zero, and otherwise does nothing. (If the architecture does not have a cmovz instruction, a conditional branch and a move instruction can be used instead, but this branch will not be protected against faults.) L1: ... // assumes r4 and r4' are independent replicas ... movi r3, L3 movi r3', L3 ... intend L2 cmovz r4', ri, r3' brz r4, r3 L2: ... ... L3: ... Again, to notate the special role of ri and simplify the presentation, we will henceforth write the conditional move cmovz r4', ri, r3' as intendz r4', r3'. Intuitively, the intend instruction unconditionally sets the intentions register, whereas the intendz instruction conditionally sets the intentions register. The errordetection code in blocks labeled L2 and L3 is identical to the errordetection code discussed earlier for jumps, as it must be. 
Summary
With just a few, well-thought-through instructions, it is possible to create a redundant copy of the intended destination of any control flow transfer prior to initiating the transfer itself. Moreover, at any control-flow target, it is possible to use that redundant copy to check that code has actually arrived at the proper place. However, as our examples illustrated, it is also easy to make slight errors in the process. In addition, since transient faults can occur at so many different places in the protocol and influence so many different bits of state, one needs proof to believe such a protocol will work. Hence, in the follow sections, we make the machine's operational semantics and fault model precise and develop a sound type system strong enough to verify that the "good" instruction sequences we have discussed in this section are indeed fault tolerant.
The Control-Flow Machine
For clarity and elegance, we will work with a minimal assembly instruction set involving move (movi), subtraction (sub), jump (jmp) and conditional branch if zero (brz) instructions as well as the special macros intend, intendz and recovernz.
3 Instruction operands include constant values v and registers r. In the previous section, values were written unannotated, but from this point forward we annotate every value with a color (either green G, blue B or orange O). These colors have no operational significance, but they play a special role in the type system and proof of correctness. The only kind of value is an integer. In general, meta-variable n ranges over integers, but when we wish to emphasize that an integer will be used as an address, we use the meta-variable .
Instructions are grouped together in code blocks b. These blocks are always terminated by either a jump or a conditional branch instruction. Code memory C is a partial map from addresses to valid code blocks b. Addresses are ordered and we use the notation + 1 to refer to the address of the block following the block at . If a block at ends with a conditional branch, we assume + 1 inhabits the domain of C -in other words, conditional branches always have a block to fall through to.
The register file R is a mapping from registers to the colored values they contain. The registers include the intentions register ri and a number of general-purpose registers r1 through rn. We use the notation R(r) to denote the contents of r in R. We use
(jmp-hw-error) Figure 3 . Operational Semantics the notation R[r → v] to denote a new register file R created by updating R so it maps r to v. When we wish to refer to the unannotated integer n as opposed to the colored value c n in a register r in R, we use the notation R val (r). Similarly, R col (r) refers to the color annotating the value in r. An ordinary abstract machine state Σ is a tuple containing code C, history h, register file R and code block to be executed b. The history h is a sequence of labels. It records the code blocks visited during the current execution. In addition to ordinary abstract machine states, there are two special "final states." The state recover(h) represents a state in which a transient fault has occurred and has been caught. The labels in history h were visited during the execution. The state hwerror(h) represents a state in which a transient fault causes transition to an invalid address. Figure 2 summarizes the syntax of the assembly language and machine states.
Dynamic Semantics
We model the dynamic semantics of the assembly language using a small step operational semantics. In general, the single step operational judgments have the form Σ −→ k F where k, which is either zero or one, records the number of faults that occur during the step.
The Fault Model. The most interesting rules in the system are the rules modeling faults. The primary rule (zap-reg) simply states that the value in any register may be corrupted arbitrarily, though its color tag (which has no operational significance) remains unchanged.
The rule above may fire at any time. In particular, it may fire just prior to execution of a jump (jmp rt) or a branch (brz rz rt), corrupting the jump target in register rt. Such a fault models a control-flow error. Of course, it is equally possible that any other register is corrupted. For uniformity in our fault model, we also consider errors in execution of the recovernz rz instruction. Recall, this instruction is merely a macro for the conditional branch brz rz recover . However, since recover is a constant, it is unaffected by faults in registers modeled by the zap-reg rule (our other branching instructions take arguments in registers). To simulate a fault that causes control to jump somewhere other than the recover label when the rz register contains a non-zero value, we add the following rules.
The zap-recovernz1 rule expresses the possibility that a fault causes execution to jump to some random block labeled rather than the recovery code block. The zap-recovernz2 rule expresses the possibility that a fault causes control to jump to an illegal address. Attempted execution of code at this address results in immediate transition to the final state hwerror(h), where h represents the sequence of blocks visited not including the illegal address.
Other Operational Rules. All other operational rules are presented in Figure 3 . The majority of these rules are quite unsurprising. For instance, the movi rule implements the move by updating the register file. Notice that the index on the arrow is "0" indicating no fault occurs during this transition. Naturally, the intend rule is very similar to movi as intend is just a macro for a move into ri.
Skipping to the bottom of the figure, it is important to notice there are two rules for expressing the semantics of a jmp rt instruction. The jmp rule fires whenever rt contains the address of a valid block. Of course, due to a fault earlier in execution, the address in rt may not be the intended destination for this jump. In addition to transferring control to the new block, this instruction does some bookkeeping. In particular, it extends the current history with the destination address and it changes the color of ri to be orange. The latter effect facilitates the proof of correctness and will be explained in further detail in Section 4. The second rule for jmp semantics fires whenever rt does not contain the address of a valid block. In this case, there is an attempt to transfer control to an illegal address, which is caught by the hardware. The rules for conditional branches follow a similar pattern to those for the unconditional jumps.
Typing
The design of the type system is based on three main concepts:
• Classifying the reliability properties of values.
• Using abstract types to make sure that the fault tolerance protocol proceeds in the correct order, with no steps omitted or inappropriate steps inserted.
• Equivalence checking to ensure that redundant values act as proper backups to the original.
The following paragraphs explain the main intuitions behind each concept.
Classifying the Reliability Properties of Values. Since faults occur completely unpredictably and at run time, it is not possible for the type system to know which values have incurred faults or to track the propagation of presumed faulty values precisely. Consequently, as is usual, the type system will have to approximate these properties somehow. It does so by assigning each value to one of several compile-time "groups" and ensuring that each member of a group has related reliability properties. As a mnemonic, each group has an associated color c, which may be either green, blue or or-
ange.
Most values either belong to the green group or to the blue group. These two groups have the property that they are mutually independent, unless a control-flow fault occurs. In other words, a fault in a green value can never percolate to a blue value and vice versa. Consequently, when no control-flow has occurred and corresponding green and blue values are compared, at least one of them must be correct. This mutual independence property is ensured by a series of simple checks in the type system that guarantee that green values are not used to construct blue values and vice versa.
But what if a control-flow fault has occurred? In case, almost all program invariants are invalidated, including any properties of either blue or green values. Fortunately, though, the defining characteristic of orange values is preservation of their properties in just this situation.
There are two general mechanisms by which one can guarantee orange values maintain their expected properties in the face of a control-flow fault. The first mechanism is to ensure that the orange value in question is not live across the control-flow transfer: If the value has been constructed in the current block and does not depend upon values in previous blocks, a control-flow error will not influence its properties. This first mechanism is used in the checking code at the beginning of each program block. In particular, the operation that moves a label into a register at the beginning of a block may label its results orange:
Lk: movi r2, Lk // r2 is orange ... sub r2, r2, ri ... recovernz r2 ...
The second mechanism involves ensuring that every possible control-flow transfer maintains the invariant in question. If the invariant is true across every control-flow transfer, then it is true no matter where control winds up. This second mechanism is used to classify the the contents of ri as orange across every control-flow transfer. Just as the type system isolates green values from blue and blue from green, orange is also isolated from the other two. Again, the purpose is to avoid having a fault in one color influence the others.
While values are classified using colors, entire machine states are classified using a related concept called zap tags. Intuitively, each zap tag specifies which colors may no longer be trusted. For example, if zap tag Z is empty (written "·"), then there have been no faults during the computation, and all values, no matter what their color, satisfy the standard invariants associated with their compiletime type. On the other hand, if Z is a color c, then there has been a fault to a value colored c and, moreover, the corruption may have spread to any other value colored c. Consequently, values colored c will not necessarily satisfy any particular properties associated with their compile-time type.
The final zap tag CF classifies machine states after a controlflow error has occurred. In this case, control may have transfered somewhere totally unexpected, and so we know nothing about green or blue values. Fortunately, though, the properties of orange values remain valid. Figure 5 summarizes the properties that hold under each zap tag while in block . We say a value is trusted if it satisfies standard canonical forms properties (e.g., a value with code type is actually a pointer to valid code). We say a value is untrusted when we cannot guarantee standard canonical forms properties hold.
We say a zap tag Z is a subtype of another Z , written Z ≤ Z , when the values in machine states classified by Z are more trusted than the values in machine states classified by Z . Hence the empty zap tag is a subtype of all other zap tags, and both B and G zap tags are a subtype of CF . Typing Protocol Stages. The instructions in each block can be thought of as being divided into three distinct stages -the checking code, the block body, and the exit code. Each of these stages has its own distinct invariants. The type of intentions register ri encodes the current stage and ensures that the stages occur in the correct order. It also guarantees no part of the protocol can be omitted or any inappropriate instruction added. These stages may be summarized as follows.
Zap Tag
1. The checking code compares the intended target with the current location to determine if there has been a control flow fault. In this region, ri must be colored orange and have basic type check. 2. In the block body, we already know the control flow correctly transferred to this block. At the end of this sequence, there is some green to register that holds the target label for the next control flow transfer and some blue register that holds the duplicate copy of this label. In the absence of faults, these two values are equal. In this region, ri must be colored blue and have basic type ok.
3. The exit code sequence sets the intended target and transfers control to the new block. In the exit code sequence, ri is colored blue and has type go when an intention has been set, and type goz when a conditional intention has been set.
For example, consider the example code sequences from Section 2 shown in Figure 6 . On entry, each block first checks that control has reached this block correctly, and sets its intentions before transferring control to another block.
Testing Value Equivalence.
There are many places in the fault tolerance protocol where we require a blue value to be an independent and redundant copy of a green value. To ensure that blue and green values are equal in the absence of faults, we characterize them accurately using a language of static expressions. Moreover, many of the typing rules require that corresponding expressions are equal. Onward. Now that we have summarized the intuitions behind the main concepts, we will proceed with the technical details.
Value Typing
The type of value is a triple c, τ, e . The color c is assigned according to the intuitions expressed in the previous subsection. A basic type τ is either an integer, a code type, or a special type that indicates the state of the fault tolerance protocol. The third component e is a static expression that describes the value in more detail. These expressions are used to require that blue and green computations compute identical results in the absence of faults. These expressions include variables x, integers n, subtraction e1 − e2 and conditional expressions e1?e2 : e3 which equal e2 when e1 is non-zero and e3 when e1 is zero. The judgment ∆ e : κ holds when all variables in e are contained in ∆. The judgments ∆ e1 = e2 and ∆ e1 = e2 hold when the relation holds for all substitutions of the variables in ∆. The judgment ∆ S : ∆ holds when S provides substitutions for all variables in ∆ , and the substituted expressions are well-formed in ∆.
Value Typing Judgment. The value typing judgment has the form ∆; Ψ Z v : t and is shown in Figure 7 . The context ∆ contains free expression variables, and the heap type Ψ maps integer addresses to basic types. The zap tag Z characterizes the current state of the machine as explained earlier. Z is always the empty tag when a user checks a program at compile time. It only takes on other values at run time for the purposes of the proof of preservation.
The main value typing judgment depends upon an auxiliary judgment with the form Ψ n : τ . This auxiliary judgment allows integer n to be given either a basic int type, a stage description type ρ, or a code type Ψ(n). If e is equal to n and Ψ n : τ , then c n can always be given the type c, τ, e . However, if the zap tag Z is a color c, then all values c n can also be typed using any basic type and any well-formed expression -such a general rule reflects the fact that we can make no guarantees about such values. When the zap tag is CF , then any green and blue value can be given any type, including giving green values blue types and vice versa. In other words, as mentioned earlier, when there has been a control-flow fault, all bets are off for green and blue values.
Value Subtyping. There is also a subtyping relationship ∆ t ≤ t . As an example, this judgment allows type c, τ, e to be subtype of c, int, e whenever ∆ e = e . Please see the online appendix for further details [14] . Figure 8 presents the instruction typing judgment, which has the form ∆; Ψ; Γ i : Γ . As before, ∆ contains free expression variables and Ψ types heap addresses. Γ acts as the precondition for the instruction, mapping registers to their corresponding types prior to execution of the instruction. Γ acts as the postcondition for the instruction, mapping registers to types guaranteed after execution of the instruction.
Instruction Typing
The simplest instruction to type check is the movi r d c n instruction. It merely updates the type of r d to be c, int, n . The subtraction instruction sub r d ra r b requires that the values being subtracted are integers. Notice it also requires the integers arguments have the same color as the result -this restriction prevents faults in values with one color to influence another. These two instructions place no restrictions on the type of ri, so they can occur during any stage of a block.
The unconditional intention instruction intend rt requires that ri has basic type ok. This restriction guarantees any new intend will occur after the checking code has been completed. Intentions are part of the blue computation, so the register that is used to set the intention must contain a blue value with code type. The type of ri is updated to reflect the new static expression and the new stage go.
The conditional intention instruction intendz rz rt is similar, although it must occur after an unconditional intention. In other words, to set intentions for a conditional branch, first use intend to set ri to contain the address of the fall through block, and then conditionally set it to contain the branch target. The resulting type of ri has basic type goz and a conditional expression guarded by the expression describing rz. If rz is nonzero, then ri will be described by ei, which describes the fall through branch. Otherwise, it is described by et, which describes the branch target.
Despite the fact that recovernz is syntactically an instruction, it is type-checked using the block typing judgment because it affects the set of free expression variables.
Block Typing
The block typing judgment ∆; Ψ; Γ; σ; ei; τ opt b contains a number of new pieces of information. In addition to ∆, Ψ, and Γ, the block typing judgment is parameterized by a sequence σ, an expression ei, and a type option τ opt. (recovernz-eq-t)
.; Ψ; Γ; σ • e ; ei; τ opt recovernz rz; b (recovernz-neq-t)
∆; Ψ; Γ; σ • e ; ei; t jmp rt (jmp-t) Figure 9 . Block Typing Judgment.
The sequence σ contains a list of expressions that describe the locations in the current history h. While typing a block at location , σ has the form x h • meaning that the program has already visited some unknown sequence of locations (x h ) leading up to this point and that the label of the current block is . The judgment ∆ σ1 = σ2 holds when each expression in σ1 is equal to the corresponding expression in σ2 for all substitutions of the variables in ∆.
The expression ei describes the intended target when the transfer occurred to the current label . If control flow correctly transfered to , then ∆ ei = .
The option type τ opt contains the type of the label + 1 if such a label exists. It is used when a branch falls through to the subsequent block to determine the type of that block.
The block typing rules are presented in Figure 9 . The first rule, sequence-t, is used when the first instruction in a block is one of the basic instructions described previously. Descriptions of the other rules follow.
Recovery. There are three distinct rules for checking recovernz rz. All of them require the instruction to occur in the first stage of the block when ri contains an orange value with basic type check. The operand register rz compares this value to the current label. The first rule recovernz-t applies when ri is described by variable xi. This is the rule used by a programmer to check correctness of their program at compile time. Control only proceeds past this point in the block if xi is equal to the expression e , which describes the current location, so the remainder of the block is typed by substituting e for xi. The types of ri and rz are updated to reflect the deletion of xi. Judgment ∆ Γ/ri/rz wf and ∆ σ wf hold when all variables used in registers other than ri and rz as well as the expressions in σ are all contained in ∆. Since none of these pieces of state contain xi, they do not need to be modified.
The other two rules recovernz-eq-t and recovernz-neq-t are needed to carry out the proof of type preservation (particularly the substitution lemma), but would never be used to type check programs prior to execution. In these situations, xi has already been replaced with a closed expression ei that describes the intentions register at block entry. Here, it is evident that either · ei = e or not, so there is one typing rule for each situation. The rule recovernz-neq-t does not place any requirements on the remainder of the block since control does not proceed past this point.
Control Flow Transfers.
In order to verify unexpected transfers from the end of one block to the beginning of any other, code blocks must have the same basic precondition. To be specific, each block must expect that the intentions register ri contain an orange value with basic type check that is described by a variable xi. This variable does not occur any where else in the function precondition. This condition entails every target block can accept any orange value in ri.
The rule jmp-t requires that ri has type B , go, e t specifying that the intention must already have set before the jump. Also, the current jump target has a code type and is described by an expression et that is equal to e t . This enforces that in the absence of faults, the duplicate target is equal to the target.
The target label precondition contains a set of expression variables ∆t and requires a register file described by Γt and a history described by σt. There is some substitution St for the variables in ∆t so that the current register file type and sequence are subtypes of those required by the target. (The register file subtyping judgment is a straightforward extension of the value subtyping judgment.)
The jmp rt and brz rz rt instructions recolor the blue intention register to be orange when control is transfered to a new block. At first, this seems to contradict the rule that faults to a value of one color should never corrupt values of other colors. However, because the target block doesn't place any restrictions on the expression describing ri, the variable xi that describes the value can be instantiated with the value itself. Because of this, a blue value that is not trusted can become a trusted orange value during a control flow transfer, continuing to leave only the blue values untrusted.
The rule brz-t is similar, but adds in the conditional register rz and specifies both the fall through and the branch cases.
Machine State Typing
Code Memory Typing. The judgment C : Ψ describes the invariants for code memory. As described previously, all blocks Register File Typing. The judgment Ψ Z R : Γ states that register file R has type Γ under zap tag Z given heap typing Ψ. It holds when each register in R has the corresponding type in Γ under Z. And again, values with colors that are affected by Z are not trusted to have their given types.
History Typing. A history h is described by sequence σ when each location is equal to the corresponding expression.
Machine State Typing. A machine state Σ is well-typed under zap tag Z when each of its elements is well-typed, and two additional invariants hold. (1) If Z is CF then the current location is not equal to the intended location ei. Otherwise, if Z not CF , then these two are equal. (2) If the current block b has proceeded past the checking stage, then it must be the case that is equal to ei. These two invariants together imply it is not possible for code past the checking stage of a block to be well-typed under the CF zap tag. Consequently, a proof of type preservation will imply that any control-flow error will caught in the checking stage of the next block.
Type Safety
We have proven that the TALCF type system is sound using the standard notion of Progress and Preservation. Progress asserts that machine states well-typed under the empty zap tag can take a step to another ordinary machine state. States that are well-typed under any zap can also take a step, but this step may reach any state, including recover(h) or hwerror(h).
Preservation states that execution preserves typing. States welltyped under the empty zap tag continue to be so after taking a nonfaulty step. States typed under any zap also remain well-typed after a non-faulty step, but the zap tag may escalate to a supertype. If a state is well-typed under the empty zap tag and takes a faulty step, then the resulting state is well-typed under some color c.
The zap tag may be elevated at control flow transfers. A zap tag of B or G becomes CF whenever the corruption has spread to the operands being used in the transfer. This way the block that results from the transfer can be well-typed under CF even when control has transferred to a totally unexpected block. The intentions register is always the only orange value that is live across control flow transfers, and we have already seen that it is well-typed even when a control fault has occurred.
Fault Tolerance Theorem
In this section, we first present a handful of definitions relating machine states to other states, and then use these definitions to formally state and prove the Fault Tolerance Theorem.
Machine State Simulation
We say that a faulty value simulates a fault-free value under color c if the values are equal when they are not colored by c. A faulty machine state Σ f simulates a fault-free state Σ if they are identical modulo the values in registers colored c.
Program Execution
In order to reason about program execution, we extend the single step relation Σ −→ k Σ from Section 3 to create two multistep judgments. The judgment Σ ; k F states that F is the result of executing the current block of Σ while incurring k faulty transitions. Execution proceeds up to the control-flow transfer statement at the end of the current block or the recover state if the block terminates prematurely by transitioning to recovery code. For example, if Σ = (C, h, R, i1; ...; in; jmp rt), then either F = (C, h, R , recover(h)) or F = (C, h, R , jmp rt).
The judgment Σ =⇒ h k F states that machine state Σ executes through a sequence of blocks h to reach state F while incurring k faulty transitions. In other words, if Σ = (C, h1, R, b), then F is either (C, (h1, h), R , b ), hwerror(h1, h), or recover(h1, h).
The Fault Tolerance Theorem
A program is fault-tolerant if a faulty version of the program behaves in one of four possible ways with regards to the original, nonfaulty computation. The first possibility involves the faulty computation visiting the same sequence of blocks as the original and the final faulty state simulating the original result state under some color c. The second possibility involves the faulty computation attempting to transfer control to an invalid address outside the domain of code memory and triggering a hardware fault. Prior to the occurrence of the hardware fault, the faulty computation will have visited the same blocks as the original computation. The third possibility involves the faulty computation detecting a fault in software and jumping to recovery code even though no incorrect blocks have been visited. This situation can be caused by a fault affecting the intentions register or the checking code. And finally, the last possibility involves the faulty computation veering off course to a block that does not match the corresponding block in the original computation. In this case, the checking code in the invalid block catches the error and transfers control to the recovery code. The full proof appears in the online appendix [14] .
recover(h , h f ) and h f = (h1, l ) and h = (h1, l, h2)
Translation
In order to show that TALCF is sufficiently expressive to be of interest, we define a simple language of while loops, and show how to compile statements in this language into well-typed TALCF programs.
A Simple While Loop Language
The while loop language statements consist of assignment, subtraction, if statements, while loops, and sequences of statements. As all the variables in this language contain integers, the well-formedness judgment X s simply enforces that all variables in s exist in the variable context X.
| if0 xz then s1 else s2 | while xz = 0 do s
Translating Statements
A 4-tuple of objects (L, C, i, ) is used to track the code generated during the translation. C is the code memory that contains all blocks generated so far. L contains labels that may be referred to by blocks in C but whose corresponding blocks have not yet been generated. is the label that will be assigned to the block that is currently being generated. i contains the list of instructions for this block that have been generated so far.
with the translation of statement s. The free variables of s must be a subset of X. If the context X contains variable x1,. . . ,xn, then the translation requires 2n + 4 registers: a green copy r k and a blue copy r k for each variable x k , the intention register ri, and three temporary registers tg,
(t-while) Figure 11 . Translation of While Programs t b , and to. We have made no effort to optimize this translation, it merely serves to demonstrate the theoretical expressiveness of the target language.
The translation rules and the statement of the Translation Theorem make use of the following macros that implement the protocol from Section 2.
check
≡ movi to O ; sub to to ri; recovernz to
movi tg G t; brz rz tg A subset of the translation rules are shown in Figure 11 . Translating assignment statements simply adds assembly instructions to the end of the current instruction sequence. Sequencing two statements uses the partial translation from the first statement to translate the second. Translating while statements requires the addition of new blocks. The current block is terminated with an unconditional jump to a beginning block b that tests the condition and branches to an ending block e if the condition fails. Otherwise it falls through to the block s which contains the translation of s and terminates with a jump back to the beginning block. The function N umBlock(s) calculates the number of blocks generated by the translation of s. The remaining rules are fully defined in the Online Appendix [14] .
The Translation Theorem
To translate a statement s as a stand-alone program, it is translated as in the previous section with 1 as the starting label. Because there is no halt statement in TALCF , code is added to the last block in the translation to create an infinite loop. The function InitRegFile(X) creates an initial register file that maps each register used to translate X to 0.
The assembly language program corresponding to s is the TALCF state consisting of the generated code memory, a history with only the first location, an initial register file, and code to jump to the first label in code memory. If the original statement is wellformed, then the translation is well-typed.
Theorem 4 (Translation
where
Related Work
There is a long history of research into techniques for delivering fault tolerance in the presence of transient faults. What sets the current work apart from the vast majority of the literature is the use of a provably sound type system to verify reliability properties of lowlevel code. As mentioned in the introduction, this research follows previous work on λzap [21] and TALFT [13] . However, neither λzap nor TALFT provided software mechanisms for guaranteeing control-flow integrity. Recently, Elsman [6] has shown how to extend λzap so that the atomic voting operations can be broken down into a series of conditional statements. However, again, there is no treatment of control-flow. Perhaps the most closely related work to the current paper is CFI, a provably-sound technique for enforcing control-flow integrity in a security context [1, 2] . The goal of CFI is to guarantee that machine code obeys a predefined "control-flow policy" that constrains the sequence of blocks control can move through. The key distinction between CFI and our own work is the threat model, which makes all the difference. CFI attackers can modify arbitrary amounts of machine state in arbitrary ways; this sort of attacker models the threat posed by buffer-overflow vulnerabilities effectively. However, CFI attackers cannot touch three reserved registers during the execution of certain code sequences. Protecting against transient faults is, on the one hand, easier, because the attacker can only modify a single value as opposed to arbitrary amounts of state arbitrarily many times, but, on the other hand, more difficult, because no single bit of state can be a priori guaranteed to be protected. On balance, it appears that having just that tiny bit of protected state makes the solution and proof of correctness of the CFI problem simpler than the corresponding fault tolerance problem. For instance, the CFI checker can be defined as a relatively straightforward series of context-insensitive conditions on the code; there is no need for a sophisticated type system. It is also the case that the structure of the desired theorems are somewhat different. In the case of CFI, the running code must satisfy a security policy specified as a control-flow graph. In our case, the desired end result is a simulation theorem that guarantees that every faulty run of the program is properly related to the non-faulty run.
Our work builds upon many past research efforts in fault tolerance, particularly those that deal with control-flow checking. For example, Oh et al. [11] developed a pure software control-flow checking scheme (CFCSS) wherein each control transfer generates a run-time signature that is validated by error checking code generated by the compiler for every block. The SWIFT system [16] , another software-only fault tolerance system, also uses signature checking very much like that in the current paper. Venkatasubramanian, Hayes and Murray [20] proposed a technique called Assertions for Control Flow Checking (ACFC) that assigns an execution parity to each basic block and detects faults based on parity errors. Schuette and Shen [18] explored control-flow monitoring (ARC) to detect transient faults affecting the program flow on a Multiflow TRACE 12/300 machine with little extra overhead. Ohlsson and Rimen [12] developed a technique to monitor software control flow signatures without building a control flow graph. However, this latter technique requires additional hardware: A coprocessor is used to dynamically compute the signature from the running instruction stream and watchdog timer is used to detect the absence of block signatures. The distinguishing feature of our research is not the control-flow checking procedure itself, but the type system we designed to verify the code and our proof that well-typed programs are indeed fault tolerant. These previous efforts did not rigorously specify the properties they intended to enforce nor did they prove their techniques actually enforce them. Naturally, our research also builds upon previous work in the verification of low-level code including the original typed assembly language (TAL) [8] and proof-carrying code (PCC) [9] . However, both TAL and PCC operate under the assumption of nonfaulty hardware and therefore ignore the major issues of reliability on which this paper has focused.
Conclusion
Current trends in hardware design including increased transistor density, decreased voltages and increased clock rates are decreasing the reliability of modern processors. While these effects are currently limited, for the most part, to high-end clusters and supercomputing facilities, they pose a broader threat to future systems. One way to counter this trend is to shift some of the burden for reliability into software. However, reasoning about the correctness of software running on faulty hardware is an extremely difficult task, particularly when faults may affect program control flow.
In this paper, we defined a simple abstract machine that exhibits control-flow faults and we analyzed the correctness of a software protocol for detecting them. Our analysis proceeded through the definition of a type system that guarantees programs are reliable relative to a simple fault model. From a theoretical perspective, the type system serves as a tool for reasoning about the correctness of faulty programs. From a practical perspective, it may be implemented and used as a debugging tool in compilers that purport to generate reliable code. We have rigorously proven strong reliability properties for our type system and have shown it is sufficiently expressive to serve as the target for compilation of a simple language of while programs. Overall, we believe this is the first successful attempt at reasoning rigorously about software mechanisms for controlling control flow faults.
