A transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. Although transient faults do not permanently damage the hardware, they may corrupt computations by altering stored values and signal transfers. In this paper, we propose a new scheme for provably safe and reliable computing in the presence of transient hardware faults. In our scheme, software computations are replicated to provide redundancy while special instructions compare the independently computed results to detect errors before writing critical data. In stark contrast to any previous efforts in this area, we have analyzed our fault tolerance scheme from a formal, theoretical perspective. To be specific, first, we provide an operational semantics for our assembly language, which includes a precise formal definition of our fault model. Second, we develop an assembly-level type system designed to detect reliability problems in compiled code. Third, we provide a formal specification for program fault tolerance under the given fault model and prove that all well-typed programs are indeed fault tolerant. In addition to the formal analysis, we evaluate our detection scheme and show that it only takes 34% longer to execute than the unreliable version.
Introduction
A transient fault or soft error is a temporary hardware failure that alters a signal transfer, a register value, or some other processor component. While transient faults are temporary, they corrupt computations and have led to costly failures in high-end systems in recent years. For example, in 2000 there were reports that transient faults caused crashes at a number of Sun's major customer sites, including America Online and eBay [2] . Later, Hewlett Packard admitted multiple problems in the Los Alamos Labs supercomputers due to transient faults [7] . Finally, Cypress Semiconductor has confirmed "The wake-up call came in the end of 2001 with a major Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PLDI'07 June 11-13, 2007 customer reporting havoc at a large telephone company. Technically, it was found that a single soft fail. . . was causing an interleaved system farm to crash" [28] .
Unfortunately, while soft errors can already cause substantial reliability problems, current trends in hardware design suggest that fault rates will increase in the future. More specifically, faster clock rates, increasing transistor density, decreasing voltages and smaller feature sizes all contribute to increasing fault rates [1, 11, 21] . Due to a combination of these factors, fault rates in modern processors have been increasing at a rate of approximately 8% per generation [3] .
These trends are well known in the architecture and compiler communities, and, consequently, many solutions to the threat of soft errors have been proposed. At a high level, all of these solutions involve adding redundancy to computations in one way or another, but the specifics vary substantially. For instance, there are proposals involving hardware-only solutions such as error-correcting codes, watchdog co-processors [6] and redundant hardware threads [4, 9, 16, 25] as well as software-only techniques that use both single and multiple cores [12, 13, 17, 18, 20, 24] . Broadly speaking, if the technique can scale, hardware-only solutions are more efficient for a single, fixed reliability policy, but software-only solutions are more flexible (they may be deployed exactly when, where, and to the degree needed) and less costly in terms of hardware. In an attempt to gain some of the best of both worlds, researchers have also recently proposed hybrid software-hardware solutions involving strong fault tolerance mechanisms implemented in hardware but controlled by the software running on the processor [19] .
Software-only and hybrid hardware-software techniques also possess at least one further, little-mentioned drawback -they may not actually work. To be fair, many of these techniques appear extremely promising. However, as far as we are aware, the published transient fault-tolerance techniques come with no rigorous proofs that they guarantee any particular reliability properties. In general, researchers satisfy themselves with presenting an algorithm for fault-tolerance and leave the audience to judge for themselves whether or not the algorithm is correct. In fact, the literature does not even precisely define what it might mean for an assembly-level program to be fault tolerant. This paper tackles this gaping hole in the existing literature by defining a new hybrid hardware-software technique for tolerating transient faults, and, unlike any previous work, actually proving it has strong fault-tolerance properties.
The specification and proof of fault tolerance comes in several stages. First, before proving any particular properties, it is necessary to define a fault model precisely. Most of the current literature uses the Single Event Upset (SEU) Model, which states that only one fault may occur during execution [16, 19, 26] . However, the details of exactly where and when faults may occur are usually given in English. We also assume the SEU model, but we specify exactly where by including faulty transitions as formal rules in the operational semantics of our assembly language. Second, it is necessary to state precisely what "fault tolerance" actually means. Abstractly, a program is fault-tolerant if no fault can change the observable behavior of a program. More concretely, we assume our system operates in the presence of a memorymapped output device, and hence a program is not fault-tolerant if a fault can cause a deviation in the sequence of values written to memory. We formalize this property more precisely as a mathematical theorem that relates faulty and non-faulty executions of a program.
Third, it is necessary to provide a technique for actually proving that specific programs are fault tolerant relative to the fault model. Our proof technique is presented in the form of a type system. All well-typed programs satisfy variants of the standard progress and preservation lemmas, even in the presence of transient faults, as well as the stronger fault tolerance property mentioned above. In addition to being theoretically important as a proof technique for fault tolerance, the type system can be used to debug compilers that intend to generate reliable code. If the output from these compilers type check, their code will have strong fault tolerance guarantees. In the past, researchers have proposed testing compiler outputs using fault injection techniques that randomly insert errors into programs. However, using a type checker in this case is a much better idea. In principle, a conventional testing technique would need to test all combinations of features in conjunction with all combinations of faults, causing an explosion in the number of test cases, and yet still failing to achieve perfect fault coverage in practice. By using the type checker we have designed, one achieves perfect fault coverage relative to the fault model without needing to increase the compiler test suite.
The rest of this paper presents the details of our hybrid hardwaresoftware fault-tolerance technique. Section 2 presents the syntax and operational semantics of the new, idealized assembly language we have designed for fault tolerance. It is a RISC-based architecture with special instructions to facilitate reliable communication with memory and to detect control-flow faults. Section 3 presents the key principles and formal definitions for the fault-tolerant assembly language type system (TALFT for short). Though the typing rules are specific to our particular setting, the underlying principles are more general; we believe many of these principles will apply to reasoning about related fault-tolerant systems. Our innovative combination of a TAL-like type-theory with concepts from classical Hoare Logics is a particularly general and important technical contribution. Section 4 describes the key theorems we have proven including Progress, Preservation, "No False Positives," and Fault Tolerance. Section 5 provides empirical evidence that our new hybrid solution to fault tolerance is feasible for many applications by measuring performance results on simulated hardware. Related work is discussed in more detail in Section 6. Due to space considerations, some of the technical details and all of the proofs have been omitted. A companion technical report [15] contains the complete specification of our system and a relatively detailed proof outline.
The Faulty Hardware
The faulty hardware is based on a simple RISC architecture, extended with features to support detection of control-flow faults and safe interaction with memory-mapped output devices. Correct use of these features makes it possible to detect all faults that might change a program's observable behavior. Most practical systems also need a fault recovery mechanism of some kind. However, since recovery is largely orthogonal to detection, we omit the former, focusing only on the latter in this paper.
The general strategy of every fault-tolerant program is to maintain two redundant and independent threads of computation, a green (G) computation and a blue (B) computation. The green computation generally leads slightly, and the blue computation generally trails, though there is a fair amount of flexibility in how the instructions in each computation may be interleaved. Prior to writing data out to a memory-mapped output device, the results of the two computations are checked for equivalence. If the results are not equivalent, the machine will signal that a fault has been detected. The arguments to any control-flow transfer must also be checked for faults. This methodology has been show in the literature as an effective implementation of fault tolerance [13, 18] , and we expand on this style of implementation by formalizing the fault model and coverage.
The execution of assembly programs is specified using a smallstep operational semantics that maps machine states (Σ) to other machine states. These machine states are made up of a number of components. The first component is the machine's register bank R, which is a total function that maps register names to the values contained therein. The meta variable a ranges over all sorts of registers, and meta variable r ranges only over general-purpose registers (r1, r2, ...). In addition to general-purpose registers, there are two program counter registers (pcG and pcB), which contain the same value unless there has been the fault. There is one additional special register, the destination register, d . Its role in control-flow checking will be explained later.
To facilitate proofs of certain theorems, the value in each register is tagged with the color (either green or blue) of the computation to which it belongs. However, these tags have no effect on the runtime behavior of programs. 1 In addition to a register bank, the machine state includes a code memory C, which we model as a function mapping integer addresses n to instructions. 2 The machine also has a value memory M , which maps addresses to integer values. In between the value memory and the processor is a special store queue, Q, which is used to detect faults before data is written to a memory-mapped output device. The store queue is a queue of address-value pairs. We will discuss the role of the queue in greater detail later.
Overall, an abstract machine state (Σ) may have the form fault, indicating the hardware has detected a transient fault, or the ordinary state (R, C, M, Q, ir), where the first four components are as discussed above, and ir is either an instruction i to be executed, or "·" indicating the next instruction should be fetched from code memory. Figure 1 summarizes the syntax of machine states. Here and elsewhere in the paper, we use overbar notation to indicate a sequence of objects.
The Fault Model
The operational semantics is designed both to model proper execution of machine instructions and to make perfectly explicit, precise, and transparent all of our assumptions about when and where faults may occur. The central operational judgment has the form Σ1 −→ s k Σ2, which expresses a single step transition from state Σ1 to state Σ2 while incurring k faults and writing data s to a memorymapped output device. We will work under the standard assumption of a single upset event and hence k will always be either 0 or 1. The data s is a (possibly empty) sequence of address-value pairs. While the operational semantics models the internal workings of the machine, the only externally observable behavior of the machine is the sequence of writes s to the output device or the signaling of a hardware-detected fault. If faults cause the processor to have drastically different internal behavior, but the externally observable sequence s is unchanged, we consider the program to have executed successfully. Different fault-tolerance techniques protect different components of machines. In the literature, the protected areas are usually inside the Sphere of Replication (SoR) [16] . In our case, we target faults that may occur in data manipulated within the processor. We assume that both code memory C and value memory M are fully protected. This is often the case since error-correcting codes can very efficiently protect memory. To make these assumptions explicit, the following three operational rules specify exactly how faults may occur within our system.
Rule reg-zap nondeterministically introduces a fault into any register by replacing the value in that register with some other arbitrary value. There are no restrictions on how the underlying value might be changed. For instance, code pointers can be changed to arbitrary integer values; references may no longer be in bounds. However, the color tag is preserved to facilitate fault-tolerance proofs. Since the color tag is fictional (has no effect on run-time behavior), this poses no limitation on the fault model. Rules Q1-zap and Q2-zap alter the contents of the store queue in similar ways. Formally, these are the only faults that can occur. However, notice that since the program counters and targets of indirect jumps are susceptible to the reg-zap rule, we effectively capture many forms of "control-flow faults" studied previously. Notice also that we do not explicitly consider faults that occur during execution of an instruction. However, many such faults may easily be shown equivalent to correct execution of an instruction composed with a fault either immediately before or afterwards. For example, consider a simple register-to-register add instruction. Any fault within the adder hardware during execution of the add is equivalent to a correct add followed by a fault in the destination register.
An important benefit of our formal model is that there is actually some precise, concrete specification to analyze. Moreover, if a researcher wants to reason about the consequences of some fault that lives outside the formal model, this may be done by adding a new operational rule to the system and studying its semantic effect.
Basic Instructions: 
Instruction Semantics
The syntax of machine instructions was presented along with the rest of the components of our abstract machine in Figure 1 . The semantics is described formally by the inference rules in Figures 2, 3, and 4, and explained informally below. The formal rules use several notational conventions. For instance, if R is a register file then R(a) is the contents of register a and R[a → v] is the updated register file with register a mapped to v. R++ is the register file that results from incrementing both pcG and pcB by 1. If R(a) is the colored value c n, we write R val (a) to denote n and R col (a) to denote c. The function f ind(Q, n) produces the first pair (n, n ) that appears in Q, or () if no pair (n, n ) appears in Q.
Instruction Fetch. The machine operates by alternatively fetching an instruction from code memory and executing that instruction. When there is no current instruction to execute (i.e. ir = ·), the fetch rule should fire. This rule tests for equality of the two program counters to check for faults and loads the appropriate instruction from code memory. If pcG and pcB are the same but R val (pcG) is not a valid address in code memory, execution "gets stuck" (no rule fires). Fortunately, however, well-typed programs never get stuck, even when a single fault occurs. On the other hand, a fault can render the two program counters inequivalent. In this case, rule fetch-fail fires and causes a transition to the fault state. Abstractly, this transition represents hardware detection of a transient fault. Controlled program termination or perhaps recovery may follow. Fault recovery is an orthogonal issue to fault detection, so we leave it unspecified here. The fault model does not allow for the instruction itself to be corrupted.
Basic Instructions. The arithmetic and move instructions (rules op2r, op1r, and mov) are completely standard. The first arithmetic operation op r d , rs, rt performs op on the values in rs and rt, storing the result in r d . The second arithmetic operation uses a constant operand v in addition to rs and r d . All constants are annotated with the color of the computation they belong to. Likewise, the mov instruction loads an annotated constant into a register.
Memory Instructions. Transient faults are problematic only when they change the results of computations and those results are observed by an external user. In our model, the only way a result can be observed is for a program to write it to memory, where a memory-mapped output device may read and process it. Without special hardware it appears impossible to guarantee that storage operations guard access to memory properly. No matter what sophisticated software checking is performed just before a conventional store instruction, it will be undone if a fault strikes between the check and execution of the store instruction. This is the conundrum of the Time-Of-Check-Time-Of-Use (TOCTOU) fault.
To avoid TOCTOU faults, our machine possesses a modified store buffer (the queue Q), which is similar to the store buffer used in previous hardware [16] and hybrid [19] fault tolerant systems. In addition, there are two special storage instructions, each tagged with a color. The green store instruction stG r d , rs places the address-value pair (R val (r d ), R val (rs)) on the front of the queue (rule stG-queue). The blue store instruction stB r d , rs retrieves the pair (n l , n l ) on the back of the queue, checks that it equals (R val (r d ), R val (rs)), and then stores it in memory (rule stB-mem). If the pairs are different, the hardware signals a fault. Failure rules appear in Appendix A.1. Since green stores must always come before blue stores, instruction scheduling is somewhat constrained. As we will show later in Section 5, we have evaluated the performance both with and without this scheduling constraint and show that its performance impact is negligible.
As an example, consider the following straight-line sequence:
1 mov r1, G 5 2 mov r2, G 256 3 stG r2, r1 4 mov r3, B 5 5 mov r4, B 256 6 stB r4, r3
These six instruction have the effect of storing 5 into memory address 256. Moreover, a fault at any point in execution, to either blue or green values or addresses, will be caught by the hardware when the blue store (instruction 6) compares its operands to those in the queue. In addition, our instruction set gives a compiler the freedom to allocate registers however it chooses (e.g., reusing registers 1 and 2 in instructions 4-6 instead of registers 3 and 4) and to change the instruction schedule in various ways (e.g., moving instruction 3 to a position between instructions 5 and 6). Interestingly, however, not all conventional optimizations are sound, and, of course, this is why type checking generated code can be so helpful in detecting compiler errors. For example, common subexpression elimination might result in the following code:
1 mov r1, G 5 2 mov r2, G 256 3 stG r2, r1 4 stB r2, r1
In this case, a fault in r1 after instruction 1, or a fault in r2 after instruction 2 will cause both instructions 3 and 4 to manipulate the same, but incorrect, address-value pair. The result would be to store an incorrect value at the correct location or a correct value at an incorrect location. Fortunately, the TALFT type system catches reliability errors like this one.
As mentioned in Section 2.1, many "intra-instruction" faults can be modeled by modifying the register file before or after the instruction. However, this is not the case for a fault that occurs during the execution of the stB-mem rule in between the comparisons and the store. The hardware designer must implement structures that detect or mask any faults that occur here. If the hardware designer cannot meet the specification given by the operational semantics, he acknowledges there may be a vulnerability.
The load instructions also come in pairs: ldB and ldG. The only difference in their semantics is that ldG checks for a pending store in the queue before loading its value from memory, whereas ldB goes directly to memory, ignoring the queue. This wrinkle increases the freedom in instruction scheduling by allowing the green computation to load a value it may have recently stored before the blue computation has necessarily committed the store. Rules ldG-queue, ldG-mem, and ldB-mem specify these behaviors.
Notice that there is no mechanism for verifying the address used in loads. Hence, a fault can result in an invalid address. In practice such a load might induce a hardware exception such as a segmentation fault or might result in loading some arbitrary value. Failure rules that model both possibilities appear in Appendix A.1.
Control-Flow Instructions.
Any change in the control-flow of a program may cause a different sequence of values to be stored and observed by an external user. Consequently, the hardware contains mechanisms to detect faults in addresses that serve as jump targets. Intuitively, these mechanisms mirror the solution to faults in stored data in that execution of a control-flow transfer is accomplished through two instructions. Our solution uses a combination of software and hardware control-flow protection that is similar to watchdog processors [6] , but that makes both versions of the control flow explicit as in software-only control flow protection [12, 18] .
To achieve an unconditional jump, one executes a jmpG instruction first and a related jmpB instruction at some point in the future. A jmpG r1 moves the destination address from r1 into the special destination register d (rule jmpG). Like the store queue, the destination register stores a programmer intention, initiated by the green computation. Later, when the blue computation attempts to commit the jump by executing a jmpB r2 instruction, the contents of r2 are compared to the contents of the destination register and if they are equal, control jumps to that location (rule jmpB). If the addresses are different, the hardware detects a fault (see rule jmpB-fail). Similar to the constraint for the store queue, forcing green control flow instructions to be executed before the corresponding blue version constrains the instruction schedule. Section 5 will show that this scheduling constraint has only a minimal performance impact. The following code illustrates a typical control-flow transfer.
1 ldG r1, r2 3 ldB r3, r4 2 jmpG r1 4 jmpB r3
Initially, registers r2 and r4 should point to the same memory location, which contains a code pointer to jump to. The example illustrates some of the flexibility in scheduling jump instructions. Conditional jumps are more complex, but follow the same principles. The green conditional bzG rz, r d tests rz and if it is 0, moves the contents of r d into destination register d (rules bz-untaken and bzG-taken). No control-flow transfer occurs until a blue conditional bzB r z , r d tests the contents of its r z register. If r z is 0 then r d must equal the contents of d, and if so, the control flow transfer occurs (rule bzB-taken). If r z is not 0, it is not good enough merely to fall through -the contents of r z might be faulty. To avoid this possibility, the instruction examines the destination register. If it is 0 (and hence a prior bzG instruction did not store an address), the fall-through occurs (rule bz-untaken). The rules for the associated failure cases appear in Appendix A.1. Our metatheory will show that this mechanism suffices to detect faults either in the green computation (registers rz and r d ) or the blue computation (registers r z and r d ). 
Typing
The primary goal of the TALFT type system is to ensure that welltyped programs exhibit fail-safe behavior in the presence of transient faults. In other words, well-typed programs must guarantee that a memory-mapped output device can never read a corrupt value and make it visible to a user. We call this property "fault tolerance."
In the following sections, we explain the intuitions and principles behind the various elements of the type system. Throughout the discussion, the reader will notice that our typing rules are not syntax-directed. Of course, as with other sorts of typed assembly language or proof-carrying code, this fact presents no particular difficulty in practice -it is easy for a compiler to generate sufficient "typing hints" to make type reconstruction trivial. For the reader's reference, the objects used in the type system are presented in Figure 5. 
Static Expressions
Our "type system" is actually a combination of two theories, one being a relatively simple type theory for assembly, inspired by previous work on TAL [8] , and the second being a Hoare Logic, designed to enforce the more precise invariants required for strong fault tolerance. The latter component requires we define a language of static expressions for reasoning about values and storage.
For the purposes of this paper, the static expressions are drawn from the standard theory of arithmetic and arrays used in many classical Hoare Logics (c.f., Necula's thesis [10] ). These static expressions are classified as either integers (kind κint) or memories (kind κmem). The integer expressions include variables, constants, simple arithmetic operations, and values from a memory (sel Em En is the integer located at address En in Em). The memory expressions include variables, the empty memory (emp), and memory updates (upd Em En 1 En 2 is a memory Em updated so that address En 1 stores value En 2 ).
The context ∆ is a mapping from variables to kinds, and the judgment ∆ E : κ classifies expression E as having kind κ. The judgment ∆ S : ∆ holds when the substitution S maps variables in Dom(∆ ) to values well-formed in ∆ with types in Rng(∆ ). The judgment ∆ E1 = E2 is valid when E1 and E2 are equal objects in the standard model. The function [[E] ] supplies the denotation of the closed static expression E as either an integer or a memory, depending on its kind. The definitions for [[E]] and ∆ E1 = E2 are shown in Appendix A.2, and the remaining judgments are defined in the companion technical report [15] . Figure 6 . Value Typing.
Value Typing
Since faults strike values, corrupting their bit patterns in arbitrary ways, the subtleties of value typing are a key concern. Informally, the type system maintains three key pieces of information about every value: 
A color (green or blue

A static expression.
When there has been no fault in a value's color, the value exactly equals the static expression. Static expressions are used to guarantee that in the absence of faults, the green and blue computations produce equal values, and hence, dynamic fault detection checks always succeed.
To summarize, every value is typed using a triple c, b, E , where c is a color, b is a basic type, and E is a static expression. The presence of the static expression makes this type a kind of singleton type.
Value Typing Judgment. The value typing judgment has the form Ψ; ∆ Z v : t, where Ψ maps heap addresses to basic types, and ∆ contains the free expression variables. In the rule val-t, a colored value c n is given the type c, b, E when the static expression E is equal to n, and Ψ n : b. The judgment Ψ n : b allows n to be given either the basic type int or the type of the address n in memory.
The two rules cond-t and cond-t-n0 are used to type the conditional type (E = 0 ⇒ G, Θ → void , E r ). When the static expression E is equal to zero, values of this type also have type G, Θ → void , E r . When E is not equal to zero, values with this type must be 0.
The final two rules for Ψ; ∆ Z v : t make use of the zap tag Z, which is either empty or a color c. If the zap tag is a color c, then there may have been a fault affecting data of that color. Data colored the same as the zap tag can be given any type, as it may have been arbitrarily corrupted. The static expression used in this type may not contain any free expression variables.
Value Subtyping. There is also a subtyping relation ∆ t ≤ t that allows all types c, b, E1 to be subtypes of c, int, E2 when ∆ E1 = E2. This relation is extended to register file subtyping ∆ Γ1 ≤ Γ2, by requiring that the type of each generalpurpose register in Γ2 be a supertype of the corresponding register in Γ1. Note that here is no required relationship between the special registers d, pc G , and pc B . The rules for these judgments appear in the companion technical report [15] .
Instruction Typing
While many of the instruction typing rules are quite complex, the essential principles guiding their construction may be summarized as follows.
In the absence of faults, standard type theoretic principles
should be valid. In order to guarantee basic safety properties, the type system checks standard properties in much the same manner as previous typed assembly languages [8] . For example, jump targets must have code types, while loads and stores must operate over values with reference types.
Green values only depend on other green values, and blue values only depend on blue values.
When this invariant is maintained, a fault in a blue value can never corrupt a green value and vice versa.
Both green and blue computations have equal say in any dangerous actions.
Dangerous actions include storing values to memory-mapped output devices and executing control-flow operations. When both blue and green computations are involved, a fault in just one color is insufficient to deceive the hardware fault detection mechanisms.
In the absence of faults, green and blue computations must compute identical values.
To be more precise, green and blue computations must store identical values to identical storage locations and must issue orders to transfer control to identical addresses. If not, the hardware will claim to detect faults when there have been none, or alternatively, might exhibit incorrect behaviors when there is a fault.
The first three principles are relatively straightforward to enforce. The fourth principle leads to the most technical challenges as it requires we check equality constraints between values. Moreover, since construction of these values depends on storage, the type system must maintain a relatively accurate static representation of storage. We accomplish this latter challenge using techniques drawn from Hoare Logics. The former challenge (testing values for equality) is achieved through the use of the singleton types described earlier.
The Instruction Typing Judgment. The judgment for typing instructions has the form Ψ; Θ ir ⇒ RT . Unlike the context Ψ, which only contains invariant heap typing assumptions, Θ contains fine-grained context-sensitive information about the current state of memory and the register file. More specifically, Θ consists of the following subcontexts: (1) ∆, which describes the free expres- sion variables appearing in the other context-sensitive objects, (2) Γ, which describes the mapping of register names to types for register values, (3) (E d , Es), which describes the values in the queue, and (4) Em, which describes memory, as one does in Hoare Logic. The "result" of checking an instruction is a result type RT . A result type may either be void , indicating control does not proceed past the instruction (it is a jump), or a postcondition Θ , which describes the state of memory and the register file after execution of the instruction.
The typing rules are defined using several notational abbreviations. The notation Γ++ adds one to the static expression associated with each program counter register in Γ. The expression Figure 7 presents the typing rules for instructions, and the following paragraphs explain the main points of interest.
Typing Basic Instructions. Basic arithmetic operations are not "dangerous" to execute, so the definitions of their typing rules are driven by principles 1 and 2, mentioned earlier. Consider, for example, rule op2r-t for an arithmetic operation op. This rule requires that the operand registers contain integers with the same color c in accordance with principal 2 (green depends on green, blue depends on blue). The result register r d has a type colored c as well. In accordance with principle 1, the result has integer type. The rule also states that the static expression describing the result register is E s op E t and that the state of the queue and memory are unchanged by evaluation of the instruction.
Typing Memory Instructions. Store operations are "dangerous" -they make computed values observable by the outside worldso we must be particularly careful in the formulation of their typing rules. In accordance with principle 1, both green and blue store instructions (rules stG-t and stB-t) require that the address register has the basic type b ref and the value register has the corresponding basic type b. Intuitively, the store queue is a green object, and in accordance with principle 2, the green store instruction may push an address-value pair onto the front the queue as long as both values are green. In accordance with principle 4, the rule for the blue store checks that the address-value pair to be stored is exactly equal to the address-value pair at the end of the queue. Since the arguments to the blue store have a blue type and the queue always contains green objects, both blue and green computations contribute to the actual storage operation (in accordance with principle 3).
The load operations are somewhat simpler than the store instructions since they are not "dangerous" in our model. However, like the store instructions, the operands of blue loads must be blue and the operands of green loads must be green. Once again, in accordance with principle 2, the result of a blue load is value with a blue type and likewise for a green load.
Typing Control-Flow Instructions. While the typing rules for control-flow instructions have many premises, they continue to follow the same four principles as the other instructions. Much of the complexity is inherently due to principle 1, which mandates checking all the usual constraints associated with jumps in any typed assembly language.
The simplest rule involves the green unconditional jump. This instruction is just a move from register r d to the special destination register d. The type of register d is updated to the type of r d (obeying both principles 1 and 2). The rule contains constraints that d must be equal to 0 in both Γ and Γ since the hardware resets the destination register to 0 after a jump.
The blue unconditional jump is a true jump. According to principle 1, it checks the standard typing invariants needed to ensure safety in any typed assembly language, including (1) that the jump target has code type (see the first two premises), and (2) that the current state, including register file, memory, and queue, matches the expected state at the jump target, modulo some substitution S of static expressions for universally quantified variables ∆ from the code type (see the final seven premises).
The typing of the conditional branches is quite similar to that of unconditional jumps. One difference is that the bzG instruction is now a conditional move as opposed to an unconditional move. Hence, to represent the result of the move (unknown at compile time) the conditional type (E z = 0 ⇒ G, Θ → void, E r ) is used. In addition, since the conditional branch may fall through, the result of typing the bzG instruction is a proper postcondition as opposed to void , like jmpG.
Machine State Typing
In order to prove various properties of the type system, we need to specify the invariants of machine states that are preserved during execution. The judgments for typing a machine state Σ are shown in Figure 8 and explained below.
Register File Typing. The judgment Ψ Z R : Γ states that the register file R has the register file type Γ under heap typing Ψ and a zap tag Z. The contents of each register must have the type given to that register by Γ. Each program counter must have the appropriate color, and the program counters must compute equal values. (In the case where one program counter is corrupted, the zap tag Z in the first premise allows its actual value to differ from the expected computed value.)
Code Typing. The judgment Ψ C states that code memory C is well-formed with respect to heap typing Ψ. The address 0 is not a valid code address. Each address must have a code type, and the code type must contain the precondition for the instruction at that address. If the instruction typing results in a postcondition Θ (meaning that control may fall through to the next instruction) then the subsequent instruction must be well typed using Θ as its precondition.
Memory Typing. The judgment Ψ M : Em states that given heap typing Ψ the value memory M is well-formed and can be described by the static expression Em. The static expression Em must have kind κmem, and M must be the denotation of Em. Queue Typing. The judgment Ψ Z Q : (E d , Es) means that queue Q can be described by the sequence of static expressions (E d , Es) given heap typing Ψ and zap tag Z. When the queue is empty, it is described by the empty sequence. When the zap tag Z is not G, the first pair (n1, n2) must consist of an address n1 with type b ref and a value n2 with type b. This pair is described by the static expression pair (E d , Es) when E d evaluates to n1 and Es evaluates to n2. The remainder of the queue must be described by the remainder of the static expression sequence. All values in the queue are considered to be green, so when the zap tag is G, these values may have been arbitrarily corrupted. Accordingly in this case, the only requirements are that each static expression must have kind κint and the length of the queue must be the same as the length of the static expression sequence.
Machine State Typing. The judgment
Z Σ states that a machine state Σ is well-typed under zap tag Z. This judgment holds when Σ is a five-tuple (R, C, M, Q, ir), and these elements are each welltyped and consistent with each other. Note that Σ is not well-typed when it is the fault state fault. 
Formal Results
In order to prove properties of our type system, we extend our single-step transition Σ1 −→ s k Σ2 from Section 2 to a sequence of n transitions containing exactly k faults Σ1 n −−→ s k Σ2, where n is greater than or equal to zero, and k is still either 0 or 1.
Type Safety
Progress states that well-typed states can take a step. In particular, a machine state that is well-typed under the empty zap tag can take a non-faulty step to another ordinary, non-faulty machine state. A machine state that is well-typed under a zap tag of color c can take a step, but the result of that step may either be another ordinary machine state or the fault state.
According to Preservation, if a machine state is well-typed under a zap tag Z, and it takes a non-faulty step to another machine state, then that resulting state will also be well-typed under Z. Additionally, if a state is well-typed under the empty zap tag, and it takes a faulty step, then there is some color c such that the resulting state is well-typed under c.
Progress and Preservation define the usual notion of type safety. In addition, part one of Progress, together with part one of Preservation entail the following important corollary: The hardware never claims to have detected a fault when no fault has occurred during execution of a well-typed program.
Corollary 3 (No False Positives
) If Σ then ∀ n. Σ n −−→ s 0 Σ and Σ .
Fault Tolerance
A program is fault tolerant when all the faulty executions of that program simulate fault-free executions of the program. More precisely, the sequence of outputs from the faulty executions are required either to be identical to the fault-free execution or, in the case the hardware detects the fault, a prefix of the fault-free execution. In order to reason about pairs of faulty and fault-free executions, we define similarity relations between values, register files, queues and machine states. Each of these relations is defined relative to the zap tag Z. Intuitively, if Z is empty, the related objects must be identical. If Z is a color c, the objects must be identical modulo values colored c. In the latter case, values colored c may be corrupted, and there is no hope they satisfy any particular relation. The formal definitions of these relations are shown in Figure 9 .
Using the similarity relations, we can state and prove the fault tolerance theorem for well-typed programs precisely. Assume that machine state Σ is well-typed under the empty zap tag, and nonfaulty execution of Σ for n steps results in a state Σ and outputs a sequence of value-address pairs s. If somewhere during that execution a single fault is encountered, the faulty execution will either run for n + 1 steps or terminate in the fault state during that time. If the faulty execution takes n + 1 steps and reaches the non-faulty state Σ f , then Σ simulates Σ f and the sequence of output pairs is identical the original execution. Alternatively, if the faulty execution reaches the fault state then the output pairs will be a prefix of the non-faulty output pairs. 
Performance
To better understand how TALFT can be applied to real world situations, we simulated the TALFT hardware in the framework of a current computer architecture, the Intel Itanium 2 ISA. The instruction set of the Itanium 2 contains many more types of instructions than those specified in TALFT. While not an exact representation of the performance of TALFT, simulating the performance of TALFT applied to this architecture will give guidance as to the feasibility of this system in a real architecture.
To evaluate the performance impact of our techniques, a version of the VELOCITY compiler [23] was modified to add the reliability techniques of TALFT and was used to compile the SPEC CINT2000 and MediaBench benchmark suites. These executions were compared against binaries generated by the original VELOC-ITY compiler, which have no fault detection. The reliability transformation was compiled into the low level code immediately before register allocation and scheduling. To simulate the new hardware structures of TALFT, extra instructions were inserted to emulate the timing and dependences of the hardware structure accesses.
Performance metrics were obtained by running the resulting binaries with reference inputs on an HP workstation zx6000 with 2 900Mhz Intel Itanium 2 processors running Redhat Advanced Workstation 2.1 with 4Gb of memory. The perfmon utility was used to measure the CPU time. Figure 10 presents the execution time of the fault-tolerant code relative to baseline binaries with no fault detection. Naïvely, one might expect the fault-tolerant code to run twice as slowly as the fault intolerant code since the number of instructions is essentially doubled. However, we find that smart instruction scheduling and efficient allocation of resources reduces the execution time to only 34% more than the fault-intolerant baseline average. These simulations are in line with previously published software-only reliability performance experiments [18] that show the degradation due to redundant code to be less than double.
As alluded to in Section 2.2, Figure 10 compares the performance degradation both with and without the scheduling constraint that green memory and control flow instructions must be executed before the corresponding blue versions. In order to perform the second set of experiments, our compiler was modified to produce code that had more flexibility in the scheduling of the green and blue versions. We then simulated a more aggressive hardware implementation that could correlate the original and redundant memory operations regardless of the executed order. As expected, this version has better performance (in most cases) than the unconstrained code. Comparing both to the unprotected code, the version without the ordering constraint increases execution time by 30% while the version with the ordering increases execution time by 34%. Although the colored ordering restriction of TALFT may seem costly, removing this restriction provides only a small improvement.
Related Work
Fault tolerance based on software replication is a well-populated field with decades of history. TALFT differs from previous approaches in that it provides a type-theoretic framework for obtaining strong guarantees about the reliability of machine code.
Most closely related to TALFT is our previous work on λzap, a highly abstract type-theoretic model for studying the basic principles of fault tolerance in the lambda calculus [26] . There are two important distinctions between TALFT and λzap. First, λzap, working at the level of the lambda calculus, is very far removed from real machine code. For instance, it lacks a program counter, a register file, memory, and load or store instructions. Memory references in particular constitute a key challenge in the current technical work. Second, the properties of the λzap type system are relatively weak compared with the properties of the current type system. The "end-to-end" fault tolerance property proven for λzap depends not only on the type system but also the nature of the translation from the ordinary simply-typed lambda calculus. In contrast, the type system of TALFT is much stronger, capable of ensuring a strong fault tolerance property independently of the process that compiles the code.
Also closely related to TALFT is the original TAL system, which first applied strong type checking to machine code to guarantee its safety [8] . TAL operates under the assumption of nonfaulty hardware and therefore ignores the major issues of reliability on which this paper has focused.
There have been various implementations of software-only, hardware-only, and hybrid techniques for transient fault mitigation. Hardware techniques have a long history of using very localized bit-level techniques like error correcting codes or parity bits additions. These techniques are efficient for storage structures like memory, but are costly or impossible to apply to other processor elements like pipeline latches or arithmetic units. Higher level techniques are used when protection is necessary for larger segments of the processor. These techniques include the duplication of coarse-grained structures such as functional units, processor cores [5, 22, 27] , or hardware contexts [9, 16, 25] .
To provide protection when the hardware costs of these approaches are prohibitive, software-only approaches have been proposed as alternatives [12, 13, 17, 18, 20, 24] . While software-only systems are cheaper to deploy and can be configured after deployment, they cannot achieve the same performance or reliability as hardware-based techniques, since they have to execute additional instructions and are unable to examine microarchitectural state. Despite these limitations, software-only techniques have shown promise, in the sense that they can significantly improve reliability with reasonable performance overhead [12, 13, 18] . TALFT attempts to exploit the benefits of both sorts of systems by using a hybrid approach to fault tolerance. There have been previous hybrid approaches to transient fault tolerance, some focusing solely on control-flow protection [14] and recently others looking at full processor protection [19] . This work differs from those previous approaches because regardless of the type of implementation, software, hardware, or hybrid, none of those previous approaches have given rigorous formal proofs of the correctness of their systems.
Conclusions
In conclusion, transient faults are already a significant cause for concern at major semiconductor manufacturers and threaten to be more so in the coming years and decades. This paper takes one step forward for the science of fault tolerance by presenting a principled and practical hybrid software-hardware scheme for detecting transient faults. More specifically, we identify four general principles for verifying correctness of fault tolerant systems and capture these in an assembly language type system. From a theoretical perspective, the type system acts as a sound proof technique for verifying reliability properties of programs. From a practical perspective, it can be used as a debugging aid within a compiler, strictly dominating any conventional testing technique. Our two main formal results show that a single fault affecting observable behavior in a welltype program will always be detected, and that the system will not claim to have detected a fault when none has occurred. Despite the fact that well-typed programs essentially duplicate all computation, we provide simulation results showing a performance overhead of 1.34x.
