Smaller feature size, lower supply voltage, and faster clock rates have made modern computer systems more susceptible to faults. Although previous fault tolerance techniques usually target a relatively low fault rate and consider error recovery less critical, with the advent of higher fault rates, recovery overhead is no longer negligible. In this paper, we propose a scheme that leverages and revises a set of compiler optimizations to design, for each application hotspot, a smart recovery plan that identifies the minimal set of instructions to be re-executed in different fault scenarios. Such fault scenario and recovery plan information is efficiently delivered to the processor for runtime fault recovery. The proposed optimizations are implemented in LLVM and GEM5. The results show that the proposed scheme can significantly reduce runtime recovery overhead by 72%.
INTRODUCTION
While smaller feature size, lower supply voltage, and faster clock rates have blessed modern computer systems with higher computational power, these trends also make silicon devices more vulnerable to various types of faults [14] . Transient faults caused by external events such as alpha-particle strikes, cosmic rays, or radiation from radioactive atoms [11] are expected to rise in facing the reduced noise margin, while process variation, voltage and temperature fluctuation, and in-progress wear-out [5, 2] expose systems to frequent irregular faults known as intermittent faults. These transient and intermittent faults usually lead to program failures, making fault resilience one of the most important design concerns.
A system-level fault tolerant solution typically integrates three functions: fault detection, execution checkpointing, and error recovery. Previously when the fault rate was low, many fault tolerance approaches focused on optimizing fault detection [12, 10] or checkpointing strategies [21, 18] since their overheads were consistently imposed onto the system regardless of whether a fault occurred or not. Those techniques performed recovery in a straightforward manner and sometimes even traded off recovery to reduce fault detection overhead [7, 15, 9] . Unfortunately, those approaches are insufficient and inefficient for future systems, especially the ones with high fault rates or with real-time constraints. In order for the computation to progress and meet deadline constraints before encountering another fault, the ability to quickly recover from a detected fault is critical.
* This work is supported by NSF grant #1253733.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. The goal of this paper is to reduce recovery overhead without affecting overheads of the other two fault tolerance functions. We propose a scheme which leverages a set of compile-time optimizations to design a smart recovery plan for each application hotspot (i.e., a frequently executed loop). Our scheme targets hotspots because their code sizes are small but the runtime benefit is significant. For each hotspot, the proposed scheme analyzes instruction dependencies to identify the minimal set of instructions to be re-executed in different fault scenarios. Registers are also statically renamed to mitigate ambiguity in fault diagnosis. Such fault scenario and recovery plan information is efficiently delivered to the hardware for quick runtime fault recovery. The proposed compile-time and runtime optimizations are respectively implemented in LLVM [8] and GEM5 [1] . Experiments show that the scheme is capable of not only efficiently detecting faults, but also quickly recovering from them.
The rest of this paper is organized as follows: Section 2 briefly reviews previous work on error recovery. Section 3 outlines the fault tolerance framework and the technical motivation. Section 4 presents the proposed compiler-directed runtime recovery scheme. Section 5 describes the experimental framework and presents the results, while Section 6 concludes the paper.
BACKGROUND AND RELATED WORK
When a fault is detected, recovery is necessary, either by preventing faults from modifying computation states, or by rolling the execution back to a previously saved clean checkpoint. One technique of the first category is presented for chip multiprocessors in [7] . To detect faults, two threads of the same program are run in parallel. The trailing thread repeats and checks the computation performed by the leading thread. Instructions of the trailing thread are not committed until their correctness is verified, and the register state of this thread is used for recovery. This scheme imposes high detection overhead on system because they need to constantly check and buffer the results for every instruction. To reduce this overhead, a technique that dynamically creates the dependence chain among instructions and only check the last instruction in the chain is presented in [15] . Yet the reduction in comparison overhead is insignificant since it can only create simple dependence chains. In sum, both recovery techniques need to execute all the faulty instructions even if they may not affect the correctness of program output.
The second type of recovery techniques allow faulty results to be committed to registers. Once a fault is detected, the computation is rolled back to the last checkpoint and re-executed from there. To detect faults, some instructions are selected as "detection points". For example, the technique in [20, 13] distance between two comparison points leads to higher recovery overhead. In [4] , only the live 1 registers are compared and checkpointed, while faults in dead registers are ignored. A following work [3] aims to balance detection and recovery overhead through adapting checkpoint frequency according to runtime fault rates. In sum, these recovery techniques need to execute all the instructions between the last checkpoint and the detection point, without any discrimination between them.
A MOTIVATING EXAMPLE
As reviewed before, most existing fault tolerant solutions perform error recovery is a straightforward manner by rolling the computation back to the last checkpoint and re-executing all the instructions from there. This straightforward process neither reduces the amount of re-execution nor reuses any clean but not checkpointed value.
To illustrate the potential for reducing recovery overhead, consider the hotspot code shown in Figure 1 (a). The program is duplicated into two threads for redundant execution. T1 is the leading thread whose results are periodically verified by T2, the trailing thread. Fault detection is performed by comparing store values and addresses and periodically checkpointing the live registers. Checkpoints are placed at the loop entry before instruction 1. When T2 reaches instruction 6, since it is a store, T2 checks its value and address and detects a fault. In the default recovery approach, registers of both threads are restored to their last checkpointed values, and instructions 1-6 all need to be re-executed. This recovery strategy is independent of the fault origin. In other words, no matter which and how many instructions are faulty, the recovery overhead is the same.
However, a detailed examination of the Data Dependency Graph (DDG) indicates that upon detecting a fault, not all the instructions following the last checkpoint need to be reexecuted. The DDG of the code in Figure 1 (a) is shown in Figure 1(b) . Nodes represent instructions, while an edge (u, v) represents a register value produced (written) by u and consumed (read) by v. Each edge is labeled with the register operand passed between the two instructions. The checkpointed regis-1 At any point of the code, a register is live if its subsequent access is a read, and dead if its subsequent access is a write. Figure 2: System Framework ter set which includes the live-in registers r0, r4, and sp is also depicted in the figure. Now consider the scenario of a fault originated in instruction 5 and detected at instruction 6. The default recovery policy has two limitations. First, it re-executes all instructions 1-6 indiscriminately, while according to the DDG, only instructions 5 and 6 are definitely faulty. The ones not located on the faulty instruction's dependence chain, such as instruction 4, can potentially be skipped during re-execution. Second, all the data produced by instructions 1-6 in the faulty run are thrown out, although some of them might still be correct and useful, such as results in r2, r1, and r4 produced by instructions 1, 2, and 4, respectively.
The example reveals the general guidelines for reducing recovery overhead. Specifically, the recovery process only needs to re-execute the detected faulty instruction(s) as well as the instructions that either directly, or through a dependence chain, produce inputs for the faulty instruction(s). Instructions whose results have not been affected by the fault can be skipped during re-execution, and their results could potentially be used for executing other instructions. Motivated by this observation, we propose a scheme to minimize the recovery overhead in application hotspots.
PROPOSED SCHEME

Framework Overview
The proposed scheme improves recovery efficiency through a collaboration between dynamic and static optimizers, as shown in Figure 2 . The dynamic optimizer performs fault detection, checkpointing, and recovery under the guide of the static optimizer, which is responsible for analyzing the code and generating recovery plans for the dynamic optimizer to use.
Dynamic optimizer
At runtime, the application is duplicated into two looselycoupled threads for redundant execution. Checkpoints are created at loop entries and include all the registers which are livein to the loop. Algorithm 1 shows the runtime fault detection process performed by the trailing thread. Store instructions and checkpoints serve as detection points (dtcPnt). At a checkpoint, all the live registers are verified by the trailing thread, while upon a store, its value and address are compared. If no fault is detected, the trailing thread continues execution. If any fault is detected, the entire register sets of both threads are compared to identify all the registers with inconsistent values. Then the processor recovers the computation in the way shown in Algorithm 2. For every mismatch detected, the processor accesses the recovery plan, implemented as a direct mapped cache that accepts partial bits of dtcPnt as the index and the rest of dtcPnt alongside two other inputs as the tag, to get the The MRS uses a sequence of 0s and 1s to represent instructions that can be skipped during the recovery process and the ones that have to be re-executed. The bottom part of Figure 2 shows how to selectively re-execute instructions according to MRS. Specifically, when re-execution starts from the checkpoint, each time a new PC is calculated, MRS is right-shifted as well. The rightmost bit of MRS is the selection signal of the multiplexer, which decides whether a new instruction should be fetched or to be replaced with a No-Operation (NOP).
To ensure the correctness of the detection and recovery process, three constraints are enforced. First, the branch outcomes of the leading thread are buffered and delivered to the trailing thread for comparison, so as to detect faults in the control flow. Second, the leading thread waits for the trailing thread at the checkpoint, and both threads resume their executions after creating a new checkpoint. Third, store instructions are not committed until a new checkpoint is created. This is to avoid memory inconsistency caused by a load followed by a store to the same memory location [19] .
Static optimizer
As depicted in Figure 2 , entries of the recovery plan provide MRS and RstRegs for a particular register state vector (RegStat), identified at a specific detection point, reached from a specific path (i.e., a certain Branch Prediction History (BPH)) in the hotspot. Such information is extracted by the static optimizer. One challenge faced by the static optimizer is the inability to precisely locate runtime faults at compile time. To cover all possibilities, a recovery plan is needed for each possible combination of faulty registers. In order to minimize the amount of information to be extracted and delivered to the dynamic optimizer, the static optimizer defines prime fault scenarios which are indecomposable and can be used to derive the remaining non-prime fault scenarios.
To identify the MRS for each prime fault scenario, the instructions that have to be re-executed during the recovery process should be determined, and the input availability of these instructions should be checked to form the RstRegs. Overall, instructions can be classified into the following three categories:
• Clean: instruction's output register is not overwritten by any other instruction and is clean.
• Faulty: instruction's output register is not overwritten by any other instruction and is faulty.
• Ambiguous: instruction's output register is overwritten by another instruction.
As a concrete example, consider the fault scenario shown in Figure 1 . The RegStat at the detection point and the instruction classification are shown in Figure 1 (c)
Given this instruction classification, the goal of the static optimizer is to minimize MRS by reducing the size of the ambiguity set. As shown in Figure 2 , it leverages two compiler optimizations to generate an efficient MRS: a Minimum Recovery Set Selecter (MRSS) which classifies instructions and generates the MRS, and an Ambiguity Set Minimizer (ASM) which minimizes the size of the ambiguity set through register renaming. These two optimizations are described below.
Minimum Recovery Set Selecter (MRSS)
MRSS algorithm
MRSS, implemented as an LLVM pass [8] , differentiates instructions to be skipped during recovery from the ones to be re-executed. It classifies instructions not based on its operation or its relation with the others, but based on its status in the fault scenario. For each fault scenario, it first classifies the instructions that form a dependence chain with the faulty instruction and then processes the remaining instructions.
Algorithm 3 shows the MRSS algorithm executed per each fault scenario. It uses a stack to hold all the instructions selected for re-execution and colored them in red. It first pushes all the faulty instructions on the stack (Lines 2-6). A faulty instruction is the Last Writer (LW) of a faulty register before the detection point. For example, in Figure 1 , r0 is detected as faulty and instruction 5 is its LW. The algorithm traverses the predecessors of faulty instructions in the reverse Depth-FirstSearch (DFS) order in Lines 8-28. For each faulty instruction, this process continues until reaching a clean instruction or the end of the dependence chain. A predecessor instruction is added to the MRS (lines [15] [16] [17] [18] [19] [20] if its output is known faulty (a faulty instruction) or if it is not the LW of its destination register and hence its output cleanness is unknown (an ambiguous instruction).
Algorithm 3 also ensures, for each instruction added to the MRS, the availability of its input registers. An instruction gets inputs either from its predecessors in DDG or directly from the checkpoint (i.e., a loop-carried dependence). For the former case, input availability is ensured in the DFS search process. For the latter case, if the instruction's input register(s) are overwritten in the loop body, they need to be restored to their last checkpointed values. These registers form a modified set, as formulated in Equation (1):
Definitions of "Live-in", "Def", and other functions used in this paper are summarized in Table 1 . Based on this formula, RstRegs is formed by identifying instructions in MRS which gets their input(s) directly from the checkpoint, and the input register(s) are in M odSet (Line 22-27). instr ⇐ stack.pop();
Algorithm 3 MRSS Algorithm
10:
if (instr.color = Red) then continue;
11:
end if
12:
instr.color ⇐ Red;
13:
for all Pred of instr in DDG do
14:
reg ⇐ Pred.dstReg;
15:
if 
Registers read in loop L LW(r) last producer of register r before detection point FP(r) first producer of register r in loop L Finally, a correct and complete recovery scheme should not only recover the faulty registers, but also preserve the value of clean registers at the detection point. The value of a clean register may change during recovery only if it is selected as a RstRegs or its value will be overwritten by an instruction selected for re-execution. To avoid changing any clean register, these two cases are checked in Algorithm 3 and the LW of any register of these two cases are pushed on stack and added to MRS at Line 25 and 18, respectively.
As an example, we outline the procedure of Algorithm 3 when applied to the fault scenario shown in Figure 1(b) , assuming the fault origins at instruction 5 and is detected at instruction 6. Instructions are processed in the order of {6,5}→{1,3}→ {2,4}. Instruction 5 is added to MRS as the LW of the faulty register r0. Instructions 1 and 3 are predecessors of 5, and instruction 2 is the predecessor of 3. When 3 is added to MRS as an ambiguous instruction, the algorithm detects that it needs the checkpointed value of r4 for re-executing 3, and r4 is in M odSet. Therefore, instruction 4 as the LW of r4 is also added to the MRS, and is processed to find further instructions on its dependence chain. The process terminates with MRS containing instructions 3,4,5, and 6 and RstRegs containing r4, as shown in Figure 1 (c).
Covering all fault scenarios
While theoretically the MRSS pass should determine a recovery plan for each possible combination of faulty registers, such amount of information is exponential since for n registers there is a total number of 2 n fault scenarios. To minimize the amount of information to be extracted and delivered to the dynamic optimizer, the MRSS pass only generates recovery plans for prime fault scenarios, defined as indecomposobale cases in which only one register is faulty. The other non-prime fault scenarios that contain more than one faulty registers can be considered as a combination of multiple prime fault scenarios. Their recovery plans can be derived at runtime, by forming the union of recovery sets of those prime fault scenarios. Equations (2) and (3) show the way to derive MRS and RstRegs using prime fault scenarios in a system with N registers r1, r2, ...., rN . Only if register ri is faulty will its MRS and RstRegs of the prime scenario be included.
Ambiguity Set Minimizer (ASM)
Modern register allocators carefully examine variable live ranges and employ different heuristics to minimize the number of registers spilled to memory. It is very likely to write a physical register multiple times inside the hotspot. Although this policy works in favor of code performance, it increases the size of the ambiguity set. The more registers overwritten in the hotspot, the less chance of minimizing the MRS.
The ASM pass runs counter to the current register allocator's goal: it minimizes the ambiguity set by avoiding overwriting registers inside hotspots as much as possible. While this goal can be achieved via Static Single Assignment (SSA) 2 , this approach works only if there are enough free registers to hold all the assignments inside the hotspot. Instead, the ASM pass creates a pseudo-SSA form inside hotspots. It applies two optimizations: creating as much free registers as possible and, when overwriting a register is unavoidable, prioritizing the decisions according to the recovery overhead.
First, in the ASM pass, registers which are not defined or used inside the loop body are considered free, as shown below:
To preserve program data flow, live-in registers that are not accessed inside the loop are stored to memory before entering the hotspot, and loaded from memory upon exiting the hotspot. Such overhead is trivial since these store and load instructions are executed only once outside the loop.
Once the FreeSet is obtained, the ASM pass starts to rename registers inside the hotspot, by processing instructions from the bottom of the DDG in a reverse Breadth First Search (BFS) order. This processing order pushes potential register overwrites to the beginning of the DDG. It helps reduce recovery overhead in worst case scenarios since usually more instructions have to be re-executed to recover from a fault originated at the upper levels of DDG, compared to faults that are closer to the detection points.
In the renaming process, if the target instruction has a destination, it is renamed to a register obtained from the FreeSet. The define-use chain is preserved by renaming all the subsequent usage of that renamed register. This process repeats until either there is no instruction left to be renamed or the FreeSet is empty. Since one definition for each register is still allowed, for each register the ASM pass marks one instruction as its First Producer (FP) and skips it during the rename process. Note that if a register forms a loop-carried dependence, not the first but the last assignment instruction is marked as 2 A program is in SSA form if each variable is assigned exactly once, and every variable is defined before it is used [6] . 
Faulty Instr.
6,8,9
Amb. Instr.
1,2,3,4,7
Clean Instr. 5 Figure 3 illustrates the ASM process and its effect on the ambiguity set and MRS size. The fault scenario is a fault originated at instruction 6 and detected at instruction 9. For the original code, The MRS in Figure 3(c) shows that every instruction has to be re-executed; even the clean instruction 5 has to be re-executed to re-produce r1. In comparison, Figure 3(d) shows the recovery information for the renamed code as well as the instruction processing order. As can be seen, the ASM pass processes instructions in the reverse BFS order, and renames the ones not marked with * except for instruction 2. Since there are only 4 free registers which are used to rename instructions 5, 6, 7 and 4, at the time when instruction 2 is processed, the FreeSet is empty. Although instruction 2 is ambiguous, it is not added to MRS since its successor instruction 3 is known clean. Overall, this example shows that the ASM pass successfully reduces the ambiguity set to 1, and as a result, MRS size is decreased from 9 to 3 and RstRegs becomes ∅.
MRS
EXPERIMENTAL EVALUATION
Experiment Setup
The static optimizer is implemented in LLVM 3.7 [8] . Both the MRSS and ASM passes are developed for ARM architecture. MiBench benchmarks are used in the experiment. For each benchmark, two sets of recovery plans are generated respectively for the baseline code (without ASM) and converted code (with ASM).
The runtime fault detection, checkpointing, and recovery functions are simulated in GEM5 [1] , under SE mode with L1 and L2 caches, using simple timing CPU model for the ARM architecture. Each benchmark is duplicated into two threads: a clean thread that executes normally, and a faulty thread with faults injected. According to [16] , 70% of faults in the processor lead to inconsistencies in the register file. Therefore, our experiments inject faults into the register file, by randomly selecting one instruction and injecting a single bit-flip into one of its source or destination registers. A faulty thread is injected with 1000 faults inside its hotspots. Fault detection is implemented in the post-execution stage of GEM5. Two threads are compared at store instructions and at checkpoints, in the way described in Algorithm 1.
Results
Impact of ASM: Table 2 reports the percent of hotspot code in the SSA form, before and after applying ASM. On average, the ASM pass effectively increases the SSA ratio of the code by 49%. Variations across benchmarks are due to the differences in their hotspot sizes and the number of free registers. By increasing the SSA ratio of the code, ASM is able to reduce the size of the ambiguity set and hence the MRS size. Columns 4-6 of Table 2 respectively show the min, average, and max reductions in MRS size. On one hand, the average maximum MRS size reduction of 58% confirms the effectiveness of ASM. On the other hand, the minimum reduction is 0% since ASM has no effect on those hotspots already in the SSA form. Existence of such cases pulls the average reduction down to 9%. Fault classification: A classification of all the injected faults is performed to demonstrate the fault coverage of the experiment framework and quantify the rate of injected faults to which the proposed recovery scheme is applicable. Specifically, faults are classified into six categories: Crashed faults lead to unsuccessful termination of the program, Benign faults are neither detected nor manifested as an error in program output, SDC (Silent Data Corruption) faults corrupt program output without being detected, Control Faults are detected and they change program control flow, DUE (Detected Unrecoverable) faults are detected but cannot be recovered, while Recoverable faults are detected and recovered successfully. Figure 4 shows the classification of injected faults for different benchmarks. The average rates of Benign, Crashed, Control faults, and SDC are 13%, 26%, 4%, and less that 1%, respectively. The remaining 56% of faults are all Recoverable, and there is no DUE fault in the system. These results show that the proposed framework is applicable to 56% of faults.
Our experiments also confirm that there is no difference between the baseline code and the converted code in terms of fault coverage. The converted code never changes the instruction dependence chain and registers omitted from the checkpointed set, if any, never affect fault propagation and detection in the hotspot. In other words, in the two versions of code, a fault always propagates through the same path and is detected at the same detection point. Recovery latency: Figure 5 depicts the recovery overhead reduction achieved by the proposed scheme compared to the traditional recovery scheme, represented as the number of clock cycles in GEM5 dedicated to the recovery process. Even for the baseline code without ASM pass being applied, the MRSS pass still reduces recovery time by 68%. The ASM pass reduces recovery overhead by an additional 4%, leading to a total reduction of 72%. Effect of ASM on different benchmarks is in line with the average MRS size reduction results in Table 2 . Since faults are injected randomly in GEM5, a run-time fault scenario could be a combination of multiple prime scenarios. The relatively small reduction achieved by ASM indicates that many of the prime scenarios used at runtime are best case scenarios whose size cannot be significantly decreased by ASM. Recovery plan table Size: As shown in Figure 2 , the recovery plan is stored in a direct mapped cache for the dynamic optimizer to access. Size of the cache, which is constrained by the largest hotspot in each benchmark, can be computed using the following equation:
log entries 2 * (P CT ag +BP H +RegStat+M RS+RstRegs) (5) Here, entries is the total number of entries in the recovery plan, RegStat and RstRegs are both 16-bit values with each bit corresponding to an ARM register, P CT ag is the portion of PC bits used as tag in Figure 2 , BPH is the minimum number of bits needed to identify a specific path, and MRS size is the number of instructions in the largest hotspot. These values and the cache size are reported in Table 3 . It can be seen that a 16KB cache is sufficient to store recovery plans of all the benchmarks. According to CACTI [17] , the access latency of this cache is 0.3ns which is usually shorter than one cycle.
CONCLUSIONS
In this paper, a fault resilience scheme that minimizes the recovery overhead in application hotspots has been proposed. Compile-time optimizations are leveraged to generate recovery plans for different fault scenarios to guide runtime error recovery. These optimizations, implemented in LLVM, identify the minimum set of instructions to be re-executed during the recovery process and exploit renaming techniques to further reduce the re-execution set. To evaluate the proposed scheme, intensive fault injection studies have been conducted in GEM5.
Results show that on average the recovery overhead can be reduced by 72%. Such a low-overhead recovery process enables current and future systems to efficiently tolerate the elevated and varying rates of faults.
