Abstract
Introduction
Dramatic increases in the number of transistors that can be integrated on a chip will deliver great performance gains. However, it will also expose a major roadblock, namely the poor reliability of the hardware. Indeed, in the near-future environment of low power, low voltage, relatively high frequency, and very small feature size, processors will be more susceptible to transient errors. Transient faults, also known as soft errors are due to impacts from high-energy particles or other random external events that change the logic val-ues of latches or logic structures. Error detection schemes are needed to ensure that a soft error does not go undetected and result in an erroneous computation. Once errors are detected reliably, it is often possible to use software schemes for error correction -the performance of error correction schemes is not critical, as long as errors are not too frequent; however, error detection adds an overhead to all computations and has to perform efficiently. For this reason, we focus in our project on error detection.
Hardware error detection is used on modern microprocessors to detect errors in storage and buses: for example, ECC memory and parity bits for caches and various buses. Such error correcting codes generally add a low overhead to performance and chip size. On the other hand, it is much harder to detect errors in the computation pipeline: one needs to replicate significant fractions of the CPU logic, in order to do so. Such a replication is used in high-end faulttolerant systems, such as IBM mainframes, HP NonStop or mission-critical computers. It is unclear whether the cost of essentially doubling hardware complexity is acceptable for commodity systems. For such systems, software error detection may be a preferable solution. The main advantage of software checking is its flexibility: different trade-offs between performance and reliability can be achieved on the same hardware, using different software approaches; fault tolerant hardware cannot offer the same flexibility. Such flexibility can be used, for example, for achieving a higher level of reliability for large clusters, built out of commodity components: A PC might be built to have a mean time between undetected failures (MTTUF) of, say, 10 years; this would result in an unacceptable MTTUF of half a week for a 1000 node PC cluster. Alternatively, the flexibility may be used to achieve different levels of reliability for different software components: One may not care about undetected errors that will affect the PC display during a game, but may want to avoid errors that will corrupt the file system metadata.
Current software approaches address the problem by replicating the instructions and adding checking instruc-tions to compare the results, but they add a significant overhead. In this project, we are investigating compiler techniques to reduce the overhead of the software error checking approaches while maintaining similar reliability. We have proposed three novel techniques:
• The first technique is based on the fact that programs already have redundancy, and if the compiler can determine the programs sections where such redundancy exists, it can avoid the replication and later checking. We use boolean logic to identify a code pattern that corresponds to outcome tolerant branches and develop a compiler algorithm that automatically finds those patterns and removes the unnecessary replicas.
• The second technique is based on the observation that faults that corrupt the application tend to quickly generate other noisy errors such as segmentation faults [10] . Thus, we can reduce replication of the instructions that tend to generate these type of errors, trading reliability for performance. In this paper, we remove the checks of the memory addresses and discuss situations where removing these checks affect little to the fault coverage.
• The third technique considers the situation where the register file is hardware-protected with parity or ECC, as in Intel Itanium [6] , Sun UltraSPARC [4] and IBM Power4-6 [1] . In these platforms, that we call register safe platforms, some checks and shadow registers are unnecessary and can be removed.
We have implemented the baseline replication and proposed optimization techniques using the LLVM compiler infrastructure [5] and run experiments on Pentium 4 platforms using Spec benchmarks. We have also built a fault injection framework to measure the reliability of our techniques.
The rest of the paper is organized as follows. Section 2 presents the baseline software checking; Section 3 describes the techniques to detect outcome tolerant branches; Section 4 describes the removal of address checks; Section 5 discusses the benefits of having a register file that is checked in hardware; Section 6 presents our experimental results; Section 7 concludes the paper and discusses our future work.
Baseline Software Checking
Software techniques such as SWIFT [7, 8] replicate the instructions of the original program and interleave the original instructions and their replicas in the same thread. Memory does not need to be replicated because the memory hierarchy is protected with ECC and scrubbing. Stores, loads, branches, function calls and returns are considered "synchronization" points and checking instructions are inserted before these instructions to validate certain values.
An example with the original and its corresponding augmented code is shown in Figure 1 -(a) and (b), respectively. The augmented code contains additional instructions that are shown in bold and uses additional registers that are marked with a '. Instructions 1 and 2 check that the load is loading from the correct address, instruction 3 makes a shadow copy of the value loaded in r3 into r3', instruction 4 replicates the addition, and instruction 5-8 check that the store writes the correct data to the correct memory address. 
Use of Boolean Logic to Find Outcome Tolerant Branches
This technique is based on the fact that programs have redundancy. For instance, Wang et al. [9] performed fault injection experiments and found that about 40% of all the dynamic conditional branches are outcome tolerant. These are branches that, despite an error, converge to the correct point of execution. These branches are outcome-tolerant due to redundancies introduced by the compiler or the programmer. An example of outcome-tolerant branch appears in a structure such as if (A || B || C) then X else Y. In this case if A is erroneously computed to be true, but B or C are actually true, this branch is outcome tolerant, since the code converges to the correct path. The control flow graph of this structure is shown in Figure 2-(a) .
The state-of-the-art approach to check for errors is to replicate branches as shown in Figure 2-(b) , where the circles correspond to the branch replicas. However, we can re-duce overheads by removing the comparison replica when the branch correctly branches to X. If the original comparison in A is true we need to execute the comparison replica to verify that the code correctly branches to X. However, if A is false, we can skip the execution of the A replica and move to check B. We will only need to execute the A replica if both B and C are also false. The resulting control flow graph is shown in Figure 2-(c) . In situations where A and B are false, but C is true, we can save a few comparisons. Outcome tolerant branches also appear in code structures such as if (A && B && C) then X else Y, and in general in all the code structures that contain one or more shortcut paths in the control flow graph. A basic shortcut path is edge(A->X) in Figure 3 -(a), where both A and its child point to the same block. However, most shortcut paths are more complex. For instance, in Figure 3 -(b), block A points to the same block pointed by its grandchild (not its direct child). Thus, the optimizer should move A' from edge(A->B) to edge(B->Z) and edge(C->Y), as shown in Figure 3-(c) .
Detecting the existence of a shortcut path is not sufficient to determine that there is an outcome tolerant branch. The reason is that one of the blocks involved in the shortcut (such as block A' in Figure 3 ) can update a variable that is later used by instructions outside the block. That block needs to be replicated on all the paths. Otherwise the update will not be visible outside the block. The algorithm to find and optimize short-cut graphs is described on [11] .
Removal of Address Checks
Recent experiments have shown that faults produce not only data corruption, but also events that are atypical of steady state operation and that can be used as a warning that something is wrong [10] . Thus, we can reduce the overhead of the software approaches and trade reliability for performance by reducing the replication, hoping that the error will manifest with these atypical events. 
Figure 4. Address check removal for pointer chasing
In this Section we consider the removal of address checks before load and store instructions. Errors in the registers containing memory addresses may manifest as segmentation faults. However, any fault-tolerant system must also include support for roll-back to a safe state. Thus, on a segmentation fault we can roll-back and re-execute, and only communicate the error to the user if it appears again. By doing this the system will be vulnerable to errors, since some of these faulty addresses will access a legal space and the operating system will not be able to detect the error. Thus, this technique will decrease error coverage. Next, we discuss two techniques that the compiler can use to determine which load and store instructions are most suitable for address check removal.
Address checks can be removed when there are later checks checking the same variable. For example, in Figure 1-(b) , checking instructions (1-2) and (7-8) are checking the register r6. This makes the first check (1-2) unnecessary, because if an error occurs to r6 it will manifest as a segmentation fault or will be eventually detected by the checking instructions (7) (8) . We have observed many of these checks in the SPEC benchmarks due to the register indirect addressing mode, since the same register is used to access two fields of a structure, or because two array accesses share a common index. Removing these replicated checks can significantly reduce the software overhead.
Address checks can also be removed when the probabil-ity of an error in the loaded value is small. This case appears in pointer chasing, where the data loaded from memory is inmediately used as the address of a subsequent load. An example is shown in Figure 4 -(a) and (b). In this case, since the processor will issue the second load as soon as the first one completes, the probability of error is very small. In some cases, however, the value loaded by the first load is not exactly the one used by the next load, if not that it may be first modified by an add instruction. This occurs when accessing an element of a structure that is different from the first one. In this case, the probability of error is higher, and the checking instructions will also determine if an error occurred during the computation of the addition. An example is shown in Figure 4 -(c) and (d).
In this project we evaluate the removal of address checks for only the loads, or for both loads and stores. Thus, our results are an upper bound on the performance benefit that can be obtained and the reliability that can be lost. In the future we will write a data flow analysis to identify the checks that are safe to remove, as explained above.
Register Safe Platforms
Our last technique can be applied to processors where the register file is hardware-proteded, with parity or ECC or other cost-effective mechanisms as the ones proposed by [2] . Examples of processors where the register file is already protected with parity or ECC are Intel Itanium [6] , Sun UltraSPARC [4] and IBM Power4-6 [1] . In this case, the shadow copy to a register after a load and some of the checking instructions are unnecessary, since an error in the register will be detected by the hardware.
An example is shown in Figure 1 . The replicated code in Figure 1-(b) can be simplified as shown in Figure 1-(c) . Register r3' is not necessary because registers and memory are safe and instruction 4 can use directly the contents from register r3. Instructions 1, 2, 7 and 8 can be removed if we assume register r6 has been defined by a load. Instructions 5 and 6 cannot be removed because register r4 is defined by an addition, and we need to validate the results of the addition.
Evaluation
We implemented our optimizations using the LLVM compiler [5] and modified the LLVM backend optimizer so that it does not remove the replicated instructions. For the evaluation we use SPEC CINT2000 and the C codes from SPEC CFP2000, running with the ref inputs. Experiments are done on a 3.6GHz INTEL Pentium 4 with 2GB of RAM running RedHat9 Linux.
Performance
When using boolean logic to eliminate replication and checks of outcome tolerant branches (Section 3) three benchmarks (gzip, vpr, and perlbmk) achieve 7% performance gain, though the average speedup is 1.6% through all tested benchmarks. There is also a negative impact on vortex, where we observe more load/store instructions after the optimization, meaning that this optimization introduces additional register spills that hurt the benefit of less dynamic instructions. A performance plot is not shown due to space limitations, but can be found in [11] . On average, the register safe optimization (R) runs 16.0% faster than the (FullRep). After we remove checks for address of loads (NAL), we get an average 20.2% speedup over the baseline Fully Prelicated (FullRep). If we further remove checks for address of stores (NALS), we improve 4.6% more. And if the register is protected in hardware and we combine (NAL) or (NALS) with (R), we obtain an average speedup of 35.2% and 40.8%, respectively, reducing the software checking overheads by 44.9% and 50%, respectively. Notice that with (NALS) all address checks before loads and stores are removed, so the performance benefit of (R+NALS) versus (NALS) is due to the reduced register pressure (the shadow register after the load is not necessary) and the removal of a few additional checks before the data being stored.
Reliability
Our first technique is very conservative and should not affect the fault coverage. But with the second technique, since we remove all the checks of memory addresses, memory can be corrupted. In order to evaluate the loss of fault coverage, we use Pin [3] and inject faults to the binary file (excluding system libraries). Only one bit fault is injected to a random register during the execution of the program. After injecting an error into the binary, the program is run to completion (unless it aborts) and its output is compared to a correct output. The error is categorized as unACE (output is correct), Detected, Self-Detected (fails program assertions), Seg Fault, SDC (output is incorrect). (SDC) is the first type of error we want to prevent.
Our experimental results on Figure 6 show that after the program is replicated (FR), most (Seg Fault), (Self- 
Figure 6. Fault-detection rates break down
Detected) and (SDC) go to the (Detected) category. If we remove checks for load addresses (NAL), comparing to (FR), (SDC) increases from 0.36% to 1.08%, (Seg Fault) increases from 4.47% to 8.05%. If we also remove checks for store addresses (NALS), (SDC) rises to 1.44%, and (Seg Fault) rises to 9.02%. Given that we almost decrease the performance overhead by half, this loss of fault coverage seems acceptable.
Conclusion and Future Work
This paper provides a summary of our ongoing work in the area of software instrumentation for fault tolerance. We have explored how to optimize the software replicated program and made several contributions. First, we identify a code pattern that corresponds to outcome tolerant branches, and develop a compiler algorithm that finds these patterns, avoiding unnecessary replication and checking. Second, we evaluate the removal of address checks for loads and stores, and analyze situations where these checks can be removed with little loss of fault coverage. Third we identify the check and shadow registers that can be removed when the register is hardware-protected.
For the future we plan to extend the work in Section 4 and develop compiler algorithms to only remove those checks that can be safely removed. Then, we plan to run the replicated optimized code on a second thread to minimize the performance degradation of the original thread.
