ABSTRACT
LIST OF FIGURES

Introduction
Permanent faults or transient errors in a computer system may affect the control flow of a program, change the system status or modify the data stored in memory. If the system does not perform some run-time checking, an erroneous output may not be detected and may be used as a correct output. Therefore, in highly reliable and dependable computing, it is important to monitor the program to detect any abnormality in the system and take appropriate actions to avoid incorrect outputs.
On the other hand, trends in processor architecture have shown an increasing use of ILP to improve performance. In addition to pipelining individual instructions, it has become very attractive to fetch multiple instructions at the same time and execute them in parallel to utilize functional units whenever possible. This form of ILP is called super-scalar execution. It provides a way to exploit available hardware resources in the system. However, the limitation of ILP in a program prevents full utilization of resources and consequently, some functional units are idle during execution. Chang and Johnson [1] [2] have shown that the performance gain is not proportional to the maximum number of instructions that are issued simultaneously because of the limitation of ILP in a single thread of control flow.
Based on data presented in [2] , even in an ideal machine with infinite machine resources, perfect branch prediction, and ideal register renaming, there is no significant improvement in speedup beyond four instruction issues per cycle as shown in Fig.1.1 . Therefore, if we cannot simultaneously execute more than four instructions in one clock cycle but have enough resources, we can use most of the idle resources for error detection.
The focus of this paper is on concurrent error detection by exploiting ILP of super-scalar architectures. Duplicated instructions in EDDI have no effect on the results of the program but detect errors in the system during run time. The instructions for error detection are scheduled to utilize idle resources in super-scalar architecture to reduce performance overhead.
The single event upset (SEU) is one of the major sources of bit flips in memory [4] . A bit flip is an undesired change in the state of a memory cell; an SEU can cause the state of a memory cell to change from 0 to 1 or 1 to 0. It may modify an instruction of the program or corrupt data stored in memory. EDDI can detect faults that can be modeled as bit flips in the memory. For example, when an instruction is fetched from the memory, one bit of the data bus gets complemented and modifies the opcode field of the instruction. This error can be modeled as a bit flip in the code segment in memory and can be detected by EDDI. The graph is cited from [3] and shows the average of speedups reported in Johnson [2] for several benchmarks.
Transient errors in functional units, control logic, address buses and data buses can also change the intermediate value of the computation and result in incorrect outputs. A program may
deviate from its correct instruction sequence, for example, due to a fault in a branch instruction, resulting a control flow error. These are errors that can be detected by EDDI.
After addressing related works in Sec. 2, we present the algorithm in Sec. 3 and Sec. 4.
We show the detailed analysis of the algorithm in Appendix (Sec. 10). With MIPS architecture, we estimate the fault coverage during compile time in Sec. 5 and verify the results by simulation in Sec 6. In Sec. 7, we compare our technique with different duplication methods, and finally, we conclude the paper in Sec. 8.
Related Work
A traditional concurrent error detection techniques is to use massive redundancy. N modular hardware redundancy [5] and N version programming [6] are examples of massive redundancy, but these techniques incur (N-1) hundred percent area or performance overhead. To reduce this overhead, system-level error checking methods such as designing self-checking programs [7] , running a separate task for error checking [8] , or employing a watchdog processor [9] Signatures are embedded into the program during compile time as part of the instructions and a run-time signature is generated and compared with the embedded signatures when the instructions are executed.
In an effort to decrease hardware cost, time redundancy has received attention. Time redundancy methods include alternating logic [22] , alternate-data retry [23] , data complementation [24] , REcomputing with Shifted Operands [25] and time redundancy in neural networks [26] . Time redundancy reduces the amount of extra hardware at the expense of additional time. If the system is able to complete its computations before its specified time limit, the extra time can be used for error detection. Researchers have made attempts to exploit the unused resources of systems for concurrent error checking. In a multi-tasking and multi-computer environment, idle computers are used to execute replicated tasks for error checking [32] . Application of the RESO [25] technique to the Cray-1 has been studied using unutilized resources employing machine parallelism [33] . Utilizing the spare capacity in super-scalar and Very Long Instruction Word (VLIW) processors to tolerate functional unit failures has been proposed in [34] . These two techniques utilize idle resources in the system but need to modify the original processor architectures. Conversely, Available Resource-driven Control-flow monitoring (ARC) [35] does not require the incorporation of a specialized monitor and does not modify the original processor architecture. It is applied to a VLIW machine and exploits ILP for integrated control flow monitoring but it only detects control flow errors.
The idea of duplicate instructions in a VLIW processor is previously investigated by Holm and Banerjee [36] . They assume fault-free memory operations and control operations, and duplicate only the ALU instructions to detect errors in data paths. For the duplicated ALU instructions, they use the same source registers. The advantage of using the same source registers is that register pressure is kept low. However, if the source registers of the operation is corrupted, the error cannot be detected. In EDDI, we do not assume any fault-free operations. with EDDI show high fault-secure capability: a fault in the system either has no effect on the output or an error signal is reported.
Error Detection by Duplicated Instructions
Preliminaries
Duplicated Instructions in EDDI have no effect on the result of the program but detect errors in the system during run-time. The purpose of EDDI is to check any deviation from the expected behavior of the computer system and detect an error caused by faults introduced into the system.
The basic idea of error detecting instructions is duplication of original instructions in the program but with different registers and variables. A master instruction is the original instruction in the source code and a shadow instruction is the duplicated instruction added to the source code.
General purpose registers and memory are partitioned into two groups for master and shadow After executing instructions in the block, the branching direction is determined by the value of the variable j. If a register (master register) holds a value for this variable j and is corrupted while executing instructions, it will produce an unexpected result. Thus, in order to detect erroneous branching, the value in the master register should be compared with the value in a shadow register that holds the same value as j.
A basic block is a branch-free sequence of instructions; no jumps into or out of the block except for the first and last instruction of the block. A storeless basic block is a sequence of instructions in which there is no store instruction except for the last instruction and the last instruction may be a store instruction or a branch instruction. An example is shown in Fig. 3 .2.
A storeless basic block is always inside a basic block. Within a storeless basic block, shadow instructions are scheduled to maximize resource utilization by attempting to use idle resources, which are not used by master instructions. If the last instruction of a storeless basic block is a store instruction, a comparison instruction is placed before the store to compare the master and shadow register values that will be stored in memory. Scheduling of shadow instructions will be discussed in detail later in Sec. 4. The algorithm adding instructions for error detection is:
If I kn is a store or a conditional branch instruction Scheduling algorithms will be discussed in detail later in Sec. 4. The correctness of the algorithm and fault coverage will be discussed in detail in Sec. 10.
Basically, an error caused by the faulty instruction will propagate through the dependency graph to the comparison instruction, which will detect the mismatch between the master and shadow pair.
Control flow errors will be detected if the control flow is transferred to the place from which there exists a path in G D to a comparison instruction and the number of master instruction executed is not equal to the number of shadow instructions executed on this path. This probability is 2/7 for the scheme in (b) and is much lower than the one in (a), which is 4/7.
The detailed analysis is presented in Appendix (Sec. 10).
Scheduling Instructions Exploiting Instruction-Level Parallelism
The shadow instructions should be scheduled with master instructions to maximize error detection coverage and to minimize time overhead using idle resources. Resource-constrained scheduling is a well known intractable problem. We will first formalize the scheduling of master and shadow instructions and consider an exact solution method (linear integer programming), then describe heuristic algorithms for static as well as dynamic scheduling.
Integer Linear Programming
is a subgraph of G D denoting dependencies within the kth storeless basic block, where A formal model of the scheduling problem under resource constraints can be achieved by using binary variables with two indices:
represents an upper bound on the latency of kth storeless basic block. A binary variable x jl is 1 only when the instruction I kj starts in step l of the schedule. Then, the scheduling can be formulated as a set of equations (from (1) to (5) ) that have to be satisfied. The first four equations are the resource-constrained minimum latency scheduling equations and described in detail in [38] . Briefly, equation (1) tells us that the start time of each instruction is unique and equation (2) shows us that the dependency relations in G Dk must be satisfied. The resource constraint must be satisfied at every cycle. An instruction I kj is executing at cycle l when
The number of all instructions at cycle l of type k must be less than or equal to a k . This is shown in (3). We add one more condition in (5) and restrict the number of master instructions to being always greater than the number of shadow instructions until step λ-1 in the kth storeless basic block. The number of master and shadow instructions should be equal at step λ when scheduling is completed.
Objective function: minimize t j j ∑ such that Constraints: 
List Scheduling
The disadvantage of the integer linear programming formulation is the computational complexity of the problem: generally it is an NP-complete problem [38] . The number of variables, the number of inequalities and their tightness affect the ability of computer programs to find a solution. Heuristic algorithms have been developed to solve this intractable problem. We consider the list scheduling algorithm [39] to schedule master and shadow instructions. 
Interleaved Scheduling
A super-scalar processor with out-of-order execution capability is able to dynamically schedule the instructions by solving dependencies using hardware and simultaneously issue multiple instructions in one clock cycle. In order to reduce the cost of static scheduling discussed above, we can exploit the dynamic scheduling capability of the processor. The super-scalar processor has a buffer called instruction window, in which it decodes instructions and places, and at the same time, examines instructions in the window to find instructions that can be issued at the same cycle. The instruction window serves as a pool of instructions and the more ILP exists in this pool, the more instructions can be executed concurrently. Therefore, if we put as many master and shadow instructions as possible in the instruction window, we have more chance to execute more instructions at the same time. Furthermore, instructions should be arranged such that control flow error detection coverage is maximized. The scheme we propose for achieving these two objectives together, is shown in Fig. 4 .1.
If we interleave master instructions with shadow instructions, we can put the same number of master and shadow instructions in the instruction window. Since there is no data dependency between the master and shadow instructions, this scheme will provide maximum ILP in the instruction window; however, a simple interleaving scheme will miss half of the control Hardware dynamically schedules the instructions in the instruction window finding available instructions that can be issued at the same cycle. In this scheme, the scheduling to utilize idle resources is done by hardware while control flow error detection is achieved by arrangement of instructions.
Estimated Fault Coverage
Based on probability methods, we have derived formulas for estimating the error detection coverage of EDDI. These formulas and their derivation can be found in Appendix (Sec. 10).
This technique can provide us the error detection coverage when EDDI is implemented in the program during compilation. This estimate is very useful when we need to predict the error detection coverage before we actually run the program, especially in the computers for space applications.
The estimated fault coverage can be obtained from equation (13) Based on the percentage of branch instructions and store instructions after EDDI is added to the programs, the probability of missing an error is calculated for the benchmark programs, and the results are shown in Table I . instructions. We assume that q st = 0.95, i.e., 95% of the stored data are used later in the program execution. This is a reasonable number considering most of the data in the above programs are repeatedly used during execution. We also assume that q reg = 0.9, i.e., the register utilization is 90% since level 2 compiler optimization tries to maximize the register utilization. Table I The estimated numbers show that EDDI has around 98% to 99% error detection coverage in seven out of the eight benchmark programs. The EDDI technique is a pure software method, i.e., it does not require the incorporation of a specialized hardware nor does it alter the original processor architecture; however, based on the results shown in Table I , it achieves high error detection coverage as other hardware techniques do. In most of the benchmark programs, Pr u (T B ,
T N ) and Pr u (T B , T B )
, which are related with control flow error, constitute the larger part of the undetected errors. CFCSS [21] technique using signature analysis method can be used to lower these probabilities, but it will increase the performance overhead by adding signature-checking instructions to the original program. 
Experimental Results
The same eight benchmark programs in Sec. 5 are chosen for the experiment to verify the estimated coverage calculated in the previous section. Our target machines are SGI Indigo with MIPS R4400 processor and SGI Octane that employs the 4-way super-scalar R10000 Mips processor. First, the source files are compiled and the target machine codes are generated without error detection instructions. A fault injector forced a bit flip in the code segment of the machine code. The location of the bit flip is determined by a random number generator. This machine code is executed and the result is shown in Table II .
The numbers in the second row of the table (incorrect result) indicate the number of faults that cause the programs to produce incorrect results that look like correct ones to the observer. The third row means that erroneous result is repeatedly produced because the fault creates an infinite loop in the program. In the fourth row, the processor does not respond to the observer, so we have to manually stop the processor. The numbers in the fifth row show the number of faults that are detected by the operating system in the machine. A segmentation fault and failed assertion are examples of faults detected by the operating system. Although a bit flip is inserted into the machine code, a correct result may be produced. The number of these cases is shown in the sixth row. This case can happen because a bit flip, occurred in unused bits of a nop instruction that fills the empty slot of the pipeline, has no effect on the output, or the non-used field of the instruction has a bit flip so that the effect of the bit flip is nullified. The last row denotes the total number of faults that produced incorrect outputs without being detected (the sum of the second, the third and the fourth rows). On average, in the programs without EDDI, 20% of the injected faults produced incorrect outputs and were not detected as shown in Table II . Now, EDDI is included in the eight benchmark programs by the compiler postprocessor that we have developed. A bit flip is injected into the generated machine code, then the corrupted machine code is executed. The results are shown in Table III . The first and third columns in the graph in The experiment validated the estimated error coverage as illustrated in Fig. 6 .1.
The error coverage calculated in Sec. 5 was around 98.2% in eight benchmark programs, and the error coverage in the experiment is about 98.5% in the benchmark programs.
Since extra instructions are added to the original assembly code, the program with EDDI suffers from an increase in code size and loss of performance compared to the original one. The execution time overhead is shown in Fig. 6 Notice that for most of the programs, the overhead in execution time is less than 100%.
Since we have duplicated instructions, the overhead might be close to or greater than 100%.
There are two reasons for this low performance overhead in time. First, shadow instructions fill the empty slot of the pipeline; thus, we achieve higher utilization of the processor resources.
Second, because a pair of master and shadow instruction are always independent of each other, shadow instructions add more parallelism to the program. A 4-way super-scalar processor can exploit this parallelism better than a 2-way super-scalar processor. This effect can be seen in Fig.   6 .2 where the execution times for running the programs in two different processors are shown.
We can observe that the execution time overhead in the R10000 is less than the one in the R4400: a 4-way issue machine has less time overhead than a 2-way issue machine. EDDI has less execution time overhead in processors that employ more aggressive super-scalar architectures.
Comparison of Different Source Level Duplication Methods
Our technique duplicates the instructions in an assembly source code to achieve error detection capability. The alternative way might be duplicating source code in the high level source language such as C or Pascal. All the variables, assignments, calculations as well as arguments are duplicated in order to compare the pair of variables or results of calculations. We consider two options for comparing the duplicated variables with the original variables in the source code:
comparing the final outputs of the program and comparing the variables when they are updated.
The former has the advantage of minimum performance overhead compared to the latter since, in the former, comparison instructions are placed only at the final output stage while, in the latter, comparison instructions are inserted whenever the variables are updated. However, comparing the values when they are updated has higher fault coverage than comparing only final output results. For example, if a branch instruction is corrupted and causes an infinite loop during the execution, the final output comparison cannot detect this error. 
Fig. 7.2 Undetected incorrect outputs in different levels of duplication
Overall, EDDI shows the highest fault coverage with medium performance overhead. It is because EDDI handles duplication at the assembly level; therefore, it has finer grain error detection capability than other techniques at the C source level.
Discussion
The EDDI technique proposed in this paper is a pure software method that achieves high fault coverage in the computer system in which adding any extra hardware or modifying the existing hardware is not possible. Our simulation as well as our probabilistic estimate shows that our technique can achieve high error detection coverage.
We developed an algorithm for adding EDDI to a program and derived formulas that can be used to quickly estimate the fault coverage of EDDI while compiling the program. We verified the estimates by fault injection simulations. In the simulations, a fault injector forced a bit flip in the executable machine code. In eight benchmark programs without EDDI, on average, 20% of the injected faults produced incorrect results. However, in the program augmented by EDDI, only an average of 1.5% of injected faults produced undetected incorrect results. This result shows that EDDI can achieve high fault coverage only by pure software technique. In super-scalar architecture, duplicated instructions can be scheduled to use idle resources of the processor in order to reduce performance overhead. Experimental result shows that approximately half of the benchmark programs have less than 50% execution time overhead when they run in a 4-way super-scalar processor. This execution time overhead might be high for the application programs that need high performance, but if high reliability is required without any help of special hardware, EDDI would be one of the best candidate techniques to adopt. The authors wish to thank Dr. Nirmal Saxena for his valuable suggestions and also Subhasish Mitra for his helpful comments.
Acknowledgements
Appendix Error Detection Coverage Estimate
Any faults that can be modeled as a bit flip in the data segment of memory can be detected by EDDI because there are always pairs of master and shadow variables in the data segment. The fault can be detected by comparing the two values of the master and shadow variables. However, it is somewhat difficult to determine the exact fault coverage of our technique in the code segment of memory since the effect of faults depends on the run time behavior of the program. For instance, we cannot tell whether a conditional branch will be taken or not until it is actually executed with the run time values of the condition variables. The error detection coverage also depends on the input data sets since the dynamic values of the registers and memory are strongly affected by the input data. However, based on the probabilistic method, we can get an estimate of the fault coverage. In this section, we will describe how to estimate and predict the fault coverage before running the program. We will discuss and analyze all nine possible cases to estimate error coverage of our technique. The probability that an error occurs and escapes detection will be calculated for each case; then we can get the fault coverage by subtracting it from one. There might be some events that are not considered when calculating the probability but we can ignore them when the probabilities of them are very low. In the case of (T S , T N ), even if the incorrect stored data is not used later, this error will be detected if the normal instruction, which results from the error in the store instruction, modifies a live value in a master or shadow register. Otherwise, it will not be detected. If we ignore this special case and assume that the error is not detected only if the stored data is not used, the probability of these two cases not being detected will be (1). Let us denote the corrupted instruction by I f and the corrupted bit in I f by b f , then:
where (1-q st ) represents the ratio of stored data that are not used later in the program execution and k is the ratio of possible normal instructions, which can be created by a single bit flip in the opcode field, to total number of possible instructions in the ISA.
The rest of the cases are related with control flow errors; a normal or store instruction is changed to a branch instruction or a branch instruction is changed to another instruction.
Case 3: (T N , T B ) (T S , T B )
Assuming a single bit error, instructions with Hamming distance 1 from the branch instruction can be changed to a branch instruction, thus this case of error is restricted to the instructions that have Hamming distance 1 from the branch instruction. Let us call this event E h . Normal and store instructions always exist as a master and shadow pair. If one of the pair is changed to a branch instruction and the other is executed, it is similar to Case 1, and the error might be detected. However, if one of the pair is changed to a branch instruction and causes an illegal jump to another location, but the other is not executed, this error is undetectable. Therefore, if we assume that these cases are only detected when the number of executed master and shadow instructions are different, the estimated probability is: This case happens when a bit flip occurs in the opcode field. If the branch instruction is changed to a normal or store instruction, the register field will be the destination of the operation. For example, an instruction Ôbne r1 r2 label1Õ is altered to Ôand r1 r2 r3Õ, thus r1 becomes the destination register for and operation. If this corrupted instruction is in the live range of r1, this instruction will modify the value in r1 and this error will propagate and be detected when the comparison instruction is executed. Thus, we can estimate where q reg represents the register utilization. For instance, if 90% of registers are always live during execution, the probability that the register specified by the register field has a live value is 0.9, thus (1Ð q reg ) would be 0.1. If the program is fully optimized, the compiler tried to maximize the register utilization and (1Ðq reg ) could be kept low.
Case 5: (T B , T S )
If the hamming distance between a branch and store instruction is greater than one, this case cannot happen assuming one single bit error. On the other hand, If the hamming distance is one, a bit flip in that bit position introduces this case. The changed instruction stores a value in the location specified by the offset field. If this stored value is used later, the error will be detected, but if it is not used, it would go undetected. Thus we can estimate ∈ ⋅ ∈ ⋅ = ⋅ ⋅ − Now, if the register field is corrupted, this fault may cause an error or no error. We will take an upper bound of the probability of missing the error since we cannot exactly predict the conditional jump as shown in the following examples:
Branch_greater r1 r3 label1
If the register field is changed and r1 is altered to r2, it is difficult to tell that the control flow will be corrupted or not unless we know the values of the registers in run time. For example, if r2 has positive values while r1 has negative values during entire execution time, the branch will always be taken, which is incorrect. This analysis is only available by simulation and it will give us the exact number, but the cost is very expensive. LetÕs take another example:
Branch_not_equal r1 r2 label1
If the register field is changed and r1 is altered to r3, but if the value in r3 is still different from the value in r2, this fault does not change the control flow. If we assume equal probability for the numbers in the registers, we can guess that the probability of changing the control flow is almost zero since the probability that r2 and r3 have the same value is almost zero. As shown in the examples, it is very difficult to predict whether the fault affects the control flow or not without running the program. Therefore, assuming that the faults in the register field of the branch instruction is not detected at all, the probability in (11) will be an upper bound of missing this case. (10) and (11), the probability that (T B , T B ) happens but is not detected is (6) to (12) , the upper bound of probability that a fault occurs in the code segment and is not detected is estimated as (13) 
P u u T S T S u T S T N u T N T B u T S T B u T B T N u T B T S u T B T B S
