ABSTRACT As process technology scales, electronic devices become more susceptible to soft error induced by radiation. Silent data corruption (SDC) is considered the most severe outcome incurred by soft error. The effects of faulty variables on producing SDC vary widely. Without a profiling of vulnerability of variables, the derived detectors often incur low SDC detection rate or unacceptable overhead. To assess the vulnerability of variables to SDC, this paper proposes a metric called Output Vulnerability Factor (OVF). The metric is used to rank the variable's priority in the detector derivation process in order to selectively protect the most SDC-prone variable in the program. The calculation of OVF is based on enhanced Dynamic Dependence Graph (eDDG), a proposed instruction-level error propagation model. We filter out the edges representing the identified crash propagation path and perform a backward traversal of the eDDG to obtain SDC propagation path. Further, error masking probability is estimated for the edges refer to value comparison and logistic operation. Fault injections show that our approach achieves an SDC detection rate of 65.0% with the top 10% high OVF variables monitored. Compared with previous methods, the SDC detection rate increases by 12-21%.
I. INTRODUCTION
Radiation-induced error caused by neutron in cosmic ray or alpha particle in packaging material have provoked growing concerns [1] , [2] . The fault rate per device (e.g., latch, SRAM cell) in a bulk CMOS process is projected to remain roughly constant or decrease slightly for the next several technology generations. Thus, with the increase in the number of transistors on a chip and the reduction of chip sizes, fault rate will grow with Moore's Law [3] . Soft error has emerged as a severe challenge in aerospace-based computing [4] .
Hardware faults can affect the running software in four ways: (1) they may not have any effect on the application (benign), (2) they may stop execution (crash), (3) they may exhaust system resource but the program still cannot finish execution (hang) or (4) they may lead to incorrect outputs, The associate editor coordinating the review of this article and approving it for publication was Chaoyong Li. also called silent data corruption (SDC). While crash and hang are important from an availability perspective, SDC is important from a reliability perspective because it causes program to fail without any indication of the failure [5] . For the application domains such as climate, energy and healthcare, applying erroneous output may lead to loss of properties and even casualties [6] , [7] . Erroneous output can be more dangerous than no output since users are not aware of errors until a serious consequence occurs.
In order to detect hardware faults, applying redundant hardware and software has been the usual solution. Owing to prohibitive overhead, detection mechanisms by hardware are increasingly unacceptable for modern commodity systems. To alleviate the overhead, several software redundancy techniques have been proposed [8] , [9] , [10] , which rely on duplication at instruction level or invariant-based assertions. If the entire set of variables is duplicated, the resulting overhead can be unacceptable for some applications. Hence, VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ trends have been set for adaptive and selective protection, which applies duplication or assertion to more susceptible variables.
A great amount of prior work has identified SDC-prone variables using fault injection (FI) [11] . FI involves perturbing the program state to emulate the effect of a hardware fault and executing the program to completion to determine if the fault causes an SDC [12] . Performing FI to get a profiling for each variable takes too much time to be practical. As a result, researchers try to analytically construct error propagation model to identify SDC-causing instructions or variables [13] , [14] , [15] . These analytical models achieves low overhead and they are fast to execute. However, most existing models suffer from a lack of accuracy due to their limitations in modeling SDC-causing error propagation.
SDC-causing error affects program behavior in a distinct way. Although it affects the data-flow of program, it does not cause severe damage to execution or raise any exception. We apply two basic features of SDC in modeling SDC-causing error propagation. First, the necessary condition of incurring SDC is erroneous parameters of output function. Since erroneous output is obtained, SDC-causing error propagation ends up with erroneous parameters of output function. Data that affect output function should have an error propagation path to parameter of output function. Enhanced dynamic dependence graph (eDDG) is applied in this paper to capture instruction-level data dependence and error propagation path. Second, since the outcome of fault injection is exclusive, errors that can be identified crash-causing or benign-causing incur no SDC. For the data found in the error propagation path, we conservatively assume that every bit is SDC-causing unless the case matches the error masking or crash-causing models.
In this paper, we define a metric called Output Vulnerability Factor (OVF) to quantify the SDC proneness of program variables. OVF is intended to be an estimation of the probability to get erroneous outputs when the variable itself is corrupted, i.e., to have SDC due to the corruption of the variable; in other words OVF for each variable is a measure of how an application is sensitive to errors injected in its own memory location. The variables with high OVF should be given high priority in selective protection. The calculation of OVF is based on SDC propagation path in the eDDG. A possible SDC propagation path is denoted by a sequence of connected nodes ended by the node represents parameter of output function. The SDC propagation paths are obtained by a backward traversal of the eDDG of the program.
Our main contributions in this paper are as follows:
• Propose OVF, a metric to assess vulnerability of variables to SDC-causing faults. Our approach only needs instruction-level profiling of fault-free execution and thus remains low cost.
• Propose a method to estimate error masking probability for data used in value comparison and logistic operation.
• Compare the SDC detection rate of detectors derived by OVF with previous techniques. The injection experiment results show that applying OVF achieves 12.2%-20.8% higher SDC detection rate compared with previous techniques.
The remainder of this paper is structured as follows: Firstly, section II discusses relevant prior work. Secondly, section III describes the model for SDC propagation. Section IV proposes the definition of Output Vulnerability Factor.
The estimation of error masking parameter is characterized in section V and the experimental results are then presented. Finally, the last section summarizes and concludes this work.
II. RELATED WORK
Identifying and duplicating critical variables can significantly reduce the memory and performance overheads, while still guaranteeing very good results in terms of fault-tolerance improvement. In recent years, several studies have addressed the issue of assessing vulnerability of variables in presence of soft errors.
Benso proposes Criticality Function (CF) to measure the criticality of the variables based on the runtime analysis of variables' behavior [14] . CF is obtained by calculating the integral of Instantaneous Criticality Function (ICF) across the variable's lifetime. Each read or write event of the variable modifies the value of ICF. At each read event, the ICF value increases by a constant β. At each write event, the ICF value equals to a constant K . The variable with many read operations is likely to obtain high CF. However, such variables may not have significant effect on the output. For example, a variable used in exception handlers can be read many times and thus obtains a high CF value. Error in that variable is likely to cause crash instead of SDC.
Shoestring also shares our goal of reducing SDC by protecting only sites that potentially result in SDC [15] . It employs a heuristic algorithm that all writes to memory and function arguments are SDC-causing instructions. The condition is corresponding to a particular type of instruction, so the proposed algorithm is not able to cover all possible SDC-causing instructions. Moreover, since a large number of sites satisfy the condition, it may bring high false positive rate.
Pattabiraman develops a set of metrics to identify what application variable to protect and where to place detectors [10] , [16] . The goal, however, is to prevent fault propagation and avoid crashes with minimum possible detector locations, not particularly to reduce SDC. Most of crashes occur soon after the fault is manifested in the program [17] . The fanout metric is the set of all immediate successors of the node in the DDG and it indicates how many nodes are directly impacted by an error in that node. The fanout metric has been proved to have the highest crash detection coverage since errors in these variables are likely to propagate to many locations in the program and cause program failure.
Mukherjee introduces architectural vulnerability factor (AVF) to represent the probability that a fault in a processor structure will affect the output of the program [18] . Un-ACE bits are defined as bits that are unnecessary for architecturally correct execution and errors in those bits are masked in the execution. By identifying the un-ACE bits in the processor, the resulting AVF gives an upper bound of the soft error rate. Sridharan proposes program vulnerability factor (PVF) that extracts the software-dependent components of AVF to provide a way for developers to estimate the soft error rate of their software [13] . Both AVF and PVF are conservative as they assume that all erroneous executions are a major concern, not just those that result in SDC, and it also does not account for crash-causing errors. These metrics do not distinguish between fault outcomes and treat crash, SDC and hang as equally severe, and thus they are not helpful to get SDC-causing variables. To serve our purpose of detecting SDC, we construct SDC propagation path and attempt to filter out benign-causing and crash-causing faults. By distinguishing between SDC and other outcomes, OVF allows protection techniques to better focus on the variables that if corrupted, can potentially cause SDC.
III. THE MODEL FOR SDC PROPAGATION
To model the propagation of SDC-causing fault, we propose enhanced Dynamic Dependence Graph (eDDG) that builds on the original Dynamic Dependence Graph. We introduce the original DDG first and explain our modifications to the DDG.
A. THE DEFINITION OF OUTPUT
The definition of a program's output depends on the user's interaction with the program. For our purpose, we require a precise definition of what constitutes an output. Conceptually, an output is a program's generated value that is sent to functions such as printf and cout. In the instruction level, the output refers to the destination operand of the instruction which loads parameter of output function. We define an instruction set I out , which contains the dynamic instructions that load parameters of output function. 
B. DYNAMIC DEPENDENCE GRAPH (DDG)
DDG is a directed-acyclic graph that captures the dynamic dependencies among the values produced in the course of program execution [19] . DDG is defined as a tuple G = (V , E). V is the set of instruction nodes. V = {i 0 , ...i k ..., i n }. The subscript k denotes the sequence number of dynamic instruction in the execution. Each instruction is given as
. O s and O d denote the set of source and destination operand. Pre and Suc denote the set of predecessors and successors instructions in the DDG. E is the set of edges. The source node of the outgoing edge corresponds to an instruction operand, and the destination node corresponds to the value produced by the instruction.
O s , which ensures no write operation on u is between t i k +1 and t i j . t i k+1 is the moment after i k is executed and before i k+1 is executed.
The DDG corresponding to the example code is shown in Fig.1 . The yellow nodes represent instructions contained in I out . Edge in the DDG reveals a data dependence between two instructions, which also indicates a potential error propagation path. i 351 → i 353 → i 356 is an SDC-causing error 
C. PROPAGATION MODEL: ENHANCED DYNAMIC DEPENDENCE GRAPH (eDDG)
It can be found that certain edges in the DDG are not real SDC propagation path. Meanwhile, certain types of SDC propagation path are not included in the DDG. Therefore we propose eDDG, an enhancement of the original DDG methodology. Two major changes are made to the structure of DDG. First, we delete the edges leaving the nodes representing the instruction writes ESP or EBP. The error in those instructions are verified to cause crash in high probability. In our previous work [20] , a series of fault injections on ESP and EBP are conducted. It is observed that error in ESP leads to benign or SDC only if ESP points to a return address when executing RET. The injection results reveal that most injections (98.5%) on RET-control ESP lead to crash. Similar to the result of ESP, the crash rate for RET-control EBP reaches 98.3%. The SDC rate for RET-control ESP and EBP is below 0.5%, much lower than the baseline SDC rate (11.3%). In As another change made to DDG, certain edges are inserted to represent error propagation through control flow. Jcc instruction checks the state of EFLAGS register and decides whether to take the branch or not. If error causes a wrong branch, the control flow is changed and instructions in the right basic block are not executed. Consequently, the value of destination operand of instruction in the right basic block is likely to differ from that of fault-free execution. We take the example in Fig.1 to describe the error propagation. If the Jcc instruction i 350 doesn't choose the branch i 351 , then alt_sep is not assigned 0x1, which finally results in a wrong value in i 356 . To represent this error propagation, we insert edges between node representing Jcc instruction and dynamic instructions from consequent basic block. In Fig.1 
IV. OUTPUT VULNERABILITY FACTOR
In this section, we propose Output Vulnerability Factor (OVF) to quantify the effect of a variable on the output of program. The calculation of OVF is based on eDDG. The whole process has three steps (see Fig.2 ).
1) Value profiling. The application program is executed and an execution trace is obtained with detailed execution information of dynamic instructions. 2) Model construction. The data trace is analyzed, the corresponding eDDG is constructed. 3) OVF computation. We traverse the eDDG from the nodes contained in I out and compute OVF for each variable. This step is discussed in section.IV-B. We give the definition of OVF and then describe how to calculate OVF.
A. THE DEFINITION OF OUTPUT VULNERABILITY FACTOR
OVF for variable v is calculated by evaluating the integral of δ(u, t) over the interval of its lifetime, shown in Eq.1. A tuple (u, t) indicates a memory location u at a given time t of the program execution. δ(u, t) represents the instantaneous output vulnerability factor of the tuple (u, t). v (u, t) denotes the variable v stored in the memory location u at the moment t.
The lifetime of a variable is defined as the time period in which the variable is allocated in memory, shown in Eq.2. A variable's lifetime starts by the first write operation and ends by the last read operation.
The computation of δ(u, t) is described as follows. First, as shown in Eq.3 we set δ(u, t i j ), u ∈ i j .O s ∧ i j ∈ I out to 1 since error in (u, t i j ) can directly cause erroneous output. Other SDC-causing (u, t) can be identified by a backward traversal of the eDDG. The instantaneous OVF of one node in the eDDG depends on how the node affects its children and the value of instantaneous OVF of its children. OVF of each node is calculated iteratively starting from nodes contained in I out .
we propose Eq.4 to calculate δ(u, t) during read operations. e i w →i r denotes the error propagation path from i w to i r and | e iw→ir denotes the instantaneous OVF contributed by the edge of e i w →i r . P m (i r ) is introduced to denote the probability that i r masks the error during the propagation and 1 − P m (i r ) denotes the probability of SDC-causing error propagating from the source operand of instruction i r to the destination operand. If P m (i r ) = 1, error masking must happen; in other words error in u has no impact on u d . Here we mainly consider error masking model for data used in value comparison and logistic operation. Besides, the effect of dynamically dead value is already considered implicitly by using the model of eDDG. The computation of P m (i r ) is discussed in section.V.
Eq.5 computes δ(u, t) when i r 's predecessor node i w is being executed (t i w ≤ t ≤ t i w+1 ). Assume i w has r n successors i r 1 , . . . , i r k , . . . , i r n , i.e., u is written by i w and read by i r 1 , . . . , i r k , . . . , i r n . During the execution of i w (t i w ≤ t < t i w+1 ), δ(u, t) is set to 0 since the value of u is refreshed after the write operation. If u is corrupted as the write operation is done (t = t i w+1 ), all of i w 's successors are affected by the wrong value of u. Error propagates through the read operations of i w 's successors, therefore δ(u, t) after the write operation of i w is defined as the summation of the instantaneous OVF of u before the read operations of i r 1 , . . . , i r k , . . . , i r n .
The calculation of δ(u, t) during t i r k ≤ t < t i r k +1 is shown in Eq.6. Assume i m is a successor of i w executed after i r k (i m ∈ i w .Suc ∧ r k + 1 ≤ m ≤ r n ). Since error originating in i r k can propagate to i m , δ(u, t) during the read operation of i r k (t i r k ≤ t < t i r k +1 ) is defined as the summation of the instantaneous OVF of u before the read operations of all possible i m . Apparently, the successors of i w executed before i r k are not affected by the error originating in i r k .
B. ALGORITHM TO CALCULATE OVF OF VARIABLES
The algorithm to calculate OVF of variables is shown in Alg.1. The input of the algorithm is the instruction set I out and eDDG. The output is OVF of each variable of the program. The algorithm can be divided into two phases. The first phase is from line 1 to 10. In this phase, it traverses the eDDG from the nodes contained in I out and identifies SDC-causing nodes in the eDDG. ToCalNodeSet contains the nodes that are considered SDC-causing and corresponding OVF is calculated in the second phase. ToSearchNodeSet contains the nodes that need to be visited and is initialized to I out . Each time it pops a node from ToSearchNodeSet and pushes its predecessors into ToCalNodeSet and ToSearchNodeSet.
When ToSearchNodeSet becomes empty, the nodes that have a path to I out are put into ToCalNodeSet.
The second phase starts by assigning instantaneous OVF value to the nodes that belong to I out . CalDoneNodeSet contains the nodes which finish OVF calculation. According to Eq.5,6, we start to calculate one node's instantaneous OVF only if all of its successors finish the calculation. For each iteration in the while loop, it pops a node in ToCalNodeSet. If all of the nodes' successors are contained in CalDoneNodeSet, the instantaneous OVF of current node's destination operand and source operand are calculated by using Eq.4, Eq.5 and Eq.6. Then the current node is transferred from ToCalNodeSet to CalDoneNodeSet. The iteration repeats until ToCalNodeSet becomes empty. According to Eq.1,2, OVF of each variable is obtained by computing integral over the interval of its lifetime.
We use the example shown in Fig.1 
V. ESTIMATION OF ERROR MASKING PARAMETER P m
In this section, we estimate the error masking parameter P m for nodes in the eDDG which involve value comparison and logistic operation.
• Value comparison results from the fact that the information in the values being compared is being reduced down to certain bits. It has been observed that about 40% of all dynamic conditional branches don't influence the output [21] .
• Logical operation derates errors that occur in AND operations when the corresponding bit in the other operand is 0, as well as OR operations when the corresponding bit in the other operand is 1.
We analyze the effect of bit flip based on the instruction semantics and determine the conditions for error masking. Bits of target data that satisfy the conditions are counted by a bit counter function and equations for estimating P m are derived.
Here we analyze the type of unsigned doubleword integer. Signed integer can be analyzed by adding one more step of characterizing the sign bit. The size of an unsigned doubleword integer is 32 bits. The value of unsigned doubleword integer ranges from 0 to 2 32 − 1.
A. THE DEFINITION OF THE SET/UNSET BIT COUNTER FUNCTION
Error masking is sensitive to the value of bit, i.e. whether the bit is set(1) or unset(0). For example, AND eax, 0x3 performs logic calculation eax ← eax&0x3. Error in the 7th-2nd bit of VOLUME 7, 2019 for all i k ∈ i j .Pre do 5: if i k / ∈ ToCalNodeSet then 6:
ToSearchNodeSet ← ToSearchNodeSet ∪ i k ; 8: end if 9: end for 10: end while 11: for all i j ∈ I out do 12: for all u ∈ i j .O s do 13: δ(u, t i j ) ← 1; 14: end for 15 : end for 16 : CalDoneNodeSet ← I out ; 17: while ToCalNodeSet = ∅ do 18: for all i j ∈ ToCalNodeSet do 19: if i j .Suc ⊂ CalDoneNodeSet then 20: calculate δ(u, t) by Eq.4, Eq.5 and Eq.6 21: ToCalNodeSet ← ToCalNodeSet \ i j ;
22:
CalDoneNodeSet ← CalDoneNodeSet ∪ i j ; 23: end if 24: end for 25 : end while 26: for all v ∈ {set of variables of the program} do 27: calculate OVF(v) by Eq.1,2. 28: end for eax is masked since the corresponding bits of the immediate are 0.
Two functions are defined to count unset bits and set bits of numbers within a certain range. f k 0 (a 1 , a 2 ) counts the unset bits of the numbers ranging from a 1 to a 2 and f k 1 (a 1 , a 2 ) counts the set bits of the same numbers. The definitions of f k 0 (a 1 , a 2 ) and f k 1 (a 1 , a 2 ) are given in Eq.7,8, where 0 ≤ a 1 < a 2 ≤ 2 32 , 0 ≤ k ≤ 31, and x k denotes the kth bit of x.
Given f k 0 (0, a 1 ) and f k 0 (0, a 2 ), f k 0 (a 1 , a 2 ) and f k 1 (a 1 , a 2 ) can be calculated by Eq.9,10. f k 0 (0, a 1 ) and f k 0 (0, a 2 ) are computed by Eq.11. The proof is given below. % represents the modulo operation which finds the remainder after division of one number by another.
Proof: The kth bit of number ranging from 0 to a − 1 repeats in a cycle whose length is 2 k+1 . Taking the numbers ranging from 0 to 8 as an example, the cycle length of 0th bit is 2 (k = 0, 2 k+1 = 2, shown in column 0th-bit in Table. 2), the cycle length of 1st bit is 4 (k = 1, 2 k+1 = 4, shown in column 1st-bit in Table. 2). The kth bit of a−1 is either 0 or 1. When the kth bit of a − 1 is 0, the kth bits from 0 to a − 1 constitute a string of 00...011... In Eq.11, p 1 equals the total number of zeroes in complete cycles, for 2 k is the number of zeroes in one complete cycle and a 2 k+1 is the number of complete cycles. Furthermore, p 2 and p 3 calculate the number of residual zeroes. Supposing the number of residual zeroes is n 0 . We just need to proof that p 2 + p 3 = n 0 in both situations (ending with either 0 or 1).
• Ending with 0.
B. CALCULATION OF P m FOR DATA USED IN LOGISTIC OPERATION
The calculation of P m for AND and OR instruction is discussed in this section. AND x,y instruction performs logic calculation that x bitwise-and y. Error is masked when the corresponding bit in other operand is 0. To estimate P m , we need to count the unset bit of operand within its range. Suppose that value of operand obeys uniform distribution on the interval [min value, max value]. The min value and max value are distracted from data trace of the instruction.
The calculation of P m for AND and OR instruction is analyzed separately. x k represents the kth bit value of x. y k represents the kth bit value of y. N (A) denotes the number of outcomes favourable to A. N ( ) denotes the total number of possible bit value in the sample space . Whether the flip occurs in x or y does not affect the result of AND instruction. Assume the bit flip occurs in the kth bit of y.
1) AND x, y
If x k = 0, the error is masked after AND instruction is executed; otherwise the error propagates to x. Unset bits of x within its range can be counted by the function of unset bit counter.
2) OR x, y
If x k = 1, the error is masked after OR instruction is executed; otherwise the error propagates to x.
C. CALCULATION OF P m FOR DATA USED IN VALUE COMPARISON
CMP instruction is often followed by Jcc instruction. Based on the conditional flag produced by CMP instruction, Jcc instruction takes one branch. The error in operand of CMP instruction is masked if the same branch is taken.
We take JA instruction (jump near if above) as an example. Other types of Jcc instructions which consider other conditions (such as equal and below) could be analyzed in a similar way. If x > y in CMP x, y instruction, it performs a jump to the target instruction specified by the destination operand; otherwise execution continues with the instruction following the JA instruction. Assume the bit flip occurs in the kth bit of y. y represents the value of y after bit flip. The computation of P m is discussed in following categories.
• when x ≤ y min ≤ y
The comparison result is not changed and thus the same branch is taken. The number of unset bits can be calculated by 31 k=0 f k 0 (y min , y max + 1).
To incurs error masking, i.e., maintain the comparison result, y must satisfy y ≥ x +2 k . The number of set bits which satisfy this condition can be calculated by
k , y min ), y max + 1)).
• when
Since the comparison result is not changed, error in y is masked.
k , y max + 1)).
• when y < x ≤ y max -If y k = 0, y = y+2 k < x. To incurs error masking, y must satisfy
• when y ≤ y max < x -If y k = 0, y = y+2 k < x. To incurs error masking, y must satisfy
). To conclude, P m for data used in value comparison operation is shown below. The numerator is the summation of the number of events satisfy the condition for error masking in each category. .
VI. EXPERIMENTAL SETUP
This section describes the experimental infrastructure and application workload used to evaluate the model. The effectiveness of OVF is validated by fault injections. Each time detectors are derived from a fraction (top 10%, top 20%, ..., 100%) of variables according to the ranking of OVF. The values at the detector points are recorded and compared with the corresponding values in the golden run of the application. Except address value, any deviation between the values in the golden run and the faulty run indicates successful detection of the error. Address values are not compared in the detection process since the address allocated by the OS is often different from that in another execution. Address value can be identified by the 8 most significant bits due to virtual memory layout in the IA-32 architecture. For example, stack address often starts with 0xbf.
A. FAULT MODEL
The fault model we assume is a single bit flip within the register file or memory. Fault in the opcode is not considered, since it always causes illegal opcode exception rather than SDC [22] . The injection is done by altering a randomly selected bit in the destination operand. We do not consider faults from system calls and library function calls and faults in floating point registers.
B. INFRASTRUCTURE
The experiment is conducted on IPASON P18 Workstation with Intel i5 processor running Ubuntu10.04, and GCC of the version 4.4.3 is used for compilation.
In our experiments, we employ the dynamic instrument tool Pin for fault injection [23] . Pin is a dynamic binary instrumentation framework for the IA-32 and x86-64 instruction-set architectures that enables the creation of dynamic program analysis tools. Fault is injected to destination operand after the target instruction is executed.
We develop a pintool called TracePrinter to extract data traces of instructions. The data trace contains the tag of instruction (sequence number of dynamic instruction, name of operand, etc) and the value of operand before/after the execution. By summarizing the trace of each static instruction, we obtain the maximum and minimum value of the operand for estimating P m . TracePrinter is also applied to generate traces of the instructions for detection rate assessment.
C. APPLICATION PROGRAMS
The benchmarks studied here are the Siemens suite of programs [24] . The Siemens programs considered are schedule, schedule2 (which are priority schedulers), print_tokens, print_tokens2 (which perform lexical analysis) and tcas (which decides the action of aeroplane system). These are C programs consisting of a few hundred lines of C code and each is equipped with extensive test suite. The benchmarks only contain console output, implemented by the function printf.
The outcome of fault injection is classified into SDC, benign, crash and hang. The above outcomes are mutually exclusive and exhaustive.
VII. EXPERIMENTAL RESULTS
This section begins with results from an initial fault injection campaign to quantify the amount of opportunity available for our approach to exploit. We then proceed to examine the SDC detection rate for detectors derived from the variables with high OVF and compare the result with other methods. The cost of detectors is affected by benign detection rate. The benign detection rate for each method is presented in the end. 
A. PRELIMINARY FAULT INJECTION
For reference, Fig.3 shows the absolute SDC rates obtained by fault injection experiments. The SDC rates of studied programs range between 5.2% to 38.5%. The averaged SDC rate is 17.0%. The averaged benign rate is 58.0%, which is the highest among the outcomes.
B. SDC DETECTION RATE
Each time detectors are derived from an increasing percentage (top 10%, top 20%, ..., 100%) of variables according to the ranking of OVF. Fig. 4 shows the detection rate of SDC for detectors derived from varied fractions of variables. The results for SDC are normalized across the total number of errors observed. When the fraction of monitored variables reaches 10%, the average detection rate reaches 65.0%. When the fraction rises to 30%, the average detection rate reaches 82.0%. When all variables are monitored, the average detection rate reaches 89.8%. The detection rate does not reaches 100% due to our definition of error detection. Variables that hold address are not compared and thus error in address may not be covered by detectors.
It can be observed that the detection rate increases slowly for most of the applications after the percentage of monitored variables exceeds 30%. The bottom 70% of the deployed detectors only contributes 7.8% of the SDC detections. The results indicate that applying OVF makes it able to identify variables that are critical to the propagation of SDC. It is possible to protect only a small fraction of variable to achieve a relatively high SDC coverage.
C. COMPARISON WITH RELATED WORK
The results of fault injection campaign with the variable criticality metric CF [14] , fanout [25] and our method are compared in Fig.6 . The SDC detection rate is averaged across all studied programs. Each time we deploy an equal number of detectors derived by each method.
To obtain low-cost solution for error detection, the SDC detection rates with a small fraction of variables protected is a major concern. When the fraction of monitored variables reaches 10%, the SDC detection rate for CF is 52.8%, and that for fanout is 44.2%. The SDC detection rate of our method is 12.2% higher than CF and 20.8% higher than fanout. As the percentage of the monitored variable rises (≤ 50%), our method still achieves higher detection rates than other two methods. When the percentage of monitored variables rises to nearly 100%, the detection rates of the three methods tend to equalize since the protection is gradually close to full-scale.
By investigating the cases, we find the variables preferred by CF or fanout tend to be a particular type of variables. By applying the method of CF, variables with a long life are likely to be chosen. This is because the index ICF increases when a read event occurs. Global variables or argv (argument for main procedure of program) are often identified as the most critical variables by the CF method. By applying the method of fanout, variables with a large number of successors are chosen, such as branch control variable.
To reveal the differences among the methods, we show the ranked highest variables obtained by the three methods in the same program of tcas. The most critical variable obtained by CF is Positive_RA_Alt_Thresh [3] . The variable is initialed at the beginning of the program and its value is not changed afterwards. It is read by the function of ALIM many times. According to the method of CF, the variable has a long lifetime and its ICF increases each time it is read. According to the method of fanout, the variable Own_Tracked_Alt is considered most critical in tcas since the node representing Own_Tracked_Alt has the largest number of immediate successors (5 successors). Injections on Own_Tracked_Alt show a benign rate over 90%. We show the related source code containing Own_Tracked_Alt in Fig.5 . Own_Tracked_Alt is used in the return statement of the function Own_Below_Threat. The return value of Own_Below_Threat determines the value of need_upward_RA. If Non_Crossing_Biased_Climb returns 0, then the result after and(&&) operation is 0, thus erroneous Own_Tracked_Alt will not change the return value and further contaminate need_upward_RA. The most critical variable obtained by our method is need_upward_RA. i 349 in the Fig.1 represents an instruction which accesses need_upward_RA. Error in i 349 (need_upward_RA) can propagate into parameter of printf function (i 356 ). The injection results of these variables show that SDC rate need_upward_RA(100%) > Positive_RA_Alt_Thresh [3] (10%)>Own_Tracked_Alt(7%), which indicates that need_upward_RA has a more significant effect on the output in the presence of soft error.
D. DETECTION RATE FOR BENIGN
Detecting benign outcome incurs performance overhead without any increase in reliability. To obtain low cost detection, we expect to achieve low benign detection rate. The benign detection rates for the three methods averaged across all studied programs are shown in Fig.7 . Detectors derived by OVF obtains a relatively low benign detection rate. The benign detection rate for monitoring top 10% variables ranked by OVF value is 6.2%, which is 1.6% lower than fanout and 3.6% lower than CF. Our method achieves a lower benign rate since we consider error masking effect in the model of error propagation. When a high P m parameter value is obtained, it infers high probability to incur benign and results in a low instantaneous OVF. The related variables with low OVF value are not chosen to derive detectors.
E. SDC DETECTION LATENCY
The SDC detection latency is measured by the number of dynamic instructions executed by the program from fault activation to SDC detection. Fig.8 shows the detection latency for detectors derived by the three methods averaged across all studied programs. OVF achieves a relatively low detection latency compared with other methods. When the percentage of monitored variables is 10%, the detection latency of OVF is 13.8% lower than that of fanout and 33.1% lower than that of CF. When the fraction of monitored variables rises, the detection latency tends to decrease since newly deployed detectors decrease the average distance between detectors. High OVF variables are usually SDC-prone variables, meaning that faults originate in these sites are major source of SDC. The derived detectors check the value of these variables. The distance between fault activation locations and detectors is relatively short and thus fault can be detected in a low latency. The method of CF chooses variables with a long life, often these variables are accessed in initialization phase of execution such as argv and argc in main function. They may not be in or near the SDC-prone sites thus increases detection latency.
VIII. CONCLUSION
This paper proposes a metric called Output Vulnerability Factor to assess the vulnerability of variables to SDC. The metric can be used to rank the variable's priority in the selective instruction-level detection techniques.
The model of enhanced DDG is proposed to analyze the error propagation by using the trace in fault-free execution.
To eliminate benign-causing faults from potential SDC-causing faults, we estimate error masking probability for data used in the value comparison and logistic operation. Experimental results show that it achieves high SDC detection rate with low latency when detectors derived from high OVF variables are deployed.
Our approach can help to derive SDC detectors for applications. Future work will involve integrating our approach into a compiler and automatically deriving the invariant-based assertion for checking. Moreover, the OVF computation has to be performed for each input. We plan to use the graph similarity measures to derive pairwise similarity scores for the nodes of different eDDG. High similarity score is likely to denote highly similar execution profile and error propagation. The measure will further reduce the cost of OVF computation by sharing certain intermediate results with another eDDG.
