Abstract-Although algorithm level re-computing techniques can trade-off the fault detection capability vs. time overhead of a Concurrent Error Detection (CED) scheme, they result in 100% time overhead when the strongest CED capability is achieved. Using the idle cycles in the data path to do the re-computation can reduce this time overhead. However, dependences between operations prevent the re-computation from fully utilizing the idle cycles. Deliberately breaking some of these data dependences can further reduce the time overhead associated with algorithm level re-computing. According to the experimental results the proposed technique, it brings time overhead down to 0-60% while the associated hardware overhead is from 12% to 50% depending on the design size.
NOMENCLATURE 1
VLSI
Very Large Scale Integrated CED Concurrent Error Detection RTL Register Transfer Level CDFG Control Data Flow Graph ASAP As Soon As Possible schedule A function unit with a single stuck-at fault is used multiple times before the results are checked. is the probability that the expected computation result is (logic 0 or 1) during the th use of this module. A function unit with two stuck-at faults is used multiple times before the results are checked.
is the probability that the expected result is during the th use of this module and due to the th fault. ,
The fault offsets due to the 1st and 2nd faults respectively. ,
The # of times a defective module is used in normal and re-computations respectively. 
I. SUMMARY AND CONCLUSIONS

W
E proposed a new Register Transfer (RT) level time redundancy based CED technique that uses ruptured data dependences in the data path. The proposed CED technique is applicable to data processing designs with strong data dependences and idle clock cycles in data paths. According to the experimental results, the CED designs using ruptured dependences have strong CED capabilities with 0-60% time overhead. Besides, this technique consumes less hardware than the semi-CED technique. The proposed technique does not cover the control logic of such data processing designs. A duplication scheme for control logic can provide good CED capability without involving too much overhead because the control logic of such designs are simple and less hardware intensive.
II. INTRODUCTION
Motivation:
Deep sub-micron scaling of VLSI devices is being aggressively pursued to improve the speed of operation, reduce power consumption and increase the scale of integration. Deep sub-micron scaling achieves these objectives by shrinking the device dimensions, reducing power supply voltages, and increasing the operating frequencies. However, such aggressive scaling negatively impacts the device and overall VLSI system reliability by making the device susceptible to the Single Event Upsets (SEU). An SEU is a radiation-induced bit-flip in a VLSI circuit. In previous VLSI technologies which used larger device dimensions and higher power supply voltages, circuit nodes stored large amounts of charge. Only high-energy particles such as the -particles were energetic enough to perturb the state of the circuit nodes, causing bit-flips. In the deep sub-micron regime, less charge is stored in the circuit nodes. Hence even particles with less energy, such as neutrons at sea level, are able to flip the states of the nodes [17] , [18] .
Increasing operating frequencies is another reason for the increased susceptibility of deep sub-micron VLSI devices to SEU. An SEU induced bit-flip is a perturbation of the state of a circuit node with a duration of perturbation on the order of a few picoseconds. In a circuit which contains both combinational logic and sequential logic, an SEU in the sequential logic occurring just before the sensitive edge of the clock will be latched, resulting in a fault. On the other hand, an SEU on the combinational logic does not result in a fault if the clock period is large when compared to the duration of the SEU (nanoseconds vs. picoseconds). As the operating frequencies move into the Fig. 1 . Error rate versus operating frequency [19] .
GHz and THz range, the clock periods are approaching the duration of an SEU, thereby increasing the probability of latching such faults. Fig. 1 shows the error rate in a simple circuit containing an inverter followed by a D latch [19] . The operating frequency is plotted along the -axis (MHz) while the SEU error rate is plotted along the -axis. The dotted line shows the SEU error rate in the sequential logic (D latch), the dashed line shows the SEU error rate in the combinational logic (inverter), and the solid curve shows the total SEU error rates of this circuit. From this plot, one can see that the probability of latching faults caused by SEU in the combinational logic is increasing almost linearly with the operating frequency, and will dominate the total SEU error rate in high-speed deep sub-micron designs [19] .
A third reason for the increased susceptibility of deep submicron VLSIC to SEU is the gate threshold voltage .
is a function of the constant sub-threshold slope and the ratio of "on" current to "off" current. To scale down , the on-off current ratio has to be reduced and this in turn increases its susceptibility to SEUs [20] .
Radiation can also cause permanent faults, which is called Single Event Latchup (SEL). In today's CMOS technology, the combination of wells and substrates form parasitic n-p-n-p structures. An induced radiation can trigger on these structures and construct a short from VDD to GND. This turns the circuit fully on and causes a short across the device until the latter burns up or the power to it is cycled. Such latchups can cause permanent damage to devices.
Related work: Antola et al. presented a register transfer (RT) level CED technique in which the datapaths of normal and re-computations share hardware resources when the 'aliasing' probability is not harmed [5] , [6] . This technique reduces the hardware overhead associated with the hardware redundancy based CED. They also provided a semi CED technique which uses the idle cycles in the data path to implement the re-computation [7] . The idle cycles of a component in a synchronous synthesized design are the control steps in which the outputs of this component are not latched, or not used by any other computation. However, the idle cycles in an iteration of normal computation may not be enough to implement an iteration of re-computation. An iteration of re-computation may span several successive iterations of normal computation. Karri and Iyer presented a RT level CED technique that uses the spare computation cycles and the spare data transfer cycles for CED [12] . Karri & Orailoglu [13] , and Lakshminarayana et al. [14] presented fault security based techniques which yield CED data paths with less than a proportional increase in hardware.
Blough et al. presented an algorithm to find the optimal checkpoints in a roll back based system [8] . The algorithm either searches for the shortest length recovery point subject to a constraint on the number of registers, or searches for the lowest cost recovery point subject to a constraint on the maximum delay. Karri & Orailoglu [10] , and Ravi et al. [11] also developed high-level synthesis algorithms targeting self-recovering data paths. Duplication and comparison of results at checkpoints was used in [10] , while duplication and comparison of results as soon as they become available was used in [11] . Hamilton and Orailoglu presented a roll forward based transient fault recovery technique by grouping operations in a scheduled and check-pointed Control Data Flow Graph (CDFG) of an algorithm into nonoverlapping sub CDFG called strings [1] . Upon detecting a fault, the faulty string is recomputed in parallel with the computation of the subsequent strings. This roll forward technique reduces the performance overhead associated with recovery.
Hamilton and Orailoglu presented a novel technique to diagnose a faulty unit [2] . The basic idea is as follows: if unit A generates a result identical to that generated by unit B when they perform operation 1, but generates a result that does not match with that generated by unit C when they perform operation 2, unit C is faulty.
An RT level built-in self-repair technique using spare modules was proposed in [15] . Chan and Orailoglu presented a methodology to re-configure the data path upon detecting a faulty unit [3] . An L/U block matrix is constructed such that columns of the matrix indicate the hardware units while rows indicate clock cycles. If the th hardware unit is busy at the th clock cycle, the cell indicates an operation. The cells above the diagonal construct the U block while the cells below the diagonal construct the L block. If the th unit is diagnosed as faulty, operations in the U block that are originally bound to , , will be bound to , while operations of the L block that are originally bound to , will be rebound to . Orailoglu extended this technique to register operations as well [4] . Iyer, Karri and Koren presented an area-efficient technique for fabrication-time re-configurability called phantom redundancy. Phantom redundancy adds extra interconnect so as to render the resulting micro architecture reconfigurable in the presence of any (single) functional unit failure [9] .
Background of Algorithm Level Re-computing: Algorithm level re-computing is a time-redundancy-based CED technique which uses two types of computations-normal computation and re-computation. They share the same hardware resource. Normal computations are carried out on all input samples up to the th sample. After the th input sample is processed by a normal computation, the result is stored. Then the th result is re-computed and compared to the stored result with a mismatch indicating an error. Different implementations of re-computation yield different fault detection capabilities. Straightforward duplication of operations in time will miss permanent faults, because, for the same inputs, a hardware module with permanent faults will always produce the same faulty outputs. Karri and Wu proposed two algorithm-level re-computing techniques which exploit RT-level diversities (data diversity and allocation diver-sity) [16] . In allocation diversity, the operation-to-operator allocation used in normal computation is different from the one used in re-computation. In data diversity, operands are shifted before the re-computation. By enabling a fault to affect the normal result and the re-computed result in two different ways, the techniques which exploit RT level diversities yield good CED capability with low area overhead. The checking ratio is the ratio of the total number of results to the number of results that have been re-computed. can be used to trade-off fault detection capability against time overhead associated with CED. When is set to 1, the strongest CED capability is achieved at the cost of 100% time overhead. In this paper, we propose a new RT level time redundancy based CED technique which exploits the idle cycles in the data path. Deliberately breaking the data dependences of re-computation, and using the idle cycles in a normal computation data path, we can reduce the time overhead associated with algorithm level re-computing. We will review algorithm level re-computing techniques, and propose the new CED technique in Section III. The CED capability of the proposed scheme is analyzed in Section IV. Experimental results will be discussed in Section V, and a fault injection study is given in Section VI.
III. ALGORITHM LEVEL RE-COMPUTING USING RUPTURED DATA DEPENDENCES
A. The Idea
Consider a CDFG representation of an algorithm with three additions and two multiplications shown in Fig. 2 , and & . In algorithm level re-computing [16] , re-computations (the part of CDFG constituted by shaded operations) use the same CDFGs and RT level schedules as normal computations, and start only after normal computations finish as shown in Fig. 2(b) . The technique uses 6 clock cycles; and when the checking ratio is 1, it entails 100% time overhead. This overhead can be reduced by utilizing the idle cycles in normal computation as shown in Fig. 2(c) . It uses the same CDFG but a different RT level schedule when compared to normal computation. No extra operators are used. Meanwhile, operations in the CDFG which are carried out on an operator in normal computation are carried out on different operators during re-computation. The re-computation starts from clock cycle C2 and ends at clock cycle C5, resulting in 67% time overhead. The dependences between two multiplications and addition prevent re-computation from using the idle add computation cycle in clock cycle C2. The semi CED technique [7] also has this problem. As shown in Fig. 2(d) , the data dependence between additions 1 and 2 forces the re-computation to span three iterations of normal computation.
By breaking some of these dependences, we can further reduce the time overhead associated with CED. Fig. 2(e) shows the algorithm level re-computing technique using ruptured dependences. In this CED design, the re-computation of addition receives operands from the normal computation data path, and is carried in clock cycle C2. The results of multiplications & , and additions & will be checked respectively with a mismatch indicating an error. The re-computation does not use extra operators, starts from clock cycle C2, and ends at clock cycle C4; thereby reducing the time overhead to 33%.B.
B. Generating a Ruptured CDFG and Scheduling
To generate an optimal CDFG schedule with ruptured data dependences, we need a heuristic algorithm. V. Raghunathan and his colleagues presented a re-schedule technique to manage the transient power consumed by the device [21] . This technique calculates the rescheduling potential for each operation and iteratively reschedules the design until gets a solution providing the required quality. Here we give out another greedy algorithm, and apply the small example shown in Fig. 2(a) . The generated CDFG may not have the optimal schedule. The procedure of generating the CDFG with ruptured dependences used for re-computation is listed as follows:
1) First, we duplicate the CDFG used for normal computation and use the idle cycles in the normal computation for re-computation as shown in Fig. 2(c) . If the time overhead of this CED design satisfies the user constraint, we stop. Otherwise, go to the next step.
2) Next, we identify the idle cycles in the newly constructed CED design and the As Soon As Possible (ASAP) schedule for each operation. Because we only need to consider the data dependences in the normal computation data path, the ASAP schedule of an operation in re-computation is the clock cycle in which its corresponding normal computation is carried. For example in Fig. 2 (e) the ASAP clock cycle of the re-computation of addition is C2. 3) Next, we identify the first idle cycle for each operation by choosing the first available idle cycle between its ASAP schedule, and its current schedule. Select the operation which has the maximum difference between its current schedule and its first idle cycle, and partition the CDFG used for re-computation by breaking the data edges between the selected operation and its predecessors. After partitioning, the selected operation will receive operands from normal computation. Meanwhile the outputs of its original predecessors (i.e., in re-computation) will be checked. 4) Next, schedule the selected operation at its first idle cycle and reschedule the CDFG of re-computation. If the time overhead is satisfying, we stop. Otherwise, go to the second step. To be more general, assume that the CDFG of the original design has operations. The first, second, and third steps will each take time, because all of these steps just go over all the nodes once. The fourth step introduces a loop, and the number of iterations depends on the value of as well as the availability of idle cycles inside the data path. The worst case iteration number could be , which in turn makes algorithm complexity of the whole procedure become . After the scheduling phase completes, operations should be bound to operators under the constraint that the normal and re-computations will be carried on two different operators. Most synthesis software such as the Synopsys Behavioral Complier can do this job by just applying the resource constraint.
IV. ANALYSIS OF CED CAPABILITY OF ALLOCATION DIVERSITY BASED CED TECHNIQUES
The algorithm-level techniques, including the basic allocation diversity (Fig. 2(b) ), semi-CED (Fig. 2(d)) , and algorithmlevel re-computing using ruptured data dependences (Fig. 2(e) ) use different operation-to-operator allocations. However, faults may still be missed even when allocation diversity is employed. We will estimate the probability of not detecting a fault in an RTL data path and use it as a measure of the CED capability of these allocation diversity based techniques. We list the common properties associated with the allocation diversity as follows:
1) The operations carried on an operator in normal computation will be different from the operations carried in re-computation.
2) The number of times an operator is used in normal computation may be different from the number of times it is used in re-computation. 3) If a defective operator is involved in either normal computation or re-computation but not both, its fault will be detected. Assumption and Motivation: In the RTL model, effects of stuck-at faults can be modeled as an offset from the correct result in arithmetic operators such as adders and multipliers. Let us define the difference between the faulty result and the correct one as the fault offset. In an RTL design which employs algorithm-level re-computing, a (faulty) hardware module is reused several times within a computation; each time with a different set of operands. Hence a fault can affect the result several times. Algorithm-level re-computing cannot detect a fault if the accumulation of the fault offsets in the normal computation is identical to the accumulation of the fault offsets in the re-computation. Hence, the probability of detecting a fault is equal to 1, the probability that the accumulation of the fault offsets in the normal computation is identical to the accumulation of the fault offsets in the re-computation. A fault offset resulting from an operation carried out on a faulty unit can be either independent of the inputs (constant offset case) to the unit, or can depend on the inputs (variable offset case) based on the location of the fault in the unit. For example, on one hand, a stuck-at-1 fault at the output of an XOR gate results in a constant offset of 1, independent of the input 2 . On the other hand, a stuck-at-1 fault on one of the inputs to the XOR gate results in a variable offset of either 1 or depending on the input. We will derive an expression to compute the probability of detecting a fault by considering just the constant offset case discussed above for two reasons: (a) A closed form expression to compute the probability of detecting a fault which considers both the constant offset case and the variable offset case is very complex. Further, exhaustive simulation to determine this probability by considering all cases is impractical. (b) The probability that the accumulation of the fault offsets in the normal computation is identical to the accumulation of the fault offsets in the re-computation is more likely due to the constant offset case than due to the variable offset case. We confirm this by performing a gate level simulation on a 6-bit ripple carry adder. One bit cell of this ripple carry adder is shown in Fig. 3 .
Simulation: We simulated the circuit by considering all possible fault locations and by feeding pseudo random inputs to the resulting faulty circuit. We assume there is only one stuck-at-1 fault in the 6-bit ripple carry adder and any connection in the schematic shown in Fig. 3 has equal probability to be stuck-at-1. Table I summarizes the percentage of equal accumulated fault offsets due to the constant offset case and due to the variable offset case. The first row of Table I shows the number of times this ripple carry adder is used in normal computation while the first column shows the number of times it is used in re-computation. Every cell in Table I has two percentage numbers. The first one shows the percentage of equal offsets due to a constant fault offset while the second one shows the percentage of equal offsets due to a variable fault offset. These results show that the case where normal and re-computations have equal faulty offsets is mostly due to the constant fault offset case.
Derivations: Operations such as multiplication and exponentiation in a computation magnify the effects of faults at their inputs, making them easy to detect. For example, consider a multiplication which calculates , while both and are addition results from two disjoint adders. Assuming that a faulty adder calculates in the normal computation, and calculates in re-computation, the faulty multiplication result from the normal computation will be , while the faulty result from re-computation will be . The fault will be missed only when , which happens rarely. On the other hand, because linear operations such as additions and subtractions do not magnify the effects of faults at their inputs, we will focus on the parts of CDFG which have only additions/subtractions. There might be multiplications present only at the beginning of the CDFG. We will consider stuck-at-1 faults in the following analysis. 3 Because in a RTL design a (faulty) hardware unit is reused several times within a computation, the fault can affect the result several times. Let be the probability that the expected result is 0 due to the th use of the defective module, and let be the probability that the expected result is 1 due to the th use of the defective module.
Consider the CDFG that implements shown in Fig. 4 . This CDFG uses two adders and takes two clock cycles. The adder (shaded in dark) is the defective module and is used twice (once in clock cycle 1 and once in clock cycle 2). After the first use of the faulty adder in clock cycle 1, the probability of a correct result is and the probability of a wrong result is . Similarly, after the second use of the faulty adder in clock cycle 2 (assuming that the inputs to the second use are correct), let the probability of a correct result be and the 3 A similar analysis can be carried out for stuck-at-0 faults. Fig. 4 . Example CDFG.
probability of a wrong result be . Now we will derive probabilities for not detecting the following three cases of faults: a single stuck-at-1 fault, stuck-at-1 faults in nonadjacent bit positions and stuck-at-1 faults in adjacent bit positions.
Case 1. A Single Stuck-at-1 Fault: In Fig. 4 , every time the faulty adder is used, it will possibly offset the correct result 4 by . We will have the expected result with probability ; and the wrong result with probability the wrong result with probability and the wrong result with probability . In fact, the probabilities for the second usage are dependent on the probabilities for the first usage . However, this will make the analysis too complex. Here we assume these probabilities are independent of each other. This will not significantly alter the analysis result. These probabilities for the general case when the defective module is used times are summarized in Table II. If allocation diversity is employed, a fault will not be detected when the accumulated fault offsets that come from normal and re-computations are identical. If the defective module is used times in the normal computation and times in the re-computation, the probability that the two results are offset by the same amount is . Hence, the probability that a fault is not detected is shown in (1). The probability of not detecting a fault depends on the number of times the defective module is used in normal computation , the number of times the defective module is used in re-computation , the probability to have a wrong result and the probability of having a correct result . and may vary depending on the fault position and the implementation of the defective module. We use three sets of and and plot the 3-D probability graph in Fig. 5 . In Fig. 5 , the three sets are , and . In Fig. 5 , the -axis is the number of times the defective module is used in normal computation while the -axis is the number of times the defective module is used in re-computation, . The -axis stands for the probability of not detecting the fault in this defective module. From this plot, we can observe that the probability of not detecting a single fault peaks when , and is close to 0 when or .
Case 2. Two Stuck-at-1 Faults in an Operator:
Let be the probability that the expected result is 0 during the usage of the defective module due to the first fault, and let be the probability that the expected result is 1 during usage of the defective module due to the first fault too. Also, let and be the above probabilities due to the second fault. Last, let be the error offset due to the first fault, and be the error offset due to the second fault. Assuming that the first fault occurs in a more significant bit position and affects the result times, while the second fault occurs in a less significant bit position and affects the result times, the possible final results are .
a) Stuck-at-1 faults in nonadjacent bit positions:
If the faulty bit positions are so far apart (such that ) that different combinations yield a different offset, the probabilities of correct and faulty results when the defective module is used times are summarized in Table III has the same effect on the final result as the faults with combinations and . Fig. 7 shows the probability of not detecting two adjacent stuck-at-1 faults for three different sets of and respectively. The -axis and -axis stand for the number of times the defective module is used in the normal and re-computation respectively. It is easy to see that faults will be missed with a higher probability compared to 0a. However, it is still better than Case 1. Similar to Case 1 and 0a, the probability of not detecting this kind of fault peaks when , and is close to 0 when or . Comparing Figs. 6 and 7 with Fig. 5 shows that the probability of not detecting a fault decreases as the number of faults in the hardware increase. This is because, as the number of faults increases, the number of possible faulty results increases, thereby reducing the possibility that these faulty results match.
Comparing the Allocation Diversity Based CED Techniques: The proposed CED technique selectively breaks some data de- pendences of re-computation, thereby partitioning the CDFG of re-computation into several small sub-CDFG. This will improve the CED capability of the proposed technique in two ways:
1) It introduces additional check points at the output of each sub-CDFG. A defective module used in multiple sub-CDFG is checked several times, thereby inherently increasing the probability of detecting the faults in it.
Compared to the basic allocation diversity and semi CED techniques which have only one check point for each primary output, the proposed CED technique perform better at detecting faults. 2) According to the analysis we made above, when or , the probability of not detecting faults in these techniques is close to 0. Partitioning the CDFG into several sub-CDFG increases the unevenness of usage between the normal and re-computations for every operator. Let us use the two multipliers of Fig. 2 as an example. In Fig. 2(b) and (d) , each of the multipliers is used once in normal computation and once in re-computation. This results in 0.25, 0.1875 and 0.1875 probabilities for missing a single stuck-at-1 fault, two nonadjacent faults, and two adjacent faults, respectively (assuming ). By using the technique with ruptured dependences, these two multipliers are partitioned into two sub-CDFG and the probabilities of not detecting these three kinds of faults are all 0. Although the experiment data and error detection probabilities are based on stuck-at-1 fault, the technique applies to stuck-at-0 faults as well. Meanwhile, we also implemented the examples using allocation diversity & semi-CED [7] , and compared the area and performance overheads of the new technique using ruptured dependences.
FIR filter: Our example FIR filter implements: Coef(i) In(i) where In(i) are previous inputs and Coef(i) are constant coefficients. It accepts one input, produces one output and contains 17 multiplications and 16 additions. Table IV shows the results for the non-CED design and CED design using basic allocation diversity, ruptured dependences and the semi CED in column 2, 3, 4 and 5 respectively. The second and third rows show the number of operators used by these designs. In this example, all the CED techniques use the same number operators as non-CED design. The fourth row shows the area consumed in terms of unit cells while the fifth row shows the corresponding area overhead. Because the original design consumes very little hardware, the CED design using ruptured dependences involves 53% area overhead, while the semi CED design consumes 74.3% area overhead. The number of clock cycles and the corresponding time overhead are listed in rows 6 and 7 respectively. Compared to the 8 clock cycles used by non-CED design, the CED design using allocation diversity uses 16 clock cycles, and results in 100% time overhead; while the CED design using ruptured dependences uses 12 clock cycles, and results in 50% time overhead. In the semi CED design, the re-computation takes three iterations of normal computation translating to 24 (3 8) clock cycles. However, because in the semi-CED technique the re-computation does not interrupt normal computation, the time overhead associated with this technique is 0. The last row shows the probabilities of not detecting faults in one of the operators. Because different designs use different operation-to-operator allocations, here only the worst case probability of not detecting, among all the operators, is shown. We considered single stuck-at-1 fault, two nonadjacent stuck-at-1 faults, and two adjacent stuck-at-1 faults; and combined the probabilities of not detecting these faults into one set (assuming ). The probabilities of not detecting faults in CED design using allocation diversity and semi-CED design are similar, because both of them only check the final results. The CED design using ruptured dependences partitions the CDFG of re-computation into 7 sub-CDFG, hence the probability of not detecting fault(s) in any operator is almost 0.
Windowed Filter: Our example windowed filter accepts one input, produces one output, and implements Coef(i)
In(i) using 30 multiplications and 29 additions. Table V shows the results for all designs. The meaning of each row is the same as in Table V . The non-CED design uses four adders, four multipliers and takes 11 clock cycles. In this case, while the CED designs using allocation diversity and ruptured dependences do not use additional operators, the SEMI CED design uses one more multiplier and one more adder, and its re-computation spans two iterations of normal computation. The area overheads of three CED designs are 23%, 25.5%, and 57.1%; while the time overhead of them are 100%, 54.5%, and 0 respectively. Meanwhile, because the CDFG of re-computation in the CED design using ruptured dependences is partitioned into 8 parts, the CED design achieves almost 0 probability of missing faults in any operator, while the design using allocation diversity has probabilities of missing the three kinds of faults, and the design using semi CED has approximate probabilities .
Differential Equation Solver:
The third experiment is a differential equation solver which originates from a benchmark provided by [22] . The example from [22] presents a hardware description for numerically solving a particular differential , and until reaches the user specified limit A. An iteration of the design contains 6 32-bit multiplications, 2 32-bit additions, and 2 32-bit subtractions. Because the number of iterations depends on the inputs , , and limit ; we cannot implement the SEMI CED technique due to the undetermined schedule. Table VI shows the results of Non-CED design, CED design using allocation diversity, and CED design using ruptured dependences. Because the Non-CED design uses just one adder/subtractor, we have added one more adder/subtractor in both CED designs. However, the area overheads associated with both CED designs are just 7.1% and 12.5% respectively. Besides, compared to the 100% time overhead introduced by the design using allocation diversity, no time overhead is introduced by the design using ruptured dependences. This is due to the strong data dependences existing in the original data path. Because the computation may take several iterations, the probability of not detecting faults in any of the operators is almost 0.
VI. FAULT INJECTION STUDY
By using algorithm-level re-computing, we do not need to check all the results like logic level CED techniques do; we can check the results periodically. The normal computation is carried out on all input samples up to the th sample. After the th input sample is processed by the normal computation, the result is stored. Then the th result is re-computed and compared to the stored result with a mismatch indicating an error. is called the "checking ratio" and is defined as the ratio of the total number of results to the number of results that have been re-computed. The analysis made in Section IV and the experimental results shown in Section V assume the checking ratio is 1, which means all inputs samples are re-computed. When the checking ratio increases, some of input samples are not re-computed, thereby reducing the time overhead due to less re-computation. However, increasing the checking ratio also reduces the probability of detecting a fault. This is because the faults occurring only in the normal computations that have no corresponding re-computations will not be detected. We use faults injection simulation to study the effect of increasing checking ratio and choose the Windowed filter using ruptured data dependences as an example. To simulate the injection of a stuck-at-1 fault, we appended every operator with an OR gate. One input of the OR gate is the output of the operator while the other input controls fault injection. The output of the operator is stuck-at-1 if this input is 1. A stuck-at-0 fault can be similarly injected by appending an AND gate to the operator and applying 0 to one of the AND inputs. The traditional way to simulate faults in a design is using hardware description languages such as VHDL and Verilog to model the design, and using simulation software and test bench to verify. However, considering our case needs massive inputs to test the fault detection capability, we have chosen to implement the fault injection study as a C++ program and take advantage of its high programming flexibility and run-time efficiency. First, we injected transient faults, and computed how many of these faults are detected versus how many of them are not detected. There are two reasons why a fault is not detected. For one, the proposed CED technique inherently misses some faults as we analyzed in Section III. Also, a fault may not be detected when the time between re-computations is larger than the duration of the fault. Fig. 8 plots the probability of detecting transient faults. The -axis shows the checking ratio of the CED design and the -axis shows the duration of the transient fault. From the simulation, transient faults with longer duration are easier to detect, and the designs with smaller checking ratios have stronger CED capability.
Next, we injected permanent faults and studied the average number of re-computations necessary to detect a permanent fault. Fig. 9 plots the probability distributions of the number of re-computations versus the checking ratio. The -axis shows the number of re-computations while the -axis shows the different checking ratios. As shown in the Fig. 9 , 2 to 3 re-computations are able to detect 99% of injected permanent faults.
