Abstract-Concurrent error detection (CED) based on time redundancy entails performing the normal computation and the re-computation at different times and then comparing their results. Time redundancy implemented can only detect transient faults. We present two algorithm-level time-redundancy-based CED schemes that exploit register transfer level (RTL) implementation diversity to detect transient and permanent faults. At the RTL, implementation diversity can be achieved either by changing the operation-to-operator allocation or by shifting the operands before re-computation. By exploiting allocation diversity and data diversity, a stuck-at fault will affect the two results in two different ways. The proposed schemes yield good fault detection probability with very low area overhead. We used the synopsys behavior complier (BC), to validate the schemes.
I. INTRODUCTION

A. Motivation
D
EEP submicron scaling of very large scale integration (VLSI) devices is being aggressively pursued to improve the speed of operation, reduce power consumption, and increase scale of integration. Deep submicron scaling is achieving these objectives by shrinking the device dimensions, reducing power supply voltages, and increasing the operating frequencies. However, such aggressive scaling is negatively impacting the device and overall VLSI system reliability by making them susceptible to the single event upsets (SEU). An SEU is a radiation-induced bit-flip in a VLSI circuit. In previous VLSI technologies that used larger device dimensions and higher power supply voltages, circuit nodes stored large amount of charge. Only high-energy particles such as the -particles were energetic enough to perturb the state of circuit nodes causing bit-flips. In the deep submicron regime less and less charge is stored in the circuit nodes. Hence, even particles with less energy such as neutrons at sea level are able to flip the states of the nodes [22] , [23] .
Increasing operating frequencies is another reason for the increased susceptibility of deep submicron VLSI devices to SEUs. An SEU induced bit-flip is a perturbation of the state of a circuit node with duration of perturbation in the order of a few piManuscript received May 7, 2001 ; revised June 15, 2002 . This work was supported by NSF CAREER Award CCR 996139.
The authors are with the Electrical and Computer Engineering Department, Polytechnic University, Brooklyn, NY 11201 USA (e-mail: ramesh@india.poly.edu; kaijie@photon.poly.edu).
Digital Object Identifier 10.1109/TVLSI.2002.808440 coseconds. In a circuit that contains both combinational logic and sequential logic, an SEU in the sequential logic that occurs just before the rising edge of the clock will be latched resulting in a fault. On the other hand, an SEU on the combinational logic does not result in a fault if the clock period is large when compared to the duration of the SEU (nanoseconds versus picoseconds). As the operating frequencies move into gigahertz and terahertz range, the clock period is approaching the duration of an SEU, thereby increasing the probability of latching such faults. Fig. 1 shows the error rate in a simple circuit containing an inverter followed by a D latch [1] . The operating frequency is plotted along the axis (MHz) while the SEU error rate is plotted along the axis. The dotted line shows the SEU error rates in the sequential logic (D latch), the dashed line shows the SEU error rate in the combinational logic (inverter), and the solid curve shows the total SEU error rate of this circuit. From this plot one can see that the probability of latching faults caused by SEUs in the combinational logic is increasing almost linearly with the operating frequency and will dominate the total SEU error rate in high-speed deep submicron designs [1] . A third reason for the increased susceptibility of deep submicron VLSICs to SEUs is the gate threshold voltage .
is a function of the constant subthreshold slope and the ratio of ON current to OFF current. To scale down the on-off current ratio has to be reduced and this in turn increases its susceptibility to SEUs [10] .
Radiation can also cause permanent faults, which is called single event latchup (SEL). In today's CMOS technology, the combination of wells and substrates form parasitic n-p-n-p structures. An induced radiation can trigger on these structures and construct a short from drain voltage (VDD) to ground voltage (GND). This turns the circuit fully on and causes a short across the device until the latter burns up or the power to it is cycled. Such latchup can cause permanent damage to devices.
These observations clearly bring out the significance of concurrent error detection (CED) of and recovery from SEU induced bit-flip faults in current and future deep submicron designs. Although several CED schemes have been previously proposed, they either use 100 area overhead or entail 100 time overhead. In this paper we propose two register transfer level (RTL) time-redundancy-based CED techniques that have low area overhead. The proposed RTL CED uses allocation and data diversity techniques to ensure that an SEU will affect the normal computation and the re-computation in two different ways. In this paper we will focus on CED techniques. For the system requiring fault detection and recovery, we assume the recovery mechanism is already built in the data path. For the system requiring fault detection followed by re-configuration, we assume that re-configuration will be done after detecting faults. Upon detecting an error, such computation will be repeated since by just comparing two results we do not know which one is correct.
B. Related Research
Concurrent error detection (CED) mechanisms can be used to detect permanent and transient faults. A number of CED techniques based on time redundancy, hardware redundancy, and information redundancy have been developed [8] .
Patel and Fung [24] developed a logic level time redundancybased CED technique for permanent faults called re-computing with shifted operands (RESO). If an arithmetic logic unit (ALU) performs a function , and is an input to the function, then an error in the ALU can be detected by comparing with the output of right_shift ( (left_shift ). For a given data input, the result of function is stored in a register. This is then compared with the result obtained using shifted operands. Error detection capability of RESO depends on the amount of shift. Minero et al. developed a similar technique called pseudo-duplication [17] . Re-computing use duplication with comparison (REDWC) is another time redundancy technique [9] . [3] extends REDWC by increasing the number of partitions. Logic level time redundancy techniques that use alternating data have also been developed. These techniques check if , where is the input and is its complement [27] . A divider satisfying such a self-dual property has been described in [29] . A CED technique for array multiplier using bi-directional operations has been presented in [2] . The performance overhead of time redundancy based error detection is about 100%.
Recently, Nicolaidis [21] presented a logic level CED technique that combines the hardware and time redundancy. A state preserve element is used to preserve the previous state when it receives noncode word from a self-checking module. In addition, time redundancy based perturbation tolerant circuits and perturbation-detecting circuits are presented. Both of them are designated to correcting/detecting transient faults. Since the transient faults can be molded as a perturbation of the correct output, two latches, which are controlled by two clocks (with a delay between their rising edges), will latch two different outputs (one records the perturbation value and the other records the correct value) when transient faults are presented.
Triple modular redundancy (TMR) is a hardware redundancy based error correction technique that has been widely used [8] and it entails at least 200% hardware overhead. A time redundancy variant of TMR called re-computing with triplication with voting (RETWV) trades off area overhead for increased error detection/correction latency has been developed [5] , [6] , [28] . RETWV is similar to REDWC except that it partitions the operations and operators into thirds. RETWV has been applied to inner product units, convolvers [28] , Newton-Raphson dividers [5] and Goldschmidt dividers [6] . Performance overhead of time redundancy based error correction is about 200%. In [18] Mitra and McCluskey proposed a logic synthesis technique for designing diverse implementations of combinational circuits in order to maximize the design diversity. They also provided a scheme to choose concurrent error detection techniques based on the diversity [19] .
Information redundancy based CED techniques append check bits to the data to form code words [8] coding scheme then maps the normal output space of a function onto an extended code space such that for the set of all input combinations only a subset represents valid information. CED techniques based on Bose-Lin codes [4] , Berger codes [15] , [16] and parity codes [20] have been proposed. Information redundancy based CED needs to calculate the check codes for operands and predict the check code for computed results. Therefore, specialized cell libraries are required for the synthesis system supporting information redundancy based CED techniques. It is suitable for logic operations and some simple arithmetic operations such as additions and subtractions but not applicable for complex arithmetic operations such as multiplications and division.
Recently, several RTL techniques for concurrent error detection, recovery, and correction have been proposed. Karri and Orailoglu [13] and Ravi et al. [26] developed algorithms to synthesize self-recovering RTL data paths. While duplication and comparison at checkpoints was used in [13] , an early comparison strategy wherein intermediate and final results are compared as soon as they become available was used in [26] . Although these algorithms reduce the voting overhead, they do not reduce the almost 100% hardware overhead of duplicationbased CED. Karri and Iyer presented an RTL CED technique that uses the spare computation cycles of the functional units and the spare data transfer cycles in the interconnection network for CED [11] . Karri and Orailoglu [12] and Jha et al. [14] presented CED techniques that yield fault-secure data paths with less than proportional increase in hardware. These techniques selectively check intermediate results in a control data flow graph (CDFG) of a computation and weaken some data precedence constraints and reduce hardware overhead without violating the time constraints. An RTL built-in self-repair using spare modules was described in [7] .
In this paper, we propose RTL time redundancy based CED techniques using allocation diversity and data diversity. The key idea of the proposed techniques is that by exploiting allocation diversity and data diversity, an SEU induced stuck-at fault will affect the two results (one from normal computation and the other from re-computation) in two different ways. Since the normal and the re-computations use the same hardware the area overhead is very low. However, there is a time overhead. We will describe general idea of algorithm level re-computing in Section II and discuss the fault model in Section III. Then we will describe the algorithm level re-computing using allocating diversity in Section IV and algorithm level re-computing using data diversity in Section V. We will analyze CED capability of these basic schemes and based on this analysis we will propose additional improvements. In Section VI experiment results will be discussed. Finally, conclusions are given in Section VII.
II. ALGORITHM LEVEL RE-COMPUTING
Consider a CDFG with four additions and one multiplication shown in Fig. 2 . The RTL schedule in Fig. 2 (a) uses two adders, one multiplier and three clock cycles. It does not support CED. Fig. 2(b) shows hardware redundancy based CED. "C" denotes comparison. Every large circle denotes a logic level CED operator. This kind of operator duplicates the original operation. The duplicate operation is carried out in the same clock cycle as the normal operation. The implementation of the duplicate operation can be identical as the original design (straightforward duplication), or can be different (design diversity). Although the computation time has not increased in this design, it comes at the cost of double the hardware-four adders and two multipliers. Fig. 2 (c) shows a time redundancy based CED. Here two logic level CED adders and one logic level CED multiplier are used and each operation consumes two clock cycles. Although there is no increase in the number of operators, the computation time increases by 100% to six clock cycles since comparisons are executed after each operation.
Finally, in Fig. 2(d) , the comparisons are moved out of the logic level operators to the end of the computation. First, the normal inputs are applied to the design to perform the normal computation. When the normal computation is finished, the results are saved to registers and the same inputs are applied to the design for re-computation. Following re-computation, the two results are compared with a mismatch suggesting an error. This algorithm level re-computing technique makes the checking operation controllable by designer. We do not have to check all the results like logic level CED techniques do; we can check the results periodically.
Algorithm level re-computing is a general time redundancy based CED technique. There are two types of computations-the normal computation and re-computation. The normal computation is carried out on all input samples up to th sample. After the th input sample is processed by the normal computation, the result is stored in a register. Then the re-computation is performed on the th input sample again. When the result of re-computation is available, it is compared to the result stored before and a mismatch suggests an error. , called the checking ratio, is the ratio of the total number of results to the number of results that have been re-computed. Assuming that one iteration of a computation takes clock cycles, a basic design without CED capability takes clock cycles to process N input samples while the design using algorithm level re-computation takes clock cycles. Hence, the time overhead is . In the worst case when a fault occurs just after one re-computation and is detected in the next re-computation, the detection latency is . can be used to tradeoff performance overhead against detection latency and fault detection capability. The smaller the value of , the more results will be re-computed and checked. If is set to 1, all results are computed twice. Minimum detection latency can be achieved while the time overhead is 100%. If is set to 2 only half of the results will be re-computed and checked. Detection latency increases while the time overhead is reduced to 50%. Let be the number of consecutive outputs affected by a transient fault. When , all transient faults that affect the outputs are detected because there is always an input data that will be recomputed and latched before the fault disappears. When , only a few transient faults are detected with the probability of detection . Note that the detection capability is independent of the implementation of the re-computation technique.
Further, different implementations of the re-computation task yield different fault detection capabilities. Straightforward duplication of operations in time can only detect transient faults. Permanent faults will be missed because by the same given inputs, a hardware module with permanent faults will always produce the same faulty outputs. In this paper we propose allocation diversity and data diversity to improve the CED capability of straightforward algorithm level re-computing.
III. RT LEVEL FAULT MODEL
In this paper, we will focus on transient and permanent stuck-at faults. Although the analysis in this paper focuses on stuck-at-1 faults, the results are valid for stuck-at-0 faults as well. In the analysis we model the faults as offsets from the correct result. Consider the 4-bit array multiplier shown in 3 . The four adders enclosed in a dashed square form the third bit slice since their sums will be accumulated to the third bit result. Assuming that one of the connections (for example, the thick line shown in Fig. 3 ) is stuck-at-1, the faulty result output by the defective multiplier is offset from the correct result by 2 if the correct output of the thick line is zero, or zero if it is one. Table I summarizes all possible offsets due to one stuck-at-1 fault.
Effects of stuck-at faults can be modeled as an offset from the correct result in other arithmetic operators such as adders and subtractors as well.
IV. ALLOCATION DIVERSITY
A. The Technique
Using straightforward algorithm level re-computing, permanent faults in the data path cannot be detected since both the normal computation and re-computation will be carried out on the same operators in the data path. To improve the fault detection capability, we propose a new technique called allocation diversity. In allocation diversity the normal computation and re-computation use the same algorithm for computing the results (i.e., they use identical CDFG). Further, they may use identical RTL operation-to-clock cycle schedules and hence the same operators but possibly different registers and interconnection network. However, operations in the CDFG that are carried out on an operator in the normal computation are carried out on a different operator during re-computation. Fig. 4 shows two implementations of a CDFG that computes . Corresponding operations in the normal and re-computations are allocated to different operators. For example, addition is carried out on adder 1 in the normal computation and on adder 3 in re-computation. At the beginning of each computation, the controller checks the checking ratio to decide if it is a normal computation or a re-computation. If it is a normal computation the controller will select the normal allocation shown in Fig. 4(a) . Otherwise, the controller will select the checking allocation shown in Fig. 4(b) .
Since the operation-to-operator allocations are different, a defective module will carry out different operations in the normal and re-computations. Most probably this will affect the final result in different ways thereby increasing the likelihood of error detection.
B. CED Capability
Returning to the example in Fig. 4 let us assume that only one operator has a stuck-at-1 fault. The stuck-at-1 fault can be modeled as an offset of from the correct result based on the bit position of the fault. The faulty results obtained in the normal and re-computations due to this stuck-at-1 fault are summarized in Table II . For example, if there is a stuck-at-1 fault in adder 3 that offsets the result by , the faulty result output by normal computation is while the faulty result output by the re-computation is . Since these two results differ by , the fault can be detected if . Similarly, a stuck-at-1 fault in multiplier 2 translates into a faulty result of during normal computation and a faulty result of during re-computation. In this case, since the two faulty results have the same offset, the fault may not be detected.
We will now compute the probability of missing a fault when allocation diversity based CED is used. Note that this probability is about by knowing a fault is occurred, what is the probability to detect/miss this fault by using the proposed CED techniques. Operations such as multiplication and exponentiation in a computation magnify the effects of faults at their inputs making them easy to detect. On the other hand, since operations such as addition and subtraction do not magnify the effects of faults at their inputs, we will focus on parts of CDFGs that have only additions/subtractions. Further, we will consider stuck-at-1 faults in the following analysis. 1 Since in a RTL design a (faulty) hardware unit is reused several times within a computation, the fault can affect the result several times. Let be the probability that the expected result is 0 due to the th use of the defective module and let be the probability that the expected result is 1 due to the th use of the defective module.
Consider the CDFG that implements shown in Fig. 5 . This CDFG uses two adders and takes two clock cycles. The adder 2 (shaded in dark) is the defective module and is used twice (once in clock cycle 1 and once in clock cycle 2). After the first use of the faulty adder in clock cycle 1, the probability of a correct result is and the probability of a wrong result is . Similarly, after the second use of the faulty adder in clock cycle 2 (assuming that the inputs to the second use are correct), let the probability of a correct result be and the probability of a wrong result be . Now we will derive probabilities for missing for following three cases of faults: a single stuck-at-1 fault, stuck-at-1 faults in nonadjacent bit positions and stuck-at-1 faults in adjacent bit positions.
Case 1. A Single Stuck-at-1 Fault:
In Fig. 5 , every time the faulty adder 2 is used, it will possibly offset the correct result 2 by . We will have the expected result with probability and the wrong result with probability , the wrong result with probability and the wrong result with probability . These probabilities for the general case when the defective module is used times are summarized in Table III. If allocation diversity is employed, a fault will not be detected when the faulty offsets that come from normal and re-computations are identical. However, if the defective module is used times in the normal computation and times in the re-computation, the probability that the two results are offset by the same amount is Hence, the probability that a fault is missed is 1 A similar analysis can be carried out for stuck-at-0 faults 2 A result is correct if it is the expected result for the inputs even though these inputs may come from a faulty operation. Fig. 6 . The probability of missing a stuck-at-1 fault. Fig. 6 plots as a function of and assuming that , (i.e., in the correct output, 1's and 0's are equally likely). The -axis and -axis stand for the number of times the defective module is used in the normal and re-computation respectively while the axis stands for the probability of miss faults in this defective module. From this plot we can observe that the probability of detecting a fault is lowest when . Further, the probability of detecting a fault is highest when or . Case 2. Two Stuck-at-1 Faults in an Operator: Let and be the above probabilities due to the first fault and and be the above probabilities due to the second fault. Also, let be the error offset due to the first fault and be the error offset due to the second fault. Assuming that the first fault occurs in a more significant bit position and affects result times while the second fault occurs in a less significant bit position and affects result times, the possible final results are a) Stuck-at-1 faults in nonadjacent bit positions If the faulty bit positions are so far apart (such that ); that different combinations yield a different offset, the probabilities of correct and faulty results when the defective module is used times are summarized in Table IV. The probability of missing a fault can be calculated using (1) . And if we assume , Equation 1 can be simplified as . Fig. 7(a) shows the probability of missing two nonadjacent stuck-at-1 faults (1) Probability of missing two disjoint faults. b) Stuck-at-1 Faults in Adjacent Bit Positions.
Since the two faulty bit positions are adjacent, and hence faults with different ( , ) combinations can produce identical faulty results. For example, a fault with combination ( , ) has the same effect on the final result as the faults with combinations ( , ) and ( , ). Fig. 7(b) shows the probability of missing two adjacent stuck-at-1 faults. The -axis and -axis stand for the number if times the defective module is used in the normal and re-computations, respectively. It is easy to see that faults will be missed with a higher probability compared to Case 2a. However, it is still better than Case 1. Similar to Case 1 and Case 2a, the probability of detecting this kind of faults is the lowest when and the highest when or Comparing Fig. 7(a) and (b) with Fig. 6 shows that the probability of missing a fault decreases as the number of faults in the hardware increase. This is because as the number of faults increases the number of possible faulty results increases thereby reducing the possibility that these faulty results match. 
C. Improving CED Capability of Allocation Diversity
From the previous analysis, the error detection probability of CED based on allocation diversity can be improved by maximizing the difference between the number of times a defective module is used in the normal computation and the number of times it is used in the re-computation. However, it may not always be possible to achieve this unevenness in the allocations for all hardware units in a design.
Let us consider Fig. 8 as an example. Since the design uses three adders, a single allocation cannot simultaneously maximize the usage difference for all three adders. An operation-to-operator allocation for the normal and the re-computations are shown in Fig. 8(a) and (b) , respectively. This allocation minimizes the probability of missing the faults introduced by faulty adder 2 by maximizing the difference between the number of times it is used in the normal and re-computations. In Fig. 8 , adder 2 is used once in normal computation and four times in re-computation yielding 0.125 probability of missing a single fault, 0.023 probability of missing two nonadjacent faults and 0.033 probability of missing two adjacent faults. 3 On the other hand, adder 1 is used four times in normal computation and two times in re-computation yielding a 0.219 probability of missing a single fault, 0.055 probability of missing two nonadjacent faults and 0.082 probability of missing adjacent faults. Finally, adder 3 is used two times in normal computation and 1 time in re-computation yielding a 0.25 probability of missing a single fault, 0.125 probability of missing two faults, and 0.141 probability of missing two adjacent faults.
Partitioning the CDFG into smaller CDFG and checking the intermediate results output by these CDFG partitions can further improve the fault detection probabilities of the allocation diversity scheme. Fig. 8(c) and (d) show two-way partitioning of the CDFG of Fig. 8(a) and (b) . As shown in the figure, original CDFG has been divided into two CDFG partitions A and B and an " " denotes the intermediate results that are checked. For CDFG partition A, adder 1 and adder 2 are used in the normal computation while adder 2 and adder 3 are used in the re-computation. Allocation of CDFG partition A simultaneously maximizes the usage differences of all three adders; adder 1 is used four times in normal computation and 0 times in re-computation, adder 2 is used once in normal computation and four times in the re-computation and adder 3 is used once in normal computation and zero times in re-computation. Similarly allocation for CDFG partition B maximizes the usage differences for adder 3 and adder 1. In all these cases, if a defective module is involved either in the normal computation or in re-computation but not both, the probability of missing its fault is zero.
D. Comparison With Existing Techniques
Hardware duplication is a straightforward implementation of allocation diversity. The hardware used by the original computation is duplicated. Hardware is not shared between the original and the duplicate computations. This straightforward allocation diversity entails significant overhead. Instead of duplicating data path, the allocation diversity technique shares operators between normal and re-computations without impairing the CED capability. It advances state of art as follows.
• When radiation threat reaches the sea level as we described in the motivation section, even commercial devices need CED support. However, commercial devices cannot afford too much CED area overhead. Naive sharing of the data path between normal and re-computations compromises the CED capability. How seriously this sharing of the data path between normal and re-computations effects the overall CED has not been explored until now. For the first time, we proposed allocation diversity as a way to explore this design space. The main motivation is to find a way of sharing (reduce overhead) the data path without lowering the CED capability.
• Until now, time redundancy has been mainly used for transient fault tolerance and mostly at the system level. Typically, the same program code is executed at two time instances on the same hardware for transient fault CED. This cannot detect permanent faults. We developed allocation diversity technique at the RTL to support both permanent and transient fault CED. The key idea of allocation diversity is to allocate an operation in the normal computation and the corresponding operation in the re-computation on two different operators. Thereby a permanent fault in one of the operators affects the two results (one from normal computation and one from re-computation) in two different ways. By comparing these two results we can detect this permanent fault. We incorporate these CED constraints during the operation-to-operator allocation phase of RTL synthesis.
• We analyzed the CED capability of allocation diversity based time redundancy technique by estimating the probability of missing faults. The analysis reveals that the unevenness in the utilization of operators can be exploited to further improve the CED capability. When the number of times the defective operator is used in normal computation is much more or much less than the number of times the same operator is used in re-computation, the probability of missing faults is the lowest. Based on this observation, we proposed additional improvements to allocation diversity using CDFG partitioning and checking of selected intermediate results.
• Allocation diversity is a time redundancy-based technique and incurs low overhead (refer to Sections V and VI). Straightforward time redundancy entails 100% time overhead. We showed how by controlling the checking ratio, this time overhead can be reduced as well. For example when only every other normal computations are re-computed, the time overhead is reduced to 50%. The checking ratio can be dynamically changed to have the best tradeoff between CED capability and associated performance penalty. Although new design optimization procedures can be developed, this would be useful if a new synthesis system is being designed from scratch. However, for easy transfer of this technology into industry, we describe an approach to incorporate algorithm level re-computing constraints into existing RTL synthesis systems such as synopsys' behavioral compiler. Algorithm level re-computing using allocation diversity can be incorporated into an RTL synthesis tool as follows. 1) Make a duplicate copy of the VHDL model of the algorithm and create a conditional VHDL model with the original VHDL model in the true branch and the duplicate VHDL model in the false branch of the conditional statement. The checking ratio counter will be the control signal selecting one of these branches. The synthesis system will compile this conditional very high-speed integrated circuits Hardware Description Language (VHDL) model into a conditional CDFG. 2) Specify an allocation constraint that ensures that each operation in the true branch of the CDFG and the corresponding duplicate operation in the false branch of the CDFG are performed by different hardware operators.
Invoke the scheduling and allocation algorithms supported by the synthesis system using the timing constraints (number of clock cycles) specified by the designer. Observe that this approach is generic and can use the design optimization procedures (for scheduling and allocation steps) supported by the specific synthesis system to solve this problem.
V. ALGORITHM LEVEL RE-COMPUTING WITH SHIFTED OPERANDS
A. The Technique
RESO is a logic level concurrent error detection technique. We extend the idea to the RTL. In algorithm level re-computing with shifted operands (ARESO), one through input samples are processed without checking. After processing the th input, we store the result into register; apply the shifted version of th input to the design to do the re-computation. When the result is available, it is compared to the result stored before to see if there is any difference. The RTL data path used in ARESO design is wider than the original non-CED design. For example, if original design has a 32-bit data path, an ARESO design that supports 2-bit shifted operands needs a 34-bit data path.
B. The CED Capability of RESO and ARESO
Logic level RESO and its error detection capabilities have been described in [24] . The paper shows that RESO with a one-bit shift can detect errors in the following:
1) all bit-wise logical operations when failures are confined to a single bit-slice; 2) the sum or the carry network (but not both) of an adder if they are implemented as disjoint logic networks; 3) either the sum or the carry or the common logic shared by sum and carry of adder if the sum function strongly depends on the function that the shared subnetwork represents. The paper also shows that RESO with a k-bit shift can detect errors in: 4) all bit-wise logical operations when failures are confined to k adjacent bit-slices; 5) arithmetic operations in a ripple-carry adder and carry-look ahead adder when failures are confined to k-1 adjacent bit-slices, ; 6) arithmetic operations in a group carry look ahead adder when failures are confined to a group. Each group consists of a bit adder and circuits for group-carry generate , group-carry propagate , and group carry-in . Also in an -bit array multiplier, an error in a single cell can be detected by shifting one of the operands by two bits and up to errors in a bit-slice can be detected by shifting at most the ceiling of bits in one of the operands. Because the magnitude of the error offset when errors occur in a bit-slice is . When one of the operands is shifted right by bits, the magnitude of error offset is . All errors in the bit-slice will be detected if . Rearranging this inequality yields . In algorithm level re-computing, since intermediate results are not checked and a defective module can be used several times before checking the final results, the effect of a fault accumulate. Hence, ARESO requires more bits to be shifted than logic RESO technique. For example, if a defective adder that offsets result by is used twice, the possible final offset can be . According to our analysis above, ARESO with 1-bit shifted will have the offset of . It does not (c) Fig. 9 . The probability of missing stuck-at-1 fault by using ARESO technique.
guarantee to find this fault. We use similar technique as we did in Section IV to calculate the probability of missing a fault. Fig. 9 shows the probabilities of missing faults for the three cases discussed in Section IV. In these plots, axis stands for the number of times the defective module is used, while the -axis stands for the number of bits shifted in the data path and -axis stands for the probability of miss fault(s). As the number of bits shifted increases, the probability of missing faults decreases. When only one bit is shifted and the defective module is used three times, we get the worst detection probability-the probability of missing this fault is 14%. When two bits are shifted in the data path, the probability of missing the same fault is reduced to zero.
C. Improving the CED Capability of ARESO
The CED detection capability of ARESO technique can be improved, as we did for allocation diversity technique. By inspecting Fig. 9 , the worst case of detecting faults occurs under two conditions: the first case is when faults accumulate and possible fault range exceeds the CED capability of data path; the second case is when the defective module is used about two to four times. Preventing either or both of conditions improves detection capability. Shifting more bits in the data path, although is straightforward, entails large amount of hardware overhead. Another way is to partition CDFG into smaller ones and check the outputs of all the CDFG partitions. If a defective module is used in more than one CDFG partition, probability of detecting the faults improves. When possible we also try to avoid using a module less than four times in a CDFG partition. Consider the CDFG shown in Fig. 10(a) , adders 1, 2, and 3 are used three, two, and two times, respectively. Assuming that data path is 1-bit shifted, we have probabilities of to miss a single stuck-at-1 fault, two nonadjacent stuck-at-1 faults and two adjacent stuck-at-1 faults at adder 1, and for adder 2 or 3. Fig. 10 (b) partitions the original CDFG into two partitions A and B and checks the outputs of both of them. Adder 1 is used once in partition A and twice in partition B, while adder 2 and 3 are used once in both partitions. The probability of missing faults can be calculated using following:
Using this equation, adder 1 has probabilities to miss the three kinds of faults, while adder 2 and 3 has probabilities .
D. Comparison With Existing Techniques
ARESO technique is different from RESO in following ways.
• RESO is a logical level CED technique. In RESO technique, every operator is a self-checking operator. Implementation of RESO technique requires availability of custom self-checking operator library. On the other hand, ARESO, an algorithm level CED technique, can be used with any target library. Instead of using self-checking operators, ARESO checks the results at the end of algorithm and uses normal operators, which means that it can be applied to any target library and is easily to be integrated into generic RTL synthesis systems.
• In RESO technique, the time used by self-checking operator is two times that used by a nonself-checking operator. This involves 100% time overhead. Since ARESO technique checks the results at the end of algorithm, a checking ratio is introduced to tradeoff the time overhead and CED capability. Although the time overhead is still 100% when the strongest CED capability is achieved, ARESO technique provides a means to reduce this time overhead by trading off with CED capability.
• The CED capability of RESO depends only on the number of bits the operands shifted. The more bits shifted, the stronger the CED capability. In ARESO technique, due to the tradeoff brought by checking ratio, CED capability can be tuned by considering the time overhead.
• In RESO technique, the CED capability is analyzed at the gate level. We analyzed the CED capability of ARESO technique at the RTL and plotted a graph to show the probability of missing faults. By reading these graphs we can identify cases when the faults are guaranteed to be detected and cases when the faults can be detected with probability close to 1. Algorithm level re-computing using data diversity can be incorporated into an RTL synthesis tool as follows: Make a duplicate copy of the VHDL model of the algorithm and create a conditional VHDL model with the original VHDL model in the true branch and the duplicate VHDL model in the false branch of the conditional statement. The original VHDL model accepts normal operands while the duplicate VHDL model accepts shifted operands. The checking ratio counter will be the control signal selecting one of these branches. The synthesis system will compile this conditional VHDL model into a conditional CDFG. Invoke the scheduling and allocation algorithms supported by the synthesis system using the timing constraints (number of clock cycles) specified by the designer. Observe that this approach is generic and can use the design optimization procedures (for scheduling and allocation steps) supported by the specific synthesis system to solve this problem.
VI. EXPERIMENTAL RESULTS
We used Synopsys BC [30] to synthesize designs with allocation diversity and data diversity to study the area and performance overhead associated with these techniques. You can specify the allocation for each operator by using Synopsys label and use BCVIEWER to verify the final schedule and binding. We will show the results on three examples: finite impulse response (FIR) filter (16 16-bit additions, 17 8-bit multiplications), windowed filter (15 32-bit multiplications, 29 32-bit additions) and 8-point discrete cosine transform (DCT) (14 20-bit multiplications and 24 20-bit additions). As mentioned earlier, we will present missing probability for stuck-at-1 fault. However, the technique applies to stuck-at-0 faults as well.
A. FIR Filter
Consider an FIR filter that implements . In the above equation, are previous inputs and are constant coefficients. It accepts one input and produces one output and uses 17 multiplications and 16 additions. Our implementation uses three adders and four multipliers and takes eight clock-cycles for each computation. No operation takes multiple clock cycles and all the operations are not chained. Table V shows the results of the basic design, CED designs using allocation diversity and data diversity. The second and third rows show the number of operators used by these designs. The fourth row shows the area while the fifth row shows the corresponding area overhead. Since the actual layout area depends on the target library and our technique is library independent, we give out the area in terms of unit cell. One unit cell is equivalent TABLE V  FIR FILTER   TABLE VI  WINDOWED FILTER   TABLE VII  DISCRETE COSINE TRANSFORM to a two-input NAND/NOR gate. The actual layout area can be estimated by multiplying the number of unit cells with the layout area used by a two-input NAND gate of the target library. Because the original design consumes very little hardware, all the proposed schemes involve a large overhead. Three succeeding rows show the probabilities of missing faults in the three adders. The types of faults we analyzed here are single stuck-at-1 fault, two nonadjacent stuck-at-1 faults and two adjacent stuck-at-1 faults. For brevity, we combined the three probabilities to one set as shown in the table. In this design, four multipliers have similar allocation and schedules, so in the last row we showed the probability of missing faults in one of them. According to this table, allocation diversity using CDFG partitioning will reduce the probabilities of missing fault from around 30% to less than 4%, while the ARESO with partitioned CDFG will reduce the probabilities of missing faults to almost zero.
B. Windowed Filter
The second example we implemented is a windowed filter. It accepts one input, produces one output and implement using 15 multiplications and 29 additions. Our implementation uses four adders and four multipliers and takes nine clock cycles for each computation. Table VI summarizes the results. In this case, because the original design consumes a large amount of hardware, area overheads consumed by proposed schemes are around 15%. Both schemes have a lower probability of missing the faults on adders than on multipliers. The reason for this is that among the additions allocated to each adder, at least one of them is carried out prior to a multiplication. As we analyzed above, this fault will be magnified by multiplication. By using partitioned CDFG, we improved the probabilities of detecting all possible faults to almost one.
C. A One-Dimensional Eight-Point DCT
The 8-point DCT design accepts eight inputs, produces eight outputs and implements 14 multiplications and 24 additions/subtractions. It uses four adders, four multipliers, and 19 clock cycles for one computation. Error reference-source not found Table VII summarizes the results for RTL allocation diversity. In this design, each output corresponds to a partition of the DCT CDFG. Each of these partitions is independent of the other. Since in algorithm level re-computing we need to check all outputs, straightforward allocation diversity achieves fault detection probability of 1.
VII. CONCLUSION
We proposed two algorithm level re-computing CED schemes using allocation diversity and data diversity. In RTL allocation diversity the operation-to-operator allocation used in the normal computation is different from the one used in re-computation. In data diversity technique, shifted operands are applied to the re-computation. These proposed techniques entail about 10%-30% area overhead depending on the size of the original design. Although in some designs these techniques provide very good error detection probability, they do not do as well in other designs. For such designs partitioning the CDFG and checking some intermediate results increases the error detection probability to almost 1. The area for this enhancement is only slightly larger than that for the basic techniques. Synopsys BC is used to synthesize the designs. His research interests include computer aided design and high-level synthesis of fault tolerant design, high-speed hardware architectures of cryptographic protocols, and algorithms.
