Abstract-In this paper, the authors propose an algorithm-level time redundancy-based concurrent error detection (CED) scheme against permanent and transient faults by exploiting the hardware allocation diversity at the register transfer level. Although the normal computation and the recomputation are carried out on the same data path, the operation-to-operator allocation for the normal computation is different from the operation-to-operator allocation for the recomputation. The authors show that the proposed scheme provides very good CED capability with very low area overhead.
I. INTRODUCTION
A S THE scale of integration increases, very large scale integration (VLSI) circuits are becoming more susceptible to permanent and transient faults. Concurrent error detection (CED) mechanisms can be used to detect permanent and transient faults. A number of CED techniques based on time redundancy, hardware redundancy, and information redundancy have been developed.
Patel and Fung [1] developed a logic level time redundancybased CED technique for permanent faults called RE-computing with Shifted Operands (RESO). If an ALU performs a function , and is an input to the function, then an error in the ALU can be detected by comparing with the output of right_shift (left_shift . For a given data input, the result of function is stored in a register. This is then compared with the result obtained using shifted operands. Error detection capability of RESO depends on the amount of shift. Minero et al. developed a similar technique called pseudoduplication [21] . RE-computing use Duplication With Comparison (REDWC) is another time-redundancy technique [3] . In REDWC, both the operator and the operation are split into two halves. The lower half operation and its duplicate are carried out on the two halves of the adder and the results are compared. This is then repeated on the upper half of the operation. Reference [5] extends REDWC by increasing the number of partitions.
Manuscript received July 30, 2001 . This work was supported in part by the National Science Foundation under CAREER Award CCR 996139. This paper was recommended by Associate Editor R. Camposano.
The authors are with the Department of Electrical and Computer Engineering, Polytechnic University, Brooklyn, NY 11201 USA (e-mail: kaijie@ photon.poly.edu; ramesh@india.poly.edu).
Publisher Item Identifier 10.1109/TCAD.2002.801110.
Logic level time redundancy techniques that use alternating data have also been developed. These techniques check if , where is the input and is its complement [22] . A divider satisfying such a self-dual property has been described in [6] . A CED technique for array multipliers using bidirectional operations has been presented in [4] . An array multiplier is partitioned into two symmetrical parts. One part performs the normal operation while the other part computes it in the reverse direction. The connections between the full adders in the multiplier are made bidirectional to support such a bidirectional operation. The performance overhead of time redundancy-based error detection is about 100%.
Triple modular redundancy (TMR) is a hardware redundancybased error correction technique that has been widely used [2] and it entails at least 200% hardware overhead. A time redundancy variant of TMR called RE-computing with Triplication With Voting (RETWV) trades area overhead for increased error detection/correction latency has been developed [10] , [12] , [13] . RETWV is similar to REDWC except that it partitions the additions and adders into thirds. RETWV has been applied to inner product units, convolvers [10] , Newton-Raphson dividers [12] , and Goldschmidt dividers [13] . Performance overhead of time redundancy-based error correction is about 200%. In [23] , Mitra and McCluskey proposed a logic synthesis technique for designing diverse implementations of combinational circuits in order to maximize the design diversity; they also provided a scheme to choose concurrent error detection techniques based on the diversity [24] .
Information redundancy-based CED techniques append check bits to the data to form code words [2] . A coding scheme is then used to maps the normal output space of a function onto an extended code space such that for the set of all input combinations only a subset represents valid information. CED techniques based on Bose-Lin codes [7] , Berger codes [8] , [18] , and parity codes [19] have been proposed. Information redundancy-based CED needs to calculate the check code for operands and predict the check code for computed results. It is suitable for logic operations and some simple arithmetic operations such as additions and subtractions, but not applicable for complex arithmetic operations such as multiplication and division.
Recently, several register transfer (RT) level techniques for concurrent error detection, recovery, and correction have been proposed. Karri and Orailoglu [16] and Ravi et al. [11] developed algorithms to synthesize self-recovering RT level data paths during the scheduling phase of high-level synthesis. While duplication and comparison at checkpoints was used in [16] , an early comparison strategy wherein intermediate and final results are compared as soon as they become available was used in [11] . Although these algorithms reduce the voting overhead, they do not reduce the almost 100% hardware overhead of duplicationbased CED. Karri and Iyer presented an RT level CED technique that uses the spare computation cycles of the functional units and the spare data transfer cycles in the interconnection network [14] . Karri and Orailoglu [15] and Lakshminarayana et al. [17] developed CED techniques that yield fault-secure data paths with less than proportional increase in hardware. These techniques selectively check intermediate results in a control data flow graph (CDFG) of a computation, weaken some data precedence constraints, and reduce hardware overhead without violating the time constraints.
In the built-in self-repair (BISR) technique proposed in [20] , some spare functional units (such as multipliers, adders) are synthesized during high-level synthesis. These spare modules are not used when the system is healthy. When some modules become faulty, the operations allocated to these faulty modules are reallocated and/or rescheduled to the remaining modules as well as the spare modules. BISR exploits the flexibility of high-level synthesis tasks of scheduling, allocation and assignment to reduce the number of spare modules synthesized. A system level BISR technique targeting multitask real-time systems using multiple processors and multiple memory units is proposed by Hong et al. [26] . Every task running on such a real-time system has two or more algorithmic choices. Each algorithmic choice of a task needs different amounts of processor and memory units. The system level BISR technique uses an optimal schedule for all tasks when no faults occur. However, when a fault occurs in either the processors or the memory units or both, the system level BISR technique selects another optimal schedule that can finish all the tasks using the remaining resources (processors and memory units).
Two approaches to FPGA fault-tolerance have been proposed. Shnidman et al. proposed an on-line testing technique for bus-based field programmable gate arrays (FPGA) [27] . This technique utilizes the feature of such an FPGA that all the function blocks are organized as columns. This technique reserves two columns of function blocks. One of them hosts the configuration data of the column under test while the other column has the testing circuits implemented. When a column is under test, its configuration data and flip-flop values are copied to the free column. When the test is finished, the configuration data and flip-flop values are restored. Lach et al. [28] exploit the reconfiguration flexibility to improve the FPGA system reliability. In this approach, the physical design is partitioned into a set of blocks. Each block contains a certain physical resource, an interface specification, a netlist, and more importantly multiple configurations. When a fault occurs in a block, it is reconfigured using a configuration that does not use the faulty region of the block.
Wu and Karri proposed an RT level time redundancy-based CED technique called Algorithm level RE-computing with Shifted Operands (ARESO) [25] . ARESO combines logic level RESO with RT level scheduling and pipeline synthesis. ARESO does not check intermediate results of a computation. Rather, it checks the final result of the data path and stores it in a register. Then the same input data are shifted and fed to the data path again. When the results are available they are compared to the results saved before. A mismatch detects an error. The detection capabilities of ARESO depend on the amount of bits shifted.
In this paper, we will propose allocation diversity-based RT level time redundancy-based CED technique. The key idea of the proposed technique is that when the operations allocated to an operator during normal computation are different from the operations allocated to the operator during the recomputation, a fault in the operator may affect the two results in different ways. Since the normal and the recomputations use the same hardware, the area overhead is very low. However, there is a time overhead.
We will describe the general idea of algorithm level recomputing in Section II and describe the RT level fault model in Section III. Then in Section IV, we will describe recomputing using allocation diversity, analyze its CED capability, and propose an enhanced scheme. Experimental results will be discussed in Sections V and VI, followed by conclusions in Section VII.
II. ALGORITHM LEVEL RECOMPUTING
Consider a CDFG with four additions and one multiplication shown in Fig. 1 . The schedule in Fig. 1 (a) uses two adders, one multiplier, and three clock cycles. It does not support CED. Fig. 1(b) shows the hardware redundancy-based CED technique. C denotes comparison. Every large circle denotes a logic level fault tolerant operator. This kind of operator duplicates the original operation. The duplicate operation is carried out in the same clock cycle as the normal operation. The implementation of the duplicate operation can be identical to the original design (straightforward duplication) or can be different (design diversity). It comes at the cost of double the hardware four adders and two multipliers, though the computation time has not increased in this design. Fig. 1(c) shows a time redundancy-based CED. Here, two logic level fault tolerant adders and one logic level fault tolerant multiplier are used and each operation consumes two clock cycles. Although there is no increase in the number of operators, the computation time increases by 100% to six clock cycles.
Finally, in Fig. 1(d) , the comparisons are moved out of the operators to the end of the computation. First, the normal inputs are applied to the design. When the normal computation is finished, the results are saved to registers and the same inputs are applied to the design for recomputation. Following recomputation, the two results are compared with a mismatch suggesting an error. This technique called algorithm level recomputing makes the recomputation controllable by the designer. We do not check all the results like logic level CED techniques do; we check the results periodically. There are two types of computations-the normal computation and recomputation. The normal computation is carried out on all input samples up to th sample. After the th input sample is processed the result is stored in a register. Then a recomputation is performed on the th input sample and its result is compared with the stored normal result.
, called the checking ratio, is the ratio of the total number of results to the number of results that have been recomputed and can be used to tradeoff performance overhead against detection latency and fault detection capability. The smaller the value of , the more results will be recomputed and checked. If is set to 1, all results are recomputed and checked. Minimum detection latency can be achieved while the time overhead is 100%. If is set to 2, only half of the results will be recomputed and checked. Detection latency increases while the time overhead is reduced to 50%.
Algorithm level recomputing is a general time redundancybased CED technique. Different implementations of the recomputation process yield different fault detection capabilitiesdetection of transient faults only, detection of permanent faults only, or detection of both permanent and transient faults.
III. RT LEVEL FAULT MODEL
In this paper, we will consider transient and permanent stuck-at faults. Although the analysis in this paper is based on stuck-at-1 faults, the results extend to stuck-at-0 faults as well. According to Thaker et al. [29] , when RTL constructs contain synthetic operators such as adder and multiplier, RTL faults are considered to be injected only at the input variables. However, it is only a subset of the collapsed gate level fault list. In this paper, we consider the faults of input variables as well as the faults occurred inside the synthetic operators. No matter where the location of the fault is, the technique will detect faults that manifest as outputs different from the expected result. Hence, in our analysis, we model a fault as an offset from the correct result.
Consider a four-bit array multiplier shown in Fig. 2 . The four adders are enclosed in a dashed rectangle from the third bit slice since their sums will be accumulated to the third-bit in the result. Assuming that one of the connections (for example, the thick line) is stuck-at-1, the faulty result output by the defective multiplier is offset from the correct result by 2 if the correct output of the thick line is 0, or 0 if it is 1. Table I summarizes all possible offsets due to one stuck-at-1 fault.
IV. ALLOCATION DIVERSITY

A. The Technique
In algorithm level recomputing the normal computation and the recomputation are executed on the same data path using the same input samples but at different times. Permanent faults in the data path cannot be detected if both the normal computation and recomputation are carried out on the same operators in the data path. We propose a new technique called allocation diversity wherein normal computation and recomputation use the same algorithm (i.e., the same CDFG) and the same RT level schedule but different operation-to-operator allocations. Fig. 3 shows two implementations of a CDFG that computes . Addition is carried out on adder 1 in the normal computation and on adder 3 in recomputation. At the beginning of each computation, the controller FIG. 3 checks the checking ratio . If it is a normal computation the controller will select the normal allocation shown in Fig. 3(a) . If it is a recomputation the controller will select the checking allocation shown in Fig. 3(b) . Since the operation-to-operator allocations are different, a defective operator will carry out different operations in the normal and recomputations. Most probably, this will affect the final result in different ways thereby increasing the likelihood of error detection.
B. CED Capability
Returning to example Fig. 3 , let us assume that only one of the operators has a stuck-at-1 fault. The stuck-at-1 fault can be modeled as an offset of 2 from the correct result based on the bit position of the fault. The faulty results obtained in the normal and recomputations due to this stuck-at-1 fault are summarized in Table II . For example, if there is a stuck-at-1 fault in the th bit position of adder 3, the faulty result output by normal computation is while the faulty result output by the recomputation is . Since these two results differ by 2 , the fault can be detected if
. Similarly, a stuck-at-1 fault in multiplier 2 translates into a faulty result during normal computation and a faulty result during recomputation. Since the two faulty results have the same offset, the fault may not be detected.
We will compute the probability of missing a fault using allocation diversity-based CED. Operations such as multiplication and exponentiation in a computation magnify the effects of faults at their inputs making them easy to detect. On the other hand, operations such as addition and subtraction do not magnify the effects of faults at their inputs. We will focus on parts of CDFGs that have only additions/subtractions and consider stuck-at-1 faults. 1 Let us consider a net of an operator that is used up to times during an iteration of computation. The possible logic value of this net is 0 or 1, depending on the inputs. By considering all the possible inputs, we define as the probability of 0 at this connection and as the probability of 1 at this connection. For the sake of easy understanding of our probability description, we use another subscript to indicate the sequence number that the operator is used. For example, when the operator is in its th usage, and are the probabilities of 0 and 1 at this net, respectively. In our analysis, all the are the same, and all the are the same, that is , . Assuming a stuck-at-1 fault has occurred in this net, becomes the probability that the expected result is 0 due to the th use of the faulty operator. Similarly, becomes the probability that the expected result is 1 due to the th use of the faulty operator.
Consider the CDFG shown in Fig. 4 . This CDFG uses two adders and takes two clock cycles to implement . The adder 2 (shaded in dark) is the faulty module and is used twice (once in clock cycle 1 and once in clock cycle 2). After the first use of the faulty adder in clock cycle 1, let the probability of a correct result be and the probability of a wrong result be . Similarly, after the second use of the faulty adder in clock cycle 2 (assuming that the inputs to the second use are correct), let the probability of a correct result be and the probability of a wrong result be . Now we will derive probabilities for three cases: a single stuck-at-1 fault, stuck-at-1 faults in nonadjacent bit positions, and stuck-at-1 faults in adjacent bit positions.
Case 1. A Single Stuck-at-1 Fault:
In Fig. 4 , every time the faulty adder 2 is used, it will possibly offset the correct result 2 by . We will have the expected result with probability and the wrong result with probability , the wrong result with probability , and the wrong result with probability . These probabilities for the general case when the faulty hardware is used times are summarized in Table III. A fault will not be detected when the faulty offsets that come from normal and recomputations are identical even when allocation diversity is employed. However, if the faulty hardware is used times in the normal computation and times in the recomputation, the probability that the two results are offset by the same amount is Hence, the probability that a fault is not detected is Fig. 5 plots as a function of and assuming that (i.e., in the correct output, 1 and 0 are equally 2 A result is correct if it is the expected result for the inputs even though these inputs may come from a faulty operation. likely). From this plot, we can observe that the probability of not detecting a fault is large when . Further, the probability of not detecting a fault is lowest when or .
Case 2. Two Stuck-at-1 Faults in an Operator:
Let and be the above probabilities due to the first fault and and be the above probabilities due to the second fault. Also, let be the error offset due to the first fault and be the error offset due to the second fault. Assuming that the first fault occurs in a more significant bit position and affects the result times while the second fault occurs in a less significant bit position and affects result times, the possible final results are ; . a) Stuck-at-1 Faults in Nonadjacent Bit Positions: If the faulty bit positions are so far apart (such that ) that different combinations yield a different offset, the probabilities of correct and faulty results when the faulty hardware is used times are summarized in Table IV. (1)
The probability of missing a fault can be calculated using (1). And if we assume , (1) can be simplified as follows: Fig. 6(a) shows the probability of missing two nonadjacent stuck-at-1 faults. TO ) shows the probability of missing two adjacent stuck-at-1 faults. It is easy to see that faults will be missed with a higher probability compared to Case 2a. However, it is still better than Case 1. Similar to Case 1 and Case 2a, the probability of detecting faults is the lowest when and the highest when or . Comparing Fig. 6(a) and (b) with Fig. 5 shows that the probability of missing a fault decreases as the number of faults in the hardware increases. This is because as the number of faults increases, the number of possible faulty results increases thereby reducing the possibility that these faulty results match.
C. Time Overhead
Assuming that the iteration of a computation takes clock cycles, a design without CED capability takes clock cycles to process input samples while the design using allocation diversity takes clock cycles, where is the checking ratio. Hence, the time overhead of allocation diversity is .
D. Improving CED Capability
Error detection probability of allocation diversity-based CED can be improved by maximizing the difference between the number of times a defective operator is used in the normal computation and the number of times it is used in the recomputation. However, it may not always be possible to achieve this unevenness in the allocations for all hardware units in a design.
Let us consider Fig. 7 as an example. Since the design uses three adders, a single allocation cannot simultaneously maximize the usage difference for all three adders. An operation-to-operator allocation for the normal and the recomputations are shown in Fig. 7(a) and (b) , respectively. This allocation minimizes the probability of missing the faults introduced by faulty adder 2 by maximizing the difference between the number of times it is used in the normal and recomputations. In Fig. 7 , adder 2 is used once in normal computation and four times in recomputation yielding 0.125 probability of missing a single fault, 0.023 probability of missing two nonadjacent faults, and 0.033 probability of missing two adjacent faults. 3 On the other hand, adder 1 is used four times in normal computation and two times in recomputation, yielding a 0.219 probability of missing a single fault, 0.055 probability of missing two nonadjacent faults, and 0.082 probability of missing adjacent faults. Finally, adder 3 is used two times in normal computation and 1 time in recomputation yielding a 0.25 probability of miss a single fault, 0.125 probability of missing two faults, and 0.141 probability of missing two adjacent faults.
Partitioning the CDFG into smaller sub-CDFGs and checking the intermediate results output by these sub-CDFG's can improve the fault detection probabilities of the allocation diversity scheme. Fig. 8 shows two-way partitioning of the CDFG of Fig. 7 . The original CDFG has been divided into two subCDFGs, A and B, and a " " denotes the intermediate results that are checked. For sub-CDFG A, adder 1 and adder 2 are used in the normal computation while adder 2 and adder 3 are used in the recomputation. Allocation of sub-CDFG A simultaneously maximizes the usage differences of all three adders; adder 1 is used four times in normal computation and zero times in recomputation, adder 2 is used once in normal computation and four times in the recomputation and adder 3 is used once in normal computation and zero times in recomputation. Similarly, allocation for sub-CDFG B maximizes the usage differences for adders 3 and 1. In all these cases, if a defective hardware is involved either in the normal computation or in recomputation but not both, the probability of missing a fault in it is zero. 3 Here we use same assumption as above that one bit output has equal likelihood to be 1 and 0.
V. EXPERIMENTAL RESULTS
Although new design optimization procedures can be developed, this would be useful if a new synthesis system is being designed from scratch. However, for easy transfer of this technology into industry, we describe an approach to incorporate algorithm level recomputing constraints into existing RT level synthesis systems such as Synopsys Behavioral Compiler (BC) [31] . Algorithm level recomputing using allocation diversity can be incorporated into an RT level synthesis tool as follows.
1) Make a duplicate copy of the VHDL model of the algorithm and create a conditional VHDL model with the original VHDL model in the true branch and the duplicate VHDL model in the false branch of the conditional statement. The checking ratio counter will be the control signal selecting one of these branches. The synthesis system will compile this conditional VHDL model into a conditional CDFG. 2) Specify an allocation constraint that ensures that each operation in the true branch of the CDFG and the corresponding duplicate operation in the false branch of the CDFG are performed by different hardware operators. Invoke the scheduling and allocation algorithms supported by the synthesis system using the timing constraints (number of clock cycles) specified by the designer. Observe that this approach is generic and can use the design optimization procedures (for scheduling and allocation steps) supported by the specific synthesis system to solve this problem.
In this section, we will show the results on three examples: finite impulse response (FIR) filter, windowed filter, and eightpoint discrete cosine transform (DCT). The experimental data and error detection probabilities reported here are based on stuck-at-1 faults.
The synthesis area result includes the controller. The BC does not give out the controller overhead. However, the area overhead due to allocation diversity is mainly due to an increase in complexity of the controller and the additional comparators, registers, multiplexes, and interconnection lines. The controller overhead is mainly due to two sources. First, without using the allocation diversity technique, control signals associated with the operators can be set as "don't cares" in those states where they are idle. Second, allocation diversity technique entails additional control signals to control the checking circuitry. Notwithstanding these sources of the controller overhead, the reasons for small control area overhead associated with allocation diversity technique are threefold: 1) The digital signal processing data paths tend to occupy far more area than the control paths, so the impact of control on the overall area of the chip is small. 2) While the complexity of the control logic increases, it does not correspond to the controller area in a one-to-one fashion. The controller area is a nonlinear function of the schedule, allocation, and mapping. Moreover, state assignment for the new controller results in a different encoding of the states that is also a factor in keeping the control overheads low.
3) The control logic can be distributed throughout the design. The increase in controller area is accommodated in the dead areas of the layout.
The controller fault-tolerance can be implemented using the technique presented by Hellebrand [30] or by straightforward duplication. Table V shows all the results of the basic design, allocation diversity-based CED design, and allocation diversity-based CED design with CDFG partitioning. The second and third rows show the number of operators used by these designs. The fourth row shows the area consumed in terms of unit cells while the fifth row shows the corresponding area overhead. Because the original design consumes very little hardware, all the proposed schemes involve a large overhead. Rows 6-9 show the probabilities of missing faults in three adders. The types of faults we analyzed here are single stuck-at-1 fault, two nonadjacent stuck-at-1 faults, and two adjacent stuck-at-1 faults.
For brevity, we combined the three probabilities into one set as shown in the table. In this design, four multipliers have similar allocation and schedules, so in the last row we showed the probability of missing faults in one of them. According to this table, the scheme using CDFG partitioning reduces the probabilities of missing these faults by a factor of 7.5 from around 0.3 to less than 0.04.
B. Windowed Filter
Next we considered a windowed filter that accepts one input, produces one output, and implements Out Coef In In using 15 multiplications and 29 additions. Our implementation uses four adders and four multipliers and takes nine clock cycles. Table VI shows all the results. The meaning of each row is same as Table V. In this case, because original design consumes a large amount of hardware, area overheads consumed by the CED schemes are around 15%. The allocation diversity scheme has a lower probability of missing the faults on adders than on multipliers. The reason for this is that among the additions allocated to each adder, at least one of them is carried out prior to a multiplication. As we analyzed above, this fault will be magnified by multiplication. Partitioning the CDFG improved the probabilities of detecting all possible faults to 1.
C. A One-Dimensional Eight-Point DCT
The eight-point DCT design accepts eight inputs and produces eight outputs using four adders, four multipliers, and 19 clock cycles. Table VII summarizes the allocation diversitybased CED results. In this design, each output corresponds to a sub CDFG of the DCT CDFG. Each of these sub-CDFGs is independent of the others. Since in algorithm level recomputing we need to check all outputs, straightforward allocation diversity achieves fault detection probability of 1.
VI. FAULT INJECTION STUDY
We used the windowed filter example for the transient and permanent faults injection study. To simulate the injection of stuck-at-1 fault, we appended every operator with an OR gate. One input of the OR gate is the output of the operator while the other input controls fault injection. The output of the operator is stuck_at_1 if this input is 1. A stuck_at_0 fault can be similarly injected by appending an AND gate to the operator and applying0 to one of the AND inputs. We implemented the fault injection study as a C++ program.
First, we injected transient faults and computed how many of these faults are detected and how many faults are missed. There are two reasons why a fault is missed: 1) Allocation diversity inherently misses some faults as we analyzed in Section IV-B. 2) A fault may also be missed if the time between recomputations is larger than the duration of the fault. Fig. 9 plots the probability of detecting transient faults. The axis shows the duration of the transient fault and axis shows the checking ratio of the CED design. From the simulation, transient faults with longer duration are easier to detect and the designs with smaller checking ratios have stronger CED capability. Fig. 9(b) shows that CDFG partitioning slightly improves the detection probability of allocation diversity-based CED.
Next, we injected permanent faults and studied the average number of recomputations necessary to detect a permanent fault. Fig. 10 plots the probability distributions of the number of recomputations versus checking ratio. The axis shows the different checking ratios while the axis shows the number of recomputations. Fig. 10(a) shows that for allocation diversity-based CED design four recomputations can detect 65% of all the injected single-bit permanent faults. This is increased to 80% with one more recomputation (total five recomputations). Fig. 10(b) shows that for allocation diversity-based CED with CDFG partitioning, five recomputations improve this percentage to 93%.
VII. CONCLUSION
We proposed an RT level time redundancy-based CED scheme using allocation diversity. The operation-to-operator allocation used in the normal computation is different from the one used in recomputation. The proposed technique entails about 10%-20% area overhead depending on the size of the original design. Although in some designs allocation diversity provides very good error detection probability for permanent faults, it does not do as well as in other designs. For such designs partitioning the CDFG and checking the intermediate results increases the error detection probability to almost 1. The area for this enhancement is only slightly larger than that for allocation diversity.
