Design diversity has long been used to protect redundunt systems against common-mode 
Introduction
Concurrent Error Detection (CED) techniques are widely used for designing systems with high data integrity. A duplex system is an example of a classical redundancy scheme that has been used in the past for concurrent error detection. There are many examples of commercial dependable systems from companies like Stratus and Sequoia using hardware duplication [Kraft 81, Pradhan 961. Hardware duplication is also used in the IBM G5 processor [Webb 97, Spainhower 991 and also in the space shuttle. Figure 1 .1 shows the basic structure of a duplex system. In a duplex system there are two modules (shown in Fig. 1 .1 as Module 1 and Module 2) that implement the same or related logic functions (e.g., complement function). The two implementations can be the same or different. A comparator is used to check whether the outputs from the two modules agree. If the outputs disagree, the system indicates the presence of an error. Duta integrity means that the system either produces correct outputs or generates error signal when incorrect outputs are produced. For a duplex system, data integrity is maintained as long as both modules do not produce identical erroneous outputs. In a duplex system common-mode failures (CMFs) result from failures that affect more than one element at the same time, generally due to a single cause [Lala 941 . These include operational failures that may be due to external (such as EMI, power-supply disturbances and radiation) or internal causes and design faults. Commonmode failures in redundant VLSI systems are surveyed in [Mitra OOa] . Design diversity has been proposed in the past to protect redundant systems against common-mode failures. In [Avizienis 841 , design diversity was defined as the "independent" generation of two or more software or hardware elements (e.g., program modules, VLSI circuit masks, etc.) to satisfy a given requirement. Design diversity has been applied to both software and hardware systems [Lyu 91 , Briere 93, Riter 951. Tohma proposed using the implementations of logic functions in true and complemented forms during duplication [Tohma 711 . The use of a particular circuit and its dual was proposed in [Tamir 841 to achieve diversity in order to handle common-mode failures. The basic idea is that, with different implementations, common failure modes will probably cause different error effects.
The above discussion of diversity is qualitative and does not provide any quantitative insight into the design or the analysis of systems using diverse duplication. In a recent paper [Mitra 99a ], we developed a metric (called the D-metric) to quantify diversity amon2 several designs and used this metric to perform reliability analysis of redundant systems. However, for arbitrary designs. the problem of calculating the value of the D-metric is NPcomplete. In this paper, we present several techniques to calculate the D-metric. 
D: A Design Diversity Metric
Assume that we are given two implementations (logic networks) of a logic function, an input probability distribution and faults fi and fi that occur in the first and the second implementations, respectively. The diversity di,j with respect to the fuult pair (fi, 4) is the conditional probability that the two implementations do not produce identical errors, given that faults fi and sj have occurred [Mitra 99al . The djj's generate a diversity profile for the two implementations with respect to a fault model. Consider a duplex system consisting of the two implementations under consideration.
In response to any input combination, the implementations can produce one of the following cases at their outputs: (1) Both of them produce correct outputs. (2) One of them produces the correct output and the other produces an incorrect output.
(3) Both of them produce the same incorrect value. (4) They produce non-identical incorrect outputs.
For the first case, the duplex system will produce correct outputs. For the second and the fourth cases, the system will report a mismatch so that appropriate recovery actions can be taken. However, for the third case, the system will produce an incorrect output without reporting a mismatch --thus, for the third case, the data integrity of the system is not preserved. D is the probability that the two implementations either produce error-free outputs or produce different error pattems on their outputs in the presence of faults affecting the two implementations.
Consider any combinational logic function with n inputs and a single output. The single stuck-at fault model is used because of its effectiveness as discussed in Fig. 2 .la and the fault f2 = y stuck-at-0 in the implementation of Fig. 2 .lb. The set of input combinations that detect f l is {ABC = 101). The set of input combinations that detect f2 is {ABC = 111. 101: llo}. It is clear that ABC = 101 is the only input combination that detects both f1 and f2. Hence, the joint detectability k1,2 of the fault pair (f1,fz) is 1. If a duplex system consisting of the two implementations in Fig. 2 
4
The D-metric can be used to perform data integrity analysis of duplex systems [Mitra 99al . An estimate of data corruption latency using the above metric was presented in [Mitra 99bl . Suppose that faultsf] and f 2 affect the two implementations NI and N2 at cycle c. The data corruption latencv is defined to be the number of cycles from c after which both the implementations produce the same error pattem at the output. The expression for the expected data corruption latency of a duplex system is shown below.
Expected data corruption latency
=
In the above expression, T i s the mission time of the application under consideration. When the d1,2 value of a fault pair is 1, the data corruption latency is strictly infinity (because data integrity is guaranteed) and is limited by the system mission time. It is clear from the above expression that the fault pairs having their di,j values very close to 1 contribute the most to increase the data corruption latency. Fault pairs with very low values of di,j have very little impact on increasing the expected data corruption latency of the system.
Techniques for Reducing the Number of Fault Pairs
In this section, we first prove that the problem of calculating the D-metric presented in Sec. 2 is NPcomplete.
Theorem 1: The calculation of the dj,j value for a fault pair fi, 4) is an NP-complete problem for arbitrary logic networks.
Proof: Consider any fault fi in an arbitrary combinational logic network. We want to find whether the fault is redundant or not. For that purpose, we can calculate the di,i value for two identical designs (corresponding to the given combinational logic network). The fault is redundant if and only if the di,i value is 1. However, we know that the problem of identification of a redundant fault is NP-complete because it can be reduced to the Boolean Satisfiability problem [Garey 791 . Hence, the problem of calculation of the d i j values is NP-complete.
Q.E.D. For practical purposes, there are two problems associated with calculation of the D-metric. First, the number of fault pairs for which the di,j values must be calculated can be very large. Second, the problem of calculating the di,j value for a fault pair is NP-complete. In this section, we present techniques to reduce the number of fault pairs for which the d i j values have to calculated -as a result, we obtain bounds on the D-metric for two implementations of a combinational logic function. The fault model that we consider is the single stuck-at fault model; i.e., all failures act as single stuck-at faults in N1 and N2. As our basis for the calculation of the lower and the upper bounds, we use the following two theorems.
Theorem 2: In a single-output fanout-free circuit C, for any single stuck-at fault f, the set of all test pattems that detect f is a subset of the set of all test pattems of either the stuck-at-0 or the stuck-at-1 fault at the output of C. Theorem 3: In a single-output fanout-free circuit C, for any faultf, we can find a set S of single stuck-at faults at the inputs of C such that the set of all test pattems that detect f is a superset of the set of all test pattems that detect the faults in S.
The proofs of the above theorems follow directly from the analysis of the equivalence and dominance relationship of single-stuck faults in logic networks [To 731 .
Consider the single-output fanout-free combinational logic network of Fig. 3 Given any arbitrary combinational logic network (not necessarily fanout-free or single-output). we can decompose the network into maximal single-output fanout-free regions and calculate the test sets of the single stuck-at faults at the inputs and the outputs of the different fanout-free regions ( Fig. 3.2) . Next, we can approximate the test sets of the stuck-at faults on the remaining leads of the given network using the superset and subset relationships explained earlier and calculate bounds on the dj,j values of for different fault pairs as explained next.
. The above illustration also holds for a multipleoutput combinational logic circuit. Given a general multiple-output combinational logic circuit, we first decompose it into maximal single-output fanout-free regions and calculate the test sets and error responses for the stuck-at faults at the inputs and outputs of the fanoutfree regions. It follows directly from Theorems 2 and 3 that if a test pattern t detects a fault f inside a singleoutput fanout-free region and produces an erroneous pattern e at the output of the combinational logic network, then t also detects the stuck-at-0 or the stuck-at-1 fault at the output of the fanout-free region and produces the same erroneous pattern e at the output of the combinational logic network. For estimating the potential benefits of the reduction of fault pairs. we present some simulation results on MCNC benchmark circuits.
SI-
We synthesized two implementations of the truth tables with true and complemented outputs using the Sis tool [Sentovich 921. For each benchmark circuit, we calculated the reduced number of fault pairs obtained by considering stuck-at faults at the inputs and outputs of fanout free sub-circuits of the two implementations. We calculate the percentage reduction in the number of fault pairs to be considered as:
Reduced number of fault pairs Total number of of fault pairs (1 -)XlOO%
The simulation results in Table 3 .1 show that in the worst-case we obtain around 40% (i.e., 1.6 times) reduction while in the best case we obtain around 80% (i.e., 5 times reduction). While the number of fault pairs can be greatly reduced, the accuracy of the bounds can suffer. However, the user can control the extent to which the reduction must be performed depending on the desired accuracy of the bounds. For example, if the user uses only fault equivalence rules for reduction of the number of fault pairs, then the bounds will be perfectly accurate.
Diversity Calculation for Datapath Circuits
In this section we calculate the value of the D-metric for different datapath circuits like adders, priority encoders, etc. Our main focus is on datapath designs based on iterative logic networks. However, similar analysis techniques can be used for other structures (e.g., trees and combinations of iterative logic networks and trees). In this section, we illustrate the calculation technique for ripple-carry adder.
Techniques for calculation of diversity for carry-select and carrylookahead adders are described in [Mitra OOc] .
Consider the design of an n-bit ripple-carry adder (Fig. 4.la) using the full-adder blocks shown in Fig. 4 .lb.
The following theorem tells us that for a duplex system containing two identical copies of the ripple-carry adder, the d1,2 value is 1 for any fault pair (f1, f 2 ) affecting nonadjacent full-adder blocks in the two copies. This means that, for these fault pairs we do not have to explicitly calculate the value of d1,2. This can significantly reduce the computational complexity of the D-metric.
Theorem 4: Consider a duplex system consisting of two identical copies N1 and N2 of a ripple-carry adder (Fig. 4.1) . Consider a fault f 1 affecting the full-adder block FAi in N1 and a fault f 2 affecting FAj in N2. If j > i+l or i > j+l (i.e.,fi andf2 affect two non-adjacent fulladder blocks in the two copies), then dl ,2 = 1.
Consider the case when j > i+l. The case when i > j+l is symmetric. If a fault f l in the full-adder block FAi affects Si, a mismatch will be reported when Si outputs of N1 and N2 are compared. If fault f1 affects only the Ci output of FAi in N I , then the Si+l output of FAi+l in N1 will be erroneous. Since li-jl > 1, it is guaranteed that the Si+l output of FAi+l in N2 will be correct -hence, a mismatch will be reported. Hence, d1,2 = 1.
Q.E.D.
For the remaining fault pairs affecting adjacent fulladder blocks (i.e., j = i or j = i+l), we can form a circuit by cascading two full-adder blocks and calculate the exact value of di,j for every fault pair. The circuit containing a cascade of two full-adder blocks has only 5 inputs and 3 outputs. Hence, the set of input combinations that produce the same erroneous output in the two copies can be calculated easily. Our results show that it takes around 7 seconds (real-time) to calculate these input combinations for all fault pairs on a SUN Ultra Sparc 2 workstation.
Once these input combinations are obtained, the following procedure must be used to calculate the di,j value.
Let us suppose that we are considering fault pairs in the cascade of the blocks FAj+1 and FAi+2 in an n-bit adder. For each combination of Ai+2, Bi+2, Ai+l, Bj+l and Cj for which the fault pair produces identical erroneous outputs, we calculate the number of input combinations of the full-adder that produce the particular combination of Aj+2, Bi+2, Ai+l, Bj+l and Ci.
For example, suppose that for a particular fault pair, the input combination Ai+2 = Bi+2 = Ai+l = Bi+l = Cj = 1 produces identical erroneous pattem at the outputs of the cascaded block. The number of input combinations for the n-bit adder that satisfies the above assignment of values is 2 2 ( 2 -1). Since we only have to satisfy Proof: [Mitra OOc] shows the use of these techniques for carry look-ahead and carry-select adders. These techniques can be generalized for any iterative logic network. All these results demonstrate that for circuits exhibiting regular structures we can exploit the structural regularity to compute the value of the Dmetric very quickly.
Diversity Estimation for General Combinational

Logic Circuits
Signal Probability Calculation Model
In Sec. 4. we utilized the regularity in the implementation of datapath logic circuits to devise fast techniques to estimate the value of the D-metric. For genera! logic circuits (often called random logic circuits) we may not be able to exploit the regularity to estimate the value of the D-metric because there may not be any regularity present in the structure of general combinational logic circuits. For a given fault pair vi,, 4). the problem of calculating the di,j value can be modeled as the signal probability ctilciilation problem [Parker 751 . The modeling is shown in Fig. 5.1 . As shown in Fig. 5 .1, we consider three blocks -N I in the presence of the fault f i . N2 in the presence of the fault 12, and the fault-free N1 block. In response to any input combination, if the incorrect outputs produced by the two faulty blocks match, then the same error pattem has been produced by the two faulty blocks. Otherwise, the two blocks either produced correct values or different error pattems at their outputs. The probability that the OUT signal is 1 is the same as the d1,2 value for the fault pair (fi, f2). It is known that the signal probability calculation is a very hard problem -it is a #P complete problem [Motwani 971 .
The Parker-McCluskey method [Parker 751 can be used to calculate the exact di,j value for every fault-pair (fi, 4): after modeling the problem as a signal probability calculation problem as shown before. However, this method has an exponential complexity in the worst case. Methods based on Binary Decision Diagrams (BDDs) can also be used for this purpose. The cutting algorithm [Savir 901 can also be used to obtain an approximate value of the D-metric in polynomial time. In Sec. 5.2, we describe an adaptive Monte-Carlo simulation technique to estimate the di,j values of fault pairs because it is much simpler compared to other techniques.
Adaptive Monte-Carlo Simulation
The classical Monte-Carlo technique can be used to estimate the di,j values [Motwani 971 after modeling the problem as a signal probability calculation problem. This involves choosing N independent input combinations uniformly and estimating the probability that in response to a random choice, the two implementations will produce either correct values or non-identical error pattems at the outputs in the presence of the faults. The value of N will be discussed later in this section. We can define random variables Y1, .. .,YN as follows: Yk = 1 if the two implementations produce either the correct outputs or non-identical error patterns in the presence of the faults for input combination k and Yk = 0 if they produce identical error patterns at their outputs for input combination k. The estimator Z is defined to be:
The expected value of Z is given by E(Z) = -=
2"
diJ. Here T is the set of all input combinations in response to which the implementations produce either correct values or non-identical error pattems at their outputs. The main challenge is to determine the value of N that we need to guarantee that the error in our approximation is bounded. For that purpose, we calculate the value of N such that the following relationship holds: Else The dj,jvalue of the fault pair = average of the results
End
Using the Chemoff bound, it has been proved in [Motwani 971 that the above relationship holds if the following bound on N is satisfied:
When the value of dj,j is very small, then there is a chance of having to use an exponential number of simulations to estimate the value Z within error bounds. This is the downside of the above technique. However, we can use the following approximation. As noted at the end of Sec. 2, only the high dj,j values significantly affect the data integrity of diverse duplex systems. Table 5 .2 show that the theoretical arguments behind adaptive Monte-Carlo simulation are applicable for real-life circuits and demonstrate the effectiveness of the adaptive MonteCarlo simulation technique.
Conclusions
This paper demonstrates the feasibility of calculating the value of the design diversity metric for arbitrary combinational logic circuits.
Although the general problem is NP-complete, efficient algorithms can be devised for solving the problem. For datapath logic circuits and circuits with iterative logic networks, the regularity in the circuit structures can be exploited to compute the value of the diversity metric very fast. For general combinational logic circuits, reduction techniques using fault equivalence and fault dominance relationships can be applied to significantly reduce the number of fault pairs to be considered during diversity calculation. Next, the adaptive Monte-Carlo simulation technique can be used to obtain accurate estimates of the diversity metric for the reduced set of fault pairs using the number of simulations which is polynomial (instead of exponential) in the number of inputs of the combinational logic function. Moreover, the number of simulations to be used can be tuned depending on the error that can be tolerated during estimation. This paper describes techniques for estimating diversity in combinational logic circuits; a related paper [Mitra OOb ] describes design diversity estimation techniques for sequential logic circuits. hence, Zj = 2 from the above recurrence relation.
Appendix B There can be two kinds of errors in the adaptive
Monte-Carlo estimation technique. The actual di,j value can be high but we can erroneously declare that the estimated di,j value is very low. Although this situation will produce pessimistic values, it is not desired. As noted in Sec. 5.2, 6 is the probability that the value of di,j is out of the error bound. Hence, the probability that the di,j value is erroneously declared to be less than 0.5 is Thus, if we choose M : 6 and E appropriately, then we can obtain close approximation of the di,j values.
The other source of error is due to the fact that the actual djJ value may be less than 0.5 but for all the Monte-Carlo simulation experiments the value is estimated to be greater than or equal to 0.5. The worst scenario is when the actual di,j value is exponentially small but the estimated value is close to 1. We show next that the probability of such an event is extremely small and is almost negligible. The following fact has been proved in [Motwani 971 (using Chemoff bound):
[ ea r , w h e r e p = -. ITI I f p Pr[Z > 0.51 < -The probability that the value of Z will be greater than 0.5 in all the M experiments is less than [i]'. When the value of p is exponentially small. x is of the order of 2". and this probability becomes extremely small. Thus, our Monte-Carlo simulation is adaptive and suits the current application -it provides very good estimates for high di,j values and makes sure that we do not erroneously estimate very high (optimistic) values when the actual dj,j value is extremely small (less than 0.5). 
