Design diversity has long been used to protect redundQnt systems against common-mode failures. The conventional notion of diversity relies on ''independent'' generation of "di@erent" implementations. This 
,
Introduction
The use of \redundancy techniques for designing faulttolerant systems has been studied extensively [Siewiorek 92 ][Pradhan 961: Triple-Modular-Redundancy (TMR) [Von Neumann 561 is an example of a redundancy scheme where three copies of a module are replicated and the outputs att: processed by a voter. Figure 1 .1 shows a classical TMR system. Such a 'system produces correct results as long as two modules produce correct results.
In a redun4t system, common-mode failures (CMFs) result from failures that affect more than one module at the same time, generally due to a common cause. These include operational failures that may be due to external (such as EMI, power-supply disturbances and radiation) or internal causes.
In addition to these common-mode failures, with tde increasing complexity of the various designs, design mistakes are becoming very significant. It has been pointed out in [Avizienis 841 that design faults are reproduced when iedundant copies are made. Thus, Simple I replication fails to enhance the system reliability against design faults.
Voter

Figure 1 .I. Classical Triple Modular Redundancy
Design diversity has been proposed in the past to protect redundant systems against common-mode failures. In [Avizienis 841 , design diversity was defined as the independent generation of two or more software or hardware elements (e.g., program modules, VLSI circuit masks, etc.) to satisfy a given requirement. Design diversity was also proposed in [Lala 941 as an avoidance technique against common-mode failures. Design diversity has been applied to both software and hardware systems.
N-version programming [Avizienis 77 ] [Lyu 911 is an example of diversity in software systems. Hardware design diversity is used in the Primary Flight Computer (PFC) system of Boeing 777 [Riter 9-51 and many other commercial systems [Briere 931 . For the Boeing 777, three different processors (from AMD, Intel and Motorola) are used in the PFC. Tohma proposed to use the implementations of logic functions in true and complemented forms during duplication [Tohma 711. The use of a particular circuit and its dual was proposed in [Tamir 841 to achieve diversity in order to handle common-mode failures. The basic idea is that, with different implementations, common-mode failures will probably cause different errors.
Design diversity can prove to be useful in the context of dependable Adaptive Computing Systems (ACS). The field programmability of Field Programmable Gate Arrays (FPGAs) can be utilized to achieve diversity among the different modules. In an ACS environment, we can create diversity by synthesizing and downloading different implementations into FPGAs at any time. Thus, there is no need to manufacture multiple diverse ASICs.
In order to quantify the effect of diversity on the reliability of a redundant system, a metric is needed to quantify diversity among designs with the same specification [Tamir 841.
In addition to common-mode failures, with the high density of logic gates in a VLSI chip, multiple failures may become more frequent. For example, current research shows that multiple-event upsets (possibly due to a single radiation source) are common in VLSI chips [Liu 97 ][Reed 971. The classical TMR reliability model is pessimistic because, in the presence of multiple module failures, it does not consider compensating effects of different faults [Siewiorek 751 . For the TMR system in Fig. 1.1 , suppose that the output of Module 1 is stuck at 0 and the output of Module 2 is stuck at 1. Still, the system always produces correct outputs. This is an example of a compensating module fault. Earlier research in this area [Siewiorek 75 ][Stroud 941 has focused on predicting the effects of compensating faults on the reliability of a given TMR system. Hence, it is interesting to find out whether design diversity also helps in achieving better compensating effects of different faults, compared to simple replication.
In this paper, we address issues related to design diversity and examine their effects on the reliability of a redundant system. Some preliminary ideas related to this work were reported in [Saxena 981 .
Our main contributions are: (1) develop a metric to quantify diversity among several designs; and (2) use this metric to perform reliability analysis of redundant systems. In Sec. 2, we introduce a design diversity metric and perform reliability analysis of redundant systems using this metric. Section 3 presents some preliminaries related to the stuck-at fault model and illustrates our analysis with the help of an example. We present simulation results in Sec. 4. Section 5 examines the effect of design diversity on the self-testing properties of a duplex system. Finally, we conclude in Sec. 6.
Design Diversity Metric And Reliability
D : A Design Diversity Metric
In this section, we introduce a metric to quantify diversity among several designs. We define the metric for a system with two designs implementing the same function. The metric has application in estimating the reliability of NMR systems with masking redundancy. Before defining the diversity metric, we first define the notion of diversity between two implementations with respect to a fault pair.
For two designs implementing the same function, the diversity with respect to a fault pair (fi, f j ) , d i j , is the probability that the designs do not produce identical error patterns, in response to a given input sequence, whenfi and f j affect the first and the second implementations, respectively. For a given fault model, the design diversity metric, D, between two designs is the expected value of the diversity Analysis with respect to different fault pairs. Mathematically, we have D = 7 '(A 9 fi )di, j
D is the probability that, in response to a given input sequence, the two implementations either produce emr-free outputs or produce different error patterns on their outputs. Example: Consider any combinational logic function with n inputs and a single output. The fault model considered is such that a combinational circuit remains combinational in the presence of the fault. Now, let us consider two implementations ( N I and N2) of the given combinational logic function. likely, then we can write di,j = 1 --.
The dij's generate a diversity profile for the two implementations with respect to a fault model. Consider a duplex system consisting of the two implementations under consideration. In response to any input combination, the implementations can produce one of the following cases at their outputs. For the first case, the duplex system will produce correct outputs. For the second case, the system will report a mismatch so that appropriate recovery actions can be taken. However, for the third case, the system will produce an incorrect output without reporting a mismatch -thus, for the third case, the integrity of the system is lost due to the presence of faults in the two implementations. In the literature on fault-tolerance [Siewiorek 92 ][Pradhan 961, this system integrity has been referred to as the fault-secure property.
The quantity djj is the probability that a duplex system, having two implementations of the logic function under affecting the two implementations, we define kq as the number of input patterns, in response to each of which, both the implementations produce the same erroneous output pattern. Now, we can use the same formulas as the single output case. The above illustration of the design diversity metric can also be extended to sequential circuits and software programs. For small or medium-sized systems, the exact value of the diversity metric can be calculated manually or using compuTer programs. For large systems, the value can be estimated by using simulation techniques.
For two 1 identical implementations of the same function, a common-mode failure (e.g., a design mistake) can be modeled as the same fault fi affecting the two implementations. Let m be the number of input sequences for which these two implementations produce identical error patterns at the outputs. Now, suppose that the second implementation is different from the first. For any fault fi affecting the second implementation (and fi affecting the first implementation), we cannot have more than m input sequences that produce identical error patterns at the outputs of the two implementations. Hence, di,i 5 dij. This property is useful for enhancing the reliability of a redundant system against common-mode failures by using diversity.
Reliability Analysis
In this section, we calculate the reliability of duplex systems using/ the diversity metric described in Sec. 2.1.
We define the reliability of a duplex system as the probability that the system is fault-secure. The reliability calculation is independent of whether the redundant components are exact replicas or different implementations. We assume a discrete time model for the system. In such a model, the timk axis is broken up into discrete time cycles and we apply :inputs and observe outputs only at cycle boundaries.
As shown in Fig. 2 .1, input combination (vector) vi is applied at the beginning of the ih cycle. Also, in the first system becomes faulty (fi) during cycle i and the second system becomes faulty (f2) during cyclej. Let p be the probability that a particular module is afected by a fault at cycle i . For simplicity, we assume that this probability p is the same for all the modules in the system at all times. The probability p can be looked upon as the failure rate per cycle. simultaneously at cycle i. It may be argued that if a random fault appears in a particular module, then chances are high that the second fault will also appear in that same module. However, we do not assume any such correlation in this paper. At time 0, everything is fault-free. So, before cycle i the system will produce correct results. However, starting from time i, in each cycle, the system ,will produce correct results with the probability equal to d1,2. The probability sl(f1, f 2 , t) that the system is fault-secure up to time t, even in the presence of the two faults f1 and f 2 , is given by:
The derivation of the above expression is shown in [Mitra 991 . Now we consider the case where f1 and f 2 appear at different cycles.
As discussed earlier, in Fig. 2 .1, Module 1 becomes faulty during cycle i and module 2 becomes faulty during cyclej. It is clear that up to time j, a duplex system will be fault-secure. Hence, starting from time j, the system will be fault-secure with probability d1,2. Thus, the probability sz(f1, f2, t) that the system is fault-secure up to time t , in the presence of the two faults f1 and f 2 is given by the following equation. The derivation of the second case is also shown in [Mitra 991 . This case is more complicated than the first case and is useful when we consider random independent faults in multiple modules. Now we have: S f f b f 2 , t) = Slffl>f2> t) + szff17fzr t) Here s ( f , f2, t) is the probability that a duplex system is fault-secure up to time t, when Module 1 is uflected by fault fi and Module 2 by fault f2.
We can now characterize a duplex system using our diversity metric. In the following calculations, we assume that once a module becomes faulty, no other fault appears in that module. This assumption is simplistic and allows us to obtain closed-form reliability expressions. We calculate the probability that, up to time t, a duplex system is fault-secure. It is given by the following expression:
The above expression follows from the fact that, in a duplex system, when none of the modules fails the system produces correct outputs. When only one of the modules fails (due to single or multiple faults), the system is faultsecure. When both modules are faulty, then we have to consider the d1,2 value for the fault pair (f1, f2) in the two modules. P(f1, f2) is the probability that faults f1 and fz appear in modules 1 and 2, respectively. In Fig. 2 .3, for a given pair of faults cfl, fi), we show the plots of the above expression for different values of d1,2. The mission time is shown along the X-axis -the MTTF (Mean Time To Failure in cycles) of a simplex system corresponds to 1 time unit. The probability that a fault appears in one cycle is 10.''. Along the Y-axis, we show the probability that the duplex system is fault-secure. The classical analysis of duplex systems is pessimistic since it assumes that the system ceases to be fault-secure when two modules.are faulty.
The above expressions can be modified for commonmode failures (CMF). The probability that a duplex system is fault-secure against common-mode failures up to time t, is given by the following expression:
(1 -PI' + p f i 9 R ) z ( f i 9 f 2 . t )
Here, p is the probability that a CMF affects the two modules. In the above expression, z(f1, f2, t ) is given by the following formula:
The derivation is shown in [Mitra 991 . The above expression is maximized when d1,2 is of the order of (1-p). This suggests that, for a common-mode failure that can be modeled as fault pair (fi, f i ) , we can obtain appreciable reliability improvement over classical systems when the value of d1,2 is of the order of (1-p). The following observations can be derived from this relationship.
(1) When the failure rate is high, even a small diversity can help enhance the system reliability over traditional replication. (2) If the failure rate is low, then d1,2 must be extremely high for appreciable reliability improvement over classical systems.
As a limiting case, consider the situation when the CMF failure rate is 0. In that case, diversity will not buy us any extra reliability against CMFs. In Fig. 2 .4, for a given pair of faults (fi, fi), we show the plots of the above fault-secure probability expression for the different values of d1,2. The failure rate per cycle is It is clear that we get appreciable improvement in reliability (over classical systems) when the value of dl,2 is very high (1-\10-'2 or more) . When the value of d1,2 is less than I-lO-'*, we do not see high reliability improvement. (1) The probability that a duplex system is not faultsecure at time i, for a fault pair (fi, f 2 ) with d1,2 = 1 -1iP.i (2) The probability that a duplex system is not faultsecure at time i, for fault pair ( f i , f 2 ) with d1,2 = I-10"2. The failure rate per cycle is 10-l~. We call this ratio the gain. On the X-axis, we plot the mission time. \ As Fig. 2.5 shows, the gain diminishes with longer mission times. However, it may be noted that diversity is helpful for mission times when a TMR system is most effective. This analysis allows us to derive relationships between the reliability of a redundant system, the diversity incorporated to protect the system against common-mode failures and the mission time. The graph in Fig. 2 .5 helps us understand the payoffs of diversity as a function of mission time. Now we cokider design mistakes, which are special cases of common-mode failures. For these cases, the fault is always present. Simple analysis reveals that the probability that a duplex system is fault-secure up to time t, in the presence of design mistakes, is: Thus, for design mistakes, for a given fault pair (f1, fi), the more the value of d1,2, the more is the system reliability.
This implies that, for design mistakes, diversity among the two implementations in a duplex system helps to increase the probability that the system is fault-secure.
For replicated systems, we can define a common-mode failure as one that produces identical faults (and hence, identical errors) in the two systems. For diverse copies, there is no such simple way to model common-mode failures. Hence, for our simulations, we choose random pairs of faults in an unbiased way. Note that, from our observations in Sec. 2.1, it follows that, in the presence of a common-mode failure, the reliability of a diverse system is never worse than that of a non-diverse system.
While diversity in hardware designs is the main focus of this paper, the above ideas can be extended to analyze diversity in saftware modules. Far estimating the diversity metric for software modules, we need to have a fault model for the software under consideration. Considering the range of values the input variables to the software module can possibly take, it may be difficult to compute the exact value of the metric. However, the value of the metric can be estimated using simulation techniques. Note that, our observations about the relationships between diversity, mission time and failure rate still hold for software systems. One key feature of our analysis technique is that, it is powerful, but at the same time, simpler than the models in [Eckhardt 851, [Tomek 931 and [Lyu 951 .
We validate these observations using simulation data in Sec, 4. For simulation purposes we used the stuck-at fault model. In the next section, we introduce the preliminaries related to the stuck-at fault model and illustrate the calculation of our diversity metric using an example. For the rest of this paper, we assume that all failures show up as stuck-at faults in the circuit. We also assume that the failures are permanent; i.e., if a stuck-at fault shows up at some time instant t , then the fault remains at all time instants greater than t. For circuits made from SRAM-based FPGAs, unless we re-initialize the SRAMs (reload a given configuration), a transient fault in the configuration SRAM persists. Thus, the assumption of the permanent fault behavior is reasonable.
Example
For example, consider the network shown in Fig. 3 .1.
The function implemented by the network is wx + y .
Consider a stuck-at-0 (s-a-0) fault on the line y , denoted by yl0. The function implemented by the network, in the presence of the fault, is wx. Thus, w = 1, x = 0 and y = 1,
I
Paper26.2 1 I 666 i when applied to the input of the logic circuit causes the faulty network to produce a 0 and the fault-free network to produce a 1. Therefore, the fault y/O is detected by the pattern w = 1, x = 0, y = 1.
Figure 3.1. An example logic circuit
A fault is said to be functionally equivalent to another fault if and only if the output function realized by the network with only the first fault present is equal to the function realized when only the second fault is present. For example, in the network of Fig. 3 .1, in the presence of the fault xl0, the function implemented is y. In the presence of the fault zl0, the function implemented by the Hence, the faults xl0 and zl0 are functionally equivalent. The set of functionally equivalent faults forms an equivalence class. A fault f1 dominates fault f2 if and only if all input combinations that detect f2 also detect fl. In our example, the fault pl0 dominates fault zl0.
Techniques for obtaining equivalence and dominance relationships among different fault pairs have been described in [McCluskey 711 and [To 731 . Now, we illustrate the calculation of our design diversity metric with respect to single stuck-at faults in the circuit of Fig. 3 .1. There are 10 single-stuck faults associated with this network. The faults are: wl0, w l l , xI0, xll, yl0, yll, z10, zll, pl0 and pll. The corresponding fault equivalence classes are: F1 = (w10, xl0, {ylO} and F6 = @lo}. The set of vectors that detect the faults in F1 is V1 = {w = 1, x = 1, y = O}. We write V1 = (110). Similarly, V2 = (000, 010, loo}, V3 = {OlO}, V4 11 1, 1 lo}. Here, the number of inputs (n) is 3. Consider the fault pair cfi, f2) = (wl0, pl0). The set of vectors that detect wl0 is VI. The set of vectors'that detect pl0 is V6. Now, VlnV6 = (110). Thus, the value of d1,2 is 7/8. In this way all the did's and the D metric can be calculated.
, network is also y. do}, F2 = {yll, zll, pll}, F3 = {~l l } , F4 = {~l l } , F5 = ={loo}, vs = (001, 011, lol} and v6 = (001, 011, 101, become complicated. In fact, it may not be possible to obtain a closed form. Hence, we developed a simulation environment to examine the reliability of a TMR system in the presence of multiple faulty modules. In [Stroud 941 , Stroud also used a simulation technique to obtain the survivability distribution used to calculate the reliability of the TMR system. However, our simulation approach differs from that of Stroud because, our goal is to examine the effect of diversity on the reliability of TMR systems. For generating different designs, we minimized the truth tables corresponding to some MCNC benchmark circuits (clip, inc, ZSxpl, apex4 and rd84) using espresso. Then, we synthesized logic circuits after applying multi-level optimizations using the rugged script available in sis [Sentovich 921 . We subsequently mapped the multi-level logic circuits to the LSI Logic G-lop technology library [LSI 961. Next, we complemented the outputs in the truth tables of the benchmark circuits to generate new truth tables. We used the same synthesis procedure for these new truth tables. Finally, we added inverters at the outputs of the new designs obtained. Table 4 .1 summarizes the characteristics of the different simulated designs.
In the fourth column of Table 4 .1, we report the number of candidate single stuck faults for the implementations of the circuits, obtained by synthesizing the given specification. The fifth column shows the number of candidate single stuck faults for the implementations of the circuits, obtained by synthesizing the given specifications with complemented outputs.
Simulation 1
We considered TMR systems (with identical and different implementations) for the benchmark circuits. In order to evaluate the effect of diversity in the presence of multiple independent failures, we conducted 100,000 experiments, each consisting of the following steps. In 4. A Simulation-Based Approach system each experiment, we start from time instant 0, when-all the three modules are fault-free. We have a binary counter for generating the input patterns and we randomly pick a seed for that counter. In each iteration, for each of the modules, we generate a random variable to decide whether the module will be affected bv a stuck-at fault. The
As we noted earlier, it is difficult to model the entire Even with the stuck-at fault model, it is difficult to derive the exact reliability equation for the following reasons:
1. For a given pair of faults Vi, f2), the calculation of d1,2 is an NP-complete problem [Gary 791. The problem is related to the NP-complete test generation problem.
2. If multiple stuck-at faults appear in the modules at different cycles, then the reliability expressions will probability that a particular module will be affected by some fault is proportional to the number of single stuck-at faults in that module -the constant of proportionality is ,1Om6. If the random variable indicates that the module is going to be by a fault, then we randomly pick a
Paper26.2
stuck-at fault in that module. Now, we apply the content of the counter to the inputs and obtain the output. If the system output is correct (same as the output produced by the fault-free system), we proceed to the next iteration. Otherwise, we note the time instant when the failing output vector is produced and proceed to the next experiment., After completing all the experiments, we calculate the mean time to failure (MTTF) for the system by averaging the number of cycles up to which the system produced correct outputs for each experiment. The improvement in the MTI'F over the classical TMR (replicated) MTTF is an indicator of the effectiveness of using a set of modules in a TMR system. in that module. The value of this probability is chosen to be 10". In Table 4 .3, we present the results obtained from these experiments. For the ZSxpl example, we find that the TMR with different implementations has a higher MTTF compared to TMR with identical implementations (replication). For each of the remaining cases, there is at least one TMR with replication that has a higher MTTF than a TMR with different implementations. The case of the inc benchmark is interesting. The TMR with different implementations has a higher MTTF compared to one of the replicated TMRs and a lower MTTF compared to another replicated TMR. In Table 4 .2, we show the Mean Time To Failure (MTTF) for the TMR systems with identical and different implementations of the same logic specification. For example, for /the ZSxpl circuit, we formed three TMR systems. For the TMR system 1, we replicated the circuits obtained by synthesizing the original truth table; hence, this TMR system is denoted by (T, T, T) (T stands for "true"). For the TMR system 2, we replicated the circuits synthesized from truth tables with complemented outputs. This system ,is denoted by (C, C, C) (C stands for "complement"). These two TMR systems correspond to TMRs with identical implementations. TMR 3 contains different implementations of ZSxpl . In the fourth column of Table 4 .2, 4we show the total number of single stuck faults for the whole TMR system. In the fifth column, we report the Mn;.F (the average number of cycles after which the TMR produces incorrect outputs). It is clear from Table 4 .2, that the MTTF is strongly dependent on the number of single stuck-at faults in the TMR system. Hence, the MTTF is dominated by the reliability of the individual modules in the TMR system. ' ~ .Simulation 2
I
The experiments in Simulation 2 are similar to those in Simulation 1. !The only difference is that, the probability that a particul4 module gets affected by any fault is fixed, independent of the total number of possible stuck-at faults From the results of Simulation 1 and Simulation 2, it can be concluded that: for independent failures in multiple modules, it is not necessarily true that a TMR system with different implementations will survive (produce correct outputs) for a longer time compared to a TMR system with identical implementations.
Simulation 3 For Simulation 3, for each benchmark circuit, we built duplex systems with identical and different implementations. For each of these systems, we performed 100,000 experiments. In each experiment, we randomly picked up a single stuck-at fault pair (f1, f2) such that the fault f1 affects Module 1 and f 2 affects Module 2. We injected these faults into the modules, applied input patterns from a binary counter (with random seed) and calculated the error latency (the number of cycles after which the system ceased to be fault-secure). For more discussions on error latency, the reader is referred to [Shedletsky 761. The expected error latency for the injected fault pairs is shown in Table 4 .4. We also calculated the percentage of fault pairs for which none of the two modules produced the same erroneous outputs at the same time (compensating fault pairs). These are the fault pairs (f1, f2) that have dl ,2 equal to 1.
As shown in Table 4 .4, a duplex system consisting of different implementations of the ZSxpl circuit has a higher percentage of compensating fault pairs, compared to the non-diverse version -however, that is not generally true. 
These results from simulations 3 and 4 indicate that, for multiple independent failures, the reliability of a redundant system is dominated by the profile of d i j values of the faultpairs vi, 4). This property has been captured by our reliability analysis that has been presented in Sec. 2.2.
In [Sakov 871 , for a given combinational logic function, the fault detectability profiles for different implementations have been reported. Further studies are needed to synthesize circuit structures with high values of di,j for different fault pairs. It has been proved in [To 731 that, for fanout-free combinational logic networks, all internal single stuck-at faults are either equivalent to or dominate single stuck-at faults on the primary inputs of the network. Thus, if we want to implement two diverse fanout-free networks implementing the same function, the dij values of the different fault pairs will be strongly dependent on the input combinations detecting the single stuck-at faults on network inputs and outputs. For both the networks, the set of patterns that detect the input or output stuck-at faults is independent of the network structure and is directly determined by the function the networks are implementing. Thus, chances are low that for fanout-free networks and stuck-at faults, the diversity metric is going to achieve appreciable high values for networks synthesized in different ways, compared to simple replication. Thus, it appears to be important to focus on achieving diverse fanout structures of different networks to obtain high values of the diversity metric for fault pairs.
Simulation 5
Our previous simulation results mainly focused on independent faults in multiple modules of a TMR system. However, it has been observed in the literature [Avizienis 84 ] [Lala 941 , that design diversity is useful for handling correlated failures and common-mode failures. Since we did not find any data on common-mode failure mechanisms, we performed the following sets of experiments to estimate the possible effect of diversity in the presence of commonmode failures.
In a duplicated system with identical implementations, we can find a one-to-one correspondence between the leads of the two copies. Hence, for these duplicated systems, we injected fault pairs (fi , f 2 ) such thatfl and f 2 affect lead i of Module 1 and Module 2, respectively. Note that, in the presence off1 and f2, the two modules behave exactly in the same way. Hence, they can be called common-mode faults. After injecting such a fault-pair, we applied input combinations from a binary counter (with the seed chosen randomly). With this setup, we found the error latency (the number of cycles, after which the duplex system ceased to be fault-secure). For duplex systems with different implementations, since we cannot establish such a one-toone correspondence between the leads of the two copies, we performed 100,000 experiments -in each experiment we randomly chose a fault pair and calculated the error latency. designs under consideration, In fact, an interesting synthesis problem is to synthesize two copies of a given logic function such that the number of self-testable fault pairs is maximum.
Conclusions
In this paper, we have addressed the issue of design diversity in redundant (software or hardware) systems in order to handle common-mode failures and failures in multiple modules. In order to protect fault-tolerant systems against common-mode failures, design diversity has been used commercially. Conventionally, design diversity means "independent" generation of "different" designs. This notion of diversity is qualitative and has limitations, Hence, the need for a metric to quantify diversity between different systems has been expressed in the past.
In this paper, for the first time, we have introduced a metric to quantify diversity among different designs under a particular fault-model, and explained how to calculate the overall system reliability in terms of this metric. In our example of the calculation of diversity for combinational logic circuits (Sec. 2. l), we have assumed that all the input combinations are equally likely, In the absence of any information about the relative frequency of the different input combinations, this is a reasonable assumption. However, for a particular application, if we have information about the relative frequencies (in the form of input traces, for example), then we can appropriately modify the above expression to incorporate this extra information (by changing the weights associated with different input combinations).
We have also produced simulation results to model reallife environments that inject multiple failures in duplex and TMR system. Our theoretical and simulation results indicate that, in the presence of independent multiple module failures in redundant systems, mere use of different implementations does not guarantee higher reliability compared to redundant systems with identical implementations. It is more important to evaluate the reliability of the systems using our metric. On the other hand, for common-mode failures and design faults, there is a significant gain with different implementations. However, the gain decreases with increasing mission time.
Our analysis technique can be used to derive relationships between system reliability, diversity, mission time and system failure rate and compare reliabilities of multiple diverse systems. These relationships can help understand the cost and reliability tradeoffs while designing redundant systems with diversity.
For common-mode failures, diverse systems have no worse reliability compared to replicated systems. However, there is a further need to characterize common-mode failure mechanisms in the circuit level. With a good CMF fault model, (logical or layout-level) synthesis techniques can be used to incorporate sufficient diversity to protect systems against the modeled faults.
