Abstract
Introduction
Concurrent Error Detection (CED) techniques are widely used for designing systems with high availability and data integrity. A duplex system is an example of a classical redundancy scheme that has been used in the past for concurrent error detection. There are many examples of commercial dependable systems from companies like Stratus and Sequoia using hardware duplication [Kraft 81, Pradhan 96]. Hardware duplication has also been used in the IBM G6 processor. Figure 1 .1 shows the basic structure of a duplex system. In a duplex system there are two modules (shown in Fig. 1 .1 as Module 1 and Module 2) that implement the same logic function. The two implementations need not be identical; for example, one could be the complement of the other. A comparator is used to check whether the outputs from the two modules agree. If the outputs disagree, the system indicates the presence of an error. Data integrity is the property of a system which either produces correct outputs or generates an error signal when incorrect outputs are produced. For a duplex system, data integrity is maintained as long as both the modules do not produce identical erroneous outputs (assuming that the comparator is fault-free). Since the comparator is crucial to the correct operation of the duplex system, special designs are needed to ensure that the data integrity of the system is not compromised due to comparator failure.
The comparator design in [Hughes 84] can be used for this purpose. Any duplex system is vulnerable to common-mode failures (CMFs) that affect both the modules of the system [Lala 94]. Design diversity through independent generation of different implementations of the two modules was identified as a possible solution to this problem [Avizienis 84]. In the presence of CMFs, the data integrity of a duplex system is not guaranteed to be preserved. Hence, CMFs must be detected using special techniques.
The main contributions in this paper are: · An efficient algorithm for identifying non-self-testable faults (formally defined in Sec. 2) in duplex systems. These faults undermine the system data integrity. · New techniques that use test points to detect all nonself-testable faults.
Our results indicate that the number of test points required for duplex systems with diverse implementations is significantly lower than those required for duplex systems with identical implementations.
We discuss the effects of diversity on the detectability of faults in duplex systems in Sec. 2. In Sec. 3, we present techniques to identify non-detectable faults in a duplex system that can potentially cause data integrity problems.
Section 4 describes test point insertion techniques to detect these faults. We present simulation results in Sec. 5 and conclude in Sec. 6.
Self-testing Properties of Fault Pairs in Duplex Systems
Consider a duplex system consisting of two implementations N 1 and N 2 of the same logic function and a comparator comparing their outputs. The duplex system is self-testing with respect to a fault pair (f 1 , f 2 ) (f 1 affecting N 1 and f 2 affecting N 2 ) if there exists an input combination for which the two implementations produce different outputs in the presence of the faults.
The corresponding fault pair is said to be self-testable. If the two implementations produce different outputs in the presence of the fault pair, then the comparator will produce a Mismatch signal that can be used to initiate repair action. These faults can potentially be non-self-testable. The objective of our technique is to ensure that these faults are detected. In this paper, we consider all single stuck-at fault pairs pair (f 1 , f 2 ), f 1 affecting N 1 and f 2 affecting N 2 . This model includes common-mode failures that manifest themselves as single stuck-at faults in the individual implementations. In Table 2 .1, we show simulation results comparing the percentage of non-self-testable fault pairs in duplex systems with identical and diverse implementations. Diverse implementations were obtained by synthesizing the logic functions in different ways.
More detailed information about the synthesis of different implementations can be obtained from Sec. 5.
For duplex systems with identical implementations, a common-mode failure (CMF) can be considered as one for which the corresponding leads in the two implementations are stuck at the same value. It is obvious that the selftestability of common-mode failures is 0 % in a duplex system with identical implementations. However, for a duplex system with different implementations, we have very few non-self-testable fault pairs. Thus, many of the potential CMFs can be detected by using diverse implementations. Self-testability enables on-line detection of faults (and CMFs included in the fault model) that affect the two modules of a duplex system. In the next section, we describe efficient techniques to identify non-self-testable fault pairs in a duplex system.
Identifying Non-self-testable Fault Pairs
In this section, we describe a technique to identify fault pairs that are not self-testable in a duplex system. The technique is approximate and is based on compaction of output responses for each fault.
We calculate a signature corresponding to each fault in each implementation. We call a fault pair non-self-testable if and only if the two faults forming the pair have the same signature. The reason behind this will be discussed later in this section. The algorithm is shown below (Algorithm 1). If
Inject f 2 in N 2 , simulate and store response R 2 (f 2 )
To calculate the signature associated with each fault, we use a Multiple-Input Signature Register (MISR) and a counter. The length of the MISR is at least 20 and greater than the number of outputs. For example, suppose that we want to calculate the signature associated with a particular fault f in N 1 . For each input combination, if the response of N 1 in the presence of f is different from the fault-free response, then the counter is incremented and the faulty response is compacted into the MISR. The counter counts the number of test patterns that detect a particular fault. The signature of a fault is given by the pair <content of the MISR, value of the counter>. While updating a signature in Algorithm 1, the counter value is incremented by 1 and the circuit outputs are compacted in the MISR.
Our results show that using the counter or the MISR alone results in highly sub-optimal results. The suboptimality arises from the fact that faults f 1 and f 2 may have the same signature although the fault pair (f 1 , f 2 ) may be self-testable. In this situation, our algorithm will declare the fault pair (f 1 , f 2 ) to be non-self-testable. This situation is referred to as signature aliasing. However, as the results in Sec. 5 indicate, using both the counter and the MISR, we obtain very close to optimum and often fully optimum results with negligible aliasing.
If the signatures for faults f 1 and f 2 are different, then the fault pair (f 1 , f 2 ) is self-testable. If a fault pair (f 1 , f 2 ) is non-self-testable, then the corresponding signatures are equal. However, the converse may not be true. For example, signatures for faults f 1 and f 2 may be equal due to aliasing while the fault pair (f 1 , f 2 ) is self-testable. In this case, we classify a self-testable fault pair as non-selftestable. Thus, aliasing results in one-sided error and makes our algorithm pessimistic (unlike fault detection where aliasing may cause a defective part to be treated as a fault-free part).
A similar argument holds for the number of input patterns that must be applied to identify the self-testable fault pairs. In the worst case, for a system with n inputs, we have to apply all the 2 n input combinations for identifying self-testable fault pairs. If we use a reduced number of input combinations, a self-testable fault pair may be declared as being non-self-testable. However, the reverse situation cannot happen. Thus, using a reduced number of input combinations produces one-sided errors and pessimistic results. Thus, depending on the number of inputs of the circuits, the required execution time, and the desired level of accuracy, we can appropriately select the number of input combinations.
The running time of Algorithm 1 can be further reduced by using deductive fault simulation techniques [Abramovici 90, Armstrong 72] . The simulation results presented in Sec. 5 clearly show a distinct advantage in execution time by using Algorithm 1 over exact techniques. In the next section, we describe test point insertion techniques so that all fault pairs that are identified as being non-self-testable become testable. In this section, we discuss test point insertion techniques to enhance the self-testability of duplex systems. There are two types of test points: control test points and observation test points. In Secs. 4.1 and 4.2, we describe self-testability enhancement using control and observation test points, respectively.
Enhancing Self-testing Properties Using Test Points

Control Test Points
Consider the duplex system consisting of two identical modules each implementing the logic circuit shown in Fig.  4 .1a. Consider the fault pair in the presence of which, the signal line corresponding to Z 1 is stuck-at-0 in both the modules. It is obvious that the duplex system will never produce any mismatch signal in the presence of these two faults. Thus, the fault pair is not self-testable. Next, suppose that for one of the two modules, we add test points T 1 and T 2 as shown in Fig. 4 .1b. We make T 1 = 0 and T 2 = 0 and apply a test pattern for Z 1 stuck-at-0. If the fault pair is not present, a mismatch signal will be produced. If the fault pair or other fault pairs are present, no mismatch signal will be produced. This observation can be used to detect the presence of the fault pair. A similar case with T 1 = 0 or 1 and T 2 = 1 arises when the fault pair Z 1 is stuckat-1 in both the modules. Thus, control test points can enhance the self-testability of fault pairs in a duplex system. Note that, in a duplex system with two identical implementations, the fault pairs affecting the same leads in the two implementations are not self-testable. Thus, we have to add test points at each lead of the circuit in Fig.  4 .1a. The primary advantage of using control points is that, we can utilize the available resources (comparators) of the duplex system for observing the response of the system in response to an input combination. Thus, we do not have to store simulated fault-free responses and compare the system response with these pre-stored responses to detect the presence of faults. However, we have to ensure that when the test points are activated, we apply a test vector for the untestable stuck-at fault pair. This can be achieved by using deterministic test patterns or pseudo-random patterns using an LFSR (Linear Feedback Shift Register). If LFSRs are used, some technique similar to the mapping logic technique [Touba 95 ] can be used for test point activation. For detecting the presence of faults when the test points are activated, we can XOR the mismatch signal output of the comparator with a Test signal that is 1 when one of the test points is activated. The Test signal may be generated externally or by the test controller. During the idle cycles of the system, we can apply input combinations and activate appropriate test points (if necessary) to detect different fault pairs. Consider the example of a computation process shown in Fig. 4 .2. Multipliers MUL1 and MUL2 are used in every alternate cycle (Cycle 1, Cycle 3, Cycle 5, etc.). Thus, during every even cycle, we can apply test patterns to the multiplier inputs. This can be done by adding extra instructions if the algorithm is implemented in a processor, or by using an LFSR. The basic advantage here is that we do not need any extra mechanism to process the response of the multipliers during the idle cycles. Use of idle cycles for concurrent error detection is also described in [Sohi 89 ].
Control points require extra area, may affect the circuit performance of the circuit and require more design effort.
Observation Test Points
Instead of adding control points, we can increase the self-testability of a duplex system by using observation test points. For example, in the duplex system of Fig.  4 .1b, we can observe the logical values on the node Z 1 instead of adding any control test point. As a result, we can detect all fault pairs involving a stuck-at fault on Z 1 . For observation purposes, we can perform signature analysis or directly observe the node Z 1 .
This approach has a distinct advantage over control points because we do not have to add extra gates. However, we have to route the observation points to signature analyzers and perform comparison of the computed signature of the logic values on the observation test points with golden signature. For self-testable fault pairs, we can steal idle cycles of the system to apply test patterns. For non-self-testable faults, we must observe the logical values on the added test points (possibly through signature analysis). Thus, fault simulation and storage of fault-free signatures are necessary.
With observation test points, each application can be preceded and followed by testing phases. A high-level block diagram of an application with testing phases is shown in Fig. 4 .3. Input patterns are applied using LFSRs or compiling deterministic finite state machines. Note that, with observation test points, the faults can be detected more easily Ñ we just have to excite the fault and not worry about sensitizing the fault effect. Hence, we have to toggle logic values at the test point sites rather than propagating the fault effects to the outputs. Thus, the fault-free signature on the observation points is a 0 or a 1 for a stuck-at-1 or a stuck-at-0 fault, respectively. Finally, test points can help us perform quick fault-location and self-repair. Table 4 .1 summarizes the advantages and disadvantages of the control and observation test points. 
Choice of Test Points
For duplex systems with identical implementations, the test points to be inserted can be determined very easily. In such a system with m leads in each implementation, there must be at least 2m single stuck-at fault pairs that are not self-testable.
These are the same faults on the corresponding lines of the two implementations. Then we need m test points to detect all single stuck-at fault pairs [Mitra 00].
For a duplex system with different implementations, we find the non-self-testable fault pairs using Algorithm 1. Next, we choose the minimum number of test points that make all these fault pairs self-testable. This problem can be formulated as a Covering problem. A non-self-testable fault pair (f 1 , f 2 ) can be detected if we insert a test point on the signal line corresponding to f 1 or f 2 . Thus, for each non-self-testable fault pair, we have two candidate signal 1 , B 1 ) (A 1 , B 2 ) (A 1 , B 3 ) (A 2 , B 4 ) (A 3 , B 4 )
Let us suppose that we have 5 non-self-testable fault pairs, (A 1 , B 1 ), (A 1 , B 2 ), É, (A 3 , B 4 ) , as shown in Table  4 .2. To detect this fault pair (A 1 , B 1 ), we have to add a test point at the signal line corresponding to A 1 in the first implementation or at the signal line corresponding to B 1 in the second implementation. The rows of Table 4 .2 correspond to candidate test points. We put an X in an entry if the fault pair in that column can be detected by inserting a test point at the signal line corresponding to the row of the entry. Selecting the minimum number rows in order to have XÕs under every column is the same as the classical covering problem encountered while finding the minimum number of prime implicants to represent a Boolean function [McCluskey 56].
Heuristic Test Point Selection
While there are columns in the table The covering problem is NP-complete and has exponential complexity in the worst case.
We implemented a simple heuristic algorithm as shown above. For the example in Table 4 .2, we need two test points at the sites A 1 and B 4 .
Simulation Results
In Tables 5.1(a) and 5.1(b), we show results on the number of test points needed to make all the fault pairs self-testable in a duplex system. For generating different implementations, we minimized the truth tables corresponding to some MCNC benchmark circuits using espresso.
Then, we synthesized logic circuits after applying multi-level optimizations using the rugged script available in sis [Sentovich 92]. We subsequently mapped the multi-level logic circuits to the LSI Logic G-10p technology library [LSI 96 ]. These implementations are referred to as ÒTÓ in Table 5 .1(a).
Next, we complemented the outputs in the truth tables of the benchmark circuits to generate new truth tables. We used the same synthesis procedure for these new truth tables. Finally, we added inverters at the outputs of the new designs obtained. These implementations are referred to as ÒCÓ in Table 5 .1(a). In the third column of Table  5 .1(a), we show the total number of single stuck-at fault pairs (in millions) in a duplex system containing the two implementations. This gives an idea of the complexity of the problem. In the next three columns we show the number of observation test points needed to detect all the single stuck-at fault pairs using different techniques. The exact algorithm that can find the minimum number of test points is based on ATPG (Automatic Test Pattern Generation). The ATPG tool available in Sis is used for this purpose. The running time of the ATPG-based exact technique is extremely high for large designs as shown in Table 5 .1(b). The entries marked with ÔÑÔ are the cases where the exact technique ran for more than a day without producing any result. It follows from our discussions in Sec. 3 that the number of control test points is twice the number of observation test points. In the columns of Table 5 .1(b) show the execution time required for finding the non-selftestable fault pairs using Algorithm 1 and an ATPG-based approach. All the programs were executed on a Sun UltraSparc-2 workstation.
The following observations can be made from the data presented in Tables 5.1(a) and 5.1(b). It is overwhelmingly clear from Table 5 .1(a) that, by adding only a very few test points in a duplex system with different implementations, we can make all the fault pairs (and all modeled CMFs) self-testable. The number of test points needed for duplex systems with different implementations is orders of magnitude lower than those needed for duplex systems with identical implementations. For minimizing aliasing (and hence, reducing the number of test points to be added), we recommend using both the MISR and the counter for calculating fault signatures in Algorithm 1. As shown in Table 5 .1(b), we obtain significant speedup using Algorithm 1 compared to an ATPG-based exact technique.
Conclusions
In this paper, we demonstrated the advantages of using diverse implementations in enhancing the self-testability of common-mode and multiple failures in duplex systems. This result is significantly useful in the context of adaptive computing systems that enable easy instrumentation of design diversity. Our technique for finding the non-selftestable fault pairs shows orders of magnitude improvement in execution time compared to other competitive techniques. An interesting extension to our solution will be to preprocess a given circuit to identify the subset of inputs that decide the testability of a given fault. This preprocessing will be useful for circuits having a large number of inputs where each output depends on only a very small subset of the inputs. We have also described test point insertion techniques to detect all modeled commonmode and multiple failures. This enhancement helps in increasing the system data integrity and availability. There are further opportunities to reduce the number of test points using fault equivalence relationships [Mitra 00 ]. The test point insertion techniques reported in this paper can be combined with other test point insertion techniques used in the context of digital testing [Touba 96 ] to reduce the test length and detect different fault pairs more efficiently.
Acknowledgments
This work was supported by Defense Advanced Research Projects Agency (DARPA) under Contract No. DABT63-97-C-0024.
