Continuously shrinking feature sizes result in an increasing susceptibility of circuits to transient faults, e.g. due to environmental radiation. Approaches to implement fault tolerance are known. But assessing the fault tolerance of a given circuit is a tough problem.
INTRODUCTION
According to Moore's Law the number of components per area increases at an exponential rate in integrated circuits. One consequence is an increase of externally induced transient faults [20] .
Techniques to cope with transient faults are available on the production level [26] or the design level [1, 10] . Even first tools to improve fault tolerance are available [25] .
Proving robustness with respect to transient faults is difficult. Simulation or emulation based methods [6, 18] can only cover a small portion of the states and the input space of a circuit. A formal analysis determines the probability of a fault to propagate to a primary output (see e.g. [16] ). But the computational effort is extremely high. * This work has been funded in part by DFG grants DR 287/19-1 and FE 797/5-1.
Methods commonly applied for formal verification can prove fault tolerance of an implementation. The approach of [3] proposes to use symbolic methods for the classical analysis of fault trees. But the faults have to be specified manually. Similarly, [13] and [15] rely on symbolic methods. These approaches analyze fault tolerance with respect to mutations of the implementation. As a result, [13] decides whether an implementation is fault tolerant or not, while [15] also provides data about the state space. Both techniques use the original circuit as a specification. The authors of [19] determine fault tolerance with respect to given formal properties. Only faults in state bits are considered. None of the techniques mentioned so far provides insight about circuit structures that are not fault tolerant.
The technique proposed here is similar to [11] and uses the same fault model. A circuit is classified as robust if no fault tampers the output behavior. Detailed feedback about components that are not fault tolerant is returned. Thus, the approach in [11] provides a good basis, but has several limitations when considering practical problems. For example the formal analysis may have to consider a large number of time steps before providing a result and reachability analysis is required to get accurate information.
In contrast, our approach is more efficient and fits practical requirements. The main contributions of our work are:
• Practical model for robustness checking
We explain why a full formal analysis is an overkill in practice and how a fault detection mechanism helps to prove fault tolerance.
• Fault tolerance within a lower and an upper bound While running, our technique delivers bounds on the fault tolerance by determining robust, non-robust and non-classified components. Non-classified components point to potential Silent Data Corruption (SDC).
• Restricted observation window for formal analysis Restricting the observation time significantly improves the performance.
• Avoiding full reachability analysis Full reachability analysis is often too expensive. We show how to use a light-weight reachability analysis when assessing fault tolerance.
Experimental results evaluate the approach and the performance of the algorithm. The formal approach expectedly requires long run times, but proves fault tolerance with respect to any possible input stimulus. Even circuits where full reachability analysis is not feasible are effectively handled by our algorithm. The optimization techniques, i.e. the consideration of structural information and the reuse of learned information, improve the performance by a factor of up to 11. 
Sequential comparison This paper is structured as follows: The underlying fault model and a basic approach to determine the robustness are discussed in the next section. Section 3 introduces a model bounded in time that also yields bounds on the robustness of a circuit. The incremental algorithm is explained in Section 4, while Section 5 presents optimization techniques to improve efficiency. Experimental results are given in Section 6. Section 7 concludes the paper.
FAULT MODEL AND BASIC APPROACH
We consider a synchronous sequential circuit C with Primary Inputs (PIs) X, Primary Outputs (POs) Y and state bits S. The number of components in C is denoted by |C|. Here, a component may be a gate, a module or a source level expression in the hardware description language. Our fault model assumes that a faulty component behaves non-deterministically in one time frame, i.e. the value of the output of the component does not depend on the values of the inputs. We consider single faults only and justify in Section 3 why this is sufficient. A component g is robust iff the output behavior of C cannot change when g is faulty. Let T be the set of robust components in C, then the robustness of C is given by |T|/|C|.
As suggested in [11] we use an instance of Boolean satisfiability (SAT) to measure robustness. In the following the output signal of component g is associated to variable g as well. A fault is modeled as follows (the formulation is similar to SAT-based diagnosis [21] ): For a component g, a fault predicate p g and a new variable g are introduced; then g is replaced by p g → g = g as shown in Figure 1 . Consequently, the value of g is specified by the circuit structure if p g = 0. But if a fault at g is asserted by p g = 1, g may take any value. Given a circuit C, the circuit C is created by replacing each component as explained above. Then, P denotes the set of fault predicates and G denotes the set of newly introduced variables to replace the outputs of components. Now, the SAT instance is created as shown in Figure 2 : The circuit C is unrolled for t d time frames as in bounded model checking [2] ; the unrolled circuit is compared to the copy C connected to t d − 1 instances of C; the POs in the final time frame t d are forced to be different; only one variable in P may take the value 1. This SAT instance is satisfied iff the output behavior of the faulty and the original circuit differ in time frame t d . This may only happen, when a faulty value is injected at component g with p g = 1, i.e. component g is not robust. By finding all satisfying assignments, the nonrobust components are retrieved. Once a component g has been found non-robust, further solutions for this component are blocked by inserting the constraint p g = 0 into the SAT instance. All nonrobust components are calculated by iteratively incrementing t d .
Note, that this model already provides an improvement over [11] . Fault injection is only required in the first time step instead of all time steps. If the set of initial states S(0) is equal to the set of Transmission system reachable states S * , the exact value of the robustness is determined when reaching the maximal sequential depth 1 of the correct circuit and a faulty circuit [11] . In the following we assume that S(0) is equal to S * until relaxing this constraint in Section 3.2.
BOUNDS FOR ROBUSTNESS
In this section we adjust the notion of robustness and the model to practical requirements. As a "side effect" the computational effort decreases. First we justify an observation window to restrict the number of time steps considered by the formal analysis. Then we explain how to relax reachability analysis in the context of our approach.
Observation Window
After detecting an internal malfunction some action has to be taken at the system level in practice. Otherwise the effects of multiple faults may accumulate and may cause a disastrous failure. Therefore, we assume that a fault detection signal f exists. If a malfunction occurs, this is signaled by setting f within a given time bound of no more than t d time steps. Assuming a very low probability for more than one fault within t d time steps, this is safe. By this we retrieve exact bounds for the robustness while restricting the formal analysis to an observation window of t d time steps.
We apply a case split to determine the robustness of a component g. Assume component g behaves faulty, then the robustness of g is assessed as follows: To guarantee that a circuit is robust, neither non-robust nor nonclassified components must remain. A non-robust component is a threat -a fault in this component may cause wrong output values. Knowledge about non-classified components is also essential. A fault in such a component cannot directly influence the output values, but changes the internal state of the circuit, i.e. causes SDC. If this is not detected within the required observation time t d , an undetected error is immanent in the circuit. Effects of multiple faults may accumulate and eventually cause erroneous output.
The same model also handles circuits that directly correct faults instead of flagging a fault. In this case f is assumed to be constantly 0. As a result case 1(a) of the case split given above does not occur. [12] . Figure 3 shows a transmission using an encoder for 4 bit data, a bit-wise serial channel and a decoder. A failure in the transmitted code word is flagged by setting f . The timing is summarized like this: The determined robustness depends on the value of t d : • t d = 6: The input data reaches the POs. Faults that can be detected are flagged. All components are classified.
EXAMPLE 1. A (7,4)-Hamming-Code recognizes and repairs single faults
• t d > 6: Faults injected at t = 0 do not influence the state of the model after more than 6 time frames.
Let T be the set of components classified as robust; S the set of components classified non-robust and U the set of components not classified, yet. Then, C = T ∪ S ∪ U. Now, a lower bound R lb and an upper bound R ub for the robustness of the circuit C are given by: 
Reachability
Up to now we assumed, that S(0) was the set of reachable states S * . Consequently, the approach determined exact bounds for the robustness. In practice, reachability analysis is often not feasible due to the computational complexity. For example the computation of S * with a Binary Decision Diagram (BDD) [4] often requires a large amount of main memory or exceeds run time limitations. Therefore we relax this requirement in the following.
In practice the bounds derived above for the robustness depend on the set of states S(0) applied in the initial time frame of the formal analysis. Assume that an overapproximation S ↑ and an underapproximation S ↓ of the reachable states are available, i.e. S ↓ ⊆ S * ⊆ S ↑ .
When considering the subset S ↓ some states reachable during normal operation are excluded. Therefore some components may not be activated even though relevant during normal operation, e.g. when the ADD-operation in a CPU is never activated. These components are classified as robust instead of non-robust. The robustness determined when using S ↓ is larger than the real value. Therefore, when calculating an upper bound of the robustness using S ↓ is safe, i.e. R ub (S ↓ ) ≥ R ub (S * ).
Similarly, considering unreachable states using S ↑ decreases the calculated robustness. Consider a circuit with Triple Modular Redundancy (TMR). In the fault free case the three redundant modules are in the same state. Consequently, any internal fault of one module is masked and all component are robust. But when also unreachable states are considered where the state of the three modules deviates, a fault may change the output behavior of the overall circuit. Additional components may be classified as non-robust and it is safe to use the overapproximation to determine a lower bound on the robustness, i.e. R lb (S ↑ ) ≤ R lb (S * ).
No full reachability analysis is required. Instead we apply lightweight approximations.
For S ↓ we integrate partial reachability analysis into the formal analysis using the structure shown in Figure 4 . The original circuit C is unrolled for t i time frames and starts from a set of states R(0) known to be reachable, e.g. the reset state. Then, S ↓ contains any state reachable from R(0) within t i time steps as k is left unconstrained. The unrolling depth t i controls the accuracy: S ↓ remains an underapproximation of the reachable states when t i is smaller than the state space diameter 2 .
When leaving R(0) unconstrained, R(t i ) provides an overapproximation S ↑ of the reachable states. But for our experiments we use a faster approach that assumes all states are reachable. Alternatively, in case of TMR circuits an invariant can force the states of the three redundant modules to be equal.
Of course, these light-weight approximations may be replaced by more elaborate approaches like the SAT-based procedure in [5] . In this case compactly representing S(0) is crucial.
ALGORITHM
This section provides an incremental algorithm to transform the calculation of bounds for the robustness into a sequence of SAT instances [7] . A SAT solver [9] is used to determine the solutions. Parameters for the algorithm are the circuit C, the set of states to be considered S use and the size of the observation window t d . The algorithm is based on the approach introduced in Section 2: The original circuit C and a copy C are unrolled for an increasing number of time frames t ∈ [0 ... To analyze single faults, fault injection logic in time frame 0 is sufficient which reduces the search space compared to [11] . This is valid, because all states in S use are considered as initial states of the formal analysis, i.e. S(0) = S use . Also at most one fault predicate may take the value 1. The model supports faults in state elements as well as in combinational logic or at primary inputs.
The algorithm in Figure 5 shows the incremental algorithm that determines the lower and upper bound for the robustness. Once a component is classified, this information is used in the following iterations to reduce the run time. Given a circuit C, a copy C with fault injection logic is created (Lines 2-5). Both copies are converted into Conjunctive Normal Form (CNF) [23] (Line 6). The initial states of both copies are forced to be equal (Line 7). The initial states S(0) for the formal analysis are restricted to S use (Line 8).
Whether the algorithm computes exact bounds, a lower bound or an upper bound for the robustness depends on S use as explained in 
c o n v e r t t o SAT i n s t a n c e ; 7 f o r c e i n i t s t a t e s of C 0 and C 0 t o be e q u a l ; 8 f o r c e S(0) = S use 9 c o n s t r a i n ∑ p g == 1 ; 10 11
T := / 0 ; 12 S := / 0 ; 13 U := a l l components g ∈ C 0 ; 14 t := 0 ; 15 add c o n s t r a i n t p g = 0 ; 7 done ; 8 r e t u r n M; 9 end f u n c t i o n ; Figure 6 : Retrieving all solutions Section 3.2. The number of fault predicates with value 1 is limited to one (Line 9).
Then, the sets of robust (T), non-robust (S), and non-classified (U) components are initialized (Lines 11-13). In the beginning all components are non-classified. Next, the sets are incrementally updated for time frame t, starting at t = 0 up to t = t d (Lines 14-44). As soon as all components are classified, i.e. U = / 0, the algorithm terminates. Fresh copies of C are appended to the unrolled circuits for t > 0 (Line [16] [17] [18] [19] [20] . A constraint forces C to behave fault free (Line 21). Additional logic compares the POs in time frame t (Line 22), where cmpPOs = 1 indicates a different value for fault free and faulty copy. Similarly, cmpFFs compares the states (Line 23).
Then the components S that can be classified as non-robust in time frame t are determined (Lines 25-27). The POs are forced to different values and the fault detection signal f is forced to 0 (Line 25). Each satisfying solution provides a component that is non-robust. The newly classified non-robust components S are returned by the subroutine extractAllSolutions shown in Figure 6 . The subroutine extracts one non-robust component per satisfying solution (Line 4) and forces the fault predicate of this component to 0 afterwards (Line 6).
The main routine in Figure 5 proceeds by removing the constraints on POs and f (Line 27). Next, the algorithm determines the remaining non-classified components U in a similar way (Lines 29-31). In case of non-classified components the constraints p g = 0 are removed (Line 32) before the next iteration for t + 1 starts. Now, the newly classified set of robust components is available (Line 34). These components do not have to be considered in further iterations and their fault predicates are fixed to 0 (Line 35).
Finally, the sets T, S and U are updated by adding or assigning the newly classified components, t is increased, and the additional logic to compare POs and states is removed (Lines 37-43) .
If non-classified components remain and t d has not been reached, the next iteration starts (Line 15). Otherwise the algorithm terminates and returns the three sets T, S and U. As explained above the parameter S use determines whether exact values or approximations are returned.
OPTIMIZATION TECHNIQUES
The algorithm presented so far solves multiple sequential equivalence checking problems and sequential equivalence checking is a hard problem itself. Therefore, optimization is required to improve the performance.
Knowledge about structural dominators is known to be often helpful in CAD algorithms. For example, the output of a fanout free region is a dominator for all nodes within the region. The notion of dominators is more general. A component g is dominated by a component e, if any path from g to a primary output or state bit passes along e. Thus fault effects from g must propagate along e. If component e is robust, component g is robust as well. We determine dominators using the algorithm from [14] . Then, the algorithm to determine robustness runs in two steps. First, faults are only injected into components that dominate others. Second, the dominated components of non-robust and non-classified dominators are considered for a detailed classification. This speeds up the overall run time because the search space is pruned.
Instead of sequential equivalence checking also sequential Automatic Test Pattern Generation (ATPG) may be used as the underlying engine. In this case one problem instance is created per fault that has to be considered. As an advantage, the size of the problem instance shrinks by only including those parts of the circuit that may be influenced by the particular fault. Moreover, similar to combinational ATPG, propagation constraints can be used to improve the performance of the engine [22, 8] . For an evaluation we used a SAT-based sequential ATPG engine. However, on the circuits considered the algorithm of Section 4 using equivalence checking was faster than the ATPG approach. This can be explained by the structural similarity of the problem instances. The algorithm of Section 4 creates one problem instance for all faults. This problem instance is kept and extended by further copies of the circuit until t d is reached. Using the concept of incremental SAT [24] , the proof engine keeps learned information for reuse in subsequent calls. Using ATPG, independent problem instances are created for all faults and similar information has to be learned from scratch. Therefore, we choose the algorithm based on equivalence checking for further experiments.
EXPERIMENTAL RESULTS
Experimental results are provided in the following. Our benchmark suite contains different types of sequential circuits that allow to explain the results by considering the structure of the circuits:
• without fault tolerance,
• with Triple Modular Redundancy (TMR) and
• with fault detection.
The circuits without fault tolerance are taken from the ITC'99 benchmark suite named by their original names b01-b13. Using these circuits, fault tolerant TMR circuits were created. The circuit was To create circuits with fault detection, the TMR circuits were extended with a signal f . While the states of the three instances are identical, no fault is detected, i.e. f = 0. Otherwise a fault is signaled, i.e. f = 1. Faults at the PIs do not activate f , because the TMR instances behave equivalently. Additionally, the Hamming model introduced in Example 1, provides fault detection. In all cases PIs, FFs and gates were considered as components.
All experiments were carried out on an AMD Athlon(tm) 64 X2 Dual Core CPU (3.0GHz, 4GB RAM, Linux). The SAT solver Chaff [17] with incremental SAT extension [24] has been used. The consideration of structural dominators improved the performance by a factor of up to 11 (see Section 5) .
First, we analyze the influence of the observation window t d on the robustness. Next the influence of constraints on S(0) and the quality of approximate bounds are evaluated and finally discussed. Note, that a comparison to [11] is not given due to differing models, i.e. the bounds, the fault signal f and handling reachability. Figure 7 exemplarily visualizes the exact bounds retrieved for three circuits using S * while extending the observation window. Note, the initial bounds are marked with t d = −1.
Influence of t d
As already discussed in Section 3, the exact value for the robustness of the Hamming model is retrieved at t d = 6 where lower and upper bound converge. In case of b05 and b07 the bounds approach each other quite rapidly in the beginning, but do not meet within 10 time frames. While 15% of the components cannot be classified for b05, only 2% remain non-classified for b07. This shows that the convergence behavior of the bounds significantly depends on the design. The incremental algorithm may be stopped as soon as the bounds are close enough or no progress can be observed. Table 2 summarizes further results. The overapproximation S ↑ does not constrain S(0). For S ↓ the approach shown in Figure 4 was used, R(0) denotes the reset state and t i was set to 0, 1, or 10, respectively. BDD-based exact reachability analysis provides S * .
Influence of t i
Columns |C| and |FF| give the number of components and state holding elements in the circuit, respectively. Column t d denotes the length of the observation window required to classify all components. At t d = 10 the algorithm has been stopped. Columns |U| and time give the number of non-classified components and the run time in seconds for computation, respectively.
While increasing t i more reachable states become observable at S(0) and thus the accuracy of the approximate upper bound increases. For example, the results in Table 2 show a significant improvement for b08 and b10 between t i = 1 and t i = 10. But the overhead for computing reachable states "on-the-fly" increases when increasing t i -more run time is required.
Additionally, for TMR circuits the number of components known to be non-classified increases. Some components classified as robust for small values of t i become non-classified, e.g. when a fault in one of the three TMR instances does not affect the output behavior but changes the state.
The TMR circuits with fault detection mechanism immediately detect the error in one of the instances and thus an early classification as robust is possible. Already for t i = 0 the upper bound is nearly correct and requires far less computation time in comparison to TMR circuits without fault detection. Only components close to PIs remain non-classified. Therefore t d has to be increased while the faulty values are not observable at the POs.
Quality of approximate bounds
For non-TMR circuits, the bounds R lb (S ↑) and R ub (S ↓), t i = 10 are close to the exact ones R lb (S * ) and R ub (S * ). Only for b05 and b13 the exact bounds differ significantly. Here, increasing t i or t d may help to improve the accuracy of the approximate bounds.
For TMR circuits the distance between lower and upper bound is large. The lower bound R lb (S ↑) is not tight enough, because the initial states of the TMR instances are allowed to differ. Either an exact analysis or a manual invariant is required to improve the accuracy. Experiments using an invariant to force identical initial states, yielded an almost exact lower bound R lb (S ↑). Due to page limitation no details are reported here.
The approximate bounds for circuits with fault detection are close to the exact bounds. Constraining the specification to fault free behavior by setting f = 0 in C t forces the submodules to start in identical states. Consequently, the lower bound becomes more accurate and an exact analysis is not required.
Discussion
As shown in Table 2 , BDD-based reachability analysis often exceeds the run time limitation. Thus, no exact information about fault tolerance can be computed. Our approximation algorithm still provides a partial classification of components giving insight about fault tolerance of the circuits.
For both, the approximate as well as the exact bounds, non-classified components are left for some circuits. Here, a large difference between upper and lower bound due to non-classified components always points to potential immanent undetected errors that do not yet materialize in the output response. Such SDC may lead to faulty output responses in combination with other faults. Thus, the knowledge about the non-classified components is mandatory.
In summary, the main focus of the experiments is on the formal model and the formal algorithm to determine robustness. The proposed model fits practical needs by restricting the observation window. The approximate bounds provide information about fault tolerance even when an exact analysis cannot be applied. For some circuits run time is a bottleneck. Here, providing a set of manual invariants decreases the run time and increases the accuracy.
CONCLUSIONS
We presented an approach to formally prove the robustness of a circuit. The algorithm works on a bounded number of time steps and determines a lower bound and an upper bound on the robustness. Approximate sets of reachable states are sufficient to determine these bounds. An incremental algorithm and additional optimization techniques are provided. The results show that even if only a small number of time steps is considered, an exact value of the robustness can often be obtained. Otherwise a subset of non-robust and robust components is provided, that can be used for further design modifications. Especially, for circuits with fault detection mechanism accurate values are determined efficiently.
Future work focuses on improving the effectiveness for very large circuits. That is, using multiple engines like simulation, sequential ATPG and the proposed algorithm within a single framework or by using abstraction and hierarchical information. Moreover, assessing robustness in presence of multiple faults remains an open issue. 
