We propose a non-intrusive methodology for concurrent fault detection in FSMs 
Introduction
Concurrent test methods provide circuits with the ability to self-examine their operational health during normal functionality and indicate potential malfunctions. While such an indication is highly desirable, designing a concurrently selftestable circuit which, at the same time, conforms to the rest of the design specifications is not a trivial task. Issues to be addressed include the hardware cost and design effort incurred, potential performance degradation due to interaction between the circuit logic and the concurrent self-test logic, as well as the level of assurance required.
In this paper, we focus our interest on Finite State Machines (FSMs) and we explore the trade-offs between the aforementioned parameters, in order to devise a nonintrusive design method for concurrent fault detection. Nonintrusiveness implies that hardware may only be added in parallel to the given FSM which is encoded, optimized, and implemented to meet specific requirements and may not be modified. The additional logic is expected to detect all circuit faults. Moreover, self-test has to be performed concurrently with the operation of the FSM and may not degrade it.
In concurrent test, additional hardware is added to the circuit in order to monitor its inputs during normal operation and generate an a priori known property that is expected to hold for the circuit output. A property verifier is subsequently utilized to identify and indicate any violation of the expected property, thus detecting potential circuit malfunctions. An important requirement in concurrent test is that * The author is supported in part through NSF grant CCR-9820850. the normal operation of the circuit may not be interrupted by false alarms; in other words, the concurrent test output indicator of the property verifier may not be asserted unless a malfunction is detected in the circuit.
The simplest approach is to duplicate the circuit, thus imposing an identity property between the original circuit output and the replica output, which may be simply examined by a comparator operating as the property verifier. With the exception of common-mode failures, duplication will immediately detect any error in the circuit. However, it incurs significant hardware overhead that exceeds 100% of the original circuit, which may or may not be justifiable depending on the application that the circuit is intended for. While expensive schemes such as duplication detect all functional errors, simpler properties detecting only structural faults in a prescribed fault model exist. For example, the method proposed in [1] reduces the functionality of the duplicate so that it only predicts the output of the circuit for a set of test vectors adequate to detect all stuck-at faults. The latter, however, allows functional errors to go undetected until the structural fault that causes it is eventually detected. The concept of fault detection latency, the time difference between appearance of an error and detection of the causing fault is thus introduced.
Since electronic circuits are employed in a wide spectrum of applications, ranging from mission-critical to simple commodity devices, concurrent test methods of various cost and efficiency are required. Related work is reviewed in section 2 and SPaRe, a concurrent fault detection method based on selective partial replication is proposed in section 3. Experimental results in terms of hardware overhead, fault coverage, and fault detection latency are provided in section 4.
Related Work
Related research efforts in concurrent test can be roughly classified in one of the following two categories:
Concurrent Error Detection (CED): Approaches in this category require that all functional errors be detected with zero (or very small, bounded) latency. Duplication is the simplest CED method, limited however by its expensive hardware overhead. Reducing the area overhead below the cost of duplication typically requires redesign of the original circuit, thus leading to intrusive methodologies. One of the first successful attempts along this direction is described in [2] , where resynthesis is employed to favorably encode the circuit, incorporating parity information and employing TSC checkers. Structural limitations requiring an inverter-free design were alleviated in [3] , where a single parity bit and partitioning is employed. Multiple parity bits are used in [4] . While these methods are intrusive, they render totally selfchecking circuits, guarantee zero latency, and typically provide hardware savings in the range of 15% over duplication. 
Concurrent Fault Detection (CFD):
Approaches in this category require that for every structural fault in a prescribed fault model there exists at least one input combination that will detect the fault. Yet it is not guaranteed that every input combination that activates a fault will also detect it. While fault detection latency is, thus, introduced, this relaxation allows for significant hardware savings of CFD over CED methods. Among the few existing CFD schemes, properties specific to non-linear adaptive filters are used in [5] , achieving a 30% cost reduction with near-zero latency. Frequency response of linear filters is used as an invariance property in [6] , achieving a 50% cost reduction but introducing significant latency. Finally, a CFD approach exploiting transparency of RTL components is described in [7] , achieving over 90% fault security with 40% hardware overhead.
Proposed Method
SPaRe, a CFD method based on Selective Partial Replication is proposed in this section. The key idea supporting SPaRe is presented through a small example, followed by an extensive description and analysis of the proposed method.
Motivation
Consider the 2-bit Up/Down Counter described in the table of figure 1. If the objective is to detect all errors occurring during normal operation, the duplication-based CED scheme shown on the left side of the figure will achieve this by comparing the two outputs of the FSM 1 to the two outputs of its replica. If, however, the objective is to detect all faults, allowing possible fault detection latency, it is not necessary to compare both FSM outputs at every clock cycle. When we implemented the counter we noticed that by observing only one bit per state transition (shown in boldface in the table of figure 1 ), we detect all faults. Therefore, for the purpose of CFD it is sufficient to replicate only partially the FSM, appropriately selecting which bits to predict for each state transition in order to detect all faults. Partial FSM replication implies cost reduction over duplication.
This observation is the basis for the SPaRe methodology which is shown on the right side of figure 1 for the 2-bit Up/Down Counter. A combinational prediction logic is used to implement the 1-bit function that generates for each state transition the value shown in boldface in the table of figure 1. This value is stored in a D Flip-Flop and compared to the corresponding bit of the FSM state register one clock cycle later. A MUX is used to drive the appropriate FSM output to the comparator. The select line of the MUX is driven by a function of the previous state and the inputs of the FSM, in this case a simple XOR between PS1 and PS0, delayed by one clock cycle. All faults in the next state logic are, thus, detected. Additionally, by postponing the comparison by one clock cycle, faults in the state register are also detected.
SPaRe: Selective Partial Replication
The optimization objective of SPaRe is to minimize the output width of the prediction logic. Based on the observation that a subset of output bits per state transition is typically sufficient to detect all faults, SPaRe aims at identifying a minimal such set. The general version of SPaRe is depicted in figure 2. For every (n + k)-bit input combination, the prediction logic generates outputs that match a subset of out of the k FSM outputs. A Selection Logic is required to choose which FSM outputs to drive to the comparator for each (n + k)-bit input combination. Comparison is delayed by one clock cycle to also detect faults in the state register.
Success of SPaRe relies on efficient solutions to two key issues: identification of appropriate output values to be replicated by the prediction logic and cost-effective selection of circuit outputs to which they should be compared. Regarding the first issue, an ATPG tool capable of generating all test vectors and reporting both the good and faulty circuit outputs for every fault in the combinational next state logic is required. This information indicates the faults that can be detected at each output for each input vector and may be used to construct a matrix similar to the one shown in figure 3. SPaRe seeks a set of columns that covers all faults, such that the maximum number of output bits to be observed for any input vector is minimized. However, the exact selection of columns impacts directly the cost of the Selection Logic. More specifically, since the prediction logic only generates an -bit function, additional logic is necessary to select among the k circuit outputs to which the predicted bits will be compared. As shown in figure 2, this can be viewed as k-to-1 MUXes, each of which requires log k address bits. Therefore, if we allow any possible subset of size for every (n + k)-bit input combination, the Address Logic will generate ·log k (n+k)-input functions. Compared to duplication, The cost of the Prediction Logic is linear in ; the cost of the Selection Logic, however, increases almost linearly in up to = k/2, at which point it starts decreasing, eventually becoming zero at = k. Therefore, if > k/(log k + 1), the total size of the Selection Logic and the Prediction Logic exceeds the cost of duplication. Imposing such an upper bound on could significantly reduce the fault coverage of this scheme. Instead, we impose restrictions on the complexity of the Address Logic and by extension, on the acceptable solutions on the matrix of figure 3. SPaRe eliminates the Address Logic all together, therefore allowing that the log k select inputs of each multiplexer may only be driven directly by any log k out of the (n + k) previous state and input bits. The form of acceptable solutions under this additional constraint, as well as a selection algorithm for identifying an appropriate set of columns that detects all faults are discussed in the following section.
Selection Algorithm
We focus on the next state logic of the FSM, which, given a previous state and an input generates the next state. The inputs to this component are I 1 . . . I k (the previous state) and I k+1 . . . I n+k (the FSM inputs). The outputs of this component are O 1 . . . O k (the next state). We denote the set of the 2 n+k possible previous state/input combinations by V .
Assume for the moment that we are given the matrix of figure 3, say A. We remind that SPaRe eliminates the ADDRESS LOGIC component of figure 2. For simplicity, we assume that 2 specific input bits, denoted by I 1 and I 2 , drive all MUXes and also that is given. As a result, each MUX selects only among four of the FSM outputs; we remove this assumption promptly. Thus, the SELEC-TION LOGIC component of the diagram is fully specified. The SELECTION LOGIC splits the input vectors to 4 disjoint groups, each corresponding to a possible value for the pair I 1 I 2 ∈ {00, 01, 10, 11}; for all vectors in each group the same output bits are observed at the output of the SELEC-TION LOGIC. We denote the groups by G 1 , G 2 , G 3 , G We now state the problem formally: given A, the groups G 1 , G 2 , G 3 , G 4 and , pick output bits for each group so that the number of covered faults is maximized.
Prior to presenting an algorithm to solve the above problem, we revoke the simplifications we made earlier, starting with the assumption that is given. In practice, we seek the minimum for which we can detect i.e. 99% of the faults. Finding such an though is trivial; since 1 ≤ ≤ k, use binary search and solve the above problem log k times. We also assumed that the addressing bits (I 1 and I 2 ) were given; in practice we try all possible 2-bit addressing schemes (≈ (n + k) 2 /2). If we were to use c > 2 bits to feed the MUXes, the number of possible addressing schemes increases; however, since we only allow up to log k addressing bits, it is always a small number. We note that in this case the number of groups would increase to 2 c instead of 4. Finally, we assumed that A is fully constructed; obviously, for large circuits, time/space constraints render this assumption infeasible. Thus, in large circuits, the following strategy is employed: for every fault, generate a large number (say r) of input vectors detecting it. Thus, assuming m faults in our circuit, at most mr vectors are generated. We subsequently identify the faults detected by each of these vectors, construct an m × mr matrix A and solve the aforementioned problem in A instead of A. Generally, A admits less efficient solutions than A; as r increases the two solutions converge.
The size of the solution space for the above problem, assuming that and c are fixed, is . If and c are small constants, the size of the solution space is polynomial in both n and k. In practice, though, might be close to k/2, in which case the size of the solution space grows exponentially in k and it is impossible to explore it exhaustively. To understand its size, if n = 2, k = 6, l = 3 and c = 2 there are 4.5·10 6 possible solutions, while, if c = 3, there are more than 14·10
11 possibilities. Thus, we describe an algorithm to explore the space of possible solutions efficiently; given infinite time, our algorithm explores the whole state space. In practice, we explicitly limit its running time. We should also note that it is not necessary to drive all MUXes with the same input bits; indeed, better fault coverage might be achieved by using different bits. Thus, the state space is even larger, since the number of possible addressing schemes increases.
Our algorithm is simple: it randomly decides which output bits to generate for each group of input vectors; we denote by R i the set of output bits that we generate for group G i . Initially all the R i 's are empty. The algorithm essentially picks a group and decides which output bit to generate for this group; we decide which group to pick using biased sampling and favoring groups whose corresponding R i contains fewer elements. Biased sampling is also used to decide which output bit to include in R i . We assign a score to every output bit not already included in R i : this score reflects the significance of this particular output bit for fault detection.
Intuitively, the significance of an output bit is a function of the number of faults it detects, and, in particular, faults that are not detected by a large number of vectors in V . As an example, we tend to favor an output bit that detects 2 faults that no other input vector can detect over an output bit that detects 5 faults, each detected by 10 other input vectors as well. Every time an output bit is selected to be included in R i , we remove all faults covered by that bit for any input vector in G i . The above process is repeated until all R i contain exactly elements and the fault coverage is reported. If the result is unsatisfactory, we repeat the process until either a satisfactory result emerges or a fixed number T of iterations is exceeded; if the result is still unsatisfactory, we try a different addressing scheme. The SPaRe algorithm calls the BasicSPaRe algorithm with different G 1 , G 2 , G 3 , G 4 until a target fault coverage is attained or the run time limit of the scheme is exceeded.
A brief note on x: while in our experiments a value of x = 1 returned acceptable solutions fast (typically, after trying at most 10 addressing schemes with T = 100), one could try different values of x to fine tune the algorithm. As an example, as x increases, our search becomes greedier: the output bit with the highest score is picked with very high probability. We prefer to present our algorithm using generic values for x; in practice, one could potentially use training data to learn the "best" value of x for the circuits at hand.
Analysis
SPaRe is non-intrusive and, by construction, guarantees a pre-specified fault coverage; in our case, 98.5%. Furthermore, since SPaRe predicts and compares the appropriate portion of the circuit output for every state transition, no false alarm is possible. SPaRe introduces latency in the detection of an activated fault, which will remain undetected until an appropriate state transition is performed. We stress, however, that SPaRe checks for faults for every state transition; since most stuck-at faults are detected by many state transitions, we may conjecture that the average latency of SPaRe is small. In Section 4 we see that this prediction is justified. We outline the expected hardware overhead of SPaRe. The following statement relates the hardware -assuming multilevel implementation using 2-input gates -required to implement a function of n + k input bits and one output bit to the hardware required to implement a function of n + k input bits and k output bits (k 2 n+k ).
Remark 1 Almost all boolean functions
k require at least k2 n+k /(n + k) gates if the k output bits are uncorrelated.
Proof (sketch):
We observe that the number of functions
. Thus, Shannon's counting argument [8] proves our statement.
Assume for the moment that the k output bits are uncorrelated; then, the minimum hardware required for SPaRe is /k times the minimum hardware required for the original circuit. We can only examine how the lower bound of the size of SPaRe behaves; indeed, tight bounds for circuit sizes are notoriously hard to prove even under stringent assumptions. In practice, the output bits of the PREDICTION LOGIC and the original FSM are correlated, otherwise some states of our FSM would be unreachable. It is not clear though that as correlation increases the ratio of the size of SPaRe over the size of the original circuit increases; one expects the size of the SPaRe to decrease as correlation increases. In section 4 we observe that our predictions on the hardware overhead are quite accurate, even in the presence of correlation. (a) Randomly pick one of the R i , with probability
Algorithm BasicSPaRe
Denote the one picked by R i . 
(b) Assign a score to each output bit
O p / ∈ R i , p = 1 . . . k (x ∈ R, usually x = 1). Score(O p ) =
Algorithm SPaRe
Input: A, Output: R 1 , R 2 , R 3 , R 4 (initially empty). rightmost column provides this cost as a percentage of the cost of the next state logic, indicating the hardware savings of SPaRe over duplication. As may be observed, the hardware overhead is, on average, 45% less than duplication. Furthermore, the average deviation between the expected overhead and the actual overhead is around 7%, implying that the ratio of predicted bits over next state logic bits is an accurate indication of incurred hardware overhead. We anticipate that this ratio will decrease further as the number of next state logic bits increases, thus resulting in even more savings.
In this section, we compare SPaRe to duplication, in tenDs of hardware overhead, fault coverage, and fault detection latency. In order to preserve generality, we employ random FSMs with K = 2k states and n inputs. We experiment with ten different types of (K, n) FSMs, where K is the number of states and n is the number of inputs. The ten types are 
Hardware Overhead

Fault Coverage
In order to assess the effectiveness of the proposed method, we construct the FSM with SPaRe-based CFD and the FSM with duplication-based CED in ISCAS89 [10] format. The next state logic and the prediction logic are available from the hardware overhead experiment. Two copies of the next state logic, two state registers and a comparator are used for duplication. One copy of the next state logic, one copy of the prediction logic, a comparator and the MUXes for the selection logic are used for SPaRe.
Two experiments are performed employing these circuits. In the first experiment, we compare the number of faults in the original FSM detectable by SPaRe to those detectable by duplication. HITEC [13] is used to perform ATPG on the two constructed FSMs. In both ATPG runs only the faults in the original FSM are targeted and only the Test Output is made observable. The results are summarized in the table of figure 5 . Duplication detects all testable faults in the original FSM, reported in the second row of the table. SPaRe, on the other hand, detects all faults that are covered in the solution provided by the algorithm of section 3.3. In our experiments, the threshold for algorithm termination was set to covering 98.5% of all faults. This is validated by ATPG, which yields an average fault coverage of 99% of all testable faults.
In the second experiment, we demonstrate the ability of SPaRe to also detect all testable faults in the hardware added for CFD. Two ATPG runs are performed using HITEC [13] on the FSM with SPaRe-based CFD, targeting all circuit faults. Both the test output and the original FSM outputs are made observable in the first ATPG run, while only the test output is made observable in the second ATPG run. The results are summarized in the table of figure 6 . The number of faults missed by SPaRe in the tables of figures 5 and 6 is equal, indicating that all testable faults in the additional hardware are detected. On average, SPaRe-based CFD detects 99.4% of all testable faults.
In terms of incurred hardware overhead, SPaRe and duplication differ in the following aspects: duplication employs a replica of the combinational next state logic of the original FSM, while SPaRe employs a prediction logic which generates fewer output bits. As a result, SPaRe uses a narrower state register and a narrower comparator than duplication. However, a few additional MUXes are employed in SPaRe, balancing the cost savings of these modules. Essentially, in order to compare SPaRe to duplication, it is adequate to compare the cost of the next state logic of the FSM to the cost of the prediction logic of SPaRe.
In order to obtain these costs, the next state function of the FSMs generated through the above process is converted to pia format, synthesized using the rugged script of SIS [9] , and mapped to a standard cell library comprising only 2-input gates. Since the proposed methodology is non-intrusive, no assumptions are made as to how the FSMs are encoded or optimized. The hardware cost of the circuit is reported by SIS through the printJnap-stats command and the circuit is then converted to ISCAS89 [10] format. ATALANTA [11] is used to generate all vectors detecting each fault, and HOPE [12] is employed to provide both the good machine and the bad machine responses for every (vecto1; fault) pair, revealing the output bits at which each fault may be detected for every vector. This information is used to construct the matrix A necessary for SPaRe, through which the prediction logic functions are identified. These functions are subsequently converted to pia format, synthesized using the rugged script of SIS [9] , and mapped to a standard cell library comprising only 2-input gates. The cost of the prediction logic is reported by SIS [9] through the printJnap-stats command and the circuit is converted to ISCAS89 [10] format.
The results are summarized in the table of figure 4 . The cost of the next state logic is reported, along with the number f of prediction logic bits generated through the algorithm of section 3.3. The percentage in the parenthesis indicates the expected hardware overhead, based on the analysis of section
Fault Detection Latency
The hardware savings achieved by SPaRe come at the cost of introducing fault detection latency. It is not possible to predict the exact latency of the method, since it depends on the values that appear at the FSM inputs during normal operation. Yet, an experimental indication of how much latency is introduced by SPaRe is necessary for its evaluation.
We measure fault detection latency based on fault simulation of randomly generated input sequences. More specifically, we use HOPE [12] to perform two fault simulations of the same sequence of randomly generated inputs, once observing both the test output and the FSM outputs, and a second time observing only the test output. The time step at which a fault is detected during the first fault simulation is the Fault Activation time, while the time step at which it is detected during the second fault simulation is the Fault Detection time. Fault Detection Latency is the time difference between Fault Activation and Fault Detection, therefore we can calculate the Fault Detection Latency for each fault, as well as the average Fault Detection Latency.
Worst-case results for each of the 10 different FSM types are summarized in the table of figure 7. We fault simulate a total of 5000 random patterns and snapshots of the results are shown after 10, 50, 100, 500, 1000, and finally all 5000 patters are applied. For each snapshot, we provide the number of faults remaining non-activated, the number of faults activated and detected, and the number of faults activated but missed (not yet detected) by SPaRe. We also provide the maximum and the average fault detection latency for the faults that are both activated and detected. Figure 8 presents a plot of faults activated and faults detected by SPaRe on the (64, 3) FSM, as well as a plot of the average fault detection latency on the (64, 1), (64, 2), and (64, 3) FSMs versus the number of applied random patterns.
While the maximum latency is significant, ranging up to 2714 clock cycles for the (64,3) circuit, the average latency is small, ranging up to only 28.35 clock cycles, which is 1.05% of the maximum latency. Additionally, most faults are detected quickly and the typical 90-10 rule applies for the average latency. More specifically, 90% of the faults are detected within 50% of the average latency, while the other 50% is contributed by the remaining 10% of the faults. For example, once 500 random vectors are applied to the (64,3) circuit, 96.69% of the faults are activated and 93.67% are detected. The average fault detection latency at this point is 11.66, which is 41.12% of the average latency when all faults are detected. Furthermore, the plot of Figure 8(a) shows that the number of faults activated but not yet detected by SPaRe is constantly small. Finally, as indicated in the plot of Figure 8(b) , both the average and the maximum latency increase sub-linearly with the size of the circuit, guaranteeing scaling of SPaRe. Similar observations hold for all other circuits.
Conclusions
Cost-effective CFD requires careful examination of the trade-offs between the conflicting objectives of low hardware overhead, low fault detection latency, and high fault coverage. SPaRe explores the trade-off between fault detection latency and hardware overhead, under the additional constraint that the original circuit design may not be altered. Thus, a comparison-based approach is employed, where the next state logic of the original FSM is partially replicated into a prediction logic, selectively testing the circuit during normal operation. The problem of identifying cost-effective prediction logic functions is theoretically formulated and an algorithm for efficient selective partial replication is proposed. Experimental results demonstrate that SPaRe reduces the incurred hardware overhead by an average of 45% over duplication, while preserving the ability to detect more than 99% of the circuit permanent faults. Further reduction of this overhead is anticipated as the size of the circuit increases. While these savings come at the cost of introducing fault detection latency, the experimentally observed average latency is very low, ranging up to 28 clock cycles in the largest of our FSMs and scaling sub-linearly with the size of the circuit. In conclusion, when non-zero fault detection latency may be tolerated, SPaRe constitutes a powerful alternative to duplication.
