We discuss the problem of concurrent error detection (CED) 
Introduction
A plethora of research efforts have been expended in concurrent error detection for both combinational and sequential circuits [1, 2, 3] . Proposed solutions explore the trade-offs between the three key parameters of this problem: achieved coverage, incurred overhead, and potential latency. The vast majority of existing approaches require that errors be detected with zero latency for combinational logic or a maximum latency of one clock cycle in sequential circuits so that faults in the bi-stable elements may also be detected. Consequently, such methods [4, 5, 6, 7, 8] attempt to reduce the incurred overhead by restricting the set of detectable errors based on realistic assumptions regarding potential malfunctions. Numerous methods wherein the utmost objective is to incur very low overhead have also been proposed. In such cases [9, 10, 11] , an unbounded latency between error occurrence and error detection is permitted, implying that errors may remain indefinitely undetected, thus reducing coverage. A third alternative, however, namely that of exploring the trade-off between latency and overhead while guaranteeing detection of a given set of errors has received little attention.
In this work, we present a method for performing CED with bounded latency in FSMs. In essence, the objective of this approach is to reduce the overhead of CED by allowing a small delay between error occurrence and error detection. We emphasize, however, that the worst-case delay is bounded. Thus, while the hardware necessary for performing CED may be reduced, it is still possible to guarantee error detection within the specified latency. The efficiency of the method depends on the structure of the original FSM and the set of targeted errors. Therefore, we first study the problem of CED with bounded latency in FSMs in its general form and we derive a set of conditions that need to be met in order to provide error detection latency guarantees. We then demonstrate that this idea can be effectively implemented by extending an existing Parity-Based CED method for FSMs to accommodate bounded latency.
The proposed method is capable of detecting all errors resulting from any specified fault model. Such a fault model can be prescribed by providing the error-free response and all erroneous responses resulting from faults in the model for every transition in the FSM. Target fault models are expected to be restricted, in the sense that the set of resulting erroneous responses should be a subset of all possible circuit responses. Indeed, for an unrestricted fault model, information theory proves that any non-intrusive concurrent error detection circuit will be as complex as the original circuit [12] . When a restricted fault model is specified, however, more cost-effective solutions may be devised [13] . To our knowledge, the only previously proposed method that provides an upper bound to the detection latency is based on the use of convolutional codes. In this method, additional logic is used to generate key bits during every FSM transition, such that these keys are valid sequences in a convolutional code if and only if the FSM is operating correctly. The theoretical foundation for this method is described extensively in [14] , yet no indication of its cost is provided. Unfortunately, for convolutional codes of latency more than one clock cycle, the method becomes cumbersome.
The paper is organized as follows. The general requirements for performing CED with bounded latency are discussed in section 2. A parity-based implementation of CED with bounded latency for FSMs is proposed in section 3, wherein the optimization problem of minimizing the number of necessary parity bits required is also discussed. The Integer Program formulation of the parity bit minimization problem and the proposed solution based on Linear Program relaxation and Randomized Rounding are described in section 4. Experimental results on MCNC benchmark FSMs are provided in section 5, demonstrating that allowing a small, bounded latency can reduce the hardware cost while preserving the desired level of error coverage.
CED with Bounded Latency
In CED without latency, erroneous FSM transitions are detected immediately. Bounded latency, on the other hand, provides more freedom as to when to detect errors. Consequently, an erroneous transition may be ignored, as long as it is guaranteed that the causing fault will result in another error that will be detected within p clock cycles, where p is the specified latency bound. For this to be possible, we assume that a fault remains present for at least p clock cycles after causing an error. This assumption reflects realistically permanent faults and intermittent faults due to wear-&-tear. It may also reflect transient errors, if the error source has a continuous duration of at least a few clock cycles, which is the targeted latency bound. However, it does not reflect single event upsets (SEU). For SEU, concurrent error detection allowing bounded latency would either have to restrict the bound to one or use some form of memory which increases the overall cost. One such solution is the method based on convolutional encoding, which is described in [14] .
In order to omit immediate detection of an erroneous response, we need to enumerate all paths of length p, starting from the state where the error is initially activated. A CED mechanism should be capable of detecting the underlying fault in all such paths, yet not necessarily during the initial transition. Path enumeration, either explicit or implicit, is a costly procedure. However, since we only target a bounded latency of a few clock cycles, the exponential explosion is contained and the number of paths is manageable.
By permitting latency in error detection, we anticipate simplification of the circuit necessary for implementing a CED method. In essence, latency relaxes the constraints in designing a CED circuit by allowing more flexibility as to when faults are detected. Unfortunately, overhead reduction due to latency reaches a saturation point, after which increasing the latency bound does not provide more choices. This happens because of loops during path enumeration. As soon as a loop occurs, enumeration along this path is terminated, since any additional latency increase will result in at least one path that expands along the loop. Detecting an error along this path implies detection of errors along all paths comprising the loop. Given a fault model, we can find the maximum latency of interest by finding the length of the shortest loop on each faulty FSM and selecting the largest value.
Parity-Based Implementation
In this section, we develop a parity-based implementation of the general algebraic method [7] for CED in FSMs, where a set of parity trees performs lossless compaction of the circuit responses. Additional hardware is subsequently used to predict the compacted error-free responses and a comparator is employed to identify potential discrepancies between the output of the compactor and the output of the predictor. In an effort to reduce the incurred overhead, this method attempts to minimize the number of parity trees required for lossless compaction and, by extension, the size of the predictor and the comparator. We demonstrate how this method can be employed for performing parity-based CED with bounded latency in FSMs. The purpose of allowing latency is to reduce the overhead, i.e. the number of parity functions that need to be constructed in order to guarantee detection of all erroneous cases. The underlying principle for achieving this is that by allowing a small, bounded latency, we have more choices on how to detect each error and, therefore, a smaller number of parity functions might suffice. Under the assumption that the causing fault persists for at least p clock cycles after triggering an error, we do not necessarily need to detect the error during the first transition. Rather, we need to ensure that the selected parity functions are capable of detecting an effect of the fault along every possible path of length p, starting from the first erroneous state.
Problem Formulation
Consider the FSM with r inputs, s state bits, and n − s outputs shown in Fig. 1 . For every combination of a sequence of inputs A = a 1 , . . . , a p , where a j ∈ 0, . . . , 2 r − 1, j ∈ 1, . . . , p, and previous state c, c ∈ 0, . . . , 2 s − 1, any error caused by a fault f will manifest itself as a difference between the sequence of error-free responses GM (A, c),
and the sequence of erroneous responses
During each of the p FSM transitions, this difference is detectable in a set of state and output bits b i , where i ∈ 1, . . . , n. The concatenation of these p sets defines an Erroneous Case, EC(A, c, f ). Clearly, several combinations of transition sequences (A, c) and faults f may lead to the same erroneous case, i.e. the same p sets of bits through which the effect of fault f on transition sequence (A, c) may be detected across the p transitions. The union of all erroneous cases, F = ∀(A,c,f ) EC (A,c,f ) , can be represented in tabular format in an error detectability table, as shown in Fig. 2 Detecting all circuit errors requires that at least one state/output bit in each erroneous case in F be predicted through additional hardware and compared to its actual runtime value. To minimize the overhead, the number of predicted bits should be small. Yet, since faults on a state/output bit may only be detected on this bit, it is likely that all state/output bits will be included in the solution, leading to duplication. To overcome this limitation, we employ state/output compaction via parity trees. The key observation is that the parity (XOR) function of several state/output bits, an odd number of which detects an erroneous case, also detects the erroneous case. Therefore, it is possible that a small number of parity functions compacting the state/output bits will be adequate to cover all erroneous cases.
Using the information in the error detectability table, the optimization objective of our method is to minimize the number of q parity bits that need to be constructed out of the next state/output bits b 1 through b n , such that all Erroneous Cases are detected. An Erroneous Case is detected by a parity tree if and only if the parity tree comprises an odd number of bits b i that detect the Erroneous Case at any specific time-step between 1 and p. We note that the problem may be modelled as an NP-complete minimum cover problem, for which several heuristics exist [15] . However, expanding the error detectability table to explicitly represent all possible parity functions for each latency step is infeasible, since there is an exponential number of alternative parity combinations.
The benefit of allowing bounded latency stems from the larger number of alternative ways to detect each Erroneous Case. This can be seen in the last row of the error detectability table of Fig. 2 , where Erroneous Case m may be detected by more bits and combinations of bits across the p transitions than just in the first transition. Of course, it is also possible that for some Erroneous Cases latency will not provide any additional flexibility. This is the case, for example, for Erroneous Cases 1 and 2 in the table of Fig. 2 . In the first case, the fault only affects the first transition and cannot be detected at a later time within p transitions. In the second case, the fault affects exactly the same bits in every one of the p transitions.
Based on the above observations, the proposed methodology is very straightforward, as depicted in the form of a block diagram in Fig. 3 . Given an FSM with r inputs, s state bits, and n − s outputs, XOR trees are employed to implement the q parity functions required for lossless state/output bit compaction. Combinational logic is employed to predict the values of the q bits that compact the n state/output bits for each FSM transition, and a comparator is used to identify any discrepancy. Similar to [16] , registers are added to hold the output and the predicted values so that comparison is performed one clock cycle later, in order to also detect faults in the state register. Thus, all FSM errors in the restricted error model are detected within latency of p clock cycles.
Proposed Algorithm
In this section, we demonstrate how to model the problem as an integer program; we subsequently use randomized rounding to identify feasible points -namely points satisfying all the constraints. Our integer program is an extension of the result in [17] , which may be viewed as a special case of this section's formulation when latency is equal to one 1 . We start by introducing some notation that will be useful throughout this section. Let [x] denote the sequence 1, 2, 3, . . . , x for any non-zero positive integer x. Assume that the FSM has a total of n next state/output bits, denoted by {b 1 , b 2 , . . . , b n } (see Fig. 1 ). We are also given a set of m erroneous cases, denoted by F = {EC 1 , EC 2 . . . , EC m } and a target latency p, which is a non-zero positive integer.
The most important part of the input is a 3-dimensional array 2 , which will be denoted by V . The dimensions of V are m × n × p; we denote the (i, j, k) element of V by V (i, j, k)
) and all combinations are defined similarly. V is a 0-1 matrix, defined as follows: 
Definition 1 V (i, j, k) is equal to 1 if and only if the erroneous case EC
We remind the reader that, for boolean variables x 1 , x 2 , x 1 ⊕ x 2 = (x 1 + x 2 ) mod 2. Then, the above formula essentially means that the XOR of the bits in β detects EC i with some latency k ∈ [p]. Thus, using q parity bits (the XORs of the bits in β , ∈ [q]) we can detect all erroneous cases.
We note that if we can solve the above problem in time T , then we may easily minimize q in T log n time: since 1 ≤ q ≤ n, we may perform binary search and find the optimal q, as shown in Algorithm 1.
In the following, we will denote any subset of {b 1 
Statement 2 Given a positive integer q, find q ndimensional binary vectors
or report the lack thereof.
In order to understand the above constraints, observe that if
V (i, y, k) mod 2 is at least 1, the XOR of the bits in β ( ) detects the erroneous condition EC i with latency at most p, where i ∈ [m]. Thus, in order for at least one of the β ( ) , ∈ [q] to detect EC i , it is enough to satisfy the first constraint. We may now state our problem in matrix notation:
or report the lack thereof, where 1 m is an m-vector of 1s.
Notice that V (:, :, k) is a 2-dimensional array (a matrix of dimensions m × n) that denotes the erroneous cases captured by the output bits with latency exactly equal to k.
We now remove the mod operator from the statement:
Statement 4 Given a positive integer q, find vectors
. . .
In order to understand the above constraints, observe that e.g. r (1k) is an m-dimensional 0-1 vector denoting whether {EC 1 , . . . , EC m } are detected by the XOR of the bits in the set represented by β (1) with latency k. We note that w (1k) is also an m-dimensional vector that removes the mod 2 operation. The sum of the r ( k) is, element-wise, at least one, thus guaranteeing that every erroneous case is detected.
In statement 4, we described our problem as an integer program. Our goal is to find a feasible point; namely, values for all r ( k) , w ( k) and β ( ) (a total of 2qpm + qn variables) such that all the restrictions of statement 4 are satisfied. Identifying a feasible point for an integer program is NP-complete; we, therefore, employ a technique called randomized rounding [18] to solve it. The idea of randomized rounding is simple: solve the linear programming relaxation of the integer program (easily done in polynomial time using 
We round each of thex variables as follows:
Raghavan et al. [18] argue that this simple algorithm identifies a feasible point with high probability, if one exists. In practice, we probabilistically round x a fixed number of times and verify that a solution is found.
Experimental Results
The proposed methodology has been implemented and applied on several sequential MCNC benchmarks. After performing state assignment, the circuits are synthesized and mapped onto a standard-cell library using SIS [19] . Internally developed software employing fault simulation is used to identify the error-free and erroneous responses to generate the error detectability table of Fig. 2 . While the stuck-at fault model has been used as the source of errors, we emphasize that the method applies for any restricted error model, as discussed in section 2. Subsequently, Algorithm 1 is applied to compute the minimal number of parity functions for several values of latency p.
The results are summarized in Table 1 . Under the first major heading, we provide details about the circuits that were used: name, number of inputs, state bits, outputs, gate count and the hardware cost reported by SIS. Under the second, third and forth major heading, we provide the minimum number of parity functions required for complete fault coverage, the gate count and the hardware cost reported by SIS for latency p = 1, p = 2 and p = 3, respectively. The number of parity functions (hardware cost) for the basic parity-based method with latency p = 1 [17] for these examples is, on average, 53.00% (22.40%) smaller than the number of functions (hardware cost) necessary for duplicating the circuit. Addition of one more clock cycle, i.e. for bounded latency p = 2, reduces the number of parity bits (hardware cost) by 11.70% (7.18%) over the number of parity bits (hardware cost) required for latency p = 1. Further increase of the bound to latency p = 3, yields an additional 7.23% (7.08%) reduction in the number of parity bits (hardware cost).
As discussed in section 2, the benefits of adding latency diminish as latency increases. In smaller FSMs, faults result to a large number of self-loops. For example, this is the case for circuits donfile, s27, and s386. As the FSM size becomes larger, self-loops are less frequent and the benefits of increasing the detection latency are more significant. This is for example the case for circuits pma, s298, and s1488.
The reduction in the number of parity functions and the reduction in the hardware cost of the predictor are not necessarily proportional. For example, this is the case for circuit dk16 where a latency of p = 2 reduces the number of parity functions by 16.67%, yet the hardware overhead is reduced by 19.94%. More surprisingly, the hardware overhead increases by 11.37% when latency, p = 2, is increased to p = 3. A single complex parity function may require the same or more area than a larger number of simple parity functions. To the best of our knowledge, the literature lacks solutions that consider the actual area cost of parity functions as a metric in choosing which parity functions to select. In the absence of such methods, the most promising direction is to reduce the number of parity functions, anticipating that, on average, functions will incur the same area cost. 
Conclusion
We introduced a technique that allows concurrent error detection with latency in FSMs. In order to preserve the level of attainable coverage, we bound the detection latency to a few cycles and we derive the necessary conditions to detect all possible errors within the specified latency period. Since a bounded delay is permitted in the detection of errors, the trade-off between latency and hardware required for concurrent error detection can be beneficially explored. In order to assess the effectiveness of this approach, we extended a latency-free parity-based method to perform error detection with bounded latency. We formulated the problem of minimizing the number of required parity bits as an Integer Program and we devised an algorithm based on Linear Program relaxation and Randomized Rounding to solve it. Experimental results indicate a monotonic reduction in the cost of the CED hardware when latency is increased.
