Abstract-A formal model for the study of on-line diagnosis is introduced and used to investigate the diagnosis of unrestricted faults. Within this model a fault of a system S is considered to be a transformation of S into another system S' at some time r. The resulting faulty system is taken to be the system which looks like S up to time r and like S' thereafter. Notions of fault tolerance and error are defined in terms of the resulting system being able to mimic some desired behavior as-specified by a system S. A notion of on-line diagnosis is formulated which involves an external detector and a maximum time delay within which every error caused by a fault in a prescribed set must be detected.
I. INTRODUCTION
IN many applications, especially those in which a computer is being used to control some process in real-time (e.g., telephone switching, flight control of an aircraft or spacecraft, etc.) it is desirable to constantly monitor the performance of the system, as it is being used, to determine whether the actual system is within tolerance of the intended system. Informally, by "on-line diagnosis"
we mean a monitoring process of this type where the extent of the diagnosis depends on the meaning of "within tolerance." Thus, for example, if being within tolerance means having the same input-output behavior, then online diagnosis becomes on-line "detection." In the special case where the implementation of on-line diagnosis is completely interntal to the system being diagnosed, it is referred to as "self-diagnosis" or "self-checking." The incorporation of special hardware for the purpose of on-line diagnosis dates way back to the relay computers developed by Bell Laboratories in the early-to-mid 1940's, where biquinary codes were used to dynamically check the operation of the computer [1] . A more general look at codes for checking logical operations was first taken by Peterson and Rabin in 1959 [21 where they showed that combinational circuits can vary greatly in their inherent on-line diagnosability. The use of coding techniques in the design of self-checking circuits was further explored by Carter and Schneider in 1968 [3] and by Anderson in 1971 [4] . In addition, a number of special on-line diagnosis methods have been considered which apply to specific hardware subsystems such as adders, counters, etc., (see [5] , for example).
Given this background of techniques that have been proposed and, in many cases, used to improve the on-line diagnosability of a system, the following question arises quite naturally. With regard to any technique that might be employed, how complex must the diagnosing system be as compared to the system being diagnosed, if the latter is to be on-line diagnosable in some prescribed sense? To answer this question, one must, of course, designate the class of systems considered, the complexity measure, and the precise meaning of on-line diagnosis. In a first attempt, it appears reasonable to make these devices as general as possible in order to establish a framework for more incisive results that might follow.
Specifically, the systems we have chosen to consider are those which are representable as "discrete-time" systems when subjected to transient or permanent faults [6] . Such systems are generalizations of sequential machines and permit structure to vary as faults occur. As a measure of system complexity, we have chosen the number of reachable internal states. This measure reflects the memory capacity of a system and, without further restrictions on system structure, it is the only measure of structural complexity that has a reasonable interpretation. Finally, the concept of on-line diagnosis considered requires that any error caused by a fault be detected within some maximum allowable time delay.
Section II of the paper is concerned with the formal development of the notion of a discrete-time system and the associated concepts of fault, result of a fault, and error. Section III formalizes the above concept of on-line diagnosis and establishes an answer to the question posed above; namely, if no restrictions are placed on the potential faults of a system S, then the complexity of a detector D must be at least as great as that of S. Moreover, this 468 result holds even when the allowed time delay for error detection is arbitrarily large. Section IV considers the on-line diagnosis of unrestricted faults for systems which have (delayed) inverses, that is, systems which are information lossless. Here it is shown that an inverse system can always be used for on-line diagnosis if it too is information lossless. Although the lossless condition is sufficient, it is shown further that there exist systems for which a lossy inverse can also be used for on-line diagnosis.
II. FAULTS AND ERRORS IN DISCRETE-TIME SYSTEMS Informally, a discrete-time system is a causal, deterministic, finite-state system to which inputs (from a finite set) are applied at discrete instants of time and from which states and outputs (from a finite set) are observed at discrete instants of time. If, in addition, specific inputs are designated as "reset" inputs (used to initialize the system), then discrete-time systems can be formally defined as follows.
Definition 1: Relative to the time-base T = { 1,0, 1, . }, a (resettable) discrete-time system, (with finite input, output, and reset alphabets). is a system S = (I,Q,Z,,X,IR,p) where I is a finite nonempty set, the input alphabet, Q is a finite nonempty set, the state set, Z is a finite nonempty set, the output alphabet, :Q X I X T -* Q, the transition function, X :Q X I X T -* Z, the output function, R is a finite nonempty set, the reset alphabet, and p:R X T -+ Q, the reset function.
The first five elements, I, Q, Z, 6, and X, of a discretetime system are the usual elements of a sequential machine but with and X generalized to account for possible variation of structure with time. The action of a reset r C R is described by p, the reset function, with the interpretation that if reset r is applied at time t -1 then the system will be in state p (r,t) at time t. In the special case where S is time-invariant we will adopt the usual terminology by referring to S as a (resettable) sequential machine.
A particular discrete-time system can be viewed as a system which looks like some sequential machine Si in one time interval, like S2 in another interval, and so on (see Fig. 1 ). Assuming familiarity with the concept of a sequential machine, with this view the more general concept of discrete-time system is easily understood. Moreover, as will be observed in the discussion that follows, discrete-time systems suffice to represent the structure and behavior of both "fault-free" and "faulty" digital systems in an on-line diagnosis environment.
Formulation It will also be convenient to define the behavior of S in state q, that is, the function ,lq: I+ X T Z where 13q(x,t) = X(q,x,t).
Given a discrete-time system S, the reachable part of S is the set P = {q C Q q = 8(p(r,t),x,t) for some r C R, t C T, and x C PI}.
(8 denotes the natural extension of a to Q X I* X T.) S is reachable if P = Q. S is reduced if for all q,q' E PF f, = q' implies q = q'. Concepts of simulation and realization that have been considered for sequential machines (see [7] , for example) also extend easily to discrete-time systems. In particular, given two systems S and S, S realizes S under (g,h,k) if g:(I)+ -> I+ is a semigroup homomorphism such that g (7) C I, h: R-* R, and k: Z'
Zwhere Z' C Z such that for all C R, and t C T *F,t = k 0 #hM,t°g
(where a denotes left composition of functions). A pictorial representation of this notion is given in Fig. 2 . A realization concept is quite useful when considering questions of diagnosability, for one often begins with a system specification S which describes what the user wants but is not diagnosable. The solution is to find another system S which is diagnosable and can realize the behavior of S via the input encoding map g, the reset encoding map h, and the output decoding map k.
Given some discrete-time system S, let us now consider how faults effect changes in system structure. In general, if a fault occurs at some time r, S will be transformed into some other system S' and if S is in state q just before 'r then S' is in state q' just after r. More formally, a fault of S is a triple f = (S',r,6) where S' has the same input, output, and reset alphabets as S, r C T, and 0: Q -* Q'. The restriction on the input, output, and reset alphabets is reasonable since after the fault occurs the system will presumably have the same external terminals. The function 0 describes the state transitions that result when the fault occurs. Note that the interpretation of fault here is one of effect, not cause. Thus, for example, if S represents a switching netWork and some gate output j becomes stuck-at-i at time r, the fault is represented by the triple f = (S',r,0) where S' represents the network, as modified by a constant 1 at output j, and 0 describes how this change affects the next state.
Given this interpretation, a formulation of the resulting faulty system is straightforward; more precisely, as follows. (Arguments not specified in the above definitions may be assigned arbitrary values.) A pictorial view of the result of f is presented in Fig. 3 .
Given the result Sf of some fault f, the behavior of Sf for initial condition (r,t) (see (2.1) ) can be conveniently formulated as follows. ( x denotes the length of sequence x.) The proof of Theorem 1 is a straightforward application of the general definition of behavior (2.1) to the faulty system Sf given by Definition 2. Its utility is that it provides a formal means for comparing the behavior of a faulty system Sf to that of the fault-free system S or to that of some original specification S. In particular, we want to determine whether the behavior of Sf is "within tolerance" of the specification 9. The latter concept can be formalized as follows. Let S be a reduced, reachable specification of a timeinvariant, discrete-time system (i.e., S is a sequential machine) and let S be a sequential machine that realizes S under the functions (g,h,k). (Our development at this point could be generalized to include time-varying systems. However, it seems reasonable to assume that the specification and desired fault-free realization are timeinvariant.) We can assume further that g and h are onto, since the only input and reset symbols of concern in the realization S are those which correspond to inputs and resets of 9. Also, since S and S are time-invariant, it suffices to describe their behaviors for resets at tirne 0.
Accordingly, we will let 5r and #r denote the behaviors r,o and # ,o, respectively.
Given the above assumptions, we will say that a faulty system Sf is "within tolerance" of S or alternatively, that the fault f is "tolerated" if, behaviorally, Sf relates to S in the same way that S relates to S. In other words, behaviorally, S and Sf can accomplish the same thing relative to S. (Note that although S is presumed timeinvariant, in general, Sf will not be.) More formally, if f is a fault of machine S, then f is tolerated if, for all r E R, k of g.
Alternatively, since g and h are onto, it follows that f is tolerated if and only if, for all r E R, k°0 , = k°O 3.
A fault which is not tolerated is capable of causing "errors" in the following sense. If r C R, x E I+, and y E Z+ such that that is, for reset r and input sequence x, Sf produces an output that is in error relative to S. It follows immediately from the definition that a fault f is tolerated if and only if no errors are caused by f. Finally, since we will be interested in the time when an error first occurs, we will say that an error (r,ua,vb) (where r E R; u,v E I+; a,b E I) is minimal if (r,u,v) is not an error.
III. ON-LINE DIAGNOSIS
With respect to the concepts of fault and error de- Thus, the detector D observes the operation of Sf (see Fig. 4 ) and must make a decision, based on this observation, as to whether an error has occurred. Note that the fault-free realization S and the detector are both timeinvariant (i.e., machines), and that the detector takes no part in the computation of S's output. The two conditions of the above definition can be paraphrased as: 1) D responds negatively if no fault occurs, i.e., D gives no false alarms; and 2) for all f E F, D responds positively within n time steps of the occurrence of the first error caused by f.
Given this concept of on-line diagnosability, the investigation that follows will be concerned with the general case in which the set of potential faults is "unrestricted." More precisely, the set of unrestricted faults of machine S, denoted by U, is the set U = {f f is a fault of S}. Note that this class of faults is truly unrestricted for it is precisely the set of all possible faults of the machine being diagnosed. In particular, no bound is placed on the number of states in a faulty system Sf. Moreover, a fault f = (S',r,6) need not be "permanent" in the sense that S' is time-invariant. U is therefore more general than fault classes considered in studies of off-line "checking experiments" of the type first investigated by Hennie [8] .
Aside from representing a "worst-case" fault environment, there are certain practical reasons for considering U, at least at the outset. In particular, as the scale of integrated circuit technology becomes larger, it becomes more difficult to postulate a suitably restricted class of faults such as the class of all "stuck-at" faults. Moreover, although other failure models such as bridging failures have been proposed and studied (see [9] and [10] for example), little is known about the diagnosis of such failures. In addition, intermittent and multiple failures are also possible and are even more difficult to model. Filially, for a given failure it may be impossible to determine the 0 function of the fault caused by this failure. Thus fault sets which do not restrict the fault mapping 0 are advantageous.
One important property of the set of unrestricted faults is the relation between this fault set and the set of errors that may be caused by faults in this set. Given any r C R, xC I+,andy C Z+with x = y[,thereisafault f C U such that 4 f (x) = y. Therefore faults in U can cause any possible erroneous behavior, and for (S, U) to be (D,n) -diagnosable all of these possible erroneous behaviors will have to be detected by D. Due to the above observation it is clear that the output of Sf (the system actually being observed by the detector) can give no information about what the correct output should be.
It is a well known and obvious fact that if a system is duplicated and both copies are run in parallel with the same inputs, then, by dynamically comparing the outputs on the two copies, any error which does not appear simultaneously in both copies will be immediately detected. Our view of duplication is shown in Fig. 5 . In this figure the detector D consists of a copy of S along with a generalized EXCLUSIVE-OR gate whose output is 0 if and only if its inputs are identical. Given Recall that given any r C R, x C I+, and y C Z+ with x x = y 1, there is a fault f C U such that Jrf(X) = y. Let f E U be a fault for which 4Brif (xiua) = r (xl) 13q2 (ua) . Since it is known that k (4q(u)) = A(4q,(u)),it follows that (r1,xjua,ilrf (x1ua) Before stating this result formally, it is convenient to establish the following important lemma. Step 1: Delete from the state table of D' any row corresponding to a state q for which o E { XD'(q,(z,a)) (z,a) C Z X II.
Step 2: In the resulting table, replace every reference to the deleted state with a reference to an arbitrary remaining state, and set the corresponding ouptut to 1.
Step 3: Repeat Steps 1 and 2 until no further deletions *are possible. Since QD' < X the above algorithm will terminate in a finite number of iterations. then an error must have occurred because if D' is in q then an error detection signal will be emitted regardless of the input to D'. Hence, this error could be signaled whenever a transition to q is indicated, and there would be no loss in diagnosis and no possibility for a false alarm. Since all minimal errors which q signaled would then be signaled before D' got to state q, q could be eliminated. This is the essence of what is accomplished in Steps 1 and 2. This elimination process is necessarily iterative because Step 2 may introduce new states to be deleted. Since this construction is diagnosis preserving, (S, U) is (D,n) -diagnosable, thereby proving the lemma. As an immediate consequence of Theorems 2 and 3, we obtain the following important result. This result answers the question posed in the introduction; namely, with regard to any technique which may be employed, to achieve unrestricted fault diagnosis the of state set size, as the specification of the system being diagnosed.
IV. DIAGNOSIS USING INVERSE MACHINES
Let us now consider the use of inverse machines for the diagnosis of unrestricted faults. An (I,n)-delay machine (delay machine) is a machine Sn = (I,In,I,6,X,R,p) such that if ai C 1, 1 < i < n + 1, then a ((al,, * * ,an) an+l) = (a2,.* *,an+. ) and X((a1,.* *,an),an+i) = al.
Thus, an (I,n)-delay machine simply delays its input for n time steps. Stated more precisely, if Sn is an (I,n) -delay machine then f3(al...., an) (an+l ... ) an+m) = am.
Let S and S be two machines such that R = R and Z = I. Then S is an (n-delayed) inverse of S if there exists an (I,n) -delay machine Sn with reset alphabet R such that for all r R and x C I+ (gr (X) ) n ={3r (X).
Machines for which inverses exist can be easily characterized. Intuitively, such machines lose no information as they transform input sequences into output sequences. A machine S is information lossless of delay d if for all r C R and a1a2 * an,blb2* *.bn C I+(ai,bi E I, 1 < i < n) r(a1a2... *an) Br(bib2 bn) implies ai = b
The basic relationship between information losslessness and inverses is given by the following theorem (see [10] , for example).
Theorem 4: S has an n-delayed inverse if and only if S is information lossless of delay n.
Information lossless machines and inverse machines were first introduced by Huffman [11] . He devised a test for information losslessness and for the existence of inverses. It should be pointed out that our definitions of these notions are oriented towards their use in diagnosis and that they vary slightly from Huffman's definitions.
Even [12] later devised a better means of determining information losslessness and he presented two means for obtaining inverses of information lossless machines. Kohavi and Lavallee [13] have shown that any machine can be realized by an information lossless machine.
We now state the basic result relating the use of lossless inverses with the diagnosis of unrestricted faults.
Theorem 5: Let S be a lossless machine and let S be an n-delayed inverse of S. Let It is interesting to note that although S, has fewer states than Si, D1 has more states than Si. This is because there is an (l1,2) -delay machine in D1, in addition to the inverse Si. It is also worth pointing out that the delay in diagnosis using an inverse machine is not the delay of losslessness of the machine being diagnosed but rather of its inverse. Thus an n-delayed inverse can be used to achieve diagnosis without delay if it is lossless of delay 0.
The following example shows that the converse of Theorem 5 does not hold. Namely, it is possible to diagnose the unrestricted faults of a machine using an inverse which is not lossless. However, not all inverses can be used for the diagnosis of unrestricted faults. The complete characterization of inverses which can be used for unrestricted fault diagnosis is still an open problem.
Example 2: Consider the reduced and reachable machines S2 and 52 given by the state tables in Figs. 9 and 10. 82 is a 0-delayed inverse of S2 and it can be used to construct a detector D2 such that (S2, U) is (D2,0) -diagnosable. However, 82 is not lossless.
In conclusion, it is interesting to note that results established in this and the preceding section have something to say about lossless machines, per se. Let S be reduced, reachable, and lossless of delay d machine. Let S be a lossless inverse of S. We have seen in Example 1 that such an inverse can have fewer states than the machine of which it is an inverse. In the following result we will give a lower bound on the state set size of S in terms of state set size of S, the delay d of S, and the input alphabet size of S. This result, which deals only with lossless and inverse machines, is proved using Corollary 3.1 and Theorem 5, results concerning the diagnosis of unrestricted faults. V. CONCLUSION The main result of this paper has shown that regardless of what diagnosis technique is applied and regardless of how much delay is allowed in diagnosis, if faults are unrestricted the detector must be at least as complex as the specification of the system being diagnosed. It should be pointed out that this result will not hold if the potential faults of S or the set of possible input sequences to S are suitably restricted. For example, one can construct resettable machines S and S', both reduced and reachable, such that they have identical input/output behavior for a restricted class of input sequences, and yet they have a different number of states. Thus, if input sequences were restricted a detector could be constructed which has fewer states than the machine it monitors.
The results concerning on-line diagnosis using inverse machines have shown that if the inverse being used is lossless, then the class of unrestricted faults can be diagnosed with a delay equal to the delay of losslessness of the inverse system. However, if the inverse system under consideration is lossy, then further analysis is necessary because such inverses may or may not be capable of unrestricted fault diagnosis.
