On-line diagnosis of sequential systems, 2 by Sundstrom, R. J.
01214%-1-7 
3c'iy 1974 
https://ntrs.nasa.gov/search.jsp?R=19740021422 2020-03-23T06:13:33+00:00Z
THE- U N I V E R S I T Y  O F  M I C H I G A N  
SYSTEMS ENGINEERING LAEORATORY 
Department of Electrical and Computer Engineering 
College of Engineering 
SEL Technical Report No. 81 
ON-LINE DIAGNOSIS OF SEQUENTIAL SYSTEMS: II 
Robert J. SundstrGin 
Under the directioa of 
Professor John F. Meyer 
July 1974 
Prepared under 
NASA Grant 
NGR -23 -005 -622 
TABLE 03 CONTENTS 
I. INTRODUCTION 
1. 1 Outline of the Problem 
1.2 Brief Survey of the Literature 
1.3 Synopsis of the Report 
II. A MODEL FOR THE STUDY O F  ON-LINE DIAGNOSIS 
2.1 Resettable Discrete -Time Systems 
2.2 Resettable Systems with Faults . 
2.3 Fault Tolerance and E r r o r s  
2.4 On-line Diagnosis 
In. GENERAL PROPERTIES OF DIAGNQSSIS 
IV. DIAGNOSIS O F  UNRESTRICTED FAULTS 
4 .1  Unrestricted Faults 
4.2 Diagnosis via Independent Computation and 
4.3 Diagnosis with Zero Delay 
4. 4 Diagnosis with Nonzero Delay 
Compar ism 
V. DIAGNOSIS USING INVERSE MACHINES 
5 . 1  Inverses of Machin2s 
5.2 Diagnosis Using Lossless Inverses 
5.3 Applicability of Inverses for Unrestricted Fault 
Diagno s is 
VI. DIAGNOSIS OF NETWORKS OF RESETTABLE 
SYSTEMS 
6.1 Networks of Resettable Svstems 
6 .2  Unrestricted Component Faults 
6.3 Characterization of Combinationally Diagnosable 
6.4 Construction of Combinationally Diagnosable 
Networks 
Networks 
Page 
1 
1 
8 
10 
14 
30 
43 
55 
61  
68 
70 
72 
77 
83 
88 
90 
95 
103 
109 
111 
118 
120 
129 
Vn. CONCLUSION 
REFERENCES 
APPENDIX 
Page 
141 
145 
149 
CHAPTER I 
Introduction 
1. 1 Outline of the P r w l e m  
For  many applications, especially those in which a computer 
-
is controlling a real-time process (e. g. , telephone switching, 
flight control of an aircraft  o r  spacecraft, control of traffic in a 
transportation system, etc. ), reliability is a major factor in the 
design of the system. The need for high reliability arises because 
of the serious consequences e r r o r s  may have in t e rms  of danger to 
human lives, loss of costly equipment, oradisruption of business or  
manufacturing operations. For  example, it is economically unsound 
to  shut down a steel  mill for even a short time in order  to repair  
a comparatively inexpensive controlling computer. The seriousness 
of the consequences,of course, depends upon the application and must 
be weighed against the cost of improving the reliability. 
A number of techniques exist for improving computer reliability.' 
One of the more obvious is the use of more reliable components. 
While the use of reliable components is clearly very important, it 
has been recognized that this technique alone is not sufficient to meet 
the requirements for modern ultrareliable computing systems [34]. 
1 
2 
Another 
is the use of 
general technique which is useful in some applications 
masking redundancy such as Triple Modular Redundancy 
or Quadded Logic [ 351. One major drawback to  masking redundancy 
is that if failed components are not replaced and the mission time 
is long, then the reliability of a system which uses  masking redun- 
dancy can actually be less  than that of the corresponding simplex 
system 1251. 
A third means of increasing system reliability and availability 
is through fault diagnosis and subsequent system reconfiguration o r  
repair. 
switching, the No. 1 Electronic Switching System (ESS) contains 
duplicates of each module and fault diagnosis is achieved primarily 
by dynamically comparing the outputs of both modules [ 11 3 .  Once 
the faulty module is identified, it is repaired manually with diagnos- 
tic help from the fault -free computer. Another ultra-reliable 
computer, the Je t  Propulsion Laboratory Self -Testing and Repair- 
ing (STAR) computer, also makes use of modularity and standby 
sparing [ 4 J . 
For example, a computer designed to control telephone 
One means of performing fault diagnosis is to continuously 
Monitor the performance of the system, as it is being used, to deter-  
mine whether its actual behavior is tolerably close to the intended 
behavior. It is this sort  of monitoring which we mean by the  te rm 
"an-line diagnosis. '' Others have used the t e r m  "error  detection" 
to  refer to this sor t  of monitoring ([22 J , [23 1. 
3 
Implementation of on-line diagnosis may be external to the 
system, both internal and external, o r  completely internal. In the 
last extreme, on-line diagnosis is sometimes re fer red  to as "self- 
diagnosis" or "self -checking" ([ 8 ], [ 9 ] ). 
There are two essential requirements for on-line diagnosis. 
The f i rs t  is redundancy; more than the minimum amount of informa- 
tion must be processed, The second is verifiability; the redundant 
information must be checked €or consistency. 
The r;gnals generated by a monitoring device can be used in 
many ways. 
circuits to detect e r r o r s  [ 6 1. The signals generated by these 
circuits are used in some models to  freeze the computer s o  that the 
instruction which was currently executing may be retr ied if possible. 
and to ass is t  in the checkout and repair  of the computer i f  the auto- 
matic re t ry  attempt fails. Ultra-reliable computers typically use 
the signals generated by the monitoring device to provide the computer 
system with the information it needs to automatically reconfigure 
itself s o  as to avoid using any fault circuits. One other use for such 
signals is to simply inform the system user  that the system is not 
operating properly and that there may be e r r o r s  in his data. 
For example, the IBM System/360 utilizes checking 
In general, on-line diagnosis is used to verify that the system 
is operating properly; or conversely, to signal that it is in need of 
repair. In most computer systems this  task is also performed in 
some part by "off-line diagnosis. " By off-line diagnosis we are 
4 
referr ing to the process of removing the system from its normal 
operation and applying a series of prearranged tes ts  to determine 
whether any faults are present in the system. There are major 
differences between on-line and off-line diagnosis and it is important 
to be aware of the capabilities and the limitations of each. 
One basic difference is that on-line diagnosis is a continuous 
process whereas off -line diagnosis h a s  a periodic nature. Transient 
faults are difficult to  diagnose with off-line diagnosis because if a 
fault is transient in nature it may not be in the system when it is 
tested. On the other hand, since on-line diagnosis is a continuous 
monitoring process both permanent and transient faults can be 
diagnosed. It has been recognized by Ball and Hardie [ 5 ] and 
others that intermittents do occur frequently, and that finding an 
orderly means to diagnose them is an important unsolved problem. 
Thus the inability of off -line diagnosis to  deal satisfactorily with t ran-  
sients is a severe limitation. 
Another basic difference is that the delay between the occurrence 
of a fault and i ts  subsequent detection is generally greater  for off - 
line than on-line diagnosis. Recovery after a fault has been diagnosed 
may sometimes be achieved by reconfiguration acd restarting. 
However, in a real-time application irrepeatable or  nonreversable 
events may take place if an e r r o r  occurs and is nct immediately 
detected. 
of an e r r o r  and the subsequent diagnosis of a fault, then 
In any application, if there is a delay between the occurrence 
5 
contamination of data bases may occur thus making restart ing 
difficult. For these reasons, the inherent delay associated with 
off -line diagnosis can be a serious limitation. 
One further difference between on-line and off -line diagnosis 
is that with off-line diagnosis the system must be removed from its 
normal operation to apply the tests. This also may not be acceptable 
in a real-time application. 
The cost of either form of diagnosis depends on the nature of 
the system to be diagnosed, the technology to  be used in building 
the system, and the degree of protection against faulty operation 
that is required. With on-line diagnosis the cost is almost totally 
in the design, construction, and maintenance of extra  hardware. 
With off-line diagnosis the cost is the initial generation of the tes ts  
and in tt.e subseguent storage and running of these tests.  
In general, off-line diagnosis is useful for factory testing and 
for applications where immediate knowledge of any faulty behavior 
is not essential. Off-line diagnosis is also useful for locating the 
source of trouble once such trouble is indicated by on-line diagnosis. 
For example, as stated ear l ier  Bell System's No. 1 ESS uses dupli- 
cation and ccmparison as its primary e r r o r  detection scheme. But 
once an e r r o r  has been detected, off-line diagnosis is used to deter-  
mine which processer exhibited the erroneous behavior and to locate 
the faulty module in that processer. 
6 
In the Design Techniques for Modular Architecture for Reliable 
Computing Systems (MARCS) study a more integrated use of on- 
line diagnosis is proposed whereby a number of checking circuits 
observe the performance of various par ts  of the computer [ 8 1. 
With a scheme such as this, information about the location of a 
fault can be obtained from knowledge of which checking circuit 
indicated the trouble. 
Both on-line and off-line diagnosis have been used t o  check 
the operation of computers from the very first  machines until the 
present time. In a shori  paper published in 1957, Eckert  1.121 informs 
u s  that off -line diagnosis was relied upon for the ENIAC computer, 
that the BINAC system had duplicate processors,  and that the UNIVAC 
used a more economical on-line diagnosis scheme involving 35 check- 
ing circuits. During the past decade, however, the development of 
theory and techniques for fault diagnosis in digital systems and 
circuits have focused mainly on problems of off-line diagnosis (see 
[ 9 3 and 1141 for example). 
An alternative means of performing diagnosis has been investi- 
gated by White [ 361. H i s  novel scheme is similar to  on-line diagnosis 
in that it involves redundant processing of information and subsequent 
checking for consistency. However, with his scheme the redundancy 
is in time rather than in space. After every operation is performed, 
a related operation is initiated wh.ich uses  rhe same circuitry but 
7 
with different signals. The resul ts  of these two operations are  then 
checked for consistency . 
One other approach to diagnosis is simply to have human users  
or observers of the system watch for obvious misbehavior. Since 
faults often give rise to behaviors which a re  clearly erroneous, many 
faults can be detected in this manner. The effectiveness of th i s  
method is highly dependent upon the individual system and program, 
and is exceedingly difficult to evaluate. It seems reasonable to 
assume, however, that this method is less effective than any of the 
methods previously discussed. Certainly, this method is unacceptable 
fo r  many applications. 
1.2 Brief Survey of the Literature 
The work that has been done on on-line diagnosis is mainly in 
the area of techniques. One early paper is Kautz's study [ 211 of 
fault detection techniques for combinational circuits. In this paper 
he investigated a number of techniques including the use of codes 
and the possibility of greater economy if immediate detection of 
errors was not necessary. Many of the more common on-line 
diagnosis techniques have been gathered together and published in 
a book by Sellers, Hsiao, and Bearnson [ 331. Much of what is in 
this book and a large portion of the techniques that can be found 
elsewhere in the literature are concerned with special circuits 
such as adders and counters. For example, see the work of Avizienis 
[ 3 1 ,  Rao 1321, Dorr [ l o ] ,  and Wadia [37] .  
Relative15 little work can be found on the theory of on-line 
diagnosis. A s  with the work on on-line diagnosis techniques, much 
of the theory of on-line diagnosis focuses on arithmetic units. 
In one of the earliest  works of a theoretical nature,Peterson 
[ Z S ]  showed that an adder can be checked using a completely indep- 
endent circuit which adds the residue, module some base, of the 
operands. He went on to show that any independent check of this 
type was a residue class  check. Further theoretical work concern- 
ing the diagnosis of arithmetic units using residue codes can be 
found in Massey 124 ] and Peterson [31] .  
9 
An early theoretical result of a more general nature was published 
by Peterson and Rabin [30]. They showed that combinational circuits 
can differ greatly in their inherent diagnosability and that in some 
cases virtual duplicaiion is necessary. 
A later and very important paper is that of Carter  and Schneider 
[ 7 1. They propose a model for on-line diagnosis which involves a 
system and external checker. The input and ouput alphabets of 
the system are encoded and the checker detects faults by indicating 
the appearance of a non-code output. A system is self-checking 
if for every fault in some prescribed set ,  (i) the system produces 
a non-code output for at least one code space input, and (ii) the 
system never produces incorrect code space outputs for code space 
inputs. Thus, (i) insures that every fault can be detected duringnormal 
usage, and (io insures that if nofault has been detectedthenthe output 
canbe reliedupon tobe correct. The checkers that they consiuer a r e  
a lso self -checking. Using this mode1 they prove that any system can be 
designedtobe self-checkingfor the class  ofsingle stuck-at faults. 
Anderson [ i ] has named property (i) "self-testing" and property 
(ii) "fault-secure, " and he has  investigated these properties for 
combinational networks. In Chapter ID it is shown that the notion 
of diagnosis considered in this study is a generalization of the fault- 
secure property. 
10 
1.3 Synopsis of !he Report 
This report describes an investigation of t b w y  and techniques 
applicable to the on-line diagnosis of sequential systems. 
With decreasing cost of logic and the increasing use of computers 
in real-time applications where erroneous operation can result in 
the loss of human life and/or large sums of money the use of on-line 
diagnosis can be expected to increase greatly in the near future. 
The importance of this area along with the relative lack of theoreti- 
cal results is or-- motivation for initiating this study of on-line 
diagnosis. 
The purpose of this investigation is to further the currently 
insufficient store of information on the subject of on-line diagnosis. 
The formal approach taken in this report leads to  a fuller under - 
standing of current on-line diagnosis practices and suggests 
zeneralizations of known techniques. It also provider: a framework 
for evaluating the advantages and limitations of the vmious on-line 
diagnosis schemes. 
In ChapterII, a complete model for the study of on-line diagnosis 
is developed. First an appropriate c lass  of system models is 
formulated which can serve as a basis for a theoretical study of 
on-line diagnosis. Then notions of realization, fault, fault -tolerance 
and diagnosability a re  formalized which have meaningful interpreta- 
tions in the context of on-line diagnosis. The following chapters are 
all concerned with the properties of the notion of diagnosis which is 
11 
introduced in this chapter. 
Chapter 111 contains some elementary properties of diagnosis 
which are independent of the particular class of faults under considera- 
tion. The results of this chapter help to  give a biisic understanding 
of on-line diagnosis and are used in the later chapters. 
Chapter IV is concerned with the diagnosis of the set of unre- 
stricted fauits. This set of faults is simply the set of all faults of 
the system under consideration. The major. result  of this chapter 
gives a lower bound on the amount of redundancy that must be employed 
by any technique which can be used for  unrestricted fault diagnosis. 
In Chapter V, the use of inverse systems for the diagnosis 
-
of unrestricted faults is considered. Inverse systems axe formally 
introduced, and a partial  characterization of those inverse systems 
which can be used for unrestricted fault diagnosis is obtained. Since 
not every system has an inverse system, let alone one which is 
suitable for unrestricted fault diagnosis, it is not always possible 
to apply this technique directly. However, it is shown that every 
system has a realization upon which this technique call be success- 
fully applied. 
In Chapter VI, the diagnosis of systems which are structurally 
decomposed and are represented as a network of smaller  systems 
is studied. The fault set  considered here is the set of faults which 
only affect one component system in the network. A characterization 
12 
of those networks which can be diagnosed using a purely combina- 
tional detector is achieved. A technique is  given which can be 
used to realize any network by a network which is diagnosable in 
the above sense. Limits are found on the amount of redundancy 
involved in any such technique. 
CHAPTER I1 
A Model for the Study of On-Line Diagnosis 
In this chapter we develop the model which we will be using in 
this theoretical study of on -line diagnosis. 
We begin by introducing a new class  of system models, called 
"resettable discrete-time systems, '' which will serve as the basis of 
our study. Within this model we will consider a fault of a system S to 
be a transformation of S into another system S' at some time r. The 
resulting faulty system is taken to be the system which looks like S up 
to  time r and like S' thereafter. 
Next the companion notions of fault tolerance and e r r o r  are 
defined in te rms  of the resulting system being able to mimic some de- 
sired behavior. 
Finally, our notion of on-line diagnosed is introduced. This 
notion involves an external detector and a maximum time delay within 
which every e r r o r  caused by a fault in some prescribed set must be 
detected. 
13 
1.4 
2.1 Resettable Discrete -Time Systems 
On-line diagnosis is inherently ir more complex process than off- 
line diagnosis becawe of two complicating factors:  i) it has to deal with 
input over which it has no control and ii) faults can occur as the system 
is being diagnosed. We would like to build a theory of on-line diagnosis 
using conventional models of time -invariant (stationary, fixed) systems 
(e. g., sequential machines, sequential networks, etc. ). However, 
due to the second factor mentioned above these conventional models 
can no longer be used to represent the dynamics of the system as it is 
being diagnosed. A system which is designed and built to behave in a 
time -invariant manner becomes a time -varying system as faults occur 
while it is in use. 
on time -varying systems is required. Based on this fundamental obser - 
vation we have developed what we believe to be an  appropriate model 
for the study of on-line diagnosis. 
Therefore, a more general representation based 
Definition 2. 1: Relative to the time-base T = {. . . , - l , O ,  1.. .}, a 
discrete-time system (with finite input and output alphabets) is a system 
where I is a finite nonempty set ,  the input alphabet 
Q is a nonempty set ,  the state set 
2 is a finite nonempty set ,  the output alphabet 
6: Q X I X T + Q, the transition function 
15 
A: Q X I X T 4 Z, the output function. 
The interpretation of a discrete-time system is a system which, 
if at  time t is in state q and receives input a, will at time t emit mt- 
put symbol X(q, a, t)  and at time t + 1 be in state 6(q, a, t). In the special 
case where the functions 6 and X are independent of t ime (i.e., are 
time-invariant), the definition reduces to that of a (Mealy) sequential 
machine. In the discussion that follows we will assume, unless other- 
wise qualified, that S is a finite-state (i. e. , 1Q 1 < a). 
To describe the behavior of a system, we first extend the t ransi-  
tion and output functions to input sequences in the following natural way. 
If I* is the set  of all finite-lengthsequences over I (including the null 
sequence A) then: 
where, fo r  a l l q  E Q, a E I, t E T: 
Similarly, if I+ = I - {A} : 
X: Q X I + X T + Z  
16 
where for all q E Q, a E r, t E T: 
- 
x(q, ala 2... an,t) = A(T(q, ala2. an,l, t), an, t + n - 1) . 
Henceforth Band .will be denoted simply as 6 and A. 
Relative to  these extended functions, the behavior of S in state q 
is the function 
where 
Thus, if the state of the system is q and it receives input sequence x 
start ing at time t, then f i  (x,t) is the output emitted when the last 
symbol in x is received (i. e.,  the output at time t + 1x1 - 1 (1x1 = 
9 
length 6))). 
Many investigations of on-line diagnosis arid fault tolerance have 
studied redundancy schemes such as duplication and triplication. 
Typically they have not dealt with the problem of starting each copy of 
a machine in the same state. In this study we will be examining these 
schemes and others for  which the same problem arises. Since many 
existing systems have reset  capabilities, and since this feature solves 
the above synchronizing problem we will use a special type of system 
for which the reset  capabilities are explicitly specified. This explicit 
17 
specification of the reset capability is essential since it is an important 
part of the total system and it may be subject to failure. 
Definition 2.2 : 
is a system 
A resettable discrete -time system (resettable system) 
where (I, Q, Z, 15. A )  is a discrete-time system 
R is a finite nonempty set ,  tl e reset  alphabet 
p :  R X T 4 Q, the reset  function. 
A resettable system is resettable in the sense that if reset  r is 
applied at time t - 1 then p(r, t) is the state a t  time t. This method of 
specifying reset  capability is z matter of convenience. This  feature 
could just as well have been incorporated as a restriction on the t ransi-  
tion function relative to a distinguished subset of input symbols called 
the reset alphabet. 'rhus a resettable discrete-time system can indeed 
be regarded as a special type of discrete-time system, If 6, A,  and p 
are all independent of time the definition reduces to that of a resettable 
seauential machine. Thus a resettable machine can be viewed as a 
resettable system which is invariant under time -translations. 
Given a resettable system we can view it as a system organized 
as in Fig. 2. 1. 
18 
Fig. 2. 1. Schematic Diagram for S = (I, Q, Z, 6, A, R,p ) 
In many discussions we will not be directly concerned with the 
output function of a system, but rather we will want to focus our 
attention upon the state transitions. This motivates the folbwing 
definition. 
Definition 2.3 : 
is a resettable state system if 2 = Q and A(q,  a, t )  = q for all q E Q, 
A resettable discrete-time system S = (I, Q, Z, 6, A, R, p )  
a e I, and t E T. 
Since the output alphabet and output function of a resettable state 
system need not be explicitly specified, a resettable state system 
S = (I,Q, Z, 6, A, R,p) will  be denoted by the 5-tuple (I, Q, 6, R,p). 
This formulation of resettable state systems as special types of 
resettable systems allows u s  to directly apply the following theory of 
on-line diagnosis to state rnac hines. 
19 
Notation: 
and resettable machines wil l  be denoted by M, M', MI, M2, etc. 
Unless otherwise specified, M will denote the resettable machine 
(I, Q, Z, 6, A, R,p); M' will denote the resettzble machine (I', Q', Z', 6', 
A', R',p'); and so forth. S(1, Z, R) will denote the set  of s y s t e u s  with 
input alphabet I, output alphabet Z, and reset alphabet R. That is, 
Resettable systems will be denote.? by S, S', SI, S2, etc., 
n(1, Z, R) will denote the corresponding set of resettable machines. 
Definition 2. 4: A resettable sequential machine M = (I, Q, Z, 6, A, R, p )  
is memoryless or  combinational if 1QI = 1. 
The triple (1, Z, A) where A: I-> Z will be used to denote any 
memoryless machine with input alphabet I, output alphabet Z, and 
output function A. The memoryless machine M = (I, Z, A) is said to 
realize the function A :  I 4  Z.
We will represent sequential machines in the usual manner, 
L e . ,  via transition tables or state graphs. Resettable machines are 
represented by minor extensions of these two methods. The transition 
table of a resettable machine is identical to that of a machine with 
addition of one column on the right to accommodate the reset function. 
Ifp(r) = q then r will appear in this additional column in the row 
corresponding to  state q. Similarly, the state graph of a resettable 
machine is identical to that of a machine with the addition of one short  
20 
arrow for each r E R, This arrow will be labeled r and will point 
to state p(r). 
Example 2.1: Let M1 be the sequence generator with reset alphabel 
(0) and input alphabet { 1) which has been implemented by the circuit 
in Fig. 2. 2. 
dl ' 
1, R Z 
d2 
Fig. 2.2. Circuit for MI 
The transition table and the state graph for M1 are shown in 
Figs. 2.3 and 2. 4, 
Fig. 2.3. Transition Table for MI 
21 
Fig. 2.4.  State Graph for MI 
The circuit in Fig. 2. 2 is also an implementation of a similar machine 
M2 with input alphabet { 0, 1). The state gr,aph for Ma is shown in 
Fig. 2. 5. 
0 
I 
Fig. 2 .5 .  State Graph for M2 
Thus, in Mz the input symbol "0" can be interpreted as an input or as 
a reset. In M2 the outputs for input 0 are explicitly specified whereas 
in MI they may be regarded as classical "don't cares. " 
22 
We can view a particular discrete-time system as a system which 
looks like some machine Mi in one time interval, like Mi+l in another 
interval, and so on, This is also a good means of specifying a system. 
-1 
I I 
& m.. 
";t I I 
T"IIIIJ 
I 
t 
.a* M i  
... ... 
Time- 
Fig. 2.6. A Discrete-Time System 
Example 2 . 2 :  Suppose that MI was implemented as in Fig. 2 . 2  and 
that this circuit operated correctly up to time 100 when gate 2 became 
stuck-at-0. What actually existed was not a resettable machine but a 
(time-varying) resettable system S which looks Like MI up to time 100 
and like a different machine, say Mi thereafter. The graph for  M; is 
shown in Fig. 2.7. 
Fig, 2, 7, Resettable Machine Mi 
We can represent S as follows: 
f M, for t < 100 
1 s =  { 
L Mi for t - > 100. 
By this we mean that I = I1 = I i  and likewise for Q, 2, and R, and that 
6,(q,a) for t < 100 
b(q,a,t)  = { 
6i(q,a) for t - > 100 
and similarly for X andp. 
For resettable systems we take the definitions of r, x, and 3 
9 
to be the same as those for systems. It is also convenient in the case 
of resettable systems to specify behavior relative to a reset  input r 
that is released at time t ,  that is, the behavior of S for condition (r, t )  
(r e R, t E T) is the function 
where 
If t = 0, p 
and is denoted simply as fire 
is referred to as the behavior of S for initial reset r r,O 
24 
It is useful to extend the behavior function P in a natural r, t 
manner to represent the sequence to sequence behavior of S. For 
r e R  a n d t r T  
flr , t :  A I+ 4 z+ 
+ where for all al.. .an E I 
We will now introduce a few properties of resettable machines 
which will be important to our developing model of on-line diagnosis. 
A more complete treatment of the properties of resettable machines 
can be found in the appendix. 
We define these prcperties for resettable machines rather thm 
for resettable systems because we will be applying them to "fault -free" 
systems, which in this study will  always be time-invariant. 
We begin with some concepts of "reachability. " Let M be a 
resettable machine. The reachable part of M, denoted by P, is the 
set 
P = {Gb(r),x)jr E R, x E I*} . 
M is reachable if P = Q. M is P-reachable if 
P = (6(p(r) ,x)lr  E R, x E I* and 1x1 < - L } . 
25 
An elementary result of graph theory states that in a directed 
graph with n points, if a point v can be reached from a point u then 
there is a path of length n - 1 o r  less from u to  v. An immediate con- 
sequence of this is that any machine M is (1P I - 1)-reachable. 
Let M, M' E W(I, Z, R). M is equivalent to M' (written M M') 
if br = @; for all r E R. Two states  q e Q and q' E Q' are equivalent 
(q 9') if /3 = j3ir . It is easily verified that these are both equivalence 
relations, the f i rs t  on m(I, Z, R) and the second on the states of machines 
on Fm(1, Z, R). 
9 
A resettable machine M is reduced if for all q, q' e P, q q' 
implies q = 9'. A basic result of sequential machine theory s ta tes  that 
for every machine there is an equivalent reduced machine and that th i s  
machine is unique up to isomorphism. The corresponding result for 
resettzble machines is given in the appendix. 
A concept which is central to sequential machine theory is that of 
a "realization. " The corresponding resettable machine concept will 
be very important to our theory of on-line diagnosis. We will intro- 
duce it by first stating Meyer and Zeigler's definition of realization for 
sequential machines 127 1. 
Definition 2.5: If M and & are sequential machines then M real izes  
M if there i; a triple of functions (ol, 02, os) where al: 6)' -> I is + .L 
.c. 
a semigroup homomorphism such that o,(I) - C I, aZ: 6 4 Q, 
03: Z' 4 2 where Z' C - Z, such that for all E 6 
26 
It has been shown by h a k e  [ 231 that this strictly behavioral 
definition of realization is equivalent to  the structurally oriented 
definition of Hartmanis and Stearns (163. 
If M and k are resettable machines then our definitior, of 
realization is somewhat diffsrent. Inherent in this definition is our 
presupposition that a resettable system will be reset  before every use. 
- 
Definition 2.6: If hi and M are two resettable machines then M realizes 
M if there is a tripte of functions (a , a , (J ) where ol: E)+ 4 I+ is 
a semigroup homomorphism such that ul(I) : I, u2: R + R, u3: 
.c. 
1 2 3  - - 
- 
Z' - C 2, such that for all T E: E , 
This concept can be viewed pictorally as in Fig. 2.8. 
z I - =GV- M 
U 
Fig. 2.8. M Realizes fi under (a1, u2, u3) 
27 
.cI 
Example 2 .3 :  
Fig. 2.9 and Fig. 2. 10. 
Let M3 and M3 be the resettable machines shown in 
Fig. 2.9. Resettable Machine $ 
Fig. 2. 10. Resettable Machine M3 
.5) 
Then % realizes M3 under the triple (a , Q , a ) where ut: K3)+ 4 1 2 3  
..) 
is the identity, u2: R3 4 R3 is defined by 02(r) = rl,  and 
2 + Z3 is the identity. To verify this claim we need only as: 3 
3 observe that a:(x) = pr (x) for all x e ii3)+. 
1 
Notice that the definition of realization for resettable machines 
is less restrictive than that for sequential machines in the sense that 
for resettable machines we only require the realizing system to 
mimic the behavior of the reset  s ta tes  of the realized machine; while 
in the sequential machine case the realiTing system must mimic the be- 
havior of every state of the realized system. On the other hand, the 
definition in the resettable case is more res t r ic t i te  in the sense that 
for each reset  state in the realized machine not only does there exist 
a state in the realizing machine which mimics its behavior, but we also 
know how to get to  that state. 
Before proceeding with our model of on-line diagnosis we must 
introduce a few notational conventions. The identity function on a 
set A will  be denoted by eA. 
set is being mapped the subscript will be deleted. 
When it is clearly understood which 
If A1,. . . , A  is a sequence of n se t s ,  i ts  Cartesian product is n 
n 
the set A1 X . . X An = x1 Ai = {(xl,.  . . ,xn)(xi E Ai, i = 1,. . . ,n}. 
The Cartesian product of an  empty sequence of s e t s  is taken to the any 
singleton set. 
n 
Given a Cartesian product A = izl Ai, a coordinate projection of A 
is a function Pi : A + Ai defined by Pi (xl,. . . , x ) = xi. n 
If fl: A 4  Bl, . . . ,fn: A 4 B is a sequence of functions, the 
11 
n n 
cross-product function 
x=l 1 
to extend coordinate projections to project on to any subset of coordin- 
igi f i. A + Bi is defined by 
n 
x f.(a) = (fl(a), . . fn(a)). The cross-product function can be used 
ates: if C C{l,.. - . ,n} then Pc: A 4  Ec Ai is defined by 
In particular P is a constant function with domain A. @ Pc = &Pi. 
2.2 Resettable Svstems with Faults 
Our model of a "resettable system with faults" is a specialization 
of Meyer's general model of a "system with faults" [%I. 
Informally, a "system with faults" is a system, along with 
a set of potential faults of the system and description of what 
happens to the original system as the result  of each fault. 
The original system and the systems resulting from faults 
are members of one of two prescribed classes of (formal) 
systems, a "specification" class  for the original system and 
a "realization" class for the resulting systems. More pre-  
cisely, we say that a triple ( S , @ , p )  is a (system) representa- 
tion scheme if 
i) S is a class of systems, the specification class, 
i i )  dl is a class of systems, the realization class.  
'iil") p: 63 + S where, if R E 63, R realizes p(R). 
By a class of systems, in this context, we mean a c lass  of 
formal systems, i. e. . a set of formally specified s t ructures  
of the same type, each having an associated behavior that is 
determined by the structure [28] .  
In this study we are concerned with the reliable use of a system. 
That is, we are concerned with degradations in structure which Meyer 
calls "life defects. " This is contrasted with reliable design in which 
case we would be concerned with "birth defects. " Thus, in our case, 
a specification is a realization and we choose a representation scheme 
61 = (a, 63,p) where p is the identity function on 63. 
Assuming that a faulty resettable system has the same input, 
output, and reset alphabets as the fault-free system S, the following 
class of resettable systems will suffice as a realization class : 
31 
In summary, the representation scheme that we are choosing for 
our study of on-line diaposis is the scheme (a, a , p )  where 
dt = &(I, 2, R) and p is the identity function on (R. 
In such a scheme the seemingly difficult problem of describing 
faults and their results becomes relatively straightforward. Before 
we state our particular iiotion of a fault and its resul ts  we will repeat 
here Meyer's general notion of a "system with faults" [28]. 
A system with faults in a representation scheme 
(S ,S ,p )  is a structure (S, F,+) where 
i) S E S  
ii) F is a se t ,  the faults of S -
iii) $: F 4 d3 such that, for some f e F, 
P(+(f)) = s. 
f If f E F, the system Sf = q(f) is the result of f. I f p ( S  ) = S 
then f is improper (by iii), F contains a t  least one improper 
fault); otherwise it is proper. A realization 3f is fault-free 
if f is improper; otherwise Sf is faulty [ Z S ] .  -
In applying this notion to our study we must f i rs t  define what we 
mean by a fault of a resettable system. Given a resettable system 
S e &(I, Z , R ) ,  a fault f of S can be regarded as a transformation of 
S into another system S' E S(1, Z, R )  at some time r. Accordingly, 
the resulting faulty system looks like S up to time T and like S' 
thereafter. Since S may be in operation at time T we must also be 
concerned with the question of what happens t o  the state of S as this 
transformation takes place. We handle this with a function 8 from 
the state set of S to that of 8'. The interpretation of 9 is that i f  S is 
in state q immediately before time T then S' is in state 8(q) at t ime 
7. More precisely, 
32 
Ikfinition 2. 7 :  If S c: &(I, Z ,R) ,  a fault of S is a triple -
where S' e &(I, 2, R), T e T, and 0: Q 4 Q'. 
A fault f = (S', r ,  8) of S i s  a permanent fault if S' is t h e  invariant. 
We view the occurrence of a fault f = (S', ?,e) of a system S s 
shown in Fig. 2. 11. 
Fig. 2.11. A Fault f = (S', 7, 6) of S 
Given this formal representation of a fault of S, the resulting 
faulty system is defined as follows. 
Definition 2. 8: The result of f = (S', r, 8) is the system 
Sf = &Qf, z,d f f  ? R,pf) 
where Qf = Q U Qt 
(6(q,a,t) i€qc Q a n d t  < 7 -  1 
f 
6 Cq, a,t) = { 8(6(q, a, t) ) if q E Q and t = T - 1 
181(q, a, t) if q E Q' and t > 7 - 
33 
A(q, a, t) if q e Q and t < 7 
V(q, a, t) if q E: Q' and t > t - 
p(r, t) if t < T 
e@(r,t) ) if t = 7 
p l ( q  t) if t > 7. 
(Arguments not specified in the above definitions may be assigned arbi-  
t rary values. ) 
In justifying this representation of the resulting faulty system one 
should regard a fault f = (S', 7 , O )  as actually occurring between time 
r - 1 and 7. Note that, for any fault f of S, S E &(I, Z, R).  f 
Example 2.4 : Recall that in Example 2. 2 M1 was transformed into 
Mi at time 100. We would say now that f = (Mi,  100, e) 
fault of M1 and that S is the result  of f (i. e. , S = M1). 
is a permanent 
f 
Example 2.5 : Again consider M1 as implemented by the circuit  in 
Fig. 2.2 and let g be the fault which is caused by dl becoming stuck-at-1 
at time 50. Then g = (MY, 50,O) is a permanent fault of MI where M;' 
is the machine shown in Fig. 2.12 and 8: Q1 + Q" is defined by the 
table 
1 
E 10 10 
0 
Fig. 2. 12. Resettable Machine M;' 
M! will behave as M1 up to  time'50 and thereafter it will produce a 
constant sequence of 1's. 
To complete the model, a resettable system with faults, in this 
representation scheme, is a structure 
(s, F, $1 
where S E S(1, Z, R), F is a set  of faults of S including at least one 
improper fault (e. g. , f = (S, 0, e)), and $: F 
S , for all f E F. Given this definition, we can drop the explicit refer- 
ence to $ in denoting a resettable system with faults, i. e. , $, F) will 
&(I, Z, R) where $(f) = 
f .  
mean (S, F, $) where $ is as defined above. 
35 
In the remainder of this study we will be dealing exclusively with 
resettable systems. Thus we wil l  refer to resettable systems simply 
as systems and to resettable machines as machines. 
A word is in order about our definition of faults. The interpreta- 
tion here is one of effect, not cause, e. g. , we don't talk of stuck-at-1 
OR gates but rather of the system which is created due to some presumed 
physical cause. We wil l  refer to  these physical cau'ses as component 
failures or simply as failures. A fault, by our definition, consists of 
precisely that information which is needed to define the system which 
results from the fault. This allows u s  to treat faults in the abstract;  
independent of specific network realizations of the system and without 
reference to the technology employed in this realization and the types 
of failures which are possible with this technology. We are  insured, 
however, that for each fault we have enough information to a s ses s  the 
structural  and behavioral effects of the fault; in particular as  these 
effects relate to fault diagnosis and tolerance. 
There a re  limits, however, to how much can be done with a purely 
effect oriented concept of faults. When a system is sufficiently structured 
to allow a reasonable notion of what may cause a fault we certainly will 
want to make use of this  notion. When this is the case we may, through 
an abuse in language, refer to a specific failure at time 7 as a fault. 
What we will mean is that we have stated a cause of fault and that there 
is a unique fault which is the result of t h i s  failure at time 7. 
It is interesting to  see what the scope of our definition of fault is 
in t e rms  of the types of failures which will result  in faults. Recall that 
a fault f of a system S is a triple, f = (S', 7,  e), where S' E &(I, 2, R). 
Thus S' is a (resettable) system with the same input, output, and reset  
alphabets as S. The previous sentence contains, implicitiy , every 
restriction that we have put on faults. F i r s t  of all, S' is (resettable).  
system. Thus it remains within our universe of discourse. In parti- 
cular,  its reset  inputs still act like reset  inputs. That is, they cause 
S' to  go into a particular state regardless of the state it was in when the 
reset input was applied. The restrictions on the input, output, and re- 
set  alphabets are reasonable since after a fdult occurs the system 
presumably will  have the same input and output terminals as it had be-  
fore the fault occurred. 
Let f = (S',r, 8) be a fault. BEause S' may vary with time we have 
considerable latitude in the types of failures which we may consider. 
In particular, we may consider simultaneous permanent failures in one 
or more components, simultaneous intermittent failui 2s in one or more  
componeats, or any combination of the above occurring at the same o r  
varying t i r x s .  For example, a fault f may be caused by an AND gate 
becoming stuck-at-1 at time rl, followed by an OR gate becoming stuck- 
at-0 a t  time 72. 
f Let us  now compute the behavior of S in state q. Let x = al.. . a 
p1 
E 1'. Then 
= x f f  (6 (q, al. an,l, t),  an, t + 11 - 1) 
There are three cases which must be considered. 
Case ii) q E Q, t + n - 1 > T ,  and t < 7. Say t + n -in = 7. Then 
c- - 
Thus we have proved: 
38 
Theorem 2.1: Let S be a system and f = (S, 7 , 6 )  a fault of S. Then fqr 
each t E T andx  e 1' 
f f (As in the definitions of 6 and X arguments not specified may be 
assigned arbitrary values. ) 
Corollary 2.1.1: Let S be a system and f = (S, r, 6 )  a fault of S. Then 
for each r E R,  t E T, and x e I+ 
P (x) if t +  Ix J<  - r r, t 
(z, r )  if t + Ix I > r and % w r ,  t ) , Y ,  t ))  
t < T where x = y z  and r, t - 
IyI = r  - t  
8' ( x ) i f t > ~ .  Gr)= I r,  t 
Proof: By its definition -
Again we have three cases to consider. 
39 
f t + 1x1 < .I7.  Then t < T a n d p  ( r , t )  =p(r,t) c Q. Case i) 
Therefore by Theorem 2 . 1  
f C a s e i i )  t +  1x1 > ~ a n d t < 7 .  - I f t < r t h e n p  ( r , t ) = p ( r , t ) e Q  
and Case ii) of Theorem 2.1 applies with p(r, t) in place of q. If 
t = T then p (r, t) = 6@(r, t)) e Q' and case iii) of the theorem 
applies giving us  
f 
(x,t) * - - %Wr,  t),A, t)) 
f Case iii) t > T. In this case p (r ,  t) = p'(r, t) Q'. Therefore 
We have noted that we wil l  often be interested in the physical cause 
of a fault. For example, in a network realization of a machine we may 
be interested in faults which are caused by a specific NAND gate be- 
coming stuck-at-l. Since this gate failure results in different faults 
40 
as we consider it occurring at different t imes it seems natural to give 
a name to this family of faults. More generally, we will define an equi- 
valence relation on a set of faults such that a family of faults such as 
we have just mentioned will be an equivalence class. 
First we must define an equivalence relation on &(I, Z, R) such 
that two systems S, S' E $(I, Z ,R)  are equivalent if they are identical 
except for a shift in time. 
Definition 2.9: Let S, S' E $(I, Z,R).  S' is a 7-translation of S if 
Q = Q' and for all a E I, r E R, and t e T 
i) 6(q, a, t)  = SYq, a, t + d  
ii) x(q, a, t) = x'(q, a, t+T) 
iii) p(r, t )  = p'(r, t + T )  . 
If S' is a .r-translation of S then it can be sham that for all q E Q, 
r E R, x E I+, and t T 
Definition 2.10: Let (F, S) be a system with faults and let f l  = (S1, 71, el) 
and f2 = (S2, T ~ ,  e,) be in F. Then f l  is equivalent to f2  (fl 
is a (71 - r2)-translation of S2 and e l  = 02. 
fz) if SI 
4 1  
Theorem 2. 2: The above relations are equivalence relations. 
Proof: The relation of "7-translation" is an equivalence relation on 
&(I, Z, R) becaiise ''=" is an equivalence relation. The relation 
a set of faults of a system is an equivalence relation because '?-trans- 
lation" and "='( are both equivalence relations. 
-
I t  on 
Notation: We denote then equivalence c lass  of F which contains the 
fault f = (S, r ,  8) by [ f ]  F. When the class of faults is clear we will drop 
the F. Generally if F is not mentioned we take it to be the set of all 
possible faults of a system S. We let f i  = (Si, i, 8) denote the fault in 
[f]  which occurs at time i. When dealing with behaviors J 
f i i the behavior of S , and p will denote the behavior of Si. 
f i will denote 
Let fi  = (Si, i, 6) and f. = (S j ,  8) be equivalent faults of a machine 
I j '  
M. Since M is a (i-j)-translation of itself, it can be verified directly 
f rom Definition 2. 8 that M f i is a (i-j)-translation of M f j  . Hence, 
Theorem 2.3: Let f be a fault of M and let fi ,  f .  E [f] .  'rhen for all 
q e Q ,  ~ € 1 ' '  r c R  a n d t E T  
I 
and 
f .  
( X I .  'i 'r,t+i') = 'r,t+j 
42 
In this section we 'have defined and studied the notion of a fault 
of a system. In the remainder of this study we shall limit our investi- 
gations to the case in which the fault-free system is time-invariant. 
That is, we shall be studying faults of machines. If f = (S, r, 0) is a 
fault of a machine M we will allow S' to vary with time. 
43 
2.3 Fault Tolerance and E r r o r s  
Given a system with faults (S, F) and a proper fault f E F, a n  
immediate question is whether the faulty system S is usable in the f 
sense that i ts  behavior resembles, within acceptable limits, that of the 
fault-free system S. We will use the general notion of a "tolerance 
relation" [ 281 to make more precise what is meant by *'acceptable 
limits. " A tolerance relation for a representation scheme (S,tR, p )  is 
a relation 7 between d3 and S(T C 63 X 8 )  such that, for all R E 63, 
(R,p(R)) E T (i. e. , p C 7 ) .  In this section we will develop the particu- 
- 
- 
lar notions of "acceptable limits" that we will be using in t h i s  study of 
on-line diagnosis. 
Given a machine M it will be understood that M real izes  a specific 
reduced and reachable machine 6I under the triple (ol, 02, 03). Under 
~ ~- ~ - 
the intended interpretation, M se rves  as the specification of some 
desired behavior and M serves  as the fault-free realization of this - 
behavior. This relationship between M and M will underline our basic 
notions of fault tolerance, e r r o r  and on-line diagnosis. 
In this study we will only be concerned with the behavior of M 
under those resets and inputs which correspond via o1 and o2 to  resets 
and inputs of fi. No requirements wi l l  ever be put on Prb)  or f (x), 
r, t 
where f is a fault of M, if r f' o2(:) o r  x f' al<+) because these are 
considered to be %on-code space resets" and "non-code space inputs. " 
For this reason we will always assume that o1 and o2 are onto. In -- - 
actually dealing with machines for which o1 or o is not ontgoccurrences 2 
44 
of "non-code space resets" and "non-code space inputs" could be 
ignored or  they could be treated as e r r o r s  which must be detected. 
These two options correspond to Carter  and Schneider's [ 7 ] Don't 
Care  Assignments 1 and 2. 
* 
We will be using two basic notions of fault tolerance. The f i rs t ,  
and weaker, corresponds to the presermtion of the behavior of M 
CI 
only insofar as i ts  mimicing of M is concerned. 
Definition 2. 11: 
by M for resets at time t if for all r E R 
Let f be a fault of a machine 1.4. Then f is 1-tolerated 
- -  
- 
Alternatively, since o1 and o2 are onto and since & = r 
u3 "8,2(T) 0 ul, f is 1-tolerated by M for resets at t ime t if for 
all r E R 
In the special case where f is 1-tolerated by M for resets at t ime 
0, we will simply say that f is 1-tolerated by M. 
The second, and stronger, notion of tolerance does not allow for 
the tolerance of any change in behavior. 
Definition 2. 12:  Let f be a fault of a machine M. Then f is 2-tolerated 
by M for resets  at t ime t if for all r E. R 
45 
Again, f is 2-tolerated by M if it is 2-tolerated by M for rese ts  
a t  time 0. 
Our definition of 1-tolerated induces a relation 71 on (53 where 
f f M T~ M if ar,d only if f is 1-tolerated by M. If f is improper then M = 
M and thus f is 1-tolerated by M. Hence M 71 M, and therefore r1 is 
a tolerance relation. Likewise 2 -tolerated induces a tolerance relation 
If f is 2-tolerated by M then we can see that f is 1-tolerated by IC. r2. 
Hence, as sets,  T~ C - T ~ .  
1-tolerated by M then f is 2-tolerated by M. 
Finally, note that if a3 is 1-1 and f is 
- 
Example 2.6: Let M be the realization of M which consists of 3 copies 
of M ,  a voter, and a disagreerwnt detector as shown in Fig. 2.13. Then 
any fault f which affects only one copy of M is 1-tolerated but may not 
- 
* 
be 2-tolerated, and its presence may be detected by the disagreement 
dete e t or. 
46 
I I 
I I 
I I 
I I 
I 
I 
I I * 
I I 
I I 
I I 
I - I 
I R 
I 
L -.---------e -I 
Machine M 
. I 
Fig. 2.13. Triple Modular Redundancy with Voting 
and Disagreement Detecting 
Our definitions of 1 and 2-tolerated by M for rese ts  at time t 
are refined notions of fault tolerance. Coarser notions, and ones more 
in keeping with the literature, would be behavioral equivalence for 
resets at any time. We prefer our finer definitions for  with them the 
effects of time can be more naturally analyzed. One question which 
we will study lakr is: For rese ts  at how many (and which) t imes must 
a fault be tolerated f o r  it to be tolerated for rese ts  at any t ime?  
When a discussion or theorem applies equally well to 1-tolerated 
and to 2-tolerated we wi l l  just use the general t e rm "tolerated. " We 
also do this latter in this section when we discuss "errcrs. " 
47 
Theorem 2.4: .- Let f = (St,7, 0) be a fault of machine M. Then f is toler- 
ated by M for resets at t h e  t if and only if f7 - is tolerated by M. 
fr -t f 7 f,,t f 
’r,t = O 3  O fir,, 9 
. This 
- Proof: By Theorem 2.3, PrTt = p F, . Hence, a3 0 
and 03 O P, = 03 O P, ,  t 7 f7 -t 
f 
if and only if o3 0 0, = o3 0 pr, 
establishes the result. 
Thus, f f f are tolerated by M for resets at t imes t l ,  t2, tg... 
i’ i ’  k’*** 
respectively if and only if fi,t , fj ‘tz, fk-t3,. . . are tolerated by M where 
1 
by F is tolerated by M we mean that each f E F is tolerated by M. Due 
to this we will  always consider rese ts  to be released at time 0 when 
dealing with fault tolerance of machines and no generality will be lost. 
Clearly, due to Theorem 2.3, this  same sor t  of time translation can be 
applied to any other behavioral attribute. 
Example 2.7: Let M4 be the sequence generator shown in Fig. 2. 14. 
This machine could be implemented by the circuit shown in Fig. 2.15. 
0 
Fig. 2.14. Machine %I4 
48 
Fig. 2. 15. Circuit for M4 
Let f be a fault of M4 which is caused by dt becoming stuck-at-1 at 
time T. Then f = (Mi, T , O )  where Mi is the machine represented by 
the graph in Fig. 2 .16 and 8 is as indicated below. 
l+i 10 10 
11 11 
Fig. 2.16. Machine Mi 
49 
L 
I 
Consider f m l ,  i. e . ,  the fault (Mi, -1, e), and m t e  that pd' (11) = 1 
whereas p0(l1) = 0. Thus f- l  is not 2-tolerated by M4. On the other 
hand both Mq and Mi1 will produce the sequence 00010101.. . when 
reset at -10. Thus f - l  i s  2-tolerated by M4 for resets at -10. By 
applying Theorem 2.4 we can learn that f i  is not 2-tolerated by M4 
for rese ts  at time i + 1 and that f9 is 2-tolerated by M4. 
Corresponding to  our two types of fault tolerance we can define 
two types of e r rors .  
Definition 2. 13: Let M be a machine, r E R, x E I+, and y E Z+ where 
~ 
1x1 = Iyj. The triple ( r , x ,y )  
If ( r ,x ,y)  is an e r r o r  of 
^pf (x) = y then we say that the r 
is called a 1-error  (2-error)  of M if 
M and f is a fault of M for which 
fault f causes the e r ro r  ( r ,x ,y) .  Note 
that m y  given e r r o r  could be caused by many -,fferent faults. 
The relation between fault tolerance and e r r o r s  is very simple. 
A fault f is tolerated if and only if it causes no e r rors .  The relation 
between l-errors and 2-errors  is also straightforward. Namely, 
every 1-error is a 2-error,  and if  a3 is 1-1 then every 2-error  is a 
l-error.  Er rozs  a re  very important in any study of fault diagnosis 
because a fault can never be detected until it causes an e r ro r .  The general 
goal of on-line diagnoses is protection against undesirable behavioral 
manifestations of faults, i. e. , for protection against errors. 
An e r r o r  (r, ua, vb) where a E I and b E Z is a minimal e r r o r  if 
(r, u, v) is not an e r r o r ,  If (r, x, y )  is il minimal 1-error  then it is : 
2-error but not necessarily a minimal 2-error. This  notion of minimal 
(or first)  e r r o r s  will  be central to our notion of diagnosis. A minimal 
e r r o r  ( r ,x ,  y) is said to occur a t  time ]x 1 - 1. This is the time at 
which the last symbol in y is emitted. 
-
Often we w i l l  be in a situation where we are concerned with a 
machine M tolerating a set  of faults which a re  a l l  caused by the same 
phenomenon but which may occur zt any time. More specifically, let 
f t ~ a  fault of M. We would like results which assured us that if some 
finite subset of [ f ]  was tolerated by M then all of [ f] was tolerated oy 
M. Later we wil l  be interested in the same problem with regard to 
diagnosis. 
Our first  result of th i s  nature hinges on the fact that any reachable 
state of an 8-reachable machine is reachable by time 1. 
Theorem 2. 5 :  Let f be a fault of an (-reachable machine M and suppose 
f i  is tolerated by M for 0 < i < 8 .  Then f i  is tolerated by M for all 
i > 0. 
- -  
4 
Proof: Assume, to  the contrary, that f i  is not tolerated by M for some 
i > 1. Then there exists an emor ( r , x ,  y)  which is caused by fi. 
Hence pr (x) = y. Let x = x1x2 and y = yly2 where Ixl 1 = lyl 1 = i. 
-
4 
5 1  
By Corollary 2.1.1 we know that 
Let q = 6@(r),x1). Since M is 1-reachable, there exists s E R and 
u E 1' such that tu 1 = j < P  and 6(p(s),u) = q. By Theorem 2 .3  
/$(q)(x2, i) = $6(q)(x2. j). Therefore if $$ (u) = v then $s (w,) = 
Vu)%l(ab(s), u)) ( x2' J '1 = v'$(q)k2, j )  = vy2. Clearly, (s, u2, va) 
is an error and it is caused by f . .  Therefore f .  i s  not tolerated. 
Contradiction. This establishes the result. 
- 
A A f j  
I I 
The following general example shows that Theorem 2. 5 is the 
strocgest result possible. in the sense that if the hypothesis is at all 
weakeced then there exists a fault f and a machine M for which the 
conclusion is invalid. 
Example 2. 8: 
in Fig. 2. 17. 
Consider the I -reachable autonomous machine Ml shown 
Let m be an integer between 0 ;nd P, and let 
Fig. 2. 17. Machine Me 
52 
f = (Mp,7, 0) be a fault of Mp where 
- 
Consider hf to be rea1:zing itself. That is, take M = Me. E 
The occurrence of f = !Me, T, 0 )  has an effect on the behavior of 
MI if and only if MP a d d  be in state qm at time r. Therefore, f i  = 
(ME, i, 0 )  is tolerated by NIP if and ocly if i # m (mod P + 1). Hence 
f i  is tolerated by M, for i = 0,. . . , m-1, m+l , .  . . ,I does not imply f .  
is tolerated by M for ali i - .>0. Since b0th.m and P were arbitrari ly 
L 1 
E 
chosen, this general example shows that the hypothesis of Theorem 2.5 
cannot be weakened. 
Let us  now look at faults which occur before t ime 0. In the 
previous result we have not mentioned this case because if f i  and f.  
1 
are eqaivalent faults and i or j i.s less than 0 then there is, in general, 
no relation between the behaviors of M and M for resets re sased a t  
time 0. Hcwever, in the important special case where f = (M', 7, e )  
is a permanent fault, any E E [ f ]  with i < 0 will, with respect to resets 
released at  time 0, cause identical behavior. 
f i f j  
'i 
Lemma 2.5 : 
f i 
'r 
Let f = (MI, T, 8) be a permanent fault of M. Then 
f.  
= p: for all r E R and i, j < 0. 
53 
Proof: Let i, j < 0. Because f is permanent, f i  = (M', i, 8) and -
= (M', j ,  e ) .  By Corollary 2. 1.1, pr fi = @; and $r fj = $ for all r E R. 
fj  
This establishes the result. 
Theorem 2.7: Let f be a permanent fault of an t-reachable machine M. 
If f i  is tolerated by M for -1 < - -  i < P :ken f i  is tolerated by M for all 
i e TI 
-1 f .  f Proof: BY Lemma 2.6, 6 - - r - 'r for all i < 0. Hence, f- l  is tolerated 
by M implies that f i  is tolerated by M for all i < 0. By Theorem 2. 5, 
f i  is tolerated by M for all 1 - > 0. This establishes the result. 
Before leaving this line of development we wi l l  make some final 
observations. Note that a machine hl: is 0-reachable if and only if 
p(R) = P. In particular, every memoryless machine is 0-reachable. By 
Theorem 2. 5, if M is 0-reachable and fo is tolerated by M then f i  is 
tolerated by M for all i > 0. - 
If f = (M', 7 , e )  is a fault of M we think of f as affecting the reset  
mechanism of M if p ' ( r )  f eb(r)) for some r E R. If this is not the case 
then a further result, similar to Lemma 2.6 can be obtained. 
Lemma 2.8: 
that p ' ( r )  = 6(p(r)) for a l l  r E R. 
Let f = ( M ' , T ,  6)  be a permanent fault of M and suppose 
f. 
r 
fi Then = p ' for all r E R and i, j - < 0. 
for all r E R. fO - Proof: Since p'(r)  = e(p(r)), by Corollary 2. 1.1, ';lr = 
The result now follows just as in the proof of Lemma 2. 6. 
54 
Putting the above observations together yields: 
Theorem 2.9: Let f = (M', r, 6) be a permanent fault of M. Suppose 
that p'(r) = etp(r)) for all r E R and that p(R) = P. If f i  i s  tolerated by 
M for any i <  - 0 then fi is tolerated by M for all i T. 
Proof: By Lemma 2.8  fi is  tolerated by M for all i < 0. Since p(R) = 
P, M is  0-reachable. Therefore, by Theorem 2. 5 fi is tolerated by M 
for all i > 0. This establishes the result. 
- - 
- 
55 
2.4 On-line Diagnosis 
Our notion of on-line diagnosis of a system involves an external 
detector (assumed to be fault-free) which observes the input and the 
output of the system and makes a decision as to whether the behavior 
of the system is within "acceptable limits" as set  forth by our notions 
of fault tolerance. Initial synchronization of the system with its 
detector is achieved by using the same reset to initialize both systems. 
The formal relation between a system and i ts  detector is that of 
a "cascade connection. " 
Definition 2. 14: The cascade connection of two systems S1 and S2 for 
which R1 = R2 and I2 = Z1 X I1 is the system 
56 
Schematically, SI * S2 can he pictured as in Fig. 2.18. 
Fig. 2.18. The Cascade Connection of S1 and S2 
+ + Notation: If u = z z 
[ u, v] will denote the sequence (zl, al)(z2, a2). . . (zn, an) E (Z X I)+. 
z E 2 and v = ala2.. . an E I then the pair  1 2”’ n 
1 2  
Let S1 * S2 be the cascade connection of S1 with S2. Let p , p , 
and p* denote the behavior functions of S1, S2, and S1 * S2 respectively. 
It can be shown directly from the definition of a cascade connection that 
for all x E I;, q1 E Q, q2 E Q,, r E R1, and t E T, 
and 
57 
We can now formally define our notion of on-line diagnosis. 
Definition 2. 15: Let (M, F) be a machine with faults, let D be a machine 
for which M * D is defined, and let k be a nonnegative integer. (M, F) is 
(D, k)-1 -diagnosable (2 -diagnosable) if 
i) BF = 0 for all r E R,  and 
ii) if (r, x, w) is a minimal l - e r ro r  (2-error) caused by some f E F 
then 
P([y (xy),xy] ) # Olxv I for all y E I* with 1y 1 = k . r r  
f Thus, the detector D ooserves the operation of M and must makc 
a decision based on this observation as to whether an e r r o r  has occurred. 
Note that the fault-free rezlization M and the detector are both time- 
invariant (i. e., machines), and that the detector takes no part  in the 
computation of M's output. 
R 
E Z - Mf - . -- D 
I .- ow_ 
L 
Fig. 2.19. Diagnosis of (M, F) using the Detector D 
58 
The two conditions of Definition 2.15 can be paraphrased as: 
1) D responds negatively if no fault occurs; i. e. , D gives nc, 
false alarms, and 
ii) for all f E F, D responds positively within-k t ime steps of the 
occurrence of the f i rs t  e r r o r  caused by f. 
Condition i) implies 0 E ZD, the output alphabet of D. Each 
z E ZD other than 0 is called a fault-detection signal. - The choice of the 
symbol "0" to indicate that the machine M is operating properly is 
purely for notational convenience. In general we could let any subset 
of Z indicate proper operation and let the complement of this set in 
ZD be the set  of fault-detection signals. In a practical  application this 
choice would depend on the design constraints on the detector. 
D 
As we have done with fault tolerance and with e r r o r s ,  if a theorem 
or remark applies to both "1-diagnosable" and "2-diagnosable" we wi l l  
just state it once using the general t e rm "diagnosable. " 
Let D be a detector for M. Then ID = 2 X I. There will be t imes 
when the observation of M's input by CI will be unnecessary or undesired. 
If for all z E 2 and a, b E I (z, a) and (2, b) are equivalent inputs of D 
then we will say that D is independent of M's input. In this case the 
behavior of D does not depend on the second coordinate of D's input and 
we will take ID to be simply Z. 
Recall that with this concept of diagnosis that we a r e  only con- 
sidering faults of M. Faults of D must be analyzed separately. In 
59 
5 
finding a realization M of M and a detector D there is some leeway in 
how much of the added complexity required for diagnosis should go 
into the detector and how much should go into the realization. If it all 
goes into the realization then D will serve only to select out certain 
coordinates of M's output t o  be used as the output of D. That is, D 
will be m e m x y l e s s  and realize a projection. In this  case we will 
say that (M, F) is k-self-diagnosable. In general, it is desirable for  
the desirable for the detector to be self -diagnosable for some suitable 
set of faults. 
The basic on-line diagnosis problem can be stated as follows: - 
Given a machine M ,  a class of faults F, a class of detectors 
and a delay k find an (economical) realization M of fi and a 
detector D E: 9 such that (M, F) is (D, k)-diagnosable. 
In this chapter we have developed a model for the study of on- 
line diagnosis of resettable machines, and we have stated the basic 
on-line diagnosis problem. We end this chapter by stating some funda- 
mental questions, the answers to which will help solve this basic pro-  
blem. We will  begin to answer these questions in the following chapters. 
I. Given M, M ,  and F, does there exist a detector D and a 
delay k such that (M, F) is (D, k)-diagnosable? 
11. If such a D and k exist, how does one construct an optimal or  
near -optimal detector? What might be cr i ter ia  for optimality ? 
III. What time-space tradeoffs are possible between the added 
complexity needed for diagnosis and the maximum allowable delay ? 
We expect that there will be situations where if the detector is given 
additional time in which to indicate an e r ro r  then diagnosis may be 
simplified. 
IV. What are good on-line diagnosis techniques? When is each 
technique applicable? How does one compare techniques? 
V. What relationships exist between faults and errors? Given - 
M and F, what e r r o r s  are possible? Given M and F, how can one find 
a realization M of %I such that the machine with faults (M, F) gives rise 
only to  e r r o r s  af a given type? These are important questions 
because given a diagnosis technique or a particular type of detector, 
it will often be easy to determine just what types of e r r o r s  are 
detectable. The faults that are diagnosable will then have to be inferred 
from this information. Conversely, we wi l l  want to find realizations 
such that the faults we a re  concerned with will cause e r r o r s  that we 
can detect. 
VI. What properties of system structure and system behavior a re  
conducive to on -line diagnosability ? Structural properties are 
important for it is expected that they will relate directly to diagnosis 
techniques. Behavioral properties could be used to measure the 
inherent diagnosability of a given behavior in t e rms  of the minimum 
added complexity which would be required to obtain a given level of 
on-line diagnosis. 
CHAPTER 111 
General ProDerties of Diamosis 
In this  short chapter we will present a few resul ts  on diagnos .S 
per se. That is. they are general results which tell  u s  some things 
about diagnosis, independent of the particular fault s e t  being diagnosed 
or of any particular diagnosis technique. In the following chapters 
we look at the diagnosis of specific se t s  of faults and investigate 
the capabilities and limitations of on-line diagnosis techniques. 
It is interesting t o  see how our concept of on-line diagnosis 
compares with a similar concept introduced by Car te r  and Schneider 
[ 7 ] and called "fault -secure" by Anderson [ 1 1. As stated by 
Anderson, "A circuit is fault- secure i f ,  for every fault in a pre  - 
scribed set ,  the circuit never produces incorrect code space outputs 
-
for code space inputs. " 
Before making a formal comparison we must translate this 
notion into our framework. In doing s o  we will str ive t o  be faithful 
to  Anderson's intent. 
Definition 3.1: A machine with faults, (M, FA is fault-secure if 
(r,x,ya), where a E Z, is a minimal 2-error  caused by some f E F 
implies a { {pr(x)1r e R ,  x E I+}. 
61 
62 
Thus if (M, F) is fault-secure then a combinational detector which 
only observes the output of M can detect all minimal 2-errors.  More 
formally , 
Theorem 3.1: (14, E') is faiilt-secure if and only if (M, F) ! 3  (D, 0)-2- 
diagnosable where D is memoryless and independent of hl's input. 
Proof: (Necessity) Assume that M is faqilt -secure. Define -
: Z * { O , l }  by 
[ o if z E { p  k,lr E R ,  x E I+} . r  
1 otherwise 
A (z) = D 
Let D be the memoryless detector which realizes AD. Then D is 
independent of hl's input and it can easily be verified that (M, F) is 
(D, 0) -2 -diagnosable. 
(Sufficiency) Assume that (M, F) is (D, 0)-2-diagnosable where D is 
memoryless and independent of M's input. Let AD: Z + (0, 1) 
denote the function realized by D and let 2' = {pro() 1 r E H, x E 1'). 
Then hD(z) = 0 for all z E Z' for otherwise a false alarm could occur. 
Let ( r ,x ,yu)  where a E Z be a minimal 2-error.  If 9 e 2' then 
h (a) = 0 and f is not detected without delay. Therefore a { Z'. 
Hence (M, F) is fault -secure. 
D 
Thus the concept of (D,k)-diagnosable is a generalization of the 
concept of fault -secure. In particular, (D, k) -diagnosis allows for 
(i) different tolerance relations, (ii) nonzero delay in diagnosis, 
63 
(iii) detectors with memory, and (iv) explicit +wervation by the 
detector of the input to the system being monitored. 
The following result is a consequence of the fact that every 
l-error is a L-error but not conversely. 
Theorem 3.2: If ;M, F) is (D, k)-2-diagnosable then (M, F) is (D, k)-1- 
diagnosable, but not conversely. 
Proof: 
occur and every minimal 2-error  wi l l  be detected within k t ime steps 
of i ts  occurrence. Let (r ,x,y) be a minimal 1-error.  Then u3($9x)+ 
a3k) and hence p,(x) f y .  Thus ( r , x l , y l )  is a minimal 2-error  for 
some x1 and y1 such that x = x1x2 and y = y1y2. Since this minimal 
2-error is detected within k time steps of its occurrence the minimal 
Let (M, F) be (D, k)-2-diagnosable. Then a 3 false a l a rms  wil l  -
A 
1-error  ( r , x ,y )  must also be detected wittiin k time steps of its 
occurrence. Hence (M, F) is (D, k)-1 -diagnosable. 
The counterexample which shows that the converse does not 
hold is given in the next chapter in the pmof of Theorem 4.4. 
Due to this result, in stating theorems "1-diagnosable" is a 
weaker hypothesis than "2 -diagnosable. I' 
Although the converse of Theorem 3. 2 does not hold in general, 
the following partial converse can be obtained. 
64 
Theorem 3.3: 
(M, F) is (D, k)-2-diagnosable. 
If (M, F) is (D, k)-1-diagnosable and o3 is 1-1 then 
Proof: We observed in Section 2.3 that if o3 is 1-1 then every 
2-error is a l-error. The result is an immediate consequence of 
-
this fact. 
The next result will help u s  to see the relationship will help us 
to see the relationship between fault diagnosis and fault tolerance, 
Theorem 3.4: Let (M, F) be a machine with faults. If F is tolerated 
by M then (M, F) is (Do, 0)-diagnosable where Do is a trivial  memory- 
less machine which realizes the constant 0 function. 
Proof: Conditicn i) is clearly satisfied, and condition ii) is sat is-  
fied because if F is tolerated by M then no f E F will cause any e r ro r s .  
-
The decision in this case can be trivially made since no e r r o r s  
are ever produced. The situation for tolerated faults is not so simple 
as this result may seem to indicate for it must be remembered that 
I.-tolerated does not imply 2-tolerated and thus a 1-tolerated fault 
could be detected through a %-error. 
We will now develop some results concerning diagnosis which are 
analogous to Theorems 2. 5, 2 .7  and 2.9. Recall that these theorems 
allowed u s  to infer the tolerance of an infinite set of equivalent faults 
from knowledge that a specific finite subset of them is tolerated. 
65 
Theorem 3.5 : Let M be a machine and let D be a detector for M. 
Suppose that the cascade connection M * D is 1-reachable, and that 
f is a fault of M. If (M, {f?) is (D, k)-diagnosable f o r  0 < i < P then 
(M, {f?) i s  (a, k)-diagnosable for all i > 0. 
- -  
- 
Proof: Assume that (M, {fi)) i s  (D, k)-diagnosable for 0 - -  < i < 1.
Then condition i) of Definition 2. 15 is immediately satisfied. Let 
( r , x ,  w) be a minimal e r r o r  caused by f i  where i > 1, and let u E 1' 
with 1u 1 = k. To show that (M, (fi}) is (D, k)-diagnosable for 0 < i 
we need only shov that $r ([ $r  (xu),xu]) # 0 
-
- 
Ixu I AD A f i  . 
Let x = xlz where Ix ' = i, and let 6*(p*(r),xl) = (q,q'). Since 
M * D is  P-reachable there exists s 6 R and y E 1' with 0 < ly I < P  
such that 6*@*(s),y) = (q,q'). 
(D, k)-diagnosable, $s D ([ ,jJ Af (yzuj, yzu] ) # 0 l y Z u  I ,  and since the fault 
detection signal must occur after the fault occurs, 
1' 
- - 
Say !yI = j. Since (M,(fj}) is 
f .  2 j  
0(q> 
Nnw by Theorem 2.3, $ (zu, i)  = ,do(q) (zu, j )  and hence 
f 
A D  ~i (zu, i), zu]) # 0 Izu I . Therefore I.l,i([ be (9) 
Hence (M, (fi>, is (D, k)-diagnosable for all i - > 0. 
66 
Example 2.8, which shows that the hypothesis of Theorem 2. 5 
cannot be weakened, works likewise for Theorem 3.4. This  example 
works f G r  both fault tolerance and fault diagnosis becuase, as was 
poitt,ed out by Theorem 2.3, tolerated faults are trivially diagnos- 
able. 
Theorem 3.6: 
such that M * D is P-reachable. If f is a ptrmanent fault of M and 
(M, {fi)) is (D, k)-diagnosable for -1 - -  < i < P then (M, {f;) is 
(D, k)-diagnosable for all i E T. 
Let M be a machine and Let D be a detector for M 
Proof: Assume that f is a permanent fault and th-t (M, {fi)) is 
(D,k)-diagnosable for -1 < - -  i < E.  By Theorem 3.4,  (M,{fi)) is 
(D, k)-diagnosable for all i - > 0. By Lemma 2.6, ,jr = Pr 
for all r E R and i < 0. Hence ev ry f i  with i < 0 wil l  cause 
exactly the same er rors .  Since (X, {f - 
follows that (M, {fJ )  is (D, k)-diagnosable for all i < 0. This  
establishes the result. 
f i f -1  
is (D, k)-diagnosable it 
Let D be a detector for a machine M. It wil l  often be the case 
that the second coordinate of the state of M * D can be uniquely 
determined from the first coordinate. In particular, this is always 
the case when lQDI = 1. More formally, the casc? Js connection of 
MI with M2 is synchronized - if there exists a function h: Q, 4 Q2 
67 
such that for each (ql,q2) in the reachable part of M1 * MZ, 
h(ql) = q2. Such a function is called the synchronizing function of 
M1 * M2 and it must satisfy h(pl(r)) =p,(r) for each r E R. 
If M * D is synchronized and M is 1-reachable then M * D is 
also L-reachable. We have observed in Chapter II that M is 
0-reachable if and only if p(R) = P, and that, in particular, every 
memoryless machine is 0-reachable. Hence if p(R) = P and M * D 
is synchronized then M * D is 0-reachable. in  this case we know 
that if fo is diagnosable then f i  is diagnosable for 0 - <i. 
We terminate this line cf development by stating the strongest 
result of this nature. 
Theorem 3.7: Let M be a machine for  which p(R) = P. 
a detector f9r M such that M * D is synchronized. Let f = (hV, T, a )  
be a permanent fault for whichp'(r) = 6(p(r)) €or all r E R. If 
(Mifi))  is (D, k)-diagnosable for any i - < 0 then (M, ('3 ) is (D, k)- 
diagnosable for all i E T. 
Let D be 
Proof: Assume that (M, {fp))  is (D, k)-diagnosable where P < 0. 
By Lemma 2. &, 2 = /3' for all i, j < - 0. Therefore (M, {fi)) is 
(D, k)-diagnosable for all i - < 0. Since p(R) = P and M * D is syn-  
chronized, M * D is 0-reachable. Thus by Theorem Z. 4, (M, {fi}) 
is (D, k)-diagnosable for all i > 0. l h i s  establishes the result. 
- - 
f *  
r 
CHAPTER IV 
Diaernosis of Unrestricted Faults 
With rapid!y changing technology it is risky'to rely too heavily 
on the classical stuck-at model of circuit failures. Other failure 
modes such as bridging failures have been proposed and studied 
(see f26] and [15] for example) but little is known about the diag- 
nosis of such failures. Intermittani and multiple failures a re  also 
possible. Adequate failure mode analysis often exists only for out- 
dated tect.nology. 
There are other problems in obtaining a suitably restricted set 
of faults which a r e  peculiar to on-line diagnosis. 
failure it may be impossible to determine the 0 function of the fault 
caused by this failure. Thus fault se t s  which do not rest r ic t  the 
fault mapping 8 are advantageous. 
For a given 
In this chapter we will  develop some basic results concerning 
the diagnosis of "unrestricted faults. I' This se t  of faults is truely 
unrestricted for it is precisely the set of all faults of the machine 
being diagnosed. 
Unrestricted faults a r e  typically diagnosed usipg the technique 
of duplication. One of the aims of this chapter is to take a deeper 
look at duplication ard :.- a generalization of this scheme. 
68 
69 
An alternative to  using duplication for the diagnosis of 
unrestricted faults 13 investigated in Chapter V. 
The main result in this chapter s ta tes  that to  achieve 1-diagnosis 
of the unrestricted faults of a machine M, the detector must have as 
many states as M ,  the behavioral specification for M. 
to achieve 2-diagnosis, the detector must have as many s ta tes  as 
MR the reduction of M. 
detector are independent of +he delay allowed for the diagnosis. 
- 
Furthermore, 
These bounds on the state se t  s ize  of the 
70 
4 . 1  Unrestricted Faults 
As stated above, the set of unrestricted faults of a machine 
is simply the set  of all faults of that machine. More formally, 
Definition 4. 1: The set  of unrestricted faults of machine M, denoted 
by UM, is the set UM = (f If is a fault of M}. That is, 
= { ( S ' , T ,  0) I S' E S(I,  Z , R ) ,  r E T, and 8: Q 4 Q'} . 'M 
When it is clear what machine is under consider ation, the 
identifying subscript will be dropped 
One important property of the set  of unrestricted faults is the 
relation between this fault set  and the set  of e r r o r s  that may be 
caused by faults in this set. Given any r E R, x E Ii andy E 2' with 
Ix I = 1y I ,  there is a fault f E U such that x(x) = y. Therefore 
faults in U can cause any possible erroneous behavior, and for 
(M, U) to  be (D, k)-diagnosable all of these possible erroneous 
behaviors wil l  have to be detected by D. 
f Due to  the above observation it is clear that the output of M 
(the system actually being observed by the detector) can give no 
information about what the correct  output should be. Ther :ore, 
for the diagnosis of unrestricted faults, the ability of D to observe 
M's input directly is crucial. This  observation is made explicit 
in the following result. 
71 
Theorem 4.1: If (M, U) is (D, k)-1 -diagnosable, 
M's input, and M is transition distinct then M is 
- D is independent of 
aut on0 mous. 
- Proof: Suppose that (M, U) is (D, k)-1-diagnosable. D is independent 
.5 
of M's input, and M is transition distinct. Assume, to the contrary, 
that M is not autonomous. Then there exis ts  r e R and x,y E 1' 
such that 1x1 = lyl and cr,or(x)) # u3Or(y)). Let v c I* with 1-4 1 = k. 
For no false a la rms  to occur we must have y ( b r & v ) )  = 0 and A ixv I 
f i  qB (yv)) = Olyvl. Let f E U be a fault for which $;(xv) = $f,orv). 
r r  
A f  Since (r ,x ,  a r b ) )  i s  a 1-error  it must be detected within k time 
steps of its occurrence. But j?, D (3r(xv)) A f  = ar  AD ( b r ~ v ) )  A = 0 IYV I . 
Contradiction. Hence M must be autonomous. 
72 
4.2 Diagnosis Via Independent Computation and Comparison 
It is a well-known and obvious fact that if a system is dupli- 
cated and both copies are run in parallelwiththe same inputs then by 
dynamically comparing the outputs of the two copies any e r r o r  
which does not appear simultaneously in both copies will be immed- 
iately detected. 
Our view of duplication is shown in Fig. 4. 1. In this figure 
Fig. 4. 1. Diagnosis via Duplication in the Detector 
the detector D consists of a copy of M along with a generalized 
Exclusive-OR gate where output is 0 if  and only if its inputs are 
identical. Given such a detector D, it is immediately clear that 
(M, U) is (D, 0)-2 -diagnosable, 
Duplicaticn is an  expensive technique, involving somewhat 
more than twice the circuitry required for the unchecked system 
alone, but it has a number of positive attributes. In addition to 
being capable of diagnosing the unrestricted se t  of faults, 
73 
synthesis is easy and self -testing and self - diagnosable comparators 
are known t o  exist [ 1 1. 
The basic configuration shown in Fig. 4 . 1  can be generalized 
to the configuration shown in Fig. 4.2. In this figure the detector 
Fig. 4.2. A Generalization of Duplication in the Detector 
consists of a machine M' which runs in parallel  with M and a 
combinational comparator C which dynamically compares the out - 
puts of M and M'. Note that for the cascade connection M * D to be 
defined we must have I' = I and R' = R. 
With this scheme M' may be much less complex than M. How- 
ever ,  we will show that there is a relationship between the size of 
the state set  of M' and the level of diagnosis which may be possible 
using M'. 
74 
In the following result we give a necessary and sufficient 
condition for (M, U) to be (D, 0)-diagnosable where D is structured 
as in Fig. 4.2. The basic intuition for this result is that (M, U) 
is (D, 0)-1-diagnosable if and only if it is possible t o  perfectly pre-  
dict the behavior of f rom that of M'. 
Theorem 4 .2 :  Let M realize under (a1, a2, u3). Let, [ M', C] de- 
note a detector for M constructed from M' and C as shown in Fig. 
c 
4.2. There exists a$ such that M' realizes M under (a,,u2,a~) 
i f  and only if there exists C such that (M,U)  is ([ M', C], 0)-1- 
diagnosable. Similarly there exists (J' such that M' realizes M 
under (e, e, ai) if and only if there exists a C such that (M, U) is 
([ M', C] , 0) -2 -diagnosable. 
3 .  
- - Proof: (Necessity) Assume that M' realizes M under (u1, a2, crb). 
0 * al. Recall that al and a2 are 
assumed to be onto. Because of this assumption, it follows that 
' 0 p' = o - p  for all r E R. Let C be the compara tw shown in Fig. 4.3. ' 3 r  3 r  
75 
0-- \ 
\ 
0 
C 
Fig. 4.3. The Comparator Used in the Proof of Theorem 4 . 2  
Since ub 0 P i  = u3 0 pr the detector [ M', C] will give no false 
alarms. Let (r,x,y) be a minimal 1-error caused by f E U 
o3@,k)) # a3(13,(x)). Hence, u&3ibd? f a,(Pr(x)), and this  will  
cause the Exclusive-OR gate to emit a 1. Therefore the minimal 
1-error (r, x, y )  is detected with no delay. Hence (M? U) is 
([ M', C] , 0)-1-diagnosable. 
Then 
f f 
Similarly, if M' realizes M under (e, e,") then /3 = ub Q r 
and a comparator a s  shown in Fig. 4. 3, but without the a3 function. 
can be used to achieve ( [  M', C] , O)-2 -diagnosis of (M, U). 
(Sufficiency) Assume that (M, U) is ([ M', C] , 0) -1 -diagnosable. TO 
prove that there exists a ob such that M' realizes M under (a1,02,Q13) 
we must exhibit a function ui and show that 0 3 ' fir =o!J 0 PI,. This 
is sufficient because M realizes M under (D,., a2, u3) 
.5 
- 
76 
and al and o2 are assumed to  be onto. 
Since no false a l a rms  may uccur we know that C(Pr(x), /3ik)) = 0 
for all r E R and x e 1'. Define ai as follows: oi($i(x)) = o,(P,(x)). 
Since CY; has the desired property we must simply verify that it is 
indeed a function. 
It is clear that every z E { $' (IC) I r E R, x E 1') has an image r *  
under 08. To see that this image is unique suppose that ,.jk(x) = 
p' (y). We must show that (r3(hrh)) = a3(Os(y)). Let $;(XI = a, 
o,(P,(x)) = b, andu3(@sb)) = c. Then C(b,a) = C(c,a)  = 0. Assume 
to the contrary that b # c. Let f E U be a fault which causes the 
output of M to be c a t  time 1x I - 1 and which has no other affect. 
Let x = uv where v E I. Then ( r , x?  pr(u)c) is a minimal 1-er ror  
and since C(c, a) = 0, it is not detected (;-hen it occurs. This contra- 
dicts the assumption that (M, U) is ( [  M', C] , 0)-1-diagnosable. Hence 
v i  is a function and M' reaiizes M under (ol, a2, "5). 
S 
A 
- 
The proof that (M, U) is ([ M', C ]  ,0)-2-diagnosable implies that 
such that M' realizes M under (e, e,ub) there exists a function (J 
is essentially the same as the above proof. 
3' 
From Theorem 4 . 2  we know that if M realizes M' and M' is 
teduced and reachable then 1 Q I - > IQ' 1. Hence Theorem 4 . 2  tells 
u s  that if we use the scheme shown in Fig. 4 . 2  for tt-e diagnosis of 
unrestricted faults then we must have I Q ' I  - > 161 in order  to achieve 
1-diagnosis, and 1 Q'l - > bR I in order to  achieve 2-diagnosis, 
where idR is the reduction of 14. 
37 
4.3 Diagnosis with Zero Delay 
The question we will  answer next is whether it is possible to 
achieve (D, O)-l-diagnosis of (M, U) with a detector which is less  
complex, in ternis of state se t  size, than the reduced and reachable 
specification M .  One reason to  believe that this may be possible 
is the observation that if 6 has an inverse then this inverse may 
have fewer states than M ,  and yet a detector constructed using this 
- 
- 
inverse may be capable of diagnosing all of U. Examples of such 
inverses are given in the following chapter. 
Theorem 4.3: If (M, U) is (D, 0)-1-diagnosable then IQDl - > 16 1. 
- Proof: Let (M, U) be (D, 0)-1-diagnosable, and assume, to the 
contrary, that / Q D I  < I Q I .  Without loss of generality, assume - 
that M is reachable. 
Claim: There exiFts q, q' E Q and s e Qo such that (9, s), (q', s) 
E P*, the reachable part of M * D, and a3 0 pq # a3 0 pi. 
Let g: Q + p ( Q D )  - 9 (where P(Q,) = {X IX C - Q$ be 
defined by g(q) = {s [(q, s) E P*}. Assume that the claim is not 
true. Then a3 0 0, # a3 0 pq implies g(q) n &q) = 9. We know 
from the proof of Theorem A, 2 that for each 4 E there is a state 
f(q) for which 2- = a3 0 PfG) = al and that f is necessarily 1-1. 
9 
Since k is reduced and reachable there must exist IGl = P unique 
states {q,, . . . ,qe} C Q such that i # j implies g(qi) n g(q.) = 9, 
1 - 
78 
andtherefore bDI - > 16 I. Contradiction. This establishes the 
claim. 
Let q,q’  E Q and s E QD such that (q, s), (q’, s) E P* and 
o3 0 pq # a3 0 p,,. Then there exists a sequence ua where u E I* 
and a E I such that (J (p  (ua)) # (3 (p (ua)) and if u# hthencr3@ (u)) = 
3 q  3 q  9 
($ ,(u)). Since (9, s) E P*, there exists r E Randy E I* such that q 
6*(p*(r)), y )  = (9,s). 
+ Recall that given any r E R ,  x E I’and y E Z with Ix I = lyl, there is a 
fault f E Usuch tha t i f (x )  = y. Let f E U be a fault for w h i c h t k u a )  = 
kr(Y)tq,(ua). Since it is known that (T (b (u)) = (J @ ‘(u)), it fallows 
3 q  3 c l  
that ( r ,yua,Prbua))  A f  is a minimal 1-error.  Now (M, U) is (D,O)-l-  
diagnosable implies #:( [ pf. (yua), yua] # 0 Jyua I. Since no false 
a la rms  may occur, br([ A D  $r(y),y]) f\ = O i y J .  Also, since (q’, S) E P*, 
f’([Bq,(ua),ua]) = 0 lua I . ~ ~ 9 w  
This contradicts the assumption that (M, U) is (D, 0)-1-diagnoa- 
able. Therefore lQ,l - > 161. 
79 
- Corollary 4 . 3 . 1 :  If (M, U) is (D, O)-2-diagnosab!e then ]QDl - > IQ, I, 
where MR is the reduction of M. 
- Proof: Assume that (M, U) is (D, 0)-2-diagnosable, and consider 
M to be realizing MR. By Theorem 3.2, (M, U) is (D, 0)-1-diagnos- 
able, and hence, by Theorem 4.3, IQ,I > - IQ, I. 
Let us now consider the set  of faults of M which are caused by 
the output of M becoming stuck-at-v, where v E Z ,  at some time T.  
More formally, the set of permanent ou+.put faults of M is the set  
= {f = (MI, 7, e )  lM' = (I, Q, Z, 6,  A', R,p) where 
FO 
h'(q, a) = A'(s, b) for  all q, s E Q and a, b E I } 
Because the set  of perrnanent faults causes the same minimal 
2-errors as the set  of unrestricted faults if (M, Fo) is (D, 0)-2-diag- 
nosable then (M, U) is (D, 0)-2-diagnosable. However, U and Fo do 
not cause the same minimal 1-error ,  and, in fact, (M, Fo) i s  
(D, 0) -1 -diagnosable does not imply that (M, IJ) is (D, 0) - 1 -diagnos - 
able. These statements are proved in the following result. 
Theorem 1 . 4 :  (M, Fo) is (D, 0)-2-diagnosable if and only if (M, U) 
is (D, 0) -2-diagnosable. However, (M, Fo) is (D, 0) -1 -diagnosable 
does not imply that (M, U) is (D, 0)-1-diagnosable. 
80 
Proof: Let (My Fo) be (D, 0)-2-diagnosable. Let (.r, yay wi  where a 
a E I, be a minimal 2-error which is caused by f E U. To show that 
D (My U) is (D, 0)-2-diagnosable it suffices to  show that Pr ([ 4ba), ya] ) # 
0. Since (r,ya, w) is a minimal e r ro r ,  $r(y) = @,(y) and 13,ba) # 
((ya). Say {(ya) = b, and consider the fault f '  E Fo which is caused 
"f by the output of M becoming stuck-at-b at time ly [. Then 6 (ya) = 
1: 
"f ' P,(ya), and f '  also causes the minimal 2-error  (r, ya, w). Since 
(M, Fo) is (D, 0)-2-diagnosable w: know that hr ([ ar e a ) ,  ya] ) f 0. 
Hence e([ jf (ya), ya] ) 
A Af 
D Af' 
0 and (M, U) is (D, O)-2-diagnosable. r 
Now assume that (M, U) is (D, 0) -r'iagnosable. Since Fo - C U, 
it follows immediately that (M, Fo) is (D, 0)-diagnosable. 
We prcve that (M, Fo) is (D, 0) -1-diagnosable does not imply 
(My U) is (D, 0)-1-diagnosable by supplying a counter-example. Let 
MI,  M1, D1, and (J Z 4 5  be specified by the tables in Fig. 4.4. 3 :  
5 - 
Then M1 is reduced and reachable, and M I  realizes Ml under 
(e, e,ug). 
8 1  
0 
b/2 
d/O 
e/O 
d / 2  
e l 3  
C 
1 
1 R 
c / 3  r 
d/O 
e l 0  
a/3 
a /2  
M* : 
- 
Fig. 4.4. Machines M1, MI, and D1 and 03: 2 + 
82 
Since IQ I < lGl1 we know from 
(D1, 0)-1-diagnosable. To see that 
Dl 
Theorem 4 .3  '' At (MI, U) is not 
(MI, $) is (D, 0) -1 -diagnosable 
takes a bit of analysis. Briefly, states A, D, and E dq l i ea t e  s ta tes  
a, d and e and any e r r o r  which occurs when M1 is in one of these 
states is immediately detected. If MI is in b o r  c then D1 will be 
in BC and if  the output becomes stuck-at 2 or 3 at  this time it wil l  
be immediately detected. If M1 is in b er c and a stuck-at-0 or 
stuck-at-1 fault occurs then it wil l  be tolerated for one time step 
and detected the next. This establishes the result. 
\ 
In the above countepexample it is clear that (MI, Fo) is not 
(D1, 0)-2-diagnosable because a stuck-at -1 fault which occurs when 
MI is in b causes a 2-error which is not immedi2tely detected. 
Therefore this example also proves that, in  general, (M, F) is 
(D, k)-1 -diagnosable does r.ot imply that (M, F) is (D, k)-2-diagnos- 
able. Also, if (M, Fo) was (D, Gj-?-diagnosable for some D then by 
Theorem 4.4  (M, U) would be (D, 0)-2-diagnosable and from Theorem 
4.3 it would follow that I Q D !  - > 16 1. Hence this is also an example 
af how 1-diaposis  may be achieved with a detector which is less 
complex than the least complex detector which is sufficient for 
2-diagnosis. 
83 
4.4 Diagnosis with Nonzero Delay 
Suppose now that we allow some arbilrary,  but fixed, k > 0 
in the detection process. Can this additional time be traded off for 
less detector complexity? 
the answer is no. In fact, if (M, U )  is (D', k)-1-diagnosable then v e  
Unfortunpteiy, for the unrestricted case, 
can construct a detectGr D, essentially by eLLAninating unnecessary 
s ta tes  of D', such that (M, U) is (D, 0)-1-diagnosable. 
Before statiiig-this result formally, we will establish an import - 
ant lemma. 
Lemma 4 . 5 :  If (M, U) is (D', k)-1-diagnosable then there exists a 
detector D such that I Q D I  - < la,, 1, (M, U) is (D, k)-1-diagnosable, 
and for  each q E Q, XD(q, (2, u)) = 0 for some (2, a) E Z x I. 
Proof: Assume that (M, U) is (D', k)-1-diagnosable and constrict 
D from D' as follows: 
1) Delete from the state table of D' any row corresponding to  
a state q for which 
2) Tu the resulting table, replace every reference to the 
deleted state with a reference to  an arbi t rary remaining state,  and set 
the corresponding output to  1. 
3) Reseat s teps  1) and 2) until no further deletions are possible. 
a4 
Since IQ,' I < 00 the above algorithm wi l l  terminate in a finite 
number of iterations. 
From the nature of the above construction it is clear that 
IQ, I ,< IQDr 1 and for each q E QD, A,,(q, (2, a)) = 0 for some (2, a) 
E 2 X 1. It only remains to  be shown that (M, U) is (D, k)-1-diagnosable. 
If the detector D' is in a state q for which 0 # {ADt(q,  (2, a)) I 
(z,a) E Z X G ,  then an e r r o r  must have occurred becauseif D' is in q 
then an e r r o r  detection signal will be emitted regardless of the input 
to D . Hence this e r r o r  could be signaled whenever a transition to 
q is indicated, and there would be no loss in diagnosis and no possi- 
bility for a false alarm. Since all minimal e r r o r s  which q sigr,aled 
would then be signaled before I)' got to state q , q could be eliminated. 
This is the essence of what is accomplished in steps 1) and 2). 
This elimination process is necessarily iterative because step 2) 
may introduce new states to be deleced. 
Since thi construction is a h p o s i s  preserving, (M, U) is 
(D, k) -1 -diagnosable. 
Theorem 4.6 : If (M, U) is (D', k)-1 -diagnosable then there exists 
a detector D with ] Q D I  lQD, 1 such that (M, U) is (D, 0)-1-diagnos- 
able. 
- Proof: Assume that (M, U) is (D', k)-1-diagnosable. From Lemma 
4.5 there exists a detector D such that ] Q D l  - < IQD1 1 , (M,U 
(D, k)-1-diagnosable, and for each q E Q,, hD(q, (z, a)) = 0 for some 
is 
85 
(z,a) E. z x I. 
Claim: (M, U) is (D, 0)-1 -diagnosable. 
Assume, to the coi t rary,  that (M, U) is not (D, 0)-1-diagnosable. 
Using induction on the delay of the diagnosis, we will deduce that 
(M, U) is not (D,m )-1-diagnosable for all m - > 0. This will establish 
the result for  it contradicts the hypothesis that (M,U) is (D,k)-1- 
diagnosable. 
Having assumed that the basis step for our induction is true, 
we assume that (M, U) is not (D, m)-1 -diagnosable for some m - > 0, and 
we must show that this implies (M, U) is not (D, mAl)-l-diagnc)sable. 
Since (M, U) is not (D, m)-1-diagnosable, there exists a minimal 
l - e r ro r  ( r , x ,y )  caused by f E U.and a sequence v E I+ with ]v  1 = m 
such thai SD ([ 3: (xv), xv]) = 0 lXv I. Let bD(oD(r), [ hr (xv), xv] ) = s. 
Let (z, a) E Z X I such that hD(s ,  (2, a)) = 0. By Lemma 4. 5 we know 
that such a (z, a) exists. Let f '  be a fault for which 
fi: (xv)z. Then (r,x, $:*(x)) is a minimal 1-error but 
P ([ ?r h a ) ,  xva] 1 = 0 lxva I . Hence (M, U) is not (D, m+l)-1-diag- AD A f t  
nosable. Therefore, (M, U) is not (D, 0)-1-diagnosable implies (M7 U) 
is not (D, m)-1-diagnosable for all m - > 0. 
A f  
* r  
A f t  (xva) = 
r 
But we know that (M, U) is (D, k)-1 -diagnosable. Hence (M, U) 
is (D, 0)-1-diagnosable. This establishes the result. 
86 
Corollary 4.6.1: If (M, U) i s  (I), k)-1-diagnosable then lQDl - > 161. 
- Proof: This is an immediate consequence of Theorem 4.6 and 
Theorem 4.3. 
Corollary 4.6 .2:  If (M, U) is (D, k)-2-diagnosable +h.en lQDl ,> lQR I, 
where MH is  the reduction of M. 
- Proof: Assume that (M, U) is (D, k)-2-diagnosable, and consider M 
to be realizing MR. From Theorem 3.2, it follows that (M, U) is 
(D, k)-1 -diagnosable. The result now follows immediately from 
Corollary 4.6.1. 
We know from Theorem 4.4 and Corollary 4.3. 1 that (M, Fo) 
is (D, 0)-2-diagnosable implies ]QDl  - > !QR 1. Can this result be 
gen?ralized as was done for unrestricted faults by the previous 
corollary? 
This example serves as a good example of when a space-time trade 
off is possible. 
The following example shows that the answer is no. 
- Exxlp le  4.1: Consider machines M2 and D2 of Fig. 4.5. Since 
M2 is reduced and reachable, IQ, 1 = 1 Q  1, where M is the 
2R 2R 
reduction of M2. 
87 
0 
- 
b/O 
a/2 
d/2 
e/o 
d/l 
M2: 1 1  
C/O 1 
d/2 
a/2 
c/o 
a l l  
a 
b 
C 
d 
e 
A 
B '  
C 
A/1 B/O C/O 
A / O  I311 c/o 
A/O B/O C/1 
D2: i 
Fig. 4.5. Machines M2 and D2 
Note that no output symbol can appear next to itself in any output 
sequence produced by M2. Since D2 will produce an error detection 
signal precisely when two consecutive inputs to it are identical: it 
can detect all permanent output faults of M2 with a delay of a most one. 
Therefore (Ma, Fo) is (D2, l)-2-diagnosable, yet IQ, I > I Q  I .  
R D2 
CHAPTER V 
Diagnosis Using Inverse Machines 
It is well known that many circuits can be diagnosed by what is 
commonly called a "loop check. " This involves regenerating the 
input to the circuit from the output and then comparing the regner- 
ated input with the  actual input. Often the "inverse" circuit is easier 
to implement than the original circuit, thus providing a savings over 
duplication. For example, division can be checked using multiplica- 
tim. It is also possible to  have greater confidence in a loop check 
than in duplication, especially if the checking circuit is less complex 
than the original circuit. 
In this chapter we will investigate the use of "inverse machines" 
for diagnosis using a loop check. Informally, machine is an 
inverse of machine M if  a can reconstruct the input to M from i ts  
output with at most a finite delay. 
Machines which have inverses can be characterized as being 
those machines which are "information lossless. " Information loss- 
less machines are machines whose behavior functions satisfy a 
condition which is s imilar  to, but weaker than, the condition which 
a 1-1 function must satisfy. 
Information lossless machines and inverse machices were f i r s t  
introduced by Huffman [ 181. Huffman devised a test for information 
losslessness and for the existence of inverses. It should be pointed 
88 
89 
out that our definitions of these notions are slightly less  general 
than Huffman's. The definitions in this paper are directed towards 
the use of inverse machines for diagnosis. 
Even [ 131 later devised a better means of determining information 
losslessness, arid he presented two means for obtaining inverse 
machines. 
Information lossless machines sild inverse machines are d is  - 
cussed in textbooks by Kohavi [20 ] ana Hennie [ 1'71. Kohavi provides 
a fuller description of Even's techniques for obtaining inverse 
machines, and Huffman describes a different means ef obtaining 
inverse machines. 
The questions about the use of inverse machines for diagnosis 
which we seek to answer in this chapter are:  When can an inverse 
be used for the diagnosis of unrestricted faults? Given a machine 
M and an inverse &i of M, what will be the delay in diagnosis if M 
is used to  diagnose M using a loop check? How can an arbitrary 
machine be realized so  that unrestricted fault diagnosis is possible 
using a loop check? 
We concentrate on unrestricted fault diagnosis in this chapter 
because this is the most natural and important fault c lass  which can 
be diagnoscd using a loop check. Inverse machines can be used for 
the diagnosis of more restricted se t s  of faults but synthesis and 
analysis for more general levels of diagnosis seems to  be very 
difficult. 
90 
5.1 Inverses of Machines 
Before we can formally define the inverse of a machine we need 
to introduce one preliminary notion. 
Definition 5. 1: An (I, n)-delay machine (delay machine) is a machine 
M" = (I, In, I, 6, A, R,p) such that i f  ai E I, 1 < - -  i < n + 1, then 
W,, . . a , an), an+l) = (a2, .. , a n+ 1 I 
and 
An (1,n)-delay machine simply delays i ts  input for n time steps. 
Stated more precisely, if Mn is .an (I, n)-delay machine then 
) = a .  m p" (al,. . . , an) (a n+l*  an+m 
Definition 5. 2: Let M and M be two machines such that R = 
Z = I? ;.-is an (n-delayed) inverse of M if there exists an (&!I)- 
delay machine Mn with mset  alphabet R such that for all r E R and 
and - -- 1- 
- .  
\ 
. 
x E I+ 
Note that if M is an inverse of M then I C 2 . However, it is - 
- 
not necessary to have I = Z. Symbols which are in z but not in I 
can be useful for diagnosis. Since they will never appear while M 
is receiving its input from M, the appearance of one immediately 
9 1  
A B/C 
B AI1 
signifies that an e r r o r  has occurred. 
- 
M might more properly have been dubbed a "right inverse" of 
M for if 
inverse of M, This is illustrated in Example 5. 1. This  example 
is a counter -example to the claims of Kohavi [20 ] and Even [ 13 1 
that if  T i s  an inverse of M then M is an inverse of E. 
is an inverse of M it is not necessarily true that M is an 
0 1  
B/1 
All0 
Example 5. 1: Consider machines M1 and ml of Fig. 5. 1. nl is 
a 0-delayed inverse of M1 Gut M1 is not an inverse of ml . 
1 
d/3 
a/o 
b/ 1 
C / 2  
MI: R 
a 
b 
C 
d 
- 
0 
Fig, 5. 1. Machines M1 a n d g l  
In fact, there is no machine which is 
an inverse of ml. This is because the input s y m h l s  0 and 2 are 
equivalent and so  there is no way in which they can be distinguished 
once they have been applied. 
Definition 5.3: A machine M is information lossless of delay n if 
+ for all r E R and ala2.. .a,, blb2.. . b E I (a., b. E I, 1 < - -  i < m) m 1 1  
92 
implies ai = bi for 1 < i < m-n. - -  
M is said to be lossless if it is information lossless of delay 
n for some nonnegative integer n. M is lossy if  it is not tossless. 
Example 5. 2: Machine M1 of Fig. 5. 1 is information lossless of 
delay 0 and machine ml of Fig. 5. 1 is lossy. 
Fig. 5. 2. Machine M in  Series with an Inverse of M 
Referring to Fig. 5. 2, i f  M is lossless and a is an inverse of 
M tb.en intuitively no information is lost as sequences from I+ a r e  
transformed into sequences from Z+ by M. The same is true €or 
the entire process which consists of transforming sequencns from I+ 
into sequences from Z+ and then back again. Therefore it is somewhat 
surprising to see,  a s  we have in Example 5.2, that M may be lossy. 
This may occur because while M must lose no information in t rans-  
forming the sequences it observes at the output of M, M may not be 
93 
capable of producing all possible output sequences. Thus while M 
must be lossless with respect to a subset of Z+ it may be lossy with 
respect to all of z+. 
Even [13] gives an algorithm for determining if a given machine 
is lossless, and if so, of what delay. It is particularly easy to 
determine whether a given machine is lossless of delay 0. This is 
because a machine M is lossless of delay 0 if and only if  the output 
symbols in every row which corresponds to a state q E P are all 
distinct. 
Machines for which inverse machines exist can be characterized 
as being precisely those machines which are lossless. More pre-  
cisely, 
- Theorem 5.1: 
informatiov lossless of delay n. 
M has P n-delayed inverse if and only if  M is 
- Proof: (Necessity) Assume that 
Let r E R and al.. .am, b l . .  . bm E 1' (ai, bi E I, 1 < - -  i < m) such 
that p r(al..  . am) = br(bl..  . bm). We must show that ai = 
all i, 1< i < m-n. 
is a n-delayed inverse of M. 
A A 
for 
I 
- -  - 
Since ?/I is a n-delayed inverse of M there exists an (1,n)-delay 
n machine Mn such that, Jr 0 jr = 4. In particular, Jr($&a1.. .a,)) = 
<(a1. a,) = all-* and fir(Br(bl, , bp)) = /j:bl. bp) = be-, 
for all 1, n < .e < m. 
- A  
- 
94 
h A - A  
Now fir(al. . .arn) = Pr(bl.. . bm) implies dBr(al. . a,)) = 
EBrbl.. . b,) for all P, 1 qt - -  1 < m. Therefore at - = bp,n for 
all 1, n < 1 < m. That is. ai = bi for all i, 1 < i <m-n. Hence, 
M is lossless of delay n. 
(Sufficiency) Given a machine M which is lossless of delay n, we 
- - -  
can show that M has a n-delayed inverse by constructing one. Tech- 
niques for constructing inverses of lossless sequential machines can 
be found in Hennie [ 171 and Kohavi [20]. With minor modifications 
to insure the existance of suitable starting states, these techniques 
can be used to construct inverses of resettable machines. 
95 
5.2 Diagnosis Using Lossless  Inverses 
If E is a n  n-delayed inverse of M then, by definition, there 
exists an (I,n)-delay machine Mn such that pr 0 Fr = <. Diagnosis 
using inverses can be performed by implementing M, M, and Mn and 
dynamically checking to see if the above relationship holds. The 
basic configuration for diagnosis using inverses is shwn in Fig. 5.3. 
r - - - - - - T  
Fig. 5.3. On-line Diagnosis Using Inverse Machines 
Since an (I, 0)-delay machine is simply a combinational machine 
which realizes the identity function on I, a detector which uses a 
0-delayed inverse will  have the form shown in Fig. 5.4. 
-1 ---- R r -  
I + -  I 
I E  
I I 
1 1  I 
I.----- - J 
- 
‘D M 
D 
Fig. 5.4. A Detector which Uses a 0-delayed Inverse 
96 
We now state the basic result  relating the use of lossless 
inverses with the diagnosis of unrestricted faults. 
Theorem 5.2: Let M be a lossless machine and let a be an n-delayed 
inverse of M. Let D be constructed from M, the (I, n)-delay machine 
which demonstrates that M is an  n-delayed inverse of M, and an  
Exclusive-OR gate as shown in Fig. 5.3. If M is lossless of delay 
d then (M, U) is (D, d)-2-diagnosable. 
- A  - Proof: Since pr ($rd)  = @:e:), there will be no false alarms. 
Let (r, x, w) be a minimal 2 -er;.or caused by a fault f E U. 
Then 6,(x) P,(x). Let y E I* with ly I = d. Since is lossless 
& Af 
of delay d ,  Pr@,(xy)) # !@r(xy)). The ExclusiveUR gate will 
detect this inequality, and hence the minimal 2-error  wil l  be detected 
within d time steps of i ts  occurrence. Therefore (M, U) is (D, d)-2- 
diagnosable. 
It is worth noting that the delay in diagnosis is not the delay of 
losslessness of M but rather of its inverse M. Thus an  n-delayed 
inverse can be used to achieve diagnosis without delay if it is loss- 
less of delay 0. 
Example 5.6, which appears later in this chapter, shows that 
the converse of Theorem 5.2 does not hold. Namely, it is possible 
to diagnose the unrestricted fault set of a machine using an inverse 
which is not lossless. However, not all inverses can be used for 
97 
the diagnosis of unrestricted faults. Example 5. 5 shows how a lossy 
inverse can be useless for diagnosis. The complete characteriza- 
tion of inverses which can be used for unrestricted fault diagnosis 
is sti l l  an open problem. 
Given Theorem 5.2 and the observation that an inverse machine 
may be lossy, an important question is whether every lossless 
machine has a lossless inverse. This question is presently unan- 
sxdrctd. However, it can be shown that if M is lossless of delay 
0 then there exists a lossless inverse of M. 
Example 5.3: Consider machines M2 and ";iz of Fig. 5. 5. M2 is 
lossless of delay 2 and m2 is a 2-delayed inverse. Since m2 is 
M: 
Fig. 5. 5. Machines M2 and R2 
is lossless of delay 0 it can be used to form a detector D2 such that 
(MZ, U) is (DZ, 0)-2 -diagnosable. 1 
98 
1 
a e/o f/o 
b a/l b / l  
C a/O b/O 
d e/I f/I 
e a/O C/I  
f d / l  b/O 
The following example shows that it is possible to diagnose the 
I 
unrestricted fault set  of a machine using a lossless inverse which 
has fewer s ta tes  than the reduction of the machine being diagnosed. 
- 
Example 5.4: Consider machines M3 and m3 of Fig. 5.6. M3 is 
a 2-delayed inverse of M3, and % is itself lossless of delay 2. 
M3: 
A C/O D/1 
B D/O C/1 
C A/O B/O 
. D  C/1 D/1 
Fig. 5.6. Machines Mg and fi3 
Therefore a detector D3 can be constructed from a3 and the 
(I, 2)-delay machine M 
2-diagnosable. Notice that M3 is reduced and reachable and that 
lQ3 I > lQ3 1 .  However, because M3 is also in the detector lQ 
lQ3 I lQ3 I = 16. Therefore lQ3 I < IQ 
what we know from Corollary 4.6.2.  
2 of Fig. 5.7 such that (M3, U) will be (D3, 2)- 
2 I = 
D3 2 1. This is in keeping with 
D3 
99 
0 
00 oo/o 
01 10/0 
10 00/1 
11 10/1 
, 
r 
1 R 
01/0 r 
11/0 
O J / l  
11/1 
Fig. 5. 7. Machine Mg 2 
From Corollary 4 . 6 . 2  we know that if (M, U) is (D, k)-2-diagnos- 
able then l Q D l  - > ]Q, I, where MR is the reduction of M. Using this 
corollary and Theorem 5.2 we c-an derive a lower bound on the state 
set  size of a lossless inverse of M. This bound is stated in  t e rms  
of the input alphabet size of M, the delay of losslessness of M, and 
the state set size of MR. 
Theorem 5.3 : Let M be lossless of delay n,  let MR be the reduction 
of M, and let M be a lossless n-delayed inverse of M. Then 
IQR I 
111" 
IQI ,> - ' 
Proof: Consider M to be realizing its reduction % andconsider M andm 
in the configurationusedfor diagnosis shown in Fig. 5. 3. Since a is 
-
lossless, by Theorem 5.2 (M, V) is (D, d)-2-diagnosable where d is 
the delay of lossleseness of M. Now by Corollary 4.6. 1 /Q,l > - 
100 
If one has a lossless machine M of unknown delay and an inverse 
a of M then a lower bound on the delay n of M can be found using the 
following inequality : 
This inequality was obtained directly from the one in Theorem 5.3. 
Given a machine M = (I,Q, Z, 6 , h , R , p )  let 2' denote the subset 
of 2 which may actually appear in an output sequence of M. That is, 
let Z' = {pr(x)lr E R, x E I+]. 
The following result gives a very simple necessary condition 
which all lossless machines must satisfy. 
Theorem 5.4: If M is lossless then 11 I < - I Z' I. 
Proof: Assume that M is lossless of order n. Let f r :  1' + Z+ X Q 
be defined by fr(x) = ( t r (x) ,6b(r) ,  x)). 
Claim: fr  is 1-1. Let x, y E I'where x f y .  If 1x1 # ~ Y I  then lfir(d1 f I ~ ( Y  ) I  
and hence fr (x) # f r  (y). Let 1x1 = ]yl and assume, to  the contrary, that f r  (x) = 
f (y). Then/!?(x) =tb) and Sb(r) x) = 6(P(r), y). This  implies that t ( x z )  = r 
tbz) for all z E I*, and for F,orne z of lengthn. Since M is lossless of delay n 
-
10 1 
this implies that x =y. Contradiction. Hence if Ix I = lyl and x f y 
then f (x) # fr(y). Since either 1x1 = ly l  or 1x1 # ly I ,  the claim 
is established. 
r 
Since fr: 1' 4 Z+ x Q is 1-1 and 1x1 = I f  (x>l it follows that r 
lIlm 5 l Z t  l"lQ! for all m > 0. Hence ~ I l m / l Z ' ~ m l Q l  < - 1 for 
all m > 0. Since 1 Q I  is a fixed positive integer, this implies that 
~ I ~ / ~ Z ' ~  1, or  111 < - IZ' l .  
This result has some immediate corollaries concerning inverses 
of lossless machines. 
Corollary 5 .4 .1 :  Let M be a lossless machine with 111 < 12' I. 
Then any inverse of M w i t h z  = I is lossy. 
Proof: Let M be an inverse of M with z' = I. Since 
of M, Z ' C  - r, and we know that 111 < IZ ' I .  Hence lz'l = 111 < 
12' I < - Irl. By Theorem 5.4,  a must be lossy. 
is a n  inverse -
This corollary says that if M is lossless and 111 < 12'1 then 
for an inverse% of M to be lossless M must have output symbols 
which would never appear while a is receiving i ts  input from M. 
However, i f  a fault occurs to  M and causes an e r r o r  then a could 
emit one of these symbols. The appearance of one of these symbols 
in ns output would immediately cause an e r r o r  detection signal 
because this same symbol cannot appear in the output of an (1,n)- 
de lay machine. 
102 
Corollary 5.4 .2:  Let M be a lossless machine with a lossless 
inverse a. If t' = I then 111 = l Z '  1. 
Proof: This follows immediately from Corollary 5.4. 1. -
Given the above result, an immediate question is whether M is 
lossless and 11 I = I Z' I implies that any inverse 
As Example 5.5 shows, the answer is no. 
of M is lossless. 
- Example 5. 5: Consider machine M' of Fig. 5.8. Mi is an inverse 
of machine M3 of Fig. 5.6 and I3 = Z3, but ai is not lossless. 
3 
4 Fig. 5.8. Machine mi 
103 
5.3 Applicability of Inverses for Unrestricted Fault Diagnosis 
The use of inverses as a technique for performing diagnosis 
applies directly only to those machines which have su ibble  inverses. 
In the following development we will show that given an arbi t rary 
machice M', we can always construct a realization M of M' such that 
M has an inverse which can be used for  diagnosis. The realizations 
will be obtained simply by augmenting the output of the original 
machine. Thus we will show that diagnosis using inverses is a 
universally applicable technique . 
Definition 5.4 : M is an output -augmented realization - of M' if M = 
(I',Q',ZfxA,6',h,R',p') and h = A' X X for  some A*: Q' x I ' 4 A .  A 
If M is an output-augmented realization of M' then M real izes  
M' under (e,e, Pz,) where Pz, is the projection of Z' X A onto 2'. 
Kohavi and Lavallse [ 191 have given a conztruction which 
proves the following results. 
Theorem 5. 5: Given any machine M', there exists an output- 
augmented realization M of M' which is lossless of delay n for 
some n, and in particular, for n = 0, 
Theorem 5.6: If M' is lossless  of delay n, then for every m, 
0 - < m < - n, there exists an output-augmented realization M of M' 
which is losis!ess of delay m. 
The method that Kohavi and Lavallee use to achieve the abovt 
resul ts  employs a "testing graph'' which is used to determine €f the 
given machine M' is lossless, m d  if so of what delay, Output aug- 
mentation which wi l l  yield the desired property is  determined by a 
method of cutting branches in this graph. Minimal augmend ioa  
for losslessness of a desired delay is not guaranteed. 
A lower bound on the amount of output-augmentation necessary 
to  make a particular machine lossless is given by Theorem 5.4. 
This result tells u s  that for the output-augmented realization to be 
lossless, then the s ize  of its output alphabet must be at least as 
great as the size of i ts  input alphabet. 
Any machine can be made l o s s b s s  of delay 0 simply by aug- 
menting i ts  output with a copy of the input. This gives an upper 
bound on the amount of output augmentation which is necessary to  
make a given machine lossless of delay 0. 
It is tempting to use the Kohavi and Lavallee technique to aug- 
ment the inverse of a machine in the hope of achieving a lossless 
inverse. However, this is impossible because an output -augmented 
realization of an inverse of M is not necessarily an inverse of M. 
Example 5.6: Consider the configuration shown in Fig. 5.9. Here 
M' I s  any machine, and M is the output-augmented realization of M 
105 
I 
I 
I ,  
which was formed simply by augmenting the output of M' with a 
copy of its input. The inverse M' of M shown in this figure is 
- I 
I 
I 
I I 
I I  I I 
Fig. 5.9. A Lossless Machine with a Lossy Inverse 
simply the combinational machine which realizes the projection of 
Z X I onto I. This inverse is lossy and is clearly useless for 
diagnosis. 
Now augment the output of M' to form the machine M shown 
in Fig. 5. lo. This machine is lossless but it is not an inverse of 
Fig. 5. 10. An Output-augmented Realization of R' of Fig, 5.9 
106 
M and it too is useless for diagnosis. 
Although Kohavi and Lavallee's technique cannc. be used ;o 
construst lossless inverses, it is an important technique because 
it can be used to construct lossless of delay 0 realizations of any 
given machine. The following result shows that given a machine 
which is lossless of delay 0 ,  an inverse of that machine can be 
constructed which can be used for the diagnosis of unrestricted 
f a u k  s. 
Theorem 5.7: Let M be lossless of delay 0. Then there exists 
an inverse M of M such that (M, U) is (D, 0)-2-diagnosable where  
D is formed from and an Exclusive-OR gate as shown in Fig. 5.4. 
- Proof: Let 
q E P and a E Z. 
= (Z, P, I U  (e} ,r, x,R,p) where e # I and for all 
6(q, b) if b E I and X(q, b) = a 
arbitrary if  a # X(q, I) 
r(q, a) = ( 
- b if b I a n d  X(q,b) = a  
x<q,a> = ( 
e if a #  Uq,I)  
Thus is basically the same as M but with the roles of the 
input and output interchanged. 
The functions 5 2nd are well-defined for if M is lossless 
of delay 0 and q e P then X(q, a) = A(q, b) implies a = b. 
If 111 < lZl then every symbol in Z cannot appear in every 
row of the state :able of M. This is what gives rise to the transi-  
tions of which may be arbitrarily specified. 
Consider M and to  be operating in se r i e s  as shown in Fig. 
5.2. Since M and have the same reset function, they will initially 
be in the same state. Now if M and= are both in some state q E P 
and the input symbol b E I is applied to  M then M wi l l  emit X(q,  b) 
and go to state 6(q, b). a will emit x(q, X(q,b)) = b and wi l l  go t o  
state F(q, X(q,  b)) = b(q, b). Thus M and M will  make the same 
state transitions m d  the present output of will always be the 
?resent input t o  M. Hence hI is a 0-delayed inverse of M. 
It remains to be shown that (M, U) is (D, 0)-2-diagnosable This 
must be shown directly because a is not necessarily lossless. 
Since M is a 0-delayed inverse of M there will be no false a l a r m s  
Let (r,xa, wb) where a E I and b E Z be a minimal 2-error. Since 
any input sequence applied to  M will cause hl and M to experience 
the same state trajectories, 6b ( r ) , x )  = B(p(r),w). 
q. Since (r ,xa,  wb) is a minimal 2-error, (xa) # b. Now 
X(q, P,(xa)) = a and therefore x(q, b) # a. This inequality will be 
detected by the Exclusive-OR gate which will emit a fault detection 
signal. Hence (M, U) is (D, 0)-2-diagnosable. 
Say S@(r),x) = 
r - 
108 
It should be noted that the inverse constructed in the proof of 
the above theorem is not necessarily lossless. By using IZ I - 111 
new symbols, instead of just one, a could have been constructed to 
be lossless of delay 0. 
Example 5.7: Consider machine 6f1 of Fig. 5. 11. This machine 
is an inverse of machine MI of Fig. 5.1. It was constructed as 
described in the proof of Theorem 5.7. The transitions of wl which 
Fig. 5. 11. Machine mi 
may be arbitrari ly chosen are indicLted by a "-". This inverse of 
M1 is not lossless, but it can be used fo r  the diagnosis of unrestricted 
faults of MI. 
A lossless inverse a!i of M1 can be obtained from Dl simply 
by changing one of the "e" outputs in each row of the state table of 
M1 to e'. %!i so constructed would be lossless of delay 0 because 
the output symbols would be distinct in every row of the state table 
- 
109 
CHAPTER VI 
Diagnosis of Networks of Resettable Systems 
In this chapter we wil! consider the problem of diagnosing a 
machine which h a s  been structurally decomposed and is represented 
as a network of resettable state machines. The networks that we 
will be using are very general and they will allow us  to work within 
a wide range of structural  detail. 
The fault set which we will be applying to these networks is the 
set of "unrestricted component faults. '' Informally, an  unrestricted 
component fault is a fault which only affects one component machine 
but which may affect that component iLi an  unrestricted manner. 
This fault set is a natural restriction of the set of unrestricted 
faults. We will show that it is possible to diagnose the set of unres- 
tr icted component faults of a network with relatively little redund- 
ancy. 
This chapter focuses on the diagnosis of "state networks. '* 
A state network is simply a network in which the external output is 
the state of the network, i.e., a vector consisting of the state of each 
component machine in the network. Since the state of a state network 
is directly observable a t  i ts  output, state networks are easier to 
diagnose than arbi t rary networks. 
110 
The results in this chapter characterize state networks which are 
diagnosable using combinational detectors. A general construction 
is given which can be used to  augment a given state network such 
that the resulting state network is diagnosable in the above sense. 
Upper and lower bounds on the amount of redundancy required by 
such an augr ;ntation are derived. 
111 
6.1 Networks of Resettable Svstems 
~ ~ 
The field of study known as "algebraic structure theory of 
sequential machines" is concerned with the synthesis and decompo- 
sition of sequcrtial machines into networks of smaller component 
machines. 
similar to the "abstract networks" introduced by Hartmanis and 
Stearns [lS]. 
The networks considered in this chapter are very 
The major differences are in our use of resettable 
state systems for the components and in our system connection rules  
which force all computation to be done in the component systems or  
in the external output function. Hartmanis and Stearns use sequential 
state machines for their components and they allow for a combina- 
tional function f .  : (X Qi) X I + Ii to proceed each component. 
1 
Definition 6.1: A netwGrk of resettable systems is a 6-tuple 
N = (I, R, (SI,. . . , Sn), (K1,. . , Kn), Z, A) where 
I is a finite nonempty set, the external input alphabet 
R .is a finite nonempty set ,  the external reset alphabet 
= (I Qi, 6i, R,pi) for each i, 1 < i < n, is a resettable si i' - -  
state system, a component system 
Ki for each i, 1 - -  < i < n ,  is a subset of (Q1,. . . , Qn, I), 
a system connection rule 
Z 
A: (% 
is a finite nonempty set ,  the external output alphabet 
Qi) X I X T 4 2, the external output function 
i=l 
such that for each i, 1 <= i < n, if  - -  
112 
L x A.. 
I j =1 
An) then Ii = Ki = {A1?. . . ? 
Under the intended interpretation, the system connection rule 
K specifies from which par t s  of the network component i receives i 
its input. 
By the convention we introduced in Section 2.1, if Ki = + then 
Ii is any singleton set. Therefore if Mi has no connections then it 
is an autonomous machine. 
Example 6.1: The 6-tuple described in Fig. 6. 1 specifies network 
N1. 
state se t s  (p,, p,} and {ql? q2} respectively. M, is connected t o  
the external input and the output (state) uf M2 and M2 is connected 
to the external input and the output (state) of M,. Network N1 can 
be viewed pictorally as shown in Fig. 6. 2. 
This network has two component machines M, and M2 with 
113 
MI : 
- 42 1 92 
Fig. 6.1. Network N1 
114 
I R 
Fig. 6. 2. Diagram of Network N1 
Since any machine may be viewed as a one component network 
we see that a network may convey little or  no structural  information. 
On the other hand the structural  description given by the network 
may be very detailed. For example, each component may be a two- 
state state machine which represents only one flip-flop and one 
coordinate of the global transition function. 
Definition 6. 2: 
defines the system SN = (I, Q, Z, 6, X,R,p) where 
A network N = (I, R, (Sl, .. . ? Sn)? (K1,.  . , Kn), Z,  A )  
n 
Q = x  Qi i= 1 
115 
n 
A network of resettabie machines is a network in which the 
component systems and the external output function are all time- 
invariant. For example, network N1 of Fig. 6.1 is a network of 
machines. 'The system defined by a network of machines N is also 
time-invariant, and it will be denoted by &I 
N realizes a machine M if MN realizes M. Likewise the defini- 
tions of reduced machines, reachable machines, and s o  forth can 
A network of machines N' 
be extended to  apply to networks of machines. 
Example 6. 2: Consider network N1 of Fig. 6. 1. This  network - 
defines machine MN of Fig. 6.3 and it realizes 9 of Fig. 6.4 
1 
.I 
because M realizes MI. 
N1 
116 
MN1 
- 
M1: 
Fig. 6.3. Machine MN 
1 
Fig. 6.4. Machine M1 
A network N = (I, R, (S1,. . . , Sn), (K1,. . K ), A, Z) is a - s a t e  
and h(q, a) = q for all q E x Qi and 
n 
n n 
network if 2 = x Qi 
i= 1 i= 1 
a E I. If N is a state network then SN is a state sys  tern. For state 
networks it is unnecessary to explicitly specify the external output 
alphabet and the external output function. 
Since the fault set  which we will be considering does not allow 
for faults which affect the external output function, we will focus on 
117 
the diagnosis of state networks which realize state machines. The 
diagnosis of the output function will be taken care of separately, 
possibly by duplication. 
Performing diagnosis on state networks is easier, in general, 
than for arbitrary networks because with state networks the output 
function does not mask the internal operation of the network. 
Decomposing a network into a state network and an output function 
and then diagnosing each separately has  the effect of applying a 
tighter tolerance relation to the diagnosis of the original network. 
This is also due to the lack of any masking of the state by the out- 
put function. 
118 
6.2 Unrestricted Component Faults 
Suppose that N and N' are networks. Then f = (N', 7,0) is a 
- fault of N if f '  = (SNf, r ,  0)  is a fault of SN. Thus a fault of N can be 
considered to be a transformation of N into another network N' at 
some time 7. The notions of fault tolerance, e r r o r ,  and diagnosis 
are extended in a similar manner to apply to  networks. 
Given a network N, a natural set  of faults to  consider are those 
which are caused by failures in one component of N. If f = (N', 7, e )  
is caused by failures which are restricted t o  one component of N then 
N' will differ from N only in that one component. Likewise 8:  xQ. 
+ X QL will  act as the identity on each coordinate except possibly 
the one affect by f. These faults are described formally in the 
following definition. 
1 
Defii.ition 6.3: Let N = (I, R, (M1,. . ., Mn), (Kip. . , K,), 2, A) be 
a network of machines. A fault f = (N', r, 8) of N is an unrestricted 
componc;d. fault if for some j ,  1 < j < n -- - -  
i) N' = (I, R, (Ml,.  . , S., t . . . , Mn), (K1,. . . , Kn), 2, A) where 
I 
S! E &(I., Q R) and 
n 
ii) for all (ql, , 9,) E x Qi, o h l ,  , 9,) = <si, qh) 
i=l 
I I j '  
implies q1 = q; for all i # j. 
The se t  of all unrestricted component faults of a network will 
be denoted by Uc. 
119 
Note that since N' is a network, S' is required to be a state 
system. Because the output alphabets of M and S' are identical 
and they a r e  both state systems their state sets  must a lso be identi- 
cal. Thus, unrestricted component faults .lo not permit state blowup 
or collapse. 
1 
J 1 
The fault set  Uc is sufficiently restricted to make possible i ts  
ciiagnosis with relatively little redundancy. On the other hand, Uc 
is not unduly restricted for it allows for any number and type of 
physical failures to occur to any one component; subject, of course, 
to the general restrictions on faults outlined in Section 
using Uc a s  the fault class greatly reduces the amount 
analysis which is necessary within the components. 
2.3. Thus 
of failure 
120 
6.3 Characterization of Combinationally Diagnosable Networks 
How can state networks for which a combinational detector can 
diagnose the set of unrestricted component faults be characterized? 
We shall show that one means of doing this  is in te rms  of the amount 
of network redundancy. 
Given a network of machines N we will assume, as we have 
'5 
earlier, that N realizes some reduced and reachable machine M. 
Since the relation between the state set  of N and the state set of fi 
will be of interest to us we wi i l  use the structurally oriented char- 
acterization of a realizatior. given by Theorem A. 1, and will assume 
that N realizes k under (ql, q 2 ,  q 3 ,  q4). .We will assume as before 
that v l  and q2 are ontJ. The natural extensions of n1 and q3 to 
sequence to sequence mappings will also be denoted by rl and u3. 1 
The reachable part of N will be denoted by P. 
Since k is reachable the domain of q4 is 6. 
and n1 and q2 a r e  onto it can be shown that: i) q 
I 
f 
Since is reduced 
q' E 6 and q f q '  
Let q i :  ~ 4 i j  
where r)' (q) = q' if and only i f  q e q4(q'). Because q4 induces a 
partition of P, v i  is a well-defined function. Thru an abuse in 
notation, 
function wil l  play an important role in the following results. 
4 
will be referred to more suggestively as v i ' .  This 
If N is a state network which realizes a state machine under 
-1 (nl, v2,  v 3 ,  71,) then by Theorem A. 1 u3(q) = v4 (q) for all q e P. 
12 1 
-1 In this case we wi l l  take q3 to be identical to q4 . 
Notation: Given a network N let C C - { 1,. . . , n} denote a subset of 
the set of components. Let Ci denote the particular subset { 1,. . . , 
i-1, i+l,.  . . , n}. Let q = (ql,. . . , qn) and s = (sl, , sn) be states 
of N. 
Each C induces a partition rc  on Q = X Qi where q s(n,) if 
and only if qi = si for all i E C. 
A cover of a set L is a set of subsets of L whose union is L. -
Thus every partition of L is also a cover of L. A cover J of L is 
a singleton cover if B E L implies 
#IJI denote the cardinality of the largest element in J. 
~ B I  < - I. J is a cover let 
Let C C - { 1,. . . , n} and let nc = {B1;. . . , Bn). C induces the 
cover 
where if B C P then vil(B) = {vil(q) Iq E B). In particular, 
r)$) = $ 0  
Each set  of states which the components in C can take on 
corresponds directly to  a block of the partition 7rc. Thus 7rC 
represents the information about the current state of N which is - 
given by the current s ta tes  of components in C. C represents  the 
corresponding information as to the state of M which N is currently 
mimicing. If C is a singleton covey then the current state of each 
- 
- 
122 
rr 
component in C completely determines the corresponding state of M. 
Note that { 1,. . . , n} i s  always a singleton cover. -
Definition 6.4: 
is a singleton cover. N is totally redundant i f  every component of 
N is redundant. 
Component Mi of a network N is redundant if ci 
If N is totally redundant then knowledge of the state of any n-1 
components is sufficient to determine the corresponding state of M 
- 
although it may not be sufficient to  determine the state of the remain- 
ing component. 
Example 6.3: 
machine M1 of Fig. 6.4 under (e, e ,  e, v,) where 77, is defined by 
Consider network N1 of Example 6. 1. N1 realizes 
- 
the table: 
123 
L1 
Therefore Ci is not a singleton cover, M1 is not a redmdant 
component, and Nl is not totally redundant. 
If u is a partition of L: let ff  : L + r denote the natural mapping 
induced by 8 .  Let i c: Q +e be defined as S c(q) = n-'(f 
n P ). The interpretation of S is as follows: given the state of 
each component in C, take any q E Q which agrees with this informa- 
(q) TC 
,. 
tion, and 5 c(q) is the set of states of M to which the current state of 
N may correspond. 
Lemma 6.1: 
a n d l e t q = ( q l  ,..., q. , . . .*q )andq '  =(q1 , . . .7q i , . . .7  qn) be states 
- -1 of N. If q,  q' E P then v i 1  (q) = v4 (q'). . 
Let N be a totally redundant state network of machines, 
1 n 
Proof: Let q, q' E P and let C = (1 ,... ,n}. Then -
Since N is totally redundant, e i  is a singleton cover. There- 
wise, bc(q') = \f (q'). Now 
124 
= f (q') . 
'i 
-1 -1 Therefore f c(q) = f c(q'), and hence v4 (q) = v4 (q'). 
Suppose that an unrestricted component fault f occurs to  a 
totally redundant network of machines N and causes a minimal 
2-error (r,x,y). Say that BrC.) =q = (ql,. . . ,qn). Due to the 
nature of f ,  namely that it affects only one component, i ( x )  = r 
q' = (ql,. . . ,qi,. . . ,qn). If q' E P then Lemma 6.1 tel ls  us  that 
this 2-error is not a 1-error because TI, (q) = v4 (q'). -1 -1 
Theorem 6.2 : 
machine % under (ql, q2, q3, 71,) where v3 = v4 . Then (N, Uc) 
is (D, 0)-1-diagnosable for some combinational detector D if and 
only if N is totally redundant. 
Let N be a state network which realizes a state 
-1 
Proof: 
where D is combinational, and let D realize the function AD Assume, 
to the contrary, that N is not totally redundant. Then for some i, 
C is not a singleton cover. Hence there exists q = (ql,.  . . , qi, ..., qd 
(Necessity) Suppose that (N, Uc) is (D, 0 )  -1 -diagnosable 
.1 
i 
125 
Since q, q' E P, XD(q) = +,(q') = 0 for otherwise a false alarm could 
occur. Let f e Uc be a fault caused by the output of M. becoming 
stuck-atq; a t  a time when M could be in q. This fault can cause 
a l - e r ro r  which is not (D, 0) -1-diagnosable. Contradiction. There - 
1 .  
fore  if (N, U ) is (I), 0)-l-diagnosable where D is combinational then 
N must be totally redundant. 
C 
(Sufficiency) Assume that N is totally redundant. Let D be the 
detector which realizes the function AD: Q 4 { 0 , l )  where 
Clearly, D will  give no false alarms. 
Let (r ,x,  y)  be a minimal l-error caused by f c Uco Let x = uab 
where a , b  E I. 
Then &3r (ua)) =u4 -1 (P,(ua)) f and ~;'(i3~(uab)) # u&$ab)). say 
f ((ua) = q. Then ((uab) = 6 (9, a, t) where t = lul. Because f E Uc, 
f can affect at most one component of N. Therefore b(q,a) will 
f differ in at most one coordinate f rom 6 (9, a, t). Let 6(q, a) = s = 
j ,  9 Sn). 
f , sn) and let 6 (9, a, t )  = s' = (sl,. . . , s' 
s j y - * *  
(51' ' 9 
Therefore s E P, and qi'(s! # qil(sf) because rl;l(l3,(uab))#.l;'(~(uab)). 
Applying Lemma 6.1 we deduce that s' 
the l -e r ror  (r,x, y)  is detected without delay, and (N, U,) is 
P. Therefore AD(s ' )  = 1, 
126 
(D, 0)  -1 -diagnosable. 
Given C C {1, ... ,n), let T = {B Bn). Then C induces - c - 1 ’ . * * ’  
C c - 1  a partition 3? on P where 7 = {B fl P,. . . , BL n P} - 9. 
If a partition B of a se t  L is a singleton cover then we wil l  denote 
this by writing 7~ = 0. This notation is derived from the observation 
that this partition is the least elemenl of the lattice of all partitions 
of L. 
Corollary 6.2.1: 
(N, U,) is (D, O)-2-diagnosable for some combinational detector D 
i f  and only if ?i 
Let N be a state network of machines. Then - 
= 0 for all i, 1 < i < n. - -  
‘i 
Proof: 
q3 is 1-1. By Theorems 3. 2 and 3.3 (N, Uc) is (D, 0)-2-diagnos- 
able for some combinational D if and only if (N, Uc) is (D, 0)-1- 
diagnosable for some combinational D. 
Now since q3 is 1-1, so is q4 . Therefore Ci is a singleton 
Consider N to be realizing the reduction of MN. Then 
.y -1 
cover if and only if 7 
only if Fc = O for all i, 1 < i < n. 
= 0. Hence N is totally redundant if and 
‘i 
- -  
i 
The result  now follows immediately from Theorem 6. 2. 
Example 6 .4 :  Again consider network N1 of Example 6. 1. Let 
Ni be the associated state network which is obtained from N1 by 
changing the external output function and alphabet. Let GIi be the 
127 
r 
state machine corresponding to machine MI of Fig. 6 .4 .  Then 
I .1 
Ni realizes Mi and C1 i s  the same in this case a s  in Example 6.2. 
Hence Ni is not tot?lly redundant and from Theorem 6 . 2  we know 
that (Ni, U,) is not (D, O)-l-diagnosable for any combinational 
detect or D. 
Now construct a new network N;' from Ni by adding a new 
component M3 as shown in Fig. 6. 5. 
I, R, MI, M2, K1 and K2 are identical to those 
of network N1 of Fie. 6 . 2 .  
M3: 
._ -- -~ - -- 
Fig. 6, 5. Network NY 
.1 
Network NY realizes machine Mi of this example under 
(e, e,%& 17;) where ~h = (vi)-' and where v i  i s  given by the table: 
128 
For network NY 
(P1’92’S $ 2  (P2,92’s$; (PI7 92’S,>, (P2’ 42’s,)> 
c1 - 
and C1 = { (a}, id}, {c}, (b} 
component M1 is redundant. Similarly one can show that Ma and 
M3 are redundant. Hence N;’ is totally redundant, and (N;’,Uc) i s  
. Thus C1 is a singleton cover and 
(D, 01-1 -diagnosable for some combination of detector D. 
129 
6.4 Construction of Combinationally Diagnosable Networks 
In Example 6.4 we showed that a totally redundant network could 
be constructed from network Ni through the addition of one compon- 
ent machine. In this section we will show that this can be done for 
any network. In addition, we derive upper and lower bounds on the 
minimum number of states that such an additional component must 
have. 
Theorem 6.3: 
and let m = m a x  m 
1 Z E n  
(N', U,) is (D, Oj-2-diagnosable for some combinational detector D 
can be constructed from N by the addition of an m state component. 
Let N be a state network of machines. Let mi = lQiI, 
A network N' where N' realizes N and i '  
/ 
, - Proof: Without loss of generality take Qi = (0,.  . . , mi). Let 
N = (I, R, (M1,. . . , M,), (K1,. . . , K )) and let N' = (I, R, (M1 ,..., Mn, n 
M 
Mn+l is constructed such that for all q = (ql,. . . , qn+l) E P', the 
reachable part of N', 2 qi 0 (mod m). A machine Mn+l with 
m states which satisfies the above property is described below: 
1, (K1,. . Kn, Kn+l)) where Kn+l = {Ql,. .. , Q , I} and where n+l n 
n+l 
i=l 
where 
Qn+l = {O, ... ,m- l}  
130 
(r) E - f pi(r) (mod m) for all r e R 
i=1 P,+l 
It is clear that N' realizes N. Therefore, it remains only to  
be shown that (N', Uc) is (D, 0)-2-diagnosable for  some combinational 
D. 
Let D be the combinatimal machine which realizes the function 
n+l  
i=l 
AD: x Qi +{0,1} where 
1 1 otherwise 
n+l 
Since (ql, . ,qn+l ) E P' implies -2  qi 5 0 (mod m) no false a l a rms  
1=1 
will  occur. 
Let (r,x,y) be a minimal 2-error caused by f E Uc. Since 
(r, x, y) is a minimal e r r o r  and f only affects one component of N, 
Pr(x) and d,k) will differ in exactly one coordinate. Say dr(x) = 
13 1 
t $ 0 (mod m). qi zq; (mod m). Therefore q1 + + q1 + . . 
Hence, the error (r,x, y) is detected without delay, and (N', Uc) is 
(D, 0) -2 -diagnosable. 
+Q*+ 1 
In the proof of Theorem 6. 3 we have given a construction wil-zh 
can be used to  form a totally redundant network from any network 
of machines. This construction simply involves the addition of one 
component to N. This theorem also gives an upper bound on the 
amount of additional redundancy required to make a given network 
totally redundant. This upper bound is stated in t e r m s  of the s ize  of 
the state set of the additional component. 
The detector used in the proof of Theorem 6. 3 simply checked 
to  see if the states of the components always summed to 0 (mod m). 
By using a more complex detector, namily one which can determine 
if the present state is in the reachable part, the number of s ta tes  
which the additional component must have can be reduced. 
Let mi be the number of s ta tes  that Mi, 1 - -  < i < n, can actually 
enter while Mi is a component of network N, and let m' = m a x  
l<i<n - -
That is, let m' = m a x  IPi(P) I ,  where Pi(P) is the projection onto 
l<i<n 
coordinate i of the reachable part of N. Then m' < - m because Pi(P) 
C Qi, 1 < i < n, and Theorem 6.3 holds with m replaced by m'. 
This claim is established in the following theorem. 
m!. 
1 
- -  
- - -  
13 2 
Theorem 6.4 : Let N be a state network of machines. Let 
mi = IPi(P) 1, and let m' = m a x  mi. A network N' can be con- 
l < K n  - -
strutted from N by the addition of an m' state component such that 
N' realizes N and (N', U,) is (D, O)-2-diagnosable. 
- Proof: Without loss of generality take Pi(P) = (0,. . . , mi} and 
Qi = (0,. . . , mi). Construct N' by adding component Mn+l where 
N' and Mn+l are exactly as in the proof of Theorem 6.3 except for 
nA being replaced by m'. 
We will show that (N', Uc) is (D, O)-a-diagnosable by showing 
that Fc = 0 fo r  all i, 1 < - -  f < n, and then appealing to Corollary 
6. 2. 1. 
i 
Assume, to the contrary, that ?i: # 0 for some i, say for  i = I. c, 
I 
Let nC ={B l,...,Bp}. Thenfor  some j ,  1 < j  - -  < P , [ B j n P l  >1. 
1 
This implies the existence of two s ta tes  q = (q,, q2,. . . qn) and 
q' = (qi,  q2,. . . , qn) such that q,  q' E P' and q1 f S i -  NOW q ,q '  E P' 
implies q1 + q2 +. . . + qn 2 @ (mod m') and 4; + q2 + . . + q, E 0 
(mod m'). Hence, q1 3 
q1 = qi. Contradiction. Therefore Fc = 0 for all i, 1 < i <n ,  
and the result follows immediately from Corollary 6.2.1. 
(mod m') and since 0 < _I ql, 4; < m', 
- -  
i 
A technique similar to  the one used in the proof of Theorem 6.3 
could be used for the diagnosis of n Mealy machines which operate 
in parallel with the same inputs and resets. In this case one 
133 
additional Mealy machine would be required which had as many out- 
put symbols a8 the machine with the largest output alphabet, There 
is no guaranty, however, that this technique will result  in a savings 
over duplication for the additional machine may need as many s ta tes  
as the product of the number of states of the original n machines. 
We have shown that given a network N, a totally redundant 
network N' can be constructed thru the addition of a component with 
no more than m' s ta tes  where m' = max ]P,(P) I. This amount of 
additional redundancy is not always necessary for N may already 
be totally redundant. The following example shows that this amount 
of additional redundancy is not necessary even if no component of 
the network is redundant, 
Example 6. 5 :  Consider state network N2 of Fig. 6.6. 
134 
MI: 
M2: 
0 
p2 
P1 
p3 
p4 
Fig. 6 .6 .  Network Nz 
- 
N2 realizes state machine M2 of Fig. 6 . 7  under (e, e, u3,  v4) where 
-1 
v3 = q4 and where v3 is given by the following table: 
135 
0 
b 
a 
d 
C 
e 
f 
g 
h 
a 
b 
C 
d 
e 
f 
g 
h - 
1 2 
a e 
b f 
C f 
d e 
f C 
e d 
h d 
g C 
P 9 
p1 91 
p1 92 
p2 91 
p2 92 
p3 9 3  
p3 94 
p4 9 3  
p4 94 
a 
d 
b 
C 
e 
h 
f 
g 
3 
h 
g 
f 
e 
d 
- 
C 
b 
a - 
- 
4 
C 
d 
a 
b 
g 
h 
e 
f 
Fig. 6. 7. Machine M2 
136 
Since 16, I = 8 and lQl X Q, I = 16 it should be clear that while N2 
is not totally redundant there is some redundancy in this network 
realization of M2. Thus if we were to add a component M3 to N2 
in an attempt to  form a totally redundant network Nb we should not 
be too surprised if we succeeded with a component M3 with fewer 
than m' states. where for  network N2 m' = 4. In fact, if the 2-state 
- 
machine M3 = (Q1 X Q, X I, { sl, s,}, b3)  where added tc, N2 where 
63 is such that M is in s1 whenever M1 and M2 are in (pl,ql), 
(p2, q2), (p3, q3) o r  (p4, q4) and in s whenever MI and M2 are in 
(PI, 42)' 029 p** tp3, q4) o r  (p4, q3) then the network NH so formed 
3 
2 
, +  
would be totally redundant. 
An intuitively satisfying means t o  verify this claim is as follows. 
Component Mi computes the information e 
ing state of M. In this case thz e 
about the correspond- 
are the following partitions of 
{ i} - 
i il - 
Q2* --- .c. 
= { a, d; b ,c ;  e, h; G} 
11 
- -} -- - = { a,b; c ,d;  e,f, g ,h  c{ 21 
- - .1 lr - 
Since E (1) c{2} = c{2} c { 3 }  = Cil) Cis) = Oany two 
components taken together provide total information as to the co r re s -  
ponding state of Q,. Hence the remaining one will  always be 
redundant. 
- 
137 
The following result gives a lower bound on the number of states 
that an additional component must have in order  for the resulting 
augmentec! network to be totally redundant. If the network under 
consideration is already totally redundant then the lower bound given 
by this result  is one. Since the behavior of a s ta te  machine with one 
state is always a constant [unction, the actual addition of such a com- 
ponent is unnecessary. 
Theorem 6.5: Let N be an n component state network and let N' 
be the state network formed from N by the addition of a component 
with P states. If N' is totally redundant then C > m ax # l e i  1. - l<i<n 
4. - Proof: Without loss of generality take i+lC,l = m a x  rlcil, and 
l<i<n 
= d. That is, if it is h o w n  that M2 is in q2, that Mg is in q3, and 
so forth up to Mn being in q then there is still a c! state uncertainty 
as t o  which state of M the state of M currently corresponds. It is 
n - 
necessary for M 
uncertainty. 
to have at least d states to resolve this n+ 1 
The above result provides a good lower bound on the amount 
of additional redundancy required to  form a totally redundant network, 
and it does so by taking into account the redundancy which already 
exists in the network. This level of redundancy, however, is not 
138 
always sufficient because it may be impossible to find a component 
wtth d states which will simultaneously resolve the uncertainties 
represented by C1, C2,. . . , and Cn. The following describes just 
such a situation. 
- -  - 
Example 6.6: Consider the state network N3 of Fig. 6.8.  
MI : 
M3: 
M, : - 
s s s  
2 I 1 2 2 s2 s2 s1 s1 s1 s2 s2 s2 s2 s2 s 2  s2 s2 s2 s2 I S 
Fig. 6 .8 .  Network N3 
139 
y o  Q 
c '  
f f  
This network realizes machine G3 of Fig. 6.9. 
a e  
b g  
g 
d h  
e e  
g g  
h h  
M3: 1 
b 
C 
C 
d 
b 
b 
C 
d 
2 R 
f r 
h 
g 
h 
a 
a 
b 
b 
Fig. 6. 9. Machine Mg 
- 
For N realizing M we have 3 3 
- 
Therefore m = m a x  lQil = 3 and d = max #ICil = 2. 
l<i<3 - -  1< i<S - -  
Suppose that it is desired to add a component M4 to  N3 in order  
to form a totally redundant network. Theorem 6.  5 tel ls  us that M4 
must have at least 2 states,  and Theorem 6.3 tells  u s  that there is 
a 3-istate component which will  work. We wil l  show that in this case 
it is not sufficient for M4 to have 2 states. 
140 
Let M4 be a 2-state component which when added to N3 forms 
LI rc, - ~ 5 .  Let c { ~ }  = {B~, B~}. Since c is a cover of Q ~ ,  B~ u B~ = 
(41 
G3. If IB1 I 2 5 or lB2 1 - > 5 then c1 would not be a singleton cover 
because M2 and M3 have only 2 states each and together they could 
not resolve a 5-state uncertainty. Therefore if Nh is to be 
totally redundant we must have lBl 1, 1B2 I - < 4 and thus wi l l  
.u 
be a partition of Q3. 
For N' to be totally redundant M4 must resolve the following 3 
pairs of states : (a, e} , { b , d) , {e, g} , { f, h} , { h, h} { c , d} , {e, f] , and 
{g, h}. It can resolve a pair only if the pair is split between Bt 
and B2. But this  is easily seen to be impossible. Therefore 
there is no 2-state component which when added to  N3 will form 
a totally redundant network. 
CHAPTER VII 
Cone his ion 
In this report a fresh look at  on-line diagnosis was taken 
f rom a system theoretic point of view. The approach used in this 
investigation was system theoretic in the sense that resettable dis- 
crete-time systems were used as a basis for a well-developed 
formal model of on-line diagnosis, and formal methods were used 
to investigate this model. As evidenced by the results in Chapters 
III through VI this approach has proved to  be very fruitful. One 
advantage of this approach is that the results developed in this 
report are independent of any particular technology and may be 
applied to  any system which can be modeled as a resettable machine. 
In Chapter II a complete model for the study of on-line 
diagnosis was developed, and a number of fundamental questions 
concerning on -line diagnosis were stated. Subsequent chapters 
provided some answers t o  these questions for the unrestricted fault 
case and the unrestricted component fault case. However, much 
more work remains to be done which could be carr ied out 
along the lines presented below. 
Except for some of the examples and for the networks considered 
in Chapter VI we have been dealing with abstract (i. e. , totally 
unstructured) systems. Such an approach is good for developing 
formally the concepts involved in our theory and for studying the 
14 1 
142 
diagnosis of unrestricted faults, but some of the questions raised 
can best be studied in a more structured environment. One reason 
for this is that with a structured system we can consider the causes  
of faults. For  example, given an abstract system it makes no sense 
to speak of the set of faults caused by component failures of a cer- 
tain type o r  by bridging failures. However, given a structured 
representation of a system (e. g. , a circuit diagram) we can discuss 
these and other types of failures (causes) and determine the resul t -  
ing faults (erfects). 
There are many different structural  levels that could prove 
useful to  a further investigation into the theory of on-line diagnosis. 
Two levels which we believe wil l  be important are: the  binary state- 
assigned level and the Logical circuit level. 
basis for their potential usefulness are explained below. 
These levels and the 
A machine M is said to be binary state-assigned if  Q = (0, l}" 
for some positive integer n. Given such a machine we can speak 
of stuck-at-0 and stuck-at-1 and any other type of memory failure. 
The faults corresporiding to  these failures can be enumerated and 
comparisons can be made between various schemes for diagnosing 
these faults. Memory faults have been studied before in other con- 
texts and they are an important class of faults for a number of 
reasons. As we have seen, only a limited amount of structure is 
needed to discuss them. Thus memory faults can be analyzed 
before the circuit design of the machine is complete. Also, it is 
143 
memory which distinguishes trulq. sequential systems from purely 
combinational (one -state) systems. Combinational systems are 
inherently easier than sequential systems to analyze and a number 
of techniques for the on-line diagnosis of such systems are known 
(see 1211 and [ 331 for example). 
A system possesses structure a t  the logical circuit level if 
a representation of the system is given in t e rms  of a logical circsit  
composed of primitive logical elements. These may be of the AND- 
OR variety, threshold elements, or any similar elements of a 'build- 
ing block" nature depending upon the technology being considered. 
This level is useful for investigating failures in the primitive 
components. The circuit in Fig. 2.2 is an example of a structural  
representation a t  this level and the failure of this circuit discussed 
in Example 2 .2  is a simple example of the analysis that can be 
conducted at this level. 
.- 
Further work could also be performed at the network level 
of structural detail which was introduced in Chapter VI. At th i s  
level one could study the problem of implementing on-line diagnosis 
on a whole computer whereas with the other levels the emphasis 
would be on diagnosing one module. Note that in our definition of 
diagnosis the detector is not constrained to give simply a yes-no 
response. It could also provide extra information for use in 
automatic fault location. Thus, at th.s level, the problem of which 
144 
subsystems must be explicitly observed by the detector to achieve 
some desired fault location property could be studied. 
One problem that requires extension of our present model 
(at any structural level) is the problem of automatic recmfiguration 
of the system under the control of the detector. To study this 
problem, the model used would have to allow for feedback from the 
detector to the system it is observing. The question of how such 
an extension should be made is an interesting one and, if answered 
satisfactorily, could serve as a basis for a systematic investigation 
of reconfiguratioti techniques. 
REFERENCES 
[ 11 Andemon, D. A. , "Design of Self-checking Digital Networks 
Using Coding Techniques, " Coordinated Science Lab, University 
of Illinois, Urbana, Report R-527, Sept. 1971. 
[ 21 Arbib, M. A. , - Theories of Abstract Automata, Prentice-Hall, 
Englewood Cliffs, New Jersey ,  1969. 
[ 31 Avizienis, A. , "Concurrcnt Diagnosis of Arithmetic Pro - 
cessors, " Digest of the F i rs t  Annual IEEE Computer Conference, 
Chicago, Illinois, Sept. 196'( 9 PP. 34-37. 
[ 4 ]  Avizienis, A , ,  G. C. Gilley, F. P. Mathur, D. A. Rennels, 
J. A. Rohr, and D. K. Rubin, "The STAR (Self-Testing and 
Repairing) Computer: An Investigation of the Theory and 
Practice of Fault-Tolerant Computer Design, '' IEEE Trans. 
on Computers, Vol. C-20, Nov. 1971, pp. 1312-1321. 
[ 51 Ball, M. and F. Hardie, "Effects and Detection of Intermittent 
Faults in Digital Systems, " ir, 1969 Fall Joint Comput. Conf. , 
AFIPS Conf. Proc. ,  Vol. 35, Montvale, New Jersey ,  AFZPS 
-969, pp. 329-336. 
[S) Carter ,  W. C . ,  H. C. Montgomery, R. J. Preiss, and H. J. 
Reinheimer, "Design of Serviceability %eatures for she IBM 
System/360, " IBM Journai, Vol. 8, April 1964, pp. 115-126. 
[ 71 Carter ,  W. C. , and P. R. Schneider, "Design of Dynamically 
Checked Computers, '' Proc. of the IFIPS, Edinburgh, Scotland, 
[8]  Carter ,  W. C . ,  D. C. Jessep,  W. G. Bouricius, -4. B. Wadia, 
C. E. McCarthy, and F. G. Milligan, "Design Techniques for 
Modular Architecture for  Reliable Computer Systems, " IBM 
Res. Report RA 12, Yorktown Heights, New York, March 1970. 
August 1968, pp. 878-883. 
[9]  Chang, H. Y.,  E. G. Manning, and G. Metze, Fault Diagnosis 
of Digital Systems, John Wiley and Sons, Inc., New York, 19m. 
[ 101 DOrr, R. C. , "Self-Checking Combinational Logic Binary 
Counters, '' IEEE Trans. on Computers, Vol. (2-21, Dec. 1972, 
pp. 1426-1430. 
14 5 
146 
1111 Downing, R. W., J. S. Nowak, andL.  S. Tuomenokfia, 
"No. ? ESS Maintenance Plan, '' Bell System Technical Journal, 
Vol. 43, Sept. 1964, pp. 1961-2019. 
[ 121 Eckert, J. P. , "Checking Circuits anc! Diagnostic Routines, '' 
Instruments and Automation, Vol. 30, Aug. 1957, pp. 1491- 
493. 
[ 131 Even, S. , "On Information Lossless Automata of Finite Order," - -  
IEEE Trans. on Computers, Vol. EC-14, Aug. 1965, pp. - 
[ 141 Friedman, A. D . ,  and P. R. Menon, Fault Cetection in Di ita1 
Circuits, Prentice -Hall, Englewood Czffs,  New Jersey ,  dh- 
[ 151 Friedman, A. D., "Diagnosis of Short Faults in Combinational 
Circuits, " Dig. 1573 Int. SJ mp. Fault-Tolerant Computing, 
June 1973, pp. 95-99. 
[ 161 Hartmanis, J. and R. E. Stearns, Algebraic Structure Theory 
of Sequential hlachines, Prentice - H d l ,  Englewood Cliffs, 
New Jersey, 1966. 
[ 1'11 Hennie, F. C . ,  Finite-State Models for Logical Machines, 
John Wiley and Sons, Inc., New York, 1968. 
[ 183 Huffman. D. A , ,  "Canonical Forms for Intormation-Lossless 
Finite-State Logical Machines, " IRE Trans. on Circuit Theory, 
Vo1. ?-6, Special Supplement, May 1959 , pp. 41-59. 
[ 191 Kohavi, 2. and P. Lavallee , "Design of Sequential Machines 
with Fault-Detection Capabilities, " IEEE Trans.  on Computers, 
VO~. EC-16, Aug. 1967, pp. 473-484. 
[ 201 Kohavi, Z. , Switching and Finite Automata T?eory, McGraw- 
Hill, New York, 1970. 
[ 211 Kautz, W. H. , "Automatic Fault Detection in Combinational 
Switching Networks, " Stanford Research Institute Project No, 
3196, Technical Report No. 1, Menls Park, California, April 
1961. 
[ 223 Langdon, G. G. and C. K. Tang, "Concurrent E r r o r  Detection 
for Group Look-Ahead Binary Adders, '' IBM Journal, Vol. 14, 
Sept. 1970, pp. 563-573. 
147 
[ 231 Leake, R. J. , "Realization of Sequential Machines, '' - IEEE 
Trans. on Computers (correspondence), Vol. C-17, Dec. 
968, P* 1173. 
[ 241 Massey, J, L. , "Survey of Residue Coding for Arithmetic 
Errors, *' ICC Bulletin, Vol. 3, Rome, Italy, Oct. 1964, 
pp. 195-269, 
[ 251 Mathur, F. P. , "On Reliability Modeling and Analysis of 
Ultrareliable Fault -Tolerant Digital Systems, '' IEEE Trans. 
on Computers, Vol. C-20. Nov. 1971, pp. 1376-1382. 
[ 261 Mei, K. C. Y. , "Eridging aid  Stuck-at Faults, '' Dig. 193:_-\4 
Int. Symp. Fault -Tolerant Computing, June 1973, pp. 
[ 271 Meyer, J. F., and B. P. Zeigler, "On the Limits of Linearity." 
Theory of Machines and Computations (Edited by 2. Kohavi 
and A. Pazj, Academic Press, New York, 1971, pp. 229-241. 
[ 281 hIeyer, J. F. , "A General Model for the Study of Fault Toler- 
ance and Diagnosis, '' Proc. of the 6th Hawaii International 
Symposium on System Sciences, Jan. 1973, pp. 163-165. 
[ 291 Peterson, W. W. , "On Checking an Adder, '' IBN! Journal, 
Vol. 2, April 1958, pp. 166-168. 
[ 301 Peterson, W. W. and M. 0. Rabin, "On Codes for Checking 
Logical Operations, " IBM Journal, Vol. 3, April 1959, 
pp. 163-168. 
[31] Peterson, W. W. ,  Error-Correcting Codes, MIT Press, 
Cambridge, Mass. ,  1961. 
[32] Rao, T. R. N . ,  "Error-Checking Logic for Arithmetic-Type 
Operations of a Processar ,  " E E E  Trans. on Computers, 
Vol. C-17, Sept. 1968, pp. 845-845. 
[33] Sellers, F. F., M. Hsiao, and L. W. Bearnson, Error 
Detection Logic for Digital Computers, McGraw-Hill, 1968. 
[34] Short, R. A. , "The Attainment of Reliable Digital Systems 
through the Use of Redundancy--A Survey, '' Computer Group 
News, March 1968, pp. 2-17. -
148 
[ 351 Tryon, J. G. , "Quadded Logic, " Redundancy Techdques for 
(Edited by R. H. Wilcox and W. C. 
Washington D. C . ,  1962, pp. 205-228. 
[ 361 White, J. C. C. , "Programmed Concurrent Error-Detection 
in an Unchecked Computer? " Ph. D. dissertation, Electrical 
Engineering and Computer Science, University of Califwn;.d, 
Berkeley, 1973. 
[ 371 Wadia, A. B. , "Investigation into the Design of Dynamically 
Checked Arithmetic Units, " XBM Res. Report RC 2787, York- 
town Heights, New York, Feb. 1970. 
APPENDIX 
Resettable Machine Theory 
Our goal in the appendix is not to  study the theory of resettable 
machines per se but rather to cover that part of it which is used in 
this study of on-line diagnosis. The theory of resettable machines 
follows closely the theory of sequential machines. The main 
differences in the dc'initions stem from the presupposition that a 
resettable machine is reset before every use. One consequence of 
this is that the "unreachable" states of a resettable machine i i r e  
always ignored. 
We begin by repeating here the basic machine notions introduced 
in Chapter 11. 
Let M be a resettable machine. The reachable part of M, 
denoted by Y, is the set 
P = ( s@(r ) ,x ) j r  E R, x E I*) . 
M is reachable if P = Q. M is P-reachable if - 
P = {6@(r) ,x)Ir  E R, x E I* and 1x1 < - t }  . 
Let M, M' E m(I, Z , R ) .  M is equivalent to M' (written M f M') 
if p, = /3E for all r E R. Two states q E Q and q' e Q' are 
equivalent (q E 9') if /3 =ai , .  It is easily verified that these are 
9 
both equivalence relations, the first on m(I, 2, R) and the second on 
149 
150 
the states of machines in W(I,Z,R) .  M is reduced if for all 
q, q' E P, q 3 q' implies q = 9'. 
If M and M' are two resettable machines then M realizes M' if 
there is a triple of functions (u1,o2,o3) where ol: (1')' -> I is a 
semigroup homomorphism such that ~ ~ ( 1 ' )  C - I, a2: R' 4 R, 
a3: Z" -& 2' where Z" - C 2, such that for all r' E R' 
+ 
Bk, = 
The following result is analogous to the result due to  Leake [23] 
which was cited in Section 2.2. It supplies u s  with an alternative, 
and structurally oriented, definition of realization. 
Theorem A. 1: Let M and M' be two resettable machines with reach- 
able parts P and P'. M realizes hl' if and only if there exis ts  a 
4-tuple of functions (ql, q2, q3,  n4) where 
vl: It- \  I 
v 3 :  z 4 2' 
: P' +@(P) - v4 
q2:  R ' 4 R  
( P ( P )  = {xlx c - P)> 
such that 
i) b(q4(p!), ql(a)) c - q4(b'(P', a)) for  all p' E P' and a E I' 
ii) r13(X(p, al(a))) = X'(P',a) for all p' E P', a E 1', and p E q4(p') 
iii) p(n2(r')) E q4(pf(r')) for all r' E R'. 
15 1 
Proof: (Necessity) Assume that M realizes M'. Then there exists 
an appropriate triple of functions (cr1,o2,o3) such that pk,(x) = 
-
for each r' R', u E (I?)* and v E (1')'. Hence, 
Thus for  each p' E P' there is a p P such that 
Consider q4: P' 4- g ( P )  - @ defined by 
and consider cl: f 4 I defined by 
Claim: The 4-tuple (q17U2705, v4) where 05 is an arbi t rary extension 
of u3 to Z satisfies i), ii), and iii). 
i) Let p E v4(p'). We musi: show b(p, ql(a)) E v4(6'(pt, a)). 
152 
implies 
153 
(Sufficiency) Suppose there exists functions (ql, q2# v3 ,  q4) as in the 
statement of the theorem. Let ol: 6)' 4 I' be the natural exten- 
sion of vl to sequences. h a t  is, ul(al.. . an) = v 1 1  (a ). . . vl(a,). 
Claim: M realizes M' under (ul,q2, q3).  Consider 5 : P' 4 P 
where 
5 (p') = some p E v4(p') such that 
p(q2(r')) = t, (p'(r')) for all r' c: R'. 
Let x = ya where a F I. Then 
- P;yr,)ea) 
= B p d  
This completes the proof of Theorem A. 1. 
154 
Theorem A. 2: If M realizes M' and M' is reduced and reachable then 
- Proof: Assume that M realizes M' under (a1, a2,03) and that M' 
is reduced and reachable. Then pE = a3 0 p,Z(r) e '1 
Let q' Q'. Then there exists r 
q' = 6'(o'(r),x). Now 
for all r E R'. 
R and x c (I1)* such that 
Hence there exists a function f :  Q' 4 Q such that for each q' E Q', 
To prove that IQI - > IQ'I, it suffices to show that f is 1-1. Let 
E Q' and assume that f(ql) = f(q2). Then 8' = u3 0 Df(ql)ool = 
9 1  91'92 
= P' . Since M' is reduced and reachable this implies 
O a 1  92 
that q1 = q2. Hence f is 1-1. This establishes the result. 
Theorem A. 2 : The relation "realizes" is transitive. That is, M realizes M' 
and M' realizes M" implies M realizes M". 
155 
- Proof: (Sketch) Assume that M realizes M' under (01, 02, 0 3 )  and 
0 a i  for all r" E R". It follows 
0 a' That is, M real izes  1' 
If M and M' are resettable machines then M is isomorphic to  M' 
if there exist four 1-1 and onto functions 
ul: I \ I '  
02: R - \ R '  
03: z 4 2' 
0,: P 4 P' 
iii) w,(p(r)) = p'(w2(r)) . 
The 4-tuple (ul, 02, w 3 ,  w , ) is called an isomorphism of M onto M'. 
If M, M' E %(I, 2, R)  and (e, e, e, w4) is an isomorphism of M onto M', 
then M is strongly isomorphic to M'. A basic result of sequential 
machine theory s ta tes  that for every machine there is a n  equivalent 
reduced machine and that this machine is unique up to strong 
156 
isomorphism. The corresponding result for  resettable machines is 
given by Theorem A. 4 and Corollary A. 6.1. 
Theorem A. 4:  
and reaciiable machine MR equivalent to M. 
For  every resett  able machine M there is a reduced 
Proof: L e t M =  (I,Q,Z,6,X,R,p)andletMR=(I,QR,Z,dR,AR,R,pR) 
where 
QR = ([s]~s E PI ([ql = h'h '  E 9)) 
To prove this  result we must verify (1) that 6 and XR are well- R 
defined, (2) that MR is reduced and reachable, and (3) that M E MR. 
The details of this proof are very s imilar  to the details of the 
corresponding result in sequential machine theory. They may be 
found in m a n .  textbooks which cover this theory (e. g. , see Arbib 
r 2 3). 
M as defined above is called the reduction of M. M' is a R 
reduced form of M if M' is reduced and M M'. 
for all r E R 6 b ( r ) ,  x) = G'(pf(r), x) Lemma A.5: M 3 M' implies r-3 
a n d x  I*. 
157 
Proof: Let a e I, x, y e I* and r E R. Then -
Theorem A. 6: 
i s  strongly isomorphic to M'. 
If M and M' are both reduced and M 5 M' then M 
Proof: Assume that M and M' are reduce'd and that M M'. We 
know that each q E P is representable in the form Sb( r ) ,x ) .  
-
Define 
: P A P ' b y  w4 
Claim: M is strongly isomorphic to  M' under (e, e, e, w4). We must 
show that w4 is well-defined, 1-1 and onto and that for all r E R,  
a c I a n d q E P  
i) w4(6(q, a)) = 6'(w4(q), a) 
158 
In the following we denote w4 (q) by q'. 
Well-defined: 
p sq9. T h e n 0  
Let p = b@!r),x) and q = b ( p ( s ) , y ) ,  and suppose that 
and thus by Lemma A. 5 ,  0' W r ) ,  x) = p6@(s) ,  y) b'(p'(r), x) = 
. That is, 0' = bit.  Since M' i s  reduced and p', q' E P' it Pk'b'(S)? y )  P' 
follows that p' = q'. Hence o4 is well-defined. 
- 1-1: Again let p = 6(p(r),x) and q = Sb(s),y) but now suppose that 
p # q. Then by reapplying the above arguement p' # 9'. Hence, 
w4 is 1-1. 
Onto: - Since every q' E P' is representable in the form S'(p'(r),x) 
w4 is onto. 
That i), ii), and iii) are satisfied is straightforward to verify. 
Corollary A. 6.1: The reduced form of M is unique up to strong 
isomorphism. That is, if M' and M" a r e  reduced forms of M then 
M' is strongly isomorphic to M". . *  
Proof: If M' and M" are reduced forms of M then M E M' and 
M 1 M". Hence M' M". Since M' and M" are both reduced, by 
-
Theorem A. 6, M' is strongly isomorphic to M". 
Theorem A. 7:  If M E M' then M realizes M'. 
Proof: 
under (e, e,  e). 
M E M' implies pr = for all  r E R. Hence M realizes M' -
159 
A resettable machine hl is autonomous if 111 = 1. 
Given a resettable machine M, two input symbols a, b E I are 
equivalent (a E b) if A(q, a) = A(q, b) and 6(q, a) b(q, b) for  a l l  q E P. 
M is transition distinct i f  no two of :ts input symbols are equivalent. 
Any machine which has equivalent inputs is redundant in the sense 
that the inputs in an equivalence c lass  can bc represented by any one 
of its members without affecting the capabilities of the machine. The 
following resul ts  give an alternative characterization of equivalent 
inputs, 
Theorem A. 8: Let M be a resettable machine, and let a, b E I. Then 
a b if and only if for all x, y E I* and r E R,Pr(xay) = 0 (xby). r 
Proof: (Necessity) Suppose a f b and assume, to the contrhry, 
that r 
Now, pr(xay) f Br(xby) implies Bq(ay) # Pq(by). If y = A then 
A(q,a) # A(q,b). If y E I* then P 6(q, a)') ' '6(q, b) (y) and hence 
6(q, a) $ 6(q, b). Therefore a b. Contradiction. Hence a E b 
-- 
(xay) # P,(xby) for some r E R and x , y  E I*. Let q = b(p(r),x). 
implies P (xay) = P,(xby) for all x ,  y E I* and r E R. 
(Sufficiency) Assume that  a f b. Then for  some q E P, h(q, a) # 
r 
X'(q,b) or  6(q, a) 6(q,b). Let q = 6(p(r),x). Then X(6b(r) ,x) ,a)  # 
A(b(p(r), x), b) or 6(p(r),xa) i 6(p(r), xb). Hence P r h )  # b',(xb) or 
for some y E I+, Pr(xay) # P kby). Therefore if Brkay) = br(xby) 
for all r E R, and x, y 
r 
I* then a E b. 
