Intermittent/transient fault phenomena in digital systems by Masson, G. M.
General Disclaimer 
One or more of the Following Statements may affect this Document 
 
 This document has been reproduced from the best copy furnished by the 
organizational source. It is being released in the interest of making available as 
much information as possible. 
 
 This document may contain data, which exceeds the sheet parameters. It was 
furnished in this condition by the organizational source and is the best copy 
available. 
 
 This document may contain tone-on-tone or color graphs, charts and/or pictures, 
which have been reproduced in black and white. 
 
 This document is paginated as submitted by the original source. 
 
 Portions of this document are not fully legible due to the historical nature of some 
of the material. However, it is the best reproduction available from the original 
submission. 
 
 
 
 
 
 
 
Produced by the NASA Center for Aerospace Information (CASI) 
https://ntrs.nasa.gov/search.jsp?R=19770020363 2020-03-22T09:45:32+00:00Z
JUHNS HUPKIN5
UNIVERSITY
THE0
/66^Ltll.
h^
i
0
(NASA-CR- 153743) INTERMITTENT/TRANSIENT	 N77-27307
FAULT PHENOMENA IN DIGITAI, SYSTEMS Final
Report (Johns Hopkins Una y.) 112 p HC
A06/MF A01	 CSCL 09C	 Unclas
G3/33 36704
ELECTRICAL ENGINEERING DEPARTMENT
Final Report
Intermit-tent/Transient Fault Phenomena
in Digital Systems
NASA Research Grant NSG 1265
by
Gerald M. Masson
Electrical Engineering Department
The Johns Hopkins University
Baltimore, Marvland 21218
(301) 333-7013
Submitted to
National Aeronautics and Space Administration
Langley Research Center
Hampton, Virginia 23655
e	
^	
'^` =fit ^ 8 tiQ
8	 AM 
AL
Final Report
i
Intermittent/Transient Fault Phenomena
- in Digital Systems
NASA Research Grant N,SG 1265
4
by
Gerald M. Masson
Electrical Engineering Department
The Johns Hopkins University
Baltimore, Maryland 21218(301)	 338-7013 i
I
Submitted to
National Aeronautics.and Space Administration
Langley Research Center
Hampton, Virginia 23655 i
s
jj
:i
i
i
I
I
Julys 1977
1
/r r,	 Project Participants
1l^3
Name: Vinod K. Agarwal
Date & Place of Birth_ 4/30/52 	 Mathura, India
1:ducacion: BE (Honors) 1973 Birla Institute of Tech. & Sci.
Pilani, India
MS 1974 University of Pittsburgh, Pittsburgh,
Pennsylvania
Ph.D Candidate
Interest: Fault interrelationships; resolvability of multiple
fault situations; algorithms (network level) for
resolution; testing theory.
Name: Robert E. Glaser
Date & Place of.Birth: 3/12/54	 Baltimore, Maryland
Education:' BES (General and-Departmental Honors) 1915 The Johns
Hopkins university
Ph.D Candidate
Interest: Reliability of networks and systems; strategies
and algorithms for recovery, roll-back, survivability
and tolerance.
Name:	 Clifford L. Greenblatt
Date & Place of Birth: 	 6/16/52	 Baltimore, Maryland
Education:	 BS (Honors) 1974 Case western Reserve University
Ph.D Candidate j
Interest:	 Causes and effects of intermittent/transient faults
'T (component and network levels); models and parameters.
Name:	 S.ivanarayana Mallela
Date & Place of Birth: 	 8/4/52	 Madras, India
Education:	 B. Tech 1974 Indian Institute of Technology
Bombay, India
MS 1976 State University of New York at Buffalo
Ph.D Candidate
interest:	 Analysis (detection and location)'of transient/
Intermittent faults (network level); reliability of
testing procedures;-goodness factors and bounds.
Name:	 Shinji Nakamura
Date & Place of Birth: 	 2/27/44	 Tokyo, Japan
Education:	 BS 1966 Gakushuin University (Physics)
M5 1969 Gakushuin University (Physics)
Ph.D Candidate
Interest:	 Equivalence of fault situations (component and
network level); functional approaches to coverage;
transformations.
'i
F
.f
I
i;
1--1
1. Preliminaries
1.1 Introduction
A study of intermittent/transient faults (I/T faults)
in digital systems requires a confrontation with a.multitude
of issues. It soon becomes clear, moreover, that in the
period of one year, not all the issues can be considered.
We have, therefore, taken one approach - of many possible
approaches -- to this area, and have reached a point which we
feel indicates reasonable progress. In the Following sections
we will detail the state of our --avestigation, and we will try
to indicate the strong and weak points of our current position
with this problem area. This section will serve as an over-
view of these results.
The ultimate goal of this study is to perform survivability
evaluation of digital systems for I/T faults. The framework
within which this evaluation is to be performed is generically
described by the CARE II approach. However, survivability has
heretofore been addressed primary from the point of view of
long-term or uniform survivability. The explicit consideration
df'I/T faults requires instead the consideration of interval
survivability. Interval survivability is a measure of the
probability of the system surviving a-fixed time interval of
I/T fault activity in the sense that at the end of, this interval
the system can continue to operate (perhaps at the cost of some
recovery operations) in an acceptable (but perhaps degraded)
i 4	
anode.
3
1-2
'	 We have not, at this tame, implemented a means of evalu-
ating such a survivability number. It is clear, however, that
the task of doing so is at least as complicated as that of
writing the actual CARE II program. Moreover, to do so is
beyond the scope of the work reported here, and would, in fact,
overlap significantly with the development of CARE III. How-
ever, crucial to any such .evaluation is the capability to
detect and diagnose I/T faults. (This is sometimes referred
to as the "D" and "I" of-the DIR function). We have here,
then, the motivation for much of the work to be reported in
the following: the chief results to be reported will consist
of.the development of methodology for detecting and diagnos-
ing I/T faults in digital systems. We will show that there
are specific bounds and guidelines to this detection and diag-
nosis for I/T faults in general which must be taken into-
account in any interval survivability evaluation. These bounds
and guidelines are detailed such that they can be incorporated
into any testing and diagnosing methodology.
In addition we report on the status of an experimental
attempt to determine the effect of various physical I/T faults
(for example, those which might be determined to be important
from the 1977 Learjet Experiments).on a type of module which
could be a part of -a general computer system characterized by
the CARE 11 model. The goal here is to functionally determine
-	 the effect of various I/T faults so that the methodologies for
testing 'and >di.agnosing I/T faults can be' fully explotecl. by
J
1-3
r. r
taking into account the detailed test set requirements.
All of this, then leads to further refinements in
interval survivability evaluation for digital systems in
the presence of I/T faults.
i	 t
_S
11I'	 C'
j'
I
.
I.
I
I
j
1
i
f
-
. ...-._
	
W
.	 .._».-. nrrm.0
2-1
2. Interval Survivability
our objective is to enhance and evaluate the survivability
of a computer system by use of fault tolerant techniques.
Survivability is evaluated as the probability that the system
will survive until a given time, t
	 The concept of survives-
 bility is extended by the inclusion of degraded modes of opera--
tion. For example, a system may survive to time z in the
undegraded mode but survive from time T to time t in a
t
degraded mode.
Various methods of increasing the fault tolerance of a
r:
computer system..have been proposed in the literature. The
method considered here is the use of standby sparing. The
computer system is assumed to be partitioned into several.
stages. Within each stage are a number of identical units.
A certain number of the units in a given stage are in active
operation while several other units serve as standby units.
In the event that an active unit fails, a spare unit is
k	 tested and then switched in to replace the bad active unit.
The principles of detection and.location of faults and
i
of recovery of the program in progress are essential to a
?t
gracefully degrading,standby
system of both hardware and software detectors are provided
	 E
f!
to detect the presence of any faults and then to locate the
fault to within a certain category. A strategy for recovering
	 ^!
from faults in each category must be provided. When the pre-
sense of a fault is detected, further propagation of errors	 }^
­ ___7 _T
3
sparing computing system. A
777F777
2-2
is prevented by stopping the program in progress and holding
information needed for recovery. After the fault is.isolated
to within a certain category and a spare unit switched in to
replace the faulty unit (no spare unit is switched in for the
cases of transient faults) a recovery strategy is then
carried out to restore the program in progress. A failure to
detect or isolate a fault o- a failure to restore the program
in progress krill result in a system failure.
in order to design for fault tolerance,, a great deal
must be known about the nature of the faults that can occur.
A transient fault is generated internal to the affected u_n-its;
3 therefore one must be able to identify.transient faults so
that recovery may be done without replacing any units with a
spare unit.	 An intermittent fault is a failure that is gen-
erated within a unit.	 if a fault is identified as intermittent,
it is best to treat the fault as a permanent fault, i.e.,
replace the unit with a spare unit to avoid the consequences
`II of reoccurance of the intermittency. 	 In order to detect faults
i4
as quickly as possible, one must have a good modal of the proba-
i
bility of occurance of each type of fault that may occur. 	 Using
i an accurate fault model, one then designs hardware detectors
E
and software procedures to detect the various faults in an
efficient manner.	 For permanent faults, accurate failure rates
are given by the manufacturer. 	 However, additional study is
required to model intermittent faults, which are generally
2-3
caused by loose wire bonds in integrated circuits, and
transient faults, which are generally the result of electro-
magnetic disturbances generated in a thunderstorm by nearby 	 f
lightning strikes.
The CARE II program is designed to evaluate the sur-
vivability of the gracefully degrading and standby sparing
computing system described above. The program derives the
probability that the system can survive until time t
Information required for CARE II includes failure Fates of 	 Y
_	 a
the units and rate of occurance of nonrecoverable, trans-
ient faults. The CARE TI program includes a coverage model
which demands a detailed knowledge of what faults may occur,
the characteristics of the hardware and software detectors,
and probability of recovery given the class of fault and i
the time elapsed from the occurance of the fault to its-
"	 detection and to its isolation. Coverage is the part of the
I
fault . , tolerance design that includes detection, isolation,	 i
and recovery from a fault. Once . the fault is.detected by a 	 a 1
given detector, the isolation and recovery procedures follow
according to a deterministic path. However, several detectors
may be capable of detecting the same fault and which one 	 r
WE
actually succeeds will . then determine the isolation and
recovery procedures. Due ta i the nondeterministic nature of
fault detection, CARE 11 requires categorization of faults
-	 into fault classes. Fault classes are chosen such that the
S{
is
2_4
detection of a fault class by the set of detectors is statis•}
is
tically independent with respect to the detectors.
	 To aerive-
coverage coefficients, i.e., the probability of recovering
from a fault given that a certain number of spare units must.
be tested before a good one is found, one must also know the
9
J
probability of occurance of the various faults.
	 Finally,
f"	 one must also know the effectiveness of the recovery strategy,,,,
for a given fault and given time delays included in detection
and isolation.	 Specifically one must provide
	 r(t,-e')	 where
r ('r' , T')	 is the probability for a given type of fault.that the
program will recover after a delay of	 T	 seconds from fault
occurance. to fault detection and a delay of	 T'	 seconds from
fault detection to fault isolation. 	 CARE II assumes that
r (T, C') = r' (T) +r'' (T+T')	 where	 r'' (T)
	
accounts for propa-
gation of errors before the fault is detected and 	 r''(T+T')
accounts simply for total time lost in the.running of the
program.
The present literature on gracefully degrading and
standby sparing computing systems assume a uniform failure
rate-. The problem of lightning induced fualts brings up
the question of interval survivability, that is design and
evaluation of the computing system with respect to tolera-
tion of a severe transient fault inducing environment over a
limited period of time. It will be necessary to modify the
_` CARE II model to account for a change in detection and recovery}
^,j
[77N^,, {	 i
2­5
procedures for the severe transient interval. The emphasis
of the detection procedure must be shifted to detection of
the transient faults likely to occur in a thunderstorm.
Such a shift may easily be initiated by the detection of
electromagnetic disturbances by an external detector provided
for such a purpose. The external detector may provide data
on the nature of the transient environment so that the opti-
a
i
mum fault tolerant strategy may be employed. For example, the
fault tolerant strategy should be a function of the severity
of each electromagnetic disturbance as it occurs and also the
expected length of the severe transient fault inducing inter--
val. if several operations are being carried out at once,
those operations that are most affected by the transient en-
vironment should be postponed, if possible, or done in a way
that is less sensitive to transient faults. To study the modi-
fication of gracefully degrading and standby sparing computer
systems for interval survivability, one requires. an  extensive
knowledge of the effects of a thunderstorm on the operation of
the computing system and a knowledge of the effects of.lightning
induced transient faults'on the various functions of the com-
puting system.
3--1
3. Current Status of Module Self--Diagnosis Theory
U
	 3.1 Introduction
In this section we will give an overview of the theory of
.z
	
	 self-diagnosis of digital systems. It should be kept in mind that
while our concern is with I/T faults, the majority of this reviewed
work deals with permanent faults. However, this work must nevertheless
be appreciated as it represents an initial Condition on the results
of Section 4. Moreover, from Section 3 we have seen that for
survivability evaluation, self-diagnosis is a crucial factor.
in particular, we will consider the design of such systemsj
which will operate in a distributed processing/decentralized control
mode. in order for such systems to operate in a fault tolerant
enrivonment, it is imperative to incorporate into the.design some
degree of self-diagnosabili.ty.
in this section, a system is considered which can be partitioned
into.n functional units, or modules, each possessing some degree of
intelligence. A major assumption.to
 be made is that each module be
completely capable of testing the correctness of other specified
modules in; the system. Such tests cannot be well defined in the
general sense. (indeed, their generation is the responsibility of
the module.desi.gner). They may be considered.to
 be strictly hardware,
'as in the case of dedicated hardware monitor. Similarly,, a module
may utilize various diagnostic software routines to generate and
compare test patterns that will be applied to the module being tested.
In the most general sense however, a test may be thought of as any
combination of hardware and software which enables a unit to success-
fully base a-conclusion as to the operational state-of another module.
-	 --
yk1 	 3-2
A unique test will probably need to be designed for each module
based on the fact that a single universal test would not provide
adequate fault coverage for each module, due to their functional
3-3.
It should be noted that the implementation of such tests is not a
trivial matter and poses a major.obstacle to the design of totally
self-diagnosable systems..
The actual diagnosis problem is one of detection and location'.
of all faulty modules that may be present in the system. Once all.
of the tests in the testing interconnection design have been comR
pleted, it is the function ofthe entire system to detect the
presence of any single or multiple faults. A system diagnosis
algorithm must be implemented.to examine the set of test outcomes
and determine if any faults have occurred. A factor which enhances
the complexity of the problem is that of the possibility of incorrect
test outcomes produced by modules.which are themselves faulty.
A basic assumption affecting the testing interconnection of
devices is one of generating an upper bound on the allowable number
of modules which may become defective at any one time. Depending _
upon this _figure, the testing scheme may be very simple or quite
complex.. it will be shown that the number of modules in the system
and their testing interconnections have a direct dependence upon
this upper bound.
The goals of attaining a totally self-dignosable system are
two fold. The first is in enabling the system to operate in a
fault tolerant environment. Upon the detection of any module
failures, the.system could conceivably isolate those!,devices,
reconfigure and then recover to resume its processes, all without
y any external. communications. Such performance would be essential.
i
---

3
3-5-
3: 2 	 SYSTEM MODELS AND THEIR ANALYSIS
r	 {
Various system models have been proposed to aid in the analysis
of diagnosable systems.
	 The basic goal of each of the models is to
demonstrate the testing interconnections employed and to deduce the W=
performance characteristics of such schemes. 	 These characteristics
€
may include an upper bound on the number-of faults that may occur
in the system, while at the same time possess the ability to locate
just a -sing^e fault or perhaps diagnosis the entire fault situation.
Probably the most well known had been proposed by Preparata [3.71
a decade ago.	 in this model the system is decomposed into n different
subsystems, or modules, each with the capacity to test the correct-
t
ness of the others.	 The model itself is a graph-theoretic one in
which each of the n modes represents the n modules and a directed
:. edge is included to denote a testing interconnection between two
modules.	 Each of the testing links is represented by bi^
.
in which
each unit Ui evaluates unit U7 .	 The weight associated with each
bij is aij = 0,#
	
The test outcomes a id is a 0 if module Ui is n
fault free and tests TTY to be also fault free.	 ;However, ? f Ui is
s
fault free and tests U . to be faulty, then aij = 1.	 In the situa--.
tion where the testing module U i is . faulty, its test output could
possibly be a 0 or 1, regardless of the actual condition of the
module U'. being tested. The testing connection of system can be
	 E
represented by a connection matrix C WJJcijJJ where:
J if: biz exists.	 {^
cij
	 Q -if bi j does not exist..
Once all of the tests in the model have been completed, each aid
i	 E
^j
__
-^ ----f
r
3-6
h
has been assigned a corresponding .binary weight. it is from this
set of test outcomes, i.e. the system syndrome, that the system
diagnostics will be performed.
The system diagnostics may be oriented towards one of two
possible approaches. The first is often referred to as one-step
t-fault diagnosi_bili.ty. is the goal of this method to i.dent _` fy
locate) all of the faulty units in the system, given its syndrome.
A necessary constraint is that the number of faulty units does not
exceed the upper.bound t. Another approach is in viewing the system
as being sequentially t-fault diagnosable. Mere, it is guaranteed
that at least one faulty unit can be detected directly from analysis
of the system syndrome. Again, it is assumed that the maximum
number of faulty modules does not exceed t. it is obvious that a
one-step t--fault diagnosable system is also sequentially t--fault
diagnosable.. The motivation behind each of these situations should
also be clear. In the o%,e-step case, the syndrome is examined to
identify each of the faulty units. These units may all be replaced
at once, thus enabling the system to once again become fully opera -
tional. On the other hand, in the sequentially t--fault diagnosable
situation, -the syndrome is examined Bach that only one faulty unit
is detected. This faulty unit is then replaced and a new syndrome-
is generated. This procedure would be reinterated up to t times,
until all of the faulty units had been replaced. 'While one-step
t-fault diagnosabi_l.i.ty is more efficient than sequential t-fault
diagnosability, the complexity involved may not warrant the increased
performance.
^1
- -- .
3-7
in his paper, Preparata made a few basic, but very important
observations in relation to diagnosable systems. The first was_in
generating a lower bound on the number of units to ensure the
system to be diagnosable. That is, given that a system is one-step'
t-fault diagnosable, then nl2t+y Conversely, if a system of n modules
is said to be one-step t--faint diagnosable, then an upper bound on
n-1
its degree of diagnosability is t ^ 2	 it should be emphasized
that these bounds may not be reached if an inefficient connecting
scheme is employed. Another important observation made was in
bounding the smallest number of units needed to test another to
ensure the system to be one-step t--fault diagnosable.. It was found
that it was necessary that each unit be tested by at least t other
units. Thus, we see that for a system of n units, a minimum number
of connections-that enable one-step t-fault diagnosability is N nt
links.
An interconnection design in which n = 2t + 1 and each unit is
tested by exactly t other units in such a way as to be one-step
j
t-fault diagnosable is said to be optimal. A well known class of
i
optimal designs is the so-called Ds t design. in this design, a
testing link from Ui to U3 exists if and only if j -- i. =sm. (modulo n)
and m assumes.the values'i,2,...,t. Examples of D Z2 and D22 designs,
with n = 5, are shown in Figure 3.1.1t was shown that the D. t design
is an optimal one whenever Sand t are relatively prime, and as such,
would allow the system designer to employ a most efficient testing
interconnection scheme into the overall design. Also, this ty pe of
testing interconnection between devices provides the ability to
synthesize an efficient diagnostic algorithm, such as the one
proposed by Meyer and Masson [3.61.. (This algorithm is presented'
in detail in the Appendix.)
As had been already pointed out, the complexity of a one--step
t--fault diagnosable interconnection scheme leads to'a. rather large
number of testing links. It is for this reason that sequential
t--fault diagnosable systems have been studied. Since we are
utilizing the same model as before, the lower bound on the number
of units; nZ2t+l, is still valid. However, Freparata showed that
there exists a class-of designs with the number of testing links,
N = n + 2t - 2, such that the resulting s-stem is sequentially
t-fault diagnosable. Essentially, this design was that of a simple
loop along with a subset of 2t - 2 units all testing a common unit.
Such a testing interconnection scheme is shown in Figure 3.2 with
n=14 and t=6.
The simplest sequential t-fault analysis is through the use,
of a single loop system. With this interconnection, a lower bound
on the number of units to guarantee sequential t-fault diagnosability
is given by the following:
n^V = J+ (m+1)2 +1(m +I)
. with t.=  2m +X,  m integral and = 0, [
A table comparing one-step t-fault diagnosis-to sequential t-fault
diagnosis wi. th respect to their lower bounds on the number of units
and testing links is given in Figure 3.3. Upon examination of the
f table, it is obvious that as the allowable number of faults increases,.
sequential t-fault diagnosis becomes much more cost effective in
AL-
E
-	
_
i
I
3--9
it.
y	 relation-to the hardware needed to realize the-design.
As one of the first to address the problem, Preparata has
been shown to make a few important initial contributions to the
study of diagnosable systems. Among them have been the concepts
of one-step and sequential t-fault diagnosis, along with each of
their respective lower bounds or the number of units and testing
interconnections. Also, a class of optimal designs were proposed.
What was noticeably missing however, was the means to determine
the diagnosibility number, i.e. the maximum number of allowable
faulty units, of a general interconnection scheme.
To this end, it was necessary to be able to fully characterize
the connection assignment of diagnosable systems_ Hakimi and Arvin
^T^
	 [3.4] essentially picked up the study where Preparata left off, in
that they claimed to have shown both the necessary and sufficient
conditions for a system to be t- diagnosable. in their termino-
logy, a system that is t-diagnosable is directly analagous to
Preparata l s one-step t--fault diagnosis: The model used is exactly
the same as the one that had previously been considered. That is,
the system in question can be viewed as a directed graph explicitly
i
showing the testing connections between modules.
,i
	
	 One of the main results of Hakimi and Amin was the determination
of necessary and.suffieient ' condi.tions for a system to be t-diagnos-
able when the system has the property that no two units test each
other. Very siTRply, they found that the . system is t--diagnosable only
if each unit is tested by at least t other units. As an example,
S
r	
__
y	 .
3-1a
it is seen that the D t design, which is known to be t-diagnosable,
satisfies the above conditions.
What was obviously needed, however was a general characteriza-
tion of any system, regardless of the interconnection scheme. The-
,F
-.1
following approach was proposed. Consider.a. dipath.from Ui to U;k
in which a sequence of vertices and edges in G (the graph),
Uil,(Uil^ Ui2 } , Ui2
. - - - • (Uik-3., Uik } , Uik exists, where (L7 `il Ui .. )
r	 2
denotes a directed edge from Uil, to Ui2 . When there is a dipath
from . Ui to U^ in G, then. Uj is said to be reachable from Ui . G is
defined as being strongly connected if any pair of vertices are
?:utally reachable. Hakimi'and Amin claimed the following.- if
n^:2t+l and K(G) %t, then the system is t-diagnosable, where the con-
nectivity K(G) of a digraph G is the minimum number of vertices
whose removal from G yields a graph that is not strongly connected. i
It is claimed that whenever these conditions are met by any system,
	
y
then the system diagnosibility number can be verified.
An example.which questions these results is shown in Figure 3.4..
By inspection, it is obvious that the graph as shown is not strongly
connected. By considering U4 and U3 it is seen that a dipath from
h	
i
U4 to U3 does not exist. Therefore, the connectivity of this system
	 ;j
is K(G) = Q, which implies that the system-is o-diagnosable. A
descrepancy evolves here in that it is felt that the system is
instead 1-diagnosable, or equivalently, sequentially 1-fault diagnos-
able. it will be shown in the next discussion of Russell and Kime`s
model,, that the above system is indeed sequentially 1-fault diagnosable_
Thus we see that given a.genera.1 interconnection scheme,;.necessary
f
l^
J-
3-11
1. iiii
. E
and sufficient conditions have been proposed by Hakim: a.nd Amin to s
determine the degree of the systems diagnosibility, although at the
present they-do seem to be questionable. r
Another diagnostic model has been proposed by Russell and Rime
which tends to be somewhat more general that the ones previously }
studied.	 in this model a system can be represented by either the
G'array approach S= r,T,F,G; or by the diagnostic graph approach
S= (s, T, F, G,) . 	 Common to both is the seta which is the set of a 4
' possible faults that may occur in the system.	 Thus, *;=
	
fl , E .
	 ._,fn
Now, if we consider all of the possible n
 possible subsets of
nF = Fl , F 2 , • - -, FZ	 where Fk (k=1,2,...,2	 is One Of the allowable
fault patterns that may occur in the system. 	 The entire class of
fault patterns can be represented by a 2r` x n.F array as follows:
L.: 1	 fl , f2 .	 fnF
F2s
F= F lK = 1
	
f i r F
i
^2n
' The set T= (tj, t2 , ... , tp} represents thy: p pass-fail tests- that
can be applied toS . A test tj is a complete test for - a fault f 	 if
and only if a) tj always fails for f. alone and b) t j always passes
for no faults. 	 Thus, we see that the set of tests which are complete
fora fault pattern Fk is given as t(0) = t (fi ).T.T_ t (t )_U` . _U t-,(€k) where
fi , fj ... fg 6 Fk ,	 An invalid test set T (Fk) is defined as the set of
tests.tlia.t may not correctly specify the nature of th'e system in the
presence of a fault pattern Fk .	 That is', the test outcomes may be-
code unreliable in the case. where a) t j might pass if t j .rc t (Fk): and
s ,	 ;
l	 y
_
3
3-12i r
fff
^ b) t] might pass if tj ^ ME 	 .	 This leads us to the notion of a
h
valid test set. in the presence of a fault pattern FK, a valid
test is one in which a) tj always fails if t. 	 t(FK ) and b) tj
always passes if t4 f (F) .	 It can be shown that the valid test
set for a fault pattern can be derived as follows: Valid tests a
i t (FK) - T (FK)
Up to this point we have.completely described those parts of
the model which are comuon to both the diagnostic graph and G array
approaches.. In the G array, or generalized fault table, the set of
test outcomes, or syndromes, is represented by a 2n
 x p matrix with
the following structure:
Fl	 tl t2	 t 
G- F2 G jK =	 0, ti t (FK) and tj 4 T (FK) (i.e. always passes)1, t  & t (FY-) and t j + T (FK) (i.e.  always fails)
F2n	 X, tj g T (FK ) (i.e. don't ],,-now)
To aid in visualizing the system, a diagnostic graph-dan be
drawn. This graph is essentially a digraph in which each allowable
.single fault in the system is represented by a vertex of the graph.
	
A directed edge from fi to fj represents the situation in which f i	 f
being faulty invalidates test that is complete for f j . Thus, we see
that according to this model, the set of.complete tests for a fault
is just the set of all incoming edges to it. in the presence of a
fault pattern FK, the set of invalid tests is the.set of all out-
going tests of fi G FK. An example of a diagnostic graph of a given
system is shown in Figure 3.5,. Notice that this model. differs signi-
ficantly from the one previously discussed in that more than one
unit may be . needed , to perform a testa Therefore, it is evident that
a
IC
;.	 this type of model allows a finer partitioning of the system. For
f.l
example, both Data Channel (Cl) and Memory (Ml) act in conjunction
with each other in testing the ROM Control (RI).
This model as proposed is much more efficient than the one
introduced by Preparata due to the fact that it considers much
more general systems. in a normal design, a fault may invalidate...
a test that would be performed. In such cases, the system could be
viewed to be morphic, that is, T (F1
 U F 3 )	 T (F^) U T (P i ) .	 The
present model, however, also enables special systems to be_repre--
sented.	 As an example, suppose there exists a $esign in which two
or more simultaneous faults are necessary in order to invalidate
a test.	 Such would be the situation in a triple modular redundant
design.
	 These type of systems would be semi.--morphic systems in
which T (F' O F 7 ) 2 T (F') V T (F i ) .	 Semi-morphic systems could be
modelled as above by making a morphic approximation.
	 Thus, we see
that the diagnostic model co rers a rather large class of system
designs.
in their first paper. Russell and Kime[3.8]define diagnosa--
E>
bilty with repair to be exactly the same as sequential t-fault
f
i
diagnosis., 	 in order to derive the conditions for diagnosability,
} the concept of a closed fault pattern was studied.
	 A closed fault
pattern Fk is one in which every test for each fault in Fk is
invalidated in the presence of Fk .	 Note that this is analagous to
Ij
the classical masking ideas of combinational logic circuits.
	 The
. system closure index C(S) is the cardinality of the smallest closed
fault pattern in the system_	 It is with this index that a necessary
1F
3-14	 .
condition is given for-the system to be t-fault diagnosable with	 c
repair_ it is stated as follows: 	 r
C (s) -1, [ —	 + 12
It was also shown that for t = 1,2,3 the above is both necessary
i	 and sufficient. Now, recall that in Figure 3.4 the system was
deemed to 0 - fault diagnosable according to Hakimi and Amin.
	 i
Since 1 - fault diagnosis is equivalent to a system which is 1 -
fault diagnosable with repair, we- can also use Russel]. and Kime' s
results to determine the diagnosability of the hypothetical. system.
Vpcn examination of Figure 4, it is seen the smallest closed fault
pattern [U1, U2, 
U:31 
and therefore C(S) = 3. since c (s)
3 ] 2t +1, this implies that the system diagnosibility number t = 1.
Therefore, it appears we have a contradiction between the two
approaches, with Hakimi and Amin . 's results seeming to be in question.
in their second study, Russell and Kime[3.9]define diagnosability
without repair to be equivalent to one-step t-fault diagnosis, or
simply t-diagnosability. Fault masking was found to be essential in
determining the conditions for t-diagnosability. A fault pattern
F3 is said to-be masked by FK if every test for each fault in F a is
invalidated in the presence of FK . This concept is similar to that
7 others in logic networks. The masking
the smallest fault pattern FK that masks
Thus, we see that Fj ,tmasked by FK if
Also, if a fault in F 3 is not masked by
Ait is said to be exposed. The exposure index of a fault pattern
of faults completely masking
index is the cardinality of
F3 and is denoted by M jF I ) .
and only if t (F3) G T (FK ) .
the exposure indices of the fault pattern containing k faults is
termed the system exposure index of order k, ek (S). From these
definitions, we can•find the number of faulty elements in F K to be
M (FY-) + le (Fr-)
Necessary and sufficient condition for a system to be t-fault
diagnosable without repair are given as the following:
a) M (S)	 t
b) C(S) ^2t + 1
C) ek (S )	 2t+ 1 --^ k for k = t + ^., ; . ; . min (2t - 1, n)
Thus, with these results one can determine to what degree a system
is capable of being diagnosable without repair. A major obstacle
in working with these ideas points directly to the amount of book
keeping involved in determining the system masking, closure and
exposure indices.
A final model to consider has been published recently-by
Barsi (3:11 The diagnostic modal-proposed is a slight modification
of the one introduced by Preparata, with the notion of producing
a more realistic representation of the system. The basic assumptions
are the following:
1) each test is performed by a single unit;
2) each unit must be capable of testing any other unit;
3) no unit tests itself; and
4) for any pair (Ui , Uj),•unit U performs at most one
test of unit Uj.
The actual difference between this model and that of Preparata's is
with regard to the set of possible tests outcomes. Assuming that a
testing link exists from Ui to Uj , we have the following set of
allowable tdst outcomes:
f {
3-16
/t
i g, if Ui is fault-free and U 3
 is fault-free
1, if Ui is fault-free and Uj is faulty
aiJ	 0 or 1, if Ui is faulty and Uj is fault-free
1 1 if both Ui and Uj are faulty.
Notice that when both Ui and Uj are faulty, the test outcome is
always a " 1." The reasoning behind this is that some type of
self-checking design be incorporated into the critical parts of
the testing devices. Thus, according to this approach, a " 1 "
test outcome encountered specifies that the tested unit is guaranteed
to be faulty.
Due to this slight difference in the model just discussed, the
upper bound on the diagnosibility number t, is seen to increase.
in fact, with a system of n units, the one-step diagnosibil.ity of
the system is tin-2. observe that this largely exceeds Preparata's
bound of t	 n-'_ Necessary and sufficient conditions were also
derived for the system, to be one-step t-diagnosable. These are:
a) JB (x)J	 t 4x6 N and b) for each pair (x ,y) with
x E N, y e N, t B (x) J= J B (y) != t and y6 B (x) fl D (x) theme exists at
least one node u such that either u E B (x) - B (y) n B (x) and
B (u) + B (y), 
 or u e B (y) - B (x) (1 B (y) and B (u) ^- B (x) , where B (x) is
the predecessor set of x and D(x) is the sucessor set of x. Examina-
tion of the Dlt design reveals that only (a) of the above conditions
reveals an optimal interconnection design. in fact, equality in (a)
specifies that a design was indeed optimal. A technique for the
synthesis of an optimal design was proposed, although it is essen-
tially equivalent to the Dlt
 design previously discussed.
.a- a
rg
p`s
:G
f
3--17
Barsi also addressed the problem of diagnosability without
1___1	 I
from the basic concepts to the more advanced analysis which repre-
sents the current states of the art with respect to diagnosable
systems.
With this in mind, we can now proceed to our extensions of
this theory to I/T faults.
rrf1
	
•
repair. In doing so, the index f  (x) is defined to be the raini.mum
1	 CW
number of faulty units that can give rise to a syndrome of all 1's
^'-	 in the system, with the constraint that unit x is faulty. Similarly,
:A
	the index fo (x) is defined to be the minimum number of faulty units
that can give rise to a system syndrome of all 1's provided that
unit x is fault-free. Using these, it is claimed that a system
having a strongly connected graph is t-diagnosable with repair if
and only if t = Pax Lf 0 (Ui ) , fl (UA - 1. Thus, we see thati=1,... pn
there exists but one more way of determining the diagnosabi.lity of
a given system.
in conclusion, various models used in studying diagnosable
systems have been presented. It is felt that they adequately
..-	 represent the research that has been directed in the area, ranging
1-
.P,
i.
k--
}
i-
i
1
E
I
1
1

^^^	 - ^.;
}'^
^ ^^
^^ ^	 y
j;,^^r #^
's ^'d ^^1;. _	 ^
--^-^v
T - :^^ =1
I ^;^^^`r4
!^
f}S
} '-
S!
N
!:
F
`,
l:
^^
^'
9
i
t.
^:
L
f
Number of Allowable Faults	 One-step Diagnosis Sequential Diagnosis
r
S
t Units	 Links Units Link:
na	 nt	 n?l+fm+l)2+A(m+l)2t + 1	 N?: N> nI€
!' 1 3	 3 3 3
2 5	 10 5 5
3 7	 21 7 7
9	 36 10 10
.} 5 11	 55' 13 13
6 13	 78 17 17 '
7 i5	 105 21 21
ii
Figure 3.A comparison between one-step and sequential diagnosis systems, 'd
s •
..Y
,. U1
U4
f
t
l
..
U
U	 u5
€
It
I y
'.I
l
1
j ; Figure, 3.4. An i.nteraannectioi . scheme where K (G) = Q.
{
r '.
REFERENCES
3.1 F. Barsi, F. Grandoni and P. Moestrini, "A theory of diagnos--
ability of digital systems," 	 IEEE Transactions on Computers,;,s
Vol. C-25, NO. 6, June 1967, pp 585-593.
3.2 D.B. Burchby and L.w_ Kern, "Specification of the fault-tolerant
spaceborne computer (FTSC) , " 	 Proceedings of the Fault Tolerant
Symposium, IEEE Cat. Na.76 CH 1094-2C, June 1967, pp 129-133. a
3.3 A.M. Corluhan and S.L. Hakimi, "On an algorithm for identifying
faults in a T-diagnosable system,"	 Proceedings of the 1976
Conference on information Sciences andSystems, The Johns
Hopkins University, 1976,'pp 370 =375. i
3.4 S.L. Hakimi and A.T. Amin, "Characterization of the connection
assignment of diagnosable systems," IEEE Transactions on Com-
puters, January 1974, pp 86-88.
3.5 T. Kameda, S. Toida and F.J. Allan, "A diagnosing algorithm #
for networks", Information and Control, Vol. 29, 1975, pp 141-148.
3.5 G.G.L. Meyer and G.M. Masson, "An efficient fault diagnosis
algorithm for multiple processor architectures, to appearr IEEE
_Transactions on Computers.
3 . 7 F.P. Preparata, G. Metze and R.T. Chien, "On the connection
assignment problem of diagnosable systems," 	 IEEE Transactions
on Computers, Vol. EC-16, No. 6, December 1967, pp 848-854..
3 . 8 J.D. Russell and C.R. Kime, "System fault diagnosis:	 Closure
and diagnosabi.li.ty with repair," 	 IEEE Transactions on Computers,
Vol. C--24, No. 11, November 1975, pp 1078-1088.
3.9 J.D. Russell and C.R. Kime, "System fault diagnosis: Masking;
exposure and diagnosability without repair," IEEE Transactions
on Computers, Vol C-24, No. 12, December 1975, pp 1155-1161.
3.10 J.J. Stiffler, "Archtectural design for near -- 100/ fault
coverage," Proceediggs of the Fault Tolerant Symposium; IEEE
Cat. No. 76 CH 1494-2C, June 1976, pp 134-137.
3.11 S. Toida, "System diagnosis and redundant tests," IEEE Trans-
actions on Computers, November 1976, pp 13-67-1170.
I
CE7
f..
4--1
	
f
I/T Testing and Diagnosis
4.0 Introduction
Previous work on intermittent fault detection has been
done by Breuer [ 3-1] and Kamal and Page [3-2]. Breuer
assumed that the statistics of the intermittent fault can be
modeled by a two state first-order Markov process State FP
corresponds to the fault being present at time tg and state
FN corresponds to the fault not being present at tg . The
transition probabilities for going from one state at t g to
either state FP or FN at tg+1 are assumed known.. From
this, the steady. state probabilities associated with states
FP and )`N at any time tX can be determined as a function
of these probabilities at some initial time t0 	 Let T be
a collection , of tests Kl,X2,.-..XK where each Xk,k = 1,2r...K ,
is a single test pattern for an . intermittent fault in a combi-
national circuit...-..,:T, will detect the presence of the inter-
mittemt fault under test if the fault is present when at least
one of the Xk is applied. The probability that T will
detect the presence of the intermittent fault ,is a function of
Dk , the time between the application of tests and 'K the
number of tests applied.
The model used by Kamal and Page is a special case of the
above one. In this case, the transition probabil,^ties.between'
the two states are assumed to be equal. Thus, the first--order
Markov process reduces • to a zero-order process. It is assumed
that p (wi) , the prior probability that the circuit is in w . }
i

€n
.;t
.1
=n	 4.1 combinational circuits:-
!
	
	 A transient fault is intermittent- if it occurs ' repeatedly.
Tf a trransient fault is not intermittent itld bwou	 every
difficult to test for it in a combinational network. This is
because the combinational network would behave as if it is
fault free after the transient has disappeared. If the trans-
ient does not occur repeatedly we might never catch it at all.
Therefore, intermittent/transient faults will be considered
here. It is necessary to characterize intermittent faults.
4.1.1 Model of intermittent Faults:
Arrival:- We will assume that intermittent faults arrive in
a random manner. The interarrival times of faults will be
assumed to be independent and random with a known probability
density function.
Duration:- After a fault arrives, it persists for certain
time which is its duration. We will assume that the inter-
mittent fault has a duration which is random with a known
probability density function. We will also assume that the
duration is independent of the arrival.
Depending on the nature of the fault, different density
functions might be used to model the random nature of the
interarrival times and the duration.. i
If an assumption that short interarrival times (or durations)
are more likely than longer ones is used, we arrive at an
exponential or...hyperexponential density as an approximation.
lzt^
}
(	 4-a
s:;+
	
	 (ii) If an assumption that there is a definite mean inter
arrival time (or duration) with an associated spread is
made, we would get gamma, normal, Rayleigh, Erlang or Weibull
#	 approximation. Naturally, the chosen density function should
^E	 B
have a value of zero . for negative values of the argument
(time, in this case).
i	 4.1.2 Fault Detection:
I'	 f	
Si
i	 .
-
tf: time fault arrives
tsi :: time test (set) is applied
!	 d.	 duration of the test (set)
t:
It will be assumed that a fault arrival can be detected
by a test only if the effect of the fault arrival is present
f	
for th- complete duration, of the test, since otherwise, the
output will change when observing for the presence of the
e„ fault,.leading to uncertainty. When a test set is applied,
a fault arrival can be detected only if it persists for the.
1
entire duration of the particular tests)- which tests for it.
Since tests in a test set can be applied in, any order, it is
conservative to assume that a fault arrival is not. detected
by a test set if it does not persist for the entire duration
of the test set. Therefore, a fault arriving at t  can be
4^I
I<
Is .
f:
4--5
detected by a test set applied at ts• only if the duration
of the fault td , is such that
r td > tsi+d-tf
Due to the non-permanent nature of the faults, it is
necessary to apply the test set repeatedly... If the network
! gives an incorrect output under any test set application ! the
-	 t
{ testing can be stopped because the presence of a fault is
indicated.	 But it is necessary to have a decision rule which
will permit us to stop further applications of the test set at
some stage when the network responds correctly to all the past
applications of the test set.	 The decision rule which will be
used here . is:	 The conditional probability of not detecting a
fault given that the networY has an intermittent fault is close
to	 0	 That is,
I,
P(fault not detected/network is faulty) < lei
'. Depending on the level of confidence required about the fault
I: free.c.ondition of the.network (if it responds correctly to all
the test set applications), ,s{	 can be chosen to be as close
to	 0	 as needed.
-4.1. 3.: 	 Sequential Analysis
Let	 Tl,...iTk	 be	 applications of the test set,
*) applied at tames	 t51 < ts2... < t	 respectively..	 All the
4-6
probabilities mentioned below are conditional probabilities
given that the network has an intermittent fault:
P(Ti} = Probability that Ti detects the presence
of a fault
= Probability that there is a fault arrival.at
tf before tsi such that its duration is
ts i +d-tf
P(Ti) = Probability that T i does not detect the
presence of a fault
1-P(T,}
P(Ti n Till*n T1 ) = Probability that none of the i
applications of the test set detect a fault.
All the fault arrivals which can be detected IV T2
can be classified into two groups.
i) Arrivals which can be detected by both T 2 and T1 . -
ii.) Arrivals which can be detected by T 2
 and not T1 .
Given that T1 has not detected the presence of any
fault, the probability that T2 detects a fault is just the
probability that there is an arrival belonging to group ii).
Hence, P (T2 
1T
1) = P (T2 ) - P (T 2 *Tl) where T2*TI is the
event that there is a fault.arrival which can be detected
by both T2 and T1
 .
P (T2 1 1) = 1 - P (T2 T1)
= 1 - P (T2 ) + P (T2*T1)
0
We need to.compute P(T 3 ^ T^nTI )	 Given that neither Tl
i
nor T2 has detected the presence of a fault, the probability
that T3 detects a fault is given by
P(T3
 T2 nT1 ) = P(T3 ) - P((T3*T2 U (T3 *T1))
a
S
But the event T3 *T1 is a subset of the event T3 *T2
	
Hence,
P ( T3 T2 nTl) - P (T3 ) - P (T3*T2)
G	 P(T3lT2nT1) = 1 - P(T3)+P(T3*T2)	
1
Similarly, for any Ti i>l
P (Ti Ti-ln ... nTl) = 1 - P ( Ti ) + P (Ti*Ti-1)
The decision rule employed for the termination of the test set
applications requires the computation'of the probability
P (Tkn ... Tl ) ;which can be done as follows,
P (T1 )	 1 _ P (Tl)
P (TanT3-1 ... nTl) = (1-P (T )+P (Ti *T -1}) P (T,-1 a...Tl)
As can be seen, this requires the computation of P(T ; } for
1<^ <k. and P ( Ta *T^^,1} for 1<j <k .
r4--8
^i
4.1.4	 -Queueing Theory Applications:-
The intermittent fault model assumed can be noted for its 
i; similarity to the models used in Queueing theory. 	 Hence, the
j probabilities which need to be computed can be done so using
results from Queueing theory.
The fault system, where	 Fi	is the intermittent fault
th(say on the	 i	 line i,n . a network), has fault arrivals and
i'
duration.	 Naturally, when a fault arrives, till its duration
ends, there cannot be another new arrival.. 	 Therefore, at
any time, the arrival of a new fault depends-on the previous
arrival.	 But once a fault arrives, its duration is indepen-
dent of previous arrivals and previous fault durations.
A queueing system is characterized by the following three
factors:
1.	 The customer arrivals 	 -
2.	 The service time of customers
_ 3.	 The service system
The customer arrivals and service times are expressed as
statistical. distributions.	 The service system can be de-
scribed by the number of servers in the system and the queue
discipline.
There is a one-to-one correspondence between the para-
t,
meters of the fault model assumed and those associated with
a queueing system. 	 Thus, the arrival of faults Corresponds
is to the customer arrivals, the duration of faults to the service
4-9
.,,	 time of customers and the service system corresponding to the
fault model is a single server system with no waiting per-
mitted.
one implicit assumption made is that the arrival of
faults begins before testing is started. However, the exact
time of the beginning of the arrival process cannot be known.
Hence, it will be assumed that the arrivals begin, long before
testing is started. This will permit us to use the steady
state results of queueing theory (which are time independent
and simple) rather than the transient ones (which are time
dependent and hence complex).
The Erlang k--distribution will be used to approximate
the statistical nature of the interarrival time and the fault
duration. If k equals one, this reduces to the exponential
distribution. When the Erlanq k-distribution is used, the mean
and standard deviation equal, that of the practical problem
and yet, some of the properties of the negative exponential
distribution are maintained. The arrival (or duration) is
divided into a fixed number of independent, identical
(hypothetical) "phases",-each phase having a negative exponen-
tial distribution. Each time a phase ends, an arrival takes
place or the duration ends as the case may be. The Erlang
k-distribution is,
k k-1 -fitf (t) = ?L t	 e
(k-1)
4With the above assumptions, some correspondence relations
can be stated. Using these relations, probabilities required
for fault analysis can be obtained from the equivalent Queueing-
model which is a single server system with no waiting allowed.
F  is the type of intermittent fault present, with arrivals
and duration.
(i) P(effect of a fault arrival is present at time t) A P (fi)
E	 P (A customer is ia the service facility at time t)
(ii) P(effect of a fault arrival is present from time t to
at least t+d) A— P (fz nrfi>d)	 P (A customer is in
the service facility at time t and will require at
least d more units of service time).
(iii) P(effect of any fault arrival is not present at time Q
P(Ti )	 P (there is no customer in the service
facility at time t)
(iv) P(effect of any fault arrival is not present from time
t to at least t+s) Q P (fi nr-'ins) f----) P (No customer
is in the service facility at time t and no new cus-
tomer will enter the service facility at least till
time t+s )
w	 l	 I	 :_L_	 i	 ^__ l	 _ f
is
's
i°
C,
i-
r:
A
4--11
4.1.5 Single Fault Detection Procedure:-
Let the set of faults in a circuit be F = {fl,f2r',"'fn},
where each f  is a single fault. A fault event E  occurs
when a single fault f  from F occurs. When the assumption
that only a single fault can occur is.made,.the n faults in
F are-not independent occurrences. Therefore, the arrivals
and duration considered will be of E f .
We will assume that each test set application T^ has a
duration d	 It will also be assumed that the elapsed time
between successive test set applications T._ 1
 and T. is
a constant, L
	 A test set application Ti
 applied at tsi
will detect the presence of a fault only if the effect of a
fault arrival is present from tsi to at least tsi+d 	 There--
fore,
P (T^) -- p ( f7 n rfi>d)	
r.
	 (1)
A fault arrival which can be detected by T j _1 can also be
detected by T only if the arrival occurs before ts.	 and
awl
is such that the effect of the fault arrival is present from
tS •	 to at least tS. +d+L o Therefore,
J--1-1
P (T, *Tj -1 ) = P, (fi n rfi>L+d)
	 (ii)
The above probabilities are independent of t- because we
4-3
Hence, given k applications of the test set, the
probability that none of them detects a fault is,.
_	 k
P (Tkn ... Tl ) = [l-P (finrfi2d) ] I I
-
 (l-P (finrfi>d) +P (finrfi>L+
j=2
Note that above probability is the conditional probability
given that the network has an intermittent fault. This is
because the model we have-assumed guarantees at least one
fault arrival.
Now we consider 4 cases of arrival and duration.
i)	 Exponential interarrival time: f(ta) =XeXta
Exponential duration: f(td)=jje'9td
P (f)
^+u
P (f) = Jl+^
	
-
P (fnrf>d) = eud	 A
P (fnrf>S) = eXS U
h+u
Hence,
P (TknTk-l ... nTl} - (^'- l+ ud} l^ A+. epd (1-eIjL)) k-1
	(ii) Exponential . interarrival time':-- f(t 
a	
?'-jI ta
k-Erlang duration
I
 (k=m)	 f (t	 I'm td-1 
td
d
mI? (f) _ I.. P (n stages of duration remaining)	 X+A
n=1
P (f)
M.
P(Enrf>d)	 X P(n stages of duration remaining
n=1	 and remaining duration is >d)
M.	
X	
dil, m--tl	
S
I M
	
e.	 I	 (lid)
n=1	 S=O	 S!
-
P(fnr-f>S)	 ens P
is
Hence,
P (Tk,nTk_ 3_ . . . nT
m	 ^Iidm-1	 S	 M	 -Pd
I A	 X	 (pd)	 x
m (A +P)	
S!	
X M. (A+11)
n=1
	
n-- 1
M-1 ( S	 Slid)	 Oid+4L -jiL- _	
I e
S=O	 S!	 S!
t9-1 eAta
	
(iii) k-Erlang interarraval time M=9) , f(t a	 a
Exponential duration:	 f(td) pe-4td
4-14
A
P (f) =	 P (n stages of arrival remaining and fault in
in duration.) _ x 1 - (1+AQ) -
P (f) = li. +	 AQ
P (fnrf>d) = eua h (1-- (^- AQ) ~^)
u
P (fnrf>S) _	 P (n stages of arrival remaining with no fault
r	 n=1
in duratiofi and remaining stages need more
than S units of time)
(I+eAS (AS) q
n=1	 q-p	 qi
Hence,
P(Tkn ... i ,)
_ 	 -Q	 k-1
1-eud A 3,- (1+u
IT
 ) -t]) ( ,_;pa A (,_(,+gy) ) (l
-euL
A t -1 ^Ata(iv), k--Erlang interarrival time (k=Q) , f (ta ) - a
t0
-k-Erlang duration (k=m) , f (td) 	 pmt d m-]. Atde
(M-1)! i
First, it is necessary to solve for the following ML+z
simultaneous linear algebraic equations.
AP(r,-;0) W AP(r+l,-,4) + uP(ryl;l)
4' 15
-IL J
.(X+li)P(r,n;l) 	 XP(r+l,n;l) + IjP (r, n+1; 1)
r--1,2,...,9.-1
(?L+-p) P (r, m; 1) 	 XP(r+l,m;l)
(X+IA) P (2., n; 1) 	 pP(k,n+l;l)
n=1,2,.
(?L.+14) P (k,M, 1) T )LP (1,-; 0)
M
	
P(r,n,j)	 where
P(i,j,l)	 Probabilitythat there are L i stages of arrival
remaining and j stages of duration remaining and a
fault in duration.
A
*P(i,-,O)	 Probability that there are i stages of arrival
remainingM 4 and no faiilt in duration.
The required probabilities can be obtained as follows.
M
P M	
I	
1,..P(qn;l)
n--1 q--1
L^
li
P (f nrf >d)
k M	
--ud)X	 I P (q, n; 1) end m I
1 J
. S! S
,,-1 n=l	 S=O
1j
•
.	 •.	 ea+r _.f'awwT.R e.^:rrss.!..v..rre...—^—
I
4-16
q=1
P (fnrf>S) =	 Q-1
—JAS
	 (AS) c
q_1	 c=0	 c:
Hence,
P(Tk n ..... nTl)
Q	 zn	 -pd m-1
q31 n=1	 S=O	 S3
€ z	 m	 -pd m 1S	 S	 -uL k-1 a(j d)	 (tid IL)	 e
P (q p n 1)	
-
;Si.
q=1 n-1	 S=0
The probability that a fault is detected is
Pk - 1 - P(Tkn .... nTl)
In practice, a fault can be detected if it exists for the
entire duration-of the particular test(s) which test for it.
Since we required the fault arrival to effect the entire
duration of the test set, the actual probability of fault
,i
detection will be greater than the 	 Pk	 obtained above.	 But
the above ;procedure can be easily extended to give more
I
i
accurate results.
,i
' Let the test set contain	 g	 tests.	 Each test has a
} duration of	 d/g , where	 d	 is the test set duration.	 Let
the set of possible faults be	 F = Ifl ,f2 ,...,fn	_	 Let
I
4-17
Pfi.	 be the conditional probability that fault 	 fi 	has occurred
given that	 Ef	 has occurred.	 Naturally,
P fi ^' 1
It is necessary that	 Pf
3
	be known for each	 i	 By consider-
ing those tests which test for	 fi , the probability	 P (i)
that fi 	is detected can be computed using the results developed
above by substituting tests in place of test set everywhere. ^.
Then, 1
n
P
k 
--	 p(i)Pfi I
This is the actual probability of detection of a fault.`
4.1-6 .
 Multiple Fault Detection Procedure
^.^	 Let the number of lines which can be faulty be n. All these
t'.s
	
	 lines are assumed to have identical fault characteristics. it is
also assumed that the probability that b lines are faulty is the
1	 same as the probability that a single line is faulty. Due to the
i{
large density of circuits in present day IC's, this seems a
reasonable assumption. We also assume that an intermittent fault
on a line is either of the s-a wl or s-a-0 type but not both... The
arrival and duration of -the intermittent faults corresponding to
` each of the n lines are similar, to the corresponding ones of Efr
as used in the single fault case.
k+	 - mere, i.t will be assumed that 'a multiple fault can be detected
only if it is present for the entire duration of the test set
k ` (MFDTS).	 This is a reasonable assumption because,, if any component
fault in the multiple fault is present only for part of the duration
of'the test set, neither of the .two different, fault situations
r'
i
would be detected if the particular tests which test for them are
not applied during their presence.
	 All the assumptions regarding
the test set applications will be the same as before. The
probability that a given line in a network is faulty = P€.
The probability that the given network is.faulty is,
n
PF
 - I (b} Pf - (2n-1)Pf
b=1
i^ The probability that exactly b of the n mires are faulty is,
pb = (b) Pf
i;
s
The conditional probability that exactly b wires are faulty given
that the network is faulty is,
(b) Pf	 (b)
Pcb 2n--1 Pf	2n-1
clearly,
n
b=1 Pcb 1
Q)
Given that a b--wire multiple fault has occurred, each of the
component wires of the multiple fault has fault arrivals with
duration. Let the fault corresponding to the b wires be labeled
fl, ... J fb. A multiple fault involving f l can be detected by a
test set application if the effect of a fault arrival of fl exists
for the complete duration of the test set, d and none-of the orher
faults have an arrival whose effect exists for only part of the
duration d.
P(multiple fault involving f l is detected)
P (f nrf>d) (P (f nrf>d) +P (fnrf>d)) b--1
P (multiple fault involving f 2 but not f1 is detected)
= P(fl has no fault arrival with an effect on test set and
fz has an arrival with effect for the complete duration d and none
of the other faults have an arrival with an effect for only part
of the duration d)
P (fnrf>d)P (fnrf>d) (P (fnrf>d)+P (fnrf>d)) b-2
The probability that the b--wire multiple fault is detected.by a
test is,
:i
'i
Hence, given k applications of the test set, the probability that
none of them detects a fault is,
--	 P (fnrf>d)	 IP (Tnrf>d) ^'-1 (P (fnrf>d)+P ( nrf^d) )b-i
Therefore, the conditional probability that a test set ap plication
will detect a multiple fault given that it has occurred is,
n
P (Tj }	 Pcb S  (d)b=1	 I
--	 (b)	 P (fnrf>d) x
b=l 2n-I
b_ _	 _
(	 (P (fnrf>d) x-I (P (f nrf>d)+P (ffirf>.d)) b-i
i^ I
A multiple fault arrival which can be detected by Ti-1 can also be
0
detected by T  only if all the component fault arrivals occur
.:'	 before s	 and are such that their effects are present fromi-1
is
	to at least ts 	+ d + L. This probability is Sb(T+d):
jj-1	 -1
where we substitute T+d for d. Hence,
n
P (Tj * Tj-l) =
	
	
Pcb S  (T+d)
b=1
n (b)
XP (fnrf>T+d) x
b=1 2n-I
b	 _ _	 _
(	 (P (fnrf>T+d) ^' -1 (P (fnrf>T+d) + P (fnrf>T+d) ) b y
i=l i
I
3
i
4-21
n	 ( "` )	 b(l -	 P (f nrf>d) x (	 (P (f nrf>d)) i-1
b=1 2 -1	 i=1
n (n)(P (fnrf>d)+P (fnrf>d) ) b-^") ) (1- y
	
b	
. (P(fnrf>d) x
--	 b-1 2n-1
b
( (P (n.r>d}) x-^' (P (f nrf>d)+P (fnrf>d) } b-^) w-P (f nrf>T+d) x
i--1
b . 	 _ _	 b-i
(	 (P ( fnrf>T+d) }	 (P (fnrf>T+d) + P (fnrf>T+d) } )
i=1
	 r
Now we consider one of the 4 cases mentioned before
Exponential inter arrival tune:
	
f(ta) = XeXta
Exponential duration	 f(td) = Peutd
P (Tk nTk-1 ... nTl)
(1_ 	 (b}	 Eud	 X	
X	
(;Ad	 ) i-1	
-
b=1 2n--1	 a+u	 i^l	 X+fit
eXd u + ;Pd	 b-i , 1 _ n	 (b}	 X--pd
(	 ?L4-ji	 7t+ )	 ) . (	 n	 X+Pb=1 2 -1
b	 --Ad	 u i--1	 Xd u-ud X b-i --VT b	 wh (T+d) u i-1
(	 (e)	 (e 	 + e 	 )	 - e	 E (e - +u )i=1	 k_-1
- X (T+d) u 
+ ell (T+d) X
	 b-i(e	 X+11 +11 J	 }
4.1-7 Conclusion
As can be seen, the expression gets complex. However, if the
actual values of the various parameters are substituted, the
I .
computation can be performed in a systematic manner, very easily on
	
}'<
a computer. The various parameters can be estimated using methods
r
employed in Queuing systems.'
We have obtained expressions to determine the number of
times a combinational circuits has to be tested, when checking
for the presence of single or multiple faults of the intermittent -
transient type. Though dependent on the model used, since we
have used quite a general model, these results should be useful.
^-ti
4--23
n
n
t
Cy'
	 4.2 Seauential Machines:-
-	
^	
—
A sequential machine can be tested by applying an appro-
priate sequence of input signals, termed a checking sequence,
to the circuit and observing the output sequence that the
circuit produces in response [3.31. The checking sequence
determines whether or not the sequential machine is operating
in accordance with the given state table description rather
than testing for specific hardware failures. Hence it is
difficult to find the precise relationship between the presence
of an I/T fault and its effect on the output sequence. Though
the output of the sequential machine may be correct during the
presence of the fault, it could be incorrect at a later stage.
Therefore, it is convenient to model the faulty sequential
machine as a probabilistic sequential machine.
4.2.1 The model:-
If the statistics of the I/T fault are known, at a given
time, the probability that the effect of the I/T fault is
present, can be calculated. A particular fault will affect
the next state and output functions in a particular way. By
knowing the exact way in which each fault will affect the next
state and output functions along with the relative probabili-
ties of occurrence of these faults and their statistics, the
faulty sequential machine can be modeled as a probabilistic
sequential. machine. Instead of the exact model., it is possible
ii
L	 ^	 r	 F
4-24
to set up an approximate model, with relative ease, by assum-
ing that every possible combination of incorrect next state
and/or output is equally likely when the effect of the I/T
fault is present. In either case, we arrive at a probabi-
listic sequential machine model of the faulty machine.
4.2.2 Testing:-
The actual application of the checking sequence is pre-
ceded by the appli cation of a homing sequence to bring the
machine to a fixed starting state. Initially,- if we assume
that the machine is equally likely to be in any of its states,
the probability of the machine being in any final state after
the application of the homing sequence can be computed using
the transition probabilities and the output response of the
machine to the homing sequence.
if the initial state probabilities are known, the proba-
bility that the machines - output response to the checking.
sequence is correct can be easily found [ 3-41 . This represents
the probability that the test does not detect the fault, given
that the machine is faulty. Therefore, the probability that
n .applications of the test fail to detect the fault given
that the machine is faulty, can be calculated.
4.2.3 Conclusion:-
The testing of sequential machines for I/T faults is
straightforward once an exact model of the faulty machine as
a probabilistic sequential machine is obtained, due to the
results already available in this area. Hence, in this sec-
tion, we have just outlined the technique.
^	 I	 J
4.3 Selz diagnosable systems:
Self diagnosing capability is becoming an important requirement
of systems as their complexity increases and greater emphasis is
being placed on their reliability. Design conditions for such
systems where the units are capable of testing each other have been
studied for permanent failures of units. One such system is the
F
t--fault diagnosable system proposed by Preparata et al. We shall
study the capability of the t-fault diagnosable systems to diagnose
3
intermittent/transient (I/T) faults in units.
4.3.1 Preliminaries:
s
We shall assume that a fault free unit correctly evaluates the 	 r
tested unit as being faulty or fault free while a faulty unit`s
evaluation of the tested unit could be incorrect. Under such
circumstances, the diagnosis of the faulty units is achieved through
the results of the -tests performed by the fault free units.
When a unit has an I/T fault, it may have to be tested several
times by a fault free unit before correct evaluation can be performed. s
Therefore, we will assume that after every test routine, an updated
syndrome (set of test outcomes) is formed which describes the evalu-
ation of all the units to date. Anytime the updated syndrome
corresponds to a consistent set of faults, diagnosis can be performed.
Because of the time delay gram the initiation of the I/T fault in a
unit to its detection, it is likely that additional units could have
i
faults initiated in them in the mean time. Therefore, even if
certain units-are diagnosed as being faulty, one cannot be absolutely 	 j
i	 certain that no more units are faulty. Hence, incomplete diagnosis
t
4-26
is inevitable. We shall designate the I/T fault capability of a
system, t', as the maximum number of units which could be faulty
such that the diagnosis is at worse incomplete but never incorrect,
i.e.; a fault free unit is never diagnosed as being faulty.
4.3.2 The Two Partitions
S 1	 S2	 Fig. 4.3--1
R
S1 , S2 are 2 sets of units, S 1 n S 2 = yr and IS11,[S21 < t,
R being the remaining units in the system. Because the system is
t-fault diagnosable, it is not possible that neither S.  nor S.2
receives any testing links from R. Therefore, there are only 2
possibilities:
i) Only one of Sits  receives links from R. In such a case, there
is a non-zero probability of diagnosing S 1 (S2 ) as being faulty when
intact S2 (S1 ) is faulty, if S1 (52 ) receives no testing links from
R. In Fig. 4.3-1, if S2 is the faulty set of units, it is possible
to obtain an updated-syndrome where, due to insuffient testing all
links from R to S2 are 0-links. if in addition, all links from S2
to S1 are 1-links and all links within S 2 are. 0-links, regardless
of the nature of the links from S .  to S 2 , this syndrome would be a
valid syndrome and would correspond to a fault pattern where S 1 is
the faulty set of units. So we could diagnose S 1 as being the
faulty units when in fact S. is the set of faulty units.
f
t
S 1`"s 1AS20 0.s -s ns2 1 2
-1I	 A	 1__ I	 ^... I	 1_ __ I	 _.. i_
4-27
ii) Both S1 and S2
 receive links from R. In such a case, there is
a zero probability of diagnosing S 1 (S 2 ) as being faulty when in fact
S 2 ( S1 ) is faulty. This is
I
l 	S2
Fig. 4.3-2
R
because a valid syndrome corresponding to a fault pattern where
S1 (S 2 ) is the faulty set of units would require all the links
from R to S 1 (s 2 ) to be 1-links and this could never happen if
S2 (S1 ) is indeed the faulty set of units.
4.3.3 IOT Fault Diagnosability
We can describe the I/T fault diagnosing capability of a
t-fault diagnosable system by an index t'. If the number of faulty
units does not exceed t', there is a zero probability of diagnosing
a fault free unit as faulty. This requires that given any-2 sets
of units S1 and S2 , is11,IS21 < t', S 1 nS2 = ^, both S l and S2
receive at least one link from R. This guarantees that even if
S 1 , S2
 were 2 sets of units such that Is11,IS2) < t' and
S1 n S 2 0 0 1 Sl us2 76 S1 , S1 U S 2 34 S2 , there is a zero probability
of diagnosing a fault free unit as faulty. The reasoning is as
follows.	 0
slns2
Because (S1--S1 nS2 ) and S 2 are 2 disjoint sets of units with
cardinality < t', S 1-S1 nS 2 receives at least one link from R.
Fig. 4.3--3
L_	 I	 4	 I	 I	 J__	 1	 P
E
Therefore, if S1 (5 2 ) is the faulty set of units, there is a zero
probability of diagnosing S2(S1 ) as being faulty because the links
from R to S2-S1 n S2 (S1-S1 n S 2 ) would never be 1--links.
Therefore, when a set of units is diagnosed as faulty, there is
a 100% probability that all those units are faulty.if the number
of faulty units is <t'. Hence the diagnosis is incomplete at
worse but never incorrect as far as the faulty units are concerned.
If a set of units S1 is faulty, it is always possible to diagnose
only a proper subset of S 1 as being faulty.
In Fig. 4.3--3, S11 and S12 are proper subsets of S l such that
_.	
S11 us12_S1' When S 1 is faulty, due to insufficient testing, it
is possible that all the links from R to S 32 are 0-links, while
all those from R to S 1 are 1-links. If, in addition, all the
links from S12 to SZ1 are 1-links and all links within S12 are
0-links, the diagnosis will designate S 11 as the- faulty units.
As long as the number of allowable faulty units is greater than 1,
this sort of incorrect diagnosis has a non-zero probability.
There are several partitions of a system into S 1 , S 2 and R
(as in Fig. 4.3-1) such that S1 receives no testing links from R.
t' has to be less than max ( Is Ills ) of each such partition
z
t' = min (max (IS l (, I S 2 1) } --1
over all
partitions
•	 _	 y	 1
integer smaller than x.
4.2.4 Bounds for Asymmetric Testing
:i
We shall now find bounds for t'. Let us definex, as the largest
Lemma 1: in any t-fault diagnosable system where no two uni•Ls
test each other, the minimum value of t' is 2t+1:^A`
3
Proof: Let k be the number of units in Sl. The max number of
(k-1)links possible within S 1 = k 2
	
This would require the
smallest number of links incident on S 1 , from outside S1. Since
a
each unit has to be tested by at least t others, the smallest
number of links incident on S l
 from outside SI is kt- (k-^ )k	 if
m is the cardinality of S 2 , the smallest" m will be needed when
each unit in S 2
 tests each unit in S 1 . Therefore, the smallest
>,
size of S2 is given by
Km = kt- (k-1)k2
(k-1)
m = t- 2
Since m has to be an integer , in any t-fault diagnosable system,
if S1
 has size k, the minimum size of S 2 is rt- (kzl}"'	 It is
possible to design a system
1 2 3	 k
Fig. 4.3--4
which has a partition as in Figure 4.3-1, and these values of
I S 1 1, and	 IS 2 1. As the value of 	 k increases,the minimum size
of S2 decreases. The minimum value-of max	 (ISl l ' IS2 1)	 occurs when
m-k , as can be seen from Pig. 4.3-4.
m=k-t-M-1)
m=k=2t+l3
If 2tI1 is an integer, max([51 1,15 2 1) = 2t1 since,	 = -1f2,
2t 1
	 g j	 L 2t+1 j r m= J 2t+1^when	 is _.ot an rote er^ for k	 )
Minimum value of max( IS 1	 -^
2t+l3 ^.^r(S 2 ()
Since t' = min (max(IS11,ES21)) -1,
over all
partitions
_	 1
t min - 1 3
2t+ - l
^2t+1^
3
Q.E.D.
Lemma 2 :+ The maximum value of t' is t-l.
Proof: Since the system we consider is. t-fault diagnosable, there
exists a fault pattern comprising (t+l) units which cannot be
diagnosed. Therefore, t' cannot be greater than t.
Since the system is t-fault diagnosable, there is at least one
unit i, which is tested by exactly t units. Hence, the s ystem has
a partition as in Fig. 4.3rl, with S1 consisting of i and S 2 of the
t units which test i. Therefore, in any system min(max([S1 },IS21))
is at least as small as t. Therefore t' can never exceed (t-1).
We shall shots that the maximum value of t' is (t-1) by citing
a connection assignment where it is so. Let us consider the Dlrt
1
f	 4-31
connection. Let us try to form a partition as in Fig. 4.3-1, by
	
9t
starting with a unit j in R.
r
'	 Y	 F
j'
	
	 Now S l  can only be a subset of M., where M. is the set of units
not tested by j. Because of-the D'connection, any S l  will
have one unit which is not tested by any of the remaining units
in S l . Therefore, S2 has to have a size of at least t. Therefore
t' has a value of t-1 in a system with a D 1Ft connection.
Q.E.D.
r
4.3.5 Non-Asymmetric Testing	 Al
_
	
	 r	 ^
Now, we shall consider a system which contains pairs of units
that test each other. The necessary and sufficient conditions for
'^rf
such a systemto be t-fault diagnosable were formulated by Hakimi
and Amin. We shall find bounds for t' for such systems.
Lemma 3: In any t--fault diagnosable system whera some pairs of
units test each other, t' cannot be less than 2t31 and can be
at most, equal to t.
Proof: (1) one of the conditions necessary for a system to be
t-fault diagnosable is that for each integer p with 0<t , given
any set of units R with IRI=n-2t+p, the largest set of additional
units S 2 with every unit in S 2 being tested by at least one unit
in R, must be such that 1521>p.
e	 .
f 	 }
e f
4-32
For any given value of p and a partition as in Fig.. 4.3-1,
since IS1 1 decreases as I s 2 1 increases, the smallest value of
max(IS I I,IS 2 I) occurs when IS 1 17 I s2 1 _ Let I S2I = p+k where
k >0. Of all possible values of k, for a given p, there exists
only one, k* when Is 11 =
 I s 2 1 .
I S l l -- n-IRI -IS 2I = n-(n-2t+p)- (p+k*)=p+k*
3(9+k*) = 2t+k*
.°- p+k* = 2t+k* = Isll =ls2l
\ r
The min mum value of k* is 1 and there exists a p for which this
equality holds.
Therefore, t' cannot be less than 
^2t31^ (11) t' obviously
cannot be greater than t. Consider a system with 2t+1<n<2t+3.
Construct a D 1,t system with a bidirectional link between units
i and j if Ii-a1=1 mod n. In this case each unit is tested by (t---I)
other units. lf.a unit k is in R, S 1 can only be a subset of Mk,
the set of units not tested by k. Any such S 1 will have a- unit m
which is tested by atmost one other unit in S 1 and a unit n which
is tested by at least
.
 one unit p not in S1 such that p does not
test m	 Therefore, every S 2 has a size of at least t+l and
hence for such a system t' =t. Hence, the maximum value of t' is t.
Q.E.D.
We have established bounds for t'. For any given connection
assignment, the exact value of t' can be determined by examining
all partitions as in Fig. 4.3-1. We shall now give".an algorithm
to do this.
4.3.6 Procedure
Let us denote b t r , the smallest u
	
fy m	 peer bound on the value o^
based on all the available information at any stage. We know
-the initial value of t' 	 Then, we examine all Type P_ partitions,
In
j	 partitions as in Fig. 4.3--1 with is
	 < tm & S2 < tm , which
contain a unit i in R	 If there exists such a partitions, we update
t' and repeat. If there exists no such partition we examine Type A
{ partitions which contain a unit j but not i iri R and-so on till all
3
possible partitions have been examined. After all the partitions
have been examine,, the current value of tm is the'value of t`.
Let us define,
i	 Ti d set of units tested by unit i
£il^ set of units testing unit i
Pij ^ (r iur i uiuj) l
Qija (rlul'lui_uj)
Procedure 1: This will be used to find an upper bound, sufficiently
lower than tm , on the maximum possible size of S 1 in a type A
partition, when a unit i is in R.
Lemma 4: The minimum possible size of (S 2 uR) is k only if there is
a set C of at least (k- t') units such that for each pair of units
m,nsC, Pmn<k .
Proof: Deleted because it is obvious.
Every set C satisfying the condition in the Lemma is a possible
candidate for R but is2URI must be evaluated to make sure that it
is so. The minimum possible size of (S„ .jR) establishes a limit on
I')	 Is 1  Lax'
i,	 •
f
t
r
E.
We start with a unit i in R and establish a lower limit on k s
by first finding the smallest k such that there are at least
(k-tm-1) units such that for every such unit j, P ij <k. If the
lower limit is to be increased, every possible C has to be formed
and checked. If any C results in a type A partitions, the value
of tm
 is updated, the new t' being (max(JS l l,IS 2 ])-1} for that
partition and the procedure is repeated till the value of t' is
unchanged during the iteration with that value of k. If there
exists no type A partition with is 2 uRJ=k, we can increment k and
start all over again. We can do this till we have Isi 1max to a
number sufficiently smaller than t' .
.The various C's can be evaluated by converting a Boolean
expression in a product.of sums form to a sum of products representa-
tion. e.g., if P jn , Pam>k, then j and n,m cannot be in the same C.
We express it as (Xj 3n m4-Xj ) . The C's are evaluated from
Ci = H (X 9.+-X
 M where N j = q ,Xt q
P. >k
7q
Only these X. are considered which have P ij <k and at least
(k-tm-1) such other terms not in N j . No reductions are performed
on the sum of products. Also note that if a product term contains
more than (k-tm) literals, all combinations of size >k-t' are
possible C's.
Procedure 2: When a unit i is in R, S 1 can be formed only from
M., the set of units not tested by i. After establishing an upper
bound on Is L	 using procedure 1 procedure 2 can be used to1 ax
look for type A partitions.
4--35
Lemma 5: A type A partition exists with is 1 1= IS111iax only of
there exists a set of units D in M  such that IDj = 
IS11max. and
for every pair of units m,n in D. Qmn 
< IS11rnax + tm.
Proof: Deleted because it is obvious.
Every set D satisfying the condition in the Lemma is a
possible candidate for S 1
 in a type A partition. However, each
D has to be examined individually to check if Is 2 1< tm .
After starting with a unit i in R and arriving at a tm and
islima.. by using procedure 1, we can find a Lower value for IS11max
by first finding the largest number w' such that there are at least
w units in M  such that each of them satisfies the condition on Q
with at least (w-1) other units. If^SLImaX is to be Lowered,
every possible D has to be formed and checked. If any D results
in a type A partition, the value of t' is updated and the procedure
is repeated for the same IS l l-ax. It is also possible to try to
reduce Is ax; by repeating procedure 1 using the: new value of -
t'. If there exists no type A partition with 15 1 1 - ISlImax we
decrement Is1 1 maX and repeat. We do this till lsllmaX is reduced
to 1.
The D o s can be found in a manner analogous to that for finding
the G's. The Xj"s considered are those units in M  which satisfy
the condition on Q with at least.js l I ma
 1 other units in Mi , the
Xq °s representing units not satisfying the condition on Q with Xj.
Also, if a product term in the sure of products representation has
more than ISllmax literals, only combinations of size Is l j max are
to be considered as candidates for D.
F j
Fi
4-36
Procedure 2 1 : After we have examined all partitions which contain
a unit i in R and are examining partitions which contain a unit j
in R, we can try to avoid examining some D's which have already
been examined before. We can di-vide M j into Mj nMi and MjnMic.
Every D must have at least one unit from Mj nMic in order to be an
unexamined one. Therefore, islimax will be determined by the
maximum size of D containing a unit from MjnMic.
After examining partitions containing units i t , i2 , .:.. ip in R,
the next unit we pick should be a unit j which has the smallest
Pi j for all j and all is in 11" " ip. We will then partition
s
M. into Mj nMi and Mj nMic and use procedure 2 with the exception
S	 s
that D's not containing any units from M j nMi are not examined at
s
all.
4.3.7 An Example
We shall now use an example to clarify the algorithm. In
Table 1 is given a connection assignment, with the P's and Q's
given in Table 2.
Procedure 1: We shall start with unit 3 in R. Initially t=t-1-5.
We need at least M-6) units with P3 .1<k. Therefore, the smallest
value of k is 9. The possible candidates for R with JRuS2=9,
are units 3,2,6 and 10. We now form C3.
C3 = x3 (X2X1:0+x2 ) (x6X10+x6 ) (x1.Ox2x6+x10)
At any stage, we multiply 2 sum of products terms only if they have
at least one common xj . The C's can be easily formed from these
disjoint sum of products terms.
C3 = x3 {x2x6X1.0+X10X2x6 )
r
---- -
{	 I	 I	 I	 k	 9
4-37
The maximum possible size of C is 3 and Lemma 4 is not satisfied.
Therefore, we check for k-10. Now, the C's can be selected from
units 3,2,5,6,10,13,15_ We again form C3.
C3
 = x3 (x2x10X13+X2 ) (X5x6X10x13+x5 ) (X6X5X11+x6) (x10-t-
x10X2X5x15^ (X15X10X13}x 15 ) (x13 . 2x5x +x15133
x10 ,x5 and x13 cannot be in any C because every C must have a
size of at least 5 and they satisfy the condition on P with only
3 other xi 's. Since there are only 4 possible candidates
remaining, there exists no C satisfying Lemma 4 for k=10. Therefore,
the smallest ^s2 uR1 is greater than 10. Now we can switch to
procedure 2.
Procedure 2: Since the minimum size of k is >10, IS11max =4.
The units not tested by 3 are 2,4,6,8,9,10,12,14 and 15. It can
be seen from the table that none of these units has more than 1 unit
satisfying the condition on Q. Therefore, there is no type A
partition containing unit 3 in R. Now we look for type A partitions
not containing unit 3 in i.
4.3.8 Conclusion
We have established bounds on the I/T fault diagnosing
capability of t-fault diagnosable systems and given an algorithm
to determine this value for any connection assignment. 	 nee this
does not take into account the statistics of the i/T fault, this
can be looked on as the minimum capability of the system.
k:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
aY
4-38
1	 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0	 a 1 0 1 0 0 1 0 0 1 1 1 0 0
0	 0 0 1 1 0 1 0 0 0 1 0 1 1 0
0	 1 0 0 0 1 0 1 1 0 G 1 0 0 1
1	 0 0 0 a 0 0 0 0 0 1 1 1 1 1
0	 0 1 0 0 0* 1 1 0 1 0 0 1 1 0
1	 0 0 1 0 0 0 1 0 1 0 1 1 0 0
1	 0 1 1 0 1 0 0 1 1 0 0 1 0 0
0	 1 0 1 0 0 1 0 0 0 1 1 6 0 1
1	 0 0 1 1 1 0 1 0 0 1 0 0 0 0
1	 0 0 1 0 0 0 1 1 0 0 1 0 1 0
0	 0 1 0 1 0 1 0 0 1 0 1 0 0 1
1 0 0 1 0 1 0 1 0 0 0 0 1 1
0	 0 1 0 0 0 0 1 0 1 1 1 0 1 0,
1	 0 0 0 0 1 1 1 1 0 1 0 0 0 0
1	 0 -0 0 1 0 1 0 1 1 0 0 1 0 0
-(m,n) 1 iff m is tested by n
Number of units'= 15 t 6
TABLE 1
5T
DIT
£T
ZT
IT
OT
6
8
L
9
S
£
Z
T
IT ZT OT OT IT IT ZT OT TT OT TT ZT TT TT
x TT TT ZT OT 6 TT IT IT TT TT IT IT TT
IT x ZT OT OT ZT TT ZT OT 6 01 IT IT 6
ZT £T x OT IT ZT OT £T £T IT IT OT OT ZT
IT ZT ZT x ZT ZT OT ZT ZT OT IT IT IT OT
IT TT OT ZT £T x OT IT TT 6 IT OT Ti ZT TT
OT TT £T £T £T IT x IT IT OT ZT 11 ET IT OT
£T ZT £T ZT eT ZT ZT x £T TT ZT OT OT OT IT
OT IT ZT VT ZT TT IT PT x 01 TT ZT ZT ZT ZT
OT IT TT ZT IT OT 8 OT IT x TT UT IT ZT OT
OT TT TT £T TT IT TT £T OT IT x ZT ZT OTAT
IT IT IT ZT TT IT IT ZT ZT 6 ZT x ZT OT OT
OT IT OT TT IT 6
	IT TT IT 6 OT ZT x £T TT
L	6	TT OT OT TT 6 IT 6 8 6 6 6 x TT
ZT ZT IT £T ZT IT OT ZT £T 6 ZT OT ZT ZT x
9T VT £T ZT TT OT 6 8 L 9 5 V £ Z T
6£-TV
'r
iE	 '
i
4-40
3-1 M. A. Breuer, "Testing for Intermittent faults in digital
circuits", IEEE Trans. Comput., vol. C-22, No. 3, pp. 241-
246, March 1973.
3-2 S. Ram.al and C. V. Page, "Intermittent faults: a model and
a detection procedure", IEEE Trans. Comput. (Special. Issue
on Fault--Tolerant Computing) , vol. C-23, pp. 713-719,
July 1974.
3-3 F. C. Rennie, "Fault detection experiments for sequential
circuits".
3-4 T. L.-Booth, ' Sequential. Machines and Automata Theory',
John,Wiley,1967. '
3-5 F. P. Preparata, G. Metze and R. T. Chien, "on the connection
assignment problem of diagnosable systems", IEEE Trans.
Comput., vol. EC-16, No. 6, pp. 848-854, December 1967.
3-6 S. L. Hakimi. and A. T. Amin, "Characterization of the
connection assignment of diagnosable systems", IEEE Trans.
Comput., PP. 86-88, January 1974.
3-7 T. L. Saaty, 'Elements of Queueing Theory`, McGRAW--HILL, 1961.
t
I	 I	 I	 J	 ^	 E
ri_a
5-1
S. Tolerance Strategies
5.1 Introductory Statements:
Given that a machine is imperfect, information may or
may not be known about the types of phenomena which will
produce I/T faults in it. in either case, the problem is the
same: how can the circuitry be arranged to eliminate the.I/T
faults or the effects-of the I/T faults? Is enough informa-
tion available to do this? If not, what can be done to make
possible this goal? This report is intended to address these
problems. Material pertinent to this subjec-^ is listed in
references 5-1 through 5-10.
5.2 I/T.Fault Intolerance:
The first method is to attempt to eliminate the I/T faults.
This approach is called fault intolerance. The most reliable
components are used in constructing equipment to lower the proba-
bility of overall failure for a given mission time. No redundancy
is employed, and in general, all components must function pro p
-erly for the system to operate. The overall reliability is the
product of the component reliabilities, and since these have
been arranged to be as high as possible, the system will have a
high reliability.
There is a limit to how much the reliability can be in-
creased, both physically and economically. The Y/,T fault in-
tolerance approach adds nothing new to the structure of a. sys-
tem, and better reliabilities are required, so the next approach
is considered.
y	 r.
5-2
_
5.3.2 Sequential Circuits:
l^
When dealing with sequential circuits, I/T faults may
affect them after the .fault disappears. Wakerly points
is
this out in j5-2). If triple modular redundancy is used on
a level low enough so that the replicated modules are com-
binational, the overall circuit may be sequential in nature
but the redundancy scheme will filter out all single I/T
faults per module. If it is desired to triplicate modules
which are themselves sequential, then a build-up of faults
can occur. An I/T fault in a sequential machine can change
the machine's state. This can clearly lead to improper
execution. Therefore, applying triple modular redundancy
to sequential modules is more complicated than to combina-
tional modules. However, it has become increasingly more
important to apply redundancy to sequential modules. Break-
ing a circuit down into such small portions so that they are
all combinational results in modules which are comparable in
complexity to the voters themselves (not very complex). It
is doubtful that fault tolerant machinery constructed in this
manner would be much more reliable than the original non-
redundant machine.
The level of complexity of commercially .available inte-
grated circuits is constantly rising, and due to this con-
straint many times it is impossible to break a circuit down
into combinational modules. Wakerly shows that.the modules
.t T}	 i	 7	 •
i4 3
i,
f_
5.3 I/T Fault Tolerance:
The idea behind fault tolerance is to utilize redundant
	 I
circuitry to eliminate the effects at the output to internal
	 yY
I^
faults. The fact that faults will occur is accepted; only
	 y
through redundancy can the effects be eliminated.'
5.3.1 Faults in Combinational Circuits:
In a combinational'circuit, the worst that an I/T fault
can do is to cause the circuit to produce erroneous output
while the fault is present. If measures are • taken to insure
that these erroneous outputs are masked, the task is accomw
plished. The job is simplified because there can be no after-
effects of the I/T faults.
Triple modular redundancy has been used at this level,
as shown in Figure 5-1. If no more than one module in a
triple experiences a fault at the same time, the voter output
will be correct. Depending upon the voter reliability, single
or triple voter schemes are used. With combinational circuits,
triple modular redundancy may be used at any level. Circuits
may consist of a single overall triplication, or of triplication
of modules which are triplications of modules, etc. Note that
while transient faults have been implied in this discussion,
single permanent faults per module will also be masked. Nato--
rally, another I/T fault in another portion of that same group
can cause system failure.
A:'1
5-4
1
1
!A
}
r;k	 r
must be restorable, and that restoring inputs must be applied
in normal operation, some circuit structures directly lend
themselves to this automatically in normal operation, while
for others special resynchronizing inputs must be devised and
applied. In the latter case, the problem is that only single
faults per module can be guaranteed not to affect the output,
between applications of the resynchronizing inputs.
If enough is statistically known about the I/T faults,
then the overall reliability can be computed for various re-
synchronizing input frequencies.
5.3.2.1 Larger Scale -Modularization:
in [5--3] Wakerly describes a method of utilizing triple
modular redundancy with microprocessors with associated memories.
Considerable discussion is given to the voter placement problem.
The resultant system constantly runs a program of resynchroniz-
ing routines which restore the registers on the central pro-
cessor chip as well as the external memory. To prevent a single
microprocessor which has gone awry from deteriorating the system
periodically the processors are restarted.
Wakerly's scheme will certainly work, but it has limita-
tions. Due to the choice of placing the voters after the memory
instead of between the processor and the memory, if a failure
occurs within a processor, its faulty data can be placed any-
where within its external memory. To keep the memory clean,
the entire memory must periodically be completely rewritten.
r{
5-5
f!
y^
lA
it is not sufficient to wait until an error is detected and
only rewrite the pressnt address location. The problem is
	 >'
that for any reasonably large memory it will take a consider-
able .amount of time to totally rewrite. Furthermore, the 	 K
.cleansing must be done fairly frequently to capture as many
I/T faults as possible. This leaves very little time for the
R
processors to perform the original task assigned-them. For
	 j
these reasons it believed that the triple modular redundancy 	 zf
scheme for microprocessors described in [5-3) needs improve-
ment.
A
5.3.2.2 Special Problems with Microprocessor implementations
	 1
To a limited degree, fault tolerance implementations for
low level circuits exist. Designing a random logic-circuit to
be fault tolerant is possible primarily for two reasons:
analysis and test set generation. Due to the low level nature
of the circuit the effects of any particular fault can be
analyzed. It is only because x/T faults injected into a circuit
can be analyzed that it then becomes possible to generate test
sets. The problem of test set generation is complex even, in
low level circuit descriptions, however there are techniques
available which give best solutions, at least in theory if not
in practice.
t
t
There are also practices deviwed suitable for connecting
large computers together to provide some measure of fault tol-
erance. Even in tUs case, it is not clear what a good measure:
kf
...,:Y
1
^rl
4
^,Jr
-- ----- -- -
of fault tolerance is. It is also necessary to specify what
f
the faults are. I/T and permanent faults may be considered.
The. , problem under investigation falls between the above two
^s
categories. How can microcomputers be best configured to pro-
vide fault tolerance? Microprocessors have peculiarities
which must be considered in planning tolerance strategies.
i
The relative price of each component changes what should be
t
duplicated-in the overall system. The capabilities of micro-
computers are not as great as large computers, and strategies
developed for large computers are often not at all suitable
for implementation with microprocessors.
5.3.2.2.1 Microprocessor Redundancy Schemes
Redundancy schemes applied to microprocessors fall into
two categories: those which are specifically designed for the
microprocessor, and those which are general in nature and are
originally intended as reliability schemes for larger processor
systems, but are adapted to microprocessor based processors.
Ideas based on larger systems which are later applied to micro-
processor systems sometimes make little sense. There are some
features of microprocessors which must be taken into account
when devising tolerance schemes. The first is complexity. if
the redundancy scheme uses so much extra hardware so as to over-
shadow the amount of hardware that the microprocessor itself has,
then the reliability of the hardware added for the extra relia-
bility will probably be such that when compared to the relia-
bility of the original nonredundant system, little will be
25--7
added, if indeed the so called reliable system is not less
reliable than the simple system. Another feature which makes
i
the microprocessor very different from larger processors is
speed. The microprocessor is quite slow when compared with
other large computers and minicomputers. Elaborate recon-
figuration schemes implemented in software may take a long
time to execute, and depending on the application, may or may
not be suitable.
The report from Ultrasystems [5-7] on reconfigurable
computer systems is quite complete. Many of the ideas pre-
sented there can be adapted to microprocessor designs. They
categorize their approaches as mostly software, hardware aided
software, and mostly hardware. The mostly hardware proposals
involve a large amount of hardware, and would not be desirable
to implement on a processor system using microprocessors as
the processor Elements, as the complexity of the extra hardware
is large compared to the relative small amount of hardware which
the individual microprocessors require. Reliability is closely
connected with the amount of interconnections, and hardware
designs involving large qua tities of integrated circuits de-
manding many interconnections tend to become unreliable. There-
fore, any hardware added to microprocessors for 1fT fault tol-
erance should be small to moderate when compared with the com-
plexity of the microprocessor itself.
The major thrust of microprocessor based controllers is
to replace hardware with software. Continuing in this manner,
it would seem that any fault tolerant microprocessor system
should use as little added hardware as is possible for the
f
^I
5--8
added fault tolerance. The techniques mentioned in [5--7]
under mostly software can be adapted reasonably well to
microprocessor systems. This ranges from minimal additions
to vast reconfiguration software monitors. The large moni-
tors should be avoided with microprocessor implementations
since microprocessors do not usually have a large amount of
memory to hold elaborate programs, and very elaborate monitor
programs would tend to take a long time to execute on micro-
processor systems. Nevertheless, there are some very good
techniques discussed in [5-7], and those which can be used on
Microprocessor systems will be briefly outlined here.
The applications program is broken into program segments.
The choice of program segments can greatly affect the relia-
bility of the end product. No more than one output statement
should be in any one program segment, and large calculations
should be broken down into several segments. A set of variables
called the state vector is associated with each program segment.
The state vector is such that in order to leave a particular
segment with the correct data, all that should be needed is
the state vector input to that segment. Naturally, the larger
the program segment, the larger will be the state vector. If
the program is operating properly, and if the state vector is
correct, then the output of that program segment should be
correct. That data can then be used as the state vector for
the next program segment. Comparison of state vectors is the
major reliability addition made in multiple processor imple-
Ii	
^l
v l
i
1
5-9 ;
mentations in this approach.
	 Multiple processors all execute
the same program, and produce state vectors.	 Before a pro-
gram segment is initiated, the state vectors of each processors
are compared.	 If all agree, the processing continues in the
normal manner.	 If a disagreement is found, one of several
things will occur. 	 If there are more than two processors,
then program rollahead can be used.	 A vote is taken on the
state vectors, and any processor which disagrees with the out--
,
come of the vote has its state vector forcibly changed to what
a
the cthers have.	 Program execution continues as normal.
	 If
E
there are only two processors or if there is a tie vote with
an even number of processors, rollback must be used.
	 It is
known that a mistake has occurred, but it is not known where.
The previous state vector is reloaded, and program execution
of the prior program segment is repeated. 	 Rollback, of-course,
1
takes longer than rollahead, and the larger the program segment,
i
the longer the recovery time when rollback is used.``
Many other considerations come into play with multiple
processor systems, such as keeping track of the frequency of
errors in each module, knowing to remove a processor from the
system, and trying to restart faulty processors at a slow rate.
Implementations can be made on microprocessors with these
techniques.
Reliability schemes have been specifically developed for
microprocessor systems.	 Wakerly [5-3) describes a triple
modular redundancy system for microprocessors.	 He replicates
t
i^
J i
I	 I	 I	 I	 ,_1	 V	 f_	 t
5-1Q
processor/memory pairs and adds voters. He discusses where
the optimum place is to put the voters, and decides that
the voters on the output of the memory is best. This system
is simple, and the hardware automatically assures that the
processors receive the proper data from memory. The biggest
problem with this implementation is that it is possible for
memory locations to be changed to bad data, so that periodically
it is necessary to read and rewrite the entire memory contents.
This cleans up any errors, however, for larger memories it
could take a considerably long period of time to execute.
Even this minimal arrangement requires a lot of circuitry: 24
voters for an 8 bit machine. Reliability curves can be pro-
vided for the various systems.
5.3.2.2.2 Possible Use of Bit Slice Microprocessors for
Tolerance
Many microprocessor fault tolerance approaches utilize
a modular structure. Processors are replicated and compari-
sons are made between them. There is a lot of overhead in
these designs for the limited amount of reliability gained,
and an alternative approach is desired. An interesting
possibility is the use of the bit slice microprocessor designs
for this purpose. What would be significantly useful would
be a tolerance structure which built a sixteen bit micro-
processor out of five four bit slice microprocessors, leaving
one extra for redundancy. This would be an overhead of only
5-11
25% as versus 200% for a triple modular redundant system.
However, the only part of the bit slice microprocessor design
which is actually modular in the slice sense is the register,
arithmetic, and logic unit. The contemporary bit slice pro-
cessor chips , are powerful and include registers and shifters.
These are useful for multiplication and similar powerful in-
structions. These are sequential in nature, and this in it-
self is a problem. To make a transparent redundant system
from these devices with voters on the outputs of the RALUs
would require that the sequential portions of them not be
used. This destroys most of their power, and is unreasonable.
Even if this were not a problem, the RALU is a minor portion
of the overall circuitry of which the bit slice microprocessor
is composed, and it is not reasonable to make the RALU toler-
ant while not doing anything to the rest of the circuitry to
improve the fault tolerance.' if the bit slice microprocessors
included most of the slice properties throughout most of the
circuitry, then perhaps good advantage could be made of them
for fault tolerance implementations. No way is seen to do
this with present bit slice microprocessors which is any better
than non bit slice microprocessors, and no way is evident to
design a new type of bit slice microprocessor which would
allow one to take advantage of the slice properties for fault
tolerance implementations.
j
/ I
A
i
,r
J
i
a
5-12
r
	
37
5.3.2.2.3 The Problem of Determining the Effectiveness
of Designs
There are many schemes proposed to achieve fault
tolerant microcomputer systems, most incorporating multiple
processors for the redundancy needed for the fault toler-
ance. The present situation is such that short of construe-
tion and operation in hostile environments, there is no good
way to determine the relative effectiveness of the various
approaches - indeed, even to verify that a particular imple-
mentation will perform as claimed. The reason that there is
so much difficulty in determining these parameters is that
the fault class being considered is phenomenally large -
namely, all intermittent/transient faults. Parameters to be
determined are such things as the sensitivity of the strategy
to burst type faults, dependent faults, hoar long the recovery
times are for different faults, the lzaigest expected and the
mean recovery tirp.es, and cat;. strophic faults. very little is
understood about the various categories of faults which can
be utilized in analyzing fault tolerant approaches to micro-
processor designs. What is needed then is a general model
for I/T faults. It is not clear, however, that a general
model for such faults exists. Realizing this problem, the
research emphasis has been shifted from fault tolerant strate-
gies for microprocessors to that of measuring and modeling the
intermittent/transient faults which can influence microprocessor
based systems.
l
5-13
-^	 5.3.2.2.4 Plan for Resolving the Problem
The work thus far has been directed towrads obtaining
the means to further examine proposed multiple processor 	 3
schemes. This is being accomplished somewhat experimentally.
A microprocessor system has been constructed for this use.
The first step was to subject the processor to various in-
duced faults. These induced faults are faults that are
=	 supposed to copy the real world intermittent/transient faults
to which such a microprocessor system might . reasonably be
expected to be exposed. Data which is to be collected from
the Lear jet experiments to be conducted in Florida will be
a more realistic guide in choosing realistic faults to induce.
In fact, there is little distinction between actual fault
situations and induced faults. The faults which a circuit is
exposed to in normal operation are the faults of interest,
but by the very nature of the fact that a study of those faults
is being made, the circuit under test is not in normal opera-
tion. This is particularly true in view of the fact that one
cannot wait around for the natural faults to manifest them-
selves, but must force the circuit into a faulty situation.
Any faults to which a network is purposefully exposed are
called induced faults. The induced faults are as close an
approximation to real faults that the circuit would normally
be exposed to as is possible. Naturally, the fault rate will
be higher in the induced faults than in a mildly hostile
environment, but this is necessary in order to accomplish
our goals.
The first step in the experimental procedure was to	 u _'4
subject the test microprocessor to different induced fault
situations. Such things as high or low power supplies,
noisy power supplies, heat, and electromagnetic interference
are possible induced faults which we have attempted to con-
sider. our choices will be soon coupled with the Lear jet
experiments which will ultimately be the guide in choosing
and implementing the induced faults. The experiments we are
performing can be described as follows: it is not first of
all known how these induced faults will affect the processor -
these are decentralized faults, and the individual lines
actually driven to faulty values are not known. A diagnostic
program which will give data on the faults as they occur is
running on the processor during the time that faults are be-
ing induced. The purpose-is to collect enough data on each
induced fault so that the data will be a signature of each
fault, and give an indication of the severity of that fault.
When enough data is collected on each induced fault, the
fault emulation stage begins. This work is still in its
initial stages, but, briefly, it characterizes fault emula-
tion.
induced faults will be approximated by emulated faults.
Emulated faults are faults fog which it is known how, and
1	 1	 5	 t
5-15
more importantly,'where they affect the circuit. For example,
suppose a noisy power supply causes intermittent fault situa-
tions to occur in the processor. The cause of the fault is
known, the power supply. However, it is not known where that
fault is affecting the circuit to cause the failures. it
could be internal to any of the integrated circuits, and may
not even be directly observable on the pins of the packages.
A logic level somewhere-in the processor must be changed to
cause the fs,alure, but it is not known which level has been
changed to the other, nor which line on which the level has
been changed. Emulated faults will be postulated for each
induced fault, and the same tests will be run with each emu-
lated fault as was done with the induced faults. Comparison
of the emulated fault data and the induced fault data will
serve as a feedback loop to improve the accuracy of the
postulated emulated faults. In this manner, a set of emu-
lated faults will be constructed which in a sense are a
model of the induced faults which are a close representation
of the actual intermittent/transient faults encountered in
a real situation. The emulated faults for a specific in-
duced fault may be used as a model of that induced fault
because the direct effect of the emulated faults are known
in the processor. This allows the entire system to be simu-
lated on a large computer, and evaluations can be 'made of
the effectiveness of the fault tolerance strategy. The
r
f
f"
5-16
emulated faults do not have to be known deterministically,
but may only be known statistically. The proposed way to
generate these emulated faults is to have some intelligent
ai
devi,:e (minicomputer or specialized hardware) drive various
fault injection networks which are imbedded throughout the
r 1"
A
microprocessor. This will give a large degree of freedom
in arriving at emulated faults in the hope that a match can
be obtained.
The plan is to construct a microprocessor, expose it
to various hostile environments, measure the effects, postu-
late an equivalent emulated fault, expose it to that proposed
emulated fault, measure the effects, and arrive at a reason-
ably approximate class of emulated faults which can be used
to model a large class of real hostile fault environments.
The z;oncept of an emulated fault includes any faults which
can be injected into the microprocessor in a manner such
that its direct effects are known, either deterministically
or statistically. These effects must be first order effects,
meaning that it is clear that the particular fault emulation
is directly causing some effect, and is not indirectly caused
by that injection.
On the other side are induced faults. These are an
attempt to expose the processor to a real hostile environment
without waiting for the processor to experience intermittent/
transient faults on its own. in order to test the validity
of the postulated emulated faults, faw^lts must be induced into
L
5-17
the processor which can be expected to closely parallel a
real hostile environment. How these induced faults cause
failures in the processor-need not be known - indeed, this
^X.
is the entire point of the current research; in generai
this is not known. In short, the induced faults are being
substituted for real world faults, and the emulated faults
are effectively modeling the induced faults, though perhaps
not in a conventional manner. The whole purpose of this
procedure is to produce a methodology to test the effect
tiveness of various fault tolerance microprocessor strategies.
A suitable microprocessor for these proposed experiments
	 j
has been constructed. Figure 5-2 shows the configuration.
The processor has been constructed on plug boards, so that it
may be easily modified. This allows any of the lanes to be
broken for the insertion of the fault injection networks. The
processor is done, and the testing program is to be developed
and checked. Figure 5--3 shows the • diagnostic program. when
running, it periodically prints a message to indicate that it
is still working. implementation on the 8080 microprocessor
has the advantage that if the stack gets changed, and a non-
memory location is used for the stack, the processor will jump
to a nonexistent location, and receive the data hexadecimal FF,
which corresponds to the interrupt instruction on the 8080.
Advantage is taken of this in the diagnostic program, and in
normal operation the interrupt instruction is never reached.
A count is kept of the times an interrupt is received as an
indication of the processor running awry. The checking pro-
gram checks all of the microprocessor instructions, and
prints a message if execution is improper, perhaps also with
a time tag. It includes memory checks.
k
The exact nature of the induced faults must be chosen,
and suitable circuitry designed and built to create the
faults. All of the data collecting must be done, and then
the fault injection networks must be built for the emulations. i
Strategies for the emulations must be developed, and again
data must be collected until the end goal is reached. Figure
5-4 shows the proposed method for emulating faults.
This plan will yield the needed information on intermittent/
transient faults as microprocessors are affected, and permit
further investigation of microprocessor tolerance structures.
y5-19
VOTER .LOGIC
INPUTS OUTPUT
0 0 0 0
ORIGINAL 0 O 1 0
DIODULE 0 1 0 0
0 1 1 1
-	 1 0 0 0
SINGLE VOTER OUTPUT	 TRIPLE VOTER OUTPUT
Pigure 5-1: Triple Modular Redundancy
n^^
Data
ROM
Address, Cc
Data
RAN!
4
	
	
^
1.	
}
5-20
BusClock
	 CPTJ	 Controller
Generator
	
1	
Address
8224	 —
- 	Data
Data
solo	 8226
Control	 Cor_trol
CsART
Address, Control.
8251	 Input/Output
Data	 (Teletype)
Address, Control
Figure 5-2: Processor
a
y
fi
i
}
Figure 5--3: Diagnostic Program nor
Fault Detection
si
-t
Address, Data, Control Lines
FIN	 FIN	 FIN
t
(Dedicated Circuitry or PDP-11 Minicomputer)
Figure 5-4: Fault Emulation Experiment
.V I	 I	 I	 ..{	 4	 P	 t
5--23
5-1 P. M. Merryman, A. A. Avizienis, "Modeling Transient Faults
in TMR Computer System", Proceeding:-i 1975 Anrual Reliability
and Maintainability Symposium.
5-2 John F. Wakerly, "Transient Failures in Triple Modular
Redundancy Systems with Sequential Modules", IEEE Trans. on
Computers, May 1975.
5-3 John F. Wakerly, "Reliability of Microcomputer Systems
Using Triple Modular Redundancy", IEEE Trans. Comp., April, 1975.
5-4 J. A. Abraham and D. P. Siewiorek, "An algorithm for the
accurate reliability evaluation of Triple Modular Redundancy
networks", IEEE Trans. Comp., vol. C-23, No. 7, pp. 682-692,
1974.
5-5 John F. Wa7terly and E. J. McCloskey, "Design of Low-cost
general-purpose self--diagnosing computers", Technical Note
No. 38, Digital Systems Laboratory, Stanford University,
Stanford, Ca.
5-6 John F. Wakerly, "Checked binary addition: using check symbol
prediction and checksum codes", Technical Note 39, Digital
Systems Laboratory, Stanford University, Stanford, Ca.
5-7 "Definition and trade-off study of reconfigurable airborne
digital computer system organizations", Final report,
November 1974.
5-8 A. Avizienis, "The Methodology of Fault-Tolerant Computing",
Proc. First USA -- Japan Computer Conference, 1972.
5-9 W. G. Bouriecius, et al.. "Reliability Modelina.Techniques
for Self-Repairing Computer Systems", Proc. ACM 1969 Am. Conf.
f.y
'z
5--24
5-10 W. G. Bouricius, et al., "Reliability Modeling for Fault-
Tolerant Computers", IEEE Trans. Comp., vol. C-20, No. 11,
November 1971.
k
diF+^
ri
AN EFFICIENT FAULT DIAGNOSIS ALGORITHM FOR SYMMETRIC
MULTIPLE PROCESSOR ARCHITECTURES
A--1 Al
_.. i INTRODUCTION
^f Consider a general model of a multiple processor architecture
} consisting of n -digital modules denoted U C , U1 ,
	
Un-1 and some
.s
2550C 3ated interconnect ion design, denoted. b aggy .	 These modules, `t
y
for example; could be n processors implementing a segmented
^i
al.gor.thm C6	 Regardless of the use of the multiple procesor
` archit6dbUre; we will assume that each U 	 3.s capable of testing the ;' 	 ^
` other V- I s to which it is directly connected for some specified C
of faults.
	
If a module contains any such fault we will refer to
it as faulty.	 The problem we W ill study in this paper is the
dlagno5is of an existing fault 8ituation among the fiodu es given +
their respec tive testing result s,. 	 Thisproblem is . not new and.
has been examined elsewhere in the literature X1,3 .,4,5,7, 83. The
results to be presented here represent a xielq approach to such_
diagnosis,,	 In particular, the diagnosis procedure described gill
be seen to be sufficiently straightforward to be easily .imple-
mentable on a simple processor, e.g.., a microprocessor, and for a
proper interconnection design among the processors and 'upper bound
;.
A-2=
PRELIMINARIES
..-.....`
1	 '. f
-
f" Given n modules U0 , Ul.,	 Un-1, we will denote the
modules which Ui tests by-Uf(r,	 r = 1, 2,- •• ,t, where
f(r,i) E [0, 1, - • •,n-1],	 i = 0,1, - • • ,n--1.	 For convenience, :re
_ will always assume that Ui tests itsel_ and, regard?ess 0 f its
_	
k 'f
state, concludes that it is fault free.
	
The outcome of the
test of module U f (1	 by module Ui will be denoted a(i, f(r, i))	 ,r
,S
where
0-hat Ui concludes 
	 Uf(r i) is fault free
}a(i,f(r_,I)
1 otherwise
' It should be no that the conclusion of, say, Ui reQ rding txae
t state (faulty or fault free) of the modules to ;,rhich it is
'A! connected is only reliable if indeed Ui is fault free.	 If with
each module Ui, hie associate a test table Bi , i = 0,1,° • -, n--1,
f where Bi represents the conclusion of U3 regarding the states of
all the modules, we have the problem of determining the existing
i fault situation based on the availzble test results:	 Whether
or not this is feasible clearly depends on the number of faults
and the interconnection design. 	 We will assume in the following
that at most t modules can be simultaneously faulty and that
every module. is tested by at. least:-.t other modules.	 under some
assumptions on the interc'onnee-tion design, Preparata, Metze and
1	 t
Chien [7] have shown that. mit is.feasible to diagnose any valid
fault situation.	 However, the diagnosis algorithms which have
been proposed to do so are quite complex [1, 4 , 8 ] . We propose here
a new diagnosis algorithm for-this problem. 	 For the purpose of
explanation we will assume in the following that the intercdnnec-
`
i
tion design between the modules is the so-called Dlt design of [7],
.' Y4
A-3 Y
wherein there is a testing interconnection from U
	 to U.f and
only if.- j -- i	 m (modulo n) and m assumes the values. from 1 to
t.	 the results presented here have been extended to more !;
-_ general interconnection designs, but since they are descriptively
? Cumbersome., these extension will not be detailed. -
.=i
f
z
i
i
A_4
-:7 DIAGNOSIS ALGORITHM
t
Each test table Bi has components B i3O' B i 1 , •	 '' Bi n--1
	
t
s	
-^{ ,	 .	 .
where B	 represents the conclusion of module U 	 regarding the
state of module U..	 If module Ui "believes" that module U
	 is
+
fault free, then B
	
is set to the value a, .otherwise B is ^ is
set to the value 1. 	 Suppose that B a , B••-, Bnare complete	 J
-1
` in-the sense that every module has a conclusion re garding
 the
	
•a
state of each of the modules Ui, i	 0„1, • - • y r_--? .	 We w? ?3.
assume here that if -a module is fault-free, its corresponding
table is correct.
" Lemma
	 1	 There exists at least n-t of the. B i tables which
F are identical.
Proof:	 Since at most t modules awe faulty, and since a table
corresponding to a fault-free module correctly describes the
fault situation, the theorem follows.
' Lemma 2	 If there exists only one set of identical: tables
Bi (l)' B1(2 )' • .. B 	 ,. such that s > n-t., then each of these
tables in this s et correctly describes the existing fault
situa ioij.
'	 Proof:	 We already know that there exists at least n-t correct
and therefore identical tables. 	 Therefore, if only one set
Of identical tables has a car"dinality larger or equal to n t,
this set must consist of-the correct tables.
It should be clear that no conclusion can be% made regarding
the fault situation if there exists more than one set of identi-
cal tables with cardinality larger than or e qual to n-t.
i
i
77	 r	 -
_A--5
I,
Theorem I s Suppose that n >2t + 1; then there exists one and on;+.
one set of identical tables with cardinality larger than or
equal to n-t.
  r^r	 .
Proof: Suppose that n > 2t+1 and assume that there exist two
sets of identical tables of cardinality n and n respectively.
1	 2
Assume	 nl > n-t
and	 P-2 > n-t .	 !
Then,	 n^ + n2 > 2n --• 2t.
We' know that	 n > n  + n2 'and therefore
n > 2n -2t.
This inequality, used in conjunction with n > 2t + 1 yields
2n > 2n + 1
and we conclude that we cannot have two sets of identical. tables -
	 i
of cardinality larger than or e qual to n_t when n > 2t + 1.
At this point we need an efficient procedure to build the
complete n tables B02 B1,-•-, Bn-1 such that if module Ui is
fault free, then-the table Bi
 reflects accurately the fault
situation of the multiple processor arebitecture. Such an
algorithm is presented in the following to compute the tables
BO 9 B1, .. Bn_l.
Algora: ohm 1:;et i in t],1, ..., n-l^ and t n Z# 2, ..., n-1^e given,.',
S tep_ 0 . Set Bi n=0 for
 m=0,1, • • • , n--1, set ^j =i, set k=i+1
-	 a
and set N^=O.
Step 3: if N	 a t, stop; else, go to Step 2.
Ste	 2: If k=i, stop ; else,	 go .to Step 3.
Step 3: if aChk) = 1, set B1 01, set N=N^+1 and go to
_ Step 4; else, set j=k and go to Step 40
Step Set k=k+l and go to Step 1.
r
A- 6
Notes:
(^} All additions are performed modulo n;.
(ii) We assume a nl ^ interconnection design, i.e.., f(r,j) =j :-r
(modulo n); and therefore, we use the notation a(j,j+r)
instead of the more general notation a (j , f (r, j)) .
Theorem 2: If a Dl 	interconnection design is used * if the	 J;
maximum number of faults which may occur is t and if module IIi 	
Y+
is fault--free,. then the table Bi constructed by the algorithm
accurately reflects-the existing fault situation.
Proof: We need to show that the algorithm is well defined and
that it produces tables B i which are correct whenever U. is not
faulty. The technique we use to prove the theorem is based on.
the use of invariant assertions as described in [2] (see Fig. 1)
We assume that a Dla t interconnection design is used, I.e..,,
s	 module U1 tests the modules Ui+l , Ui+2 , " ', Ui+t ` The algorithm
uses the quantity a(j,k) which contains the result of the test of
module k by module j. It follows that the algorithm is well
defined if and only if j-and k are related by
k=j+r
where r is some integer in [1,2,- ,t1
Assume that before executing Step 3, the following assertion_ holds:
(Al)	 j+1 { k < j+1+NF,
Then it can be shown that (Al) still holds after the execution of-
Step 4. Clearly (Al) is satisfied by the initial values given to
IA-'7
k and NF
 and therefore we conclude that (Al) is always
' satisfied. before the execution of Step 3.
It is only possible to reach Step 3 if
NF < t
It follows that just before the execution of Step 3, the ouy t.i-
ties j and k are related by the assertion
(A2)	 jt!<k<i+t	 ;.
which shows that the algorithm is well defined.	 w
She first part of the proof showed that the algorithm is
well defined.
	 lie now prove that if U_ is fault free	 then the
table Bi
 reflects the actual fault situation. 	 Following again	 }''
' A the approach described in [2], we show that the following asser-
tions,are always satisfied before the execution of Step 3:
(A3)	 The module Uj is not faulty
(A)	 BZ
 accurately reflects the existing fault situation up to	 1
`C k--1, i e ,
	 for all m in C i, i+l, i+2,	 - • , k--1]
Bi m = D if and only if module Um is-not faulty.
and	 Bi	 = 1 if and only if module % is faulty.
r
NF
 contains the number of faulty modules up to k-1, Ve• y
k-1
m,,z I.,M
it can be shown that M (A3), (A4) and (A5) are true before the
execution. of Step 3, then they are still true after the execution
of Step 4. Clearly (A3) , (A4) and (A5) hold after the execution
of Step O . and therefore we conclude that (A3) s (A4) and A(5) are
always true before the execution of Step 3.
FAA Now, suppose that the algorithm stops in Step 1. We know
^ f
that B i is correct up to k and that N
	 t.	 in other words, B^
it.
^^ 
ML
correctly reflects the fault situation for m=i, i+1,^-', k and t	 ^3
faults have been detected.	 But we have assumed that at most t
faults may occur and therefore this implies that the remaining	 r''
modules are not faulty. The B'. 	 for m k+l, k+2, ...,i--1 are eTaa"& M	 s..
F
to 0 and therefore ne complete table Bi is correct.
Suppose that the algorithm instead stops in Step 2; ,_ then Bi
is correct up to k=i and therefore B  is correct.'
Although we have shown that when the algorithm stops, It 	 '.
y
produces the correct -cable. It remains to be shown that it indeed
stops after a finite number of iterations. We note that k takes
the values i, i+l, 1+2,---  and therefore if the algorithm. does
not stop in Step 1, it must necessarily stop in Step 2. This 	 }
i
concludes the proof of the theorem.
ACCELERATED ALGORITIM
The diagnosis of the set of faulty modules based on
the results of Lemma 1, Lemma 2, and Theorem 1 requires that
the table B. , i = 0, 1, ...,n--1 be compared. This process is
time consuming and may be avoided. For each j = Oslo..., ""I,
let y^ be the number of indices i for which B. = 1,
3-_.
.	 y = ca.rdinality of	 i E 10,1,	 ,	 g j =^	 n-^1 ^	 Bi	 3.
there, these quantities may be used in a diagnostic algorithum
as follows:
e
a
A~9
Algorithin 2 Let t in [1.,.n---11 be given.
Step
	
Compute the tables B,. B Z , ..., Bn_1 by using
Algorlthm 1.5	 • V	 f
Ste L: Compute the quantities Y 0 , YV
 • .. , Yn^^. g
E'
	 Step 2: Let V --	 j _E ! 0, 1, .. , n-3^	 Y ? t+l)
Theorem 3: f a 3]1 interconnection design is used if the
maximum number of faults .
 which may occur is t and if n ? 2t + 1,
then U is faulty if_ and only if j is in V.
is Proof: The result is 'a direct consequence of Lemmas 2 and 2
and Theorems I and 24
^.	 .Algorithm 2- is well suited,for implementation on a microprocessor,
iFor example, on an Intel 8880 microprocessor, the total amount
f>	 of memory necessary to . . store the data and the program. in the case
n = 8 and t = 2 is 116 words of 8 bits, i. e.' 1408 bits.
We note that Algorit1im, 2 may be implemented .in parallel
on a. network of X microprocessors, with N < n. in p.articul.ar,
if. N microprocessors are used, t'heri it is possible to compute
in parallel all the tables Bi and all the quantities Y.. The
computational time. necessary to diagnose the network of n mo-
`	 dules using N'microprocessors for implementing Algorithm 2
is essentially T[n/N3/n , where T is the computational time
necessary?` to execute the instructions. of Algorithm 2: whet a
single m3':crV-'YrOaessor is used and En/N] is the smallest inte-
ger larger than n/N.
t•
t	 .
_	 r	 _	 ^..
A-10
- ; EXAMPLE
In order to demonstrate the simplicity of the' algorithms, we
i  apply therm to the network given in Figure 2. The network contains
n=9..modules, t=3, i.e, at most three modules may be faulty, and
t
a D
1
 93 interconnection design is used, i.e_, module U
0
 testsr
Ul, U
	
and U2 and U	 module U	 tests U1	 2	 U3 ,	 '
	3
, etc.	 Assume
that.the modules Ul ;_ U3 and U	 are faulty.	 Figure 3 contains a
P .0 ssible set of test outcomes. 	 The app1±eation of the algorithm.
to these test outcomes Yields the tables B.
	 i=0,1,2 , - - - , 8
 
  
riven-	 js o
in Figure 4.	 We f-;nd that the tables B^, 32, B^, B^, B7
 acid 38 a
are 'identical-.
	
We have 6 identical tables and using Lemma 2,
we conclude that these tables reflect the correct Fault situation
a
of the network, i.e., we conclude that the modules Ul, U 3 and U6
are faulty. Alternatively, we niay compute the quantities Y^, z. e., 9
YO - 0 4 Y1 = 8s Y2 = Os Y3 = 8, YA - 0.7 Y^ - O, Ys	 ^, Yy - 1, and
- I
Ya -- 1, and then compute the set V	 Yj ? 4 } -- f i, 3, 63 . 3
_
Using Theorem 3 1 we conclude once again that U1, U3,~ and U6 are
faulty.
1A-11
CONCLUSION
'
.	 ..
An approach to the problem of fault dY a nosis of a^t^- i c ;
multiple processor architcectures has been prDPOSed, it
__ns7sis{
of constructing tables ? s assuiri.I !g that the Come sp and rg
Mbdules U
	
are not fau lty, followed by a voting procedure.
The construction of. the tables S1 is decoup?ec in the sense `	 }.
That each table may be constructed independently of the others. 
it is possible to decrease the amount of co^i'^p=^tati Qn neces sary rs
to obtain all the tables 3i, i - 0, 1, ..., n-? by increasing he
dependency between the construction bf the various Tab l es, it
is not difficult to Tind schemes is which the construction oil a
the table. U . depends on the tables U0 , U_ , , _ ,; J .. .	 Su
schemes are more com plicated to code than the one we propose_
require more memory to stare The program and do got lend As_,
selves to parallel implementation. Ther efore, we ' fees. that
our schemes i.e., Algoritha 2 T eich has a i"Me• colp ! hic ty.of
0 (n2 j if sequentially implemented, and a (n) if :implemented car_
T2etWQ ryf. of i2 microprocessor,	 .S - idea? ly suited for` The fault
A.
ORIGINAL
??-lj PAGE ISWiOOR War
S
.-(Al)	 ^-^ j 1 	 F rio •t	 i n-t	 r
es
	
t
stop no
—,	 < j-`
'
,yr^s	 k=5-
. t a
rA2)3 !	 }	 W).,ll `^	 ^Trti Jno
..
l y L-
true i
yobr7
no
_	 L	 ^
v
VII
Z
i
_ ^ 
_	 _ —
	 ._ ..
Z
), (A3), (AU, (r5
— -- _(.A!E	 —	 ---
Figure
	 A0 V C h? I'% of R_ i g fJ r i i.'2]F<l 1
into pretaL,ion when moduli U	 is not faulty.
i
e
,. /
Q
-C
'`2^
'^^^ .;
0 1 2 3 5 6 7 8
0 I J
k	
0 ^
1 0 3 ].
2 1 0 0
3 0 0 1
4 ^ 0 1 0
5 1 0 0
6 0 1 1
7 0 1 0
8 0 1 0
Figure 3. Tests outcomes a(i,k), Modules U 1 , U3 and
Ub faulty.
i 0 1 2 3 4 5 6 7 $
0 0 1 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1 0 0
2 0	 1 1 0 1 0 0 1 0 0
3 1.0	 1 1 0 0 0 0 1 0 0
4 0 1	 1.^ 0 1 0 0 1 0 0
5 0 1	 1 0 1 0 0 1 0 0
6 0 1	 1 0 1 0 0 0 1 1
7 0 1	 1 0 1 0 0 1 0 0
8 0 1	 1 0 1 0 0 1 0 0
Figure . Quantity B ij obtained by using the algorithm
to process the test results given in Fig. 3.
.	 'j
V.:... _....__ . 	 ..i
A-1.5
REFERENCES
[13 A.M. Corluhan and S-L. Hakimi, "On an algorithm for identi-
fying faults in a T-diagnosable system", Proceedings of the
1976 Conference on Information Sciences and Systems, The
Johns Hopkins University, 197b, PP . 37 0-375 . 	--
[2^ R.W. Floyd, "Assigning meanings to programs, mathematical
aspects of computer science", Proceedings of Symposia in
Auulied Mathematics, American Mathematical Society, Provi-
dence, Rhode Island, 1967, pp. 19--32.
[31 S.L. Hakimi and-A.T. Amin, "Characterization of the connec-
tion assignment of diagnosable systems", IEEE Transactions
on Computers, Vol. 23, ,Tanuary 1974, pp. 86-88.
X43 T. Kameda, S. Toida and F. Allan, " A Diagnosing Algorithm
for Networks", information and Control, Vol. 29 (1975),pp. 141-148.
[51 S.N. Maheshwari. and S.L . Hakimi., "On models of diagnosable
systems and probabilistic fault diagnosis", IEEE Trans. on
Computers, March 1976, pp. 22$ --236.
[61 G.G.L. Meyer, "A segmented algorithm for solving a class of
constrained discrete o ptimal control problems", IEEE Trans.
on Automatic Control, Vol. AC-19, No. 2, April 1974,
PP . l3 —13 .
[71 F.P. preparata, G. Metze and R.T. Chien, "On the connection
assignment problem of diagnosable systems", IEEE Trans. on
.Comniiters, Vol. EC-16 3 No. 6, December 1967, pp! 848-854.
E8,k J.D. Russell and C.R. Kime, "On the diagnosability of digital
systems", 1973 Intl. Symposium on Fault Tolerant Computing,
IEEE Computer Society Pubheati.ons, June 1973, pp. 139
-144.
