Fault diagnosis in distributed computer networks by Dincer, Ibrahim.
Calhoun: The NPS Institutional Archive
Theses and Dissertations Thesis Collection
1987-12

















Thesis Advisor Jon T. Butler
•
Approved for public release; distribution is unlimited.
T238814

SECURI TV CLASS- CA- ON
REPORT DOCUMENTATION PAGE
la REPORT SECURITY ClASSiF;CA-,ON
Unclassified
id restrictive markings
2a SECURITY C.ASSiF'CATiOISi AuTnORiTY
2d DECASSiFiCATiOrj. DOWNGRADING SCHEDULE
3 DISTRIBUTION; AVAILABILITY OF REPORT
Approved for public release;
distribution is unlimited
4 PERFORMING ORGANIZATION RE=ORT NUM3ER(S)
NPS62-87
5 MONITORING ORGANIZATION REPORT NUMBER(S)






7a NAME OF MONITORING ORGANIZATION
Naval Postgraduate School
6c. ADDRESS [City. State, and ZIP Code)
Monterey, CA. 93943-5000
7b ADDRESS (C/fy, State, and ZIP Code)
Monterey, CA. 93943-5000




9 PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER









11 TITLE (include Security Classification)
Fault Diagnosis In Distributed Computer Networks
12 PERSONAL AUTHOR(S)
Dincer, Ibrahim











18 SUBJECT TERMS {Continue on reverse if necessary and identify by block number)
Preparata-Metze-Chien, Computer-Aided-Design
19 ABSTRACT (Continue on reverse if necessary and identify by block number)
This thesis introduces the concept of a diagnosis algorithm in the
context of the Preparata-Metze-Chien (PMC) model. It represents a
Computer-Aided-Design (CAD) tool for use in analyzing such algorithms.
That is, with this tool, the user can establish a multiprocessor system,
a set of test outcomes and then analyze the properties of specified
distributed diagnosis algorithm. Examples in this thesis include a
system in which: 1. Correct diagnosisis achieved in a small number of
iterations. 2. Correct diagnosis is never achieved. 3. An
oscillating situation exits in which, faulty processors become
alternately enabled and disabled.
20 DISTRIBUTION' AVAILABILITY OF ABSTRACT
L3 UNCLASSIFIED UNLIMITED SAME AS RPT dtiC USERS
21 ABSTRACT SECURITY CLASSIFICATION
22a NAME OF RESPONSIBLE 1NO1VIDUAL
Prof. Jon T. Butler




DO FORM 1473, 84 mar 83 APR edition may oe used until exhausted
All other editions are oosoiete
SECURITY CLASSIFICATION QF THIS PAGE
O US Go.trnmtnt Printing Office !»<«—«0*-2*.
Approved for public release; distribution is unlimited




B.S., War Academy, Ankara, Turkey, 1978
Submitted in partial fulfillment of the
requirements for the degree of






A. NEED FOR STUDY 10
1. Preparata-Metze-Chien(PMC) Model 10
2. Perfect Tester 10
3. 1-Fail Safe Tester 10
4. 0-Fail Safe Tester 1
5. Ab Model 1
6. A(i Model 1
7. AX Model 1
8. Partial Tester 1
9. Zero information Tester 1
B. PROBLEM ENVIRONMENT 12
II. BACKGROUND 14
A. PREPARATA-METZE-CHIEN(PMC) GRAPH MODEL 14
B. ONE-STEP T-FAULT DIAGNOSABLE SYSTEMS 17
1. Necessary And Sufficient Conditions 17
2. Optimal Design for one-Step T-fault diagnosability 18
C. SEQUENTIALLY DIAGNOSABLE SYSTEMS 19
D. GENERALIZATION OF FAULTS 21
1. tp-Fault Diagnosability 21
2. ti-Fault Diagnosability 21
3. t/s-Diagnosability 21





A. SIMPLE DIAGNOSABILITY TEST FOR MULTIPROCESSING
SYSTEM 24
B. RECONFIGURATION 25
C. RELATIONSHIP BETWEEN ENABLED/DISABLED
UNITS AND SYSTEM RELIABILITY 26
IV. METHOD OF APPROACH 27
A. WHY A CAD-TOOL? 27
B. TOOL DEFINITIONS 27
C. TOOL SPECIFICATIONS 28
D. TOOL REALIZATION 31
V. RESULTS 33
VI. CONCLUSIONS AND RECOMMENDATIONS 45
A. CONCLUSIONS 45
B. RECOMMENDATIONS 45
APPENDIX A: SOURCE CODE OF CAD-TOOL 46
APPENDIX B: HAND CALCULATIONS OF DIFFERENT CASES 56
LIST OF REFERENCES 72
INITIAL DISTRIBUTION LIST 73
LIST OF TABLES
1. Different Models of System Diagnosis 12
2. Menu of CAD-Tool 31
LIST OF FIGURES
2.1 Five Processor Multiprocessor System with
Faulty Units and Test Outcomes 14
2.2 Assumed Test Outcomes in Preparata-Met-Chien
Model 15
2.3 A System and Associated Test Outcomes 16
2.4 An Example of Sequential Diagnosis Connection
for n=14 and t=6 20
2.5 Five Processor Multiprocessor System for Two Arrangements of
Faulty Processors 22
4.1 Flow Chart of CAD-Tool 29
4.2 Detailed Flow Chart of CAD-Tool 30
5.1 Cad-tool menu and test outcomes 33
5.2 Case 1 initial condition 34
5.3 Case 1 first iteration 34
5.4 Case 1 second iteration 35
5.5 Case 1 third iteration 35
5.6 Case 3 initial condition 36
5.7 Case 3 first iteration 37
5.8 Case 3 second iteration 37
5.9 Case 3 third iteration 38
5.10 Case 6 initial condition 39
5.11 Case 6 first iteration 39
5.12 Case 6 second iteration 40
5.13 Case 6 third iteration 40
5.14 Case 6 fourth iteration 41
5.15 Case 19 initial condition 42
5.16 Case 19 first iteration 42
5.17 Case 19 second iteration 43
5.18 Case 19 third iteration 43
5.16 Case 19 first iteration 42
5.17 Case 19 second iteration 43
5.18 Case 19 third iteration 43
5.19 Case 19 fourth iteration 44
5.20 Case 19 f ifth iteration 44
ABSTRACT
This thesis introduces the concept of a distributed diagnosis algorithm in the context
of the Preparata-Metze-Chien (PMC) model. It represents a Computer-Aided-Design
(CAD) tool for use in analyzing such algorithms. That is, with this tool, the user can
establish a multiprocessor system, a set of test outcomes and then analyze the properties
of a specified distributed diagnosis algorithm.
Examples in this thesis include a system in which ;
1. Correct diagnosis is achieved in a small number of iterations.
2. Correct diagnosis is never achieved.
3. An oscillating situation exists in which faulty processors become alternately
enabled and disabled.
ACKNOWLEDGEMENTS
I would like to thank to my advisor Professor Jon T. Butler for his valuable
assistance and patient guidance. I appreciate his great support and encouragement.
I have to express my deep respects to my government for sending me here for this
education.
I also would like to thank Dr. Dana Madison for his great contribution.
Special thanks to my wife Makbule and my son Melih for their continuous support.
I. INTRODUCTION
A. NEED FOR STUDY
The advent of inexpensive microprocessor elements has made multiprocessor
computing networks much more practical. This fact has led to an increasing interest in
the high reliability of such networks. The prospect of ultra reliability has inspired
research into the use of computers where low reliability precluded its previous use. This
includes aircraft control systems, where the Federal Aeronautic Administration (FAA) has
Q
specified as a standard probability of failure in a 10 hour operating period of 10 [Ref. 1].
The traditional approach to computer reliability is through redundancy, where reliable
outputs are the result of a vote on three or more less reliable outputs. In the theory of
system diagnosis [Ref. 2], a graph is used to model a multiprocessing system where nodes
represent the processors and arcs represent tests between processors. One goal of the
theory is to determine what tests achieve the highest tolerance to faults. It has been
shown [Ref. 3] that for the same system reliability, greater throughput can be achieved
from system diagnosis approach than modular redundancy. Conversely, for the same
throughput, a system diagnosis approach yields greater reliability [Ref. 3].
Beginning with the Preparata-Metze-Chien model, many models have been developed
for system diagnosis. The best known models are [Ref. 4].
1. Preparata-Metze-Chien(PMC) model: This model was used in this research and
will be explained in Chapter II. This model is represented by Ap in Table 1.1.
2. Perfect Tester: In this model, test outcomes correspond to perfect diagnosis of
faulty units. In other words, if the tested unit is faulty (not good), no matter what the
status of testing unit is (faulty or fault-free), the test outcome will be fail(l). If the tested
unit is fault-free(good), the test outcome will be pass(0) regardless of the status of the
testing unit. This model is represented by Aa in Table 1.1.
3. 1-Fail safe tester: This model never has an incorrect zero. That means that there
might be incorrect fail test outcomes (e.g., when faulty unit is testing a fault-free unit the
10
test outcome will be 1), but there will never be any incorrect pass (0) outcome. It is
represented by Aw in Table 1.1.
4. O-Fail safe tester: This model never has incorrect 1. That is, when a faulty unit
tests another faulty unit, the test outcome will be 0. This is an incorrect pass outcome.
However, there is no incorrect fail test outcome. The model is represented by Ay in
Table 1.1.
5. Ab is a model in which a faulty unit will never incorrectly diagnose another faulty
unit. However, in this model a faulty unit testing a fault-free unit will produce and 1
arbitrarily.
6. A p. is a model in which a faulty testing unit may not correctly diagnose another
faulty unit. Test outcomes can be and 1 arbitrarily.
7. Ax is a model in which a faulty testing unit always diagnoses a fault-free unit
incorrectly, producing fail test outcome. However, a faulty testing unit produces and 1
arbitrarily for a faulty tested units.
8. Partial tester: In this model, there is the possibility that a fault-free testing unit
cannot correctly diagnose a faulty unit. This model is examined by Simoncini and
Friedman [Ref. 5]. They considered the problem where system tests may be incomplete,
i.e., that is a fault-free unit may be able to detect faulty units with percentage p (p <
100). This model is represented by Apt in Table 1.1.
9. Zero information tester: This model provides no reliable test outcomes. This
model was considered by Marion L. Blount [Ref. 6]. Several different fault detection
requirements can be addressed.
a. A fault-free unit can fail to diagnose another fault-free unit.
b. A fault-free unit can fail to diagnose a faulty unit.
c. A faulty unit can give a correct diagnosis of another unit (faulty or fault-free).
This model is represented by Ao in Table 1 .
1
11
Aa Aw Ab Ay A*l AX Ap Apt A(
ay
0->0 X
0->0 1 1 1 1 1 1 1 X X
0->0 1 X 1 X X X
0->0 1 1 1 X X X X X
O-Fault-free unit O-Faulty unit
Table 1 . 1 Different models of system diagnosis
All the models mentioned previously apply to a graph theoretic system. Analysis of
such systems is typically done by hand calculation which limits the number of units.
System fault configurations is limited to some small numbers as well. Thus, the analysis
of such theory is difficult. Also, there is much interest in making the model more
realistic. This, in fact, inspired the models described. For example, Ab proposed to
model tests among processors consisting of comparing results of computations. The goal
of this thesis is to further improve the model. Specifically, it addresses the problem of
reconfiguration, where there has been relatively little study so far.
B. PROBLEM ENVIRONMENT
The fault diagnosis problem is to determine faulty processors given the set of test
outcomes. Almost all previous studies have assumed a central diagnoser, which collects
all of the test results and identifies faulty processors from this. This assumption
simplifies the problem and avoids the complexities of reliable replacement. But a central
diagnoser is also a processor, which might fail. In this case, system diagnosis may not be
accurate. To provide accurate system diagnosis, the central diagnoser should be ultra
reliable. This will be expensive and will require extra maintenance effort. To overcome
12
these difficulties, distributed system diagnosis is proposed. In the distributed systems




A. PREPARATA-METZE-CHIEN (PMC) GRAPH MODEL
A multiprocessing system is composed of n processors. Each processor is called a
unit (node) where a unit is a well-identifiable portion of the system which cannot be
further decomposed for the purpose of diagnosis. Units are indicated by Ui , < i < n-1.
These units must be powerful enough to test other individual subunits. A test
corresponds to an arc between processors with the arrow pointing to the tested unit. Arcs
are denoted by a i j, where i is the unit number which is doing the test, and j is the unit
number which is tested. Each test has two outcomes, pass and fail; O's correspond to
pass test outcomes and l's correspond to fail test outcomes. Faulty processors are
indicated by X's. Figure 2.1 shows a 5 processor multiprocessor system, where U2 and
U3 are faulty. A test is meaningful only if the testing unit itself is fault-free; otherwise








Figure 2. 1 Five processors multiprocessor system with faulty units and test outcomes
14
Figure 2.2 shows how test results occur in the model we have chosen. The top arc
goes from a fault-free node to a fault-free node and for this case a (pass) outcome is
always produced. The second arc goes from a fault-free node to a faulty node and for this
case a 1 (fail) outcome is always produced. The third arc goes from faulty node to
fault-free node and fourth arc goes from faulty node to faulty node. The outcomes of the
last two cases are unpredictable and can be or 1 arbitrarily.
Definition 1: The set of test outcomes aij represents the syndrome of the system;
obviously aij can be assigned if and only if the corresponding testing link exists. [Ref. 3:
p-848]. In Figure 2.1 the syndrome of the system for one loop will be (aoi, ai2, a23, a34,
a40) where the left to right arrangement of the aij is intended to reflect the direction of the
loop. Diagnosis is the process of determining the faulty units given a set of test outcomes.












Q Fault-free A Faulty















3 4 a 40
a) X 1
b) 1 X
c) X X 1
Figure 2.3 A system and associated test outcomes
Faults in units Ui and Uj are distinguishable if the syndromes associated with them are
different. The two faults are indistinguishable if the syndromes associated with two
different faults are the same. These definitions may be directly extended to
distinguishable and indistinguishable sets of faults called fault patterns. Figure 2.3
depicts a system and its test outcomes for three different cases. If Uo is faulty, the
syndrome shown in line a is produced. If Ui is faulty, the syndrome shown in line b is
produced. They are distinguishable since the value a40 is different. The multiple fault
pattern (Uo, Ui are faulty) has the syndrome in line c, and since it may be the same as the
syndrome for faults {Uo} (depending on the unpredictable values of aoi and an), {Uo}
and {Uo, Ui} are indistinguishable.
16
B. ONE-STEP T-FAULT DIAGNOSABLE SYSTEMS
Definition 2: A system of n units is one-step t-fault diagnosable if all faulty units
within the system can be uniquely identified, provided the number of faulty units present
does not exceed t [Ref. 3].
1. NECESSARY AND SUFFICIENT CONDITIONS:
In this section we investigate the relationship between n and t (the number of
faulty units), for one-step diagnosable systems.
Theorem 1: If a system with n units is one-step t-fault diagnosable, then n > 2t+l.
Conversely, if n > 2t+l, it is always possible to provide a connection to form a system
that is one-step t-fault diagnosable [Ref. 3].
Proof: To prove the converse, we construct a maximally connected graph, that is,
we make a connection among all possible pairs of these n units in both directions. One
characteristic of such a graph is that there exists a loop connecting any subset of n units.
It is easily verified that given any loop connecting z units with all test outcomes in the
loop exhibiting the value 0, then the z units in the loop are either all faulty or fault-free.
In particular, if z > t+1, all units in the loop must be fault-free. Otherwise, this would
violate the hypothesis on the maximum number of faulty units. The location of a loop of
t+1 or more fault-free units will essentially have completed the diagnosis process, and any
identified fault-free unit will immediately locate all faulty units through direct links.
Since the system can have at most t faulty units, it must contain at least t+1 fault-free
units; hence the existence of a loop of t+ 1 or more fault-free units is guaranteed.
For a system with n < 2t+l units and an arbitrary connection, we show the
existence of two distinct allowable fault patterns that may result in exactly the same
syndrome. An allowable fault pattern for our specific case is any fault pattern with at
most t faulty units. We can consider n as odd and even in two separate cases; but both
cases are analogous. Assume n < 2to, with to < t. Consider the case of an even number of
nodes. We partition the system into two parts, Pi and P2, each with the same amount of
units to. Suppose all units in Pi are faulty and all units in P2 are fault-free. Then, all links
17
between units within P2 will have a value and all links pointing from units in P2 to units
in Pi will have a value 1. Since the units in Pi are faulty, many possible configurations of
values may occur. One such possible configuration is for all links between units in Pi to
have a value and all links pointing from units in Pi to units in P2 to have value 1. From
symmetry, it is seen that when all units in Pi are fault-free and all units in P2 are faulty,
the same pattern of test results may occur. Hence, it is not always possible for the system
to differentiate between the two allowable fault patterns and the system is not one-step
t-fault diagnosable [Ref. 3: p-850].
2. OPTIMAL DESIGNS FOR ONE-STEP t-FAULT DIAGNOSABILITY:
For this model it has been shown that the number of units n must be at least 2t+l
for a system to be one-step diagnosable. Now we will try to get the lower bound on the
number of units that concurrently test a particular unit.
Theorem 2: In a one step t-fault diagnosable system, a unit is tested by at least t
other units [Ref. 3: p-850].
Proof: On the hypothesis that the system is one-step t-fault diagnosable, we may
assume that Ui, U2,....,Uk are all the units in the system which test a certain unit Uo and
k < t. Consider the case in which Ui, U2, ...,Uk are all faulty. The outcome of the tests
performed by these faulty units may, of course, assume arbitrary values. Hence there is
no reliable test being performed on Uo, and the two legitimate fault patterns (Ui, U2,
...,Uk) and (Uo, Ui, U2, ...,Uk) neither of which has more than t faults are not
distinguishable. Hence according to Definition 2, the system is not one-step t-fault
diagnosable. Since a contradiction has been arrived at, the assertion stated in the theorem
is proved.
Definition 3: A one-step t-fault diagnosable system is said to be optimal if n =
2t+l and each is tested by exactly t units [Ref. 3: p-850].
In general, many optimal designs exists for a system. To describe these families
of designs Dt, it is convenient to designate the n units by Uo, Ui, ...,Un-l, and to perform
any computation on the subscripts modulo n. We will consider a class of designs in
18
which the testing connection at each unit is identical. In fact, whether there is a testing
link from ui to uj depends entirely upon the value of l=j-i (modulo n). A test exists if and
only if 1 < 1 < t. Preparata, Metze and Chien [Ref. 2] showed that a design Dt is an
optimal one step t-fault diagnosable system.
C. SEQUENTIALLY DIAGNOSABLE SYSTEMS:
Definition 4: A system of n units is sequentially diagnosable if at least one faulty unit
can be identified without replacement, provided the number of faulty units present does
not exceed t [Ref. 3: p-849].
It is obvious that every system which is one-step t-fault diagnosable is also
sequentially diagnosable. But a system which is sequentially diagnosable may not be
one-step t-fault diagnosable. In the previous section, we have seen that nt links are
required for a system of n units to be one-step t-fault diagnosable (design Dt). The
investigation of sequentially diagnosable systems is motivated by the expectation that
fewer test links are required in such systems. Theorem 1 is valid for sequentially
diagnosable systems also. Hence for any sequentially t-fault diagnosable systems n >
2t+l.
Theorem 3: There exists a class of designs with N=n+2t-2 that are sequentially t-fault
diagnosable [Ref. 3: p-852].
Proof: Consider the following design. First, connect all units Uo, Ui, ....,Un -l in a
loop such that for every i there is a link from Ui to Ui+i (all subscripts are taken modulo
n). Secondly, select a subset Si of 2t-2 units from the set (Ui, U2, U3, ...,Un-2) and
establish a link from each unit of Si to Uo. This is shown in Figure 2.4. Let the number
of testing signals from Si and Un-l to Uo having the value (1) be no (ni). The following
cases are possible:
Case 1: ni>t. The assumption (Uo is not faulty) implies that ni > t units are faulty,
thus violating the hypothesis on the maximum number of faulty units. Therefore ni > t
implies Uo is faulty.
19
Case 2: ni<t. The assumption (Uo is faulty) implies that, no > t-1 more units are
faulty. If m < t, ni+n2=2t-2 and assume ni=t-l. So no=2t-2-ni. If we put ni=t-l, then no
= t-1. For ni = t-2, t-3 .. and so on, no > t-1 but this also violates the hypothesis.
Therefore ni < t implies Uo to be not faulty.
Case 3: ni=t. Let's consider the set S'=Si U Un -l U Uo for a total of 2t units. If Uo
is not faulty, the set contains ni=t faulty units; if Uo is faulty, the system contains Uo and
no = t- 1 additional faulty units, for a total of t. In both cases the set contains t faulty units.
We conclude that all units of the system not contained with in the set S' are not faulty and
at least one fault-free unit can be identified. Therefore, ni = t implies the existence and
identification of at least one fault-free unit.
To locate at least one faulty unit we proceed as follows. In case 1, Uo is the faulty
unit. In cases 2 and 3 we have located at least one fault-free unit. To locate a faulty unit
we simply travel along the loop of testing links in the direction of arrows. We follow the
test signals until we see a 1 for the first time, the unit being tested by this link is faulty
[Ref. 3: p-852]. So considering all of the three cases above, we have identified at least
one faulty unit; which is necessary and sufficient for sequential diagnosis.
Figure 2.4 An example of sequential diagnosis connection for n=14 and t=6
20
D. GENERALIZATION OF FAULTS
tp-fault diagnosability: A system is tp-diagnosable if and only if the application of
the test set identifies precisely which faults are present, provided the number of faults
does not exceed tp [Ref. 9]. (This is precisely one-step t-fault diagnosability.)
The major part of the self-diagnosability of systems has assumed that only permanent
(solid) faults can be present. Consideration of intermittent faults is generally difficult
since it requires a modeling of the behavior of these faults in a system and also requires
interactive testing strategies to detect faults. Mallela and Masson [Ref. 10] consider the
effect of intermittent faults in diagnosable systems. The existence of both permanent and
intermittent faults in a system, for example, affects the test outcome which is received
after repeated applications of the test routines. This outcome may generate an incomplete
diagnosis of faulty units, since not all the faulty units in the system may be detected.
ti-fault diagnosability: A system is ti-fault diagnosable if in the presence of ti
intermittent faults no fault-free unit will ever be diagnosed as faulty, and diagnosis will be
at worst case incomplete [Ref. 4].
In general, the fact that a system is tp-fault diagnosable does not necessarily imply
that it is also ti-fault diagnosable. Mallela and Mason also give necessary and sufficient
conditions for one-step t i-fault diagnosability .
t/s-diagnosability: A multiprocessing system is t/s-diagnosable if one can always
identify a set of processors of size s or less which contains all permanently faulty
processors, provided there are no more than t-faulty processors. In general, t < s, and so
there is a relaxation of restriction in previous studies that no fault-free processors can be
replaced [Ref. 7].
E. SMITH'S ALGORITHM:
Consider three replacement algorithms [Ref. 8] for faulty processors:
STi: At each step perform the tests and replace processors which fail at least one
test, with randomly chosen spares. If all test results are pass , the system is assumed to be
correct.
21
ST2: At each step, perform the tests and replace processors which fail the maximum
number of tests. Replaced processors are placed back into the set of spares. If all test
results are pass , the system is assumed to be correct.
ST3: At each step, perform the tests and replace processors which fail the maximum
number of tests. Put these into the SPARE-II and replace them with randomly selected
spares in SPARE-I. If the number of processors in SPARE-I are not sufficient, then
choose any additional needed spares randomly selected from SPARE-IL If all test
results are pass , the system is assumed to be correct (initially, all spares are in SPARE-I
and SPARE-II is empty).
STi is fast but tends to replace many fault-free processors (those which fail at least
one test by fault-free processors). ST2 replaces fewer fault-free processors, but it is
slower. ST3 is the most sophisticated, since it tends to maintain an enrichment in the set
of fault-free processors, and resorts to selection of suspected faulty spare processors only
when necessary [Ref. 8].
d-disabling rule: Processor Ui is disabled (e.g: not allowed to participate in




( a ) ( b )
Figure 2.5 Five processor multiprocessor system for two arrangements of faulty
processors
22
Consider the 1 -disabling rule in Figure 2.5(a) and assume U2 and U3 are faulty and
enabled. Then U4 is disabled even though it is fault-free. Uo is also fault-free and
disabled. However, since Ui fails no test and it will become enabled permanently. It
follows that U2 and U3 will eventually be disabled. Thus fault-free nodes U4 and Uo
which were originally disabled will become enabled permanently. Consider the system in
Figure 2.5(b), where there are also two faulty units, and assume the 1-disabling rule
applies as before. If U2 and U4 are enabled, before any of the processors are enabled, the
fail test outcomes they produce disable Uo, Ui and U3. Since all fault-free processors are
disabled and the tests among faulty processors are pass , both faulty processors are
enabled. Unlike the case just discussed, the system will never correct itself. Thus, a
permanent situation exists where all faulty processors are enabled and all fault-free
processors disabled. In the same figure, if we apply the 2-disabling rule with the same
initial conditions (e.g: U2, U4 are faulty and enabled), the fault-free processors will
eventually become disabled, while only one of the faulty processors will be disabled.
Thus, the 1-and 2- disabling rule lead to an unsatisfactory diagnosis.
23
III. PROBLEM
A. SIMPLE DIAGNOSABILITY TESTS FOR MULTIPROCESSING SYSTEMS
Recall that we are interested in distributed fault diagnosis of the system, since ultra
reliability can be achieved less expensively. The basic idea behind distributed
self-diagnosis is that the diagnosis algorithm is executed on the remaining intact units of
the system. In contrast to the central diagnosis which assumes an external (perfect) unit
for computing diagnosis results, distributed diagnosis is performed throughout the system.
First, a node is diagnosed by its immediate neighboring nodes. In a second step, these
local diagnosis results are used to disable processors.
To achieve distributed fault diagnosis in a system, each unit is equipped with
disabling circuitry. Thus, testing processors can determine the status of the tested
processor. The problem of identifying how many faulty processors can be tolerated
before it is impossible to correctly identify them is a very difficult task in general
multiprocessing systems. For example, in some cases as is shown in Chapter II, Figure
2.3, the two different fault patterns produce the same test outcome (syndrome).
The problem of locating faulty processors within a multiprocessor system by
temporarily halting normal operation and placing it in a diagnostic mode has been
studied using the PMC model. When the number of modules in the system is large, some
of them will be idle at a given moment. A test may be any sort of check by one processor
on the operation of the other, including applying test vectors and checking resulting
outputs. In a concept introduced by Nair, Metze , Abraham [Ref. 9] called "roving
diagnosis". One part of the system diagnoses a second part, while the remainder of the
system continues normal operation. The part most recently diagnosed as fault-free then
takes its turn in diagnosing other pans. Thus, there appears to be a subsystem of
diagnosing and diagnosed units which "roves" through the system until no parts of it
remains undiagnosed. However roving diagnosis, must ensure that first diagnosis will
produce unique, identifiable results. The checks are performed at the system level on data
elements that constitute the results of computations on these systems. It is assumed [Ref.
10: 298] that each processor has a local memory on which it performs reads and writes.
24
In addition, it can communicate with other processors in the system through the buffers at
various input and output ports. A processor cannot read or write from any other
processor's local memory even in the presence of a fault. A fault is any condition that
causes a malfunction in a single processor while performing operations.
B. RECONFIGURATION
Definition 5: A system is c-correctable using the d-disabling rule if and only if:
1
.
All faulty nodes are eventually permanently disabled.
2. All fault-free processors are eventually permanently enabled provided there are c
or fewer faulty nodes [Ref. 7].
The main goal in system configuration is to switch-in all fault-free units and to
switch-out all faulty units. But this switching is not between two working systems, just
between working system and spares . The goal is not only to switch-out the faulty units
but also keep the working system functional. That gives more flexibility to the system
but increases the cost. The problem is to derive a distributed strategy for correct
switching which is insensitive to the arrangement of faulty processors. Sometimes it may
be difficult to replace a specific processor, so rearrangement of applied tests can give
more accurate results. A flexible test arrangement will allow an approach which views
the diagnostic task as one of arranging processors into two groups, a working group and a
spare group. Another approach is to have three groups, one group for critical operations,
one for noncritical operations, and one for spares. However in this thesis, we will
consider only the first approach.
C. RELATIONSHIP BETWEEN ENABLED/DISABLED UNITS AND SYSTEM
RELIABILITY
In an implementation of distributed diagnosis, to have correct diagnostics, two major
important problems must be considered:
1. Reliable implementation of the disabling criteria and function.
25
2. Reliable transmission of appropriate test (pass, fail) and result signals of disabling
criteria (enabled or disabled) for system units.
It should be noted that in distributed diagnosis, only local information is used to
identify faulty processors. In central diagnosis all test results are used. Thus, we would
expect distributed diagnosis to be less accurate. This manifests itself in a fewer number
of faulty nodes which can be tolerated in distributed diagnosis.
26
IV. METHOD OF APPROACH
A. WHY A CAD-TOOL ?
Our approach to the problem of developing diagnosis strategies is to develop a CAD
(Computer Aided Design) tool for the simulation of different fault patterns and different
reconfiguration strategies. Previously all studies have used hand calculations for this
purpose. When the number of units in the system has increased to more than seven, hand
calculations becomes complex. Thus, the user can only simulate a limited number of
units and fault patterns. Using the CAD-tool, the user can simulate from 2 to 20 units
with various fault patterns. The restriction of 20 units is due to limitations of the monitor
screen.
Thus, the tool facility gives the user an opportunity of simulating a large number of
units and fault patterns in a system. The number of units in a network is known in
advance and can be predefined in to the tool-program. The names and number of faulty
nodes are determined by the user. Testing connections can be predefined by the user or
the program. Only the test procedure (worst_case or user_defined_case) can be chosen by
the user. Also the user defines the disabling criteria . After input by the user, the
CAD-tool determines test results , disabled, enabled units and then displays the system in
a control unit monitor. By using the CAD-tool, a computer network is automatically
controlled without any hand calculation.
B. TOOL DEFINITIONS:
This CAD-tool is written in the C programming language [Ref. 12] using PMC graph
model. The terms used in the program are listed below and given short explanations:
N=The number of units in the system (may change from 1 to 20).
f=The number of faulty nodes( < f < N-l).
T=The number of units which tests one unit. This number is the same for all units.
Test results according to test connection are determined by the program reflecting the user
desire as a worst-case or arbitrary case.
27
For the worst-case , the program itself determines all test results. That is, faulty testing
units produce fail (1) test outcome for fault-free and pass (0) test outcome for faulty
tested units. This information is completely opposite to the status of the units. This is the
reason it is called worst case. For the user defined (arbitrary) case, test outcomes for
faulty testing units (for faulty or fault-free tested units), are defined by the user.
d=Is the disabling criteria which is defined by the user. If a tested unit has, at least d
fail test outcomes by enabled units, the unit will be disabled.
C. TOOL SPECIFICATION
Figure 4.1 shows the flowchart of the main body of the system tool. As can be seen,
the user can specify initial conditions and then allow the system to execute diagnostic
steps one after the other.
Figure 4.2 shows a more detailed flowchart of the program. First, the user defines the
number of units in the system. If this number is less than or greater than 20, the
program produces an error message. The user defines the number and the names of faulty
nodes. Next, the user defines T (the number of units testing one unit) and the test
procedure (as worst case or arbitrary case). The program determines the test results and
displays them onto the screen. The user defines the disabling criteria, the number and
names of enabled units (all units are disabled initially). The tool displays the whole












































Figure 4.2 Detailed flow chart of CAD-tool
30
To see the application of the disabling rule, the user selects option #5 from the menu
shown in Table 4.1. Then, the program determines the enabled and disabled units and
displays the first iteration by calling the drawing subroutine. The user can go onto more
iterations with the same conditions. After some number of iterations, the user can exit the




3. SET TEST RESULTS
4. SET THE DISABLING CRITERIA
5. APPLY DISABLING RULE
6. EXIT
Table 4.1 Menu of CAD-tool
D. TOOL REALIZATION
The CAD tool is made up of five main parts (subroutines). The first, menu option #1,
gives a brief explanation of the program. Option #2 sets up the type of system, number
and names of units, number and names of faulty units. Option #3 sets up T, and test
procedure. Option #4 sets up the disabling criteria, number and the names of the enabled
units. Then it displays the system initial conditions calling the subroutine drawing .
Option #5 applies the disabling criteria and determines the enabled and disabled units,
then it displays the system. In the drawing subroutine, enabled fault-free units are green
,
enabled faulty nodes are also green with X's inside circles. Disabled fault-free nodes are
red and disabled faulty nodes are red with X's inside circles. Test results are represented
by the color of testing arrows. A green arrow means a pass (0) test outcome, and a red
arrow means fail (1). Each time, after going through each option, the menu comes onto
the screen. So if the user makes a mistake somewhere in the program, he/she can correct
31
it easily, choosing the same option from the menu. The main part of program is very
straightforward and just calls the subroutines according to selected menu options.
32
V. RESULTS
Figure 5.1 shows a photograph of the CAD-tool menu. Figures 5.2 through 5.5
shows the initial condition and three step iterations of a five unit multiprocessor system.
In this system U2 and U3 are faulty and enabled initially and shown with color green;
other units are disabled and shown with color red. The disabling criteria is 1 and the test
results are the worst case. After the first iteration units Uo and U4 are disabled (red) and
all the other units are enabled (green). After the second iteration Ui is enabled and all
the other units are disabled. After the third iteration, all faulty units are disabled (U2, U3)
and all fault- free units arc mabled. In this case, the 1 -disabling criteria gives the desired
results. This example is explained in Appendix B as Case 1.
Figure 5.1 CAD-tool menu and test outcomes
33
,Figure 5.2 Initial condition
Figure 5.3 First iteration
34
Figure 5.4 Second iteration
Figure 5.5 Third iteration
35
Figures 5.6 through 5.9 show another five unit multiprocessing system. In this
example, Ui and U4 are faulty and enabled initally. Disabling criteria is 2 and test results
are also worst case. After the first iteration all units are enabled. After the second
iteration only U4 is disabled and all the other units are enabled. Figure 5.8 and Figure 5.9
both are the same. This means that the system stays in that state and cannot correct itself.
This example is explained in Appendix B as Case 3.
Figure 5.6 Initial condition
36
Figure 5.7 First iteration
Figure 5.8 Second iteration
37
Figure 5.9 Third iteration
Figures 5.10 through 5.14 show a seven unit multiprocessor system. In this system,
Ui, U3, U5 are faulty units and enabled initially. Test results are also worst case and
disabling criteria is 2. After the first iteration, U4 and U6 are disabled, all the other units
are enabled. After the second iteration U3, U4, U6 are disabled and the other units are
enabled. After the third iteration only U3 is disabled. After the fourth iteration all faulty
units are disabled and all fault-free units are enabled. This indicates the 2-disabling
criteria works and the system corrects itself. This example is explained in Appendix B as
Case 6.
38
Figure 5.10 Initial condition
Figure 5.11 First iteration
39
Figure 5.12 Second iteration
Figure 5.13 Third iteartion
40
Figure 5.14 Fourth iteration
Figures 5.15 through 5.20 show a six unit system. In this system Ui, U3, U5 are
faulty units and the disabling criteria is 2. Test results are arbitrary (user defined) and are
defined as followes: faulty testing units produce fail (1) test outcome for faulty tested
units and produce pass (0) outcome for fault-free tested units. In this example, faulty
units are alternately disabled and enabled. Thus the system will never correct itself. It
displays an oscillation of period six. This example is explained in Appendix B as Case
19.
41
Figure 5.15 Initial condition
Figure 5. 16 First iteration
42
Figure 5.17 Second iteration
Figure 5.18 Third iteration
43
Figure 5.19 Fourth iteration
Figure 5.20 Fifth iteration
44
VI. CONCLUSIONS AND RECOMMENDATIONS
A. CONCLUSION
This thesis introduces distributed diagnosis. The analysis of distributed diagnosis is
difficult without a CAD tool. In this research, a CAD-tool has been developed based
upon the PMC graph model. Using this tool, the user can simulate various number of
configurations and fault patterns. The tool provides a step by step procedure for user to
follow. In this tool, the information related to the faulty nodes (the numbers and the
names of faulty nodes) is provided by the user. Then the user simulates the system as
much as wanted.
In the CAD-tool, fail test outcomes by enabled porcessors for each unit are counted
and compared with the disabling criteria. If fail test outcomes exceed the criteria, then the
unit is disabled. Unlike the central diagnosis algorithm which eventually settled on a final
arrangement of processors, the algorithm denoted here develops dynamic behavior.
B. RECOMMENDATIONS
It is expected that this tool will be used to study optimum disabling criteria for various
systems. For example, we hope that it will free the user of the tedium of generating
examples, allowing him to prove properties of the system. One possibility is that it could





/* This menu helps the user to determine the main selections of the program.
If the user wants to run the program for very FIRST TIME should choose the
option #2. To choose INTRODUCTION is outside this restriction.*/









N , fmax , f , T , k
,
j , i , no_units_set
,
































3. SET THE TEST RESULTS
4. SET THE DISABL. CRITERIA









































THESIS TOPIC: FAULT TOLERANT COMPUTING
*
* IN DISRIBUTED COMPUTER NETWORKS.
#
* Author: Ibrahim DINCER
*
Thesis Advisor: Prof. Jon T. BUTLER
* NAVAL POSTGRADUTE SCHOOL
*















































* DATE : APRIL 23,1987 *\n");
This program is for simulation of distributed \n" )
;
diagnosis algorithm in a computer network. For this\n");
purpose PREPARATA_METZE_CHIEN is used. The number\n")
\n")
of nodes in the system is restricted TO NO MORE \n" )
\n")
THAN 20. The user enters the number of nodes , faulty\n"
)
\n")
nodes in the network, test procedure and disabling \n"
\n")










FMAX=NUMBER OF ALLOWED FAULTY NODES IN THE SYSTEM \n"
)
\n")
T= NUMBER OF UNITS WHICH ARE TESTING ONE UNIT \n" )
outcomes and shows enabled fault_free nodes and
disabled faulty nodes.
N= NUMBER OF NODES IN THE SYSTEM
D=DISABLING CRITERIA FOR FAULTY NODES
F= NUMBER OF FAULTY NODES IN THE SYSTEM
/* THIS SUBROUTINE DEFINES THE NAMES OF NODES AND ALSO DEFINES THE FAULTY
NODES IN THE SYSTEM */
units( )
{




printf( ,, %c%d," , 'U' ,i);
}
printf ( "\n" )
;




/* THIS TWO LOOPS KEEP THE USER IN THE ALLOWED LIMITS FOR
FAULTY UNITS*/
while(f<=0 ! ! f >N)
{
printf(" 'F' SHOULD BE GREATER THAN ZERO ");
printf (" AND LESS THAN N \n");
scanf("*d",&f );
47
}printf( "THERE ARE %d FAULTY NODES \n\n",f);
= 1
;
/* INDICATES THE ARRAY TO DEFINE THE FAULTY NODES »/
for( i = 0; i<=N-1 ; + + i )
{




printf ( "ENTER THE FAULTY UNIT NUMBER ONE AT A TIME \n" )
;




"%d" ,&i ) ;
while( i >(N-1 ) I ! i<0)
{
printf (" UNIT NUMBER IS NOT VALID, TRY AGAIN ]\n")
scanf ("^d" ,&i);
}
if (fault_array[i! == 'B* )
{
printf ("THIS UNIT IS PREVIOUSLY DEFINED AS ");




fault_array[i] = ' B' ;












printf(" TO DETERMINE THE NETWORK ENTER ONE OF THE ");
printf(" OPTIONS BELOW\n" )
;
printf("\n");
printf(" 1. DESIGN \n\n" ) ;
printf(" 2. ARBITRARY SYSTEM \n\n" )
;
scanf ("*d" ,ftp);




printf ("ENTER THE NUMBER OF NODES IN THE SYSTEM\n\n"
)
scanf ( "#d" ,&N);
48
while(N>20 ! ! N<=0)
{
printf("THE NUMBER OF UNITS IS NOT VALID,");









printf (" THIS SYSTEM WILL BE DEFINED LATER \n\n" ) ;
}
}
/* THIS SUBROUTINE DETERMINES THE TEST RESULTS FOR THE SYSTEM. IN THE
•VORST_CASE' .PROGRAM DETERMINES ALL THE TEST RESULTS; FOR THE ARBITRARY CASE
TEST RESULTS FOR THE TESTED UNITS BY 'FAULTY' TESTING UNITS WILL BE DEFINED
BY THE USER. »/
test ( )
{
printf(" 'T' IS THE NUMBER OF UNITS TESTING ONE NODE; ENTER" ) ;
printf(" 'T' \n" )
;
scanf("*d" ,&T);
printf (" DO YOU WANT 'WORST_CASE' TEST RESULTS7IF YES.ENTER");
printf ( "1 \n" )
;
scanf( "%d" ,&w);















if((fault_array[k!=='B' ) &&( fault_array [ 1 ! == * B' )
)
(
test_array[k! [j ! =0
;
}
else if( (fault_array[k' == 'B' ) !
!














else /*THIS PART user_def ined ARBITRARY TEST RESULTS »/
{
for ( j=1 ;j <=T;+ + j )
{










printf("TEST RESULT NODE #*d BY NODE # #d IS ",k,l);
scanf ( "#d" ,&test_array [k] [j ] ) ;
while(test_array[k] [j ]=0 && test_array [k] [ j ] = 1
)
{
printf("TEST RESULTS SHOULD BE OR 1 \n" ) ;
scanf("#d'\&test_array[k] [j] );
}
printf ( "test_array|>d] [^d] =^d\n" ,k
,
j , test_array [k] [ j ] )
;
>
else if ( fault_array[k]== 'B'
)
{









for(k=0;k<=N-1 ;++k) /»THIS PART PRODUCES TEST_RESULT MATRIX */
{
for( j=1 ; j<=T;+ + j
)
{
printf (" %d " , test_array [k] [ j ] )
;
}
printf ( "\n\n" )
;
}
} I* END OF TEST SUBROUTINE »/
/*"THIS PART OF PROGRAM IS DRAWING THE NETWORK FOR DISPLAY" »/
50






, j , k , x ,y , x1 , y1 , X2,y2,x3,y3,x4,y4,x5,y5;
int x6,y6 , x7 ,y7 ,x8 ,y8 , t ,r ,R;
char number [20] , Z
;
float pi , theta.phi . rho ,psi , tau;
short ang;
pi = 3. U16295;
ginit( ) ;
viewport (400 ,1000,100, 700);
cursoff ( ) ;
color(BLUE) ;
clear ( ) ;
linewidth( 4 ) ;








































move2i ( x5 ,y5 )
draw2i ( x6 ,y6
1 = 1 );
&& disable_array [i]
51

















&& disable_array[i] == ' E'
color ( GREEN);
circf i ( x ,y , r )
}









spr intf ( number , "U#d" , i )
;






























x1 =r *sin(phi )
;


























/* THIS PART DETERMINES THE INITIAL CONDITIONS AND DISPLAYS




printf( "ENTER THE NUMBER OF ENABLED NODES\n" )
;
scanf ( "£d" , &no_en_set )
;
printf( "ENTER THE MINIMUM NUMBER OF FAIL TEST RESULTS BY");
printf( "ENABLED PROCESSORS WHICH DISABLE THE TESTED ");





for( i = 0; i < =N- 1 ;+ + i )
{







printf( "ENTER THE ENABLED UNIT NUMBER ONE AT A TIME \n" )
;




if (i>N-1 ! ! i<0)
{
printf("UNIT NUMBER IS NOT VALID, TRY AGAIN ]\n");
}
else if ( disable_array [i] == ' E'
)
{





disable_array[i] = ' E'
;
printf( "ENABLED UNIT #%d IS U*d\n\n" ,j , i ) ;






/* THIS PART OF THE PROGRAM DETERMINES ENABLED AND DISABLED














if ( ( test_array[k] [j
]






if ( count >=dis_cr it
)
{










































textport( 0,350. 1 0,900 )
;
linewidth( 6 )






"%d" , ^response )
;




















HAND CALCULATION OF DIFFERENT CASES
Case 1. A five unit multiprocessor system, U2 and U3 are faulty units and shown
underlined. Test results are worst case, disabling criteria is 1. In the matrix shown below











a. first iteration with I.C
Ui,U2, U3are : enabled






U2, U3 are disabled
* all faulty nodes are disabled, all fault-free nodes are enabled
56
Case 2. A five unit multiprocessor system, with U2 and U3 are faulty units and
enabled initially. Test results are arbitrary (user defined) case and disabling criteria is 1.











a. first iteration with I.C
Uo, Ui, U2, U3 are enabled
U4 is disabled
b. second iteration
Uo, Ui are enabled
U2, U3, U4 are disabled
c. third iteration
Uo, Ui, U4 are enabled
U2, U3 are disabled
* all faulty nodes are disabled, all fault-free nodes are enabled.
57
Case 3. A five unit multiprocessor system, Ui and U4 are faulty units and enabled











a. first iteration with I.C
all nodes are enabled
b. second iteration
Uo,Lh, U2, U3 are enabled
U4is disabled
* system stays in that state forever
* so system i s not 2-fault 2-correctable












a. first iteration with I.C
all nodes are enabled
b. second iteration
Uo, U2, U3 are enabled
Ui,U4are disabled
* all faulty nodes are disabled, all fault-free
enabled.
Case 5. A seven unit multiprocessor system, with Ui, U3, U5 are faulty and enabled






















a. first iteration with I.C
Ui, U3, U5 are enabled
Uo, U2, U4, U6 are disabled
* system stays in that state forever. So system is not 3-fault 1 -correctable
Case 6. This system is the same as previous case, but disabling criteia is 2.
a. first iteration with I.C
Uo, Ui, U2, U3, U5 are enabled
U4,U6 are disabled
b. second iteration
Uo, Ui, U2, U5 are enabled
U3, U4, U6 are disabled
c. third iteration
Uo, Ui, U2, U4, U5, U6 are enabled
U3 disabled
d. fourth iteration
Uo, U2, U4, U6 are enabled
Ui, U3, U5 are disabled
* all faulty nodes are disabled, all fault-free nodes are enabled.
Case 7. A seven unit multiprocessor system, Ui, U3, U5 are faulty units and enabled
















ia i Q 1
Ifc U4 Ul
U6 _Q 1
a. first iteration with I.C
all nodes are enabled
b. second iteration
Uo, U2, U4, U6 are enabled
Ui, U3, U5 are disabled
* all faulty nodes are : disabled, all fault-free are enabled
Case 8. A seven unit multiprocessoi• system, Uo, Ui, U3, u4 are faulty units and Ui
U3, U4 are enabled initially. Test results
;
ire worst case, disabling criteria is 1.
U6 U 5 il4
Hq l 1
LSD U6 U5














a. first iteration with I.C
Uo, Ui, U3, U4 are enabled
U2, Us, U6 are disabled
* system stays in that state forever. So it's not 4-fault 1 -correctable.
Case 9. This case is the same as the previous case, except the test results are
arbitrary case.
U6 U5 H4






Ua 1 1 Q
Ifc U2 Hi





. first iteration with I.C
Ulis enabled
Uo, U2, U3, U4, Us, U6 are disabled
62
b. second iteration
Uo, Ui, U4, U5, U6 are enabled
U2, U3 are disabled
c. third iteration
U4 is enabled
Uo, Ui, U2, U3, U5, U6 are disabled
d. fourth iteration
Ui, U2, U3, U4 are enabled
Uo, U5, U6 are disabled
e. fifth iteration
Ui is enabled and Uo, U2, U3, U4, U5, U6 are disabled.
* This is iteration #l.So system is not 4-fault, 1- correctable.
Case 10. This system is the same as case 8, disabling criteria is 2 in this case.
The test results will be the same as in case #8.
a. first iteration with I.C
Uo, Ui, U2, U3, U4 are enabled
U5, U6 are disabled.
b. second iteration
Uo, Ui, U3, U4 are enabled
U2, U5, U6 are disabled
c. third iteration
Uo, Ui, U3, U4 are enabled
U2, U5, U6 are disabled
* This is I.C (initial condition) state, system stays in that loop for ever. That means
system is not 4-fault 2-correctable.
Case 11. This is the same as case 9, disabling criteia is 2 in this case.
Test results will be the same as in case #9.
63
a. first iteration
all nodes are enabled
b. second iteration
U2, U5, U6 are enabled
Uo, Ui, U3, U4 are disabled
* all faulty units are disabled, all fault-free units are enabled.
Case 12. An sight unii : multiprocessor system, Uo, U3, U5, U7 are faulty units and

















a. first iteration with I.C
Uo, U3, U5, U7 are enabled
64
U2, Ui, U4, U6 are disabled
* system stays in that forever. So system is not 4-fault, 1 -correctable.
When we try to simulate if the system is 2-correctable.
We can easily see that Uo will never be disabled in that case. So system is not
2-correctable either.
Case 13. This the same as case 12, but test results are arbitrary and disabling
criteria is 2.
nz U6 ill














Hi 1 Q 1
a. first iteration with I.C
all nodes are enabled..
b. second iteration
Ui, U2, U4 and U6 an; enabled
65
Uo, U3, Us, U7 are disabled
* all faulty units are disabled, all fault-free units are enabled.
Case 14. A nine unit multiprocessor system, Uo, Ui, U2, U3 are faulty units and Uo,
U2, U3 are enabled. Test results are worst case, disabling criteria is 1.
u8 U7 U6 us
m 1 1 1 1
hq u8 U 7 U6
iii 1 1 1
ill Uq u8 u7
122 1 1
uz Hi LlQ Ug
12a 1
Ha 112 Hi Hq
U4 1 1 1 1
U4 Ha H2 Hi
U5 1 1 1
Us U4 Ha H2
U6 1 1
U6 U5 U4 Ha
U7 1
U7 U6 U5 U4
u8
a. first iteration with I.C
Uo, Ui, U2, U3, Us are enabled




all the others are disabled
c. third iteration
U4,U5,U6,U7,U8Jire enabled
Uo,U:l,U2, U3 are disabled
* all faulty nodes are disabled, fault-free nodes are enabled.
Case 15. A nine unit multiprocessor system,Uo, U3, U5, Us are faulty units and U3
LJ5, Us are enabled. Test results are arbitrary case, disabling criteria is 1.
US U7 U6 U5.
IlQ 1 1 1 Q
Hq lis u7 U6
Ui 1 _Q
Ui IlQ Ha U7
U2 1 1
U2 Ui Hq ua
Ik 1 1 H 1
na U2 Ui Hq
u4 1 Q
U4 U2 U2 Ui
115 1 Q 1 1
u5 U4 Ul U2
U6 1 Q
U6 us U4 Ik
U7 1 I
U7 U6 Ul U4
Ha 1 1 Q 1
a. first iteration with I.C
Ui, U5, Us are enabled
67
Uo, U2, U3, U4, U6, U7 are disabled
b. second iteration
Ui, U4, Us are enabled
Uo, U2, U3, U5, U6, U7 are disabled
c. third iteration
Ui, U4, U6, U7 are enabled
Uo, U2, U3, U5, Us are disabled
d. fourth iteration
Ui, U2, U4, U6, U7 are enabled
Uo, U3, U5, Us are disabled
* All faulty nodes are disabled, all fault-free nodes are enabled.
Case 16. This the same as previous case but disabling criteria is 2.
a. first iteration
Uo, Ui, U2, U3, U4, U5, U6, Us are enabled
U7 is disabled
b. second iteration
Ui, U4, U6 are enabled
Uo, U2, U3, U5, U7, Us are disabled
c. third iteration
Uo, Ui, U2, U3, U4, U6, U7 are enabled
U5, Us are disabled
d. fourth iteration
Ui, U2, U4, U6, U7 are enabled
Uo, U3, U5, Us are disabled.
* All faulty nodes are disabled, all fault-free nodes are enabled.
Case 17. This case is the same as case 15, but disabling criteria is 3.
a. first iteration
all nodes will be enabled
68
b. second iteration
Ui, U2, U4, U6, U7 are enabled
UO, U3, U5, U8 are disabled
* All faulty nodes are disabled, all fault-free nodes are enabled.
Case 18. A nine unit multiprocessor system, Ui, U3, U5, U8 are faulty units and Ui,










us U7 U6 ik
1 1
Uo ua U7 U6
1 1 1
ill Uo lis U7
1 1
U2 Hi Uo -lis
1 1
lk U2 111 Uo
1 1
u4 U2 u2 Ul
1 1 0_
lk U4 ila U2
1 1
U6 u$ U4_ Ha
1 1
U7 U6 ik U4
1 1 1
69
a. first iteration with I.C
Uo, Ui, U2, U3, U5, Us are enabled
U4, U6, U7 are disabled.
b. second iteration
Ui, Us, Us are enabled
Uo, U2, U3, U4, U6, U7 are disabled.
c. third iteration
Ui, U3, U4, U5, U6, U7, U8 are enabled
Uo, U2 are disabled.
d. fourth iteration
U3, U5 are enabled.
Uo, Ui, U2, U4, U6, U7, U8 are disabled.
e. fifth iteration
Uo, Ui, U2, U3, U4, U5, Us are enabled
U6, U7 are disabled.
f. sixth iteration
Ui.Us are enabled
Uo, U2, U3, U4, Us, U6, U7 are disabled.
* System is not 4-fault 2-correctable.
Case 19. A six unit multiprocessor system, Ui, U3, Us are faulty units and only Ui



















Uo, U2, U3, U4 are enabled
Ui.Usare disabled.
b. second iteration
Uo, Ui, U2, U3, U4 are enabled.
Us is disabled.
c. third iteration
Uo, Ui, U2, U4 are enabled.
U3, U5 are disabled.
d. fourth iteration
Uo, Ui, U2, U4, U5 are enabled.
U3 is disabled.
e. fifth iteration
Uo, U2, U4, U5 are enabled.
Ui, U3 are disabled.
f. sixth iteration
Uo, U2, U3, U4, U5 are enabled.
Ui is disabled.
*That is I.C state and system oscillates and returns to I.C state in every six iteration.
71
LIST OF REFERENCES
I. J.H. Wesley, et. al., "SIFT: Design and analysis of fault tolerant computer for
Aircraft Control," Proc. of IEEE,Vol. 66, No. 10, pp. 1240-1255, October 1978.
2 . F.P. Preparata, G.Metze, and R.T.Chien.,"On the connection assignment problem of
diagnosable systems," IEEE Trans, on Comp., Vol. C-16, pp. 848-854, Dec. 1967.
3. K. Y. Chwa and S.L.Hakimi, "Schemes for fault tolerant computing: A comparison of
modularly redundant and t-diagnosable systems," Inform, and Control, Vol. 49, No. 3,
pp. 212-238, June 1981.
4. Arthur D.Friedman and Luca Simoncini, "System level fault diagnosis," IEEE Trans,
on Comp., Vol. 13, p. 47-2, March 1980.
5. Simoncini Karunanithi and A.D. Friedman,"System diagnosis with t/s diagnosability,"
Proc. of the 7 th Fault-tolerant Comp. Symp., pp. 65-71, June 1977
6. M.L. Blount, "Probabilistic Treatment of Diagnosis in Digital systems, Proc. 7th Intl.
Conf. on Fault Tolerant Computing, pp. 72-77, June 1977.
7. J.T. Butler, "On the design of distributed diagnosable multiprocessing systems,"
Naval Postgraduate School Monterey, CA, research proposal.
8. A.L. Hopkins, T.B. Smith, and J.H. Lala, " FTMP-A highly reliable fault-tolerant
multiprocessor for aircraft," Proc. of IEEE, Vol. 66, No. 10, pp. 1221-1239,
October 1978.
9. R. Nair, G. Metze and J. Abraham, "Design Considerations for Fault -Tolerant
Distributed Digital Systems," unpublished manuscript.
10. S. Mallela and G. Masson, "Diagnosable systems for intermittent faults," IEEE Trans,
on Comp., Vol. C-27, pp. 560-566, 1978.




1. Defense Technical Information Center 2
Cameron Station
Alexandra, VA 22304-6145
2. Library, Code 0142 2
Naval Postgraduate School
Monterey, CA 93943-5002
3. Department Chairman, Code 62 1
Department of Electrical and Computer Engineering.
Naval Postgraduate School
Monterey, CA 93943-5000
4. Dr. Jon T. Butler, Code 62 BU 5
Department of Elecrical and ComputerEngineering.
Naval Postgraduate School
Monterey, CA 93943-5000
5. Dr. Bruno O. Shubert, Code 55 SY 1
Department of Operational Analysis
Naval Postgraduate School
Monterey, CA 93943
6. Dr. Dana E. Madison, Code 52 1
Department of Computer Science
Naval Postgraduate School
Monterey, CA 93943
7. Dr. Joo Kang Lee 1




8. Director of Research Administration, Code 012 1
Naval Postgraduate School
Monterey, CA 93943-5000
9. Kara Kuwetleri Komutanligi 1
Egitim Dairesi Baskanligi
Bakanliklar, Ankara, Turkey
10. Kara Harp Okulu 1
Bakanliklar, Ankara, Turkey
73
11. Muhabere Okul Komutanligi 1
Mamak, Ankara, Turkey




13. Lrjg. Mustafa Paktuna 1
Marmara cad. No: 158/6
Kocamusiafapasa, Istanbul, Turkey





15. Dr. George Abraham 1
Code 7500
NRL
M-5 5 5 Overlook Ave. S.W.
Washington, DC 20375
16. Dr. Lou Schmid 1
OHT 20T












C< 1 Fault diagnosis in
distributed computer
networks.

