Tutorial: Advanced fault tree applications using HARP by Dugan, Joanne Bechta et al.
NASA Technical Memoranduni 102747
30 H
TUTORIAL
ADVANCED FAULT TREE APPLICATIONS USING HARP
Joanne Bechta Dugan
Duke University
Durham, North Carolina
Salvatore J. Bavuso
NASA Langley Research Center
Hampton, Virginia
Mark A. Boyd
Duke University
Durham, North Carolina
NOVEMBER 1993
I'0/kSA
Nalional Ae/onaulics and
Space Adminislralion
Langley Research Center
I lanq)lon, VA 2368 I+0001
(NASA-TM-}O2747) TUTORIAL:
AOVANCED FAULT TREE APPLICATIONS
USING HARP (NASA) 30 p
G3/61
N94-17447
Unclas
0194087
https://ntrs.nasa.gov/search.jsp?R=19940012974 2020-06-16T17:36:48+00:00Z

Tutorial: Advanced Fault Tree Applications using HARP
Contents
3
Background 1
Introduction 1
Behavioral decomposition 1
21 The fault occurrence an(! repair model
(FORM) ................... 2
2.2 The fault and error handling model (FEIIM) 3
2.2.1 Example memory FEIIM ..... 3
2.2.2 Example processor FI"AIM ..... 3
2.3 Near-coincident faults ........... 5
2.4 Combining FORM and I"EllM models . . 7
Dynamic fault-tree gates 7
3,1 Functional dependency gate ........ 7
3.2 Cold spare gate ............... 8
3.3 Priority-AND gate ............. 8
3,4 Sequence enforcing gate .......... 8
II Examples 10
Cm*: a ioosely-couplod distributed system 10
4,1 System description ............. 10
4,2 Failure criteria ............... 10
4.3 Fault tree model .............. 10
5 AIPS: a systom of fault-tolerant bulhlhtg
blocks 12
5.1 System description ............. 12
5.2 Failurecriteriand parameters ...... 12
5,3 Fault recovery................ 12
5,4 Fault treemodel .............. 12
5,5 Truncated faulttree ............ 13
6 FTPP: Fault tolerant paralh:l prcmcssor 14
6,1 System description ............. 14
6.2 Failure criteria and i)aram(;t.ers ...... 14
6.3 Fault recovery ................ 15
6.4 Fault tree models .............. 15
6.4.1 Configuration #1 .......... 15
6.4.2 Configuration #2 .......... 15
6.4.3 Conflguratio,I #3 .......... 18
7
9
6.5 Results .................... 18
ASID MAS: a mission avionics system
7.1
7.2
7.3
7.4
18
System l)escriptiqn............. 18
Fail.recriteriand parameters ...... 20
Faultrecovery ................ 20
Faulttree model .............. 20
7.4.1 I"ault tree with no pooled spares 20
7.4.2 Mo(h:ling I_,ooh:(I sparc.,.I ...... 20
7.4.:1 Full Illodcl and results ....... 22
Three Gudt tolerant hypercul)e architc('-
tufts _22
8.1 System description ............. 122
8.1.1 Architecture I ........... 22
8.1.2 Architecture 2 ........... 22
8.1.3 Architect.re 3 ........... 22
8.2 Faih,re criteria and parameters ...... 24
8.3 Faultrecovery ................ 24
8.4 Fault tree models .............. 25
8.4.1 llot spares .............. 25
8.4.2 Cold spares ............. 25
8.4.3 Warm spares ............ 25
8.5 Results.................... 26
ll.e, f,{_.ronces 26

Abstract
Reliabilil_y analysis of fault tolerant computer systcms for
critical applications is complicated by several factors. In
this tutorial, we discuss these modeling difficulties and de-
scribe and demonstrate dynamic fault tree modeling tech-
niques for handling them. Several advanced fault tolerant
computer systems are described, and fault tree models
for their analysis are presented. IIARP (the IIybrid Au-
tomated Reliability Predictor) is a software package de-
veloped at Duke University and NASA Langley Research
Center that is capable of solving the fault tree models
presented m this tutorial.
Part I
Background
1 Introduction
Fault tolerant computer systems for critical applications
are characterized by several factors which complicate
their analysis. Systems designed to achieve high levels
of reliability frequently employ high levels of redundancy,
dynamic redundancy management, and complex fat,It and
error recovery techniques. In this tutorial we consider
advanced fault tree modeling techniques to include these
factors in the analysis of system reliability.
In this tutorial, we assu,ne the following
, Faults occur randomly and are statistically indepen-
dent.
• Lifetime distributions are exponential. Faults occur
at a constant average rate, which is referred to as the
failure rate of the component.
• Mission lengths are relatively short, so that the prob-
ability of more than a few failures is low.
• The systems are not repairable while in use.
Systems which violate these assumptions can be handled
by more sophisticated techniques which fall otttside the
scope of this tutorial.
There are several possible for the reliability analysis of
fault tolerant computer systems for critical applications.
In addition to predicting the reliability of the system for
a specified mission time, these techniques can facilitate
tradeoff analysis for various fault tolerant techniques, or
can be used to compare alternative architectures for a
system still in the design phase. Even if a system exists
only as a rough sketch on paper, analysis techniques can
be used to analyze parametric sensitivity in order to de-
termine which factors have tile strongest impact on the
reliability of the system.
Fault trees are frequently used for reliability analysis
of critical systems. Faltlt tree models are well accepted
and solution methods are well known, but exact analy-
sis of fault trees with many basic events is often expen-
sive, both in terms of developing the model and in solving
the model once it is developed. Also, several importal,t
types of dynamic behavior in advanced fault tolerant sys-
tems cannot be adequately captured in a standard fault
tree model. These dynamic behaviors inclnde transient
recovery, intermittent errors, and sequence dependency.
Markov models present an alternative modeling technique
that is flexible enough to model nearly any such dynamic
system. Tools and techniques exist for tile solution of
even very large Markov models, llowever, the construc-
tion of a Markov model for any but tile simplest system
can be tedious and error prone.
To exploit the relative advantages of I)oth fault trees
and Markov models, while avoiding many of the short-
comings, we define a model that is flexible enough to
capture the dynamic aspects of the system, but wilich
is (almost) as easy to use as a standard fault tree. The
model construction and solution process is facilitated by
the new model in tliree major ways, which are defined
and demonstrated via example in this tutorial.
• Behavioral decomposition is used to separately define
models for system structure and fault recovery.
• Several additional gates are introduced into the faolt
tree model to capture dynamic behavior.
• The fault tree model of syslem structure is internally
and automatically converted to a Markov model, to
which is added the fault recovery information.
These techniques i,ave been implemented in IIARP (the
llybrid Automated Reliability Predictor), a software
package for the analysis of advanced fault tolerant sys-
tents, developed by NASA Langley Research Center and
Duke University.
The models exampled described are all solved using
IIARP. The techniques implemented in HARP are de-
scribed in more detail in other publications. References
[4, 14, 23, 2, 3] are general papers describing tlARP. More
details of the models presented here, as well as other mod-
els using IIARP appear in [1, 6, 7, 15, 11, 12, 10]. Mod-
eling the recovery process is covered in detail in [13].
2 Behavioral decomposition
A common approach to modeling complex systems con-
sists of structurally dividing the system into smaller sub-
systems (e.g. processors, memory units, buses), analyzing
[ System Failure I
Figure 1: Fault tree model of example system
tile dependability of the subsystenls separately, and then
combining tile subsystem solutions to obtain the system
solution. A system level analysis call then be effected by
analyzing each subsysl.em separately and combining the
results to obtain the fiual solution. This structural de-
composition is allowed only if the subsystems' fault tol-
erant behaviors are mutually statistically independent.
An alternative to such a structural decomposition is
behavioral decomposition. Geuerally, the time scale for
tile occurrence of faults and their associated errors is rel-
atively long (i.e. weeks or months) while the time scale
for recovery is relatively short (nlillisecouds). Behavioral
decomposition exploits this time scale difference, by al-
lowing an analyst to describe the two behavior types (oc-
currence and recover),) in separate models.
Using behavioral decomposition, the model is decom-
posed into fault-occurrence and repair (FORIVl) and fault
and error handli*lg (FEIIM) submodels. The FORM con-
tains information about the structure of the system and
the fault arrival process. The FEII/vl (often called tile
coverage model) allows for the modeling of permane*lt,
intermittent, and transient faults, and models the on-line
recovery procedure necessary for each fault type. We de-
scribe this process of model construction by way of a sim-
ple three processor, two memory (3P2M) example system.
2.1 The fault occurrence and repair
model (FORM)
We wish to model a computer consisting of three pro-
cessors and two shared memories (3P2M) communicating
over a shared bus. The system is operational as long as
one processor can communicate with one of the memo-
ries. We describe tile system structure model as a fault
tree, as shown in figure 1, where the top eve!)t, System
Figure 2: Markov chain model of example system
Failure is caused by bus failure OR all processors failing,
OR both memories failing. The abbreviation for the com-
bined basic event i * j represents i statistically indepen-
dent occurrences of component type j.
Figure 2 shows the (continuous time) Marker chain
representatiou of the system whose fault tree is shown
in figure 1. The states are labeled with an ordered triple,
where clenlellt
1. _lenotes the number of operational processors,
2. denotes number of operational memories, and
3. denotes the state of tl_e bus.
An arc between states (i, j, k) and (i- l, j, k) is labeled
with i * A ( where A is the failure rate of processors).
Likewise, an arc between stales (i, j, k) and (i,./A i,k) is
labeled with j • p (where p is the failure rate of memory
units). The failure rate of the bus is a.
F1 represents exhaustion of the processor cluster
F2 represents exhaustion of the memories, and
F3 represents failure of the bus.
The fault tree in figure 1 can be automatically con-
verted to the Marker chain in figure 2. All possible oc-
currences of basic events that leave the system operational
are enumerated; each combination becomes a state in tile
Marker chain,
The advantage of allowing a fault tree description of tile
system is that the modeler need not perform the tedious
task of determining the Markov chain representation of a
system that can be described as a fault tree. Very often,
a relatively simple fault tree can give rise to a very large
and complicated state space in the corresponding Marker
chain. The modeler can use the parsimony of the fault
tree representation of the system to generate the state
space of the Marker chain automatically, and then make
adjustments to the Marker chain as needed.
fauh occurs
R exit
transient
restoration
Fault/Error
llandling
Model
C exit
pcrm_flent
coverage
S exit
single-point_
failure "x_
Figure 3: General structure of FEIIM
2_2 Tile fault and error handling model
(FEIIM)
We next concentrate on modeling tile detailed behavior
of the system when a fault occurs. Tile general structure
of a model that represents the recovery process that is
initiated when a fault occurs is shown ill figure 3. The
entry point to the model signifies the occurrence of the
fault, and the three exits signify three possible outcomes.
The transient restoration exit (labeled R) represents the
correct recognition of and recovery from a transient fault.
A transient is usually caused by external or environmen-
tal factors, such as excessive heat or a "glitch" in the
power line. It is generally believed that the vast majority
of faults are transient. Successful recovery from a tran-
sient fault restores the system to a consistent state with-
out discarding any components, for example by retrying
art instruction or rolling back to a previous checkpoint.
Reaching this exit successfldly requires timely detection
of an error produced by tile fault, performance of an ef-
fective recovery procedure, and the swift ttisappearance
of tire fault (Ihe cause of the error).
The permanent coverage exit (labeled C) de,totes the
determination of the permanent nature of tile fault, and
the successful isolation and removal of tlre faulty compo-
nent. The single point failure exit (labeled S) is reached
when a single fault causes the system to crash. This gen-
erally occurs if an undetected error propagates through
the system, or if the faulty unit cannot be isolated and
thus the system cannot be reconfigured.
2.2.1 Example memory FEIIM
As an example of a FEIIM for the memory subsystem of
figure 1, a hypothetical recovery procedure is shown in
figure 4.
* The memory uses an error correcting code, so a
single-bit error is always detectable and correctable,
and no reconfiguration is required. If 98% of all mem-
ory faults affect only a single bit, then the probability
of reaching tl,e R exit is r = 0.98.
• The 2% of faults that affect more than one. mem-
ory bit are 95% detectable. When a multiple mem-
ory error is detected, the affected portion of mem-
ory is discarded, the memory mapping function is
updated, and the needed information is reloaded
from a previous checkpoint and updated to repre-
sent the current state of the system. Experimen-
tation on a prototype system revealed that this re-
covery from the detected multiple memory errors
works 85% of the time. Thus, the probability of
reaching the C exit is the probability that a mul-
tiple fault occurs, is detected, and is recovered from
c = (0.02) × (0.95) x (0.85) = 0.01615. Tire first two
moments I of the time to perform this recovery have
been determined by experiment to be 0.45 arid 0.25
(time scale in seconds).
• There are two paths to the single point faih,re exit.
1. The memory fault cat,ses a single-point failure if
a nndtiple-bit error is not detected (with prob-
al)ility 0.02 x 0.05),
2. if a multiple-bit memory error is detected, but
the attempted recovery is not successful. (with
prol)ability 0.02 × 0.95 × 0.15)
Thus, tire probability of reaching the S exit is s =
(0.02) x ((0.05) + (0.95) × (0.15)) = 0.00385.
2.2.2 Example processor FEItM
The recovery process for the faults that occur in the pro-
cessor is more complex. A processor contains built-in test
circuitry so that error checking occurs concurrently with
instruction execution. If an error is detected, the instruc-
tion is retried immediately. Partial results are stored in
case the retry is unsuccessfid, so that the computation
can be continued from some intermediate point (called
a checkpoint). The process of continuing a computation
from a previously saved checkpoint is called a rollback.
In some cases tire fault is such that the rollback is not
sueeessfld, so the computation must start over after a
system-level recovery procedure is invoked.
An example of a processor fault coverage model is
shown in figure 5, and represents the following recovery
procedure [20].
llf, in successive experiments, recovery times are TI, T_.... , Tk,
then the mean {first moment} is _ _=t T, and the second moment
is _ E_=, Tfl.
Error Occurs l
Single bit J [ Multiple bit
Memory error I t Memory error
J 0 95 0.05
inErr°rzeromaskedtime d.tected,_ ' l [not detected A
Transient
Restoration
Exit R
t I 1
attempt Exit S
recovery
0.85//_0.15
[_ncce_sful'''\l [unsuccessr.l_].
Perm anent FaJl'ure
Coverage Exit S
Exit C-
Figure 4: Recovery model for memory rod)system
Figure _: Recovery model for processor subsystem
Transient Recovery Procedure. Assume that the
fault is transient, and begin a multi-step recovery
procedure that continues as long as an error is de-
tected. If an error persists after all three steps have
been performed, then a permanent recovery proce-
dttre must be invoked.
Step 1.
Step 2.
Step 3.
Wait for 0.1 second and do nothing. If the fault
is transient it may disappear dttring this time,
allowing rollback to succeed.
Retry the current ifistruction several times, for
as long as a half-second. The probability that
the retry will be successful (i.e., no error is de-
tected) is 0.5.
If an error persists, perform a rollback to a pre-
vious checkpoint, followed by recomputation,
taking 2 sec. total. The rollback succeeds in
removing the error 80% of the time.
If the fault is transient, a transient recovery can be
successf, I only if the fault has disappeared before
the step begins. For the analysis of this example,
assume that the lifetime of a transient fault is ex-
ponentially distributed with a mean lifetime of 0.25
seconds. Transients comprise 90% of all faults, while
the other 10% are permanent.
Permanent recovery. If an error still persists after the
rollback, it is assmned to be caused by a permanent
fatdt, and a system level permanent fault recovery
process is begmL to remove the offending processor
from the set of active units and to reconfigure the
system to continue without it. The permanent fault
recovery process succeeds with probability 0.875.
The analysis of this coverage model consists of calcu-
lating the probability of system recovery (PSRi) for each
of the three steps of transient recovery, and for perma-
nent recovery. This calculation entails determining two
intermediate sets of quantities: the probability that the
transient has gone before step i is reached, and the prob-
ability that step i is taken.
Transient Recovery Exit. The transient recovery exit
is reached if the fault 'is transient, and any of the
three steps is successful in achieving system recovery.
Step 1. Step l is taken with probability one imme-
diately upon the occurrence of the fault. Step 1
performs no actual recovery, so the probability
of successful system recovery at the end of step
1 is zero (PSRI = 0).
Step 2. System recovery from a transient error will
occur in step 2 if
• the transient has disappeared during step
1 (with probability 1 - e-°l/°2s = 0.329),
and
• the retry is successfid (with probability
o.5).
Thus, PSR2 = (0.329) × (0.5) = 0.165.
Step 3. System recovery from a transient will occur
in step 3 if
• steps 1 and 2 are unsuccessful (with prob-
ability 1 - 0.165 = 0.835, and
• the transient has disappeared during step 1
or 2 (with probability 1 - e-(° J+° s)/°_5 =
0.909), and
• the rollback is successfid (with probability
0.8).
Thus, PSRa = (0.835) x (0.909) × (0.8) = 0.607.
The probability of transient recovery is then the sum
of the probabilities associated with the three steps
(0+0.165+0.607=0.772). The transient restoration
exit is reached if the fault is transient (with proba-
bility 0.9) and if transient recovery is successfld; thus,
r = 0.9 x 0.772 = 0.695.
Permanent Coverage Exit. There are two cases to
consider for the analysis of the permanent recovery
exit. The first case deals with the invocation of the
permauenf recovery procedure to handle a persistent
transient fault; the second deals with recovery from
a permanent fault.
Case 1. The permanent recovery process is initiated
against a transient fault if the three steps of
transient recovery have been nnsuccessfld. The
prol,ability associated with this case is then the
product of
• the probability that the fault is transie,lt
(0.90);
• the probability that the three steps of tran-
sient recovery were not successfnl (l-
0.772 = 0.228); and
• the probability that the permanent recov-
ery process is successfnl (0.875)
Case 2. The permanent recovery process is success-
ful against a permanent fault if
• the fault is permanent (with probability
0.10); and
• the permanent recovery process succeeds
(with probability 0.875).
The probability of reaching the permanent coverage
exit is then the sum of the probabilities associated
with the two cases, thus c = 0.179+ 0.0875 = 0.267.
Single-Point Failure Exit. There are two eases to con-
sider for the single-point failure exit.
Case 1. For a transient fault, the single-point fail-
ure is reached if the permanent recovery pro-
cedure is invoked, and fails to achieve sys-
tem recovery. The probability associated with
this scenario is (0.228 × (1 - 0.875) = 0.028).
Multiplying by the probability that the fault
is transient yiehls the probability for Case I:
(0.9 x 0.028 = 0.0252.)
Case 2. For a permanent fault, tim single-point fail-
ure exit is reached if the permanent recovery
procedure is unsuccessfnl. Multiplying by the
probability that the fault is permanent yields
the probability for Case 2: ((1-0.875)×(0.10) :--
o.o12. ).
The probability of reaching the single-point failure
exit is the sum of the probabilities associated with
Cases 1 and 2, thus s = (0.0252 + 0.0125) = 0.0377.
2.3 Near-coincident faults
In highly reliable systems, such as those used for flight
control, the probability of a second fault occurring while
attempting recovery from a given fault cannot be ignored.
The occurrence of a second, near-coincident fault (while
attempting to handle a single fault) causes immediate sys-
tem failure, if the second and first faults are critically
coupled. The modeler must designate which sets of faults
are critically coupled, or can assume either extreme: all
faults are critically coupled or no faults are critically cou-
pled. Once the set of critically coupled faults has been
determined, the calculation of the probability of near-
coincident faults is straightforward, given some measure
of the time spent in a recovery model. In the 3P2M ex-
ample, if a processor fails while a memory failure is being
handled, or during the recovery from a fault in another
processor, the system fails. If, however, a memory fails
during a processor recovery, no immediate failure occurs.
A bus failure would interfere with processor or memory
recovery.
A fourth exit is then added to tlm FEI[M model, rep-'
resenting the occurrence of a near-coincident fault be-
fore another exit is reached. Consequently, the probabil-
ity of reaching one of the original three exits is reduced
by a factor equaling the probability that an interfering
near-coincident fault does not occur. This single-entry 4-
exit model is then automatically inserted into the FORM
model, as described in the following section.
Figure6: (_onll)ina(ionf FEIIMandFORMmodels
3XN6 all N i
_2_s2. o E, _ 2_55 -7 \
2_S 4 213'15 , - ,., __
_ o_ '_c5 _ -
Figure 7: Reduction of combined FEIIM and FORM models
Causeof Failure Probability
Exhaustion of Processors
Exhaustion of Memories
Exhaustion of Buses
Single Point Failure
Near-Coincident Faults
2.20 x 10 -1°
1.61 × 10 -1°
9.99 × 10-6
3.53 × 10 -5
4.49 × 10 -l°
Total Unreliability 4.53 x 10 -5
Table 1: Solution of 3P2M example system
2.4 Combining FORM and FEHM mod-
els
Once the FORM and FEIIM models are described, they
are then combined. We demonstrate this process for the
Markov chain in figure 2 which results from tile favlt
tree in figure 1. For each failure of a redundant com-
ponent, tile appropriate FEIIM model is invoked. That
is, a FEIIM model is inserted oil each failure arc between
operational states in the Markov chain, as shown in figure
6. In the 3P2M example, tim FEIIMs oq the horizontal
failure arcs are copies of tile processor recovery model
(figure 5), while the FEIIMs on the vertical failure arcs
are copies of lhe memory coverage model (figure 4). Two
failure states are inserted:
• FSPF denoting the occurrence of a single-point fail-
ure, and
• FNCF denoting tile occurrenize of critically coupled
near-coincident faltlts.
Each FEIIM model is then solved for tile probability of
reaching each of its three exits, and tim FEIIM model is
replaced by a branch point. The resl,lting Markov chain
(see figure 7) is then solved for the reliability of tile sys-
tem, whicll is given by tim prol)ability that the system is
not in any failure state.
Table 1 shows the results of the reliability analysis for
a 10 hour mission of the 3P2M example. For this model,
we assume that tim failure rate of the processor A = 10 -4,
for tile memory tl = 10-s and for the bt,s u = 10 -°. Tim
I .
largest contr,butor to the unreliability is single-point fail-
ure, that is, faults from which recovery is not successfld.
3 Dynamic fault-tree gates
A major disadvantage of traditional fault tree analysis
is the inability of standard fault tree models to capture
sequence dependencies in the system, and still allow an
analytic solution. As an example of a sequence dependent
faihtre, consider a system with one active component and
one standby spare connected with a switch controller [19].
If the switch controller fails after the active unit fails (and
Non-dependent Output
t
[---t FDEP
Trigger event
Dependent Events
Figure 8: Functional dependency gate
thus the standby is already in use), then the system can
continue operation, llowever, if tim switch controller fails
before the active unit fails, then the standby unit cannot
be switched into active operation and the system fails.
Thus, the failure criteria depend not only on the com-
binations of events, but also on the sequence in which
events occur.
Systems with various sequence dependencies are usu-
ally modeled with Markov models. If, instead of using
standard fault tree solution methods, the fault tree is
converted to a Markov chain for solution, the expressive
power of a fault tree can be expanded by allowing cer-
tain kinds of sequence dependencies to be modeled by
defining special purpose gates to capture specific types of
sequence dependent behaviors. There are several different
kinds of sequence dependencies in fault tolerant systems.
This section identifies several svch dependencies, and de-
flues specific gates to express these behaviors in fault tree
models. Part II demonstrate the use of these gate types
in several examples.
3.1 Functional dependency gate
Suppose that a system is configured such that tile occur-
fence of some event (call it a lrigger event) causes other
dependent components to become inaccessible or unus-
able. In this case, later failures of the dependent compo-
nents will not filrther affect the system and should not
be considered. A functional dependency gate (see figure
8) has a single trigger input (either a basic event or tile
output of another gate in the tree), a non-dependent out-
put (reflecting the status of the trigger event) and one or
more dependent basic events. The dependent basic events
are flmctionally dependent on the trigger event. When
the trigger event occurs, tile dependent basic events are
forced to occur. In the Markov chain generation, when a
state is generated in which the trigger event is satisfied,
all the associated dependent events are marked as having
GateOutput
t
t ..t
Primary activeuni_ [ntlla]ternate unit
1st alternate uhiq
2nd alternatekmit
Output occurs only If
both A and B occur"
and A occurs before B
I
A B
Figure 9: Cohl spare gate
occurred. The occurrence of any of the dependent basic
events h_ no effect oll the trigger event.
The hmctional dependency gate is useful where com-
mnnication is achieved through some network interface el-
ements, where the failure of the network element isolates
the connected components. In this case, the failure of Ihe
network element is ttle trigger event and the connected
components are the dependent events. Part II describes
several applications of the functional dependency gate.
3.2 Cold spare gate
Consider a system t!tat utilizes cold spares, that is, spare
components that are unpowered, and thus do not fail be-
fore l)eing used. Such systems cannot I)e modeled exactly
using standard fault tree techniques because the systeln
failure criteria cannot be expressed in terms of combina-
tions of basic events, all using the same time frame.
We address tiffs fault tree deficiency by introducing a
cold spare gate (see figure 9), with one primary input and
one or more alternate inputs. All inputs are basic events.
The primary input is the one that is originally powered
on, and the alternate input(s) specify the (initially nn-
powered) components that are used as replacements for
tile primary unit. The cold spare gate has one output
which becomes true after all the input events occur.
Tile conversion of the fault tree to a Markov chain
makes the consideration of cold spares possible. In a state
where tile primary unit is operational, the cold spares are
not permitted to fail. However, once the primary unit
Figure 10: Priority-AND gate
units are functionally dependent on some other (other-
wise unrelated) component. Tile occurrence of the trigger
event call render one or more of the spares unusal-,le, evell
if they have not been switched into active operation yet.
Then, if tile primary unit fails, the spares are unavailable
to replace it. This is the one case where a spare can "fail"
even while it is unpowercd. Part II gives cxamp]es of the
use of the cold spare gate.
3.3 Priority-AND gate
The priority-AND gate is logically equivalent to an AND
gate, with tim added condition that the events must occur
in a specific order. The priority-AND gate (as shown
in figure 10) has two inputs, A and B. The output of
the gate is true if both A and B have occurred, and if
A occurred before B. If both events have not occurred,
or if B occurred before A then the gate does not fire.
To represent the behavior that A occurs before B which
occurs before C, tile priority-AND gates can be cascaded
as shown in figure 11.
3.4 Sequence enforcing gate
Tlle sequence enforcing gate forces events to occur in a
particular order. The input events are constrained to oc-
cur in the left-to-right order in which they appear under
the gate (i.e., the leftmost event must occur before the
event on its immediate right which must occur before the
event on its immediate right is allowed to occur, etc.).
There may be any number of inputs (see figure 12), the
has failed, then the first alternate refit can fail. After first of which may be a (possibly replicated) basic event
the first alternate fails, the remaining alternates are al-
lowed to fail, one at a time in the order specified, until
the spares are exhausted. The possibility of being unal)le
to reconfignre correctly tim spare unit into operation is
captured in the (separately specified) coverage model.
The functional dependency gate and the cold spare gate
can interact in an interesting way. Suppose that the spare
or the output of some other gate. All inputs other than
the first are limited to being (possibly replicated) basic
events. The sequence enforcing gate can be contrasted
with the priority-AND gate in that the priority-AND gate
detects whether events occur in a particular order (the
events can occur in any order) where the sequence enforc-
ing gate will only allow the events to occur in a specified
8
A betore B belore C
vN
C
A B
order.
In the generationof a Markov chain from a faulttree
containinga sequence enforcinggate,statesthat repre-
sent any other ordcring than that specifiedby the se-
quence enforcinggate are never generated. In part 2 of
thistutorialwe willshow an interestingapplicationoftile
sequence enforcinggate to model pooled spares.
Figure 11: Cascading priority-AND gates
Gate Output
t
SEQ ]
Figure 12: Sequence enforcing gate
9
Part II
Examples
We study several examples of advanced fault tolerant sys-
tems, and develop fault tree models to analyze the reli-
ability of these systems. The models are all solved with
IIARP, the llybrid Automated Reliability Predictor, de-
veloped at NASA Langley Research Center and Duke
University. The parameters used for these model and tlle
details of the recovery mechanism are pure conjecture,
and should not be interpreted as a factual representation
of the parameters associated with the systems.
4 Cm*: a loosely-coupled dis-
tributed system
4.1 System description
An instance of tile Cm*system (shown in figure 13) con-
sists of 2 clusters of processors and memories connected
by links [22]. Each cluster consists of 4 local switch inter-
face controllers (S.locats), each attached to one processor
and one 12K memory module. Each processor has 4K of
memory on board. Tile K.map is a cluster controller con-
nectiug the S.IIocals; 'the clusters are connected by inter-
cluster communications (g.inc). A fault in the K.map
renders the associated S.locais (and their connected pro-
cessors and memories) inaccessible, while a fault in an
S.Iocal m_akes the processor and memory modules con-
nected to it inaccessible.
The Cm* system exhibits three characteristics that are
typical of reliable distributed systems.
1. There are fimctional interdependencies which can
make tile development of the fault tree model diffi-
cult, for example, the depemlence of the accessibility
of the processors and memories on the state of the
S. locals.
2. There are many potential system states: since there
are 27 components, the system can be in any one of
2_7 > 134 million states, if any component can be in
one of two states, t'unctional and failed.
3. There arc many failure modes: there are 5405 min-
imal cut sets for this system (a cut set is a set of
components whose failure causes the system to fail).
4.2 Failure criteria
The system is considered operational as long as there are
3 processors that can communicate with 3 memories. As
long as the L.inc is operational, these requirements can
Ink
C °']
Figure 14: Fault tree model of Cm* system
be satisfied by the components of both clusters. But, if
the L.inc fails, the requirements must be met within one
cluster.
4.3 Fault tree model
The development of the fault tree model of tile Cm* sys-
tem is simplified by the use of a functional dependency
gate, to capture the iuterconnection dependencies. A
fault tree model of the Cm* system is shown in figure
14. System failure (the top event) can be attributed to
one of two causes wlfich are shown as inputs into the up-
permost OR gate. Failure occurs when either the L.inc
fails and the requirements cannot be satisfied by a sin-
gle cluster (the left input to the uppermost OR gate), or
(independent of the state of the L.inc) there are an in-
sufficient total number of processors or memories in both
clusters. The output of an m/n gates is true when m of
the n input events have occurred.
The fimctional dependencies of the S.iocals on the
K.maps and of tile processors and memories on the asso-
ciated S.local are captured in tile fimctional dependency
gates (FDEP) shown in figure 14. In this case, there
were no explicit reliability requirements concerning the
K.maps or S.Iocats, so the fimctional dependency gate
is not explicitly connected to tile top event in the fault
10
I K.map I
IS.Iocal I
i
I
!
I S.local I
!
I M.,
i 1
..... Cluster
I Linc
I'!, S.local ""i
I
I I_,pl
P = LSI-11
C
K.map I
IIS'l°cal I
1'
Cm
Mp = Memory (12K words)
S.Iocal = local switch interface controller
K.map - cluster conlroIler
Linc - inler-cluster communications
Figure 13: A diagram of tile Cm* system.
st t s2
P1 P2 P3
I FBOX I
JLink
l P4
S6
M6 M7
L I
so
M8
Figure 15: Fault tree model of Cm* system without functional dependency gates
11
Pe_ECTtYnEu_u__P, ocess_cc_._u_.A_,'_
pJ_nGouE_ml
Figure 18: An AIPS i/O network used for example cal-
culations
tree. In order to solve a fault tree model containing func-
tional dependency gates via standard co,nbinatorial solu-
tion methods, we need to convert the model to a strictly
combinatorial one. To accomplish this conversion, the
dependency gates can be replaced with OR gates in the
following manner. For each occurrence of a dependent
basic event, replace that basic event with a logical OR of
the basic event and its trigger event. Thus in the Cm*
system, each basic event representing a processor failure
is replaced by a logical OR of that processor event, its
S.Iocal and its K.map. Memory events are altered in a
similar manner. The fanlt tree that results from replac-
ing the flmctional dependency gates is shown in figure 15.
The replacement of the functional dependency gates only
produces a correct result if no FEIIM models are used,
that is, if all faults are permanent and are instantaneously
and perfectly covered.
5 AIPS: a system of fault-
tolerant building blocks
5.1 System description
An example of the AIPS (Advanced Information Process-
ing System) I/O network is shown in figure 16. The AIPS
system, designed at the Charles Stark Draper Laboratory,
is intended to provide fault-tolerant building blocks that
can be used for a variety of rea|-thne control applications
[16]. The AIPS I/O network might be used in a flight con-
trol system, and consists of 3 rings, each of which contains
5 nodes. Three of the nodes on each ring (those labeled A,
B, E) are connected to sensors and/or ach, ators. Each
such device is triplicated, with one copy of each device
connected t.o each ring, via a node in tim same location
(with the same letter label). The remaining two nodes,
C and D, are termed root nodes because they provide tim
connections to the triplicated computers.
5.2 Failure criteria and parameters
The I/O network fails when
1. Nodes in the same location on two different rings
either fall 0r become isoJated from both root connec-
tions, OR
2. if 2 of the 3 computers fall or become disconnected
from both rings, OPt ....
3. wheu 2 of the three rings become disconnected from
both computers.
As long as a node can communicate with one computer, it
can communicate with all computers that are up because
the computers are assumed to be connected by a perfectly
reliable i,derconnection mecltanism (such as shared mem-
ory). For the purpose of this analysis we consider only the
I/O network and the computer connections, and not the
possible failures of tim devices (such as sensors and ac-
tuators) connected to tl,e nodes. The failure parameters
used for this analysis are
. Node failure rate: 6 x 10 -6 per hour
• Link failure rate: 12 x 10 -s per hour
• Computer failure rate: 10 -4 per hour
5.3 Fault recovery
Recovery from faults in nodes and links is assumed to
be perfect and instantaneous. For the computers, how-
ever, more detailed coverage modeling is necessary. It
is assumed that 85 percent of the faults that occur in
the computer system are transient, with the remaining 15
percent being permanent or intermittent in nature. Re-
covery from computer faults is assumed to be perfect, but
not instantaneous: the time to recover from a transient
is 1 second, while the time to recover from a permanent
or intermittent is uniformly distributed between 1 and 5
seconds. During the recovery interval, if a second, near-
coincident fault occurs in either of the other computers,
the recovery is interrupted, and system loss is conserva-
tively assumed to occur.
5.4 Fault tree model
The fault tree model of the AIPS I/O network has 102
nodes, including 39 basic events, and is too large to be
presented here as a whole, llowever, figure 17 is a sketch
12
c
Figure 17: Fault tree model of AIPS I/O network
of tile fault tree with some of the I)aths complete. The
system fails when one of the seven triplicated subsystems
fail (hence seven 2/3 gates are connected to the top OR
gate), these being node groups A through E, the comput-
ers, and the root connections between the rings and the
computers. A representative of each of the 7 subsystems
is shown in detail; the other members of each triplicated
subsystem are analogous. The results of tile solution of
this model appear in table 2.
5.5 Truncated fault tree
An interesting alternative to the development of the fidl
fault tree model is tile concept of a truncated fault tree.
For the AIPS network (figure 16), the expansion of only
2 failure levels produced a reasonably accurate estimate
of the system unreliability. For this case, we could have
produced a similar result with a much simpler fault tree,
one which explicitly defined only the 2 component failure
combinations. Consider the fault tree representation of
tile AIPS network in figure 18. The top event of this tree
is 2-component failure system loss, where tile system loss is
caused by losing 2 members of any triplicated subsystem.
No combination of 2 link failures, or one link failure and
one other component failure can lead to system failure,
and so the link basic events do not input to any gates in
the truncated fault tree. The presence of these dangling
basic evenls (basic events that do not input to any gate
@...@ @...@ @...®
Figure 18: Truncated fault tree model of AIPS network
in the fault tree) can be used to bound the failure prob-
ability. If the dangling basic events are ignored then the
solution of the fault tree gives an optimistic estimate of
the unreliability of the system.
If we are using a strictly combinatorial solution
method, we can use the dangling basic events to deter-
mine the upper bound on the unreliability by using a
k-out-of-n gate. Connect all n basic events (those that
are dangling ,as well as those that are not) to an 3-out-of-
n gate (a gate that is activated on the third component
failure), and OR its output with the top event of the tree.
This is equivalent to assuming that tile third componellt
failure causes system failure.
If we need to include the effects of imperfect coverage
in the model, we can use the dangling basic events in con-
junction with the conversion of the fault tree to a Markov
chain. As the Markov chain state space is expanded, all
the basic events become part of the state definition. Tile
resulting Markov chain can be used to produce bounds
on tile unreliability of the system from the solution of
tile truncated fault tree. It is not necessary in this case
to add the m-out-of-n gate as was done with the strictly
combinatorial solution. The basic events are simply left
dangling. The presence of dangling basic events is cru-
cial to the determination of correct bounds on the system
unreliability.
The solution of the truncated Markov chain corre-
sponding to the truncated fault tree of the AIPS system
is shown in table 3. A comparison of the numbers in this
table with those in table 2, shows that the truncated fault
tree can give reasonable results. The time needed by a
reliability analyst to determine a truncated fault tree is
substantially less than the time required to derive a com-
plete fault tree model of a system. Further, tile combi-
nation of a truncated solution technique and a truncated
fault tree can allow more faith to be placed in the model,
since if there are missing failure combinations they may
13
FullFaultTreeModelSolution
AIPSI/O NetworkExampleSystem
TruncationLevel 1ComponentFailure 2ComponentFailures
Size Of Truncated Model 42 states, 190 transitions 770 states, 5155 transitions
Lower Bound on Unreliability 0.125e-6
Upper Bound on Unreliability 2.94e-6
Total Run Time 65 CPU seconds
0.126e-6
0.128e-6
1295 CJPU seconds
Table 2: Solution of example AIPS system
Truncated Fault Tree Model Solution
AIPS l/O Network Example System
2 Component-FailuresTruncation Level
Size Of Truncated Model
Lower Bound on Unreliability
Upper Bound on Unreliability
1 Comt)onent Failure
42 states, 190 transitions
0.126e-6
0.640e-6
770 states, 4879 transitions
0.1261e-6
0.1263e-6
Total Run Time 58 CPU seconds 1144 CPU seconds
Table 3: Solution of truncated fault tree model of AIPS system
(_(_ _)//_ Processing
___ elements
I
Network elements NE3 ]
,
T
I
• I I NE1 t
Prnnarv fault . I
contamhmnt _ "// \\ '
regmn _ \\ I
eco nd a ry
ullt .
colltal nment
region
Figure 19: An instance of the fault tolerant parallel pro-
cessor
be accounted for by the bounding technique.
6 FTPP: Fault tolerant parallel
processor
6.1 System description
Next we consider several models of the FTPP (Fault Tol-
erant Parallel Processor) [18, 17] cluster, to c¢)mpare var-
ious configurations of triads with spares. An instance of
an FTPP cluster is shown in figure 19, and consists of 16
processing elements (PE), with 4 connected to each of 4
network elements (NE). The network elements are fully
connected. In the clusters modeled here, the 16 proces-
sors are logically co,mected to form 4 triads, each with
one spare. We investigate three triad/spare configura-
tions, the first two with hot spares and tile third with
cohl spares:
#1 utilizes hot spares; there is one spare for each triad
and all spares are attached to tile same network ele-
ment. : _ =
# 2 also uses ]lot spares; there is one spare on each net-
work element and the spare PE can substitute for any
failed PE attached to the same network element.
# 3 is tim same as #1, with all spares on the same NE,
but in configuration #3 the spares are cold.
Tile processing elements in all three configurations
fimctionally depend on the network element to which they
are connected. If a network element experiences a perma-
nent failure, the processing elements connected to it are
then considered failed.
6.2 Failure criteria and parameters
For all models, a triad fails when it has fewer than 2 active
components; the system fails if any triad fails. Failures
occur at a constant rate of 1.1 × 10 -4 per hour for pro-
cessing elements, and 1.7 × l0 -5 per hour for network
elements.
14
Fl
NE3
NE1 I
Figure 20: Confignration #1 with one spare per triad
i
Figure 22: Configuration #2 with one spare per NE
6.3 Fault recovery
Recovery and reconfiguration from faults in processing el-
ements are both perfect, but take a non-zero amount of
time. Ifa second fault occurs in any other component dur-
ing attempted _'ecovery from a first fault, the system fails.
llalf of tile faults that occur in the processing elements are
transient, and can be recovered from without discarding
the affected component. Tile remainder of faults are per-
manent. The time to recover is exponentially distributed
I
with a mean of 3.6 seconds. Coverage of NE failures is
both instantaneous and perfect.
6.4 Fault tree models
6.4.1 Configuration #1
Confguration #1 (sliown in figure 20)divides tile active
elements of a triad among NEI, NE9 and NE3, and uses
the PE's on NE$ as spares. The PE's that are ill the same
6.4.2 Configuration #2
Configuration #2 is an FTPP cluster with hot spares dis-
tributed across the network elements instead of grouped
on the same network element (see figure 22). The spare
element on each network element can substitute for any
failed PE connected to the same NE. That is, processing
element TSt can substit.te for a failed PE connected to
NEI.
The fault tree model of this system is a bit more com-
plex than the one presented in section 6.4.1, and is shown
in figure 23. The functional dependency gates FDEP
again reflect tim dependence of the processing elements
on the network elements. A triad failure is again at-
tributed to losing the majority of operational elements,
but it is more difficult to describe the failure of a member
of tile triad. A member of the triad is failed if it and
its spare fail or if its spare is not available when needed.
The spare is not available if some other PE on the same
NE fails and uses the spare before it is needed by the
relative position on the first three network elements form first PE. For example, in figure 23, the leftmost OR gate
a triad, and tile PE in the same relative position on NE,_
serves as a hot (active) spare for tile triad.
The fault tree model for configuration #1, shown in
figure 21, uses four functional dependency gates (FDEP)
to reflect the dependence of the processing elements on
the network elements. The FDEP gates are not explic-
itly connected to the other gates in the tree, since the
reliability requirements (all 4 triads must be operational)
do not explicitly mention the network elements. Figure
21 shows four 3/4 gates connected to tile top OR gate,
one 3/4 gate for each triad. A triad fails when only one
element remains (3 of the 4 elements have failed).
that inputs into the leftmost 2/3 gate represents the fail-
ure of the first member of the first triad. This member
fails if both TI i (tile first member of the first triad) and
its spare (TSI) fail, or if the spare is being used because
another failure has already occurred when Tll fails. The
spare will already be in use when Tll fails if either T22 or
T33 (the other two active components on the same NE)
have failed before Tll does. This condition is reflected in
the Priority-AND gate that inputs to tile same OR gate.
There is a similar structure of AND and Priority-AND
gates to represent the failure of the other members of the
triads.
15
Figure 21: Fault tree model for col!figtlration #1
System failure ]
4
.E3_q FoEPI NE4--qFOEPI
FiglJre 23" Fault tree model for configuration #2
16
II
Figure 24: Fault tree model for configuration #3 with one COLD spare per triad
6.4.3 Configuration#3
Tile third configuration is used to investigate tile effect on
reliability of keeping tile spares unpowered until needed.
The FTPP configuration modeled in this section is tile
same as confguration #1 (figure 20) except that the
spares are cold rather than hot. There is one spare for
each triad, and all spares are connected to the same net-
work element. The fault tree model for this system, shown
in figure 24, uses the cold spare gate. There is one cold
spare gate for each member of each triad, where tile ini-
tially active members of tile triad are used as the primary
inputs. The basic event representing tile coht spare PE is
connected to all three cold spare gates since it can sub-
stitute for any iof the elements.
6.5 Results
This section presen!:s the results obtained from solutio,i
of the mddels of the three FTPP configurations for a mis-
sion length of 10 hours. Table 4 compares the reliability
of the three configurations. We solved a truncated model
(described in more detail later in this section) which pro-
duces bounds on the mlreliability from a partial solution
of the model. Table 4, shows the bounds on the unre-
liability, and the best case (optimistic) estimate of tim
probabilities of exhaustion of network elements (exh NE),
exhaustion of processing elements (exh PE) and near co-
incident failures (NCF).
Configuration #2 (that distributed the hot spares
across the network elements) not only required a more
complicated fault tree for analysis, but also was apprecia-
bly less reliable than coufiguration #1. In configuration
#2, the failure of 2 network elements (alone) can kill the
system, since the failure of 2 network elements removes
2 members from at least one triad. For example, if NEI
and NE2 both fail, then Tll and TI2 are both disabled,
and no spare is available to replace them (because of the
hmctional dependencies). The solution of the model for
configuration #2 shows that the predominant cause of
failure is the exhaustion of network elements, lu config-
uration #1, the loss of 2 network elements (alone) does
not cause any triad to fail, even though it can render all
the spare elements unusable.
In the #3 configuration, the spare elements remained
unpowered until needed resulting in a modest decrease
in unreliability. Since near-coincident failures contributed
more highly to the unreliability of the system, the effect
of keeping the PEs unpowered was not as significant as
might be expected.
For all three models, the Markov chain was truncated
after the consideration of 2 or 3 faults, and so a pair
of bounds on the actual reliability were generated. The
bounds were tight enough after only considering 2 faults
for configuration #2, but we needed to consider a larger
model for the other two cases. The reason that tl,e
bounds were tighter for configuration #2 is that there
were a significant number of failure states encountered
when oldy considering 2 component failures. In the #1
and #3 co,afigurations, there were not many failure states
with only 2 failed components. Unfortunately, the num-
ber of states in a Markov chain increases exponentially
with the number of component failures considered, so the
increase in accuracy is accompanied by a large increase
in solution times. Table 5 compares the results obtained
from the smaller model (truncated after 2 failures) and
the larger model (truncated after 3 failures), as well as
the size of the models and the run time for the complete
generation and solution of the model on a DECstation
3100.
7 ASID MAS: a mission avionics
system
7.1 System Description
The ASID (Advanced System Integration Demonstration)
project was the first large scale effort in the development
of the PAVE PILI.AR architecture for advanced tactical
fighters. The Boeing Military Airplane Company was one
of five contractors who designed implementations of the
PAVE PILI,AR project. A unique feature of the Boeing
implementation [5] is the use of dual processor pairs wher-
ever a single processor is required. This processor-pair
uses comparison monitoring so as to achieve very high lev-
els of error detection. For critical flmctions, high levels of
reliability are assured by using redundant processor-pairs
in duplex or triplex mode. We analyze the reliability of
the critical functions of the mission avionics part of the
ASID system.
There are several critical functions within the mission
avionics system (MAS). The loss of any of these functions
causes the MAS to fail. These critical functions include
the vehicle management subsystem (VMS), the crew sta-
tion control and display functions, mission and systems
management, local path generation, and scene and ob-
stacle following functions. The vehicle management sub-
system provides airframe control, including flight and
propulsion control, as well as providing utility systems
management and control. The crew station subsystem
displays information to the pilot, contains mechanisms
for pilot control actions, and manages crew station ac-
tivity. The mission and systems management subsystem
allocates resources for real time control functions.
Figure 25 is a block diagram of the architecture of the
critical mission avionics system. One processing unit is
require.d for the crew station flmctions, local path genera-
tion, and mission anti system management. Each of these
18
Configuration 11#2: |lot spare per NE #1: llot spare per triad _3: Cold spare per"triad
(Best Case) Unreliability 0.207 x l0 -6 0.406 × 10 -_ 0.264 × l0 =z
(Worst Case) Unreliability 0.417 × i0 -6 0.407 × l0 -7 0.266 × 10 -t
(Best ease) exh. NE 0.174 x l0 -6 0.135 x 10 -s 0.104 x l0 -s -
(Best case) exh. PE 0.327 x l0 -s 0.910 × l0 -s 0.705 x 10 -s
(Best case) NCF 0.302 × l0 -t 0.302 × l0 -7 0.183 × l0 -7
Table 4: Results of the solution of all three FTPP models
[ Configuration I1 #2: llot spare/NE I #l: Hot spare/triad I #3: Cold spare/triad J
Truncated at 2 component failures
(Best Case) Unreliability
(Worst Case) Unreliability
Number of states
Number of transitions
Runtime (CPU seconds)
0.207 x 10 -6
0.417 x 10 -s
201
877
138
0.406 x 10 -7
0.242 x 10 -s
123
581
99
0.263 x 10 -7
0.132 x 10 -6
225
817
99
Truucated at 3 component failures
(Best Case) Unreliability
(Worst Case) Unreliability
Number of states
Number of transitions
R.ntime (CPU seconds)
analysis not
necessar_ for
this ezample
0.406 x 10 -7
0.407 x 10 -t
961
5469
2653
0.264 x 10 -v
0.266 x 10 -t
2307
9777
5055
Table 5: Comparison of accuracy and model size
Background
data
bus
VehicLe
manag
bus
Figure 25: Block diagram of mission avionics system architecture
10
processingunitsissuppliedwitha hotsparebackupto
takeovercontrol if the primary processor should detect
an error. Each of the processing units is really a pair of
tightly coupled processors so as to maximize the proba-
bility of fault detection and minimize latency. Although
there are really 4 active processors for each of these time-
tions, we treat the processor-pairs as a single processing
unit, since they are not used independently. When a mis-
match of results is detected, both members of the pro-
cessing pair are removed from the system. Figure 25 thus
shows that there are two processing uuits for these func-
tions, where one is the primary unit and the other is a
hot spare.
The scene anti obstacle subsystem and the VMS both
require more functionality than one processing unit can
provide, and thus each use 2 processing units. The scene
and obstacle processing units are aisoreplicated, provid-
ing a hot spare backup. The VMS is triplicated, providing
2 hot spare backups.
In addition to the hot spare backups, 2 additional pools
of spares arc provided, each containing 2 spare processing
units. The first pool can be used to cover the first 2
processor failures in the subsystems other than the VMS;
the second pool covers the first 2 failures in the VMS.
The processing uuitsare connected via 2 triplicated bus
systems, tim first being a data bus and the second being
the mission management bus. The replicated memory is
connected to the data bus. The VMS has an additional
triplicated bus, the vehicle management bus.
7.2 Failure criteria and parameters
The MAS fails if any of the functions cannot be per-
formed, or if both of the 2 memories fail, or if all 3 of
any one type of bus fail. The following MTBF (mean
time between failures) values, giving rise to the following
failure rates, were used.
• processor pairs: 40,000 hours; failure rate: 2.5 × 10 -s
• buses: 400,000 hours; failure rate: 2.5 x 10 -6
• memories: 1,000,000 hours; failure rate: 1.0 × 10 -6
7.3 Fault recovery
Fault detection is perfect (because of the processing pairs)
but it takes between 015 second and 5 seconds (uni-
formly distrihuted) for recovery to occl,r. If a second,
near-coincide,,t failure occurs during this interval, we say
that the system fails because of near-coincident failures
(NCF).
,)
Figure 27: Fault tree model of a 2-duplex system
7.4 Fault tree model
The fault tree model of the mission avionics system is
complicated by the presence of the pooled spares. For
ease of exposition, we first present a fault tree model that
ignores the pooled spares. AVe then describe the method-
ology for modelihg pooled spares via a fault tree with
sequence dependency gates, by way of a simple example.
Finally, we define the fidl fault tree model of the missiou
avionics subsystem including the pooled spares.
7.4.1 Fault tree with no pooled spares
TI,e fault tree model of the MAS with no pooled spares
is shown in figure 26. This fault tree shows that the MAS
fails if any of the critical fimctions fail, or if either of the
bus systems fail, or if both memories fail. There are 3
types of components in the example fault tree, processing
units (type 1), buses (type 2) and memories (type 3). The
crew station, for example, uses 2 components of type 1,
so its basic event is labeled 2 * 1. The memory subsystem
uses 2 memories and is thus labeled 2 * 3, while the ntis-
sion management bus subsystem uses three buses and is
labeled 3 * 2.
7.4.2 Modeling pooled spares
Before we add the pooled spares to the fault tree model
of the MAS, consider a simple system with two duplexes
and 2 pooled spares. The fault tree model of this 2 duplex
system is shown in figure 27, while the equivalent Markov
chain is shown in figure 28. This equivalent Markov chain
is determined automatically by IIARP.
Next, consider the desired Markov chain representation
of the same 2-duplex system with the addition of 2 pooled
spares (figure 29). The 2 pooled spares cause 2 states
to be added to the front of the Markov chain. These 2
states represent the first 2 failures in the system which will
deplete the spares. After the first 2 failures, 2 functioning
20
(:,
I
Figure 26: F*u,lt tree model of MAS with no pooled spares
2),
(_ 2X 2X
Figure 28: Markov chain model of a 2-duplex system
Figure 29: Markov chain model of a 2-duplex system with
2 pooled spares
21
)
Figure 30: Fault tree model of a 2-dui)lex system witl, 2
pooled spares
duplexes remain, and the rest, of the Markov chain in
figure 29 is identical to that in figure 28.
We can use tl,e fault tree shown in figure 30 to represent
the 2-duplex system with 2 pooled spares, lu figure 30,
the combination of the 2/6 gate (which fires after the
first 2 of 6 failures) and tile FDEP gate creates a Markov
chain that models the first 2 failures of 6 components.
After tile first 2 failures, the FDEP gate stops any more
of the 6 components from failing. The two SEQ gates in
figure 30 do not allow the two basic eveuts labeled with
1 * 2 to begin to fail until after tile 2/6 gate has fired.
After the 2/6 gate has fired, then the rest of the fault
tree (which is identical to the one in figure 27 can occur
as usual. This coml)iuation of FDEP and SEQ gates can
be used in a more general setting to tie multiple Markov
chains together.
I
7.4.3 Full model and results
Figure 31 is tile rid] fault tree model of the MAS, including
tile pools I of spares. Tile ieftmost FDEP and SEQ gates
slmw the 2 spares for the vehicle management system,
while those to the right represent the other 2 spares.
Because of the sequence dependency gates, this fal,lt
tree cannot be solved by standard combinatorial meth-
ods, but rather must be converted to a Markov chain for
solution. IIARP performs this conversion automatically,
and produces a truncated Markov chain with 479 states
and 2517 transitions. The Markov model is truncated af-
ter considering 5 component failures. Instead of produc-
ing an exact reliability estimate, bounds that encompass
the reliability of the fidl model are produced. For a 200
hour interval, the unreliability lies between 1.138 x 10 -t
and 1.146 × 10 -t.
8 Three fault tolerant hypercube
architectures
_,'Ve next model three fault tolerant hypercube architec-
tures. All three contain 8 processing nodes connected in
a hypercube of dinaension 3. All three consist of 2 fault-
tolerant modules with eacll module containing 4 process-
ing nodes. The three architectures differ in the ways that
spare nodes are incorporated into the fault-tolerant mod-
ules, in the way that messages are routed between pro-
cessing nodes, and in the architecture of the individual
processing nodes. Tlle hypercube architectures are de-
scril)ed ill more detail in [7] and are discussed only briefly
]|ere,
8.1 System description
8.1.1 Architecture 1
Architecture I is based on the hierarchical approach to
sparing proposed by Rennels[21] and is depicted in figure
32. It consists of 2 fault tolerant modules of processing
nodes. Each module contains 4 processing nodes and one
spare node. The spare is connected by a port to each of
the 4 active processors in tile module.
The processing nodes the,nselves are comprised of 5
individual processors (4 active processors and a spare)
which communicate over an bus and share a memory
module. The memory module contains spare bit planes
and spare chips within a bit plane. The processing node is
connected to its neighboring nodes in the hypercube by 4
ports. Three ports communicate across the three dimen-
sions of the hypercube, and the fourth port communicates
with the spare processing node of the module.
8.1.2 Architecture2
Architecture 2, also depicted in figure 32, is identical to
architecture 1 except that the ports within each process-
ing node are replaced by hyperswitch ports[9]. The hy-
perswitch allows an adaptive routing method to avoid
failed or congested links within the hypercube. It permits
any 2 nodes of the hypercube to communicate as long as
there exists any nonfailed path between them anywhere
throughout the hypereube.
8.1.3 Architecture 3
Architecture 318], depicted in figure 33, differs from archi-
tectures 1 and 2 ill several important ways. Processing
22
I1
Figure 31. Full fanlt tree model of mission avionics system
Node:
Bus
Figure 32: Architecture 1 (Rennels)
nodes are again configured into 2 faolt tolerant modules
(each containing 4 active processing nodes and one spare),
however tile iuter-node connections are mediated by de-
coupling switches rather than being direct connections be-
tween ports of neighboring nodes. The hypercube connec-
tivity and the switching of spares online and failed nodes
oflline is performed using tl,ese decoupling switches[8].
The switches are intended to be comparatively simple de-
vices. One consequence of using the switches to control
access to the spare nodes is that the spares cannot provide
redundancy for links as was possible for architectl,res 1
and 2.
The processing nodes of tile hypercl,be are much sim-
pler and contain processors that are much less powerful
that those of architectures 1 and 2. Each processing node
consists of 2 processors which perform identical compu-
tations in parallel. The output is compared to detect
faults. A recovery module is responsible for fault han-
dling upon tile detection of a processor fault. The node
may either declare itself failed or attempt a reconfigura-
tion to a simplex configuration upon detection of such a
processor fault. Both processors have access to a single
memory module and a DMA (direct memory access) mod-
ule. Finally, each processing node communicates with the
outside worhl through three ports, each of which connects
the node to its neighbor across one dimension of the hy-
percube.
23
Node: _-- ---
Figure 33: Architecture 3 (Chau and Liestalan)
For this discussion we examine only the processing
nodes of the various candidate architectures ill isolation
from the ensemble. The processing nodes of each archi-
tecture tllemselves call be configured in a variety of ways.
The configuration chosen can affect, the reliability and
power consumption of the node, which can in turn affect
the overall ensemble reliability of the hypercube multi-
processor.
8.2 Failure criteria and parameters
The processor nodes for architectures laud 2 are ideuti-
cal, so thcir failure criteria are closely related. The differ-
ence t)etween them is due to tile message routing scheme
employed by each architecture. A processor node for ar-
cl!itectures 1 and 2 will fail if:
• the memory fails OR
* tile bus fails OR
• 2 out of tile 5 processors fail (the first processor fail-
ure is presumably recovered from by switching in the
spare to take the failed processor's place) OR
• the node is disconnected from the other processing
nodes ill the hypercube.
Tim events that cause a node to be disconnected differ for
the two architectures.
The routing algorithm used for architecture 1 allows
only one path between each pair of nodes in the hyper-
cube. Since the spare processing node in each of the two
fault tolerant modules can relay messages witlfin the mod-
ule when a direct connection between 2 nodes in the mod-
tile is not possible, it takes the failure of 2 of the four ports
in a processing node to disconnect the node. In architec-
ture 2, a hyperswitch is used instead of the single path
routing algorithm, so that all four ports in a node must
fail in order to disconnect the node.
A processing node for architecture 3 fails if:
• the memory fails OR
• the DMA unit fails OR
* both processors fail OR,
since tile single path routing algorithm is used for
tiffs architecture, tim node will fail if any of its 3
ports fail.
The component failure rates for all three architectures
are listed below.
• Active processor (architectures 1 and 2): 1.990× I0 -s
per hour
• Active processor (architecture 3): 2.306 x 10 -T per
]lour
• Warm spare processor(architecture 2): 1.0 x 10 -6 per
]lOllr
• Shared Mernory (architectures 1 and 2): 3.477x 10 -T
per hour
• Memory (architecture 3): 1.147 x 10 -r per hour
• DMA module (architecture 3): 3.477x 10 -r per hour
• lntra-node bus(architectures 1 and 2): 1.147 x 10 -7
per hour
• ltyperswitch and I/0 port (all architectures):
3.477 x 10 -7 per hour
8.3 Fault recovery
The FEIIM used for tile processors assumes that a pro-
cessor failure can be detected, located, and the spare suc-
cessfully switched in to replace the failed processor 95%
of the time, and that the time required to do all of this
is uniformly distributed between 0.9 seconds and 1.1 sec-
onds. The remaining 5% of tlle time the reconfiguration
attempt does not succeed, leading to node failure. The
FEIIM used for tile ports assumes detection and deacti-
vation of a failed port is successful 98% percent of the
time, and tliat the time required for this is exponentially
distributed with a mean of 0.1 see. Again, the remaining
2% of the time a port failure is not successfully detected,
leading to node failure. No transient restoration is at-
tempted, i.e., all failures are considered to be permanent.
24
JNode fails J
(s'p,oc) 
Figure 34: Fault tree model of architecture 1 processing
node with hot spares
JNode fails }
(5"proc)(4"port)
Figure 35: Fault tree model of architecture 2 processing
node with hot spares
, I Node fails I
(2"proc)(3"port)
Figure 36: Fault tree model of architecture 3 processing
node with hot spares
8.4 Fault tree models
8,4.1 Hot spares
Figures 34 and 35 model the processing nodes in archi-
tectures 1 and 2 when the spare processor in the node is
a hot spare (tile spare is powered ola and operating all the
time) and hence fails at tile same rate as the active pro-
cessors. Tile fault trees differ only in the modeling of port
failures, as architecture 1 fails when 2 of the four ports
fail (hence tim 2/4 gate), while architecture 2 doesn't fail
until all four ports have failed (hence the AND gate). Fig-
ure 36 depicts a fault tree model for the processing nodes
of architecture 3.
8.4.2 Cold spares
Power consumption by a multiprocessor with spare nodes
can be reduced by having the spares be cold spares, un-
powered until they are needed to replace a failed active
processor. A cold spare processor cannot fail until it is
activated and brought online. In IIARP this type of con-
figuration is modeled using the Cold Spare gate, as de-
picted in figure 37 by a fault tree for architecture 2. The
cohl spare gate ensures that the spare processor does not
fail until one of the 4 active processors fail. The 2/5 gate
in parallel with the cold spare gate maintains the require-
ment. that 2 processor failures cause the node to fail. Such
a configuration not only reduces power consumption, but
also enhances the reliability of the processing node.
8.4.3 Warm spares
Instead of being unpowered, the spare may be partially
powered up. It may then fail before being activated but at
a lesser rate than the active processors. Such a processor
is called a warm spare and can be modeled in tlARP us-
ing the Sequence Enforcing gate as shown in figure 38 for
architecture 2. In this example two pseudo-components
(appearing as inputs to the OR gate whose output feeds
into the Functional Dependency gate) are used to rep-
resent the 4 active processors and spare before any pro-
cessor failures. Upon the first failure of a processor (ei-
ther active or spare), these two pseudo-components are
"turned off" as far as the fault tree is concerned by tile
Functional Dependency gate. The 4 remaining proces-
sors, now all active, are represented by the "4*processor"
basic event which appears as the rightmost input to tile
Sequence Dependency gate. This basic event had been
"turned off" prior to the first processor failure by the
Sequence Enforcing gate. After the first processor fail-
ure, tile leftmost input to the Sequence Enforcing gate is
turned on, which "turns on" the basic event that is its
rightmost input (i.e. the processors of this basic event
are now permitted to fail). Note that because this ba-
sic event is also an input to the top OR gate of tile fault
25
[ Node fails ]
Figure 37: Fault tree model of architecture 2 processing
node with cohi spares
tree, a subsequent failure of any of the 4 processors will
cause the node to fail, again niaintaining the requirement
that failure of 2 of the 5 processors cause node failure. Al-
though a spare does not fail while unpowered, upon power
up and activation there can be some probability that the
spare does not operate properly. Such a situation can be
modeled as a warm spare.
8.5 Results
Figure 39 compares tile 10 year unreliabilities of the pro-
cessing nodes of each of the three architectures assuming
all of them use hot spares. The unreliability of the ar-
chitecture 3 processing nodes is much lower that those
for architectures 1 and 2, reflecting that the reliability of
individual processors for architecture 3 is much greater
than that of the others and there are only 2 that can fail
instead of 5. As anticipated, the unreliability for archi-
tecture 2 nodes is slightly better than the unreliability for
architecture 1 nodes.
Figure 40 shows the 10 year unreliabilities for architec-
ture 2 processing nodes using hot, warm, and cold spares.
In general, the reliability increases from configuration to
configuration in that order. This is to be expected, since
the failure rate of the spare during its inactive period de-
creases in that order.
9 References
[1] S. J. Bavuso, Joanne Bechta Dugan, K. S. Trivedi,
E. M. Rothmann, and W. E. Smith. Analysis of typ-
Figure 38: Fault tree model of architecture 2 processing
node with warm spares
.25
.2
.15
.1
0.05
ff
I I I ! I i I I ! ,
1 2 3 4 5 7
Tnne6yeats't) 8 9 l0Mission
Figure 39: Comparison of node unreliabilities of all three
architectures using hot spares
.25 .........
.2 . llot@'--
.1
0.05
0
1 2 3 4 5 -._yea?rs_8 9 10Mmsion Time
Figure 40: Unreliability of architecture 2 processing nodes
with various types of spares
26
icalfault-tolerantarchitecturesu ing IIARP. [EEE
73"ansactions on Reliability, R-36(2):176-185, June
1987.
[2] Salvatore Bavuso and Sandra llowell. A graphical
language for reliability model generation. In Pro-
ceedings 86th Annual Reliability and Maintainability
Symposium, 1990.
[13]
[3] Salvatore J. Bavuso and Joanne Bechta Duganl [14]
Hitch Reliability/availability integrated workstation
tool. In Proceedings of the Reliability and Maintain-
ability Symposium, pages 491-500, January, 1992.
[4] Salvatore J. Bavuso and Anna Martensen. A fourth-
generation reliability predictor. In Proceedings of [15]
the Reliability and Maintainability Symposium, pages
11-16, January, i988.
[5]
[6]
[7]
[8]
[9]
[1o]
[11]
[12]
Stephen W. Behnen, William A. Whitehouse,
Richard J. Farrell, F: Mark Leahy, and LeRoy E.
Morn. Advanced system integration demonstrations
(ASID) system definition. Teclmical report, AF
Wright Aeronautical Laboratories, 1984.
M. A. Boyd, M. Veeraraghavan, Joanne Bechta
Dugan, and K. S. Trivedi. An approach to solv-
ing large reliability models. Ill A[AA/IEEE Digital
Avionics Systems Conference, San Jose, CA, Octo-
ber 1988.
Mark A. Boyd and Jesus Tuazon. Fault tree models
for fault tolerant hypercube multiprocessors. In Pro-
ceedings of the Reliability and Maintainability Sym-
posium, January 1991.
S. C. Chau and A. Liestman. Proposal for a
fault-tolerant binary hypercube architecture. In
Proc. IEEE Int. Syrup. on Fault-Tolerant Comput-
ing, FTCS-19, pages 323-330, June 1989.
[16]
[17]
[18]
[]9]
[20]
E. Chow, J. Peterson, and II. Madan. llyperswitch
network for the hypereube computer. In Digest
of the 13th Symposium on Computer Architecture,
pages 90-99, May 1988.
[21]
Joanne Bechta Dugan, Salvatore Bavuso, and Mark
Boyd. Dynamic fault tree models for fault tolerant
computer systems. IEEE Transactions on Reliabil-
ity, September 1992.
[22]
Joanne Bechta Dugan, Salvatore Bavuso, and Mark
Boyd. Fault trees and sequence dependencies. In
Proceedings of the Reliability and Maintainability
Symposium, pages 286-293, January, 1990.
Joanne Bechta Dugan, Salvatore J. Bavuso, and
Mark A. Boyd. Fault trees and markov models for
[23]
reliability analysis of fault tolerant systems. Reli-
ability Engineering and System Safety, 39:291-307,
1993.
Joanne Bechta Dugan and K. S. Trivedi. Cov-
erage modeling for dependability analysis of fault-
tolerant systems. IEEE Transactions on Computers,
38(6):775-787, 1989.
Joanne Bechta Dugan, K. S. Trivedi, Mark K.
Smotherman, and Robert M. Geist. The hybrid
automated reliability predictor. AIAA Journal
of Guidance, Control and Dynamics, 9(3):319-331,
May-June 1986.
Joanne Bechta Dugan, Malathi Veeraraghaven,
Mark Boyd, and Nitin Mittal. Bounded approxi-
mate reliability models for fault tolerant distributed
systems. In Proceedings 8th Symposium on Reliable
Distributed Systems, pages 137-147, 1989.
E. Feldman and P. S. Babcock. Reliabilityevaluation
of AIPS I/O networks. Technical Report AIPS-87-
15, C. S. Draper Laboratory, Inc., Cambridge, MA,
June 1987,
Richard E. llarper. Reliability analysis of parallel
processing systems. In Proceedings of the 8th Digital
Avionics Systems Conference, pages 213-219, 1988.
Richard g. llarper, Jaynarayan II. Lala, and John J.
Deyst. Fault tolerant parallel processor architecture
overview. In Proceedings of the 18th Symposium on
Fault Tolerant Computing, pages 252-257, 1988.
g. J. ilenley and tl. Kumamoto. Probabilistic Risk
Assessment. IEEE Press, 1982.
Ying-Wah Ng and Algirdas Avizienis. A model for
transient and permanent fault recovery in closed
fault-tolerant systems. In Proc. IEEE Int. Syrup. on
Fault-Tolerant Computing, FTCS-6, pages 182-187,
June 1976.
D. A. Rennels. On implementing fault-tolerance
in binary hypercubes. In Proc. IEEE Int. Syrup.
on Fault-Tolerant Computing, FTCS-16, pages 344-
349, July 1986.
D. P. Siewiorek and R. S. Swarz. The Theory and
Practice of Reliable System Design. Digital Press,
Bedford, MA, 1982.
K. S. Trivedi, Robert Grist, Mark Smotherman, and
Joanne Bechta Dugan. Hybrid modeling of fault-
tolerant systems. Computers and Electrical Engi-
neering, An International Journal, 11(2 & 3):87-108,
1985.
27
Form ApprovedREPORT DOCUMENTATION PAGE OMB No. 0704-0188
Publicrepoding burdenfor this collectionof informationIs estimated to average1 hourperresponse.Irmludingthe timelot reviewinginstructions,searching existingdatasources,
gatheringand malntalnlngthe dataneeded,andcompletingandreviewingthe collection of inlon'rk_ion.Sendcommentsregardingthis burdenestimateor anyolher aspectof this
conecticno! information,includings_ggast_onsfor reducingthisburden, toWashingtonHeadquartersServices,Directoratefor InformationOperationsand Reports,1215JeffersonDavis
Highway,Suite1204.Arlington,VA 22202-4302,andtothe OfficeofManagementandBudget,PaperworkReductionProject(0704-0188),Washington.[3(320503.
1. AGENCY USE ONLY (Leave blank) 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED
November 1993 Technical Memorandum
4. TITLE AND SUBTITLE
Tutorial: Advanced Fault Tree Applications Using HARP
s. AUTHOR(S)
Joanne Bechta Dugan
Salvatore J. Bavuso
Mark A. Boyd
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)
NASA Langley Research Center
Hampton, VA 23681-0001
9, SPONSORING / MONITORING AGENCY NAME(S) AND ADDRESS(ES)
National Aeronautics and Space Administration
Washington, DC 20546-0001
5. FUNDING NUMBERS
WU 505-66-21
8, PERFORMING ORGANIZATION
REPORT NUMBER
10. SPONSORING / MONITORING
AGENCY REPORT NUMBER
NASA TM-102747
11. SUPPLEMENTARY NOTES
Dugan: Duke University, Durham, NC (presently at University of Virginia, Charlottesville, VA); Bavuso: NASA Langley
Research Center, Hampton, VA; Boyd: Duke University, Durham, NC (presently at NASA Ames Research Center,
Moffett Field, CA)
12a. DISTRIBUTION / AVAILABILITY STATEMENT
Unclassified-Unlimited
Subject Category 61
12b. DISTRIBUTION CODE
13. ABSTRACT (Maximum 2OO words)
Reliability analysis of fault tolerant computer systems for critical applications is complicated by several factors. In this
tutorial, we discuss these modeling difficulties and describe and demonstrate dynamic fault tree modeling techniques for
handling them. Several advanced fault tolerant computer systems are described, and fault tree models for their analysis are
presented. HARP (the Hybrid Automated Reliability Predictor) is a software package developed at Duke University and
NASA Langley Research Center that is capable of solving the fault tree models presented into this tutorial.
!14. SUBJECT TERMS
Fault Tree, Reliability, Tutorial, Fault-Tolerance, Markov chains
17. SECU RrI'Y CLASSIF1CATION
OF REPORT
18. SECURITY CLASSIFICATION
OF THIS PAGE
;19. SECURITY CLASSIFICATION
OF ABSTRACT
Unclassified Unclassified Unclassified
NSN 7540-01-280-5500
lS. NUMBER OF PAGES
29
16. PRICE CODE
A03
20. UMITATION OF ABSTRACT
Unlimited
'ndard Form 298 (Rev. 2-89)
rbedby ANSiS=d.Z3_t8
298-102
