Automatic specification of reliability models for fault-tolerant computers by Liceaga, Carlos A. & Siewiorek, Daniel P.
i
Automatic Specification
of Rel abil Models for
_i ity
Fault-Tolerant Computers
Carlos A. Liceaga and
Daniel P. Siewiorek
(NASA-TP-3301) AUTOMATIC
SPECIFICATION OF RELIABILITY
FOR FAULT-TOLERANT COMPUTERS
(NASA) 70 p
MODELS
'!
i
i
|
i
I
N93-31869 i
!
Unclas :'I
I
H1/66 017_951
I
|
. __ r Ii I
https://ntrs.nasa.gov/search.jsp?R=19930022680 2020-03-17T05:19:28+00:00Z
£_. F -5 ....
ZZ--_ 7- Z_..
T . ------ _ -Z
L
_ ._ i_ T- ...........
NASA
Technical
Paper
3301
1993
National Aeronautics and
Space Administration
Office of Management
Scientific and Technical
Information Program
Automatic Specification
of Reliability Models for
Fault-Tolerant Computers
Carlos A. Liceaga
Langley Research Center
Hampton, Virginia
Daniel P. Siewiorek
Carnegie Mellon University
Pittsburgh, Pennsylvania
The use of trademarks or names of manufacturers in this
report is for accurate reporting and does not Constitute an
official endorsement, either expressed or implied, of such
products or manufacturers by the National Aeronautics and
Space Administration.
ri
Contents
Summary .................................. 1
1. Introduction ....................... .......... 1.
1.1. Background .............................. 2
1.2. Previous Work ............................. 7
1.3. Motivation ............................... 9
1.4. Organization .............................. 9
2. Graphical User Interface (GUI) Definition ................... 9
2.1. Graphs ............................... 11
2.1.1. Structure ............................. 11
2.1.1.1. External ........................... 12
2.1.1.2. Internal ........................... 12
2.1.2. Hierarchy ............................ 12
2.1.2.1. Physical ........................... 13
2.1.2.2. Logical ............................ 14
2.1.3. System Reconfiguration ...................... 14
2.1.4. Requirement ........................... 15
2.2. Parameters .............................. 15
2.2.1. Active Component ........................ 17
2.2.2. Spare Component ......................... 17
2.2.3. Component Repair ........................ 19
2.2.4. Subsystem Recovery ........................ 19
2.2.5. System Reconfiguration ...................... 19
2.2.6. Model Generation ......................... 22
2.2.7. Model Evaluation ......................... 22
2.3. Summary and Recommendations .................... 22
3. Automated Reliability Modeling (ARM) Implementation ........... 23
3.1. Graphical User Interface ........................ 23
3.1.1. Graphical Editing Windows .................... 23
3.1.2. Parameter Specification and Action Selection Windows ......... 25
3.2. Reading and Processing the System Description ............. 25
3.2.1. Detection of Symmetry in the External Structure Graph ........ 26
3.2.2. Determining the Subsystem Hierarchies ............... 28
3.3. Specifying the System Reliability Model ................. 28
3.3.1. Constants ............................ 29
3.3.2. State Space Variables and the Start State .............. 29
3.3.3. Functions and Final State Conditions ................ ' 30
3.3.4. Failure Transitions ........................ 31
3.3.4.1. Logical subsystem component failures .............. 31
3.3.4.2. Spare failures ......................... 32
3.3.4.3. Dependents in logical subsystems ................ 32
3.3.5. Potentially General Transitions ................... 32
3.3.5.1. Spare recoveries ........................ 33
3.3.5.2. Component repairs ........................ 33
3.3.5.3. Logical subsystem recoveries .................. 33
3.3.5.4. Reconfigurations that retire a subsystem ............. 34
3.3.5.5. Reconfigurations that degrade a subsystem ............ 34
e,J
I11
[-
!
|
i
3.4. Advanced Features Not Yet Implemented ................. 35
3.4.1. Reinitializing Reconfigurations ................... 35
3.4.2. Mission Phase Change Reconfigurations ............... 35
4. Application Examples and Results ..................... 35
4.1. Comparison With Example Multiprocessor Results ............ 35
4.2. Application to Systems Described in the Literature ............ 36
4.2.1. Software Implemented Fault-Tolerance (SIFT) Computer ........ 36
4.2.2. Comparison With Self-Generated Results ............... 42
4.2.2.1. Tandem computer ....................... 42
4.2.2.2. Stratus computer ....................... 42
5. Analysis ................................. 50
5.1. Summary of Assumptions ....................... 50
5.2. Utility ................................ 51
5.2.1. Adding System Characteristics ................... 51
5.2.2. Performing Design Tradeoffs .................... 51
5.3. Performance ............................. 53
5.4. Validation .............................. 53
5.5. Lessons Learned ........................... 55
6. Conclusions ............................... 55
6.1. Summary of Work and Contributions .................. 55
6.2. Future Work ............................. 55
Appendix--ARM Program Algorithms ..................... 57
A1. Symmetry Detection ......................... 57
A2. Determining the Subsystem Hierarchies ................. 58
A3. Specifying Potentially General Transitions ................ 59
References ................................. 60
iv
Tables
Table 1.1.
Table 2.1.
Table 2.2.
Table 3.1.
Table 3.2.
Table 3.3.
Table 4.1.
Table 4.2.
Table 4.3.
Table 4.4.
Table 4.5.
Table 4.6.
Table 4.7.
Table 4.8.
Table 4.9.
Table 4.10.
Table 4.11.
Table 4.12.
Table 4.13.
Table 4.14.
Table 4.15.
Table 4.16.
Summary of Previous Work ..................... 9
Categories of Subsystems and Components .............. 10
Sources of Major GUI Input Categories ............... 10
File Name Suffixes for Each Class of Graph .............. 25
File Name Suffixes for Each Class of Parameters ........... 25
State Space Variable Functions ................... 30
All Permanent Failure Rates of the Multiprocessor ........... 36
Some Spare Component Parameters of the Multiprocessor ....... 36
Reliability Model Statistics for the Multiprocessor ........... 36
Probability of Failure Results for the Multiprocessor .......... 36
Reliability Model Statistics for a SIFT Computer ........... 42
Probability of Failure Results for a SIFT Computer .......... 42
Some Active Component Parameters of a Tandem Computer ...... 46
Some Component Repair Parameters of a Tandem Computer ...... 46
Some System Reconfiguration Parameters of a Tandem Computer .... 46
Reliability Model Statistics for a Tandem Computer ......... 46
Probability of Failure Results for a Tandem Computer ........ 46
Some Active Component Parameters of a Stratus Computer ...... 49
Some Component Repair Parameters of a Stratus Computer ...... 49
Some System Reconfiguration Parameters of a Stratus Computer .... 49
Reliability Model Statistics for a Stratus Computer .......... 49
Probability of Failure Results for a Stratus Computer ......... 49
Table 5.1. Effect of Simple Changes in the System Description
on a Manual ASSIST File ......................... 52
Table 5.2. Probability of Failure Results for Some Variations of a Stratus
Computer ................................ 52
Table 5.3. Probability of Failure Results for Some Variations of a SIFT
Computer .............................. 53
Table 5.4. ARM Model Specification Performance ................ 53
Table 5.5. Effect of Coverage on the Probability of Failure of a SIFT Computer . 54
Table 5.6. Effect of the Failure Rate on the Probability of Failure of a SIFT
Computer ................................ 54
Table 5.7. Effect of the Reconfiguration Rate on the Probability of Failure
of a SIFT Computer ............................ 54
!!
Z
i
|
Figures
Figure 1.1.
Figure 1.2.
Figure t.3.
Figure 1.4.
Figure 1.5.
Figure 1.6.
Figure 1.7.
Figure 2.1.
Figure 2.2.
Figure 2.3.
Figure 2.4.
Figure 2.5.
Figure 2.6.
Figure 2.7.
Figure 2.8.
Figure 2.9.
Figure 2.10.
Figure 2.11.
Figure 2.12.
Figure 2.13.
Figure 2.14.
Figure 2.15.
Figure 2.16.
Figure 2.17.
Figure 2.18.
Figure 2.19.
Figure
Statically redundant processor triad .................. 3
Dynamically redundant active processor with m spares ......... 3
Hybrid redundant triad of active processors with m spares ........ 3
Adaptive voting with n processors .................. 3
Adaptive hybrid n-tuple of active processors with m spares ....... 4
Reliability graph (state transition diagram) of a triad .......... 6
Hierarchy of Markov models ..................... 6
System description hierarchy .................... 10
Main window .......................... 11
Graphs menu .......................... 11
Parameters menu ......................... 11
Model menu ........................... 11
External structure of a multiprocessor ................ 12
Internal structure of each of the six memory components ....... 13
Physical hierarchy of the multiproeessor ............... 13
Physical hierarchy
Logical hierarchy
Logical hierarchy
Reinitialization of the multiprocessor ................ 15
Degradation of the multiprocessor ................. 15
Requirements of the multiprocessor ................ 16
Requirement of the printed circuit board subsystem type ....... 16
Requirements of the triad subsystem class ............. 16
Active component parameters with example values .......... 18
Spare component parameters with example values .......... 18
Component repair parameters with example values ......... 18
of the printed circuit board subsystem type ..... 13
of the multiprocessor ............... 14
of the triad (T) subsystem class .......... 14
2.20. Subsystem recovery parameters with example values ......... 20
Figure 2.21. Partial Markov model of a processor dual with m cold spares and
repair .................................. 20
Figure 2.22. System reconfiguration parameters with example values ....... 20
Model generation parameters with default values .......... 21
Model evaluation parameters with default values .......... 21
Graphical editor window ..................... 24
Tools window ...... . ................... 24
Drawing tools window ....................... 24
External structure with symmetry ................. 27
Figure 2.23.
Figure 2.24.
Figure 3.1.
Figure 3.2.
Figure 3.3.
Figure 3.4.
vi
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
3.5.Equivalenceclassgraph ...................... 28
4_1.Externalstructureof aSIFT computer ............... 37
4.2.Logicalhierarchyof aSIFT computer................ 37
4.3.Logicalhierarchyof thesextuplesubsystemclass ........... 37
4.4.Logicalhierarchyof thequintuple(QT)subsystemclass........ 38
4.5. Logicalhierarchyof thequad(Q) subsystemclass .......... 38
4.6. Logicalhierarchyof thenonreconfigurabledual (ND)subsystemclass 38
4.7.Degradationfromaprocessorsextuple(PST)to a quintuple(PQT) . 39
4.8.Degradationfroma processorquintupleto a quad(PQ) ........ 39
4.9. Degradationfroma processorquadto a triad (PT) .......... 39
Figure4.10.Degradationfroma processort iad to anonreconfigurabledual(PND) ..................................
Figure4.11.
Figure4.12.
Figure4.13.
Figure4.14.
Figure4.15.
Figure 4.16.
Figure 4.17.
Figure 4.18.
Figure 4.19.
Figure 4.20.
Figure 4.21.
Figure 4.22.
Figure 4.23.
Figure 4.24.
Figure 4.25.
Figure 4.26.
Figure 4.27.
Figure 4.28.
Figure 4.29.
Figure 4.30.
Figure 4.31.
Figure 4.32.
Figure 4.33.
Figure 4.34.
Figure 4.35.
Figure 4.36.
Degradation from
Degradation from
Requirements of
Requirements of
Reqmrements of
Reqmrements of
Reqmrements of the
Requirements of the
Requirements of a SIFT computer .................
Requirements of the sextuple subsystem class ............
Requirements of the quintuple subsystem class ...........
Requirements of the quad subsystem class .............
Requirements of the nonreconfigurable dual subsystem class ......
External structure of a Tandem computer .............
Logical hierarchy of a Tandem computer ..............
Logical hierarchy of the dual subsystem class ............
Logical hierarchy of the simplex subsystem class ...........
Degradation from a processor dual (PD) to a simplex (PS) ......
Degradation from a disk controller dual (KD) to a simplex (KS) ....
a disk drive dual (DD) to a simplex (DS) ......
a bus dual (BD) to a simplex (BS) ........
a Tandem computer ................
the PPL performance level .............
the KPL performance level .............
the DPL performance level .............
BPL performance level .............
dual subsystem class ..............
Reqmrements of the simplex subsystem class ............
External structure of a Stratus computer ..............
Physical hierarchy of a Stratus computer ..............
Physical hierarchy of the MR subsystem type ............
Logical hierarchy of a Stratus computer ..............
Requirements of a Stratus computer ................
Requirements of the MR subsystem type ..............
vii
39
40
40
40
41
41
41
41
43
43
43
43
43
43
44
44
44
45
45
45
45
47
47
47
48
48
48
: IL
i |
! !
Summary
The calculation of reliability measures using
Markov models is required for life-critical processor-
memory-switch (PMS) structures that have standby
redundancy or that are subject to transient or inter-
mittent faults or repair. The task of specifying these
models is tedious and prone to human error because
of the large number of states and transitions required
in any reasonable system. Therefore, model specifi-
cation is a major analysis bottleneck, and model ver-
ification is a major validation problem. The general
unfamiliarity of computer architects with Markov
modeling techniques further increases the necessity
of automating the model specification.
Automation requires a general system description
language (SDL) that can accommodate new fault-
tolerant techniques and system designsl: For practi-
cality, this SDL should also provide a high level of
abstraction and be easy to learn and use.
This paper presents the first attempt to define and
implement an SDL with those characteristics. The
problems involved in the automatic specification of
Markov reliability models for arbitrary interconnec-
tion structures at the PMS level are identified and
analyzed. Solutions to these problems are generated
and implemented.
A program named ARM (Automated Reliability
Modeling) has been constructed as a research vehicle.
The ARM program uses a graphical user interface
(GUI) as its SDL. This GUI is based on a hierarchy
of windows. Some windows have graphical editing ca-
pabilities for specifying the system's communication
structure, hierarchy, reconfiguration capabilities, and
requirements. Other windows have text fields, pull-
down menus, and buttons for specifying parameters
and selecting actions.
The ARM program outputs a Markov reliability
model specification formulated for direct use by pro-
grams that generate and evaluate the model. The
advantages of such an approach include utility to a
larger class of users, who are not necessarily expert
in reliability analysis, and lower probability of human
error in the calculation.
1. Introduction
Computer systems are growing in complexity
and sophistication as multiprocessors and distributed
computers are coming into widespread use to achieve
higher performance and reliability. This growth,
which is being assisted by the availability of succes-
sively more complex building blocks, has increased
the importance of fault tolerance and system relia-
bility as design parameters. Thus, the calculation
of system reliability measures has become one of the
system design tasks. Several efforts have been re-
ported in the literature and are in progress to make
computing system reliability measures easier and
more efficient by providing designers with reliability
evaluation tools.
The analysis and evaluation of system reliability
for complex computer systems is very tedious and
prone to error even for experienced reliability ana-
lysts. The model of a system with n components can
have up to 2n states if it only has permanent faults
and they are not removed. Therefore, the model of a
system with just 10 components can have more than
1000 states.
With the exception of the ADVISER (Advanced
Interactive Symbolic Evaluator of Reliability) and
the RMG (Reliability Model Generator) programs,
discussed in subsection 1.2, existing software tools
usually assume an understanding of the failure modes
and therefore are more in the nature of computa-
tional aids once the preliminary system decomposi-
tion and analysis have been manually achieved. Al-
though ADVISER does not make this assumption,
it uses combinatorial techniques, and it is therefore
limited in the complexity of systems and fault types
which it can analyze. The RMG program lacks a
high-level system description language (SDL) that is
easy to learn and use.
More advanced techniques are required to analyze
computer architectures that use standby redun-
dancy, can be repaired, or are susceptible to tran-
sient or intermittent faults. One possibility is the
Markov model, which is discussed in subsection 1.1.
The advantages offered by Markov models are that
they are widely used among reliability analysts and
that several programs, which are discussed in sub-
section 1.2, have been developed to solve them. How-
ever, Markov models cannot be used to analyze
nonexponentially distributed concurrent events. For
example, a fault that arrives while the system is re-
configuring itself around a previous fault would be
represented by a transition to a state in which two
faults are present. This new state would not take
into account the time that the system has already
spent reconfiguring from the first fault.
Another analysis possibility is the extended sto-
chastic Petri net (ESPN) described in Dugan et al.
(1984). The advantages offered by the ESPN are that
it can analyze concurrent events and can model sys-
tems at a lower level of detail than Markov models.
The ESPN "tokens" can be simultaneously enabled
to moveconcurrentlyat independenttransitiontimes.
The low-levelmodelingcapabilityis dueto mecha-
nisms,suchasqueuesandcounters,whichcansimu-
late thealgorithmof theprocessbeingmodeled.To
analyticallyornumericallysolveanESPN,it mustbe
convertedto a Markovmodel.However,if tokensare
movingconcurrentlyat independenttransitiontimes
that arenot exponentiallydistributed,the process
becomesnon-Markovian(i.e., the transitionproba-
bilitiesdependonpaststates).Thissituationmakes
theconversionimpossible.In general,anESPNmust
besolvedby simulation.
Simulationscanincludeany levelof detail,and
they are thus flexible; however,for straightfor-
wardMonteCarlosimulations,manyrepetitionsare
neededto ensureaccuracy.For example,in life-
criticalapplicationsthat requireaprobabilityoffail-
ure of 10-9 with a relativeerrorof no morethan
10percentwithinaconfidenceintervalof 95percent,
approximately3.8× 1011simulationrepetitionsare
necessary(Liceaga1992).In general,theseapplica-
tionsrequireaMarkovmodelbecauseit canbesolved
analyticallyor numerically.
Thispaperdefinesa general,high-levelSDLthat
is easyto learn and use, identifiesand analyzes
theproblemsinvolvedin theautomaticspecification
of Markovreliabilitymodelsfor arbitrary intercon-
nectionstructuresat the processor-memory-switch
(PMS) level,1 and generatesand implementsolu,
tionsto theseproblems.Theresultsof thisresearch
havebeen implementedand experimentallyvali-
datedin theARM(AutomatedReliabilityModeling)
program.
The ARM programusesa graphicaluserinter-
face(GUI) as its SDL. This GUI is basedon a
hierarchyofwindowsimplementedin theC program-
ming language using the Transportable Application
Environment Plus (TAE Plus) user interface devel-
opment tool for building X window-based applica-
tions (Szczur 1990). Some windows have graphical
editing capabilities for specifying the system's com-
munication structure, hierarchy, reconfiguration ca-
pabilities, and requirements. These window's have
been implemented using the schematic drawing edi-
tor Schem (Vlissides 1990). Other windows have text
fields, pull-down menus, and buttons for specifying
parameters and selecting actions.
The ARM software outputs a Markov reliabil-
ity model specification formulated for direct Use by
programs that generate themodei. The advantages
1 Components are not limited to being a processor, memory, or
switch.
of such an approach are utility to a larger class of
users, who are not necessarily expert in reliability
analysis, and lower probability of human error in the
calculation.
A brief background on reliability calculation at
the PMS level using Markov models is presented in
subsection 1.1. Previous work in the specification,
generation, and evaluation of reliability models is
surveyed in subsection 1.2. The goals for ARM are
stated and compared with those of previous efforts
in subsection 1.3. The organization of this paper is
presented in subsection 1.4.
1.1. Background
Present-day computer systems and the process
of designing and analyzing them can be viewed at
various levels of detail. Four levels, which are defined
in work by Siewiorek et al. (1982), range from the
circuit level, through the logic and programming
levels, to the PMS level. The PMS-level view of
digital systems is one in which the primitives include
processors, memories, switches, and transducers, as
opposed to the logic level in which the primitives
include gates, registers, and multiplexers.
Hardware components are susceptible to hard
and soft faults as discussed by Siewiorek and Swarz
(1992). A fault is an incorrect state of hardware or
software resulting from a physical change in the hard-
ware, interference from the environment, or design
mistakes (Laprie 1985). Hard or permanent faults are
continuous and stable, and they result from an irre-
versible physical change. Soft faults can be transient
or intermittent. Transient faults result from tempo-
rary environmental conditions. Intermittent faults
are occasionally active because of unstable hardware
or varying hardware or software states (e.g., as a
function of load or activity). Depending on whether
the intermittent fault is benign or active, the output
of the component will be correct or not, respectively.
Fault-tolerant computer systems can be affected
by a limited set of faults without interruptions in
their operation. Some computer systems achieve
fault tolerance by using redundant groups of com-
ponents to perform the same operations. The sys-
tem must determine which is the correct output
using diagnostics or majority voting. The various re-
dundancy techniques are discussed in Siewiorek and
Swarz (1992), and the more relevant ones are defined
below:
Static redundancy: Faults are masked through
a majority vote involving a fixed group of re-
dundant components. Thus, when the number
of faulty componentsreachesthe maximum
that canbe tolerated,anyfurther faultswill
causeerrorsat the output. Figure 1.1 il-
lustratesa staticallyredundantprocessor(P)
triad (a groupof threecomponents)and its
voter(V).
Figure1.1.Staticallyredundantprocessortriad.
Dynamic redundancy: Faults are not masked
from causing errors at the output, but the
faulty components are detected, isolated, and
reconfigured out of the system. The faulty
components are replaced when spares are
available. Figure 1.2 illustrates a dynami-
cally redundant active processor (AP) with m
spares (SP).
components are reconfigured out of the system
by excluding them from the voting process.
Thus, when the number of faulty components
reaches the maximum that can be tolerated,
any further faults tha_ occur before a faulty
component is reconfigured out of the vot-
ing process will cause errors at the output.
Figure 1.4 illustrates adaptive voting with
n processors and their voter (AV).
Figure 1.2. Dynamically redundant active processor with rn
spares.
Hybrid redundancy: Faults are masked
through a majority vote involving a group of
redundant components which are reconfigured
when spares are available. Thus, when the
number of faulty components reaches the max-
imum that can be tolerated, any further faults
that occur before a faulty component is re-
placed by a spare will cause errors at the out-
put. Figure 1.3 illustrates a hybrid-redundant
triad of active processors with m spares.
Adaptive voting: Faults are masked through
a majority vote involving a variable group of
redundant components without spares. Faulty
5-a
m
Figure 1.3. Hybrid-redundant triad of active processors with
m spares.
51
Figure 1.4. Adaptive voting with n processors.
Adaptive hybrid: Faults are masked through
a majority vote involving a variable group
of redundant components which are replaced
when spares are available. If spares are not
available, faulty components are reconfigured
out of the system by excluding them from the
voting process. Thus, when the number of
faulty components reaches the maximum that
canbetolerated,anyfurtherfaultsthat occur
beforea faulty componentis replacedby a
spareorreconfiguredoutof thevotingprocess
will causeerrorsat the output. Figure1.5
illustratesanadaptivehybridn-tuple of active
processors with m spares.
m
Figure 1.5. Adaptive hybrid n-tuple of active processors with
m spares.
For example, if a triad that uses hybrid redun-
dancy "recovers" from a fault by replacing the faulty
component with a spare, it can then tolerate a sec-
ond fault. The following two definitions are those
that will be used in this paper, but neither term has
a universally accepted definition:
Recovery: The process of detecting, isolating,
and reconfiguring a faulty component out of
the system.
Coverage: The probability that the system can
survive a fault in a component and successfully
recover. (If the system can always recover, it
has a "perfect" coverage of 1.)
Spares are sometimes left unpowered until they
become part of the active configuration to reduce
their failure rates (Avi_ienis et al. 1971). They are
sometimes said to bc cold if their failure rates are
assumed to be 0, warm if their failure rates are
reduced but not 0, or hot if their failure rates arc
not reduced (Butler ami-Johnson 1990).
Reliability measures are defined in terms of prob-
abilities because the failure processes in hardware
4
components are nondeterministic. These various
measures are discussed in Siewiorek and Swarz
(1992). The more relevant ones are defined below:
Reliability: The conditional probability R(t)
that the system is operational throughout the
interval [0, t] given that it was operational
at time 0. (This measure is a nonincreasing
function whose initial value is 1.)
Availability: The probability A(t) that the
system is operational at time t.
Mean time to failure (MTTF): The expected
time of the first system failure assuming a new
(perfect) system at time 0.
Mean time to repair (MTTR): The expected
time for repair of a failed system.
Mean time between failures (MTBF): The ex-
pected time between failures in systems with
repair.
Availability is typically used as a figure of merit in
systems in which service can be delayed or denied for
short periods to perform preventive maintenance or
repair without serious consequences. The availability
is important in the calculation of system life-cycle
costs. If the limit of A(t) exists as t goes to infinity,
it expresses the expected fraction of time that the
system is available to perform useful computations
and has the following form:
MTTF
lira A (t) -
t-_ MTBF
The MTBF is given by:
MTBF = MTTF + MTTR
Reliability is used to describe systems in which re-
pair is typically infeasible, such as space applications.
The MTTF can be derived from R(t) as follows:
MTTF = R(t) dt
The most commonly used reliability function . for
a single component, which is based on a Poissonpro-
cess with an exponential distribution, is called the
exponential reliability function, and it has the form
R (t) = e -At
where ,_ is the hazard or failure rate. The failure rate
is a constant that reflects the quality of the compo-
nent and is usually expressed in failures per million
hoursfor high-qualitycomponents.The exponen-
tial reliability functionis usedwhenthe failurerate
is time independent,suchaswhencomponentsdo
notage.Afteraburn-inperiod,permanentfaultsin
electroniccomponentsoftenfollowa relativelycon-
stant failurerate. The MTTF for the exponential
reliabilityfunctionhasthefollowingform:
1MTTF = -A
Manyother reliability functions have been formu-
lated. The second most common reliability function,
which is based on the Weibull distribution, is called
the Weibull reliability function, and it has the form
R(t) =e -(xt)_
where A in this case is the scale parameter and
a is the shape parameter. (Other reparameterized
forms are also common.) It is equivalent to the
exponential function when a is 1. The Weibull
reliability function is used when the failure rate is
time dependent. Permanent faults for components
that age can be described using an increasing failure
rate (a > 1). In that case, the system is not as
good as new when repair takes place. Data presented
in McConnel (1981) and Castillo, McConnel, and
Siewiorek (1982) indicate that transient faults follow
a decreasing failure rate (a < 1).
The failure processes of different components are
assumed to be independent of one another. This as-
sumption is not strictly true, such as when electrical,
mechanical, or thermal conditions in one component
affect other components in its proximity. However,
the assumption is close enough in practice to be used
to simplify the analysis.
The state of a system represents all that must be
known to describe the system at any instant. As the
system changes, such as when components fail or are
repaired, so does its state. These changes of state
are called state transitions. If all possible states are
assumed to be known, a discrete-state system model
is used; if this assumption is not made, a continuous-
state system model is used. If the state transition
times are assumed to be restricted to some multiple
of a given time interval, a discrete-time system model
is used. If it is assumed that state transitions can
occur at any time, a continuous-time system model
is used. Systems can be classified according to their
state space and time parameter as the following:
discrete state and discrete time
discrete state and continuous time
continuous state and discrete time
continuous state and continuous time
For a discrete-state system, a state transition dia-
gram (STD) may be drawn. This transition diagram
is a directed graph. The nodes correspond to sys-
tem states, and the directed arcs indicate allowable
state transitions. Each arc has a label that iden-
tifies the distribution of the conditional probability
that the system will go from the originating node to
the destination node of that directed arc given the
previous history of the system and that the system
was initially at the originating node. The label used
depends on the distribution. For example, the label
could be the hazard rate for the exponential distribu-
tion, the scale and shape parameters for the Weibull
distribution, or the mean and standard deviation for
a general distribution.
If transitions are allowed from failed states to
operational states, then the STD is an availability
graph and A(t) may be obtained from it. The
term R(t) may be obtained by specifically disallowing
failed to working state transitions from the STD, thus
making it a reliability graph.
A reliability graph of a triad is given in figure 1.6.
In this model, it is assumed that the components have
a perfect coverage of 1. The horizontal transitions
represent fault arrivals. These transitions follow an
exponential distribution. Consequently, A represents
the constant hazard rate. The coefficients of A repre-
sent the number of working processors being actively
used in the configuration. The vertical transition rep-
resents recovery from a fault. This recovery follows
a general distribution. Consequently, tt and a repre-
sent its mean and standard deviation. A competition
exists between the two transitions that are leaving
state 2. If the second fault wins the competition,
then system failure occurs; however, if the removal
of the first fault wins the competition, then the sys-
tem reconfigures into a simplex (i.e., it only uses one
of the two working components). Unless otherwise
noted in the state descriptions, all working proces-
sors are being actively used in the configuration.
The information conveyed by the STD is often
summarized in a square matrix called the state tran-
sition matrix (STM). The STM element in row i and
column j is the label on the arc from state i to state j.
The terminology used in this paper to denote the
various types of Markov models and the assumptions
they are based on are defined below. The hierarchy
of Markov models is illustrated in figure 1.7.
5
State
3A :_
Description
3 working
2 working
System failed
2 working; uses 1
System failed (
< p,o" >
Figure 1.6. Reliability graph (state transition diagram) of a triad.
Markov model ]
6
tIomogeneous ](time independent) I Semi- ](local time dependent)
Figure 1.7. Hierarchy of Markov models.
Nonhomogeneous
(global time dependent)
Markov model: A stochastic process model
whose future state depends only upon the
present state and not upon the state history
that led to its present state.
Homogeneous Markov model: A Markov
model whose state transition probabilities are
time independent. For the continuous-time
homogeneous Markov modeI I this implies that
She state transition times follow an expo-
nential distribution. This type of model is
discussed in Chung (1967) and Romanovsky
(1970) and applied to computer systems in
Makam and Avi_ienis (1982).
Semi-Markov model: A Markov model whose
state transition probabilities depend upon the
time spent in the present state, which is called
the local time. For the continuous-time semi-
Markov model, this implies that the state
transition times do not follow an exponen-
tial distribution; they might follow a Weibull
distribution or any other distribution. This
type of model is discussed and applied to
computer systems in White (1986).
Nonhomogeneous Markov model: A Markov
model whose state transition probabilities
depend upon the time since the system was
first put into operation , which is called the
global time. For the continuous-time non-
homogeneous Markov model, this implies that
the state transition times do not follow an
exponential distribution. Often these times
are assumed to follow a Weibull distribution,
but they can follow any other distribution.
This type of model is discussed and applied to
computer systems in Trivedi and Geist (1981).
The probability of being in a particular state
for a discrete-state, continuous-time Markov model
can be expressed with a differential equation. The
set of simultaneous differential equations which de-
scribe these models are called the continuous-time
Chapman-Kolmogorov equations. For homogeneous
Markovmodels,theseequationscanbesolvedusing
matrixor Laplacetransformations.
If the state transition probabilitiesare time
dependent,it may be quite difficult to obtain
explicitsolutionsto the continuous-timeChapman-
Kolmogorovequations.Obtainingthe exactproba-
bility of reachinga statethrougha particularpath
of transitionsrequiresthe solutionof a multiplein-
tegral,in whicheachintegralrepresentstheproba-
bility of makingoneof the transitionsin the path.
Oftenthe integralsareapproximatedusingnumer-
ical integrationtechniques(Stifler, Bryant, and
Guccione1979).An alternativemethodis to approx-
imatethecontinuous-timemodelwith discrete-time
equivalents(SiewiorekandSwarz1992).Themajor
difficultywith thesecondmethodis that manytran-
sitionratesthat areeffectivelyzerointhecontinuous-
timemodelassumesmall,but nonzero,probabilities
in a discrete-timemodel.
1.2. Previous Work
Several programs exist, such as ARIES, SURF,
CARE III, HARP, SURE, PAWS, STEM, and
ASSURE, which use Markov models to evaluate the
reliability and/or availability of systems that use
standby redundancy or can be repaired and that are
susceptible to hard, transient, or intermittent faults.
All these programs can evaluate reliability. The
ARIES, SURF, and HARP programs can also evalu-
ate availability. Except for CARE III and ASSURE,
they all have the state transition matrix as one of the
system specification methods.
The ADVISER (Advanced Interactive Symbolic
Evaluator of Reliability) program, described in Kini
and Siewiorek (1982), automatically generates sym-
bolic reliability functions for PMS structures. The
program assumptions are that all the faults are per-
manent and stochastically independent, the PMS
system has a perfect coverage, and the failed com-
ponents are not repaired and returned to a nonfaulty
state. The program's primary input is the intercon-
nection graph of the PMS structure. Other program
inputs describe the components of the PMS struc-
ture by their types, reliability functions, internal port
connections, and ability to communicate with com-
ponents of the same type. The program also takes
as input the requirements for the system and its sub-
systems or clusters in the form of modified Boolean
expressions.
The ARIES (Automated Reliability Interactive
Estimation System) program, described in Makam,
Avi_ienis, and Grusas (1982), is restricted to homo-
geneous Markov models. The system can be specified
using a state transition matrix or as a series of inde-
pendent subsystems each containing identical mod-
ules that either are active or serve as spares. The
program uses a matrix transformation solution tech-
nique that assumes distinct eigenvalues for the state
transition matrix.
Described in Landrault and Laprie (1978), the
SURF program can solve semi-Markov models that
use exponential distributions or nonexponential dis-
tributions that are related to the exponential (e.g.,
Gamma, Erlang, and others). The method of stages
(Cox and Miller 1965) is used to produce a ho-
mogeneous Markov model. Matrix transformations
are used to obtain time-independent values, such as
MTTF and the limiting availability. The Laplace
transform is used to obtain time-dependent values,
such as availability and reliability.
The CARE III (Computer-Aided Reliability Es-
timation) program, described in Bavuso, Petersen,
and Rose (1984), can evaluate the reliability of sys-
tems that use reconfiguration to tolerate component
faults but that do not repair the faulty compo-
nents. The program uses a behavioral decompo-
sition/aggregation solution technique described in
Trivedi and Geist (1981). This technique assumes the
fault-occurrence behavior is composed of relatively
infrequent (slow) events, and the fault-handling
behavior is composed of relatively frequent (fast)
events. The fault-handling behavior is separately an-
alyzed using a fixed semi-Markov model that can
use exponential and uniform distributions. The
fault-occurrence behavior is analyzed using an ag-
gregate nonhomogeneous Markov model that can use
exponential and Weibull distributions. The fault-
handling behavior is reflected by parameters in the
aggregate nonhomogeneous Markov model. Numer-
ical integration techniques are used to solve these
Markov models. The fault-occurrence behavior is
specified using extended fault trees, which are auto-
matically converted to the nonhomogeneous Markov
model. The fault-handling behavior is specified
by providing the transition parameters of the fixed
semi-Markov model.
For HARP (Hybrid Automated Reliability Pre-
dictor), described in Dugan et al. (1986) and Howell
et al. (1990), the state transition probabilities can
have exponential, uniform, Weibull, or general dis-
tributions. (A histogram must be provided for gen-
eral distributions.) If the state transition matrix is
given by the user, HARP can only evaluate the avail-
ability of systems with constant repair rates.. The
HARP program has several additional methods of
specifying the fault-occurrence behavior (e.g., fault
trees), all of which are automatically converted to a
7
nonhomogeneousMarkovmodel.Thefault-handling
behaviorcanalsobespecifiedbyprovidingthetran-
sition parametersof one of severalmodels. The
programusesthe samebehavioraldecomposition/
aggregationsolution techniqueas CARE III, but
the variousmodelsaresolvedin a hybrid fashion.
Markovmodelsaresolvedusingnumericalintegra-
tion techniques,andextendedstochasticPetrinets
aresolvedby simulation.
The SURE (Semi-MarkovUnreliability Range
Evaluator)program,describedinButler(1992),eval-
uatesthe unreliabilityupperand lowerboundsof
semi-Markovmodels.It usesnewmathematicalthe-
oremsprovedinWhite (1986)andLee(1985).These
theoremsprovidea techniqueforboundingtheprob-
ability of traversinga specificpath in the model
within a specifiedtime. By applyingthe theorems
to everypathof themodel,theprobabilitythat the
systemreachesanydeathstatecanbe determined
within usuallyveryclosebounds.Thesetheorems
assumethat slow(with respecto themissiontime)
exponentialtransitionsdescribethe occurrenceof
faults,andfasttransitionsthat followageneraldis-
tribution specifiedby its meanand standarddevi-
ation describethe recoveryprocess.Theprogram
providestheoptionof pruningthemodelduringits
evaluationby conservativelyassumingsystemfail-
ureoncethe probabilityof reachinga statefallsbe-
lowaspecifiedor automaticallyselectedprunelevel.
Faultscanbemodeledaspermanent,transient,or
intermittentas long as thereare no loopsin the
modelwhichonly havefast transitions. The only
input methodof theprogramis the statetransition
matrix.
Describedin Butler and Stevenson(1988),the
PAWS (Pad6ApproximationWith Scaling)and
STEM(ScaledTaylorExponentialMatrix)programs
evaluatethe unreliabilityof homogeneousMarkov
models.Theinput languagefor thesetwoprograms
isessentiallythesameasfortheSUREprogram.Al-
thoughthe numericaltechniquesusedin thesepro-
gramsarenot asfastasthe SUREtechnique,they
aresuitablefor loopswith onlyfasttransitions.
The ASSIST(AbstractSemi-MarkovSpecifica-
tion Interfaceto the SURETool) program,which
usesanabstractlanguagefor specifyingMark_vre-
liability models,is describedin Butler (1986).The
languagehasstatementsto specifythe statespace,
by definingthestatevariablesandtheir range;the
startstate,bytheinitial valuesofthestatevariables;
thedeathstates,byaBooleanexpressionofthestate
variables;andthestatetransitions,byasetofif-then
rulesthat define,in termsof thestatevariables,the
possibletransitions,their rates,and their destina-
tion states. This languagehasbeenimplemented
in the ASSISTprogramto generateMarkovrelia-
bility modelsin theSUREinput language(Johnson
1986).Theimplementationprovidesthreeoptional
statespacereductiontechniques.Thefirst technique
ispruningthemodelduringits generationbyconser-
vativelyassumingsystemfailureonceastatesatisfies
apruneconditionspecifiedasaBooleanexpressionof
thestatevariables(Johnson1988).Thesecondtech-
niqueis trimmingthemodelby conservativelyalter-
ingstateswithoutgoingrecoverytransitions(White
andPalumbo1990).Theoutgoingfailuretransitions
ofthealteredstatesthat donotgoto deathstatesare
changedto goto asingletrim state.Thethird tech-
niquecombinespruningand trimmingby changing
all statesthat meeta pruneconditionto trim states.
Eachtrim statehasa singletransitionto a death
stateat sometrim rate. Thetrim ratemustbe the
sumof thefailureratesof all remainingcomponents.
The ASSUREprogram,describedin Palumbo
and Nicol (1990),translatesan extensionof the
ASSISTlanguageinto C code,whichis linkedwith
SUREsolutionproceduresandexecutedto generate
andsolvethe model. This reducesthestoragere-
quiredbecausecompletelyexpandedstatesaredis-
cardedsincethe only stateof consequenceat any
time is the state beingexpanded. The extended
ASSISTlanguageallowsthe useof user-definedC
functionsto specifythe deathstatesand the state
transitions.Thisspecificationincreasesthesizeand
complexityof the systemsthat canbe practically
modeledbecauseit makesthe modelspecification
morecompact.
The RMG (ReliabilityModel Generator)pro-
gramisspecifiedin CohenandMcCann(1990).Asit
isnowimplemented,LISPexpressionsarerequiredto
specifythesystemfailureconditionswhoseprobabil-
itiesareto beevaluatedandeachcomponent'slocal
reliabilitymodel(LRM) and function. An LRM is
specifiedin termsofthecomponentmodes,thetran-
sitionsbetweenmodes,andthecharacteristic(good,
bad,or none)of theoutputsin termsof the modes
and the Valueor characteristicof the inputs. A
graphicalinput is usedto specifythe interconnec-
tion graphof thePMSstructure. It aggregatesthe
LRM's to specifya Markovreliabilitymodelin the
ASSISTlanguagefor the systemfailureconditions
given.
Table1.1givestheprimaryinputsandoutputsof
the programsdescribedin thissubsection.Noneof
theseprogramsis ableto generatea Markovmodel
or itsspecificationusingahigh-levelSDLthat iseasy
to learnanduse.
Table1.1.SummaryofPreviousWork
Programname Primaryinputs Primaryoutputs
ADVISER PMSstructuC'g Symbolicreliabilityfunction
ARIES HomogeneousMarkovmodel Reliabilityoravailabilityestimate
SURF Semi-Markovmodel Reliabilityoravailabilityestimate
CAREIII Faultreeandsemi-Markovmodelparameters Reliabilityestimate
HARP FaultreeornonhomogeneousMarkovmodel Reliabilityoravailabilityestimate
SURE Semi-Markovmodel Reliabilitybounds
PAWS/STEM HomogeneousMarkovmodel Reliabilityestimate
ASSIST Semi-Markovmodelspecification Semi-Markovmodel
ReliabilityboundsASSURE Semi-Markovmodelspecification
RMG LRM's,PMSstructure,andsystemfailureconditions Semi-Markovmodelspecification
1.3. Motivation
Thegoalof thisresearchanddevelopmenteffort
is to providethe computerarchitecta powerfuland
easy-to-usesoftwaretool that will assumethe bur-
denof anadvancedreliabilityanalysisthat consid-
ersintermittent,transient,andpermanentfaultsfor
computersystemsofhighcomplexityandsophistica-
tion. ThePMSlevelof computersystemdescription
wasselectedbecauseit is the highestlevelviewof
digital systemsandthereforethe easiesto specify
and it is wellknownto computerarchitects.The
Markovmodeltechniquewasselectedbecauseit is
powerfulenoughto accuratelymodelmostsituations,
it iswidelyusedamongreliabilityanalysts,andthese
modelscanbeevaluatedby severalprogramsthat
havebeendeveloped.
Previouseffortshavebeenlimitedin oneof three
ways.Mosteffortsprovidedacomputationalidonce
thepreliminarysystemdecompositionandreliability
analysishadbeenmanuallyachieved.Alternatively,
computersystemsof lesscomplexityandsophistica-
tion wereconsideredwithouttransientandintermit-
tent faults,or theydid notprovidea high-levelSDL
that is easyto learnanduse.
1.4. Organization
The GUI is defined and illustrated in section 2.
The problems involved in the automatic specification
of Markov reliability models are identified and ana-
lyzed in section 3. Examples of GUI applications and
their results are given in section 4. An analysis of this
approach is presented in section 5. Conclusions are
drawn in section 6. The algorithms used by the ARM
program are shown in the appendix.
2. Graphical User Interface (GUI)
Definition
The GUI is the first of four steps in the automated
reliability modeling process proposed in this paper.
The second step is the automated specification of
the model in the ASSIST language. This step was
implemented in the ARM program. The last two
steps, the automated generation and evaluation of
the model, have already been implemented. The
third step has been implemented in the ASSIST
program, and the fourth step has been implemented
in the SURE, PAWS, and STEM programs.
In order of importance, the major goals of the
GUI are defined below:
General: To allow current and future fault-
tolerant techniques and system designs to be
accommodated
Hierarchical: To allow systems and subsys-
tems to be defined in terms of their subsys-
tems and components, respectively
Compact: To allow subsystem classes to only
be defined once with their component types
as formal parameters
Subsystems are in the same class if they have the
same hierarchy and requirements (e.g., triads that
require two of their three components). Subsystems
are of the same type if they are in the same class, are
System
I
i o uiroments*I
I
I Structure
Architecture I
I
I Parameters* I
I
Reconfigurations [ Hierarchy [
I
I I
I Extorn ,*I I Intorna I
!
l I
I Physic lI I Logical
Figure 2.1. System description hierarchy. (The
composed of the same component types, and have
the same recovery parameters, if any (e.g., processor
triads). Componentsare of the same type if they have
the same function and parameters (e.g., processors).
These categories of subsystems and components are
summarized in table 2.1. For the sake of generality,
the GUI does not predefine any category.
Table 2.1. Categories of Subsystems and Components
Category Common attributes
Subsystem class Hierarchy
Requirements
Subsystem type Subsystem class
Component types
Recovery parameters
Component type Function
Parameters
Each category is represented by an identifier
that starts with a letter and can contain letters,
underscores (_), and digits (e.g., a component type
could be represented by p). A subsystem identifier
can also end with a set of parentheses that enclose a
list of parameters separated by commas. Formal pa-
rameters, which are identifiers that are not used to
represent a category or anything else, are used in the
identifier of a subsystem class (e.g., T(x)). Compo-
nent types are used instead of the formal parameters
in the identifier of a subsystem type (e.g., T(p))'
Type identifiers can be either (a) preceded by an
integer greater than 1 to represent multiple elements
of the same type (e.g., 2T(p)) or (b) followed by a
period and a list of subranges and/or integer numbers
asterisk (*) denotes parts that are always required.)
in the range from 1 to the number of elements
of that type, which are separated by commas to
represent specific elements of the same type (e.g.,
T(p).l,2), but not both (a) and _ (b). A subrange
would be specified by two positive integer numbers
separated by a dash, with the larger one on the right
(e.g., p.l-3). Unless elements are assigned specific
numbers, they are given the lowest positive numbers
available (e.g., the components represented by 2p
could be assigned the numbers 1 and 2).
The system's description is divided into require-
ments, architecture, and parameters. Th e require-
ments depend on the application of the system. How
the system was designed determines the architecture.
The technology used to implement the system com-
ponents determines the parameter values (e.g., fail-
ure rates). The sources of the major GUI input
categories are summarized in table 2.2. Figure 2.1
shows the hierarchy of the system description. The
actual GUI inputs are the leaves of the tree shown in
figure 2.1.
Table 2.2. Sources of Major GUI Input Categories
: r
Major GUI input category Source
Requirements Application
Architecture Design
Parameters Implementation technology
The GUI starts by displaying the main window
shown in figure 2.2. It contains text fields for entering
the system name and the name of the current selec-
tion; the graphs, parameters, and model pull-down
10
_ten Na_e :
Current Selectio_:
Figure 2.2. Main window.
menus; and a button to quit the GUI. The current
selection, which is the initial name used by win-
dows that describe a component type, subsystem
type, or reconfiguration, changes automatically to
the last name entered in the first text field of any
such window, but it can also be changed manually.
The graphs menu, shown in figure 2.3, displays
a window for editing the graphs described in sub-
section 2.1. The parameters menu, shown in fig-
ure 2.4, displays windows, with text fields and but-
tons for parameter specification, which are described
in subsection 2.2. The model menu, shown in fig-
ure 2.5, executes the programs that specify, generate,
and evaluate the Maxkov model, based on the sys-
tem description given through the GUI. The ARM
program will notify the user if the system descrip-
tion is incomplete (e.g., if the external structure has
not been given) and not specify the model. Subsec-
tion 2.3 presents a summary of the GUI and recom-
mendations on how to reduce the number of errors
in the system description.
2.1. Graphs
The following subsections describe the graphs
used for specifying the system's communication
structure, hierarchy, reconfiguration capabilities, and
requirements.
2. i. 1. Structure
Graphs with unidirectional and bidirectional
edges describe the system's external and internal
communication structures. It is assumed that com-
ponents which communicate and are critical (i.e., re-
quired for the system to be operational) must be
External Structure
Internal Structure
P_/_ical Hierarchy
Logical Hierare] W
Strste_ Ree_figttratio_
Reqttire_ent
Figure 2.3. Graphs menu.
Active Co_po_e_t
Spare Conp_e_t
Co_p_ent Repair
•ab_Fste_ Recovery
•Fste_ Beconfiguratio_
Model 8e_eratio_
Model Evaluation
Figure 2.4. Parameters menu.
_eeify li
I]enerate Ii
.....Ij
Figure 2.5. Model menu.
able to continue communicating. If this assump-
tion is not true, the result will be conservative. The
main purpose of the communication structure de-
scription is to analyze which component failures will
prevent communication between critical components
and therefore cause system failure.
11
2.1.1.1. External. A system's external structure
is defined as the communication interconnection of
all its components. The external structure graph
is required for all systems because it is also used
to identify the system components, their types, and
their connectivity equivalence classes (defined in sub-
section 3.2.1). In the external structure graph, the
nodes represent one or more components of the same
type. Unless specific numbers are assigned, the com-
ponents represented by the same node axe assigned
a continuous range of numbers (e.g., the components
represented by a node labeled 3p could be assigned
the numbers 1 through 3). A unidirectional edge be-
tween two nodes indicates that all the components of
the source node can communicate with all the com-
ponents of the target node. A bidirectional edge be-
tween two nodes indicates that all the components of
one node can communicate with all the components
of the other node and vice versa.
A plus sign (+) at the end of a component iden-
tifier indicates that this is a self-talking component.
A majority of componcnts of the same type are pas-
sive, and they do not need to communicate. Exam-
ples of passive components are memories, buses, and
input/output transducers. Self-talking components
need to exchange information amongst one another.
Examples of self-talking components are processors,
direct-memory-access device controllers, and other
"smart" controllers. If not specified, the default is
for components to be passive and not communicate
with their own type. This information is needed to
prevent ARM from requiring communication paths
between components of the same type that never ex-
change information. Not taking this behavior into
account would lead to a pessimistic evaluation of the
system reliability.
An asterisk (*) at the end of a component iden-
tifier indicates that every input port of this compo-
nent is internally connected to all output ports of
the component. Most buses have this internal struc-
ture. If not indicated in this way or as described in
subsection 2.1.1.2, the default is for every port of a
component to be disconnected from the other ports
of the component.
The graph in figure 2.6 describes the external
structure of a multiprocessor composed of six pro-
cessors p, six memories m, six watchdog timers
w, four transmit buses tb, four receive buses rb,
and four watchdog buses wb. The processors and
watchdog timers need to communicate with com-
ponents of their own type. TWne processors com-
municate through the memory as described in sub-
section 2.1.1.2. The watchdog timers communicate
through the watchdog bus. All the buses have the
12
typical internal structure described above. This
multiprocessor will be used as a running example
throughout this section.
6p+
6m
Figure 2.6. External structure of a multiprocessor.
2.1.1.2. Internal. A component's internal struc-
ture is defined as the communication interconnection
of its ports. This internal structure of one or more
components can be described by a graph inside a
component with its external port connections labeled
on the outside of the component. The absence of an
edge between two ports indicates that they cannot
communicate through this component.
The internal structure graph Of a component is
used to determine which of its neighbors can com-
municate through it. Two components are neighbors
if they are interconnected in the external structure
graph. If none of a component's neighbors can com-
municate through it, no internal structure needs to
be specified because by default a component cannot
be used for communication by its neighbors.
The internal structure of each of the six memories
is described by the graph in figure 2.7. This struc-
ture indicates that the processors can communicate
through the memory.
2.1.2. Hierarchy
A system can have physical and/or logical hier-
archies that contain physical and logical subsystems,
respectively. These hierarchies axe different partial
4rb 4tb
Figure 2.7. Internal structure of each of the six memory
components.
views of the same system; therefore, a component of
a physical subsystem may also be a component of a
logical subsystem. The difference between a physical
and a logical subsystem is in their ability to be recon-
figured and in how their failure affects the system's
operation, as explained in the next two subsections.
If present, the system hierarchies show what sub-
systems are in the initial system configuration and
define the composition of the subsystems that may
be part of those hierarchies.
A group of components with its own set of re-
quirements constitutes a subsystem. If a subsystem
does not meet its requirements, then none of its com-
ponents are able to perform their function. If a subset
of the system components, but not all of them, de-
pends on one or more components in the subset, the
subset needs to be defined as a subsystem by giving
its hierarchy and requirement graphs. The subsystem
defined for the subset must be placed in either the ap-
propriate system hierarchy graph (if it is part of the
initial system configuration) or the destination node
of a system reconfiguration graph (if it can be part of
a future system configuration). The system physical
or logical hierarchy graphs can only be given if there
are physical or logical subsystems, respectively.
Redundant subsystems are composed of multi-
ple components with the same function to increase
their reliability or availability. Some of these redun-
dant subsystems may be part of the initial system
configuration, while others serve as alternatives for
system reconfiguration (e.g., a quad subsystem that
reconfigures into a triad).
A system hierarchy is described by nondirectional
tree graphs. Root nodes (identified by a circle)
represent the system or one of its subsystems. Other
nodes (identified by a rectangle) represent one or
more identical subsystems or components.
Unless they are assigned specific components,
subsystems are assigned components with the lowest
numbers available. For example, if there were six
processors, numbered 1 through 6, and two processor
triads, one triad would be assigned processors 1
through 3 and the other triad would be assigned
processors 4 through 6.
2.1.2.1. Physical. Physical subsystems cannot be
reconfigured. However, the failure of a physical sub-
system does not preclude the system from operating,
as long as the system requirements are met.
(
Figure 2.8. Physical hierarchy of the multiprocessor.
Figure 2.9. Physical hierarchy of tile printed circuit board
subsystem type.
- Figures 2.8 and 2.9 describe the physical hierar-
chy of the multiprocessor (MP). Initially, the multi-
processor contains six printed circuit boards (PCB's),
which belong to the same physical subsystem type.
13
I T
IT(wb)
Figure 2.10. Logical hierarchy of the multiprocessor.
Each board contains a processor, memory, and a
watchdog timer.
2.1.2.2. Logical. Logical subsystems can be re-
configured. Before component failures cause them to
fail, they can recover by replacing the failed compo-
nents with spares. If not enough spares are avail-
able, the system can degrade to a lesser number of
subsystems or a less redundant subsystem. When
a logical subsystem fails, the system also fails un-
less it can be reinitialized by a separate subsystem
or component.
Figures 2.10 and 2.11 describe the logical hierar=
chy of the multiprocessor. Initially, the multiproces-
sor contains two processor triads, one memory triad,
one watchdog triad, one transmit bus triad, one re-
ceive bus triad, and one watchdog bus triad. These
triads are each composed of three components of the
same type.
@
Figure 2.11. Logical hierarchy of the triad (T) subsystem
class.
The ARM program will automatically determine
what components are spares by comparing the ex-
ternal structure with the logical hierarchy; any extra
instances of components in the external structure,
beyond what is included in the logical hierarchy, will
be assumed to be spares. Therefore, from figures 2.6
and 2.10, the spare components are assumed to be
three memories, three watchdog timers, one transmit
bus, one receive bus, and one watchdog bus.
2.1.3. System Reconfiguration
The future system Configurations are described in
terms of the reconfigurations allowed. A change in
the system's configuration in response to some trig-
gering event is defined as a reconfiguration. A re-
configuration occurs when the system is reinitialized
because of a logical subsystem failure or when the
system degrades to a lesser number of subsystems
or a less redundant subsystem because no spares ex-
ist to replace a failed component. Also, the mission
phase may change, thus causing the system to recon-
figure. If the system is to be reinitialized because
of a logical subsystem failure, only one reconfigura-
tion must do so. To simplify the model specification,
a single reconfiguration will only be ail0wed to de-
grade a subsystem to, at most, two less c0mp0nents.
For example, one reconfiguration could take a sub-
system from a quintuple to a triad, and a subsequent
reconfiguration could take it to a simplex.
A reconfiguration is described in part by one or
more unidirectional graphs. A source node repre-
sents one or more of the components or subsystems
(physical or logical) which must be active before the
reconfiguration. A destination node represents either
the reinitialized system or one or more of the logical
subsystems that will be active after the reconfigura-
tion in place of the logical subsystems identified by its
source node. Each edge is labeled with the name of
a specification that will provide the triggering event
and the rest of the reconfiguration parameters, de-
scribed in subsection 2.2.5, which will complete the
14
descriptionof thereconfiguration.Thespecifcation
namecancontainletters,underscores,anddigitsin
anyorder.
Figure2.12describesthe reinitializationof the
multiprocessorby the watchdogtriad. Figure2.13
describesthedegradationof themultiprocessorfrom
two processortriads (PT's) to one. If the recov-
ery rate of the remainingtriad is specifiedasbe-
ing greaterthan 0, the workingprocessorsin the
deactivatedtriadareassumedto becomespares.
Restart I_
Figure 2.12. Reinitialization of the multiprocessor.
2 to 1PT
:.@
Figure 2.13. Degradation of the multiprocessor.
Currently, only the reconfigurations that degrade
the system have been implemented. Therefore, at
the present time, reconfiguration graphs are needed
only for systems that have logical subsystems and
can degrade to a lesser number of logical subsystems
and/or to less redundant logical subsystems.
2.1.4. Requirement
The requirement of a system or subsystem is de-
fined as the minimum set of subsystems and com-
ponents needed. Performance levels can be used
to identify the nondegraded mode and the various
degraded modes of operation a system might have.
This requirement is described by one or more suc-
cess trees. Root nodes (identified by a circle) repre-
sent the system, one of its subsystems, or a perfor-
mance level. Other nodes (identified by a rectangle)
represent one or more identical subsystems, a per-
formance level, or one or more identical components.
It is assumed that components in the system success
tree are not in any logical subsystem. A success tree
is required for all systems, subsystems, and perfor-
mance levels.
Success trees and fault trees use the same nota-
tion, but they define the combination of events that
will cause the system to succeed or fail, respectively.
The advantages of success trees over fault trees are
that (1) they are more intuitive for a computer engi-
neer who is concerned with making the system work
and not with how it can fail and that (2) a conserva-
tive reliability estimate is produced if some modes of
operation are left out of the success tree, because sys-
tem failure is assumed for those modes of operation,
whereas an optimistic reliability estimate is produced
if a failure mode is left out of a fault tree.
The graphs in figures 2.14 to 2.16 describe the
system and subsystem requirements of the multi-
processor. This multiprocessor can operate at one of
two performance levels. To achieve full performance
(FP), both processor triads, the watchdog triad,
and the memory triad must be operational. The
requirements for degraded performance (DP) are the
same except that only one processor triad is needed.
Each printed circuit board requires that its memory
be working for it to be operational. Subsystems of
the triad class require two of their three components
to operate.
2.2. Parameters
The following subsections describe the parameter
specification windows. Any time unit may be used
for the parameter values as long as it is the same
one for all of them. The time unit used for ARM
parameters throughout this paper is hours. The
OK and CANCEL buttons in each window save and
discard the parameter changes made, respectively.
Selecting either button makes the window disappear.
The ARM program will assume that a transition
which reconfigures components and/or subsystems in
or out of the system describes sequential processes.
For example, if n faults exist in one or more sub-
systems of the same type with recovery rate p, the
rate at which one of the faulty components is replaced
by a spare is assumed to be p not np. If this assump-
tion is not true, the result will be conservative. Typ-
ically, these transitions are fast, in which case this
assumption being false would have little effect.
The SURE program requires slow transitions to
follow an exponential distribution, but it allows fast
transitions to follow a general distribution. Because
transitions that reconfigure components and/or sub-
systems in or out of the system are typically fast,
ARM allows them to follow a general distribu-
tion. However, in SURE, the transition probability
must be given for each general transition competing
with other fast transitions (Butler and White 1988).
These probabilities must be given for each combi-
nation of one or more general transitions competing
15
Figure2.14.Requirementsofthemultiprocessor,
I
Figure 2i151 Requirement of the printed circuit board sub-
system type.
16
Figure 2.16. Requirements of the triad subsystem class.
with otherfasttransitions.Thenumberof combina-
tionsof n competing general transitions taken two or
more at a time is as follows:
n n!
Z._ jT (n - j)!j=2
To simplify the system description and the model
specification, ARM requests only one probability
for each general transition. This is the occurrence
probability when it is competing with any of the
fast exponential rates at which transient faults dis-
appear or intermittent faults become benign (sub-
section 2.2.1). Although these rates are fast, ARM
does not allow them to follow a general distribution
so that only one transition probability is needed for
each general transition. This transition probability
will be assumed to be the same for all competing
fast exponential transitions. This assumption is not
strictly true; however, it is often close enough in
practice to be used to simplify the analysis.
Because ARM only asks for the probability of a
general transition for the case when it is competing
with the fast exponential rates at which transients
disappear or intermittent faults become benign, these
general transitions cannot compete with potentially
general transitions. A potentially general transition
is one that ARM allows to follow a general distribu-
tion. The only transitions that ARM allows to follow
a general distribution are those that reconfigure com-
ponents and/or subsystems in or out of the system.
However, all fast exponential transitions can com-
pete. To determine which potentially general transi-
tions should take precedence over others and which
ones have the same precedence and therefore should
compete, ARM requires that the user assign a posi-
tive integer priority to each potentially general tran-
sition. A value of 1 will be interpreted as the highest
priority. Transitions that are assigned the same pri-
ority can compete if they follow exponential distri-
butions, their triggering conditions are met, and the
triggering conditions of higher priority transitions are
not met.
Initially, numeric and selection parameters are
assigned an appropriate default value. Probabilities
default to 1 or 0. Priorities and coverage probabilities
default to 1. Rates, means, standard deviations, and
transition probabilities default to 0.
For each component, one of the failure rates de-
scribed in subsection 2.2.1 must not be 0. Other-
wise, ARM will notify the user and not specify the
model. All other parameters may be left at their
default values. Therefore, ARM does not prompt the
user for any values.
Instead of values, all numeric parameters except
priorities can also be given variable identifiers that
start with a letter and can contain letters, under-
scores (_), and numbers. One of these variables can
be given a range as described in subsection 2.2.7 if
it is not used for the ASSIST trim rate described
in subsection 1.2. If a variable is used for the trim
rate, ASSIST will prompt for its value. The SURE,
PAWS, or STEM programs will prompt for the value
of all other variables without a range.
Numeric parameters are assumed to be indepen-
dent of the system state. This assumption is not
strictly true; however, it is often close enough in
practice to be used to simplify the analysis.
2.2.1. Active Component
The active component parameters with example
values are shown in figure 2.17. First is the name
of the component type. Second is the arrival rate
of permanent faults (0.00005 per hour or 2 × 104
hours between permanent failures). The next two
are the arrival (0.0005 per hour) and disappearance
rates (4000 per hour or 0.9 seconds to removal) of
transient faults. If the arrival rate of transient faults
is not 0, then the disappearance rate must have a
value other than 0. The next three are the rates
at which intermittent faults arrive, become benign,
and become active again. If the arrival rate of
intermittent faults is not 0, then the benign and
active rates must have values other than 0. All six
rates are assumed to describe concurrent processes.
For example, if there are n working components of
the same type with a permanent failure rate of )%
the rate at which one of them fails permanently is
assumed to be nA. If this assumption is not true, the
result will be conservative.
The disappearance and benign rates are assumed
to describe fast transitions if they are not 0. This
is not a severe restriction because the behavior of
a transient fault with a slow disappearance rate ap-
proximates that of a permanent fault and so does the
behavior of an intermittent fault with a slow benign
rate. These are the only fast exponential transitions
that may compete with general transitions.
2.2.2. Spare Component
The spare component parameters with example
values are shown in figure 2.18. First is the name
of the component type. Second is the failure rate
factor used to indicate which type of spare this is. It
17
Active Coepo_ent Parameters
Compeme_t Type:
Pernememt Failure Rate:
Tremsie_t:
Failure Rate: _e-4
Disappearance Rate :
lntermittemt :
Failure Rate:
Beni_ Rate:
Active Bate :
e-5
_e-6 I
I
I
J
I
Figure 2.17. Active component parameters with example
values.
spare
Cemponemt Type:
Failure Rates FaCtor: _.0 I
Betectable Fractiom: _0.9 J
Recovery PrioriW: _1 ]
Recovery Ti_t Distrilmtiem:
Expemealtial 0 0emeral
P_ge:
Stm_Jard De_atiom: _0.0 [
Probability: _.0 I
Figure 2.18. Spare component parameters with example Values.
CoBp_ment Type: l_
Repair Coverage: [0 . 999999999
Repair Priority: _I J
Repair Ti_e Distri/mtiom :
• Expomntial
Rate: [30.0 [
General
Stm_dard Ueviatim_ :
Probability: _0.0
_o.o ]
Figure 2.19. Component repair parameters with example values.
18
J
I
J
.___-
ii
!
i
l
z
N
=
r.
m
i|
is 0 for cold,in theexclusiverangeof 0 through 1
for warm, and 1 for hot. This factor, which is
the spare's fraction of the active component's failure
rates, defaults to 1. Third is the fraction of faults
that can be detected in a component of this type
while it is a spare. This fraction defaults to 0.
Fourth is the fault coverage of a spare component
of this type. Fifth is the recovery priority. Sixth is
the parameter that indicates whether the recovery
time of detectable faults follows an exponential or
general distribution. The next three parameters for
the recovery time are (1) the rate, (2) the conditional
mean (#), and (3) the conditional standard deviation
(a). Parameter 1 is for an exponential distribution,
and parameters 2 and 3 are for a general distribution
given that the transition takes place.
The last parameter is the probability (P) that
this transition will take place if it is competing with
fast exponential transitions (whose rates add up to
)_). This parameter defaults to 0. If the specification
of the competing transitions is not consistent, SURE
will not evaluate the model. To be consistent, these
transitions must meet the following condition:
2
P<
- 2(1 + A2 +.2)
This expression was derived from the conditions
given in Butler and White (1988).
2.2.3. Component Repair
The component repair parameters with example
values are shown in figure 2.19. First is the name of
the component type. Second is the probability that
the system can survive the reintegration of this type
of component once it has been repaired. Third is
the repair and reintegration priority. The remaining
parameters specify the repair and reintegration time
distribution.
2.2.4. Subsystem Recovery
The subsystem recovery parameters with example
values are shown in figure 2.20. First is the name of
the subsystem type. Second is the fault coverage for
components in this type of subsystem. Third is the
recovery priority. The remaining parameters specify
the recovery time distribution.
Figure 2.21 illustrates the meaning of the param-
eters of active components and subsystem recoveries
using a partial Markov model of a processor dual with
m cold spares and repair. Except for states 0 and 6,
all the states have additional transitions to additional
states, none of which are shown. If the spares were
warm or hot, state 0 would also have transitions rep-
resenting the failure of the spares. Permanent, tran-
sient, or intermittent failures can take the system into
states where a faulty component actively produces er-
rors. From these states, either the system will detect
these errors and succeed or fail in reconfiguring out
the faulty component, the fault will disappear if it is
a transient, or the fault will become benign if it is
intermittent. If the faulty component is reconfigured
out, it can be repaired and the system can succeed
or fail in bringing it back into the configuration. The
following notation applies to figure 2.21:
Parameter Description
F fault coverage
R repair coverage
a intermittent active rate
intermittent benign rate
transient disappearance rate
), permanent failure rate
tt repair rate
p recovery rate
T transient failure rate
w intermittent failure rate
State Description
0 no faults; m spares
1 1 permanent fault; m spares
2 1 transient fault; m spares
3 1 active intermittent fault; m spares
4 1 benign intermittent fault;
m spares
5 no faults; m- 1 spares
6 system failed
2.2.5. System Reconfiguration
The system reconfiguration parameters with ex-
ample values are shown in figure 2.22. First is the
specification name. The second parameter indicates
whether this reconfiguration is triggered by a logical
subsystem failure (default), a mission phase change,
or a component in a logical subsystem failing without
a spare. Third is the name of the component type
19
recovery
Subsl_t_ Type : _(p)
,mxlt Coverage: _.999999999 I
Recovery Priority: _1 ]
Recovery Tine Distrilmtiom:
Expa_mnti_l _ Oeneral
Rate: _.8e3 I Neon: _.0
Standard Oeviatio_:
Figure 2.20. Subsystem recovery parameters with example values.
reorder
2T
Q
4
Figure 2.21=. Partial Markov model of a processor dual with
m cold spares and repair.
Specification Name: __to_lPT
Triggering _t:
0 L°gical S_bsI_tem Failure
0 _issim Phase Change
• Component Failing vitl_ut a Spare
Cmq_mmt Type: _
Beconfi_ration Coverage:
Recemfiq_ration Priority: _1
Recmfiguration Time Distribution:
Eqxme_tial _ GeMral
Rate: _.8e3 J
J
I-a-I
-.:_o ]
,,:,_d_.io,, [o.o I
Prob+ilitT: _.0 l
Figure 2.22. System reconfiguration parameters with example
va|ues.
2O
ASSIST Prime Ccnditim_: I"
Tr_ing ]qethod:
Combined vitb es_ Prying
_) Separate froD an F Pruning
0 off
Tri_ Rate Selecticm:
Aut m_atic O P_aw_al
Tr_ sate: _o.o I
Figure 2.23. Model generation parameters with default values.
Model Evaluatic_ Parameters
Hissi(m T_e: I10.0 I
Prime Level Selecticm:
Automatic _ Rmaual
revel: [0.0
toop Tromcatim_ Level:
Digits of Accuracy Required:
Variable Name:
B_ge:
I. I
I
Figure 2.24. Model evaluation parameters with default values.
whose failure without a spare will trigger the recon-
figuration. Fourth is the probability the system can
survive the triggering event and successfully reconfig-
ure. Fifth is the priority of the mission phase change,
if any, and the consequent reconfiguration. This pri-
ority can be used to order multiple phase changes
into a sequence. Because mission phase changes can
occur at any moment, in most cases they should be
given a priority equal to or less than other transi-
tions; otherwise, lower priority transitions will have
to wait for the higher priority phase changes to oc-
cur. If the reconfiguration is reinitializing the system
because of a logical subsystem failure, it must have
a priority of 1, and it must be the only transition to
21
have that priority. The remaining parameters specify
the combined time distribution of the reconfiguration
and the mission phase change, if any.
2.2. 6. Model Generation
The window with default values shown in fig-
ure 2.23 allows a user familiar with the ASSIST pro-
gram, which is described in subsection 1.2, to select
which, if any, state space reduction techniques are to
be used in generating the model and to specify any
associated parameters. The first parameter is the op-
tional ASSIST prune condition, which is specified as
a Boolean expression of the total number of compo-
nent failures (TNF) and/or the number of failures
for a type of component (e.g., NF(p)). For example,
the following expression would prune the model when
there were two processor failures or three component
failures of any type:
NF(p) >= 2 or TNF >= 3
The second parameter indicates the optional trim-
ming method to be used. The last two parameters in-
dicate whether the trim rate should be selected auto-
matically or manually and the trim rate to be used
when it is selected manually. If the trim rate is to be
selected automatically, variables cannot be used for
the arrival rates of faults.
2. 2. 7. Model Evaluation
The window with default values shown in fig-
ure 2.24 allows a user familiar with the SURE,
PAWS, or STEM programs, which are described in
subsection 1.2, to specify parameters used in the eval-
uation of the model. The first parameter is the mis-
sion time used for calculating the failure probability.
The next two parameters indicate whether the
SURE prune level should be selected automatically
or manually and the prune level to be used when it
is s_lected manually. The ASSIST pruning affects
which states are generated, whereas the SURE prun-
ing affects which of the generated states are evalu-
ated. If no SURE pruning is desired, the SURE prune
level selection should be manual, and the prune level
should be left at its default value of 0.
The fourth parameter is the maximum number
of times the SURE program will go around a loop
in the model before truncating its traversal. The
fifth parameter is the number of digits of accuracy
required. The _SURE program will issue a warning
if SURE pruning and truncation result in an upper
bound on the failure probability that does not meet
this accuracy requirement.
22
The last two parameters are used when the failure
probability is to be calculated as a function of a
previously defined variable. In that case, the name
of the variable must be given along with its range.
The range can be specified as follows:
I to h add i
where I and h are the low and high ends of the range
and i is the increment added to vary the variable's
value over that range. The range can also be specified
as follows:
l to h by f
where f is the multiplication factor used to vary the
variable's value over the range.
2.3. Summary and Recommendations
Although the graphs and parameters described
in the previous two subsections can be given in any
order, the number of errors in the system description
maybe reduced by following the same order of steps
each time. The order suggested by the GUI pull-
down menus is recommended because it should be
easy to remember since the user sees it every time the
pull-down menus are used and it has been designed
to be natural and intuitive. The order and steps
recommended for describing a system are as follows:
1. Identify all the system components and their
interconnections in the external structure
graph.
2. If the neighbors of a component use it to
communicate, indicate so by ascribing to the
component either the fully connected inter-
nal structure typical of buses by following its
identifier in the external structure graph with
an asterisk (*) or a specific internal structure
graph.
3. If a subset of the system components, but
not all of them, depends on one or more
components in the subset, define the subset
as a subsystem by giving its hierarchy and
requirement graphs. If the subsystem defined
for the subset is part of the initial system
configuration, include it in the appropriate
system hierarchy graph or else it is part of
a future system configuration and must be
included in a system reconfiguration graph as
a destination node.
4. If physical and/or logical subsystems ex-
ist, give the system physical and/or logical
hierarchy graphs.
5. If the systemcandegradeto a lessernum-
ber of logicalsubsystemsand/or to lessre-
dundant logical subsystems,give the sys-
temreconfigurationgraphsthat describethese
degradations.
6. Givethe successtreeof the systemandeach
subsystemandperformancel vel.
7. Usetheactivecomponentparameterswindow
to assignat leastonenonzerofailurerateto
eachcomponenttype.
8. If any of the systemcomponentsat some
point becomepotentialspares(components
that arenot partof a logicalsubsystem)and
the defaultvaluesdo not apply to someof
them, usethe sparecomponentparameters
windowto assigntheapplicablevalues.
9. If someof the systemcomponentypescan
be repaired,usethe componentrepairpara-
meterswindowto definetheprocess.
10.Usethe subsystemrecoveryparameterswin-
dowto definetheprocessif therearelogical
subsystemsandtheir faulty componentscan
bereplacedbyspares.
11. If any systemreconfigurationgraphshave
beengiven,use the systemreconfiguration
parameterswindowto specifyeachreconfig-
uration.
12.Usethemodelgenerationparameterswindow
to providetheapplicablevaluesif thedefault
valuesdonot apply.
13. If the defaultvaluesdo not apply,usethe
model evaluationparameterswindow to
providetheapplicablevalues.
3. Automated Reliability Modeling
(ARM) Implementation
Theuseof fixed-sizearraysin theARM program
is limitedto the storageof valuesor variableiden-
tifiersfor parameterswhoselengthsaredetermined
bytheGUIimplementation.(Theselengthscouldbe
easilychangedin later versionsof ARM.) All other
dataarestoredin dynamicallyallocatedstructures
or arrays;therefore,the sizeof the problemswhich
ARMcanhandleis limitedonlybythecomputeron
whichit is running.Dynamicallocationalsopermits
ARM to moreefficientlyusestorageby allocating
onlywhat isneededbythecurrentproblem.
TheARM programhasbeenimplementedusing
a C programwith morethan 8000sourcelinesof
whichmorethan1000wereautomaticallygenerated,
asdescribedin subsection3.1.2.Thefollowingsub-
sectionsidentifythe problemsinvolvedin doingso
anddescribethestepstakento solvethem.
3.1. Graphical User Interface
The GUI, which is based on a hierarchy of win-
dows, has two types of windows. The implementa-
tion of the graphical editing windows is described in
subsection 3.1.1. The implementation of the win-
dows used to specify parameters and select actions is
described in subsection 3.1.2.
3.1.1. Graphical Editing Windows
Graphical editing has been implemented using
the schematic drawing editor Schem (Vlissides 1990).
This editor provides three windows with which to
create, view, and edit graphs.
The editor window, shown in figure 3.1, provides
arrows for selecting what part of the graph is being
viewed and at what scale. This window also provides
five pull-down menus. The File menu can be used for
reading and saving files; creating, adding, or remov-
ing tools for the creation of graphical components
called elements; and creating a textual representation
of a graph that identifies its elements and their inter-
connections called a netlist. The Edit menu can be
used to copy, paste, or delete graphical components
and to assign the graph a name that identifies it in
the netlist (by using the command Info). The Struc-
ture menu can be used to group components and then
copy or move them as a single entity. The Align menu
can be used to center graphical components or align
them in relation to one another. The View menu can
be used to display the drawing tools window and to
center the whole graph. The two other windows are
displayed automatically when Schem is executed.
The tools window, shown in figure 3.2, provides
commands for selecting and moving graphical com-
ponents, connecting the nodes of graphical elements,
and assigning to the elements and their nodes the
names that identify them in the netlist. This window
also displays the tools currently available for creating
graphical elements. The node, wire, bulb, and switch
tools (the top four shown in the window) are the only
ones provided by Schem. All other elements have to
be created by the Schem user. Figure 3.2 shows 7 of
the 28 tools created for the ARM user.
The drawing tools window, shown in figure 3.3,
provides five pull-down menus, eight drawing tools,
and four commands. The pull-down menus can be
used to select the current font, brush, pattern, and
foreground and background colors. The drawing
23
F[_ schem
./HP_logical.schem ma9 Ix
File Edit Structure All9n View
Figure 3.1. Graphical editor window,
_[_ schem drawing tools
Helvetica 12
Font Brush Pattern FgColor BgColor
T0_<</, ..r'0/, q O0 O0%
Scalej Stretch., Rotate k Reshapeq
1 P
Figure 3.3. Drawing tools window.
24
Figure 3.2. Tools window.
tools can be used to create a graphical component
that can be text or a line, rectangle, or circle. The
commands can be used to scale, stretch, rotate, or
reshape a graphical component.
The netlist file of each graph must be generated
for ARM to process the graph. These files must be
of the net type. Their file names must be composed
of an identifier, an underscore (_), and a suffix. The
identifier must correspond to the name of the system,
subsystem, component (s), or performance level being
described, except that underscores are substituted
for parentheses and periods. The suffix describes
the class of graph in the file. The suffixes that
must be used for each class of graph are shown in
table 3.1. For example, the file specifications for
the hierarchies of logical subsystem class T(x) and
physical subsystem CR_A.1 are T_x_logical.net and
CR_A_l_physical.net.
Table 3.1. File Name Suffixes for Each Class of Graph
Graph class File name suffix
External structure External
Internal structure Internal
Physical hierarchy Physical
Logical hierarchy Logical
System reconfiguration Reorder
Requirement Require
The names that identify the graph and its ele-
ments in the netlist must be exactly the same as
the name of the system, subsystem, component(s),
or performance level being described. The only ex-
ception to this requirement is that logical gate ele-
ments and the root elements (identified by a circle)
of tree graphs must retain their original netlist names
of and, or, plus root. The names that identify graph
element nodes in the netlist file must be composed
of a function identifier and a tag. The function iden-
tifier has to be input, output, or bidirectional. For
internal structure graphs, the tag is composed of an
underscore followed by the name of the element that
the node is connected to in the external structure
graph. For system reconfiguration graphs, the tag is
composed of an underscore followed by the name of
the specification.
For all other graphs, a tag is needed only if
an element has more than one node with the same
function. If present, this tag must be composed of an
underscore followed by an identifier that is unique
for nodes with the same function and in the same
element. For example, two nodes in the same element
could be named input_l and input_2.
3.1.2. Parameter Specification and Action
Selection Windows
The windows for entering parameters and select-
ing actions have been implemented using the TAE
Plus user interface development tool for building
X window-based applications (Szczur 1990). These
windows were defined using the TAE Plus work-
bench. The workbench generates the more than 1000
source lines that display these windows, including an
event handler for each item in a window. Function
calls have been added to each event handler to check
the validity of the input and to store it if so indicated
by the user. The inputs from each window are stored
in a separate file.
These parameter files are of the par type. Their
file names are composed of an identifier, an under-
score (_), and a suffix. The identifier corresponds
to the name of the system, subsystem type, compo-
nent type, or reconfiguration specification being de-
scribed, except that underscores are substituted for
parentheses and periods. The suffix describes the
class of parameters in the file. The suffixes that
are used for each class of parameters are shown in
table 3.2.
Table 3.2. File Name Suffixes for Each Class of Parameter
Parameter class File name suffix
Active component Active
Spare component Spare
Component repair Repair
Subsystem recovery Recovery
System reconfiguration Reorder
Model generation
Model evaluation
Generate
Evaluate
3.2. Reading and Processing the System
Description
When indicated by the user, ARM uses the cur-
rent system name to read and process the system de-
scription files and check them for completeness and
consistency. If at any point these files are found to be
incomplete or inconsistent, the user is notified, and
25
the readingand processingis aborted. The ARM
programtakesthe followingstepsto read,process,
andcheckthesystemdescriptionfiles:
1. Readstheexternalstructure
2. Readsthecomponentparameters
3. Determinestheinternalstructureof the
components
4. Detectssymmetriesin theexternalstructure
5. Readsthesystemphysicalhierarchy
6. Determinesthehierarchyof physical
subsystems
7. Readsthesystemlogicalhierarchy
8. Determinesthehierarchyof logical
subsystemsin theinitial configuration
9. Readsthereconfigurationgraphsand
parameters
10.Determinesthehierarchyof logical
subsystemsnot in the initial configuration
11.Readsthelogicalsubsystemparameters
12.Readstherequirements
13.Readsthemodelingparameters
The programidentifiesthe systemcomponent
typesin the externalstructurein step2 andreads
their active parameterfiles. It also readsthe
spareandrepairparameterfilesif theyarepresent;
otherwise,it usesdefaultvalues.
In step3,theprogramreadstheinternalstructure
filesof eachvertexin theexternalstructureif they
arepresent;otherwise,thedefaultinternalstructure
indicatedin the externalstructureis used. It also
recordsanynumbersassignedto specific omponents
in theexternalstructure.
The programdivideseachcomponentypeinto
classesin step 4; theseclassesare equivalentin
termsof their connectionsin theexternalstructure,
asdescribedin subsection3.2.1. Theprogramalso
assignsnumbersto thosecomponentsthat did not
haveanyin theexternalstructure.
In step5,theprogramreadsthesystemphysical
hierarchyfileif it ispresent;if so,it thenidentifiesthe
physicalsubsystems,their types,andtheir classes.
In step6, the programdeterminesthe hierar-
chyof anyphysicalsubsystemsasdescribedin sub-
section3.2.2.It thenassignspecific omponentsto
thephysicalsubsystemswherenecessary.Basedon
thecomponentsassigned,theprogramthenidentifies
26
the componentequivalenceclassesin eachphysical
subsystem.
In step7, the programreadsthesystemlogical
hierarchyfileif it ispresent;if so,it thenidentifiesthe
logicalsubsystemsin the initial configuration,their
types,andtheirclasses.
Step8is thesameasstep6exceptfor thelogical
subsystemsin theinitial configuration.
In step9, theprogramreadsthesystemreconfig-
urationfile if it is present;if so,it then findsa re-
configurationgraphWhosesourcevertexrepresentsa
previouslyidentifiedcomponentorsubsysSem,stores
the reconfigurationinformation,_an(_identifiesany
newlogicalsubsystemtype and classin the desti-
nationvertex.Thisstepassumesthat thenewsub-
systemtypecancontainanycomponentequivalence
classesin the old subsystemtype. Whenthereare
nomorereconfigurationgraphsto process,it reads
theparameterfilesthat specifythereconfignrations.
In step10,the programreadsthe hierarchyfile
of anylogicalsubsystemclassthat is not in the ini-
tial configuration.It thendeterminesthehierarchy
ofanylogicalsubsystemthat isnot in theinitial con-
figurationfromthehierarchyof its subsystemclass
and the componentype arguments,if any,of its
subsystemtype.
Theprogram,in step11,readsthe recoverypa-
rameterfilesofthelogicalsubsystemtypesif present;
otherwise,it usesdefaultvalues.
In step12,theprogramreadsthesystemrequire-
mentsfile. If thereareanysubsystemclassesorper-
formancelevelswhoserequirementsarenotdefinedin
the systemsuccesstree,it readstheir requirements
files. Fromthe requirementsof its subsystemclass
andthecomponenttypearguments,if any,Ofitssub-
systemtype, theprogramthendeterminesthecon-
ditionsunderwhicheachsubsystemwill fail andany
componentdependenciesthat exist.
In step13, the programreadsthe modelgen-
eration and evaluationparameterfiles if present;
otherwise,defaultvaluesareused.
3.2.1. Detection of Symmet_71 in the External
Structure Graph
Each component type is divided into equivalence
classes because ARM assumes that when a com-
ponent in a logical subsystem fails, it can only be
replaced by an equivalent component. Substruc-
tures in the external structure graph G are consid-
ered symmetric if they are isomorphic and the cor-
responding vertices of the two graphs have identical
component-type labels. Symmetrical substructures
Ipe.4+l
Figure 3.4. External structure with symmetry.
are assumed to be identical in function and reliabil-
ity.
Subsection A1 shows the algorithm used by
ARM to detect symmetries in the external struc-
ture graph G. This algorithm has been derived from
the one used in ADVISER (Kini and Siewiorek 1982)
for nondirected graphs with a single component per
vertex because the ADVISER algorithm is mature,
well documented, and simple. However, the ARM
algorithm applies to directed graphs that can have
multiple components per vertex. This algorithm is
based on the component-type labels and the degree
of the vertices in the graph. The degree of a vertex is
the number of neighbor vertices which it has of each
type. Two vertices are neighbors if they are inter-
connected. Neighbors can be of the input, output, or
bidirectional types.
The ARM algorithm requires three steps to parti-
tion the vertex set of a labeled graph into equivalence
classes whose vertices are symmetrical. In the first
step, the partition is based on the component-type la-
bel of each vertex. For the second step, the partition
is based on the degree of each vertex. In the third
step, partitioning is attempted based on the number
of neighbor types each vertex has in each equivalence
class.
The last step must be repeated until there are no
more changes in the equivalence classes. The reason
for this is that each partition changes the number of
neighbors in each equivalence class; therefore, other
partitions may become necessary. In the worst case,
this repetition will stop when each equivalence class
has a single element.
In the first step, the example external structure
in figure 3.4 is divided into five equivalence classes--
one for each component type. Components of the
ne type are split into two equivalence classes--one
of degree 5 and another of degree 6. The first time
the third step is taken, the program splits the pe
and the lop component types into two equivalence
classes_ne connected to the ne equivalence class of
degree 5 and the other connected to the ne equiv-
alence class of degree 6. The second time that the
third step is taken, the eight equivalence classes are
left unchanged.
Each class is related to other classes in a con-
nectivity sense because the vertices in the class are
symmetrically connected to the vertices in other
27
111 1[1
2
i-
Figure 3.5. Equivalence class graph.
classes. These equivalence classes and their con-
nectivity relationships may be viewed as defining
another graph G 1. The vertices of G r correspond
uniquely to the equivalence classes in G. Unlike the
basic directed graph without self-loops, which was
taken to be the model for G, G t may have vertices
that have self-loops. A self-loop occurs when the ver-
tices in the same equivalence class are connected to
each other in some symmetric fashion, thus making
the equivalence class its own neighbor. Also, the
number of links or connection density between two
vertices of G _ can be greater than 1. This would be
the result of a case in which multiple vertices in the
same equivalence class are connected to one or more
vertices in another equivalence class.
Figure 3.5 shows the equivalence Class graph cor-
responding to the external structure in figure 3.4.
Vertex 2ne[1] corresponds to equivalence class 1 of
component type he, which has two elements in that
class. The 2/1 on the edge between vertex 2nil] and
vertex 4s[1] indicates that each element of equiv-
alence class 2n[1] is connected to two elements of
equivalence class 4s[1] and each element of equiv-
alence class 4s[1] is connected to one element of
equivalence class 2n[1].
3. 2.2. Determining the Subsystem Hierarchies
The hierarchies of any logical subsystems in the
initial configuration and of any physical subsystems
are determined using the algorithm in subsection A2.
This algorithm tries to ol_tain the hierarchy of each
subsystem class from the system hierarchy or a sep-
arate file. For those subsystem classes whose hier-
archy it cannot obtain, the algorithm goes through
each subsystem in the class and reads its individual
subsystem hierarchy file that assigns specific Compo-
nents to the subsystem. For those subsystem classes
whose hierarchy it did obtain, this algorithm then
goes through each subsystem in the class and de-
termines its hierarchy based on the hierarchy of its
subsystem class and the component-type arguments,
if any, of its subsystem type.
3.3. Specifying the System Reliability
Model
If the system description files are complete and
consistent, ARM then specifies the system reliability
model in the ASSIST language. A reliability model
specification in the ASSIST language must define the
following:
1. Any constants
2. State space variables
3. Start state
4. Any functions
5. Death conditions
6. Any pruning conditions
7. State transitions
28
The followingsubsectionsdescribethe stepsthat
ARM takesto makethesedefinitions.
3.3.1. Constants
The ARM parameters that were given numeric
values will be defined as ASSIST constants. An
ASSIST input statement will be specified if a variable
identifier is given for the trim rate because ASSIST
needs that value to generate the model. The SURE
input statements will be specified for any other ARM
parameters that were given variable identifiers with-
out a range. If one of the variable identifiers was
given a range, a SURE variable definition statement
will be specified for it. Only one input or variable
definition statement will be specified for each vari-
able identifier even if it is used for more than one
ARM parameter.
The ARM parameters will not be placed in
ASSIST constant arrays because if a variable iden-
tifier is given for one of the parameters, then one of
the array elements would be undefined. Although
the undefined array element could be given some
initial value like 0 and then redefined when SURE
evaluates the model, each redefinition would cause a
SURE warning message. Longer model specification
files are a consequence of not using arrays for con-
stants, but longer files are justified to avoid these
warnings and to provide the convenience of using
variable identifiers.
In addition to the ARM parameters, scalar and
array constants used in later definitions have to
be specified. These constants define the system
components and the logical subsystems.
The system component constants and their defi-
nitions are as follows:
NCT: Number of component types
LCTS: Largest componeng-type size
NEC: Number of equivalence classes
LECS: Largest equivalence class size
CT: Component type of each equivalence class
The size of a component type or equivalence class
is the number of components in the type or class.
All of the above are integer scalars except CT, which
is an integer array indexed by the equivalence class
number.
The logical subsystem constants and their defini-
tions are as follows:
NLT: Number of logical subsystem types
LLTS: Largest logical subsystem-type size
NL: Number of logical subsystems
LNEL: Largest number of equivalence classes in
a logical subsystem
LESL: Largest number of equivalent components
in a logical subsystem
LT: Type of each logical subsystem
NEL: Number of equivalence classes in each logi-
cal subsystem
EC: Equivalence class number of a subset of
equivalent components in a logical subsystem
The size of a logical subsystem type is the number of
subsystems in the type. All of the above are integer
scalars except LT, NEL, and EC. The constants LT
and NEL are integer arrays indexed by the logical
subsystem number. The constant EC is an integer
array whose first index is the logical subsystem num-
ber and whose second index is the equivalent subset
number within the subsystem.
All the parameters involved in a potentially gen-
eral transition are defined as constants except the
priority that is used by ARM in specifying these tran-
sitions as described in subsection 3.3.5. The order in
which all of the constants are defined is as follows:
1. Trimming model generation parameters
2. Model evaluation parameters
3. System component constants
4. Active component parameters
5. Spare component parameters
6. Component repair parameters
7. Logical subsystem recovery parameters
8. Logical subsystem constants
9. Reconfiguration parameters
3.3.2. State Space Variables and the Start State
The system components will be divided into those
that belong to a logical subsystem and those that
do not. Although those that do not will be referred
to as spares, because they often are, they are not
necessarily spares. The components of a logical
subsystem will be divided into subsets of components
in the same equivalence class.
The state space variables and their definitions are
as follows:
CF: Coverage failure
LTC: Logical subsystem-type count
29
Table3.3. State Space Variable Functions
Function Definition
NB(i,j) -- P[ij] + W[ij] + A[ij] Number of nonbenign failures in subset
WB(ij) -- C[ij] - NB(ij) Number of working or benign components in subset
W(ij) = WB(ij) - B[ij] Number of working components in subset
PL(i) = sum(P[i,*]) Number of permanent failures in logical subsystem
WL(i) = sum(T[i,*]) Number of transient failures in logical subsystem
AL(i) = sum(A[i,*]) Number of active failures in logical subsystem
NBL(i) -- PL(i) + WL(i) + AL(i) Number of nonbenign failures in logical subsystem
TNF = sum(NF) Total number of failure transitions
SS(k) = WS[k] + PS[k] + IS[k] + AS[k] + BS[k] Number of equivalent spares
sns(k) -- PS[k] + WS[k] + AS[k] Number of nonbenign failures in equivalent spares
FCE = sum(T, A, B, TS, AS, BS) Number of fast exponentials competing with any general transitions
LO: Logical subsystem operational
C: Components in a logical subsystem subset
P: Permanent failures in a logical subsystem sub-
set
T: Transient failures in a logical subsystem subset
A: Active intermittent failures in a logical sub-
system subset
B: Benign intermittent failures in a logical sub-
system subset
NF: Number of failure transitions per component
type
NR: Number of components reconfigured out per
equivalence class
WS: Working equivalent spares
PS: Permanent failures in equivalent spares
TS: Transient failures in equivalent spares
AS: Active intermittent failures in equivalent
spares
BS: Benign intermittent failures in equivalent
spares
Except for CF, Which is a Boolean scalar, all others
are integer arrays. The LTC variable is indexed by
the logical subsystem type number, and it indicates
the number of operational logical subsystems for each
type. The LO variable is indexed by the logical
Subsystem number. The: first index of C, P, T, A,
and B is the logical subsystem number, and the
second index is the equivalent subset number within
the subsystem. The NF variable is indexed by the
component type number. The variables NR, WS,
PS, TS, AS, and BS are indexed by the equivalence
class number.
The ARM program assumes that the only log-
ical subsystems that are operational initially are
those present in the system logical hierarchy; it also
assumes a start state where no failures have yet
occurred.
3.3.3. Functions and Final State Conditions
The state space variable functions and their defi-
nitions are given in table 3.3. The ASSIST function
sum adds all the elements in one or more dimensions
of one or more arrays.
If a pruning condition was given, it is imple-
mented using the state variable NF and the function
TNF. For example, the pruning condition
NF(p) >= 2 or TNF >-- 3
can be implemented by the following statement:
pruneif NF[I] >-- 2 or TNF >= 3;
The Boolean expression represented by the sys-
tem requirements success tree is logically negated
and used as a death state condition. For example,
3O
!
=
theMP systemrequirementscanbe implementedby
thefollowingstatement:
deathif not ((LTC[I] >= 2 and LTC[2] >= 1
and LTC[3] >= 1) or (LTC[1] >= 1
and LTC[2] >= I and LTC[3] >= i));
The conditions under which a logical subsystem
fails are used to trigger a restart reconfiguration, if
there is one; otherwise, they are used as death state
conditions. For example, the requirements of the
logical subsystem T(m) can be implemented by the
following statement:
deathif WB(2,1) < 2;
Other death state conditions include situations
in which the components in the system requirements
success tree can no longer communicate. The condi-
tion for being in the death state that corresponds to
a coverage failure is for CF to be true.
3.3.4. Failure Transitions
The ARM program analyzes the system descrip-
tion to specify the failure transitions. For each failure
transition, the following must be specified: the con-
ditions under which the transition could take place,
its destination state, and its transition rate. Fail-
ure transitions are only specified for those fault types
with nonzero rates.
Several options exist for specifying failure tran-
sitions for components that depend on a component
that has a soft fault:
1. The components could be put in the same
state as the component they depend on; how-
ever, they could become benign at their own
rate even before the component they depend
on.
2. They could be declared to have a permanent
fault to avoid the problem described in op-
tion 1. To implement option 2, these com-
ponents would have to be (a) tracked to de-
termine if they were reconfigured out so they
could be declared to be working when the
component they depend on becomes benign
or (b) left failed.
The ARM program implements option 2(b) because
it is conservative and it avoids the inconsistency
problems with option 1 and the implementation dif-
ficulties with option 2(a).
3.3.4.1. Logical subsystem component .failures.
The condition for fault arrival in a logical subsys-
tem is that working components exist. The destina-
tion state is one in which the number of failures in
the subsystem and in the component's type (NF) has
been increased by 1. Also any components that de-
pended on the failed component are marked as failed.
The fault arrival rate is obtained by multiplying the
number of working equivalent components in the sub-
system by their failure rate. For example, the arrival
of a permanent fault in equivalent subset 1 of log-
ical subsystem 2 can be described by the following
statement:
if W(2,1) > 0 tranto NF[CT[EC[2,1]]]++,
P[2,1]++ by W(2,1) * PFR_^CT[EC[2,i]] ;
where PFR stands for the permanent failure rate. A
caret (^) is used to concatenate a string and a value
to form a previously defined identifier.
The condition for the disappearance of a transient
fault exists when there are components with tran-
sient faults. The destination state is one in which the
number of transient failures in the subsystem (T) has
been decreased by i. The transition rate is obtained
by multiplying the number of equivalent components
with transient faults in the subsystem (T) by their
disappearance rate. For example, the disappearance
of a transient fault in equivalent subset 1 of logi-
cal subsystem 2 can be described by the following
statement:
if T[2,1] > 0 tranto T[2,1]--
by fast T[2,1] * TDR__CT[EC[2,1]];
where TDR stands for the transient disappearance
rate.
The condition for an intermittent fault to go from
active to benign is that components with active in-
termittent faults exist. The destination state is one
in which the number of benign components (B) has
been increased by 1, and the number of compo-
nents with active intermittent faults (A) has been de-
creased by 1. The transition rate is obtained by mul-
tiplying the number of equivalent components with
active intermittent faults in the subsystem (A) by
their benign rate. For example, an intermittent fault
that goes from active to benign in equivalent sub-
set 1 of logical subsystem 2 can be described by the
following statement:
if A[2,1] > 0 tranto B[2,1]++, A[2,1]--
by fast A[2,1] * IBK_^CT[EC[2,1]];
where IBR stands for the intermittentbenign rate.
The condition for an intermittentfaultto go from
benign to activeisthat components with benign in-
termittent faultsexist.The destination state isone
in which the number of components with active in-
termittent faults (A) has been increased by 1 and
the number of components with benign intermittent
faults (B) has been decreased by 1. The transition
31
rate isobtainedbymultiplyingthenumberof equiv-
alentcomponentswith benignintermittentfaultsin
the subsystem(B) by their activerate. For exam-
ple,anintermittentfaultgoingfrombenignto active
in equivalentsubset1of logicalsubsystem2 canbe
describedby thefollowingstatement:
if B[2,1] > 0 tranto A[2,1]++, B[2,1]--
by B[2,1] * IAR_'CT[EC[2,1]];
where IAR stands for the intermittent active rate.
3.3.,_.2. Spare failures. Spare failure transitions
are specified only for those component types in which
the failure rates factor is nonzero. The condition for
fault arrival is that working spares exist. The desti-
nation state is one in which the number of working
spares (WS) has been decreased by 1, both the num-
ber of failed spares and the number of failed compo-
nents of the spare's type (NF) have been increased
by 1, and any components that depended on the
failed component are marked as failed. The fault
arrival rate is obtained by multiplying the number of
working spares in the equivalence class (WS) by their
failure rate and failure rates factor. For example, the
arrival of a permanent fault in a spare in equivalence
class 1 can be described by the following statement:
if WS[l] > 0
tranto WS[I]--, NF[CT[I]]++, PS[I]++
by WS[l] * PFR_'CT[I] * FIRF_'CT[I];
where FRF stands for the failure rates factor.
The condition for transient disappearance is that
spares with transient faults exist. The destination
state is one in which the number of working spares
(WS) has been increased by 1 and the number of
failed spares has been decreased by 1. The transition
rate is obtained by multiplying the number of spares
with transient faults in the equivalence class (TS) by
their disappearance rate and failure rates factor. For
example, the disappearance of a transient fault in a
spare in equivalence class 1 can be described by the
following statement:
if TS[I] > 0 tranto WS[I]++, TS[I]--
by fast TS[1] * TDR_ACT[1] * FRF_'CT[1];
The condition for an intermittent fault to go from
active to benign exists when there are spares with
active intermittent faults. The destination state is
one in which the number of benign spares (BS) has
been increased by 1, and the number of spares with
active intermittent faults (AS) has been decreased
by 1. The transition rateis o_btained by multiplying
the number of spares with active intermittent faults
in the equivalence class (AS) by their benign: rate
and failure rates factor. For example, an intermit-
tent fault going from active to benign in a spare in
32
equivalence class 1 can be described by the following
statement:
" if AS[I] > 0 tranto BS[I]++, AS[l]--
by fast AS[I] * IBR_'CT[I] * FRF_'CT[I];
The condition for an intermittent fault to go
from benign to active is that spares with benign
intermittent faultsexist. The destination state is
one in which AS has been increased by i, and BS
has been decreased by I. The transition rate is
obtained by mult{plying the number of spares with
benign intermittent faultsin t}ieequivalence Class
(BS) by their active rate and failure rates factor.
For example, an intermittent fault going from benign
to active in a spare in equivalence class 1 can be
described by the following statement:
if BS[I] > 0 tranto AS[I]++, BS[I]--
by BS[I] * IAR_'CT[I] * FRF_'CT[I];
3.3.,_.3. Dependents in logical subsystems. If each
component of type A depends on one component of
type B and some components of type A can be in one
or more logical subsystems, then multiple transitions
arespecified for the failure of a component of type B.
One transition is for the case in which the dependent
component is a spare, and the other transitions are
for each of the logical subsystems containing a com-
ponent of type A. Therefore, if the dependent compo-
nent can be in n logical subsystems, n + 1 transitions
are generated. If there are two dependent compo-
nents that can be in n and m logical subsystems,
then (n + 1)(m + 1) transitions are generated and so
on.
In addition to the conditions mentioned in sub-
sections 3.3.4.1 and 3.3.4.2, these multiple transitions
are conditi0ned on the dependent components being
operational as a spare or in a logical subsystem, as
the case may be. The fault arrival rates mentioned
in subsections 3.3.4.1 and 3.3.4.2 are multiplied by
the probability of each dependent component being
in a particular logical subsystem or being a spare.
For example, if a component in subset 1 of logical
subsystem 2 depends on a con%ponent in equivalence
class 3, the arrival of a permanent fault in a spare in
equivalence class 3 can be described by the following
statement:
if WS[3] > 0 & W(2,1) > 0 tranto WS[3]--,
PS [3]++, NF [CT [3]]++, P [2, I]++
by WS[3] * PFR_'CT[3] * FRF-^CT[3]
• W(2,1) / (WS[EC[2,1]] + W(2,1));
3.3.5. Potentially General Transitions
The ARM program analyzes the system descrip-
tion to derive the potentially general transitions.
Thesetransitionsareonly derivedfor thosefault
typeswith a nonzerofailurerate. Theymustalso
havea nonzerotransitionrate,if theyareexponen-
tial,ormean,if theyaregeneral.Foreachpotentially
generaltransition,the followingmust be derived:
theconditionsunderwhichthetransitionwouldtake
place,its destinationstate, and its transition rate
expression.
Each condition can lead to two types of transi-
tions which include a successful transition, if the cov-
erage is not 0, and a system failure transition, if the
coverage is not 1. The destination state for such sys-
tem failure transitions is a death state in which CF
is true. The transition rate expression of a system
failure transition is the same as that for a successful
transition except that the coverage c is replaced by
(1 - c). Therefore, only the destination states and
transition rate expressions of successful transitions
are given in the subsections that follow.
The rate expression of a general transition is com-
posed of the mean time, standard deviation, and
probability. Except for spare recoveries, the tran-
sition time mean and standard deviation are the
repair, recovery, or reconfiguration mean and stan-
dard deviation. The transition probabilities given in
the subsections that follow are for the case in which
there are no competing fast exponential transitions.
These probabilities are multiplied by the repair, re-
covery, or reconfiguration probability when there is
such competition.
After all the potentially general transitions have
been derived, they are specified based on their pri-
ority using the algorithm in subsection A3. This al-
gorithm only allows competition between transitions
with the same priorities.
3.3.5.1. Spare recoveries. Spare recovery tran-
sitions are derived only for those component types
whose failure rates factor and detectable fraction are
nonzero. The transition condition is that a spare has
failed. The destination state is one in which the num-
ber of components reconfigured out (NR) has been
increased by 1 and the number of failed spares has
been decreased by 1.
The exponential transition rate expression is the
product of multiplying the recovery rate, the cover-
age, the detectable fraction, and the probability that
the system recovers from this fault type and not some
other fault types that it may have in spares of this
component type. The general transition probability
is the product of multiplying the coverage by the pre-
vious probability. The general transition time mean
and standard deviation axe the quotients of dividing
the detectable fraction into the recovery mean and
standard deviation. For example, the recovery from
a permanent fault in a spare in equivalence class 2
and of type 1, which has two equivalence classes, can
be described by the following statement:
if PS[2] > 0 tranto NR[2]++, PS[2]--
by DF_I * SRR_I * SRC_I * PS[2]
/ (NBS(1) + NBS(2));
where SRR, SRC, and DF stand for the spare re-
covery rate,the spare recovery coverage, and the de-
tectable fraction,respectively.
3.3.5.2. Component repairs. The transition con-
dition for a component repair is that NR is nonzero.
The destination state is one in which the number
of working spares (WS) in the equivalence class has
been increased by 1 and NR has been decreased by 1.
An alternative to this destination state is to restore a
subsystem that was retired or whose redundancy had
been diminished instead of increasing the number
of working spares. Because this alternative requires
tracking all subsystem retirements and degradations
to decide which one to reverse, it has not yet been
implemented.
The exponential transition rate expression is the
product of multiplying the repair rate, the coverage,
and the probability that a component from this
equivalence class is repaired and not some other
component of this type. The general transition
probability is the product of multiplying the coverage
by the previous probability. For example, the repair
of a component in equivalence class 2 and of type 1,
which has two equivalence classes, can be described
by the following statement:
if NR[2] > 0 tranto WS[2]++, NR[2]-- by
CP__I * CRC_I * NR[2] / (NR[I] + NR[2]);
where CRR and CRC stand forthe component repair
rate and coverage.
3.3.5.3. Logical subsystem recoveries. The tran-
sition conditions for a logical subsystem recovery are
that the subsystem is operational, one of its compo-
nents has failed, and there is an equivalent spare to
replace it. If the spare is working properly, the des-
tination state is one in which NR has been increased
by 1 and both the number of failed components in
that subsystem and the number of spares have been
decreased by 1. If the spare is not working properly,
the destination state is one in which NR has been
increased by 1 and the number of spares has been
decreased by 1.
The exponential transition rate expression is the
product of multiplying the recovery rate, the cover-
age, and two probabilities. One probability is that of
the system recovering from this fault and not some
other faults that it may have in subsystems of this
33
type.Theotherprobabilityis that ofgettingaspare
in theworkingstatethat is indicatedin thedestina-
tionstate. Thegeneraltransitionprobabilityis the
productof multiplyingthecoverageby theprevious
twoprobabilities.Forexample,therecoveryfroma
permanentfault inequivalentsubset1of logicalsub-
system2 of type3, whichhastwosubsystems,can
bedescribedbythe followingstatement:
if L012] & P[2,1] > 0 & WS[EC[2,1]] > 0
tranto NR[EC[2,1]]++, P[2,1]--,
WS[EC[2,1]]-- by LRR_3 * LRC_3 * PL(2)
/ (NBL(2) + NBL(4)) * WS[EC[2,1]]
/ NS(EC[2, i]) ;
where LRR and LRC stand for the logicalsubsystem
recovery rate and coverage.
3.3.5.4. Reconfigurations that retire a subsystem.
The transition conditions for a reconfiguration that
retires a subsystem are that the subsystem is oper-
ational, one of its components has failed, there is
no equivalent spare to replace it, and the number
of operational subsystems of this type is the same
as specified in the source vertex of the reconfigura-
tion graph. The destination state is one in which the
subsystem is no longer operational, the number_of
operational subsystems of this type (LTC) has been
decreased by 1, NR has been increased by 1, the num-
ber of failed components in that subsystem is 0, and
the number of spares has been incremented by the
number of components in that subsystem minus 1.
The exponential transition rate expression is the
product of multiplying the reconfiguration rate, the
coverage, and the probability that the system recon-
figures because of this fault and not some other faults
it may have in subsystems of this type. The general
transition probability is the product of multiplying
the coverage by the previous probability. For exam-
ple, the reconfiguration from a permanent fault in
the only equivalent subset of logical subsystem 2, of
a type which has two subsystems, can be described
by the following statement:
if L012] & P[2,1] > 0 & WS[EC[2,1]] = 0
& LTC [LT [2]] = 2 tranto NR [EC [2, I]]++,
WS[EC[2,1]] = WS[EC[2,1]] + W(2,1),
PS[EC[2,1]] = PS[EC[2,1]] + P[2,1] - l,
TS[EC[2,1]] = TS[EC[2,1]] + T[2,1],
AS[EC[2,1]] = AS[EC[2,1]] + A[2,13,
BS[EC[2,1]] _ BS[EC[2,1]] + B[2,1],
LTC[LT[2]]--, L012] = false, P[2,1] = 0,
T[2,1] = 0, A[2,1] = 0, B[2,1] = 0 by
ILrL3 * RC_3 * PL_(2) / (NBL(2) + NBL(4));
where RR and RC stand for the system reconfigura-
tion rate and coverage,
34
3.3.5.5. Reconfigurations that degrade a sub-
system. If a subsystem is going to be degraded by m
components and m is greater than 1, two alternative
ways exist to deal with the m - 1 components that
could be working but are not going to be part of the
new subsystem. These components could be recon-
figured out of the system, or they could be declared
to be spares.
The implementation of either alternative must
consider all the possible ways to chose m- 1 compo-
nents out of the n components in the old subsystem
and the probability of each selection. The reason for
these alternatives is that more than one of the n com-
ponents might have a fault. In that case, it makes a
difference which ones are chosen.
The first alternative is always conservative, but
for some systems, it could be overly conservative.
The ARM program implements the second alter-
native in a way that does not lead to optimistic
results. For subsystems that do not recover from
faults by using spares, their recovery rate or mean,
whichever applies, would be 0, and therefore the
m - 1 components would not be used as spares for
theml
The transition conditions for a reconfiguration
that degrades a subsystem by m components are that
the old subsystem is operational one of its compo-
nents has failed, no equivalent spare exists t0 replace
it, and the number of operational subsystems of this
type (LTC) is greater than or equal to the number
specified in the source vertex of the reconfiguration
graph. The destination state is one in which the old
subsystem is no longer operational, NR has been in-
creased by 1, the number of spares has been incre-
mented by m - 1, the number of failed components
in the old subsystem is 0, a new subsystem is opera-
tional with the components of the old subsystem mi-
nus m, and the number of operational subsystems is
decreased for the old subsystem's type and increased
for the new subsystem's type. To simplify the model
specification, ARM currently limits m to 2.
The exponential transition rate expression is the
product of multiplying the reconfiguration rate, the
coverage, and the probability that the system recon-
figures because of this fault and not because of some
other faults that it may have in subsystems of this
type. The general transition probability is the prod-
uct of multiplying the coverage by the previous prob-
ability. For example, the degradation to a subsystem
that has two less components is caused by a per-
manent fault in the only equivalent subset of logical
subsystem 2, of a type which has two subsystems.
This degradationcanbedescribedby the following
statement:
if L012] & P[2,1] > 0 & WS[EC[2,1]] = 0
& LTC[LT[2]] = 2 tranto NR[EC[2,1]]++,
WS [EC [2, i]]++, LTC [LT [2]]--,
L012] = false, C[2,1] _ 0, P[2,1] = 0,
T[2,1] = 0, A[2,1] = 0, B[2,1] = 0,
LTC [LT [3]]++, L0 [3] = true,
C[3,1] = C[2,1] - 2, P[3,1] = P[2,1] - i,
T[3,1] = T[2,1], A[3,1] = A[2,1],
B[3,1] = B[2,1] by RR_5 * RC_5 * PL(2)
/ (NBL(2) + NBL(4));
3.4. Advanced Features Not Yet
Implemented
Two of the more advanced features of the ARM
system description language have not yet been imple-
mented due to the complexities they involve. If their
use is attempted, ARM warns the user that these fea-
tures have not yet been implemented. The following
subsections outline how they might be implemented
in the future.
3.4.1. Reinitializing Reconfigurations
The transition conditions for a reinitializing re-
configuration are that (1) a logical subsystem has
failed; (2) excluding the components whose fail-
ure caused the logical subsystem to fail, there are
still enough components to meet the system require-
ments; and (3) the components and/or subsystems
required for the reinitializing reconfiguration have
not failed. The destination state is one in which the
components whose failure caused the logical subsys-
tem to fail have been reconfigured out of the system.
The destination state is also one in which either those
components reconfigured out would be replaced by
spares, if available, or the failed subsystem would be
retired or degraded.
The exponential transition rate expression is the
product of multiplying the reconfiguration rate by
the coverage. The general transition probability is
the coverage of the reinitializing reconfiguration.
3.4.2. Mission Phase Change Reconfigurations
No transition conditions exist for mission phase
change reconfigurations because they can occur at
any moment. The destination state is determined
by the reconfiguration graph. This graph is one in
which the logical subsystems in the source vertex are
no longer operational and the logical subsystems in
the destination vertex are now operational.
The exponential transition rate expression is
the product of multiplying the reconfiguration rate
by the coverage. The general transition proba-
bility is the coverage of the mission phase change
reconfiguration.
4. Application Examples and Results
The following subsections illustrate the type of
systems that can be described with the GUI and give
evidence that the results ARM generates are correct.
This is done by comparing the results ARM generates
for four systems with manually generated results. To
make this comparison possible, each pair of results
is based on the same architecture, requirements, and
parameters. For all the system reconfigurations in
the examples of this section, the triggering event used
was a component that failed without a spare. All of
the models were evaluated by SURE using a mission
time of 10, a loop truncation level of 25, and a re-
quirement of 2 digits of accuracy. The manually and
ARM-generated Markov reliability model specifica-
tions for these four systems are presented in Liceaga
(1992).
4.1. Comparison With Example
Multiprocessor Results
The first system is the multiprocessor used as
an example throughout section 2 but without repair.
The architecture and requirements of this system re-
main the same except that no reinitializing reconfig-
uration is used because that feature of ARM is not
yet implemented.
However, some of the parameters used are not the
same. Only permanent failure rates were used, and
they are shown in table 4.1. A value of 0.999999999
was used for all fault coverage probabilities, and a
value of 1 was used for all priorities. A failure rates
factor of 1 was used for all the spare components.
The other parameters used for the spare components
are shown in table 4.2. A rate of 7.8 × l03 was used
for all the subsystem recoveries. The values used for
the reconfiguration from two processor triads to one
were shown in figure 2.22. To generate the model,
an ASSIST pruning condition of TNF >= 5 was used
without any trimming. To evaluate the model, a
SURE prune level of 1 x 10 -is was used.
Statistics comparing the manually and ARM-
generated Markov reliability model specifications are
shown in table 4.3. All of the ASSIST and SURE
execution times in table 4.3 and throughout this pa-
per were measured on a Sun Microsystems SPARC-
station 2 computer. The probability-of-failure results
are compared in table 4.4.
35
Table 4.1. All Permanent Failure Rates of the Multiprocessor
Parameter
Permanent failure rate, hr -l
Component type
p wb
5 × 10 -5
m w tb rb
5× 10 -5 2× 10 -6 3x 10 -6 3× 10 -6 3 x 10 -6
Table 4.2. Some Spare Component Parameters of the Multiprocessor
Parameter p
Detectable fraction 0.9
Recovery rate, hr -l 3.3 x 103
Component type
m w tb rb wb
0 0.9 0 0 0
0 1.8 x 10 4 0 0 0
Table 4.3. Reliability Model Statistics for the Multiprocessor
Statistic Manual ARM
ASSIST file lines 195 631
Model generation time, hr 0.36 2.72
States 33 611 46 338
Transitions 694 980 1 092 898
Model evaluation time, hr 1.76 3.18
Table 4.4. Probability-of-Failure Results for the Multiprocessor
Measure Manual ARM Difference Percentage
Lower bound 2.01939 x 10 -l° 2.01835 x 10 -1° 1.04 x 10 -:3 5.15 x 10 -2
Upper bound 2.05305 x 10 -:° 2.05307 x 10 -1° -2 x 10 -15 -9.74 x 10 -4
4.2. Application to Systems Described in
the Literature
The following subsections illustrate the applica-
bility of the GUI to systems that have been described
in the technical literature and give evidence that the
results ARM generates are correct.
4. 2. I. Software Implemented Fault- Tolerance
(SIFT) Computer
A SIFT computer, described in Goldberg et al.
(1984), can initially be configured as a processor p
sextuple (ST). In addition to a central processing unit
(CPU), each processor contains its own memory, in-
put/output port, and power supply. As processors
fail, SIFT first reconfigures into a quintuple, then a
quad, then a triad, and finally into a nonreconfig-
urable dual.
The logical hierarchy and requirements of the
triad subsystem class were shown in figures 2.11
and 2.16. The remainder of the architecture and re-
quirements of this SIFT computer are described by
figures 4.1 to 4.15. Figures 4.1 to 4.3 describe the ini-
tial configuration. Figures 4,4 to 4.6 describe possi-
ble future cofigurations. Figures 4.7 to 4.10 describe
36
M.f"
-,_ lot
-.,,'1%
- _$
ml
Figure 4.1. External structure of a SIFT computer.
Figure 4.2. Logical hierarchy of a SIFT computer. Figure 4.3. Logical hierarchy of the sextuple subsystem class.
37
Figure 4.4. Logical hierarchy of the quintuple (QT) subsystem class.
Figure 4.5. Logical hierarchy of the quad (Q) subsystem class.
38
Figure 4.6. Logical hierarchy of the nonreconfigurable dual (ND) subsystem claus.
_1 PST_to_PQTw
Figure 4.7. Degradation from a processor sextuple (PST) to a quintuple (PQT).
_1 PQT_to_PQ i_ I
.."1% _,,.'t%
M..' _M,.t
Figure 4.8. Degradation from a processor quintuple to a quad (PQ).
PQ to PT (D-_
Figure 4.9. Degradation from a processor quad to a triad (PT).
_1 PT to PND --'_ I_
Figure 4.10. Degradation from a processor triad to a nonreeonfigurable dual (PND).
3g
Figure 4.11. Requirements of a SIFT computer.
Figure 4.12. Requirements of the sextuple subsystem class. Figure 4.13. Requirements of the quintuple subsystem class.
4O
(Figure 4.14. Requirements of the quad subsystem class. Figure 4.15. Requirements of the nonreconfigurable dual sub-
system class.
I
E
Vq Vq
Figure 4.16. External structure of a Tandem computer. Figure 4.17. Logical hierarchy of a Tandem computer.
41
Table 4.5. Reliability Model Statistics for a SIFT Computer
Statistic Manual ARM
ASSIST file lines 18 330
0.23 1.69Model generation time, sec
States 12 18
Transitions 17 17
Model evaluation time, sec 0.02 0.063
Table 4.6. Probability-of-Failure Results for a SIFT Computer
Measure Manual ARM
7.43383 × 10-1:5
Difference
Lower bound 7.47849 x 10 -15 -4.466 x 10 -17
Upper bound 7.71581 x 10-15 7.76217 x 10 -15 -4.636 x 10 -17 -0.601
Percentage
-0.601
the degradations from six to two processors. Fig-
ures 4.11 and 4.12 describe the requirements of the
initial configuration, and figures 4.13 to 4.15 describe
the requirements of possible future configurations.
A permanemt failurerate of 1 x 10 -4 hr -1 was
used for the processors. A coverage of 1, a priority
of 1, and a rate of 3.6 x 103 were used for all the
system reconfigurations. The model was generated
without any pruning or trimming. Default values
were used for the model evaluation parameters, and
these were shown in figure 2.24.
A Markov reliability model for this SIFT com-
puter was specified on page 48 of Butler and Johnson
(1990). Statistics comparing the manually and
ARM-generated Markov reliability model specifica-
tions are shown in table 4.5. Their probability-of-
failure results are compared in table 4.6.
4.2.2. Comparison With Self-Generated Results
The following subsections give further evidence
that the results ARM generates are correct by com-
paring them with results that were generated man-
ually. A value of 4 x 103 was used for all transient
disappearance rates and intermittent benign rates. A
value of 4 x 10 -2 was used for all intermittent active
rates.
_.2.2.1. Tandem computer. An almost minimal
version of a Tandem computer, described in Katzman
(1977), is composed of one processor p dual, one disk
controller k dual, two disk drive d duals, two fans f,
two power supplies ps, and one interprocessor bus b
42
dual. In addition to a CPU, each processor contains
its own memory. When a component in a dual (D)
fails, the subsystem is reconfigured into a simplex (S).
This Tandem computer requires all subsystems,
one fan, and one power supply for it to be opera-
tional. Each dual requires that one of its components
be working for it to be operational.
The architecture and requirements of this Tan-
dem computer are described by figures 4.16 to 4.30.
Figures 4.16 to 4.19 describe the initial and fu-
ture configurations. Figures 4.20 to 4.23 describe
the degradations from dual to simplex subsystems.
Figures 4.24 to 4.30 describe the initial and future
configuration requirements.
To generate the model, an ASSIST pruning con-
dition of TNF >= 4 was used without any trim-
ming. To evaluate the model, a SURE prune level
of 1 x 10 -11 was used. The other parameters used
for this Tandem computer are shown in tables 4.7
to 4.9.
Statistics comparing the manually and ARM-
generated Markov reliability model specifications are
shown in table 4.10. Their probability-of-failure
results are compared in table 4.11.
_.2.2.2. Stratus computer. An almost minimal
version of a Stratus computer, described in Siewiorek
and Swarz (1992), is composed of one computer
module and two disk drive d duals. The components
in this version of a computer module are grouped into
two module regions (MR's). Each MR is composed
of a processor board pb, a memory board mb, a disk
Figure 4.18. Logical hierarchy of the dual subsystem class. Figure 4.19. Logical hierarchy of the simplex subsystem class.
Figure 4.20. Degradation from a processor dual (PD) to a
simplex (PS).
Figure 4.21. Degradation from a disk controller dual (KD) to
a simplex (KS).
Figure 4.22. Degradation from a disk drive dual (DD) to a
simplex (DS).
Figure 4.23. Degradation from a bus dual (BD) to a simplex
(BS).
43
KFigure 4.24. Requirements of a Tandem computer.
Figure 4.25. Requirements of the PPL performance level.
Figure 4.26. Requirements of the KPI_ performance level.
44
vFigure 4.27. Requirements of the DPL performance level. Figure 4.28. Requirements of the BPL performance level.
Figure 4.29. Requirements of the dual subsystem class. Figure 4.30. Requirements of the simplex subsystem class.
45
Table4.7.SomeActiveComponentParametersofaTandemComputer
Parameter P
Permanentfailurerate,hr-1 5x 10-5
Transientfailurerate, hr -1 5 x 10 -4
5 x 10 -6Intermittent failure rate, hr -1
Component type
k d f
2x10 -5 4×10 -5 lx10 -6
2x 10 -4 4x 10 -4 1 x 10 -5
2 x 10 -8 4 × 10 -6 1 x 10 -7
ps
3 x 10 -5
3 x 10 -4
3 x 10 -6
b
3 x 10-6
3 x 10-5
3 x 10-7
Table 4.8. Some Component Repair Parameters of a Tandem Computer
Parameter p k ps b
Repair priority 5 7 6 10
Repair rate, hr -1 30 30 30 15
Component type
d f
9 8
30 30
Table 4.9. Some System Reconfiguration Parameters of a Tandem Computer
Specification name
Parameter PD_to_PS KD_to_KS DD_to_.DS BD_to_S
Component type p k d b
Reconfiguration coverage 0.999999 0.99999975 0.99999975 0.99999975
Reconfiguration priority 1 2
Reconfiguration rate, hr -1 1.8 x 10 3 7.2 × 10 3
3 4
25 7.2 × 10 3
Table 4.10. Reliability Model Statistics for a Tandem Computer
Statistic Manual ARM
ASSIST file lines 157 794
Model generation time, min 6.46 75.6
11945 12920States
Transitions
Model evaluation time, min
284 121
3.94
314967
8.85
Table 4.11. Probability-of-Failure Results for a Tandem Computer
Measure Manual
1.84026 x 10 -5
ARM
1.7866 x 10 -5
Difference
5.366 x 10 -7Lower bound
Upper bound 2.10696 x 10 -5 2.05823 × 10 -5 4.873 x 10 -7
Percentage
2.92
2.31
46
controller board kb, a power supply ps, and a module
bus b. Each board contains duplicated logic and
an onboard comparator that will stop the board
from transmitting on the bus in case of a mismatch.
Except for the power supply, each component in one
MR forms a dual with the component of the same
type in the other MR. The bus is used to perform the
OR logical function on the output signals of boards
in a dual. Therefore, no reconfiguration takes place
when a board fails. When a component fails in a disk
drive or bus dual, the subsystem is reconfigured into
a simplex.
This Stratus computer requires all subsystems for
it to be operational. Each dual requires that one of
its components be working for it to be operational.
Each MR requires that its power supply be working
for it to be operational.
The logical hierarchy and requirements of the
dual and simplex subsystem classes are shown in fig-
ures 4.18, 4.19, 4.29, and 4.30. The requirements of
the DPL and BPL performance levels are shown in
figures 4.26 and 4.27, respectively. The system recon-
figuration graphs corresponding to the disk drive and
bus subsystems are shown in figures 4.22 and 4.23,
respectively. The remainder of the architecture and
requirements of this Stratus computer are described
by figures 4.31 to 4.36. Figures 4.31 to 4.34 complete
the description of the initial configuration. Fig-
ures 4.35 and 4.36 complete the description of the
initial configuration requirements.
To generate the model, an ASSIST pruning condi-
tion of TNF >= 4 was used without any trimming. To
evaluate the model, a SURE prune level of 1 × 10 -l°
was used. The other parameters used for this Stratus
computer are shown in tables 4.12 to 4.14.
Statistics comparing the manually and ARM-
generated Markov reliability model specifications are
shown in table 4.15. Their probability-of-failure
results are compared in table 4.16.
2b _
I
E
i
Figure 4.31. External structure of a Stratus computer. Figure 4.32. Physical hierarchy of a Stratus computer.
Figure 4.33. Physical hierarchy of the MR subsystem type.
4T
Figure4.34.Logical hierarchy of a Stratus computer.
48
Figure 4.35. Requirements of a Stratus computer. Figure 4.36. Requirements of the MR
subsystem type.
Table 4.12. Some Active Component Parameters of a Stratus Computer
Parameter
Permanent failure rate, hr -1
Transient failure rate, hr -t
Intermittent failure rate, hr -1
pb
5 × 10 -5
5 x 10 -4
5 × 10 -6
Component type
mb kb d
4.--
5× 10 -5 2× 10 -5 4x 10 -5
I
5 x 10 -4 2 x 10 -4 4 x 10-4
5x 10 -6 2x 10 -6 4x 10-6
ps b
3x 10 -5 3x 10 -6
3 x 10 -4 3 x 10 -5
3x 10 -6 3x 10 -T
Table 4.13. Some Component Repair Parameters of a Stratus Computer
Parameter
Repair priority
Repair rate, hr -1
Component type
pb mb kb d ps b
5 4 6 7 3 8
30 30 30 30 30 15
Table 4.14. Some System Reconfiguration Parameters of a Stratus Computer
Specification name
Parameter DD_to-DS BD_to_BS
Component type d b
Reconfiguration coverage 0.99999975 0.99999975
Reconfiguration priority 1 2
Reconfiguration rate, hr -1 25 7.2 x 103
Table 4.15. Reliability Model Statistics for a Stratus Computer
Statistic Manual ARM
ASSIST file lines 155 1472
Model generation time, min 4.2 58.88
States 8197 10 462
Transitions 193 790 248109
Model evaluation time, min 3.13 8.37
Table 4.16. Probability of Failure Results for a Stratus Computer
Measure Manual ARM Difference Percentage
Lower bound 8.0522 × 10 -5 7.53168 × 10 -5 5.2052 × 10 -6 6.46
Upper bound 9.03802 x 10 -5 8.85121 × 10 -5 1.8681 × 10 -6 2.07
49
5. Analysis
The next three subsections analyze the assump-
tions, utility, and performance of ARM. Sub-
section 5.4 discusses how ARM might be validated.
Subsection 5.5 gives some lessons learned from using
ARM.
5.1. Summary of Assumptions
The purpose of this subsection is to make sure
that the potentiM users of the ARM approach to re-
liability modeling are aware of the assumptions that
are inherent in it. The user is responsible for deter-
mining whether those assumptions are applicable to
the system whose reliability they want to estimate or
the error they introduce is acceptable given the re-
liability and accuracy the system requires. If any of
the following six assumptions are not true, the result
will be conservative:
1. It is assumed that components which commu-
nicate and are critical (i.e., required for the
system to be operational) must be able to
continue communicating.
2. A transition that reconfigures components
and/or subsystems in or out of the system
is assumed to describe sequential processes.
For example, if there are n faults in one or
more subsystems of the same type with recov-
ery rate p, the rate at which one of the faulty
components is replaced by a spare is assumed
to be p not np. Typically, these transitions
are fast, in which case this assumption being
false would have little effect.
3. A rate that characterizes the failure behavior
of a component is assumed to describe con-
current processes. For example, if there are
n working components of the same type with
permanent failure rate )_ the rate at which
one of them fails permanently is assumed to
be hA.
4. It is assumed that when a component in a log-
ical subsystem fails, it can only be replaced by
a component that is equivalent in terms of its
type and its connections to other components.
5. Components that depend on a component
that has a soft fault are assumed to have failed
permanently.
6. It is assumed that repaired components be-
come spares and that they are not used to
restore a subsystem that was retired or whose
redundancy had been diminished.
The following four assumptions are not strictly
true. However, they are often close enough in
practice to be used to simplify the analysis:
7. The failure processes of different components
are assumed to be independent of one another.
8. Each failure process is assumed to follow an
exponential distribution.
9. Numeric parameters are assumed to be
independent of the system state.
10. The transition probability of a potentially
general transition is assumed to be the same
against any competing fast exponential
transitions.
To evaluate a model specified by ARM using
SURE and get close bounds on the probability of
failure, the following three assumptions must be met:
11. The rate at which an intermittent fault goes
from benign to active is assumed to be slow.
12. The coverage c and rate p specified for tran-
sitions that could have been general must be
such that the successful transition rate pc and
the system failure transition rate p(1 - c) that
they produce are either slow or fast. If the
product of a transition rate and the mission
time is less than 0.01, it is slow, and if it is
greater than 100, it is fast.
13. The following are assumed to be fast:
a. Recovery transitions
b. Reconfiguration transitions
c. Repair transitions
If the system components do not meet assump-
tion 11, the system's probability of failure due to in-
termittent faults can be calculated by first estimating
or measuring, with fault injection experiments, the
recovery and reconfiguration times for intermittent
faults and then using those times in a model in which
the components fail permanently at the intermittent
fault arrival rate. A transition that does not meet
assumption 12 can be specified as a general transi-
tion with a mean and standard deviation of 1/p, but
then it cannot compete with other potentially gen-
eral transitions. Assumptions 12, 13a, and 13b do
not severely restrict the systems that can be modeled
because most systems meet them to achieve their reli:
ability requirements. However, assumption 13c is not
usually true. The PAWS or STEM programs can be
used to evaluate models that do meet assumptions 11
through 13.
5O
5.2. Utility
As demonstrated in section 4, the major goals
given for the GUI in section 2 were achieved. The
GUI is quite general in that all the redundancy tech-
niques defined in subsection 1.1 can be accommo-
dated. It makes use of physical and logical hierar-
chies. The GUI also uses subsystem classes and types
to reduce the number of subsystems that need to be
defined.
The input from each GUI window, whether it be a
graph or a set of parameters, is stored in a separate
file. A single file can be part of the description of
more than one system. A file can also be edited with
the GUI to produce similar files without having to
create them from scratch. The files used to describe
the systems used as examples in section 4 will be
available to the ARM user so that they may be
reused in this manner. Sharing files can be very
useful because, as demonstrated in section 4, many
systems use the same subsystem classes and types.
Even without reusing files, an experienced user could
describe each of the systems used as examples in
section 4 in approximately 1 hour.
The GUI is an alternative to learning the ASSIST
language and using it to manually specify the relia-
bility model. Subsection 5.2.1 illustrates how simple
changes in the system description given through the
GUI would require changing a large percentage of a
manual ASSIST file. Subsection 5.2.2 illustrates how
easy and natural it is to make architectural changes
for design tradeoff studies using the GUI.
5.2.1. Adding System Characteristics
This subsection compares what it takes to make
a simple axidition to a system description given
through the GUI versus the corresponding percent-
age of changes and additions to a manual ASSIST
file. The example system used for these additions is
the Stratus computer described in subsection 4.2.2.2
but without dependencies between components, im-
perfect coverage, transient and intermittent faults,
and repair.
Dependencies between components were added
by giving the hierarchy and requirements of the
MR physical subsystem type and the system phys-
ical hierarchy. Imperfect coverage, transient and in-
termittent faults, and repair were added by sim-
ply changing the corresponding parameter values.
The effect of these additions, described in subsec-
tion 4.2.2.2, on a manual ASSIST file is shown in ta-
ble 5.1. The model specifications for these additions
were also generated by ARM to validate the manually
generated specifications. The probability-of-failure
results calculated with the manual and ARM model
specifications are compared in table 5.2.
5.2.2. Performing Design Tradeoffs
The following variations of the sextuple SIFT
presented in subsection 4.2.1 were considered:
1. One quintuple plus one spare
2. One quad plus two spares
3. One triad plus three spares
4. One dual plus four spares
5. Two duals plus two spares
6. Three duals
7. Two triads
In addition to the parameter values used for the
sextuple SIFT, the following parameter values were
used for the other variations of SIFT: a coverage
of 1, a priority of 1, and a rate of 3.6 × 103 for the
spare and subsystem recoveries; and a failure rates
factor of 1 and a detectable fraction of 1 for the spare
components.
The quintuple SIFT required the same graphs
as the sextuple SIFT except for the ST(x) logical
subsystem class hierarchy and requirements and the
PST_to_PQT system reconfiguration. Of the remain-
ing graphs, only two had to be modified to replace
ST by QT in the system logical hierarchy and remove
ST(p) from the system requirements.
The quad SIFT required the same graphs as the
quintuple SIFT except for the QT(x) logical sub-
system class hierarchy and requirements and the
PQT_to_PQ system reconfiguration. Of the remain-
ing graphs, only two had to be modified to replace
QT by Q in the system logical hierarchy and remove
QT(p) from the system requirements.
The one-triad SIFT required the same graphs as
the quad SIFT except for the Q(x) logical subsystem
class hierarchy and requirements and the PQ_to_PT
system reconfiguration. Of the remaining graphs,
only two had to be modified to replace Q by T in
the system logical hierarchy and remove Q(p) from
the system requirements.
The one-dual SIFT required the same graphs
as the one triad SIFT except for the T(x) logical
subsystem class hierarchy and requirements and the
PT_to_PND system reconfiguration. Of the remain-
ing graphs, only two had to be modified to replace
T by ND in the system logical hierarchy and remove
T(p) from the system requirements.
51
Table 5.1. Effect of Simple Changes in the System Description
on a Manual ASSIST File
Declaration lines, percent
Added
Control line, percent
Changed AddedAddition Changed
Dependency 0 0 6.67 13.33
Coverage 1 2 20.00 26.67
Transients 1 4 40.00 40.00
Intermittents 1 6 40.00 46.67
Repair 2 2 20.00 126.67
All of above 2 13 46.67 306.67
Table 5.2. Probability of Failure Results for Some Variations
of a Stratus Computer
Addition Bound Manual ARM
None Lower 9.21527 x 10 -7 9.21527 × 10 -7
Dependency
Coverage
Transients
Intermittents
Repair
All of above
Upper 9.6179 x 10 -7
Upper
2.0254 x 10 -6
9.6179 × 10 -7
2.0254 × 10 -6
o=
2.06689 x 10 -6
Lower
Upper 2.06689 x 10 -6
Lower 9.21873 × 10 -7 9.21873 x 10 -7
9.62206 × 10 -7 9.62206 x 10 -7
Lower
Upper
Lower
1.04005 x 10 -5
1.14293 x 10 -5
1.11775 x 10-6
1.18582 x 10 -6
8.97387 × 10 -7
9.64093 x 10 -7
Upper
Lower
1.04005 x 10 -5
1.14293 x 10 -5
1.03363 x 10 -6
1.10316 x 10 -6
8.97377 x 10 -7
9.65213 x 10 -7Upper
Lower 8.0522 x 10 -5 7.53168 x 10 -5
Upper 9.03802 x 10 -5 8.85121 x 10 -5
52
Table 5.3. Probability-of-Failure Results for Some Variations of a SIFT Computer
Configuration
One sextuple
One quintuple plus one spare
One quad plus two spares
One triad plus three spares
One dual plus four spares
Two duals plus two spares
Lower bound
7.47849 x 10 -15
7,43369 × 10-15
3.28383 x i0 -m
1.64194 x 10 -10
1.99798 x 10 -3
3.99197 × 10-3
Upper bound
7.76217 x 10-15
7.71588 x 10 -15
3.34341 x I0 -lO
1.67218 x 10 -1°
2.00401 × 10 -3
4.004 x 10-3
Three duals 5.982 x 10 -3 6 x 10-3
Two triads 5.96846 x 10-6 6.01271 x 10-6
The two-dual SIFT and the three-dual SIFT re-
quired the same graphs as the one-dual SIFT except
that in the system logical hierarchy ND was replaced
by 2ND and 3ND, respectively. The two-triad SIFT
required the same graphs as the one-triad SIFT ex-
cept that in the system logical hierarchy T was re-
placed by 2T. The probability-of-failure results for all
of these variations of SIFT are compared in table 5.3.
5.3. Performance
From the results presented in section 4, it is seen
that the ARM approach produces reliability model
specification files with about an order of magnitude
as many lines. This occurs because the current im-
plementation specifies all possible parameters, state
variables, and functions that might be needed to
model the system. Consequently, the time to gener-
ate the model is increased by approximately an order
of magnitude. It may be possible to optimize ARM
to specify only what is actually needed.
From the results presented in section 4, it is
observed that the ARM approach specifies models
with about a factor of 2, at the most, as many states
and transitions. Although this increases the time to
evaluate the model by about a factor of 2, it does
not limit the systems that can be analyzed because
the models can be piped directly into the evaluation
program without having to store them.
Table 5.4 shows the time and virtual memory re-
quired by ARM to read the system description and
specify the model of the systems used as examples
in section 4. These measurements were made on
a Digital Equipment Corporation VAXstation 3100
computer that uses the VMS operating system. The
memory utilization is given in blocks of 512 bytes
of 8 bits. From these measurements and the gener-
ation and evaluation times presented in section 4, it
can be concluded that the time and memory to auto-
matically specify the reliability model are not factors
limiting the systems that can be analyzed.
Table 5.4. ARM Model Specification Performance
System Time, CPUsec Memory
MP 2.60 6668
SIFT 2.02 6540
Tandem 3.24 6924
Stratus 3.82 6924
5.4. Validation
The user of the automated reliability modeling
process proposed in this paper should be aware of its
several sources of errors. The failure rates calculated
with MIL-HDBK-217F (U.S. Department of Defense
1991) can be off by several orders of magnitude,
especially for new technologies or environments for
which there are little data. This can also be true
of the parameters that describe how the system
responds to faults if they are not measured in the
laboratory.
For highly reliable systems, the coverage param-
eter values need to be so close to 1 (e.g., 0.9999999)
that it becomes impractical to measure them. For
that reason, some computer architects have opted to
53
prove their system designs and then use coverage pa-
rameter values of 1 (Moser et al. 1987). Coverage can
have a profound effect on the system probability of
failure. This effect is illustrated in table 5.5 for the
version of the SIFT computer presented in section 4.
Table 5.5. Effect of Coverage on the Probability of Failure
of a SIFT Computer
Coverage Lower Bound
7.47849 x 10 -15
Upper Bound
7.76217 × 10-151.0
0.99999999999 6.65572 x 10 -14 6.79125 x 10-14
0.9999999999 5.98266 x 10 -13 6.09357 × 10-13
0.999999999 5.91535 x 10 -12 6.02288 × 10 -12
0.99999999 5.90862 x 10 -11 6.01614 x 10-11
0.9999999 5.90795 x 10 -t° 6.01513 x 10-1°
0.999999 5.90788 x 10 -9 6.01503 x 10-9
0.99999 5.90787 x 10 -8 6.01508 x 10-8
0.9999 5.90786 x 10 -7 6.01562 x 10-7
0.999 5.90776 × 10 -6 6.02104 x 10-6
0.99 5.98368 x 10 -5 6.01487 x 10-5
0.9 5.98234 x 10 -4 6.01352 x 10-4
Other sources of errors are design and implemen-
tation faults in the software tools that specify, gener-
ate, and evaluate the reliability model. Although the
probability of this type of error must be minimized,
experience has shown that the probability of commit-
ting an error when manually performing these tasks
is far greater even for very new tools such as ARM.
If practical, the most desirable method of val-
idation is to formally prove that ARM is correct
and that it has a perfect reliability of 1. However,
manual proofs are lengthy, tedious, and error-prone
(Ramamoorthy and Bastani 1982). Furthermore, au-
tomated proving techniques are still impractical for
realistic programs (Ramamoorthy and Bastani 1982).
As with most software of significant size and
complexity, exhaustive testing is impossible because
there are an infinite number of system descriptions
that could be given to ARM as input. However, thor-
ough testing of all program features can be achieved
to uncover as many software faults as possible and in-
crease user confidence. This testing should combine
black-box and white-box techniques (Myers 1979).
The four application examples presented in section 4
are black-box test cases because only the output Of
the program was considered. The goal of white-box
test cases is to make certain that all parts of the pro-
gram have been exercised.
One of the main difficulties with developing a test
case is defining what the output is expected to be.
54
As the software being tested grows in complexity,
so does the task of defining the expected output.
This is especially true for programs that automate
a previously manual task, such as ARM, because the
usual reason the task was automated is that it could
only be done quickly and reliably for simple cases,
as is the case for ARM. This difficulty limits the
number of complex test cases which is practical to
develop. If and when other programs that perform
the same function become available, comparison with
them would be a possible solution to this problem.
In the case of ARM, a way to increase the benefits
provided by the test cases developed is to perform a
sensitivity analysis of each one. This analysis would
ensure that the two models agree not only for a
specific set of parameter values but also for a range
of values. The results of doing this for the version of
the SIFT computer presented in section 4 are shown
in tables 5.6 and 5.7.
Table 5.6. Effect of the Failure Rate on the Probability
of Failure of a SIFT Computer
Failure
rate, hr -1 Bound
1 x 10 -3 Lower
Upper
1 x 10 -4 Lower
Upper
1 x 10 -5 Lower
Upper
Manual ARM
5.77469 x 10 -1° 5.77513 x 10-1°
6.16966 x 10 -I° 6.17013 x 10-1°
7.43383 x 10 -15 7.47849 x 10-15
7.71581 x 10 -15 7.76217 x 10-15
2.63296 x 10 -19 3.08019 x 10-19
2.73023 x 10 -19 3.1934 x 10 -19
Table 5.7. Effect of the Reconfiguration Rate on
the Probability of Failure of a SIFT Computer
Reconfiguration
rate, hr -1 Bound Manual
3.6 x 102 Lower 2.38272 x 10 -i4
Upper 2.7356 × 10-14
3.6 x 103 Lower 7.43383 x 10-15
3.6 x 104
Upper
Lower
Upper
7.71581 x 10-15
ARM
2.79663x 10-14
3.19926 x 10 -14
7.47849 x 10 -15
7.76217 x 10 -15
6.1037 x 10 -15 6.10416 x 10 -15
6.16738 x 10 -15 6.16877 x 10 -15
5.5. Lessons Learned
For most systems, it is essential to specify a
reasonable ASSIST prune condition. This is the
factor that most directly controls the size of the
model generated. It is quicker to start with severe
pruning (e.g., TNF >= 3) and then reduce it (e.g.,
TNF >= 4) until its effect (the evaluation programs
report it as the prunestate bounds) on the probability
of failure bounds reaches some acceptable level (e.g.,
less than 1 percent).
When usingSURE to evaluate the model, the
evaluation time can be reduced by manually selecting
a reasonable SURE prune level. For this type of
pruning, it is also quicker to start with a severe
pruning level (e.g., 1 x 10 -8) and then reduce it
(e.g., 1 × 10 -9) until its effect (SURE reports it as
the sure prune bounds) on the probability of failure
bounds reaches some acceptable level (e.g., less than
1 percent). These two types of pruning can and
should be used in conjunction with each other.
6. Conclusions
This paper has demonstrated that the tedious and
error-prone task of specifying reliability models can
be further automated by graphical representations.
From the results presented in section 4, it can be
concluded that the Automated Reliability Modeling
(ARM) approach produces reliability model specifi-
cation files with approximately an order of magnitude
as many lines. Consequently, the time to generate the
model is increased by about an order of magnitude.
The number of states and transitions, however, in-
creased by a factor of 2 at the most, and the time to
evaluate the model only increased by approximately
a factor of 2. The probability of failure calculated
using ARM specified models was within 7 percent
of that calculated using manually specified models.
With present computers, the size of the specification
file is not a problem. Typical systems have large
models whose model generation time is in the range
of 10 to 100 minutes and whose model evaluation
time is in the range of 1 to 10 hours. Hence, the
modest increase in the model generation and evalua-
tion time will be more than offset by the time saved
in specifying the model, which in very complicated
systems could be months. Therefore, it can be con-
cluded that the ARM approach to automatic reliabil-
ity model specification is an efficient way to evaluate
the reliability of complex fault-tolerant systems.
6.1. Summary of Work and Contributions
The goal of this research and development ef-
fort was to provide the computer architect with
a powerful and easy to use software tool that as-
sumes the burden of an advanced reliability analysis
that considers intermittent, transient, and perma-
nent faults for computer systems of high complex-
ity and sophistication. This paper defined a gen-
eral, high-level system description language (SDL)
that is easy to learn and use, identified and analyzed
the problems involved in the automatic specification
of Markov reliability models for arbitrary intercon-
nection structures at the processor-memory-switch
(PMS) level, and generated and implemented solu-
tions to these problems. The results of this research
have been implemented and experimentally validated
in the ARM program.
The ARM program uses a graphical user interface
(GUI) as its SDL. This GUI is based on a hierarchy of
windows. Some windows have graphical editing ca-
pabilities for specifying the system's communication
structure, hierarchy, reconfiguration capabilities, and
requirements. Other windows have text fields, pull-
down menus, and buttons for specifying parameters
and selecting actions.
The ARM program outputs a Markov reliability
model specification formulated for direct use by pro-
grams that generate and evaluate the model. The
advantages of such an approach are utility to a larger
class of users, not necessarily expert in reliability
analysis, and lower probability of human error in the
calculation.
6.2. Future Work
This work could be extended in several ways. The
most obvious one is to implement reinitializing re-
configurations and mission phase change reconfigu-
rations. An outline of how this might be done has
been given in subsection 3.4. Another possible exten-
sion is to modify the way that repaired components
are currently handled such that they can be used to
restore subsystems that were retired or whose redun-
dancy had been diminished instead of just becoming
spares.
The ARM program could be generalized such that
a subsystem can be degraded by more than two com-
ponents at a time. This program could also be gen-
eralized such that components which depend on a
component that has a soft fault become benign when
the component they depend on becomes benign. An-
other way to generalize ARM would be to allow a
subsystem to be composed of other subsystems.
Further research is needed to describe transi-
tion time distributions that depend on the global
state of the system. Additional work is also needed
55
to easily allow competing general transitions. The
competing transition probabilities possibly could be
automatically estimated.
The ARM program could also be extended to au-
tomate the specification of availability models. How-
ever, this automation would require modifying the
present model generation and evaluation tools or
using other ones.
NASA Langley Research Center
Hampton, VA 23681-0001
January 25, 1993
i
J
|
!
!
i
i
e
|
5{i
Appendix
ARM Program Algorithms
This appendix shows the algorithms used by the ARM program.
A1. Symmetry Detection
The function definitions are as follows:
Split_Class (R, C, L): If relation R is not satisfied, it then partitions class C and creates a new class after the
last class L. Returns the number of equivalence classes.
Size (C): Returns the number of elements in the vertex equivalence class C.
Element (E, C): Returns element E of the vertex equivalence class C.
Equivalent (E, C, R): True if element E of class C is equivalent in term_ of relation R to the preceding class
elements.
EquahDegree (E, C): True if element E of class C has the same degree as the preceding elements of class C.
Equal_Neighbor_Classes (E, C): True if element E of class C has the same number of neighbor types in each
class as the preceding elements of class C.
Equivalent (Current_Element, Class, Relation) {
if (Relation == Degree)
return Equal_Degree (Current_Element, Class);
else
return Equal_Neighbor_Classes (Current_Element, Class);
}
Split_Class (Relation, This_Class, Last_Class) {
Split = false;
for (I = 2; I <= Size (This_Class); I++) {
Current_Element = Element (I, This_Class);
if (!Equivalent (Current_Element, This_Class, Relation)) {
if (_Split) {
Split = true;
Last_Class++;
/* Create a new Last_Class with the degree and
neighbor attributes of the Current_Element of This_Class.
}
/* Move the Current_Element of This_Class to the Last_Class.
}
}
return Last_Class;
} /* Split_Class */
Symmetry () {
/* Step 1: Split based on equal type. */
Last_Class = Last_Type;
for (I = I; I <= Last_Class; I++)
/* Add elements of type I to class I. */
57
/* Step 2: Split based on equal degree. */
I = i;
while (I <ffiLast_Class) {
Last_Class = Split_Class
I++;
}
(Degree, I, Last_Class);
/* Step 3: Split based on equal neighbor classes, */
New_Last = Last_Class;
Done ffifalse;
while (!Done) {
for (I - I; I <= Last_Class; I++)
New_Last = Split_Class (Neighbors, I, New_Last);
if (Last_Class -ffiNew_Last)
Done = true;
else
Last_Class - New_Last;
} /* Symmetry */
A2. Determining the Subsystem Hierarchies
The variable definitions are as follows:
View: Determines whether the algorithm is dealing with logical subsystems in the initial configuration or
physical subsystems.
Subsystem_Classes[V]: Number of subsystem classes in view V.
Subsystem_Types[C, V]: Number of subsystem types of class C in view V.
Subsystems[C, T, V]: Number of subsystems of type T and class C in view V.
Class_Hierarchy[C, V]: True if the hierarchy of subsystem class C in view V was obtained from the system
hierarchy or a separate file.
The function definitions are as follows:
Get_Class_Hierarchy (C, V): Returns true if it is able to obtain the hierarchy of subsystem class C in view V
from the system hierarchy or a separate file.
Read_Subsystem_Hierarchy (C, T, S, V): Reads the hierarchy file of subsystem S of type T and class C in
view V which assigns specific components to the subsystem.
Determine_Subsystem_Hierarchy (C, T, S, V): Determines the hierarchy of subsystem S of type T and class C
in view V from the hierarchy of its subsystem class and the component type arguments, if any, of its subsystem
type.
Subsystem_Hierarchies (View) {
for (C = O; C < Subsystem_Classes[View]; C++) {
Class_Hierarchy[C, View] = Get_Class_Hierarchy (C, View);
if (!Class_iierarchy[C, View])
for (T - O; T < Subsystem_Types [C, View]; T++)
58 *U.S. GOVERNMENT PRINTING OFFICE: 1993-728-150/60062
for (S = 0; S < Subsystems[C, T, View]; S++)
Read_Subsystem_Hierarchy (C, T, S, View))
}
for (C = O; C < Subsystem_Classes[Vie'w]; C++)
if (Class_Hierarchy[C, View])
for (T ffiO; T < Subsystem_Types[C, View]; T++)
for (S = 0; S < Subsystems[C, T, View]; S++)
Determine_Subsystem_Hierarchy (C, T, S, View))
} /* Subsystem_Hierarchies */
A3. Specifying Potentially General Transitions
The variable definitions are as follows:
Transitions[P]: Number of transitions of priority P.
Condition[P, T]: Transition condition T of priority P.
Destination[P, T]: Destination state T of priority P.
Rate[P, T]: Transition rate expression T of priority P.
LP: Lowest priority.
The function definition is as follows:
Condition_Or (I): Returns true if any of the transition conditions of priority I are true.
if (Condition_0r (1))
for (T = O; T < Transitions[i]; T++)
if (Condition[i, T])
tranto Destination[l, T] by Rate[l, T];
else if (Condition_Dr(2))
for (T = O; T < Transitions[2]; T++)
if (Condition[2, T])
tranto Destination[2, T] by Rate[2, T];
else if (Condition_Or(LP))
for (T = O; T < Transitions[LP]; T++)
if (Condition[LP, T])
tranto Destination[LP, T] by Rate[LP, T];
59
/References
Avi_ienis, Aigirdas; Gilley, George C.; Mathur, Francis P.;
Rennels, David A.; Rohr, John A.; and Rubin, David K.
1971: The STAR (Self-Testing and Repairing) Com-
puter: An Investigation of the Theory and Practice of
Fault-Tolerant Computer Design. IEEE Trans. Comput.,
vol. C-20, no. 11, Nov., pp. 1312-1321.
Bavuso, S. J.; Petersen, P. L.; and Rose, D. M. 1984: CARE
III Model Overview and User's Guide. NASA TM-85810.
Butler, Ricky W. 1986: An Abstract Language for Speci-
fying Markov Reliability Models. IEEE Trans. Reliab.,
vol. R-35, no. 5, Dec., pp. 595 601.
Butler, Ricky W. 1992: The SURE Approach to Reliability
Analysis. IEEE Trans. Reliab., vol. 41, no. 2, June,
pp. 210-218.
Butler, Ricky W.; and Stevenson, Philip H. 1988: The
PAWS and STEM Reliability Analysis Programs. NASA
TM-100572.
Butler, Ricky W.; and White, Allan L. 1988: SURE Reliability
Analysis--Program and Mathematics. NASA TP-2764.
Butler, Ricky W.; and Johnson, Sally C. 1990: The Art
of Fault-Tolerant System Reliability Modeling. NASA
TM- 102623.
Castillo, Xavier; McConnel, Stephen R.; and Siewiorek,
Daniel P. 1982: Derivation and Calibration of a Tran-
sient Error l_eliability Model. IEEE Trans. Comput.,
vol. C-31, no. 7, July, pp. 658-671.
Chung, Kai Lai 1967: Markov Chains--With Stationary Tran-
sition Probabilities, Second ed. Springer-Verlag.
Cohen, Gerald C.; and McCann, Catherine M. 1990: Relia-
bility Model Generator Specification. NASA CR-182005.
Cox, D. R.; and Miller, H. D. 1965: The Theory of Stochastic
Processes. John Wiley & Sons, Inc.
Dugan, Joanne Bechta; Trivedi, Kishor S.; Geist, Robert M.;
and Nicola, Victor F. 1984: Extended Stochastic Petri
Nets: Applications and Analysis. Performance '84-
Models of Computer System Performance, E. Gelenbe,
ed., Elsevier Science Publ. Co., Inc., pp. 507-519.
Dugan, Joanne Bechta; Trivedi, Kishor S.; Smotherman,
Mark K.; and Geist, Robert M. 1986: The Hybrid Au-
tomated Reliability Predictor. AIAA J. Guid., Control _J
Dyn., vol. 9, no. 3, May-June, pp. 319-331.
Goldberg, Jack; Kautz, William H.; Melliar-Smith, P.
Michael; Green, Milton W.; Levitt, Karl N.; Schwartz,
Richard L.; and Weinstock, Charles B. 1984: Development
and Analysis of the Software Implemented Fault- Tolerance
(SIFT) Computer. NASA CR-172146.
Howell, Sandra V.; Bavuso, Salvatore J.; and Haley, Pamela J.
1990: A Graphical Language for Reliability Model Gener-
ation. Proceedings of I990 Annual Rehability and Main-
tainability Symposium, Inst. of Electrical and Electronics
Engineers, Inc., pp. 471-475.
Johnson, Sally C. 1986: ASSIST User's Manual. NASA
TM-87735.
Johnson, Sally C. 1988: Reliability Analysis of Large, Com-
plex Systems Using ASSIST. A Collection of Techni-
cal Papers, Part I--AIAA/IEEE 8th Digital Avionics
Systems Conference, Oct., pp. 227-234. (Available as
AIAA-88-3898-CP.)
Katzman, James A. 1977: System Architecture for Non-
stop Computing. 14th IEEE Computer Society Inter-
national Conference, IEEE Catalog No. 77CHl165-OC,
IEEE Computer Soc., pp. 77-80.
Kini, Vittal; and Siewiorek, Daniel P. 1982: Automatic Gen-
eration of Symbolic Reliability Fhnctions for Processor-
Memory-Switch Structures. IEEE Trans. Comput.,
vol. C-31, no. 8, Aug., pp. 752-771.
Landrault, C.; and Laprie, JI-C. 1978: SURF A Program
for Modeling and Reliability Prediction for Fault-Tolerant
Computing Systems. Information Technology 78: Pro-
ceedings of the Third Jerusalem Conference on Informa-
tion Technology, Josef Moneta, ed., North-Holland Publ.
Co., pp. 17-26.
Laprie, Jean-Claude 1985: Dependable Computing and Fault
Tolerance: Concepts and Terminology. The Fifteenth
Annual International Symposium on Fault-Tolerant
Computing Digest of Papers, IEEE Catalog No.
85CH2143-6, IEEE Computer Soc., pp. 1-1 I.
Lee, Larry D. 1985: Reliability Bounds for Fault-Tolerant Sys-
tems With Competing Responses to Component Failures.
NASA TP-2409.
Liceaga, Carlos A. 1992: Automatic Specification of Relia-
bility Models for Life-Critical Processor-Memory-Switch
Structures. Ph.D. Diss., Carnegie-Mellon Univ., Apr.
Makam, Srinivas V.; and Aviitienis, Algirdas 1982: ARIES 81:
A Reliability and Life-Cycle Evaluation Tool for Fault-
Tolerant Systems. FTCS 12th Annual International Sym-
posium, Fault-Tolerant Computing--Digest of Papers,
IEEE Catalog No. 82CH1760-8, IEEE Computer Soc.,
pp. 267-274.
Makam, Srinivas V.; Avi_ienis, Algirdas; and Grusas,
Gintaras: UCLA ARIES 82 User's Guide. Rep.
No. CSD-82-830 (ONR Contract No. N00014-79C-0866),
Univ. of California, Aug. 1982.
McConnel, Stephen Roy 1981: Analysis and Modeling of
Transient Errors in Digital Computers. Ph.D. Diss.,
Carnegie-Mellon Univ.
Moser, Louise; Melliar-Smith, Michael; and Schwartz, Richard
1987: Design Verification of SIFT. NASA CR-4097.
Myers, Glenford J. 1979: The Art of Software Testing. John
Wiley & Sons, Inc.
Palumbo, Daniel L.; and Nicol, David M. 1990: Generation
and Analysis of Large Reliability Models. Proceedings of
9th IEEE/AIAA/NASA Digital Avionics Systems Con-
ference, Inst. of Electrical and Electronics Engineers, Inc.,
pp. 350--354.
60 -_
Ramamoorthy, C. V.; and Bastani, Farokh B. 1982: Software
Reliability--Status and Perspectives. IEEE Trans. Softw.
Eng., vol. SE-8, no. 4, July, pp. 354-371.
Romanovsky, V. I. (E. Seneta, transl.) 1970: Discrete Markov
Chains. Walters-Noordhoff Publ. Groningen (Nether-
lands).
Siewiorek, Daniel P.; Bell, C. Gordon; and Newell, Allen
1982: Computer Structures: Principles and E_amples.
McGraw-Hill, Inc.
Siewiorek, Daniel P.; and Swarz, Robert S. 1992: Reliable
Computer Systems: Design and Evaluation, 2nd ed. Dig-
ital Press.
Stiflter, J. J.; Bryant, L. A.; and Guccione, L. 1979: CARE III
Final Report, Phase I--Volurne I. NASA CR-159122.
Szczur, Martha R. 1990: TAE Plus: Transportable Applica-
tions Environment Plus--A User Interface Development
Tool for Building X Window-Based Applications. Pro-
ceedings of the 4th MIT Annual X Conference.
Trivedi, Kishor S.; and Geist, Robert M. 1981: A Tutorial on
the CARE III Approach to Reliability Modeling. NASA
CR-3488.
U.S. Dep. of Defense 1991: Reliability Prediction of Elec-
tronic Equipment. MIL-HDBK-217F, Dec. 2. (Supersedes
MIL-HDBK-217E, Notice 1, Jan. 2, 1990.)
Vlissides, John M. 1990: Generalized Graphical Object Edit-
ing. Tech. Rep. CSL-TR-90-427 (NASA Grant NAGW-
419), Stanford Univ., June.
White, Allan L. 1986: Reliability Estimation for Reconfig-
urable Systems With Fast Recovery. Microelectron. _4
Reliab., vol. 26, no. 6., pp. 1111-1120.
White, Allan L.; and Palumbo, Daniel L. 1990: State Re-
duction for Semi-Markov Reliability Models. Annual Re-
liability and Maintainability Symposium--1990 Proceed-
ings, IEEE Catalog No. 90CH2804-3, Inst. of Electrical
and Electronics Engineers, Inc. pp. 280-285.
61

I Form ApprovedREPORT DOCUMENTATION PAGE OMBNo, 0704-0188
Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, Searching existing data sources,
gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this
collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson
Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperv_ork Reduction Project (0704-0188), Washington, DC 20503.
1. AGENCY USE ONLY(Leave blank) 1 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED
I July 1993 Technical Paperi
4. TITLE AND SUBTITLE 5. FUNDING NUMBERS
Automatic Specification of Reliability Models for Fault-Tolerant
Computers WU 505-64-10-07
16. AUTHORIS)
Carlos A. Liceaga and Daniel P. Siewiorek
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)
NASA Langley Research Center
Hampton, VA 23681-0001
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES)
National Aeronautics and Space Administration
Washington, DC 20546-0001
8. PERFORMING ORGANIZATION
REPORT NUMBER
L-17144
10. SPONSORING/MONITORING
AGENCY REPORT NUMBER
NASA TP-3301
11. SUPPLEMENTARY NOTES
Liceaga: Langley Research Center, Hampton, VA; Siewiorek: Carnegie Mellon University, Pittsburgh, PA.
12a. DISTRIBUTION/AVAILABILITY STATEMENT
Unclassified-Unlimited
Subject Category 66
12b. DISTRIBUTION CODE
13. ABSTRACT (Maximum 200 words)
The calculation of reliability measures using Markov models is required for life-critical processor-memory-switch
structures that have standby redundancy or that are subject to transient or intermittent faults or repair. The
task of specifying these models is tedious and prone to human error because of the large number of states and
transitions required in any reasonable system. Therefore, model specification is a major analysis bottleneck,
and model verification is a major validation problem. The general unfamiliarity of computer architects with
Markov modeling techniques further increases the necessity of automating the model specification. Automation
requires a general system description language (SDL). For practicality, this SDL should also provide a high level
of abstraction and be easy to learn and use. This paper presents the first attempt to define and implement an
SDL with those characteristics. A program named Automated Reliability Modeling (ARM) was constructed as
a research vehicle. The ARM program uses a graphical interface as its SDL, and it outputs a Markov reliability
model specification formulated for direct use by programs that generate and evaluate the model.
i
14. SUBJECT TERMS
System description language; Computer-aided design; Graphical user interface;
Reliability modeling; Markov models; Fault tolerance
17. SECURITY CLASSIFICATION
OF REPORT
Unclassified
_ISN 7540-01-280-5500
18. SECURITY CLASSIFICATION 19. SECURITY CLASSIFICATION
OF THIS PAGE OF ABSTRACT
Unclassified
15. NUMBER OF PAGES
68
16. PRICE CODE
A04
20. LIMITATION
OF ABSTRACT
i
Standard Form 298(Rev. 2-89)
Prescribed by ANSI Std. Z39-18
298-102
NASA-Langley, 1993

