Experimental analysis of computer system dependability by Tang, Dong & Iyer, Ravishankar, K.
NASA-CR-19340I
June 1993 UILU-ENG-93-2227
CRHC-93-15
Center for Reliable and High-Performance Computing
1
%' _,,, _
EXPERIMENTAL
ANALYSIS OF
COMPUTER SYSTEM
DEPENDABILITY
Ravishankar K. Iyer
Dong Tang
(NASA-CR-193401) EXPEREMENTAL
ANALYSIS OF COMPUTER SYSTEM
OEPENOABILITY (Illinois Univ.)
I18 p
G3/60
N93-32233
unclas
0176675
Coordinated Science Laboratory
College of Engineering
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Approved for Public Release. Distribution Unlimited.
https://ntrs.nasa.gov/search.jsp?R=19930023044 2020-03-17T05:15:35+00:00Z
Experimental Analysis of Computer System Dependability
Ravishankar K. Iyer and Dong Tang
Technical Report CRHC-93-15
Center for Reliable and High-Performance Computing
Coordinated Science Laboratory
University of Illinois at Urbana-Champaign
© 1993 R.K. Iyer & D. Tang
Urbana, Illinois, USA
July 1993
ABSTRACT
This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system
dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault
injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to
dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase.
Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a
classification of research methods or study topics is outlined, followed by the discussion of these methods or topics as
well as representative studies.
The statistical techniques introduced include the estimation of parameters and confidence intervals, probability
distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique
used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers elec-
trical-level, logic-level, and function-level fault injection methods as well as representative simulation environments
such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation
fault injection methods as well as several software and hybrid tools including FIAT, FERRARI, HYBRID, and FINE.
The discussion of measurement-based analysis covers measurement and data processing techniques, basic error char-
acterization, dependency analysis, Markov reward modeling, software dependability, and fault diagnosis. The discus-
sion involves several important issues studied in the area, including fault models, fast simulation techniques, work-
load/faihire dependency, correlated failures, and software fault tolerance.
ACKNOWLEDGMENTS
The authors thank Kumar Goswami, Timothy Tsai, Gwan Choi, and Weilun Kao for their contributions to this
manuscript. Special thanks go to Fran Wagner for proofreading the whole manuscript. Thanks axe also extended to
Inhwan Lee and Darren Sawyer for their comments on the manuscript.
We highly appreciate D. P. Siewiorek, J. Arlat, E. W. Czeck, J. Kaflsson, and G. B. Finelli for permission to use
figures (Figures 3.6, 4.2, 4.3, 4.4, 4.8, and 5.11) and algorithms (Sections 3.2.2 and 5.7.1) in their publications. Figure
3.9 is generated from "NEST: A Network Simulation and Prototyping Testbed," authored by A. Dupuy, J. Schwartz, Y.
Yemini, and D. Bacon, Communications of the ACM, Vol. 33, No. 10, copyright ACM 1990, by permission of the
ACM.
i!
TABLE OF CONTENTS
1. INTRODUCTION .......................................................................................................................................
2. STATISTICAL TECHNIQUES USED IN THE AREA ..........................................................................
2.1. Parameter Estimation ................................................................................................................
2.1.1. Point Estimation ........................................................................................................
2.1.2. Interval Estimation ...................................................................................................
2.2. Distribution Characterization ...................................................................................................
2.2.1. Empirical Distribution .............................................................................................
2.2.2. Function Fitting ........................................................................................................
2.3. Multivariate Analysis .................................................................................................................
2.3.1. Clustering Analysis ...................................................................................................
2.3.2. Correlation Analysis .................................................................................................
2.3.3. Factor Analysis ..........................................................................................................
2.4. Importance Sampling .................................................................................................................
2.4.1. Overview of the Method ...........................................................................................
2.4.2. Applications in DTMC Simulation ..........................................................................
3. DESIGN PHASE ..........................................................................................................................................
3.1. Simulated Fault Injection at the Electrical Level ....................................................................
3.1.1. Simulation of a Microprocessor-Based Chip ..........................................................
3.1.2. FOCUS -- A Chip-Level Simulation Environment ...............................................
3.2. Simulated Fault Injection at the Logic Level ..........................................................................
3.2.1. Study of Bendix BDX.930 ........................................................................................
3.2.2. Study of IBM PC RT ............ "_.. ................................................................................
1
4
4
4
7
10
ll
ll
13
13
14
15
16
17
18
20
22
24
25
29
32
34
..°
Ul
5. OPERATIONAL PHASE
5.1. Measurements
3.3. Simulated Fault Injection at the Function Level .....................................................................
3.3.1. NEST u A Network Simulation Testbed ...............................................................
3.3.2. DEPEND m A System Dependability Analysis Environment ..............................
4. PROTOTYPE PHASE ................................................................................................................................
4.1. Hardware-Implemented Fault Injection ..................................................................................
4.1.1. FTMP .........................................................................................................................
4.1.2. MESSALINE .............................................................................................................
4.2. Software-Implemented Fault Injection ....................................................................................
4.2.1. FIAT
4.2.2. FERRARI ..................................................................................................................
4.2.3. HYBRID ....................................................................................................................
4.2.4. FINE ...........................................................................................................................
4.3. Radiation-Induced Fault Injection ...........................................................................................
5.2. Data Processing ..........................................................................................................................
5.3. Preliminary Analysis ..................................................................................................................
5.3.1. Basic Statistics
5.3.2. Empirical TTE Distributions and Hazard Rates ...................................................
5.3.3. Analytical TTE Distributions ..................................................................................
5.4. Dependency Analysis .................................................................................................................
5.4.1. Workload/Failure Dependency ................................................................................
5.4.2. Two-Way Dependency ..............................................................................................
5.4.3. Multi-Way Dependency ............................................................................................
36
41
42
48
49
51
53
54
57
59
59
61
63
66
69
70
72
72
73
75
77
78
79
8O
iv
5.5. Markov Reward Modeling
5.5.1. Modeling of a Distributed System
5.5.2. Modeling of an Operating System
5.6. Software Dependability ..............................................................................................................
5.6.1. Error Interactions .....................................................................................................
5.6.2. Software Fault Tolerance .........................................................................................
5.6.3. Software Defect Classification .................................................................................
5.7. Failure Prediction .......................................................................................................................
5.7.1.
5.7.2.
6. CONCLUSION
REFERENCES
Prediction Based on Heuristic Trend Analysis .......................................................
Prediction Based on Statistical Analysis .................................................................
81
82
86
91
91
94
96
97
98
99
103
105
I. INTRODUCTION
In computer science more than in physical sciences, the experimenter must decide what to consider and what to
ignore in data gathering and analysis, sometimes without the benefit of prior intormation or easily available intuition.
How to obtain general models from experiments or measurements made in a particular environment is by no means
clear. This paper discusses the current research in the area of experimental analysis of computer system dependability.
The discussion centers around methodologies, major developments, and major directions of the research in the area.
Experimental evaluation of the dependability of a system can be performed at different phases of the system's
life. In the early design phase, CAD (Computer-Aided Design) environments are used to evaluate a design via simu-
lations, including simulated fault injections. Fault injection simulations can be used to investigate the effectiveness of
fault tolerant mechanisms, to evaluate system dependability, and to provide timely feedback to system designers.
However, simulations need accurate input parameters and the validation of output results. These should be estimated
based on previous measurement-based analysis. In the prototype phase, a system runs under controlled workload con-
ditions, and controlled fault injections are used to evaluate the system behavior under faults. Fault injections on real
systems can provide information about the process from fault occurrence to system recovery, including error latency,
propagation, detection, and recovery (reconfiguration may be involved). But fault injection can only study artificial
faults, and it cannot provide some dependability measures such as MTBF (Mean Time Between Failures) and avail-
ability. In the operational phase, a direct measurement-based approach can be used to measure systems in the field
under real workloads. The collected data contain a large amount of information about naturally occurring
errors/failures. Analysis of such data can provide understanding of actual error/failure characteristics and insight into
analytical models. Although measurement-based analysis is useful tbr evaluating real systems, it is limited to detected
errors. Further, conditions in the field can vary widely from one system to another, casting doubt on the statistical
validity of the results. Thus, all three approaches are complementary and essential tbr accurate dependability analysis.
In the design phase, fault iniection simulations can be conducted at different levels: the electrical level, the logic
level, and the function level. The objectives of simulated fault injection are to determine dependability bottlenecks,
the coverage of error detection/recovery mechanisms, the effectiveness of reconfiguration schemes, the system TIT
(Tune To Failure) distributions, reliability, availability, performance loss, and other dependability measures. The
resultingfeedbackofsimulationscanbeextremelyusefulin cost-effectiveredesignofasystem.In thispaper,wedis-
cussdifferenttechniquesu edinlault injection simulations. We also introduce different levels of simulation tools.
In the prototype phase, while the objectives of physical fault injections are similar to those of simulated tault
injections in the design phase, the methods are radically different because real fault injection and monitoring facilities
are involved. Physical fault injections can be conducted at the hardware level (logic or electrical) or at the software
level (code or data corruption). Further, heavy-ion radiation techniques can also be used to inject faulks and stress a
system. Instrumentations used in fault injection experiments are illustrated using real examples, including several
fault injection environments.
In the operational phase, the measurement-based approach needs to address issues such as how to monitor com-
puter errors and failures and how to analyze measured data to quantify system dependability characteristics. Although
there is extensive research on methods tot the design and evaluation of fault tolerant systems, little is known about
how well these strategies work in the field. A study of production systems is valuable not only lor accurate evaluation
but also for identifying reliability bottlenecks in system design. Several issues in measurement-based analysis, includ-
ing workload/failure dependency, modeling and evaluation based on data, software dependability in the operational
phase, and fault diagnosis are addressed.
Results of measurement-based analysis discussed in this paper are based on over 100 machine-years of data
gathered from IBM, DEC, and Tandem systems. The evaluation methodology discussed includes: the use of the mea-
sured hardware and software error data to jointly characterize the interdependence between pertbrmance and depend-
ability, correlation analysis to quantify correlated failures and their impact on dependability, Markov reward modeling
of measured data to evaluate the loss of system service due to errors and failures, and algorithms that use on-line error
logs to perform automatic fault diagnosis and failure prediction.
Before discussing methodologies and developments tbr each of the three phases discussed above, we present an
overview of the relevant statistical techniques used in this area. The techniques cover the estimation of parameters
and confidence intervals, distribution characterization including function fitting, and multivariate analysis methods
including clustering analysis, correlation analysis, and factor analysis. Importance sampling, a statistical technique to
accelerate Monte Carlo simulation, is "also introduced. These techniques are later used in the discussion of analysis of
dataobtainedfromfaultinjectionsormeasuredfromoperationalsystems.
In discussingtheexperimentalanalysisapproachesu edin thethreephases,a widerangeof dependability
issues,includingerrorlatency,errorpropagation,errordetection,errorrecovery,errorcorrelation,workload/error
dependency,availability,reliability,pertormability,andrewardrate,areaddressed.Inadditiontopresentingmethod-
ologiesandmajordevelopmentsineachof thesephases,wealsocritiquetherelativemeritsandresearchissueslor
differentapproaches.Mostevaluationtechniquesintroducedareillustratedviacasestudiesof theirusesonactual
systems.
II. STATISTICAL TECHNIQUES USED IN THE AREA
In this section, we will introduce several statistical techniques commonly used in the analysis of data collected
from fault injections and operational systems and used in simulation. The techniques discussed are not intended to be
comprehensive. For a comprehensive study of statistical methods, the reader is referred to the advanced texts of
statistics [Kendall77], [Dillon84]. In particular, we will discuss parameter estimation, distribution characterization,
and multivariate analysis techniques. Most of these techniques are widely used in every phase of the experimental
evaluation of dependability.
2.1. Parameter Estimation
The most important characteristics of a random variable are its distribution, mean, and variance. In practice,
means and variances are usually unknown parameters. Thus, how to estimate these unknown parameters from data
needsto be'addressed.
2.1.1. Point Estimation
Point estimation is often used in experimental analysis, such as the estimation of the detection coverage from
fault injections and the estimation of MTBF (mean time between failures) from field data. Each fault injection and
each failure occurrence can be treated as an experiment. The following theory is based on the assumption that all
experiments are independent and have the same underlying distribution.
Given a collection of n experimental outcomes x_, x2, '" - x_, of a random variable X, each x_ can be considered
as a value of a random variable Xi. These Xi's are independent of each other and identical to X in distribution. The
set {Xl, X_ ..... X, } is called a random sample of X. Our purpose is to estimate the value of some parameter 0 _0
could be E[X] or Var[X]) of X using a function of X_, X2 ..... X,. The function used to estimate 8,
d = t_(Xl, Xz ..... X_), is called an estimator of 8, and if(x1, x2 ..... xn) is said to be apoint estimate of 8.
An estimator ff is called an unbiased estimator of 8, if E[0] = 8. The unbiased estimator that has the minimum
variance, i.e., it minimizes Var(O-) = E[(O- 8)21 among all O's, is said to be the unbiased minimum variance estimator.
It can be shown that the sample mean
! n
.g= Y_Xi (2.1)
n i=1
is the unbiased minimum variance linear estimator of the population mean/1, and the sample variance
$2 1 "
= _ ]_(X i _ _)2 (2.2)
i=1
is, under some mild conditions, an unbiased minimum variance quadratic estimator of the population variance a 2. If
an estimator 8 converges in probability to O,i.e.,
lira P( [ _(X_, X: ..... X.) - 0 I > e) = O, (2.3)
where e is any small positive number, it is said to be consistent.
A. Method of Maximum-Likelihood
If the functional form of the p.d.f, of the variable is known, the maximum likelihood is a good approach to
parameter estimation. In many cases, approximate functional torms of empirical distributions can be obtained (to be
discussed in Section 2.2). For example, the software TTR (Time To Error) in two measured distributed operating sys-
tems was shown to have an hyperexponential distribution (see Section 5.3). In such cases, the maximum likelihood
method can be used to determine distribution parameters.
The idea of the maximum likelihood method is to choose an estimator based on the assumption that the
observed sample is the most likely to occur among all possible samples. The method usually produces estimators that
have minimum variance and consistence properties. But if the sample size is small, the estimator may be biased.
Assuming X has a p.d.f. (probability distribution function) f(xlO), where 0 is an unknown parameter, the joint
p.d.f, of the sample {Xi, X, ..... Xn},
r/
L(O) = .Hf(xilO) (2.4)
is called the likelihood function of O. If O(xi, x2 ..... xn) is the point estimate of o that maximizes L(0), then
if(X1, X2 ..... X,) is said to be the maximum likelihood estimator of O.
Now we use an example to illustrate the method. Let X denote the random variable "time between failures" in a
computer system. Assuming X is exponentially distributed with an arrival rate 2, we wish to estimate 2 from a random
sample{X1, Xo_..... X_ }. By Equation (2.4),
n -2 _ x i
L(2) = rI2e-_= 2"e ,=t
i=I
How do we choose an estimator such that the estimated/maximizes L(2)? An easier way is to find the 2 value
that maximizes lnL(2), instead of L(2). This is because the 2 that maximizes L(2) also maximizes lnL(2), and lnL(2)
is easier to handle. In this case we have
In L(2) = n/n(2) - 2 _ xi •
i=l
To find the maximum, consider the first derivative
The solution of this equation at zero,
is the maximum likelihood estimator for 2.
d[lnL(2)l n _,=---- X i .
d2 2 i=l
,-= n
i=1
B. Method of Moments
Sometimes it is impossible to find maximum likelihood estimators in closed form. For instance, it is difficult to
maximize the following p.d.f, of the gamma distribution G(ctO)
1
g(x) = _ x _-t e -xl°
F(_)O _
in estimating o_ and O, because of the existence of the gamma function F(a). The gamma distribution is often found
useful for characterizing interval times in the real world. It will be shown in Section 5.3 that the software TTE in a
measured single-machine operating system fits a multi-stage gamma distribution. In such cases, the method of
moments can be used if an analytical relationship between the moments of the variable and the parameters to estimate
can be lbund.
To introduce the method of moments, We lirst bring out the concepts of sample momem and population
moment. The k-th (k=l,2,...) sample moment of the random variable X is defined as
rl
mk = -- ]_ X/k , (2.5)
n i=i
where Xt, X2 ..... X n are a sample of X. The k-th population moment of X is just E[Xk].
Suppose there are k parameters to be estimated. The idea of the method of moments is to set the first k sample
moments equal to the first k population moments which are expressed as the unknown parameters, and then to solve
these k equations for the unknown parameters. The method usually gives simple and consistent estimators. However,
some estimators may not have unbiased and minimum variance properties. The following example shows details of
the method.
Consider the above gamma distribution example. We wish to estimate a and O, based on a sample
{ X_, X 2 ..... Xn } from a gamma distribution. Since X - G(o_, o), we know
E[x]= aO, E[x z] = _0 2 +aO.
The first two sample moments, by definition, are given by
I n l n = _-:m =-Zx S2+ .
li, i=1 n i=1
Setting ml = E[X] and m2 = E[X 2] and solving for a and O, we obtain
,_2 S 2
These are the estimators for a and 0 from the method of moments.
2.1.2. Interval Estimation
So far our discussion is limited to the point estimation of unknown parameters. The estimate may deviate from
the actual parameter value. To obtain an estimate with a high confidence, it is necessary to construct an interval esti-
mate such that the interval includes the actual parameter value with a high probability. Given an estimator d, if
P(t_-et < O < O + e_) = fl , (2.6)
the random interval (O- el, O + e2) is said to be 100xfl% confidence interval tbr O, and fl is called the confidence coef-
ficient (the probability that the confidence interval contains o).
A. ConfidenceIntervalsfor Means
In the following discussion, the sample mean 2" will be used as the estimator tor the population mean. As men-
tioned betbre, it is the unbiased minimum variance linear estimator for/z. We first consider the case in which the sam-
ple size is large. By the central limit theorem, ,_ is asymptotically normally distributed, no matter what the population
distribution is. Thus, when the sample size n is reasonably large (usually 30 or above, sometimes at least 50 if the
population distribution is badly skewed with occasional outliers), Z = (._ -/.t)/(S/-(ff) can be approximately treated as
a standard normal variable. To obtain a 100fl% confidence interval for/1, we can find a number z_/2 from the N(0, 1)
distribution table such that P(Z > z,_) -- od2, where a = 1 - ft. Then we have
_'-_
P(-z_,n < _ < z,,tz) = 1 -a
Thus, the 100(1 - o_)% confidence interval tbr/1 is approximately
S S
- :_r2 _nn <-/_ <-_ + z_. 7nn " (2.7)
If the sample size is small (considerably smaller than 30), the above approximation can be poor. In this case, we
consider two commonly used distributions: normal and exponential. If the population distribution is normal, the ran-
dom varmble T = (._ -/a)/(S/-fh-) has a Student t distribution with n - 1 degrees of freedom. By repeating the same
approach pertbrmed above with a t distribution table, the following 100(1 -a)% confidence interval for/_ can be
obtained:
S S
k'- t,-l:,o _nn </1< -_+/,-1;_ _nn' (2.8)
where t,-t:_/2 is a number such that P(T > t,-l:,,r2.) = a/2. Theoretically, Equation (3.8) requires that X have a normal
distribution. However, we will show later that the estimator is not very sensitive to the distribution of X when the
sample size is reasonably large (15 or more).
If the population distribution is exponential, it can be shown that 2"2 = 2nT./g has a chi-square distribution with
2n degrees of freedom. Thus, the chi-square distribution table should be used. Because the chi-square distribution is
not symmetrical about the origin, we need to find two numbers, xZz,:l__r2 and x2z,:_t2., such that P(X z < x:2,,:1-,_.) =
tg2 and Pfx 2 > x22,:=/z) = tg2. The obtained I00(1 - a)% confidence interval lot/1 is
2n_" 2n_"
</1< (2.9)
X22n;ot/2 X22n;l-ot/2 "
B. Confidence Intervals for Variances
The estimation of confidence interval for variances is more complicated than that tbr means, because the sample
variance cannot be simply approximated by a unique distribution (such as normal distribution) regardless of the popu-
lation distribution. However, irrespective of the population distribution, limVar[S 2] = 0. Thus, a good confidence
interval can be expected as long as the sample size n is large. Next, our discussion will be tbcused on the two com-
monly used distributions: normal and exponential.
If X is normally distributed, the sample variance S 2 can be used to construct the confidence interval. It is known
that the random variable (n- l)$2/0 r2 has a chi-square distribution with n- ! degrees of freedom. To determine a
100(1 - a)% confidence interval for 0-2 we follow the procedure for constructing Equation (3.9) to find the numbers
x2,_l;l_o, r2 and x'-__m_r2 from the chi-square distribution table. The confidence interval is then given by
(n - |)S 2 0" 2 (n - 1)S z
-- < < (2.10)
X2n-1 ;oil2 X2n-I ;l-a/2 "
Similar to Equation (2.8), our experience shows that this equation is not restricted to the normal distribution when the
sample size is reasonably large (15 or more).
If X is exponentially distributed, Equation (2.9) can be used to estimate the confidence interval for crz, because
tor the exponential random variable, 0"2 equals/12. Since all terms in Equation (2.9) are positive, we can take square
lbr them. The result gives a 100(1 - a)% confidence interval for or2:
2nX" , °"2 2n)_ )2(_--r-'---)" < < (-r------- • (2.11)
-r," 2n_ctl2., X'2n;i-ct/2
C. Confidence Intervals for Proportions
Often, we need to estimate the confidence interval tot a proportion or percentage whose underlying distribution
is unknown. For example, we may want to estimate the confidence interval for the detection coverage "_ter fault injec-
tion experiments. In general, given n Bernoulli trials with the probability of success on each trial being p and the
9
numberof successes being Y, how do we find a confidence interval tor p? If n is large (particularly when np> 5 and
n(1 - p) > 5 [Hogg83]), Y/n has an approximately normal distribution, N(/I, or2), with /1 = p and ty 2 = p(1 - p)/n.
Note that YIn is the sample mean which is an estimate of # or p. By Eq. (2.7), the 100(I-a)% confidence interval for
p is
r + z_,r_/p(1 p)/n (2.12)
n
This equation can be used to determine the number of injections required to achieve a given confidence interval
for an estimated fault detection coverage. Let n represent the number of fault injections and Y the number of faults
detected in the n injections. Assume that all faults have the same detection coverage, which is approximately p. Now
we wish to estimate p with the 100(I-or)% confidence interval being e. By Eq. (2.12), we have
Solving the equation Ibr n:
e = z_,/2",]p(1 - p)/n. (2.13)
,)
n = z_,8"p(l - p) (2.14)
e2 '
where n is the number of injections required to achieve the desired confidence interval in estimating p.
For example, assume detection coverage p = 0.6, confidence interval e = 0.05, and confidence coefficient
1 - a = 90%. Then the required number of injections is
,)1. 645-×0.6×0.4
n = = 260
0.052
2.2. Distribution Characterization
Mean and variance are important parameters that summarize data by single numbers. Probability distribution
provides more in/ormation about data. Analysis of distributions can help one understand data in detail as well _L_the
underlying models. For example, if the waiting times in all states of a transition model are exponential, then the model
is a Markov model. Otherwise, it is a semi-Markov model. We will discuss empirical distribution functions and func-
tion fitting in this subsection.
10
2.2.1.EmpiricalDistribution
Given a sample of X, the simplest way to obtain an empirical distribution of X is to plot a histogram of the
observations. The range of the sample space is divided into a number of subranges called buckets. The lengths of the
buckets do not have to be the same. Assume that we have k buckets, separated by x0, Xl ..... xk, for the given sample
k
with the size of n. In each bucket, there are yz instances. Obviously, the sample size n is _ Yi. Then, yi/n is an esti-
i=l
mation of the probability that X takes a value in bucket i. We will call the histogram an empirical probabili_., distri-
It is easy to construct the following empirical cumulative distribution Junction (c.d.f.)butionfunction (p.d.f.) of X.
from the histogram.
0, X < X 0
Fk(X) = _, Yt, xi-t < x < xi (2.15)
l=l n
1, Xk < X
The key problem in plotting histograms is determining the bucket size. A small size may lead to a large varia-
tion among buckets so that the characterization of the distribution cannot be identified. A large size may lose details of
the distribution. Given a data set, it is possible to obtain very different distribution shapes by using different bucket
sizes. One guideline is that if any bucket has less than five instances, the bucket size should be increased or a variable
bucket size should be used. By our experience, 10 to 50 buckets are appropriate in most cases, depending on the sam-
ple size. We will call the histogram constructed from data the empirical distribution.
2.2.2. Function Fitting
Analytical distribution functions are useful in analytical modeling and smaulations. Thus, it is often desirable to
fit an analytical function to a given empirical distribution. Function fitting is not a trivial task and relies on certain
knowledge of statistical distribution functions. The procedure given in the tollowing is based on our experience.
Given an empirical distribution, the first step is to make a good guess of the closest distribution function(s) by observ-
ing the shape of the empirical distribution. The second step is to use a statistical package such as SAS to obtain the
parameters for a guessed function by trying to lit it to the empirical distribution. The third step is to perform a signifi-
cance test of the goodness-of-fit to see if the fitted function is acceptable. If the function is not acceptable, we have to
11
gotostep2totryadifferentfunction.
Nowwediscusstep3-- significancet st.Assumethatthegivenempiricalc.d.f,is Fk, defined in Eq (2.15),
and the hypothesized c.d.f, is F(x) (obtained from step 2 in the above). Our task is to test the hypothesis
H0: Fk(X) = F(x).
There are two commonly used goodness-of-fit test methods: the chi-square test "and the Kolmogorov-Smirnov
test. We now briefly introduce the two methods.
A. Chi-Square Test
The chi-square test assumes the distribution under consideration can be approximated by a multinomial distribu-
tion, which usually stands. Let
Pi = F(xi) - F(xi-l) , i = 1..... k
where p_ is the probability that an instance falls into bucket i. If we define
P[xi-I < Xi < xi]= Pi , i= 1..... k ,
then X1, X2 ..... Xk have a multinomial distribution which is equivalent to the original distribution F(x). Thus, tot a
sample size of n, the expected instances tailing into bucket i is npi, by the above distribution. The sum of error
squares divided by the expected numbers
k (Yi- nPi) 2'
qk-I = Y'_ (2.16)
i=1 npi
is a measure of the "closeness" of the observed number of instances, y_, to the expected number of instances, np_, in
bucket i. If qk-I is small, we tend to accept Ho. The "smallness" can be measured in terms of statistical significance if
we treat qk-1 as a particular value of the random variable Qk-_. It can be shown that if n is large (npi >__1), Ok-_ has an
approximate chi-square distribution with k - 1 degrees of freedom, 2,2(k - 1). If H0 is true, we expect that q_-t falls
into an acceptable range of Qk-_ so that the event is likely to occur. The boundary value, or critical value, of the
acceptable range, 2"_(k - 1) is chosen such that
P[Ok-i > 2"_(k - 1)1 = c_
where a is called the significance level of the test. Thus, we should reject H0 if qk-l > ;(_(k - I). Usually, c_ is cho-
sen to be 0.05 or 0.1.
12
B. Kolmogorov-Smirnov Test
The Kolmogorov-Smirnov test is a non-parametric method in that it assumes no particular distribution for the
variable in consideration. The method uses the empirical c.d.f., instead of the empirical p.d.f., to perform the test,
which is more stringent than the chi-square test. The Kolmogorov-Smirnov statistic is defined by
Dk = supx[IFk(X) - F(x)l], (2.17)
where sup._ represents the least upper bound of all pointwise differences IFk(X) - F(x)l. In calculation, we can choose
the midpoint between xi-i and xi, for i = 1..... k, to obtain the maximum value of IFk(x i) - F(xi)l. It is seen that D k
is a measure of the closeness of the empirical and hypothesized distribution functions. It can be derived that Dk sub-
mits to a distribution whose c.d.f, values are given by the table of Kolmogorov-Smirnov Acceptance Limi_ [Hogg83].
Thus, given a significance level a¢,we can find the critical value dk from the table such that
P[Dk > dk] = ot .
The hypothesis Ho is rejected if the calculated value of Dk is greater than the critical value dk. Otherwise, we accept
Ho.
2.3. Multivariate Analysis
In reality, measurements are usually made on more than one variable. For example, a computer workload mea-
surement may include usages on CPU, memory, disk, and network. A computer failure measurement may collect data
on multiple components. Multivariate analysis is the application of methods that deal with multiple variables. These
methods, including clustering analysis, correlation analysis, and factor analysis to be discussed, identify and quantity
simultaneous relationships among multiple variables.
2.3.1. Clustering Analysis
Clustering analysis is useful tbr characterizing workload states in computer systems by clustering similar points
in resource usage. Assume we have a sample of p variables with a size of n. We call each instance in the sample a
point, which consists of p values. Clustering analysis identifies similar points and clusters them into groups (clusters).
Let x_ = (xil, x,2 ..... xip) denote the ith point of the sample. The Euclidean distance between points i and j,
13
Pdij = I xi - xjt = (ff'_,(xi_- xjt)2) 1'2
I=1
is usually used as a similarity measure between points i and j.
There are several different clustering algorithms. The goal of these algorithms is to achieve small within-cluster
variation relative to the between-cluster variation. A commonly used algorithm is the k-means clustering algorithm.
The algorithm partitions a sample with p dimensions and n points into k clusters, C_, C2 ..... C_. Denote the mean, or
centroid of the C) by _j. The error component of the partion is defined as
k
E=Z Z
j=l ieC i
(2.18)
The goal of the k-means algorithm is to find a partition that minimizes E.
The clustering procedure is as foUows: Start with k groups each of which consists of a single point. Each new
object is added to the group with the closest centriod. Alter a point is added to a group, the mean of that group is
adjusted to take the new point into account. After a partition is formed, search for another partition with smaller E by
moving points from one cluster to another cluster until no transfer of a point results in a reduction in E.
2.3.2. Correlation Analysis
Correlation analysis can be used to quantify error or workload dependency between two components in a sys-
tem. The correlation coefficient, Cor(X1, X2), between the random variables X_ and X_. is defined as
E[(X_ - p_)(Xe -/_z)l
Cor(X1, X2) = (2.19)
O"1 0" 2
where/.t I and ,u2 are the means of X_ and X2, and 0"1and 0-2 the standard deviations of X_ and X2, respectively. If we
use p to denote the correlation coefficient, then p satisfies -1 < p _<1. The correlation coelficient is a measure of the
linear relationship between two variables. When Ip_= 1, we have XI = aX2 + b, where b>0 if p = 1, or b<0 if ,o = -1.
In this extreme case, there is an exact linear relationship between X_ and X_. When ]/_ ,1, there is no exact linear
relationship between X_ and Xe. In this case, p measures the goodness of the linear relationship X_ = aXe + b
between X_ and X__. Usually, a p value of 0.5 or above is considered reasonably high.
14
Givenrandomvariables,X1, X2, and X3, and correlation coefficients between each pair, P12, P_.3, and P13, we
know these variables are related each other by Pt2, P23, and P13- Since X_ is related to Xz and X 2 is related to X?, a
partial dependence between X_ and X3 may be due to X2. The partial correlation coefficient defined below quantifies
this partial dependence.
P13 - PI2P23
/913.2 (2.20)
_(1 - P:12)(1 - p223 )
Partial correlation coefficient can be considered as a measure of the common relationship among the three variables.
If a random variable, X, is defined on time series, the correlation coefficient can be used to quantify the time
serial dependence in the sample data of X. Given a time window At > 0, the autocorrelation coefficient of X on the
time series t is defined as
Autocor(X, At) = Cor(X(t), X(t + At)), (2.21)
where t is defined on the discrete values (At, 2At, 3At .... ). In this case, we treat X(t) and X(t + At) as two different
random variables and the autocorrelation coefficient is actually the correlation coefficient between the two variables.
That is, Autocor(X, At) measures the time serial correlation of X with a window At.
2.3.3. Factor Analysis
The limitation of correlation analysis is that the correlation coefficient can only quantify dependency between
two variables. However, dependency may exist within a group of more than two variables or even among "all variables.
The correlation coefficient cannot provide inlbrmation about this multiple dependency. Factor analysis is one of sta-
tistical techniques to quantify multi-way dependency among variables. The method attempts to lind a set of unob-
served common factors which link together the observed variables. Consequently, it provides insight into the underly-
ing structure of the data. For example, in a distributed system, a disk crash can account for failures on those machines
whose operations depend on a set of critical data on the disk. The disk state can be considered to be a common factor
for failures on these machines.
Let X = (xl ..... xp) be a normalized random vector. We say that the k-factor model holds tor X if X can be
written in the form
15
X = AF + E (2.22)
where A = (2ij) (i = 1..... p; j = 1..... k) is a matrix of constants called factor loadings, and F = (fz ..... fk) and
E = (el ..... ek) are random vectors. The elements of F are called common factors, and the elements of E are called
unique factors (error terms). These factors are unobservable variables. It is assumed that "all factors (both common and
unique factors) are independent of each other and that the common factors are normalized.
Each variable xi (i = 1 ..... p), can then be expressed as
k
xl = _ 2ijfj + ei
)=1
and its variance can be written as
k
2
)=1
where _i is the variance of e i. Thus, tile variance of xi can be split into two parts. The first part
k
i=1
is called the communali_ and represents the variance of x_ which is shared with the other variables via the common
factors. In particular ,_) = Cor(x_, fj) represents the extent to which xi depends on the jth common factors. The sec-
ond part, q/_, is called the unique variance and is due to the unique lactor e_; it explains the variability in ._:,not shared
with the other variables.
2.4. Importance Sampling
Importance sampling is a statistical method to reduce sampling size while keeping estimates obtained from the
sample at a high level of confidence [Kahn531. The method has been recently used to reduce the number of runs in
Monte Carlo simulations for evaluating computer dependability [Goya192] [Choi921. In the following, we first give
an overview of the method and then discuss its applications in the Monte Carlo simulation of discrete-time Markov
chains (DTMC's).
16
2.4.1. Overview of the Method
Assume that a random variable X has p.d.f, f(x) and that Y = h(X) is a function of X. Our goal is to estimate
the expected value of Y,
through sampling. That is, we
{Yl, v2 .... Yn}, and then calculate
f .¢.o00 = E[Y] = E[h(X)] = h(x)f(x)cb;, (2.23)
generate a sample {xa, x2 ..... x,} according to f(x), therefore generating
--O= Y =- Yi =- h(xi).
rl i=l n i=[
It may be very expensive to generate a statistically significant sample of X. For example, if Yi = h(x_) = 0 tot
most generated xi, we may need an extremely large size of sample to estimate 0 with a high level of confidence. How-
ever, if we can make the rare xi's which are "important" tbr estimating 0 be much more frequently selected in sam-
pling while keeping the estimate unbiased, the sample size will be greatly reduced. This is the basic idea of the impor-
tance sampling method.
To do importance sampling, we change the p.d.f, of X from f(x) to g(x) such that those x's which are of impor-
tance in our parameter estimation have higher occurrence probabilities in g(x). We use X' to represent the variable
which has p.d.f, g(x'). By Eq. (2.23), we have
0=_ h(x)f(x)dx=S h(x)_g(x)dx=S h(x)A(x)g(x)dx, (2.24)
where
is called likelihood ratio.
f(x)
A(x) =_ (2.25)
g(x)
Let Y' = h(X)A(X), then Eq. (2.24) becomes
+on
o = f_ y'g(x)dx = E[Y'I. (2.26)
Thus, instead of sampling from f(x) to estimate the expected value of Y, the experiment is changed to sampling
from g(x') to estimate the expected value of Y'. That is, we generate a sample {x_, x" ...... r,;} according to g(x'),
V t • ..., #therefore generating {. l, Y2, y, }, and then calculate
17
d=y,= 1 1
- Yi = - h(xi)A(xi) •
?/" i=1 n, i=1
The variance of the above estimator is
I h(x)f(x) 2Var(r') = EI(Y'- 0) 2] = g(x)
To achieve the minimum variance, we should have
g(x)dx- 0 2 .
h(x)f(x)
g(x) =
o
But 0 is the unknown parameter to estimate. A heuristic is that the shape of g(x) should follow the shape of h(x)ffx)
as closely as possible.
2.4.2. Applications in DTMC Simulation
In many cases, the operation of a computer system can be modeled by a DTMC (Discrete Time Markov Chain)
[Trivedi821. If the built DTMC is very large (such that it exceeds the available storage) or the functional simulation
(simulation of the execution of machine instructions, algorithms, etc.) is used above a DTMC, the Monte Carlo simu-
lation method is perhaps the only feasible way to solve the model. In dependability models, system failures are usually
rare events with extremely small probabilities. In order to obtain statistically significant results, large simulation runs
are required, which can be very time consuming. In such a case, importance sampling can be used to reduce simula-
tion runs, usually by orders of magnitude.
Assume we have a DTMC {Y_s >_0} with a set of states {S t, S 2..... Sm} and a transition matrix [Pij]. For
each simulation run, we have a path xi = Si,, Si, ..... S_. The occurrence probability of path x, is [Goya192]
P(xi) = PioPioil "'" Pik-tik '
where each pq is an element in [Po]. All possible paths constitute the probability space of a random variable: X =
(xx, x2, x3, '" ).
To reduce simulation runs, we change the transition probability matrix from lPijl to lP_j] such that those paths
which are of importance in our dependability evaluation are more likely to be sampled. After the change, the occur-
rence probability of path xi is
18
tP'(xi) = P_oP_oi, "'" Pik-lik "
Assume the dependability measure to evaluate is 0 = E[h(X)].
{x l, x2 ..... x. }, obtained from simulations by
Then o can be estimated using a sample,
where
n
= - ___h(xl)A(xi), (2.27)
n i=1
P(xi) PioPioi, "'" Pik-lik
A(x0 = _ = (2.28)
p'(x,) " p;.,,,
The remaining question is how to determine [p_i]. Several heuristics called failure biasing have been proposed
in the literature [Lewis84] [Goya192]. Here we introduce one of the commonly used heuristics. Assume that in state
S_, transitions out of the state go to either a set of failure states, F (e.g., the system suffers one more component fail-
ure), or a set of recovery states, R (e.g., the system recovers from a component failure). (Si itself can be treated as
either in F or in R.) It is obvious that we have
_pO+ _pq =1 .
j_F j_R
Define a parameter b such that P_i s satisfy
t
Then we determine each PO in state Si by
Z Pli = b, Z P_i = l-b. (2.29)
Ib &
• I Z Pik jEF
Pij = _ k_F
/(1 - b) --_P_6Pik j6R
L k_R
(2.30)
The parameter b is usually chosen to be 0.5 [Goya1921. Since the sum of the original probabilities to failure
,states is often very small, by Eq. (2.29), the selection of b can significantly increase these probabilities, thus making
these transitions much more likely to occur in simulations.
19
HI. DESIGN PHASE
in the early design phase of highly reliable systems, simulation is an important experimental means tor perfor-
mance and dependability analysis. Compared to analytical modeling, simulation has the capability to model complex
systems in detail without being restricted to assumptions made in analytical modeling to keep the model mathemati-
cally tractable. Thus, simulation is able to provide more accurate dependability evaluation than analytical models.
Simulations for dependability analysis can be performed by injecting faults in the system under study at the electrical
level, the logic level, and the function level. Dependability issues studied usually include but are not limited to: 1)
/ault propagation, 2) fault latency, and 3) fault impact such as coverage, reliability, availability, and performance loss.
Figure 3.1 shows fault injections at the different levels.
Electrical-level fault injection simulation is usually used to emulate transient faults by changing the electric cur-
rent and voltage inside the circuits. The faulty current and voltage may cause errors in logic values at the gate level.
The gate-level errors may then propagate to other functional units and output pins of the chip. It has been reported
that transient faults account for more than 80% of the failures in computer systems [Siewiorek78], [Iyer861. These
faults result from physical causes such as power transients, capacitive or inductive crosstalk, or cosmic particle inter-
ventions [Yang921. Electrical-level simulation can be used to study the impact of transient faults from the physical
Figure 3.1. Simulated Fault Injections at Different Levels
Fault Injection
Electrical-Level
Change Current
Change Voltage
Electrical ICircuits Physical
Process
Logic-Level Function-Level
Stuck-at 0 or 1 Change CPU Register
Inverted Fault Flip Memory Bit, etc.
_ Logic IFunctionalGates Logic _ Units
Operation
2q)
level, but since the simulation has to track the propagation of faults from circuits to gates, to functional units, and
eventually out to the pins, it can be very time consuming and memory bound.
For this reason, logic-level fault injection simulation applies abstractions of physical fault models to logic gates
to study large VLSI, even computer systems. Commonly used fault models include stuck-at-O, stuck-at-I, and
inverted fault. These models are considered to be representative of faults at the gate level. Although simulation at the
logic level ignores the physical processes underlying gate faults, it still needs to trace the impact of gate-level faults to
higher levels. For the same reason that electric'd-level simulation cannot be effectively used to study large VLSI sys-
tems, logic-level simulation cannot effectively study large computer systems.
Function-level fault injection simulation is usually used to study dependability features of large computer or
network systems. Faults are injected into various components of the system under study. Functional fault models are
used in the simulation, while detailed processes of fault occurrence at lower levels are ignored. Functional models rep-
resent the manifestation of faults at the lower levels and are extracted from results obtained from electrical-level or
logic-level fault injections or from field measurements. For example, "flipped memory bit" and "CPU register error"
are two typical fault models. Analytical dependability models of computer systems are usually built at this level.
Compared to analytical models, simulation is capable of representing detailed architectural features, real fault condi-
tions, and inter-component dependencies, thereby providing more accurate and believable results.
There are several common issues for fault injections at all levels. The first issue is that given fault models (e.g.,
one bit flip in memory) and types (e.g., transient fault), where do we inject faults? A simple way is to randomly
choose a location from the injection space (e.g., all gates in a VLSI chip or all memory bits). This scheme is easy to
implement, but many faults may have similar impact (e.g., all faulty bits in an ALU may have the same effect) and
many faulty locations may not be exercised. Another way is to inject faults only to representative locations which
have different impact, or only to representative workload areas. This approach can be used to study fault impact in
terms of locations or workloads.
The second issue involves workloads. The impact of faults on system dependability is workload-dependent.
Hence it is important to analyze a system while it is executing representative workloads. These workloads can be real
applications, selected benchmarks, or synthetic programs. If the goal of study is to investigate fault impact on a
21
missiontask,therealapplicationsrunningin themissionmaybeusedin thesimulation.If theresearchgoalis to
studyfaultimpactongeneralworkloads,everalrepresentativebenchmarksmaybeselectedIorthesimulation.If we
wantoexerciseveryfunctionalunitandlocationin thesimulation,bothrealapplicationsandbenchmarksmaynot
beappropriate.Inthiscase,syntheticworkloadscanbedesignedIorachievingthegoal.Theworkloadissuefurther
complicatessimulationmodelsandincreasessimulationtime.It isnecessaryto developwaystorepresentrealistic
workloadswhilestillmaintainingreasonablesimulationtimes.
The third issue is simulation time explosion which occurs when: 1) too much detail is simulated such as model-
ing physical processes in fault injections at the electrical level, and 2) extremely small failure probabilities require
large simulation runs to obtain statistically significant results (the theory is discussed in Section 2.1). Several tech-
niques, including mix-mode simulation [Saleh90] [Choi92], importance sampling [Goya1921 [Choi92], hybrid simula-
tion [Bavuso87], [Goswami93a], and hierarchical simulation [Goswami921 have been used to tackle the time explo-
sion problem.
Table 3.1 summarizes features and representative studies in simulated thult injections at different levels. We will
discuss these studies in the following three sections.
Table 3.1. Summary of Simulated Fault Injections
Category Electrical Level Logic Level Function Level
Approach Alter electrical current Inject stuck-at or inverted Inject faults to CPU,
and voltage in circuits Iaults to logic gates memory, I/O devices, etc.
Target VLSI chip VLSI chip Computer system
Under Software running Computer system Network system
Study on the chip Software Software
Studies
Fault simulation [Yang92]
HAl602 [Duha881
FOCUS [Choi921
BDX930 [McGough81 I
BDX930 [Lomelino86]
IBM RT PC [Czeck91]
Trace-driven [Chillarege871
NEST [Dupuyg01
DEPEND [Goswami921
REACT [Clark931
3.1. Simulated Fault Injection at the Electrical Level
There are several reasons lot perlbrming fault irtiections at the electrical level. First, the fault injection at this
level can be used to study the impact of physical causes which lead to faults and errors. Secondly, it has been pointed
22
outbyprevioustudies[Banerjee82],[Beh82]thatsimplestuck-atfaultmodelsdonotrepresentsometypes of faults.
Thirdly, some circuits are of a mixed analog/digital nature which cannot be tully characterized by logic-level fault
models. Thus, there is a growing need lbr fault simulators which can handle electrical transient faults and permanent
physical failures for the purposes of both circuit testing and dependability evaluation.
The basic simulation methodology used in fault injections at the electrical level is the mixed-mode method in
which the fault-flee portions of the circuit are simulated at the logic level while the faulty portions of the circuit are
simulated at the electrical level [Saleh90]. Figure 3.2 illustrates the method. A simple CMOS AND gate with buffered
output is drawn in the figure. The dotted boxes indicates normal voltage waveforms for the circuit and the dashed
boxes contain waveforms resulting from a transient injection at the location marked by X. Notice that waveforms
within the electrical-level analysis behave in an analog fashion, but are discrete in the logic-level analysis.
Figure 3.2. Illustration of Fault Injection at Electrical Level
..................... i:: .................. ! I
-q
. i --
LOGIC LEVEL ANALYSISi ELECTRICAL LEVEL ANALYSIS LOGIC
LEVEL
ANALYSIS
Vcc Vcc
......................
: --GND \ 'Fault Injection' -_-GND!_: ................. :
i ................ _ NORMAL WAVEFORM
_. ..... .a WAVEFORM DUE TO FAULT INJECTION
23
A representativemixed-modefaultsimulatoris SPLICE1[Saleh84]. The electrical analysis in SPLICE1 is
based on the method of iterated timing analysis (ITA) which incorporates a nonlinear relaxation method with event-
driven selective tracing. ITA has been shown to be accurate and fast (can provide a speed-up of up to two orders of
magnitude). The logic analysis in SPLICE1 is performed using a relaxation-based method including MOS-oriented
models. Recently, more advanced techniques, such as the concurrent mixed-mode simulation of permanent faults and
the dynamic mixed-mode simulation of transient faults have been developed [Yang921.
We now discuss two studies in the electrical-level fault injection. Both use SPLICE1 as the fault simulation
engine. The first is a case study of the impact of different levels of current transients on a microprocessor-based chip.
The second is a fault injection tool which integrates fault injection engine, tracing facility, and graphical and statistical
analysis packages into a user environment.
3.1.1. Simulation of a Microprocessor-Based Chip
One of the studies in this field was an experimental analysis of susceptibility of a microprocessor-based jet
engine controller to upsets caused by current and voltage transients through simulated fault injections [Duba88]. The
target system for the study was an HA1602, a microprocessor-based digital jet-engine controller designed by Hamilton
Standard for commercial aircraft and made available to NASA Langley AIRLAB. SPLICE1 was chosen tbr the fault
simulation in the study. A number of enhancements to SPLICE1 were made to facilitate the fault injection simula-
tions.
The parameters used in the simulations were extracted from those used in the HAl602 design and circuit layout.
The application code running on the simulated processor was chosen such that ",111the functional units at which tran-
sient fault injections were made were exercised. Fault injections were made at seven randomly chosen nodes in six
functional units. For each node, current transients were injected at five different charge levels: 0.5, 1.0, 2.0, 3.0, and
4.0 pico Coulombs. Each charge level was injected at five different time-points during the execution of the application
code sequence. This amounted to over 1000 fault inlections/simulations.
The error data was generated by comparing each faulted simulation with a fault-free simulation. An error event
was defined as either a logic state change or a voltage level change large enough to cause a node to be faulted at a
24
Table3.2.Severityof InjectedTransientFaults
Error Category Occurrences Percentage Charge Threshold
Injected Transients
Logic Upsets
Latched Error
Pin Errors
1050 100.0 3.0 pC
437 41.6 3.0 pC
60 5.7 3.0 pC
59 5.6 3.0 pC
future time. Error events were classified as three categories: 1) logic upsets -- voltage transients large enough to con-
stitute logic level errors, 2) latched errors -- errors in the first-level latches, and 3) pin errors -- errors at the chip I/O
pins. The overall results from the experiments are shown in Table 3.2. It can be seen that the injected transients have a
41.6% chance of causing a logic upset (no errors), a 5.7% chance of resulting in a latched error (a latent error in the
circui0, and 5.6% chance of error propagating to pins. The other 47% of the injected transients have no impact on the
processor. Thus, only 11% of all injected transients cause either a permanent change in circuit behavior or affect the
external environment. The table shows that transients below 3.0 pico-Coulombs have no significant impact on the cir-
cuit.
The study "also investigated the impact of current and voltage transients occurring in the different functional
units of the processor. An ALU transient was round to most likely result in logic upsets and pin errors. Further, the
analysis of variance (ANOVA) technique was used to quantify the sensitivity of pin-level errors to error activity in the
different functional units. The results of ANOVA are shown in Table 3.2, which indicate that the output pin errors are
most sensitive to error activity in ALU.
3.1.2. FOCUS -- A Chip-Level Simulation Environment
FOCUS is a simulation environment, developed at University of Illinois, for fault sensitivity analysis of IC
chips [Choi92]. In the environment, a range of user-specified faults are automatically injected at the circuit level, and
fault propagation is measured at the gate and higher levels. Figure 3.3 depicts the overall experimental environment.
The environment takes as input a net-list of the hardware description of the system and converts it into a simulation
model. SPLICE1 is used as the fault simulation engine. The importance sampling technique, which has been intro-
duced in Section 2.4, is used in FOCUS to accelerate simulations.
25
Figure3.3.FOCUSExperimentalEnvironment
iii TARGET SYSTEM
. DESCRIPTION __M_IXED-MODE HIERARCHIC_LAL
ill TRANSIENT/STUCK-ATF ULTTYp DE CRIPTIONOF FAULT 1_,III\ " / "_,-.__FAULT SIMULATION__/
,1\ / ,.o
GRAPHICAL ANALYSIS
VISUAL IDENTIFICATIONS
ERROR PROPAGATION
MANIFESTATION
IMPACT ANALYSIS
STATISTICAL ANALYSIS
DESIGN FEEDBACK
The fault injection process is implemented as a run-time modification of the circuit, whereby a current source is
added to a target node, 1 thus altering the voltage level of the node over the time interval of the injected current wave-
form. The experimental environment ",allows both transient and permanent (single or multiple) fault injections. Since
the injected current source is specified as a mathematical function, the resulting transients can be of varying shapes
and duration. For example, electrical power surge, in-chip alpha particle intervention, lightning, and bridging faults
can be modeled. The user can control the location of a fault, the time and duration of a fault, and the shape of the cur-
rent source.
The tracing facility monitors all switching activities in the target system, including fault propagation through
each gate or transistor, for all processed events. The trace data for each event consists of the time of the event, the
hierarchical node name, and the new and previous voltage levels (for electrical nodes) or logic levels (tbr logic nodes).
IA node is defined a._ a point in a conductive interconnection between electrical and/or logical element,_.
26
Thegraphicat analysis facility is used to visualize the error activity in different functional units of the processor
and the fault propagation on the major interconnects and at the external pins. The statistical analysis tools provide
impact analysis of the target system and generate necessary models to depict the fault behavior in the system (e.g., I/O
pin error distribution, latch error distribution, and internal fault propagation model).
The application of FOCUS is illustrated by studying a target system. The target system is a microprocessor
used in commercial aircraft for real-time control of jet-engine functions. The 16-bit microprocessor consist._ of six
major function units: ALU, control, decoder, multiplexer, countdown, and watchdog, as shown in Figure 3.4. The sys-
tem incorporates a variety of fault tolerant design features at different levels, including software checks, parity checks,
memory test, and error counting. The objective of the study is to investigate the impact of charge-level transients on
latch, pin, and tunctional levels.
Nearly 80 instruction cycles (90300 nanoseconds) of the application code were executed on the target system
during each simulation run. The application code was carelully selected to ensure that 'all of the functional units were
Figure 3.4. The Target Microprocessor System
UO
Memo_
Decode
__)Timing
Countdown
IPttv
Control
ALU Disc
Watchdog
UARTI
Disc
Statue
ILatch I
27
Table 3.3. Impact of Transients Injected to the Target System
Type Injections Involved Percent of Total Injections Resultant Errors
First-Order Latch Errors
Second and Higher-Order Latch Errors
First-Order Pin Errors
Second and Higher-Order Pin Errors
Functional Errors
470
120
255
90
193
22.4
5,7
12.1
4.3
9.2
2149
1829
1168
839
747
executed. A total of 2100 simulations were pertormed tbr obtaining stable results. During the simulation, 'all nodes
(including all latches and external I/O pins) in the circuit were monitored and processed. Table 3.3 summarizes the
overall impact of transients in the range 0.5 to 9.0 picoCoulombs. In the table, a first-order error is defined as one
which occurs during the first clock cycle following a transient fault injection; second and higher-order errors are those
occurring during the second and subsequent clock cycles.
Figure 3.5 shows the propagation of the latch errors in time. In the figure, the x-axis represents the clock cycles
from the fault injection time, and the y-axis represents the total latch error count for each clock cycle. It can be seen
that, given a certain number of latch errors in the first clock cycle, the number of latch errors degenerates significantly
until the fourth clock cycle. At approximately the fifth clock cycle, the number of errors rapidly multiplies. This is
because at this time, the error sibmal enters a unit with a large number of latches and high tan-out, e.g., the ALU regis-
ters. After the sixth cycle, the number of errors degenerates significantly until finally disappearing after the eighth
cycle. Thus, the impact of latch errors lasts at most up to 8 clock cycles from the time of fault injection.
Figure 3.5. Latch Error Occurrence in Time
_ Latched Errors
Frequency l / _
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
Time (Clock Cycles)
28
3.2. Simulated Fault Injection at the Logic Level
Simulated fault injection at the logic level is similar to that at the electrical level in that they are both circuit-
level simulations. The difference is in the fault models used. In the electrical-level injection, physical fault models are
used, while in the logic-level injection, logic fault models are used. Logic-level fault simulation uses abstract logical
models tot both faults and circuit functions to evaluate the behavior of a system. In contrast to the evaluation of the
physical models used in the electrical-level simulation, logic-level simulations pertorm binary operations that repre-
sent the behavior of a given device. They take binary input vectors and to evaluate the output of the device tbr a given
input pattern. Each signal in the circuit is represented by a member in a set of boolean values depicting the steady
state conditions of the physical circuit. For example, set { 1,0,X} is often used to describe high, low, and unknown
voltage values tbr logic gates. Fault injection at this level simply forces a node to either stuck-at-1 or stuck-at-0, or it
inverts a logic value. Fault injection location and time can be set arbitrarily. Hence, with logic simulation, one
obtains outputs with discrete values and possibly with some approximate timing information. Typically, outputs of the
faulty and non-faulty systems are compared to determine whether a fault has been detected.
For MOS technology, a gate-level logic simulator is inadequate to handle circuits containing pa.gs transistors,
ratioed logic, buses, and other features that exhibit bidirectional signal flow and/or charge-sharing effects. To handle
such transistor networks without resorting to expensive electrical-level analysis, switch-level simulation is proposed in
[Bryant84]. Switch-level analysis allows bidirectional signal flow and different levels of signal strengths. For exam-
ple, a discrete set {0,1,..,9} can be used to model different signal strengths or voltage levels. At this level, transistor-
level fault modeling can also be incorporated easily.
Fault simulation has been widely used for evaluating the coverage of a given set of test vectors tbr testing man-
ufacturing defects in a chip. Typically a set of test vectors generated either by an automatic test pattern generator
(ATPG) randomly, or manually is submitted to the fault simulator in order to decide how many faults can be detected
by the test vectors. In this case, the generated test vectors become workloads or input.g to the system. In the begin-
ning of such a simulation, a stuck-at fault is injected, and the faulty circuit is simulated while a given test is applied at
the primary inputs of the circuit. A similar run is performed again without any fault iniection. The logic values at the
primary output.g of the faulty circuit are then compared to the outputs of the fault-free run to determine if there is any
29
difference in the outputs. If the injected fault altered logic values at the outputs compared to the clean run, then the
fault is assumed to be detected. If a fault is detected, there is no reason to continue the simulation tor that specified
fault. The process of test generation and fault simulation is iterated until satisfactory fault coverage (the percentage of
faults detected of all theoretically detectable faults) is achieved. This application has been widely used in industry to
evaluate and assist test generation [Ruehli831 [Rogers851.
The use of fault simulation to perform dependability analysis at the design phase, and thus avoid the high cost
of an additional redesign/modification iteration alter the finalized design is submitted for fabrication is an ongoing
research effort. New techniques are being introduced to perform fault sensitivity analysis of very large circuits. The
simulation approach permits determination of a chip's fault sensitivity during the design stage. Through simulated
fault injection and subsequent fault propagation at the logic level, it is possible to identify critical bottlenecks in relia-
bility. To characterize a highly dependable VLSI systems, we need to evaluate, simultaneously, the fault behaviors of
all components and their combined behavior.
For the dependability analysis of a system either stuck-at or transient faults can be simulated. Stuck-at faults
can be simulated using conventional logic-level fault simulators that are widely available. A stuck-at-fault iniection is
perlbrmed by Iorcing the state of a node to a specified value tbr the entire simulation duration. By selectively trac-
ing/detecting a set of internal and external nodes, fault propagation can be monitored. Fault behavior in a system can
be modeled and analyzed through studying the fault propagation trace.
Transient fault injection is more complicated than stuck-at fault injection. Transient faults are injected by alter-
ing the logic values of the target node momentarily during the simulation. For example, the output of a gate is set to 1
while it should normally be 0. This faulty logic is forced on the output tora specified time period. Logic-level tran-
sient injection can be pertormed in two different ways. The above "bit-flip" effect can be pertormed on the combina-
tional part of the circuit using a timing simulator. The created "pulse" can then propagate and become latched.
Another way is to change the state of a machine by flipping a bit in a register or a memory element in the system.
These approaches, however, may not reflect the actual device-level transient behavior at the logic level, because a
transient can propagate in multiple paths and result in more than one latch error.
3O
Toevaluatesystemdependabilitybasedon realisticfaultmodels,a fault-dictionaryapproachcanbetaken
[Choi93].A fault-behaviordictionarygeneratedfromelectrical-levelfaultanalysiscanbeusedasafastlook-uptable
foralogic-levelconcurrentorparallelfault-injectionsimulation.First,anelectrical-levelfault-behaviordictionaryfor
agivenchipcanbegeneratedbyextensivefaultsimulations.In thisstep,gatesaroundthefault-injectionlocationare
extracted,andasubcircuitconsistingof thesegatesis formed.Thissubcircuitisexercisedbyexhaustivelyapplying
•all input combinations while fault injection is performed. Faulty behavior at each of the subcircuit outputs is analyzed
and recorded in a dictionary. The resulting entry in the dictionary consists of the input vector, fault-injection time, and
fault location. Second, concurrent run-time fault injections of the generated logical error at the subcircuit level using
the fault dictionary can be perlbrmed. The concurrent simulator is used to propagate, in a single simulation pass, the
effects of the injected errors.
Both transient and permanent faults can be injected using switch-level or gate-level logic simulation. The over-
all simulation approach for fault injections at the logic level consists of the following steps:
(1) Obtain the net-list of a design and devise appropriate simulation models to emulate the given design.
(2) Simulate the model using a logic-level simulator.
(3) During the simulation, run a given workload depicting the application or test software (by applying test vec-
tors to the primary inputs).
i
(4) Save the behavior of the system under fault-tree conditions by tracing "all the changes in the evaluated logic
events of monitored nodes lbr comparison with the subsequent fault-injection runs.
(5) Run the same workload again and inject a fault to a selected node during the simulation period and trace.
For a stuck-at fault: Force the state of the selected node to either 1 (for stuck-at-1 fault) or 0 (for stuck-at-0
fault) and evaluate the circuit. Hold the state to stuck-at fault value throughout the simulation.
For a transient fault: Force the state of the selected node to a logic value that is the reverse of the normal state
(i.e., lorce a 0 if the normal state is a 1, and vice versa). Hold the state to the reverse value on the node for one
or more clock cycles. Let the fault effect propagate by evaluating the circuit with the corrupted logic state.
Rele_.se the forced state when a new signal/event arrives at that node.
31
(6) Monitorthebehaviorofthesystemunderfaultconditions.
(7) Comparethetracesfromthefaultyandfault-l_eerunsandidentitythedifferencesto determinewhereand
whenthefaulthaspropagated.
(8) Usecollectedstatisticalmeasurementstodeterminedependabilityparameters(e.g.,detectioncoverage)and
thefaultimpact(e.g.,minorprogramerrororsystemfailure).
Theabovefaultinjectionstepshouldberepeatedalargenumberof times for a given workload. If the experi-
ment is intended to estimate single measures (e.g., detection coverage), the number of injections required tor achiev-
ing a given confidence interval can be determined using Eq. (2.14). If the experiment purpose is to obtain distributions
(e.g., error latency distribution), the fault injections should not be stopped until the constructed distribution is stable,
i.e., the two consecutive distributions constructed are not different statistically. Importance sampling can be used to
significantly reduce simulation runs.
Two early studies in fault injections at this level took a digital avionic miniprocessor, BDX-930, as the target
system. The first study investigated the impact of faults at gates and pins on the output results of programs, with
emphasis on the fault latency and fault coverage issues [McGough811. The second study investigated error propaga-
tion from the gate level to the pin level [Lomelino861. A recent study explored the behavior of transient faults which
occur during the normal execution of a processor [Czeck911. The study quantified faults that can be emulated by soft-
ware-implemented fault injections (to be discussed in Section 4.2). We discuss these studies in the following two sub-
_ctions.
3.2.1. Study of Bendix BDX-930
An early study in this field was the simulated fault injection to the Bendix BDX-930, a digital avionic minipro-
cessor, to investigate fault latency and coverage [McGough81 ]. The BDX-930 was composed of bit-slice processors
(AMD2901) and was used in a number of flight control avionic systems. Fault tolerance was achieved by replication
of the processing and voting in software. A gate-level emulator of the BDX-930 was developed tbr this study. The run
speed of the emulator was 25,000 times slower than the BDX-930 when hosted on a PDP- 10.
32
Themethodologyusedin thestudywas:Givenasoftwareprogramrunningontheprocessor,injectasingle
stuck-ator invertedfaultatagateorpinselectedrandomlyandobservethetimetodetection,assumingthatadetec-
tionoccurswheneverthereisadifferencebetweentheoutputsof thefaultyandtault-freeprocessorsexecutingthe
sameprogram.Theexperimentwasrepeated600to 1,000timestoconstructanempiricallatencydistribution.Six
differentprogramswereselectedtorepeattheabovexperimentalprocedure.Inaddition,atypicalavionicflightcon-
trolsystemself-testprogramwaswrittenforthisstudyandexecutedtodeterminethultdetectioncoverage.
Resultshowedthatmostdetectedfaultsweredetectedin thefirstrepetitionof theprogram.Subsequentrepeti-
tionsdidnotsignificantlyincreasethepropagationfdetectedtaults.A largepercentageoffaults(about60%forthe
gate-levelfaultsand30%tot thepin-levelfaults)remainedundetectedalterasmanyaseightrepetitionsof thepro-
gram.Thetaultcoverageof theself-testprogramwasroundto be 87% tor the gate-level taults and 98% for the pin-
level faults.
The above study emphasized the impact of faults at gates and pins on the output results of programs. Another
study on the Bendix BDX-930 computer investigated error propagation ti'om the gate level to the pin level
[Lomelino86]. In this study, a single AMD2901 processor chip in the BDX-930 was selected for fault injection and
error data collection. The processor was simulated using an event-driven, gate-level logic simulator developed at
NASA Langley [Migneault85]. The simulator was driven by a self-test program, developed tbr the BDX-930, which
provides a high probability of detecting error activity.
In the simulation, the single stuck-at thult model was applied to 150 selected gates for tault injection. The gates
were distributed throughout the nine function units of the AMD2901. Error data was collected by first simulating a
fanlt-tree circuit, then simulating the circuit with a single injected fault, and finally comparing the two simulation out-
put lbr obtaining differences. Four sets of simulation experiments, consisting of 150 simulations per set, were con-
ducted. Results showed that 78.7% of faults produced error propagation detected within the chip and 66.7% of laults
produced errors that propagate to the output pins, within the first 100 clock cycles. The error distribution at the output
pins was round to be sensitive to the locations of faults. The results also showed that the error activity increases with
the increase of concurrent microinstruction activity.
33
3.2.2.Studyof IBM PC RT
In [CzeclOl], a simulation model of the IBM RT PC was used to inject transient, gate-level faults for exploring
the behavior of transient faults which occur during the normal execution of a processor. The emulated hardware func-
tional units in the processor included: instruction prefetch buffer (IPB), microinstruction fetch (MIF), data fetch and
storage (DFS), ALU and shifter (ALU), and ROMP storage channel interface (RSCI). Both original error detection
mechanisms (EDM) which reside in the IBM RT PC (such as illegal instruction traps and memory access violation)
and additional error detection mechanisms which axe provided in this study/or evaluating their effectiveness (such as
timeout and control flow monitoring) were included in the simulation model.
Figure 3.6 shows possible error manifestations alter a fault injection. In the figure, minor errors are differences
in the internal processor state between the faulty and fault-free simulation runs, which have not been detected by an
EDM. Monitoring errors are those which are uncovered by the "duplicate and compare" EDM which monitors bus
addresses and data. Severe errors are those resulting in the change of a microinstruction and the instruction address
registers, which lead to a change in the control flow of the program. Fatal errors have triggered a system resident
EDM and caused an abnormal termination of the application task. Results overdue occurs when the task executes
longer than a predetermined time limit and the execution is halted. Overwritten means that the injected fault does not
manifest to a minor error or a minor error is overwritten by correct data.
Figure 3.6. Error Manifestations in the IBM PC RT Simulation Model
Far'd*
* = error detection mechanism .._f _t,.,,,;,.,y ,_ Error
Results*
Overdue
Results
Wrong
Results
OK
Overwritten
34
Threeworkloadswereselectedforthisstudy:aniterativematrixmultiplication,arecursiveFibonacciprogram,
andaniterativeFibonacciprogram.Theseworkloadswereconsideredtoberepresentativeof thecharacteristicsof
instructionsetarchitecturesanddiversityinprogramstructure.Foreachworkloadandeachfaultlocation,1000faults
wereinjected.Followingisthemethodoftheexperiments:
(1) Foreachworkload,thefault-freebehavioroftheworkloadisextractedfromtheinternalstateof theprocessor
andsavedforcomparisonduringthesubsequentfaultinjectionexperiments.
(2) A faultlocationisselectedsuchthatthefaultinthelocationhasahighprobabilityofproducinganerrorand
locationsfordifferentinjectionsdonotyieldthesamerrorbehavior.
(3) Thefaultinjectiontimeissettothestartof theworkloadexecution.Thefaultinjectiontimewillbeadvanced
byonecycleforeachsuccessivefaultinjectionexperiment.
(4) Thefaultisinjectedtor a duration of one clock cycle at the location and time selected.
(5) For each successive clock cycle, the internal processor state of the faulty processor is compared with that
obtained in step 1. Differences are recorded/or off-line analysis.
(6) The faulty behavior is monitored at each clock cycle until the program execution is completed or a severe
error causes the monitor to cease.
(7) The simulation run is restarted at step 3 and the time of next fault injection will be advanced by one clock
cycle.
Results of the study include: 40% to 55% of injected faults do not produce an error. Among the faults that mani-
fest to errors, approximately 2/3 of them can be emulated by the software-implemented/ault injection approach (to be
discussed in Section 4). The other 1/3 of these faults manifest to errors in CPU components (e.g., microinstruction
control registers) that are not accessible to software. At the system level, the fault behavior showed a strong depen-
dency on the workload structure and instruction sequencing rather than the instruction mix. Error detection latency
was found to follow a Weibull distribution with a decreasing detection rate. The distribution represents two error
occurrence processes: a last process in which fault manifestation and error propagation occur within a small time win-
dow and a slow process in which dormant faults and errors are activated gradually by the workload.
35
3.3. Simulated Fault Injection at the Function Level
Function-level fault injection simulation is used to study complete computer and network systems rather than
the components of which they consist. These studies typically consider the hardware, the software, their interactions,
and the inter-dependence between the various components of the system. There are at least five issues in developing
functional simulation models at the system level.
The first is a lack of well-established system level thult models. This is partly due to the second issue: a large
and varied component domain. At the gate level, the basic components are gates with single functions and well-
defined interconnections. At this level, it is possible to establish a fault model, such as the single stuck-at tault model
which can be consistently applied to all gates to model their fault behavior. At the system level, the basic components
include CPUs, communication channels, disks, software systems and memory. The componenLs have complex input.,;,
perlbrm multiple functions, have varied physical attributes (e.g. hardware and software) and complex interconnec-
tions. In addition to the diversity of the components that make up a system, two similar componenLs (such a.s two
CPUs) can have different thnctions and behavior. This makes it difficult to establish a single lault model that can be
applied consistently to all components.
For this reason, various types of fault models are required and will depend upon the type of component being
injected. The fault models are functional fault models that simulate system-level manifestations of gate-level faults.
For instance, a single bit-flip is typically used to simulate a memory or register fault. Various fault models can be
used tot communication channels. Messages traversing the channel can be corrupted or destroyed, or the channel can
be made inoperative. A fault in a processing node can be modeled as a service interrupt caused by CPU, memory.,
disk, or software t'anlts in the node. More detailed fault models tor a processor or other system components can be
derived from lower-level simulations using the fault-dictionary approach discussed in Section 3.2. For instance, a
gate-level simulation of a processor can be injected with faults while executing a typical workload. The effect of the
faults on the workload can be stored in a fault dictionary that contains, for each gate-level fault injected, the types of
effects and the probability of these effects. This dictionary can then serve as a fault model for system level simula-
tions.
36
Thethird issue,whichisespeciallysignificantwhensimulatinglarge,complexsystems,is theeffortandtime
requiredtodevelopafunctionalsimulationmodel.Forfaultinjectionstudiesanddependabilityanalysis,twofactors
contributetothis.Oneisthetimeandeffortneededtodescribethedetailedfunctionalityof thesystemcomponents.
Theotheristhetimeandeflbrtrequiredtoinjectfaults,initiaterepairs,abort,rescheduleandsynchronizeevents,and
maintaina wholehostof faultstatistics.As thenumberof componentsin thesystembecomeslarge,a well-
tormulated, structured, and automated approach is needed to contend with the complexity. A solution is to have a tool
which includes a library of software "objects" that provide the skeletal framework needed to conduct simulated fault
injection studies and that can be easily customized to meet user specific needs.
The tburth issue is the impact of the software on system dependability. Dependability studies have tended to
focus on a system's hardware components. But as the hardware becomes more reliable, the software component is
becoming a more dominant factor [Gray90]. The effectiveness of functional detection and repair schemes depend
upon several application-specific measures such as detection latency and error propagation times. In order to study
the impact of the software on system dependability, methods are needed that allow the designer to incorporate the
application into the overall dependability study. Thus, the simulation tool should permit the execution of actual user
programs.
The fifth issue, and extremely important one, is simulation time explosion. This occurs when the system mod-
eled has very small failure probabilities requiring large simulation runs to obtain statistically significant results. This
is especially a problem with functional simulation because its primary benefit is detailed modeling, which further con-
tributes to simulation time explosion. Different acceleration techniques are used at the system level to reduce simula-
tion time explosion. Hierarchical and hybrid simulation methods have been shown to be very effective [Goswami921
IGoswami93al. The basic approach of these techniques is to: 1) break down a large, complex model into simpler sub-
models, 2) analyze submodels individually, 3) combine the results from step 2 to build a simplified system model, and
4) analyze the system model to obtain the solution. If the models in step 1 and step 3 are both simulation models, the
approach is c',dled hierarchic',d simulation. If the models in step 1 are simulation models and the model in step 3 is an
analytical model, the approach is called hybrid simulations. As long as the interactions among the subsystems are
weak, this decomposition approach provides valid results. The approach is ideally suited lbr dependability analysis
37
becausedependabilitymodelscanusuallybebrokenintotwosubmodels-- a faultoccurrencesubmodelandanerror
handlingsubmodel-- whoseinteractionsaretypicallyweak.Figure3.7illustratestheapproach.
A goodquestionis thatsincetherearealargenumberof analyticalmodelingtoolsincludingpetri-netbased
simulationtools,whatistheneedtbrfunctionalsimulationtoolsforsystemleveldependabilityanalysis?Whataddi-
tionalinformationandcapabilitiescantheyprovideoveranalyticaltools?Theansweris thatanalyticalmodelingtools
onlyuseprobabilisticmodelstorepresentthebehaviorofasystem.Inessence,theeffectofafaultonthesystemis
pre-definedbyasetofprobabilitiesanddistributions.Functionalsimulationtoolsnotonlyusestochasticmodeling,
theyalsopermitbehavioralmodeling,whichdoesnotrequirethatheeffectofthefaultsbepre-detined.
Anexamplethatdemonstrateshiscapabilityisastudyinwhichadistributedsystemusingacentralized,pre-
diction-basedloadbalancingschemeisevaluatedunderfaultswithDEPEND[Goswami921,afunction-levelsimula-
tiontooltobediscussedinSection3.2.2.ThestudyisdemonstratedinFigure3.8.Theload-balancingsoftwarethat
makestaskplacementdecisionsandmaintainsthedatabaseisactuallyexecutedwithintheDEPENDenvironmenton
a simulatedistributedsystem.DEPEND'sfaultinjectionfacilitiesareusedtoinjectcommunicationfaultswhich
destroyandcorrupt fields of the status messages sent to the CPU maintaining the database. Faults are also injected
Figure 3.7. Hierarchical Simulation
II _Handling II /
II IV
_MStatistical
odel T
38
Figure 3.8. Distributed System Executing Load Balancing Software
Trace FiDe
arrivePr°cessesI
,n
_lode0
FailNodes
Job placement messages
Processes executed on all nodes.
...
Status messages
//
Corruptordestroystatusmessages
into the CPU containing the load-balancing software, to erase its database. The effect of these faults is to corrupt the
database and impair the placement decisions made by the load-balancing software. If a purely probabilistic modeling
tool were used for this study, the user would have to pre-specify:
• the probability that a fault will corrupt the database,
• how each fault will corrupt the database,
• which portions of the database will be corrupted,
• the extent of corruption, and
• how each corruption will impair the placement decision made by the load-balancing software.
Needless to say, these factors are extremely difficult to obtain without executing actual software. Because
DEPEND executes actual software, these parameters are the results of (and not inputs to) the fault injection experi-
ment. Only the fault arrival rates and the types of faults injected need to be specified. Thus, DEPEND can identify
39
thefailuremechanisms,obtainfailureprobabilities,andquantifytheeffectoffaults.Itcanbeusedtopickoutthekey
featuresthatmustbemodeledandhelptodetermineandspecifyboththestructureof,andtheparametersto,analyti-
calmodels.
A singledistinguishingfeaturebetweenprobabilisticmodelingandbehavioralmodelingisbroughtoutbyone
oftheresultsof thisstudy(detailsof"alltheresultscanbefoundin[Goswami93b]).Thestudyhelpedtouncovera
designfeatureof thesoftwarethatcausederraticincreasesinsystemresponsetimeonlywhenstatusmessageswere
destroyed.Oncethesoftwarewasmodified,theerraticincreaseinresponsetimeceased.Clearly,suchresultscannot
beobtainedwithanalyticalmodelingtools.
Anadditional"advantage of functional simulation tools is that they 'allow the use of any type of TTF distribu-
tions. Unlike analytical modeling, in which only a few types of distributions are commonly used for the tractability of
models, the simulation method can handle any form of distribution, empirical or analytical.
An early study used a trace-driven simulation approach to analyze error latency [Chillarege87]. The approach is
based on sampled data of physical memory activity gathered, through h_dware instrumentation, from a computer sys-
tem running normal workloads. The data are then used for a trace-driven simulation in which faults are inserted into
the trace to emulate fault occurrence and error discovery processes in the system. The approach provides a means to
study error latency in memory systems under real workloads.
In recent years, several function-level simulation tools that can be used for fault injections have been or are
being developed. NEST, DEPEND, and REACT are three representative tools. REACT, a software testbed that per-
forms automated life testing of a variety of multiprocessor architectures through simulated fault miections, is being
developed at the University of Massachusetts [Clark93]. Several system, workload, and fault/error models, which are
representative of multiprocessor architectures and conditions, are embedded in the testbed. The tool can be used to
evaluate system reliability and availability metrics. Preliminary versions of the software have been reported to be suc-
cessfully employed in several studies of multiprocessor systems [Clark93].
NEST is a function-level testbed that specializes in modeling and evaluating distributed network systems
[Dupuy901. Although the tool is not designed for fault injections, users can make node or link failures by deleting or
adding nodes and links or changing their features while the simulation is running. DEPEND, developed at the
4O
Universityof Illinois,exploitsthepropertiesof theobject-orientedparadigmtoprovideageneral-purpose,system-
leveldependabilityanalysis tool that can evaluate various types of fault tolerant architectures IGoswami921. The
object-oriented feature of DEPEND makes the tool capable of modeling multiple levels of functional units to meet a
wide range of applications. The next two subsections discuss NEST and DEPEND, respectively.
3.2.1. NEST -- A Network Simulation Testbed
The NEtwork Simulation Testbed (NEST) is a graphical environment, running on the UNIX system, for model-
ing, executing, and monitoring distributed network systems and protocols [Dupuy90]. Using a set of graphical tools
provided by NEST, the user can develop simulation models of communication networks. The model includes node
functions (e.g., routing protocols) and communication link behaviors (e.g., packet loss or delay features), typically
coded in C. These user procedures are linked with run-time routines embedded in NEST and executed by the NEST
simulation server. The user can reconfigure modeled network system through graphical interaction or programming.
Built-in graphical tools allow users to programming custom monitors to observe the simulation resulLs on-line.
Figure 3.9 shows the overall architecture of NEST. NEST consists of a simulation server and several client
monitors. The simulation server is responsible for running simulations. The generic client monitors are used to config-
ure simulation models and control their executions. The custom client monitors are used to observe simulation behav-
ior and display results. Clients can reside on separate machines so that the server is dedicated to time-consuming sim-
ulations.
Node functions are used to model distributed communicating processes running at network nodes (e.g., proto-
cols and database transactions). NEST executes node processes and their communication calls using a set of embed-
ded primitives for sending, broadcasting, and receiving packets. The motion of a packet over links is simulated by
passing it through the link functions. Link functions are used to model the behavior of communication links (e.g.,
packet loss and link jamming). Link functions are also used to monitor and collect performance statistics of link traf-
lic. The simulation server schedules the execution of the node and link processes to meet the delay and timing speci-
lied by the user.
41
Figure3.9.OverallArchitectureofNEST
DESIGN NODE BEHAVIOR
[Node Function_
" Ill 'J Simulation
Server
DESIGN LINK BEHAVIOR
Link Function _ l
l1 Generic 11/
/\ Client //_ [
1 CustOm //
Client //
Monitor//
The user can create and modify a network description (node and link functions and connections) using the
NEST graphical tools. Once the user has defined a simulation scenario, it is sent to the stmulation server to be exe-
cuted. One of NEST's key features is its ability to reconfigure a scenario during the simulation run. The user may
delete or add nodes and links (thus failures can be emulated) or change their features while the sunulation is running.
The impact of these changes may be instantly observed and interpreted. Such dynamically reconligured simulations
can be used to study the impact of node/link failure and recovery on the modeled network system.
3.2.2. DEPEND -- A System Dependability Analysis Environment
DEPEND is an integrated design and fault injection environment [Goswami92]. It provides facilities to rapidly
model fault tolerant architectures and conduct extensive fault injection studies. It is ideally suited for evaluating spe-
cific fault tolerant mechanisms, detailed fault scenarios such as latent errors, and software behavior due to hardware
42
faults. It is a functional, process-based [Kobayashi781, [Schwetman86] simulation tool. The system behavior is
described by a collection of processes that interact with one another. A process-based approach was selected tbr sev-
eral reasons. It is an effective way to model system behavior, repair schemes, and system software in detail. It Ihcili-
tales modeling of inter-component dependencies, especially when the system is large and the dependencies are com-
plex, and it allows actual programs to be executed within the simulation environment. Both hierarchical and hybrid
simulation techniques have been used in DEPEND.
DEPEND exploits the properties of the object-oriented paradigm, specifically, modular decomposition and mod-
ular composability [Meyer88], to model different levels of components and to implement a variety of fault models.
Modular decomposition consists of breaking down a problem into small elements, whereas modular composition
favors production of elements that can be freely combined with each other to provide new functionality. It, tor
instance, the fault injection process is divided into two elements or objects: an object that determines when to inject
and interrupt the system, and an object that determines the response to a thult (the thult model), then the two criteria
are met. The first object is common to all fault injection methods. It encapsulates the various mechanisms used to
determine the arrival time of a fault and interrupt the system. The second object is the fault model and is specific to
the component being injected and the type of fault injection study. The two are combined via function calls. Thus, by
specifying different fault model objects, one injector object can be used tbr all types of fault injections. Key object.s,
such as the injector object, are designed to be parameterized. That is, the user can specify various fault arrival distri-
butions or trace files. This same approach is used to model components that are similar but not identical; common
aspects are encapsulated in an object which then invokes other objects to provide more specific functionality. Further-
more, because users can specify specific behaviors (e.g. their own fault model objects), the tool is not limited to any
predefined set of fault models or component types.
A library of objects that provide the skeletal foundation necessary to model an architecture and conduct simu-
lated fault-injection experiments is provided. This reduces the development time and ellbrt needed to build simulation
models. In addition to decomposition, composition, and parameterization, the concept of inheritance [Meyer88]
makes it possible to provide a library with a minimum set of objects that can be readily specialized to model a wide
gamut of different architectures and fault injection experiments. With inheritance, users can inherit the properties of
43
Table 3.4. Some Objects Provided in DEPEND
Nalne
Active_elem
Injector
Checksum
Fault Report
Voter
Server
Link
NMR
Fault Manager
Type
Elementary
Elementary
Elementary
Elementary
Elementary
Complex
Complex
Complex
Complex
Description
Simulates a basic server. Several disciplines: first come
first serve, round robin, etc.
Injects faults using distributions, trace files and workloads.
Computes checksums.
Compiles and displays fault statistics.
Simulates a basic voter with timeout.
Simulates a server with spares. Three policies: no spare, graceful
degradation, stand-by sparing. Automatic repair and reconfiguration
Simulates communication channels. Several fault types: link dead
packet corruption, packet loss. and user defined faults.
Simulates dual self-checking, triple-modular redundant and
N-modular redundant components.
Simulates software fault management schemes. Logs faults
and shuts off components which exceed their fault threshold.
an existing object and develop more specialized objects with minimum effort. Table 3.4 briefly describes some of the
major objects in the DEPEND library. Elementary objects provide basic/unctions, such as injecting faults and com-
piling statistics. Complex obiects created from several elementary objects simulate fundamental components tound in
most fault tolerant architectures such as CPUs, self-checking processors, N-modular redundant processors, communi-
cation links, voters, and memory.
The steps required to develop and execute a model are shown in figure 3.10. The user writes a control program
in C++ using the objects in the DEPEND library. The program is then compiled and linked with the DEPEND
objects and the run-time environment. The model is executed in the simulated parallel run-time environment. Here,
the assortment of objects including the lault injectors, CPUs, and communication links execute simultaneously to sim-
ulate the function',d behavior of the architecture. Faults are injected and repairs are initiated according to the user's
specifications, and a report containing the essential statistics of the simulation is produced.
DEPEND "allows users to specify different fault models. In addition, DEPEND provides delault fault routines
for each object to minimize user design time. For instance, the default lauit model lor a communication medium sim-
ulates the effects of a noisy communication channel. Fields in the messages passed along the communication link are
actually corrupted or the message is destroyed.
44
Figure 3.10. The Depend Environment
Control Program t
written by user in
C++
I Co  ,ean ,,n I" I
Depend /
object
library
iiiFiiiiii_i;ii;;;i;_;;_i;;_;_;_1{__;__ _____'__:'__'_'__::___':__:'_'_'"'_'"_'_"_'_"_'_;_;_;_;
....................................._ .....................................................................................................I j_iiiiiiiii  iiiiiiiiiiiiiiiii!iiiiiiiiiiiiiiiiiiN iI
__i_i_!_!_!!_!_!_!!!_!_!!_!_,_!_!_!_!_!_:_!_!_:!_!_:_;:.;_!_!__.:_;_:_i_iiiN!  iiiiNi i iNii iiii..........
iillli!iiiiiiii!iii!i iiiiii_ii_iiiiiiiiiiiiiiii iiiiiiiiiiiiiiiiiiiiill
ilii!iiiii:iili:.ii!i:.;i:.i:.iiii:.ii::. !i!i:. !ii:.i!;ii!i!ii!i!i:W?:i::iii::ii i__iiF i : i :ii::i:: ===========================================
iiiilli{!_i iiiiiiiiill!!!ii!iii!!ii!iiiiiiiiiiiiiiiiiiiiiiiiili]ii!il!i!iii!iiiiiiiiiiiiii!ill ii!i iiiiii!i!i i Printed
The fault injector is a Nn_nental object of DEPEND (Figure 3.11). It encapsulates the mechanism for inject-
ing Nulks. To use the injector, a user specifies the number of components, the TTF distribution for each component,
and the fault subroutine that specifies the Nult model. In addition to user-specified distributions, the injector provides
constant time, exponential, hyperexponentlal, and the Weibull distributions. The injector also provides a workload-
based injection scheme that varies the Nult arrival rate based on a specified workload. The user provides a workload
function, a set of workload states, and an exponential fault arrival rate for each state. For example, the workload func-
tion may be the utilization of a server. With this approach, the fault arrival rate will fluctuate with the utilization of the
server. The fault injector will periodically poll the workload function to update a state transition diagram to maintain
a history of the workload behavior. This history is used to inject a large number of faults dunng peak workload condi-
tions and fewer faults when the workload is low. This technique models the workload/failure dependency observed in
iIyer801 and [Castillo811.
In addition to executing actual C++ and C programs, DEPEND provides an abstract software modeling environ-
ment to simulate progr_Lrn behavior during the early design stages when actual code does not exist. The environment
represents application programs by decomposing them into graph models consisting of a set of nodes, a set of edges
that probabilistically determine the flow from node to node, and a mapping of the nodes to memory. The graph
45
Figure3.11.TheFaultInjectorObject
workload
I constant t
] exponential t
Weibull t
Erlang I
workload based
Fault Injector
models are then mapped to virtual memory and executed while errors are injected into the program's memory space.
The environment provides application-dependent parameters, such as detection and propagation times, and permits
meaningful application-dependent evaluation of function- and system-level error detection and recovery schemes.
This environment has been used to analyze memory-scrubbing schemes within the context of application programs
[Goswami93c]. The application-dependent coverage values obtained were compared with those obtained by tradi-
tional schemes that assume uniform or random memory access patterns. The coverage values obtained using the tradi-
tional approach were lound to be up to 100% larger than those obtained with the software graph model. The findings
demonstrate the need tor application-dependent evaluation -- especially when evaluating the dependability of applica-
tion-specific systems.
DEPEND has been applied to evaluate several computer systems. In [Goswami91l and [Goswami921,
DEPEND was used to simulate the UNIX-based Tandem Integrity $2 fault tolerant system and evaluate how well it
handles near-coincident errors caused by correlated and latent faults. Issues such as memory scrubbing, reintegration
policies, and workload-dependent repair time were evaluated. The accuracy of the simulation model was validated by
comparing the results of the simulations with measurements obtained from fault iniection experiments conducted on a
46
productionTandemIntegrity$2machine.DEPENDhasalsobeenusedto studytheCM5connectionmachine,the
Parsytechigh-performancecomputerbeingdevelopedbytheEuropeanEspritproject,heSpaceStationDataManage-
mentSystem,andthecomputingelementoftheHubbleTelescope.
47
IV. PROTOTYPE PHASE
In the prototype phase of the development of fault tolerant systems, physical fault injection can be used to eval-
uate fault, error, failure, and fault tolerance characteristics of the developed systems. Normally, fault injection can be
applied only to fault tolerant systems, because the injected faults, if activated, would almost always crash the target
system without fault tolerant mechanisms. However, fault injection can also be used in non-fault-tolerant systems if
the system control flow can be well traced and the system state inlormation can be obtained when it crashes because
of injected faults.
A thuit injection environment typically contains the following components: target system, controller and moni-
tor, tault injector, data collector, and data analyzer, as shown in Figure 4.1. The target can be a VLSI chip, a computer
system, or a network system. When laults are injected into the target, either benchmarks or synthetic workloads
should be running on the target to emulate real workloads. The controller is a special software program, sitting on the
target or on another computer, which controls the overall fault injection experiments. The fault iniector implements
fault it_iections into the target. The monitor keeps track of normal and abnormal executions of the workload and initi-
ates data collection whenever necessary. The data collector and analyzer perform on-line data collection and off-line
data processing and analysis, respectively.
Figure 4.1. Components in a Fault Injection Environment
:[ Controller_
l Monitor l-
Injector "_ DatarCollecto_
1
Data ]Analyzer
48
Table 4.1. Categories of Fault Injections
Category Hardware-Implemented Software- Implemented Radiation- Induced
Approach Inject faults to IC pins Inject faults to components Inject faults by applying
by hardware instrumentation by special software radiation rays to target
Advantages No disturbance to workload Flexibility Can induce transient thults
High time resolution Low cost inside IC evenly
Disadvantages Limited access points Workload disturbance Fault injection points
High cost Low time resolution are uncontrollable
Studies
FTMP lLala831
FTMP [Shin841, [Shin86]
FTMP [Finelli87]
MESSALINE [Arlat90]
Accelerated Injection [Chillarege89]
FIAT [Segal188], [Bartong01
FERRARI [Kanawati92]
HYBRID [Young92]
FINE [Kao931
Z-80 [Cusick85]
MC6809E [Karlsson89]
MC6809E [Gunneflo891
The fault injector can be implemented by hardware, software, or radiation. Correspondingly, fault injection can
generally be divided into three categories: hardware-implemented (or hardware) fault injection, software-implemented
(or software) fault injection, and radiation-induced (or radiation) fault injection. Table 4.1 lists teatures and represen-
tative studies in these categories. The monitor can also be implemented by hardware, software, or both (hybrid). If
the fault injector is Implemented by software and the monitor is implemented by hardware or by both hardware and
software, the system is called a hybrid fault injection environment. The following three sections discuss in detail each
type of fault injection.
4.1. Hardware-Implemented Fault Injection
Hardware-implemented fault injection is a method of introducing faults in the hardware of a computer system
with the aid of additional hardware instrumentation. The method is well suited lot studying dependability characteris-
tics which require high time resolution, such as fault latency in the CPU, which cannot be easily achieved by other
thult injection methods. For example, the occurrence of software-implemented laults is restricted by the system clock
(i.e, the iniections must occur synchronously). Detections are similarly restricted by the system clock, unless an exter-
nal hardware monitor is used. Two main techniques are used to accomplish hardware-implemented fault injections.
The first approach involves the use of active probes attached to the desired hardware injection points. The cur-
rents through these injection points can be altered, thereby influencing the corresponding logic values. The types of
49
faultsattainablewithprobesare usually limited to stuck-at faults. However, it is also possible to introduce bridging
faults by placing the probes across multiple hardware points. Care must be taken with the use of active probes that
torce values onto injection points, because damage to the target hardware can result due to an inordinate amount of
current.
The second technique involves the insertion of additional hardware into the target computer system. Whereas
the first method uses active probes which are external to the target system, this method introduces additional hard-
ware, which becomes part of the target system. The most common approach requires the interpolation of a socket
between a chip and the circuit board. This socket has the capability to inject stuck-at faults or open faulL_, where the
chip pin is essentially tri-stated. In addition, more complex logical faults can be forced onto these pins. For instance,
the pin signals can be inverted, or ANDed or ORed with adjacent pin signals or even with previous signals on the
same pin.
In theory, the domain of possible injection locations is limited only by the physical constraints of the target sys-
tem that prevent the introduction of probes or other hardware. Since the target system is usually a complete prototype
computer system, fault injection below the chip pin level is impractical. Thus, the locus of most iniections are the
pins of chips. In addition, active probes can be attached to certain circuit board locations, such as buses or other sig-
nal lines.
In addition to the range of possible injection locations, a major concern of any fault injection environment is the
fault types or models that are available. We have already discussed some types of faults achievable with probes and
sockets: stuck-at, open, bridging, or complex logical functions. Another important aspect of Ihult types is the dura-
tion of the fault, which can be either permanent, transient, or intermittent. Permanent faults are simply held on the
injection point until an error detection occurs. In contrast, transient faults are placed on the injection point only for an
active period, "alter which they are removed. Thus, the possibility exists that the transient fault may never even be
latched into a chip (i.e., the fault never produces an error), especially if the active period is less than the system clock
period on a synchronous machine. Intermittent taults are injected in the same manner as transient faults, but they are
also repeated, either randomly or according to some function. Both injection methods discussed previously _tre capa-
ble of creating any of the three temporal lhult types.
50
Inthefollowing,wewilldiscusstworepresentativehardware-implementedfaultinjectionenvironments:FTMP
[Lala831andMESSALINElariat891.
4.1.1.FTMP
Severalstudiesinthisareacenteredaroundthefaulttolerantmultiprocessor(FFMP)faultinjectioninstrumen-
tation[Lala831,[Shin86],lFinelli871.FT'MP is a computer architecture which evolved over a 10-year period in con-
nection with several critical aerospace applications [Hopkins78]. The architecture was designed to have a failure rate
of the order of 10 -1° per hour. The basic blocks of the architecture are independent processor-cache modules and
memory modules which communicate through redundant buses. The modules are dynamically grouped into several
TMR triads or assigned to spare status. Jobs can be scheduled to any processor triad. All transactions between proces-
sor modules and memory modules in a triad are voted bit-by-bit. When a lault occurs, the faulty module is isolated
and the faulty triad reconfigured. Fault detection, diagnosis, and recovery are handled in such a way that application
programs are not involved.
Figure 4.2 shows the diagram of the FTMP fault injection instrumentation developed at the Charles Stark
Draper Laboratory [Lala83], [Finelli87]. In an FTMP computer, there are several line replaceable units (LRUs), each
containing a processor, clock generator, power subsystem, and bus interface circuits. LRU #3 is constructed tot con-
nection of the fault injector. All chips in LRU #3 are connected to sockets which "allow them to be removed tbr inser-
tion of the fault injection implant. Each fault injection implant contains circuitry which can interrupt and reconnect
the pins in the sockets. Several different types of faults, such as stuck-at-0 and stuck-at-l, can be iniected into the pins
by the implants. These implants are controlled by a VAX 11/750 computer. A special version of the system configura-
tion control (FSCC) program running in the FTMP communicates with the fault injection software (FIS) running in
the VAX 11/750 through one of the FTMP I/O ports and a 1553/UNIBUS data link.
Faults are normally injected on one pin at a time. When an injection occurs, the FIS program chooses a fault and
a pin, applies the fault to the pin, and records the injection time. Once the PTMP detects and identifies the fault and
reconfigures the system, it sencts this information along with the time of each event back to FIS. Upon receiving the
information, FIS removes the fault by restoring the pin to its normal state and notifies the FTMP. The FTMP then puts
51
FaultInjection
Software
VAX11/750
Figure4.2FI'MPFaultInjectionEnvironment
(1) (2)
1553/Unibus
DMABoard
FaultInjectionProcess
(1)GetreadyfromVAX
(2)FTMP acknowledges
(3) FTMP restores LRU #3
(4) Fault injected
(5) Data from FTMP
(5)
Fault
Injector
(4)
I LRU #3
(3)
F'FMP
FSCC Software
the victim module back into an active state and notifies FIS that it is ready for another fault injection. This process is
repeated after a random delay.
In the experiments conducted at the Charles Stark Draper Laboratory |Lala831, a total of 21,055 taults were
injected, and 17,418 (83%) were detected. All of the detected faults were identified correctly, and the system subse-
quently recovered successfully from each of these faults by replacing the faulty module. That is, the coverage in the
FTMP was 100%, which validated the FTMP architecture and implementation.
Another study using the FTMP fault injection instrumentation was reported in [Shin84], with emphasis on the
investigation of fault latency. Results showed that the hazard rate of fault latency is monotonically decreasing. Two
distributions with monotonically decreasing hazard rates, Weibull and gamma distributions, were then used to fit the
experimental results. The study also investigated the effect of fault latency on the probability of having multiple faults.
It was shown that there exists an optimal fault latency in minimizing the multiple fault probability.
Later, fault injection experiments on the same instrumentation were conducted at the NASA Langley Research
Center [Finelli871 to investigate two issues: fault sampling methods and fault recovery distributions. For each fault
injection, two choices must be made: the fault location (pins) and the fault type (stuck-at-l, stuck-at-0, inverted signal,
etc.). Thus, the possible fault set, or the collection of "all different injected faults, can be very large. Exhaustive fault
injection is costly and time consuming. It is necessary to find appropriate sampling methods to reduce the time and
52
costoftesting.Thestudycomparedtheeffects(detectionbehavior)ofdifferentfaultsandgroupedthesefaultsinto
severalsubsetsaccordingto thesimilarityin theireffects.Theresultshowedthattheeffectsarenothomogeneous
acrossthefaultset.Thisindicatesthatstratifiedsamplingmethods,basedonthefaultsubsets,houldbedevelopedtbr
fault injection. The study also showed that the fault recovery time is not exponentially distributed.
4.1.2. MESSALINE
MESSALINE [Arlat90] is a flexible, pin-level fault injection tool that has been developed at LAAS-CNRS in
Toulouse, France. The general architecture of MESSALINE and its environment is given in Figure 4.3. The injec-
tion, activation, and collection modules are implemented in hardware on an lntel 310 microcomputer. The software
management module resides on a Macintosh II computer, which provides a flexible user interlace.
The fault injection mechanism lbr MESSALINE uses active probes and socket insertion. Thus, fault types such
as stuck-at, open, bridging, and complex logical functions can be injected. Because the duration and tYequency of
faults can be controlled, the fault injector can introduce permanent, transient, and intermittent faults. Signals collected
from the target system can provide feedback to the injector. Also, a device is associated with each injection point to
I -'f
Input/Outputs
T
Figure 4.3. General Architecture of MESSALINE
TARGET SYSTEM I
T !u Synchr°nizati°n Read°uts
Initialization Fit _
.CT,V..,o.II O.IIco.. C O.I
CONTROL OF THE EXPERIMENT
I I
MANAGEMENT OF THE TEST SEQUENCE
OPERATOR
53
sense when and if each fault is activated and produces an error. MESSALINE has facilities to inject up to 32 different
injection points simultaneously.
The application of MESSALINE has been shown in two experiments involving (1) a subsystem of a centralized,
computerized interlocking system (called PAl) for railway control applications and (2) a distributed system corre-
sponding to an implementation of the dependable communication system of the ESPRIT Delta-4 Project.
In the case of the PAl system, permanent stuck-at-0, stuck-at-l, and open circuit faults were injected to various
memory and CPU chips. The results indicated that CPU errors were more difficult to detect than memory errors. The
error detection mechanisms were analyzed individually, and it was discovered that the diagnosis software accounted
for most of the error coverage. The elimination of hardware detection would have decreased the overall coverage by
less than 3%.
The distributed communication system was injected with intermittent stuck-at-0 and stuck-at-I faults. The
actual faults were injected into the network attachment controllers (NAC), which provide the connection for each node
to the local area network. Results showed that over 67% of all errors cause the injected NAC to be correctly identified
and extracted. Also, 24% of the errors did not cause a detectable error. Thus, in over 91% of the injections, the dis-
tributed system was able to correctly handle the error. These experiments demonstrate the utility and flexibility of the
MESSALINE fault injection tool.
4.2. Software-lmplemented Fault Injection
While hardware-implemented fault injections require special hardware instrumentation and interlace to circuits
of the target systems, software-implemented fault injection provides a cheap and easy-to-control methodology. In
software-implemented fault injections, no extra hardware instrumentation is needed, and users can choose fault loca-
tions in both hardware and software components accessible to machine instructions. In addition, software detects can
only be emulated using the software approach by changing code. Several techniques have been proposed to emulate
different types of hardware and software faults through software-implemented fault injections.
Software-implemented fault injection is done by changing the contents of memory or registers, based on some
fault models, to emulate the occurrence of hardware or software faults. Hardware faults can lead to software errors
54
andaffectsoftwarexecutions(hardware-inducedsoftwarerrors).These/hulLscanoccurinCPU,memory,bus,and
networks.Theymaycausethesystemtoexecuteincorrectinstructions,accessincorrectdata,andproduceincorrect
results.By software faults, we mean software design/implementation detects (e.g., incorrect initialization of a vari-
able or failure to check a boundary condition), and they may change software states to unexpected states. If software
data is corrupted by either hardware or software faults, we call them software errors.
At least two issues need to be addressed lbr software fault injections. The first is that when a fault is injected to
a memory location or a register, who owns the memory location or which process is running on the processor. In
other words, what is the target of the fault injection? The second issue is what fault models should be used to simulate
hardware and software faults. We have discussed hardware fault models at function level in Section 3.3. Like hard-
ware models, software models should be built based on engineer experience and field measurements.
Several fault models and implementation techniques are listed in Table 4.2. All these techniques are similar in
that they change program or memory words. To inject software faults, the text segment needs to be modified. Some
typical software faults are: a variable is used before it is initialized; a module's interface is defined or used incorrectly;
statements are in the wrong order or omitted [Sullivan91 ]. As a result of executing faulty software code, the data seg-
ment may be corrupted, causing software errors. Software errors can also be directly in.iected by changing the data
segment.
When the software approach is used to emulate hardware faults, the faults are normally of transient nature. For
example, the faulty bits in memory or CPU registers can be overwritten by subsequent instructions. However, the
Table 4.2. Techniques Used tor Software Fault Injection
Type Method
Software Fault Modify the text segment of the program.
Software Error Modify the data segment of the program.
Memory Fault Flip memory bits of the program.
CPU Fault Use a trap to modify the memory area of the saved CPU registers.
Bus Fault Use traps before and alter an instruction to change the instruction or data
used by the instruction and then restore them after the instruction is executed.
Network Faults Modify or delete transmission messages.
55
softwareapproachcanbeusedtoemulatepermanentfaultsbyrepeatedlyinjectingthesamefaulttoa locationwhen-
everthereis anaccessto thelocation.Forexample,toemulateapermanentstuck-at-0faultataparticularbit in a
memoryword,thebit ischangedto0 ',after every write operation to the word. To emulate a permanent stuck-at-1 fault
at a bus address line, the corresponding bit in the effective address (in the program counter or in a CPU register) is set
to one belore any access to the bus. It is obvious that the emulation is expensive, involving the monitoring and execu-
tion of many extra instructions.
Unlike hardware-implemented fault injections which are difficult to gear toward specific workload areas, soft-
ware fault injections can be targetted toward user applications, the operating system, or both. If the target is user
applications, the fault injector can be inserted into user applications or can be an extra layer between the user applica-
tions and the operating system. If the target is the operating system, the fault injector has to be embedded in the oper-
ating system, because it is very difficult to add an extra layer between the machine and the operating system.
Although the software-based approach is very flexible, it has some restrictions. First, the approach cannot inject
faults into locations not accessible to software. We have mentioned in Section 3.2 that approximately 1/3 of errors
produced in logic-level fault injections cannot be emulated through the software approach [Czeck911. Secondly, the
software instrumentation may disturb the workload running in the target system and even change the structure of orig-
inal software. A careful design can "alleviate the perturbation to the workload. Another disadvantage is the low time
resolution of the approach, which may cause fidelity problems. For the long latency faults, such as memory faults, the
low time resolution may not be a problem. For the short latency faults, such as bus and CPU faults, the approach may
fail to capture the error behavior (e.g., propagation). This problem can be solved by using a hardware monitor, i.e., the
hybrid approach [Young92]. The hybrid approach combines the versatility of software-implemented injection and the
accuracy of hardware monitoring. It is well suited for measuring extremely short latencies.
There have been several studies using the software-based approach. In [Chillarege89], a failure acceleration
method is used to inject the overlay software faults into an IBM commercial transaction processing system. In the fail-
ure acceleration method, fault injections are designed such that the fault/error latency is decreased and the probability
of a fault causing a tailure is increased. An overlay occurs when a program writes into an incorrect area. It is esttmated
that about 1/3 of software errors can be mapped into the overlay model [Chillarege89]. The study quantified the
5(i
Table4.3.ComparisonfSoftware-ImplementedFaultInjections
Tool
Hardware
Injection
Target
Monitor
Fault
types
To
evaluate
FIAT
[Segal1881
PCRT
O.S.
User
Software
Memory
CPU
Communication
Detection
Latency
Recovery
FERRARI
[Kanawati921
SPARC
User
Software
Memory
CPU
Bus
Control flow
Detection
Latency
HYBRID
[Young92]
Tandem $2
O.S.
User
Hybrid
Memory
CPU
Cache
Detection
Latency
Recovery
FINE
[Kao931
Sun
O.S.
User
Software
Memory
CPU
Bus
Software
Detection
Propagation
immediate impact and potentizl hazards (which may cause a catastrophic failure in the future) of the injected faults.
In recent years, interest in developing software-implemented fault injection tools has increased. Several envi-
ronments have been published in literature: FIAT [Segal1881, FERRARI [Kanawati92], HYBRID [Young92], and
FINE [Kao93]. Table 4.3 lists features of these tools, which will be discussed in the following subsections.
4.2.1. FIAT
A number of fault injection studies at Canegie Mellon University centered around FIAT (Fault Injection Auto-
mated Testing), a software-implemented fault injection environment [Segali88], [Barton901, [Czeck91l. The FIAT
hardware implementation consists of IBM RT PCs connected by a token ring network. The FIAT software structure is
divided into two parts: Fault Iniection Manager (FIM) and Fault Injection REceptor (FIRE). FIM is a global control
program responsible tor 'all phases of the experiment. FIRE, under the control of FIM, collects the experimental
results and sends appropriate information to FIM lor off-line analysis. Figure 4.4 shows the process of a typical fault
injection experiment.
FIAT has been used to study the impact of faults on the application workload level [Bartong01. Two representa-
tive programs, a matrix multiplication task and a selection sort task, were chosen as application workloads. To
achieve fault tolerance, each task is executed on two different processors and results are compared. Three fault types
57
Figure 4.4. Typical Fault Injection Experiment in FIAT
Library Preparation Exp. Definition
FIM
User
Program I WE
-1 Libs i ! Workload DefinitionFIM
| I Ii _
Exp. Description Exp.Script
I
FIM I
Faul class [----'-7 ',[
Definition] LFab_t [ J. Fault Instance
I I
Workload Manager Exp. Description
Fault Manager Manager
FIM
Ex p.
' - on oJ
Exp. Execution Data Analysis
FIRE FIM
= ' _l Anal.
I
Exp. Manager
Data Control Manager
Measures
(Coverage)
Data Analysis
Manger
were injected in the experiment: zero-a-byte, set-a-byte, and two-bit compensating. The zero-a-byte or set-a-byte sets
a consecutive 8 bits anywhere within a 32-bit word to zero or one. The 2-bit compensating complements 2 bits in a
word such that the parity code would not detect it as an error. Faults were injected into all locations within a workload,
with a total of over 130,000 faults injected.
Results showed that there are a limited number of system-level fault manifestations. The mean error detection
coverage tbr different workloads and fault types is around 50% to 60%. Error detection latency was lbund to follow a
normai distribution. This result conflicts with those presented in [Shin861, [Finelli87], where the latency was shown
to tbllow either gamma, Weibuil, or log-normal distributions. This difference may be explained by the differences in
the experimental environment and detection mechanisms. In [Shin861, [Finelli871, the hardware-implemented fault
injection technique is used, and the resolution of detection time is on the order of milliseconds, while the time resolu-
tion of the software-implemented FAIT is on the order of seconds, which may skew the results.
58
4.2.2. FERRARI
FERRARI (Fault and ERRor Automatic Real-time Injector), another software-implemented fault injection envi-
ronment, was recently developed at the University of Texas at Austin [Kanawati92]. The purpose of the development
of FERRARI was to evaluate complex systems by emulating most hardware faults in software. It was implemented on
SPARC workstations in an X-window environmenL FERRARI consists of tour software modules: 1) the initializer
and activator, 2) the user information, 3) the fault and error injector, and 4) the collector and analyzer. These four
modules are controlled by the manager module which coordinates the operation of the four modules.
The initialization and activation module prepares the target program for fault injection by extracting its inlbrma-
tion, such as the starting address, the program size, and the execution time. The user information module receives
experiment parameters provided by the user, such as experiment mode, fault and error types, and dependability mea-
sures to obtain. The fault and error injection module is responsible for injecting different types of transient or perma-
nent faults, such as address line fault, data line fault, and fault in condition code flags. The data collection and analy-
sis module records experiment results, such as information about error detection, error latency, and failures, and it
determines statistics of these measures at the end of the experiment.
To demonstrate the capabilities of FERRARI and to study the behavior of the target system under faulty condi-
tions, over 600,000 fault injection runs were conducted on SUN4 SPARC workstations under different applications.
Results showed that the error coverage is highly dependent on the fault type. The highest coverage was obtained when
errors were injected in the task memory image. This is because the injected errors are likely to be exercised repeatedly
if the corrupted instructions are in a loop. An important finding is that a considerable number of undetected errors are
those that corrupted input/output routines and system libraries. These routines may tend to be ignored when error
detection techniques are embedded in the user code.
4.2.3. HYBRID
A major drawback of the above purely soRware-implemented fault injection environments is the low resolution
of detection time. If the error detection mechanism is implemented with hardware, the time resolution is greatly
enhanced. This approach is used in the hybrid fault injection environment developed at the University of Illinois at
59
Urbana-Champaign[Young921.Thehybridenvironmentcombinestheversatilityof software injection and the accu-
racy of hardware monitoring. It is well suited tor measuring extremely short error latencies, and the introduced over-
head is minimal so that error propagation and control flow are not significantly affected by the presence of instrumen-
tation.
In the hybrid environment, faults are injected via software, and the impact is measured by both software and
hardware. Figure 4.5 illustrates the subsystems that make up the environment. It consists of a tault injection system,
a hybrid monitor system to measure the effects of injected faults, and a supervisory system to automate the measure-
ments. The hybrid monitor system is further divided into a hardware monitor and a software monitor. Figure 4.6
illustrates how these systems are physically situated. The fault injector and software monitor execute on the test sys-
tem, while the supervisor program executes on the control host. Probes attach the hardware monitor to the
address/data backplane of the test system so that the monitor can analyze and record the signals generated. Communi-
cation between the supervisor and the hardware monitor takes place over an RS-232 or GPIB connection.
Figure 4.5. Hybrid Fault Injection Environment
Fault
Injector
Hybrid Monitor
Hardware
Monitor
Software
Monitor
Supervisor
Figure 4.6. Physical Layout of Hybrid Fault Injection System
o o o
TEST backplane HARDWARE _GPIB or_ CONTROL
SYSTEM probes - MONITOR - RS 232 - HOST
60
Thefunctionoftheenvironmentis operformexperimentsthatrepeatedlyinjectfaultsandrecordobservations.
Theenvironmenti roducesfaultsintothetestsystemduringtheexecutionofa target program, measures the effects
of that fault, and returns the test system to conditions present prior to fault injection. These operations tbrm a single
observation loop. Faults can be injected into any location that has a physical address, e.g., CPU registers, cache, local
memory, mass storage, and network controllers. Faults can also be injected into locations allocated to a single, exe-
cuting user program or even into the kernel, and propagation can be characterized down to the instruction level.
The fault injection environment was used to study dependability characteristics of a Tandem Integrity $2 tault
tolerant computer system [Jewett91]. High degrees of accuracy in measuring latency (within 20ns) were obtained.
Measurements of the sensitivity of different instructions to faul_ indicated a 5% chance that a faulted MIPS RISC
instruction will not tail when executed. Modeling of multi-level error propagation showed that error detections were
due to multiple corruptions of state in as many as 57% of reads from wrong addresses and 37% of writes to wrong
addresses. The median latency associated with error detection by an individual CPU was on the order of 10/_, and
the median delay between detection and the start of CPU shutdown was on the order of lOOms. Kernel fault injection
studies show that a fault in the kernel is 2.6 times as likely to bring down a CPU as a fault elsewhere.
4.2.4. FINE
FINE is a UNIX-based fault injection environment developed at the University of Illinois at Urbana-Champaign
[Kao93]. The significance of FINE is twotold. First, it is the first tool that can inject software faults as well as hard-
ware errors. Second, it is the first tool built tot tracing fault propagation among software modules. The software
faults that can be injected by FINE include initialization (missing or incorrect), assignment (missing or incorrect),
condition check (missing or incorrect), and/'unction (incorrect) faults. FINE can also inject hardware errors such as
CPU (ALU, shifter, opcode decoder, or registers), memory (text segment or data segment), and bus errors (address
lines or data lines).
Figure 4.7 shows the FINE environment. FINE consists of a fault injector, a sqfiware monitor, a workload gen-
erator, a controller, and several analysis utilities. The fault injector and software monitor are embedded in the kernel
so that faults can be injected there and their propagation can be monitored. Fault injectmn is implemented by
61
Figure4.7.TheFINEEnvironment
Fault
il_
Injector
trace
file
Analyze
Fault
Propagation
modifying the system trap handling routines, so the fault injector can be considered an extra layer between the operat-
ing system and the machine. The software monitor Ixaces the execution flow and key variables of the kernel. Soft-
ware probes are inserted into functions in the kernel to record the execution flow and the values of arguments and key
variables. The synthetic workload generator issues various system calls to activate injected faults. The distribution of
generated system calls can be specified by users to emulate real workloads or to deliberately accelerate the activation
of injected faults. The controller assigns experiment specifications to the fault injector and the monitor, and it initiates
experiments. The analysis utilities provide assistance in analyzing fault propagation. The target of the study is the
UNIX kernel, a non-stopped, highly parameterized, complex service program with high impact and a broad spectrum
of workloads.
Experiments on SunOS 4.1.2 (on a SPARCstation IPC) were conducted by applying FINE to investigate fault
propagation and to evaluate the impact of various types of faults. Results showed that memory faults and soRware
62
faultsusuallyhaveaverylonglatency,whilebusfaultsandCPUfaultstendtocrashthesystemimmediately.Nearly
90% of detected errors are detected by hardware. About half (47%) of the detected errors are data errors, these data
errors are detected when the system tries to access an area it has no privilege to access. In the software fault propaga-
tion, incorreCt control flow is the major impact for the first level of propagation, while data corruption is the major
impact for the subsequent propagation. Analysis of fault propagation among the UNIX subsystems revealed that only
about 8% of faults propagate to other UNIX subsystems.
4.3. Radiation-Induced Fault Injection
Neither hardware-implemented nor software-implemented/hult injections have a way to produce transient thulLs
at random locations inside ICs. Radiation-induced fault injections provide such a capability. One way to do this is to
expose the chip to the heavy-ion radiation from a Californium 252 (Cf 252) source [Gunneflo891, [Karlsson891. The
heavy ions emitted from the source are capable of creating transient faults when they pass through a depletion region
in the IC. One advantage of this method is that it can produce transient faults at random locations evenly and can
cause either a single bit flip or multiple bit flips. This leads to large variation in the errors seen on the output pins of
the IC.
In the fault injection experiments reported in [Gunneflo89], [Karlsson89], the Cf 2s2 methcxl was used to investi-
gate error coverage and detection latency for error detection schemes for the MC6809E 8-bit microprocessor. The
intention of the experiments was to characterize the effects of transient faults that originate inside a CPU. The
MC6809E is fabricated in NMOS, a technology sensitive to heavy ion radiation. The error detection schemes under
study are suitable for implementation with a watchdog processor that checks the behavior of the main processor on the
external bus. The developed experimental system is called FIST (Fault Injection system for Study of Transient fault
effects). Figure 4.8 shows the FIST diagram.
The heavy-ion radiation is implemented by using a commercially available 37×103 Becquerel (I gCi) Cf 25_-
source. The Cf 252 source is mounted inside a vacuum chamber together with a small computer system. One of the sys-
tem boards is placed on a mechanical fixture movable in three dimensions for accurate positioning of the CPU beneath
the Cf 252 source. The system has two MC6809E CPUs which operate synchronously using the same clock. One CPU
63
Figure 4.8. FIST Diagram
Inside Vacuum Chamber
MC6809E
Reference CPU
Reset
................... 7
External
bus
MC6809E
Test CPU
Host [.Error data [Computer r
r""-7-q
i .
_[ Comparator LError flip-flops [_
Error
data
Trig External
bus
q
LogicAnalyzer
Error
I data
V
Monitoring iComputer Reset
External
bus
i
ISerial,
, MemoryPort ,
I
Commands &
Program Loading
is exposed to heavy-ion radiation. The other is used as a reference to detect errors via comparison on the output from
the two CPUs. When errors are detected by the comparison logic, the logic analyzer is triggered to record the external
bus signals. The monitoring computer is responsible tor data acquisition and control of expenrnents.
A fault injection experiment is conducted in the following way. Before the experiment starts, the monitoring
computer fetches from the host computer a load file which contains the test program to be executed. The test program
is then loaded from the monitoring computer to the MC6809E system. Atier the loading, the test program is started
with a "go" command from the monitoring computer. When a mismatch is detected, the monitoring computer/'etches
the recorded error data from the logic an',dyzer and the error flip-flops in the MC6809E system and transfers them to
64
the host computer. Finally, the MC6809E system is reset, and the test program is reloaded lor the next experiment.
It was found from fault injection experiments that 78% of all errors affected control flow (i.e., caused the pro-
cessor to diverge from the correct sequence) and 17% caused errors in data. Results "also showed that 30% of all
errors were multiple bit errors on the output pins, although the origin of each of these errors was only one single heavy
ion. The error recordings obtained from the experiments were also used as input to simulation models of different
error detection mechanisms to evaluate these error detection mechanisms without implementing them. The coverage
of several detection mechanisms was investigated. It was found that the best mechanism was the one that detects
access to the memory outside permitted areas and that the combination of two mechanisms gave a better coverage
than any one mechanism alone. It was also tbund that the type of the test program had a considerable influence on the
results of error detection mechanisms.
65
V. OPERATIONAL PHASE
When a computer system is in normal operation, various errors occur in both hardware and software. There are
many possible sources of errors, including untested manufacturing faults and software defects, wearing out of devices,
transient errors induced by radiation, power surges, or other physical processes, operator errors, and environmental
factors. The occurrence of errors is also highly dependent on workloads running in the system. A distribution of oper-
ational outages from various error sources for several major commercial systems are reported in [Siewiorek92].
To understand dependability characteristics of a complex computer system, there is no better way than measur-
ing real systems and analyzing the measured data. Here, measuring real systems means monitoring and recording
naturely occurring errors and failures in the system while it is nmning under user workloads. Analysis on such mea-
surements can provide valuable information on actual error/failure behavior, identify system bottlenecks, obtain
dependability measures, and verify assumptions made in analytical models.
Given field error data collected from a real system, a measurement-based study consists of four steps, shown in
Figure 5.1: 1) data processing, 2) model identification and measure acquisition, 3) model solution if necessary, and 4)
analysis of models and measures. Step 1 consists of extracting necessary information from field data (the result can be
a tbrm of compressed data, or fiat data), classifying errors and failures, and coalescing repeated error reports. In a
computer system, a single problem commonly results in many repeated error observations occurring in rapid succes-
sion. To ensure that the analysis is not biased by repeated observations of the same problem, all error entries which
have the same error type and occur within a short time interval (e.g., 5 minutes) of each other should be coalesced in
data processing. Thus, the output of this step is a lbrm of coalesced data in which errors and failures are identified.
This step is highly dependent on the measured system. Coalescing "algorithms have been proposed in [Tsao83],
[Iyer861, [Hansen921.
Step 2 includes identifying appropriate models (such as Markov models) and acquiring measures (such as
MTBFs and TBF distributions) from the coalesced data. Several models have been proposed and validated using real
data, such as the workload-dependent cyclostationary model in [Castillo81], the workload hazard model in [Iyer82a],
and the correlation models in [Tang92a]. Statistical analysis packages such as SAS [SAS85] or measurement-based
dependability analysis tools such as MEASURE+ [Tang93b] can be used to perlbrm this analysis. Step 3 solves these
66
Figure 5.1. Measurement-Based Analysis
Models & Measures
I c° esce lM° e' IFie,d ] I en  yingI Mo e,Data =: Data Data ._
J _- SolutionProcessing AM_e_r]r_ i Models
Step 1 Step 2
_] Analysis
ZI o,Models
Measuresr[ Measures
Step 3 Step 4
Results
models to obtain some other measures (such as reliability and transient reward rates). Dependability and perlbrmance
modeling and evaluation tools such as SHARPE [Sahner87] can be used in this step. The most creative part is step 4,
the human analysis of models and measures obtained from data. New results are produced in this phase. For example,
reliability bottlenecks can be identified from analysis of error/failure statistics, and workloadffailure dependency can
be concluded by analysis of models. However, analysis methods may vary significantly tiom one study to another,
depending on research goals.
Measurement-based dependability analysis of operational systems has evolved significantly over the past 15
years. These studies addressed one or more of the following issues: basic error characteristics, dependency analysis,
modeling and evaluation, software dependability, and fault diagnosis. The following paragraphs give a brief overview
of these studies, which are listed in Table 5.1.
Early studies in this field investigated transient errors in DEC computer systems and round that more than 95%
of all detected errors are intermittent or transient errors [Siewiorek78], [McConne179]. The studies also showed that
the inter-arrival time of transient errors follows a Weibull distribution with a decreasing error rate. This distribution
was later shown to fit the software failure data collected from an IBM operating system [Iyer85b]. A recent study of
failure data from three different operating systems showed that T'rE (time to error) can be represented by a multi-
stage gamma distribution for the measured single-machine operating system and by hyperexponential distributions lor
the measured distributed operating systems [Lee93al.
Studies of dependency between workload and failure in early 1980s, based on measurements from IBM [But-
ner80] and DEC [CastilloSll machines, reve',ded that the average system failure rate is strongly correlated with the
67
Table5.1.Measurement-BasedStu iesofComputerSystemDependability
Category Issues Studies
Data Analysis of time-based tuples [Tsao83], [Hansen92]
Coalescing Clustering based on type and time [Iyer86], [Lee91], [Tang93a]
Basic Transient faults/errors [Siewiorek781, [McConne179], [Iyer86]
Error Error/failure bursts [Iyer86], [Hsueh87], [Tang93a]
Characteristics TTE/'ITF distributions [McConne179], [Iyer85b], [Lee93a]
Dependency
Analysis
Modeling
and
Evaluation
Software
Dependability
Fault
Diagnosis
Hardware failure/workload dependency
Software failure/workload dependency
Correlated failures and impact
Two-way and multi-way failure dependency
Performability model tbr single machine
Markov reward model for distributed system
Two-level models for operating systems
Error recovery
Hardware-related & correlated software errors
Software fault tolerance
Software detect classification
Heuristic trend analysis
S tatistical analysis of symptoms
Network fault signature
[Butner80], [Castillo81 ], [Iyer82a]
[Castillo82], llyer85b], [Mourad871
[Tang90], [Wein901, [Tang92al
[Dugan91 ], [Lee91 l, [Tang91 ]
[Hsueh881
[Tang93al
[Lee93al
[Velardi84], [Hsueh87]
[Iyer85a], [Tang92b], [Lee93a]
[Grayg01, [Lee92], [Lee93bl
[Sullivan91 ], [Sullivan92]
[Tsao83], [Lin90l
[Iyer901
[Maxion90a], lMaxion90bl
average workload on the system, The effect of workload-imposed stress on software was investigated in [Castillo82]
and [lyer85b]. Recent analyses of DEC [Tang90], [Wein90l and Tandem [Lee91l multicomputer systems showed that
correlated failures across processors are not negligible, and their impact on availability and reliability are significant
[Dugan911, [Tang911, [Tang92ai.
In [Hsueh88], analytical modeling and measurements were combined to develop measurement-based reliabil-
ity/performability models using data collected from an IBM mainframe. The results showed that a semi-Markov pro-
cess is better than a Markov process for modeling system behavior. Markov reward modeling techniques were further
applied to distributed systems [Tang93al and fault tolerant systems [Lee92], to quantify performance loss due to
errors/failures for both hardware and software. A census of Tandem system availability indicated that software fault.s
are the major source of system outages in the measured fault tolerant systems [Grayg0]. Analyses of field data from
different software system..s investigated several dependability issues including the effectiveness of error recovery
[Velardi841, hardware-related software errors [Iyer85a], correlated software errors in distributed systems |Tang92b],
68
softwarefaulttolerance[Lee92],[Lee93b],andsoftwaredefectclassification[Sullivan91],[Sullivan92].Measure-
ment-basedfaultdiagnosisandfailurepredictionissueswereinvestigatedin [Tsao83],[Iyer90],[Lin90],[Max-
ion90a],[Maxion90b].
Inthefollowingsubsections,wediscussi suesandrepresentativestudiesinvolvedinmeasurements,datapro-
cessing,preliminaryanalysisofdata,dependencyanalysis,modelingandevaluation,softwaredependability,andfault
diagnosis.
5.1. Measurements
Therearenumeroustheoreticalndpracticaldifficultiesassociatedwithmakingmeasurements.Thequestionof
whatandhowtomeasureisadifficultone.A combinationfinstalledandcustominstrumentationhasbeenusedin
moststudies.Fromastatisticalpointof view,soundevaluationsrequireaconsiderableamountof data.In modem
computersystems,especiallyin faulttolerantsystems,failuresarerare.Toobtainingmeaningfuldataforsuchsys-
tems,measurementsmustbemadeforaconsiderablylongperiodof time,orsometimesthemeasuredsystem ustbe
exposedtohigh-stressconditions.
Inanoperationalsystem,onlydetectederrorscanbemeasured,becauseanerrorisknownonlyif it isdetected.
Therearebasicallytwowaysto makemeasurements:on-lineautomaticloggingandhumanmanuallogging.Many
largecomputersystemssuchasIBMandDECmainframesproviderror-loggingsoftwarein theoperatingsystem.
Thesoftwarerecordserrorreportsfromdifferentsubsystems,suchasmemoryordisksubsystems,andothersystem
events,suchasrebootsandshutdowns.Thereportsusuallyincludeinformationabouthelocation,time,andtypeof
theerror,thesystemstateattheerrortime,andsometimesrrorrecovery(e.g.,retry.)information.Therecorded
reporksarestoredinapermanentsystemfilechronologically.Themainadvantageoftheon-lineautomaticloggingis
its abilityto recorda largeamountof informationaboutransienterrorsandtoprovidedetailsof automaticerror
recoveryprocesses,whichcannotbedonemanually.Disadvantagesarethatinformationcanbelostwhenasystem
tails too quickly for error messages to be recorded, and that an on-line log does not include information about the
cause and propagation of the error or about off-line diagnosis.
69
Table 5.2 shows a sample of extracted error logs from a VAXcluster multicomputer system. Often, meanings of
a record in the logs can differ between versions of the operating system and between machine models. Error detection
and recording routines may be written and modified over time by different people. For example, a careful study of
VAX error logs and discussion with the field engineers indicate that the operating system on different VAX machine
models may report the same type of error into different categories. Thus, it is necessary to distinguish these errors in
the subsequent error classification (to be discussed in Section 5.2).
Table 5.2. A Sample of Extracted Error Logs from a VAXclustert
Entry System ID Logging Time Subsystem & Unit
5815 Earth 20-DEC- 1987 20:23:13.22 1/O, H0$DUA51 :
7005 Earth 4-JAN-1988 11:45:07.12 I/O, H3$MUAI :
12979 Europa 8-JAN-19g8 14:14:28.63 C[, EUR$PAA0:
13005 Europa 8-IAN-1988 16:23:17.41 CI, EURSPAA0:
13734 Europa 19-JAN-1988 17:31:30.74 CI, EUR$PAA0:
3260 Mercury 24-DEC- 1987 04:54:52.06 Memory, TR #2
I0939 Jupiter I-APR- 1988 09:57:39.40 Unknown Device
14209 Jupiter 16-MAY-1988 13:37:04.97 CPU, SBI
13941 Mats 25-FEB-1988 02:13:20.25 CPU, IBOX
20937 Mars 18-APR-1988 16:46:39.75 gugCheck
27958 Mars 14-MAY- 1988 20:57:46.48 BugCheck
37790 Saturn 20-JUL- 1988 18:51:49.15 Bu_Check
Interpretation
Disk drive error
Tape drive error
Path _ went from good to bad
Error logging datagram received
Virtual circuit timeout
Corrected memory error
Unexpected read data fault
Machine check
Bad memory deallocation request size or address
Insufficient nonpaged pool to remaster lock._
Unexpected system service exception
t The sample is intended to illustrate the different types of errors lugged, Therefore, the entry numbers are not consecutive.
Since the information provided by on-line error logs may not be complete, it is valuable to have operator logs
compensate the missing information in on-line logs. Whenever possible, measurements should include both on-line
and operator logs, A good operator log should include information about failure diagnosis, component replacement,
hardware and software update, etc. It is not easy to maintain an accurate and complete operator log. Unremitting
efforts must be made for a substantial period in obtaining measuremenks.
5.2. Data Processing
Usually, on-line logs contain a large amount of redundant and irrelevant information in various formats. Thus,
data processing must be performed to obtain useful, classified intormation and put it into a fiat tormat that will tacili-
tate the subsequent analyses. The first, step of data processing is error classification, which classifies errors in the
measured system into a number of types based on the subsystems and components in which they occur. There is no
uniform error classification, because different systems have different hardware and software architectures. But some
70
Table 5.3. Major Error Types in VAXcluster
System Type Description
Hardware
Software
CPU
Memory
Disk
Network
Control
Memory
I/O
CPU or bus controller errors
Memory ECC errors
Disk, drive, and controller errors
Local network and controller errors
Problems involving program flow control or synchronization
Problems referring to memory management or usage
Inconsistent conditions detected by I/O management routines
error types, such as CPU, memory, and disk errors, are seen in most systems. Table 5.3 lists an error classification
(major error types) tbr VAXcluster systems [Tang92bl, [Tang93a].
After error classification, the following data processing can be broadly divided into two steps: data extraction
and data coalescing. Data extraction selects useful entries such as error and reboot reports (throwing away useless
entries such as disk volume change reports) from the log file and transtbrms them into a flat lormat. The design of the
fiat tormat depends on the necessity of the subsequent analyses. The following is a possible lormat:
lentrynumber ] I°ggingtime I err°rtype ] deviceid" I otherfields ]
In on-line error logs, a single fault in the system can result in many repeated error reports in a short period of
time. To ensure that the subsequent analyses will not be biased by these repeated reports, envies which correspond to
the same problem should be coalesced into a single event. A commonly used coalescing algorithm [Iyer86] is merging
all error envies which have the same error type and occur within a AT interval of each other into a tuple. The algo-
rithm is as follows:
IF <error type> = <type of previous error>
AND <time away from previous error> < AT
THEN <put error into the tuple being built>
ELSE <start a new tuple>
A tuple reflects the occurrence of one or more errors of the same type in rapid succession and can be repre-
sented by a record containing at least the following fields |Tsao83], [Tang93bl: '
71
(1) tupleid-- identificationofthetuple
(2) noentry-- numberof error entries in the tuple
(3) start_time -- logging time of the first entry in the tuple
(4) end_time -- logging time of the last entry in the tuple
(5) err type -- error type of the tuple
Different systems may need different time intervals in data coalescing. A recent study on this issue [Hansen92]
defined two types of mistakes that can be made in data coalescing: collision and truncation. A collision occurs when
the detection times of two faults are close enough (within AT) such that they are combined into a tuple. A truncation
occurs when the time between two reports caused by a single fault is greater than AT such that the two reports are split
into different tuples. If AT is large, collisions are likely to occur. If AT is small, truncations are likely to occur. The
study found that there is a threshold of time intervals beyond which collisions are rapidly increased. Based on this
observation, the study proposed a statistical models which can be used to select an appropriate time interval to reduce
collisions. According to our experience, collision is not a big problem if the error type and device information is used
in data coalescing as shown in the above coalescing algorithm. Truncation is usually not considered to be a problem
[Hansen92]. There are techniques [Iyerg0], [Lin901 which deal with this problem and which are used tot fault diagno-
sis and failure prediction (to be discussed in Section 5.7).
5.3. Preliminary Analysis
Once coalesced data is obtained, basic dependability characteristics of the measured system can be identified by
a preliminary statistical analysis. Commonly used measures in the analysis include error/failure frequency, TT'E or
TTF distribution, and error/failure hazard rate function. In the lbllowing discussion, data from a VAXcluster system
[Tang93a| is used to illustrate analysis methods.
5.3.1. Basic Statistics
Although it is not difficult, it is important to first obtain basic statistics such as frequency, percentage, and prob-
ability from the measured data. These statistics provide a basic picture of the measured system. Often, dependability
72
Table5.4.Error/FailureStatisticsfortheVAXcluster
Error Failure
Category
FrequencyPercentageFrequencyPercentage
I/O
Machine
Software
Unknown
All
25807
1721
69
191
27788
92.87_+0.30
6.19+0.28
0.25+0.06
0.69_+0.1'0
100.0
105
5
62
73
245
42.86-+6.20
2.04+1.77
25.31+5.44
29.80-k_5.73
100.0
Recovery
Probability
0.996+0.001
0.970&0.002
0.101_+0.071
0.618_+0.069
0.991-+0.001
bottleneckscanbeidentifiedbyanalysisonthestatistics.Table5.4showstheerror/failurestatisticslbr themeasured
VAXcluster.Inthetable,I/Oerrorsincludedisk,tape,andnetworkerrors.MachinerrorsincludeCPUandmemory
errors.Softwarerrorsaresoftware-relatederrors.The95%confidenceintervalsforthepercentageandprobability
estimatesshownin thetablearecalculatedusingthemethodiscussedinSection2.1forestimatingconfidenceinter-
vals/brproportions.Twobottleneckscanbeidentifiedfromthetable.
First,themajorerrorcategoryis I/Oerrors(93%),i.e.,errorsfromsharedresources.Thiscategoryof errorhas
averyhighrecoveryprobability(0.996).However,theserrors011resultinnearly43%ofall failures.This result
indicates that, although the system is generally robust to the impact of I/O errors, the shared resources still constitute a
major reliability bottleneck due to the sheer number of errors. An improvement in such a system may require using an
ultra-reliable network and a disk system to reduce the raw error rate, not just providing high recoverability.
Secondly, although software errors constitute only a small part of all errors (0.3%), they result in significant fail-
ures (25%). This is because software errors have a very low recovery probability (0.I). This software failure estima-
tion is conservative because there are significant unknown failures (30%). Some of these unknown failures could be
attributed to software problems. Thus, software-related problems are severe in the measured system.
5.3.2. Empirical TTE Distributions and Hazard Rates
TTE/T/'F probability distributions and error/failure hazard rates are commonly used to investigate how errors
and failures occur across time. It is relatively easy to obtain empirical TTE/qq'F distributions from data. Figure 5.2
shows the empirical TTE distribution function, f(t), for a measured VAXcluster system [Tang93al. Notice that the
logarithmic coordinate is used for f(t) because of the big contrast between the largest and smallest values. It is seen
73
that about 67% of the TBEs are less than one minute. Most of these instances are "time between errors of two differ-
ent machines" because errors of the same type occurring within a live minute interva] of each other on the same
machine have been coalesced into a single error event. This tact implies that errors are likely to occur on the different
machines in the measured system within a very short period of time.
The hazard rate characterizes error/failure intensity on time series. It can be considered to be the probability
that an error (failure) will occur within the coming unit of time, given that no error (failure) has occurred since the
start of the system or the last error (failure) occurrence. The mathematical definition of the hazard rate [Ross851 is as
follows:
Pr[error in (t, t+dt)} f(t)
h(t) = Pr{no errors in (0, t)} dt 1-F(t) (5.1)
Figure 5.3 shows the empirical failure hazard rates computed from the VAXcluster failure data. The high hazard
rate near the origin, i.e., the high probability that the second failure will occur within a short time after a failure occur-
rence, indicates that failures in the VAXcluster tend to occur in bursts. The most likely tbr a second failure is the first
two hours 'after a failure occurrence. Failure bursts have been observed by many studies [Iyer86], [Hsueh87],
[Bishop88]. Actually, in an early study of transient errors [McConne179], the Weibull distribution with a decreasing
failure rate identified for the interarrival time of failures caused by transient errors implicated the existence of failure
bursts.
f(t)
Figure 5.2.
1.000 ,
0"100 t
VAXcluster Empirical TIE Distribution
Mean = 12.9
Median = 0.08
Std. Dev. = 46.1
0.001_JJlllll_lJJlllnlJlJJ_lllllllllllllliultr-rv_ r
0 10 20 30 40 50
t (minutes)
h(t)
Figure 5.3.
0.4-
0.3-
0.2-
0.1
o.0
0
VAXcluster Failure Hazard
.__i_ _ ,r_-I 7---
l0 20 30 40 50
t (hours)
74
5.3.3. Analytical TTE Distributions
A realistic, analytical Ibrm of TIE distributions is essential in modeling and evaluating computer system
dependability. Often, tot simplicity or due to lack of intbrmation, TTEs are assumed to be exponentially distributed
[Arlatg0b|, [Laprie84]. Early measurement-based studies lound that the Weibull distribution with decreasing failure
rate is representative of the time between failures (TBF) in a measured DEC computer system [McConne179] and a
measured IBM operating system [Iyer85b]. A recent comparative study of the dependability of the Tandem
GUARDIAN, VAX VMS, and IBM MVS operating systems showed that the software TTE in a single machine can be
represented by a multi-stage gamma distribution and the software "rTE in multicomputers can be represented by a
hyperexponential distribution [Lee93a]. In this section, we discuss these two types of distributions.
Betore presenting the analytical TTE distributions, we first explain how a TTE distribution is obtained from a
multicomputer system, because both measured GUARDIAN and VMS were running on multicomputer systems. In
the measured multicomputer systems, all machine members are working in a similar environment and running the
same version of the operating system. If the whole system is treated as a single entity in which multiple instances of
an operating system are running concurrently, then every software error on all machines can be sequentially ordered
and a distribution can be constructed. The constructed TFE distribution reflects the software error characteristics for
the whole system. We will call this distribution the multicomputer software TTE distribution.
f(t)
0.2
0.0
0
Figure 5.4. IBM MVS Software TTE Distribution
f(t) = 0. 748 • g(t; 2.1,-1) + 0. 55 • g(t;0.5,0) + 0.069 - g(t: 3.5, 3)
/7 + 0. 030. g(t; 5.0, 8) + 0. 098 • g(t; 5.0, 1.7)
, I , 1 ' ' J ' ' ' ' ' ' " i , ........ Ji i i i I I I I I I I I
1013 200 300 400
t (minutes)
75
Figure 5.5. VAXcluster Software TTE Distribution
.12.
f(t)
.08
.04
f(t)
f(t) = _t21 e-'ttl + _2_,2 e-'t2t
Oq=0.67 3,1=0.20
Ct2=0.33 22=2.75
5 10 15
t (days)
Figure 5.6. Tandem Software TTH Distribution
.12.
.08
.04
.00 .00
o 2'0 25 o
f(t) = _t 2_e _t + o_23_2e-).2t
aq =0.87 2t----O.10
ct2=0.13 22 =2.78
10 15 20 25
t (days)
Figures 5.4 to 5.6 show the analytical TTE or q'TH (Time To Halt) distributions fitted using SAS for the three
measured systems. All the three empirical distributions failed to fit simple exponential functions. The fitting was
tested using the Kolmogorov-Smirnov or Chi-square test (see Section 2.2) at a 0.05 significance level. The two-phase
hyperexponential distribution provided satisfactory fits t'or the VAXcluster and Tandem multicomputer software TTE
distributions. An attempt to fit the MVS TIE distribution to a phase-type exponential distribution led to a large num-
ber of stages. As a result, the tbllowing multi-stage gamma distribution was used:
tl
where ai > O, _, ai = 1, and
i=1
f(t) = _ aig(t; oti, si)
i=1
J O t<s,
g(t; a, s) = 1
(t - s) "-I e -(t-_) t > s .
It was round that a 5-stage gamma distribution provided a satistactory lit.
Figures 5.5 and 5.6 show that the multicomputer software TTE distribution can be modeled as a probabilistic
combination of two exponential random variables, indicating that there are two dominant error modes. The higher
error rate, 2__, with occurrence probability a2, captures both the error bursts (multiple errors occurring on the same
operating system within a short period of time) and concurrent errors (multiple errors on different instances of an
operating system within a short period of time) on these systems. The lower error rate, 2_, with occurrence
76
probability _l, captures regular errors and provides an inter-burst error rate.
Error bursts can be explained as repeated occurrences of the same software problem or as multiple effects of an
intermittent hardware fault on the software. Actually, software error bursts have been observed in laboratory experi-
ments reported in [Bishop881. The study showed that, if the input sequences of the software under investigation are
correlated (rather than independent), one can expect more "bunching" of failures than those predicted using a constant
failure rate assumption. In an operating system, input sequences (user requests) are highly likely to be correlated.
Hence, a detect area can be triggered repeatedly.
5.4. Dependency Analysis
Many underlying dependencies exist among measured parameters and components, such as the dependency
between workload and failure rate and the dependency among failures on different components. Understanding such
dependency is important tot improving system dependability and developing realistic models. In this regard, the
workload/failure dependency issue was studied in the early 1980s and the correlated failure issue was investigated
recently.
Dependency between workload and failure was addressed in two approaches: statistical quantification of the
dependence between workload and failure rate [Butner80], [Iyer85bl and stochastic modeling of tailures as functions
of workload [Castillo81]. Both demonstrated the strong correlation between workload and failure rate. This result
indicated that dependability models cannot be considered representative unless the system workload is taken into
account. Based on this result, several workload-dependent analytical models have been proposed [MeyerJ88], |Aup-
perle891, [Dunkel90].
Recent measurements on VAXclusters |Tang90], [Wein901 and Tandem machines |Lee911 found that correlated
failures are not negligible in distributed systems. Further studies showed that even a small correlation can have big
impact on system dependability [Dugan91], [Tang91l, [Tang92al. It was also shown that neither traditional models
assuming failure independence nor those few models believed to take correlation into account are representative of the
actual occurrence process of correlated failures observed in the measured systems [Tang93bl.
77
In the following three subsections, dependency analysis is illustrated through three examples: 1) using a work-
load hazard model to analyze the dependency between workload and software failures in an IBM 3081 system, 2)
using the correlation analysis method to analyze the two-way dependency between errors on two different machines
in a VAXcluster system, and 3) using the factor analysis method to analyze the multi-way dependency among failures
on multiple processors in a Tandem fault tolerant system.
5.4.1. Workload/Failure Dependency
An early study [Castillo81] introduced a workload-dependent cyclostationary model to characterize system fail-
ure processes. The basic assumption is that the instantaneous failure rate of a system resource can be approximated by
a function of the usage of the resource considered. The model was applied to a PDP-10 machine running a modified
version of the standard TOPS-10 operating system. It was shown that the TTF distribution predicted by the model and
the one observed from the real system have an extremely good fit.
In [Iyer82a], a load hazard model was introduced to measure the risk of a failure as the system activity
increases. The proposed model is similar to the hazard rate defined in Eq (5.1). Given a workload variable X, the load
hazard is defined as
z(x) = Pr[failure in load interval (x, x + Ax)] g(x)
Pr[no failure in load interval (0, x)] Ax 1 - G(x) (5.2)
where g(x) is the p.d.f, of the variable "a failure occurs at a given workload value x" and G(x) is the corresponding
c.d.f. That is,
f(x)
g(x) = Pr[failure occurs[ X = x] - " (5.3)
l(x)
where l(x) is simply the p.d.f, of the workload in consideration:
l(x) = Pr[X = xl,
and f(x) is the joint p.d.f, of the system state (failure state or non-failure state) and the workload:
(5.4)
f(x) = Pr[failure occurs & X = x]. (5.5)
A constant hazard rate implies that failures are occurring randomly with respect to the workload. An increasing
hazard rate on the increase of X implies that there is an increasing failure rate with increa_sing workload.
78
Figure5.7.
xo t
I
10-51 _ _ _ I
0 0.2 0.4 0.6 0.8
x = OVERHEAD
Workload Hazard Plots tor the IBM 3081 System
0 20 40 60 80 100
z(x)
10 -2 , + +
10_ty , 'R2 = 0"91
0 50 100 150 200
X = PAGEIN x =SIO
The load hazard model was applied to the software t'ailure and workload data collected from an IBM 3081 sys-
tem running the VM operating system. Based on the collected data, l(x), f(x), g(x), and z(x) were computed tor
each workload variable. Figure 5.7 shows the z(x) plots for three selected workload variables:
(1) OVERHEAD -- fraction of CPU time spent on the operating system;
(2) PAGEIN -- number of page reads per second by 'all users;
(3) SIt (Start I/O) -- number of input/output operations per second.
The regression coetficient, R 2, which is an effective measure of the goodness of fit, is also provided in the figure.
The hazard plots show that the workload parameters appear to be acting as stress factors, i.e., the failure rate
increases as the workload increases. The effect is particularly strong in the case of the interactive workload measures
OVERHEAD and SIt. The correlation coefficients of 0.95 and 0.91 show that the failure closely tit an increasing load
hazard model. The risk of a failure ",alsoincreases with increased PAGEIN, 'although at a somewhat lower correlation
(0.82). Note that the vertic',d scale on these plots is logarithmic, indicating that the relationship between the load haz-
ard z(x) and the workload variable is exponential, i.e., the risk of a software failure increases exponentially with
increasing workload.
5.4.2. Two.Way Dependency
It was mentioned in Section 2.3 that the correlation coefficient can be used to quantify the linear dependence
between two variables. When errors/failures on two components are related, the correlation coefficient between the
two components is a gc×)d measure of such dependence. The question is how to obtain it from measured data.
79
Thefirststepincorrelationanalysisbuildingadatamatrixbasedonthemeasureddata.Assumethatthereare
n components in the measured system and the measured period is divided into m equal intervals of At (e.g., 5 min-
utes). An mxn data matrix can then be constructed in the following way. The n columns of the matrix represent the n
components in the measured system. The m rows of the matrix represent the m time intervals. Element (i, j) of the
matrix is set to the number of errors occurring within interval i on component j. Column j can be regarded as a sam-
ple of the random variable, X j, which represents the state of component j in the system.
The second step is calculating correlation coefficients using Eq. (2.19) based on the data matrix. Each time, we
pick up two columns (Xi and Xj) to calculate Cor(Xi, Xi). This step can be automated by using a statistical package
such as SAS. Table 5.5 lists the average correlation coefficients of the 21 pairs of machines in a VAXcluster for dif-
ferent types of errors and failures [Tang93al. Generally, the error correlation is high (0.62) and the failure correlation
is low (0.06). Disk and network errors are strongly correlated, because the processors in the system heavily use and
share the disks and the network concurrently.
Table 5.5. Average Correlation Coefficients for VAXcluster Errors
Error
All CPU Memory Disk
0.62 0.03 0.01 0.78
Network Software
0.70 0.02
Failure
All
0.06
5.4.3. Multi-Way Dependency
If errors/failures on more than two components are related, the correlation coefficient is not enough to quantify
the dependence among these components, i.e., multi-way correlation. In such a case, the factor analysis method intro-
duced in Section 2.3 can be used to uncover the underlying multi-way correlation. In this subsection, the application
of factor analysis is illustrated using the processor failure data collected from a Tandem fault tolerant system [Lee91 I.
Similar to the correlation analysis discussed above, the first step is building an mxn data matrix based on mea-
surements, where n is the number of components in the system. The measured Tandem system is an 8-processor mul-
ticomputer, i.e., n is 8. The At used is 30 minutes. The element (i, j) of this matrix has a value of 1, if processor j
halts during the i-th time interval; otherwise, it has a value of 0. The j-th column of the matrix represents the sample
80
Table 5.6. Factor Pattern of the Tandem Processor HalLs
Processor Factor 1 Factor 2 Factor 3 Factor 4 Communality
1
2
3
4
5
6
7
8
0.997
0.000
0.061
0.001
0.982
-0.001
0.047
-0.007
-0.004
0.000
0.012
0.999
-0.000
0.447
-0.002
0.762
-0.069
0.000
0.853
-0.011
0.188
-0.005
0.862
0.090
0.023
0.000
-0.133
0.021
-0.018
0.009
0.506
0.641
Vat. 1.965 1.781 1.519 0.685
Vat. % 24.6 22.3 19.0 8.6
1.00
0.00
0.75
1.00
1.00
0.20
1.00
1.00
halt history of processor j, while the i-th row of the matrix represents the state of the eight processors in the i-th time
interval. The matrix is thus called aprocessor halt matrix.
The second step is pertbrming /'actor analysis by applying the SAS procedure FACTOR to the processor halt
matrix. The results are shown in Table 5.6. The numbers in the middle of the table are factor loadings, and the last
column shows communality. The bottom two rows show the amount of variances explained by the common factors
and their percentages to the total variance.
According to [Dillon84], t.actor loadings greater than 0.5 are considered to be significant. However, in reliabil-
ity analysis, t.actor loadings lower than 0.5 can be significant. The results show that there are tour common factors.
Factor 1 captures the dependence between processor 1 and processor 5 and accounts for 24.6% of the total variance.
Factor 2 captures the multi-way dependence among processors 4, 6, and 8, although the contribution of processor 6 is
small (0.4472, i.e., 20% of its variance is explained by this factor). Factor 2 explains 22.3% of the total variance. Fac-
tor 3 captures the dependence between processor 3 and processor 7, and contributes 19% to the total variance. Factor 4
captures the dependence, "although it is lower (with l.actor loadings 0.506 and 0.641), between processor 7 and proces-
sor 8, and accounts tor 8.6% of the total variance.
5.5. Markov Reward Modeling
Many natural and social phenomena can be modeled by Markov or semi-Markov stochastic processes
[Trivedi821. In computer area, Markov process is one of the most frequently used models in performance and depend-
ability evaluation. Compared to combinatorial models, Markov models have several advantages, such as the ability to
81
handle time-dependent failure rate, performance degradation, and interactions among components. In the area of ana-
lyticai modeling of computer systems, pertbrmability models [MeyerJ801, [MeyerJ921, availability models [Goya1871,
and Markov reward models [Reibman89], [Trivedi921 have 'all been addressed during the past 15 years. However, how
to apply these techniques to measured data is still not clear. Assumptions made in building analytical models "also
need to be validated by measurement-based analysis.
In analytical analysis, Markov models are built based on some assumptions (such as independent failures on dif-
terent components) using individual component parameters (such as failure and recovery rates). The evaluated results
are highly dependent on input parameters and model assumptions. In measurement-based modeling, Markov models
are identified from data and therefore called measured models [Tang93b]. No additional assumptions (more than the
Markov property) are made in the construction of models. The measured models provide the best evaluation lbr real
systems as well as insight into the development of representative analytical models. Thus, it is valuable to identify
appropriate models from measured data. Measurement-based Markov reward modeling techniques are illustrated
through a system model generated lot a VAXcluster and a software model generated for an IBM operating system.
5.5.1. Modeling of a Distributed System
The data used for the modeling was collected from a DEC VAXcluster system, consisting of seven machines,
for 250 days [Tang93a]. For this system, an error was defined as an abnormality in any component of the system. If
an error led to a termination of service on a machine, it was defined as a failure. A failure was identified by a reboot
following one or multiple error reports.
A. Model Construction
Since the measured VAXcluster has seven machines, an 8-state Markov error model is constructed. The eight
.states, E0, Et ..... and ET, are defined such that Ei represents the state wherein i machines observe errors at the same
time (the time granularity is chosen to be 1 second). For example, state E0 represents that none of the machines expe-
riences errors, i.e., the VAXcluster is in the normal (error-free) state; state E7 represents that all the machines experi-
ence errors. At any measured time, the VAXcluster is in one of these states.
82
The transition probabilities for the 8-state model is estimated from the error event data. Given that the system is
in state i, the probability that it will go to state j, p_j, is caiculated as lollows:
observed number of transitions from Ei to E) (5.6)
Plj = observed number of transitions out of Ei
Table 5.7 shows the transition probabilities calculated from the VAXcluster error data. Based on the table, an
error propagation model can be obtained by calculating the probability that the system goes from state E_ (i = 1..... 6)
to any of the lower states (E,__ ..... E0) and the probability that it goes from Ei to any of the higher states (E_+I .....
ET). These probabilities are easily determined by summing "all the row elements to the left of element (i, i), and all
row elements to the right of element (i, i) in the tables. The error propagation model is shown in Figures 5.8. An
interesting error propagation characteristic is uncovered with this model. Notice that the transition probabilities to
higher states (numbers in the upper line) tend to increase as the state increases. That is, once an error domain encom-
passes more than one machine, the probability of the domain involving more machines increases. In such situations,
error containment can become increasingly difficult.
Table 5.8 shows the mean holding time, the total holding time in the measured period, and the occupancy proba-
bility in each state tbr the model. It is seen from the table that E 7 has the longest mean holding time (2.31 minutes)
Table 5.7. Transition Probability forthe VAXcluster Error Model
Sm_ E0 El E2 E3 E4 E_ E6 E7
Eo .000 .891 .084 .014 .004 .002 .002 .003
E_ .824 .000 .145 .023 .004 .003 .001 .000
E2 .239 .594 .000 .118 .035 .009 .004 .001
E3 .126 .211 .401 .000 .227 .024 .009 .003
E4 .079 .147 .102 .422 .000 .205 .034 .011
E5 .058 .115 .054 .073 .367 .000 .315 .018
E6 .070 .081 .024 .016 .073 .406 .000 .331
E7 .125 .104 .000 .021 .036 .161 .552 .000
Figure 5.8. An E_or Propagation Model Mr _e VAXcluster
.82 .83 .74 .75 .67 .67 1.0
83
Table5.8.HoldingTime(HT)& OccupancyProbabilityfortheVAXclusterErrorModel
State
Eo 22.39
El 1.27
E2 0.40
E3 0.56
E4 1.07
E5 0.40
E 6 0.73
E7 2.31
Mean HT (rain.) Total HT (hr.) Occ. Prob.
5578.89
347.42
29.24
14.07
15.t3
3.37
4.50
7.38
O.9298
0.0579
0.0049
0.0023
0.0025
0.0006
0.0007
0.0012
among "all error states. Clearly, when all seven machines are alfected by errors, the system takes the longest time to
recover. The occupancy probabilities provide evidence that errors on different machines (i.e., errors in the higher
states) are related. It is found that the measured occupancy probabilities for the higher states (E3 to ET) are quite dif-
ferent from the occupancy probabilities analytically determined assuming error independence. For example, we con-
sider the occupancy probability for E7. By Table 5.8, the measured occupancy probability tor E7 is 0.0012. Assuming
that errors on different machines are independent, we can easily determine the occupancy probability for this state to
be at most 0.027, where 0.02 is the highest error occurrence probability among the seven machines. That is, the mea-
sured value is higher than the calculated value by at least eight orders of magnitude.
B. Reward Analysis
Markov models can be used to conduct reward analysis [Trivedi92] to quantify the loss of service due to errors
and failures. The key step is to define a reward function which characterizes the performance loss in each degraded
state. For a multicomputer system, a generic reward function can be defined tot both a single machine and the whole
system. Given a time interval AT (random variable), a reward rate for the system in AT is determined by
r(AT) = W(AT) / AT, (5.7)
where W(AT) denotes the useful work done by the system in AT and is calculated by
[!TW(AT) = T - nr in normal statein error statein failure state , (5.8)
where n is the number of raw errors (error entries in the log, see Section 5.2) in AT and ris the mean recovery time tbr
84
asingleerror.Thus,oneunitof rewardisgivenforeachunitof timewhenthesystemis in thenormalstate.Inan
errorstate,thepenaltypaiddependsontherecoverytimethesystemspendsin thatstate,whichisdeterminedbythe
linearfunctionAT-nr (normally,AT > n_, if AT < n_, W(AT) is set to 0). In a failure state, W(AT) is by definition
zero.
Applying Eq. (5.8) to the VAXcluster, the reward rate formula has the lollowing tbrm:
7
r(AT) = _., Wk(AT) / (7 × AT), (5.9)
k=l
where Wk(AT) denotes the useful work done by machine k in time AT. Here all machines are assumed to contribute
an equal amount of reward to the system. For example, if three machines fail when the system is in E3, the reward rate
is 4/7.
The expected steady-state reward rate, Y, can be estimated by [Trivedi92]
1
Y = _ Y'_ r(At_)Atj , (5.10)
At/_T
where T is the summation of all At)'s (particular values of AT) in consideration. If we substitute r from Eq. (5.9) and
let AT represent the holding time of each state in the error model, Y becomes the steady-state reward rate of the VAX-
cluster, which is "also an estimate of system availability (performance-related availability). If we substitute r from Eq.
(5.9) and let AT represent the time span of the error event for a particular type of error, Y becomes the steady-state
reward rate of the system during the event intervals of the specified error. Thus, (1 - Y) measures the loss in perfor-
mance during the specified error event. Note that it is possible that there are failed machines when the system is in an
Table 5.9. Steady-State Reward Rate for the VAXcluster
r 0.1 ms lms 10 ms 100 ms
Y 0.995078 0.995077 0.995067 0.994971
Table 5.10. Steady-S tate Reward Rate for Each Error Type in the VAXcluster
CPU Memory Disk Tape Network Software
0.14950 0.99994 0.61314 0.89845 0.56841 0.00008
85
errorstate.Sincethemodelisanempiricalmodelbasedon the error event dam (of which the failure event data is a
subset), the information about errors and failures of all machines Ibr each particular Atj can be obtained from the data.
The steady-state reward rate tor the VAXcluster was computed with r being 0.1, 1, 10, and lOOms. The results
are given in Table 5.9. The table shows that the reward rate is not sensitive to r. This is because the overall recovery
time is dominated by the failure recovery time, i.e., the major contributors to the performance loss are failures, not
non-failure errors. In the range of these r values, the VAXcluster availability is estimated to be 0.995. Table 5.10
shows the steady-state reward rate/or each error type (r = 1 ms) tbr the VAXcluster. These numbers quantify the loss
of pertbrmance due to the recovery from each type of error. For example, during the recovery from CPU errors, the
system can be expected to deliver approximately 15% of its lull performance. During the disk error recovery, the aver-
age system performance degrades to nearly 61% of its capacity. Since software errors have the lowest reward rate
(0.00008), the loss of work during the recovery from software errors is the most significant.
5.5.2. Modeling of an Operating System
The modeled operating system is the IBM MVS system running on an IBM 3081 mainframe [Hsueh87]. The
measurement period is one year. A Markov model is developed using data collected from the system to describe error
detection and recovery inside an operating system. The MVS is a widely used IBM operating system. Primary fea-
tures of the system are reported to be efficient storage management and automatic software error recovery. The MVS
system attempts to correct software errors using recovery routines. The philosophy in the MVS is that tor major sys-
tem functions, the programmer envisages possible failure scenarios and writes a recovery routine tor each. It is, how-
ever, the responsibility of the installation (or the user) to write recovery routines for applications.
Recovery routines in the MVS operating system provide a means by which the operating system prevents a total
loss on the occurrence of software errors. When a program is abnormally interrupted due to an error, the supervisor
routine gets control. If the problem is such that further processing can degrade the system or destroy data, the supervi-
sor routine gives control to the recovery termination manager (RTM), an operating system module responsible tot
error and recovery management. If a recovery routine is available tor the interrupted program, the RTM gives control
to this routine before it terminates the program. The purpose of a recovery routine is to tree the resources kept by the
86
tailingprogram,tolocatetheerror,andtorequesteitheraretryortheterminationof theprogram.
Morethanonerecoveryoutinecanbespecifiedlor thesameprogram.If thecurrentrecoveryroutineisunable
torestoreavalidstate,RTMcangivecontroltoanotherrecoveryroutine,if available.Thisprocessi calledpercola-
tion. The percolation process ends if either a routine issues a valid retry request or no more recovery routines are
available. In the latter case, the program and its related subtasks are terminated. If a valid retry is requested, a retry is
attempted to restore a valid state using the information supplied by the recovery routine and then give control to the
program. For a retry to be valid, there should be no risk of error recurrence and the retry address should be properly
specified. An error recovery can result in the following tour situations:
(1) Resume op (resume operation) -- The system successfully recovers from the error and returns control to the
interrupted program.
(2) Task term (task termination) -- The program and its related subtasks are terminated, but the system does not
fail.
(3) Job term (job termination) -- The job in control at the time of the error is aborted.
(4) System tailure -- The job or task, which was terminated, is critical/or system operation. As a result of the
termination, a system lailure occurs.
A. Model Construction
The states of the model consists of eight types of error states (see Table 5.11) and four states resulting from
error recoveries. Figure 5.9 shows the model. The normal state represents that the operating system is running error-
tree. The transition probabilities were estimated from the measured data using Eq. (5.6). Note that the system Iailure
state is not shown in the figure. This is because the occurrence of system failure was rare, and the number of observed
system failures was statistically insignificant.
Table 5.11 shows the mean waiting time characteristics of the normal and error states in the model. Note that the
waiting tune distribution of the normal state is the TIE distribution. It has been shown in Section 5.3.3 that this distri-
bution is not simply exponential (a multi-stage gamma distribution), so the model is a semi-Markov model. In the
table a multiple software error is defined a.s an error burst consisting of more than one type of software error. The
87
Figure 5.9. MVS Software Error/Recovery Model
7838
0.8313
39
0.0243
k0540
To
Normal
State
35
.6289
average duration of a multiple error is at least lbur times longer than that of any type of single error which is typically
in the range of 20 to 40 seconds, except lbr DLCK (deadlock) and OTHR (others). Tile average recovery time from a
88
programexceptionis twiceaslongasthatfromacontrolerror(21secondsversus42seconds).Thisisprobablydue
totheextensivesoftwareinvolvementi recoveringfromprogramexceptions.
Anerrorecoverycanbeassimpleasaretryorascomplexasrequiringseveralpercolationsbetbreasuccessful
retry.Theproblemcan"also be such that no retry or percolation is possible. Figure 5.9 shows that about 83.1% of all
retries are successful. The figure also shows that the operating system is able to recover from 93.5% of I/O and data
management errors and 78.4% of control related errors by retries. These observations indicate that most I/O and con-
trol related errors are relatively easy to recover from, compared to the other types of errors such as deadlock or storage
errors. Also note that "no percolation" occurs only in recovering from storage management errors. This indicates that
storage management errors are more complicated than the other types of errors.
Table 5.11. Mean Waiting Time
State # Observations Mean Waiting Tune (Sec.) Standard Deviation
Normal (Error-Free)
CTRL (Control Error)
DLCK (Deadlock)
I/O (I/O & Data Management Error)
PE (Program Exception)
SE (Storage or Address Exception)
SM (Storage Management Error)
OTHR (Other Type)
MULT (Multiple Software Error)
2757
213
23
1448
65
149
313
66
481
10461.33
21.92
4.72
25.05
42.23
36.82
33.40
1.86
175.59
32735.04
84.21
22.6l
77.62
92.98
79.59
95.01
12.98
252.79
B. Model Evaluation
The steady-state measures evaluated from the model is listed in Table 5.12. The definitions of these measures
are given in [Howard71 I.
(1) Transition probability (trj) -- probability that the transition is to state j, given a transition to occur
(2) Occupancy probability (_j) -- probability that the system occupies state j at any time point
(3) Mean recurrence time (Oj) -- mean recurrence time of state j
The occupancy probability of the normal state can be viewed as the operating system awtilability without degra-
dation. The state transition probability, on the other hand, characterizes error detection and recovery, processes in the
89
operatingsystem.Table5.12(a)liststhestatetransitionprobabilitiesandoccupancyprobabilitiesforthenormaland
errorstates.Table5.12(b)liststhestatetransitionprobabilitiesandthemeanrecurrenttimesof therecovery,and
resultstates.A dashedlinein thetableindicatesanegligiblevalue(lessthan0.00001).Table 5.12(a) shows that the
occupancy probability of the normal state in the model is 0.995. This indicates that in 99.5% of the time the operating
system is running error-tree. In the other 0.5% of time the operating system is in the error or recovery states. In more
than half of the error and recovery time (i.e., 0.29% out of 0.5%) the operating system is in the multiple error state.
The average reward rate for all software error and recovery states is estimated from data to be 0.2736. Based on this
reward rate and the occupancy probability for all error and recovery states shown in the table (0.005), the steady-state
reward loss in the modeled MVS can be evaluated to be 0.00363.
By solving the model, it is tbund that the operating system makes a transition every 43.37 minutes. Table
5.12(a) shows that 24.74% of all transitions made in the model are to the normal state, 24.73% to error states
(obtained by summing all the tr's tot all error states), 25.79% to recovery states, and 24.74% to result states. Since a
transition occurs every 43 minutes, it can be estimated that, on the average, a software error is detected every 3 hours
and a successful recovery (i.e., reaching the "resume op" state) occurs every 5 hours. Table 5.12(b) 'also shows that
more than 40% of software errors lead to.job or task terminations which cause the loss of service to users. However, a
few of these terminations lead to system failures. This result indicates that recovery routines in MVS are effective in
avoiding system failures but are not so effective in avoiding user job terminations.
Normal
Measure State
x 0.2474
0.9950
Table 5.12. Error/Recovery Model Characteristics
CTRL DLCK
0.0191 0.0020 0.1299
0.00016 0.00125
i/O SM
Error State
PE SE
0.0060 0.0134
0.000098 0.000189
(a)
OTHR MULT
0.0281 0.0057 0.043 !
0.00036 0.002913
Measure
Recovery State Resultant State
Retry Percolation I No-Percolation Resume Op. Task Term. Job Term.
I
0.1704 0.0845 0.0030 0.1414 0.0712 0.0348
4.25 8.55 241.43 5.11 10.16 20.74
(b)
5.6. Software Dependability
A great deal of research has been performed in the area of software reliability during the development phase.
Different models have been proposed (reviewed in [Musa87]) to characterize the reliability growth of the candidate
software through this phase. In general, these models can be divided into two classes. The first assumes that the fail-
ure rate is a function of the number of remaining defects in the software. Imperfect debugging and uncertainty in the
projected number of initial detects have also been modeled [Goe185]. The second class of models does not depend on
the knowledge of the number of the remaining detects [Littlewood80]. The failure rate is assumed to be a random
variable and the software reliability model involves two stochastic processes. Although most models pertbrm well
within their own contexts, their performance varies significantly from one data set to another.
The operational phase of a mature software is much different from the development phase. In the operational
phase, a typical situation involves frequent changes and updates installed either by system managers or by vendors.
Often, without notification to the installation management, the vendor will install a change (patch) to fix a fault tbund
at some other installation. In a sense, the system being measured represents an aggregate of "all such systems being
maintained by the vendor. In addition, software reliability in the operational phase is also attributed to workload
effects, hardware problems, and environmental factors. Thus, software reliability in the operational phase cannot be
characterized by simply applying analytical models proposed for the development phase.
Studies dealing with software dependability issues lbr the operational phase have also evolved over the past 15
years. Software TTE distributions (Section 5.3), dependency between software failure and workload (Section 5.4), and
modeling of software error/recovery processes (Section 5.5) have been discussed in previous sections. In this section,
several other issues, including error interactions (i.e., hardware-related and correlated software errors), software fault
tolerance, and software defect classification are discussed.
5.6.1. Error Interactions
When software is running in a complex system, interactions between hardware and software, and interactions
among multiple processors can cause software error scenarios that cannot been seen during testing. Investigation of
such error scenarios is helpful for understanding characteristics of software errors in operational systems. In the
91
Iollowing, two kinds of such error scenarios are discussed: hardware-related software errors, which are a result of
interactions between hardware and software, and correlated software errors, which are a result of interactions among
processors through software protocols.
A. Hardware-Related Software Errors
In [Iyer85al, software errors related to hardware errors were described as hardware-related soRware errors.
More precisely, if a software error (failure) occurs in close proximity (within a minute) to a hardware error, it is called
a hardware-related software (HW/SW) error (failure). There are several causes of hardware-related software errors.
For instance, a hardware error, such as a flipped memory bit, may change the software condition, resulting in a soft-
ware error. Therelore, even though it is reported as a software error, it is actually caused by faulty hardware. Another
possibility is that the software may fail to handle an unexpected hardware problem such as an abnormal condition in
the network communication. This can be attributed to a software design flaw. Sometimes, both the hardware error
and the software error are symptoms of another, unidentified problem.
Table 5.13 shows the frequency and percentage of hardware-related software errors/failures (among all software
errors/failures) measured from an IBM 3081 system [Iyer85b] and two VAXclusters [Tang92b]. In the IBM system,
approximately 33% of 'all observed software failures are hardware-related. HW/SW errors are found to have large
error-handling times (high recovery overhead). The system failure probability for the HW/SW errors is close to three
times that tot software errors in general. The VAXcluster data shows that most hardware errors involved in HW/SW
errors are network errors (75%). This indicates that the major sources of hardware-related software problems in the
measured VAXclusters are network-related hardware or software components. This is a unique feature in the multi-
computer system, where processes highly rely on intercommunications through the network.
Table 5.13. Hardware-Related Software Errors/Failures
Category HW/SW Errors HW/SW Failures
Measures Frequency Percent Frequency Percent
IBM/MVS 177 11.4 94 32.8
VAX/VMS 32 18.9 28 21.4
92
B. Correlated Software Errors
When multiple instances of a software system interact with each other in a multicomputer environment, the
issue of correlated failures should be addressed. Several studies [Tang90], [Wein90], [Lee91] found that significant
correlated processor failures exist in the measured multicomputer systems. Correlated software failures are also found
in the VAX VMS and the Tandem GUARDIAN operating systems [Lee93a]. The data showed that about 10% of soft-
ware failures in the measured VAXcluster and 20% of software halts in the measured Tandem system occurred on
multiple machines concurrently. To understand how correlated software failures occur, it is instructive to examine a
real case in detail.
Figure 5.10 shows a scenario of correlated software failures. In the figure, Europa, Jupiter, and Mercury. are
machine names in the VAXcluster. A dashed line represents that the corresponding machine is in a failure state. At
one time, a network error (netl) was reported from the CI (Computer Interconnect) port on Europa. This resulted in a
software failure (softl) 13 seconds later. Twenty-four seconds after the first network error (netl), additional network
errors (net2,net3) were reported on the second machine (Jupiter), which was lbllowed by a software failure (soft2).
The error sequence on Jupiter was repeated (net4,net5,soft3) on the third machine (Mercury). The three machines
experienced software fiailures concurrently for 45.5 minutes. All three software failures occurred shortly after network
errors occurred, so they were network error related. Further analysis of the data revealed that the network-related
Figure 5.10. A Scenario of Correlated Software Failures
netl softl
Europa [ 13 sec. I 47.83 min.
reboot
Jupiter I
Mercury I
Note:
net2 net3 soft2 retx)ot
24 ,e I .,ec .............. .............
60 sec.
net4 net5 soft3 reboot
178sec.[ llsec. [ .... 45-.5min- ...... ]4see. [
softl, soft2, soft3 -- Exception while above asynchronous system traps delivery or on interrupt stack.
netl, net3, net5 -- Port will be re-started, net2, net4 -- Virtual circuit timeout.
93
softwareoftheVAX/VMSisapotentialsoftwarebottleneckintermsof correlated failures.
The higher percentage of correlated software failures in the Tandem system can be attributed to the architectural
characteristics of the system. In the Tandem system, a single software fault can cause halts of two processors on
which the primary and backup processes (see Section 5.6.2) of the faulty software are executing. If the two halted
processors control a disk which includes files needed by other processors on the system, additional software halts can
occur on these processors. (In the Tandem system, a disk can typically be accessed by two processors via dual-port
disk controllers.) This explains why there is a higher percentage of correlated software failures in the Tandem system.
Note that the above scenario is a multiple component failure situa0on not expected in general system design,
which assumes failure independence. Even the Tandem fault tolerant system is not designed explicitly to guard
against this situation. Generally, correlated t_lures can stress recovery and break the protection provided by the fault
tolerance.
5.6.2. Software Fault Tolerance
While hardware fault tolerance techniques have been used successfully, the issue of software fault tolerance is
still not well addressed. Major approaches lor software fault tolerance rely on design diversity [Avizienis841, [Ran-
del1751. But these approaches are usually inapplicable to large operating systems because of immense cost in devel-
oping and maintaining the software. However, some fault tolerance techniques not explicitly designed tot tolerating
software faults can provide a certain amount of software fault tolerance. Understanding such techniques is important
lor designing good approaches to improving software dependability. The Tandem GURDIAN system, running on the
single-failure tolerant multicomputer system, is a good target for such evaluations.
The Tandem GUARDIAN operating system is a message-based distributed system built for on-line transaction
processing [Bartlett781. High availability is achieved via single-failure tolerance techniques including the process-
pair approach. For each user program, there are two processes -- a prima_ process and a backup process -- execut-
ing the same program on two processors. During normal operation, the primary process performs all operations for the
user, while the backup process passively watches message flows. The primary process periodically sends checkpoint
messages to its backup. When the primary process detects an inconsistency in its state, it fails t_st, and the backup
process takes over the responsibility of the primary process. This approach can tolerate transient software errors,
94
whichwill usuallynotberepeatedbyreexecutingtheprocess.
A studyofoperatingsystemfaulttoleranceachievedbythesingle-tailuretolerancet chniquesimplementedina
Tandemultiprocessorystemwasreportedin [Lee92].Themeasuredsystemhad16processorsandwasworkingin
ahigh-stressenvironment.ThedatasourcewastheprocessorhaltlogmaintainedbytheGUARDIANsystemfora
periodof23months.Theeffectofthebuilt-infaulttolerancemechanismsonsoftwareavailabilitywasevaluatedby
rewardanalysis.Tworewardfunctionsweredefinedin theanalysis.Inthedefinition,i represents the system state in
which there are i tailed processors, and n represents the total number of processors in the system. The first function
(SFT) reflects the fault tolerance of the Tandem system. In this function, the first processor halt does not cause any
degradation. For additional processor halts, the loss of service is proportional to the number of processors halted. The
second function (NSFT) assumes no fault tolerance. The difference between the two functions allows evaluation of
the improvement in service due to the built-in fault tolerance mechanisms.
SFT (Single-Failure Tolerance):
1 if i=0
ri= 1 i- 1 if 0<i<n
n
0 if i=n
(5.13)
NSFT (No Single-Failure Tolerance):
i
r i = 1 - - 0 < i < n (5.14)
r/
B&sed on the above reward functions, the expected steady-state reward rate, i.e., the Y in Eq. (5.10), was evalu-
ated for software, non-software, and all halts. The results are given in Table 5.14. The bottom row shows the
improvement in service time (i.e., reduction in reward loss) due to the fault tolerance. It is seen that the single-failure
tolerance in the measured system reduces the service loss due to software hails by 89% and due to non-software halts
by 92%. This clearly demonstrates the effectiveness of the implemented fault tolerance mechanisms against software
failures as well as non-software failures. The table also shows that software problems account for 30% of the service
loss in the measured system (with SFT). Although the system was working in a high-stress environment, the overall
reward loss is small (10 -a with SFF). This rellects the high availability of the measured system.
95
Table5.14.LossofServiceCausedbyHaltsintheTandemSystem
: Measure Software [ Non-Software All
.00062
NSFT
I-Y
Percent
I-Y
SFT
Percent
Improvement
Software Defect Classification5.6.3.
23.2
,00007
30.4
89%
.00205 .00267
76.8 [00
.00016 .00023
69.6 100
92% 91%
In recent studies of software detects reported from the IBM MVS operating system [Sullivan91l and two IBM
large database management systems, DB2 and [MS [Sullivan92], a software detect classification scheme was pro-
posed. The scheme uses three concepts -- error type, defect type, and error trigger _ to classify software faults and
errors. The error type classifies the low-level programming mistakes that lead to software thilures. The detect type is
a higher-level classification that distinguishes desi_na mistakes, coding mistakes, and administrative mistakes. The
error trigger is related to the running environment; it distinguishes several ways that defective code which was not
executed during testing could be executed at the customer site. Tables 5,15 to 5.17 list major categories generated
Table 5.15. Major Categories of Error Types
Error Type Description
Allocation Management A module uses a memory region alter deallocating it.
Copying Overrun The program copies data past the end of a buffer.
Data Error The program produces or reads wrong data.
Interface Error A module's interlace is defined or used incorrectly.
Memory Leak The program never deallocate memory it obtained from the system.
Pointer Management A variable containing the address of data is corrupted.
Statement Logic Statements are executed in the wrong order or are omitted.
S ynchronization An error occurs in locking or synchronization code.
Uninitialized Variable A variable is used before it is initialized.
Undefined State The system goes into a state the designers did not anticipate.
Wrong Algorithm The program works but uses a wrong algorithm.
96
Table5.16.MajorCategoriesofDefectTypes
DetectType Description
Function A program'sfunctionalityismissing,incomplete,orincorrect,
DataStruct/Algorithm A datastructureoralgorithmhasadesignflaw.
Assignment/CheckingA codingmistakeinvolvesvariableassignmentorvalidation.
Interface Errorsarediscoveredin theinteractionbetweencomponents.
Timing/SynchronizationErrors occur in the management of shared or real-time resources.
Build/Package/Merge Errors occur in version control or roll-up of fixes.
Table 5.17. Major Categories of Error Triggers
Error Trigger Description
Workload Unusual workload conditions such as a user request with unexpected parameters.
Bug Fixes A bug introduced when an earlier bug was fixed.
Client Code Errors caused by propagation from application code running in protected mode.
Recovery/Exception Problems in error recovery and exception handling.
Timing Errors caused by an unanticipated sequence of events.
from the data under the three criteria.
The studies compared the error type, defect type, and error trigger distributions of the three products (DB2,
IMS, and MVS) and round that the three product's distributions differ significantly. However, they have some com-
mon characteristics, such as the mode "undefined state." The studies "also investigated the impact of software defects
on system availability for the MVS operating system. A comparison between overlay detects (detects that corrupt a
program's memory) and non-overlay detects demonstrated that the impact of an overlay detect is much higher. Bound-
ary conditions and "allocation management were round to be the major causes of overlay defects.
5.7. Failure Prediction
Fault diagnosis and failure prediction are of significance tot maintaining highly reliable systems. Measurement-
based studies have shown that it is possible to predict future failures based on the current and historic',d on-line error
information. Sever',d heuristic and statistical approaches have been proposed. The heuristic approach extracts charac-
teristics of anomalous events, such as error reports |Lin90] or pertbrmance anomalies |Maxiong0a], and relates them
97
tofailuresorfanlksbyheuristicrulesorsignatures.Thestatisticalpproachusestatisticaltechniquestoquantifyrela-
tionshipsamongsystemerrorstatesdefinedon the basis of error rates and recognizes failure patterns using the quanti-
fied relationships [Iyer90]. In the tbllowing, we discuss two typical approaches: 1) failure prediction based on the
heuristic trend analysis of error logs and 2) failure prediction based on the statistical analysis or error symptoms.
5.7.1. Prediction Based on Heuristic Trend Analysis
This approach is based on the observation that a system usually experiences a period of intermittent errors
before a hard failure occurs. The symptoms of intermittent errors can be used to predict impending failures. The early
study of this approach showed qualitatively that the frequency of error tuples was correlated to system failures, based
on measurements from a DEC disk subsystem [Tsao831. Later, a heuristic trend analysis method, the dispersion frame
technique (DFT), was developed [Ling0]. DFT determines the relationship among errors by examining their closeness
in time and space.
Two concepts are used in DPT: dispersion frame (DF) and error dispersion index (EDI). A DF is defined a.s the
interval between two successive errors of the same type. The EDI is defined as the number of error occurrences fol-
lowing the previous DF during the interval of one half of the previous DF or the DF betore the previous DF. Each DF
is applied to the following two errors. A high EDI implicates that the errors following the DF used to measure the
EDI are highly correlated. DFT consists of five heuristics rules developed ttom field experience:
(1) 3.3 rule: The two consecutive EDIs obtained by applying the same frame are at least 3.
(2) 2.2 rule: The two consecutive EDIs obtained by applying two successive/tames are at least 2.
(3) 2 in 1 rule: A frame is less than 1 hour.
(4) 4 in 1 rule: Four errors occur within a 24-hour frame.
(5) 4 decreasing rule: There are four monotonically decreasing frames, and at least one frame is half the size of its
previous frame.
Figure 5.11 demonstrates an example, including some activated heuristics, of DFT. In the figure, the top line
represents the time sequence of live error occurrences (1 ..... 5) in a particular device. DFT is activated when a frame
size less than 168 hours (1 week) is encountered. Assume that all the frames in the figure f',dl into this threshold. Each
98
Figure 5.11. Dispersion Techniques
1 2 3
• I........................ ...................I
DF(1,2) -j-
DF(2,3)
4 5
I I _ TimeI I
3 3 warning
-I...... _--'-'---
..........!iiiiiiiiii.......I 2.2w  ng
DF(3,4) ---k--- ._ 4 decreasing
DF(4,5) _ warning
frame is applied to the following two errors by putting its center to the time points of the two error occurrences. For
example, DF(1,2) is applied to errors 2 and 3, DF(2,3) is applied to errors 3 and 4, etc. An upward arrow represents a
failure warning issued under the above heuristic rules.
DFT was applied to the data collected from 13 public-domain file servers in Carnegie Mellon University over a
22-month period. Among 16 hard failures examined, DFT predicated 15, with 5 false "alarms. That is, the successful
prediction rate is 93.7%. This results shows that DFT is very effective when coupled with good system instrumenta-
tion. The disadvantage of this approach is that different systems may require different heuristics and parameters.
5.7.2. Prediction Based on Statistical Analysis
The objective of this approach is to recognize intermittent failures through statistical analysis and testing on
recorded error data_ The approach starts by identifying key error patterns potentially symptomatic of failure occur-
rences and then refines these patterns by scanning the rest of the data in stages for similar error patterns. At each stage.
the similarity is statistically tested. The approach is illustrated by the flowchart in Figure 5.12.
In the first stage, data coalescing is performed on the raw data to eliminate redundant reports. The output of this
stage is error records ftuples) characterized by error stares (error type, machine condition, etc.). Next, all error
records occurring within a small time interval (15 minutes) are identified as error groups. Error groups represent peri-
ods of high error activity (error burst,s). Experience has shown that when system errors occur in bursts of a relatively
high error rate, the errors are often related. In the second stage, statistical analysis and hypothesis testing are
99
performed on each error group to determine whether a valid correlation exists among its members (error records).
Randomly lbrmed groups in which members are statistically independent are rejected. Thus, the original error groups
consisting of records among which relationships can exist are refined to the validated error groups consisting of
records among which relationships do exist.
lO0
Relationships can exist across error groups, i.e., a single cause can give rise to a persistent error and thus foster
multiple error groups within a short time, In the third stage, the output groups from the second stage are examined to
recognize related error groups and to eliminate stray error records. Several concepts are introduced for the analysis in
this stage. An error event is defined as the collection of error groups occurring within a given period (e.g., 24 hours)
and having at least two error states in common. A symptom is defined as a collection of statistically related error slates
that are common to at least half of the groups in an event. A symptom set is defined as the collection of all symptoms
in an event. Figure 5.13 illustrates an event and its symptom set. The event is composed of three groups: G_, G_., and
G3. The error states in these groups are represented by A_ ..... A7. Two symptoms are extracted from these error states:
St which consists of A_ and A,t, and S_, which consists of A5 and A 6, Thus, S 1 and Ss constitute the symptom set lot
this group.
Further, three simple rules are used in the fourth stage to recognize related events and to group them into sets
called super events. The rules ensure that the events so grouped will have sufficiently common structure to permit
testing for correlation. Two events are grouped into a super event if they satisfy any one of the following criteria: 1)
they have at least one symptom in common, 2) a symptom of one event is a proper subset of at least one symptom of
another event, or 3) if they are single-group events, then they have at least two error states in common. Figure 5.14
Figure 5.13. Derivation of an Event's Symptom Set
Group G 1
Group G2
G_ t_ G2 c_ G3 =¢
GI c_ G2 = A2 A4
G2 (-h G3 = A5 A6
GI U'_G3 =0
101
Event E_
Event E_ ]
Event E4
Event E5
Figure
Q
Q
Q
Q
Q
Q
Q
Q
Q
5.14. Construction of Super Events
SUPER-
EVENT 1
I
1
SUPER-
EVENT 2
Q
Q
Q
illustrates how super events are constructed. There is no time restriction when these rules are applied to the event
data. When a super event is created, a corresponding super symptom set is also created. The super symptom set starts
with just the symptoms of the first event of that super event. As another event is added, set intersection is performed
between its symptom set and each of the symptom _ts already in the super event. All intersections are then added to
the super event set.
In each of the above stages, statistical analysis and hypothesis testing are performed to validate the correlations
among members in the formed groups or sets. The super events derived in the final stage can be used by service engi-
neers to .judge potential failures. This methodology was applied to the on-line error log files from two CYBER sys-
tems, and the results were compared to the log of failures and repair maintained by the system staff. In nearly 85% of
the cases, the engineers were directly able to confirm that the validated super events corresponded to real system prob-
lems. The evaluation was made both on the basis of their experience and from their field maintenance logs. For the
remaining 15% of the cases, the engineers agreed that a problem had existed, but that its manifestation was not severe
enough to be noticed by their analysis.
102
6. CONCLUSION
In this paper, we discussed methodologies and advances in the area of the experimental analysis of computer
system dependability. The discussion covered three fields: simulated fault injection, physicad fault injection, and mea-
surement-based analysis of operational systems. The approaches used in the three fields are suited, respectively, to the
dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase.
Before discussing these fields, we introduced several statistical techniques used in all fields. For each field, we pro-
posed a classification of research approaches or topics. Then we presented detailed methodologies and representative
studies for each of these approaches or topics.
The statistical techniques introduced included the estimation of parameters and confidence intervals, probability
distribution characterization, several multivariate analysis methods, and importance sampling. For simulated fault
injections, we covered electrical-level, logic-level, and function-level simulation approaches as well as representative
simulation environments, such as FOCUS and DEPEND. For physical tault injections, we discussed hardware, soft-
ware, and radiation fault injection methods as well as several software and hybrid tools, including FIAT, FERRARI,
HYBRID, and FINE. For measurement-based analysis of operational systems, after an introduction to measurement
and data processing techniques, we presented methods used and representative studies in basic error characterization,
dependency analysis, Markov reward modeling, software dependability, and fault diagnosis. The discussion covered
several important issues previously studied, including workload/failure dependency, correlated/allures, and software
fault tolerance.
Fault injection stmulations can be used to investigate the effectiveness of key design features of fault tolerant
systems and to provide timely feedback to system designers. Generally, most dependability measures (except input
parameters such as failure and recovery rates) can be obtained from simulations. However, simulations need accurate
input parameters and the validation of output results, which come from physical fault iniections and measurement-
based analysis. Fault iniection on real systems can produce intbrmation about error latency, error detection, error
propagation, error recovery, and system reconfiguration, but it can only study artificial fault_ and cannot produce some
dependability measures, such as MTBF and availability. Measurement-based analysis of operational systems under
real workloads can provide valuable intormation on actual thilure characteristics and insight into analytical models.
103
Thistypeof analysis provides a means to study naturally occurring errors and all measurable dependability metrics,
such as failure and recovery rates, reliability and availability. However, the analysis is limited to detected errors. Fur-
tiler, conditions in the field can vary widely from one system to another, casting doubt on the statistical validity of the
results. Thus, all three approaches are complementary and essential for accurate dependability analysis.
Significant progress has been made in all the three fields over the past 15 years, especially in the recent 5 years
during which several dependability analysis tools have been developed. Increasing attention is being paid to: 1) com-
bining analytical modeling and experimental analysis and 2) combining system design and evaluation. In the first
aspect, state-of-the-art analytical modeling techniques are being applied to real systems to evaluate various depend-
ability and performance characteristics. Results from experimental analysis are being used to validate analytical mod-
els and to reveal practical issues that analytical modeling must address to develop more representative models. In the
second aspect, dependability analysis tools are being combined with each other and with other CAD tools to provide
an automatic design environment which incorporates multiple levels of joint evaluation of functionality, pertorrnance,
dependability, and cost. Software failure data from testing and operational phases are also providing feedback to the
software design, improving software reliability. Further interesting studies and "advances in this area can be expected
in the near future.
104
REFERENCES
[ArlaO0a] J. Axial M. Aguera, L. Amat, Y. Crouzet, J.C. Fabre, J.C. Laprie, E. Martins, and D. PoweLl, "Fault Injec-
tion for Dependability Validation: A Methodology and Some Applications," IEEE Trans. Software Engineering, Vol.
16, No. 2, pp. 166-182, Feb. 1990.
[Arlatg0bl L. Arlat, K. Kanoun, and J.C. Laprie, "Dependability Modeling and Evaluation of Software Fault-Tolerant
Systems," IEEE Trans. Computers, Vol. 39, No. 4, pp. 504-513, April 1990.
[Aupperle891 B.E. Aupperle, J.F. Meyer and L. Wei, "Evaluation of Fanlt-Tolerant Systems with Nonhomogeneons
Workloads," Proc. 19th Int. Syrup. Fault-Tolerant Computing, pp. 159-166, June 1989.
[Avizienis84] A. Avizienis and J.P.J. Kelly, "Fault Tolerance by Design Diversity: Concepts and Experiments," IEEE
Computer, pp. 67-80, Aug. 1984.
[Banerjee82] P. Bane_ee and J.A. Abraham, "Fault Characterization of MOS VLSI Circuits," Proc. Int. Conf. Cir-
cuits and Computers, pp.564-568, 1982.
[Bartlett78} J.F. Bartlett, "A 'Nonstop" Operating System," Proc. Int. Hawaii Conf. System Science, pp. 103-117,
1978.
[Barton90] J.H. Barton, E.W. Czeck, Z.Z. Segall, and D.P. Siewiorek, "Fault Injection Experiments Using FIAT,"
IEEE Trans. Computers, Vol. 39, No. 4, pp. 575-582, April 1990.
[Bavuso87] S.J. Bavuso, LB. Dugan, K.S. Trivedi, E.M. Rothman, and W.E. Smith, "Analysis of Typical Fault-
Tolerant Architectures using HARP," IEEE Trans. Reliabili_, Vol. 36, No. 2, pp. 176-185, June 1987.
|Beh82] C.C. Beh, K.H. Arya, C.E. Radke, and K.E. Torku, "Do Stuck Fault Models Reflect Manufacturing Defects?"
Proc. Int. Test Conf., pp. 35-42, 1982.
[Bishop881 P.G. Bishop and ED. Pullen, "PODS Revisited -- A Study of Software Failure Behavior," Proc. lSth Int.
Syrup. Fault-Tolerant Computing, pp. 2-8, 1988.
[Bryant84] R.E. Bryant, "A Switch-Level Model and Simulator for MOS Digital Systems," IEEE Trans. Computers,
Vol. 33, No. 2, pp. 160-177, Feb. 1984.
[Butner80] S.E. Bumer and R.K. lyer, "A Statistical Study of Reliability and System Load at SLAC," Proc. lOth Int.
Symp. Fault-Tolerant Computing, pp. 207-209, Oct. 1980.
[CastilloSll X. Castillo and D.P. Siewiorek, "Workload, Performance, and Reliability of Digital Computer Systems,"
Proc. 11th Int. Symp. Fault-Tolerant Computing, pp. 84-89, July 1981.
[Castiilo82] X. Castillo and D.P. Siewiorek, "A Workload Dependent Software Reliability Prediction Model," Proc.
12th Int. Symp. Fault-Tolerant Computing, pp. 279-286, June 1982.
[Chiilarege87] R. Chillarege and R. K. Iyer, "Measurement-Based Analysis of Error Latency," IEEE Trans. Comput-
ers, Vol. C-36, No. 5, May 1987.
[Chillarege89] R. Chillarrege and N.S. Bowen, "Understanding Large System Failures -- A Fault Irliection Experi-
ment," Proc. 19th Int. Symp. Fault-Tolerant Computing, pp. 356-363, June 1989.
IChoi891 G.S. Choi, R.K. Iyer and V. Carreno, "FOCUS: An Experimental Environment for Validation of Fault Toler-
ant Systems: A case study of a Jet Engine Controller," Int. Conf. Computer Design (ICCD), pp. 561-564, Oct, 1989.
[Choi_Y21 G.S. Choi and R.K. lyer, "FOCUS: An Experimental Environment for Fault Sensitivity Analysis," IEEE
Trans. Computers, Voi. 41, No. 12, pp. 1515-1526, Dec. 1992.
[Choi93] G. Choi, R. lyer, and D. Saab, Fault Behavior Dictiona_for Simulation of Device-Level Transients, Techni-
cal Report, CRHC, University of Illinois at Urbana-Champaign, 1993.
[Clark93] J.A. Clark and D.K. Pradhan, "REACT: A Synthesis and Evaluation Tool for Fault-Tolerant Multiprocessor
Architectures," Proc. Annual Reliabili_ and Maintainability _vmposium, pp. 428-435, 1993.
[Courtois791 B. Courtois, "Some Results about the Efficiency of Simple Mechanisms for the Detection of Microcom-
puter Malfunctions," Proc. 9th Int. Syrup. Fault-Tolerant Computing pp. 71-74, June 1979.
1I)5
ICusick85]J.Cusick,R. Koga,W.Kolasinski,andC. King,"SEUvulnerabilityof theZilogZ-80andNSC-800
microprocessors,"IEEE Trans. Nuclear Science, vol. NS-32, pp. 4206-4211, Dec. 1985.
[Czeck92] E.W. Czeck and D.P. Siewiorek, "Observations on the Effects of Fault Manifestation as a Function of
Workload," IEEE Trans. Computers, Vol. 41, No. 5, pp. 559-566, May 1992.
[Dillon84] Dillon, W. R. and Goldstein, M., Multivariate Analysis, John Wiley & Sons, 1984.
[Duba881 P. Duba and R.K. Iyer, "Transient Fault Behavior in a Microprocessor: A Case Study," Proc. 1988 IEEE Int.
Conf. Computer Design: VLSI in Computers & Processors, pp. 272-276, Oct. 1988.
[Dugan91] J.B. Dugan, "Correlated Hardware Failures in Redundant Systems," Proc. 2nd IFIP Working Conf.
Dependable Computing for Critical Applications, Tucson, Arizona, Feb. 1991.
[Dunkel901 J. Dunkel, "On the Modeling of Workload-Dependent Memory Faults," Proc. 20th Int. Syrup. Fault-
Tolerant Computing, pp. 348-355, June 1990.
[Dupuy901 A. Dupuy, J. Schwartz, Y. Yemini, and D. Bacon, "NEST: A Network Simulation and Prototyping
Testbed," Communications oftheACM, Voi. 33, No. 10, pp. 64-74, Oct. 1990.
[FineUi87] G.B. Finelli, "Characterization of Fault Recovery through Fault Injection on FI2VIP," IEEE Trans. Reliabil-
ity, Vol. R-36, No. 2, pp. 164-170, June 1987.
[Goe185] A.L. Goel, "Software Reliability Models: Assumptions, Limitations, and Applicability," IEEE Trans. Soft-
ware Engineering, Vol SE-11, No. 12, pp. 1411-1423, Dec. 1985.
[Goswami90] K.K. Goswami and R.K. lyer, "DEPEND: A Design Environment for Prediction and Evaluation of Sys-
tem Dependability," Proc. 9th Digital Avionics Systems Conference, Oct. 1990.
[Goswamigl] K.K. Goswami and R.K. Iyer, "A Simulation-Based Study of a Triple Modular Redundant System
using DEPEND," Proc. 5th Int. Tests, Diagnosis, Fault Treament Conf., pp. 300-311, Sept. 1991.
[Goswami92] K.K. Goswami and R.K. Iyer, DEPEND: A Simulation-Based Environment for System Level Depend-
ability Analysis, Technical Report, CRHC 92-11, University of Illinois at Urbana-Champaign, June 1992.
[Goswami93al K. K. Goswami and R. K. Iyer, "Use of Hybrid and Hierarchical Simulation to Reduce Computation
Costs," Int. Workshop Modeling Analysis & Simulation of Computer & Telecomm. Sys., San Diego, CA, pp. 197-202,
Jan. 1993.
[Goswami93b] K. K. Goswami, R. K. Iyer and M. Devarakonda, "Prediction-Based Dynamic Load-Sharing Heuris-
tics," IEEE Trans. Parallel and Distributed Computing, May 1993, to be published.
[Goswami93cl K. K. Goswami and R. K. Iyer, "Simulation of Software Behavior Under Hardware Faults," Proc. 23rd
Int. Symp. Fault-tolerant Computing, June 1993.
[Goya187] A. Goyal, S.S. Lavenberg and K.S. Trivedi, "Probabilistic Modeling of Computer System Availability,"
Annals of Operations Research, No. 8, pp. 285-306, March 1987.
[Goyai92] A. Goyal, P. Shahabuddin, E Heidelberger, V.E Nicola, and P.W. Glynn, "A Unified Framework for Simu-
lating Markovian Models of Highly Dependable Systems," IEEE Trans. Computers, Vol. 41, No. 1, pp. 36-51, Jan.
1992.
[Gray90] J. Gray, "A Census of Tandem System Availability Between 1985 and 1990," IEEE Trans. Reliability, Vol.
39, No. 4, pp. 409-418, Oct. 1990.
[Graygl] J. Gray and D.P. Siewiorek, "High-Availability Computer Systems," IEEE Computers, pp. 39-48, Sept.
1991.
[Gunneflo89] U. Gunneflo, J. Karlsson, and J. Torin, "Evaluation of Error Detection Schemes Using Fault Injection
by Heavy-ion Radiation," Proc. 19th Int. Symp. Fault-Tolerant Computing, pp. 340-347, June 1989.
[Hansen92] J.P. Hansen and D.P. Siewiorek, "Models tot Time Coalescence in Event Logs," Proc. 22nd Int. Syrup.
Fault-Tolerant Computing, pp. 221-227, July 1992.
[Heimann90] D.I. Heimann, N. Mittal and K.S. Trivedi, "Availability and Reliability Modeling tor Computer Sys-
tems," Advances in Computers, Vol. 3l, pp. 175-233, 1990.
106
[Hogg83] R.V. Hogg and E.A. Tanis, Probability and Statistical Inference, Second Edition, Macmillan Publishing
Co., Inc., 1983.
[HowardT1] R.A. Howard, Dynamic Probabilistic Systems, John Wiley & Sons, Inc., New York, 1971.
[Hsueh87] M.C. Hsueh and R.K. Iyer, "A Measurement-Based Model of Software Reliability in a Production Envi-
ronment," Proc. 11th Annual Int. Computer Software & Applications Conf., pp. 354-360, Oct. 1987.
[Hsueh88] M.C. Hsueh, R.K. lyer, and K.-S. Trivedi, "Performability Modeling Based on Real Data: A Case Study,"
IEEE Trans. Computers, Vol. 37, No.4, pp. 478-484, April 1988.
[Iyer82a] R.K. Iyer and D.J. Rossetti, "A Statistical Load Dependency Model for CPU Errors at SLAC," Proc. 12th
Int. Syrup. Fault-Tolerant Computing, pp. 363-372, June 1982.
(lyer82b] R.K. Iyer, S.E. Butner, and E.J. McCluskey, "A Statistical Failure/Load Relationship: Results of a Multi-
computer Study," IEEE Trans. Computers, Vol. C-31, No. 7, pp. 697-705, July 1982.
(lyer85a] R.K. lyer and P. Velardi, "Hardware-Related Software Errors: Measurement and Analysis," IEEE Trans.
Software Engineering, Vol. SE-I1, No. 2, pp. 223-231, Feb. 1985.
[lyer85b] R.K. Iyer and D.J. Rossetti, "Effect of System Workload on Operating System Reliability: A Study on IBM
3081," IEEE Trans. Software Engineering, Vol. SE-11, No. 12, pp. 1438-1448, Dec. 1985.
[lyer86] R.K. lyer, D.J. Rossetti and M.C. Hsueh, "Measurement and Modeling of Computer Reliability as Affected
by System Activity," ACM Trans. Computer Systems, Vol. 4, No. 3, pp. 214-237, Aug. 1986.
[Iyerg0] R.K. Iyer, L.T. Young, and P.V.K. Iyer, "Automatic Recognition of Intermittent Failures: An Experimental
Study of Field Data," IEEE Trans. Computers, Vol. 39, No. 4, pp. 525-537, April 1990.
[,lewett91] D. Jewett, "Integrity $2: A Fault-Tolerant Unix Platform," Proc. 21st Int. Symp. Fault-Tolerant Comput-
ing, June 1991.
[Kahn53] H. Kahn and A. W. Warshall, "Methods of Reducing Sample in Monte Carlo Computations," Journal of the
Operations Research Society of America, Voi. 1, No. 5, pp. 263-278, 1953.
[Kanawati92] G.A. Kanawati, N.A. Kanawati, and J.A. Abraham, "FERRARI: A Tool tot the Validation of System
Dependability Properties," Proc. 22nd Int. Symp. Fault-Tolerant Computing, pp. 336-344, July 1992.
[Kao931 W. Kao, R.K. Iyer, and D. Tang, "FINE: A Fault Injection and Monitor Environment for Tracing the UNIX
System Behavior under Faults," IEEE Transactions on Software Engineering, Dec. 1993, to be published.
[Karlsson89] J. Karlsson, U. Gunneflo, and J. Torin, "The Effects of Heavy-ion Induced Single Event Upsets in the
MC6809E Microprocessor," Proc. 4th Int. Conf. Fault-Tolerant Computing Systems, GI/ITG/GMA, Baden, Germany,
1989.
[Katzman78] J.A. Katzman, "A Fault-Tolerant Computing System," Proc. Int. Hawaii Conference on System Sci-
ence, pp. 85-102, 1978.
[Kendall77] M.G. Kendall, The Advanced Theo_ of Statistics, Oxford University Press, 1977.
[Kobayashi78] H. Kobayashi, Modeling and Analysis: An Introduction to System Performance Evaluation Methodol-
ogy, Addison-Wesley Publishing Co., 1978.
IKronenberg86] N.P. Kronenberg, H.M. Levy and W.D. Strecker, "VAXcluster: A Closely-Coupled Distributed Sys-
tem," ACM Trans. Computer Systems, Vol. 4, No. 2, pp. 130-146, May 1986.
[Lala83] J. Lala, "Fault Detection, Isolation and Reconfiguration in FTMP: Methods and Experimental Result.s,"
Proc. 5th AIAA/IEEE Digital Avionics Systems Conference (DASC), pp. 21.3.1-21.3.9, 1983.
[Laprie84] J.C. Laprie, "Dependable Evaluation of Software Systems in Operation," IEEE Trans. Software Engineer-
ing, Vol. SE-10, No. 6, pp. 701-714, Nov. 1984.
[Laprie851 LC. Laprie, "Dependable Computing and Fault Tolerance: Concepts and Terminology," Proc. 15th Int.
Syrup. Fault-Tolerant Computing, pp. 2-11, June 1985.
[Law82] A. M. Law and W. D. Kelton, Simulation Modeling andAnalysii_, McGraw Hill Book Company, 1982.
1t)7
[Leegl]I. Lee,R.K.IyerandD.Tang,"Error/FailureAnalysisUsingEventLogsfromFaultTolerantSystems,"Proc.
21st Int. Syrup. Fault-Tolerant Computing, pp. 10-17, June 1991.
[Lee92] I. Lee and R.K. Iyer, "Analysis of Software Halts in Tandem System," Proc. 3rd Int. Syrup. Software Reliabil-
i_ Engineering, pp. 227-236, Oct. 1992.
[Lee93a] I. Lee, D. Tang, R.K. lyer, and M.C. Hsueh, "Measurement-Based Evaluation of Operating System Fault
Tolerance," IEEE Transactions on Reliability, June 1993, to be published.
[Lee93b] I. Lee and R.K. Iyer, "Faults, Symptoms, and Software Fault Tolerance in the Tandem GUARDIAN90
Operating System," Proc. 23rd Int. Syrup. Fault-Tolerant Computing, June 1993.
[Lewis84] E.E. Lewis and F. Bohm, "Monte Carlo Simulation of Markov Unreliability Models," Nuclear Eng. and
Design, Vol. 77, pp. 49-62, 1984.
[LifO0] T.T. Lin and D.P. Siewiorek, "Error Log Analysis: Statis0cai Modeling and Heuristic Trend Analysis," IEEE
Trans. Reliabili_, Vol. 39, No. 4, pp. 419-432, Oct. 1990.
[Littlewood80} B. Littlewood, "Theories of SoIlware Reliability: How Good Are They and How Can Th.ey Be
Improved?" IEEE Trans. Software Engineering, Vol. SE-6, No. 5, pp. 489-500, Sept. 1980.
[Lomelino86] D. Lomelino and R. lyer, "Error Propagation in a Digital Avionic Processor: A Simulation-Based
Study," Proc. Real Time Systems Symposium, pp. 218-225, Dec. 1986.
[Maxion90a] R.A. Maxion, "Anomaly Detection for Diagnosis," Proc. 20th Int. Syrup. Fault-Tolerant Computing, pp.
20-27, June 1990.
[Maxion9Ob] R.A. Maxion and F.E. Feather, "A Case Study of Ethernet Anomalies in a Distributed Computing Envi-
ronment," IEEE Trans. Reliability, Vol. 39, No. 4, pp. 433-443, Oct. 1990.
[McConne179] S.R. McConnel, D.R Siewiorek, and M.M. Tsao, "The Measurement and Analysis of Transient Errors
in Digital Compute Systems," Proc. 9th Int. Syrup. Fault-Tolerant Computing, pp. 67-70, 1979.
[McGough81] J.G. McGough and F.L. Swern, Measurement of Fault Latency in a Digital Avionic Mini Processor,
NASA Contract Report 3462, NASA, Washington, DC 1981.
[Meyer88] B. Meyer, Object-oriented Software Construction, Prentice Hail International Series in Computer Science,
1988.
[MeyerJS0] J.E Meyer, "On Evaluating the Performability of Degradable Computing Systems," IEEE Trans. Com-
puters, Vol. C-29, No. 8, pp. 720-731, Aug. 1980.
[Meyer J88] J.E Meyer and L. Wei, "Analysis of Workload Influence on Dependability," Proc. 18th Int. Svmp. Fault-
Tolerant Computing, pp. 84-89, June 1988.
[Meyer J92] J.E Meyer, "Performability: A Retrospective and Some Pointers to the Future," Performance Evaluation,
Vol. 14, pp. 139-156, Feb. 1992.
[Migneault85] ?.?. Migneault, "The Diagnostic Emulation Technique in the Airlab," Internal Report, NASA-Langley
Research Center, 1985.
[Mourad87] S. Mourad and D. Andrews, "On the Reliability of the IBM MVS/XA Operating System," IEEE Trans.
Software Engineering, Vol. SE-13, No. 10, pp. 1135-1139, Oct. 1987.
[Musa87] J.D. Musa, A. Iannino, and K. Okumoto, Software Reliability." Measurement, Prediction, Application.
McGraw-Hill Book Company, 1987.
[Randel175] B. Randell, "System Structure lor Software Fault Tolerance," IEEE Trans. Software Engineering, Vol.
SE-1, No. 2, June 1975.
[Reibman89] A. Reibman, R. Smith, and K. Trivedi, "Markov and Markov Reward Model Transient Anaiysis: An
Overview of Numerical Approaches," European Journal of Operational Research, Vol. 40, pp. 257-267, 1989.
[Rogers85] W. Rogers and J. Abraham, "CHIEFS: A Concurrent Hierarchical and Extensible Fault Simulator." Proc.
IEEE Int. Test Conference, pp. 710-716, 1985.
108
[Ruehli83]A.W.Ruehliand G.S. Ditlow, "Circuit Analysis, Logic Simulation, and Design Verification tbr VLSI,"
Proc. of the IEEE, voi. 71, No. 1, pp34-48, Jan. 1983.
[Sahner87] R.A. Sahner and K.S. Trivedi, "Reliability Modeling Using SHARPE," IEEE Trans. Reliability, Vol.
R-36, No. 2, pp. 186-193, June 1987.
[Saleh84] R.A. Saleh, "Integrated Timing Analysis and SPLICEI," Mem. UCB/ERL M84/2, Elec. Res. Lab., U.C.
Berkeley, 1984.
[Saleh90] R.A. Saleh and A.R. Newton, Mixed-Mode Simulation, Kluwer Academic Publishers, June 1990.
[SegaI188J Z. Segail, D. Vrsalovic, D. Siewiorek, D. Yaskin, J. Kownacki, J. Barton, R. Dancey, A. Robinson, and T.
Lin, "FIAT - Fault Injection Based Automated Testing Environment," Proc. 18th Int. Syrup. Fault-Tolerant Comput-
ing, pp. 102-107, June 1988.
[Schwetman86] H. Schwelman, "CSIM: A C-Based Process-Oriented Simulation Language," Proc. Winter Simula-
tion Conf., 1986.
[Shin84] K. Shin and Y. Lee, "Error Detection Process - Model, Design, and Its Impact on Computer Performance,"
IEEE Trans. Computers, vol. C-33, No. 6, pp. 529-540, June 1984.
[Shin86] K.G. Shin and Y.H. Lee, "Measurement and Application of Fault Latency," IEEE Trans. Computers, Vol.
C-35, No. 4, pp. 370-375, April 1986.
[Siewiorek78] D.P. Siewiorek, V. Kini, H. Mashburn, S.R. McConnel, and M. Tsao, "A Case Study of C.mmp, Cm*,
and C.vmp: Part I -- Experience with Fault Tolerance in Multiprocessor Systems," Proc. of the IEEE, Vol. 66, No. 10,
pp. 1178-1199, Oct. 1978.
[Siewiorek92] D.P. Siewiorek and R.W. Swarz, Reliable Computer Systems: Design and Evaluation, Digital Press,
Bedlord, Mass., 1992.
[Sullivan91] M.S. Sullivan and R. Chillarege, "Software Detects and Their Impact on System Availability -- A Study
of Field Failures in Operating Systems," Proc. 21st Int. Symp. Fault-Tolerant Computing, pp. 2-9, June 1991.
[Sullivan92] M.S. Sullivan and R. Chillarege, "A Comparison of Software Detects in Database Management Systems
and Operating Systems," Proc. 22nd Int. Symp. Fault-Tolerant Computing, pp. 475-484, July 1992.
[Tang90] D. Tang, R.K. Iyer and Sujatha Subrarnani, "Failure Analysis and Modeling of a VAXcluster System," Proc.
20th lnt. Symp. Fault-Tolerant Computing, pp. 244-251, June 1990.
[TangOl] D. Tang and R. K. Iyer, "Impact of Correlated Failures on Dependability in a VAXcluster System," Proc.
2nd IFIP Working Conf. Dependable Computing for Critical Applications, Tucson, Arizona, Feb. 1991.
[TangO2al D. Tang and R.K. Iyer, "Analysis and Modeling of Correlated Failures in Multicomputer Systems," IEEE
Trans. Computers, Vol. 41, No. 5, pp. 567-577, May 1992.
[Tang92bl D. Tang and R.K. Iyer, "Analysis of the VAX/VMS Error Logs in Multicomputer Environments -- A Case
Study of Software Dependability," Proc. Third Int. Symp. Software Reliability Engineering, Research Triangle Park,
North Carolina, pp. 216-226, Oct. 1992.
[Tang93a] D. Tang and R.K. lyer, "Dependability Measurement and Modeling of a Multicomputer Systems." IEEE
Trans. Computers, Vol. 42, No. l, pp. 62-75, Jan. 1993.
[Tang93b] D. Tang and R.K. lyer, "MEASURE+ -- A Measurement-Based Dependability Analysis Package," Proc.
ACM SIGMETRICS Conf. Measurement and Modeling of Computer Systems, Santa Clara, Calilbrnia, pp. i10-121,
May 1993.
[Trivedi82] K.S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications,
Prentice-Hall, Englewood Cliffs, NJ, 1982.
[Trivedi92] K.S. Trivedi, J.K. Muppala, S.P. Woolet, and B.R, Haverkort, "Composite Perlbrmance and Dependability
Analysis," Performance Evaluation, Vol. 14, pp. 197-215, Feb. 1992.
lTsao831 M.M. Tsao and D.P. Siewiorek, "Trend Analysis on System Error files," Proc. 13th Int. Syrup. Fault-Tolerant
Computing, pp. 116-119, June 1983.
109

