Experimental Analysis of Computer System Dependability by Iyer, Ravishankar K. & Tang, Dong
May 1994 (revised) U IL U -E N G -9 3 -2 2 2 7
C R H C -9 3 -1 5





Ravishankar K. Iyer 
Dong Tang
Coordinated Science Laboratory 
College of Engineering
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Approved for Public Release. Distribution Unlimited.
UNCLASSIFIED___________
SECURITY CLASSIFICATION ÔF 'TH iS PÄ 5T
REPORT DOCUMENTATION PAGE
la. REPORT SECURITY CLASSIFICATION
Unclassified
lb. RESTRICTIVE MARKINGS 
N o n e
2a. SECURITY CLASSIFICATION AUTHORITY 3. DISTRIBUTION/AVAILABILITY OF REPORT
Approved for public release; 
distribution unlimited2b. DECLASSIFICATION/DOWNGRADING SCHEDULE
4. PERFORMING ORGANIZATION REPORT NUMBER(S)
UILU-ENG-93-2227 CRHC 93-15
5. MONITORING ORGANIZATION REPORT NUMBER(S)
6a. NAME OF PERFORMING ORGANIZATION 
Coordinated Science Lab 
University of Illinois
6b. OFFICE SYMBOL 
(If applicable)
N/A
7a. NAME OF MONITORING ORGANIZATION
NASA, ONR
6c. ADDRESS (Gty, State, and ZIP Code)
1101 W. Springfield Avenue 
Urbana, IL 61801
7b. ADDRESS (City, State, and ZIP Code)
NASA Langley Res. Ctr., Hampton, VA 23665 
Office of Naval Res., 800 N. Quincy, Arlington 
VA 22217_______________________
8a. NAME OF FUNDING/SPONSORING 
ORGANIZATION
NASA, ONR. JSEP, Tandem. CS(
8b. OFFICE SYMBOL 
(If applicable)
9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER
NASA NAG-1-613, N00014-91-J-1116 
N00014-90-J-1270. GSA CSC 4bR%Q
8c. ADDRESS (City, State, and ZIP Code)
NASA Langley Research Center, Hampton, VA23665 
ONR, 800 N. Quincy, Arlington, VA 22217









11. TITLE (Include Security Classification)
Experimental Analysis of Computer System Dependability
12. PERSONAL AUTHOR(S)
__________________Ravishankar K. Iyer and Dong Tang
13a. TYPE OF REPORT 13b. TIME COVERED 14. DATE OF REPORT (Year, Month, Day) 15. PAGE COUNTTechnical FROM TO May 1994 (revised) 85
16. SUPPLEMENTARY NOTATION
17. COSATI CODES 18. SUBJECT TERMS (Continue on reverse if necessary and identify by block number)
FIELD GROUP SU8-GROUP Dependability evaluation, simulation, fault injection
measurement-based analysis, computer system, data analysis
Markov model y ’
’9 AeSTRACT (Continue on reverse if necessary and identify by block number)
This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer 
system dependability. Methodologies and advances are discussed for three basic approaches used in the area: 
simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches 
are suited, respectively, to dependability evaluation in the three phases of a system’s life: design phase, 
prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques 
used in the area are introduced. For each phase, a classification of research methods or study topics is 
outlined, followed by discussion of these methods or topics as well as representative studies.
The statistical techniques introduced include the estimation of parameters and confidence intervals, prob­
ability distribution characterization, and several multivariate analysis methods. Importance sampling, a sta­
tistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated
20. DISTRIBUTION/AVAILABILITY OF ABSTRACT
SUNCLASSIFIED/UNLIMITED □  SAME AS RPT. □  OTIC USERS
21. ABSTRACT SECURITY CLASSIFICATION
Unclassified
22a. NAME OF RESPONSIBLE INDIVIDUAL 22b. TELEPHONE (Include Area Code) 22c. OFFICE SYMBOL
All other editions are obsolete.
UNCLASSIFIED
UNCLASSIFIED I





S E C U R I T Y  C L A S S I F I C A T I O N  O F  T H IS  P A G E
E xp erim en ta l A n alysis  o f C om p u ter S y stem




Center for Reliable and High-Performance Computing 
Coordinated Science Laboratory 
University of Illinois at Urbana-Champaign 
Urbana, Illinois 61801, USA
tcurrently at SoHaR Incorporated, Beverly Hills, California 90211, USA
©1994 R. K. Iyer & D. Tang 
May 1994
A B ST R A C T
This chapter reviews an area that has evolved over the past 15 years: experimental analysis of computer 
system dependability. Methodologies and advances are discussed for three basic approaches to this analysis: 
simulated fault injection, physical fault injection, and measurement-based analysis. These three approaches 
are suited, respectively, to dependability evaluation in the three phases of a system’s life: the design phase, 
the prototype phase, and the operational phase. Before discussing dependability evaluation in these phases, 
several statistical techniques used are introduced. For each phase, a classification of research methods or 
study topics is outlined. The methods or topics are then discussed, along with representative studies.
The statistical techniques introduced in this chapter include the estimation of parameters and confidence 
intervals, probability distribution characterization, and several multivariate analysis methods. Importance 
sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discus­
sion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods 
as well as representative simulation environments such as FOCUS, NEST, REACT, and DEPEND. The dis­
cussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as 
several software and hybrid tools, including FIAT, FERRARI, HYBRID, DEFINE, and SFI. The discussion 
of measurement-based analysis covers measurement and data processing techniques, basic error character­
ization, dependency analysis, Markov reward modeling, software dependability, and fault diagnosis. The 
discussion involves several important issues in the area, including fault models, fast simulation techniques, 
workload/failure dependency, correlated failures, and software fault tolerance.
A C K N O W LED G M EN TS
The authors thank Kumar Goswami, Timothy Tsai, Gwan Choi, and Wei-lun Kao for their contributions to 
this manuscript. Special thanks go to Fran Wagner for proof-reading and editing the manuscript. Thanks 
are also extended to Inhwan Lee and Darren Sawyer for their comments on the initial draft.
We wish to thank J. Abraham, J. Arlat, J. A. Clark, E. W. Czeck, G. B. Finelli, J. Karlsson, K. G. 
Shin, D. P. Siewiorek, and F. Yang for permission to use figures (Figures 0.3, 0.7, 0.8, 0.12, 0.13, 0.15, 
0.16, 0.17, 0.20, 0.21, 0.25, and 0.30) and algorithms (Sections 0.3.3 and 0.5.7) from their publications. 
Figure 0.10 is generated from “NEST: A Network Simulation and Prototyping Testbed,” authored by A. 
Dupuy, J. Schwartz, Y. Yemini, and D. Bacon, Communications of the ACM, Vol. 33, No. 10, copyright 
ACM 1990, by permission of the ACM.
This research was supported by NASA under Grant NAG-1-613, in cooperation with the Illinois Computer 
Laboratory for Aerospace Systems and Software (ICLASS), by Tandem Computers, by the Office of Naval 
Research under Grant N00014-91-J-1116, by the Joint Services Electronics Program under Grant N00014- 




0.2 STATISTICAL TECHNIQUES..................................................................................................... 3
0.2.1 Parameter Estim ation........................................................................................................  3
0.2.2 Distribution Characterization............................................................................................  7
0.2.3 Multivariate Analysis ........................................................................................................  8
0.2.4 Importance Sampling ........................................................................................................  11
0.3 DESIGN PH A SE............................................................................................................................  13
0.3.1 Simulated Fault Injection at the Electrical L e v e l.............................................................. 15
0.3.2 Simulated Fault Injection at the Logic Level....................................................................... 19
0.3.3 Simulated Fault Injection at the Function Level................................................................. 25
0.4 PROTOTYPE PHASE . ' .............................................................................................................. 34
0.4.1 Hardware-Implemented Fault Injection.............................................................................  35
0.4.2 Software-Implemented Fault In jec tion ...............................................................................  38
0.4.3 Radiation-Induced Fault Injection ...................................................................................  46
0.5 OPERATIONAL P H A S E .............................................................................................................  47
0.5.1 Measurements ....................................................................................................................  49
0.5.2 Data Processing .................................................................................................................  49
0.5.3 Preliminary Analysis...........................................................................................................  51
0.5.4 Dependency Analysis...........................................................................................................  55
0.5.5 Markov Reward Modeling..................................................................................................  59
0.5.6 Software Dependability .....................................................................................................  64
0.5.7 Failure Prediction ..............................................................................................................  68
0.6 CONCLUSION...............................................................................................................................  73
0.7 QUESTIONS..................................................................................................................................  74
R E FE R E N C E S 78
0.1 IN T R O D U C T IO N
In computer science more than in the physical sciences, the experimenter must decide what data to gather 
and analyze, sometimes without the benefit of guidance, experience, or easily available intuition. How to 
obtain general models from experiments or measurements made in a particular environment is by no means 
clear. This chapter discusses the current research in the area of experimental analysis of computer system 
dependability. The discussion centers around methodologies, major developments, and major directions of 
the research in the area.
Experimental evaluation of the dependability of a system can be performed at different phases of the 
system’s life. In the design phase, computer-aided design (CAD) environments are used to evaluate the 
design via simulation, including simulated fault injection. Such fault injection tests the effectiveness of fault- 
tolerant mechanisms and evaluates system dependability, providing timely feedback to system designers. 
Simulation, however, requires accurate input parameters and validation of output results. Although the 
parameter estimates can be obtained from past measurements, this is often complicated by design and 
technology changes. In the prototype phase, the system runs under controlled workload conditions. In this
1
stage, controlled physical fault injection is used to evaluate the system behavior under faults, including the 
detection coverage and the recovery capability of various fault tolerance mechanisms. Fault injection on the 
real system can provide information about the failure process, from fault occurrence to system recovery, 
including error latency, propagation, detection, and recovery (which may involve reconfiguration). But this 
type of fault injection can only study artificial faults; it cannot provide certain important dependability 
measures, such as mean time between failures (MTBF) and availability. In the operational phase, a direct 
measurement-based approach can be used to measure systems in the field under real workloads. The collected 
data contain a large amount of information about naturally occurring errors/failures. Analysis of this data 
can provide understanding of actual error/failure characteristics and insight into analytical models. Although 
measurement-based analysis is useful for evaluating the real system, it is limited to detected errors. Further, 
conditions in the field can vary widely, casting doubt on the statistical validity of the results. Thus, all 
three approaches—simulated fault injection, physical fault injection, and measurement-based analysis—are 
all required for accurate dependability analysis.
In the design phase, simulated fault injection can be conducted at different levels: the electrical level, 
the logic level, and the function level. The objectives of simulated fault injection are to determine depend­
ability bottlenecks, the coverage of error detection/recovery mechanisms, the effectiveness of reconfiguration 
schemes, performance loss, and other dependability measures. The feedback from simulation can be ex­
tremely useful in cost-effective redesign of the system. This chapter will discuss different techniques for fault 
injection simulation. We also introduce different levels of simulation tools.
In the prototype phase, while the objectives of physical fault injection are similar to those of simulated 
fault injection, the methods differ radically because real fault injection and monitoring facilities are involved. 
Physical faults can be injected at the hardware level (logic or electrical faults) or at the software level (code 
or data corruption). Heavy-ion radiation techniques can also be used to inject faults and stress the system. 
This chapter will illustrate the instrumentation involved in fault injection experiments using real examples, 
including several fault injection environments.
In the operational phase, measurement-based analysis must address issues such as how to monitor com­
puter errors and failures and how to analyze measured data to quantify system dependability characteristics. 
Although methods for the design and evaluation of fault-tolerant systems have been extensively researched, 
little is known about how well these strategies work in the field. A study of production systems is valuable 
not only for accurate evaluation but also for identifying reliability bottlenecks in system design. This chap­
ter addresses several issues in measurement-based analysis, including workload/failure dependency, modeling 
and evaluation based on data, software dependability in the operational phase, and fault diagnosis.
The measurement-based analysis results discussed in this chapter are based on over 200 machine-years of 
data gathered from IBM, DEC, and Tandem systems. The evaluation methodology discussed includes: the 
use of the measured hardware and software error data to jointly characterize the interdependence between 
performance and dependability, correlation analysis to quantify correlated failures and their impact on 
dependability, Markov reward modeling of measured data to evaluate the loss of system service due to 
errors and failures, and algorithms that use on-line error logs to perform automatic fault diagnosis and 
failure prediction.
Before discussing methodologies and developments for the design, prototype, and operational phases, 
we present an overview of the relevant statistical techniques used in this area. The techniques cover: the 
estimation of parameters and confidence intervals; distribution characterization, including function fitting; 
and multivariate analysis methods, including clustering analysis, correlation analysis, and factor analysis. 
Importance sampling, a statistical technique to accelerate Monte Carlo simulation, is also introduced. These 
techniques are later used in the analysis of data obtained from fault injection or from operational systems.
In discussing the experimental analysis approaches used in the three phases, a wide range of depend­
ability issues are addressed, including error latency, error propagation, error detection, error recovery, error 
correlation, workload/error dependency, availability, reliability, performability, and reward rate. In addition 
to presenting methodologies and major developments in each of the phases, we critique the relative merits 
and research issues for the different approaches. Most evaluation techniques introduced are illustrated by 
case studies of their uses on actual systems.
2
0.2 STATISTICAL TEC H N IQ U ES
In this section, we introduce several statistical techniques commonly used to analyze the data collected from 
fault injection experiments and operational systems. The discussion is not intended to be comprehensive. For 
a comprehensive study of statistical methods, the reader is referred to the advanced texts in statistics, such as 
[Kendall77] and [Dillon84]. In particular, we discuss parameter estimation, distribution characterization, and 
some relevant multivariate analysis techniques. A statistical technique to accelerate Monte Carlo simulation 
is also discussed. Most of these techniques are widely used in every phase of the experimental evaluation of 
dependability.
0.2.1 P aram eter E stim ation
The most important characteristics of a random variable are its distribution, mean, and variance. In practice, 
means and variances are usually unknown parameters. Thus, how to estimate these parameters from data 
needs to be addressed.
Point E stim ation
Point estimation is often used in experimental analysis. Examples include the estimation of the detection 
coverage from fault injections and the estimation of mean time between failures (MTBF) from field data. 
Each fault injection and each failure occurrence can be treated as an experiment. In this analysis, each 
experiment (e.g., injection of a fault) is assumed to be independent.
Given a collection of n experimental outcomes, x\, x?,.. .xn, of a random variable X, each Xi can be 
considered as a value of a random variable At . These Xt ’s are independent of each other and identical to 
X  in distribution. The set {A'i, X 2, ■.., An} is called a random sample of X. Our purpose is to estimate 
the value of some parameter 9 (9 could be E[A] or Var[A]) of X  using a function of Xi, X 2, ■ ■ ■, X n. The 
function used to estimate 9,9 = 9{X\ , X 2, ■ ■ ■, Xn), is called an estimator of 9, and 9(xi, X2, ■.., xn) is said 
to be a point estimate of 9.
An estimator 9 is called an unbiased estimator of 9, if E[0] = 9. The unbiased estimator that has the 
minimum variance, i.e., it minimizes Var(0) = E[(0 — 9)2] among all 0’s, is said to be the unbiased minimum 
variance estimator. It can be shown that the sample mean
X
1 n
= n i ' ¿=1
( 0 .1)
is the unbiased minimum variance linear estimator of the population mean /i, and the sample variance
S: 1n — 1 -  x )2 »=1
(0.2)
is, under some mild conditions, an unbiased minimum variance quadratic estimator of the population variance 
cr2. If an estimator 9 converges in probability to 9, i.e.,
lim P{ | 9{Xi ,X 2 , . . . ,X n) - 9  |> e) = 0
n —»00
where e is any small positive number, it is said to be consistent.
(0.3)
M ethod o f M axim um  Likelihood
If the functional form of the probability distribution function (p.d.f.) of the variable is known, the method 
of maximum likelihood is a good approach to parameter estimation. In many cases, approximate functional 
forms of empirical distributions can be obtained. In such cases, the maximum-likelihood method can be used 
to determine distribution parameters.
The maximum-likelihood method is to choose an estimator based on the assumption that the observed 
sample is the most likely to occur among all possible samples. The method usually produces estimators that
3
have minimum-variance and consistency properties. But if the sample size is small, the estimator may be 
biased.
Assuming X  has a p.d.f. f(x\9), where 6 is an unknown parameter, the joint p.d.f. of the sample {Xi,
' X n} ' m  = n?=1 /(*,-|i) (0.4)
is called the likelihood function of 9. If 9(xi, X2, . . . ,  xn) is the point estimate of 9 that maximizes L(9), then 
9(X 1 , ^ 2, . . . ,  ATn) is said to be the maximum-likelihood estimator of 9.
The following example illustrates the method. Let X  denote the random variable “time between failures” 
in a computer system. Assuming X  is exponentially distributed with an arrival rate A, we wish to estimate 
A from a random sample {Xi, X2 , . ■., Xn}. By Equation 0.4,
L(A) = n ”=1 Ae-Ax' = \ ne- x 'E'=ix' .
How do we choose an estimator such that the estimated A maximizes L(A)? An easier way is to find the 
A value that maximizes lni,(A), instead of L(A). This is because the A that maximizes L(A) also maximizes 
lnL(A), and lnL(A) is easier to handle. In this case we have
n
lnL(A) = n ln(A) — A ^ x ,-  .
t=i
To find the maximum, consider the first derivative
d[lnL{ A)] n ^
-  A •d\
1=1
The solution of this equation at zero,
A = x ’2^i=i x»
is the maximum-likelihood estimator for A.
M ethod  o f  M om ents
Sometimes it is difficult to find maximum-likelihood estimators in closed form. One example is the p.d.f of 




a - \ e -x- /9
The estimation of a and 9 is complicated by the existence of the gamma function r(c*). The gamma 
distribution, however, is useful for characterizing interval times in the real world. As shown in Section 0.5.3, 
the software TTE for the measured operating system fits a multistage gamma distribution. In such cases, the 
method of moments can be used if an analytical relationship is found between the moments of the variable 
and the parameters to be estimated.
To explain the method of moments, we introduce the simple concepts of sample moment and population 
moment. The k-th (k= 1,2,...) sample moment of the random variable X  is defined as
m t = - ¿ A ‘ , (0.5)n t—*:=i
where X i, X o, . . . ,  X n are a sample of A". The k-th population moment of X  is just E[X*j.
Suppose there are k parameters to be estimated. The method of moments is to set the first k sample 
moments equal to the first k population moments, which are expressed as the unknown parameters, and then 
to solve these k equations for the unknown parameters. The method usually gives simple and consistent 
estimators. However, some estimators may not have unbiased and minimum variance properties. The 
following example shows details of the method.
4
Consider the above gamma distribution example. We wish to estimate a and d, based on a sample 
{Xi, X'j,. . .  ,.Xn} from a gamma distribution. Since X  ~  G{a, d), we know
E[x] = ad , E[x2] = ad2 -f ad .
The first two sample moments, by definition, are given by
mi
1 n
= - Z *=n '*=i
X  , m2
1 n
= ^E*?
— 2sz + x
i = 1




These are the estimators for a and d from the method of moments.
Interval E stim ation
So far, our discussion has been limited to the point estimation of unknown parameters. The estimate may 
deviate from the actual parameter value. To obtain an estimate with a high confidence, it is necessary 
to construct an interval estimate such that the interval includes the actual parameter value with a high 
probability. Given an estimator d, if
P(d — ei < d < d + e2) = /? , (0.6)
the random interval (d — e\,0 + e2) is said to be 10 0 x/?% confidence interval for d, and (3 is called the 
confidence coefficient (the probability that the confidence interval contains d).
A. Confidence Intervals for M eans
In the following discussion, the sample mean X  is used as the estimator for the population mean. As 
mentioned before, it is the unbiased minimum variance linear estimator for /¿. We first consider the case in 
which the sample size is large. By the central limit theorem, X  is asymptotically normally distributed, no 
matter what the population distribution is. Thus, when the sample size n is reasonably large (usually 30 or 
above, sometimes 50 or more if the population distribution is badly skewed with occasional outliers), Z = 
(X — fi)/(S/\/n) can be approximately treated as a standard normal variable. To obtain a 100/?% confidence 
interval for /i, we can find a number za /2 from the N(0,1) distribution table such that P(Z > za/2 ) = o/2, 
where a — 1 — /?. Then we have
Y ^
P{~ Zq' 2 < 5 /7 ^  < Zan) = i ~ (x-
Thus, the 100(1 — tx)% confidence interval for ft is approximately
V _ -  A  ^ ,  v a . .  J L- a / 2  / — S  A* S  '*■ ' “ o r /2 /—V n Vn
(0.7)
If the sample size is small (considerably smaller than 30), the above approximation can be poor. In this 
case, we consider two commonly used distributions: normal and exponential. If the population distribution 
is normal, the random variable T  = (X — fi)/(S/\/n) has a Student t distribution with n — 1 degrees 
of freedom. By repeating the same approach performed above with a t distribution table, the following 
10 0 (1 — a)% confidence interval for /x can be obtained:
~Ÿ . S — S
—l; a /2  /— ^  i * n —l; a /2  /—v n Vn
(0.8)
5
where in_i;a/2 is a number such that P(T > tn- i ,a/2) := <*/2. Theoretically, Equation 0.8 requires that 
X  have a normal distribution. However, we will show later that the estimator is not very sensitive to the 
distribution of X  when the sample size is reasonably large (15 or more).
If the population distribution is exponential, it can be shown that \ 2 = 2nX /  p has a chi-square dis­
tribution with 2n degrees of freedom. Thus, the chi-square distribution table should be used. Because the 
chi-square distribution is not symmetrical about the origin, we need to find two numbers, x22n,i-a/2 and 
z 22n;a/2 , such that P(x2 < x22n ;i-a/2) = <*/2 and P(x2 > x22n-a/2) = <*/2. The obtained 100(1 -  a)% 
confidence interval for p is
2 nX





B. C onfidence Intervals for Variances
The estimation of confidence interval for variances is more complicated than that for means, because the 
sample variance cannot be simply approximated by a unique distribution (such as normal distribution) re­
gardless of the population distribution. However, irrespective of the population distribution, limn-.ooVarfS2] 
= 0. Thus, a good confidence interval can be expected as long as the sample size n is large. Next, our 
discussion will focus on the two commonly used distributions: normal and exponential.
If X  is normally distributed, the sample variance S2 can be used to construct the confidence interval. It is 
known that the random variable (n — l)S 2/<r2 has a chi-square distribution with n — 1 degrees of freedom. To 
determine a 100(1 — a)% confidence interval for <r2, we follow the procedure for constructing Equation 0.9 to 
find the numbers x2n _ 1;1_ a /2 and x2n-i;a /2 from the chi-square distribution table. The confidence interval 
is then given by
(n ~ 1)S2 ^ 2 .  (n ~ l)Sa
—a / 2  “n — l;ar/2
(0 .10)
Our experience shows that this equation, like Equation 0.8, is not restricted to the normal distribution when 
the sample size is reasonably large (15 or more).
If X  is exponentially distributed, Equation 0.9 can be used to estimate the confidence interval for <r2, 
because for the exponential random variable a2 equals p2. Since all terms in Equation 0.9 are positive, we 
can take square for them. The result gives a 100(1 — a)% confidence interval for a2:
(■
2 nX
i *“ 2 n ; l  —a / 2
< <T2 < (■ 2 nX
X ~ ‘>n,a/2
(0 .11)
C. C onfidence Intervals for P roportions
Often, we need to estimate the confidence interval for a proportion or percentage whose underlying distribu­
tion is unknown. For example, we may want to estimate the confidence interval for the detection coverage 
after fault injection experiments. In general, given n Bernoulli trials with the probability of success on each 
trial being p and the number of successes being V', how do we find a confidence interval for p? If n is large 
(particularly when np > 5 and n(l — p) > 5 [Hogg83]), Y/n  has an approximately normal distribution, 
N(p, or2 ) ,  with p — p and <r2 = p( 1 — p)/n. Note that Y/n  is the sample mean, which is an estimate of p or 
p. By Equation 0.7, the 100(l-c*)% confidence interval for p is
y
-  ±n -a/2 \/p(I - p ) /n  • (0 .12)
This equation can be used to determine the number of injections required to achieve a given confidence 
interval for an estimated fault detection coverage. Let n represent the number of fault injections and V' 
the number of faults detected in the n injections. Assume that all faults have the same detection coverage, 
which is approximately p. Now we wish to estimate p with the 100(l-cr)% confidence interval being e. By 
Equation 0.12, we have ________
c = -«/2 \/V(l ~ p)/» • (0.13)
Solving the equation for n gives us:
*q/2aPU ~ P) (0.14)
6
where n is the number of injections required to achieve the desired confidence interval in estimating p.
For example, assume detection coverage p =■ 0.6, confidence interval e = 0.05, and confidence coefficient 
1 — a = 90%. Then the required number of injections is
n 1.6452 x 0.6 x 0.4 ÔTÔ52 = 260.
0 . 2.2 D istr ib u tion  C haracterization
Mean and variance are important parameters that summarize data by single numbers. Probability distribu­
tion provides more information about data. Analysis of distributions can help one understand the data in 
detail and arrive at simple conclusions regarding the underlying models. For example, if the time to failure 
and the recovery time for a system are all exponential, then the model is clearly a Markov model; otherwise, 
it could be one of several types of semi-Markov models. We now discuss empirical distribution functions and 
function fitting in this subsection.
E m pirical D istr ib u tion
Given a sample of A, the simplest way to obtain an empirical distribution of X  is to plot a histogram of 
the observations. The range of the sample space is divided into a number.of subranges called buckets. The 
lengths of the buckets are usually the same, although this is not essential. Assume that we have k buckets, 
separated by £o> xi, for the given sample of size n. In each bucket, there are y,- instances. Clearly,
the sample size n is Yli=i 2/* - Then, y,/n is an estimation of the probability that X  takes a value in bucket
1. The histogram is an empirical probability distribution function (p.d.f) of X or the empirical distribution. 
It is easy to construct the following empirical cumulative distribution function (c.d.f.) from the histogram, 
as follows:
{ 0 , x < x0£ ‘=i xi_1 < x < x i . (0.15)
1, x k < X
The key problem in plotting histograms is determining the bucket size. A small size may lead to such 
a large variation among buckets that the characterization of the distribution cannot be identified. A large 
size may lose details of the distribution. Given a data set, it is possible to obtain very different distribution 
shapes by using different bucket sizes. One guideline is that if any bucket has less than five instances, the 
bucket size should be increased or a variable bucket size should be used. In our experience, 10 or more 
buckets are sufficient in most cases, depending on the sample size.
D istrib u tion  Function  F ittin g
Analytical distribution functions are useful in analytical modeling and simulations. Thus, it is often desirable 
to fit an analytical function to a given empirical distribution. Function fitting relies on knowledge of statis­
tical distribution functions. Given an empirical distribution, step 1 is to make a good guess of the closest 
distribution function(s) by observing the shape of the empirical distribution. Step 2 is to use a statistical 
package such as SAS to obtain the parameters for a guessed function by trying to fit it to the empirical 
distribution. Step 3 is to perform a significance test of the goodness-of-fit to see if the fitted function is 
acceptable. If the function is not acceptable, we go to step 1 to try a different function.
Now we discuss step 3—the significance test. Assume that the given empirical c.d.f. is Fk, defined in 
Equation 0.15, and the hypothesized c.d.f. is F(x), obtained from step 2. Our task is to test the hypothesis
Ho : Fk(x) = F(x).
Two commonly used goodness-of-fit test methods are the chi-square test and the Kolmogorov-Smirnov 
test. We now briefly introduce these two methods.
7
A. Chi-Square Test
The chi-square test assumes the distribution under consideration can be approximated by a multinomial 
distribution. Let
Pi = F(xi) -  F(xt_i) , i = 1, . . . ,  k, 
where p i is the probability that an instance falls into bucket i. If we define
■ P [x j_ l  5; X% <  Xj] ~  P i i i — 1, . . . , k ,
then Xi, X i , . . Xk have a multinomial distribution that is equivalent to the original distribution F(x). 
Thus, for a sample size of n, the expected instances falling into bucket i is n p i,  by the above distribution. 




(yi -  np i)2 
npi
(0.16)
is a measure of the closeness of the observed number of instances, y*, to the expected number of instances, 
npi, in bucket i. If qk- i  is small, we tend to accept Ho- The smallness can be measured in terms of statistical 
significance if we treat qk-i as a particular value of the random variable Qk- i • It can be shown that if n is 
large (npi > 1 ), Qk-i has an approximate chi-square distribution with k — 1 degrees of freedom, x2(& — 1 ). 
If Ho is true, we expect that qk- \  falls into an acceptable range of Qk-i, so that the event is likely to occur. 
The boundary value, or critical value, of the acceptable range, Xa(k ~ 1) is chosen such that
P [ Q k - i  >  x l ( k  -  1)] = a,
where a is called the significance level of the test. Thus, we should reject Ho if qk- i  > x'i(k — !)• Usually, 
a is chosen to be 0.05 or 0.1.
B. K olm ogorov-Sm irnov Test
The Kolmogorov-Smirnov test is a nonparametric method in that it assumes no particular distribution for 
the variable in consideration. The method uses the empirical c.d.f., instead of the empirical p.d.f., to perform 
the test, which is more stringent than the chi-square test. The Kolmogorov-Smirnov statistic is defined by
Dk = supx[Fk(x) -  F(r)] , (0.17)
where supx represents the least upper bound of all pointwise differences Fk(x) — F(x). In calculation, we can 
choose the midpoint between Xj_i and x,, for * = 1, . . . ,  k, to obtain the maximum value of Fk(xi) — F(xi). 
It is seen that Dk is a measure of the closeness of the empirical and hypothesized distribution functions. It 
can be derived that Dk submits to a distribution whose c.d.f. values are given by the table of Kolmogorov- 
Smirnov Acceptance Limits [Hogg83]. Thus, given a significance level a, we can find the critical value dk 
from the table such that
P[Dk > dk] = or .
The hypothesis Ho is rejected if the calculated value of Dk is greater than the critical value dk. Otherwise, 
we accept Hq.
0 .2 .3  M ultivariate A nalysis
In reality, measurements are usually made on more than one variable. For example, a computer workload 
measurement may include usages on the CPU, memory, disk, and network. A computer failure measurement 
may collect data on multiple components. Multivariate analysis is the application of methods that deal with 
multiple variables. These methods, which including cluster analysis, correlation analysis, and factor analysis 
(to be discussed), identify and quantify simultaneous relationships among multiple variables and are valuable 
tools for computer systems analysis.
8
C luster A nalysis
Cluster analysis is helpful in identifying patterns in data. Specifically, it can help in reading a large number 
of points plotted in n-dimensional space into a few identifiable states called clusters. A common use is in 
characterizing workload states in computer systems by identifying the points in a resource usage plot that 
are “similar” by some measure and grouping them into a cluster. Assume we have a sample of p workload 
variables. We call each instance in the sample a point characterized by p values. In this case, the measure 
of similarity is the Euclidean distance. Let = (xti,*»2> • • -,£ip) denote the zth point of the sample. The 
Euclidean distance between points i and j,
p 1/2
dij = |  X{ — xj | = ~ xji) ) >
1=1
is usually used as a similarity measure between points i and j.
There are several different clustering algorithms. The goal of these algorithms is to achieve small within- 
cluster variation relative to the between-cluster variation. A commonly used clustering algorithm is the 
k-means algorithm. The algorithm partitions a sample with p dimensions and n points into k clusters, Ci, 
C2, . . . ,  Ck • The mean, or centroid of the Cj is denoted by Xj. The error component of the partion is defined 
as k
«  =  i r  ■ (o-i8)
J = 1 Xi&Cj
The goal of the ¿-means algorithm is to find a partition that minimizes E.
The clustering procedure is as follows. Start with k groups, each of which consists of a single point. Each 
new point is added to the group with the closest centroid. After a point is added to a group, the mean of 
that group is adjusted to take the new point into account. After a partition is formed, search for another 
partition with smaller E by moving points from one cluster to another cluster until no transfer of a point 
results in a reduction in E.
A problem associated with clustering algorithms is the presence of outliers in the sample. Outliers can 
be an order of magnitude greater than any of the other points (which are usually more than 95%) of the 
sample and can be scattered over the sample space. As a result, the generated clusters may not characterize 
the features of the sample well. For example, most generated clusters contain only one or two outliers, with 
all other points groupable into only a few clusters. One way to deal with this problem is to specify in the 
algorithm the minimum number of points that form a cluster, typically 0.5% of the sample size. Another 
way is to define an upper bound for the radius (maximum distance between the centroid and any point 
in a cluster) of any generated cluster. A recommended range for the upper bound is 1.0 to 1.5 standard 
deviations of the sample [Artis86].
C orrelation A nalysis
The correlation coefficient, Cor(X\, X 2), between the random variables Xi and X2 is defined as
Cor{X 1 , X 2) E[{Xx - p i ) ( X , - p 2 ) ]<ri<r2 (0.19)
where p.\ and /z2 are the means of Xi and X2, and <Ti and <r2 are the standard deviations of Xi and 
X2 , respectively. If we use p to denote the correlation coefficient, then p satisfies — 1 < p < 1. The 
correlation coefficient is a measure of the linear relationship between two variables. When | p | = 1, we 
have X\ = aX2 + 6 , where b>0 if p — 1, or b<0 if p — —1. In this extreme case, there is an exact linear 
relationship between X\ and X2. When | p | ^  1, there is no exact linear relationship between AT and A'2. 
In this case, p measures the goodness of the linear relationship AT = aX > + b between AT and X2. Usually, 
a p value of 0.5 or above is considered reasonably high. Correlation analysis can be used to quantify error 
or workload dependency between two components in a system.
Given random variables, Xi, X2, and X3 , and correlation coefficients between each pair, py>, P23, and 
p 13 , we know these variables are related each other by p\2, P’i3 , and p\$. Since AT is related to A'2i and
9
X.2 is related to X$, a partial dependence between X\ and X$ may be due to X 2 . The partial correlation 
coefficient defined below quantifies this partial dependence:
— P i3  ~  P12P23 
\ / ( l  — p212)(1 — P2 23)
The partial correlation coefficient can be considered as a measure of the common relationship among the 
three variables.
If a random variable, X , is defined on time series, the correlation coefficient can be used to quantify the 
time serial dependence in the sample data of X . Given a time window At > 0, the autocorrelation coefficient 
of X  on the time series t is defined as
(0 .20)
Autocor(X, At) = Cor{X{t), X(t + At)) , (0.21)
where t is defined on the discrete values (Af, 2At, 3At, . . . ) .  In this case, we treat X(t) and X(t + At) 
as two different random variables, and the autocorrelation coefficient is actually the correlation coefficient 
between the two variables. That is, Autocor(X, At) measures the time serial correlation of X  with a window 
At.
Factor Analysis
The limitation of correlation analysis is that the correlation coefficient can only quantify dependency between 
two variables. However, dependency may exist within a group of more than two variables or even among 
all variables. The correlation coefficient cannot provide information about this multiple dependency. Factor 
analysis is a statistical technique to quantify multiway dependency among variables. The method attempts 
to find a set of unobserved common factors that link together the observed variables. Consequently, it 
provides insight into the underlying structure of the data. For example, in a distributed system, a disk crash 
can account for failures on those machines whose operations depend on a set of critical data on the disk. 
The disk state can be considered to be a common factor for failures on these machines.
Let X  = (®i,. . . ,  xp) be a normalized random vector. We say that the k-factor model holds for X  if X  
can be written in the form
X = AF + E, (0.22)
where A = (AtJ-) (i = 1,.. .p; j  = 1 ,.. .,  k) is a matrix of constants called factor loadings, and F = ( / 1 , . . . ,  /*) 
and E = (e 1 , . . . ,  e*) are random vectors. The elements of F are called common factors, and the elements 
of E are called unique factors (error terms). These factors are unobservable variables. It is assumed that 
all factors (both common factors and unique factors) are independent of each other and that the common 
factors are normalized.
Each variable X{ (* = ! , . . .  ,p), can then be expressed as
k
Xi — ^  ̂Aij fj -j- e,-
; = i
and its variance can be written as
al -  ^ 2  + ’
j'= 1
where ipi is the variance of e*. Thus, the variance of x,- can be split into two parts. The first part
V = E 4
; = i
is called the commonality. It represents the variance of z, that is shared with the other variables via the 
common factors. In particular A = Cor(xi, fj)  represents the extent to which, x j depends on the j th 
common factor. The second part, ipi, is called the unique, variance. It is due to the unique factor e, and 
explains the variability in x-: not shared with the other variables.
10
0 .2 .4  Im p ortan ce Sam pling
Importance sampling has long been used to reduce sampling size while keeping estimates obtained from 
the sample at a high level of confidence [Kahn53]. Specifically, it allows the analysis of rare events (such 
as failures or errors) with a high degree of accuracy when Monte Carlo techniques are used for numerical 
solutions. The method has been recently used to reduce the number of runs in Monte Carlo simulations 
for evaluating computer dependability [Goyal92] [Choi92]. In the following, we first give an overview of 
the method and then discuss its applications in the Monte Carlo simulation of discrete-time Markov chains 
(DTMCs).
O verview  o f the M ethod
Assume that a random variable X  has p.d.f. f(x)  and that Y = h(X) is a function of X. Our goal is to 
estimate the expected value of Y,
9 = E[Y] /+oo h(x)f(x)dx ,•OO (0.23)
through sampling. That is, we generate a sample {xq, x2,...,x -n} according to f{x), therefore generating 
{2/1 > 2/2» , 2/n} ■ We then calculate
_  1 n 1 n
= y  = -  yi -  ~ zL Mx*) •
1=1 i=i
It may be very expensive to generate a statistically significant sample of X. For example, if yt- = h(x{) — 
0 for most generated x ,, we may need an extremely large size of sample to estimate 9 with a high level of 
confidence. However, if we can, in sampling, select the rare x, ’s that are considered important for estimating 
9 much more frequently while keeping the estimate unbiased, the sample size will be greatly reduced. This 
is the essence of the importance sampling method.
To do importance sampling, we change the p.d.f. of X  from f(x)  to g(x) such that the x ’s of importance 
in our parameter estimation have higher occurrence probabilities in g(x). We use X'  to represent the variable 
that has p.d.f. g(x'). By Equation 0.23, we have





is called the likelihood ratio. Let V  = fi(Ar)A(A). Then Equation 0.24 becomes




Thus, instead of sampling from f(x)  to estimate the expected value of Y , the experiment is changed to 
sampling from g(x') to estimate the expected value of Y '. That is, we generate a sample {x^, x '2). . . ,  x'n} 
according to g(x'), therefore generating {y[,y2, . . . ,  y ^ , and then calculate
9 = Y
n i—4
The variance of the above estimator is
Var(Y') = E[(Y' - h{x ) f i x )
fir(x)
g(x)dx — 92 .
11




But 9 is the unknown parameter to estimate. A heuristic is that the shape of g(x) should follow the shape 
of h(x)f(x) as closely as possible.
A p p lications in D T M C  S im ulation
In many cases, the operation of a computer system can be modeled by a discrete time Markov chain (DTMC) 
[Trivedi82]. If the built DTMC is very large (such that it exceeds the available storage) or the functional 
simulation (simulation of the execution of machine instructions, algorithms, etc.) is used above a DTMC, 
the Monte Carlo simulation method is perhaps the only feasible way to solve the model. In dependability 
models, system failures are usually rare events with extremely small probabilities. To obtain statistically 
significant results, large simulation runs are required, which can be very time consuming. In such a case, 
importance sampling can reduce simulation runs, usually by orders of magnitude.
Assume we have a DTMC {Y4is > 0} with a set of states {Si, S2, • ■., 5m) and a transition matrix 
[pij]. For each simulation run, we have a path xt- = 5,0, 5tl, . . . ,  Sik. The occurrence probability of path Xi 
is [Goyal92]
P{Xi) = VioPidi •••Pik-iik .
where each p,-; is an element in [p,-y]. All possible paths constitute the probability space of a random variable: 
A — {•£]., Xo, £3 , ■ • • }•
To reduce simulation runs, we change the transition probability matrix from [pij] to [pC] such that the 
paths that are important in our dependability evaluation are more likely to be sampled. After the change, 
the occurrence probability of path X{ is
P '(xi) = P'ioPioi, ■
Assume the dependability measure to evaluate is 9 = E[h(X)]. Then 9 can be estimated using a sample, 




P ( x f )  PioPioi ,  ■ ■ •P i k - i i k
P ' ( X i )
(0.27)
(0.28)
PioPioil • ■ ’Pik-iik
The remaining question is how to determine [p(;]. Several heuristics called failure biasing have been 
proposed in the literature [Lewis84] [Goyal92], Here we introduce one of the commonly used heuristics. 
Assume that in state Si, transitions out of the state go either to a set of failure states, F (e.g., the system 
suffers one more component failure), or to a set of recovery states, R (e.g., the system recovers from a 
component failure). (Si itself can be treated as either in F or in R.) It is obvious that we have
2̂ pu + J2 ph = 1 •
;'€F j€R
Define a parameter b such that p'i :’s satisfy
Y ,  ph = 6 -
je r
Then we determine each p[. in state Si by
Pa = (1 —6)
=  1 -  * -
j € F  





The parameter b is usually chosen to be 0.5 [Goyal92]. Since the sum of the original probabilities to 
failure states is often very small, then by Equation 0.29, the selection of b can significantly increase these 
probabilities, making these transitions much more likely to occur in simulations.
0.3 D E SIG N  PH A SE
In the design phase of a system, simulation is an important experimental means for performance and de­
pendability analysis. Compared to analytical modeling, simulation has the capability to model complex 
systems to a high degree of fidelity without being restricted to assumptions made to keep an analytical 
model mathematically tractable. However, simulation for dependability analysis involves the injection of 
faults into the system under study at three levels of abstraction: the electrical level, the logic level, and the 
function level. The issues studied usually include, but are not limited to, fault propagation, fault latency, and 
fault impact such as coverage, reliability, availability, and performance loss. Figure 0.1 shows fault injection 
at the different levels.
Fault Injection
Change Current Stuck-at 0 or 1 Change CPU Register
Change Voltage Inverted Fault Flip Memory Bit, etc.
Figure 0.1: Simulated Fault Injections at Different Levels
Transient faults account for more than 80% of the failures in computer systems [Siewiorek78], [Iyer86]. 
These faults result from physical causes, such as power transients, capacitive or inductive crosstalk, or cosmic 
particle interventions [Yang92],
Electrical-level fault injection emulates transient faults by changing the electric current and voltage inside 
a circuit. The faulty current and voltage can cause errors in logic values at the gate level. The gate-level 
errors can then propagate to other functional units and output pins of the chip. Electrical-level simulation 
can be used to study the impact of transient faults in a circuit or chip. The simulation, however, can be very 
time consuming and memory bound, since it has to track the propagation of faults from circuits to gates, to 
functional units, and eventually out to the pins.
Logic-level fault injection simulates the abstractions of physical fault models to logic gates in order to 
study large VLSI circuits or systems. Commonly used fault models include stuck-at-0, stuck-at-1, bridging 
faults, and inverted faults. These models are generally considered to be representative of faults occurring at 
the gate level. Although simulation at the logic level ignores the physical processes underlying gate faults, it 
still needs to trace the impact of gate-level faults to higher levels. For the same reason that electrical-level 
simulation cannot be effectively used to study large VLSI systems, logic-level simulation cannot effectively 
study large computer systems.
Function-level fault injection simulation is usually used to study dependability features of large computers 
or networks. Faults are injected into various components of the system under study. Functional fault models
13
are used in the simulation, while detailed processes of fault occurrence at lower levels are ignored. Functional 
models representing the manifestation of faults can be extracted from results obtained from electrical- or 
logic-level fault injection or from field measurements. For example, “flipped memory bit” and “CPU register 
error” are two typical fault models. Analytical dependability models of computer systems are usually built 
at this level. Compared to analytical modeling, simulation is capable of representing detailed architectural 
features, realistic fault conditions, and intercomponent dependencies.
There are several common issues that apply to fault injection at all levels. The first is: What is the 
appropriate fault model at the chosen level of abstraction? While there is no easy answer to this question, 
filed data and experience are valuable guides. The second issue is: For a given fault model (e.g., one bit 
flip in memory) and fault type, (e.g., transient fault), where should the fault be injected? A straightforward 
approach is to randomly choose a location from the injection space (e.g., all gates in a VLSI chip, or all 
memory bits). This scheme is easy to implement, but many faults may have similar impact (e.g., all faulty 
bits in an ALU may have the same effect) and many faulty locations may not be exercised. Another 
approach is to inject faults into a few representative locations under selected workloads so as to provide a 
broad evaluation of the system. This technique can be used to study fault impact in terms of locations or 
workloads.
The third issue involves workloads. The impact of faults on system dependability is workload-dependent. 
Hence, it is important to analyze a system while it is executing representative workloads. These workloads 
can be real applications, selected benchmarks, or synthetic programs. If the goal of study is to investigate 
fault impact on a mission task, the real applications running in the mission should be used in the simulation. 
If the goal is to study fault impact on general workloads, several representative benchmarks should be selected 
for the simulation. If the objective is to exercise every functional unit and location, neither real applications 
nor benchmarks may be appropriate. In this case, synthetic workloads may have to be designed for achieving 
the goal. The workload issue complicates simulation models and increases simulation time. It is essential to 
develop techniques to represent realistic workloads while maintaining reasonable simulation times.
The last issue is simulation time explosion. This occurs in two cases: 1) when too much detail is simu­
lated, such as modeling physical processes in fault injections at the electrical level, and 2) when extremely 
small failure probabilities require large simulation runs to obtain statistically significant results (the theory 
is discussed in Section 0.2.1). Several techniques, including mix-mode simulation [Saleh90] [Choi92], impor­
tance sampling [Goyal92] [Choi92], hybrid simulation [Bavuso87], [Goswami93a], and hierarchical simulation 
[Goswami92] have been used to address the time explosion problem.
Table 0.1 summarizes features and representative studies in simulated fault injection at different levels. 
We discuss these studies in the following three sections.
Table 0.1: Summary of Simulated Fault Injection
Category Electrical Level Logic Level Function Level
Approach Alter electrical current 
and voltage in circuits
Inject stuck-at or inverted 
faults to logic gates
Inject faults to CPU, 



























0.3.1 S im ulated  Fault Injection  at th e E lectrica l Level
There are several reasons for simulating fault injection at the electrical level. First, fault injection at this 
level can be used to study the impact of physical causes that lead to faults and errors. Second, studies show 
that simple stuck-at fault models are not always representative of physical failures [Banerjee82], [Beh82], 
[Kim88]. Third, some circuits are of a mixed analog/digital nature, which cannot be fully characterized by 
logic-level fault models. Thus, there is a growing need for fault simulators that can handle electrical transient 
faults and permanent physical failures for the purposes of both circuit testing and dependability evaluation.
Figure 0.2: Illustration of Fault Injection at Electrical Level
M ixed-M ode A pproach
The basic simulation methodology used in fault injections at the chip level is a mixed-mode approach. The 
fault-free portions of the circuit are simulated at the logic level while the faulty portions of the circuit are 
simulated at the electrical level. A representative mixed-mode fault simulator is SPLICE1 [Saleh84]. The 
electrical analysis in SPLICE1 is based on the method of iterated timing analysis (ITA), which incorporates 
a nonlinear relaxation method with event-driven selective tracing. ITA has been shown to be accurate and 
fast; it can provide a speed-up of up to two orders of magnitude. The logic analysis in SPLICE1 is performed 
using a relaxation-based method and MOS-oriented models.
In fault injection based on mixed-mode simulation, a circuit model description is read and modified by 
adding a current source at the fault injection node. 1 During each iteration, the scheduled node-events 
in the current time step (virtual simulation time) are processed, and new events are scheduled and queued 
in the event list. For each fan-in element in the processed node, the element type (electrical, switch-level, 
logical) is determined. If the element type is electrical, then additional analysis is performed to determine
1 A node is defined as a point in a con d u ctive in terconnection  betw een  electrical a n d /o r  logical e lem ents.
15
whether the node is an injection site (i.e., analog signals are dealt with). The fault injection time window is 
the period between the fault-injection time t and t + dt, where dt is the duration of the fault. If the node 
is an injection site and if the virtual simulation time is within the fault injection time window, the current 
source representing the transient is activated and additional current is added to the total current calculation. 
The additional current value is determined from a function representing the current source for the particular 
virtual simulation time. The total current is used to calculate the fault voltage level at the processed node.
Figure 0.2 illustrates the overall simulation and fault injection approach. A simple CMOS AND gate 
with buffered output is illustrated in the figure. The dotted boxes indicate normal voltage waveforms for the 
circuit. The dashed boxes contain waveforms resulting from a transient injection at the location marked by 
X. Notice that waveforms within the electrical-level analysis behave in an analog fashion but are discrete in 
the logic-level analysis.
D ynam ic M ixed  M ode
Recently, a more efficient technique for performing transient fault simulation has been developed [Yang92]. 
The representation of a subcircuit that is subjected to transient fault injections is dynamically switched 
among different analysis modes (i.e., electrical and logic levels), during the simulation. The subcircuit is 
evaluated at the logic level until fault is injected. When a transient fault is injected, the voltage level of the 
target node is upset and behaves in an analog fashion. Electrical-level analysis is used to evaluate this analog 
fault behavior. The analysis continues until the faulty voltage waveform stabilizes into a discrete signal. At 
this point, the subcircuit is evaluated with logic-level analysis until another transient occurs.
The dynamic approach is shown conceptually in Figure 0.3. For example, for the schematic circuit 
diagram shown, if a transient fault occurs in subcircuit S3, one can determine that subcircuits S2, S3, and 
S4 must be simulated at the electrical level to capture the effects of the transient. Therefore, these subcircuits 
must be simulated at the electrical level. In the dynamic approach, unlike static mixed-mode simulation, 
all subcircuits begin the simulation at the logic level. When a transient occurs in subcircuit S3 at time tl, 
the simulation mode changes to electrical-level analysis until the effect of the transient disappears. When 
the effect of the fault injection propagates to subcircuits S2 and S4, these two subcircuits are also forced 
to perform circuit simulation starting at time points t2 and t3, respectively. The electrical-level simulation 
of these subcircuits terminates after a short time, when the transients in each subcircuit have settled, and 
switch back to logic-level simulation. Since subcircuits SI, S5, S6, and S7 are not affected by the fault 
injection, they remain at the logic level throughout the simulation period. This approach is better able 
to reduce costly electrical-level analysis by exploiting the nature of the digital circuits while capturing the 
desired behavior of transients.
We now describe an automated environment for electrical-level fault injection. The fault injection tool 
integrates a fault injection engine, a tracing facility, and graphical and statistical analysis packages into a 
user environment.
FO C U S— A C hip-Level S im ulation  E nvironm ent
FOCUS is a simulation environment for fault sensitivity analysis of IC chips [Choi92j. In this environment, 
a range of user-specified electrical-level transient faults are automatically injected at the circuit level, and 
fault propagation is measured at the gate and higher levels. SPLICE1 is used as the fault simulation engine. 
Figure 0.4 depicts the overall experimental environment. The environment takes as input a net-list of the 
hardware description of the system and converts it into a simulation model. The user provides the fault 
description data, and specified faults are injected during the simulated run-time of the target system. Faults 
that get propagated and detected are traced by the tracing facility, and the graphical analysis and impact 
analysis are performed on the fault data.
The fault injection process is implemented as a run-time modification of the circuit, whereby a current 
source is added to a target node, thus altering the voltage level of the node over the time interval of the 
injected current waveform. The experimental environment allows both transient and permanent (single or 
multiple) fault injections. Since the injected current source is specified as a mathematical function, the 
resulting transients can be of varying shapes and duration. For example, electrical power surge, in-chip 
alpha particle intervention, lightning, and bridging faults can be modeled. The user can control the location 
of a fault, the time and duration of a fault, and the shape of the current source.
16
Figure 0.3: Static and Dynamic Mixed-Mode Simulation
The tracing facility monitors all switching activities in the target system, including fault propagation 
through each gate or transistor, for all processed events. The trace data for each event consists of the time 
of the event, the hierarchical node name, and the new and previous voltage levels (for electrical nodes) or 
logic levels (for logic nodes).
The graphical analysis facility is used to visualize the error activity in different functional units of the 
processor and the fault propagation on the major interconnects and at the external pins. The statistical 
analysis tools provide impact analysis of the target system and generate the models necessary to depict the 
fault behavior in the system (e.g., I/O pin error distribution, latch error distribution, and internal fault 
propagation model).
The application of FOCUS is illustrated by analyzing a microprocessor used in commercial aircraft for 
real-time control of jet-engine functions. The 16-bit microprocessor consists of six major functional units. 
The arithmetic and logic unit (ALU), which contains six registers, can perform double precision arithmetic 
operations. The control unit, which issues signals to control ALU operations, is made up of combinational 
logic and several registers. The decoder unit decodes I/O signals, the multiplexer unit provides the discrete 
lines and buses, and the countdown unit drives chip-wide clock signals. The watchdog unit provides protection 
against faults by resetting the processor in the event of a parity error or in the event the application software 
is timed out by a software sanity timer. The signal to synchronize the dual system is also provided by this
17
Figure 0.4: The FOCUS Experimental Environment
unit. The chip runs at 12.08 MHz and is implemented in a 3-micron technology CMOS gate-array made of 
2688 blocks of 4 N-channel and 4 P-channel transistors. The system incorporates a variety of fault-tolerant 
design features at different levels, including software checks, parity checks, memory test, and error counting. 
The objective of the study is to investigate the impact of charge-level transients on latch, pin, and functional 
levels.
An experimental analysis of the susceptibility of the microprocessor to upsets due to current and voltage 
transients was conducted using simulated fault injection [Duba88] [Choi90]. The parameters used in the 
simulations were extracted from the actual microprocessor design and circuit layout. The application code 
running on the simulated processor exercised all the functional units at which transient fault injections were 
made. Fault injections were made at seven randomly chosen nodes in six functional units. For each node, 
current transients were injected at five charge levels: 0.5, 1.0, 2.0, 3.0, and 4.0 pC. Each charge level was 
injected at five time-points during the execution of the application code sequence. This amounted to over 
1000 simulated fault injections. Transients below 3.0 pC had no significant impact on the circuit.
The error data was generated by comparing each faulted simulation with a fault-free simulation. An 
error event was defined as either a logic state change or a voltage level change large enough to cause a node 
to be faulted at a future time. Error events were classified into three categories: 1) logic upsets—voltage 
transients large enough to constitute logic level errors, 2) latch errors—errors in the first-level latches, and 
3) pm errors—errors at the chip I/O pins.
Nearly 80 instruction cycles (90300 ns) of the application code were executed on the target system during 
each simulation run. The application code was carefully selected to ensure that all of the functional units 
were executed. A total of 2100 simulations were performed to obtain stable results. During the simulation, 
all nodes (including all latches and external I/O pins) in the circuit were monitored and processed. Table 0.2 
summarizes the overall impact of transients in the range 0.5 to 9.0 pC. In the table, a first-order error 
is defined as one that occurs during the first clock cycle following a transient fault injection; second and 
higher-order errors are those occurring during the second and subsequent clock cycles.
Figure 0.5 shows the propagation of the latch errors in time. In the figure, the x-axis represents the clock
18
Table 0.2: Impact of Transients Injected to the Target System
T yp e Injections Involved P ercent of T otal In jections R esu ltan t Errors
First-O rder L atch  Errors 470 22.4 2149
Second  and  H igher-Order Latch Errors 120 5.7 1829
First-O rder P in  Errors 255 12.1 1168
Second  and H igher-Order P in  Errors 90 4.3 839
F unctional Errors 193 9.2 747
Frequency
cycles from the fault injection time, and the y-axis represents the total latch error count for each clock cycle. 
It can be seen that, given a certain number of latch errors in the first clock cycle, the number of latch errors 
degenerates significantly until the fourth clock cycle. At approximately the fifth clock cycle, the number of 
errors rapidly multiplies. This is because at this time, the error signal enters a unit with a large number of 
latches and high fan-out, such as the ALU registers. After the sixth cycle, the number of errors degenerates 
significantly until finally disappearing after the eighth cycle. Thus, the impact of latch errors lasts, at most, 
up to eight clock cycles from the time of fault injection.
0.3.2 S im ulated  Fault Injection  at th e Logic Level
Simulated fault injection at the logic level is similar to that at the electrical level in that both are circuit- 
level simulations. The difference is in the fault models used. Logic-level fault simulation uses abstract 
logical models for both faults and circuit functions to evaluate the behavior of a system. In contrast to the 
evaluation of the physical models used at the electrical level, logic-level simulation performs binary operations 
that represent the behavior of a given device. It takes binary input vectors and evaluates the output of the 
device for a given input pattern. Each signal in the circuit is represented by a member in a set of boolean 
values depicting the steady-state conditions of the physical circuit. For example, set {1,0,X) is often used 
to describe high, low, and unknown voltage values for logic gates. Fault injection at this level simply forces 
a node to either stuck-at-1 or stuck-at-0, or it inverts a logic value. Fault injection location and time can be 
set arbitrarily. Hence, with logic-level simulation, one obtains outputs with discrete values and possibly with 
some approximate timing information. Typically, outputs of the faulty and nonfaulty systems are compared 
to determine whether a fault has been detected.
For MOS technology, a gate-level logic simulator is inadequate to handle circuits containing pass tran­
sistors, ratioed logic, buses, and other features that exhibit bidirectional signal flow and/or charge-sharing 
effects. To handle such transistor networks without resorting to expensive electrical-level analysis, switch- 
level simulation is used in [Bryant84]. Switch-level analysis allows bidirectional signal flow and different levels 
of signal strengths. For example, a discrete set {0,1,.. .,9} can be used to model different signal strengths 
or voltage levels. At this level, transistor-level fault modeling can also be easily incorporated.
19
This type of fault simulation has been widely used for evaluating the coverage of a given set of test 
vectors for testing manufacturing defects in a chip. Typically, a set of test vectors is generated either by an 
automatic test pattern generator (ATPG), randomly, or manually. The test vectors are then submitted to 
the fault simulator to determine how many faults they can detect. In this case, the test vectors become the 
workload or the inputs to the system. In the beginning of such a simulation, a stuck-at fault is injected, and 
the faulty circuit is simulated while a given test is applied at the primary inputs of the circuit. A similar 
run is performed again without any fault injection. The logic values at the primary outputs of the faulty 
circuit are then compared to the outputs of the fault-free run to determine if there is any difference in the 
outputs. If the injected fault altered logic values at the outputs compared to the clean run, then the fault is 
assumed to be detected. If a fault is detected, there is no reason to continue the simulation for that specific 
fault. This process of test generation and fault simulation is iterated until satisfactory fault coverage (the 
percentage of faults detected of all theoretically detectable faults) is achieved. This application has been 
widely used in industry to evaluate and assist test generation [Ruehli83] [Rogers85].
The use of fault simulation to perform dependability analysis at the design phase is somewhat more 
complicated. Here we are concerned with both permanent and transient faults and the time at which a fault 
is introduced. Fault sensitivity analysis of very large circuits through simulated fault injection and subsequent 
fault propagation can identify critical dependability bottlenecks. To characterize a highly dependable VLSI 
system, we need to evaluate, simultaneously, the fault behavior of all components as well as their combined 
behavior.
Stuck-at faults can be simulated using conventional logic-level fault simulators that are widely available. 
A stuck-at-fault injection is performed by forcing the state of a node to a specified value for the duration 
of the simulation. By selectively tracing/detecting a set of internal and external nodes, fault propagation 
can be monitored. Fault behavior in a system can then be modeled and analyzed by studying the fault 
propagation trace.
Transient faults are injected by altering the logic values of the target node for a specified time during 
the simulation. For example, the output of a gate is set to 1 when it should normally be 0. This faulty logic 
is forced on the output for a specified time period. Logic-level transient injection can be performed in two 
different ways. The above bit-flip effect can be performed on the combinational part of the circuit using a 
timing simulator. The created “pulse” can then propagate and may become latched or disappear. Another 
approach is to change the state of a machine by flipping a bit in a register or in a memory element in the 
system. These approaches, however, may not always reflect the actual device-level transient behavior at the 
logic level, because a transient can propagate in multiple paths and result in more than one latch error.
To evaluate system dependability based on realistic fault models, a fault-behavior dictionary approach 
can be taken [Choi93]. The approach is illustrated in Figure 0.6. A fault-behavior dictionary generated from 
electrical-level fault analysis can serve as a fast look-up table for a logic-level concurrent or parallel simula­
tion. First, an electrical-level fault-behavior dictionary for a given chip can be generated by extensive fault 
simulation. In this step, gates around the fault-injection location are extracted, and a subcircuit consisting 
of these gates is formed. This subcircuit is exercised by exhaustively applying all input combinations while 
fault injection is performed. Faulty behavior at each of the subcircuit outputs is analyzed and recorded in a 
dictionary. The resulting entry in the dictionary consists of the input vector, fault-injection time, and fault 
location. Second, concurrent run-time fault injections of the generated logical error at the subcircuit level 
can be performed using the fault dictionary. A concurrent simulator can be used to propagate, in a single 
simulation pass, the effects of multiple injected errors.
Both transient and permanent faults can be injected using switch-level or gate-level logic simulation. The 
overall simulation approach for fault injection at the logic level consists of the following steps:
1. Obtain the net-list of a design and devise appropriate simulation models to emulate the given design.
2. Simulate the model using a logic-level simulator.
3. During the simulation, run a given workload depicting the application or test software (by applying 
test vectors to the primary inputs).
4. Save the behavior of the system under fault-free conditions by tracing all the changes in the evaluated 
logic events of monitored nodes for comparison with the subsequent fault-injection runs.
20
CONCURRENT TRANSIENT SIMULATION
Figure 0.6: Concurrent Transient Simulation
5. Run the same workload again and inject a fault to a selected node during the simulation period and 
trace.
For a stuck-at fault: Force the state of the selected node to either 1 (for stuck-at-1 fault) or 
0 (for stuck-at-0 fault) and evaluate the circuit. Hold the state to the stuck-at fault value 
throughout the simulation.
For a transient fault: Force the state of the selected node to a logic value that is the reverse 
of the normal state (i.e., force a 0 if the normal state is a 1, and vice versa). Hold the state 
to the -reverse value on the node for one or more clock cycles. Let the fault effect propagate 
by evaluating the circuit with the corrupted logic state. Release the forced state when a new 
signal/event arrives at that node.
Monitor the behavior of the system under fault conditions.
Compare the traces from the faulty and fault-free runs and identify the differences to determine where 
and when the fault has propagated.
Use collected statistical measurements to determine dependability parameters (e.g., detection coverage) 
and the fault impact (e.g., minor program error or system failure).
The above fault injection steps should be repeated many times for a given workload. If the experiment is 
intended to estimate single measures (e.g., detection coverage), the number of injections required for achieving 
a given confidence interval can be determined using Equation 0.14. If the purpose is to obtain distributions 
(e.g., error latency distribution), the fault injections should not be stopped until the constructed distribu­
tion is stable, i.e., the two consecutive distributions constructed are not statistically different. Importance 





Two early studies in fault injection at this level analyzed a digital avionic miniprocessor, BDX-930, as 
the target system. The first study investigated the impact of faults at gates and pins on the output results 
of programs, with emphasis on the fault latency and fault coverage issues [McGough81]. The second study 
investigated error propagation from the gate level to the pin level [Lomelino86]. A recent study explored 
the behavior of transient faults that occur during the normal execution of a processor [Czeck91]. The study 
quantified faults that can be emulated by software-implemented fault injection (to be discussed). We discuss 
these studies in the following two subsections.
S tudy o f  B en d ix  B D X -930
An early study in this field was the simulated fault injection to the Bendix BDX-930, a digital avionic 
miniprocessor, to investigate fault latency and coverage [McGough81j. The BDX-930 was composed of bit- 
slice processors (AMD2901) and was used in a number of flight control avionic systems. Fault tolerance was 
achieved by replication of the processing and voting in software. A gate-level emulator of the BDX-930 was 
developed for this study. The run speed of the emulator was 25,000 times slower than the BDX-930 when 
hosted on a PDP-10.
The methodology used in the study was: Given a software program running on the processor, inject 
a single stuck-at or inverted fault at a gate or pin selected randomly and observe the time to detection. 
Assume that a detection occurs whenever there is a difference between the outputs of the faulty and fault- 
free processors executing the same program. The experiment is repeated 600 to 1,000 times to construct an 
empirical latency distribution. Six programs were selected to repeat the above experimental procedure. In 
addition, a typical avionic flight control system self-test program was written for this study and executed to 
determine fault detection coverage.
Results showed that most of the detected faults were detected during the first execution of the program. 
Subsequent repetitions did not significantly increase the propagation of detected faults. A large percentage 
of faults (about 60% for the gate-level faults and 30% for the pin-level faults) remained undetected after as 
many as eight repetitions of the program. The fault coverage of the self-test program was found to be 87% 
for the gate-level faults and 98% for the pin-level faults.
The above study emphasized the impact of faults at gates and pins on the output results of programs. 
Another study on the Bendix BDX-930 computer investigated error propagation from the gate level to the 
pin level [Lomelino86]. In this study, a single AMD2901 processor chip in the BDX-930 was selected for 
fault injection and error data collection. The processor was simulated using an event-driven, gate-level 
logic simulator developed at NASA Langley [Migneault85j. The simulator was driven by a self-test program 
developed for the BDX-930, which provides a high probability of detecting error activity.
In the simulation, the single stuck-at fault model was applied to 150 selected gates for fault injection. 
The gates were distributed throughout the nine function units of the AMD2901. Error data was collected 
by first simulating a fault-free circuit, then simulating the circuit with a single injected fault, and finally 
comparing the output of two simulations. Four sets of simulation experiments, consisting of 150 simulations 
per set, were conducted. Results showed that 78.7% of faults produced error propagation detected within 
the chip and that 66.7% of faults produced errors that propagate to the output pins, within the first 100 
clock cycles. The error distribution at the output pins was found to be sensitive to the locations of faults. 
The results also showed that the error activity increases with the increase of concurrent microinstruction 
activity.
S tudy o f IBM  PC RT
In [Czeck91], a simulation model of the IBM RT PC was used to inject transient, gate-level faults for exploring 
the behavior of transient faults that occur during the normal execution of a processor. The emulated hardware 
functional units in the processor included: instruction prefetch buffer (IPB), microinstruction fetch (MIF), 
data fetch and storage (DFS), ALU and shifter (ALU), and ROMP storage channel interface (RSCI). The 
simulation model included original error detection mechanisms (EDM) that reside in the IBM RT PC (such 
as illegal instruction traps and memory access violation) and additional error detection mechanisms provided 
in this study for evaluating their effectiveness (such as timeout and control flow monitoring).
Figure 0.7 shows possible error manifestations after a fault injection. In the figure, minor errors are those 











been detected by an EDM. Monitoring errors are those uncovered by the “duplicate and compare” EDM 
that monitors bus addresses and data. Severe errors are those resulting in the change of a microinstruction 
and the instruction address registers, which lead to a change in the control flow of the program. Fatal 
errors have triggered a system resident EDM and caused an abnormal termination of the application task. 
Results overdue occurs when the task executes longer than a predetermined time limit and the execution is 
halted. Overwritten means that the injected fault does not manifest to a minor error or that a minor error 
is overwritten by correct data.
Three workloads were selected for this study: an iterative matrix multiplication, a recursive Fibonacci 
program, and an iterative Fibonacci program. These workloads were considered to be representative of the 
characteristics of instruction set architectures and diversity in program structure. For each workload and 
each fault location, 1000 faults were injected. The following experimental methodology was used:
1. For each workload, the fault-free behavior of the workload is extracted from the internal state of the 
processor and saved for comparison during the subsequent fault injection experiments.
2. A fault location is selected such that the fault in the location has a high probability of producing an 
error and locations for different injections do not yield the same error behavior.
3. The fault injection time is set to the start of the workload execution. The fault injection time will be 
advanced by one cycle for each successive fault injection experiment.
4. The fault is injected for a duration of one clock cycle at the location and time selected.
5. For each successive clock cycle, the internal processor state of the faulty processor is compared with 
that obtained in step 1. Differences are recorded for off-line analysis.
6. The faulty behavior is monitored at each clock cycle until the program execution is completed or a 
severe error causes the monitor to cease.
7. The simulation run is restarted at step 3 and the time of next fault injection will be advanced by one 
clock cycle.
It was found that 40% to 55% of injected faults do not produce an error. Among the faults that manifest 
to errors, approximately 2/3 of them can be emulated by the software-implemented fault injection approach 
(to be discussed). The other 1/3 of these faults manifest to errors in CPU components (e.g., microinstruction 
control registers) that are not accessible to software. At the system level, the fault behavior showed a strong 
dependency on the workload structure and the instruction sequencing rather than on the instruction mix. 
Error detection latency was found to follow a Weibull distribution with a decreasing detection rate. The
23
distribution represents two error occurrence processes: a fast process in which fault manifestation and error 
propagation occur within a small time window, and a slow process in which dormant faults and errors are 
activated gradually by the workload.
The next section introduces a tool that uses a hardware simulation engine to simulate permanent stuck-at 
faults at the gate/switch level and extracts a high-level error model automatically.
E M A X
EM AX is a high-level error model automatic extractor. It simulates user-selected, permanent, stuck-at faults 
that may occur inside a processor chip at the gate and/or the switch level and generates the output patterns 
produced by faulty circuits [Kanawati93]. See Figure 0.8. It is designed to investigate the representativeness 
of faults/errors applied at the pins or at the system level for the faults inside VLSI chips. Based on the error 












Figure 0.8: The EMAX Modules
EMAX consists of a workload input module, a fault simulation module, and a data analysis module. 
The workload input module takes as input workload patterns and circuit descriptions. The fault simulation 
module uses a hardware simulation engine, the Zycad MACH-1500, to speed up the simulation experiments. 
The data analysis module includes three submodules, an error pattern extractor module, an error pattern 
classification module, and a fault dictionary module. It reads the output fault records generated by the 
MACH-1500, extracts error patterns for every fault, divides error patterns into classes, and builds a fault 
dictionary.
The application of EMAX was illustrated by analyzing a processor. The processor has seven instructions; 
two of the instructions require two execution phases, and the others need four phases. Faults were injected 
at every node defined in the netlist of the processor. During the experiments, less than 16% of injected faults 
caused errors. Thirteen error types were observed, seven of which involved more than one component and 
one of which caused a processor halt. Execution phases of instructions were taken into account. The majority
24
of errors during phase 1 were address line errors that caused the processor to fetch wrong instructions. As a 
result, data line errors were observed during the next phase of execution. The majority of the errors in phase 
2 were also address line errors that caused the processor to fetch wrong instructions. The errors propagated 
to phase 3 as address errors and control line errors, which eventually propagated to phase 4 as data line 
errors and address line errors. Overall, 73% of errors were address line errors while fetching an instruction, 
13% were address line errors while fetching an operand, and each of the other error types contributed less 
than 5% of the errors.
0.3 .3  S im ulated  Fault Injection  at th e Function Level
Function-level fault injection simulation is used to study overall computer and network systems rather than 
their individual components. These studies typically consider the hardware, the software, their interactions, 
and the interdependence between the various components of the system. There are several outstanding issues 
in developing functional simulation models at the system level: 1) a lack of well-established system-level fault 
models, 2) a large and varied component domain, 3) the effort and time required to develop a functional 
simulation model, 4) the impact of the software on system dependability, and 5) simulation-time explosion.
The first issue, a lack of well-established system-level fault models, is partly due to the second, a large 
and varied component domain. At the gate level, the basic components are gates with single functions and 
well-defined interconnections. At this level, it is possible to establish a fault model, such as the single stuck- 
at fault model, that can be consistently applied to all gates to model their fault behavior. At the system 
level, the basic components include CPUs, communication channels, disks, software systems, and memory. 
These components have complex inputs, perform multiple functions, and have varied physical attributes (e.g. 
hardware and software) and complex interconnections. In addition to the diversity of the components that 
make up a system, two similar components (such as two CPUs) can have different functions and behavior. 
This makes it difficult to establish a single fault model that can be applied consistently to all components.
For this reason, various types of fault models are required, depending on the type of component being 
studied. The fault models are functional fault models that simulate system-level manifestations of gate- 
level faults. For instance, a single bit-flip is typically used to simulate a memory or register fault. Various 
fault models can be used for communication channels. Messages traversing the channel can be corrupted or 
destroyed, or the channel can be made inoperative. A fault in a processing node can be modeled as a service 
interrupt caused by CPU, memory, disk, or software faults in the node. More detailed fault models for a 
processor or other system components can be derived from lower-level simulations using the fault-dictionary 
approach discussed earlier. For instance, a gate-level simulation of a processor can be injected with faults 
while executing a typical workload. The effect of the faults on the workload can be stored in a fault dictionary 
that contains, for each gate-level fault injected, the types of effects and the probability of these effects. This 
dictionary can then serve as a fault model for system-level simulations.
The third issue, the effort and time required to develop a functional simulation model, is especially 
significant when simulating large, complex systems. Two factors contribute to the problem. One is the 
time and effort needed to describe the detailed functionality of the system components. The other is the 
time and effort required to inject faults, initiate repairs, abort, reschedule and synchronize events, and 
maintain a whole host of fault statistics. As the number of components in the system becomes large, a well- 
formulated, structured, and automated approach is needed to contend with the complexity. A solution is a 
tool that includes a library of software “objects” that provide the framework needed to conduct simulated 
fault injection studies and that can be easily customized to meet user-specific needs.
The fourth issue is the impact of the software on system dependability. Dependability studies have 
tended to focus on a system’s hardware components. But as hardware becomes more reliable, software is 
becoming a more dominant factor [Gray90]. The effectiveness of functional detection and repair schemes 
depend upon several application-specific measures, such as detection latency and error propagation times. To 
study the impact of the software on system dependability, methods should allow the designer to incorporate 
the software into the overall dependability study. The simulation tool should permit the execution of actual 
user programs and relevant operating system features.
The fifth issue, simulation time explosion, is extremely important. This occurs when the system mod­
eled has very small failure probabilities requiring large simulation runs to obtain statistically significant 
results. This is especially a problem with functional simulation: its primary benefit, detailed modeling, fur-
25
ther contributes to simulation time explosion. Acceleration techniques used at the system level can reduce 
simulation time explosion. Hierarchical and hybrid simulation methods have been shown to be very effective 
[Goswami92] [Goswami93a]. The basic approach of these techniques is to: 1) break down a large, complex 
model into simpler submodels, 2) analyze submodels individually, 3) combine the results from step 2 to build 
a simplified system model, and 4) analyze the system model to obtain the solution. If the models in step 1 
and step 3 are both simulation models, the approach is called hierarchical simulation. If the models in step 
1 are simulation models and the model in step 3 is an analytical model, the approach is called hybrid, sim­
ulation. As long as the interactions among the subsystems are weak, this decomposition approach provides 
valid results. The approach is ideally suited for dependability analysis because dependability models can 
usually be broken into two submodels—a fault occurrence submodel and an error handling submodel—whose 
interactions are typically weak.
Since there are many analytical modeling tools, including petri-net based simulation tools, what is the 
need for functional simulation tools for system-level dependability analysis? What additional information 
and capabilities can they provide? The answer is that analytical modeling tools only use probabilistic models 
to represent the behavior of a system. In essence, the effect of a fault on the system is predefined by a set 
of probabilities and distributions. Functional simulation tools not only use stochastic modeling, they also 
permit behavioral modeling, which does not require that the effect of the faults be predefined.
An example that demonstrates this capability is a distributed system using a centralized, prediction- 
based, load-balancing scheme demonstrated in Figure 0.9 The system is evaluated under faults.
Trace File
a
Figure 0.9: Distributed System Executing Load-Balancing Software
The load-balancing software that makes task placement decisions and maintains the database is executed 
on a simulated distributed system. Communication faults were injected to destroy and corrupt fields of 
the status messages sent to the CPU maintaining the database. Faults were also injected into the CPU 
containing the load-balancing software, to erase its database. The effect of these faults is to corrupt the 
database and impair the placement decisions made by the load-balancing software. If a purely probabilistic 
modeling tool was used for this study, the user would have to prespecify:
26
• the probability that a fault will corrupt the database,
• how each fault will corrupt the database,
• which portions of the database will be corrupted,
• the extent of corruption, and
• how each corruption will impair the placement decision made by the load-balancing software.
These factors are extremely difficult to obtain without executing the actual software. Because the simu­
lation executes the actual software, these parameters are the results of (and not inputs to) the fault injection 
experiment. Only the fault arrival rates and the types of faults injected need to be specified. Thus, the 
simulation results can identify the failure mechanisms, obtain failure probabilities, and quantify the effect of 
faults. It can be used to pick out the key features that must be modeled and help to determine and specify 
both the structure of, and the parameters to, analytical models.
A singular feature that distinguishes behavioral modeling from probabilistic modeling is the ability of 
behavioral modeling to reveal design flaws in the software. For example, the simulation helped to uncover 
a design feature of the software that caused erratic increases in system response time only when status 
messages were destroyed. Once the software was modified, the erratic increase in response time ceased. 
Clearly, such results cannot be obtained with analytical modeling tools.
An additional advantage of functional simulation is that it allows the use of any type of TTF distributions. 
Unlike analytical modeling, in which only a few types of distributions are commonly used for the tractability 
of models, the simulation method can handle any form of distribution, empirical or analytical.
In recent years, several function-level simulation tools that can be used for fault injection have been or 
are being developed. NEST, DEPEND, REACT, and MEFISTO are four representative tools. NEST is a 
function-level testbed that specializes in modeling and evaluating distributed network systems [Dupuy90]. 
Although the tool is not designed for fault injection, users can make node or link failures by deleting or 
adding nodes and links or changing their features while the simulation is running. DEPEND, developed 
at the University of Illinois, exploits the properties of the object-oriented paradigm to provide a general- 
purpose, system-level dependability analysis tool that can evaluate various types of fault-tolerant architec­
tures [Goswami92j. The object-oriented feature of DEPEND makes the tool capable of modeling multiple 
levels of functional units to meet a wide range of applications. REACT, a software testbed that performs 
automated life testing of a variety of multiprocessor architectures through simulated fault injections, was 
developed at the University of Massachusetts and the Texas A&M University [Clark93a]. Several system, 
workload, and fault/error models, which are representative of multiprocessor architectures and conditions, 
are embedded in the testbed. MEFISTO, developed at LAAS-CNRS in France and Chalmers University in 
Sweden, is an integrated environment for applying fault injection into VHDL simulation models encompassing 
various levels of abstraction [Jenn94]. In the following, we discuss these tools.
N E ST — A N etw ork S im ulation  T estbed
NEST (NEtwork Simulation Testbed) is a graphical environment, running on the UNIX system, for modeling, 
executing, and monitoring distributed network systems and protocols [Dupuy90]. Using a set of graphical 
tools provided by NEST, the user can develop simulation models of communication networks. The model 
includes node functions (e.g., routing protocols) and communication link behaviors (e.g., packet loss or delay 
features), typically coded in C. These user procedures are linked with run-time routines embedded in NEST 
and executed by the NEST simulation server. The user can reconfigure a modeled network system through 
graphical interaction or programming. Built-in graphical tools allow users to program custom monitors and 
observe the simulation results on-line.
Figure 0.10 shows the overall architecture of NEST. NEST consists of a simulation server and several 
client monitors. The simulation server is responsible for running simulations. The generic client monitors 
are used to configure simulation models and control their executions. The custom client monitors are used to 
observe simulation behavior and display results. Clients can reside on separate machines so that the server 
is dedicated to time-consuming simulations.
27
DESIGN NODE BEHAVIOR
Figure 0.10: Overall Architecture of NEST
Node functions are used to model distributed communicating processes running at network nodes (e.g., 
protocols and database transactions). NEST executes node processes and their communication calls using 
a set of embedded primitives for sending, broadcasting, and receiving packets. The motion of a packet over 
links is simulated by passing it through the link functions. Link functions are used to model the behavior 
of communication links (e.g., packet loss and link jamming). Link functions are also used to monitor and 
collect performance statistics of link traffic. The simulation server schedules the execution of the node and 
link processes to meet the delay and timing specified by the user.
The user can create and modify a network description (node and link functions and connections) using 
the NEST graphical tools. Once the user has defined a simulation scenario, it is sent to the simulation server 
to be executed. One of NEST’s key feature is its ability to reconfigure a scenario during the simulation run. 
The user may delete or add nodes and links (thus failures can be emulated) or change their features while 
the simulation is running. The impact of these changes may be instantly observed and interpreted. Such 
dynamically reconfigured simulations can be used to study the impact of node/link failure and recovery on 
the modeled network system.
NEST has been used to evaluate IPLS (a distributed Incremental Position Location System), topology 
recognition and broadcasting connectivity tables for the ARPAnet and Internet, various dynamic load­
balancing schemes, and an experimental multiprocessor operating system.
D E P E N D — A S ystem  D epend ab ility  A nalysis E nvironm ent
DEPEND is an integrated design and fault injection environment [Goswami92]. It provides facilities to model 
fault-tolerant architectures and conduct extensive fault injection studies. It is ideally suited for evaluating 
specific fault-tolerant mechanisms, detailed fault scenarios such as latent errors, and software behavior due 
to hardware faults. It is a functional, process-based [Kobayashi78], [Schwetman86] simulation tool. The 
system behavior is described by a collection of processes that interact with one another. A process-based
28
approach was selected for several reasons. It is an effective way to model system behavior, repair schemes, 
and system software in detail. It facilitates modeling of intercomponent dependencies, especially when the 
system is large and the dependencies are complex, and it allows actual programs to be executed within the 
simulation environment. Both hierarchical and hybrid simulation techniques have been used in DEPEND.
DEPEND exploits the properties of the object-oriented paradigm, specifically, modular decomposition 
and modular composability [Meyer88], to model different levels of components and to implement a variety 
of fault models. Modular decomposition consists of breaking down a problem into small elements, whereas 
modular composition favors production of elements that can be freely combined with each other to provide 
new functionality. If, for instance, the fault injection process is divided into two elements or objects: an 
object that determines when to inject and interrupt the system, and an object that determines the response 
to a fault (the fault model), then the two criteria are met. The first object, called a key object, is common 
to all fault injection methods. It encapsulates the various mechanisms used to determine the arrival time 
of a fault and interrupt the system. The second object, the fault model, is specific to the component being 
injected and to the type of fault injection study. The two are combined via function calls. Thus, by specifying 
different fault model objects, one key object, such as the injector object, can be used for all types of fault 
injections. Key objects are designed to be parameterized. That is, the user can specify various fault arrival 
distributions or trace files. This same approach is used to model components that are similar but not 
identical. Common aspects are encapsulated in an object that then invokes other objects to provide more 
specific functionality. Furthermore, because users can specify specific behaviors (e.g., their own fault model 
objects), the tool is not limited to any predefined set of fault models or component types.
A library of objects that provide the skeletal foundation necessary to model an architecture and conduct 
simulated fault injection experiments is provided. This reduces the time and effort needed to build simula­
tion models. In addition to decomposition, composition, and parameterization, the concept of inheritance 
[Meyer88] makes it possible to provide a library with a minimum set of objects that can be readily specialized 
to model a wide range of architectures and fault injection experiments. With inheritance, users can reuse 
the properties of an existing object and develop more specialized objects with minimum effort. There are 
two types of objects: elementary objects and complex objects. Elementary objects provide basic functions 
such as injecting faults and compiling statistics. Complex objects created from several elementary objects 
simulate fundamental components found in most fault-tolerant architectures such as CPUs, self-checking 
processors, N-modular redundant processors, communication links, voters, and memory.
The steps required to develop and execute a model are shown in Figure 0.11. The user writes a control 
program in C++ using the objects in the DEPEND library. The program is then compiled and linked with 
the DEPEND objects and the run-time environment. The model is executed in the simulated parallel run­
time environment. Here, the assortment of objects, including the fault injectors, CPUs, and communication 
links, execute simultaneously to simulate the functional behavior of the architecture. Faults are injected and 
repairs are initiated according to the user’s specifications, and a report containing the essential statistics of 
the simulation is produced.
DEPEND allows users to specify different fault models. In addition, DEPEND provides default fault 
routines for each object to minimize user design time. For instance, the default fault model for a communi­
cation medium simulates the effects of a noisy communication channel. Fields in the messages passed along 
the communication link are actually corrupted or the message is destroyed.
The fault injector is a fundamental object in DEPEND. It encapsulates the mechanism for injecting faults. 
To use the injector, a user specifies the number of components, the TTF distribution for each component, and 
the fault subroutine that specifies the fault model. In addition to user-specified distributions, the injector 
provides constant time, exponential, hyperexponential, and the Weibull distributions. The injector also 
provides a workload-based injection scheme that varies the fault arrival rate based on a specified workload. 
The user provides a workload function, a set of workload states, and an exponential fault arrival rate for 
each state. For example, the workload function may be the utilization of a server. With this approach the 
fault arrival rate will fluctuate with the utilization of the server. The fault injector will periodically poll the 
workload function to update a state transition diagram to maintain a history of the workload behavior. This 
history is used to inject a large number of faults during peak workload conditions and fewer faults when 
the workload is low. This technique models the workload/failure dependency observed in [ButnerSO] and 
[Castillo81].
In addition to executing actual C++ and C programs, DEPEND provides an abstract software modeling
29
Figure 0.11: The Depend Environment
environment to simulate program behavior during the early design stages when actual code does not exist. 
The environment represents application programs by decomposing them into graph models consisting of a 
set of nodes, a set of edges that probabilistically determine the flow from node to node, and a mapping of the 
nodes to memory. The graph models are then mapped to virtual memory and executed while errors are in­
jected into the program’s memory space. The environment provides application-dependent parameters, such 
as detection and propagation times, and permits meaningful application-dependent evaluation of function- 
and system-level error detection and recovery schemes. This environment has been used to analyze memory­
scrubbing schemes within the context of application programs [Goswami93c]. The application-dependent 
coverage values obtained were compared with those obtained by traditional schemes that assume uniform or 
random memory access patterns. The coverage values obtained using the traditional approach were found to 
be up to 100% larger than those obtained with the software graph model. The findings demonstrate the need 
for application-dependent evaluation—especially when evaluating the dependability of application-specific 
systems.
DEPEND has been applied to evaluate several computer systems. In [Goswami91] and [Goswami92], 
DEPEND was used to simulate the UNIX-based Tandem Integrity S2 fault-tolerant system and evaluate 
how well it handles near-coincident errors caused by correlated and latent faults. Issues such as memory 
scrubbing, reintegration policies, and workload-dependent repair time were evaluated. The accuracy of the 
simulation model was validated by comparing the results of the simulations with measurements obtained from 
fault injection experiments conducted on a production Tandem Integrity S2 machine. DEPEND has also 
x been used to study the CM5 connection machine, the Parsytec high-performance computer being developed 
by the European Esprit project, the Space Station Data Management System, and other systems.
R E A C T — A Software T estbed  for A nalyzing M ultiprocessors
The REliable Architecture Characterization Tool (REACT) is a software testbed for analyzing a variety 
of fault-tolerant multiprocessor systems [Clark93a] [Clark93b]. It was developed to meet the need for a 
generalized simulation package that can evaluate high-level dependability metrics more precisely than feasible 
with analytical approaches. Most other fault-injection tools are oriented toward simulating systems over very 
short intervals of time, to measure the effects of error propagation or the latency and coverage of detection 
mechanisms. REACT, on the other hand, has the capability to evaluate reliability or availability, as functions 
of time, and analyze failure modes over the operational life of a system.
30
REACT abstracts a system at the architectural level and performs life testing through simulated fault- 
injection to measure dependability. This involves conducting a certain number of experiments or trials in 
which the operation of an initially fault-free system is simulated, while randomly occurring faults are injected 
into its components. Systems are operated until they either fail or reach a particular censoring time. Failure 
statistics are collected during each trial which are later aggregated over the entire simulation run in order to 
compute system dependability metrics.
REACT can analyze multiprocessor systems that utilize N-modular redundancy, duplication and com­
parison, standby sparing or error-control coding to achieve fault tolerance. Figure 0.12 depicts the system 
model employed by REACT. This class of architectures consists of one or more processor modules (P) inter­
connected via buses (B) to one or more memory modules (M) through a block of fault tolerance mechanisms. 
The fault tolerance mechanisms supply the hardware necessary to detect, correct or mask errors during 
memory accesses, and reconfigure the system when modules fail. Pre-defined, functional level abstractions 
are used to model individual processors and memories. The number of processor and memory modules, their 
interconnection, and the operation of the fault tolerance mechanisms are all user-specified. This framework 
provides the flexibility needed to represent many different architectures, without the complexity of developing 
custom simulation models for each.
Figure 0.12: Class of Architectures REACT Can Analyze
REACT can automatically inject both permanent and transient faults into the processors, memories and 
the fault tolerance mechanisms of a system. Fault occurrence times are sampled from a YVeibull distribution. 
Behavior of each type of faulty component within the system is independently governed by a stochastic 
model that accounts for both fault and error latency. These models were derived, in part, from the results 
of other low-level fault-injection experiments. Repair times for failed components are assumed to have a 
log-normal distribution after a fixed logistics delay. The time required to reintegrate a repaired component 
back into the system and to reboot the system after a critical failure are constant and user-specified.
A synthetic workload is assumed for which processors continually perform instruction cycles consisting 
of several possible memory references and the simulated execution of an instruction. Real code and data 
are not used by REACT, but errors are allowed to propagate throughout the system as if the application 
program was actually being executed. The workload model is specified by a mean instruction execution rate, 
the probabilities of performing a memory read and write access per instruction, plus a locality-of-reference 
model that determines which locations are accessed.
An example of how faults are modeled in REACT is illustrated by the processor fault model. Processor 
operation in the presence of faults is governed by the stochastic model shown in Figure 0.13. Ovals in this 
diagram represent valid processor states, while arcs indicate what state changes may take place. Dashed 
boxes are temporary nodes that are passed through when moving into a valid state. All processors initially
31
start in the Fault Free state. Permanent faults occur at Weibull (A p perm , a p perm ) distributed interarrival 
times and become Latent. No errors are generated by Latent processor faults. A  permanent fault will remain 
Latent for a period of time specified by a Weibull (A iaten t, « la te n t)  distribution and then become Active. A  
processor is considered to be failed and will produce errors during every instruction cycle when it is in the 
Active state due to a permanent fault. It can only leave this state and return to the Fault Free state if it is 
repaired and resynchronized.
Figure 0.13: Processor Fault Model
Transient processor faults occur at Weibull (Aptran, «Ptran) distributed interarrival times. A transient 
is assumed to generate a single logic upset that must be latched to become Latent. If not latched with 
probability 'Piatch, the processor will immediately return to the Fault Free state. A Latent transient fault 
will upset the control flow of an executing program with probability P upset, after a Weibull (A [atent ,  « la te n t)  
distributed period of time. This will cause the processor to diverge from the correct instruction stream 
and enter the Control Upset state. It is assumed that the processor will remain in the Control Upset, state,
32
producing errors during every instruction cycle, until it is resynchronized. If the Latent transient does not 
result in a control flow upset, it will become Active after a Weibull (A iatent, » la te n t)  distributed length of 
time. An Active transient fault is assumed to produce errors during one instruction cycle and then move into 
the Dormant state. With termination probability Vterm> the processor will immediately return to the Fault 
Free state. Otherwise, it will hold in the Dormant state for a period of time that is Weibull (Aactiv e , » a c tiv e )  
distributed. After the holding time in the Dormant state expires, the transient will again become Active 
and repeat the same sequence of state changes. This cyclic behavior is intended to represent multiple errors 
being sourced from a single fault that eventually disappears. Register and cache faults may exhibit such 
behavior when they are read several times before being overwritten.
When a processor reads an error from memory (through the fault tolerance mechanisms), its state may 
become corrupted. All input errors will be latched internally, so they are assumed to have an immediate 
effect on the processor. Based on the probability "Pupset, an input error will put the processor in either the 
Control Upset or Active state. The input error is assumed to behave like a transient fault once the processor 
has entered either of these states.
Processor errors can affect both the addresses and (write) data generated during an instruction cycle. The 
probabilities of an error affecting only an address ( P addr error) or data ( P la t a  error) are specified by the user. 
Addresses and data are simultaneously affected by an error with probability (1 — P addr error — P d a ta  error)- 
An erroneous address is assumed to access a random memory location. Erroneous data take on a random 
value.
The effectiveness of REACT was demonstrated through the analysis of several alternative multiprocessor 
architectures. Specifically, two dependability tradeoffs associated with triple-modular redundant (TMR) 
systems were investigated. Processor and memory modules are triplicated in a TMR system, and a majority 
voter is used to mask erroneous outputs from any one processor or memory at a time. The performance 
penalty of voting and the expense of triplicating modules, especially memory, are the major drawbacks of a 
TMR design. The first study explored the reliability-performance tradeoff made by voting unidirectionally, 
instead of bidirectionally, on either memory read or write accesses [Clark92j. The second study examined the 
reliability-cost tradeoff made by duplicating, rather than triplicating, memory modules and comparing their 
outputs via error detecting codes [Clark93b]. The effects of different failure rates, fault types, and workload 
conditions on dependability were considered. Both studies showed that in many cases, little reliability is 
sacrificed for potentially large performance increases or cost reductions, in comparison to the traditional 
TMR design.
The MITRE Reliability and Maintainability Center has developed a simulation program based on RE­
ACT. The simulator was used to evaluate availability and other reliability metrics for a proposed Navy fixed, 
very low frequency transmitter that could not be accurately modeled using traditional analytical approaches.
M E FISTO — Fault Injection  in to  VH DL M odels
The growing use of VHDL in the development process of digital systems and its inherent abstraction ca­
pabilities [Dewey92] make it a privileged language to support the integration of fault injection as early as 
possible and throughout the successive validation phases of the design process of fault-tolerant systems in 
a coherent simulation framework. The Multi-level Error/Fault Injection Simulation Tool (MEFISTO) is 
an integrated environment for applying fault injection into VHDL simulation models encompassing various 
levels of abstraction. This tool is the result of an ongoing collaborative research effort between LAAS-CNRS 
in France and Chalmers University in Sweden [Jenn94]. and is aimed at developing and using an integrated 
environment that supports the application of fault injection in VHDL models.
MEFISTO is intended for 1) estimating the coverage of fault tolerance mechanisms, 2) investigating 
mechanisms for mapping results from one level of abstraction to another, and 3) validating fault and error 
models applied during fault injection experiments carried out on the implementation of a fault-tolerant 
system (e.g., software implemented or pin-level fault injection).
MEFISTO supports several techniques for injecting faults into VHDL models. Specific components called 
saboteurs can be inserted in the VHDL model of the target system, or some of the system’s components or 
processes can be mutated. MEFISTO can also use the command language of the underlying simulation engine 
(the VantageSpreadsheet environment) to modify the variables and signals of the target VHDL model. Users 
define and execute a fault injection campaign, which is made up of a series of fault injection experiments. A
33
campaign consists of three phases: setup, simulation, and data processing.
The first set of experiments carried out with MEFISTO consisted of injecting transient faults on two 
architectures of a simple 32-bit processor: one behavioral and one structural. The structural (RTL) model 
was composed mainly of a finite state control machine, an ALU, a program counter, a register file, and 
several buffers and latches. The behavioral model consisted mainly of a VHDL process containing a large 
case statement that initiates the bus cycle appropriate with the operation code of the fetched instruction. 
Two fault injection campaigns with two different workloads (bubble-sort and heap-sort sorting programs) 
were run on each model. These experiments proved useful in 1) studying the impact of the fault injection 
techniques (i.e., signal and variable manipulation) on the experimental results and 2) analyzing the mapping 
between two sets of fault classes chosen at the two levels of abstraction considered.
R em aining S im ulation  Issues
While there has been much progress in developing functional simulations for designing and evaluating de­
pendable systems, several important questions remain unanswered. The main question is that of fidelity to 
the actual system. How much more accurate is a simulation than an analytical model? For example, to what 
degree of detail must we model detection and recovery mechanisms in order to evaluate their effectiveness 
using simulation? Analytic tools use probabilistic models to represent the behavior of the system; in essence 
the effect of a fault on the system is predefined by a set of probabilities and distributions. Simulation tools 
vary in range from using stochastic modeling to permitting behavioral modeling, which does not require that 
the effect of the faults be predefined. From both a research and a user perspective, there is a need of methods 
to establish the fidelity of the simulation in the absence of an actual system. Other unresolved issues include: 
how to determine the accuracy of the assumed fault models with respect to to fault behavior and how to 
establish this accuracy, how to model software faults, and how to model workloads and interactions.
0.4 PR O T O T Y PE  PH ASE
During the prototype phase of a fault-tolerant system, physical fault injection is used to evaluate fault, error, 
failure, and fault tolerance characteristics of the developed system. Fault injection is usually applied to fault- 
tolerant systems or components because the injected faults, if activated, would almost always crash a system 
without fault-tolerant mechanisms. However, fault injection can be used in non-fault-tolerant systems if 
the system control flow can be well traced and the system state information can be obtained when a crash 
occurs.
Figure 0.14: Components in a Fault Injection Environment
A fault injection environment typically contains the following components: target system, controller and 
monitor, fault injector, data collector, and data analyzer, as shown in Figure 0.14. The target can be a VLSI 
chip, a computer system, or a networked system.
34
Table 0.3: Categories of Fault Injections
Category Hardware-Implemented Software-Implemented Radiation-Induced
Approach Inject faults to IC pins 
by hardware instrumentation
Inject faults to components 
by special software
Inject faults by applying 
radiation rays to target
Advantages No disturbance to workload 
High time resolution
Wide choices of FI locations/models 
Low cost
Can induce transient faults 
inside IC evenly








FTMP [Shin84], [Shin86] 
FTMP [Finelli87] 
MESSALINE [Arlat90a]
Accelerated Injection [Chillarege89] 








Faults are injected into the target while the system is exercised using benchmarks or synthetic workloads 
that emulate the system’s operational environment. The controller is a software program, on the target 
or on another computer, that controls the overall fault injection experiment. The fault injector contains 
information on the type of fault(s) to be injected and their location. It also contains appropriate hardware 
or software logic to ensure that faults are injected into the right component at the right time. The monitor 
keeps track of normal and abnormal executions of the workload and initiates data collection whenever 
necessary. The data collector and analyzer perform on-line data collection and off-line data processing and 
analysis, respectively.
The fault injector can be implemented by hardware, by software, or via radiation. Correspondingly, 
fault injection can generally be divided into three types: hardware-implemented, software-implemented, and 
radiation-induced. Table 0.3 lists features and representative studies in these categories. The monitor can 
be implemented by hardware, software, or by a combination of hardware and software (hybrid). If the fault 
injector is implemented by software and the monitor is implemented by hardware or by both hardware and 
software, the system is called a hybrid fault injection environment. Before discussing in detail each type of 
fault injection, we give a brief review of the formal methodology described in [ArlatOOa],
In [Arlat90a], a formal methodology is proposed to characterize fault injection. The methodology consists 
of a tuple of four sets: FARM , where F is a set of faults, A is a set of activations, R  is a set of readouts, and 
M is a set of derived measures. F and A constitute the input domain, and R  and M constitute the output 
domain. The methodology can be applied to three levels of models: 1) axiomatic or analytical models, such as 
reliability block diagrams, fault trees, Markov chains, and Petri nets, 2) empirical models which incorporate 
more detailed behavioral and structural descriptions that may require the simulation approach to process 
them, and 3) physical models which correspond to hardware and/or software prototypes. The physical 
prototypes can be software-only prototype, hardware-only prototype, or hardware-and-software prototype. 
Physical models are more related to the fault injections discussed in this section.
In physical models, the F set is based on physical faults. In the case of software-only and hardware- 
only prototypes, the A set consists of a set of test patterns for exercising the injected faults. In the case 
of hardware-and-software prototypes, the A set may vary significantly, depending on the target systems. 
For general systems, representative programs can be used as a solution. A fault injection experiment is 
characterized by a triple (/, a, r), where /  is a fault in F, a is an activation trajectory in A, and r is an 
experiment outcome in R. The measures in M can be obtained from a series of fault injection experiments.
0.4.1 H ardw are-Im plem ented Fault Injection
Hardware-implemented fault injection is a method of introducing faults in the hardware of a computer system 
with the aid of additional hardware instrumentation. The method is well suited for studying dependability
characteristics that require high time resolution and cannot be easily achieved by other fault injection meth­
ods, such as fault latency in the CPU. Two main techniques are used to accomplish hardware-implemented 
fault injections.
The first approach involves the use of active probes attached to the desired hardware injection points. The 
currents through these injection points can be altered, thereby influencing the corresponding logic values. 
The types of faults attainable with probes are usually limited to stuck-at faults. However, it is also possible 
to introduce bridging faults by placing the probes across multiple hardware points. Care must be taken with 
the use of active probes that force values onto injection points, because damage to the target hardware can 
result from an inordinate amount of current.
The second technique involves the insertion of additional hardware into the target computer system. 
Whereas the first method uses active probes that are external to the target system, this method introduces 
additional hardware that becomes part of the target system. The most common approach requires the 
interpolation of a socket between a chip and the circuit board. This socket is able to inject stuck-at faults 
or open faults, where the chip pin is essentially tri-stated. In addition, more complex logical faults can be 
forced onto these pins. For instance, the pin signals can be inverted, ANDed, or ORed with adjacent pin 
signals or even with previous signals on the same pin.
In theory, the domain of possible injection locations is limited only by the physical constraints of the 
target system that prevent the introduction of probes or other hardware. Since the target system is usually 
a complete prototype computer system, fault injection below the chip pin level is impractical. Thus, most 
injections focus on the pins of chips. Also, active probes can be attached to certain circuit board locations, 
such as buses or other signal lines.
In addition to the range of possible injection locations, a major concern of any fault injection environment 
is what fault types or models are available. We have already discussed some types of faults achievable with 
probes and sockets: stuck-at, open, bridging, or complex logical functions. Another important aspect of fault 
types is the duration of the fault, which can be either permanent, transient, or intermittent. Permanent 
faults are simply held on the injection point until an error detection occurs. In contrast, transient faults are 
placed on the injection point only for an active period, after which they are removed. Thus, the possibility 
exists that the transient fault may never even be latched into a chip (i.e., the fault never produces an error), 
especially if the active period is less than the system clock period on a synchronous machine. Intermittent 
faults are injected in the same manner as transient faults, but they are also repeated, either randomly or 
according to some function. Both injection methods discussed previously are capable of creating any of the 
three temporal fault types.
In the following section, we will discuss two representative hardware-implemented fault injection envi­
ronments: FTMP [Lala83] and MESS ALINE [Arlat90a].
FT M P
Several studies centered around the fault-tolerant multiprocessor (FTMP) fault injection instrumentation 
[Lala83], [Shin86], [Finelli87]. FTMP is a computer architecture that evolved over a 10-year period in 
connection with several critical aerospace applications. The architecture was designed to have a failure 
rate of the order of 10“ 10 per hour. The basic blocks of the architecture are independent processor-cache 
modules and memory modules that communicate through redundant buses. The modules are dynamically 
grouped into several TMR triads or assigned to spare status. Jobs can be scheduled to any processor triad. 
All transactions between processor modules and memory modules in a triad are voted bit-by-bit. When a 
fault occurs, the faulty module is isolated and the faulty triad reconfigured. Fault detection, diagnosis, and 
recovery are handled in such a way that application programs are not involved.
Figure 0.15 shows the diagram of the FTMP fault injection instrumentation developed at the Charles 
Stark Draper Laboratory [Lala83] [Finelli87j. In an FTMP computer, there are several line replaceable units 
(LRUs), each containing a processor, clock generator, power subsystem, and bus interface circuits. LRU 
#3 is constructed for connection of the fault injector. All chips in LRU #3 are connected to sockets that 
allow them to be removed for insertion of the fault injection implant. Each fault injection implant contains 
circuitry that can interrupt and reconnect the pins in the sockets. Several different types of faults, such as 
stuck-at-0 and stuck-at-1 can be injected into the pins by the implants. These implants are controlled by a 
VAX 11/750 computer. A special version of the System Configuration Control (FSCC) program running in
36
the FTMP communicates with the Fault Injection Software (FIS) running in the VAX 11/750 through one 
of the FTMP I/O ports and a 1553/UNIBUS data link.
(2) FTMP acknowledges
(3) FTMP restores LRU #3
(4) Fault injected
(5) Data from FTMP
FSCC Software
Figure 0.15: FTMP Fault Injection Environment
Faults are normally injected on one pin at a time. When an injection occurs, the FIS program chooses 
a fault and a pin, applies the fault to the pin, and records the injection time. Once the FTMP detects and 
identifies the fault and reconfigures the system, it sends this information, along with the time of each event, 
back to FIS. Upon receiving the information, FIS removes the fault by restoring the pin to its normal state 
and notifies the FTMP. The FTMP then puts the victim module back into an active state and notifies FIS 
that it is ready for another fault injection. This process is repeated after a random delay.
In the experiments conducted at the Charles Stark Draper Laboratory [Lala83], a total of 21,055 faults 
were injected, and 17,418 (83%) were detected. All of the detect faults were identified correctly, and the 
system subsequently recovered successfully from each of these faults by replacing the faulty module. That 
is, the coverage in the FTMP was 100/validated the FTMP architecture and implementation.
Another study using the FTMP fault injection instrumentation was reported in [Shin84], with emphasis 
on the investigation of fault latency. Results showed that the hazard rate of fault latency is monotonically 
decreasing. Two distributions with monotonically decreasing hazard rates, Weibull and gamma distributions, 
were then used to fit the experimental results. The study also investigated the effect of fault latency on the 
probability of having multiple faults. It was shown that there exists an optimal fault latency in minimizing 
the multiple fault probability.
Later, fault injection experiments on the same instrumentation were conducted at the NASA Langley 
Research Center [Finelli87] to investigate two issues: fault sampling methods and fault recovery distributions. 
For each fault injection, two choices must be made: the fault location (pins) and the fault type (e.g., stuck- 
at-1, stuck-at-0, inverted signal). Thus, the possible fault set (the collection of all different injected faults) 
can be very large. Exhaustive fault injection is costly and time consuming. It is necessary to find appropriate 
sampling methods to reduce the time and cost of testing. The study compared the effects (detection behavior) 
of different faults and grouped these faults into several subsets according to the similarity in their effects. 
The results showed that the effects are not homogeneous across the fault set. This indicates that stratified 
sampling methods, based on the fault subsets, should be developed for fault injection. The study also showed 
that the fault recovery time is not exponentially distributed.
M ESSA LINE
MESSALINE [ArlatOOa] is a flexible, pin-level fault injection tool that has been developed at LAAS-CNRS 
in Toulouse, France. The general architecture of MESSALINE and its environment is given in Figure 0.16. 
The injection, activation, and collection modules are implemented in hardware on an Intel 310 microcom-
37





SIMULATION ACTIVATION INJECTION COLLECTION
i
CONTROL OF THE EXPERIMENT | ,NTEL31-0







Figure 0.16: General Architecture of MESSALINE
The fault injection mechanism for MESSALINE uses active probes and socket insertion. Thus, fault types 
such as stuck-at, open, bridging, and complex logical functions can be injected. Because the duration and 
frequency of faults can be controlled, the fault injector can introduce permanent, transient, and intermittent 
faults. Signals collected from the target system can provide feedback to the injector. Also, a device is 
associated with each injection point to sense when and if each fault is activated and produces an error. 
MESSALINE has facilities to inject up to 32 injection points simultaneously.
The application of MESSALINE has been shown in two experiments involving 1) a subsystem of a cen­
tralized, computerized interlocking system (called PAI) for railway control applications and 2) a distributed 
system corresponding to an implementation of the dependable communication system of the ESPRIT Delta-4 
Project.
In the case of the PAI system, permanent stuck-at-0, stuck-at-1, and open circuit faults were injected to 
various memory and CPU chips. The results indicated that CPU errors were more difficult to detect than 
memory errors. The error detection mechanisms were analyzed individually, and it was discovered that the 
diagnosis software accounted for most of the error coverage. The elimination of hardware detection would 
have decreased the overall coverage by less than 3%.
The distributed communication system was injected with intermittent stuck-at-0 and stuck-at-1 faults. 
The actual faults were injected into the Network Attachment Controllers (NAC), which provide the connec­
tion for each node to the local area network. Results showed that over 67% of all errors caused the injected 
NAC to be correctly identified and extracted. Also, 24% of the errors did not cause a detectable error. 
Thus, in over 91% of the injections, the distributed system was able to correctly handle the error. These 
experiments demonstrate the utility and flexibility of the MESSALINE fault injection tool.
0.4 .2  Softw are-Im p lem en ted  Fault Injection
While hardware-implemented fault injection requires special hardware instrumentation and interface to 
circuits of the target systems, software-implemented fault injection provides a cheap and easy-to-control 
methodology. In software-implemented fault injection, no extra hardware instrumentation is needed, and 
users can choose fault locations in both hardware and software accessible to machine instructions. In addi­
tion, the software approach allows the emulation of software defects by an appropriate code change. Several
38
techniques have been proposed to emulate different types of hardware and software faults through software- 
implemented fault injection.
Software-implemented fault injection is achieved by changing the contents of memory or registers, based 
on specified fault models. Hardware faults in the CPU, memory, bus, and network can lead to software errors 
and affect software executions (produce hardware-induced software errors). Injections to emulate these errors 
are implemented as execution of incorrect instructions and access of incorrect data. By software faults, we 
mean software design/implementation defects (e.g., incorrect initialization of a variable or failure to check a 
boundary condition), which may change software states to unexpected states. If software data is corrupted 
by either hardware or software faults, we call these software errors.
At least two related issues need to be addressed for software-implemented fault injection. The first issue 
is which fault models should be used to simulate hardware and software faults. We have discussed hardware 
fault models at the function level in Section 0.3.3. Similar to models for hardware-implemented fault injection, 
models for software-implemented fault injection should be built based on engineering experience and field 
measurements. The second, and related, issue is who owns the memory location or which process is executing 
when a fault is injected into a memory location or a register. In other words, what is the target of the fault 
injection?
Several fault models and implementation techniques are listed in Table 0.4. All these techniques are 
similar in that they change program or memory words. To inject software faults, the text segment needs to 
be modified. Some typical software faults are: a variable is used before it is initialized; a module’s interface 
is defined or used incorrectly; statements are in wrong order or omitted [Sullivan91]. As a result of executing 
faulty software code, the data segment may be corrupted, causing software errors. Software errors can also 
be directly injected by changing the data segment.
Table 0.4: Techniques Used for Software Fault Injection
Type Method
Software Fault Modify the text segment of the program.
Software Error Modify the data segment of the program.
Memory Fault Flip memory bits of the program.
CPU Fault Use a trap to modify the memory area of the saved CPU registers.
Bus Fault Use traps before and after an instruction to change the code or data used 
by the instruction and then restore them after the instruction is executed.
Network Faults Modify or delete transmission messages.
When software injection is used to emulate hardware faults, it is usually assumed that the faults are 
transient in nature. For example, the faulty bits in memory or CPU registers can be overwritten by subse­
quent instructions. However, the software approach can be used to emulate permanent faults by repeatedly 
injecting the same fault into a location whenever there is an access to the location. For example, to emulate 
a permanent stuck-at-0 fault at a particular bit in a memory word, the bit is changed to 0 after every write 
operation to the word. To emulate a permanent stuck-at-1 fault at a bus address line, the corresponding bit 
in the effective address (in the program counter or in a CPU register) is set to 1 before any access to the bus. 
Clearly this emulation is expensive, involving the monitoring and execution of many additional instructions.
Unlike hardware-implemented fault injection, which is difficult to gear toward specific workload areas, 
software fault injection can be targeted toward user applications, the operating system, or both. If the target 
is a user application, the fault injector inserted into the user application or can be an extra layer between 
the user application and the operating system. If the target is the operating system, the fault injector has to 
be embedded in the operating system, because it is very difficult to add an extra layer between the machine 
and the operating system.
Although the software approach is flexible, it has some restrictions. First, the approach cannot inject 
faults to locations not accessible to software. We have mentioned in Section 0.3.2 that approximately 1/3 
of the errors produced in logic-level fault injections cannot be emulated through the software approach
39
[Czeck91]. Second, the software instrumentation may disturb the workload running in the target system 
and even change the structure of original software, although careful design of the injection environment 
can alleviate the perturbation to the workload. Third, the poor time resolution of the approach may cause 
fidelity problems. For the long latency faults, such as memory faults, the low time resolution may not be 
a problem. For the short latency faults, such as bus and CPU faults, the approach may fail to capture the 
error behavior (e.g., propagation). This problem can be solved by using a hardware monitor, i.e., the hybrid 
approach. The hybrid approach combines the versatility of software-implemented fault injection and the 
accuracy of hardware monitoring. It is well suited for measuring extremely short latencies. However, the 
hardware monitoring involved in this approach can decrease flexibility (e.g., limited observation points and 
buffer size of the monitor) and increase cost.











Hardware PC RT SPARC Tandem S2 Sun Harts
Injection O.S. O.S. O.S. O.S.
target User User User User User
Monitor Software Software Hybrid Software Software
Fault Memory Memory Memory Memory Memory













There have been several studies using the software approach. In [Chillarege89], a failure-acceleration 
method is used to inject the overlay software faults into an IBM commercial transaction processing system. 
The failure process is said to be accelerated when the fault model is not altered and: 1) The fault latency 
is decreased; 2) The error latency is decreased; 3) The probability of a fault causing a failure is increased. 
An overlay occurs when a program writes into an incorrect area. It is estimated that about 1/3 of software 
errors can be mapped into the overlay model. The failure acceleration method is intended to minimize fault 
latency and to expose the fault impact on the system immediately after the fault injection by using the 
overlay model and by increasing the workload level.
The study found that a total loss of the primary service occurred in only 16% of all detected faults. A 
class of errors termed potential hazards was quantified to be at least 22%. Potential hazards do not affect 
the short term availability and will cause a catastrophic failure when there is a significant change in the 
operating state of the system. At least 41% of errors were identified as potential candidates for the category 
failure prevention and error repair. These errors do not affect the short term availability and experience 
adequate times for repair action, thus providing an opportunity for using inexpensive techniques to achieve 
high coverage.
In recent years, interest in developing software-implemented fault injection tools has increased. Sev­
eral environments have been published in literature: FIAT [SegallSS], FERRARI [Kanawati92], HYBRID 
[Young92], DEFINE [Kao94], and SFI [Rosenberg93]. Table 0.5 lists features of these tools, which will be 
discussed in the following subsections.
FIAT
A number of fault injection studies at Carnegie Mellon University have centered around FIAT (Fault Injection 
Automated Testing), a software-implemented fault injection environment [Segall88], [Barton90], [Czeck91].
40
The FIAT hardware implementation consists of IBM RT PCs, connected by a token ring network. The 
FIAT software structure is divided into two parts: the fault injection manager (FIM) and the fault injection 
receptor (FIRE). FIM is a global control program responsible for all phases of the experiment. FIRE, under 
the control of FIM, collects the experimental results and sends appropriate information to FIM for off-line 
analysis. Figure 0.17 shows the process of a typical fault injection experiment.
Figure 0.17: Typical Fault Injection Experiment in FIAT
FIAT has been used to study the impact of faults on the application workload level [Barton90]. Two 
representative programs, a matrix multiplication task and a selection sort task, were chosen as application 
workloads. To achieve fault tolerance, each task was executed on two different processors, and the results 
were compared. Three fault types were injected in the experiment: zero-a-byte, set-a-byte, and 2-bit com­
pensating. The zero-a-byte or set-a-byte sets a consecutive eight bits anywhere within a 32-bit word to 0 
or 1. The 2-bit compensating complements two bits in a word such that the parity code would not detect 
it as an error. Faults were injected into all locations within a workload, with a total of over 130,000 faults 
injected.
Results showed that there are a limited number of system-level fault manifestations. The mean error 
detection coverage for different workloads and fault types is approximately 50% to 60%. Error detection 
latency was found to follow a normal distribution. This result conflicts with those presented in [Shin86], 
[Finelli87], where the latency was shown to follow either gamma, YVeibull, or log-normal distributions. This 
difference may be explained by the differences in the experimental environment and detection mechanisms. 
In [Shin86], [Finelli87], the hardware-implemented fault injection technique is used, and the resolution of 
detection time is on the order of milliseconds, while the time resolution of the software-implemented FIAT 
is on the order of seconds, which may skew the results.
F E R R A R I
FERRARI (Fault and ERRor Automatic Real-time Injector), another software-implemented fault injection 
environment, was recently developed at the University of Texas [Kanawati92]. The purpose of FERRARI is 
to evaluate complex systems by emulating most hardware faults in software. It is implemented on SPARC 
workstations in an X-window environment. It consists of four software modules: the initializer and activator, 
the user information, the fault and error injector, and the collector and analyzer. These four modules are
41
controlled by the manager module, which coordinates the operation of the four modules.
The initialization and activation module prepares the target program for fault injection by extracting 
information, such as the starting address, the program size, the execution time, the output of an error- 
free program, and the addresses used by the program. The user information module receives experiment 
parameters provided by the user. These parameters include:
• duration, location, time, and bit position of the fault,
• user-specified or pseudo-random selection of the fault,
• fault type (XOR, set, or reset a bit; zero or set a byte),
• fault and error classes (hardware, control flow, user-defined), and
• dependability properties to measure (coverage, latency).
The fault and error injection module is responsible for injecting different types of transient or permanent 
faults, such as address line faults, data line faults, and faults in condition code flags. The data collection 
and analysis module records experiment results, such as information about error detection, error latency, 
and failures, and it determines statistics of these measures at the end of the experiment.
The main fault and error injection mechanism involves using software traps. At the appropriate time or 
program location, the program to be injected is trapped. The selected fault or error is then injected. For 
transient errors, the current instruction is executed and then the injected error is removed. For permanent 
faults, the injected fault is not removed. Instead, the program is trapped for the next n instructions, where 
n is the duration of the fault in instruction cycles. Table 0.6 lists the fault and error classes that FERRARI 
can inject.
Table 0.6: Fault and Error Classes Supported by FERRARI
Control Flow Hardware
Control bit errors 
Data line when opcode fetch 
Instruction type faults 
Control flow errors 




Address line when opcode is fetched
Address line when operand stored
Wrong register
Data line when loading
Data line when storing
Data line when opcode is fetched
Data byte enable store
Corrupted register
Condition code flag
To demonstrate the capabilities of FERRARI and to study the behavior of the target system under faulty 
conditions, over 600,000 fault injection runs were conducted on SUN4 SPARC workstations under different 
applications. Results showed that the error coverage is highly dependent on the fault type. The highest 
coverage was obtained when errors were injected into the task memory image. This is because the injected 
errors are likely to be exercised repeatedly if the corrupted instructions are in a loop. An important finding is 
that a considerable number of undetected errors are those that corrupted input/output routines and system 
libraries. These routines may tend to be ignored when error detection techniques are embedded in the user 
code.
H Y B R ID
To enhance the low resolution of detection time in purely software-implemented fault injection environments, 
error detection mechanisms can be implemented with the combination of hardware and software. This 
approach is used in HYBRID, a hybrid fault injection environment developed at the University of Illinois
42
[Young92]. In HYBRID, faults are injected via software, and the impact is measured by both software and 
hardware.
The environment consists of a fault injection system, a hybrid monitor system (implemented by both 
hardware and software) to measure the effects of injected faults, and a supervisory system to automate the 
measurements. Figure 0.18 illustrates how these systems are physically situated. The fault injector and 
software monitor execute on the test system, while the supervisor program executes on the control host. 
Probes attach the hardware monitor to the address/data backplane of the test system so that the monitor 
can analyze and record the signals generated. Communication between the supervisor and the hardware 
monitor takes place over an RS-232 or GPIB connection.
Figure 0.18: Physical Layout of HYBRID
The function of the environment is to perform experiments that repeatedly inject faults and record 
observations. The environment introduces faults into the test system during the execution of a target 
program, measures the effects of that fault, and returns the test system to conditions present prior to fault 
injection. These operations form a single observation loop. Faults can be injected into any location that has a 
physical address, such as CPU registers, cache, local memory, mass storage, and network controllers. Faults 
can also be injected into locations allocated to a single, executing user program or even into the kernel, and 
propagation can be characterized down to the instruction level.
The fault injection environment was used to study dependability characteristics of a Tandem Integrity 
S2 fault-tolerant computer system [Jewett9I]. High degrees of accuracy in measuring latency (within 20 ns) 
were obtained. Measurements of the sensitivity of different instructions to faults indicated a 5% chance that 
a faulted MIPS RISC instruction will not fail when executed. Modeling of multi-level error propagation 
showed that error detections were due to multiple corruptions of state in as many as 57% of reads from 
wrong addresses and 37% of writes to wrong addresses. The median latency associated with error detection 
by an individual CPU was on the order of 10 /is, and the median delay between detection and the start of 
CPU shutdown was on the order of 100 ms. Kernel fault injection studies show that a fault in the kernel is 
2.6 times as likely to bring down a CPU as a fault elsewhere.
D E FIN E
DEFINE is a UNIX-based distributed fault injection environment developed at the University of Illinois 
[Kao94j. Its predecessor, FINE [Kao93], is a single-machine fault injection environment. The significance of 
DEFINE is twofold. First, it can emulate software faults as well as hardware errors. Second, it can trace 
fault propagation through software modules. The software faults that can be injected by DEFINE include 
initialization (missing or incorrect), assignment (missing or incorrect), condition check (missing or incorrect), 
and function (incorrect) faults. Injectable hardware errors include CPU (ALU, shifter, opcode decoder, or 
registers), memory (text segment or data segment), bus (address lines or data lines), and communication 
errors (missing messages or corrupted messages).
Figure 0.19 shows the DEFINE environment. DEFINE consists of a target system, a fault injector, a 
software monitor, a workload generator, a controller, and several analysis utilities. The target system is 
a group of connected machines, consisting of servers and clients. The controller, fault injector, software 




ft uh' • W orkload • ' - Software--' •
:
1 / trace \
.Generator- • Monitor; ' • T V V
Target
System




Figure 0.19: The DEFINE Environment
target system. The local fault injector and message recorder are embedded in the kernel so that faults can 
be injected there and their propagation can be monitored. Fault injection is implemented by modifying the 
system trap handling routines and hardware clock interrupt handling routines, so the fault injector can be 
considered an extra layer between the operating system and the machine. The fault injector uses hardware 
clock interrupts to control the time of fault injection and activation, and uses software traps to inject all the 
faults except communication faults and memory faults in the data/stack segments. The software monitor 
traces the execution flow and key variables of the kernel. Software probes are inserted into functions in the 
kernel to record the execution flow and the values of arguments and key variables. The synthetic workload 
generator issues various system calls to activate injected faults. The distribution of generated system calls 
can be specified by users to emulate real workloads or to deliberately accelerate the activation of injected 
faults. The controller assigns experiment specifications to the fault injector and the monitor, and it initiates 
experiments. The analysis utilities provide assistance in analyzing fault propagation. The target of the study 
is the UNIX kernel, a nonstopped, highly parameterized, complex service program with high impact and a 
broad spectrum of workloads.
Two experiments were conducted by applying DEFINE to investigate fault propagation and to evaluate 
the impact of various types of faults. The first experiment was on SunOS 4.1.2 (on a SPARCstation IPC). 
Results showed that memory faults and software faults usually have a very long latency, while bus faults and 
CPU faults tend to crash the system immediately. Nearly 90% of detected errors are detected by hardware. 
About half (47%) of the detected errors are data errors, these data errors are detected when the system tries 
to access an area it has no privilege to access. In the software fault propagation, incorrect control flow is the 
major impact for the first level of propagation, while data corruption is the major impact for the subsequent 
propagation. Analysis of fault propagation among the UNIX subsystems revealed that only about 8% of 
faults propagate to other UNIX subsystems. The second experiment was on six Sun workstations (one as 
server and the others as clients). Experimental results show that fault propagation from servers to clients 
occurs more frequently than from clients to servers. The majority of no-impact faults are dormant. The 
fault impact depends on the workload.
44
SFI
SFI is designed for validating dependability mechanisms on an experimental distributed real-time system, 
HARTS [Rosenberg93], [Han93]. See Figure 0.20. It introduces a new fault type, intermittent, in addition 
to permanent and transient faults. The interarrival time between intermittent faults can be deterministic or 
can follow a specified exponential distribution. Injectable errors include memory (code, global variables, or 










Figure 0.20: The Relationship of SFI Files
SFI consists of the SFI Experiment Generator (SEG) and the SFI Control Modules (SCM). The SEG 
takes as input a user-supplied experiment description to drive fault injection experiments. The SCM consists 
of fault injection routines that will be included into executable files by the SEG. Memory errors are injected 
by changing the contents of the selected address. Communication errors are injected by modifying the 
communication protocols to mimic the desired behavior. Processor errors are injected by changing the 
assembly code during compilation.
Two experiments on HARTS were conducted to investigate the effect of intermittent message losses be­
tween two adjacent nodes and the effect of routing using failure data. In the first experiment, a model of 
communication between two nodes was developed to predict the effect of intermittent message losses. Ex­
perimental results showed that the predicted values of average round-trip delay, average number of attempts 
per message, and frequency of number of attempts matched the observed values very well. The second 
experiment investigated three routine methods with or without failure information. The first method uses 
transmission time of a message on each link only. The second method considers transmission time and the 
average number of timeouts on each link. The third method uses the delivery time of test messages that are 
sent out by each node to its neighbors periodically. Results showed that none of the methods is best under 










0 .4 .3  R ad iation -In d u ced  Fault In jection
Neither hardware-implemented nor software-implemented fault injection has a way to produce transient 
faults at random locations inside ICs. Radiation-induced fault injection provide such a capability. One 
way to do this is to expose the chip to the heavy-ion radiation from a Californium252 (C /252) source 
[Gunneflo89], [Karlsson89]. The heavy ions emitted from the source are capable of creating transient faults 
when they pass through a depletion region in the IC. One advantage of this method is that it can produce 
transient faults at random locations evenly and can cause either a single bit flip or multiple bit flips, leading 
to large variation in the errors seen on the output pins of the IC.
In the fault injection experiments reported in [Gunneflo89], [Karlsson89], the C /252 method was used to 
investigate error coverage and detection latency for error detection schemes for the MC6809E 8-bit micro­
processor. The intention of the experiments was to characterize the effects of transient faults that originate 
inside a CPU. The MC6809E is fabricated in NMOS, a technology sensitive to heavy ion radiation. The 
error detection schemes under study are suitable for implementation with a watchdog processor that checks 
the behavior of the main processor on the external bus. The developed experimental system is called FIST 
(Fault Injection system for Study of Transient fault effects). Figure 0.21 shows the FIST diagram.
Figure 0.21: FIST Diagram
The heavy-ion radiation is implemented using a commercially available 37 x 103 Becquerel (1 fiC\) C f252 
source. The C flo~ source is mounted inside a vacuum chamber together with a small computer system. One 
of the system boards is placed on a mechanical fixture movable in three dimensions for accurate positioning
46
of the CPU beneath the C /252 source. The system has two MC6809E CPUs, which operate synchronously 
using the same clock. One CPU is exposed to heavy-ion radiation. The other is used as a reference to detect 
errors via comparison on the output from the two CPUs. When errors are detected by the comparison logic, 
the logic analyzer is triggered to record the external bus signals. The monitoring computer is responsible for 
data acquisition and control of experiments.
A fault injection experiment is conducted in the following way. Before the experiment starts, the mon­
itoring computer fetches from the host computer a load file that contains the test program to be executed. 
The test program is then loaded from the monitoring computer to the MC6809E system. After the loading, 
the test program is started -with a “go” command from the monitoring computer. When a mismatch is 
detected, the monitoring computer fetches the recorded error data from the logic analyzer and the error 
flip-flops in the MC6809E system and transfers them to the host computer. Finally, the MC6809E system 
is reset, and the test program is reloaded for the next experiment.
It was found from fault injection experiments that 78% of all errors affected control flow (i.e., caused the 
processor to diverge from the correct sequence) and 17% caused errors in data. Results also showed that 
30% of all errors were multiple bit errors on the output pins, although the origin of each of these errors 
was only one single heavy ion. The error recordings obtained from the experiments were also used as input 
to simulation models of different error detection mechanisms to evaluate these error detection mechanisms 
without implementing them. The coverage of several detection mechanisms was investigated. It was found 
that the best mechanism was the one that detects access to the memory outside permitted areas and that 
the combination of two mechanisms gave a better coverage than any one mechanism alone. It was also found 
that the type of the test program had a considerable influence on the results of error detection mechanisms.
0.5 O PERATIO NAL PH ASE
When a computer system is in normal operation, various types of errors can occur both in the hardware 
and in the software. There are many possible sources of errors, including untested manufacturing faults and 
software defects, wearing out of devices, transient errors induced by radiation, power surges, or other physical 
processes, operator errors, and environmental factors. The occurrence of errors is also highly dependent on 
the workloads running on the system. A distribution of operational outages from various error sources for 
several major commercial systems are reported in [Siewiorek92].
There is no better way to understand dependability characteristics of a complex computer system than via 
direct measurement and analysis. Here, measuring a real system means monitoring and recording naturally 
occurring errors and failures in the system while it is running under user workloads. Analysis of such 
measurements can provide valuable information on actual error/failure behavior, identify system bottlenecks, 
quantify dependability measures, and verify assumptions made in analytical models.
Given field error-data collected from a real system, a measurement-based study consists of four steps, as 
shown in Figure 0.22: 1) data processing, 2) model identification and parameter estimation, 3) model solution 
if necessary, and 4) analysis of models and measures. Step 1 consists of extracting necessary information 
from field data (the result can be a form of compressed data, or flat data), classifying errors and failures, 
and coalescing repeated error reports. In a computer system, a single problem commonly results in many 
repeated error observations occurring in rapid succession. To ensure that the analysis is not biased by these 
repeated observations of the same problem, all error entries that have the same error type and occur within 
a short time interval (e.g., 5 minutes) of each other should be coalesced into a single record. The output of 
this step is a form of coalesced data in which errors and failures are identified. This step is highly dependent 
on the measured system. Coalescing algorithms have been proposed in [Tsao83], [Iyer86], [Hansen92].
Step 2 includes identifying appropriate models (such as Markov models) and estimating various mea­
sures of interest (such as MTBFs and TBF distributions) from the coalesced data. Several models have 
been proposed and validated using real data. These include workload-dependent cyclostationary models 
[Castillo81], a workload hazard model [Iyer82a], and error/failure correlation models [Tang92a], Statistical 
analysis packages such as SAS [SAS85] or measurement-based dependability analysis tools such as MEA­
SURE-1- [Tang93b] are useful at this stage. Step 3 solves these models to obtain dependability measures 
(such as reliability, availability, and transient reward rates). Dependability and performance modeling and 
evaluation tools such as SHARPE [Sahner87] can be used in this step. Step 4, the most creative part of this
47
Models & Measures
Step 1 Step 2 Step 3 Step 4
Figure 0.22: Measurement-Based Analysis
study, involves a careful interpretation of the models and measures obtained from the data; for example, 
the identification of reliability bottlenecks and impact on availability of design enhancement. The analysis 
methods can vary significantly from one study to another study, depending on project goals.




Analysis of time-based tuples 








T T E /T T F distributions
[Siewiorek78], [McConnel79], [Iyer86] 




Hardware failure/workload dependency 
Software failure/workload dependency 
Correlated failures and impact 
Two-way and multiway failure dependency
[Butner80], [Castillo8l], [Iyer82a] 
[Castillo82], [Iyer85b], [Mourad87] 





Performability model for single machine 
Markov reward model for distributed system 







Hardware-related & correlated software errors 
Software fault tolerance 
Software defect classification
[Velardi84], [Hsueh87] 
[Iyer85a], [Tang92b], [Lee93a] 




Heuristic trend analysis 





Measurement-based dependability analysis of operational systems has evolved significantly over the past 
15 years. These studies have addressed one or more of the following issues: basic error characteristics, de­
pendency analysis, modeling and evaluation, software dependability, and fault diagnosis. Table 0.7. provides 
a quick overview of the issues addressed in the literature.
Early studies in this field investigated transient errors in DEC computer systems and found that more 
than 95% of all detected errors are intermittent or transient errors [Siewiorek78] , [McConnel79]. The studies 
also showed that the interarrival time of transient errors follows a Weibull distribution with a decreasing 
error rate. This distribution was also shown to fit the software failure data collected from an IBM operating 
system [Iyer85b], A recent study of failure data from three different operating systems showed that time to 
error (TTE) can be represented by a multi-stage gamma distribution for a single-machine operating system 
and by a hyperexponential distribution for the measured distributed operating systems [Lee93a],
Several studies have investigated the relationship between system activity and failures. In the early 1980s, 
analysis of measurements from IBM [ButnerSO] and DEC [Cast.illo81] machines, revealed that the average 
system failure rate was strongly correlated with the average workload on the system. The effect of workload-
48
imposed stress on software was investigated in [Castillo82] and [Iyer85b]. Recent analyses of DEC [Tang90], 
[Wein90] and Tandem [Lee91] multicomputer systems showed that correlated failures across processors are 
not negligible, and their impact on availability and reliability are significant [Dugan91], [Tang91], [Tang92a].
In [Hsueh88], analytical modeling and measurements were combined to develop measurement-based re- 
liability/performability models using data collected from an IBM mainframe. The results showed that a 
semi-Markov process is better than a Markov process for modeling system behavior. Markov reward model­
ing techniques were further applied to distributed, systems [Tang93a] and fault-tolerant systems [Lee92], to 
quantify performance loss due to errors/failures for both hardware and software.
A census of Tandem system availability indicated that software faults are the major source of system out­
ages in the measured fault-tolerant systems [Gray90]. Analyses of field data from different software systems 
investigated several dependability issues including the effectiveness of error recovery [Velardi84], hardware- 
related software errors [Iyer85a], correlated software errors in distributed systems [Tang92b], software fault 
tolerance [Lee92], [Lee93b] and software defect classification [Sullivan91] [Sullivan92]. Measurement-based 
fault diagnosis and failure prediction issues were investigated in [Tsao83], [Iyer90], [Lin90], [Maxion90a], 
[Maxion90b], [Maxion93].
In the following subsections, we discuss issues and representative studies involved in measurements, data 
processing, preliminary analysis of data, dependency analysis, modeling and evaluation, software depend­
ability, and fault diagnosis.
0.5 .1  M easurem ents
There are numerous theoretical and practical difficulties associated with making measurements. The question 
of what and how to measure is a difficult one. A combination of installed and custom instrumentation is 
typically used in most studies. From a statistical point of view, sound evaluations require a considerable 
amount of data. In modern computer systems, especially in fault-tolerant systems, failures are infrequent 
and, in order to obtain meaningful data, measurements must be made for a long period of time. Also, the 
measured system must be exposed to a wide range of usage conditions for the results to be representative. 
In an operational system, only detected errors can be measured.
There are two ways to make measurements: on-line automatical logging and human manual logging. 
Many large computer systems, such as IBM and DEC mainframes, provide error-logging software in the 
operating system. This software records information on errors occurring in the various subsystems, such as 
the memory, disk, and network subsystems, as well as other system events, such as reboots and shutdowns. 
The reports usually include information on the location, time, and type of the error, the system state 
at the time of the error, and sometimes error recovery (e.g., retry) information. The reports are stored 
chronologically in a permanent system file. The main advantage of the on-line automatic logging is its 
ability to record a large amount of information about transient errors and to provide details of automatic 
error recovery processes, which cannot be done manually. Disadvantages are that an on-line log does not 
usually include information about the cause and propagation of the error or about off-line diagnosis. Also, 
under some crash scenarios, the system may fail too quickly for any error messages to be recorded.
Table 0.8 shows a sample of extracted error logs from a VAXcluster multicomputer system. Often, the 
meaning of a record in a log can differ between versions of the operating system and between machine models. 
One reason is that error detection and recording routines are written and modified over time by different 
groups of people. For example, a careful study of VAX error logs and discussion with the field engineers 
indicate that the operating system on different VAX machine models might report the same type of error into 
different categories. Thus, it is important to distinguish these errors in the subsequent error classification.
Since the information provided by on-line error logs may not be complete, it is valuable to have operator 
logs compensate the missing information. An operator log should include information on system crashes, 
failure diagnosis, component replacement and hardware and software updates.
0.5.2 D ata  P rocessing
Usually, on-line logs contain a large amount of redundant and irrelevant information in various formats. Thus, 
data processing must be performed to classify this information and to put it into a flat format to facilitate 
subsequent analyses. The first step in data processing is error classification. This process classifies errors
49
Table 0.8: A Sample of Extracted Error Logs from a VAXclusterf
E n t r y S y s t e m  I D L o g g in g  T i m e S u b s y s t e m  Sc U n i t I n t e r p r e t a t io n
5815 Earth 20-D E C -1987  20:23:13.22 I /O , H 0SD U A 51: Disk drive error
7005 Earth 4 -J A N -1988 11:45:07.12 I /O , H 3SM U A 1: T ape drive error
12979 E uropa 8 -JA N -1988  14:14:28.63 C l, E U R SPA A 0: P ath  # 0  went from good  to  bad
13005 E uropa 8-J A N -1988  16:23:17.41 C l, E U R SPA A 0: Error logging datagram  received
13734 E uropa 19-J A N -1988  17:31:30.74 C l, E U R SPA A 0: V irtual circuit tim eou t
3260 M ercury 24-D E C -1987  04:54:52.06 M em ory, T R  # 2 C orrected m em ory error
10939 Jupiter l-A P R -1 9 8 8  09:57:39.40 Unknown D evice
14209 Jupiter 16-M A Y -1988 13:37:04.97 C P U , SBI U nexp ected  read d a ta  fault
13941 M ars 25-F E B -1988  02:13:20.25 C P U , IBOX M achine check
20937 M ars 18-A P R -1988  16:46:39.75 BugC heck B ad  m em ory d ea lloca tion  request s ize /a d d ress
27958 M ars 14-M A Y -1988 20:57:46.48 BugC heck Insufficient nonpaged p ool to  rem aster locks
37790 Saturn 20-JU L -1988  18:51:49.15 BugC heck U n exp ected  system  service excep tion
fThe sample is intended to illustrate the different types of errors logged. Therefore, the entry numbers are not consecutive.
in the measured system into types based on the subsystems and components in which they occur. There 
is no uniform or “best” error classification, because different systems have different hardware and software 
architectures. But some error types, such as CPU, memory, and disk errors, are seen in most systems. Table 
5.3 lists an error classification (major error types) for VAXcluster systems [Tang92b], [Tang93a]







CPU or bus controller errors 
Memory ECC errors 
Disk, drive, and controller errors 





Problems involving program flow control or synchronization 
Problems referring to memory management or usage 
Inconsistent conditions detected by I/O management routines
After error classification, the following data processing can be broadly divided into two steps: data 
extraction and data coalescing. Data extraction selects useful entries such as error and reboot reports 
(throwing away uninteresting entries such as disk volume change reports) from the log file and transforms 
the data set into a flat format. The design of the flat format depends on the necessity of the subsequent 
analyses. The following is a possible format:
entry number logging time error type device id error description fields
In on-line error logs, a single fault in the system can result in many repeated error reports in a short period 
of time. To ensure that the subsequent analyses will not be distorted by these repeated reports, entries 
that correspond to the same problem should be coalesced into a single event, or tuple [Tsao83]. A typical 
data-coalescing algorithm merges all error entries of the same error type that occur within a AT interval of 
each other into a tuple. The algorithm is as follows:
IF <error type> = <type of previous error>
AND <time away from previous error> < AT 
THEN <put error into the tuple being built>
ELSE <start a new tuple>
A tuple reflects the occurrence of one or more errors of the same type in rapid succession. It can be 
represented by a record containing information such as the number of entries in the tuple and the time 
duration of the tuple.
50
Different systems may need different time intervals in data coalescing. A recent study [Hansen92] defined 
two kinds of mistakes that can be made in data coalescing: collision and truncation. A collision occurs when 
the detection times of two faults are close enough (within AT) such that they are combined into a tuple. 
A truncation occurs when the time between two reports caused by a single fault is greater than AT such 
that the two reports are split into different tuples. If AT is large, collisions are likely to occur. If AT is 
small, truncations are likely to occur. The study found that there is a time-interval threshold beyond which 
collisions are rapidly increased. Based on this observation, the study proposed a statistical model that can be 
used to select an appropriate time interval. In our experience, collision is not a big problem if the error type 
and device information is used in data coalescing as shown in the above coalescing algorithm. Truncation is 
usually not considered to be a problem [Hansen92]. Also, there are techniques [Iyer90], [Lin90] to deal with 
truncation which have been used for fault diagnosis and failure prediction (see Section 0.5.7).
0.5 .3  P relim in ary  A nalysis
Once coalesced data is obtained, the basic dependability characteristics of the measured system can be 
identified by a preliminary statistical analysis. Commonly used measures in the analysis include error/failure 
frequency, TTE or TTF distribution, and error/failure hazard rate function. In the following discussion, 
data from several commercial systems are used to illustrate analysis methods.
Basic S ta tistics
It is important but easy to obtain basic statistics from the measured data such as frequency, percentage, 
and probability. These statistics provide an overall picture of the measured system. Often, dependability 
bottlenecks can be identified by analysis of these statistics. Table 0.10 shows the error/failure statistics for 
a measured VAXcluster [Tang93a]. In the table, I/O errors include disk, tape, and network errors. Machine 
errors include CPU and memory errors. Software errors are software-related errors. The 95% confidence 
intervals for the percentage and probability estimates shown in the table are calculated using the method 
for proportions discussed in Section 0.2.1.
Table 0.10: Error/Failure Statistics for the VAXcluster
Category
Error Failure Recovery
ProbabilityFrequency Percentage Frequency Percentage
I/O 25807 92.87±0.30 105 42.86±6.20 0.996±0.001
Machine 1721 6.19±0.28 5 2.04Ü.77 0.970±0.002
Software 69 0.25±0.06 62 25.31±5.44 0.10Ü0.071
Unknown 191 0.69±0.10 73 29.80±5.73 0.618±0.069
All 27788 100.0 245 100.0 0.991±0.001
Two bottlenecks can be identified from the table. First, the major error category is I/O errors (93%), i.e., 
errors from shared resources. This category of error has a very high recovery probability (0.996). However, 
these errors still result in nearly 43% of all failures. This result indicates that, although the system is generally 
robust with respect to I/O errors, the shared resources still constitute a major reliability bottleneck due to 
the sheer number of errors. Improving such a system may require using an ultra-reliable network and a disk 
system to reduce the raw error rate, not just providing high recoverability.
Second, although software errors constitute only a small part of all errors (0.3%), they result in significant 
failures (25%). This is because software errors have a very low recovery probability (0.1). This software failure 
estimation is conservative because there are significant unknown failures (30%). Some of these unknown 
failures could also be attributed to software problems. Thus, software-related problems are severe in the 
measured system.
51
Em pirical T T E  D istr ib u tion s and H azard R ates
TTE/TTF probability distributions and error/failure hazard rates are commonly used to investigate how 
errors and failures occur across time. It is relatively easy to obtain empirical TTE/TTF distributions from 
data. Figure 0.23 shows the empirical TTE distribution function, f(t), for a measured VAXcluster system 
[Tang93a]. Notice that the logarithmic coordinate is used for f(t)  because of the big contrast between the 
largest and smallest values. It is seen that about 67% of the TBEs are less than one minute. Most of these 
instances are “time between errors of two different machines” because errors of the same type occurring 
within a 5-minute interval of each other on the same machine have been coalesced into a single error event. 
This fact implies that errors are likely to occur on the different machines in the measured system within a 






0 10 20 30 40 50
Mean = 12.9 
Median = 0.08 
Std. Dev. = 46.1
■ ÏÏIïïttlïïTffîT̂  ̂ r
t (minutes)
Figure 0.23: VAXcluster Empirical TTE Distribution
The hazard rate characterizes error/failure intensity in time. It is the probability that an error (failure) 
will occur within the coming unit of time, given that no error (failure) has occurred since the start of the 
system or the last error (failure) occurrence. The mathematical definition of the hazard rate [Ross85] is as 
follows:
h(t \ _  Pr{ error in (M + dt)} _  f ( t )
Pr{no errors in (0 ,t)}dt  1 —  F(t) (0.31)
Figure 0.24 shows the empirical failure hazard rates computed from the VAXcluster failure data. The 
high hazard rate near the origin, i.e., the high probability that the second failure will occur within a short 
time after a failure occurrence, indicates that failures in the VAXcluster tend to occur in bursts. The most 
likely time for a second failure is the first 2 hours after a failure occurrence. An early study of transient 
errors [McConnel79], which fitted a VVeibull distribution with a decreasing failure rate to the interarrival 
time of transient errors, implied the existence of failure/error bursts. Failure bursts have been observed in 






0 10 20 30 40 50
t (hours)















A n alytica l T im e to Error D istr ib u tions
A realistic, analytical form of TTE distributions is essential in modeling and evaluating computer system 
dependability. Often, for simplicity or due to lack of information, the TTE is assumed to be exponentially 
distributed. Early measurement-based studies found that the Weibull distribution with a decreasing failure 
rate is representative of the time between failures (TBF) in a measured DEC computer system [McConnel79] 
and for a measured IBM-VM/SP operating system [Iyer85b], A recent comparative study of the dependability 
of the Tandem GUARDIAN, DEC VAX VMS, and IBM MVS operating systems showed that the software 
TTE in a single machine can be represented by a multistage gamma distribution, and the software TTE in 
multicomputers can be represented by a hyperexponential distribution [Lee93a]. In this section, we discuss 
these distributions.
Figure 0.25 depicts the probability density function for disk errors in the CMU Andrew files system 
[Siewiorek92]. Analysis shows that the instantaneous error rate (hazard function) for this data is a decreasing 
function of time. The data is seen to best fit a Weibull distribution, i.e., the probability density function, 
/(<)• is Siven by
f(t)  = aXta~lexp[—Xta] 
and the hazard (error rate) function is given by
h(t) = aA ta~l
where a is the shape parameter and A is the scale or rate parameter. Note that if a — 1, then the hazard 
function reduces to a constant (i.e., f(t)  reduces to the exponential). The Weibull function has been found 
to describe a wide variety of hardware and software errors; it is superimposed upon the actual data presented 
in Figure 0.25. Note that a is usually much less than 1, which means the hazard function is decreasing.
Tim e (hrs)
Figure 0.25: Distribution of Andrew Disk Errors
Before presenting TTE distributions for the three operating systems studied in [Lee93a], we first explain 
how a TTE distribution is obtained from a multicomputer system, because two of the three operating systems 
were running on multicomputer systems. In a multicomputer system, typically, all the constituent machines 
work in a similar environment and run the same version of the operating system. The whole system can be
53
treated as a single entity in which multiple instances of an operating system are running concurrently. Every 
software error in the system is sequentially ordered, and a distribution is constructed. The constructed TTE 
distribution reflects the software error characteristics for the whole system. We will call this distribution the 
multicomputer software TTE distribution.
(a) IBM MVS Software TTE Distribution
(b) VAXcluster Software TTE Distribution
Figure 0.26: Analytical Software
(c) Tandem Software TTH Distribution
Distributions Extracted from Data
Figure 0.26 gives the analytical TTE or time to halt (TTH) distributions extracted from data by using 
SAS for the three measured systems. All the three empirical distributions failed to fit simple exponential 
functions. The fitting was tested using the Kolmogorov-Smirnov or Chi-square test (see Section 0.2.2) 
at a 0.05 significance level. The 2-phase hyperexponential distribution provided satisfactory fits for the 
VAXcluster and Tandem multicomputer software TTE distributions. An attempt to fit the MVS TTE 
distribution to a phase-type exponential distribution led to a large number of stages. As a result, the 
following multistage gamma distribution was used:
n
/(*) = aig t̂] a *’ **■)» 
i=i
where at- > 0, a, = 1, and
.</(*; or, s) t < s, t > s .
It was found that a 5-stage gamma distribution provided a satisfactory fit, which means that the software 
TTE distribution on the MVS has a complicated mode.
Figures 0.26(b) and 0.26(c) show that the multicomputer software TTE distribution can be modeled as 
a probabilistic combination of two exponential random variables, indicating that there are two dominant
54
error modes. The higher error rate, A2 , with occurrence probability 0 2 , captures both the error bursts 
(multiple errors occurring on the same operating system within a short period of time) and the concurrent 
errors (multiple errors on different instances of an operating system within a short period of time) on these 
systems. The lower error rate, Ai, with occurrence probability <*1 , captures regular errors and provides an 
inter-burst error rate.
Error bursts can be explained as repeated occurrences of the same software problem or as multiple effects 
of an intermittent hardware fault on the software. Software error bursts have also been observed in laboratory 
experiments reported in [Bishop88]. The study showed that, if the input sequences of the software under 
investigation are correlated rather than independent, one can expect more “bunching” of failures than those 
predicted using a constant failure rate assumption. In an operating system, input sequences (user requests) 
are highly likely to be correlated. Hence, a defect area can be triggered repeatedly.
0.5 .4  D ep en d en cy  A nalysis
Many underlying dependencies exist among measured parameters and components. Examples are the de­
pendency between workload and failure rate, and the dependency or correlation among failures on different 
components. Understanding and quantifying such dependencies is important for developing realistic mod­
els and hence better designs. The workload/failure dependency issue was studied in early 1980s, and the 
correlated failure issue has been investigated more recently.
Dependency between workload and failure was addressed in two approaches: statistical quantification of 
the dependence between workload and failure rate [Butner80], [Iyer85b] and stochastic modeling of failures as 
functions of workload [Castillo81]. Both approaches demonstrated the strong correlation between workload 
and failure rate. The results indicated that dependability models cannot be considered representative unless 
the system workload is taken into account. Based on this result, several workload-dependent analytical 
models have been proposed [MeyerJ 88], [Aupperle89], [Dunkel90].
Recent measurements on VAXclusters [Tang90], [Wein90] and Tandem machines [Lee91] found that cor­
related failures significantly exist in distributed systems. Further, the studies also showed that even a 
small correlation can have major impact on system dependability [Dugan91], [Tang91], [Tang92a]. Neither 
traditional models that assume failure independence nor those that are believed to take correlation into ac­
count are representative of the actual occurrence process of correlated failures observed in measured systems 
[Tang93b].
In the following subsections, dependency analysis is illustrated by discussing two issues: 1) dependency 
between workload and failure, and 2) dependency among errors/failures on different components in a com­
puter system.
W orkload/Failure D epend en cy
An early study [Castillo81] introduced a workload-dependent cyclostationary model to characterize system 
failure processes. The basic assumption in the model was that the instantaneous failure rate of a system 
resource can be approximated by a function of the usage of the resource considered. Specifically, the failure 
rate of a particular resource, A(t), is assumed to be
A(f) = au(t) + b (0.32)
where u(t) is a usage function of the resource, which in turn, consists of a deterministic, periodic function 
of time, m(t), and a modified, stationary Gaussian process, z(t):
u(t) = m(t) + z(t) (0.33)
The failure arrivals were assumed to follow a Poisson process. Thus, the failure process involves two 
stochastic processes: a Poisson process and a Gaussian process. Such a process was defined as a doubly 
stochastic process. The model was applied to a PDP-10 machine running a modified version of the standard 
TOPS-10 operating system. It was shown that the TTF distribution predicted by the model and the one 
observed from the real system have a very good fit at the significance level 0.36 in a \ 2 test.
[Castillo82] introduced a workload dependent software probabilistic model to predict the differences in 
manifestations of hardware transient and software errors as a function of system workload. The model
55
was applied to a modified version of the TOPS-10 operating system running on a PDP-10 machine. The 
central argument behind this study was that the observed software failure rate depends on the instantaneous 
complexity of the data to be processed while the system failure rate due to hardware transients is insensitive 
to the data complexity. If a system doubles its average fraction of time spent in the kernel mode, its failure 
rate due to hardware transients increases linearly. Thus, deviations from this expected linearity can be 
attributed to software errors.
In [Iyer82a], a load hazard model was introduced to measure the risk of a failure as the system activity 
increases. The proposed model is similar to the hazard rate defined in Equation 0.31. Given a workload 
variable X , the load hazard is defined as
Pr[failure in load interval (x , x + Ax)] _  g(x) 
Pr[no failure in load interval (0, x)] Ax 1 — G(x) ’
(0.34)
where g(x) is the p.d.f. of the variable “a failure occurs at a given workload value x” and G(x) is the 
corresponding c.d.f. That is,
g(x) =  Pr[failure occurs | X  = x] /(*)/(x ) ’ (0.35)
where /(x) is simply the p.d.f. of the workload in consideration:
/(x) = Pr[X = x] , (0.36)
and /(x) is the joint p.d.f. of the system state (failure state or nonfailure state) and the workload:
/(x) = Pr[failure occurs & X  = x]. (0.37)
A constant hazard rate implies that failures are occurring randomly with respect to the workload. An 
increasing hazard rate on the increase of X  implies that there is an increasing failure rate with increasing 
workload.
The load hazard model was applied to the software failure and workload data collected from an IBM 
3081 system running the VM operating system. Based on the collected data, /(x), /(x), g(x), and z(x) were 
computed for each workload variable. Figure 0.27 shows the z(x) plots for three selected workload variables:
(1) OVERHEAD—fraction of CPU time spent on the operating system
(2) PAGEIN—number of page reads per second by all users
(3) SIO (Start I/O)—number of input/output operations per second





0 0.2 0.4 0.6 0.8 0 20 40 60 80 100
x = OVERHEAD x = PAGEIN x = SIO
Figure 0.27: Workload Hazard Plots for the IBM 3081 System
The hazard plots show that the workload parameters appear to be acting as stress factors, i.e., the failure 
rate increases as the workload increases. The effect is particularly strong in the case of the interactive 
workload measures OVERHEAD and SIO. The correlation coefficients of 0.95 and 0.91 show that the failure 
closely fit an increasing load hazard model. The risk of a failure also increases with increased PAGEIN, 
although at a somewhat lower correlation (0.82). Note that the vertical scale on these plots is logarithmic,
56
indicating that the relationship between the load hazard z(z) and the workload variable is exponential, i.e., 
the risk of a software failure increases exponentially with increasing workload.
The above experimental results have been incorporated into analytical dependability models. [MeyerJ88] 
proposed a general, analytical approach to the study of workload effects on computer system dependability. 
In the study, a Markov renewal process model was established to represent the interaction among workload 
and fault accumulation in the system. Two types of interaction were considered: workload may help de- 
tect/correct a correctable fault, or it may cause the system to fail by activating an uncorrectable fault. The 
faults considered in the model were transient in nature, which were modeled as dormant, soft internal faults 
in the system. Such faults can accumulate in the system and be activated by workload.
The modeled workload consists of various types of task arrivals which have different processing require­
ments. The relationship between workload and system states was defined as follows: for each type of task 
i, there is a corresponding threshold value mt . If the total number of dormant faults in the system does 
not exceed m,, an arrival of a type i task activates and corrects any existing faults with fault tolerance 
mechanisms provided by the system, and brings the system back to the fault-free state. Otherwise, if the 
total dormant faults in the system exceeds m,-, the service request of type i tasks cannot be met and the 
system fails. The threshold value m* was called fault margin associated with the type i task. The fault 
margin characterize the fault tolerance of the system.
The study examined some specific examples (input tasks as workload, and internal activity, or self- 
exercising as workload) and showed how the probability nature of “time to failure” can be formulated directly 
in terms of workload, fault arrival, and fault margins. The study also examined the role of self-exercising in 
fault tolerant systems and showed how self-exercising and fault tolerance interact in their influence on time 
to failure. The results provided new insights into how workload affects the dependability of fault tolerant 
systems.
Later, another methodology was proposed in [Aupperle89] for evaluating systems with nonhomogeneous 
workloads and fault arrivals. The proposed methodology employed analytic techniques based on Markov 
processes and stochastic activity networks. The modeled environment was assumed to vary between different 
utilization phases (e.g., passive use period and active use period) of random duration which have different 
workload effects on fault occurrence and recovery. External faults due to physical or human causes as well 
as internal design and operational faults were considered in the model. The study addressed questions about 
the system at random times, e.g., system availability at a phase transition or mean time to failure relative 
to the beginning of some phase. Since the methodology accounts for a nonhomogeneous environment, it can 
be used to evaluate the use of different fault recovery techniques during different phases.
In [Dunkel90], a workload dependent memory fault model was developed. The overall model consists of 
two submodels: 1) a memory fault occurrence model which depends on several workload parameters such 
as the page references and the execution time of tasks, and 2) a performance model which accounts for the 
influence of memory faults on system performance. Additional workload produced by fault handling was also 
taken into account. The study used queueing network analysis methods to evaluate the average performance 
loss caused by memory faults. The transient task reliability, which quantifies the risk that a single task 
suffers a memory fault, was also evaluated. The results demonstrated that the performance decrease caused 
by memory errors depends on system workload and operating system characteristics.
Failure D epend en cy am ong C om ponents
It was mentioned in Section 0.2.3 that the correlation coefficient can be used to quantify the linear dependence 
between two variables. When errors/failures on two components are related, the correlation coefficient 
between the two components is a good measure of such dependence. The question is how to obtain it from 
measured data.
The first step in correlation analysis is building a data matrix based on the measured data. Assume 
that there are n components in the measured system and that the measured period is divided into m equal 
intervals of At (e.g., 5 minutes). An m x n data matrix can then be constructed in the following way. The 
n columns of the matrix represent the n components in the measured system. The m rows of the matrix 
represent the m time intervals. Element (i,j) of the matrix is set to the number of errors (or set to 1 in 
the failure case) occurring within interval i on component j. Column j  can be regarded as a sample of the 
random variable, X j , which represents the state of component j  in the system.
57
Table 0.11: Average Correlation Coefficients for VAXcluster Errors
Error Failure
All CPU Memory Disk Network Software All
0.62 0.03 0.01 0.78 0.70 0.02 0.06
The second step is calculating correlation coefficients using Equation 0.19 based on the data matrix. Each 
time, we pick up two columns (X* and Xj )  to calculate Cor(X{, Xj).  This step can be automated by using a 
statistical package such as SAS. Table 0.11 lists the average correlation coefficients of the 21 pairs of machines 
in a VAXcluster for different types of errors and failures [Tang93a]. Generally, the error correlation is high 
(0.62) and the failure correlation is low (0.06). Disk and network errors are strongly correlated, because the 
processors in the system heavily use and share the disks and the network concurrently.
We have seen that the failure correlation coefficient in Table 0.11 is low (0.06). An important question 
is: does such a low correlation have impact on dependability? Two independent studies, [Dugan91] and 
[Tang91], showed that even a small correlation can have significant impact on system unavailability through 
different approaches. Here, we discuss the approach used in [Dugan91].
In [Dugan91], another type of correlation coefficient, which is different from that discussed above, was 
introduced to quantify the correlation between the f-th and (i + l)-th failure. Let A and B be random 
variables representing failures of the first and second components, respectively. A (B) takes a value of 1 if 
the first (second) component is in the failure state. Otherwise, it takes a value of 0. (So A and B are similar 
to Xi in the matrix discussed above.) The steady-state linear correlation coefficient between A and B was 
defined as follows:
Pab =
E[(A -  mA)(B -  mg )]
(0.38)
where and mg are the time average (mean) of A and B, respectively. Given the failure rate of the first 
component, Ax, p^s  can he used to determine the failure rate of the second component by
/¿(Ax + Ax pab + pPa b ) 
P -  >̂ \Pab -  PPab
(0.39)
where p is the recovery rate of components 1 and 2.
The study applied the methodology to solve Markov models of two, three, and four component systems 
subject to permanent, intermittent, and transient correlated failures under a range of assumed correlation 
coefficients. Table 0.12 lists the evaluated unavailability for systems with 2, 3, and 4 components under 
several different correlation coefficients, given a set of component failure and recovery rates. The results 
showed that a correlation coefficient as small as in a range from 0.0001 to 0.01 can increase unavailability of 
a system by several .orders of magnitude.
Table 0.12: Effect of Correlation on Unavailability for Systems with 2 to 4 Components
Pab 0 0.0001 0.001 0.01
2-Component System 4.84 x 10"8 7 x 10"8 4.28 x IO"7 2.27 x IO“6
3-Component System 1.06 x 10"11 3.63 x 10-10 6.71 x 10"9 1.97 x IO"7
4-Component System 2.34 x 10~15 2.93 x 10-11 1.40 x 10"9 8.64 x IO"8
If errors/failures on more than two components are related, the correlation coefficient is not enough to 
quantify the dependence among these components (multiway correlation). In such a case, the factor analysis 
method introduced in Section 0.2.3 can be used to uncover the multiway correlation. In the following, 
the application of factor analysis is illustrated using the processor failure data collected from a Tandem 
fault-tolerant system [Lee9i].
58
Similar to the correlation analysis discussed above, the first step is building an m x n data matrix based 
on measurements, where n is the number of components in the system and m is the number of measured 
time intervals. The element (i , j ) of this matrix has a value of 1, if processor j  halts during the z-th time 
interval; otherwise, it has a value of 0. The j-th column of the matrix represents the sample halt history 
of processor ;, while the z'-th row of the matrix represents the state of the eight processors in the z'-th time 
interval. The matrix is called a ■processor halt matrix.
In [Lee91], the factor analysis approach was applied to data collected from an 8-processor Tandem fault- 
tolerant system (i.e., n = 8). The time interval (At) used was 30 minutes. Results obtained by applying the 
SAS procedure FACTOR to the processor halt matrix are shown in Table 0.13. The numbers in the middle 
of the table are factor loadings, and the last column shows commonality. The bottom two rows show the 
amount of variances explained by the common factors and their percentages to the total variance.
Table 0.13: Factor Pattern of the Tandem Processor Halts
Processor Factor 1 Factor 2 Factor 3 Factor 4 Commonality
1 0.997 -0.004 -0.069 0.023 1.00
2 0.000 0.000 0.000 0.000 0.00
3 0.061 0.012 0.853 -0.133 0.75
4 0.001 0.999 -0.011 0.021 1.00.
5 0.982 -0.000 0.188 -0.018 1.00
6 -0.001 0.447 -0.005 0.009 0.20
7 0.047 -0.002 0.862 0.506 1.00











According to [Dillon84], factor loadings greater than 0.5 are considered significant. However, in reliability 
analysis, factor loadings lower than 0.5 can be significant. The results show that there are four common 
factors. Factor 1 captures the dependence between processor 1 and processor 5 and accounts for 24.6% of 
the total variance. Factor 2 captures the multiway dependence among processors 4, 6, and 8, although the 
contribution of processor 6 is small (0.4472, i.e., 20% of its variance is explained by this factor). Factor 2 
explains 22.3% of the total variance. Factor 3 captures the dependence between processor 3 and processor 
7 and contributes 19% to the total variance. Factor 4 captures the dependence, although it is lower (with 
factor loadings 0.506 and 0.641), between processor 7 and processor 8 and accounts for 8.6% of the total 
variance.
0.5 .5  M arkov Reward M odeling
Many natural and social phenomena can be modeled by Markov or semi-Markov stochastic processes 
[Trivedi82]. In computer area, the Markov process is one of the most frequently used models in performance 
and dependability evaluation. Compared to combinatorial models, Markov models have several advantages, 
such as the ability to handle time-dependent failure rate, performance degradation, and interactions among 
components. In the area of analytical modeling of computer systems, performability models [MeyerJ80], 
[MeyerJ92], availability models [Goyal87], and Markov reward models [Reibman89], [Trivedi92] have all been 
addressed. Typically, Markov models are built based on certain assumptions (such as independent failures 
on different components) using individual component parameters (such as failure and recovery rates). The 
evaluated results are highly dependent on the input parameters and on the model assumptions. A substan­
tial amount of the research addresses the questions of solving a given model. However, how to identify an 
accurate model to start with remains unclear. Also, the assumptions made in building analytical models 
also need to be validated by measurement-based analysis.
In measurement-based modeling, Maikov models are identified from data and therefore called measured 
models [Tang93b], No additional assumptions (more than the Markov property) are made in the construction 
of these models. Measured models provide the best evaluation for real systems as well as insight into
59
the development of representative analytical models. Thus, it is valuable to identify appropriate models 
from measured data. In the following sections, measurement-based Markov reward modeling techniques 
are illustrated by a system model generated for a VAXcluster and a software model generated for an IBM 
operating system.
M odeling o f  a D istr ib u ted  S ystem
The data used for the modeling was collected from a DEC VAXcluster system, consisting of seven machines, 
for 250 days [Tang93a]. For this system, an error was defined as an abnormality in any component of the 
system. If an error led to a termination of service on a machine, it was defined as a failure. A failure was 
identified by a reboot following one or multiple error reports.
A. M odel C onstruction
Since the measured VAXcluster has seven machines, an 8-state Markov error model is constructed. The 
eight states, Eo, E i, ..., and E7, are defined such that Ei represents the state wherein i machines observe 
errors at the same time (the time granularity is chosen to be 1 second). For example, state Eo represents 
that none of the machines experiences errors, i.e., the VAXcluster is in the normal (error-free) state; state 
Ej represents that all the machines experience errors. At any measured time, the VAXcluster is in one of 
these states.
The transition probabilities for the 8-state model are estimated from the error event data. Given that 
the system is in state i, the probability that it will go to state j, pXj , is calculated as follows:
Pij
observed number o f transitions from Ex to Ej 
observed number of transitions out of Et (0.40)
Table 0.14: Transition Probability for the VAXcluster Error Model
State Eo Ei e 2 E3 £ 4 Es Eo e 7
Eo .000 .891 .084 .014 .004 .002 .002 .003
El .824 .000 .145 .023 .004 .003 .001 .000
E? .239 .594 .000 .118 .035 .009 .004 .001
£ 3 .126 .211 .401 .000 .227 .024 .009 .003
£ 4 .079 .147 .102 .422 .000 .205 .034 .011
Es .058 .115 .054 .073 .367 .000 .315 .018
Ee .070 .081 .024 .016 .073 .406 .000 .331
e 7 .125 .104 .000 .021 .036 .161 .552 .000
Figure 0.28: An Error Propagation Model for the VAXcluster
Table 0.14 shows the transition probabilities calculated from the VAXcluster error data. Based on the 
table, an error propagation model can be obtained by calculating the probability that the system goes from 
state E{ (i = 1, . . . ,  6) to any of the lower states (£7»_ 1 , . . . ,  Eo) and the probability that it goes from Ex to 
any of the higher states (£¿+1 , . . . ,  E7). These probabilities are easily determined by summing all the row 
elements to the left of element (i, 1), and all the row elements to the right of element (*, i) in the table. The 
error propagation model is shown in Figure 0.28. An interesting error propagation characteristic is uncovered 
with this model. Notice that the transition probabilities to higher states (numbers in the upper line) tend
60
to increase as the state increases. That is, once an error domain encompasses more than one machine, the 
probability of the domain involving more machines increases. In such situations, error containment can 
become increasingly difficult.
B. Rew ard A nalysis
Markov models can be used to conduct reward analysis [Trivedi92] to quantify the loss of service due to 
errors and failures. The key step is to define a reward function that characterizes the performance loss in 
each degraded state. For a multicomputer system, a generic reward function can be defined for both a single 
machine and the whole system. Given a time interval AT (random variable), a reward rate for the-system 
in AT is determined by
r(AT) = W{AT) /  AT , (0.41)
where W (AT) denotes the useful work done by the system in AT and is calculated by
{AT in normal stateAT — nr in error state (0.42)
0 in failure state ,
where n is the number of raw errors (error entries in the log, see Section 0.5.2 in AT and r  is the mean 
recovery time for a single error. Thus, one unit of reward is given for each unit of time when the system is 
in the normal state. In an error state, the penalty paid depends on the recovery time the system spends in 
that state, which is determined by the linear function AT — nr (normally, AT > nr; if AT < nr, W (AT) 
is set to 0). In a failure state, W (AT) is by definition zero.
Applying Equation 0.42 to the VAXcluster, the reward rate formula has the following form:
7
r(AT) = w k(A T ) /  (7 x AT) , (0.43)
k = l
where Wk(AT) denotes the useful work done by machine k in time AT. Here all machines are assumed to 
contribute an equal amount of reward to the system. For example, if three machines fail when the system is 
in E3, the reward rate is 4/7.
The expected steady-state reward rate, Y , can be estimated by
y = 4 T  r(Aii)Ai) . (°-44)
where T is the summation of all A tj’s (particular values of AT) in consideration. If we substitute r from 
Equation 0.43 and let AT represent the holding time of each state in the error model, Y  becomes the steady- 
state reward rate of the VAXcluster, which is also an estimate of system availability (performance-related 
availability). If we substitute r from Equation 0.43 and let AT represent the time span of the error event for 
a particular type of error, Y  becomes the steady-state reward rate of the system during the event intervals 
of the specified error. Thus, (1 — Y) measures the loss in performance during the specified error event. Note 
that it is possible that there are failed machines when the system is in an error state. Since the model is an 
empirical model based on the error event data (of which the failure event data is a subset), the information 
about errors and failures of all machines for each particular Atj can be obtained from the data.
The steady-state reward rate for the VAXcluster was computed with r  being 0.1, 1, 10, and 100ms. 
The results are given in Table 0.15. The table shows that the reward rate is not sensitive to r. This is 
because the overall recovery time is dominated by the failure recovery time, i.e., the major contributors 
to the performance loss are failures, not nonfailure errors. In the range of these r  values, the VAXcluster 
availability is estimated to be 0.995. Table 0.16 shows the steady-state reward rate for each error type 
(r = 1 ms) for the VAXcluster. These numbers quantify the loss of performance due to the recovery from 
each type of error. For example, during the recovery from CPU errors, the system can be expected to deliver 
approximately 15% of its full performance. During the disk error recovery, the average system performance 
degrades to nearly 61% of its capacity. Since software errors have the lowest reward rate (0.00008), the loss 
of work during the recovery from software errors is the most significant.
61
Table 0.15: Steady-State Reward Rate for the VAXcluster
r 0.1 ms 1 ms 10 ms 100 ms
y 0.995078 0.995077 0.995067 0.994971
Table 0.16: Steady-State Reward Rate for Each Error Type in the VAXcluster
CPU Memory Disk Tape Network Software
0.14950 0.99994 0.61314 0.89845 0.56841 0.00008
M odeling o f  an O perating S ystem
The modeled operating system is the IBM MVS system running on an IBM 3081 mainframe [Hsueh87]. The 
measurement period is 1 year. A Markov model is developed using data collected from the system to describe 
error detection and recovery inside an operating system. The MVS is a widely used IBM operating system. 
Primary features of the system are reported to be efficient storage management and automatic software error 
recovery. The MVS system attempts to correct software errors using recovery routines. The philosophy in 
the MVS is that for major system functions, the programmer envisages possible failure scenarios and writes 
a recovery routine for each. It is, however, the responsibility of the installation (or the user) to write recovery 
routines for applications.
Recovery routines in the MVS operating system provide a means by which the operating system prevents 
a total loss on the occurrence of software errors. When a program is abnormally interrupted due to an error, 
the supervisor routine gets control. If the problem is such that further processing can degrade the system or 
destroy data, the supervisor routine gives control to the recovery termination manager (RTM), an operating 
system module responsible for error and recovery management. If a recovery routine is available for the 
interrupted program, the RTM gives control to this routine before it terminates the program. The purpose 
of a recovery routine is to free the resources kept by the failing program, to locate the error, and to request 
either a retry or the termination of the program.
More than one recovery routine can be specified for the same program. If the current recovery routine 
is unable to restore a valid state, the RTM can give control to another recovery routine, if available. This 
process is called percolation. The percolation process ends if either a routine issues a valid retry request or no 
more recovery routines are available. In the latter case, the program and its related subtasks are terminated. 
If a valid retry is requested, a retry is attempted to restore a valid state using the information supplied by 
the recovery routine and then give control to the program. For a retry to be valid, there should be no risk 
of error recurrence and the retry address should be properly specified. An error recovery can result in any 
of the following four situations:
1. Resume op (resume operation)—The system successfully recovers from the error and returns control 
to the interrupted program.
2. Task term (task termination)—The program and its related subtasks are terminated, but the system 
does not fail.
3. Job term (job termination)—The job in control at the time of the error is aborted.
4. System failure—The job or task, which was terminated, is critical for system operation. As a result of 
the termination, a system failure occurs.
62
A. M odel C on stru ction
The model consists of a normal state, eight types of error states (listed in Table 0.17) and four states 
(corresponding to the above four situations: Resume op, Task term, Job term, and System failure) resulting 
from error recoveries. The normal state represents the operating system running error-free. The transition 
probabilities from states to states were estimated from the measured data using Equation 0.40.
Table 0.17 shows the mean waiting time characteristics of the normal and error states in the model. 
Note that the waiting time distribution of the normal state is the TTE distribution. It has been shown 
in Section 0.5.3 that this distribution is not simply exponential (a multistage gamma distribution), so the 
model is a semi-Markov model. In the table, a multiple software error is defined as an error burst consisting 
of more than one type of software error. The average duration of a multiple error is at least four times longer 
than that of any type of single error, which is typically in the range of 20 to 40 seconds, except for DLCK 
(deadlock) and OTHR (others). The average recovery time from a program exception is twice as long as 
that from a control error (21 seconds versus 42 seconds). This is probably due to the extensive software 
involvement in recovering from program exceptions.
Table 0.17: Mean Waiting Time
State #  Observations Mean Waiting Time (Sec.) Standard Deviation
Normal (Error-Free) 2757 10461.33 32735.04
CTRL (Control Error) 213 21.92 84.21
DLCK (Deadlock) 23 4.72 22.61
I/O  (I/O  &: Data Management Error) 1448 25.05 77.62
PE (Program Exception) 65 42.23 92.98
SE (Storage or Address Exception) 149 36.82 79.59
SM (Storage Management Error) 313 33.40 95.01
OTHR (Other Type) 66 1.86 12.98
MULT (Multiple Software Error) 481 175.59 252.79
B. M odel E valuation
The steady-state measures evaluated from the model are listed in Table 0.18. The definitions of these 
measures are given in [Howard71].
1. Transition probability (7r; )—probability that the transition is to state j, given a transition to occur
2. Occupancy probability (<I>; )—probability that the system occupies state j  at any time point
3. Mean recurrence time (0 ; )—mean recurrence time of state j
The occupancy probability of the normal state can be viewed as the operating system availability without 
degradation. The state transition probability, on the other hand, characterizes error detection and recovery 
processes in the operating system. Table 0.18(a) lists the state transition probabilities and occupancy 
probabilities for the normal and error states.
Table 0.18(b) lists the state transition probabilities and the mean recurrent times of the recovery and 
result states. A dash (—) in the table indicates a negligible value (less than 0.00001). Table 0.18(a) shows 
that the occupancy probability of the normal state in the model is 0.995. This indicates that in 99.5% of 
the time the operating system is running error-free. In the other 0.5% of the time, the operating system is 
in an error or recovery state. In more than half of the error and recovery time (i.e., 0.29% out of 0.5%) the 
operating system is in the multiple error state. The average reward rate for all software error and recovery 
states is estimated from data to be 0.2736. Based on this reward rate and the occupancy probability for all 
error and recovery states shown in the table (0.005), the steady-state reward loss in the modeled MVS can 
be evaluated to be 0.00363.
By solving the model, it is found that the operating system makes a transition every 43.37 minutes. 
Table 0.18(a) shows that 24.74% of all transitions made in the model are to the normal state, 24.73% to
63
Table 0.18: Error/Recovery Model Characteristics
Normal Error State
Measure State CTRL DLCK I/O PE SE SM OTHR MULT
7r 0.2474 0.0191 0.0020 0.1299 0.0060 0.0134 0.0281 0.0057 0.0431
<t> 0.9950 0.00016 — 0.00125 0.000098 0.000189 0.00036 — 0.002913
a)
Measure
Recovery State Resultant State
















error states (obtained by summing the 7r’s for all error states), 25.79% to recovery states, and 24.74% to 
result states. Since a transition occurs every 43 minutes, it can be estimated that, on the average, a software 
error is detected every 3 hours and a successful recovery (i.e., reaching the “resume op” state) occurs every 5 
hours. Table 0.18(b) also shows that more than 40% of software errors lead to job or task terminations, which 
cause the loss of service to users. However, a few of these terminations lead to system failures. This result 
indicates that recovery routines in MVS are effective in avoiding system failures, but are not so effective in 
avoiding user job terminations.
0 .5 .6  Softw are D ep en d ab ility
A great deal of research has been performed in the area of software reliability during the development 
phase. Different models have been proposed (reviewed in [Musa87]) to characterize the reliability growth 
of the candidate software through this phase. In general, these models can be divided into two classes. 
The first assumes that the failure rate is a function of the number of remaining defects in the software. 
Imperfect debugging and uncertainty in the projected number of initial defects have also been modeled 
[Goel85]. The second class of models does not depend on knowing the number of the remaining defects 
[Littlewood80]. The failure rate is assumed to be a random variable, and the software reliability model 
involves two stochastic processes. Although most models perform well within their own contexts, their 
performance varies significantly from one data set to another.
The operational phase of mature software is much different from the development phase. In the opera­
tional phase, a typical situation involves frequent changes and updates installed either by system managers 
or by vendors. Often, without notification to the installation management, the vendor will install a change 
(patch) to fix a fault found at some other installation. In a sense, the system being measured represents 
an aggregate of all such systems being maintained by the vendor. In addition, software reliability in the 
operational phase is attributed to workload effects, hardware problems, and environmental factors. Thus, 
software reliability in the operational phase cannot be characterized by simply applying analytical models 
proposed for the development phase.
Studies dealing with software dependability issues for the operational phase have also evolved over the 
past 15 years. Software TTE distributions (Section 0.5.3, dependency between software failure and workload 
(Section 0.5.4), and modeling of software error/recovery processes (Section 0.5.5) have been discussed in 
previous sections. In this section, several other issues, including error interactions (i.e., hardware-related 
and correlated software errors), software fault tolerance, and software defect classification are discussed.
64
Error In teractions
When software is running in a complex system, interactions between hardware and software, and interactions 
among multiple processors can cause software error scenarios that cannot been seen during testing. Inves­
tigation of such error scenarios is helpful for understanding characteristics of software errors in operational 
systems. In the following, two kinds of such error scenarios are discussed: hardware-related, software errors, 
which are a result of interactions between hardware and software, and correlated software errors, which are 
a result of interactions among processors through software protocols.
A. H ardw are-R elated  Software Errors
In [Iyer85a], software errors related to hardware errors were described as hardware-related software errors. 
More precisely, if a software error (failure) occurs in close proximity (within a minute) to a hardware error, it 
is called a hardware-related software (HW/SVV) error (failure). There are several causes of hardware-related 
software errors. For instance, a hardware error, such as a flipped memory bit, may change the software 
conditions, resulting in a software error. Therefore, even though the error is reported as a software error, 
it is actually caused by faulty hardware. Another possibility is that the software may fail to handle an 
unexpected hardware problem, such as an abnormal condition in the network communication. This can 
be attributed to a software design flaw. Sometimes, both the hardware error and the software error are 
symptoms of another, unidentified problem.
Table 0.19: Hardware-Related Software Errors/Failures
Category HW/SW Errors HW/SW Failures
Measures Frequency Percent Frequency Percent
IBM/MVS 177 11.4 94 32.8
VAX/VMS 32 18.9 28 21.4
Table 0.19 shows the frequency and percentage of hardware-related software errors/failures (among all 
software errors/failures) measured from an IBM 3081 system running MVS [Iyer85b] and two VAXclusters 
[Tang92b]. In the IBM system, approximately 33% of all observed software failures are hard ware-related. 
HW/SW errors are found to have large error-handling times (high recovery overhead). The system failure 
probability for HW/SVV errors is close to three times that for software errors in general. The VAXcluster data 
shows that most hardware errors involved in HW/SW errors are network errors (75%). This indicates that 
the major sources of hardware-related software problems in the measured VAXclusters are network-related 
hardware or software components. This is a unique feature in the multicomputer system, where processes 
rely heavily on intercommunications through the network.
B. C orrelated Software Errors
When multiple instances of a software system interact with each other in a multicomputer environment, 
the issue of correlated failures should be addressed. Several studies [Tang90], [YVein90], [Lee91] found that 
significant correlated processor failures exist in the measured multicomputer systems. Correlated software 
failures are also found in the VAX VMS and the Tandem GUARDIAN operating systems [Lee93a]. The 
data showed that about 10% of software failures in the measured VAXcluster and 20% of software halts in 
the measured Tandem system occurred concurrently on multiple machines. To understand how correlated 
software failures occur, it is instructive to examine a real case in detail.
Figure 0.29 shows a scenario of correlated software failures. In the figure, Europa, Jupiter, and Mercury 
are machine names in the VAXcluster. A dashed line represents that the corresponding machine is in a failure 
state. At one time, a network error (netl) was reported from the Cl (Computer Interconnect) port on Europa. 
This resulted in a software failure (softl) 13 seconds later. Twenty-four seconds after the first network error 





13 sec. 47.83 min.
net2 net3 soft2








60 sec. 78 sec. 11 sec. 45.5 min. 4 sec.
Note: softl, soft2, soft3 — Exception while above asynchronous system traps delivery or on interrupt stack,
netl, net3, net5 — Port will be re-started. net2, net4 — Virtual circuit timeout.
Figure 0.29: A Scenario of Correlated Software Failures
followed by a software failure (soft2). The error sequence on Jupiter was repeated (net4,net5,soft3) on the 
third machine (Mercury). The three machines experienced software failures concurrently for 45.5 minutes. 
All three software failures occurred shortly after network errors occurred, so they were network error related.
The higher percentage of correlated software failures in the Tandem system can be attributed to its 
architectural characteristics. In the Tandem system, a single software fault can cause halts of two processors 
on which the primary and backup processes (discussed below) of the faulty software are executing. If the 
two halted processors control a disk that includes files needed by other processors on the system, additional 
software halts can occur on these processors. (In the Tandem system, a disk can typically be accessed by 
two processors via dual-port disk controllers.) This explains why there is a higher percentage of correlated 
software failures in the Tandem system.
Note that the above scenario is a multiple component failure situation not expected in general system 
design, which assumes failure independence. Even the Tandem fault-tolerant system is not designed explicitly 
to guard against this situation. Generally, correlated failures can stress recovery and break the protection 
provided by the fault tolerance.
Software Fault Tolerance
While hardware fault tolerance techniques have been used successfully, the issue of software fault tolerance is 
still not well addressed. Major approaches for software fault tolerance rely on design diversity [Avizienis84], 
[Randell75j. But these approaches are usually not applied to large operating systems because of the cost 
they would add to.developing and maintaining the software. However, some fault tolerance techniques not 
explicitly designed for tolerating software faults can provide a certain amount of software fault tolerance. 
Understanding such techniques is important for designing good approaches to improving software depend­
ability. The Tandem GUARDIAN system, running on the single-failure-tolerant multicomputer system, is a 
good target for such evaluations.
The Tandem GUARDIAN operating system is a message-based distributed system built for on-line trans­
action processing [Bartlett78]. High availability is achieved via single-failure tolerance techniques including 
the process-pair approach. For each user program, there are two processes—a primary process and a backup 
process—executing the same program on two processors. During normal operation, the primary process 
performs all operations for the user, while the backup process passively watches message flows. The primary 
process periodically sends checkpoint messages to its backup. When the primary process detects an incon­
sistency in its state, it fails fast and the backup process takes over the responsibility of the primary process. 
This approach can tolerate transient software errors which will usually not be repeated by re-executing the 
process.
A study of operating system fault tolerance achieved by the single-failure tolerance techniques imple-
66
mented in a Tandem multiprocessor system was reported in [Lee92]. The measured system had 16 processors 
and was working in a high-stress environment. The data source was the processor halt log maintained by 
the GUARDIAN system for a period of 23 months. The effect of the built-in fault tolerance mechanisms on 
software availability was evaluated by reward analysis. Two reward functions were defined in the analysis. 
In the definition, i represents the system state in which there are i failed processors, and n represents the 
total number of processors in the system. The first function (SFT) reflects the fault tolerance of the Tandem 
system. In this function, the first processor halt does not cause any degradation. For additional processor 
halts, the loss of service is proportional to the number of processors halted. The second function (NSFT) 
assumes no fault tolerance. The difference between the two functions allows evaluation of the improvement 
in service due to the built-in fault tolerance mechanisms.
SFT (Single-Failure Tolerance):
1
1 - t  -  l
NSFT (No Single-Failure Tolerance):
1  -  -
if* =  0
if 0 < i < n 
if i = n
if 0 < i < n
(0.45)
(0.46)
Based on the above reward functions, the expected steady-state reward rate, i.e., the Y  in Equation 0.44, 
was evaluated for software, non-software, and all halts. The results are given in Table 0.20. The bottom row 
shows the improvement in service time (i.e., reduction in reward loss) due to the fault tolerance. It is seen 
that the single-failure tolerance in the measured system reduces the service loss due to software halts by 89% 
and that due to non-software halts by 92%. This clearly demonstrates the effectiveness of the implemented 
fault tolerance mechanisms against software failures as well as non-software failures. The table also shows 
that software problems account for 30% of the service loss in the measured system (with SFT). Although 
the system was working in a high-stress environment, the overall reward loss is small (10-4 with SFT). This 
reflects the high availability of the measured system.
Table 0.20: Loss of Service Caused by Halts in the Tandem System
Measure Software Non-Software All
NSFT
1 -  Y .00062 .00205 .00267
Percent 23.2 76.8 100
SFT
1 -  Y .00007 .00016 .00023
Percent 30.4 69.6 100
Improvement 89% 92% 91%
Software D efect C lassification
Recent studies of software defects reported from the IBM MVS operating system [Sullivan91] and two large 
IBM database management systems, DB2 and IMS [Sullivan92], propose a software defect classification 
scheme. The scheme uses three concepts—error type, defect type, and error trigger—to classify software 
faults and errors. The error type classifies the low-level programming mistakes that lead to software failures. 
The defect type is a higher-level classification that distinguishes design mistakes, coding mistakes, and ad­
ministrative mistakes. The error trigger is related to the running environment; it distinguishes several ways 
that defective code that was not executed during testing could be executed at the customer site. Tables 0.21 
to 0.23 list major categories of error types, defect types, and error triggers.
67
*
Table 0.21: Major Categories of Error Types
Error Type Description
Allocation Management A module uses a memory region after deallocating it.
Copying Overrun The program copies data past the end of a buffer.
Data Error The program produces or reads wrong data.
Interface Error A module’s interface is defined or used incorrectly.
Memory Leak The program never deallocate memory it obtained from the system.
Pointer Management A variable containing the address of data is corrupted.
Statement Logic Statements are executed in the wrong order or are omitted.
Synchronization An error occurs in locking or synchronization code.
Uninitialized Variable A variable is used before it is initialized.
Undefined State The system goes into a state the designers did not anticipate.
Wrong Algorithm The program works but uses a wrong algorithm.
Table 0.22: Major Categories of Defect Types
Defect Type Description
Function A program’s functionality is missing, incomplete, or incorrect.
Data Struct/Algorithm A data structure or algorithm has a design flaw.
Assignment/Checking A coding mistake involves variable assignment or validation.
Interface Errors are discovered in the interaction between components.
Timing/Synchronization Errors occur in the management of shared or real-time resources.
Build/Package/Merge Errors occur in version control or roll-up of fixes.
The studies compared the error type, defect type, and error trigger distributions of the three products 
(DB2, IMS, and MVS) and found that the three products’ distributions differ significantly. However, they 
have some common characteristics, such as the mode “undefined state.” The studies also investigated the 
impact of software defects on system availability for the MVS operating system. A comparison between 
overlay defects (defects that corrupt a program’s memory) and non-overlay defects demonstrated that the 
impact of an overlay defect is much greater. Boundary conditions and allocation management were found 
to be the major causes of overlay defects.
0 .5 .7  Failure P red iction
Fault diagnosis and failure prediction are significant for maintaining highly reliable systems. Measurement- 
based studies have shown that it is possible to predict future failures based on the current and historical 
on-line error information. Several heuristic and statistical approaches have been proposed. The heuristic 
approach extracts characteristics of anomalous events and relates them to failures or faults by heuristic rules 
[Lin90]. The statistical approach uses statistical techniques to quantify relationships among system error 
states (defined on the basis of error rates) and recognizes failure patterns using the quantified relationships 
[Iyer90]. Recently, the fault injection method has been used on networks to diagnose faults based on the 
information of performance anomaly caused by the faults [Maxion93]. In the following, we discuss these 
three approaches in detail.
68
Table 0.23: Major Categories of Error Triggers
Error Trigger Description
Workload Unusual workload conditions such as a user request with unexpected parameters.
Bug Fixes A bug introduced when an earlier bug was fixed.
Client Code Errors caused by propagation from application code running in protected mode.
Recovery /  Exception Problems in error recovery and exception handling.
Timing Errors caused by an unanticipated sequence of events.
P red iction  B ased on H euristic  Trend A nalysis
This approach is based on the observation that a system usually experiences a period of intermittent errors 
before a hard failure occurs. The symptoms of intermittent errors can be used to predict impending failures. 
The early study, of this approach showed qualitatively that the frequency of error tuples was correlated to 
system failures, based on measurements from a DEC disk subsystem [Tsao83]. Later, a heuristic trend analy­
sis method, the dispersion frame technique (DFT), was developed [Lin90], which determines the relationship 
among errors by examining their closeness in time and space.
Two concepts are used in the DFT: dispersion frame (DF), defined as the interval between two successive 
errors of the same type, and error dispersion index (EDI), defined as the number of error occurrences 
following the previous DF during the interval of one half of the previous DF or the DF before the previous 
DF. Each DF is applied to the following two errors. A high EDI implicates that the errors following the DF 
used to measure the EDI are highly correlated. DFT consists of five heuristic rules developed from the field 
experience:
• 3.3 rule—The two consecutive EDIs obtained by applying the same frame are at least 3.
• 2.2 rule—The two consecutive EDIs obtained by applying two successive frames are at least 2.
• 2 in 1 rule—A frame is less than 1 hour.
• 4 in 1 rule—Four errors occur within a 24-hour frame.
• 4 decreasing rule—There are four monotonically decreasing frames, and at least one frame is half the 
size of its previous frame.
1 2 3 4 5
------------------------------------------------------------- ---------------------------------------- =► Time
DF(1,2) ----------------------- "*-------------------------f 3’3 warning
DF(2,3)  ̂ | f  2 , 2  warning
DF(3,4) — h -  a
4 decreasing
DF<4 ’5> warning
Figure 0.30: Dispersion Techniques
Figure 0.30 demonstrates an example, including some activated heuristics, of the DFT. In the figure, 
the top line represents the time sequence of five error occurrences (1, . . . ,  5) in a particular device. DFT
69
is activated when a frame size less than 168 hours (1 week) is encountered. Assume that all the frames in 
the figure fall into this threshold. Each frame is applied to the following two errors by putting its center to 
the time points of the two error occurrences. For example, DF(1,2) is applied to errors 2 and 3, DF(2,3) is 
applied to errors 3 and 4, etc. An upward arrow represents a failure warning issued under the above heuristic 
rules.
DFT was applied to the data collected from 13 public-domain file servers in Carnegie Mellon University 
over a 22-month period. Among 16 hard failures examined, DFT predicated 15, with 5 false alarms. That 
is, the successful prediction rate is 93.7%. This result shows that DFT is very effective when coupled with 
good system instrumentation.
P red iction  B ased  on S ta tistica l A nalysis
The objective of this approach is to recognize intermittent failures through statistical analysis and testing on 
recorded error data [Iyer90j. The approach starts by identifying key error patterns potentially symptomatic 
of failure occurrences and then refines these patterns by scanning the rest of the data in stages for similar 
error patterns. The approach is divided into three stages: 1) identification and validation of error groups, 
2) identification and validation of error events, and 3) identification and validation of super events. At each 
stage, validation is done by statistical testing.
In the first stage, data coalescing is performed on the raw data to eliminate redundant reports. The 
output of this stage are error records (tuples) characterized by error states (error type, machine condition, 
etc.). Next, all error records occurring within a small time interval (15 minutes) are identified together and 
identified as error groups. Error groups represent periods of high error activity (error bursts). Experience 
has shown that when system errors occur in bursts of a relatively high error rate, the errors are often 
related. Statistical analysis and hypothesis testing are performed on each error group to determine whether 
a valid correlation exists among its members (error records). Randomly formed groups in which members 
are statistically independent are rejected. Thus, the original error groups consisting of records among 
which relationships can exist are refined to the validated error groups consisting of records among which 
relationships do exist.
Group Gi a 2
A3 A4
Group Gt a 2 A4
a 5 ^ 6
Group G 3 a 5 Aö
A?
G\ O  G2  O  G 3  = <P 
G\ O  G2 — A2 A4 
G2  O  G3  —  A5  Aö
Gi O  G3 = 0
(Symptom Si A2
Symptom S 2
Figure 0.31: Derivation of an Event’s Symptom Set
Relationships can exist across error groups, i.e., a single cause can give rise to a persistent error and thus 
foster multiple error groups within a short time. In the second stage, the output groups from the first stage 
are examined to recognize related error groups and to eliminate stray error records. Several concepts are 
introduced for the analysis in this stage. An error event is defined as the collection of error groups occurring 







EVENT 1 C S ^ >
SUPER- C 5 Z >  
EVENT 2 < T D >
Figure 0.32: Construction of Super Events
as a collection of statistically related error states that are common to at least half of the groups in an event. 
A symptom set is defined as the collection of all symptoms in an event. Figure 0.31 illustrates an event and 
its symptom set. The event is composed of three groups: Gi, Go, and G3 . The error states in these groups 
are represented by A\, . . . ,  Ay. Two symptoms are extracted from these error states: 5 1 , which consists 
of A? and A4, and So> which consists of A5 and ^ 6- Thus, S 1 and ¿ ’2 constitute the symptom set for this 
group.
In the third stage, three simple rules are used to recognize related events and to group them together into 
sets called super events. The rules ensure that the events so grouped will have sufficiently common structure 
to permit testing for correlation. Two events’are grouped into a super event if they satisfy any one of the 
following criteria: 1 ) they have at least one symptom in common, 2 ) a symptom of one event is a proper 
subset of at least one symptom of another event, or 3) If they are single-group events, then they have at 
least two error states in common.
Figure 0.32 illustrates how super events are constructed. There is no time restriction when these rules 
are applied to the event data. When a super event is created, a corresponding super symptom set is also 
created, which starts with just the symptoms of the first event of that super event. As another event is 
added, set intersection is performed between its symptom set and each of the symptom sets already in the 
super event. All intersections are then added to the super event set.
In each of the above stages, statistical analysis and hypothesis testing are performed to validate the 
correlations among members in the formed groups or sets. The super events derived in the final stage can be 
used by service engineers to judge potential failures. This methodology was applied to the on-line error log 
files from two CYBER systems and the results were compared to the log of failures and repair maintained 
by the system staff. In nearly 85% of the cases, the engineers confirmed that the validated super events 
corresponded to real system problems. The evaluation was made both on the basis of their experience and 
from their field maintenance logs. For the remaining 15% of the cases, the engineers agreed that a problem 
had existed, but that its manifestation was not severe enough to be noticed by their analysis.
71
P red iction  B ased  on P erform ance A nom aly
When faults occur in a fault-tolerant system, although the system may recover from the faults, the per­
formance of the system may deviate from its normal conditions. Thus, the performance anomaly can be 
used to diagnose faults. This idea has been explored in [Maxion90a], [Maxion90b], and [Maxion93] by»using 
fault injection to local computer networks and generating a set of diagnostic decision rules based on the 
information of network traffic anomaly caused by the injected faults.
The study was performed on five Ethernet networks at the Carnegie Mellon University [Maxion93]. There 
were totally 419 client machines and 20 server machines included in the five networks. These networks were 
representative of a diversity of network traffic characteristics. Five types of faults were selected for this 
study. They were considered to be typical network faults that could impair or disable network performance. 
The five fault types are:
1. Pseudorunt flood—resource contention caused by runt flood. A runt packet is smaller than the 60-byte 
minimum size required by the Ethernet standard.
2. Networking paging—memory swaps of a client over the network to a file server, causing significant 
network transmission latency.
3. Bad memory sequence—wrong frame-check sequence on a packet due to corruption or incorrect com­
putation by the receiver, which requires a retransmission.
4. Jabber—excessive transmission of oversized packets, i.e., packets longer than the protocol-specified 
1518-byte limit.
5. Broadcast storm—overuse of the Ethernet broadcast caused by flawed protocols or configurations, and 
software errors.
Table 0.24: Measured Network Performance Parameters
Sequence Description
1 Destination Address, Unusual Activity
2 Destination Address, Increased Activity
3 . Destination Address, Ceased Activity
4 Destination Address, Sudden Appearance
5 Source Address, Unusual Activity
6 Source Address, Increased Activity
7 Source Address, Ceased Activity




12 Packet Length \ 63
13 Packet Length in 64-127
14 Packet Length 1024
The experimental instrumentation consisted of a software/hardware combined fault injection system and 
two out-of-band hardware monitoring systems. A dedicated machine is used to generate traffic patterns and 
faults, and to inject them into the active network. Of the two hardware monitoring systems, one is used to 
collect statistics for packet traffic, collisions, percent utilization, etc., and the other is used to collect data 
about packet types, lengths, sources and destinations, etc. Because the monitoring instruments do not share 
the same data path, or band, as the measured devices, they do not influence the traffic in the measured 
network.
72
More than 500 faults were injected into the five networks. For each injected fault, 14 network performance 
parameters were measured, as described in Table 0.24. The 14 parameters, together with some information 
about the fault injection, are called a feature vector (or signature). Each feature vector was assigned a fault 
number from 1 to 5, according to the associated fault. Based on the feature vectors, a set of decision rules 
were determined. The technique used to generate decision rules is called recursive partitioning regression 
which forms homogeneous groups of vectors by recursively partitioning the dataset. The decision rule can 
be used to discriminate one fault from another on the basis of the features and values contained in the 
associated vectors.
By applying the decision rule to the feature vectors of new injected faults, fault detection and diagnostic 
classification accuracy can be determined. The results showed that the fault detection accuracy is 90.6% and 
the overall diagnostic classification accuracy is 8 6 .8%. The identification errors can be attributed to noises 
in the network environments. Since the generation of decision rules and the identification of faults can be 
done on-line and automatically, the approach is robust under a variety of environmental conditions.
0.6 CO NCLUSIO N
In this chapter, we discussed methodologies and advances in the area of the experimental analysis of computer 
system dependability. The discussion covered three fields: simulated fault injection, physical fault injection, 
and measurement-based analysis of operational systems. The approaches used in the three fields are suited, 
respectively, to the dependability evaluation in the three phases of a system’s life: design phase, prototype 
phase, and operational phase. Before discussing these fields, we introduced several statistical techniques used 
in all fields. For each field, we proposed a classification of research approaches or topics. Then we presented 
detailed methodologies and representative studies for each of these approaches or topics.
The statistical techniques introduced included the estimation of parameters and confidence intervals, 
probability distribution characterization, several multivariate analysis methods, and importance sampling. 
For simulated fault injection, we covered electrical-level, logic-level, and function-level simulation approaches 
as well as representative simulation environments, including FOCUS, NEST, REACT, and DEPEND. For 
physical fault injection, we discussed hardware, software, and radiation fault injection methods as well as 
several software and hybrid tools, including FIAT, FERRARI, HYBRID, and DEFINE. For measurement- 
based analysis of operational systems, after an introduction to measurement and data processing techniques, 
we presented methods used and representative studies in basic error characterization, dependency analysis, 
Markov reward modeling, software dependability, and fault diagnosis. The discussion covered several impor­
tant issues previously studied, including workload/failure dependency, correlated failures, and software fault 
tolerance.
Fault injection simulation can be used to investigate the effectiveness of key design features of fault- 
tolerant systems and to provide timely feedback to system designers. Generally, most dependability measures 
(except input parameters such as failure and recovery rates) can be obtained from simulation. However, 
simulation requires accurate input parameters and the validation of output results, which come from physical 
fault injection and measurement-based analysis. Fault injection on real systems can produce information 
about error latency, error detection, error propagation, error recovery, and system reconfiguration, but it can 
only study artificial faults and cannot produce some dependability measures, such as MTBF and availability. 
Measurement-based analysis of operational systems under real workloads can provide valuable information 
on actual failure characteristics and insight into analytical models. This type of analysis provides a means 
to study naturally occurring errors and all measurable dependability metrics, such as failure and recovery 
rates, reliability and availability. However, the analysis is limited to detected errors. Further, conditions in 
the field can vary widely from one system to another, casting doubt on the statistical validity of the results. 
Thus, all three approaches are complementary and essential for accurate dependability analysis.
Significant progress has been made in all the three fields over the past 15 years, especially the past 
5 years during which several dependability analysis tools have been developed. Increasing attention is 
being paid to: 1 ) combining analytical modeling and experimental analysis and *2 ) combining system design 
and evaluation. In the first aspect, state-of-the-art analytical modeling techniques are being applied to 
real systems to evaluate various dependability and performance characteristics. Results from experimental 
analysis are being used to validate analytical models and to reveal practical issues that analytical modeling
73
must address to develop more representative models. In the second aspect, dependability analysis tools are 
being combined with each other and with other CAD tools to provide an automatic design environment 
that incorporates multiple levels of joint evaluation of functionality, performance, dependability, and cost. 
Software failure data from testing and operational phases are also providing feedback to the software design 
for improving software reliability. Further interesting studies and advances in this area can be expected in 
the near future.
O.T Q UESTIO NS
Q uestion  1
Table 0.25 shows software time to failure (TTF) data measured on a multicomputer system. Table 0.26 
shows the completion time (CT) of a benchmark running on a multiprocessor under different workload 
conditions. Do the following work using the data:
• Construct an empirical distribution for each set of data.
• Fit both empirical distributions to an exponential function, the.TTF distribution to a 2-phase hyper­
exponential function, and the CT distribution to a 2-phase hypoexponential function, respectively.
• Do the chi-square and Kolmogorov-Smirnov significance tests for each fitting.
• Give the 95% confidence intervals for the mean and variance of each distribution.
Table 0.25: Software TTF Data from a Multicomputer
TTF (days) 1 2 3 4 5 6 7 8 9 10 1 1 12 13
Frequency 29 8 6 5 3 3 1 3 1 1 1 2 1
TTF (days) 14 15 15 17 18 19 20 2 1 22 23 24 25 26
Frequency 0 2 0 1 0 0 0 0 0 0 0 0 1
Table 0.26: Completion Time of a Benchmark
Time* (min.) 1 3 5 7 9 1 1 13 15 17 19 2 1 23
Frequency 2 10 12 18 16 13 15 7 6 4 2 6
Time* (min.) 25 27 29 31 33 35 37 39 41 43 45 47
Frequency 4 1 0 1 4 0 0 1 1 0 0 0
Q uestion  2
Table 0.27 gives processor failure data collected from a distributed system consisting of 5 processors 
connected by a local network. Each record in the table has the following format:
processor id failure time recovery time error type
The time unit is second and time 0 is 12am, 10/1/1987. The measurement started from 12am, 12/9/1987 
(29,548,800) and ended at 12am, 8/15/1988 (51,148,800), covering 250 days. The error type means the type 
of error that causes a processor failure. Possible error types are CPU, I/O (network or disk problems), 
software, and unknown. Do the following work using the data:
• Obtain failure rate and recovery rate for each processor.
74
• Assuming failures on different processors are independent, build a Markov model, where Si represents 
there are i failed processors, based on the failure and recovery rates obtained in step 1 .
• Assuming the modeled system is a 3-out-of-5 system, solve the model (using SHARPE or similar tools) 
to obtain the reliability of the system.
• Build a measurement-based Markov model using the data.
• Solve the measurement-based model (using SHARPE or similar tools) to obtain the reliability of the 
(3-out-of-5) system, and compare the result with that obtained in step 3.
• Construct a data matrix and then use the matrix to calculate correlation coefficient for each pair. 
Q uestion  3
Build a simulation model for the system described in the above question. Assume that: 1) the modeled 
system is a 3-out-of-5 system, and 2) the TBF (time between failure) and TTR (time to recovery) for each 
processor are exponentially distributed. Investigate the following dependability issues through simulation.
• Assume that failures on different processors are independent and that all processors have the same 
failure (recovery) rate which is the average of the five processor failure (recovery) rates obtained in 
Question 2. Construct the system TBF and TTR distributions and evaluate MTBF and unavailability.
• Assume that failures on different processors are related by shared resources and coordinated software. 
Specifically, the joint failure probability is 0.1 for any two processors, 0.03 for any three processors, 
0.01 for any four processors, and 0.005 for all five processors (these numbers are close to the measured 
correlation parameters from the data in Question 2). Evaluate MTBF and unavailability and compare 
them with those obtained in step 1 .
• Assume that each processor failure is caused by an error and there is a latency between the error arrival 
and the failure occurrence. The latency follows an exponential distribution with a mean equaling the 
average processor recovery time. The error arrival rate and failure recovery rate for each processor 
are the same as the failure and recovery rates in step 1, respectively. Assume that each error causes a 
failure. Further, errors on different processors are related with the correlation parameters used in step 
2. Evaluate MTBF and unavailability and compare them with the results from step 2.
75
Table 0.27: Processor Failure Data from a 5-Processor Multicomputer System
Pr.-id start-time end-time err-type Pr.-id start-time end-time err-type
1 29604556 29605120 Software 2 29909506 29909806 I/O
1 29704893 29706486 I/O 2 29910884 29911926 I/O
1 29770774 29772334 Software 2 29913095 29913465 I/O
1 29779466 29779946 I/O 2 29914121 29915273 I/O
1 29918361 29919001 I/O 2 29916032 29916705 I/O
1 29938155 29938995 Unknown 2 29917194 29917844 I/O
1 29968514 29969314 I/O 2 29918236 29919428 I/O
1 30204850 30206947 I/O 2 29939825 29940125 Unknown
1 30300136 30300482 I/O 2 29946828 29947128 Unknown
1 30315829 30316496 I/O 2 29949602 29950199 I/O
1 30552159 30555408 Unknown 2 29953804 29954104 I/O
1 30571134 30571529 I/O 2 29957176 29957476 I/O
1 31762359 31762897 I/O 2 29963079 29963379 I/O
1 31830453 31831347 I/O 2 29964579 29966271 I/O
1 31837833 31838304 I/O 2 29967435 29968115 I/O
1 31839160 31839628 I/O 2 30550260 30550560 Unknown
1 31951476 31952241 I/O 2 30550946 30553816 Software
1 32123531 32124925 I/O 2 30660369 30660669 Unknown
1 32126834 32127160 I/O 2 31774557 31774857 I/O
1 32177392 32178646 I/O 2 31775859 31776450 I/O
1 32963455 32964054 Software 2 31781002 31781783 I/O
1 33014870 33015438 I/O 2 31832367 31832781 I/O
1 33152933 33153737 Software 2 31839101 31841182 I/O
1 34616577 34617326 Software 2 32122929 32129036 I/O
1 34770846 34771459 I/O 2 32176731 32177250 Software
1 37813703 37814561 Unknown 2 32178134 32180594 I/O
1 38007128 38049817 I/O 2 32183767 32184321 I/O
1 38955976 38956597 Unknown 2 32449592 32450399 Unknown
1 38961843 38962658 Unknown 2 32962291 32962591 Unknown
1 39465650 39467247 I/O 2 32975370 32975670 I/O
1 39809295 39810575 Unknown 2 32976674 32976974 I/O
1 40069978 40071607 I/O 2 32983427 32983727 I/O
1 41000249 41001273 Software 2 33049654 33050104 I/O
1 41366807 41387784 Software 2 33052795 33053095 I/O
1 41391480 41392113 Software 2 33057253 33057553 I/O
1 42272616 42273174 Software 2 33059795 33060095 I/O
1 42831896 42833058 Software 2 33087233 33087533 I/O
1 43309767 43313204 Software 2 33089396 33089696 I/O
1 43348292 43350322 Software 2 33120965 33121265 Unknown
1 43952410 43953022 Software 2 33148035 33148430 I/O
1 44877091 44877998 Software 2 34011900 34012200 Unknown
1 45841909 45842888 Software 2 34770845 34771810 I/O
1 46961851 46962724 Software 2 34774453 34774753 Unknown
1 48878979 48880349 Unknown 2 34775145 34776270 I/O
1 48888392 48890586 Software 2 34777013 34777313 I/O
2 29570604 29570904 I/O 2 34777969 34778269 I/O
2 29577262 29577562 I/O 2 34780806 34781106 I/O
2 29767256 29767556 I/O 2 35659389 35662510 Unknown
2 29782058 29782358 I/O 2 35668585 35668885 I/O
2 29788920 29789718 I/O 2 35725413 35725713 I/O
2 29886930 29887230 I/O 2 35726840 35727669 I/O
Table is continued on next page.
76
Table 0.27: Processor Failure Data from a 5-Processor Multicomputer System (cont.)
Pr.-id start-time end-time err-type
2 35730910 35731210 I/O
2 35744817 35745117 I/O
2 35753244 35753544 I/O
2 36785044 36785772 Unknown
2 36789532 36789988 Software
2 36796257 36847121 I/O
2 38249305 38249785 Software
2 38251655 38252099 Unknown
2 38744311 38744777 Software
2 38745674 38746341 CPU
2 38955454 38956597 Unknown
2 39805355 39808038 I/O
2 43064089 43065444 Unknown
2 44732995 44733453 Software
2 45221452 45221917 I/O
2 46375146 46375583 Unknown
2 49391794 49392273 Software
2 50068269 50069049 I/O
3 30550976 30554633 Software
3 31760114 31760624 Software
3 32806356 32806844 Software
3 33152933 33153558 Software
3 34370843 34385770 I/O
3 34779280 34779754 Unknown
3 36783938 36786522 I/O
3 36787771 36788262 Software
3 37800626 37811789 I/O
3 37812929 37813548 Software
3 43175442 43175990 Software
3 43326842 43327797 Unknown
3 43330318 43330785 CPU
3 43331708 43332312 CPU
3 43334898 43338477 CPU
3 43338720 43340663 I/O
3 43342983 43345110 CPU
3 43691509 43692018 Software
3 43693033 43693689 Unknown '
3 43693580 43694622 I/O
3 43695584 43696455 Unknown
3 43697283 43698493 I/O
3 47651456 47651969 Software
3 49057997 49058626 Software
3 49568366 49568810 Unknown
3 50043031 50043500 Software
3 50697666 50698119 Unknown
3 50930312 50930764 Unknown
4 29939050 29940425 I/O
4 29943202 29948396 I/O
4 30550948 30554287 I/O
4 30762674 30762976 I/O
4 34777878 34778529 I/O
Pr.-id start-time end-time err-type
4 37563700 37567137 I/O
4 37811277 37811786 I/O
4 38956002 38963057 I/O
4 39204773 39205074 I/O
4 43175015 43175315 Unknown
4 47239532 47243366 I/O
4 47641698 47642345 I/O
4 48939418 48940072 Unknown
4 49144906 49145430 I/O
4 50662627 50663276 Unknown
4 50668168 50672456 I/O
4 50758111 50758751 I/O
4 50941271 50943381 I/O
5 30545108 30545606 Software
5 30551262 30553813 Unknown
5 31769662 31770215 Software
5 31953896 31954395 Unknown
5 33152335 33153779 Unknown
5 33153616 33156417 Software
5 34774852 34780711 I/O
5 37813181 37813664 Unknown
5 38955806 38962854 Unknown
5 40928410 40934268 Software
5 40936013 40936767 Unknown
5 43189066 43193507 Software
5 43252583 43254201 Unknown
5 43256847 43257890 Unknown
5 43328379 43333113 Software
5 44903066 44903945 Unknown
5 47064423 47069097 I/O
5 49125607 49126676 Software
5 49478273 49478888 Unknown
5 50618397 50623146 Software
77
References
[Arlat90a] J. Arlat, M. Aguera, L. Amat, Y. Crouzet, J.C. Fabre, J.C. Laprie, E. Martins, and D. Pow­
ell, “Fault Injection for Dependability Validation: A Methodology and Some Applications,” 
IEEE Trans. Software Engineering, Vol. 16, No. 2, pp. 166-182, February 1990.
[Arlat90b] L. Arlat, K. Kanoun, and J.C. Laprie, “Dependability Modeling and Evaluation of Software 
Fault-Tolerant Systems,” IEEE Trans. Computers, Vol. 39, No. 4, pp. 504-513, April 1990.
[Artis86] H.P. Artis, “Workload Characterization Using SAS PROC FASTCLUS,” Workload Char­
acterization of Computer Systems and Computer Networks, G. Serazzi (editor), Elsevier 
Science Publishers, 1986.
[Aupperle89] B.E. Aupperle, J.F. Meyer and L. Wei, “Evaluation of Fault-Tolerant Systems with Non- 
homogeneous Workloads,” Proc. 19th Int. Symp. Fault-Tolerant Computing, pp. 159-166, 
June 1989.
[Avizienis84] A. Avizienis and J.P.J. Kelly, “Fault Tolerance by Design Diversity: Concepts and Exper­
iments,” IEEE Computer, pp. 67-80, August 1984.
[Banerjee82] P. Banerjee and J.A. Abraham, “Fault Characterization of MOS VLSI Circuits,” Proc. Int. 
Conf. Circuits and Computers, pp. 564-568, 1982.
[Bartlett78] J.F. Bartlett, “A ’Nonstop’ Operating System,” Proc. Int. Hawaii Conf. System Science, 
pp. 103-117, 1978.
[Barton90] J.H. Barton, E.W. Czeck, Z.Z. Segall, and D.P. Siewiorek, “Fault Injection Experiments 
Using FIAT,” IEEE Trans. Computers, Vol. 39, No. 4, pp. 575-582, April 1990.
[Bavuso87] S.J. Bavuso, J.B. Dugan, K.S. Trivedi, E.M. Rothman, and W.E. Smith, “Analysis of 
Typical Fault-Tolerant Architectures using HARP,” IEEE Trans. Reliability, Vol. 36, No. 
2, pp. 176-185, June 1987.
[Beh82] C.C. Beh, K.H. Arya, C.E. Radke, and K.E. Torku, “Do Stuck Fault Models Reflect Man­
ufacturing Defects?” Proc. Int. Test Conf, pp. 35-42, 1982.
[Bishop88] P.G. Bishop and F.D. Pullen, “PODS Revisited—A Study of Software Failure Behavior,” 
Proc. 18th Int. Symp. Fault-Tolerant Computing pp. 2-8, 1988.
[Bryant84] R.E. Bryant, “A Switch-Level Model and Simulator for MOS Digital Systems,” IEEE Trans. 
Computers, Vol. 33, No. 2, pp. 160-177, February 1984.
[Butner80] S.E. Butner and R.K. Iyer, “A Statistical Study of Reliability and System Load at SLAC,” 
Proc. 10th Int. Symp. Fault-Tolerant Computing, pp. 207-209, October 1980.
[Castillo81] X. Castillo and D.P. Siewiorek, “Workload, Performance, and Reliability of Digital Com­
puter Systems,” Proc. 11th Int. Symp. Fault-Tolerant Computing, pp. 84-89, July 1981.
[Castillo82] X. Castillo and D.P. Siewiorek, “A Workload Dependent Software Reliability Prediction 




















R. Chillarege and R. K. Iyer, “Measurement-Based Analysis of Error Latency,” IEEE Trans. 
Computers, Vol. C-36, No. 5, May 1987.
R. Chillarege and N.S. Bowen, “Understanding Large System Failures—A Fault Injection 
Experiment,” Proc. 19th Int. Symp. Fault-Tolerant Computing, pp. 356-363, June 1989.
G.S. Choi, R.K. Iyer and V. Carreno, “FOCUS: An Experimental Environment for Val­
idation of Fault Tolerant Systems: A case study of a Jet Engine Controller,” Int. Conf. 
Computer Design (ICCD), pp. 561-564, October 1989.
G. Choi, R. Iyer, V. Carreno, “Simulated Fault Injection: A Methodology to Evaluate 
Fault Tolerant Microprocessor Architectures,” IEEE Trans, on Reliability, Special Issue on 
Experimental Evaluation, Vol, 39, No. 4, pp. 486-491, October 1990.
G.S. Choi and R.K. Iyer, “FOCUS: An Experimental Environment for Fault Sensitivity 
Analysis,” IEEE Trans. Computers, Vol. 41, No. 12, pp. 1515-1526, December 1992.
G. Choi, R. Iyer, and D. Saab, “Fault Behavior Dictionary for Simulation of Device-Level 
Transients,” Proc. Int. Conf. Computer Aided. Design, November 1993.
J.A. Clark and D.K. Pradhan, “Reliability Analysis of Unidirectional Voting TMR Systems 
through Simulated Fault-Injection,” Proc. 1992 IEEE Workshop Fault-Tolerant Parallel 
and Distributed Systems, pp. 72-81, July 1992.
J.A. Clark and D.K. Pradhan, “REACT: A Synthesis and Evaluation Tool for Fault- 
Tolerant Multiprocessor Architectures,” Proc. Annual Reliability and Maintainability Sym­
posium, pp. 428-435, 1993.
J.A. Clark, Dependability Analysis of Fault-Tolerant Multiprocessor Architectures through 
Simulated Fault-Injection, Ph.D. dissertation, University of Massachusetts at Amherst, 
September 1993.
B. Courtois, “Some Results about the Efficiency of Simple Mechanisms for the Detection of 
Microcomputer Malfunctions,” Proc. 9th Int. Symp. Fault-Tolerant Computing, pp. 71-74, 
June 1979.
J. Cusick, R. Koga, W. Kolasinski, and C. King, “SEU vulnerability of the Zilog Z-80 
and NSC-800 microprocessors,” IEEE Trans. Nuclear Science, Vol. NS-32, pp. 4206-4211, 
December 1985.
E.W. Czeck, On The Prediction of Fault Behavior Based on Workload, Ph.D. Thesis, Dept, 
of Electrical and Computer Engineering, Carnegie Mellon University, April 19, 1991.
E.W. Czeck and D.P. Siewiorek, “Observations on the Effects of Fault Manifestation as a 
Function of Workload,” IEEE Trans. Computers, Vol. 41, No. 5, pp. 559-566, May 1992.
A. Dewey and A.J. de Geus, “VHDL: Toward a Unified View of Design,” IEEE Design and 
Test of Computers, pp. 8-17, June 1992.
W.R. Dillon and M. Goldstein, Multivariate Analysis, John Wiley &; Sons, 1984.
P. Duba and R.K. Iyer, “Transient Fault Behavior in a Microprocessor: A Case Study,” 
Proc. 1988 IEEE Int. Conf. Computer Design: VLSI in Computers & Processors, pp. 272- 
276, October 1988.
J.B. Dugan, “Correlated Hardware Failures in Redundant Systems,” Proc. 2nd IFIP Work­
ing Conf. Dependable Computing for Critical Applications, Tucson, Arizona, February 1991.
J. Dunkel, “On the Modeling of Workload-Dependent Memory Faults,” Proc. 20th. Int. 
Symp. Fault-Tolerant Computing, pp. 348-355, June 1990.
79
[Dupuy90] A. Dupuy, J. Schwartz, Y. Yemini, and D. Bacon, “NEST: A Network Simulation and 
Prototyping Testbed,” Communications of the ACM, Vol. 33, No. 10, pp. 64-74, October 
1990.
[Finelli87] G.B. Finelli, “Characterization of Fault Recovery through Fault Injection on FTMP,” IEEE 
Trans. Reliability, Vol. R-36, No. 2, pp. 164-170, June 1987.
[Goel85] A.L. Goel, “Software Reliability Models: Assumptions, Limitations, and Applicability,” 
IEEE Trans. Software Engineering, vol SE-11, No. 12, pp. 1411-1423, December 1985.
[Goswami90] K.K. Goswami and R.K. Iyer, “DEPEND: A Design Environment for Prediction and Eval­
uation of System Dependability,” Proc. 9th Digital Avionics Systems Conference, October 
1990.
[Goswami91] K.K. Goswami and R.K. Iyer, “A Simulation-Based Study of a Triple Modular Redundant 
System using DEPEND,” Proc. 5th Int. Tests, Diagnosis, Fault Treatment Conf., pp. 300- 
311, September 1991.
[Goswami92] K.K. Goswami and R.K. Iyer, DEPEND: A Simulation-Based Environment for System 
Level Dependability Analysis, Technical Report, CRHC 92-11, University of Illinois at 
Urbana-Champaign, June 1992.
[Goswami93a] K. K. Goswami and R. K. Iyer, “Use of Hybrid and Hierarchical Simulation to Reduce Com­
putation Costs,” Int. Workshop Modeling Analysis & Simulation of Computer & Telecomm. 
Sys., San Diego, CA, pp. 197-202, January 1993.
[Goswami93b] K. K. Goswami, R. K. Iyer and M. Devarakonda, “Prediction-Based Dynamic Load-Sharing 
Heuristics,” IEEE Trans. Parallel and Distributed Computing, Vol. 4, No. 6 , pp. 638-648, 
June 1993.
[Goswami93c] K. K. Goswami and R. K. Iyer, “Simulation of Software Behavior Under Hardware Faults,” 
Proc. 23rd Int. Symp. Fault-tolerant Computing, pp. 218-227, June 1993.
[Goyal87] A. Goyal, S.S. Lavenberg and K.S. Trivedi, “Probabilistic Modeling of Computer System 
Availability,” Annals of Operations Research, No. 8 , pp. 285-306, March 1987.
[Goyal92] A. Goyal, P. Shahabuddin, P. Heidelberger, V.F. Nicola, and P.W. Glynn, “A Unified 
Framework for Simulating Markovian Models of Highly Dependable Systems,” IEEE Trans. 
Computers, Vol. 41, No. 1, pp. 36-51, January 1992.
[Gray90] J. Gray, “A Census of Tandem System Availability Between 1985 and 1990,” IEEE Trans. 
Reliability, Vol. 39, No. 4, pp. 409-418, October 1990.
[Gray91] J. Gray and D.P. Siewiorek, “High-Availability Computer Systems,” IEEE Computers, pp. 
39-48, September 1991.
[Gunneflo89] U. Gunneflo, J. Karlsson, and J. Torin, “Evaluation of Error Detection Schemes Using Fault 
Injection by Heavy-ion Radiation,” Proc. 19th Int. Symp. Fault-Tolerant Computing, pp. 
340-347, June 1989.
[Han93] S. Han, H. Rosenberg, and K. G. Shin, “DOCTOR: An Integrated Software Fault Injection 
Environment,” CSE Technical Report, CSE-TR-192-93, Department of Electrical Engineer­
ing and Computer Science, The University of Michigan, Ann Arbor, MI, December 1993.
[Hansen92] J.P. Hansen and D.P. Siewiorek, “Models for Time Coalescence in Event Logs,” Proc. 22nd 
Int. Symp. Fault-Tolerant Computing, pp. 221-227, July 1992.
[Heimann90] D.I. Heimann, N. Mittal and K.S. Trivedi, “Availability and Reliability Modeling for Com­



















R.V. Hogg and E.A. Tanis, Probability and Statistical Inference, Second Edition, Macmillan 
Publishing Co., Inc., 1983.
R.A. Howard, Dynamic Probabilistic Systems, John Wiley & Sons, Inc., New York, 1971.
M.C. Hsueh and R.K. Iyer, “A Measurement-Based Model of Software Reliability in a Pro­
duction Environment,” Proc. 11th Annual Int. Computer Software & Applications Conft, 
pp. 354-360, October 1987.
M.C. Hsueh, R.K. Iyer, and K.S. Trivedi, “Performability Modeling Based on Real Data: 
A Case Study,” IEEE Trans. Computers, Vol. 37, No.4, pp. 478-484, April 1988.
R.K. Iyer and D.J. Rossetti, “A Statistical Load Dependency Model for CPU Errors at 
SLAC,” Proc. 12th Int. Symp. Fault-Tolerant Computing, pp. 363-372, June 1982.
R.K. Iyer, S.E. Butner, and E.J. McCluskey, “A Statistical Failure/Load Relationship: 
Results of a Multicomputer Study,” IEEE Trans. Computers, Vol. C-31, No. 7, pp. 697- 
705, July 1982.
R.K. Iyer and P. Velardi, “Hardware-Related Software Errors: Measurement and Analysis,” 
IEEE Trans. Software Engineering, Vol. SE-11, No. 2, pp. 223-231, February 1985.
R.K. Iyer and D.J. Rossetti, “Effect of System Workload on Operating System Reliability: 
A Study on IBM 3081,” IEEE Trans. Software Engineering, Vol. SE-11, No. 12, pp. 1438- 
1448, December 1985.
R.K. Iyer, D.J. Rossetti and M.C. Hsueh, “Measurement and Modeling of Computer Reli­
ability as Affected by System Activity,” ACM Trans. Computer Systems, Vol. 4, No. 3, pp. 
214-237, August 1986.
R.K. Iyer, L.T. Young, and P.V.K. Iyer, “Automatic Recognition of Intermittent Failures: 
An Experimental Study of Field Data,” IEEE Trans. Computers, Vol. 39, No. 4, pp. 525- 
537, April 1990.
E. Jenn, J. Arlat, M. Rimen, J. Ohlsson, and J. Karlsson, “Fault Injection into VHDL 
Models: The MEFISTO Tool,” to appear Proc. 2fth Int. Symp. Fault-Tolerant Computing, 
1994.
D. Jewett, “Integrity S2: A Fault-Tolerant Unix Platform,” Proc. 21st Int. Symp. Fault- 
Tolerant Computing, June 1991.
H. Kahn and A. W. Warshall, “Methods of Reducing Sample in Monte Carlo Computa­
tions,” Journal of the Operations Research Society of America, Vol. 1, No. 5, pp. 263-278, 
1953.
G.A. Kanawati, N.A. Kanawati, and J.A. Abraham, “FERRARI: A Tool for the Validation 
of System Dependability Properties,” Proc. 22nd Int. Symp. Fault-To l er ant Computing, pp. 
336-344, July 1992.
G.A. Kanawati, N.A. Kanawati, and J.A. Abraham, “EMAX: An Automatic Extractor of 
High-Level Error Models,” Proc. American Institute of Aeronautics and Astronautics, pp. 
1297-1306, 1993.
W. Kao, R.K. Iyer, and D. Tang, “FINE: A Fault Injection and Monitor Environment 
for Tracing the UNIX System Behavior under Faults,” IEEE Transactions on Software 
Engineering, November 1993.
W. Kao and R.K. Iyer, “DEFINE: A Distributed Fault Injection and Monitor Environ­
ment,” to appear in The 1994 IEEE Workshop on Fault-Tolerant Parallel and Distributed 
Systems, June 1994.
81
[Karlsson89] J. Karlsson, U. Gunneflo, and J. Torin, “The Effects of Heavy-ion Induced Single Event 
Upsets in the MC6809E Microprocessor,” Proc. fth Int. Conf. Fault-Tolerant Computing 
Systems, GI/ITG/GMA, Baden, Germany, 1989.
[Katzman78] J.A. Katzman, “A Fault-Tolerant Computing System,” Proc. Int. Hawaii Conference on 
System Science, pp. 85-102, 1978.
[Kendall77] M.G. Kendall, The Advanced Theory of Statistics, Oxford University Press, 1977.
[Kim88] S. Kim and R. K. Iyer, “Impact of Device Level Faults in a Digital Avionic Processor,” 
Proc. AIAA/IEEE 8th Digital Avionics Systems Conference (DASC), pp. 428-436, October 
1988.
[Kobayashi78] H. Kobayashi, Modeling and Analysis: An Introduction to System Performance Evaluation 
Methodology, Addison-Wesley Publishing Co., 1978.
[Kronenberg86] N.P. Kronenberg, H.M. Levy and W.D. Strecker, “VAXcluster: A Closely-Coupled Dis­
tributed System,” ACM Trans. Computer Systems, Vol. 4, No. 2, pp. 130-146, May 1986.
[Lala83] J. Lala, “Fault Detection, Isolation, and Reconfiguration in FTMP: Methods and Exper­
imental Results,” Proc. 5th AIAA/IEEE Digital Avionics Systems Conference (DASC), 
1983.
[Laprie84] J.C. Laprie, “Dependable Evaluation of Software Systems in Operation,” IEEE Trans. 
Software Engineering, Vol. SE-10, No. 6, pp. 701-714, November 1984.
[Laprie85] J.C. Laprie, “Dependable Computing and Fault Tolerance: Concepts and Terminology,” 
Proc. 15th Int. Symp. Fault-Tolerant Computing, pp. 2-11, June 1985.
[Law82] A. M. Law and W. D. Kelton, Simulation Modeling and Analysis, McGraw Hill Book 
Company, 1982.
[Lee91] I. Lee, R.K. Iyer and D. Tang, “Error/Failure Analysis Using Event Logs from Fault Tol­
erant Systems,” Proc. 21st Int. Symp. Fault-Tolerant Computing, pp. 10-17, June 1991.
[Lee92] I. Lee and R.K. Iyer, “Analysis of Software Halts in Tandem System,” Proc. 3rd Int. Symp. 
Software Reliability Engineering, pp. 227-236, October 1992.
[Lee93a] I. Lee, D. Tang, R.K. Iyer, and M.C. Hsueh, “Measurement-Based Evaluation of Operating 
System Fault Tolerance,” IEEE Transactions on Reliability, Vol. 42, No. 2, pp. 238-249, 
June 1993.
[Lee93b] I. Lee and R.K. Iyer, “Faults, Symptoms, and Software Fault Tolerance in the Tandem 
GUARDIAN90 Operating System,” Proc. 23rd Int. Symp. Fault-Tolerant Computing, pp. 
20-29, June 1993.
[Lewis84] E.E. Lewis and F. Bohm, “Monte Carlo Simulation of Markov Unreliability Models,” Nu­
clear Eng. and Design, Vol. 77, pp. 49-62, 1984.
[Lin90] T.T. Lin and D.P. Siewiorek, “Error Log Analysis: Statistical Modeling and Heuristic Trend 
Analysis,” IEEE Trans. Reliability, Vol. 39, No. 4, pp. 419-432, October 1990.
[Littlewood80] B. Littlewood, “Theories of Software Reliability: How Good Are They and How Can They 
Be Improved?” IEEE Trans. Software Engineering, Vol. SE-6, No. 5, pp. 489-500, Septem­
ber 1980.
[Lomelino86] D. Lomelino and R. Iyer, “Error Propagation in a Digital Avionic Processor: A Simulation- 
Based Study,” Proc. Real Time Systems Symposium, pp. 218-225, December 1986.
[Maxion90a] R.A. Maxion, “Anomaly Detection for Diagnosis,” Proc. 20tli Int. Symp. Fault-Tolerant 
Computing, pp. 20-27, June 1990.
82
[Maxion90b] R.A. Maxion and F.E. Feather, “A Case Study of Ethernet Anomalies in a Distributed 
Computing Environment,” IEEE Trans. Reliability, Vol. 39, No. 4, pp. 433-443, October 
1990.
[Maxion93] R.A. Maxion and R.T. Olszewski, “Detection and Discrimination of Injected Network 
Faults,” Proc. 23rd Int. Symp. Fault-Tolerant Computing, pp. 198-207, June 1993.
[McConnel79] S.R. McConnel, D.P. Siewiorek, and M.M. Tsao, “The Measurement and Analysis of Tran­
sient Errors in Digital Compute Systems,” Proc. 9th Int. Symp. Fault-To l er ant Computing, 
pp. 67-70,1979.
[McGough81] J.G. McGough and F.L. Swern, Measurement of Fault Latency in a Digital Aviomc Mini 
Processor, NASA Contract Report 3462, NASA, Washington, D. C. 1981.
[Meyer88] B. Meyer, Object-oriented Software Construction, Prentice Hall International Series in Com­
puter Science, 1988.
[MeyerJ80] J.F. Meyer, “On Evaluating the Performability of Degradable Computing Systems,” IEEE 
Trans. Computers, Vol. C-29, No. 8, pp. 720-731, August 1980.
[MeyerJ88] J.F. Meyer and L. Wei, “Analysis of Workload Influence on Dependability,” Proc. 18th Int. 
Symp. Fault-Tolerant Computing, pp. 84-89, June 1988.
[MeyerJ92] J.F. Meyer, “Performability: A Retrospective and Some Pointers to the Future,” Perfor­
mance Evaluation, Vol. 14, pp. 139-156, February 1992.
[Migneault85] Migneault, “The Diagnostic Emulation Technique in the Airlab,” Internal Report, NASA- 
Langley Research Center, 1985.
[Mourad87] S. Mourad and D. Andrews, “On the Reliability of the IBM MVS/XA Operating System,” 
IEEE Trans. Software Engineering, Vol. SE-13, No. 10, pp. 1135-1139, October 1987.
[Musa87] J.D. Musa, A. Iannino, and K. Okumoto, Software Reliability: Measurement, Prediction, 
Application, McGraw-Hill Book Company, 1987.
B. Randell, “System Structure for Software Fault Tolerance,” IEEE Trans. Software Engi­
neering, Vol. SE-1, No. 2, June 1975.
A. Reibman, R. Smith, and K. Trivedi, “Markov and Markov Reward Model Transient 
Analysis: An Overview of Numerical Approaches,” European Journal of Operational Re­
search, Vol. 40, pp. 257-267, 1989.
[Rogers85] W. Rogers and J. Abraham, “CHIEFS: A Concurrent Hierarchical and Extensible Fault 
Simulator,” Proc. IEEE Int. Test Conference, pp. 710-716, 1985.
[Rosenberg93] H. A. Rosenberg and K. G. Shin, “Software Fault Injection and its Application in Dis­
tributed Systems,” Proc. 23rd Int. Symp. Fault-Tolerant Computing, pp. 208-217, June 
1993.
[Ross85] S.M. Ross, Introduction to Probability Models, 3rd Edition, Academic Press, Inc., 1985.
[Ruehli83] A.W. Ruehli and G.S. Ditlow, “Circuit Analysis, Logic Simulation, and Design Verification 
for VLSI,” Proc. of the IEEE, Vol. 71, No. 1, pp. 34-48, January 1983.
[Sahner87] R.A. Sahner and K.S. Trivedi, “Reliability Modeling Using SHARPE,” IEEE Trans. Reli­
ability, Vol. R-36, No. 2, pp. 186-193, June 1987.
[Saleh84] R.A. Saleh, “Integrated Timing Analysis and SPLICE1,” Mem. UCB/ERL M84/2, Elec. 























SAS User’s Guide: Basics, SAS Institute, 1985.
•
Z. Segall, D. Vrsalovic, D. Siewiorek, D. Yaskin, J. Kownacki, J. Barton, R. Dancey, A. 
Robinson, and T. Lin, “FIAT -  Fault Injection Based Automated Testing Environment,” 
Proc. 18th Int. Symp. Fault-Tolerant Computing, pp. 102-107, June 1988.
H. Schwetman, “CSIM: A C-Based Process-Oriented Simulation Language,” Proc. Winter 
Simulation Conf., 1986.
K. Shin and Y. Lee, “Error Detection Process -  Model, Design, and Its Impact on Computer 
Performance,” IEEE Trans. Computers, Vol. C-33, No. 6, pp. 529-540, June 1984.
K.G. Shin and Y.H. Lee, “Measurement and Application of Fault Latency,” IEEE Trans. 
Computers, Vol. C-35, No. 4, pp. 370-375, April 1986.
D.P. Siewiorek, V. Kini, H. Mashburn, S.R. McConnel, and M. Tsao, “A Case Study of
C. mmp, Cm*, and C.vmp: Part I—Experience with Fault Tolerance in Multiprocessor 
Systems,” Proc. of the IEEE, Vol. 66, No. 10, pp. 1178-1199, October 1978.
D. P. Siewiorek and R.W. Swarz, Reliable Computer Systems: Design and Evaluation, Dig­
ital Press, Bedford, Mass., 1992.
M.S. Sullivan and R. Chillarege, “Software Defects and Their Impact on System 
Availability—A Study of Field Failures in Operating Systems,” Proc. 21st Int. Symp. Fault- 
Tolerant Computing, pp. 2-9, June 1991.
M.S. Sullivan and R. Chillarege, “A Comparison of Software Defects in Database Manage­
ment Systems and Operating Systems,” Proc. 22nd Int. Symp. Fault-Tolerant Computing, 
pp. 475-484, July 1992.
D. Tang, R.K. Iyer and Sujatha Subramani, “Failure Analysis and Modeling of a VAXcluster 
System,” Proc. 20th Int. Symp. Fault-Tolerant Computing, pp. 244-251, June 1990.
D. Tang and R. K. Iyer, “Impact of Correlated Failures on Dependability in a VAXcluster 
System,” Proc. 2nd IFIP Working Conf. Dependable Computing for Critical Applications, 
Tucson, Arizona, February 1991.
D. Tang and R.K. Iyer, “Analysis and Modeling of Correlated Failures in Multicomputer 
Systems,” IEEE Trans. Computers, Vol. 41, No. 5, pp. 567-577, May 1992.
D. Tang and R.K. Iyer, “Analysis of the VAX/VMS Error Logs in Multicomputer 
Environments—A Case Study of Software Dependability,” Proc. Third Int. Symp. Software 
Reliability Engineering, Research Triangle Park, North Carolina, pp. 216-226, October 
1992.
D. Tang and R.K. Iyer, “Dependability Measurement and Modeling of a Multicomputer 
Systems,” IEEE Trans. Computers, Vol. 42, No. 1, pp. 62-75, January 1993.
D. Tang and R.K. Iyer, “MEASURE-)— A Measurement-Based Dependability Analysis 
Package,” Proc. ACM SIGMETRICS Conf. Measurement and Modeling of Computer Sys­
tems, Santa Clara, California, pp. 110-121, May J993.
K.S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science 
Applications, Prentice-Hall, Englewood Cliffs, NJ, 1982.
K.S. Trivedi, J.K. Muppala, S.P. Woolet, and B.R. Haverkort, “Composite Performance and 
Dependability Analysis,” Performance Evaluation, Vol. 14, pp. 197-215, February 1992.
84
[Tsao83] M.M. Tsao and D.P. Siewiorek, “Trend Analysis on System Error files,” Proc. 13th Int. 
Symp. Fault-Tolerant Computing, pp. 116-119, June 1983.
[Velardi84] P. Velardi and R.K. Iyer, “A Study of Software Failures and Recovery in the MVS Operating 
System,” IEEE Trans. Computers, Vol. C-33, No. 6, pp. 564-568, June 1984.
[Wein90] A.S. Wein and A. Sathaye, “Validating Complex Computer System Availability Models,” 
IEEE Trans. Reliability, Vol. 39, No. 4, pp. 468-479, October 1990.
[Yang92] F.L. Yang, Simulation of Faults Causing Analog Behavior in Digital Circuits, Ph.D. Thesis, 
Dept, of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 
May 1992.
[Young92] L. Young, R.K. Iyer, K. Goswami, and C. Alonso, “Hybrid Monitor Assisted Fault Injec­
tion Environment,” Proc. Third IFIP Working Conf. Dependable Computing for Critical 
Applications, Mondello, Sicily, Italy, pp. 163-174, September 1992.
85
