Predeployment validation of fault-tolerant systems through software-implemented fault insertion by Segall, Zary Z. et al.
NASA Contractor Report 4244 
Predeployment Validation 
of Fault-Tolerant Systems 
Through Software-Implemented 
Fault Insertion 
Edward W. Czeck, Daniel P. Siewiorek, 
and Zary 2. Segall 
GRANT NAG1-190 
JULY 1989 
NASA 
https://ntrs.nasa.gov/search.jsp?R=19890015446 2020-03-20T01:39:48+00:00Z
NASA Contractor Report 4244 
Predeployment Validation 
of Fault-Tolerant Systems 
Through Software-Implemented 
Fault Insertion 
Edward W. Czeck, Daniel P. Siewiorek, 
and Zary 2. Segail 
Carnegie-Mellon University 
Pittsbu rgb, Pennsy Zuania 
Prepared for 
Langley Research Center 
under Grant NAG 1- 190 
National Aeronautics and 
Space Administration 
Office of Management 
Scientific and Technical 
Information Division 
1989 
Contents 
1 Introduction 1 
2 Validation 2 
2.1 System Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  2 
2.2 Validation Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3 
3 Faults 5 
3.1 Origin of Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  5 
3.2 Fault Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  6 
3.3 Fault Insertion E. xamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  8 
3.4 Other Fault Insertion Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  11 
4 FIAT 12 
5 The FIAT Process and Its Abstractions 17 
5.1 Workload Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  17 
5.2 Fault Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  20 
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  22 
5.4 Data Collection and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  24 
6 Implementation 25 
6.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  25 
6.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  25 
6.2.1 FIM Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  26 
6.2.2 FIRE Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  2T 
6.3 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  30 
iii 
7 Evaluation 31 
7.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  31 
7.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  34 
8 Summary 37 
iv 
1 Introduction 
Validation methodologies of ultra-reliable systems have been explored by NASA[47,48]. The developed 
methodologies included fault-free characterization of the system, single processor behavior under faulted 
conditions, and system behavior under faulted conditions. CMU’s efforts have supported this validation 
methodology through empirical studies. Early work by Clune, Feather, Czeck, and Grizzaffi[12,15,16,22] 
determined the value of fault-free characterization. Additionally) Feather[22] developed a synthetic work- 
load generator and monitor to aid in the characterization of the system, while Czeck[lG] supplemented 
the workload generator to induce faults into the system. Further work a t  CMU lead to  the inception of 
the FIAT project. 
This paper describes FIAT, a Fault Injection-based Automated Testing environment, which is used to 
experimentally characterize and evaluate distributed realtime systems under fault-free and faulted con- 
ditions. Section 2 surveys validation methodologies and develops the need for fault insertion. Section 3 
overviews fault origins and fault models. Section 4 suggests some motivations for FIAT and presents 
a high level description of the system. Section 5 discusses the abstractions used in the FIAT system, 
while Section 6 details its implementation. Finally Section 7 describes two experiments performed on 
the FIAT system, and Section 8 concludes with a summary and observations. 
2 2 VALIDATION 
2 Validation 
Validation is the process of substantiating, through demonstration, that  a given system meets its 
specification[4,5]. For highly dependable systems, the specifications contain extreme reliability require- 
ments which necessitate the ability to function under faulty conditions. To demonstrate or validate the 
system operation, prediction methods must be used to determine the system “operation point” before 
the system is committed to use. Methods of determining the “operation point”, or the nominal behavior, 
such as fault insertion, which are well suited f e T  areas in which modeling and analysis fail t,o capture the 
needed detail. 
include simulation, modeling, and analysis. Complimentary t,n these z-ethods zic ~ ~ ~ t i i i i i r i i i a i  ni noas,  , I  
- 
2.1 System Specifications 
System specifications can be divided into two domains, with the validation effort directed to demon- 
strate that  both are fulfilled[ll]. The first domain includes functionality and the second is the bounds 
within which correct functioning must occur. Functionality is by far the easier of the two domains to 
validate; metrics, such as throughput and realtime deadlines, a re  readily defined. The bounds of correct 
functioning typically are associated with dependable computing and include metrics such as reliability, 
maintainability, and fault tolerance. Although the two domains cannot truly be separated during vali- 
dation, this report shall concentrate on the validation regarding the reliability and fault-tolerance issues. 
Previous reports outlined fault-free validation[l2,15,16,22]. 
A typical reliability requirement for a life critical application is lo-’’ failures per hour. The basis for 
this failure rate can be justified through the following life-cycle model. Assume a 30 year life, with 8 
hours operation per day; this yields approximately 100,000 (lo5) operational hours per unit. If 100,000 
( lo5) copies are produced and one failure is acceptable over the life of all copies, then the failure rate 
must be less than failures per hour. This translates to one failure per 1 million years per unit 
or several orders of magnitude greater than the reliability of todays systems. This stringent reliability 
requirement yields two observations. The first is that  non-redundant systems are a t  least six orders of 
magnitude less reliable than the goal, necessitating the use of redundancy and its ability to function 
correctly with faults present. The second is that  life testing (monitoring) for confirmation of reliability 
is impossible, necessitating the need for accelerated testing. These observations formulate the problem 
addressed by this report: how can the reliability specifications be validated. 
The need for a validation environment, such as FIAT (Fault Injection-based Automated Testing 
environment) which is described in this report, stems from system specifications. Two possible methods 
exist for validation of high reliability systems under faulted conditions. The first method, life testing, 
monitors actual running systems awaiting the natural occurrence of faults. The behavior of the system, 
when faults occur, can then be analyzed and used to support validation assumptions or conditions. 
The second method, fault insertion, induces faults into the system and analyzes behavior under these 
conditions. 
Life testing offers realism, but due to the current level of component reliability, faults can be expected 
2.2 Validation Methodologies 3 
a t  a rate of one in lo3 hours per system. This failure rate is prohibitively slow for the completeness 
required in thorough testing. Fault insertion speeds up the rate a t  which synthetic faults occur. The 
use of synthetic faults is necessary given the large number of fault types, fault locations, and times of 
occurrence. For example, a small board consisting of 50 packages each with 20 pins, has a fault space of 
1000 pin-level faults without considering any time dependencies. Additionally, software faults must be 
considered as the majority of system complexity moves into software. The software-fault space is also 
large: consider the amount of code present in even the smallest of operating systems. 
2.2 Validation Methodologies 
Much work has been done in validation methodologies, especially in aerospace and other life-critical ap- 
plications. These methodologies include formal proofs, analyses, and tests to assure the system meets its 
specifications. Although there is no commonly accepted validation methodology, a generalized methodol- 
ogy may be extracted from procedures presented in the literature[ll,20,27,29,45,47,48,51]. The approach 
is to build confidence in the system by a thorough and systematic methodology of proofs, analyses, and 
tests. Proofs are formal arguments supported by deductive inferences. Analyses employ models of the 
system, and testing uses statistical inference. These three methods are complementary: proofs and 
analyses use abstract models of the systems; testing uses the actual system to substantiate the models 
and results generated in the analysis. 
These three processes (proofs, analyses, and testing) are applied throughout the system development 
as depicted in Figure 1. During the architecture development, proofs are generated which specify the 
conditions necessary to achieve the requirements. Analysis of the architecture include reliability and 
error rate Markov models, while the testing comprises activities such as high level simulations and 
design reviews. At the implementation level, the conditions required in the architecture proofs are 
verified, leading to more conditions for the realization. The analyses include further refinement of 
the Markov models developed in the architecture analysis, and in-depth analyses such as Fault-Tree 
generation. Testing begins to  involve concrete methods such as simulation and emulation of the design. 
In the final level realization, proofs of the hardware and software structure are continued from the 
implementation level. Analyses include exhaustive Failure Modes and Effects Analysis, refinement of 
Fault-Tree analysis, and the inclusion of specific failure rates into the reliability and error rate analyses. 
Testing a t  the realization stage measures the assumptions and requirements used in the proofs and anal- 
yses. The assumptions involve error rates, fault latency, and coverage, as well as concrete measurements 
such as throughput, utilization, and error recovery time. 
4 2 VALIDATION 
Development 
Level 
Arc hi tec t ure 
Implementation 
Realization 
Abstract Concrete 
Design Proofs Analyses Tests 
Against Requirements Error Rate High Level 
Markov Models Simulations 
Prove Architecture Reliability and Design Reviews 
Prove Fault Tree Analysis Simulation and 
Implementation Emulation 
Against Architecture of Hardware 
Prove Realization Failure Modes and Support Assumptions 
Against Effects Analysis from Analysis, 
Implementation Fault Insertion 
Figure 1: Validation Areas Throughout the Design Process 
5 
3 Faults 
The validation of dependable systems requires an understanding of faults. This section discusses faults! 
their origin, models of faults, their basis, and some fault insertion examples. 
3.1 Origin of Faults 
One possible classification of faults is by their origin. Three categories arise: design errors, manufacturing 
introduced faults, and faults occurring in field use. The following paragraphs discuss these three fault 
origins[l]. 
Design errors are caused by improper translation of an idea or concept into an  operational realization[56]. 
Gathering information on design errors is difficult because each design error occurs only once per system 
and the types of design errors are usually complex. Design errors have been studied mostly in the soft- 
ware realm due to the expanding requirement of reliable software and the fact that  all software failures 
reduce to design mistakes. The effects of these errors are difficult to generalize due to their diversity and 
infrequency of occurrence, hence few if any models have been developed to characterize their behavior. 
Manufacturing defects are introduced by improper processing or the use of defective material in the 
fabrication of the system. Examples of manufacturing defects in semiconductors include: cracked dies, 
over or under doping of material, and flaws in the mask. The effects of manufacturing defects on CMOS 
circuit behavior have been characterized in Ferguson[23]. Ferguson applied mask defects to  several 
industry circuits and extracted the behavior from the faulted circuit. He noted from the results that  
over 99% of all faults can be characterized as either bridging faults or breaks, while only approximately 
50% of all faults can be modeled as single stuck line faults. 
Field failures are caused by physical processes occurring through normal and abnormal use during the 
life of the system. Examples of MOS device failures are listed below[21,39,46]. 
1. Thin oxide breakdown is a primary mechanism which is due to  large electric fields in the insulator, 
2. Electromigration, the drifting of metal atoms toward the cathode, is a common wear out mechanism 
3. “Hot electron” trapping in the gate oxide is caused by high temperature and high electric fields in 
4. Soft or transients errors are produced by alpha-particles and cosmic radiation which4 create several 
5. Electric overstress due to  improper environmental conditions such a electro-static discharge, may 
6. Other life failures are caused by an array of sources: ranging from design deficiencies, production 
usually in the gate oxide. 
influenced by high current densities in the conductor. 
the channel of a MOS transistor. 
GLCL -1--’--- b I u I L -  L-1- IIVlL ==;-e yu” “ Iffnotinu u. -” ctnred - - - - . charges. 
cause multiple physical failures. 
techniques, mechanical stress, corrosion, etc.. 
Little has been published on the distribution of field failures due to the variation introduced by the 
6 3 FAULTS 
Oxide I less than 10% 25% - 75% 
Metal I 30%- 40% 4 % -  17% 
mancfacturing process and operation environment. Additionally, knowledge regarding the effects of field 
failures on circuit behavior is limited, but several sources give isolated information. Lloyd and Knight[38] 
empirically support the classification of shorts and opens due to electromigration into a single failure 
model, but give no information on the resulting behavior. Timoc[63] attempted to map physical failures 
to  logical models; several failures resulted in “stuck-at” behavior, while others resulted in parametric 
faul t s . 
Figure 2: Fault Effects: M a n u f a c t u r i n g  S t a g e  vs. Field Life 
It  is interesting to  contrast faults originating in manufacturing and field operation. In field operation, 
the frequency of failures is highest for gate oxide, and next highest for opens and shorts in metal runs 
(see Figure 2). Oxide failures result in transistors stuck off (for both oxide breakdown and electron trap- 
ping). Metal failures result in open and shorted metal lines, breaks and bridges (for electromigration). 
During manufacturing on the other hand, Ferguson[23] reports less than 10% of mask defects cause oxide 
problems and 30% to  40% cause extra or missing metal, with an insignificant percent of faults causing 
transistor stuck off faults. 
3.2 Fault Models 
Fault models are abstractions of failure mechanisms to a level of understanding desired by the user. 
Models range from the physical device level to the gate, functional and even architectural level. A wide 
range of models have been developed for test program generation. Figure 3 summarizes fault models 
a t  different levels of abstraction. Fault models are typically categorized according to the origin of the 
defects: manufacturing or field operation. Most fault models are based on manufacturing defects, which 
is only of concern during a fraction of the system lifetime. Validation is interested in the events occurring 
during the useful life of the system, hence some of the manufacturing fault models are not applicable. 
Switch-level fault models[9] are used for MOS devices where unidirectional logic gate models do not 
adequately detail the bidirectional behavior of such devices under certain fault types (e.g., bridging, stuck 
open and closed transistors.) Switch-level models contain nodes, connected by bi-directional transistors 
(switches); faults are nodes stuck high or low, transistors stuck open or closed, and extra or missing 
transistors. Low level simulations and models are required because the failure modes of devices, especially 
CMOS[25], cannot be modeled a t  the gate or higher levels. Furthermore, limited access to individual 
devices prohibits empirical observation. Switch-level simulations aid in test program generation and test 
coverage assessment, but the simulation cannot be used for large systems as modeling and simulation 
time become prohibitive. 
Gate-level fault models assume inputs and outputs of gates are stuck a t  high or low logic values, but 
the gate functions correctly. These faults are based on printed circuit (PC) boards, TTL,  and pre-TTL 
3.2 Fault Models 
Network 
7 
Communication lost, delayed Abstraction of Failure modes are 
or unordered. Lost nodes. behavior. unknown and complex. 
Functional 
Gate 
Switch 
PMS 
Complement or dual function Observation? Many realizations 
Truth table modification. Ad hoc. and fault modes. 
Gate output stuck at 0 or 1, Technology outdated. 
Single stuck line. board behavior. 
New and missing devices, Processing Simulation overhead, 
Shorts and breaks (opens), defects. difficult to observe 
Transistors stuck on/off. 
TTL and P C  
and fault insert. 
Data change. Message Abstraction of Failure modes are 
or process lost. Data behavior. unknown and complex. 
inconsistent. Time outs. 
Data change. Wrong assertion, Abstraction of Based on R T  models l R T  I source, or destination. behavior. not implementation. 
Figure 3: Fault Models at Different A b s t r a c t i o n  Levels 
logic: circa 1960. The gate-level fault model is not applicable to MOS implementations as failure modes 
are possible which transform a combinational MOS circuit to a sequential circuit[64I1 and complex MOS 
circuit implementations do not map gate lines to circuit nodes. Beh et a1171 emphasized that fault models 
should be consistent with manufacturing defects and developed a methodology relating TTL processing 
defects to their logic behavior. Beh's work was limited to demonstrating which defects can be modeled 
by gate-level stuck-at faults and did not attempt to  develop new fault models. 
Ferguson et a1[23,54] developed fault models based on processing mask defects and mapped the defects 
to  gate-level fault models, illustrating that only 50% of the faults are representable by a gate-level single 
stuck line fault. Marchal[40] challenged the stuck-at models used in functional testing with simulations 
of faulted microprocessor internal buses. Results show fault models should be realization and technology 
dependent, with models updated as technology advances. These approaches require realization details 
and do not further abstract faults to higher levels. 
As integration levels increase further, testing based on implementation faults becomes prohibitive and 
functional testing approaches, either implementation dependent or independent, must be used[28,59]. 
That te  and Abraham[60,61,62] presented the ground work for functional testing, with fault models 
based on possible scenarios of failures occurring within the control section, data  section, or data  storage 
of a generalized microprocessor. These models were implementation independent, hence they did not 
embody a complete fault library, but they were' used to generate test procedures for IIliclupiuiesscis. 
Silberman and Spillinger[57] present a formalized methodology to define the functional-level fault model, 
given the implementation of a circuit, its related defects, and the input vector. The resultant functional 
fault model is then used in simulations to acquire a better estimate of implementation fault coverage 
'Results from Ferguson[23] show the occurrence of these types of faults from manufacturing defects to be less than 2% 
of all faults occurring. 
8 3 FAULTS 
saving the overhead of low level simulation. 
Architectural-level testing has been proposed and implemented[l7, 181, with mostly ad hoc fault mod- 
els. Stuck-at faults of input and output nets, mutation operation in the microprocessor (Thatte and 
Abraham model) and in control resources (instruction fetch and decode units) have been included in the 
architectural-level fault model. The evaluation of these models was based on test coverage results from 
simulations a t  the architectural and gate level, with results of the architectural simulations, a t  best, 
tracking the coverage of the test programs, and showing which fault. models are appropriate. 
Throughout the referenced papers, the goal was to substantiate or formalize the fault models used 
a t  the higher levels based on known failure modes2 or lower level fault models. There are still some 
limitations in predicting higher (component) level fault behavior, even with fault models based on 
manufacturing defect distributions. Using a gate-level emulation of an avionic processor , McGough 
et a1[41,42,44] showed an 87% coverage of all gate-level faults contrasted with a 98% coverage for all 
component3 (pin) level faults. Furthermore these studies concentrated on logic value faults and did not 
consider parametric or other faults. 
Some results from software testing support the concept of functional testing based on abstract fault 
models. One interesting result is the comparison between “black-box” testing, similar to the functional- 
level in hardware, and structural testing, similar to the component level. In “black-box’’ testing the 
internal structure of the module is unknown. The test input generation proceeds from specifications 
given to the module, with emphasis on boundary values of both the input and output vectors. In 
structural or “white-box” testing, the internal structure of the program is known, and the test generation 
typically attempts to exercise all paths .  In a study by Howden[3O], “black-box” testing was significantly 
more effective in error detection than structural testing due to subtleties in the data selection. Data 
for “white-box’’ testing are geared to exercise each path once for some data,  but data for “black-box’’ 
testing are geared to exercise the program for the range of valid data. Also Lai in his Ph.D thesis[32] and 
in two conference papers[33,34] showed the merits of functional testing of hardware without the details 
of hardware implementation. 
3.3 Fault Insertion Examples 
The evaluation of a fault-tolerant systems includes both functional and reliability measures. Functional 
measures are relatively straight forward, but the analyses of the reliability requires measures of such 
quantities as component reliability, fault coverage, error detection coverage and other difficult to  measure 
factors. These unique measures typically require special methods, such as fault insertion[11,47,48,51]. 
Fault insertion has been used extensively for a number of objectives: test coverage evaluation[31], 
the generation of fault dictionaries[26], the study of error propagation and latency[13,24,44,55], the 
evaluation of error detection schemes[14,52,53], and system evaluation[6,36,58]. Figure 4 summarizes 
these studies. By far, the two most common means of fault insertion has been simulation of the hardware 
’Most of the failure modes considered were during the manufacturing phase. 
3 A  component, as used in this study, is defined as a single SSI, MSI or LSI chip. 
3.3 Fault Insertion Examples 
Source Target 
Avizienis '72 JPL Star 
[GI breadboard 
Goetz '72 ESS Microstore 
[261 
Courtois '79 MC 6800 
~ 3 1  
9 
Method/Level Goal 
Permanent and Estimate of Detection, 
Transient/ Pin (Gate) Recovery Procedures, 
Level and Coverage 
Simulation/ Detection and 
Gate Level Coverage Measure 
Simulation/ Detection Time 
op-code (RT) level 
Decouty et a1 
'80[19], '82[14] 
Kurlak '81 
[3 11 
McGough & 
Swern '81 [43] 
Lala '83 
[361 
GORDINI: Fault Physical/ Pin (Gate Tool and Methodology 
Tolerant micro and RT) Level Development 
GE MCP-701 CPU Physical Faults with Evaluation of 
Bendix Simplex Simulation/ Gate and Coverage Measurement 
BDX-930 Pin Level 
FTMP Engineering Physical/ System and Coverage 
Model Pin (Gate) Level Evaluation 
FMEA analysis Watchdog Timer 
I iAPX 432 Yang et a1 
'851651 
Fault Emulation/ Coverage of TMR 
Memory Words (RT)  
Schuette et a1 
'86r531 
MC 68000 Transient, Physical/ Evaluation of Error 
Bus (RT)  Level Detection Techniques 
Figure 4: Fault Insertion Examples  
Czeck et a1 FTMP Engineering 
'87 I1 61 1 Model 
and physical fault insertion. Fault simulation has occurred a t  all the levels discussed in Section 3.2 on 
fault models[9,43,49,57]. Physical fault insertion has been used to determine fault coverage of test 
programs[6], fault latency[24,55], and fault detection efficiency[14,36,53]. 
Fault Emulation/ Methodology Study 
RT Level 
Figure 5 presents a matrix of advantages and disadvantages of fault insertion as a function of methods. 
The four methods of fault insertion include: 
Finelli '87 ' FTMP Engineering 
[241 I Model 
1. Software simulation involves fault insertion by code modifications or special functions of the sim- 
2. Hardware emulation uses hardware representative of the system under test, such as an engineeiiikg 
3. Fault emulation attempts to  imitate fault behavior through software control of the hardware or 
4. Physical fault insertion involves the inducement of faults through special hardware to  the actual 
ulation engine. 
prototype, as a basis for study. 
special capabilities built into the hardware. 
system under test. 
Permanent Physical/ Fault Recovery 
Gate Level Distributions 
10 3 FAULTS 
Advantages 
Disadvantages 
Fault Insertion Methods 
Software Hardware Emulation Fault Physical Fault 
Simulation (Breadboard) Emulation Insertion 
Access to  system Representative True hardware True hardware 
at any level hardware with and software and software 
of detail. favorable in use. in use. 
access and 
F d i  iypes and monitoring. 
control are 
unlimited. 
Simulation time Implementation and Fault types VLSI limits 
explosion. other parameters are limited. access and 
will change with monitoring 
Lack of tools deployed system. points . 
limit ability. 
Task is 
difficult. 
Figure 5: Fault Insertion Methods 
Physical fault insertion has been used extensively in system validation, with the typical means of fault 
insertion being pin-level stuck-at and inverted faults[6,14,19,36,53,55,58]. With a SSI/MSI realization of 
a system, pin-level stuck-at’s closely represent failures which have been observed to occur in such devices; 
but with LSI and VLSI realizations, failures may be remote from the input/output pins. At these higher 
levels of integration, fault insertion seldom claims to accurately portray physical faults, but the hope 
is that  they provide a first approximation to the metrics under study. Palumbo[50] hypothesised that 
pin-level stuck-at faults produce error behavior similar to error behavior caused by internal devices. 
Initial empirical results are inconclusive; the hypothesis holds well for 85% of the data,  but other data 
call for rejection. 
Although actual faults may be remote from the pin boundaries, promising results have been reported 
with pin fault insertion of LSI and VLSI devices, and other abstract fault insertion methods. Schuette et 
a1[53] inserted transient faults on the data,  address, and control lines of an MC68000 bus, representing 
faults within the data and control sections of the processor. With this fault insertion ability, two error- 
detection schemes were evaluated. Yang et a1[65] inserted faults into an iAPX 432 to evaluate software 
implemented TMR; the faults were generated by altering bits in the program or data  areas in memory 
using the debugger. Czeck .et a1[16] inserted faults in an FTMP triad by causing one processor to  execute 
special code, thus triggering the error detection mechanism. This method was able to duplicate some 
hardware fault insertion results presented by Lala[35,36]. But even with these results, McGough[41,42,44] 
illustrated a distinct gap between gate level and component level fault types as discussed in Section 3.2. 
3.4 Other Fault Insertion Examples 11 
3.4 Other Fault Insertion Examples 
Aside from hardware test generation and verification, fault insertion has been used in other areas. One 
area, which is related to  computing systems, is software testing. Two software fault insertion examples 
are error seeding and mutation analysis[2,3,4,10]. While both methods seem similar, there are subtle 
differences. Error seeding is the process of inserting faults (errors) into the software during debugging. 
When the acceptable percentage of the the seeded errors are found, the debugging effort ceases. It is 
assumed that  the unseeded errors (programmers errors) are found a t  the same rate as the seeded errors, 
thus error seeding provides a measure of the number of unfound errors. Mutation analysis is used as 
a measure of test program coverage. Faults (mutations), which are a slight perturbation of a program 
statement, are inserted one at a time into lines of code. Test sets are run to determine if the mutation 
is detected, thus gaining a measure of the test effectiveness. 
Within the Sperry Univac 1100/60[8], fault insertion capabilities are built into the system to verify 
the functionality of the fault detection, isolation, and recovery mechanisms. Fault insertion is activated 
during system idle time and can insert faults in the processor, memory, and 1/0 unit. These fault 
insertion capabilities are under operating system control and require no external hardware. 
12 4 FIAT 
4 FIAT 
FIAT, Fault Injection-based Automated Testing environment, is a prototype of an experimental test 
bed used in the exploration of validation processes for fault-tolerant systems. System validation requires 
the ability to monitor the system under test, the ability to control the system to induce faults and 
other operating conditions, and the ability to repeat tests to identify the source of system deficiencies. 
The requirement for experiment repeatability, as well as the need to attack a large fault space, a re  
the reasons FIAT provides a test environment capable of niifnmqficz!!y indzcing f d t s  a ~ d  iiioiiihiiiig 
system behavior. 
The underlying methodology which guides the FIAT validation process is the following: 
Specify an architecture through models, emulation, or prototypes. 
0 Specify the system software architecture and workload. 
Profile the fault-free behavior of the system to determine a nominal “operation point”. 
Select a set of fault classes for fault insertion experiments. 
Perform fault insertion experiments. 
Analyze experimental data  and use the results to support validation requirements and com- 
pare fault-tolerance strategies. 
Given this desired validation methodology, the goal of the FIAT project is to develop an environment 
for automated software fault/error insertion, error detection, and recovery analysis which can be used to 
perform a more thorough investigation of software implemented fault insertion and how software fails. 
To perform the desired validation methodology FIAT provides an environment capable of supporting 
the following functions; these are described in greater detail in Section 5. 
Architecture Development: The development methodology for an architecture can be divided into 
three sections: simulation, emulation, and actual implementation. The hardware and communica- 
tions structure of the FIAT system is general, so it may emulate a variety of architectures under 
evaluation. The ideal architecture for FIAT to emulate is a message-based, replicated structure, 
where messages are passed via the FIAT communication channels and the replicated structure is 
emulated by FIAT hardware and software tasks. This allows the user to design and evaluate a sys- 
tem without customized hardware or software, or a large initial effort. Examples of this flexibility 
can include systems such as: a Tandem like structure which employs a primary and secondary, 
a Stratus structure employing duplicate and compare, and a SIFT-like system employing several 
redundant processors and voting. 
Moreover, the FIAT methodology is not limited to architecture emulation, as the FIAT environment 
can be transferred to the actual implementation. Benchmarks conducted in the emulation phase 
can be redone with the actual implementation to calibrate the system and further experiments 
developed to evaluate the implementation. 
Additionally, FIAT provides several tools which aid in the development of the workload and hence 
the experimental architecture. These tools, available through the experiment interface or directly 
to the user, include editors, compilers, and the ability to  provide monitoring “hooks” into the 
13 
Hard ware 
(error insertion) 
Bus 
system. 
Operating System User Application System (OS and Application) 
Code Code Lost, Delayed Messages 
(error insertion) 
Fault Insertion: Fault insertion must exercise the error detection and recovery mechanisms (EDRM) 
as well as develop quantitative metrics of the dependability properties of the system under test. 
Using software fault insertion, FIAT can insert faults or the manifestation of faults (i.e., errors) 
by either triggering error detection mechanisms or by seeding errors into memory. Software fault 
insertion was selected for the following reasons: 
0 Systems to  be validated have a substantial software component. Software fault inser- 
tion allows penetration into the software portion of the system as well as exploring the 
interaction of software with hardware. 
0 Software fault insertion is less expensive, in terms of time and effort, than hardware fault 
0 Software fault insertion is functionally complementary to hardware fault insertion and 
0 There is a need for a testing methodology to validate software implemented fault tolerant 
insertion. 
does not exclude it.  
strategies. 
Memory 
Registers 
1/0 structure 
Data Data 
Registers 
Corrupted Message 
Delayed Task 
Abnormal Task Termination 
Timer Corruption 
I System Clock Corruption 
Figure 6: Faults and Errors Available in FIAT 
Software fault insertion has its limitations, mainly in its inability to force low level errors, such 
as a gate output stuck-at. However, designers are interested in the behavior of the whole system 
(hardware and software). Furthermore a large amount of the hardware functionality is visible 
through software. 
Figure 6 is a table of the faults and errors available and currently under development for software 
insertion in FIAT (unless specifically noted, all of the insertions are faults). The use of fault and 
error in this figure and throughout this report refer to the definitions proposed by Laprie[37] 
and Siewiorek[56]. A fault is an  erroneous state of hardware or software, whereas an  error is the 
mz~ifesfation nf a fault. The duration of inserted faults range from transient to pseudo-permanent4. 
Automation and Unity: The quality of fault insertion experimentation is a function of the capability 
of the system to  insert (test) as many faults as possible per unit of time and of the fidelity of 
the fault insertion method itself. Automation includes both experiment development time and 
experiment run time processes. 
'Pseudo-permanent is achieved by repeating a transient fault over a long duration. True permanent faults are difficult 
to emulate because software tends to overwrite storage locations. 
14 4 FIAT 
To be effective, the various components of the system (e.g.. workloads, fault classes, experiments 
and data  analysis) must be integrated under one comprehensive environment, which supports the 
process of preparation, debugging, run time control and data  analysis. Each of these components 
are discussed in the following section. 
The CMU hardware implementation of FIAT uses four IBM RT PCs, connected via a 10 Mbit token 
ring, Figure 7.  The hardware provides the execution platform for FIAT, without limiting the generality 
of the system. Hence a myriad nfarch i tec txes  =.xi 5; eEiiil&,cd iiiroug'n soitware without any limitations 
imposed by the hardware or communications structure. The implementation is not limited to the RT 
and can be ported to other systems. 
FIAT Hardware Structure 
Fault lnlection I 
MRT-PC I 
ult lniection I 
Manager (FIM) 
10 Mbit 
Token PI-- -
Figure 7: FIAT Hardware Structure 
The FIAT software structure is more complex than the hardware structure and is divided into two 
parts: the Fault Injection Manager (FIM) and the Fire Injection Receptor (FIRE). Figure 8(a) shows 
the FIM software consisting mostly of the experiment interface controlling all phases of the experiment, 
from design to data analysis. Several intermediate forms are created in the process; the user defined 
workloads, the fault lists, and experiment descriptions are created and compiled from users specifications 
and FIAT libraries. 
Figure 8(b) shows the software configuration on the FIRES. The core element is the fire monitor and 
control (FMC), which interfaces the fault list and the experiment description to the workload. The FMC 
also monitors the execution of the experiment sending appropriate information to  the data files. Further 
details of the hardware and software are presented in Section 5 and Section 6. 
As an example of the methodology's versatility, Figure 9 shows a possible emulation of the SIFT 
system using FIAT. The processors are emulated via software, with two SIFT processors running per 
FIRE machine. The scheduler along with SIFT tasks, such as voting, synchronization, sensor input, and 
applications, are run under the virtual SIFT processor. Communications channels emulate the point 
to point communications of SIFT with a dedicated channel per physical SIFT link. Using FIAT with 
FIAT Software Structure (FIM) FIAT Software Structure (FIRE) 
Experiment Inletface Users M 
Ubraries t3-4 
Data Flles H 
- I 
Desaiption 
Figure 8: FIAT Software Structure 
this application, several experiments may be run, including validation of the voting and interactivity 
consistency algorithms using fault insertion, measurement of task overhead, and others. 
16 4 FIAT 
Workload 
Point to Point Communications 
(communications channels) 
Figure 9: SIFT Emulations on FIAT 
5 The FIAT Process and Its Abstractions 
Abstractions 
Following the methodology defined in Sections 2 and 3 ,  an experimental validation using FIAT may 
progress through stages: 
Workload Fault Class Experiments Data Collection 
Management and Analysis 
Task Attributes, Fault Class, Experiment Histories, 
Task Image Fault Instance Definition and Error Records, 
~ 
Script and Reports 
0 Fault-free validation of a system: profile the system components’ performance characteristics 
0 Fault-free validation of a workload: profile the performance/functionality characteristics of 
0 Fault insertion experimentation: insert faults into profiled workload, then collect histories, 
and collect data to determine the system’s “operating point”. 
the workload and collect data. 
and error records. 
Operation Edit, Generate ll Task Image 
To support the fault insertion environment, FIAT provides the abstractions presented in Figure 10, 
each described in the following subsections. FIAT also provides the necessary tools to operate on the 
abstractions during experiment development and runtime. 
Edit, Generate, Edit, Generate Instrumentation, 
Fault Instance Experiment Script Database 
Control Script Interpretor, I Data Analysis 
I1 - I I 1 Runtime 1 1  Monitor, I Fault Insertion I Script Execution, 1 Data Collection, 
5.1 Workload Management 
The workload of a computer is defined as the set of all inputs (programs, data, commands) the system 
receives from its environment. A workload can be classified as natural or synthetic. Natural workloads 
accomplish useful work, while synthetic workloads model natural workloads. The advantage of using a 
synthetic workload to  profile a system has been demonstrated by previous CMU work[15,16,22]; hence 
FIAT isz,ipn!stes synthetic ?nnyl.!en_ds(WL). In FIAT, a workload is an observable set of real-time 
communicating tasks representing the user’s real-time programs. 
A task is an  observable (monitorable) unit of computation, defined by the user, within the workload, 
which communicates through observable communication media named channels. Note that a workload 
could run in one machine or be distributed among a number of computers. Figure 11 shows an example, 
in C language source code, of the matrix multiplication task. The create-channel construct establishes 
18 5 T H E  FIAT PROCESS A N D  ITS ABSTRACTIONS 
task(argc ,argv) 
int argc; 
char **argv; 
{ 
create_channel("matl") ; 
sensor(11); 
compute0 ; 
sensor(l2) ; 
I 
compute0 
{ 
for (i=O;i<3;i++) t 
for (j=O; j<3; j++) C 
for (k=O;k<3;k++) c 
I 
c[ilCjl = c[il[jl + a[il Ckl * bCklCj1; 
I 
I 
I 
Figure 11: Typical Task Defini t ion 
a communication medium which allows other tasks to  communicate with the task.5 The sensor construct 
is a user-defined means of observing the execution of the task through executive function calls embedded 
in the workload tasks. Its two basic functions are to  provide a mechanism to monitor the execution of 
the workload tasks and to synchronize communicating tasks. 
For symbolic fault insertion, a number of task attributes must be extracted from the workload. The 
term attribute refers to  the set of symbolic names identifying the tasks, the code, and the data  segments 
within each task. This analysis is automatically performed by a tool known as the attribute eztractor. 
The attribute extractor, after analyzing the workload, provides da ta  constructs (e.g., task tables) known 
as domains. These domains are used for the automatic generation of fault lists. Domains include task 
and function names, variable names, and locations of code and da ta  within the object file(s). 
As depicted in Figure 12, each task is linked with four program attachments which respond to external 
requests. The workload monitor provides observability of execution through the sensors, the fault injector 
performs the fault insertion, the error detection and reporting attachment provides the data  collection 
capability, and the fault-tolerant architecture is a generic capability for implementing software-based 
recovery mechanisms. Figure 13 illustrates the process of attribute extraction and the linking of the 
attachments in the generation of a task. 
5Text in a typewriter font refer to program constructs or FIAT commands. 
5.1 Workload Management 
Workload 
Monitor 
Task Code 
Fault Error Fault 
injector Detection/ Toleran 
Reporting Arch 
User's Program 
Figure 12: Typical FIAT Task Modified by Attachments 
Attribute Compile & + Task 
Extract or 
Figure 13: Task Generation 
20 5 T H E  FIAT PROCESS AND ITS ABSTRACTIONS 
5.2 Fault Classes 
A fault class is a template which describes a set of workloads or system modifications representing a 
group of physical or logical faults having common properties. This template is similar to an abstract 
data type. Figure 14 shows the fault class format, where f ield-i is the identifier of the Domain-i6 and 
method-i is a user defined procedure for selecting items from the domain (Domain-i) to  be used in fault 
insertion. 
As an example, Figure 15 shows a memory fault class. The line, Mechanism: fill shows the fault 
insertion method used. In this exairiple, fil is the fault insertion mechanism which inserts faults into 
the memory image of the task. Compute: is an identifier given to the first fault insertion of the class. 
compute .attributes refers to the domain extracted from the workload and select-all-by-one refers 
to the function used to create the fault list from the domain. This function applies the fault mechanism 
to all members of the domain. The creation of the fault list is controlled by the user after the workload 
is developed, domains extracted, and fault insertion method(s) defined. 
# 
# Explanatory / Identifying Comments 
# 
Mechanism: Fault-Injection-Mechanism-Id 
Field-1: Domain-1 Method-1 
Field-2: Domain-2 Method-2 
Field-n: Domain-n Method-n 
Figure 14: Fault Class Format 
This is an illustrative example of a Fault Class Definition. 
Comments may be included to describe the Fault Class. 
Generalized Domain Names and Method Names are used. 
Mechanism: f il 
Compute : compute.attributes select-all-by-one 
Task: task.attributes select-all-by-one 
Figure 15: Memory Fault Class Example 
As in any abstract type, the fault class can have several instantiations, meaning that  each method 
(e.g., select-all-by-one) can be applied to a different domain and thus generate a specific fault (fault  
'The domain is provided by the task attribute extractor, discussed in Section 5.1. 
5.2 Fault Classes 21 
Fault Class 
Definition & 
r - 
instance). This process is done automatically in FIAT by a tool named the Fault Instance Generator. 
Figure 16 depicts the fault instance generation process while Figure 17 shows a list of fault instances. 
Fault Fault 
Generator 
Instance b List 
~~ 
o Memory Faults 
o Register Faults 
o Communication Faults 
o Error Detection Mech Triggering 
Figure 16: Fault Instance Generation Process 
Mech Fire Sensor Task Element Size Posit ion Mask Behavior 
f i l  1 11 matl compute 4 0 \cO\OO\OO\OO xor 
f i l  1 11 matl compute 4 4 \cO\OO\OO\OO xor 
f il 1 11 matl compute 4 8 \cO\OO\OO\OO xor 
f i l  1 11 matl compute 4 12 \cO\OO\OO\OO xor 
f i l  1 11 matl compute 4 16 \cO\OO\OO\OO xor 
Figure 17: Typical Fault List Example 
The columns of the fault instances shown in Figure 17 are: the fault insertion mechanism, Mech, defined 
in the Fault Class; F i r e  is the target hardware described in Section 6.1; Sensor is the task observability 
defined at task generation time (Figure 31); Task is the name of the task within the workload; Element 
designates the target domain for the fault insertion; S i z e  and P o s i t i o n  define the size of the faulted 
word and the location for fault insertion within the element; And Mask is a hexadecimal word(s) used 
to  corrupt the target word with the function used in the Behav ior  column. 
22 5 T H E  FIAT PROCESS AND I T S  ABSTRACTIOATS 
5.3 Experiments 
Experiments are  defined by experiment  descriptions and executed by experimental  scripts .  The experi- 
m e n t a l  description is a high level description of experiment flow, which includes worklbad management, 
fault insertion, and data collection commands. Figure 18 shows the general format of such a description, 
while Figure 19 shows an  example of an experiment description. Items in the experiment description 
include: experiment name, database name, workload and fault class, duration of experiment, and the 
number of runs in the experiment. 
EXPERIMEliT <experiment-id> 
DATABASE <database-id> 
INITFCD <fcd-id> 
LOOP citeration-cnt> 
WORKLOAD <uorkload-id>, <m-time> 
FAULT <fcd-id>, <number-of-faults> 
COLLECT 
END 
Figure 18: Experiment Description Format 
/*** 
***/ 
Experiment Description for Experiment 2 
experiment 2 
database rnatmultiplication.db 
initf cd direct 
loop 100 
workload matdemo.wl, 30 
fault direct, 1 
collect 
end 
Define database 
Initialize fault class 
100 Iterations 
Use matdemo workload 
30 seconds each 
Use direct fault class 
One fault per run 
Collect data after each experiment 
Figure 19: Experiment Definition Example 
The experiment description is automatically transformed by the Experiment Description Translator 
(EDT),  into an ezperiment  script .  The experiment script contains the low level command sequence for 
controlling the run-time fault insertion. Figure 20 shows a single run of an experiment script generated 
from .the experiment description in Figure 19. Description of the script is given by in-line comments. 
5.3 Experiments 23 
# 
# Experiment S c r i p t  f o r  Matrix wi th  Faul t  I n s e r t i o n  
# 
# 
# Workload i s  matdemo 
# 
x f e r  t o  f i r e l  matprml . t i  
x f e r  t o  f i r e 2  matprm2.ti  
x f e r  t o  f i r e l  m a t l . t i  
x f e r  t o  f i r e 2  m a t 2 . t i  
x f e r  t o  f i r e l  compare . t i  
# 
# T i t l e  Experiment 
# SYNTAX : fmc MACHINE echo BEGIN run-id 
# Typica l  Command sequence begins  here  
fmc f i r e l  echo BEGIN 1 
fmc f i r e 2  echo BEGIN 1 
# 
# Send command t o  FMC ( F i r e  Monitor and Cont ro l )  
# t o  de l ay  f a u l t  i n s e r t i o n  un t i l  sensor 1 occurs  
# t h e n  f a u l t  i n j e c t  ma t r ix  a i n  t a s k  Matl. 
# SYNTAX : fmc MACHINE de l ay  SENSOR f i l  TASKID BEHAV ELEM POS SIZE MASK 
# 
fmc f i r e l  de l ay  26 
# 
# Create  Workload on FIRE 
# SYNTAX : fmc MACHINE wcrea te  t a sk - id  image-id 
# 
fmc f i r e l  wcreate PRM matprml 
fmc f i r e 2  wcreate PRM matprm2 
# 
# Wait f o r  experiment t ime out  
# 
pause 30 
# 
# K i l l  Workload 
# 
fmc f i r e l  w k i l l  
fmc f i r e 2  w k i l l  
# 
# C o l l e c t  Data 
# SYNTAX : fmc MACHINE c o l l e c t  RUN-ID DATAFILE [overwrite] 
# 
fmc f i r e l  c o l l e c t  1 xf - l -2- l .da t  overwr i te  
fmc f i r e 2  c o l l e c t  1 xf-1-2-2.dat overwr i te  
# 
# Trans fe r  Data t o  FIM 
# 
x f e r  from f i r e l  x f - l -2 - l . da t  
x f e r  from f i r e 2  xf-1-2-2.dat 
# 
fmc f i r e l  echo BEGIN 2 
T r a n s f e r  Workload t a s k  images t o  the  FIRE machines 
See Figure  14. 
f i l  m a t 1  xor  compute 0 4 \ c O \ O O \ O O \ O O  
Figure 20: Experiment Script Example for One Run 
24 5 T H E  FIAT PROCESS AND ITS  A B S T R A C T I O N S  
5.4 Data Collection and Analysis 
There are two forms of data collection (DC): histories and error reports. Histories are records of normal 
functions and performance events. Error reports are records of exceptions and abnormal events. Data 
analysis supports three hierarchical levels of experimental data  processing: run, experiment, and multiple 
experiments. A run is a single fault insertion in a specific workload. An experiment is a collection of 
runs, usually from the same fault class. Multiple experiment analysis refers to the ability to correlate a 
number of experiments under a specific analysis. 
In FIAT, two types of data analysis are available: generic and open. The generic data  analysis provides 
a set of predefined functions, such as workload profiling and error coverage statistics. The open data 
analysis is a relational data  base query language which enables the user to define their own analysis. 
Examples of the data collection/analysis functionality will be presented in Section 7. 
25 
6 Implementation 
Since the FIAT system is geared toward fault inserting real-time distributed systems, its physical imple- 
mentation is a real-time distributed system. 
6.1 Hardware 
Figure 21 presents the hardware structure of the current FIAT system, as implemented a t  CMU, illus- 
trating the distribution of the abstractions between the two components: the Fault Injection Manager 
(FIM) and the Fault Injection Receptacles (FIRE). The two parts communicate via a local area net- 
work(LAN). The purpose of the FIM is twofold. First, it  supports experiment development and da ta  
collection/analysis and second, i t  provides run time control of the experiment. The FIRES provide the 
execution platform for the distributed system under test. Further descriptions of the FIM and FIRE are 
in the following section on software. 
(Fault Injection Manager) 
Experiment Planning, 
Control, and Date Reduction 
- 
IAN 
(Fault Injection REceptor) 
Workload Monitoring 
(Fault Injection REceptor) 
Workload Monitoring 
I 
(Fault Injection REceptor) 
Workload Monitoring 
I 
Figure 21: FIAT Hardware Architecture 
6.2 Software 
The FIAT software is similarly divided into two parts: FIM and FIRE as described in the following two 
subsections. 
26 
- - 
Fault Database Software Experiment 
Manager Manager - 
6 IMPLEMENTATION 
- 
6.2.1 FIM Software 
Figure 22 shows a block diagram of the FIM software. The names of the components of the FIM software 
are representative of their functionality in the process of generating workloads, fault lists, and experiment 
scripts, and their control a t  run time. The Workload Manager, Fault Manager, Experiment Description 
Manager, Data Analysis Manager, and Data Collection Manager support the respective abstractions a t  
development time, as outlined in the previous sections. The Experiment Manager is responsible for the 
run time communication with the FIRE machines. The relational database nnrl i ts  scftiv-re is cectra! 
to the data storage and analysis functions. 
I 
Programs 
User's 
Fault 
Classes 
C 
Experiment 
Description 
In te r a d  ion 
User -7 
L 
Mana er w Network 
I I I 
Data Analysi 
I I 
1-1 I Manager I 
Availibility Measures (Le. Coverages) 
Figure 22: FIM Software 
The Experiment Interface, depicted in Figure 22, is responsible for the integration of the components 
into an interactive, purposeful experimentation system. The Menu Manager, in the experiment interface, 
controls the menus for four modes. Figure 23 shows the menu tree. The four modes involved in the 
experimentation process include: 
0 Library Preparation assists preparation of the workload and fault class libraries. The creation 
of the workload library involves two processes: the task development process and workload 
definition process. The task development process includes task definition (program creation 
and editing), task image generation (compilation and linking), and attribute extraction (for 
use in fault insertion process). The fault class librarian groups the domains and methods 
used for selecting the fault insertion method during experiment definition. 
Experiment Definition assists in experiment description preparation and experiment script 
generation through the use of a text editor in creating the description and automatic gener- 
ation of experiment scripts. 
. 
6.2 Software 27 
I 
0 Experiment Execution provides a list of prepared experiment scripts and executes the selected 
script. 
0 Data Analysis provides two types of data  analyses: user defined queries using the Informix 
Structural Query Language (ISQL) and generic FIAT data  analysis routines. 
Experiment Interface 
Library Experiment Experiment Data 
,Prepyration , Definition , , , Execution Analysis 
A 
L 
Workload Fault Expr Expr 
Librarian Librarian Desc Script 
Analyze Query 
Experiment Experiment 
Database Database 
Domain Method Fault 
Defn Dev Class 
I D ~ v ,  ":ask I I ~;n Task 
Task Task Run Single Multiple 
Defn Image Attr  Method Method Analysis Experiment Experiment 
G e n G e n  Defn Image Analysis Analysis 
Gen 
Figure 23: Experiment Interface Menu Tree 
6.2.2 FIRE Software 
The core of the of the FIRE software is the FIRE Monitor and Control (FMC) which is centered around 
two components depicted in Figure 24: the user defined workload and the modified operating system 
(OS) kernel. The FMC acts as an executive intermediary between the workload tasks and the operating 
system providing the monitor and control functions needed for experimentation. The Modified OS Kernel 
allows the monitoring and fault insertion of the workload tasks. Figure 25 shows a more detailed view 
of the FIRE software. 
There are four modules in the FMC: the Command Controller, the Workload Monitor Controller, the 
Fault Dispatcher and the Data Collection modules. Their functions are described below. 
Command Controller (CC): Stores and organizes commands from the Experiment Manager 
into a command queue, and provides synchronization mechanisms to coordinate the com- 
mands, such as the starting of tasks and fault insertion within the workload control flow. 
28 6 IMPLEMENTATION 
It  also manages the communication functions required for command transfer and execution. 
Fault insertion commands are relayed to the Fault Dispatcher for follow-on execution. 
Workload Monitor Controller (WMC): Provides the functionality required for workload mon- 
itoring and control, such as task control, communication, and synchronization. Task control 
includes such commands as task creation, task termination, task suspension and task resuinp- 
tion. Task communication commands include channel creation, send to channel, receive from 
channel, and select channel. Synchronization commands include event signaling and han- 
dling. 
0 Fault Dispatcher (FD): Manages fault insertion of workload tasks, communication, and the 
Operating System. The fault insertion commands are transferred to the appropriate entity 
(e.g., task, OS), where the actual fault insertion is performed. Workload faults include 
communication and memory faults. OS faults include memory and register faults as well as 
the triggering of existing hardware and/or software error detection mechanisms. 
Data Collection (DC): Transfers data records generated by the monitor and error report 
commands, which are user specified, to a local data storage object. 
FIM 
User Workload 
4 
Monitor and Controller 
* 
I Modified OS 
Figure 24: FIRE Software 
6.2 Software 
I I I I I 
Fault Injection Enhanced 
Typical Task 
.--) KERNEL & Error Detection 
Mechanism Triggering 
\ 
I 
os c6 
Reporting 
WL 
Command Workload Data 
Controller Monitor 
Fault 
Dispatcher Collection FMC 
and 
Controller 
FIM 
Figure 25: Detailed view of FIRE Software 
30 6 IMPLEMENTATION 
6.3 Experimental Procedure 
The experimental procedure ties the various parts of the FIAT system together into a unified environment 
capable of generating workloads, instrumenting tasks, performing fault-free characterization, designing 
fault lists, executing fault insertion experiments, and analyzing collected data. Figure 26 describes this 
process, showing the library preparation of the workload, (Section 5.1), preparation of the fault library, 
(Section 5.2), the experiment definition (Section 5.3), the experiment execution, and the data analysis 
(Section 5.4). 
Lib Prep 
Libs 
Fault 
Libs 
o WL Manager 
o Fault Manage 
Exp Definition 
lWorkload 
Lr 
rl FaultLists Script -
o Exp Desc Manager 
Exp Execution 
FI M/FI RE 
Exp Exe 
and Data Data -
o Exp Manager 
o Data Col Manager 
Data Analysis 
Analysis 7 
o Data Analysis 
Figure 26: Typical Fault Insertion Experiment 
31 
7 Evaluation 
The FIAT system has been implemented and evaluated since June 1987. The initial results indicate that 
the system fulfills the stated gods. As evidence of this fulfillment, two experiments performed on the 
FIAT system are presented in the following subsections. 
7.1 Experiment 1 
A distributed checkpointing system, utilizing two FIRE machines, was implemented. Figure 27 shows 
the functional block diagram of the fault tolerance strategy involved. There are two computational 
engines: the primary and the secondary. The primary, a t  the start  of its computation, informs the 
secondary of the task, as well as the time frame for the next interaction. The primary then executes the 
computation and the secondary waits for the next interaction. If the time between interactions exceeds 
the time frame (i.e., primary failure), the secondary then initiates a recovery action and becomes the 
primary. If the primary detects that  no secondary exists (i.e., secondary failure), it  creates a secondary. 
The experimental results of fault inserting this system are shown in Figure 28. The fault class was 
a bus fault consisting of a double bit compensating error. These double errors would not be detected 
by hardware parity and thus must wait for software error detection. Note the large error latency and 
that one quarter of the inserted faults were not detected. To illustrate how fault-tolerance strategies 
may be compared, the above fault-tolerance technique was enhanced by adding checksumming to each 
computational engine. The 
code is regenerated when reading the blocks from memory during runtime and is compared a i t h  the 
predetermined code to detect errors. Hence the detection methods are timeout for the first experiment 
versus checksums for the second. The improved results are shown in Figure 29. Note the increased fault 
detection probability and the decreased time to detect. 
A checksum is appended to code and data blocks during compilation. 
32 7 EVALUATION 
Machine A 
Primary 
Machine B 
Secondary 
Receive 
Figure 27: Real Time Checkpointing Workload 
Experiment = l a  
Detection S t a t i s t i c s  ... 
Number of Fault In jec t ions  . . . 206 
Number of Fault Detections ... 153 
Detection Coverage = 74.271845% 
Avg Detection Time = 4.219567 seconds 
Min Detection Time = 1.406250 seconds 
Max Detection Time = 6.218750 seconds 
Figure 28: Experiment 1: Error Detection Statistics (Checkpointing Only) 
7.1 Experiment 1 33 
I 
Experiment = l b  
Detect ion S t a t i s t i c s  . . .  
Number of Fault Inject ions  ... 91 
Number of Fault Detect ions ... 86 
Detect ion Coverage = 94.505495% 
Avg Detection Time = 3.102834 seconds 
Min Detection Time = 1.203125 seconds 
Max Detection Time = 4.203125 seconds 
END OF METHOD: Detect ion S t a t i s t i c s  
Figure 29: Experiment 1: Error Detection Statistics (Checkpointing with Checksums) 
34 7 EVALUATION 
7.2 Experiment 2 
A software duplicate and match fault tolerant strategy was implemented. The outputs of two identical 
computational engines, located in different FIRES and performing the same computation (matrix multi- 
plication), are compared on a third computational engine. Experimentation consisted of fault inserting 
one of the identical engines. The experimental results of fault inserting with two different fault classes, a 
bus fault and a memory fault, are shown in Figure 30 and Figure 31. Note that while the error detection 
coverage varies between experiments (63.0% vs. 54.5%). there is almnct a_: id&ic~! dis:riSiitiaii of f a : t  
manifestations (e.g. Stop, Invalid Output, Response Too Late, Crash and Hung). 
7.2 Experiment 2 35 
# 
Netmat3 Experiment Se t  1 - Two-bit Compensating E r r o r s  
440 F a u l t s  I n j e c t e d  Per  Experiment: 8800 i n s e r t e d  f a u l t s  
Exp I D  F a u l t s  Detected % Detected 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
1 2  
13 
1 4  
15 
16 
1 7  
18 
19 
20 
287 
274 
270 
289 
255 
279 
239 
250 
242 
279 
279 
298 
289 
297 
305 
297 
259 
277 
302 
278 
65.2 
62.3 
61.4 
65.7 
58.0 
63.4 
54.3 
56.8 
55.0 
63.4 
63.4 
67.7 
65.7 
67.5 
69.3 
67.5 
58.9 
63.0 
68.6 
63.2 
Mask: XOR 
\co\oo\oo\oo 
\08\02\00\00 
\00\44\00\00 
\00\01\01\00 
\ o O \ O O \ o O \ a O  
\10\00\00\01 
\Oa\OO\oO\OO 
\ o O \ O a \ o O \ O O  
\ o O \ O O \ o a \ O O  
\00\00\80\02 
\00\02\00\40 
\00\08\00\04 
\20\00\40\00 
\01\00\00\08 
\01\00\00\40 
\00\00\03\00 
\04\00\80\00 
\02\00\00\04 
\00\02\00\01 
\00\20\00\20 
Mean: 277.25 d e t e c t i o n s  (63.0% coverage) 
S tandard  Devia t ion :  19.62 (7.08% of t h e  mean) 
E r r o r  De tec t ion  Mechanisms 
Mechanism # Detected % of T o t a l  
Abnormal Task Death (STOP) 3 124 56.3 
I n v a l i d  Output 1 460 26.3 
Response Too Late  879 15.9 
Machine Reboot (CRASH) 8 0  1.4 
Processo r  Halt (HUNG) 2 0 .0  * 
Figure 30: Error Detection Statistics: Two Bit Compensating Errors 
36 7 EVALUATION 
Netmat3 Experiment Se t  2 - 'Zero-a-byte' E r r o r s  
440 F a u l t s  I n j e c t e d  p e r  Experiment:  8800 i n s e r t e d  f a u l t s  
Exp I D  F a u l t s  Detected % Detected 
1 
2 
3 
4 
5 
6 
7 
8 
9 
1 0  
11 
1 2  
13  
1 4  
15 
1 6  
17  
18  
19 
20 
258 
229 
240 
217 
239 
260 
227 
244 
236 
239 
258 
213 
243 
245 
235 
235 
249 
219 
260 
249 
58.6 
52.0 
54.5 
49.3 
5 4 . 3  
59.1 
51.6 
55 .5  
53.6 
54.3 
58 .6  
48.4 
55.2 
55.7 
53 .4  
53.4 
56.6 
49.8 
59.1 
56.6 
Mask: AND 
\OO\f f \f f \f f 
\ f f \OO\ff \ f f  
\ f f  \ff\OO\ff 
\ f f  \ f f  \ f f  \oo 
\ fO\Of\ff \ f f  
\ f f \ fO\Of\ff  
\f  f \f f \f O\Of 
\ c 0 \ 3 f \ f f \ f f  
\ f f  \c0\3f \f f 
\ f f \ f f \ c 0 \ 3 f  
\ 8 0 \ 7 f \ f f \ f f  
\f f \80\7f \f f 
\ f f  \ f f \80\7f  
\ e O \ l f \ f f \ f f  
\ f f  \eO\lf \f f 
\ f f  \ f f \eO\ l f  
\ f f  \ f f  \f 9\01 
\ f f  \f f \f c\03 
\ f f \ f8 \07\ f f  
\ f f \ f e \ O l \ f f  
Mean: 239.75 d e t e c t i o n s  (54.5% coverage)  
Standard Deviat ion:  13.98 (5.83% of t h e  mean) 
Error  Detec t ion  Mechanisms 
Mechanism # Detected % of T o t a l  
Abnormal Task Death (STOP) 2 748 57 .3  
I n v a l i d  Output 1 249 26.0 
Response Too Late 753 15.7 
Machine Reboot (CRASH) 45 0.9 
Processor  Halt (HUNG) 0 0.0 
Figure 31: Error Detection Statistics: Zero-a-byte Errors 
37 
8 Summary 
The goals, concepts, design, implementation, and evaluation of a Fault Injection-based Automated Test- 
ing environment have been presented. The initial evaluation has shown two things. First, the initial 
goals have been met. Second, the evaluation process, applied to two experiments, has shown how mea- 
surements can give hints a t  improving the architecture under evaluation. Currently, the main thrusts of 
the project are in the following areas: 
0 Reducing of the complexity of fault insertion on all levels. 
0 Correlating the fault insertion manifestations to realistic software errors. 
0 Providing absolute measures of real-time distributed system dependability. 
0 Applying FIAT in the design process of realistic systems. 
0 Refining the process of predeployment validation using FIAT. 
To the authors, a tool is as relevant as its applicability to either acquiring experimental evidence for 
a new theory or applying it to solve real life problems. The FIAT system is a step toward fulfilling these 
desiderata. The example fault-tolerant strategies evaluated in FIAT provided initial proof of the system 
usefulness in both areas. 
38 9 REFERENCES 
9 References 
[l] Jacob A. Abraham and W. Kent Fuchs. Fault and error models for VLSI. Proceedings of the IEEE, 
74(5):639-654, May 1986. 
[2] Allen T. Acree, Timothy A. Budd, Richard A. DeMillo, Richard J .  Lipton, and Frederick G. Say- 
Technical Report GIT-ICS 79/08, Georgia Institute of Technology, ward. 
August 19 82. 
Mutation Analysis. 
[3] !-!en TZCS ACTCE, Ji. V n l ~  Iti-cLiaiion. FhD thesis, tieorgia Institute of Technology, School of Infor- 
mation and Comput,er Science, A4ugust 1980. 
[4] W. Richards Adrion, Martha A. Branstad, and John C. Cheriavsky. Validation, verification, and 
testing of computer software. A C M  Computing Surveys, 14(2):159-192, June 1982. 
[5] Algirdas Aviiienis and Jean-Cluade Laprie. Dependable computing: From concepts to  design di- 
versity. Proceedings of the IEEE, 74(5):629-638, May 1986. 
[6] Algirdas Aviiienis and David A. Rennels. Fault-tolerance experiments with the J P L  Star computer. 
In Digest of Papers, COMPCON '72, pages 321-324, September 1972. 
[7] C.C. Beh, K.H. Arya, C.E. Radke, and K.E. Torku. Do stuck faults reflect manufacturing defects? 
In 1982 IEEE Test Conference, pages 35-42, 1982. 
[8] L.A. Boone, H.L. Liebergot, and R.M. Sedmak. Availability, reliability, and maintainability aspects 
of the Sperry UNIVAC 1100/60. In 10th International Symposium on Fault- Tolerant Computing, 
pages 3-9, 1980. 
[9] R.E. Bryant. A switch level model and simulator for MOS digital circuits. IEEE Transactions on 
Computers, C-33(2):160-177, February 1984. 
[lo] Timothy A. Budd, Richard A. DeMillo, Richard J .  Lipton, and Frederick G. Sayward. Theoretical 
and Empirical Studies on Using Program Mutation to Test the Functional Correctness of Programs. 
Technical Report GIT-ICS 80/01, Georgia Institute of Technology, February 1980. 
[ll] W.C. Carter. System validation - Putting the pieces together. In 7th A I A A / I E E E  Digital Avionics 
Systems Conference, pages 687-694, 1986. 
[12] Ed Clune, Zary Segall, and Daniel Siewiorek. Fault-Free Behavior of Reliable Multiprocessor Sys- 
tems: F T M P  Ezperiments in AIRLAB.  NASA CR-177967, Carnegie Mellon Univ., August 1985. 
[13] Bernard Courtois. Some results about the efficiency of simple mechanisms for the detection of 
microcomputer malfunctions. In 9th International Symposium on Fault- Tolerant Computing, pages 
71-74, 1979. 
[14] Y. Crouzet and B. Decouty. Measurement of fault detection mechanisms efficiency: Results. In 
12th International Symposium on Fault- Tolerant Computing, pages 373-376, 1982. 
[15] Edward W. Czeck, Frank E. Feather, Ann Marie Grizzaffi, Zary Z. Segall, and Daniel P. Siewiorek. 
Fault-Free Performance Validation of Fault- Tolerant Multiprocessors. NASA CR-178236, CarneGe 
Mellon Univ., January 1987. 
39 
[16] Edward W. Czeck, Daniel P. Siewiorek, and Zary Z. Segall. Software Implemented Fault Insertion: 
A n  F T M P  Ezample. NASA CR-178423, Carnegie Mellon Univ., October 1987. 
[17] Scott Davidson. Fault simulation at  the architectural level. In International Test Conference, pages 
669-679, 1984. 
[18] Scott Davidson and James L. Lewandowski. ESIM/AFS - A concurrent architectural level fault 
simulator. In International Test Conference, pages 375-383, 1986. 
1191 B. Decouty, G. Michel, and C. Wagner. An evaluation tool of fault detection mechanisms efficiency. 
In 10th International Symposium on Fault- Tolerant Computing, pages 225-227, 1980. 
[20] Yves Deswarte, Khadija Alami, and Oliver Tedaldi. Realization, validation and operation of a fault- 
tolerant multiprocessor: ARMURE. In 16th International Symposium on Fault- Tolerant Comput- 
ing, pages 8-13, 1986. 
[21] Fausto Fantini. Reliability problems with VLSI. Microelectronics Reliability, 24(2):275-296, 1984. 
[22] Frank Feather, Daniel Siewiorek, and Zary Segall. Fault-Free Validation of a Fault- Tolerant Mul- 
NASA CR-178075, Carnegie tiprocessor: Baseline Experiments and Workload Implementation. 
Mellon Univ., April 1986. 
[23] F. Joel Ferguson. Inductive Fault Analysis of VLSI  Circuits. PhD thesis, Carnegie Mellon Univer- 
sity, Electrical and Computer Engineering Dept., October 1987. 
[24] George B. Finelli. Characterization of fault recovery through fault injection on FTMP. IEEE 
Transactions on Reliability, R-36(2):164-170, June 1987. 
[25] J .  Galiay, Y. Crouzet, and M. Vergniault. Physical versus logical fault models MOS LSI circuits: 
Impact on their testability. IEEE Transactions on Computers, C-29(6):527-531, June 1980. 
[26] Frank M. Goetz. Design for detection, an attempt a t  complete fault detection of a store. In Digest 
of Papers, COMPCON '72, pages 325-328, September 1972. 
[27] Gary L. Hartmann, Joseph E. Wall, Jr., and Edward R. Rang. Design validation of fly-by-wire flight 
control systems. In A GARD Lecture Series No. 143, Fault Tolerant Hardware/Software Architecture 
for Flight Critical Function, pages 9.1-9.17. NATO Advisory Group for Aerospace Research and 
Development, 1985. 
[28] John P. Hayes. Fault modeling. IEEE Design and Test, pages 88-95, April 1985. 
[29] H.M. Holt, A.O. Lupton, and D.G. Holden. Flight critical system design guidelines and validation 
methe&. Ir! ATA A,/AHS/ASEE Aircraft Design Systems and Operating Meeting, 1984. Paper: 
AIAA-84-2461. 
[30] William E. Howden. Functional program testing. IEEE Transactions on Software Engineering, 
SE-6(2):162-169, March 1980. 
[31] Raymond P. Kurlak and J .  Robert Chobot. CPU coverage evaluation using automatic fault injec- 
tion. In 4th A I A A / I E E E  Digital Avionics Systems Conference, pages 294-300, 1981. 81-2281. 
40 9 REFERENCES 
[32] Kwok-Woon Lai. Functional Testing of Digital Systems. PhD thesis, Carnegie-Mellon University, 
Dept. of Computer Science, December 1981. Technical report CMU-CS-81-148a. 
[33] Kwok-Woon Lai and Daniel P. Siewiorek. Functional testing of digital systems., In 2Uth Design 
Automation Conference, pages 207-213. IEEE, 1983. 
[34] Larry Kwok-Woon Lai. Error-oriented architecture testing. In National Computer Conference, 
pages 565-576, June 1979. 
I351 Japnarayan H k ! ~  znc! T. E d  Siiiith III. Geveiopment and Evaluation of a Fault- Tolerant Multi- 
processor Computer, Vol. III? FTMP Test and Evaluatzon. NASA CR-166073, Charles Stark Draper 
Laboratories, May 1983. 
[36] J.H. Lala. Fault detection isolation and reconfiguration in FTMP: Methods and experimental 
results. In 5th AIAA/ IEEE Digital Avionics Systems Conference, pages 21.3.1-21.3.9, 1983. 
[37] Jean-Cluade Laprie. Dependable computing and fault tolerance: Concepts and terminology. In 
15th International Symposium on Fault- Tolerant Computing, pages 2-11, 1985. 
[38] J .R. Lloyd and J .A. Knight. The relationship between electromigration-induced short-circuit and 
In International Reliability Physics open-circuit failure times in multi-layer VLSI technologies. 
Symposium, pages 48-51, 1984. 
[39] Tiilin Erdim Mangir. Sources of failures and yield improvement for VLSI and restructurable inter- 
connects for RVLSI and WSI: Part I sources of failure and yield improvement for VLSI. Proceedings 
of the IEEE,  72(6):690-708, June 1984. 
[40] P. Marchal. Updating functional fault models for microprocessors internal buses. In 15th Interna- 
tional Symposium on Fault- Tolerant Computing, pages 58-64, 1985. 
[41] John G.  McGough and Fred Swern. Measurement of Fault Latency in a Digital Avionic Mini 
Processor. NASA CR-3462, Bendix Corp., October 1981. 
[42] John G. McGough and Fred Swern. Measurement of Fault Latency in a Digital Avionic Mini 
Processor, Part II. NASA CR-3651, Bendix Corp., January 1983. 
[43] John G. McGough, Fred Swern, and Salvatore J .  Bavuso. Methodology for measurement of fault 
latency in a digital avionic miniprocessor. In 4th A I A A / I E E E  Digital Avionics Systems Conference, 
pages 310-314, 1981. 81-2282. 
[44] John G. McGough, Fred Swern, and Salvatore J .  Bavuso. New results in fault latency modeling. 
In Proceeding of the IEEE EASCON Conference, pages 299-306, August 1983. 
[45] P. Michael Melliar-Smith and Richard L. Schwartz. Formal specification and mechanical verification 
of SIFT: A fault-tolerant flight control system. IEEE Transactions on Computers, C-31( 7):616-630, 
June 1982. 
[46] Matthew J .  Middendorf and Tom Hausken. Observed physical effects and failure analysis of 
EOS/ESD on MOS devices. In International Symposium for  Testing and Failure Analysis, pages 
205-213, October 1984. 
41 
Validation Methods fo r  Fault- Tolerant Avionics and Control Systems: Working Group Meeting I ,  
NASA Langley Research Center, March 1979. ORI, Incorporated, Compilers. NASA CP-2114. 
Validation Methods for Fault- Tolerant Avionics and Control Systems: Working Group Meeting 11, 
NASA Langley Research Center, September 1979. System and Measurements Division, Research 
Triangle Institute. NASA CP-2130. 
J .  Duane Northcutt. 'The design and implementation of fault insertion capabilities for ISPS. In 
18th Design Automation Conference, pages 197-209. IEEE, 1980. 
Daniel L. Palumbo and George B. Finelli. A Technique for Evaluating the Application of the Pin- 
Level Stuck-at Fault Model to VLSI  Circuits. NASA TP-2738, Langley Research Center, September 
1987. 
SAE Committee S-18A. Fault/Failure Analysis for Digital Systems and Equipment. Aerospace 
Recommended Practice ARP-1834, Society of Automotive Engineers, Warrendale, Pa., August 1986. 
M.E. Schmid, R.L. Trapp, A.E. Davidoff, and G.M. Masson. Upset exposure by means of abstract 
verification. In 12th International Symposium on Fault- Tolerant Computing, pages 237-244, 1982. 
M.A. Schuette, J.P. Shen, D.P. Siewiorek, and Y.X. Zhu. Experimental evaluation of two concurrent 
error detection schemes. In 16th International Symposium on Fault- Tolerant Computing, pages 138- 
143, 1986. 
John P. Shen, W. Maly, and F. Joel Ferguson. Inductive fault analysis of MOS integrated circuits. 
IEEE Design and Test of Computers, 2(6):13-26, December 1985. 
Kang G. Shin and Yann-Hang Lee. Measurement and application of fault latency. IEEE Transac- 
tions on Computers, C-35(4):370-375, April 1986. 
Daniel P. Siewiorek and Robert S. Swarz. The Theory and Practice of Reliable System Design. 
Digital Press, Bedford MA, 1982. 
Gabriel M. Silberman and Ilan Spillinger. The difference fault model using functional fault simu- 
lation to  obtain implementation fault coverage. In International Test Conference, pages 332-339, 
1986. 
J.J. Stiffler and A.H. Van Doren. FTSC - Fault tolerant spaceborne computer. In 9th International 
Symposium on Fault- Tolerant Computing, page 143, 1979. 
[59] Stephen Y.H. Su and Tonysheng Lin. Functional testing techniques for digital LSI/VLSI devices. 
In 21st Design Automation Conference, pages 517-528. IEEE, 1984. 
[GI S. Thattc nn:! J. A h r ~ h a m .  A methodology for functional level testing of microprocessors. In 8th 
International Symposium on Fault- Tolerant Computing, pages 90-95, 1978. 
[61] S. Thatte and J. Abraham. Test generation for general microprocessor architectures. In 9th Inter- 
national Symposium on Fault- Tolerant Computing, pages 203-210, 1979. 
[62] Satish M. Thatte and Jacob A. Abraham. Test generation for microprocessor. IEEE Transactions 
on Computers, C-29(6):429-441, June 1980. 
42 9 REFERENCES 
[63] C. Timoc, M. Buehler, T. Griswold, C. Pina, F. Scott, and L. Hess. Logical models of physical 
failures. In International Test Conference, pages 546-553, 1983. 
[64] R.L Wadsack. Fault modeling and logic simulation of CMOS and MOS integrated circuits. Bell 
Systems Technical Journal, 57(5):1449-1473, May-June 1978. 
[65] X.Z. Yang, G. York! W.P. Birmingham, and D.P. Siewiorek. Fault recovery of triplicated software 
on the Intel iAPX 432. In Distributed Computing Systems, pages 438-443, May 1985. 
Report Documentation Page 
1. Report No. 
NASA CR-4244 
2. Government Accession No. 
7. Author(s) 
Edward !.?- Czetk, B c t k . l  T. Siewiorek, and Zary 2. Segal: 
17. Key Words (Suggested by Authorbl) 
Validation 
Fault In1 ection 
9. Performing Organization Name and Address 
Carnegie-Mellon University 
Electrical and Computer Engineering Department 
Schenley Park 
Pittsburgh, PA 15213 
National Aeronautics and Space Administration 
Langley Research Center 
Hampton, VA 23665-5225 
12. Sponsoring Agency Name and Address 
18. Distribution Statement 
Unclassified - Unlimited 
15. Supplementary Notes 
Langley Technical Monitor: Peter A. Padilla 
Final Report 
19. Security Classif. (of this report) 
Unclassified 
P 
3. Recipient's Catalog No. 
21. No. of pages 22. Price 
Unclassified 48 A03 
20. Security Claasif. (of thin page) 
~ 
5. Report Date 
July 1989 
6. Performing Organization Code 
8. Performing Organization Report No. 
10. Work Unit No. 
505-66-21-01 
11. Contract or Grant No. 
NAG1 - 1 90 
13. Type of Report and Period Covered 
Contractor Report 
11/87-11/88 
14. Sponsoring Egency Code 
Fault Toierance 
Software-Implemented Fault Insertion 
Distributed Systems 
Star Category 62 
