Benchmarking of Fault-Tolerant Systems by Tsai, Timothy K.
UILU-ENG-97-2211 
CRH C-97-07______May 1997
Benchmarking of Fault-Tolerant Systems 
Timothy Tsai
Coordinated Science Laboratory 
1 West Main Street, Urbana1jL_61^1_
i
S
SF 298 MASTER COPY KEEP THIS COPY FOR REPRODUCTION PURPOSES
REPORT DOCUMENTATION PAGE
Form Approved 
OMB NO. 0704-0188
PuWic r•porting burowi lor this collection o! information is esumeieq to iverige i hour per response, including the time lor reviewing instructions, searching existing cats sources, 
gathering and mamtainmg the data needed, and completing and reviewing the collection of information. Send comment regarding this burden estimates or any other aspect of this 
collection of information, including suggestions lor reduong this burden, to Washington Headquarters Services, Directorate lor information Operations and Reports. 1215 Jefferson
2. REPORT DATE 3. REPORT TYPE AND DATES COVERED
April 29, 1997
1. AGENCY USE ONLY (Leave blank)
| 4. TITLE AND SUBTITLE
Benchmarking of Fault-Tolerant Systems
6. AUTHOR(S)
5. FUNDING NUMBERS 
DABT6 3-94-C-0045
Timothy Tsai
7. PERFORMING ORGANIZATION NAMES(S) AND ADDRESS(ES) 
Coordinated Science Laboratory 
University of Illinois
8. PERFORMING ORGANIZATION 
REPORT NUMBER
1308 W. Main St. 
Urbana, IL 61801
CRHC-97-07 UILU-ENG-97-2211
| 9. SPONSORING / MONITORING AGENCY NAME(S) AND ADDRESS(ES)
DARPA/Dept. of the Army 
Directorate of Contracting 
I ATTN: ATZS-DKO-I, Ms. Marilyn Carney C-19 
PO Box 748
10. SPONSORING / MONITORING 
AGENCY REPORT NUMBER
I
r
Ft. .Hnar.hiira, A7. 8Sfi13-n74ft 
11. SUPPLEMENTARY NOTES
The views, opinions and/or findings contained in this report are those of the author(s) and should not be construed as 
an official Department of the Army position, policy or decision, unless so designated by other documentation.
12a. DISTRIBUTION / AVAILABILITY STATEMENT 1 2  b . DISTRIBUTION CODE
Approved for public release; distribution unlimited.
13. ABSTRACT (Maximum 200 words)
This thesis presents a benchmark for evaluating fault tolerance. The benchmark is based 
| on the FTAPE tool, which injects CPU, memory, and disk faults, and generates workloads
with specific amounts of CPU, memory, and disk activity. Two benchmark metrics are 
produced: 1) a count of catastrophic incidents and 2) the average performance degrada- 
j tion. The catastrophic incident count represents the recovery coverage of the system; the
performance degradation reflects the performance of the system in the presence of faults. 
The benchmark is fully functional and has been implemented on three Tendem prototype 
■ fault-tolerant machines; the benchmarking results are given in this thesis. Fault injection
! plays an important role in the benchmark, because it is the means by which fault-tolerant
activity is generated. Two focused fault injection strategies are presented: stressed-based 
and fault-based.
I
j 14. SUBJECT TERMS
' f a u l t  t o le r a n c e ,  b e n c h m a rk , f a u l t  i n j e c t i o n ,  p e r fo rm a n c e  
d e '; : i .d a t io n , s t r e s s - b a s e d ,  p a th -b a s e d
15. NUMBER IF PAGES
127
16. PRICE CODE
1 17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19. SECURITY CLASSIFICATION 20. LIMITATION OF ABSTRACT
OR REPORT OF THIS PAGE OF ABSTRACT
UNCLASSIFIED UNCLASSIFIED UNCLASSIFIED UL
NSN 7540-01-200-5500 Standard Form 298 (Rev. 2-89)
Prescribed bv ANSI Sid. 239-18
BENCHMARKING OF FAULT-TOLERANT SYSTEMS
BY
TIMOTHY K. TSAI
B.S., Brigham Young University, 1990 
M.S., University of Illinois, 1994
THESIS
Submitted in partial fulfillment of the requirements 
for the degree of Doctor of Philosophy in Electrical Engineering 
in the Graduate College of the 
University of Illinois at Urbana-Champaign, 1996
Urbana, Illinois
©  Copyright by Timothy K. Tsai, 1996
Ill
BENCHMARKING OF FAULT-TOLERANT SYSTEMS 
Timothy K. Tsai, Ph.D.
Department of Electrical and Computer Engineering 
University of Illinois at Urbana-Champaign, 1996 
Ravishankar K. Iyer, Advisor
This thesis presents a benchmark for evaluating fault tolerance. The benchmark is based 
on the FTAPE tool, which injects CPU, memory, and disk faults and generates workloads 
with specifiable amounts of CPU, memory, and disk activity. Two benchmark metrics are 
produced: (1) a count of the number of catastrophic incidents and (2) the average perfor­
mance degradation. The catastrophic incident count represents the recovery coverage of the 
system, while the performance degradation reflects the performance of the system in the 
presence of faults.
The benchmark is fully functional and has been implemented on three Tandem fault- 
tolerant machines (Prototypes A, B, and C). The benchmarks results show that Prototypes 
B and C are more fault-tolerant than Prototype A, in that they suffer fewer catastrophic 
incidents under the same workload conditions and fault injection method. Also, Prototype C 
suffers less performance degradation in the presence of faults, which might be an important 
concern for time-critical applications.
Fault injection plays an important part in the benchmark because it is the means by 
which fault-tolerant activity is generated. In order to ensure a high level of fault activation 
and error propagation, focused fault injection stratagies are used. Two such strategies are 
presented in this thesis: stress-based injection and path-based injection.
VACKNOWLEDGEMENTS
Many people deserve credit and recognition for their assistance and encouragement over 
the course of my graduate career. First and foremost, I extend my gratitude to my advisor, 
Professor Ravi Iyer, who has been very patient with me and has provided me with the 
necessary challenges and resources to grow and learn academically and professionally. I 
now realize in hindsight the motivation for many of his assignments and am grateful for the 
experiences I have had at the University of Illinois. I also appreciate the insightful comments 
and guidance of the members of my doctoral committee: Professors Kent Fuchs, Bill Sanders, 
and Roy Campbell.
Many of the practical ideas in my research have been implemented on fault-tolerant 
computers which were provided by Tandem Computers, Inc. I am indebted to Tandem for 
the use of their machines and also a great deal of technical assistance. In particular, I must 
acknowledge the tremendous amount of assistance I have received from Luke Young, both 
from his time as a fellow student and as a Tandem employee. Among many others who at 
Tandem, Doug Jewett helped to provide input to my FTCS paper and also arranged for 
machine time on Tandem prototype computers that were heavily in demand.
My stay at the University of Illinois has been enjoyable in large part due to associations 
with my fellow students. Wei-lun Kao and Steve VanderLeest have given me guidance as
VI
more experienced students. Dane Dwyer has helped me with many technical issues and has 
also been a good friend since we started our studies together six years ago. Of course, there 
are too many others to mention.
Finally I must thank my family for their encouragement and God for the extraordinary 
opportunities I have received.
vu
TABLE OF CONTENTS
I
1. INTROD UCTIO N............ ...............
2. RELATED W O R K ....................................................
2.1 Hardware-Implemented Fault Injection . . . .
2.1.1 External Hardware With Contact . . .
2.1.2 External Hardware Without Contact .
2.1.3 Internal Hardware ...............................
2.1.4 Representative Case S tu d ies ...............
2.1.5 F T M P ....................................................
2.1.6 M ESSALINE..................... ..................
2.1.7 RIFLE ....................................................
2.1.8 F IS T .......................................................
2.1.9 MARS study ........................................
2.2 Software-implemented Fault Injection Tools .
2.2.1 FIA T.......................................................
2.2.2 FERRARI ..............................................
2.2.3 D EFIN E.................................................
2.2.4 D O C T O R ..............................................
2.2.5 Xception.................................................
2.2.6 S O F IT ..............................................’ *
3. DESCRIPTION OF BENCHMARK . . . . . . . .
3.1 Proposed Benchmark Specification ...............
3.1.1 Phase 1 .................................................
3.1.2 Phase 2 ............ ....................................
3.2 F T A P E ........................ .....................
3.2.1 Fault in je c to r .......................................
3.2.2 Workload G e n e ra to r...........................
3.2.3 M easure.................................................
3.3 R epeatab ility .................................................
Page
12
14
15
15
16 
16 
19 
21 
23
27
28 
31 
34 
36 
38 
40
43
44 
47 
51
51
52
53 
58 
61 
63
f
Vlll
3.4 Error Propagation Observability ..................................................................
4. FOCUSED FAULT INJECTION STRATEGIES...............................................
4.1 Stress-based Injection....................................................................................
4.1.1 Sensitivity of Workloads to F a u lts ..................................................
4.1.2 Stress-based Injection R e su lts ........................................................
4.2 Path-based In je c tio n ...................................................................................
4.2.1 Implementation.................................................................................
4.2.2 Results ................................................................................................
5. EXPERIMENTAL RESULTS ..............................................................................
5.1 Description of Target Systems ..............................................................
5.1.1 Description of Tandem Integrity ....................................................
5.1.2 Tandem S erverN et..........................................................................
5.2 Benchmark R e su lts ...................................................................................
5.2.1 Phase 1 .........................................................................................
5.2.2 Phase 2 . ...................... - .................................................
6. CONCLUSIONS .................................................................
6.1 Future Directions .......................................................................................
REFERENCES ............................ - ................................................................
V I T A ............ ...........................................................................................................
71
72
74
75 
80 
84 
95
103
103
104
105
106 
107 
112
118
121
123
127
6 6
\
i
IX
LIST OF TABLES
Table Page
1.1: Fault-Tolerance Benchmark R esu lts ..............................................................  3
2.1: Summary of Fault Injection M eth o d s..........................................................  9
2.2: Fault Injection Methods Advantages and Disadvantages...........................  11
2.3: Comparison of Software-Implemented Fault Injection T o o ls ..................... 32
2.4: Fault and Error Classes Supported by F E R R A R I.................................... 35
4.1: Sensitivity of Workloads to F a u lts ................................................................  73
4.2: Stress-based Injection Results For CPU and Memory F a u l t ..................... 76
4.3: Number of Observed System C rash es.......................................................... 77
4.4: DAS Event Identifier M ean ings...................................................................  93
4.5: Input set for compress program ....................................................................  94
4.6: Description of compress P rogram ................................................................ 98
4.7: Random Injections vs. Path-based Injections............................................. 98
4.8: Comparison of Paths for the First 100 Injections.......................................  99
4.9: Measured Values for Cost Graph (Figure 4.4) 100
5.1: Comparison of Target Systems......................................... 102
5.2: Phase 1 R esu lts ............................................................................................ 105
5.3: Demonstration of Disk Bandwidth During Mirror Recovery..................... 108
5.4: Fault Latencies for Fault Preceding Error Detection (secs) ..................... 109
5.5: Phase 2 Results ............................................................................................ I l l
5.6: Phase 2 Results for Duplex Prototype C (with no attempted recovery) . 114

XI
5.3: Phase 2 Catastrophic Incident Scenario
11. INTRODUCTION
Fault tolerance is an important issue in many computing applications. For instance, 
mission critical aerospace or railroad control systems need to ensure the safety of passengers, 
and banking and telecommunications systems must offer high availability to prevent financial 
losses. This thesis presents an approach towards benchmarking of fault-tolerant systems, 
proposes a fault tolerance benchmark, and demonstrates the use of that benchmark on several 
fault-tolerant systems. A fault tolerance benchmark1 is a metric that characterizes the fault 
tolerance of a system, along with the specification of the procedure for obtaining that metric. 
The specification of the benchmark procedure may be presented as a source code program 
or as a description of the procedure. The benchmarking of fault-tolerant systems is the 
execution of the procedure for obtaining the benchmark metric. To perform benchmarking, 
a benchmark tool must be created. For a fault tolerance benchmark, this tool is used to 
obtain the metric by injecting faults throughout a fault-tolerant system and generating a 
workload that causes those faults to result in fault-tolerant activity. The significance of the 1
1The term fault-tolerant benchmark is avoided in this thesis in order to avoid any possible confusion with 
a benchmark that has fault-tolerant features.
2measured metric must be viewed with consideration for the the fault models upon which the 
fault injection is based.
This thesis describes a fault tolerance benchmark that has been implemented on several 
fault-tolerant systems. The benchmark tool consists of a synthetic workload program that 
generates CPU, memory, and I/O  activity based upon user specifications and injects CPU, 
memory, and I/O  faults according to a specified injection strategy. The program can be 
configured to execute with different workload specifications (e.g., CPU-intensive workloads, 
I/O-intensive workloads, etc.) and with different fault injection strategies. The injection 
strategies include random selection of fault parameters (such as time and location) from 
distributions (such as exponential or normal) and selection of fault parameters based on 
workload activity measurements such that faults are injected into components undergoing 
high activity, thus ensuring a high level of fault activation and propagation (presented as 
stress-based injection in [1]). The exact manner in which the benchmark tool is used to 
form the actual benchmark is described in Chapter 3. Two metrics are provided to quantify 
the system fault tolerance using the proposed benchmark: (1) the number of catastrophic 
incidents (which are events which cause the entire computer system to become unusable and 
include operating system panics and hangs) and (2) performance degradation.
The proposed benchmark has been fully installed on three Tandem prototype machines, 
which will be referred to in this thesis as (1) TMR Prototype A, (2) TMR Prototype B, 
and (3) Duplex Prototype C. The first two TMR (triple modular redundant) machines are 
based on the Tandem Integrity S2 architecture, and the third machine, Duplex Prototype 
C, is based on the new Tandem ServerNet architecture. Section 5.1 provides a description 
of these machines, including their fault-tolerant features. It should be noted that these
3machines are prototypes, and therefore the results presented in this thesis are not necessarily 
representative of production systems. The benchmark results are given in Table 1.1 for a 
mixed (CPU, memory, and I/O activity) workload and a stress-based injection strategy.
Table 1.1 shows benchmark results based on 50 runs of the benchmark program. The 
table shows that the same workload and fault injection method produces four catastrophic 
incidents in Prototype A, compared with none for Prototypes B and C. This indicates that 
Prototype C along with Prototype B, which is a later instantiation of the same architecture, 
are indeed more fault-tolerant than Prototype A, which is a very early prototype system. 
The performance degradation for Prototype C is lower than for the other two prototypes, 
due to a recovery scheme that decreases the impact on performance during recovery.
The performance degradation under faults may be an important consideration for time- 
critical applications. The third row of Table 1.1 shows the average number of faults needed 
to cause a catastrophic incident. These injections were carried out with a series of faults 
that the machine was not specifically designed to tolerate, and thus catastrophic incidents 
ultimately resulted. A detailed examination of these measurements is presented later in 
Chapter 5. However, the numbers in Table 1.1 demonstrate the utility of the benchmark in 
comparing different fault-tolerant machines.
Table 1.1: Fault-Tolerance Benchmark Results
Measure A
Prototypes
B C
Catastrophic incidents 4 0 r 0
Performance degradation 0.009836 0.008243 0.002430
Avg. injections to 
catastrophic incident
10.4 4.1 5.0
4Fault injection plays an important part in the benchmark because it is the means by 
which fault-tolerant activity is generated. In order to ensure a high level of fault activation 
and error propagation, focused fault injection strategies are used. These strategies utilize 
knowledge of the workload (e.g., program structure, resources used, etc.) to intelligently 
select which faults to inject, thereby avoiding many faults that are not capable of causing 
fault-tolerant activity. Two such strategies are presented in this thesis: stress-based injection 
and path-based injection.
Related work is discussed in Chapter 2. The proposed benchmark is described in Chap­
ter 3. Chapter 4 describes two focused fault injection strategies for injecting faults in an 
intelligent manner. Detailed benchmark results are given in Chapter 5. Conclusions, includ­
ing a discussion of the advantages and disadvantages of the benchmark and future direction 
for research, are given in Chapter 6.
52. RELATED WORK
Many techniques to perform experimental analysis of computer system dependability 
have been developed, including analytical modeling, simulated fault injection, prototype 
fault injection, and analysis of operational measurements. The goal of these techniques is 
to test, validate, and evaluate the fault-tolerant algorithms and mechanisms of the systems 
under test. Each of these techniques occupies an important position in the design cycle of 
fault-tolerant computers.
In the early design phase, only specifications and high-level architectures are available 
for analysis. Analytical modeling is well-suited to providing relatively quick solutions based 
upon high-level information. Analytical solutions also have the advantage of being exact 
(or of at least having a easily quantifiable error). Unfortunately, the large computational 
resources required and the practicality of creating solvable models places constraints on 
the complexity of analytical models. Simulation techniques are able to model computer 
systems at a wide range of complexity, from the device and circuit levels to the functional 
and behavioral levels. Many assumptions inherent in abstracted analytical models can be 
eliminated in simulation models because concerns about the solvability of the models are
6not present. However, the simulation of detailed models incurs a significant computational 
cost, and in contrast to analytical methods the computation cost is exacerbated by the need 
for repeated simulations to attain statistically valid results. In order to make simulation 
practical, abstracted or higher-level models are often used, which may decrease the accuracy 
of results.
The use of prototypes for dependability analysis is appealing because no abstracted mod­
els are required. In this context, the term “prototype” refers to a completely manufactured 
machine, which may either be a laboratory system or a commercial system. Because the full 
details of the entire system are available, including complete software applications, evalua­
tion of prototype systems yields potentially greater confidence in the accuracy of the results. 
In addition, prototypes generally operate at full speed and are thus faster than the corre­
sponding simulation models. Nonetheless, some disadvantages are present. The need for a 
functioning physical prototype machine disqualifies prototype analysis for use in the design 
phase. Because the insight gained from evaluating prototypes is obtained late in the produc­
tion cycle, the results are mainly applicable to future designs and as a means for improving 
the general understanding of such systems. Also, the innards of a complete prototype are 
usually less accessible for monitoring and modification (as is needed for fault injection) than 
simulation models. Thus, the invention of novel methods to inject faults is often required.
Analysis of data collected from operational computer systems is attractive because the 
hardware configuration, software applications, and operating environment are completely 
realistic. This is especially important for commercial manufacturers because operational 
data represents the actual requirements of their customers, in contrast to laboratory results 
which are based on sometimes non-realistic assumptions of hardware and software usage.
7The main disadvantage of using operational field data is the extremely long times required 
to collect useful data. Realistic faults occur infrequently, and thus, many months or even 
years are needed to collect a minimal amount of data for analysis. Furthermore, because 
the computer systems are no longer in a controlled laboratory setting, data collection may 
not be complete or may be inaccurate (especially if human observations form part of the 
collected data).
This thesis focuses on the dependability evaluation of prototype computer systems. 
Therefore, this chapter includes a summary of related fault injection tools involving pro­
totype systems, including specific approaches, advantages and disadvantages, and suggested 
applications. A good overview of tools using the other methods described above is provided 
by Tang [3].
A practical classification of fault injection methods can be arrived at based on the type 
of fault injection implementation. Three important characteristics are
1. use of hardware or software to inject faults,
2. existence of direct physical contact with electrical circuitry, and
3. need for hardware that is external to the system under test.
These characteristics lead to the following classification of fault injection methods:
1. External hardware with contact
2. External hardware without contact
3. Internal hardware
84. Software
Note that the terms hardware and software refer to the means of injecting a fault and not to 
the fault model. For instance, software methods are able to emulate the effect of hardware 
faults. Thus, hardware and software fault injection are equivalent to hardware-implemented 
and software- implemented fault injection, respectively.
The use of external hardware with contact focuses on disturbing the voltages or currents 
on chip pins. These chip pins may contain either logical values or the power supply. When 
external hardware without contact is used, an external source producing heavy-ion radiation 
or electromagnetic fields is used to cause spurious currents inside the targeted chip. Fault 
injection hardware may also reside within a chip (i.e., internal hardware) and may take 
the form of custom logic or scan logic that is already included for testing purposes. The 
three hardware-based fault injection methods share many characteristics and therefore will 
be collectively referred to as hardware fault injection methods. Table 2.1 lists approaches, 
fault models, and representative case studies for each of the fault injection classifications 
described above.
Hardware methods are able to inject faults without disturbing the workload because they 
require no software assistance on the target machine for the actual injection1. In contrast, 
software methods rely completely on software, which must reside at least partly on the 
target machine. The execution of the injection software necessarily disturbs the workload. 
The time resolution for fault triggering and results monitoring is low for software methods,
1 Software residing on non-target machines might be used for specifying injection parameters and collecting 
measurements, but no perturbation to the target machine workload should result.
9Table 2.1: Summary of Fault Injection Methods
Type of fault injection instrumentation
Category External HW 
with contact
External HW 
without contact
Internal HW sw
Approaches Pin-level probes 
Pin-level sockets 
Power 
disturbance
Heavy-ion 
radiation 
Electromagnetic 
interference
Scan logic 
Custom logic
Source code 
modification 
Memory 
corruption 
Register 
corruption 
Communication 
packet
corruption
Fault Models Stuck-at
Open
Bridging
Bit-flip
Change in power 
supply voltage
Spurious currents Bit corruption of 
latches and 
registers 
Stuck-at 
Open 
Bridging 
Bit-flip
Machine-level or 
higher level 
manifestation of 
software defects 
Bit/byte 
corruption of 
registers, 
memory, buses, 
and packets 
CPU faults
Case Studies FTMP [4] 
FTMP [5],[6] 
FTMP [7] 
MESSALINE [8] 
RIFLE [9]
FIST [10]
Z-80 [11] 
FIST [12] 
FIST [13] 
FIST [10] 
MARS [14]
ES/9000 [15] Accelerated 
Injection [16] 
FIAT [17],[18] 
FERRARI [19] 
FTAPE [20] 
DEFINE [21] 
DOCTOR [22] 
Xception [23] 
SOFIT [24]
10
since the software must also handle its normal tasks2 in addition to fault injection. High time 
resolution is possible with pin-level methods due to special high-resolution timing circuits 
present on the injection hardware. Finally, hardware methods incur a greater financial cost 
than software methods because additional hardware is needed. This last reason, coupled 
with the greater flexibility of software methods, has led to the choice of software methods 
as the fault injection method used in this thesis. Detailed descriptions for the studies and 
corresponding tools in Table 2.1 are included later in this chapter.
A comparison of the advantages and disadvantages of the fault injection methods is 
given in Table 2.2. Software methods have become quite popular, especially in the academic 
community, due to their low cost and applicability to a wide range of systems. The risk 
of damaging the target system is minimal, and controllability of the fault location and 
repeatability of experiments are all favorable. However, unavoidable perturbation to the 
target system due to the execution of software routines and the lower time resolution for 
triggering injections are drawbacks.
Hardware methods are most appropriate when high-resolution time measurements are 
needed. In contrast to software methods, no target system resources, such as the processor 
or memory, are consumed, thus avoiding perturbation in terms of time and disturbances 
to the processor and memory states. Also, triggering of injections and time-stamping of 
observed events is based on hardware timers. In addition to perturbation and timing issues, 
hardware methods are able to access hardware locations inaccessible to software methods. 
These locations include chip pins and internal components, such as combinational circuits
2 The time resolution for software methods can be increased if special hardware timers are available and 
software accessible.
11
Table 2.2: Fault Injection Methods Advantages and Disadvantages
Type of fault injection instrumentation
Category External HW 
with contact
External HW
without
contact
Internal HW SW
Cost High High Low Low
Perturbation None None Low Low-High
Risk of damage High Low None None
Monitoring time 
resolution
High High High Low
Accessibility of fault 
injection points
Chips pins Chip
internals
Chip
internals
Register
Memory
Controllability High Low High High
Trigger Yes No Yes Yes
Repeatability High Low High High
and registers that are not software-addressable. Thus, hardware methods are useful for 
evaluating low-level error detection and masking mechanisms. Software methods can inject 
software faults3, but they cannot inject hardware faults directly. Rather, the effects of 
hardware faults are emulated (e.g., instead of creating a spurious electrical current, the 
current is assumed to cause a bit-flip in a register, and the register contents are corrupted 
directly). Since software methods assume that faults have occurred and directly introduce 
the effect of the fault at the register/memory level, some low-level fault-tolerant mechanisms 
may not be exercisable. However, if the emulation of the fault effect at the register/memory 
level is satisfactory, software methods might be preferable as a low-cost solution.
The next two sections provide detailed descriptions of hardware and software methods 
of fault injection. However, a brief mention of related work in the area of benchmarks for 
fault tolerance should be made. Relatively little research has been performed in the area
3 Whereas hardware fault models represent the actual occurrence of hardware faults, software fault models 
sire not necessarily as representative of actual fault occurrences.
12
of fault tolerance benchmarks. One area that has received some attention is robustness 
testing. Siewiorek et al. [25] addressed the development of a benchmark for measuring 
system robustness. Crashme[26] is a program that executes multiple processes consisting of 
random data masquerading as executable code. Many illegal instruction, segmentation fault, 
and other exceptions are generated, which should be handled by the operating system. The 
objective of crashme is to test whether a user program executing illegal instructions is able 
to crash the operating system. CMU-Crashme [26] is a similar type of program, but instead 
of executing random instruction code, it passes random parameter values to random system 
calls. Dingman[27] builds on the crashme programs to create a robustness benchmark for 
the ASCM fault-tolerant system. The method of testing involves the execution of system 
calls with incorrect parameters, which might represent software faults or data corruption. 
The response of the system (including correct operation and several levels of failure) is then 
measured. This work is valuable because it provides a method for quantifying the system 
robustness, i.e., the ability of the system to tolerate unexpected conditions caused by user 
programs. The method yields numerical results which can be used to compare different 
systems. As the authors point out, the objective is not per se to measure how well a system 
responds to faults that the system is expected to tolerate.
2.1 Hardware-Implemented Fault Injection
The fault injection tools for prototype machines can be organized into two main cate­
gories: those using hardware-implemented, fault injection (HWIFI) and those using software- 
implemented fault injection (SWIFI). HWIFI methods require additional hardware that is
13
not part of the original prototype in order to inject faults. As shown in Table 2.1, four dif­
ferent hardware-implemented fault injection methods have been used: pin-level, heavy-ion 
bombardment, power supply disturbance, and electromagnetic interference (EMI). The first 
two methods, pin-level and heavy-ion bombardment, have been used more extensively than 
the latter two.
HWIFI methods can be grouped into the three classifications mentioned at the beginning 
of this chapter: (1) external hardware with contact, (2) external hardware without contact, 
and (3) internal hardware. These fault injection methods can be characterized by their 
fault models, triggering, monitoring, and incurred overhead. HWIFI methods are generally 
based on circuit-level fault models: stuck-at, open, bridging, bit-flip, spurious currents and 
voltages. Triggering and monitoring are both implemented with hardware, thus providing 
high time resolution and low perturbation for both triggering and monitoring. Triggering is 
usually performed after a specified time has expired on a hardware timer or an event has been 
detected (such as the detection of a specified address on the address bus). For methods that 
use external hardware without contact, precise triggering can be difficult because the exact 
moment of heavy ion emission or electromagnetic field creation cannot be precisely controlled. 
These methods are well suited for studying dependability characteristics that require high 
time resolution for hardware triggering and monitoring (e.g., fault latency in the CPU) 
or which require access to locations that cannot be easily reached by other fault injection 
methods. The remainder of this section describes in more detail the three classifications for 
HWIFI methods and presents short descriptions of some representative case studies.
14
2.1.1 External Hardware With Contact
The use of external hardware in direct contact with chip pins is probably the most 
popular method of hardware-implemented fault injection. The method is often called pin- 
level injection. Two main techniques exist for altering electrical currents and voltages on 
chip pins:
1. Active probes: The electrical currents through the chip pins are altered by adding 
current via the probes attached to the chip pins. The types of faults attainable with 
probes are usually limited to stuck-at faults. However, it is also possible to introduce 
bridging faults by placing the probes across multiple chip pins. Care must be taken with 
the use of active probes that force additional current into chip pins because damage 
to the target hardware can result from an inordinate amount of current.
2. Socket insertion: The target chip is removed from the circuit board and replaced 
with a socket, after which the original chip is placed on top of the socket. The special 
socket is able to inject stuck-at and open faults by simply forcing the desired logic 
value onto the targeted chip pin. In addition, more complex logical faults can be 
forced onto these pins. For instance, the pin signals can be inverted, ANDed, or ORed 
with adjacent pin signals or even with previous signals on the same pin.
Both of these injection methods provide good controllability of fault times and locations 
with no perturbation to the target system. It should be noted that faults are modeled at 
chip pins. Thus the fault models are not identical to traditional stuck-at and bridging fault 
models that generally occur inside a chip. Nonetheless, many of the same effects can be 
achieve via these injection methods (e.g., the exercise of error detection mechanisms). A
15
special case of using active probes occurs when the probes are attached to the power supply 
pins to inject power supply disturbance faults.
2.1.2 External Hardware Without Contact
These methods of fault injection make use of hardware that does not directly contact 
any electrical circuitry on a chip or circuit board. Faults are injected by producing electrical 
currents in chips by heavy-ion radiation as an ion passes through a semiconductor depletion 
region or by induction as the chip or board is placed in or near an electromagnetic field. 
These methods are attractive because they mimic actual physical conditions that might 
cause faults and are able to affect all areas of a chip. However, the temporal and spatial 
controllability is lower that for other methods. Consequently, the repeatability of individual 
faults is difficult.
2.1.3 Internal Hardware
Hardware that is internal to a chip can sometimes be utilized for fault injection. Many 
chips contain some form of scan logic that provides access to registers and other latch points 
within a chip. Although the intended purpose of scan logic is testing and initialization, 
the same circuitry can also serve as a fault injection mechanism to corrupt the contents 
of registers and other latches in the scan chain. Usually the clock signal to the chip is 
temporarily frozen as values are shifted through the scan chain to prevent any unintended 
disturbance to the chip state. Since the scan circuitry is already present, no additional cost 
is incurred for the use of the scan logic for fault injection.
16
In addition to scan logic, some chips also contain custom logic that is specifically designed 
for fault injection. For instance, any point on a chip may be interrupted with a logic gate 
to emulate a stuck-at or bit-flip fault. More complex circuitry can be used to emulate open 
and bridging faults. The main drawback to this method of fault injection is the additional 
circuitry required, which occupies valuable chip real estate and might affect the timing of 
the chip.
The use of internal hardware for fault injection is similar to SWIFI methods because both 
rely on software to take advantage of existing hardware to inject faults. However, internal 
hardware is generally able to access more internal chip points.
2.1.4 Representative Case Studies
The following case studies are chosen to be representative of the hardware-implemented 
fault injection methods discussed. Some studies, such as the FIST and MARS studies, 
provide comparisons of selected fault injection methods.
2.1.5 FTMP
Several studies centered around the fault-tolerant multiprocessor (FTMP) fault injection 
instrumentation [4], [6], [7]. FTMP is a computer architecture that evolved over a 10-year pe­
riod in connection with several critical aerospace applications. The architecture was designed 
to have a failure rate of the order of 10“ 10 per hour. The basic blocks of the architecture 
are independent processor-cache modules and memory modules that communicate through 
redundant buses. The modules are dynamically grouped into several TMR triads or assigned
17
to spare status. Jobs can be scheduled to any processor triad. All transactions between pro­
cessor modules and memory modules in a triad are voted bit-by-bit. When a fault occurs, 
the faulty module is isolated and the faulty triad reconfigured. Fault detection, diagnosis, 
and recovery are handled in such a way that application programs are not involved.
Figure 2.1 shows the diagram of the FTMP fault injection instrumentation developed at 
the Charles Stark Draper Laboratory [4] [7]. In an FTMP computer, there are several line 
replaceable units (LRUs), each containing a processor, clock generator, power subsystem, 
and bus interface circuits. LRU #3 is constructed for connection of the fault injector. All 
chips in LRU #3  are connected to sockets that allow them to be removed for insertion of the 
fault injection implant. Each fault injection implant contains circuitry that can interrupt 
and reconnect the pins in the sockets. Several different types of faults, such as stuck- 
at-0 and stuck-at-1 can be injected into the pins by the implants. These implants are 
controlled by a VAX 11/750 computer. A special version of the System Configuration Control 
(FSCC) program running in the FTMP communicates with the Fault Injection Software 
(FIS) running in the VAX 11/750 through one of the FTMP I/O ports and a 1553/UNIBUS 
data link.
Faults are normally injected on one pin at a time. When an injection occurs, the FIS 
program chooses a fault and a pin, applies the fault to the pin, and records the injection time. 
Once the FTMP detects and identifies the fault and reconfigures the system, it sends this 
information, along with the time of each event, back to FIS. Upon receiving the information, 
FIS removes the fault by restoring the pin to its normal state and notifies the FTMP. The 
FTMP then puts the victim module back into an active state and notifies FIS that it is ready 
for another fault injection. This process is repeated after a random delay.
18
(2) FTMP acknowledges
(3) FTMP restores LRU #3
(4) Fault injected
(5) Data from FTMP
FSCC Software
Figure 2.1: FTMP Fault Injection Environment
In the experiments conducted at the Charles Stark Draper Laboratory [4], a total of 
21,055 faults were injected, and 17,418 (83%) were detected. All of the detect faults were 
identified correctly, and the system subsequently recovered successfully from each of these 
faults by replacing the faulty module. That is, the coverage in the FTMP was 100%, which 
validated the FTMP architecture and implementation.
Another study using the FTMP fault injection instrumentation was reported in [5], with 
emphasis on the investigation of fault latency. Results showed that the hazard rate of fault 
latency is monotonically decreasing. Two distributions with monotonically decreasing hazard 
rates, Weibull and gamma distributions, were then used to fit the experimental results. The 
study also investigated the effect of fault latency on the probability of having multiple faults. 
It was shown that there exists an optimal fault latency in minimizing the multiple fault 
probability.
Later, fault injection experiments on the same instrumentation were conducted at the 
NASA Langley Research Center [7] to investigate two issues: fault sampling methods and
19
fault recovery distributions. For each fault injection, two choices must be made: the fault 
location (pins) and the fault type (e.g., stuck-at-1, stuck-at-0, inverted signal). Thus, the 
possible fault set (the collection of all different injected faults) can be very large. Exhaustive 
fault injection is costly and time consuming. It is necessary to find appropriate sampling 
methods to reduce the time and cost of testing. The study compared the effects (detection 
behavior) of different faults and grouped these faults into several subsets according to the 
similarity in their effects. The results showed that the effects are not homogeneous across 
the fault set. This indicates that stratified sampling methods, based on the fault subsets, 
should be developed for fault injection. The study also showed that the fault recovery time 
is not exponentially distributed.
2.1.6 MESSALINE
MESSALINE [8] is a flexible, pin-level fault injection tool that has been developed at 
LAAS-CNRS in Toulouse, France. The general architecture of MESSALINE and its envi­
ronment is given in Figure 2.2. The injection, activation, and collection modules are im­
plemented in hardware on an Intel 310 microcomputer. The software management module 
resides on a Macintosh II computer, which provides a flexible user interface.
The fault injection mechanism for MESSALINE uses active probes and socket insertion. 
Thus, fault types such as stuck-at, open, bridging, and complex logical functions can be 
injected. Because the duration and frequency of faults can be controlled, the fault injector 
can introduce permanent, transient, and intermittent faults. Signals collected from the target 
system can provide feedback to the injector. Also, a device is associated with each injection
20
TARGET SYSTEM
Input/Outputs Synchronization Readouts
ENVIRONMENT
SIMULATION ACTIVATION INJECTION COLLECTION
t
CONTROL OF THE EXPERIMENT | INTEL310
4
'
k
r
MANAGEMENT OFT HE TEST SEQUENCE
"~'*TRPUT 
FILES^
OPERATOR
Figure 2.2: General Architecture of MESSALINE
'o u t p u t
_  FILES_
point to sense when and if each fault is activated and produces an error. MESSALINE has 
facilities to inject up to 32 injection points simultaneously.
The application of MESSALINE has been shown in two experiments involving 1) a sub­
system of a centralized, computerized interlocking system (called PAI) for railway control 
applications and 2) a distributed system corresponding to an implementation of the depend­
able communication system of the ESPRIT Delta-4 Project.
In the case of the PAI system, permanent stuck-at-0, stuck-at-1, and open circuit faults 
were injected to various memory and CPU chips. The results indicated that CPU errors were 
more difficult to detect than memory errors. The error detection mechanisms were analyzed 
individually, and it was discovered that the diagnosis software accounted for most of the error 
coverage. The elimination of hardware detection would have decreased the overall coverage 
by less than 3%.
21
The distributed communication system was injected with intermittent stuck-at-0 and 
stuck-at-1 faults. The actual faults were injected into the Network Attachment Controllers 
(NAC), which provide the connection for each node to the local area network. Results 
showed that over 67% of all errors caused the injected NAC to be correctly identified and 
extracted. Also, 24% of the errors did not cause a detectable error. Thus, in over 91% of the 
injections, the distributed system was able to correctly handle the error. These experiments 
demonstrate the utility and flexibility of the MESSALINE fault injection tool.
2.1.7 RIFLE
RIFLE is a general-purpose pin-level fault injection tool developed at the University of 
Coimbra in Portugal. The tool uses the socket insertion technique to inject pin-level faults, 
which are mainly targeted at the processor pins.
The environment for RIFLE is given in Figure 2.3. In addition to the target and host 
systems, three hardware modules are present:
1. Adaptation module This module contains the socket that is inserted between the pro­
cessor and the processor board.
2. Main module This module monitors the processor bus signals received from the adap­
tation module and provides the trigger for each fault. The module also contains a trace 
memory which continuously collects information on the processor bus.
3. Interface and counters module This module is the interface between the host system 
and the hardware modules. It contains circuit needed to collect latency and binary
results.
22
The environment is controlled by the Control and Management Software which is used to 
specify the experiment parameters, to control the injection of faults, to validate injected 
faults based upon the observed errors produced, and to collect the results.
Figure 2.3: RIFLE environment
The current implementation of RIFLE allows up to 96 different pins to be injected. 
Faults can be stuck-at-1, stuck-at-0, stuck-at-an-external-value, inversion, logical bridging 
(i.e., logically same as an adjacent pin), and open circuit. Fault injection is triggered by the 
following steps:
1. An Activation Address is compared to the processor address bus.
2. The Activation Address is detected as many times as specified by the Activation Ad­
dress Count.
23
3. A specified delay (of up to 216 bus cycles) is performed.
4. An Activation Pattern is detected.
5. The fault is injected.
RIFLE has been implemented systems based on the Motorola 68000, [Zilog?] Z80, Intel 
486DX, and Inmos T800 processors. Experiments were performed on a Motorola 68000- 
based computer which contained several methods to detect internal processor errors, memory 
access errors, and watchdog timeout errors. The results showed that up to 72.5% of the 
errors were detected using internal processor and memory access error detection mechanisms. 
Furthermore, less than 10% of all errors produces incorrect results, which suggests that a 
traditional computer with simple error detection mechanisms has strong fail-silent properties.
2.1.8 FIST
The FIST (Fault Injection system for Study of Transient fault effect) system was used to 
conduct experiments on a MC6809-based microcomputer [13], [12]. The FIST environment 
is shown in Figure 2.4. Two fault injection methods are possible with FIST: heavy-ion 
radiation and power supply disturbance.
Neither pin-level nor software-implemented fault injection has a way to produce transient 
faults at random locations inside ICs. Radiation-induced fault injection provides such a 
capability. One way to do this is to expose the chip to the heavy-ion radiation from a 
Californium252 (Cf252) source [13], [12]. The heavy ions emitted from the source are capable of 
creating transient faults when they pass through a depletion region in the IC. One advantage 
of this method is that it can produce transient faults at random locations evenly and can
24
Figure 2.4: FIST Diagram
cause either a single bit flip or multiple bit flips, leading to a large variation in the errors 
seen on the output pins of the IC.
In the fault injection experiments reported in [13] and [12], the Cf252 method was used to 
investigate error coverage and detection latency for error detection schemes for the MC6809E 
8-bit microprocessor. The intention of the experiments was to characterize the effects of tran­
sient faults that originate inside a CPU. The MC6809E is fabricated in NMOS, a technology 
sensitive to heavy-ion radiation. The error detection schemes under study are suitable for
25
implementation with a watchdog processor that checks the behavior of the main processor 
on the external bus. The developed experimental system is called FIST (Fault Injection 
system for Study of Transient fault effects).
The heavy-ion radiation is implemented using a commercially available 37 x 103 Becquerel 
(1 fiCi) source. The source is mounted inside a vacuum chamber together with a small 
computer system. One of the system boards is placed on a mechanical fixture movable in 
three dimensions for accurate positioning of the CPU beneath the source. The system has two 
MC6809E CPUs, which operate synchronously using the same clock. One CPU is exposed 
to heavy-ion radiation. The other is used as a reference to detect errors via comparison 
on the output from the two CPUs. When errors are detected by the comparison logic, the 
logic analyzer is triggered to record the external bus signals. The monitoring computer is 
responsible for data acquisition and control of experiments.
A fault injection experiment is conducted in the following way. Before the experiment 
starts, the monitoring computer fetches from the host computer a load file that contains the 
test program to be executed. The test program is then loaded from the monitoring computer 
to the MC6809E system, after which, the test program is started with a “go” command from 
the monitoring computer. When a mismatch is detected, the monitoring computer fetches 
the recorded error data from the logic analyzer and the error flip-flops in the MC6809E 
system and transfers them to the host computer. Finally, the MC6809E system is reset, and 
the test program is reloaded for the next experiment.
It was found from fault injection experiments that 78% of all errors affected control flow 
(i.e., caused the processor to diverge from the correct sequence) and 17% caused errors in 
data. Results also showed that 30% of all errors were multiple bit errors on the output pins,
26
although the origin of each of these errors was only one single heavy ion. The error recordings 
obtained from the experiments were also used as input to simulation models of different error 
detection mechanisms to evaluate these error detection mechanisms without implementing 
them. The coverage of several detection mechanisms was investigated. It was found that the 
best mechanism was the one that detects access to the memory outside permitted areas and 
that the combination of two mechanisms gave a better coverage than any one mechanism 
alone. It was also found that the type of the test program had a considerable influence on 
the results of error detection mechanisms.
FIST is also capable of injecting faults in the power supply for ICs. Random faults 
inside ICs can be produce by disturbing the power supply voltage. The FIST system used 
to perform radiation-induced fault injection [Gunneflo89, Karlsson89] was also used to inject 
faults via power supply disturbances [Karlsson91]. In order to control the power supply 
voltage, a MOS transistor was placed between the power supply and the Vcc pin of the 
MC6809E CPU. Thus, the effective power supply voltage could be decreased below +5 V 
by dropping the gate voltage of the MOS transistor. The amplitude of the voltage drop was 
adjusted by using another adjustable voltage source in parallel connected to the Vcc pin via 
a diode.
Experiments with the FIST system utilized power supply voltage sag pulses, which are 
illustrated in Figure 2.5. The voltage sag pulse had a width, PW, of about 50 ns and a pulse 
amplitude, A, of about -4.2 V. In order to produce errors, the voltage sag pulse had to be 
applied at the edges of the bus clock signals. On average, errors were produced by every 
sixth voltage sag pulse, and the average time between the sag pulse and the error was 1.3 
bus cycles. In contrast to radiation-induced errors, most (almost 80%) of the power supply
27
disturbance errors were manifested in the CPU control signals and only 1% was manifested 
in the data bus.
2.1.9 MARS study
Hardware-implemented fault injection experiments were conducted on the MARS (Main­
tainable Real-time System) architecture developed at the University of Vienna [28]. Three 
different fault injection methods were used: heavy-ion radiation, pin-level injection, and 
electromagnetic interference [14]. The heavy-ion radiation and pin-level injection methods 
are similar to those used in other studies.
The fault injection test-bed for injecting electromagnetic interference (EMI) faults is 
shown in Figure 2.6. Faults are applied to a computer circuit board using one of two 
methods: (1) The board is placed between two conducting plates which are connected to a 
EMI burst generator, and (2) a special probe is positioned near specific components on the 
board, thus exposing a smaller part of the board to the electromagnetic disturbances. For 
both methods, small wires functioning as antennas can be attached to specific components
28
on the board (such as CPU buses) to accentuate the effect of the electromagnetic field on 
those components.
Figure 2.6: EMI Fault Injection Test-bed
Experiments using EMI on the MARS architecture showed that the EMI method mostly 
produced errors that were detected by CPU error detection mechanisms (EDMs). Board- 
level and software-based EDMs were much less exercised. This result suggested that EMI 
is less suitable for a comprehensive evaluation of EDMs than other fault injection methods. 
Almost all of the error detections were caused by spurious interrupts, which are interrupts 
signaled to the CPU but which are not mapped to any device. This shows that the interrupts 
lines of a CPU are very sensitive to EMI.
2.2 Software-implemented Fault Injection Tools
Software-implemented fault injection (SWIFI) tools require no additional hardware and 
rely on special software routines to inject faults. The suffix “implemented” is needed when 
discussing SWIFI tools to distinguish between the use of software to injects faults (be they 
based on hardware or software fault models) from the injection of faults based on software 
fault models. SWIFI is capable of corrupting any software-addressable location. Since SWIFI 
only operates at the information level (which contains registers and memory locations), it 
actually injects errors rather than faults. However, since the term “fault injection” is much
29
more prevalent than “error injection,” this thesis will continue to refer to fault injection, 
even in the case of SWIFI.
In software-implemented fault injection, no extra hardware instrumentation is needed. 
Faults are injected through the use of special software routines, which corrupt registers, 
memory, communications packets, and other software-addressable locations. In addition, 
software faults (which are design or implementation defects, such as an incorrect initialization 
of a variable or a failure to check a boundary condition) can be injected by modification of 
source or machine code. The appropriate injection method is dependent upon the desired 
fault model. Table 3 contains a list of fault models along with SWIFI techniques for injecting 
each type of fault. Memory and network faults are injected by directly altering the contents 
of the target memory location or transmission message. CPU and bus faults are indirectly 
injected by modifying code or data such that the result corresponds to the corruption that 
would have been caused by an actual fault. Software errors are similar to memory faults and 
can be injected by direct modification of the data segment. Software faults are injected by 
corrupting the code segment of a program, while the program resides in secondary storage 
or after it has been loaded into primary memory.
Several techniques can be used to trigger fault injection. A simple method is to invoke 
a fault injection routine after a specified amount of time has elapsed. The time can be 
measured by a hardware or a software timer. Triggering can also be dependent upon certain 
events. For instance, a software trap instruction can be inserted into the target program to 
invoke the injection routine before a particular instruction in the target program is executed. 
Hardware traps may be used to invoke the injection routine when a hardware-observed event 
occurs (e.g., when a particular memory location is accessed). The fault injection routines
30
may exist as a part of the target program, as a separate program, or as part of the operating 
system.
Software-implemented fault injection is attractive because the lack of a need for expen­
sive hardware results in much lower costs when compared to hardware-implemented methods. 
Also, unlike hardware-implemented fault injection, which is difficult to gear toward specific 
workload areas, software fault injection can be targeted toward user applications, the oper­
ating system, or both. If the target is a user application, the fault injector is inserted into 
the user application or can be an extra layer between the user application and the operating 
system. If the target is the operating system, the fault injector has to be embedded in the 
operating system, because it is very difficult to add an extra layer between the machine and 
the operating system.
Although the software approach is flexible, it has some restrictions. First, the approach 
cannot inject faults into locations not accessible to software. Research shows that approxi­
mately 1/3 of the errors produced in logic-level fault injections cannot be emulated through 
the software approach [Czeck91]. Second, the software instrumentation may disturb the 
workload running in the target system and even change the structure of original software, 
although careful design of the injection environment can minimize the perturbation to the 
workload. Concern about workload perturbation usually leads to avoiding the use of per­
manent fault models that require frequent invocation of the injection routine. Third, the 
poor time resolution of the approach may cause fidelity problems. For long latency faults, 
such as memory faults, the low time resolution may not be a problem. For the short latency 
faults, such as bus and CPU faults, the approach may fail to capture the error behavior
31
(e.g.s propagation). This problem can be solved by using a hardware monitor, i.e., the hy­
brid approach. The hybrid approach combines the versatility of software- implemented fault 
injection and the accuracy of hardware monitoring. It is well suited for measuring extremely 
short latencies. However, the hardware monitoring involved in this approach can decrease 
flexibility (e.g., limited observation points and buffer size of the monitor) and increase cost.
In recent years, interest in developing software-implemented fault injection tools has 
increased. Several environments have been published in the literature: FIAT [17], FER­
RARI [19], FTAPE [20], DEFINE [21], DOCTOR [22], Xception [23], and SOFIT [24]. 
Table 2.3 lists features of these tools, which will be discussed in the following subsections.
2.2.1 FIAT
A number of fault injection studies at Carnegie Mellon University have centered around 
FIAT (Fault Injection Automated Testing), a software-implemented fault injection environ­
ment [17], [18], [29]. The FIAT hardware implementation consists of IBM RT PCs, connected 
by a token ring network. The FIAT software structure is divided into two parts: the fault 
injection manager (FIM) and the fault injection receptor (FIRE). FIM is a global control 
program responsible for all phases of the experiment. FIRE, under the control of FIM, col­
lects the experimental results and sends appropriate information to FIM for off-line analysis. 
Figure 2.7 shows the process of a typical fault injection experiment.
FIAT has been used to study the impact of faults on the application workload level [18]. 
Two representative programs, a matrix multiplication task and a selection sort task, were 
chosen as application workloads. To achieve fault tolerance, each task was executed on two 
different processors, and the results were compared. Three fault types were injected in the
32
Table 2.3: Comparison of Software-Implemented Fault Injection Tools
Target system Fault type To evaluate
FIAT [IT] PC-RT CPU
memory
communication
detection
latency
recovery
FERRARI [19] SPARC CPU
memory
bus
control-flow
detection
latency
FTAPE [20] Tandem CPU
memory
disk controller
detection
latency
recovery
DEFINE [21] Sun CPU
memory
bus
communication
software
detection
propagation
DOCTOR [22] HARTS CPU
memory
communication
detection
recovery
Xception [23] Parsy tec/PowerPC CPU
memory
bus
detection
latency
SOFIT [24] SPARC CPU
memory
bus
detection
latency
33
Figure 2.7: Typical Fault Injection Experiment in FIAT
experiment: zero-a-byte, set-a-byte, and 2-bit compensating. The zero-a-byte or set-a-byte 
sets a consecutive eight bits anywhere within a 32-bit word to 0 or 1. The 2-bit compensating 
complements two bits in a word such that the parity code would not detect it as an error. 
Faults were injected into all locations within a workload, with a total of over 130,000 faults 
injected.
Results showed that there are a limited number of system-level fault manifestations. The 
mean error detection coverage for different workloads and fault types is approximately 50% to 
60%. Error detection latency was found to follow a normal distribution. This result conflicts 
with those presented in [6], [7], where the latency was shown to follow either gamma, Weibull, 
or log-normal distributions. This difference may be explained by the differences in the 
experimental environment and detection mechanisms. In [6], [7], the hardware-implemented 
fault injection technique is used, and the resolution of detection time is on the order of
34
milliseconds, while the time resolution of the software-implemented FIAT is on the order of 
seconds, which may skew the results.
2.2.2 FERRARI
FERRARI (Fault and ERRor Automatic Real-time Injector), another software-implemented 
fault injection environment, was recently developed at the University of Texas [19]. The 
purpose of FERRARI is to evaluate complex systems by emulating most hardware faults 
in software. It is implemented on SPARC workstations in an X-window environment. It 
consists of four software modules: the initializer and activator, the user information, the 
fault and error injector, and the collector and analyzer. These four modules are controlled 
by the manager module, which coordinates the operation of the four modules.
The initialization and activation module prepares the target program for fault injection 
by extracting information, such as the starting address, the program size, the execution 
time, the output of an error-free program, and the addresses used by the program. The user 
information module receives experiment parameters provided by the user. These parameters 
include:
• duration, location, time, and bit position of the fault,
• user-specified or pseudo-random selection of the fault,
• fault type (XOR, set, or reset a bit; zero or set a byte),
• fault and error classes (hardware, control flow, user-defined), and
• dependability properties to measure (coverage, latency).
35
The fault and error injection module is responsible for injecting different types of transient 
or permanent faults, such as address line faults, data line faults, and faults in condition 
code flags. The data collection and analysis module records experiment results, such as 
information about error detection, error latency, and failures, and it determines statistics of 
these measures at the end of the experiment.
The main fault and error injection mechanism involves using software traps. At the 
appropriate time or program location, the program to be injected is trapped. The selected 
fault or error is then injected. For transient errors, the current instruction is executed and 
then the injected error is removed. For permanent faults, the injected fault is not removed. 
Instead, the program is trapped for the next n instructions, where n is the duration of the 
fault in instruction cycles. Table 2.4 lists the fault and error classes that FERRARI can 
inject.
Table 2.4: Fault and Error Classes Supported by FERRARI
Control Flow Hardware
Control bit errors 
Data line when opcode fetch 
Instruction type faults 
Control flow errors 
Program counter error 
Illegal branches - 
Wrong branches 
Condition code flag
Address line when opcode is fetched
Address line when operand stored
Wrong register
Data line when loading
Data line when storing
Data line when opcode is fetched
Data byte enable store
Corrupted register
Condition code flag
To demonstrate the capabilities of FERRARI and to study the behavior of the target 
system under faulty conditions, over 600,000 fault injection runs were conducted on SUN4 
SPARC workstations under different applications. Results showed that the error coverage
36
is highly dependent on the fault type. The highest coverage was obtained when errors were 
injected into the task memory image. This is because the injected errors are likely to be 
exercised repeatedly if the corrupted instructions are in a loop. An important finding is that 
a considerable number of undetected errors are those that corrupted input/output routines 
and system libraries. These routines may tend to be ignored when error detection techniques 
are embedded in the user code.
2.2.3 DEFINE
DEFINE is a UNIX-based distributed fault injection environment developed at the Uni­
versity of Illinois [21]. Its predecessor, FINE [30], is a single-machine fault injection environ­
ment. The significance of DEFINE is twofold. First, it can emulate software faults as well 
as hardware errors. Second, it can trace fault propagation through software modules. The 
software faults that can be injected by DEFINE include initialization (missing or incorrect), 
assignment (missing or incorrect), condition check (missing or incorrect), and function (in­
correct) faults. Injectable hardware errors include CPU (ALU, shifter, opcode decoder, or 
registers), memory (text segment or data segment), bus (address lines or data lines), and 
communication errors (missing messages or corrupted messages).
Figure 2.8 shows the DEFINE environment. DEFINE consists of a target system, a fault 
injector, a software monitor, a workload generator, a controller, and several analysis utilities. 
The target system is a group of connected machines, consisting of servers and clients. The 
controller, fault injector, software monitor, and workload generator are running on another 
machine (host machine) which is connected to the target system. The local fault injector 
and message recorder are embedded in the kernel so that faults can be injected there and
37
Figure 2.8: The DEFINE Environment
their propagation can be monitored. Fault injection is implemented by modifying the system 
trap handling routines and hardware clock interrupt handling routines, so the fault injector 
can be considered an extra layer between the operating system and the machine. The fault 
injector uses hardware clock interrupts to control the time of fault injection and activation, 
and uses software traps to inject all the faults except communication faults and memory 
faults in the data/stack segments. The software monitor traces the execution flow and key 
variables of the kernel. Software probes are inserted into functions in the kernel to record 
the execution flow and the values of arguments and key variables. The synthetic workload 
generator issues various system calls to activate injected faults. The distribution of generated 
system calls can be specified by users to emulate real workloads or to deliberately accelerate 
the activation of injected faults. The controller assigns experiment specifications to the
38
fault injector and the monitor, and it initiates experiments. The analysis utilities provide 
assistance in analyzing fault propagation. The target of the study is the UNIX kernel, a 
non-stopped, highly parameterized, complex service program with high impact and a broad 
spectrum of workloads.
Two experiments were conducted by applying DEFINE to investigate fault propagation 
and to evaluate the impact of various types of faults. The first experiment was on SunOS 
4.1.2 (on a SPARCstation IPC). Results showed that memory faults and software faults 
usually have a very long latency, while bus faults and CPU faults tend to crash the system 
immediately. Nearly 90% of detected errors are detected by hardware. About half (47%) 
of the detected errors are data errors, these data errors are detected when the system 
tries to access an area it has no privilege to access. In the software fault propagation, 
incorrect control flow is the major impact for the first level of propagation, while data 
corruption is the major impact for the subsequent propagation. Analysis of fault propagation 
among the UNIX subsystems revealed that only about 8% of faults propagate to other UNIX 
subsystems. The second experiment was on six Sun workstations (one as server and the 
others as clients). Experimental results show that fault propagation from servers to clients 
occurs more frequently than from clients to servers. The majority of no-impact faults are 
dormant. The fault impact depends on the workload.
2.2.4 DOCTOR
DOCTOR is designed for validating dependability mechanisms on an experimental dis­
tributed real-time system, HARTS [31], [32]. See Figure 2.9. It introduces a new fault type, 
intermittent, in addition to permanent and transient faults. The interarrival time between
39
intermittent faults can be deterministic or can follow a specified exponential distribution. 
Injectable errors include memory (code, global variables, or heap), communication (lost,
altered, or delayed messages), and processor errors (adder or multiplier).
— *■
SFI Files
r  “ • ~ -* Z ...: ..._____________L ,-...- ^
Figure 2.9: The Relationship of DOCTOR Files
DOCTOR consists of the DOCTOR Experiment Generator (SEG) and the DOCTOR 
Control Modules (SCM). The SEG takes as input a user-supplied experiment description 
to drive fault injection experiments. The SCM consists of fault injection routines that 
will be included into executable files by the SEG. Memory errors are injected by changing 
the contents of the selected address. Communication errors are injected by modifying the 
communication protocols to mimic the desired behavior. Processor errors are injected by 
changing the assembly code during compilation.
40
Two experiments on HARTS were conducted to investigate the effect of intermittent 
message losses between two adjacent nodes and the effect of routing using failure data. In 
the first experiment, a model of communication between two nodes was developed to predict 
the effect of intermittent message losses. Experimental results showed that the predicted 
values of average round-trip delay, average number of attempts per message, and frequency 
of number of attempts matched the observed values very well. The second experiment 
investigated three routine methods with or without failure information. The first method uses 
transmission time of a message on each link only. The second method considers transmission 
time and the average number of timeouts on each link. The third method uses the delivery 
time of test messages that are sent out by each node to its neighbors periodically. Results 
showed that none of the methods is best under poor traffic operating conditions.
2.2.5 Xception
Xception is a software-implemented fault injection environment that has been developed 
at the University of Coimbra in Portugal. Xception takes advantage of advances debugging 
and performance monitoring features that are present in many modern processors to inject 
more realistic faults. The target application does not require modification or the insertion 
of software traps, and trace mode execution, which decreases the execution speed, is not 
necessary. Instead of using software trap instructions to control injection triggering, hard­
ware exceptions are used. Faults can be injected when the instruction or data at a specific 
address is fetched or accessed. Because faults are triggered based upon accesses to specific 
addresses, experiments are reproducible, in contrast to experiments in which the timing of 
fault injections are based upon waiting a specified amount of time after an event.
41
The structure of Xception is shown in Figure 2.10. The environment consists of three 
main parts:
1. An injector module, which is linked with the kernel of the target system,
2. a library of functions which are called by the user application to start fault injection, 
and
3. the main module running on a host system which implements the user interface for 
fault definition, automatic fault injection, and collection of results.
Most of the code for Xception is executed on the host machine, which minimizes the pertur­
bation to the target system.
Host Computer
.fa u lt
Output file
.detected
^Xception ^
1
ii ^ “kernel log file Experiment
exception
Kernel
—  i results file
^ handlers J ii
USER
Figure 2.10: The Structure of Xception
Faults can be injected in the following locations:
• Instruction Execution Control Unit (IECU)
42
• Integer Unit (IU)
• Floating Point Unit (FPU)
• Memory Management Unit (MMU)
• Internal Data Bus (PDB)
• Internal Address Bus (PAB)
• General Purpose Registers (GPR)
• Condition Code Register (CCR)
• Memory
Fault injection can be triggered by the following events:
• Opcode fetch from a specified address
• Operand load from a specified address
• Operant store to a specified address
• After a specified time since start-up
• A combination of the above fault triggers
For each fault, a fault mask is specified. For bits in the fault mask which are set to ‘1’, 
several bit-level operations can be used: stuck-at-zero, stuck-at-one, bit-flip, and bridging.
Xception has been implemented on a Parsytec parallel machine based on the PowerPC 
601 processor. Experiments showed that up to 73% of injected faults resulted in incorrect 
results which were undetected for certain processor functional units.
43
2.2.6 SOFIT
SOFIT is a software-implemented fault injection tool developed at Texas A&M University 
[24]. The tool is able to inject faults based on many of the same fault models used by 
other software-implemented tools. In contrast to the other tools, SOFIT is object-oriented, 
which facilitates customization of the tool for different fault injection applications. For 
instance, a different set of injection mechanisms can be assembled and used depending on the 
desired measurements. Also, because the tool is constructed in an object-oriented manner, 
portability is enhanced. SOFIT has been implemented on SPARC and NCUBE computers.
44
3. DESCRIPTION OF BENCHMARK
Currently there is no accepted benchmark for fault tolerance. One of the main goals of 
this research is to develop a benchmark for fault-tolerant systems. Such a benchmark should 
(1) provide a point of comparison among different systems and (2) yield insight into the 
operation of specific fault-tolerant strategies and implementations, as well as the complete 
systems themselves.
Most benchmarks are performance benchmarks, which can be defined as “means of esti­
mating computer performance by measuring experiments on the computer concerned.” [33] 
Fault tolerance benchmarks can be defined in a similar manner as means of estimating the 
fault tolerance of a computer by measuring experiments on the computer concerned. Com­
mon usage often refers to “the benchmark” as the workload program that is executed for 
performance benchmarks. However, a complete benchmark must also include a procedure 
for executing the workload program and measuring results.
In constructing a good general benchmark, several characteristics are essential. A bench­
mark must be relatively simple, both in its implementation and execution. The same ob­
jectives (providing means of comparisons and insight) can be achieved through simulation
45
or analytical methods. However, a benchmark trades off simplicity in the evaluation for 
simplicity in the final result. “Simplicity in the final result” in this context refers to the 
use of one number (or a few numbers) that characterizes the overall system in a black-box 
sense. Of course, further study of the simple benchmark numbers may lead to more detailed 
conclusions, but the direct benchmark result is simple and thus facilitates easy comparisons 
of results for different systems.
Since a benchmark is used to compare different systems, the benchmark must be capable 
of being implemented on all systems of interest. Thus, a benchmark must also be portable. 
This concept of portability is related to the concept of simplicity. Assuming that benchmarks 
are specified in terms of functionality, rather than system-specific code, all benchmarks 
are portable, in the sense that all benchmarks can be implemented on different systems 
if sufficient effort is expended. However, true portability refers to the simplicity of the 
required implementation effort. The main obstacle to portability is the differences inherent 
in different computer systems. Indeed, these difference are the very parts of the system 
for which benchmarking is desired! For a benchmark of fault tolerance, the differences 
due to dissimilar fault-tolerant mechanisms are especially significant because these fault- 
tolerant mechanisms can vary greatly among systems. Thus, special care needs to be given 
to maximizing the portability of the benchmark.
Simplicity and portability are qualities that are desirable in all benchmarks, be they 
performance or fault tolerance benchmarks. Although there are many similarities between 
performance and fault tolerance benchmarks, some significant differences exist. The most 
obvious is the need to target the fault-tolerant mechanisms in the system under test, since 
the benchmark is intended to evaluate the operation of the fault-tolerant mechanisms and
46
not the performance of the traditional, non-fault tolerant portion of the system. Thus, fault 
injection is essential.
Fault injection is the insertion of a change into a system in order to emulate the occurrence 
of a fault based upon a specific fault model. A good overview of fault injection methods and 
tools is provided by Tang [3]. Chapter 2 also contains a summary of some fault injection 
methods and tools, especially those that rely on software-implemented fault injection, which 
is the method of fault injection used by the benchmark proposed in this thesis.
Another difference between performance benchmarks and fault tolerance benchmarks is 
the type of measurement that must be made. For performance benchmarks, the measurement 
usually consists of the time needed to execute the workload program or the number of 
iterations of the workload program that can be executed within a given amount of time. 
These types of measurements do not suffice for fault tolerance benchmarks, which must also 
reflect the operation of the fault-tolerant mechanisms. This is perhaps the greatest challenge 
in developing a good fault tolerance benchmark: the selection of appropriate metrics.
This chapter presents the details of the proposed fault tolerance benchmark in Section 3.1. 
The benchmark is based on the FTAPE fault injection program, which is described in Sec­
tion 3.2. Section 3.3 discusses the repeatability of the benchmark, and Section 3.4 discusses 
methods for observing the extent of the error propagation caused by the benchmark’s fault
injection.
47
3.1 Proposed Benchmark Specification
The fault tolerance benchmark proposed in this thesis consists of the FTAPE fault in­
jection tool, which is used to provide the workload and fault injection, and the estimation 
of the fault tolerance of a system in terms of two metrics:
1. number of catastrophic incidents, and
2. performance degradation.
Catastrophic incidents are events which cause the entire computer system to become unus­
able. Some examples of catastrophic incidents are operating system panics and hangs and 
failed error recovery attempts that lead to an unusable system configuration (e.g., the failed 
recovery of a second failed CPU in a TMR configuration). Since these incidents reflect a 
significant failure of a system’s fault tolerance, the measurement of these incidents is impor­
tant. Performance degradation is the amount of additional time required by an application 
program due to the presence of faults. Usually, the detection and correction of errors (i.e., 
which are the effects of faults) incurs a time overhead. This time overhead might result from
1. the overhead of error recovery routines (e.g., the time needed to copy memory infor­
mation to a memory which has undergone error recovery) and
2. the loss of resources such as CPU’s or disks, which decreases the available compute or 
I/O  bandwidth.
Note that this performance degradation metric does not include the overhead due to the 
inclusion of additional fault-tolerant hardware, which might increase circuit capacitances 
and lead to longer delay times.
48
The FTAPE tool is able to generate CPU, memory, and I/O  activity based upon user 
specifications and inject CPU, memory, and I/O faults according to a specified injection 
strategy. A more detailed description of the FTAPE tool is given in Section 3.2.
Currently the benchmark has been completely ported to three Tandem prototype sys­
tems: (1) TMR Prototype A, (2) TMR Prototype B, and (3) Duplex Prototype C. These 
machines are regarded as prototype machines because they are either very early production 
machines (Prototype A) or laboratory validation machines (Prototypes B and C). The first 
two machines are based on the Tandem Integrity S2 architecture, which is summarized in 
Section 5.1.1. The third machine is based on the Tandem ServerNet architecture, which is
summarized in Section 5.1.2.
R1 R2
Fl f4 f5
Time to clear latent faults 
(e.g., fault Fg)
iii iiiiii m ii i i i ' »  Time
T startl T 1 T2 Tendl
Workload run with faults
r tstart2 *end2
Workload run without 
faults
The complete benchmarking procedure consists of multiple executions of the above scenario.
Figure 3.1: One “run”
The general method for the benchmark is illustrated in Figure 3.1, which shows a single 
“run.” Each run consists of two executions of the same workload. During the first execution 
of the workload, faults are injected, with parameters based on the “stress-based injection 
method” (described in Section 4.1, which injects faults into the system areas (e.g., CPU, 
memory, I/O) experiencing the greatest workload activity. For example, Figure 3.1 shows
49
that faults Fi, F2, and F3 are injected as the workload program executes. An error is detected 
at time Ti, and a recovery action R\ is initiated and completes at time T2. After recovery 
succeeds, faults are again injected until another error detection and recovery. This cycle of 
faults, detections, and recoveries is repeated until the workload completes. It is possible that 
a recovery action is initiated before Ten(n but does not finished until after Tendl; in this case, 
Tendi is still used in the calculation of the performance degradation. After this first execution 
of the workload, latent faults are cleared from the system by forcing all components that have 
been injected to undergo recovery. Then, the workload is executed a second time, but this 
time without faults. This dual workload execution allows the measurement of performance 
degradation, which is calculated as
Degradation "  Tend2 Tstart2
where n is the number of faults injected. To compare the two workload execution times, 
the time overhead of the fault injector for both runs of the workload program must be 
accounted for. During the second execution of the workload program (i.e., without faults), 
this is accomplished by executing the fault injector without actually injecting faults (i.e., 
this can be viewed as injecting a 0-bit fault). If a catastrophic incident occurs, then the run 
is attempted again to achieve the same number of runs without catastrophic incidents on 
each system from which the performance degradation can be calculated. Note that a second 
execution of the same run may not result in exactly the same result because the faults may 
not be injected at exactly the times relative to the workload and thus fault activation and
error propagation will be different. This explains why a second execution of the same run
50
does not necessarily recreate a catastrophic incident. A discussion of repeatability is given 
in Section 3.3.
Three different workloads are used: (1) a CPU-intensive workload, (2) an I/O-intensive 
workload, and (3) a mixed workload with CPU, memory; and I/O  activity. 20 runs are 
executed for each workload. If a catastrophic incident occurs, the run is repeated. The 
total number of catastrophic incidents for all runs of the same workload are counted. The 
measured performance degradation for all runs of the same workload are averaged.
The proposed benchmarking procedure is comprised of two phases. Phase 1 verifies that 
the machine tolerates expected faults correctly and also evaluates the impact of the fault- 
tolerant implementation on performance by injecting faults that should be tolerated by the 
system. Phase 2 evaluates the reaction of the system to fault conditions that it may not be 
designed specifically to handle and thus demonstrates the degree of fault-tolerance beyond 
what is expected. In general, the difference between the two phases lies in the location of 
injected faults. For phase 1, faults are injected into only one component of a redundant set 
(e.g., only one CPU in a TMR set). For phase 2, faults are injected into all components of 
a redundant set.
It is important to note that the rate of fault injection is much higher than the rate 
of operational fault occurrence. Thus, the fault-tolerant mechanisms for these systems are 
stressed much more than with a more realistic fault rate. However, this higher fault rate is 
applied to all tested machines and is needed to produce results in a reasonable amount of
test time.
51
3.1.1 Phase 1
For phase 1 injections, three types of results were obtained:
Error/Fault Ratio: The ratio of the number of errors detected to the number of faults 
injected. The number reflects the extent to which the fault-tolerant detection mechanisms 
were exercised. A low number indicates that the injections need to be performed more 
intelligently to obtain more fault-tolerant activity. The recovery due to one error detection 
may correct more than one error, thus lowering the error/fault ratio. For instance, if multiple 
faults are injected into one memory in the TMR set, then the first error detection will cause 
recovery for the entire memory, thus correcting all errors in that memory.
Performance Degradation: The fault injection run is repeated without injecting faults. 
The time for the run with faults (Twf) and the time without faults (Twof) are measured. 
Catastrophic Incidents: The number of catastrophic incidents.
3.1.2 Phase 2
During phase 2, faults are injected which may not necessarily be tolerated by the fault- 
tolerant system. The purpose of these injections is to determine how well the system under 
test can handle faults which it is not specifically designed to tolerate. This is an additional 
measure which can be used in a benchmark comparison of different fault-tolerant systems. 
Also, the study of the reaction to phase 2 faults may yield insights into the tested machines. 
For Prototypes A and B, faults are injected into all three CPUs in the TMR configuration. 
For Prototype C, faults are injected into the multiple ASICs which comprise the high-speed 
router network connecting the processors and I/O  devices.
52
3.2 FTAPE
The FTAPE (Fault Tolerance and Performance Evaluator) fault injection program forms
the core of the proposed benchmark. The FTAPE program generates the workload activity
and fault injection needed to test a fault-tolerant system. The workload consists of a mixture 
of CPU, memory, and/or I/O  activity based on user specifications, and the faults are CPU, 
memory, and I/O  faults which are injected according to a specified injection strategy. A 
general block diagram of the program is given in Figure 3.2. The program consists of 
three main parts: the fault injector, the workload generator, and Measure (which is used 
to implemented stress-based injection, described in Section 4.1). The fault injector and 
workload generator can be configured with several parameters (as shown in Figure 3.2) to
produce different instantiations of the benchmark program. A detailed description of each
part of the program follows.
Fault In jection Specs
Injection strategy: 
Random 
Stress-based
cpu parameters: 
Register set
mem parameters: 
kernel/user space 
text/data/heap/ 
stack space
io parameters: 
disk controller 
error set
Workload
Workload
Activity
Normalized
0
Measure
Workload Specs 
Composition 
(Relative mix of): 
cpu function 
mem function 
io function
Level of data flow 
between functions
Intensity of function:
Figure 3.2: Benchmark Program Based on FTAPE
53
3.2.1 Fault injector
The purpose of the fault injector is to inject faults that will stimulate fault-tolerant 
activity. These important issues concerning the fault injector are addressed:
1. Faults must be injected throughout the entire system under test.
2. Software-implemented fault injection (SWIFI) is used because it permits a great deal of 
controllability over fault parameters and allows a wide variety of faults to be injected. 
Because no additional hardware is required, SWIFI can offer more portability than 
other fault injection methods, although low-level fault injection mechanisms will still 
need to be designed for each system.
3. To inject a fault, the system’s implemented fault tolerance must be bypassed. For 
instance, injections into CPU registers or memory locations which are TMR-replicated 
must corrupt only a single component in the TMR configuration. Most SWIFI tools 
have not been implemented on fault-tolerant systems and therefore do not do this.
4. Portability is required, since the fault injector is part of a benchmark program that 
compares different machines. A large degree of portability is attained by (1) using 
the SWIFI fault injection method and (2) separating the fault injector into a directly 
portable, high-level portion and a small system-dependent, low-level portion (see Fig­
ure 3.3).
Since the benchmark is used to evaluate the overall system fault tolerance, faults are 
injected throughout the entire system under test. FTAPE partitions the system into three 
main areas: cpu, mem, and io. These same areas are targeted by the workload generator in
54
FI Parameter
Figure 3.3: Fault Injector Portability Interface
order to increase the chance for the injected faults to be propagated by the workload. Each 
area requires a different fault injection method, which are described below.
Injection method 1: inject.cpu: The CPU fault models include single/multiple bit-flip 
and zero/set faults in CPU registers. Faults are injected into CPU registers, specifi­
cally, saved (saved registers are those registers whose values must be preserved across 
procedure calls) general purpose and floating point registers, the program counter, the 
global pointer, and stack pointer. These registers were chosen because faults in these 
registers have a higher chance of propagation compared to faults in other registers 
(e.g., temporary registers). The method of injection involves the use of the memory 
fault injector. A copy of the CPU register set is obtained and placed in memory. The 
memory fault injector is then used to corrupt the memory location corresponding to a 
specific register. The copy of the CPU register set in memory is then loaded back into 
the CPU, thus placing the corrupted contents into the appropriate Cflj|*register.
55
Injection method 2: inject_mem: The memory fault models include single/multiple bit- 
flip and zero/set faults in memory. Faults are targeted at heavily used parts of memory. 
Faults are injected by directly modifying the contents of selected memory locations. 
The method of injection bypasses the normal memory access mechanism. A special 
device driver is created. This device driver is able to corrupt the contents of a spe­
cific memory location based on a given XOR fault mask (i.e., an XOR fault mask of 
0x00000001 would flip bit 0 of the memory location). For TMR-replicated memory, 
only one component in the TMR configuration is corrupted.
Injection method 3: inject Jo: The I/O fault models include valid SCSI and disk errors. 
Faults are injected into a mirrored disk system. The method of injection involves using 
a test portion of the disk driver code that sets error flags for the next driver request. 
Thus, the next request activates the error handler in the driver code, and one half of 
the disk mirror may be disabled.
Unattached
Figure 3.4: Prototype C Fault Injection
These three fault-injection methods have been implemented on the three Prototype sys­
tems, with the exception of inject Jo on Prototype C. The Tandem Duplex Prototype C is a
56
new system which has been recently released, and for this system, the built-in fault injection 
mechanisms (see Figure 3.4) have also been utilized. For the built-in fault injection, faults 
are injected by means of a set of XOR gates that were inserted into the original design. One 
input of each XOR gate is connected to a decoder, which then activates one gate. The logical 
value of the node on which the gate is inserted is then effectively flipped. Thus, activation of 
an XOR gate is accomplished by setting the appropriate value into the fault injection register 
which drives the decoder. Faults are injected into the ASICs that comprised the interconnect 
between the processors and the I/O  devices. Three types of ASICs exist. The router ASIC 
(ASIC R) forms the backbone for the interconnect. ASIC M serves as the interface between 
the processors and the routers, and ASIC I serves as the interface between the I/O  devices 
and the routers.
Portability: Since one of the main purposes for the FTAPE program is the comparison of 
different fault-tolerant machines, portability of the program is an important issue. The pro­
gram software has been written based on the UNIX operating system. Thus, the software 
code should port easily to most UNIX-based systems. The code for the implementations 
on the prototype machines required a total of 672 lines of C-language code out of 9172 to 
be modified. However, certain functions, such as the low-level fault injection mechanism 
will have to be written specifically for each machine. These mechanisms are listed below. 
Once these functions are identified, the fault injector software has been written to facilitate 
integration of these functions: (1) a virtual to physical memory address translation function, 
(2) a function returning the virtual addresses for the text and data regions for a specified
57
process, and (3) a low-level fault injector. Also, the three injection methods methods de­
scribed in this section refer to specific types of CPU registers (e.g., a global pointer) and I/O  
devices (e.g., SCSI) which might not be defined for all machines. These specifications need 
to be written in a manner to include all machines, but are sufficient to describe the current 
implementations.
Figure 3.3 shows the fault injector code, which has been separated into a high-level part 
and a system-dependent low-level part. The high-level code, which is directly portable to 
most UNIX systems, selects fault parameters (such as fault time and location) and places 
those values into the fault injector parameter interface block. The low-level fault injector, 
which is implemented by a device driver dedicated to fault injection, uses the values in the 
interface block to inject the actual fault into the appropriate system component. The system 
under test is viewed as being comprised of logical components (CPU, memory, and I/O). By 
viewing the system as being comprised of logical components, the same fault sets can be used 
for different systems. This separate view of logical and physical components is supported by 
the separation of the fault injector into high and low-level as shown in Figure 3.3. The high- 
level fault injector code injects faults into logical components, while the system-dependent, 
low-level code determines how injections into logical components maps to injections into 
physical components.
The device driver injects faults in such a way as to circumvent the normal fault-tolerance. 
For example, memory faults in a TMR system are written to a single memory as opposed to 
all three. The functions other than the low-level fault injector should be available for most 
UNIX-based systems and simply need to be identified. These functions may be available
58
either as an option in the /proc file system or as a privileged function in the kernel, which 
can be accessed through a device driver.
3.2.2 Workload Generator
The main purpose of the workload generator is to provide an easily controllable workload 
that can propagate the faults injected by the fault injector. The workload is synthetic to 
allow easy control of the workload, based on a few parameters. The same areas that are used 
by the fault injector (cpu, mem, and io) are targeted for workload activity. The workload is 
composed of a mixture of the following three functions, each of which exercises one of the 
three main system areas intensively:
Function 1: workload.cpu This function is CPU-intensive. It consists of repeated addi­
tions, subtractions, multiplications, and divisions for integer and floating point vari­
ables. These operations are performed in a loop containing conditional branches. Mem­
ory accesses are limited by using CPU registers as much as possible.
Function 2: workload_mem This function is memory-intensive. A large memory array 
is created, and locations in this array are repeatedly read from and written to in a 
sequential manner. An attempt is made to force accesses to the physical memory, for 
instance, by making the size of the array larger than the size of the data cache.
Function 3: workload Jo  This function is I/O-intensive. A dummy file system is created 
on a mirrored disk system. Opens, reads, writes, and closes are repeatedly performed.
The three types of workload functions are assembled into a workload process. Figure 3.5 
illustrates how this is done. The sequence of workload functions is randomly chosen. The
59
Workload characteristics:
Workload process
c M C C I C C
Functions: Time
1. Sequence of functions is random
2. Frequency of functions is specified
(i.e., 90-5-5 means cpu () is called
90% of the time; mem ()is called 
5% of the time; etc.)
C: cpu(){
for (i=s0; i<maxloops; i + + ) {
/* Do CPU-intensive activity */ 
/* -- arithmetic operations */
}
}
M: mem () {
for (i=0; icmaxloops; i++){
/* Do memory-intensive activity */ 
/* -- array accesses */
}
}
I: io () {for (i=0; icmaxloops; i++){
/* Do I/O-intensive activity */
/* —  dummy file accesses */
}
>
Figure 3.5: Assembling a Workload Process from Functions
frequency of each function type is specified before the workload starts. For instance, if 
a CPU-intensive workload is desired, the workload is specified to be 90-5-5, which means 
that cpu() is executing 90% of the time, meraO 5% of the time, and ioO  5% of the time. 
The maxloops variable in Figure 3.5 is chosen from a normal distribution with a mean 
of one second. Since the order of function in the workload process is chosen randomly, a 
specification of a different random seed will produce a different workload. However, the 
relative frequencies of the workload functions will remain the same. For example, a 90-5-5 
workload will remain CPU-intensive regardless of the random seed specified.
In practice, each function is usually specified to last the same amount of time (e.g., one 
second). Then the composition of each workload process can be specified to contain a specific 
proportion of each function. For instance, a workload that is CPU-intensive with a small
60
amount of memory and I/O activity can be specified to contain 90% of the cpu function and 
5% each of the mem and io functions. Such a workload would be said to have a composition 
of 90/5/5. When the workload process is executed, each function will be randomly chosen 
according to the corresponding probabilities.
Each function also reads and writes data from a special global interdependence array 
which forces data flow among functions. This is necessary to encourage fault propagation 
among functions. Otherwise, a data fault in one function is usually overwritten if the fault 
influences only variables local to that function and the system doesn’t detect the error before 
the end of the function. The amount of data flow among functions via the interdependence 
array can be controlled by resetting parts of the array to default values. This has the effect 
of providing some measure of control of data flow, and therefore fault propagation, through 
global variables.
The intensity is the amount of activity in each function relative to the maximum possible 
activity. The intensity of each function is decreased by substituting calls to the usleepO  
function instead of code that would otherwise be executed. Thus, an intensity of 100% 
would contain no additional usleepO  calls. The ability to control the intensity is useful 
for studying the impact of the workload activity level on fault propagation. For most of the 
workloads used in the experiments in Chapter 5, the intensity is varied from 100% to 20% 
over a period of about nine minutes1. Varying the intensity emphasizes the effect of high 
and low workload activity on the amount of fault propagation. *
xThis time period needs to be long enough for the Measure tool and fault injector to react to the corre­
sponding workload activity.
61
Finally, the workload provides the fault injector with information needed to determine 
the location of certain faults, such as which processes are currently executing and what 
portions of memory are being used.
Portability: The workload generator code is easily ported to any UNIX-based system. Some
communication with the fault injector is based on UNIX file systems and pipes. In order 
to help minimize variations among different systems, a random number generator function 
is included. No additional functions need to be identified or written to port the workload 
generator.
3.2.3 Measure
Measure is a tool that monitors the actual workload activity. Although each workload 
function is designed to be very intensive for one system area, each function must necessarily 
cause activity in other system areas. For instance, the io function must also use the CPU 
and perform memory reads and writes as well as accessing the disk. Thus, the Measure tool 
is necessary to measure the actual activity caused by the workload.
Measure returns the level of workload stress for each system area as well as for the system 
as a whole. The stress is the amount of workload activity, especially that which can aid fault 
propagation. As with the fault injector, the methods needed to obtain the stress measures for 
each system area are system dependent to some extent. For each system area, the following 
methods are used to obtain the workload stress:
M ethod  1: m easure_cpu The stress measure is based upon the CPU utilization. On the 
S2, the sar utility returns the CPU utilization.
62
M ethod  2: m easure_m em  The stress measure is based upon the number of reads and 
writes per second to the memory space used by the workload. Since any software 
method of obtaining this information would incur an unacceptable amount of overhead, 
a hardware method is used. A Tektronix DAS 9200 logic analyzer is used to count the 
number of memory accesses. This count is automatically sent to the Measure program 
every 10 seconds. A detailed description of the setup needed to measure mem stress 
can be found in Young[34].
M ethod  3: m easure J o  The stress measure is based on the number of disk blocks accessed 
per second. On the S2, the sar utility returns the number of disk blocks accessed per 
second.
Each stress measure is normalized in order to compare the different measures. The 
normalization is performed by running a set of various workloads2 and obtaining a distri­
bution of the raw stress measures (i.e., CPU utilization, memory accesses/second, and disk 
blocks/second). Each raw stress measure was normalized to a value between 0 and 1, inclu­
sively, based on the following formula, where X m in  is the 5th percentile value and X max is 
the 95th percentile value in the raw stress distribution:
X norm ai = min \ max
One disadvantage of the current methods is the relatively long amount of time between 
measurements (about 10 seconds). This is mainly due to the amount of time required by the 
logic analyzer to count memory accesses. However, most of this time is used to set up the
2These workloads had compositions of 33/33/33, 20/20/60, 20/60/20, and 60/20/20.
63
logic analyzer; the actual count only takes about one second. A newer logic analyzer will be 
used in the future to significantly decrease this setup time.
3.3 Repeatability
An essential ingredient of benchmarks is the repeatability of results. A benchmark must 
produce the same comparisons repeatably under the same conditions in order to be useful 
with a high degree of confidence. However, it is this guarantee of the exact same conditions 
that presents so much difficulty in repeating results exactly. This problem exists for all 
benchmarks, including performance benchmarks, because the operating system, memory, 
and file system state will differ for each invocation of the benchmark. For the proposed fault 
tolerance benchmark, an additional challenge is present: synchronizing the injection of faults 
with the workload program. In other words, it is difficult to guarantee in a repetition of 
the benchmark that each fault is injected at exactly the same time relative to the workload 
program. This difficulty arises due to several factors: (1) Many faults are injected during 
one execution of the workload program and (2) the time and location of each fault is not 
determined until after the workload program has begun execution. The second factor exists 
because stress-based injection is used to determine the time and location for faults.
To illustrate this difficulty, consider the scenario in Figure 3.6. Faultl is injected during 
an initial execution of the benchmark, and fault2 is the same fault that is injected during a 
succeeding run execution of the benchmark. The fault is specified to be injected 5 seconds 
after the start of the workload program. However, faultl is injected before instruction inst3 
in the workload program is executed, and fault2 is injected after inst3. If inst3 is the the 
instruction that activates the fault, then faultl will be activated because it is injected before
64
inst3 and fault2 will not be activated because it is injected after inst3 has already been 
executed. Thus, a different fault effect results even though both faults were specified to be 
injected at the same time. This discrepancy in the actual injection time occurs due to a 
dependence on a software timer to determine the fault injection time.
5 seconds after start 
of workload program
Time
v
faultl
fault2
Figure 3.6: Scenario Illustrating Difficulty of Repeatability
Fortunately, repeatability of benchmark results is easier to achieve than repeatability of 
all events that occur during the execution of a benchmark. The repeatability of benchmark 
results (i.e., the count of catastrophic incidents and performance degradation) is most easily 
attained by averaging the results of a series of benchmark runs (see Section 3.1). It should be 
noted that many performance benchmarks also employ averaging of multiple runs to increase 
repeatability. For the three machines studied, 12 runs were sufficient to produce statistically 
distinct results. This will be shown in detail in Section 5.2.
The repeatability of all events that occur during the execution of a benchmark is very 
difficult to attain. As stated before, the main obstacle is the non-synchronization between 
fault injections and the execution of the workload program. The following scheme can be 
used to obtain this synchronization. Consider Figure 3.7, which contains the same workload 
program code as Figure 3.5, except that a b a rr ie r  () function has been placed at the end of
65
C: cpu( ) {
for (i=0; i<maxloops; i++){
/* Do CPU-intensive activity */
/* -- arithmetic operations */
>
barrier(1);
>
M: mem(){
for (i=0; icmaxloops; i++){
/* Do memory-intensive activity */ 
/* -- array accesses */
>
barrier(2);
}
I: io(){
for (i=0; icmaxloops; i++){
/* Do I/O-intensive activity */
/* -- dummy file accesses */
>
barrier(3);
Figure 3.7: Scheme for Synchronization Between Faults and the Workload Program
the cpuQ, memO, and ioO  functions. The parameter in the b a r r i e r () function serves to 
identify that function. Faults may only be injected while the workload program is executing 
the b a r r ie r () function. The b a r r ie r () function serves two purposes:
D uring th e  in itial benchm ark run  Each b a rr ie r  () function keeps a count of how many 
times that b a rr ie r  O function has been executed. The time to inject a fault is de­
termined by the fault injector. When a fault is injected, the counts for all b a rr ie r  () 
functions is written to a log file.
D uring a subsequent repe tition  of th e  in itial run  The time to inject a fault is deter­
mined by b a r r ie r () functions. Again, a count is kept of the number of times each 
b a rr ie r  () function is executed. The log file from the initial run is read to determine 
the time to inject a fault. For instance, if the first fault in the initial run was injected
after b a r r ie r (1) had been executed 5 times, b a r r ie r (2) 3 times, and b a rr ie r  (3)
66
6 times, those counts would have been written to the log file. During the repetition 
run, when the b a rr ie r  () function counts match that entry in the log file, a signal is 
sent to the fault injector to inject a fault at that time. The b a r r ie r  () function halts 
execution of the workload program until the fault has been injected.3
The use of b a rr ie r  () functions ensures that the injection of faults is synchronized with 
the execution of the workload program. However, there are caveats to this scheme. First, the 
workload program is altered. Since the workload program is originally synthetic, the addition 
of the b a r r ie r  () functions does not make the workload less realistic. Second, this scheme 
still does not provide complete repeatability of all events in a benchmark run because the 
operating system, memory, and disk states will still differ for each run. There is no simple 
solution for resetting these parts of the system to a repeatable state. Certain types of events, 
such as operating system panics will not be repeatable. However, some events, such as error 
propagation and detection that is dependent solely on the workload program execution, will 
be repeatable.
3.4 Error Propagation Observability
Although the main purpose of a benchmark is to provide a convenient means of comparing 
computer systems, a benchmark can also often provide insight into the operation of specific 
fault-tolerant strategies and implementations, as well as the complete systems themselves. 
One approach to gaining this insight is to observe the propagation of errors after a fault 
has been injected. While such observability is readily available in simulation models, it is
3This is why the function is called a “barrier” function.
67
much more elusive for prototype machines. There are several approaches that can be taken 
to observe error propagation.
One method is to use a logic analyzer that monitors buses of interest, such as the CPU- 
memory bus and the I/O bus. All data flow outside of the processor can be observed 
and collected (i.e., a logic analyzer cannot monitor data flow within a single chip). This 
information can then be compared with the fault-free case to determine the extent of error 
propagation. This approach has no perturbation to the system and can collect large amounts 
of information in real time. However, propagation within the processor cannot be observed 
directly. Furthermore, the size of the logic analyzer data collection memory limits the amount 
of information that can be collected, which motivates the need for selective collection of data 
on the monitored buses.
A second method is to simply rely on the system’s built-in reporting facilities to determine 
the extent of error propagation by studying the reported error detection. This approach is 
very simple, since no additional hardware or software is needed. Furthermore, it has the 
advantage of focusing on the exercise of the system’s error detection mechanisms, which is 
one of the main reasons for investigating error propagation. Because this approach uses no 
additional hardware of software, it is not capable of observing propagation between the time 
of fault injection and error detection.
A third method involves the use of a software tracing facility. Most operating systems 
provide support for software debugging functions, which are able to start, stop, and single- 
step through the execution of a program. The unix ptraceO facility is one example of a 
software tracing facility. Assuming that the location of fault injection is in the register or 
memory space of the workload program, the debugging functions can be used to single-step
68
through each instruction of the workload program. After each instruction is executed, an 
analysis of the propagation is conducted. Propagation will occur based on the output of the 
instruction. For instance, a memory store instruction will cause potential propagation only 
to the memory location that is written to. This approach is somewhat similar to simulation 
techniques, in that a great deal of information is available and all steps of propagation can be 
monitored. The significant drawback is the tremendous time overhead of using the software 
debugging functions. In particular, the single-step function incurs the overhead of a system 
call for every instruction in the workload program. Thus, the total execution time of the 
workload program is multiplied by several orders of magnitude. Clearly this time overhead 
is impractical for programs lasting more than a few seconds.
Corrupted
List
regl
reg2
•
•
pagel
page2
Figure 3.8: Workload Program Modifications for Fourth Method
The last method is a development from the third method. Since the main objection to 
the use of debugging functions is the time overhead associated with system calls for each
69
instruction, the time overhead can be dramatically decreased if the system call for single­
stepping is replaced with an application software routine that is attached to the workload 
program.
Figure 3.8 shows the modifications that must be made the to executable image of the 
workload program. The executable program is on the left. The basic blocks (labeled BB1, 
BB2, BB3, etc.) in the text section of the must first be identified. At the beginning of each 
basic block, a piece of “tracing” code (labeled trace_BBl, trace_BB2, trace_BB3, etc.) is 
added that “traces” the propagation based upon the instructions in the corresponding basic 
block. The propagation resulting from that piece of tracing code is marked in the “corrupted 
list” , which contains a list of all registers and memory locations which have been corrupted 
(i.e., have been affected by the error propagation). When a fault is injected, the affected 
register or memory location in the corrupted list is marked. Thus, the corrupted list reflects 
the error propagation that occurs as a result of the fault injection. Since the basic blocks 
in the workload program are being shifted relative to each other, the targets and offsets of 
all branching instructions must be adjusted. In addition, the main and section headers for 
the executable image must be adjusted to account for the increased size of the code section. 
Since the corruption list must keep track of all registers and memory locations used by the 
workload program, it can be very large. To minimize the size of the corrupted list, the 
marking of each register or memory location can be implemented as a single bit.
The workload program is instrumented in the described manner and can be viewed as 
“single-stepping” itself. This approach is similar to that used by profiling programs like 
p ix ie , which add additional code to the original program in order to gather information 
during the execution of the program. This approach is has access to all the information
70
available to the third method and reduces the time overhead per instruction to about an order 
of magnitude. Thus, the time overhead is still quite significant, but much more acceptable. 
The main disadvantages are (1) the greatly increased complexity introduced by the need 
to directly modify the executable image of the workload program and (2) the possibility of
perturbing the workload execution.
71
4. FOCUSED FAULT INJECTION STRATEGIES
The purpose of fault injection is to provide the stimulus for the exercise of fault-tolerant 
mechanisms. Thus, injected faults that do not result in the desired fault-tolerant activity 
do not fulfill their purpose and should be avoided. Although such faults do not necessarily 
decrease the accuracy of results, time and resources may be needlessly wasted. The efficiency 
of the fault injection process can be increased if faults to be injected are selected intelligently.
Focused fault injection strategies take advantage of special knowledge about the system 
under test and the workload program in order to increase the level of fault-tolerant activity 
in the system. Stress-based fault injection is one type focused fault injection which utilizes 
knowledge about the workload activity at the time of injection to determine a good injection 
time and location. Path-based injection is another type of focused fault injection which 
utilizes knowledge about the workload program’s structure in order to ensure that faults are
always activated.
72
4.1 Stress-based Injection
It is well known that high stress and complex workloads cause greater propagation of 
faults and detection of errors[35]. By using knowledge of the workload activity, faults can 
be injected to maximize the chance of activation and propagation. One such fault injec­
tion strategy is stress-based injection, which is the process of injecting faults based upon a 
measurement of the current workload activity. Stress in this sense refers to the amount of 
activity caused by the workload which could encourage error propagation. Error propaga­
tion refers to the activation of faults and the subsequent propagation of the fault effects, 
along with the resulting deviation of the system state from the fault-free case. The effect 
of the fault is propagated due to the activity of the workload. The FTAPE tool is ideal 
for implementing stress-based injection because it includes a workload activity measurement 
tool, which provides two values:
1. the level of workload activity in each system component (CPU, memory, and disk), 
which determines the location of injection, and
2. the level of workload activity in the entire system, which determines the time of injec­
tion.
Two sets of experiments are conducted to demonstrate the effectiveness of stress-based 
injection in increasing fault propagation. The first set of experiments, described in Sec­
tion 4.1.1, involves injecting matched faults (i.e., faults that are injected into areas of great­
est workload stress) and unmatched faults (i.e., faults that are injected into areas of least 
workload stress). These experiments illustrate the sensitivity of certain workloads to specific 
faults. The next set of experiments, presented in Section 4.1.2, illustrates the effectiveness
i
73
of stress-based injections in increasing fault propagation. The target machine for these ex­
periments is the Tandem Integrity S2 fault-tolerant computer. A brief description of the S2 
is given in Section 5.1.1.
Since the tool characterizes the fault tolerance of the system using a single quantity, a 
metric for that characterization is needed. Several metrics are proposed and measured. The 
ratio of detected errors to injected faults represents the effectiveness of error detection, while 
performance degradation represents the efficiency of error recovery. The number of system 
crashes measures the effectiveness of error recovery.
In order to measure performance degradation, the workload program is executed twice, 
as shown in Figure 3.1. The procedure is identical to that used in the proposed benchmark. 
In addition to performance degradation, the ratio of error detections to fault injections is 
measured for each run. This ratio represents the effectiveness of error detection. Since it 
is usually desirable to detect as many faults as possible, the errors/fault ratio should be 
maximized. It should be noted that multiple injected faults may be concurrently present in 
a system component. When a single error in that component is detected, reintegration of 
that component results in correction and removal of all faults in that component.
Performance degradation and the errors/fault ratio can also be used to measure the level 
of fault propagation on a single machine. Since the detection and recovery mechanisms on 
a machine remain the same from one run to another, variations in these two measures are 
caused by the detection of errors caused by injected faults. The more the faults propagate, 
the more error detections and the errors/fault ratio are likely to increase. A larger num­
ber of error detections causes more recovery activity and hence increases the performance
degradation.
74
4.1.1 Sensitivity of Workloads to Faults
Faults are activated and propagated by workloads. The experiments in this section show 
that more fault propagation occurs when the locations of faults and high workload activity 
are the same. The experiments consist of injecting faults into a single system component. 
Two types of workloads are executed along with those fault injections: (1) a workload with 
little activity in that component and (2) a workload with its activity mostly concentrated 
in that component. Thus, for experiment (c) in Table 4.1, faults are injected only into the 
disk. The first row represents a non-disk intensive workload, while the second row represents 
a disk-intensive workload. The injection time was chosen based on an exponential arrival 
distribution with a mean of 20 seconds.
The results are given in Table 4.1. Each row represents seven runs. From the table, 
it can be seen that the errors/fault ratio and the performance degradation are higher for 
the second row of each experiment. This means that the fault propagation is indeed higher 
when the injection location matches the location of high workload activity. For instance, the 
errors/fault ratio for io injections increases from 0.248 to 0.700 when the workload activity 
becomes disk-intensive. Similarly the performance degradation increases from 0.001257 to 
0.030363.
The increase in the errors/fault ratio occurs because the injected faults are accessed by 
the workload more frequently when the workload activity is concentrated in the injection 
area. Furthermore, the high workload activity causes the accessed fault to produce additional 
errors. For instance, a CPU fault may be a corrupted register. That register may be a pointer 
to a memory location. Each time that corrupted register is referenced by the workload, an
75
additional memory error is created (i.e., fault propagation). This fault propagation effect is 
increased when the workload causes the register to be used more often.
Table 4.1: Sensitivity of Workloads to Faults
Injection Composition
Exp Location cpu mem io
a cpu 4 48 48
cpu 90 5 5
b mem 48 4 48
mem 5 90 5
c io 48 48 4
io 5 5 90
Exp
Inj.
Loc.
Errors
Detected
Faults
Injected
Errors
T i h i r
Execution 
Time with 
Faults (sec)
Execution 
Time without 
Faults (sec)
Performance
Degradation
a cpu 9 61 0.148 1588 1544 0 .0 00467
cpu 26 101 0.257 2334 2236 0 .000434
b mem 2 87 0.027 1948 1928 0 .000119
mem 3 71 0.038 1558 1537 0 .000193
c io 12 48 0.248 2026 1910 0 .0 01257
io 26 37 0.700 3347 1583 0 .030363
4.1.2 Stress-based Injection Results
Stress-based injection is a method of selecting the time and location for injected faults 
with the goal of producing the greatest fault propagation possible. Injected faults must be 
activated and propagated in order to adequately exercise the error detection and correction 
mechanisms on a fault-tolerant system. Thus, by using stress-based injection, the likelihood 
that the fault tolerance of a system is tested can be increased.
To show that stress-based injection increases fault propagation, experiments were per­
formed using using five different stress-based injection strategies:
76
Strategy Description
It Use both location-based stress injection (LSBI)
and time-based stress injection (TSBI).
1 Use LSBI.
t Use TSBI.
random Randomly select injection times from an exponen­
tial distribution and injection locations from a uni­
form distribution.
ltLOW Use both LSBI and TSBI. However, select injec­
tion times when the composite stress is below a 
specific threshold, and select the injection location 
(CPU, memory, I/O) with the lowest measured 
stress.
The errors/fault ratio and performance degradation for the five injection strategies used 
with the same workload are given in Figure 4.1. The figure shows averages based on 19 runs 
for each injection strategy. The workload used is a disk-intensive workload. From the figure, 
it can be seen that the highest level of fault propagation (as measured by the errors/fault 
ratio and performance degradation) is obtained when using both the location-based and time- 
based injection strategies (labeled in the graph as “It”). If only the location-based strategy 
(labeled as “1”) is used, then the propagation is lower. However, the location-based strategy 
still produces more propagation than using the time-based or random strategies (labeled as 
“t” and “random”, respectively). Thus, for this disk-intensive workload, injecting faults into
77
the disk produces more fault propagation than choosing the injection location randomly. 
However, if additionally the faults are injected only when the dynamic workload activity is 
high, then even more propagation occurs.
The measured performance degradation in Figure 4.1 is small, partly because it is divided 
by the number of faults injected. However, the measure is significant because it is intended 
to be used as a relative measure. Thus, the importance of the measure is that the combined 
location-base and time-based injection strategy produces more performance degradation than 
the other strategies.
Figure 4.1: Errors/Fault and Performance Degradation
This same effect can be seen for other workloads. Figure 4.2 shows the errors/fault ratio 
for several workloads. For each workload, the errors/fault ratio is higher when the location- 
based strategy is combined with the time-based strategy. Again, the combined strategy is 
labeled as “It” in the graph, while the location-based strategy alone is labeled as “1”.
As shown in Table 4.1, disk faults have a much higher errors/fault ratio and performance 
degradation compared to CPU and memory faults. To ensure that the results of the exper­
iments are not biased by this, the results were also calculated for the same experiments in 
Figure 4.2 ignoring disk faults. The results are given in Table 4.2. Again, the errors/fault
78
Figure 4.2: Errors/Fault for Several Workloads
Table 4.2: Stress-based Injection Results For CPU and Memory Fault
Exp
Injection
Method
Composition 
cpu | mem \ io
#
Runs
Errors
Detected
Faults
Injected
Errors
TauItT
d It 90 5 5 19 4 22 0.1749±0.0362
1 90 5 5 18 13 104 0.1206±0.0147
t 90 5 5 19 2 17 0.1184±0.0353
random 90 5 5 19 3 23 0.1170±0.0302
ltLOW 90 5 5 6 12 169 0.0740±0.0161
e It 33 33 33 12 11 66 0.1679± 0.0261
1 33 33 33 17 7 71 0.1007±0.0169
t 33 33 33 18 3 29 0.1075±0.0264
random 33 33 33 16 3 28 0.1053±0.0282
ltLOW 33 33 33 5 4 94 0.0403±0.0177
f It 20 20 60 19 7 60 0.1178±0.0187
1 20 20 60 10 4 52 0.0874±0.0220
t 20 20 60 9 3 33 0.1003±0.0298
random 20 20 60 19 4 32 0.1151±0.0254
ltLOW 20 20 60 6 2 76 0.0263±0.0147
79
ratio is highest when the location-based and time-based injection strategies (labeled as “It”) 
are combined. The errors/fault ratios for the It strategies are highlighted in the table. For 
the errors/faults ratio, 95% confidence intervals are given.
System Crash Data The results above do not include experiments which resulted in system 
crashes. The number of system crashes is given in Table 4.3 classified by the injection strategy 
and workload used. Each row represents a different workload, and each column represents the 
injection strategy used. For example, the “It” column of row “e” shows that 3 system crashes 
occurred while the combined location-based and time-based injection strategy was used with 
workload (e). The table includes data for 272 runs, during which a total of 5 system crashes 
occurred. All crashes occurred when the location-based and time-based injection strategies 
were used. This result is consistent with the results of the other experiments in this thesis, 
which show that the combined location-based and time-based strategy seems to produce the 
most fault propagation.
Table 4.3: Number of Observed System Crashes
Experiment It
Injec
1
tion strat 
t
egies
random ltLOW
d 0 0 0 0 0
e 3 0 0 0 0
f 1 0 0 0 0
g 1 0 0 0 0
Total 5 0 0 0 0
80
4.2 Path-based Injection
Path-based injection is an approach that minimizes non-activated faults through an intel­
ligent selection of fault parameters. Specifically, fault times and locations are chosen based 
upon a pre-injection analysis of the resource usage of a test program. For example, if memory 
faults are to be injected, the memory usage of the program is analyzed to determine the set 
of memory locations that are used and the times when faults would be activated. Thus, not 
only is the set of activatable faults known, the program inputs that will cause each fault to 
be activated are also known for each fault. The approach is called “path-based” because the 
selection of fault parameters is based on a knowledge of the test program’s execution path 
and resource usage, which is dependent to a large degree upon the program’s control-flow 
paths.
Path-based injection is especially useful for testing an embedded system, which executes 
a single program repeatedly. For such a system, an involved effort to thoroughly test the 
program with fault injection is justified. If system validation involves the determination of 
the fault coverage for a large set of faults, path-based injection would be helpful in decreasing 
the number of unactivated faults, which must be ignored in determining the fault coverage. 
Path-based injection ensures that the entire fault set is activated with a low time-cost. The 
number of injected faults that are not activated is minimized to zero if the pre-injection 
analysis is able to find paths that utilize every resource that is to be fault injected.
The main goal of path-based injection is to maximize the level of fault activation as much 
as possible. The level of fault activation is defined as the fraction of injected faults that are 
activated. Equation 4.1 shows the calculation of the fault activation level in terms of the
81
number of faults to be activated.
. , . . . number of faults to be activated ,fault activation level = ----- -------—— ---- --------------------- (4.1)
number of faults that are injected
The fault activation level has a maximum level of 1.0, which is attained when every injected 
fault is activated.
Before discussing path-based injection, a few terms will be defined. The test program 
is the program that is executed as the fault is injected. Since fault activation and propa­
gation are dependent upon the workload, the test program which generates this workload 
is important. An input is the set of data that the test program processes and may include 
command-line arguments, contents of files, environment variables, and file system state. A 
path is the sequence of test program instructions that are executed based upon a given input. 
Resources are the system state components that may be injected with a fault. In this dis­
cussion, the system state consists of the contents of CPU registers and memory locations, as 
shown in Figure 4.3. The system state is limited to CPU registers and memory locations in 
this discussion because the fault injections in the experiments were confined to registers and 
memory locations. More system state can be considered if faults are injected elsewhere. A 
fault may have several parameters. Two important fault parameters are time and location. 
The time can be expressed in terms of the currently executing instruction in the test pro­
gram, which is called the stop addressl . The fault location is either the register or memory 
location.
In order to explain the details of path-based injection, a comparison with random in­
jection will be made. In this context, random injection refers to the injection of a fault 1
1This program instruction is referred to as the stop address because the program is stopped at that 
address when the fault injector is activated.
82
Resource
Register 1 
Register 2
Register m 
Memory location 1 
Memory location 2
Memory location n
Time
Solid box represents time 
when the resource is being 
used.
Figure 4.3: Resources Used by the Test Program Over Time
with a randomly selected test program input. In contrast, for path-based injection, a fault 
is injected with a test program input that is known to activate that fault, based on a pre­
injection analysis of the test program. For purposes of comparison, if an injected fault is not 
activated by random injection, that fault is injected again with a different test program in­
put. Attempts to activate the fault are made until the surrender threshold is reached, which 
is an arbitrary limit placed on the number of attempts. Consider Figure 4.3, which shows 
the usage of resources by the test program at each point in time during the test program 
execution. Only CPU registers and memory locations are considered in Figure 4.3 because 
in this experiment, faults will only be injected into CPU registers and memory locations. 
The set of resources can easily be extended to encompass other fault injection targets. In 
our case, the resource usage for CPU registers corresponds to those times when each register 
contains a valid value, or in compiler terminology, when each register is live. Similarly for 
memory locations, the resource usage corresponds to those times when each memory location 
contains a valid value -  these memory locations will also be called “live.” At any particular 
time (e.g., time T), only a few registers and memory locations contain valid values. Only
83
faults injected into these registers and memory locations can be activated. Faults injected 
into the other registers and memory locations will never be accessed by the test program, 
unless first overwritten with an initialization value, in which case the fault is avoided and 
will never be activated.
Thus, path-based injection guarantees activation of each fault, as long as the fault location 
is in the set of live resources at the fault time for at least one path. By assembling an 
input set that results in paths which cover most of the test program code, a large set of 
interesting activatable faults can be achieved. For instance, if the input set results in paths 
that cover 100% of the test program code, then all control-flow faults for that test program 
are guaranteed to be activatable by path-based injection.
Fr #faults to activate
Figure 4.4: Time cost vs. #  faults to activate
The desirability of a high fault activation level can be illustrated with Figure 4.4. The 
graph shows the cost in time incurred in order to activate a certain number of faults that 
are injected. The costs associated with path-based injection are represented by the solid 
line, while the dashed line represents the random injection costs. The slope of each line is 
determined by the cost to activate one fault, which is equivalent to (fault activation level)-1. 
Obviously, the time-cost increases as more faults are to be activated. However, the time-cost
84
for path-based injection increases less rapidly than that for random injection because path- 
based injection maximizes the fault activation level and thus minimizes the cost to activate 
each fault.
Since path-based injection requires a pre-injection analysis, a one-time cost is incurred. 
This cost is represented by Cp in Figure 4.4. Cp is dependent on the number of paths to 
be analyzed. However, once the number of paths to be analyzed is determined, Cp is fixed 
and thus is referred to as the fixed cost. This fixed cost initially increases the time-cost of 
the path-based injection approach. However, as more faults are activated, the fixed cost is 
amortized and eventually at some number of faults, F r , the total time-cost of path-based 
injection is lower than that for random injections. This fact is true regardless of the fixed 
cost of the pre-injection analysis. In our experiment, Fr  turns out to be a fairly low number 
for at least one fault injection test-bed. This is shown in Section 4.2.2.
Note that although the two lines in Figure 4.4 are straight lines, the actual time-cost for 
each activated fault will vary. Especially for the random injection approach, the time-cost 
for each activated fault is very dependent upon the number of injections needed to activate 
that fault, which may be as low as a single injection or as high as the surrender threshold. 
Since the time-cost per injection can never be lower for random injection, the two lines in 
Figure 4.4 are still accurate in a relative sense.
4.2.1 Implementation
Two major tasks are needed to implement path-based injection:
1. pre-injection analysis, which associates paths with faults, and
85
2. injection of the fault, including selection of an appropriate input.
These two tasks are described in the following subsections.
Pre-injection Analysis
As mentioned in Section 4.2, a pre-injection analysis must be performed in order to 
associate paths with faults. This task of associating paths with faults is the key to path- 
based injection. A high fault activation level is only possible because the faults are injected 
in conjunction with an input that is guaranteed to activate that fault. The time-cost of the 
pre-injection analysis is represented in Figure 4.4. This process will be automated as much 
as possible.
The implementation was performed on a Tandem Integrity S2 fault-tolerant computer, 
which is based on the MIPS R3000 microprocessor. The test program is compress, the 
standard UNIX compress utility and is written in the C language. The description of the 
implementation in this section refers specifically to the S2 and compress in order to present 
a concrete example. However, the same procedure can be performed using any computer 
system and any test program.
The following steps are needed to accomplish this task:
1. Derive an input set based upon a knowledge of the test program, including command­
line options, documentation, and a knowledge of the program’s high-level language
code.
86
2. For each input in the input set, determine the associated path. The path is represented 
as a list of test program basic blocks2 that are executed by the associated input.
3. Determine the faults that can be activated by each path.
Deriving an Input Set The first step, derivation of the input set, is performed manually 
and is not included in the fixed cost, Cp. This cost was not included in Cp because (1) 
timing a non-automated activity is not simple, (2) the cost is heavily dependent upon the 
skills of the person creating the input set, and (3) most importantly, the cost only adds to 
the fixed cost, which might increase FR in Figure 4.4 but does not change the fact that the 
time-cost for path-based injection is lower than that for random injection after Fr faults are 
activated. (The input set chosen included 39 inputs.)
Determining the Path Associated with an Input The second and third steps are automated, 
and their costs constitute the fixed cost, Cp. The second step is the most time-consuming in 
the pre-injection analysis. This step involves the discovery of the path associated with each 
input in the input set. A list of basic blocks that are executed due to a given input set is 
produced by the process given in Figure 4.5. Instead of a sequential list of all executed basic 
blocks3, a list of basic blocks along with a count of the frequency of execution4 is produced. 
The sequential list requires a great deal more disk storage than the frequency count list, but 
the frequency count list requires less disk storage and provides the required information -  
whether a particular part of the test program in executed or not. (The compress program
2 A basic block is a sequential group of instructions such that every instruction in the basic block is 
executed if the first instruction is executed.
3An example of a sequential list would be BB1—> BB2-> BB3—> BB4—► BB2—* BB3-* BB4-> BB5.
4An example of a list of basic blocks with frequency counts for the same path would be BB1(1), BB2(2), 
BB3(2), BB4(2), BB5(1).
87
contains a total of 1710 basic blocks, of which 497 basic blocks are associated with user 
code5.)
getcfg.c create probe lib.c probe.c
Figure 4.5: Flow Chart for Pre-injection Analysis Second Step
As illustrated in the flow chart in Figure 4.5, three programs are used to find the path 
for a given input:
1. getcfg.c,
2. create_probe_lib.c, and
3. probe.c.
getcfg.c derives a control-flow graph for the test program, based on a disassembly of the
test program. create_probe_lib. c then uses the control-flow graph to determine how many
5User code includes all code written by the user. Typically this includes all code except for standard 
library routines and extra code inserted by the linker.
88
basic blocks the test program has and then creates a probe library containing a unique probe 
function for each basic block. A probe consists of a C-language function that records each 
call to that probe6. This probe library is then compiled and linked in with the test program. 
At this point, the probe functions are part of the test program executable, but are not 
executed along with the original test program code, getcfg.c is used to derive the control- 
flow graph once again in order to account for any addresses that might have changed due 
to the inclusion of the probe library, probe. c uses this second control-flow graph to alter 
the new test program to call a unique probe for each executed basic block. When this is 
accomplished the basic block and probe are “linked.” This linking process is accomplished 
basically by substituting a jump to the associated probe at the beginning of each basic block 
and then placing the replaced original basic block instructions at the end of the probe. After 
all basic blocks and probes have been linked, the altered program is executed and produces 
a list of all executed basic blocks.
Figure 4.6 shows the instructions in the basic block and probe that need to be altered to 
link the basic block and probe. The specific alterations that need to be made are:
1. Insert a jump to the probe at the start of the basic block. Include a nop in the jump 
delay slot. Save the two original basic block instructions that are replaced.
2. Insert sw instructions at the start of the probe to save all registers that are modified 
in the probe. (Group Al)
6There is also a special probe that is called at the end of the test program. This probe dumps all call 
records to a disk file.
89
3. Insert lw instructions at the end of the probe to restore all registers that were saved 
at the start of the probe. (Group Bl)
4. If the one of the two original basic block instructions modifies the return address 
register, then insert code to load the return address register with the proper value. 
Include a nop in the load delay slot. (Group B2)
5. Insert the two original basic block instructions. If the second instruction has an asso­
ciated delay slot, then copy the delay slot instruction. (Group B3)
6. If one of the original basic block instructions contains a 16-bit branch offset that has 
been modified and can no longer fit within 16 bits, change the branch instruction to 
its complement7 (Group B3) and place a jump to the corresponding 26-bit absolute 
branch target here (Group B4).
7. Insert a jump to the basic block, with a jump target that skips the instructions that 
were copied to the probe. (Group B5)
A few of the steps listed above are required due to the instruction set architecture of the 
MIPS R3000. The R3000 requires a single delay slot following all loads, stores, jumps, and 
branches. As a result, the probe associated with each basic block is activated by replacing 
the first two instructions of the basic block with a jump to the probe followed by a nop. 
Basic blocks that consist of a single instruction cannot be linked to their associated probe 
in this manner without creating extra space for the needed nop, which would require an 
increase in the size of the test program code. Changing the size of the test program code
7For example, the complement of a “branch if equal” (beq) instruction is a “branch if not equal” (bne) 
instruction.
90
necessitates modifications to the executable image headers and is unnecessarily complicated. 
Thus, single-instruction basic blocks are not linked to their probes. Instead, the altered test
program is executed to produce the list of executed basic blocks, which is used in conjunction
with the control-flow graph to construct a list of executed single-instruction basic blocks.
Save registers 
(Group A l)
Start of probe
sw lO 
sw tl 
sw t2 
sw t3 
sw 16 
sw t7 
sw t8 
sw t9
Shaded instructions are 
present only when needed
}
Restore saved registers (Group B 1)
Modify return addr register (Group B2) 
Original instructions from BB (Group B3)
Long jump when 16-bit branch 
offset is insufficient (Group B4)
Return to BB (Group B5)
Figure 4.6: Basic Block and Probe Alterations
Determining Faults Associated with a Path The final step in the pre-injection analysis is to 
determine the faults that can be activated by each path. To simplify this step, we will only 
consider control-flow faults or faults that directly affect the execution of branches and jumps. 
For instance, an example of a control-flow fault would be a CPU register that is corrupted 
and causes a conditional branch to be evaluated incorrectly, thus altering the control-flow of 
the program. With this in mind, all direct-effect control-flow faults occur when branches or 
jumps occur. To simplify things further, we will only inject faults into CPU registers, which 
means that all such control-flow faults occur at conditional branch instructions. Thus, the
91
faults that are activatable by each path occur when the CPU registers used as operands for 
conditional branches are corrupted.
Comments on Pre-injection Analysis It can been seen that the pre-injection analysis re­
quires a fair amount of effort, most of which is expended in finding the path (or list of basic 
blocks) associated with a given input. Other methods for extracting the same information 
exist. For example, for the MIPS R3000, the p ix ie  program can be used to instrument 
a program to trace the execution of basic blocks. However, the method described in this 
section is a more general solution that requires no hardware or software support and thus is 
applicable to systems without a program such as p ix ie . In addition, although the probes 
described in this section were used only to record the execution of a basic block, the probes 
can be used for other purposes, such as controlling the exact timing of fault injection.
Once the pre-injection analysis is finished, path-based fault injection can be performed. 
The next section describes the method used to inject faults.
Injection of Fault
The purpose of the fault injection implementation described in this section is to provide a 
test-bed for demonstrating the ability of path-based injection to maximize the fault activation 
level, especially when compared to random injection. The fault injection method used is 
software-implemented fault injection, which uses software routines to modify system state 
(such as that contained in registers and memory locations) in order to emulate the effect 
of lower-level faults. Although software-implemented fault injection can be used to inject
92
a variety of faults, we will only inject faults into CPU registers, since that is sufficient to 
accomplish our goal of demonstrating the advantages of path-based injection.
The fault injection test-bed is shown in Figure 4.7. The target machine is the machine 
on which the test program is run and faults are injected. A DAS logic analyzer is connected 
to the target machine and monitors memory bus activity in order to detect the activation of 
injected faults. The DAS needs to be reprogrammed before each fault injection to search for 
the activation of that specific fault. The reprogramming is controlled by the control host, 
which in turn communicates with the injection software on the target machine in order to 
know what fault is to be injected.
r  ' - \
D M
r \
D A S
(D A S  m o n ito r in g sta te  m o n ito r in g
p rogram )V J
^  p rogram
Control
DASHost Serial
Line
Probes
Ethernet
f T e s t  P ro g a m 'N
in je c t
d bx
L p o k e J
Target
Machine
Figure 4.7: Setup of DAS, Host, and Target Machine
The injection software on the target machine consists of three programs:
1. inject,
2. dbx, and
3. poke.
The flow chart for these three programs is given in Figure 4.8. inject is the main control 
program. It first initializes the DAS with the needed monitoring software and then reads
93
in the pre-injection analysis information, which shows which inputs and faults should be 
paired up in order to ensure activation of the fault, in je c t  selects a fault to be injected 
and an input that will result in the activation of that fault, and then the fault is injected. 
For random injection, the input is selected randomly, which may not result in the activation 
of the fault. If no activation occurs, then another input is chosen for the same fault, and 
the fault is injected again. This process is repeated until either the fault is activated or an 
arbitrary threshold is reached. In our experiments, the inputs for random injections were not 
chosen completely randomly; instead, the inputs were chosen randomly from the same set 
of inputs used for path-based injection. By so doing, the fault activation level is increased 
beyond that for a completely random selection of inputs. Yet, the results in Section 4.2.2 will 
show that even with this assistance, the fault activation level is still much lower for random 
injections.
in je c t  does not perform the actual fault injection. Instead, it calls the dbx program, 
dbx is a fairly common UNIX C-language debugger, dbx is used to control the timing of the 
fault injection. This is accomplished by setting a breakpoint at the stop address, which is 
the current test program address when the fault should be injected, dbx then runs the test 
program. If the breakpoint is encountered, the poke program is executed, poke injects the 
actual fault and is responsible for communicating with the control host machine to reprogram 
the DAS and to obtain the data collected by the DAS.
The data collected by the DAS is of the form shown in Figure 4.9. Each line represents 
an event that the DAS has detected. The leftmost number of each line is the event identifier.
The meanings of the event identifiers are given in Table 4.4. The presence of event identifier
94
Start DM 
Init DAS
Read CFG/path info
Select fault
Solid lines represent control-flow within a program. 
Dashed lines represent time control between programs 
(i.e., the ’’Select input" action in the inject program 
triggers the start of the dbx program).
Select input
Analyze collected 
DAS data
•<-----------
- » ^ ^ A R T  d b x ^ )
Set bre 
at stop
akpoint
address
J
Run test program
Note 1: At breakpoint or
end of test program
Note 1
Invalidate stop 
address
Inject fault
Resume test 
program, if needed
-----Tell dbx to continue
^^^E N D dbx"'^^
Wait for error 
detection or timeout
Download collected 
DAS data
■3
-(^ ^ E N D  p o k e ^ ^ )
Figure 4.8: Flow Chart for Injection Instrumentation
95
2 indicates that the fault has activated. This information is sent by poke to in je c t  to 
determine if the fault has been activated.
State ADR DATA W/R EBI DMA 1543" time
ft
#ID 61/0 PID: 24814 Vloc 0x0040170c Ploc 0x003ae70c
0 1FCA000C 1960000D 01 111 . .. 11111 0.89642522
1 003AE70C 1960000D 01 Ill . . . 11111 0.00000630
2 003AE70C 1960000D 11 Ill . . . 11111 0.07827434
2 003AE70C 1960000D 11 Ill . . . mu 0.08470108
7 1FC00000 0BF00082 10 Oil . .. 10111 0.20634250
Figure 4.9: Sample collected DAS data
Table 4.4: DAS Event Identifier Meanings
Identifier Meaning
0 Fault injection routine has been activated
1 Actual fault has been injected
2 Fault has been accessed
7 Error has been detected
The results from the experiments conducted with the fault path-based and random in­
jection test-bed described above are presented in the next section.
4.2.2 Results
The purpose of the fault injection implementation described in the previous section was 
to demonstrate the advantages of path-based injection over random injection. Specifically, 
path-based injection results in a higher fault activation level because faults can be injected 
in conjunction with a test program input that is guaranteed to activate that fault. The 
test program used to test the implementation described in Section 4.2.1 was the compress 
program. In the pre-injection analysis, a set of 39 inputs were chosen manually based upon a
96
knowledge of the test program functionality and command-line options. The selected inputs 
are listed in Table 4.5.
Table 4.5: Input set for compress program
Input # Command-line arguments Input # Command-line arguments
0 compress data.full 20 uncompress -b -1 data.fullz.Z
1 compress data.empty 21 uncompress -b 0 data.fullz.Z
2 compress data.poof 22 uncompress -b 8 data.fullz.Z
3 compress -h 23 uncompress -b 9 data.fullz.Z
4 compress -DVvdfnCc 24 uncompress -b 16 data.fullz.Z
5 compress -DVdfnCcq 25 uncompress -b 17 data.fullz.Z
6 compress -b 26 zcat data.fullz.Z
7 compress -b -1 data.full 27 zcat data.emptyz.Z
8 compress -b 0 data.full 28 zcat data.poof
9 compress -b 8 data.full 29 zcat -h
10 compress -b 9 data.full 30 zcat -DVvdfnCc
11 compress -b 16 data.full 31 zcat -DVdfnCcq
12 compress -b 17 data.full 32 zcat -b
13 uncompress data.fullz.Z 33 zcat -b -1 data.fullz.Z
14 uncompress data.emptyz.Z 34 zcat -b 0 data.fullz.Z
15 uncompress data.poof 35 zcat -b 8 data.fullz.Z
16 uncompress -h 36 zcat -b 9 data.fullz.Z
17 uncompress -DVvdfnCc 37 zcat -b 16 data.fullz.Z
18 uncompress -DVdfnCcq 38 zcat -b 17 data.fullz.Z
19 uncompress -b
data.full is an uncompressed file 
data.empty is a zero-length file 
data.poof is a non-existent file 
data.fullz.Z is a compressed file 
data.emptyz.Z is a zero-length file
For each the inputs in Table 4.5 a file is created that contains the path associated with 
that input. This file is called the p a th - f i le  and contains a list of the basic blocks that are 
executed by that input. For instance, the contents of the path-file for input 13 is shown in 
Figure 4.10. On each line in the path-file, a basic block that is executed by input 13 is listed, 
followed by a list of basic blocks that are executed immediately afterwards. For example, 
basic block #294 is executed and immediately followed by basic block #295 132 times and
97
basic block #1440 21,952 times. From this information the set of resources used by input 13 
can be found. Since we are interested in control-flow faults, the location of the conditional 
branches for the basic blocks listed in the path-file for input 13 correspond to the control-flow 
fault locations that can be activated by input 13. Furthermore, we are interested in how 
errors in CPU registers affect the control-flow of the test program. For each basic block in 
the path-file with an associated conditional branch, the register operands can be determined 
by decoding the branch instruction.
Figure 4.10: Contents of Path-File for Input 13
BB#0 -> 1(1)
BB#1 -> 10(1)
BB#10 -> 675(1)
BB#11 -> 12(1)
BB#12 -> 675(1)
BB#288 -> 294(22084)
BB#294 -> 295(132) 1440(21952) 
BB#295 -> 1440(132)
BB#296 -> 296(27868) 297(16483) 
BB#297 -> 1669(22084)
BB#298 -> 501(65)
To illustrate the injection of a specific fault, consider the following portion of C-language 
code from the test program:
Line 1: for (fileptr = filelist; *fileptr; fileptr++) {
Line 2: exit_stat = 0;
Line 3: if (do.decomp ! = 0) { /* DECOMPRESSION */
Line 4: /* Check for .Z suffix */
Line 5: if (strcmp(*fileptr + strlen(*fileptr) - 2, ".Z") != 0) {
Line 6: /* No .Z: tack one on */
98
This code above checks if ail files specified on the command-line ends with a . Z and appends 
a . Z if not already present at the end of the filename. This check is only performed if de­
compression is specified. Line 5 corresponds to the stop address of the fault, or the currently 
executing instruction when the fault is injected. The conditional branch corresponding to 
line 5 is
beq s2,zero,0x4006d4
which branches to location 0x4006d4 if register s2 equals zero. The control-flow graph for 
the six-line portion of code given above is shown in Figure 4.11, where the boxes are basic 
blocks and the numbers inside the boxes are the basic block numbers. The beq instruction 
corresponding to line 5 is located in basic block #77. If the contents of register s2 equal zero, 
the next basic block is #1402; otherwise, the next basic block is #78. If the path-files for 
an input in the input set contains basic block #77, then that input is capable of activating a 
fault injected into a resource used by basic block #77. Of the 39 inputs listed in Table 4.5, 
18 inputs (i.e., #13-15,20-28,33-38) are capable of activating a fault injected into register s2 
at the beq instruction in basic block #77.
Consider the contrasting effects of path-based and random injection for a fault injected 
into register s2 at the beq instruction in basic block #77. For path-based injection, an 
input is selected that executes basic block #77 and thus activates the fault by accessing the 
corrupted contents of register s2. If register s2 originally contained a 0 and the fault flipped 
a bit and changed the value (for example, to a l), then the control-flow would be incorrectly 
altered to execute basic block #78 instead of #1402. This incorrect control-flow would add 
an extra . Z to the filename, which would eventually result in an error detection.
99
, zero
Figure 4.11: Part of Control-Flow Graph for compress
For random injection, the resource usage information for the paths described above is 
not available, and thus an input for the given fault in register s2 would have to be chosen 
randomly. For purposes of comparison, our experiment allowed the input to be randomly 
chosen from the 39 inputs in Table 4.5. Since only 18 of the 39 inputs are able to activate 
the fault in register s2, a randomly selected input from the set of 39 would have a 50% 
probability of activating the fault. In reality, the probability of activating the fault with a 
random input could be much less, and in any case is never better than that for path-based 
injection.
The code coverage for this input set in Table 4.5 is given in Table 4.6. Here code coverage 
refers to the percentage of basic blocks that were executed by at least one input in the input 
set. Table 4.6 shows that 37.51% of all basic blocks were covered. However, 58.15% of the 
basic blocks compiled from user code were covered. Higher code coverage for user code is 
to be expected since the input set was selected based upon a knowledge of the user code. 
Because the code coverage is not 100%, faults were limited to control-flow faults that were 
activatable by the given input set.
Test Program
100
Table 4.6: Description of compress Program
#Covered #Total %Covered
All basic blocks 641 1710 37.49%
User code 289 497 58.15%
Path-based injection does indeed result in a higher fault activation level than random 
injection. Table 4.7 shows that the fault activation level for path-based injection is four times 
higher than for random injection. In fact, for path-based injection, all injected faults were 
activated, which is to be expected, since that is the main purpose of path-based injection. 
With random injection, an average of over four injections were needed to activate each fault. 
Thus, the fault activation level is less than one-quarter. Note that even this low level of fault 
activation is achieved with some assistance because the test program inputs for random 
injection were selected from the path-based injection input set. In reality, the path-based 
injection input set would not be available, and thus the true fault activation level for random 
injection would be even lower than that reported in Table 4.7. Also, for the random injections 
in our experiment, fault locations are restricted to the same set of fault locations for path- 
based injection, which eliminates the injection of faults into resources that are never used. 
This further inflates the fault activation level in Table 4.7.
Table 4.7: Random Injections vs. Path-based Injections
Random Path-based
Faults injected 623 149
Faults activated 149 149
Injections/activation 4.18 1.00
Fault activation level 0.24 1.00
101
More insight into the difference between path-based and random injection can be obtained 
by investigating the effect of an equal number of injections for both injection methods. 
Table 4.8 contains some measurements for the first 100 injections for both injection methods. 
Since 100 injections were performed, 100 inputs were needed for both methods. For path- 
based injection, all 100 inputs resulted in fault activation, while only 30 of the inputs for 
random injection caused fault activation. This 0.30 fault activation level is slightly higher 
than the 0.24 level in Table 4.7, but is nonetheless still well below the 1.00 level for path- 
based injection. Of the inputs that caused fault activation, 31 unique inputs were used for 
path-based injection and 23 unique inputs were used for random injection. It is interesting 
to note that for path-based injection a set of 31 inputs was able to activate 100 different 
faults, while for random injection a set of 23 inputs was only able to activate 30 different 
faults. Thus, with path-based injection a relatively small input set is able to activate a 
large number of faults. This is advantageous for validation efforts, since the generation and 
management of a larger input set incurs a greater cost.
Table 4.8: Comparison of Paths for the First 100 Injections
Injection
method
Total
inputs used
Activating
inputs
Unique 
activ. inputs
Path-based 100 100 31
Random 100 30 23
In addition to verifying that path-based injection does indeed results in a higher fault 
activation level, the experiments performed also allow the fixed cost (Cf in Figure 4.4) and 
the per-activated fault time-cost for path-based and random injection to be measured. With 
these measurements, the minimum number of faults needed to justify path-based injection
102
(F r ) can be calculated. These values for these parameters are given in Table 4.9. The fixed 
cost of the pre-injection analysis, C F , is 1357 seconds. This value does not include the time 
required to manually determine an appropriate input set, but that time was small relative 
to the measured C F - The average time-cost associated with each fault before activation 
occurred was found to be 51 seconds for path-based injection and 213 seconds for random 
injection. Based on these measurements, Fr  is calculated to be 8.4 faults. FR is the minimum 
number of faults that have to be injected before path-based injection incurs a lower time-cost 
than random injection. Thus, path-based injection is justified from a time-cost perspective 
if at least 9 faults are to be activated.
Table 4.9: Measured Values for Cost Graph (Figure 4.4)
Meaning Value
Fixed time-cost of pre-injection analysis (CF) 1357s
Startup time-cost of injections 9s
PBI time-cost per-activated fault 51s
Random time-cost per-activated fault 213s
Min. #  faults to justify PBI (Fr ) 8.4
103
5. EXPERIMENTAL RESULTS
The fault tolerance benchmark which is described in Chapter 3 has been implemented on 
three fault-tolerant computers. A description of these three systems and their main fault- 
tolerant features is presented in Section 5.1. The results of experiments to demonstrate the 
utility of the benchmark is given in Section 5.2.
5.1 Description of Target Systems
The experiments described in this thesis have been conducted on three Tandem fault- 
tolerant computers: (1) TMR Prototype A, (2) TMR Prototype B, and (3) Duplex Prototype 
C. The first two TMR machines are based on the Tandem Integrity S2 architecture, the 
general fault-tolerant features of which are described in Section 5.1.1. The third machine, 
Duplex Prototype C, is based on the new Tandem ServerNet architecture. Section 5.1.2 
summarizes the fault-tolerant features for the ServerNet architecture. An comparison of 
these three systems is given in Table 5.1.
104
Table 5.1: Comparison of Target Systems
TM R
Prototype A
TM R
Prototype B
TM R Prototype C
CPU MIPS R 3000 MIPS R 4400 MIPS R 4400
Architecture TM R CPUs TM R CPUs Pair of lock-stepped 
CPUs
Memory 8 MB local 
32 MB global
128 MB local 
96 MB global
256 MB
CPU-I/O
connection
Replicated bus High-speed router 
network
Main fault 
tolerance
Voting on global memory ac­
cesses and interrupts
Self-checking compo­
nents; ECC memory
Note: These systems are prototypes th at are not necessarily representative of production systems.
5.1.1 Description of Tandem Integrity S2
The Integrity S2[36] is a fault-tolerant computer designed by Tandem Computers, Inc. 
The core of the S2 is its triple-modular-redundant processors. Each processor includes a 
CPU, a cache, and an 8MB local memory. Although these three processors perform the 
same work, they operate independently of each other until they need to access the doubly- 
replicated global memory. At this point, the duplexed Triple Modular Redundant Controllers 
(TMRCs) vote on the address and data. If an error is found, the faulty processor is shut 
down. After it passes a power-on self-test (POST), it is reintegrated into the system by 
copying the states of the two good processors. Voting also occurs on all I/O  and interrupts. 
In addition, the local memory is scrubbed periodically. This architecture ensures that a fault 
that occurs on one processor will not propagate to other system components without being 
caught by the TMRC voting process.
105
Figure 5.1: Overview of Tandem Integrity S2 Architecture 
5.1.2 Tandem ServerNet
The Tandem ServerNet architecture [2] is a new fault-tolerant architecture that is based 
on the ServerNet System Area Network, which is a packetized, byte-serial multi-stage net­
work that supports both I/O and interprocessor traffic. Figure 5.2 depicts a common im­
plementation of a ServerNet-based system. Fault tolerance is achieve through duplication of 
CPUs, I/O  devices and paths among CPUs and I/O devices.
The CPUs in Figure 5.2 operate in a “pair-and-spare” configuration. Each CPU (repre­
sented in Figure 5.2 by a box labeled CPU) includes two microprocessors that are tightly 
synchronized. A divergence on the address or data bus of these two microprocessors on any 
cycle is flagged as an error and results in the disabling of that CPU. Two such CPUs are 
paired together for fault-tolerant operation. The memories and caches in both CPUs are 
identical because the same code is executed on both CPUs and all inbound I/O is forwarded
106
to both memories. If one CPU is determined to be in an error state and is disabled, the 
other CPU continues processing with no interruption in service.
Each CPU is linked to each I/O device through two distinct paths in the router network 
(represented in Figure 5.2 by circles labeled R). All routers are self-checking. If any router 
is determined to be faulty, that router is disabled, which results in the disabling of the 
entire path that contains that router. The remaining corresponding path then takes over 
the functionality of the disabled path. The I/O devices contain separate controllers for each 
path. In addition, disk devices are usually mirrored.
Figure 5.2: Overview of Tandem ServerNet Architecture 
5.2 Benchmark Results
The benchmark procedure described in Chapter 3 was performed for the three TMR 
systems described in the previous section. Both phase 1 and phase 2 results are presented 
below. Because Prototype C is a new machine, the full fault injector was not implemented.
107
For that machine, the fault injector can inject CPU and memory faults but no I/O  faults. 
However, the built-in hardware fault injection mechanism for Prototype C is utilized for 
phase 2 experiments.
5.2.1 Phase 1
Table 5.2: Phase 1 Results
TM R Prototype A
Workload Runs Faults Errors E/F Twf
(sec)
Twof
(sec)
P D /F Catastrophic
Incidents
33-33-33 12 66.4 11.5 0.174 1925 1882 .0 0 0 4 9 4 ± .000266 3
5-5-90 19 41.0 27.6 0.673 3411 1596 .0 2 8 0 0 3 ± .001853 1
90-5-5 19 22.4 4.0 0.180 2341 2305 .0 0 1 0 1 2 ± .000361 0
TM R Prototype B
Workload Runs Faults Errors E /F Twf
(sec)
Twof
(sec)
P D /F Catastrophic
Incidents
33-33-33 12 10.7 7.3 0.680 2377 2194 .0 0 6 6 4 8 ± .001851 0
5-5-90 19 4.1 3.3 0.788 2224 2130 .0 1 1 3 9 4 ± .002108 0
90-5-5 19 6.0 3.7 0.612 2224 2138 .0 0 6 6 8 6 ± .001865 0
Duplex Prototype C
Workload Runs Faults Errors E /F Twf
(sec)
Twof
(sec)
P D /F Catastrophic
Incidents
33-33-33 12 12.1 6.4 0.529 661 643 .0 0 2 3 8 4 ± .001501 0
5-5-90 19 10.0 5.2 0.517 498 483 .0 0 2 9 3 9 T .003510 0
90-5-5 19 11.5 6.4 0.557 575 563 .0 0 1 9 6 7 E .001499 0
Note: All P D /F  intervals are 95%  confidence intervals for means.
Phase 1 consists of injections that should be tolerated by the system; these injections 
are performed in order to verify that the machine tolerated expected faults correctly and to 
evaluate the impact of the fault-tolerant implementation on performance. The phase 1 results 
were obtained for all three machines. Table 5.2 contains these results. For each machine, the 
results are categorized by the workload used during the injections (5-5-90 is disk intensive, 
90-5-5 is CPU intensive, and 33-33-33 is mixed). The number of faults and errors are an
108
average of the total number of faults injected and errors detected for each workload type. 
The error/fault ratio is given in the “E /F ” column, the performance degradation is given in 
the “PD /F” columns, and the number of catastrophic incidents is given in the “Catastrophic 
Incidents” column. Twf is the time in seconds for the workload program to execute with 
faults; Twof is the same measurement without faults. The Twf and Twof values for the three 
machines in Table 5.2 are not identical because the amount of workload activity was adjusted 
to account for different processing speeds and different amounts of allocated machine time. 
The workload composition (relative mix of CPU, memory, and I/O activity) remained the 
same.
Table 5.2 contains some important results. Most important is the benchmark result that 
4 catastrophic incidents occurred for Prototype A and none for Prototypes B and C. This 
is an indication of the high degree of fault tolerance for Prototypes B and C compared to 
Prototype A, which is to be expected, since Prototype A is a very early prototype, while 
Prototype B is a newer design which contains corrections to the problems of Prototype A, 
and Prototype C is a based on a much new architecture. In particular for Prototype B, 
improvements were made to the CPU and TMRC (see [36]) which affect fault containment 
and isolation.
The probability that a catastrophic incident will occur for a single run for Prototype 
A is 0.08 with a 95% confidence interval of [0.004,0.156]. Since no catastrophic incidents 
were observed for Prototypes B and C, the resulting mean of zero cannot be used as the 
catastrophic incident probability. However, it can be shown that the catastrophic incident 
probabilities for Prototypes B and C are less than for Prototype A at a statistically significant 
level using hypothesis testing. If the mean catastrophic incident rates for Prototypes A, B,
109
and C are =  0.08, a2 = 0, and a 3 = 0 respectively, then a hypothesis test for
H o(A B )  ■ Oti — 0 - 2 ^ l { A B )  '■ Oil >
and
H q(a c ) ’■ ol 1 =  a ^ H ^ A C )  '■ >  <2 3 ,
both at a 0.05 level of significance shows that for both HQ(Ab) and Hq(ac),
_ Ql —Ot-2
G c t ^ - a  2
ai —Q2
s i g m a *  ^  s i g m a £
= 2.064,
where sigma2ai is the estimated standard deviation for oq, 
sigma2 is the estimated standard deviation for a 2,
N,ai
N.a 2
is the sample size for oq, and 
is the sample size for a 2.
Since z > 1.645 both H0(ab) and Hq^ac) can be rejected at the 0.05 significance level, 
indicating that more catastrophic incidents occur for Prototype A compared to the other 
two machines at a statistically significant level.
The performance degradation numbers are generally higher for Prototypes B and C, 
mainly because these two machine contain more memory and hence require more time for 
recovery. For both machines, recovery entails copying the data in the memories of the good 
CPUs to the bad CPU. The more memory that needs to be copied, the longer recovery is.
no
The exception is the performance degradation for the 5-5-90 (or I/O-intensive) workload, 
which is much greater for Prototype A than for Prototype B. The difference here is the 
extent to which recovery of a disk mirror affects the workload execution. Prototype A uses 
an older disk file system, which incurs a greater loss of disk bandwidth during recovery. 
This effect can be easily seen when the 5-5-90 workload is executed while mirror recovery is 
constantly ongoing. Table 5.3 compares the execution times for an I/O-intensive workload 
with simultaneous mirror recovery and without, for Prototypes A and B. The table shows that 
Prototype A suffers a much higher performance hit when mirror recovery occurs. Statistical 
hypothesis testing shows that the mean performance degradation numbers for each type of 
workload in Table 5.2 are statistically different when comparing Prototypes A and B. Indeed, 
the hypothesis that the mean performance degradation for a particular workload is the same 
for both machines can be easily rejected even at a 0.005 level of significance.
Table 5.3: Demonstration of Disk Bandwidth During Mirror Recovery
Time with 
mirror recovery
Time without 
mirror recovery Ratio
Prototype A 5539 secs 1796 secs 3.08
Prototype B 2310 secs 1878 secs 1.23
For Prototypes A and B, the performance degradation for the 5-5-90 I/O-intensive work­
load is greater than for other workloads. The reason for this is the double negative impact 
of disk faults. In addition to the penalty imposed by the disk recovery procedure, the loss 
of disk data bandwidth that exists until recovery is complete results in an even greater hit
on performance.
Ill
The performance degradation for Prototype C is less than that for Prototype B. The 
main reason for this is the difference is the recovery of a CPU. For both machines, memory 
and other state information needs to be copied from the good CPUs to the CPU undergoing 
recovery. The Integrity S2 architecture of Prototype B requires a complete, albeit temporary, 
cessation of all normal processing activity during the memory data transfer. For large mem­
ory configurations, as is the case for Prototype B, this interruption has a significant impact 
on performance, which may be especially detrimental to real-time applications. To avoid 
this interruption of service, the ServerNet architecture of Prototype C uses cycle stealing to 
perform the memory data transfer during periods of low CPU-memory activity. This has 
the result of decreasing the negative impact on performance.
Table 5.4: Fault Latencies for Fault Preceding Error Detection (secs)
Prototype A Prototype B Prototype C
Workload Avg Median Max Avg Median Max Avg Median Max
33-33 2.12 1 114 5.23 4 21 8.74 5 57
5-5-90 2.57 0 117 14.31 7 71 7.63 5 42
90-5-5 10.03 1 149 7.98 4 59 7.35 5 73
In addition to the benchmark figures given in Table 5.2, additional insight into the re­
action of the tested systems to faults can be gained by investigating the latencies of the 
injected faults. Table 5.4 shows the average, median, and maximum latencies for each of 
the three prototype systems. Each row contains results for a particular workload, and each 
column shows the results for a particular machine. Since more than one injected fault may 
be present in the system at any one time, the assumption that the last injected fault causes 
the error detection is made. Although this is difficult to prove, the fact that error detections
112
usually occur within a few seconds after a fault injection suggests that the error detection 
is caused by the fault that was just injected. A few observations about the latencies in 
Table 5.4 can be made. The median latencies for Prototype A are very short (no more than 
one second). One possible explanation for this effect is the relatively small physical memory 
size of Prototype A. A smaller physical memory size would allow most faults to be accessed 
frequently and thus result in greater chances for error propagation and detection. However, 
the maximum latencies for Prototype A are also longer than for the other two machines. This 
might result from its slower processor speed, which would cause the same memory access 
patterns to be executed more slowly and thus result in slower error propagation. Finally, 
Prototype C exhibited an interesting effect that is not shown in Table 5.4.
5.2.2 Phase 2
Phase 2 consists of injections that may not necessarily be tolerated by the system; these 
injections are performed in order to evaluate the reaction of the system to fault conditions 
that it may not be designed specifically to handle and thus demonstrates the degree of fault- 
tolerance beyond what is expected. Table 5.5 gives some phase 2 results for the three systems 
tested. The table contains two types of measures: (1) the average number of faults and time 
in seconds to the initial error detection and to a catastrophic incident, and (2) the type of 
catastrophic incident. For all phase 2 injections, a catastrophic incident is defined as the 
condition which prevents successful completion of the workload and is usually manifested 
as a system panic or hang. The “Others” category in the last row of Table 5.5 includes 
several premature workload process terminations and an automatic system reboot. The 
fault injections for Prototypes A and B consist of bit-flips in CPU registers and memory.
113
For Prototype C, two types of injections are performed: (1) bit-flips in CPU registers and 
memory in a manner similar to that for Prototypes A and B and (2) injections using the 
fault injection mechanism built into the three ASICs described near the end of Section 3.2.1. 
The results for the first type of injections are given in the Table 5.5 column labeled “C”, 
and the results for the second type of injections are given in the column labeled UC (HW)”.
Table 5.5: Phase 2 Results
Measure
F
A
Tototy 
B '
rpe Mac 
C
hines 
C (HW)
Avg. injections to initial detection 6.6 3.0 1.6 2.1
Avg. injections to catastrophic incident 10.4 4.1 5.0 15.8
Avg. time to catastrophic incident (sec) 85.8 31.9 198.5 467.5
Time from initial detection 
to catastrophic incident (sec)
18.5 28.0 150.5 428.7
Catastrophic incident manifestation A B c C (HW)
Panics 7 7 0 3
Hangs 1 3 0 7
Others 2 0 10 0
Of the first type of measures (in the upper part of Table 5.5), two are especially significant: 
the average number of fault injections to catastrophic incident and the time from initial 
detection to catastrophic incident. It is interesting to note that all three machines were able 
to continue correct operation past the first few fault injections even though the injections 
were intended to cause catastrophic incidents. Also, all three machines detect their first error 
within an average of one minute (based upon an exponential fault interarrival rate with a 
mean of 20 seconds).
114
The column labeled “C” in Table 5.5 requires some explanation. The catastrophic in­
cidents observed for these experiments were quite different from those for the other exper­
iments. Instead of system panics or hangs, the only observed event was the termination of 
the workload program. This phenomenon is to be expected due to the duplex nature of the 
ServerNet CPUs. If faults are present in both CPUs in the duplex pair and a resulting error 
is detected in the first CPU, that CPU is disabled and the second CPU become the remain­
ing CPU. A latent fault in the lone remaining CPU cannot be detected, since it will neither 
cause a self-check violation nor a duplex mismatch, and thus the CPU continues operation. 
It is possible that the fault might cause a hardware or software exception, but the only likely 
consequence would be the failure of the workload program, which would be manifested as 
a termination of the program or, in the worst case, the production of an incorrect result. 
Note that the FTAPE workload program does not check for the correctness of its result, and 
therefore, an incorrect result would not be detected.
CPU A CPU B CPU C
Fault 2 is detected during recovery 
Fails system
Figure 5.3: Phase 2 Catastrophic Incident Scenario
This effect is in contrast to that for TMR-based architectures, which are capable of 
detecting component state mismatches, even after the first component is disabled. Figure 5.3
115
depicts a situation where two phase-2 faults have been injected into the memories of CPU 
A and CPU B. The fault in the memory of CPU B is discovered first by the memory voter, 
which then fails CPU B. After CPU B passes a diagnostic hardware check, recovery begins. 
The memories of CPUs A and C are first compared and copied to the memory of CPU B on 
a block-by-block basis. When the block containing Fault 2 is reached, the mismatch between 
CPUs A and C is detected, and the system fails.
However, there is one sequence of events that avoids the described catastrophic incident 
scenario and upon which an alternative, albeit more complex, hardware recovery implemen­
tation might be developed. A memory scrubber is implemented for both TMR Prototypes 
A and B. When the memory scrubber discovers an error in a CPU, it performs a memory 
block copy, but only for the affected memory block. If another error is present in a different 
memory block in another CPU, that error is not corrected until the scrubber discovers it 
later. Unfortunately, the memory scrubber discovers far fewer errors than the memory voting 
mechanism, because the memory scrubber is specifically configured to slowly traverse mem­
ory in order to minimize its impact on performance through memory contention. However, 
this type of strategy could be used to improve the recovery of CPUs downed by the memory 
voter. Referring again to Figure 5.3, when Fault 2 is encountered during the recovery process 
and a mismatch between CPUs A and C is detected, the copy of that memory block for CPU 
B can be used to determine that CPU A is in error. CPU A can then be corrected at that 
time, and the recovery for CPU B can continue. This recovery process only fails when the 
memory for all three CPUs disagree at exactly the same memory location, which should be 
extremely unlikely. One drawback to this recovery method is the reliance on data in a failed
component for recovery.
116
The fourth column of Table 5.5 contains results of injections into Prototype C using 
the built-in fault injection mechanism described in Section 3.2.1. It should be notes that 
these results are based on faults models that differ greatly from those for the results in the 
first three columns, which use SWIFI fault models. Nonetheless, Table 5.5 does show that 
faults injected via the built-in fault injection mechanism are better tolerated by Prototype 
C, because the system is able to tolerate more faults before a catastrophic incident occurs.
Table 5.6: Phase 2 Results for Duplex Prototype C (with no attempted recovery)
Measure All 3
AS
M
ICs
R I
Avg. injections to initial detection 2.1 6.0 1.0 5.5
Avg. injections to 
catastrophic incident
15.8 15.1 6.9 6.9
Avg. time to catastrophic 
incident (sec)
467.5 304.8 199.2 139.9
Avg. time from initial detection 
to catastrophic incident (sec)
428.7 138.8 194.6 44.4
Catastrophic incident manifestation All 3 M R I
Panics 3 9 5 5
Hangs 7 0 5 5
Others 0 1 0 0
The results in Table 5.5 for the built-in fault injection mechanism are repeated again in 
Table 5.6, which shows the results of injections into different ASICs in Prototype C. For these 
experiments, no recovery was attempted, thus causing the system to fail in every case. The 
fully operational system will, of course, be capable of recovery from errors. As previously 
mentioned in Section 3.2.1, these ASICs comprise the high-speed interconnect between the 
processors and I/O  devices. ASIC R is a router chip that provides the backbone for the 
interconnect, ASIC M interfaces the processors to the router ASICs, and ASIC I interfaces
117
the I/O  devices to the router ASICs. These results provide insight into the designs of the 
different types of ASICs and in particular highlight specific areas that should receive more 
attention. For example, ASIC I fails shortly after detecting the first error whereas ASIC M 
is able to forestall a catastrophic incident for a longer period of time. Interestingly, when 
faults are injected into all three types of ASICs, the mean time to catastrophic incident is 
longer than for faults into one specific type of ASIC. However, when such an incident finally 
occurs, more hangs than panics result, indicating that such incidents are more severe.
As a final note, one of the phase 2 injections into Prototype C caused a system hang 
after a single fault had been injected. This discovered a design defect, which illustrates the 
utility of the fault injection method in providing feedback to designers.
118 i
6. CONCLUSIONS
This thesis has presented a benchmark for fault-tolerance. The benchmark is based on 
the FTAPE tool, which injects CPU, memory, and disk faults and generates workloads with 
specifiable amounts of CPU, memory, and disk activity. Two benchmark metrics are pro­
duced: (1) a count of the number of catastrophic incidents and (2) the average performance 
degradation. The catastrophic incident count represents the recovery coverage of the system, 
while the performance degradation reflects the performance of the system in the presence of 
faults.
The benchmark is fully functional and has been implemented on three Tandem fault- 
tolerant machines. Experiments on those machine using the benchmark produce the bench­
mark metrics, as well as additional insight into the reaction of the system to faults through 
measures such as error latency. The experiments performed on these prototype machines 
do not necessarily characterize the operation of production machines, but the experiments 
do demonstrate the utility of this benchmark in comparing the high-level dependability 
of different systems. The results show that Prototypes B and C are more fault-tolerant 
than Prototype A, in that they suffer fewer catastrophic incidents under the same workload
119
conditions and fault injection method. Furthermore, Prototype C suffers less performance 
degradation in the presence of faults, which might be an important concern for time-critical 
applications.
The area of fault tolerance benchmarks is just emerging. This work represents an effort 
to propose and demonstrate a fault tolerance benchmark and, to the authors’ knowledge, 
is the first to develop a benchmark that represents the overall fault tolerance of a system. 
Because this work represents an initial and groundbreaking effort, there is certainly room 
for improvement in the proposed benchmark (some of these ideas are discussed in the last 
section in this chapter), However, as mentioned before, the fault tolerance benchmark is 
fully functional and does indeed produce the advertised metrics that can be used to compare 
fault-tolerant machines.
Several significant ideas are embodied in our benchmark: The system under test is viewed 
as being composed of logical components (such as CPU, memory, disk, etc.), which not only 
provides targets for fault injection, but also aids portability by defining fault injection targets 
in term of components that all systems possess. The fault injector and workload generator 
are both modular, which allows the benchmark to be easily reconfigured if the underlying 
fault models to be tested are redefined. Focused fault injection strategies are used to increase 
the level of fault-tolerant activity caused by injected faults. Also, the use of a performance 
degradation measurement is unique.
An important consideration is the applicability of the proposed benchmark to systems 
that are very different. A benchmark consists of a metric and the procedure needed to 
obtain that metric. The benchmark described in this thesis specifies the procedure for the 
obtaining the metric in the form of a benchmark program. The benchmark program was
120
initially implemented on Prototype A. Porting the benchmark to Prototype B was easily 
performed because both systems are based on the same architecture. Thus, the physical 
components, such as the TMR CPUs, that are targeted by the workload generator and 
fault injector are present on both machines. Although systems may differ in the design 
of their physical components, all systems possess certain vital logical components, such 
as a processor, memory, and I/O subsystem. The faults and workloads generated by the 
benchmark program target these logical components. For example, faults are injected into 
the logical processor, whether the actual design is TMR or lock-stepped duplication. In both 
the TMR and lock-stepped designs, a single fault would be injected into one out of either 
two or three duplicated components.
When systems with different architectures are being tested, the benchmarking procedure 
needs to be specified in terms of fault models and workloads targeted at logical components. 
By viewing the system as being comprised of logical components, the same workload and fault 
sets can be used for different systems. This separate view of logical and physical components 
is supported by the separation of the fault injector into a high level and low level as shown 
in Figure 3.3. The high-level fault injector code injects faults into logical components, while 
the system-dependent, low-level code determines how injections into logical components map 
to injections into physical components. Prototype C is based on a duplex architecture that 
is very different from the TMR architecture. All of the high-level benchmark code has been 
ported to Prototype C, as well as a portion of the low-level code, which has allowed some 
results to be obtained.
One concern regarding the injection of faults into logical components is fairness. Is the 
injection of a single fault into one TMR CPU comparable to the injection of a single fault
121
into one CPU of a gracefully degrading 3-CPU processor? Similar questions also arise for 
performance benchmarks. For instance, is the execution of a large program in a small- 
cache system comparable to that for a system with a larger cache? These questions can be 
answered in the positive because one design is actually better able to handle the specific 
fault or program size mentioned. Fairness results by properly selecting a set of faults and 
workloads that consider the desirability of each design. In certain cases, the consideration 
is obvious. For example, a larger cache is almost always better for performance than a 
smaller cache. Other cases are more difficult. For example, two-CPU faults injected into the 
gracefully degrading 3-CPU processor result in fewer catastrophic incidents (because only a 
single processor is required for operation) but more performance degradation (because each 
additional failed CPU degrades the overall performance). Thus, care has to be taken in the 
selection of workload and fault inputs for the benchmark program.
6.1 Future Directions
The fault tolerance benchmark proposed in this thesis serves as a good starting point for 
the development of better benchmarks for fault tolerance. The benchmark and its current 
implementations on the three Tandem machines are fully functional. However, there are 
several directions for further research. Two important areas are the search for better fault 
injections and better workloads. Fault models need to be extended to more comprehensively 
take into account all fault-tolerant mechanisms in the test system. For instance, faults in the 
communications system and software faults should be investigated. As more faults are con­
sidered, the portability of the benchmark should remain a significant concern. The workload
122
generator used in the proposed benchmark used simple synthetic functions to create an eas­
ily tunable workload. While the ability to easily configure the workload is convenient, more 
realistic workloads should also be studied. In particular, existing performance benchmarks 
might provide good workloads for testing fault tolerance.
Another important direction for future research is the search for improved metrics of 
fault tolerance. Just as the current benchmarks offers two separate metrics for evaluating 
recovery coverage and performability, additional metrics can be developed that take into 
account such dependability quantities as error detection and latency.
Distributed computer systems and networks of workstations are becoming more prevalent. 
The current benchmark only considers a single uni-processor machine. A distributed version 
of the benchmark should be developed to take care of distributed, parallel, and other such 
multiprocessor environments.
Finally, a more user-friendly interface should be developed. This would enhance the 
general usability of the benchmark by a wider audience. In addition, such an interface 
might provide a means to display real-time information about events during the benchmark 
execution, such as error propagation.
REFERENCES
[1] T. K. Tsai and R. K. Iyer, “Measuring fault tolerance with the ftape fault injection 
tool,” in Proc. of Performance Tools ’95/MMB ’95, pp. 26-40, Sept. 1995.
[2] W. E. Baker, R. W. Horst, D. P. Sonnier, and W. J. Watson, “A flexible servernet- 
based fault-tolerant architecture,” in Proceedings 25th International Symposium on 
Fault-Tolerant Computing, (Pasadena, California), pp. 2-11, June 1995.
[3] D. Tang and R. K. Iyer, Fault-Tolerant Computer System Design, ch. 5. Upper Saddle 
River, NJ: Prentice Hall PTR, 1996.
[4] J. Lala, “Fault detection, isolation, and reconfiguration in ftmp: Methods and exper­
imental results,” in Proceedings 5th AIAA/IEEE Digital Avionics Systems Conference 
(D ASCl 1983.
[5] K. Shin and Y. Lee, “Error detection process -  model, design, and its impact on com­
puter performance,” IEEE Transactions on Computers, vol. C-33, pp. 529-540, June 
1984.
[6] K. Shin and Y. H. Lee, “Measurement and application of fault latency,” IEEE Trans­
actions on Computers, vol. C-35, pp. 370-375, Apr. 1986.
[7] G. B. Finelli, “Characterization of fault recovery through fault injection on ftmp,” IEEE 
Transactions on Reliability, vol. R-36, pp. 164-170, June 1997.
[8] J. Arlat, M. Aguera, L. Amat, Y. Crouzet, J. C. Fabre, J.-C. Laprie, E. Martins, 
and D. Powell, “Fault injection for dependability validation: A methodology and some 
applications,” IEEE Transactions on Software Engineering, vol. 16, pp. 166-182, Feb. 
1990.
[9] H. Madeira, M. Rela, F. Moreira, and J. G. Silva, “Rifle: A general purpose pin-level 
fault injector,” in Proceedings 1st European Dependable Computing Conference, (Berlin, 
Germany), pp. 199-216, Oct. 1994.
[10] J. Karlsson, U. Gunneflo, P. Liden, and J. Torin, “Two fault injection techniques for test 
of fault handling mechanisms,” in Proceedings International Test Conference, pp. HO­
MO, 1991.
124
[11] J. Cusick, R. Koga, W. Kloasinski, and C. King, “Seu vulnerability of the zilog z- 
80 and nsc-800 microprocessors,” IEEE Transactions on Nuclear Science, vol. NS-32, 
pp. 4206-4211, Dec. 1985.
[12] J. Karlsson, U. Gunneflo, and J. Torin, “The effects of heavy-ion induced single event 
upsets in the mc6809e microprocessor,” in Proceedings 4th International Conference on 
Fault-Tolerant Computing Systems, (Baden, Germany), 1989. GI/ITG/GMA.
[13] U. Gunneflo, J. Karlsson, and J. Torin, “Evaluation of error detection schemes using 
fault injection by heavy-ion radiation,” in Proceedings 19th International Symposium 
on Fault-Tolerant Computing, (Chicago, Illinois), pp. 340-347, June 1989.
[14] J. Karlsson, J. Arlat, and G. Leber, “Application of three physical fault injection tech­
niques to the experimental assessment of the mars architecture,” in Proceedings 5th 
International Working Conference on Dependable Computing for Critical Applications, 
(Urbana, IL), pp. 140-149, Sept. 1995.
[15] A. C. Merenda and E. Merenda, “Recovery/serviceability system test improvements for 
the ibm es/9000 520 based models,” in Proceedings 22nd International Symposium on 
Fault-Tolerant Computing, (Boston, MA), pp. 463-467, July 1992.
[16] R. Chillarege and N. S. Bowen, “Understanding large system failures—a fault injection 
experiment,” in Proceedings 19th International Symposium on Fault-Tolerant Comput­
ing, (Chicago, Illinois), pp. 356-363, June 1989.
[17] Z. Segall, D. Vrsalovic, D. Siewiorek, D. Yaskin, J. Kownacki, J. Barton, R. Dancey, 
A. Robinson, and T. Lin, “Fiat -  fault injection based automated testing environment,” 
in Proceedings 18th International Symposium on Fault-Tolerant Computing, pp. 102— 
107, June 1988.
[18] J. H. Barton, E. W. Czeck, Z. Z. Segall, and D. P. Siewiorek, “Fault injection experi­
ments using fiat,” IEEE Transactions on Computers, vol. 39, pp. 575-582, Apr. 1990.
[19] G. A. Kanawati, N. A. Kanawati, and J. A. Abraham, “Ferrari: A tool for the validation 
of system dependability properties,” in Proceedings 22nd International Symposium on 
Fault-Tolerant Computing, pp. 336-344, July 1992.
[20] T. K. Tsai, R. K. Iyer, and D. Jewett, “An approach towards benchmarking of fault- 
tolerant commercial systems,” in Proceedings 26th International Symposium on Fault- 
Tolerant Computing, (Sendai, Japan), pp. 314-323, June 1996.
[21] W. lun Kao and R. K. Iyer, “Define: A distributed fault injection and monitor en­
vironment,” in The 1994 IEEE Workshop on Fault-Tolerant Parallel and Distributed 
Systems, June 1994.
[22] S. Han, K. G. Shin, and H. A. Rosenberg, “Doctor: An integrated software fault injec­
tion environment for distributed real-time systems,” in International Computer Perfor­
mance and Dependability Symposium, pp. 204-213, Apr. 1995.
125
[23] J. Carreira, H. Madeira, and J. G. Silva, “Xception: Software fault injection and mon­
itoring in processor functional units,” in Proceedings 5th International Working Con­
ference on Dependable Coputing for Critical Applications, (Urbana, IL), pp. 135-149, 
Sept. 1995.
[24] D. R. Avresky and P. K. Tapadiya, “A framework for developing a software-based 
fault-injection tool for validation of software fault-tolerant techniques under hardware 
faults,” in Proceedings 2nd ISSAT International Conference on Reliability and Quality 
in Design, (Orlando, FL), Mar. 1995.
[25] D. P. Siewiorek, J. J. Hudak, B.-H. Suh, and Z. Segall, “Development of a benchmark 
to measure system robustness,” in Proceedings of the 23rd International Symposium on 
Fault-Tolerant Computing, (Toulouse, France), pp. 88-97, June 1993.
[26] A. Mukherjee and D. Siewiorek, “Measuring software dependability by robustness 
benchmarking,” Tech. Rep. CMU-CS-94-148, Carnegie-Mellon University, 1994.
[27] C. P. Dingman, J. Marshall, and D. P. Siewiorek, “Measuring robustness of a fault toler­
ant aerospace system,” in Proceedings 25th International Symposium on Fault-Tolerant 
Computing, (Pasadena, California), pp. 522-527, June 1995.
[28] H. Kopetz, A. Damm, C. Koza, M. Mulazzani, W. Schwabl, C. Senft, and R. Zainlinger, 
“Distributed fault-tolerant real-time system: The mars approach,” IEEE Micro, vol. 9, 
pp. 25-40, Feb. 1989.
[29] E. W. Czeck, On The Prediction of Fault Behavior Based on Workload. PhD thesis, 
Carnegie Mellon University, Apr. 1991.
[30] W. lun Kao, R. K. Iyer, and D. Tang, “Fine: A fault injection and monitor environment 
for tracing the unix system behavior under faults,” IEEE Transactions on Software 
Engineering, Nov. 1993.
[31] H. A. Rosenberg and K. G. Shin, “Software fault injection and its application in 
distributed systems,” in Proceedings 23rd International Symposium on Fault-Tolerant 
Computing, pp. 208-217, June 1993.
[32] S. Han, H. Rosenberg, and K. G. Shin, “Doctor: An integrated software fault injection 
environment,” CSE Technical Report CSE-TR-192-93, The University of Michigan, Ann 
Arbor, MI, Dec. 1993.
[33] N. Benwell, ed., Benchmarking Computer Evaluation and Measurement. Washington, 
D. C.: Heimsphere Publishing Corporation, 1975.
[34] L. Young and R. K. Iyer, “Error latency measurements in symbolic architectures,” in 
Proceedings AIAA Computing in Aerospace 8, (Baltimore, MD), pp. 786-794, Oct. 1992.
[35] R. K. Iyer, D. J. Rossetti, and M. chen Hsueh, “Measurement and modeling of computer 
reliability as affected by system activity,” ACM Transactions on Computer Systems, 
vol. 4, pp. 214-237, Aug. 1986.
126
[36] D. Jewett, “Integrity s2: A fault-tolerant unix platform, ii 
tional Symposium on Fault-Tolerant Computing, June 1991.
Proceedings 21st Interna-
127
VITA
Timothy K. Tsai was born in Taipei, Taiwan on September 6, 1967. He received the B.S. 
(cum laude) degree in electrical engineering from Brigham Young University in 1990 and the 
M.S. degree in electrical engineering from the University of Illinois at Urbana-Champaign in 
1994.
While at the University of Illinois, Mr. Tsai has served as a teaching assistant from 
1990 to 1992 in logic design and VLSI courses and a research assistant from 1991 to 1996 
in the Coordinated Science Laboratory. He also worked at Tandem Computers in Austin, 
Texas during the summers of 1993 and 1994, performing system validation of Tandem In­
tegrity systems and design verification of ServerNet-based systems. His research interests 
include fault-tolerant system design and evaluation, high-performance computer systems, 
and distributed computer systems.
Mr. Tsai will begin employment with Lucent Bell Labs in September 1996.
