Hierarchical Simulation to Assess Hardware and Software Dependability by Ries, Gregory Lawrence
,,_4s ¢/ s p - _ 7 - 206188 11i7_-_-i. <<;i.
October 1997 UILU-ENG-97-2230
CRHC-97-17
University of Illinois at Urbana-Champaign
Hierarchical Simulation to Assess Hardware
and Software Dependability
Gregory Lawrence Ries
Coordinated Science Laboratory
1308 West Main Street, Urbana, IL 61801
https://ntrs.nasa.gov/search.jsp?R=19970034703 2020-06-16T01:42:36+00:00Z
i
°
j Form ApprovedREPORT DOCUMENTATION PAGE OMBNO0704-0188
PuOli¢ tlloortulg O_jtOett for g/is COILnl_l Ot k_lormalJon ul 0stJmall_l to avora_ I Ilour p_' tolkooflse, =3¢du¢1i1_ 1110 time lot tIIV.NlWW_ _SttUClJO_$. _NItCP, i_ I=olltzttg ¢l&ta IOulC_lk
gamonng _ mit_ntame'_ the .data oedod, and ccmplecrDg _ rev,4wmg I/'1o ¢_4_Jon of information. Sen_ ccn_nent rogsn_n 9 mls t_n:lcm estk, nltms or Imy o6_er 8skoect o¢ _1$
¢olle¢1_ O_ infofrtlitlllOrl. IR_lUdll_g SR._AIt_IS fOr re(l_ ttuJ I_gtt_.41ft. Io WIMII_tO_I He4(klult_¢Nll Sefvlcei, 0lrlctOtairo for unformilm_l Ogerationj and Rel0ons, t215 Jeffer_n
Oz_8 Hi_q=y, State 1204. Artington, 22202-4302. end to me ,,.mtce of Manzgement =rod Budge(. Psaerwork Reduction Prolect (0704-O188). Wlshmgton, DC 20503.
1. AGENCY USE ONLY (Leave blank) J 2. REPORT DATE
I
4. TITLE AN0 SUBTITLE
Hierarchical Simulation to
Dependability
16. AUTHOR(S k
L;regory Lawrence Ries
3. REPORT TYPE AND DATES COVERED
Assess Hardware and Software
7. PERFORMING ORGANIZATION NAMES(S) AND ADDRESS(ES)
5. FUNDING NUMBERS
NASA NAG-I-613
DABT63-94-C-0045
Coordinated Science Laboratory
University of lllinois
1308 W. Main St.
Urbana, IL 61801
_9. SPONSORING/MONITORING AGENCY NAME(S)ANDADORESS(ES)
NASA Langley Research Center
8. PERFORMING ORGANIZATION
REPORT NUMBER
UILU-ENG-97-2230 (CRHC-97-1D
10. SPONSORING/MONITORING
AGENCY REPORT NUMBER
Hampton, VA 23681
_ DARPA/ITO
3701N. Fairfax Dr.
11. UPPLEMENTARY NOTES
The views, opinions and/or findings contmned in this report are those of the author(s) and should not be const_ed as
an official Department of the Army position, policy or decision, unless so designated by other documentation.
r12a. DISTRIBUTION /AVAILABILITY STATEMENT 12 b. DISTRIBUTION CODE
[ Approved for public release; distribution unlimited.
i-13. ABSTRACT (Maximum 200 words)
This thesis presents a method for conducting hierarchical simulations to assess system hardware and software dependability.
The method is intended to model embedded microprocessor systems. A key contribution of the thesis is the idea of using
- fault dictionaries to propagate fault effects upward from the level of abstraction where a fault model is assumed to the system
level where the ultimate impact of the fault is observed. A second important contribution is the analysis of the software
behavior under faults as well as the hardware behavior. The simulation method is demonstrated and validated in four case
studies analyzing Myrinet, a commercial, high-speed networking system. One key result from the case studies shows that the
simulation method predicts the same fault impact 87.5% of the time as is obtained by similar fault injections into a real
Myrinet system. Reasons for the remaining discrepancy are examined in the thesis. A second key result shows the reduction
in the number of simulations needed due to the fault dictionary method. In one case study, 500 faults were injected at the
chip level, but only 255 to the level. Of these 255 110 sharedpropagated system faults, identical fault dictionary entries at the
system level and so did not need to be resimulated. The necessary number of system-level simulations was therefore reduced
from 500 to 145. Finally, the case studies show how the simulation method can be used to improve the dependability of the
target system. The simulation analysis was used to add recovery to the target software for the most common fault propaga-
tion mechanisms that would cause the software to hang. After the modification, the number of hangs was reduced by 60% for
fault injections into the real system.
14. SUBJECT TERMS
fnul_ dzc_ionary, hierarchical
software dependability
7
!_7 SECURITYCLASSiFiCATIONI_8.SECURITYCLASSiFiCATiON
OR REPORT I OF THiSPAGE-t UNCLASSIFIED UNCLASSIFIED
NSN 7540-01-280-5500
simulation, hardware dependaoilzty
19. SECURITY CLASSIFICATION
OF ABSTRACT
UNCLASSITIED
15. NUMBERIFPAGES
7g
I
16. PRICE CODE
20. LIMITATION OF ABSTRACT
UL
Standard Form 298 (Rev. 2-89)
Pre_._)ed by ANSt S|d. 239-18
HIERARCHICAL SIMULATION TO ASSESS HARDWARE
AND SOFTWARE DEPENDABILITY
BY
GREGORY LAWRENCE .RIES
B.S., Case Western Reserve University, 1992
M.S., University of Illinois, 1995
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Electrical Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 1997
Urbana, Illinois
°..
111
HIERARCHICAL SIMULATION TO ASSESS HARDWARE
AND SOFTWARE DEPENDABILITY
Gregory Lawrence Ries, Ph.D.
Department of Electrical and Computer Engineering
University of Illinois at Urbana-Champaign, 1997
Ravishankar K. Iyer, Advisor
This thesis presents a method for conducting hierarchical simulations to assess sys-
tem hardware and software dependability. The method is intended to model embedded
microprocessor systems. A key contribution of the thesis is the idea of using fault dictio-
naries to propagate fault effects upward from the level of abstraction where a fault model
is assumed to the system level where the ultimate impact of the fault is observed, and
a second important contribution is the analysis of the software behavior under faults as
well as the hardware behavior.
The simulation method is demonstrated and validated in four case studies that analyze
a commercial, high-speed networking system called Myrinet. One key result from the case
studies shows that the simulation method predicts the same fault impact 87.5% of the
time, as is obtained by similar fault injections into a real Myrinet system. Reasons for
the remaining discrepancy are examined in the thesis. A second key result shows the
reduction in the number of simulations needed due to the fault dictionary method. In
one case study, 500 faults were injected at the chip level, but only 255 propagated to
the system level. Of these 255 faults, Ii0 shared identical fault dictionary entries at the
system level and so did not need to be resimulated. The necessary number of system-level
simulations was therefore reduced from 500 to 145. Finally, a third result in the case
iv
studies shows how the simulation method can be used to improve the dependability of the
target system. The simulation analysis was used to add recovery to the target software for
the most common fault propagation mechanisms that would cause the software to hang.
After the modification, the number of hangs was reduced by 60% for fault injections into
the real system.
VACKNOWLEDGEMENTS
I would like to thank my advisor, Ravi Iyer, for his guidance throughout this work. I
would also like to acknowledge the contributions of Zbigniew Kalbarczyk, Jagdish Patel,
and Myeong Lee to the hierarchical fault model case study, and those of David Stott
to the validation case study, both of which are presented in this work. I would like to
thank my thesis committee, Bill Sanders, Bharghavan Vaduvur, and Sharad Mehrotra,
for their input into this work. Finally, I would like to thank my parents for their help
and support, especially during the hectic times of my preliminary and final exams.
TABLE OF CONTENTS 
Page 
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
1.1 Related Work and Motivation . . . . . . . . . . . . . . . . . . . . . . 
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
1.3 Additional Background . . . . . . . . . . . . . . . . . . . . . . . . . . 
2. MULTILEVEL SIMULATION VIA FAULT MODEL ABSTRACTION . . 
2.1 Fault Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
2.1.1 Problems with the fault-dictionary method . . . . . . . . . . . 
2.2 Trace-Driven Execution . . . . . . . . . . . . . . . . . . . . . . . . . . 
2.3 Special Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
2.3.1 Process-interaction simulation . . . . . . . . . . . . . . . . . . 
2.3.2 Object encapsulation . . . . . . . . . . . . . . . . . . . . . . . 
2.3.3 Operator overloading . . . . . . . . . . . . . . . . . . . . . . . 
2.3.4 Variable reference mapping . . . . . . . . . . . . . . . . . . . . 
2.3.5 Custom pointer class . . . . . . . . . . . . . . . . . . . . . . . 
2.3.6 Backwards and forwards translation . . . . . . . . . . . . . . . 
3. BRIEF DESCRIPTION OF MYRINET . . . . . . . . . . . . . . . . . . . 
3.1 Myrinet Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
3.2 Myrinet Host Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . 
3.3 Myrinet Control Program . . . . . . . . . . . . . . . . . . . . . . . . 
4. CASE STUDIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
4.1 Modeling a Single Host Interface . . . . . . . . . . . . . . . . . . . . 
4.1.1 System model . . . . . . . . . . . . . . . . . . . . . . . . . . . 
4.1.2 Fault model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
4.1.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
4.2 Modeling an Entire Myrinet LAN With Validation . . . . . . . . . . . 
vii
.
4.3
4.4
4.2.1 System model ...........................
4.2.2 Fault model ............................
4.2.3 Results ...............................
4.2.4 Discussion .............................
Inclusion of Recovery to Improve Dependability ............
4.3.1 Recovery added ..........................
4.3.2 Results ...............................
Incorporation of a Device-Level Fault Model ..............
4.4.1 Fault dictionary hierarchy ....................
4.4.2 Results ...............................
CONCLUSIONS ................................
5.1 Summary .................................
5.2 Future Work ................................
REFERENCES .................................
47
48
51
54
58
58
60
62
63
66
70
70
71
75
VITA ...................................... 78
Vlll
LIST OF TABLES
Table Page
4.1: A sample set of fault dictionary entries ..................
4.2: Number of errors by category for simulation and SWIFI ........
4.3: Comparison of errors before and after recovery was added .......
4.4: Breakdown of number of errors by category ...............
4.5: Breakdown of number of errors by category and instruction type. . .
38
51
60
67
68
ix
LIST OF FIGURES
Figure Page
2.1: Example of hierarchy of simulation abstraction levels ..........
2.2: Picture of the trace-based method ....................
3.1: Block diagram of host interface ......................
4.1: Picture of the simulated network ............... : .....
4.2: A block diagram of the MCP showing the "send message" module...
4.3: Number of faults leading to given number of corrupt words ......
4.4: Effect on messages sent during fault lifetime ...............
4.5: Target system for fault injection (simulation and SWIFI) ........
4.6: Diagram of fault injection region .....................
14
2O
28
33
36
41
45
49
49
1. INTRODUCTION
This thesispresentsa hierarchicalsimulation method to assesshardwareand software
dependability that usesa detailed, low-levelfault modelwhilestill producingsystem-level
results. The method is basedon the key ideaof usingfault dictionariesto propagatefault
effectsfrom onelevelof abstractionto another,from lowestto highest, and on techniques
to model the systemsoftwarebehaviorunder faults aswell asthe hardware. Using these
ideas,a fault model may beassumedat a very low level,for instancethe transistor level,
but the impact of faults may be evaluated at the system level, including modeling the
changein the systemsoftwarebehaviordue to the fault.
The techniqueis intended to help closethe loop in a design-for-dependabilityprocess
in computer systems. In sucha process,a systemdesignis developedto provide a given
levelof dependability aswell asperformance.Ideally, the performanceand dependability
of the designareevaluatedbeforea prototype is constructedin order to correct deficien-
ciesin the design. Without this early evaluation,costly redesignmay be necessaryafter
a prototype hasbeenbuilt. In order to closethis design-evaluate-redesignloop, however,
analytical or simulation techniquesare required that can predict system dependability
in the absenceof a prototype. Becausethe techniquepresentedin this work is basedon
simulation, it can be applied to dependability evaluation in the designphase.
At this point, the meaningof the term fault dictionary as it is usedin this work will
be defined. A fault dictionary details the impact or faults on the behavior of some subset
of the target system in terms of the change in that subsystem's behavior due to the fault
as seen from outside the subsystem. As implied by the word dictionary, a fault dictionary
contains many entries, and each entry details the impact of one fault (or of a set of faults
with identical impacts) on the behavior of the chosen subsystem. The dictionary is used
to model the faulty behavior of the subsystem by considering the subsystem as a black
box and modifying its fault-free behavior according to the dictionary entry for the desired
fault. In this way, the impact of the fault is raised from the detailed level under which it
was described in the subcircuit to the more abstract level in which the subcircuit is only
a component. Further details and examples of the use of fault dictionaries in this work
will be given in Chapter 2.
The remainder of this chapter will discuss related work in this area and provide some
motivation for the development of the method presented here. The next chapter discusses
the method in more detail. Chapter 3 describes Myrinet, which is a commercial network
that is used as a target system in the case studies presented in this work. Chapter
4 discusses those case studies, including demonstrating the use of fault dictionaries,
validating the simulation model of the Myrinet, and showing how the analysis provided
3by the tool can be used to improve the target system.
presentssomeconclusionsand future work.
And, finally, the last chapter
1.1 RelatedWork and Motivation
There are a number of ways to analyze the dependability of a system. A general
overview of measurementtechniquesfor dependability evaluation, including simulated
fault injection, physical fault injection, and measurement-basedanalysis, is given in [1].
More detail on physical fault injection can be found in [2],while softwaretechniquesfor
fault injection aredescribedalongwith the FERRARI project in [3].
Both physical fault injection and softwareimplementedtechniques,however,require
at leasta prototype systemto analyze.Thesemethodsare thereforenot well suited to the
designphaseof a product, wheresucha prototype is not available. Instead, simulation
or analytical techniquesareusedin this phase.
There are somestudies that concentrateprimarily on modeling the systems at a
softwarelevel. Modelsat this levelareoften graph-based,andmay usestochasticactivity
nets (SANs), similar to [4], or are analytic in nature and are usedto predict software
reliability under software faults basedon testing data [5]. The difficulty with these
methodsis that somehigh-levelfault modelmust be assumed.For the analytic methods
to predict software reliability from testing data, these fault modelscome from failures
observedduring the test phase.In othercases,thesetechniquesareusedafter a prototype
4of the systemisavailable,and sofault modelscanbetaken from observationof the system
in the field. Suchdata is not availableduring the designphaseof the system,however.
Both simulation and analytical techniqueshavetheir placein dependability analysis.
In this study, however,we concentrateon simulation techniquesbecauseof their capa-
bility to handle high levelsof complexity, to employ generalizeddistributions for fault
arrival or other processes,and to expresshardware designsand softwareprotocols very
naturally. It should benoted,however,that employingsimulated fault injection to study
dependability, as opposedto analytical techniques,meansconfronting the problemsof
repeatedexperiments,fault coverage,and confidenceintervals.
There are severalsystem-levelsimulation tools that can model the hardware of a
system. REACT is an exampleof one such tool that operatesonly at high levels of
abstraction, modeling CPU's and memoriesas either good or faulty [6]. ADEPT and
MEFISTO are examplesof simulation tools that useVHDL to allow analysis from the
logic level to the system level [7, 8]. Someof these tools can also model the software
executing on the systemin an abstract way. Noneof these tools, however,models the
changein the softwarebehaviordue to a fault.
Finally, manydifferent algorithms for simulating the impact of faults at the gate level
havebeendeveloped.Thesetools are typically designedto determinethe fault detection
coverageof a set of test vectors. Someof the strategiesemployedin thesealgorithms
include concurrent fault simulation [9], parallel fault simulation [10], deductive fault
simulation [11], or a combination of the three [12]. While thesestrategieshave been
shown to be very effective in accelerating fault simulation a t  the gate and switch levels, 
they are not easily applied to system-level simulations, particularly when modeling the 
detailed behavior of the software as is done in this thesis. However, this thesis does make 
use of multiple simulation levels, and where gate-level simulations are done, techniques 
such as concurrent or parallel simulation could be applied t o  that part of the overall 
method. An example of combining concurrent fault simulation with a fault dictionary 
analysis is described in 1131. 
The difficulty in modeling software under faults is that the software behavior must 
be tied to  the hardware architecture. Very abstract views of the software lack these 
connections and require the use of abstract fault models. Low-level views of the software, 
which allow more detailed fault models, are costly to  simulate, however. Thus, a tool 
is needed that can simulate a high-level view of the hardware and software in a system 
and yet still maintain the connections between that behavior and views of the system a t  
lower abstraction levels. Through these connections, a detailed fault model that changes 
the state of the hardware can be mapped to  a change in the state of the software. 
1.2 Contributions 
Four key contributions to the state of the art  are made in this thesis. The first is 
the idea of using fault dictionaries to  raise fault abstraction levels from the level where 
a fault model is assumed to the system level where the impact of the fault is observed. 
The second contribution is a technique to assess the behavior of the system software 
under faults as well as assessing the hardware dependability. Third, four case studies are 
presented to demonstrate the use of the hierarchical simulation method and particularly 
how the fault dictionary idea is applied to a real system. Finally, the fourth contribution 
is a validation of the simulation method by comparison to software-implemented Fa~l t  
injections into a real system. 
1.3 Additional Background 
This work builds upon the previous research in the areas of system-level ~imulat~ion, 
multiple simulation levels, and fault dictionaries to provide such a tool that can connect 
hardware and software behavior in the presence of low-level faults. Some of the related 
work on these basic methods is presented in the following paragraphs. 
The system-level simulation environment this thesis makes use of is called DEPEND 
[14]. This simulation environment makes use of a process-oriented simulation strategy and 
the power of C++ to allow a very natural description for computer systems and software. 
DEPEND has a library of high-level system components from which simulations can be 
constructed, such as CPUs or disks. 
There is support within DEPEND for an abstract model of software behavior repre- 
sented by a probabilistic control flow graph [15]. Faults are modeled abstractly according 
to the corruption they cause in memory, but only the good or faulty states of memory 
words are tracked, not their actual values. Such a flow-graph model is therefore best 
suited to abstract simulation of highly reliable systems for long periods of time as opposed 
to the detailed software behavior that is analyzed in this thesis. This thesis therefore ex- 
tends DEPEND with a new technique for simulating software behavior in detail, as well 
as providing the fault dictionary mechanism by which the impact of detailed, low-level 
faults may be propagated up to  the system level for evaluation. 
The use of multiple simulation levels has been explored for fault simulation a t  the 
logic level. One technique, called mixed-mode simulation, represents a subset of a circuit 
where faults will be injected a t  the electrical level while the outputs from that  subset are 
propagated a t  the gate level [16]. Another more efficient technique dynamically switches 
the subcircuit where faults are being injected from the logic level to the electrical level 
for only the period surrounding fault injection [17]. 
The term fault dictionary was first coined in the context of raising fault abstraction 
levels in a description of a tool called FOCUS [13, 161. In FOCUS, a fault dictionary 
was used to  reduce the resource requirements of multilevel simulation and to  propagate 
fault effects from transistor-level models of CPU subcircuits up to a gate-level model of 
the entire CPU. A similar use of fault dictionaries to go between the transistor and gate 
levels was made in [18]. 
One of the major contributions of this work is the extension of the idea of fault dictio- 
naries to  a generalized mechanism to  propagate fault effects between any two neighboring 
levels of abstraction of a system. In particular, the application of fault dictionaries to  
modeling software behavior and the representation of fault dictionaries a t  the system 
level as a list of corrupted software variables and the corrupted program control flow due 
to a fault in a given program module is new to this work. Also defined by this thesis 
are four simulation levels where the fault dictionary technique can be effectively applied, 
allowing fault effects to be propagated upwards from the device-physics level to the tran- 
sistor level. Finally, the structure of the fault dictionary between each of the four levels 
is also described. 
2. MULTILEVEL SIMULATION VIA FAULT MODEL ABSTRACTION
The simulated fault injection method presentedhere is designedto achievetwo key
goals. First, it should simulate the system at a high level of abstraction to provide a
very fast simulation of the systemhardwareand softwareat the sametime that it injects
detailed faults and obtains the proper responseof the system to those faults. This goal
is achievedby conducting multiple levelsof simulation and propagating fault effectsup
through eachsimulation level through the useof fault dictionaries. The secondgoal is
to avoid simulating the systemwhen it is fault free. This goal is met by performing the
simulation in a trace-drivenmannerwhereuninterestingportions of systemexecutioncan
be skippedby jumping aheadto a newstate taken from a trace of the fault-free system.
Both of theseaspectsof the systemwill be describedin their own sectionsbelow.
In addition, a number of techniquesare used to allow the real code of the system
software to be usedin the highestof the simulation levels. Usingthe real codeallows the
softwarebehavior to be describedin a very natural wayand ensuresthat the simulation
takes into accountthe specificimplementationdetails of the software. It also allows the
10
softwarebeingmodeledto gothrough severalversionswithout painstakingredevelopment
of a new model or extensivemodification of the old one. However,becausethe system
software may rely on direct communication with special-purposehardware in the real
system (as the software in our casestudies does), it is necessaryto use a number of
techniquesto provide a similar environment in the simulation so that this codecan be
usedwith minimal modification. Thesetechniquesare describedin the sectionfollowing
the fault dictionaries and trace-drivenexecution.
Finally, although the expositionof thefault dictionary methodis a major contribution
of this thesis, the real test of the method describedhereis its application to the analysis
of a real system. For that reason,another important contribution of this thesis is the
presentation of the casestudies in chapter four that show the hierarchical simulation
method works on a real system and validate it against fault injections done on real
hardware.
2.1 Fault Dictionaries
A definition for fault dictionary hasalready beengivenin the introduction chapter,
but for convenience,that definition is repeatedhere. A fault dictionary details the irr_pact
of faults on the behavior of some subset of the target system in terms of the change in
that subsystem's behavior due to the fault as seen from outside the subsystem.
The key function of the fault dictionaries is to raise the abstraction level of faults
so that their impact at the system-level can be determined. The abstraction level of a
fault is raised by carefully choosing the subsystems for which a fault dictionary will be 
generated. The boundaries of the subsystem are placed so that information specific to 
the abstraction level of the subsystem is naturally contained within that  subsystem while 
the outputs of the subsystem can be easily represented a t  a higher abstraction level. 
As an example, consider a stage in a microprocessor pipeline. The stage consists, 
basically, of some latches a t  the input that contain the outputs of the previous pipe 
stage, a block of logic that computes the outputs for the current stage, and latches a t  the 
stage outputs to  hold those outputs for use in the next stage. This pipeline stage could 
be used to  generate a fault dictionary for use between the transistor and gate levels. 
Faults would be injected into a transistor-level model of the pipe stage. Within one clock 
cycle, the faults would either have made a difference in the value of the output latches 
or have died out. The fault dictionary entry for each fault would store which outputs 
were different. Because of the latches a t  the outputs, the timing, voltage, and current 
details of the pipeline stage logic block are contained within the subsystem. Outside 
the subsystem, only the latched values are important, and these values can be easily 
converted to digital logic values and propagated through a gate-level model of the rest 
of the system (which is fault free). 
There is a second advantage to  using a fault dictionary. A fault dictionary can also 
help t o  reduce the number of simulations that must be done. Many of the faults injected 
into the transistor-level pipeline stage mentioned above may not propagate to the outputs 
and cause a latch error. All of these faults will produce no fault dictionary entry and will 
proceed no further up the hierarchy of simulations. As well, many faults may produce 
identical patterns of latch errors and, therefore, identical fault dictionary entries. In this 
case, it is only necessary to note the number of faults represented by the dictionary entry. 
Only one simulation will have to be run at  the higher simulation levels in the hierarchy 
to determine the impact of all of the represented faults. 
An ideal fault dictionary wouldn't loose information in the transition from one ab- 
straction level to another higher one. In practice, however, there is some information 
lost. Consider again the pipeline stage mentioned above. While the latch filters out most 
of the timing and voltage information in the transistor-level model, it doesn't filter all of 
it. In most cases, the latch will end the cycle with a solid one or zero and no information 
will be lost. In some cases, however, the latch may come to some intermediate value 
or continue to  settle to a one or zero after the latch has closed. In these cases, there 
is still some information in the voltage or timing of the latch value that will be lost by 
considering the latch voltage as a digital logic value that stays constant for the entire 
cycle as is assumed in the gate-level model. 
The fault dictionaries are used by conducting multiple independent simulations at  the 
different abstraction levels. One fault dictionary is generated through repeated simulation 
of the chosen subsystem for many different faults. The dictionary is then used to represent 
the subsystem's faulty behavior in a simulation at  the next higher abstraction level. In the 
pipeline example, for instance, assume the microprocessor had five pipeline stages. Each 
stage could be considered as a subsystem, and a fault dictionary could be generated 
13
for each,beginning with transistor-level fault models and ending with gate-level latch
errors. These fault dictionaries would then be used in a gate-level simulation of the
entire pipeline. Whena fault occurredinsideoneof the pipe stages,the inputs of the pipe
stagewould be examinedand a suitable dictionary entry would be chosenrepresenting
the gate-levelimpact of onefault in that pipe stage.The latch outputs of that pipe stage
would bemodified accordingto the dictionary entry, and the gate-levelsimulation would
continue. Thus, the simulationsthat generatea fault dictionary and the simulations that
use it run independently.
In this work, four simulation levelshave beendefinedwhere the concept of a fault
dictionary canbe effectivelyapplied to raisethe abstraction levelof faults. A diagramof
the hierarchyof thesesimulation levelsis givenin Figure 2.1. In the figure, fault models
are representedby the rounded rectangles. The rounded rectangle in the upper left,
labeled "Heavy Ion Particle Impact," representsthe assumedfault model of a heavy-ion
particle striking a transistor junction in a microprocessor.The remaining three rounded
rectanglesrepresentfault dictionarieswhich canbe viewedas abstractionsof the heavy
ion fault modelat highersimulation levels. Eachof the three fault dictionaries represents
a translation of fault effectsbetweentwo simulation levelsasdenotedby the arrows. For
instance, the "Output Current Surge" dictionary translates betweenthe device-physics-
level representationof faults and the transistor-levelrepresentation.
There are four basic simulation steps depicted in the figure. In the first step, two
simulations,one at the device-physicsleveland one at the transistor level, are combined
Gate in Adder 
Transistor-Level Adder Circuit 
Second Level 
oooooooo ooool 20% 
Software Module Execution 00010 3% 
00011 1 %  
00100 5% 
Message 1 hostSendDm 
Interface Program 
Trace 
L 1 
I Ox14b14 0x1 
next module "hostSendDmaU 
2 Oxl3e04 Ox13cfO 
Ox14b10 Ox0 
Ox14b14 Ox0 
Ox14b18 Ox8 
next module "event loop" 
3 0 x 3 ~ ~ 5 8  Oxddc0009c 
next module "hostSendDma" 
Figure 2.1: Example of hierarchy of simulation abstraction levels. 
15
to determine the appropriate current burst at a gate output due to a heavy ion par-
ticle impact. The simulations at this level produce the "Output Current Surge" fault
dictionary representing current surges due to ion impacts in different types or sizes of
gates. Each dictionary entry models one ion impact as a piecewise-linear analog current
waveform recorded at periodic timesteps throughout the fault's lifetime.
The current burst fault dictionary becomes the fault model for the next step, a
transistor-level simulation of a subsystem of the processor, perhaps the adder. In this
second step, the operation of the adder is simulated for one cycle while the current burst
propagates. During this time, the current burst will either die out or affect one or more
output latches of the circuit. Any errors on the output latches at the end of the cycle are
recorded as the new gate-level fault model for the original particle impact. The resulting
fault dictionary, represented by the "Latch Error Pattern" box in the figure, actually
represents the impact of faults in the subcircuit probabilistically. For a given input com-
bination, the probability of observing a given latch error pattern will be recorded in the
dictionary. Functional decomposition is used to prevent the number of inputs from over-
whelming the simulation effort, and any symmetry present in the subcircuit may be used
as well.
The latch error fault dictionary is used in the third step where the entire micropro-
cessor is simulated running one module of its software in a cycle-accurate simulation.
During the execution of the module, the operation of a microprocessor subcircuit, an
adder in the example thus far, is corrupted by modifying its outputs according to the
16
latch error pattern selectedprobabilistically from the appropriate dictionary entry. At
the end of the module'sexecution, the memoryand processorstate are then examinedfor
changesin the softwarestate or control flow. The corrupted memoryand control flow are
recordedin the third fault dictionary, representedby "Memory Error Pattern and Er_ors
in Control Flow" in _hefigure. In this fault dictionary, the memory locationscorrupted
due to the fault are enumeratedalong with their faulty values, and the next module
that will be executedis alsolisted in casethe control flow betweensoftwaremoduleswas
changedby the fault.
The corrupted memory and control flow becomethe fault model for the particle
impact at the system-level.During the simulationof the softwareexecutionat the system
level, the fault model is injectedby modifying the softwarestate and control flow at the
end of the appropriate module (the samemodule simulated when the dictionary was
computedin the previousstep). The errorsare then propagatedthroughout the software
by simulating its executionuntil the ultimate impact of the fault is known.
While it makessenseto follow the path of a singlefault aswasdonein the description
above,the actual simulation proceedsdifferently. Insteadof eachsimulation level being
invokedone at a time, the simulation levelsoperateconcurrentlybut on different faults.
For example, imaginethat onesimulation hasbeenconductedat the device-physicslevel,
resulting in a current burst fault model. That fault model is storedasthe first entry in a
fault dictionary betweenthe device-physics-levelstepand the transistor-levelstep. Next,
the device-physicssimulation may pick up a new fault and begin a secondsimulation.
17
At the sametime, the transistor-levelsimulation may pick up that current burst in the
first fault dictionary entry and begin the correspondingtransistor-level simulation.
Note that it is not necessaryto alwaysbeginwith a device-levelfault model. If less
accuracyis desired, fault injection could beginat levelsabovethe device level, such as
a bit-flip fault model at the logic level. If this is the case,however, then the impact
of implementation details such as supply voltage levels, accurate circuit timing, and
transistor sizing will not be consideredin the final results.
2.1.1 Problemswith the fault-dictionary method
There is one major problem that must be overcomein applying the fault dictionary
method. That problem is the fact that the entries in the fault dictionary depend not
only on the fault that is being injected but also on the inputs to the chosensubsystem.
For example, the impact of a fault in the instruction decodeunit may depend on the
instruction being decodedaswell asthe fault injected.
At the transistor and gate levels,the solution to this problem is to carefully choose
the subsystemfor which a fault dictionary will be generatedso as to keepthe number
of inputs down to a manageablelevel. For example, rather than choosing the entire
instruction decodestageof a pipeline asa block for a fault dictionary, it may be more
manageableto look at the instruction decodestageas being a combination of several
circuits. One examinesthe highestopcodebits to decodethe instruction type. Another
examinesthe middle bits to decodesourceregisters.Finally, a third examinesthe lowest
18
bits to determine the result register. Symmetry may be used in addition to the functional
decomposition to further reduce the number of inputs that must be considered.
At the upper simulation levels, though, the different input states that must be con-
sidered are the different states of the software when it enters the chosen module for which
a fault dictionary will be developed. In this case, it is not so clear how to reduce the
number of software states that will be considered in forming the fault dictionary. A
claim is made here that, for the types of systems for which this technique was developed,
embedded microprocessor systems, the state of the software upon entering a block is not
as important in determining the impact of a fault as is the particular fault itself or the
design of the module. In other words, the impact of many faults within the module will
be independent of the software state when entering the module or will be dependent on
only a couple variables which are major inputs to the module. If such is really the case,
then only a sampling of the software states upon entering a module need be considered
when developing a fault dictionary for that module. This claim will be substantiated in
the second case study.
Finally, note that all fault injection methods, including measurement, must deal with
this problem of the impact of different software states during fault injection. The problem
is not further exacerbated by the use of fault dictionaries. Thus, similar techniques can
be applied with fault dictionaries as are applied elsewhere.
19
2.2 Trace-DrivenExecution
A trace-drivenmodelof the softwareis usedat the system-levelto allow the simulation
to skip over uninteresting periods of the software'sexecution (when the system is fault
free). The trace consistsof a seriesof snapshotsof the softwarestate at different points
throughout its workload. In the casestudies that are presentedlater, thesesnapshots
are taken from anactual fault-free run of the softwarefor the network workload that was
used.
The trace is usedin the followingway. Whena fault is to be injected, the state of the
softwareis initialized to the latest point that the state is known beforethe fault occurs.
The softwarecode is then simulated from that point up until the fault injection. The
fault is injected by corrupting the softwarestate as dictated by a fault dictionary entry
for the module of the softwarewhich is beingexecutedwhenthe fault occurs. Execution
of the program code then continuesuntil the outcome of the fault injection is known
and the softwarecan be said to be in a good state. This may mean that the software is
simulated until it crashesand is reloadedin a good state,or it may mean the simulation
continuesuntil it is feasibleto comparethe state to that of the goodprogram, and those
states are the same. In either case,the next step is to determine the time of the next
fault and repeat the processof initializing the programstate. A picture of the trace-based
simulation method is given in Figure 2.2.
2O
Software Execution
Send message module
fault occurs here
Picture of the real software execution
fault effect known here
Picture of simulated software execution
Send message module
Initialize software state from trace
simulate execution of software to propagate errors
I I
t I
, j s_tp to
I I
i , next
I I
' ' fault
I I
I I
corrupt sw state to model fault
(model comes from dictionary)
Figure 2.2: Picture of the trace-based method.
2.3 Special Techniques
All of the special techniques that are described in this section are used only in the
system-level simulation to allow the real code of the software to be used in modeling the
system's software behavior. It is not necessary to use these techniques in any of the lower
simulation levels that were described in the fault dictionary section. The first of these
techniques is provided by using the DEPEND simulation engine and environment. The
middle four techniques are all provided by using an advanced C++ compiler (and thus
are only directly applicable to system software written in C or C++, although it may
be possible to modify them for software in other languages). Finally, the last technique
is is provided outside of the simulation environment to transfer system state between
the system-level simulation and the next lower level (behavioral level in the case studies
presented later).
21
2.3.1 Process-interaction simulation
Process-interaction is a strategy for representing a discrete event simulation [19]. In
this strategy, the discrete events are grouped together in a sequence according to the
process they are part of. For example, a database search could be considered as a
process containing a sequence of many discrete events each of which access a database
to read some keys. All of the processes in a simulation are executed together, much as
a multitasking operating system executes many program tasks together. However, when
a process is waiting for the simulation time that its next event should be executed, it is
temporarily blocked.
This type of representation allows a very natural representation of computer systems
and is one of the core features of the DEPEND simulation engine and environment
used in this work. The strategy was key to allowing the use of the real software code
in modeling the software behavior of the target system. With the process-interaction
strategy, the software behavior could be simply represented as one process which executed
the real program code instrumented with occasional time delay statements (representing
the passage of time as the code was executed).
2.3.2 Object encapsulation
The process-interaction strategy allowed the easy representation of one copy of the
software, but it was not sufficient by itself to allow the representation of multiple copies.
This is because the software may make use of global variables which would be incorrectly
22
shared by all of the copiesof the softwareexecuting in the simulation. In order to
avoid this collision for global variableswithout extensivemodification of the software,
the object-orientednature of C++ wasused.A C++ object wascreatedto representone
copy of the software,and all of the global variablesusedby the softwarewereallocated
in that object. In this way, each copy of the softwarehad its own copy of the global
variables,and the C++ compiler would automatically ensurethat the code associated
with one particular object would accessthe variablesalsoassociatedwith that object.
2.3.3 Operator overloading
After using the above two techniques, it is possible to simulate the execution of
multiple copiesof the system software. Onefurther problem with using the real code,
however,is the fact that the simulation environment may be different than the environ-
ment in which the codewould normally execute. In the real environment, there may be
memory-mappedI/O or other specialpurposehardware. In order to provide a similar
environment in the simulation, extra simulation processeswere created to perform the
functions of the missinghardware,and C++ operator overloadswereused to direct the
C++ compiler to automatically generatethe necessarycommunicationbetweenthe real
code and the new processesthat simulate the special hardware,avoidingany necessary
modification of the real code.
23
2.3.4 Variable referencemapping
The remaining techniquesare all used to help model the behavior of the software
under a fault. Variable referencemapping is used to arrange the software variables
for one simulation copy of the system software into a memory map that matches the
arrangementfor the real system. This arrangementis done by using a feature in C++
called references.A referenceassociatesa new variable name with an already existing
variable. This feature was used to create the desiredmemory map by first allocating
a large array to model the system memory. Then, the normal global variables of the
program are changedto referenceswhich are mapped to the appropriate locations in
the already existing array. In this way,a duplicate of the real system'smemory map is
created in the array modeling the systemmemory.
2.3.5 Custom pointer class
Another techniquethat wasusedto provide moreaccuratesoftwareexecutionin the
presenceof a fault and aneasiertranslation for fault modelswasthe creation of a custom
classto representpointers in the simulatedsystem. This custom pointer classacted, as
muchaspossible,the sameasa standard C++ pointer with the exception that it always
pointed within the memoryarrayallocatedfor the simulatedsystem.This custompointer
classprovided two functions. It helpedto preventthe simulatedsystemfrom overwriting
regionsof memory in the simulator that were not part of the memory of the simulated
system. It alsoallowedthe custompointers to havethe samevalueasthe pointer in the
24
real system would have. For example, if the array modeling the system memory were
allocated at address500, the valueof a normal pointer for the first word of the memory
would be500,not 0 as it would havebeenin the real system. By usinga custom pointer
class,the classcould take care of dealing with this offsetof 500, so the custom pointer
valuecould be storedas0.
2.3.6 Backwardsand forwards translation
The last technique,backwardsand forwardstranslation, is usedto translate the soft-
ware state from the system-levelsimulation to the next lower level (backwardstransla-
tion) or from the lower level to the systemlevel (forwards translation). Providing t,his
translation allows a state from the system-levelsimulation to be usedfor fault dictio-
nary generationin the next lower level (behaviorallevel in the casestudies). The forward
translation then allowsthe fault modelcomingout of that fault dictionary to be included
back in the system-levelsimulation. Becauseof all the techniquesdescribedabovethat
are used to make the simulation executionenvironmentand the real executionenviron-
ment for the system codeto be the same,the two translation stepsfor the casestudies
wereassimple ascopying the memoryarray from onemodel to the other.
25
3. BRIEF DESCRIPTION OF MYRINET
All of the casestudies in the following chapter that demonstrate the simulation
method discussedin this thesis usea Myrinet as the target system. For this reason,
a description of the Myrinet hardwareand softwareis providedbelow. Additional details
canbe found in the referencesor at Myricom's website (http://www.myricom.com) [20].
Myrinet is a commercial, high-speed, local area network. It is based on packet-
switched, point-to-point communicationtechnologythat was first developedfor deploy-
ment in system area networks, such as in the Mosaic multicomputer. Information can
flow along the links in a Myrinet at 1.2 Gb/s in both directions, and the total peak
bandwidth availablein a Myrinet scalesupward with the number of hostsconnectedto
the network.
A Myrinet is made up of combinationsof three key components. The first, a host
interface, is an expansionboard that connectsa host computer to the network. Second,
the Myrinet Control Program (MCP) is the control softwarethat runs on eachhost in-
terfaceboard and performs the network control functions, routing, and messagetransfer.
26
The third component,a switch, is usedto connectmultiple host interfaces(and thus host
computers) together for topographiesother than two directly connectedhosts. Eachof
thesecomponentswill be describedin further detail in the sectionsbelow,beginningwith
the switches.
3.1 Myrinet Switches
Each of the switchesin a Myrinet is a perfect crossbar.At the time of this writing,
there exist Myrinet switcheswith from 4 to 10 bidirectional ports. That meansa 10-
port Myrinet switch really has 10 input ports and 10 output ports and can form any
permutation of connectionsfrom input to output with nomore than oneinput connected
to eachoutput.
Switchescan beconnectedto other hostsor other switchesin an arbitrary, multilevel
topography. Theremay beonly oneroutebetweeneachpair of hosts,or redundantroutes
may be provided by connectingmore than the minimally necessarynumber of switches.
Connections may be added or removed from the Myrinet at any time, and the network
will automatically adapt to the new configuration.
Information flows through a switch in atomic packets. When a packet enters a switch,
the first byte of the packet designates the output port through which the packet should
leave. The switch will attempt to make that connection and will hold it until the entire
packet travels through the switch. After the packet leaves, the switch tears down that
connection and can allow some other input port to access that same output port.
27
Eachswitchprovidesonly enoughbufferingto store information in transit on an input
port should the selectedoutput port for that packetbe blocked. If such is the case,tile
switch will senda flow control messagebackalong the reversechannelfor the input that
is blockedto inform the transmitting host to stop transmission.Oncethe selectedoutput
is no longerblocked,another flow control signalwill be sent to the transmitting host to
notify it to continue transmission.Messagesare thereforebufferedin the host interfaces,
not in the switches.
3.2 Myrinet Host Interfaces
Anywherea host computer is to beconnectedto the Myrinet, a host interface board
is insertedto form the connection.The host interface board sits on the host computer's
expansionbus (suchas a PCI bus) and connectsto another host computer or network
switch through a multiconductor cable link. A block diagram of the host interface is
shownin Figure 3.1.
Each host interface board is an embeddedcomputer system. It contains a custom,
32-bit microprocessor,256K of static RAM, and additional hardwarethat connectsit to
the host computer expansionbus and the network links. The Myrinet Control Program
is executedon this custom processoron eachinterface board, and its binary image is
stored in the static RAM. The static RAM is also usedto buffer incoming and outgoing
messages.
28
32-bit, fast, static memory
(SRAM)
In
LANai Processor
Address bus
_Data bus
I r
Myrinet _"-"P_I Packet Processor DMA
interface I._.. [ nterface Core engine
Timing and control signals I
I Extra logicpeculiar to the bus
Figure 3.1: Block diagram of host interface.
e_
o
o
_9
o
The custom processor on each interface board is called the LANai processor and was
designed by Myricom. It is a RISC processor with multiple general-purpose registers and
a load/store instruction set. In addition, the processor makes use of memory-mapped [/O
to provide hardware interfaces to the host computer expansion bus, the outgoing network
link, and the incoming network link. Each of these interfaces can operate concurrently
with the processor, allowing the generation of up to five simultaneous memory accesses
(instruction, data, expansion bus, incoming link, outgoing link) of which two can be
satisfied by the memory in one cycle. The processor operates at a speed of 40 MHz, and
each of the hardware interfaces can transmit one 32-bit word on each cycle.
The memory on the host interface board is accessible by the host computer as one
contiguous chunk. This feature is used both to allow the host computer to download the
29
MCP software to the interface board and to allow the host computer to communicate
with the interface by writing signals into the interface's memory.
3.3 Myrinet Control Program
The MCP is the brains of a Myrinet. It performs nearly all of the network control
functions-only simple flow control is performed by the switches. In particular, the MCP's
functions can be broken down into the following three classes: sending messages out onto
the network, receiving messages from the network, and cooperating to map the network
and determine proper routes.
The MCP sends a message by programming the memory-mapped I/O interfaces of
the LANai processor. When the host informs the interface board that a message is
ready to be sent, the MCP will perform the following steps. First, it will allocate a
buffer in its static RAM to hold the message. Next, it will construct the header of the
message, including such details as the message type, length, and destination. The MCP
then initiates a DMA transfer to get the message data from the host computer into its
buffer in static RAM through the host's expansion bus. The correct route will then be
prepended to the message from the routing table. Finally, the memory-mapped interface
for the outgoing network link will be programmed to initiate the transfer of the message
onto the network.
A receive operation contains many similar steps to a send, performed in the reverse
order. The incoming network link is first programmed to accept a message from the
3O
network into abuffer in the static RAM. Next, the messageheaderand CRC areexamined
to be sure the messageis valid and without error. (Invalid or erroneousmessagesare
dropped.) A messagebuffer on the host is then allocated for the receivedmessage,and
the expansionbusmemory-mappedinterfaceis programmedto transfer the messagefrom
the interface's static RAM to the buffer in the host computer.
The last function of the MCP, cooperating to map the network, is slightly more
complicated and will only be summarizedhere. Each interface has a unique ID, and
in mapping, this ID is used to selectone interface as the mapper. This interface will
initiate network mapping periodically, send mapping requests,collect replies, and dis-
tribute completedmaps to all other interfaces.The mapping interfaceforms its map by
sending "areyou there" type messagesexhaustivelyto the possibleports in the network.
Unconnectedports are detectedby a lack of responseto the inquiry for some timeout
period. The other interfacesin the network will only respond to mapping requestsand
periodically compute routing tableswhen a completedmap is received.In the casethat
a timeout period occurswith no responsefrom the mappinginterface,a newmapper will
bechosenfrom amongthe remainingMCPs. With this mapping protocol, a Myrinet can
dynamically adjust to changesin the network topography.
31
4. CASE STUDIES
Four casestudies were done to demonstratethe simulated fault injection method
presentedin this thesis. All four of the casestudiesuseda Myrinet asthe target system,
but each study had a different focus. In the initial casestudy, the main goal was to
demonstrate how the fault injection method could be applied to the Myrinet system.
This casestudy was thereforelimited to modelingonly a singlehost interface in detail,
anda behavioralfault modelwasused.The secondstudy built upon the first by extending
the system model to an entire Myrinet LAN and doing someverification against a real
Myrinet. The third study added somerecovery code to the Myrinet MCP software,
basedon the results of the secondstudy, in an attempt to reducethe tendency of the
host interface to hangunder fault injection. Finally, in the fourth casestudy, the focus
was on the useof the fault dictionaries to raise the abstraction level of an input fault
model from the devicelevel to the systemlevel.
32
4.1 Modeling a SingleHost Interface
This study wasdone asan initial demonstrationof the fault-dictionary-based simu-
lated fault injection method. It wasthereforelimited to modelingonly a single interface
in detail and using a behavioral-levelfault model to reducethe implementation time of
the study. As a result, the specialsystem-levelsimulation techniquesof object encapsu-
lation and custom pointer classwere not necessaryin this study. This casestudy was
first presentedin [21].
In this study, the simulated hardware consistsof the LANai processorand 256K
of memory, and the simulated softwareconsistsof the majority of the MCP program.
One function of the MCP, the network mapping, is not simulated in this casestudy
becausea full Myrinet is not simulated-a predefinednetwork workload is used rather
than simulating the behavior of multiple nodeson the network in addition to the node
undergoingfault injection. A pictorial representationof the simulatedsystemis givenin
Figure 4.1.
4.1.1 Systemmodel
There aretwo simulation levelsin this casestudy. First, a cycle-accuratesimulation of
the LANai chip and memory is usedto dodetailed simulation of onemoduleof the NICP
for the short period surrounding a fault injection, creating a fault dictionary. Second,
a system-levelsimulation is done, using the real C++ sourcecode of the software and
33
Address 3
Host 1Interface
input
workload
simulated
in
workload
simulated
Address 2
Host
Interface
Address
Host
Interface
Host
0
Interface
_Detailed simulation
of this part, only
LAN node
output workloads
(workstation)
checked for correctness
Figure 4.1: Picture of the simulated network.
34
additional softwareobjects to emulateLANai specifichardware,to propagatethe errors
from the fault dictionary anddeterminetheir ultimate effectunderthe specifiedworkload.
Only a singlenodeof a Myrinet LAN is simulated in detail. A predefinedworkload is
injected at the interfaceboundary betweenthe host interface and the network and also
at the boundary betweenthe host interface and the host computer. These workloads
representarriving messagesthat are to be receivedfrom the network or sentout to the
network. The effect of a fault is determinedby examiningthe messagesoutput by the
interface to thosesameboundaries. If one of these messagesis corrupted, the proper
responseof the network to the corrupted messageis recorded (i.e., a messagewith an
illegal route will be dropped). In determining these responses,an LAN with 4 nodes
connectedto a single4-port switch is assumed,with the nodeundergoingfault injection
consideredasnode0.
Each of the simulated workloads containsa set of messagesthat arrive at specified
times during the simulation for either receivingor sendingrespectively.A pattern similar
to a parallel computation on the LAN is assumed,so the messagesfollow a repeating
pattern of sendingto all nodesand then receivingfrom all nodesa fixed-length message
with a simulatedperiod of 0.1 seconds.For this study, the data in eachmessagewasset
to an arbitrary length of 52 bytes.
35
4.1.2 Fault model
Finally, in this casestudy, faults wereinjected into the LANai processorduring only
a single module of the MCP program, although their propagation was unconstrained.
By injecting to only a singlemodule,wewereable to minimize the time spent on imple-
menting the fault dictionary simulation and concentrateinstead on modeling issues. In
this study weconstructedonly a singlefault dictionary, correspondingto the effect of a
fault in the processorduring its executionof the "sendmessage"module. SeeFigure 4.2
for a picture of someof the tasks of the "sendmessage"module. Someadditional details
of the fault injection arementionedbelow.
The module that wasselectedwas the code responsiblefor sendinga messagefrom
the LAN nodeout onto the LAN. This codehad to perform threemajor functions: trans-
ferring the messagefrom the LAN nodeto the interfacememory,preparing the message
to go out on the network, including generatinga correct route, and then transferring the
completed messagefrom the interface memory to the LAN. Execution of this module
took approximately 540cyclesof usefultime from the LANai processor,not countingcy-
clesspent in DMA transfers. For eachfault dictionary entry, a single fault was injected
during this time, uniformly distributed acrossthe 540 cycle period. At the end of the
execution of the module,all persistentvariableswerewritten out into memory,and the
contentsof the memory werecomparedto thoseof a good program run in order to find
the corrupted variables.
36
Map
Network
Functions
_-_ M_nEvent_/
Total Send
Code Size
- 1000 lines
Faults
Injected
I Find route in route 1Here tabl ]
_F--
I 1
f--
Receiv
Message [
Functions J
Figure 4.2: A block diagram of the MCP showing the "send message" module.
37
The faults that wereinjected in this study camefrom a behavioral-levelfault model
similar to that describedin [22].This behavioral-levelmodelwaschosenbecausea gate
or lower-levelview of the LANai microprocessorwasnot availableat the time of the
study. The fault model involves changingthe assemblyinstruction executing on the
microprocessorduring the fault in a numberof different waysto model faults in different
parts of the microprocessor.Someof the possibleeffectsof a fault in this model, selected
with equal likelihood in this study, are: wrong sourceoperandselected,sourceoperand
is corrupted, wrong destination operand is selected,additional destination operand is
selected,and destination operand is corrupted. For additional details on the possible
changesto an instruction to model different processorfaults, consult [22].
4.1.3 Results
This section will describethe resultsof the host interface simulations in two parts:
the fault dictionary generationand the system-levelsimulation.
Fault dictionary results
The first results we obtained for the host interface came from the behavioral-level
simulations that formed the fault dictionary. These results consistedof a set of zeroor
morememorywords that werecorrupted at the end of the sendmessagecodeasa result
of a single fault injection.
An exampleof a small set of dictionary entries pulled out from the 1000entries that
were computed is given in Table 4.1. Each entry consistsof a location where the fault
38
Table 4.1: A sample set of fault dictionary entries.
Loc: setup route lookup
START
28f08 0 8182
END
corrupt start of route
Loc: find route in route table
START
dd38 3 2
28f08 8183 8182
END
corrupt cached address
corrupt start of route
Loc: copy route to message
START
28f0c 0 I0000
END
corrupt end of route
Loc: set DMA parameters
START
28f30 0 54686973
28f34 0 20697320
28f38 0 736f6d65
28f3c 0 20646174
28f40 0 6120666f
28f44 0 72206120
28f48 0 6d657373
28f4c 0 61676520
28f50 0 77652077
28f54 0 696c6c20
28f58 0 73656e64
28f5c 0 2e202e20
28f60 0 2e202e00
END
Corrupt message data
(all of these entries)
39
occurred and a list of memory words that were found to have corrupted values at the
end of the simulation of the given software module. For each of these corrupted words,
the address, corrupted value, and good value are listed (in hex). (For example, in the
first entry, the variable residing at address 0x28f08 was found to have a corrupted value
of 0 rather than the correct value of 0x8182.) This information is used to identify the
corrupted software variables so that the same corruption can be simulated at the software
level as occurred at this behavioral level. In the table shown, the location of the fault is
given in terms of what the program was doing when the fault occurred, and the meaning
of the corrupted memory is given to aid the reader in understanding the character of the
dictionary-normally the location would be in hex just like the memory addresses. The
effects of four faults are depicted in the table. As a single fault may propagate to corrupt
more than one variable, the second and fourth faults have multiple entries, notable by
their multiple lines between the START and END commands.
Following is an incomplete list of possible effects to the software state that were
observed in the generated fault dictionary:
1. Corrupt start of message route
2. Corrupt end of message route/message type
3. Corrupt cached address in routing table
4. Corrupt message data element
5. Corrupt all message data
4O
6. Corrupt interrupt state machine (softwareobject)
7. Corrupt messageaddress
8. Corrupt messagechannel
9. Corrupt messagelength
10. Corrupt route in routing table
11. Corrupt pointer to message (move message in memory)
More than one of these effects may occur together from a single fault. For example,
if the message address is corrupted by a fault before the route is looked up, the wrong
route may also be written to the message, causing it to have both a corrupt address and
route.
The effect of a fault as recorded in the dictionary was found to be strongly correlated
with what the software was doing at the time of the fault injection for these behavioral-
level faults. For instance, if the software was searching the routing table to find the
correct destination and route, a fault was very likely to cause an invalid or incorrect
route. This correlation occurred because, even though a fault can cause corruption to
random memory words by corrupting the address of a load or store, the fault is much
more likely to affect the variables that are currently being computed when the fault
occurs due to the many references to these variables. Even if a fault corrupts a load or
store, resulting in corrupting a random part of memory, the variables that were b_dng
41
1200
1000
800
m
600
400
2OO
0
# corrupted words
Figure 4.3: Number of faults leading to given number of corrupt words.
computed at the time are also likely to be corrupted because that load or store did not
write the correct value to those variables but wrote to the wrong destination instead.
A graph of the number of faults which caused a given number of memory words to
be corrupted is shown in Figure 4.3. The X axis of the graph is the number of corrupted
words resulting from a single fault, and the y axis is the number of faults that caused the
given number of corrupted words. As can be seen on the graph, the great majority of
fault injections lead to no change in the program behavior or the corruption of a single
memory word. The next peak in the graph is for faults that caused approximately 11
memory words to be corrupted, and the final peak is for approximately 27 corrupt words.
Some reasons for these peaks are described below.
42
Just overhalf of the injected faults leadto no changein the program behavior. This
result is due to a number of different effects. The first is that a fault injected during a
NOP cycle wasunlikely to causeany corruption in our model. Becausethe pipeline in
this processoris softwarescheduledwith respectto dependenciesbetweeninstructions,
a significant number of NOP's are generatedin the compiled code to insert the proper
delays betweendependent instructions. Another reasona fault may have no effect is
becausethe result it corrupts is not used. Sometimesa fault may lead to corrupting a
register that containsa deadvariable or corrupting the target addressfor a branch that
is not taken. Finally, a third reasonthe programbehavior may remain the samewith a
fault is that the effectof the fault is functionally masked.This meansthat the instruction
had the sameresult with the fault as without. An examplewould be multiplying the
wrong sourceregister by zero. The result is still zeroeventhough the choiceof a source
registerwasfaulty. The combinationof thesethreeeffectsis what leadsto somany faults
causingno corruption.
The next largest group of faults leadto the corruption of a singleword. The sizeof
this group can be attributed to the fact that a corrupt instruction could directly cause
only a single memoryword to be corrupted. For further corruption to occur, the result
of the corrupt instruction had to be used multiple times. Becausethe send message
codeto which we injected wasvery sequentialwith almost no loops, most variableswere
computed onceand then written out to memorywithout being reused.Thus, the largest
43
groupof faults that leadto a programchangewerethosethat corrupted a singlememory
variable.
The third and fourth largestpeaksaccountedfor about 11and 27wordsrespectively.
Thesetwo peakswere causedby faults that corrupted the pointers in the sendmessage
codeusedin an array computation that wrote to 10words.The first set is characterized
by two different kinds of faults. The first kind changeda pointer to the input data,
causinga significant number of variablesto be miscomputed. The secondkind changed
the output pointer, but only slightly, sothat the correct resultswereoutput to memory
but offset one or two locations, again leading to many corrupt words. The peak at 27
was similar to that second kind of fault in that the output pointer was changed. This
time, though, it was changed to point to a significantly different part of memory. So, not
only were the original 10 array words corrupted, but at least an additional 10 words in
another part of memory were corrupted by writing the output results there instead.
The computation of the fault dictionary here was interesting because it showed that
each software module will have its own characteristic faulty behavior, and that faulty
behavior can be characterized by the way in which the variables in that module were
used. The generation of the fault dictionary was only an intermediate step in this analysis,
however. The use of the fault dictionary in a system-level simulation of the host interface
to determine the ultimate effect of the fault is described in the next subsection.
44
System-level simulation results
Once the fault dictionary was computed, it was used as a fault model at the system
level. Here, the host interface was simulated handling a predefined message workload-
both sends and receives. When it was determined that a transient fault occurred within
the LANai processor, the fault dictionary was used to determine what errors would occur
in the executing software as a result of the fault. These errors were then injected into
the software state, and the execution of the software continued, typically with a change
in behavior because of the fault injection.
At this level, we were interested in the ultimate effect the fault would have on the
software behavior in terms of things visible to the user. For that reason, the results here
are in terms of the effect on the messages that were to be sent or received by the interface.
If all of these messages were sent and received correctly, then the fault was said to have
no effect at this level.
We injected 1000 sets of errors, each set corresponding to the impact of one fault
on the send message module of the MCP, from our fault dictionary into the running
software. Each time we performed an injection, it was assumed that a long enough time
had passed since the last fault that the system was fault free. This assumption seems
reasonable because the host interface regularly updates its persistent variables, such as
network routes, and an interface that doesn't respond for a significant period of time
will be reset. Note, however, that there is nothing preventing us from considering the
impact of near-coincident faults on a software module in forming our fault dictionalies;
45
Illegal
PacketType
2%
Unaffected
40% Illegal Route
39%
Wrong
Channel
1% Data Illegal
Misrouted ChannelCorrupted
12% 1%5%
Figure 4.4: Effect on messagesentduring fault lifetime.
they simply weren't consideredin this examplestudy. In addition, the period between
network route updateswasmodeledsothat faults that causedchangesto network routes
would be able to effectmessagesonly until the next route update.
Figure 4.4 showsa breakdownof the errorsseenin the messagesworkload. Of the
faults injected, 40% causeno errors in the transmissionof the network workload. Of
those messagesthat were affectedby faults, it can be seenthat the majority of them
endedup with illegal routes and were dropped in the LAN. The next largest group of
errant messagesweremisroutedwith a legal route, and the third largest group consisted
of messageswith corrupteddata. For those60%of faults that did leadto an error, 95.5%
of them affectedonly a single message,and the other 0.5% affectedmultiple messages
(all the messagesto the samedestination until a mapping update wasdone).
46
4.2 Modeling an Entire Myrinet LAN With Validation
This second case study built upon the work in the first with the goals of modeling an
entire Myrinet and verifying the results against a real system. The system model for the
simulation was therefore extended to modeling in detail all of the interfaces in the four-
node LAN from the first study, and the object encapsulation and custom pointer class
special techniques were added. A behavioral-level fault model was still used, however,
to facilitate the fault injections into the real system that would be necessary to do the
validation. This comparison study first appeared in [23].
Before describing the simulation models that were used and the results, however, it is
important to understand the way in which the simulation was to be validated against the
real system and why this way was chosen. The validation was done by choosing a fault
model that could be injected into both systems (real and simulated) and then injecting
identical faults from this model into both systems and comparing the fault impacts. The
same fault model was used in each system to eliminate potential differences in the results
due to differences in the fault models. The impacts from identical faults were compared
in order to avoid the large number of injections that would be necessary to compare
distributions from two different fault lists.
There were two further reasons for comparing the fault impacts on a fault-by-fault
basis. First, the comparison was more precise than just comparing the distributions. For
example, the simulation could show the same number of system reset results as the real
system but not be 100% accurate because some of the reset results were attributed to
47
different fault injections than in the real system. Thus, someof the faults that caused
resetsin the real systemdidn't causeresetsin the simulation. The simulation was then
saidto predict the wrong behaviorfor thosefaults. Second,the fault-by-fault comparison
allowedthe percentageof faults for which the real and simulatedsystemsmatchedto be
examinedby result category.A low percentageof matchesbetweenthe two methods for
any one category showeda weaknessof the simulation at modelingfaults with impacts
in that category.
In this casestudy, the two claimsthat weremadeabout the fault dictionary method in
the secondchapter will also be revisited. In particular, the first claim said that the fault
dictionary helped reducethe number of faults that must be consideredat higher levels
becausemany of those faults would result in no error or in errors identical to an entry
already in the fault dictionary. The secondclaim was that the behavior of the majority
of faults would be independentof the softwarestate when the fault was injected.
4.2.1 Systemmodel
As in the first study, the systemmodel consistedof a four-node Myrinet LAN with
faults being injected into the host interface of only one of the nodes. For this study,
however,eachof the four host interfaceswasmodeledin detail, including executing its
own copy of the MCP, and the interfacescommunicatedthrough a simulated Myrinet
switch. The full MCP softwarewas modeledin this study, including the cooperative
mapping functions. The host workstations connectedto eachof the four interfaceswere
48
not modeled in detail. Instead, only the application (a synthetic workload described
below) wasmodeled.
A real Myrinet testbedwasset up, aswell, to allow fault injections to be duplicated
on a real Myrinet. Fault injection wasdoneto oneof the interfaceson this real network
usingSoftware-ImplementedFault Injection (SWIFI). The testbedand simulation model
were made to be as similar as possibleto facilitate the comparisonof fault injection
results from the two injection methods. Thus, the testbed alsoconsistedof a four-node
Myrinet connectedthrough a switch. The sameaddressesand switch ports wereassigned
in the simulation and testbed. A secondnetwork, an Ethernet that wasnot undergoing
fault injection, wasusedin the testbed to providecontrol and monitoring functions. A
picture of the systemconfiguration is shownin Figure 4.5.
4.2.2 Fault model
A simplified fault model was chosenthat would allow fault injections to be easily
duplicated for the two fault injection methods. Faults were injected as transient single
bit flips of oneof the instructions to be executedin the "host send" messagemodule. A
diagram of the "host send" messagemodule of the MCP is given in Figure 4.6. (This
module is similar to the sendmodule usedin the first study, but due to revisionsof the
MCP betweenstudies, the samesendmoduleno longerexisted.)
In the simulation experiments,faults wereinjected by corrupting oneof the instruc-
tions in the givenmoduleat the start of eachof the behavioralcycle-accuratesimulations.
49
/
!
I
I
I
I
Ethernet (used to control and monitor SWIFI experiments)
Host
interface
Host
interface
\
I
I
I
--I
I
I
I
I
I_
Host
interface
Myrinet Switch
Host
interface
1
I
I
I
I
Ortho
(Remote workstation
for monitoring and data
collection in SWIFI)
Figure 4.5: Target system for fault injection (simulation and SWIFI).
I host sendmdule--fault injection
.......................................'
Figure 4.6: Diagram of fault injection region.
5O
The executionof the module with the fault was then simulated, and at the end of the
module, a fault dictionary entry wasmaderecordingthe corrupted control flow and vari-
ablesdue to the fault. The dictionary was then later usedin system-levelsimulations to
determine the ultimate impact of eachfault.
In the SWIFI experiments,faults wereinjected in asimilar way. The executionof the
MCP on the interfaceundergoingfault injection waspausedat thestart of the "host send"
module. Then, the selectedinstruction wascorrupted by the interface's host computer.
Note that the sameinstruction and samebit would becorrupted for the SWIFI as in the
simulation for a duplicate fault. The interfacewould then continueexecuting the MCP
modulewith the fault. At the end of the module,executionwould pauseagainwhile the
original instruction wasrestored.
Each of the fault injections was repeated 10 times in the SWIFI experiments to
ascertain the repeatability of the result. Faults that showedonly one error behavior
and did so for at least 6 of the 10 injections were used in the validation comparison.
Other faults that showedmultiple error behaviorsfor different fault injections or that
only rarely causederrorswerethrown out.
In this way, faults whosebehaviorwashighly dependenton the softwarestate when
they wereinjected wereidentified and removedfrom the comparison.Thesefaults would
complicate the comparisonbetweenthe simulation and SWIFI experimentsbecauseit
would bedifficult to makethe softwarestatesexactly the samebetweenthe two. For the
51
remaining faults, however,that showedvery little dependenceon the softwarestatesthe
two systemsshould beeasily comparable.
4.2.3 Results
In the comparison,study betweenthe simulation and SWIFI on the real systemeach
method injected the sameset of 500 faults. Someof these injections (4 injections) had
to be discardedbecausethey causedthe simulator to crash. Of the remaining faults,
423 were found to causevery repeatableresults and so to be fairly independent of the
The breakdownof the comparisonfor these423 faults is presentedinsoftwarestate.
Table 4.2.
Table 4.2: Numberof errorsby categoryfor simulation and SWIFI.
Fault injection result I Simulation
MCP hang
MCP restart
Messagedropped
Data corrupt
61
6
58
19
SWIFI
82
16
55
19
Match
62.2% (9.4%)
37.5% (20.9%)
94.5% (6.0%)
84.2% (16.4%)
No error 279 251 97.6% (1.9%)
Total II 423 423 87.5% (3.2%)
The leftmost column of Table 4.2 shows the fault injection results. Faults in the
MCP hang category caused one interface's MCP to stop performing some or all of its
functions. Faults in the MCP restart category cause one MCP to reset, momentarily
disrupting communication and causing it to drop all the messages currently in its buffer.
Those in the message dropped category caused one message being transmitted on the
the Myrinet to be dropped. The data corrupt category was used when the only impact
52
of a fault wasto corrupt someof the data in oneof the messagesbeing transmitted, and
all of the remaining faults fell into the no error category,signifying all messagesin the
workload weresent and receivedcorrectly,evenwith the fault.
The secondcolumn showshow frequently the selectedresult category was observed
in the simulations. The SWIFI column shows the frequencywith which the selected
result category appearedin the real system. Finally, the match columnshowshow often
the simulation and the real fault injection results were identical, and the number in
parenthesesgives the 95% confidenceinterval. In computing the value for the match
column, the SWIFI method was consideredto give the "gold" result. If for a given
injection, the simulation result agreedwith the SWIFI result, a match wassaid to occur.
The value in the match column is the numberof matchesthat fell into a given category
divided by the total numberof SWIFI resultsin that category.Forexample,the simulator
and SWIFI results matched for 52 injections in the message dropped category. The
maximum possible number of matches for this case would be 55 (100%). Thus, for the
message dropped category the simulator accuracy was 94.5% (52/55), plus or minus 6.0%
for the 95% confidence interval.
Table 4.2 shows that the simulation does extremely well at detecting when no error will
occur (matching SWIFI for over 97% of the faults) and reasonably well at predicting the
less severe injection results (e.g., the simulation correctly identified a dropped message for
about 95% of the faults where SWIFI determined that result). The simulation, however,
has relatively low accuracy in predicting severe fault results, such as host interface hang
53
where 62% of the injections match. One reasonfor the low level of accuracy is that
the simulation doesnot fully model the interaction betweenthe host and the interface.
Factors affecting this accuracyrate are addressedin detail in the following subsection.
The two claims madein the secondchapter will now briefly be revisited. The first
claim was that the fault dictionary would help reducethe number of simulations that
were necessaryat higher levelsby removing many faults that didn't propagate or had
identical error patterns to another fault alreadyconsidered.For the 500faults presented
above,injected at the chip level assinglebit flips of instructions, 245did not propagate
to the systemlevel becausethey causedno changein the softwarestate of the host send
module. The remaining 255 faults can be divided into three categories. One-hundred
ten faults had identical error patterns to another fault in the dictionary and could be
discarded. Sixty five faults had error patterns that were similar to another fault in the
dictionary, but not identical. That is, thesesimilar faults corrupted exactly the same
softwarevariablesbut the corrupted valuefor one or moreof the variableswasdifferent.
With no further analysis, these65 faults did have to be simulated at higher levels to
be sure of their impact. However, faults with similar error patterns were very likely to
cause identical results, and further analysis may have been able to identify faults with
significant differences from those without. Finally, 80 of the 500 faults caused unique
error patterns. In this study, therefore, one would have had to simulate at the system
level the 80 unique faults plus the 65 similar faults in the worst case, for a total of 145
54
faults (out of the original 500). With further analysis, it may have been possible to
reducethis number by eliminating someof the 65similar faults.
The secondclaim wasthat for embeddedmicroprocessorsystemswith control software
like the oneexaminedin this study, the majority of the fault impactswouldbedependent
only on the fault injected and the designof the hardwareand softwaremodules,not on
the particular state of the softwarewhen the fault was injected. The abovecasestudy
supports the claim becauseof the 500 faults injected, 423 faults had a very repeatable
behavior that was independentor nearly independentof the softwarestate. For these
423 faults, it was only necessaryto considera small set of representativestates of the
software to obtain the correct behavior.
4.2.4 Discussion
The initial weeksof our comparisoneffort turned out to be a learning process. A
numberof problemsassociatedwith our simulation effortsmadeit difficult to match the
behavior of the simulation to that of the real device. Those problemsare discussedin
this subsection,including limitations in the simulator model, specificationproblems,and
effectsof the simulation environment.
Cycle-Accurate Simulator Limitations
The cycle-accurate simulator had three basic limitations for our purposes. First, the
standard version supplied by Myricom simulates only the CPU core of the LANai chip.
None of the memory-mapped I/O on the real chip is implemented. As a result, we were
restricted from simulating certain regions of the MCP code a t  the cycle level and so could 
not inject faults into these regions. The simulator could be extended to model the entire 
LANai chip; however, we chose instead to  find an important region of code (the "focused 
send" region) that could be simulated in the cycle simulator without modifications. In 
this way we were able to  make a convincing argument for the validation of the simulation 
without bringing in the additional issues of developing and testing new features in the 
cycle simulator. 
The second limitation in the cycle simulator was that it was not guaranteed to match 
the host interface behavior for certain error conditions, such as the response to  an illegal 
instruction or invalid memory access. One particular difference we noted was that the 
simulator did not implement memory protection of the low memory segment, where 
the MCP is stored, as was done in the real host interface. Some faults were observed to 
attempt writes t o  this region. While these writes would be discarded in the real interface, 
they were allowed to proceed in the cycle simulation. In the examples we observed, the 
lack of memory protection did not appreciably alter the results of the simulation, but 
the potential certainly exists. 
The third limitation applies to  cycle-accurate simulation in general. Such simulators 
are typically thousands of times slower than the real device they simulate. For most 
faults, this simulation time was very short because we could quickly translate the fault 
effect to  the software level. There were rare faults, however, that would change the 
program counter to  a random value. In these cases, execution would be outside of typical 
program paths, and thus the effect of the fault could not be translated to the software 
level right away. If execution never returned to normal, a hang could be assigned to these 
cases. However, this random code could lead to a reset of the LANai chip or a return 
to normal execution. The only way to be sure of the result was to continue simulation. 
Due to  the cost of simulating at  this level, however, an arbitrary limit of 80,000 cycles 
was set. If the program had not reset or resumed normal execution within this period, 
the result was marked as a hang. 
Specification Problems 
Another problem we encountered in the development of our simulations was that the 
response of the LANai device to various error cqnditions was very vague or missing in 
the official device specification. One example of unspecified behavior was the response 
to an unaligned memory access. A 32-bit memory access has to be aligned to a 32-bit 
boundary (the memory address must be evenly divisible by 4) in the LANai device. This 
information is clearly stated in the specification. However, the behavior when accesses 
are not aligned is not specified. While it is understandable that such information is 
unnecessary for programmers, it is necessary for the proper simulation of the device, 
particularly when such error conditions are likely to occur due to fault injection. 
Some of the unspecified behavior we came across was cleared up through discussion 
with Myricom. In other cases, however, we did not even realize we had overlooked some 
57
error condition until we observedit to causea differencebetweenthe SWIFI and simu-
lation experiments. A particular exampleinvolved a fault that changedthe DMA_DIR
register. This register holds a one bit value specifying the direction of DMA transfers
betweenthe host (workstation) and LANai chip. In our simulations,weconsideredonly
the lowest bit of this register to be valid, and so writing a five or a one to this register
would give the sameresult. Experimentson the real deviceshowed,however,that values
other than zeroor onewritten to this register could causea hang. Experienceslike this
oneled to a period wherewe, in essence,trained our simulator by correcting its behavior.
Effectsof Simulation Environment
Oneproblem in simulation is decidingwhereto draw the boundariesof the simulated
system. Interaction betweenobjects inside the boundariesand thoseoutside that have
beenabstracted away may causethe simulation to act differently than the real device.
One casewhere this problem appearedin our study involved a fault in the real host
interface that causedthe reset of its host workstation. Becausethe host was outside
our simulation boundaries,we did not model any of the interaction betweenhost and
interface. Therefore,we werenot ableto predict this occurrence.
Another effectof the environmentwasdueto our executionof the MCP codenatively
on a workstation (for the software-levelmodel). While this approachallowedthe simu-
lation of the MCP to be very fast, it posed a problem. The LANai chip itself has only
a limited memory protection that simply discards illegal accesses, but on a workstation,
58
illegal memoryaccessescancausethe termination of an application. While everyattempt
wasmadeto avoid crashingof the simulationsdue to memoryaccessviolations, 48such
crashesstill occurred (seeSection6.1).
Finally, becausethe simulation engineand the MCP sharedone userprocessin our
software-levelsimulation and had to communicatewith each other, it was possiblefor
an errant MCP to corrupt variablesbelongingto the simulation engine. In the runs we
recorded, this behaviorwasnot observed,but it wasconsideredin our design. Avoiding
this problem, at least in part, would have meant distributing our simulation among
multiple processesso as to provide the operating system'smemory protection to each
simulation process.Sucha designwasbeyond the scopeof this study.
4.3 Inclusion of Recoveryto Improve Dependability
In the third casestudy, the resultsof the validation study wereanalyzedin anattempt
to improve the behavior of the MCP softwareunder faults. One key result from the
validation study was the high number of fault injections that lead to an MCP hang. As
this result category wasalso the most severe,it wastargeted for recovery.
4.3.1 Recoveryadded
Three of the most common mechanismsthat causedan MCP hang were identified
from the simulation results. Here, the behavioral-levelfault dictionary wasvery useful
sinceit gavea breakdownof how different faults affectedthe MCP softwarevariablesand
59
control flow. From this dictionary, typical patterns for erroneousvariables and control
flow could be identified sothat recoverycould be added.
The first of the three mechanismswashole in the receivedmessageacceptancetests
that passederroneousmessageswith data lengths of zero bytes. Thesemessageswere
invalid, and handling them causedthe receiving MCP to hang. The recoveryfor this
mechanismjust plugged the hole in the acceptancetests by adding a check for zero-
length data messages.
The secondmechanismthat wastargeted for recoveryinvolved the state of the send
messagebuffer called NetSendQueue.A large number of faults could causethe MCP
to erroneouslydetermine that this queuewas full when there wasstill spaceremaining.
Oncethe MCP decidedthe queuewasfull, it would acceptnomore messagesfor sending
until the queuemade a transition from full to not full. Becausethe queuewas really
not full already, though, this transition would neveroccur, and the MCP would sendno
more messages.The recovery for this mechanismaddeda timeout to the wait for the
queueto transition to not full. If the timeout expired without the not full notification,
the MCP would recheckthe state of the queue.Typically, this secondcheckwould allow
it to make the correct determination and resumenormal behavior.
The third mechanismwas very similar to the second,but it involved the host to
LANai DMA rather than a queue. Somefaults would causethe MCP to erroneously
determine this DMA interface was in usewhen it wasnot, and the result was that the
MCP would wait forever for the interfaceto becomefree, similar to the situation with
6O
the sendqueueabove. Again, the solution herewasto add a timeout to this wait period.
If the timeout expired without a notification the DMA wasfree, the MCP would check
the state of the DMA againand typically continuewith normal operation.
4.3.2 Results
After the MCP codewaschangedto addthe three typesof recoverymentionedabove,
a similar set of fault injections was run asin the verification study. That is, the target
systemmodel (asidefrom the softwarechanges)wasthe samefour-nodeLAN, the same
fault list andworkloadwereused,and experimentswererun onboth the simulatedsystem
and the real testbed.
The resultsof the experimentsareshownin Table 4.3. The numberof faults causing
a given impact are listed for eachof the five error categoriesfor the simulation and real
systemas before. This time, however,there are two numbersseparatedby a slash. The
first number is the impact of the fault with the new recovery routines added, and the
second number gives the old behavior.
Table 4.3: Comparison of errors before and after recovery was added.
Fault injection result Simulation SWIFI
MCP hang
MCP restart
Message dropped
Data corrupt
61/36
6/6
58/63
19/19
82/32
16/16
55/67
19/19
No error 279/299 251/289
Total 423 423
61
Experiments were run on the simulatedsystem first to evaluate the effectivenessof
the recoverymechanismsbeforeimplementing them on the real system. The results of
the simulated fault injections showeda drop in the number of faults that causedhangs
from 58 to 29, or a 50%decrease.Note that the recoverythat wasaddedwasall outside
the fault injection region,sothe faults that wereinjectedcould bemadeexactly the same
as in the verification study. Of thosesamefaults, 29no longercauseda hang after the
addition of the recoverycode.
Of the remainingfaults that did causea hang, 15weredue to random changesto the
program counter that put it outside normal program control flow paths. These faults
could not be targeted with simple softwarerecoverymechanisms.The remaining faults
included some that used a variation on the three targeted mechanismsthat could slip
through recovery,and the rest useda mechanismthat wasnot targeted by the recovery.
The next step was to implement the samerecoveryfeatureson the real system and
verify the effectiveness. Of particular interest was whether the recovery would be as
effectiveon the real systemgiventhe difficulty the simulation had in predicting hangson
the real system. While the three mechanismsthat weretargeted for recoverywerevalid
mechanismsfor which the simulation wasknownto simulate the correct behavior, it was
unclear whether they would help for the hangsin the real systemwhich the simulation
didn't predict.
62
The results showedthe recoveryto be everybit aseffectivein the real system asthe
simulation had predicted. The number of faults that causeda hang result in the real
systemdropped from 82 to 32 for the samefaults, a 61%decrease.
This study demonstrated the usefulnessof the simulation method in determining
failure modesfor the systemsoftwareand identifying recoverymechanisms,evenwhen
the simulation model wasnot detailed enoughto model all of the failure modespresent
in the real system. Even though the simulation wasnot ableto identify all of the faults
that would hang the Myrinet system, the simulation analysis was still very useful in
determining what someof the commonfailure modeswereand guiding the development
of recoveryfor thosemodes.
4.4 Incorporation of a Device-LevelFault Model
The final casestudy presentedherewasdesignedto demonstratethe useof the entire
fault dictionary hierarchy,from a device-levelfault modelup to a system-levelresult. To
accomplishthis goal, the samesystem modeland workload wasusedas in the previous
two studies. That is, multiple host interfaceswere simulated in a four-node LAN, and
the interfaceswere logically arrangedin a circle, with eachinterface receiving from its
counterclockwiseneighbor and sendingto its clockwiseneighbor. The fault model and
fault dictionary structure werechanged,however,from atwo-levelto afive-levelhierarchy.
This casestudy wasfirst publishedin [24].
63
Becauseof the unavailability of a gate-levelor transistor-level model for the LANai
chip, however,a specialstrategy had to beusedto allowa demonstration of the full fault
dictionary hierarchy. The solution wasto decideon a subcircuit of the LANai chip that
was frequently used and was visible to the behavioral-levelsimulator and to generate
transistor-level models for that subcircuit. The chosensubcircuit was the ALU 32-bit
adder. Becausethe codesimulating the adderin the behavioral-levelsimulation waseasily
identified and wasa goodmatch to the behaviorof the transistor-leveldescription of the
adder,it waseasyto propagatetheerrorsfrom the adderfaults directly in the behavioral-
level simulation rather than going through an intermediate gate-level simulation step.
Thus, a completegate-levelmodelof the LANai chip wasnot necessary.A full description
of the fault dictionary hierarchy is givenbelowand may better illustrate this strategy.
4.4.1 Fault dictionary hierarchy
The fault dictionaries in this study weredesignedto begin with a device-levelfault
model for a radiation particle impact on the addercircuit and end with the systemlevel
effectof that event. The stepsof this processaredescribedbelow, from the beginning at
the lowest level until the end whenthe systemlevel is reached.
Device level
Fault injection beganat the device level in a two-level simulation step to determine
an appropriate current burst due to a heavyion particle impacting the adder circuitry.
The fault event was the impact of a radiation particle with an energy of 8 MeV at a
64
reverse-biasedtransistor junction at an angleof 45 degreesto the surface. The device
parametersfor this stageweretaken from a 0.25-#mCMOS technology.
The DESSISsimulator was the first levelof this two-levelapproach,and it wasused
to model the MOStransistor that washit by the particle. The DESSISsimulator solved
the coupledPoisson,Electron, and Holeequationsof the transistor in the transient mode
to calculate the resulting current surge.
The second level of this simulation step was the transistor-level simulator called
HSPICE. HSPICE wasusedin an iterative mannerwith the DESSISsimulator to provide
the bias conditions in which the effectedtransistor wasoperating.
Together, the two simulators obtained the model for the particle impact in the fol-
lowing way. First, HSPICE would would simulate a short period of time which would
define the original bias conditions around the MOS transistor. Next, DESSISwould be-
gin simulation of the particle impact for a short period, obtaining the beginning of the
resulting current burst. This current burst wasthen enteredin the HSPICE model, _md
simulation continueda timestep further, resulting in a new bias. The new bias was then
put back in DESSISto generatethe next timestep for the current burst. The back and
forth iteration proceededin this wayuntil the wholeparticle impact completed,resulting
in a time-varying model for the current burst.
65
Logic level
The next fault dictionary levelwas the logic level. Here, the target system wasnow
the whole adder, not just onetransistor. The 32-bit adderwasorganizedaseight 4-bit
Manchestercarry adderstageswith ripple carry betweenthe stages.Faults, in terms of
the current burst derived in the device-levelstep, wereinjected, oneper simulation run,
at reverse-biasedtransistor junctions in the addercircuit as it performeda computation.
The errorslatched at the outputs of the addercircuit werethen recordedasthe logic-level
fault model for the adder.
For eachinput combination, faults were injected exhaustively at all reverse-biased
transistor junctions, and the error patterns observedalong with their frequency were
recorded. Symmetry in the adder, suchas the fact that the 32-bit adder was made of
8 identical 4-bit adderswasusedto reducethe numberof simulations necessaryin this
step. At the end of this step, the output wasa fault dictionary that listed for eachinput
combination the likelihood of observingeachpossibleerror pattern given the occurrence
of a single radiation particle impact on the circuit.
Chip level
The logic-levelfault dictionary wasthen usedasthe fault model in a behavioral-level
simulation of the LANai chip. This step used the samebehavioral-levelsimulator as
wasused in the other three casestudies,and faults were injected into the same "send
message"moduleof the MCP. Insteadof asingle-bit flip of an instruction, however,faults
66
were now injected ascorruption of oneof the add operationsduring the executionof the
module.
More than half of the instructions in the 67 instruction region where faults were
injected usedthe adder. Add operationswererequired by load and store instructions to
computeeffectivememoryaddressaswell asby add andsubtract instructions to compute
arithmetic results. Also note that comparisonsare generally performed by subtracting
one of the numbersto be comparedfrom the other, and thus comparisonsare coded as
subtracts and alsouse the adder. Finally, NOP instructions arecodedasadding zero to
zero and discarding the result, and so they also use the adder. Discarding the NOPs,
there were 22 instructions that actually relied on the resultsof the add operation.
Faults were injected exhaustively during the execution of everyadd operation, in-
cluding all possibleerror patterns in the logic-levelfault dictionary, one per simulation
run. At the end of the executionof the module, the softwarestate of the simulated IvlCP
wasexaminedfor erroneouscontrol flowand variables.Theseerrorswererecordedin the
chip-level fault dictionary asthe chip-levelmodel for the given fault.
4.4.2 Results
Finally, the ultimate impact of eachfault wasdeterminedby asystem-levelsimulation
that incorporated the chip-level fault dictionary. This system-levelstep was identical to
that of the previous two casestudies,and the sameresult categorieswere used. The
overall resultsof the fault injections areshownin Table 4.4.
67
Table 4.4: Breakdownof number of errorsby category.
Fault injection result Simulation
MCP hang
MCP restart
Message dropped
Data corrupt
No error
112
32
134
54
918
Total 1250
One number that stands out in the table is the large number of fault injections that
caused no impact. It is important to note that there are many reasons a corrupt add
operation might not impact the host interface's operation. The most common reason is
that the message buffers are being continually reused, and the new message that is being
written into the buffer may share many header items in common with the past message
in that buffer. If so, then a failure to write some information into the buffer (due to
an address miscalculation) may not have any impact because the correct information
is already there. As long as the failed write does not overwrite some other important
data, it has no effect. Another common reason a fault may have no impact occurs when
the adder is used to compare two unequal numbers. In this case, the adder is used to
subtract one number from the other. A zero result indicates equality. However, a single
fault in the adder is unlikely to change a nonzero result into a zero result. The comparison
decision, then, doesn't change, even though the adder result is wrong, because the result
is still nonzero, indicating unequal numbers.
68
Breakingup the injection resultsby the type of instruction affectedrevealssomemore
interestingpoints. Table 4.5showsthe result categoriesaccordingto whether the affected
instruction wasa load, store, add, or subtract.
Table 4.5: Breakdownof numberof errors by categoryand instruction type.
Fault injection result II Load I Store I Add I Subtract
MCP hang 29 - 19
MCP restart 26 1 5
Message dropped 89 45 -
Data corrupt 27 - 27
No error 223 468 62
Total
64
165
229
As can be seen from the table, the add operations in loads are more important than
those in stores for the code in the fault injection region. (Stores have only a 9% chance
of causing a user-level error whereas loads have a 45% chance.) One reason for this
behavior has already been given above: a given store may be unnecessary because the
correct information is already in memory. Where writing to a wrong address is not so
disruptive for a store, loading from the wrong address is much more likely to cause an
error. The load result is guaranteed to be used in a future computation, while the bad
address to which the store wrote is only sometimes used. Since a load from a wrong
address almost always loads a wrong value, loads are much more likely than stores to
generate a wrong value that will be reused and eventually cause an error at the user level.
Another interesting point from the table is that loads lead to all of the different types
of result categories. Loads are used to begin computations that affect message headers,
and corrupting these loads can cause message drops. Loads also set pointer values for the
69
messagedata, and thus corrupt loadscan causemessagecorruption. Finally, loads are
usedin computations to makedecisionsabout the state of sharedresources.Corrupting
a load that is determining the state of a sharedresourcecan lead to a deadlockfor that
resourceor a crashbecauseof an invalid useof the resource.
Almost half of the subtract instructions can be seento causehangs in the table,
and subtracts causeno other kind of result. Only hangsoccur becausethe only useof
subtracts in the fault injection regionis to makecomparisondecisionsabout the state of
sharedresources.Making the wrong decisionin thesecasestypically causesthe MCP to
hang. Four decisionpoints are affectedin the code, two of them are typically unequal
and two equal. As mentionedabove,comparisonsthat would generatean unequal result
in the fault-free caseare unlikely to be affectedby a single fault in the adder. Equal
comparisonresults arevery likely to be affected,though. As a result, almost half of the
subtracts causean impact at the user level becausehalf of the comparisonsin the fault
injection regionswould normally be equal in the fault-freecase.
5. CONCLUSIONS 
This chapter summarizes the work that was done in this thesis and then describes 
some future work which could be done to improve the method that has been presented. 
5.1 Summary 
This thesis has presented a new method for simulated fault injection analysis of com- 
puter systems. The key points where the method differs from previous work are allowing 
detailed fault models, simulating the impact of the faults on system software behavior as 
well as hardware behavior, and obtaining the impact of the faults a t  the user level. 
The key technique applied to meet these goals is the use of fault dictionaries to  raise 
the abstraction level of faults. A fault dictionary for a given component characterizes 
that component's behavior for faults occurring within it. The fault dictionary stores the 
change in behavior of the component for many different faults. When a fault is to  be 
injected in a given component, the dictionary is accessed, and the components behavior 
is modified according to one of the appropriate entries in the dictionary. 
A fault dictionary used a t  one level of abstraction is developed through many simu- 
lations of the given component at the next lower level of abstraction. Fault dictionaries 
are built from the bottom up. That is, simulations begin a t  the level of abstraction of 
the primary fault model, for instance a t  the transistor level for a transient current burst 
model. From there, multiple simulations are run to build up a fault dictionary which can 
be used for fault injections a t  the next higher level, for instance, the logic level. Simu- 
lation then continues a t  that higher level, building a new fault dictionary. The upward 
propagation continues until fault effects reach the system level and their ultimate impact 
on the system, as it  is visible to the user, is determined. 
Four case studies were presented, demonstrating the use of the method to analyze 
a commercial network system called Myrinet. Each case study focused on presenting a 
different piece of the method in detail. In the first study, the focus was on the use of 
the method to model the behavior of the system software. In the second study, the focus 
was validating the simulation method versus similar fault injections into a real Myrinet. 
The third study showed how the analysis from the method could be used to  improve the 
behavior of the Myrinet under faults, and the fourth study demonstrated the use of the 
complete fault dictionary hierarchy, from the device level t o  the system level. 
5.2 Future Work 
Two ways in which this work could be extended will be discussed in this section. There 
are, of course, improvements that could be made to the case studies presented earlier, 
72
but the two ideaspresentedhere are both improvementsto the basic fault dictionary
simulation method.
First, the software-levelfault dictionary representationcould be made moregeneral-
ized. The current fault dictionary recordsthe impact of faults in one module for just
one state of the software. To include additional softwarestates in the study, new fault
dictionaries must be computed,one per eachnewstate. From observationsmadeduring
the casestudies in this thesis,however,the impact of many faults remainsthe samefor
multiple softwarestates. This data suggeststhat the computationsfor many faults need
not be repeated for manysoftwarestates.
Instead, the fault dictionary entry could be modified to describethoseparts of the
softwarestate that the dictionary entry dependson. If those parts of a new software
state match the state for which the dictionary entry wasmade, the sameentry is used
rather than computing a new entry for the new softwarestate. In this way,a fault in a
module that will causea system reset independentof the softwarestate when it enters
that modulewill be computedonly oncein the fault dictionary computations.
Second,the demonstrationspresentedin this thesisand the special techniquesthat
havebeendescribedare all tailored towardssimulations with hardwaretransient faults.
The generalizedfault dictionary method could also be applied to other types of faults,
however. One valuable addition to this work would be its extension to other types of
faults, particularly permanenthardwarefaults.
73
The difficulty in using this method, as is, to model permanent faults is that permanent
faults do not alter the system state at just one point in time. A permanent fault cannot
be modeled, then, by using a single fault dictionary entry to modify the system state
as required by the fault. The easy solution, repeated application of fault dictionaries
at each successive module of software execution, isn't very tractable, either. At each
successive module, the software state may be further corrupted by previous execution
with the permanent fault. If this corruption were not considered in forming the fault
dictionary entries that were used, the entries may not represent the correct module
behavior at the current point in the software's execution. If the previous corruption
of the software is taken into account, however, the fault dictionary computations may
become unmanageable due to the large number of software states that must be considered,
both fault-free and with corruption due to permanent faults.
A good approach to implementing permanent faults may be to combine the idea of
fault dictionaries with a form of hierarchical simulation. The hierarchical simulation
would differ from the normal fault dictionary method in that the faulty behavior of a
module would not be precomputed (in a dictionary) but instead computed on the fly
by dynamically switching simulation execution to the lower level at which the dictionary
would have been made. Fault dictionaries would still be used, but only at the lower levels
of system abstraction where the software state wouldn't impact the dictionary entry.
For example, consider a permanent fault in an adder. Fault dictionaries could be
computed for this fault up to the behavioral level, where the adder takes two inputs and
74
outputs the sum. The softwarestate doesn't impact the operation of the adderhere,only
the inputs. Thus, a fault dictionary can be made for permanent faults in the adder. If a
software module uses the adder, it will find it has no fault dictionary at the system level.
Execution will then dynamically switch down to the behavioral level where the already
computed fault dictionary for the adder can be used. Additional software modules using
the adder will require continued simulation at the behavioral level, but the hope is that
the permanent fault will be quickly detected, ending the need for further simulation.
REFERENCES 
[I] R. K. Iyer and D. Tang, Fault-Tolerant Computer System Design, Chapter 5, "Ex- 
perimental Analysis of Computer System Dependability." Prentice Hall, 1996. 
[2] J .  Arlat, M. Aguera, L. Amat, Y. Crouzet, J. Fabre, J. Laprie, E. Martins, and 
D. Powell, "Fault injection for dependability validation: A methodology and some 
applications," IEEE Trans. Software Engineering, vol. 16, pp. 166-182, Feb. 1990. 
[3] G. Kanawati, N. Kanawati, and J .  Abraham, "FERRARI: A tool for the validation 
of system dependability properties," in Proceedings 22nd International Symposium 
on Fault- Tolerant Computing, July 1992, pp. 336-344. 
[4] K. Prodromides and W. H. Sanders, "Performability evaluation of csma/cd and 
csma/dcr protocols under transient fault conditions," IEEE Transactions on Relia- 
bility, vol. 42, pp. 116-127, March 1993. 
[5] W. H. Farr, "A survey of software reliability modeling and estimation," Tech. Rep., 
Naval Surface Weapons Center, Sept. 1983. 
[6] J .  A. Clark and D. K. Pradhan, '(REACT: A synthesis and evaluation tool for fault- 
tolerant multiprocessor architectures," in Proceedings of the Annual Reliability and 
Maintainability Symposium, 1993, pp. 428-435. 
[7] A. K. Ghosh and B. W. Johnson, "System-level modeling in the ADEPT environ- 
ment of a distributed computer system for real-time applications," in Proceedings 
of the IEEE International Computer Performance and Dependability Symposium, 
April 1995, pp. 194-203. 
[8] E. Jenn, J. Arlat, M. Rimen, J. Ohlsson, and J. Karlsson, "Fault injection into VHDL 
models: The MEFISTO tool," in Proceedings of the 24th International Symposium 
on Fault-Tolerant Computing, June 1994, pp. 66-75. 
[9] E. G. Ulrich and T. Baker, "The concurrent simulation of nearly identical digital 
networks," in Proceedings of the 10th Design Automation Workshop, June 1973, 
pp. 145-150. 
[lo] S. Seshu, "On an improved diagnosis program," IEEE Trans. on Electronic Com- 
puters, vol. EC-14, pp. 76-79, Feb. 1965. 
[ll] D. B. Armstrong, "A deductive method for simulating faults in logic circuits," IEEE 
Trans. on Computers, vol. C-21, pp. 424-428, May 1972. 
[12] T. M. Niermann, W. T. Cheng, and J .  H. Patel, "Proofs: A fast, memory-efficient 
sequential circuit fault simulator," IEEE Trans. on Computer- Aided Design, vol. 11, 
pp. 198-207, Feb. 1992. 
[13] G .  L. Ries, G. S. Choi, and R. K. Iyer, "Device-level transient fault modeling," 
in Proceedings of the 24th International Symposium on Fault-Tolerant Computing, 
June 1994, pp. 66-75. 
[14] K. K. Goswami, R. K. Iyer, and L. Young, "DEPEND: A simulation-based environ- 
ment for system level dependability analysis," IEEE Trans. on Computers, vol. 46, 
pp. 60-74, Jan. 1997. 
[15] K. K. Goswami and R. K. Iyer, "Simulation of software behavior under hardware 
faults," in Proceedings of the 23rd International Symposium on Fault- Tolerant Corn- 
puting, June 1993, pp. 218-227. 
[16] G. Choi and R. K. Iyer, "FOCUS: An experimental environment for fault sensitivity 
analysis," IEEE Trans. on Computers, vol. 41, no. 12, pp. 1515-1526, 1992. 
[17] F. Yang, "Simulation of faults causing analog behavior in digital circuits," Ph.D. 
dissertation, University of Illinois, Urbana, IL, 1992. 
[18] G. L. Ries, "Transient fault modeling," Master's thesis, University of Illinois. Ur- 
bana, IL, 1995. 
[19] J .  D. Barnette, "Acceleration techniques for dependability simulation," Master's 
thesis, University of Illinois, Urbana, IL, 1994. 
(201 N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J .  Seizovic, and W.-K. 
Su, "Myrinet: A gigabit-per-second local-area network," IEEE Micro, vol. 15-1, 
pp. 29-36, Feb. 1995. 
[21] G. Ries and R. K. Iyer, "Evaluating the impact of transient faults on software 
behavior: Case study of a commercial high-speed network," in Proceedings of the 
6th IFIP International Working Conference of Dependable Computers for Critical 
Applications (DCCA-6), March 1997. Scheduled to  appear. 
[22] M. Rimen, J. Ohlsson, and J. Torin, "On microprocessor error behavior modeling,'' 
in Proceedings of the 24th Internatzonal Symposium on Fault-Tolerant Computing, 
June 1994, pp. 76-85. 
[23] D. Stott,  G. Ries, M. C. Hsueh, and R. K. Iyer, "Fault injection for high-speed 
network dependability analysis," IEEE Transactions on Computers Special Issue on 
Dependable Computing, 1997. Scheduled to appear. 
[24] Z. Kalbarczyk, R. K. Iyer, G. Ries, J. U. Patel, and M. S. Lee, "Hierarchical approach 
to accurate fault modeling for system test and evaluation,'' in Proceedings of the 3rd 
IEEE International On-line Testing Workshop, 1997. Scheduled to appear. 
78 
VITA 
Gregory Lawrence Ries was born in  on  He graduated 
from Case Western Reserve University, in Cleveland, Ohio, with a B.S.E.E. in 1992, 
and in January, 1995, obtained a M.S.E.E. from the University of Illinios at Urbana-
Champaign. His awards include National Science Foundation Scholar, Robert C. Byrd 
Scholar, Leonard Case Scholar, and CWRU Alumni Scholar. 
Ries was a teaching assistant at the University of Illinois from August, 1992, until 
August of 1993, following which he was a research assistant in the Center for Reliable 
and High-Performance Computing until May, 1997. He also worked at Rockwell Semi-
conductor Systems as an engineer in the advanced DSP architecture group during the 
summer of 1996. Ries is a member of the Tau Beta Pi national honor society. He began 
working toward his Ph.D. in the spring of 1995 under Dr. Ravishankar K. Iyer at the 
University of Illinois at Urbana-Champaign in the area of dependability modeling. 
HIERARCHICAL SIMULATION TO ASSESSHARDWARE
AND SOFTWARE DEPENDABILITY
GregoryLawrenceRies,Ph.D.
Department of Electrical and Computer Engineering
University of Illinois at Urbana-Champaign,1997
RavishankarK. Iyer, Advisor
This thesispresentsa method for conducting hierarchical simulations to assesssys-
tem hardware and softwaredependability. The method is intended to model embedded
microprocessorsystems.A key contribution of the thesisis the ideaof using fault dictio-
nariesto propagatefault effectsupwardfrom the levelof abstractionwherea fault model
is assumedto the system level wherethe ultimate impact of the fault is observed,and
a secondimportant contribution is the analysisof the softwarebehavior under faults as
well as the hardwarebehavior.
The simulation methodis demonstratedandvalidatedin four casestudiesthat analyze
a commercial,high-speednetworkingsystemcalledMyrinet. Onekey result from the case
studies showsthat the simulation method predicts the samefault impact 87.5%of the
time, as is obtained by similar fault injections into a real Myrinet system. Reasonsfor
the remaining discrepancyare examinedin the thesis. A secondkey result shows the
reduction in the number of simulations neededdue to the fault dictionary method. In
one casestudy, 500 faults were injected at the chip level, but only 255 propagated to
the systemlevel. Of these255faults, 110sharedidentical fault dictionary entries at the
systemleveland sodid not needto be resimulated.The necessarynumberof system-level
simulations was thereforereduced from 500 to 145. Finally, a third result in the case
studiesshowshow the simulation method canbe usedto improvethe dependability of the
target system. The simulation analysiswasusedto add recoveryto the target softwarefor
the most commonfault propagationmechanismsthat would causethe software to hang.
After the modification, the numberof hangswasreducedby 60%for fault injections into
the real system.
