Advanced information processing system: Fault injection study and results by Masotto, Thomas K. et al.
NASA Contractor Report-189590
Advanced Information Processing System:
Fault Injection Study and Results
Laura F. Burkhardt
Thomas K. Masotto
Jaynarayan H. Lala
THE CHARLES STARK DRAPER LABORATORY, INC.
CAMBRIDGE, MA 02139
Contract NAS1-18565
May 1992
N/ A
National Aeronautics and
Space Administration
Langley Research Center
Hampton, Virginia 23665-5225
(_IASA-C_-189590) ADVANCE r) INFORMATION
P_,2CLS31t_% SYSTEM: FAULT [NJLCTIU_! STUDY AND
R,r::]!JLTS Fin,tl R_.t.)ort (_rap_r (Ch=_rles
L)t-Jrk) L_b. ) 154 p
_3/6z
N92-26105
unclas
0091272
https://ntrs.nasa.gov/search.jsp?R=19920016862 2020-03-17T10:52:26+00:00Z
B
NASA Contractor Report-189590
Advanced Information Processing System:
Fault Injection Study and Results
Laura F. Burkhardt
Thomas K. Masotto
Jaynarayan H. Lala
THE CHARLES STARK DRAPER LABORATORY, INC.
CAMBRIDGE, MA 02139
Contract NAS1-18565
May 1992
NASA
National Aeronautics and
Space Administration
Langley Research Center
Hampton, Virginia 23665-5225

TABLE OF CONTENTS
LIST OF ILLUSTRATIONS ........................................................................ v
LIST OF TABLES .................................................................................. vii
1.0 APPROACH TO FAULT INJECTION .................................................. 1-1
1.1 Introduction ........................................................................... 1-1
1.2 Test Case Specification .............................................................. 1-3
1.2.1 All Possible Test Cases ..................................................... 1-3
1.2.2 Subset of Actual Test Cases ................................................ 1-8
1.3 Test Case Measurements ............................................................ 1-9
1.4 Test Case Execution ................................................................. 1-10
1.4.1 Experimental Setup ......................................................... 1-10
1.4.2 Test Case Measurements .................................................. 1-11
2.0
3.0
THE
2.1
2.2
2.3
FAULT INJECTION ENVIRONMENT .......................................... 2-1
Overview .............................................................................. 2-1
2.1.1 Hardware Fault Injection ................................................... 2-1
2.1.2 Software Fault Injection .................................................... 2-3
Fault Injector Hardware ............................................................. 2-4
Fault Injection Software ............................................................ 2-16
2.3.1 The FIS Main Menu ........................................................ 2-16
2.3.1.1 The FIS Edit Faults Menu ...................................... 2-16
2.3.1.2 The FIS Multiplexer Signal Selection Menu ................. 2-18
2.3.1.3 The FIS Boolean Function Selection Menu .................. 2-18
2.3.1.4 The FIS Fault Application Option ............................. 2-19
2.3.1.5 The FIS Save Fault File Option ............................... 2-22
2.3.1.6 The FIS Load Fault File Option ............................... 2-22
2.3.1.7 The FIS Reset Option ........................................... 2-23
2.3.1.8 The FIS Quit Option ............................................ 2-23
2.3.2 Fault Injection Software - Multiple Fault Application .................. 2-23
FAULTS TO THE I/O NETWORK ....................................... 3-1
Overview of AIPS I/O Network .................................................... 3-1
Specification of I/O Network Faults ............................................... 3-2
3.2.1 Creating Node and Link Faults ............................................ 3-2
3.2.2 Specification of Test Cases ................................................. 3-3
Test Results ........................................................................... 3-9
3.3.1 Maximum and Average Times ............................................ 3-10
3.3.2 Frequency Histograms ..................................................... 3-23
APPLYING
3.1
3.2
3.3
Ul
PRECEOING PAt'_ BI_ANK NOT FILMED
3.3.2.1 Variance of the Detection Times ............................... 3-24
3.3.2.2 Variance of the Reconfiguration Times ....................... 3-24
3.3.2.2.1 Reconfiguration Variance-Second Reconfiguration
Attempts ............................................ 3-25
3.3.2.2.2 Reconfiguration Variance-Presumed
Reconnection ....................................... 3-25
3.3.2.2.3 Reconfiguration Variance-Inconclusive
Analysis ............................................. 3-26
3.3.2.2.4 Reconfiguation Variance-Simple and Complex
Error Symptoms ................................... 3-26
3.3.3 Probability and Cumulative Density Functions ......................... 3-74
3.4 I/O Network Fault Injection Conclusions ........................................ 3-75
4.0 APPLYING FAULTS TO THE CORE FFP ............................................ 4-1
4.1 Overview of AIPS FTP .............................................................. 4-1
4.2 Specification of Core FTP Faults ................................................... 4-3
4.2.1 Specification of Test Cases for Software-Injected Memory Faults .... 4-3
4.2.1.1 Fast FDIR ......................................................... 4-4
4.2.1.2 Watchdog Timer Reset .......................................... 4-5
4.2.1.3 Background Self Test ............................................ 4-5
4.2.1.4 Hardware Exception Handler ................................... 4-6
4.2.2 Specification of Test Cases for Hardware-Injected Faults .............. 4-6
4.2.2.1 Transient FDIR ................................................... 4-6
4.2.2.2 Lost Soul Sync ................................................... 4-8
4.2.2.3 System Restart .................................................... 4-9
4.2.3 Reconfiguration .............................................................. 4-9
4.3 Test Results ........................................................................... 4-9
4.3.1 The Software Fault Injection Plan ......................................... 4-9
4.3.2 The Hardware Fault Injection Plan ....................................... 4-18
4.4 Core FTP Fault Injection Conclusions ............................................ 4-21
4.4.1 Software Fault Injection Test Results .................................... 4-21
4.4.1.1 Maximum and Average Times ................................. 4-21
4.4.1.2 Probability and Cumulative Density Functions .............. 4-32
4.4.2 Hardware Fault Injection Test Resutls ................................... 4-36
4.4.3 Design Flaws Uncovered by the Fault Injection Tests ................. 4-37
4.5 Core FTP Fault Injection: Conclusions ........................................... 4-38
5.0 CONCLUSIONS ........................................................................... 5-1
6.0 REFERENCES .............................................................................. 6-1
iv
LIST OF ILLUSTRATIONS
Figure Title Page
1-1
1-2
1-3
1-4
2-1
2-2
2-3
2-4
2-5
2-6
2-7
2-8
2-9
2-10
2-11
2-12
2-13
2-14
2-15
3-1
3-2
3-3
3-4
3-5
3-6
3-7
3-8
3-9
3-10
3-11
3-12
3-13
3-14
3-15
3-16
3-17
3-18
3-19
3-20
3-21
3-22
3-23
3-24
3-25
3-26
3-27
3-28
3-29
3-30
3-31
Hardware Configurations ............................................................... 1-5
Software Configurations ................................................................ 1-6
Fault Injection Configuration ........................................................... 1-7
Fault Injection Experimental Setup ................................................... 1-11
Experimental Setup-Software Fault Injection ......................................... 2-3
Fault Injector Logical Organization .................................................... 2-6
Insertion of FETs between Socket and Device ....................................... 2-7
Fault Injector Implant .................................................................... 2-8
Fault Injector Hardware ................................................................. 2-9
Fault Description Word ................................................................ 2-11
Mux A, B, C Selection Word ......................................................... 2-12
Boolean Function Generator Data Word ............................................. 2-13
The FIS Main Menu .................................................................... 2-16
The FIS Edit Faults Menu ............................................................. 2-17
The Fault Direction Options ........................................................... 2-17
The Fault Type Options ................................................................ 2-18
The FIS Mux Signals Menu ........................................................... 2-19
The FIS Boolean Function Menu ..................................................... 2-19
The Boolean Function Options ........................................................ 2-20
15 Node I/O Network-No Fault Configuration ...................................... 3-2
Test A.l.a ................................................................................ 3-27
Test A. 1.b ............................................................................... 3-28
Test A.l.c ................................................................................ 3-29
Test A.2.a ................................................................................ 3-30
Test A.2.b ............................................................................... 3-31
Test A.2.c ................................................................................ 3-32
Test A.3.a ................................................................................ 3-33
Test A.3.b ............................................................................... 3-34
Test A.3.c ................................................................................ 3-35
Test B.l.a ................................................................................ 3-36
Test B.l.b ............................................................................... 3-37
Test B.l.c ................................................................................ 3-38
Test B.2.a ................................................................................ 3-39
Test B.2.b ............................................................................... 3-40
Test B.2.c ................................................................................ 3-41
Test B.3.a ................................................................................ 3-42
Test B.3.b ............................................................................... 3-43
Test B.3.c ................................................................................ 3-44
Test B.4.a ................................................................................ 3-45
Test C.l.a ................................................................................ 3-46
Test C. 1.b ............................................................................... 3-47
Test C.l.c ................................................................................ 3-48
Test C. 1.d ............................................................................... 3-49
Test
Test
Test
Test
Test
Test
Test
C.2.a ................................................................................ 3-50
C.2.b ............................................................................... 3-51
C.2.c ................................................................................ 3-52
C.3.a ................................................................................ 3-53
C.3.b ............................................................................... 3-54
D. 1.a ................................................................................ 3-55
D.l.b ............................................................................... 3-56
3-32
3-33
3-34
3-35
3-36
3-37
3-38
3-39
3-40
3-41
3-42
3-43
3-44
3-45
3-46
3-47
3-48
3-49
3-50
3-51
3-52
4-1
4-2
4-3
4-4
4-5
4-6
4-7
4-8
4-9
Test D.l.c ................................................................................ 3-57
TestD.1.d............................................................................... 3-58
Test D.2.a ................................................................................ 3-59
TestD.2.b............................................................................... 3-60
Test D.2.c ................................................................................ 3-61
Test D.3.a ................................................................................ 3-62
TestD.3.b ............................................................................... 3-63
Test D.4.a ................................................................................ 3-64
TestD.4.b ............................................................................... 3-65
Test D.5.a ................................................................................ 3-66
TestD.5.b ............................................................................... 3-67
Test D.5.c ................................................................................ 3-68
TestD.5.d ............................................................................... 3-69
TestE.1.a................................................................................ 3-70
TestF.1.a................................................................................ 3-71
TestF.2.a................................................................................ 3-72
TestF.3.a................................................................................ 3-73
The Probability Density Functionfor the DetectionTimes.........................3-74
TheCumulativeDensityFunctionfor theDetectionTimes........................ 3-75
The Probability Density Functionfor the ReconfigurationTimes.................3-76
TheCumulativeDensityFunctionfor theReconfigurationTimes................ 3-76
FaultTolerantProcessor:FunctionalView (OneChannel)........................ 4-2
DataExchangeNetwork............................................................... 4-18
Fault Tolerant Clock Network......................................................... 4-19
The Probability Density Function for the DetectionTimes.........................4-33
TheProbabilityDensityFunctionfor theDetectionTimes:Expansionof the0 to
10,000ms.Region..................................................................... 4-34
TheCumulativeDensityFunctionfor theDetectionTimes........................ 4-34
The Probability Density Functionfor the ReconfigurationTimes.................4-35
TheProbabilityDensityFunctionfor theReconfigurationTimes: Expansionof
Range0 to 2400ms.................................................................... 4-35
TheCumulativeDensityFunctionfor theReconfigurationTimes................ 4-36
vi
LIST OF TABLES
Table Title Page
2-1
2-2
2-3
2-4
2-5
3-1
3-2
4-1
4-2
4-3
4-4
Fault Injector Address Space .......................................................... 2-10
Fault Type Selection .................................................................... 2-12
Mux A, B, C Source Selection ........................................................ 2-13
Boolean Functions of Two Variables ................................................. 2-14
Fault Direction Control ................................................................. 2-15
Victim Node Components-Schematic Location, Pin Number, and Fault Logic
Level ....................................................................................... 3-4
Victim Port Components-Schematic Location, Pin Number, and Fault Logic
Level ....................................................................................... 3-5
Software Fault Injection Plan .......................................................... 4-17
Hardware Fault Injection Plan ......................................................... 4-20
Software Fault Injection Results ...................................................... 4-26
Hardware Fault Injection Results ..................................................... 4-37
vii
O_Q
VIII
ADVANCED INFORMATION PROCESSING SYSTEM:
FAULT INJECTION STUDY AND RESULTS
1.0 APPROACH TO FAULT INJECTION
1.1 Introduction
The overall objective of the Advanced Information Processing System (AIPS) program is
to achieve a validated fault tolerant distributed computer system architectures suitable for a
broad range of applications, including those which have a failure probability requirement as
low as 10 -9 at 10 hours. As a part of this process, an AIPS knowledgebase has been
developed. Various domains of the AIPS knowledgebase and a design-for-validation
methodology that uses the knowledgebase to synthesize computer system architectures are
described in [1]. The present report focuses on the fault injection study and its results,
which are a component of the performability knowledgebase.
To configure AIPS building blocks to meet specific application requirements, it is
necessary to characterize performability, i.e., performance and reliability, of building
blocks and of ensembles of building blocks as a function of fundamental architectural
parameters. The performability knowledgebase would eventually consist of all quantifiable
knowledge about the architecture that affects its performability. It is organized as analytical
and empirical relationships between three major domains: performance metrics, reliability
metrics and architectural parameters. The metrics and the AIPS architectural parameters are
described in Section 5 of [1], which also discusses the empirical relationships between
these three domains using the results obtained on the AIPS engineering model.
The requirement of extremely low system failure rates for the AIPS applications (typically
10 -6 to 10 -10 per hour) precludes computer reliability validation exclusively by any single
technique, tool, or approach. A balanced validation plan that uses analytical models,
formal proofs, empirical test and evaluation, and architectural attributes that enhance the
"validatability" of the system [1] can be cost effective and feasible in achieving validated
fault tolerant computer system architectures. The performability knowledgebase is an
important part of the balanced approach to validation. A set of analytical models has been
developed to characterize reliability and availability of the AIPS hardware building blocks.
To appreciate the role of empirical evaluation in constructing the performability
knowledgebase, we quote from [2] as follows:
"Design-for-validation concept consists of ... 1. The system is designed in such a manner
that a complete and accurate reliability model can be constructed. All parameters of the
model which cannot be deduced from the logical design must be measured. All such
parameters must be measurable within a feasible amount of time."
1-1
The designof the AIPS building blocks hasadheredto this preceptof the "designfor
validation" methodology. For example,by complying with all the known theoretical
requirementsfor Byzantineresilience,thereliability of theAIPS FaultTolerantProcessor
or theInter-Computercommunicationsnetworkcanbemodeledanalyticallywith just afew
parameters.It is notnecessaryto exhaustivelyenumeratefailuremodesandshowthateach
modeis coveredwith therequisiteprobability [6]. Theanalyticalmodelsarediscussedin
Section4 of [1] in thecontextof theAdvancedLaunchSystemmissionrequirements.The
modelsare,however,generalenoughsothatby changingafew parametersonecanpredict
thereliability andavailability for othermissionscenariosalso. Thereliability modelsuse
three typesof parameters:componentfailure rates,fault responsetimes (detectionand
reconfigurationtimes),andfaultcoverages(detectionandreconfigurationcoverages).The
componentfailureratesareestimatedusingtheMIL-HDBK-217E. Theotherparameters,
however,mustbededucedfrom thedesignor measuredexperimentally.
Sinceengineeringmodelsof theAIPSbuildingblockshavebeenfabricated,it is feasibleto
measuresystemresponseto faults andmeasuresomeof the parametersexperimentally.
The analytical modeling and empirical characterizationof the AIPS building blocks
complementeachother. Analytical modelsareabstractionsof physicalreality. Testand
evaluation on the engineeringmodel can help verify model assumptions,determine
unknownparametersandincreaseoverallconfidence,andhenceclaimsof validation,in the
system.
Apart from gatheringdatafor reliability parameterestimation,fault injectionplaysanother
important role in the overall systemvalidation. Fault injection canbe usedto obtain
feedbackfor fault removal from the design implementation. Again, the role of fault
injection in finding andfixing designerrorsshouldbekept in theproperperspective.One
cannotrely solelyon fault injection to uncoverdesign,specificationandimplementation
errors. Fault injection is not a substitute for the design-for-validation methodology.
However, it is a component of the methodology just as specifications, design reviews,
analytical models, and formal methods are.
If the fault injection process does not uncover a single flaw in the system under test, it does
not imply that there are no flaws in the system, only that the system is correct with respect
to the fault set to which it was subjected. But what if some design flaws are uncovered?
Does that mean the exercise was useless? On the contrary, a utility of the fault injection
technique is in uncovering shortcomings in the system. One gains a deeper understanding
of the fault tolerance design, a more fundamental appreciation of the cascade of events
triggered by a fault, including complex interactions between hardware and software
elements and the timing relationships between various events.
With the above discussion in mind, the goals of this fault injection study, as stated in the
statement of work, were as follows:
1-2
1.To testthesystemdesignspecificationfor fault tolerance.
2. To obtainfeedbackfor fault removalfrom thedesignimplementation.
3. To obtainstatisticaldataregardingfault detection,isolation,andreconfiguration
responses.
4. To obtaindataregardingtheeffectsof faultsonsystemperformance.
Theorganizationof this reportis asfollows. The remainderof this sectiondescribesthe
parametersthatmustbevariedto createacomprehensivesetof fault injection tests. The
subsetof testcasesselectedfor this study,the testcasemeasurements,andthe testcase
executionarealsodescribedin Section1. Bothpin-levelhardwarefaultsusinga hardware
fault injector and software-injectedmemorymutations were used to test the system.
Section2 providesanoverviewof thehardwarefault injector andtheassociatedsoftware
usedto carryout theexperiments.Sections3 and4 give detailedspecificationsof faults
andtestresultsfor theI/O Network andtheAIPS FaultTolerantProcessor,respectively.
Section5 summarizestheresultsandgivesconclusionsof thestudy.
1.2 Test Case Specification
This section explains how the test cases used in the AIPS Fault Injection Study were
derived. Section 1.2.1 describes the major parameters that could be varied in order to
create a comprehensive set of tests. Section 1.2.2 describes how the parameters actually
were varied in order to select from all the possible tests a limited subset that was executable
within the time and financial constraints of the Fault Injection Study.
1.2.1 All Possible Test Cases
Any given fault that is injected occurs within a particular context. This context consists of
the hardware environment, the software environment, and the fault injection environment.
Thus there are four major parameters that can be varied when creating test cases:
Hardware environment
Software environment
Fault injection environment
The actual fault
Hardware Environment
The hardware environment consists of the hardware building blocks that make up the
system during the particular test. These hardware building blocks may be arranged in
varying configurations. The parameters that describe the possible configurations are:
Redundancy level of the victim FTP
1-3
I/O networkconfiguration
IC networkconfiguration
A graphical representation of all combinations of these parameters is shown in Figure 1-1.
For clarity, the variations of the parameters under the top level are shown only once.
Software Environment
Similarly, the software environment consists of the software building blocks that compose
the system during the particular test. These software building blocks may be arranged in
varying configurations. The parameters that describe the possible configurations are:
Combinations of system functions and applications, and support functions
Iteration rates of system functions
Number of application tasks
Computational loads of application tasks
I/O requirements of application tasks
IC requirements of application tasks
A graphical representation of all combinations of these parameters is shown in Figure 1-2.
In this case each parameter could take on multiple values, but the figure arbitrarily shows
only two. For clarity, the variations of the parameters under the top level are shown only
once.
Fault Injection Environment
The fault injection environment consists of attributes of the faults to be injected. These
attributes may be varied and arranged in different configurations. The attributes are:
Number of FTPs monitoring fault injection
Number of FCRs affected
Number of simultaneous faults
Duration of fault
Random placement vs. selected placement
Hardware injection vs. software injection
Scope of fault (i.e., FCR, Board, Chip, Pin)
A graphical representation of all combinations of these parameters is shown in Figure 1-3.
For clarity, the variations of the parameters under the top level are shown only once.
1-4
®®
X
I.U
._J
a
I-
0
>
X
uJ
....I
Cl_
fl:
I--
0
>
®
o -_m
Z rc ,,- z
o_"-
--0
®
z__g
_o§
® ®
Figure 1-1. Hardware Configurations
1-5
ow
z ,A fl: _.
8
oN @
m Q. _ .
@ m
m _ zQ
o a.
N <
= Q
m
_ z
,..I
5 _
£ Qo__'-'1,,..:
_o
,,, = @ _:
0. '_ o,.-
I_ IZ "" -"
0
m n_
iq
> A
v _
< m
£ _
< I-
z z
0 '"
_ 5
a_ 0
_ m
0 n-
O 0
_z
A
£ o.:m
z _ _
>
Figure 1-2. Software Configurations
1-6
d°
Iz
u..I
I.LI
I'-- I.I.
LLI I.I-
r'r" "_
®_-
Z
_© _
13_
u.I
©
.-n
©
Y s
I.I,. c,,_ iii
121
1.1.1
ILl
m
® ® g ® ® 5) _ I_
1-7
I,,-
,.,.I
I,.I.
C
ii
ii
ii
o
J_
I,,.
o
r_
lu
o_,--_
4_
I,.
t_
The Actual Fault
Finally, each individual fault to be injected was determined by examining the possible
places at which to inject a fault in each particular fault region (i.e., FCR, board, chip,
pin). The individual faults are discussed in detail in Sections 3 and 4 of this report.
To reiterate, a test case consists of an actual fault within a fault injection
configuration within a software configuration within a hardware configuration.
The total number of possible test cases is the product of the number of hardware
configurations, software configurations, fault injection configurations, and faults, i.e.,
NTot = NHW * NSW * NFI * NF
Each test case may be specified using the assigned numbers from Figures 1-1 through 1-3.
An example test case would be:
H1/2/3 - S1/2 - FI1/2/3/4/5/6/10 - F1
The hardware configuration for this test case consists of a triplex FTP as the victim (H1), a
15-node I/O network (H2), and a 3-layer IC network (H3). The software configuration
includes only Local System Services (S1) and has the FTP RM executing at Rate 1. The
fault injection attributes include fault monitoring on a single FTP (FI1), only 1 FCR
affected by the injection (FI2), only 1 fault injected at a time (FI3), that fault being a hard
fault (FI4) selected for its ability to create a known effect (FI5). The fault is injected by
means of the hardware fault injector (FI6) and is applied at the pin level (FI10). The actual
fault injected is Fault #1 (F1).
1.2.2 Subset of Actual Test Cases
If all possible variations of the hardware environment, software environment and fault
injection environment shown in Figures 1-1 through 1-3 were exercised with only one fault
each, this alone would represent approximately 594,000 test cases (8 hardware
configurations, 290 software configurations, 256 fault injection configurations). In order
to limit the test cases to a number consistent with the time and financial constraints of this
study, one hardware configuration, one software configuration, and two fault injection
configurations were chosen. The number of test cases, then, came from the different faults
that were injected within these two environments.
Using the notation given above, the two environments may be specified as
and
(1) H1/2/3
(2) H1/2/3
- S 1/67/68/69/70/71/72/73/74/75 - FI1/2/3/4/5/6/10
- S1/67/68/69/70/71/72P3/74/75 - FI1/2/3/4/5/11/14
1-8
The hardwareconfigurationin bothcasesconsistedof a triplex FTP as the victim (H1), a
15-node I/O network (H2) and a 3-layer IC network (H3). The software configuration in
both cases included Local System Services, I/O System Services, and IC Communication
Services ($67), FTP RM at 40 Hz ($68), I/O RM every 2 seconds ($69), no I/O
applications ($70, $71, $72), and two IC applications ($73) doing minimal computation
($74) and moderate IC communication ($75). The fault injection configurations included
fault monitoring on a single FTP (FI1), only 1 FCR affected by the injection (FI2), only 1
fault injected at a time (FI3), that fault being a hard fault (FI4) selected for its ability to
create a known effect (FI5). Faults were injected either at the pin level by the hardware
fault injector (FI6/FI10)) or at the chip level by software fault injection (FI11/FI14).
The actual faults that were injected are discussed in detail in Sections 3 and 4.
It should be noted that as an ongoing part of AIPS testing and debugging efforts, faults
have previously been injected at the FCR and card level. In the case of the core FTP, a
channel failure could be simulated by resetting either one processor or the entire channel.
Data exchange faults were simulated by use of specially constructed switches that grounded
power to pertinent parts of the data exchange hardware. In the case of the I/O network,
injected faults included resetting a node, unplugging a node card and unplugging links
between two nodes. These faults were inserted manually and the collection of fault
detection and reconfiguration times was not automated, so that collecting statistics about a
large volume of faults would have been very tedious and time-consuming.
1.3 Test Case Measurements
To achieve the goals stated in Section 1.1, certain information must be collected for
each test case.
Goal 1: to test the system design specification for fault tolerance.
This means it must be determined that an injected fault was actually tolerated, i.e., that the
system continued to perform correctly in the presence of the fault. This requires a record of
system performance before the fault is injected, which can then be compared to the record
of system performance during and after fault detection. To obtain this record of system
performance, all of the tasks executing in the system for an appropriate time period around
the fault injection must be tracked. In addition to the clock time at which each task
suspends and resumes, this record must give some indication of what the task was doing
so as to confirm that it is executing normally and is not in some erroneous state. In the
absence of tools that provide such a record, externally visible manifestations must be relied
on, such as correct functioning of the CRT and MAC displays, continued correct
configuration of I/O nodes, and continued communication between IC applications.
1-9
Goal 2:
implementation.
case.
to obtain feedback for fault removal from the design
To do this the following questions must be answered for each test
Was the fault detected?
Was the fault isolated?
Was it correctly isolated?
Did reconfiguration occur?
Was the reconfiguration correct?
Goal 3: to obtain statistical data regarding fault detection, isolation,
and reconfiguration responses. This may be achieved by recording the times when
each of the three events (detection, isolation, reconfiguration) occurs.
Goal 4: to obtain data regarding the effects of faults on system
performance. The information required to answer this question is similar to that
required for Goal 1. A record of system performance before and after fault detection is
required to determine effects of a fault on performance. Short-term effects, i.e., the effect
of extra fault detection and identification overhead on system throughput must be
measured, as well as long-term effects, e.g., extra redundancy management time required
for 2-layer IC messages rather than 3-layer messages.
1.4 Test Case Execution
1.4.1 Experimental Setup
The experimental setup for injecting faults into the AIPS Distributed Engineering Model
and measuring their effects is shown in Figure 1-4. The Engineering Model is shown on
the fight and consists of four Fault Tolerant Processors (FTPs), a fault-tolerant Inter-
Computer (IC) Communication Network, and a fault-tolerant Input/Output (I/O) Network.
On the left is the Fault Injection Software (FIS) that resides on a MicroVAX 3900
computer. This software controls the number and type of faults and the time of their
insertion. It also collects time information recorded by the FTP so that it can compute
performance measurements such as fault detection and reconfiguration times.
To create hardware-injected faults, the FIS communicates with a Fault Injector device by
means of a VAX Qbus interface. The Fault Injector interfaces with the Engineering Model
in such a way that a signal from the VAX will cause a fault to occur at any desired pin in
the Engineering Model. To create software-injected faults, the FIS communicates with the
FTP through a shared memory referred to as the Testport Interface. The only type of fault
injected by software is an emulated transient memory fault. The FIS also uses the Testport
Interface to retrieve the timing data recorded by the FTP.
The Fault Injector and the Fault Injection Software are discussed in detail in Section 2.
1-10
AIPS Distributed Engineering Model
(Simplified View)
VAX 3900
Fault Injection ISoftware
Fault Fault
Injector Injector
Qbus
Interface
Testport
Interface
i i i i
(_-
_'_etwork
I Inter-Computer
/ Network
I
To
AIPS
,. Victim FTP , FTP i
Device 4 1 I
1
Figure 1-4. Fault Injection Experimental Setup
1.4.2 Obtaining Test Case Measurements
The test case measurements described in Section 1.3 were gathered by the Fault Injection
Software and by visual inspection. This section explains how each type of measurement
was obtained.
One step in testing the system design specification for fault tolerance (Goal 1)
was to capture the record of system performance. This process was not automated and
incorporated within the Fault Injection Software for this study. Instead we relied solely on
visual observation of the CRT and MAC displays and configuration of the I/O network to
determine that the system functioned correctly during and after the fault. Since the display
tasks execute at the lowest priority, this ensured that no higher priority task was
monopolizing the system as a result of the fault.
To determine the correctness and completeness of the fault detection and
identification (Goal 2) for each test case, two methods were used. One was visual
inspection of the error logs that are maintained by the core FTP RM and the I/O network
1-11
RM processes. These logs indicated whether a particular fault was isolated and, if so,
whether it was isolated correctly. The other method involved setting breakpoints in the
FTP code at the fault isolation and reconfiguration points and having the FIS verify that
these breakpoints were reached. In addition, the FIS verified that a fault was recovered
from before injecting the next fault.
Statistical data about the fault detection and identification (Goal 3)was
obtained by having the FIS note the time at which the isolation and reconfiguration
breakpoints were reached. In addition, the FTP code logged the time the fault was detected
in the Testport Interface.
The effects of faults on system performance (Goal 4) include both (1) the
additional time required by the particular FDIR process when it is dealing with a fault and
the subsequent scheduling delays incurred by other tasks, and (2) the effects on users of
the faulty component, e.g., an application task that receives erroneous input or cannot
transmit output because an I/O fault has not yet been detected and/or reconfigured around.
The additional time required by the FDIR processes has been measured at other times
during the life of the AIPS project and documented in a previous report [1]. Additional
measurements were also obtained in the I/O network portion of this Fault Injection Study
but not in the core FTP portion.
The scheduling delays incurred by non-FDIR tasks and the effects of a fault on users of the
faulty component require the same instrumentation as for Goal 1. This instrumentation was
not in place at the time of this study, and this data was not collected.
1-12
2.0 FAULT INJECTION ENVIRONMENT
2.1 Fault Injection Experiment Setup
2.1.1 Hardware Fault Injection
To inject hardware faults into the AIPS Distributed Engineering Model, a device called the
Fault Injector (FI) was designed and built at the Draper Laboratory [3]. The fault injector
interfaces with the Engineering Model on one end and with the VAX 3900 Qbus on the
other end. The number and type of faults and the time of their insertion are controlled by
the Fault Injection Software (FIS) that is resident on the VAX computer. The VAX and the
Engineering Model are linked together by a shared memory interface that was designed by
Draper Laboratory (referred to as the Testport Interface). This interface is used by the Fault
Injection Software to communicate with the AIPS system software executing on the Model.
As a result, the experimental setup is a closed loop system in which the executor, the FIS,
and the victim device on the Distributed Engineering Model are in constant touch with each
other. This setup, as shall be seen later, makes it possible to automate the fault insertion
process and to collect data that otherwise would not be possible to acquire.
Figure 1-4 is a block diagram of the experimental setup. The victim system is shown on
the right and is composed of four Fault Tolerant Processors (FTPs), a fault-tolerant Inter-
Computer (IC) Communication Network, and a fault-tolerant Input/Output (I/O) Network.
Each of these building blocks was built with wire-wrapped circuit boards and open paneled
chassis. Consequently, their electronic circuitry can be accessed by the Fault Injector's
implants, described in the following paragraph, in a relatively easy manner.
Faults are normally injected one pin at a time. To insert faults, controllable DIP extenders
or implants (part of the fault injector) are plugged into the DIP socket. Each implant
accepts the DIP pins it replaced and contains circuitry which can interrupt (or reconnect)
each DIP pin and each incidental signal line from their socket. Six implants, each of which
handles 8 DIP pins, are provided. Thus, up to 48 pins on one DIP or on a combination of
DIPs may be set up for fault injection at a given time.
The 48 implant pins of the fault injector are individually addressable by the VAX 3900.
Each pin appears as a Qbus address to the Fault Injection Software (FIS). The type of fault
to be produced at any pin is controlled by writing appropriate data to the Qbus address
corresponding to the pin. Once a fault or set a of faults has been defined, they can be
"enabled", that is, inserted into the victim by writing to another Qbus address. The fault
injector hardware listens to this address space, decodes the data, and produces the fault that
is requested. It also enables or clears the fault when appropriate data is written to the
enable/clear address.
It is possible to produce signals other than simply the stuck-at class of faults. Faults that
are boolean functions of signals on other pins can be generated. This can be used to
2-1
simulatefaults which are ratherunlikely but which havebeenknown to happen. For
example,it ispossibleto turnaNAND gateinto aNOR gate.Nonetheless,themainutility
of theFault Injector lies in its ability to inject faults into tristatesignals. Forinstance,the
datapinsof arandomaccessmemory(RAM) havesignalsthatareeitherinputsto or output
from the memorydependingon whethermemory is beingwritten to or read from. To
inject a fault into sucha devicepin, thedirection of the fault signalshouldbecorrect in
order to avoid any possibledamageto the device. Sucha signal canbe producedby
generatingthefault asafunctionof othersignalson thedevicethatdeterminethedirection
of thedatasuchasread/writeandchipenablesignalson theRAM DIP.
The Fault Injection Softwarehasbeenwritten to facilitate automaticfault injection by
providing commandsthat areusedto definethevictim device,map its pins into implant
pins, specify the type of fault for eachpin, and enableand clear faults. The FIS can
executea seriesof suchcommands,making it possibleto go througha numberof faults
automaticallyoncethevictim devicehasbeenphysicallymovedto theimplants. Another
condition necessary for automatic fault injection is some form of communication between
the FIS and the AIPS system software to indicate whether the Engineering Model is ready
to accept a new fault. Such messages between the FIS and the Model are exchanged using
the Testport Interface. A modified version of the AIPS system software is responsible for
communicating the timing metrics to the Testport's memory. The same data path is used in
reverse to send messages from the FIS to the Distributed Model.
The FIS in conjunction with the AIPS software is able to record two times: the time of error
detection and the time of system reconfiguration. The important metrics, however, are the
time necessary to detect the fault and the time required to reconfigure around it. Since the
time of fault injection is known to the FIS, the difference between the fault injection and
error detection times constitutes the time required to detect the fault. Strictly speaking, this
should be called the error detection time. However, we hypothesize that the error is caused
by a fault. Therefore, the detection of an error is also an indirect indication of the
occurrence of a fault. Since this is the earliest indication of a fault, we will call this the fault
detection time. Also, the difference between the fault detection and the system
reconfiguration times equals the time necessary to reconfigure around the fault.
To communicate the fault detection time to the FIS, the AIPS system software reads the
AIPS real-time clock after the fault was detected and stores the time in the Testport
Interface. To indicate the reconfiguration time, the AIPS system software suspends
execution when it reaches a break-point at the end of the reconfiguration process. A
program, AIPSDEBUG, stores the real-time clock value in the testport interface. When the
FIS notices that the FTP is at the breakpoint, it reads the fault detection time and the
reconfiguration time from the Testport Interface. Subsequently, the FIS calculates the
times required for fault detection and reconfiguration. After these times have been
determined and recorded for later analysis, execution of the AIPS system software is
resumed to permit it to return to a state such that a new fault can be applied.
2-2
2.1.2 Software Fault Injection
To inject software faults into the AIPS Distributed Engineering Model, the Fault Injection
Software and Draper Laboratory's Testport Interface are utilized. The FIS inserts faults by
corrupting a channel's program and/or data memory. This is accomplished by sending
commands through the Interface. As with the hardware fault injection, the number and
type of faults and the time of their insertion are controlled by the FIS. After the designated
number of faults have been injected, the FIS employs the Testport Interface to retrieve the
fault detection and reconfiguration metrics.
AIPS Distributed Engineering Model
(Simplified View)
VAX 39O0
I_1 El
"_ Testport I
Interface To
AIPS
Victim
Device
} I/o
r Network
I Inter-Computer
--TP / Network
3
FTP
2
FTP FTP
4 1
Figure 2-1. Experimental Setup - Software Fault Injection
A block diagram of the software fault injection setup is shown in Figure 2-1. As can be
seen, this is a subset of system that is used for hardware fault injection.
The Fault Injection Software has been written to facilitate automatic injection of software
faults by providing commands that allow the user to define the memory region to be
2-3
corrupted, the value of the fault, and the FTP channel and processor in which to insert the
fault. As with hardware fault injection, the FIS can also execute a series of such
commands, making it possible to go through a number of faults automatically.
The Fault Injector Hardware and Software are described in more detail in Sections 2.2 and
2.3 respectively.
2.2 Fault Injector Hardware
The fault injector hardware used for this study is described in detail in Reference 3. For the
sake of completeness, a brief summary of the fault injector follows. A functional block
diagram of the fault injector is shown in Figure 2-2. The heart of the fault injector is a pair
of FETs that are interposed between the device pin and the socket pin. By turning on the
FETs, a direct connection is established between the device and the socket. This is the
normal situation when no fault is being injected on the pin. The device-socket connection
can be severed by turning off the FETs. Now any desired signal may be applied to the
device or the socket pin, whichever pin has the input signal (see Figure 2-3). A choice of
eight signals is provided, one of which may be selected by multiplexer M1 for the device
pin and by multiplexer M2 for the socket pin, as shown in Figure 2-2. A set of 48 FET
and mux pairs are provided, one pair for each victim pin. This allows one to extend up to
48 pins on one DIP or a combination of DIPs. The choice of faults for each circuit pin is as
follows.
1. Socket/Device Signal: This provides the original signal to the victim pin. That
is, no fault is injected.
2. MuxA: This signal is the output of multiplexer A as shown in
Figure 2-3. The inputs to the multiplexer are the 48
signals from the 48 pins that can be extended with the
FETs. That is, a signal from any circuit pin or gate may
be used as the fault or input signal for the victim pin.
3. MuxB:
4. Mux C:
This multiplexer has the same function as Mux A.
This multiplexer has the same function as Mux A.
5. ffA,B): This signal is the boolean function of two signals, the
outputs of multiplexers A and B. Any of sixteen
possible boolean functions may be specified.
6. f(A,B,C):
7. Ground:
This is a boolean function of f(A,B) and the output of
multiplexer C. Any one of sixteen possible functions
may be specified.
This provides the stuck-at-zero fault.
2-4
8. EXT: In addition to theseseven selections, an externally
generatedsignalmaybeusedasafault.
Eachof theaboveeightsignalsmayalsobeinvertedbeforebeingappliedto thevictim pin,
thusproviding a choiceof sixteenfaults. Thechoiceof faults thusincludesstuck-at-one
and"complementedsignal"typeof faults.
Multiplexers A, B, and C and the boolean function generators provide an extremely
powerful capability to generate any type of fault. For example, certain faults in integrated
circuits can change a NAND gate into a NOR gate. It is possible with this fault injector to
simulate such a fault by extending all input and output pins of the target gate with FETs,
generating the required boolean function using inputs from the gate inputs, and replacing
the output with this signal. The main utility of this powerful capability, however, lies in
the ability to inject faults on tristate signal lines. The direction of the fault can be made a
function of other signals on the device, signals that determine the state of the tristate pin. It
is thus possible to inject faults into data pins of memory chips and other tristate devices.
The fault injector hardware is physically packaged as shown in Figures 2-5 and 2-6. The
FET pairs are mounted on an implant segment. Two sizes of implants are provided: 4 pin
extenders and 8 pin extenders. An 8 pin implant has 16 FETs mounted on it and can
extend one side of a 16 pin DIP. Dummy extenders that simply connect the socket and
device pins without going through a FET are also provided. These are used to extend those
device pins that are too sensitive to sustain that capacitance and/or time of delay of an
intervening FET.
In any event, the FET implants are connected to multiplexer boards through a flat ribbon
cable. As mentioned earlier, the fault injector has the capability of extending 48 device
pins. The signals on each of these pins are controlled by a dedicated pair of multiplexers
M1 and M2 (see Figure 2-2). Thus there is a total of 48 pairs of muxes. These are
packaged on six multiplexer boards as shown in Figure 2-6. Each board controls 8 pins.
One 8 pin implant or two 4 pin implants may be connected to each board. The six boards,
labeled A, B, C, D, E, and F, are identical multiwire boards. Each contains one sixth of
the multiplexers MA, MB, and MC. That is, each of the three 48:1 muxes (A, B, and C) is
logically partitioned into six 8:1 muxes. Since a board handles 8 pins, a signal from one of
these eight pins can be selected through the 8:1 mux (A, B, or C) on that board. The
outputs of six logical parts of each 48:1 mux are OR'ed and distributed to all six circuit
cards via the backplane. All 48 signals are then made available to each board. Each board
also has its own copy of the three boolean function generators as shown in Figure 2-3.
Functions f(A,B), F(A,B,C), and S(A,B,C) can be produced on any board. These
signals, along with the outputs of muxes MA, MB, and MC form the inputs to the muxes
M1 and M2.
2-5
48 units - 6 implants with 8 on each implant
To Device
Pin
Data
S(ABC)
A
B
C
f(AB)
GND
EXT
D(0:3,15) !)(4:7.15)
FaultType and DirectionControl
M2
48:1 - one sixth on each card
To Socket
Pin
6 units - one on each card
\
Boolean f(AB)
Function
S(ABC) Boolean
Function
Boolean
Function
f(ABC)
D(4:7)
D(8:ll)
Figure 2-2. Fault Injector Logical Organization
2-6
Device
FET
Implant
m
_F
i
I
[
- F'
Signal to/from Package
Direct Connection
Control
Signal to/from Socket
_J Socket
Figure 2-3. Insertion of FETs between Socket and Device
2-7
Implant
Scgments
(Each handles
8 Pins)
Victim
Socket
To Multiplexer
Board
To Multiplexer
Board
Vi_im
13cvi_
L/
Figure 2-4. Fault Injector Implant
2-8
6 Multiplexer Boards
(Each handles 8 pins)
6 Implant
Segments
II
Control and
Qbus
Interface
(Double Height
Wire-Wrap
Board)
Figure 2-5. Fault Injector Hardware
Last, but not least, is the selection and control of the FETs, multiplexers, boolean function
generators, and fault enabling and clearing logic. The fault injector has been designed such
that it can be addressed as a Qbus device by a VAX 3900 computer. The data written to the
Qbus address space of the fault injector is used to perform the selection and control
functions. As shown in Figure 2-5, the backplane of the multiplexer boards is connected
by fiat-ribbon cables to a control and Qbus interface card. This is a double-height wire-
wrap board that can be plugged into the VAX Qbus. It has the standard Qbus protocol and
address decoding circuitry. The fault injector occupies the address space 3E98016 -
3E9FF16. This address space is mapped as shown in Table 2-1.
2-9
Address
3E9xx
80-8E
90-9E
A0-AE
B0-BE
C0-CE
D0-DE
E0
E2
E4
E6
EA
E8 and
EC-FF
Max
Board
A
B
C
D
E
F
Mux A
Mux B
Mux C
Boolean Function Select
Execute/Clear Fault
Unused
Pin
1-8
1-8
1-8
1-8
1-8
1-8
Table 2-1. Fault Injector Address Space
Circuitry controlling signals on each of the 48 pins (muxes M1, M2, and FETs) is
addressed individually (addresses 3E98016 to 3E9DE16). Data written to these addresses
selects one of the eight inputs to mux M1 or M2 and controls the point of fault insertion
(device or socket) by choosing mux M1 or M2. This is static operation. That is, the data
written to these addresses is latched in the fault injector. The type and direction of the fault
is thus determined, but the signal is not yet applied to the victim. To actually break the
device-socket connection and inject the fault signal, one must write to the Execute/Clear
address 3E9EA16. Writing a "0001" to this address enables the chosen multiplexer M1 or
M2 on the chosen pin. It also turns off the pair of FETs on that pin. Faults on all the
previously "enabled" pins are asserted simultaneously. The most significant bit of the fault
selection data word determines if the pin is enabled. Writing a "0002" to the Execute/Clear
address disables the muxes and turns on the FETs, thus clearing the fault condition.
2-10
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
IEN/I
DIS I
0 0 1 X Y Z
Figure 2-6. Fault Description Word
Bits 0-3 of the fault description word select the type of fault going to the device pin, bits 4-
7 select the fault going to the socket, and bits 8-11 determine the direction of the fault (to
device or to socket) as shown in Figure 2-6. Bit 15 enables and disables the pin. A pin
must be enabled before a fault defined on it can be asserted. Bit 15 must be 1 for the pin to
be enabled. Bits 12, 13, and 14 should always be as shown in Figure 2-6.
If the fault direction is 0, the fault as determined by the data bits Y is sent to the socket pin
and data bits Z are ignored. If X is 8, then the fault as determined by Z is sent to the device
pin and Y is ignored. In addition to 0 and 8, there are fourteen other values that can be
assigned to X. Fault direction selected for these values of X is explained later in this
Section (illustrated in Table 2-5).
Y and Z select the fault signal as shown in Table 2-2. If Y/Z is 8, the original signal is
passed through the multiplexer unchanged. Stuck-at-1 and 0 faults can be generated by a
value of 6 and "E", respectively. The signal can be inverted if Y/Z is 0. Other more
complex faults can be chosen as outputs of multiplexers A, B, C, or a boolean function of
their inputs (Y/Z = 1 to 5, 9 to "D").
If multiplexer A, B, or C output is either used directly as a fault or as input to a boolean
function generator, it is necessary to select the multiplexer source. This is done by writing
to the Qbus address of the multiplexer. Data written to the multiplexer address is
interpreted as shown in Figure 2-7. Bits 3-5 select one of six boards A to F. Bits 0-2
select one of eight pins on that board as the mux output. These are shown in Table 2-3.
2-11
Y/Z Fault Signal
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
Inverted Signal
F(A,B,C)
A
B
C
F(A,B)
1
EXT
Original Signal
/ F(A,B,C)
/A
/B
/C
/ F(A,B)
0
/ EXT
Table 2-2. Fault Type Selection
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Not Used Board Pin
Figure 2-7. Mux A, B, C Selection Word
2-12
Data Bits Board
3 4 5 Selected
1
2
3
4
5
6
Bits Pin
0 1 2 Selected
A 0
B 1
C 2
D 3
E 4
F 5
6
7
1
2
3
4
5
6
7
8
Table 2-3. Mux A, B, C Source Selection
If a boolean function such as f(A,B) or f(A,B,C) is chosen as the desired fault, then one
must also define the boolean function by writing to the function select address 3E9E616.
The data written to this address is interpreted as shown in Figure 2-8.
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Not Used F(f,c) S(f,c) f(A,B)
Figure 2-8. Boolean Function Generator Data Word
Bits 0-3 of the boolean function data word are used to select one of sixteen functions of
signals A and B. Bits 4-7 are used to select f(A,B,C), which is one of the sixteen boolean
functions of f(A,B) and C. Further, bits 8-11 are used to select S(A,B,C), which is also
one of sixteen boolean functions of f(A,B) and C. The sixteen possible boolean functions
of two variables are shown in Table 2-4.
As noted earlier, the fault direction (to device or to socket) is controlled by a 4-bit field X as
shown in Figure 2-6. X can assume one of sixteen possible values. These are interpreted
as shown in Table 2-5.
If the fault direction signal chosen by X is high, the fault is asserted on a socket pin. If it is
low, the fault is asserted on a device pin.
2-13
Data Boolean Function of A, B
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
/A */B
A */B
/B
/A* B
/A
A+B
/A +/B
A*B
/(A + B)
A
A+/B
B
/A+ B
A+B
1
Table 2-4. Boolean Functions of Two Variables
For X equal to 0, the fault direction signal is high and the fault is sent to the socket. For X
equal to 8, the fault is applied to the device. The fault direction in these two cases is static.
For other values of X, the fault would be dynamically applied to the socket or device pin
depending upon whether the chosen signal is high or low, respectively. The signals that
can be used for direction control are the outputs of the multiplexers A, B, C, or their
boolean functions f(A,B), S(A,B,C) and their complements. This allows one to
dynamically control fault direction on tristate pins.
As explained earlier, fields Y and Z in the fault description word determine the type of fault
to be applied to socket and device pins, respectively (see Figure 2-6). When X is equal to
0 or 8, only one of these two fields (Y when X is 0 and Z when X is 8) needs to be
defined. Nevertheless, both Y and Z must be defined when X is not 0 or 8. But Y and Z
need not be the same. That is, different faults can be applied to socket and device pins. In
fact, by an appropriate choice of Y and Z, a fault can be applied in one direction while the
original signal is passed through unchanged in the other direction. One may, for example,
wish to insert a fault in a data pin of a memory chip only when data is being read but not
when data is being written into the memory. This can be done by selecting a fault direction
2-14
signalthatis high duringthememoryreadcycle. Thefault selectedby Y wouldbeapplied
to the socketpin during thereadcycle. By choosingZ to be 8, correct datawould be
written to thememoryduring memorywrite cycle sinceZ equalto 8 passesthe original
signalto thedevicepin.
X Fault Direction
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
To Socket
/S(A,B,C)
/A
/B
/C
/F(A,B)
Not Used
Not Used
To Device
S(A,B,C)
A
B
C
F(A,B)
Not Used
Not Used
Table 2-5. Fault Direction Control
Multiplexer output selection and boolean function definition are static functions. Data
written to these addresses is latched in the fault injector.
It should be mentioned that the fault injector is a "write-only" device. The state of the fault
injector can not be determined by reading its address space.
It is not necessary to remember various addresses of the fault injector since the fault injector
software maintains these addresses in a database. This software provides appropriate
commands to define fault types and select mux outputs. The next section describes the
fault injector software.
2-15
2.3 Fault Injector Software
The fault injection software (FIS) resident on the VAX 3900 provides commands to
perform all functions necessary to inject faults into the AIPS Distributed Engineering Model
and record the results. The FIS is a menu driven program that permits the selection of fault
types and their subsequent insertion into one or more victim devices.
2.3.1 The FIS Main Menu
The main menu of the fault injection software is illustrated in Figure 2-9. As shown, it
permits the fault injection supervisor to edit faults, select input signals for the multiplexers,
choose one or more boolean functions, insert a set of faults, save and load fault files, and
reset the fault injection hardware. The aforementioned options are discussed in the
following sections.
Selection Fault Injector Main Menu
A
B
C
D
E
F
G
H
Edit Faults
Select Signal for Multiplexers
Choose a Boolean Function
Apply Faults to Victim
Save Fault File
Load Fault File
Reset
Quit
Figure 2-9. The FIS Main Menu
2.3.1.1 The FIS Edit Faults Menu
The Edit Faults menu of the fault injection software is depicted in Figure 2-10. It allows
the creation, deletion, and confirmation of a fault or a set of faults. In order to create a
fault, the selection A must be chosen. Subsequently, the FIS requests the associated
multiplexer board (A-F), the pin (1-8), the fault direction, and fault type. The multiplexer
and pin number identify the desired fault injector mux board, implant segment, and pin
through which to apply the fault (described in Section 2.2 - Fault Injector Hardware). The
fault direction indicates one of twelve options, each of which is presented in Figure 2-11.
2-16
Furthermore,oneof fourteenfaults canbeinsertedinto thevictim device. The fourteen
possiblefault typesare indicated in Figure 2-12.
Selection Edit Faults Menu
AB
CD
Create FaultDelete FaultView
Fault PackageReturn to Main
Menu
Figure 2-10. The FIS Edit Faults Menu
If a previously created fault is not desired, it can be removed by selecting option B in the
Edit Faults menu. In order for this extraneous fault type to be deleted, the corresponding
multiplexer and pin must be identified.
The fault set that is currently defined can be viewed by choosing option C. This capability
is incorporated into the FIS to permit the confirmation of a fault suite.
To return to the main FIS menu, option D of the Edit Faults menu must be selected.
Selection Fault Direction
1
2
3
4
5
6
7
8
9
10
11
12
To Socket
/S(A,B,C)
/A
/B
/C
/F(A,B)
To Device
S(A,B,C)
A
B
C
F(A,B)
Figure 2-11. The Fault Direction Options
2-17
Selection Type of Fault
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Original Signal
Inverted Signal
1
0
F(A,B,C)
A
B
C
F(A,B)
/ F(A,B,C)
/A
/B
/C
/ F(A,B)
Figure 2-12. The Fault Type Options
2.3.1.2 The FIS Multiplexer Signal Selection Menu
The input signal to the fault injector multiplexers A, B, and C can be specified by choosing
option B of the main menu. This selection displays the Multiplexer Signal Selection menu
which is illustrated in Figure 2-13. As expected, this menu has an entry for each of the
three multiplexers. After the desired multiplexer is selected, the FIS requests information
concerning the source of the multiplexer's input. That is, the corresponding multiplexer
board (A-F) and pin (1-8) that should be used as the input signal.
To determine the connectivity between the multiplexers (A, B, and C) and pins (1-8) of the
multiplexer boards (A-F), see Figure 2-2, the Fault Injector Logical Organization.
2.3.1.3 The FIS Boolean Function Selection Menu
It may be desirable to inject a signal into a victim device that is the boolean function of one
or more input signals. The Boolean Function Selection menu, shown in Figure 2-14,
which is entered by selecting option C of the main menu, permits the specification of the
fault injector functions f(A,B), f(f,C), and S(f,C). As described in the Section 2.1 and
illustrated in Figure 2-2, f(A,B) is a boolean function of the outputs of multiplexers A and
2-18
B. Alternatively, f(f,C) and S(f,C) are functions of the output of f(A,B) and multiplexer
C.
To define f(A,B), f(f,C), or S(f,C), the associated option of the Boolean Function
Selection menu must be selected. The FI software then prompts the fault injection
supervisor for the corresponding function. The function can be one of the fifteen
possibilities shown in Figure 2-15.
2.3.1.4 The FIS Fault Application Option
To apply a fault to a victim device, a fault set must be created or a previously defined fault
package must be loaded. The creation of a fault is discussed in Section 2.2.1.1, and the
method to load a fault set is described in Section 2.2.1.5. After a fault package has been
devised and downloaded to the fault injector, option D of the main menu should be
selected. This option applies the fault set to the victim device.
Selection Signal for Mux Menu
A
B
C
D
Select Signal for Mux A
Select Signal for Mux B
Select Signal for Mux C
Return to Main Menu
Figure 2-13. The FIS Mux Signals Menu
Selection Select Boolean Function Menu
A
B
C
Select Function for f(A,B)
Select Function for f(f,C)
Select Function for S(f,C)
Figure 2-14. The FIS Boolean Function Menu
2-19
Selection Boolean Function of A, B
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
/A */B
A*/B
/B
/A*B
/A
A+B
/A +/B
A*B
/(A + B)
A
A+/B
B
/A+ B
A+B
Figure 2-15. The Boolean Function Options
To inject, observe, vary, and accurately record the fault injection test, the FIS requests the
following information:
1. The fault detection bucket size.
.
The fault detection times are grouped into "buckets", which are collections
of times that fall in a particular range, to create a frequency histogram.
The size of the bucket is selectable by the fault injection supervisor.
The reconfiguration detection bucket size.
Similar to the fault detection times, the reconfiguration metrics are placed
into "buckets". The size of the reconfiguration groupings is selectable by
the fault injection supervisor.
3. Hardware or Software Fault Injection.
The Fault Injection Software can insert hardware faults via the Fault
Injector or software faults by corrupting the FTP's memory. The type of
fault application is selectable by the user. If software faults are desired,
2-20
the FIS program requests the channel and memory range to be victimized
and the numerical value of the fault.
4. The channel in which the fault will be inserted.
When faults are applied to the core Fault Tolerant Processor (FTP), the
FIS must know the channel through which the fault is being applied. This
information assists the FIS in re-synchronizing the FTP in preparation for
a successive test.
5. Whether a fixed or random time between successive faults is desired.
.
.
.
The FIS permits the fault injection supervisor to insert a fixed amount of
time between successive tests or to have the FIS randomly select the
interval. If a fixed time is desired, then the supervisor is prompted for the
corresponding number of seconds.
Number of times to insert the fault.
The FIS sequentially executes a series of tests by repeatedly injecting the
fault suite into the victim device. The number of times that the fault is
applied is selectable.
Number of milliseconds to apply the fault.
The length of time that a fault is inserted is selectable in quanta of 10
milliseconds. This is the least count of the Microvax 3900 clock. Since
the periodicity of the different modules of the Redundancy Management
Software varies considerably, the capability of changing the application
interval is desirable.
The "Reconfiguration Time-Out".
The "Reconfiguration Time-Out" is the maximum number of seconds to
wait for the completion of the reconfiguration phase.
It is possible that a fault is not detected (due to a software bug or a "don't
care" condition). Accordingly, it is desirable to have a facility that permits
the continuation of the series of tests albeit that one or more fault tests
were not detected. This FIS facility is the Reconfiguration Time-Out. If it
is exceeded, the FIS logs its occurrence and begins the execution of the
next test. The length of this time-out is specified by the fault injection
supervisor.
2-21
9. The"FaultRe-insertionTime-Out".
The"FaultRe-InsertionTime-Out"is themaximumnumberof seconds to
wait for the AIPS Engineering Model to return to a known state.
After a fault has been detected and reconfigured around, the state of the
AIPS Engineering Model must be returned to its state prior to the
application of the fault. It is possible that a fault could cause the Model to
enter an "unknown state" (due to a software bug). Accordingly, it is
desirable to have a facility that permits the abortion of the series of tests if
an inconsistent state is entered. This FIS facility is the Fault Re-Insertion
Time-Out. If it is exceeded, then the FIS program logs this occurrence
and stops. The length of this time-out is selected by the fault injection
supervisor.
10. Names of the load modules that are executing in the computational and I/O
processors of the AIPS Engineering Model.
In order to maintain configuration control over the experimental test set
up, the load modules that are involved must be noted. Accordingly, the
FIS prompts the supervisor for the names of these files.
11. Name of the Output File.
Since the AIPS Fault Injection plan involves many diverse tests and each
test will be unique, the results of each test should be stored in a file with a
name that identifies the test. Accordingly, the FIS prompts the fault
injection supervisor for the name of the output file.
2.3.1.5 The FIS Save Fault File Option
It is often desirable to save a fault suite for later use. Option E of the main menu allows the
fault injection supervisor to save a file that was created using option A (described in Section
2.2.1.1). After option E is selected, the supervisor is prompted for the name of the file in
which to store the fault set.
2.3.1.6 The FIS Load Fault File Option
In order to use a previously defined fault package, it must be downloaded to the fault
injector. Option F of the main menu allows the fault injection supervisor to load a fault file,
that was saved using option E (described in Section 2.3.1.5), into the FI. After option F is
selected, then the supervisor is prompted for the name of the file in which the associated
fault set is stored.
2-22
2.3.1.7 The FIS Reset Option
If it is desirable to reset the fault injector, it can be accomplished by selecting option G of
the main menu. This process resets Multiplexer boards A - F.
2.3.1.8 The FIS Quit Option
To exit the fault injection software, option H should be selected.
2.3.2 Fault Injection Software - Multiple Fault Application
The process of applying one fault and collecting the detection and reconfiguration times
was described earlier. Before applying another fault, the AIPS software must return to the
state that existed prior to the fault injection. In the context of the I/O Network fault
insertion, this process entails the regrowth of the previous I/O virtual path. As a result, the
configuration of the I/O nodes and ports is identical to that which existed before the
application of the fault. After the I/O network path has been restored, the AIPS software
stops executing. When the FIS recognizes that AIPS has stopped, it resumes the execution
of the AIPS software and subsequently injects the next fault.
Between the injection of faults into the core FTP, the channel that is corrupted must be
brought on-line (from a failed status). This entails realigning its volatile memory to ensure
that all channels' internal state is identical. As with the I/O network fault injection, the
AIPS software stops after the FTP has returned to the state that existed previous to the fault
application and the next fault is inserted.
2-23

3.0 Applying Faults to the I/O Network
The I/O Fault Insertion Plan is a set of faults to be injected into the AIPS I/O network
which is attached to the AIPS Distributed Engineering Model. The Plan was designed to
apply various types of faults that would exercise several network configurations. The
following sections describe the AIPS I/O network and the I/O Fault Insertion Plan,
respectively.
3.1 Overview of the AIPS I/0 Network
For communication between an AIPS FTP and I/O devices, damage and fault-tolerant
networks are employed. These AIPS I/O networks are designed to provide both high
throughput and high reliability. Each network consists of a number of full duplex links that
are interconnected by circuit switched nodes (shown in Figure 3-1). Sensors and actuators
are attached to these nodes. In steady-state, the circuit switched nodes route information
along a fixed communication path, or "virtual bus", without the delays that are associated
with packet-switched networks. Once the virtual bus is set up within the network, the
protocols and operation of the network are similar to typical multiplex buses. Although the
hardware implementation of this "virtual bus" is a circuit-switched network, the FTP
communication and protocol view it as a conventional bus.
Albeit the AIPS I/O network performs exactly like a bus, it is far more reliable and damage
tolerant than a linear bus. The network architecture provides coverage for many well
known failure modes that would cause a standard linear bus to either fail completely or
provide service to a reduced set of subscribers. A single fault or limited damage (for
example, caused by weapons, electrical shorts, or overheating) can disable only a small
fraction of the virtual bus, typically a node or a link connecting two nodes. The rest of the
network, and the subscribers on it, can continue to operate normally. If the sensors and
effectors are physically dispersed for damage-tolerance and the damage event does not
affect the inherent capability of the vehicle to continue to fly, then the AIPS I/O system
would continue to function in a normal manner or in some degraded mode as determined by
sensor/effector availability.
The ability of the network to tolerate such faults comes from the design of the node. An
AIPS node has five ports, each of which can be enabled or disabled. When the ports on
either end of a link are enabled, data is routed along that link of the network. Each node in
a properly configured fault-free network receives transmissions on exactly one of its
enabled ports and simultaneously retransmits this data on all of its other enabled ports. The
nodes provide a richness of spare interconnections that can be brought into service after a
hardware fault or damage event occurs.
3-1
Figure 3-1. 15 Node I/O Network - No Fault Configuration
3.2 Specification of I/O Network Faults
3.2.1 Creating Node and Link Faults
As discussed in Section 3.1, the AIPS I/O network is comprised of nodes and
interconnecting full duplex links. Accordingly, the I/O Fault Insertion Plan involves the
application of node and link faults.
The AIPS I/O nodes are functionally divided into six sections: Node Sequencer and
Control, Port Enable Register, Message Buffer, Port Activity Register, Transmit FIFO,
and Protocol Encoder. To corrupt the operation of an I/O node, a fault is applied to one or
more of these components. Similarly, each port of the AIPS I/O node is divided into five
functional components: Receiver, Protocol Decoder, Clock Extractor, Signal Regeneration
Logic, and Transmitter. To interrupt the operation of a link, a fault is applied to a
component of the associated node port.
The I/O Fault Insertion Plan only applied faults to some of the node and port subsections,
because it was more concerned with a general characterization of the I/O Management
process' performance rather than a comprehensive one. An extensive I/O Plan can be
developed and applied the AIPS I/O Networks; however, currently it is not being
performed.
3-2
Four typesof nodefaultswereapplied,corrupting:
1. The node'scontrol microsequencer,a componentof the Node Sequencerand
Control subsectionthat is responsiblefor decodingand processingthe node's
microinstructions.
2. ThePortEnableRegister,abuffer thatindicateswhichnodeportsareenabled.
3. The I/O FIFO,a componentof theTransmitFIFO subsectionwhich buffers the
datathatis processedandtransmittedbythenode.
4. The node'saddressdecodinglogic, an element of the Node Sequencerand
Controlsystemthatis responsiblefor uniquelyidentifyingthenode.
Threetypesof port faultswereinserted,affecting:
1. Theport'sProtocolDecoderthatis partly responsiblefor correctly receiving the
inputdata.
2. The port'sinput buffer, thepart of the Receiverlogic that regeneratesthe input
data.
3. Theport'sinput receiver,acomponentof theReceiversubsectionthatacceptsthe
inputdata.
Thesenodeandport faultswereselectedbecauseof their accessibility,their coverage(50
percentof nodes'and60percentof theports'majorfunctionalsectionswereaffected),and
theircomplexity(simple,intermediate,andcomplexerrorsymptomsweregenerated).
Tables3-1 and 3-2 detail the nodeand link faults, specifyinghow the faults affect the
componentsandthe locationof the victim hardwareelementswith respectto the node
schematics.(It shouldbenotedherethat theterms"link fault" and "port fault" havebeen
usedin thisdocumentinterchangeably.)
3.2.2 Specification of Test Cases
The I/O Fault Insertion Plan is comprised of 47 Tests. Each test applies a fault to one or
more nodes of the AIPS 15 Node I/O Network (its topology is illustrated in Figure 3-1).
The tests were designed to apply six basic types of topological faults.
A. Failed Link Causing Disjoint Leaf Node
B. Failed Link Causing Disjoint Branch
C. Failed Leaf Node
D. Failed Node Causing Disjoint Branch
E. Failed Root Node
F. Simultaneous Node Failures
3-3
Methods tO In_¢rt Generic Node Faults IC Pin Logic Schematic
.Location Number Level Pa ege._##
1. Corrupt the Node Sequencer
The node sequencer is reset.
2357 3,12 H 5/6
2. Corrupt the Port Enable Register 1130 11 H 5/6
The input enable of the Port
Enable register is corrupted
such that arbitrary data is read
into the register.
3. Corrupt the I/0 FIFO 2047 4 H 5/6
The decoder that controls the
resetting of the I/O FIFO is
corrupted.
4. Corrupt the Address Decoding Logic 1130 14 H 5/6
- The input enable of the register
that outputs the node address is
corrupted such that the node
address is not placed on the data
bus at the correct time.
Table 3-1. Victim Node Components - Schematic Location,
Pin Number, and Fault Logic Level
A comprehensive I/O fault plan would apply millions of different I/O faults, causing the
victim I/O network to reconfigure into thousands of different I/O topologies. The I/O Fault
Insertion plan performed by Draper Laboratory only utilized 47 tests, applying faults that
caused the 15 node network to reconfigure into 19 degraded topologies. As mentioned
earlier, this I/O Fault Insertion Plan was more concerned with a general characterization of
the I/O Management process' performance rather than a comprehensive one.
Consequently, only some of the vast number of possible tests were performed. The tests
completed were selected because of their topological coverage and complexity.
The I/O Fault Insertion Plan is segmented into six fault categories corresponding to the six
topological fault types. Each fault category was applied to several network nodes, thereby
exercising different I/O configurations. Accordingly, each category is subdivided into
groups, each group uniquely identifying and depicting a victim link or node. Furthermore,
as previously discussed, each fault type (or fault category) may be instrumented in several
ways. Consequently, each group is further segmented into subsections, each indicating the
specific fault that was applied.
3-4
Methods to Insert Generic Link Faults IC
Location
Pin Logic Schematic
N_omb_r Level
° Corrupt the Node's Protocol Decoder
- Corruption of the input data
counter such that it does not
correctly load data.
Port 0.
Port 1.
Port 2.
Port 3.
Port 4.
1310 7 H 1/6
1110 7 H 1/6
1301 7 H 1/6
1101 7 H 1/6
0901 7 H 1/6
. Corrupt the Node's Input Buffer
- Corruption of the input data
such that the data is stuck-at-one.
Port 0.
Port 1.
Port 2.
Port 3.
Port 4.
0301 3 H 1/6
0101 11 H 1/6
0101 5 H 1/6
0101 13 H 1/6
0101 3 H 1/6
3. Corrupt the Node's Input Receiver
- The input data FIFO is reset.
Port 0. 2047 11 L 5/6
Port 1. 2047 12 L 5/6
Port 2. 2047 13 L 5/6
Port 3. 2047 14 L 5/6
Port 4. 2047 15 L 5/6
Table 3-2. Victim Port Components - Schematic Location,
Pin Number, and Fault Logic Level
A test is identified by specifying the category, group, and subsection that is associated with
the fault that was applied. For example, "test B.2.a" is the test that corrupts the connection
between nodes 1 and 2 by disrupting the operation of protocol decoder of port 2 of node 2
(see test outline on the following pages). Alternatively, a test may be identified by
indicating the node, integrated circuit, and pin that was victimized. Consequently, "test
B.2.a" can also be referred to as "node 2, chip 1301, pin 7" (or simply N2_CH1301_P7).
3-5
Thefollowing paragraphsdetail theI/O FaultInsertionPlan.
A. FailedLink - DisjointLeafNode
o Cause Node 15 to be a Disjoint Leaf Node
a. Fail the link between nodes 13 and 15 by corrupting the protocol
decoder of port 3 of node 15.
- N15_CH1101_P7
b. Fail the link between nodes 13 and 15 by corrupting the input data of
port 3 of node 15.
- N15_CH0101_P13
c. Fail the link between nodes 13 and 15 by corrupting the input receiver
of port 3 of node 15.
- N15_CH2047_P14
o Cause Node 5 to be a Disjoint Leaf Node
a. Fail the link between nodes 3 and 5 by corrupting the protocol decoder
of port 3 of node 5.
- N5_CH1101_P7
b. Fail the link between nodes 3 and 5 by corrupting the input data of port
3 of node 5.
- N5_CH0101_P13
c. Fail the link between nodes 3 and 5 by corrupting the input receiver of
port 3 of node 5.
- N5_CH2047_P14
o Cause Node 14 to be a Disjoint Leaf Node
a. Fail the link between nodes 12 and 14 by corrupting the protocol
decoder of port 4 of node 14.
N14_CH0901_P7
b. Fail the link between nodes 12 and 14 by corrupting the input data of
port 4 of node 14.
N14_CH0101_P3
c. Fail the link between nodes 12 and 14 by corrupting the input receiver
of port 4 of node 14.
N 14_CH2047_P 15
B. Failed Link - Disjoint Branch
o Cause Nodes 7 and 9 to be a Disjoint Branch
a. Fail the link between nodes 6 and 7 by corrupting the protocol decoder
of port 2 of node 7.
3-6
N7_CH1301__P7
b. Fail thelink betweennodes6 and7 bycorruptingtheinputdataof port
2 of node7.
N7_CH010l_P5
c. Fail the link betweennodes6and7 by corruptingtheinputreceiverof
port 2 of node7.
N7 CH2047_P13
. Cause Nodes 2, 4, 8 and 10 to be a Disjoint Branch
a. Fail the link between nodes 1 and 2 by corrupting the protocol decoder
of port 2 of node 2.
- N2_CH1301_P7
b. Fail the link between nodes 1 and 2 by corrupting the input data of port
2 of node 2.
- N2_CH010 l_P5
c. Fail the link between nodes 1 and 2 by corrupting the input receiver of
port 2 of node 2.
- N2_CH2047 P13
. Cause Nodes 3, 5, 12 and 14 to be a Disjoint Branch
a. Fail the link between nodes 1 and 3 by corrupting the protocol decoder
of port 1 of node 3.
- N3_CH1101_P7
b. Fail the link between nodes 1 and 3 by corrupting the input data of port
1 of node 3.
- N3 CH0101_Pll
c. Fail the link between nodes 1 and 3 by corrupting the input receiver of
port 1 of node 3.
- N3_CH2047 P12
° Cause all Nodes to be a Disjoint Branch
a. Fail the link between node 1 and channel A by corrupting the input data
of port 0 of node 1.
- NI_CH0301_P3
C. Failed Leaf Node
. Fail Node 15 which is a Leaf Node
a. Fail node 15 by corrupting its control sequencer.
- N15_CH2357_P12
b. Fail node 15 by corrupting its port enable register.
- N15_CH1130_P11
c. Fail node 15 by corrupting its !/O FIFO.
- N15_CH2047_P4
3-7
d. Fail node15by corruptingits nodeaddressdecodinglogic.
- N15_CH1130_P14
. Fail Node 4 which is a Leaf Node
a. Fail node 4 by corrupting its control sequencer.
N4_CH2357_P3
b. Fail node 4 by corrupting its I/O FIFO.
N4_CH2047_P4
c. Fail node 4 by corrupting its node address decoding logic.
N4_CH1130_P14
3_ Fail Node 10 which is a Leaf Node
a. Fail node 10 by corrupting its control sequencer.
N10_CH2357_P3
b. Fail node 10 by corrupting its I/O FIFO.
N 10_CH2047_P4
D. Failed Node - Disjoint Branch
o Cause Node 9 to be a Disjoint Branch
a. Fail node 7 by corrupting its control sequencer.
N7_CH2357_P3
b. Fail node 7 by corrupting its port enable register.
N7_CH1130_P11
c. Fail node 7 by corrupting its I/O FIFO.
N7_CH2047_P4
d. Fail node 7 by corrupting its node address decoding logic.
N7_CH1130_P14
. Cause Node 13 and 15 to be a Disjoint Branch
a. Fail node 11 by corrupting its control sequencer.
N1 l_CH2357_P3
b. Fail node 11 by corrupting its port enable register.
N11_CH1130_P11
c. Fail node 11 by corrupting its I/O FIFO.
N 1 l_CH2047_P4
o Cause Node 7 and 9 to be a Disjoint Branch
a. Fail node 6 by corrupting its control sequencer.
N6_CH2357_P3
b. Fail node 6 by corrupting its I/O FIFO.
N6_CH2047_P4
3-8
. Cause Node 4 to be a Disjoint Branch and Nodes 8 and 10 to be a Disjoint
Branch.
a. Fail node 2 by corrupting its control sequencer.
- N2_CH2357_P3
b. Fail node 2 by corrupting its I/O FIFO.
- N2_CH2047_P4
. Cause Node 5 to be a Disjoint Branch and Nodes 12 and 14 to be a Disjoint
Branch.
a. Fail node 3 by corrupting its control sequencer.
- N3_CH2357_P3
b. Fail node 3 by corrupting its port enable register.
- N3_CHll30_Pll
c. Fail node 3 by corrupting its I/O FIFO.
- N3_CH2047_P4
d. Fail node 3 by corrupting its node address decoding logic.
- N3_CH1130_P14
E. Failed Root Node
. Cause all Nodes to be Disjoint
a. Fail node 1 by corrupting its control sequencer.
Nl_CH2357_P3
F. Simultaneous Node Failures
. Simultaneously Fail Nodes 8 and 10
a. Fail nodes 8 and 10 by corrupting their control sequencer.
- N8 10 CH2357_P3
. Simultaneously Fail Nodes 2 and 8
a. Fail nodes 2 and 8 by corrupting their control sequencer.
- N2 10 CH2357_P3
. Simultaneously Fail Nodes 4 and 10
a. Fail nodes 4 and 10 by corrupting their control sequencer.
- N4 10 CH2357_P3
3.3 Test Results
As discussed in Section 3.2, the I/O Fault Insertion Plan is comprised of 47 tests. Each
test consists of 25 iterations. An iteration involves the application of a fault, its detection
3-9
by theAIPS I/O NetworkRedundancyManagementsoftware,andtheassociatednetwork
reconfiguration. For eachiteration, thefault detectionandreconfigurationtimeswere
recordedby the Fault Insertion Software (FIS). Accordingly, 25 sets of data were
recordedfor eachof the47 tests(atotalof 1175datasets).
The detectionandreconfigurationtimesrecordedduring theI/O Fault Insertion arenot
optimal. These "non-optimal" times result becausethe I/O Network Redundancy
Managementsoftwarelogsdebuginformationduringthefaultdetection,identification,and
reconfigurationprocess.Theoverheaddueto this loggingprocessis lessthan0.5percent
of thedetectiontime andlessthan10percentof thereconfigurationtime. Accordingly, the
times recorded during the I/O Fault Insertion plan are upper bounds.
The I/O Fault Insertion results are segmented into four groups. Section 3.3.1 presents the
maximum and average times for each test. Section 3.3.2 details each test's time/frequency
histogram. Next, Section 3.3.3 illustrates and discusses the I/O Fault Insertion probability
and cumulative density functions. Finally, Section 3.4 provides conclusions for the I/O
Fault Insertion plan.
3.3.1 Maximum and Average Times
A fault is detected by the I/O Network Redundancy Management software by requesting
and analyzing the status of the network nodes. If all nodes correctly respond to this status
query, then the Redundancy Management software assumes that the network is fault-free.
If the status response from one or more nodes has errors, then this software presumes that
a hardware fault exists and begins the fault identification and reconfiguration stage.
To isolate and reconfigure around a hardware fault, the I/O Network Redundancy
Management software examines the status response from the nodes. By analyzing this data
and subsequently requested information, the software can determine the location of the fault
and activate spare network components to bypass it.
In the AIPS Distributed Engineering Model, the I/O Network Management software checks
for the presence of a fault every two seconds (2000 ms.). This fault detection check
requires approximately 75 ms. to execute. Consequently, the typical maximum detection
time will be about 2075 ms. plus any latency from the time the fault is applied to the time
the associated error symptom is generated. (Some exceptions were observed and they are
discussed later in this section.) As a result, the typical average detection time is expected to
be 1040 ms. plus error latency.
The time necessary to perform the fault isolation and reconfiguration process varies
considerably. Simple faults may be identified and bypassed in 100 ms. Alternatively,
more complex faults require the regrowth of the I/O network, thereby utilizing thousands of
milliseconds to reconfigure the network.
3-10
Thefollowing paragraphsprovidethemaximumandaveragedetectionandreconfiguration
timesfor eachtest. Further,thenumberof faultsdetected(with respecto the25 thatwere
inserted)areprovided.
A. FailedLink - Disjoint LeafNode
1. CauseNode15to beaDisjoint LeafNode
a. Fail the link between nodes 13 and 15 by corrupting the protocol
decoder of port 3 of node 15.
N15_CH1101_P7
i. Maximum Detection Time (ms.) 2032.4
ii. Average Detection Time (ms.) 996.1
iii. Maximum Reconfiguration Time (ms.) 154.2
iv. Average Reconfiguration Time (ms.) 147.7
v. Number of Faults Detected 25
b° Fail the link between nodes 13 and 15 by corrupting the input data of
port 3 of node 15.
N15_CH0101_P13
i. Maximum Detection Time (ms.) 1942.1
ii. Average Detection Time (ms.) 843.2
iii. Maximum Reconfiguration Time (ms.) 154.7
iv. Average Reconfiguration Time (ms.) 147.9
v. Number of Faults Detected 25
C* Fail the link between nodes 13 and 15 by corrupting the input receiver
of port 3 of node 15.
N 15_CH2047_P 14
i. Maximum Detection Time (ms.) 2036.6
ii. Average Detection Time (ms.) 1235.5
iii. Maximum Reconfiguration Time (ms.) 153.9
iv. Average Reconfiguration Time (ms.) 145.8
v. Number of Faults Detected 25
2. Cause Node 5 to be a Disjoint Leaf Node
a. Fail the link between nodes 3 and 5 by corrupting the protocol decoder
of port 3 of node 5.
3-11
N5_CH1101_P7
i. MaximumDetectionTime(ms.) 2031.6
ii. AverageDetectionTime(ms.) 1141.0
iii. MaximumReconfigurationTime(ms.) 155.5
iv. AverageReconfigurationTime (ms.) 151.2
v. Numberof FaultsDetected 25
bo Fail the link between nodes 3 and 5 by corrupting
3 of node 5.
- N5_CH0101_P13
the input data of port
i. Maximum Detection Time (ms.) 1947.1
ii. Average Detection Time (ms.) 1007.4
iii. Maximum Reconfiguration Time (ms.) 155.7
iv. Average Reconfiguration Time (ms.) 149.2
v. Number of Faults Detected 25
Co Fail the link between nodes 3 and 5 by corrupting
port 3 of node 5.
- N5_CH2047_P14
the input receiver of
i. Maximum Detection Time (ms.) 2029.2
ii. Average Detection Time (ms.) 946.3
iii. Maximum Reconfiguration Time (ms.) 155.0
iv. Average Reconfiguration Time (ms.) 151.1
v. Number of Faults Detected 25
3. Cause Node 14 to be a Disjoint Leaf Node
a. Fail the link between nodes
decoder of port 4 of node 14.
N 14_CH090 l_P7
12 and 14 by corrupting the protocol
i. Maximum Detection Time (ms.) 1783.3
ii. Average Detection Time (ms.) 889.3
iii. Maximum Reconfiguration Time (ms.) 154.6
iv. Average Reconfiguration Time (ms.) 150.0
v. Number of Faults Detected 25
b. Fail the link between nodes 12 and 14 by corrupting the input data of
port 4 of node 14.
N14_CH0101_P3
3-12
i. MaximumDetectionTime(ms.) 2041.5
ii. AverageDetectionTime(ms.) 1116.3
iii. MaximumReconfigurationTime(ms.) 176.3
iv. AverageReconfigurationTime(ms.) 142.1
v. Numberof FaultsDetected 25
C* Fail the link between nodes 12 and 14 by corrupting the input receiver
of port 4 of node 14.
- N 14_CH2047_P 15
i. Maximum Detection Time (ms.) 2039.5
ii. Average Detection Time (ms.) 982.0
iii. Maximum Reconfiguration Time (ms.) 154.8
iv. Average Reconfiguration Time (ms.) 149.9
v. Number of Faults Detected 25
B. Failed Link - Disjoint Branch
1. Cause Nodes 7 and 9 to be a Disjoint Branch
a° Fail the link between nodes 6 and 7 by corrupting the protocol decoder
of port 2 of node 7.
- N7_CH130 l_P7
i. Maximum Detection Time (ms.) 1914.4
ii. Average Detection Time (ms.) 944.7
iii. Maximum Reconfiguration Time (ms.) 130.8
iv. Average Reconfiguration Time (ms.) 126.7
v. Number of Faults Detected 25
b° Fail the link between nodes 6 and 7 by corrupting
2 of node 7.
N7_CH010 l_P5
the input data of port
i. Maximum Detection Time (ms.) 1466.6
ii. Average Detection Time (ms.) 797.8
iii. Maximum Reconfiguration Time (ms.) 214.4
iv. Average Reconfiguration Time (ms.) 130.3
v. Number of Faults Detected 25
C. Fail the link between nodes 6 and 7 by corrupting
port 2 of node 7.
- N7_CH2047_P13
the input receiver of
3-13
i. MaximumDetectionTime(ms.) 1326.9
ii. AverageDetectionTime(ms.) 741.3
iii. MaximumReconfigurationTime (ms.) 186.5
iv. AverageReconfigurationTime (ms.) 183.6
v. Numberof FaultsDetected 25
2. CauseNodes2,4, 8and 10to bea DisjointBranch
ao Fail the link between nodes 1 and 2 by corrupting the protocol decoder
of port 2 of node 2.
- N2_CH1301_P7
i. Maximum Detection Time (ms.) 2065.5
ii. Average Detection Time (ms.) 1034.5
iii. Maximum Reconfiguration Time (ms.) 196.3
iv. Average Reconfiguration Time (ms.) 187.8
v. Number of Faults Detected 25
bo Fail the link between nodes 1 and 2 by corrupting
2 of node 2.
- N2_CH010 l_P5
the input data of port
i. Maximum Detection Time (ms.) 1641.5
ii. Average Detection Time (ms.) 811.0
iii. Maximum Reconfiguration Time (ms.) 293.6
iv. Average Reconfiguration Time (ms.) 218.9
v. Number of Faults Detected 25
Co Fail the link between nodes 1 and 2 by corrupting
port 2 of node 2.
- N2_CH2047_P13
the input receiver of
i. Maximum Detection Time (ms.) 1863.1
ii. Average Detection Time (ms.) 841.9
iii. Maximum Reconfiguration Time (ms.) 226.6
iv. Average Reconfiguration Time (ms.) 221.8
v. Number of Faults Detected 25
3. Cause Nodes 3, 5, 12 and 14 to be a Disjoint Branch
3-14
a° Fail the link between nodes 1 and 3 by corrupting the protocol decoder
of port 1 of node 3.
- N3_CH1101_P7
i. Maximum Detection Time (ms.) 2051.5
ii. Average Detection Time (ms.) 726.2
iii. Maximum Reconfiguration Time (ms.) 145.0
iv. Average Reconfiguration Time (ms.) 137.3
v. Number of Faults Detected 25
b, Fail the link between nodes 1 and 3 by corrupting
1 of node 3.
- N3_CH0101_P11
the input data of port
i. Maximum Detection Time (ms.) 2044.4
ii. Average Detection Time (ms.) 1046.6
iii. Maximum Reconfiguration Time (ms.) 298.3
iv. Average Reconfiguration Time (ms.) 220.1
v. Number of Faults Detected 25
Co Fail the link between nodes 1 and 3 by corrupting
port 1 of node 3.
- N3_CH2047_P12
the input receiver of
i. Maximum Detection Time (ms.) 1869.4
ii. Average Detection Time (ms.) 758.3
iii. Maximum Reconfiguration Time (ms.) 230.1
iv. Average Reconfiguration Time (ms.) 227.4
v. Number of Faults Detected 25
4. Cause all Nodes to be a Disjoint Branch
ao Fail the link between node 1 and channel A by corrupting the input data
of port 0 of node 1.
- NI_CH0301_P3
i. Maximum Detection Time (ms.) 2060.0
ii. Average Detection Time (ms.) 1339.4
iii. Maximum Reconfiguration Time (ms.) 1090.6
iv. Average Reconfiguration Time (ms.) 1066.8
v. Number of Faults Detected 25
C. Failed Leaf Node
3-15
1. Fail Node15which isaLeafNode
a. Fail node15by corruptingits controlsequencer.
- N15_CH2357_P12
i. MaximumDetectionTime(ms.) 2272.7
ii. AverageDetectionTime(ms.) 1102.8
iii. MaximumReconfigurationTime (ms.) 216.4
iv. Average Reconfiguration Time (ms.) 191.4
v. Number of Faults Detected 25
The fault that was applied to pin 12 of IC 2357 does not manifest itself
immediately because a relatively large resistor/capacitor network must
first be charged. Consequently, the maximum detection time is
exceeded by a few hundred milliseconds.
b. Fail node 15 by corrupting its port enable register.
- N15_CHll30_Pll
i. Maximum Detection Time (ms.)
ii. Average Detection Time (ms.)
iii. Maximum Reconfiguration Time (ms.)
iv. Average Reconfiguration Time (ms.)
v. Number of Faults Detected
3094.0
1142.4
1303.1
271.5
23(2 Don't Cares)
The fault that was applied to pin 11 of IC 1130 causes the port enable
register to continuously read a byte from the data bus. As a result, at
any given time while the fault is injected, the ports that are enabled
(the ports through which data is transmitted) depend on the value of
the data bus. Consequently, when this port enable information is read
by the control sequencer, it may: (1) correctly depict the ports that
should be enabled, (2) falsely specify that one or more ports are
enabled when they should be disabled and thus cause the
manifestation of a fault, or (3) falsely indicate that disconnected ports
are enabled and thereby create a "don't care" condition (again the fault
does not manifest itself). If situation (1) or (3) occurs during the I/O
Redundancy Management software's initial fault detection check, the
software will not see a fault because the fault does not produce an
error. Since the fault may not cause visible symptoms, the maximum
detection time is not constrained. Furthermore, it is possible that the
applied fault is not detected at all (because the fault is only injected for
approximately 4 seconds). Conversely, if the fault does cause error
symptoms and is detected, the time required to reconfigure the I/O
3-16
networkmayvaryconsiderablybecausethefault mayproducesimple,
complex,or dynamicerrorsymptoms.
c. Fail node15bycorruptingits transmitFIFO.
N15_CH2047_P4
i. MaximumDetectionTime(ms.) 1159.2
ii. AverageDetectionTime(ms.) 638.9
iii. MaximumReconfigurationTime(ms.) 211.5
iv. AverageReconfigurationTime(ms.) 208.0
v. Numberof FaultsDetected 25
d. Fail node15by corruptingits nodeaddressdecodinglogic.
N15_CH1130_P14
i. MaximumDetectionTime(ms.) 1885.0
ii. AverageDetectionTime(ms.) 997.5
iii. MaximumReconfigttrationTime(ms.) 1459.6
iv. AverageReconfigurationTime(ms.) 1051.2
v. Numberof FaultsDetected 25
2. Fail Node4 which is aLeafNode
a. Fail node4 bycorruptingitscontrolsequencer.
- N4_CH2357_P3
i. MaximumDetectionTime(ms.) 1839.0
ii. AverageDetectionTime (ms.) 576.2
iii. MaximumReconfigurationTime (ms.) 220.0
iv. AverageReconfigurationTime (ms.) 213.2
v. Numberof FaultsDetected 25
b. Fail node4 bycorruptingits transmitFIFO.
- N4_CH2047_P4
i. MaximumDetectionTime(ms.) 1527.3
ii. AverageDetectionTime(ms.) 682.0
iii. MaximumReconfigurationTime (ms.) 224.4
iv. AverageReconfigurationTime (ms.) 217.3
v. Numberof FaultsDetected 25
c. Fail node4 by corruptingits nodeaddressdecodinglogic.
- N4_CH1130_P14
3-17
i. MaximumDetectionTime (ms.) 1711.5
ii. AverageDetectionTime(ms.) 752.1
iii. MaximumReconfigurationTime(ms.) 484.0
iv. AverageReconfigurationTime (ms.) 228.4
v. Numberof FaultsDetected 25
3. FailNode10which is aLeafNode
a. Fail node10by corruptingits controlsequencer.
- N10_CH2357_P3
i. MaximumDetectionTime (ms.) 2051.9
ii. AverageDetectionTime(ms.) 1076.9
iii. MaximumReconfigurationTime(ms.) 215.7
iv. AverageReconfigurationTime (ms.) 214.3
v. Numberof FaultsDetected 25
b. Fail node10by corruptingits transmitFIFO.
- N10_CH2047_P4
i. MaximumDetectionTime (ms.) 1315.7
ii. AverageDetectionTime(ms.) 804.8
iii. MaximumReconfigurationTime(ms.) 491.7
iv. AverageReconfigurationTime(ms.) 227.8
v. Numberof FaultsDetected 25
D. FailedNode- Disjoint Branch
1. CauseNode9 to beaDisjointBranch
a. Fail node7 by corruptingits controlsequencer.
- N7_CH2357_P3
i. MaximumDetectionTime(ms.) 1928.7
ii. AverageDetectionTime(ms.) 718.6
iii. MaximumReconfigurationTime(ms.) 255.9
iv. AverageReconfigurationTime(ms.) 252.4
v. Numberof FaultsDetected 25
b. Fail node7 by corruptingits portenableregister.
- N7_CHll30_Pll
3-18
i. MaximumDetectionTime(ms.)
ii. AverageDetectionTime(ms.)
iii. MaximumReconfigurationTime(ms.)
iv. AverageReconfigurationTime(ms.)
v. Numberof FaultsDetected
3372.9
1088.7
2869.5
478.6
22 (3 Don't Cares)
SeeTestC.l.b for explanationconcerningthe exceptionally large
maximumdetectionandreconfigurationtimes and the "don't care"
conditions.
Co Fail node 7 by corrupting its transmit FIFO.
N7_CH2047_P4
io
ii.
iii.
iv.
V.
Maximum Detection Time (ms.) 4009.5
Average Detection Time (ms.) 1177.1
Maximum Reconfiguration Time (ms.) 969.4
Average Reconfiguration Time (ms.) 310.5
Number of Fault Detected 25
IC 2047 is a decoder that controls the resetting of the node's input receivers and the write
enable for the transmit FIFO. The insertion of the fault into pin 4 is such that the decoder
IC is not enabled and none of its outputs are selected. Therefore no new data can be
written into the transmit FIFO. Nevertheless, the transmit FIFO can still transmit data. If
the data in the transmit FIFO corresponds to a valid node status response, the I/O
Redundancy Management software may not detect the fault during its initial faul check (a
valid response may be transmitted to the FTP). Accordingly, as was observed, the
maximum detection time may exceed the expected maximum time.
d. Fail node 7 by corrupting its node address decoding logic.
- N7_CH1130_P14
i. Maximum Detection Time (ms.) 1873.9
ii. Average Detection Time (ms.) 1131.8
iii. Maximum Reconfiguration Time (ms.) 1239.4
iv. Average Reconfiguration Time (ms.) 1117.9
v. Number of Faults Detected 25
2. Cause Node 13 and 15 to be a Disjoint Branch
a. Fail node 11 by corrupting its control sequencer.
- N1 l_CH2357_P3
3-19
i. MaximumDetectionTime(ms.) 2054.0
ii. AverageDetectionTime(ms.) 983.2
iii. MaximumReconfigurationTime(ms.) 296.1
iv. AverageReconfigurationTime (ms.) 289.0
v. Numberof FaultsDetected 25
b. Fail node11by corruptingits portenableregister.
- N11_CH1130_P11
i. MaximumDetectionTime(ms.) 1906.4
ii. AverageDetectionTime(ms.) 1171.7
iii. MaximumReconfigurationTime(ms.) 2206.4
iv. AverageReconfigurationTime (ms.) 1037.8
v. Numberof FaultsDetected 25
SeeTest C.l.b for explanationconcerningthe exceptionally large
maximumreconfigurationtime.
c. Fail node11by corruptingits transmitFIFO.
N1l_CH2047_P4
i. MaximumDetectionTime(ms.) 2053.1
ii. AverageDetectionTime(ms.) 1013.2
iii. MaximumReconfigurationTime(ms.) 1070.2
iv. AverageReconfigurationTime(ms.) 528.3
v. Numberof FaultsDetected 25
3. CauseNode7 and9 to beaDisjoint Branch
a. Fail node6 bycorruptingits controlsequencer.
N6_CH2357_P3
i. MaximumDetectionTime (ms.) 2051.0
ii. AverageDetectionTime(ms.) 721.9
iii. MaximumReconfigurationTime(ms.) 378.6
iv. AverageReconfigurationTime (ms.) 295.6
v. Numberof FaultsDetected 25
b. Fail node6 by corruptingits transmitFIFO.
- N6_CH2047_P4
3-20
i. MaximumDetectionTime(ms.) 1954.4
ii. AverageDetectionTime(ms.) 799.3
iii. MaximumReconfigurationTime(ms.) 381.7
iv. AverageReconfigurationTime (ms.) 377.1
v. Numberof FaultsDetected 25
4. CauseNode4 to bea Disjoint BranchandNodes8 and 10 to be a Disjoint
Branch.
a. Fail node2 bycorruptingits controlsequencer.
- N2_CH2357_P3
i. MaximumDetectionTime(ms.) 1519.8
ii. AverageDetectionTime(ms.) 727.0
iii. MaximumReconfigurationTime(ms.) 465.5
iv. AverageReconfigurationTime(ms.) 244.4
v. Numberof FaultsDetected 25
b. Fail node2 by corruptingits transmitFIFO.
N2_CH2047_P4
i. MaximumDetectionTime(ms.) 1959.7
ii. AverageDetectionTime(ms.) 674.1
iii. MaximumReconfigurationTime(ms.) 292.3
iv. AverageReconfigurationTime (ms.) 288.9
v. Numberof FaultsDetected 25
5. CauseNode 5 to bea Disjoint Branchand Nodes12and 14 to bea Disjoint
Branch.
a. Fail node3 by corruptingits controlsequencer.
- N3_CH2357_P3
i. MaximumDetectionTime(ms.) 2063.2
ii. AverageDetectionTime(ms.) 887.8
iii. MaximumReconfigurationTime (ms.) 296.4
iv. Average Reconfiguration Time (ms.) 262.5
v. Number of Faults Detected 25
b. Fail node 3 by corrupting its port enable register.
- N3_CHll30_Pll
i. Maximum Detection Time (ms.) 2794.4
3-21
ii. AverageDetectionTime(ms.) 1035.3
iii. MaximumReconfigurationTime(ms.) 2897.3
iv. AverageReconfigurationTime(ms.) 1503.6
v. Numberof FaultsDetected 25
SeeTest C.1.b for explanationconcerningthe exceptionally large
maximumdetectionandreconfigurationtimes.
c. Fail node3by corruptingits transmitFIFO.
N3_CH2047_P4
i. MaximumDetectionTime(ms.) 2050.2
ii. AverageDetectionTime(ms.) 739.8
iii. MaximumReconfigurationTime(ms.) 324.8
iv. AverageReconfigurationTime(ms.) 288.1
v. Numberof FaultsDetected 25
d. Fail node3 by corruptingits nodeaddressdecodinglogic.
- N3_CH1130_P14
i. MaximumDetectionTime(ms.) 2002.6
ii. AverageDetectionTime(ms.) 808.8
iii. MaximumReconfigurationTime(ms.) 557.6
iv. AverageReconfigurationTime(ms.) 353.1
v. Numberof FaultsDetected 25
E. FailedRootNode
1. Causeall Nodesto beDisjoint
a. Fail node1bycorruptingits controlsequencer.
- N1_CH2357_P3
i. MaximumDetectionTime(ms.) 2032.0
ii. AverageDetectionTime(ms.) 928.5
iii. MaximumReconfigurationTime (ms.) 400.4
iv. AverageReconfigurationTime (ms.) 390.1
v. Numberof FaultsDetected 25
F. SimultaneousNodeFailures
1. SimultaneouslyFail Nodes8and10
3-22
a. Fail nodes8 and10by corruptingtheir controlsequencer.
N8 10 CH2357_P3
i. MaximumDetectionTime (ms.) 2046.8
ii. Average Detection Time (ms.) 822.8
iii. Maximum Reconfiguration Time (ms.) 461.2
iv. Average Reconfiguration Time (ms.) 458.8
v. Number of Faults Detected 25
2. Simultaneously Fail Nodes 2 and 8
a. Fail nodes 2 and 8 by corrupting their control sequencer.
- N2 10 CH2357 P3
i. Maximum Detection Time (ms.) 2055.3
ii. Average Detection Time (ms.) 1283.7
iii. Maximum Reconfiguration Time (ms.) 1237.2
iv. Average Reconfiguration Time (ms.) 1078.4
v. Number of Faults Detected 25
. Simultaneously Fail Nodes 4 and 10
a. Fail nodes 4 and 10 by corrupting their control sequencer.
- N4 10 CH2357_P3
i. Maximum Detection Time (ms.)
ii. Average Detection Time (ms.)
iii. Maximum Reconfiguration Time (ms.)
iv. Average Reconfiguration Time (ms.)
v. Number of Faults Detected 25
2O6O.5
1523.9
997.4
992.5
3.3.2 Frequency Histograms
Ideally, for each fault insertion test, the fault detection and reconfiguration times for each
iteration of the test should be constant. That is, each inserted fault should be detected in X
ms. and reconfigured around in Y ms. Such consistency will occur only if the I/O fault
detection check is performed frequently and each fault produces invariable error symptoms.
Nonetheless, since the I/O Redundancy Management software only runs every two seconds
in the Engineering Model, large variations may exist in the detection times. Further, it is
possible, and likely, that multiple insertions of a given fault cause different error
symptoms. Consequently, the time required to reconfigure around one application of the
fault may be Y ms. while the time necessary to bypass another insertion of the same fault
may be Y + Z ms. As a result, histograms of the detection and reconfiguration times are
3-23
valuable becausethey illustrate the repeatability of the I/O Fault Insertion and I/O
Redundancy Management processes.
Figures 3-2 through 3-48 present the frequency histograms for each test. Each illustration
is comprised of two graphs. The upper graph represents the fault detection distribution
while the lower one shows the distribution of the reconfiguration times. The detection and
reconfiguration times were grouped into "buckets" which are collections of times that fall in
a particular range. The times associated with each bucket are indicated on the horizontal
axis of each graph. The "frequency distribution" or the number of entries in each bucket is
depicted on the vertical axis of each graph.
The I/O fault detection and reconfiguration histograms are detailed in Sections 3.3.2.1 and
3.3.2.2, respectively.
3.3.2.1 Variance of the Detection Times
As mentioned earlier, the I/O Redundancy Management software performs a fault detection
check every two seconds. The I/O Fault Insertion Software is executed such that the
interval between the successive applications of a given fault varies. Consequently, the time
at which the fault is inserted, with respect to the I/O fault detection check, differs between
successive iterations of the test. If the fault is injected into the I/O network just prior to the
check and it manifests as an error before the detection check, then it will be detected in a
few hundred milliseconds. Alternatively, if the fault is applied just after the completion of
the check and it manifests as an error before the next detection check, then approximately
two seconds will be required to detect it. Faults, for which error manifestation latency is
just over 2 seconds or larger, would be detected by subsequent detection checks.
Since the detection cycle is two seconds and the relative fault insertion times differ from
iteration to iteration, it was expected that a given test's detection times would vary
considerably. As anticipated, large time variances were observed, and they are portrayed in
the fault detection histograms which are shown in Figures 3-2 though 3-48.
3.3.2.2 Variance of the Reconfiguration Times
As discussed in Section 3.3.2, the application of a given fault may or may not produce
consistent error symptoms. This inconsistency occurs because the fault insertion times
vary relative to the execution of the detection cycle and the activity of the I/O nodes. If
repeatable error symptoms are generated by the fault, then the reconfiguration times will be
grouped into a few contiguous buckets (for example, as shown in Figure 3-2). If the error
symptoms produced vary, then the histogram may have widely differing times and
accordingly disjoint groupings (for instance, a primary cluster and an exception as depicted
in Figure 3-12). These different reconfiguration times result because the I/O
Reconfiguration Management software traverses different code paths.
3-24
Thevariancein thereconfigurationtimes,however,mayalsoresultif oneor moresoftware
errorsexist in theI/O RedundancyManagementprocess.As a result,it wasdesirableto
analyze the resultsof the I/O fault testswith disjoint bucketsin order to confirm the
correctnessof theI/O Managementsoftware. This analysiswasperformedby modifying
the Fault Insertion Software (FIS), such that it suspendswhen a reconfiguration time
significantlydiffersfrom themedianbucket(for instance,if thisdifferencewasgreaterthan
50ms.). After theFIS stops,theI/O Redundancytracedata,which indicatesthepathof
thereconfigurationsoftware,of this atypicaliterationis comparedto thatrecordedafter a
normal reconfiguration. By comparing the traces,the abnormal error signatureand
resultantreconfigurationpathweredetermined.
The "normal vs. abnormal"resultswerecarefully examinedand in eachcase,the I/O
RedundancyManagementsoftwareexecutedcorrectly. Theatypicalreconfigurationtimes
resultedbecauseof oneof four causes:secondattempts,inconclusiveanalysis,presumed
reconnection,andfault signaturesof differing complexities.Thesesituationsarediscussed
in the Sections3.3.2.2.1through 3.3.2.2.4.
3.3.2.2.1 Reconfiguration Variance - Second Reconfiguration Attempts
During the reconfiguration process, typically only one attempt is necessary to activate a
non-faulty path (enable a link between two nodes by actuating the corresponding ports).
Nevertheless, it is possible that a non-faulty path returns an erroneous response to the I/O
network management software due to the breadboard nature of the laboratory nodes.
Therefore, in the process of enabling a link, two attempts are performed. That is, if the
first attempt to enable a non-faulty link fails, then a second attempt is tried. If both attempts
are unsuccessful, then the link is deemed faulty.
Occasionally, during the reconfiguration process, second attempts were performed.
Because of these second attempts, 80 - 90 ms. variations from the median buckets were
observed. The histograms displaying such occurrences are illustrated in Figures 3-9, 3-12,
3-15, 3-18, 3-37, 3-43, and 3-47.
3.3.2.2.2 Reconfiguration Variance - Presumed Reconnection
The breadboard nature of the laboratory nodes also contributed to inconsistency in the
"effect" of the applied I/O fault. Specifically, the application of a fault that should
completely fail a node occasionally only failed its inboard port (the active port through
which the node is connected to the I/O virtual path). In such a situation, the I/O
Redundancy Management software reconnected to the node, or the associated disjoint
branch, through a spare link.
3-25
If the I/O reconfigurationprocessreconnectsto a node rather than fail it completely
(becausetheapplicationof thefaultdoesnotcompleteddisablethenode),thenlesscodeis
traversed and accordingly, the reconfiguration time is shorter. Consequently, the
associatedreconfigurationhistogramsare signified by aberrations that are a negative offsets
from the median bucket. These histograms are depicted in Figures 3-21, 3-32, 3-34, 3-39,
3-41, 3-43, and 3-45.
3.3.2.2.3 Reconfiguration Variance - Inconclusive Analysis
The I/O reconfiguration process can only isolate a finite number of fault signatures. If an
unidentifiable fault pattern is encountered, then the I/O reconfiguration process regrows the
I/O network using one of several I/O growth algorithms.
During the I/O Fault Insertion Analysis, it was observed that a given fault, that usually
generated a distinguishable error pattern, sometimes produced a fault signature that was
unknown to the I/O Redundancy Management software. When this indeterminate pattern
was encountered, the I/O network was regrown (using an algorithm employing minimal
diagnostics), thus adding approximately 270 - 280 ms. to the reconfiguration time. The
histograms of the tests in which such a regrowth occurred are shown in Figures 3-27, 3-
29, 3-39, and 3-44.
3.3.2.2.4 Reconfiguration Variance - Simple and Complex Error
Symptoms
The error symptoms that are produced by a fault may have simple or complex signatures.
Further, the fault pattern may be time-varying (for example, the corruption of the port
enable register). If simple error symptoms occur, then the fault can be isolated and
bypassed quickly. Alternatively, if complex symptoms are produced by the fault, then the
regrowth of the I/O network may be required. In addition, if the symptoms vary with time,
then the I/O redundancy management process may perform multiple regrowth attempts
(because the error signature changes during the growth process).
The histograms of tests in which simple, complex, and time-varying fault signatures occur
are shown in Figures 3-20, 3-22, 3-24, 3-31, 3-32, 3-35, 3-36, and 3-42.
3-26
"n15_ch1101_p7.det"
4
3
Frequency
2
¢q f'4
Time (ms)
2O
"n15_ch 110 l_p7.rec"
15
Frequency
10
0
Time (ms)
Figure 3-2. Test A.l.a
3-27
"nl5_ch010 l_p 13.det"
Frequency
4
¢,,i
Time (ms)
2O
"nl5_ch0101_pl3.rec"
15
Frequency
I0
Time (ms)
Figure 3-3. Test A.l.b
3-28
"n15_ch2047_p14.det"
4
3
Frequency
¢,i ¢'4
Time (ms)
Frequency
30125
20
15
"n15_ch2047_p14.rec"
10
Time (ms)
Figure 3-4. Test A.l.c
3-29
"n5_chll01 p7.det"
Frequency
2
•=" ¢_I ¢'4
Time (ms)
3O
"n5_ch 1101_p7.rec"
25
2O
Frequency
15
10
O O
Time (ms)
Figure 3-5. Test A.2.a
3-30
"n5_ch0101_p13.det"
3
Frequency
Time (ms)
20
"n5_ch0101_pl3.rec"
15
Frequency
10
Time (ms)
Figure 3-6. Test A.2.b
3-31
"n5_ch2047_p14.det"
3
Frequency
fq f,4
Time (ms)
30
"n5_ch2047_p14.rec"
25
2O
Frequency
15
°°=_SR_8°°°o_$-_ -_
Time (ms)
Figure 3-7. Test A.2.c
3-32
"n 14_ch 090 l_p7.d e t"
Frequency
4
Time (ms)
2O
"nl4_ch0901_p7.rec"
15
Frequency
I0
Time (ms)
Figure 3-8. Test A.3.a
3-33
"nl4_ch0101_p3.det"
Frequency
2
Time (ms)
2O
"nl4_ch0101_p3.rec"
15
Frequency
I0
°
0
Time (ms)
Figure 3-9. Test A.3.b
3-34
"n 14_ch2047_p15.det"
Frequency
4
0
Time (ms)
2O
"n 14 ch2047_p 15.rec"
15
Frequency
10
Time (ms)
Figure 3-10. Test A.3.c
3-35
"n7_chl301_p7.det"
Frequency
4
('4
Time (ms)
3O
"n7_chl301_p7.rec"
25
2O
Frequency
15
10
° ° _ _ _ _: _ R _ _ 8 ° ° __ - _ ,,
Time (ms)
Figure 3-11. Test B.l.a
3-36
"n7_ch0101_p5.det"
Frequency
4
Time (ms)
Frequency
"n7_ch0101_p5.rec"
30t25
20
15
I0
5
0
Time (ms)
Figure 3-12. Test B.l.b
3-37
"n7_ch2047_p13.det"
Frequency
4
Time (ms)
3O
"n7_ch2047_p13.rec"
25
2O
Frequency
15
I0
Time (ms)
Figure 3-13. Test B.l.c
3-38
"n2_ch1301_p7.det"
Frequency
4
¢',4
Time (ms)
3O
"n2_ch 1301_p7.rec"
25
20 ¸
Frequency
15
10
Time (ms)
Figure 3-14. Test B.2.a
3-39
"n2_ch0101_p5.det"
4
Frequency
3
Frequency
"n2_ch0101_p5.rec"
10
5
0
Time (ms)
Figure 3-15. Test B.2.b
3-40
"n2_ch2047_p13.det"
3
Frequency
Time (ms)
Frequency
"n2_ch2047_p13.rec"
30
25
20
15
10
Time (ms)
Figure 3-16. Test B.2.c
3-41
"n3_ch 110 l_p7.det"
Frequency
4'
¢'4 ¢'4
Time (ms)
3O
"n3 chll01_p7.rec"
25
20
Frequency
15
10
= ° _ _ _ S $ _ _ _ 8 ° _ _ ___ - ,_
Time (ms)
Figure 3-17. Test B.3.a
3-42
"n3_ch0101_pll.det"
4-
Frequency
3
2
¢'4 t_
Time (ms)
3O
"n3_ch0101_pll.rec"
Frequency
25
20
15
I0
5
0
Time (ms)
Figure 3-18. Test B.3.b
3-43
"n3_ch2047_p12.det"
Frequency
2
Time (ms)
3O
"n3_ch2047_p12.rec"
25
2O
Frequency
15
10
Time (ms)
Figure 3-19. Test B.3.c
3-44
"nl_ch0301_p3.det"
4
Frequency
3
Time (ms)
2O
15
"n l_ch 030 l_p3.rec"
Frequency
10
Time (ms)
Figure 3-20. Test B.4.a
3-45
"n15_ch2357_p12.det"
3
Frequency
Time (ms)
2O
"n15_ch2357_p12.rec"
15
Frequency
10
Time (ms)
Figure 3-21. Test C.l.a
3-46
"nl5_chll30_pll.det"
5
Frequency
4 ¸
3
2
1
0
Time (ms)
10
"n 15_ch 1130_pl 1.rec"
.
Frequency
Time (ms)
Figure 3-22. Test C.l.b
3-47
"n15_ch2047_p4.det"
Frequency
4
Time (ms)
2O
"n 15_ch 2047_p4. rec"
15
Frequency
10
Time (ms)
Figure 3-23. Test C.l.c
3-48
"nlS_chll30_pl4.det"
3
Frequency
Time (ms)
"nl5_ch 1130_p 14.rec"
3
Frequency
0
Time (ms)
Figure 3-24. Test C.l.d
3-49
"n4_ch2357_p3.det"
Frequency
4
Time (ms)
2O
"n4_ch2357_p3.rec"
15
Frequency
10
Time (ms)
Figure 3-25. Test C.2.a
3-50
"n4_ch2047_p4.det"
4
3
Frequency
0
Time (ms)
20
"n4_ch2047_p4.rec"
15
Frequency
10
5
Time (ms)
Figure 3-26. Test C.2.b
3-51
"n4_ch1130_p14.det"
4
3
Frequency
Time (ms)
Frequency
2O
15
10
"n4_ch 1130_p 14.rec"
I
o ___.__________:__ ...........................
Time (ms)
Figure 3-27. Test C.2.c
3-52
"n 10_ch 2357_p3.d et"
3
Frequency
O
Time (ms)
30
"n10_ch2357_p3.rec"
25
2O
Frequency
15
10
Time (ms)
Figure 3-28. Test C.3.a
3-53
"n10_ch2047_p4.det"
4
Frequency
3
Time (ms)
Frequency
3O
25
2O
15
I0
"n10_ch2047_p4.rec"
............................................... i .
Time (ms)
Figure 3-29. Test C.3.b
3-54
"n7_ch2357_p3.det"
4
Frequency
3
¢'4
Time (ms)
3O
"n7 ch2357_p3.rec"
25
2O
Frequency
15
10
0
Time (ms)
Figure 3-30. Test D.l.a
3-55
"n7_chl130_p11.det"
4
Frequency
3
2
0
Time (ms)
"n7_chll30_pll.rec"
4
3
Frequency
2
t-,I ¢',l ¢'q
Time (ms)
Figure 3-31. Test D.l.b
3-56
Frequency
"n7_ch2047_p4.det"
Frequency
"n7 ch2047 p4.rec"
Time (ms)
Figure 3-32. Test D.l.c
3-57
"n7_ch1130_p14.det"
3
Frequency
Time (ms)
3O
"n7ch2047_p13.rec"
25
2O
Frequency
15
10
0
Time (ms)
Figure 3-33. Test D.l.d
3-58
4
"nl 1_ch2357_p3.det"
Frequency
2
¢'4 ¢_
Time (ms)
3O
"n11_ch2357_p3.rec"
Frequency
25
20
15
I0
Time (ms)
Figure 3-34. Test D.2.a
3-59
"nll_chll30_pll.det"
Frequency
4
Time (ms)
Frequency
"nll_ch1130_pl l.rec"
10"
4
Time (ms)
Figure 3-35. Test D.2.b
3-60
"nl l_ch2047_p4.det"
Frequency
2
0
Time (ms)
12
"n I 1 ch2047_p4.rec"
10
8
Frequency
6
Time (ms)
Figure 3-36. Test D.2.c
3-61
"n6_ch2357_p3.det"
Frequency
4
Time (ms)
2O
"n6_ch2357_p3.rec"
Frequency
15
10
5
0
Time (ms)
Figure 3-37. Test D.3.a
3-62
"n6_ch2047_p4.det"
4
Frequency
3
f'4
Time (ms)
Frequency
"n6_ch2047_p4.rec"
3O
25
2O
15
10
,o
Time (ms)
Figure 3-38. Test D.3.b
3-63
"n2_ch2357 p3.det"
4
Frequency
3
Frequency
O
Time (ms)
"n2_ch2357_p3.rec"
10
8
6
4
2
0
Time (ms)
Figure 3-39. Test D.4.a
3-64
"n2_ch2047_p4.det"
Frequency
4
0
("4
Time (ms)
Frequency
"n2_ch2047_p4.rec"
30
25
20
15
Time (ms)
Figure 3-40. Test D.4.b
3-65
"n3_ch2357_p3.d e t"
3
Frequency
2
fq f,4
Time (ms)
2O
"n3_ch2357_p3.rec"
Frequency
15
10
5
0
Time (ms)
Figure 3-41. Test D.5.a
3-66
"n3_ch 1130_pl l.det"
3
Frequency
,
Time (ms)
Frequency
3" "n3_chll30_pll.rec"
1
0
Time (ms)
Figure 3-42. Test D.5.b
3-67
I! 4 -- _I!n3_ch2047_p .ae_
4
Frequency
3
Time (ms)
2O
"n3 ch2047_p4.rec"
Frequency
15
10
5
0
Time (ms)
Figure 3-43. Test D.5.c
3-68
"n3_ch1130_p14.det"
Frequency
4
0
¢',I
Time (ms)
3O
25
2O
15
10
"n3_ch 1130_pl 4.rec"
Frequency
• • • • • i • • w • • • • , , , , , . . , i • • • i • • • • • i , l, , , • , w • • • • • • • • w w i • • • • •ll ,
Time (ms)
Figure 3-44. Test D.5.d
3-69
"n 1_ch2357_p3.det"
3
Frequency
2
0
Time (ms)
20"
"n 1_ch2357_p3.rec"
Frequency
15
I0
5
0
Time (ms)
Figure 3-45. Test E.l.a
3-70
"n8 10 ch2357_p3.det"
Frequency
4
Time (ms)
Frequency
2O
15
10
"n8 10 ch2357_p3.rec"
Figure 3-46. Test F.l.a
3-71
10
"n2_10_ch2357_p3.det"
6
Frequency
4
Time (ms)
3O
"n2 10 ch2357_p3.rec"
25
20
Frequency
15
10
Time (ms)
Figure 3-47. Test F.2.a
3-72
"n4 10 ch2357_p3.det"
Frequency
4
¢'4 f'4
Time (ms)
3O
"n4 10 ch2357_p3.rec"
25
2O
Frequency
15
10
o
o o o o _ ° _ o
Time (ms)
Figure 3-48. Test F.3.a
3-73
Percentage
lO
9
i/2
4
0 - l- |- , -| -| - i - |- i - l- | -| - | -| -| - i - 1- i- | - i -i - i -i -i - | - |
000 O0 O0 O00 O0 O00000000 O0 O00
0000 O0 O00 O0 O00000000 O0000
Time (ms)
Figure 3-49. The Probability Density Function for the Detection Times
3.3.3 Probability and Cumulative Density Functions
The I/O Fault Insertion Plan is comprised of 47 tests. Each test consists of 25 iterations.
For each iteration, the fault detection and reconfiguration times were recorded by the Fault
Insertion Software. Accordingly, a total of 1175 (47 times 25) data sets were anticipated.
As described in Section 3.3.1, however, five of the 1175 fault applications generated
"don't care" error symptoms. As a result, 1170 sets of data, or 99.6 percent of the applied
faults, were recorded.
Probability and cumulative density functions for these data sets were generated to complete
the I/O Network Fault Insertion Analysis; these functions are illustrated in Figures 3-49
through 3-52. The probability density function for the I/O fault detection times depicts the
wide variance in the observed times (discussed in Section 3.3.2.1). In the 0 to 2000 ms.
range, which is the detection cycle for the AIPS Distributed Engineering Model, the
percentage of events in each bucket is relatively consistent, ranging from 2.5 to 8.5
percent. In the 2000 ms. and over range, this percentage decreases to less than one-half
percent. As a result, as shown by the cumulative density function for the detection times,
approximately 97 percent of all faults were detected in 2000 ms. or less.
3-74
Percentage
100 "
95
9O
85
8O
75
70
65
6O
55
5O
45
40
35
30
25
2O
15
10
5
0
• I. I-I,I -I. l • u. I • IV l "l • l It .I • I'i" i. i I I .I • I "l .I • i • I
Time (ms)
Figure 3-50. The Cumulative Density Function for the Detection Times
The probability density function for the I/O fault reconfiguration times indicate that a
significant percentage of the faults were bypassed in 100 to 500 ms. while a substantial but
smaller percentage were reconfigured around in 900 to 1100 ms. This latter range
represents the percentage of times that the reconfiguration process employed the network
regrowth algorithm that uses "low-level diagnostics" (several types of growth algorithms
were designed for the AIPS I/O network). Furthermore, as depicted by the cumulative
density function in Figure 3-52, approximately 85 percent of all faults were isolated and
bypassed in 500 ms. or less.
3.4 I/O Network Fault Insertion: Conclusions
To conclude the discussion of the I/O Fault Insertion Plan, the Fault Insertion results are
presented with respect to the goals of the Fault Injection Study. In brief, the objectives of
the I/O analysis were:
1. to test the design specification for fault tolerance,
2. to obtain feedback for fault removal from the design implementation,
3. to obtain statistical data regarding fault detection, isolation, and reconfiguration
(FDIR) responses, and
4. to obtain data regarding the effects of faults on system performance.
3-75
Percentage
15
14
13
12
11
10
9
8
7
6
4
3
1
0 - I - I " I " I " I J I I I ! _ "-'-_
..... | I
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
v- Cd 0"3 '_r _ rid I_ _ 02 0 _ 04 e3 _r
Time (ms)
Figure 3-51. The Probability Density Function for the
Reconfiguration Times
100
95
9O
85
8O
75
70
65
60
55
Percentage 50
45
40
35
30
25
15
10
5
0
7
°
i - i - i - i - i - i - i - i - i - i - i
0 0 o o 0 0 0 0 o 0 o 0 o o o 0
o o 0 0 o 0 0 o 0 o 0 o 0 o o
Time (ms)
Figure 3-52. The Cumulative Density Function for the
Reconfiguration Times
3-76
The processof validating the fault tolerant specificationsand obtaining fault removal
feedbackis concernedwith the presenceof designflaws and the fault coverage. As
detailedin Section3.3,the limited I/O Fault Insertionprocessthatwasperformeddid not
find anydesignerrors. This doesnotnecessarilyprovethat thedesignof thenetwork is
completelyerror free;however,thelevel of confidencein thedesignis greaternow thanit
wasbeforethefault injectionexperiments.Thefault detectioncoveragewas99.6%;0.4%
of thefaultsdid notproduceanydetectableerror symptoms.Thereconfigurationcoverage
for detectedfaultswas100%.
To confirrnthecoverageof theI/O faults,severaldeviceswereutilized. The detection of
the fault was verified using the Fault Insertion Software (FIS) and light emitting diodes
(LEDs). The FIS program records whether or not a fault was detected. Additionally, the
manifestation of a fault is indicated by the I/O network LEDs, which are diodes that depict
the configuration of the I/O network. The FIS logs and the network LEDs were examined
by the I/O fault insertion supervisor to verify that each fault was detected.
The reconflguration of the network was also validated using the FIS program and the I/O
network LEDs. In addition, the I/O Network Redundancy Management logs were
employed. Similar to the fault detection process, the FIS program was analyzed to
determine if the applied fault was bypassed. Further, since the LEDs indicate the
configuration of the network and the I/O reconflguration strategy is deterministic, the
expected reconfiguration was calculated by the fault injection supervisor and verified by
examining the LEDs. Moreover, the I/O Redundancy Management logs, which identify the
fault and summarize the reconfiguration process, were periodically examined by the fault
injection supervisor to confirm the correctness of the I/O Network Management algorithm.
The third and fourth objectives of the I/O Fault Insertion study are concerned with
recording data to measure the performance of the FDIR process and quantifying how the
I/O faults affect system performance. As presented in Sections 3.3.1 through 3.3.3, the
I/O Fault Insertion process recorded 1170 sets of data. The maximum and average
detection and reconfiguration times, the corresponding frequency histograms, and the
probability and cumulative distribution functions were calculated. This information
accurately characterizes the performance of the I/O FDIR process on the AIPS Distributed
Engineering Model. Furthermore, the data illustrates the effect that various IIO faults have
on the overall performance of the Model.
As illustrated in Section 3.3.1, the I/O Fault Insertion results typically conformed to the
expected maximum and average times. The anticipated maximum detection time was about
2075 ms. plus the error latency. Accordingly, the expected average time was
approximately 1040 ms. plus the error latency. Due to the number of faults and the
complexity of their error symptoms, the maximum and average reconfiguration times were
not explicitly calculated for each fault (calculations would be numerous and gross
approximations). The worst case reconfiguration time, however, was determined to be
3-77
approximately3500 ms.- fault diagnosticsplus the worst caseregrowth scenario. As
estimated,all reconfigurationtimeswere lessthan the worstcase. Sinceonly theworst
casereconfigurationtimewascalculated,theI/O reconfigurationtimeswereprimarily used
to characterizethe performanceof the I/O FDIR processrather than to comparethe
measuredatato theircorrespondingexpectations.
Theperformanceof theFDIR processsignificantlyimprovesif theAIPS/ALS technology
projectionsareconsidered.Thethroughputperformanceof theprojectedAIPS/ALSFault
Tolerant Processoris approximately 50 times the current capabilities of the AIPS
Distributed Engineering Model, with respect to the I/O Redundancy Management code.
Consequently, the maximum I/O fault detection time will be reduced from 2075 ms. to
approximately 42 ms., and the corresponding average time will be about 21 ms. (plus error
latency). The reconfiguration times will also be reduced by a factor of 50. For example, a
failed link that causes a disjoint node, which is typically bypassed in about 150 ms. by the
AIPS Engineering Model, can be reconfigured around in 3 ms. with the AIPS/ALS system.
Since the AIPS/ALS Fault Tolerant Processor will significantly decrease the fault detection
and reconfiguration times (with respect to the Engineering Model), the effect that I/O faults
have on the system performance will also be greatly reduced.
3-78
4.0 APPLYING FAULTS TO THE CORE FTP
This chapter discusses the Core FTP Fault Injection Plan, which is a set of faults that were
injected into the core of an AIPS Fault Tolerant Processor. Section 4.1 describes the AIPS
Fault Tolerant Processor while Section 4.2 summarizes the Fault Detection, Identification
and Reconfiguration algorithm. The test cases comprising the Core FTP Fault Injection
Plan are specified in Section 4.3, and the results of the test cases are presented in Section
4.4. Section 4.5 contains conclusions and summary remarks about the test cases.
4.1 Overview of the AIPS Fault Tolerant Processor
The Fault Tolerant Processor (FrP) consists of a variable number of redundant processing
channels depending on the reliability requirements of the application. The AIPS
Engineering Model FTP is intended to be operated primarily as a triplex, but it provides
fail-stop capability when operated as a duplex. A single channel can also be used for non-
critical operations as a simplex computer.
Each channel of an FFP consists of three sections: a computational section, an input/output
section, and shared resources. The first section contains a Computational Processor (CP),
memory, timers and clocks. The second section contains an Input/Output Processor (IOP),
memory, timers, and clocks. The shared resources include shared memory, data exchange
hardware, timers, and external interface hardware. The redundant processors are tightly
synchronized using a fault-tolerant clock. Data is exchanged among redundant channels on
point-to-point links. The data exchange hardware also performs the bit-for-bit voting, fault
detection and masking functions in a manner that satisfies all the requirements to protect the
FTP from Byzantine failures. Apart from redundancy, there are other features that provide
hardware and software fault tolerance. These include watchdog timers, processor
interlocks, a privileged operating mode, handlers for hardware and software exceptions,
and self tests.
A functional view of one channel of an AIPS FTP is shown in Figure 4-1. The CP and IOP
are identical, conventional processor architectures, and each processor refers to the other as
its companion. Interval timers are used for scheduling tasks and maintaining time-out
limits on applications tasks (task watchdog timers). A hardware watchdog timer is
provided to increase fault coverage and to cause a processor to fail-safe in case of hardware
or software malfunctions. This timer resets the processor and disables all of its outputs, if
it is not reset periodically. The watchdog timer is implemented independently of the basic
processor timing circuitry. A monitor and interlock circuit in each channel provides the
capability to disable the outputs of faulty processors. Any two correctly operating
processors in a triplex FTP can disable the outputs of the third failed processor through this
interlock mechanism. A processor that is failed active is thus prevented from transmitting
erroneous data or commands on I/O networks, IC networks, and local I/O devices.
4-1
TheCP andIOP shareresourcesthrougha busthatcanbeaccessedby eitherprocessor.
Thesesharedresourcesincludememory;asystemtimer, the interchannelcircuitry for the
dataexchange,fault-tolerantclockandmonitorinterlock;andinterfacesto oneor moregO
networks, memory mapped I/O devices, and the IC network.
Shared
Resources
Computational Processor
Processor Memory
i
Shared
Memory
Interval
Timers
Dedicated
I/O
I/ON
ECc
Interchannel:
- Data Exchange
- Fault Tolerant Clock
- Monitor Interlock
:1
mq
Processor [ Memory
Input/Output Processor
1 ]
GPC
Comm
N
ne T )
I
System ]
Timer
I I I
Interval Dedicated
Timers I/O
Inter-Computer
Network
Figure 4-1. Fault Tolerant Processor: Functional View (One Channel)
One very important aspect of the FTP architecture is the interconnection hardware between
redundant channels. The interchannel data exchange and voting hardware serves three
purposes: it provides a path for distributing data in one channel to all other channels; it
provides a mechanism for comparing results of the redundant channels; and it provides a
path for distributing and comparing timing and control signals such as the fault tolerant
clock and external interrupts.
4-2
ThesamesoftwareexecutesonaredundantFTP as on a simplex channel and application
code is written as if it were to operate on a simplex computer. All redundant processors
have identical software and execute identical instructions at exactly the same time. This
feature of the architecture is carded out in the data exchange hardware and software as well.
The data exchange hardware is designed such that all redundant processors execute
identical instructions when exchanging data whether it is redundant data to be voted or
simplex data being transmitted from one channel to others. Thus, for example, if a simplex
exchange is to be made from channel A, all three channels write to their FROM_A register.
While the contents of the FROM_A register are transmitted from A, voted, and deposited in
the receive registers of all three processors, the contents of the FROM_A registers in
channels B and C, which are meaningless, are ignored.
On a routine basis, the internally produced data that needs to be exch:_nged consists of error
information and cross channel comparisons of results for fault detection. These operations
can be easily confined to the program responsible for Fault Detection, Identification, and
Reconfiguration (FDIR). Therefore, the remaining pieces of the Operating System software
and the applications programs need not be aware of the existence of the data exchange
registers.
4.2 FTP Fault Detection, Identification and Reconfiguration
The AIPS FTP uses hardware redundancy with fault detection and masking capabilities to
provide fault tolerance. The fault tolerance provided by the hardware is greatly enhanced by
the Fault Detection, Identification and Reconfiguration (FDIR) functions which are part of
the FTP local operating system. While the hardware alone in a triplex FTP could sustain
one fault, the FDIR software allows it to sustain multiple successive faults and identifies the
fault(s) for an operator, thus making the FTP much more robust and serviceable.
The FDIR software in AIPS has two main functions:
identifying a failed channel, i.e., detecting a fault, isolating it to a single
channel, masking the faulty channel's inputs, and disabling its outputs.
recovering a failed channel, i.e., determining that the fault no longer exists,
bringing the channel into line with the two synchronized channels, accepting the
channel's inputs, and enabling its outputs.
These functions are described in more detail below.
4.2.1 Fault Detection and Identification
Fault detection mechanisms are implemented in both hardware and software, while the
identification mechanisms are implemented solely in software. Instruction-level synch-
4-3
ronizationtogetherwith bit-for-bit comparisonof redundantdatamakesit possibleto isolate
afault to asinglechannel.
Therearefourmainprocesseswhichdetectandidentifyfaults:
Fast FDIR. A periodic, high-priority task which checks for failure of a
companion, an unsynchronized channel, a fault in the data exchange hardware,
and a fault in the fault-tolerant clock;
Watchdog Timer Reset. A periodic, high-priority task which taps the watchdog
timer within the given time bounds so that the timer does not overrun and cause a
hardware reset;
• Background Selftests. A low-priority task which does tests to uncover latent
faults in memory, voting circuitry and error latches, and the real-time clock; and
• Hardware Exception Handler. A procedure for handling M68010 hardware
exceptions such as an illegal instruction or addressing error.
After a channel is identified as being faulty, the FTP is reconfigured so that the faulty
channel does not affect FTP operation. The errors generated by the channel are masked,
and its outputs are stopped. This is done by a procedure, Reconfigure, that sets a software
variable identifying the channel as failed, disengages the monitor interlock so that outputs
from the faulty channel are disabled, and logs the fault and the reconfiguration for later
examination by an operator.
4.2.1.1 Fast FDIR
Fast FDIR is one of four tasks which detect and isolate errors. It is a high-priority task
which runs every 40 ms on both the CP and IOP. It checks for:
• a fault reported by its companion processor
• an unsynchronized channel
• a fault detected but not reconfigured around by the selftests
• a data exchange fault, i.e., either a faulty in the interstage or in any link in the
data exchange hardware
• a fault in the fault-tolerant clock
• a missing companion processor (i.e., a companion that is not executing FDIR at
the prescribed rate).
Error detection is done in the order given above. If any particular test uncovers a fault, the
remaining tests are not done during that iteration of Fast FDIR. When an error is detected,
4-4
the error is logged, the FTP is reconfigured to exclude the faulty channel, and the
reconfigurationis logged.
4.2.1.2 Watchdog Timer Reset
The second fault detection process is the Watchdog Timer Reset process. This process does
not perform fault detection functions in quite the same way, however, as other processes in
this category, i.e., by responding to a specific fault. Rather, the failure of this task to
execute at its scheduled period would indicate a critical fault in either hardware or software
and would cause a hardware reset.
The watchdog timer is a hardware component whose purpose is to prevent infinite software
loops or hardware faults from hanging up the system. After it has been started, the
watchdog must be cleared periodically within a set time window; if it is not cleared within
this window (i.e., either too early or too late) a hardware reset occurs. On the AIPS FTP
this window is 60-120 milliseconds plus or minus 10%. The Watchdog Timer Reset
process performs the function of clearing the watchdog timer.
4.2.1.3 Background Self Test
The third fault detection process is the Selftest task, a task of the lowest priority that runs
when there are no higher priority tasks to be executed. Its job is to uncover latent faults,
that is, faults which exist but which have not yet caused data exchange errors or desynchro-
nization of a channel. This task tests memory, the voting circuitry and error latches, and the
real-time clock.
The following memory tests are performed:
• PROM sum check. This test verifies that all channels have identical values in
ROM by doing a sum check.
• RAM scrub. This test checks each memory location to ensure that the values are
identical among the three channels.
• RAM pattern test. This test checks the functionality of each location. It tests
each bit's ability to hold both a 1 and a 0 by writing specific patterns to each
word.
• Shared memory scrub. This test checks each memory location in shared
memory to ensure that the values are identical among the three channels.
The voter circuitry and error latches are tested by writing normal and faulty patterns of data
to the voters. After these votes, both the resulting values and the error latches are checked
to confirm that all errors were properly latched and corrected.
4-5
The real-timeclock is testedbyreadingthecurrentvalueandensuringthat it is identical
amongthethreechannels.
4.2.1.4 Hardware Exception Handler
The exception handler is the fourth fault detection process. It is invoked when there is a
hardware exception such as an illegal instruction or an address error. A presence test is
done to determine if the exception was caused by a hardware error or a generic software
error. If the results of the presence test show that the processor is alone, this implies a
hardware error. If the presence test shows that the processor is with others, this implies a
generic software error. In either case, the exception and the results of the presence tests are
logged in the non-congruent log, and the processor(s) are restarted.
4.2.2 Channel Recovery
The reliability of the FTP is greatly enhanced if channels previously diagnosed as faulty but
currently operating without faults can be brought back into the FTP configuration. How a
channel is recovered depends on the type of failure, i.e., whether or not the fault has
caused the channel to fall out of sync. A failure in the data exchange hardware does not
desynchronize a channel, while other kinds of failures do. An FTP has recovered from a
fault, therefore, when
the failed channel can be resynchronized, or
the failed channel no longer shows errors in the data exchange hardware.
When a channel has been recovered, the FTP must be reconfigured so that the recovered
channel participates in the FI'P operation and another fault can be tolerated.
There are three main processes involved in channel recovery. Transient FDIR
distinguishes between transient and hard faults when a channel recovery is being attempted
in order to balance competing system needs. Lost Soul Sync is responsible for
resynchronizing an unsynchronized channel, i.e., synchronizing it to the instruction level
and making its internal state the same as the duplex processors. Finally, the Restart
process is invoked when a second fault or a common-mode failure occurs. These faults
result in a fail-safe condition, which the AIPS FTP responds to with a system restart.
4.2.2.1 Transient FDIR
When recovering a failed channel, system resources are used most efficiently if a
distinction is made between transient faults and hard faults. Transient faults are assumed to
be caused by some temporary environmental condition (e.g., a power surge). By
4-6
definition, they areexpectedto disappearwith time. Hardfaults, on theotherhand,are
causedbybreakdownsof theFTPhardwarethatmustbephysicallyrepaired.
The attemptto recovera failed channelcould bemadeautomatically(i.e., the software
periodicallyteststhechannelto determineits currentstate)or it couldbemadesolelyunder
operatordirection (i.e., theoperatorentersa commandindicating the channelhasbeen
repaired).Thefirst methodsatisfiestheneedto recoverthechannelasquickly aspossible
while the secondmethodsatisfiesthe needto not wastesystemresourcesby repeatedly
testingachannelwith a hardfault. TransientFDIR strikesa balancebetweenthesetwo
needsby initially assumingthatanyparticularfault is transient(it hasbeenobservedthat50
to 80percentof all faultsin computersystemsaretransient)andautomaticallyattemptinga
recovery.As timepasseswithoutthechannelbeingrecovered,it becomesmorelikely that
thefault is a hard fault rather thana transient,andTransientFDIR makesthe recovery
attemptlessoften. After a certain period it can reasonably be assumed that the fault is a
hard fault. Then Transient FDIR either waits for an operator signal or, in the case where
there is no operator, tests the channel only at some infrequent interval such as its mean time
to repair.
Additionally, it has been noted that hard faults tend to manifest themselves sporadically. A
channel may be recovered according to the above criteria, but may immediately fail again.
Transient FDIR attempts to prevent this situation by regarding a recovered channel as
recovered only on a trial basis. If the channel passes its trial period without further errors,
it is regarded as fully recovered and can be added back into the FTP configuration.
Intermittent faults which occur at infrequent intervals (i.e., after the trial period has passed)
will not be handled by this scheme, however, but will be regarded as new faults.
This distinction between transient and hard faults thus defines the two functions of
Transient FDIR:
It decides when it is appropriate to attempt to recover a failed channel.
Once a channel is seen as fault-free, it monitors its health for a brief probation
period before declaring it fully recovered.
Attempting Channel Recovery_
The initial response to all detected faults is to mask the fault and disable all outputs from the
faulty channel. Thereafter, the status of the failed channel is periodically "sampled" to
determine if the fault is transient. Immediately after a failure, a recovery attempt is made
and a sampling of the channel's health is taken. If the attempt fails (i.e., the
unsynchronized channel cannot be found or the data exchange latches still show errors), the
4-7
time betweensuccessiveattemptsis doubled,until Mean Time To Repair (MTTR) is
reached.This timedelaybetweensuccessiverecoverytriesandthesamplingsof thestatus
of afailedchannelis afunctionof statevariablesrepresentingthe"health"of thechannel.
The "health" variable,in turn, is a function of theerrorhistory of theparticular channel
with manyrecentfault observationsfor thechannelindicating"poor" healthanddeclining
fault observationsrepresenting"good" health. The time betweenrecoveryattemptsis
doubledfollowing eachstatussamplingwhich indicatesthe fault is still present. This
samplingsequenceis repeateduntil eitherthefaultstatuschangesto indicatedthefault is no
longerpresentor anupperthresholdon theretry time is crossedat whichpoint thefault is
deemed"hard". Fromthispoint on recoverywill beattemptedonly whenanoperatorhas
signaledthatthechannelhasbeenrepairedor, in thecasewherethereis nooperator,when
anotherMTTR periodhaspassed.
Probation Monitoring
After a channel has been recovered, it must undergo a trial period before being declared
fully recovered and functional. The length of this period is a function of the "health" of the
channel and depends on the number and type of faults. A channel with multiple faults in
quick succession (i.e., the channel fails before it has passed its trial period) will have a
longer trial period than if it only had a single fault. Faults that desynchronize a channel
require a longer trial period than data exchange errors.
4.2.2.2 Lost Soul Sync
Lost Soul Sync is the process of attempting to resynchronize a previously failed channel (a
"lost soul") and, if successful, bringing it to the same state as the two good channels. This
process has two main steps:
• resynchronizing the channel, i.e., synchronizing it to the instruction level with
the other two channels, and
aligning the channel, i.e., making its volatile memory and registers the same as
those of the other two channels. This ensures that after the code execution is
synchronized, only a fault could cause a channel to lose synchronization, rather
than, for example, a memory location that contained an incorrect value.
This task is described in detail in [4], for the reader who is unfamiliar with the operation of
Lost Soul Sync.
4-8
4.2.2.3 System Restart
Certain faults which are detected may be of such a magnitude that they are unsustainable
and may be recovered from only by restarting the system. Examples of such faults are a
second fault detected by Fast FDIR and common-mode faults. The restart process
accomplishes the system restart without requiring operator intervention.
4.2.3 Reconfiguration
The reconfiguration process is invoked by the fault detection and identification tasks when
a fault has been identified and by the recovery tasks when a channel has been repaired.
During a reconfiguration a channel is either removed from the configuration because it was
found to be faulty, or added in because the fault no longer exists.
When a channel is identified as being faulty, the errors generated by that channel must be
masked and its outputs must be stopped. This is done by (1) setting a software variable
which identifies the channel as having failed, and (2) disengaging the monitor interlock so
the channel's outputs are disabled.
When a channel has recovered from a failure, its inputs must be accepted and its outputs
enabled. This is done by the reverse process, i.e., (1) setting the software variable to say
that the channel is now functional, and (2) engaging the monitor interlock so the channel's
outputs are enabled.
When a channel has recovered from a failure, it is considered part of the configuration only
by FDIR until its probation period has expired. This is done by setting a software variable
to say that the channel is enabled on a trial basis.
4.3 Specification of Core FTP Faults
Two methods were used to apply faults to the core FTP. The first method used the Fault
Injector Software described in Chapter 2 and simulated memory faults by altering selected
portions of one channel's memory. It is referred to in the following sections as Software
Fault Injection.The second method used both the Fault Injector Hardware and Software
described in Chapter 2. This method, referred to in the following sections as Hardware
Fault Injection, created faults by altering the signals provided to a selected pin.
4.3.1 The Software Fault Injection Plan
The Software Fault Injection Plan inserts defects into a triplex FTP. A simulated memory
fault is inserted by corrupting the memory of one channel of the FrP. The faults applied are
4-9
divided into threecategories:data,constants,andcode. Thisclassificationis basedon the
typeof memorythatis altered.
Data. The data section used by a software module is corrupted. The data
section has three segments: initialized, uninitialized, and debug. Initialized data
are objects that are assigned values when they are declared. In contrast,
uninifialized data are objects that are declared but not assigned; they are initialized
by the program at a later time. Last, debug data is information used by the AIPS
system for debugging.
• Constants. The constants section of a software module is corrupted.
• Code. The instructions in a particular software module are corrupted.
These simulated memory faults will typically cause one or more of the following symptoms:
• Unsynchronized Channel. The corrupted memory causes the channel to go out
of synchronization (because data utilized by the program or the program itself is
altered by the applied defect). This condition is identified by a Presence Test.
The presence test detects an unsynchronized channel by sending a unique pattern
from each channel through the data exchange. If the result read from the data
exchange receiver is not the expected pattern, the channel originating the
exchange is judged not present and therefore out of sync.
• Inconsistent RAM. The corrupted memory differs from its analogous values in
the other channels (the data should be the same). This condition is identified by
the RAM Scrub process. This test checks each memory location to ensure that
the values are identical among the three channels.
• Incorrect PROM sum. The altered memory changes program instructions. This
error symptom is determined by the PROM Sum _heck. This test verifies that
all channels have identical values in ROM by doing a sum check. (In the AIPS
Engineering model, code actually resided in RAM but was treated as if it was in
PROM.)
• Unknown DX. The faulty channel temporarily goes out of sync (in particular,
during a data exchange) but is forced back into sync prior to the subsequent
presence tests. This sequence of events is detected when the faulty channel's
data exchange (DX) latches are compared to other channels (to determine if the
DX latches agree).
The Software Fault Injection Plan is comprised of 178 tests; 117 tests involve the CP while
61 affect the IOP. Each test was deterministically selected; that is, the software module to
be corrupted was chosen by the fault injector supervisor and an address range that would
disrupt the module's execution was determined. (If random fault selection was performed,
the address range would be arbitrarily selected.)
4-10
The CoreFTPSoftwareFault InjectionPlanis detailedin thefollowing table. The table
presentsfive setsof information.
Test Numbers. The tests are divided into two sections: CP and IOP. Each
section is segmented into three subcategories: data, constants, and code. The
tests are identified by the section and subcategory.
- CP_I to CP_99 affect the CP's data memory;
- CP_100 to CP_199 involve the CP's constant sections;
- CP_200 to CP_299 corrupt the CP's program space;
- IOP_I to IOP_99 affect the IOP's data memory;
- IOP_100 to IOP_199 involve the IOP's constant sections; and
- IOP_200 to IOP_299 corrupt the IOP's program space;
• Module Name. The name of the software module that is affected by the
application of the defect. The modules selected were chosen from the entire
range of AIPS system software components, including the Ada Run-Time
System, FDIR, the CRT Display tasks, Inter-Computer Communication
Services, and application tasks.
• Data Type. The type of information that is corrupted by the fault: BSS
(uninitialized data), DEBUG (debug log), DATA (initialized data), CONST
(constants region), and CODE (program section).
• Addresses Corrupted. This is the address range that is altered by the Fault
Injection Software.
• Faulty Data. The value of the faulty data written into the address range. This is
typically 0 or FFFF.
The Software Fault Injection Plan only applied faults to a subset of the AIPS FTP software,
because it was more concerned with a general characterization of the FTP Redundancy
Management process' performance rather than a comprehensive one. Given enough
resources, an extensive Core FTP Fault Injection Plan could be developed and applied to the
AIPS Engineering Model, but this was not done for the present project.
4-11
Test
No.
CP_I
CP_2
CP_3
Module Data Addresses Faulty
Name Type Corrupted Data
CALENDAR B K BSS 1E4F14- 1E4F22 FFFF
CALENDAR_K
CONFIG_B
BSS
BSS
1E4058-1E4062
1E8F74- 1E8F76
FFFF
FFFF
CP_4 DEBUG_TRACE_B DEBUG 7100 - 7D00 FFFF
CP_5 LSS_CONFIG BS S 1E42AC - 1E42F2 FFFF
CP_6 LSS_CONFIG_B BSS 1E5070- 1E5082 FFFF
CP_7 LSS_CLOCK_ERR BSS 1E4E84- 1E4EB6 FFFF
CP_8 LSS_CLOCK_ERR_B B S S 1E4EB8 - 1E4EE2 FFFF
CP_9 LSS_DX ERR BSS 1E4AD8- 1E4B26 FFFF
LSS_DX ERR_B DATA 1DE280- 1DE2CE FFFF
LS S_EVENT_CNTL DATA 1DE010 - 1DE01E FFFF
BSS
BSS
BSS
LSS_EVENT_CNTL B K
CP_10
CP_I 1
CP_12
CP_13
CP_14
1E4DB8- 1E4DCA
1E8AD8- 1E8B22
1E41D4- 1E41EA
LSS_FFDI_B
LSS_FFDI
CP_15
CP_16
CP_17
CP_18
CP_19
CP_20
CP_21
CP_22
CP_23
CP_24
CP_25
CP_26
CP_27
CP_28
CP_28A
CP_29
CP_30
FFFF
FFFF
FFFF
LSS_FDIR_GLOBALS BSS 1E4D60- 1E4D82 FFFF
LSS_GLBOAL_MEM DATA 1DE31C- 1DE362 FFFF
LSS_MEMORY BSS 1E4138- 1E416E FFFF
LSS_NON_CONGRUENT_DATA BSS 1E42F4- 1E43F2 FFFF
STATUS_DATABASE_MGR_K BSS 1E43F4- 1E4426 FFFF
BSS 1E4428- 1E4AAE FFFF
BSS 1E4D84- 1E4D9E FFFF
STATUS_DATABASE_MSG B K
LSS_SYNC
LSS_SYNC_B BSS 1E8A08- 1E8A16 FFFF
LSS_TEST BSS 1E5084- 1E50C2 FFFF
LSS_TEST_B BSS 1E8A90- 1E8AD2 FFFF
BSS 1E51C8- 1E51FE FFFF
DATA 1DE6BC- 1DE88A FFFF
BSS 1E41A4- 1E41AE FFFF
B S S 1E4238 - 1E425E FFFF
DATA 1DEE88- 1DEF3E FFFF
BSS 1E914C- 1E915E FFFF
DATA 1DF134- 1DF146 FFFF
LSS_TEST2
LSS_TEST2_B
LINK_BLOCK
LINK_BLOCK_B
OS_B
TIMER_SUP
TIMER_SUP_B
4-12
Test
No.
CP_31
CP_32
CP_33
CP_34
Module
Name
TS_MD
ICCS CP IOP_COMMON
TS_SIGNAL_B
TS_MD_CLK
Data
Type
BSS
DATA
DATA
DATA
CP_35 SYS_TABLE DATA
CP_36 LSS_TIME_MGR BSS
CP_37 ICS S_USER_SERVICES B S S
BSSCP_38 ICCS CP IOP_COMMON
CP_39 LSS_TIME_MGR B BSS
BSSCP_40
CP_41
ICDEMO_STATUS_INFO
ICCS_USER_SERVICES_B
ICDEMO ST BRCAST
ICDEMO ST BRCAST
ICCS_DISP_MA1N_CP_B
ICCS_TFDI_CP_B
ICCS_FDIR_TIME CP B
CP_42
CP_43
CP_44
CP_45
CP_46
BSS
BSS
BSS
BSS
BSS
BSS
Addresses
Corrupted
Faulty
Data
1E91B0- 1E91C6 FFFF
1DE364- 1DE63E FFFF
1DEC58- 1DEC86 FFFF
1DF168- 1DF172 FFFF
1DF1EC- 1DF252 FFFF
1E4EF8- 1E4F12 FFFF
1E5200- 1E5216 FFFF
1E5218 - 1E5500 FFFF
1E8A18- 1E8A3A FFFF
1E50C4- 1E51C6 FFFF
1E8AD4- 1E8AD6 FFFF
1E8B54- 1E8B6A FFFF
1E8C04- 1E8C1E FFFF
1E8D34- 1E8D3A
1E8A3C-1E8A7A
1E8B44-1E8B46
FFFF
FFFF
FFFF
CP_47 TS_MD_CLK_B B S S 1E9184 - 1E9192 FFFF
CP_100 DEBUG_TRACE_B CONST 1B2B80- 1B2BB2 FFFF
CP_101 ICCS_ICIO_MAIN_PROG_CP CONST 1C0458- 1C047A FFFF
CP_102 CALENDAR_K CONST 1B1574- 1B15A2 FFFF
CP_103 MACHINE_CODE CONST 1B1610- 1B1632 FFFF
CP_104 MACHINE_CODE CONST
CP_105 SYSTEM CONST
CP_106 TEXT IO K CONST
CP_107 SYSTEM_B CONST
CP_108 LSS_TASK_IDS CONST
CP_109 LSS_MEMORY CONST
CP_ll0 LSS_TASK_IDS B K CONST
CP_I 11 ICCS_FDIR_TIME_CP CONST
CP_I 12 LSS_EXCHANGE CONST
CP_I 13 LSS_EVENT_CNTL CONST
1B1FIC- 1B1F3E FFFF
1B1634- 1B1672 FFFF
1B1674- 1B16A6 FFFF
1B1F40- 1B1FE2 FFFF
1B202C- 1B106E FFFF
1B1628- 1B270E FFFF
1B1734- 1B2866 FFFF
1B28E0- 1B290A FFFF
1B2A18- 1B2A66 FFFF
1B2AE4- 1B2B36 FFFF
4-13
Test
No.
Module
Name
lit
CP_114 LSS_SCHEDULER
CP_I 15
CP_116
CP_ll7
CP_ll8
CP_ll9
CP_120
CP_121
CP_122
CP_123
CP_124
CP_125
CP_126
CP_127
CP_128
CP_129
CP_130
CP_I 31
CP_132
CP_133
CP_134
CP_135
CP_136
CP_137
CP_138
CP_139
CP_140
CP_141
CP_142
CP_143
LSS_SCHEDULER B K
LSS_CONFIG
LSS_NON_CONGRUENT_DATA
STATUS_DATABASE_MGR .... K
STATUS_DATABASE_MGR... B K
LSS_FDIR_GLOBALS
LSS_GLOBAL_MEM_UTIL
LSS_EVENT_CNTL B K
LSS_GLOBAL_MEM_UTIL B K
Data
Type
CONST
CONST
CONST
CONST
CONST
CONST
Addresses
Corrupted
1B35A8- 1B370E
1B3710- 1B37E2
1B3AF4- 1B3D76
1B3D78- 1B3FB2
1B3FD8- 1B41DA
1B44DC-1B45DE
CONST 1B679C- 1B6846
CONST 1B6B7C- 1B6BBE
CONST 1B6BC0- 1B6D12
CONST 1B6D5C- 1B6D9E
CALENDAR B K CONST 1B7B54- 1B7F26
LSS_SYNC CONST 1B6A44- 1B6B56
LSS_TIME_UTIL_B CONST 1B7F80- 1B8366
LSS_GLOBAL_MEM
LSS_CONFIG_B
LSS_TIME_MGR_B
LSS_EXCHANGE_B
LSS_DX_ERR
LSS_DX_ERR_B
LSS_CLOCK_ERR
LSS_CLOCK_ERR_B
LSS_SYNC_B
ICCS_TFDI_CP_B
LSS_FFDI_B
TS_MD_INT_B
CONFIG_B
CONST 1B8728- 1B874A
CONST 1B8924- 1B8A6E
CONST 1BB634- 1BB802
CONST 1BBE14- 1BBED6
CONST 1B4ADC- 1B4B12
CONST 1B6848- 1B6A1E
CONST
CONST
CONST
CONST
OS_SUP_B
OS_B
TIMER_SUP
TIMER_SUP_B
CONST
CONST
CONST
CONST
CONST
CONST
CONST
1B7740- 1B77D6
1B77D8 - 1B788A
1BB29C- 1BB60E
1BBCCC- 1BBD76
1BCA4B - 1BCD32
1C16F8- 1C184E
1C1850- 1C18D6
1CID60- 1C1E46
1C1E48- 1C21BA
1C2334- 1C239E
1C24A0 - 1C265E
Faulty
Data
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
4-14
Test
No.
CP_200
CP_201
CP_202
CP_203
CP_204
CP_205
CP_206
CP_207
CP_208
CP_209
CP_210
CP211
Module Data
Name Type
CALENDAR B K CODE
CONFIG_B CODE
DEBUG_TRACE_B CODE
LSS_CONFIG_B CODE
LSS_CLOCK_ERR_B CODE
LSS_DX_ERR_B CODE
LSS_EVENT_CNTL B K CODE
LSS_FFDI_B CODE
CODESTATUS_DATABASE_MGR B K
LSS_SYNQB
LSS_TEST_B
CODE
CODE
LSS_TEST2_B CODE
CP_212 LINK_BLOCK_B CODE
CP_213 TIMER_SUP_B CODE
ICCS_FDIR_TIME CP B CODE
SYSTEM_B CODE
TS_SIGNAL_B CODE
TS_MD_CLK_B CODE
LSS_TIME_MGR_B
LSS_USER_SERVICES_B
ICDEMO ST BRCAST
ICCS_DISP_MAIN_B
ICCS_TFDI_CP_B
ICDEMQCCP_APPLIC
LSS_WATCHDOG_TIMER_B
LSS_SERIAL IO B K
TEXT IO B K
CP_214
CP_215
CP_216
CP_217
CP_218
CP_219
CP_220
CP_221
CP_222
CP_223
CP_224
CP_225
CP226
CP_227
CODE
CODE
CODE
CODE
CODE
CODE
CODE
CODE
CODE
LSS_DISP_ROUT_B CODE
Addresses
Corrupted
10D800- 10DCFA
14191C- 141B08
101330- 1013E2
110326 - 110700
10CA00 - 10CAAA
108EF4- 109200
109COC- 109F00
122524- 122724
1045C4- 104A00
1189EC - 118AEC
Faulty
Data
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
11E800 - 11F000 FFFF
11FE10- 120318 FFFF
10126C- 1012F0
146E64- 14711A
124754- 124A84
1003DC- 100534
FFFF
FFFF
FFFF
FFFF
141F64- 142230 FFFF
1476F0- 1478B2 FFFF
1 lAD96- 11B152 FFFF
121972- 121C72 FFFF
12679C- 126DB0 FFFF
134EF4- 1351F4 FFFF
11CEFE - 11CFFE FFFF
126E98- 127100 FFFF
101510- 1015E8 FFFF
105E34- 106100 FFFF
106C76- 106DCE FFFF
lllBA0- lllDOA FFFF
4-15
Test
No.
Module
Name
Data
Type
Addresses
Corrupted
IOP_I DEBUG_TRACE_B DEBUG 7100 - 7D00
IOP_2 ICCS_CP_IOP_COMMON DATA 1DE318- 1DE5F2
IOP_3 LSS_TEST2_B DATA 1DE708- 1DE8D6
IOP_4 ICCS_MSG_SEND_RCV_B (OT) DATA 1DFOE4- 1DF13A
IOP_5 TS_EVENT_CONTROL_B DATA 1DF348- 1DF35A
IOP_6 SYS_TABLE DATA 1DF868 - 1DF8CE
IOP_7 ICCS_DATA_TYPES BSS 1E42B0- 1E42BA
STATUS_DB_MGR B KIOP_8 BSS 1E44E4- 1E4B6A
IOP_9 LSS_FDIR_GLOBALS BSS 1E4E1C- 1E4E3E
IOP_10 ICCS_MSG_SEND_RCV BSS 1E4EAC- 1E5FF2
IOP_I 1 ICCS_ERROR_LOG_B BSS 1E60B8 - 1E61B2
IOP_12 LSS_TEST BSS 1E620C- 1E624A
IOP_13 LSS_TEST2 BSS 1E6350- 1E6386
IOP_14 ICCS CP_IOP_COMMON BSS 1E63B8- 1E884E
IOP_15 LSS_TIME_MGR_B BSS 1E9E20- 1E9E42
IOP_16 LSS_TEST_B BSS 1E9E6C- 1E9EAE
IOP_17 LSS_FFDI._B BSS 1E9EB4- 1E9EFE
IOP_18 ICCS_MSG_SEND_RCV_B (OT) BSS lEA908- 1EAB5E
IOP_19 TS_KRN_B BSS 1EB08C- 1EB14A
IOP_20 TS_MD CLK_B BSS 1EB184- 1EB192
IOP_101 DEBUG_TRACE_B CONST 1B35A0- 1B35D2
IOP_102 STATUS_DB_MGR B K CONST 1B4FD4-1B51D6
IOP 103 LSS_SERIAL_IO_B_K CONST 1B578C- 1B583E
IOP_104 LSS_EVENT_CNTL B K CONST 1B777C- 1B78CE
IOP 105 ICCS MSG SEND RCV CONST 1B795C-1B79EA
IOP_106 CALENDAR B K CONST 1B853C- 1B890E
ICDEMO_STATUS_INFO
LSS_TEST2
IOP_107 CONST
CONSTIOP_108
1BA344- 1BA386
1B94C4-1B94E6
Faulty
Data
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
i
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
IOP_109 ICCS_CP__IOP_COMMON CONST 1B9554- 1B96AE FFFF
IOP 110 LSS_TIME_MGR_B CONST 1BE918- 1BEAE6 FFFF
CONST 1BF080- 1BF142LSS_EXCHANGEIOP_ 111 FFFF
4-16
Test
No.
IOP_112
IOP_ll3
IOP_114
IOP_I 15
IOP_116
IOP_117
IOP_118
IOP_119
IOP_201
IOP_202
IOP_203
IOP_204
IOP_205
IOP_206
Module Data Addresses Faulty
Name Type Corrupted Data
LSS_TEST_B CONST 1BF7A0- 1BFA3A FFFF
LSS_TEST2_B CONST 1BFA3C- 1BFEC6 FFFF
ICCS_USER_SERVICES B CONST 1BFEC8- 1BFF5A FFFF
LSS_FFDI_B CONST 1BFF5C- 1C0246 FFFF
ICDEMO ST BRCAST_IOP CONST 1C19B0- 1C1A52 FFFF
ICCS_MESSAGE_SEND_RCV_B (OT) CONST 1C71C0- 1C738A FFFF
ICCS_DISP_MAIN_IOP_B CONST 1C8188 - 1C82F2 FFFF
TIMER_SUP_B CONST 1CF444- 1CF602 FFFF
SYSTEM_B CODE 100418- 10057A 0000
LSS_WATCHDOG_TIMER_B CODE 1016B8- 10180A 0000
LSS_SERIAL IO B K
ICCS_ERROR_LOG
LSS_TIME_MGR_B
ICCS_CP_IOP_COMMON_B
CODE
CODE
CODE
CODE
106180-106lEA
10E034- 10EOD2
123B50- 123BFE
125688- 1256CE
FFFF
FFFF
FFFF
FFFF
IOP_207 LSS_EXCHANGE_B CODE 125FE0- 1260C6 FFFF
IOP_208 LSS_TEST_B CODE 1270CC- 1270E2 0000
IOP_209 LSS_TEST2_B CODE 1286E4- 128792 0000
IOP_210 ICCS_USER_SERVICES_B CODE 12A294- 12A59A 0000
IOP_211 ICDEMO ST BRCAST_IOP CODE 12E230- 12E2D2 0000
IOP_212 ICCS_MSG_SEND_RCV_B (OT) CODE 156708 - 156906 0000
IOP_213 ICCS_MSG_SEND_RCV_S3 CODE 173874- 173B94 0000
IOP_214 ICCS_MSG_SEND_RCV_S2 CODE 173DB0- 173FAC 0000
IOP_215 ICCS_MSG_SEND_RCV_S2 CODE 173FAC- 17429C 0000
IOP_216 ICCS_MSG_SEND_RCV_S 1 CODE 18BC40 - 18BD70 0000
IOP_217 ICCS_MSG SEND_RCV_S1 CODE 18D180- 18D46C 0000
IOP_218 IOSS_NET_MGR_CONFIG_B CODE 188714- 188914 0000
IOP_219 IOSS_NET_MGR_COLLECT_B CODE 192898- 192A98 0000
IOP_220 IOSS_UTLS_B CODE 100FA4- 101043 0000
IOP_221 IOSS_GPC_DATA_B CODE 101300-10138B 0000
IOP_222 IOSS_NET STATUS_B CODE 11F534 - 11F734 0000
Table 4-1. Software Fault Injection Plan
4-17
4.3.2 The Hardware Fault Injection Plan
The Hardware Fault Injection Plan involved applying faults to links in the data exchange
network and the fault-tolerant clock network. Overviews of these two networks are shown
in Figures 4-2 and 4-3, respectively. Details about the data exchange and fault-tolerant
clock may be obtained from [4] and [5] by the reader who is unfamiliar with their
operation. The fault-tolerance specific hardware of the AIPS FTP was chosen as the target
of hardware fault injection since that is the unique part of the FTP.
Figure 4-2. Data Exchange Network
4-18
iii
m
>
tr
tu
>
O
v
0
0
..I
0
I-
Z
<
uJ
._1
0
I-
_J
<
u.
OUJ
Ill
<
._1
LU
Z
Z
<
L)
I I
= |
I
I m oI ILlZ
:I: :I:
0 L)
)
I
Figure 4.3. Fault Tolerant Clock Network
The Hardware Fault Injection Plan consisted of twelve tests, which were deterministically
selected; that is, the particular pins whose signals were altered were chosen by the fault
injector supervisor to produce a desired fault. Details of these tests are given in Table 4-2.
4-19
Test ID Component To Be IC Pin # Logic
Faulted Location Level
COM2_1356_6 DX: 1356
C-to-A transmitter link
COM1356_11 DX: 1356
C-to-B transmitter link
COM_1756_6 DX: 1756
B to C's Receiver
COM_0748_8_0 0748
COM_0748_8_1
COM_0510_2_0
COM_0510_2_1
COM0510_4_0
COM_0510_6-0
COM_0510_6_1
COM-0510_7-0
COM-0510_7_1
DX:
Cs Bypass Receiver
DX:
C"s Bypass Receiver
FTC: C's clock element
to A's interstate
FTC: C's clock element
to A's interstage
FTC: C's clock element to
B's interstage
FTC:C's clock element
to C's interstate
Frc: C's clock element
to C's intersta[e
F'I'C: C's clock element
to C's intersta[e
Frc: C's clock element
to C's interstage
6 0
11 0
6 0
8 0
0748 8 1
0510 2 0
0510 2 1
0510 4 0
0510 6 0
0510 6 1
0510 7 0
0510 7 1
Schematic
Page
COM,
Pg. 4
COM
Pg. 4
COM,
Pg. 5, l&2
(3OM
P[_. 4, l&2
COM
Pg. 4, l&2
COM
Pg. 6
OOM
Pg. 6
COM
Pg. 6
COM
Pg. 6
COM
Pg. 6
CDM
Pg. 6
COM
Pg. 6
Table 4-2. Hardware Fault Injection Plan
4-20
4.4 Core FTP Fault Injection Test Results
4.4.1 Software Fault Injection Test Results
As discussed in Section 4.3.1, the Core FTP Software Fault Injection Plan is comprised of
178 tests. Each test is repeated 5 times. Each iteration of a test involves the application of
a fault, its detection by the FDIR software, the appropriate FTP reconfiguration, and finally
recovery of the faulty channel by the Lost Soul Sync process. For each iteration, the Fault
Injector Software recorded the fault detection and reconfiguration times. It also verified
that the channel had been recovered before injecting the fault again.
The FTP Software Fault Injection results are presented in two sections. First, Section
4.4.1.1 presents the maximum and average times for each test. Next, Section 4.4.1.2
illustrates and discusses the Software Fault Injection probability and cumulative density
functions.
4.4.1.1 Maximum and Average Times
Table 4-3 provides the maximum and average detection and reconfiguration times for each
test. As explained in Section 2.1, detection time is defined as the time elapsed between
insertion of the fault and detection of the fault by the FDIR process and reconfiguration
time is defined as the time elapsed between detection of the fault and removal of the failed
channel from the active configuration. The identifiers in the "Test Numbers" column in
Table 4-3 correspond to those given in Table 4-1. The "Test Results" column indicates
whether or not a fault was detected. If the fault was detected, the process by which it was
found is given. In some cases the fault was detected, but the faulty channel could not be
recovered, i.e., it could not be resynchronized. These cases are so noted and are explained
in following sections. In these cases, however, the FTP as a whole did properly sustain
the fault and continue to operate as a duplex.
4-21
Test
No.
CP_I
CP_2
CP_3
CP_4
CP_5
CP_6
CP_7
CP_8
CP_9
CP_10
CP_I 1
CP_12
CP_13
CP_14
CP_15
CP_16
CP_17
CP_18
CP_19
CP_20
CP_21
CP_22
CP_23
CP24
CP_25
CP_26
CP_27
CP_28
Test
Results
Fault Detected;Chan Not Recovered
Detected: RAM Scrub
Detected: RAM Scrub
Detected: Presence Test
Detected: Presence Test
Detected: RAM Scrub
Detected: RAM Scrub
Detected: Presence Test
Detected: RAM Scrub
Detected: RAM Scrub
Detected: RAM Scrub
Detected: RAM Scrub
Detected: Presence Test
Detected: Presence Test
Max
Detect
(ms)
95536.5
95273.0
29.8
19.9
96927.8
95594.2
58.7
96584.0
Max
Reconfig
(ms)
31.8
39.4
1.9
1.8
27.0
35.4
1.8
34.5
Average
Detect
(ms)
191.2
90936.7
81129.8
16.7
12.2
90608.0
93376.7
39.7
89280.0
Average
Reconfil2
(ms)
1.7
19.3
24.0
1.8
1.8
22.8
24.8
1.8
18.4
96250.3 126.4 85340.9 47.8
106816.6 38.4 95036.5 33.1
146641.1 127.1 119054.9 43.6
510.2 1.7 185.7 1.7
28.7 1.7 17.2 1.7
20.3Detected: RAM Scrub 96583.7 27.9 77436.6
Fault Detected;Chan Not Recovered 12.8 1.8
Fault Detected;Chan Not Recovered 58130.1 28.4
118256.2
8.7 1.7
358.2 1.8 201.6 1.8
35.9146297.6
96081.5
22.9
44.7 81916.6 30.7
98451.3 35.6
96156.9 12.0 53652.6 3.8
61.1 1.8 32.0 1.8
87.5 83947.3
28.1
90501.5
93740.5
1.9
96855.2
53.8
96084.1
97651.6
Fault Detected;Chan Not Recovered
Detected: Presence Test
Detected: RAM Scrub
Detected: RAM Scrub
Fault Detected;Chan Not Recovered
Detected: RAM Scrub
Detected: Presence Test
Detected: RAM Scrub
Detected: Presence Test
Detected: RAM Scrub
Detected: RAM Scrub
37.0
88.0
38.6
1.8
22.3
32.4
4-22
Test
No.
Test
Results
CP_29 Detected: Presence Test
CP_30 Detected: RAM Scrub
CP_31 Detected: RAM Scrub
CP_32 Detected: Presence Test
CP_33 Detected: RAM Scrub
CP_34 Detected: RAM Scrub
CP_35 Detected: Presence Test
CP_36 Detected: Unknown DX, (2) RAM
Scrub, (2) Presence Test
Max
Detect
(ms)
32.6
147435.0
146690.3
181.2
Max
Reconfig
(ms)
2559.1
1.8
124.4
34.9
1.8
Average
Detect
(ms)
24.4
140214.0
124763.0
120.9
145787.1 124.2 141029.4
145449.8 171.3 141481.8
1.8 2154.2
16.393983.9 26167.1
Average
Reconfig
(ms)
1.8
37.8
23.1
1.8
53.3
56.5
1.8
6.7
CP_37 Detected: RAM Scrub 96451.0 32.6 95110.9 23.5
CP_38 Detected: Presence Test 419.6 1.8 257.3 1.8
CP_39 Detected: RAMScrub 145522.8 31.3 131949.5 18.8
CP_40 Detected: Unknown DX, 866.8 1.8 462.2 1.8
(4) Presence Test
CP_41 Detected: RAM Scrub 96339.4 36.6 87893.3 22.6
96473.5 31.6 93824.6 20.3
153216.9 34.0 146463.7 22.8
226.2 1.9 101.2 1.8
96657.1 38.7 89099.0 27.7
207.5 1.8 121.9 1.8
66234.286.698961.1
149025.6 34.2 126143.8
146475.7 139417.8
79653.5
126.0
CP_42 Detected: RAM Scrub
CP_43 Detected: RAM Scrub
CP 44 Detected: Presence Test
CP_45 Detected: RAM Scrub
CP_46 Detected: Presence Test
CP_47 Detected: (4) RAM Scrub,
Presence Test
CP_100 Detected: PROM Sum
CP_101 Detected: PROM Sum
CP_102 Detected: PROM Sum
CP_103 Detected: PROM Sum
CP_104 Detected: PROM Sum
CP_105 Detected: PROM Sum
CP_106 Detected: PROM Sum
CP_107 Detected: PROM Sum
86.596552.2
31.7
147230.7
23.1
59.2
32.6
22.495355.4 39.8 88859.0
105570.4 85.8 94836.4 36.6
128520.9 56.4126.3
165.6146347.9
36.7
138544.8
140070.1145507.7
49.0
25.4
4-23
Test
No.
CP_108
CP109
CP_110
CP_I 11
CP_112
Test
Results
Detected: PROM Sum
Detected: Presence Test
Detected: PROM Sum
Detected: PROM Sum
Detected: PROM Sum
CP_ll3
CP_ll4
CP_115
CP_I 16
CP_117 Detected: PROM Sum
CP_I 18 Detected: PROM Sum
CP_119 Detected: PROM Sum
CP_120
CP_121
CP_122
CP_123
CP_124
CP_125
CP_126
CP_127
CP_128
CP_129
CP_130
CP_131
CP_132
CP_133
CP_134
CP_135
CP_136
Detected: PROM Sum
Detected: PROM Sum
Detected: PROM Sum
Detected: PROM Sum
Detected: PROM Sum
Max
Detect
(ms)
146181.2
37.4
144039.0
146472.8
95114.8
95282.8
91444.8
Max
Reconfig
(ms)
87.0
1.7
207.1
127.0
Average
Detect
(ms)
138558.3
24.0
119955.2
139991.4
Average
Reconfi_
(ms)
33.7
1.7
65.2
44.6
86.4 78788.5 36.2
126.7 88119.7 57.8
34.7 81038.1 23.0
103959.8 29.0 93040.1 20.5
149127.6 47.3 140330.2 28.9
112119.6 47.6 94477.6 25.1
95999.6 39.6 79645.4 24.2
96148.4 33.9 86219.3
146386.3 130109.032.6
22.8
27.0
Detected: PROM Sum 95199.9 33.7 92444.6 22.1
Detected: PROM Sum 91539.1 32.3 84941.6 28.9
Detected: PROM Sum 90.1 93879.4 36.5
Detected: PROM Sum
101278.8
214498.1 34.5 136828.9 28.1
148181.3 85.6 143241.3 33.3
150162.1 126.2 144700.0
34.195944.0
111365.6
93968.3
44058.4
96410.629.3
95094.7 39.1 89605.8
92340.8 38.8 86177.6
95953.4
146889.4
88120.033.2
38.9
27.2
143411.4
12.8
23.4
Detected: PROM Sum
Detected: PROM Sum
Detected: PROM Sum
Fault Detected;Chart Not Recovered
Detected: PROM Sum
Detected: PROM Sum
Detected: PROM Sum
Detected: PROM Sum
Detected: PROM Sum
Detected: PROM Sum
Detected: PROM Sum
Detected: PROM Sum
33.6
29.7
18.6
33.6 134713.3 26.2
144691.0 89.1 142409.8 30.9
146346.8 26.9
89.5 142848.3147351.0
19.6
39.4
4-24
Test Test Max Max Average Average
No. Results Detect Reconfig Detect Reconfig
(ms) (ms) (ms) (ms)
CP_137 Detected: PROM Sum 148606.3 38.9 96564.1 28.3
CP_138 Detected: PROMSum 96890.8 37.2 93503.4 24.0
CP_139 Detected: PROMSum 96786.7 48.2 93405.1 27.5
CP_140 Detected: PROMSum 145792.4 31.6 107682.4 21.6
CP_141 Detected: PROMSum 94577.3 89.3 88625.5 39.9
644.0 1.8 437.0 1.8CP_142 Detected: Presence Test
CP_143 Detected: PROM Sum 96056.9
CP_200 Detected: Presence Test 168677.5
20.0 82852.3 16.9
86.5 106666.7 42.9
CP_201 Detected: Presence Test 28.1 1.7 21.5 1.7
CP_202 Fault Detected;Chan Not Recovered 36.1 1.7
CP_203 Detected: PROM Sum 95505.6 34.0 80481.9 30.6
CP_204 Detected: Presence Test 51.5 1.8 34.8 1.8
Detected: Presence Test 23.9CP_205 1.7 14.0 1.7
CP_206 Detected: Presence Test 477.6 1.8 265.4 1.8
CP_207 Detected: Presence 37.9 1.8 26.5 1.8
CP_208 Detected: PROM Sum 155942.4 85.2 95294.7 31.6
Fault Detected;Chan Not RecoveredCP_209
CP_210 Detected: Presence Test 87.7 1.8
CP_211 Detected: Presence Test 61.5 1.8
CP_212 Detected: PROMSum 191510.9
CP_213 Fault Detected;Chan Not Recovered
60.1 1.8
46.6 1.8
36.3 1.8
127.6 145278.9 43.8
46.6 1.8
CP_214 Detected: Presence Test 40.0 1.8 31.2 1.8
CP_215 Detected: PROM Sum 195341.8 126.7 178607.7 46.7
Detected: PROM SumCP_216 192959.1
25.1
211.7
446.5
CP_217
CP_218
CP_219
38.6
1.7
1.8
1.8
Detected: Presence Test
189657.3
15.2
113.3
244.4
Detected: Presence Test
20.6
1.7
1.8
1.8Detected: Presence Test
4-25
Test
No.
Test
Results
Max
Detect
(ms)
Max Average
Reconfig Detect
(ms) (ms)
Average
Reconfig
(ms)
CP_220 Detected: Presence Test 431.7 1.8
CP_221 Detected: Presence Test 178.7 1.8
CP_222 Detected: Presence Test 198.3 1.8
CP_223 Detected: Presence Test 203.5 1.8
CP_224 Detected: Presence Test 54.5 1.8
CP_225 Detected: Presence Test 70.3 1.8
CP_226 Detected: PROM Sum 81303.0 33.4
95554.4Detected: PROM Sum
1.8 275.2
1.8 134.6
1.8 129.2
1.8 127.7
1.8 39.7
1.8 53.7
71.7 65041.4
36.2 76895.2CP_227 26.6
Table 4-3. Software Fault Injection Results (continued)
4-26
Test
No.
IOP_I
IOP_2
IOP_3
IOP_4
IOP_5
IOP_6
IOP_7
IOP_8
IOP_9
IOP_IO
IOP_I 1
IOP_12
IOP_13
IOP_14
IOP_ 15
IOP_16
IOP_17
Test
Results
Detected: Presence Test
Detected: Presence Test
Detected: Presence Test
Detected: RAM Scrub
Detected: RAM Scrub
Detected: Presence Test
Detected: Presence Test
Detected: RAM Scrub
Detected: RAM Scrub
Detected: RAM Scrub
Detected: RAM Scrub
Detected: (2) RAM Scrub, (3)
Presence Test
Detected: RAM Scrub
Detected: RAM Scrub
Detected: Presence Test
Detected: RAM Scrub
Detected: (4) Presence Test, RAM
Scrub
Max
Detect
(ms)
Max
Reconfig
(ms)
Average
Detect
(ms)
Average
Reconfig
(ms)
51.7 1.8 17.3 1.8
213.5 1.8 73.8 1.8
82.4
81359.2
123713.3
4351.0
453.9
124721.8
125274.2
122840.8
123551.2
103016.8
97933.4
124456.5
81504.3
1.8
28.6
207.0
1.8
1.8
30.8
126.3
34.4
34.1
206.0
35.6
37.5
86.4
1.8
27.8
43.8
44.1
72260.6
111310.0
2915.1
210.7
107873.3
107683.2
114341.6
119467.1
62119.4
60132.0
112652.0
77742.9
24.6
14011.534048.2
1.8
22.3
59.4
1.8
1.8
22.8
38.2
24.4
23.7
47.3
22.9
26.3
33.0
1.8
7.0
IOP_I 8 Detected: Presence Test 451.4 1.9 260.3 1.8
IOP_19 Detected: RAM Scrub 80649.0 30.4 64473.7 21.5
IOP_20 Detected: RAMScrub 82006.5 21.1 78393.8 16.5
IOP_101 Detected: PROM Sum 112417.8 126.9 84467.5 42.6
81264.9Detected: PROM Sum 86.6 70073.0IOP_102 34.9
IOP_103 Detected: PROMSum 81583.4 28.9 67491.2 22.3
IOP_104 Detected: PROM Sum 124832.0 30.1 99660.3 22.3
IOP_105 Detected: PROM Sum 124163.1 38.2 90309.8 33.8
IOP_106 Detected: PROM Sum 81008.4 25.9
IOP_107 Detected: PROMSum 81412.8 34.7
Detected: PROM Sum
66098.0 20.4
77279.2 25.5
79185.647.881336.4IOP_108 29.3
4-27
Test
No.
Test
Results
Max
Detect
(ms)
Max
Reconfig
(ms)
Average
Detect
(ms)
Average
Reconfig
(ms)
IOP_109 Detected: PROM Sum 81871.0 126.2 77719.7 39.8
IOP_ll0 Detected: PROMSum 81761.3 39.6 78144.3 20,4
IOP_lll Detected: PROMSum 125472.4 129.0 101481.2 48.7
IOP_112 Detected: PROM Sum 123597.6 32.2 120999.4 18.8
IOP_I 13 Detected: PROM Sum 76507.7 40.1 52846.2 30.8
IOP_114 Detected: PROMSum 87151.9 35.0 58393.2 20.8
IOP_115! Detected: PROM Sum 79690.6 88.8 51821.7 41.8
I
IOP_1161 Detected: PROM Sum 71076.4 35.9 48508.4 26.8
IOP_117 Detected: PROMSum 81476.4 37.4 77877.6 23.0
IOP_118 Detected: PROM Sum 124452.0 32.2 86398.3 23.2
IOP_119 Detected: PROMSum 94316.9 126.0 71361.5 40.2
IOP_201 Detected: PROMSum 80541.2 37.1 60870.8 18.3
IOP_2021 Fault Detected;Chart Not Recovered 34.7 1.7
IOP_203 Detected: (4) PROM Sum, 81364.0 35.2 67697.8 19.0
Presence Test
IOP_204 Detected: PROM Sum 122004.9 20.3 114485.8 14.4
IOP_205 _Detected: PROM Sum 81567.6 28.1 72695.6 16.9
IOP_206 Detected: PROM Sum 82127.9 39.0 81483.0 21.0
1OP_207 Fault Detected;Chan Not Recovered 20.3 1.8
IOP_208 Detected: Presence Test 47.7 1.8 31.01 1.8
IOP_209 Detected: Presence Test 146.6 1.8 63.0 1.8
IOP_210 Detected: PROM Sum 125624.5 26.7 122849.6 19.3
IOP_211 Detected: PROMSum 138244.3 127.2 123907.0 48.1
IOP_212 Detected: PROMSum 124774.5 32.4 120438.3 21.2
IOP_213 Detected: PROM Sum 124400.9 32.0 105208.4 23.9
IOP_214 Detected: PROM Sum 81767.7 29.7 78645.9 21.1
IOP_215 Detected: Presence Test 419.6 1.8 269.9 1.8
IOP_216 Detected: Presence Test 377.6 1.8 144.6 1.8
IOP_217 Detected: PROM Sum 84940.3 38.8 82048.9 35.0
4-28
Test
No.
Test
Results
IOP_222
Max
Detect
(ms)
Detected; PROM Sum
Max
Reconfig
(ms)
122310.4
Average
Detect
(ms)
Average
Reconfig
(ms)
IOP_218 Detected: PROMSum 124793.0 35.0 100550.4 23.0
IOP_219 Detected: PROMSum 126107.3 38.5 121129.2 25.5
IOP_220 Detected: PROMSum 123171.1 88.4 116307.4 39.7
IOP_221 Detected: PROMSum 123045.8 20.9 107199.0 16.6
33.2 106561.6 23.1
Table 4-3. Software Fault Injection Results (concluded)
4-29
Detection Tim¢_
Detection times vary because for each test the corrupted area of memory is referenced with
a different frequency. If the corrupted area is commonly used, the fault will most likely be
manifested by the next iteration of Fast FDIR. If the corrupted area is infrequently
referenced, it may not cause a problem until several iterations of Fast have occurred. If the
area is never referenced, the fault will only be detected by the Prom Sum or Ram Scrub
tests in the Background Selftest task, which could take several minutes.
None of the detection times were unexpected. All faults detected by Fast FDIR (Presence
Test) were detected within 40 ms after being injected. Faults detected by the Background
Selftests (RAM Scrub, PROM Sum) took a maximum of 214 seconds (approximately 3-1/2
minutes). This is consistent with the amount of time it takes to do a complete iteration of
the Background Selftests, which is about 4 minutes. The wide range of detection times for
faults detected by the Background Selftests is a result of the position of the tests at the time
when the memory was corrupted. If the corrupted locations had just been examined by the
selftests, the fault would take much longer to detect than if the corrupted locations were just
about to be examined.
Reconfiguration Times
Reconfiguration times vary depending on the process that detected the fault. When a fault
is detected by Fast FDIR, the reconfiguration takes place immediately (less than 2 ms).
Faults detected by the Background Selftests (RAM Scrub, PROM Sum) can be expected to
have a wide range of reconfiguration times, which is explained by the fact that the
Background Selftests does not do the reconfiguration itself, but passes the information to
Fast FDIR to act upon. If Fast FDIR has just finished prior to the Selftests detecting a
fault, the reconfiguration will not take place for at least 40 ms. Additional delays can occur
because the Selftests may be interrupted by a higher priority task between the time that it
detects the fault and the time that it has finished creating the reconfiguration information for
Fast and is ready for Fast to act upon it.
None of the reconfiguration times were unexpected. All faults detected by Fast FDIR
(Presence Test) were reconfigured immediately (less than 2 ms) after being detected. The
highest reconfiguration times for Selftest-detected faults were 160-170 ms (only 2 cases),
which indicates that the Selftests were suspended for four iterations of Fast. This is not
unreasonable, considering the number of tasks operating in the system.
4-30
Fault_ D_tected but Channel Not Recovered
In several cases, the fault was detected but the channel was not recovered, i.e., it could not
be resynchronized. These situations are discussed below.
CP_I: The corrupted memory in this case caused the faulty processor to get
into an infinite loop with interrupts off. The other channels saw the faulty
channel as failing the presence test and disengaged its Monitor Interlock. This
prevented the faulty processor's Watchdog Timers from going off and getting
the processor out of its infinite loop.
CP_16: This test wipes out a section of memory used in doing the CP-IOP
handshake before a lone channel attempts to be picked up. The CP is in an
infinite loop because it has written the handshake word into the wrong location
and is now waiting for the lOP to respond in that location. The fault has
crippled the channel to such an extent that it cannot even attempt to be picked
up.
CP_17: This test wipes out a section of memory used by the Lost Soul Sync
task when a lone channel attempts to be picked up by good channels. The lone
channel cannot function well enough to even attempt to be picked up.
CP_18: This test wipes out a section of memory used in doing the CP-IOP
handshake before a lone channel attempts to be picked up. The CP is in an
infinite loop attempting to complete the handshake. The fault has crippled the
channel to such an extent that it cannot even attempt to be picked up.
CP_128: This test wipes out a section of memory used in doing the CP-IOP
handshake before a lone channel attempts to be picked up. The CP is in an
infinite loop attempting to complete the handshake. The fault has crippled the
channel to such an extent that it cannot even attempt to be picked up.
CP_202: This test wipes out a section of code used by the Lost Soul Sync task
when a lone channel attempts to be picked up by good channels. The CP in the
lone channel is getting repeated hardware exceptions. The fault has crippled the
channel to such an extent that it cannot even attempt to be picked up.
CP_209" This test wipes out a section of memory used in doing the CP-IOP
handshake before a lone channel attempts to be picked up. The CP is in an
4-31
infinite loop attemptingto completethehandshake,Thefault hascrippledthe
channelto suchanextentthatit cannotevenattemptto bepickedup.
CP_213: This test wipes out a section of memory used in doing the CP-IOP
handshake before a lone channel attempts to be picked up. The CP is in an
infinite loop attempting to complete the handshake. The fault has crippled the
channel to such an extent that it cannot even attempt to be picked up.
IOP_207: This test wipes out a section of memory used in doing the CP-IOP
handshake before a lone channel attempts to be picked up. The CP is in an
infinite loop attempting to complete the handshake. The fault has crippled the
channel to such an extent that it cannot even attempt to be picked up.
4.4.1.2 Probability and Cumulative Density Functions
The Core FTP Fault Insertion Plan is comprised of 178 tests. Each test consists of five
iterations. For each iteration, the fault detection and reconfiguration times are recorded by
the Fault Insertion Software. Accordingly, a total of 890 data sets were anticipated.
However, as described in Section 4.4.1.1.4, in ten tests the fault was detected but the
faulty channel could not be recovered (implying that only one data set was recorded per test
rather than five). As a result, 850 sets of data, or 95.5 percent of the applied faults, were
posted.
Probability and cumulative density functions for these data sets were generated to complete
the Core FTP Fault Insertion Analysis; these functions are illustrated in Figures 4-4
through 4-9.
The probability density function for the FTP fault detection times, shown in Figure 4-4,
depicts the wide variance in the observed times. As discussed in Section 4.4.2.1, this
variance occurred because some faults were detected by the high priority Fast FDIR task
while other faults were detected by the low priority background Selftest process.
Several distinguishable peaks can be observed (illustrated in Figures 4-4 and 4-5):
• 0 - 40 ms. Approximately 25 percent of the faults were detected in this range.
° 78,000 - 82,000 ms. Ranges from a minimum of 0 to a maximum of 2 percent
of the faults.
• 92,000 - 97,000 ms. Ranges from a minimum of 0 to a maximum of 2 percent
of the faults.
° 120,000 - 125,000 ms. Ranges from a minimum of 0 to a maximum of 1
percent of the faults.
4-32
• 140,000 - 147,000 ms. Ranges from a minimum of 0 to a maximum of 1
percent of the faults.
As shown by the cumulative density function for the detection times (Figure 4-6),
approximately 25 percent of the software injected faults were detected in 40 milliseconds or
less, 50 percent in 80 seconds or less, and 75 percent in 100 seconds or less.
The probability density function for the Core FTP fault reconfiguration times (Figures 4-7
and 4-8) indicates that a significant percentage of the faults were bypassed in 0 to 220 ms.
while a substantial but smaller percentage were reconfigured around in 220 to 1950 ms.
This variance occurred because some faults were detected by the high priority Fast FDIR
task and therefore reconfigured immediately, while other faults that were detected by the
low priority Selftest task took longer to reconfigure.
As depicted by the cumulative density function in Figure 4-9, approximately 94 percent of
all the simulated memory faults were isolated and bypassed in 2000 ms. or less.
25
2O
15
Percentage
10
0 .
)
Figure 4-4.
I • I | I |
Time (ms)
The Probability Density Function for the Detection Times
4-33
20'
15'
Percentage
10,
Time (ms)
Figure 4-5. The Probability Density Function for the Detection Times:
Expansion of the 0 to 10,000 ms. Region
Percentage
100
90
80
70
6O
50
40
30
20
I0
0
" l " I " I " I " l " I " ! " I " I " I " I " I " I " I " I " I " I " I " I " I
Time (ms)
Figure 4-6. The Cumulative Density Function for the Detection Times
4-34
3O
25
2O
Percentage
15
I0
5
0
Time (ms)
Figure 4-7. The Probability Density Function for the Reconfiguration Times
Percentage
3O
25
2O
15
10
Time (ms)
Figure 4-8. The Probability Density Function for the Reconfiguration Times:
Expansion of Range 0 to 2400 ms.
4-35
Percentage
100 o
90-
80
70
60
5O
4O
30
20
!0'
0 " I • I • l ' I ' I ' I " I " I " I ' I ' I " I
Time (ms)
Figure 4-9. The Cumulative Density Function for the Reconfiguration Times
4.4.2 Hardware Fault Injection Test Results
As discussed in Section 4.3.2, the Hardware Fault Injection Plan consists of 12 tests.
Each test is repeated 10 times. Each iteration of a test involves the application of a fault, its
detection by the FDIR software, the associated FTP reconfiguration, and finally recovery
of the faulty channel. For each iteration, the Fault Insertion Software recorded the fault
detection and reconfiguration times. It also verified that the channel had been recovered
before injecting the fault again.
The test results are shown in Table 4-4. In the first three tests, the fault was detected and
correctly identified and the system was reconfigured. In the next two tests, the fault was
detected and identified, but the channel could not be recovered. This is hypothesized to be
due to the fact that the injected fault prevented the faulty channel from being able to be
resynchronized. In the final seven tests, the fault was not detected. This was found to be
due to a bug in the FDIR software, which only looks at a voted value of the FTC error
latches rather than looking at each channel's latch individually. Looking at the voted value
recognizes faults in the entire clock element or entire interstage, but ignores individual link
faults. The particular fault created in all seven test cases was the failure of a link between a
clock element and an interstage. There was no opportunity to rerun the tests with a
corrected version of the software. Since the number of test cases was small, no statistical
analysis was done.
4-36
Test ID
Test Results Max
Detect
(ms)
Max
Reconf
(ms)
Average
Detect
(ms)
Average
Reconf
(ms)
COM2_1356_6 Detected: A-to-C 39.9 3.4 26.3 3.4
transmitter link failure
COM_1356_11 Detected: B-to-C 39.9 3.5 28.4 3.5
transmitter link failure
COM_1756_6 Detected: A-to-C 41.1 3.5 24.1 3.4
transmitter link failure
COM_0748_8_0 Detected: Presence; 13.6 1.8
Channel Not Recovered
COM_0748 9 0 14.8 1.8Detected: Presence;
Channel Not Recovered
Not Detected
Not Detected
Not Detected
Not Detected
COM_0510 2 0
COM 0510 2 1
COM_0510 4 0
COM_0510 6 0
COM_0510_6_1 Not Detected
COM_0510 7 0 Not Detected
COM_0510_7_1 Not Detected
Table 4-4. Hardware Fault Injection Results
4.4.3 Design Flaws Uncovered by the Fault Injection Tests
During execution of both the Hardware Fault Injection Plan and the Software Fault
Injection Plan, several design flaws were uncovered. These design flaws included both
hardware and software. All software design faults were corrected.
Hardware Design Flaws
1. Disengaging a channel's Monitor Interlock should not disable its Watchdog
Timers. Test CP_I showed that this can prevent a channel from being
4-37
recoveredif the fault resultedin a processorbeing in aninfinite loop with
interruptsoff.
Software Design Flaws
o If the faulty channel went out of sync between the presence test and the data
exchange latch analysis in Fast FDIR (refer to Section 4.2.1), a data exchange
fault would be identified and possibly the wrong channel identified as faulty.
This was corrected by exchanging data exchange latches and FTC latches before
the presence test.
2. Occasionally the faulty channel would behave in such a way that it was out of
sync only temporarily (e.g., during a data exchange) but was somehow forced
back into sync and passed subsequent Program Counter checks and presence
tests. However, the data exchange latches had been set by the other two
channels, but not by the faulty channel. As in (1), this led to an erroneous
diagnosis. The solution here was to verify that any reported DX fault was
reported by all channels (except for a faulty interstage-to-receiver link, which is
normally reported by only one channel).
. A failed link in the FTC network (as opposed to an entire clock element or
interstage) would not be detected by Fast FDIR because it looked only at a
voted value of the FTC error latches, rather than looking at each channel's
latches individually. This was corrected by examining each individual
channel's latches.
4.5 Core FTP Fault Injection: Conclusions
To conclude the discussion of the Core FTP Fault Injection Plan, the Fault Injection
Results are considered with respect to the goals of the Fault Injection Study. In brief, the
objectives were:
.
2.
3.
.
to test the design specification for fault tolerance,
to obtain feedback for fault removal from the design implementation,
to obtain statistical data regarding fault detection, isolation, and reconfig-
uration responses, and
to obtain data regarding the effects of faults on system performance.
To test the system design specification for fault tolerance (Goal 1), we relied solely on
visual observation of the CRT display to determine that the system functioned correctly
during and after the fault. Since the display tasks execute at the lowest priority, this
4-38
ensuredthatnohigherpriority taskwasmonopolizingthesystemasaresultof thefault. In
all testcases,thesystemfunctionedcorrectly.
To determinethe correctnessandcompletenessof thefault detectionand identification
(Goal2) for eachtestcase,two methodswereused.Onewasvisualinspectionof theerror
log that is maintainedby the coreFTP FDIR process. This log indicated whethera
particular fault was detectedand isolated correctly The secondmethod involved
verificationby theFIS thatthefault wasdetectedandthatreconfigurationtook place. As
shownin Table4-3,all of thetestcasesin theSoftwareFaultInjectionPlanwerecorrectly
detectedandisolated. In a smallpercentageof thecases,thefaulty channelcouldnot be
recoveredbecausethe memory that wascorruptedwasmemoryusedby the recovery
process,sothat thefault appearedasa hardfault ratherthana transient. In onecase,the
faultychannelcouldnotberecoveredbecauseof ahardwaredesignflaw thatpreventedthe
faulty channelfrom detectingthatit wasfaulty. In addition,duringtheinitial iterationsof
theSoftwareFault InjectionPlan,errorsin theFDIR softwareweredetectedwhich were
correctedfor subsequenti erations;only thefinal iterationwaspresentedin thisreport.
As shown in Table 4-4, about half of the test cases in the Hardware Fault Injection Plan
were correctly detected and isolated; the other half were not detected. The undetected faults
were similar in that they all consisted of the failure of a link in the fault tolerant clock
network. The FDIR software was determined to contain an error that prevented it from
recognizing this type of fault. There was no opportunity to rerun the tests with corrected
software.
Statistical data about the fault detection and identification (Goal 3) was obtained by having
the FDIR software log the detection time in the Testport Interface and by having the Fault
Injector Software note the time at which fault isolation occurred. As presented in Sections
4.4.1 and 4.4.2, the Core FTP Fault Injection Plan recorded 853 sets of data. The
maximum and average times and the probability and cumulative distribution functions were
calculated. The results of the Core FTP Fault Injection test cases conformed to the
expected maximum and average times.
The effects of faults on system performance (Goal 4) include both (1) the additional time
required by the particular FDIR process when it is dealing with a fault and the subsequent
scheduling delays incurred by other tasks, and (2) the effects on users of the faulty
component. The additional time required by the FDIR processes has been measured at
other times during the life of the AIPS project and documented in a previous report [ ].
Measurement of the scheduling delays incurred by non-FDIR tasks and the effects of a fault
on users of the faulty components require additional instrumentation for obtaining a record
of system performance and was not in place at the time of this study.
4-39
4-40
5.0 CONCLUSIONS
This report has described a plan for systematically injecting large numbers of faults into the
AIPS building blocks and collecting data about the resulting actions of the redundancy
management processes. The goals of this fault injection plan were fourfold:
1. To test the system design specification for fault tolerance.
2. To obtain feedback for fault removal from the design implementation.
3. To obtain statistical data regarding fault detection, isolation, and reconfiguration
responses.
4. To obtain data regarding the effects of faults on system performance.
A comprehensive set of possible fault injection tests was developed; from this a subset of
actual test cases was selected. Both pin-level hardware faults using a hardware fault
injector and software-injected memory mutations were used to test the system. A Fault
Injection Software program was used to facilitate automatic fault injection and collect timing
information.
Two of the AIPS building blocks, the I/O Network and the Core FTP, were chosen as
candidates for extensive fault injection. The I/O Network Fault Injection Plan consisted of
47 different pin-level faults, inserted using the hardware fault injector. Each of these faults
was applied 25 times. The detection coverage for these faults was 99.6%; the other 0.4%
of the faults did not produce any detachable error symptoms. The reconfiguration coverage
for detected faults was 100%. No design errors were found by these tests. This does not
necessarily prove that the design of the network is completely error free, but it does
increase the level of confidence in the design.
The anticipated maximum detection time for faults injected in the I/O Network was about
2075 ms. plus the error latency, and the expected average time was approximately 1040
ms. plus the error latency. The I/O Fault Injection results typically conformed to these
expected maximum and average times. For the reconfiguration time, the worst case time,
i.e., one including fault diagnostics plus the worst case regrowth scenario, was determined
to be 3500 ms. All of the actual reconfiguration times were less than the worst case.
One problem was encountered when using the hardware fault injector to apply faults to the
I/O Network. Sometimes when a fault injector probe was attached to an I/O node, the
probe caused the node to fail. The fault injector supervisors speculated that this problem
was caused by impedance differences that occurred when the probe was attached. Because
of this problem, some of the originally proposed I/O Network faults were not injected.
pRECEDiNG PAGE IBt.ANI( NOT FILMED
5-1
The Core FTP Fault Injection Plan consisted of 178 simulated memory faults, inserted
using the Fault Injection Software program, and 12 pin-level faults, inserted using the
hardware fault injector. The memory faults were each applied 5 times; the pin-level faults
were applied 10 times. Four design flaws were uncovered by the Core FTP Fault Injection
Plan. Three of these were software design flaws; one was a hardware design flaw. As
time permitted, the software design flaws were corrected and the test cases rerun. The
hardware design flaw was not corrected.
The detection coverage for the simulated memory faults was 100%; the reconflguration
coverage was also 100%. The detection and reconfiguration times from each test typically
conformed to the expected maximum and average times. However, in a number of the
simulated memory fault test cases, the fault prevented the channel from being recovered.
Therefore multiple instances of these particular faults could not be injected without
restarting the system. The detection coverage for the pin-level fault test cases was 42%; the
reconfiguration coverage was also 42%. The undetected faults were all of the same type
and were the result of a software error. In two of the cases where the fault was detected,
the faulty channel could not be recovered. The fault injection supervisors speculate that
although the hardware fault injector stopped corrupting the pin after a certain time, the
effect remained, thereby creating a permanent fault rather than a transient one.
A second problem encountered when using the hardware fault injector was that simply
attaching the fault injector probe to a pin caused the channel to fail. The fault injector
supervisors speculate that this was caused by the added distance between the chip and the
card, which could result in either changed timing, added capacitance, or added noise.
Because of this problem, many of the originally proposed Core FTP faults were not
injected.
This study has demonstrated the importance of fault injection in the overall validation of a
system. In addition to providing data for reliability parameter estimation, it can also
provide feedback for fault removal from the design implementation. This does not mean
that fault injection is a substitute for the design-for-validation methodology, but it is a
component of the methodology just as specifications, design reviews, analytical models
and formal methods are. If the fault injection process does not uncover a single flaw in the
system under test, this does not imply that the system is perfect, only that the system is
correct with respect to the fault set to which it was subjected. And if some design flaws are
uncovered, the fault injection process provides a deeper understanding of the fault tolerance
design and a more fundamental appreciation of the cascade of events triggered by a fault,
including complex interactions between hardware and software elements. The Fault
Injection tests performed as part of this study successfully met three of the four originally
stated goals. They helped validate the system design specification for fault tolerance; they
5-2
detectedfaults in thedesign implementation; and they provided statistical data regarding
fault detection, isolation, and reconfiguration responses.
Recommendations for future work in this area include gathering data on the effects
of faults on system performance, in particular, any delays in execution of time-critical
tasks. The envelope of test cases could also be extended to cover a fuller spectrum of the
fault injection plan that has been described in this report. Another research area is to
understand why certain transient faults behave as permanent faults, i.e., how the errors
produced by a transient fault permanently disable a channel preventing its reintegration in
the FTP.
5-3
5-4
6.0
[1]
[2]
[3]
[4]
[51
[61
REFERENCES
Harper, R.E., Alger, L.S., and Lala, J.H., "Advanced Information Processing System:
Design and Validation Knowledgebase", NASA Contractor Report 187544, September
1991.
Johnson, S.C., and Butler, R.W., "Design for Validation," 10th AIAABEEE Digital
Avionics Systems Conference, Los Angeles, CA, October 1991, PPs. 487-492.
Lala, J.H., and Smith, T.B., III, "Development and Evaluation of a Fault Tolerant
Multiprocessor (FTMP) Computer, Volume III, FTMP Test and Evaluation", NASA
Contractor Report 166073, May 1983.
Burkhardt, L.F., L. Alger, R. Whittredge, P. Stasiowski, "Advanced Information
Processing System: Local System Services," NASA Contractor Report 181767, March
1989.
Gauthier, R.J., "The Airlab Fault-Tolerant Processor: Physical Implementation,"
Charles Stark Draper Laboratory Report, CSDL-R-1928, December 1986.
Lala, J.H., Harper, R.E., and Alger, L.S., "A Design Approach for Ultrareliable Real-
Time Systems," IEEE Computer Special Issue on Real-Time Systems, May 1991.
6-1
PRECEDING PA,P,P__v.#,r_K NOT FILMED
_tj_I[tlIl0tt/d&l[ 2_t'o_
REPORT DOCUMENTATION PAGE I FormAppro_e_OMB No. 0704-0188
I
Publi= reporting burden for this o0ilecfion o_ inf_tion i= estimated to average 1 hour per mspontm, including the time for reviewing instructions, searching existing data sou
gathering and madntaJning the data needed, and completing and reviewing the collection of inlormation. Send commentl regarding this burden estimate or any olher aspect o
o_ _nforrnadion, including suggestlor_ for reducing this burden, to Washington Headquaden= Servicm, Directorate for Information Operations and Reports, 1215 Jefferson [
Highway, Suite 1204, kington, VA 22202-4302, and 1othe Office of Management and Budget, Paperwork Reduction Project (0704-0188), Washington; DC 20503.
1. AGENCY USE ONLY (Leave blank)
4. TITLE AND SUu/i/LE
Study and Results
6. AUTHOR(S)
2. REPORT DATE 13. REPORT TYPE AND DATES COVEREDI Contractor Report
Advanced Information Processing System: Fault Injection s. FUNDINGNUMBERS
WU 506-59-61-03
Laura F. Burkhardt, Thomas K. Masotto, and Jaynarayan H. Lala
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)
The Charles Stark Draper Laboratory, Inc.
Cambridge, MA 02139
9. SPONSORING I MONITORING AGENCY NAME(S) AND ADDRESS(ES)
National Aeronautics and Space Administration
Langley Research Center
Hampton, VA 23665-5225
11. SUPPLEMENTARY NOTES
Langley Technical Monitor: Felix L. Pitts
Final Report - Task 12
C NAS 1-18565
8. PERFORMING ORGANIZATION
REPORT NUMBER
10. SPONSORING/MONITORING
AGENCY REPORT NUMBER
NASA CR-189590
12a. DISTRIBUTION / AVAILABILITY STATEMENT
Unclassified - Unlimited
Subject Category 62
12b. DISTRIBUTION CODE
13. ABSTRACT (Maximum200 words) The objective of the Advanced Information Processing (AIPS) program is to achieve :
validated fault-tolerant distributed computer system. The goals of the AIPS fault injection study were:
1. To present the fault injection study components addressing the AIPS validation objective
2. To obtain feedback for fault removal from the design implementation
3. To obtain statistical data regarding fault detection, isolation, and reconfiguration responses
4. To obtain data regarding the effects of faults on system performance
The organization of this report is as follows. Section 1 describes the parameters that must be varied to create a
comprehensive set of fault injection tests, the subset of test cases selected for this study, the test case measurements
the test case execution. Both pin-level hardware faults using a hardware fault injector and software-injected memory
mutations were used to test the system. Section 2 provides an overview of the hardware fault injector and the associa
software used to carry out the experiments. Sections 3 and 4 give detailed specifications of faults and test results for
Network and the AIPS Fault Tolerant Processor, respectively. Section 5 summarizes the results and gives conclusion
the study.
14. SUBJECT TERMS Fault-tolerant computing, fault injection, distributed fault-tolerant computers
empirical validation
17. SECURITY CLASSIFICATION
OF REPORT
Unclassified
NSN 7540-01-280-5500
15. NUMBER OF PAGES
168
18. SECURITY CLASSIFICATION I
OF THIS PAGE
Unclassified
lg. SECURITY CLASSIFICATION
OF ABSTRACT
16. PRICE CODE
20. LIMITATION OF ABST
Standard Form 298 (Rev. 2-
Prascrbed by ANSI Std. _.39-18
298-102
