Abnormal fault-recovery characteristics of the fault-tolerant multiprocessor uncovered using a new fault-injection methodology by Padilla, Peter A.
I 11
sang a
https://ntrs.nasa.gov/search.jsp?R=19910009034 2020-03-19T19:43:52+00:00Z

NASA Technical Memorandum 4218
Abnormal Fault-Recovery
Characteristics of the
Fault-Tolerant Multiprocessor
Uncovered Using a New
Fault-Injection Methodology
Peter A. Padilla
Langley Research Center
Hampton, Virginia
National Aeronaut,s and
Space Administration
Office of Management
Scientific and Technical
Information Division
1991

Contents
Summary .................................. 1
Introduction ................................. 1
Background of Fault-Tolerant Multiprocessor ................... 2
Error-Latch Processing and Fault-Injection Experiments .............. 4
Error-Latch Processing ........................... 4
Fault-Injection Experiments ......................... 5
Anomalous Behavior: Observations and Preliminary Data ............. 5
Causes of Anomalous Behavior ......................... 6
Experimental Data ............................. 7
Subsystem Interaction Leading to Anomalous Behavior ............. 8
Model Predictions ............................. 10
Solution ................................. 12
Implications for Fault-Tolerant Multiprocessor ................. 13
Hard Faults ............................... 13
Intermittent Faults ............................ 15
Effect of Holding the Bus on the FTMP ................... 16
Lying Events on a Triple Modular Redundancy System ............. 16
Implications of New Technology ....................... 16
Concluding Remarks ............................ 17
References ................................. 17
=,w
III
PRECEDING PAGE BLANK NOT FILMED
Tables
Table 1.
Table 2.
Table 3.
Table 4.
Table 5.
Table 6.
Bus Errors Detected by LRU ...................... 4
Variability of N F .......................... 10
Comparison of Fault-Injection and Model Experiments ......... 12
Confirmation Event ......................... 13
Values for Pn and N Obtained From Fault-Injection Experiments ..... 14
Rate of Failure of TMR System (Lying Event) ............. 16
Figures
Figure 1. FTMP physical configuration ..................... 2
Figure 2. FTMP architecture ......................... 3
Figure 3. Structure of error-latch procedure ................... 7
Figure 4. Two typical examples of system bus activity during error-latch
read transactions ............................. 7
Figure 5. Time diagram of error-latch read transactions of LRU A ......... 8
Figure 6. Time diagram of error-latch read transactions of LRU 5 ......... 9
Figure 7. Theoretical sequence of LRU's disabled for a configuration
without LRU 9 ............................. 11
Figure 8. Sequence of LRU's disabled (with LRU 9 in the configuration) ..... 12
Figure 9. Confirmation event ........................ 14
Figure 10. The three possible states of fault-management-software
execution on the triads ......................... 15
iv
Summary Introduction
A new design methodology for a fanlt-injection
experiment was used to uncover and character-
ize problems with the Fault-Tolerant Multiproceso
sor (FTMP) error detection and faulty-unit iso-
lation functions. The observed anomalies were
discovered during hardware fault-injection experi-
ments performed on the FTMP test-bed at the
Avionics Integration Research Laboratory (AIR-
LAB) at the NASA Langley Research Center. During
these experiments, the FTMP occasionally disabled
a working unit instead of the faulted unit once every
500 fault injections, on the average. This behavior
continued until the system crashed after 4000 to 6000
fault-injection tests.
The new fault-injection methodology involves new
criteria for selecting the fault location (high fan-
out signals) and new instrumentation to observe
the behavior of the system in real time. This new
fault-injection methodology can be applied to other
fault-tolerant-system architectures to uncover design
problems. Using this methodology, weaknesses in the
FTMP design were found. The weaknesses increase
the probability that any active fault can exercise a
part of the fault-management software that handles
Byzantine or "lying" faults.
Byzantine faults behave in such a way that the
faulted unit points to a working unit as the source
of errors. When errors generated by the fault occur
in certain time intervals during the detection data
acquisition, the FTMP fault-identification procedure
incorrectly decides that only one line replaceable
unit (LRU) detected the error (the so-called lone-
accuser). This event triggers a software component,
designed to handle Byzantine faults, that keeps track
of the identity of the lone-accuser and disables it
after a second event. In summary, the combined
effects of the design weaknesses of the system result
in a significant probability that a fanlt-handling error
occurs in such a way that one fault causes two LRU's
to be disabled, a good unit in addition to the faulty
unit.
This paper gives some background material on the
FTMP architecture, and then it proceeds to describe
a typical fault-injection experiment and the observed
anomalous behavior. Following these descriptions,
the causes of the erroneous behavior are described.
A simple model was constructed that predicts the
general pattern of these events and their rates based
on measurable system parameters. The paper ends
with calculations of the rate at which an intermit-
tent/transient fault or a permanent fault might in-
duce the observed phenomena and with conclusions
reached during this study.
The Fault-Tolerant Multiprocessor (FTMP) is a
multicomputer system designed according to fault-
tolerant design principles to survive and continue op-
eration after the occurrence of several nonsimultane-
ous faults. (See ref. 1.) The intended application of
the system design is flight-critical applications where
a high reliability is required to ensure the survival of
the vehicle, crew, and passengers. The often-stated
reliability goal is expressed as a probability of fail-
ure of less than or equal to 10 -9 for a 10-hour mis-
sion. An engineering prototype of the FTMP was
constructed in the late 1970's, and preliminary test-
ing was conducted by the system designers. (See
ref. 2.) Thereafter, the system was delivered to the
NASA Langley Research Center where it has been
subjected to more extensive experimentation. (See
ref. 3.)
Early fault-injection experiments concentrated on
obtaining recovery-time data used to construct a
recovery distribution from which certain parameters
are required inputs for reliability assessment. Also
performed were experiments designed to measure or
estimate fault latency and fault-free performance.
Early fault-injection experiments were straight-
forward: a pin was selected at random from a chip
selected at random from a system board also selected
at random. Other methods of selecting the fault-
injection location, called sampling methods in refer-
ence 3, had been devised and used to design experi-
ments. These methods of selecting a location for fault
injection assured compliance with statistical theory
and thus accuracy in parameter estimation. The data
acquisition requirements for these experiments were
modest and were satisfied with the existing data ac-
quisition system. (See refs. 3 and 4.)
Selection of injection points at random over the
whole system misses many weak points in the system
that can be identified by careful examination and
testing of the design. The common perception is that
in order to explore the system for the existence of
abnormal recovery behavior and single-point failures,
exhaustive or "massive" fault-injection testing must
be performed.
A new methodology for fault-injection experi-
ments is proposed in this paper that enhances the
probability of detecting abnormalities and inadequa-
cies in a system. The results presented in this paper
were obtained using the new methodology. These re-
sults support the contention that the new method
enhances the probability of detecting inadequacies in
fault-tolerant systems.
The new methodology is based on two principles:
I. Faults should be injected mainly in signals
with high fan-outs (e.g., board enables and
other control signals), thus maximizing the
amount of damage (the number of errors in-
troduced) to the system.
2. The information/data necessary to determine
the system state should be observable at all
times in real time.
These principles, although seemingly simple, re-
quire a completely new experimental environment.
The data acquisition requirements imposed by the
new methodology on the FTMP test environment
overburdened the original data acquisition system.
This system was sufficient for the early work but
completely inadequate for the new methodology and
necessitated the design of a new system with three
orders of magnitude improvement in data acquisi-
tion rates. (See ref. 4.) The application of the new
methodology to fault-injection experiments uncov-
ered an abnormal fault-recovery behavior never ob-
served before.
During some of the initial experiments using the
new methodology, it was noted that line replaceable
units (LRU's) not being faulted were regularly dis-
abled by the fault-management software. Although
the system classified the events as being caused by
real permanent faults that occurred during the fault-
injection experiment, no evidence of these faults
could be found after the fault-injection experiment
was completed. All the supposedly faulty units were
reactivated either manually or by booting the system.
Therefore, it was concluded that these events must
somehow be caused by the faults injected during the
experiments.
Experiments using the new data acquisition sys-
tem were designed to characterize this behavior. Ini-
tially, the error-latch processing integrity was in-
vestigated. The data indicated that on certain oc-
casions only one LRU reported a specific bus er-
ror. All the LRU's that singly reported an error
on an active bus on two separate occasions were
disabled. The authenticity of the reported error
was investigated to determine if the behavior was
caused by the injected fault affecting an LRU across
fault-containment boundaries or by corruption of the
error-latch information by software or noise. The new
data acquisition system was instrumental in deter-
mining the authenticity of the observed error as it is
described in a later section. Once the authenticity
of the reported error was established, the mechanism
was investigated by which the fault coerced a non-
faulted LRU to report an error not visible to others.
Once the cause of the behavior was discovered
and characterized, an analytical model was developed
using easily measurable parameters to estimate the
probability that an abnormal recovery would occur
during the process of recovering from a naturally
occurring fault. Also, the model was used to estimate
the probability of a general triple-modular redundant
system suffering a recovery failure.
Background of Fault-Tolerant
Multiprocessor
Figure 1 shows the physical configuration of the
Langley Avionics Integration Research Laboratory
(AIRLAB) FTMP test-bed. (For a detailed descrip-
tion see ref. 1.) There are 11 LRU's (12 is the max-
imum number of LRU's supported by the FTMP)
that are interconnected by a redundant system bus.
Each LRU contains a processor with cache, a system
bus interface, and a system memory unit. (There
are many other subsystems that are not relevant
to the subject of this paper and are therefore not
mentioned.)
Bus
Figure 1. FTMP physical configuration. (LRU B is absent from the configuration.)
2
At system restart, the processors and the memo-
ries are organized in triads (see fig. 2) to provide the
redundancy required to tolerate a single fault any-
where in the system. The triads are tightly synchro-
nized; i.e., all the processor (memory) components
of a processor (memory) triad execute the same in-
struction (operation) on the same clock cycle. Thus
synchronized, three copies of all input/output (I/O)
data from a triad are presented to the triple redun-
dant system bus. The copies are then voted bit by
bit to mask any errors that might occur because of a
fault on any of the triad components.
The system bus is a composite of four different
redundant serial buses. These buses are the Poll bus
(P bus), the Transmit bus (T bus), the Receive bus
(R bus), and the Clock bus (C bus). There are five
of each P, T, R, and C bus types, of which three are
used at any given time (a bus triad). One exception,
the real-time clocks and the C bus are configured
as quads (four units active) to implement the fault-
tolerant clock. (See ref. 1 for details.)
Processor Processor
triad 1 triad 2
PROC PROC PROC PROC PROC PROC
0 ] ! 2 I 314 5 II t I I
I S)'stem bus
I I
T 7 7
System memory
triad I
Processor
triad 3
PROC PROC 'PROC I
I
61718[ I I
I
I
RAM RAM I RAM
I
3 4 [ 5
System memory
triad 2
Figure 2. FTMP architecture.
The P bus is used by a processor triad to request
control of the T and R buses. The T bus is used by
a processor triad to transmit read/write commands
to system memory triads. The R bus is used by
memory triads to respond to processor triad read
commands transmitted through the T bus triad. (See
ref. 1.) In a typical example, when a processor
triad needs to input or output a data value to a
memory triad, it will first poll for control of the
system bus through the P bus. Each element of the
processor triad polls in synchronism with the other
elements on one of the active P buses. (As mentioned
above, there are three active elements at any time.)
The polling protocol is called the Lanning protocol
and is described in reference 1. After obtaining
control of the system bus, the processor triad issues
read or write commands (three copies) through the
active T buses. If the command is a read command,
then the responses (three copies) from the addressed
memory triad appear in the active R buses after a
short delay. There is a second type of read operation,
called a simplex read, in which there is only one
response transmitted through one of the R buses.
The simplex read is required to read nonredundant
data sources, e.g., an error latch.
The three active buses of each type are compared,
any single error is masked, and the detection is
flagged by setting a bit in an error latch. Each LRU
contains four error latches, one per bus type (i.e., P,
T, R, and C error latches). Each error latch contains
five bits, one for each available bus of the specified
type; e.g., the P latch has five bits, one for each of the
five P buses (P1, P2, P3, P4, and P5) available to the
system. Setting the bit corresponding to the P1 bus,
for example, signifies that an error was detected in
that bus during a previous transaction.
The P bus is monitored continuously by all the
available LRU's; therefore, if a processor fails to cor-
rectly follow the polling protocol, then every LRU
will flag an error in the P bus used by the mis-
behaving processor. Error handling is different in the
T and R buses. Although all the T and R bus op-
erations are voted by all the LRU's, only the LRU's
that initiated the transactions (by having their pro-
cessor units issue a read or a write command) will
log error detection in the T or R error latches. This
is required so that when a simplex read transaction
is performed, the other LRU's not involved in the
transaction (which had not been signaled that a sim-
plex transaction was requested) do not mistake the
simplex data for bus errors.
On the test-bed system one processor (processor
A) in FTMP is used as a "master" processor during
system restart. This master processor is used to load
3
systemmemorywith the applicationsoftwareand
to issueinitial configurationcommands,e.g.,which
processorsaregoingto bein triads1,2,and3. After
systemrestart,themasterprocessoris usedfor data
acquisitionandsoftwaredebuggingpurposesothat
it is neverpart of a triad or usedasa spare.The
systemmemoryunit of LRUA isusedasasparefor
a systemmemorytriad.
Error-Latch Processing and
Fault-Injection Experiments
Pin-level fault-injection experiments have been
performed for the past several years on the FTMP
test-bed. (See refs. 2 and 3.) In these experiments,
hardware fault-injection systems use different meth-
ods (see refs. 2 and 5) to induce a temporary malfunc-
tion in one or more components of the target system.
The errors rcsulting from the induced fault propa-
gate through the circuits, thus producing erroneous
behavior at the system bus level. Manifestation of
the errors at this level occurs by several "fault-to-
system bus error" paths operating singly or simul-
taneously. For example, a fault can corrupt output
data on cache or disable the system bus interface,
or both; in such a case, the external behavior of the
faulted unit will differ from the "normal," as deter-
mined by voting the data presented to the system
bus by the two nonfaulted triad partners.
Error-Latch Processing
Errors at the system bus level are flagged by
setting the error latches. The latches are read into
system memory and processed every 320 msec by the
fault-management software. (See ref. 6 and table 1.)
Error latches are not redundant; each LRU contains
one of every kind. Therefore, error-latch values are
read with a "simplex read operation"; i.e., there is
only one copy transmitted through one of the active
R buses. During a read operation the 5 error-latch
bits are transmitted as the least significant bits of a
16-bit word, and the 11 most significant bits are all
transmitted as l's. Therefore, an error latch with no
error signaled will appear as FFE016.
The data shown in table l, acquired with the high-
speed data acquisition system (ref. 4), are a typical
example of what occurs during a fault injection.
The data indicate that all the LRU's detected a
bus error in the P1 bus. (LRU 3 was enabled
on this bus.) However, the data also show that
LRU's 4 and 5 detected a bus error in the T1 bus.
(LRU 3 was enabled to transmit in that bus.) The
faulted behavior of LRU 3 (for the same fault used
to gather the error-latch data) was studied with
the high-speed data acquisition system in order to
deduce the sequence of events described below. The
observed behavior of LRU 3 under the specific fault
was the complete absence of bus transactions; i.e.,
the fault completely disabled LRU 3 transmissions
on the system bus. Therefore, it follows from the
error-latch data in table 1 that the processor triad
composed of LRU's 3, 4, and 5 issued a read/write
request through the T bus after performing a polling
sequence to acquire control of the bus. The polling
sequence was not executed by LRU 3, which led to
the detection of a bus error in the P1 bus. Even
with LRU 3 out, the two surviving partners of the
triad gained control of the system bus and performed
a read/write transaction on the T bus. Again, on
this transaction LRU 3 was out, which led to the
detection of a bus error on T1. This error should
have been detected by LRU's 3, 4, and 5 (T and
R bus errors are detected only by the triad that
generates the transaction where the error occurs),
but LRU 3 error-latch values were discarded by the
system because the system does not utilize the latch
values from units suspected of being faulty.
Table 1. Bus Errors Detected by LRU
Raw error-latch data
LRU P.EL a R.EL b T.EL c C.EL d
0 FFE1 e FFE0 FFE0 FFE0 P1
1 FFE1 FFE0 FFE0 FFE0 P1
2 FFE1 FFE0 FFE0 FFE0 P1
3 FFE1 FFE0 FFE0 FFE0 P1
4 FFE1 FFE0 FFE1 FFE0 P1, T1
5 FFE1 FFE0 FFE1 FFE0 P1, TI
6 FFE1 FFE0 FFE0 FFE0 P1
7 FFE1 FFE0 FFE0 FFE0 P1
8 FFE1 FFE0 FFE0 FFE0 P1
f9
A FFE1 FFE0 FFE0 FFE0 P1
IB
ap bus error latch.
bR bus error latch.
CT bus error latch.
dc bus error latch.
eAll error-latch values in hex.
/'Not in present configuration.
Interpreted data
Buses where error
was detected
As a result of the simplex nature of the error-
latch data, several tests are performed on the error
data in system memory before it is considered by the
fault-management software. These tests are designed
to filter out corrupted data. First, any "zeroed" bit
found in the 11 most significant bits of each value
4
indicatesthat the valuewascorruptedduring the
readoperation. Second,if a valueshowederrors
in an inactive/sparebus, that valueis considered
corrupted.Thirdandlast,if thefiveleastsignificant
bitscontainedmorethanonebit setto 1 (errorson
multiplebuses),thevalueisalsoassumedcorrupted.
The R bus from which the corruptederror-latch
valueswereread is then markedsuspectand any
othervaluesreadthroughit arediscarded.
A latchvaluethat passesall threetestsis con-
sideredgood.All thegoodvaluesarethenchecked
for consistency;e.g.,all the latchesindicatinganer-
ror in anactiveP busmustagreeontheidentityof
theaccusedP bus(sayP1), whereasall the latches
indicatinganerroron theT busmustagreeon the
identityoftheaccusedTbus(sayT3). If thereiscon-
sistencyamongvaluesandoneof theaccusedbuses
is an R bus,this busis markedassuspectandall
theerror-latchvaluesreadthroughit arediscarded.
If they arenot consistent,thosedataareexcluded
fromfurtherconsideration.
Theidentityof all thebuseswith errors,asindi-
catedbyall thelatchvaluesthatsurvivedtheconsis-
tencycheck,ispassedto thefault-managementsoft-
warefor isolationof thefaultycomponent(s)(which
couldbe the busor anyprocessoror memoryunit
enabledonthesuspectbus).
A specialcaseoccurswhenonly onegooderror
latch indicatesan error on an activebus. This
situationisaspecialcaseof aByzantinefault,andit
is calledthelying-faultsyndromein reference6. In
sucha casetheidentityof theLRU fromwhichthe
errorlatchwasreadis storedin systemmemory.If
thesituationrepeatswith thesameLRU,theLRUis
disabled.(Boththeprocessorandthememoryunit
of that LRU aredisabled.)
Fault-Injection Experiments
A typicalfault-injectionexperimentinvolvesthe
selectionof theboard,chip,andpin on theFTMP
LRU 3 wherefaultsof a particularfault type(e.g.,
stuckat one)wouldbe insertedat randomtimes.
The numberof faults insertedcanbevariedandit
normallyrangedfrom500to 5000faults. Thefault
detection,fault isolation,andreconfigurationtimes,
plustheidentityof thedisabledunit,wereacquired
foreveryfaultandwerestoredincomputerdiskfiles.
The instrumentationrequiredby the newfault-
injectionmethodologyallowsinvestigatorsto capture
the behaviorof theFTMP in realtime. Twoloca-
tions in LRU 3 wereselectedaccordingto the new
methodologycriteria:(1)acontrolsignalin thecen-
tral processingunit (CPU)datapath,and(2)acon-
trol signalin the cachecontroller.Fault injections
in theselocationshavethedesiredeffectin that ab-
normalrecoverybehaviorwasobservedevery500to
1000faults.
Anomalous Behavior: Observations and
Preliminary Data
As mentionedabove,duringthe analysisof the
fault-injectiondatait wasnotedthat theLRU'snot
beingfaultedwereregularlydisabledby the fault-
managementsoftware.Althoughthe systemclassi-
fied the eventsas beingcausedby real permanent
faultsthat occurredduringthefault-injectionexper-
iment,noevidenceof thesefaultscouldbefoundaf-
terthefault-injectionexperimentwascompleted.All
the supposedlyfaulty unitswerereactivatedeither
manuallyorbybootingthesystem.Therefore,it was
concludedthat theseeventsmustbecausedsomehow
by thefaultsinjectedduringtheexperiments.
Forexperimentsthat lasted4000 faults or longer,
4 nonfaulted LRU's were usually disabled before the
system crashed. Obviously, whatever was causing the
LRU's to fail did not disappear after the first LRU
was disabled. Some discernible patterns in the data
were observed. For example, in all the experiments
where LRU 9 was not in the configuration (because
of a real fault), LRU A was disabled first. In approx-
imately 50 percent of the experiments, the second
LRU to be disabled was LRU 5 followed by LRU 8.
In the other 50 percent, LRU 8 was second followed
by LRU 5. The fourth disabled LRU was LRU 7.
With LRU 9 present (after being repaired), usu-
ally five LRU's are disabled before the system
crashes. The first LRU to be disabled alternated ran-
domly between LRU's 5 and A. The second LRU to
be disabled would be either LRU 5 or LRU A, de-
pending on which was first. The next casualty would
then be LRU 9, and thereafter LRU's 8 and 7 would
follow.
To investigate the causes behind these failures,
the high-speed data acquisition system was used to
observe the error-latch values that were read, stored
in system memory, and processed by the error-latch
processing routines. By monitoring this activity dur-
ing a fault-injection experiment, it is possible to ob-
serve the system behavior and detect any processing
error or data corruption.
The acquired data indicated that there were oc-
casions when only one LRU reported a specific bus
error. All the LRU's that singly reported an error
on an active bus on two separate occasions, trigger-
ing the lying-fault syndrome component of the fault-
management software, were disabled. Why is one
specific LRU the only one detecting an error? The
answer to this question is of fundamental importance.
It implies that weaknesses exist in the robustness of
this fault-tolerant architecture and of fault-tolerant
systemsin general.This architecturewasdesigned
with the principlesof fault-tolerant-systemsdesign
theory in mind. Theseprinciplesstate the num-
ber of fault-containmentregionsandhowthey are
to communicatewith eachotherin orderto tolerate
n faults, including Byzantine faults. For more details
on fault-tolerant-systems theory and prototypes, see
reference 7. The fact that single faults can cause the
system to misbehave implies that the fault-tolerant
design principles are incomplete and cannot by them-
selves produce a reliable or robust architecture.
First, the authenticity of the reported error must
be considered; i.e., the reported error could be real
(which means that the injected fault somehow pre-
vented other LRU's from seeing it), or the error could
be fictitious (which means that the injected fault
somehow affected the error-latch values of another
LRU across fault-containment regions). Corruption
of the error-latch values could occur by several paths:
(1) by erasing the latches, (2) by corrupting system
memory, or (3) by noise coupling in the system bus.
During fault injections in the FTMP, the corruption
events must occur twice per LRU disabled (to trig-
ger the lying-fault syndrome software). Therefore,
any corruption of the error-latch values must hap-
pen periodically. Consequently, if error-latch data
corruption is the cause of the behavior, an example
should be readily caught.
The data acquisition system monitored the buses
when the fault-management software read the error
latches. The data thus obtained come directly out of
the latches and into the system bus. Subsequently,
the locations where these values are stored in sys-
tem memory are monitored. The data from both
sources can be compared to determine if any value
has changed. In order to reduce the data that will
be stored and analyzed, the acquisition system was
programmed to sift through the acquired error-latch
data and to store only those sets where bus errors are
reported. These data clearly indicated that the error
was authentic; i.e., latches were not being cleared by
faulty software and values were not corrupted during
storage in system memory or during transit in the
system bus. The reported errors occurred on P and
T buses; in particular, the P and T buses where the
faulted LRU was enabled. (The system configuration
data that include what LRU is enabled on each bus
were also acquired in real time.) Lacking any confir-
mation on repetitive error-latch data corruption by
faulty software or noise, the next step is to assume
that the reported error is real.
The question of a fault in one LRU affecting
another, although highly unlikely in this architecture,
was investigated. The only way for signals from one
LRU to reach another LRU is through the system
6
bus. No other signal path exists except for the
28-V power supply that was checked for noise and/or
glitches during nonfaulted and faulted operation, and
nothing was found. Therefore, for a fault in one
LRU to affect another, the fault-generated errors
must propagate out from the faulted LRU through
the system bus and into another LRU, defeating
the voters that are always enabled. To defeat the
voters the original error must affect at least one other
bus by cross talk so that at least two of the three
copies presented to the voters are in error. If the
buses were that sensitive to cross talk, the effects
would show during normal or nonfaulted operation
and would have been easily detected. The fact that
no such event was observed during the course of the
experiments implies another cause for the observed
behavior.
Causes of Anomalous Behavior
During the analysis of the acquired error-latch
data, a pattern emerged that suggested an explana-
tion of the anomalous behavior. The first hint was
observed in the data sets containing the data present
on the bus as the fault-management software read the
error latches and stored the values in system mem-
ory. The observed interaction between the system ar-
chitecture, the fault-management software, and the
system bus is the clue to the problem. The fault-
management software component that reads the er-
ror latches is structured as shown in figure 3.
A routine written in the Algol Extended for De-
sign (AED) high-level language calls an assembler
subroutine 12 times, once per LRU, to read the er-
ror latches of each LRU one by one into local cache.
Then, the four error-latch values per LRU are sent
to an array in system memory.
Reading the error latches seems like a simple task,
except when the multiprocessing nature of the archi-
tecture and the operation of the system bus controller
are taken into account. First, the system bus con-
troller (SBC) is a direct memory access (DMA) de-
vice. To operate, the SBC must be supplied with
the following: a destination address, a source ad-
dress, a word count (number of words to transfer),
and the type of transaction (normal read or write,
or a simplex read). A DMA operation is very ineffi-
cient at transferring single words or multiple words at
random noncontiguous locations. For a single-word
transfer, 10 instructions must be executed to set the
SBC to read the word to local cache and then to
transmit the word to system memory. For multiple
noncontiguous words, each word must be treated as
a single-word transfer; i.e., the SBC must be set for
every word transferred.
READERROR-LATCHPROCEDUREa
FOR I = 0TO 12
CALL READEL(LRU.ID = I);
EXIT;
READEL PROCEDURE b
SET SBC c TO READ P.EL OF LRU I INTO CACHE
READ P.EL
SET SBC TO READ R.EL OF LRU I INTO CACHE
READ R.EL
SET SBC TO READ T.EL OF LRU 1 INTO CACHE
READ T.EL
SET SBC TO READ C.EL OF LRU I INTO CACHE
READ C.EL
WRITE P.EL, R.EL, T.EL, AND C.EL FROM CACHE
INTO ARRAY IN SYSTEM MEMORY
EXIT
aThe routine is originally coded in the AED language.
bThe routine is originally coded in CAPS-6 assembler.
cSystem bus controller.
Figure 3. Structure of error-latch procedure.
The SBC can be set to release or maintain con-
trol of the system bus between noncontiguous word
transactions. If the bus is released after every word
transfer, the SBC must execute the polling sequence
to acquire control of the bus before transferring the
next word. If the bus is not released, the poll se-
quence can be avoided and word transfers speeded
up. However, there is a trade-off that must be in-
vestigated: if the number of noncontiguous words to
be transferred is larger than a certain threshold, it is
better to release the bus after each transaction. The
limit in the length of a transaction while holding the
bus is apparent when the multiprocessing character
of the FTMP is taken into account; i.e., there are
two more triads trying to execute other programs
and system bus transactions. If one triad holds the
bus too long, the other triads will have to wait to
finish their tasks, thus increasing the probability of
missing a deadline. Holding the bus could result in an
increased probability of missing interrupts, decreased
system throughput (two triads would not be able to
communicate for that period), undersampling of sen-
sors, and falling under the minimum permissible rate
for updating actuators with dire consequences for the
stability of the controlled system. Measurements ob-
tained with the data acquisition system indicate that
to read 48 error latches without releasing the bus
would take approximately 3.2 msec.
Experimental Data
Releasing the bus after every transaction avoids
the aforementioned problems but it increases the
time required to read the error latches. The time
increase has two components (ref. 4): (1) a poll must
be executed for every error latch read (i.e., 48 latches
×15 #sec/poll _ 0.7 msec), and (2) the triad must
wait when the bus is busy (48 latches × 10 #sec/wait
0.5 msec). The expected time to read all the error
latches becomes, then, 3.2 + 0.7 + 0.5 -- 4.4 msec.
Measurements indicate that depending on the I/O
traffic on the bus, the time to read the error latches
(releasing the bus every time) varies from 4 to 6 msec.
In figure 4, two examples of the bus activity during
error-latch read transactions are shown.
Read error-latch transactions
LRU 0 LRU 1
P R T C P R T
I I I I
LRU 2...
C P R...
I I I
Other triad transactions
(a) Example 1.
Read error-latch transactions
LRU 0 LRU i LRU 2...
P R T C P R T C P R...
I I I I 1 t I I I I
Other triad transactions
(b) Example 2.
Figure 4. Two typical examples of system bus activity during error-latch read transactions.
Thetimebetweenreadingthe C bus error latch
of LRU n and reading the P bus error latch of LRU
n + 1 (this interval is denoted by t.r) varies depending
on the number of transactions by other triads (5 and
7 in the examples in fig. 4) from 277 to over 500 #see.
The time between reading error latches of the same
LRU (this interval is denoted by tp), e.g., reading P
and R bus error latches of LRU 1, varies with the
I/O traffic from 67 to 250 #sec.
Approximately 45 percent of the triad trans-
actions captured with the data acquisition system
during the error-latch read transaction Contained er-
rors. The external manifestation of the fault was
an inability of LRU 3 to transmit data on the P
and T buses. The average error rate in both P and
T buses as a consequence of the fault injected in
LRU 3 was approximately one error every 210 psec
(4762 errors/see). The error rate was approximately
the same on both buses because every word transac-
tion (about 99 percent of all transactions) involves a
poll sequence. Therefore, when the triad with LRU 3
as a member starts a transaction, a bus error will be
generated on both the P and T buses.
Subsystem Interaction Leading to
Anomalous Behavior
The interaction of the error-latch read transac-
tions, the time that the errors occur in the P and
T buses, and the particular details of the system bus
provide the framework on which the solution to the
subsystem anomalous behavior can be found. In fig-
ure 5(a) the entire operation of reading the 12 LRU
error latches is portrayed in a time-line diagram, and
an expanded diagram of the events for LRU's 8, 9,
and A is shown (fig. 5(b)). This figure will be used
to explain the anomalous behavior of LRU A during
fault injections.
LRU 0 1 2 3 4 5 6 7 8 9 A B
I I I I I I I 1 1 I i I
(a) P, R, T, and C error-latch read transactions. (LRU B is absent.)
LRU 8 LRU 9 LRU A...
P R T C P R T C P R...
I I I I I I I
I I_ P2 "1
I, Pt "1
(b) Expanded diagrams. (Pn denotes interval wh.ere a transaction generates the first P bus error.)
Figure 5. Time diagram of error-latch read transactions of LRU A.
During experiments, faults are injected at random
times during the software processing cycle. There-
fore, the first system bus error generated by the
faulty unit can occur anywhere during the cycle. Tri-
ads do not operate in synchronism with other triads;
therefore, to one triad the I/O transactions of an-
other triad will appear to occur at random times.
Thus, from the point of view of other triads, sub-
sequent errors generated by the faulty unit will ap-
pear to occur at random. The length of the slowest
cycle in FTMP, in which the error latches are read
and processed, is 320 msec. Therefore, for different
fault-injection events, the first and subsequent errors
generated by a fault occur at random times during
this cycle.
In figure 5(b) two transaction time windows, P1
and P2, are shown. Both P1 and/°2 represent a time
interval where a transaction that generates the first
P bus error in the cycle could occur. The P1 window
occurs when LRU 9 is not in the configuration. Under
these conditions, a P bus error generated anywhere
after the P bus error latch of LRU 8 is read, but
before the P bus error latch of LRU A is read, would
be signaled only by LRU A. Although the error is
detected by all the LRU's, the software read the
error-latch values of LRU's 0 through 8 before the
error occurred. So when the P bus error latch of
LRU A is read, it is the only one that appears to the
software as having detected an error. The P2 window
occurs when LRU 9 is in the configuration.
8
Usingthe averagevaluesfor t 7 (304 #sec) and
tp (71 #sec), it is possible to estimate the probabili-
ties that the first bus error in the cycle occurs in the
intervals P1 and P2. The total time interval where
a transaction can occur is 320 msec (1 cycle). As-
suming that an error can occur at a random time
within the cycle, then for P1 the event probability
Pe is the length of /91 (i.e., 3tp + t 7 + 3tp + t_
1.034 msec) divided by the total time in a cycle
(320 msec). Therefore, the estimated probability is
Pe _ 1/309. For/:)2 the same analysis gives a proba-
bility Pe = (3tp+t_)/320 msec _. 1/618. The/91 win-
dow is double the size of/92; therefore when LRU 9
is absent, LRU A has double the chance of lying.
Using the estimated probabilities it is possible to
estimate the average number of fault injections before
an LRU is disabled by the lying-fault syndrome. The
probability of an LRU lying per 320-msec cycle (T)
is Pe. Two lying events are required for an LRU
to be disabled. Therefore, the average number of
cycles required before two events occur is given by
2/Pe. The average fault is maintained for a period
(Ft) of approximately 372.4 msec. Thus, the number
of faults before an LRU is disabled is given by how
many Ft's can fit into (2/Pe)T (the average time to
wait for two events). Thus, the average number of
faults is given by
2T
NF- PeFt (1)
Without LRU 9, Pe = 1/309 for LRU A; therefore
the expected number of faults injected before the
LRU is disabled is NF _ 534. With LRU 9, Pe =
1/618 for LRU A; therefore NF _ 1067.
After LRU A is disabled, the same analysis applies
for the active LRU whose error latches are read last
by the fault-management software; e.g., if LRU A is
the first LRU to be disabled and LRU 9 is not in
the system, then this same analysis applies to LRU 8
(see fig. 5(a)) which results in a probability Pe of
1/618 (NF "_ 1067) of LRU 8 being the only LRU
signaling a P bus error. A similar analysis applies to
LRU 5. Figure 6 shows the situation for this LRU.
(Remember that T bus errors can be logged only by
the members of the processor triad that initiated the
bus transaction.)
LRU 0 1 2 3 4 5 6 7 8 9 A B
I I I I I I I I I I I I
F.-- t_ --._ F.--T1-._
(a) P, R, T, and C error-latch read transactions. (LRU B is absent.)
LRU 4 LRU 5 LRU 6...
P R T C P R T C P R...
I I I I I I I
]. T1 q
(b) Expanded diagram. (T1 denotes interval where a transaction generates the first T bus error.)
Figure 6. Time diagram of error-latch read transactions of LRU 5.
During these experiments the processor triads
were initially organized as follows:
(1) Triad 1 = LRU's 0, 1, and 2
(2) Triad 2 = LRU's 3, 4, and 5
(3) Triad 3 = LRU's 6, 7, and 8
The situation for LRU 5 is similar to that for
LRU A. It is the last LRU in the error-latch read
sequence that is capable of detecting a T bus error
generated by its triad partner LRU 3. Therefore, if
the first error-generating event for a cycle (T1) occurs
after the T bus latch of LRU 4 is read but before the
T bus latch of LRU 5 is read, LRU 5 will appear
to the software as the only LRU signaling a T bus
error. The probability of this event can be estimated
as before (i.e., Pe = (3tp + t-r)/320 msec _ 1/618),
and results seem to be the same as the /92 event
probability for LRU A. Here the similarities between
LRU's 5 and A stop. When the triad containing the
faulty unit reads the error latches (one-third of the
time), the probability of a lying event is given by the
probability of the first error occurring in the interval
T1. However, the only transaction this triad does on
that interval is read the error latches from LRU 5.
Unfortunately, when triads read simplex data they
9
disableerrordetectionandvoting;thereforeduring
thiscyclenolyingeventwill begenerated.If anyof
theothertwotriadsisreadingtheerrorlatches(two-
thirdsof thetime), theprobabilityof a lyingevent
is givenby the probabilitythat the triad with the
faultedunit makesa transactionin T1. Therefore,
the total probability of a lying event for LRU 5 is
1/3(0) + 2/3(1/618) .._ 1/927, which implies that
N F = 1600.
To account for the variability of the real pro-
cess, the sample standard deviations of Ft (iF ----
120.3 msec), t_ (a_ = 53.5 #sec), and tp (co =
39.4 #sec) are used to estimate a first-order approx-
imation to the variabilities of Pe and NF. Equa-
tion (3) defines the standard deviation of Pe (eq. (2))
assuming that the random processes with standard
deviations ap and a_ are independent. It should be
noted that if
Pe = atp + bt 7
then
a2pe = a2a2p + b2a 2
where a and b are arbitrary constants. The resulting
equations are
Pe -- L(3tp + tT)
T (2)
O'Pe = T (3)
2T
N F = (4)
FtPe + 2mapeFt + 2naFPe + 4rnnagape
where
1
L= 2
2/3
(for LRU A with LRU 9 present)
(for LRU A with LRU 9 absent)
(for LRU 5)
and both m and n are +1. Equation (4) is obtained
by expanding the expression
2T
NF (5)
(Ft :l= 2aF)(Pe ± 2ap_)
Substituting the parameter values in the equations
above makes it possible to estimate the 2a bounds of
N F (table 2).
From table 2, the lower and upper 2a bounds of
NF are
400 _ NF <_ 6000 (with LRU 9)
200 E NF_ 3000 (without LRU 9)
600 < NF <_ 9000 (for LRU 5)
The upper bounds obtained indicate a wide distri-
bution and the possibility that the real distribution
of these values might not be the normal distribution.
Because of the relatively small data sample taken, the
real distribution cannot be determined. Therefore, in
order to correct the data for a skewed distribution,
the upper bounds are taken as 2000 with LRU 9, 1000
without LRU 9, and 3000 for LRU 5. Assuming that
the signs of the standard deviations are random and
independent from the other parameters, these upper
bounds will not be exceeded 75 percent of the time.
Table 2. Variability of N F
Signs
+, +
2a bounds of N F for--
Pe = 1/309
3017
649
1000
215
Pe = 1/618
6034
1297
2001
430
Pe = 1/927
9051
1946
3002
645
Model Predictions
Using this model makes it possible to predict the
expected sequence of LRU's that is disabled during
a typical fault-injection experiment. In figure 7 the
sequence of LRU's disabled is derived based only
on two conditions: the absence of LRU 9 and the
expected number of faults N F. Using the average
value of N F to derive the sequence should lead to
the most probable sequence, To derive the sequence
it must be remembered that the LRU number is
not important. The important information is the
position that the LRU adopts in the sequence of
error-latch read operations, i.e., the last active LRU
in the sequence or the last LRU in the sequence
that is a member of the triad containing the faulted
processor.
In figure 7, LRU A (with an average NF of 534)
is disabled first, followed by LRU 5 (NF = 1600).
When LRU 5 fails, its triad partner, the faulted
LRU 3, becomes a spare. The system software then
assigns LRU 3 to a new triad (the triad composed of
LRU's 0, 1, and 2) and swaps one of the processors of
the triad (LRU 0) with LRU 3 so that fault injections
can continue. On this triad (LRU's 1, 2, and 3),
the last LRU capable of detecting T bus errors from
LRU 3 is LRU 3. Therefore, in this case no lying can
occur.
After LRU A is disabled, LRU 8 becomes the
last active LRU in the error-latch read sequence (as
10
did LRU A). Therefore,whenLRU A is disabled,
LRU8startslying;andaftertheexpected1067faults
from this moment(1067+ 534= 1601 faults from
the beginning of the experiment), it is disabled.
The expected number of faults for LRU's 5 and 8
is similar; therefore, it could be expected that the
second and third LRU's to be disabled would flip
between LRU's 5 and 8. The next LRU to be disabled
(after 1067 + 1601 = 2668 faults) would be LRU 7
(because it is the last active LRU in the latch read
sequence). After LRU 7 is disabled, LRU 6 should
follow with 3735 faults injected, but this does not
occur. The triad composed of LRU's 1, 2, and 3 is
the last active triad in the system. (The rest of the
nondisabled LRU's are spares.) This last triad is left
with the faulted processor (LRU 3) and it is supposed
to run all the system work load. Unfortunately,
there are two problems that make this situation
unworkable: first, the work load is too much for one
triad to run on the allotted time, and second, the
fault-management software is effectively disabled by
this particular configuration.
LRU A
t
I I I I I I [
0 200 400 600 800 1000 1200
Number of faults injected
(a) Sequence of LRU's disabled between 0 and 1200.
LRU's 5 & 8
I l I I I
LRU 7
2400 2
System crash
1400 1600 1800 2000 2200
Number of faults injected
(b) Sequence of LRU's disabled between 1400 and 2600.
Figure 7. Theoretical sequence of LRU's disabled for a configuration without LRU 9.
The first problem (not enough time to run the
work load) causes the system to slip frames so that
all the high-priority tasks are executed first. Un-
fortunately, the fault-management software and the
console-interface task are not among these. The low-
priority tasks are run when there is time available
(which does not occur very often). Consequently,
the schedules for these tasks fall apart, taking 20 to
30 sec to execute each task. The second problem in-
volves a check, performed in software, that prevents
any triad from reconfignring itself. Therefore, be-
ing the only active triad in the system, the triad of
LRU's 1, 2, and 3 will not reconfigure under any cir-
cumstances, even when LRU 3 is faulted. Ultimately,
the system does not have communication with the
outside world (low priority), and the fault-injection
software running on the minicomputer host indicates
time-outs for fault detection, isolation, and reconfig-
uration. These symptoms are similar to a system
crash and are the end result of almost all the exper-
iments where 4000 or more faults are injected.
On very rare occasions, the system can really
crash. These crashes, when they occur, always occur
when there are only two triads left and a lying LRU
is detected. Data obtained during these situations
indicate that the system software cannot handle the
situation where the faulted unit is isolated and recon-
figured out in the same cycle where a lying unit has
been detected. The system software, in its confusion,
assigns the same R buses to two different processors
on the same triad and tries to pull apart the two
remaining triads (and it succeeds). The complexity
and rarity of the events have prevented a complete
solution on the case of the "confused software."
The model sequence of disabled LRU's (A ---* 5
(or 8) _ 8 (or 5) _ 7 --* system failure) is the most
commonly observed (approximately 99 percent of the
experiments). Other less probable sequences have
been observed, and all involve permutations of the
A ---* 5 ---* 8 --* 7 sequence. If the variability of NF
is included, these less probable sequences result from
the overlapping of the NF values for the different
LRU's.
All these numbers are highly uncertain because
of the probabilistic nature of these events, which is
reflected in the variability of the values for the pa-
rameters t_,tp, a_, and ap and the limited size of
the data sample. In table 3 comparisons between
11
experimentaldataandthe modelresultsareshown.
Generalagreementexistsbetweenthemodelandthe
experimentsequenceofLRU'sdisabledandthenum-
berof faultsinjectedbeforeeachLRUwasdisabled.
In particular,whenthe variabilityof N F was in-
cluded (third column in table 3), all the LRU's were
disabled in their predicted 2a intervals 86 percent of
the time.
Table 3. Comparison of Fault-Injection and Model Experiments
Experiment
Fault injection Model
(a) (b)
170 <444> a 1090 200 <534> 1000 b
790 <1280> 2290 600 <1600> 3000
760 <1370> 1990 600 <1601> 3000
2500 <3670> 5100 I000 <2668> 5000
LRU
A
5
8
7
aLower 2a limit <average> upper 2a limit.
bNumbers on right (1000, 3000, 3000, 5000) denote upper
bounds computed using m = 1 and n = -1 in equation (4).
When LRU 9 is present the initial conditions
change. Now, LRU's 5 and A have the same probabil-
ity of being disabled; therefore there are two equally
probable events competing for the first position in
the sequence. The second LRU will depend on the
first; i.e., if LRU A is first, then LRU 5 should be
next and vice versa. The third LRU is LRU 9, inde-
pendent of the first or the second LRU disabled. The
reason for such behavior is apparent if the position
of LRU 9 in the configuration is taken into consid-
eration; i.e., LRU 9 is a spare before the first LRU
is disabled. If the first LRU is 5, then LRU 9 takes
5's position in the triad (with LRU's 3 and 4). This
makes LRU 9 sensitive to a lying event because it is
the last LRU in the error-latch read sequence that
can detect a T bus error from LRU 3. If the first
LRU to be disabled is A (active on a memory triad),
then LRU 9 is the last LRU in the error-latch read
sequence, thus becoming as sensitive to a lying event
(in the P bus) as LRU A was. After both LRU's 5
and A are disabled, LRU 9 occupies both sensitive
positions for detecting P and T bus errors, thus dou-
bling its probability of lying. After LRU 9 is disabled,
the sequence of LRU's disabled is similar to the case
when LRU 9 is absent, with LRU's 8 and 7 failing
before the system fails. The result is not a sequence
but an upside-down tree. (See fig. 8.)
Supporting evidence for this model of the ob-
served behavior includes the fact that physically
interchanging the LRU boxes did not affect the ob-
served patterns as the model predicts. The LRU
identification number refers to the slot, not the box;
therefore by interchanging the identical LRU boxes,
the possibility that a latent fault in one of the boxes
is affecting the behavior of the system during the
fault-injection experiment is eliminated.
5 A
I r
A 5
[
9
8
7
I
System crash
Figure 8. Sequence of LRU's disabled (with LRU 9 in the
configuration).
Confirmation evidence for this model required the
capture of one event where the error occurred as
described previously. (See figs. 5 and 6 and text.)
Unfortunately, the data acquisition system did not
have enough memory to hold, or enough disk space
to store, an entire fault-injection experiment. (The
acquisition/storage rate is approximately 8MB/sec;
one experiment of 2000 fault injections lasts about
4 hours.) Therefore, the data acquisition system was
used to acquire a data sample and transfer it to the
host minicomputer so that a program could examine
it. If the sample contained an event of interest, the
program would store the data on disk; otherwise it
discarded the data and started the cycle again. After
several months of searching, an event was found. (See
fig. 9 and table 4.) A T bus error occurred after the
LRU 4 and before the LRU 5 error latches were read.
The content of the error latches indicated that only
LRU 5 saw the event, as expected.
Solution
As mentioned previously, it seems possible to pre-
scribe a quick solution to the problem. Keeping
control of the bus while the triad reads the error-
latch information should prevent other triads from
performing transactions during this time, thus pre-
venting an error from occurring between error-latch
readings. Unfortunately, holding the bus does not
solve the problem entirely; i.e., Pe does not become
12
zero. Any error manifestation of a naturally occur-
ring intermittent or transient fault in an element of
the triad reading the error-latch values could prop-
agate to the system bus at the appropriate time to
cause the lying fault behavior. Such an event will
depend on the error production behavior of the fault
and is therefore difficult to reproduce through fault-
injection experiments. A similar situation exists in
fault-injection experiments: the faults are injected
at random times that produce a nonzero probability
that the fault will be injected at the appropriate mo-
ment when the faulted unit is a member of the triad
reading the latches, thus causing the lying behavior.
The faulted unit is reenabled after each fault, thus
providing many trials on which a lying event can oc-
cur. Experiments were performed in which the sys-
tem was modified to hold the bus during error-latch
transactions and then was subjected to fault injec-
tions. These experiments confirmed that the proba-
bility Pe is not reduced to zero by holding the bus.
Table 4. Confirmation Event
Raw error-latch data Interpreted data
Bu_s where error
LRU P.EL a R.EL b T.EL c C.EL d was det_ted
0 FFE1 e FFE0 FFE0 FFE0 P1
1 FFE1 FFE0 FFE0 FFE0 P1
2 FFE1 FFE0 FFE0 FFE0 P1
3 FFE1 FFE0 FFE0 FFE0 P1
4 FFE1 FFE0 FFE0 FFE0 P1
5 FFE1 FFE0 FFE1 FFE0 P1, T1
6 FFE1 FFE0 FFE0 FFE0 P1
7 FFE1 FFE0 FFE0 FFE0 P1
8 FFE1 FFE0 FFE0 FFE0 P1
f9
A FFE1 FFE0 FFE0 FFE0 P1
fB
ap bus error latch,
bR bus error latch.
CT bus error latch.
dc bus error latch.
eAll error-latch values in hex,
fNot in present configuration.
Elimination of the behavior seems possible by
the following procedure. After reading and analyzing
the error latches, and after determining that there is
a lying event where LRU x is accusing LRU y of
producing an error, the error latches of all the LRU's
read before reading the LRU x latches should be read
again. With this procedure the situation explained
in this section is avoided; i.e., if an error occurs just
before the LRU x latches are read, then reading the
latches of all the LRU's read previous to the error
event would yield the missing data and eliminate
the appearance of a lying event. Unfortunately, this
procedure extends the error-latch processing time by
a factor of 2, which causes frame slippage and other
scheduling problems on the FTMP.
The most promising solution would seem to be
redundant latches on the LRU's. Redundant latches
would make all the present latch processing unnec-
essary and, coupled with the above procedure for
dealing with a single LRU reporting an error, would
eliminate the processing overhead and the lying event
problems.
Implications for Fault-Tolerant
Multiprocessor
This section considers the effects of real faults on
FTMP behavior with the lying-fault process taken
into account. The emphasis here is to develop a
model that combines the error-production behavior
of the faults and the system architecture into equa-
tions that can be used to predict the probability of
observing this behavior for real faults. These equa-
tions can then be generalized and applied to more
general architectures.
At the system bus level, a hard fault on the
FTMP behaves as an intermittent fault; i.e., the
fault-generated errors occur only when the faulty unit
is on the bus. A hard fault is defined here as a fault
that causes the affected LRU to generate an error
every time it transmits on the bus. An intermittent
or transient fault, in contrast, is defined here as a
fault that causes the affected LRU to generate an
error in a fraction of its transmissions on the bus.
Therefore, in the FTMP the fault behaviors for hard,
transient, and intermittent faults differ by the error
production rate; i.e., hard faults produce errors at
the same rate that the faulty unit performs I/O
operations, whereas intermittent and transient faults
produce errors at a fraction of the I/O rate.
Hard Faults
For a hard fault to cause a lying event, the as-
sumption of errors occurring at random times during
a cycle, and in particular during the error-latch read
transactions, must apply. Clearly, this assumption is
valid only in: (1) the first cycle when the fault oc-
curs, and (2) subsequent cycles if the faulty unit is
not an element of the triad reading the error latches.
The reasoning behind the previous statement lies in
the characteristics of a hard fault; i.e., a hard fault
affects every operation of the faulty unit. If a fault
occurs at any time during a cycle, it does not matter
13
on which LRU it resides because the first error will
occur at a random time in this cycle. For subsequent
cycles, if the faulty unit is an element of the triad
reading the error latches, its erroneous output will
be detected from the first transaction (reading the
P latch of LRU 0) by both its partners in the triad,
thus preventing a lying-fault event. However, if the
faulty unit is in another triad (not reading the error
latches), the errors produced will appear random to
the triad that is reading the latches.
LRU 4 LRU 5 LRU 6 . . .
P R T C P R T C P R...
I I I I I I I I I
t
Tr
Figure 9. Confirmation event. (Tr denotes transaction that generates the first T bus error.)
For hard faults there is only one chance for lying where
events to occur: the events must occur before the
R
faulty unit is isolated and reconfigured out. Hard
faults could be isolated in 1 or more cycles. If it takes
only 1 cycle to isolate the fault, the fault has only one Pe
chance of causing a lying event. If the faulty LRU
is disabled during the first cycle, the error-producing A
hard fault is taken out of commission before it gener-
ates another lying event. If the system takes longer Pn
than 1 cycle, a chance exists that two lying events
are generated in consecutive cycles. The parameters
Pn (the probability that n cycles are required to re-
cover from a fault) and N (the maximum number of
cycles required to recover from a fault) can be es-
timated from fault-injection data for both hard and
intermittent faults. The results are shown in table 5.
Table 5. Values for ion and N Obtained From
Fault-Injection Experiments
Parameter Pn for--
Number of
cycles(n) Hard faults lntermittent faults a
1 0.3525 0
2 .6187 .1277
3 .0288 .8298
4 0 .0425
N .... 3 4
aFor stuck-at-one faults injected at random times and lasting
for 1 msec.
The rate at which a hard fault generates two lying
events is given by
R = 3AP 2 [2p2 + p3(1 - p2Pe)] (6)
14
rate at which a lying LRU is disabled
per hour
estimated probability of a lying LRU
per cycle
hard failure rate for an LRU
probability that n cycles are required
to isolate the fault (n = 2 or 3 for
hard faults)
To derive equation (6), the fact that the fault-
management software moves from one triad to an-
other on every cycle must be taken into account. (See
fig. 10.)
In figure 10 the fault-management execution cycle
is depicted. The states represent the location of
the fault-management software at any given time.
Through time the states form an infinite sequence.
For analysis, a starting point can be selected as the
current sequence when a fault first occurred; e.g., a
sequence order could be $2 _ $3 _ $1 _ $2 _ ...
In the first cycle where the fault produces errors,
the faulty LRU can be on any triad and the system
can be in any state. Therefore, the probability of
a lying event in this cycle is 9APe. The probability
that a second cycle is needed to recover from the
fault is P2, and the faulty LRU has a probability of
two-thirds of not being in the triad reading the error
data. Therefore, the probability of a lying event in
the second cycle is (2 3)Pep2. As shown in table 5 for
hard faults, a maximum of 3 cycles could be needed
to recover from the fault (with a probability of P3)-
After 3 cycles the fanlt-management software has
moved through the three triads. As a result, if the
faulty unit and the fault-management task fall within
the same triad during the second or third cycles, the
hard fault would be detected before causing a lying
fault (asdiscussedpreviously). Therefore,of the
threestatesequencesthat endin a3-cyclerecovery,
two havea nonzeroprobabilityof causinga lying
event.Theseresultsarederivedbelow.
Fault-management-softwareexecutionstates
Triad1 Triad2 I [ Triad 3
(s l) (S2) (s 3)
State Triad 1
FMS
TII
T/I
Triad 2
T/I
FMS
T/I
Triad 3
T//
Tfl
FMS
Figure 10. The three possible states of fault-management-
software execution on the triads. (Sx denotes state x;
FMS denotes fault-management software; T/I denotes
task/idle.)
In this discussion assume that the fault occurred
in a component of triad 2. In one sequence, $3
S1 --* $2, the fault occurs in a cycle ($3) where the
faulty unit is not reading the error latches. In the
second cycle (S1), the faulty unit and the software
again do not "meet" on the same triad. Therefore,
during the second cycle, the probability of a lying
event is (1/3)Pep2. The 1/3 factor is required to
account for the fact that this is one of the three
possible sequences of 3 cycles. For the sequence $1
$2 _ $3, the probability of a lying event in the second
cycle is zero because during this cycle the FMS and
the faulty unit reside on the same triad. The second
sequence of 3 cycles with a nonzero probability of a
lying event is $2 _ $3 _ $1, where the fault and the
software meet on the same triad in the first cycle.
The fault arrival is random, and therefore there is a
probability Pe that it will arrive at the proper time to
cause a lying event. For this sequence the probability
of two lying events by the second or third cycle is
(1/3)[P2Pe + (1 - p2Pe)P3Pe]. The 1/3 accounts for
the probability that the fault arrives at $2. The total
rate then is
1 p 1 p, +_(l_p2Pe)p3pe 1R = 9APe [_P2 e + gP2 e
Grouping common terms gives
R = 9APe { 1p e[2p2 +p3(1 - P2Pe)]}
and simplifying further gives
R = 3APe212p2 + p3(1 - P2Pe)]
which is equation (6). Although this equation is
derived for a faulty member of triad 2, it is still a
general result as can be demonstrated by deriving
the equation for the other two triads.
After the lying LRU is disabled, the fauity LRU
is most likely disabled; therefore R also gives the rate
at which two LRU's are disabled, the lying and the
faulty LRU's.
Computing R (using the values for a hard fault in
table 5) and assuming a failure rate of A = 10 -4 per
hour (see ref. 8) and Pe = 1/618 gives R _-10 -9 per
hour; i.e., the rate at which two LRU's are disabled
(the first one due to a lying event and the second to
the real hard fault) is approximately 10 -9 per hour.
Intermittent Faults
The error-production behavior of intermittent
faults is assumed to be random in time (although de-
pending on where the fault is located and the utiliza-
tion rate and input signals of the faulty hardware);
therefore no consideration is necessary for the case
where the faulty unit is part of the triad reading the
error latches. The rate for two lying events is then
the combinatorial probability of two events occur-
ring in 4 cycles (N -- 4 in table 5). The probability
of a lying event in 1 cycle is different for intermittent
faults (Pi) and hard faults (Pe). The difference lies
in the fault behavior. If an intermittent fault gener-
ates faults at random times in a cycle, then Pi equals
the probability of errors occurring at the appropriate
intervals. The observed rate at which a good LRU
would be disabled by intermittent faults is given by
o,]R = 9A E pap2(1 - Pi)n-22!(n - 2)! (7)
n=2
For intermittent faults it usually takes more than
2 cycles to isolate the faulty unit, as shown in table 5.
Computing R (using the value for Pi as an order
of magnitude less than Pe for hard faults), using
A = 10 -3, and using the values given in table 5 for
Pn for intermittent faults gives R _ 6.7 × 10 -l° per
hour. For arate of R < 1 × 10 -10 per hour, the
intermittent failure rate must be A < 1.5 × 10 -4 per
hour.
For environment-related intermittent and tran-
sient errors, A depends on the external environment
where the system is located. Therefore, a value for
the failure rate cannot be readily estimated from lab-
oratory tests or general theory. Also, the Pn param-
eter depends on the displayed error behavior of the
15
underlyingerror-producingphenomena,e.g.,electro-
magneticinterference.Environment-causederrorsdo
notstopaftersomeunitsaredisabled;thereforethey
cancausesimpleor multiplesimultaneouserrorsand
corruptionof thesimplexerror-latchdatauntil the
systemcrashes.
Effect of Holding the Bus on the FTMP
Themultiprocessingcharacterof theFTMP pro-
videsenoughredundancyfor the FTMP to survive
theseeventsby degradingto two triads. Keeping
controlof thebusduringtheerror-latchreadingwill
not helpavoidlying faults. By holdingthe system
busit ispossibleto keeptriadsfromexecutingtrans-
actionswhenthe error latchesare read,but if an
elementof the triad performing the latch reading is
affected by an intermittent fault, there is a chance
for the fault to generate an error at the appropriate
times to generate a lying event. Under the same as-
sumptions as before, the rate of a lying event caused
by an intermittent fault is given by
R -- 3Ap4Pe 2 (8)
Equation (8) is derived as follows: there are three
active triads, and the rate that a fault occurs in any
triad is 9X. The probability that this triad is reading
the error latches is one-third, at which time a lying
event could be generated with probability Pe. After
4 cycles the faulty unit and the FMS come full-circle,
at which time a second lying event could be generated
with probability Pe. The probability that the fault
lasts 4 cycles before the faulty unit is disabled is
P4. Therefore, the rate at which an intermittent
fault generates two lying events is given by R =
(9X)(1/3) (Pe)(P4)(Pe), which becomes equation (8).
The value of Pe when the system holds the bus
differs from the previous value used. Assuming that
a random intermittent fault will produce errors at
any random time within a cycle of length T, it is
possible to approximate Pe as before; i.e., Pe = 5t/T
where 5t denotes the time window in which the
system is susceptible to a lying event. While holding
the bus there is no need to perform a bus polling
sequence; therefore tp _ (71 - 15) = 56 ttsec and
t 7 _ (304- 15) = 289 #see. These values for
t7 and tz account for the time savings achieved by
not performing a poll sequence (._ 15 ttsec). The
expression for Pe is
Pe -- 3tp + t 7T (9)
where T -- 320 msec. Evaluating equation (9) gives
Pe = 1/700. Evaluating equation (8) with this value
16
of Pe, with ), --- 10 -4 and with the appropriate values
ofpn, results in R -- 2.6 x 10 -11 per hour as the rate
of losing an LRU to a lying event.
Lying Events on a Triple Modular
Redundancy System
For a fault-tolerant system based on adaptive-
voting triple modular redundancy (TMR) principles
(ref. 7), the probability of failure is greater or equal to
the probability of losing a good LRU while there is a
faulty LRU active. For this situation (after the good
LRU is disabled, leaving the faulty one to produce
more errors), the system has enough redundancy to
detect errors, but not enough to mask them. Thus,
it is assumed that the next error causes a system
failure. The rate for a lying event in a TMR system
is given by
R = 3,X E nP/(1 -- pi)n-22!(_ = 2)!
n=2
(10)
Equation (10) is obtained from equation (7) by
considering that in a TMR system there is only
one triad that reads the error data. The system
failure rate (R) versus the probability of a lying
event (Pi) for a TMR system is shown in table 6
for different ranges of Pi and assuming that A =
1.0 × 10 -3. It is assumed here that the values for
the probability of requiring n cycles to isolate and
reconfigure from a fault (Pn), obtained from fault-
injection experiments on the FTMP, apply to the
hypothetical TMR system.
Table 6. Rate of Failure of TMR System (Lying Event)
n (from _q. (10))
P, (a)
1.0 x 10 -5
5.0 x 10 -5
1.0 x 10 -4
5.0 x 10 -4
1.0 x 10 -3
5.0 x 10 -3
1.0 x 10 -2
5.0 x 10 -2
8.6 x 10 -I3
2.2 x 10 TM
8,6 x 10 -11
2.2 x 10 -9
8.6 x 10 -9
2.1 x 10 -7
8.5 x 10 -7
2.0 x 10 -5
aR and X have the same units in equation (10).
Implications of New Technology
Although the large-scale integration of circuitry
in a single chip makes pin-level hardware fault in-
jection less effective, new technologies have evolved
in this field. Utilizinglasersto inducea chargein a
circuitnodeof achiphasbeenreportedby manyre-
searchersat differentlaboratories.Thismethoduses
a laserof the appropriatewavelengthto excitethe
valencelectronsof thebulksemiconductorandelec-
tron donorimpurities.Thevalencelectronsabsorb
the laserenergyandjump to the conductionband.
The addedelectronsin the conductionbandcreate
a voltagein thenode.This voltageperturbationis
thedesiredeffectin allhardwarefault-injectiontech-
niques.Also,a newfault-injectiontechniquesolely
basedin softwarehasbeendevelopedby Carnegie-
MellonUniversity(ref.9). All thesetechniqueshave
limitations.It isupto the investigatorsto determine
themostsuitabletechniqueto useforthesystemun-
dertest.
Concluding Remarks
A newfault-injectionmethodologywasusedsuc-
cessfullyto investigateabnormalfault-recoverybe-
havioroftheFault-TolerantMultiprocessor(FTMP).
With newinstrumentationand fault-locationselec-
tion criteriait waspossibleto observeand study
the abnormalfault-recoverybehaviorof theFTMP.
This behaviorinvolvedthe reconfigurationof work-
ingunitswhilethefaultedunitwaskeptin theactive
configuration.Thecausesof theabnormalbehavior
werefoundandexplained.Usingasimplemodelof
thebehaviorofthefaultysystemanddatafromfault-
injectionexperimentsto estimatethemodelparame-
ters,theFTMPprobabilityoflosingtwolinereplace-
ableunits (LRU's)by a singlefault wasestimated.
Themodelresultsindicatethat theprobabilityrate
oflosingagoodunit to a lyingfault isapproximately
10-9 perhour.
Applyingthemodelto agenerictriplexsystem,in
whichalyingeventrepresentsasystemfailure,gives
thesystemfailurerateasapproximately10-9+6 per
hour, where 6 represents the contribution of any
other failure mechanism to the failure rate. There-
fore, the failure rate for the example triplex system
is an order of magnitude greater than the desired
rate for ultrareliable systems (10 -10 per hour). This
implies that a fault-tolerant system must either in-
corporate features to correctly handle this behavior
or incorporate massive redundancy so that it can tol-
erate the loss of working units at the calculated rate.
It is evident from the discussion in this paper that
the simplex sourcing of error-latch data is a weak
link for systems depending on these data for fault re-
covery in a noisy environment or with design flaws.
The unpredicted interaction between hardware, soft-
ware, and faults can bring forth new, and most likely
anomalous, behavior from the systems.
It has been shown that fault injection can help
detect and analyze the behavior of a system in the
ultrareliable regime. Although fault-injection test-
ing cannot be exhaustive, it has been demonstrated
here that it provides a unique capability to unmask
problems in the system fault-management software
and the software interaction with the system hard-
ware. In view of the results obtained for the FTMP,
it can be concluded that fault-injection experiments
can provide very useful characterization data on the
behavior of fault-tolerant systems.
NASA Langley Research Center
Hampton, VA 23665-5225
January 25, 1991
References
1. Smith, T. Basil, III; and Lala, Jaynarayan H.: Develop-
ment and Evaluation of a Fault-Tolerant Multiprocessor
(FTMP) Computer. Volume I--FTMP Principles of Op-
eration. NASA CR-166071, 1983.
2. Lala, Jaynarayan H.; and Smith, T. Basil, III: Develop-
ment and Evaluation of a Fault-Tolerant Multiprocessor
(FTMP) Computer. Volume lII FTMP Test and Eval-
uation. NASA CR-166073, 1983.
3. Finelli, George B.: Characterization of Fault Recovery
Through Fault Injection on FTMP. IEEE Trans. Reliab.,
vol. R-36, no. 2, June 1987, pp. 164 170.
4. Padilla, Peter A.: FTMP Data Acquisition Environment.
NASA TM-100636, 1988.
5. Padilla, Peter A.: In-Circuit Fault Injector User's Guide.
NASA TM-100478, 1987.
6. Lala, Jaynarayan H.; and Smith, T. Basil, III: Develop-
ment and Evaluation of a Fault-Tolerant Multiprocessor
(FTMP) Computer. Volume II FTMP Software. NASA
CR-166072, 1983.
7. Siewiorek, Daniel P.; and Swarz, Robert S.: The Theory
and Practice of Reliable System Design. Digital Press,
c.1982.
8. Hopkins, Albert L., Jr.; Smith, T. Basil, III; and Lala,
Jaynarayan H.: FTMP A Highly Reliable Fault-Tolerant
Multiprocessor for Aircraft. Proc. IEEE, vol. 66, no. 10,
Oct. 1978, pp. 1221-1239.
9. Czeck, Edward W.; Siewiorek, Daniel P.; and Segall,
Zary Z.: Predeployment Validation of Fault-Tolerant
Systems Through Software-Implemented Fault Insertion.
NASA CR-4244, 1989.
17

Report DocumentationPage
National Aeronautics and
Space Administ raison
1. ReportNAsANO.wg_4218 ] 2. Government Accession No. 3. Recipient's Cata]og No.
4. Title and Subtitle
Abnormal Fault-Recovery Characteristics of the Fault-Tolerant
Multiprocessor Uncovered Using a New Fault-Injection Methodolog3
7. Author(s)
Peter A. Padilla
9. Performing Organization Name and Address
NASA Langley Research Center
Hampton, VA 23665-5225
12. Sponsoring Agency Name and Address
National Aeronautics and Space Administration
Washington, DC 20546-0001
5. Report Date
March 1991
6. Performing Organization Code
8. Performing Organization Report No.
L-16630
10. Work Unit No.
506-46-21-05
11. Contract or Grant No.
13. Type of Report and Period Covered
Technical Memorandum
14. Sponsoring Agency Code
15. Supplementary Notes
16. Abstract
An investigation was made in the Langley Avionics Integration Research Laboratory (AIRLAB)
of the fault-handling performance of the Fault-Tolerant Multiprocessor (FTMP). Fault-handling
errors detected during fault-injection experiments were characterized. In these fault-injection
experiments, the FTMP disabled a working unit instead of the faulted unit once every 500 faults,
on the average. System design weaknesses allow active faults to exercise a part of the fault-
management software that handles Byzantine or "lying" faults. Byzantine faults behave in
such a way that the faulted unit points to a working unit as the source of errors. The design
problems involve: (1) the design and interface between the simplex error-detection hardware and
the error-processing software, (2) the functional capabilities of the FTMP system bus, and (3) the
communication requirements of a multiprocessor architecture. These weak areas in the FTMP
design increase the probability that for any hardware fault, a good line replaceable unit (LRU) is
mistakenly disabled by the fault-management software.
17. Key Words (Suggested by Authors(s))
Fault-tolerant multiprocessor
Fault management
Reconfiguration
18. Distribution Statement
Unclassified--Unlimited
Subject Category 33
19. UnclassifiedSecurityClassif. (of this report) 20. UnclassifiedSecurityClassif. (of this page) 21. 2NO'l of Pages 22. A03Price
NASA FORM 1626 OCT 86 NASA-Langley, 1991
For sale by the National Technical Information Service, Springfield,Virginia 22161-2171

