Development and evaluation of a Fault-Tolerant Multiprocessor (FTMP) computer.  Volume 3:  FTMP test and evaluation by Lala, J. H. & Smith, T. B., III
NASA Contractor Report 166073 
NASA-CR-166073 
19850022395 
DEVELOPMENT AND EVALUATION 
OF A FAULT-TOLERANT 
MULTIPROCESSOR (FTMP) COMPUTER 
Volume III 
FTMP Test and Evaluation 
Jaynarayan H. Lala and T. Basil Smith, '" 
THE CHARLES STARK DRAPER LABORATORY, INC. 
555 Technology Square 
Cambridge, Massachusetts 02139 
CONTRACT NAS1-15336 
MAY 1983 
RPR-EARLY·DOMESTIC_DISSEMINATION 
-- -- - -- '::4 
Because of Its significant early commercial potential, this Information, 
tWhlch...ba~~ been developed under a U.S. Gover!lr1}ent"P1O'§tam, Is 
being disseminated within the United States In··aavance of general 
publication. This InfOrmation-l may be~du·pllcated and used by the 
recipient with the express IImltatfon that)t not be published. Release 
of this Information to.9lher domestic parties by the reclp.lent shall be 
made subject to theSe limitations. Foreign release may be made only 
with prior JjASA approval and appropriate export licenses. This 
legend shBli be marked on any reproduction of this Information In 
whole or in part. 
Review for general release May, 1985 
NI\SI\ 
National Aeronautics and 
Space Administration 
Langley Research Center 
Hampton, Virginia 23665 
111111111111111111111111111111111111111111111 
NF02248 
ll~!lfiR~ cnpv 
, __ c p ~ '! 1983 
LANGLEY RESEARCH CENTER 
LIBRARY, NASA HA~~?'Tot~. YIRGItl'A 
https://ntrs.nasa.gov/search.jsp?R=19850022395 2020-03-20T17:47:34+00:00Z
FOREWORD 
This report was authored by Dr. Jaynarayan H. Lala. Dr. T. Basil 
Smith was the project engineer. Mr. Charles Meisner was the NASA 
technical monitor for the period January-December 1982, and Mr. Nicholas 
Murray was the technical monitor from August 1978 to December 1981. 
Following are some of the people who contributed to the success of this 
project. 
Draper Laboratory 
Dr. Albert Hopkins 
Mr. Jack McKenna 
Ms. Linda Alger 
Mr. Kevin Koch 
Mr. Alan Wimmergren 
Mr. Robert Scott 
Mr. Joseph Marino 
Mr. David Hauger 
Mr. Mario Santarell1 
Collins Avionics 
Mr. Ron Coffin 
Mr. Charles Schulz 
i 
N <J5- 3D707#-
~\~~ 
This Page Intentionally Left Blank 
Chapter 
2 
3 
TABLE OF CONTENTS 
INTRODUCTION .............................................. 
EXPERIMENTAL TECHNIQUES ••••••••••••••••••••••••••••••••••• 
2.1 
2.2 
2.3 
2.4 
Overall Experimental Set-Up .......................... 
Fault InJector Hardware 
Fault Injection Software 
.............................. 
............................. 
FTMP System Configuration Controller ••••••••••••••••• 
RESULTS ................................................... 
General Observations ................................. 
Average and Maximum Times ............................ 
5 
5 
9 
23 
26 
31 
31 
34 
3.1 
3.2 
3.3 
3.4 
Frequency Distributions •••••••••••••••••••••••••••••• 45 
Actual Fa~lures ••••••••••••••••••••••••••••••••••••••• 105 
4 SUMMARY AND CONCLUSIONS ••••••••••••••••••••••••••••••••••• 107 
REFERElICES •••••••••••••••••••••••••••••••••••••••••••••••••••••••••• , 09 
iii 
Figure 
1 
2 
3 
4 
5 
6 
7 
8-13 
14-18 
19-25 
26-31 
32-36 
37-40 
41-46 
47-52 
53-58 
59-62 
LIST OF FIGURES 
Fault Injection Experimental Set-up 
Fault Injector Logical Organization 
...................... 
...................... 
Insertion of FETs between Socket and Device .............. 
Fault Injector Hardware •••••••••••••••••••••••••••••••••• 
Fault Description Word •••••••••••••••••••••••••••••••••••• 
Mux A, B, C Selection Word •••••••••••••••••••••••••••••••• 
Boolean Function Generator Data Word ••••••••••••••••••••• 
6 
10 
11 
14 
17 
19 
20 
CPUD Frequency Distributions ••••••••••••••••••••••••••••• 46 
CPUC Frequency Distributions ••••••••••••••••••••••••••••• 52 
PROM Frequency Distributions ••••••••••••••••••••••••••••• 57 
Cache Controller Frequency Distributions ••••••••••••••••• 64 
BGUA Frequency Distributions ••••••••••••••••••••••••••••• 70 
BIT Frequency Distributions •••••••••••••••••••••••••••••• 75 
BIPC Frequency Distributions ••••••••••••••••••••••••••••• 79 
SBC Frequency Distributions •••••••••••••••••••••••••••••• 85 
All Faults Frequency Distributions ••••••••••••••••••••••• 91 
All (Except BGU) Faults Frequency Distributions •••••••••• 97 
iv 
Table 
2 
3 
4 
5 
6 
7 
8 
LIST OF TABLES 
Fault Injector Address Space · .............................. . 
Fault Type Selection •••••••••••••••••••••••••••••••••••••••• 
Mux A, B, C Source Selection · .............................. . 
Boolean Functions of Two Words .............................. 
Fault Direction Control ••••••••••••••••••••••••••••••••••••• • 
FIS-FSCC Data Exchange Block · .............................. . 
Average FDIR Times 
Maximum FDIR Times 
.......................................... 
.......................................... 
v 
Page 
16 
18 
19 
21 
22 
30 
36 
37 
CHAPTER 1 
INTRODUCTION 
This report ~s Volume III of a mult~-volume report on the Fault-
Tolerant Mult~processor (FTMP) proJect sponsored by the Langley Research 
Center of the National Aeronaut~cs and Space Adm~nistrat~on under 
Contract NAS1-15336. The maJor top~c covered by th~s volume ~s the test 
and evaluation of the FTMP. A prerequ~site for understand~ng this report 
is some knowledge of the FTMP arch~tecture and ~ts pr~nc~ples of opera-
tion descr~bed in Volume I and the FTMP Execut~ve software descr~bed ~n 
Volume II. 
The reliabil~ty, performance, and ava~lab~l~ty of the FTMP have 
been modeled extensively (1,2). A number of assumptions were made about 
various FTMP characteristics to arr~ve at these models. Some of these 
assumpt~ons, such as mean t~me between fa~lures of a ll.ne replaceable 
un~t (LRU) can only be ver~f~ed by f~elding the equ~pment and observing 
~ts fa~lure rate ~n ~ts real operating env~ronment. Other assumpt~ons, 
though, are much more eas~ly ver~fied in a laboratory environment. 
Examples of these are the mean and d~str~but~on of the t~me to recover 
from faults and the res~l1ency to s1ngle p01nt fa1lures. FTMP response 
to faults can be observed and measured much more accurately under 
controlled laboratory cond1t10ns rather than 1n the f1eld. Th1S 1S one 
of the motivating factors that led to a ser1es of exper1ments 1n wh1ch 
the FTMP was subjected to numerous artific1ally created faults while 
operating rout1nely 1n a s1mulated aircraft environment and 1tS response 
in each case was observed and recorded. 
Apart from verifying modeling assumptions, there are a number of 
other important reasons for experimental test and evaluat1on. These 1n-
clude expanding val1dat1on envelope, building confidence in the system, 
revealing any weaknesses in the architectural concepts and/or their exe-
cution in hardware and software. Other benefits of the test and evalua-
tion exerC1se include a general stressing and shake-out of the hardware 
as well as software, in particular, the fault detection hardware and the 
fault identification and system configurat10n control software. The 
results of these experiments, therefore, not only include hard data such 
as fault detect10n and identificat10n times but a number of intang1bles 
as well. These are discussed in Chapter 4. 
The goal of the fault-injection experiments was to inJect at least 
stuck-at-O and stuck-at-1 class of faults on every C1rcu1t p1n of one LRU 
and measure the FTMP response. It will be recalled here that all ten 
LRUs in the FTMP are 1dentical to each other in hardware. In addition, 
due to the symmetr1c architecture of the mult1processor and the Execu-
tive, the functions performed by one LRU over a per10d of t1me are no 
different from those performed by any other LRU that 1S operational and 
active. Therefore, one can be fairly conf1dent 1n assuming that the 
results obtained by subjecting only one out of ten LRUs 1n the system to 
2 
faults are representatl.ve of what would be observed l.f the faults were 
dl.stributed amongst all ten hardware units. The choice of stuck-at class 
of faults was necessl.tated by ll.ml.ted tl.me and resources rather than any 
deficiency l.n the experimental set-up. The fault l.n]ector, to be de-
scribed in the next chapter, is in fact fully capable of generatl.ng and 
injecting a wide variety of faults including externally supplied sig-
nals. The fault l.n]ector can sl.multaneously produce 48 fault signals, 
each of which could concel. vably be applied to a different circuit pin 
sl.multaneously. Once again, due to the previously mentioned ll.mitations 
and the astronomically hl.gh combinations of multl.ple faults (even just 
double faults) as well as the extremely low probability of such events 
ever happening in real life, it was decided to limit the experiments to a 
sl.ngle fault at a tl.me. 
The FTMP response was measured in terms of fault detection, isola-
tion, and recovery times. Identity of the faulty unl.t, as determl.ned by 
the multl.processor, was also recorded. 
It was determl.ned early in the experl.ments that once the fault had 
been detected l.t took a determl.nl.stl.c amount of time to identl.fy the 
faulty unit and to reconfl.gure the system such that the faulty unl.t was 
no longer actl.ve. The fault detectl.on time, on the other hand, was found 
to be qUl.te varl.able for a gl.ven fault on a given pl.n. This variation 
was dependent on when the fault was injected with respect to the internal 
FTMP frames. To reflect this variation l.n the experimental data as well 
as to gain a measure of repeatabl.lity of FTMP performance, each fault on 
each pl.n was repeated fl.ve times. The moment at whl.ch each fault was in-
serted was randomized with respect to the basic FTMP software cycle. 
3 
The next chapter descr1bes the customized fault-1njection hardware 
and software and the exper1mental set-up. Results of the exper1ments are 
discussed in Chapter 3 and the last chapter summarizes the conclus1ons. 
4 
CHAPTER 2 
EXPERIMENTAL TECHNIQUES 
2.1 Overall Experimental set-up 
To inject faults 1nto the FTMP, a device called the fault inJector 
(FI) was designed and built at CSDL. The fault inJector interfaces with 
the FTMP on one end and with the PDP-11/60 Un1bus on the other end. The 
number and type of faults and the time of their insertion are controlled 
by Fault-InJection Software (FIS) that is res1dent 1n the PDP computer. 
The PDP-11 and the FTMP are linked together by a MIL-STD 1553 bus. This 
data bus 1S used by the Fault-InJect10n Software to commun1cate w1th the 
System Conf1guration Control (SCC) task in the FTMP. This then makes the 
exper1mental set-up a closed loop system in which the executor, that is, 
the FIS, and the v1ct1m, that is, the FTMP, are 1n constant touch W1th 
each other. Th1s, as shall be seen later, makes it poss1ble to automate 
the fault-injection process and to collect data that otherwise would not 
be poss1ble to acquire. 
F1gure 1 shows a block diagram of the experimental set-up. The 
v1ctim LRU 1S 1n the upper r1ght hand corner of the FTMP cab1net. This 
1S LRU 3. Access to the electronic components 1n th1S unit 1S prov1ded 
by opening the swing down door on the FTMP cabinet. Th1s exposes the LRU 
5 
DMA 1553 1553 
--- RT -yyTTTnTT"n" 0 1 2 Bus 
PDP-ll/60 4 5 6 
8 A 
(j\ 
U U U 
'----- ~--~-- -- -- --- ---- -- -- -- --- -_._-
Fl.gure 1. Fault In]ectl.on Experimental Set-Up 
't 
3 
7 
u 
I\\. 
Vl.ctl.m 
e 
~~ 
Fault 
InJector 
-
circuit boards wh1ch may then be extended for fault 1nsert10n. Faults 
are normally 1n]ected on one p1n at a t1me. To 1nsert faults, control-
lable DIP extenders or 1mplants (part of the fault 1n]ector) are plugged 
into the DIP socket. Each 1mplant accepts the DIP pins 1t replaced and 
contains circu1try wh1ch can 1nterrupt and/or reconnect each DIP p1n and 
each incident signal 11ne from the socket. S1X implants, each of which 
handles 8 DIP pins, are provided. Thus up to 48 p1ns on one DIP or on a 
combination of DIPs may be set up for fault inJection at a given t1me. 
The standard C1rcu1t boards 1n the FTMP are mult1-layer pr1nted C1rcu1t 
boards on which the DIPs have been soldered. However, to facilitate 
removal of DIPs for fault 1n]ection, one complete set of circuit boards 
for one LRU has been furnished with DIP sockets. 
The 48 1mplant p1ns of the fault 1n]ector are 1nd1vidually addres-
sable by the PDP-11. Each p1n appears as a Unibus address to the fault-
1n]ection software. The type of fault to be produced at any pin is 
controlled by wr1ting appropriate data to the Unibus address correspond-
1ng to th1s pin. Once a fault or faults have been def1ned, they can be 
'enabled,' that 1S, inserted into the vict1m by wr1t1ng to another Unibus 
address. The fault 1njector hardware 11stens to th1s address space, 
decodes the data, and produces the fault that 15 called for. It also 
enables or clears the fault when appropr1ate data is wr1tten to the 
enable/clear address. 
It is possible to produce signals other than S1mply stuck-at class 
of faults. Faults that are boolean functions of s1gnals on other pins 
can be generated. Th1S can be used to s1mulate faults wh1ch are rather 
7 
unlikely but which have been known to have happened. For example, it 1S 
possible to turn a NAND gate 1nto a NOR gate. But the ma1n utility of 
this capabil1ty lies in being able to 1n]ect faults 1nto tr1state sig-
nals. For example, the data pins of a random access memory have s1gnals 
that are either inputs to or outputs from the memory depend1ng on whether 
memory is being written to or read from. To inJect a fault into such a 
dev1ce pin, the direction of the fault signal should be correct 1n order 
to avoid any possible damage to the device. Such a signal can be pro-
duced by generating the fault as a funct10n of other s1gnals on the 
device that determine the direction of the data such as read/write and 
ch1p enable s1gnals on the RAM DIP. 
The fault injection software has been written to facil1tate auto-
matic fault 1n]ection by providing commands that are used to define the 
victim device, map its pins into implant pins, def1ne type of fault for 
each pin, and enable and clear faults. The FIS can execute a string of 
such commands, mak1ng 1t possible to go through a number of faults auto-
matically once the victim device has been moved to the implants physic-
ally. A second condition necessary for automatic fault injection is some 
form of communicat10n between FIS and the FTMP to indicate whether the 
FTMP is ready to accept a new fault. Messages between FIS and the 
multiprocessor are exchanged over a 1553 data bus. A modified verS10n of 
the system conf1gurat10n control program, called the FSCC, 1S respons1ble 
on the FTMP side for th1s protocol. Messages from FSCC are sent through 
I/O port 0 or 1 over the 1553 bus to a- 1553 remote term1nal s1mula tor. 
The RT DMAs the messages into the PDP-11 memory, Wh1Ch then can be 
8 
accessed by the FIS program. The same data path is used ~n reverse to 
send messages from FIS to FSCC. value of the FTMP real-t~me clock at the 
time of fault detection, ident~f~cation, and recovery is recorded ~n the 
FTMP and sent to FIS. Since the time of fault ~njection is known to FIS, 
the d~fference between the times of fault detect~on and inJect~on 
constitutes the time taken to detect the fault. This along with the 
identification and reconfiguration times and their sum are recorded in 
the PDP-11 for later analysis. 
The fault inJector hardware, fault ~nJector software, and FSCC are 
described in the next three subsections. 
2.2 Fault Injector Hardware 
A funct~onal block diagram of the fault ~nJector ~s shown in F~g-
ure 2. The heart of the fault injector is a pair of FETs that is ~nter-
posed between the device pin and the socket pin. By turn~ng the FETs on, 
a direct connection is established between the device and the socket. 
This is the normal situation when no fault ~s be~ng inJected on the p~n. 
The device-socket connection can be severed by turning the FETs off. Now 
any desired signal may be appl~ed to the dev~ce or the socket pin, wh~ch-
ever pin has the input s~gnal (see Figure 3). A cho~ce of eight s~gnals 
is provided, one of which may be selected by multl.plexer M1 for the 
dev~ce pin and by multiplexer M2 for the socket pin, as shown in Figure 
, 
2. A set of 48 FET and mux pa~rs are prov~ded, one pa~r for each victim 
pin. This allows one to extend up to 48 pl.ns on one DIP or a combinatl.on 
of DIPs. The choice of faults for each Cl.rCUl.t pin is as follows. 
9 
X 48 (8 ON EACH IMPLANT) 
,---------1 
TO DEVICE I 
PIN 
L 
I 
T T 
I 
I 
- - - -1 
TO SOCKET 
PIN 
1--+----8~1~--P<- -~ --c------
II 
X 48 I Ml A A 8: 1 
(8 ON EACH I ~ ~ M2 
CARD) f~AB) f(AB
b 
r---4~~ 
I FNABC) F(ABC 
I ..--- ~X¥ ~~T .:.-' - ... I 
I 
, 
DATAl ~ 
D(O:3,15) D(4:7,15) 'I 
1 FAULT TYPE AND DIRECTION CONTROL I 
S(ABC) I -""'--____________ ---' I 
1- _______ - _________________ I 
t--_1. ..... ~8:1 X 1 (ONE SIXTH ON EACH CARD) 
I I A 
MAI--r--.- r _X_ Li2tiE ON E~CH C~R~ __ , IVI~1 I I J I B F 1------4'-+- flAB) t---...-t~8 : 1 
I B I 
I l\1B I 
I ',/ S(ABC) ~ 
'---...-t~8 : l.---+M; B F r 
I It--~I~-~---*~~-~~=_--~ ~--~-~-~7"1 UD(4:7) ~D(8:11) I l,1C I v 1- ____________ J 
nD(O: 3) 
BF 
F1gure 2. Fault InJector Log1cal Organ1zat1on 
10 
FET 
IMPLANT 
DEVICE 
SOCKET 
SIGNAL TO/FROM PACKAGE 
DIRECT 
CONNECTION 
CONTROL 
SIGNAL TO/FROM SOCKET 
Figure 3. Insertion of FETs between Socket and Dev~ce 
11 
1. Socket/Dev1ce Signal: Th1S provides the or1g1nal s1gnal to 
the victim pin. That 1S, no fault is 1n]ected. 
2. Mux A: This signal is the output of the multiplexer A as 
shown in Figure 2. The 1nputs to the multiplexer are the 48 s1gnals from 
the 48 pins that can be extended w1th the FETs. That is, a s1gnal from 
any circuit pin or gate may be used as the fault or input signal for the 
victim pin. 
3. Mux B: This mult1plexer has the same function as Mux A. 
4. Mux C: This mult1plexer also has the same function as Mux A. 
5. f(AB): This signal is a boolean function of two signals, the 
outputs of multiplexers A and B. 
-" 
Anyone of sixteen possible boolean 
functions may be specified. 
6. F(A,B,C): This is a boolean function of f(A,B) and the output 
of Mux C. Anyone of sixteen possible boolean functions may be speci-
fied. 
7. Ground: Th1S provides the stuck-at-O fault. 
8. EXT: In addition to these seven choices, an externally 
generated signal may be used as a fault. 
Each of the above eight s1gnals may also be 1nverted before be1ng 
applied to the victim pin, thus prov1ding a choice of sixteen faults. 
The choice of faults thus includes stuck-at-1 and 'complemented s1gnal' 
type of faults. 
Multiplexers A, B, and C and the boolean funct10n generators 
provide an extremely powerful capability to generate any type of fault. 
For example, certain faults in 1ntegrated circu1ts can change a NAND gate 
12 
~nto a NOR gate. It is possible Wl.th this fault ~njector to s~mulate 
such a fault by extending all the input and output pins of the target 
gate with the FETs, generating the required boolean function using ~nputs 
from the gate inputs and replacing the gate output w~th th~s s~gnal. The 
ma~n utility of this powerful capability, however, l~es in the ab~l~ty to 
inject faults on tristate s~gnal l~nes. The direction of the fault can 
be made a funct~on of other sl.gnals on the device, signals that determine 
the state of the tristate pin. It is thus possible to inject faults l.nto 
the data pl.ns of memory chips and other tr~state devices. 
The fault injector hardware is physically packaged as shown l.n 
Figure 4. The FET pa~rs are mounted on an implant segment. TWO s~zes of 
implants are provided: 4 pl.n extenders and 8 p~n extenders. An 8 pin 
implant has 16 FETs mounted on it and can extend one side of a 16 pin 
DIP. Dummy extenders that simply connect socket and dev~ce p~ns without 
go~ng through a FET are also provided. These are used to extend those 
device pins that are too sensitive to sustain the capacitance and/or time 
delay of an interven~ng FET. 
In any event, the FET implants are connected to multiplexer boards 
through a flat ribbon cable. As mentioned earl~er, the fault injector 
has the capabil~ty of extending 48 device pins. Signal on each of these 
pins is controlled by a dedicated pair of multiplexers M1 and M2 (see 
Figure 2). Thus there are a total of 48 pairs of muxes. 
packaged on six mUltiplexer boards as shown in Figure 4. 
These are 
Each board 
controls 8 pins. One 8 pin implant or two 4 pin implants may be con-
nected to each board. The six boards, labeled A, B, C, 0, E, and F, are 
13 
IMPLANT 
SEGMENTS 
IMPLANT 
SEGMENTS 
(EACH HANDLES 8 PINS) 
~ TO MULTIPLEXER BOARD 
CONTROL & UNIBUS 
INTERFACE TO VAX 11/750 
F1gure 4. Fault InJector Hardware 
14 
~ TO MULTIPLEXER 
Y BOARD 
VICTIM 
DEVICE 
identical multiwire boards. Each of them also contains one sixth of the 
multiplexers MA, ME, and MC. That ~s, each of the three 48:1 muxes (A, 
B, and C) is logically partitioned ~nto six 8:1 muxes. Since a board 
handles 8 pins, a signal from these eight pins can be selected through 
the 8:1 mux (A, B, or C) on that board. The outputs of six logical parts 
of each 48:1 mux are OR'ed and distributed to all six circuit cards via 
the backplane. All 48 s~gnals are then made ava~lable to each board. 
Each board also has its own copy of the three boolean function generators 
shown in Figure 2. Functions f(A,B), F(A,B,C), and S (A,B,C) can be 
produced on any board. These signals, along with outputs of muxes MA, 
MB, and MC form inputs to the muxes M1 and M2. 
Last, but not least, is the selection and control of FETs, 
multiplexers and boolean function generators, and enabling and clearing 
of faults. The fault ~njector has been designed such that it can be 
addressed as a unibus dev~ce by a PDP-11 or VAX-11 computer. The data 
written to the unibus address space of the fault inJector is used to 
perform the selection and control functions. As shown in F~gure 4, the 
backplane of the multiplexer boards is connected by four flat-ribbon 
cables to a control and unibus interface card. This is a double-he~ght 
wire-wrap board that can be plugged into the PDP-11 unibus. It has the 
standard unibus protocol and address decoding circuitry. The fault 
injector occupies the address space 764600-764777 (octal). This address 
space 1S mapped as shown in Table 1. 
Circuitry controlling signals on each of the 48 pins (muxes M1, 
M2, and FETs) is addressed indiv~dually (addresses 764600 to 764736). 
15 
Table 1. Fault Injector Address Space 
Address Mux 
764xxx Board Pin 
600-616 A 1 to 8 
620-636 B 1 to 8 
640-656 C 1 to 8 
660-676 D 1 to 8 
700-716 E 1 to 8 
720-736 F 1 to 8 
740 MUX A 
742 MUX B 
744 MUX C 
746 Boolean Functl.on Select 
752 Execute/Clear Fault 
750 and UNUSED 
754-777 
Data written to these addresses selects one of el.ght inputs to mux 1 or 2 
and controls the pOl.nt of fault l.nsertl.on (devl.ce or socket) by choosing 
mux M1 or M2. This l.S a statl.C operatl.on. That is, the data wrl.tten to 
these addresses is latched l.n the fault l.n]ector. The type and dl.rection 
of fault signal is thus determined, but the signal is not applied to the 
victim yet. To actually break the device-socket connection and inJect 
the fault signal, one must write to Execute/Clear address 764752. 
16 
writing a "0001" to the address enables the chosen multiplexer Ml or M2 
on the chosen pin as well as turns the pair of FETs on that pin off. 
Faults on all the previously "enabled" pins are asserted simultaneously. 
The most significant bit of the fault selection data word determines if a 
pin is enabled. Wri ting a "0002" to the Execute/Clear address disables 
the muxes and turns the FETs on, thus clearing the fault condition. 
Bits 0-3 of the data word select the type of fault going to the 
device pin, bits 4-7 select the fault going to the socket pin, and bits 
8-11 determine the direction of the fault (to device or to socket) as 
shown below in Figure 5. 
15 14 13 12 11 10 9 8 765 4 3 2 1 0 
o o 1 x y z 
Figure 5. Fault Description Word 
Bit 15 enables/disables the pin. A pin must be enabled before a fault 
defined on it can be asserted. Bit 15 must be 1 for the pin to be en-
abled. Bits 12, 13, and 14 should always be as shown in Figure 5. 
If fault direction field is 0, the fault as determined by data 
bits Y is sent to socket pin and data bits Z are ignored. If X is 8, the 
fault as determined by Z is sent to device pin and Y is ignored. In ad-
dition to 0 and 8, there are fourteen other values that can be assigned 
to X. Fault direction selected for these values of X is explained later 
in this section. 
17 
Table 2. Fault Type Select10n 
Y/Z Fault Signal 
0 Inverted Signal 
1 F(A,B,C) 
2 A 
3 B 
4 C 
5 f(A,B) 
6 1 
7 EXT 
8 Or1ginal Signal 
9 F(A,B,C) 
-A A 
-B B 
-C C 
0 f(A,B) 
E 0 
--F EXT 
, 
Y and Z select the fault signal as shown in Table 2. If Y/Z is 8, 
the orig1nal signal 1S passed through the mult1plexer unchanged. Stuck-
at-1 and 0 faults can be generated by a value of 6 and "E," respective-
ly. The s1gnal can be 1nverted if Y/Z is o. Other more complex faults 
can be chosen as outputs of mult1plexers A, B, C or a boolean function of 
their outputs (Y/Z .. 1 to 5, 9 to "0"). 
If a multiplexer A, B, or C output is either used directly as a 
fault or as input to a boolean function generator, it is necessary to 
18 
select the multiplexer source. This is done by writing to the unibus 
address of the multiplexer. Data written to multiplexer address 1.S 
interpreted as shown 1.n Figure 6. 
15 6 5 4 3 2 o 
[ NOT USED PIN 
Figure 6. Mux A, B, C Select1.on Word 
Bits 3, 4, 5 select one of six boards A to F. Bits 0, 1, 2 select one of 
eight p1.ns on that board as the mux output. These are shown in Table 3. 
Table 3. Mux A, B, C Source Selection 
Data Bits Board Bits Pin 
345 Selected o 1 2 Selected 
1 A 0 1 
2 B 1 2 
3 C 2 3 
4 D 3 4 
5 E 4 5 
6 F 5 6 
6 7 
7 8 
If a boolean function such as f(A,B) or F(A,B,C) is chosen as the 
,---
desired fault, then one must also def1.ne the boolean function by writing 
19 
to functl.on select address 764746 (octal). The data wrl. tten to thl.s 
address l.S l.nterpreted as shown l.n Fl.gure 7. 
15 12 11 8 7 4 3 o 
[ Unused F(f,c) S(f,c) f(A,B) 
Fl.gure 7. Boolean Function Generator Data Word 
Bl.ts 0-3 are used to select one of Sl.xteen boolean functions of signals A 
and B. Bits 4-7 are used to select F(A,B,C), whl.ch l.S one of sixteen 
boolean functl.ons of f(A,B) and C. Bl.ts 8-11 are used to select 
S(A,B,C), whl.ch l.8 also one of Sl.xteen boolean functl.ons of f(A,B) and 
C. The sl.xteen possl.ble boolean functl.ons of two varl.ables are shown l.n 
Table 4. 
As noted earll.er, the fault dl.rectl.on (to devl.ce or to socket) l.S 
controlled by a 4-bl.t fl.eld X as shown in Fl.gure 5. X can assume one of 
16 posSl.ble values. These are l.nterpreted as shown l.n Table 5. 
If the fault dl.rectl.on signal chosen by X is hl.gh the fault l.S as-
serted on a socket pl.n. If l.t l.S low, the fault l.S asserted on a devl.ce 
pl.n. 
For X equal to 0 the fault dl.rectl.on sl.gnal l.S hl.gh and the fault 
is sent to socket. For X equal to 8 the fault l.S appll.ed to devl.ce. The 
fault dl.rectl.on l.n these two cases l.S statl.c. For other values of X the 
fault would be dynaml.cally appll.ed to the socket or devl.ce pl.n dependl.ng 
upon whether the chosen signal l.S hl.gh or low, respectl.vely. The sl.gnals 
20 
• 
Table 4. Boolean Functions of Two Variables 
Data Boolean Function of A, B 
0 ~ 
- -1 A • B 
-2 AB 
-3 B 
-4 AB 
-5 A 
6 A + B 
- -7 A+ B 
8 AB 
9 A+ B 
10 A 
-11 A + B 
12 B 
-13 A + B 
14 A + B 
15 1 
that can be used for direction control are the outputs of mult1plexers A, 
B, C, or their boolean functions f(A,B), S(A,B,C) and their complements. 
This allows one to dynamically control fault direction on tristate pins. 
As explained earlier, fields Y and Z in the fault description word 
determine the type of fault to be applied to socket and device pins, 
respectively (see Figure 5). When X is equal to ~ or 8 only one of these 
two fields (y when X is 0 and Z when X is 8) need be defined. However, 
21 
Table 5. Fault Direction control 
X Fault Direction 
0 TO SOCKET 
1 S(A,B,C) 
-2 A 
-3 B 
-4 C 
5 f(A,B) 
6 NOT USED 
7 NOT USED 
8 TO DEVICE 
9 S(A,B,C) 
A A 
B B 
C C 
D f(A,B) 
E NOT USED 
F NOT USED 
both Y and Z must be defined when x is not 0 or 8. But Y and Z need not 
be the same. 'I'hat is, different faults can be applied to socket and 
device pins. In fact, by an appropriate choice of Y and Z a fault can be 
appl1ed 1n one direction while the original s1gnal is passed through un-
changed in the other direction. One may, for example, wish to insert a 
fault in a data pin of a memory chip only when data is be1ng read out but 
not when data is be1ng written into the memory. This can be done by 
selecting a fault direction signal that is high dur1ng the memory read 
22 
cycle. The fault selected by Y would be applied to the socket pin during 
the read cycle. By choosing Z to be 8, correct data would be written to 
the memory dur1ng memory write cycle since z = 8 passes the Qriginal s1g-
nal to the device pin. 
Multiplexer output selection and boolean function def1n1t10n are 
static functions. 
fault injector. 
Data written to these addresses is latched 1n the 
It should be mentioned here that the fault injector is a 'write-
only' device. state of the fault injector cannot be determined by read-
ing its address space. 
It is not necessary to remember various addresses of the fault 
injector since the fault injector software mainta1ns these tables as a 
data base. FIS prov1des appropriate commands to define fault types and 
select mux outputs. 
software. 
The next sUbsection describes the fault injection 
2.3 Fault Injection Software 
The fault injection software (FIS) package resident on the PDP-ll 
prov1des commands at a PDP-ll terminal to perform all the funct10ns 
necessary to inject faults into LRU 3 of the FTMP and observe and record 
the results. The FIS program is invoked by the command FIS. Valid FIS 
commands and the1r functions are as follows: 
DEFINE Unn M: This command defines an M pin Ie package whose 
location on the FTMP circuit board is Unn. Last package so defined 
becomes the 'active' package. 
23 
HAP n AM t: This maps pin n of the act1ve package into pin m of 
the multiplexer board A of the fault inJector. R.-1 subsequent device 
pins are mapped to R.-1 subsequent board A p1ns. Dev1ce pins may be 
similarly mapped into pins of boards B, C, D, E, and F by subst1tuting 
the appropriate letter in place of A in th1s command. This mapping 
allows one to reference device pins directly 1n subsequent commands. 
This mapping is stored as one of the FIS data bases. 
DESCRIBE n abcd: This command defines the fault (abcd) to be 
injected into pin n of the active package. abcd is a 16-bit hexadec1mal 
number that defines the fault as shown earlier in Figure 5 and Table 2. 
No mnemonics are provided to define the fault type and one must consult 
this table to create the fault selection data word. The FIS program 
converts the device pin number into the implant address using the 
previously defined pin mapping data base and the fault inJector address 
data base. The data word abcd is then written to th1s unibus address. 
The data is latched in the fault injector hardware but the selected fault 
is not yet asserted. 
SELECT Packagename: Subsequent MAP, DESCRIBE, and ENABLE commands 
refer to the selected package. 
MUX n Unn m: This command is used to select p1n m of package Unn 
, 
as the output of the multiplexer A, B, or C depending on whether n is 1, 
2 or 3. Valid values for mare 1 to 48. The FIS program maps the 
package pin in question into a board and pin number and formats an 
appropriate data word as defined in Figure 6 and Table 3 of the previous 
section. This data word is then written to the un1bus address 
corresponding to the selected multiplexer. 
24 
FUNC abcd: Th1S command 1S used to select the boolean function. 
One must consult F1gure 7 and Table 4 to construct the function select 
word abcd. This command s1mply wr1tes th1s word 1nto the function select 
address. 
ENABLE n: Th1S enables or selects p1n n of the act1ve package. A 
pin must be enabled before a fault can be asserted on it. The FIS 
program enables the p1n by writ1ng the fault select10n data word 
(previously def1ned for this p1n) DR'ed 1nto "8000" (hex), that is, w1th 
the enable/d1sable bit turned on. It w111 be recalled here that the 
fault inJector hardware 1S a 'write-only' dev1ce. Therefore, a shadow of 
all faults prev10usly def1ned by Descr1be commands 1S mainta1ned as an 
FIS data base. 
DISABLE n: Th1S d1sables or deletes p1n n of the active package. 
This is done by writ1ng the fault select10n data word previously defined 
for this pin W1th the enable/d1sable b1t turned off. 
DUMP: This command is used to dump on the terminal the fault 
description, mapping and enable/d1sable status of each of the 48 p1ns of 
the fault 1n]ector. 
EXEC: This command actually 1n]ects or asserts faults on those 
p1ns that have been enabled. This is done by wr1ting to the 
Execute/Clear address. Ten seconds later the fault condit10n is cleared 
by wr1ting 2 to the same address. 
AUTO n: This command repeats the EXEC funct10n n times. However, 
before 1n]ect1ng a fault, a 'Get Ready' command 1S sent by FIS to FSCC 
program 1n the FTMP. The system conf1guration controller 1n response to 
the command checks the status of LRU 3 and br1ngs it on-line if they are 
25 
not already active. An 'I am Ready' s1gnal 1S sent back by FSCC to FIS. 
The FIS program wa1 ts for a random time between 0 and 999 msec before 
inserting the fault. Th1S allows the fault insert10n t1me to be suff1~ 
c1ently random1zed with respect to the FSCC task wh1ch 1S also respon-
s1ble for detecting faults 1n the FTMP. 
OUTPUT filename: This command saves the results of the fault 
1n]ect10n exper1ments 1n the spec1f1ed f11e. The results consist of 
fault detection, 1solation, and reconf1gurat1on t1mes and the total 
recovery time, that 1S, the sum of the FDIR t1mes. 
The core of the FIS program 1S wr1tten 1n FORTRAN IV PLUS. It 
uses the l1ne parser prov1ded by the RSX-11M operat1ng system to inter-
pret the commands descr1bed above. Once a valid command has been identi-
fied, appropr1ate subrout1nes are called to perform the required func-
t10n. Th1S may involve updat1ng 1ts data base such as that requ1red by 
DEFINE and MAP commands or it may requ1re comput1ng a un1bus address by 
consult1ng its data base and wr1ting data to th1s address. An assembly 
language subrout1ne actually does the I/O. The FIS program also commun1-
cates W1th the FSCC task in FTMP 1n response to the AUTO command. The 
FIS-FSCC protocol 1S described 1n the next sect1on. 
EXIT: Th1S command is used to exit from FIS program. 
2.4 FSCC 
FSCC 1S a verS10n of the System Configurat10n Control (SCC) task 
in the FTMP that has specifically been mod1f1ed to work W1 th the FIS 
program 1n the PDP-11. It 1S assumed here that the reader is fam1l1ar 
W1 th the contents of Volume II Wh1Ch descr1bes the basic SCC program in 
detail. 
26 
~ere are two major differences between SCC and FSCC. First, FSCC 
does not cycle spare processors, memories, or buses into active state. 
It maintains a fixed system configuration under normal circumstances. Of 
course, if it detects a fault l.t would try to identify the faulty unit 
and reconfigure it out of the system. Second, by communicatl.ng Wl.th FIS 
it ensures that the victim LRU, that is, LRU 3, is active before FIS 
inserts a fault l.nto one of the LRU pl.ns. The FSCC-FIS protocol works as 
follows: 
When FIS is ready to inject a fault, it sends a 'Get Ready' 
message or command word to FSCC. FSCC looks at thl.s word in its normal 
mode. If it is true, the FSCC state is changed to 'Reconfigure' and the 
reconfiguration state is initialized to 13. Recall that SCC state 13 
corresponds to cycling spare units. In FSCC spares are not cycled. 
Instead in this state the status of processor 3 and memory 3 is checked. 
If they are failed, they are repaired by changing their status in the 
system configuration tables. The reconfl.guratl.on state is changed to 100 
so that on the subsequent FSCC pass the spare units, viz. processor and 
memory 3, can be assigned to shadow active triads. If the units were not 
failed, the state is changed to 14. In this state, swap commands are 
issued to swap processor and memory 3 into active members of their parent 
triads. The state is changed to 15. Also, a signal called 'Acknowledge 
Get Ready' is sent to FIS acknowledging that the Get Ready command has 
been received and acted upon by FSCC. FIS then clears Get Ready. 
Clearing the command prevents FSCC from needlessly checking the status of 
LRU 3 repeatedly. FSCC stays in state 15 until swap commands have been 
27 
executed. It then sends an 'I am Ready' message to FIS indicating that 
LRU 3 components have been repaired and are in the active state. The 
detect, identify and reconfiguration times are simultaneously cleared to 
zero. FSCC then resumes its normal state. In this state it reads error 
latches and does fault detect~on. 
After receiving the 'I am Ready' message, FIS waits for a random 
length of time that is uniformly distr~buted between 0 and 999 m~lli­
seconds. This corresponds to between 0 and 3 cycles of the FSCC task. 
This random wait assures that the fault is not always inJected at the 
same time with respect to execution of the fault detection program in the 
FTMP. 
When FSCC detects the fault it notes the value of the FTMP Real 
Time Clock. The clock values at the ~nstant of fault ident~f~cation and 
system reconfigurat~on are also recorded. FSCC thus has all the 
information to compute the time intervals between fault detection and 
fault identification as well as that between ident~fication and system 
recovery. The identificat~on and recovery time intervals can be computed 
with an accuracy equal to the least count of the Real Time Clock wh~ch is 
1/4 mill~second. However, FSCC can not compute the fault detect~on time 
since it does not know when the fault was injected. To compute detection 
time, the FTMP time base, that is, the Real Time Clock, is sent to FIS 
every R4 frame. Typically, R4 rate group iteration period is 40 
milliseconds. Therefore the FIS program knows the FTMP time of fault 
injection to within 40 milliseconds. Although this b~ases the detection 
time on the average 20 m~lliseconds (towards h~gher values), as w~ll be 
28 
seen in the next section, this is not a significant amount of error in 
the overall detection tl.me distributl.on. At any rate, the Real Time 
Clock is sent to FIS every R4 frame. The value of the RT Clock at the 
time the fault is detected, l.dentified, and recovered l.S also sent to 
FIS. FIS then computes the detection, identificatl.on, and recovery tl.me 
intervals and records them in a file. 
The fault condition is cleared as soon as the FTMP has recovered 
from the fault. FIS keeps track of FTMP's progress in recovering from 
the fault by monitorl.ng the FDIR tl.mes being sent to it. Recall that 
these locations are cleared to zero before a fault is injected. 
Therefore as each of these words assumes a non-zero value l.t shows FTMP's 
progress through various stages of system recovery. To assure that there 
. 
, 
is no deadlock in the FSCC-FIS protocol, a number of tl.me-out condl.tl.ons 
are provl.ded. If after a predetermined time the FTMP has not detected 
the fault, the fault signal is removed and the FIS program proceeds to 
the next command line. Sl.milar timeouts are provided for the 
l.dentl.fication and recovery phases. The length of these timeouts can be 
chosen when the FIS program is inl.tially invoked. 
The block of data exchanged between FIS and FSCC is as shown l.n 
Table 6. Note that the Real Time Clock as well as all other time values 
are two 16-bit words. 
wi th the exceptl.on of cycll.ng of spares and the changes in the 
system configuration contro11er descr1bed here, the rest of the software 
being executed by the machine while undergoing fault injectl.on is that 
29 
Table 6. FIS-FSCC Data Exchange Block 
Data No. Words 
FIS to FSCC 
Get Ready 1 
FSCC to FIS 
Real Time Clock 2 
Detect Time 2 
Identify Time 2 
Recover T1me 2 
Faulty Un1t , 
Reason Code , 
Ack. Get Ready , 
I am Ready 1 
described in Volume II of th1S report. This cons1sts of the Execut1ve, 
Self-Test programs, console d1splay, autop1lot and other appl1cat10ns 
code that normally runs on the FTMP. 
The next chapter descr1bes the results of the fault 1n)ect1on 
experl.ments. 
30 
CHAPTER 3 
RESULTS 
3.1 General Observat~ons 
Faults were injected in pins of eight circu~t boards. These 
boards are CPU Data Path, CPU Control Path, Processor Read Only Memory, 
Processor Cache Controller, Bus Guard~an Unit (A), Bus Interface 
(Transmit Bus), Bus Interface (Poll and Clock Buses), and System Bus 
Controller. Although the overall process of physically sett~ng up each 
device for fault injection, selecting 'safe' pins as targets, running the 
experiments and acquiring data was quite ted~ous and time-consum~ng, it 
went rather smoothly. There were some minor dif£iIcul t~es encountered 
w~th some devices and c~rcu~t boards, but once past the in~tial learn~ng 
curve these were overcome quickly. One of the irritating factors was the 
J 
extreme sensi ti vi ty of some devices to be~ng extended on an implant. 
Parent module of such a device would not function correctly in the 
presence of an interven~ng pair of FETs and would be discarded by the 
system immediately. One obv~ously had the choice of ~gnoring that device 
for the purposes of fault inJection and moving on to another circu~t. In 
fact, since the correct functioning of the parent module apparently is so 
dependent upon that dev~ce, it is ev~dent that a fault in the target 
31 
device would be detected immediately. One may therefore not worry too 
much about not being able to subJect such a sensit~ve circuit to artifi-
cially created faults. However, as ~ t turns out the sens~ t~ vi ty of a 
device to being extended through FETS is usually l~mited to one or two 
pins only. Once these pins have been identified (a rather tedious 
procedure), they can be extended with dummy implants while the remaining 
pins on that package can be extended through FETS. This procedure was 
followed for most of the sensitive packages. S~nce no data was acquired 
on sens~tive pins, these pins are not included in the data analys~s. 
Some packages were only marginally unhappy over being extended. 
That is, LRU 3 would work correctly with such a device moved to an 
implant most of the time but not all the time. The result was that the 
unit would occasionally be declared failed by the FTMP even before a 
fault was injected. 'Ibis obviously produced negative fault detection 
time. Th~s, however, happened very infrequently and the results 
presented here, of course, exclude negat~ve detect~on times. 
One other practical problem that prevented subJecting some boards 
to fault injection was the extreme caution required in handling CMOS 
c~rcui t devices. The memory chips on processor cache RAM and system 
memory boards are all CMOS type. 
A few faults were ~nJected in the cache RAM board but soon the 
socketed circuit board stopped working, most likely due to ~nadequate 
care exercised in removing and inserting CMOS memory chips. The cache 
RAM and the two system memory cards ~n each LRU are all identical and 
only one socketed circuit board was provided for all three. No useful 
data was acquired for any of the three applications of this card. 'Ibe 
32 
two EGU cards also contain a lot of CMOS circuitry. Only 294 faults were 
injected in the EGU card before it too ceased to operate correctly. 
Despite all the practical problems encountered, over 20,000 faults 
were injected into LRU 3 of the FTMP and the results recorded. Most of 
the faults were concentrated in the processor region of the LRU, the CPU 
data and control cards, the cache controller, and the PROM. However, a 
number of faults (over a thousand) were also injected into the error 
detection and masking circuitry as well as redundancy management hard-
ware. The hardware voters, disagreement detectors, error decode ROM, and 
error latches for the Poll, Transmit, and Clock buses were subjected to 
faults as were enable/disable discretes in the Bus Guardian Unit. Parts 
of the System Bus Controller were also targeted for fault insertion. 
Of the 21,055 faults inJected in the FTMP, 17,418 were detected. 
That is, 3,637 or 17.3 percent of the faults went undetected. Although 
these results would seem to 1mply that the fault detection coverage in 
the FTMP is only 0.83, this is not necessarily so. For, to convert the 
fraction of faults undetected directly into lack of coverage 1S not 
correct. One must exclude from this total those undetected faults that 
I do not matter. I There are a number of faults that obv10usly belong to 
this class. For 1nstance, 1f only three gates from a quad NAND package 
are actually used on a card, whether the fourth unused gate operates 
correctly or not is quite irrelevant. Faults in this gate would not be 
detected but do not contribute to lack of coverage. Unused gates are 
easy to trace. Unused signals, on the other hand, are not. Faults on 
these signal pins would also go undetected but once again do not really 
33 
affect coverage. The CAPS- 6 processor microcode in the FTMP, for 
J~ 
t 
example, does not ut~lize all the outputs of the AMD2901 Ar~thmet~c Logic 
Un~t (ALU). This can be ascertained only by an exhaust~ve search of each 
and every microinstruction to make sure that the output in question is 
not looked at. Such a study is outs~de the scope of this proJect. 
Approximately 80 percent of all undetected faults, or about 3,000 faults, 
were either on unused gates or on signals that are always low or always 
high under normal circumstances. Of the remaining 20 percent undetected 
faults, a few were analyzed in depth and were all found to belong to the 
'don't care' class. Since each pin fault is repeated five times, the 
number of pins in question is about 60. However, a much more thorough 
analysis of all the undetected faults is required before a definitive 
statement can be made about fault detection coverage. Further d~scussion 
here is limited only to the faults that were detected. 
3.2 Average and Maximum Times 
As mentioned earlier, 17,418 faults were detected. All of these 
faults were ~dentified correctly and the system successfully recovered 
from each of these faults by purging the faulty module and replac~ng it 
with a spare or gracefully downgrad~ng the system when no spare was 
available. Based on these results one could conce~vably argue that the 
fault identification and recovery coverages are each one hundred percent 
as far as the detected faults are concerned. It is, of course, not 
possible to extrapolate this perfect record for faults that were not 
detected and for LRU p~ns that were not sUbJected to faults during these 
experiments. As mentioned earlier, detection, identification, and 
reconfiguration times were computed for each fault. The three phases of 
34 
recovery were also summed to gl. ve the total recovery time for each 
fault. These results are summarized in Tables 7 and 8. The first of 
these tables lists the average detectl.on, identificatl.on, 
reconfiguration, and total recovery time l.n milliseconds for each of the 
eight cards. The last column in this table shows the average FDIR tl.mes 
for all 17,418 faults. Table 8 shows the maximum times recorded in each 
category for each card, also shown l.n milliseconds. 
There are certain obvious conclusions that can be drawn from 
figures in these tables. Let us start with the last phase of the recov-
ery procedure first, that is, system reconfiguration phase. This phase 
begins as soon as the identity of the faulty module is known. At thl.S 
point in time, the System Configuration Control (SCC) task is being 
executed. It will be recalled here that this task runs at the lowest 
frequency or R1 rate group (3.125 HZ). It passes the l.denti ty of the 
faulty unit on to the R4 dispatcher. The R4 dispatcher running at 25 Hz 
issues appropriate reconfiguration commands l.n its prolog to remove the 
faulty un~t from the system. The reconf~gurat~on phase is complete as 
soon as the faulty unit is replaced with a spare or the system gracefully 
degraded in the absence of a spare. The average reconfiguration tl.me as 
seen in Table 7 is between 46 (SBC) and 113 (PROM) milliseconds depending 
upon the type of card. That is, on the average it takes between two and 
three R4 frames to reconfigure the system. The average reconfiguration 
t~me for all the faults is 82 msecs or two passes of R4 d~spatcher. The 
overall average is weighed heavily by the processor region which was the 
subject of most faults. The average reconf~guration times for the 
35 
W 
0'1 
BOARD 
# FAULTS DETECTED 
DETECT 
IDENT 
AVERAGE 
TIME 
RECONF 
TOTAL 
CPUD CPUC 
7266 4761 
312 349 
82 99 
80 83 
474 532 
Table 7. Average Times (Mil11seconds) 
PROM CC BGUA BIT BIPC SBC ALL ALL EXCEPT BGUA 
I 
783 3508 294 214 235 357 17418 17124 
I 
589 314 36554 1920 1361 678 988 378 I 
59 59 133 147 229 263 88 88 
I 
113 88 47 53 71 46 82 82 , 
I 
I 
763 462 36735 2121 1662 988 1160 549 
--- ----
------- -------
IN 
-.J 
BOARD 
# FAULTS 
MAXIMUM 
TIME 
CPUD 
7266 
DETECT 9137 
IDENT 1009 
RECONF 289 
TOTAL 9223 
Table 8. Maximum Times (Milliseconds) 
CPUC PROM CC BGUA BIT 
4761 783 3508 294 214 
15817 21614 8122 118437 11592 
780 810 1204 993 813 
190 242 546 115 198 
16231 21757 8182 118843 11707 
BIPC SBC ALL 
235 357 17418 
4818 17056 118437 
931 1625 1625 
243 195 546 
4887 17604 118843 
processor regl.on are hl.gher than all others, though the variatl.on from 
card to card l.S quite small. The rna Xl. mum reconfl.guratl.on tl.mes are also 
higher for the processor regl.on (CPUC, CPUD, CC, and PROM) than all 
others, as seen l.n Table 7. How long l.t takes to replace a faulty module 
with a spare l.S, of course, dependent on the l.nstantaneous system 
configuratl.on. For instance, l.f the trl.ad containl.ng the fal.led 
processor l.S bel.ng shadowed by a spare processor the reconfl.guratl.on will 
be done simply by swapping failed and shadow processors on the bus 
lines. This takes only one pass of R4 dl.spatcher. On the other hand, l.f 
the spare l.S shadowing another trl.ad it would be necessary to retire the 
target triad and synchronl.ze spare wl.th the target triad members. This 
obvl.ously takes much longer Sl.nce the target trl.ad must complete all 
tasks l.n progress before retl.rl.ng. In any event, the reconfl.guratl.on 
process l.S determl.nl.stl.c and bounded. The rna Xl. mum tl.me l.n the table, 546 
msecs (CC), corresponds to the scenarl.O Just descrl.bed. 
The recovery phase just precedl.ng system reconfl.guratl.on l.S fault 
identl.ficatl.on. This phase begins as soon as a fault l.S detected. Thl.s 
happens l.n the SCC task. It terminates as soon as the fault source l.S 
located. Thl.s also happens l.n the SCC task. The l.nterval between these 
two events l.S the fault identifl.catl.on tl.me. Faults may be identifl.ed 
sl.multaneously with their detection l.n some cases. Thl.s usually occurs 
when a self-test program uncovers the fault. Sl.nce diagnostl.c programs 
know whl.ch regl.on l.S bel.ng tested, they can usually l.dentl.fy the faulty 
module l.mmedl.ately. In other cases several reconfigurations may be 
required to sort out the fault symptoms. The average l.dentl.fl.catl.on tl.me 
38 
15 seen to vary from 59 (PROM and CC) to 263 (SBC) mill~seconds with the 
system-wide average being 88 ml.lll.seconds. Sl.nce one R1 frame l.S 320 
milliseconds, it may be concluded that most faults are identifl.ed 
immediately. Indeed the average tl.me for the processor region cards is 
between 59 and 99 ml.lliseconds. This is because symptoms of a failed 
processor appear on two buses simultaneously, the Poll bus and the 
Transml.t bus. In most cases thl.S combination is uniquely associated wl.th 
only a single processor in the system. Most processor faults are there-
fore l.dentified immedl.ately. The question may therefore be asked as to 
why l.t even takes 60 to 90 ml.lliseconds to look up the bus assignment 
tables. Actually it does not really take that long to consult the appro-
priate data base in the shared memory. what happens in fact is that the 
SCC task of being the lowest priority ~s l.nterrupted by h~gher priority 
tasks. It will be recalled here that the R4 rate group tasks are 
executed eight times and that the R3 rate group tasks are executed four 
times for every iteration of R1 tasks (SCC). Hence the identification 
program can be interrupted many times between start and f~nish. The 
identification tl.me is measured as the total elapsed time and not as the 
length of time the program is active. This is, of course, as ~t should 
be. 
The maximum identl.fication tl.mes are seen to vary between 780 
(CPUC) and 1625 (SBC) milliseconds with the maximum for the processor 
region being 1 204. This parameter, like the reconfiguration tl.me, 1.S 
deterministic and bounded. The worst case scenario here l.S a fault on a 
bus that has four memory unl.ts enabled on it. If the bus itself l.S 
39 
faulty, it would take four diagnostic reconfigurations to isolate the bus 
from all other suspects. Thl.s translates l.nto five passes of the SCC 
program and corresponds to the maXl.mum tl.me observed during the course of 
the experiments. 
The fault detectl.on phase is what starts the recovery process. 
This is more complex than other parts 6f the recovery procedure. Once a 
fault is uncovered, the ensuing processes are qUl.te mechanl.cal. The un-
covering of a fault is, however, considerably more involved. Clock for 
the detection phase starts tl.cking as soon as the fault l.S inJected under 
the command of the fault injection software running in the PDP 11/60. 
Faults are usually manl.fested as dl.sagreements on one or more buses. 
These are recorded l.n error latches which are read by SCC every 320 
milliseconds. The detection phase terminates when SCC diges ts error 
latch outputs and determines that they indicate an 'unexpected' bus 
error. Recall that some bus errors may always exist such as those on an 
unused clock bus or on a failed bus and so on. In any event, this tl.me 
interval is the detection time. As explained l.n Chapter 2, the tl.me of 
fault injection is not known to SCC and l.S known to Fault Injectl.on Soft-
ware as the most recent value of the FTMP Real Time Clock which l.S sent 
to FIS every 40 milliseconds. Therefore the fault detection time as 
recorded in the experiments l.S higher than the real value anywhere from 0 
to 40 milliseconds or an average of 20 milll.seconds. The average detec-
tion time for all the faults from Table 6 is seen to be 988 ml.lli-
seconds. Therefore, the error is only about 2 percent. 
40 
The average detection time is seen to vary from 312 milliseconds 
for the CPU data card to over 36 seconds for the BGU card. Now if a 
fault mam.fested 1tself as an error on the bus soon after it was 
injected, the detect10n t1me would mostly consist of latency in read1ng 
error latches. S1nce error latches are read every 320 m1lliseconds on 
the average, this latency should only be 160 milliseconds. The average 
fault detect10n time for the processor region is around 300 mill1seconds 
for the CPUD, CPUC and cache controller cards and 589 milliseconds for 
the PROM card. This implies that on the average there is considerable 
latency between fault 1njection and error man1festat10n. Th1S is mostly 
due to the fact that not all parts of the processor reg10n hardware are 
being used all the time. This is quite obviously true of the Read Only 
Memory. There 1S a considerable fraction of the PROM that contains 
programs that are invoked only when an error is detected. Faults in this 
reg10n of the memory would not be uncovered unt11 another fault man1fests 
itself. The PROM 1S therefore tested per10dically by a check-sum 
program. The average latency of half a second 1n uncover1ng PROM faults 
is s1mply a reflection of how frequently the check-sum program is 
executed. The maximum detection time for PROM faults is over 21 seconds 
and is a direct function of the repetit10n rate of the self-test program. 
The average detection t1mes for BIT, BIPC, and SBC cards are much 
higher than those for the processor region because faults in the bus 
interface cards were concentrated mainly 1n the error detect10n and 
masking hardware. Faults 1n most of th1s reg10n would not manifest 
themselves as bus errors under routine operat10n. Some faults in the 
41 
voter circuitry, for instance, are highly latent since the voter output 
1S the same as its inputs as long as the three inputs are the same. Such 
a fault can only be uncovered by feed1ng d1sagreeing input streams to the 
voter. This is done by a self-test program. It 1S seen that only 200 to 
300 faults were 1n]ected 1n the bus 1nterface cards. All of these faults 
were purposefully concentrated in the error detection region to uncover 
any weaknesses in this area since the correct functioning of the FTMP is 
so critically dependent upon this hardware. The results Ifor these cards 
are therefore b1ased towards higher values. When the remaining random 
logic on these cards is subjected to faults, the averages would tend to 
move down because faults in the random log1c would be uncovered by 
routine operation without self-test programs. 
Finally, it is quite evident from Tables 7 and 8 that the average 
as well as the maximum detection times for the Bus Guardian Unit are an 
order of magnitude higher than even those for the error detection 
circuitry. There is a reason for this which is as follows. The BGU card 
conta1ns the redundancy management hardware. Faults were injected in the 
enable/disable discretes that control whether a unit is enabled or 
disabled on a bus. Some of these faults such as the ones that d1sable a 
unit from its active bus would be detected immediately by routine opera-
tion since a single BGU can disable a unit by itself and a lack of trans-
mission from a unit would immediately cause errors on that bus. But most 
other faults such as those that enable a un1t on other buses or disable a 
unit from buses on wh1ch it is not supposed to transm1t anyway would only 
be detected either by a self-test program that exerC1ses these d1scretes 
42 
or over the long term by routine system reconfigurat10n. No self-test 
programs have been written for the Bus Guard1an Un1t. Therefore almost 
all BGU faults were uncovered by the rotat10n of processors and memor1es 
on d1fferent buses and by swapp1ng of act1ve and spare buses. Wh1le the 
self-test programs complete a cycle every 13 seconds, complete cycling of 
( 
all spares takes 6 m1nutes. Th1S 15 why max1mum detect10n t1me for BGU 
faults 1S almost 2 minutes. It would be even h1gher 1f the t1meout 11m1t 
for these exper1ments was 1ncreased beyond 2 m1nutes. Faults that may 
have been detected w1th a h1gher t1me-out 11m1t are treated 1n the data 
analys1s as undetected faults. 
The h1gh detect10n t1mes for BGU faults have a tremendous 1mpact 
on the system-wide average. Table 7 shows that overall average detection 
t1me 1S 988 m1111seconds, or about one second. If the average were 
computed for all except BGU faults, it would be only 378 m11liseconds. 
The BGU fault detection t1mes can be reduced by an order of magn1tude by 
writing d1agnostic programs for 1t. 
It should be ment10ned here for the sake of clar1ty that although 
routine system reconfiguration was suppressed for faults on all other 
cards to facil1tate a reasonable FIS-SCC protocol, it was allowed for the 
BGU card since th1s was the only way of detecting BGU faults and 1t did 
not interfere with the protocol in th1s case. 
The sum of times for the three phases constitutes the total recov-
ery time. This is the time from the moment the fault 1S 1njected to the 
point 1.n t1me when the system has completely recovered. Times for the 
three recovery phases were summed for each fault and then the sums were 
averaged over all faults for each card. These averages are shown in 
Table 6 under the heading 'TOTAL.' The total average recovery t1me for a 
43 
card should obviously equal the sum of the average time for detection, 
identification, and reconfiguration phases. In other words, average of 
the sums should equal sum of the averages. This 1S true for all cards to 
within 1-2 milliseconds which is the truncation error. The maximum total 
recovery time for a card, on the other hand, is not necessarily the sum 
of the maximums for individual phases since maximum detection time need 
not necessarily be for a fault that also takes maximum time to be identi-
fied. The maximum recovery time is therefore simply the maximum of all 
sums. 
It is quite evident from data presented 1n Tables 7 and 8 and from 
the discussion so far that the recovery time is dominated by the detec-
tion time for each card as well as for the system as a whole. Even the 
processor region, which seems to react the fastest to faults, about 65 
percent of recovery time is spent uncovering a fault. Therefore recovery 
time characteristics are very much like those of the detection time. In 
particular, if the average recovery time is computed for all faults 
except the BGU, it is found to be 549 mill1seconds or about a half second 
compared to about a second if it 1S averaged over all the faults. The 
meaning of this is quite clear. FTMP response to faults can be improved 
twofold simply by writing a few diagnostic programs. 
When the FTMP reliability was computed, it was assumed that the 
R4, R3, and R1 rate groups would execute at 40, 20, and 5 Hz rather than 
25, 12.5, and 3.125 Hz used in the experiments. The fault injection data 
presented so far shows a strong correlation between detection, identifi-
cation, and reconfiguration times and the execut10n frequencies of sec, 
44 
~ dispatcher, and self-test programs. It may be concluded, therefore, 
that the average recovery time could be reduced by 37.5 percent by 
increasing repet1tion rates to or1ginal goals. 
The two changes suggested here would br1ng the average recovery 
time down to 343 m1lliseconds wh1ch is qU1te close to the value (250 
msecs) assumed 1n reliabil1ty models. 
Of course, what affects the actual reliab111ty 1S not only the 
average recovery time but also its d1stribution and the LRU mean time 
between fa1lures (MTBF). These are d1scussed next. 
3.3 Frequency Distribut10ns 
The fault inJection data was analyzed to compute probability 
dens1ty function (pdf) of detect10n, 1dentification, reconf1gurat10n, and 
total recovery times for each card separately and for the total ensemble 
of 1 7,41 8 faul ts • 
F1gures 8 to 62. 
Estimates of pdf's are plotted as h1stograms 1n 
A few comments regarding the organ1zat10n and plott1ng of data are 
in order here. These figures are organized by cards in the same order as 
the numerical results in Tables 7 and 8. Figures 8 to 13 are for the CPU 
data card, 14 to 18 for the CPU control card, 19 to 25 for the PROM card, 
and so on. A d1fferent scale is used for each parameter to show as much 
detail as possible. All identificat10n time histograms use a bucket size 
of 100 milliseconds, and all reconfigurat10n plots use a bucket S1ze of 
50 m11liseconds. A common scale for all detection time distr1but10ns that 
accommodated maX1mum detect10n t1mes and yet showed the deta1ls was not 
as easy to choose. Detection times for most cards are therefore plotted 
45 
o 
o 
. 
o 
~-f---"'" 
o 
o 
o 
o 
o 
o 
CARD: CPU DFiTR 
~ FAULTS: 7256 
AVERRGE: 312 
MAXIMUM: 9137 
15:32:53 11/18/82 
O~I-----r----~I ----.----.----.----.-----.----.----.1----.----.----, 
o 1000 2000 3000 IlOOO :~OO 6000 iOOO 8000 90JO 10::100 11000 12000 
CET~CT I ON TINE IMSEC) 
Figure 8 
46 
o 
o 
· o 
::r 
o 
o 
· C\J 
(T') 
o 
o 
W • 
~R; 
IT 
f-
Z 
Wo 
Wo 
0: • 
-
-
-
w tD -
CL -
o 
o 
co 
o 
o 
· 
-
01 
a 
CARD: CPU DRTn 
# FAULTS: 7266 
AVERRGE: 312 
MAXIMUM: 9137 
I 
I 
100 200 300 ijOO 5j::l 500 
15:32:58 11/18/82 
I 
700 800 900 1000 1100 10000 
[clcCT I ON TIME It-1SEC) 
Figure 9 
47 
o 
o 
· o 
o 
.... 
o 
o 
· o 
co 
o 
° LLJ • c..:>~ 
a: 
f-
Z 
Wo 
Uo 
a: . 
W o CL.:l' 
o 
o 
o 
N 
o 
o 
· o 
-
-
-
-
-
o 100 
CARD: CPU DATA 
** FAULTS: 7266 
AVERAGE: 82 
MAXIMUM: 1009 
I I 
200 300 YOO 
I 
500 
I 
600 
09·54:30 11/19/82 
I I I I I I 
700 800 900 1000 1100 1200 
ID~NTIFICATION TIME !MSEC) 
Fl.gure 10 
48 
o 
o 
· o 
IJ) 
o 
o 
· o 
:::r 
o 
o 
W • 
t.:)~ 
a: 
I-
Z 
Wo 
Uo 
0::. 
W o 
o...C\J 
o 
o 
· 
-
-
-
-
o 
-
-
o 
o 
· o 
o 
I 
50 
CARD: CPU DATA 
.. FAULTS: 7266 
AVERAGE: 80 
MAXIMUM: 289 
I 
I I I I I 
09:Sij:30 11/19/82 
I I I I I I 
100 150 200 250 300 350 ijOO ij50 500 550 600 
RECONFIGUAATION TIME (MSEC) 
Figure 11 
49 
o 
o 
o 
0_ 
-1--....., 
o 
o 
~-co 
o 
o 
W • 
C)~­
a: 
I-
Z 
Wo 
Uo 
0: • 
W~­
o..::J' 
o 
o 
o 
o 
CARD: CPU DATR 
*' FAULTS: 7266 
AVERAGE: 11711 
MAXIMUM: 9223 
09:5ij:30 11/19/82 
.~--~==4---~--~--~--~--~--~--~--~--~~ o I I I I I r 1 1 -, 1 , I 
o 1000 2000 3000 ijOOO 5000 6000 7000 8000 9000 100001100012000 
TOTAL RECOVERI T I ME IMSEC) 
F~gure 12 
50 
o 
o 
lJ') 
C\J 
o 
o 
· 
-
o ,-
C\J 
o 
o 
'-
W' C)~ 
0: 
I-
Z 
Wo 
Uo 
CC • 
'-W o 
0.... 
.... 
C) 
C'I 
· lJ') 
C) 
C) 
· o 
-
o 100 
CARD: CPU DAT=1 
# FAULTS: 7266 
AVERAGE: 474 
MAXIMUM: 9223 
I 
I I I I 
200 300 YOO 500 
I 
600 
09:5~:30 11/19/82 
I J 
I 
I I I I I 
700 800 900 1000 1100 10000 
TD-AL RECOVERY TIME 1M SEC) 
Figure 13 
51 
o 
o 
o 
o 
01 
o 
CRrD: CPU C~~-:-~C;L 
= FAULTS: 4751 
AVERAGE: 349 
MAXIMUM: 15817 
1 
1 COO 2000 3:JOO ~GJJ 5:JO 
Figure 14 
52 
15:25:27 11/18/82 
1 
~CJu 7020 SCOO 9:J00 1 :::0 1; OJO 1 c:uu 
o 
o 
· o 
:::r 
o 
o 
f\J 
('l") 
o 
o 
W • 
t.:)~ 
a: 
f-
Z 
LJ o Uo 
0:: • 
W~ 
(L 
o 
o 
· 
-
-
-
-
-co 
o 
o 
· 01 
o 
1 
100 
CARD: CPU CD~7RaL 
** FRULTS: ~761 
AVERRGE: 349 
MAXIMUM: 15817 
t 
, 
~ I 
2;)0 SOD !tOO 5:0 s:o 
I 
700 600 
15:25:27 11/1B/E2 
9:lD 
~E7EC7I(jN TZ',E (:-1~=:C) 
Fl.gure 15 
53 
o 
o 
o 
o 
..... 
o 
o 
. 
o 
co 
-
-
'-
o 
o 
o 
N 
o 
C) 
o 
-
-
a 
I 
100 
CARO~ CPU CONTROL 
** FAULTS: 4761 
AVERAGE: 99 
NAXIMUM: 780 
I I I 
2DO 300 qOO 503 
I I 
600 700 
15:26:27 11/18/82 
I I I I l 
800 900 1000 llCD 1200 
IOENTlrErllIIJN TIME (~lSECJ 
F~gure 16 
54 
o 
o 
o 
o 
o 
a 
Cl 
o 
CARD: CPU CDNT:=lDL 
*= FAULTS: Y751 
AVERAGE: 83 
MAXIMUM: 190 
15:26:27 11/18/82 
O~'~--~----~----~==~----~----'----'----'-----'----'----~---' 
o 50 100 ISO 2CQ 3,:)0 S50 5JO 550 5:J0 
Figure 17 
55 
C) 
C) 
C) 
C) 
W • 
t.:)~­
a: 
l-
Z 
L:J C) 
UC) 
cc: • 
W o -(LN 
CARD: CPU CC;:~T;:JL 
# FRULTS: ~701 
AVERAGE: 532 
!'1AX HH..J11: 16231 
15:25:27 11/18/52 
~~I~--_+-I--~-----'----'I-----'----'----'----'-----+----'-----'--~ 
o 100 200 300 ijOO SCJ 60;) iCO SuO 9C::I 10::10 1100 163CO 
Fl.gure 18 
56 
o 
o 
W • 
~~­
a: 
I-
Z 
Wo 
Uo 
0: • 
W o -O-~ 
o 
o 
CARD: PROM 
** FAULTS: 783 
AVERAGE: 589 
MAXIMUM: 21614 
15:ijO:27 11/18/82 
O~----~I===TI====~I----~Ir----r----~I--~I----~I-----Ir----r-I--~I----~1 
o 2000 ijOOO 6000 8Ce;) 10ceo 12000 gOOO 16000 18000 20000 22COO 2ijOOO 
DETECT ION TIME IhSEC) 
Fl.gure 19 
57 
o 
o 
· o 
o 
'-
-
o 
o 
· 
'-
o 
(XJ 
· ~-
~-
o 
o 
· o 
N 
o 
o 
· o 
-
I , 
CARD: PROM 
** FAULTS: 783 
AVERAGE: 589 
MAXIMUM: 2161LJ 
, 
I I I I 
-
I 
10:00:07 11/19/82 
. 
I I I I I 
o 200 1100 600 800 1000 1200 11100 1600 1800 2000 2200 22000 
DETECTION TIME (MSEC) 
Figure 20 
58 
o 
o 
o 
o 
o 
o 
o 
CARD: PROM 
.. FAULTS: 783 
AVERAGE: S9 
MAXIMUM: 810 
10:00:07 11/19/82 
o I I o~--~~---~I---F=I===~I---4I----'I----~I-----I~---'-I---r-I---~I--~1 
o 100 200 300 qOO 500 600 700 BOO 900 1000 1100 1200 
IDENTIFICATION TIME IMSEC) 
Fl.gure 21 
59 
o 
o 
· o 
o 
'-
-
o 
o 
· o 
'-to 
o 
o 
'-
W· t!)~ 
a: 
'-
I-
Z 
Wo 
Uo 
a: . 
W O a...:::I' 
o 
c 
· C 
N 
o 
o 
· c 
-
o 
, 
50 
• 
CARD: PROM 
.. FAULTS: 783 
AVERAGE: 113 
MAXIMUM: 242 
I I I 
I 
I I 
10:00:07 11/19/82 
I , I I I I 
100 150 200 250 300 350 ijOO ij50 500 550 600 
RECONFIGURATION TIME IMSEC) 
F1gure 22 
60 
o 
o 
. 
o 
0_ 
-
o 
o 
o 
o 
W· t.=)~-
a: 
I-
z 
Wo 
Uo 
a: . 
w o -a..::r 
o 
o 
o 
o 
CARD: PROM 
• FAULTS: 783 
AVERAGE: 763 
MAXIMUM: 21757 
10:00:07 11/19/82 
.+---~~==~-----~~~~--~-----~~-----~-----~~ o I , I I I I I I I I I I 
o 2000 qOOO 6000 8000 10000 12000 1 qOOO 16000 18000 20000 22000 2qOOO 
TOTAL RECOVER. TIME 1M SEC) 
Figure 23 
61 
o 
o 
. 
o 
~-
o 
o 
o 
o 
W • 
C)~­
CI 
~ 
Z 
Wo 
Uo 
0: • 
W o -
o....::J' 
CARD: PRClM 
** FAULTS: 783 
AVERAGE: 763 
MAXIMUM: 21757 
15:ijS:ij8 11/18/82 
g I I O~--~,----~,---~,--~,----~,====~,---.-,--~,~--~,----~,---~,==~, 
o 200 qOO 600 800 1000 1200 lQOO 1600 1800 2000 2200 22000 
TClTAL P.ECClVERY T I ME (MSEC) 
Figure 24 
62 
o 
o 
o 
If) 
o 
o 
-
'-o 
::I" 
o 
o 
W • 
'-l:)~ 
a: 
I-
Z 
Wo 
Uo 
a: . 
W O 
a...
N 
o 
o 
. 
-
o 
-
-
o 
o 
o 
a 
I 
100 
CARD: PROM 
.. FAULTS: 783 
AVERAGE: 763 
MAXIMUM: 21757 
I 
I I I I 
200 300 ~OO 500 
15:ij5:ij8 11/18/82 
I 
I I I I I 
600 700 800 900 1COO 1100 22000 
TOTAL RECOVER, TIME IMSEC) 
F1gure 25 
63 
o 
a 
a 
o 
01 
o leoo 
ChRD: CFtCHE C:::,iROLLER 
tt FAULTS: 3508 
RVERAGE: 314 
~i rl X I M UN: 8 1 22 
Fl.gure 26 
64 
7C:.JO 
15: 13:110 1./16/82 
11:-J:) ,---"'\ .:.::::. .... ' 
o 
o 
o 
o 
W • 
t:>~­
a: 
I-
:z 
Wo 
Uo 
c:: . 
W~-
0-
o 
o 
00-I-_...J 
CARD: CACHE CONTROLLER 
.. FAULTS: 3508 
AVERAGE: 314 
MAXIMUM: 8122 
I 
16:08:ij7 11/18/82 
o I I ~~----r-,---~,--~,----~,~---r-,---~,--~,----~lr----r-,---~,---;,----" 
a 100 200 300 qOO sao 600 700 800 900 1000 1100 8200 
DETECT! ON T I ME IMSEC) 
Figure 27 
65 
o 
o 
o 
o 
...... 
o 
o 
. 
-
'-
o 
(X) 
o 
o 
W • 
'-C)~ 
CC 
I-
Z 
Wo 
Uo 
0:: • 
W o (L::r 
o 
o 
. 
o 
(\J 
o 
o 
o 
-
-
o 100 
CARD: CACHE CONTROLLER 
., FAULTS: 3508 
AVERAGE: 59 
MAXIMUM: 1204 
/ 
I I I 
200 sea ~O'J 500 6:::!0 
15: 13:l!O 11/18/82 
I I I I I I 
7:)0 BOO 9C:l lC'CO jlC:: 
I DENT I F EAT I:N T I ~~E !:1SECl 
Fl.gure 28 
66 
o 
o 
· o 
o 
-
o 
o 
o 
CD 
o 
o 
W • 
L:)~ 
a: 
I-
Z 
Wo 
Uo 
a: . 
W O a...:::r' 
o 
o 
· o 
N 
o 
o 
· o 
-
-
-
-
-
o 
I 
50 
CARD: CACHE CONTROLLER 
.. FAULTS: 3508 
AVERAGE: 88 
MAXIMUM: 5146 
I I I I I I 
100 150 200 250 300 350 1:00 
15:13:ijO 11/18/82 
I I I I 
~so 5JO 550 6:0 
AECONF I GlJRFlTI ON T H1E (I~SEC) 
Figure 29 
67 
o 
o 
. 
o 
~-
o 
o 
o 
o 
CARD: CACHE CONTROLLER 
** FAULTS: 3508 
AVERAGE: 462 
MAXIMUM: 8182 
15:13:110 11/18/82 
o 
~4---~1----4!----~1----~1----~1----~1---'-1--~----~--~--~1r---~1 
a 1000 2000 3000 IOGOO 5000 60:)0 7COO 8GOO 9000 i OCOO 11 COO 1200:) 
TuTAL RECLi\'ER'( T H:E IM=ECl 
Figure 30 
68 
e 
e 
· e 
::J' 
e 
e 
· 
-
C\J 
'-CT') 
:-
e 
e 
W • 
t:)~ 
a: 
'-
I-
Z 
We 
Uo 
a: . 
W~ 
a... 
o 
o 
· to 
o 
o 
· o 
-
o 
CARD: CACHE CONTROLLER 
.. FAULTS: 3508 
16:08:ij7 11/18/82 
AVERAGE: 462 
MAXIMUM: 8182 
J I 
l 
I 
I I I I I I I , I I 
100 200 300 1100 500 600 700 800 900 1000 1100 8200 
TOTAL RECOVERY TIME (MSEC) 
Fl.gure 31 
69 
C) 
C) 
. 
C) 
If) 
o 
o 
-
'-
o 
::I" 
o 
o 
W • 
'-C)~ 
a: 
I-
Z 
Wo 
Uo 
0: • 
W O 
a....
N 
C) 
o 
o 
.-I 
o 
o 
o 
-
-
I 
CARD: 6GUA 
# FAULTS: 294 
AVERAGE: 36554 
MAXIMUM: 116437 
I I I I 
I 
I 
15:00:31 11/18/82 
I I I I I I 
o looeo 20000 30000 ltCOOO sc-ooo 50~OO 7CO::') 80aoo SOOOO 1:l0jCO 112000 1~::;:::O 
DETECT WN TI ME (I~SEC) 
F~gure 32 
70 
o 
o 
· to 
'-N 
e 
o 
· o 
'-N 
o 
o 
W • 
C)~ 
a: 
'-
I-
Z 
We 
UO 
a: . 
We '-
0.... 
..... 
e 
o 
· If') 
o 
e 
· e 
-
I 
CARD: 8GUA 
** FAULTS: 29L! 
I 
I 
AVERAGE: 3655~ 
~lAXIMUM: 1181437 
I 
I 
I 
15:00:31 11/18/82 
I I I I I I 
o 5000 10000 15000 2:1000 25C:O 30000 35000 1i0000 '45000 50000 55000 12000:; 
DETECT I ON iI I~E (MSEC) 
Figure 33 
71 
o 
o 
· o 
o 
.... 
o 
o 
· o 
CD 
o 
o 
· 
-
-
-
-
o 
-N 
o 
o 
· o 
o 100 
CARD: BGUA 
u FAULTS: 294 
AVERAGE: 133 
MAXIMUM: 993 
I I I 
200 300 ~OO 
I 
I I I 
530 600 700 
15:00:31 11/18/82 
, , , I , 
eco 900 1000 1100 1200 
ID~NTIFICATI(jN TIIiE (MSEC) 
F1gure 34 
72 
o 
o 
. 
o 
~-
o 
o 
o 
o 
W· L')g-
a: 1---. 
~ 
Z 
I.J.J o Uo 
a: . 
W o -
0...:1' 
o 
o 
o 
o 
CARD: BGUA 
.. FAULTS: 294 
AVERAGE: 47 
MAXIMUM: 11S 
15:00:31 11/18/82 
04-----r----r----r----.----.----.----.----~1--~1----·,----'1----~1 
o 50 1 CO 1 SO 200 250 3CO 350 qDO l!50 500 550 600 
R E C Cit': F ! G L! ~ R T I Cl N T HI E If-'l SEC) 
F1gure 35 
73 
o 
o 
o 
If) 
o 
o 
. 
° ::r 
o 
o 
-
-
'-
W· C)~ 
IT 
I-
Z 
Wo 
Uo 
0: • 
'-W o 
a....
N 
o 
o 
. 
o 
..... 
o 
o 
o 
-
I 
CARD: BGUA 
tt FAULTS: 291.1 
AVERAGE: 36735 
MAX H1UM: 11 B6~3 
I I 
I I I I 
I 
I 
15:00:31 11/18/82 
I I I I I I 
o 10000 20000 3C:OO ltro:o :%JU SOOC3 700::0 eoooo 90000 100000 110000 12:::lJ 
TlrML ;::EC(j\ ERr T H:E 111SECl 
Figure 36 
74 
0 
0 
· 0 
0 
.... 
o 
o 
· o 
(D 
o 
o 
W • 
C)~ 
a: 
I-
Z 
Wo 
Wo 
a: . 
W o 0-:1' 
o 
o 
· o 
N 
o 
o 
CARD: BIT 15:22:29 11/18/82 
*' FAULTS: 21t! 
AVERAGE: 1920 
MAXIMUM: 11592 
oj----~----I~---r----~I====~--~----~----r----+----+----+--~ 
o 1000 2000 3000 11000 S::10 6000 7000 80~0 9000 10000 11000 120CQ 
Figure 37 
75 
o 
o 
· o 
o 
'-
...... 
o 
o 
· 
'-
° ro 
o 
o 
W • 
'-~~ 
CC 
I-
Z 
Wo 
uo 
a: . 
W O a...:::r -
o 
Cl 
· o 
C\J 
o 
o 
Cl 
-
a 
I , 
CARD: BIT 
** FRULTS: 21!.! 
, 
AVERRGE: II.! 7 
MAXIMUM: 813 
, I 
I 
I 
100 200 300 YO:! 5:0 
15:22:29 11/15/82 
n 
I I I I I I I 
630 700 600 50:! 1300 1:00 12:0 
IOE .. T!i=i:R"jICjN TIME IMSECl 
Figure 38 
76 
e 
e 
· e 
e 
-
.... 
e 
e 
· e 
-CO 
e 
e 
W • 
c.:J55 
CI 
I-
Z 
We 
Ue 
a: . 
We 
a...::I' 
o 
e 
· 
-
-
o 
-C\I 
o 
CJ 
· CJ 
a 
I I 
so 
CARD: BIT 
** FAULTS: 214 
AVERAGE: 53 
MAXIMUM: 198 
I I I 
100 1 SO 200 
I 
250 
15:22:29 11/18/82 
I I I I I I I 
sea 350 1400 ~5~ 500 550 600 
RECONF I GURAT IC;:~ TI ME (MS~C) 
F~gure 39 
77 
o 
o 
· o :-
lJ') 
o 
o 
· ~-o 
:::2' 
· 
'-
'-
o 
o 
· o 
..... 
o 
o 
· 
-
01 I I 
CAAD: BIT 
• FAULTS: 21Y 
AVERAGE: 2121 
MAXIMUM: 11707 
I 
I I I 
n 
I I I 
15:22:29 11/18/82 
I I I 
I I I I 
o 1000 2000 3000 qOOO 5CJO 6GOO 7000 6000 9000 10000 11000 12000 
TOTRL ~ECOVERT TIME (MSEC) 
Figure 40 
78 
o 
o 
· o 
o 
'-..... 
o 
o 
· 
'-
CJ 
co 
'-
:-
CJ 
CJ 
· CJ 
N 
CJ 
CJ 
CJ 
-
I 
-
I 
CRRD: BIPC 
** FAULTS: 225 
AVERRGE: 1361 
MAXI:1Ut1: 14818 
I I I I I 
15:18:88 11/15/82 
I I I I I I 
a 1000 2000 3000 liCCa 5(;::) 5:::0 7:JO 80:)0 sooa loao:) 11000 12000 
Figure 41 
79 
C) 
C) 
· C) 
C) 
-
...... 
C) 
C) 
· :-C) co 
C) 
C) 
W • 
'-C)~ 
0: 
I-
Z 
Wc) 
Uc) 
0: • 
wc) 
o....::::t' 
C) 
C) 
c) 
C\J 
c) 
c) 
· o 
-
-
o 
I 
500 
CARD: B IPC 
** FAULTS: 235 
AVERAGE: 1361 
MAXIMm1: L!BIB 
I I I I 
16:11:26 11/18/82 
I I I I 
1000 1500 2000 2500 3000 3500 11000 11500 5000 5500 6000 
DETECT I CiN T I ME (MSECl 
F1gure 42 
80 
Q 
Q 
. 
'-
Q 
If) 
Q 
Q 
:-Q ::r 
'-
:-
o 
o 
o 
'-
-
o 
o 
o 
o 
I I 
100 
CRRD: 8IPC 
;; FAULTS: 235 
AVERRGE: 229 
MAXIMUM: 931 
I I I 
200 30J 
t 
I I 
: :::1 
15:18:08 11/18/82 
I I I I I I 
70:1 800 lCCO 1100 1200 
IQE',T:=I::HICtJ TII1E ':'1SECl 
Fl.gure 43 
81 
CARD: BI?C 
;; FRULTS: 235 
AoJERRGE: 71 
MRXIMUH: 243 
I 
15:18:03 11/18/92 
g I 
O~I-----r----~I----~--~====~---,,----r----.---~----.---~~--~ 
o SO 100 150 200 300 ~so ~:JO t:50 50Q 550 S2J 
Figure 44 
82 
! 
o 
o 
o 
o 
..... 
o 
o 
o 
co 
CJ 
o 
W' C)::g 
a: 
I-
Z 
Wo 
Uo 
a: . 
W o c....::I' 
o 
CJ 
-
-
-
-
o 
N -
CJ 
CJ 
CJ I 
CARD: 8IPC 
;t FAULTS: 235 
AVERAGE: 1652 
MRXIMUI1: 14887 
I 
I I I I I 
15:18:CB 11118/82 
I I I I I I 
o 1000 20UO 3000 ~OOO ::00 6000 7000 8COO 9000 IG~OO 11000 12JOO 
TClT=!L ?'~CClVERI T I ME (f'l.SEC) 
Figure 45 
83 
o 
o 
o 
o 
-
o 
o 
. 
-
'-
o 
(0 
o 
o 
W· (.9g '-
a: 
I-
Z 
Wo 
Uo 
a: . 
W O (L::r 
'-
o 
o 
o 
C\J 
o 
o 
. 
o 
-
a 500 
CARD: SIPC 
# FAULTS: 235 
AVERAGE: 1662 
MAXIMUM: 4887 
r 
I I I I 
16: 11:25 11/18/82 
I I I I I 
1000 1500 200(J 2500 3000 3500 11000 t:500 5000 5500 600::1 
TCTAL RECOVERY TIME IMSEC) 
F1gure 46 
84 
o 
o 
o 
o 
o 
o 
o 
o 
CARD: SSC 
** FAULTS: 357 
AVERAGE: 678 
MAX I MUM: 17056 
15:ij9:38 11/18/82 
.~--~r---~--~I~---F====~---r----r----r----T---~----~--~ o I I I I I I I I I I I I 
o 1000 2000 3000 11000 5000 6000 7000 8000 9000 10000 11000 17100 
DETECT! ON T I ME (MSECl 
F1gure 47 
85 
o 
o 
a 
a 
CARD: SSC 
.. FAULTS: 357 
AVERAGE: 678 
MAX I MUM: 17056 
15:~9:38 11/18/82 
'4---~--~==~==~==~==~--~--~--~--~--~~ a I I I I I I I I I I J I 
o 500 100015002000250030003500 qOOO Q500 5000 550017100 
DETECT ION T II':E (MSECl 
Figure 48 
86 
o 
o 
· o 
If') .-
o 
o 
· :-o ::r 
o 
o 
W • 
:-C)~ 
a: 
f-
z 
LU o Uo 
a: . 
:-Wo (LN 
o 
o 
· o 
.-. 
o 
o 
· o 
-
o 
CARD: SSC 15:ij9:38 11/18/82 
I 
** FAULTS: 357 
AVERAGE: 263 
MAXIMUM: 1625 
I 
I I I I 
I I J 
I I ! I 
100 200 300 QOO 500 600 700 800 900 
I D::NT I F I CAT! ON TINE (MSEC) 
F1gure 49 
87 
I , 
I I I 
1000 1100 1700 
o 
o 
o 
o 
. 
0..., 
(0 
o 
o 
o 
CARD: sse 
** FAULTS: 357 
AVERAGE: 46 
MAXIMUM: 195 
15:ij9:38 11/18/82 
01 
04---~1----,1----~,----~1----~--~----~1---~1---,-1---,r---~1~--'1 
o 50 100 150 200 250 300 350 YOO Y50 500 550 600 
RECJ~~ I GUPAT I C1~ T I ME (MSEC) 
Figure 50 
88 
0 
0 
. 
0 
0_ 
.... 
o 
o 
. 
~-l---"" 
o 
o 
W' ~~­
a: 
I-
z 
Wo 
Uo 
0: • 
w o -
a..:r 
o 
o 
0 
0 
. 
0 
0 
I 
1000 
CARD: SSC 10:03:02 11/19/82 
# FAULTS: 357 
AVERAGE: 988 
MAXIMUM: 176DLJ 
I 
I I I I I I I I I I I 
2000 3000 ijOOO 5000 6000 7000 8000 9000 10000 11000 17700 
TOTAL RECOVERY TIME (MSEC) 
Figure 51 
89 
o 
o 
o 
o 
. 
~-
::t' 
o 
o 
W • 
C)~­
a: 
I-
z 
Wo 
Uo 
a: . 
W o -
o...C\J 
o 
o 
C) 
o 
I 
500 
CARD: SSC 
tt FAULTS: 357 
AVERAGE: 988 
MAX I MUM: 17604 
I 
I I 
I 
! I 
1000 1500 2000 2500 
I 
3000 
15:ij9:3B 11/1B/82 
I I I I I I 
3500 I!OOO 1!500 5000 5500 17700 
TOTAL RECOVERY TIME (MSEC) 
Figure 52 
90 
o 
o 
o 
C) 
CARD: ALL 
** FAULTS: 17418 
AVERAGE: 988 
MAXIMUM: 118437 
15:54:00 11/18/82 
.+---~==~--~--~--~--~--~--~~~~---4==~ Cl I I I I I I I I I I I I 
o 1000 2000 3000 II 000 50eo 6000 7000 8000 9000 10000 11000 120000 
DETECT ION T I ME (MSEC) 
F~gure 53 
91 
Q 
Q 
I-IJ) 
C\J 
Q 
Q 
· J_ Q 
C\J 
· 
'-
· 1-
CJ 
Q 
,-
U"I 
o 
CJ 
, 
o 
o 
I I 
100 
CARD: FILL 
;$ FAULTS: 17l!18 
AVERRGE: 988 
MAXIMUM: 118l!37 
1 
I - I I I 
200 330 l!00 SOD 
I 
I 
600 
15:5ij:OO 11/18/82 
I I I I I I 
700 800 900 1000 1100 12:;000 
DE~5:CTiwN TIME (MSECl 
Figure 54 
92 
D 
D 
· D 
D '. 
-
D 
o 
· D 
'-
co 
'-
'-
D 
o 
· o '. 
N 
D 
o 
· o 
o 
CARD: ALL 15:5ij:OO 11/18/82 
j 
u FAULTS: 17418 
AVERAGE: 88 
MAXIMUM: 1625 
I I I I I I I 
100 200 300 YOO 500 600 700 600 900 
I DENT I F I CAT! ON T H1E !l1SEC) 
Figure 55 
93 
I I I 
1000 1100 1700 
o 
o 
o 
o 
..... 
o 
o 
-
'-
o 
to 
o 
o 
W • 
'-c...:J~ 
a: 
I-
:z:: 
Wo 
Uo 
a: . 
W O (L::J' 
o 
o 
. 
o 
N 
o 
o 
o 
-
-
o 
I 
50 
CARD: ALL 
# FAULTS: 17l!18 
AVERAGE: 82 
MAXIMUM: 546 
I 
I I I I 
100 150 200 250 
I 
300 
15:5ij:OO 11/18/82 
I I I I I I 
350 1100 1150 500 550 600 
RECClNF I GURAT I ON T !tiE (MSEC) 
Figure 56 
94 
o 
o 
o 
o 
C) 
o 
CARD: ALL 
** FAULTS: 17418 
AVERAGE: 1160 
MAXIMUr1: 11SE:J3 
lS:Sij:OO 11/18/82 
.+---~~--~--~--~~--~--~--~~~~==~ o I I I I I I I I I I I 
o 1000 2000 3000 lIoeo 50CJ 6000 7000 eooo 9000 1000011000120000 
Tt:TR~ RECOVERY TIME (MSECl 
Fl.gure 57 
95 
o 
o 
· o 
::r ~-
o 
o 
· J_ N 
(I") 
o 
o 
· w C)~ 
a: 
-
I-
:z 
Wo 
Uo 
· 
'-
a:(!) 
w_ 
a.... 
CI 
CI 
· 
,-
co 
o 
CI 
CI 
o 
I I I 
CARD: ALL 
# FAUL.TS: 17418 
AVERAGE: 1160 
MAX It-IUM: 118843 
I 
I I I I 
100 200 300 t:oo 50::1 so::! 
15:54:00 11/18/82 
I 
l 
~ 
I I I I I -, 
700 800 900 1000 1100 120000 
TOTAL ii~CrJV~RI TIME 1M SEC) 
Figure 58 
96 
o 
o 
o 
0_ 
..... 
o 
o 
. 
0_ 
C\! 
o 
o 
1---, 
CARD: ALL EXCEPT BGUA 
** FAUL TS: 1712L! 
AVERAGE: 378 
MAXIMUM: 21614 
15:01:ij6 11/20/82 
O~----J~===~I---'----'I~---I~---'----'----'I-----r-I--~----~--~ 
1000 2000 3000 1,I0CO SO:J 6000 7000 eo~o soeo o 10:00 t· ...... ... :. . - .. .. 
Figure 59 
97 
o 
o 
:-
o 
o 
· C\J 
'-(1") 
o 
o 
W • 
C)~ 
a: 
-
I-
Z 
Wo 
Uo 
a: . 
WID '-
CL 
...... 
o 
o 
· to 
o 
o 
· o 
, 
-
o 
I 
CARD: ALL EXCEPT BGUA 
** FAUL TS: 17124 
AVERAGE: 378 
MAXIMUM: 21614 
I 
I 
I I I I I 
100 200 300 1,100 SOD 600 
15:01:ij6 11/20/82 
I 
I I I I I I 
700 SCD 900 IC03 1100 21i[0 
DE TEe T I J ~~ T It 1 E ( :'1 SEC) 
Figure 60 
98 
o 
o 
o 
~-
o 
o 
W • 
t.:)~­
CI 
I-
Z 
Wo 
UO 
0: • 
w o -
0-:3' 
o 
o 
o 
o 
CARD: ALL EXCEPT BGUA 
** FAULTS: 17124 
AVERAGE: 549 
MAX H1UM: 21757 
15:06:33 11/20/82 
O~----r-'---+-I---'I----'I----~' ----.-,---.,----',-----.-,---.-'---.1----'1 
o 1000 2000 3000 11000 5:00 6~00 7000 8000 9000 lOO~O 11000 21800 
TOTRL RECOVERY TI~E (MSEC) 
Figure 61 
99 
o 
o 
· o 
:::!" :-
o 
o 
· C\J 
'-(I""J 
o 
o 
W • 
t.:)R; 
a: 
-
,-
I-
Z 
Wo 
Uo 
CC • 
W~ 
a... 
o 
o 
· 
'-
co 
o 
o 
. 
o 
o 
CARD: ALL EXCEPT BGUA 
# FAULTS: 1712~ 
AVERAGE: 5~9 
MAXIMUM: 21757 
I 
I I I I I 
~ 
I 
100 200 300 ijOO 500 eoo 700 
15:06:33 11/20/82 
I I I 
I I I I I 
BOO 900 1000 1100 21800 
TeliRl RECOVERY TIME £t1SEC) 
Figure 62 
100 
two times for most cards. For ~nstance, Figure 8 shows this variable for 
CPUD using a bucket size of 1 second. This scale allows the maximum 
detection t~me (9.1 seconds for th~s card) to be accommodated. But since 
98 percent of all faults are uncovered within a second, a lot of informa-
tion is lost. Therefore th~s same data is replotted ~n F~gure 9 using a 
bucket size of 100 milliseconds. All detection times longer than a 
second are lumped together at the end of the plot. This figure shows in 
much greater detail the distribution of 98 percent of detection times. 
Finally, the total recovery time is also plotted several times for each 
card using different scales. For example, Figures 12 and 13 show the 
probability density function of this variable for CPUD with scales of 1 
second and 0.1 second, respectively. The contents of Figures 8-62 will 
be discussed next. 
It may be observed that there is a great variat~on in pdf's of 
detection, ident~f~cation, and reconfigurat~on t~mes. But there is not 
much variation amongst cards for any given parameter. For ins tance, 
identification t~me probabil~ty density function for CPUD (F~gure 10) ~s 
very similar to that for PROM (Figure 21) or EGU (F~gure 34). But it ~s 
quite d~fferent from the detection time pdf for the same card (Figures 8, 
9). Characteristics of pdf's for each of the four parameters in general 
rather than for each card will therefore be discussed next. Exceptions 
where appropriate will be pointed out. 
Probability density functions of detection time reveal the 
complexi ty of the detection phase. Over 95 percent of the faults for 
101 
almost all the cards are detected wi thin 600 milliseconds or two R1 
frames. For CPU data (Figures 8, 9) and control cards (Figures 14, 15), 
this figure rises to almost 99 percent. The latency in reading error 
latches varies from 0 to 340 milliseconds (one R1 frame) depending on 
when the fault was injected with respect to the beg~nning of the SCC 
task. Evidently, not all faults manifest themselves as bus errors right 
away. For instance, about 20 percent of the faults injected in the CPU 
data card are uncovered between 400 and 600 milliseconds. Evidently it 
took these faults ~tween 100 to 300 msecs to cause erroneous data to 
appear on the buses. But these faults are uncovered by routine programs. 
Beyond this initial impulse, there is a long tail in the detection 
time pdf that goes out to about 20 seconds. This corresponds to faults 
that are only uncovered by self-test programs. The fraction of faults 
that falls under this long tail is only about 2 to 4 percent for the 
processor reg~on cards (CPUD - Figure 8, CPUC - Figure 14, PROM - Figure 
20, Cache Controller - Figure 27). For the bus ~nterface cards this 
f~gure ~s much higher, as seen in pdf plots for BIT (F~gure 37), BIPC 
(Figure 41), and SBC (Figure 48). This is due to the fact that faults on 
these boards were deliberately concentrated into error detection and 
masking circu~try. Most of these faults requ~re self-test programs to be 
uncovered. There is a lot of other random logic on these boards that was 
not subjected to faults, and it is most likely that faults into these 
circuits would be detected by routine program execution. As can be seen 
from the number of faults inJected into each board, the processor region 
cards have been much more thoroughly tested than the bus interface and 
102 
amtrol cards. (Only about 1,000 out of 21,000 were inJected 1.nto the 
latter. ) If they were tested as completely as the processor region, 
their detection t1.me d1.stribution would tend to be closer to that for the 
processor region. 
The detection time pdf for BGU (Figures 32, 33) 1.S totally d1.ffer-
ent from all others with a very high number of faults being detected 
between 40 and 50 seconds and the maximum going out to 2 minutes. Th1.s, 
as expla1.ned earlier, is due to the fact that BGU faults were detected by 
normal system reconfiguat1.on, a complete cycle of wh1.ch takes 6 m1.nutes. 
Self-test programs for th1.s part of the FTMP would decrease the BGU 
detection t1.mes by an order of magnitude. 
Detection time pdf for all 17,418 faults is shown 1.n Figures 53 
and 54. About 96 percent of all faults are detected in 600 mill1.seconds 
or less. The detection time distribution for all faults except the BGU 
faults shown in F1.gures 59 and 60 looks very much the same. The only 
d1.fference appears 1.n the average detect1.on time, which drops from 988 to 
only 378 m1.ll1.seconds. This latter f1.gure 1.S more representat1.ve of what 
may be expected as the FTMP response 1.f a reasonable set of d1.agnost1.c 
programs had been completed. 
The next parameter 1.S the 1.dent1.f1.cation t1.me. As discussed 
earlier in th1.s chapter, fault 1.dent1.fication is a determ1.n1.st1.c phase of 
the recovery procedure. For a g1.ven pin fault and a given system config-
urat1.on one can say with certa1.nty as to how many passes of 1.dent1.fica-
tion program are required to isolate the faulty un1.t. Th1.S 1.S borne out 
by the probability density funct1.ons for ident1.f1.cation t1.me. As seen 
103 
~ Figure 10, about 85 percent of CPUD faults are identified between 0 
and 100 ml.lliseconds. Most of the remaining faults are identified 
between 300 and 400 milliseconds. It will be recalled here that the 
identification program runs every 320 milliseconds. Wha t this pdf 
implies is that 85 percent of the faults are identified during fl.rst pass 
of the program and the remaining ones are identified after one diagnostic 
reconfiguratl.on during the second pass. The identifl.cation time pdf is 
very similar for other cards as well with impulses of decreasing 
magnitudes at times corresponding to 1, 2, 3, and 4 passes of the SCC 
program. This density function for all the faults is shown in Figure 55. 
The last recovery phase, system reconfiguratl.on, is also deter-
ml.nistl.c in nature. For a given faulty module and a given system config-
uration, a fixed amount of time is requ1red to replace the faulty 
module. Reconfiguration time pdf for all the faults is shown in Figure 
56. It is seen that almost all the faulty modules are removed w1thin 200 
milliseconds. 
Figure 57 shows the pdf of the total recovery t1me (sum of detect, 
identify, and reconfigure) for all the faults. The FTMP recovers from 
almost 95 percent of the faults w1th1n a second. Figure 58 shows the 
exploded view of the distribution from 0 to 1 second. If the EGU faults 
were excluded from the ensemble, which is reasonable to do S1nce no 
self-test programs were written for 1t, the result1ng distribut10n 
appears as shown 1n Figures 61 and 62. Th1s appears almost totally 
1dent1cal to the pdf w1th EGU faults w1th the only exception being the 
average total recovery time. This is seen to drop from 1 .16 to 
104 
0.549 second. In any event, the density function of the recovery time 
does not appear to be exponential as assumed in reliability modeling. As 
seen in Figure 62, it may be characterized as Gaussian density function 
from 0 to 1 second with an asymmetric tail going out to about 22 
seconds. Whether this pdf is better or worse than the exponential 
pdf from the viewpoint of its impact on system reliability can only be 
determ1ned through mathematical modeling. However, it is encourag1ng to 
note that a very high fraction of all incidences (about 95 percent) lie 
in a narrow time band around the average value. 
favorable impact on the system reliability. 
3.4 Actual Failures 
This can only have a 
The mean time between failures of an LRU was assumed in the 
reliability models to be 2,600 hours. Based on the observed failure rate 
during a course of 18 months of routine FTMP operat1on, the LRU MTBF can 
be estimated to be at least 10,000 hours. Over 130,000 LRU operating 
hours were accumulated during this time and only 12 failures were 
observed. Of course, this experience has been obtained in a laboratory 
environment which is not subject to the temperature variations and shock 
and vibrat10n induced by turbulence and landings and take-offs. In that 
regard the laboratory environment is certainly more benign. However, the 
equipment was subjected to substantial power cycling much more than might 
be expected in the field. Also, the electronic components were going 
through their burn-in period during which they are known to have a higher 
failure rate. In fact, almost all the failures observed can be attrib-
uted to burn-in. 
105 
Two types of components accounted for almost all the fal.lures. 
One was the Harrl.s random access memory chip. There were four RAM chl.p 
failures. The second component that fal.led was a new 1553 LSI chip. Six 
of these fal.lures were observed. In addi tl.on, two dl.odes l.n two LRU 
power converters fal.led, shortl.ng two power buses together although they 
dl.d not l.mpact LRU or system operation. 
Finally, an actual single pOl.nt fault occured l.n one LRU that 
resulted in a total system failure. The faulty component was a voltage 
regulator in the rechargl.ng cl.rcuit for the battery that provl.des LRU 
backup power. This backup battery power is used to hold the contents of 
the CMOS cl.rcuitry, includl.ng the configuration control registers in the 
BGUs, l.n the absence of the primary power. 
When the voltage regulator failed the output voltage of the backup 
power going to the BGU registers exceeded the safe high limit. Thl.s 
caused the LRU enable registers in both BGU cards to behave erratically 
and enabled the subject LRU on multiple system buses simultaneously. 
Thl.s, in turn, made the system bus useless leadl.ng to the system failure. 
This fal.lure mode which is basically a common failure mode of the 
two bus guardian units was not overlooked in the specifl.catl.on process. 
As a matter of fact, antl.cl.pating such common failure modes the FTMP 
design specl.fication called for undervoltage and overvoltage protectl.on 
circuits on individual BGU cards. Unfortunately, during detailed cl.rcuit 
level design of the bus guardian units the overvoltage protectl.on cir-
cUl.try was omitted from the design. This omission was not caught durl.ng 
subsequent desl.gn reviews. 
106 
CHAPTER 4 
SUMMARY AND CONCLUSIONS 
A total of 21,055 pl.n level faults were l.njected l.nto the FTMP. 
Of these, 17418, or 83 percent, were detected. Of the 3,637 undetected 
faults at least 80 percent were estl.mated to be on unused gates and 
pins. A few of the remal.nl.ng undetected faults were analyzed and found 
to be long to the ' don't care' class. Further analysis of undetected 
faults is requl.red to arrive at a defl.nitive detection coverage value. 
Identl.fl.catl.on and reconfl.guration coverages, on the other hand, were 
found to be perfect for the detected faults. The system l.dentl.fl.ed all 
detected faults correctly and successfully recovered in each case. 
The total tl.me to recover from a fault was doml.nated by tl.me spent 
in the detection phase. Time to identl.fy a fault and reconfl.gure the 
system was found to be deterministic and bounded, as expected. Average 
identifl.cation and reconfiguration times were found to be 88 and 82 
milll.seconds, respectively. Total recovery tl.me averaged over all 17,418 
faults was found to be 1.16 second although l.t would be only 549 ml.lll.-
seconds if BGU faults were excluded. In the absence of BGU self-test 
programs, which was the case here, BGU faults were solely uncovered by 
very low frequency routine system reconfl.guratl.ons. 
107 
) 
~e d1stribution of the total recovery t1me does not appear to be 
exponential. However, 1t 1S encouraging to note that over 95 percent of 
all faults are recovered from in a second or less. 
In addition to this hard data, a number of very important though 
1ntangible results were obtained as well. The hardware and software, in 
general, and the fault detection hardware and the fault identification 
and system configuration control software, 1n particular, performed 
extremely well under the stress of thousands of faults. In a sense the 
FTMP arch1tecture, the hardware, and the software have been validated 
1nformally. 
The test and evaluation experiments, their positive results, and 
the 100 percent availability of the FTMP dur1ng 13,000 hours of routine 
operation at the Draper Laboratory have all substantially bolstered 
conf1dence in the FTMP concept as well as 1ts realization 1n hardware and 
software. 
108 
LIST OF REFERENCES 
1. Lala, J.H., and C.J. Smith, "Performance and Fconomy of a Fault-
Tolerant Mult1processor," Proceed1ngs of the National Computer 
Conference, Vol. 48, New York, NY, June 1979. 
2. Hopk1ns, A.L., T.B. Smith, and J.H. Lala, "FTMP - A Highly Rel1-
able Fault-Tolerant Mult1processor for A1rcraft," Proceedings of 
the IEEE, Vol. 66, No. 10, October 1978. 
, 
109 
1 Report No Government AccessIon No 
NASA CR-l66073 
4 TItle and SubtItle 
DEVELOPMENT AND EVALUATION OF A FAULT-TOLERANT 
MULTIPROCESSOR (FTMP) COMPUTER 
Volume III - FTMP Test and Evaluat10n 
7 Authorlsl 
J. H. Lala and T. B. Smith III 
9 Performing Organtzatlon Name and Address 
The Charles Stark Draper Laboratory, Inc. 
555 Technology Square 
Cambridge, Massachusetts 02139 
12 Sponsoring Agency Name and Address 
Nat10nal Aeronaut1cs and Space Adm1nistrat10n 
Wash1ngton, DC 20546 
15 Supplementary Notes 
Langley Techn1cal Mon1tor: Charles W. Me1ssner, Jr. 
F1nal Report 
16 Abstract 
3. RecIpient's Catalog No 
5 Report Date 
May 1983 
6 PerformIng OrganIzatIon Code 
8 Performing OrganizatIon Report No 
CSDL-R-1602 
10 Work Untt No 
11 Contract or Grant No 
NASl-15336 
13 Type of Report and PerIod Covered 
Contractor Report 
14 Sponsoring Agency Code 
This report 1S Volume III of a four-volume final report on the Fault-Tolerant 
Mult1processor (FTMP) proJect. It covers 1n deta11 the exper1mental test and 
evaluat10n of the FTMP. Ma)or ob)ect1ves of th1s exerC1se 1nclude expand1ng 
val1dat10n envelope, bU11ding confidence 1n the system, revealing any weaknesses 
1n the arch1tectural concepts and 1n their execution 1n hardware and software, 
and 1n general, stress1ng the hardware and software. 
To this end, p1n-level faults were inJected 1nto one LRU of the FTMP and the 
FTMP response was measured 1n terms of fault detect10n, 1solat1on, and recovery 
t1mes. A total of 21,055 'stuck-at-O', 'stuck-at-l' and '1nvert-s1gnal' faults 
were 1n)ected 1n the CPU, memory, bus interface c1rcuits, Bus Guardian Un1ts, and 
voters and error latches. Of these, 17,418 were detected. At least 80 percent 
of undetected faults are est1mated to be on unused p1ns. The multiprocessor 
1dent1f1ed all detected faults correctly and recovered successfully 1n each case. 
Total recovery t1ffie for all faults averaged a 11ttle over one second. Th1S can be 
reduced to half a second by 1ncluding appropr1ate self-tests. 
17 Key Words (Suggested by Authorlsll 
Fault-Tolerance 
Mult1processor 
Synchronous 
Reconf1gurable 
Fault In)ect10n 
19 Security aasslf (of thlsreportl 
Unclass1f1ed 
18 DIstributIon Statement 
~ Distribution 
Subject Category 62 
20 Security Classlf (of thIS pagel 
Unclass1f1ed 
21 No of Pages 
115 
22 Pl'lce 
Available: NASA's Industrial Applications Centers 
End of Document 
