A preliminary transient-fault experiment on the SIFT computer system by Butler, Ricky W. & Elks, Carl R.
I' 
NASA Technical Memorandum 89058 
A Preliminary Transient-Fault Experiment 
on the SIFT Computer System 
Ricky W. Butler and Carl R. Elks 
February 1987 
1 (N AS 8-TB- 89 05 6 )  !XbAESIEZl-PAULT E X P E E l E E I ' I  Cb IIEE SIFT it EB ELI a1 EA BY 387-2 C7 7 0 CCBEGIER SYSTEM ( E A S A )  43 F CSCL 09B 
u n c l a s  
National Aeronautics and 
Space Administration 
L M g k y ~ c a n t a r  
Hamptcn, Virginia 23665 
63/62  45359 
https://ntrs.nasa.gov/search.jsp?R=19870011337 2020-03-20T11:54:18+00:00Z
Y 
A reconfigurable Computer system must distinguish between transient 
faults and permanent faults. A transient fault usually only causes incorrect 
behavior temporarily, and consequently the operating system should not 
permanently remove the affected conq?onent via reconfiguration. Unfortunately, 
transient faults appear to occur mre frequently than permanent faults. 
available empirical data show that transients occur about 10 times more 
frequently than permanent faults. (See ref. 1.) Thus, if an operating system 
removes too many processors affected by transient faults, then the reliability 
will be seriously compromised. 
transient/penuanent fault discrimination algorithm is a critical problem for 
fault-tolerant cosaputer system designers. 
threefold: 
The 
The development of an effective 
The objective of this experiment is 
1. To gain som fundamental information concerning error latency and the 
error propagation process in the presence of injected transient faults 
2. To obtain the necessary data to perform a reliability analysis of the 
SIFT computer system (ref. 2) including the effects of permanent and transient 
faults 
3. To determine the effectiveness of the operating system's ability to 
discriminate between transient and permanent faults 
Only a small number of injections have been performed, therefore, 
The purpose of statistically significant conclusions cannot yet be drawn. 
this paper is to present the experimental approach and data analysis 
techniques in detail. 
c 
W 
Z' 
R' 
Z 
R 
x, 
a 
random variable representing the duration of transient faults 
random variable representing the elapsed time from fault injection 
until last error appears. 
random variable representing the elapsed time from fault injection 
until the system reconfigures 
randm variable representing the elapsed time from fault injection 
until last error appears given that reconfiguration does not occur 
random variable representing the elapsed time from fault injection 
until the system reconfigures given that reconfiguration occurs 
arrival rate of transient faults 
arrival rate of permanent faults 
Fw(w) distribution of W 
F,, ( 2 )  distribution of Z' 
F, ( 2 )  distribution of Z 
FRf ( r )  distribution of R' 
FR (r) distribution of R 
FL (t) 
Fp (t) 
distribution of fault latency 
distribution of permanent fault reconfiguration time 
FRlw(rrw) conditional distribution of R given W 
Y 
. 
F, (r,w) conditional distribution of z given W 
E[ . ] expected value operator 
P(  . ) mean of a distribution 
d( . ) variance of a distribution 
In this experiment, transient faults with a particularly simple waveform 
are injected: 
Clearly, there are an infinite number of possible transient waveforms. 
nobody knows what the characteristics of transient waveforms are in nature, we 
are beginning with this simple waveform. 
stuck-at-1 or stuck-at-0) for W microseconds. 
the operating system's voters. 
two possible effects of a transient fault: 
Since 
The fault is held active (either 
A transient fault may or may not generate errors which are detectable by 
The following two time-graphs illustrate the 
Case 1: Reconfiguration does not occur 
t 
S 
... C b  
t t t  t t t  t t 
e, e2 e3 e4 e, e6 ... en C 
3 
Case 2: Reconfiguration occurs 
t 
S 
... ~ 
t t t  t t t  t t 
e1 e2 e3 e4 e, e6 ... en r 
where 
s =  
ei = 
r =  
C -  
2 -  
R =  
time fault injection initiated 
the time of detection 
time operating system 
censoring point ( i .e. 
terminated) ' 
en - s 
r - s  
of the ith error 
reconfigures 
point where experimental observation is 
(1 < i < n) 
These two cases represent the outcome of two competing processes-the 
disappearance of the transient and the reconfiguration process of the 
operating system. In the first case, Z is a random variable which 
represents the duration of transient errors given that reconfiguration does 
not occur and R is a random variable which represents the reconfiguration 
time given that reconfiguration occurs. 
record error detections after the reconfiguration process, the time of 
disappearance of transient errors can only be obsenred when reconfiguration 
does not occur. Similarly, no information is available about the 
reconfiguration process when the errors disappear first and no reconfiguration 
takes place. Thus, although one can postulate the existence of some 
theoretical underlying competing distributions, say F R , ( r )  and F, , ( z ) ,  
only the conditional distributions 
Since the operating system does not 
F,(r)  = Prob[ R < r 1 
= Prob[ R' < r I R' < oD 1 
F,(z)  = Prob[ Z < z I 
= Prob[ Z' < z I R' = QD ] 
can be directly observed. 
impossible to identify the underlying distributions given the conditional 
Furthermore, it has been shown that it is 
. 
distributions and that although the actual underlying distributions may not be 
independent, the stochastic behavior can be accurately modeled as competing 
independent processes. (See ref. 3.) 
model to describe this phenomena is justified. 
indicates the case where reconfiguration does not occur. If no errors are 
detected in an injection, Z is defined to be zero. The results of each 
injection can only be observed for a finite time. The censoring point C of 
this preliminary experiment was two minutes. 
fault with extremely long latency periods would be missed. 
transient faults with exponential arrival rate 
Therefore, the use of a semi-Markov 
The notation R' = 
Consequently, the effects of a 
The following model describes the response of the operating system to 
&: 
Since we are dealing with experimental data that is conditional, the 
nonexponential transitions will be labeled with the conditional distributions. 
In the above model, the transition from (0) to (1) is the arrival of a 
transient fault with exponential rate h. 
the disappearance of the transient errors. 
the total elapsed tine of the transition is a sample from the distribution 
F , ( z ) .  
via reconfiguration. The distribution of reconfiguration time is FR(r). The 
probability & = Prob[ R" < = ] 
to (2) occurs. 
The transition from (1) to (0) is 
Given that this transition occurs, 
The transition from (1) to ( 2 )  is the removal of the faulty processor 
is the probability that the transition (1) 
(The probability that the transition (1) to (0) occurs is 
AS mentioned earlier, each transient fault injection is performed by 
physically holding a fault active for a predetermined duration W. 
can only be done for a finite set of predetermined durations, we can only 
observe the random variables R and Z . in response to particular transient 
fault durations (w,, w2, w3, ... wk). 
the conditional distributions 
1 - & * I  
Since this 
Thus, we actually observe samples from 
5 
F R l w (  t I Wi (i.e. distribution of recovery time given that the 
transient fault is active for duration wi ) 
Fzlw( t I Wi (i.e. distribution of the time of disappearance of 
errors given that the transient fault is active for 
duration wi ) 
The probability 
corresponds to the fraction of times the system reconfigures in the presence 
of a transient fault of duration w. 
Mathematically, 
where Fw(w) is the distribution of transient fault durations. The 
motivation for performing the experiment in this manner is that the 
distribution of transient fault durations Fw(w) is unknown. If experimental 
data were available for Fw(w), then the transient fault durations could be 
sampled randomly and 2 and R could be measured directly. This indirect 
method enables us to construct F,(z) and F R ( r )  under various assumptions 
about Fw(w). 
FAULT INJECTION METHOD AND r".A GFIURJI 
It is impossible to perform transient fault injections at every pin in a 
processor for all possible transient fault durations. Thus, the fault 
injection locations were chosen randomly weighted according to the chip 
6 
. 
failure rates and a small set of transient-fault durations (to be injected at 
every randomly-selected pin) were predetermined. The chip failure rates were 
determined using MIL-STD-~~~D. A list of the failure rates used for the chips 
in the SIFT processors are provided in Appendix A. The set of fault durations 
were not chosen to be equally far apart (i.e. equal successive differences). 
This is impractical since the fault latency (i.e. time from injection until 
first error detection) is several orders of magnitude longer for some pins 
than for others. A spacing appropriate for one pin location in the processor 
would not be appropriate for another. Consequently, the natural logarithm of 
the fault durations were chosen to be equally far apart. 
injection durations were used: 
The following 
1 ps,  3.16 ps,  10 PS,  31.62 p s ,  100 ps,  316.22 ps, 1 ms, 3.162 m ~ ,  
10 ms, 31.62 m s ,  100 ms, 316.22 ms, 1 s. 
The SIFT operating system was instrumented to obtain the time of each error 
detection on the non-injected processors. 
processor from a global clock with millisecond resolution. 
detection is accomplished by voting, error detection is possible only in 
subframes where voting occurs. 
of variables voted per subframe is shown below: 
This time was obtained on each 
Since error 
The SIFT schedule table including the number 
subframe 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
clock tic 
0 
2 
6 
9 
14 
16 
18 
20 
22 
24 
26 
30 
33 
38 
40 
42 
44 
46 
49 
51 
task 
a m  
IcTl 
ICT2 
ICT3 
MLS 
GUILIA 
PITCH 
LATER 
ERRm 
NULLT 
IcTl 
ICT2 
ICT3 
MLS 
GUILIA 
PITCH 
LATER 
FAULT 
NULLT 
IcTl 
7 
# variables voted 
3 
1 
3 
6 
4 
2 
3 
1 
3 
6 
4 
2 
2 
21 
22 
23 
24 
25 
26 
27 
If a fault generates errors at a rate greater than THRESHOLD/G then the 
reconfiguration time will vary between (K-l)G++ and KG++. The value of 
could be reduced by moving the ERRTA and FAULT task immediately before the 
RECFT task. 
lowing illustrates a typical sequence of errors detected by the SIFT 
processors when a fault is injected on processor 1: 
This would reduce the mean reconfiguration time in SIFT. The fol 
8 
. ~~ 
55 
58 
63 
65 
67 
69 
71 
IcT2 
ICT3 
MLS 
GUIDA 
PITCH 
LATEX 
RECFT 
The SIFT scheduler repeatedly executes the sequence of tasks enumerated in the 
table above. Each execution of these tasks constitutes a global frame. 
The error task ERRTA counts the number of vote errors since its last 
execution. If this count exceeds the value of parameter THRESHOLD 
(arbitrarily set to 3) then it sets an error flag ERR[p] indicating that it 
has diagnosed processor p as faulty during this global frame. 
isolation task FAULT retrieves a voted version of ERR[p]. If ERR[p] is 
true for a processor p for K (arbitrarily set to 2) consecutive global 
frames then the fault-isolation task tells the reconfiguration task RECFT to 
remove processor p. In the following diagram, E represents the ERRTA 
task, F represents the fault-isolation task FAULT and R represents the 
reconfiguration task RECFT: 
The fault- 
-l-I-l I-1-1 1--1 > 
E F R  E F R  E F R  
I<- frame length (G) -> I I<- 0 ->I 
GLOBAL FRAME 27 
. 
SF P1 P3 P4 P5 P6 
SF P1 P3 P4 P5 P6 
109( 2) 
112( 3) 
115( 2) 
147( 3) 
150( 2) 
155( 2) 
181(2) 
187( 2) ! 
109(2) 109(2) 109(2) 
112(3) 112( 3) 112( 3) 
115(2) 115( 2) 115( 2) 
147( 3) 147( 3) 147( 3) 
150( 2) 150( 2) 150( 2) 
155(2) 155(2) 155(2) 
181(2) 181(2) 181(2) 
187(2)! 187(2)! 187(2)! 
RECONFIGURATION 187 187 187 187 
The first column is the subframe; the remaining columns contain the 
error-detection times observed on each processor. For example, the column of 
numbers under the header P4 contains the times that processor P4 detected 
faults on processor P1. 
detection time is the number of vote errors at that time. 
The number in parentheses following each error- 
A sumnary of the results of the 29.7 transient fault injections is given 
in Appendix B. 
TRANSIENT FAULT CLASSIFICATION 
The errors produced by the injected transient faults fell into the 
following classes: 
Transient Null - the injected fault produced no errors 
9 
Transient Benign - the injected fault produced a finite sequence of 
errors, followed by correct operation of the 
processor 
Transient Persistent - the injected fault produced a non-terminating 
sequence of errors (until reconfiguration). 
The transient persistent fault's behavior is indistinguishable from a 
However, when the injected permanent fault while the system is operating. 
processor is manually restarted, it operates properly. 
transient fault can disable a processor is by crashing the microcode. 
Although the physical cause of the fault is temporary, the effect is 
permanent. 
the operating system and removed. 
is impossible to exactly differentiate between the benign and persistent 
class. Furthermore, since the SIFT operating system reconfigures in the 
presence of errors (terminating the error observation process), the problem of 
distinguishing between the two classes is further complicated. In the section 
entitled "Future Experimental Directions" a new experimental approach to this 
problem is described. 
One way that a 
Transient persistent faults should be diagnosed as permanent by 
Because the error generation process cannot be observed indefinitely, it 
In the following summary tables, the following assumptions were made: 
(1) If the system reconfigured and errors persisted up to the 
reconfiguration point, it is assumed that the operating system 
properly diagnosed the fault as persistent. 
(2) If the system reconfigured, but errors disappeared at least 5 m s  
prior to the reconfiguration, it is assumed that the operating system 
improperly diagnosed a benign fault as persistent. 
(3) If the fault produced errors but the system did not reconfigure, then 
it is assumed that the fault was benign. 
(4) If a fault had not generated an error within the censoring time of 
the experiment (i.e., 1 minute), the fault was null. 
. 
mu 
W (,used I Null I Benign I Persistent 
.18 
.49 
.54 
.89 
.79 
.95 
.07 I 
.15 I 
.15 I 
.ll I 
.21 I 
.05 I 
.75 I 
.36 I 
.31 I . 00 I 
10000 - 100000 I . 00 I 
100000 - 1000000 I . 00 I 
I 
10 - 100 I 
100 - 1000 I 
1000 - 10000 I 
0 - 10 
Memory 
W (Psec) I Null I Benign I Persistent 
.37 
.60 
.72 
.03 I 
.10 I 
.ll I . 00 I - 1.00 
.60 I 
.30 I 
.17 I 
1000 - 10000 I .oo I 
0 - 10 I 
10 - 100 I 
100 - 1000 I 
RELIABILITYANALYSIS OF SIFT 
In this section a methodology for performing a reliability analysis of 
the SIET computer system subject to transient and permanent faults will be 
presented. The methodology will be illustrated by application to a 4- 
processor SIFT system. The following assumptions govern this model: 
1. The system initially consists of four statistically-independent 
processors which fail permanently at constant failure rate a and 
transiently at constant rate h. 
2. Each processor executes the  exact same program on exactly the same 
inputs so that all non-faulty processors produce exactly the same 
output. Thus, 
so long as a majority of the processors are non-faulty, any erroneous 
values are "masked". 
The system "votes" the outputs prior to external use. 
3. The system remves the faulty processors via reconfiguration. The 
first reconfiguration reduces the system to a triplex configuration. 
A second reconfiguration reduces the system to a simplex. 
11 
4. The distribution of reconfiguration time F,(r) is unknown and must 
be determined experimentally. 
5. The distribution of transient error duration F,(z)  and transient 
fault duration F,(w) is unknown. No experimental data is available * 
(nor does this experiment provide any) for these distributions. 
The computation of the probability of system failure based on this model 
will be performed using the Semi-Markov Unreliability Range Evaluator (SURE) 
program. A key advantage of the SURE program lies in its use 
of means and variances of the unknown distributions. It is not necessary to 
assume some family of underlying distribution and perform distribution-fitting 
procedures. 
(See ref. 4.) 
The SURE input file describing this model is: 
LAMBDA = 23-4; 
GAMMA = 1o*LAMBm; 
M D R =  
P R =  
SI- - R = 
MUz5 
SI-mi - z = 
SI-m - P = 
M U P =  
1,2 = 4*GAMMA; 
2,3 = 3*GAMMA + 3*LAMBDA; 
1,4 = 4*LAMBDA; 
4,5 = 3*GAMMA; 
2,5 = 3*LAMBDA; 
4,6 = 3*LAMBDA + 3*GAMMA; 
2,7 5 <MU R, SIGMA R, P R>; 
2,l = <MU-Z, SIGMA-Z, l=P R>; 
7,8 = 3*-; 
8,9 5 2*GAMMA + 2*LAMBDA; 
7,lO = 3*LAMB13A; 
4,7 = <MU-st s1GMA-s>;  
10,12 = 2*LJmDA + 2*c1AMMA; 
10,11 = 2*GRMMA; 
8,12 = 2*LAMBDA; 
8,13 = <MU R, SIGMA R, P - R>; 
10,13 - <m S, S I m  S>; 
13,14 = GAMMA + LMEM; 
8,7 = <MU Z, SIGMA Z, I-P - R>; 
( *  permanent fault arrival rate 
( *  transient fault arrival rate X, 
( *  mean reconfiguration time p(F , )  
( *  mean last error time p(F,)  
( *  stan. dev. of last error u ( F Z )  
*)  
* )  
( *  probability of reconfiguration *)  
* )  
( *  stan. dev. of reconf. time u(F,) *)  
*) 
*)  
( *  mean permanent reconf. time p ( F  ) * )  
( *  stan. dev. perm. reconf. time u!Fp) *)  
12 
The graphical display of this d e l  in figure 1 was generated by the SURE 
program. 
calculation of the means and variances is given in Appendix C. 
The colaplete input file to the SURE program including the 
c 
Figure 1.- Reliability model of 4-processor SIFT. 
13 
The following non-parametric statistics must be estimated from the 
experimental data: 
ob 
d ( F R )  * E ( R 2 )  - (E[RI)2  J E[R2 IW 
0 
W] dF,(W) - [ p ( R ) 1 2  
0 
d ( F , )  = E(Z2)  - (E[Z]12 - J E[Z21W = W] dFw(W) - [p(Z)12 
0 
u(Fp ) = E[R2 IW = 
Since we only have measurements of EIRIW-wi I t  EIZIWlwi 1 E[R2 I and 
E[Z2 ) for a few values of wi we are forced to approximate the integral 
w i t h  a numerical method: 
k+l wi 
i-1 wi - 
= C J & ( w )  dFw(w) where - 
. 
14 
Similarly 
The first and second moments of the distribution of reconfiguration time 
can be estimated using the following in the presence of permanent faults Fp 
simple unbiased estimators: 
where ( rl r2 
injection. A histogram of this sample is given in figure 2. 
reconfiguration time is 272.6 ms and the standard deviation is 121.5 m s .  
. . . r,, ) is a random sample obtained via permanent fault 
The mean 
15 
m 
m a 
0;1 
3 z 
50 
Figure 2.- Reconfiguration time histogram (permanent faults). 
A A 
The following values of 
were observed: 
wi 
1 /Is 
3.16 ps 
10 ps 
31.62 ps 
100 ps 
316.22 ps 
1 m S  
3.162 ms 
10 ms 
31.62 m s  
100 ms 
316.22 m s  
1 s  
.17 
.30 
.37 
.53 
.69 
.54 
.86 
1.00 
1.00 
.95 
.90 
1.00 
1.00 
0.1 
66.3 
0.0 
47.3 
0.0 
2.5 
0.0 -- 
- 
282.0 
99.0 - 
- 
E[R(w=wi 1 
239.0 
350.9 
239.4 
255.9 
247.1 
238.3 
284.3 
256.9 
249.1 
248.4 
255.7 
251.5 
236.7 
E+ Iw-w, I 
0.16 
92402.34 
0.00 
15875.00 
0.00 
40.00 
0.00 ---- 
-- 
79524.00 
9801.00 - 
- 
~ 
57282.60 
181682.00 
58599.27 
68679.12 
62566.75 
57448.20 
108444.48 
66597.96 
62921.78 
62708.11 
66016.66 
64170.50 
56681.30 
16 
Since the distribution of trdnsient fault durations is unknown and no 
experimental data is available, sensitivity analysis will be performed under 
the assumption of three different families of distributions. 
data -re available for Fw(w) ,  then the transient fault durations could be 
sampled randomly and 2 and R could be measured directly. In that case, 
this indirect calculation would be unnecessary. 
Weibull distributions will be analyzed. 
If experimental 
The exponential, uniform and 
Analysis Assuming Exponential Transient Duration 
In this section the reliability analysis is performed under the 
assumption that the distribution of the duration of transient faults is 
exponentially distributed. Thus, 
Fw(w) = 1 - e'-+" 
for some 4. The probability of system failure as a function of p(F,) = l/+ 
is given in figure 3. 
17 
Figure 3.- Prob. of failure vs. k ( F )  for exponential F.  
18 
Analysis ASSUIUing Uniform Transient Duration 
In this section the reliability analysis is performed under the 
assumption that the distribution of the duration of transient faults is 
uniformly distributed. Thus, 
for some g. The probability of system failure as a function of N(F,) = B/2 
is given in figure 4. 
Figure 4.- Prob. of failure vs. h ( F )  for uniform F. 
19 
Analysis Assuming Weibull Transient Duration 
10-5 - 
10'6 
10-7 
In this section the reliability analysis is performed under the 
assumption that the distribution of the duration of transient faults is 
Weihll: 
a 
Fw(w) = 1 - e-+" 
U 
2.0 
0.5 
- -  
I 
1 00 
I 
10-12 10-9 10-6 10-3 
for some + and a. The probability of system failure as a function of 
p ( ~ , )  - (l/+)l/a T(l+l/a) for e 2  and ~ 1 / 2  is given in figure 5. 
t 
J - 
h 
0 
P 
a 
2 
Figure 5.- Prob. of failure vs. p,(F) for Weibull F. 
Sensitivity to F, (w)  
A comparison of figures 3, 4 and 5 reveals that the probability of system 
failure is only moderately sensitive to the different shapes of the 
distributions. However, the unreliability varies over two orders of magnitude 
depending upon the mean of p(F,). Once again the reader is cautioned that 
this observation is based on a very small sample and the assumption that F, 
comes fram one of these three families of distributions. 
EFFECTIVENESS OF SIFT'S TRANSIENT/pERMA"T FAULT DISCRIMINATOR 
In this section some preliminary observations are made about the 
effectiveness of the SIFT transient/permanent fault discrimination algorithm. 
There are two ways the operating system can incorrectly diagnose a fault: 
(1) A permanent fault generates intermittent errors that are 
indistinguishable from the errors produced by two or more transients 
faults and thus is not reconfigured. 
(2) The time that the operating system waits to see if a fault is 
transient is not long enough to recognize a particularly long 
transient and consequently reconfigures before the fault disappears. 
The first case was not observed during the experiment (i.e., all 
permanent faults were successfully reconfigured). 
the transient injection resulted in the injected processor being reconfigured 
out of the system. 
whether the fault was transient benign or transient persistent. If the fault 
is transient persistent, the decision was correct. 
benign, the decision was incorrect. 
not observed after a reconfiguration, the type of fault could not be 
ascertained with 100% confidence from the experimental data. 
impossible to determine if at sone time subsequent to the reconfiguration the 
errors of a transient benign fault would disappear. 
disappeared shortly before the reconfiguration (suggesting that the fault was 
transient benign), there was no way to determine if this was merely a 
temporary lapse in the error sequence of a transient persistent fault. 
There were many cases where 
However, whether this was a correct decision depends on 
If the fault was transient 
Because the error generation process was 
Thus, it was 
Also, if the errors had 
In the 
21 
section entitled "FUTUFtE EXPERIMENTAL DIRECTIONS" a modification to the 
experimental approach is presented which facilitates the process of 
distinguishing transient benign faults from transient persistent faults. 
the rest of the section a simple analysis of the data is given which gives 
sone indication of the operating system's ability to distinguish these types 
of faults. 
reconfiguration. 
reconfiguration (i.e. S = 0), then the diagnosis as permanent is probably 
correct. However, if S is large, then most likely the fault was improperly 
diagnosed. The distribution of S is given in figure 6. 
In 
Let S be the time between the last error detection and the 
If a fault generates errors up to the time of 
I 
5 .  
TIME (msec) 
Figure 6.- Histogram of S - R - Z*. 
22 
There were 190 fault injections which resulted in a reconfiguration. Of 
these, 167 had a value of S less than one clock tick ( 1 . 6  ms) .  In the 
remaining injections there were only 2 injections with a value of S 
than 31 ms as revealed in the following table: 
less 
S I of occurrences 
0 -  1 m s  
2 -  5 m s  
6 m s  ' 
7 - 3 1 ~ ~  
> 31 ms 
167 
0 
2 
0 
21 
Exactly where the division should be made between the transient benign and 
transient persistent class is not obvious. 
injections with values of S between 2 ms and 31 ms, any choice in this range 
would yield essentially the same results. 
division is made at 5 ms. Therefore all faults with an value of S less than 
5 is assumed to be transient persistent and those with a value of 
than 5 are assumed to be transient benign. 
the percentage of improperly reconfigured faults were: 
Since there were very few 
In the following tables, the 
S greater 
Using this classification scheme, 
.. 
# transient benign faults reconfigured 
# reconfigurations 
23 
P- x 100% = 12.1% 
190 
% error - x 100% 
There were 9 injections which produced errors but did not lead to a 
Assuming that these faults were correctly diagnosed as reconfiguration. 
transient benign, the percentage of transient benign faults which were 
improperly diagnosed were: 
23 
# transient benign faults reconfigured 
# transient benign faults 
% error = x 100% 
23 
32 
P- x 100% = 71.9% 
Since most of the faults injected were transient persistent the 
percentage of improperly reconfigured faults was small (12.1%). 
the faults that should not have been reconfigured (transient benign), 71.9% 
were improperly reconf igured. 
However, of 
DISTRIBUTION OF FAULT LATENCY 
The data obtained in this experiment is sufficient to determine fault 
latency in SIFT using the methodology developed by the University of Michigan. 
(See ref. 5.) Consider the following graph of the fault propagation process: 
fault arrival error generated error detected 
5 5 5 
-I I I > t  
I< fault latency >I< error latency -> I 
Let L be a random variable representing the fault latency with 
distribution function FL(l) and let ni(w) represent the total number of 
transient fault injections at pin i with duration W. If Di(w) is a 
random variable representing the number of injections which result in at least 
one error detection, then 
E[ Di(w)/ni(w) 1 I FL(w) 
Under the assumption that errors generated by L e  injected fault are 
propagated and detected before the censoring point of the experiment: 
If di(w) detections are observed in response to n,(w) injections, the 
following estimator of FL(w) is unbiased: 
The following 
wi 
1 PS 
3.16 ps 
1 0  ps 
31.62 ps 
100 ps 
316.22 ps 
1 m S  
3.162 m s  
1 0  ms 
31.62 ms 
100 m s  
316.22 ms 
1 s  
neasurements were obtained: 
6 
10 
11 
19 
20 
17 
25 
24 
18 
19 
10 
10 
10 
Although FL(w) must be 
ni (Wi ) 
30 
30 
30 
30 
29 
28 
29 
24 
18 
19 
1 0  
10  
10  
.20 
.33 
.37 
.63 
.69 
.61 
.86 
1.00 
1.00 
1.00 
1.00 
1.00 
1.00 
A 
nonotonic increasing, the estimates FL (wi ) may 
not be. Clearly, a statistical method of estimating the FL (w, ) under a 
constraint of monotonicity is needed. 
Theoretically: 
The following are approximations to the mean and variance: 
25 
The following values of Z and 2 were obtained: 
9 = 0.216 IUS 
5 = 0.451 IUS 
FunmE MPERI- DIRECTIONS 
Improvements in Measuring the Effectiveness of the 
Operating System's Transientflermanent Fault Discriminator 
A key factor in evaluating the effectiveness of the operating system's 
ability to discriminate between transient and permanent faults is determining 
whether an injected transient fault is transient benign or transient 
persistent. 
approach is described which enables a more accurate determination of the type 
of the fault. 
In this section a simple modification to the experimental 
The recmnded change in the experimental method is: 
(1) disable the reconfiguration process of the SIFT operating system so 
that reconfiguration does not occur. 
(2) instrument the operating system to record the time that 
reconfiguration wwld normally have occurred. 
In this way the error propogation process can be observed for a greater 
The possible results of an injection are now: amnvlt of time. 
case 1: no reconfiguration 
... ... - t 
t t t t  t t t  t t ... c S e1 e2 e3 e4 e5 e, ... en 
I< Z' >I 
26 
case 2: reconf iguration occurs 
... ... ... t - 
t t t t  t t t t  t t 
S e1 e2 e3 e4 e, r e, e, ... C ... ... 
where 
s - time injection begins 
e,- the time of detection of the ith error 
r = time operating system reconfigures 
C = censoring point of the experiment 
(1 < i < n) 
Z' - e, - s 
R = r - s  
By observing the error generation process after the reconfiguration time 
r until the censoring point C, we obtain the unconditional Z' directly. 
Therefore, the incorrect diagnosis of a transient fault as a permanent can be 
more accurately discerned. If the errors disappear at some point after the 
reconfiguration point, then the diagnosis that the fault was permanent was 
wrong. 
benign and transient-persistent is simplified. 
Similarly, the classification of the transient faults into transient- 
Measuring F,(z)  and the Consequent Refined Analysis Method 
In this section, a method of measuring the duration of natural (i.e. non- 
injected) transient errors F,(z) is introduced. The implications of such an 
experiment are far-reaching. 
pin-level fault has been removed. 
underlying distribution for the duration of the faults is eliminated. 
The response of the operating system to transient faults of varying durations 
created by physical pin-level injections no longer has to be measured. 
observed response of the operating system to the natural transient faults can 
be directly entered into the reliability model. 
modifications to the transient/permanent fault discrimination algorithm can be 
First, the need to assume a simple stuck-at-1 
Second, the need for assuming some 
F,(w) 
The 
The effectiveness of any 
27 
measured by artificially introducing error detections. 
detections can be introduced by changing memory locations in the SIFT 
processors while they are executing. 
introduced artifically can be inferred from the sequences of error detections 
observed from natural transient faults. 
The following approach is suggested for measurement of F, (z ) .  Disengage 
SIFT'S reconfiguration algorithm and let it run continuously for many years. 
Instrument the operating system with the same data gathering code as described 
in the section " M P E R I ~  APPROACH" and collect as many natural transient 
faults as possible. (Note. the distribution Fy can be determined from the 
distributions FL and F, using the above relationship between the random 
variables). 
transient faults should be observed in a year. 
These artificial error 
The patterns of error detections to be 
If the transient fault arrival rate is 5~10-~/hour, about 40 
ACKNOWLEDGEMENT 
The authors are grateful to Quyen Duong Cleary for her help in performing 
this experiment. 
fault injections this paper would only be a theoretical discussion. 
Without her patient performance of hundreds of physical 
A detailed description of the preliminary transient fault experiment 
along with the results from 297 transient injections are given. 
enough data was obtained to draw statistically significant conclusions, the 
foundation has been laid for a large-scale transient fault experiment. 
Several changes in the experimental procedure are recomaended for the large- 
scale experiment in order to increase the usefulness of the experiment. The 
sensitivity of the probability of system failure to the mean duration of the 
transient faults reveals the pressing need for credible measurements of 
transient fault behavior. 
Although not 
APPENDIX A 
SIFT Chip Failure Rates ( x hrl ) 
Chip Type # pins 
54830 14 
5440 14 
54LS30 12 
54LS21 12 
40938 14 
54LSll 14 
54LS27 14 
54126 14 
54125 14 
54LS02 14 
54LS08 14 
54LS32 14 
54LSOO 14 
54837 14 
54808 14 
54S02 14 
54LS122 12 
54155 16 
54LS51 14 
54S04 14 
70C96 16 
5474 14 
54LS74A 14 
54874 14 
54C174 16 
54LS279 16 
54LS368 16 
54LS92 10 
DS1651 16 
7603.2 16 
5331 16 
HD6440A. 2 18 
54LS158 16 
54LS157 16 
54156 16 
54LS153 16 
54LS112 16 
54LS352 16 
54LS251 16 
a 9 0 2  16 
54182 16 
54S112 16 
548253 16 
54LS175 16 
54175 16 
54LS148 16 
54LS241 20 
Rate/Chip Chip Type # pins Rate/Chip 
0.1654 
0.2132 
0.1870 
0.1913 
0.2388 
0.2404 
0.2404 
0.2424 
0.2424 
0.2432 
0.2432 
0.2432 
0.2432 
0.2437 
0.2437 
0.2437 
0.2100 
0.2815 
0.2479 
0.2495 
0.2899 
0.2580 
0.2584 
0.2615 
0.3010 
0.3012 
0.3012 
0.1895 
0.3070 
0.3098 
0.3098 
0.3484 
0.3121 
0.3121 
0.3123 
0.3135 
0.3135 
0.3135 
0.3149 
0.3176 
0.3189 
0.3194 
0.3194 
0.3240 
0.3271 
0.3301 
0.4129 
~~ 
54820 
548244 
54LS20 
40018 
5410 
54LS10 
54S10 
5438 
54LS86 
54LS09 
54LS33 
54LS125 
54LS126 
54886 
54832 
54SOO 
54LS53 
5404 
54LS04 
54S51 
5437 
54LS74 
7837 
54C175 
54LS93 
54LS367 
54LS113 
7835 
548113 
548288 
Hm7603 
5 4 LS2 88 
54LS155 
54LS257 
54LS138 
54LS253 
54LS109 
54LS151 
54LS139 
2902 
54LS123 
54S153 
54S151 
LMl19D 
54LS148 
54LS164 
54LS240 
14 
20 
12 
14 
14 
14 
14 
14 
14 
14 
14 
14 
14 
14 
14 
14 
14 
14 
14 
14 
16 
14 
16 
16 
10 
16 
14 
16 
14 
16 
16 
16 
16 
16 
16 
16 
16 
16 
16 
16 
16 
16 
16 
14 
16 
14 
20 
~~~ 
0.1913 
0.3099 
0.1913 
0.2387 
0.2397 
0.2404 
0.2406 
0.2424 
0.2432 
0.2432 
0.2432 
0.2432 
0.2432 
0.2437 
0.2437 
0.2437 
0.2456 
0.2472 
0.2479 
0.2495 
0.2916 
0.2584 
0.2964 
0.3001 
0.1883 
0.3012 
0.2643 
0.3047 
0.2702 
0.3098 
0.3098 
0.3119 
0.3121 
0.3121 
0.3135 
0.3135 
0.3135 
0.3149 
0.3163 
0.3176 
0.3189 
0.3194 
0.3217 
0.2859 
0.3301 
0.2891 
0.4129 
29 
54LS244 
548240 
548175 
5485 
54LS245 
7611.2 
54LS194 
54LS298 
54LS161A 
2 5LS2 5 37 
9410 
54LS259 
54LS169 
MHQ3467 
54LS165 
54LS377 
2 5LS2536 
2 5LS377 
554715 
AM25LS2 569 
75109A 
CA3039 
75109 
9407 
54S472 
AM2940Dm 
2914 
290lA 
290lA 
m 9 3  
mK4114.3 
LF156H 
m2OH. 5 
20 
20 
16 
16 
20 
16 
16 
16 
16 
20 
18 
16 
16 
14 
16 
20 
20 
20 
8 
20 
14 
12 
14 
24 
20 
28 
40 
40 
40 
8 
18 
8 
3 
0.4129 
0.4179 
0.3385 
0.3392 
0.4242 
0.3456 
0.3507 
0.3552 
0.3620 
0.4530 
0.4093 
0.3643 
0.3654 
0.3209 
0.3677 
0.4632 
0.4803 
0.4873 
0.1953 
0.4969 
0.3516 
0.3059 
0.3809 
0.6663 
0.5771 
0.8395 
1.3708 
1.4945 
1.8233 
0.3770 
1.1236 
0.7382 
0.4348 
29LS18 
54LS174 
DM7136 
7136 
25LS2518 
54LS393 
5496 
7131 
54LS161 
54LS163 
54LS191 
54LS390 
751078 
54LS290 
54LS273 
54LS37 4 
54273 
SE555F 
548471 
54LS381 
25LS2517 
54LS299 
2911 
7641.2 
"7643 
HM6514.2 
AM2812 
93L422 
AM29olA 
2716 
LM741 
QT6m 
16 
16 
16 
16 
16 
14 
16 
16 
16 
16 
16 
16 
14 
16 
20 
20 
20 
8 
20 
20 
20 
20 
20 
23 
18 
18 
28 
22 
40 
24 
7 
4 
0.3313 
0.3383 
0.3392 
0.3392 
0.3405 
0.3049 
0.3542 
0.3583 
0.3620 
0.3632 
0.3643 
0.3654 
0.3209 
0.3677 
0.4621 
0.4711 
0.4859 
0.1953 
0.4918 
0.4981 
0.5030 
0.5169 
0.5504 
0.6538 
0.5216 
0.6013 
1.0408 
0.9420 
1.8233 
1.4030 
0.4906 
0.4906 
30 
APPENDIX B 
Sumnary of Fault Injections 
This section includes a sunrnary of the 279 transient fault injections in 
tabular form. The information included under each heading is: 
INJ - 
mcm- 
TWIDTH- 
FIRSTE- 
LASm - 
T Y P -  
injection number 
the time in milliseconds from the injection until the system 
reconfigured. 
the duration of the injection in milliseconds. 
the time in milliseconds from the injection until the first error 
detection on another processor. The notation ... indicates that 
no error was detected. 
the time in milliseconds from the injection until the last error 
detection on another processor. 
==> indicates last error same as reconfiguration time. 
... indicates no errors detected 
-x+R indicates the last error was x milliseconds before 
reconfiguration 
the type of fault: SAl = + 5 volts, SA0 = -5 volts. 
LOCATION - the processor, board, chip and pin where fault was injected. 
INJ 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
REm 
244 
- 
... ... ... 
236 ... ... ... 
216 ... ... 
284 ... ... 
574 ... 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.003 
0.003 
0.003 
0.003 
0.003 
0.003 
FIRSTE 
20 ... ... ... 
12 ... ... ... 
2 ... ... 
23 ... ... 
35 ... 
LASTE 
==> ... ... ... 
==> ... ... ... 
-> ... . .-. 
==> ... ... 
-4O+R ... 
TYP 
SAl 
SAl 
SAl 
SAl 
SA1 
SA0 
SA0 
SA0 
SA0 
SA0 
SAl 
SAl 
SAl 
SAl 
SA1 
SA0 
- - 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
LDCATION 
CPU u35 
CPU u35 
CPU u35 
CPU u35 
CPU u35 
cpu u35 
CPU u35 
CPU u35 
CPU u35 
CPU u35 
CPU u35 
CPU u35 
CPU u35 
CPU u35 
CPU u35 
CPU u35 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
31 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
44 
45 
46 
47 
48 
49 
50 
51 
52 
53 
54 
55 
56 
57 
58 
60 
61 
62 
63 
64 
65 
71 
72 
73 
74 
75 
81 
82 
83 
... ... 
216 
282 
259 
193 
214 
... 
... 
... ... ... ... ... ... 
253 
255 
239 
250 
... 
... ... 
274 
190 
241 
221 
284 
237 
287 
246 
188 
253 
... 
... 
... ... 
261 
221 
248 
231 
184 
215 
216 
215 
227 
26 4 
286 
252 
244 
278 
280 
262 
... 
... 
... 
0.003 
0.003 
0.003 
0.003 
0.010 
0.010 
0.010 
0.010 
0.010 
0.010 
0.010 
0.010 
0.010 
0.010 
0.032 
0.032 
0.032 
0.032 
0.032 
0.032 
0.032 
0.032 
0.032 
0.032 
0.100 
0.100 
0.100 
0.100 
0.100 
0.100 
0.100 
0.100 
0.100 
0.316 
0.316 
0.316 
0.316 
0.316 
0.316 
0.316 
0.316 
0.316 
1.000 
1.000 
1.000 
1.000 
1.000 
3.160 
3.160 
3.160 
3.160 
3.160 
1.000 
1.000 
1.000 
... ... ... ... 
1 -=> 
21 -1+R 
4 -108+R 
5 ==> 
25 ==> 
... ... 
... ... 
... ... ... ... 
L . .  ... ... ... ... ... 
129 279 
2 ==> 
3 ==> 
15 -73+R 
26 ==> 
... ... 
... ... ... ... 
13 -1+R 
1 -1+R 
17 -1+R 
104 -> 
23 -6+R 
13 -1+R 
26 -1+R 
21 -1+R 
3 ==> 
1 -1+R 
8 18 
... ... 
... ... 
... ... 
3 -33+R 
3 -=> 
24 -107+R 
114 -> 
2 -72+R 
26 -1+R 
2 ==> 
3 ==> 
3 -=> 
25 -> 
1 -32+R 
20 -1- 
17 ==> 
19 ==> 
4 -> 
... ... 
... ... 
26 -1+R 
... ... 
32 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 CPU 
SA1 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 r n T  
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
84 
85 
86 
87 
88 
89 
90 
91 
92 
93 
94 
95 
96 
97 
98 
100 
101 
102 
103 
104 
105 
106 
107 
108 
109 
110 
111 
112 
113 
114 
115 
116 
117 
118 
119 
120 
121 
122 
123 
124 
125 
126 
127 
128 
129 
130 
131 
132 
133 
134 
135 
136 
137 
138 
139 
263 
257 
262 
253 
281 
280 
220 
271 
219 
285 
216 
247 
244 
278 
197 
279 
261 
224 
215 
278 
251 
284 
281 
283 
243 
248 
229 
254 
272 
246 
291 
260 
290 
246 
214 
266 
249 
287 
253 
209 
247 
274 
188 
288 
254 
242 
248 
245 
209 
282 
214 
239 
251 
186 
... 
1.000 
1.000 
3.162 
3.162 
3.162 
3.162 
3.162 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
31.620 
31.620 
31.620 
31.620 
31.620 
31.620 
31.620 
31.620 
31.620 
31.620 
100.000 
100.000 
100.000 
100.000 
100.000 
100.000 
100.000 
100.000 
100.000 
100.000 
316.220 
316.220 
316.220 
316.220 
316.220 
316.220 
316.220 
316.220 
316.220 
316.220 
1000.000 
1000.000 
1000.000 
1000.000 
1000.000 
1000.000 
1000.000 
1000.000 
1000.000 
2 
2 
4 
2 
20 
19 
3 
10 
2 
24 
1 
23 
20 
17 
8 
19 
4 
107 
1 
17 
27 
23 
128 
22 
19 
24 
5 
2 
20 
11 
22 
4 
2 
3 
22 
25 
5 
25 
26 
2 
20 
23 
14 
3 
2 
3 
18 
24 
21 
==> 
-47+R 
-1+R 
-=> 
-> 
-1+R 
==> 
-> 
==> 
-=> 
-> 
-1+R 
-1+R 
-> 
==> 
==> 
=-> 
==> 
-> 
=-> 
==> 
-> 
-81+R 
=> 
==> 
-I+R 
=-> 
-1+R 
99 
-47+R 
-1+R 
-108+R 
-> 
-1+R 
=> 
=-> 
==> 
-1+R 
==> 
==> 
-> 
5)  
-> 
=-> 
-> 
==> 
-> 
==> 
-1+R 
20 -=> 
21 -72+R 
25 ==> 
15 ==> 
27 ==> 
3 -1+R 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 CPU 
skl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 L W  
SAl P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 CPU 
SA1 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA1 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
v35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
33 
140 
161 
162 
163 
164 
165 
166 
167 
168 
169 
170 
171 
172 
173 
174 
175 
176 
177 
178 
179 
180 
181 
182 
183 
184 
185 
186 
187 
188 
189 
190 
191 
192 
193 
194 
195 
196 
197 
198 
199 
200 
201 
202 
203 
204 
205 
206 
207 
208 
209 
210 
211 
212 
213 
214 
251 ... ... ... 
246 ... ... ... ... ... ... ... ... ... ... 
188 ... ... ... ... ... 
234 
274 ... ... ... ... ... ... ... ... 
201 
234 
197 
279 ... ... ... ... ... ... 
192 
272 
289 
243 
214 
236 ... ... ... ... ... 
257 
27 3 
... 
1000.000 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.003 
0.003 
0.003 
0.003 
0.003 
0.003 
0.003 
0.003 
0.003 
0.003 
0.010 
0.010 
0.010 
0.010 
0.010 
0.010 
0.010 
0.010 
0.010 
0.010 
0.032 
0.032 
0.032 
0.032 
0.032 
0.032 
0.032 
0.032 
0.032 
0.032 
0.100 
0.100 
0.100 
0.100 
0.100 
0.100 
0.100 
0.100 
0.100 
0.100 
0.316 
0.316 
0.316 
0.316 
27 ... ... 
2 
22 ... ... ... ... ... ... ... ... ... ... 
2 ... ... ... ... ... 
10 
13 ... ... ... ... ... ... ... ... 
12 
9 
8 
18 ... ... ... 
3 ... ... 
3 
11 
106 
19 
132 
12 ... ... ... ... ... 
3 
13 
... 
-1+R ... ... 
2 
-> ... ... ... ... ... ... ... ... ... ... 
-33+R ... ... ... ... ... 
-> 
-> ... ... ... ... ... ... ... ... 
-1+R 
-108+R 
-1+R 
-108+R ... ... ... 
3 ... ... 
=> 
==> 
-1+R 
-1+R 
=-> 
-1+R ... ... ... .... ... 
-> 
=> 
... 
34 
SA0 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA1 P1 CPU 
SAl P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA1 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
S A l  P1 CPU 
u35 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
2 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
215 
216 
217 
218 
219 
220 
221 
222 
223 
224 
225 
226 
227 
228 
229 
230 
231 
232 
233 
234 
235 
236 
237 
2 38 
239 
240 
241 
242 
244 
245 
246 
247 
248 
249 
250 
251 
252 
253 
254 
256 
257 
2 58 
259 
260 
261 
262 
26 3 
264 
265 
266 
267 
268 
269 
270 
271 
186 ... ... ... ... ... 
226 
217 
290 
241 
261 
282 
253 
288 
207 
255 
288 
254 
261 
267 
289 
257 
282 
272 
204 
250 
260 
286 
284 
254 
2 37 
269 
204 
235 
284 
187 
227 
184 
249 
250 
288 
253 
... 
... 
... 
... 
... ... ... ... ... ... ... ... ... 
0.316 
0.316 
0.316 
0.316 
0.316 
0.316 
1.000 
1.000 
1.000 
1.000 
1.000 
1.000 
1.000 
1.000 
1.000 
1.000 
3.160 
3.160 
3.160 
3.160 
3.160 
3.160 
3.160 
3.160 
3.160 
3.160 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
31.620 
31.620 
31.620 
31.620 
31.620 
31.620 
31.620 
31.620 
31.620 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.001 
0.003 
3 ... ... .*. 
14 
2 
2 
0 
16 
4 
21 
2 
2 
18 
4 
1 
2 
4 
6 
2 
3 
128 
11 
15 
26 
2 
26 
23 
3 
12 
7 
122 
10 
25 
23 
4 
3 
2 
25 
26 
1 
106 
.*. 
*.. 
... 
... 
... ... ... ... ... ... ... ... ... 
-1+R ... ... ... 
14 
-> 
-1+R 
==> 
-1+R 
==> 
==> 
==> 
==> 
==> 
===> 
a=> 
-1+R 
==> 
-1+R 
-1+R 
===> 
==> 
=> 
=a> 
-> 
-I+R 
-81+R 
-6+R 
==> 
-1+R 
-34+R 
-1+R 
-1+R 
282 
=-> 
-1+R 
==> 
==> 
-72+R 
==> 
-> 
-> 
... 
... 
... 
... 
... ... ... ... ... ... ... ... ... 
35 
SAl P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA1 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 C W  
SA0 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 C W  
SAl P1 CPU 
SAl P1 CPU 
m PI C W  
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SAl P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SA0 P1 CPU 
SAl P1 MPM2 
SAl P1 MPM2 
SAl P1 mM2 
SAl P1 MPM2 
SAl P1 MPM2 
SA0 P1 MPM2 
SA0 P1 MPM2 
SA0 P1 mM2 
SA0 P1 mM2 
SA0 P1 MPM2 
SA1 P1 MPM2 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
E38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
U38 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
27 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
272 
273 
274 
275 
276 
277 
278 
279 
280 
281 
282 
284 
285 
286 
287 
288 
289 
290 
291 
292 
294 
295 
296 
297 
298 
299 
300 
301 
302 
303 
304 
305 
306 
307 
308 
309 
31 0 
311 
312 
313 
314 
315 
316 
318 
319 
321 
322 
323 
324 
325 
326 
327 
329 
330 
331 
__ 
... ... 
253 
216 
244 
961 
222 ... ... 
209 
108 
274 
215 
212 
... 
... 
... ... ... 
388 
222 
392 
246 
216 
... 
258 
... ... 
210 
254 
2 52 
213 
356 
220 
280 
255 
... 
... 
... ... 
219 
241 
257 
235 
270 
251 
250 
254 
268 
1091 
252 
236 
254 
... 
... 
0.003 
0.003 
0.003 
0.003 
0.003 
0.003 
0.003 
0.003 
0.003 
0.010 
0.010 
0.010 
0.010 
0.010 
0.010 
0.010 
0.010 
0.010 
0.032 
0.032 
0.032 
0.032 
0.032 
0.032 
0.032 
0.032 
0.032 
0.100 
0.100 
0.100 
0.100 
0.100 
0.100 
0.100 
0.100 
0.100 
0 100 
0.316 
0.316 
0.316 
0.316 
0.316 
0.316 
0.316 
0.316 
1.000 
1.000 
1.000 
0.010 
0.316 
1.000 
1.000 
1.000 
1.000 
1.000 
... ... 
2 
1 
20 
772 
5 
36 
34 
2 
13 
26 
23 
... 
... 
... 
... ... ... 
199 
302 
1 
203 
22 
3 
1 ... ... 
1 
3 
0 
24 
167 
3 
19 
3 
... 
... 
... ... 
4 
17 
3 
11 
9 
27 
4 
3 
6 
9 
1 
12 
3 
... 
... 
... ... 
-> 
==> 
-> 
==> 
=> 
1393 ... 
-1+R 
-1+R 
-> 
-1+R 
-> 
... 
... 
... ... ... 
==> 
380 
==> 
-> 
==> 
-1+R 
=> ... ... 
=-> 
-> 
==> 
-107+R 
==> 
-> 
-1+R 
-1+R 
... 
... 
... ... 
-1+R 
-1+R 
==> 
-> 
=> 
==> 
==> 
=> 
-48+R 
==> 
-> 
-72+R 
a=> 
... 
... 
36 
SA1 
SA1 
SA1 
SAl 
SA0 
SA0 
SA0 
SA0 
SA0 
SA1 
SA1 
SA1 
SA1 
SA0 
SA0 
SA0 
SA0 
SA0 
SA1 
SA1 
SA1 
SA1 
SA0 
SA0 
SA0 
SA0 
SA0 
SA1 
SA1 
SA1 
SA1 
SA1 
SA0 
SA0 
SA0 
SA0 
SA0 
SA1 
SA1 
SA1 
SAl 
SA1 
SA0 
SA0 
SA0 
SA1 
SA1 
SA1 
SA1 
SA0 
SAl 
SA1 
SA0 
SA0 
SA0 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
P1 
MPM2 
MPM2 
MPM2 
MPM2 
M P M 2  
MPM2 
MPM2 
mM2 
MPM2 
MPM2 
MPM2 
MPM2 
MPM2 
MPM2 
MPM2 
MPM2 
M P M 2  
MPM2 
MPM2 
MPM2 
MPM2 
M P M 2  
MPM2 
MPM2 
MPM2 
mM2 
MPM2 
MPM2 
MPM2 
MPM2 
MPM2 
M P M 2  
MPM2 
MPM2 
mM2 
MPM2 
MPM2 
MPM2 
MPM2 
MPM2 
MPM2 
MPM2 
MPM2 
MPM2 
M P M 2  
MPM2 
MPM2 
MPM2 
MPM2 
MPM2 
MPM2 
MPM2 
MPM2 
MPM2 
mM2 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
u34 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
. 
333 243 1.000 19 ==> SA0 P1 mM2 u34 3 
334 227 3.162 3 -1+R SAl PI MPM2 U34 3 
335 237 3.162 13 -> SAl P1 MPM2 u34 3 
336 199 3.162 10 ==> SAl P1 MPM2 u34 3 
337 251 3.162 27 -1+R SAl PI MPM2 U34 3 
355 ... 0.032 .. . ... SAl P1 mM2 u34 3 
37 
APPENDIX c 
SURE Model 
LAMBDA = 1E-4; 
K = 10.0; 
GAMMA = K * M D A ;  
MU W .I 1E-9 TO* IE-1 BY 10; 
Bm - 2*MU - W; 
wo - 0.0; 
W4 = 31.623-6; 
W8 = 3.162E-3; 
W12 316.22313; 
ERWO - 239.0; 
ERW4 - 255.875; 
ERW8 = 256.916; 
ERWl2 = 251.500; 
ER2WO = 57282.6; 
ER2W4 = 68679.1; 
ER2W8 - 69975.8; 
ER2w12 - 64170.5; 
Ezwo - 0.0; 
EZW4 = 47.286; 
EZW8 = 0.0; 
E m 2  = 0.0; 
EZ2WO = .160; 
EZ2W4 = 15875.0; 
EZ2W8 = 0.0; 
E Z M 2  = 0.0; 
PRWO = 0.0; 
PRW4= .53; 
PRw8 = 1.00; 
-2 - 1.00; 
Wl = 1E-6; 
W5 = 100.0316; 
W9 = 1OE-3; 
W13 = 1OOO.OE-3; 
W2 = 3.163-6; 
W6 = 316.223-6; 
W l O  = 31.623-3; 
ERWl - 239.0; 
EEWS - 247.150; 
ERW9 = 249.111; 
ERWl3 = 236.700; 
ERW2 - 350.889; 
ERW6 = 238.333; 
EEWlO - 248.444; 
ER2w1 = 57282.602; 
ER2W5 = 62566.7; 
ER2W9 = 62921.8; 
-13 = 56681.3; 
ER2W2 = 181682.0; 
ER2W6 = 57448.1; 
ER2WlO = 62708.1; 
EZWl = 0.1; 
EZWS = 0.0; 
Em = 0.0; 
E M 3  = 0.0; 
E m 2  = 66.333; 
EZW6 = 2.462; 
EZWlO = 282.000; 
EZ2Wl = .160; E Z m  = 92402.3; 
EZ% = 0.0; 
EZ2W9 - 0.0; EZ2W6 = 40.00; EZZwlO = 79524.0; 
EZZwl3 = 0.0; 
PRWl = .17; PRW2 - .30; 
PRW5 = .69; PRW6- .54; 
PRw9 - 1.00; PRwlO - .95; 
PRwl3 = 1.00; 
EWO = 1; I F  WO < BETA THEN EWO - WO/BETA; 
EWl = 1; I F m <  BE’IA THEN EWl - Wl/BETA; 
FW2 = 1; IFm < BETA THEN FW2 - W2/BETA; 
EW3 1; I F  W3 < BETA THEN FW3 - W3/BETA; 
Fw4 = 1; I F  W4 < BETA THEN FW4 = W4/8ETA; 
Fw5 = 1; THEN Fw5 - W5/BETA; 
PW6 = 1; I F  W6 < BETA THEN FW6 - W6/BETA; 
FW7 = 1; I F W  < BETA THEN FW7 - W7/BETA; 
= 1; IFW8 < BETA THEN FW8 - W8/BETA; 
lT?9 = 1; IFW9 < BETA THEN Fw9 - W/BETA; 
lXL0 = 1; 
IF  W5 < BETA 
I F  mo < BETA THEN FwlO = WlO/BETA; 
W3 131-5; 
W7 = 1.03-3; 
W l l  = 10031-3; 
ERW3 = 239.445; 
ERW7 - 284.320; 
ERWll = 255.778; 
ER2W3 - 58599.2; 
ER2W7 - 108444.4; 
ER2w11 = 66016.7; 
EZW3 - 0.0; 
EZW7 = 0.0; 
EZWll = 99.000; 
EZ2W3 = 0.0; 
Ezm - 0.0; 
EG!wll = 9801.0; 
PRw3- .37; 
P M  - .86; 
PRwll = .go; 
38 
. 
aJll= 1; IF W l 1 <  BETA "HEN F W l l =  Wll/BEZA; 
pw12 = 1; 
MI3 - 1; IF W12 < BETA IF W13 < BETA "HEN EWl2 = Wl2/BEZA; "HEN FWl3 = Wl3/BEZ!A; 
M U R =  - ( F w l  - F W O ) * E R W l +  (Fw2 - F w l ) * W +  
(EW3 - r n ) * E E W 3 +  ( E W 4  - E W 3 ) * E m 4 +  
( F w 5  - E W 4 ) * = +  (FW6 - F W 5 ) * E R W 6 +  
( F W 7  - F W 6 ) * W +  (FW8 - F E S I ) * E R W ~ +  
(Fw9 - F w 8 ) * E I w 9 +  ( F w l O - F W 9 ) * E R W l O +  
( m1- Fwlo ) * ERWll+ ( m 2  - Fwll ) * -2 + 
( F w l 3 - F w l 2 ) * w w 1 3 ;  
SIGHA - R = SQR"( 
( m l  - E w o ) * E F Q w l +  ( r n  - M ) * E R 2 W 2 +  
(EW3 - F w 2 ) * E R 2 w 3 +  ( E W 4  - E W 3 ) * E R 2 w 4 +  
(Fw5 - m 4 ) * E R 2 w 5  + (FW6 -FW5)*ER2W6 + 
( F W 7  -EW6)*ER2W7 + ( F W 8  - F E S I ) * m  + 
(Fw9 - F W 8 ) * E R 2 w 9  + ( F w l O - F W 9 ) * E R 2 w 1 0  + 
( F w l l - F w l O ) * E E u W l l + ( ~ 2 - F w l l ) * E R 2 w 1 2 +  
( Fwl3 - lW.2 ) * -13 - MU R*MLI R ) ;  - - 
m 2 -  - 
(FWI - F W O ) * E Z W l +  ( F W 2  - F w l ) * E Z W 2 +  
(EW3 - F W 2 ) * E Z W 3 +  (EW4 - E W 3 ) * E Z W 4 +  
( E W 5  - E W 4 ) * E Z W 5 +  (EW6 - F W 5 ) * E Z w 6 +  
( F W 7  - E W 6 ) * E Z W 7 +  ( F W 8  - F E S I ) * E Z W 8 +  
(Fw9 - F W 8 ) * E m +  (FwlO-FW9)*EZWlO + 
( F w l l -  Fwlo ) * Ezwll+  ( Fwl2 - Fwll ) * Ezwl2 + 
( F w l 3 - F w l 2 ) * E z W 1 3 ;  
SIGMA - 2 = SQRT( 
(Fw1 - E W O ) * E 2 2 W 1 + ( F W 2  - E W l ) * E Z 2 W 2 +  
( EW3 - FW2 ) * EZ2W3 + ( EW4 - Fw3 ) * E22W4 + 
( FW5 - EW4 ) * E Z M  + ( EW6 - Fw5 ) * E22W6 + 
(EW7 - E W 6 ) * E Z 2 W 7 + ( F W 8  - p F n ) * ~ z 2 w 8 +  
(Fw9 - F w 8 ) * E z 2 w 9 + ( E W l o - F w 9 ) * E z 2 w l o +  
( F w l l - F w l O ) * E Z 2 w l l + ( F w l 2 - F w l l ) * E z 2 w l 2 +  
( Fwl3 - Fwl2 ) * n m 3  - MU z*m 2) ;  - - 
P R =  - 
(Fwl - F W O ) * P R w l +  (Fw2 - F w l ) * P R w 2 +  
(EW3 - F W 2 ) * P R w 3 +  ( F w 4  - E w 3 ) * P F w 4 +  
( E W 5  - F W 4 ) * P R W 5 +  (FW6 - F W 5 ) * P E w 6 +  
( F W 7  - E W 6 ) * P R W +  (FW8 -FESI )*PRW8+ 
( F w 9  - F w 8 ) * P R w 9 +  ( F w l O - F W 9 ) * P R W l O +  
( F w l l - F w l O ) * P R W l l + ( F w l 2 - F w l l ) * P R W l 2 +  
( Fw13 - Fwl2 ) * PRWl3 ; 
39 
M U R P - M U R ;  
SI-= - RP --SIGMA - R; 
( *  convert to hours *) 
MS PER HOUR = lE3*60*60; 
MU-R --MU R/Ms PER HOUR ; 
SI- R =-SI-- R -m PER - HOUR; 
HU Z = MU Z m  FER H m ;  
SI-m 2 --s1Gm z/Rs PER HOUR; 
MU m- MU - RP/HS-pER -HOUR7 
SI-BSA - RP - s 1 a  -RP-m - -  PER HOUR; 
SHCW MU - R,SIGMA - R,MU - Z,SIBSA - Z,MU - RP,SIcmA - RP; 
1,2 = 4*G?MMA; 
2,3 - 3*GAMMA + 3*LAMBDA; 
1,4 - 4*LAMBI1A; 
4,s - 3*G?MMA; 
2,s = 3 * M D A ;  
4,6 - 3*LAMBDA + 3*GAMMA; 
2,7 = <MU R, SIGMA R, P R>; 
211 = <m-z, SIGMA-Z, l=P - €0; 
4,7 = <MU-RP, SI=- - RP>; 
7,8 - 3*-; 
8,9 = 2*GAMMA + 2*-; 
7,lO - 3*LAMBLlA; 
10,12 - 2*LAMBm + 2*GAMMA; 
10,ll = 2"GAMMA; 
8,12 - 2*LAMBDA; 
8,13 - <MU R, SIGMA R, P R>; 
10,13 - <I+@ RP, SI- RPY; 
817 = 
13,14 - + M-DA; <MU ZI SIGMA ZT 1-P - R>; 
. 
. 
40 
REFERENCES 
1. McCoMel, Stephen R.; Siewiorek, Daniel P.; Tsao, Michael M.: The 
Measurement and Analysis of Transient Errors in Digital Computer Systems, 
The Ninth Annual International Symposium on Fault-Tolerant Computing, 
June 20-22, 1979. 
2. Goldberg, Jack; Kautz, William H.; Melliar-Smith, P. Michael; Green, 
Milton W.; Levitt, Karl N.; Schwartz, Richard L.; and Weinstock, 
Charles B.: 
Tolerance (SIFT) Computer. NASA CR-172146, 1984. 
Development and Analysis of the Software Implemented Fault 
3. Miller, Douglas R: 
In Competing Risk Models. Annuals of Statistics, 1977, Vol. 5, No. 3, pp. 
A Note on the Independence of Multivariate Lifetimes 
516-579. 
4. Butler, Ricky W.: The Semi-Markov Unreliability Range Evaluator (SURE) 
Program. NASA -86261, 1984. 
5. Shin, Kang G.; Woodbury, Michael W.; and Lee, Yang-Hang: 
Measurement of Fault-Tolerant Multiprocessors. 
Modeling and 
NASA CR-3920, August 1985. 
6. Bavuso, S. J.; and Petersen, P. L.: CARE I11 Model Overview and User's 
Guide (First Revision). NASA TM-86404, 1985. 
7. Lala, Jaynarayan H.; and Smith,  T. Basil, 111: Development and Evaluation 
of a Fault-Tolerant Multiprocessor (FTMP) Computer, Volume I11 - FTMP Test 
and Evaluation. NASA CR-166073, 1983. 
. 
41 
Standard Bibliographic Page 
. Report No. 
NASA TM-89058 
2. Government Accession No. 
1. Performing Organization Name and Address 
NASA Langley Research Center 
Hampton, VA 23665-5225 
17. Key Words (Suggested by Authors(s)) 
Transient Faults 
Fault Tolerance 
Error 
Fault Latency 
Fault Injection 
12. Sponsoring Agency Name and Address 
National Aeronautics and Space Administration 
Washington, DC 20546-0001 
L5. Supplementary Notes 
18. Distribution Statement 
Unclassified - Unlimited 
Star Category 62 
3. Recipient's Catalog No. 
19. Security Classif.(of this report) 20. Security Classif.(of this page) 21. No. of Pages 
Unclassified Unclassified 42 
5. Report Date 
February 1987 
6. Performing Organization Code 
505-66-21-01 
22. Price 
A03 
~ 
8. Performing Organization Report No. 
10. Work Unit No. 
11. Contract or Grant No. 
13. Type of Report and Period Covered 
Technical Memorandum 
14. Sponsoring Agency Code 
Ricky W. Butler, Langley Research Center, Hampton, Virginia. 
Carl R. Elks, PRC Kentron, Inc., Hampton, Virginia (presently with U . S .  Army 
Aerostructures Directorate, USAARTA (AVSCOM), Langley Research Center, 
Hampton, Virginia). 
16. Abstract 
This paper presents the results of a preliminary experiment to study the 
effectiveness of a fault-tolerant system's ability to handle transient faults. 
The primary goal of the experiment was to develop the techniques t o  measure 
the parameters needed for a reliability analysis of the SIFT computer system 
which includes the effects of transient faults. 
analysis is the determination of the effectiveness of the operating system's 
ability to discriminate between transient and permanent faults. 
description of the preliminary transient fault experiment along with the 
results from 297 transient fault injections are given. 
data was obtained to draw statistically significant conclusions, the 
foundation has been laid for a large-scale transient fault experiment. 
A key aspect of such an 
A detailed 
Although not enough 
For sale by the National Technical Information Service, Springfield, Virginia 22161 
NASA Langley Form 63 (June 1985) 
