Analysis of the impact of error detection on computer performance by Lee, Y. H. & Shin, K. C.
General Disclaimer 
One or more of the Following Statements may affect this Document 
 
 This document has been reproduced from the best copy furnished by the 
organizational source. It is being released in the interest of making available as 
much information as possible. 
 
 This document may contain data, which exceeds the sheet parameters. It was 
furnished in this condition by the organizational source and is the best copy 
available. 
 
 This document may contain tone-on-tone or color graphs, charts and/or pictures, 
which have been reproduced in black and white. 
 
 This document is paginated as submitted by the original source. 
 
 Portions of this document are not fully legible due to the historical nature of some 
of the material. However, it is the best reproduction available from the original 
submission. 
 
 
 
 
 
 
 
Produced by the NASA Center for Aerospace Information (CASI) 
https://ntrs.nasa.gov/search.jsp?R=19850005168 2020-03-20T20:02:22+00:00Z
i
C^
IL
ANALYSIS OF THE IMPACT OF ERROR DETECTION ON COMPUTER PERFORMANCE:
Kang G. Shin and Yann-Nang Lee
Department of Electrical and Computer Engineering
The University of Michigan
Ann Arbor, Michigan 48109, U. S. A.
^a"A -CE-17 • ,167) bbALYSIS OF IhE IMPACT QF
ERROR DETECTICN Ch COdPUTEE PEHORMANCE
Progress Report (dichigau OniV-) 24 p
HC A02/df a01	 CSCL 09B
G3/60
ABSTRACT
N85— 13417
Unclas
12496
Conventionally, reliability analyses either assume that a fault/error is
detected immediately following its occurrence, or neglect damages caused by
latent errors. Though unrealistic, this assumption has been imposed in order to
avoid the difficulty of determining tl a respective probabilities that a fault
induces an error and the error is then detected in a random amount of time
after its occurrence.
As a remedy for this problem, in this paper a model is proposed to analyze
the impact of error detection on computer performance under moderate
assumptions. Error latency - the time interval between occurrence of an error
and the moment of error detection - is used to measure the effectiveness of a
detection mechanism. We have used this model to (1) perdict the probability of
producing an unreliable result, and (2) estimate the loss of computation due to
fault and/or error.
w
M
'	 ;:i7 r
O,} tr. y	 ^,: r fa
C`
r,:,N
The wont reported there is supported in part jy NASA gran: No. NAG 1-298. A'.; corres^onde-ice
should be made to :'ri,f. X. G. Stun, :'CE Depz., The University of Wca'gan, A-in .1r-.or, T, 48109.
1
'	 r
i
1. INTRODUCTION
During the past decade, many reliability-related models for fault-tolerant
computers have been developed. Based on system structures and operation
strategies, these models predict v, rious measures such as reliability, computa-
tion capacity, performability, etc. Usually, in these models, a probability distri-
bution function is used to describe thr^ occurrence of component or system
failure. The r^sults represent the time-varying characteristics of a computer
system. Since only the occurrence of failure is included in these models, they
fail to cover the following two aspects. One is the existence of a latent fault is
which case a fault is present but no erroneous state is induced. The other is the
possible error latency because the error ma y
 not be detected immediatel y fol-
lowing its occurrence.
Consider the property of a fault. An input signal to a computer may cause
the fault to induce some erroneous states, or it may simply pass throu;h this
fault and produce a correct output. The fault is said to be latent if it does not
harm normal operations. Bavuso, et al., investigated the problem of latent fault
and proposed experiments to measure fault latency J. Their studies indicate
that a significant proportion of faults remained latent after many repetitions of
a program. This fault latency has an important impact on an ultra-reliable sys-
Lem since it may cause a catastrophe if more than one latent fault becomes
active.
It is desired that the error detection mechanisms associated with the sys-
Lem identify an error immediately upon; generation. In fact, some errors may
not be captured by error detection mechanisms when it occurs and then spread
as a result of subsequent flow of informaticn. Thus the damage by the error i il!
be propagated until it is identified. The delay between the occurrence of an
error and the moment of detection, called error latency, is impowt ant to damage
2
"$
.rx
tee.
assesment, error recovery, and confidence in computed results. The same
notion has been defined by Courtois as detection time 1 2,3] and by Shedletsky as
latency difference [4]. Courtois also showed the results of on-line testing of the
M6800 microprocessor and presented the distributions of detection time for cre-
Lain detection mechanisms. Shedletsky proposed a technique to evaluate the
error latency based on the "fault set" philosophy and r.he probability distribution
of input signals. The resultant error latency was used to decide the required
rollback distance for successful data restoration. Both of these are confined to
the study of error detection capability and are not extended to include the
impacts of error detection on the system performance.
In reconfigurable fault-tolerant systems, a task executed b y p rocessors can
be recovered through various recovery methods if one of the resident proces-
sors fails. Thus these systems are considered to have failure only when all
resources are exhausted or the system fails to reconfigure. In practice, in add •.-
tion to the probability of s y stem failure, one may question what ouln ha p pen if
the system can not respond to a fault- error immediately follotising its
occurrence. ',Vith the existence of error latency, the system ma., send out some
erroneous computation results if it is still unaware of the error at the output
phase. On the other hend, even if the system has detected the error before u is
propagated, the computation achieved during error latency is useless and the
whole system suffers from the delay caused by error latency and recovery. So
the total cost induced by fault ,md/or error consists of t,vo parts: one is the
computation los- .:hich includes error detection overhead, error latenc y , and
recovery overhead, the other part is the relative cost increased d,;e to _relayed
response. To quantify these effects of error latency, the probability: of having an
unreliable result and '_he computation loss ha y e to be evaluated.
Various error detection techniques can be used to reduce the computation
loss and en;iance the reliability of computation results, for instance,
3
enhaikeement of self-checking capability so that most of arroi • s can be detected
immediately, limitation of the error contaminations to reduce error latency and
recovery overhead, and periodic diagnostics which can seize faults before they
induce errors. Each of these techniques by itself may not provide acceptable
solutions to the reliability problem without high cost or overhead. Instead, a
combination of these techniques must be employed to obtain a good, reliable
performance at a reasonable cost.
In this paper, a model is proposed to describe error detection processes
and to estimate their influences on system performance. Because intermittent
faults can seriously degrade performance and can cause a large fraction of all
errors ! 5], the mode, is intended to study their impact. In the following section,
the classification and the properties of error de, ection mechanisms are dis-
cussed first. The model is developed in Section 3. Sec " ,.-)n presents the evalua-
tion of the probability of having an unreliable result and the estimation of aver-
age computation loss. The optimal strategy of periodic diagnostic is also dis-
cussed in this section. A brief conclusion follow in Section 5.
Note that in this paper we consider faults in hardivare components which
ma y ;;ause a transition to erroneous states during the normal operation. We
also assume that there is no desi g n fault in the system. An error is defined to be
the consequent erroneous information/data caused by fault;s).
2. CI.ASSIFICA7'ION OF ERROR DMTEC'HON MECHANISMS
There are various error detection mechanisms ^vhich can be incorporated
in a computer system. The basic prir,cir,le of these mechanisms is the use of
redundancy in devices, info, • riiation, or time. Based on where these mechanisms
4
ALI
are employed and their respective performance measures, they are divided into
the following three classes.
1. Signal level detection. mechanisms
Usually, the mechanisms in this category are implemented by built-in self-
checking circuits. Whenever an error is caused by a predescribed fault, these
circuits detect the malfunction immediately even if the erroneous signal does
not have any logical meaning. Typical methods include Prror detection codes,
duplicated complementary circuits, matcher, etc. Since the error is induced
only when an input signal falls into the corresponding fault set of the fault, the
fault latenc y will depend un the type of fault and the distribution of input signal.
Oil the other hand, the error is detected immediately whenever it is generated.
These detection mechanisms are difficult to have complete detectability for all
kinds of error because (1) it is prohibitively expensive to design detection
mechanisms which cover all types of faults, and (2) physical dependence
between function .snits and detection mechanisms cannot be totally avoided. The
performance of these detection mechanisms is measured by "coverage", which
is the probability of detecting an arbitrary fault.
2. Function level detection. mechanisms
The detection mechanisms in this level are jnter,Lied to check out unaccept-
able activities or information at a higher level than the previous category. These
detection mechanisms could be imagined bs "barriers" around the normal
operations. After an error is generated by a fault, the resulting abnormality may
grow very quickly which is called "snow ball effect" 3], or "error rate
phenomenon" :6], until it hits these barriers. ;1'e can appl y several softl+are and
hardware techniques such as capability checking, acceptance check ng, invalid
op-code, timeout, and the like. Compared frith the mechanisms in the first
5
M
category, these detection mechanisms are more flexible and inexpensive but the
error latency tends to increase. The effectiveness of these detection mechan-
isms is very difficult to evaluate since it depends not only on the program exe-
cuted and the current system states, but on the type of error.
3. Periodic diagnostic
This method is usually referred to as off-line testing because the computa-
tion unit can not perform any useful task while it is applied. It is composed of a
diagnostic program which imitates inputs such that all existing faults are
activated and thus produce errors. Several theoretic approaches exist to deter-
mine the probability of finding an error after a certain amount of test time
(equivalent to the probability of detecting fault in this case) , 7,81. Tasar also
provided a simulation to show the coverage of a self-testing program r 9 l . All
these results indicate tnat the effectiveness of the present category is a mono-
tonically increasing function of run time. Since the time required for complete
testing is too long, it is impractical to apply this method frequently during nor-
oral operation. An alternative is to perform an imperfect di a gnostic periodically
during normal operation or perform a thorough diagnostic when the system is
idle.
3. Mt ,DEL DEVELOPMENT
We have developed a model for describing error detection mechanisms as in
Figure 2. The model consists of three parts: the occurrence of a faint, the conse-
quent g eneration of an error, and the detection of that error. Since the proba-
bility of having a double fault at a time is negligible, the case of multiple faults is
r
6
excluded from this model. There are six states in the model as follows-
1). NF (non-faulty): In this state no fault exists in the system.
2). F (faulty): There is a fault which is active and capable of inducing 	 -
errors.
3). FB (fault-benign): There is an inactive intermittent fault.
4). E (error): There is at least one error in the system and the fault which
has yiP'.ued erroneous information is still present.
5). ENF (error-non-faulty): In this state the intermittent fault has become
inactive after it induced some errors.
6). D (detection): In this state, the detection mechanisms have
identified some errors in the !system.
Usually, the occurrence of a fault is regarded as a Poisson process with
rate X. Since the system may contain an inactive intermittent fault, a benign
state has to be included in the model. Several models of intermittent faults were
proposed and used for testing and reliability evaluation ; .0 - 15]. In our model,
the transitions between states NF, F, FB, and between states 11. ENF are used to
describe the behavior of faults.
Suppose there exists a process for generating errors by a given fault. With
the assumption that the signal patterns in successive inputs are independent,
the period of fault latency can be considered to be a random variable with a
hyperexponential distribution (or composite geometric distribution for discrete
inputs or cycles :6]). Using the concepts of information theory, Agrawal
presented a formula to estimate the probability of inducing error 6 1 . In fact,
because of the memoryness of sequential circuits and the dependence of execu-
tion sequence, the assumption of independent. successive inputs is invalid. In our
model, an exponentially distributed fault latency with rate a is assumed for sim-
plicity. Since fault. !atency is generally much smaller than the life cycle, this
assumption would not degenerate the accuracy of the ,whole model. Before an
7
i-
error is induced by a fault, the system may transfer immediately into state D if
signal level detection mechanisms cover this fault with probability one. Other-
wise, the system enters state E. Another reason of direct transition from state F
to state D is the execution of a diagnostic program. The transition duration from
state F to state D is assumed to be exponentially distributed with parameter
while the diagnositic program is running.
Once the system enters state E, the erroneous information starts tc spread
until function level detection mechanisms identify any unaccepte ble result.
There are two paths to indicate this detection which have transition rates g and
y, respectively. Q should be great( . r than y since the existing fault in st.ite E
could induce more errors %Nhich may spread with high probability. In additie!,,
the execution of a diagnostic pro g ram can also explore the fault in state F.
The model as described above is very general for c• overin; the processes of
error detection. The transition rates are dependent upon 1). the error detection
mechanisms employed in the system, 4). the ooerati.)ns executed in the system,
and 3). the characteristics of the concerned physical devices.
4. EXALUATION OF THE IMPACT OF ERROR LATENCY
4.1. Formulation of detection processes
Let a computer system incorporate the three types of error detection
mechanisms discussed above. We are interested in both the useful computation
time before the detection of error ana' the r_onscquence of er,-or latency. The
diagnostic program is executed for period ! i,fter every normal operation
W'
E3
period t„ as shown in Figure 3. Thus the coverage of a single diagnostic, denoted
as ^, is equal to 1—e _ wt p for each diagnostic period. 'The overhead for swapping
between the normal operation and the diagnostic is denoted by t, The signal
level detection provides a coverage c to detect error immediately. If the func-
tion level detection mechanism finds an error, the system may ipply ona of vari-
ous recovery methods to rescue the contaminated message and computation.
The recovery overhead is assumed to be a function of error latency, denoted as
R(toi ) where t a , is error latency.
Since a latent fault is ni,,rely possible to harm system behavior ive deal
only with the error latency instead of the fault latency. Note that there is an
absorbing state (Detection state) in our V.arkov model. To distinguish %%hether
the error latency exists or not, we divide the state D into state D. and state D
where the transition to state D, has to go through state E, and state D 2 is
reachable directly from state F if the fault is captured before the occurrence of
error or an error is detected immediately when it occurs. For convenience let's
number the states NF, F. FB, E, ENF, D i , D 2 with i for i= ,2..,7 and define the
transition matrix H 7x-,,t) as follow:
_ n
Hex?( t ) =
—A Xv XA 0 0 0 0
'i+v µ+v
0 —(µ+a l (t ) +a 2 (t )) µ ai(t ) 0 0 Ct	 )
0 V —v 0 0 0 0
0 0 0 —(µ+O ,\ t A #I\ t) 0
10 0 0 0 0 0 0
0 0 0 0 0 0 0^
Since the diagnostics are invoked periodically, transition rates a l (t ), a 2 (t ), 9"t),
and /(t) are function= of time which are defined as follo»•s:
9
'r
at(t) -_ S (1-c)a if 
n(t„ +tp+t„)<tsn(t„+tp+t,)+t„
	
0	 otherrise
ca	 if n(tn+tp+ty)<tSn(t„+tp+t,)+t„
	
a 2 (0 = w
	
otherturse
	
q	 if n(t„+tp+ IV) <t<n(tn+ip+t„)+t„
	
w	 othertuise
	
_/	 if n(tn+tp+j,,)<tsn(tn+tp+t„)+t,i
	
r (t) - 0	 o t he rt1J'Ls P.
Thus the transition probability ,natrix P(t)= * P jj (t )) can be solved by the for-
ward Chapman-Kolmogorov equation r, 71-
dP't = P't)!,(t)dt
where p;j (t) is the probability 1. 1 at the system is in state j at Lim- t given that
the initial state is i. For the state prob,ibihLies, r(t)_ `r i ( t ),;;,; t ),..,; „t )1, %,e
have the differential equation:
drat	
=rat) 11;t)dt
Where r t (t ) is the probability that the system is in state i at time t giver: r(0)
Because of the absorbing property of stales D. and I) 2 , re(s) +r.(x) =:.
4.2. Estimation of the probability of having an unreliable result
The execution of a Lssk ccnsi,-;Ls of parallel and/er serial execution of
processes. We can rd%vays partition the task such ' hat evert- process sends the
computation result to its =uccessors at the end of its execution and receives all
the input daLa it t h a bf--nning of execution Tins,  ­ xch l>roc:„ can ive con-
10
sidered as an atomic action [18]. Since an atomic action can h e recovered very
easily, we are not interested in the possible faults/errors within the atomic
ac«on if these faults/errors are detecte,i. The more serious situation, namely
the propagation of erroneous information through the system, occur3 if the
error can not be discovered by the end of execution. Let the probability tha n. the
system has at least one error at the output phase bi- p,. Since the computation
result may or may not be contaminated by the errors, %ne claim that p, is also
the probabi!ity of having an unreliable result.
Without periodic diagnostic, p, can be represented easil y by the error
latency t, i and the process of error occurrence. 1,ct j i ^ j ^') and j,,,. 'l ) he the
probability density functions of t, i
 and the time bet.vern two successive error,
occurrences induced by different types of fault. Than, the probability of having
an unreliable result, r,,, is ziven by
Tk	
r
7'4--C) J f^r^^ ^5
Tk t
dt
0	
`r
1 — ,10
where Tk is the process execution time. It is obvious that both the reduction of
error laten^y and the avoidance of error can improve the rinal fi;ure Roth den-
sity functions can be obtained from he forward Chapman-Kolmogorov equation
which becomes homo.-eneous it this case
When a scheduled periodic diagnostic is itnplemented for the process, the
resultant p, becomes a function of thc- time interval between the cutput
moment and the previous di ignustic. The shorter this time interval, the more
reliable the computation result. Because of the uncertainty o.' the ta,k execu-
tion time, it is di`ficult to schuclule a periodic diagno=tic .-uch that the s y stem Is
tested just before the process troves into the output ph«se. 11cre, using the pro-
ll
posed model, we evaluate the maximum p., denoted by mrz (p. ), at which the
t
time interval between task completion and the last diagnositic i^; t,, . For a pro-
cass which sends out result at Tt , - .az(p.) is the probability that the system is
in erroneous states (E or ENF) at time T t , i.e. mas(p.)=>TB(Tk)+rr^(T,^). In fact,
because of the Markovian property in each transition, maz(p.) is almost
independent of the execution time TI, when Tt :s much less than The simu-
lation results are graphed in Figures 4 and 5. In Figure 4, ma: (p.) starts .o
decrease only when each diagnostic has a higher coverage. Note that Tasar
showed the running of diagnosite program for first :50µs can cover 98.46%
faults of processor [9]. In Figure 5, we compare four different cases- :). %,, thout
diagnostic, 2). with periodic diagnostics and . =0.6, 3). with periodic diagnostics
and c =0.8, and 4). with p-riodic diagnostics and doubling detection rate of func-
tion level detection mechanisms. It is noted that maz(p.) is linearl y related to
the coverage of signal level detection ",nd is changed exponentially with respect
to the detection capability of function level detection mechanisms.
4.3. Calculation of computation loss
Given the characteristics of the signal and function levei detection mechan-
isms which are incorporated in the system, a Cesigner may question ho., much
Lime is lost due to fault y /crrorF .trid anw much periodic dia,nosAics can
improve the reliability and performance. Intuitively, periodic diagnostics can
decrease the probabilit y of havin7 errors and can thus avoid the crass,-- but i'
certainly wAsLes the useful processing time. The following example is used to
show the variaL uns related Lu different parameters If an errnr i- (It-tected a'tcr
12
execution interval Td , the average computation loss due to fault and/or error,
CL, is given by:
CL =	 (tp +Iv) E ( Td) + (E(t.l)+E'(R(t•i)))Pa(t„+tp+ty)
where 7'd is the probability of detecting error by function level detection
mechanisms, and E(t. 1 ) is the mean vi,li. of error latency. E( T.) can be
cxp:essed as pdE'!Tdl)+(:–pd)E(T^•) whe y^ Td , and Td2 are the amount of time
spent before the system is xb , ;orLcd in:n stales D. and D 2 , respectively. TI , and
Td2 are random variables with pdf PO ^- and P, It resp ectively, where
Pio'\-)	 Pi7',-)
denotes the derivative with respect to time. The errur latency is also a ran-
dom variable with pdf P'0(0. Finally, the percentage of civcnige c•ornput„tion
loss is given by:
(tp+tv)	 E(tc.l)+I;(R(t„1))
r	 L (•d)	 (tn+tplty) t'18	 i PW(°D ) r (%dl)+P17(°O):.(:d2)
The above equation indicates chat, the time wasted for executing periodic
diagnostics is a dominating factor to the total computation loss ishen the system
is higlay reliable (i.e. the system has a small X). however, only the task
currently being executed suffers from the delay due to the error latency and
recovery overhead. This delay in r • esponsv may cause Nome snr,ous damages to
the system if execution of the task is tii.le-critical. With E= 7 0 -8 , o=0 02, # =0.2,
y=0.1, w=50, the simulation result:: of the computation !o.-oz and the response
delay due to error latency versus the period of a diagnostic cycle are ploted in
Figure B. Once the cost function of response tune for a task acid the recovery
overhead are given, we can easily calculate the total loss and then decide the
optimal diatinostic schedule which consists of the time intervid between t'.YO suc-
cessive diagnostics, t,,, d nd the coverage_ or each diagnostic, ^.
13
Figure 7 presents the response delay due to error latency for different
combinations of intermittent and permanent faults where greater error latency
occurs when most faults have a short active time. The improvement in error
latency by diagnostic appears notable only if the cycle time of diagnostic is not
much greater than the fault's active time. However, the computation time is
also wasted in this case. No ideal method so far has been established to diag-
nose the intermittent faults. Many computers arc able to retry instructions
whenever an error is detected. This method is useful to make the system sur-
vive intermittent faults, specially for reading or writing a tape or disk. Once an
error can be detected immediately after its occurrence, the instruction retry
method can also be applied in other parts of the system. This implies that signal
level detection mechanisms should play an important role in fault-tolerant sys-
tems.
5. CONCLUSION
In this paper, we have presented two performance-related evaluations for
fault-tolerant computers. 'These two are not usually included in the tranditional
reliability models since such models do not deal with the process of error detec-
tion. I he first evaluation, the probability of having an unreliable result, indicates
the degree of confidence in computation result. The suspicion ^n the computa-
tion result is totally due to the deficiency of Error detection process. Unfor-
tunately, this deficiency can not. he eliminated completely from any practical
error aetection mechanism. In the second evalu,:tion, we take into account a
ore detailed computation loss resulting from Lhe occurrence of error, its
detection -.-id the subsequent recovery. For many cases where a system
14
requires high overhead for error recovery or suffers from an erroneous output,
the reliability analysis has to quantify these kinds of loss and has to provide a
good method for estimating the total loss.
Though there are seve*. •al assumptions to be justified through expriments,
the model developed in this paper is general enough to include all aspects from
fault occurren.;u to error detection and also various detection mechanisms. As
shown in both evaluations, the model has systematically dealt with various
aspects of error detection mechanisms. The results obtained here has high
potential use in decision making during design or operation phase. The results
also show favorable strategies of periodic diagnostics: :). for time-critical tasks,
one can derive an optimal diag-iostic cycle to minimize the computation loss and
the penalty of delayed response, 2). for none ritical tasks, the diagnostic is exe-
cuted only when the system is idle.
ACKNOWLEDGMENT
The authors are grateful to Rick y Butler and Milton Holt at NASA Lan-71c),
Research Center acid C. M. Krishna at the 'University of Vichigan for technical
discussions and assistance.
15
REFERENCE
1. S. J. Bavuiso et al., "Latent Fault Modeling anti Measurement Methodolo-
gy for Application to Digital Flight Control", Advanced Flight Control
Symposium, USAF Academy, 1981.
2. B. Courtois, "Some Results about the Efficiency of Simple Mechanisms
for the Detection of Microcomputer ValfuncLion", Pric. of the 9th Annu-
al Intl Symp. on Fault-Tolerant Computing, 1979, pp. 7i-7-.
3. B. Courtois, "A Methodology for On-line Testing on Vicroprocessors",
Proc, of the 11th Annual Intl Symp. on Fault-Tolerant Computing,
1981, pp. 272-274.
4. J. J. Shedletsky, "A Rollback Interval for Networks with an Imperfect
Self-Checking Property", IEEE Trans. on Computers, Vol. C-27, No. 6,
June 1978, pp. 500-508.
5. H. Ball and R Hardie, "Effects and Defection of Intermittent Failures in
Digital Systems," AFIP Conf. I-roc., Fall 1969, pp. 229-2315.
6. S. Osden, "The DC-9-80 Digital Flight Guidance System's Vonit.oring Tech-
niques", Proc. of the AIAA Guidance and Control C:onf., :979, pp. 6.:-79.
7. N. L. Gunther and W. C. Carter, "Remarks on the Probability of Detecting
Faults", Proc. of th 10c.t Annual Intl Symp. on Fault-Tolerant C'ompzct-
ing, 1960, pp. 213-215.
8. J. J. Shedletsky, "Random Testing: Practicality vs. Verified Effective-
ness", Proc. of the 7th Annual Intl Symp. on Fault-Tolerant Computing,
1977, pp. 175-179.0 9. V. Tasar, "Analysis of Fault-Detection Coverage o: a Self-Test Software
Program", Proc. of the 8th Annual Intl Symp. ov Fault.-Tolerant Com-
puting, 1978, pp. 65-74.
10. M. T. Breuer, ''Testing ;-+r Intermittent Faults in Digital Circuits," IEEE
Trans. on Computers, Vol. C-e,2, No. 3, Varch :973, pp. 241-246.
11. I. Koren and S. Y. H. Su,"Reliability .Analysis of N-modular Redundancy
Systems with Intermittent and Permanent Faults", IEEE Trans. on Com-
puters, Vol. C-28, No. 7, July '979, pp. 514-520.
12. Y. K. Valaiya and : Y. ii. Su, "Reliability Veasure of Hardware I;edun-
dancy Fault-Tolerant Digital Systems with Intermittent Faults", IEEE
Trans. on Computers, Vol. C-30, No. 8, August '.96', pp. 600-604.
13. Y. W. Ng and A. A. Avizienis, "A Unified NoliaF,ility Vodel for Fault-
Tolerant Computers," IEEE Trans. on Computers, Vol. C -29, No. 11, Nov.
1980, pp. 1002-1011.
16
14. T. H. Lada and A. L. Hopkins, Jr., "Survival and Dispatch Probability
Models of the FTMP," Proc, of the 8th Annual Intl Symp. on Fault-
Tolerant Computing, 1978, pp. 37-43.
15. K. S. Trivedi and R. M. Geist, "A Tutorial on the CARE III Approach to Reli-
ability Modeling", NASA Report, No. 3468, 1981.
16. V. D. Agrawal, "An Information Theoretic Approach to Digital Fault Test-
ing", IEEE Trans. on Computers, Vol. C-30, No. 8, August 1981, pp. 582-
587.
17 L. Kleinrock, Queueing Systems, Volume 1: Theory, John Wiley & Sons
Inc, 1975,
18. D. B. Lomet, "Process Structing, Synchronization, and Recovery Using
Atomic Actions," Proc. ACM Conf. Language Design for Reliable
Software, SIGPLAN Notices 12, no. 3, March 1977, pp. 128-137.
17
R	 A	 C
error
° detected
—0
time
Duration A: A fault exists and is active in the system.
Duration B: The fault becomes inactive.
Duration C: Errors exist in the system.
Figure 1.	 The error detection process.
error_
latency '
Figure 2.	 The model for error detection process.
process
normal
	 swap
operation	 diagnostic
time	 ^E--	 t,	 --^ t"
 ^-- tp --
Figure 3.	 A cycle of periodic diagnostic.
Case 1: t„=5.0
Case 2: t„=10.0
Case 3: t„=20.0
co
'^ d
0
Z
H
f-
(J)d
W
W-
W
J
Cn
Cr
Z
LL.
O
CYY`
O°
x
CL
00
ct.10
0.30 COVEAR OF SI RG OIAG. 0'90	
1,10
1 , 'igure !.	 r'robability of having an unreliable result versus coverage of
single diagnostic.
` x=10 ©, µ=0.1, v =0.2, a=0.2, #=0.b, -y=0.1, c =0 6)
Case 1: no dianostic
Case 2: c =0.3
Case 3: c =0.8
Case 4: c =0.6 and P, y are doubled
case 1
b
0
ZH
F—
(!pWW
WJ
co
XZ
m
LL.
O
CL
00
-D.00
9
10.00	
PER&I' OF DIRF,'08YCLE 40.00	 50.00
Figure 5.
	
	
Probability of having an unreliable result versus period of
diagnostic cycle.
( X=10 -8 , µ= 0.1, v=0.2, a=0.2, P =0.5, y=0.1, {=0.6 )
K(n
a
LL
0
w
0
U
w
CL
F r-niUu ur uinv. LO i %.L_L_
S
Figure 6.	 Percentage of total loss (case 1) and average loss due to er-
ror latency (case 2) versus period of diagnostic cycle.
( A=10 -e , µ=v=0.2, u=0 05, Q=0.2, y=0.1, w=50.0, f=0.9,
c =0.6 )
c,,ise I
Case 1: µ=0.2, v=0.04
Case 2:,u=0.2, v=0.2
Case 3: µ=G.04, v=0.2
ZH
U
U9
cr-
J
O
Ul@.
O^
W
O
V)
r	 JD
°ol
C$.00	 150.00 
PER18bo OF DIAP506CLE s20.00	 750.00
Figure 7.	 Average loss due to error latency versus period of diagnostic
cycle.
A=10 -e , a=0.05, 9=0.2, y=0.i, w=50.0, ^=0.9, c=0.6
