Reliability and coverage analysis of non-repairable fault-tolerant memory systems by Cox, G. W. & Carroll, B. D.
General Disclaimer 
One or more of the Following Statements may affect this Document 
 
 This document has been reproduced from the best copy furnished by the 
organizational source. It is being released in the interest of making available as 
much information as possible. 
 
 This document may contain data, which exceeds the sheet parameters. It was 
furnished in this condition by the organizational source and is the best copy 
available. 
 
 This document may contain tone-on-tone or color graphs, charts and/or pictures, 
which have been reproduced in black and white. 
 
 This document is paginated as submitted by the original source. 
 
 Portions of this document are not fully legible due to the historical nature of some 
of the material. However, it is the best reproduction available from the original 
submission. 
 
 
 
 
 
 
 
Produced by the NASA Center for Aerospace Information (CASI) 
https://ntrs.nasa.gov/search.jsp?R=19760020797 2020-03-22T15:00:16+00:00Z
'	 1	 ^	 i	 I	 I	 l
ELECTRICAL
N
G
N
1
F(4ASA-CF-149S42)
	 RELIAHLITY ANC CCVERAGE
	 N76-27885
ANALYSIS CF ECN-F.EFAIFAELE FAULT-TCLEFANT
MErOFY SYSIZIO.S Final Technical FEpart
(Auburn Univ.)	 106 p NC $5.50	 CSCI 09E
	 Unclas
63/60 45852
rl
^	 f
3456
b^' Puy ^Q^^
c^ENGINEERING EXPERT MEN ^S,	r
AUBURN UNIVERSI
AUBURN, ALABAMA
E
E
R
N
G
FINAL TECHNICAL REPORT
RELIABILITY AND COVERAGE ANALYSES
OF NON--REPAIRABLE FAULT-TOLERANT.
MEMORY SYSTEMS
i
7 Prepared by	 i
L
3
G. W.	 Cox and
B.	 D.	 Carroll
	
-
r,
Electrical Engineering Department
Auburn University	 +
xc.i
Auburn, Alabama 36830
tr
s
Contract NAS8--26930
George C. Marshall Space Flight Center
	 i
National Aeronautics and Space Administration
Marshall Space Flight Center, Alabama 35812
July 1, 1976
a
1tFFF-.
TABLE OF CONTENTS
LIST
	 OF	 FIGURES .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .iv
LIST OF TABLES .	 :	 : -V
I.	 INTRODUCTION	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .1
Background
II.	 FAULT TOLERANT MEMORY DESCRIPTION. 	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .3
Lu Bas.iC System
Al ternate Designs
III.	 RELIABILITY MODEL DEVELOPMENT. 	 .	 ...
i;
.12
General Techniques_
Reliability Equations..
.
^;.. Coverage Equations
Computer Evaluation
IV.	 RELIABILITY EQUATIONS FOR ALTERNATE.SYSTEMS.	 .	 .	 .	 .	 .	 . .	 .44 ^.
Non-Spared System a
TMR System
Duplicated System
Double-Error-Correcting (DEC) System
V.	 ANALYSIS	 RESULTS	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .64
VI.	 CONCLUSION	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .76
•y
k .
.REFERENCES	 .	 .	 .	 .	 .	 .	 :	 .	 .	 .	 .	 . 	 .	 .	 .	 .	 .	 . .	 ,78
^.^
APPENDIX A
_
APPENDIX 
.
B ..	 :	 .	 .	 .	 .	 .	 .	 . .	 .87
Y.
i
iMi .
T	 ^
ABSTRACT
A method was developed for the construction of probabilistic
state-space models for non-repairable systems. 	 This method allows h
the construction of system models with considerably fewer states
z
than the model resulting from more traditional approaches. 	 Models
were developed for several systems which achieved reliability improve-
ment by means of error-coding, modularized sparing., massive replication N
and other fault-tolerant techniques.
From the models developed, sets of reliability and coverage
equations for the systems were developed_ 	 Comparative analyses of the
systems were performed using these equation sets.	 In .addition, the
effects of varyi ng subunit reliabilities on system reliability and
coverage.were described.	 The results of . these analyses indicated.
that a significant gai n.in system reliability may be achieved by use of j
combnatia'ns of modularized sparing, error coding and software error
r
control.	 For sufficiently reliable system subunits, this gain may far
exceed the reliability gain achieved by use of massive-replication. ^1
techniques, yet result in a considerable saving in system cast.
_	
- 4	 ^
^s
LIST OF FIGURES
1. Functional. Representation of Basi c System.	 .	 .	 .	 .	 .	 .	 .	 . .	 . 4
- 2. Functional Orientation of a Typical Memory Word.	 .	 .	 .	 .	 . .	 .5
3. Functional	 Representation of TMR System.	 ..	 .	 .	 .	 .	 .	 .	 .	 . .	 .1.0
State Diagram and Tran ition Probabilities for Exampl e
` Device .15
T 5. State Diagram for Basic System 	 . .	 .	 .	 . .25.
6. T-Matri x for Double-Er-tor-Correcting System. .	 . .57	 }
7, Reliabil ity of Subject Systems	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 .66
B. Reliability of Basic System for Various Number of Spares ...68
^ 9. Reliability of Double-Error-Correcting System Algorithm
Failure Rate
	
. ,	 .69
10. Reliability of Double-Error-Correcting System vs.
Algorithm Failure Rate
	
.	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 .	 . .	 . 70
11. Rel iabi lity a
	
Doubl e--Error--Correcti ng System vs.
- = Detector Fail ure Rate..	 . . 7. ;
Y 12. Reliability of Double-Error-Correcting System vs.
E
-Memory Size.	 .	 . .73
^- 13. Coverage for . Basic System.	 .	 .	 .	 .	 .	 ,	 ..	 .	 .	 .	 .	 . .74
14. Flowchart for Reliability Computations by Method 2 . 	 . . . .	 ` $2
'-^ 15.. Fl ow chart for Reliability Computations by Method 3	 .` . .85
iv
ELIST OF TABLES
1. Qefi ni tl on of Notatio n.
	..	 .	 •	 .	 .	 .	 •	 .	 .	 • • 13
2. Definition of Reliability Symbols for Basic System.. 	 .	 .	 . . 27 „T
3. PercenLage .of
.
	
Memory Word . .FPV's Correctabl e for the Double
Error-Correcting System (22, 16) Code . 	 .	 .	 ..	 .	 .	 .	 .	 . .55
4. State Configurations for Double-^Error-Correcting System . . .58
S. Base Val ues for System Variables. . 65
6. Events and 5ubevents for Double-Error-Correcting
y
System.	 .	 ...	 .	 .	 .	 .	 .	 .	 .	 . ^.^
1
-rte
-
2
_	
-	
v o
_Y
I.	 INTRODUCTION
As the.field of computing. system design has developed, the need
for reliable computers has become crucial:	 Advances in the aerospace
area in-particular have. necessitated the design of computing systems that
are highly reliable and capable of operation in a non--repairable environ-
meat.
	
In many other system applications, while repair may be possible,
an interruption in system operation is unacceptable.
Due to the large number of components which it contains, the main
memory has typically been the most unreliable subunit of the computing
system [1].
	
Since this subunit contributes a.high percentage of total
system size and weight and ma	 must operate within limitations	 i
in these areas, ,massive replication techniques for memory reliability
-- improvement are. often. not applicable. .	 Thus, much research has been {
performed to find methods of memory reliab pity improvement by other means.
Several methods of improvement have been utilized.	 One such	 3
method is the development of error-control codes for use in the memory
. array.	 Also, modular memory organizations have.been designed in an..
a k attempt to limit the passible ways that shored-word errors can occur	 1
anti to ease system reconfiguration problems.` The example systems of
Y4 this paper utilize both 'coding and modular design-for improved system	 -
reliability.
	
These systems are described in Chapter II.
A method is presented in this paper for calculating the reliability
i
V
1lebli.cati oh. an the memory system level L2, 3] has been used as a
solution for the ultra-reliable memory problem. Substantial increase
in memory reliability has resulted in many cases. System cost, however,
has. i ncreased linearly with the number of duplicated systems. Other
limiting factors, such as system w':ight and size, have prevented the use
of massive replication techniques in many applications.
..R number.of proposed and actual. systems [1, 4, 5, 6, 7, 8] have
utilized a modular concept of memory arrangement, usually in conjunction
i
with error coding. In additi on, a number :. [9, 10, ll, 12]. of burst-error
correcting nodes have been developed. These codes are well suited for use 	
L,
. .
..
.
In word-sl. ce oriented memories 3n which a majority of the word errors 	 q
may be expected to occur within groups of word bits.	 :?
Several < articles [13, 14, .15] have developed reliability calculation
r ^,
procedures for the fault--tolerant memory problem. many others . L16, 17, 18J
have shown- cal.cuiation procedures for fault-tolerant systems in general.
When a state--space approach to system modeling has been taker, the time
alloWed for stake transitions to. occur is generally din at. Typically,
At 
only one system event is allowed to occur in this transition interval.
Multiple states are then necessary to represent; all possibly combinations
of conditions of sysfft subunits resulting in large numbers of' stares for
highly Zomplex systems..
<t r	 ^
H.	 FAULT TOLERANT MEMORY DESCRIPTION
{ In this chapter, several fault-tolerant memory systems are
described.	 The first section describes a system which is taken as
a basis for the comparison of related systems. 	 Several related systems
3
are described	 the second section.
	
Reliabili ty and coverage
computations for these systems will be examined in fol l owing chapters.
e
}
Basic System	
-
The basic computer system to be analyzed has been designed for use
in extended aerospar__c-	 ' ~.siot7s.
	
It was desirable to implement the
computer memory in a manver so as to be withi n wei ght, size, and
economic limitations, yet be highly -Fault-tolerant.
A modular design approach has been undertaken in which the memory
array " is made up of memory slices, each of which. contains the same bit
location of all memory words,	 if n words are contained in the memory
-. and each word is k bits long, then there must be k memory modules and
each module must contain n bits.
	
These modules will be referred to as
on-dine bit planes.
In addition tothe bit planes already discussed, the system contains
f ^ identically-sized spare bit planes which may be switched to replace any
failed on-line bit plane.
	
The arrangement of on-line and spare bit
( a Ie.Ci	 '7 C	 C ^'7rt1.1!'c	 ^i M
	
Figure  
	 -^	 ^•	 -^	 ..	 .. .P I 	
-_,Win ,,. 	^ .	 one functional orientation of memory wards
is shown in Figure 2.


I6
A single-error-correcting/double-error-detecting (22, 16) code [191
is used for memory data word encoding. This code has the property that
any odd number of errors in a codeword will produce an odd-weighted
error syndrome. ...Double errors will produce a non-zero.even-weighted error
syndrome and hi gher numbers of even errors will produce even-weighted
(including all zero) error syndromes. These features of the code
will be further discussed n a later  section.
External to the-memory, data words are encoded using only 2 byte
it
parity bits. For this reason, circuitry which translates between the two
codes is necessary for use in memory write and read cycles. This	 ?
function is performed by the memory translator. In addition, the translator
	
'	 9
contains circuitry for the correction of single bit errors and detection 	 .J
of multiple bit errors in memory words, and control of the reconfigura-
Lion switching circuitry which directs each word bit to the appropriate
bit plane. These functions will now be examined.
For a memory write operation, the translator accepts a byte--parity
encoded word From the CPU-memory bus. The byte parity bits are saved
and the check bits for the SfC/DED code are generated. A validity 	 :y
check is then.made.by.a comparison.of . the saved byte parity bits with 	 -.
the generated . check bits._ if no error is found, the data word with SEC/DED
check bits appended is stored:in.the memory.. If a n error. is found, a.
program interrupt is sent to the CPU.
For a memory read operation,.the requested encoded word is read	 d
V
from the memory array and placed in, -the storage data register (SDR) The
error syndrome For the word is formed from the encoded word and if a zero
i
	
t
7
(no.error) syndrome is signaled, the byte parity bits for the . data word
are formed and the word is transmitted on the data bus.
An odd-weight (odd error) syndrome signal causes a bit inversion to
be made by the single error correction circuitry. The error syndrome for
the corrected word is then generated. If no error is.signalled, then
it is assumed that there was a single error in the encoded word. The
byte parity bits are generated and the word is transmitted an the data
bus. If an error is signaled, a program interrupt is generated.
When the translator receives the information that a certain
designated spare bit plane is to replace an on-line bit plane, it must
reconfigure the memory array input and output switching to reflect this
change. Memory input switching is reconfigured first. Each memory word
is then read from the on-line array, corrected if necessary, and
re-written in the on--line array with the spare bit plane replacing the
designated on-line bit plane. After all memory words have . been read and
restored, the memory array output switching is reconfigured appropriately.
The decision to replace an on-line bit plane may . be arrived.at by
use of various switching strategies: It is assumed for the basic-system 	 {
that the reconfiguration signal is issued by the CPU as a result of
4rror signals received from the translator. it is also assumed that the
switching strategy is to replace a bit plane as soon as it is detected
that the bit plane contains an error. Another switching strategy will j
be discussed •i n. a following section.
In the basic system, there is assumed to be no facility available
for the correction of multiple errors... If system failure is defined to
f	 --
4	 I_	 J	 __ 1	 __ L
8
be the occurrence of a non-- correctable error, then the occurrence of
more than one error in a single memory word will constitute failure for
this system. For purposes of system modeling the occurrence of
simultaneous failures in multiple bit planes is assumed to be
equivalent to the occurrence of multiple errors in a single memory word.
^F}
System failure, then, will occur when more than one on-line bit plane
has failed.
Sure bit planes are assumed to operate in a mode identical to the
on-line bit planes prior to their insertion into the on-line array.
Spare bit planes, then, fail with the same characteristics as the
on-line units. It is also assumed that after a bit plane has been
removed from the on-line array, it is never re-inserted. A bit plane
which has been replaced is called an unavailable spare.. A spare bit.
plane which has not been inserted into the on-line array and which may 	 T'
or may not be failed is an available spare. a.
The system, then may be divided into subsystems by function. These
subsystems are:
1) The on-line memory array consisting of a number of bit
planes,
2) The spare bit plane array including both available and
unavailable spares,
3) The error detection circuitry.of the translator,
4) The error correction circuitry of the translator,
6) The reconfiguration switching array, and
6) The encoding and decoding subsystems of the translator.
References will be made to these subsystems in fol-lowing sections.
1
4: a
z ,;
n;
4 .^
^Y
9
	
i
ti
z isi;
f,
'
.ry
Alternate Designs
Several fault-tolerant memory systems which are related to the
basic system have been studied. Four of these systems will be described
in this section.
The non-spared system is identical to the basic system except that
no spare bit planes are provided. In addition, no reconfiguration switch-
ing circuitry is included, since such circuitry would have no use in this
system. Comparisons made between this system and the basic system will
show the relative improvement to be gained by the use of the spare bit
plane approach.
The TMR system consists of three systems cf the non-spared type
in a triple modular redundant configuration. The functional operation
of this system may be described as follows:
1) For a memory write operation, SECJDED-encoded word is
stored in the same logical location in Pll three memories.
2} For a memory read operation, the requested memory location
is read in all three memories. Single error correction
is performed independently by the systems and byte parity
bits are generated in each case. The three byte-parity
encoded words are then voted on by majority logic in a
bit-by-bit F=ashion. The output word is constructed by
using the majority vote for each bit. If the constructed
word is still a codeword, it is transmitted on the data bus.
If it is not a codeword, an error program interrupt is
generated.
This system, then, will produce the correct output word as long as
at least two of the three memories can construct the correct word. A
functional depiction of this system is shown in Figure 3.
The duplicated system is composed of two identical non-spared
subsystems. Data to be loaded is stored in the same logical location in

11
both subsystems. Data read from the system is read from only one memory.
If a non-correctable error is signalled by the on-line unit, output
bussing is switched to the other unit and the.data is read from the same
E
location. if both subsystems signal a none-correctable error in the
sane memory word, an error program interrupt is generated.
The double-error-correcting system is a modified version of the
basic system which will correct double errors and detect a triple error
y	
which produces a single error syndrome. The additional features are
achieved by the use of software routines [20] which are CPU implemented.
d-i
Since double errors are correctable in this system, a reconfiguration
switching strategy is assumed in which an on-line bit plane is replaced
only if it contains an erroneous bit position of a word which has two
or more errors. This system will be more fully discussed in a later
chapter.
F_
III. RELIABILITY MODEL DEVELOPMENT
In this chapter, a generalized method for the computation of
reliability, the probability of satisfactory operation, and coverage,
the probability of recovery if a failure occurs, for a system is
described. This method is applied to form sets of reliability and
	 t::r
coverage equations for the basic system described in the proceeding
chapter. Computer implementation of these equations is examined in the
x ..
last section.
General Techniques
Prior to the development, it is appropriate that certain notation
be defined. A listing of notation used is shown in Table 1. 	
R.'
For the purpose of reliability computation, the performance of a
device may often be represented as a set of states and state transitions.
Suppose, for example, that a certain non-repairable device has three
possible modes of operation:
1) Satisfactory operation,
2) Degraded operation caused by event A which occurred while
the device was operating satisfactorily, and	 r
3) Unsatisfactory operation caused by event B which occurred
while the device was operating satisfactorily or by event
C which occurred while the device was operating in its
,a
degraded mode.
These three nodes of operation form three natural states for the device.
12	 ,:	 l
1
.13
e
TABLE= 1. Definition of . No .tati ors .
! Notation Meaning
l P(x) Probabili ty of the occurrence of event x
P(x,r) Probability of the occurrence of events x and r
P(x or r) Probability of the occurrence of event x or event t"
or both
P(x/r) Probability of the occurrence of event x given that
event r has occurred
Px (t,At) Probability of the occurrence of event x in the time
period from t to t + At
P.	 (t,Atli) Probability of the occurrence of a transition from state
x or	 ^'
P i,j (t ' At)
i to state j in the time period from t to t + At given
that the state at time t is i
Pi (t) Probability that the system is in state.i at time t
P.(t + Atli) Probability that the system is in state i at time t + At 	 v
^- and that it was in state j at time t
.`E P.(t + At/j) Probability that the system is in state i at time t + At
7 given that it was in state j at time t
-; r.(t) Probability that component j is non-failed at time t
r(t) Probability that a generalized component is non-failed.
t at time t
If the assumption is made.that the system is operating in state.?
(satisfactory operation) at time t, then the probability that the system P
Will bL in state 2 (degraded operation) at time t + At, a small interval
of time later, is the probability that event A occurred in the time
Period from t to t + At,	 1'n equation. form
P19 2 (t,At/1)	 PA (t,At)
where P1,2 (t,At/1) is the probability of a transition from state 1 to
state 2 in time t to t + At given that the system was in state 1 at j{
time t and PA (t, At) is the probability of the occurrence of event A
in the same time period.
In a similar manner, the transition probabilities into state 3
(the failed state) are
t
Plj3	 (t,At/1)	 P 	 ('f,' . bt),	 and
1
P	 (t,At/2)	 P	 (t, At).
r
Y	 9
$	 S
This state model can be described graphically by a state diagram as shown {
in Figure 4.
An equivalent form of device state representation is a matrix T
which has as its i,j entry P. 	 (t,At/) for i # i and 1 	 -	 E 1Ti,kj {
for i	 j, where N.is the number of device states.
	
The Tmatrix For the
example device is given below.
_	 -	 -	 l	
2	 3
_	 1
PA•_	 P (t,At)
.
	P (tsAt ) s
't,-P	 At)	 A	 g
T	 2. 0	 1-PC(t,At)_	 PC(t,At)
-
3	 0	 0	 1

1.6
Deleting the (t,ot) subscripts yields
l	 2	 3
I I-pA PA	 Pg
S
= 2	 0	 I-P^ PO
3	 0	 0	 1
The probabil ity that the system is in any given state at time
t + At may be expressed in terms of the transition probabilities and
the distribution of state probabilities at time t. These equations may
be obtained by assuming that the system is operating in a state i at time t
and by computing the probability of the occurrence of the transition
event to state j in the time period from t to At.
For W e development, the i =ol l ouing notational convention will be
used.
P(system operating in state i at t + At given that the
system was in state j at time t) = P i (t + At/j).
To obtain the equation for P l (t + At/1), it must be considered
1	 that for the system to be in state 1 at time t + At, no state transition
out of state 1 may occur between t and t + At. Then the complement of
the two state transitions out of state 1 must be combined as follows:
P1 (t + At/1)	 0 - P1'2 (t,At/ 1)) (l - P1,3 (t,^t/1))
1	 P1,2 (t,ot/1) -- Pl 
13 
(t,ot/1)
+ P1,2 (t,At/1) PI,3 (t,At/1)
S
17
If At is defined to be a	 time	 is	 toperiod of	 which	 small enough	 allow
only one state transition to take place, the last term in this.equation
becomes negligible since it defines the probability of
 more than one state
transition occurring in time t to t + at.	 Then,
Pl	
(t + At/1) = l - P1,2 ( t ,At/ 1 ) - P ly	 (t,At/1)
I	 - PA (t,At)	 - PS	 (t,At)
Recalling that
P	 X.	
f
YP(X/Y) _	 )	 [251,{ 
>
;hen
v 
j
Pl (t + Atli)
P ( t ) ! y 1 -- PA (t,At)	 PB (t,At)
P l 	(t + At l i) = P l (t)(l - P^(t,at) -- PB(t,At)).
Since there are no transition paths into state 1, the event "the
system is in state 1 at t + At" implies the event "the system is in state
I at time t."	 Then,
P l (t + At, 1)
	 P l (t + At)
. _. So,
Pl(t + At)	 P I WO	 PA ( tyAt ) -. PS(tlht))..
r.,: n
There are two ways for the system to be in state 2 at time t + At.
t; Either the system was operating in state 1 at time t and the transi ti on
fl8
from state 1 to state 2 occurred in the time period from t to 	 + At,
or the system Was operating in state 2 at time t and no transition event
out of state 2 occurred in t to t + At.	
_.
E	 The equation for P2 (t + qt) may then be formed as follows:
I
P2 (t+ pt) - P2 (t + at, l) + P2(t + At, 2)'
. = P 1 ^ 2 (t,At/l)P l (t) + (l	 - P2^3(t,Qt/2))p2(t)
' = PA(t,at} p l (t)	 (1 - P^(t,at))P2W.
By similar reasoning,
P3 (t + at) = P. 3 (t+ qt,l) + P 3 (t + ot,2) ± p3 (t	 ot,3)
P l ^ 3 (t,o /l } P 1 (t) + P2^ 3 (t,At/2)P(t) + P 3 (t + at,3)
i
Pg(t,pt)Pl(t) + P C (t,At)P2 (t) + P 3 ( t + At,3).
Since there are no transition paths out of state 3, the probabili ty that
the system is in state 3 at t + At and that it was in state 3 at time t is
`	 the probability of the latter condition, or
P3 (t + dt,3)`= P3(t),-
By substituti on, the equation for P(t + At) becomes
P3 (t + at)	 P8 (t,ot)P- I (t) + PC (t,dt)P2 ( t ) + P3(t)•	 3
In general, the state probability equation for stave 
i 
is
Mt +
n	 n
At)-
	
^ l
 P.	 (t,At/3)P^(t)	 t .(1 ^^=1 p i  k(t, At/M P^WJJ	 I	 ,
3
y19
where n is the number of system states. 	 The first summation in this
equation represents the sum of the probabilities of all possible. 	 -
transitions into state i from another state. 	 The coefficient of
r P^(t) is the probability that no transition out of state i will occur
in t to t + At given that the system was in state i at time t.
S-nce for each term of the form.P(t,Ot/u), the u inside.the
u,v
parentheses is redundant, this probability may be represented as
Pu ^V(t,ot) where the deleted u is.'understood.
The gene;-al state probability equation then becomes
n	 n
p i (t + At) _
	
PJ^i(t,At)P^(t) + (1	 Pi,k(t,ot))P W W.
j=1
	
k=1
Vi	 (3-1)
If vectors p(t + At) and P(t) are defined by
Pl (t + At)	 '1(t)
Pz (t + At) P2(t}
P(t + At) -- P(t}
P n ( t + At) Pn(t)
then equation (3--1) maybe represented in matrix form as
At)	 TT x P(t)	 (3--2)
6
1
where T is previously defined and Tr is the transpose of T.
In a complex system, the events whicah cause state transitions may be.
composed of many subevents which must occur for the transition event to
occur.	 It may be more desirable to work with these subevent probabilities
1
it
20
than to attempt to determine the probability of the overall event. For
this reason, it is necessary to analyze the possible types of subevents
and to be able to calculate the probability of occurrence of each type.
	 a
For any transition event i,j with probability of occurrence P
it is possible to place any subevent in exactly one of the following six
I
i
event classes:
1. The failure event of a system component or component group
prior to time t + et.
2. The non-failure event of a system component or component
group prior to time t + qt.
3. The failure event.of a system component or component group
in the time period from t to t + et.
4. The non-failure event of a system component or component
group in the time period from t to t + qt.
5. The.failure event of a system component.or component
group in the time period. from t to t + At given its non--
failure prior to t.
6. The non"failure event of a system component or component
group in the time period from t to t + At given its non-
failure prior to t.
i
In.
 order to compute the probability of events in each of these 	 --
classes, it is necessary to first examine the basis for the computation
of failure probabilities
Each system component or component group has associated with it a
failure Probability density function, f(t). In the general case,'
f(t:)
r ddtt
and f f(t)dt 1.	 [23]
0
The apriori probability of component (group) failure in the time period
from tl
 to t, may be expressed as

3 J	 :^
22
7
which is the probability that the component (group) will fail in the
interval from t to t + At given that it is non-failed prior to time t.
By use of these concepts, the subevent probabilities for each class
may now be computed as follows:
t+At
Class 1. P 1 	f	 f(t)dt	 x
C 9m
	
= 1 - f	 f(t)dt
t +At
i
1	 r(t + At).
t+At
Class.2. P2 = 1 w 
J	
f(t)dt
0
- 1 - P1
= r(t + At).
t + At
Class 3. P3 =	 f(t)dtf
t
CO
f f(t)dt -	 f(t)dt
t	 t + At
= r(t) - r(t + At) .
t +At
Class 4. P4	1	
J	
f(t)dt
t
	
Yl -p3	 v
1 - r(t) + r(t + At).
^!_	 l	 _l	 !	 I	 J	 I	 t	 _ ►
23
At
Class 5. P B = f f(t)dt
0
1 - r( At ) .
At
Class 6. P 6 = l - f f(t)dt
0
=1-P5
r( At)
To completely specify the state probabilities, it is necessary to
select a base time, 
tbase. In general, tbase 
may be any time at which
all state probabilities are known. The following discussion will
assume that 
tbase is 0. It is common to denote one system state, m, as
the starting state and assume that
Pm (t = 
t
base = 0) = 1, and
P n (t = tbase = 0) = 0 for all n#m.
The state probabilities may be computed for any t > 0 if:
1. All state transition equations are known, and
2. All system component (group) reliability equations are
known.
To obtain a closed-form solution for each probability equation,
it is common to rearrange each equation into its differential form and
solve the equation set simultaneously. By making simplifying assumptions,
the equation set may be approximated by a set of linear differential
equations. For systems with a large number of states, however, the
simultaneous solution problem may become quite involved 	 In addition
I	 I	 I	 !	 I	 i	 i
24
if the analysis of a related system is desired, only a slight difference
in architecture or operation may necessitate the re-derivation of all
state equations.
If computer evaluation of state probabilities is possible, however,
the open form of the state probability equations may yield satisfactory
results at considerable savings in effort. In addition, no simplifying
assumptions need be made to assure equation linearity. State probability
equations to be derived in this paper will remain in this open form.
Reliability Equations
For the basic memory system, the insertion of each spare bit plane
on-line performs a natural partitioning of system states. By determining
the number of available spares it is possible to define the state of
the system. If the basic system has k bits per memory word and s spare
bit planes initially available, the system state diagram may be constructed
as shown in Figure 5.
For each state i (1 t i e s+l) in this diagram, the system is
operating with exactly s	 i + 1 spare bit planes available, and no
failed bits in any word (no failed bit planes on-line). In state s+2,
the system has suffered a single bit plane failure but there are no
available spare bit planes to replace the failed on-line bit plane.
The system must use the single-error--correction circuitry to correct
one error in each memory word in this state. The FAIL.. state is the
system state when an uncorrectable error has occurred.
The development of transition and state probability equations for
this system will now be shown.

25
For the transition event to occur from state i to state i+1
(l < i < s) in the time period from t to t + At, exactly four subevents
must occur. These subevents are:
El : The failure of exactly one on-line t
time period from t to t + At given t
of all on-line bit planes prior to i
E,: The non-failure of the system error
`'	 prior to time t + At,
E3 : The non--failure of the system reconi
circuitry (group) prior to time t +
E4-: The non-failure of at least one avai
prior to time t + At.
These subevents belong to classes 5, 2, 2, and 2, respectively.
The subevent probabilities may be computed as:
PE (t,At) = (^)(r(At))(k-1)(1-r(At))
l
_ k (r(At ))(k-l)(l-r(At))
PE
2 
(t;At) = rd (t + At)
PE (t,At) = rs (t + ;At)
3
P E. (t.,At) = 1-(l - r(t + At))(s - i + 1}
4
where all symbols are as defined in Tables l and '2.
27
T:
_.. TABLE 2.	 Definition of Reliability Symbols
for Basic System
oh
symbol Meaning
r(t) Reliability of an on-line or available spare
bit plane at time t.
^. a
rd (t) Reliability of the system
4
error detector (group)
at time t.
rc (t) Reliability of the system error corrector (group)
at time t.
x
rs (t) Reliability of the system reconfiguration switching
circuitry (group) at time t.
ii
^r
I'
^1

29
Eb :	 The failure of all s - i + 1 available spare-bit pl anes
prior to time t + At.
E:	 The failure of the system reconfiguration switching7
circuitry (group) prior to - time t + At.
The state transition probability may now be formed as
{ pi 	 (t ' At ) - k(r(At )) (k-l) 0 r(At)) rd-rc-
+(l-(1-r^j(s -i+l}}(I-rs-A ,
W .A
which reduces to
!- Pi,s+2(t,At)	 k (r(At))(k-1)(1-r(At)) rd'rc'
h;
1
- 
C l	rs	
(1-(1^r,)(s-i+l?)^
4 M
For 	 <i <s+1.
Define for each state i in the system state diagram a probability
P i,	 (t,At)	 i
`- which is the probability.that, if the system is in state i at time t,
no.transition .out of state i.will occur . before time t +. At.	 Then, for
the basic system,
P.	 .[t.,At	 + P.	 t At	 + P,	 t At	 + P-	 l)	 (^	 }	 (	 )	 (t^At)	 =
^,i
	 ^,^+1	 -2_	 ^,5•t,FAIL	
-
fort <i <s,
Ps±l,sfl:(t ,At) ± Ps+1,s2(t'At) + ps+1,FAIL
	
At}	 1,
30
P
s+2,s+2 (t "At) + Ps+2,FAIL (t,At) = 1, and
PFAIL,FAIL(t,At) = 1.
The formulation of the equation for P, , (t,At), then, uniquely
specifies the equation for Pi,FAIL(t,At). Since, for this system, the
non-transition event. involves fewer subevents than the transition
event to the failed state, these non-transition equations will be developed.
For states l through s+l, the only event occurrence which is
necessary for the non-transition event to occur in time t to t + At
is the note-failure of the k on--line bit planes in the same time interval
given that all were non-failed at time t. Then
P- .(t,At) _ (r(At)) k	for 1 < i < s + 1.
The operation of the system error detector and corrector is required
for the system to be in state s+2 at time t. The non-transition event
for this state, then contains the subevents
ES Q The non-failure of the system error detector (group)
in the time period from t to t + At, given its non-
failure prior to.. t, and
E9 The non-failure.of the system error corrector (group)
in the time period from t to t +'At. given its non-
failure prior-to t.
In addition, none of the k-^l on-line operating bit planes may fail from
t to t + At. Then
Ps+2,s+2(t,At) - (rd(At))(rc(At))(r(At))
.:	 31
y
For any state i, (1 < i < s), F
	
	
(t At) may now be compui ted as,FAIL
P(t,At) -- '-
	 (,At} . p	 (t,At) --	 (t,ot).i,AIL	 i, i 	 i,s+2	 i,i+1
	
f ;`.	 = i -- (r(utj ) k - k(r(At)) (k`1) (1-r(At)) rd'rc'
• ^1 - rs (1-(1-r')(s
_ k(r( At))( k-1)(1-r(At) )rd'r.'(1-0--r-)(s-i+1))l
= 1 - (r(At)) k - k(r(At))(k-1)(1-r^^.t)}
i	
[r 'r '- r 'r 'r l(1-(1-r-)(s-7+1))d o
	
d c s
't
i. " rd .rs"(1-0-r,.) (5-i+1) ),
(r(At)) k - k(r(At))(k-1)(1-r(At))rd'
	
3 4 ^	 r	 f	 (s--i+l )
	
{	 Crc + r5 '(1-rc )(l-(1-r)	 )7
Fort <i <s.
32
f	 The state probability equation for state , l may be obtained by
use of equation (3-1')., the general probability equation. The resultant
equation is
P1(t + At) _ (1--P1,2(t'At) - Pl s+2(t'ot) - PI,FRIL(t,At))Pj(t)
= P l,l (t,At)P 1 (t)	 (3-3)
(r(nt))kPl(t).	 T
The state.probability equation for states 2 through s may be obtained as
P i
 (t + fit)	 Pi-l,i(t,pt) Pi-1 (t) + (1-Pi,i+1(t,et)
- P	
+2(t,At) 
- P i FlllL( t,ntj)',P.(t)i,s
r Pi-l,i (t ' dt) Pi-1 (t) + Pi'i (t,At) Pi(t)
(k-1)	 (sy^)k[(r(dt))
	 (1-rr(bt))rd 
r 
rs (I -O-r - )	 ]P-i_,(t)	 M.
3
3(r(pt)) k Pi(t)
For 2ci <s.	 -
For state s+l, the state probability equation is
P	 (t + At} = P	 (t, qt) P (t) + P	 (t,Qt) P	 (t}s+1	 srs+1	 s	 s+1,s+1	 s+1	 .._
.r
= k
	
(k`l)(1-r(et)) rd'rs, (1-(1 -r'} }^PS(t)
+,(r(pt))k Ps+l (t)
}
i33
34
S+2	 m.
PFAIL (t + At)= k^l Pk,FAIL(t,ot) Pk(t) + 'FAIL(t)
S k	 (k-1)
	[I- (r(ot))	 k(r(dt))	 (1-r(ot))rdf
k=1
- [rc^ + (1--rc ') rs ' (1-(1-r-)(s-k+l))7^ Pk(t)
+ (1- (r(Qt) )k - k(r(dt))(k-1)(1-r(At))rd.^rcfps+1(t)
+ (1-- (rd(At))(rc(At))(r(At))(k-l) Ps+2(t)
s
- 
I P k(t) + Ps+l (t) + PFAIL (t) + Ps+2 (t)	 rrk=1
5+1
E [(r(Qt)) k + k(r(ot))(k-1)(1-r(at))rd'
k=1	 _
- [rG + (1--rc '} rs (1-(1--r-)(s-k+1) )I] P k ( t )	 -
- 
[rd (At) rc(At)(r(tt))(k-1)^ Ps+2(t)
s+1
= l
	
[(r(At))k + k (r(at))(k-1)(1-r(ot)) rd,
k=1
-[rc. + ( 1 -rcl rs ' (1-(i--r')(s-k+l))IIPk(t)
(k-1)	 -[r^(at) rc (et)(r{dt})	 1 Ps+z(t) • 	 (3:- 6.)
The system reliability may now be computed as the summation of
probabilities of being in any state other than the failed state. Then:
s+2
R(t) -	 Pi(t) - 1-PIiAIL(t)	 (3-7)	
:..
where the P l s are obtained from equations (3--3) through (3-6.).
NI l	 ^..	_...
35
It is	 then, to	 the	 the occurrenceonly necessary	 compute	 probability of
of state FAIL at any time t to determine the system reliability at that
times
Coverage Equations
System coverage (C) is defined to be the probability that the system
will recover given that a failure has occurred [211.	 This probability
is useful in reliability calculations and provides an.indication of the
_ effectiveness of a fault-tolerant system.	 Hence, a derivation of
coverage equations for the basic system will now be shown.
If the system's states are examined, it is evident that a failure
in the time period from t to t + at m-!y N., grouped into 1 of 3 classes
dependent: upon the failure's effect on the system state at time t + At.
These classes are
1.	 The failure causes no change in system state,
2.	 The failure causes a transition to another system state
which is not the failed state, and	 !
3.	 The failure causes a transition to the failed state.
t
The occurrence of class 1 and 2 failures contribute to system coverage
while the occurrence of class 3 failures does not. 	 Denoting the
probability of the occurrence of class L - type failures given that a
_	
_
1
failure has occurred in the time period from t to t + . At by P(L), then
1 C(t)	 = P(l) + P(2)	 i
9
But	 P(l) + P(2) + P(3)	 1
so	 C(t) _ 1_P(3)
1-P(Class 3 failure/a failure has occurred in t to t+At).
{
r
i
36
In general, however, the subevents which constitute the class 3
failure event are dependent on the current system configuration (or state).
To overcome this difficult the state coverage C. t y,	 ^	 I ( ) {system coverage
given that the system state at time t is i) is introduced, wherer
Ci(t) = l - P(state i, class 3 failure/a failure has occurred
in t to t + At where the system is in state i at
time t),
and a state i, class 3 failure is a component failure which causes a
transition from state i to the failed state.
Now, by Bayes' Theorem,
P(B/A.}P(Aj)
P(Aj /B)
	P B/Al P Al + ... +	
n P An
where
P(Aj,Ar)=0 for I <j,r<n
and
P(Al or A2 or ... or An ) = I.
The following events are considered
A1: No failure has occurred in t to t + At;
A2: Occurrence of a state i, class l failure in t to t + At;
A3: Occurrence of a state i, class 2 failure in t to t + fit;
A4: Occurrence of a state i, class 3 failure in t to t + At;
B: Occurrence of a failure in t to t + At where the system
state at time t is 1.
I
I	 I	 I	 I	 I!	 i	 I	 I
37
Then
Ci(t) = 1 - P(A4/B)
P(B/A4)P(A4)
I	 P B/Al F Al + P B/A2 P(A2 ) + F B/A3 P A3
 + P (B/R4)P(A4)
But P(B/A2) = P(B/A3 ) = F(B/A4 ) = 1, and P(BJAl ) = O, so
P(A4)
Gi (t) y I	 P(A2) + P(A3) + P A4
W P(Orcurrence of a state i, class 3 failure t to t+At)
P Occurrence of a failure in t to t+At/state at t is i
Since each occurrence of a state i, class 3 failure results in a
transition from state i to the failed state and no other conditions
cause this transition, it follows that
P(Occurrence of a state 1, class 3 failure in t to t + qt)
P(transition from state i to the failed state in t to t + At)
	 I
= Pi,FAIL(t,ot).
To compute the probability of a failure in the time period from
t to t + At, a hypothetical series system S, which contains all system
components for state i, may be constructed.
If the reliability, Rs (t), of this system is computed, then the
failure density function of the system may be obtained as
i
d Rs(t)
fs (t)	 ^- dt
1
The probability of system S failure in the time period from t to t + At
is
..........
38
t+At
P(Failure of S in t to t + A t) = f	 fs(t) dt
4
= R
s
 (t)- Rs (t,+ At)
3r
t J
as was shown in a previous section.
The reliability of a series system is the product of all system
component reliabilities, so
n.	 n.
P(failure of S in t to t + At) = R r.(t) - H r•(t + At),
3~l J	 j=l
where n  is the number of components in S and r j (t) is the reliability
of the j th system component at time t.
Since the failure event for a series system occurs when any system
component or combination of components fails and since S contains all
components of interest for state i of the original system, then
P(Failure of S in t to t + At)
= P(occurrence of a failure in t to t + of/state at t is i)
n 
	 ni
	
_ 
ff r. (t) - H	 r (t + At),
j=l J	 j=l
where n  is the number of components in state i of the original system.
As was shown previously,
rj (t) = l and rj (t + At) = rj(At)
for a system component j which is required for operation in state i at
time t. If the number of these components is m i . then
r.
s^1
d
{
39
P(occurrence of a failure in t to t + At/state at t is i)
ni-mi	
M 
	
ni -mi
1 r^(t) - q=1 rq(At) 
k 
j rk(t + At).
For state i (1 < i < s+l) of the basic system, this probabilit y is
rdrersr( s-i+1) - (r(At) }k rd.rc^rs.(r.)( s-i+1)
There all symbols have been previously defined.
Then
_	 Pi,FA^L(t,ot)Ci (t)	 l -	 s-i+l	 k	
-) ' s- m)	 (3-8)rdrer sr	 -(r (At)) rd rc rs (r }
for i < i < s + 1.
For state s+2, Cs+2 (t) may be obtained as
C	 (t} = 1 _	 Ps+2,FATL(t)	 (3-9)s+2	
r  _ (r(At))(k-1)rd(At)rc(At)rs.
Recalling that
C i (t) = P ( system will recover/a failure occurs in t to t + At
where the system is in state i at time t),
then
P[(System will recover/a failure occurs in t to t + At) and
the system is non-failed at time t]
s+2
_	 Ci(t) Pi(t)
=1
Since, for a non--repairable system, it is meaningless to compute
coverage for the system after it has failed, the total system coverage
may be considered to be
This is of the form
P[ A /B]
P[ A and B].
P[A/B]= P[APBand B
since
then
i
i
r	 i i
(3-10)	 I
whereas the previously derived equation is of the form
C(t) = Total system coverage
s+2
1 
Ci ( t) Pi (t)
i=
Pi(t)
s+2
C i ( t ) Pi(t)
-
^ ^l R$	 ,
where the C i 's are obtained from equations (3-B) and (3-9), the Pi's
from equations (3-3) through (3-6) and R from equation (3-7).
Computer Evaluation
Three approaches to computer evaluation of equations of the type
presented will be described in this section. These methods are:
1) Manual substitution of transition probability equations
into the general state probability equation and evaluation
of the state probability equations each At,
i
f
41
^a
2) Evaluation of the transition probability equations and
-- substitution of the results into the general state
_ probabili ty equation each At, and
` 3) Evaluation of a product of a T - type matrix and a T
matrix which is updated each At.
Methods 1 and 2 are straightforward.	 Method 3 will now be discussed.
It was shown in a preceding section that
P(t + At) = TT
 x P(t) (3-2)
where P(t + Qt) and P(t) are state probability vectors and T contains
P	 (t,At) in Its i,j location
Then
q P(t + tot) = T 1T x P(t + At)
-iM where Tl is T evaluated at time t + at.	 By substitution,
T' P(t + 2At) = T lT x fTT
 x P(t)y
DTI	x. TTI x P(t).
In general,
—	 T	 T	 —T
P(t + nQt) = CTnwl x 
'fn_2 x ... x T1 x T ] P(t)
9
CT x T1 x ... x —T n-2x in_1 IT P(t)
	
i
= Tn*T P(t)	
i
wherei n = [T' x T i x ... x Tn-2 x T'n-1 ]'	 (3-11)
Thus,'to evaluate P(t + not) when P(t) is known, the following algorithm
may be used.
42
k	 1. i=valuate T at time t, set T.* T, i
	 1.
z	 _
2. Evaluate T at time t + 1 A t to obtain Ti.
F: 3. Ti+1* Ti* x Ti•
4. If i < n then i	 1 + 1, go to 2. Otherwise, p (t + nAt)
Tn* x P(t), stop.
For a system with a small number of states and state transitions,
method 7 is managable. For systems with a large number of states, how-
r:	 ever, either method 2 or 3 is more expedient.. Example flowcharts for
methods 2 and 3 are presented in Appendix A. Program listings may be
found in [271.
The selection of a suitable At for use in the computer evaluation
of these equations is a difficult task. This problem will now be
discussed.
The time period of was originally defined to be a -time period in
which no more than one state transition is likely to occur. Since
the probability of more than one state transition occurring may be.
represented as a product of state transition probabilities, the
monitoring of these products during execution will give an indication
of the appropriateness of the selected At.
By specifying a maximum allowable probability, pmax, for the
occurrence of two state transitions in t'is,,a &t, and reducing At when
this probability is exceeded, the computational error may. be reduced.
The following algorithm will implement this self-monitoring oatrol
for a method 3-type evaluation.
1. Evaluate T at time t, set Tj j = T, i = 1.
Ia. Specify initial At, pmax
v
i
€	 i Is	i
i
i
1
j
r
k	 i.
i
i
i
w __	 l	 l	 I	 I	 1 ^	 ^.}
..
43
77
2.	 Evaluate T at time t + iQt to obtain Ti.
,Y 2a. For each non-diagonal entry Ti ( j,k) compute Ti(,j,k)-(T1(k,m)
for each m.
" 2b. If any of these products is greater than pmax 	 reduce At
and go to.2.
3.	 Ti+1* W Ti * x Ti
4.	 If i < n then i = i + 1, go to 2. 	 Otherwise P(t + net) =
_. TnT x P{t}, stop.
In general, the value selected for pmax i s dependent on the
subsystem failure rates and the computational accuracy of the com-
puting system used.	 For the computations of this paper, satisfactory
results were obtained by the use of pmax in the range from . 0001 to
.000001.
3
The magnitude of the computational error accumulated at time t may
_ be approximated by determining the magnitude of the difference of the
sum of all state probabilities and 1.
	 In equational form,
`
N
le(t)1	 _	 11	 -	Y	 Pi (t) l
i=1
adhere N is th q
 number of system states.
The percent error in system reliability may be approximated by
°.
e(t)q =	 x 100%.
R( t )
ak
J
i
and
44
i	 _L _ I	 I	 I
IV. RELIABILITY EQUATIONS FOR ALTERNATE SYSTEMS
This chapter will show equational developments for the reliability-
of the non-spared, TMR, duplicated and double-error-correcting systems.
A method will also be shown which allows the computation of the
L.. r
probability of various memory ward fault patterns and the effects of
these patterns on system reliability.
Non-Spared System
The non-spared system is capable of operation in only 3 states.
These states correspond to states 1, s+2, and FAIL in the basic system.
By substitution of O for s in the equations for the basic system, the
state probability equations for states 1, 2, and FAIL of the non-spared
system are obtained as follows:
1
PNS (t + At)	 (r(At) )kpN51(t)	
(4-1)?
1	 ,
(k-1) 1-r At ) r	 (t)PNS (t + At)	 k(r( At))	 (	 'r( ) 	 d c 'P NS1	
^..2
+ (rd (at))(rc(At))(r(At)) (k-1) PRS 2 (t)	 (4-2)
P	 (t + At)	 1- (r(At)) k + k(r(At))(k-1)(1-r(At))rd'rc'] .NS
FAIL
• pNS (t) - [
rd
(At)rc (At)(r(At))(k-l)]pNS (t)
1	 Z
(4-3)
45
RMS(t) = pNS1 (t) + RNS 2 (t) = 1
	
FAIL
pNS	 (t)	 (4-4)
where the p NI S. `s are obtained from equations (4-1) through (4-3)
TMR System
The reliability of the TMR system may be approximated from the
reliability of the non-spared system.by application of the. classical
TMR equation. From [24], this equation is
R ]MR (t) r C3 (R(t ) ) 2 	2(R(t))] rU1-(t,
where R(t) is the unreplicated unit reliability, and ryT(t) is the
reliability of the voting and codeword testing circuitry.
Fhen
"TMR (t) = [3(RNS (t)) 2 - 2(RNS (t))31 rVT (t) '	 (4-5)
where	 RO.t) is obtained by use of equation (44)
i
Duplicated System
The reliability of the duplicated system may be computed by
determining the probability of the various operational modes of the
system. These modes are:
1. Both non--spared units operate correctly,
2. the unit currently on-line fails, and the sense switching
circuitry.switches the system output to the other unit
which is non--failed, and
3. the unit currently off-line fails.
The reliability of this system, then, is
i
46
RD (t) _ [RNS (t)]^	 [1 - R
N
s(t)]
 
RS(t) rss ( t)
+ RNS(t) [l	 RNS (t)
R^s (t)	 [I - R^^(t}5(ts) r ss (t}
=
	(-s)
where r5 5 (t) is the reliability of the sense-switching-circuitry and
RNS is obtained by use of equation (4-4).
Double-Error-Correcting (DEC) System
t
Carter and McCarthy [20] have described a.fault-tolerant memory
system of the double-error-correcting type which utilizes a software
implementable double-error-correction algorithm. The algorithm is based
on a concept of memory word error modeling which will now be described.
The non-operational modes of a memory word-bit cell are assumed
to be:
1) Stuck--at-one (s-a-l),and 3
2) Stuck-at-zero (s-a-0).
The occurrence of ei ther of these modes is termed a fault.
The.class of all faults.may be partitioned into two subclasses
by the effect of each fault on the correct memory word bit. If the
,
fault is of the s-a-x type.anti the correct memory word bit for that
a
-^	 tlocation i	 e	 't' cur	 u	 o this1	 lon s X, then no effect
	
the m mory b^ o  s	 Fa lts  fY	 _.
subclass are termed failures. If the fault is 	 s-a-x and the correct
bit is x, then the fault causes an incorrect response o rs a memory read
operation. Faults of this type are called errors:
The weight  of a binary word is defined to be the number of
binary digits (bits) in the word which.are logic 1. By analysis of
_	
_	 f
47
m
the words of a particular code the sum of all codeword weights, W, may
~ be obtained.. An. average codeword weight, w is com puted by
W
where V is the.total number of codewords.
	 If w is divided by N, the
- length in bits of each. codeword, are approximation to the statistical
probability of any given bit of a word being a logic l is obtained.
In equationa;i form,
p (Word bit = 1) = Pwl = ^, and
-- P(Word bit - 0) - P
wo
	
1	 pwl - 1	 N.
A statistical analysis of faults for a memory system should isolate
the following probabilities.for the bit locations of a .data word.
P(Bit location s-a-1 /l ocation faulted) =Psi, and
P (Bit location s-a--0/location faulted) = PSO:
- ^t is now possible to obtain the probability of a.fail:ure when it
is knovp that a single word fault has occurred.	 This probability is
i
- P(failure /1 fault) = PUBIT location s-a-1/location faulted)
and Word Bit. W 11 + P[(Bit location
. s-a-0/l ocation faulted) and word bit
01
^p	 P	 + p'	 p
sl	 wl	 So	 wO
^. In a simila	 manner,
,i-,
48
P(Error/l fault)	
PSI 
Pw + PSG pwl.
Sir ►ice P.(failure/i fault) +.P(error/l fault)
pSl Pwl + Pso Pwo + PSo Pwl + PSl Pw
( PSl + PSO)(Pwl + PWO)
the binomial probability distribution may be used to compute the probability
of any combination of errors and failures in a word given that a 	 Lain
number of faults has occurred.	 z
Then	 _i
P(n failures and m errors/n + m faults)
( n+m)(P _P	 + P P ) n (P P	 + P P )m
n	 51 wl	 So wo	 Sl wo	 So wl ;t
i
If the binomial distribution is also used to compute the probability
i
of n+m fau ] is , then	 ^..
P(n failures and m errors, n + m faults in b nits)
n+m	 n	 m b	 b-(nom)	 (n+m)
- (
	 ) (P P +p P ) (P P +P P } • ( 	 )r	 (1-r}
n	 Sl wl S^ w^	 51 w S wl	 n+m
wh.ar e:. Y? is the reliability of a memory word bit location.
Since ( nbm) is the number of nom--fault words which may occur and
(nn ) is the number of ways that exactly n failures may be ordered among
n+m faults, then the number of distinct m+n-fault words with n failures 	 Y
wi s 
, (n+rn) ( b ) . The total number of distinct (w nth rt^ard' to number andn n+m
order of failures) n+m-fault words, is then
49
nfm
0+m)
	 (non)C
These numbers may now be used to obtain the percentage of f-fault
words which contain a -given number of failures. 	 For example, the
percentage of f-fault words of b bits which contain f failures is
(f)(f)	 1	 0To	 x 100x 100%	 1,f
( i
f
	C.1	 ()]
i=o	 7 -0
a useful figure, since an f-fault word with f failures is error-free.
The application of these concepts to }re double--error-correcting
system will be shown following a discussion of correctable error types
__. for the system.
A fault pattern vector for a memory word is defined as
FPV = ('he jf, ge,nf)
a
.,
where h and q are the numbers . of errors in the memory word data and I
4 check bits, respectively, and j and n are the numbers of failures.
The double-error-correction algorithm discussed will always
produce a valid correction when presented with memory words with FPV's
of certain forms.	 These forms, from [20],.are as follows.
(2e Of, lie Of); (le Of, le Of);	 (Oe Of, ze Of)
(2e. If, ,fie Of); 	 (2e Of, Oe If); :
 (Oe Of, 2e If).
(2c, If, Oe If);
	
(2e Of, Oe 2f);	 (Oe Of, 2e 2f);
(Oa Of, 4e of).
I	 I	 l	 1	 I
..i
50	 - ?
For Memory word. with FPV's o`F the following forms correction may or
may not be attempted and results may be invalid [20]
(le Of, le if); (@e If, 2e Of);
(le Of, le 2f); (Oe If, 2e If).
No error correction is attempted in the following cases [20]
(le If, le Of);
(2e 2f, Oe Of); (le 2f, le Of); (le If, le If);
(4e Of, Oe Of); (3e Of, le Of); (2e Of, 2e Of).
It should be noted that the preceding FPV's listed all contain an
;I
even number of errors and will produce error syndrome vectors of even
weight. The computation of a syndrome of this type by the memory
translator causes the invocation of this algorithm.
A second algorithm has been designed to attempt data reconstruction
when ail odd-weight error syndrome is computed. Since many triple-error 	 i
patterns produce a single-error syndrome and a high percentage of these
syndromes imply an error in a valid bit, a critical function of this
algorithm is "Co distinguish between single and triple word errors.
This algorithm is capable of reconstructing all memory words with
FPV's containing exactly one error and two or fewer failures. In
addition, all memory words with FPV's containing one error and three
failures are corrected with the exception of the FPV
(Oe 3f, le Of)
for which no reconstruction is attempted [20].
51
Valid results, [20], are also produced for
(0e Of, 3e Of) and (0e Of, 3e If).
Correction results are variable E20] for memory words with the
following FPV ` s
(2e Of, le Of); (le Of, 2e Of);
(2e Of, le If); (le Of, 2e If).
No correction is attempted, [201, for the case listed above and the
cases
(3e Of, Oe Of); (3e Of, Oe If).
The listings above show that any combination of two or fewer
faults in a memory word will be algorithmically corrected. For words
with three faults, the percentage of words which are corrected may be
computed as foIIows.
The number of ways in which three faults may appear in a word with
K bits is
(The number of ways 3 faults can appear) +
(The number of ways 2 faults and 1 error can appear) +
(The number of ways 1 fault and 2 errors can appear) +
(The number of ways 3 errors can appear)
^) + 3(3 ) + 3( k ) + ( k	= g	 (3}
The first term of this sum represents all 3-fault words with no errors.
No correction is required for these words. In addition, the triple
i
152
i I
^l
error algorithm will correct all 3(3) three-fault words with only one 	 .Z'A
error.
,t
A 3-fault word contaihing 2 errors will not be corrected if the FPV
is of the form
1
j
If the number of data bits in the word is C and the number of check bits 	
I
is C, then the number of 3-fault patterns of this fora is
( C )( 2 )( C ) = 2 (C)C.
_2 l 1	 .2
The number of 3-fault words with 2 errors for which correction is
uncertain is
(°)( 2 )( 1 ) + ( ° )(C) = 2 D(C) + a( C ) = 3 D(2).
A 3-fault word with three errors will be corrected if the FPV is
of the form
(Oe Of, 3e Of).
The number of patterns of this form is
(3)•
The number of 3-fault words with three errors -for which correction is
.:ncertain is
C(2)( 1) + (°)(2)7 = IC(°)+ 	D(C)I.
(le If, le Of).
53
The total number, T3 , of 3-fault words which are correctable is then
bounded as follows
I(') + 3(3) + 3(3) - 2C(') + (3}^	 < T3 <
U k + 3( 3) + 3(') - 2C( 2 ) + (C) + 3D( 2 ) + C(') + D(2)l
?(3) - 2C(') + (3)l < T3 < [7() - C( 2 ) + ( 3) + 4D(^)^.
Since there are 8( 3 ) possible ways that 3 fau":ts can occur, the
percentage, u 3 , of 3 faults words that can be corrected is
T
u3
 = 8(k) x 100%.	 T.
3
For the (22, 16) code of the basic system, u 3 may be computed as:
75.96% < u3 < 89.45%
A breakdown of double-error-correcting system correction percentages
by the number of memory word faults is shown in Table 3. In this
table, u
m,n 
denotes n errors which are system correctable. um
denotes the total percentage of m-fault FPV's which are correctable.
The switchin g strategy assumed for the double-error correcting
system is as follows:
1) If a memory word is detected to have a single error, the
single error correction procedure is performed.
2) If the word has two errors, one of the faulty on-line
bit planes is switched out and replaced with a spare.
Error correction is attempted by use of the double-error
correction procedure.
t
54
i
TABLE 3. Percentage of Memory Word FPV's Correctable for
the Double-Error-Correcting System ( 22, 16) Code
uF,e
F	 e	 (9 correctable/100%)
# FAULTS	 # ERRORS	 x(1 of F-fault words with e errors/1009)
0	 0	 u0^0 = 1
0	 0	 u0=1
1	 0	 u1
------------ -------------------- u^ ^^
	
--------------------
0,1, 	 u1
z	 0	 u2,0 = .25
2 1 u2,1 = .5
2 2
u2,2
	
.25
2 0,1,2 u2 = 1
3 0
u3,0	
,125
3 1 u3,1 = -375
3 2 .258 < u3,2 < .317
3 3 .0016 < ti33 < .078
3
5
0
----^-----------------^----------------------------- ------- I -----------
3 0,1;2,3 .7596 < u3 <_.8945
F
55
Table 3 (continued)
uF,e
F	 e	 N correctable/100%)
# FAULTS	 # ERRORS	 x(l of F-fault words with a errors/1.00%)
4	 4	 u4,0 ^ -0525
4	 1	 u4,1 ^'- .25
4	 2	 .1.02 < u4,2 < .119
4	 3	 .0005 < u4,3 < .0395
4	 4	
u4,4 4-- .0001
4	 0,1,2,3,4	 .4151 < u4 < .4711
56
i
1 s	 ^
3) If the word has either three or four errors, a correction 	 E
is attempted. If the correction is successful, faulty
on-line -rba_t planes are replaced with spare bit planes until
either all avail-able spares are exhausted or only one
faulty bit plane remaUs on-line. 	 y
A T matrix may be constructed as shown in Figure 6 with the stfstem	 1
configuration in each state as shown in Table 4.
	
	
f
I'
Appendix B shows the derivation of the state transition probability
equations for this system. If the notational simplifications 	
fA `
(u k-x+yak-x^y)(y)(r(ot)(x-y)(1-r(At))y
,
	 = D(x,y,
2	 (s—x+2—k)	 k _	 _
k^0
(s-x+
 
k ) (l wr )	 (r) - E(x,y), and r(At) - r
are made and the reliability of the algorithmic correction procedure
T	 is denoted by rA, then the transition equations appear as follows:
P12 (t,At) 	 D(k,l ) rd'rc "rA'•
P1,3 (t,At) = D(k,2) rd'rc'rA'rs'(1-E(2,0)).
P1,4(t,At) = D(k,3) rd'rc'rA'rS'(1-E(2,1)).
P1,5 (t,At) = D(k,4) rd'rc'rA`rs'(1--E(2,2))..
P1,s+3(t,At) = D(k,2) rd'rc 'rA'(1-rs ' + rs ' E(2,0))
+ D(k,3) rd'rc'rA'rs'(E(2,1)-E(2,0))
+ D( k,4) rd'rc'rA'rs'(E(2,2)-E(2,1)).
Pl,s+4(t,At) = D(k,3) rd'rc'rOl -rs ' + r s ' E(2,0))
+ D(k,4) rd'rc'rA'rS'(E(2,1)-E(2,0)).
Pl,s+5(t,At) _ D(k,41 rd rc rA (1-rs + rs ' E(2,0)).
i}
I
r 3.. v 3.5 p 3,s
R p
r p
p 5.S-3 5.$44 x5,5.5 rj ,Fl- :'
sf., 	D
fj
^ 7
y '	 tr:	 C	 t^	 ^	 R	 C	 ^
J—
p PS.21 ,5-4 Si2.1-5 vS42.F."L.
PS • 1.5-4 1543.; • 5 PS-5.wi
FIGURE 6. T-Matrix for Double-Error-
Correcting System
iWAL pop
P 
Q AGE
POOR TIALIly
TABLE 4. State Coni:iguration& For Double -Error-Correcting System
State	 Configuration
1	 K Good bit planes on-line, S available spares
2 < i < s + 2
	
K-1 Good bit planes on-line s - i + 2.available
spares
s + 3	 K-2 Good bit planes on-line
s + 4	 K-3 Good bit planes on-line
s + 5	 K-4 Good bit planes on-line
FATS.	 An uncorrectable word error exists
i^
-	 i
^y
T 59
P1,1 (t,At)	 = D(k,a).
'
Pi,.+1(t,At) = D(k-1,1) rd*rc*rA*r$ ' (1-E(i3O)}
for 2	 i <s+1.
Pi,i+2(t,At) = D(k-1 9 2) rd*rc*rA*rs ' {'-E(i,1)}
t
.. for 2 < i < s.
1	
y
i
_.. Pi,i+3(t,At) = D(k-1 ,3) rd*ro*rA*r 5 ' (1-E(i52))
for 2<i <s-1.
i-J
Pi,s+3(t,At) = D(k-1,1) rd*ro*rA*(i-rs ' + rs ' E(i3O)11
•- + D(k-1,2) rd*ro*rA*rs - (`E(1.,1)-E(i,D))
IF
.}
+ D(k-1,3) rd*rC*rA*rs ' (E(i,2)-E(i,l))
for 2<i <s.
.x Pi,s+4(t,At) = D(k-1,2) rd*rC*rA* (1--rs ' + rs ' E(i 3 O))	 i
+ D(k-1,3) rd*ro*rA*r$ ' (E(i,l)-E(i3O))	 i
for 2<i <s+1.
Pi,5+5 ( t ,At) = D( k-1,3) rd*ro*rA* (1-rs ' + rs ' E(1,0))	 i
! for 2 < i < s+1.
Pi (t,At) =
,
D(k-1,0) rd*rc*rA*
s
for 2 < i < s+2.
Fs+1,^+3(L,At) = R(k T,1) rd*ro*rA* (1 -r5 ' + rs ' E(s+1,0)).
+ D(k-1,2) rd 	 *r *r " ((s+1,1}-E(s+l,d}}d	 c	 A	 s
r^
r-,
fk 1t
60
Ps+2 ,	 j(t;At) = D(k- 1,j -2) rd*ro*rA*
for 3 < j < S.
Ps+3,s
+j(t,At) = D(k- 2,j -3) rd*rc*rA*
for 4<j<5.
Ps+4,s+5 (t,At) = D(k-3,1)rd*re*rA*•
Ps+j,s+j(t,At) - D(k--j+l.,Q). r *rc*rA* Ar
for 3<j<5.
Pi,FAIL(t,d) = 1-D(k,Q)-rd 'rG 'rA ' «^ i	 D(k,j). --	 -	 _.
f	 3 s
Pi,FAIL(t,ot) - 1--rd*re*rA*	 D(k-1 ,j) G:j=D
for 2 < i < s+2.
5-q-
Ps+q,FAIL(t At) = 1-rd*ro*rA*	 E	 D(k-q+l,j).j=Q -
`r	
!
for 3<q<5.
	
E ;
The state orobab4'ii ty equations for this system are also derived
R
in Appendix B.	 The resultant equations are
,
P 1. (t + At) D(k,0)
	
P 1 (t) .
i
(4--7)
P2(t + At)
D
( k ' I ) rd'r .rArPI(t) + D(k_1
'
Q) rd*rc*r,*P2(t)• r-:n
(4-8)
P3 (t + ©t) D(k,2) rd'rc'rA 'rs '(i-E(2 ' Q})	 PI(t)
rd*rc*rA* C (k-l,l)rs'(i--E(2,Q))P2(t)
LVJ	 a
D (k-1 , Q )	 p 3 (t)] _ ( 4-9)
;^	 s
1E
61
P t + et) = D(k,3) 'd"c"rA"rst(1-E(2,1)) Pj(t}
+ rd*r
c
*rA* [D(k-1,2) rs ' (1-E(2,1)) P 21(t)	 (4-10)
+ D(k- 1 ;1) r	 (1--E(3sC))
	
P3 (t) + D(k-1, 0)	 P4(t)] .
P 6 (t + At) = D(k,4) rd .rc rA.rs. (I-E(2,2)) P1(t)
: + rd*r
c
*rA* ID(k-1,3) rs ' (1-E(2,2)) P2(t)
+ D(k-1,2)	 r.
s '	
(I-E(.3,1))	
P3 (t)	 (4-11)
+ D(k--1,I)	 rs'	 (I-^E(4,0)	 p4() +D(fc-1,0)	 P()	 .
y pi (t + At) = rd 	 [D( k-1, 0) P i (t)f§E
+
3
rs '	 D(k-l^j)(I-E(i-,7,3-1)
	 P (t)l	 (4.12)j ==1
for	 6 <' i < s+2.
Ps+3 (t + At) = r -r^-rA-[D(k;2) (1-rs' + rs " E(2,0)
^...
+	 ^	 D(k,3) rs ` (E(2}.3--2)-E(2,j-3))] P1(t)
3=3
S. + rs° E(k,D)}+ rd rC*rA*
	 k=2
+ D(k-1, 2 )rs '	 (E(k,1)--E(k,Q))
.^
i
+ D(k-1,3) rs" (E(k,2)-E(k,1))	 Pk{t}
+ CD{ k-I,1}( 1 -.r A + rs . E(s+1,0))
± D(k-1,2) rs ' (E(s+1,1)-E(s+1,0))] PS+1{t)
^.. + 0(k-1,I) P s+^:.(t) + D(k-2,0) Ps i-3 (t).	 (4-13)
1
62	 ^9
Ps+4 (t + at) = rd 'rc '3"A '[D(k,3) ( l ..rs ' + rS- E(2,0))
+ D(k,4) rs - (£(2,1) - £(2,0))] Pl(t)
s+1
+ rd *r *r
A*	 X	 [D(k-1,2)(1-r
s y + rs ' £(j,0))j=2
+ D(k-1,3) r s ' (£(j,1)-E(j,,D))] P^(t)
+ D(k-1,2) Ps+2 (t) + D(k-2,1) Ps+3(t)
+ D(k-3,0) Ps+4(t)	 (4-14)
ps+5 (t + Qt) = D(k,4) rd 'rc 'rA '(1-rs ' + rS ' £( 2 , 0 )) Pl(t)
s+l
+ rd *rC*rA*	 E	 D(k-1,3)(1-rs' + rs ' E(j,0))Pj()
j=2
+ D(k-1,3) PS+2(t) + D(k-2,2) Ps+3.(t)
+ D(k-3,1) Ps4 (t) ,	 (4-15)
4
P FAIL (t + at) = 1 -	 D(k,0) + rd 'rc 'rA '	 0(k,j) Pl(t)j=1
s+2	 .3
+ rd*rc*rA*	 (^ D(k-i , j ) Pk(t))
k=2 m=0
5 5"n
+	 {	 D(k-n+1,q) Ps n( t ) )^	 (4.-161
w=3 q=0
It should be noted that these equations are developed 'fora
double-error-correcting system with sq-2 greater than 5 (equivalently,
^k
more than 3 spare bit planes). If s+2 i where 2 < 1 < 5, then the
equations involving state j where -i < j ` < 5 'should be modified to delete
this state. This modification will involve only the deletion of the
appropriate equations.

i	 I	 I	 I	 i
.:n
..y
V. ANALYSIS RESULTS
In this chapter, typical results of analyses performed on the five
systems previously described will be discussed. Comparative reliabilities
of each system are shown and the effect of varying several system
parameters is described.
The base variable values assumed [22] for the system analyses are
shown in Table 5. For each analysis performed, the system variables
are fixed at the base value unless otherwise noted.
A comparative reliability analysis of the five sub,ect systems was
performed by use of equations (3--7), (4-4), (4--5), (4-6), and (4-17).
The results of this analysis are displayed in Figure 7. This figure
shows the reliability of the TMR, non-spared, duplicated, basic, and
double-error-correcting systems for mission lengths of four years or
less. Also shown is the reliability of a simplex system with no error_
detecting or correcting capabilities. This system consists of 16 on-line
bit planes and has reliability (e-XSPt)16 where XBP is obtained from
Table 5. It may be seen from this figure that, for missions of 1/2 year	 y
or less, all of the systems except the non-spared and simplex systems
have reliability greater than .99. For greater mission lengths, however,
the reliability of the non-spared, duplicated, and TMR systems decrease
rapidly. For a 3-year mission, probably only the basic or double-error-
correcting systems would be acceptable.
64
`ABLE 5. Base Values for System Variables
65
L7
# On-line Bit Planes
# Spare Bit Planes
i Bit Plane Failure Rate
i
Detector Failure Rate
Reconfiguration Switch
Faila re Rate
Corrector Failure Rate
DEC Algorithm Failure Rate
Mission Length
t`	 Memory Size
Failure Distribution
4K-Bit Subplane Failure Rate
"eripheral Bit Plane Circuitry
Failure Rate
22
4
2.6384/106 HR
900/106
 HR
.583/106
 HR
.027/106
 Hr
0
3 Years
16k Fords
Exponential
.5596/106 Hr
.3/106 Hr
66
1.
.9
R	 .^
E
L
I
A
B	
.7
I
L
I
T
Y
.6
.5
.4
0	 1	 2	 3	 4
TIME (YEARS)
FIGURE 7. Reliability of Subje ,-t Systems
67
Ei
Comparison of the curves for the double-error-correcting and Basic
Systems shows the reliability improvement to be expected from the use of
the software algorithms of the double-error-correcting system. For
1/2-year missions, this improvement is negligible. For missions of
greater lengths, however, the reliability improvement gained by the use
of this systerl becomes important.
It is interesting to note that, while the duplicated and TMR
systems represent a doubling and tripling of memory bit planes over the
non-spared system, the basic and double-error-correcting systems result
in much higher system reliabilities with an addition of only 4 bit planes
to the non--spared system.
Figure a shows the results of a reliability analysis performed on
the basic system for various numbers of spare bit planes. The
corresponding curves for the double-error-correcting system are shown
in Figure 9. Comparison of these two figures shows that the same degree
of reliability achieved by the basic system with 4 spare bit planes may
be reached by a double-error-correcting system with 3 spares and a
sufficiently reliable double-error--correction algorithm. The need for
one spare bit plane may thus be aleviated by the use of software
error correction.
The reliability of the software error correction algorithms used in
the double-error-correcting system is highly important to system success.
The effects on the double-error-correcting system reliability made by
varying a hypothetical failure rate for the CPU hardware which implements
these algorithms is shown in Figure 10.
i
TIME (YEARS)
FIGURE 8, Reliability of Basic System
for Various Numbers of Spares
e
^I
't
l
I	 3
= i	 ^
3
i
3
^i
0	 1	 2	 3
1.
.9
R
E
L .8
I
A
B
I
L
I
T
Y	 .7
.G
1
f
68
SPARES
I
SPARE
.9
{
`j
R
E .8
L
I
A
B
I
L
I
T
Y	 •^
69
TIME (YEARS)
FIGURE 9. Reliability of Double-Error-Correcting
System for Various numbers of Spares
I.
.95
R
E
L .9
I
A
B
I
L
I
T
Y 85
.8
ALGORITHM FAILURE RATE (f/hr)
FIGURE 10. Reliability of Double-Error-Correcting
System vs. Algorithm Failure Rate
70
.1	 -	 ,	 l
i
71
Also essential to overall system reliability is the failure rate of
the detector. The effects of varying this failure rate are shown
in Figure ll,
The reliability of double-error-correcting systems with various
memory capacities is shown in Figure 12. The major effect of memory size
on the reliability of a system of this type is in the bit plane
failure rate. Also affected are failure rates of memory size-related
i'
components such as address decoder circuits, however, only the bit
r'	 plane failure rates are considered in this figure. The failure
rates used were obtained by assuming that each bit plane is composed
of 4K-bit sub-planes and peripheral circuitry, each with a failure
rate as shown in Table 5.
The results of the memory capacity analysis show that for
missions of 1 year or less, double-error-correcting type memories
containing up to 64K words will achieve high reliability. Greater mission
lengths show a reliability decrease for the larger capacity memories with
a dramatic decrease for memories larger than 32K words and a three"-year
mission length.
The coverage of the basic system for various numbers of spare bit
planes is shown in Figure 13. Coverage may be defined as the probability
that the system will continue to function given that a failure occurs.
As such, the coverage of a system is useful in analyzing the system's
behavior after camponent failures of a nature not predictable by system
failure rates.
71.
.9
p
L 8
I
A
B
I
L
i
T
Y .7
.6
72
lu -	 L	 4	 lu	 e	 'F	 lu
DETECTOR FAILURE RATE
FIGURE 11. Reliability of Double-Error-Correcting
System vs. Detector Failure Rate
i
I73
1 YEAR
.95
•g 2 YEARS
R
E
L
B .85
I
._
L
T
Y
•8
; 75
a
7- 3 YEARS
^i 4K	 8K	 16K	 32.K	 64K
MEMORY SIZE (WORDS)
FIGURE 12. Reliability of Double-Error-Correcting
t
System vs. Memory Size.
1.
.9
.S
C
0
V •7
E
R
A
G
E
.6
.5
.4
0	 1
TIME (YEARS)
FIGURE 13. Coverage for Ba
74
75
E	 it may be seen from this figure that a basic system with no
spares is highly vulnerable to system component failures. As the number
of spare bit planes increases, however, this vulnerability decreases
Mb
rapidly until, in the system with 4 spare bit planes, there is a
probability of .96 or greater of successful operation after a failure
for missions of 3 years or less.
Overall results of the analyses performed show that a high degree
of system reliability may be obtained by a ,judicious combination of
coding, modular sparing, and software error correction. Substantial
reliability improvement over massive replication techniques is achieved
with relatively low cost. While some sensitivity is shown to the
reliability of system control components, fault--tolerant techniques
applied to these components should assure high system reliability.
i
VI. CONCLUSION
A technique for the development of reliability and coverage
equations for a class of non-repairable fault-tolerant memory systems
has been presented. The methods discussed have been applied to several
systems and typical results have been shown.
The basic and double-error--correcting fault-tolerant memory systems
have been shown to achieve high reliability at minimal cost. These
systems make efficient use of the spare bit--planes provided and th ,- error-
correction capabilities of the code. By use of software correction,
the double-error-correcting systerin adds an additional level of error
control and may reduce the need for one of the spare bit planes.
A major advantage of the calculation methods presented here over
more traditional reliability calculation methods is the allowance of a
finite of for state transition occurrence. The use of this finite tin,a
increment allows multiple system events to occur during any state
transition. The need for separate states to represent these events is
then diminished. The result is a state diagram with a reduced nuriber of
states with probability equations that are easily computer-impl.enented.
A disadvantage of this method is the lack of a closed form solution
which is easily obtainable by use of other methods. Because of the
dependency of the state probabilities at time t + At on the conditions
at time t, small errors in computation at one time may cause large
errors at succeeding times. A closed form solution should eliminate
this problem.
76
77
Further work in this area could include the following:
I. Development of a closed-form solution from the equations
of this method,
2'. Research into the effect of un—powered spares on system
modeling, and
3. Application of these methods to the repairable system
problem.
[11 Goldberg, J., Levitt, K. N., and Wensley, J. H., "An Organization
for a Highly Survivable Memory," IEEE Trans. on Computers, Vol. C-23,
July 1974, pp. 693-705.
z
[2] Downing, R. W., Nowak, J. S., and Thomenoksa, L. S., "No. 1 ESS
Maintenance Plan," Bell System Tech. J., Vol. 43, September 1964,
pp. 1961-2019.
[31 Dickinson, et. al., "Saturn V Launch Vehicle Digital Computer and
Data Adapter," 1964 Fall Joint Comput. Cong., AFIPS Conf. Proc.,
1964, Vol. 26, pp. 501-516.
[41 McCarthy, C. E., Carter, W. C., and White, J. B., "A Memory System
Which Can Tolerate [tiultiple Storage Array Faults," Proceedings of
1975 Southeastern Symposium on System Theory, Auburn, Alabama,
pp. 172-178, March 20 - 27, 1975.
[5] Patel, A. M., Hsiao, M. Y., "An Adaptive Error Correction Scheme
for Computer Memory System," 1972 Fail Joint Computer Conf., AFIPS
Conf. Proc., Vol 41, 1972, pp. 83-85.
[61 Szygenda, S. A. and Flynn, M. J., " Coding Techniques for Failure
Recovery in a Distrihutive Modular Memory Orgainization," 1971
Spring Joint Computer Conf., AFIPS Conf. Proc., Vol. 38, 1971.
pp. 459-466.
[71 Szydenda, S. A. and Flynn, M. J., "Failure Analysis of Memory
Organizations for Utilization in a Self Repair Memory System," IEEE
Trans. on Reliability, Vo). R-20, No. 2, May 1971, pp. 64-70.
[81 Szygenda, S. A. and Flynn, M. J., "Self-Diagnosis and Self-Repair
in Memory: An Integrated System Approach," IEEE Trans. on
Reliability
     , Vol. R=Z2, too. 1, April ` 1973., pp. 2-12.
[9] Abramson, N. M., "A Class of Systematic Codes for Non-Independent
Errors," IRE Transactions on Information Theory, IT-5, No. 4
December 1959, pp. 150-157.
[10] Elspas, B. and Short, R. A., "K Note on Optimum Burst Error-Correcting
Codes, IRE Trans on Info. Theor	 IT--8, No: 1,.January.1962,
^pp. 39.-4Z.jj
78	 yf
I
79
[11] Srinivasan, C. V., "Codes for Error Correction in High-Speed
Memory Systems: Part I.I: Correction of Temporary and Catastrophic
Errors," IEEE Trans. on Computers, Vol. C-20, No. 12, Dec. 1971,
pp. 1514--1520.
[121 Bossen, D. C., "b--Adjacent Error Correction," IBM J. Res Develop_.,
Vol. 14, July 1970, pp. 402-40B.
[131 Graham, M., "Error Correction in Batch-Fabricated Memories."
IEEE Trans. on Camps. (Corresp.), Vol. C-18, No. 6, June 1969,
pp, 566--567.
['141 Rao, T. R. N., "Use of Error Correcting Codes on Memory Words for
Improved Reliability," IEEE Trans. on Reliability, Vol. R-17, No. 2
June 1968, pp. 91-96.
D51 Bouricius, W. G., et.al ., "Modeling of a Bubble Memory Organization
with Self-Checking-Transistors to Achieve High Reliability," IEEE
Trans. on Comps., Vol. C--22, No. 3, March 1973, pp. 269--275.
[161 Bricker, J. L., "A Unified Method for Analyzing Mission Reliability
for Fault Tolerant Computer Systems," IEEE Trans. on Rel., Vol. R-22,
No. 2, June 1973, pp. 72-77.
[171 Jagannathan, T., "General Expressions for Reliability of Redundant
Systems," IEEE Trans on Rel., Vol. R-21, No. 2, May 1972, po. 119.
[18] Benning, C. J., "Reliability Prediction Formulas for Stand By
Redundant Structures," (Corresp.) IEEE Trans. on Rel., Vol. R-16,
No. 3, December 1967, pp. 136-137.
[191 Hsiao, M. Y., "A Class of Optimal Minimum Odd-Weight-Column SED/DED
Codes," IBM J. Res. Dev., Vol. 14, July 1970, pp. 395-401.
[201 Carter, W. C. and McCartt:y C. E., "Implementation of an Experimental
Fault Tolerant Memory System," IBM Research RC5514 (#23976), July
1975.
[21] Arnold, T. F., "The Concept of Coverage and Its Effect on the Relia-
bility Model of a Repairable System," IEEE Trans. Comps., Vol. C-22,
No. 3, March 1973, pp. 251-254.
[221 "Solar Electric Propulsion Stage (SEPS) Command Computer Subsystem
Utilizing Space Ultrareliable Modular Computer (SUMU).- Vol. 1,
Technical," IBM no. 73W--00260, August 1973, pp.. 4, 17-4, 19.
[231 Bazovsky, I., Reli ability Theory and Practice, Prentice-Hall, Inc.,
Englewood cliffs .New Jersey, ^^9-M.
80
[24] Bouricius, W. et.al ., "Reliability Modeling for Fault-Tolerant
Computers," IEEE Trans. Comps., Vol. C-20, No. 11, November 1971,
pp. 1306-131 .
[25] Papoulis, A., Probability, Random Variables, and Stochastic
Processes, McGraw-H111, ew TorK, New York, 1965, p. 37.
[26] Bazovsky, I., Reliability Theory and Practice, Prentice-Hall, Inc.,
Englewood Cliffs, New Mersey, 1961, pp. 43-49.
[27] Cox, C. W., "Reliability and CoverageAnalyses of Non-Repairable
Fault-Tolerant Memory Systems," Master's Thesis, Auburn University,
Auburn, AL, 1976.
EAPPENDIX A
Flowcharts for Computational Algorithms
Three methods for computer evaluation of the equations of this
paper were outlined in Chapter III. Flowcharts for evaluation by use of
r-
	
Mothods 2 and 3 are shown here.
Figure 14 shows a typical implementation of evaluation method 2.
For this flowchart, 
tBASE is selected to be 0 and the system starting
i`
	 state is state 1. TMAX is the mission length of interest.
f	
After initialization, all transition probabilities are calculated
for the current time (T) and At. Where a two-state transition is
possible, the product of the two single-state transitions involved is
formed. If this product is greater than PMAX, the maximum allowable
two-state transition probability, the At is reduced. r.
The amount of this reduction is arbitrary. If At and T have units
of hours, then a convenient method of reduction is to multiply at by
r
.9 and set the new at equal to the greatest integral number of hours
less than this number. When this method is used, however, a test must
be performed to assure that At is not 0 since this condition would
prevent any further processing.
If all the two -state transition probabilities are less than PMAX,
r
the state probabilities for time T + At are computed by substitution
of the transition probabilities and state probabilities for time T into
equation (3-1), the general state probability equation. if T is less
	 I
^f
	
81
482
Start
At, PMAX
P Z
 (0)=l
PI(0)=0
for 1^1
t=O
1
Figure 14. Flowchart for Reliability Computations
by Method 2.
ICompute
P-	 (t,ot)'s
Product
o	
-	
^s >
3 2.3
PMAX
Pao
Substitute
P i
 i(t,At)'s
aria P i
 (t)' s
into general
state
probability
equation
Yes
t-trot j	 t<1 MAX
No
Qutput
Results
I
Qt= otx.9
11"
	 No
At-
Yes
Stop
Figure 14. Continued
than TMAX, T is incremented by At and processing continues. Otherwise,	 1
the system reliability is formed as a suitable sum of state probabilities,
results arm output, and processing terminates.
Figure 15 shows an implementation of a Hathod 3 evaluation. This
flowchart follows the steps outlined in the second computational
algorithm of Chapter III.
It should be noted from equation (.3-11) that if the base computation
time is 0 and the system starting state is state i so that P i (0) = 1
then Tnx contains the state probabilities for state j in its (i,j)
a
location. For this case, then, the multiplication by P(t) to obtain
f
P(t k nat) is unnecessary since the state probabilities may be
determined directly.
3j
1'.
r;
85
START
ENTER
At,	 PMAX,
TMAX
L.
T=O
COMPUTE T
_	 - Tn*=T
1
T
f:
FIGURE 15. Flowchart for Reliability
Computations by Method 3
s

ri
APPENDIX s
Development of Equations for the Double-Error-Correcting System
A listing of transition events and subevents causing the transitions
	
'	 is shown in Table 6. In this table, the success of the detector,
corrector, correction algorithm and switch prior to time t + At ark
represented by DA , C , , A-, and W-, respectively. Success in the time
ws interval from t to t + At is denoted by a " superscript. The non-
success event is denoted by a subtraction of the appropriate symbol
	
-	 from 1.
For the derivation of the transition and state probability equations
the following notation will be used.
D(x,y) = P(y correctable on-line bit plane errors out of x
possible on-line bit planes given all were good
at time t)
	
i
	
- 	 (y) (r(At)) (x-y)(1--r(ot))y
	
x-	 I
E(x,y)	 P (y or -fewer good spare bit planes out of s - x + 2
available)
	
v_	 y	 ^
(six+2)(l-r,)(.s-x+2-k)(r.)k
k=0
	
k
rm = rm (ot)
The double--error-correcting system transition probability equations
may now be specified as
	
3	
87
TRANS-
.	 ITION SUBEVENTS`CAUSING TRANSITION 	 -
# on-line
8
correctable
B 
	
Errors Other Subevents
# possible
bits
12 1 JK D-C-A-
1,3 2/K D^C'A'.W-(at least l good spare) 	 -
1,4 3JK D-C'A-W-(at least 2 good spares)
1,5 4/K D-C-A'W-(at least 3 good spares)
1,s+3 2/K
3
D-C-A-((1-W,) or W'(No good spares))
or	 1
3JK D^C^A'W'(exactly 1 good spare} 	 -.
_ 
or
' 4/K D-C-A^W'texactly Z good spares)
1,s+4 31K D'C-A'W--W-) or W-(no good spares))
or
4/K D-C,A'W'(exactly 1 good spare)
1,s+5 4JK D-C-A-((l-W') or W(ho good spares))
1,1 OJK
-
i,i+1 1JK-1 D*C*A*Wf(at least l good spare)
2<i <s+1
9
1
89
TABLE 6.	 continued
2/K-1 D*C*A*W'(at least 2 good spares)
i , i+3 3; K-1 D'^C*A*W' (at least 3 good spares)
2<i <s-1
i,s+3 1 /K-1 D*C*A* ((1--W , ) or W'(no goad. spares)
2< i <s or
2/K-1 D*C*A*W'(exactly 1 good spare)
or
3/K-1 D*C*A*W'(exactiy 2 good spares)
i,s+4 2/K-1 D*C*A*((1-W-) 	 or VJ.- (no good spares))
2<i<s+l or	 .
3/K-1 D*C*A*W-(exactly 1 good spare)
i,s+5 3/K-1 D*C*A*((1-W') or W-(no good spares)}	 j
2« <s+1
0/K-1 D*C*A*
2<i <s+2
., s+7,s+3 1/K-1 D*C*A*((1-W') or W-(no good spares))
or
Z/K-1 D*C*A*W'(exactly 1 good
i
spare)
s+2' s+ J` ^-2^K-1 ^ - ,^D C A
a
_
3<j<5
I	 i	 i	 I	 I	 i	 i	 i
4 .5
s+4,s+5
S+j,S+j
90
TABLE 5. continued
3-3/K-2	 D*C*A*
1/K-3	 D*C*A*
0/K-j+1	 D*C*A*
r	 }1	 l
91
The double-error--correcting system transition.probability equations
may now be specified as:
pl,2 (t,ot) = D(k,l) rdlrc'rA'.
"	 P1s3(r,At) = D([c,2} rd'rc-rA'rs'(^-E(2,0))-
P1,4 (t,At) =D(k3) rd'rc'rA'rs'(1-E(2,1)).
P^^ (t,ot} = D(k,4) r 'rc'rA'rs-(1-E(2,2)).
P i,s+3 ( t yQt) = D(k,2) rd'rc"rA'(1_rs' + rs ' E(2,0))
+ D(k_.3) rd'rc'rA'r5'(E(2,1)-E(2,0))
4Y	
+ D (k,4) 
rd' rc'rA'rs'(E(2,2)-E(2,t)).
P19s+ (t,ot) = D(k,3) rd 'rc 'rA '( 1 —rs ' + rs ' E(250)}
+ D(k .,4) rd'rc"A'rs'(E(2,1)-E(2,0)}.
P1's 5 (t,©t) = D(k,4) rd 'rc'rA '(1--rs ' +r s ' E(2,0)),
P1,1 (t,At) = D(k,0).
P i i+,(t,At) = D(k-1,l) rd*rc*rA*rs -(1-E(is0))
. 	1
for 2<i<s+l.
P i,i+2 (t,ot) = D(k-1,2) rd*rc*rA*rs ` (1-E(i,1}}
for 2<i <s
P i, 1 ,3 (t,At) = D(k--1,3) rd*rc*rA*r$ ' (1-E(i,2))
for 2<i <s-l.	
-
i1	 ^
3
.s
P.
y s+3 
(t,At) = D( k-137) rd*rc*rA* (1_rs
 
 
+ r
^ 
+ D(k-•1 >2) rd*rc*rA rs - (E(i ,1) .-E(i ; })	 `^
+ D(k•-1,3) rd*rc*re r s ' E (^ )-E{^ a7 ))
for 24i <s. 	
`.1
P., s+4 
(t A }	 D(k-1 2) rd*rc*rA* (1 " ^» r	 rs ° E(i g0)}
i
+ D{k-1,3} rd* c*rA rs ' (E(-^,l)-E(3,))	 `_
far Z.O <s+l .	 r: a
P-
,
s+5 (t,bt) = D(k-1,3) rd*rc*rA* (1-rs' + rs E(i,a))
^ 
for 2<i<s+l-
P. -.(t,ot) = D(k-1,0) rd*rc*rA
for 2<i<s+2.
Ps 
1 s+3(t,ot) = D(k-1,1) rd*rc*rA (1-rs '	 rs - E(s+1,D))
+ D(k--1,2) rd*rc*rA*r s (E{s+1 ,1)-E(s+1 ,a))
P	 (t,ot) = D(k--1,j-2) rd*rc*rp*
s+2, s+j
for 3<,^<5.'
P
s+3,s+j (t,ot) 	
D(k-2,j-3) rd*rc*rA*.
for 4<j<5,	 ..
Ps+4,s+5(t,At)	 D(k-3,1) rd c*r*rA*.	
—f
P
s+j,s+j (t,ot) - D(k--j+1>0) rd*rc*rA ^3.
for
Since
p .	 (t9	 1 _ s^5 p.
i,FAILpt)
	
j-1 ^,a
93
the following equations may be developed:
P	 (t3ot)	 11 ,FAIL
_ p^^ 2(t,pt) - p l,4(t,at)	 - P1,5(t'at)
4.a
' 
P1,s+3 (t,At) - Pl,s+4(t=At) - p11(t,At)
P	 (t,Qt) = 11,FAIL - 
D(k,0) - D(k,l) rd'rc'rA'
_ D(k,2)rar rACr-(I-E(2,D)+.l-r$'+ r5E(2,D)1
- D( k,3)rdr
c
rA[r (1-E(?_,1)}+r$(E(2,1)-E(2Sa))
+ 1-rs ' + r s " E(2,D)l - D(k,4) rd'rc'rA,
(
.	 [rs 	(1-E(2,2)) + rs '	 (E(2,2}-E(2,1))
f + r s '(E(2,1)-E(2,0)) + 1-r s ' + r s ' E(2,0)
= 1 .. D(k,a) - D(M) rd 'rC 'rA' - D(k,2) rd'rc'rA'
.._
_ D(k,3) rd,rc.rA' - D(k,4) rd:rc'rA^
- 1
4
- D(k,0)	 - rd 'rC -rA'	 D(k,j).
^j j=1
(t,ot} = 1i,FAIL
- P.	 + (t,ot} - P i ^ i+2 (t,ht ) - Pi,i^3(t=ot)
^,i	 1
for	 1<i<s--1 - P ,S ^ 3 (t,Qt)	 - Pi,s+4(t,'t) - Piss+5(t,flt)
i
I
s
}i'
94
Pi,FAIL(t,At) = 1 - D(k-1,0) r d*rC*rA* - D(k-1,1) rd*rc
*rA*
• [rs ' (1-E(i 3 O)) + 1-rs ' + rs ' E0,0)1
- D(k-1,2) rd*rc*rA* [rs ' (1-E(i,l))
+ rs ' (E(i,l) - E(i,D)) + 1 - rs ' + rS' E(i,D)]	 R
- D(k-1,3) rd*r^*re[ rs • (1-E(i,2))
+ rs ' (.E(i,2.) - E(i,1)) + 1 - rs ' + r5 ' E(i3O)	
w	
,
+ rs ' (E(i,l) - E(i3O))]
= 1 - rd c*r*rA*[D(k--1,0)+D(k--1,1)+D(k--1 ,2;
+ D(k-1,3)].
ps, FAIL (t,At) = 1-Psss+1(t,At)-Ps,S+2(t,At)-Rs^s+3(t,At)
-Ps,s+4(t,ot)-Psis+5(t,At)-P S's (t,A.t)
1 -- rd*rc*rA* B(^--1,0) + D(k-1,1)
• Ers ' (1-E(S,0)) + 1-r S, + rSr E(s,0)]
+ D(k-1,2)[rs' (1-E(s,1)) + r s ' (E( s,l) -E(s,0))
+ 1--rs '+ rs'E(s,D)]+D(k-1,3)[rS, E(s,2)-E(s,l)
+ rs• (E( s,l )-E(s,0)) + 1-rs •
 + rs• E(s,O)]
1 -- rd*r,'krA* D(k-1,0)+D(k-1,1)+D(k-1,2)
+ D(k-1,3)LrS ' E(s,2) + 1 - rs']
But E(s,2) W	 (2)(1-r-)(2-a)(r_)a	 1.
r=n
s0
95
Ps,FAIL(t,At) = 1 - rd*rG*rA*[D(k•-1,0) + D(k-1,1) + D(k-1,2)
+ D(k-1,3)].
Ps+1,FAIL(t,ot) = 1-Ps+l,s+2(teat)-Ps+1,s+3(t'At)-Ps+3,s+5(t,At)
-Ps+1,s+i(t,At)
= 1 - rd*rc*rA* D(k-1,0)+D(k-1,1)[rs-(1-E(s+1,0))
+ 1-r5 ` + r5 ' E(s+1,0)]+D(k-1,2)
• E1-rs '+ rs 'E(s+1,0) + rs'(E(s+1,1)-E(s+1,0))]
`-	 + D(k-1,3)Ers'(E(s+'t,I)-E(s+1,0))
+ I - rs ' + rs ' E(i3O)]
Ps+i,FAIL(t,ot) = 1 - r d*rc*re D(k-1,0)+D(k-1,1)+D(k-1,2)
• [1-r,' + rs ' E(s+1,1)] + D(k-1113)
i
Lrs' E(7,1) + 1 - rs']
But E(s+1,1) =	 ti)(1-r')(I-q)(r')Q	 i
9=0 9
E	
[^
So
Ps+1,FAIL(t,At) = 1-rd*rC*rA*[D(k-1,0)+D(k-1,1)+D(k-1,2)
_._	
+ D(k-1,3)].
Ps+2,FAIL(t,ot) = 1 -Ps+2,s+2(t,At) -Ps+2,s+3{t9at)`Ps+2=s+(t,At)
_,.	 -Ps+2,S+5(t,6t)
r'	 W i-rd*rc*rA*eD(k-1 ,0)+D(k-1 ,1)+D (k-1 ,2)
+ D(k-'1,3)].
{S
^	 .	 _I	 i	 1
Ps+3,FAIL(t'At)
Ps+4,FAIL(t,At)
Ps+5,FAIL(t'At)
96
1-Ps+3,s+4(t,At)-Ps+3,s+5(t'At)-Ps+3,s+3(t,At)
1-rd*rc*rA*[D(k-2,0)+D(k-2,1)+D(k-2,2)].
I"Ps+4,s+4(t,At)--P s+4,s+5(t,At)
l -rd*rc*rA*[D(k-3,0)+D(k-3,1)].
1- Ps.E5a s+5 (t,At) = 1- rd*rc*rA* D(k-4,0):
So
4
P1,FAIL(t,At) = 1-D(k,0)-rd rrc `rA.
	D(kx3)•
^=l
3
Pi,FAIL(t,At) = 1-rd*rc*rA* ^ D(k-1,j)
j-0
for 2<i<s+2.
5-j
Ps+j,FAIL (t,At) = l-rd*rc*rA* qY0 D(k-j+1,q)
for 3<j15 .
By substitution of the transition probability equations into the
general state Probability equation, the state probability equations
for the double-error-correcting system are obtained as follows:
Pl (t + At) = P 1 11 (t,At)P l (t)
= D(k,0)Pl(t),
P2 (t + At) = P1,2 (t,At)P2 (t) + PZ,2(t,At)P2(t)
= D(k,l) rd 'rc 'rA ^P l (t)
+ D(k-1,0) rd*rc*re P2(t).
97
p3 (t + At) = P 1 ,3 ( t >At )P I ( t ) + p 2,,3 (t,At)P2 (t) + !' 3 ,P ,.\t )11 .1 ( k.)
= D(k,2) r d rr
c
,rA rrs , (1-E(2,0))Pl(t)
• rd*rG*rA* [D(k-1,1)r$- (1-E(2,0))P2(t)
• D(k-1,0)P3(t)1-
P4(t + At' = P 1 ^ 4 (t,At) p l (t) + p2i4(t,At} p2(t) + P3,4(t,At)P3(t)-
+ P4,4(tsAt)P4(t)
n( k,3) rd ? rc,rAfrs^ (1-E(2,1))pl(t)
+ rd*r
c
*rA*[D(k-1,2)rs r (1-EE2,1))P2(t)
+ D(k-1,1) rs - (1-E(3,0))P3(t) + D(k--1,0)P4(t)].
PS (t + At li = Pl,5 (t,At)Pl (t) + P2?5(t,At)P2(t)
+ p3
,5 (t,ot)P3 (t) + p4,5 (t,At)P4(t) + P5,5(t,At)P5(t).
= D(k,4)rd-r` -rA'rsr (1-E(2,2))P1(t)
+ rd*rG*rA* ED(k-1 5 3)rs r (T-E(2,2))P2(t)
+ D(k-1,2)rs^ (1-E(3,1))P3(t)
+ D(k-1,1)rS.. (l--E(4,0))P' (t) + D(k-1,0)P (t:)
P i (t + At) = P 1-3,i (t,At)pi-3 (t) + Pi-2,i(t°At)Pi -2(Q
+ Pi_, ,i (t,At) p i
_l (t) + p i ,i (t,At)P i (t)
= rd*rc*rA* ED(k-1,3)r s- (1-E( i -3,2))'Pi-3(t)
+ D(k-1,2)rs
 (1-E(i--2 ))Pi-2(t)
+ 0(k-r1,1)r5- (I-E0-lM)P,-l(t) + D(k-1,o)Pj(t)]
I
98
P i (t + At) = rd*rc*rA* [D(k-1,0)Pi(t)
3
+ rs' D(k--l,j)( 1 -E(i-j,j-1))Pi-j(t)1
^j 1
for	 6<i<s+2.
Ps+3 (t + fit) _
s+2
=1 Pj,s+3(t,ot)Pj(t) + Ps+3,s+3(t,ot)Ps+3(t),
= rd 'rc 'rA'[0(k,2)(1-r s ' + rs ' E(2,0)) + D{F,3),^s'
(E(251)-E(2,0)) + D (k,4)rs '	 (E(2,2)-E(2,1))lP1(t)
+
s
rd*rc*rA*	 [D(k-1,1)(1--rs' + rs ' E(j.,0)).
.j=2
+ D(k-1, 2)rs ' (EU M -Eki3O))
+ D(k-1, 3 )rs '	 (E(J,2)-E(j,1))]P^(t)
+ [D(k--1,1)(1-rs 	 + rs ' E(s+1,0)
+ D(k-1,2)	 rs ' (E(s+1,1)	
- E(s+i3O))]Ps+1(t)
+ D(k-1,1)Ps+2 (t) + D(k-210)Ps+3(t)
.	 Ps+4(t + At) =
s+3
Pj,s+4(t,Dt)Pj (t) + Ps+4,s+4(t,ot)Ps+4(t)3 X1
= rd 'rc'rA P CD(k,3)(1-r s ' + rs ' E(2,0))
+ D( k , 4 )	 rs`	 (E(2,l)	 -	 E(2,0))]P1(t)
s+l
+ rd*r,*"A*	 E	 [D(k-1,2)(1-rs' + r s ' E(J,0))j°2
+ DN-1,3)rs ^ (E(7, l)-E(a,0))]Pi(.Q
+ Dt k-1,Z)P5+2 {t) + D(k-2;l)Ps+3(t).
+
D(k-3,0)Ps+4(t)
i99
s+4W..
	
Ps
+5(t + At) = .X1 Pj,S+5(t,At)P3(t) + Ps+5,s+5(t,At)PS+5(t)
^ 
D(k,4) rd,rc rA. (1--rs' 
+ rsr E(2,0))P1(t)
`	 + r *r *r * s1 D(k-1,3)(1- rs' + rs. F(^,0))P^(t)
d c A-2
{	
+ D(k-1,3)Ps+2(t) + D(k-2,2)Ps+3(t)
F
+ D(k-3,1)Ps+4(t)
f
s5
FAIL
p	 (t + At) =
	 p a, a FAIL
(t,At)Pj (t) 
+ PFAIL(t)
a=1 4
	
_ [1--D(k,0) - rd -r^ :rA"	 D(k,J)7P1(t)
3
+ s2 [ 1 -r *r *r * ^, D(k~1,m)]P (t)
q=2
	
d e A 
m=0	 q
..	 r 
C1 -r *r *r * 5 h
	
-t	 D(k^-n+1,q)7Ps+n(t) +PFAIL(t)
n=3	 d c A --0q-
s+5
Pr(t) 
+ PFATL(t) + [-D( k ,0) -- rd'r.'rA-
r=1e_	
4
•
qq 
D(k,)^p1(t) + rd*rG*r•R
j=l
X5 ;2	 D(k-1,m))P (t)
q=2 m=0	 q
J
( 5^n D(k-n+1 , q)Ps+n(t))1
3 q=0
s+5
But	 ^ Pr ( t) ^' P^AIL(t) 
_ 1
. •	 r=1

