The reliability of single-error protected computer memories by Blaum, Mario et al.
IEEE TRANSACTIONS ON COMPUTERS, VOL. 37, NO. I ,  JANUARY 1988 114 
can be rewritten as 
a* = d l  (df?+ 6@+ d e +  &d+ a 6 d )  
[ 151 New York: 
[ 161 S. Rudeanu, Boolean Functions and Equations. Amsterdam, The 
D. Dietmeyer, Logic Design of Digital Systems, 2nd ed. 
Allen and Bacon, 1978. 
Netherlands, North-Holland, 1974. 
+ dl (@be + bcd + bede + ad + ace). 
Noting that the desired function is to be realized on node 1 of the 
circuit, the CF  of the specification can be obtained [2], [3] as 
@*= d l  f + d l f +  DON’T CARE terms 
= dl (66 + &+ 6@+ 6Cd) 
+ dl (ad+ be+ ace + bcd)  
+ abcde. 
From [2] and [3],_a circuit realizes its specification iff +* < +S. In 
other words, @*.a, = 0. That the latter relation is satisfied can be 
easily verified by inspection, since 4S can be expressed as 
&s= [d1 (66+de?+~~f?+&d)+d l (ad+be+ace+  b d)]  . (&de). 
VIII. CONCLUSIONS 
We have shown how multiple valued characteristic functions can 
be used to extract functional descriptions of CSA networks from their 
structure and external signal constraints. 
The algebraic manipulations can be programmed on a computer, 
for instance, by representing the CF’s using cubical complexes [2], 
[15] extended for multiple values of the constituent variables. The 
technique described here can be used for formal verification, since it 
transforms a network into a series of logic functions. The analysis can 
be aided by applying the properties of Boolean equations [2]-[5], 
[16]. Our current work is in this direction [13], [14]. 
[31 
[71 
REFERENCES 
J. P. Hayes, “A unified switching theory with applications to VLSI 
design,” Proc. IEEE, vol. 70, pp. 1140-1151, Oct. 1982. 
E. Cerny and M. A. Marin, “An approach to unified methodology of 
combinational switching circuits,” IEEE Trans. Comput., vol. C-26, 
E. Cerny, “Controllability and fault observability in modular combina- 
tional circuits,” IEEE Trans. Comput., vol. C-27, pp. 896-903, Oct. 
1978. 
__ , “Unique and identity solutions of Boolean equations,” Digital 
Processes, vol. 3, pp. 331-337, Fall 1977. 
_ _  , “Characteristic functions in multivalued logic systems,” Digital 
Processes, vol. 6, pp. 167-174, 1980. 
E. Cerny, D. Mange, and E. Sanchez, “Synthesis of minimal binary 
decision trees,” IEEE Trans. Comput., vol. C-28, pp. 472-482, July 
1979. 
E. Cerny and J. Gecsei, “Simulation of MOS circuits by decision 
diagrams,” IEEE Trans. Computer-Aided Design, vol. CAD-4, pp. 
R. E. Bryant, “An algorithm for MOS logic simulation,” LAMBDA/ 
VLSIJ., vol. 1, pp. 46-53, Fourth Quarter 1980. 
-, “A switch-level model and simulator for MOS digital systems,” 
IEEE Trans. Comput., vol. C-33, pp. 160-177, Feb. 1984. 
-, “Symbolic verification of MOS circuits,” in Proc. Chapel Hill 
Cony. VLSI, 1985. 
M. Gordon, “Proving a computer correct with LCF-LSM hardware 
verification system,’’ Computer Lab. Tech. Rep. 42, Cambridge 
University, 1983. 
D. L. Dill and E. M. Clarke, “Automatic verification of asynchronous 
circuits using temporal logic,” in Proc. Chapel Hill Cony. VLSI, 
1985. 
C. Berthet and E. Cerny, “Input-constrained memory elements in 
speed-independent circuits, ” in Proc. Canadian Cony. VLSI 
-, “An algebraic model for asynchronous circuits verification,” 
Publi. 571, Department d’ informatique et de recherche ogration- 
nelle, UniversitC de Montreal, IEEE Trans. Comput., to be published. 
pp. 745-756, Aug. 1977. 
685-693, Oct. 1985. 
(CCVLSI ’86), Oct. 27-28, 1986. 
The Reliability of Single-Error Protected Computer Memories 
MARIO BLAUM, RODNEY GOODMAN, AND 
ROBERT MCELIECE 
Abstract-In this paper we study the lifetimes of computer memories 
which are protected with single-error correcting double-error detecting 
(SEC-DED) codes. We assume that there are five possible types of 
memory chip failures (single-cell, row, column, row-column, and whole 
chip) and, after making a simplifying assumption (the Poisson assump- 
tion) that has been substantiated experimentally, we derive a simple 
closed-form expression for the system reliability function. Using this 
formula, and chip reliability data from tables, we are easily able to 
compute the mean time to failure for realistic memory systems. 
Index Terms-Error-correcting codes, RAM’S, reliability, SEC-DED 
codes, semiconductor memory systems. 
I. INTRODUCTION 
All modern computers have memories built from VLSI RAM 
chips. Individually, these devices are highly reliable, and any single 
chip may function for decades before failing. However, when many 
chips are combined in a single large computer memory, the expected 
wait until at least one of them fails can be as small as a few hours, or 
even less. For this reason, almost all large computer memories are 
protected by single-error correcting, double-error detecting (SEC- 
DED) codes. In the usual jargon of coding theory, a SEC-DED code 
is just a shortened d = 4 Hamming code; the shortening is usually 
done in a hardware-efficient manner devised by Hsiao [7 ] .  A recent 
paper by Chen and Hsiao [4] gives a good survey of how SEC-DED 
coding is actually implemented in computer memories, but we shall 
give a brief overview here. 
Most often, the memory is organized as an M X n rectangular 
array of chips as shown in Fig. 1. The first k chips in each row are 
used to store information, and the remaining r = n - k chips are 
parity-check chips, needed for the SEC-DED coding. As we shall 
see, this means that a (n, k )  d = 4 Hamming code is being used to 
protect the memory. A typical small example is the standard one- 
megabyte board sold for VAX computers, which is organized as M 
= 4 rows of 64K RAM chips, with each row containing k = 32 data 
chips and r = 7 parity chips. The corresponding SEC-DED code is 
then a (39, 32) d = 4 shortened Hamming code. 
We assume that each chip has a one-bit wide external organization 
but is organized internally as an I x I square array of bits. For 
example, the standard 4164 64K chips have an external organization 
of 65536 single-bit locations, with I = 256. We also assume that the 
n bits stored in corresponding cells in one row of the memory form a 
codeword in the (n, k )  code, as shown in Fig. 2. 
We now dicxuss chip failures. By a chip “failure” we mean any 
Manuscript received June 5 ,  1984; revised June IO,  1985. R. Goodman was 
supported by Caltech’s Program in Advanced Technologies, sponsored by 
Aerojet General, General Motors, GTE, and TRW. R. McEliece was 
supported by a grant from IBM. 
M. Blaum is with the IBM Almaden Research Laboratory, San Jose, CP 
R. Goodman and R. McEliece are with California Institute of Technology, 
IEEE Log Number 8715312. 
Pasadena, CA 91125. 
0018-9340/88/0100-0114$01 .OO O 1988 IEEE 
I 
IEEE TRANSACTIONS ON COMPUTERS, VOL. 31, NO. 1 ,  JANUARY 1988 1 1 5  
n 
4 
4 
b 
b- 
approximation, to a reasonably simple formula for Rsys ( f )  and 
k r MTTF. In Section 111 we will give some useful approximations to the 
exact formula found in Section 11. In Section IV we will make 
numerical comparisons of the predictions of our formulas to the 
results of computer simulation. The analytic predictions will be seen 
to be in very close agreement with the simulations, thereby justifying 
our confidence in the accuracy of the Poisson approximation made in 
Section 11. Finally, in Section V,  we will compare our results to those 
which have already appeared in the literature. 
0000000000000000000000 
00000 
0000000000000000000000 
0 0 0 0 0 U[?X M ~~~~~[~~ 
000000 
Fig. 1. The organization of memory chips. 
11. MODELS: FORMULAS FOR R ( t )  AND MTTF 
n 
4 b Our basic quantitative assumption about individual chip failures is 
that they are exponentially distributed. This means that the 
reliability of a given chip, i.e., the probability that it has not failed 
after t hours is equal to e -?”, where h is a constant that must be found 
experimentally [ 111. We will need to distinguish between the five 
types of chip errors depicted in Fig. 3, and so for future reference, we 
will use the following notation. 
I 
l{n r;l r;l------n 
7 
Fig. 2. Each bit in each codeword resides on a different chip. 
A :  row failure 
B: column failurc 
C: single-cell failure 
D: row-column failure 
A ’ 6  C D F 
Fig. 3. Five types of chip failure 
situation in which one or more of the bits written on a chip cannot be 
reliably read. These failures are traditionally classified as either 
hard, meaning that the affected memory cells are permanently 
damaged, or soft, meaning that the damage is only temporary. 
Laboratory observation of real memories [ l ] ,  [12], [13] shows that 
by far the most common type of chip failure is a soft error of a single 
cell on a chip. Such errors are caused by stray alpha particles which 
can, under certain circumstances, change a stored “ 1 ”  to a “0” 
without otherwise damaging the chip [14]. However, several kinds of 
hard failure have been observed. A single-cell failure, for example, 
can also occur as a hard error. There are also several kinds of hard 
failures which cause multiple cell errors. A row faihre, which can 
be caused by a failure of one of the chip’s row drivers, causes all I 
cells in one row of the affected chip to fail. Similarly, a column 
failure, which can be caused by a failure of one of the chip’s column 
sense amplifiers or column decoders, causes all I cells in a column of 
the chip to fail. A short circuit at a memory cell can cause a row- 
column failure. Finally, a catastrophic whole chip failure may 
occur, in which all the cells of a chip fail. All five of these failure 
types are illustrated in Fig. 3. (The letters A ,  B, C, D, and F will be 
referred to in Section 11.) 
The organization of the SEC-DED code (assuming “by-one” 
memory chips) guarantees that no chip failure, however catastrophic, 
can cause two errors in any n bit codeword, and so the memory 
system will survive any single-chip failure. In fact there are many 
possible combinations of multiple chip failures that can also be 
tolerated. Eventually, however, we expect that so many chip failures 
will have occurred that some one of the individual n bit codewords 
will have suffered two or more errors. When this happens, we declare 
a memory system failure. 
If we start with a brand new memory system of the kind we have 
been describing, the time until a memory system failure occurs will 
be a random variable. In this paper we will derive accurate and easily 
evaluated estimates for two of the most important quantitative 
measures of this random variable, the system reliability function 
R,,,(t) and the mean time to failure (MTTF). The reliability 
function Rsys(t)  represents the probability that the system will not 
have failed after t hours, and the MTTF represents the average length 
of time the system will function before a memory system failure 
occurs. 
In the next section (Section 11) we will describe the probabilistic 
model which is commonly used to describe the occurrences of the 
various types of chip failures, and see how it leads, via a Poisson 
F: whole chip failure. 
We assume that, if a given chip fails, the conditional probabilities 
that the failure will be of type A ,  B, C, D, or F are a, b, c, d ,  and f ,  
respectively. The probabilities a ,  6 ,  c, d ,  and f also have to be 
determined experimentally. We further assume that failures on one 
chip are independent from failures on all other chips. 
Given all these assumptions, it is in principle possible to calculate 
the row reliability function R ( t ) ,  defined as follows. 
R ( t )  = Pr {an uncorrectable combination of chip failures has 
not yet occurred in the ith row of chips at time t . }  
(1) 
For example, if the only kind of chip failures were whole chip 
failures, we would have a = b = c = d = 0 ,  f = 1 and 
R ( t )  = e-hnr + n( 1 - e-hf)e-h(n- 1 ) 1 ,  (2) 
which is just the probability that the given row has suffered either 
zero or one whole chip failure after t hours [lo]. Since the M rows 
are assumed to fail independently, the reliability of the entire system 
of Mn chips is 
Ray, (0  = R (0  M .  (3) 
The MTTF of system whose reliability function is r ( t )  is well known 
to be given by the formula 
MTTF= im r ( t )  d t ,  (4) 
and so for a computer memory system of the kind we are considering, 
MTTF= im R ( t ) M  dt .  ( 5 )  
Thus, everything we are interested in depends in a simple manner on 
the row-reliability function R ( t ) .  
Unfortunately, however, an exact formula for R ( t )  proves to be 
extremely complicated. (For example, Mikhail et al. [15] give a 
recursive method for computing it when errors of types A ,  B, C, and 
F are present.) Thus, difficulty has led us to make the following 
simplifying assumption. We no longer view a row of chips as 
consisting of n separate chips, but “end on” so that the failures on all 
n chips are superimposed onto a single “protochip.” We also make 
an important assumption about how protochip failures are distributed, 
I 
116 
in row i at time t is 
exp (- a n h t / l ) ,  
and the probability that there has been exactly one row failure in this 
row is 
E4 
E l  E2 E3 
Fig. 4. The four tolerable failure configurations on the Poisson protochip. exp ( - a n h t / l ) ( a n h t / l ) .  
which we call the Poisson assumption: 
In each protochip, the failures of types A ,  B, C, D, and F form 
independent Poisson processes of intensities anh, bnh, cnh, 
dnh, and fnh, respectively. 
Under this assumption, the row reliability function R ( t )  is just the 
probability that at time t no cell on the protochip has suffered two or 
more errors. As we will see, this assumption greatly simplifies the 
formulas for R ( t )  without introducing significant inaccuracies. For 
example, if we again consider a situation in which only whole chip 
failures occur, then under the Poisson assumption the number of 
whole chip failures in a given row is a Poisson process of intensity 
An, and so the row-reliability function, i.e., the probability of zero or 
one whole chip failures after n hours is given by 
(6) R ( t )  = e -  hnr( 1 + hnt) .  
Formulas (6) and (2 )  are very similar, and for example if we compute 
the MTTF of a single row of chips using (2) and (4) we obtain 
whereas, if we use the Poisson assumption via (6) and (4) we get 
instead 
In Section IV we will give further comparisons between exact 
MTTF’s and those obtained by the Poisson protochip method. 
Using the Poisson protochip, we can now derive a formula for the 
row-reliability function R ( t ) ,  the probability that the memory system 
is still working after t hours. It will be convenient to classify the 
various tolerable combinations of protochip failures into the follow- 
ing four categories. 
E , :  
E2: 
E3: 
E4: one whole chip failure. 
We note that the configurations El and E2 are not disjoint, so that 
only row and single-cell failures 
only column and single-cell failures 
one row-column failure and single-cell failures 
These “tolerable failure configurations” are shown in Fig. 4. 
we also introduce E12, defined as 
E l 2  =El n E2: only single-cell failures. 
Then the row-reliability function defined by (1) is 
R ( t ) = P r  { E ,  U E2 U E3 U E 4 }  
=Pr  ( E , } + P r  {E2}-Pr {E12}+Pr  (E3}+Pr  ( E d } .  (7) 
To simplify the notation, from now on we will use the parameter x 
defined by 
x = Ant, (8) 
so that the probability of zero and one failures in row i at time t are 
given by 
and e-‘/‘(ax/l), e - a// 
respectively. Next we focus on a particular cell within the ith row, 
say cell ( i ,  j ) .  Since each chip contains l 2  cells, our Poisson 
assumption implies that the single-cell failures in this particular cell 
form a Poisson process with intensity cnh/I2. Therefore, the 
probabilities that the (i ,  j ) t h  cell has suffered zero or one single-cell 
errors at time t are 
e - C W / /  and e - C X / /  (cx/ I 2 ) ,  
respectively. It follows that the probability that at time t no cell in row 
i has suffered a single-cell error is 
while the probability that no cell in row i has suffered more than one 
single-cell error is 
We now wish to calculate the probability that no cell within the ith 
row has suffered two or more errors of type A or C. This probability 
is the sum of the following two probabilities: 
The row has not failed and there is at most one single-cell 
error in each of the I cells. 
The row has failed and there are no single-cell errors in any 
of the I cells. 
This probability is, therefore, 
Finally, the probability Pr { E 1  } is the probability that the type A and 
C errors in all I rows of the protochip will be tolerable, which is just 
the Ith power of (9), multiplied by the probability that there are no 
errors of type B ,  D, or F, which is exp ( - ( b  + d + f ) x ) .  After 
some algebra we find that this product is 
Pr ( E I } = e - X  ((l+;)’+y)‘. 
We now proceed to calculate the five probabilities in (7). 
We begin by calculating Pr { E l  }, which is the probability that the 
protochip has suffered no errors of type B ,  D, or F, and that the 
errors of type A and C are “tolerable,” i.e., that no cell on the 
protochip has suffered two or more errors. To do this, we focus our 
attention on a single row of cells on the protochip, say row i .  Since 
each protochip has I rows, our Poisson assumption implies that the 
row failures in this particular row form a Poisson process of intensity 
anh/I. Therefore, the probability that there have been no row failures 
BY replacing ‘‘r0WS’’ With ‘‘ColUmns’’ in the Preceding argument, we 
find that 
Pr { E 2 } = e - X  ((l+;) / +T)  bx / . 
To compute Pr { EI2}, we note that this case corresponds to no errors 
of types A,  B ,  D, or F, and at most one error in each cell of the 
(1 1) 
1 
IEEE TRANSACTIONS ON COMPUTERS, VOL. 37, NO. 1, JANUARY 1988 117 
protochip. Thus, 
In order to calculate Pr { E3 } , note that if there is a row-column 
failure and if there are no failures of types A ,  B, or F, we can tolerate 
zero or one single-cell failures in the unaffected (I - l ) z  cells but no 
errors in the 21 - 1 cells affected by the row-column failure. Hence, 
Pr { E 3 }  =e-(o+b+f)X . e-dx(dx) . (e-CX//')Z/-I 
To calculate Pr { E4}, we note that if there is a whole chip failure, 
there can be no further failures of any kind. Therefore, 
Pr {E4} =e-(~+b+c+d)x  . e-fxfx 
= e - x  fx .  (14) 
Finally, to compute the row-reliability function R( t ) ,  we combine (7) 
with (lo), ( l l ) ,  (12), (13), and (14), and obtain the following 
somewhat intimidating expression. 
This expression is our main result. The rest of the paper will be 
devoted to exploring its consequences. 
111. ASYMPTOTIC APPROXIMATIONS AND SPECIAL CASES 
Although the expression (15) for R ( t )  is simple enough for 
numerical work, it is possible to give approximations to it that will 
yield additional insight into the problem. For example, in most 
modem chips the number I (the number of storage cells per row or 
column) is quite large, and this suggests that the limiting behavior of 
( 1 5 )  as I -+ Q, may be interesting to consider. In fact, this limit is 
given by the formula 
R ,  ( t )  = e-x(e(o+c)x + e(b+c)x + ecx(dx- 1 )  + fx). (16) 
This formula is simple enough to integrate explicitly, and so by (5) 
we find that the MTTF for one row of chips is approximately 
d 
(17) 
Experimentation with (16) and (17) indicates that these approxima- 
tions are quite accurate if max (a + c, b + c) is not too close to 1 .  
Such a restriction is understandable, since if, e.g., c = 1, an I = 00 
chip would have an infinite MTTF, since the probability of any given 
cell position being hit twice would be zero. 
Another interesting case to consider is the case of a large number 
of rows. (A CRAY-1, for example, has M = 1024 rows, each 
consisting of 72 1K ECL RAM chips.) In this case we can exploit the 
classic theory of asymptotic analysis [3] and find an asymptotic 
formula for the integral in (5). Omitting the details, we find that the 
result is of the form 
MTTF - - (4% 
AnM 
In (18) the number 1IAnM represents the mean time between chip 
failures (there are nMchips and each one has an MTTF of 1IA). The 
other term, viz., m . K I  + K2 represents the asymptotic value of the 
mean number of events to failure (METF), which is the average 
number of chip failures which occur before a system failure occurs. 
The constants KI and Kz in (1 8) are determined as follows. If we call 
the bracketed term in (15) r ( x )  and expand it as a polynomial in x up 
to terms of degree 3, we find that 
r(x)=l+x+r2xZ+r3x3+ . . e ,  (19) 
where 
c2 (12- 1 )  (I- 1) (aZ+  b z )  
r z = T .  I Z + ( a + b ) c .  -+- I 2 
(I- . (12-21+ 1 )  .- 
I 1 2  
and 
c3 ( l4-3 l2+2)  ( a + b ) c z  (13-212+ 1) 
6 1 3  
1 2  +- 
1 4  +--' 2 r3=- . 
( a z + b z ) c  . (Iz-3/+2) ( a 3 + b 3 )  
6 
+- 
2 
(IZ- 3 l +  2) czd  ( I 3  - 4 P +  51-2) 
1 2  +2' 1 3  
The constants K1 and Kz in (18) are then given by 
+- 2(1- 2rz) 
2 
2( r3 - r2) + - 
3 
K2 = 
(1 -2rz )Z  . 
The difference between the asymptotic formula (18) and the true 
value (5) is guaranteed to go to zero, if a, b, c,  d ,  f ,  and I are fixed 
and M -+ W .  However, (18) usually gives remarkably accurate 
answers for small values of M, often even for M = 1 ,  as we shall see 
in the next section. 
We now consider two special cases, viz., f = 1 and c = 1 .  When f 
= 1 (all failures are whole chip failures), the reliability function R ( t )  
in (15) becomes simply 
R ( t ) = c X ( 1  +x), 
which is exactly the same as the I = CO approximation (16). 
Therefore, the MTTF for one row of chips is given exactly by (17), 
i.e., 
2 
An MTTF = - . 
Since there are n chips, each with MTTF equal to 1/A, this formula 
simply reflects the fact that the system will fail as soon as two whole 
chip failures have occurred. (We already observed this in Section 11.) 
More interesting is what happens when the number of rows M is 
large. In this case the system will fail as soon as one of the M chip 
rows has suffered two errors. If we interpret the occurrence of a chip 
I 
118 IEEE TRANSACTIONS ON COMPUTERS, VOL. 37, NO. 1,  JANUARY 1988 
failure in the ith row as the arrival of a “person” whose “birthday” 
occurs on the ith day of an M day year, we see that the expected 
number of chip failures needed to cause a system failure is the same 
as the expected number of people we need to interview before we find 
number is given by [8] 
and so by (5) 
s,” 1 Xn 
1 
XnM 
MTTF=- . 1’ . (e-x(l + x ) ) I Z M  dx 
two with the same birthday. It is known that this “birthday surprise” 
- -. B ( l 2 M ) .  
The asymptotic formula (18) is even more accurate in this case, since 
with r (x)  = (1 + x/I2)” we have B ( M ) = M  . lm (e-”(l + x ) ) ~  dx. (21) 
c2 (12-1) 
“‘2’ - 1 2  This is in agreement with the theory developed in Section 11, since with f = 1 the formula (15) simplifies to R(t )  = e -x ( l  + x),  and so 
by (5) the MTTF is c2 (14-31’+2) 
‘”6. 1 4  
which means that the asymptotic formula (18) works out to be 
the approximation (24) provides the oniy reliable way of obtaining 
accurate values for the MTTF. 
The relationship between MTTF’s and “birthday surprises” was 
originally noted in [6] and [14]. 
(22) 
What this means is that the term + 2/3 is an approximation 
to the METF, which in this case is simply the mean number of whole 
chip failures before system failure, as well as the “birthday surprise” 
number B ( M ) :  
The approximation given in (23) is very accurate, too, as the 
following table shows. 
M 
1 
2 
4 
8 
16 
32 
IV. NUMERICAL EVALUATION OF THE MTTF FORMULAS 
In this section we will illustrate our results numerically. The plan is 
to take three sets of values for the parameters a, b, c, d, f, and I, as 
reported in the literature, and for various values of M to compute the 
METF (which differs from the MTTF by the factor lIXMn, as we 
explained in the last section) using four methods. The first method is 
direct Monte Carlo simulation. This method is very slow, but it 
makes no use of our Poisson assumption and provides a valuable 
check of the accuracy of our other methods. The second method is 
direct integration of the expression (15): 
M E T F = M  . ( R ( x ) ) ~  dx.
Exact B ( M )  Approx. B ( M )  
[from (21)] [from (23)] The third method is direct integration of the I = 03 approximation of 
the row-reliability function, given in (16): 
2.000 1.920 
2.500 2.439 
3.219 3.173 
4.245 4.212 
5.704 5.680 
7.774 7.756 
365 24.616 24.611 
The M = 365 entry shows that on a planet with a 365-day year, one 
needs to interview between 24 id 25 people, on the average, before 
finding two with the same birthday. Alternatively, in a 365-row 
computer memory in which only whole chip failures occur, and in 
which each row is SEC-DED protected, the memory will tolerate 
between 24 and 25 chip failures, on the average, before failing. 
A similar “birthday analysis” can be made for the case c = I .  In 
this case only single-cell failures occur; it is as if there were 12M 
independent rows of 1 X 1 chips. The number of single-cell failures 
which can be tolerated before a system failure occurs is thus 1 lhnM 
times the “birthday number” B ( / 2 M ) .  This too is consistent with our 
theory, since with c = 1, the formula (15) becomes 
M E T F = M  . lr ( R , ( x ) ) ~  dx. 
The final method is to use the two-term asymptotic approximation 
given by (18), viz., 
METF = &Kl+ K2. 
The sets of parameters are taken from [9], [ l l ] ,  and [15], as 
summarized in the following table. 
U b C d f I 
- ~ -~ 
[9] 0.01646 0.01646 0.85343 0 0.11365 128 
[ l l ]  0.047 0.047 0.893 0.013 0 128 
[15] 0.12 0.18 0.35 0 0.35 64 
Our numerical results for the corresponding METF’s are given in 
Figs. 5-7. Each of the numbers in the “simulation” columns 
represents the average number of (simulated) chip failures before 
(simulated) system failure for 40 OOO Monte Carlo trials, reported to 
the nearest tenth. 
We see in very case that our “exact” expression (15) gives results 
that are indistinguishable from the simulation results. The I = 00 
I 
IEEE TRANSACTIONS ON COMPUTERS, VOL. 31, NO. 1. JANUARY 1988 I19  
Simulation exact(l5) I = w(16) Mlarge(l8) 
M =  1 8.3 8.458 8.662 5.142 
M = 2  8.9 8.900 9.023 6.260 
M = 4  9.8 9.710 9.783 7.842 
M = 8  11.4 11.283 11.328 10.079 
M =  16 14.1 13.997 14.032 13.243 
M = 32 18.4 18.200 18.234 17.717 
Fig. 5 .  Numerical estimates for METF, using data from [9] 
Simulation exact(l5) I = m(16) M large(l8) 
M =  1 
M = 2  
M = 4  
M = 8  
M =  16 
M = 32 
20.9 20.774 25.122 20.367 
26.3 26.286 30.770 25.905 
33.8 34.058 39.145 33.737 
45.2 45.067 51.263 44.813 
60.9 60.671 68.589 60.477 
83.9 82.773 93.224 82.630 
Fig. 6. Numerical estimates for METF, using data from [ll] .  
Simulation exact(l5) I = w(16) M large(l8) 
M =  1 2.8 2.793 2.826 2.506 
M = 2  3.4 3.359 3.384 3.163 
M = 4  4.2 4.225 4.248 4.092 
M = 8  5.5 5.496 5.521 5.406 
M =  16 7.3 7.326 7.356 7.263 
M = 32 9.9 9.934 9.972 9.890 
Fig. 7. Numerical estimates for METF, using data from [15]. 
approximation is good, but not as good, and is consistently high. This 
is because with I = 03 single-cell errors become negligible, as do, for 
example, pairs of row errors. The asymptotic estimate is always low, 
because the complete asymptotic expansion of the METF involves 
positive terms which we have neglected. 
We feel that these results justify our confidence in the accuracy of 
the Poisson approximation. An independent verification of the 
accuracy of the Poisson approximation, based on a mathematical 
analysis, was given in [2]. 
V. A SURVEY OF RELATED WORK: CONCLUSION 
The literature contains many papers devoted in whole or part to the 
subject of this paper, including two survey papers ([9] and [16]). In 
this section we will attempt to describe how our work adds to what is 
already known. 
The earliest work on ECC memory reliability [ 101 deals only with 
type F chip failures, i.e., whole chip failures. Later models, 
including those in [9] and [ 1 I], extended the types of failure modes to 
include types A,  B, C, and D, but as pointed out in [16], it is 
implicitly assumed in these models that the failure types are 
“nested.” That is, there is a hierarchy of failure types, such that each 
type is a subset of the previous type. For example, single-cell, row, 
and whole chip failures are nested, but no nested hierarchy can 
contain both row and column failures. Since one row and one column 
failure in a row of chips will cause a memory system failure, it is 
important to have a model that handles “crossed” failure types, e.g., 
failure types A and B simultaneously, as is done in [15]. However, 
[I51 does not consider failure type D.  Indeed, our model is to our 
knowledge the only one that handles all five failure types A ,  B, C,  D, 
and F simultaneously. 
We believe that the key innovation of our paper, however, is the 
introduction of the Poisson approximation. As we have seen, this 
approximation allows us to obtain simple formulas for the system 
reliability without sacrificing significant accuracy. And although our 
main formula (15) may seem excessively complex, when compared to 
the corresponding formulas in [9], [ I l l ,  and [15], it is very simple 
indeed. As we have demonstrated in Section IV, it can be easily 
programmed to give fast and accurate reliability estimates that can be 
1 by memory system designers. 
REFERENCES 
J. M. Ayache and M. Diaz, “A reliability model for error corrective 
memory system,” IEEE Trans. Reliability, vol. R-28, Oct. 1979. 
M. Blaum, Caltech Ph.D. Thesis, 1985, appendix 2. 
N. G. de Bruijn, Asymptotic Methods in Analysis. Amsterdam, 
The Netherlands: North-Holland, 1961. 
C. L. Chen and M. Y. Hsiao, “Error-correcting codes for semiconduc- 
tor memory applications: A state-of-the-art review,” IBM J. Res. 
Develop., vol. 28, pp. 124-134, Mar. 1984. 
R. M. F. Goodman and R. J. McEliece, “Lifetime analyses of error- 
control coded semiconductor RAM systems,” Proc. IEE, part E, vol. 
3, pp. 81-85, 1982. 
-, “Hamming codes, computer memories, and the birthday 
surprise,” in Proc. 20th Allerton Conf. Commun., Control, 
Comput., 1982, pp. 672-679. 
M. Y. Hsiao, “A class of optimal minimum odd-weight-column SEC- 
DED codes,” IBM J. Res. Develop., vol. 14, pp. 395401, July 1970. 
M. S. Klampkin and D. J. Newman, “Extensions of the birthday 
surprise,” J. Comb. Theory, vol. 3, pp. 279-282. 
D. Y. Koo and H. B. Chenowith, “Choosing a practical MTTF model 
for ECC memory chip,” in Proc. 1984 IEEE Reliability Maintaina- 
bility Symp., pp. 255-261. 
L. Levine and W. Meyers, “Semiconductor memory reliability with 
error detecting and correcting code,” Computer, vol. 9, pp. 43-50, 
Oct. 1976. 
D. Marston, “Memory system reliability with ECC,” Intel Appl. Note 
AP-73, Intel Corp., 1980. 
T. C. May, “Soft errors in VLSI: Present and future,” IEEE Trans. 
Components, Hybrids, Manuf., Technol., vol. CHMT-2, pp. 377- 
387, Dec. 1979. 
T. C. May and M. H. Woods, “A new physical mechanism for soft 
errors in dynamic memories,’’ in Proc. 1978 Int. Reliability Phy. 
Symp., Apr. 1978, pp. 33-40. 
R. J. McEliece, “The reliability of computer memories,” Scientiy. 
Amer., vol. 252, pp. 88-95, Jan. 1985. 
W. F. Mikhail, R. W. Bartoldus, and R. A. Rutledge, “The reliability 
of memory with single-error correction,” IEEE Trans. Cornput., vol. 
R. A. Rutledge, “Models for the reliability of memory with ECC,” in 
Proc. 1985 IEEE Reliubility Maintuinability Symp., pp. 57-62. 
C-31, pp. 560-564, 1982. 
FISHNET: A Distributed Architecture for High-Performance 
Local Computer Networks 
YONG J. KANG, JAMES H. HERZOG, AND JOHN SPRAGINS 
Abstract-FISHNET (Facilities Integrated in a Shared Habitat NET- 
work) addresses the problem of effectively integrating a wide variety of 
computers, terminals, memory devices, and computer peripherals in a 
local environment. High performance is achieved by effectively separat- 
ing the high volume node-to-node data communications and the low- 
Manuscript received January 14, 1984; revised August 31, 1986. 
Y. J. Kang is with Advanced Control Technology, Inc., Albany, OR 
J. H. Herzog is with Oregon State University, Corvallis, OR 97331. 
J. Spragins is with Clemson University, Clemson, SC, 29361. 
IEEE Log Number 8715310. 
97321. 
0018-9340/88/0100-0119$01.00 O 1988 IEEE 
