+ dl (@be + bcd + bede + ad + ace).
I. INTRODUCTION
All modern computers have memories built from VLSI RAM chips. Individually, these devices are highly reliable, and any single chip may function for decades before failing. However, when many chips are combined in a single large computer memory, the expected wait until at least one of them fails can be as small as a few hours, or even less. For this reason, almost all large computer memories are protected by single-error correcting, double-error detecting (SEC-DED) codes. In the usual jargon of coding theory, a SEC-DED code is just a shortened d = 4 Hamming code; the shortening is usually done in a hardware-efficient manner devised by Hsiao [ 7 ] . A recent paper by Chen and Hsiao [4] gives a good survey of how SEC-DED coding is actually implemented in computer memories, but we shall give a brief overview here.
Most often, the memory is organized as an M X n rectangular array of chips as shown in Fig. 1 . The first k chips in each row are used to store information, and the remaining r = n -k chips are parity-check chips, needed for the SEC-DED coding. As we shall see, this means that a (n, k ) d = 4 Hamming code is being used to protect the memory. A typical small example is the standard onemegabyte board sold for VAX computers, which is organized as M = 4 rows of 64K RAM chips, with each row containing k = 32 data chips and r = 7 parity chips. The corresponding SEC-DED code is then a (39, 32) d = 4 shortened Hamming code.
We assume that each chip has a one-bit wide external organization but is organized internally as an I x I square array of bits. For example, the standard 4164 64K chips have an external organization of 65536 single-bit locations, with I = 256. We also assume that the n bits stored in corresponding cells in one row of the memory form a codeword in the (n, k ) code, as shown in Fig. 2 .
We now dicxuss chip failures. By a chip "failure" we mean any approximation, to a reasonably simple formula for R s y s ( f ) and k r MTTF. In Section 111 we will give some useful approximations to the exact formula found in Section 11. In Section IV we will make numerical comparisons of the predictions of our formulas to the results of computer simulation. The analytic predictions will be seen to be in very close agreement with the simulations, thereby justifying our confidence in the accuracy of the Poisson approximation made in Section 11. Finally, in Section V, we will compare our results to those which have already appeared in the literature. Our basic quantitative assumption about individual chip failures is that they are exponentially distributed. This means that the reliability of a given chip, i.e., the probability that it has not failed after t hours is equal to e -?", where h is a constant that must be found experimentally [ 111. We will need to distinguish between the five types of chip errors depicted in Fig. 3 , and so for future reference, we will use the following notation. [13] shows that by far the most common type of chip failure is a soft error of a single cell on a chip. Such errors are caused by stray alpha particles which can, under certain circumstances, change a stored " 1 " to a "0" without otherwise damaging the chip [14] . However, several kinds of hard failure have been observed. A single-cell failure, for example, can also occur as a hard error. There are also several kinds of hard failures which cause multiple cell errors. A row faihre, which can be caused by a failure of one of the chip's row drivers, causes all I cells in one row of the affected chip to fail. Similarly, a column failure, which can be caused by a failure of one of the chip's column sense amplifiers or column decoders, causes all I cells in a column of the chip to fail. A short circuit at a memory cell can cause a rowcolumn failure. Finally, a catastrophic whole chip failure may occur, in which all the cells of a chip fail. All five of these failure types are illustrated in Fig. 3. (The letters A , B, C, D, and F will be referred to in Section 11.)
The organization of the SEC-DED code (assuming "by-one" memory chips) guarantees that no chip failure, however catastrophic, can cause two errors in any n bit codeword, and so the memory system will survive any single-chip failure. In fact there are many possible combinations of multiple chip failures that can also be tolerated. Eventually, however, we expect that so many chip failures will have occurred that some one of the individual n bit codewords will have suffered two or more errors. When this happens, we declare a memory system failure.
If we start with a brand new memory system of the kind we have been describing, the time until a memory system failure occurs will be a random variable. In this paper we will derive accurate and easily evaluated estimates for two of the most important quantitative measures of this random variable, the system reliability function R,,,(t) and the mean time to failure (MTTF). The reliability function Rsys(t) represents the probability that the system will not have failed after t hours, and the MTTF represents the average length of time the system will function before a memory system failure occurs.
In the next section (Section 11) we will describe the probabilistic model which is commonly used to describe the occurrences of the various types of chip failures, and see how it leads, via a Poisson F: whole chip failure.
We assume that, if a given chip fails, the conditional probabilities that the failure will be of type A , B, C, D, or F are a, b, c, d , and f , respectively. The probabilities a , 6 , c, d , and f also have to be determined experimentally. We further assume that failures on one chip are independent from failures on all other chips.
Given all these assumptions, it is in principle possible to calculate the row reliability function R ( t ) , defined as follows. 
R (t)
which is just the probability that the given row has suffered either zero or one whole chip failure after t hours [lo] . Since the M rows are assumed to fail independently, the reliability of the entire system of Mn chips is
The MTTF of system whose reliability function is r ( t ) is well known to be given by the formula
and so for a computer memory system of the kind we are considering,
Thus, everything we are interested in depends in a simple manner on the row-reliability function R ( t ) . Unfortunately, however, an exact formula for R ( t ) proves to be extremely complicated. (For example, Mikhail et al.
[15] give a recursive method for computing it when errors of types A , B, C, and F are present.) Thus, difficulty has led us to make the following simplifying assumption. We no longer view a row of chips as consisting of n separate chips, but "end on" so that the failures on all n chips are superimposed onto a single "protochip." We also make an important assumption about how protochip failures are distributed, in row i at time t is exp ( -a n h t / l ) , and the probability that there has been exactly one row failure in this row is exp ( -a n h t / l ) ( a n h t / l ) .
which we call the Poisson assumption:
In each protochip, the failures of types A , B, C, D, and F form independent Poisson processes of intensities anh, bnh, cnh, dnh, and fnh, respectively.
Under this assumption, the row reliability function R ( t ) is just the probability that at time t no cell on the protochip has suffered two or more errors. As we will see, this assumption greatly simplifies the formulas for R ( t ) without introducing significant inaccuracies. For example, if we again consider a situation in which only whole chip failures occur, then under the Poisson assumption the number of whole chip failures in a given row is a Poisson process of intensity An, and so the row-reliability function, i.e., the probability of zero or one whole chip failures after n hours is given by (6)
Formulas (6) and (2) are very similar, and for example if we compute the MTTF of a single row of chips using (2) and (4) we obtain whereas, if we use the Poisson assumption via (6) and (4) we get instead In Section IV we will give further comparisons between exact MTTF's and those obtained by the Poisson protochip method.
Using the Poisson protochip, we can now derive a formula for the row-reliability function R (t), the probability that the memory system is still working after t hours. It will be convenient to classify the various tolerable combinations of protochip failures into the following four categories. E , : E2: E3: E4: one whole chip failure.
We note that the configurations El and E2 are not disjoint, so that only row and single-cell failures only column and single-cell failures one row-column failure and single-cell failures These "tolerable failure configurations" are shown in Fig. 4. we also introduce E12, defined as E l 2 =El n E2: only single-cell failures.
Then the row-reliability function defined by (1) is
To simplify the notation, from now on we will use the parameter x defined by
so that the probability of zero and one failures in row i at time t are given by and e-'/'(ax/l), e -a// respectively. Next we focus on a particular cell within the ith row, say cell (i, j ) . Since each chip contains l2 cells, our Poisson assumption implies that the single-cell failures in this particular cell form a Poisson process with intensity cnh/I2. Therefore, the probabilities that the (i, j ) t h cell has suffered zero or one single-cell errors at time t are e -CW// and e -C X / / (cx/ I 2 ) , respectively. It follows that the probability that at time t no cell in row i has suffered a single-cell error is while the probability that no cell in row i has suffered more than one single-cell error is
We now wish to calculate the probability that no cell within the ith row has suffered two or more errors of type A or C. This probability is the sum of the following two probabilities:
The row has not failed and there is at most one single-cell error in each of the I cells.
The row has failed and there are no single-cell errors in any of the I cells. This probability is, therefore, Finally, the probability Pr { E 1 } is the probability that the type A and C errors in all I rows of the protochip will be tolerable, which is just the Ith power of (9), multiplied by the probability that there are no errors of type B , D, or F, which is exp ( -( b + d + f ) x ) . After some algebra we find that this product is Pr ( E I } = e -X ((l+;)'+y)'.
We now proceed to calculate the five probabilities in (7).
We begin by calculating Pr { E l }, which is the probability that the protochip has suffered no errors of type B , D, or F, and that the errors of type A and C are "tolerable," i.e., that no cell on the protochip has suffered two or more errors. To do this, we focus our attention on a single row of cells on the protochip, say row i. Since each protochip has I rows, our Poisson assumption implies that the row failures in this particular row form a Poisson process of intensity anh/I. Therefore, the probability that there have been no row failures BY replacing ''r0WS'' With ''ColUmns'' in the Preceding argument, we find that
Pr { E 2 } = e -X ((l+;) 
Finally, to compute the row-reliability function R ( t ) , we combine (7) with (lo), ( l l ) , (12), (13), and (14), and obtain the following somewhat intimidating expression.
This expression is our main result. The rest of the paper will be devoted to exploring its consequences.
111. ASYMPTOTIC APPROXIMATIONS AND SPECIAL CASES Although the expression (15) for R ( t ) is simple enough for numerical work, it is possible to give approximations to it that will yield additional insight into the problem. For example, in most modem chips the number I (the number of storage cells per row or column) is quite large, and this suggests that the limiting behavior of (15) as I -+ Q, may be interesting to consider. In fact, this limit is given by the formula R , ( t ) = e-x(e(o+c)x + e(b+c)x + ecx(dx-1 ) + fx).
(16)
This formula is simple enough to integrate explicitly, and so by (5) Such a restriction is understandable, since if, e.g., c = 1, an I = 00 chip would have an infinite MTTF, since the probability of any given cell position being hit twice would be zero.
Another interesting case to consider is the case of a large number of rows. (A CRAY-1 
+2' 1 3
The constants K1 and Kz in (18) We now consider two special cases, viz., f = 1 and c = 1 . When f = 1 (all failures are whole chip failures), the reliability function R ( t ) in (15) becomes simply R ( t ) = c X ( 1 +x), which is exactly the same as the I = CO approximation (16).
Therefore, the MTTF for one row of chips is given exactly by (17), i.e., 2 An MTTF = -.
Since there are n chips, each with MTTF equal to 1/A, this formula simply reflects the fact that the system will fail as soon as two whole chip failures have occurred. (We already observed this in Section 11.) More interesting is what happens when the number of rows M is large. In this case the system will fail as soon as one of the M chip rows has suffered two errors. If we interpret the occurrence of a chip I failure in the ith row as the arrival of a "person" whose "birthday" occurs on the ith day of an M day year, we see that the expected number of chip failures needed to cause a system failure is the same as the expected number of people we need to interview before we find number is given by [8] and so by (5) s , "
two with the same birthday. It is known that this "birthday surprise" ---.
B ( l 2 M ) .
The asymptotic formula (18) 
(22)
What this means is that the term + 2/3 is an approximation to the METF, which in this case is simply the mean number of whole chip failures before system failure, as well as the "birthday surprise" number B ( M ) :
The approximation given in (23) is very accurate, too, as the following table shows. The M = 365 entry shows that on a planet with a 365-day year, one needs to interview between 24 id 25 people, on the average, before finding two with the same birthday. Alternatively, in a 365-row computer memory in which only whole chip failures occur, and in which each row is SEC-DED protected, the memory will tolerate between 24 and 25 chip failures, on the average, before failing. A similar "birthday analysis" can be made for the case c = I . In this case only single-cell failures occur; it is as if there were 12M independent rows of 1 X 1 chips. The number of single-cell failures which can be tolerated before a system failure occurs is thus 1 lhnM times the "birthday number" B ( / 2 M ) .
This too is consistent with our theory, since with c = 1, the formula (15) becomes
The final method is to use the two-term asymptotic approximation given by (18), viz., We feel that these results justify our confidence in the accuracy of the Poisson approximation. An independent verification of the accuracy of the Poisson approximation, based on a mathematical analysis, was given in [2]. V. A SURVEY OF RELATED WORK: CONCLUSION The literature contains many papers devoted in whole or part to the subject of this paper, including two survey papers ([9] and [16] ). In this section we will attempt to describe how our work adds to what is already known.
The earliest work on ECC memory reliability [ 101 deals only with type F chip failures, i.e., whole chip failures. Later models, including those in [9] and [ 1 I], extended the types of failure modes to include types A, B, C, and D, but as pointed out in [16] , it is implicitly assumed in these models that the failure types are "nested." That is, there is a hierarchy of failure types, such that each type is a subset of the previous type. For example, single-cell, row, and whole chip failures are nested, but no nested hierarchy can contain both row and column failures. Since one row and one column failure in a row of chips will cause a memory system failure, it is important to have a model that handles "crossed" failure types, e.g., failure types A and B simultaneously, as is done in [15] . However, [I51 does not consider failure type D . Indeed, our model is to our knowledge the only one that handles all five failure types A , B, C , D, and F simultaneously.
We believe that the key innovation of our paper, however, is the introduction of the Poisson approximation. As we have seen, this approximation allows us to obtain simple formulas for the system reliability without sacrificing significant accuracy. And although our main formula (15) may seem excessively complex, when compared to the corresponding formulas in [9] , [ I l l , and [15] , it is very simple indeed. As we have demonstrated in Section IV, it can be easily programmed to give fast and accurate reliability estimates that can be 1 by memory system designers.
