Lifetime analyses of error-control coded semiconductor RAM systems by Goodman, R. M. F. & McEliece, R. J.
Lifetime analyses of error-control coded
semiconductor RAM systems
R.M.F. Goodman, B.Sc, Ph.D., C.Eng., and Prof. R.J. McEliece, B.Sc, Ph.D.
Indexing terms: Codes, Computer applications, Memory systems
Abstract: The paper is concerned with developing quantitative results on the lifetime of coded random-access
semiconductor memory systems. Although individual RAM chips are highly reliable, when large numbers of
chips are combined to form a large memory system, the reliability may not be sufficiently high for the given
application. In this case, error-correction coding is used to improve the reliability and hence the lifetime of
the system. Formulas are developed which will enable the system designer to calculate the improvement in
lifetime (over an uncoded system) for any particular coding scheme and size of memory. This will enable the
designer to see if a particular memory system gives the required reliability, in terms of hours of lifetime, for
the particular application. In addition, the designer will be able to calculate the percentage of identical
systems that will, on average, last a given length of time.
List of principal symbols
X = chip failure rate
R = chip reliability, the probability of correct oper-
ation of a chip
Q = probability of chip failure
RR = row reliability
Rs = system reliability
a = probability of system operating correctly, equal
to Rs
n = number of chips in a coded row
k = number of data-carrying chips in a row
r = error correction power of code
m = number of chip rows in memory system
r r (a) = system lifetime to probability level a, with r bit
error correction
Tr(l/2) = median time to failure (mTTF)
jUr(7) = a solution of the Poisson distribution
Cr(a) = coding gain
1 Introduction
The continued decrease in cost of semiconductor random
access memory (RAM) chips makes the construction of very
large memory arrays an economic possibility. Not only can
such arrays be used to form the basic core memory of large
computers, but also small size microcomputers can benefit
from large memory arrays for applications such as speech
processing, picture processing, intelligent terminals and data
bases etc. If large memory systems such as these are to be
increasingly used, it is essential that they should be reliable,
and not require frequent servicing. Although an individual
RAM chip may have a quoted reliability of better than 10~6
failures/h, when large numbers of these chips are combined
to form a total system the reliability of the system becomes
exponentially worse. In these cases, it is essential that some
form of error correction coding (ECC) be used to protect
against data loss.
LSI dynamic RAMs generally have a low failure rate of the
order of 10~7 to 10~6 failures/h. In addition, there is a
'learning curve' for devices as they appear on the market,
which means that the larger RAMs have a lower reliability
than the smaller devices. For example, the current industry
standard 16k-by-l bit dynamic RAM has a quoted failure
rate of approximately 3x 10~7 failures/h [1], whereas the
Paper 1633E, first received 16th March and in final form 26th August
1981
Dr. Goodman is with the Department of Electronic Engineering,
University of Hull, Hull, England. Prof. McEliece is with the Coordi-
nated Science Laboratory, University of Illinois, Urbana-Champain, USA
64 k RAMs just appearing will initially have a failure rate
much worse than this. Thus, although system reliability is
increased by using larger RAMs, the need for ECC remains.
The need for improved system reliability is of particular
concern if a large memory system is to be mass produced.
For example, if a memory system has a 10% probability of
failure after one year, we may turn the argument 'around'
and say that on average 10% of the manufacturer's memory
systems are only going to last one year. This would clearly
be unacceptable in many applications.
In this paper, we assess the improvement in memory-
system lifetime that can be obtained by using error-control
coding. We are not concerned here with the particular form of
coding used, as these are dealt with in the literature [2—6].
We derive general results that will enable a system designer
to get an accurate impression of what coding will do for any
particular memory system. Firstly, we consider chip failures
and the need for ECC; next, system lifetime is defined and
a general formula for the improvement in lifetime of a coded
system is defined. This is then developed into a formula for
the median time to failure (mTTF). We then consider asymp-
totic results for the case of large memory systems. Finally,
we develop a formula for calculating the time at which any
given probability of system failure exists.
2 Chip failures and need for ECC
The predominant failure mode within a RAM chip is a 'stuck-
at' fault. In this mode, either an individual cell, or a whole row
or column within the X, Y memory array, appears to be
stuck at a particular value 0, or 1, on read. Alternatively, the
chip can catastrophically fail, and every location appears
'stuck'. Thus errors are stationary in time and chips do not
'repair themseves' in the sense that the error condition does
not pass.
Let us assume that the chip failure rate is given by X (e.g.
X=10~6 failures/h), where a failure is any 'stuck-at' fault
which prevents the chip operating correctly. If t = 0 is a
point in time at which the chip is functioning correctly, then,
assuming constant failure rates [2], the probability of correct
operation at time t is given by
R = e~Kt (1)
where R is called the 'chip reliability'. The probability of
chip failure is given by
Q = l-R = l-e~xt (2)
If we now consider an uncoded memory with a total of D
chips, then the probability of correct operation of the system
IEEPROC, Vol. 129, Pt. E, No. 3, MA Y 1982 0143-7062/82/030081 + 05 $01.50/0 81
is RD, and the probability of system failure at time t is
(1 -RD).
If we assume that memory chips fail independently at
random, then the system mean time to failure [3] (MTTF) is
(MTTF) = l/XD (3)
Let us insert some numbers to get accustomed to these
equations. Consider a 4-megaword RAM system operating with
a 16-bit microcomputer. Assuming that the memory is built
out of industry standard 4116 type 16k-by-l bit dynamic
RAMs, then the system takes the form of an array with
16 columns and m = 4096/16 = 256 chip rows, and D =
4096 devices. Given a device failure rate of X = 3 x 10~7,
then from eqn. 3, the system MTTF is 813 h or approximately
one month. Alternatively, from eqn. 2 we find that the prob-
ability of system failure at the 48 h point is 6%. This is clearly
unacceptable for many applications, and implies that, on
average, 6% of these systems will function correctly for only
48 h.
3 System lifetimes
There are several ways to assess the improvement in 'lifetime'
of a memory system due to coding. First, it is possible to
calculate the mean time to failure (MTTF) for both coded
and uncoded systems. We have done this, but the calculation
is lengthy, and we prefer to omit it, as not much insight into
the problem is gained. Alternatively, mean time between
failures (MTBF) can be calculated if renewal times can be
assumed.
In this paper, however, we consider the lifetime Tr(ot) of
the system to be the time at which the probability of system
failure equals some value (1 — a). A special case is the time at
which the probability of system failure is 1/2, and this time is
the median time to failure (mTTF).
4 General analysis of system lifetime
Consider a memory system to be composed of an array of n
columns by m rows of RAM chips, as shown in Fig. 1. The
RAM chips are assumed to be 1-bit wide, so that the failure
of a single chip only affects 1 bit in a horizontal word. Fur-
thermore, we assume that the first k columns contain the
data bits, where k is equal to the computer word size. The
remaining (n — k) columns contain the parity check bits.
n chip columns
n-k
D D D - - - D
D D D
i i
D D D — D
m chip rows
Fig. 1 RAM organisation
82
The efficiency (or rate) of the coded system is therefore
kin, and its redundancy is (n — k)/n. Note that an uncoded
memory has (n — k) = 0.
Let us assume that each chip row of the above memory is
coded by using a binary block code of length n bits, which is
capable of correcting any combination of r errors amongst
the n bits. The probability of correct operation of the row
(or row reliability) is given by
RR = I
i = 0
(4)
If the memory system is composed of m rows, then the
probability of correct operation of the system (system
reliability) is the probability that all m rows operate correctly.
Conversely, the probability of system failure is the probability
that one or more chip rows fails to operate correctly, because
r + 1 or more errors have occurred in a single row. The system
reliability is therefore given by
Rs = (RRY (5)
The problem we wish to consider is that of inverting the
equation
Rs = a
i.e. inverting the equation
RR = («)1/m
(6)
(7)
to produce the solution t = Tr(a), which is the lifetime for
the given level of performance a.
We now use an approximation to eqn. 4 to give an approxi-
mate solution to eqn. 7. Consider eqn. 4; if the quantities
r/n and r(\ — R) are smaller than 1, as is certainly the case in
this application, then the row reliability is well approximated
by the Poisson distribution
RR - e"M I T where M = " 0 " * )
Let us define jur(7) to be the solution to the equation
-M V " '
i=0 I-
Then the solution to eqn. 7 is well approximated by
Thus, from eqns. 2 and 10,
(8)
(9)
(10)
That is, the system lifetime is the solution
01)
Now, in all cases of interest, the ratio /ir(a1/m)/« will be small
enough for the approximation
n] n
IEE PROC, Vol. 129, Pt. E, No. 3, MA Y1982
to be very accurate. Hence, eqn. 11'becomes
7V(«) 2t ^ Urioc1"") (12)
For an uncoded memory, n = k, r — 0, and we may define the
uncoded system lifetime as T0(a), where, from eqn. 12:
1
1
Xkm
(13)
log (a"1)
In this case, the solution /io(a1/m) = log (a 1 / m) is exact and
can be derived directly.
In order to compare equivalent coded and uncoded
systems, let us define the 'coding gain' of the coded system
Cr(oi) to be the ratio of the system lifetime with coding, to
that of the system without coding. Thus, from eqns. 12 and
13, we have:
CM 21 (14)
From now on, the use of the — sign is discontinued and it
must be inferred that our lifetimes and coding gains are
approximations. It can be shown by direct numerical solution,
however, that such approximations are very accurate for
memory systems of practical interest.
5 Median time to failure (mTTF)
We may consider the median time to failure (mTTF) to be
truly representative of the system lifetime in the general
sense. In this case, a = 1/2, and eqns. 12—14 become
(mTTF)r = Tr(\/2) = -^-Mr
AM
(mTTF)0 = 7-0(1/2) = r ^
A
05)
(16)
(17)
Let us now consider some special cases.
5.1 Uncoded memory
First, the uncoded case, i.e. r = 0. Clearly then, juo(2~1/m) =
(1/m) log 2 from eqn. 13, and eqn. 16 becomes
1 0.693
(mTTF)o = 7o(l/2) = — Iog2 =Xkm Xkm (18)
This is, in fact, the exact value of mTTF for this case, as can
be verified directly.
5.2 Single-row, single-error correction
Next consider r = 1, m = 1, i.e. a single-chip row with single-
error correction. One can verify numerically that ^ (1 /2) =
1.678, so that by eqn. 15:
(mTTF), = r,(l/2) =
and the coding gain is
1.678
Xn
^ ft- 2.42 1-
log 2 In
(19)
(20)
Let us apply eqn. 20 to the case of a k= 16-bit computer
word encoded with the « = 21 Hamming single-error-
correcting code. The rate of this code is k/n = 0.762 and
eqn. 20 becomes
d ( l / 2 ) = 1.84 (21)
Quantitatively, this means that a single-chip row of memory
will have a lifetime of approximately 1.8 times that of the
equivalent uncoded memory.
5.3 Single-row, multiple-error correction
Consider now m — 1, and increasing values of r. Table 1 shows
r against /ir(l/2) and the coding gain Cr(l/2). From Table 1,
Table 1: Coding gain against r
r
0
1
2
3
4
5
6
7
8
9
10
Mr(1/2)
0.6931
1.678
2.674
3.672
4.671
5.6702
6.6696
7.66925
8.66895
9.668715
10.66852
Cr(1/2)
1 X (kin)
2.4 X (k/n)
3.9 X (k/n)
5.3 X (k/n)
6.7 X (k/n)
8.2 X (k/n)
9.6 X (k/n)
11.1 X (k/n)
12.5 X (k/n)
13.9 X (k/n)
15.4 x (k/n)
it seems clear that jur(l/2)
shown (Appendix 10) that
r + 2/3, and in fact it can be
. - 2
The coding gain for large r is therefore approximated by
Cr(l/2) = (1.44 r + 0.196)-
(22)
(23)
It is interesting to note that the relative improvement in
coding gain decreases rapidly with increasing r so that the
benefits of coding are subject to rapidly 'diminishing returns'.
For example, in Section 5.2 we saw that, for k — 16, a single-
error-correcting code increases the lifetime of a single word
by 1.84. If an n = 26 double-error-correcting code is used,
the single-word lifetime is increased by a factor of 3.9 x
(16/26) = 2.4 over uncoded, which is only a factor of 1.3
better than the single-error-correcting code.
6 Case of memory with many chip rows
It is clearly possible to extend the analysis of Section 5.3
to the case of multiple-chip rows, i.e. m = 2, 3 etc. However,
we now consider the case of m large, i.e. a computer memory
with many chip rows. In this case, /Ltr(a1/m) can be approxi-
mated by noting that
Mm
— 1 H— log a
m
and that, for small values of JU,
r ,.»• ..(r+l)
(r+1)!
(24)
(25)
Thus, for large m, we have the approximation:
(26)
IEEPROC, Vol. 129, Pt. E, No. 3, MA Y1982 83
The system lifetime is then given by eqn. 12 as
Tr(a) = ±nr(a')
The coding gain from eqns. 14 and 26 is then
(27)
Cr(a) = -
(28)
For r> 1, eqn. 28 can be approximated via Stirling's formula
for n! to be:
nf , k ( r + 1 ) - 1Cr(a) ~ - m
n e log a (29)
6.1 Median time to failure with m large
The coding gain for m large can be found by putting a = 1/2
into eqns. 28 and 29. Table 2 shows the coding gain against
Table 2: Coding gain against r
r Cr(1/2)
1 1.699 X-Xm1
n
2 2.32 X-XmV3
n
3 2.91 X - X m 3 / 4
n
4 3.49 X - X m 4 ' 5
n
5 4.06 X - X m 5 "
n
1 0.53 (r + 1) X - X m
n
r. Table 2 (or eqn. 28) can be used to give an accurate estimate
of the median time to failure Tr(\/2) for any given memory
system.
For example, consider a 16-bit computer word, and a
memory with 64 rows of chips. If the memory is coded with
the n = 21 single-error-correcting Hamming code, and the
chip failure rate is 10~6, then eqn. 18 gives the uncoded
mTTF as 677 h. From Table 2, the coding gain is 10.36,
giving a coded mTTF of 10.36 x 677 = 7011 h. It is
interesting to note that an exact solution to this example,
obtained via eqn. 14, gives a coding gain of 10.87 and an
mTTF of 7359 h. This shows that the above approximations
are reasonably accurate.
6.2 Case of a close to 1
We may wish to know the lifetime of a coded or uncoded
system to a point in time at which a = 0.9, 0.99, 0.999 etc.
That is, the time at which the probability of system failure is
10%, 1%, 0.1% etc. This result can be approximated by noting
that log (a"1) — (1 — a), for a close to 1. In this case, the
uncoded memory has, from eqn. 13, a lifetime of
wish to find the time at which the system has a 1% failure
probability. The uncoded system has a 1% failure probability
at a time given by eqn. 30, as fo(0.99) = \/\km (0.01) =
9.76 h.
From eqn. 31, the coding gain for the coded memory
is d (0.99) = 16/21\/2~x8\/l00 = 86.2, which gives a
coded system lifetime of 841 h (or approximately one month)
at the 1% point. Again, the argument may be turned around,
to say that, on average, 1% of these systems will only last
one month.
7 Conclusions
In this paper, formulas are developed which will enable a
system designer to calculate the improvement in reliability
that can be obtained by applying coding to a semiconductor
memory system. The designer first calculates the lifetime of
the uncoded memory system, and then simply multiplies
this by the coding gain factor to yield the coded system
lifetime.
In addition, the designer can use the formulas to calculate
what percentage of a particular mass-produced memory
system will last a given length of time.
By using these formulas, the system designer can assess
whether or not a particular coding scheme (including no
coding) will give his memory system its required reliability.
8 Acknowledgments
The authors would like to acknowledge partial financial
support from the UK Science and Engineering Research
Council and the Joint Services Electronics Program (USA),
contract N00014-78-C-0424.
9 References
1 EUZENT, B.: 'Intel 2116 W-channel silicon gate 16 K dynamic
RAM'. Reliability report RR-16, Intel Corporation, 1977
2 ALNETHER, J.: 'Error detecting and correcting codes part 1'.
Applications note AP146, Intel Corporation, 1979
3 LEVINE, L., and MEYERS; W.: 'Semiconductor memory reliability
with error detecting and correcting codes', Computer, 1976, 9,
pp. 43-50
4 WALKER, W.K.S., SUNDBERG, C.W., and BLACK, C.J.: 'A
reliable spaceborne memory with a single error and erasure correc-
tion scheme'. IEEE Trans., 1979, C-28, pp. 493-500
5 CARTER, W.C., and McCARTHY, C.E.: 'Implementation of an
experimental fault-tolerant memory system', ibid., 1976, C-25,
pp. 557-567
6 GOODMAN, R.M.F.: 'Error correction coding for VLSI memories'.
IEE Colloquium Digest 1980/41, 1980, pp. 119-122
7 HOEL, P.G., PORT, S.C., and STONE, C.J.: introduction to
probability theory' (Houghton-Mifflin, Boston, 1971)
8 FISHER, R.A., and CORNISH, E.A.: 'The percentile points of
distributions having known cumulants', Technometrics, I960,
2, pp. 209-225
9 ABRAMOWITZ, CM., and STEGUN, I.A., (Eds.) 'Handbook of
mathematical functions' (Dover, New York, 1965)
10 Appendix: Asymptotic expression for jur (7)
The object is to obtain an asymptotic approximation for
lir(i), defined as the solution to the equation
and eqn. 28 becomes
Cr(ct) = -
n
(30)
(31)
Let us use this formula in an example. Consider the 16-bit
memory system defined in Section 6.1. Taking (a = 0.99), we
-u V M6
 £n 7 = 7
i=0 ' •
which is valid for fixed 7 and large r and, in particular, to
obtain the approximation (eqn. 22) for 7 = 3- Ou r result
depends on the following well known facts from probability
theory:
(a) If A"" is a Poisson random variable with mean ju, and if
Y is a random variable with the gamma density function
84 IEEPROC, Vol. 129, Pt. E, No. 3, MA Y 1982
r(«) '1e '*fn"1,r>OJ' then Pr{X<n - 1} = Pr{Y>u}*
Thus, if Fn{y) denotes the cumulative distribution function
of Y, /!„-!(7) is the solution to the equation Fn(id) = 1 — 7.
(b) If l^i, Y2, . . . , Yn are independent, identically dis-
tributed random variables, each with the geometric density
function e~\ t > 0 , the sum Y= Yx + . . . + Yn has the
gamma distribution cited above.+
(c) If Y = Yy + Y2 + . . . + Yn is a sum of independent,
identically distributed random variables, and if Fn(y) is its
cumulative distribution function, there is an asymptotic
expression (with respect to n) for the solution y{y,n) to the
equation /7n(>')=: 1 - 7 . This expression, the Cornish-Fisher
expansion [8], can be viewed as a generalisation and inversion
of the central limit theorem. It depends on the moments of
the Yf and the solution x of the equation
y(y,n) =
 M B - I ( 7 ) = « + nin x + ~(x2 -
+ n -1/2
x
3
 -Ix
— n
— 16
810
+ n-3/2
s/Tn= / '7T JX
e-
t2>2dt = 7
In the special case, where each Yt has the exponential density
e~\ a straightforward application of the formulas given in
Reference 8 yields the following expression for y(y,n),
which is, by our preceding remarks, also equal to jun-i(T):
op.cit., Reference 7, section 5.3.3.
id., Chap. 6, theorem 5
36
9JC5 + 2 5 6 J C 3 -
38880
- 243x4 - 923x2 + 1472
204120
This formula gives very good results even for modest values
of n. For example, with n = 11, y = 0.1, we find (e.g. from
Table 26.5 of Reference 9) that x= 1.28155, and the first
four terms of the above expression give julo(0.1)= 15.40704,
whereas the actual value is 15.40664.
In the special case y = \, we have x = 0, and the preceding
asymptotic expansion gives
405 25515
On replacing n by n + 1, we obtain the advertised expansion
(eqn. 22), namely,
I HE PROC, Vol. 129, Pt. E, No. 3, MA Y 1982 85
