The reliability of semiconductor RAM memories with on-chip error-correction coding by Goodman, Rodney M. & Sayano, Masahiro
884 IEEE TRANSACTIONS O N  INFORMATION THEORY, VOL. 37, NO. 3, MAY 1991 
Lemma 4: Let I points xI;.. ,xI lie on some sphere S, of 
radius r in {0,1}“. Then 
1 
for at least one pair i # j .  
To apply this lemma in our situation, we recall that Z 2 2”‘14, 
r 5 1 and h(1- A )  I T. Therefore, the inequality (6) implies that 
among the codewords & n k , E k )  there exist at least two, 
&n,, E;), &ni, E,), with the distance between them not exceed- 
ing 2 t .  Both codewords qo(mi,E,),cp(mj,Ej) lie at distance ex- 
actly r from x,and x@cp(m,,E,)EY(E,), x @ & n j , E , ) ~ Y ( E j ) .  
In other words, denoting xi = x@cp(m,,E,), x, = x@cp(m,, Ej) ,  
we see that xi and xi have the following properties: 
a) xi E Y( E;), xi E Y( Ej),  
b) cp(m,,E,)@x, = d m , ,  Ej)@xi, 
c) wt(x,) = wt(xi) = r ,  
d) wt(x,@xj) I 2t. 
Define now two error-vectors e; and ei as follows: For any k, 
of e, equals 1 if and only if 
= 0, while the kth component ( e j I k  of ei 
= 0. Then one can 
1 I k I n,  the kth component 
equals 1 if and only if 
easily see that e, and ei have the following properties: 
= 1 and 
= 1 and 
a’) e, E Y ( E j ) ,  e, E Y(Ej ) ,  
b’) q(mi,  E,)@e, = &n,, E j ) @ e j ,  
c’) wt(e,) = wt(ej) I t (since x j@xj  = e,@ej ,  wt(e;) = wt(e,) 
and wt(x,@xj) I 2t). 
Hence, receiving the vector y = cp(m,,E,)@e; = &n, Ej)@ei ,  
the decoder can not distinguish which of the two events has 
occurred: The codeword cp(ml, E,) was sent and the error vector 
was e;, or the codeword &nj, E,) was sent and the error vector 
was e,. Therefore, the decoder does not know which of the 
messages mi and mi was sent. This means that our code cannot 
correct all I t-on-1 localized errors, contrary to the assumption 
at the beginning of the proof. Theorem 2 is proved. 
Remark: A problem similar to the one considered in this 
correspondence can be formulated also for nonbinary channels 
with partially localized errors. In this case one can easily gener- 
alize the upper and lower bounds from Theorems 1-3. Unfortu- 
nately, even for channels with localized errors considered in [2] 
(the case t = 1 in the notations of the present correspondence), 
these upper and lower bounds do not coincide. Namely, while 
the upper bound for t = 1 is the q-ary Hamming bound, the 
lower bound is worse that the Hamming bound. At present we 
do not have any reasonable conjecture about the exact asymp- 
totic formula for the rate in the q-ary case. 
REFERENCES 
[l]  A. V. Kuznetsov and B. S. Tsybakov, “Coding for memories with 
defect,” Probl. Peredach. Inform., vol. 10, no. 2, pp. 52-60, 1974. 
[2] L. A. Bassalygo, S. I. Gelfand, and M. S. Pinsker, “Coding for 
channels with localized errors,” in hoc. Fourth Souiet- Swedish 
Workshop in Inform. Theory, Gotland, Sweden, 1989. 
[3] J. H. van Lint, “Coding for channels with localized errors,” in 
Beauty is Our Business. Berlin: Springer-Verlag, 1990, pp. 274-279. 
141 F. J. MacWilliams and N. J. A. Sloane, The Theory of Error-Cor- 
recfing Codes. Amsterdam: North-Holland, 1977. 
The Reliability of Semiconductor RAM Memories 
with On-Chip Error-Correction Coding 
Rodney M. Goodman and Masahiro Sayano 
Abstract -The mean lifetimes are studied of semiconductor memories 
that have been encoded with an on-chip single error-correcting code 
along each row of memory cells. Specifically, the effects of single-cell soft 
errors and various hardware failures (single-cell, row, column, row-col- 
umn, and entire chip) in the presence of sort-error scrubbing are 
examined. An expression is presented for computing the mean time to 
failure of such memories in the presence of these types of errors using 
the Poisson approximation; the expression has been confirmed experi- 
mentally to accurately model the mean time to failure of memories 
protected by single error-correcting codes. These analyses will enable the 
system designer to accurately assess the improvement in mean time to 
failure (MlTF) bought by the use of error-control coding. 
I d e x  Terms -Error-correction coding, random access memory, soft- 
error scrubbing, mean time to failure. 
I. INTRODUCTION 
A typical N X 1 semiconductor RAM memory is composed of 
a two-dimensional array of N memory cells with word lines 
along the rows and bit lines along the columns. When a bit is 
accessed, the word line is activated, allowing the entire row of 
memory cells to be read by the bit lines. One of these bits is 
then chosen by the column selection circuitry, and that bit is 
then outputted. (See Fig. 1.) 
These N X 1 chips are organized on boards to create byte-wide 
memory systems. Typically, for SIMM-type memory modules, 
eight such chips are aligned to provide an 8-bit byte output as 
shown on Fig. 2(a). The bits from the same addresses on each 
chip compose a byte; thus, each chip provides one bit of the 
byte. On a larger system, the board may be composed of many 
rows of chips. Fig. 2(b) shows a case where rows of N X 1 chips 
are used to form a multipage memory board. Often, in a large 
multipage memory, the rows of the memory are encoded with an 
( n ,  k) error-correcting code (information chips are shown white; 
parity chips are shown shaded). This is an example of board-level 
error-correcting coding [l], [2]. Systems of this form are the most 
common; they are based on N X 1 memory chips and therefore 
most often use Hamming codes. Other schemes are used when 
the chips are byte-wide chips [3]. 
Memory chips are subject to several types of failure modes. 
The two main classes of errors are hard and soft errors. Soft 
errors, induced most commonly by alpha particle radiation and 
noise, can affect the content of a cell temporarily by upsetting 
the charge stored in the cell; this occurs for both static and 
dynamic memory cell types. The effect is not permanent since 
no physical damage is done to the chip [4]. As memory cells 
decrease in size, they individually become more susceptible to 
this type of damage; furthermore, one alpha particle can affect 
more than one cell, resulting in clusters of errors [5], as shown in 
Fig. 3. 
Manuscript received November 27, 1990. This work was supported in 
part by NSF Grant MIP8711568. This work was presented in part at the 
International Symposium on Information Theory and Its Applications, 
Waikiki, HI, November 27-30, 1990. 
The authors are with the California Institute of Technology, Mail 
Code 116-81, Pasadena, CA 91125. 
IEEE Log Number 9143291. 
OO18-9448/91/05OO-0884$01 .OO 0 1991 IEEE 
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 3, MAY 1991 885 
... ... . . . . 
... 
MEMORY CELLARRAY 
.e. 
s ; e :  . e e . 
8- 
U e. ... 
COLUMN SELECTON CIRCUITS 
INPUTOUTPUT 
: 
Fig. 1. Random access memory structure showing one word line cho- 
sen, allowing access to entire row. 
There are five common types of hard error or hardware 
failure. The most common type by far is a single cell hard 
failure, where a defect occurs in a single cell, making it no 
longer,able to reliably hold data 161. There are also other less 
common failure modes. Failures of row selection circuitry (row 
failure) and column selection circuitry (column failure) cause an 
entire row or column to be unreliable. Shorting of row select 
and column sense lines (row-column failure) can cause a row 
and a column to fail simultaneously. Finally, the entire chip can 
fail if a chip selection or power circuit fails [6]. These types are 
shown in Fig. 4. Other techniques such as row and column 
replacement are used to increase the reliability of chips at the 
time of manufacture [7] and may be better suited to deal with 
row, column, and row-column failures; we do not address these 
techniques here. Also, we do not deal explicitly with failures of 
the addressing mechanisms here, since these are prone to cause 
multiple-row or multiple-column failures [8]. 
Additional hard-error types may occur from layout strategies 
used on large chips. For example, long selection lines cause the 
chip to be slow and error-prone due to large capacitance these 
lines have in relation to the cell capacitance or driving ability. 
Thus, large memory chips are broken into blocks, with each 
block having its own selection circuitry. There may be as few as 
two or as many as sixteen blocks of this type; large chips may 
have even more. This makes possible single and multiple block 
failures and full and partial row and column failures in addition 
to single cell, row-column, and entire chip failures as shown in 
Fig. 5. The number of types of errors increases as more blocks 
are employed since there are more failure modes possible. 
As memory capacity per chip increases, the effect of these 
errors becomes too great for board level coding to handle. In 
particular soft errors can drastically reduce the effective mean 
time to failure of an individual chip. It is thus natural to move to 
on-chip ECC (error correction coding) in order to get higher 
chip reliability. In addition, it may be necessary to combine chip 
and board coding to maintain the required reliability [9]. The 
most natural architecture for on-chip ECC is to place a single 
error correcting code along each row of the memory. These are 
most typically SEC (single error-correcting) Hamming codes that 
are shortened to have an information length of some integer 
power of two. Thus, when a row is accessed, the entire code- 
word can be simultaneously read, then corrected if necessary, 
prior to outputting the single bit required by the column selec- 
tion. If no more than a single error occurs along each row, the 
memory will remain functional and will reliably hold data. 
Furthermore, since soft errors can be removed by reading, 
I . - -  
lossless of finite &der can be viewed as ‘‘deterministic 
with bounded delay.” 
” C. ._ . .  C .  . 
correcting, and rewriting the data at regular intervals (soft-error 
scrubbing), repeated soft errors on a codeword will not cause 
memory failure unless more than one error occurs before the 
memory can be scrubbed. 
Soft errors, however, can be clustered, as shown in Fig. 3, and 
therefore there must be some means to handle clustered errors. 
One way is to use burst error-correcting codes, but these tend to 
have higher redundancy and be more complex, resulting in more 
difficult and time-consuming encoding and decoding, than Ham- 
ming codes. These are not desirable characteristics. Instead, this 
effect is most commonly minimized by employing good layout 
format and spatially interleaved cells so that cells which belong 
to the same codeword are spaced apart, as shown in Fig. 6. This 
allows the use of single error correcting codes but requires that 
each row employ more than one codeword [Sb [lo]. Note also 
that by placing more than one codeword for each row and 
modeling alpha particles as induced clusters of errors, there 
arises the possibility that two soft-error strikes, even if not in the 
same codeword, can cause failure. A simple model assumes that 
each alpha particle effectively affects only a single cell and that 
two alpha particles must be in the same row to affect the same 
codeword; this assumption will be used here. 
Hard errors, in contrast, cannot be removed, only corrected, 
by this method. Thus, one single-cell hard error in a codeword 
will put the memory in a state where the next error of any type 
in that codeword will cause memory failure. Likewise, one 
column failure in the chip will also cause the memory to fail 
with the next error of any type in any codeword in the same 
block, since this error causes a single error in every codeword in 
the memory block. Other types of hard errors will overwhelm 
the error-correcting code and cause memory failure with their 
first occurrence. Note also that placing more than one codeword 
for each row creates the possibility that multiple single cell or 
column hard errors may not cause failure. 
Several memory chips which employ on-chip error correcting 
codes have been fabricated. A production chip fabricated by 
Micron Technologies employs a (12,8) Hamming code, inter- 
leave depth 4, on a 256 Kbit X 1 RAM [lo]. Asakura et al., have 
employed a (40,32) SEC-DED (single error-correcting-double 
error-detecting) modified Hamming code on a 1 Mbit cache 
DRAM [ l l ] .  Horiguchi, et al. have also constructed a RAM with 
coding, although this is a multilevel, 4 bits/cell 4 Mbit DRAM 
[5]. The code used is still a single error-correcting code, though 
it uses a (131,128) cyclic 16-ary code with interleave depth 8 
to correct single-cell errors, the equivalent to four bit errors. 
Chiueh, Goodman, and Sayano [12] have constructed a 2 Kbit x 1 
static RAM chip with a double error-correcting linear sum code 
proposed by Fuja, Heegard, and Goodman [13]. Most codes 
used in practice have been simple binary Hamming type codes, 
in order to reduce complexity of the on-chip decoder. 
Most previous M’ITF calculations were based entirely or in 
part on the binomial distribution [l], [14]-[17]. The results 
tended to be complex, and some were simplified by taking the 
Poisson approximation to the binomial distribution at one point 
or another. Using the Poisson distribution from the start, first 
done in [18], provides a far less complex yet accurate result. We 
will again use the Poisson distribution in this analysis. Further- 
more, row-column type hard errors were first accurately mod- 
eled in [19]; previous papers did not address this failure mode 
and therefore inaccurately modeled the M’ITF. We now extend 
our work to include the effect of soft error scrubbing in semi- 
conductor random access memories coded with on-chip single 
error correcting codes and subject to both hard and soft errors. 
~ ~~____-  
- 
Fig. 2. Example for bound of Theorem 5. 
886 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 3, MAY 1991 
SlMM 
( a )  
MULTCPAGE MEMORY BOARD 
(b) 
Fig. 2. Board-level organization of N X 1 memory chips. SIMM's module and multipage board with ECC are shown. Each 
chip contributes one bit to output byte or error-correction codeword. 
Fig. 3.  Cluster error patterns created by alpha particle strikes (soft 
errors). Affected cells are darkened. 
Analytical models such as the one presented here are becom- 
ing of increasing interest to system designers, as they enable an 
accurate assessment of the improvement in mean time to failure 
due to error control coding, without the need for lengthy and 
complex simulations, which are often prone to error. These 
analyses are particularly important for memory systems that 
must operate autonomously in harsh environments, such as 
those in oceanic probes, satellites, and spacecraft. Many such 
memory systems utilize similar coding schemes to those specifi- 
cally analyzed here. In some cases, however, more complex 
coding schemes such as spares swapping, or concatenated on- 
chip and board-level coding, have been proposed. The tech- 
niques and models presented here should enable the coding 
theorist to extend our work to assess the performance of most of 
these current and proposed coded memory systems in an effi- 
cient manner. 
In this correspondence, Section I1 presents the Poisson failure 
model which will be used in subsequent sections as the mathe- 
matical representation of the chip. Section 111 then develops the 
model for both hard and soft single-cell failures with soft-error 
scrubbing. Characteristics of the model will be explored for 
various limits of parameters. Section IV then expands the model 
to include multiple cell-hard failures (such as row, column, 
row-column, and entire chip failures). In addition, the case 
where chips are composed of multiple blocks of memory cell 
arrays will be addressed here. Section V presents experimental 
results from computer simulations, which help confirm the accu- 
racy of the model for varying parameter values. 
11. POISSON FAILURE MODEL 
We would like to create an accurate model of the mean time 
to failure of a semiconductor memory with a single error-cor- 
recting code placed along each row. However, the model must 
also be computationally simple enough to ease its determina- 
tion. The mathematical model used here will be the Poisson 
distribution for determination of probability of error in a given 
codeword. The accuracy of this model has been justified in our 
previous work [18]-[21]. A short review of the model is provided 
next. 
The Poisson distribution gives the probability that there are 
no events (failures) in time t as 
Po( t )  = e - A f ,  
where A is the mean event arrival rate, and the probability that 
there is exactly one event (failure) in time t as 
P l ( t )  = Ate-A'.  
The row reliability function is the probability that the row has 
not failed in time t ;  for a codeword protected by a single 
error-correcting code and error arrival rate per row A, this 
becomes [21] 
R( t )  = P ( 0  or 1 error) 
By assuming independence among the rows, the chip reliability 
function, i.e., the probability that the entire chip has not yet 
failed by time t ,  is given by 
= (1 + A t ) f Y A ' .  
Rchip( t = RM( t ) 9 
where there are M codewords in a chip. 
The mean time to failure is 
M ~ F  = j m ~ ~ ( t )  dt 
0 
as shown in [21]. Modeling such a function may be time-inten- 
sive, so for simulation purposes we computed mean events to 
failure (METF). This is the average number of events which 
must occur before memory failure occurs; if this number is 
large, then by Wald's identity (and Little's Law, which states 
that, for Poisson processes, the mean waiting time-in this case, 
for failure-is the product of the mean number in the system 
and the inverse of the mean arrival rate) the approximation 
1 
A MTTF - METF (1) 
can be used, because 1 / A  represents the mean arrival rate of 
events (mean rate of errors) in the Poisson distribution 1191. 
111. ANALYSIS OF SINGLE-CELL FAILURES 
A. The MTTF Calculation 
Initially the case of only single-cell hard and soft errors 
attacking the memory will be discussed. Furthermore, soft errors 
are confined to affecting only one cell. The symbols used are the 
following. 
A, 
A, 
Single-cell hard-error arrival rate per cell. 
Single-cell soft-error arrival rate per cell. 
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 3, MAY 1991 887 
SINGLE CELL COLUMN FAILURE ROW FAILURE ROW-COLUMN ENTIRE CHIP 
FAILURE FAILURE FAILURE 
Fig. 4. Types of hard errors. 
SINGLE CELL HALF-COLUMN COLUMN FAILURE ROW FAILURE 
FAILURE FAILURE 
ROW-HALF COLUMN ROW-COLUMN BLOCK FAILURE ENTIRE CHIP 
FAILURE FAILURE FAILURE 
Fig. 5 .  Two-block ram chip and some associated types of hard errors. 
Fig. 6. Spatial interleaving allowing use of SEC codes. I n  this case interleave depth i s  3 (three codewords per row, shown 
as a ,  b ,  and c ) .  
A,, A, + A, = total single-cell error arrival rate per cell. 
t ,  Soft-error-scrubbing period. 
n Number of bits in a codeword. 
M Number of codewords on a chip. 
Soft-error scrubbing will be conducted. This means that at 
periodic intervals t,, the entire,chip will be subject to internal 
error correction where all codewords are read, corrected, and 
rewritten. This cannot remove hard errors, but it can remove 
soft errors, so long as no more than one error occurs in a single 
codeword. Note that no restriction is placed here on the number 
of codewords in each word line (in each row). Therefore, the 
word line may contain more than one codeword, such as in 
spatial interleaving. Thus, the number of codewords does not 
necessarily equal the number of rows on the chip. Also, there 
may be multiple blocks on the chip in a manner similar to that 
shown in Fig. 5. 
The following two assumptions will be made. 
Assumption 1: The scrub cycle must be small compared to the 
mean time to failure: 
t ,  -=z MTTF. 
Since the scrub cycle (which, in dynamic RAM'S, occurs in 
conjunction with the refresh cycle) is rarely longer than 100 
seconds, a system with a mean time to failure which is not far 
greater than this is not reliable enough for use. 
Assumption 2: The hard-error rate must be small compared to 
the soft-error rate: 
A, A,. 
Hard errors cannot be removed by scrubbing and therefore 
accumulate; soft errors, which occur more frequently [6], can be 
removed. If hard errors occur with greater frequency than soft 
errors, then soft error scrubbing is useless (A, >> A, and A, is 
small enough so that scrubbing need not be done) or the chip is 
unreliable for use (A, >> A, and A, is too large for the chip to be 
used reliably). These assumptions are true for all practical 
memory chips; a rigorous argument for justifying these assump- 
tions in the model, along with a more accurate but complex 
analysis, is presented in Appendix B. No mention is made of the 
reading and writing mechanism here because if scrubbing is 
done, each word is accessed within the period of one-scrub 
cycle. If failure is declared only after a word is accessed and 
found to be in error, since Assumption 1 holds true, the in- 
creased time before failure is declared is negligible compared to 
the case where failure is declared immediately following two 
failures in any codeword. 
The situation is therefore this: While there are no hard errors 
in a codeword, the memory does not fail so long as no more 
than one soft error occurs in each time interval t ,  in each 
codeword. If a hard error does occur, then no further errors can 
be tolerated in that same codeword. The two cases, when a hard 
error has not occurred and exactly one hard error has occurred 
by time t ,  will be treated separately. 
The probability of codeword success at time t with no hard 
errors occurring is the probability that no hard errors have 
occurred and that only zero or one soft error has occurred in 
any time segment t ,  from time 0 to t .  Thus, 
R , ( t )  = P(no hard errors)P(O or 1 soft error) 
1 f"s = [ e - * h " ' ]  [ + A , n t , e - A s n f s  
= e - ' * c n f ( l +  , i ,nt ,) ' ' '~.  (2) 
The probability of codeword success at time t with the occur- 
rence of exactly one hard error before time t is more complex. 
First, assume that the hard error had occurred at time T. 
888 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 3, MAY 1991 
Therefore, the probability of success is the joint probability that 
there were no codeword failures given no hard failures have 
occurred up to time r ,  that exactly one hard error occurred 
between time r and ~ + d r ,  and that neither soft nor hard 
errors occurred from time T to t .  This is then integrated over T 
from 0 to t to consider all possible times that the hard error 
could have occurred. 
R l ( t )  = jgfRo(r)P(l  hard error in dr )P(no  errors in r to t )  
= Po( T ) [ n dr  ] [ e  -Arc"(r -') 1 
= e-*~cnrA,n/d(  1 + A,nt,)T/" dt 
e-As2'rA nt [ (1 + A,nt,)'/'' - 11 . h s  - 
In (1 + A p t , )  (3) 
Therefore, the reliability function for each codeword is now 
given by combining (2) and (3). 
R ( t )  = Ro(t)  + R l ( t )  
= e-Ascnl( 1 + A,",)'/'' 
e-Asc"'A nt 
[ (1 + A,nt,)'/" - 11. (4) h s  + In(l+ A@,)  
The mean time to failure (MTTF) for a chip with M code- 
words is then 
MTTF = / m ~ ~ (  t dt 
0 
[ exp [ I In ( l+  A,nt,) 
. ( l+  (5) 
As shown in Appendix A, this becomes 
M 
MTTF= (r) 
i = O  
. (6) 
AScnM - - In (1 + A,nt,) 
t ,  
Alternatively, by making the substitutions 
y = A,nt, 
' h  z = -  
A s  ' 
the mean time to failure can be represented in a clearer form: 
MTTF= 
The coding gain, which is the advantage of the mean time to 
failure over the uncoded case, is therefore 
MTTFcoded 
MTTFuncoded 
C G = -  
i l n ( l + y )  
I - - -  
where ( n ,  k )  codes are used. 
The mean time to failure, therefore, is effectively a function 
of three parameters, y, z ,  and M. The parameter y represents 
the relation between the scrub interval and the soft-error rate; it 
is the mean number of soft errors that strike each codeword in a 
scrub period. The parameter z is the ratio of soft to hard error 
rates. Note that if A,, A,, and t ,  are adjusted such that y and z 
are kept constant, then the coding gain remains unchanged: The 
parameter y can be kept constant by maintaining a constant 
A$,; the parameter z can be kept constant by altering A, and A h  
proportionally. Such a dependence implies that time is being 
scaled. 
Note that (7) appears similar to a binomial expansion of order 
M ,  a consequence of expanding and integrating RM(t ) .  Also, if 
y and z are small, then the denominator term will be small, and 
therefore each term of the sum will be large. This means that if 
the hard-error rate is small compared to the soft-error rate, and 
if soft errors are scrubbed out before they are allowed to 
accumulate, MITF will be large, a conclusion which confirms 
intuition. 
Subsequent subsections will explore various special cases of 
(7). Specifically, of interest are the limit of very fast scrubbing 
( t ,  -+ 0) and the limit of no hard errors ( A h  -+ 0). In addition, an 
analysis will be made for cases when Assumptions 1 and 2 are 
violated. These cases will be examined for three reasons. First, 
these cases, as will be shown, will provide a connection between 
the results of previous work (most notably, [19]-[21]). Second, 
we wish to check if the model breaks down predictably, based on 
intuition, simulation, and previous work, if the assumptions are 
violated. Third, the special cases are examined for complete- 
ness. 
B. MTTF with Fast Scrubbing 
If the interval between each soft error scrub is allowed to 
decrease to zero, that is, scrubbing is done continuously, there is 
a maximum near time to failure which cannot be exceeded. In 
this limit t ,  -+ 0, or equivalently y -+ 0, (7) becomes 
A M z + l  
(7) 
1- ~ 
M A h / A , + l  
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 3, MAY 1991 889 
The coding gain is therefore 
Equations (9) and (10) are simpler expressions to evaluate 
than (7) and (81, and they show the dependence of MTTF on 
A, /A,. Thus, if t ,  + 0, the ratio between the hard- and soft- 
error rates becomes important. 
Note that if A,/A, is extremely small, the last term of the 
summation is (9) the dominant term: 
1 
A,nM 
I -- 
CG=--( k A  l+c). 
This means that if the soft errors occur very quickly compared 
to the hard error rate but slow enough to be scrubbed out 
before more than one can accumulate in a single codeword, 
then the MTTF approaches that of the unprotected chip with 
only hard errors. This is reasonable, since when a hard error 
occurs, a soft error has a high probability of occurring relatively 
immediately after it in the same codeword, thus causing a 
failure. However, since the scrub interval is short enough, soft 
errors without a hard error in the same codeword never accumu- 
late fast enough to cause failure. Thus, the model behaves as 
expected for t ,  -+ 0. Note that this provides an upper bound on 
performance: Given an error arrival rate of A, and A,, the 
MTTF cannot exceed (9) even with extremely fast soft-error 
scrubbing. 
C. Soft Errors as the Dominant Error Type 
For the case of only soft errors occurring, i.e., A, -+ 0 (i.e., 
z + 0) the chip can fail only if two or more soft errors occur in 
the same codeword and in the same scrub interval. We expect, 
therefore, that the mean time to failure will be 
= Lme-*'OMr(1 + Asnt,)Mr'rsdt 
Mt 
t ,  
= exp [ - A,nMt + - In ( 1 + A,nts) 
M 
If the model is accurate, then the same result should be achieved 
when (7) is used. By taking z = 0 in (71, we obtain 
(14) 
The coding gain is 
Equation (14) is the same as (13). Note that if A,nt, is held 
constant, then the coding gain remains constant, as expected: 
Time scaling should not affect the coding gain. Also note that 
forcing t ,  + 0 now yields MTTF +m, as there are no hard 
errors to build up on the chip to cause two or more errors in a 
codeword during any single scrub cycle. The asymptotic behav- 
ior of this case can be determined by taking the Taylor series of 
In ( l+  y): 
1 In( l+  y )  
M l T F  = - [ 1 - -1
A,nM 
- 1  
A,nM 
-- - AsAM [ z]  
k 2  
n Y  
CG = - [ -1, 
for small y. Thus, the MTTF and the scrub interval are inversely 
related. 
This analysis has given the maximum possible M l T F  given t, 
and A, for a highly reliable chip; the upper bound on perfor- 
mance for a given t ,  and A, is therefore easily found using (13). 
D. Very Slow Soft Scrubbing 
For the case of very slow soft-error scrubbing, Assumption 1 
is violated, and (7) does not apply. The analysis must be recon- 
ducted to develop a new representation. Violation of Assump- 
tion 1 without violation of Assumption 2, for typical values for 
the other parameters, corresponds to having a very long scrub 
period. This case is not realistic: Having a long scrub cycle is 
useless, since if many errors are allowed to occur in a single 
codeword before soft-error scrubbing is conducted the error 
correcting power of the code is more likely to be overwhelmed. 
Though this case is not expected to occur in reality, it will be 
studied to determine how chip performance can be expected to 
degrade. In this case, instead of the continuous scrub period 
analysis conducted thus far, a discrete scrub period analysis 
must be made. In previous sections, since Assumption 1 assured 
the passing of many scrub periods before a chip failure was 
expected to occur, a continuous scrub period analysis, as done in 
(2), was possible. The equivalent analysis using discrete scrub 
performance intervals is much more complex. (This analysis is 
covered briefly in Appendix B, where its complexity becomes 
evident.) 
Instead, an analysis of the special case t ,  + m  will be con- 
ducted. This will provide a lower bound and an intuitive feel for 
890 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 3, MAY 1991 
the behavior of the model. The model must be reconstructed 
from the reliability function, where 
R , ( t )  = e--Ascnf[l+ A,nt] (18) 
R l ( t )  = AhnteCAScn' (19) 
and 
are now used to determine the reliability function instead of (2) 
and (3). R,( t )  represents the probability that the codeword has 
yet to fail given that no hard errors have occurred. In this case, 
ROO) is the probability that zero or one soft error and no hard 
errors have occurred. R,( t )  is the probability that the codeword 
has yet to fail given that exactly one hard error has occurred. 
Thus, the new reliability function is the sum of (18) and (19) 
R ( t )  = e-A3c"'[l+ A , $ ] .  (20) 
Therefore, the mean time to failure becomes 
Thus, we have a factor of 
M 
B ( M ) =  ( y ) ;  
i - 0  
in the equation that is only a function of the number of code- 
words in the chip. Note that the MTTF is greater than the 
uncoded mean time to failure by the factor B(M).  This factor 
B ( M )  was analyzed as an extension of the Birthday Surprise 
problem in [19] and [20] and has been shown to be, for large M ,  
B ( M ) =  E: - + - + O ( M - ' l 2 ) .  
Note that (23) is equivalent to the solution found via a different 
analysis in 1221. 
The coding gain, therefore, is 
k 
n 
C G = - B ( M )  
and is no longer a function of the error arrival rates. This is 
reasonable, as the lack of soft error scrubbing removes the time 
dependence. Only the error-correcting code's correctional power 
is important, as reflected by B ( M ) .  Some values of the B ( M )  
are listed in Table I. Note that for typical memory chips of the 
order of 1 Mbit and larger the error between the actual value of 
B ( M )  and the value found by (23) is much less than 1%. 
E. Hard-Error Dominance 
For the case where hard errors dominate, Assumption 2 does 
not hold. A new model will be constructed in this section for this 
case. Violation of Assumption 2 corresponds to having hard 
errors as the dominant-error type. This is not a realistic situa- 
tion: Hard errors are much less common than soft errors [6], and 
if they were not, either the chip itself is very unreliable and 
therefore not suitable for use (A, is too large) or soft-error 
scrubbing is not needed (A, is very small). In both cases, soft 
error scrubbing is useless because in the former, most of the 
TABLE I 
SOME VALUES OF FACTOR B(M)* 
M B ( M )  %Error M B ( M )  %Error 
1 2.00 
2 2.50 
2' 3.22 
2' 4.25 
Z4 5.70 
25 7.17 
26 10.71 
27 14.86 
2' 20.73 
29 29.03 
2" 40.78 
2" 80.88 
211 57.39 
- 4.00 
- 2.44 
- 1.41 
-7.87XlO-I  
-4.27X10-' 
-2.26X lo- '  
- 1.18X lo-' 
- 6.06 X lo- '  
-3.09 X 
- 1.57 X lo- '  
- 4.00 X 
- 7.93 x 
-2.01 x 10-3 
213 144.1 
214 161.1 
215 227.5 
216 321.52 
217 454.42 
218 642.36 
219 908.16 
2" 1284.06 
221 1815.66 
2" 2567.45 
223 3630.65 
224 5134.24 
- 1.01 x 1 0 - ~  
- 5 . 0 5 ~ 1 0 - 4  
- 2.53 x 
- 1.27X 
- 6.34 X 
- 3 . 1 7 ~ 1 0 - ~  
- 1.59X 
- 7.93 x 10-6 
- 3.96 X 
-9.87X10-' 
- 1.98 X 
- 4 . 9 0 ~ 1 0 - 7  
'Also tabled are errors between approximation (23) and the 
exact values. 
errors cannot be removed by scrubbing, and in the latter, there 
are no soft errors to be scrubbed. However, for completeness we 
do consider this case. 
For simplicity and for an intuitive feel of the behavior of the 
chip for very large A h  /A,, the limiting case Ah /A,  --$ will be 
examined. The reliability function can be reconstructed from the 
basic Poisson model, or it can be adapted from (4); both give 
equivalent models. This occurs because if the equations given in 
Appendix B, the correct complete analysis, were carried out, the 
range of t where R ( t )  is significant will be such that 1 t / t , ]  < 1. 
This is true because if hard errors are more common than soft 
errors, then they are likely to cause failure by themselves far 
more quickly than soft errors. Since soft errors are assumed to 
occur at a rate slow enough to be scrubbed out, any arrival of 
errors much faster than this implies Assumption 2, that the 
scrub interval is much less than the M W ,  is violated. Complet- 
ing this analysis yields (4) as the reliability function. 
Using the substitution z = A,, /A,, the reliability function (4) 
becomes 
- - \ -  ' z ] 
By letting z and using 
I n ( l + x ) = x  a s x - + O  
and 
(25) becomes 
- e-Ahnr + e - h  nl  - hhnt, 
= e - A h n ' [ l +  hhnt] .  (26) 
Note that (26) is the same form as (20); therefore 
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 3, MAY 1991 89 1 
Again, the factor given in (22) The following: 
M 
B ( M ) =  (1"); 
i = O  
R , ( t )  = P(no multicell failures)P(no single cell failures) 
no chip failures in 0 to 7 )  
appears in the equation. The coding gain is again given by (24) . P ( l  column failure in d ~ ) P ( n o  errors in 7 to t ) ,  
k 
n 
C G = - B ( M ) .  
+ I,'[ e-"-c'R:( T ) ]  [A ,  d 7 ] [  Equations (21) and (271, the results of Section 111-D and this 
section, respectively, are identical except for the value of A they 
employ; (21) uses A,, while (27) uses A,. The coding gain, 
however, is identical whenever either Assumptions 1 or 2 are 
violated. 
(28) 
M 
= e-"' [ ecl'cz + c 3 ]  + A c e C " ' ~ c I M T  d 7 ,  
where, as shown in Appendix A, 
IV. THE GENERAL CASE: MULTIPLE TYPES 
OF HARD ERRORS 
A. Multiple Hard-Error Types with One Block per Chip 
The preceding section dealt with single-cell hard and soft 
errors in an M-codeword memory encoded along the rows with 
a single error-correcting code. Here the situation will be the 
same, except that other types of hard errors, namely row fail- 
ures, column failures, row-column failures, and entire chip 
failures will also be considered. Analysis will again be done via 
Poisson distribution. In this analysis, coding will be restricted to 
one codeword per row for the sake of simplicity, and the chip is 
assumed to be composed of one block. 
Since the memory is encoded with codewords along rows, the 
entire chip can survive exactly one column failure. However, the 
subsequent appearance of a single error of any type anywhere 
in the chip will then cause a failure. Conversely, any other type 
of multiple-cell hard error (row, row-column, and entire chip 
failures) at any time will cause the chip to fail immediately. 
These latter types of errors will be designated catastrophic 
failures. 
The symbols used are the following. 
A 
t ,  
n 
M 
Single-cell hard-error arrival rate per cell. 
Single-cell soft-error arrival rate per cell. 
A, + A, = total single-cell error-available rate per cell. 
Column failure rate per chip. 
Row-failure rate per chip. 
Row-column-failure rate per chip. 
Whole-chip failure rate per chip. 
A, + A,, + A,, = catastrophic-failure rate per chip. 
A, + A,, + A, + A, = multiple-cell error-arrival rate 
per chip. 
A, + M,,A,, = total error-arrival rate per chip. 
Soft-error scrubbing period. 
Number of bits in a codeword. 
Number of codewords on a chip. 
The analysis here is for when Assumptions 1 and 2 hold. 
Therefore, the codeword level reliability function remains un- 
changed from (4) in the range where the two assumptions are 
valid; however, the chip level reliability function is no longer the 
row reliability function raised to the power of M .  The probabil- 
ity of the chip success at time t is controlled by the following: 1) 
There must be no chip level failures (multicell hard errors) or 
codeword level failures that cause the memory to fail; 2) if there 
is a column failure (a chip level failure) at time 7 ,  there must be 
no further errors of any type from time 7 to t .  
1 
t ,  
c, = - In (1  + A p t , ) ,  
t s A h n  c 2 = 1 +  
In (1 + A#,) ' 
t s A h n  
cg = - 
In (1 + A,nt,) ' 
Integration of (28) yields a chip reliability function of 
The mean time to failure (MTTF) is now 
M T T F = j m R , ( t ) d t ,  
= [ kme-Ar[ecl'cz + c 3 ]  d t ]  
+ [ -& c ( e x p  [ - A t  + c1 Mt ] - exp [ - At 1) dt , I 
A 
The coding gain is 
t s A h n  
ln(:C:nts))'( - In(l+A,nt,) 
A - - In (1 + A p t , )  i 
t S  
k M  
where ( n ,  k )  codes are used. 
Thus, the mean time to failure in (29) is modified from (6) 
by two factors. First, instead of the total single-cell-error 
892 IEEE TRANSACHONS ON INFORMATION THEORY, VOL. 37, NO. 3, MAY 1991 
rate MnA,, in the denominator of the summation, the total 
error rate A is used. Since A is slightly larger than MnA,,, the 
denominator has increased slightly; thus, each term of the 
summation in (29) is slightly smaller than is (6), leading to a 
smaller overall sum in (29) than in (6). Second, an additive term 
which increases the MTTF is present. This second term in (29) 
can be approximated as follows: 
A, 
= -[A,, + MnA,]- ' ,  
A 
The additive term, therefore, is shown by (31) as being small 
compared to the summation term in (30). In typical memory 
chips, A,, the rate of column failure, is less than 10% of 
Am, + MnA,, the total hardware failure rate [21]. Therefore, the 
additive term in (30) is typically less than one-tenth the mean 
time to failure for an unprotected chip. A protected chip can be 
expected to have a much longer MTTF than an unprotected one 
(i.e., CG % l), so the additive term is insignificant in most cases 
to the total MTTF. 
B. Multiple Hard-Error Types with Multiple Blocks per Chip 
If the chip is subdivided into several identical blocks as shown 
in Fig. 5, some assumptions need to be made regarding the 
structure of the chip and its failure modes. The simplest case to 
analyze (and the design with the simplest and most dense 
construction) is when, as shown in [6], row, row-column, and 
column errors occur across the entire chip, so that the only 
additional hard error type from the five discussed in Section 
IV-A is block failures. (Chips of this type are those with only 
one set of selection circuitry instead of separate selection cir- 
cuits for each block and read-write circuits impervious to fail- 
ure. There can be multiple-block failures, but since any type of 
block failure, as well as row and row-column failures, are 
catastrophic, the error-arrival rates of all of these errors can be 
lumped together: 
Ablock = arrival rates of all types of block failures. 
Only the error arrival rate A,, is altered, and the equation for 
mean time to failure remains unchanged from (29). 
For other cases, such as complexity independent selection and 
read-write circuitry for each block, the analysis becomes com- 
plex. Since single-cell failures are the most common type of 
hardware failure, and since complex models do not provide the 
insight of more simple ones, analysis of these complex cases will 
not be addressed here. 
V. SIMULATION RESULTS 
Theoretical expectations were calculated and compared with 
simulation results. Testing the single-cell simple result (6) was 
done because it is simpler than testing the multiple hard-cell 
failure-mode model (29) and because (29) is based on (6). Since 
mean time to failure simulations are very time-consuming, the 
mean events to failure (METF) were computed. MTTF can be 
determined from METF and the mean-error arrival rate A via 
(1): 
1 
MTTF = - METF . 
A 
The determination of the METF was obtained by first assum- 
ing the following: Since typical chips have much higher probabil- 
ities of single-cell errors, both hard and soft, than other types of 
errors, the error rate used was A,,, the single-cell error arrival 
rate, not A ,  the total arrival rate. Also, in most realistic cases, 
Assumptions 1 and 2 hold, so the hard-error arrival rate is much 
lower than the soft-error arrival rate, and the refresh cycle 
length is shorter than the mean time between soft-error arrivals. 
Using these two assumptions, the METF was computed as 
follows. An error of either type (hard or soft) was assumed to 
occur, with any number of additional errors determined ran- 
domly with a Poisson distribution, in a single scrub cycle time 
period. The type of error was determined randomly so that the 
average ratio of errors was A, /A, .  These errors were allowed to 
occur in any one of M codewords. If more than one error 
occurred in any codeword, a failure was declared. Scrubbing was 
conducted after these error insertions; all soft errors were 
cleared while hard errors remained. 
The number of errors required to result in a chip failure were 
recorded and averaged; this provided the mean events to failure. 
This procedure is justified by the following. Since errors occur 
only occasionally in each scrub cycle, those scrub cycles where 
no error occurs can be ignored; this means that all scrub cycles 
where at least one error of either type occurs are examined but 
all other scrub cycles are ignored. Since the mean events to 
failure is typically large-the effect of soft errors is diminished 
by scrubbing and hard errors are rare-the mean time to failure 
can be approximated by the product of the mean-error inter- 
arrival time l /A , ,  and the mean events to failure. 
Results were obtained by averaging over 2000 tries; some are 
plotted next. All parameters are per-chip: Codewords per chip, 
hard-error arrival rates per chip, and soft-error arrival rates per 
chip. Thus, there is no dependence on the length of the code- 
word n. The parameters used here are defined as follows: 
A',, = h,nM = single-cell hard-error arrival rate per chip; 
A: = A,nM = single-cell soft-error arrival rate per chip. 
A. Varying the Scrub Interval 
Fig. 7 shows the effect of keeping the error rates constant 
while the scrub cycle time was varied. This log plot shows that 
there is a threshold characteristic for soft error scrubbing; if the 
chip is scrubbed faster than some rate, there is little increase in 
MTTF. Also, if the chip is scrubbed much slower, the MTTF 
does not decrease much beyond some other level. This transi- 
tion is directly linked to the failure mode of the chip: When the 
scrub cycle time is slow, then most of the chip failures are of the 
soft-soft type, where two soft errors cause chip failure, because 
A: % A',,. These failure modes were confirmed through simula- 
tions. When the chip is scrubbed at a faster rate, the failures are 
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 3, MAY 1991 893 
1 0 6 1  \ 
IO5[ 
Fig. 7. 
simulation results in bold solid line. 
MTTF with Aj,  = 10W7 sec-', A: = sec-', M = 256 code- 
words, Model (7) in dashed thin line; model (21) in dot-dashed thin line; Fig. & MTTF with = ''-' sec-', t s  = ''-' sec* = 256 'Ode- words. Model (7) in dashed thin line; model (27) in dot-dashed thin line; 
simulation results in bold solid line. 
of the soft-hard type, where one soft and one hard error in a 
codeword causes chip failure. 
Note that the model (7) matches the simulation results closely 
except for large values of scrub cycle time, where model (21) 
holds. The data smoothly switches from one model to the other. 
For small t,, the MTTF should be as given in (9)  or (11); this is 
shown in Fig. 7. In the case plotted, since Ah << A,, (11) should 
hold: 
This is indeed the case, as Aj,=10-7  sec-', and the MlTF  
levels off at about lo7 sec. For large t,, we expect the mean time 
to failure to be as given in (21): 
and for M = 256 we have 
i !  M 
B ( M )  = (!)- = 20.73. 
i = o  2 M' 
Again the model matches simulation, as A:, = sec-' and 
the MTTF levels off at 2.1 X lo5 sec. 
Thus, there are two flat regions for the MTTF. The MTTF is 
expected to be near the hard error rate if scrubbing is fast, 
because soft errors do not have time to build up and hard errors 
remain. Failure primarily occurs from having a soft error occur- 
ring in the same codeword as a hard error (which is not removed 
by scrubbing). Since the hard-error rate is much slower than the 
soft-error rate, the relative time period between a hard-error 
strike and a subsequent soft-error strike in the same codeword is 
small compared to the time period from start to first hard-error 
strike. Therefore, the MlTF  approaches the MTTF with no 
protection and only hard errors, as expected by (11): 
When scrubbing is slow, the model described by (21) in 
Section 111-D applies, and this predicts a plateau at 
when t, +m. Note that in this case, soft-error scrubbing is not 
effective; therefore, the M l T F  is determined mainly by the 
number of codewords and not the arrival times or scrub periods. 
A point of interest to system designers would be the knee of 
the curve, where the MTTF drops from its higher plateau. In 
Fig. 7 this occurs at about lo4 seconds, when the MTTF drops 
lower than lo7 seconds. This is the maximum scrub interval 
permissible before system performance degrades significantly. 
Taking the point when the MTTF is l / f i  that of its plateau 
value as the knee, experimentally we have found that 
Ah A,nt, = a-,  
A, 
where 0.8<a < 1  gives the location of this corner point. The 
parameter a for relatively small t, remains at about 0.83; for 
larger t,, a + 1. Since the knee was defined at an arbitrary 
point, this relation provides only a rule-of-thumb maximum 
for t,: 
Ah 
t,<a-. 
A: n (33) 
B. Varying the Soft-Error Rate 
If the soft-error rate is altered, then as shown in Fig. 8, there 
are two plateaus in the MTTF. Again, model and simulation are 
close. If A: is very large, then most of the failures are of the 
soft-soft type, and soft-error scrubbing is not removing soft 
errors fast enough. Thus, as the soft-error arrival rate increases, 
MTTF decreases. The decrease is linear on the lag-log plot, 
and this confirms (14), the case where soft errors dominate, or 
894 
IO" 
1 o8 
B 
2 io5 
E 
5 
1 o2 
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 3, MAY 1991 
k ' ' ' " ""  ' ' ' " " "  ' ' ' " " "  ' ' ' " ' ?  
10-6 10-3 100 103 
kh (sec-') 
Fig. 9. M'ITF with A: = sec-', t ,  = lo- '  sec, M = 256 code- 
words. Model (7) in dashed thin line; model (27) in dot-dashed thin line; 
simulation results in bold solid line. 
A, + 0 in relation to A,: 
I n ( l +  A$,) 
A,nM [l- A N ,  I '  MTTF=- 
1 
A,nM ' 
U- 
In the flat center region, most of the failures are the soft-hard 
type; the MTTF remains relatively unchanging: Here, scrubbing 
removes soft errors quickly enough, not allowing more than one 
soft error per codeword before wiping them clean. Hard errors, 
however, remain and effectively neutralize the advantage of 
coding; only then can a single soft-error cause a failure: The 
M'ITF in this region is approximately the hard error rate A L .  
Since the hard error rate is constant, the MTTF also is slow to 
change. But as A: decreases further, it takes longer for the next 
soft strike to hit a codeword already containing a hard strike, so 
the MTTF increases, until the hard-hard failure mode domi- 
nates for A', e A',,. This is the case where soft errors are practi- 
cally nonexistent and therefore hard errors are dominant: As- 
sumption 2 is no longer valid, thus requiring the analysis given 
in Section 111-E and the use of (27). The MTTF levels off at 
approximately B( M )  times the flat center region (soft-hard 
failure mode) as expected, and since A', is kept constant, the 
curve is flat. 
Note that the lower knee of this curve, as the failure mode 
shifts from soft-hard to soft-soft, occurs at the same point 
found by (32) in Section V-A, 
A h  A,nt, = a - ,  
where 0.8 < a  < 1. In Fig. 8 this occurs at A, = 0.1 sec-'. Solving 
(32) for A, yields 
A, 
(34) 
C. Varying the Hard-Error Rate 
The effect of altering the hard error rate A',, is shown in Fig. 
9. For the left half of the curve, the failure modes are primarily 
soft-hard. This means that soft errors are scrubbed out fast 
2.0 x 
B -
E 
5 
1.ox 
1 0 7 .  
0 51 2 1024 1536 2048 i o 7  
CODEWORDS 
Fig. 10. MTTF with Aj, = lo-' sec-', A: = sec-', t ,  = lo-' sec. 
Model (7) in dashed thin line; simulation results in bold solid line. 
enough to prevent accumulation in a codeword during a single 
scrub cycle. Also, hard errors are much less frequent than soft 
errors, so the codeword fails primarily when a soft error strikes 
a codeword whose error-correcting power has been negated 
previously by a hard error. This is the case described by (111, 
1 
AhnM' 
MTTF=- 
Fig. 9 is roughly linear with a negative slope in this region, 
therefore confirming the inverse relation given by (11). Note 
again that the model tracks the simulation closely. 
Had the failure mode been soft-soft, that is, the no hard- 
error case discussed in Section 111-C, the MTTF would have 
taken the form given by (14) 
and would therefore be inversely related to A,, not A,. This 
failure mode is not visible in Fig. 9. Had the failure mode been 
hard-hard, as it is for the right half of the curve, then Assump- 
tion 2 is not valid, so the model developed in Section 111-E (27) 
must be used. This is also plotted on Fig. 9, and is shown to 
track the experimental data well. Equation (27) is inversely 
proportional to A',,, but since (27) is similar to (11) and is only 
scaled by the factor B ( M ) ,  an increase in MTTF is expected; 
this is shown in the slight upward perturbation in the plot 
between the regions where the two models (11) and (27) meet. 
D. Varying the Number of Codewords 
Fig. 10 shows the effect on the MTTF of altering the number 
of codewords M while the per-chip error-arrival rates remain 
constant. This is the case where the chip is coded with varying 
size codewords while the number of cells remains constant. As 
shown in Fig. 10, the increase in MTTF from increasing the 
number of codewords is small; in previous plots, the increases 
have been over several orders of magnitude. Here, the range is 
linear, and the MTTF is barely doubled by squaring the number 
of codewords in the chip. This confirms the relation found in 
[21]: 
Again, the model is close to the simulation results. 
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 3, MAY 1991 895 
VI. CONCLUSION 
We have presented an accurate model for the mean time to 
failure of a semiconductor RAM protected with a single error- 
correcting code along its word lines. The model takes into 
account all types of hard errors, single-cell soft errors, and 
soft-error scrubbing to derive the mean time to failure, thus 
providing a complete picture of the gain obtained by coding a 
memory chip with a single error-correcting code along the rows. 
The equation for MTTF presented can be directly used by the 
system designer wishing to assess the benefits of coding for 
memory protecting. 
VII. ACKNOWLEDGMENT 
The authors would like to thank Robert J. McEliece for 
comments and suggestions. 
APPENDIX A 
Here we solve the integral for MTTF for the simple case of 
only single-cell hard and soft failures. From (51, the mean time 
to failure for a chip with M codewords is given by 
0 In (1 + A p t , )  
MTTF = /me-Asc"M' 
- t shhn  I M d t .  
In (1 + A$,)  
This is of the form 
M MTTF = [ ecl'c2 + c3] dt , 
0 
where 
Cg = - h,,nM, 
1 
t ,  
c l= - ln ( l+A,nt , ) ,  
t s h h n  
c,=1+ 
In (1 + A p t , )  ' 
tshhn 
cg = - 
In (1 + A p t , )  . 
Equation (36) can be evaluated by making the substitution 
y = e,'': 
which does not converge unless 
which means that c o + c l i < O .  Since co < O  and c1> 0, the 
condition may be violated when i is as large as possible, i.e., 
when i  = M. But ln( l+  x )  < x ,  so 
M 
t ,  
co + c l M  = - h,,nM + - In (1 + A,nt,) ,  
< - A,,nM + A,nM, 
= - h,nM,  
< 0. 
Also, the denominator cannot be zero for any i ,  0 5 i  I M :  
i 
t ,  
A,,nM # - In (1  + t ,nhs ) ,  for 0 I iI M .  (37) 
Using In( l+ x )  < x again, 
In (1 + A p t , )  
t ,  
A,ni > i 
Since A,, > A,, and since i  I M, the condition set forth in (37) is 
satisfied. 
The equation for mean time to failure becomes 
i 
t ,  
i = O  A,,nM - - In (1 + A,nt,) 
(6)  
APPENDIX B 
In Section 111-A, Assumptions 1 and 2 were presented and 
Assumption 1: The scrub cycle must be small compared to the 
justified as reasonable characteristics of real chips. 
mean time to failure: 
t ,  -K MTTF. 
Assumption 2: The hard-error rate must be small compared to 
the soft-error rate: 
A h  A,. 
These assumptions must hold in the model described by (6) as 
well for the following reasons. In (2) the probability of correct 
operation given that no hard errors occurred was presented as 
R , ( t )  = e-Asc"'(l + Asnts)'/ ' ' .  
This was derived by using the approximation that a smooth, 
continuous function that equals another piecewise continuous 
function at regular intervals can be used to approximate that 
piecewise function. That piecewise function arises from the 
discrete time analysis of the soft-error scrubbing effect, and it is, 
for the case of having no hard errors in the codeword, 
R b ( t )  = [ e - A h ' ] P ( O  or 1 soft error in time 0 to [ t / t , ] )  
.P(O or 1 soft error in time [ t / t s ]  to t ) ,  
= e-'icn'(l+ h , n t , ) ' r / ' s l [ l +  h , n t , ( t / t ,  - [ t / t s ] ) ] ,  (38) 
where 1x1 is the largest integer less than x .  
Rb equal in value to R ,  when t / t ,  is an integer; that is, 
t / t , = [ t / t , ] .  In between these points, when t / t ,  is not an 
integer, [ t / t , ]  is constant at the last integer value. R ,  and Rb 
896 
0.401 
0.25 
0.20 
0.15 
0.1 0 
0.05 
0 
0 2 4 6 8 10 
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO. 3, MAY 1991 
t (sec) 
Fig. 11. Difference between Rb( t )  and R,(t) .  Parameters were A, = 
IO-’ sec-’, A, = sec-‘, and n = 1024. Dashed line, r s  = lo00 sec; 
dot-dashed line, t ,  = 10 sec; dotted line, 1, = 2 sec; solid line, t ,  = 1 sec; 
bold solid line, t ,  = 0.5 sec. Note that as t ,  decreases the maximum 
difference between R $ t )  and R , ( f )  diminishes. 
are of the form, 
~ , ( t )  = e - A f ( l +  A,nt,)r’ts, 
~ b ( t )  = e p A r ( 1 +  ~ , n t , ) ~ ’ / ~ ~ ~ [ i + ~ , n t , ( t / t ,  - l t / t s j ) ] ,
As can be seen, Q’( t )  is a piecewise linear approximation of 
Q(t),  a monotonically increasing function. Therefore, Q’(t) > 
Q(t),  and thus, Rb > R,. Thus, (2 )  is a lower bound of the exact 
solution (38) using the Poisson approximation. However, R ,  = Rb 
at values of t determined by t ,  and R ,  does not vary far from 
Rb when t ,  is sufficiently small. ts is sufficiently small when 
t ,  M’ITF. This confirms the need for Assumption 1, as other- 
wise R,( t )  will not be a good approximation for the more 
accurate Rb(t). (See Fig. 11.) 
Note that the discrete scrub period analysis is much more 
complex than the continuous scrub period case, as is evident by 
the complexity of the reliability function. Carrying through this 
analysis will yield a mean time to failure model that is more 
complex to derive and interpret than the model presented in 
Section 111-A, and is therefore unsuitable for use. 
In (3), another assumption was made, that Assumption 2 must 
hold as well as Assumption 1. This assumption assures that 
R , ( t )  is a good approximation of R;(t) in the following manner. 
R,( t )  is the integral over T of the product of three terms: First 
the probability that no failure had occurred given no hard error 
had occurred from 0 to 7; second, the probability that a hard 
error occurred in time d ~ ;  third, the probability that no error, 
hard or soft, occurred between T and t .  Thus, the first term is 
R0(7). This by itself justifies the use of the Assumption 1 here. 
In addition, the following analysis is needed. Without soft 
errors, the peak of R,( t )  occurs at l / A , .  This peak will be 
shifted toward zero with the occurrence of soft errors. Since 
R,( t )  is not accurate for large t,-and equivalently small A,, 
due to the time scaling property of the model as described in (7) 
-there must be sufficiently many scrub cycles before t = l /A , .  
This means that l / h ,  >> t,. Now A$, - 1 so that soft errors do 
not pile up in any one codeword during one scrub cycle. Also, if 
A@, >> 1, there will be a high probability that two or more 
errors will occur in a single codeword within a scrub cycle, so in 
= e-,‘,( t ) ,  
= e P A r Q ’ ( t ) .  
this case fewer errors (and therefore fewer scrub cycles) are 
needed to cause a failure. Thus, A,nt, - 1 must be maintained to 
avoid violation of Assumption 1. This leads to 
1 1 
- > t , - -  
A S  
or 
--l. 
A s  
Thus, Assumption 2 must hold. 
REFERENCES 
[I] D. Y. Koo and H. B. Chenoweth, “Choosing a practical model for 
ECC memory chip,” in Proc. I984 IEEE Reliability and Maint. 
Symp., 1984, pp. 255-261. 
[2] C. L. Chen and M. Y. Hsiao, “Error-correcting codes for semicon- 
ductor memory applications: A state-of-the-art review,” IBM J. Res. 
Deuelop., vol. 28, pp. 124-134, Mar. 1984. 
[31 T. Fuja and C. Heegard, “Focused codes for channels with skewed 
errors,” IEEE Trans. Inform. Theory, vol. 36, no. 4, pp. 773-783, 
July 1990. 
141 T. C. May and M. H. Woods, “Alpha-particle-induced soft errors in 
dynamic memories,” IEEE Trans. Electron Deuices, vol. ED-26, pp. 
2-9, Jan. 1979. 
[5] M. Horiguchi et al., “An experimental large-capacity semiconduc- 
tor file memory using 16-levels/cell storage,” IEEE J .  Solid-state 
Circuits, vol. SC-23, pp. 27-33, Feb. 1988. 
[6] D. Marston, “Memory system reliability with ECC,” Intel Applica- 
tion Note AP-73, Intel Corp., 1980. 
[7] T. Fuja and C. Heegard, “Row/column replacement for the con- 
trol of hard defects in semiconductor RAM’S,” IEEE Trans. Com- 
[8] T. Fuja, “Coding for the address-defect channel,” in 24th Ann. 
Conf. Inform. Sci. Syst., Princeton, NJ, Mar. 21-23, 1990. 
[9] -, “The performance of random access memory systems em- 
ploying on-chip and board-level error control,” in IEEE Int. Symp. 
Inform. Theory, Kobe, Japan, 1988. 
[lo] Micron Technology Inc., “Effect of on-chip ECC on system soft 
errors,” Micron Technology Inc. MT1256 Data Shet. 
[ I l l  M. Asakura et al., “ A n  experimental I-Mbit cache DRAM with 
ECC,” IEEE J. Solid-State Circuits, vol. SC-25, pp. 5-10, Feb. 1990. 
1121 T. Chiueh, R. M. F. Goodman, and M. Sayano, “A 2KX1 static 
RAM chip with on-chip error correction,” IEEE J. Solid-Stafe 
Circuits, vol. 25, pp. 1290-1294, Oct. 1990. 
[I31 T. Fuja, C. Heegard, and R. M. F. Goodman, “Linear sum codes 
for random access memories,” IEEE Trans. Comput., vol. C-37, pp. 
1030-1042, Sept. 1988. 
[I41 L. Levine and W. Meyers, “Semiconductor memory reliability with 
error-detecting and correcting code,” Comput., vol. 9, pp. 43-50, 
Oct. 1976. 
[15] W. F. Mikhail, R. W. Bartoldus, and R. A. Rutledge, “The reliabil- 
ity of memory with single-error correction,” IEEE Trans. Comput., 
vol. C-31, pp. 560-564, June 1982. 
[16] R. A. Rutledge, “Models for the reliability of memory with ECC,” 
in Proc. 1985 IEEE Reliability, Maint. Symp., pp. 57-62, 1985. 
[I71 H. Vinck and K. Post, “On the influence of coding on the mean 
time to failure for degrading memories with defects,” IEEE Tram. 
Inform. Theory, vol. 35, no. 4, pp. 902-906, July 1989. 
(181 R. M. F. Goodman and R. J. McEliece, “Lifetime analyses of 
error-control coded semiconductor RAM systems,” IEE Proc., vol. 
129E, pp. 81-85, May 1982. 
[I91 -, “Hamming codes, computer memories, and the birthday 
surprise,” in Proc. 20th Allerton Conf. C o n ” . ,  Contr., Comput., 
1982. 
[20] M. Blaum, “Error-correcting codes for computer memories,” Ph.D. 
thesis, California Inst. of Techno]., Pasadena, CA, 1985. 
[21] M. Blaum, R. M. F. Goodman, and R. J. McEliece, “The reliability 
of single-error protected computer memories,” IEEE Trans. Com- 
put., vol. C-37, pp. 114-119, Jan. 1988. 
[22] R. Krishnamoorthy and C. Heegard, “Reliability and yield Error 
control in semiconductor RAMS,” IEEE Trans. Inform. Theov, 
preprint, School of Elect. Eng., Cornell Univ., Ithaca, Ny, May 
1990. 
put., vol. C-35, pp. 996-1000, NOV. 1986. 
