Abstract-The reliability of memory systems that are exposed to soft errors has been studied in the past with the aim of deriving the Mean Time to Failure (MTTF) and the probability of failing in a given time interval. On those studies, the soft errors were considered to arrive following a Poissonian basis and they were assumed to be single uncorrelated events (each event causes only one soft error). Recent studies suggest that Multiple Bit Upsets (MBUs) are a significant part of the error events in advanced memory technologies and that they will continue to grow in the next technology nodes. The errors in an MBU are normally caused by the same physical event and therefore affect memory cells that are close together. This poses a major problem to memories that are protected with per-word Single Error Correction codes, as an MBU is likely to affect two or more bits in the same word, causing an uncorrectable error. To avoid that problem, interleaving is used to ensure that cells that are physically close together belong to different logical words, so that the errors in an MBU are distributed over a number of words each suffering only one error. Although some works have been done that characterize memories under radiation tests, no mathematical model of the effect of MBUs on the reliability of a memory has been proposed in the literature, to the best of the authors' knowledge. Therefore, in this paper, the reliability of memories suffering MBUs is analyzed in detail. The fundamental result from that analysis is that the MTTF of a memory exposed to MBUs can be approximated using the existing results for single event upsets by adjusting the error arrival rate.
I. INTRODUCTION
M EMORIES are present in most digital systems. From generic computers to specific embedded applications and field-programmable gate arrays, all need storage devices with an increasing capacity. Therefore, from a practical point of view, the reliability of memories is important to guarantee the correct operation of the system [1] - [3] . This has led to several studies [4] - [7] that discuss various reliability models.
Although reliability has been studied from long ago [8] , [9] , new sources of errors are arising, apart from the traditional ones, which makes the probability of failure increasingly higher. This is particularly visible in hostile environments where there are physical phenomena that affect semiconductors in a negative way. Radiation [10] - [12] is one of these factors and its influence in errors has been reported many times [13] , Manuscript received June 11, 2007 ; revised September 7, 2007 . This work was supported by the Spanish Ministry of Science and Education under Grant ESP-2006-04163.
The authors are with Universidad Antonio de Nebrija, 28040 Madrid, Spain (e-mail: previrie@nebrija.es; jmaestro@nebrija.es; ccervant@nebrija.es).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TDMR.2007.910443 [14] . It is associated to many fields, like the medical and military industries, but space is one of the areas where more research is being devoted to this problem [15] . Space applications are especially critical, since systems are not easily accessible, and therefore, errors may produce the complete failure of a mission [16] - [18] . Because of all these problems, memories are usually protected to make them as much fault-tolerant as possible. One of the most used mechanisms is single error correction (SEC) and double error detection codes that can be implemented using Hamming codes [19] . These codes add a level of redundancy at the expense of using some of the memory capacity to store the extra information needed for error detection and correction. In this way, isolated Single Event Upsets (SEUs) [20] - [23] , which produce a single error in a given memory word, can be automatically corrected, as long as no more than one affects the same memory word at the same time. The correction is achieved with the so-called scrubbing mechanism. Through this, a scrubbing period is defined, t s , which triggers a rewrite process in the memory, updating the wrong words with their right values (using the SEC codes).
However, it is possible that two (or more) independent SEUs can strike on the same word within the same scrubbing period. If this happens, the errors would be uncorrectable, leading to a failure of the system. Many models are described in the literature that address this scenario and calculate the Mean Time to Failure (MTTF) and reliability of the system [24] .
However, there are other phenomena that do not induce an SEU in the system, but multiple simultaneous errors, which is known as Multiple Bit Upsets (MBUs) [25] . This may happen, for example, when a highly charged particle strikes on the device, and due to its energy or incidence angle, it affects not only an isolated transistor but a larger area, disturbing several memory cells. As the integration level grows, these memory cells become smaller, and the probability of MBUs increases. The importance of MBUs has been recently addressed in several papers [25] - [28] , concluding that a growing number of errors are due to this fact.
One of the most direct mechanisms to mitigate the effect of MBUs is the use of an interleaving scheme. This mechanism spreads the bits in a logical word into different physical words, following a constant pattern (i.e., all the bits in the logical word are separated by the same distance). Therefore, the bits physically close belong to different logical words, and since MBUs affect a reduced area in the memory, the induced errors will be correctable by the SEC codes.
However, MBUs can produce failures, and therefore have an effect on the reliability of the system, which has been proved by recent studies. An SEU followed by an MBU (or vice versa) or a combination of MBUs, may produce double errors in the same logical word, leading to the failure of the system [25] .
Although some works have been done to characterize the effects of MBUs in memories from a physical point of view (see [29] - [31] for an analysis of error patterns and effects), no mathematical model for reliability has been considered, in part due to the complexity of the formulation and to a nearly exclusive focus on SEUs. In this paper, a detailed analysis of how MBUs affect the system is presented. Several effects will be considered, such as the use of scrubbing and the spatial correlation of the MBUs. Finally, several results through simulation will be offered, which prove the conclusions presented in this paper.
II. MODELING THE EFFECTS OF MBUS
To model the effects of MBUs, a first step would be to specify the problem assumptions, which are similar to other models in the literature.
The first assumption is that the memory is protected using SEC codes. This means that there is extra hardware in the memory so that single errors in a given word can be detected and corrected. This is an initial protection level that forces the occurrence of two or more errors in the same word to produce a failure.
The second assumption is that the memory is implemented using a physical interleaving organization. With this, the different bits that form a logical word are physically distributed in the memory after a certain pattern. Or in other words, the bits that are physically close in memory belong to different logical words. This gives a second protection level, since physicallylocated errors induced by an event will likely not affect the same logical word, and therefore will be managed by the SEC codes. More precisely, this assumption implies that the interleaving is such that the physical distance between the bits of a logical word is always larger than the maximum physical distance of two bits in any of the possible MBU patterns.
The next assumption is that the event arrival rate for the entire memory is λ. 1 In this case, conversely to what happens in the SEU study, the difference between number of events (g) and number of errors (m) has to be taken into account. It is clear that for SEUs, g = m, since there is a univocal relation between both. However, when MBUs are considered, g < m.
Let us define the errors-per-event set, Q, as
where q i is the number of errors produced by event i. All the elements in Q are independent and identically distributed random variables.
This leads to the final assumption, which considers that the mentioned events follow a Poissonian distribution. This implies how events occur in the system, but this is not of much help in this case, since failures are caused by errors, not by events. To 1 In this paper, a constant λ is assumed. If due to the environmental conditions, λ is variable, a worst case analysis could be done, choosing an upper bound of this parameter. consider this, and since an event can produce several errors, a probability distribution function of errors-per-event (that is the distribution for each q i ) has to be defined, P , as
where p(n) indicates the probability that a certain event produces n errors.
With this distribution function, it can be proven that the number of errors, m, in a given time interval, t, follows a Compound Poisson process [32] , where p(n) is the compounding function. The probability of m errors in time t is given by
where
For example, in [27] , it is proposed that the number of errors in an MBU can be modeled as a geometric distribution such that p(n) takes the form
and (1) can be computed in this particular case as
A. Nonscrubbing
Let us consider first the case in which scrubbing is not used so that errors accumulate in the memory over time. Using P (m, t) we can derive the reliability function R(t) as follows:
where P f (m) is the probability of failing given m errors, and therefore, the term in the summation is the probability that m errors happen in t, and that a failure is produced by those m errors. Let us explore how P f should be defined in the current problem. For the sake of simplicity, the case of single error events will be initially considered. Assuming a memory with M words, each of them protected with SEC codes (what implies that single errors on a memory position do not produce a failure), then for single error events P f (m) takes the form
The second equation is due to the SEC codes, which make the probability of failure with an isolated error null.
The product term implies the probability that j − 1 errors have not produced a failure (since they have affected different memory positions, and therefore they can be corrected by the SEC logic). In the same way, (j − 1/M ) denotes the probability that the jth error strikes on one of the j − 1(out of M ) previously affected memory positions, therefore producing a failure.
Unfortunately, in the case of MBUs, following a per-event distribution p(n) makes the derivation of P f (m) complex, as the errors within each event are assumed (by using interleaving) to fall on different words.
To see this complexity, let us examine an example where p(n) = 1 for n = 2, and 0 elsewhere (each event produces two errors always)
where j equals j for j odd and j − 1 for j even. In this case, what distorts the results of expression (5) is that the first two errors (produced both by the first event) can never produce a failure, because according to the assumptions, they will be physically close but logically distant. Therefore, no failure may happen until the third error arrives (which is the first of the second event) and eventually strikes on the same word that the first or second error did. A similar situation happens with the fourth error, where a failure can occur together with the first or second error, but never with the third one (both produced by the second event).
Through this example, it can be seen that the probabilities of failure are affected by how errors are distributed within MBUs, or in other words, by p(n).
For an arbitrary distribution p(n), the computation of P f (m) becomes even more complex. However, from the example above it can be seen that, given a certain number of errors m, the probability of failure P f (m) in the case wherein these errors come grouped in MBUs is always lower than if they come distributed as individual SEUs. This is due to the constraint explained before by which all the errors forming the same MBU cannot produce a failure by themselves, and therefore the combinations that lead to a failure are lower. With this consideration, the single error event case is an upper bound for the more general MBU case
This conclusion is not intuitive, since it may seem strange that the probability of failure for MBUs is lower than for SEUs. To avoid misleading interpretations, it has to be noticed that this result only states that the probability of failing, given m errors, is equal or lower if the errors occur grouped in MBUs rather than in isolated SEUs. This is a direct consequence of the assumption that errors within an MBU cannot occur in the same logical memory word, and therefore cannot produce a double (uncorrectable) error.
Likewise, the reliability function can be lower bounded as
where the P f (m) associated to the single error event case is used in the right term. The MTTF can also be lower bounded as
Since the previous expressions can become quite complex, another upper bound approximation will be presented next to simplify calculations.
Let us define m as the random variable that denotes the number of errors producing a memory failure. Let us also define the random variable m ac as the number of errors present in the system when a failure happens. For the case of SEUs only, m ac and m are obviously the same variable. In the case of MBUs, the following relationship between them holds:
This is due to the fact that errors come grouped into MBUs, e.g., if three errors are to produce a failure and the first MBUs to arrive happen to induce two errors each, then two MBUs will be needed to reach the three overall errors (m = 3), but in fact four errors (m ac = 4) will have occurred in the system (two errors after the first MBU and four after the second one; the value of three cannot be directly reached).
Let us define the random variable g as the number of events to failure. Since these g events have produced all the errors in the system until it fails, then m ac can be written as
where q i are the independent random variables defined before. Taking the mathematical expectation on (11), the following is obtained:
Applying Wald's identity (since q i are independent and identically distributed) to the rightmost member of (12), the following expression is obtained:
Let us define Q per event as the expected value of distribution q i , which is determined by p(n) in the following way:
Then, combining (12) and (13) 
Now, considering the inequality (10) and (15) 
Or, in other words
Let us consider now the well-known relationship of the MTTF and Mean Events To Failure (METF) for Poisson distributions [4] (see the Appendix for a demonstration that it also applies for the current case of Compound Poisson)
As seen in expression (17) 
where the rightmost inequality in (19) stems from P f (m) being lower in the MBU case as discussed before. But for single error events, the number of events is identical to the number of errors (one error per event), and therefore, E[m] is the definition of METF for the SEU case. In this way
where λ is defined as
Expression (20) represents the MTTF for single error events, with a modified event arrival rate, λ .
Therefore, through (19) and (21), the following inequality is obtained:
What this expression means is that, in order to study the reliability of a memory affected by MBUs, the simpler case of single error events can be studied instead, with λ increased in the factor mentioned in (21) . The results obtained can be extrapolated to the MBU case as a lower bound of the MTTF, which simplifies the process compared to the calculation given by (9) . This is an important result, since this lower bound (the MTTF for the SEU case) can be easily calculated, and therefore the application of (22) is straightforward. For example, for large values of M , the approximation (23) presented in [33] could be used to quickly evaluate the lower bound
It is also important to note that the derivation of this bound does not rely on the fact that P f (m) is lower in the MBU case: even if E[m] were the same for SEUs and MBUs, expression (22) would still be valid.
As a summary of the present section, it has been shown that the MTTF in the case of MBUs can be lower bounded with the SEUs only case based on two observations: 1) The errors within an MBU cannot occur on the same word and 2) The number of errors present in the system (m ac ) when a failure happens on error m is larger in the MBU case, as the errors arrive in groups.
Another important point is that as M grows (size memory), the approximation in (22) gets better. This is because in this case, since the probability of failure decreases, most of the failures will occur for large values of m such that m/m ac is close to 1 (and therefore expressions (10), (17) , and (22) tend to become identities). As an example, if p(n) is 0 for n > j then the worst case for m ac would be (m + j − 1), and as m grows, the expression m/m ac = m/(m + j − 1) gets closer to 1. A similar reasoning applies to (7) where if the value of m that causes failure grows, the impact of errors in an MBU falling on different words becomes smaller.
B. Scrubbing
Another technique that is normally used in conjunction with SEC is scrubbing, where the memory positions are read and rewritten periodically such that single errors are removed. In that case, the derivations are based on calculating the probability of failure P f 2 in a given scrubbing interval t s . If this scrubbing interval is short enough compared with the event arrival rate, λ * t s 1, as is normally the case, then most of the errors will be caused by just two events, so that P f can be easily approximated. For example, if p(n) is such that p(1) = p1, p(2) = p2 and p(n) = 0 for n > 2 then P f takes the following form:
The first term corresponds to the probability of two independent SEUs, the next three to the probability of an SEU followed by an MBU (or the other way around) and the last two to the probability of two consecutive MBUs.
In case the SEU occurs first, the MBU can be seen as two SEUs, the first one with probability 1/M of failure, while for the second one the probability is that of not failing with the first SEU, 1-(1/M ) but failing with the second one, 1/(M − 1) (by our assumptions, the second SEU of an MBU can only fall in M − 1 memory positions, excluding the register where the first SEU of the MBU occurred).
A similar analysis is done for the case of two MBUs, considering the two SEUs within the second MBU. Regrouping terms and approximating (M − 2)/(M − 1) as 1 we get
Assuming an arbitrary p(n) such that p(n) = 0 for n > m and also that M m, then following a similar derivation P f can be approximated as follows:
This expression can be rewritten as
In the case of SEUs only, P f takes the following form:
It can be seen that the MBU case in (27) can be approximated using (28) by modifying the arrival rate in the same way as defined in (21) (29) and also assuming that
Therefore, combining (27) through (30)
With (30), P f is being lower bounded. However, this can be compensated if needed by multiplying (31) by e −(λ−λ )·t s , making the approximation better.
Finally, based on the above and following [5] , the MTTF can be approximated as
As a conclusion, and in the same line with the previous section, it has been proved that weighing the event arrival rate λ with the Q per event factor, allows the use of the simpler SEUs only parameters (P f , MTTF) to approximate the MBU case with scrubbing. This simplifies the problem, allowing designers to account for the effects of MBUs in memories without a high calculation overhead.
C. Effect of MBU Spatial Error Correlation on Reliability
In the previous analysis, it has been assumed that the errors in an MBU occur in different memory positions and that those positions are uncorrelated. This means that the first memory position is chosen randomly among all M memory positions, the second one is chosen randomly among the remaining M − 1 memory positions and so on. This modeling enables the derivations presented in previous sections and can be justified on the assumption that the use of interleaving will ensure that errors fall on different memory positions and will randomize their distribution. While the former is true in most cases (except for MBUs with a very large number of errors), the latter is not strictly true. This is because the interleaving mechanism spreads the MBU errors that are physically close among different logical words separated following a fixed pattern (not a random one). For example, the different bits forming a logical word are separated L physical positions. In other words, logical bit 1 will be placed in physical position 1, but logical bit 2 will be placed in physical position L + 1, bit 3 in position 2·L + 1, etc. (see Fig. 1 ).
It is important to notice that the choice of L is critical in this process, since it has to guarantee that logical words are distributed away from the MBU action range. If, for example, a vertical four-error MBU hits the memory in Fig. 1 , bits in the same logical word would be affected. This can be avoided by carefully choosing L with respect to the memory width.
Let us consider an example where bits are located in this way and where the distribution of errors per event is as follows: p(1) = 0.5 and p(2) = 0.5. Suppose a case in which two MBUs (with two errors each) occur in the memory. As mentioned before, for simplicity, the errors of an MBU are assumed to occur in contiguous bits of the same physical memory position (note that although in a real situation the correlation pattern would be more complex, as shown in [25] , this would not affect the validity of the reasoning). Since these bits belong to different logical words, the errors will be corrected by the SEC mechanism, and therefore, a single MBU will not produce a failure.
Once the first MBU has happened, let us analyze the possibilities of the second MBU producing a failure: the first error of the second MBU strikes on the same bit affected by the first (case 1) or second (case 2) error of the first MBU; or the second error of the second MBU strikes on the same bit affected by the first (case 3) or second (case 4) error of the first MBU. That, assuming M 1 as before, makes four cases producing failure out of M (memory size). This reasoning, as mentioned before, applies if errors are distributed randomly, but it does not hold if the interleaving distribution pattern is considered. Let us assume (see Fig. 2 ) that the first MBU affects adjacent physical bits 2 and 3 (which belong to different logical words). Studying the previous scenarios where the second MBU produced a failure, it can be seen that cases 2 and 3 are independent and distinct (case 2 would imply that the second MBU affects bit 3 and 4; and case 3 that it affects bit 1 and 2). But cases 1 and 4 are strictly the same one, due to the fact that both errors are not random or independent, but they are physically grouped by the MBU: both cases imply that bit 2 and 3 are affected simultaneously.
As a conclusion, there are three cases producing failure out of M , not four as mentioned before. This has a clear effect on the probability of failure, P f , which would be reduced compared to the result of (27) , as follows:
If the spatial correlation is not considered, the failure cases if two SEUs arrive are 1/M ; if an SEU comes followed by a two-error MBU, 2/M (one case per each of the MBU errors); again 2/M for an MBU followed by an SEU; and 4/M for two MBUs, as discussed before. Since p(1) = 0.5 and p(2) = 0.5, each of the previous four cases has a probability of 0.25. Therefore, P f = (1 + 2 + 2 + 4)0.25/M = 2.25/M . This is the same value obtained through expression (27) . Now, if the spatial correlation is taken into account as described before, the failure cases of two MBUs are three, not four, and therefore,
This produces a reduction of P f of 2/2.25 in this case. Let us study a more complex case to check that the conclusions can be extrapolated. In this case, p1 = 0.5, p3 = 0.5 and 0 elsewhere.
If the spatial correlation is not considered and assuming M 1 as before, the failure cases if two SEUs arrive are 1/M ; if an SEU comes followed by a three-error MBU, 3/M (one case per each of the MBU errors); again 3/M for an MBU followed by an SEU; and 9/M for two triple MBUs. Since p(1) = 0.5 and p(2) = 0.5, each of the previous four cases has a probability of 0.25. Therefore, P f = (1 + 3 + 3 + 9)0.25/M = 4/M .
If the spatial correlation is considered, the probability of failure for the case of two triple MBUs changes: instead of 9/M , the new value is 5/M . This can be easily proved with the previous spatial considerations. Therefore, P f = (1 + 3 + 3 + 5)0.25/M = 3/M . The reduction of P f is now of 3/4.
In a general case, where the correlation pattern of the errors will be more complex as mentioned before, the effect would be the same: the correlation of errors tends to produce multiple simultaneous failures and to reduce P f , hence resulting in an overall increase of the MTTF. This means that the previous derivations are still lower bounds for the general MBU case. 
III. SIMULATION RESULTS
In the previous section, the reliability of a memory exposed to MBUs has been analyzed deriving a lower bound for a general case without scrubbing and an approximation for the scrubbing case. In this section, simulation results are presented to illustrate the previous models.
For the nonscrubbing case, some results are summarized in Table I with word size N = 12; λ = 1/100 per word and different memory sizes (M ). The distribution of the number of errors in an MBU is p(1) = 0.5, p(2) = 0.5 and zero elsewhere. The information listed on the table is the following.
1) SEUs only with increased rate. This corresponds to the MTTF calculated using the SEU model, with λ increased as per (21) . This should be a worst case (lower bound) for the MTTF. 2) MBU independent errors. The errors in an MBU may affect any logical word, since they are considered to be independent. Therefore, they could affect the same logical word simultaneously. 3) MBU errors on different registers (not correlated). The errors in an MBU affect different logical words (due to the effect of interleaving), but no spatial correlation among them is considered. 4) MBU errors on different registers (correlated). Same case as the previous one, but considering the spatial MBU correlation explained in Section II-C. The results are the average of 50 000 simulations. It can be noted that the SEU configuration with increased arrival rate is a worst case as predicted, and that as M grows, the difference between the three first cases becomes smaller, as expected (they become almost equal for M = 8192 or higher).
However, this is not true for the case of spatial correlation caused by interleaving (last column). This is because the effect reduces P f always (for all values of m), and therefore there will always be a difference with respect to the SEUs only case. For example, let us consider a situation with n double MBUs and n SEUs, n being an arbitrarily high number. That makes a total number of errors present in the system of 3n, assuming M n. If another double MBU arrives afterward, the probability of failure if no spatial correlation is considered would be 6n/M (3n/M for each of the two errors in the MBU). However, if the spatial correlation is considered, there will be a probability of 2n/M of failure with the errors produced by the n SEUs, but only a probability of 3n/M with the n MBUs (not 4n/M ). The reasoning behind this can be found in Section II-C. Combining both cases, a probability of 5n/M will be obtained. Therefore, if n tends to infinity, the probability ratio of both cases (without and with correlation) will be kept constant as 5/6, making the MTTF of the latter higher.
In Fig. 3 , the MTTF ratio of the different models with respect to the base case of SEUs with increased rate is presented. It can be seen that these ratios tend to 1, as predicted before, when M grows. It can also be seen that this is not true for the case of the spatial correlation, due to the reason previously explained.
The following simulation will study the effect of scrubbing in the model. The parameters that have been used for the simulation are: T s = 0.1, N = 12, λ = 1/100 per word and different memory sizes (M ). The distribution of the number of errors in an MBU is p(1) = 0.5, p(2) = 0.5 and zero elsewhere.
The listed information in Table II is similar to the one presented in Table I , but in the third column, the approximation given by (33) is included.
These results are the average of 10 000 simulations and are in line with the derivations, as the MTTF for the SEU only case (with increased arrival rate) is close to the MBU case with uncorrelated errors and to the approximation given by (33) . If the correlation of the errors in an MBU is also considered (fifth column), then the MTTF increases as predicted. The ratios of the MTTFs for MBUs with correlated errors versus MBUs with uncorrelated errors (fifth versus fourth columns) are, for each value of M : 1.110, 1.113, 1.130 and 1.127. These are close to the predicted ratio of 2.25/2 = 1.1250.
The different MTTF ratios of the models versus the base case are presented in Fig. 4 . Again, it can be seen that these ratios tend to one, except for the spatial correlation case. Another set of simulations has been conducted for the scrubbing case, in order to make sure that the results are coherent. This time, the following parameters have been used: p(1) = 0.5, p(3) = 0.5 and zero elsewhere, and λ = 1/1000 per word.
In Fig. 5 , the different MTTF ratios are depicted. Again, the conclusions are similar to the ones presented in the previous simulations.
The results are depicted in Table III , which corroborates the same conclusions extracted from Table II For smaller values, the ratio is also smaller due to the approximation made in deriving (27) .
Finally, it is also worth considering a realistic case to evaluate the effects of MBUs in existing memories. In [31] , a 4- Mb   TABLE III  MTTF FOR THE SCRUBBING CASE (IN SECONDS) SRAM manufactured in a 90-nm process is characterized, showing that MBUs account for only a few percentage of error events. It is also shown that events with a greater number of errors are less likely to happen. Taking all this into account, we propose to use a per event error distribution that follows expression (2), with r = 0.05. This results in an average number of errors per event of 1.0526. This is the factor that must be considered according to the presented models to compute λ , which in turn is used to get the MTTF through (23) and (33), for the nonscrubbing and scrubbing cases, respectively.
To show the impact of MBUs, we can compare the results given by (23) and (33) when: 1) λ is used and therefore multiple errors are not accounted for (the MBUs are modeled as SEUs). 2) λ is used and therefore the MBUs are modeled in a conservative way.
In the nonscrubbing case, the ratio of the MTTFs would be 0.95, which is a 5% decrease of the MTTF when MBUs are considered. In the scrubbing case, the ratio would be 0.90, or a 10% decrease of the MTTF. Taking into account that (23) and (33) give conservative estimates for the MTTF, we can conclude that in the worst case, the impact of MBUs on the MTTF would be below 10% for this given memory.
This example shows that MBUs still have a limited impact on current memories. However, as technology advances to smaller geometries, the proportion of MBUs will increase [26] and will become a more important factor for memory reliability.
IV. CONCLUSION AND FUTURE WORK
In this paper, an analysis of the effects of MBUs on the reliability of memories protected with SEC codes and interleaving has been presented. The main contributions of this paper are: first, a general expression for the MTTF of a memory exposed to MBUs has been derived; and second, some approximations have been presented that enable the designer to evaluate the MTTF of a memory exposed to MBUs with the existing expressions used to analyze the case of SEUs. The proposed approximations are also lower bounds to the MTTF, and therefore they can be safely used in the evaluation of memories.
As part of this paper, some interesting effects of MBUs when compared to SEUs have been studied: first, the fact that errors in an MBU come grouped; second, the errors within an MBU are assumed to fall on different logical words; and finally, the correlation introduced by the interleaving pattern. All these have been shown to have an effect on the reliability of the memory.
For future work, evaluating the proposed approximations on real memories is the natural extension of this paper. This will imply to actually radiate memory samples using a particle accelerator (or alternative technology, e.g., a laser beam). The experiments should first characterize the p(n) for the memory under test and the physical event causing the upsets, then compute the theoretical approximations. Finally, controlling the event rate and monitoring the faults would provide the results for the comparison. This will validate the presented models, and will be useful to introduce corrections if needed. These experiments would also help to illustrate how the MBU spatial correlation (discussed for a simplified case in this paper) would affect the MTTF in a real case.
APPENDIX
In this Appendix, the proof that the relation between MTTF and METF in (18) is valid in this case, is presented.
Let us define the random variable TTF that denotes the time to failure, which can be expressed as
where, x i are the random variables that denote the time between events. As per the assumptions that the events arrive with a Poissonian basis, and therefore the x i are independent and identically distributed random variables, then Wald's identity can be applied to get
Considering that the mathematical expectation of TTF is MTTF, the expectation of the number of events (g) to failure is METF, and the expectation of the time between events is 1/λ, then the traditional relationship between METF and MTTF is obtained
