Introduction
The problem The designer of a computer memory system can choose from a wide variety of techniques for achieving fault-tolerance. Error-correcting codes have been standard in large memories for years [l] , and page deallocation is commonly used to tolerate uncorrectable errors until a repair can be made. The discovery of a new physical mechanism for soft errors in dynamic memories [2] led to new error-correction techniques to handle the new error types [3] . More recently, fault-alignment exclusion [4] and dynamic spare switching [5] have been proposed as ways to quickly repair a memory without bringing the system down for component replacement.
These methods offer significant opportunities to improve the reliability of memory systems. In order to choose among these alternatives, however, the designer must be able to evaluate the impact of each technique, or combination of techniques, on reliability. This is not an easy task. In a system without fault-tolerance, each component failure causes a system failure; therefore, the system-failure rate is simply the sum of the component failure rates. In a fault-tolerant memory, on the other hand, the effect of any single component failure will, in general, depend on which other components have already failed. Therefore, the system failure rate, which in a memory is the uncorrectable error (UE) rate, is a very complicated function of the component failure rates, failure modes, system design, etc.
Background
The calculation of R(t), the probability that no UE occurs before time t, for a memory with single-error correction is difficult but not impossible. A number of similar equations have previously been given for R(t) . The main difference among these is the particular failure modes which are assumed for the array chip (Le., single cell, row or column of cells, whole chip). If the memory is replaced with a "good-asnew" memory each time a UE occurs, then the sequence of UE times forms a renewal process [ 121; and UE(t), the expected number of UEs in the first t hours of system life, is the unique solution of the renewal equation,
UE(t) = 1' [ I + UE(t -s)]f(s)ds, wheref(t) = -dR(t)/dt
is the probability density function of the time-to-first-UE. For a small memory contained on a single array card this would be a reasonable assumption; the only replacement possible is the entire memory. If soft errors as well as hard failures are considered, it is still possible to derive an exact analytic expression for the UE rate [ 131, as long as the entire memory is replaced after each hard UE.
For larger memories which are contained on multiple array cards, the required calculations are far more difficult; the state of the memory after repair is in general very different from the state of a brand-new memory. It may contain cards of Copyright 1984 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the different ages (and therefore different failure rates) and may also contain some correctable hard failures which were left in at the time of repair. The future of such a system is by no means independent of its past, and renewal theory does not apply. The situation becomes even more complex when we consider double-error correction, page deallocation, address permutation, spare switching, etc. 
FTMS
This paper describes the Fault-Tolerant Memory Simulator (FTMS), an interactive APL program written to fill that need. FTMS is designed to evaluate several reliability parameters for memories which employ single-or double-error correction, page deallocation, address permutation, and spare switching; and which are subject to hard failures and soft errors.
Method of simulation FTMS estimates memory reliability parameters by simulating the life history of many systems, counting UEs (and other parameters) on each simulated system, and averaging the results. For each individual system this involves simulating random failure times for each component, checking to see whether UEs occur, and simulating the appropriate maintenance action (eg., card replacement).
The first obstacle in writing such a simulator is to account for the enormous number of possible failures which can occur (e.g., 18 million different cell failures for a 2-megabyte system). This problem was originally solved for a precursor to FTMS when an algorithm was derived which generates random failure times in order of occurrence [ 17, 181. The next choice is how to represent the state of the memory, and what information to keep. The representation scheme described here is compact while preserving all needed information, and has proven to be readily adaptable to new designs and faulttolerant techniques. One of the fundamental principles of Monte Carlo simulation is to replace estimates with exact values wherever possible [ 191. This principle was used to good advantage when soft-error capability was added to FTMS. FTMS estimates the soft-error rate without simulating any soft errors, which simultaneously reduces the cost of simulation and improves the accuracy of the estimate.
The law of large numbers guarantees that the estimates from a Monte Carlo program will be close to the true values for sufficiently large sample sizes; the user wants to know how close and how large. FTMS provides confidence limits on all of its estimates, but these are not available until after the program has been run. Before running the simulation the user has no way of knowing how many samples will be needed to produce the accuracy he requires. Too few samples mean he must rerun the job; too many waste computer resources. Therefore FTMS incorporates a sequential stopping rule [20] which continues to sample until each parameter has been estimated with the accuracy and confidence level specified by the user.
In the following section we describe the model used by FTMS to represent memory systems. Section 3 describes some of the internal structure of the FTMS programs and the algorithms used to generate random failure times, estimate the soft UE rate, and decide on an optimal stopping time. Finally, in Section 4, we show by example how FTMS can be used to evaluate the impact of various memory design and maintenance strategies on reliability.
Memory model
The memory model consists of four parts: architecture, failure modes, maintenance strategy, and reliability parameters. The architecture defines the logical and physical configuration of the memory. The failure modes and rates define the types and frequencies of component failures which can cause the memory to malfunction. The maintenance strategy defines the actions which are taken to repair the memory when it has failed. Finally, reliability parameters are defined to quantify the effects of architecture, failure rates, and maintenance strategy on the frequency of UEs, the amount of degradation, and the cost of service.
Architecture
The memory architecture consists of the logical structure of the memory and the logical-to-physical mapping of the bits in the memory. The logical structure includes the number and size of data words and pages, the error-correcting capability of the ECC, and any built-in fault-tolerance features. The physical structure includes the number and arrangement of cells per array chip, and chips per array card.
The memory consists of F identical array cards, each logically subdivided into C bit-planes. Each bit-plane is an A X B matrix of array chips, and each chip is an X X Y matrix of cells. A data word consists of FC bits, one from the same address in each bit-plane. The memory has either single-error correction (SEC) or double-error correction (DEC) capability for each data word. The number of errors which are correctable is denoted NEC. Figure 1 shows the structure of a 2-megabyte memory containing 16K-bit chips on 18 array cards.
Fault Alignment Exclusion
The use of Fault Alignment Exclusion (FAE) to remove UEs by address permutation is discussed more fully elsewhere in this issue [4] . We briefly describe how FAE is incorporated into the FI"S model.
A word address consists of a chip address (row and column within a bit-plane) and a cell address (row and column within a chip). The cell and chip-column portions of the address are fixed physical locations, but the chip-row address is logically determined by the contents of a control register. That is, the logical chip-row address is some permutation of the physical chip-row address. This permutation capability is included to allow us to misalign failures in the memory so that all resultant data words contain only correctable errors.
The address permutation capability is defined by the permutations which can be represented, the algorithm used to choose a permutation, and the information available to that algorithm. The logical chip-row address consists of N = (log A)/(log 2) bits used to select one of A chips in a column. The physical address is obtained by forming the exclusive-or of R of these N bits with the bits in a control register, and leaving the remaining N -R bits unchanged. Each column of chips has its own control register and is independently permutable. Thus there are a total of 2R permutations possible for each column, and 2(RFCB) possibilities for the entire memory. Various algorithms have been written to choose a set of permutations which misalign errors so that no data words contain uncorrectable errors. The algorithm used is part of the memory architecture, and the name of the algorithm is denoted ALGO (i.e., if the algorithm named ALGJC is to be used, ALGO = "ALGJC"). The input to the algorithm is a faultmap which assigns each chip to a category which depends on the worst hard failures on the chip. A five-category map tells whether failures on a chip affect single cells only, rows only, columns only, rows and columns, or the entire chip. In a three-category map the middle three categories are combined. The number of categories in the fault-map is denoted M .
FTMS has several built-in FAE algorithms, and also accepts any user-designed algorithm. All the user has to do is copy his algorithm into the APL workspace containing FTMS and then set ALGO equal to the name of his algorithm.
Page deallocation
The ABXY words in the memory are subdivided into P pages, each containing (ABXY)/P data words. Each page consists of an A' X B' matrix of sub-chips from each bit-plane, and each sub-chip is an X' X Y' matrix of cells on a chip. If a data word contains an uncorrectable error, the page containing that word may be deallocated. This allows the memory to recover quickly, but at a somewhat reduced capacity, from the effects of a failure. 
Spare switching
Each array card contains S additional chips which can be logically switched in to replace one of the other CAB chips on the card. Any spare can replace any other chip, but no more than one spare can be used in any spare domain. There are D spare domains on a card, each consisting of AID consecutive rows of chips in each bit-plane. Figure 3 shows the spare domains from one card of the memory in Fig. 1 , assuming two spares and four spare domains.
Failure modes and rates
The components which make up the memory are subject to hard failures and soft errors.
Hard failures
A hard failure is a permanent inability of a component to reliably store data in one or more cells. An array chip which consists of a rectangular array of cells may fail in a number of different ways, including single cell, row or column of cells, and entire chip failure [7, 10, 21, 22] . In addition, we consider failures in address lines, data buses or registers, or decoding logic, which can disable groups of chips. The failure is described logically in terms of the cells which cannot be reliably written to or read from. In these terms, there are nine different hard-failure modes: Each failure mode has a corresponding failure rate function which is defined as a step function. Figure 4 shows a typical failure rate curve for a 16K-bit array chip. The rates for the first four modes are expressed as a fixed percentage of the total chip failure rate, and the remaining rates are defined by separate curves similar to Fig. 4 . Failures occur at random times in accordance with these rates and at random locations in the memory. When failures affect more than NEC bits in the same word, an uncorrectable error (UE) occurs immediately.
The hard-failure rates are specified in a matrix, denoted HARD, which contains in column one the end points of the time intervals and in columns two through ten the failure rates for the nine failure modes over the corresponding time intervals. The failure rates are in units of percent per 1000 hours per component. "_""""""""_"""""" Domain 3
"""-"""""_""""""" a new bit is written into it and thereafter has no greater chance of failure than any other cell. We assume that a soft error affects a single cell. If a soft error occurs in a word which already contains NEC errors, a soft UE (SUE) occurs, and is immediately repaired by rewriting with good data. If a soft error occurs in a word with fewer than NEC errors, it is immediately corrected by rewriting with good data. The soft error rate is assumed to be a constant, denoted SOFT, and is expressed in units of percent per 1 OOO hours per chip.
This model of soft errors neglects some realities which should be considered when applying the model. In the first place, some array chips, especially charge-coupled devices, may be subject to multi-bit soft errors. Secondly, there are algorithms available to correct a combination of hard and soft errors even though the total number of errors exceeds NEC [ 1,3], so a SUE may not really be uncorrectable. Finally, if good data is not written frequently enough, soft errors may accumulate to line up with other soft errors.
Maintenance strategy
The maintenance strategy is a set of rules which prescribe what action will be taken when various events occur.
Events
There are three types of events which can be used to trigger a maintenance action. If a UE occurs, some action must be taken because a memory containing a UE is considered to be nonfunctioning. If a hard failure occurs which does not cause a UE, maintenance may be specified to reduce the risk of future UEs. Finally scheduled maintenance may be performed at a fixed time independent of failures occumng in the memory.
Actions
The actions which can be taken are card replacement, page deallocation, address permutation, and spare switching. The "card replacement" action causes removal of the minimum number of cards to achieve one of the following: no UEs, at most x bad bits per card, or at most x bad bits in the memory.
Page deallocation deallocates all pages containing UEs but no more than x pages in total may be deallocated. Address permutation attempts to produce a configuration with at most x(NEC) bad bits per word. Spare switching attempts to replace a chip-kill with a spare chip. In each case x is a user-specified constant.
A complete strategy is a set of event-action pairs which may be applied sequentially. For example, 1. At 200 hours, replace any card with more than two bad 2. At a chip-kill, switch in a spare.
At a UE
, deallocate up to 32 pages. 4. At a UE, permute addresses to remove UEs. 5. At a UE, replace cards to remove all UEs. bits. Action 4 would be taken only if action 3 were unsuccessful, and action 5 only if action 4 were unsuccessful. Action 5 is included as the last resort in any maintenance strategy: it cannot fail to be successful and UEs cannot be left in the memory.
Reliability parameters
The reliability of a fault-tolerant memory can be measured in terms of service cost, frequency of interruptions, and amount of degradation. Service cost is measured in terms of card replacements (CRs) and repair actions (RAs). An RA is defined to be an unscheduled action which requires the intervention of a service person. Specifically, an RA is the replacement of one or more cards due to a hard failure (scheduled maintenance is not included). Frequency of interruption is measured by the rate of hard and soft UEs. Soft UEs are measured separately because they have a much smaller impact on the system (component replacement is not required), and because they can in some cases be corrected [ 1, 31. Degradation is measured by the average number of pages deallocated and the average number of words containing one or (if DEC is used) two bad bits. A large number of bad bits may cause performance degradation even if all errors are correctable, because of time taken to perform the corrections. The following specific parameters are defined: in (0,t) . B,(t) = Expected average number of words with one bad bit B2(t) = Expected average number of words with two bad bits DP(t) = Expected average number of pages deallocated during during (0,t).
UE(t) = Expected number of UEs in (0,t).

SUE(t) = Expected number of SUES in (0,t). CR( t ) = Expected number of cards replaced in (0,t). M ( t )
= Expected number of RAs
during (0,t). (o,t).
The first four parameters are cumulative counts, while the last three are time averages of states of the memory.
:HEN AND R. A. RUTLEDGE
The FTMS program
FTMS is an interactive menu-driven APL program. The main menu has four selections: INPUT, SIMULATE, REPORT, and END ( Figure 5) . Each selection leads to a sub-menu. The INPUT menu allows the user to enter or change model parameters. The SIMULATE menu sets the conditions of simulation and starts the simulation. The REPORT menu allows the user to print inputs and outputs in various formats. The END menu can be used to save a compacted set of information for future use and to exit from the program.
INPUT
The input menu has six selections ( Figure 6) . Each selection invokes a program for inputting model parameters. ARCHI-TECT, FAILRATES, and STRATEGY prompt the user to provide the parameters described above. TIMES asks for the total lifetime of systems to be simulated, and the intermediate time points for which the user requires outputs. NAME asks for a job name and other descriptive information to be printed on reports. All inputs are checked for validity as they are entered. CHECK does an overall consistency check on the data and either returns to the main menu or informs the user of inconsistencies.
SIMULATE
The SIMULATE menu is shown in Figure 7 . The first three selections allow the user to control the stopping rules for the simulation. These are discussed later in detail. SEED allows the user to control the random seed used for starting the simulation. GO begins the actual simulation of memory systems and returns control to SIMULATE when the stopping criteria are satisfied. The results of each simulation run (each invocation of GO) are automatically combined with all previous results since the last model change made using the INPUT menu. If this is not desired, the selection NEW will wipe out all previous results.
Stopping rules
The objective of FTMS is to estimate the reliability parameters defined above with reasonable accuracy at reasonable cost. The accuracy of the results is quantified in terms of confidence limits on the proportional error in any estimate. Let X represent any of the random variables defined above for some specific time point [eg. UE( t) at t = 10 000 hours], and let X , , Xz, . . ., X,, be sample values of X obtained by simulating n memory systems. The mean and variance of X are denoted m and u2 respectively, and the sample mean and variance are given by
We wish to find a confidence interval of prescribed propor- If the required accuracy is too tight, the program may run for an excessive time. The TIME selection on the SIMULATE menu can be used to put upper bounds on the CPU time and elapsed time of any run of GO. Finally, the COUNT selection in SIMULATE allows the user to set a maximum and minimum number of systems to simulate.
If some simulations have already been run for a given model, it is possible to estimate the relationship of time and accuracy to sample size. In this case FTMS will provide the user with an estimate of the running time, proportional accuracy, and number of simulations for the stopping rules he has chosen.
Simulation runs
The flow of simulation is shown in Figure 8 . A memory is "built" at random, and then the life history of that memory is simulated by computing the effects of each failure and the results of each maintenance action. At the end of system life, the results of the current system are added to the cumulative results from previous systems simulated. Then the stopping rules are checked and the program either begins another system simulation or returns control to the main menu.
Thus, for n sufficiently large, the proportional error in X (as an estimator of m ) is less than p = zS/x& with probability
C.
Conversely, in order to achieve proportional accuracy p , one should choose a sample size equal to the smallest n which is greater than or equal to (uz/mp)2.
Before running the simulation, however, the user does not generally know u/m and cannot solve for n. Therefore FTMS contains a sequential stopping rule which continues to simulate until each parameter has been estimated with the proportional accuracy and confidence level specified by the user. The rule is simply to stop sampling as soon as n is greater than or equal to ( S~/ x p )~. This stopping rule has been studied by Nidas [20] , who proved that it is asymptotically consistent, 1.e., and asymptotically efficient in the sense that the expected sample size approaches the fixed sample size which would be 190 used if vlm were known in advance.
Build memory The physical state of the memory at any time is determined by the type and location of the hard failures on each card. This information is stored in a matrix called FAIL which contains one row corresponding to each hard failure. The matrix FAIL could be generated sequentially by selecting a random time-to-next-failure after dealing with each simulated event. It is more convenient, however, to choose all random failure times, up to the specified end of system life, TMAX hours, at the beginning. Therefore, at time zero FAIL contains all failures which will occur before TMAX (card replacement is discussed later), and the actual state of the system at time t is that part of FAIL with times less than or equal to t.
The straightforward way to generate FAIL would be to simulate a random failure time for each component in the system, and then select only those times less than TMAX. The problem with this approach is the astronomical number of components which would have to be simulated. For the example shown in Fig. 1 it would be necessary to select 18 million random failure times for cell failures alone.
Fortunately there is another way. lTMS uses a simple algorithm to generate random failure times in order of occurrence up to TMAx. For a given failure mode with cumulative distribution function of time to failure, F(t) = Prob[ T 5 t ] , we require the first r-order statistics TI 5 T2 d ... 5 T,, from a sample of size n from F(t), where n is the number of components which are subject to that failure mode, and r is the largest integer for which T, 5 T M X . If Vdenotes a random variable uniformly distributed on (O,l), it is well known that F"( V ) is distributed like T, and F( 7') is distributed like V.
Therefore, if we can find uniform order statistics { K), we can find the required { Ti1 from Ti = F-l( V,). Time of next event The time of the next event is the time of the next scheduled maintenance or the time of the next hard failure, whichever comes first. If the event is a hard failure, it is necessary to determine whether a UE has occurred. Some failure modes cause a UE independently of other failures, e.g. an entire card failure if a card contributes more than NEC bits per data word. For other failure modes, a UE occurs if any bit affected by the new failure lines up with a different bit in the same word which was affected by a previous failure. The convention for representing failure locations is quite convenient for determining whether two failures line up to cause a double-bit error in any data word. If we define the logic function
it is easy to see that two failures represented by (1 e, a, b, x,   y ) and (f', c', a', b', x', y' ) line up to cause a double-bit error in at least one word if and only if the logical expression is true. When all double-bit errors have been located, it is even simpler to find the triple-bit errors: Three failures line up to cause a triple-bit error if and only if each pair of two causes a double-bit error.
Next action Maintenance actions are. effected by making the appropriate changes in FAIL and SPARE. Card replacement is done by removing from FAIL and SPARE the rows corresponding to the cards to be removed, and adding new rows for the failures in the new cards. Address permutation is accomplished by simply applying a permutation to the numbers located in column 5 (the chiprow location) in FAIL. Spare switching involves switching the appropriate rows between FAIL and SPARE. Pages are deallocated by recording the page addresses in a separate matrix called DEAL. If the total number of pages exceeds the specified threshold, deallocation fails and another action must be taken to remove the UE. When a new UE occurs, it is compared to the pages in DEAL to verify that it is not in a word already deallocated.
Record keeping Two types of records
are kept for each simulated lifetime. The matrix EVENTS contains one record for each event-action pair (e.g. UE-card replacement), including the type of event and action, the time of occurrence, and other pertinent information such as the number of cards replaced, the number of pages deallocated, etc. The matrix STATES contains one record for each change of state, including the time of change and the new state. The state is defined to be the number of words containing one or (if NEC = 2 ) two bad bits, and the number of pages deallocated. An event may cause several actions at the same time, but the memory state does not change between events. Therefore EVENTS is updated after each action, while STATES is updated only once for each event (Fig. 8) .
After each system is simulated, the results are added to the cumulative results from previous systems. The records u p dated consist of NS, the cumulative number of systems simulated, and four matrices: EV, EV2, ST, and ST2. EV and EV2 contain one row for each time point for which output is required, and one column for each event-count type of parameter, i.e. UEs, CRs, and RAs. The (i,j)th elements of EV and EV2 contain NS X t j k 9 NS x i k , k= I k=l respectively, where X i j k equals the total count for the jth parameter, summed over the time interval from zero to the ith time point, for the kth system simulated. ST and ST2 contain similar information for the state type parameters B I , Bz, and DP. This is the minimum information required to compute the estimates and confidence intervals for each parameter. For cumulative intervals the proportional accuracy confidence limits described above may be printed.
REPORT
Soft errors The outputs include estimates and confidence limits for SUE, the soft UE rate, even though soft errors are not simulated. An SUE will occur whenever a soft error occurs in a word which contains NEC bad bits. Thus the SUE rate at any time is simply the soft error rate times the probability that a randomly chosen bit is a good bit in a word containing NEC bad bits. This is equal to SOFT X (number of chips in the memory) X (fraction of words containing NEC bad bits) X (fraction of good bits in words with NEC bad bits).
The number of chips is FCAB, the total number of words is ABXY, and the fraction of bits good is (FC -NEC)/FC. 
END
The END selection on the main menu allows the user to exit from FTMS. At this point the APL workspace contains the inputs as well as the simulation results for the model most recently run. The user can save these results and return later to add more runs simply by invoking FTMS again. In order to allow the user to save more than one model without wasting too much disk space, END provides the option of expunging everything from the workspace except the model inputs and simulation results. The user can save this reduced workspace and later copy it into a full FTMS workspace to continue running the same model.
Design tradeoffs
FTMS can be used to estimate, to any desired degree of accuracy, the defined reliability parameters for any memory system which fits the model given in Section 2. This gives us the ability to predict the effect of design options on system reliability. The following example illustrates the use of FTMS to compare the use of page deallocation, chip sparing, and FAE as well as single-or double-error correction for a particular system. We consider a sample memory system of 4 megabytes. The system consists of a number of cards, each containing a 32 X 4 array of 16K-bit chips. There are 128 bit-lines and 128 wordlines in a chip. The memory is organized as 1-bit-per-chip with respect to the ECC. At the card level, 4-bits-per-card is assumed. We consider both a (72,64) SEC-DED code and an (80,64) DEC-TED code. Thus, the system consists of 18 cards for the SEC-DED code, and 20 cards for the DEC-TED code.
The failure rate of the memory chip is assumed to follow the step function shown in Fig. 4 . The average failure rate over 100 kPOH is 0.02 percent per kPOH. The piece part failure distribution within the chip is 35 percent for cells, 12 percent for word-lines, 18 percent for bit-lines, and 35 percent for chipkills (same as [ 1 1,2 11 ). In addition, the support logic of a card is assumed to fail at the same rate as that of a chip.
We assume that a service maintenance is scheduled at 200 power-on hours. The maintenance is to clean up the cards so that each card contains no more than two cell fails at the scheduled time. If a card has to be replaced in order to fix a UE, the rule is to replace the card that participates in the UE and has the largest number of defective cells.
A memory page is assumed to contain 2 kilobytes. Consider the memory as a chip array of 32 rows. A page of data resides in a single chiprow. It occupies 2 word-lines and 128 bit-lines within a chip. If page deallocation is used to fix a UE, the threshold of pages that can be deallocated is 32.
Two other features for the system considered are FAE and chip spare. For FAE, 5 bits of address permutation is assumed. The 5-category fault map is also assumed [4] . For chip spare, it is assumed that each card has one spare chip. Whenever there is a chip-kill, the faulty chip is replaced by the spare chip on the card that contains the faulty chip. Unless specifically stated, the strategies described henceforth do not involve chip spare.
For the system using a (72,64) SEC-DED code, we have simulated the following strategies in fixing UEs: PLAIN: Simply replace a card to fix a UE. PAGE: To fix a UE, the memory page that contains the UE is deallocated. If the number of pages deallocated exceeds the threshold (32 pages), a card is replaced. SPARE/PAGE: The spare chip is used to fix a chipkill on the card, and page deallocation is used to fix a UE. PAGE/FAE: To fix a UE, the memory page that contains the UE is deallocated. If the number of pages deallocated exceeds the threshold, FAE is performed. If FAE fails to fix the UE, a card is replaced.
The results of the simulations are shown in Figures 9 and  10 in terms of the rates of repair action and card replacement. The results clearly indicate that page deallocation, FAE, and chip spare can be used to reduce the frequency of repair as well as the number of cards replaced.
For the system using an (80,64) DEC-TED code, we have simulated the strategies of PLAIN, PAGE, and PAGE/FAE. The rates of repair action and card replacement obtained from the simulations are shown in Figures 11 and 12 . Again, the results show that page deallocation and FAE can be used to increase reliability and decrease maintenance cost.
To show the effectiveness of double-error correction over single-error correction, the repair action rates for the SEC-DED code and the DEC-TED code with PLAIN strategy are plotted in Figure 13 . At 40 kPOH, there is a slightly greater 193
I Time in field ( h )
F w e 13 Average RA rates for 4-megabyte memory with SEC and DEC.
than 5 times improvement of double-error correction over single-error correction in the rate of repair action. The improvement factor is even higher at the early life of the memory system. Similar conclusion can also be made on the rate of card replacement.
Conclusion
FTMS was written to provide memory designers with a flexible tool with which to evaluate the various techniques for faulttolerance which can be built into computer memory systems. A wide variety of design options, including options for system architecture, failure modes and rates, and maintenance strategy, can be evaluated simultaneously. The output of FTMS gives the frequency of uncorrectable errors of both types (hard and soft), and also the amount of degradation due to page deallocation and the need to correct bad bits, and the service cost parameters of repair actions and cards replaced. An optimal sequential stopping rule is used to estimate all of these parameters with prescribed accuracy and confidence level, without any prior knowledge of the variance of the estimates. This program has been successfully applied to evaluate alternative design proposals over the past several years. During that time it has evolved by adding to the list of options which can be evaluated. As future options are proposed, ITMS will be modified to give a quick and accurate prediction of the impact of such proposals on memory reliability.
