Advances in semiconductor memory technology towards higher-density and higher-performance memory chips have created new reliability challenges for the memory system designer. An example would be the multiple-bit-per-chip organization with the chip outputs used in the same word. This design structure would be prone to uncorrectable errors with conventionally implemented single-error-correcting double-error-detecting codes. With these newer chips, memory system designers will have to give special attention not only to the types of failures but to ways of minimizing the system impact of reliability defects. In this paper, a number of design approaches are presented for minimizing the efects of chip failures through the use of organizational techniques and through enhancements to conventional error checking and correction facilities. The fault-tolerant design techniques described are compatible with most existing memory designs. An evaluative comparison of these techniques is included, and their application and utility are discussed.
Introduction
Computer memory chips containing 65 536 (64K) bits are now quite common, and chips of even greater bit densities are becoming available. In addition, each new computer system generation has seen a substantial increase in the number of memory chips used with a corresponding significant increase in memory capacity. However, larger-capacity memory systems utilizing higher-density memory chips are more susceptible to failures. This paper describes several of the most effective fault-tolerant design techniques useful in minimizing the consequences of these failures upon using systems. The primary objectives are to significantly reduce the sensitivity to defects (by minimizing the probability of their accumulation into failures, which can become uncorrectable errors), and to provide mechanisms for keeping the memory system operating once the failures exceed the capabilities of conventional single-error-correcting double-error-detecting (SEC-DED) error checking and correction (or error-correcting code-ECC) facilities [ 11.
The defect types that can occur for random-access memories of the dynamic MOSFET one-device-cell type [2, 31 can greatly influence the types of error control code selected as well as the amount of memory affected by these failures. The most common types of defect faults include the single-cell, word-line, bit-line, and chipfail categories. In addition to these hard faults, this type of memory has been susceptible to soft failures caused by alpha-particle radiation 141, with a failure probability higher than the basic intrinsic chip failure rate. In order to minimize the consequences of these hard and soft error mechanisms, designers must take into account the interaction between the using system, the error checking and correction facilities used, and the chip configuration and associated memory organization. The incorporation of ECC logic for improving product reliability has been commonplace since the introduction of the IBM System/370 computers. Increased chip densities and multiple-bit-per-chip organizations have resulted in more complex designs, increasing the challenge to the designer [5] . Special attention has been placed on adapting serial coding techniques (e.g., Fire codes [6] ) to random-access memories to help improve error control capabilities [7, 81. The particular system maintenance strategy used can play an important role in fault tolerance because it can allow the physical replacement of failures to be deferred and to accumulate to a selected threshold. To minimize the system sensitivity to uncorrectable errors (UEs) when soft error rates are high, memory systems employ "scrubbing" [9, 101 of detected errors by correcting and rewriting into the same location. Scrubbing consists basically of the periodic reading and correction, if required, of the data stored at all memory addresses. discussed subsequently in a later section.) Once a specified error-rate threshold has been exceeded, the using system can invoke reconfiguration and deallocation algorithms [9-121 to remove memory space from program use. It should be noted, however, that the deallocation of memory space can result in reducing memory capabilities on line, with corresponding potential for reducing overall system performance. The simplest type of fault-tolerant memory system is that shown in 178 Figure 1 , which incorporates a conventional ECC facility
between the memory arrays and the using-system interface. Such a configuration enables the correction and detection of simple errors (i.e., such as a single defect) and the reporting ofstatus information (i.e., No Error, Single Error Correction-SEC, and Multiple Error Detection-MED). The effectiveness of the ECC facility will depend on the particular memory chip structure chosen as well as on the corresponding organization of how the data are assembled and sent to the using system. This paper describes techniques designed to improve the effectiveness of conventional ECC by using data organizational schemes and by providing enhancements to existing ECC facilities to achieve improved fault tolerance. These techniques are compatible with most existing designs, do not require any using system intervention, and are self-contained within the memory system. The design techniques that we shall consider include bit scattering, sparing, complement/recomplement, consecutive correction, and prestorage protection.
The first two are organizational schemes and the latter three involve ECC facility enhancements. Techniques based on more complex multi-bit-correcting ECC code are not addressed since they typically impose increased system performance overhead.
The following sections of this paper describe the organizational techniques and the ECC enhancements. Presented first is the bit scattering technique, which includes data redistribution and address selection. That is followed by a description of how sparing can be used for arrays as well as for arrays and support logic. Subsequent sections deal with ECC enhancements based upon recovery by error erasure (complement/ recomplement), knowledge of previous defect locations (consecutive correction), and the biasing of data words to conceal defects (prestorage protection). Additional sections deal with application and utility and include a summary and conclusion.
Bit scattering
Bit scattering is a design technique that minimizes the effect of chip defects by either distributing bits across different ECC words or by concentrating the failures within the smallest addressable section of memory. Bit scattering occurs in two forms: data steering (i.e., redistribution or fault alignment exclusion), and address selection.
Redistribution or data steering (also referred to as fault alignment exclusion [ 131) is a buffering scheme used for multiple-bit-per-chip organizations by distributing the chip outputs across multiple ECC words [ 141. Figure 2 illustrates two embodiments of this buffering: shift registers and gated latches. In both examples, a group of bits from a chip is buffered, with no more than a single bit position allocated to any ECC word. This results in minimizing the effects of multiple-bit chipfail types of failures.
Address selection is a technique which is used to minimize the size of the failure (i.e., the number of pages affected) based upon word-line and bit-line failures. The address-selection technique is most effective for block-transfer-type memory applications. An example is a memory paging application which requires 32 iterative array selects from a 64K-bit memory array chip (see Figure 3) . In this example, a page consists of 32 X 4 = 128 bits on a chip. Assume that the selected array is a 16K X 4-bit chip partitioned into two separate groups, each with its own support circuits. Each group consists of four identical sections, and each section is comprised of 64 word lines and 128 bit lines. Therefore, depending upon the method of data placement and subsequent retrieval for the 32 iterative selects, by word line or by bit line, the amount of defect contamination will be different. The reason for this is that each defect is not equally dependent upon type (Le., bit-line or word-line) or the number of pages that reside in the defect region. Figure 3 illustrates the results of contiguous selection by word line, by bit line, and by intermixing between four and eight groups of bit lines and word lines to demonstrate the extent of memory space affected when the 32 iterative selects are completed for each page.
As shown, by proper design choice it is possible to minimize the effects of defects due to word-line or bit-line failures. The particular choice depends upon application requirements.
Sparing
Sparing techniques are used to replace a defective component from an operating memory without requiring manual intervention [ 151. The sparing concept can be used for arrays as well as for arrays plus supports. Figure 4 depicts a selection partition suitable for simple spares. As shown, there is a group of memory arrays with a spare provided for appropriate activation. Any chip that fails in the memory array group can be substituted for (Le., electronically replaced) by the spare chip. The substitution is accomplished by personalizing the selection logic via the data bus. When the high-order address bit selects the defective chip, the personalized selection logic performs the substitutlon. The sparing concept can be extended to cover both arrays and support circuits by an appropriate memory organization. Figure 5 shows a memory organization consisting of a group of 16 array cards or FRUs Yield-replaceable units) each supplying an ECC word across a selection interval. Each FRU supplies an ECC word during a selection interval (i.e., a group of 16 ECC words are clocked and generated sequentially, one from each of the 16 array cards 3 (FRU 3) is defective, then when the defective card is to be clocked it is suppressed and an alternative or spare is substituted. As shown, by the addition of an alternate or spare FRU, sparing can be used to cover arrays as well as their support circuitry.
Note that, in order for the spare to be deployed, provision must be allowed for the shifting of data from the defective or failing unit into the spare unit. In addition, space must be 179 A rewriting of the data after this procedure provides for a way of eliminating or "scrubbing" of soft errors [3, 101.
Consecutive correction
Consecutive correction is a design technique that increases the correction capabilities beyond conventional SEC-DED codes by modifying the structure of the ECC facility [ 171. The principle of operation is based on the maintenance of a history of hard correctable errors, so that, when they accumulate into uncorrectable errors, the history information can be used to erase the original error and to correct the subsequent error. This operation is achieved by storing the syndromes of the initial single error into an array for subsequent use. When an uncorrectable error is detected in an ECC word, the prior correctable syndrome is used to erase the initial error and then the modified data are passed through the ECC facility for subsequent correction of the new error in the ECC word. Figure 7 illustrates the structure of a typical conventional ECC facility for read operations. Syndrome bits (4) resulting from the comparison of the generated and received check bits are used by the error classifier to determine error conditions, while the error-bit locator decodes the syndrome to the defective location. Figure 8 depicts the structure of a modified errorcorrection facility, which has added a correctable-bit-locator array, with its output coupled to the error-bit locator, and a feedback path from the data bit modifier for erasure of the original defect so that the new defect can be corrected. The control of the consecutive correction is controlled by the error classification, which uses the output (5';) of the correctablebit locator when uncorrectable errors are detected (Le., when there is an "even" output from the error classifier). ing the data in an ECC word to conceal stuck bits. Making the stuck bit appear as a hidden fault enables the ECC to correct additional defects once they occur within the ECC available to accommodate the additional logic and spare arrays word. The operation is accomplished by providing true and required and the system must be capable of tolerating a complement paths within the ECC facility based on the prop-
180
performance overhead. erty that odd-weighted codes produce the same check bits for either true or complement data bits. Figure 9 depicts, in block diagram form, the modification to the ECC facility. The control or path selection is determined on the basis of the status bits generated from both paths. These status bits (MED, SEC, and No Error) determine which format the ECC word is in, true or complement.
The principle of operation is based upon biasing of the bits in an ECC word so that defects can be hidden (Le., bit stuck at a value which is correct for the ECC word). As a result, depending upon the ECC word format, the code assignment (i.e., true or complement) can be selected to conceal hard errors. The following formats are available with an oddweighted ECC code (i.e., check bits assigned as odd-weighted parity):
ECC word format Resultant characteristics
True store, the data are stored in complemented form and left in that format if no errors are found. Figure 10 illustrates this simplified post-write procedure. Subsequent memory fetches with this ECC facility require that the appropriate path, true or complement, be selected. This is accomplished via the six true and complement status bits that are tabulated in Table  l 
Application and utility
The fault-tolerant design techniques just described satisfy a broad range of applications. The purpose of these schemes is to minimize the accumulation of errors from semiconductor memory defects so that the probability of exceeding the capabilities of the error-correction facility is minimized. The design techniques discussed all involve interaction between the actual memory circuits, the organization of the computer memory, the maintenance strategy, and the ECC facility. Two of these techniques are based on the chip structure and the memory organizational requirements of the using system, while the remaining three deal with enhancement of the ECC facility for specific situations. Table 3 summarizes the particular characteristics of the enhanced-ECC-correction techniques, including a relative comparison in performance and hardware. The complement/ recomplement procedure can be used not only as a recovery scheme but also to identify stuck bits by comparison (exclusive-or) of the read data with the corrected data. These locations can be used as the basis for forming a memory fault map. The method of consecutive correction does not require any multiple memory array cycles but rather uses multiple passes through the modified ECC structure. By proper design choice and implementation, this technique can provide a fast correction (i.e., at logic speeds) of double errors. The prestorage protection technique is most suitable for regions of memory that are designated as read mostly; otherwise performance can suffer. Those regions best suited are the areas of "core" or nucleus of operating systems, and source tables used in address translation applications. All of these techniques enhance the minimization of errors and deallocation.
Summary and conclusion
The progress of semiconductor memory technology has advanced in the industry from LSI to VLSI and will continue in the future. The accelerated progress in memory chip density coupled with larger-capacity memory applications will result in requirements for greater chip reliability and for greater system tolerance to errors. Fault-tolerant design techniques can be used to minimize the effects of failures in memory systems. As described, there are those techniques that are suitable for organization and address selection, and those that can be used to enhance the ECC capabilities of a given code. The appropriate application of these design techniques can reduce the number of uncorrectable errors and minimize the amount of replacement components necessary. These design approaches can be used either individually or collectively, and the suitability of each approach is dependent upon the specific application requirements. The using system can benefit not only from fewer errors but also from better performance whenever less memory will have to be deallocated.
