This paper presents a state-of-the-art review of error-correcting codes for computer semiconductor memory applications. The construction of four classes of error-correcting codes appropriate for semiconductor memory designs is described, and for each class of codes the number of check bits required for commonly used data lengths is provided. The implementation aspects of error correction and error detection are also discussed, and certain algorithms useful in extending the error-correcting capability for the correction of soft errors such as a-particle-induced errors are examined in some detail.
Introduction
In recent years error-correcting codes (ECCs) have been used increasingly to enhance the system reliability and the data integrity of computer semiconductor memory subsystems. As the trend in semiconductor memory design continues toward higher chip density and larger storage capacity, ECCs are becoming a more cost-effective means of maintaining a high level of system reliability [ 1-41, A memory system can be made fault tolerant with the application of an error-correcting code; i.e., the mean time between "failures" of a properly designed memory system can be significantly increased with ECC. In this context, a system "fails" only when the errors exceed the error-correcting capability of the code. Also, in order to optimize data integrity, the ECC should have the capability of detecting the most likely of the errors that are uncorrectable.
Error-correcting codes used in early computer memory systems were of the class of single-error-correcting and doubleerror-detecting (SEC-DED) codes invented by R. W. Hamming [5] . A SEC-DED code is capable of correcting one error and detecting two errors in a codeword. The double-errordetecting capability serves to guard against data loss. In 1970, a new class of SEC-DED codes called odd-weight-column codes was published by Hsiao [6] . With the same coding efficiency, the odd-weight-column codes provide improvements over the Hamming codes in speed, cost and reliability of the decoding logic. As a result, odd-weight-column codes have been widely implemented by IBM and the computer industry worldwide . Examples of systems which incorporate these codes are the IBM 158, 168, 303X, 308X, and 4300 series, Cray I, Tandem, etc. There are also various standard part numbers of these codes offered by many semiconductor manufacturers [ 1 11 (for example, the AM2960 and AMZ8 160 of Advanced Micro Devices, the MC68540 of Motorola, the MB 14 12A of Fujitsu, and the SN54/74 LS630, LS63 1 of Texas Instruments).
The number of errors generated in the failure of a memory chip is largely dependent on the chip failure type. For example, a cell failure may cause one error, while a line failure or a total chip failure in general causes more than one error. For ECC applications, the memory array chips are usually organized so that the errors generated in a chip failure can be corrected by the ECC. In the case of SEC-DED codes, the one-bit-per-chip organization is the most effective design. In this organization, each bit of a codeword is stored in a different chip; thus, any type of failure in a chip can corrupt, at most, one bit of the codeword. As long as the errors do not line up in the same codeword, multiple errors in the memory are correctable.
Memory array modules are generally packaged on printedcircuit cards with current semiconductor memory technology, and usually a group of bits from the same card form a portion of an ECC codeword, as illustrated in Figure 1 . With this multiple-bit-per-card type of organization, a failure at the card-support-circuit level would result in a byte error, where the size of the byte is the number of bits feeding from the card to a codeword. In this type of configuration, it is important for data integrity that the ECC be able to detect byte errors [ 121. A SEC-DED code is in general not capable of detecting all single-byte errors. However, a class of SEC-DED codes capable of detecting all single-byte errors can be constructed [ 13, 141. These are called single-error-correcting double-errordetecting single-byte-error-detecting (SEC-DED-SBD) codes.
There are certain design applications where the memory array cannot be organized in one-bit-per-chip fashion because of cost or other reasons such as system granularity or power restrictions. As chip density increases, it becomes more difficult to design a one-bit-per-chip memory system. For a multiple-bit-per-chip type of memory organization, a single-byteerror-correcting double-byte-error-detecting (SBC-DBD) code [15] [16] [17] [18] [19] [20] would be more effective in error correction and error detection.
System reliability generally tends to decrease as the capacity of a memory system increases. To maintain the same high level of reliability, a double-error-correcting triple-error-detecting (DEC-TED) code may be used. However, this type of code requires a larger number of check bits than a SEC-DED code and more complex hardware to implement the functions of error correction and error detection [S, 15, 161.
An error-correcting code can be used to correct "soft" errors as well as hard errors. Soft errors are temporary errors such as a-particle-induced errors that disappear during the next memory write operation. With a maintenance strategy that allows the accumulation of hard errors, a high soft error rate would cause a high uncorrectable error (UE) rate. To reduce the UE rate that involves soft errors, a SEC-DED code can be modified to correct two hard errors or a combination of one hard and one soft error [21-251. In this paper we review the current status of error-correcting codes for semiconductor memory applications and present the state of the art by describing the construction of four classes of error-correcting codes suitable for this type of design application. These four classes are SEC-DED codes, SEC-DED-SBD codes, SBC-DBD codes, and DEC-TED codes. For each class of code we provide the number of check bits required for commonly used data lengths, information that is particularly useful to designers for system planning. We also discuss the implementation aspects of error correction and error detection for these classes of error control codes. In addition, we describe a number of algorithms useful in extending the error-correcting capability of codes for the correction of soft errors such as a-particle-induced errors and other temporary errors. 
Binary linear block codes
A binary (n,k) linear block code is a k-dimensional subspace of a binary n-dimensional vector space [S, 15, 161. An n-bit codeword of the code contains k data bits and r = n -k check bits. An r X n parity check matrix H is used to describe the code. Let V = (a,, u 2 , . . . , u,,) be an n-bit vector. Then V is a codeword if and only if
where V' denotes the transpose of V, and all additions are performed modulo 2.
The encoding process of a code consists of generating r check bits for a set of k data bits. To facilitate encoding, the H matrix is expressed as where P is an r X k binary matrix and I, is the r X r identity matrix. Then the first k bits of a codeword can be designated as the data bits, and the last r bits can be designated as the check bits. Furthermore, the ith check bit can be explicitly calculated from the ith equation of the set of r equations in ( 1). A code specified by an H matrix of (2) is called a systematic code.
Any binary r
X n matrix H of rank r can always be transformed into the systematic form of (2). Since the rank of H is r, there exists a set of r linearly independent columns.
The columns of the matrix can be reordered so that the rightmost r columns are linearly independent. Applying elementary row operations [ 161 on the resultant matrix, a matrix of (2) is obtained. The systematic code obtained is equivalent to the code defined by the original H matrix. Figure 2(a) is an example of the parity check matrix of a (26, 20) code in a nonsystematic form. Note that the last six columns of the matrix are linearly independent. The submatrix of the six columns can be inverted. The multiplication of the inverse of the submatrix and the transpose of the parity check matrix results in a matrix of systematic form shown in Figure 2(b) . 125 The decoding process consists of determining whether U contains errors and determining the error vector. To determine whether U is in error, an r-bit syndrome S is calculated as follows:
pairs of codewords. For a linear code, the minimum distance of the code is equal to the minimum of the weights of all nonzero codewords [8, In semiconductor memory applications, the encoding and the decoding of a code are implemented in a parallel manner. In encoding, the check bits are generated simultaneously by processing the data bits in parallel. In decoding, the syndrome is generated using the same hardware for the generation of the check bits. The error vector is then generated by decoding the syndrome bits in parallel. Finally, the errors are corrected by subtracting the error vector from the fetched word. The subtraction is accomplished by the bit-by-bit exclusive-or (XOR) of the components of the two vectors.
The reliability function of a memory system that employs an error-correcting code can be handled either analytically or through Monte Carlo methods
[ 1-4,26-281. For a system with a simple architecture, an analytical approach may be possible. However, for a memory system consisting of hierarchical arrays, the memory reliability function is too intractable to handle analytically. Monte Carlo methods are considered a general approach to study the effectiveness of error-correcting codes and other fault-tolerant schemes [27, 28] .
To demonstrate the reliability improvement obtainable with ECC, we consider three memory systems of four megabytes. The first system consists of eight memory cards and is designed with a panty check on each set of eight data bits. The second system consists of 18 memory cards and is designed with a (72,64) SEC-DED code. The last system consists of 20 memory cards and is designed with an (80,64) DEC-TED code. The memory chips for the systems are I6K-bit chips with 128 bit lines and 128 word lines in each chip. Each memory card contains an array of 32 X 9 chips for the first system, and an array of 32 X 4 chips for the other two systems. The failure rates of the chips and the card-support circuits are assumed to be the same as those described in [27] . When a UE occurs, the strategy is to replace the card that contains the UE and that has the largest number of defective cells.
The modeling tool of [27] is used to simulate the reliability (3) of the three memory systems. The results of the simulation
The error-correcting capability of a code is closely related SEC-DED codes to the minimum distance of the code. The weight of a code-
The minimum distance of a single-error-correcting and douword is the number of nonzero components in the codeword.
ble-error-detecting (SEC-DED) code is greater than or equal The distance between two codewords is the number of comto four. Since an n-tuple of weight three or less is not a ponents in which the two codewords differ. The minimum codeword, from Eq. ( I ) the sum of a set of three or fewer 126 distance d of the code is the minimum of the distances of all columns of the H matrix must be nonzero. In other words, A 1. The column vectors of the H matrix are nonzero and are distinct. A2. The sum of two columns of the H matrix is nonzero and is not equal to a third column of the H matrix. than the maximum for a given number of check bits. There are various ways of shortening a maximum-length SEC-DED code. Usually a code designer constructs a shortened code to meet certain objectives for a particular application. These objectives may include the minimization of the number of circuits, the amount of logic delay, the number of part numbers, or the probability of miscorrecting triple errors [6] .
In a write operation, check bits are generated simultaneously by processing the data bits in a parallel manner according to Eqs. (1) and (2). In a read operation, syndrome bits are generated simultaneously from the word read according to Eq. ( 3 ) . Typically the same XOR tree is used to generate both the check bits and the syndrome bits (see Figure 4) . 127
Note that the sum of two odd-weight r-tuples is an evenweight r-tuple. A SEC-DED code with r check bits can be constructed with its H matrix consisting of distinct nonzero r-tuples of odd weights. This is an odd-weight-column code of Hsiao [6] .
The maximum code length of an odd-weight-column code with r check bits is 2"', for there are 2"' possible distinct odd-weight r-tuples. This maximum code length is the same as that of a SEC-DED Hamming code. The maximum number of data bits k of a SEC-DED code must satisfy k 5 2"' -r. Table 2 lists the number of check bits required for a set of data bits. Figure 3 shows examples of SEC-DED codes used in some IBM systems.
Most of the SEC-DED codes for semiconductor memory applications are shorfened codes in that the code length is less An algorithm for correcting single errors and detecting multiple errors is described as follows:
1. Test whether S is 0. If S is 0, the word is assumed to be 2. If S # 0, try to find a perfect match between S and a column of the H matrix. The match can be implemented in n r-way AND gates. 3. If S is the same as the ith column of H, the ith bit of the word is in error. 4 . If S is not equal to any column of H, the errors are detected error-free.
as uncorrectable (UE).
This algorithm applied to a SEC-DED code corrects all single errors and detects all double errors. Multiple-bit errors may be detected or falsely corrected. The extent of multiple errors detected depends on the structure of the code.
As shown in Figure 5 , hardware implementation of the error correction and detection mainly consists of an r-way OR gate for testing nonzero syndrome, n r-way AND gates for 128 decoding syndromes, an n-way NOR gate for generating UE
The failure of a common logic support in the memory may result in an all-ones or an all-zeros pattern in a codeword. In this case, the error vector in general contains a multiple number of errors that are not detectable by a SEC-DED code.
To prevent this kind of data loss, the code can be constructed or modified so that an all-ones or an all-zeros n-tuple is not a codeword. For example, if the check bits are inverted before the codeword is written into the memory, then all the codewords stored in the memory are nonzero. In general, the detection of all-ones and all-zeros errors can be achieved by inverting a subset of the check bits [9] .
SEC-DED-SBD codes
In some applications it is required that the memory array chips be packaged in a b-bits-per-chip organization. A chip failure or a word-line failure in this case would result in a byte-oriented error that contains from 1 to b erroneous bits. Byte errors can also be caused by the failures of the supporting modules at the memory card level. The class of SEC-DED codes that are capable of detecting all single-byte errors (SEC-DED-SBD codes) may be used to maintain data integrity in these applications. bit positions of the original code. Since the same encoding and decoding hardware can be used, no additional hardware is required if a SEC-DED code can be reconfigured for singlebyte error detection. 2  2  3  3  3  3  3  5  6  7  8  9  1  0  1  1  10  12  15  16  18  20  22  21  26  31  36  41  46  51  42  52  63  12  82  92  102  85  106  127  148  169  190 Table 3 . Table 4 All binary 3-tuples expressed as elements of GF (8) . 
SBC-DBD codes
For a memory system packaged in a b-bits-per-chip organization, the reliability provided by a SEC-DED code may not be acceptable. To increase the reliability, a byte-oriented errorcorrecting code may be used [15] [16] [17] [18] [19] [20] 291 . In this section, we discuss the construction and implementation of single-byteerror-correcting and double-byte-error-detecting (SBC-DBD) codes.
tuples can be assigned as the elements of GF(8), as shown in Table 4 . In the finite-field representation of b-tuples, the sum of two elements is the bit-by-bit XOR of the two associated 6-tuples. The product of two elements X ' and X' is X k with k = i + j m o d ( 2 " ) -1 . F o r e x a m p l e , X 3 + X 6 = ( 1 I 0 ) + ( I 0 1) = (0 1 1) = X4, and X 3 . X 6 = X' from Table 4 . A codeword of a SBC-DBD code consists of N b-bit bytes. A binary 6-tuple is considered an element of the finite field GF(2") of 2" elements [8, 15, 161 . For example, all binary 3-With the finite-field representation, an SBC-DBD code is a linear code over GF(2") with a minimum distance d 2 code can also be defined by the parity check matrix H of (1) and (2), with the components of the matrices and vectors considered elements of GF (2'). Let h,, 1 I i I N, be the column vectors of the H matrix. The SBC-DBD code must satisfy the following conditions:
The

129
B2. hi + XI. h, # X2. hh for distinct i,jJ and Xl,Xz E GF(2').
Let r be the number of check bytes of an SBC-DBD code over GF(2'). For r = 3, a code of length N = 2 + 2' bytes can be constructed by extending a Reed-Solomon code of length Using the H matrix of Eq. (4), the last three column positions of H can be designated as the positions of check bytes and the other column positions of H can be designated as data byte positions. The check bytes can be generated with an XOR tree just as in the case of SEC-DED codes. The syndrome can also be generated with the same XOR tree. For decoding, the syndrome S is divided into three parts, SI, Sz, S3. Each Si consists of b bits and represents the parity check equations for the ith row of (4). From (3), if E is a single-byte error pattern at data byte position i, then E is a unique solution to the following three equations:
On the other hand, if E is a byte error pattern at check byte position i, where i = 1, 2, or 3, then E = S: and the other two subsyndromes are zeros. The following steps can be taken to find the correctable single-byte error patterns and to detect multiple uncorrectable byte errors.
I. If S is a zero vector, assume that there is no error. If S is nonzero, go to step 2. 2. If one of the subsyndromes S, # 0, and the other two subsyndromes are zero, i = 1,2, 3, the check byte position i with error pattern S is assumed. Otherwise, go to step 3.
3. Assume that E = S:. Find i that satisfies 0 5 i < N -4, T'. E' = S2, and T2'. E' = S3. If i has a solution, the byte error with pattern E at data byte position i is assumed. If i has no solution, then an uncorrectable error is detected.
A block diagram for the generation of the error pointers for the code of Fig. 7 is shown in Figure 8 .
The extended Reed-Solomon codes defined in Eq. (4) are optimal in that no other SBC-DBD codes with three check bytes contain more data bytes. However, there exists only one code for a given byte size b. When b is small, the code may be too short for memory applications. For example, the code for b = 2 can only accommodate six data bits. This code certainly is not practical for most applications. In order to increase the code length for a given b, additional check bits are required.
Techniques for the construction of SBC-DBD codes for r > 3 can be found in [ 15, 16, 30, 311. error-detecting (DEC-TED) code to meet its reliability requirements. A DEC-TED code is also attractive for a memory with a high soft error rate. Although there are schemes [21-251, to be discussed in a subsequent section, for a SEC-DED code to correct hard-hard and hard-soft types of double errors, these schemes cannot correct double soft errors and they require s, the interruption of a normal memory read operation. With a DEC-TED code, any combination of hard and soft double errors, including double soft errors, can be corrected automatically without system interruption. A minimum distance of a DEC-TED code is at least equal to six. The panty check matrix H of a DEC-TED code must have the property that any linear combination of five or fewer columns of H is not an all-zeros vector.
A class of DEC-TED binary linear block codes can be constructed according to the theory of BCH codes [8, 15, 16, 32, 331 . Let X be a root of a primitive binary polynomial P(X) of degree m.
The powers of X can be considered elements of GF( N), N = 2", and can be expressed as binary m-tuples. A binary code defined by (1) with the following parity check matrix is a DEC-TED code: parity check matrix of a (31, 20) code constructed from Eq.
(5).
A full-length BCH code can be shortened by deleting a number of columns from its H matrix. The shortened code has a minimum distance at least as large as the original code. The number of check bits of the shortened code may be less than the original code when proper bit positions are deleted [34-351. In particular, let Y be a row vector in the space generated by the row vectors of H. Deleting the column positions of H where the corresponding positions of Y are ones, then the shortened H matrix has one fewer linearly independent row vector and the shortened code has one fewer check bit than the original code. Table 6 presents a list of the number of check bits required for some DEC-TED BCH codes. Table 6 Number of check bits required for DEC-TED BCH codes.
Data bits
Check bits Fig. 9 for example). Let HI be the parity check matrix in systematic form, and T be an r X r transformation matrix that satisfies
The H matrix defined by ( 5 ) can be transformed into the
systematic form of (2) for the generation of check bits (see The generation of check bits from matrix HI can be imple-131 132 Table 7 Example of locating erasures. mented with an XOR tree. For decoding, it is convenient to define the syndrome S from (3) with the H matrix instead of the H1 matrix. The syndrome can be generated using an XOR tree associated with the H matrix. Thus, two separate XOR trees are used to generate check bits and syndrome bits. The syndrome can also be generated by first generating SI from Eq. (3) with the H1 matrix, then multiplying matrix T by S1. Using this approach, the same XOR tree can be used to generate check bits and SI. The validity of this procedure follows directly from Eq. (6). There are various schemes for solving Eq. 
Direction of stuck faults
Extended error correction
Errors in semiconductor memory can be broadly divided into hard errors and soft errors [24, 25] . Hard errors are caused by stuck faults or by permanent physical damage to the memory devices. Soft errors are temporary errors or a-particle-induced errors that will be erased during the next data storage operation. For this discussion, the errors that will stay in their locations during the next few write cycles are considered hard errors.
Error-correcting codes can be used to correct hard as well as soft errors. However, the maintenance strategy for a system may allow the hard errors to accumulate. The presence of errors in the memory increases the probability of uncorrectable errors (UE) due to the lineup of multiple errors in a codeword. The UE rate can be reduced by repair service scheduled periodically. It can also be reduced by extending the conventional error correction to some of the otherwise uncorrectable errors. The latter approach is especially attractive when the soft error rates are high, because it does not require the replacement of memory components. The extended error-correction schemes are discussed in this section. 
(8)
For example, a SEC-DED code is capable of correcting one random error and one erasure.
In memory applications, the hard errors can be considered erasures if their locations can be identified. To locate the erasures of a particular word in the memory, we may apply some test patterns to the memory. Assume that any binary pattern can be written into the memory. An example is shown in Table 7 for finding the locations of erasures with two test patterns, TI and T2, of length 8, where T2 is the complement of TI. Before the test patterns are written into and read out of the memory, the word originally stored in the memory is read out and stored in a temporary storage. The erasure vector is obtained by the complement of TI(READ) + T,(READ). The locations of the erasures are indicated by the ones in the erasure vector. Since TI can be arbitrarily chosen, we may also use the word that originally stored in the memory as TI. This approach for locating the erasures, known as the double complement algorithm, saves one write and one read operation. An example of the algorithm is shown in Table 8 .
Some system designs permit only the codewords to be written into the memory [2 I , 22,251. If the complement of a codeword is not a codeword, then the approaches just described for the identification of erasures are not applicable. In this case, one solution is to design codes with some special properties [2 I, 221 . Another solution is to employ three test patterns in locating the erasures [25] . The test patterns are chosen in such a way that they contain at least one I and one 0 in every bit position. It can be shown that three test patterns are sufficient to satisfy this condition for any linear code.
Once the locations of the erasures are identified, algorithms can be designed to correct the hard and soft errors, provided that the number of errors satisfies Eq. (8) . Assume that the double complement algorithm is applicable for locating the erasures. The following procedure can be used to correct up to two hard errors or a combination of one hard and one soft error for a SEC-DED code:
1. Read word T I from a memory location. 2. If a single error in T I is detected by the ECC logic, the error in the word is corrected, and the corrected codeword is sent out to its destination. 3. If uncorrectable errors in TI are detected by the ECC logic, the complement of T I is written into the same memory location. Then the word from the same memory location is read and complemented. Let the resultant word be T3 (see Table 8 ). 4. If a single error in T3 is detected by the ECC logic, the error is corrected. The corrected word is sent out to its destination and is also written into the same memory location. 5. If no error is detected by the ECC logic, T3 is assumed error free. T, is sent out to its destination and is also written into the same memory location. 6. If uncorrectable errors are detected by the ECC logic, the original word is declared uncorrectable.
Note that double soft errors are not correctable by this procedure. All single errors are corrected at the normal speed. The correction of hard-hard and hard-soft types of double errors takes more time because additional write and read operations are involved. The procedure can be modified or refined to correct additional multiple hard errors [21, 241 at the expense of speed and cost. The procedure can also be extended to correct multiple errors beyond the random errorcorrecting capability of SBC-DBD codes and DEC-TED codes.
The procedure just described derives the information on erasures at the time when the double error occurs. A different method is to store the information on the erasure errors in a table [22] . This approach increases the speed of correcting double errors. However, the table has to be constantly updated to reflect the true status of the erasures in the memory.
There are other schemes for the correction of multiple erasures [39-411. These schemes involve the design of codes [40] .
Conclusions
Advances in semiconductor technology have brought about very high levels of integration, especially in the memory area where circuit densities are up to 256K bits per chip. In VLSI memory, higher density usually means a reduced signal-tonoise margin. It also increases the likelihood of soft errors due to radiation and other sources. Error-correcting codes have provided a very effective solution to these problems. They have become an essential part of modem memory design. In the future, the ECC could even be an integral part of the memory chips that manufacturers would offer.
In this paper, we have described the essentials of the principal error-correcting codes used in semiconductor memory design applications. The class of SEC-DED codes is currently most widely used throughout the industry. However, more powerful codes such as SBC-DBD and DEC-TED codes are quite likely to be used in future commercial systems.
