Abstract-As VLSI technology is inching forward to the ultimate limits of physical dimensions, memory manufacturers are striving to integrate more memory cells in a chip by employing innovative three-dimensional cell topographies. Most of the current-generation multimegabit dynamic random-access memory (DRAM) chips use three-dimensional storage capacitors where the charge is stored on a vertically integrated trench-type structure. It has been experimentally verified that these memory cells have poor reliability, because they are highly vulnerable to alpha particles, which frequently create plasma shorts between two adjoining trench capacitors on the same word line, resulting in uncorrectable double-bit soft errors. The conventional on-chip error-correcting codes (ECCs) cannot correct such doublebit/word-line soft errors. The paper presents a systematic study of soft-error related problems and it discusses the methodologies to correct single-bit and double-bit memory-cell upsets by using on-chip ECC circuits. Conventional double-error-correcting (DEC) codes used in digital communications are known to be inadequate for this application. By modifying the product code, an effective coding scheme has been designed in this paper that can be integrated within a DRAM chip to correct double-bit errors. The paper demonstrates that the reliability of a memory chip can be improved by several million times by integrating the proposed circuit. The area and timing overhead have been calculated and compared with the memory chips without any ECC and chips with single-error-correcting (SEC) codes. The ability of the circuit to correct soft errors in the presence of multiple-bit errors has also been analyzed by combinatorial enumeration.
I. INTRODUCTION
S the feature width of very-large-scale-integration A (VLSI) technology is rapidly approaching its physical limits of about 0.35 pm [l] , the dynamic random-access memory (DRAM) size is quadrupling every three years or so to grow to its estimated size of 4 Gb [2] . Already many DRAM manufacturers [3] , [6] have experimented with 64-Mb three-dimensional DRAM'S with vertically-mounted, trench- type storage capacitors. This enormous propect of gargantuan, high-density memory has posed two formidable challenges to the memory designers: a) how to test memory chips economically during fabrication, and b) how to ensure high reliability in storage data during the operational life cycle of a memory chip. A number of researchers [7] - [9] have addressed the first issue by proposing efficient design-for-testability DRAM architectures where a group of cells are tested in a single memory cycle. The main objective of this paper is to address the second issue, namely how to concurrently detect and correct the soft errors by employing an on-chip errorcorrecting circuit. The soft errors are predominantly caused by alpha particles, and sometimes they result from transients like power-supply voltage spikes, thermal effects, and manmade static. These errors are called soft because they do not damage the physical function of a cell permanently, and they can be easily corrected by inverting the data in the faulty cells. In contrast, the errors that result from common functional faults such as stuck-at, coupling, and pattern-sensitive are classified as hard and medium. The permanent faults, e.g., stuck-at, result in hard errors, while the coupling faults and the pattern-sensitive faults usually result in medium errors, because they are difficult to correct [ 101. Comprehensive parallel test algorithms proposed in [ 111 can efficiently detect all hard and medium errors in high-density DRAM. It is well known that if alpha particles are incident on the intervening space between two adjoining trench capacitors, the resulting plasma discharge may delete data in both the capacitors. Chern et al. [12] have done extensive Monte Carlo simulation to study the charge-sharing mechanism caused by an alpha-particle-induced plasma short between capacitors. Their study conclusively proved that trench capacitors are likely to cause double-bit upsets. These double-bit errors cannot be corrected by the conventional single-error-correcting (SEC) coding circuits. A novel fault-tolerant DRAM design is, therefore, proposed to correct double-bit errors on every word line within the chip. Thus, in an n-bit DRAM organized into s square subarrays each of size fi x fl, the proposed design can correct as many as 2 6 errors. The improvement in soft-error rate (SER) as a result of the proposed code is found to be better than lo6, and thereby the reliability of the memory improves considerably.
0018-9340/93$03.00 0 1993 IEEE
CHARACTERIZATION OF ALPHA-PARTICLE-INDUCED SOFT ERRORS
In a DRAM chip more than 98% of failures that occur during the normal operation are radiation-induced soft errors [13]- [15] . On-line hard and medium errors are relatively very low in a well-designed chip. A small fraction of failures come from transients like voltage spikes and man-made statics. By introducing suitable filtering circuits these sources of transient errors can be sufficiently suppressed. But the alpha-particleinduced soft errors are becoming more and more critical as the cell dimension is reducing. The capacitor value and the cell topography play an important role in deciding the softerror rate (SER). In this section, a detailed study is made to identify the sources of alpha particles, their effects on three-dimensional DRAM with trench-type capacitors, and combinational fault where a read operation is faulty because of incorrect charge level on a bit line. The SER from the bit-line failure mode is inversely proportional to read cycle time, but the SER due to the memory-cell mode failure is independent of memory cycle time.
how to minimize these effects by using appropriate shielding mechanisms.
A. Bit-Line Mode Soft Error There are three radiation sources for causing soft errors in a A functional knowledge of the memory cycle is needed to DRAM chip. The cosmic rays in the atmosphere may strike the understand the bit-line mode soft error. A typical organization chip with sufficient impinging velocity to generate electron-of a DRAM chip utilizes the differential amplifiers for sensing hole pairs that may contribute to the soft errors. For space the partitioning of each array into two identical subarrays and avionics applications, this cosmic radiation is a major as shown in Fig. 1 . The bit line B, is split into right-half B: and left-half Bk. Data are stored in capacitors at all the concern, and suitable radiation-hardened protective measures are adopted to minimize these effects. The second source crosspoints of different bit lines and word lines. A memory of alpha particles is the radioactive decay of uranium and cell typically consists of a storage capacitor in series with an access transistor, which is selected by the word line connected thorium, contained in minute proportions within the packaging also be reduced sufficiently by coating the chip with radiation-its stored charge through the bit line and the differential The third source of the alpha particles is the radioactive precharge phase and the phase. During the precharge impurities in the materials within the chip itself. Small traces of phase, both left and right halves of bit lines are charged to thorium and uranium are found in the metal and in the silicon a predetermined level, and after precharging is Over both the substrate. The alpha-particle emission from these residual bit lines remain in floating state. During the sensing phase, radioactive impurities frequently cause soft errors in a memory the charge sharing occurs between the selected cell and the chip.
the first two Of particles, the bit line, and the sense amplifier differentiates the voltage level soft errors from residual alpha-activity cannot be effectively between two bit lines to determine whether the selected cell controlled by using protective films. Several studies are being contains a 0 or 1. ~i~. 1, in the precharge phase, both the made on the cell topography [I81 and bit-line design [191 that bit-line halves BF and Bf are charged to a predetermined will improve SER. But so far no effective way has been found value vp, which may be typically near to haif the supply to eliminate this residual alpha activity.
voltage. Each half of the bit line contain a reference cell The memory Plane of a DRAM chip is the most sensitive having a fixed charge Q R . The cell is said to contain a 1 to upset the alpha Particle hits. The Peripheral logic and if its capacitor stores a charge QC 2 Q R , and it is said to decoder are usually robust to alpha particles, and very rarely contain a 0 if Q~ < QR. When a cell on Bf (Bk) is selected contribute to soft errors. Moreover, such errors can be easily for a READ operation, the word line W L (WR) is activated corrected by data retry since the transient faults in such circuits simultaneously to select the reference cell on B4 (Bf). In are usually combinational. The two most sensitive structures the Sense phase, when the reference cell is selected, charge in a memory plane are the bit lines and the storage cells. sharing occurs between Bk (I?:) and the reference cell, and
The failure mechanisms in these structures are conceptually the new Bk ( B F ) voltage is 1/* NN Vp. Similarly, in the different, and they will be called here the memory-cell mode right-half (left-half) bit line, charge sharing occurs between upset and the bit-line mode upset. In Section 11-B, it is pointed the selected cell and B: ( B k ) , which consequently assumes out that the memory-cell mode upset caused by noise electrons a new voltage V R ( V L ) .
The differential amplifier senses the flowing into one or more storage capacitors usually result voltage difference A V = V R -V* ( V L -V " ) to read the in sequential error, which cannot be corrected by data retry. value of the selected cell. If AV is positive (negative) and Suitable error-correcting coding circuits are designed in this is greater than the differential threshold voltage, u t h , then the paper to eliminate these soft errors. The bit-line mode upset selected cell is recognized to have 1 (0). This is illustrated in is caused by electron-hole pairs impinging on a floating bit- Fig. 2(a) . If A V is smaller than U t h , the selected cell may be line in a read cycle. Such failure mode usually results in a read incorrectly by the sense amplifier.
[l6l ¶ [171' This is a large but it can to its gate. The content of a memory cell can be read by sensing hardened that the chip from packaging amplifier. A read cycle consists of two distinct phases-the The bit line is vulnerable to soft errors during the sensing phase, particularly in the interval from the start of the the sensing phase to sense amplifier latch-up when the bit line remains floating. During the sensing phase if an alpha particle strikes the bit line containing the reference cell, its charge may reduce, resulting in a new reference voltage V** << V* as shown in Fig. 2(c) . If the selected cell contains 0, and AV may become lesser than wth resulting in a faulty read operation. On the other hand, during the sensing phase, if the alpha particle strikes the bit line containing the selected cell which contains 1, the bit-line voltage may degrade such that A v < V t h . The resulting read operation may be erroneous, as shown in Fig. 2(b) . ' The cell topography plays an important role in deciding how many cells can fail even by the incidence of a single alpha particle. In planar implementation of the storage capacitor, the track of alpha particles usually restricts within the boundary of a single cell, and the resulting fault causes a single-event upset. Toyabe et al. [ 191, solved three-dimensional diffusion equations for alpha-particle-induced electrons to analyze the SER for memory-cell mode upset. This SER is directly proportional to the effective area ( 0 ' ) of the memory cell and the incident alpha flux density (4).
B. Memory-Cell Mode Soft Error
As the memory size is quadrupling, the cell dimension and the storage capacitor value are reducing by a factor of 2 [20] . The present three-dimensional DRAM employs a deep trench capacitor that extends from the planarized surface through the n-well into the p+ substrate, as shown in Fig.   3 . The capacitance is typically 30 fF, and the capacitor is highly susceptible to being discharged by noise electrons. The memory-cell mode soft errors can be principally classified into two types: single-cell upsets and double-cell upsets. In a single-cell upset, an alpha particle strikes on a single capacitor, discharging it alone. On the other hand, if the alpha particle strikes on the intervening space between two adjoining capacitors, a plasma discharge may occur between both the capacitors, resulting in alteration in charge level in both the capacitors. Such soft errors are known as double-cell upsets, and are very common in a three-dimensional DRAM chip. If
'
is the effective area of a cell and 6 is the intercellular distance, then the probability of an alpha particle striking on the intervening space is proportional to 610. Thus a large number of alpha particles may cause double-bit errors if they have sufficient kinetic energy to discharge both the capacitors. The Monte Carlo simulation done by Sai-Halasz et al. [15] has revealed that, as the feature width and the critical charge in storage capacitor decrease, the double-bit soft errors dominate over the single-bit errors. It may be emphasized that the earlier soft-error analysis was based on a single-event upset where the storage capacitors were planarized. The conventional onchip and system-level error-correcting circuits are, therefore, SEC/DED type, which fails to correct the above double-cell upsets.
The objective of this paper is to design a new double-error correcting (DEC) coding circuit that can be integrated in a DRAM chip to reduce the SER and to improve the reliability of the memory system. It may be noted that in order for an errorcorrecting circuit (ECC) to be used for soft-error correction in DRAM cells with trench-type capacitors, it should satisfy the following requirements:
It [21] organized the cells in a DRAM chip into a number of blocks, and compared the block parities to correct a single-bit soft error. Mazumder and Patel [22] used a similar technique over a two-level memory system, and showed that the soft-error rate improved by a factor of lo6. They used a parallel-signature analyzer (PSA) to test the chip in parallel, and reconfigured the PSA to detect the soft error. Nippon Telephone and Telegraph [ 3 ] used product code to correct a single-bit error in their experimental 16-Mb DRAM chip.
The main limitation of the product and Hamming codes is that they fail to correct double-bit errors, and in a three-dimensional DRAM where the double-bit soft errors are relatively common, these codes are inadequate for on-chip ECC applications. The conventional double-bit errorcorrecting codes such as Bose-Chaudhury-Hocquenghem (BCH) [23] , Reed-Solomon [24] , and Golay [23] , cannot be readily applied to correct double-bit errors in a DRAM chip. These codes are frequently used in digital communications to correct t-bit ( t 2 1) errors. The encoding and decoding circuits of these codes employ multibit linear feedback shift register (LFSR) which, if used in a DRAM chip, will introduce very high access delay. By concatenating finite projective geometry codes (PGC) [23] , a linear code can be constructed that can correct double-bit soft errors in a memory chip [25] . The main problem with this coding technique is that it divides an n-bit memory into n1/3 subarrays, Le., a 16-Mb DRAM will be organized into 256 subarrays (as discussed in Appendix A). This is an unrealistic scheme because it will introduce very high decoder routing complexity, and also it will increase the chip area considerably. The practical 16-Mb DRAM manufactured by NTT employs only 8 partitions. Another problem with this code is that it can correct only two errors in the entire memory, and cannot correct faults such as a bit line that is stuck-at or short/open because of its defective sense amplifier.
An efficient code is proposed in this section that can be easily implemented within the framework of high-density memories. One of the constraints on the code is that its encoding and decoding circuit should be compatible to the low intercellular pitch width. A rectangular product code, which can correct only a single-bit error, is known to satisfy these constraints, and many commercial chip manufacturers have successfully integrated the ECC based on product code in 4-Mb and 16-Mb DRAM chips [26] . In this section, a coding scheme, called here an augmentedproduct code, is constructed malfunction is defined as the fault latency.
'The time between the occurrence of a fault and its manifestation as a by adding a set of diagonal parity bits to the conventional product code, which uses vertical and horizontal parity bits. Like the product code, the proposed code has simple encoding and decoding circuits that match the pitch width constraint of a DRAM chip, it automatically corrects the selected cell if it is faulty, and it generates error flags for diagnosing the second faulty bit, in case a double-bit soft error occurs. The rest of the section describes an effective way of organizing the product code in each word line within the DRAM chip and then how to construct an augmented product code from the rectangular organization of the product code.
A. The Conventional Rectangular Product Code
Let a DRAM chip with n, = (smf2) information or data bits be organized into s subarrays each of size ml x mp (i.e., each subarray has mp memory cells in each of its ml word lines). The DRAM is called nonredundant if m1m2 = m2. In equality is tested; if the second equality is not true for i = t . then cfi,t,l+,, is complemented to correct the single-bit error. Similarly, if only the second equality is not true, then the first equality is tested for 0 5 .j 5 q -1: ct;,,q+t is complemented if the first equality is not true for ,j = f . Thus all singlebit errors can be corrected. If there are two errors (say, cells Cg ,,,/ +,, and Ck,,(l+I ) then the above scheme will incorrectly complement the cell Ck,l,l+,l. Hence, the product code fails to perform if a double-bit memory-cell upset occurs. In order to write a data on the cell (~' : , , y + , j . its content cf; -,,/ +,, is checked by a read operation, and it is compared with the input data. A typical implementation of a product code, PC(9, 7) is shown in Fig. 5 where to each word line, having 9 information bits, 7 parity bits are added so that the overall word-line size is 16 bits. These 16 bits are organized into a logical 4 x 4 array. The parity bit of the it11 row is stored in the cell denoted by H , . and the parity bit of .it11 column is stored in the cell denoted by V I . The overall parity of the word-line is stored 
B. The Proposed Code: An Augmented Product Code
In order to correct double errors in a codeword, an uugmented product code (APC) is constructed by adding a set of p diagonal parity bits to the rectangular p + q + 1 parity bits in the product code. It must be emphasized that the proposed code can detect and correct all single-and double-bit errors in the parity bits and information bits from the error syndrome 
D ( t ) . TI).
where the symbols represent error flags for horizontal, vertical, diagonal, and the overall groups of parities and information bits, and will be explained later. The overall parity bit T is computed over pi information bits as 7r = @Cf;-' E:-' h ( i . , j ) . The diagonal parity bit corresponding to the data bit h(i.,j) is computed by reading out the data bits in a cyclic path given by 1/12 -/ I / zz 21) + f1 + 1.
b(q -l..;) -+ Qi).,;).
For an ?/,-bit DRAM organized into s square subarrays, the pro- Since there are four parity check bits corresponding to each data bit, the APC(pq. 27) + y + 1) is a distance five code, and can detect and/or correct all double-bit errors in the codeword, including the parity bits. The different error syndromes have been represented by Tabel I, and it can be seen that double errors can be corrected from the error patterns.
I.

Iv. AN IMPLEMENTATION OF THE AUGMENTED PRODUCT CODE IN A DRAM CHIP
The APC can be easily implemented within a DRAM chip. A typical implementation of a DRAM chip with rrr-bit-wide bit-lines will contain 3 f i + 1 redundant bits per bit line.
A DRAM where each word line has 9 data bits (organized into 3 x 3 matrix) and 10 parity bits is shown in Fig. 6 Table I it is evident that the different error lines can be expressed as Boolean functions of parity bits as shown below: ,-,modp)dp One of the primary concerns the memory manufacturers have is that on-chip ECC tends to reduce the memory cycle time and requires extra silicon area, in addition to the fact that the ECC circuit introduces severe layout problem since it is less regular than the memory plane. In order to justify the use of on-chip ECC circuit, it is necessary to examine the overhead in silicon area and timing, as opposed to the payoffs in increased reliabilty. In this section a detailed analysis is done to estimate the area and timing overhead, and in the next section the improvements in soft-error rate (SER) and meantime to failure (MTTF) are calculated to show the effectiveness of the proposed ECC.
Complement h ( i . , ; ) . S ( i ) . 17(,j). D ( t ) . and II.
A. Area Overhead
The proposed ECC circuit consists of multiplexers, parity trees, additional sense amplifiers and encoders. Assume that the IC information bits per word line in a DRAM chip are organized into a x fi array with additional 3 4 + 1 bits corresponding to the horizontal, vertical, and diagonal parity bits, and an overall parity bit. Each word line, therefore, consists of (IC + 3& + 1) bits, which can be grouped into three categories as follows. It may be noted that 2I/% multiplexers are required for category G1, and one each for categories G2 and G3, and also for calculating X as shown in Fig. 8 . Similarly, altogether &+2 parity trees are required to calculate the parity bits-& for category G1, and one each for calculating Y and D (none for G2 and G3).
It may be noted that each multiplexer is a (A+ 1)-to-1 type which can be constructed hierarchically by using only 2-to-1 muxes, each occupying an area of a , unit. If a multiplexer tree is built using 2-to-1 mutliplexers, then the area A,,, of each (A + 1)-to-1 mux is given by In addition to muxes and parity trees, additional memory cells for parity bits and the associated sense amplifiers will be In addition to the circuitry discussed above, two encoders are required for encoding (A + 1) bits into log (A + 1) bits, which locates the bit line containing an erroneous bit, if a double bit-error occurs. Each of these encoders can be assumed to have an area of Some of the factors that have not been considered in the above discussion are the increase in the area of the bit-line decoder. Since the number of bits per word has increased, the area of the bit-line decoder is also greater. But this increase in area is not very significant. Thus, the total increase in the area denoted by Aover-head is given by the sum total of the increase in area due to the various factors mentioned above:
. a e n c .
The area required by a DRAM of k , words of k bits of information each without any error correcting circuitry is given by
In both expressions given above the area required by the row and column decoding circuitry should be added.
Using the above equations, the area overhead for DRAM chips with ECC that can correct double errors have been calculated for different chip sizes. In Fig. 8 , the overhead of a DRAM with the proposed ECC are compared with chips with no ECC and with SEC-type ECC. It can be seen that the proposed DEC-type code requires about twice the area than the SEC-type product code when the chip size is 4M or more, but this can be justified by the improvement (over SEC) in storage reliability by an order of magnitude, as discussed in Section VI-A. For a 16-Mb DRAM chip with 16 partitions, the chip area overhead due to the proposed augmented code is about 8% of the overall chip area. This is somewhat close to the values obtained by Yamada [3] , who designed 16-Mb DRAM'S with on-chip single-error correcting circuit. The ECC layout should be carefully done to optimize the DRAM access delay and ECC area. Acces delay can be reduced by designing a selector-merged ECC circuit where transmission parity checkers and selectors, which select data from cells belonging to the appropriate parity groups, are arranged in column circuits without long bus lines.
B. Timing Overhead
The presence of ECC and the procedures P1 and P2 increase the length of average memory cycle. The various steps in reading a cell and the corresponding time delays involved are given below.
Procedure Read:
1) Decode the bit line-tdecode-bit.
2) Precharge bit line-tprecharge.
3) Decode the word line-tdecode-word. is the delay involved in passing the signal through a multiplexer and txor-tree is the delay involved in passing the signal through the EXOR trees, and 4tgate is the delay corresponding to the circuit in Fig. XX . Assume that rr is the time required for reading a cell using the procedure READ. If 20 is the probability of a write operation, then the average memory cycle time will be given by ~E C C = Tr + Wtwrite-cell, where T~ = (tdecode-bit + tprecharge + tdecode-word f tword-enable) + tread-cell + tmux + txor-tree + rgate.
4)
The average memory cycle time required for a memory without any error correction circuitry is = (tdecode-bit + tdecode-word + tword-enable)
The detailed computations of the time required for various steps in reading a cell are shown below.
1) Delay in decoding word-line address tdecode-word:
where k, is the structure constant with value 2.5 for inverter ratio 4, kw is the number of words, T is a gate delay, f is the stage factor for the driving inverters, R, and Ca represent the resistance and capacitance across the address decoder for word line decoding, respectively.
2) Delay in decoding a word line tdecode-bit:
where k b is the number of bit lines, R, and c, represent the resistance and capacitance, respectively, across the address decoder for bit-line decoding. 3) Delay in precharging a bit line tprecharge:
where Ro is the output resistance of the driver for bit line and Cbit-line is the capacitance of the bit line. This delay is very small.
4) Word-line enable delay tword-enable:
where R, and C, represent the resistance and the capacitance of the word line and T, is the resistance of the transistor driving the word line. 5) Delay in reading a selected cell tread-c-11:
where Rt is the resistance of the transistor in a memory cell through which the Ccell is charged. Other terms are as explained above. 6) Delay in writing into a cell twrite-cell:
The above expression is the same as that for tred-cell except for the output resistance of the driver for a bit line R, being added to the resistance.
7)
The delay in passing through a multiplexer tmux
where t , is the delay through a 2-to-1 multiplexer, which is equivalent to the delay through two NAND gates.
k is the number of information bits per word and kb, and the number of bit lines is given by k b = k + 3 . + 1.
8) The delay in passing through the EX-OR parity tree txor-tree
where t, is the delay in passing through an EX-OR gate. For different memory sizes, the timing overhead of the proposed ECC are compared in Fig. 9 with ECC that can correct a single error (SEC). For multimegabit DRAM's, this overhead is between 14% and 18% of the access time required by DRAM chips with SEC-type ECC, and it is virtually invariant of write probability, if 0.2 5 TU 5 0.8. The time overhead of the proposed ECC is compared with nonredundant DRAM's of various sizes. It may be noted that in the proposed architecture, the read-before-write operation and the additional ECC circuit increases the memory cycle time, but it is within 80% for a 16-Mb DRAM, as shown in Fig. 10 . A typical delay of a 16-Mb DRAM with the proposed ECC was observed to be 63 ns as opposed to about 35 ns without any ECC.
VI. RELIABILITY AND SOFT-ERROR PATTERN ANALYSIS
A. Reliability Modeling of the APC
In this section, the alpha-particle-induced soft-error rate (SER) in a DRAM chip with the proposed error-correcting mechanism has been analyzed and compared with that of a nonredundant DRAM without any error correction mechanism. In order to compute SER, the following assumptions are made.
Assumption 1: In practice, the intermittent faults in a DRAM result from alpha particles and sporadic process-related leakage currents that vary with temperature, static noise, noise pulses on power supply, data pattern in the memory, and so on. These faults manifest themselves randomly and can be represented by the Poissonian statistics with a mean rate of X = A, + A', where A, is caused by alpha particles, and A' results from other intermittent faults. In a well-designed memory usually the soft errors are predominantely caused by alpha particles and, therefore, A, >> A'. Assumption 4: The kinetic energy of the alpha particle is always sufficient to generate a single-bit error or double-bit soft errors. If the track of an alpha particle is limited within a single cell, a single-cell upset always occurs. If the track of an alpha particle permeates over two cells, or it hits between two cells, a double-cell upset occurs.
Let p ( t ) be the probability that no soft error occurs in a memory cell within a time interval of [O,t] . The reliability of an n(=sm2)-bit DRAM chip that can tolerate at most a double-bit failure in a word line, can be given by The proposed scheme is plotted for reliability ( Fig. 11) and SER, and they are compared with a simplex system without any error-correcting mechanism. From the graph of SER vs. alphaparticle flux density shown in Fig. 11 , it has been seen that the SER improvement factor is more than lo6 for a square memory array of size 4 Mb when the alpha-particle flux density is l/cm2/h.
B. Error-Pattern Analysis
In addition to correcting soft errors, the built-in ECC is capable of masking the fabrication-related hard faults and many reserchers [3] analyzed the manufacturing yield improvement by using ECC circuits. The resulting chip becomes partially nonredundant or fully nonredundant depending on how many hard faults are reconfigured during fabrication time. The potential problem of using such double-bit errorcorrecting circuits for improving the yield is that they cannot tolerate more than two hard errors per memory word line. Thus these error-correcting techniques are not adequate for common fabricational faults such as the word-line driver being faulty where the entire word line may be defective. The conventional row and column redundancy can be very effectively used to improve the yield for the fault-tolerant memory. The overhead associated with the error-correcting circuit is so high that it will be grossly underutilized if they are exclusively used to improve the yield. The resulting nonredundant DRAM will be intolerant of soft errors and their access time will be larger than the normal redundant DRAM.
It may be noted that the proposed error-correcting code can tolerate as many as two errors per word line. Thus in an n-bit DRAM organized into s square subarrays, the code can detect as many as 2 6 errors if no more than two errors occur per word line. Although in this paper we have considered up to two errors per word line, in high-density DRAM multiple-bit errors may occur with very low probability. By using Monte Carlo simulation Sai-Halasz et al. [15] showed that a large number of soft errors in the memory are one or two bits, and in a few instances a number of bits may be erroneous. It may be pointed out that the multiple errors occur randomly in the event of a cosmic shower, and the proposed error-correcting circuit will be able to correct the multiple soft errors as long as only two bits are erroneous in a word line. The probability of correcting the different multiple-bit soft errors can be estimated by combinatorial analysis as shown below. 1,1, . . . , 0) . Hence, the number of ways that the first error pattern can occur is given ) ( ' y ) ( 'y ). The number of ways that the second error pattern can occur is given by ('r) ( 7n11) ( n z [ 2 ) ( ,;") 
3.
The total number of ways three soft errors may occur in k = c n 1 -1 p,. Without any ambiguity it may be assumed that by . The probability of correctable four-bit error is also given in the second equation at the bottom of the page. Finally, in order to find the utility of the code, it is necessary to determine the probability that the code can correct all the errors knowing that at most k errors have occurred. For k = 2 m ( = 2 m . say), this probability can be shown to satisfy the following inequality:
In Fig. 12 the probability that the proposed code will be able to detect multiple-bit soft errors has been plotted for different values of multiple-bit ( k ) errors.
VII. CONCLUSIONS
This paper discussed the problems of error correction in multimega bit dynamic random-access memory (DRAM) chips. A large number of alpha-particle-induced soft errors manifest themselves as double-bit errors when alpha particles hit the space between two vertically mounted trench capacitors. The conventional SEC/DED code, such as product and cubic codes, cannot correct these double-bitiword-line faults. A new augmented product code that employs diagonal parities in addition to horizontal and vertical parities is proposed in this paper to correct the double-bit errors. The proposed code has been compared with the projective geometry code (PGC), which can also correct two-bit errors (as discussed in Appendix A). But unlike the proposed code, the PGC cannot be easily implemented in a multimegabit memory. The PGC requires special decoders that compute over Galois Fields, and also is double-error correcting logic is very complex. The proposed code can tolerate up to 2 6 soft errors, and thereby it can correct the sense amplifier faults. In Table 11 , the proposed coding scheme is compared with the product code and the projective geometry code. An error-pattern analysis has been done to find out the probability of correcting multiple errors that may occur in a DRAM chip. In addition to detecting the on-line error, the error-coding circuit can be utilized to improve the fabrication yield. In a defective memory chip, the faulty cells can be automatically bypassed by the errorcorrection circuit and the resulting memory can be used as a nonredundant and nonfault-tolerant memory. But, in practice, it will be a better idea to employ a few extra rows and columns to bypass the fabrication defects, and to utilize the error-correcting circuit exclusively for correcting the field failures and soft errors.
APPENDIX A FINITE PROJECTIVE GEOMETRY CODE
The finite projective geometry code is derived from the concept of projective geometry, which is obtained by adding a hyperplane at inifnity to Euclidean geometry [27] .
Definition 4: An m-dimensional finite projective geometry P G ( m , q ) described over the elements of Galois Field GF(q)3 consists of a set of (qm+' -l ) / ( q -1) points together with a set of equal number of lines such that each line passes over q + 1 points. Each point is denoted by nonzero ( m + 1)-
where ai E GF(q) with the rule that .,a,) ,
where ni E GF(q). It may be noted that in block design, a projective plane is equivalent to a Steiner system S ( 2 , n + l , n 2 + n + l), and ' A finite field with modulo q addition and multiplication is called a Galois Field (after the famous classical field theorist Evarist Galois) of q isomorphic elements if q is a power of a prime. For example, in GF(2), the operations correspond to binary EXCLUSIVE OR and AND. an affine plane to an S ( 2 , n , n 2 ) , where n 2 2. Before it is explained how these geometries can be utilized to construct double-bit error-correcting codes, definitions 4 and 5 are illustrated below with an example. A two-dimensional Euclidean geometry EG (2, 2 ) consists of four points and six lines as shown in Fig. 13 (a) and a projective geometry PG(2, 2) can be obtained by adding a hyperplane, the line [loo] over the points (001), (01 1) and (010). The resulting projective geometry has seven points and seven lines, each line passing through three points as shown in Fig. 13(b) . The point (001) intersects at infinity the parallel lines in the Euclidean geometry connecting the points (00) with (01) and (10) with (11) . Similarly, the point (01 1) intersects at infinity the parallel lines connecting the points (10) with (01) and (00) with (11).
[loo1
Definition 6:
A linear concatenation of q + 1 projective geometry codes PGC(m, q ) is defined as a set of ((I+ l)(q" + . . . + 1) = p + l + 2qm + . . . + 1 symbols from GF(q). The resulting code, denoted by F P H ( qnL+l, 24" + 2q""-' + . . . + I), is constructed from the connectivity of P G ( m , q ) . The projective geometry PG(m, q ) is represented as a bipartite graph G(P, L , E ) , where P is the set of points in PG(m,,q) and L is the set of lines in P G ( m , q ) , and E = {f E ( p , 1)Ip E P, 1 E L such that p lies on 1 ) . Each symbol in the codeword is a distinct edge in E of the graph G that is commonly known as field-plane hexagon [28] .
For example, PG(2, 2) in Fig. 13(b) can be represented as a field-plane hexagon of diameter (for the maximum cycle) six as shown in Fig. 14 . The set of lines and points in Fig.  13 (b) have been represented by the square and circular nodes, respectively. All those points that belong to a line have been represented by the edges, and each node has degree three since in Fig. 13 (b) each line passes through three points and also three lines meet at each point. In a two-dimensional projective geometry with q symbols PG(2.q) each node will have a degree of q + 1 corresponding to g + 1 points that pass through a line, and q + 1 lines meet at each point. The resulting fieldplane hexagon is thus always of diameter six [29] . The 21 edges in Fig. 14 correspond to the linear concatenation of three PGC(2, 2)'s, and the codeword of FPH (8, 13 ) has 8 data bits and 13 parity check bits. Assuming that all the edges drawn in dark lines represent information bits and the other edges correspond to parity bits, it can be seen that for each information bit, there exists a distinct cycle of size six consisting of five parity bits. Thus, FPH (8, 13 ) represents a coding of distant six, and thereby it can correct two errors (or detect three errors) in the codeword. In general, an FPH (2, q ) has m, = q3 information bits and 2q2 + 2q + 1 check bits, and it can correct two errors. The coding efficiency of FPH(2, q )
is 2 / q + 2/q2. In order to find out an information bit is erroneous, altogether 2q -1 bits are scanned to estimate two parity bits, and by comparing the estimted values with those of actual bits in the codeword, a single error can be detected and corrected. In order to understand how projective-geometry codes can be utilized in double-error correction in a DRAM, it is necessary to show that any arbitrary information bit and its associated parity bits can be accessed in one memory cycle. In this paper, we discuss a feasible organization in which an information bit can be read in a single memory cycle, and also it can be corrected if it is erroneous. The entire m information bits in a DRAM will be organized into (m1I3 + 1) separate PGC (2, 's, where each projective geometry will be composed of m1I3 x m1/3 information bits and m,'I3 + m1I3 + 1 parity bits. Each PGC will be organized into a separate subarray, and altogether m1/3 + 1 subarrays will be needed. In each subarray m1I3 information bits will be accessed together, because they will be on a single word line, and the first parity bit p l will be generated and compared with the corresponding bit in the codeword. In order to detect whether a particular information bit in the selected word line is faulty, m1/3 information bits are selected from the different subarrays, one bit each. These information bits will be identified from the bipartite graph (field-plane hexagon) described by the points and lines of P G ( 2 ,~n l /~) .
A second parity bit q1 will be computed from these m1/3 bits and it will be compared with the corresponding bit in the codeword. If both pl and ql are erroneous, then the information bit is faulty, and will be automatically corrected. The set of m1/3 information bits needed to compute q1 are located on the different word lines in the different subarrays, and thereby m1/3 partitions become mandatory in an m,-bit (nonredundant only) memory. Each subarray should have a special decoder Total number of infomation bits = 64. Total number parity bits = 49 (not shown). that will compute over GF(m'13) to identify these sets of m1/3 bits. In Fig. 15 , it is shown how the 64 information bits are organized into four subarrays, and how they are grouped into 16 sets such that cells numbered identically are selected in a single memory cyole for computing the parity bit. The main problem with this coding technique is that it divides an n-bit memory into n1/3 subarrys, i.e., a 16-Mb DRAM will be organized into 256 subarrays. This is unrealistic because it will introduce very high decoder routing complexity, and also it will increase the chip area considerably. The practical 16-Mb DRAM manufactured by the "IT employs only eight partitions. Another problem with this design is that it can correct only two errors in the entire memory, and cannot correct common type faults such as a bit line stuck-at or short/open because of its defective sense amplifier. In Appendix B, a new type of code is proposed to circumvent the above limitations of the projective geometry code. 
