Abstract -An efficient strategy to utilize a parallel signature analyzer (PSA) for concurrent soft-error correction in DRAM'S is described. For a two-level w-bit, n-word memory system, the proposed technique needs only one additional chip as opposed to log, w + 2 in the conventional Hamming code. Such an error-correction circuit (ECC) significantly improves the reliability of the memory system.
I. INTRODUCTION
The concept of fast data compaction by using a parallel signature analyzer (PSA) was originally proposed by Benowitz et al.
[1]. Sridhar [2] had designed a testable memory architecture incorporating the PSA within the DRAM chip to test several cells on a row (word line) in a single memory cycle, and he demonstrated how to speed up the quadratic run time of the Walking 1's and 0's test procedure to linear run time. In order to test the memory cells in parallel, at first a test vector was sequentially scanned into the PSA. The content of the PSA was then used to write in parallel to multiple cells in the selected word line, and subsequently, when these cells were read in parallel, a signature was generated. To determine whether a DRAM chip is fault-free, the scan-out pin of the PSA (quotient bit) was continuously monitored, and the final signature at the end of the test procedure was verified. Using this test strategy, we have examined several other memory test algorithms, and noted that most of the functional test procedures, except Marching algorithms, can be substantially accelerated (as shown in Table I ). The objective of this paper is to demonstrate how to utilize the presence of an on-chip PSA for correcting a single-bit soft error during the normal use of the memory chip. Several strategies [3]- [6] have been proposed in the past to correct soft errors in a memory system using an on-chip error-correction circuit (ECC). This paper demonstrates how to construct an on-chip ECC by reconfiguring the PSA during normal operation into a parity generator which detects the occurrence of a singlebit error.
The proposed scheme utilizes the two-level organization of a hierarchical memory system, and it requires one parity bit for each row (word line) in the DRAM chip to detect whether any error has occ' red within the chip. A single-bit error in the memory system can be corrected by the proposed scheme by adding one extra chip containing the parity information for all memory words. Conventional memory systems use the Hamming code to correct a single-bit error and to detect a double-bit error (SEC/DED). In a two-level memory system, with n memory words having w bits/word, altogether a w number of n X 1-b
Manuscript received September 30, 1988; revised December 11, 1989 RAM chips is used. Thus, the Hamming code requires log, w + 2 additional chips to store the error checking bits in a code word. In a cost-efficient memory system design, the proposed errorcorrecting scheme provides an economy of log, w + 1 numbers of DRAM chips for a codeword with w information bits. Moreover, in the proposed scheme whenever a w-bit memory word is read out, w& DRAM cells in the memory system are sensitized to detect the occurrence of memory-cell upsets, while in the conventional Hamming code technique only w cells in the memory system are sensitized. By sensitizing w(& -1) extra memory cells, the proposed technique improves the reliability (MTBF) of the memory system considerably.
THE PROPOSED ERROR-CORRECTION TECHNIQUE
A two-level memory organization of a w-bit, n-word memory system is shown in Fig. 1 Table I1 shows all possible outcomes, and the conditions for occurrence of different types of errors. It can be seen that in row 2, even though the word is error-free, a single-bit soft error is detected, and it should be immediately diagnosed for reducing the soft-error rate. In the conventional system-level Hamming code, this soft error will be latent until the faulty bit is addressed for a READ operation. If before this faulty bit is read one more memory cell in the same location in another chip becomes faulty, then the Hamming code will not be able to correct the faulty bit, and thereby the reliability of the memory system will be poor. The proposed scheme can detect two single-bit errors, and it can automatically correct the addressed bit if it is faulty, as illustrated in row 4 of Table 11 . line, shown by the dashed line in Fig. 3 . Thus, each time a memory reference is made, (w + 1) X 6 cells will be sensitized.
These cells will comprise a plane ( X , y , Z ) as shown in Fig. 3 . If only one chip is completely faulty due to catastrophic failure or defective chip-select line, then such a fault, denoted by the ( X , Y , z ) plane, will be detected by the proposed scheme. If the word-line driver within a chip is faulty, then such a fault can be detected by the proposed scheme as long as the shaded plane contains no more than one such defective row. If the sense amplifier or bit line within a chip is defective, then such a fault, denoted by the line ( x , Y , z ) , can also be detected by the proposed scheme.
In the READ mode, the parity bit of the selected level-1 word line is generated and checked with the content of the parity bit cell. Thus the PSA hardware has two functionalities: 1) updating the parity bit and 2) generation of the parity bit, as discussed below.
A. Updating the Parity Bit
At first the whole memory is initialized to zero, and subsequently whenever a transition write is made in the DRAM chip, the parity bit is complemented. In order to ascertain whether a WRITE operation results in changing the content of a memory cell (i.e., a transition write), the selected cell is at first read and stored in the data-out buffer. The data to be written are available in the data-in buffer and are xoRed with the value in the data-out buffer to determine whether the WRITE operation is a transition write. While reading the content of the desired memory cell, the parity bit can also be simultaneously read and stored in an additional buffer. The parity bit is toggled whenever a transition write is made. Because of the extra READ operation prior to a WRITE operation, the performance of the memory will be slightly degraded.
B. Generation of the Parity Bit
The parity bit can be generated by utilizing the XOR gates available in the PSA (shown in Fig. 4) . In a DRAM, when a memory cell is read, all the bit lines are precharged and then the word line containing the desired cell is selected. The contents of all the cells in the selected word line can be simultaneously read from their respective bit lines. But only the content of the selected bit line is transferred to the data-output buffer. Thus in the normal mode of operation of the memory, the content of a complete word line can be accessed by the PSA. It can be seen from Fig. 4 that the bit lines are directly connected to the input of the XOR gates, while the other input of the XOR gate (in the signature mode) is connected to the output of the preceding XOR gate through the flip-flop of the preceding stage (in some cases an additional XOR gate which is used for feedback polynomial). In order to generate the parity bit for the word line, the flip-flops and the XOR gates in the feedback path in Fig. 4 should be bypassed.
In order to bypass the XOR gates in the feedback path, points A and B in Fig. 4 are disconnected and an additional switch is introduced, shown by the dotted box. The switch is driven by a signal called TEST. During the test mode, TEST = 1 and points A and B are connected through the switch, as shown in Fig. 4 . During normal operation, when the PSA is reconfigured into ECC, TEST = 0 and the switch connects point B to ground; therefore all the XOR gates in the feedback path will be bypassed.
In order to understand how the flip-flops are bypassed, it is necessary to understand how the PSA circuit is implemented in the memory. Fig. 5 shows a typical MOS implementation of the jth flip-flop stage of the PSA in Fig. 4 . The PSA is usually implemented in dynamic logic where 41 and 42 are used as two nonoverlapping clocks. During the test mode of the PSA, the signal TEST = 1 and the PSA can be operated in READ or signature mode by setting WRITE = O and MODE =1, and in WRITE mode by setting WRITE = 1 and MODE = 0. In the scan mode, when the test data are serially loaded into the PSA, the signal lines are set to MODE = 0 and WRITE = 0. The flip-flop consists of two inverters G2 and G 3 which are back-to-back connected in the test mode (when TEST = 1) by the transistor Q7 when clock 41=1. In the normal mode, the feedback is removed by the signal TEST = 0 and the flip-flop degenerates into a cascade of inverters. In the signature mode, the signal at bit line BJ is xoRed with the content of the preceding flip-flop stage, F&l, by the pass transistors Q l and Q2. This value is buffered into inverter G1, and when 42 = 1, it is forwarded to the input of the flip-flop for storage. In the scan mode, the value of the preceding flip-flop stage FF, -, is directly passed through the transistor Q 4 and stored at the flip-flop F F . In the WRITE mode of the PSA, the flip-flop is isolated by the transistor Q6, which remains cut off. Transistor Q 8 turns on and the value of FC is written on a memory cell being routed through the bit line BJ. In the normal mode of DRAM operation when the content of a memory cell is read, the bit values of all the cells in the corresponding word line of the DRAM appear at the inputs of the PSA. The inverters G2 and G3 simply forward the output of BJ t B B, ~ to the next stage. Thus the PSA forms an XOR cascade and thereby a parity generator as shown in Fig. 6(a) . In an n-bit RAM, the parity generator takes O ( h ) time to detect a singlebit memory upset. This inordinate delay may reduce the effective use of memory cycles. This delay can be improved to O(log,n) by the addition of extra XOR gates to form a parity tree as shown in Fig. 6(b) . The delay can be further reduced by a constant factor by bypassing transistors Q5 and Q6, and invert- ers G1 and G2 of Fig. 5 , by an extra pass transistor during the normal mode of DRAM operation. The last bit B, of the PSA is connected to the bit line of the parity bits. Thus the quotient bit of the PSA, which is available at the scan-out pin of the testable DRAM, indicates whether a single-bit error has occurred within the memory. While reading the content of a memory cell, if all the memory cells on the corresponding word line in the DRAM are correct, then the scan-out pin is at low voltage. On the contrary, if any of those memory cells is upset, the scan-out pin will be at high voltage. While reading any memory location inside a DRAM, if the scan-out pin voltage is high and the parity bit in level-2 memory indicates an error, then it is known that the memory location is faulty and its bit value should be complemented for error correction. On the other hand, if only the scan-out pin indicates an error and the parity bit of level-2 memory does not indicate any error, then the faulty cell lies on the corresponding word line. If an error is detected and not corrected, because it is not known which memory cell is faulty, then it is required to locate the faulty cell either by hardware or software. Product codes [5] , [7] with orthogonal parity, or row and diagonal parity schemes applied over multiple word lines, are not suitable for memory applications. A bidirectional parity scheme [3], where all the information and parity bits in a rectangular code are stored on a single word line, can be used, but this will need about 2n114 X n112 extra memory bits within each DRAM, since each row of n112 bits will be organized as a square containing n114 horizontal parities and n114 vertical parities. So the strategy used here is to locate the faulty cells (if it cannot be readily located by the two-level parities) by sequentially reading the memory cells on the defective word line. Since the error rates are very low in the DRAM'S with a-particle protective film, this on-line periodic removal of faulty bits ensures that no double error occurs on a word line, and thus the proposed scheme maintains high reliability. It may be noted that in the conventional Hamming coding this cannot be done, because it only sensitizes the cell which is I' r 1 4 I" I' [yo read out of the chip, and not the whole row (word line) in the This paper demonstrates how to reconfigure the on-chip testability logic, such as the PSA, to correct a single-bit soft error in a two-level RAM system. The PSA can be integrated within a high-density DRAM to augment the testability and reduce the testing cost of the memory. The proposed scheme uses only one extra DRAM chip to store the parity bits of the system words, and it has very little overhead. The idea of reconfiguration of testable hardware into ECC further reduces the additional chip area. By making a simple reliability analysis, it can be shown that the MTBF for a DRAM with the proposed error is given by 1.25/hn3l4, where n is the number bits in a chip and A is the average chip failure rate. For a memory without any ECC the MTBF is given by l/An. Thus, the improvement in reliability due to the proposed error-correction scheme, defined by the ratio of these two MTBFs, is R I F = 0.8n'/4, and monotonically increases with the size of the memScan -Out chip. ory array.
FINAL REMARKS

I. INTRODUC~ION
Fully complementary CMOS static circuits dissipate negligible dc power, can operate asynchronously, and do not require the routing of clock signals [l] . However, static circuits are generally slower than dynamic circuits. One way of overcoming this deficiency is to trade off standby power consumption for speed. Johnson [2] recently presented a novel CMOS NOR gate using inverters with their outputs shorted together. The design of such NOR'S involves transistor ratioing to set appropriate high and low output levels.
The static power-speed trade-off is acceptable in localized applications, where there is a demonstrated need for a special function to operate particularly quickly. The static power dissipation is thus kept physically isolated to a few locations on an IC, and may in fact be insignificant in an environment where conventional static circuits operate near their maximum frequency, and thereupon dissipate considerable dynamic power. It may be noted, moreover, that the local nature of the concept does not preclude the use of other speed-enhancing techniques, and thus its use can provide the incremental delay improvement which may make a design feasible.
This correspondence extends the concepts in [2] to include buffering of the shorted or "ganged" node, thereby allowing the realization of more complex gates, and thus, the idea called "ganged-CMOS logic" (GCMOS). A number of sample circuits are presented. In particular, two novel adders are described and compared with an accepted conventional implementation. The ganged-CMOS adders provide lower input capacitance and faster carry propagation, for equally sized layouts.
GANGED CMOS
By buffering the ganged node with a simple CMOS inverter, a number of advantages are obtained. First of all, the ganged node is effectively isolated from external circuitry-its value is neither transmitted on long interconnect wires and corrupted by noise, nor does it drive complex gates, where any voltage exceeding a transistor threshold can cause a logic error. Essentially, one can tolerate much lower noise margins on a local node than on a global node. This benefit is enhanced by the inverter's inherent encoding action-its high gain results in a sharp distinction between low and high inputs. Furthermore, the inverter's switching point, while dependent on the square root of the p-n ratio, can be varied adequately by adjusting transistor geometries. Fig. 1 shows the same circuit topology repeated three times; transistor widths are changed to realize three different functions. The first, in Fig. l(a) , implements an OR gate; only one of the three inputs need be high for the ganged node G to be forced well below the switching point of the buffer inverter. Similarly, the dimensions shown in Fig. l(b) implement an AND gate. The circuit in Fig. l(c) implements the logic function A * B + C ; the inverter driven by C is essentially "twice" as strong as those driven by A and B .
GCMOS results in a lower transistor count for more complex functions, as demonstrated in the examples that follow. Although it is true that the exclusive use of inverters limits the area-saving parallel and series layout of transistors, the lower transistor count overcomes this area deficiency. For further area savings, it is possible to group n-channel and p-channel transistors in common tubs, such that the inverter's two transistors are not constrained to be physically adjacent. It is also possible to alter p-and n-area requirements by enhancing the encoding-inverter threshold (for example, using narrower p-transistors for both input and encoding inverters, if the encoding-inverter switching point is lowered).
