Abstract-If a VLSI chip is partitioned into functional units (FU's) and redundant FU's are added, error correcting codes may be employed to increase the yield and/or reliability of the chip. Acceptable testing is defined to be testing the chip with the error corrector functioning, thus obtaining the maximum increase in yield afforded by the error correction. The acceptable testing theorem shows that the use of coding and error correction in conjunction with acceptable testing can significantly increase the yield of VLSI chips without seriously compromising their reliability.
I. INTRODUCTION
V LSI chips which contain error correctors may be tested Ito be perfect (that is, tested with the error correction mechanisms either bypassed or disabled), or they may be tested to be acceptable (that is, tested with the error correction mechanisms functioning). The first case clearly produces chips of the highest reliability since the entire capability of the codes is used to correct for failures which occur after testing. The second case is of special interest, however, because one has a much higher percentage of acceptable chips than one does of perfect chips. Of course one pays for the increased yield of acceptable chips over the yield of perfect chips with a decrease in reliability. What is significant is that the reliability penalty required to obtain significant yield increase need not be severe.
It is the purpose of this paper to quantify the reliability penalty which results from employing acceptable testing of VLSI chips. A theorem is developed which relates the reliability of chips which are tested to be acceptable to the reliability of untested chips. The reliability of untested chips may be computed from standard IC fault models.
The theorem requires no assumptions about either the nature of the statistics of the defects which occur on the chip, or about the method of partitioning the chips into functional units (FU's), or about the coding and error correction methods.
Whereas previous fault-tolerant-VLSI work has emphasized memory arrays on silicon wafers, acceptable testing, as presented in this paper, is applicable to arbitrarily complex and/or irregular logic and any manufacturing technology.
Preparatory to discussing the acceptable testing theorem, a brief review of the current state of the art in fault-tolerant computing as it applies to VLSI is given. This is followed by a Manuscript received June 8, 1979 ; revised November 13, 1979 . The author is with the NASA Goddard Space Flight Center, Greenbelt, MD 20771. discussion of IC fault mQdels. It is demonstrated that consideration of statistically independent random defects only is appropriate to the discussion at hand.
For the purpose of obtaining a qualitative feel for the yield and reliability implications of acceptable testing, two specific examples are worked out. These examples demonstrate that acceptable testing can indeed be a viable technique for increasing yield without unduly sacrificing reliability.
II. FAULT-TOLERANT COMPUTING AS IT APPLIES TO VLSI
There is a large amount of literature on fault-tolerant computing, beginning with von Neumann [1] and Moore and Shannon [2] . By 1962, symposia were being held on the subject (Wilcox and Mann [311) . There are early books on the subject by Winograd and Cowan [4] and Pierce [5] . A particularly good review of the state of the art as of 1976 was given by Avizienis [6] .
Of the recent literature on fault-tolerant computing, a large number of papers have focussed on fault-tolerant memory. Goldberg et al. [7] state that this is so because computer main memory has historically been a major cost item of a computer system: it is the most unreliable part of a system (except for mechanical preipherals), and it is simultaneously the most amenable to fault-tolerance techniques because of its inherent regularity and its exceedingly large number of logic elements. Early papers dealt with core memory, although some of the techniques are apropos any memory technology. More recently, the emphasis has shifted to semiconductor memory. Reviews of semiconductor memory have been given by Eimbinder [81, Riley [9] , and Leuke et al. [10] .
Two interesting papers from the standpoint of semiconductor memory were written by Srinivasan [11] , [12] . One of his approaches combines triple-modular-redundancy (TMR) with matching the address decoder to the particular faulty array. The technique is very powerful; unfortunately, its difficult mechanization makes it unlikely to be an economically viable solution. Srinivasan also considers the application of generalized Hamming codes over GF(2b) , where b is the number of bits per byte, as described by Peterson and Weldon [13] . Since the usual methods of correction would be too slow in comparison to the inherent speed of semiconductor memories, he mentions the parallel method described by Bossen [14] , and concludes that an even faster approach must be found. Srinivasan's solution is an ingenious scheme for interlacing a U.S. Government work not protected by U.S. copyright 125 single-step majority-decodable code [15] , [16] in such a way that it can be used to correct burst errors.
Hsiao [17] describes the sytem used on the IBM 7030 and the IBM System/360 Model 85 main memories. A modification of the translator for the System/360 is delineated by Carter et al. [18] who succeed in attaining higher speed. In a related paper, Carter et al. [19] emphasize the importance of self-testable checker hardware.
Allen [20] stresses the bit-per-basic-operating-memory (bit/ BOM) organization as a means of combatting faults in address and drive circuitry. This allows correction to be made with SEC codes rather than burst error correcting codes. Szygenda and Flynn [21] - [23] describe an unusual core switching array useful in constructing a fault-tolerant magnetic memory.
More apropos large-scale integrated (LSI) semiconductor memories, Siewiorek and McCluskey [24] - [27] , and more recently Ingle and Siewiorek [28] , discuss hybrid redundancy. This technique consists of a synergism of n-fold-modularredundancy (NMR) and standby sparing. They describe an approach employing automatically switched spares wherein the switching mechanism is driven by a set of "disagreement detectors."
Taylor [29] , [30] took a theoretical tack. He defined the computational capacity of a system to be the limit of the amount of computation divided by the complexity. He gives existence proofs showing that it is possible to construct a stored program computer (including memory) with nonzero capacity from faulty components.
Rao [31 ] described a system in which coded data passes between the memory and the processor. In the example he considered, the encoders and decoders are located in the processor. An extensive analysis of the cost, both in time and components, for three memory coding schemes was done by Varanasi (student of Rao) [32] . He analyzed the SEC Hamming code, the single-step majority-decodable code, and the two-dimensional iterated parity (cross parity) code.
Switching in of spare LSI chips is proposed by Goldberg et al. [33] . They describe a technique using a regular switching network which can be embedded in the LSI chips. On the other hand, Wensley et al. [34] treat byte error correcting codes. They develop a theory of framed burst error correcting codes, noticing that if errors are constrained to a single byte, then fewer syndromes are required than for bursts of the same length in arbitrary positions. The system they propose has the decoders on separate LSI chips, but they foresee including the decoders on the same chip as the memory elements. Burst error codes which they considered include the Hamming code over GF(2b) [13] and Abramson codes [35] . They recommend an LSI realization of the decoder after a method of Levitt and Kautz [36] . They further consider the replacement of failed memory frames with spares and derive the optimum frame width from this standpoint.
In still another paper, Neumann et al. [37] consider the tradeoff between interlaced parity codes and the burst error correcting arithmetic codes proposed by Neumann and Rao [38] . The redundancy of interlaced parity or arithmetic codes Hong-Patel [39] [47] and Petritz [48] showed that limited discretionary wiring in combination with spares could increase memory chip yield. Schuster [49] suggested using latches, laser customizing, or an on-chip EPROM to select the spares. A total system, consisting of chips with on-board spares, automatic test equipment, and automatic laser customizing, has been reported by Cenker et al. [50] . The [54] showed that both yield and reliability could be simultaneously enhanced by the use of on-chip error correction. He also demonstrated that the application of concatenated codes [55] , [56] was attractive. He described a memory organization using both coded chips and an overall error correction scheme. Truly remarkable performance can be achieved with this technique-so good, in fact, that one can envision constructing a usable memory system from untested chips.
III. FAULT MODELS FOR VLSI
For over 15 years there has been a controversy over the proper technique to use for modeling IC faults. In 1964, Murphy [57] stated that area defects would be caught at slice test, line defects (such as scratches) would not exist in a well controlled process, and that spot defects would limit IC yield. Referring to spot defects, he said, "this is the predominant type of defect causing losses at line test and is responsible for the dependence of yield on area." He also pointed out that chipto-chip or slice-to-slice variation of the defect density would make the model pessimistic. Tammaru and Angell [47] agreed in 1967 that in a refined process random defects would control yield, but that a model based on random defects would be pessimistic if the defects were in reality nonrandom. Seeds [58] did an empirical analysis in 1967 which seemed to show that the random model was pessimistic.
Uniformly distributed random defects should give a yield versus area relationship of the form y yA/Ao (1) however, in 1970, Moore [59] stated that experience at Intel showed that y (AIAO) 1/2(2 Y, YO (2) was closer to reality.
Warner [60] showed in 1974 that he could fit Moore's data with two random distributions, each covering half the area of the slice. Warner also stated, "the investigator noting a yield affected by line or area defects and assuming that he faced only point defects, would tend to produce pessimistic projections concerning the effect of increasing IC area." He went on to say, "virtually all the good IC's come from the lowest density subarea, within which point defects are randomly distributed." In other words, once the really bad areas are removed from consideration, the mean density of defects which would be inferred from yield data is realistically low and (1) is a good fit to experiment. Warner also pointed out that the effective "cost" of a line defect varies as Al/2 (the number of chips a line can intersect varies as LIA 1/2). It is possible that Intel's experience was colored by a large number ofline defects.
Numerous attempts have been made to explain why observed variation of yield versus area does not agree with (1): Ansley, 1968 [61] , Yanagawa, 1972 [62] , Gupta and Lathrop, 1972 [63] , Gupta et al., 1974 [64] , Muehldorf, 1975 [65] , planation put forth by Price in 1970 [67] was that the workers in this field were incorrectly applying Boltzman statistics, when, in fact, they should have been using Bose-Einstein statistics. Murphy refuted this assertion in 1971 [68] .
In For the purposes of the analyses presented in Sections VI and VII of this paper, it will be assumed that defects are indeed randomly distributed and statistically independent. This is justified as follows. 1) Area defects: according to Schuster [49] , "gross imperfections cause large areas of the chip to be bad and are not affected by redundancy techniques." 2) Line defects: these should not occur in a well controlled process (Murphy [57] ). 3) Clusters of spot defects: Muehldorf [65] wrote, "a cluster has, in essence, the same damaging effect for a chip as a solitary fault, and hence for refined chip yield predictions, a cluster should be counted like a single fault." This is indeed true for uncoded IC's. However, if the FU's are appreciably smaller than a cluster, or if the FU's are interwoven so that a cluster typically hits several FU's (in different words), then the behavior is closer to the case of independent defects. If, on the other hand, FU's are significantly larger than a cluster, then there is a certain probability that an FU will be faulty (it is irrelevant whether the fault is caused by a single defect or a cluster) and the FU's will be effectively statistically independent.
Furthermore, since it has not been unambiguously shown that clustering of spot defects is the cause of the observed deviations from (1), there seems to be no valid reason to unduly complicate the mathematics. It should also be pointed out that the results of this paper are not confined to present day technology. Clustering may or may not occur in tomorrow's technology. Another point to be made, is that the object of this paper is to demonstrate the utility of acceptable testing; the fact that the model used might be pessimistic can only make that case stronger.
With respect to failures during use (in contrast to defects discovered at initial test) the same arguments apply. Schnable et al. [71] state that the effect of complexity on reliability is controversial. Data presented by Koppel [72] show that in the case of Intersil 16K RAM's, 89.6 percent of operational failures involve a single bit. This is taken as evidence for statistically independent failures. Therefore random failures, as well as random manufacturing defects, will be assumed in the calculations.
Data on failure rate versus time is not given. We assume a constant failure rate for the calculations. This is no doubt unrealistic, since a classical "bathtub" curve could be expected. It IV. VLSI CHIP MODEL AND DEFINITIONS For the purposes of this paper, a VLSI chip is modeled as a complex nonrepairable entity which may be accessed only at its input and output terminals. The chip is assumed to contain redundant hardware and one or more error correction mechanisms which function by the application of error correcting codes. The details of the error correction mechanisms and the choice of coding techniques are outside the scope of this paper.
The VLSI chip is assumed to contain multiple FU's each of unit complexity. It is assumed that redundancy is added in the form of additional FU's, again each of unit complexity. It is further assumed that either the total complexity of all the correction mechanisms is significantly less than the complexity of the aggregate of the FU's, or that the complexity of the correction mechanisms increase no worse than linearly with the number of FU's. This means that either the correction mechanisms are a "hard core" whose complexity can be ignored for the purpose of determining an upper limit to the reliability of the chip, or the complexity of the correction mechanisms can be subsumed into that of the FU's. Observe that an FU may be as simple as a single memory cell. Alternatively,an FU could be as complex as, or even more complex than, an entire microcomputer (complete with processor, memory, and I/O ports).
Let us assume that the FU's on a chip are organized into w "words," each of which contains k nonredundant FU's (required for the intended function of the chip) and n-k redundant FU's (required for error correction). It is further assumed that the error correction mechanisms function on these "words." These need not be "words" in the conventional sense. The only requirement is that at each time the outputs of the FU's are sampled, the outputs of the FU's of a given "word" relate in such a way that the error correction mechanism can function.
The strategies to be used in subdividing a complex VLSI chip into "words" and in subdividing these "words" into FU's depend on the level of complexity obtainable on a single chip and upon the type of functions to be performed. There is much room for research on this subject. It is the purpose of this paper to demonstrate the performance enhancement that can be obtained once this subdivision has been accomplished; partitioning per se will not be covered except to note that NMR at the FU level can certainly be used. Recent work by Davies and Wakerly [73] Let us define 4(t) to be the probability that a chip is acceptable at time t. Let us further stipulate that testing is performed at t= 0. We must consider uncoded chips, both untested and tested perfect. We must also consider coded chips in the untested, tested perfect, and tested acceptable cases. For the most part, we will be concerned with a c-fold multipleerror-correcting code, which we shall call call cMEC for convenience. (Frequently, the letter t is used to denote the number of errors which can be corrected. We choose to use c instead, because in this paper t is used to denote time.) Consideration will also be given to single-error-correcting codes (the special case c = 1), which we shall call SEC. Table I lists the various subscripts which will be appended to q (t) to specify which case we are considering. Observe that q (O)CMEC * UT is the yield of good chips obtained at testing (t = 0), and that q (O)CMEC * TP = 1, and q (O)CMEC * TA = 1, because only perfect or acceptable chips, respectively, are retained after testing.
V. DEFINITION AND PROOF OF THE ACCEPTABLE TESTING THEOREM
If a coded chip is tested perfect, then the entire capability of the code is available to correct faults which accumulate during life. If, on the other hand, a chip is tested acceptable, then there may be faulty FU's present at time t = 0. These faulty FU's will use up some of the error correction capability of the code. Accordingly, we expect that the probability versus time of having an acceptable word should lie somewhere between that for an untested chip and that for a tested perfect chip. We shall now explore in detail the nature of the probability versus time of having an acceptable chip.
We will first state the acceptable testing theorem for cMEC * TA chips and briefly discuss its import. Subsequently a proof of its validity will be given. For the purpose of calculation of the examples, we assume that the defects on the chip are statistically independent. As we pointed out in Section II, this leads to possibly pessimistic results; however, it greatly simplifies the calculations, because if the probability that a given FU is good is q, then the probability that j FU's are all simultaneously good is q1 [74] .
Let us define the probability versus time that a given FU functions properly to be q(t). Then the probability versus time that a given FU has failed is (11) p(t) = I -q(t). (9) However, since A implies B, it follows that P[B/A I = 1. Therefore,
Now we substitute (4)-(6) into (10).
Q.E.D.
Notice that the acceptable testing theorem is valid regardless of whether or not the faults are correlated, and furthermore, that no assumptions need be made regarding the method ofpartitioning the chip into FU's or the method of error correction. However, all of these considerations do affect q (t)MEC *UT) and, hence, the fmal results obtained from coding.
VI. DISCUSSION OF THE THEOREM WITH EXAMPLES
The acceptable testing theorem gives us a way to obtain quantitative information on the probability versus time that a tested acceptable chip will continue to function. Let us now consider some specific examples in order to gain some qualitative insight into the utility of acceptable testing.
where qo is the value of q(t) at time t = 0 (the time at which we test the chip), and qt = 1 at time t = 0 and decreases monotonically thereafter. We shall find one more definition to be useful Pt = 1 -qt- (13) Let us define the probability versus time that a particular word of the chip is acceptable (functions properly as seen at the output of the correction mechanism) to be 4(t). Then, given q(t), we can easily determine the probability q*(t) that an entire chip is acceptable. For a chip containing w words (14) In the case of an uncoded (nonredundant) chip, there are k FU's per word. Therefore, W(t)UNC * UT = [q(t)] k (15) If the chip is tested perfect (there is no meaning to acceptable testing of nonredundant chips) then q(t) = qt, and 4(t)UNC * TP = [qtJ k (16) For coded chips, however, @(t) depends on the characteristics of the code being used and the manner in which the chip was tested (whether the chip is untested, tested acceptable, or tested perfect).
Let us employ a cMEC code. If a word contains c or fewer faulty FU's then the output of the corrector will be correct and the word will be acceptable. Standard probability theory [74] tells us that the probability of exactly j functional FU's on an untested chip is Pj(t) = (nl)q(t)n p(t)'. (17) Standard probability theory also tells us that the probability of the union of statistically independent events is the sum of the probabilities of the individual events. Therefore, the probability that a given word of an untested chip is acceptable is found by summing Pj(t) for 0 < j 6 c. f(t)cMEC * UT = ( n)q(t)n ip(t)W. An equivalent form which we will frequently find to be more useful is obtained by factoring out q(t)'. This form appears as (19) . 
Equations (17)- (19) Fig. 1 shows how @(t) varies as a function of q(t) for two SEC codes, each with k = 4. These codes are the (7,4) Hamming code and an unspecified (8,4) code which is implemented such as to only correct single errors. The uncoded case and triple modular redundancy on each bit (TMR)4 are also included for comparison.
The graph uses logarithmic coordinates; therefore, the uncoded case plots as a straight line. Notice that all the coded cases start out being better than the uncoded case, but that eventually, they all become worse than the uncoded case. It can be seen that the comparative performance of a code (the 8,4) code) which has an information rate only slightly less than that of a perfect code [the (7,4) Hamming code, or (TMR)4] degrades quite rapidly compared to that of the perfect code as q(t) decreases. The reason, of course, is that the extra check cells associated with a nonperfect code mean increased probability of encountering a fault, with no corresponding increase in code capability.
One more comment about Fig. 1 
countered, but is included here for comparison).
The third is the coded tested perfect case. In this case all the capability of the code is used for reliability improvement, and no yield improvement is obtained. This case is an upper limit to the reliability improvement which can be obtained. The fourth casecoded tested acceptable-is the one in which we are most interested. It allows us to obtain both yield improvement and reliability improvement. The fifth case-coded untested-is of less interest in itself, but it is included because the coded tested acceptable case is derived from it.
The probability @(t) of an acceptable word is plotted in Fig. 2 versus normalized time rt for each of the five cases mentioned in the preceding paragraph. In this specific example, the (7,4) Hamming code was used. The dotted straight line starting at the upper left-hand corner of the figure applies to the uncoded tested perfect case. The uppermost curve applies to the coded tested perfect case. Observe that for quite some time the curve for the coded tested acceptable case lies in between those of the two tested perfect cases. In this interval, tested acceptable chips are more likely to be functional than uncoded tested perfect chips.
Note that in Fig. 2 , the curve for the coded untested case always lies below that for the uncoded tested perfect case. This would not normally be true; it happens here because an abnormally low value of qo has been chosen for the example. This choice was made for the purpose of spreading the curves apart to more vividly demonstrate the differences between cases. If the value of qo used in computing the curves of the figure had a more typical value, say 0.99, then the three curves for the three coded cases would be nearly identical and they would all lie above the line for the uncoded tested perfect case throughout most of their length.
We may further observe from Fig. 2 Let us now consider, as a second example, a chip which uses TMR on the single bit outputs of each of K = 1000 FU's. VII. CONCLUSION There are two conclusions to be draWn from the foregoing analysis. First, when a chip has been tested perfect (uncoded or coded), its probability as a function of time of being acceptable does not depend at all on the yield obtained at the time of testing; it only depends on qt. This is because there are no faulty FU's in a tested perfect chip. The second conclusion is that although the yield at the time of testing does affect the probability as a function of time of being acceptable for coded tested acceptable chips, this dependence is not particularly strong, even when the yield is relatively bad. We shall now consider why this should be so.
Let us examine, for instance, the last case tabulated in the previous example (Table III) . There will be, on the average, 25 faulty FU's per uncoded chip (Kpo = 1000 X 0.025 = 25). Since the statistical uncertainty associated with observing N random events is N112, almost every chip will have between 20 and 30 faults. The yield of uncoded chips is abysmally bad, there being only 1 in 1011 which contains no faults.
Coded chips, in contrast, will on the average have 75 faults each (nKpo = 3 X 1000 X 0.025 = 75). Therefore, almost every chip will have between 66 and 84 faults). However, the yield of acceptable coded chips is 16 percent, which is not bad when compared to [10] [11] . The important point is that the remaining coded chips after acceptable testing will each contain between 66 and 84 faults. The testing simply eliminates all chips which have more than one fault per word.
If any of the words which already contains a fault from manufacturing accumulates another fault due to aging, then the chip will fail. However, since only about 8 percent of the words on the chip suffer this difficulty, the effect on the probability that the chip is acceptable is quite mild. Even in this extreme case, a tested acceptable chip has a probability of 0.864 of being acceptable at the time when an uncoded tested perfect chip has a probability of 0.368 of being acceptable. Of course, at this same time, a coded tested perfect chip would have probability 0.997 of being acceptable, but the yield of such chips is essentially zero (only one out of 1033 coded chips is perfect in this example)!
In conclusion, the acceptable testing theorem shows that the use of coding and error correction in conjunction with acceptable testing. can significantly increase the yield of VLSI chips without seriously compromising their reliability. ACKNOWLEDGMENT The author is indebted to Dr. T. R. N. Rao who originally suggested the line of research which has evolved into the results presented here. The author also wishes to thank the reviewers for numerous helpful suggestions which have greatly improved this paper. In particular, one of the reviewers suggested the proof of the acceptable testing theorem which was presented in the paper. The author's original proof was excessively long and complicated, and it was valid only for the case of statistically independent defects. I. INTRODUCTION IN recent years many different digital signal processors have been described in the literature. They range in complexity from small special-purpose processors to implement digital filters, to large high-speed programmable fast Fourier transform (FFT-oriented machines. In this paper, we will consider machines which fall in the middle range between these two classes. Specifically, we will examine microprogrammable processors whose purpose is high-speed digital signal processing. Even with these restrictions there are still many such processors described in the literature, some designs only exist on paper [1] - [7] while others have been constructed and operated [8] - [21] .
In earlier years, the processors were constructed with off-theshelf SSI and MSI components, resulting in large part counts, high power consumption, and expensive construction costs.
0018-9340/80/0200-0134$00.75 i 1980 IEEE
