Abstract-The incorporation of different forms of redundancy has been recently proposed for various VLSI and WSI designs. These include regular architectures, built by interconnecting a large number of a few types of system elements on a single chip or wafer. The motivation for introducing fault-tolerance (redundancy) into these architectures is two-fold: yield enhancement and performance (like computational availability) improvement.
I. INTRODUCTION
IMPORTANT innovations are likely to occur in two VLSI-based areas, namely, wafer-scale integrated architectures, and single VLSI chip/multielement architectures. The former has the potential for a major breakthrough with its ability to realize a complete multiprocessing system on a single wafer. This will eliminate the expensive steps required to dice the wafer into individual chips and bond their pads to external pins. In addition, internal connections between chips on the same wafer are more reliable and have a smaller propagation delay than external connections. The latter does make it possible to build a high-speed processor on a single chip, designed by interconnecting a large number of simple processing elements, memory modules and the like. These architectures already have captured the imaginaion of several computer manufacturers and researchers alike.
Much recent research has focused on these new architectural innovations, especially those created by interconnecting a large number of elements such as processors, memories, switches, communication links etc, all on a single chip or wafer. Concerns about fault tolerance in such VLSI-based systems stem from the two key factors of performance and yield enhancements.
Manuscript received August 1, 1985; revised February 4, 1986 and JunelO, 1986 . This work was supported in part by the This has been demonstrated in practice for high density memory chips (e.g., [1] ) and should be extended to other types of VLSI circuits. In general, yield may be enhanced because the circuit can be accepted, in spite of some manufacturing defects, by means of restructuring, as opposed to having to discard the faulty chip.
Achieving reliable operation also becomes increasingly difficult with the growing number of interconnected elements and hence, the increased likelihood that faults can occur. Here too, redundant elements which are ready to replace faulty ones when the system is in operation, can increase the reliability and other performance measures like computational availability.
In summary, the justification for introducing fault tolerance (redundancy) into the architecture of VLSI-based systems is two-fold. One is to deal with manufacturing flaws and increase the yield. The other is to deal with operational faults and enhance the performance availability.
Our objective in this paper is to formulate analytical models that will enable us to analyze the effectiveness of a given faulttolerance technique in increasing yield and improving per- formance, or find the tradeoff between the two. These models will also allow us to compare various fault tolerance techniques, examine different system topologies and determine the optimal amount of redundancy to be added.
In the next section, the aspects that have to be considered when evaluating a fault tolerance strategy are detailed. In Section III, expressions for the actual and apparent yield of VLSI chip with added redundancy are derived. In Section IV we present models that allow us to compute various mneasures of combined performance and reliability. Then, an example of a VLSI-based system with redundancy is analyzed in Section V and final conclusions are presented in Section VI. different schemes might be cost-effective in different situations and for different objective functions.
Several aspects have to be considered when evaluating a fault-tolerance strategy for multielement systems. The first is the type of failures to be dealt with. There are two distinct types of failures with which fault-tolerance strategies can be designed to deal. These are production defects and operational faults. A relatively large number of defects is expected when manufacturing a silicon wafer in the current technology.
Normally, all chips with production flaws are discarded leading to a low yield.
Operational faults have in comparison a considerably lower probability of occurrence, the difference of which may be in orders of magnitude. Improvements in solid-state technology and maturity of the fabrication processes have reduced the failure rate of a single component within a VLSI chip. However, the exponential increase in the component-count per VLSI chip has more than offset the increase in reliability of a single componient. Thus, operational faults cannot be ignored although they have -a substantially lower probability of occurrence compared to production defects. Consequently, a fault-tolerance strategy that enables the system to continue processing, even in the presence of operational faults, can be beneficial.
The two types of failures, manufacturing defects and operational faults, also differ in the costs associated with them. Defects are tested for before the IC's are assembled into a system and therefore, they contribute only to the production costs of the IC's. In contrast, faults occur after the system has been assembled and is already operational. Hence [21, [18] , [15] and! or redundant processors, communication links, or other system elements [3] , [10] , [6] . When carrying out such an analysis we have to take into account the relative hardware complexity (silicon area) of all system elements, and their susceptibility to failures (manufacturing defects or operational faults).
Processing elements (PE's) are traditionally considered the most important system resource; hence, achieving 100 percent utilization of them is often attempted. For example, in [2] , [15] , and [18] switching elements are added between processors to assist in achieving this goal. In [3] and [10] connecting tracks are added on the wafer to be used in bypassing the defective PE's when connecting the fault-free ones. However, the silicon area that needs to be devoted to switching elements (e.g., switches capable of interconnecting 4 to 8 separate parallel busses [18] ) or to additional communication links cannot be ignored. Consequently, such schemes might be beneficial only for PE's which are substantially larger than the switches and the additional links (e.g., [13] ). Also, the addition of switching elements and especially the longer interconnection between active processors result in longer delays affecting the throughput of the system. To overcome this performance penalty, it has been suggested in [9] to add registers for bypassing faulty processors. The effect of this is to introduce extra stages in the pipeline, thus increasing the latency of the pipeline without -reducing its throughput.
In the above mentioned schemes, one of the underlying assumptions is that the extra circuitry (e.g., switching elements, communication links or registers) are failure free and only processors can fail. However, larger silicon areas devoted to those elements increase their susceptibility to defects or faults; as a result, the above-mentioned assumption might not be valid any more.
In general, there are several alternative ways for introducing redundancy into the system. Redundancy can be introduced into the architecture at the basic element level and/or at the system level. In the case of system level redundancy, spare elements are added to the original design and they will be used to replace any faulty system element. In the case of element level redundancy, each element has some internal redundancy allowing it to remain operational even in the presence of certain internal faults (with possibly a lower computational capacity). Note that both element level and system level redundancies can be incorporated into the same system. Several forms of redundancy can be used to handle manufacturing defects to increase the yield. The defective elements are configured out and the good ones are interconnected to form an operational system. Once this procedure is completed, the system goes into operation and it has to handle from this point on only operational faults. At this point the fault-tolerance capacity of the system is used to improve its performance availability. First, the remaining redundant elements (if any) can be used as spares and then, the system is gracefully degraded. We [11] and [7] . A more general expression for the yield was proposed in [8] . In what follows we modify the latter to include some simplified yield models (as used in [3] and [10] ) and to take into account the effect of incomplete testing on the yield.
The yield of any VLSI chip depends on the types of defects which may occur during the manufacturing process and their distribution. The majority of fabrication defects can be classified as random spot defects [20] caused by minute particles deposited on the wafer. The area of the system elements that we will be considering here and for which we will have spare ones (e.g., processors, memories, busses etc), is substantially larger than the expected area of a spot defect. Consequently, we assumie in what follows that each spot defect affects only a single element.
For the statistics of the fabrication defects we can adopt one of the models suggested in the literature like Poisson, binomial, general negative binomial statistics and others. Under proper assumptions each one of these statistics can be used and the "correct" one is the one that fits the data best [20] . One model which has been shown to agree with experimental results, is the generalized negative binomial distribution [19] . Its attractiveness stenms from the fact that it does not assume that all defects are evenly distributed throughout the wafer but rather allows defects to cluster. We adopt here this distribution although all our where n is the total number of different types of elements in the chip. Note that in (3.2) we implicitly assume that all mhanufacturing defects result in logical faults which in turn cause erroneous behavior of the chip. Certain defects may however produce no faults at all. For example, a defect in the outer area of the chip which is usutally occupied by bonding pads, may be harmless to the electrical performance of the circuit. To consider logical faults instead of fabrication defects, we will as a first approximation, multiply the defect density d by the probability that at least one fault is caused by a defect. If for example, we adopt the assumption made in [16] that the number of logical faults corresponding to a single defect follows a Poisson distribution with mean c, we should then multiply d by (1 -e -C). For convenience, we will in what follows still refer to manufacturing defects (rather than logical faults) with average density d which equals the original defect density multiplied by the probability that a defect results in a logical fault.
The Yield of a Chip with Added Redundancy
Suppose now that redundancy is added to a chip at the system level so that s, defective elements of type i ca-n be tolerated, (i.e., substituted by good spares), and denote by N1 the total number of elements (including the spares) of type i in a chip. Then, the chip is acceptable with any number of manufacturing defects in type i elements as long as all of them are restricted to at Most si elements. The yield, which is now the probability of a chip being acceptable, is given by n Y= Pr {There are defects in at most (3.4) We may now obtain an explicit expression for the probability a(i) in two different ways. One is to follow [8] and define This expression was derived assuming that the defects are distinguishable, i.e., Boltzmann statistics are followed [20] . If we select Bose-Einstein statistics, the defects are indistinguishable and the resulting expression is Q )_ ((JN) The last term in (3.7) is Pr {Xi = xx} and we may substitute it by (3.1)) or a similar expression for any other defect distribution.
In this first approach to the derivation of a(') we have considered the entire chip as the basic unit of silicon in which defects occur, and then we have distributed these defects uniformly among the individual elements. In the second approach to derive an expression for a(i), we consider the single element as the basic silicon unit, 'out of which larger area chips are constructed. Let Yi denote the yield (probability of zero defects) of a single element of type i, then the appropriate expression for a(i) in this case is a5()= Y>N-J(l -Yy)' (3.8) The assumption here is that each element of type i may be defective with an independent probability (1 -Y,). This approach has been adopted, for example, in [3] and [10] .
When setting the parameters for the yield of a single element Yi, we may require that the expected value and variance of the number of defects in the total chip area will be the same as in the first approach. Before comparing these two alternatives for calculating the probability a(i) (given by (3.7) and (3.8)), we return to the general equation for the yield, i.e., (3.4) . This equation can be multiplied by a "bypass coverage probability" [11] , which is the conditional probability that an element can be bypassed (isolated) given that it is faulty. By adding this probability one may consider less than perfect procedures for locating faulty elements and reconfiguring them out of the system.
To tolerate si defective elements of type i, at least si redundant ones are needed. However, the exact amount of required redundancy depends upon the specific static or dynamic reconfiguration scheme used. This in turn, determines the increase in chip area which must be taken into account when calculating the yield, since a larger number of defects is expected now.
Let zys denote the increase in the area Ai (due to the addition of redundancy), needed to tolerate these si faulty elements of type i. Let -yf denote the increase in total chip area that is required to tolerate all s = (sl, s2, * ., sn) faulty elements.
The factor oy is called the redundancy factor [7] and it depends on the system topology and the reconfiguration strategy. It assumes its lowest possible value when only si redundant elements are included in the total of Ni elements of
'ysi,z.
i-s and ye-"n 2 i=l N.
(3.10)
The larger chip area results in an increased expected number of defects. We should therefore, multiply the average number of defects A,di in (3.1) by yi. If we insist on having the same expected value and variance of the number of defects in the total silicon area (independently of how it is partitioned into chips or elements) then, as was shown above, the clustering parameter ai should also be multiplied by ys,.
In addition, the increase in chip area reduces the number of chips that will fit into the same wafer. Hence, instead of calculating the yield which is the probability that a single chip is acceptable, one has to calculate the expected number of acceptable chips out of a given wafer. This expression, which we call wafer-equivalent yield, is obtained from (3.4) after dividing it by yy.
By comparing the wafer-equivalent yield of the faulttolerant chip and the yield of the simplex one (with no faulttolerance features), we can determine whether it is beneficial when yield is considered, to have built-in fault tolerance and how many redundant elements should we add. This comparison can be done for various system topologies and different reconfiguration algorithms.
An analysis along these lines has been done in [11] and in [7] . In both it has been observed that the improvement in yield saturates above some amount of redundancy. This indicates that there is an optimal amount of redundancy that should be added.
Still, the exact value of this optimal amount of redundancy does depend upon the expression we adopt for the waferequivalent yield of a fault-tolerant chip. To illustrate this, we compare in the following example the optimal amounts of redundancy when using the above two alternative schemes for calculating a') (defined by (3.7) and (3.8), respectively).
Example assumes that the applied production testing procedure is perfect and we accept only chips which satisfy the above requirements. However, testing is never perfect and consequently, chips having more than the allowed number of defective elements will be declared good resulting in a higher "apparent" yield [22] [16] defined by Yapparent = Y+ Ybg where Ybg is the yield of bad (defective) chips that are tested as good. This yield depends on the probability that the testing procedure when applied to an element fails, given that the element is defective. We denote this probability, which is usually called fault coverage probability, by fc. If the number fi becomes too high, we may reach a point at which our system can not do useful computation any longer.
Let mi denote the maximum allowed number of elements that could become faulty if si ones were already defective at t = 0. Therefore, fi.si+ mi<Ni. (4.2) This inequality means that if less than si elements were defective at t = 0, the system will endure the failure of more than mi elements at t > 0.
An example of the suggested Markov model for a chip with two types of elements that can fail, is depicted in Fig. 1 If however, the number of active elements of type 1 satisfies ul < N1 -fi, then the appropriate expression is oefl+f2-U2=1 * X1 *P1+(N1-f1-ul) * X1. This is based on the assumption that upon a failure of a nonactive element, the system will recover successfully with probability 1. The above expression is not always well-defined since in the general case (for complex system topologies and restructuring schemes), uI may be a function offi and f2, and may depend on the exact positions of these faulty elements as well. Therefore, the value for ul to be used in the above expression for the transition rate, must be obtained according to some empirical rule. Several such rules can be envisioned, for example, the average over all possible positions of the faulty elements, or the worst case one, or the most probable one.
There are cases in which ui depends only on the value offi(i = 1, 2, * , n). In these cases, not only the above expression for the transition rate is accurate but the entire model can be simplified by partitioning the Markov chain into n independent chains. Each will then be solved separately and the final results will be combined to obtain the required performance measures. This case is demonstrated in the next section.
State (0, 0) of the Markov model in Fig. 1 is the initial state of the'system if no defects occurred while the chip was manufactured. If there were k, and k2 defective elements of type 1 and 2, respectively, (0 c kI c si, i = 1, 2) then (kl, k2) will be the initial state. The probability of this event is akl * a(2 (0<ki si, i= 1, 2) (4.3) since the probabilities of defects in different types of elements are independent [20] . We (2) akl E r (fC)r(I -fC) 11 r .* ak2 ' r=O probability of each state being an initial state results in the apparent yield as given by (3.12) .
A state like (sl + ml, f2) in Fig. I (4.8) We denote by V4 any vertex (state) at level b in Fig. 1, i. e., any state (i, j) satisfying i + j = b. Thus, the sequence (kA, k2), Vkj+k2+1, * * vi+ 1, (i, j) corresponds to a path in Fig.   1 [7] . Let Rk1,k2(t) denote the reliability of a system (i.e., the probability that it operates correctly in the time interval [ This will allow us to determine whether it is beneficial when reliability is considered, to introduce redundancy into the architecture of the system and how many redundant elements of each type we should include. Example: The wafer-level reliability as a function of the [16] and p = 0.9. The wafer-equivalent yield of this chip is maximized when four redundant elements are added to the simplex chip. The improvement in wafer-level reliability also saturates above some amount of redundancy. The optimal amount of redundancy that maximizes the wafer-level reliability depends on the mission time of the system. Fig. 2 depicts the wafer-level reliability and wafer-equivalent yield as functions of the number of spare elements with the mission time as parameter. For a low value of t (time is measured in 1/ X units), the optimal amount of redundancy is sp. = 0. For t = 0.25/X, sop = 3, and for t = 0.35/X, s,p. = 6. The tradeoff between yield enhancement and reliability improvement depends therefore, on the mission time. Graphs like the one shown in Fig. 2 The average mean time to failure can be defined similarly to (4.12) .
A Model for a System with Element-Level Redundancy
The Markov mnodel depicted in Fig. 1 Fig. 4 where each global bus can connect any processor to any memory module. Following the notation used in [12] and [21] , we refer to this multiprocessor as a P * M * B system.
There are several ways to characterize the behavior of a P * M * B system. Our purpose in this section is only to illustrate the application of the model presented in the previous section. We adopt therefore, a relatively simple characterization of the system which is based on two parameters. One is the time between two consecutive memory references and the second is the connection time between processor and memory in a single memory access. Both parameters are random variables and are assumed to be exponentially distributed with mean 1/6 (processing time) and 1/,u (connection time).
We wish to find an expression for the computational capacity of a P*M*B system denoted by CP,M,B. An expression that will allow us to determine the optimal number of spare processors, spare memory modules and spare interconnection busses to be designed in the VLSI chip so that yield and/or performance are maximized.
We may define the computational capacity of the P * M *1B system as the expected number of active processors, i.e., processors which are executing their task and not idle while waiting to access a common memory module. This performance index is known as processing power. Other performance indexes like the average cycle time and the instruction execution rate can be simply derived from the processing power index [12] . To calculate the processing power index we may construct a queuing network model. The computational complexity of this model increases very rapidly with system size. Fortunately, as has been shown in [12] and [21] , approximate models with reasonably small errors in the final results, can be employed. These are derived by lumping "equivalent" states of the model to obtain a Markov chain of substantially smaller size.
An example of such a model is shown in Fig. 5 . In it, at state (P -i) there are P -i processors which are executing their tasks while the remaining i processors are idle being serviced or waiting to be serviced by a memory module.
At a rate of i3(i) * I, one of the i idle processors will complete its service increasing the number of active ones to P -i + 1. ,B(i) is the average number of processors, out of the i idle ones, that are serviced at a given time instant. Similarly, at a rate of (P -i)6, an active processor will generate a memory request and join the idle processors, reducing the number of active ones by 1. To derive an expression for ,3(i), we assume (as in [12] and [21] ) that processors request service from the different memory modules with equal probabilities. Hence, the probability that all i requests of the idle processors will be directed to exactly j out of the M memory modules is Ql.) which is defined in (3.5) . This probability has to be multiplied by min The Markov chain in Fig. 5 is a birth and death one, whose solution is easily obtained. Let To calculate the computational availability we also need the state probabilities of a Markov model similar to the one shown in Fig. 1 , with three types of elements, namely processors, memory modules, and busses. Fortunately, in the multiple bus multiprocessor system, the number of active elements of any type depends only on the number of faulty elements of this type and is independent of faulty elements of the other two types. Consequently, the Markov model (like the one in Fig.   1 ) may be partitioned into three independent ones, each solved separately. The state probability which is needed for the calculation of the computational availability (and the system reliability as well), is equal to the product of the state probabilities obtained separately from the three simpler Markov models for the processors, memory modules, and busses.
This calculation was done for a multiple bus multiprocessor system with the following parameters:
1) Eight processors with a = 0.3, Ad = 1.5, p = 0.9, and X = X0 (time will be measured in 1/X0 units). In addition, m = 2, and 6/I= 0.91.
2) Eight memory modules with e = 0.2, Ad = 1.2, p = 0.92, X Xo, and m = 1.
3) Four buses with a = 0.12, Ad = 0.9, p = 0.95, X = 0.75XO, and m = 1.
For these system parameters three sets of values for the number of spare processors, spare memory modules and spare busses were obtained. For maximum wafer-equivalent yield 4, 3, and 2 spare processors, memories, and busses, respectively, are required. In this calculation we have assumed that fc = 1 and therefore, the apparent yield equals the actual one. For mnaximum wafer-level reliability (1, 3, 1) spares (processors, memories and busses, respectively) are required for a mission time of t = 0.2 * 1/Xo. For the same mission time, the maximum area utilization (wafer-level processing power availability), is achieved for (2, 3, 1) spares (processors, memories, and busses, respectively).
A useful application for this model might be the analysis of the relative importance of spares for the three types of system elements. To perform this kind of analysis we can, for example, set the number of spare memory modules and spare busses at some fixed values (e.g., their optimal values for a desired mission time) and then, observe the dependency of the area utilization on the number of spare processors. Such an analysis has been done for the above system parameters, and the results are illustrated in Fig. 6 .
In this figure, the area utilization is shown as a function of the number of spares (spare processors or spare memories or spare busses). The notation (-, 3, 1) means that the numbers of spare memories and busses were fixed at 3 and 1, respectively, and the different values for the number of spare processors appear on the horizontal axis.
One conclusion that might be drawn from this figure is that the area utilization measure is more sensitive to the number of spare memory modules, than it is to the other two types of elements.
Another interesting phenomenon was observed while performing this analysis. The optimal number of spares of any type in Fig. 1 , is independent of the numbers of spares of the other two types of elements. For example, the curves (2, -, 1) and (0, -, 0) have their maximum at exactly the same value of 3 spare memories. The same phenomenon was observed for a mission time of t = 0.3 l/Xo. Here, the optimal values of spare numbers are: (4, 3, 2) for maximum wafer-equivalent yield (as before), (3, 5, 2) for maximum wafer-level reliability, and (4,5 2) for maximum area utilization. However, for a longer mission time (e.g., t > 0.4 MO0) where the optimal values of the spare numbers are higher, the above independence is not preserved.
VI. CONCLUSIONS
VLSI and WSI architectures that use redundancy for yield and performance improvement have been considered. The available redundancy on the chip or wafer is primarily limited by the size of the chip or wafer; hence, it is imperative to find a method by which one can optimally share the available redundancy between yield enhancement and performance improvement.
We have developed in this paper analytical models for the evaluation of performance and yield improvement through redundancy. The models proposed can be used to study the effect of sharing element level and system level redundancy, between these two somewhat competing requirements.
