As computer technologies advance to achieve higher performance and density, intermittent failures become more dominant than solid failures, with the result that the eflectiveness of any diagnostic procedure which relies on reproducing failures is greatly reduced. This problem is solved at the system level by a new strategy of dynamic error detection and fault isolation based on error checking and analysis of captured information. The model developed in this paper allows the system designer to project the dynamic error-detection andfault-isolation coverages of the system as a function of the failure rates of components and the types and placement of error checkers, which has resulted in signiJicant improvements to both detection and isolation in the IBM 3081 Processor Unit. The model has also resulted in new probabilistic isolation strategies based on the likelihood of failures. Our experiences with this model on several IBMproducts, including the 3081, show good correlation between the model and practical experiments.
Introduction
Traditionally, failure (fault) isolation in a digital system has been dealt with in terms of testing, with testing procedures or at least specific test patterns generally produced only after the hardware to be tested is designed. Theoretical investigations of the fault-isolation problem have therefore concentrated on methods for producing "efficient" diagnostic test sets [ l , 21, including the proposed measure of the diagnosability of systems [3] . An important exception to this testing approach is use of syndromes produced by errorchecking circuits during normal machine operation for fault isolation. This general idea is obvious and has been previously discussed [ 4 ] , although no follow-on work concerning the effectiveness or coverage of such an approach has been reported.
In the investigations of fault isolation via testing, the assumed starting point is a data matrix D = [ dij] , which has one row for every fault f; (1 5 i 5 n ) and one column for every test tj (1 5 j 5 m ) . An entry di, in D is 1 if faultf;. is detected by test tY Based on the characteristics of this matrix, statements can be made about fault distinguishability, fault coverage, and efficiency.
Problems amenable to solution using such a model generally have the implicit assumption that there are more tests available than actually needed. This is rarely the case in practice. One solution has been the development of procedures either for eliminating redundant tests or for selecting tests which isolate faults to a replaceable package level rather than to the particular circuit [ 11. Additional work in testing efficiency has resulted in a procedure for selecting the best testing order to achieve the lowest average-testsequence length [ 
21.
Faults in logic are generally assumed to be of the single stuck-at-O/stuck-at-1 (s-a-O/s-a-1) variety, although other models such as short-circuit faults in LSI and their associated test-generation procedures have been proposed [ 51. From a practical standpoint, a fundamental shortcoming with these approaches is that they are based on testing procedures which rely on the presence of permanent faults, whereas in actual practice, most faults are intermittent or transient. It is a well-documented fact that these faults cause most of the serious isolation problems in the field [6-91. Some work has been reported on the problem of detecting and isolating intermittent faults [ 10- 151 that involves the use of test procedures which attempt to reproduce the assumed randomly occurring intermittent faults. Since the fault may not be present when the test is applied, the approach taken is to repeat the tests many times. There are various suggestions for computing how much repetition is necessary to ensure a specified probability that a fault is or is not present. However, diagnostic time in a customer's office, even for a single pass-through set of tests, is costly; hence, approaches requiring large repetitions of tests are generally unacceptable.
The approach to fault isolation described in this paper is to capture and interpret the syndromes resulting from error checkers that are active during normal machine operation. The advantage of such an approach is that error checkers can detect all classes of errors which exist during each checking cycle. For example, these errors may be caused by faults of the following types: stuck-at, intermittent, single or multiple, transient, externally (environmentally) induced, pattern-sensitive, or any other fault which causes the computer's state to be outside the code space of the appropriate detection mechanism or error checker.
In order to analyze the fault detection and isolation coverage, we introduce a dynamic fault-probability model for projecting the ability of a system to detect and isolate failures, including intermittent failures, in an operational environment. In fact, this is a fundamental hardware design and diagnostic requirement for the 3081
Processor Unit maintenance strategy [ 161. Relevant figures-of-merit are defined and computed using the model. The model is an integral part of an overall design evaluation procedure to enhance a product's reliability, availability, and serviceability. By using the model, it has been possible to define probabilistic repair strategies and to project their effectiveness in terms of parts replaced per repair action and the probability of correct repair. Such probabilistic repair strategies are extensively used in the 3081 [ 161. Further use of this model in the 3081 design has resulted in improved error-checker placement to enhance fault isolation, and the inclusion of improved error-detection mechanisms to cover otherwise weakly checked areas.
The remaining sections of this paper cover a system design strategy for intermittent and solid faults, a new concept of probabilistic fault isolation, the mathematical model for numerically characterizing the dynamic error-detection and fault-isolation probabilities of a system, detection and isolation coverages, the procedure and ground rules for obtaining data for the model, and our experiences with IBM products using the model.
System design and maintenance strategy
Dynamic error-detection mechanisms are usually implemented in various combinations of hardware, microcode, and software, and have had as their historic raison d'etre the detection of errors during normal operation so that recovery procedures may be initiated (i.e., retry). A recommended system strategy is to take advantage of and, where necessary, to enhance these capabilities to achieve efficient and effective fault isolation.
To take advantage of these mechanisms for the isolation of intermittent faults, the system architecture must provide for the automatic capture of all necessary error information as it is detected, and before it is destroyed, altered, or overwritten by the recovery process. Thus, such an architecture would allow for what is called error logging. The concept of error logging was first introduced in the IBM System/360 [ 17, 181, although it has not been extensively used as an isolation tool, largely because it was not possible to project the error-detection or fault-isolation coverage. A second requirement is a procedure, preferably automated, for analyzing the error log to determine the physical location of the fault. Ideally, a set of error log analysis routines would replace the conventional set of diagnostic tests as the primary fault-isolation tool. A third requirement is a design-review procedure and fault-probability model to allow advance projections to be made of the dynamic errordetection and fault-isolation probabilities. This is crucial because a strategy based on dynamic error detection generally requires changes to hardware functions (e.g., more error checkers or better placement of them) that would be practically impossible to make after an LSI design was finalized.
The importance of a timely evaluation of any aspect of the system reliability or fault-tolerant coverage cannot be overstressed. Reference [ 191 graphically illustrates the current lack of rigorous state-of-the-art methods for the projection of fault-tolerant coverage. Error detection and fault isolation are important aspects of fault-tolerant coverage.
W i t h this recommended system strategy, the faultisolation characteristics of a system are not treated as an "add-on" feature provided by diagnostic tests; rather, they are a direct result of the hardware functions of error detection and are, thus, the primary responsibility of the hardware designer. Reference [4(b)] gives design information for error-detection circuits and methods.
Dynamic fault-probability model for detection and isolation
Coverage figures-of-merit for fault-tolerant systems The proposed strategy requires several figures-of-merit for the coverage in order to properly characterize the faulttolerant performance of the system and to provide useful feedback for making design improvements. The first is the error-detection coverage or percentage (referred to as ED).
Another, the fault-isolation coverage ( F I ) , is more complex to describe. First, it is possible to characterize a system by its fault-isolation distribution ( I , , I,, -, I"), where I , is the portion of all faults which implicate a single field-replaceable unit (FRU). In general, I, is the portion of all faults which implicate exactly i FRUs. This is an unambiguous way of specifying the fault-isolation coverage which is independent of isolation strategy. (Other figures-of-merit for fault-isolation coverage are presented in a later section, along with a discussion of isolation strategies.) The total system contains N* = Zn, faults. Associated with each FRU, is a fault probability pi, which is the sum of the fault probabilities for faults contained on FRU,. That is,
where p ,f is the probability of fault f , f . For practical systems, the size of the fault set N* is unmanageably large. For a FRU containing m lines, N* will be 3"' -1; therefore, in order to analyze any practical systems of even medium size, it is mandatory that procedures for grouping faults be developed.
Based on the use of error checkers and captured machine-state information, a system contains M nonzero, mutually exclusive syndromes, SI, S,, e , S,, together with the null syndrome S,,,,. A syndrome is the logical product (Boolean "AND") of identifiers of an error-checker and its active source at the time the error is captured; thus, it is the result of a dynamically detected error. (Examples of syndromes are given in the following section.)
The dynamic error-detection characteristics of the system are represented by a fault-conditional-probability data matrix and a vector of fault probabilities. Figure 1 shows the full fault-conditional-probability data matrix D* of dimension N* x M and the full fault-probability vector P* of length N * . D* has a row for each fault and a column for each syndrome. The matrix entry d;k is the conditional probability that Sk occurs, given fault f ,f. It is convenient to deal with this set of information in a compressed form, i.e., in the form of a compressed fault-conditional-probability data matrix D, which has dimension N x M and has one row for each FRU, and one column for each syndrome Si.
The compressed full fault-probability vector P has one entry pi for each FRU,, equal to the sum of all fault probabilities on FRUi (see Fig. 2 ). The entry dij of D is simply obtained as the sum of products: Compressed fault-probability vector P and compressed fault-conditional-probability data matrix D.
where Pr (S, 1 FRU, fails) is the probability of Si due to The computational algorithms all have as their basis the compressed matrix D, the compressed fault-probability vector P = (p,, p2, -. -, p N ) , and the joint compressedprobability matrix Q (see Fig. 3 ). Matrix Q is useful since it contains the maximum information; i.e., the sum of each row is equal to the detected fault probability for FRU,, while the sum of each column is equal to the syndrome-occurrence probability for syndrome Sj. Each "normalized" (for the total number of detected faults) column equals the distribution of likelihoods for FRU failure candidates, given a particular syndrome occurrence. The total sum of all the entries in Q is the error-detection probability (ED); thus Pr (S,,,,) = 1 -ED. The error-detection probability for FRU, equals the sum of the row divided by pi. Since the fault isolation (FZ) is normally defined with respect to detected errors only, for purposes of computing isolation-coverage figures-of-merit, the matrix Q is normalized by dividing it by -Pr (S,,,,)].
Fault-isolation strategies
Before computing figures-of-merit for dynamic fault isolation of a system, it is useful to introduce the concept of isolation strategy. A strategy is a definition, for each syndrome Si, of the specific repair action to be taken given the Occurrence of Si. The action or strategy is based on the particular FRU implicated by Si. In a system perfectly designed for fault isolation, every syndrome implicates a single FRU, and a discussion of strategy is unnecessary. However, this is not generally the case and when an ambiguous syndrome occurs a choice must be made between replacing either all of the implicated FRUs or some subset of them. For the latter case, there is a risk involved since the FRU containing the actual fault may not be replaced; thus a subsequent Occurrence of the fault may arise, necessitating another repair call. Based on the syndrome-probability matrix, all important factors in such a trade-off can be quantified.
Examples of various strategies include the following. For first-call repairs we use the deterministic isolation procedure (DIP) strategy; all implicated FRUs are replaced. For second-call repairs, the action can be either (a) to replace the single-most-likely FRU on the first call and replace the remainder on a subsequent call; or (b) on the first call, to replace a subset of FRUs that is sufficient to account for a threshold (e.g., 90%) portion. This is a limited-risk strategy. For Nth-call repairs, we use the sequential isolation procedure (SIP) strategy; the action taken depends on the number of calls. On the first call, the most-likely FRU is replaced; on the second call, the next-most-likely FRU is replaced, etc., until on the Nth call, the least-likely FRU is replaced. The choice among these strategies reflects trade-offs among the costs for parts, labor, and customer-outage associated with multiple calls to fix a problem.
Figures-of-merit for fault-isolation coverage
It was previously noted that the fault-isolation distribution characterizes fault-isolation coverage independently of the fault-isolation strategy. However, in order to reflect the very real practical differences between possible isolation strategies, additional figures-of-merit for the coverage are required. Two useful measures are the average number of FRUs replaced per detected fault (NFRU) and the average number of service calls per detected fault (NSC). Having calculated these values, a direct estimate of the cost of each strategy over the life of the system or product can be computed. NFRU and NSC can be computed directly from the normalized Q, P, and the previously stated strategy definitions.
For each syndrome Si, d, is defined as the number of FRUs implicated by syndrome Si. Based on the syndrome probability matrix Q, d, is the number of nonzero entries in column i of Q. The probability of isolating to one FRU, given a system failure, is computed by summing all entries in the columns of Q which have di = 1. Using DIP strategy, this probability is given by x q l , . ; j 3 d, = 1.
The remainder of the fault-isolation distribution can be computed using the general relation of the probability of isolating to K FRUs, xqij; j 3 dj = K .
The sequential isolation bound reflects the maintenance strategy of selecting for replacement the most likely FRU implicated by an ambiguous syndrome. An ambiguous syndrome is represented in the matrix Q by a column containing more than one nonzero entry. Choosing the most likely FRU for such a syndrome corresponds to choosing the FRU associated with maximum entry in each column. Let Maxj (qij) be the largest entry in columnj of Q. Then for the sequential isolation, the probability of isolating to one FRU is given by M E Max, (Qij).
j -I
Note that this will include all the entries summed for the deterministic isolation bound.
NFRU using DIP and SIP strategies
With di as defined above, NFRU can be computed using DIP strategy with the relation Average number of service calls per fault (NSC) For the DIP strategy, the number of service calls per failure is equal to one, since this strategy specifies that all implicated FRUs will be replaced on the first call. In the case of SIP strategy, the average number of service calls is equal to the average number of FRUs replaced:
NSC(SIP) = NFRU(S1P).
The EDFl process-obtaining data for the faultconditional data matrices It has been found practical to obtain data for the faultprobability models where circuits have been grouped into functions. This corresponds to a matrix D*, which is somewhat larger than the compressed matrix D. It has further been found practical to have the designer provide data for each FRU in terms of a data set called a basic information Guidelines provided to designers suggest that entries should correspond to functions such as registers, decoders, parity checkers, arrays, ALUs, etc., each of which may have hundreds of circuits, and represent hundreds of faults in D*. The association of a failing function with the error checker that detects its failure (according to the system design) and also with the syndrome (checker combined with machine state) is the designer's job.
Where more than one nonzero syndrome is possible, for example due to varying path usage or instruction mix, the relative likelihood is also estimated by the designer. Many circuit groups performing a function, such as a register or decoder, are such that their faults are not uniform in detection probability. For example, a single failure in circuits in gating and clock distribution of an LSI register causes multiple bit errors and hence causes the null syndrome with probability 0.5. On the other hand, the latches will not cause the null syndrome.
This gives rise to the concept of checker effectiveness, which represents the weighted average across the failures in a representative function of the detection probabilities. Although it is called checker effectiveness, it actually is associated with the function being checked. Its utility is in avoiding repeated detailed circuit input to the basic information table when a common design function is repetitively used.
By using the concept of checker effectiveness in this manner, a designer can provide as input to the evaluation an additional table called the checker information table consisting of a row for each identified checker in the system, the row containing the checker name, and the effectiveness of the checker. The references to checker C , in the Syndrome of function column will automatically cause a cross reference to the checker information table to obtain the detection probability or effectiveness of 0.9, a value previously determined.
A limitation on the use of "checker effectiveness" arises in a design when a particular checker checks information from more than one source and the two sources have different fault distributions. For example, a parity checker checks a storage array as well as a data register, and the fault distributions are different. In this case, the basic information table entries must list the detailed fault distributions of each function. Figure 4 shows an LSI parity-checked register chip along with its basic information table. The relative failure rates within the chip of the three categories, logic, mechanical interconnections (pads), and power distribution within the chip, are provided by the technology developers using a projection methodology analogous to the one described in [20] . Circuit and pad counts are provided by the logic designer. Performing the arithmetic to compute an overall detection probability for this chip gives 0.8875. If this same chip part number is used elsewhere in the system, it is worthwhile to use the checker effectiveness idea (checker information table) to avoid repeated detailed data entry.
Enhancingparity-check detection effectiveness
Detection probability of 0.5 is assigned to those circuit faults which can cause multiple bits to be in error in a paritychecked field, for example, gate and clock circuits. If the gate or clock circuit is actually controlling data, e.g., from two bytes instead of from one, the detection probability for this class of circuit is 11 -= 0.75. This fact is useful in the implementation of multi-byte data flows, in order to increase the parity-check-detection probability by packaging Depending on the ratio of such common logic faults to those faults which cause only single bits in error, such packaging variations can have a pronounced effect on the overall detection probability. Figure 5 shows four packaging schemes for parity-checked LSI registers. A common failure of a single byte is now divided into multiple parity-check cases; therefore, the average detection probabilities differ among these four schemes. Some of these variations for the 3081 processor technologies have been evaluated and tabulated as design guidelines in Table 1 
Other detection mechanisms-detection effectiveness
Parity checking has been emphasized in the preceding examples because, in most digital systems, parity checking accounts for 70 to 80% of the detection coverage. Control logic which consists of counters and registers may also be parity checked. There is a portion of control logic, however, for which parity is not appropriate in general. For example, decoders have the natural characteristic that a one-andonly-one check may be performed. Since this type of check amounts to a practical duplication of the logic, it is generally 73 D. C . BOSSEN AND M. Y. HSIAO be evaluated by considering, as the set of all faults, each gate stuck-at-zero (s-a-0) and stuck-at-one (s-a-1). When the check circuit itself is included in the function, the overall "effectiveness" is shown in Fig. 6 . The complexity of such an analysis shows that the detection effectiveness concept applied to commonly repeated portions of a design is very useful in practice in reducing the amount of the input required.
~

Model input
Number Figure 7 shows data flow where a downstream checker C, checks three sources on three different FRUs. The implicated FRU set for the condition "C, active" would be all three sources as well as the receiver, or four FRUs. The addition of active source identifier logic to capture the name of the active path will improve isolation. The input for this situation is also shown in Fig. 7 , where it is seen that isolation is now to two FRUs.
One important point which this example brings out is the fact that isolation is generally improved by including machine-state information with the error checker in order to define the syndrome. In other words, a single physical error checker can give rise to a number of unique syndromes, each with its own set of implicated failures, when machine state is included together with checker output. This is a very useful observation when the design changes are being considered in order to improve isolation coverage. This fact also illustrates the importance of error logging for producing good isolation. Exactly what gets logged, in addition to error check outputs, will have a tremendous impact on the isolation coverage in general. Using the ED/FI model, it is possible to rapidly assess the impact on fault-isolation coverage of proposed design changes, as well as enhanced machine state error logging.
Illustrative example
The system under consideration is shown in Fig. 8 , along with its basic information table. It consists of two logic cards (FRUs) and an interconnecting cable, also a FRU. The denominator of the fraction shown with each item is the failure rate expressed in some consistent units. The numerator is obtained as the product of the failure rate times the error-detection probability for the item, determined according to the ground rules previously stated.
The checker information table is not used since functions checked by checker C, have different detection probabilities. The compressed fault-probability vector D, the compressed fault-conditional-probability matrix P, and the compressed joint-probability matrix Q are shown in Fig. 9 . Dividing Q by  [ 1 -Pr (S,,,,) ] gives the "normalized" joint-probability matrix Q of Fig. 9 , where the set of di giving FRUs per syndrome is also indicated. The error-detection probability equals 
ED/FI experience with IBM products
The ED/FI model and evaluation procedures described in this paper were developed to meet a practical and real need, and they have been used throughout IBM since 1975 for assessing error-detection and fault-isolation coverages of numerous product designs. The definitions and ground rules have been extended beyond the logic and electronic areas into the electromechanical products such as printers and disk drives. The initial experience in developing the model came about in the early 3081 processor development, where the system designers, diagnostic developers, and field engineers agreed that the change in technology required a diagnostic strategy to handle intermittent errors. Therefore, there was a strong need to assess the error-detection and fault-isolation coverage. There were at that time, however, no definitions, ground rules, or procedures for projecting such coverage within IBM or reported in the literature [ 191.
Experience has shown the value of using the ED/FI evaluation procedure repetitively as the design progresses. Early evaluation of 3081 system error-detection coverage yielded about 60% for the CPU. The current level of better than 90% was achieved by dedicated concentration to put checking on the initially weak areas as shown by the model. Because designers became educated and sensitive to the need for good error-detection coverage, a number of innovations in error checking were made. These were primarily in the control areas of the machine, and included decoder checks, illegal pattern checking, encoder checks, and the application of parity to address and control fields as standard practice. Packaging arrangements to enhance parityerror detection, as pointed out in Table 1 sively used. Fault-isolation coverage was enhanced by better placement of checkers and by the identification of machinestate information to produce better syndromes. Some sample output reports are shown in Figs. 10 and 1 1 for a representative large system. Reports are typically produced on a per-FRU basis for designer feedback.
Independent verification of the projections, by hardware bugging using a limited sample size on the 3081, shows good correlation to the coverage values projected by the model. This model and the ED/FI process have been extended and used in IBM to project isolation coverage for error-recreation diagnostic programs. The circuit and fault coverages of individual tests are determined by test generation and simulation programs. Each test is treated as an additional checker added to the hardware checker lists and the same ED/FI analysis procedure follows to compute an ED/FI percentage for the solid-failure case using diagnostics. This allows a complete evaluation of the service plan for a system in both the operational environment (error checkers and log analysis only) and the maintenance environment (where error re-creation may be used).
Summary
This paper describes a model for projecting error-detection and fault-isolation coverages. Implicit in interpreting the results of the model is a maintenance strategy of fault isolation based on dynamically detected errors. The model has been applied in many practical design cases, and evaluation results have suggested weak areas in both errordetection and fault-isolation coverage, with improvements having been made accordingly. Designers are given specific ground rules for generating inputs to the model, and are provided error-detection coverage numbers and computational guidelines for most error checkers in use.
The following lists some of the ED/FI evaluation benefits:
1. Product designers are aware of the need for error detection and FRU isolation early in the design phases. The evaluation procedure gives early feedback regarding areas needing improvement. (It is noteworthy that inventions are often made in response to such needs.) Improvements in error detection made in this iterative fashion, especially in LSI technologies, have had minimum impact on product cost, performance, and schedule. 2. By placing primary emphasis on error checkers which can detect errors from all causes, including intermittent errors, problems associated with a maintenance strategy of reproducing errors with diagnostic programs or procedures are eliminated. This benefit results in greatly reduced mean-time-to-repair, as well as reduced parts costs. 3. Data integrity, which depends first and foremost on the detection of errors, can be designed directly into the hardware. Using the model for error-detection coverage, quality can be projected in advance.
