A technique for evaluating the application of the pin-level stuck-at fault model to VLSI circuits by Finelli, George B. & Palumbo, Daniel L.
nincn 
Tec hn ica I 
Paper 
2738 
~ w n w n  
September 1987 
- 
the Application of the 
Daniel L. Palumbo 
and George B. Finelli 
https://ntrs.nasa.gov/search.jsp?R=19870018592 2020-03-20T10:09:29+00:00Z
NASA 
Technical 
Paper 
2738 
1987 
National Aeronautics 
and Space Administration 
Scientific and Technical 
Information Office 
A Technique for Evaluating: 
U 
the Appliiation of the 
Pin-Level Stuck-At Fault 
Model to VLSI Circuits 
Daniel L. Palumbo 
and George B. Finelli 
Langley Research Center 
Hanapton, Virginia 
Identification of commercial products in this report is 
included only to adequately describe the equipment and does 
not constitute an official endorsement, either expressed or 
implied, of such products by the National Aeronautics and 
Space Administration. 
Contents 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  Summary 1 
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1 
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1 
The detection process . . . . . . . . . . . . . . . . . . . . . . . . . .  1 
Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . .  2 
The validation methodology . . . . . . . . . . . . . . . . . . . . . . .  2 
Objective and Approach . . . . . . . . . . . . . . . . . . . . . . . . . .  2 
Experiment Definition . . . . . . . . . . . . . . . . . . . . . . . . . . .  3 
Experiment Data and Analysis . . . . . . . . . . . . . . . . . . . . . . .  3 
Static error behavior . . . . . . . . . . . . . . . . . . . . . . . . . .  3 
Dynamic error behavior . . . . . . . . . . . . . . . . . . . . . . . . .  3 
Wilcoxon’s rank-sum test . . . . . . . . . . . . . . . . . . . . . . . .  4 
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  4 
VLSI circuit model: the virtual microprocessor . . . . . . . . . . . . . . .  4 
Fault model: the stuck-at . . . . . . . . . . . . . . . . . . . . . . . .  5 
Datamodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  5 
The error model . . . . . . . . . . . . . . . . . . . . . . . . . . . .  5 
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  6 
Results of Static Error Analysis . . . . . . . . . . . . . . . . . . . . . .  6 
Results of Dynamic Error Analysis . . . . . . . . . . . . . . . . . . . . .  6 
Comparison of Internal and Pin-Level Fault Behavior . . . . . . . . . . . . .  7 
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  7 
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  8 
Appendix A-Model of Failure-Recovery Process . . . . . . . . . . . . . . . .  9 
Appendix B-Experiment Configuration and Procedure . . . . . . . . . . . .  10 
TheVmp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  10 
The Fibonacci Series . . . . . . . . . . . . . . . . . . . . . . . . . .  10 
The Fault Injector . . . . . . . . . . . . . . . . . . . . . . . . . . .  11 
The Logic Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . .  11 
The Experiment Configuration . . . . . . . . . . . . . . . . . . . . . .  11 
The Experiment Procedure . . . . . . . . . . . . . . . . . . . . . . .  11 
Appendix C-Fibonacci Program Listing . . . . . . . . . . . . . . . . . .  13 
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  15 
Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  16 
Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  21 
iii 
I 
Summary 
This paper describes a technique by which a re- 
searcher can evaluate the capability of the pin-level 
stuck-at fault model to represent true error behavior 
in very large scale integrated (VLSI) digital circuits. 
Accurate fault models are required to conduct the 
experimentation recommended by an earlier study 
of proposed validation methodologies for highly re- 
liable fault-tolerant computers e.g., computers with 
sion). The validation experiments are designed to  
measure the recovery process parameters of fault- 
tolerant computers. 
The quality of the pin-level stuck-at fault model 
is assessed by comparison of the error behavior which 
results from faults applied at the pins of a VLSI cir- 
cuit with the error behavior produced by faults ap- 
plied to gates internal to the VLSI circuit. The inter- 
nal, gate-level faults are assumed to produce “true” 
error behavior. Error behavior is observed at  the pins 
of the VLSI circuit, in this case a “virtual micropro- 
cessor.” In the presence of internal faults, the error 
behavior at  the pins of the virtual microprocessor is 
more dynamic than static. 
To study the dynamic error behavior, a hypothe- 
sis is put forth that pin-level stuck-at faults produce 
error behavior that is similar to the error behavior 
caused by internal device failures. The hypothesis 
is tested with Wilcoxon’s rank-sum test. It is found 
that this technique is tractable and has the poten- 
tial to produce meaningful results. Using a sample 
data set, a set of bar charts are derived which show 
the tendency to reject the hypothesis. Many of the 
pin-level faults exhibit very little rejection. However, 
in some test cases, the results strongly suggest rejec- 
tion, especially in the modeling of the duration of 
errors. A firm conclusion cannot be drawn because 
of the preliminary nature of the sample data. The 
results do imply that the application of the pin-level 
stuck-at fault model requires careful consideration. 
Additional experimentation is needed to confirm the 
use of the model before it can be used with confidence 
when validating highly reliable digital systems. 
a probability of failure of 10- 6 for a 10-hour mis- 
Introduction 
The ability to simulate faults in digital circuitry 
is an important issue in a proposed methodology 
for validating the failure recovery process in fault- 
tolerant computers (ref. 1) .  A common method of 
simulating a fault in a physical circuit is fault injec- 
tion on the pins of the circuit components (ref. 2). 
Typically, the injected fault takes the form of a 
“stuck-at” fault (ref. 3). The accuracy with which 
an injected fault models its physical counterpart is 
important if confidence in the validation process is 
to be established. When high reliability is specified, 
such as a probability of failure of lop9 for a 10-hour 
mission, high confidence levels and, therefore, accu- 
rate fault models are required. Reference 1 cautions 
that the pin-level stuck-at fault model may not be 
adequate for circuits composed of very large scale in- 
tegrated (VLSI) components. A technique by which 
the capability of the pin-level stuck-at fault modeling 
can be evaluated is the subject of this paper. 
As background, the fault recovery process is de- 
fined and validation methods for the recovery process 
are described. The objective and approach of this 
work are then stated, followed by a detailed descrip- 
tion of the experiment, the results, and conclusions. 
Background 
Fault-tolerant computers which are based solely 
on redundant hardware and voting cannot meet 
high-reliability requirements unless the failure rates 
currently obtainable are reduced by an order of 
magnitude (ref. 4). Today’s highly reliable fault- 
tolerant system must incorporate a recovery process 
to remove and, possibly, replace faulty components. 
There are two important aspects of the recovery pro- 
cess of a fault-tolerant system that can be exercised 
by fault injection: the ability to detect the fault and 
the ability to reconfigure successfully after the fault 
has been detected. 
The detection process. The presence of a fault in a 
system is detected when erroneous system behavior 
is recognized by a system monitor (voter, watchdog 
timer, etc.). However, the existence of a fault does 
not guarantee that the system will exhibit erroneous 
behavior. The errors produced by the faulty com- 
ponent must be propagated to the system monitor 
level. 
Two additional factors complicate fault detection: 
different components may produce similar error be- 
havior; and detected errors may be transient, that is, 
not the result of hard faults. When presented with 
ambiguous error behavior, isolation procedures (such 
as a specific series of tests) locate the faulty compo- 
nent. Verification procedures filter out transient er- 
rors by logging the errors until, based on some heuris- 
tic, a hard fault is declared. In this study, isolation 
and verification are considered part of the detection 
process. Both isolation and verification require extra 
data, and therefore time, before the recovery process 
can begin a reconfiguration. 
The detection process can thus be divided into 
three tasks: error propagation, fault isolation, and 
fault-type verification. The time spent performing 
each of these tasks depends a great deal on which 
component is faulty in the circuit. Yet these times 
are more directly dependent on the error behavior 
observed by the monitor. The error behavior is 
not solely a product of hardware characteristics but 
depends to a large degree on the input presented to 
the hardware (e.g., software). 
Reconfiguration. A successful recovery results 
in the logical restructuring of the system around 
the fault, that is, reconfiguration. Previous studies 
have shown that the reconfiguration time is not as 
dependent on the fault present in the circuit as is 
the detection time (ref. 2).  
Reconfiguration can be accomplished in hardware 
or software. With hardware reconfiguration, extra 
switching circuitry is added to enable the removal 
and replacement of elements which contain faulty 
components. The switching circuitry can be in- 
stalled at different levels. The Fault-Tolerant Multi- 
Processor (FTMP), for example, reconfigures at the 
processor-memory-bus interfaces (ref. 4). The Fault- 
Tolerant Processor (FTP) switches entire computers 
into and out of the computer interstage (ref. 5). 
Software reconfiguration is accomplished by 
rewriting system data structures which control com- 
munication and scheduling procedures. The faulty 
processors remain connected to the intercomputer 
network, but their outputs are ignored during com- 
munication and voting and tasks are not allocated 
to them during scheduling. A good example of this 
paradigm is the Software Implemented Fault Toler- 
ance (SIFT) computer (ref. 6). 
The validation methodology. Two parameters can 
be used to characterize the recovery process. They 
are the percentage of faults for which the recovery 
process performs the correct action (in the presence 
of the fault and its resulting errors) and the time 
taken to complete the process. The first parameter 
will be referred to as the coverage of the recovery 
process; the second will be referred to  as the total 
recovery time. 
In an effort to define a validation methodology 
which could determine whether a candidate fault- 
tolerant computer meets the requirement for life- 
critical digital avionics (i.e., with a probability of 
catastrophic failure of lop9 for a 10-hour mission), 
a previous study has suggested that recovery param- 
eters can be obtained by applying pin-level stuck-at 
faults to the fault-tolerant computer during carefully 
controlled experiments (ref. 1). The parameters are 
then inserted into models of the failure-recovery pro- 
cess (Markov models or Markov model derivatives) to 
calculate the probability of failure. One such model 
is described in appendix A. 
With the assumption that the processor fail- 
ure rate can be obtained from sources such as 
MIL-HDBK 217D (ref. 7), all that remains to com- 
plete the model is to derive a value for the recovery 
rate. As mentioned above, it has been suggested that 
the recovery rate be obtained through experimental 
measurements. 
The recovery rate measurement must be accurate 
enough to establish confidence in the computation of 
the probability of failure. As explained previously, 
the detection time is a part of the recovery time 
and depends on the error behavior generated by the 
fault. Therefore, the degree of experimental accuracy 
will depend on how well the fault injection stimulus 
reproduces true erroneous behavior in the target 
system. 
Injecting a fault at the pin level simulates a condi- 
tion in which the error has already propagated from 
the lower-level device. Because of this, pin-level fault 
injections are not suitable for measuring the error 
propagation time component of the detection pro- 
cess. Thus, it is assumed that pin-level fault injec- 
tions will be used primarily in experiments designed 
to measure the remaining time components of the de- 
tection process. These processes (i.e., the fault isola- 
tion and verification processes) have error behavior 
symptoms as their input. The fault-injection model 
must produce error symptoms similar to true error 
behavior to be effective. To determine the applica- 
bility of the pin-level fault model, the technique de- 
scribed in this paper produces a measure of the simi- 
larity between true error behavior and error behavior 
created by fault injection. To demonstrate the tech- 
nique, it is assumed that if dissimilar error behavior 
is observed on fewer than 10 percent of the pins of 
the VLSI circuit under test, then acceptable exper- 
imental accuracy will result. The 10-percent limit 
is inferred from the discussion in appendix A, which 
concludes that it is necessary to measure experimen- 
tal parameters to within an order of magnitude of 
the correct value. 
Objective and Approach 
The objective of this work is to define a technique 
by which insight can be gained into the ability of the 
pin-level stuck-at fault model to simulate true error 
behavior in VLSI digital circuits. To this end, the 
error behavior caused by pin-level stuck-at faults will 
be compared with true erroneous behavior in VLSI 
digital circuits. The stuck-at fault model has been 
accepted as a good model of gate-level fault behavior 
(ref. 3). Transferring the stuck-at model to the pins 
of an integrated circuit presents little difficulty for 
small-scale integrated (SSI) circuits and for some 
medium-scale integrated (MSI) circuits because of 
2 
the proximity of the gates to the pins. However, 
the pin-level stuck-at model may not apply well to 
VLSI circuits. With a VLSI circuit, the pins of 
the integrated circuit are more at the level of the 
system monitor than at the level of the faulty device 
(consider the self-checking dual processor in fig. 1). 
At the pins of a VLSI circuit, the objective is to 
model error behavior rather than fault behavior. 
To compare the error behavior generated with 
pin-level stuck-at faults to true error behavior, it 
is necessary to obtain samples of both. Obtaining 
samples of a VLSI chip error behavior when subjected 
to pin-level faults is straightforward. Acquiring the 
response of the same chip to internal faults is more 
difficult and requires some type of simulation of the 
chip. The approach used in this experiment was to 
create a “virtual microprocessor” from an existing 
SSI-MSI processor by drawing a boundary around 
the components that would be found inside a chip 
if indeed the processor were a chip. The resulting 
virtual microprocessors (Vmp) consists of 48 pins of 
data, address, and control. The Vmp allows faults to 
be injected on its internal devices while its behavior 
is recorded at  its boundaries. 
Once samples of the error behavior are obtained, 
two types of analysis are performed. First, the 
tendency towards static error behavior is derived. 
Static error behavior is defined as a condition where 
a pin does not change state because of a fault within 
the circuit. This condition closely resembles a pin- 
level stuck-at fault. The second analysis examines 
the dynamic error behavior of the pins (i.e., errors are 
present and the pin is changing state). The dynamic 
error behavior of internal and pin-level faults are 
compared by formulating the hypothesis that their 
respective data samples are taken from the same 
general error behavior distribution. If this hypothesis 
is rejected, then it will be concluded that pin-level 
stuck-at faults are “not very good” at simulating true 
error behavior in VLSI circuits. 
Experiment Definition 
The definition of the experiment is divided into 
three sections. The first section describes the data 
that were gathered and the analysis that was per- 
formed. The second section defines assumptions 
made which support the experiment approach and 
analysis. The third section contains the details of 
the experiment configuration and procedure and is 
found in appendix B. 
Experiment Data and Analysis 
To produce data descriptive of error behavior, a 
faulted system is compared with a fault-free version 
of the system. 
error behavior are analyzed, static and dynamic. 
In this investigation, two types of 
Static error behavior. If a pin does not change 
state during a test, it is said to exhibit static behav- 
ior. If the test is conducted in a fault-free condition, 
then this is normal static behavior. If a pin that was 
not normally static becomes static during a fault in- 
jection test, or if a pin that was normally static re- 
mains static but in an inverted state, the pin is said 
to exhibit static error behavior. A pin-level stuck-at 
fault of the correct polarity is indiscernible from a 
pin exhibiting static error behavior. 
One evaluator of static error behavior is the per- 
centage of pins of the Vmp that exhibit static error 
behavior during faulted runs. A large percentage of 
pins exhibiting static error behavior would indicate 
that the pin-level stuck-at model has some justifica- 
tion for VLSI circuits. However, a small percentage 
of pins exhibiting static error behavior is not suffi- 
cient evidence to  conclude that the pin-level stuck- 
at model is inaccurate. The dynamic error behavior 
must be considered also. 
Dynamic error behavior. Dynamic error behavior 
is defined by two values, the time between errors and 
the duration of errors. Figure 2 illustrates how these 
values are derived. The first trace in figure 2 is of a 
Vmp pin during a fault-free test. The second trace is 
of the same pin, but now the effect of an internal fault 
is manifested as different behavior. The third trace is 
derived as the exclusive-or of the first two traces and 
represents the error behavior of the second trace. In 
other words, when the error trace is “high” the value 
of the pin during the faulted run is opposite that of 
the fault-free run. The time between errors (TBE) 
is defined as the time elapsed between rising edges 
of error pulses. The duration of errors (DE) is the 
length of the error pulses. 
A pin can exhibit both static and dynamic error 
behavior. If an induced stuck-at fault is applied to a 
pin which behaves dynamically during the fault-free 
run, it will behave statically (i.e., it will not change 
state during the test). However, from the definition 
of dynamic errors, TBE and DE characteristics can 
be observed. 
Any single internal fault which exhibits dynamic 
error behavior will produce samples of TBE and DE. 
The TBE and DE data can be thought of as samples 
from an error behavior population. Thus, both the 
pin-level stuck-at faults and the internal faults can 
be associated with general, but possibly different, 
error behavior populations. If the error behavior 
population associated with pin-level stuck-at faults 
can be shown to  be different from the error behavior 
population of the internal faults, one could conclude 
3 
that the pin-level stuck-at fault does not produce 
error behavior similar to that produced by internal 
faults. 
Statistical hypothesis testing can be used to de- 
termine whether two samples were taken from the 
same population. The strength of hypothesis testing 
is in rejecting the hypothesis; this is called a signifi- 
cant result. If the  hypothesis is not rejected, all that 
can be said is that for this particular case the null 
hypothesis holds. If the hypothesis of identical pop- 
ulations is rejected, it can be concluded that the null 
hypothesis does not hold for this particular case and, 
therefore, does not hold in general. 
Wilcoxon’s rank-sum test. Wilcoxon’s rank-sum 
test (also known as the Mann-Whitney test) can be 
used to test the hypothesis of identical populations 
(ref. 8). Given two data samples, si and y j ,  the 
rank-sum method tests for a shift in the samples 
which would indicate that the samples came from 
different populations. In our case, for example, 
the x i  are samples of error behavior taken during 
internal fault tests and the y j  are samples taken from 
pin-level fault tests. A test statistic w is derived 
for x i  and y j  by summing the ranks assigned to 
the members of each sample after x i  and yj are 
merged and ordered. A function W is calculated by 
constructing the distribution of all possible orderings 
of two samples. Values for W are found in textbooks. 
(See ref. 8, table A.5 for an example.) The decision 
rule for the test can be stated as follows. Reject the 
null hypothesis (Ho) at level of significance a if the 
test statistic w is greater than the (1 - a ) / 2  quantile 
of W .  For example, if a significance level of 0.95 is 
desired and the two samples have sizes of 3 and 6, 
the (1 - a ) / 2  quantile of W is 22. (See table A.5 of 
ref. 8.) If w > 23, then Ho is rejected. The IMSL 
subroutine NRWST is used to perform this test on 
the data (ref. 9). 
Assumptions 
The assumptions stated in the following sections 
describe models of behavior which together comprise 
the foundation for the experimental approach and 
analysis. Two models are used during the experi- 
mentation. One simulates VLSI circuit function and 
the other fault behavior. A third model describes 
the characteristics of the data to which the Wilcoxon 
test are applied. A final model defines what is meant 
by an  error. 
VLSI circuit model: the virtual microprocessor. 
As previously stated, the purpose of the experimen- 
tation is to obtain the error behavior characteristics, 
that is, samples of time between error (TBE) and 
duration of error (DE) at the pins of a VLSI circuit. 
For the best results, the errors should be produced 
by the types of faults which would occur naturally 
within the VLSI circuit. Except for the case wherein 
hundreds of VLSI circuits are available for destruc- 
tive fault-injection testing, the VLSI circuit must be 
simulated at  a level consistent with the fault model 
that will be applied to the circuit. 
A VLSI circuit simulation was created by consid- 
ering a Bendix BDX-930 processor as a 48-pin vir- 
tual microprocessor (Vmp). (See section entitled The  
V m p  in appendix B.) The BDX-930 is a 16-bit proces- 
sor constructed mostly of bipolar SSI and MSI logic. 
The 48 pins of the Vmp are mapped to 48 signals 
within the BDX-930 processor. The 48 signals cor- 
respond to 16 memory address lines (the MAR bus), 
16 data lines (the DAT bus), and 16 miscellaneous 
signals. 
Because the circuit simulation is a processor, the 
simulation must be executing a program to  produce 
results and, therefore, errors. The program chosen 
produces a Fibonacci series. (See section entitled The  
Fibonacci Series in appendix B.) 
The following assumptions support the use of the 
Vmp as a VLSI circuit simulator. 
1. The  BDX-930 is  a modern processor of ad- 
equate complexity so that i f  it were produced as a 
single chip, i t  would be considered to be a V L S I  
device. 
In support of this assumption, consider that the 
BDX-930 has a microprogrammed pipeline architec- 
ture and contains 5000 to 6000 gates. Although the 
BDX-930 is not a state-of-the-art processor and the 
gate count is less than the 10000 usually attributed 
to VLSI circuits, the authors believe this t o  be a fair 
assumption. 
2. The  Fibonacci series program represents typi- 
cal program execution. 
Fibonacci series calculation has been used in past 
studies (refs. 10 and 11) to represent typical program 
behavior. The program used in this test was written 
to reflect as wide an instruction mix as possible. (See 
appendix B.) If the data gathered with the Fibonacci 
program cause the Ho hypothesis to be rejected, it 
can be expected that Ho will be rejected for most 
programs. This is believed to be true because of the 
simplicity and directness of the Fibonacci program. 
However, the converse is definitely not true. If 
Ho is not rejected, very little can be said about 
the behavior expected from other programs. (This 
assumption may be weak. See section entitled The  
Fibonacci Series for a discussion.) 
3. The  V m p  allows su f i c i en t  access to  i ts  inter- 
nal devices fo r  fault injection. 
4 
The Vmp contains approximately 100 integrated cir- 
cuits, most with either 14 or 16 pins. If the pins of the 
integrated circuits are considered to be connected to  
devices that would actually be inside the Vmp, then 
there is access to about 30 percent of the devices in- 
ternal to the Vmp (1800 pins out of an estimated 
5400 gates). The implication of the assumption is 
that access to this subset of devices is adequate if 
Ho is rejected. During actual testing, a small sample 
was taken from this subset. 
Fault model: the stuck-at. The stuck-at fault 
model is derived from the tendency of a switch to  
fail either stuck open or stuck closed (a transistor 
in a digital circuit acts as a switch). Other failure 
modes can be modeled as special cases of stuck-ats. 
For example, an intermittent contact can be modeled 
as a recurring, short-term stuck open failure. The 
emphasis here is placed on the binary nature of the 
device and the fact that the device has a tendency to  
fail to one state or the other. 
As mentioned above, there is access to approxi- 
mately 1800 devices within the Vmp. The stuck-at 
fault model was chosen to represent the failure modes 
of those devices. The decision to use the stuck-at 
model is based on the following assumptions. 
4. The  accessible devices of the V m p  are logic 
gates. 
5. The  stuck-at fault model is  adequate for gate- 
le ve 1 injection. 
It may at first seem unwise to test the validity of the 
pin-level stuck-at fault model under an assumption 
that the stuck-at model is adequate. However, the 
important issue is the scope of the stuck-at model. 
Given that the stuck-at model faithfully represents 
the failure modes of a logic gate, the question is 
whether the model can be extended to cover the error 
behavior at the boundary of a large circuit of gates. 
Data model. The Wilcoxon test compares two 
data samples to  test the hypothesis that the samples 
were taken from the same population. One set of 
samples, taken as the true error behavior, is acquired 
from the 48 pins of the Vmp with internal stuck-at 
faults. The second set of samples is taken with one or 
more pins of the Vmp stuck at zero or stuck at one. 
The true error behavior samples are compared with 
samples for both pin-level stuck at one and stuck at 
zero to test whether they came from the same general 
error behavior population. 
Given that two data sets are obtained with m 
samples in x and n samples in y (x being the true 
error behavior and y being the pin-level stuck-at 
behavior), application of the Wilcoxon test t o  x and 
y implies the following assumptions. 
6. xi = ei ( z  = 1, 2, . . . ,  m) and y j  = e j  
+ D ( j  = m + 1, m + 2, . . . ,  m + n)  where 
x and y are observable, e j  are unobservable, and D is  
an unknown shift in y .  
7. The  N e-values (N = m + n)  are mutually 
independent. 
8. The  e-values are sampled f rom the same con- 
tinuous population. 
With respect t o  this model, the hypotheses of the 
Wilcoxon test can be stated as 
Ho: D = O  
H i :  D # O  
Assumption 6 proposes that x and y differ by a 
constant, offset D.  If Ho is rejected (i.e., it is found 
that D = 0), then a shift will be found in the data 
and we will know that x and y were not taken from 
the same population. 
Assumption 7 may be used with confidence. Pre- 
cautions were taken to ensure that the system was 
reloaded and initialized to the same state prior t o  
each fault-injection test. (See section entitled The  
Experiment Procedure in appendix B.) However, at- 
tempts for a complete system initialization may have 
fallen short. See section entitled The  Fibonacci Se- 
ries in appendix B for a discussion. 
The error model. To correctly interpret an anal- 
ysis of error behavior, it is necessary to know what 
is meant by an error. Consider the block diagram 
in figure 1. The system described by the block dia- 
gram can be considered to  be a tightly coupled self- 
checking dual processor. The system monitor (com- 
parator) compares the pin-level state of processor A 
to that of processor B. If either processor deviates 
from the other, the monitor signals an error. 
The block diagram can also be used to describe 
the error analysis procedure used in this study. For 
example, let the block labeled processor A represent 
data acquired during the fault-free run and the block 
labeled processor B be data acquired during fault- 
injection tests. Thus, the system monitor becomes 
the analysis procedure which steps through the data 
from the fault-free and faulted files and signals an 
error upon miscomparison. However, the process of 
comparing these two files is complicated by two fac- 
tors. First, the processor does not gate data onto its 
buses every clock cycle. Between bus transactions 
the state of the bus is undefined, and therefore a 
comparison is meaningless. Second, the faulted pro- 
cessor, given that it might follow a different instruc- 
tion path, might not gate data onto the bus on the 
same cycle as in the fault-free run. The following two 
assumptions serve to eliminate this ambiguity. 
5 
9. The state of the pins is  observed when the 
fault-free run has data enabled onto its buses. 
10. An error has occurred when the state of a pin 
in a faulted r u n  is observed to di$er f rom the state of 
the corresponding pin an the fault-free run. 
The scope of the error analysis has been limited to 
the behavior of the memory and data buses. 
Results 
A total of 136 fault injections were applied to 
signals both internal to (112) and at the pin-level 
boundary of (24) the Vmp. The faulted behavior of 
the Vmp was observed at  the pin-level boundary and 
recorded in what will be referred to as the “results” 
files. (See appendix B.) The 68 signals to which the 
stuck-at one and stuck-at zero faults were applied 
were chosen from a signal list of the BDX-930. Care 
was taken to  ensure that injections were applied 
to each chip in the BDX-930. Of the 136 fault 
injections, 92 produced error behavior. The following 
sections present the static and dynamic analysis of 
these errors. 
Results of Static Error Analysis 
As explained in the section entitled Static Error 
Behavior, there are two kinds of static pin behavior: 
pins which are normally static in the fault-free run 
and pins which become static due to  the presence of 
a fault. A total of 15 pins were found to be normally 
static in the fault-free run. 
A total of 92 faults which were applied to the 
Vmp produced errors. During each fault, all 48 pins 
of the Vmp were tested, yielding 4416 pin-tests 
(48 pins * 92 test). Of the 92 faults, 62 created static 
error behavior among 1 or more of the 48 pins. 
In 4416 pin-tests, only 204 pins (z 5 percent) demon- 
strated static error behavior. However, 219 normally 
static pin-tests became nonstatic. The application 
of stuck-at faults actually decreased static behavior. 
This finding, while not totally unexpected, leads to 
the analysis of the dynamic error behavior. 
Results of Dynamic Error Analysis 
To perform the dynamic error analysis, the 
48 pins of the Vmp are broken down into 3 groups 
of 16: the MAR bus, the DAT bus, and the signals. 
Each pin-test yielded time-between-errors (TBE) and 
duration-of-errors (DE) data. These data, which 
quantify dynamic error behavior, are derived from 
an analysis that is based on the definition of an error 
as stated in assumptions 9 and 10. A program was 
written which cycled through the fault-free and re- 
sults files as if they were a tightly synchronized self- 
checking pair. However the fault-free run was the 
6 
master. Values from the fault-free and results files 
were compared only when a bus transaction was ini- 
tiated in the fault-free run. The analysis presented 
will be limited to the MAR and DAT groups because 
the two buses represent general behavior rather than 
unique functionality, represented by the signals. 
Figures 3 to 6 are histograms of the TBE and 
DE for the MAR bus and the DAT bus. Each 
histogram shows data summed over all 16 pins in a 
group for all 92 fault-injection tests which produced 
error behavior, thus representing 1472 pin-tests. The 
number of data points in each histogram depends on 
the error activity; that is, more samples of TBE and 
DE are obtained if, on average, the TBE and DE are 
small compared with the sample period. Figures 3 
and 4 show the number of errors that produced 
TBE’s ranging from 1 to 350 cycles ( 1  cycle is 125 ns); 
figures 5 and 6 show similar information for the DE 
data. For example, in figure 3 a peak of 1900 errors 
occur with a TBE of approximately 15 cycles. 
If the data are further broken down according to 
the type of operation (input, output, or instruction 
fetch) taking place on the bus at  the time of the er- 
ror, different error behavior can occur. Input and 
instruction fetch operations were found to be similar 
to each other and to the combined TBE data as plot- 
ted in figure 4. Output operations have a different 
characteristic, as shown in figure 7. The plots of the 
TBE during output operations demonstrate the type 
of error behavior that a monitor which votes output 
data will encounter. 
The data presented so far illustrate gross error 
behavior in that the samples taken on all pins for 
all fault-injection tests were summed. These data 
cannot be subjected to a hypothesis test without ad- 
ditional assumptions which state that the individual 
samples are from the same population and, therefore, 
can be summed. Figure 8 is an example of data from 
a single internal fault-injection test. The TBE sam- 
ples for each pin on the DAT bus are represented in 
“box plot” format. 
In the box plot format, the width of the box is 
proportional t o  the number of points in each sample. 
The lower perimeter of the box marks the 25th per- 
centile; the upper perimeter is the 75th percentile. 
The line through the box is the median. The bars 
off the ends of each box mark the farthest points 
within the fences, which are 1.5 box lengths below 
the 25th and above the 75th percentiles. The crosses 
mark points outside the fences (called outliers). Fi- 
nally, the little squares within the z’s in them are 
the means. As can be seen, the box plot format con- 
denses the error behavior of an entire group into a 
single plot without loss of information. 
~ ~~ 
~ 
Comparison of Internal and Pin-Level Fault 
Behavior 
The error behavior of the system in response to 
internal faults is compared with the error behavior 
in response to pin-level faults. In the case of the 
Vmp, pin-level faults are considered to be those faults 
which occur at  pins located at the boundary of the 
Vmp (more specifically, the DAT and MAR buses). 
Figure 9 is the box plot of the TBE on the DAT bus 
which occurred in response to  a stuck-at one fault 
applied to the pin-level signal DATOO. 
Comparison of figures 8 and 9 shows that pin- 
level data (fig. 9) can produce error behavior similar 
to that which occurs during internal faults (fig. 8). 
This is not always the case. Figure 10 is an exam- 
ple of different error behavior caused by an internal 
fault. The Wilcoxon rank-sum test is used to obtain 
a measure of how well the pin-level fault data match 
the internal fault data. For each pin, two samples 
are compared, and the hypothesis of identical popu- 
lations is tested. For any single fault-injection test, 
32 hypothesis tests are performed on both the TBE 
and DE data. (There are 16 pins each on the DAT 
and MAR buses.) A summary measure of the ability 
of the pin-level fault to model the internal fault is 
presumed to be the percentage of pins which reject 
the hypothesis of identical populations. A large per- 
centage of rejection indicates a small likelihood that 
the internal faults are modeled well. 
Figures 11 to 20 are bar charts summarizing the 
results of the hypothesis testing for the TBE and DE 
data which resulted from stuck-ats applied to signals 
DATOO, DAT07, DAT15, MAROO, and MAR07. The 
results of stuck-at one and stuck-at zero faults were 
found to be similar and therefore are combined. A 
total of 224 tests are possible, resulting from a stuck- 
at  one pin-level fault versus 112 internal faults plus 
a stuck-at zero on the same pin versus 112 internal 
faults. Because some of the internal faults produced 
insufficient data for a test, the bar charts display the 
results of about 150 tests. These displays can be best 
described by considering a specific example. 
Figure 11 shows the results of comparing the 
stuck-at faults injected on DATOO to all internal 
faults. The first bar of figure 11 represents the 
number of tests which rejected the hypothesis on less 
than 10 percent of the pins. Remember, there are 
16 data bus pins and 16 memory address bus pins. 
According to  figure 11, about 115 of the tests resulted 
in rejection of the null hypothesis on 0 to 10 percent 
of the 32 pins. 
All the TBE results are alike, with most falling 
into the 10-percent rejection region. The DE results 
are more weighted at 40 percent and above than the 
TBE results. A noteworthy example of this is the 
DAT07 DE graphs (fig. 17). 
Discussion 
The dynamic error behavior resulting from in- 
ternal and pin-level stuck-at faults has been char- 
acterized by the time between errors (TBE) and the 
duration of errors (DE). The similarity of the two 
sample types (pin-level vs internal faults) was tested 
with Wilcoxon’s rank-sum test. It has been found 
that single, pin-level stuck-at faults produce dynamic 
error behavior characteristics which are similar to 
single internal faults in many cases. A closer look 
into the underlying mechanism which causes the er- 
ror behavior provides a plausible explanation for this 
similarity. 
Two types of aberrant processor behavior are 
defined. A crash is said to occur when the processor 
jumps outside the program limits. Skewing occurs 
when one or more instructions use one or more extra 
clock cycles to execute than normal. When skewing 
occurs, even though the processor may follow normal 
program flow, error behavior similar to a crash may 
be observed. This is because the comparison of the 
two runs (faulted and fault free) which produces 
the errors is synchronized by the fault-free run and, 
therefore, a single clock cycle of skew in the faulted 
run produces a great deal of “crashlike” errors. 
Out of the 80 internal faults which produced 
errors, 34 caused a crash. Of the 46 remaining 
noncrashed runs, 30 were skewed. Thus, 80 percent 
of the internal faults either crashed or were skewed 
and are therefore likely to exhibit the same error 
behavior. Of the 10 pin-level faults tested which 
produced errors, 100 percent either crashed or were 
skewed. This puts all the tests of pin-level faults into 
the same behavioral category as a large majority of 
internal faults. 
Another interesting observation is that the TBE 
samples are more likely to be similar than the DE 
samples. This may be due to a correlation between 
the TBE and the average instruction execution time. 
The TBE histograms (figs. 3 and 4) show that most 
TBE’s lie between 10 and 20 clock cycles (125 ns 
per cycle). This matches well the average instruction 
execution time of the Vmp of 1 to 2 ps and implies 
that once an error goes away it is likely to return in 
one or two instructions. The DE histogram in figure 5 
does not exhibit this characteristic. The histogram 
peaks at 1 or 2 cycles and drops to  zero at  30 cycles. 
Besides the suggested correlation of TBE to in- 
struction execution times, the similarity in behavior 
of TBE sample is related to three of the assumptions. 
7 
Assumption 2 states that the Fibonacci program rep- 
resents typical program behavior. This is true until 
the processor crashes. From that point on, the be- 
havior is unrelated to  the Fibonacci program but to 
sonic pattern in a large block of undefined memory. 
Consistent error behavior may be related to the pres- 
ence of a large block of undefined memory through 
one of two mechanisms. The randomness of the un- 
defined memory may yield consistent error behavior. 
Given a completely random memory pattern, similar 
behavior would result no matter where the processor 
executed in memory. Alternatively, after a few fault 
injections, the undefined block of memory may tend 
toward the same constant pattern. If, for example, a 
crash tends to write zeros in memory (which are no- 
operation instructions in the Vmp), then, after a few 
crashes, the once undefined block of memory is now 
well defined as a long string of no-operation instruc- 
tions. This mechanism violates the independence as- 
sumption. If the test program had been larger, fewer 
jumps outside its range could have occurred. If the 
program resided in read only memory, modification 
of executable code could not occur. These additional 
factors might have produced entirely different behav- 
ior than that observed with the Fibonacci program. 
Assumption 9 states that the comparison of the 
test and fault-free files is to be synchronized with 
the execution of the fault-free file. Thus, an error 
cannot occur unless the fault-free run initiates a bus 
transaction. This type of analysis selectively deletes 
a great deal of error behavior. Performing a com- 
parison whenever either of the two runs initiated a 
transaction may have produced different error behav- 
ior than that observed. 
The end result of this abundance of similar error 
behavior is that the test hypothesis of identical pop- 
ulations is not overwhelmingly rejected. With the 
DATOO data used as a n  example and the assumption 
of a desired rejection limit of 10 percent or less, con- 
sider that less than 25 percent of the fault-injection 
tests analyzed for TBE exceed the 10-percent limit. 
Analysis of the DE data, often a better discrimina- 
tor, results in about 55 percent of the fault-injection 
tests exceeding the 10-percent limit. The pin-level 
fault model might be rejected based on the poor per- 
formance exhibited by the DATOO DE data alone. 
However, when the results of testing pin MAR07 
are considered, both TBE and DE analysis have less 
than 15 percent of the tests exceeding the 10-percent 
rejection limit. Clearly the pin-level stuck-at fault 
model has some modeling capability. 
Because of the specific attributes of this experi- 
ment, the finding of similar error behavior between 
pin-level and internal faults cannot be applied in gen- 
eral. The amount of rejection observed is adequate to 
raise serious doubts as to the usefulness of pin-level 
fault injection in the validation of highly reliable sys- 
tems where accurate fault models are necessary. 
Conclusion 
A technique has been described by which the 
modeling capability of the pin-level stuck-at fault 
can be assessed. The technique has been shown 
to be tractable and has the potential to produce 
meaningful results. The technique was applied to 
data acquired during a small-scale fault-injection 
experiment. Conclusions drawn from these data are 
thus limited in scope. However, it was found that 
a pin-level stuck-at fault can generate error behavior 
very similar to that produced by internal faults. This 
occurs mainly when the faults (both internal and pin- 
level) result in a processor crash. Thus it seems that 
the pin-level stuck-at fault may be used to study a 
system’s response to  a large class of faults which are 
known to cause a processor to crash. However, this 
leaves many other fault classes which are not modeled 
well by the pin-level fault. It is recommended that, if 
the pin-level stuck-at fault model is being considered, 
additional experimentation be performed to establish 
when it can be used with confidence. In particular, 
these experiments should precede any attempt to 
use the model when validating highly reliable digital 
systems. 
The value of this work can be seen as twofold. It 
provides an analysis of the pin-level stuck-at fault 
modeling capability in very large scale integrated 
(VLSI) circuitry. If a researcher requires a high 
degree of accuracy over all fault classes, the pin- 
level stuck-at fault may be discarded based on these 
results. However, given the need for a more definitive 
result, this work also establishes an experimental 
procedure and analysis techniques to be used in 
future experimental efforts of this kind. 
NASA Langley Research Center 
Hampton, Virginia 23665-5225 
June 25. 1987 
8 
Appendix A 
Model of Failure-Recovery Process 
Consider the model of a failure arrival-recovery 
process shown in figure 21. The recovery process, 
which is assumed to have perfect coverage, is modeled 
with three types of states. State G is the good, or 
nonfaulty, state. A single processor fails with rate X 
to state A,  or the active fault state. From state A ,  
the process continues to state R, the recovered state. 
The rate S at which the system executes the recovery 
process is the inverse of the total recovery time. The 
values G, A ,  and R denote the number of processors 
in each state. The state of the process as a whole is 
defined by the state vector (G, A ,  R).  
System failure is calculat,ed by summing the prob- 
abilities of the following two events: 
1. All processors have failed, G = 0 (lack of 
spares). 
2. The voter has been defeated, A > G (recovery 
The accuracy required in estimating recovery rate 
can be illustrated by the following example. Given a 
quad redundant computer (i.e., a four-processor con- 
figuration) which executes the recovery process de- 
scribed by the model in figure 21, the following list of 
probabilities and associated recovery rates has been 
too slow). 
derived with the SURE modeling tool (ref. 12). Each 
processor is assumed to have a mean time between 
failures (MTBF) of 20000 hours and, therefore. a 
failure rate X of 5.0 x per hour. 
100.0 3.00 x 10- 
1000.0 
10 000.0 
It can be seen from these data that the recovery rate 
must be known to  be within an order of magnitude of 
1000 per hour to  give reasonable confidence that the 
system exhibits a probability of failure on the order 
If an experimental procedure is to be used to  ob- 
tain the recovery process coverage parameter, a much 
higher degree of accuracy must be achieved. It can be 
shown through use of a sensitivity analysis similar to 
the one above that the coverage parameter must be 
greater than 0.9999999. This value is equivalent to a 
coverage failure once every IO’ trials. Observing such 
a rare event is beyond the capability of most exper- 
imental procedures. Fault-injection experiments will 
most likely not be used to obtain this parameter. 
of 
9 
Appendix B 
Experiment Configuration and Procedure 
following four steps: 
The experiment procedure can be divided into the 
1. Initialize system. 
2 .  Observe system state. 
3. Stimulate system. 
4. Record system response. 
This series of steps is repeated for each run. System 
initialization prior to each run ensures independence 
as required by assumption 7. The system is the 
Vmp performing a Fibonacci series calculation. The 
stimulus is provided by fault injection. The state of 
the Vmp is observed at its 48-pin boundary. 
The system configuration upon which the exper- 
iment was performed consists of the following four 
components: 
1. A system emulation-simulation (the Vmp) 
2. A representative work load (the Fibonacci 
3. A stimulus (fault injection) 
4. A system monitor (a logic analyzer) 
series) 
The following sections describe these four compo- 
nents, the combined configuration, and the procedure 
which controlled the experiment. 
The Vmp 
As mentioned in the section entitled VLSI Cir- 
cuit Model: The Virtual Microprocessor, the Vmp is 
constructed from a Bendix BDX-930 processor. The 
BDX-930 is a 16-bit pipelined processor with a min- 
imum instruction execution time of 250 ns obtained 
with the nominal 16 Mhz oscillator. At the heart 
of the BDX-930 are four bit-slice processors. The 
bit-slice processors contain the arithmetic and logic 
unit (ALU) and 16 general purpose registers. Sur- 
rounding the bit-slice processor chips are approxi- 
mately 100 SSI and MSI circuits. These peripheral 
circuits perform the following functions (see fig. 22): 
1. Micro code sequence control 
2 .  Pipelined instruction decoding 
3. Address processing 
4. Data path buffering 
5. Status registers 
6. Timing and control 
The signals chosen to represent the 48 pins of 
the Vmp are listed in table I along with their as- 
sociated BDX-930 chip and pin number designation. 
The 48 signals are divided into 16 data bus sig- 
nals (DATOO to DAT15), 16 memory address sig- 
nals (MAROO* to MAR15*), and 16 miscellaneous 
signals. The 16 miscellaneous signals can be classi- 
fied as follows: 
Clock: ABUF* 
Power-on reset: POS and PON 
CPU status: FOV, IND, LINK, 
PFEIN, FLAGl, and FLAG2 
Memory arbitration: MM*, MEMl, 
MEM2, WEl ,  WE2, CDEO1, and CDEO2 
The BDX-930 used in the experiment is part of 
the Software Implemented Fault Tolerance (SIFT) 
computer system, which is currently undergoing eval- 
uation in the Langley Avionics Integration Research 
Laboratory (AIRLAB). Access to and control of the 
BDX-930 was obtained through the SIFT interface 
available in AIRLAB. Further information on the in- 
terface of SIFT, BDX-930, and AIRLAB can be ob- 
tained in reference 13. 
The Fibonacci Series 
A program which produces a Fibonacci series was 
chosen to exercise the processor during the fault- 
injection tests. The elements j i  of a Fibonacci series 
are computed by the following sum: 
where initially 
This algorithm can be coded efficiently in assembler 
code. However, what is desired is not an efficient, 
compact program, but a program which represents 
a good instruction mix. For example, to introduce 
stack manipulation instructions, the algorithm was 
coded as a recursive Pascal procedure. The Pascal 
program was compiled into BDX-930 assembler lan- 
guage (see appendix C) using SIFT’S Pascal cross 
assembler. A halt instruction was inserted in the 
assembler code between calls to the Fibonacci pro- 
cedure to provide a checkpoint in the test when the 
logic analyzer limited memory could be stored in a 
data set. A data set would then contain an inte- 
gral number of iterations. In the final program, the 
results of one iteration fit in each data set. The re- 
mainder of the data set was filled with the results 
of executing a sequence of instructions which con- 
tained operations such as shifting, integer multipli- 
cation and division, and testing and branching. 
Because the error behavior exhibited by a faulty 
processor is the product of the physical nature of the 
fault and the instruction codes executed by the pro- 
cessor, the results of this study may be critically de- 
pendent on the test program. If the processor takes 
10 
a wild branch out of the expected instruction stream 
because of the fault, the error behavior becomes de- 
pendent on the state of accessible memory. In this 
study, the Fibonacci program used about 100 words 
of the 32 K words available in the BDX-930. Al- 
though the program was reloaded for each test, no 
provision was made to set the unused memory to an 
initial state. This oversight may contribute to a lack 
of independence between each test. 
The Fault Injector 
The In-Circuit Fault Injector (ICFI) available in 
AIRLAB was used to apply the stuck-at one and 
stuck-at zero faults to the pins of the integrated 
circuits within the Vmp. The ICFI can be connected 
directly to the subject pin without extending the 
integrated circuit from the printed circuit board. The 
ICFI can be controlled manually from a front panel 
or remotely as part of the SIFT AIRLAB interface. 
The ICFI also has an external trigger available. The 
external trigger provides an experimenter with the 
flexibility of being able to set the fault parameters 
from the host computer while fault activation is 
controlled by an external signal from the experiment 
apparatus. This was the mode used during the fault- 
injection tests with the external trigger controlled by 
the logic analyzer. 
The Logic Analyzer 
The logic analyzer has 6 channels of 25-Mhz ac- 
quisition and 16 channels of pattern generation data. 
All 48 pins of the Vmp were acquired (see table I), 
plus the 7 test signals shown in table 11. 
Signal A* is an 8-Mhz clock which was used as the 
logic analyzer external clock. The 512-word memory 
capacity of the logic analyzer thus provided a 64-ms 
data acquisition window. 
The signals TDR*, TIB*, and EOUT* define the 
operation occurring on the DAT bus. The TDR* is 
asserted during data input, TIB* is asserted during 
instruction input, and EOUT* is asserted during 
data output. 
Signal HLTLP, which indicates whether the 
BDX-930 is running or halted, was used to activate 
a procedure in the logic analyzer pattern generator 
upon BDX-930 start-up. The procedure consisted of 
a wait-for-interrupt loop in the mainline and a simple 
interrupt handler which asserted one bit of the pat- 
tern generator output. This bit was used to trigger 
the ICFI. 
The remaining signals were unused. 
The logic analyzer provides local offline storage of 
its setup on cassette tape, thus ensuring an identical 
setup for each test session. 
The Experiment Configuration 
Figure 23 illustrates the experiment configura- 
tion. Of the 55 signals acquired by the logic analyzer, 
all but 2 (MAR15* and LINK) are available on the 
BDX-930 backplane. A total of 38 signals (MAROO* 
to MAR15*, DATOO to DAT15, FOV, IND, LINK, 
PFEIN, FLAG1, and FLAGB) are located on the 
BDX-930 CPU board. The remaining 10 Vmp signals 
are found on the timing and control board. All seven 
test signals are on the CPU board. The BDX-930 
was removed from the SIFT test stand to allow ac- 
cess to the backplane. The BDX-930 was still con- 
trollable from the SIFT host environment through an 
extension cable which connected the BDX-930 to the 
SIFT test stand. A separate adapter provided power 
(28 V dc and 110 V ac at 400 Hz). 
One bit of the logic analyzer pattern generator 
output was connected to the external trigger of the 
ICFI. An interrupt procedure in the pattern gen- 
erator which asserted this bit was enabled by the 
BDX-930 HLTLP signal, that is, when the BDX-930 
was started. The pattern generator signal activated 
the fault, which had been programmed remotely into 
the ICFI. The ICFI fault-injection probe was in turn 
connected to a pin of the BDX-930. Table I11 is a list 
of the pins tested during this experiment. 
Finally, the logic analyzer connector was con- 
nected to a port on the host computer. The logic an- 
alyzer was controlled entirely through the host com- 
puter. Acquisition memory was recovered over the 
RS232 link and saved in individual data files. 
The Experiment Procedure 
With the assumption that the experiment con- 
figuration is as described in the previous section, 
the operator begins the experiment by entering the 
SIFT environment on the host computer and defin- 
ing the fault model for this test session with the SIFT 
FAULT command. Under the control of a command 
procedure, the test session occurs in three phases. 
The operator 
is asked for the fault number with which to start 
the test session. Of approximately 1800 pins in the 
BDX-930, 68 were chosen for testing and placed in a 
fault definition file. (See table 111.) An effort was 
made to sample pins from most of the integrated 
circuits of the BDX-930. The ordering of the pins 
in the table gives each fault an implied number. 
Notice in table I11 that the odd-numbered faults are 
stuck-at one faults and the even-numbered faults are 
The first phase is initialization. 
11 
stuck-at zero faults. During testing, only one fault 
type is used at  a time; that is, all the stuck-at one 
faults are done, followed by the stuck-at zero faults. 
This is done to reduce operator input and, therefore, 
the chance of error. If the testing occurred with 
alternating stuck-at one and stuck-at zero faults, the 
operator would have to modify the fault model with 
the SIFT FAULT command before each test. 
Once the fault number is entered, the second 
phase of the test session begins. The ICFI is remotely 
programmed according to the fault model definition. 
The fault description (signal name, board name, chip 
number, and pin number) associated with the fault 
number is displayed on the operator’s terminal. The 
operator then places the fault-injection probe on this 
pin. 
When the probe is set, the operator resumes the 
procedure by entering the name of the results file. 
File names are constructed according to the following 
convent ion: 
/Board Name/Chip Number/Pin Number/Fault Type/.DAT. 
For example, the results of testing fault number 1 
12 
(SRAM15, stuck-at one) will be stored in file 
CPU1107Sl.DAT. (See table 111.) The results of fault 
number 86 (QII, stuck-at zero) will be stored in file 
TC3206SO.DAT. The term CPU stands for the cen- 
tral processing unit board and TC stands for the tim- 
ing and control board. 
Operator 
intervention has been required for 4 items: entering 
the fault type, fault number, and results file name, 
and positioning the probe. 
The second phase of the command procedure ends 
with the following sequence: The Fibonacci program 
is loaded into the test processor and started. The 
processor initializes and comes to a programmed halt 
just prior to entering the first Fibonacci iteration. 
(See appendix C for program listing.) A remote halt 
command ensures that the processor does not restart. 
In the final phase of the test session, the processor 
executes a series of Fibonacci iterations. The data 
from each iteration is captured, compared with the 
fault-free file, and stored. If no errors are found, 
another iteration is performed. Up to 20 iterations 
are processed. 
This completes the operator input. 
Appendix C 
Fibonacci Program Listing 
PROGRAM FIB 
LOC OBJ MREF STMT 
0100 
0100 0424 
0101 9000 1027 
1000 
1000 0000 
1001 1002 
1002 
1004 
1005 
1006 
1007 
1008 
1009 
lOOA 
100B 
7F03420F 
OOOF 
59FA 
5AFB 
8F2F 
46F8 1000 
1408 101 1 
1402 1ooc 
1406 1011 
lOOC 7F01421F 
lOOE FF0094F3 1001 
1010 8FFE 
1011 0812 
1012 5E00 
1013 8 F l l  
1014 5F00 
1015 0823 
1016 8F1F 
1017 6E00 
1018 8F12 
1019 6E00 
101A OOFO 
lOlB 7F03410F 
lOlD FF001200 
101F 0100 
1020 2048 
- - - - - -  
I *  
2 *  
3 *  
4 
5 
6 
7 
8 
9 *  
10 * 
11 A$2 
12 A$5 
13 * 
14 FIB 
15 
16 
17 
18 
19 
20 
21 
22 
23 * 
24 A$l 
25 
26 
27 * 
28 A$O 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 * 
ABS 
ORG lOOH 
CONT ER,lS 
J U  MAIN$ 
ORG lOOOH 
FIX 
LINK 
PUSHM 
TRA 
LOAD 
LOAD 
IAR 
CMP 
JU 
JU 
JU 
0 
FIBER 
0,3 
0,15 
1,-6,0 
2,-5,0 
2,-1 
2,A$2 
A$O 
A$l 
A$O 
PUSHM 1,2 
JSS* A$5 
IAR 1 5 . ~ 2  
ADDR 1,2 
LOAD 2,011 
IAR 111 
LOAD 3,011 
ADDR 2,3 
IAR 1,-1 
STO 2,031 
IAk 112 
STO 2,0,1 
TRA 15,O 
POPM 013 
RPS 0 
41 * Size: 47 
42 * 
43 * 
44 * 
45 A$6 LINK STAC$ 
46 A$8 LINK C1 
FIBONACCI PROCEDURE 
=DATA 
=NUM 
= I  
=FIB 
= ADDRESS DATA + NUM -1 
= DATA [NUM] 
= DATA [NUM+l] 
= -+ DATA[NUM] 
= + DATA[NUM+l] 
13 
LOC OBJ MREF STMT SOURCE STATEMENT 
- - - _ - - - _ _ - -  _ _ _ _ - _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _  
1021 0001 
1022 1047 
1023 0001 
1024 1002 
1025 77FB 
1026 2049 
1027 54F8 
1028 OOFO 
1029 56F8 
102A 55F8 
102B BE00 
102C 6E01 
102D 1418 
102E 7F00421F 
1030 57F3 
103 1 7F00423F 
1033 FF0094F'l 
1035 0061 
1036 7D741D00 
1038 0267 
1039 0667 
103A 8084 
103B OB68 
103C OF86 
103D OC9B 
103E 8291 
103F OCBB 
1040 0561 
1041 0000 
1042 FF0021AC 
1044 0813 
1045 FCOO 
1046 14E8 
0100 
1047 0000 
2047 
2048 
2049 
204A 
47 
48 
49 
50 
51 
52 
53 
54 
lOlF 55 
56 
1021 57 
1022 58 
59 
60 
1045 61 
62 
1023 63 
64 
1024 65 
66 
67 
68 
69 
70 
7.1 
72 
73 
74 
1040 75 
76 
1042 77 
78 
79 
80 
81 
102E 82 
83 
84 
85 
86 
87 
88 
89 
90 
91 
A19 
A$lO 
A$11 
A$12 
A$7 
A$13 * 
MAIN$ 
LOOP 
* 
A$25 
A$26 
A$27 
* 
FIX 
LINK 
FIX 
LINK 
LINK 
LINK 
ENTRY 
LOAD 
TRA 
LOAD 
LOAD 
STO 
STO 
JU 
PUSHM 
LOAD 
PUSHM 
JSS* 
IAR 
TRA 
LDM 
LCM 
AND 
SLSA 
MPY 
DIV 
SUBR 
SKCT 
CLAO 
DECNE 
NOP 
DACM 
ADDR 
HALT 
J U  
* Size: 26 
STAC$ EQU 
DATA BSZ 
I RES 
C1 RES 
C2 RES 
END 
* 
1 
DATA 
1 
FIB 
30715 
C2 
MAIN$ 
O,A$6 
15,O 
2,A$9 
1,A$10 
2,0,1 
2,1,1 
A$27 
1,1 
3,A$11 
3,3 
A$12 
15,-2 
6 1  
7 , l  l , O , l  
6,7 
6,7 
834 
6 8  
836 
9,11 
9,A$25 
11,11 
6,A$26 
10,12 
173 
LOOP 
256 
4096 
1 
1 
1 
MAINLINE AND INITIALIZATION 
=STAC$ 
=I 
=DATA i 
= DATA + 1 
BEGINNING OF ITERATION 
I 
I 
=4 1 
1 
1 =f1b0 
SEQUENCE OF INSTRUCTIONS ~ 
END OF 1 ITERATION 
14 
References 
1. Gault, James W.; Trivedi, Kishor S.; and Clary, 
James B., eds.: Validation Methods Research for Fault- 
Tolerant Avionics and Control Systems- Working Group 
Meeting II. NASA CP-2130, 1980. 
Lala, Jaynarayan H.; and Smith, T. Basil, 111: Develop- 
ment and Evaluation of a Fault- Tolerant Multiprocessor 
(FTMP) Computer. Volume III-FTMP Test and Eval- 
uation. NASA CR-166073, 1983. 
Siewiorek, Daniel P.; and Lai, Larry Kwok-Woo: Test- 
ing of Digital Systems. Proc. IEEE, vol. 69, no. 10, 
4. Hopkins, Albert L., Jr.; Smith, T. Basil, 111; and 
Lala, Jaynarayan H.: FTMP-A Highly Reliable Fault- 
Tolerant Multiprocessor for Aircraft. Proc. IEEE, 
vol. 66, no. 10, Oct. 1978, pp. 1221-1239. 
5 .  Smith, T. Basil: Fault Tolerant Processor Concepts 
and Operation. The Fourteenth International Conference 
on Fault- Tolerant Computing-Digest of Papers, IEEE 
Catalog No. 84CH2050-3, IEEE Computer SOC. Press, 
6. Goldberg, Jack; Kautz, William H.; Melliar-Smith, 
P. Michael; Green, Milton W.; Levitt, Karl N.; Schwartz, 
2. 
3. 
Oct. 1981, pp. 1321-1333. 
c.1984, pp. 158-163. 
Richard L.; and Weinstock, Charles B.: Development 
and Analysis of the Software Implemented Fault- Tolerance 
(SIFT) Computer. NASA CR-172146, 1984. 
7. Military Handbook-Reliability Prediction of Electronic 
Equipment. MIL-HDBK-217D, U.S. Dep. of Defense, 
Jan. 15, 1982. (Supersedes MIL-HDBK-217C, Apr. 9, 
1979.) 
8. Hollander, Myles; and Wolfe, Douglas -4.: Nonparamet- 
ric Statistical Methods. John Wiley & Sons, Inc., c.1973. 
9. User’s Manual. IMSL Library-Problem-Solving 
Software System for Mathematical and Statistical 
FORTRAN Programming, Volume 3, Edition 9.2, IMSL 
LIB-0009, IMSL, Inc., c.1984. 
10. McGough, John G.; and Swern, Fred L.: Measurement of 
Fault Latency in a Digital Avionic Mini Processor. NASA 
11. Nagel, Phyllis M.: Modeling of a Latent Fault Detector 
in a Digital System. NASA CR-145371, 1978. 
12. Butler, Ricky W.: The SURE Reliability Analysis Pro- 
gram. NASA TM-87593, 1986. 
13. Green, David F., Jr.; Palumbo, Daniel L.; and Baltrus, 
Daniel W.: Software Implemented Fault- Tolerant (SIFT) 
User’s Guide. NASA TM-86289, 1984. 
CR-3462, 1981. 
15 
Table I. The 48 Pins of Virtual Microprocessor 
[An asterisk indicates active when signal low] 
CPU 
T C  
Signal 
MAROO* 
MAR0 1 * 
MAR02* 
MAR03* 
MAR04* 
MAR05* 
MARO6* 
MAR07* 
MAR08* 
MAKOS* 
MARIO* 
MARll* 
MAR12* 
MAR13* 
MAR14" 
MAR15* 
DATOO 
DATOl 
DATO2 
DAT03 
DATO4 
DAT05 
DATO6 
DAT07 
DAT08 
DATO9 
DATlO 
DATIl 
DATl2 
DAT 13 
DAT14 
DATl5 
FOV 
IND 
LINK 
PFEIN 
FLAG1 
FLAG2 
POS 
PON 
MM* 
MEMl 
MEM2 
WE1 
WE2 
CDEOl 
CDE02 
ABUF* 
7 
Micro pin 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 
47 
48 " 
Chip 
u 4 3  p20 
u 4 3  p l  
U43 p16 
U43 p14 
U42 p20 
U42 p18 
U42 p16 
U42 p14 
U40 p20 
U40 p18 
U40 p16 
U40 p14 
u39 p20 
U39 p18 
U39 p16 
u39 p14 
U37 p18 
U37 p17 
U37 p14 
U37 p13 
U37 p8 
u37  p7 
u37  p4 
u37  p3 
U36 p18 
U36 p17 
U36 p14 
U36 p13 
U36 p8 
U36 p7 
U36 p4 
U36 p3 
U6 p6 
U13 p6 
U13 p8 
U6 p8 
U7 p6 
U7 p8 
U20 p6 
U41 p6 
U58 p9 
U58 p10 
U58 p7 
U58 p6 
U46 p10 
U46 p9 
U14 p8 
u19 p12 
Edge connector 
J10 p7B 
J10 p6B 
J10 p5B 
J10 p4B 
JlO p17B 
J10 p16B 
JIO p15B 
JlO p13B 
J10 p39B 
JIO p38B 
JlO p36B 
JlO p35B 
JIO p56B 
J10 p54B 
JlO p53B 
JIO p52B 
J10 p27B 
J10 p9A 
J10 p22B 
J10 p21B 
J10 p l l A  
J10 p12A 
J10 p39A 
J10 p13A 
J10 p34B 
J10 p33B 
,J10 p34A 
J10 p31B 
J10 p31A 
J10 p33A 
J10 p36A 
J10 p38A 
JlO p35A 
J10 p lA  
J10 p2B 
JlO p5A 
J10 p8A 
JlO plOA 
J9 p21A 
J9 p26C 
J9 p27C 
J9 p52B 
J9 p52C 
J9 p44A 
J9 p54C 
J9 p49C 
J9 p48C 
J9 p5B 
16 
Table 11. Test Signals 
[An asterisk indicates active when signal low] 
Signal 
A* 
Board 
CPU 
Pin 
J9 p24B 
JlO p l lB  
J10 p44A 
J10 p6A 
J10 p19B 
JlO p18B 
JlO p29B 
17 
Use 
External clock 
Clock 
Clock 
Trigger injection 
Input data 
Instruction 
Output data 
Table 111. List of Pins Tested 
[An asterisk indicates active when signal low] 
Board 
CPU 
No. 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 v 
Signal 
SRAM15 
SRAM15 
SQOO 
SQOO 
LINK 
LINK 
IND 
IND 
RPTOV* 
RPTOV* 
QBIT 
QBIT 
CON 
CON 
TIB* 
TIB* 
TDR* 
TDR* 
IRO 1 
IRO 1 
SRAMOO 
SRAMOO 
COUT 
COUT 
SPAO 
SPAO 
UlO 
u10 
DAT15 
DAT 15 
DAT07 
DAT07 
DATOO 
DATOO 
Y 00 
Y 00 
DO0 
DO0 
MAR15* 
MAR15* 
MAR07* 
MAR07* 
MAROO* 
ML4R00* 
UMAl 
UMAl 
Proc7 
Processor 
- 
" 
Chip 
u11 
u11 
u11 
u11 
U13 
U13 
U13 
U13 
U28 
U28 
u12 
u12 
u12 
u12 
u20 
u20 
u20 
u20 
u35 
u35  
u35 
u35 
u35 
u35 
u35 
u35 
u35 
u35 
U30 
U30 
U31 
U3 1 
U3 1 
U3 1 
u34 
u34 
u37 
u37 
u39 
u39 
U42 
U42 
u43  
u43  
u45 
u45 
Pin 
P7 
P7 
P9 
P9 
P8 
P8 
P6 
P6 
P15 
P15 
P6 
P6 
P8 
P8 
P6 
P6 
PI1 
PI1 
P3 
P3 
PI6 
PI6 
P33 
p33 
P4 
P4 
P12 
P12 
P2 
P2 
P2 
P2 
PI9 
PI9 
PI9 
PI9 
PI9 
PI9 
P14 
P14 
P14 
P14 
P20 
P20 
P7 
P7 
pault type 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
S A0 
18 
Table 111. Continued 
No. 
47 
48 
49 
50 
51 
52 
53 
54 
55 
56 
57 
58 
59 
60 
61 
62 
63 
64 
65 
66 
67 
68 
69 
70 
71 
72 
73 
74 
75 
76 
77 
78 
79 
80 
81 
82 
83 
84 
85 
86 
87 
88 
89 
90 
91 
92 
93 
94 
95 
96 
Signal 
UMAO 
UMAO 
Y 15 
Y 15 
ESTRT* 
ESTRT* 
EMSB* 
EMSB* 
ESPC* 
ESPC* 
EBCH* 
EBCH* 
ETIR* 
ETIR* 
ELSB* 
ELSB* 
HALTM* 
HALTM* 
IAM* 
IAM* 
IRS 
IRS 
HLTLP 
HLTLP 
U30 
U30 
u54 
u54 
FOV 
FOV 
FLAG2 
FLAG2 
FLAGl 
FLAGl 
SPBO 
SPBO 
QIO* 
&IO* 
QII* 
QII* 
NORM* 
NORM* 
MEM* 
MEM* 
MM* 
MM* 
QIOIN* 
QIOIN* 
MI* 
MI* 
Processor 
Proc7 
Board - 
C 
T C  
J 
Chip 
u45 
u45 
u33 
u33 
U70 
U70 
U70 
U 70 
U70 
U70 
U70 
U70 
U70 
U70 
U6 1 
U61 
U64 
U64 
U62 
U62 
uo9 
uo9 
U65 
U65 
U67 
U67 
U68 
U68 
U06 
U06 
U71 
U71 
U71 
U71 
U27 
U27 
U32 
U32 
U32 
U32 
U32 
U32 
U32 
U32 
u44 
u44 
U05 
U05 
u35 
u35 
Fault type 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
SA 1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
19 
Table 111. Concluded 
TC 
No. 
97 
98 
99 
100 
101 
102 
103 
104 
105 
106 
107 
108 
109 
110 
111 
112 
113 
114 
115 
116 
117 
118 
119 
120 
121 
122 
123 
124 
125 
126 
127 
128 
129 
130 
131 
132 
133 
134 
135 
136 " 
Signal 
PST 
PST 
BOSCl 
BOSCl 
BOSC* 
BOSC* 
ACK" 
ACK* 
PON 
PON 
FWOT 
FWOT 
EOUT* 
EOUT* 
POS 
POS 
CDE02* 
CDE02* 
CDEOl* 
CDEO1* 
MMI 1 * 
MMI1* 
MMI2* 
MMI2' 
MEMI* 
MEMI* 
MEM2* 
MEM2* 
WE1 
WE1 
WE2 
WE2 
QMI* 
QMI* 
AI* 
AI* 
B* 
B* 
BBUF 
BBUF 
~ 
Processor 
Proc 7 
20 
I 
I 
Chip 
U13 
U13 
U14 
U14 
U 14 
U14 
U14 
U14 
u 19 
u 19 
u20 
u20 
U04 
U04 
u20 
u20 
U46 
U46 
U46 
U46 
U46 
U46 
U46 
U46 
U58 
U58 
U58 
U58 
U6 1 
U6 1 
U6 1 
U61 
U38 
U38 
U08 
U08 
U08 
U08 
U14 
U14 
~ 
Tault type 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
5a1 
5a0 
PROCESSOR A 
c ERROR 
COMPARATOR 
I 
PROCESSOR B 
P i n  f a u l t  f ree  
Figure 1. Self-checking dual processor. 
P i n  w i t h  f a u l t  p r e s e n t  L 
H 
DE 
Figure 2. Dynamic error behavior definition. 
TBE = t i m e  between e r r o r s  
DE = d u r a t i o n  of e r r o r s  
I- TBE --I 
21 
I 
0 
In 
m 
0 
0 
m 
0 
Ln 
N 
0 
0 
N 
0 
m 
4 
0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 
0 a3 ul 
?I 3 3 
N * N 0 co a * 
4 3 3 
n m 
C 
In 
c\I ,-+ 
II 
a, 
rl 
U 
h 
U 
4 
v 
v, 
a, 
4 
U 
h 
U 
w 
F4 w 
0 
22 
23 
I 
0 0 0 0 0 0 0 0 
0 0 0 a * N 0 0 0 0 0 0 0 0 0 0 cc a * N 0 03 
N 4 4 d d 
I 
0 0 0 0 0 
0 0 0 0 0 
In 0 In 
d d 
0 ul 
c\I hl 
0 0 0 0 0 
0 0 0 0 0 
0 In 0 In 0 
In * e 0 m 
E 
d 
25 
0 
Ln m 
0 
0 
m 
0 
ul 
N 
m 
C 
L n  
N 
4 
0 
N 
0 11 
a, 
rl 
U 
h 
U 
4 
v 
m 
4 2  
U 
W 
F9 
b 
0 
0 
4 
0 vr 
0 0 0 0 0 0 0 0 0 
03 fi u3 Ln * m N 4 
26 i 
W 
m 
I- 
I- 
a 
0 
4 
.L 
l- 
a 
0 
4 
cn 
(D 
m 
m 
(u 
3 
0 
a 
+ 
+ 
+ 
+ 
+ 
+ 
+ 
+ 
+ + 
+ 
+ 
+ 
+ 
+ 
+ 
I 
+ drI+ 
0 
0 
hl 
In 
I- 
d 
0 
In 
d 
In 
hl 
d 
In 
b 
0 In 
rl e4 
27 
W 
0 
l- 
0 
a 
+ 
+ 
+ 
+ 
+ 
+ 
+ 
t 
+ 
+ 
+c 
+ 
+ 
0 
In 
d 
In 
N 
d 
In 0 In 0 
h In N 
W 
I- 
m 
+ 
0 
a 
+ 
+ 
+ 
+ 
+ 
+ 
+ 
+ 
+ 
cI)I 
II)( 
II)( 
+ 0 
+ 03 
+ 0 
+ a 3  
+ 0 
+ + m [7E 
I 
+ + a I +  
+ + 8 m 
+ + m ID 
+ + a C D  
8 D 
8 
m 
8 
8 
8 
8 
m 
in 
N 
0 In 0 
b in 0 
in in 0 
N b in 
3 d 4 4 
0 
0 
N 
h 
29 
0 
U 
A 
N 
0 
U 
V 
N 
V 
0 
m 
0 m
N 
0 
N 4 
0 N 
” 
N 
V 
c 
0 
U 3
0 
N 3
0 0 
0 m 
3 
0 
U 
~ 
0 0 
N 
-J 
m 
U 
U 
m .‘) 
m 
Lj 
M 
b 
- 
cd 
a, 
e 
- 
cd 
> v) 
k a 
3 
3 
0 
VI 0 U 
0 
N 
31 
c 
U 
N 
0 0 Lrr 
hl 
0 
U 
0 0 0 
0 OD \D 
0 
N 
A 4 
32 
b 
.d 
33 
I Lr 
0 
P- 
O 
ro 
0 
VI 
0 
m 
0 
U 
0 
N 
0 
34 
i 
0 
U 
A 
N 
1 
O 0 0 0 0 m N 4 
0 
U 
10 
0 
ln 0 0 P- a m 
35 
F -  a m  
I 
H 
a, &
1 
hD 
rn 
1 
C 
U 
A 
[\: 
n 
37 
0 
U 
A 
N 
I I 
0 0 0 0 0 
-3 W N 3 
0 0 
rD ln 
m 
30 
i 
5 
bD 
I , 2 
0 0 0 0 0 
N 4 -3 m 
0 0 0 
r. rD m 
m .r( 
39 
State = (G,A,R) 
G = Processors without faults 
A = Processors with faults 
R = Processors reconfigured 
6 = Recovery process rate 
1 = Rate of failure to A 
Figure 21. Model of failure arrival-recovery process. 
DATA 
REGISTER 
MEMORY 
BUFFER 
L k X = "STRUCTION 
R 
H R A N  
A U T A  SHIFT IN c, LOGIC 
CARRY IN 
REPEAT 
COUNTER A INSTRUCTION 
! y g k  
MAIN REGISTER 
I 1  
PROCESSOR LOGIC 
MICRO-ADDRE SS 
, CONTROL LOGIC , I .  I. 
II I1 
REGISTER 
TIMING & CONTROL 
SIGNALS I 
Figure 22. Block diagram of the BDX-930. 
i 
I 
t 
EXTENSION CABLE BDX - 930 
I DATA ACQUISITION PROBES I 
Figure 23. Experiment configuration. 
I 
t 
41 
Report Documentation Page 
. Report No. 
NASA TP-2738 
2. Government Accession No. 3. Recipient’s Catalog No. 
A Technique for Evaluating the Application of the Pin-Level 
Stuck-At Fault Model to VLSI Circuits 
. Title and Subtitle 
I September 1987 
5. Report Date 
16. Performing Organization Code 
. Author(s) 
Daniel L. Palumbo and George B. Finelli 
. Performing Organization Name and Address 
NASA Langley Research Center 
Hampton, VA 23665-5225 
8. Performing Organization Report No. 
L- 16269 
10. Work Unit No. 
505-66-21-01 
11. Contract or Grant No. 
2. Sponsoring Agency Name and Address 
National Aeronautics and Space Administration 
Washington, DC 20546-0001 
5. Supplementary Notes 
13. Type of Report and Period Covered 
Technical Paper 
14. Sponsoring Agency Code 
6. Abstract 
Accurate fault models are required to conduct the experiments defined in validation methodologies 
for highly reliable fault-tolerant computers (e.g. , computers with a probability of failure of 
for a 10-hour mission). This paper describes a technique by which a researcher can evaluate the 
capability of the pin-level stuck-at fault model to simulate true error behavior symptoms in very 
large scale integrated (VLSI) digital circuits. The technique is based on a statistical comparison of 
the error behavior resulting from faults applied at  the pin-level of and internal to a VLSI circuit. 
As an example of an application of the technique, the error behavior of a microprocessor simulation 
subjected to internal stuck-at faults is compared with the error behavior which results from pin-level 
stuck-at faults. The error behavior is characterized by the time between errors and the duration of 
errors. Based on this example data, the pin-level stuck-at fault model is found to deliver less than 
ideal performance. However, with respect to the class of faults which cause a system “crash,” the 
pin-level stuck-at fault model is found to provide a good modeling capability. 
19. Security Classif.(of this report) 
Unclassified 
L7. Key Words (Suggested by Authors(s)) 
Fault tolerance 
Fault injection 
Recovery mechanisms 
Fault models 
20. Security Classif.(of this page) 21. No. of Pages 22. Price 
Unclassified 44 A03 
18. Distribution Statement 
Unclassified-Unlimited 
Subiect Category 38 
NASA FORM 1626 OCT 86 NASA-Langley, 1987 
For sale by the National Technical Information Service, Springfield, Virginia 22161-2171 
