Measurement of fault latency in a digital avionic mini processor, part 2 by Mcgough, J. & Swern, F.
NASA Contractor Report 36 5 1 
Measurement of Fault Latency in a 
Digital Avionic Mini processor 
Part II 
John G. McGough and Fred L. Swern 
CONTRACT NASl-15946 
JANUARY 1983 
https://ntrs.nasa.gov/search.jsp?R=19830008826 2020-03-21T05:56:37+00:00Z
NASA Contractor Report 36 5 1 
TECH LIBRARY KAFB, NM 
llnlll~lllnllll~llllllnll111111 
005b095 
Measurement of Fault Latency in a 
Digital Avionic Mini Processor 
Part II 
John G. McGough and Fred L. Swern 
Bendix Corporation 
Teterboro, New Jersey 
Prepared for 
Langley Research Center 
under Contract NAS l- 15 946 
National Aeronautics 
and Space Administration 
Scientific and Technical 
Information Branch 
1983 
TABLE OF CONTENTS 
1.0 INTRODUCTION. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 
1.1 Background ......................... 
1.2 Objectives of the Study. .................. 
1.3 Foreward ........................... 
SUMMARY.........................- ... 
FAULT MODELLING AND SELECTION ................... 
3.1 Fault Model. ...... K ................. 
3.2 Method of Selecting Faults ................. 
DESCRIPTION OF EXPERIMENTS. ................... 
4.1 Definition of Failure Detection. .............. 
4.2 Definition of Failure Detection Coverage ........ . . 
4.3 Indistinguishable Faults and Effects on Coverage ...... 
4.4 Objectives of Experiments. ................. 
4.5 Experiments. ........................ 
4.5.1 Search and Compute (SERCOM) ............. 
4.5.2 Linear Convergence (LINCON) ............. 
4.5.3 Quadratic (QUAD). .................. 
4.5.4 Flight Control System (FCS) ............. 
RESULTS OF EXPERIMENTS. ..................... 
5.1 Distr ibution of Faults ................... 
5.2 Exper iments ......................... 
5.2.1 SERCOM Experiment ................... 
5.2.2 LINCON Experiment .................. 
5.2.3 QUAD Experiment ................... 
5.2.4 FCS Experiments (Quasi-Repetitions) ......... 
5.2.5 FCS Experiments (True-Repetitions). ......... 
5.3 Urn Model Parameters .................... 
5.4 Accuracy and Confidence of Results ............. 
SUMMARY OF EXPERIMENTS. ..................... 
6.1 Latency ........................... 
6.2 UrnModel .......................... 
SELF-TEST DESIGN AND VALIDATION ................. 
7.1 Initial Self-Test Program. ................. 
7.2 Principal Tests. ...................... 
7.3 Self-Test Results. ..................... 
7.3.1 Indistinguishable Faults. .............. 
7.4 Sumnary and Conclusions. ................... 
5 
7 
7 
2.0 
3.0 
8 
9 
1: 
4.0 11 
11 
‘12 
13 
15 
16 
16 
17 
18 
19 
5.0 22 
fi 
22 
23 
24 
24 
25 
2265 
6.0 
2; 
53 
7.0 
1 
TABLE OF CONTENTS (CONT'D) 
8.0 URNMODEL ............................ 68 
8.1 Urn Model Description. .................. 68 
9.0 ESTIMATORS ........................... 72 
9.1 Estimators for Self-Test Coverage. ............ 72 
9.2 Estimators f-or Latency ................... 73 
9.3 Estimators for Urn Model Parameters. ........... 74 
9.4 Accuracy and Confidence of Coverage Estimates. ...... 76 
9.4.1 Self-Test Coverage ................ 76 
9.4.2 Latency Estimate ................. 78 
9.4.3 Urn Model Parameter Estimates. .......... 78 
10.0 EMULATION CHARACTERISTICS. ................... 81 
10.1 BDX-930 Architecture ................... 81 
10.2 Description of the Emulator. ............... 84 
11.0 CONCLUSIONS. : ......................... 90 
12.0 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 
2 
LIST OF ILLUSTRATIONS 
FIGURE 
5B 
6 
7 
t 
10 
11A 
11B 
TITLE 
Flow Diagram for LINCON. ................. 21 
SERCOM Combined Gate-Level Faults. ............ 42 
SERCOM Combined Component-Level Faults .......... 43 
LINCON Combined Gate-Level Faults. ............. 44 
LINCON Combined Component-Level Faults .......... 45 
QUAD Combined Gate-Level Faults. ... . ........ 46 
QUAD Combined Component-Level Faults .......... 47 
Flight Control Computations Quasi-Repetitions 
Combined Gate-Level Faults ................ 48 
Flight Control Computations Quasi-Repetitions 
Combined Component-Level Faults. ............. 49 
Flight Control Computations True Repetitions 
Combined Gate-Level Faults ................ 50 
Urn Model Distribution Flight Control Computations 
True Repetitions Combined Gate-Level Faults. ....... 51 
Comparison of Fault Latency Distributions. ........ 55 
Markov Model Representation of the Urn Model ....... 71 
BDX-930 Processor. .................... 88 
Non-Faulted "And" Gate .................. 89 
Fault Model of "And" Gate. ................ 89 
LIST OF TABLES 
7B 
8 
9 
:: 
12 
:i 
15 
16 
17 
18 
:o’ 
21 
TITLE PAGE 
Failure Rates by Partition. ................ 28 
Number of Faults Injected ................. 29 
Number of Gate-Level Faults Injected by Partitions. .... 30 
Number of Component-Level Faults Injected by Partitions . . 31 
SERCOM Latency Data .................... 32 
SERCOM Latency Data .......... 1 ......... 33 
LINCON Latency Data .................... 34 
LINCON Latency Data ..................... 35 
QUAD Latency Data .................... 36 
QUAD Latency Data .................... 37 
Flight Control Computations Latency Data 
(Quasi-Repetitions) .................... 38 
Flight Control Computations Latency Data. ......... 39 
Flight Control Computations Latency Data. ......... 40 
Urn Model Parameter Estimates for Gate-Level Faults .... 41 
Comparison of Gate Versus Component-Level Coverage. .... 56 
Urn Model Parameters Combined, Gate-Level Faults. ..... 57 
CPU Self-Test Revisions Gate-Level Coverage ........ 62 
Final Self-Test Gate-Level Coverage by Partitions ..... 63 
Initial Self-Test Gate-Level Coverage by Partitions .... 64 
Final Self-Test Component-Level Coverage by Partitions. .. 65 
Initial Self-Test Component-Level Coverage by Partitions. . 66 
Proportion of Indistinguishable Faults (y). ........ 67 
Error Ellipse for a Confidence Level of y = '.95 ...... 79 
Maximum Error Versus Sample Size and Confidence Level ... 80 
Components of the BDX-930 CPU ............... 86 
Microcircuits and Equivalent Gate Count .......... 87 
1.0 INTRODUCTION 
1.1 Background 
This study is a follow-on of an earlier study entitled: 
"Measurement of Fault Latency in a Digital Avionic 
Mini Processor" (ref. 1). 
To place the present study in perspective we include a brief sumnary 
of the results of the earlier study: 
Surrrnary of Earlier Study 
l A gate level emulation of the Bendix BDX-930 digital computer was 
developed for the purpose of analyzing failure modes and effects in 
digital systems. The run time of the emulator was 7000 times slower 
than the BDX-930 when hosted on a VAX 11/780. 
l Six software programs were emulated and faults were injected at both 
the gate-level and pin-level (i.e., component-level). The resultant 
computed outputs were compared with those of a non-faulted computer 
executing the same program. A fault was considered detected when 
these outputs differed. The results showed that: 
. . Most detected faults are detected in the first repetition. 
Subsequent repetitions do not appreciably increase the propor- 
tion of detected faults. 
. . A large proportion of faults remained undetected after as many 
as 8 repetitions of the program, e.g., 60% at the gate-level. 
Component-level faults are easier to detect than gate-level 
l * faults. For example, after 8 repetitions, the proportion of 
undetected faults were 
GATE-LEVEL COMPONENT-LEVEL 
61.7% 35.5% 
58.2% 28% 
59.5% 32.3% 
for the program FETSTO, FIB and ADDSUB, respectivelyt To corro- 
borate the findings of a pilot study (ref. 2) the instruction 
repertoire of the BDX-930 was limited to the following instructions: 
5 
Load 
Store 
Add 
Subtract 
Branch 
Transfer 
Clear 
l The results of the studycorroboratedthe findings of the pilot study 
of (ref. 2). This was surprising considering that the pilot study 
used an emulation of a very simple processor. As an illustration, 
the pilot study indicated that, after 8 repetitions, the proportion 
of undetected faults were 
64.4% 
53.7% 
44.9% 
for FETSTO, FIB,ADDSUB, respectively. 
l The Urn Model, for forecasting fault latency, produced distributions 
that were in close agreement with the empirical distributions. 
l A self-test program of 2000 executable instructions was expressly 
designed for the study. The designer was given the single require- 
ment that fault coverage should be at least 95%. The resultant test 
consisted of 241 separate subtests for the purpose of exercising the 
entire instruction set of the BDX-930. 
The results indicated that there is a significant difference in cover- 
age of gate-level versus component-level faults. For example, 
gate-level coverage = 86.5% 
component-level coverage = 97.9% 
l Only 48% of all detected faults were detected by a subtest. The 
remaining detected faults were detected because the first subtest 
was not computed. 
l Most of the subtests were redundant, i.e., only 46 of 241 subtests 
actually detected a fault. 
l 62% of all detected faults were detected by the first 23 subtests. 
l A large proportion of "don't care" (i.e., indistinguishable) faults 
were injected: (i.e., 23.7%). These proved to be exceedingly diffi- 
cult to identify. 
l The micromemory prom contained the largest proportion of undetected 
faults. 
6 
To conserve space this report omits certain details which were contained 
in the earlier report, notably in the areas of statistical analyses and 
descriptions of the emulator. However, the present text is self-contained; 
whenever comparisons are necessary the pertinent data from the earlier report 
is duplicated. 
1.2 Objectives of the Study 
The poor coverage of comparison-monitoring, which, the earlier study 
demonstrated, could have been due to the limitedrepertoire of the-instruction 
set used. As a consequence, it was decided to reprogram SERCOM, LINCON and OUAD 
but this time expanding the instruction set to capitalize on 'the full power of 
the BDX-930. As a final‘demonstration of failure coverage an extensive, 3-axis, 
high performance flight control computation was added. 
As the summary of the earlier study indicates, failure detection cov- 
erage of the target self-test program was a disappointing 86.5% for gate-level 
faults. As a consequence, Bendix conducted a development program (independent 
of the present contract) for the purpose of upgrading the self-test to approx- 
imately 95% coverage while minimizing the number of instructions and run-time. 
The initial self-test was used as a baseline. The successive self-test programs 
and their resultant coverages are described in Section 7.0. 
1.3 Foreward 
The use of trade names of manufacturers in this report does not con- 
stitute an official endorsement of such products or manufacturers, either 
expressed or implied by the National Aeronautics and Space Administration. 
1. 
2. 
3. 
4. 
5. 
6. 
2.0 SUMMARY 
A gate-level emulation of the Bendix BDX-930 digital computer was used 
to perform fault injection experiments to determine a program's ability 
to detect faults. The emulator was hosted on a VAX-11/780. The resul- 
tant run-time was 7000 times slower than the BDX-930. 
Four software programs were emulated and faults were injected at both 
the gate-level and pin-level (i.e., component-level). The resultant 
computed outputs were compared with those of a non-faulted computer 
executing the same program. A fault was considered detected when their 
outputs differed. 
The present study was a follow-on to a previous study in which the pro- 
grams were limited to a simple instruction set. The four programs of 
the present study used the full power of the BDX-930 instruction set. 
One of the four programs included a 3-axis, high performance flight 
control system of approximately 2200 executable instructions. The 
objective was to coroborate the conclusions of the previous study. 
The results corroborated those of the previous study. 1,n particular: 
l Most detected faults are detected in the first repetition of 
a program. Subsequent repetitions do not appreciably increase 
the proportion of detected faults. 
l Short programs have a tendency to benefit more from subsequent 
repetitions than lengthier programs. 
l A large proportion of gate-level faults remained undetected 
after as many as 8 repetitions of the program. For example, 
in the flight control computation, 21% of distinguishable 
( i.e., "CARE") faults remained undetected after 8 repetitions. 
l Pin-level faults are easier to detect than gate-level faults. 
A self-test program should be designed to capitalize on the hardware 
mechanization of the CPU. A self-test designed to exercise instructions 
without regard for the hardware mechanization tends to be inefficient 
in real-time and memory. 
The Urn Model can characterize fault latency distributions. It is doubt- 
ful, however, that the Urn Model parameters can be predicted on the 
basis of a oroaram's lenath and instruction mix. 
8 
3.0 FAULT MODELLING AND SELECTION 
3.1 Fault Model 
At the present time there is little or no data available regarding 
either the mode or frequency of failures of MS1 and LSI devices. Despite this 
deficiency of data, failure modes and effects analyses are regularly performed 
for avionics and flight control systems. The conventional approach is to 
assume a set of failure modes for each device. These are usually restricted 
to faults at single pins although, occasionally, multiple faults may be con- 
sidered. In most cases the failure rate of a device is assumed to be equally 
distributed over the pins or over the set of postulated failure modes. Except 
for special devices, faults are assumed to be static, being either S-a-O or 
S-a-l. 
The point to be made here is that failure modes and their rate of occur- 
rence are necessarily conjectural and the credibility of the present study 
suffers no less from this deficiency of data than the conventional analysis. 
The authors emphasize that the emulation approach does not solve this problem. 
In 
modes: 
0 
the present study the following assumptions are made regarding failure 
Every device can be represented, from the standpoint of performance 
and failure modes, by the manufacturer-supplied, gate-level equivalent 
circuit. 
Every fault can be represented as either S-a-O or S-a-l fault at a 
gate node. 
The failure rate of the device is equally distributed over the gates 
of the equivalent circuit. 
The failure rate of a gate is equally distributed over the nodes of 
the gate. 
S-a-O and S-a-l faults are equally likely. 
Memory faults are exclusively faults of single bits. 
A memory fault is the complement of its non-faulted state. 
Faults are injected into all devices except the main memory. In the 
case of the microprogram memory, which is emulated at the functional level, 
faults are injected into the memory cells where they remain active for the 
duration of the test. Faults are injected at an input or output gate node, 
and also remain active for the duration of the test. When a fault is injected 
at an output node it is allowed to propagate to all nodes and devices that are 
physically connected to the failed node. When a fault is injected at an input 
node it does not propagate back to the driving node. This strategy provides 
a wider variety of failure modes than would otherwise be possible if propaga- 
tion were allowed. The fault model, although conjectural at the present time, 
can be updated as fault data becomes available. The proposed model provides 
a simple, automatic and consistent method of generating faults. The resultant 
fault set includes a rich assortment of static and dynamic (i.e., data-dependent) 
faults. 
3.2 Method of Selecting Faults 
The method of selecting faults is implicit in the fault model. 
Explicitly, 
l Each device is assigned a failure rate. 
l The failure rate is equal 
level representation. 
ly distr ibuted over the gates of the gate- 
l The failure rate of each gate is equally distributed over the nodes 
of the gate. 
l The failure rate of each node is equally distributed over S-a-O and 
S-a-l faults. 
l As a result of this procedure, each S-a-O and S-a-l fault is assigned 
a probability of occurrence proportional to the prescribed failure 
rate. The resultant fault set is then randomly.sampled with each 
fault weighted by its probability of occurrence. It is noted that, 
according to this procedure, faults in devices with high failure rates 
will be selected more frequently than faults in devices with lower 
failure rates. 
The above procedure does not distinguish between gate-level and component 
(i.e.,pin)-level faults except by probability of occurrence; the method auto- 
matically assigns failure rates to pins. However, a different selection 
procedure was employed for component-level faults. For these faults is was 
assumed that: 
l The failure rate of each device is equally distributed over the pins. 
While this assumption violates the prescribed fault model it is consis- 
tent with the conventional method of estimating fault detection coverage by 
simulating faults in actual hardware. As a consequence, all component-level 
detection estimates obtained in the study are estimates that would be obtained 
by proponents of this approach. 
10 
4.0 DESCRIPTION OF EXPERIMENTS 
4.1 Definition of Failure Detection 
In the present study, fault coverage and latency estimates are obtained 
by employing two conventional techniques of failure detection: comparison- 
monitoring and self-test. 
In comparison-monitoring a set of computed variables is compared with a 
corresponding set computed in another processor. If it is arranged that both 
processors operate on identical inputs and are closely synchronized, then any 
difference in a computed variable signifies that one of the processors has 
failed. In practice each processor executes an algorithm which compares the 
appropriate variables and signals a discrepancy when such exists. In the 
present study this algorithm was omitted; a fault is considered to be detected 
if a difference between corresponding variables exists irrespective of the 
ability of either processor to recognize the difference or signal the discre- 
pancy. Thus, the fault coverage obtained from the study is somewhat more 
optimistic than would be obtained in practice. 
In self-test, on the other hand, each component of the processor is exer- 
cised by a set of computations designed specifically to test that component. 
The results of each computational set are compared with pre-stored values and 
any difference signifies that the fault was detected. In practice, and in 
the study, the processor increments a register after the successful comple- 
tion of each test and before proceeding to the next test. If the test is not 
successful the program exits. After an interval of time equal to the maximum 
time to complete the program, the contents of the counter are decoded. If the 
value exactly equals that total number of tests, the fault was not detected. 
Otherwise the fault was detected. 
It is emphasized that "failure detection", as it is used in the present 
study, means almost exactly what it means in an actual airborne avionic sys- 
tem. This is in marked contrast to the commonly employed alternate approach 
of assuming that a failure is detected whenever the effect to the failure 
reaches an accessible bus or register, even though the program may not be 
interrogating these devices at that time. 
In the following paragraphs a description is given of the actual compu- 
tations involved in the experiment with particular emphasis on the explicit 
definition of "failure detection" in each instance. 
11 
4.2 Definition of Failure Detection Coverage 
We assume that a test procedure is given for detecting failures of a 
component, C. Each failure mode of C will require a non-zero time for detec- 
tion. By considering all failures of C and all combinations of inputs and 
internal states of C, we obtain in principle, if not in practice, a probabil- 
ity density function for time-to-detect, which is measured from the onset of 
the failure to the time of detection. 
Denoting this density by pdf (.c) where 
'C = time-to-detect = latency time 
we define 
Test Coverage 
r 
1) 1 - a(T) = 
5 
pdf(X)dX 
0 
= probability of detecting a failure of C in 
the interval 0 4 t 5 T. 
Observe that, according to this definition, test coverage is a function 
of latency time. The definition can be extended to all devices of the com- 
puter as follows: 
Subdivide the computer into mutually exclusive components Cl, C2, . . . . 
Ck with failure rates X1, X2, . . . , xk, and test CoVerageS 1 - a,(T), 
l- a2('c) 3 . . . . l- a,(r)* respectively. 
Set Pdfi (~1 = probability density for time-to-detect failures of 
Ci, i = 1, 2, . . . , k. 
Then the pdf for all failures of the computer is 
2) PDF = in ~ pdfi(~) 
i=l 
where A= x1 + ii2 + . . . . + xk. 
12 
Test coverage of the whole computer is then 
i=k 
3) 1 - a(r) = c - ai 1. 
i=l 
The method of selecting faults, described in Section 3, is consistent with 
this definition. 
From (3) we obtain. 
i=k 
4) a(T) = c 
Xi 
7 Fi(T)s as expected. 
i=l 
One of the objectives of the present study is to obtain estimates of the 
probability density function, pdf(T). These estimates are presented in 
Section 5. 
4.3 Indi.stinguishable Faults and Effects on Coverage _--~ 
During the development of the emulator it became apparent that a signi- 
ficant proportion of components had no affect whatsoever on the digital pro- 
cess. For the most part, these components are associated with unused pins, 
e.g., a complementary output of a flip-flop. However, there are other com- 
ponents whose lack of effect are not as obvious as, for example, a component 
that only affects the process when it is faulted. Certain micromemory bits 
are in this category. In order to distinguish between these categories of 
faults we are lead to the following informal definitions: 
A fault that has no affect on the computational process is 
indistinguishable. All other faults are distinguishable. 
We note that a distinguishable fault has the property that there exists 
a software program the output of which differs from that of the same program 
executed by an identical but non-faulted processor. 
Effects on Coverage 
The presence of indistinguishable faults can lead to erroneous and mis- 
leading estimates of coverage. In theory, indistinguishable faults should be 
disqualified from the emulation or from the fault selection process. This is 
consistent with the definition of coverage which implicitly assumes that 
faults are distinguishable. Unfortunately, in order to disqualify indistin- 
guishable faults from the emulation or from the fault selection process they 
13 
must be first identified and this is a non-trivial task because of the .large 
number of possible faults. The approach taken in this study was to select 
faults irrespective of their distinguishability properties and analyze only 
those faults that were undetected by Self-Test. The proportion of indistin- 
guishable faults from this set was then used as an estimate over all faults. 
We now indicate, briefly, how indistinguishable faults affect coverage. 
If 
Y = proportion of components yielding indistinguishable faults 
and 
1 -a = coverage of distinguishable faults. 
then 
1 -a = desired coverage 
and 
5) (1 - a) (1 - y) = coverage when indistinguishable faults are 
counted as undetected. We note, incidentally 
that 
6) (1 - a) (1 - y) + y = coverage when indistinguishable faults 
are counted as detected. 
The estimate of (5) will be obtained if indistinguishable faults are not 
disqualified. Then, coverage estimates will be in error by the factor, l-y. 
In the more general case it may be more convenient to estimate the pro- 
portion of indistinguishable faults by partition since the affect on coverage 
is a function of the relative failure rate of the partition. 
Let 'i = failure rate of Partition #i, i = 1, 2, l..., 6. 
'i = proportion of indistinguishable faults in Partition #i. 
1 - ai = coverage of distinguishable faults in Partition #i. 
x = x1 + x2 + . . . . + ii6 = total failure rate. 
From the previous section, i f all faults are distinguishable then 
coverage is given by 
6 
7) l-a= c y (l-CXi) 
i=l 
14 
If, however, indistinguishable faults are counted as undetected then the 
coverage actually obtained is 
6 
8) l-a= c y (1 - ai) (1 - Vi)’ 
i=l 
, 
We note that, if indistinguishable faults are disqualified, the true 
coverage is 
6 
c ‘i (1 -ai> (1 - Vi) 
i=l 
9) I-a= . 
6 - 
i=l 
From (8) it can be seen that the required accuracy of an estimate of yi 
depends upon the relative failure rate, Xi/h. If Xi is sufficiently small 
then the effect of an inaccurate estimate of yi is negligible. 
4.4 Objectives of Experiments 
Most airborne systems, present and projected, employ comparison-monitoring, 
self-test or a combination of both to achieve the requisite detection and 
isolation capability. One of the problems of fault detection, by either 
method, is that a fault may not manifest itself at either a comparison- 
monitored variable or at an accessible output of self-test until the faulted 
component is suitably exercised. As a consequence, faults can remain latent 
for long periods of time. This is the significance of latency time, T, in 
the definition of test coverage of Section 4.2. 
One of the objectives of the experiments is to estimate T for the test 
programs described in Section 4.5. Using comparison-monitoring the probabil- 
ity distribution of 'I: will be estimated for each of the four programs and 
the interdependence of these distributions and the number and type of instruc- 
tions will be ascertained. 
15 
4.5 Experiments 
The fault injection experiments were conducted using four programs, each 
of which was coded in the assembly language of the BDX-930. 
In the following descriptions only the set of computations labelled 
"compute" were performed by the target BDX-930 CPU; all other computations, 
selections, comparisons, etc. were performed by the emulation host computer 
Executive. Needless to say, there were no failures in these latter computations. 
When the non-failed processor completed a computation* and before the 
start of the next computation the Executive recomputed all initializing vari- 
ables and stored them in the appropriate locations of the scratchpad memory. 
In the oarallel mode of operation , when 32 computers are simultaneously 
being emu 
32 copies 
la&d, the initial i 
of the scratchpad 
zing variables are stored simultaneously in the: 
memories. 
4.5.1 Search and Compute (SERCOM) 
a. Procedure 
TO) Select 8 sets of integers, (Ak, Bk, Ck), at random, each component 
from the interval 
For each fault: 
Tl) Preset the program counter to the address of the first instruction. 
T2) Store the (Ak, Bk, Ck) in successive locations of memory. 
T3) Compute and store in successive locations of memory 
‘lk = Bk + ck 
'2k = Bk 
'lk = Bk + ck 
s2k = Bk - ck 
if Bk 5 A - k 
if Ak < Bk and Ck s Ak 
* In the parallel mode of operation one of the emulated processors is non- 
faulted and, as a consequence, the end of its computation cycle can be 
determined from its program counter. 
16 
b. Instruction Set 
During a typical computation the following instructions were executed: 
LOAD/STORE 11 
STACK OPS 4 
ADD/SUBTRACT 4 
MULTIPLY 1 
BRANCH 12 
TRANSFER 8 
MISCELLANEOUS 
a. Procedure 
slk = Bk - ck 
'2k = Bk x ck 
if 
T4) When the non-failed processor 
Ak < Bk and Ak < Ck. 
completes its last instruction compare 
Slks S2k term by term, in both the non-failed and 
If ‘lk or S2k is the first variable to miscompare 
slks s2k compare set L = 0 (L = Latency Period). 
failed processors. 
set L = K. If all 
INSTRUCTION FREQUENCY 
4..5.2 Linear Convergence (LINCON) 
TO) Select the following integers from the indicated intervals: 
MOs 
-8 g MO $ 8 
yOs 
Xl’ X2’ . . . . . 
Assume that Xl < X2 < . . . . < XB. 
For each fault: 
Tl) Preset the program counter to the address of the first instruction. 
T2) Store MO, Y,, XI, X2, . . . . . XB, in successive locations. of memory. 
17 
- . 
T3) COlTlpUte Mk, Yk, for K = I, 2, . . . . . 8 as specified in the flow 
diagram of Figure 1, and store in successive locations of memory. 
T4) When the non-failed processor completes its last instruction compare 
Ml, Yl, . . . . . M8, Y8, term by term, in both the non-failed and failed 
processors. If Mk or Yk is the first variable to miscompare set 
L=K. If all Mk, Yk compare set L=o. 
b. Instruction Set 
During a typical computation the following instructions were executed: 
INSTRUCTION FREQUENCY 
LOAD/STORE 
STACK OPS 
ADD/SUBTRACT 
MULTIPLY 
BRANCH 
TRANSFER 
MISCELLANEOUS 
69 
0 
15 
3: 
0 
5 
124 
4.5.3 Quadratic (QUAD) 
a. Procedure 
TO) Select 8 sets of integers, 
indicated intervals: 
(Ak, Bk, Ck' Xk), at random from the 
Ap Bk, ck' O$X<215- 1 = 
'k' -10 2 xk =< 10 . 
For each fault: 
Tl) Preset the program counter to the address of the first instruction. 
T2) Store the (Ak, Bk, Ck, Xk) in successive locations of memory. 
18 
T3) Compute and store in successive locations of memory (overflows are 
ignored): 
Sk = (A$$$ - Bkxk - ck 
K = 1, 2, . . . . . 8 . 
T4) When the non-failed processor completes its last instruction, com- 
pare Sl, S2, . . . . . S8, term by term, in both the non-failed and 
failed processors. If SK is the first variable to miscompare set 
L = K. If all Sk compare set L = 0. 
b. Instruction Set 
During a typical computation the following instructions were executed: 
INSTRUCTION 
LOAD/STORE 
STACK OPS 
ADD/SUBTRACT 
MULTIPLY 
BRANCH 
TRANSFER 
MISCELLANEOUS 
FREQUENCY 
11 
4 
3 
3 
6 
9 
5 - 
41 
4.5.4 Flight Control System (FCS) 
FCS was an existing 3-axis, high performance flight control computation 
for an advanced aircraft. The program consisted of seven modules: 
:: 
Pitch axis control law. 
Left horizontal tail cmd (TLCMD). 
3. Right horizontal tail cmd (TRCMD). 
4. Yaw axis control law and rudder cmd (RCMD). 
2: 
Roll axis control law. 
Left flaperon cmd (FLCMD). 
7. Right flaperon cmd (FRCMD). 
All integrators were initialized to zero for each run and each sensor 
input was selected at random for each pass through the program. The compara- 
tors were located at the actuator commands, i.e., at TLCMD, TRCMD, RCMD, FLCMD 
and FRCMD. 
19 
Because of the extreme length of the FCS program (e.g., 2,200 instruc- 
tions, 13,729 microcycles) it was not possible to run the entire program for 
eight repetitions for each of 1,000 faults, as was intended, initially. 
Instead, a compromise was reached in which the program would be run for a 
single repetition for each of 1,000 faults and for 8 repetitions for each of 
200 faults. 
l Single Repetition Experiments 
In order to introduce latency during a single repetition the program 
was executed in parts, 
repetition". 
with each part designated as a "quasi- 
These quasi-repetitions were: 
Quasi-Repetition #l 
Pitch control law. 
1: Left horizontal tail cmd (TLCMD) 
Quasi-Repetition #2 
. . Retaining pitch control law computations, right horizontal tail 
cmd (TRCMD). 
Quasi -Repetition #3 
. . Yaw axis control law. 
. . Rudder cmd (RCMD) 
Quasi-Repetition #4 
. . Roll control law. 
. . Left flaperon cmd (FLCMD) 
Quasi-Repetition #5 
. . Retaining roll control law computations, right flaperon cmd (FRCMD). 
l Eight Repetition Experiments 
In these experiments the seven modules (i.e., the five, quasi-repetitions) 
were executed for eight, complete repetitions for each injected fault. 
In all of these experiments when the non-failed processor completes its 
last instruction the resultant commands are compared in both the non-failed 
and failed processors. Any discrepancy between corresponding commands desig- 
nates a detected fault. When a fault is detected the preprocessor ignores 
detection in subsequent repetitions. 
20 
I 
I YES I YEi 
FIGURE 1 FLOW DIAGRAM FOR LINCON 
5.0 RESULTS OF EXPERIMENTS 
In this section the data from the experiments is presented concisely 
and with a minimum of commentary. A detailed analysis of the results is 
given in the next section. 
5.1 Distribution of Faults 
As indicated previously the selection of faults was random, with each 
device weighted in proportion to its failure rate. The fail,ure rates of 
individual devices are given in Table 20. The failure rates of each parti- 
tion of the CPU are given in Table 1. 
The number of faults injected during each experiment are given in Table 2 
for each of the four programs. The number of faults injected in each parti- 
tion are given in Table 3. Once selected, the same faults were used in all 
experiments. 
5.2 Experiments 
To simplify the presentation material graphs and latency distributions 
wi 11 only be given for combined faults, irrespective of partition. However,. 
the distributions for S-a-O, S-a-l faults, by partitions, are given in tabular 
form. 
For the purpose of comparison the latency distributions obtained from 
(ref. 1) are superimposed on the corresponding histograms obtained under the 
conditions of the present study. Also shown are the proportion of undetected 
faults after eight repetitions , corrected for indistinguishable faults. It 
is noted in Section 7.0 that, based on an analysis of 6600 faults, the pro- 
portion of indistinguishable gate faults is 16.5%. 
5.2.1 SERCOM Experiment 
After each injected fault SERCOM was executed for 8 repetitions. The 
resultant histograms of detected faults versus repetitions to detection are 
shown in Figures 2a, 2b. Tabular results are given in Table 4. 
Figure 2a, Summarized 
(Combined, gate-level faults) 
l 53.6% undetected after a single repetition. 
l 46.4% detected in the 1st repetition. 
l 45% undetected after 8 repetitions. 
l 34.1% undetected after 8 repetitions when corrected for 
indistinguishable faults. 
22 
Figure 2b, Summarized 
(Combined, Component-level Faults) 
l 31.8% undetected after a single repetition. 
l 68.2% detected in the 1st repetition. 
l 19.2% undetected after 8 repetitions. 
l 3.2% undetected after 8 repetitions when corrected for 
indistinguishable faults. 
5.2.2 LINCON Experiment 
After each injected fault LINCON was executed for 8 repetitions. The 
resultant histograms of detected faults versus repetitions to detection are 
shown in Figures 3a, 3b. Tabular results are given in Table 5. 
Figure 3a, Summarized 
(Combined, gate-level faults) 
l 46.5% undetected after a single repetition. 
l 53.5% detected after a single repetition. 
l 44.9% undetected after 8 repetitions. 
l 34% undetected after 8 repetitions when corrected for 
indistinguishable faults. 
Fiqure 3b, Summarized 
(Combined, component-level faults) 
l 19.2% undetected after a single repetition. 
l 80.8% detected after a single repetition. 
l 18.7% undetected after 8 repetitions 
l 2.6% undetected after 8 repetitions when corrected for 
indistinguishable faults. 
23 
5.2.3 QUAD Experiment 
After each injected fault QUAD was executed for 8 repetitions. The 
resultant histograms of detected faults versus repetitions to detection are 
shown in Figures 4a, 4b. Tabular results are given in Table 6. 
Figure 4a, Summarized 
(Combined, gate-level faults) 
o 49.3% undetected after a single repetition. 
o 50.7% detected after a single repetition, 
l 41.3% undetected after 8 repetitions, 
l 29.7% undetected after 8 repetitions when corrected 
for indistinguishable faults. 
Figure 4b, Summarized 
(Combined, component-level faults) 
o 23.6% undetected after a single repetition. 
l 76.4% detected after a single repetition. 
l 17.1% undetected after 8 repetitions. 
l 0.72% undetected after 8 repetitions when corrected 
'for indistinguishable faults. 
5.2.4 FCS Experiments (Quasi-Repetitions) 
After each injected fault FCS was executed for a single repetition. 
However, as described in Section 4.5.4, the program was executed in 5 parts, 
designated as quasi-repetitions. The .resultant histograms are shown in 
Figures 5a, 5b. Tabular results are given in Table 7. 
Figure 5a, Summarized 
(Combined, gate-level faults) 
l 43% undetected after quasi-repetition #l. 
l 57% detected after quasi-repetition #l. 
l 41.9% undetected after a complete pass. 
24 
l 30.4% undetected after a complete pass when corrected 
for indistinguishable faults. 
Figure 5b, Summarized- 
(Combined, component-level faults) 
l 16% undetected after quasi-repetition #l, 
l 84% detected after quasi-repetition #1. 
l 15.8% undetected after a complete pass. 
l 0% undetected after a complete pass when corrected for indistinguish- 
able faults. 
It is noted that, once a fault has been detected, the preprocessor ignores 
detection in subsequent repetitions. This is the reason, for example, that 
Quasi-Repetitions #2, #3, #4 and #5 show poor coverage relative to Quasi- 
Repetition #l in Figure 5, even though the computations are similar. 
5.2.5 FCS Experiments (True-Repetitions) 
After each injected fault FCS was executed for 8 repetitions. The re- 
sultant histogram is shown in Figure 6. Tabular results are given in 
Table 8. 
Figure 6, Summarized- 
(Combined, gate-level faults) 
l 37.5% undetected after a single repetition. 
l 62.5% detected after a single repetition. 
l 34% undetected after 8 repetitions. 
l 21% undetected after 8 repetitions when corrected 
for indistinguishable faults. 
5.3 Urn Model Parameters 
The parameters of the Urn Model were estimated for SERCOM, LINCON, QUAD- 
and FCS,using the estimators defined in Section 9.3 for combined, S-A-O and 
S-A-l faults. The resultant estimates of a, P,.Po, as 
8.0 and 9.3, are given in Table 9. All estimates were 
petitions of each program. 
l defined in Sections 
obtained using S're- 
As an illustration of the Urn Model "fit" Figure 7 shows the resultant 
Urn Model distribution superimposed on the.empirical di stribution for FCS. 
25 
5.4 Accuracy and Confidence of Results 
The accuracy of coverage estimates will be given for combined, gate-level 
faults, only. The estimates are based on the total set of faults irrespective 
of their distinguishability. 
SERCOM (1000 Faults) 
After 8 repetitions 55% of all faults were detected. The error, at 
the 95% confidence level, is 
E = 1.96 .0308 (3.08%) 
LINCON (1000 Faults) 
After 8 repetitions 55.1% of all faults were detected. The error, at 
the 95% confidence level is 
S = .0308 (3.08%) 
QUAD (1000 Faults) 
After 8 repetitions 58.7% of all faults were detected. The error, at the 
95% confidence level, is 
E = 1.96 @@ = .0305 (3.05%) 
FCS (200 Faults) 
After 8 repetitions 66% of all faults were detected. The error, at 
the 95% confidence level, is 
E = 1.96 /* = .066 (6.6%) 
Urn Model Parameters- 
The accuracy i-s illustrated for the QUAD Program. 
There 
Ppo 
^, .864 
z.587 
a 3 .667 
From Section 9.4.3 the errors at the 95% confidence levels are 
cp = 1.96 /m7 = .028 (2.8%) 
26 
E 
PO 
= 1.96 q&g= = .031 (3 1%) . 
‘a = 1.96 
(.667)2 (.333) 
= .084 (8.4%) 
1000 x .587 x .136 
27 
PARTITION 
1 
2 
3 
4 
5 
6 
TABLE 1 
FAILURE RATES BY PARTITION 
FAILURE RATE (MIL-HDBK-2175) 
7.1014 x d /HR 
5.8223 x loo6 /HR 
7.4706 x 10°6/~~ 
9.4863 x 1O-6 /HR 
7.056 x 1O-6 /HR 
1.1867 x 1O'6 /HR 
TOTALS 38.1233 x 1O'6 /HR 
PROPORTION 
OF TOTAL 
.186 
.153 
.196 
.249 
.185 
.031 
1.0 
28 
EXPERIMENT 
SERCOM 
LINCON 
QUAD 
FCS #l 
FCS #2 
TABLE 2 
NUMBER OF FAULTS INJECTED ___ - 
GATE-LEVEL COMPONENT LEVEL 
1000 1000 
1000 1000 
1000 1000 
1000 500 
200 X 
29 
TABLE 3A 
NUMBER OF GATE-LEVEL FAULTS INJECTED BY PARTITIONS 
PROGRAMS: SERCOM, LINCON, QUAD 
PARTITION S-A-O 
1 90 
2 89 
3 117 
4 111 
5 79 
6 18 
504 
PROGRAM: FCS (1000 FAULT CASE) 
PARTITION S-A-O S-A-l COMBINED 
1 80 89 
2 77 76 
3 106 106 
4 126 101 
5 102 103 
6 18 16 
509 491 
S-A-l CQMBINED 
92 182 
83 172 
105 222 
120 231 
79 158 
17 35 
496 iii 
169 
153 
212 
227 
205 
34 
30 
TABLE 3B 
NUMBER OF COMPONENT-LEVEL FAULTS INJECTED BY PARTITIONS 
PROGRAMS: SERCOM, LINCON, QUAD 
PARTITION S-A-O S-A-l COMBINED 
1 122 124 246 
2 104 104 208 
3 147 138 285 
4 glJ 120 261 
514 786 1000 
PROGRAM: FCS (500 FAULT CASE) 
PARTITION S-A-O S-A- 1 COMBINED 
1 55 48 103 
2 62 52 114 
3 67 73 140 
4 74 69 
258 
143 
242 500 
31 
TABLE 4a SERCOM LATENCY DATA 
GATE-LEVEL FAULTS 
DETECTED FAULTS 
FAULTS 
INJECTED 
PARTITION I , 
)‘l t$ M2 N; M3 N3 tj4 N4 M5 N5 M6 N6 M7 N7 M8 N8 M N 
p1 60 68 2 1 0 2 0 0 0 010 00 
p2 48 62 10 3 0 0 000~000 00 0 0 89 8: 
I 
p3 67 61 6 12 0 0 0 0 5 100 00 0 0 117 105 
p4 31 62 14 6 0 0 0 0 6 820 001 1 111 12c I 
p5 2 21 10 0 001 0 0 0 0,o 0 0 79 75 . 
'6 
1 01 10 0 0 0 0 000 
1 
00 0 0 17 1 1E 
TOTAL 
209 255 34 24 0 2 0 0 12 930 00 1 1 503 49; 
L 
Mi = Detected S-a-O Faults, ith Cell 
N.. 
1 
= Detected S-a-l Faults, ith Cell 
TABLE 4b SERCOM LATENCY DATA 
COMPONENT LEVEL FAULTS 
FAULTS 
INJECTED 
M N 
DETECTED FAULTS 
PARTITIOM 
“l N3 “4 N4 ‘8 
M2 N2 M3 
79 
3 8 0 
14 6 0 
14 4 0 
69 25 14 16 
349 56 32 16 
“6 N6 
0 0 
0 0 
0 0 
0 0 
% *5 
0 0 
0 0 
0 0 
0 0 
0 0 
M7 N7 93 
00 0 
00 0 
00 0 
00 3 
00 3 
p1 94 3 0 0 0 126 -.i 
p2 60 0 0 0 0 104 
p3 110 0 2 3 2 153 
p4 69 9 0 0 0 105 ‘:, 1 
‘6 
TOTAL 2 488 333 12 3 2 512 
Mi = Detected S-a-O Faults, ith Cell 
Ni 
= Detected S-a-l Faults, ith Cell 
w 
w 
TABLE 5a LINCON LATENCY DATA 
GATE-LEVEL FAULTS 
DETECTED FAULTS 
FAULTS 
INJECTED 
PARTITION a 
)‘l 
N1 M2 N2 M3 N3 t14 N4 M5 N5 )I6 N6 MI N7 1'8 N8 M N 
pi 68 70 0 0 0 0 01 0 0 000 00 0 0 90 92 
p2 66 65 0 0 0 0 C 0 0 000 00 0 0 89 83 
p3 66 69 0 0 0 0 C 00 000 01 0 0 117 105 
p4 57 68 1 2 1 0 1 0 0 201 12 1 1 111 120 
p5 2 31 00 0 0 0 0 000 01 0 0 79 79 
* 
‘6 100 00 0 I- O 0 0 0 0 0 0 0 0 17 18 
TOTAL 
260 275 2 2 1 0 1 0 0 2 0 1 
I 
1 4 1 1 503 497 
& 
i 
Mi 
= Detected S-a-O Faults, ith Cell 
Ni = Detected S-a-l Faults, ith Cell 
-- 
TABLE 5b LINCON LATENCY DATA 
COMPONENT LEVEL FAULTS 
DETECTED FAULTS I FAULTS INJECTED 
PARTITION 
!‘l N1 M2 N2 M3 N3 t14 N4 M5 N5 M6 N6 M7 N7 b'8 Ng M N . 
p1 105 119 0 0 0 0 0 0 0 000 00 0 0 120 126 
p2 83 85 0 0 0 0 0 0 0 0 0 0 0 '0 0 0 104 104 
p3 96 96 0 0 0 0 0 0 0 000 00 0 0 147 138 
p4 118 106 3 0 0 0 000 011 00 0 0 142 119 
p5 
'6 
TOTAL 
402 406 3 0 0 0 000 011 00 0 0 513 487 
Mi 
= Detected S-a-O Faults, ith Cell 
Ni = Detected S-a-l Faults, ith Cell 
TABLE 6a QUAD LATENCY DATA 
GATE LEVEL FAULTS 
i 
i 
DETECTED FAULTS FAULTS 
PARTITION s 
,INJECTED 
)‘l t$ M2 N2 Id3 N3 t14 N4 M6 N5 M6 N6 M7 N7 1’8 t48 t" td N  d W 
p1 58 63 3 1 0 3 1 1 0 0 0 0 0 0 2 1 90 92 
p2 58 61 10 4 2 1 C 0 0 000 00 0 0 89 83 
1 
p3 70 67 8 5 0 0 0 0 0 000 00 0 0 117 105 
p4 54 70 14 9 5 3 n 2 0 000 00 0 0 111 120 
p5 2 22 20 0 C 0 0 000 00 00 79 79 
‘6 0 
I TOTAL I,,:I ,,; 311 2d 1, 1, 
C 
l/ 1, 1, i 1, 1, 11 : 1 :I : ,511, ,::I 
Mi = Detected S-a-O Faults, ith Cell 
Ni 
= Detected S-a-l Faults, ith Cell 
TABLE 6b QUAD LATENCY DATA 
COMPONENT LEVEL FAULTS 
W 
U 
DETECTED FAULTS I FAULTS INJECTED 
p4 126 105 8 2 0 1 0 1 0 000 00 0 0 141 120 
p5 
‘6 
TOTAL 
384 380 37 15 3 5 o 1 o 000 00 4 0 514 486 
Mi 
= Detected S-a-O Faults, ith Cell 
Ni = Detected S-a-l Faults, ith Cell 
TABLE 7a FLIGHT CONTROL COMPUTATIONS LATENCY DATA 
(QUASI-REPETITIONS) 
GATE LEVEL FAULTS 
DETECTED FAULTS 
FAULTS 
PARTITI3N 
JNJECTED 
Ml tJl M2 N2 M3 N3 t14 N4 M5 N5 M6 N6 M7 N7 w8 N8 t-1 tJ 
p1 67 77 0 0 0 0 0 0 0 000 00 0 0 80 89 
p2 61 74 0 0 0 0 0 0 0 000 00 0 0 77 76 
p3 71 70 0 0 0 0 0 0 0 000 00 0 0 106 106 
p4 61 64 3 0 6 1 0 0 0 000 00 0 0 126 101 
p5 11 12 0 0 0 0 0 0 0 000 00 0 0 102 103 
. 
'6 111 00 0 0 0 0 0 0 0’0 0 0 0 18 16 
TOTAL 
272 298 4 0 6 1 0 0 0 000 00 0 0 ,509 491 
Mi = Detected S-a-O Faults, ith Cell 
Ni = Detected S-a-l Faults, ith Cell 
TABLE 7b FLIGHT CONTROL COMPUTATIONS LATENCY DATA 
(QUASI-REPETITIONS) 
COMPONENT LEVEL FAULTS 
nC'rPTED FAULTS IJLILL 
PARTITION 1 , 
5 H1 M2 N2 M3 N3 'I4 N4 M5 N5 M6 N6 M7 N; 
p1 51 43 0 0 0 0 0 0 0 000 00 00 55 48 
p2 54 47 0 0 0 0 0 0 0 000 00 00 62 52 
p3 44 55 0 0 0 0 0 0 0 000 00 00 67 73 
p4 65 61 0 0 1 0 0 0 0 000 00 00 74 69 
p5 
I 
'6 
TOTAL 214 206 0 0 1 0 0 0 0 0 0 0 0 0 0 0 258 242 
1 
Mi = Detected S-a-O Faults, ith Cell 
Ni = Detected S-a-l Faults, ith Cell 
TABLE 8 FLIGHT CONTROL COMPUTATION LATENCY DATA 
(TRUE REPETITIONS) 
GATE LEVEL FAULTS 
1 
PARTITIOFI 
p1 
p2 
p3 
p4 
p5 
'6 
TOTAL 
Mi = Detected S-a-O Faults, 
DETECTED FAULTS 
. 
M3 N3 t14 N4 M5 N5 . 
0 0 000 00 
0 0 000 00 
0 0 000 00 
ith Cell 
N6 M7 
0 0 
0 0 
0 0 
0 0 
0 0 
0; 0 
0 0 
N7 
I 
0 
0 
0 
0 
0 
0 
0 
1 FAULTS 
I JNJECTED 
0 0 17 
0 0 11 
0 0 22 
0 0 29 
010 17 
00 4 
oi 0 100 
Ir 
N 
15 
17 
21 
27 
Ni = Detected S-a-l Faults, ith Cell 
TABLE 9 
URN MODEL PARAMETER ESTIMATES FOR GATE-LEVEL FAULTS 
a P PO 
SERCOM COMBINED .4914 .8436 .550 
S-A-O .325 .807 .515 
S-A-l .507 .876 .586 
LINCON COMBINED .2424 .971 .551 
S-A-O .3 .9774 .5288 
S-A- 1 .2173 .9649 .5734 
QUAD COMBINED ,667 .8637 .587 
S-A-O .6956 .835 .577 
S-A-l .6274 .892 .597 
FCS COMBINED .875 .947 .66 
S-A-O .857 .906 .64 
S-A-l 1.0 .985 .68 
41 
SERCOM 
COMBINED-GATE-LEVEL FAULTS 
a 
42 
=.I# UNDETECTED, CORRECTED FOR 
lNDlSTlNGUlSHABLE FAULTS) 
/ FROM (REF. 1) 
5a96 
0.2% 93% 0.2% 
I * I I , I I I I 1 I 1 1 I I 8 1 
1 2 3 4 5 6 7 8 9 10 11 
TIME TO DETECT (REPETITIONS~ 
i 
i 
i 
1 
FIGURE 2ii 
0 
SERCOM 
COMBINEiD COMPONENT-LEVEL FAULTS 
, FROM (REF. 1) 
19.2% UNbETECTED 
(3.2% UNDETECTED, CORRECTED FOR 
INDISTINGUISHABLE FAULTS) 
#.8X 
0.5% 
I I 1 1 I I 
1 2 3 4 5 6 7 8 9 10 11 
TIME TO DETECT (REPETITIONS) 
43 
FIGURE 28 
0 
I -- 
5 
L 
LINCON 
COMBINED GATE-LEVEL FAULTS 
, FROM (REF. 1) 
(51.7%). 
449% UNDETECTED 
(3.0% UNDETECTED, CORRECTED FOR 
INDISTINGUISHABLE FAULTS) 
L 
$ 4 I 5 8 7 ; 9’ 16 1; 
TIME TO DETECT (REPETITIONS) 
44 
FIGURE 3A 
LINCON 
COMBINED COiiPONENT-LEVEL FAULTS 
-- 
, FROM (REF. 11 
76.6%) 
18.7% UNDETECTED 
(2.6% UNDETECTED, CORRECTED FOR 
INDISTINGUISHABLE FAULTS) 
6 6 8 
TIME TO DETECT (REPETITIONS) 
45 
FIGURE 38 
100 
a 
46 
QUAD 
COMBINED GATE-LEVEL FAULTS 
-.-~-- -.-. . 
41.3% UNDETECTED 
(29.7% * UNDETECTED, CORRECTED FOR 
INDISTINGUISHABLE FAULTS) 
iO.7’ - 
/ 
FROM (REF. 1) 
143.2%) 
1 2 3 4 5 6 7 8 9 10 11 
TIME TO CETECT (REPETITIONS) 
FIGURE 4A 
loo( 
400 
a 
OUAD 
COMBINED COMPONENT-LEVEL FAULTS 
'6.49 
- 
- 
- 
, FROM (REF. 1) 
(71.8%) 
17.1% UNDETECTED 
(0.72% UNDETECTED, CORRECTED FOR 
INDISTINGUISHABLE FAULTS) 
- 
5.2% 
12 3 4 5 8 7 8 9 10 11 
TIME TO DETECT (REPETITIONS) 47 
FIGURE 46 
FLIGHT CONTROL COMPUTATIONS 
QUASI-REPETITIONS 
COMBINED GATE-LEVEL FAULTS 
419%. UNDETECTED 
(30.4% UNDETECTED, CORRECTED FOR 
INDISTINGUISHABLE FAULTS) 
1 2 3 4 5 6 7 8 9 10 1’1 
TIME TO DETECT (QtiASI-REPETITIONS) 
FIGURE 5A 
.- 
500- 
84% 
400-- 7 
FLIGHT CONTROL COMPUTATIONS 
QUASI-REPETITIONS 
COMBINED COMPONENT-LEVEL FAULTS 
15.8% UNDETECTED 
(0% UNDETECTED, CORRECTED FOR 
INDISTINGUISHABLE FAULTS) 
1 2 3 4 5 6 7 B 9 10 11 
TIME TO DETECT (QUASI-REPETITIONS) 
FLIGHT CONTROL COMPUTATIONS 
TRUE REPETITIONS 
COMBINED GATE-LEVEL FAULTS 
200 . T 
62.5% 
34% UNDETECTED 
(21.0% UNDETECTED, CORRECTED FOR 
INDISTINGUISHABLE FAULTS) 
1 2 3 4 5 6 7 8 9 lo 11 
TIME TO DETECT (REPETITIONS) 
50 
FIGURE 6’ 
URN MODEL DISTRIBUTION 
FLIGHT CONTROL COMPUTATIONS 
TRUE REPETITIONS 
COMBINED GATE-LEVEL FiULTS 
62.5% 
I 
I 
I 
100 . ; ~~~ 
I 
-- I 
I 
I 
I 
- I 
I 
\ 
\ 
\ 
\ 
L 
&II% 
0 I 
‘.I 0.5% -- I I I I I I I I 
I I I I I I I I I I I 
1 2 3 4 5 6 7 6 9 10 11 
TIME TO DETECT (REPETITIONS) 
- - - - _ 
FIGURE 7 
51 
6.0 SUMMARY OF EXPERIMENTS 
6.1 Latency 
l Most detected faults are detected in the first repetition. Subsequent 
repetitions donot appreciably increase the proportion of detected 
faults. However, it appears that short programs have a tendency to 
benefit more from subsequent repetitions than lengthier programs 
('. FCS). It is conjectured that shorter programs rely more 
hla:;iy on inputs for failure mode excitation than their lengthier 
counterparts. 
l S-a-l faults are easier to detect than S-a-O 'faults. 
l The micromemory (i.e., Partition #5) contains a large proportion of 
undetected faults. 
l A large proportion of faults remain undetected after as many as 
8 repetitions. For example, in a flight control computation of 2200 
instructions 21% of distinguishable gate-level faults remained 
undetected. 
l Component-level faults are easier to detect than gate-level faults. 
As an example, Table 10 sumnarizes detection coverage of distinguish- 
able gate-level faults for 8 repetitions of SERCOM, LINCON, QUAD and 
FCS. Also shown are the respective coverages of the final self-test 
program described in Section 7.0. 
l The latency distributions for SERCOM, LINCON, QUAD and FCS are 
uncorrected for indistinguishable faults. A detailed analysis of 
6600 faults (see Section 7.0) indicates that the proportion of 
indistinguishable (i.e., "don't care") faults in the BDX-930 is 16.5%. 
The appropriate correction factors can be obtained by the method 
described in Section 4.3. Table10 summarizes detection coverage of 
distinguishable faults. 
l Based on our experience with self-test (see Section 7.0) it may be 
concluded that a program's ability to detect faults cannot be 
characterized by the number of instructions or the instruction mix. 
As a consequence , it is not surprising that a short program such as 
QUAD (41 instructions) has a coverage of 70.3%, while a lengthy pro- 
gram such as FCS (2200 instructions) has a coverage of only 79.0%. 
l The distributions for LINCON are unique in that a very small propor- 
tion of faults are detected in the second and subsequent repetitions. 
In this respect it is similar to the distributions for the flight 
control system when the latter was subdivided into quasi-repetitions 
(see Figures 5a, 5b). In these experiments each program was executed, 
52 
effectively, for a single repetition. The other programs, on the 
other hand, were executed repeatedly and in their entirety with a 
different set of inputs for each pass. As a consequence, we believe 
that the distribution for LINCON and the “QUASI FCS" are not 
representative. 
l Based upon these results for LINCON and QUASI FCS it appears that the 
excitation supplied by inputs accounts.for most of the coverage in 
the second and subsequent repetitions. 
l .Detection coverage between the second and last repetitions varied 
between 3.5% (FCS) and 8.6% (SERCOM). For the programs of the pre- 
vious study these coverages were: 
FETSTO: 8.4% 
FIB: 7.0% 
ADDSUB: 7.0% 
l It is interesting to compare the gate-level latency distributions for 
FETSTO, FIB and ADDSUB, obtained from the previous study, with those 
for SERCOM, QUAD and FCS of the present study. The number of execu- 
table instructions in each program are: 
FETSTO: 6 
FIB: 11 
ADDSUB: 
SERCOM: ii 
QF"c;" : 41 . . 2200 
The first three programs were limited to a simple instruction set, 
whereas the last three used a variety of "high powered" instructions. 
The respective latency distributions are shown in Figure 8. From the 
figure it can be seen that the distributions are qualitatively similar 
despite the dissimilarity of their programs. The major difference is 
the distribution between coverage of the first repetition and total 
coverage. It appears that lengthier programs yield a greater coverage 
in the first repetition than shorter programs. 
6.2 Urn Model 
l The Urn Model can, at least qualitatively, characterize the shape of 
a latency distribution. This can be attributed to (1) the monotonic 
decreasing property of the empirical distribution and (2) the three 
degrees-of-freedom which the Urn Model provides for a best fit. 
53 
l It is doubtful that the Urn Model parameters can be predicted for a 
program on the basis of length or instruction mix. 
l Table 11 summarizes the Urn Model parameters for combined, gate-level 
faults for all of the programs of this and the previous study. Based 
on these results, we make the following observations: 
. . The Urn Model parameters are in general agreement with the 
empirical distributions. 
. . In every case, PO, the probability that a fault is.detected, 
eventually, coincided with coverage after 8 repetitions. 
. . In every case, PI, the probability that a fault is detected in 
the first repetition , coincided with the empirical distribution. 
In every case, P,(l-P), the probability that a fault is 
l * detected in subsequent repetitions, was in close agreement 
with the empirical distribution. 
l Based upon the results from the LINCON and QUASI FCS experiments we 
conjecture that, if the inputs are invariant, almost all detected 
faults are detected in the first repetition and coverage during subse- 
quent repetitions will be negligible. As a consequence, the observed 
actual coverage during subsequent repetitionsmustbe due to varying 
inputs. 
The parameter, a, which gives the Urn Model distribution its exponen- 
tial character, varies widely. It appears that "a" is a function 
primarily of input excitation and is a measure of the effectiveness 
of this excitation in fault detection. 
54 
t 
,,nuammo 
I 
t 
t 
t 
FIGURE 8. CbMPARlSQN OF FAULT LATENCY DISTRIBUTIONS 
55 
TABLE 10 
COMPARISON OF GATE VERSUS COMPONENT-LEVEL COVERAGE * 
Proqram Gate-Level Coverage** Component-Level Coverage** 
SERCOM .659 .968 
LINCON .66 .974 
QUAD .703 .9928 
FCS .79 1.0 
Final Self-Test .92 .976 
* Coverages have been corrected for indistinguishable faults 
** After 8 repetitions 
56 
TABLE 11 
Program 
FETSTO 
FIB 
ADDSUB 
SERCOM 
LINCON 
QUAD 
FCS 
# Words a P PO POP* Po(l-P)** 
6 .4386 .7776 .3845 .3 .086 
11 .6823 .8366 .4184 .35 .0684 
11 .4691 .8254 .4058 .33 .071 
44 .4914 .8436 .55 .46 .086 
l-24 .2424 .971 .551 .535 .016 
41 .667 .8637 .587 .51 .08 
2200 .875 .947 .66 .625 .04 
URN MODEL PARAMETERS 
COMBINED, GATE-LEVEL FAULTS 
* POP = P, = proportion of faults detected in the first repetition 
** Po(l-P) = proportion of faults detected in subsequent repetitions 
57 
7.0 SELF-TEST DESIGN AND VALIDATION 
7.1 Initial Self-Test P-rogram 
The initial self-test program was based on the belief that coverage could 
be achieved by exercising a sufficiently large set of instructions irrespective 
of their mechanization in hardware. As subsequent events proved, however, this 
approach resulted in an inefficient self-test in that it tended to exercise some 
hardware repeatedly while omitting to exercise a substantial proportion of the 
remainder. Undoubtedly, the addition of more instructions would eventually 
have achieved the desired coverage but at the cost of run-time and-memory. 
As a consequence, it was decided to redesign the self-test program. 
7.2 Pr_incip_al Tests 
Based on an analysis of undetected faults in the initial self-test the major 
effort was directed at the following hardware elements: 
l Scratch Pad of 2901 (i.e., 16 accumulators). 
o 9407 Address Processor. 
a Arithmetic Logic Unit of 2901. 
l Micromemory. 
The gate-level equivalent circuits of these elements were analyzed to 
determine which instructions and instruction sequences were most effective in 
exercising the component gates. Since the test sequences were hardware inten- 
sive and directed specifically at the BDX-930 computer a description of each 
test would not be very illuminating to the reader, and would, moreover, 
require a detailed analysis of the data paths exercised by each BDX-930 
instruction. However, because the micromemory is generic to all computers 
and presented the greatest challenge to fault detection, a brief description 
of the micromemory test will be given. 
Micromemory Test 
The microprogram memory consists of 512, 56-bit microinstructions. 
However, only 382 microinstructions are used. Moreover, the last three bits 
of each word are also unused. Thus, the proportion of indistinguishable 
faults is at least 29.4%. 
All of the microinstructions used in previous tests were analyzed to 
determine which microinstruction remained unexercised. Additional instructions 
were added to exercise micromemory instruction coverage. The final micro- 
instructionsetexercised 45.25% of the microinstructions. However, this did 
not mean thatan equivalent proportion of faults would be detected. The 
problem here was that many microinstructions contain branching conditions 
which could only be exercised by multiple calls of the microinstruction. 
58 
In connection with micromemory fault detection two problems were 
paramount: 
1. High coverage required the execution of a large number of mi'cro- 
instructions with an attendant increase in memory and real time. 
2. It was extremely difficult to identify indistinguishable faults. 
With no failure present it was relatjvely easy to identify mi.cromemory bits 
that, by design, had no effect on operations. However, this was not the 
case when a nominally "don't care" bit was in a failed state, Th.e identi- 
fication process required the expertise of a computer designer, and even so, 
it was a time comsuming process. 
7.3 Self-Test Results 
The design of the self-test program was anevolutionaryprocess. After 
each update the resultant coverage was estimated via the emulator. Undetected 
faults were analysed and the test again updated. 
The initial self-test consisted of 1,100 words and required 7,777 micro- 
cycle to complete. The test was reviewed to determine the most effective 
tests. These were retained and the rest discarded. The first, modified test 
consisted of 192 words and required 974 microcycles. After each revision a 
new fault set was selected and faults were injected at the gate-level. All 
successive revisions merely added new tests to those of its predecessors. 
As a consequence , coverage could only improve with each revision. 
The successive revisions and their corresponding coverages are tabulated 
in 'fable 12. The development process was terminated following Revision 8. 
The non-monotonic coverage is the result of statistical fluctuation. Table 13.‘ 
shows the gate-level coverage of Revision 8 by partitions. It is interesting 
to compare these results with those of the initial self-test, which are given 
in Table 14. The accuracy for 3,000 faults is 51% with 95% confidence. 
Table 15 shows the component-level coverage of Revision 8 by partitions. The 
corresponding results of the initial self-test are given in Table 16. 
Examination of coverage by partitions.(Table.l3) indicates that the 
worst coverage was in the micromemory and control proms. It is clear that 
the only way to improve coverage of these devices is to increase the number 
of microinstructions executed by self-test. Since the conservation of memory 
was an important design goal this approach was rejected. 
Another reason for poor coverage of the micromemory was the way that the 
unit was emulated, i.e., at the cell-level, with each cell assumed to be 
independent of other cells. The failure rate of the device was then equally 
distributed over the cells. A more realistic approach, and a less conservative 
one from the standpoint of coverage, would have been to emulate the buffers, 
59 
column and row decoders, as well as the memory cells. From "MIL-HDBK-2l7C, 
Notice 1, 72oC, uninhibited aircraft environment" the proportion of failure 
rate for these components (54S5472) is 
0.326 for the memory cells 
0.674 for the buffer and decoders, etc. 
If it assumed that detection of faults in the buffers and decoders is 100% 
then the 114 undetected faults is Partition 5 (of Table 13) become 
0.326 x 114 = 37 
and ttie resultant number of detected faults would be 
219 instead of 142. 
The resultant coverage would then be 
2409 = 0.951 (95.1%). 
2534 
7.3.1 Indistinguishable Faults 
Based on a sample of 6,600 gate-level faults the proportion of indistin- 
guishable faults are given in Table i7; by partitions. The large proportion 
of indistinguishable faults (i.e., 16.5%) and the necessity to identify each 
one in the estimation of coverage were the major obstacles in the self-test 
design and validation process, We note that the proportion of indistinguish- 
able faults was estimated to be 23.7% in the earlier study, based on a sample 
of 300 gate-level faults. 
60 
7.4 Summary and Conclusions 
Emulation appears to be an indispensable tool in the design and 
validation of an efficient self-test program. 
Self-test should be designed tocapitalize on the hardware mechanization 
of the instruction set. Emphasis should be placed on detecting faults 
in the least reliable component. 
Pin-level faults are easier to detect than gate-level faults. The 
latter tend to be highly data-dependent. Coverage of pin-level 
faults did not change significantly between the initial self-test 
and the final revision. 
l The "box" scoreforthe initial and final self-test program is 
Coverage 
Gate-Level Component-Level Words of Memory Cycle Time 
Initial 86.5% 97.9% 1100 7777 
Final 92.0% 97.6% 346 2062 
l The worst coverage was in the micromemory. A more realistic emulation 
of these devices, which included buffers, column and row decoders, 
would have yielded a significant improvement ih total coverage, e.g., 
to 95.1x, 
l Based on our experience and observations, thus far, we conjecture 
that virtually any self-test program of 200 words or more, in which 
even a modest effort wastakento exercise the major hardware components, 
will yield a gate-level fault coverage of 85%. 
Coverages greater than 90%, however, are a different matter, as 
evidenced by the successive self-tests of Table 12. The situation 
could be improved significantly if processors incorporated a direct 
means of testing the micromemory and control proms either through 
a parity checker or,more preferably, by making the contents of the 
memories directly accessible to the programmer. 
61 
pJJ 
1 
_ 2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
REVISION WORDS MICROCYCLES FAULTS 1 - a 
0 (Initial) 1100 7777 
1 192 974 
2 202 1017 
3 230 1430 
4 260 1698 
5 295 1802 
6 330 1992 
7 334 2043 
8 (Final) 346 2062 
8 (Final) 346 2062 
8 (Final) 346 2062 
300 86,4(%) 
300 88.0 
300 88.6 
300 87.9 
300 92.3 
500 93.3 
1000 94.4 
1000 93.4 
1000 91.9 
1000 92.5 
1000 91.5 
TABLE 12 
CPU SELF-TEST REVISIONS 
GATE-LEVEL COVERAGE 
1 -a = coverage when indistinguishable faults are disqualified 
62 
PARTITION 
TABLE 13 
FINAL SELF-TEST, 
GATE-LEVEL COVERAGE BY PARTITIONS 
FAULTS NOT DETECTED 
INJECTED DIST. INDIST. DETECTED ' - a 
553 22 
475 12 
639 11 
741 14 
501 114 
91 29 
3000 202 
1 -a = 2332- = 
2534 
30 501 95.8(%) 
37 426 97.3 
75 553 98.0 
32 695 98.0 
245 142 55.5 
47 15 34.1 
466 2332 
.92 (92.%) 
63 
PARTITION 
TABLE 14 
INITIAL SELF-TEST 
GATE-LEVEL COVERAGE BY PARTITIONS 
TOTAL 
INJECTED 
UNDETECTED 
DIST. INDIST. 
34 4 2 
74 7 9 
55 0 16 
74 9 6 
50 8 33 
13 3 5 - - 
300 31 71 
DETECTED 
28 
58 
39 
59 
9 
5 - 
198 
l- a =B = 229 .865 (86.5%) 
64 
TABLE 15 
FINAL SELF-TEST 
COMPONENT-LEVEL COVERAGE BY PARTITIONS 
FAULTS NOT DETECTED 
PARTITION INJECTED DIST. INDIST. 
1 76 3 
2 100 1 
3 106 0 
4 118 5 
400 9 
l-a = 367 ’ 
376 
1 72 96.0(%) 
9 90 99.0 
14 92 100 
0 113 95.8 
DETECTED 1 - a 
24 367 
= .976 (97.6%) 
65 
PARTITION 
TABLE 16 
INITIAL SELF-TEST 
COMPONENT-LEVEL COVERAGE BY PARTITIONS 
FAULTS 
INJECTED NOT DETECTED 
1 35 1 
2 73 1 
3 43 2 
4 38 0 - 
189* 4 
DETECTED 
34 
72 
41 
38 
185 
1 - a = $$ = .979 (97.9%) 
* 11 faults were disqualified as indistinguishable. 
66 
TABLE 17 
PROPORTION OF INDISTINGUISHABLE FAULTS (7) 
PARTITION TOTAL # INJECTED FAULTS INDISTINGUISHABLE Y 
1 1174 57 .049 
2 1150 88 .077 
3 1352 186 .138 
4 1633 72 .044 
5 1100 589 .535 
6 191 96 .503 
TOTAL PROPORTION OF INDISTINGUISHABLE FAULTS = 16.5% 
67 
Severa 1 mode 1s have been investigated in an attempt to characterize the 
dynamics of fault propagation in a digital computer. Although simplistic in 
their assumptions, these models may, nevertheless, provide insight into this 
undoubtedly complex process. It has been conjectured (ref. 2) that the dis- 
tribution of latency can be modelled by analogy with balls in an urn. We 
prefer to employ a different analogy although the resultant distributions are 
the same. 
8.0 URN MODEL 
8.1 Urn Model Description 
We postulate that the computer can be subdivided into three sets of 
mutually exclusive components Cl, C2, C3 such that 
c1 = Set of components randomly exercised by the program 
C2 = Set of components continually exercised by the program 
c3 = Set of components never exercised by the program. 
We make the further assumption that a fault is detected if and only if 
the faulted component is exercised. The scenario is that of an avionics com- 
puter executing two software programs one of which is executed full-time and 
the other, part-time. The components that are exercised by the full-time 
mode are denoted by C2 and those exercised by the part-time mode by Cl. 
Neither the full-time or part-time modes exercise components, C3. 
We assume that the part-time mode is exercised randpmly. If the unit of 
time is a repetition of the full-time program then we postulate that the 
excitation is Poisson-distributed in time with a = probability that the part- 
time mode is exercised in a repetition of the full-time program. 
Let A1 = Failure rate of Cl (Failures/hour) 
x2 = Failure rate of C2 (Failures/hour) 
X3 = Failure rate of C3 (Failures/hour) 
x = 9 = A2 + x3 (Failures/hour) 
We now derive the latency distribution given that a fault has just 
occurred. The distribution is defined in terms of three parameters, a, P 
and Q, where 
68 
P = Probability that the fault is detected in the first repetition given 
that it occurred in sets Cl or C2 
Qo = Probability that the fault is never detected. 
It is easy to derive the following relationships: 
5 x2 x3 1) P6=1-Qo=X+T,Qo=X 
‘2 + a '1 '2 + a ‘1 
x x x x 
2) P = = 9 
ijhere PO = Probability that the fault is detected eventually. 
If 'k = probability that the fault is detected in the k-th repetition and not detected in a previous repetition, k = 1, 2, 3, . ...) n, 
qn+l = probability that the fault is not detected in the previous n repetitions, 
then 
p1 = PO P 
x2 ll 
=X+a x 
p2 
9 = (1 - P) a PO = a (1 - a) x 
3) ! 
'n = (1 - P) (1 - a)n-2 a PO = a(1 -a)n-l 2, n = 2, 3, . . . 
co 
qn+l = Q, + c 'k = Q, + (1 - P) PO (1 - a)n-l 
k = n+l 
X3 9 = x + (lo - a)” h , n = 1, 2, 3, . . . 
69 
Observe that 
n 
qn+l 
+ c 'k = 1, as expected. 
k 1 = 
In estimating the above distribution the number of repetitions will be 
limited to eight. Then, the study will estimate the quantities 
for S-a-l, S-a-O and combined faults. 
It is easy to shown that the Urn Model can be represented as a Markov 
Model, as shown in Figure 9. 
70 
FAULT 
OCCURS 
FIGURE 9 MARKOV MODEL REPRESENTATION OF THE URN MODEL 
9.0. ESTIMATORS 
As indicated previously, the principal objective of the study is to 
obtain estimates of fault coverage and fault latency in a typical avionics 
miniprocessor. Although the statistical experiments were carefully designed 
to yield high accuracy and confidence for the least cost the estimates should 
not be taken too literally. The reader is advised to exercise engineering 
judgement in interpreting the results especially when inferring conclusions 
that depend upon small differences in the estimates. The reason for caution 
is the uncertainty in the assumptions underlying the study - assumptions which 
may, if incorrect or inaccurate, contribute a far greater uncertainty to the 
results than the statistical analysis would imply. 
For the record, the critical assumptions of the 'study are: 
l From the standpoint of failure modes and effects every device can be 
represented by the manufacturer-supplied gate-level, equivalent 
circuit. 
l Every fault can be represented as either a S-a-O or S-a-l at a gate 
node. 
l The failure rate of each device is equally distributed over the gates 
of the gate-level equivalent circuit. 
l The failure rate of each gate is equally distributed over the nodes 
of the gate. 
l Memory failures are exclusively faults of single bits. 
9.1 Estimators for Self-Test Coverage 
The estimators for x, y and z are 
1) x*="d 
-iii 
2) y* = "d 
Yi 
3) z" = "d + "d 
m+n 
72 
where 
X¶ Y, z = probability that a S-a-O, S-a-l, combined fault is detected; 
m , n d d = number of S-a-O, S-a-l faults detected; 
m, n = number of S-a-O, S-a-l faults injected. 
I 
A more accurate estimate of z can be obtained if stratified sampling is 
employed. For example, let 
i aX = proportion of S-a-O faults in the fault set of the processor I' 
aY 
= proportion of S-a-l faults in the fault set of the processor 
where a X +a =1. Y 
If m and n are selected such that 
m=a x N, n = ay N 
where 
N = total number of faults injected, 
then 
z* = a x*+a 
X Y y* 
is more accurate than (3) if x # y. Although stratified sampling was not 
intentionally employed in the study the actual selection resulted in an 
almost equal number of S-a-O and S-a-l faults.(*) 
9.2 Estimators for Latency 
The estimators for xkY yk an Zk are 
4) yk* = "k 
Ti- 
'k* = mk + "k , k + 1, 2, 3, . . . . . 8, 
m+n 
* In the selection process a, = a Y 
= 0.5, i.e., S-a-O and S-a-l faults were 
equally likely. 
73 
where 
Xk' Yk' Zk = probability that a S-a-O, S-a-l, combined fault is 
detected in the k-th repetition; 
m k k , n = number of S-a-O, S-a-l faults detected in the k-th repetition. 
With some abuse of terminology we define 
X9’ Yg’ zg = probability that a S-a-O, S-a-l, combined fault is not 
detected in the previous 8 repetitions. ' 
We note that x9 corresponds to og of Section 8. The estimators for x9, yg 
and zg are 
x9* = m - ml - m2 - . . . - m8 = 1 - x 
m 
1* - x2* - . . . - x8* 
5) Yg* = n - “1 - “2 - ‘*’ - “8 = 1 - yl* _ y2* _ . . . _ y8+ 
n 
z9* = 
lllx*+ny* 
grn + n 9 = 1 - zl* - z2* - . . . - z8*’ 
9.3 Estimators for Urn Model Parameters 
The method of estimation will be described for S-a-O latency di;;tribu- 
tions. With an obvious change in parameters, e.g., mk, the estimates can be 
applied to S-a-l and combined latency distributions, as well. 
The method is based on the principal of maximum likelihood. We note 
that mk S-a-O faults are detected in the k-th repetition. Accordingly, we 
seek Urn Model parameters a, P and PO that maximize the likelihood function 
L 
ml “2 m8 m9 = Pl P2 '-- f+3 qg 
74 
where 
Pl = PO P 
p2 = (1 - P) a PO 
p3 = (1 - P) a PO (1 - a) 
6) : . 
p8 = (1 - P) a PO (1 - a)6 
9 = Q,+ (1 - P) PO (1 - a)7 
and mg = m - ml - m2 - . . . - m8 
(See Section 8.1 for a definition of the Urn Model). 
The maximum likelihood estimators for a, P and PO are obtained.as the 
solution of 
e = 0, ;+ = 0, j+ = 0. 
0 
Instead of solving these equations for the maximum likelihood estimators, 
we will employ an approximation that was suggested in (ref. 2). There, 
it was assumed that 
qg = 1 - PO = Q,. 
In other words, detectable faults are always detected in the first 8 
repetitions. From (6) this is equivalent to the approximation 
7) (1 - P) PO (1 - a)7 = 0. 
If this substitution is made in the likelihood function, L, then the resultant 
estimates are, for S-a-O faults, 
75 
8 
p,” = $ 
c 
m. 1 
i=l 
p* = ml 
8 
c mi 
8) i=l 
8 
c 
m. - m I 1 . 
a* = ’ = 1 
8 8 
c ill+ - c mi 
i=l i=l 
The results of (ref. 1) confirmed the accuracy of these approximations. 
9.4 Accuracy and Confidence of Coverage Estimates 
9.4.1 Self-Test Coverage 
It can be shown (ref. 3) that 
9) E (x*> = x, E (y*) = y, E (z*) = z 
and 
E ( (x _ x*)* ) = dfd 
10) E ( (y - y*)* > = + 
E ( (z _ z*)* ) = w 
where 
E (0) = expected value of (.) . 
76 
For m, n and N sufficiently large the estimators x*, y* and z* are approxi- 
mately Gaussian with means and variances given by (9) and (lo), respectively. 
The following derivation of accuracy and confidence is general and 
applies to any quantity, x, estimated by the method of Section 9.1. As before, 
x* = estimate of x 
b = sample size. 
It is well-known (See (ref. 4), for example) that the probatiility that 
x lies between the limits 
* ( 
x* + i&i x*+x J”7”’ 
or, equivalently, that x * lies between the limits 
11) x+x 
J 
w 
is equal to y, where y is the area of the standard Gaussian distribution 
between -X and X. From (10) we may say that the error in the estimate, x*, 
is 
12) E=X 
J 
x (1 - x) 
m 
with a confidence level of y. 
Equation (12) is an ellipse in x and E. Table 18 gives a tabulation of 
EK versus x for a confidence level of y = .95. 
It is often convenient to obtain error estimates that are independent of 
From (12) it can be seen that the maximum error occurs when x = %. 
;able 19 gives a tabulation of this maximum error versus sample size and 
confidence level. It is noted that the maximum error can be extremely 
conservative. 
77 
9.4.2 Latency Estimate 
For the latency distributions the estimate of most interest is the cover- 
age after 8 repetitions. The accuracy and confidence of these estimates are 
obtained exactly as for self-test coverage estimates. Thus, if 
z* = estimated coverage of combined faults after 8 repetitions, then 
E ( (z - z*)* ) = z m ' 
9.4.3 Urn Model Parameter Estimates 
It was shown in (ref. 1) that, using the estimators of (8), we obtain 
E ( (P - P*)* ) = w 
0 
E ( (PO - PO*)* ) = Po (1 - Po) 
m 
E ( (a - a*)* ) = 
a* (1 - a) 
m PO (1 - P) 
and the cross-covariances vanish. Thus the estimates are independent and, at 
a confidence level of y, the errors are, for P, PO, a, respectively, 
EPo = x JPO (1-PO) m 
where X is as defined in Section 9.4.1. 
78 
TABLE 18 
ERROR FOR A CONFIDENCE LEVEL OF 7 = .95 
X 
0.0 0 
.427 .os 
sa8 .l 
.70 .15 
.784 .2 
,849 .2s 
.a98 .3 
.935 .35 
.960 .4 
.975 .45 
.9a .5 
.975 .55 
.96 .6 
.935 .65 
.a98 .7 
.a49 .75 
,784 .a 
.7 .a5 
,588 .9 
.427 .95 
0.0 1.0 
79 
TABLE 19 
WORST CASE GAUSSIAN 
ERROR VERSUS 
SAMPLE SIZE AND CONFIDENCE LEVEL 
.6 .03 .025 .021 .017 .013 
.7 .037 .03 .026 .021 .q17 
.a .046 .038 .033 .027 .021 
.9 .058 .048 .041 .034 .026 
.95 .069 
80 
10.0 EMULATION CHARACTERISTICS 
10.1 BDX-930 Architecture 
The BDX-930 Digital Processor is a microprogrammed, pipelined machine 
designed around the AMD2901A four bit microprocessor slice. The machine 
contains sixteen general purpose registers of which four registers may be 
loaded directly from memory and two registers may be used as base registers. 
One register is used as a stack pointer. 
The program counter and memory address register are contained in the 
9407, a chip designed to perform memory address arithmetic. Along with a 
temporary register contained on the same chip, the BDX-930 is able to perform 
four basic addressing modes involving three registers and various instruction 
fields. 
The machine contains three memory interface data registers which are 
used to input and output memory data. There are also a number of one bit 
status flag registers that can be manipulated under program control. This 
includes the Fl and F2 registers, which are hardware flags, and the interrupt 
enable, overflow status registers. There also exist the indirect and link 
registers used by the microcode for branching. 
The microcode is contained in seven proms and a pipeline register is 
included for simultaneous microcode fetch and decoding. Various internal 
and external conditions can affect microcode branching as selected by the 
microcode itself and a microcode control prom. In addition to a rich 
instruction set which includes 16 and 32 bit fixed point operations, there 
is a test set interface in the microcode. A selectable saturate mode is 
available which limits the results of arithmetic operations when overflow or 
underflow occur. 
For simulation purposes, the computer has been divided into six parti- 
tions, consisting of the following principal devices: 
Partition 1 - Address Processor 
l 4 - 9407 Memory Address Processor Equivalent Circuit 
l Selector Chips to Multiplex Memory Address Source 
. . 4- 54LS352 4:l 
. . 2- 54LS158 2:l 
Partition 2 - Data and Status Registers 
0 2 - 54LS374 Memory Input Buffer Register 
0 2 - 54LS374 Memory Output Buffer Register 
81 
0 2 - 54LS374 Next Instruction Register 
0 3 - 54LS113 Single Bit Registers for 
. . overflow 
. . indirect addressing 
. . link (bit carry for divide) 
interrupt mode 
1: Fl and F2 
0 2 - 54LS153 Select Overflow, Link, and Indirect Bit Sources 
l 2 - 54LS245 Octal Bus Transceivers 
Partition 3 - Microcontroller 
l Pipeline Register 
. . 4 - 54LS273 Octal l.!atch 
. . 4 - 54LS175 Quad Latch 
. . 1 - 54LS374 Octal Latch With Tri-State 
0 1 - 54LS273 External Signal Synchronizer 
0 3 - 54LS151 Selectors 8:l for Branch Conditions 
0 1 - 54LS169 Counter for Shift and Multiply Instructions 
0 1 - 54LS169 Counter for Multiple Register Load-Store Instructions 
0 1 - 54LS377 Instruction Register 
0 l- 54LS253 Microcode Branch Selector 
Partition 4 - Execute 
0 4- AMD2901A 4 Bit Slice ALU 
0 1 - AMD2902 Lookahead Carry 
0 2 - 54LS153 Selector 4:l Register Selectors 
0 l- 54LS253 Selector 4:l Shift Bit Selector 
a2 
Partition 5 - Microcode 
0 7 - 54S472 Proms with 56 Bit Wide Microcode 
Partition 6 - Control Proms 
l 1 - 54S472 Prom Microcode Start Address for Macroinstructions 
l 1 - 54S288 Prom Control for Microcode Branch 
Instruction execution is accomplished by a pipelined architecture; various 
stages of execution occur simultaneously for a sequence of instructions. Con- 
sider; for instance, four instructions, A,B,C,D, to be executed in sequence. 
During the same clock cycle it is possible for the program counter to be 
incremented to point to instruction D, while instruction C is being fetched, 
instruction B is being decoded and instruction A is being executed. 
With this level of parallelism, it will be noted that when the execution 
phase of an instruction is one clock cycle, the average time to perform the 
entire instruction will be one clock cycle. 
It should also be noted that the partitioning of the BDX-930 is roughly 
broken up into the stages of the pipe: - address, fetch, decode, and execute. 
These stages of the pipe are joined by various buses throughout the CPU. 
These buses are formed from tri-state logic and some are bidirectional. An 
enumeration of the major buses includes 
0 Y - Connects the output of the ALU (AMD2901A) to the address processor 
and the output register. In addition, it connects the output of the 
next instruction buffer to the start address register and instruction 
register. 
l D - Connects the memory data register and the program counter to the 
input of the ALU. 
l DAT - Bidirectional bus connecting memory and I/O to the memory data 
register and output register. 
l M - Bidirectional memory data bus 
l MAR - Memory Address Bus 
0 u- Microcode Bus 
l IR- Instruction Register 
A list of the devices used in the BDX-930 and their failure rates is 
given in Table 20. The data was obtained from MIL-HDBK217B Notice 2. 
a3 
10.2 Description of the Emulator 
The emulation includes the components of the CPU (Central Processor 
Unit), scratchpad memory and those portions of the program memory containing 
the target programs and the target self-test program. The emulation is 
derived from the circuit schematics. Each device is represented by a gate- 
level equivalent circuit supplied by the chip manufacturer. It was found 
that six types of gates were sufficient to represent any device, e.g., NAND, 
AND, OR, NOT, NOR, EXCLUSIVE OR. Table 21 gives the number of equivalent 
gates in each device of the CPU. In all, 5,100 gates were required. In the 
interests of reducing execution time , it was not expedient to emulate all 
components at the gate-level. The following elements are represented at the 
functional-level: 
program memory 
scratchpad memory 
microprogram and control memories 
16 general purpose arithmetic registers. 
The emulation did not include the direct memory access unit (DMA) or any 
of the devices of the I/O. The emulated devices of the CPU are shown in 
Figure 10. 
Faults were injected into all devices except the program and scratchpad 
memories. Because the program memory is "read-only", no processor, faulted 
or not, is permitted to write into this memory. However, even though the 
scratchpad memory is never faulted, a faulty processor can write into it. As 
a consequence, in the parallel mode of operation where 32 processors are 
simultaneously emulated, the corresponding 32 scratchpad memories are also 
emulated. 
No delay has been simulated between logic gates. It is assumed that all 
combinatorial logic is stable at the output the instant an input pattern is 
applied to it. This means that each time the input is changed, the network 
need only be evaluated once to supply the correct output pattern. Operating 
in this manner is very time efficient, but puts stringent requirements on the 
order of evaluation of the gates. To be able to meet these requirements, the 
logic is levelized, i.e., placed in groups or levels that represent the proper 
order of evaluation. 
The emulator utilizes the parallel method of logic simulation and was 
hosted on a VAX-11/780. The data word of a VAX-11 contains 32 bits; each bit 
position is used to represent a different machine. The simplest gate opera- 
tions are represented by a single Boolean instruction; when the two inputs 
occupy the same bit positions in their respective words, the output also 
occupies this bit position. The advantage of this technique is execution time 
savings. Typically, the amount of code necessary to simulate 32 machines is 
of the same order as the amount of code necessary to simulate only one machine. 
The BDX-930 description is contained in compiled code, rather than in tables, 
which was also done for speed. 
a4 
I 
Certain portions of the machine , notably the memory elements, were repre- 
sented at a functional level rather than a gate level. For microprogram 
memory, two words of VAX-11 storage contain 56 bits of microstore; at micro 
memory fetch time, these bits are retrieved from the proper address for each 
of the simulated machines and combined to form suitable words to interface 
the gate portion of the emulation. The ROM portion of main memory is handled 
in the same manner. Writable store contains a routine to translate the gate 
inputs into consecutive VAX-11 storage words so that there is one copy of 
writable storage for each machine being emulated. On reading this storage, 
the process is reversed. 
In a typical run of the emulator, 32 different machines are exercised; 
31 faulted machines and one good machine. Each faulted machine is assumed to 
have a single hard fault at one node, either stuck-at-one (S-a-l) or stuck- 
at-zero (S-a-O). The faults are injected by defining extra gates at each 
node, an AND gate for stuck at zero and an OR gate for stuck-at-one.. A typi- 
cal AND gate using this technique is shown in Figure 11. 
When the entire emulation is executed for true values, the ratio of VAX- 
11 time to BDX-930 time is 5OOO:l; with faults injected in one partition, 
the number is 7OOO:l. 
a5 
TABLE 20 
COMPONENTS 06 THE BDX-930 CPU 
DEVI CE 
9407 
2901A 
2902 
5440 
54125 
FAILURE RATE/PER 
UNIT 
(PPMH) 
1.3931 
2.1656 
0.3898 
0.0653 
0.0855 
54500 
54504 
0.0855 
0.1003 
54510 0.0764 
54520 0.0654 
54532 0.2138 
545288 (32x8 prom) 
545472 (512x8 proms) 
54LSOO 
54LSO2 
-54LS04 
54Lsoa 
54LSll 
54LS86 
54LS113 
54LS151 
54LS153 
54~~158 
54LS169 
54LS175 
54LS245 
54LS253 
0.1787 
1.008 
0.084 
0.084 
0.0983 
0.0752 
0.084 
0.084 
0.1447 
0.1483 
0.1447 
0.1410 
0.6603 
0.1703 
0.3792 
0.1447 
54LS273 0.6882 
54LS352 0.3117 
54LS367 0.1100 
54LS374 0.7234 
54LS377 0.7148 
86 
TABLE 21 
MICROCIRCUITS AND EQUIVALENT GATE COUNT 
DEVfCE EQUIVALENT GATES 
2901A 798 
2902 19 
54113 a 
54151 17 
54153 16 
54158 15 
54169 58 
54175 22 
54245 18 
54253 16 
54273 34 
54352 16 
54374 26 
54377 35 
9407 143 
87 
: 
DID -- 
* * 
: 
UD YL 
. 
w L 
I = PARTITION NUMBER 
FIGURE 10 BDX-930 PROCESSOR 
FIGURE 11A 
NON-FAULTED “AND” GATE 
ORIGINAL GATE 
r --B 1 
I I 
I I 
I 
I I 
I 
I 
I 
- 
I 
I- d- -I 
8:ea S-r-l 
INPUT OR OUTPUT OUTPUT 
8-d FAULT S-a-1 
FAULT 
FIGURE 11B 
FAULT MODEL OF “AND” GATE 
a9 
11.0 CONCLUSIONS 
On the basis of the study we conclude: 
o The present study substantiates the results of the previous study. 
The only difference was in the conjecture that detection is a linear 
function of the number of instructions. The present study demon- 
strates that coverage is independent of the length of the program. 
l Emulation is a practicable approach to failure modes and effects 
analysis of a digital processor. 
o The run time of the emulated processor on a VAX-11/780 host computer 
is only 5000 to 7000 times slower than the actual processor. As a 
consequence, large numbers of faults can be studied at relatively 
little cost and in a timely manner. 
l The fault model, although somewhat arbitrary, can be updated as more 
data becomes available. 
l Gate-level faults are more difficult to detect than component-level 
faults. As a consequence , coverage requirements should be explicit 
as to the types of faults to be covered. 
l In a comparison-monitdred system the accumulation of latent faults 
can be significant. For example, in a flight control program of 
2200 instruction, 21% of all distinguishable faults remained unde- 
tected after 8 repetitions. The impact of this accumulation on 
aircraft survivability has yet to be determined. 
l Self-test should be designed to capitalize on the hardware mechan- 
- ization of the CPU. 
l It is relatively easy to generate a self-test with a gate-level 
coverage between 85% and 90%, To obtain a coverage of 95% is d 
l It is relatively easy to obtain component-level coverage in exe 
95%. 
ifficult. 
ess of 
l Faults in the micromemory are difficult to detect. This situation 
could be improved if future processors incorporated a direct means of 
testing, either by a parity check or, more preferably, by making the 
contents of the micromemory accessible to the programmer. 
l A large proportion of faults, i.e., 16.5% were indistinguishable 
(' 
fZ. 
"don't care"). It was extremely difficult to identify these 
90 
I 
l The Urn Model can characterize the shape of the latency distribution. 
This can be attributed to: 
1) The monotonic, decreasing property of the empirical distribution 
2) The 3 degrees-of-freedom which the model provides for a best fit. 
l It is .doubtful that the Urn Model parameters can be predicted for a 
program on the basis of length or instruction mix. 
91 
12.0 REFERENCES 
1. McGough, J., Swern, F., "Measurement of Fault Latency 
in a Digital Avionic Mini Processor", NASA CR-3462, 
NASA Langley Research Center, Hampton, VA, October, 1981. 
2. Nagel, P., "Modelinq of a Latent Fault Detector in a 
Digital System", NASA CR-145371, 1978. 
3. McFarlane Mood, A., 'Introduction to the Theory of 
Statistics", McGraw-Hill; New York,, 1950. 
4. Cramer, H., "Mathematical Methods of Statistics", Princeton 
University Press; Princeton, 1958. 
92 
. 
NASA CR-3651 - 
4. Titlr and Subtlclr 5. Aclmrt Dia 
MEASUREMENT OF FAULT LATENCY IN A DIGITAL AVIONIC January 1983 
MINI PROCESSOR - PART II 6. Performing Organization Coda 
6. Performing Organuation Report No. 
John G. McGough and Fred Swern 
6. Performing Organizrfion Namr and Address 
Flight Systems Division 
Bendix Corporation 
Teterboro, N.J. 07608 
2. Sponsoring Agency Name and Address 
National Aeronautics and Space Administration 
Washington, D.C. 20546 
10. Work Unit No. 
11. Contract or Grant No. 
NASl-15946 
13. Type of Repot and Period Coverad 
Contractor Report 
14. Spanswing Agency Gxk 
I 
5. Supplemen~rry Notes 
Langley NASA Project Engineer: Salvatore J. Bavuso 
Prog.ress Report 
6. Abstract 
This report describes the results of fault injection experiments utilizing 
a gate-level emulationof the central proces.sor unit of the Bendix BDX-930 
digital computer. The study is an extension of a previous study: 
Measurement of Fault Latency in a Digital Avionic Mini Processor 
NASA CR-3462, October 1981. 
The poor coverage of comparison-monitoring, which the earlier study demon- 
strated, could have been due to the limited repertoire of the instruction 
set used. As a consequence , it was decided to reprogram several earlier 
programs but this time expanding the instruction set to capitalize on the 
full power of the BDX-930 computer. As a final demonstration of fault 
cov.erage an 
was added. 
extensive, 3-axis', high performance flight control computation 
A secondary objective of the study was to demonstrate the stages in the 
development of a CPU self-test program emphasizing the relationship 
between fau 1 t coverage, speed and quantity of instructions. 
r. Key Words Ckqgested by Authw(sl) 
Emulation Self-Test 
Gate-Level 
Fault Detection 
Fault Latency 
18. Distribution Statement 
Unclassified - Unlimited 
Subject Category 59 
anson-Monitori nq 
1. 5ecwitv Classif. (of this report1 20. -%curitv Classif. lof this pago) 21. No. of Pages 22. kc@ 
Unclassified Unclassified 94 A0.5 
FOr w by the Nwional Technical Information Swvice, Springfield. Virginia 22161 
RASA-Lang1 ey, 1983 
