Impact of device level faults in a digital avionic processor by Suk, Ho Kim
I 
I 
1 
C 
I 
I 
1 
January 1989 \ UILU-ENG-89-2210 
CSG-99 
A,'..! 1- 6 ~ 2  
COORDINATED SCIENCE LABORATORY 1 
dd College of Engineering 
IMPACT OF 
DEVICE LEVEL 
FAULTS IN 
A DIGITAL 
AVIONIC 
PROCESSOR 
Suk Ho Kim 
(bASA-CR-184783) ILIPAC'I: CF C E V I C E  LEVEL N89- 1 E C 46 
EAIUL'IIS Ib A GJGII111 A V I C Y I C  EECCESSCR 
( I l l i a c i s  Unir.) 55 & CSCL 09B 
Unclaz  
G3/60 0190107 
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
Approved for Public Release. Distribution Unlimited. 
https://ntrs.nasa.gov/search.jsp?R=19890008675 2020-03-20T03:26:05+00:00Z
UNCLASSIFIED 
SECuRirY CLASSIFICATION OF T HIS PAGC 
REPORT DOCUMENTAIlOW PAGE 
6a. NAME OF PERFORMING ORGANIZATION 
Coordinated Science Lab 
University of Illinois 
la. REPORT SECURITY CLASSIFICATION 
6b. OFFICE SYMBOL 
(If applicable) 
N/A 
Unclassified 
2a. SECURITY CLASSIFICATION AUTHORITY 
8.. NAME OF F UNDlNG I SPONSORING 
ORGANIZATION 
NASA 
8b. OFFICE SYMBOL 
(If applicrbh) 
BC ADDRESS (city, statu, a d  ZIPCOCW 
(see 7b.) 'ROCRAM ELEMENTNO. 
1 b. RESTRICTIVE MARKINGS 
3 . DISTRIBUTION / AVAILABILITY OF REPORT 
None 
Approved for public release; 
distribution unlimited 
5. MONITORING ORGANIZATION REPORT NUMBER(S) 
PROJECT TASK WORK UNIT 
No. No. ACCESSION NO. 
7r. NAME OF MONITORING ORGANIZATION 
NASA 
7b. ADDRESS (City, Statw, and ZIP Cok) 
NASA Ames Research Cen'ter 
lloffett Field, CA 94035 
#. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER 
NASA: NAG 1-602 
17 COSATI CODES 
FIELD GROUP SUB-GROUP 
I 
18. SUBJECT TERMS (Continu. rCvem if m a ~ y  a d  am*@ by block numkf) 
fault injection, fault propagation, device level fault, 
Between Errors, near-coincident errors. 
mixed mode simulation, Mean Error Durations, Mean Time 
Impact of Device Level Faults in a Digital Avionic Processor 
12. PERSONAL AUTHOR(S) 
13a. TYPE OF REPORT 
16. SUPPLEMENTARY NOTATION 
Suk Ho Kim 
13b. TIME COVERED 14. DATE OF REPORT (Year, Month. Oay) 
Technical FROM TO December 1988 
20. DISTRIBUTION /AVAILABILIM OF ABSTRACT 
!2a. NAME OF RESPONSIBLE INDIVIDUAL 
~ UNCLASSlFlEDNNLlMlTED c] SAME AS RPT. 0 OTIC USERS i 
f 
21. ABSTRACT SECURITY CLASSIFICATION 
22b. TELEPHONE (/W& Area Cod.) 22C. OFFICE SYMBOL 
Unclassified 
This study describes an experimental analysis of the impact of gate and device-level faults in the 
1 processor of a Bendix BDX-930, flight control system. Via mixed mode simulation. faults were 
injected both at the gate (stuck-at) and at the transistor levels and, their Propagation through the chip to 
the output pins was measured. The results show that thece is little comspondence between a stuck-at 
and a device-level fault model, as far as e m r  activity or detection within a functional unit is concerned. 
I 
I 
(continued on reverse) I 
UNCLAS S IF I ED 
SICURIW CLAUICICATIOY OC THIS C A O I  
c 
In so far as emr activity outside the injected unit and at the output pins are concerned, the stuck-at and 
device models track each other. The stuck-at model, however, overestimates, by over one hundred 
percent, the probability of fault propagation to the output pins. An evaluation of the Mean Error 
Durations and the Mean Time Between Errors at the output pins shows that the stuck-at model 
significantly underestimates (by 62%) the impact of an internal chip fault on the output pins. Finally, 
the study also quantifies the impact of device fault by location, both internally and at the output pins. 
I UNCLASSIFIED SECURITY C L A % l f l C A T l Q N  O f  THIS P A G E  
IMPACT OF DEVICE LEVEL FAULTS 
IN A DIGITAL AVIONIC PROCESSOR 
BY 
SUK HO KIM 
B.S., University of Illinois, 1986 
THESIS 
Submitted in partial fulfillment of the requirements 
for the &p of Master of Science in Electrical Engineering 
in the Graduate College of the 
University of Illinois at Urbana-champaign, 1988 
Urbana, Illinois 
Ackrumtsdgmcnt: This research wa8 suppotted by the National Aeronautics and Space 
Administration (NASA) under Contract NASA NAG 1-602. 
I 
I 
I 
iii 
ABSTRACT 
I 
I 
lr 
I 
I 
I 
This study describes an experimental analysis of the impact of gate and device-level faults in the 
processor of a Bendix BDX-930, flight control system. Via mixed mode simulation, faults were 
injected both at the gate (stuck-at) and at the transistor levels and, their propagation through the chip to 
the output pins was measured. The results show that there is little correspondence between a stuck-at 
and a device-level fault model, as far as e m r  activity or detection within a functional unit is concerned. 
In so far as error activity outside the injected unit and at the output pins are concerned. the stuck-at and 
device models track each other. The stuck-at model, however, overestimates, by over one hundred 
percent, the probability of fault propagation to the output pins. An evaluation of the Mean E m r  
Durations and the Mean Time Between Emrs at the output pins shows that the stuck-at model 
significantly underestimates (by.6296) the impact of an internal chip fault on the output pins. Finally, 
the study also quantifies the impact of device fault by location, both internally and at the output pins. 
1 
I 
I 
I 
I 
iv 
ACKNOWLEDGMENTS 
I wish to thank my advisor, Professor Ravi Iyer, for his guidance and encouragement. His support 
will always be appreciated. I also thank the researchers at the NASA Langley Research Center 
(AIRLAB), for many useful discussions. In particular I thank Bemice Becker for providing insight into 
the BDX-930 simulator. Thanks are also due to G. Choi, J. Shgh, R. Llames, Luke Young and Jenny 
Marcinkiewic for their careful reading of an early draft of this thesis. 
V 
TABLE OF CONTENTS 
CHAP-rER . PAGE 
1 . INTRODUCTION ......................................................................................................................... 
1.1. Related Research .......................................................................................................... 
2 . THE EXPERIMENT ..................................................................................................................... 
2.1. Mixed-Mode Simulation .............................................................................................. 
2.2. Fault Injection .............................................................................................................. 
3 . MEASUREMENTS ....................................................................................................................... 
3.1. Data Collection ............................................................................................................. 
4 . COMPARISON OF PHYSICAL AND STUCK-AT FAULT INJ'ECTIONS ............................. 
5 . EFFECT OF FAULT PLACEMENT ........................................................................................... 
5.1. Comparison of Fault Dishbutions of Dif€erent Material Failures ............................. 
6 . CHARACTEREATION OF ERRORS ON OUTPUT PINS ...................................................... 
6.1. probability of pin Errors .............................................................................................. 
6.2. Mean Time Between Errors (Ivfl'BE) and Mean Error Durations (MED) ................. 
6.3. Near-Coincident Errors ................................................................................................ 
7 . INSTRUCTION/MICROINSTRUCTION ANALYSIS ............................................................... 
8 . CONCLUSIONS ........................................................................................................................... 
APPENDIX A . COMPARISONS BETWEEN STUCK-AT AND DEVICE .............................. 
A.l. Cornparison of Percentage of Fault Detected ............................................................. 
A.2. Comparison of Percentage of Faults Detected at Output Pins ................................... 
A.3. Comparison of Propagation Factor at Output Pins ..................................................... 
APPENDIX B . INSTRUCTION/MICROINSTRU~ON COMPARISONS ............................ 
1 
2 
4 
4 
5 
7 
1 
11 
18 
21 
24 
25 
26 
29 
33 
37 
39 
39 
40 
41 
42 
vi 
. B.l .  Comparison of Gate Activity ....................................................................................... 42 
43 
REFERENCES .............................................................................................................................. 44 
B.2. Comparison of E m  Probability based on Instruction/microinstruction .......... 
Vii 
LIST OF TABLES 
I 
I 
1 
I 
I 
I 
TABLE 1: Number of Gates and Transistors in AMD 2901 ............................................................ 
TABLE 2: Sample of Emr File ......................................................................................................... 
TABLE 3: The Propagation Factors ................................................................................................. 
TABLE 4 MTBE and MED for Stuck-at and Device Faults .......................................................... 
TABLE 5: Probability of Units Affected ........................................................................................... 
TABLE 6 The Effect of Fault in Oxide and Metal ........................................................................... 
TABLE 7: Pins of A M D  2901 ........................................................................................................... 
TABLE 8: Probability of Pin Errors .................................................................................................. 
TABLE 9 Mean Time Between Errors ............................................................................................. 
TABLE 10 Mean Error Durations ..................................................................................................... 
TABLE 11: probability of Fault Detection for Instruction Executed ............................................... 
TABLE 1 2  probability of Fault Detection for Microinstruction ...................................................... 
TABLE A.l: The Propagation Factors to Output Pins ..................................................................... 
TABLE B . 1: Comparison of E m  Probability for Instructions ....................................................... 
TABLE B.2  Comparison of Error Probability for Microinstructions .............................................. 
\ 
6 
9 
15 
16 
20 
23 
24 
26 
27 
28 
35 
36 
41 
43 
43 
viii 
LIST OF FIGURES 
Figure 1: Percentages of Faults Detected That Remained in the Unit .............................................. 
Figure 2: Comparison of Fault Propagation Going Outside of the Unit ......... : ................................. 
Figure 3: Activity Comparisons Between Materials .......................................................................... 
Figure 4: Mean Number of Coincident Enors ................................................................................... 
Figure 5: Probability of Near-Coincident Errors ............................................................................... 
Figure 6 Gate Activity for Device Level Fault ................................................................................. 
Figure A.l: Comparison of Percentages of Faults Detected .............................................................. 
Figure A.2 Comparison of Percentages of Faults Detected at Output Pins ..................................... 
Figure B.I:,Comparison between Device and Gate Level Faults ...................................................... 
12 
14 
22 
30 
32 
34 
39 
40 
42 
I 
I 
1 
CHAPTER 1 
INTRODUCTION 
A study of fault propagation and its impact is important for effective design of reliable and fault 
tolerant systems. Such a study. however, is difficult because the mechanisms involved are complex and 
hence not easily amenable to analytical modeling. In these circumstances an experimental study can not 
only provide valuable insight into the issues of fault Occurrence and propagation, but also help develop a 
structured basis for future analytical analysis. 
This thesis describes an experimental analysis of fault propagation and fault sensitivity in the pro- 
cessor of a Bendix BPX-930, flight conaol system. The processor was simulated using an event-driven, 
gate-level logic simulator, developed at NASA Langley Research Center, interfaced with a device-level 
circuit simulator (SPICE) [ 11. Via mixed-mode simulation faults were injected both at the gate (stuck-at) 
and at the transistor levels, and their propagation through the chip to the output pins was measured. The 
nature and extent of the dependency of fault propagation on the type of instructiodrnicroinsmtion exe- 
cuted were also measured. 
The results showed that, for device-level faults. in 5.1% of the cases, errors were detected within 
the injected unit, and in 20.9% of the cases errors were detected outside the unit (including 12.7% at the 
output pins); 74% remained undetected. For the stuck-at model, in 1.5% of the cases, errors were 
detected wilhin the injected unit, in 41.8% of the cases errors were detected outside the unit (26.9% at the 
output pins); 56.7% remained undetected. The results also showed that there was little correspondence 
between a stuck-at and a device-level fault model in so far as error activity or detection within a func- 
tional unit is concerned. As far as error activity outside the injected unit and at the output pins are con- 
cerned, the stuck-at and device models tracked each other, although the stuck-at model overestimated. by 
over one hundred percenc the probability of fault propagation to the output pins. An evaluation of the 
2 
Mean Error Durations and the Mean T h e  Between Error at the output pins showed that the stuck-at 
model will signifkmnely underestimate (by 62%) the impact of an internal. chip fade on h e  output pins 
Measurement of error activity at the output pins showed that faults in different functional units 
af€ect the output pins to varying degrees, and that each unit had a distinct probability of affecting the out- 
put pins. This result suggests that by injecting pin errors with the measured probabilities we can easily 
emulate with-in chip faults for integrated system testing. 
Chapters 2 and 3 contain a detailed description of the experimental procedure and measurements. 
Chapters 4, 5, 6, and 4 show h e  experimental r s d t s  and their andysis. Chapter 4 compares thhe results 
from the gate-level and the device-level simuhtions. In chapter 5, ehe effect of fault placement in the 
AMD 2901 chip is defined and quantified. Chapter 6 shows the error characteristics at output pins for 
device-level faults. Chapter 7 describes the analysis of fault propagation according to instructions and 
miminstructions executed at the gate-level fault and the device-level fault. The final chapter highlights 
the important results and makes suggestions for future research. 
1.1, Related Research 
In recent years, there has been considerable rsearch in thhe area of e m %  and failure analysis of 
computer systems. In [2,3,4]. automatically collected error data from several general-purpose computers 
are analyzed. By analyzing jointly, the performance and emr  data on several machines, valuable insight 
into error manifestation and discovery in large systems is provided. A series of experiments focusing on 
error analysis through fault insertion was conducted by several investigators at the NASA AIRLAB test- 
bed facility. A summary of these experiments is given in [53. In [6,7,8], the evaluation and modeling of 
fault htency in digital avionic systems is investigated by determining the de- of fault latency in a 
redundant flight conuol system. In [9.101. finrther experiments to study fault and e m r  latency distribu- 
tions under varying woakIoad conditions are discussed. 
A derailed simulation experiment to study error propagation within a chip is discussed in [l I]. The 
study develops a systematic experimental methodology to quantify error propagation via gate level 
I 
I 
I 
1 
1 
1 
CHAPTER 1 
INTRODUCTION 
I 
1 
II 
A study of fault propagation and its impact is important for effective desigra of reliable and fault 
tolerant systems. Such a study, however, is dfficult because the mechanisms involved are complex and 
hence not easily amenable to analytical modeling. In these circumstances an experimental study can not 
only provide valuable insight into the issues of fault wcurrence and propagation, but a.Iso help develop a 
structured basis for future analytical analysis. 
This thesis describes an experimental analysis of fault propagation and fault sensitivity in the pro- 
cessor of a Bendix BDX-930, flight control system. The processor was simulated using an event-driven, 
gate-level logic simulator, developed at NASA Langley Research Center, interfaced with a device-level 
circuit simulator (SPICE) [ll. Via mixed-mode simulation faults were injected both at the gate (stuck-at) 
and at the transistor levels, and their propagation through the chip to the output pins was measured. The 
nature and extent of the dependency of fault propagation on the type of iarsrnction/microinsmction exe- 
cuted were also measured. 
The results showed that, for device-level faults, in 5.1% of the cases, emrs were detected within 
the injected unit, and in 20.9% of the cases errors were detected outside the unit (including 12.7% at the 
output pins); 74% remained undetected. For the stuck-at model, in 1.5% of the cases, errors were 
detected within the injected unit, in 41.8% of the cases errors were detected outside the unit (26.9% a% the 
output pins); 56.7% remained undetected. The results also showed that there was little correspondence 
between a stuck-at and a device-level fault model in so far as error activity or detection within a func- 
tional unit is concerned. As far as error activity outside the injected unit and as the output pins are con- 
cerned, the stuck-at and device models tracked each other, although the stuck-at model overestimated, by 
over one hundred percent, the probability of fault propagation to the output pins. An evaluation of the 
3 
simulation. To characterize the error propagation within the chip, distributions of e m  activity within the 
chip and at the output pins are generated. Based on these distributions, measures of error propagation and 
severity are defined. The analysis quantifies the dependency of the measured emf propagation on the 
location of the fault. The study also shows the nature and extent of the dependency of emr  propagation 
upon the type of microinsauction and assembly level insauction executed. 
Our experience with large circuits has shown that there are only certain sections or paths that 
require simulation with the highest level of detail, while the simulation accuracy for the rest of the circuit 
is less critical. To optimize the cost-accuracy tradeoff, one should be able to specify the level of detail 
required by selecting the simulation mode for each module. The current effort based on mixed-mode 
simulation is initiated to meet this need. To date, there has been no research to investigate fault propaga- 
tion from the device to the pin level. This information is crucial for flight-critical digital systems, and 
additionally would allow the determination of the effect of placement on fault Propagation. 
4 
CHAFTER 2 
THE EXPERIMENT 
The system targeted for this study is the CPU in the Bendix BDX-930, which is a digital avionic 
miniprocessor. The BDX-930 is used in a number of flight control avionic systems, e.g., in SIFT [12], 
and in AFIl F-16 [a. Fault tolerance is achieved by replication of the processing and voting in software. 
The BDX-930 consists of 86 microcircuits printed on one circuit board [13]. The processor is designed 
around the AMD 2901 four-bit microprocessor slice [a. In our experiments. the processor was simulated 
using an event-driven, gate-level logic simulator (developed at NASA Langley) interfaced with a device- 
level circuit simulator (SPICE). Since the AMD 2901 is the most complex chip in the BDX-930, it was 
used for fault injection and e m  data collection. In the simulations, fault propagation data were collected 
at device and gate levels as well as at the output pins. From the data provided by simulations, issues 
relating to Eault propagation and fault sensitivity of the chip architecture were addressed. 
2.1. MIxed Mode Simulation 
Thc simulator [14], designed at NASA AIRLAB, is an experimental tool to simulate fault and relia- 
bility checking for Bendix BDX-930. This simulator is an event-driven, gate-level, unit delay logic simu- 
lator, and includes the CPU with its instruction set, the memory, and sections of the program memory 
containing six application programs and a self-test program. The simulation model is based on the circuit 
schematic of the AMD2901 and includes all of the devices identified in those schematics. Each device is 
represented by a gate-level equivalent circuit supplied by the chip manufacturer. Six gate types are used 
to represent devices, i.e., NAND. AND, OR, NOT, NOR, Exclusive OR. Unit delay is assumed between 
logic gates. 
I 
I 
I 
I 
I 
1 
I 
I 
1 
1 
I 
1 
I 
I 
I 
I 
1 
1 
I 
5 
Although this simulator was quite accurate for gate-level simulation, it could not simulate faults 
occurring at the transistor level. By interfacing the gate-level simulation with a circuit-level simulator, 
SPICE2 [I], a method that permitted the injection of transistor level fault and the observation of fault pro- 
pagation at the gate or module level was implemented. Thus, by using a combination of circuit and 
gate-level simulation, we could achieve the accuracy of circuit-level fault injection and the speed of gate- 
level simulations. 
An important issue in any mixed-mode simulation is accurate analog-to-digital signal conversion. 
In OUT simulator, once the gate in which the fault is going to be injected is chosen, the SPICE simulator 
runs only for the faulty gate according to the fault model inside the gate. The SPICE generates an analog 
output which ranges from zero to five volts. Because logic values (one ur zero) are required for the gate- 
level simulator. a subroutine is needed to convert the analog voltages to logic values. In order to get 
proper logic values, the analog voltages are sampled and averaged in the scanning window through the 
time axis. The averaged voltages are evaluated by assuming higher than 4.2 volts as a logic one and 
lower than 0.8 volt as a logic zero to determine the corresponding logic values. After acquiring the 
values from the SPICE run, the rest of the simulation occurs at the gate-level. 
2.2. Fault Injection 
Based on previously published results [Is], physical failures may generally be divided into two 
categories, device failures and interconnection failures. In this study we consider only device failures. In 
[la], it is reported that of the device faults the most likely are oxide level faults and metal faults. Typi- 
cally, according to [16]. 68% are oxide faults and the remaining 32% are metal faults. We used these 
percentages for determining the types of fault to inject in the simulations. 
Two-hundred forty-five gates corresponding to a Bendix circuit diagram for the AMD 2901 were 
selected for fault injection. In order to have consistent statistical results, 700 device faults and 300 gate- 
level faults were injected. The target gates were randomly selected' among the twelve functional units of 
6 
1 
2 
the AMD 2901 except for the Q Register. The Q register was excluded to avoid effects of the fault 
latency which have been studied elsewhere [6,8]. The units into which faults were injected were RAM 
Shift, Q Shift, Multiplexor, Arithmetic Logic Unit, Ram Control, Output, Output Data Select Unit, Desti- 
nation Conml. ALU Conwl, and Source Control. Table 1 shows the number of gates and transistors in 
these units. 
Q Register 0 0 
Ram 12 64 
TABLE 1: Number of Gates and Transistors in AMD 2901 
4 
5 
ISECNUM I SECNAME I " M O F G A T E  I NUMOFTRANS1 
Source Contl 4 8 
ALU Conwl 6 24 
4 
I ! I 100 3 ] QShift 16 1 
.6 
7 
output 6 38 
Output Select 9 50 
8 
9 
Ram Shift 16 100 
Ram Control 83 572 
i 
'While sequani.l injection such as ured in [11] is exhaustive. it is not pctical in modcling the behvior of physical fadu 
since the numben a n  be vey large. 
10 
11 
12 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
Dest Control 10 36 
ALU 51 326 
Mux 32 224 
7 
CHAF'TER 3 
MEASUREMENTS 
The simulator and the fault injection facility programs were written in Fortran 77 and in 
V W M S  system language (Digital Command Language) [17]. Based on the random fault injections, a 
total of 700 device-level and 300 stuck-at fault simulations were performed to obtain experimental data 
on the fault propagation characteristics. Each simulation corresponded to executing the sequences of 
microinstructions of a CPU-test program2. First, a "gold" or unfaulted simulation run was performed. 
Next, one-thousand simulation runs, each containing an injected fault, were performed. For each faulted 
simulation, a comparison was made with the gold simulation to generate the error data for subsequent 
fault propagation analysis. An error was defined as follows: 
1) A gate activated in the gold simulation but not activated in the faulted simulation. 
2) A gate activated in the faulted simulation but not activated in the gold simulation. 
3) A gate activated in both simulations for the same time slice but with different logic values. 
The results of these measurements enabled us to analyze the dependency of the error propagation 
on the location of the fault and the type of insvuction/microinsuuction cxecuted and to characterize the 
crror activity at the output pins. 
3.1. Data Collection 
After cach simulation run and prior to the next run, the output consisting of time stamps, a tracc of 
gate activity and logic values was appcnded to the existing output file. The output file crmted by cach 
run was thcn comparcd with fault-frcc dam from thc gold simulation to gcncratc an "Error Data Filc." 
Table 2 shows a sample of an Error Dam Filc. To provide dclailcd information to thc data set, mnny 
'Them arc four individual subscts within the scll-tcsi program, i.c.. the cyclic RAM icst, thc CPU test. the ALU test, and the 
memory rddrus processor 1 ~ ~ 1 .  
8 
independent variables were included in the raw data as shown in the table. The first column indicates the 
simulation number. Columns 2, 3 and 4 show the gate name, the unit name and the unit type where the 
fault injections were made respectively. Column 5 specifies the gate type into which the fault was 
injected. The type of the injected faults is shown in columns 6 and 7. For example, in simulation 1, a 
fault corresponding to an oxide breakdown was injected at device-level. Column 8 shows whether or not 
an error was detected in that simulation run. Column 9 shows the time slice during which an error result- 
ing from the injected fault was detected. The remaining columns show the number of gates affected by 
the injected fault in each unit (for brevity, not all the units are shown in the table). For example, in simu- 
lation 5 ,  at time 887, eight gates in the RAM, one gate in the Q-Shift and four gates in the MUX were 
Sected due to the fault in the gate GAMULT6CPU32. 
To identify the fault propagation through the chip as well as the output pins, information about 
time, number of faults and name of affected unit during the propagation were tnced from the fault injec- 
tion point to all other units in the chip and output pins. These measurements were used for determining 
the percentage of the faults which propagated out of the unit and the percentage of the other units affected 
given that the fault did propagate out of the unit 
Because study of fault distribution to the output pins is of great practical significance, obtaining 
precise output data from the simulation was crucial for analyzing and evaluating a systcm at the output 
pin Icvcl. Thc output pin data were collcctcd in order to obscrvc how mch pin bchaved in thc evcnt of a 
fault condition in the system. In particular, sirnuitmaus occurrences of errors and the probability of 
nw-coincidence of pin errors were invcstigatcd. Also, thcsc data wcrc uscd to characterize mch pin 
based on the mmn timc bctween crrors and m a n  error durations. 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
C 
m 9 
i= 
4 
W 
0 0 0  0 0 0 0 0 0 0 0 0 0 0 0 0  
x x x  x x x x x x X x x x x x Y  
333;3ESS393393355 
0 0 0 ~ 0 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0  
> > > > > > > > > > > > > > > > >  
0 0 0 0 0 0  0 0 0 0 0  
.z .B .8 .8 .8 .8 .B .8 .8 .3 .u .u .Y .u .u .u .u 
a a a a c a a 8 8 8 8 a a ~ a n 8 B  
- PI rn q VI VI VI VI VI VI m VI m VI VI \A VI 
9 
10 
Information concerning the instruction and microcode activity was collected concurrently with the 
gate activity data. By knowing which microaddress had been accessed, the executed microinstruction was 
uniquely identified. Finally, by examining the sequence of microinstructions, the macro (or assembly) 
level insauction that was executed was determined. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
11 
CHAPTER 4 
COMPARISON OF PHYSICAL AND STUCK-AT FAULT INJECTIONS 
With the increasing complexity of VLSI circuits, there is a growing concern that fault simulation 
based on stuck-at faults is not adequate. In this section, the similarities and differences between the 
results of gate-level fault injections and device-level fault injections are discussed. The comparisons are 
based on the measured error activity resulting from gate-level and device-level fault injections into the 
same functional unit. 
Four different types of comparisons are made. The first comparison is based on the detectabiiity of 
the injected faults inside the injected units. The second is based on the percentages of faults that were 
detected outside the injected unit (Le., the measured fault propagation). The third comparison is based on 
the extent of error activity outside the injected unit. Finally, the pin-level emlr activity resulting from 
gate and device-level fault injections are compared. 
Figure 1 shows the percentages of injected faults detected within the injected unit. Percentages far 
both stuck-at and device-level fault injections are shown. The vertical axis indicates the location of the 
fault, and the horizontal axis indicates the percentage of the injected faults detected within the unit. A 
number of observauons can be made from this figure. First, device faults have higher percentages of 
being detected within the unit as compared to gate-level faults. This is reasonable because there are 
fewer levels of signal transitions at the gate-level than at the device-level. For example, a given device- 
level fault may propagate through 25 transistors before getting outside the unit while a gate-level fault 
may only have two or three logic levels to go through. In comparing the relative behavior of stuck-at and 
device-level faults across the functional units in Figure 1, we see that they do not vack each other. Thus, 
the results show that there is little correspondence between the behaviors of device and stuck-at faults. 
I 
I 
RAM 
Q SHIFT 
s corn 
ALU CONT 
OUTPVT 
OUT SEL. 
RAM SHlR 
RAM CONT 
DES C O N  
ALU 
MUX 
1 
I 
I 
12 
I Stuck-At 
I Dev 
0 5 10  15  20 25  30 
Figure 1: Pcrcentages of Faults Dctcctcd That Rcrnained in the Unit 
I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
E 
I 
I 
I 
I 
I 
1 
0 
I 
I 
1 
I 
I 
E 
E 
f 
1 
I 
13 
Figure 2 shows a comparison of the percentages of faults detected outside of the injected unit, for 
stuck-at and device faults. Again, the vertical axis indicates the location of the fault. and the horizontal 
axis is the percentage. The figure shows that although the stuck-at and the device-level fault behaviors 
uack each other, their percentages are quite different. On the average, the stuck-at faults tend to pn>- 
pagate approximately twice as frequently outside the unit as compared u) device faults. Thus, by assum- 
ing a stuck-at fault mode€, although the relative impact of a fault on other units reasonably may reflect the 
physical failures, the results are likely to be considerably pessimistic. 
In the above-mentioned case, we wete concerned with whether or not error activity due to a fault is 
detected outside of the unit in which a fault is injected. The next comparison is based on the "extent" of 
measured error activity outside the injected unit. For example, if a fault injected in the ALU unit is 
detected in four other functional units, the impact of the fault may be quadrupled due to propagation. To 
quantify this effect, a new measure "the propagation factor" is defined. The propagation factor is defined 
as the average number of extemal functional units aRected due to the fault in a speci6ed functional unit. 
This factor can be calculated by dividing the sum of error activity outside the injected unit by the meas- 
ured error activity inside the injected unit. Table 3 shows the propagation factors for the stuck-at and the 
device-level fault injections in each unit. 
RAM 
Q SHIm 
s CONT 
ALU CONT 
8.5 
10.2 
.1 
OUTPUT 
OUT SEL 
RAM SHIFT 11.1 
RAM CONT 
DES CONT 
ALU 
MUX 
14 
STUCK-AT 
eDEVlCE 
85.6 
4 0  6 0  80  100 0 20 
Figure 2: Comparison of Fault Propagation Going Outside of the Unit 
I 
1 
I 
1 
I 
I 
I 
1 
1 
I 
I 
I 
I 
1 
3 
1 
1 
I 
I 
15 
ALU 
Mux 
TABLE 3: The Propagation Factors 
6.0 6.1 
6.4 6.8 
RAM CONT 
DES CONT 7.7 7.5 
In Table 3, the first column shows the units in which the fault injections were made. The next two 
columns are the propagation factors for stuck-at faults and device faults, respectively. Here again there is 
no clear correlation across all units between device and stuck-at faults. There is also a higher variability 
(2.92 vs. 2.01) in fault propagation with stuck-at faults. An examination of the table shows that for most 
of the units (except the microinstruction decode units which are the Source Control, the ALU Control, 
and the Destination Control) the stuck-at faults had a smaller propagation factor than device faults, Le., 
functional units m more sensitive to the device faults than to the stuck-at faults. The only exceptions are 
the faults in the microinstruction decode unit which have the opposite effect 
Finally, the impact of gate and device-level faults on the output pins was compared A comparison 
of the percentages of faults which af€ected the pins (similar to Fig. 2) and the propagation factors (similar 
to Table 3) showed that the stuck-at and the device faults did track each other as shown in Appendix A. 
Comparisons were also performed based on the Mean Time Between Errors and Mean Error Durations at 
the pins. The Mean Time Between Errors is obtained by computing the average time interval between 
two consecutive errors on the pin. The Mean Error Durations indicates the holding time of the error at the 
specified pins. The MED is calculated by averaging the time between the instance of error Occurrence 
16 
STUCK- AT 
MTBEIMED 
PIN NUM. 
and disappearance. By examining these values, the impact of an internal fault on the external environ- 
ment can be estimated. The Mean Time Between Errors and the Mean Ermr Durations on the pins for 
DEVICE 
MTBEIMED 
both simulations are shown in Table 4. 
PIN6 
PIN7 
TABLE 4: MTBE and MED for Stuck-at and Device Faults 
465.6 16.9 369.4 17.5 
184.7 67.1 147.6 67.7 
PIN8 
PIN9 
PIN10 
I PIN1 11 345.1 I 45.9 11 303.8 I 47.8 I 
166.8 62.5 141.1 66.9 
169.8 59.8 144.4 61.7 
176.6 60.2 148.1 60.6 
I PINS I 215.0 I 31.9 11 185.6 I 41.6 I 
The first column shows the pin numbers. The next two columns are the MTBE and the MED for 
the gate-level simulation and the device-level simulation. The MTBEs for the stuck-at faults are longer. 
and the MEDs for the stuck-at faults are shorter than those for the device faults. Typically, the MTBEs 
of Pin 4 and Pin 6 for the stuck-at faults were 100 time steps longer than those for the device faults. The 
shorter the MTBE, the moxe likely it was that the e m  would propagate although it was also easier IO 
detect the error outside the chip. The shorter the MED, the less likely it was that the e m  would pro- 
pagate outside the chip. Note that the MTBE for the stuck-at faults was larger (72% - 88%) and the 
MED was sharter (1% - 30%) than the corresponding values for device faults. Thus, stuck-at faults were 
less likely to propagate outside the chip. Since device faults had the longer duration, they were more 
likely to exert an impact external to the chip. Thus, assuming a stuck-at model for failures may underes- 
timate the fault propagation characteristics external to the chip. 
I 
I 
I 
I 
1 
I 
1 
1 
I 
I 
I 
1 
1 
I 
1 
1 
I 
1 
I 
17 
In summary, results of the measurements show that 5.1% of the device faults are detected within 
the injected unit, and 20.9% of faults are detected outside the unit (which include 12.7% at the output 
pins) and 74% remain undetected. In the stuck-at case, 1.5% of faults are detected within the unit, and 
41.8% of faults are seen outside the unit (26.9% are at the output pins), and 56.7% are not detected. 
Results show that there is little correspondence between stuck-at and device-level fault models as long as 
error activity or detection within a functional unit is concerned. In so far as error activity outside the 
injected unit and at the output pins are concerned, the stuck-at and device models closely track each other 
although the stuck-at model overestimates by approximately one hundred percent fault propagation of the 
chip. At the pin level, although the percentages of errors for stuck-at and device faults do track each 
other, an evaluation of the Mean Emx Durations and the Mean Time Between Emr shows that the 
stuck-at made1 will significantly underestimate (by 62%) the impact of an internal chip fault on the exter- 
nal environment. 
The comparisons between the gate-level simulation and the device-level simulation based on the 
insauctiodmicroinsauction executed are shown in Appendix B. Since it is clear that the device-level 
injection is m m  accurate and realistic, we consider only the device-level fault in the remainder of this 
study. 
18 
CHAPTER 5 
EFFECT OF FAULT PLACEMENT 
This section discusses the effect of fault location on propagation through the chip. Recall that Fig- 
ure 2 shows the percentages of the faults which propagate outside the injected unit For example. among 
all the faults injected in the ALU unit, only 25.8% of those faults were detected outside the ALU. The 
functional unit with the highest percentage of external propagation was the RAM Control (44.8%). and 
the units with the lowest level of propagation were the output units which include the Output and the 
Output Select The results for the RAM Control are not surprising because this unit has a large fan-in 
and fansut (it controls many input and output paths around the R A M  unit), and is the most complex unit 
in the system. Therefore, more. faults in this unit tend to propagate to other functional units. The results 
also show that the Output Select and the Output units are least likely to propagate to other functional 
units. This result is intuitive since there is little feedback from these units to the other units. However, 
faults in these units will almost certainly affect the output pins as will be shown in Chapter 6. A rela- 
tively high percentage of faults in the microinstruction decode units (Some Control, ALU Control and 
Destination Control) tend to travel out of that unit These results also seem reasonable because the 
microinstruction decode units are extremely important in the correct operation of the processor. Further, 
the faults in the ALU behave somewhat similarly to those in the MUX because the MUX outputs feed 
directly into the ALU. A very low percentage of faults in the RAM Shift and the Q Shift units propagate 
outside. One explanation is that the RAM Shift and Q Shift units are used primarily for the multiplica- 
tion and division instructions. These insauctions are not highly used in the self-test program employed in 
this study. 
1 
1 
I 
I 
I 
I 
1 
I 
I 
I 
I 
1 
1 
I 
I 
1 
I 
I 
i 
19 
Given that a fault propagates out of the injected units, the probability that the injected fault affects 
the other units is shown in Table 5. The first column shows the units in which the faults were injected. 
The remaining columns show the other units in which the faults may be detected. Each entry shows the 
probability with which the injected fault affects each unit 
By examining Figure 2 and Table 5 together, we can get a clear picture of fault propagation in the 
chip. For example, in the ALU 25.8% of injected faults propagated outside of the ALU (refer to Fig. 2). 
and, as shown in Table 532.5% of these faults affected the RAM unit, and 90% affected the Q-Register 
and so on. As expected, the ALU is strongly affected by faults in other units. The converse, however, is 
not me, e.g.. faults in the ALU do not affect the microinstruction decode units (Source Control. ALU 
Control, and Destination Control). They do, however, propagate to the output Although the Output 
Select is not very likely to impact the other units (Fig. 2), when it does, several other units are uniformly 
affected. As explained earlier, this is most likely due to the fact that the Output Select unit has several 
data paths which feed back to many other functional units. Because the ALU and MUX are closely 
located, the faults fium these units act similarly. Faults in the microinstruction decode units (Source 
Control, ALU Control and Destination Control) have a high probability of fault propagation, and given 
that propagation occurs the other units are uniformly affected. The table also shows the Output unit is 
severely aEected by faults originating in most of the functional units. Thus, given that a fault propagates 
outside the injected unit, it is very likely to affect the output pins. 
20 
I 
1 
1 
I 
1 
1 
I 
I 
I 
I 
I 
I 
1 
1 
1 
I 
I 
I 
I 
21 
In summary, the results demonstrate that many faults, about 25% of the injected faults in the ALU 
and the MUX, tend to propagate and also that these units are strongly affected by faults in other units. 
(Around 60% of faults in other units affected these units.) Faults from the microinstruction decode units 
have uniform impact on the system overall. Given that faults propagate outside the injected unit, the 
faults severely affect the Output unit, i.e., the output pins are most likely to be affected by the faults. 
5.1. Comparison of Fault Distributions of Diflerent Material Failures 
Figure 3 shows the frequency distributions of faults occurring based on the different materials 
(oxide or metal). The vertical axis represents the frequencies of faults detected at a specific time over the 
entire simulation. The horizontal axis has the time steps in one clock cycle (with one clock cycle consist- 
ing of 70 time steps). In the plot the solid line depicts the overall fault distribution. the dashed line is the 
distribution of the oxide fault, and the dotted line represents the distribution of the metal fault The plot 
is generated by overlaying the 50 clock cycles in the sample and plotting the frequency of upset in the 
system for each of the 70 time steps. Mostly, there was little activity beyond 30 and below 5 time steps 
per clock cycle. 
Because the device-level fault injection was performed in this study, two materiaIs,e.g.. the oxide 
and the metal, were involved in the fault injection. In the previous chapter, 68% of the faults were 
injected into the oxide, and the remaining 32% of the faults were injected into the metal. Numerically, 
more than twice the fault injections were performed in the oxide material. In Figure 3, the fault distribu- 
tion of the oxide is closer to the overall distribution than that of the metal. Surprisingly, the frequency 
distribution of the oxide was not twice that of the metal as we would have expected. From this result it 
appears that faults in the metal can more actively affect the system than faults in the oxide material. 
Frq of 
Faults 
Detected 
100 
50 
0 
22 
I 
----- Oxide  Overall 
Metal .......... 
Time step in clock cycle 
Figure 3: Activity Comparisons Bctween Matcrials 
I 
I 
1 
1 
I 
I 
I 
1 
I 
I 
I 
I 
1 
I 
1 
1 
I 
I 
I 
23 
RAM 
Q S m  
Table 6 shows the impact of the metal and oxide faults as a functions of the unit into which the 
fault was injected. 
46.4 9% 58.4 6 
66.7 % 59.6 % 
TABLE 6 The Effect of Fault in Oxide and Metal 
OUTPUT 
OUT SEL 
I SEC 11 Fault in Metal I Fault in Oxide 
. ._ 
63.8 % 63.5 9% 
63.8 % 59.8 % 
s CONT II 24.6 % I 25.5 8 
ALUCON 11 18.8 9% 19.7 % 
63.8 96 59.8 % 
63.8 % 59.8 % 
DES CON" 13.9 % 
65.2 %I 63.5 5% 
40.6 % 59.1 46 
The first column shows the unit into which fault injection occurred. The percentages of the 
injected faults which resulted in some error activity are shown in the next columns. These two columns 
of numbers show a distinct similarity. Even though fewer faults were injected into the metal, the percen- 
tages of faults that atrected the unit are about same; consequently, we can conclude that the faults in the 
metal actively affected the units. 
24 
_ _ _ _ _ ~  
PIN1 
PIN2 
CHAPTER 6 
CHARACTERIZATION OF ERRORS ON OUTPUT PINS 
- 
GCN4CPUIC32 carry out 
GF3BCPUIC32 Bit7 ALU out 
previous sections of this paper have defined and measured fault propagation throughout the AMD 
2901 bit slice processor. In this section. characteristics of the e m  activity at the output pins are studied. 
Our previous results show that the output unit is very sensitive to the location of the fault within the chip. 
The output pins are also the place at which a fault can affect the external environment Therefore, 
appropriate measurements and accurate evaluations of the output pins arc absolutely necessary for 
evaluating the performance of a system. The AMD 2901 has 40 UO pins around the body of the chip. 
Of these pins, ten are used for the output lines of the chip, and the others are used for inputs, power lines, 
ai-states, etc. Because individual pin data are obtained b m  the simulation and analyzed for the charac- 
teristics of each pin, the functions of each pin are worth considering carefully. Instead of using real pin 
I numbers, ten numbers from one to ten are used for convenience. Table 7 shows the names of ten pins 
and short descriptions of them. 
l 
TABLE 7: Pins of AMD 2901 
I PINNUM I PINNAME I DESCRIPTION I 
I 
I 
1 
I 
I 
I 
I 
I 
I 
1 
I 
I 
I 
1 
1 
1 
1 
1 
1 
25 
The first column contains the pin numbers used in this study, and the second column shows the pin 
names which are used for the simulation. Pin 1, Pin 4 and Pin 6 are al l  used for the carryouts of the 
arithmetic functional result from the ALU. The differences are that Pin 1 is a carryout line of the usual 
full adder, and Pin 4 and Pin 6 are the cany propagate and generate outputs of the internal ALU (used in 
the carry lookahead). Pin 2 is the most significant ALU output bit. Pin 3 indicates whether the result of 
an ALU operation is zero or not The overllow signal is shown at Pin 5. Pins 7.89 and 10 are the four 
outputs of the ALU or the data of the register stack, as determined by the destination decoder[l8]. 
6.1. Probability of Pin Errors 
As shown in Chapter 4, on the average, 12.7% of the device-level faults are detected at the output 
pins (while 5.1% are detected within the unit and 20.9% are outside the unit). Table 8 shows the impact 
of faults in the specified functional units on the output pins. The first column shows the units in which 
the faults were injected. The remaining columns identify the specific output pins in which the faults may 
be detected. Each entry shows the probability with which an injected fault af€ects the specified pin. 
26 
TABLE 8: Probability of Pin Emrs 
Section 
RAM 
PIN1 PIN2 PIM PIN4 PIN5 PIN6 PIN7 PIN8 PIN9 PIN10 
0.650 0.875 0.875 0.700 0.700 0.650 0.875 0.875 0.600 0.600 
0 0 0 0 0 0 0 0 0 
ALU 0.700 0.800 0.700 0.750 0.700 0.850 0.750 0.750 0.650 0.600 
MUX 0556 0.611 0.778 0.870 0.556 0.444 0.611 0.611 0.722 0.556 
Recall that in the previous chapter the faults in most units have a high probability of affecting the 
Output unit (which includes the output pins). Table 8 not only confinns this observation but also shows 
that the impact is rather uniform across all output pins. DifFerent units, however, aifect the pins to vary- 
ing degrees. This is due u) the fact that the charslcteristics of functional units differ due to the combined 
effect resulting from their different functional operations, locations and structural complexities. For 
example while 100% of the faults in the ALU Control affect all pins, faults in the Source and Destination 
Control units do not affect the output pins at all. As expected, most of the faults in the Output Select 
readily affect all pins. The results in this table also show that each unit has a distinct probability of 
affecting the output pins. This result is significant for integrated system testing because it suggests that 
by injecting pin e m  with the measured distinct probabilities, we can emulate within-chip faults. 
6.2. Mean Time Between Errors (MTBE) and Mean Error Durations WED) 
Recall that in Chapter 4 we calculated the MTBE and the MED of pin errors, Clearly, the longer 
the MTBE. the less the impact of the error on the external environment It also means, however, that it is 
harder to detect the errors. Similarly, the shorter the error duration. the less the impact on the external to 
the chip. Tables 9 and 10 show the MTBE and the h4ED for the Werent functional units into which 
faults are injected. 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
1 
I 
1 
I 
27 
ALU 
MUX 
in chip 
TABLE 9: Mean Time Between Errors 
356.2 166.9 177.0 I 3453 203.8 367.2 122.9 141.5 128.5 120.6 
347.4 180.6 201.4 I 329.9 206.4 505.8 164.2 158.8 144.1 130.5 
303.8 189.1 189.6 i 296.7 185.6 369.4 147.6 141.1 144.4 148.1 
In Table 9. the first column shows the units in which the fault injections were initiated. The figures 
in the next ten columns are the MTBE an each output pin. The dots in the table indicate that computing 
values on that unit are not possible due to insufficient data. For instance, if there is only one error 
detected during the simulation, it is impossible to obtain the time interval to the next error. 
Referring to the table, faults in the ALU and the MUX resulted in very similar distributions of the 
MTBE across all the output pins as discussed in the previous chapter. The shortest MTBEs are shown 
when faults are in output units, Le., there is high external detectability of these fault. Faults in the Q- 
Shift and the Ram Shift units have long MTBE to the pins. This is due to low utilization of these units 
resulting in fewer faults mpagating to the pins. Since Pins 7 . 8 . 9  and 10 are fed out from the same unit 
and operated by the same functions, the values of the h4TBE of these pins are almost identical. Pins for 
carryout (Pin 1. Pin 4 and Pin a) had relatively long MTBEs. This is due to the fact that the carryout 
operations are used only for the event of generating carries d h g  or after arithmetic operation in the 
ALU. These pins, therefore, are not utilized as often as the other pins. 
Table 10 shows the Mean Error Durations. The units of fault injection are shown in the first 
column. The rest of the columns show the MED for the individual pins. The dots in the entries indicate 
28 
OUTPUT 
OUT SEL 
RAM SHlT 
that the data am not sacient  to perform proper computation for the corresponding unit 
58.1 88.7 783 91.6 71.8 23.2 89.0 81.6 83.3 89.1 
56.2 85.7 74.2 88.5 70.0 21.7 87.9 79.8 81.5 88.7 
. 68.3 335 . . 68.3 68.3 
TABLE 10 Mean E m r  Durations 
~ 
RAM CONT 11 465 I 23.7 I 465 I 46.6 1 30.4 I 16.7 1 58.4 1 403 I 49.2 I 45.4 
By examining values of the MTBE and the MED together, the impact of the output pins is more 
clearly shown than by either approach separately. From Table 10. faults in the Q-Shift unit have the 
shortest duration. Rwall hat they also have a long MTBE. Thus, these faults are likely to be hard to 
detect e x t e d  to the chip. As expected, the faults in the output units have a long duration. Thus, both 
from the error frequency and emr  duration perspectives these faults are easily detectable. Faults in the 
Q-Shift and the Ram Shift have behaved very similarly during their propagation in the system due to their 
functional similarity. The MEDs of these units, however, exhibited large differences. One possible 
explanation for this phenomenon is that the time duration of holding a emr on the pin is detennined not 
only by the functional operation of the unit, but also determined by multiple effects of the data p m  
pagated from several different signal paths. Because their effects by functional operation or the geograph- 
ical location are almost identical, Pins 7.8.9, and 10 have similar MEDs. These pins tend to have longer 
errors than any of the other pins. The shortest MED is shown on Pin 6, showing that this pin elears 
emrs very quickly. 
I 
1 
I 
I 
I 
1 
1 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
1 
29 
Table 9 and Table 10 also show that Pin 6 (carry generate) is least affected by the faults regardless 
of the injected unit (the longest MTBE and the shortest MED). The impact of faults on the data pins 
(pins 7, 8,9 and 10) is considerable as shown by the short MTBE and long MED. Faults in the output 
units severely affect the output pins. Thus, data faults an expected to be easily detected outside while 
carry faults appear to be more insidious. 
63. Near-Coincident Errors 
It is well hown that fault tolerant systems are highly vulnerable to nearcoincident faults [ 19,201. 
In this section we investigate the likelihood of near coincident errors at the output pins resulting from an 
injected device fault Generally, an injected fault may sensitize many other data paths simultaneously 
during the propagation. Additionally, the output pins may have multiple errors at the Same time or in a 
short span of time due to a propagated fault Nearcoincident e m  are defined as the errors which are 
observed within a short time span. In order to properly meaSure the number of near-coincident e m ,  an 
appropriate size of time window is chosen; then, the window is moved over the total simulation time. 
Each time, the number of errors discovered within each time window is observed and recorded as near- 
coincident e m .  In this study, the number and probability of nearcoincident emrs was averaged over 
the entire simulation measurement according to given window size. 
Figure 4 shows the effect of varying the time-window size on the mean number of nearcoincident 
errors. As expected the mean number of near-coincident errors increases steadily as a function of the 
window size. The rate of increase. in the mean number of near-coincident errors is seen to be lower both 
for large (more than 2700 time steps) and medium (500-1300 time steps) window sizes. The reason for 
this finding is that for the smaller window size, fewer errors are detected until the window size is large 
enough to hold these errors actively. For the larger window sizes, since a large number of faults have 
already been observed, the results are not greatly changed by further increasing the size of the window. 
30 
6 -  
Mean 4 -  
Number of 
Coincident Errors 
0 I I I I I 1 1 
0 500 1000 1500 2000 2500 3000 3500 
Window Size (time steps) 
Figure 4: Mean Nurnbcr of Coincident Errors 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
I .  
31 
Figure 5 shows the probability of near-coincident errors. The probability is obtained from the ratio 
of the total number of errors occurring in that time window to the total number of errors injected. The 
probability increased rapidly up to a window size of 1700 time steps as a function of the window size. 
The rate of increase, however, is much less for a window size greater than 1700 time steps. 
From a practical viewpoint, however, it can be seen that given a fault, there is a relatively high 
likelihood of encountering two output pin m r s  with less than seven clock cycles (500 time steps). This 
is explicitly shown in Fig. 5 where the probability of nearaincident errors is plotted as a function of 
window size. The figure also shows that, given a fads there is approximately a 15 percent chance of a 
multiple error with 7 clock cycles. 
. 
32 
0.2 - i I 
,/ 
Figure 5:  Probability of Ncar-Coincidcnt Errors 
33 
CHAPTER 7 
INSTRUCTION/MICROINSTRUCTION ANALYSIS 
Fault propagation in the chip is highly dependent upon the assembly-level instruction and microin- 
struction under execution. That is to say, that fault propagation is influenced by not only the amount of 
nonfaulted gate activity but also by the interaction between gate activity and type of instruction. In a pre- 
vious work [ll], error propagation influenced by similar and dissimilar insauctions was studied. The 
work also examined the influence of the type of microinstruction executed on error propagation. In this 
study, analysis for the device-level simulation was performed based on the instruction/microinstruction 
executed. 
Figure 6 shows the total gate activity as a function of the time (in clock cycles) for a device-level 
fault. The vertical axis is the sum of the gate activities ova  all the injected faults. The horizontal axis 
indicates the time in clock cycles. The instruction executed is also labeled across the horizontal axis. In 
the graph, the peak gate activity occurs during an instruction prefetch or some other memory access 
because of the concurrent activity in the processor. The S t a e  (sto), the Load (ldm) and the Subtract 
instruction @re-sub) show similarities in their gate activities. Low gate activities are seen when the jump 
instruction (ju-ind) is executed. The instruction for store multiple registers from memory (sun) shows the 
highest gate activity because of the high frequency of register transfers. 
34 
100000 
80000. 
Gate 60000. Activity 
40000 - 
20000 - 
StO ; pre-subr ; Idm pre-subr i 
0 
Clock cycle 
( 1  clock cycle - 70 time steps) 
Figure 6: Gate Activity for Device Level Fault 
35 
INSTRUCTTON 
ju-ind 
StO 
pro-subr 
ldm 
S t m  
1 
I 
1 
I 
I 
ERROR PROB. 
0.26 
0.14 
0.21 
0.16 
0.19 
I 
I 
1 
8 
I 
Table 11 shows the error probabilities by different instruction types. These results are the averages 
over the enthe fault set. The highest probability of detection is shown in jump instruction and the lowest 
is in store instruction. Refening to Fig 6. it appears that there is little relationship between the amount of 
gate activity and the probability of emr  occurrence. For example, while the number of gate activity is 
the highest during the sun instruction execution and the lowest during the jump instruction, the sun 
instruction has the lowest m r  probability and the jump instruction has the highest. The reason for the 
differences in the measured error probabilities between the different instruction types can be explained by 
investigating the relationship between the e m r  activity and the microinstructions. 
TABLE 11: Probability of Fault Detection for Instruction Executed 
Toward this end, these microinstructions wett classified according to the type of activity contained 
in each microinstruction. The classifications include the register transfer, the memory access, logic com- 
putation, arithmetic computation and ccmditionaVunconditional branch. Due to parallelism, one microin- 
struction may involve m m  than one classified function. The fault activity determined at the microin- 
struction level can be used to explain the fault propagation at the instruction level, because the microin- 
struction is the building block of the assembly instruction. Table 12 shows the probabilities of fault 
detection in the &vice-level simulation according the microinsauction executed. The function of each 
bit of microcode is indicated as follows: 
36 
if bit4 = 1, then a register transfer, 
if bit3 = 1, then a memory access, 
if bit2 = 1, then a logical computation, 
if bit1 = 1, then a arithmetic computation, 
if bit0 = 1. then a conditional branch. 
TABLE 1 2  Probability of Fault Detection for Microinstruction 
PROB. OF DET. 
lo001 
1 lo00 E 11001 0.04 0.18 0.2 1 0.18 0.23 0.17 0.20 
As shown in the table the probabilities of detecting a fault when a conditional branch operation 
@id)Pl) is involved, is generally increased. As expected, the microinstruction for branch operation that is 
used for jump instruction has high probability of detection, while the microinsauction for store and load 
instruction, which include the ngister transfer opera$ons, has low probability of detection. 
I 
1 
I 
I 
1 
1 
I 
I 
I 
1 
1 
I 
I 
I 
1 
1 
I 
I 
I 
37 
CHAPTER 8 
CONCLUSIONS 
This thesis has described a systematic experimental study of fault propagation in the Bendix BDX- 
930, a digital avionic miniprocessor. Error activity was investigated by comparing the gold (unfaulted) 
simulation run with each faulted simulation run. The simulations were performed only on the bit-slice 
processor, AMD 2901. In the simulations, fault propagation data were collected at device and gate-levels. 
as well as at the output pins. The results provided by these data allowed us to not only analyze the 
dependency of error propagation on the location of the fault and by the type of instruction and microin- 
shuctions executed, but also to compare the accuracy of the stuck-at fault model with the more realistic 
physical failure model for pennanent faults. 
Results show that assuming a stuck-at model can overestimate the probability of fault propagation 
to the output pins by over one hundred percent The Mean Time Between Errors for the stuck-at faults 
were longer, and the Mean Error Durations shorter, than those for the device faults. Thus, assuming a 
stuck-at model for physical failures may overestimate the fault propagation charactuistics within the chip 
and underestimate the impact on the extend to the chip. 
Measurement of error activity at the output pins showed that faults in different functional units 
affect the output pins to varying degrees and that each unit has a distinct probability of affecting the out- 
put pins. This result suggests that by injecting pin errors with the measurtd distinct probabilities we can 
easily emulate with-in chip faults f6r integrated system testing. 
The Mean Time Between Errors and the Mean Error Duration at the output pins were also 
evaluated. Among the ten output pins, the carry generate pin had the longest MTBE and the shortest 
MED. The Data pins (Pins 7, 8 . 9  and 10) had relatively short time between e m  and relatively long 
error durations. 
38 
Thus, the current work has shown that a wide variety of fault propagation behavior can result from 
device failures. Further research is in progress to use the results of such analyses in identifying the 
"weak" links in a system, from a fault tolerance viewpoint, in the design stage itself, so as to make design 
improvements in a cost-effective manner. 
I 
1 
I 
I 
1 
I 
I 
1 
I 
I 
1 
I 
I 
I 
I 
1 
I 
1 
I 
1 
I 
39 
APPENDIX A 
I 
1 
I 
I 
8 
I 
COMPARISONS BETWEEN STUCK-AT AND DEVICE 
A.l. Comparison of Percentage of Fault Detected 
RAM 
0 SHIFT 
. S CONT 
ALU CONT 
OUTPUT 
OUT SEL 
RAM SHIFT 
RAM COW 
DES CONT 
ALU 
MUX 
1 
60.3 
zizzzzm 
.8 
.6 
Stuck-at 
0 Device 
0 20 40 6 0  8 0  100 
Figurc A.1: Comparison of Pcrcentages of Faults Dctccted 
40 
A.2. Comparison of Percentage of Faults Detected at Output Pins 
RAM 
Q SHIFT 
ALU C O M  
OUTPUT 
OUT SEL 
RAM SHIFT 
RAM CONT 
DES C O W  
ALU 
MUX 
8.5 
El jTUCK-AT 
wit€ 
1 0  20  30 4 0  5 0  6 0  70 8 0  0 
Figurc A.2: Comparison of Pcrccnugcs of F~~ults Dctcctcd at Output Pins 
I 
I 
1 
I 
1 
1 
I 
1 
I 
I 
I 
1 
I 
1 
I 
1 
I 
I 
I 
41 
OUTPUT 
OUT SEL 
A3.  Comparison of Propagation Factor at Output Pins 
~~ 
9.1 10.3 
9.9 10.5 
TABLE A.l: The Propagation Factors to Output Pins 
RAM SHFr 
RAM CON" 
Section 
4.1 8.2 
2.2 4.9 
s CON" II 1.8 I 2.7 
ALLJCON 11 9.3 8.6 
DES CON" 
ALU 
1.7 2.1 
8.0 9.1 
p J X  II 7.2 I 8.3 I 
APPENDIX B 
42 
INSTRUCTION/MICROINSTRUCTION COMPARISONS 
B.l. Comparison of Gate Activity 
ldm 
: I  
: I  
powerjon I 
j ju-ind; sto i prtsubr i : prtsubr i stm I 
. I  
looo00-l i 
I 
Fault 
Gate 
Activity w/ 6oooo 
Device level 
Fault I ;  
. I .  -I0000 - 
' :  I 
. , .  
. ,  
I - device , * I  20000 - 1 ,.I$ I I 
;, f I 
I I I I I I 1 
I t  i I O  15 20 25 30 3S 40 45 50 
, -
I 
I 
Clock cycle 
Figure B.l:  Comparison bciwccn Dcvicc and Gate Lcvcl Faults 
43 
PROB. OF DET. 
in GATE LEVEL 
0.33 
0.19 
0.25 
0.22 
0.24 
B.2. Comparison of Error Probability based on Instructiodmicroinstruction 
PROB. OF DET. 
in DEVICE LEVEL 
0.26 
0.14 
0.21 
0.16 
0.19 
TABLE B.1: Comparison of Error Probability for Instructions 
power-on 
ju-ind 
StO 
pro-subr 
ldm 
p-subr  
S t m  
INSTRUCTION 
ju-ind 
pmsubr 
S t m  
0.15 
0.09 
0.23 
0.29 
0.25 
0.32 
0.22 
TABLE B.2 Comparison of Error Probability for Microinstructions 
I in GATE LEVEL I in DEVICE LEVEL PROB. OF DET. PROB. OF DE". INSTRUCX'ION I 
0.18 
0.17 
44 
REFERENCES 
A. Vladimirescs A&. Newton, and D.O. Pederson, SPICE Version 2C.Z User’s Guide. Berkeley, 
CA EECS Dept., UC Berkeley, 1980. 
R.K. Iyer and DJ. Rossetti, A Statistical Load Dependency of CPU Errors at SLAC. Santa 
Monica, California: Digest FTCS-12, June 1982. 
R.K. Iyer, DJ. Rossetti, and M.C. Hsueh. “Computer System Reliability and System Activity: 
Measurement and Modeling,” ACM Trans. on Comp. Sys., August 1986. 
X. Castillo and DP. Siewiorek, Workload, Performance and Reliability of Digital Computing 
Sysrems. Portland, Maine: Digest FTCS-11, June 1981. 
K.G. Shin, “Measurements of Fault Latency: Methodology and Experimental Results,” Tech. 
Report CRL-TR45-84. Computing Research Lab, Univ. of Mich., 1984. 
J.G. McGough and FL. Swern, “Measurement of Fault Latency in a Digital Avionic Mini 
Roc.(Fam I & I).” NASA Contractor 3651, NASA Langley Research Center, Oct 1981, Jan. 
1983. 
J.G. McGough, F.L. S w m ,  and S. Bavuso, “New results in fault latency modeling,’’ Eascon, vol. 
b, 1983. 
J.H. Lala, “Fault Detection, Isolation, and Reconfiguration in FTMp: Methods and Experimental 
Results,” Proc. 5rh DATC, 1983. 
R. Chillarege and R.K. Iyer, “Measurement-Based Analysis of Error Latency,” IEEE Trans. 
R. Chillarege, “Fault and Error Latency Under Real Workload-An Experimental Study,” Ph.D. 
dissertation, Elec&rical and Computer Engineaing, University of Illinois at Urbma-Champaign, 
1986. 
DL. Lomelino, “ E m  Propagation in a Digital Avionic Mini Processor,” M.S. Thesis. ECE 
Dew Univ. of Illinois at Urbana-champaign, 1986. 
J. Wensley et al., “SIiT Design and Analysis of a Fault Tolerant Computer for Aircraft 
Control,” Proc. IEEE, vol. 66, pp. 1240-1254, October 1978. 
P. Fonnan and K. Moses, “SFC Multiprocessor Architecture for Software Implemented Fault 
Tolerance Flight Control and Avionics Computers,” Third Digital Avionics Systems Conference, 
D. Migneault, “The Diagnostic Emulation Technique in the Airlab.” Internal Report, NASA 
Langley Research Center, 1985. 
RL. W W ,  “Fault modeling and logic simulation of CMOS and NMOS integrated circuits,” 
The Bell Sys. Tech. J., vol. 57, May 1978. 
J.P. Shen. W. Maly, and FJ. Ferguson, “Inductive Fault Analysis of nMOS and CMOS Integrated 
Circuits,” Research Report CMUCAD-85-51, ECE Dept. CMU, Pittsburgh. PA, August 1985. 
Digital Equipment Corporation, VAXNMS Primer. Maynard, MA. Digital Equipment 
Corporation, May 1982. 
Advanced Micro Devices, Bipolar Microprocessor Logic and Interface Book. Sunnyvale, CA: 
Advanced Micro Devices. 1981. 
COV.. VO~. C-36, May 1987. 
p ~ .  325-329, November 1979. 
I 
I 
1 
I 
1 
I 
I 
1 
1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
I 
I 
I 
1 
I 
8 
I 
I 
I 
I 
I 
1 
I 
1 
I 
I 
45 
[I91 
[20] 
J. McGough, “Effects of near-coincident faults in multiprocessor systems,” Proc. IEEEIAIAA 
Fifth Digital Avionics Systems Conf.., pp. 16.6.1-16.6.7, 1983. 
S.G. Mitra, “NeaKoincident Fault Discovery in a Shared Memory Multiprocessor,” M.S. 
Thesis, University of Illinois at Urbana-Champaign, 1988. 
