Software implemented fault tolerance for microprocessor controllers: fault tolerance for microprocessor controllers by Wingate, Guy A.S.
Durham E-Theses
Software implemented fault tolerance for microprocessor




Wingate, Guy A.S. (1992) Software implemented fault tolerance for microprocessor controllers: fault
tolerance for microprocessor controllers, Durham theses, Durham University. Available at Durham
E-Theses Online: http://etheses.dur.ac.uk/5811/
Use policy
The full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or
charge, for personal research or study, educational, or not-for-proﬁt purposes provided that:
• a full bibliographic reference is made to the original source
• a link is made to the metadata record in Durham E-Theses
• the full-text is not changed in any way
The full-text must not be sold in any format or medium without the formal permission of the copyright holders.
Please consult the full Durham E-Theses policy for further details.
Academic Support Oﬃce, Durham University, University Oﬃce, Old Elvet, Durham DH1 3HP
e-mail: e-theses.admin@dur.ac.uk Tel: +44 0191 334 6107
http://etheses.dur.ac.uk
2
The copyright of this thesis rests with the author. 
No quotation from it should be published without 
his prior written consent and information derived 
from it should be acknowledged. 
SOFTWARE IMPLEMENTED 
FAULT TOLERANCE 
FOR MICROPROCESSOR CONTROLLERS 
by 
Guy A.S. Wingate, B.Sc.(Hons), M.Sc. 
A Thesis submitted in fulfilment 
of the requirement for the degree of 
Doctor of Philosophy 
Engineering Science 
The University of Durham 
1992 
D E C L A R A T I O N 
None of the work contained within this thesis has previously been submitted for 
a degree at this or any other university. The work contained in this thesis is not part 
of a joint research project. 
Copyright ©1992 Guy A.S. Wingate 
The copyright of this thesis rests with the author. No quotation from it should be 
published without Guy A.S. Wingate's prior written consent, and information derived 
from it should be acknowledged. 
ii 
S O F T W A R E I M P L E M E N T E D FAULT T O L E R A N C E 
FOR M I C R O P R O C E S S O R C O N T R O L L E R S 
Guy A.S. Wingate, B.Sc.(Hons), M.Sc. 
A B S T R A C T 
It is generally accepted that transient faults are a major cause of failure in micro-
processor systems. Industrial controllers with embedded microprocessors are partic-
ularly at risk from this type of failure because their working environments are prone 
to transient disturbances which can generate transient faults. 
In order to improve the reliability of processor systems for industrial applica-
tions within a limited budget, fault tolerant techniques for uniprocessors are imple-
mented. These techniques aim to identify characteristics of processor operation which 
are attributed to erroneous behaviour. Once detection is achieved, a programme of 
restoration activity can be initiated. 
This thesis initially develops a previous model of erroneous microprocessor be-
haviour from which characteristics particular to mal-operation are identified. A new 
technique is proposed, based on software implemented fault tolerance which, by rec-
ognizing a particular behavioural characteristic, facilitates the self-detection of er-
roneous execution. The technique involves inserting detection mechanisms into the 
target software. This can be quite a complex process and so a prototype software 
tool called Post-programming Automated Recovery UTility (PARUT) is developed 
to automate the technique's application. The utility can be used to apply the pro-
posed behavioural fault tolerant technique for a selection of target processors. Fault 
injection and emulation experiments assess the effectiveness of the proposed fault 
tolerant technique for three application programs implemented on an 8, 16, and 32-
bit processors respectively. The modified application programs are shown to have an 
improved detection capability and hence reliability when the proposed fault tolerant 
technique is applied. General assessment of the technique cannot be made, however, 
because its effectiveness is application specific. 
The thesis concludes by considering methods of generating non-hazardous appli-
cation programs at the compilation stage, and design features for incorporation into 
the architecture of a microprocessor which inherently reduce the hazard, and increase 
the detection capability of the target software. Particular suggestions are made to 
add a 'PARUT' phase to the translation process, and to orientate microprocessor 
design towards the instruction opcode map. 
iii 
A C K N O W L E D G E M E N T S 
Firstly, I would like to thank my supervisor Dr. Clive Preece for his tremendous 
support and encouragement throughout the duration of the research project. I am 
also grateful to British Gas pic. for funding the research; in particular I would like to 
thank Dr. Ken Jenkins and colleagues at the Engineering Research Station, Killing-
worth. In addition, I owe thanks to: Prof. Ed Czeck (Computer Science, Carnegie-
Mellon University, USA) for his comments on the work contained within Chapter 5 
and 7; Mr. Alan Timothy (Microprocessor Centre, University of Durham) for com-
ments on the design and implementation of an Advanced Micro Device Am29000 
processor system C compiler and its implications on the work presented in Chapter 
8; Prof. Dan Siewiorek (Carnegie-Mellon University, USA), Dr. Janusz Sosnowski 
(Warsaw Technical University, Poland), and Prof. Hermann Kopetz (Vienna Tech-
nical University, Austria) for the provision of details of their related research; and 
finally, Mr. Jim Roper (Computer Science, University of Durham) and Dr. Per 
Nylen (University of Stockholm, Sweden) for access to an Intel 80386 processor and 
Motorola 68(7)05 processor emulator respectively. 
My time at Durham has not been all work! I would like to thank my friends 
over the years for making my stay so enjoyable: especially the 'tea room boys' -
Sabah, Lee, Ken, Jean, and Norman; Hatfield College Middle Common Room; and 
the university squash and badminton teams for helping me release tension. Thanks 
also to Grandma for constant provision of orange cake. 
Whilst writing this thesis I have phoned home many times to report "It's finish-
ed !". The natural reply soon became "Except for ... ?". Lastly then, I would like 
to thank my parents and my fianc'ee Sarah for their patient support and to assure 
them that I really have finished. 
(Submitted November 1990) 
The examination of this thesis was delayed by illness. I would like to thank the 
staff of Dryburn Hospital (County Durham), and R.N.H. Haslar (Gosport, Hants) 
for their care in the intervening period. 
(Examined April 1992) 
iv 
"Bloody instructions, which, being learned, return to plague 
the inventor." 
'Macbeth': Act 1, Scene 7, Lines 8-10 
by William Shakespeare. 
v 




LIST OF FIGURES xiv 
LIST OF TABLES xvi 
LIST OF ABBREVIATIONS xviii 
LIST OF SYMBOLS xx 
Chapter 1 
R E L I A B I L I T Y AND M I C R O P R O C E S S O R - B A S E D C O N T R O L L E R S 
1.1. Introduction 1 
1.2. Microprocessor-Based Control Systems 3 
1.3. Faults, Errors, and Failures in Electronic Systems 4 
1.4. Engineering Reliability Through Design 4 
1.4.1. Reliable Hardware 5 
1.4.2. Reliable Software 5 
1.5. Evaluating Controller Reliability 6 
1.5.1. Hardware Reliability 7 
1.5.2. Software Reliability 7 
1.5.3. Interface Reliability 8 
1.6. Low-Cost Enhancement of Controller Reliability 8 
1.7. Thesis Preview 9 
vi 
Chapter 2 
T E M P O R A R Y FAULTS: G E N E R A T I O N , IMPLICATION, & D E T E C T I O N 
2.1. Introduction 12 
2.2. Faults and Their Implication on Microprocessor System Reliability . . 13 
2.3. Erroneous Behaviour of Microprocessor Systems 18 
2.3.1. Data Flow Errors 21 
2.3.3. Program Flow Errors 21 
2.4. Assessing Error Detection Techniques for Microprocessor Systems . . . 23 
2.4.1. Watchdog Timers 23 
2.4.2. Capability Checking 23 
2.4.3. Program Flow Monitoring 24 
2.4.4. Hazards Associated with Error Detection Techniques 27 
2.4.5. A Novel Error Detection Technique 27 
2.5. Reliability Evaluation 28 
2.6. Summary and Conclusions 29 
Chapter 3 
M O D E L L I N G E R R O N E O U S M I C R O P R O C E S S O R B E H A V I O U R 
3.1. Introduction 31 
3.2. Initiating Erroneous Microprocessor Behaviour 31 
3.3. Erroneous Behaviour 32 
3.4. Erroneous Execution 34 
3.5. Halse Execution Model 36 
3.6. Hybrid Execution Model 38 
3.6.1. Linear Erroneous Execution 38 
3.6.2. Propagating Further Periods of Erroneous Execution 40 
3.6.3. Detection of Erroneous Execution 41 
3.6.4. Erroneous Execution Stall 44 
3.7. Reliability Analysis 45 
3.7.1. Failure Rate 46 
3.7.2. Probability of an Event Leading to Failure 48 
3.7.3. Reliability Evaluation 49 
3.7.4. Mean Time To Failure 51 
3.8. Availability Analysis 52 
3.9. Summary 53 
vi i 
Chapter 4 
E V A L U A T I N G M I C R O P R O C E S S O R BEHAVIOUR 
4.1. Introduction 54 
4.2. Instruction Mix Analysis 54 
4.3. Architecture Parameters for the Microprocessor Model 55 
4.3.1. Built-in Microprocessor Detection Capability 55 
4.3.2. Modelling the Microprocessor Program Counter 58 
4.3.3. Instruction Processing Exceptions 58 
4.4. Evaluating Microprocessor Models of Erroneous Behaviour 58 
4.4.1. 8-Bit Processor Evaluations 64 
4.4.2. 16-Bit Processor Evaluations 65 
4.4.3. 32-Bit Processor Evaluations 66 
4.5. Catastrophic Failure Analysis 67 
4.6. Recovery Through The Detection of Erroneous Execution 69 
4.7. Evaluating Microprocessor Reliability 69 
4.8. Evaluating Microprocessor Availability 74 
4.9. Conclusions 76 
Chapter 5 
D E T E C T I N G E R R O N E O U S M I C R O P R O C E S S O R B E H A V I O U R 
5.1. Introduction 78 
5.2. Address Space Allocation 78 
5.3. Erroneous Execution in the Unused Area of the Address Space 79 
5.3.1. The Initial Erroneous Jump Characteristic 81 
5.3.2. Detecting Erroneous Execution 81 
5.3.2.1. A Software Based Technique 83 
5.3.2.2. Watchdog Timers and Smart Watchdogs 84 
5.3.2.3. The Access Guardian Proposal 84 
5.4. Erroneous Execution in the Used Area of the Address Space 87 
5.4.1, The Subsequent Erroneous Jump Characteristic 87 
5.4.2. Detection Using Software Implemented Fault Tolerance 91 
5.4.2.1. Program Areas 91 
5.4.2.2. Data Areas and Reserved Input/Output Areas 96 
5.5. The Overheads of Implementing Fault Tolerance 97 
5.5.1. Hardware Fault Tolerance 97 
viii 
5.5.2. Software Implemented Fault Tolerance 98 
5.6. A Fault Tolerant Strategy for Microprocessor Controllers 99 
5.7. Summary 104 
Chapter 6 
POST-PROGRAMMING, A U T O M A T E D , R E C O V E R Y U T I L I T Y 
(PARUT) 
6.1. Introduction 107 
6.2. Design and Development Objectives for the PARUT Prototype 107 
6.3. A Functional Overview of PARUT 108 
6.4. Design Features Incorporated within the PARUT Prototype 110 
6.4.1. Programming Language 110 
6.4.2. Programming Style 110 
6.4.3. The Diagnostic Facility 110 
6.4.4. Target Software I l l 
6.4.5. Target Processors 112 
6.5. A Description of the PARUT Prototype Operation 112 
6.5.1. Data Code Analysis 114 
6.5.2. Program Code Analysis 115 
6.5.3. The 'Seeding' Algorithm 116 
6.6. PARUT: A Review of the Prototype 122 
6.7. PARUT: Developing a Standard Programming Tool 123 
6.8. Summary 124 
Chapter 7 
ASSESSING FAULT T O L E R A N C E 
7.1. Introduction 125 
7.2. Assessing the Fault Tolerance of a Microprocessor System 126 
7.2.1. Assessment Parameters 126 
7.2.2. Parameter Evaluation 126 
7.2.3. Internal Microprocessor Faults 128 
7.2.4. Assessment Dependence on Application Software 128 
7.2.5. Behavioural Observations 130 
ix 
7.3. Single-Bit Fault Injection Experiment 130 
7.3.1. Fault Injection Experiment 130 
7.3.2. Microprocessor Application Under Investigation 132 
7.3.3. Programme of Injected Faults 133 
7.3.4. Selected Single-Bit Faults 135 
7.3.5. Decoupling the Microprocessor Detection Mechanisms 137 
7.3.6. Performance Evaluation 138 
7.4. Multiple-Bit Fault Emulation Experiment 144 
7.4.1. Emulation and Fault Investigation 144 
7.4.2. Microprocessor Applications Under Investigation 145 
7.4.3. Programme of Emulated Faults 146 
7.4.4. Behavioural Analysis 147 
7.4.5. Identified Phases of Erroneous Execution 147 
7.4.5.1. The Initial Erroneous Jump Phase 148 
7.4.5.2. The Subsequent Erroneous Jump Phase 151 
7.4.6. Analysing Detection Capability 151 
7.4.7. Critical Hazards of Erroneous Behaviour 157 
7.4.7.1. Cessation of Processing 158 
7.4.5.2. Infinite Execution Loops 158 
7.4.5.3. Placement Deadlock 158 
7.4.8. Re-synchronized Erroneous Execution 160 
7.5. Summary and Conclusions 160 
Chapter 8 
G E N E R A T I N G NON-HAZARDOUS S O F T W A R E 
8.1. Introduction 162 
8.2. Bridging the Semantic Gap 163 
8.3. The Risks of Erroneous Execution 163 
8.3.1. Catastrophic Processing Failures 163 
8.3.2. Critical Hazard Coverage 164 
8.4. Non-Hazardous Program Area Code 165 
8.4.1. Hazardous Instruction Formats 165 
8.4.2. Hazardous Opcodes 166 
8.4.3. Hazardous Operands 168 
8.4.3.1. Prevention of Address Mode Hazards 168 
8.4.3.2. Inherent Addressing 168 
8.4.3.3. Manipulating Direct Addressing 169 
x 
8.4.3.4. Manipulating Immediate Addressing 170 
8.4.3.5. Manipulating Indirect Addressing 170 
8.4.3.6. Manipulating Indexed Addressing 171 
8.4.3.7. Manipulating Register-Indexed Addressing 171 
8.4.3.8. Manipulating Relative Addressing 172 
8.5. Influencing Translator Practices 174 
8.5.1. Instruction Selection 174 
8.5.2. Coupling and Cohesion 174 
8.5.3. Macros 175 
8.5.4. Peephole Optimization 176 
8.6. Non-Hazardous Data Area Code 176 
8.7. Non-Hazardous Input/Output Reserved Area Code 177 
8.8. Influence of the Instruction Set 177 
8.8.1. Undefined Instructions 177 
8.8.2. Restart Instructions 178 
8.8.3. Stop/Wait and Return Instructions 178 
8.8.4. Unspecified Jump Instructions 179 
8.9. Conclusions 179 
Chapter 9 
M I C R O P R O C E S S O R D E S I G N F O R FAULT T O L E R A N C E 
9.1. Introduction 181 
9.2. The Effectiveness of Fault Tolerance 181 
9.3. Implementing Fault Tolerance 182 
9.4. Influences on Microprocessor Design 182 
9.5. Instruction Set Architectures 183 
9.5.1. Instruction Set Mix 184 
9.5.2. Opcode Maps 185 
9.5.3. Operand requirements and Specification 188 
9.6. Input/Output Communication Ports 189 
9.7. Memory Organization 189 
9.7.1. Memory Alignment 189 
9.7.2. Defining Memory Utilization 192 
9.8. Monitoring Branch Activity 194 
9.9. Conclusion 195 
xi 
Chapter 10 
C O N C L U S I O N 
10.1. Microprocessor Controllers for Industrial Applications 196 
10.2. Reliable Microprocessor Controllers 196 
10.3. Modelling Erroneous Microprocessor Behaviour 197 
10.4. Detecting Erroneous Microprocessor Execution 199 
10.5. Evaluating Fault Tolerance 201 
10.6. Generating Non-Hazardous Software 201 
10.7. Microprocessor Design for Fault Tolerance 202 
10.8. Summary 202 
Bibliography & References 2 0 4 
Appendix A 
I N S T R U C T I O N S E T P A R A M E T E R S 
A . l . Introduction 214 
A.2. Instructions Influencing Program Flow 214 
A. 3. Microprocessor Jump Type Instruction Data 215 
Appendix B 
T H E D E S I G N OF AN A C C E S S G U A R D I A N 
B. l . Introduction 222 
B.2. An Access Guardian Design 222 
B.3. The Address Decoder 225 
B.4. The Restart Generator 225 
B.5. The Timer Unit 228 
B.6. Design Simulation 231 
B.7. The Design's Hardware Requirement 235 
B.8. Summary 235 
xi i 
Appendix C 
P A R U T AND O T H E R R E L A T E D C O D E LISTINGS 
C.l . Introduction 237 
C.2. PARUT Listing 237 
C.3. Microprocessor Description File, MICRO-FILE 257 
C.4. Target Software, CODE-FILE 261 
C.5. Target Software with Fault Tolerance, RESULT-FILE 263 
C.6. PARUT Report File, ANALYSIS-FILE 266 
C.7. PARUT Diagnostics, TRACE-FILE 269 
Appendix D 
E X A M P L E P R O G R A M S 
D . l . Introduction 273 
D.2. Program 'A' Targeting the Motorola 68000 Microprocessor 274 
D.3. Program'B'Targeting the Motorola 68(7)05 Microprocessor 284 
D.4. Program ' C Targeting the Intel 80386 Microprocessor 288 
Appendix E 
P U B L I C A T I O N S 
E. l . Introduction 297 
E.2. EUROMICRO '88 Paper 299 
E.3. EUROMICRO '89 Paper 307 
E.4. IEE '89 Paper 315 
E.5. IEE '90 Paper 318 
E.6. EUROMICRO '91 Paper 323 
xi i i 
F I G U R E S 
Figure 2.1. : Fault and Error Latencies 14 
Figure 3.1. : Microprocessor Erroneous Behaviour 33 
Figure 3.2. : Erroneous Execution Model 35 
Figure 3.3. : Reliability Model 37 
Figure 4.1. : 8-Bit Microprocessor Linear Erroneous Execution 61 
Figure 4.2. : 16-Bit Microprocessor Linear Erroneous Execution 62 
Figure 4.3. : 32-Bit Microprocessor Linear Erroneous Execution 63 
Figure 4.4. : Catastrophic Failure - Instruction Mix Analysis 68 
Figure 4.5. : Recovery Through Detection - Instruction Mix Analysis 70 
Figure 4.6. : Microprocessor Reliability - Instruction Mix Analysis 72 
Figure 4.7. : Microprocessor Availability - Instruction Mix Analysis 75 
Figure 5.1. : Functional Address Space Allocation 80 
Figure 5.2. : The IEJ Characteristic 82 
Figure 5.3. : Embedded Access Guardian 86 
Figure 5.4. : Access Guardian 86 
Figure 5.5. : The HIT Function 89 
Figure 5.6. : The SEJ Characteristic 90 
Figure 5.7. : Detection Mechanism Constructions 93 
Figure 5.8. : Detection Mechanism Placement 94 
Figure 5.9. : Erroneous Execution Model : Enhanced Fault Recovery 100 
Figure 5.10. : Microprocessor Reliability with Access Guardian 102 
Figure 5.11. : Enhanced Microprocessor Reliability 105 
Figure 6.1. : PARUT Overview 109 
Figure 6.2. : Program 'MAIN ' Call Chart 113 
Figure 6.3. : Screen Dump of PARUT User Interface 117 
Figure 6.4. : Algorithmic Processing of Invalid Branches 118 
Figure 6.5. : A Complex Example of Algorithmic Processing 121 
Figure 7.1. : Basic Microprocessor Topology 129 
Figure 7.2. : Microprocessor Laboratory System Topology 131 
Figure 7.3. : Nature of Fault Injection Outcomes 142 
x i v 
Figure 7.4. : Program 'A' IE J Execution Phase 150 
Figure 7.5. : Program A ' SEJ Execution Phase 152 
Figure 7.6. : Detection Capabilities of Example Programs 155 
Figure 9.1. : Microprocessor Opcode Map 187 
Figure 9.2. : Input/Output Location Content 190 
Figure 9.3. : SAFE ROM 193 
Figure 9.4. : Memory with Utilization Assignment 193 
Figure B . l . : Embedded Access Guardian 223 
Figure B.2. : Access Guardian 224 
Figure B.3. : Address Decoder 226 
Figure B.4. : Restart Generator 229 
Figure B.5. : Timer Unit 230 
Figure B.6. : Ripple Counter Unit 231 
Figure B.7. : Access Guardian Circuit Description 233 
Figure B.8. : Access Guardian Timing Simulation 234 
xv 
T A B L E S 
Table 2.1. : Observed Temporary & Permanent Errors 15 
Table 2.2. : Fault Emulation/Simulation Experiments 19 
Table 2.3. : Fault Injection Experiments 20 
Table 2.4. : Capability Checks 25 
Table 4.1. : Microprocessor Instruction Set Evaluation 56 
Table 4.2. : Microprocessor Undefined Instruction Evaluation 57 
Table 4.3. : Random Data Instruction Interpretation 59 
Table 4.4. : Microprocessor 'Mean Time To Failure' Evaluation 73 
Table 5.1. : MTTF with Unused Area Detection 103 
Table 5.2. : MTTF Enhancement with Used Area Detection 103 
Table 7.1. : Single-Bit Fault Injection Programme 136 
Table 7.2. : Fault Injection Outcomes 139 
Table 7.3. : Nature of Fault Injection Outcomes 141 
Table 7.4. : Observed Behaviour of Program 'A' 149 
Table 7.5. : Observed Behaviour of Program 'B ' 153 
Table 7.6. : Observed Behaviour of Program ' C 154 
Table 7.7. : Erroneous Infinite Execution Loop 159 
Table 7.8. : Placement Deadlock 159 
Table A . l . : MC 6800 Microprocessor Instruction Set Evaluation 217 
Table A.2. : Intel 8048 Microprocessor Instruction Set Evaluation 217 
Table A.3. : Intel 8085 Microprocessor Instruction Set Evaluation 217 
Table A.4. : Intel 8086 Microprocessor Instruction Set Evaluation 217 
Table A.5. : MC 68000 Microprocessor Instruction Set Evaluation 217 
Table A.6. : MC 68010 Microprocessor Instruction Set Evaluation 217 
Table A.7. : AMD 29000 Microprocessor Instruction Set Evaluation 217 
Table A.8. : MC 68020 Microprocessor Instruction Set Evaluation 217 
Table A.9. : Intel 80386 Microprocessor Instruction Set Evaluation 217 
Table A.10. : MC 6800 Jump Instructions 218 
Table A . l l . : Intel 8048 Jump Instructions 218 
Table A.12. : Intel 8085 Jump Instructions 219 
Table A. 13. : Intel 8086 Jump Instructions 219 
x v i 
Table A. 14. : MC 68000, MC 68010, and M C 68020 Jump Instructions . . . . 220 
Table A. 15. : A M D 29000 Jump Instructions 221 
Table A. 16. : Intel 80386 Jump Instructions 221 
Table B . l . : Tru th Table for SRFF Control Logic 227 
Table B.2. : Set-Reset Flip Flop Transition Table 227 
Table B.3. : Access Guardian Parts List 236 
Table C . l . : Chronological Order of Functions in P A R U T List ing 238 
xvii 
L I S T O F A B B R E V I A T I O N S 
A D B U N I X A DeBugger' Facility 
A L U Arithmetic Logic Uni t 
A M D Advanced Micro Device 
BG-Rig Brit ish Gas Rig (experimental) 
BIC Bus Interface Circuitry 
Cdf Cumulative Density Function 
CISC Complex Instruction Set Computer 
CLASSIC Custom Logic Analysis Simulation System for Integrated Circuits 
Cm* Carnegie-Mellon Multi-processor 
CMOS Complementary Metal-Oxide Semiconductor 
C M U A Carnegie-Mellon University ' A ' File System 
CMU-AFS Carnegie-Mellon University Andrew File System 
C.vmp Carnegie-Mellon Voting Multi-processor 
DIS U N I X !DIS-assembler ! Facility 
ECL Emitter-Coupled Logic 
E M I Electro-Magnetic Interference 
ESD Electro-Static Discharge 
F M E C A Failure-Mode, Effects and Crit icali ty Analysis 
F O R T R A N 'Formula Translation' Programming Language 
FTA Fault Tree Analysis 
F T M P Fault Tolerant Multi-Processor 
I B M International Business Machines 
IC Integrated Circuit 
IEE Inst i tut ion of Electrical Engineers 
IEEE Institute of Electrical and Electronics Engineers 
IEJ In i t ia l Erroneous Jump 
IPQ Instruction Prefetch Queue 
H D L Hardware description Language 
JPI Jensen and Partners International 
LP Low Pressure 
MC Motorola Corporation 
M T T F Mean Time To Failure 
M T T R Mean Time To Repair 
NASA National Aeronautics Space Adminis t ra t ion 
NMOS N-Type Metal-Oxide Semiconductor 
xviii 
P A R U T Post-programming Automated Recovery U T i l i t y 
PC Program Counter 
PCU Program Control Unit 
pdf Probability Density Function 
R A M Random Access Memory 
RISC Reduced Instruction Set Computer 
R F I Radio-Frequency Interference 
R O M Read Only Memory 
SEJ Subsequent Erroneous Jump 
SLAC Stanford Linear Accelerator Centre 
SRFF Set-Reset Fl ip Flop 
T M R Triple Modular Redundancy 
T T L Transistor-Transistor Logic 
V L S I Very Large Scale Integration 
xix 
L I S T O F S Y M B O L S 
A Set of Faults Detected by Access Guardian 
Ad Physical Address in the Address Space 
Av Availabil i ty 
Cn Ripple Counter Base 
D(k) Probability of Detecting Erroneous Execution 
Dj£j(k) Probability of Detecting Erroneous Execution Following an IEJ 
DsEj(k) Probability of Detecting Erroneous Execution Following an SEJ 
E Error Detection Coverage 
Ej Event Leading to Failure 
ET Event Leading to Recovery 
Es Set of Events Disrupting Processor Execution 
f ( t ) Probability Density Function 
F Set of Injected Faults 
F(t) Cumulative Density Function 
H{l,j) H I T Function: Probability of an SEJ Target in Used Area 
i Number of Defined Opcodes within Instruction Set 
/ Mean Period of Linear Erroneous Execution, (No. Instructions) 
/ ; Microprocessor Instruction Frequency, (No. Instructions/s) 
j Jump Type Instruction 
J Set of Jump Type Instructions 
J' Return or Unspecified Jump Outcome 
k Instruction Index During Linear Erroneous Execution 
/ Jump Type Instruction Location in Used Area 
L Set of Address Space Locations 
Ld, Error Detection Latency (No. Instructions) 
Ls Stalling Latency (No. Instructions) 
m Integer 
M Set of Faults Detected by Hardware 
n Number of Bits in Opcode Format 
Nej Effective Number of 'Jump' Outcome Instructions 
NeRM Effective Number of 'Return' Outcome Instructions 
NenT Effective Number of 'Restart' Outcome Instructions 
Nesw Effective Number of 'Stop/Wait ' Outcome Instructions 
Neuj Effective Number of 'Unspecified Jump' Outcome Instructions 
Ni Number of Locations Open to Instruction Interpretation 
X X 
NR Number of Instructions in Recovery Routine 
P Set of Faults Detected by Software 
Pc(t) Probability of Controlled State 
PIE J {Used Area) Probability of I E J Target in Used Area 
PIE J (Unused Area) Probability of IE J Target in Unused Area 
Pj Probability of a 'Jump' Instruction Outcome 
Pj(k) Functional Probability of a 'Jump' Outcome 
Pj Probability of a 'Jump' Instruction Outcome 
Pj'(k) Functional Probability of a ' J " Outcome 
Pj,(k) Functional Probability of a ' J " Outcome Following an IEJ or SEJ 
Pi (k) Functional Probability of Linear Erroneous Execution 
PNJ Probability of a 'Non-Jump' Instruction Outcome 
Pi^j(k) Functional Probability of a 'Non-Jump' Instruction Outcome 
PRw(k) Functional Probability of a 'Return' Instruction Outcome 
Pnr{k) Functional Probability of a 'Restart' Instruction outcome 
PSEJ {UsedArea) Probability of SEJ Target in Used Area 
PSEJ(UnusedArea) Probability of SEJ Target in Unused Area 
Psw(k) Functional Probability of a 'Stop/Wait ' Instruction Outcome 
Puj(k) Functional Probability of a 'Unspecified Jump' Instruction Outcome 
Pu(t) Probability of Uncontrolled State 
Px(k) Functional Probability of a V Outcome 
P l E J ( k ) Px(k) following an IEJ 
P i E J ( k ) Px(k) following an SEJ 
Pr{} Probability Function 
q Event Rate, (per hour) 
R Set of Faults Generating Re-synchronization 
R() Reliability Function 
RF() Reliability Signature Monitoring Hardware 
Rs() Reliability of Microprocessor Employing Signature Monitor ing 
RN Return Outcome 
RT Restart Outcome 
s Number of Bytes in Memory Fetch 
S(k) Probability of Erroneous Execution Stall 
SiEj(k) Probability of Erroneous Execution Stall following an IEJ 
SsEj(k) Probability of Erroneous Execution Stall following an SEJ 
SW Stop/Wait Outcome 
t Time, (s) 
T Set of Locations Addressed by Jump Type Instruction 
U J Unspecified Jump Outcome 
x Member of the Set {RN, RT, SW, U J ) 
y Member of the Set {IEJ, SEJ] 
Z(t) Hazard Rate, (per hour) 
P Probability Hardware Exception: 'Restart' Outcome 
7 Probability Software Exception: 'Restart' Outcome 
A Constant Failure Rate, (per hour) 
X(t) Failure Rate, (per hour) 
^ Recursive Element of Function D(k) and S(k) 
xxii 
Chapter 1 
R E L I A B I L I T Y A N D M I C R O P R O C E S S O R - B A S E D C O N T R O L L E R S 
1.1. Introduction 1 
1.2. Microprocessor-Based Control Systems 3 
1.3. Faults, Errors, and Failures in Electronic Systems 4 
1.4. Engineering Reliability Through Design 4 
1.4.1. Reliable Hardware 5 
1.4.2. Reliable Software 5 
1.5. Evaluating Controller Reliability 6 
1.5.1. Hardware Reliability 7 
1.5.2. Software Reliability 7 
1.5.3. Interface Reliability 8 
1.6. Low-Cost Enhancement of Controller Reliability 8 
1.7. Thesis Preview 9 
C H A P T E R O N E 
R E L I A B I L I T Y A N D M I C R O P R O C E S S O R - B A S E D C O N T R O L L E R S 
1.1. I n t r o d u c t i o n 
The continuing technological evolution of microprocessor design has resulted in a 
large number of commercially available devices wi th a wide variety of characteristics. 
Many microprocessors are embedded wi th in control systems to automatically operate, 
monitor, or control a physical process. Applications range f rom relatively simple 
control of domestic appliances such as toasters or washing machines, to the complex 
control of industrial plant such as power stations or chemical works. A n important 
feature of all these control systems is their reliability, that is, the probability that 
the system wil l perform its function under stated environmental conditions, without 
malfunction, over a specified period of time or operational duration (adapted f r o m 
[Bennetts, 1979]). 
Bri t ish Gas use microprocessor systems to manage individual governors control-
ling the transmission of gas to industrial and public customers. Such microprocessor 
controllers require high reliability because of the proximity of their installation, wi th 
the hazard of gas, to the general public. 
The United Kingdom gas distribution system involves the transmission of gas 
at high- and medium- low pressures between the off-shore gas fields, local consumer 
districts, and end-customers respectively. The low pressure (LP) system dates back 
to the production of town gas in the 19th century and serves local districts and 
individual users. The LP network has tradit ionally been controlled by independent 
pneumatic governors. Governors are devices which control, through a valve action, 
the gas pressure in a pipeline. The governors implement a clocking mechanism which 
alters the gas pressure depending on predicted daily fluctuations in demand. Gas 
pressures that fall below a critical level allow air to enter the gas pipework and 
can produce potentially explosive mixtures. To prevent this situation arising the 
governor system implements a shut-down operation, called 'slam-open', when the 
1 
gas pressure falls at an excessive rate or when the gas pressure falls beneath the 
statutory safety l imi t of 12.3 mbar. The slam-open operation involves completely 
opening the governor valve so that adjacent normal and hazardous gas pressures 
across the valve are equalized. Slam-open operations may cascade through several 
consecutive governor systems before the mean pressure along a section of pipeline is 
acceptable. 
In order to increase management efficiency of the LP network i t is necessary to 
improve control of the gas distribution. The small value of gas handled by each 
governor system in the gas network means that a low-cost upgrade is required. To 
this end, microprocessors have been embedded wi th in the governor control systems 
[Clark et al, 1987] to facilitate more effective and integrated control of the LP gas 
distr ibution system [Wynne et al, 1988]. The distribution of gas can now be managed 
in an efficient and interactive manner. Seasonal and diurnal load variations can 
be monitored and appropriate responsive action taken automatically in a real-time 
environment. Stringent safety regulations of the gas industry require the back-up of 
electronic systems failure by traditional pneumatic slam-open operation. 
Brit ish Gas predict failures of the microprocessor assisted governors to occur 
once every 10 years, a failure rate approximately one thousandth of the original 
pneumatic governor systems [Clark et al, 1987]. The slam-open back-up ensures 
that microprocessor controller failures are not catastrophic. Nevertheless, loss of the 
governor management function incurs a financial penalty; reduced gas distribution 
efficiency, and repair of the controller which may reside at a remote site. A n overall 
improvement in the reliability of the microprocessor controller is required. Bri t ish 
Gas are particularly interested in the effect of software, 'which may become highly 
unpredictable' under the influence of hardware faults, on system reliabili ty [Clark et 
al, 1987]. 
This thesis presents techniques for enhancing the operational reliabili ty of micro-
processor-based controllers without adversely increasing their cost. These techniques 
are applicable to any microprocessor system, including those used by Bri t ish Gas to 
manage governors. 
2 -
1.2. Microprocessor-Based Contro l Systems 
Many control systems have their designs based on complex logic circuits incor-
porating flip-flops, analogue-to-digital converters, shift-registers, and other logic gate 
structures. In these cases i t is often convenient to incorporate, or replace, such cir-
cuitry wi th a dedicated microprocessor and its support chips. The control system 
behaviour can now to a large extent be governed by the software stored on a micro-
processor memory chip. Software can be maintained without physically altering the 
system hardware. Sucl flexibil i ty can be valuable as in a recently reported incident 
when a car manufacturer discovered a design error in a fuel injection system [IEEE, 
1989]. Replacing the system wi th an alternative was extremely expensive. However, 
replacement was not necessary: the system was controlled by a microprocessor which 
was re-programmed to compensate for, and effectively mask, the design error. The 
problem was rectified at l i t t le cost. In such instances the maintenance engineers must 
be careful not to introduce new errors (the 'Software Death Cycle', [Rigby & Norris, 
1990]). 
In recent years digital techniques have become so powerful that tasks well suited 
to analogue systems are often partially or totally controlled by digital systems. For 
example, a temperature meter based on a thermocouple or thermistor might incor-
porate a microprocessor and memory in order to improve accuracy by compensating 
for the instrument's departure from perfect linearity. 
Although ever more powerful microprocessors are being developed, most con-
trollers do not require further advanced processing capabilities. A recent Japanese 
survey [Fujimura, 1989] reported that approximately 80% of microprocessor-based 
controllers incorporated either an 8 or 16-bit machine. Despite the commercial avail-
abil i ty of 32-bit processors for over five years, they were only used in approximately 
11% of the controllers. The remaining 9% of controllers had embedded 4-bit micro-
processors. Obviously the 4-bit microprocessor-based controllers had a very simple 
•function. 
3 
1.3. Faul t s , E r r o r s , and Fai lures in E lec tron ic Systems 
A system failure is said to occur when the behaviour of the system first deviates 
f rom that required by the specification of the system (as defined by [Anderson & Lee, 
1982]). System failures are caused by the external exposure of a defective internal 
state. Deficiencies in the internal state of a system, referred to as 'errors', can exist 
without the generation of a failure. 
A system consists of a set of components (or sub-systems) which interact under 
the control of a design. Errors originate f rom the activation of defective system 
components. Defective components are referred to as 'faults ' . 
A fault in a digital electronic system is characterized by its nature, extent, and 
duration [Avizienis, 1976]. The nature of the fault can be classified as either logical 
or non-logical. A logical fault causes the logic value at a point in the digital circuit 
to become opposite to the specified value. Non-logical faults include the remaining 
faults such as a malfunction of the clock signal. The extent of a fault specifies whether 
the effect of the fault is localized or distributed in the the digital system. Finally, 
the duration of a fault refers to whether the fault is permanent or temporary. 
McCluskey & Wakerly [1981] distinguish between two classes of temporary fault , 
transient and intermittent . Transient faults are non-recurring temporary faults which 
are caused by environmental influences. They are not repairable because there is 
no physical damage to the hardware. Intermittent faults are recurring temporary 
faults caused by deteriorating or ageing hardware. Intermittent faults may eventually 
become permanent and can be repaired. 
1.4. Engineer ing Rel iabi l i ty T h r o u g h Design 
Reliability can be engineered in a digital system by implementing a disciplined 
design process. The reliability of a microprocessor-based system depends on its hard-
ware and software design, deficiencies in either are expensive to correct. I t is therefore 
prudent, when developing reliable systems, for the design to be fault-free or fault-
tolerant. 
4 
1.4.1. Rel iable H a r d w a r e 
Poor specification, design, and manufacture can individually or collectively intro-
duce faults wi th in digital electronic systems. Specifications are normally wri t ten in 
natural language which makes their integrity extremely difficult to check. Specifica-
tions can be wri t ten using Formal Methods, enabling designs to be proven to comply 
wi th their specification, but this technique does not ensure that the specification itself 
is defect-free (Cullyer, 1988]. 
The integrity of manufacture can be validated using 'black-box' tests. Digital 
systems, however, can be complex and comprehensive black-box testing extremely 
expensive. Intel only test 98% of the logic nodes for faults in each manufactured 
80486 microprocessor (even though untested nodes may be faulty) because, as for 
many other digital systems, complete testing is considered prohibitively expensive 
(IEEE, 1990]. 
The methods outlined above for the procurement of reliable hardware are all 
fault avoidance techniques. A complementary approach involves tolerating faults 
through the implementation of special design features. Fault tolerant techniques can 
be divided into those that detect faults and initiate recovery such as parity checking 
and watchdogs, and those that mask faults such as Triple Modular Redundancy 
( T M R ) and error-correcting codes [Carter, 1985]. 
1.4.2. Rel iable Software 
Software faults (commonly called 'bugs') can arise f rom the specification, design, 
or coding process. Typically more than half the faults which are recorded during 
the software development originate in the specification [O'Connor, 1985]. This is 
mainly due to the use of natural language for documenting the 'non-technical' user 
requirements specification [Hit t & Webb, 1985]. Engineering principles are being 
proposed [Sommerville, 1985] to enable defect-free development and maintenance of 
software. 
Software verification involves semantic and syntactic checks on the program code 
for programmer error, and structured walk-through checks for functional correctness. 
Black-box tests can be used to identify faults but the complexity of software often 
5 
prevents exhaustive checking due to prohibitive costs. The complexity of software 
testing can be reduced by adopting a modular code structure. Many methods have 
been proposed to assess acceptable test-set coverage for software [Musa et al, 1987] 
but they all are subject to the Di jks t ra maxim 'testing reveals the presence of faults, 
not their absence' [Dijkstra, 1972]. 
Software can be manipulated to tolerate faults. Two well known approaches to 
fault tolerant software are N-Version Programming [Chen & Avizienis, 1978], and 
Recovery Blocks [Randall, 1975]. Both techniques rely on design diversity, the avail-
abili ty of multiple implementation of a specification, to tolerate faults. N-Version 
Programming requires the independent implementation of multiple, ' N ' , versions of 
the specification. These versions are processed in parallel w i t h the same inputs. A 
voter collects the outputs and a major i ty decision made to select the perceived correct 
output . Theory implies high reliability for this method, but in practice the multiple 
program versions can share common mode failures [Eckhardt & Lee, 1988]. 
Recovery Blocks consist of a primary routine, which normally performs a task; an 
acceptance test which checks the primary routine result; and an alternative routine 
which is executed i f the check fails. Unlike N-Version Programming where routine 
independence is assumed, Recovery Blocks require ensured independence between 
the primary routine, the acceptance test, and the alternative routine. Application 
of recovery blocks improves reliability. The degree of reliability can be enhanced by 
extending the number of independent alternative routines ensuring each acceptance 
test is also wholly independent. 
Software data structures can also be manipulated so that they can tolerate the 
presence of faults. Taylor et al [1980] briefly outline this topic and propose fault 
tolerant structures for linear lists and binary trees. 
1.5. E v a l u a t i n g Control ler Rel iabi l i ty 
To evaluate the reliability of a microprocessor-based controller i t is necessary to 
apply a 'systems approach'. The systems approach involves integrating the inter-
dependencies of all sub-systems constituting a whole system. Microprocessor-based 
controllers consist of two entities; hardware and software. Many authors, including 
6 
H i l t & Webb [1985]. and Ferrara et al [1989], integrate calculations for hardware 
and software reliability. However such reliability assessments do not involve any 
allowance for the internal interaction of hardware faults on software. Internal hard-
ware/software interaction occurs across what is referred to as the 'interface'. To 
determine system reliability more accurately i t is necessary to integrate assessments 
of hardware reliability, software reliability, and interface reliability. 
1.5.1. H a r d w a r e Rel iabi l i ty 
I t is valuable to calculate the reliability of a hardware product for the duration of 
its 'useful' lifetime. Historical failure data which takes into account benign operating 
conditions and general age degradation is used to assess the expected lifetime of the 
hardware. Popular compilations of such data include the United States A i r Force 
'Reliability Prediction for Electronic Systems' (MIL-HDBK-217) , and the United 
Kingdom Brit ish Telecom 'Handbook of Reliability Data ' (HRD-4). Techniques for 
manipulating this data to reflect hardware architecture are well understood [Lala, 
1985], 
1.5.2. Software Rel iabi l i ty 
Methods of establishing the reliability of software are stil l under development. 
Although many techniques have been proposed none have had the widespread accep-
tance given to the corresponding assessment of hardware reliability. 
Assessments of software reliability usually involve the prediction of errors existing 
in the software. However, the reliability of the software depends not only on the 
existence of a fault but also its activation. Many authors have used Markov processes 
to model software reliability [Musa, 1987]. There are two types of Markovian software 
reliability model widely used; Poisson and binomial. The Poisson models assume an 
infinite number of faults in the software, whilst the binomial models assume a fixed 
number of faults. Both model types assume faults exist randomly wi th in the software. 
Musa [1975] refined the basic Poisson model so that fault activation is a funct ion of 
the time for which the software is executed. In reality, however, fault exposure is 
dependent on the fault location within the software and its associated probabili ty of 
7 
activation by program execution. Lit t lewood [1981] attempts to model this situation 
w i t h a binomial model that weights software faults according to the probability of 
their execution. Trachtenberg [1990] has recently reviewed and suggested a general 
theory of software reliability models based on Markovian processes. 
The Markov software reliability models provide a valuable indication of the like-
lihood that a fault wi l l be exposed during software execution. The hazard attr ibuted 
to fault exposure can be further estimated by using ad hoc methods such as 'Fault 
Tree Analysis' (FTA) or 'Failure-Mode, Effects, and Crit icali ty Analysis' ( F M E C A ) . 
1.5.3. Interface Rel iabi l i ty 
In software controlled digital systems, failures can occur which are diff icult to 
diagnose as being due to the exposure of a hardware fault or software error. A dis-
t inction is not clear usually because the systems internal hardware/software interface 
has not been defined. The interface occurs wi thin electronic devices such as proces-
sors and memories. For example, a fault in an individual cell on a memory device 
holding a program can cause what appears to be a software error. Memory devices 
are sometimes referred to as firmware to reflect their hardware/software interface. 
Other faults may occur on a data bus line wi th similar effect. 
Permanent faults relating to interface reliability should be identified by the burn-
in procurement of the hardware. However because of their l imited duration, tempo-
rary faults are rarely located during the burn-in process. Assessment of the interface 
reliability requires knowledge of the occurrence of faults and errors they induce. 
1.6. L o w - C o s t E n h a n c e m e n t of Control ler Rel iabi l i ty 
The reliability of microprocessor controllers can be enhanced by addressing the 
problem of faults introduced during procurement and operation. Techniques for 
procuring reliability have been briefly outlined. Operational faults are generated 
by component aging and transient disturbances in the working environment such 
as power supply fluctuations, electro-magnetic interference ( E M I ) , and electro-static 
discharge (ESD) [Siewiorek & Swarz, 1982]. 
8 
Transient disturbances are associated wi th temporary faults in digital systems. 
Unlike analogue or mechanical systems which tend to pass the effect of a transient 
disturbance as a temporary signal discrepancy whilst retaining overall function, digi-
tal systems are susceptible to malfunction in the presence of temporary faults because 
of their discrete state nature. Indeed it is becoming established that, even in 'benign' 
working environments, about 90% of microprocessor system failures can be at tr ibuted 
to temporary faults [Siewiorek & Swarz, 1982]. 
Control systems are often required to operate in 'harsh' industrial environments 
liable to produce transient disturbances. Although shielding can be employed to re-
duce the effects of transient disturbances on digital systems, their elimination is rarely 
possible [Horowitz & Hi l l , 1986]. The benefit of tolerating temporary faults induced 
by transient disturbances can be considerable. Industrial microprocessor controllers, 
however, are often developed within a l imited budget which cannot support the re-
dundancy incurred by many established fault tolerant techniques. This thesis ap-
proaches the topic of interface reliability, proposing a low-cost software-implemented 
fault tolerant technique for temporary hardware faults. 
1.7. Thes i s P r e v i e w 
The topic of reliability for microprocessor-based controllers has been introduced 
wi th respect to the requirement for low-cost fault tolerance (the objective of the 
research presented in this thesis). Chapter 2 surveys literature investigating the fault-
error-failure mechanism in microprocessor systems. The failure process is identified 
wi th malfunction, particular hazard being associated wi th the corruption of program 
flow. Current techniques to detect this class of fault are reviewed, but many require 
considerable expense to implement. 
As a first step to developing new and more cost-effective techniques to detect 
program flow corruption, i t is useful to consider the character of associated erroneous 
microprocessor behaviour. Chapter 3 presents a model for erroneous microprocessor 
execution. Performance parameters are evolved to show the benefit of implement-
ing a detection capability together wi th a recovery mechanism. These parameters 
9 
include detection latency, reliability, Mean Time To Failure (MTTF), and availability. 
The model is applied to a selection of microprocessors commonly embedded within 
controllers, results are discussed in Chapter 4. 
Microprocessor software displays various characteristics depending on the func-
tion of its implementation. Functional sections of code include program areas, data 
areas, and reserved memory mapped input/output areas. In addition, the micro-
processor may have a proportion of its address space which is not populated by 
software. Whilst correct microprocessor operation executes instructions within the 
program area in a predictable manner, erroneous execution can invalidly interpret 
an instruction anywhere in the microprocessor address space in an unpredictable 
manner. Fault tolerant techniques whose implementation is based on software for 
detecting erroneous execution within each functional area of the microprocessor ad-
dress space are presented in Chapter 5. In particular a novel technique for detecting 
erroneous program flow is proposed. 
An algorithm for implementing the proposed fault tolerant technique in the pro-
gram area is presented in Chapter 6. The technique involves manipulating the pro-
gram code in order to strategically insert detection mechanisms. The mechanisms 
are designed to detect erroneous execution by identifying beforehand particular cor-
rupted execution routes in the software. The algorithm is implemented in a software 
utility so that program code can automatically be given the detection capability. The 
software utility is called Post-programming Automated Recovery UTility (PARUT). 
PA RUT is designed to be flexible, allowing the generation of fault tolerant code for 
a selection of microprocessors. 
Several example programs are processed by the PARUT algorithm in Chapter 7 
so that the performance of the fault tolerant technique can be assessed. Enhanced 
program code is evaluated by emulation and fault injection programmes. These 
experiments monitor the behaviour associated with corrupted patterns of erroneous 
execution. The information obtained from the experiments enables the detection 
performance of individual programs to be assessed. 
10 
Translators are used to generate significant amounts of software for microproces-
sor based controllers. The programmer has no control over the nature of the translator 
generated code. Chapter 8 identifies critical hazards which are not covered by the 
fault tolerant techniques outlined in Chapter 5. These hazards are associated with 
the catastrophic failures of cessation of processing and infinite execution loops. Tech-
niques are proposed for the translator code generation process so that critical hazards 
are not produced. These techniques are influenced by the nature and structure of the 
target microprocessor instruction set. 
Fault tolerant microprocessor design features are proposed in Chapter 9. These 
features facilitate rapid detection of corrupted program flow. 
The final chapter reviews the thesis and draws conclusions on the research. Five 
appendices provide details of: microprocessor parameters applied in Chapter 4 to 
the model presented in Chapter 3; the design of a hardware unit associated with 
the software implemented fault tolerant technique proposed in Chapter 5; a code 
listing of the PA RUT tool described in Chapter 6; the example programs evaluated 




T E M P O R A R Y FAULTS: 
G E N E R A T I O N , I M P L I C A T I O N , A N D D E T E C T I O N 
2.1. Introduction 12 
2.2. Faults and Their Implication on Microprocessor System Reliability . 13 
2.3. Erroneous Behaviour of Microprocessor Systems 18 
2.3.1. Data Flow Errors 21 
2.3.2. Program Flow Errors 21 
2.4. Assessing Error Detection Techniques for Microprocessor Systems . . . 23 
2.4.1. Watchdog Timers 23 
2.4.2. Capability Checking 23 
2.4.3. Program Flow Monitoring 24 
2.4.4. Hazards Associated with Error Detection Techniques 27 
2.4.5. A Novel Error Detection Technique 27 
2.5. Reliability Evaluation 28 
2.6. Summary and Conclusions 29 
C H A P T E R T W O 
T E M P O R A R Y FAULTS: G E N E R A T I O N , I M P L I C A T I O N , A N D D E T E C T I O N 
(A L I T E R A T U R E R E V I E W ) 
2.1 . In t roduc t ion 
An industrial environment can be less than ideal for microprocessor-based con-
trollers. In particular externally generated transient events can disrupt microproces-
sor operation. Erroneous microprocessor behaviour is associated with a degraded or 
lost control function, and the mal-operation of any equipment under the micropro-
cessor based controller's supervision. Mal-operation of controlled equipment can be 
extremely hazardous because of the unpredictable nature of erroneous microprocessor 
behaviour. 
This chapter discusses transient events that lead to temporary corruption of a 
microprocessor bus. register, or memory. Such corruption incurs no permanent hard-
ware damage. The limited duration of temporary faults inhibits their detection. 
Without detection, and exercised by circuit action, temporary faults generate errors 
which can spawn other errors, the process terminating as either benign activity or 
catastrophic failure. The fault-error-failure mechanism is explored using the results 
of fault observations in real processor systems, fault simulations, and physical fault 
injection experiments. 
Prolonged periods of erroneous behaviour increase the likelihood of generating 
a catastrophic failure. In order to prevent catastrophic failure, erroneous opera-
tion must be detected and appropriate recovery action initiated. Error detection is 
achieved by identifying characteristics of erroneous behaviour. Several techniques 
are reviewed which offer fault tolerance suitable for low-cost microprocessor based 
systems. The performance of the techniques is discussed in relation to the benefit of 
rapid error detection. Finally, assessing the reliability of a microprocessor systems 
adopting a fault tolerant technique is considered. 
12 
2.2. Faults and Their Impl ica t ion on Microprocessor System Rel iabi l i ty 
Assessing the reliability of a microprocessor based system involves evaluating the 
probability of failure which in turn is dependent on the fault-error-failure mechanism. 
As defined in Chapter 1, a fault is physical defect, an error is an activated fault, and 
a failure is classified as the deviation of system behaviour from that expected. The 
time interval between the occurrence of a fault and its manifestation as an error is 
called the fault latency. Similarly, the time interval between the occurrence of an error 
and the generation of a failure is called the error latency. Fault and error latencies 
are shown in Figure 2.1. The relationship between faults, errors, and failures is now 
explored. 
Over the last decade, computer failure data has been collected for several contin-
uously operational computer systems. Diagnosis of the data reveals temporary faults 
to be a significant cause of microprocessor failure. Collated computer failure data, 
see Table 2.1., shows temporary faults to account for between 93% and 98% of the in-
duced computer system failures, the remaining failures being due to permanent faults. 
Furthermore, within the selection of computer systems, temporary faults have been 
observed to occur approximately every 40 to 330 hours during continuous operation. 
Temporary faults in digital devices are associated with electro-magnetic interference 
(EMI) [Liu k Whalen, 1988], electro-static discharge (ESD) [Bhar & Mahon, 1983], 
electrical noise [Shoji, 1987], ionizing radiation [Amerasekera & Campbell, 1987], and 
power supply fluctuations [Cortes et al, 1986]. 
The manifestation of a temporary fault within a microprocessor based system is 
dependent on the susceptibility of its digital circuitry. Ball & Hardie [1969] report 
experiments which suggest that the probability of logic malfunction is dependent 
on the duration of the temporary fault. Typically, temporary faults only had a 
significant effect when they existed in excess of five clock cycles. Sequential logic 
was more susceptible than combinational logic, its probability of malfunction being 
in excess of 90% for temporary faults of 100 clock cycle duration. 
Additionally, the miniaturisation of digital integrated circuits (scaling) makes 














(1) L L J 
C 
o 
~ S 2 
=3 w i n 
r- nJ 
o .9 









































































































































































































O O O O CM O 
O O O O iO o 
CM Oi CO CO LO i—i 
































































































































ing is associated with two operational characteristics on VLSI devices; lower operat-
ing power and higher processing speed. Lower operating power means that smaller 
power variations can generate a signal ambiguity or fault. Hence the severity of a 
transient disturbance required to induce a temporary fault is reduced. Higher pro-
cessing frequencies enable smaller duration temporary faults to induce a logic fault. 
A fault may have various effects depending on the state of the microprocessor 
system at the time of the fault and the duration of the fault. Errors are not generated 
when the fault duration is less than the fault latency. A primary error is produced 
when the duration of the fault is equal to the fault latency, the associated event 
probability being denoted by Pr{Error | Fault}. Subsequent errors, referred to as 
secondary errors, are generated in numbers that increase with the fault duration 
beyond the fault latency [Damm, 1988]. 
Errors can influence a microprocessor system in several ways. Errors can spawn 
further errors as modelled by StifRer [1980] and Kopetz [1982]. Each of these errors 
can cause passive or active failure depending on whether or not they are dormant. 
Dormant errors, e.g. memory errors, have an error latency dependent on the access 
frequency of the corrupted memory locations. Hence, the microprocessor system 
application can have a major influence on determining whether or not an error leads 
to failure. Indeed Iyer & Rossetti [1986] and Woodbury & Shin [1990] both report 
evidence that a processor's workload can affect fault latency. Furthermore, the same 
error may have various effects on the microprocessor system depending on the timing 
of the error manifestation. The probability that an error generates a failure is denoted 
by Pr{Failure | Error}. 
Reliability is defined in Chapter 1 as the probability of operating without failure, 
and can be expressed as, 
R = 1 - Pr{Failure}. (2.1.) 
Inserting conditional probabilities, associated with the fault and error latencies, rep-
resenting the fault-error-failure mechanism yields, 
16 
R = \ - Pr{Failure \ Fault}.Pr{Fault}. (2.2.) 
Techniques to improve microprocessor system reliability can be broadly divided into 
two groups: those that endeavour to prevent or reduce the generation of faults 
(fault prevention, to reduce Pr {Fault}), and those which attempt to intervene 
and prevent generated faults from causing system failure (fault tolerance, to reduce 
Pr {Failure \ Error} and Pr{Error \ Fault}). 
Fault prevention can involve the strategic implementation of two types of tech-
niques [Anderson &; Lee, 1982]. Firstly, fault avoidance can be employed to protect 
the controller from transient disturbances. This commonly involves 'shielding' the 
controller to obstruct the effects of a transient disturbance. Secondly, fault removal 
can be applied to identify weak design or components within the microprocessor-
based controller. This process is commonly referred to as 'screening'. Rectifying the 
weaknesses should reduce the susceptibility of the controller to the effects of transient 
disturbances. 
Anderson & Lee [1982] identified four operations to be complete in order for a 
system to be fault tolerant. 
® error detection, 
© damage assessment, 
e error recovery, and 
• fault treatment and continued system service. 
Once an error is detected it is necessary to identify and isolate the damage incurred. 
Then the system must be restored to a valid operational state in order to prevent re-
curring errors evoking system failure. Finally, any damage must be repaired and the 
original system operation re-initiated. Temporary faults do not incur any physical 
damage and hence on completion of the fault tolerant process, assuming the tempo-
rary fault has terminated its existence, the system is returned to complete working 
order. 
17 
The duration of the error until its detection is called error-detection latency. The 
sum of the fault latency and error-detection latency is referred to as the fault detection 
latency [Damm, 1988], see Figure 2.1. 
Equation (2.2.) denotes two levels at which fault tolerant techniques can be ap-
plied: the circuit (or gate) level Pr{Error \ Fault}, and the functional (or compo-
nent) level, Pr {Failure \ Error}. Temporary faults are extremely difficult to detect 
because of their short duration. The overhead in providing a detection capability 
for individual faults in a VLSI device at circuit level is considered by Mahmood & 
McCluskey [1985] to be prohibitive. Some of the faults generated will in any case 
be benign and hence not require detection. Nevertheless, other faults, exercised by 
circuit operation, can generate errors. In order for a system to tolerate such faults 
it is necessary to detect their associated errors before they in turn induce system 
failure. Fortunately, there appears to be a good correlation between the circuit level 
and behaviour level reaction of common VLSI design elements, such as Arithmetic 
Logic Units (ALU) and multiplexers, to the effects of faults [Chakraborty & Ghosh, 
1988] which suggests that functional level fault tolerance will be acceptable for most 
fault tolerant systems. 
2.3. Erroneous Operat ion of Microprocessor Systems 
The impact of faults upon microprocessor operation has been the subject of 
much research. Iyer uses fault simulation experiments, see Table 2.2., to investigate 
the susceptibility of generating and the likelihood of propagating functional element 
errors. Memory, arithmetic logic units (ALU), and multiplexers are found to be, in 
descending order, the most susceptible functional elements to error generation and 
propagation. The remaining fault simulation experiments in Table 2.2., together with 
the physical fault injection experiments listed in Table 2.3., investigate the effect of 
such errors on software operation. 
Software execution errors generated by faults are diverse, but can be divided into 
two general groups: data flow errors and program flow errors. Both are affected by 
fault latencies with a bimodal distribution, i.e. there are two or more distinct classes 








































o .2 > v .2 
d -4-> K3 bO bp - * - ) bp 
o3 CO a3 
O. a P. 
O o O 
(-. s-Q, a , CD o 
k. (- c 
o o o o k.k. >-k. o t-i 










































































,—1 i—» CO 
o o oo CN 
O S o o s t — i O 
























oo • — 1 
O S 
t o 
' 5 oo 
O S 
k. I — * 

















































































o oo o oo 































































































































t . CO 
£ 
CU " O 
s-. 
o oj Cu EE 

































































































































as stack corruption, are dormant requiring particular processing for their activation 
whilst other faults generate 'fast' failures. Nineteen percent of faults injected into 
a Motorola 6809 microprocessor system through ion bombardment [Gunneflo et al, 
1989], and 22% of faults injected into an IBM 3081 processor system through memory 
corruption [Chillarege & Bowan, 1989], produce dormant faults. 
Gunneflo's experiment also reports 78%, 17%, and 5% of the injected faults to 
generate program flow errors, data flow errors, and other consequences respectively 
[Gunneflo et al, 1989]. Experiments that physically inject faults on processor package 
pins, Schmid et al [1982] and Schuette & Shen [1986], support these results with 63% 
and 78% of faults generating program flow errors in Zilog Z80 and Motorola 68000 
microprocessor systems respectively. Further, McGough & Swern [1981] report over 
half the logical faults inserted in a simulation of a AMD 2901 processor system to 
generate 'wild' branches, i.e. program flow errors. Caution should be exercised when 
comparing the significance of data and program flow errors in the fault insertion 
experiments because the experiments use different microprocessors, different fault 
insertion methods, and different fault locations. 
2.3.1. Data Flow Errors 
Data flow errors are generated by corruption or incorrect processing of data ele-
ments resident in data structures or instruction operands. Erroneous data can pro-
duce 'unreasonable' as well as failure conditions [Damm, 1988]. Data flow errors do 
not disturb the program flow and hence acceptance tests can be embedded within 
the program to check for bad data. 
2.3.2. Program Flow Errors 
Program flow errors can be initiated in microprocessor systems by the incorrect 
identification of a memory location as containing an instruction [Marchal & Courtois, 
1982], or corruption of an instruction itself [Carpenter, 1989]. There are three main 
mechanisms by which faults can generate erroneous program flow. Firstly, corrupted 
opcodes may have a different instruction length. Carpenter [1989] demonstrates this 
effect for the Motorola 68000 microprocessor. A change in the instruction length 
21 
forces incorrect interpretation of the memory location for the next instruction opcode. 
Secondly, corruption of a branch or jump instruction operand will alter the location 
of the next instruction opcode to be executed. Finally, registers in the microprocessor 
specifying the address of the next instruction opcode may be directly corrupted. A 
SBR9000 processor fault simulation, [Li et al, 1984], reports 73% of program flow 
errors to be generated in this way. Either of these three mechanisms effectively 
initiates execution of an unknown program. The nature of ensuing execution is 
dependent on the code content of memory under interrogation. The character of this 
execution is haphazard and not usually in sympathy with the organized execution 
associated with the application program. This behaviour leads to the malfunction 
of the controller. Particular hazard is attributed to such occurrences because of the 
implications of their unpredictable effect on controller operation. 
Once incorrect memory locations are accessed for instruction opcodes then there 
is a high probability that following instruction opcodes will also be incorrectly iden-
tified [Marchal & Courtois, 1982]. As erroneous program flow progresses, processing 
may further corrupt memory containing the original software. Erroneous program 
flow can terminate naturally via re-synchronization. Re-synchronization involves re-
establishing identification of instruction opcodes within the original software appli-
cation program. A physical fault injection experiment reports 75%, 6%, and 19% of 
program flow errors diverge permanently from the correct program (program crash), 
diverge temporarily from the program flow (re-synchronization), and are dormant 
(stack errors, etc) respectively. Tests cannot be embedded within the application 
program to identify opcode corruption because the application program is no longer 
executed. 
Sosnowski [1986a] considers steady state outcomes of erroneous behaviour asso-
ciated with microprocessor failure. He develops models for false loops, traps, and 
deadlocks. False loops and deadlocks are essentially infinite execution loop phenom-
ena which have also been modelled by Halse k Preece [1987], Deadlocks involve the 
termination of execution when the processor enters a 'wait' state. Halse & Preece 
22 
[1985] and Sosnowski [1986b] investigate characteristics of erroneous behaviour re-
lating to the influence of different microprocessor instruction sets and address space 
utilization. 
2.4. Assessing Error Detection Techniques for Microprocessor Systems 
Hardware and software fault tolerant techniques are briefly outlined in Chapter 1. 
Many techniques, however, may be unsuitable for microprocessor controllers because 
the costs of their application exceed the controllers budget. This section reviews low-
cost fault tolerant techniques for microprocessor controllers requiring high reliability. 
2.4.1. Watchdog Timers 
One of the most basic techniques for checking the operation of a microprocessor-
based system is the use of a watchdog timer [Connet et al, 1972], [Ornstein et al, 1975]. 
The system is designed such that, under normal operation, program execution signals 
the watchdog timer within a specified time interval. The signal presets the timer to 
its initial value. The timer generates an error if no preset signal is forthcoming 
during the specified time interval. On receiving the error signal from the watchdog, 
the system initiates suitable recovery action. Typically this involves re-establishing 
a correct set of operating parameters. Watchdogs incur a small switching overhead 
and no performance degradation is incurred directly upon the executing software. 
Watchdog timers, however, may have a hazardous error-detection latency. Con-
sider a processor operating at 8 MHz, whose mean instruction processing time is 100 
clock cycles, implementing a watchdog with a 100 ms interval. If a malfunction occurs 
in the middle of this interval then approximately 4000 instructions can be processed 
erroneously before the malfunction is detected. During the malfunction, processor 
operation is uncontrolled and may have hazardous implications for the processor 
activity. Therefore, other techniques providing fault detection are being developed. 
2.4.2. Capabi l i ty Checking 
Lu [1980] was one of the first to propose what is now commonly referred to as 
the 'smart' watchdog. These units are based on an additional processor to provide 
23 
a monitoring capability, facilitating faster detection without the high cost associated 
with fault masking techniques. Mahmood et al [1983] proposed a smart watchdog 
to check algorithm-level assertions about executing software. Namjoo & McCluskey 
[1982] suggested a scheme called 'capability checking' implemented by a smart watch-
dog that identifies illegal operations and memory accesses. Such a unit would detect 
system malfunctions as well as prevent memory mutilation by erroneous behaviour. 
Marchal & Courtois [1982] applied a selection of capability checks, whilst Schmid et 
al [1982], referring to capability checking as 'abstraction verification', extend the test 
set and estimate fault detection through direct fault simulation. Smart watchdogs 
can be applied to modern microprocessors that implement co-processors and caches 
[Saxena & McCluskey. 1990]. Mahmood & McCluskey [1988] survey the use of smart 
watchdogs. 
Capability checks, see Table 2.4., implemented together create reliable computer 
systems as demonstrated by Schmid et al [1982], Gunneflo et al [1989], and Madeira 
et al [1990] with 88%, 79%. and 75% fault detection respectively. The variation in 
fault detection is due to different selection of the capability checks implemented by 
each system, and the method of fault insertion when evaluating the microproces-
sor systems fault tolerance. Collectively applying capability checks provides both 
program and data flow error detection. However, implementing all these techniques 
can be complex so another simpler alternative is being explored by researchers. It 
involves detecting data flow errors by placing reasonableness checks in the software 
[Damm, 1988] (this includes Recovery Blocks, and N-Version Programming), and im-
plementing a monitoring scheme to directly verify the program flow of the application 
processor. 
2.4.3. P rogram Flow M o n i t o r i n g 
Program flow monitoring schemes are typically based, to some extent, on control 
flow graphs. The graphs consist of linked nodes. Each node represents a sequence 
of instructions performing some task, and each link represents the transition condi-
tions, i.e. status information. Lu [1982] proposed a scheme called 'structural integrity 
24 
Capability Checks 
a) incorrect sequence of instructions 
b) branch to invalid destination 
c) fetch illegal instruction 
d) fetch an opcode from a none opcode address 
e) invalid read within permitted memory 
0 invalid write within permitted memory 
g) access to memory outside permitted memory area 
h ) watchdog timer 
Table 2.4. : Capabil i ty Checks 
25 
checking' involving the generation of a tag for each task. These tags are checked dur-
ing execution to verify correct operation. Lu does not check transition conditions. 
Yau & Chen [1980] ensure each task does not have any inherent loops, hence, prevent-
ing the possibility of infinite erroneous execution loops without potential detection. 
They also implement verification of transition conditions between tasks. 
Task tags assigned values based on cyclic encoding of their instruction sequences 
are called 'signatures'. Two techniques implementing signatures, Path Signature 
Analysis [Namjoo, 1982] and Signatured Instruction Streams [Shen & Schuette, 1983] 
embed precomputed signatures into the application program. During program execu-
tion, special circuitry re-computes the signature and compares it with the embedded 
code signature, any ambiguity signalling detection of erroneous program flow. Both 
techniques impose a performance and code overhead. Schuette & Shen [1986] have 
implemented an embedded signature technique. The dedicated circuitry required 
3947 gates and 5435 bytes of memory, a hardware overhead of approximately 38% 
compared to the gate count of the host Motorola 68000 application processor. The 
memory overhead is incurred by embedded tags in the application program: typi-
cal overhead estimates range between 10 and 20% [Wilken & Shen 1987]. Finally, 
pseudo-branches, required by the implementation so that correct execution by-passes 
embedded signatures, are estimated to reduce application program performance by 
10%. The technique implemented in a Motorola 68000 processor system is reported 
to have a mean detection latency of less than 100 /is, the maximum latency be-
ing 3.8 ms. This is a considerable improvement on the detection latency expected 
from watchdogs. Schuette & Shen [1986] and Segall et al [1988] report a 98% and 
94% coverage of program flow errors respectively. Wilken & Shen [1987] review the 
mechanism of several signature monitoring techniques that embed signatures in the 
application software. 
Namjoo [1983] proposes a smart watchdog to compute run-time signatures inde-
pendently to verify application processor behaviour. This technique does not incur 
the performance or code overhead associated to earlier schemes. Eifert & Schuette 
[1984] refine the technique, replacing the smart watchdog with dedicated circuitry. 
26 
These techniques whilst not degrading system performance or requiring extra mem-
ory, do require an additional hardware unit. Smart watchdogs introduce approxi-
mately 100% redundancy by duplicating the number of processors. Replacing the 
watchdog with dedicated circuitry implies that the design is not directly applicable 
for use with different microprocessor types. 
2.4.4. Hazards Associated w i t h Er ror Detection Techniques 
It should be noted that those fault tolerant techniques implementing hardware 
redundancy will also be susceptible to the effects of faults induced by the environ-
ment. In particular, hazard is associated with those detection techniques such as 
the smart watchdogs that use a microprocessor to monitor a microprocessor. Duba 
& Iyer [1988] and Choi et al [1989] in their fault simulations identify the watchdog 
element of their microprocessor system to be significantly vulnerable to temporary 
fault generation, and diagnose a critical fault propagation path between the control 
unit and the watchdog. The reliability of a microprocessor-based controller imple-
menting a watchdog device can be seriously undermined if the detection capability 
of the watchdog is lost. Damm [1988] refers to this occurrence as the 'doomsday' 
syndrome. 
2.4.5. A Novel Er ror Detect ion Technique 
This thesis proposes an alternative technique based on the self-detection of pro-
gram flow errors by erroneous execution. Potential program flow errors within the 
software are identified and the code structure enhanced by the strategic placement 
of detection mechanisms. These mechanisms can only be activated by erroneous pro-
gram flow. The technique does not inherently require additional hardware. The only 
overhead is the extended code requirement, and the associated additional execution 
to by-pass the inserted detection mechanisms. The code extension, and the degrada-
tion to application program performance by by-pass operations is comparable with 
that required by the embedded tags and pseudo-branches reported for an implemen-
tation of the embedded signature technique [Schuette & Shen, 1986], [Wilken &; Shen, 
1987]. 
27 
2.5. Rel iabi l i ty Evaluation 
The reliability of microprocessor systems implementing fault masking can be 
assessed using architectural analysis. However, this method cannot be used in mi-
croprocessor systems implementing other fault tolerant techniques, such as those 
discussed in this chapter because of the uncertainty of fault detection. 
An alternative reliability evaluation which can be adapted to the fault tolerant 
techniques discussed in this chapter is given by Schuette & Shen [1986]. They pro-
pose the following estimation for reliability of a microprocessor-based system, Rs(t), 
employing signature monitoring, 
Rs(t) = [R(t).RF(t)} + [(1 - R(t)).RF(t).E] . (2.3.) 
where the reliability of the microprocessor-based system R(t) is the expectation of 
an error occurring in the microprocessor before time t, RF(t) is the expectation of an 
error occurring in the additional circuitry required by the fault tolerant mechanism 
before time t, and E is the error coverage of the employed detection mechanism. 
Re-arranging equation (2.3.) gives, 
Ra(t) = RF(t)[R(t) - (1 - R{t)).E). (2.4.) 
The correct operation of the system is dependent on avoiding the 'doomsday' syn-
drome discussed earlier, for the detection mechanism. Therefore the reliability of the 
system, Rs(t), is directly dependent on the reliability of the detection mechanism, 
RF(t). An additional constraint on system reliability is the sum of the probability 
that the processor is working correctly, R(t), with the probability of fault coverage by 
the detection mechanism when the processor is not working correctly, (1 — R(t)).E. 
Reliability engineers commonly use the complementary reliability parameter Mean 
Time To Failure (MTTF) for failure rates; increases in the event rate reduce the 
MTTF expectations. Event rates are, as discussed earlier in this chapter, dependent 
on the occurrence of transient disturbances and the susceptibility of a system to this 
disturbance. It cannot be assumed that environmental conditions are stable - the 
28 
mean rate of occurrence of transient disturbances may approximate to a constant, 
but it should be recognized that occurrences may cluster. Clustered occurrences are 
known as 'bursts', and within the C.vmp microprocessor system approximately 25% 
of observed fault events were bursts [McConnel & Siewiorek, 1978]. 
Variations in the event rate will alter the expected reliability for individual ap-
plications, and hence limit the usefulness of equations like (2.3.) when evaluating 
system reliability. However, equation (2.3.) can be applied to many other fault 
tolerant techniques, implemented in a simple single processor system, to provide a 
comparative index of their effectiveness. 
2.6. Summary and Conclusions 
Temporary faults have been diagnosed as causing between 15 and 50 times 
more failures in microprocessor systems than permanent faults. Reliability engineers 
attribute the generation of many temporary faults to the occurrence of transienten-
vironmental disturbances. 
Microprocessor-based controllers are often required to operate in harsh environ-
ments where transient disturbances are a regular hazard. Prevention techniques in-
volving screening and shielding can be applied in an attempt to eradicate the effects 
of transient disturbances on microprocessor controllers. In practice these techniques, 
however, only reduce the problem. 
Temporary faults are difficult to detect because of their limited duration. Fur-
thermore, the errors they generate can induce microprocessor malfunction and this 
may lead to erroneous operation of equipment under directives from the controller. 
Equipment operation may be haphazard and pose a danger in particular applications. 
Microprocessor malfunction should be detected rapidly to reduce the hazard of 
erroneous equipment operation. It is with this purpose that fault tolerant techniques 
have been developed. Fault tolerant techniques enable digital systems to isolate 
and repair the effects of temporary faults before restoring the application function. 
Program flow errors are identified as having particular influence on the character of 
29 
microprocessor malfunction. A selection of techniques applicable to microprocessor-
based controllers are reviewed. In particular, techniques are discussed which incur a 
low system overhead. 
Assessment of the fault tolerant techniques reviewed in this chapter involves 
inserting faults into a microprocessor system and monitoring its response. These 
assessments only reflect the efficiency of the implemented fault tolerant techniques. 
A major number of observed computer system failures are attributed to temporary 
faults. Assessments of microprocessor system reliability should, therefore, take this 
class of fault into account. Reliability assessments should involve information includ-
ing knowledge of the susceptibility of the processor to transient disturbances, and 
the likelihood of such disturbances in the application environment. 
In summary, temporary faults can be responsible for a significant number of 
microprocessor system failures. Fault prevention techniques cannot guarantee the 
eradication of all faults. It is therefore pertinent to incorporate fault tolerant features 
into the system design. Effective fault tolerance can be implemented at low-cost. For 
microprocessor systems with safety applications, developed within limited financial 
budgets, these techniques can provide highly beneficial and cost-effective reliability. 
30 
Chapter 3 
M O D E L L I N G E R R O N E O U S M I C R O P R O C E S S O R B E H A V I O U R 
3.1. Introduction 31 
3.2. Initiating Erroneous Microprocessor Behaviour 31 
3.3. Erroneous Behaviour 32 
3.4. Erroneous Execution 34 
3.5. Halse Execution Model 36 
3.6. Hybrid Execution Model 38 
3.6.1. Linear Erroneous Execution 38 
3.6.2. Propagating Further Periods of Erroneous Execution 40 
3.6.3. Detection of Erroneous Execution 41 
3.6.4. Erroneous Execution Stall 44 
3.7. Reliability Analysis 45 
3.7.1. Failure Rate 46 
3.7.2. Probability of an Event Leading to Failure 48 
3.7.3. Reliability Evaluation 49 
3.7.4. Mean Time To Failure 51 
3.8. Availability 52 
3.9. Summary 53 
C H A P T E R T H R E E 
M O D E L L I N G E R R O N E O U S M I C R O P R O C E S S O R B E H A V I O U R 
3.1. In t roduc t ion 
This chapter investigates microprocessor behaviour with particular regard to fault 
conditions. Temporary hardware faults may disrupt software processing and induce 
erroneous execution. The event initiating erroneous behaviour is defined. A model is 
proposed to simulate erroneous microprocessor behaviour. This model is developed 
for the von Neumann class of microprocessor which has dominated processor design 
over the last thirty years. Erroneous behaviour is investigated. A facility for detecting 
erroneous execution is introduced into the model. The efficiency of detection is 
examined with respect to the latency between initiation and detection of erroneous 
behaviour. A stochastic reliability model is proposed to assess the effect of software 
disruption on microprocessor performance. A method is developed for calculating 
Mean Time To Failure (MTTF) of the microprocessor system. Availability is also 
determined under the assumption that the processor has a resident recovery routine in 
its memory. M T T F and availability are common engineering measures of reliability. 
Hence the modelled reliability for a microprocessor can be compared to other device 
reliabilities. 
The models presented are robust, not relying on the features of any microproces-
sor (s). Model application allows the analysis and comparison of a wide selection of 
microprocessors. 
3.2. I n i t i a t i n g Erroneous Microprocessor Behaviour 
Erroneous microprocessor behaviour is considered to occur when a temporary 
fault manifests itself as a control flow failure. Loss of the correctly operating control 
flow will cause the microprocessor to mis-interpret its software with the hazard of 
malfunction. The ensuing behaviour comprises of propagating erroneous execution 
with a progressively increasing likelihood of catastrophic failure. 
31 
Any control flow failure will be reflected by an erroneous target entry in the 
microprocessor's program counter, and hence, the following assumption is made. 
Assumption (1) : The event initiating erroneous micro-
processor behaviour is that of program 
counter corruption. 
The program counter is considered as a single register used to locate instructions 
through the whole address space. The nature of the program counter corruption 
is not known. The event initiating erroneous behaviour may have had a variety of 
sources including stack pointer corruption, bus-line transients, and memory bit-flips. 
I t is assumed that all bits in the program counter are equally susceptible to error. 
Assumption (2) : The contents of the microprocessor pro-
gram counter are corrupted randomly by 
the event initiating erroneous behaviour. 
These assumptions enables a mathematical model, based on probability theory, 
to be developed for erroneous microprocessor behaviour. 
3.3. Erroneous Behaviour 
Consider erroneous behaviour to be initiated by random corruption of the pro-
gram counter. This event produces a jump in the control flow of the existing software 
to a random location in the address space of the microprocessor. This random jump 
is termed the Initial Erroneous Jump (IEJ). Erroneous execution then commences. 
The data contents of the memory at the new location will be executed as if they were 
instruction codes. Erroneous execution will take place in a linear fashion until the 
execution of a further jump instruction causes a Subsequent Erroneous Jump (SEJ). 
Repeated periods of linear erroneous execution interspersed by SEJs follow until ter-
minated either by catastrophic failure or system recovery. This process is shown in 
Figure 3.1., where the execution flow through the address space is shown as a stream 









© 3 o 52 
a c E @3 






Co E o 3 N \ E " © O L u 
•!=» 
© to UJ 
CO © 
to 





E L O 
(A 
3 
uoianoexo a. © c E JB9U0I 
© 
33 
3.4. Erroneous Execution 
Erroneous execution consists of a sequence of execution states, each state rep-
resenting the operation of an instruction. Execution states can be categorized with 
respect to the nature of their outcome. Halse [1984] identified the state outcomes 
listed below. 
Non-Jump : leads to the program counter pointing to 
the next instruction in the address space. 
Restart : leads to a jump to a predefined location 
in the address space. 
Unspecified Jump : leads to a jump to a new location in the 
address space determined by volatile mem-
ory contents. 
Return : leads to a jump to an address held in a 
stack. 
Stop/Wait : leads to a cessation of processing ; and 
requires an interrupt or hardware reset 
to exit from this state. 
Restart outcomes are usually generated by interrupts or exceptions. The restart out-
come vectors execution to a location predefined by the microprocessor architecture. 
A recovery routine can be placed at the restart outcome vector target. Hence for 
controlled recovery, a restart outcome defines erroneous behaviour detection and an 
ordered return to a recovery routine. 
A model for erroneous execution is shown in Figure 3.2. The model shows erro-
neous microprocessor behaviour being entered by program counter corruption (IEJ). 
The cascading sequence of state outcomes throughout erroneous execution can be 
traced. Successive 'non-jump' state outcomes will produce a period of linear erro-
neous execution. A restart outcome can be used to provide detection of erroneous 
execution. Any of the remaining state outcomes, including a stop/wait outcome, are 
defined as generating further erroneous jump (SEJ) in the erroneous execution. 
In a particular make of microprocessor, not all possible instruction bit patterns 




















Figure 3 . 2 . : Erroneous Execut ion Model 
( * man i fes ted t r ans i en t f a u l t ) 
35 
executed, result in any one of the state outcomes defined above. The actual state 
outcome depends on the particular microprocessor die, as manufacturers are not 
obliged to ensure that every batch produces the same operation. In some machines, 
such as the Motorola 68000 and Intel 80386, the execution of all undefined instructions 
is specified as an exception (software interrupt) and hence will produce a restart state 
outcome. 
3.5. Halse Execution M o d e l 
This section briefly reviews the foundations of a model for microprocessor op-
eration proposed by Halse [1984]. The model analyses erroneous microprocessor 
operation. 
A statistical model of erroneous execution is made using the assumption that 
the memory contents, throughout the address space, have a distribution that does 
not change. This clearly does not reflect the memory content distribution for real 
microprocessor based systems. Their distribution will vary through the memory map 
dependent on the utilization of locations. Nevertheless, the model does enable the 
identification of some general characteristics of erroneous execution. 
As a result of erroneous behaviour, erroneous execution will interpret some lo-
cation in the address space as an instruction. This results in one of two outcomes. 
Either an erroneous jump is generated which transfers control to another part of the 
address space; or no jump occurs and control passes onto the next logical location. 
Let the probability of an instruction execution yielding a 'jump' or 'non-jump' 
outcome be Pj and PNJ respectively. Hence by definition, 
It follows that the probability of terminating a period of linear execution on the ktK 
instruction (i.e. generate an erroneous jump) is given by 
Pj = 1 - PNJ (3.1.) 
Pj(k) = P\ >(*-!) NJ •Pj where k > 1. (3.2.) 
36 
When evaluating a microprocessor's erroneous behaviour, it is more realistic to 
use effective instruction outcome distributions rather than instruction outcome distri-
butions based upon instruction set definitions. It is recognized that some instructions 
have different outcomes dependent on some conditional test. In particular it is noted 
that conditional branch instructions can be paired, such that groups of two instruc-
tions covered a condition and its complement. Hence each pair of conditional branch 
instructions, such as a 'branch if zero' and 'branch if not zero', can be treated as 
if it were a single jump instruction and a single non-jump instruction. Halse [1984] 
assumes conditional instructions to have a 50% chance of occurring. Although this 
is strictly not true for individual conditions, the overall treatment of conditional 
instructions in this manner is considered valid. 
Let Nej be the effective number of address space locations that when interpreted 
as an instruction generate a 'jump' outcome instructions. Let Ni be the number of 
address space locations that could be interpreted as an instruction. 
Then: 
and, 
Nej = NeRN + NeRT + Nes/W + NeU3 (3.4.) 
where NeiiN, NeRT, Nes/w, and Neuj are the effective numbers of 'return', 'restart', 
'stop/wait', and 'unspecified jump' outcome instructions in the available address 
space. 
The probability that termination of a period of linear execution results in a 
particular outcome Px(k), is dependent on the proportion of that type of instruction 
in the set of 'jump' instructions. Collecting equations (3.2.), (3.3.), and (3.4.) gives, 
when k = 0, 
(3.5.) 
%^.Pj(k), w h e n f c > l . 
where the subscript x denotes the respective jump outcomes; x 6 {RT, U J, RN, S/W) 
represents restart, unspecified jump, return, and stop/wait. 
Px(k) = 
37 
3.6. H y b r i d Execution Mode l 
A hybrid model is proposed here facilitates further investigation of erroneous 
microprocessor behaviour. Erroneous behaviour has two phases of execution; lin-
ear execution following an IEJ, and linear execution following an SEJ. The hybrid 
model enables the examination of the characteristics associated with each of the two 
patterns. In particular, three mechanisms have been identified that terminate linear 
erroneous execution. 
a) Another period of linear erroneous execution is initiated: an 'un-
specified jump' or 'return' state is entered. 
b) Processing stalls: a 'stop/wait' state is entered. 
c) Detection of erroneous execution: a 'restart' state is entered. 
This section initially models the periods of linear erroneous execution associated 
with each of the two patterns of behaviour. These are then developed to investigate 
the probability of further periods of linear erroneous execution being initiated, stalled, 
or detected. 
3.6.1. Linear Erroneous Execution 
Both restart and stop/wait erroneous jump outcomes terminate erroneous execu-
tion. Unspecified jump and return outcomes initiate a new period of linear erroneous 
execution. Let Pj<(k) represent the probability of the kth instruction processed yield-
ing a jump outcome, other than a restart or stop/wait, following an erroneous jump. 
The subscript J' represents the jump outcomes; unspecified jump, and return. 
Pj,(k)= Y , P *( f c )> where fc>0. (3.6.) 
xe{UJ,RN] 
Logical processing errors such as 'divide by zero' can cause premature completion 
of an instruction's execution, generating an otherwise unexpected restart outcome 
within a fraction of the clock cycles normally required by the instruction. These 
errors have different influences on each of the two phases of erroneous behaviour. An 
IEJ may not initiate erroneous execution but fire what is considered by the model 
38 
to be an immediate restart outcome. The same assumption is made for an SEJ such 
that the SEJ instruction is considered to fire a restart rather than a jump outcome. 
Let 0 and 7 be the respective proportion of I E J S and SEJs firing a restart outcome 
in this way, such that 0 < /? < 1 and 0 < 7 < 1. 
Consider the execution model probability for the outcome of the kth processed 
instruction, equation (3.5.). Let P^EJ{k) and P^EJ (k) be the probability density 
functions for the execution outcome of the kth processed instruction. The subscript 
x denotes the class of outcome; x G {RT, UJ, RN, S/W} representing restart, un-
specified jump, return, and stop/wait outcomes respectively. The superscript IE J 
or SEJ denotes execution following an IE J or SEJ respectively. 
Evaluating erroneous execution following an IEJ when fc = 0; 
' 0 , where x G {UJ, RN, S/W}, 
k P, where x G {RT}. 
P l E J ( k ) = { 
and when k > 1: 
(3.7a.) 
'1 - /?).[(1 - i).Px(k)\, where x G {UJ, RN, S/W}, 
P<EJ(k) = { (3.7b.) 
(1 - /?). [Px(k) + i-Pj>{k)\ , where x G {RT}. 
Evaluating erroneous execution following an SEJ when k = 0, 
p S £ J ( k ) = 0, where x G {RT, UJ, RN, S/W}, (3.8a.) 
and when k > I, 
(1 - i).Px{k), where x G {UJ, RN, S/W}, 
P^EJ(k) = { (3.86.) 
Px{k) + -y.Pj>(h), where x G {RT}. 
Let Pj(k) be the probability of terminating a period of linear erroneous execution 
where the subscript x denotes the class of outcome; x G {RT, UJ, RN, S/W} repre-
senting restart, unspecified jump, return, and stop/wait respectively. The superscript 
39 
y denotes execution following an erroneous jump; y e {IEJ, SEJ} representing IEJ 
and SEJ respectively. 
/3(*) = E W 0 . (3-90 
X 
The probability that k instructions have been linearly processed in the current 
phase of linear execution is evolved f rom equation (3.9.), 
k 
Pl(k) = 1 - Y,{Pyj{k)}, where k > 0, and y G {IEJ, SEJ}. (3.10.) 
The mean number of instructions expected to be executed during each phase of 
erroneous behaviour, / , is given by, 
/ = E[K] = k-Pyj{k), where k > 0, and y G {IEJ, SEJ}. (3.11.) 
k 
where K is the random variable of the probability density function, defined by equa-
tion (3.9.), Pjf(fc). 
I f the probability of a jump in either of the patterns of erroneous behaviour is zero 
then the number of instructions processed during linear erroneous execution is infinite. 
This, of course, assumes that repeated passes of execution through the address space 
due to program counter overflow, are considered as a single period of linear erroneous 
execution. 
3.6.2. Propagat ing F u r t h e r Periods of L i n e a r Erroneous E x e c u t i o n 
Periods of linear erroneous execution are propagated when erroneous behaviour 
generates a SEJ. The probability of a SEJ outcome on the fcth processed instruction 
during either of the two patterns of linear erroneous execution is developed f rom 
equation (3.6.), 
= px(k), where k > 0 (3.12.) 
X 
40 
where the subscript x denotes the class of outcome; x e {UJ, RN] representing un-
specified jump, and return respectively. The superscript y denotes execution following 
an erroneous jump; y G {IEJ, SEJ} representing IEJ and SEJ respectively. 
3.6.3. Detect ion of Erroneous Execut ion 
Error detection latency is defined as the time between the ini t iat ion of erroneous 
behaviour and its detection. This parameter is an important performance character-
istic when evaluating detection techniques [Blough &; Masson, 1987], The discrete 
state nature of the microprocessor model presented in this chapter means that error 
detection latency is determined as a function of the number of erroneously processed 
instructions during erroneous behaviour. 
Detection of erroneous behaviour can be provided by implementation of restart 
outcomes. Restart outcomes take execution to an address space location predefined 
by the processor architecture. A recovery routine can be placed at this address. Now 
restart outcomes produce controlled return to the recovery routine. Hence erroneous 
behaviour is detected by a restart outcome. In order to remove any ambiguity, recov-
ery routines are only placed for restart outcomes generated by erroneous behaviour. 
Detection of erroneous behaviour may occur during the linear erroneous execution 
following an IEJ, or one or more SEJs. The probability of detection D(k) on the kth 
processed instruction of erroneous execution is given by, 
D(k) = D I E J ( k ) + DSEJ{k), where k > 0. (3.13.) 
where Di£j(k) and DsEj(k) respectively represent the detection coefficients of erro-
neous execution following either an IEJ or SEJ. 
The detection coefficients are derived using the execution characteristics of er-
roneous behaviour during processing following an IEJ and SEJ. Let Pi£J(k) and 
PjiTJ(k) represent the probabili ty of the kth instruction processed yielding detection 
(restart outcome) of erroneous behaviour following an IEJ and SEJ respectively. 
The detection coefficient D I E J ( k ) is the probabili ty of detecting erroneous exe-
cution on the kth instruction before a SEJ occurs, 
41 
DiEjik) = Pj£J(k)., where k > 0. (3.14.) 
The detection coefficient DSEj(k) sums the probability of all possible execution paths 
resulting in detection after one or more SEJs. Every such execution route requires at 
least one SEJ, other than a stop/wait or restart outcome representing a processing 
stall and error detection respectively, after the IE.I commencing erroneous behaviour. 
DseAk) = £ P'/J{m). 
m=0 
k—m 




, when n = 0, 
E"= i Pj,EJ{z).V{n - z) , when n > 1. 
(3.156.) 
Equation (3.15b.) calculates the probability associated wi th periods of erroneous 
execution initiated by and terminating wi th an unspecified jump or return outcome. 
To demonstrate the function of equation (3.13.) consider a simple example to 
determine the probability of detecting erroneous execution before four instructions 
have been erroneously processed. That is, evaluate D(k) when k = 3. From equation 
(3.13.), 
0(3) = DIEJ(3) + DSEJ(3) (3.16.) 
Equation (3.14.) yields the probability of detecting erroneous execution when no 
SEJs occur, 
DIBJ(3) = PLTW (3.17.) 
Equation (3.15.) yields the probability of detecting erroneous execution when one or 
more SEJs occur, 
42 
i pSEJi 
•l l rrr \ 3) + 
pIEJ ;o) .P$EJ(1 ) . i . P ^ ( 2) + 
pIEJ 
' J' .P
s/J{\ ).P']E\l ) . i . p f l E j m 
pIEJ .P?, f i J(2 \ -i pSEJ, 
plEJ 
r J ' [0) . P
s / \ l ).PS/J{\ 
pIEJ 
[0) .P$EJ(l )-Pj>EJ(2 ) . 1 . P ^ ( 0 ) + 
pIEJ . P * E J ( 2 ).Pf,EJ(l ) .1 .P |# J (0 ) + 
pIEJ 
fS>) .P*
EJ(3 \ i pSEJ ).i.rRT :o)+ 
P ' / J {!) 
i pSEJ i 
• L r RT \ 2) + 
p/EJ 
(1) . P
S / \ l \ i pSEJ 
pIEJ 
r J ' (1) . P
S / \ \ ) . P ! E J ( I ).l.PsREJ(0) + 
pIEJ 
(1) . P f
E J ( 2 \ -i pSEJ 1 • 1 • rRT :o)+ 
pIEJ 




1 J' (2) .ps,
EJ{\ ).\.PSREJ (0)+ 
pIEJ 
1 J' (3) 
i pSEJi '0), 
where equation (3.7.) and equation (3.8.) define P},EJ(0) = 0, P | E J ( 0 ) = 0, and 
P£# J (0 ) = 0, 
Substituting equation (3.17.) and equation (3.18.) into equation (3.16.) gives, 
D(3) = P A f J ( 3 ) + 
P j P ( l ) . l . P ^ ( 2 ) + 
(3.19.) 
P j P ( l ) . P j ^ ( l ) . l . P ^ ( l ) + 
P,/\2).l.PsREJ{\). 
This simple example demonstrates the function of equation (3.13.). Although com-
plex, the equation does simplify to produce an almost intui t ive result. 
Error detection latency Ld can be determined for the microprocessor model by 
calculating the expectation of detecting erroneous behaviour: 
43 
Ld = E[K) = £ f c .£>( fc ) , (3.20.) 
k 
where K is the random variable of the detection function D(k). 
This formula determines the mean number of instructions expected to be processed 
before detection. 
3.6.4. Erroneous Execut ion Sta l l 
A processing stall is considered to occur when the microprocessor enters a stop/ 
wait state. Execution stalls require an external hardware interrupt to facilitate state 
exit and continuing execution, but the occurrence of such events during erroneous 
execution is unknown. I t is therefore important to predict the significance of this 
eventuality. The following evaluations determine the probability of a processing stall 
during erroneous execution. 
Erroneous execution may stall during the linear erroneous execution following 
an IEJ, or one or more SEJs. The probability of stalling S(k) on the kth processed 
instruction of erroneous execution is given by, 
where SiEj(k) and SsEj{k) respectively represent the stalling coefficients of erroneous 
execution following either an IEJ or SEJ. 
The stalling coefficients are derived using the execution characteristics of er-
roneous behaviour during processing following an IEJ and SEJ. Let Psw{k) and 
PswJ(k) represent the probability of the kth instruction processed stalling (stop/wait 
outcome) erroneous behaviour following an IEJ and SEJ respectively. 
The stalling coefficient S i E j ( k ) is given by, 
S(k) = S l E J ( k ) + S S E j ( k ) where k > 0. (3.21.) 
SiEj{k) — Psw (fc) where k > 0. (3.22.) 
44 
which is the probability of execution following an IEJ terminating wi th a stall outcome 
before a SEJ occurs. 
The stalling coefficient SsEj(k) represents all possible execution routes to a stall 
incorporating one or more SEJs. Every such execution route requires at least one 
SEJ, other than a stop/wait or restart outcome representing SjEj{k) and detection 
respectively, after the IE J commencing erroneous behaviour ( P j , E J ) . The stalling 
execution path may or may not include SEJs propagating further periods of linear 
erroneous execution Finally, an execution path generating a processing stall 
must terminate wi th a stop/wait outcome (PjswJ)- Hence, 





where ^ ( n ) is defined by equation (3.15b.). 
The recursive nature of equation (3.15.) is similar to the above equation, and its 
functional description can be shared. 
The stalling latency Ls can be determined for the microprocessor model by cal-
culating the expectation of stalling erroneous behaviour: 
Ls = E[K] = ^ k . S ( k ) , (3.24.) 
k 
where K is the random variable of the stalling function S(k). 
3.7. Rel iabi l i ty Analys i s 
The reliability model proposed here defines a microprocessor system failure as 
hazardous behaviour rather than loss of function. Hazardous behaviour is unpre-
dictable and may mutilate system integrity and/or lead to catastrophic failure. The 
model outlined below describes how loss of funct ion may not immediately induce 
hazardous behaviour i f there is automatic repair. 
The reliability of a microprocessor system can be analysed using a state/t ime ran-
dom variable stochastic model. Let a microprocessor occupy one of two behavioural 
states: controlled (C), and uncontrolled (U). The processor remains in a controlled 
45 
state unti l the occurrence of an event induces a transition to the uncontrolled state. 
Such transitions are considered system failures and mark the ini t iat ion of hazardous 
microprocessor operation. The probability of a transition in time 6t is X(t).8t, where 
A(£) is the failure rate. The reliability model is shown in Figure 3.3. The model is 
called a Markov Process because of its discrete-state, continuous-time nature. 
3 .7 .1 . Fa i lure R a t e , A(t) 
Let the sample space Es, comprise of a set of events corresponding to ini t iated 
erroneous behaviour, that is, IEJs. Let Er e Es where Er is an event leading to 
recovery, and Ej e Es where Ej is an event leading to failure. W i t h i n the sample 
space the conditions ET U Ej — Es, and Er D Ef — 0 exist. 
Let the probability of the event Er and Ej be P{ET) and P ( E f ) respectively, 
which leads to 
P(Er) + P(Ef) = 1 (3.25.) 
Now, assuming the event Es occurs randomly at a rate of q events per hour, then the 
failure rate of the microprocessor is given by 
A(i) = q.P(Ef) (3.26.) 
The failure rate is not time dependent giving, 
X(t) = constant = A (3.27.) 
so the Markov Process is termed homogeneous. 
Equation (3.25.) shows that the probability of an event leading to failure is dependent 
on the event not leading to recovery. The probability of recovery is itself dependent 
on the detection of erroneous behaviour. 
46 
1 - A ( t ) . A t ] 
C : C o n t r o l led B e h a v i o u r S t a t e 
U : U n c o n t r o l l e d B e h a v i o u r S t a t e 
A ( t ) : F a i l u r e R a t e 
F i g u r e 3 .3 . : R e l i a b i l i t y M o d e l 
47 
3.7.2. Probabi l i ty of an E v e n t E Leading to Fai lure , P ( E f ) 
A stringent specification of failure requires the uncontrolled (erroneous) execu-
tion of one or more instructions to be complete. The failure event is therefore any 
outcome, other than a restart, generated on or before completion of the first erro-
neously processed instruction following an IE J. The detection capability of a restart 
outcome means that its operation is controlled. The probabili ty of a failure event is 
expressed as, 
P{E}) = 1 - £ { D ( f c ) } , where k < 1. (3.28.) 
k 
The cumulative density of the detection function incorporates the probability of 
detection through the two basic phases of erroneous execution: execution follow-
ing an IE J and SEJ. Properties of these phases are now given in respect of the 
microprocessor model. A jump outcome, other than restart, following an IEJ or 
SEJ can only occur when an instruction has completed its processing. Equations 
(3.7a.) and (3.8a.) yield PjFJ(0) and Pj,EJ{0) w i th nil probabilities. Equation 
(3.8a.) also yields P | # J ( 0 ) a nil probability. Evaluating the effective instruction 
distr ibution in the address space given by equation (3.7.) yields P ^ f J ( 0 ) = /?, and 
P$J(1) = (1 - ^[Ppxil) + 7 .P j / ( l ) ] . Substituting equation (3.13.) into equation 
(3.28.) and applying these conditions gives, 
P(Ef) = 1 - {P + (1 - /3)(P* T (1) + 7 - M 1 ) ) ] (3-29.) 
substituting equations (3.2.) and (3.6.), 
P(Ef) = 1 - Ne., 
' Neyj + NeRN + Nes/w 
Nej 
substituting equation (3.4.), 
P(Ef) = 1 0 + [Neirr + i\Neuj + NeRN + NeS/w\) 
48 
(3.31.) 
P(Ef) = ( 1 - 0 ) 1 1 -
Near + l[NeUj + NeRN + NeS/w] 
N, 
(3.32.) 
and from equation (3.25.), 
P{Er) 0 + ( 1 „ ®. {Near + ^.[Neyj + NeRN + NeS/w}) N 
(3.33.) 
3.7.3. Rel iabi l i ty Eva luat ion 
Consider the respective probabilities, for the reliability model, of being in a con-
trolled state or uncontrolled state at time t + 6t. 
Pc(t + 6t) = [1 - X(t).6t}.Pc(t), (3.34.) 
Pu(t + 6t) = [X(t):6t}.Pc(t) + l.Pu(t). (3.35.) 
Substituting equation (3.27.) and re-arranging equations (3.34.) and (3.35.) gives, 
Pc{i + 6t) - Pc{t) 
St 
= -X-Pc(t), (3.36.) 
Pu{t + 6t) - Pu{t) 
6t 
= A.PC(<), (3.37.) 
and passing to a l imi t as 8t —> 0 yields, 
d{Pc(t)} 
dt 
-A.Pc(i) , (3.38.) 
d{Pu(t)} 
dt 
= \.Pc(t). (3.39.) 
Re-arranging and integrating equation (3.38.), 
49 
(3.40.) 
ln{Pc(t)} = -Xt + C u (3.41.) 
Pc{t) = exp{-Xt + Ci}. (3.42.) 
Applying the ini t ia l conditions; when t = 0, then Pc(t) = 1 and Pu{t) — 0 giving 
C\ = 0. Hence equation (3.42.) becomes, 
P c ( t ) = exp{-Xt}. (3.43.) 
Now re-arranging and integrating equation (3.39.) gives, 
J d{Pu(t)} = x j Pc(t).dt, (3.44.) 
J d{Pu(t)} = X J e-xtdt, (3.45.) 
Pu(t) = exp{-Xt} + C 2 . (3.46.) 
Again applying the ini t ia l conditions; when t — 0, then Pc{t) — 1 and Pu(t) = 0 
giving Ci = 1. Hence equation (3.46.) becomes, 
Pu(t) = 1 - e x p { - A i } . (3.47.) 
The reliability of the microprocessor system in the model is given by the proba-
bi l i ty of the system remaining in a controlled state. That is, 
R(t) = Pc(t), (3.48.) 
50 
R(t) = exp{-Xt}, (3.49.) 
substituting equations (3.26.) and (3.27.) gives 
R(t) - e x P { - „ . ( ! - 0). ( l - " t t r + ^ + W ^ + A ^ l j j . ( 3 .50 . ) 
3.7.4. M e a n T i m e To Fai lure 
The concept of Mean Time To Failure ( M T T F ) , used in hardware reliability cal-
culations, can be adapted for this work. I t provides a method of comparing hardware 
and software reliability. 
M T T F is defined as. 
MTTF = / R(t).dt, 
Jo 
(3.51.) 
substituting (3.49.) gives, 
rOO 







MTTF = (3.54.) 




and substituting equation (3.32.) gives, 
MTTF = 
_g(l - 0)(NL - New - l\Neuj + NeRN + Nes/w] 
51 
(3.56.) 
3.8. Avai labi l i ty 
The availability of a microprocessor is the proportion of time for which the mi -
croprocessor is ful ly operational'. An inherent assumption made when calculating 
availability is that the target system is maintained, i.e. the system has its operation 
restored after failure. W i t h i n the microprocessor model presented in this chapter, 
restoration is provided by the automatic execution of a recovery routine when a 
restart outcome is generated during erroneous execution. Availabil i ty Av is depen-
dent on Mean T i m e To Failure ( M T T F ) and Mean Time To Repair ( M T T R ) , 
A„ = 
MTTF 
[MTTF + MTTR 
(3.57.) 
The Mean Time To Failure ( M T T F ) is defined by equation (3.54.). The Mean 
Time To Repair ( M T T R ) includes all processing before the microprocessor is restored 
to its ful ly operational state. In order to facilitate repair the microprocessor model 
must allow for the implementation of a recovery routine. Mean Time To Repair can 
be estimated using the following equation. 
MTTR = ( — ^ ^ j , (3-58.) 
where LD is the mean number of instructions processes erroneously before detection 
of erroneous behaviour (error latency f rom equation 3.20.), NJI is the mean number 
of instructions executed after detection by the recovery routine, and If is the mean 
number of instructions executed per hour. 
Substituting equations (3.54.) and (3.58.) into (3.57.) gives an estimate for the avail-
abil i ty of a microprocessor system that employs coverage for the erroneous behaviour 
described in this chapter. 
Av = (3.59.) 
- I 1 + A . | ^ + W,< I • <*»•> 
52 
3.9. S u m m a r y 
The event, induced by a temporary fault, in i t ia t ing erroneous microprocessor be-
haviour is defined as an Init ial Erroneous Jump (IEJ). Erroneous behaviour is char-
acterized by periods of linear erroneous execution interspersed by erroneous jumps. 
The characteristics of erroneous execution following an IEJ or SEJ can be statistically 
modelled. Error latency is derived f rom detection capabilities in the microprocessor 
model. Failure mode analysis is used wi th in a Markov Model to determine functions 
of reliability and Mean Time To Failure ( M T T F ) . Availabil i ty of the microprocessor 
system by the model is estimated under the assumption that a recovery routine is 
implemented. These functions allow the comparative assessment of recovery tech-
niques to be made for software disrupted by temporary faults i n a form which can be 
related to calculations for permanent faults in digital systems. 
53 
Chapter 4 
E V A L U A T I N G M I C R O P R O C E S S O R B E H A V I O U R 
4.1. Introduction 54 
4.2. Instruction Mix Analysis 54 
4.3. Architecture Parameters for the Microprocessor Model 55 
4.3.1. Bu i l t - in Microprocessor Detection Capability 55 
4.3.2. Modelling the Microprocessor Program Counter 58 
4.3.3. Instruction Processing Exceptions 58 
4.4. Evaluating Microprocessor Models of Erroneous Behaviour 58 
4.4.1. 8-Bit Processor Evaluations 64 
4.4.2. 16-Bit Processor Evaluations 65 
4.4.3. 32-Bit Processor Evaluations 66 
4.5. Catastrophic Failure Analysis 67 
4.6. Recovery Through The Detection of Erroneous Execution 69 
4.7. Evaluating Microprocessor Reliability 67 
4.8. Evaluating Microprocessor Availabili ty 74 
4.9. Conclusions 76 
C H A P T E R F O U R 
E V A L U A T I N G M I C R O P R O C E S S O R B E H A V I O U R 
4.1. Introduct ion 
A model of erroneous microprocessor behaviour is presented in the previous chap-
ter. This chapter applies the model to a selection of target processors which include 
8, 16, and 32-bit architectures using instruction mix analysis. 
The chapter commences wi th the derivation of parameter values required by the 
model for the microprocessors under investigation. The content of the address space 
is assumed to be random for the purpose of statistical analysis. Characteristics of 
erroneous execution are described for each of the target processors. In particular 
the possibilities of catastrophic failure and recovery are investigated because of their 
influence on the dependability of a microprocessor based system. 
Finally, the reliability of a microprocessor system is considered. A comparison 
is made between the microprocessors modelled using the reliability parameter Mean 
Time To Failure ( M T T F ) . Reliability calculations assume that the host processor 
has no recovery capability. Many of the microprocessors investigated, however, do 
have a recovery capability provided by the detection at tr ibute of instructions that 
develop a restart outcome. In order to assess the performance of such maintained 
microprocessor systems, the microprocessor systems availability is evaluated. 
4.2. Ins truct ion M i x Analys i s 
The model of erroneous microprocessor behaviour presented in Chapter 3 is eval-
uated using instruction mix analysis. This involves determining the mean instruction 
distr ibution for a section of memory and modelling the expected behavioural charac-
teristics. Such modelling is abstracted f rom actual microprocessor behaviour which 
is dependent on instruction sequences. Nevertheless, instruction mix analysis does 
provide a valuable method for indicating the nature of erroneous microprocessor be-
haviour and its variation between target processors. 
54 
This chapter evaluates the erroneous behaviour of a selection of microprocessors. 
The processors considered are: Motorola 6800, Intel 8048, Intel 8085, Intel 8086, 
Motorola 68000, Motorola 68010, AMD Am29000, Motorola 68020, and Intel 80386. 
The microprocessors are chosen to include common application examples of 8, 16, 
and 32-bit architectures. In addition, these processors implement various design fea-
tures including reduced instruction sets, ROM instruction decoders, and instruction 
processing exceptions. 
The instruction mix of the target processors is shown in Table 4.1., data being 
collated from Appendix A. Each instruction set is divided into instruction state out-
comes, non-jump, restart, undefined jump, return, and stop/wait. The instruction 
mix of the undefined instructions within the processor instruction sets is detailed 
in Table 4.2. Some of these instruction sets contain unspecified instructions which 
through experiment have been defined [Halse, 1984]. 
4.3. Architecture Parameters For The Microprocessor Model 
This section determines the parameter values for the model of erroneous micro-
processor behaviour which are dependent on the processor architecture. 
4.3.1. Built-in Microprocessor Detection Capability 
The Motorola 68000 family of microprocessors execute instruction op-codes resid-
ing at an even byte boundary location in the address space. If an attempt to process 
an instruction op-code at an odd byte boundary location in the address space is 
made, then an immediate 'restart' outcome is entered. The outcome is assumed to 
be immediate because the 'odd byte address' exception does not process the instruc-
tion op-code concerned but rather after a few clock cycles determines an illegal odd 
address has been accessed. The (5 parameter defined in equations (3.7.) and (3.8.) 
will therefore have an inherent value of 0.5 for this family of microprocessors. 
The remaining microprocessors evaluated within this chapter do not have this or 
a similar method of generating a restart outcome during instruction processing. The 

























C O CM 
CM 






cV L O oo 
O J f- ' • t -






O 1 CM 
oo 8 a> 












































CM L O 
L O 
























































































































































CJ o t~ 
O. 





CP C CC 
CP 
T 3 C 
_ co 
cu oo 













































L O ,—. 
CM — ' CM 
t - O O 
C M 
C M 
CM O O 
CM 













































T 3 C 





































4.3.2. Modelling the Microprocessor Program Counter 
Most of the microprocessors evaluated in this chapter have a single program 
counter which is capable of specifying every location in the address space. The Intel 
8086 and 80386 microprocessors, however, have an internal address bus smaller than 
its external address bus. To derive a location in the 8086 address space it uses two 
program counters. The address put on the external address bus is the sum of the 16-
bit Instruction Pointer and the 16-bit Control Segment Register which has already 
been left shifted four bits. Hence a 20-bit location is put on the external address bus. 
This method of deriving the address bus value means that corruption of either the 
Instruction Pointer or the Control Segment Register corrupts the microprocessor's 
effective program counter. The model considers the generation of an erroneous jump 
as corruption of the single effective program counter. 
4.3.3. Instruction Processing Exceptions 
Some microprocessor instruction sets include instructions which generate a restart 
outcome when an abnormal processing condition is identified. A good example com-
mon to most microprocessors is the 'divide by zero' processing exception. Processing 
exceptions should not be confused with conditional instruction outcomes where a test 
is incorporated into the instruction operation in order to determine whether or not 
a task is performed, e.g. conditional branch. For the purpose of statistical analysis 
within this chapter, instruction processing exceptions are considered not to occur. 
The 7 parameter used in equation (3.7.) and (3.8.) is therefore zero. 
4.4. Evaluating Microprocessor Models of Erroneous Behaviour 
Erroneous execution within the used area is initially modelled by execution 
through a memory of random content. The two main behavioural characteristics, 
linear erroneous execution and erroneous jumps, are investigated. 
A selection of microprocessor instruction sets are examined in Table 4.3. The 
distribution of instructions through the address space of random content for many 

































o o o oo 
c o 
C O -oo o oo 
L O 











I—1 i—< CO 1—< 
C M O l L O C O CM 
i—( CO O ) o> CO i—1 
C O o o i <M 
00 
J£ 6? feS 6S E£ feS 
CM CO 0 0 CM 
o o l O 0 0 o CM O © CO — J O i 

































































I c o 
2 
co co 
i O CO 
0 5 L O 
cc3 






















r—• 0 0 
co 
0 0 o 








































































































































































due to the processor implementing a ROM instruction decoder, within its architec-
ture, which specifies an instruction for all possible opcode bit formats. The Intel 8086 
and 80386. however, require their instruction set mix to be manipulated to reflect a 
random data instruction mix. 
Linear erroneous execution state outcomes are plotted as cumulative functions 
using equation (3.9.) in Figure 4.1. (8-bit processors), Figure 4.2. (16-bit processors), 
and Figure 4.3. (32-bit processors). The cumulative probability that linear erroneous 
execution has a particular state outcome after processing a number of instructions, 
indicated by the instruction index, is shown by the vertical width of the labelled area 
at that point. The features of each plot to notice are: 
i) Continued Linear Erroneous Execution 
This is the vertical width of the area labelled 'linear erroneous ex-
ecution' at each instruction index. The larger the vertical width, 
the greater the probability that linear erroneous execution has 
continued through the number of instructions indicated by the 
instruction index. 
ii) Termination by Erroneous Jump 
This is the vertical distance of the combined areas beneath the 
area labelled 'linear erroneous execution' at each instruction in-
dex. The larger the vertical width, the greater the probabil-
ity that linear erroneous execution has been terminated by the 
present (or any preceding) processed instruction indicated by the 
instruction index. 
iii) Stop/Wait Outcome 
This is the vertical width of the area labelled 'stop/wait' at each 
instruction index. This outcome represents a catastrophic failure 
involving the termination of processor activity until the appro-










£K UJ I — -< z :=> y o u 2 CC U J 
—• a: x 
- J LU U J 
O O O o 









( O 2 
CC U J I — 
U J O (_) 
ZOCUJ 
—> (X. X 
_ 1 UJ UJ 
00 CM 





u_ co UJ 
ce UJ CO 
in 
en 
UJ OC UJ 
U J O O 
z a i u — o: x I UJ UJ 
CO 
3ivis JO unigveoud 
o. 
in LU UJ V) ac co o LU to CC 0> LU 
CC IV. LU c_> to 
0-UJ to •o UJ 
to to to in LU in o 
oo to <r i_) LU 
o 
^ in to z LU 
o LU t— 
or LU a: x LU LU 
in 
31Y1S JO AiniaVGOud 
o. 





<n 2 LU 
LU h-
CC LU 
OS X LU LU 
in 
d 
31 VIS JO Aini8V80ud 
o 
cc 
10 or LU to LU L U CC 
to LU 
t_) LU LU 
X to •O L U a. to to 00 in 0 to 
CO 
cc 
to in 2T 
or I U »— 
U l O U 
Z CC LU a: x LU LU 
O 00 -O ~T <M. O 
—' c3 d o o o 












co -o -? (M o 
°3ivis JO AinigvGoad 








LU or o o «< 
a a a 
o-
x 








3JLV1S JO Ainiavoodd 
or LU 
LU UJ 
co -o <r r\i 


















of such a restoring event occurring is unpredictable. The larger the 
vertical width of this area, the higher the probability of this outcome. 
iv) Restart Outcome 
This is the vertical width of the area labelled 'restart' at each 
instruction index. The restart outcome is the only one that when 
executed erroneously is considered to generate a controlled out-
come. It is for this reason that i t will be used for detecting er-
roneous execution. Hence the 'inherent' detection capability of 
a microprocessor may be observed by thickness of the 'restart' 
area. The larger the vertical width, the greater the probability 
that linear erroneous execution has been detected, and hence ter-
minated, by the present (or any preceding) processed instruction 
indicated by the instruction index. 
The investigation of linear erroneous execution and erroneous jumps gives an indica-
tion of the character and attributes of erroneous behaviour. 
4.4.1. 8-Bit Processor Evaluations 
All the 8-bit microprocessors evaluated exhibit a high probability of periods of 
linear erroneous execution exceeding ten instructions, see Figure 4.1. In particular 
the Intel 8048 and 8085 processor models suggest 31% and 27%, respectively, of the 
periods of linear erroneous execution are expected to terminate within ten instruc-
tions. The Motorola 6800 is about twice as likely to terminate a period of linear 
erroneous execution within ten instructions. 
A SEJ terminates all periods of linear erroneous execution in the 8048 micropro-
cessor system. However, for the other two 8-bit processors some periods of linear 
erroneous execution are terminated by recovery or catastrophic failure. Approxi-
mately 79% of the periods of linear erroneous execution are terminated with a SEJ 
for the Motorola 6800 processor, a similar value of 65% is modelled for the Intel 8085 
microprocessor. Although the Intel 8048 processor will never catastrophically fail in 
64 
the model, it will also never inherently recover. Within the Motorola 6800 proces-
sor non-SEJ terminations of linear erroneous execution as catastrophic failure are 
expected to occur four more times than recovery. The Intel 8085 processor has a con-
verse relationship, recovery being expected to occur six more times than catastrophic 
failure. 
In summary, the model for the Motorola 6800 microprocessor suggests periods of 
linear erroneous execution of approximately ten instructions which are approximately 
five times more likely to terminate in an SEJ than failure, and the chance of recovery 
is small. The Intel 8048 processor model predicts much longer periods of linear er-
roneous execution which will always terminate with an SEJ, no catastrophic failure 
of recovery is possible. Although the Intel 8048 processor will never catastrophically 
fail in the model, failure is implied by the fact that erroneous execution never ceases. 
Within the model for the Intel 8085 microprocessor periods of linear erroneous execu-
tion are expected of a similar length to those evaluated for the Intel 8048 processor, 
of which approximately one third terminations are expected to generate recovery, the 
vast majority of the remaining terminations producing a SEJ. 
4.4.2. 16-Bit Processor Evaluations 
The instruction mix analysis of erroneous execution presented for the 16-bit mi-
croprocessors in Figure 4.2. suggests that these processors have shorter periods of 
linear erroneous execution than those modelled for the 8-bit processors. The model 
predicts that in excess of 80% of linear erroneous execution periods will terminate 
before their tenth processed instruction. The Intel 8086 processor has a mean ex-
pected period of linear erroneous execution longer than that for the Motorola 68000 
and 68010 microprocessors. This is due to the influence of the instruction set and 
architecture. The 68000 detection capability is considerably influenced by the 'odd 
address exception' processor facility. This exception yields a restart outcome for any 
access to an instruction located at an odd byte address in the memory map. 
Within the Intel 8086 processor model there is a 90% probability that a period 
of linear erroneous execution is terminated by a SEJ. This is much larger than that 
for the Motorola 68000 and 68010 processors whose model suggests the likelihood of 
65 
the same outcome as less than 3%. Hence, not only are the periods of linear erro-
neous execution expected to be shorter for the Motorola processors than the Intel, 
but also the Motorola processors will have fewer periods of linear erroneous execution 
before either catastrophic failure or recovery is attained. The Intel 8086 processor 
model has a similar probability of linear erroneous execution being terminated by 
recovery through a restart outcome or catastrophic failure through a stop/wait out-
come. The Motorola 68000 processors yields very different results. The probability 
of a stop/ wait outcome for the Motorola 68000 processors is too small to be shown 
on Figure 4.2(b & c) whilst the probability of a restart outcome and hence recovery 
is approximately 93% after just four processed instructions during linear erroneous 
execution. 
4.4.3. 32-Bit Processor Evaluations 
The result of model application for a selection of 32-bit processors is shown in 
Figure 4.3. The microprocessors evaluated are the Advanced Micro Devices Am29000 
(Version D), the Motorola 68020, and the Intel 80386. 
The influence in the model of the 'odd byte address' exception of the Motorola 
68020 microprocessor is clearly seen as the 50% intercept in Figure 4.3(b). This 
feature greatly reduces the mean expected period of linear erroneous execution as 
previously described for the Motorola 68000 and 68010 processors. The Advanced 
Micro Devices Am29000 achieves similar periods of erroneous execution without this 
architectural feature. Its performance relies totally on its instruction set attributes. 
Both the AMD Am29000 and Motorola 68020 models predict approximately 90% 
of linear execution periods of five instructions to terminate. This characteristic is 
not shared by the Intel 80386 processor model in which approximately 63% of linear 
erroneous execution periods of five instructions are expected to terminate. 
Termination of linear erroneous execution for the AMD Am29000 and Motorola 
68020 processor models have about a 90% expectation of of generating a restart 
outcome and hence detection and recovery. The Intel 80386 does not compare so 
favourably with a 37% chance of a restart terminating linear erroneous execution. The 
66 
Intel 80386 restart probability is, however, higher than all 8-bit and 16-bit processors 
considered earlier with the exception of the Motorola 68000 microprocessor family. 
The probability for the Motorola 68020, as with other members of the Motorola 
68000 processor family considered earlier, of catastrophic failure through a stop/wait 
outcome is too small to be shown in Figure 4.3(b). At 0.0015% it is very small when 
compared with the 8 and 16-bit processor evaluations. Although the AMD Am29000 
and Intel 80386 processors have a larger probability of 0.391% of a stop/wait outcome 
which is visible in Figure 4.3(a & c), it is still small in relation to other processor 
model evaluations. 
4.5. Catastrophic Failure Analysis 
The probability of a stop/wait outcome is identified as a catastrophic failure. 
Such a processing outcome stalls execution until an external interrupt generates a 
restart outcome and hence initiates recovery. The occurrence of a stop/wait outcome 
during periods of linear erroneous execution has been investigated for a selection of 
8. 16, and 32-bit microprocessors. It is valuable to further consider the probability of 
catastrophic failure as a function of general erroneous execution. Figure 4.4. shows 
the predicted chance of catastrophic failure using instruction mix analysis for a se-
lection of processors. The graph is developed using equation (3.21) and data from 
Table 4.3. 
The Motorola 6800 microprocessor has a significantly higher probability of pro-
ducing a catastrophic failure than the other evaluated processors. The Intel 8086 has 
an 8% probability of catastrophic failure after ten instructions have been processed 
during erroneous execution which is just under half that expected for the Motorola 
6800. The remaining processors are plotted as two groups in Figure 4.4. One group, 
comprising of the Intel 8085, A M D Am29000, and Intel 80386, has twice the expecta-
tion of catastrophic failure of the other group of the Motorola 68000 family processors. 
This can be illustrated by a comparison between the Intel 8085 and AMD Am29000 
which have the respective likelihoods of 4% and 2% for catastrophic failure before 
















4 8 10 9 
INSTRUCTIONS EXECUTED 
68 
4.6. Recovery Through The Detection of Erroneous Execution 
The probability of detecting erroneous execution is dependent on the generation 
of a restart outcome. This can be achieved through either an instruction's natu-
ral outcome, or a hardware induced outcome where a processing exception occurs. 
Although the probability of a restart outcome has been considered during periods 
of linear erroneous execution, it is valuable to evaluate the general function of this 
outcome during erroneous behaviour. Figure 4.5. shows the probability of a restart 
outcome during erroneous execution for a selection of processors. The graph is plotted 
using equation (3.13) and data from Table 4.3. 
Both the Motorola 6800 and Intel 8086 have very low probabilities of generating 
a restart outcome, the former having less than half the expectation of the other. The 
Intel 8085 shows a significant improvement with a 32% chance of initiated recov-
ery after ten instruction are erroneously processed. This represents over a four-fold 
improvement on the Intel 8086 processor. The Intel 80386 also shows an enhanced 
performance with approximately 63% probability that erroneous execution is detected 
after ten processed instructions. The remaining processors, the AMD Am29000 and 
Motorola 68000 processor family, have a much better performance. Their instruction 
mix models suggest the likelihood of erroneous execution being detected after ten er-
roneously processed instructions is in excess of 97%. The 'odd byte address' exception 
is a major contributory factor for the performance of the Motorola 68000 processor 
family. This influence of the exception is shown as the 50% intercept value for the 
Motorola 68000 processor plots. The performance of the AMD Am29000 processor 
is dependent solely on its instruction set mix. 
4.7. Evaluating Microprocessor Reliability 
Microprocessor reliability is calculated under the 'worst case' assumption that 
any processor activity other than immediate detection is considered as system fail-
ure. Detection is provided by those instructions which generate a restart outcome. 
Immediate detection requires a restart outcome to be attained before the second 



















yir® 4.5. = !fi®©®wg(ry Tfroyglft DeteetiQtn ° 
70 
The reliability of the microprocessors modelled, R(t), is assessed using equations 
(3.49.) wi th the substitution of equations (3.27.), (3.26.), and (3.32.), to yield, 
R(t) = exp { -qt.(l — ^ ?). 1 — 
New + l[Neuj + NeRN + NeSw] (4.1.) 
where q is the event rate ini t iat ing erroneous behaviour, 0 is the probabili ty of hard-
ware detection of erroneous execution, 7 is the probability of a processing exception, 
Nepar, Neyj, NCRN, and Nesw are the effective numbers of restart, unspecified 
jump, return and stop/wait outcome generating instructions wi th in the instruction 
mix, and finally, NL is the number of instructions in the instruction mix. 
Details of the instruction mix for reliability evaluation are shown in Table 4.3. The 
parameter 7 is defined in section 4.3.3., for the purpose of statistical analysis, to be 
zero so equation (4.1.) becomes, 
The parameter 0 is set to zero except for the Motorola 68000 microprocessor family 
where i t is set to 50% to represent the 'odd byte address' exception facility. The 
determination of processor parameters is discussed in section 4.3. 
The reliability curves evaluated for the selection of 8, 16, and 32-bit processors 
considered in this chapter are shown in Figure 4.6. Unfortunately some processor 
evaluations are so similar that their individual reliability curves cannot be distin-
guished. In particular this occurs for the Motorola 68000 and 68010 processors, and 
the Motorola 6800, Intel 8048, and Intel 8086 processors. The variation in the respec-
tive processor reliability for these clusters of curves is indicated by the Mean Time 
To Failure ( M T T F ) calculations shown in Table 4.4. These calculations assume the 
events which initiate erroneous behaviour to occur once per month or every 714 hours. 
As discussed earlier the Intel 8048 has no detection capability for erroneous ex-
ecution and hence its M T T F is equivalent to the event rate. The remaining micro-
processors have some degree of detection capability depending on instruction mix 
Ne 











































o o o o o 

















o 00 lO CO 
o oo oo oo o o o CO oo 00 oo 
in C/j to 
( - t-l 
o 1—1 CO CO LO 00 
I — I T—4 © 
CM CM CM 



















i n o oo L O 
co 
















o o CO o r—< o CM oo o o o o co oo oo 00 o CD co CM co oo 
c 
c 




































and architectural influences. This detection capability has l i t t le effect of the M T T F 
for the Motorola 6800, Intel 8086, Intel 8085, and Intel 80386 processors. A more 
significant improvement is shown by the A M D Am29000 processor which has an 
M T T F approximately double the inter-arrival event period. Finally, the Motorola 
68000 processors have the best M T T F which are about three times the inter-arrival 
event rate. The performance of the Motorola processors can be largely attr ibuted 
to their 'odd byte address' exception facility which wi l l , under the model conditions, 
immediately detect half the initiated periods of erroneous microprocessor behaviour. 
4.8. Eva lua t ing Microprocessor Avai labi l i ty 
The availability of a microprocessor system can only be evaluated i f the system 
is maintained, i.e. manual or automatic repair is facilitated. The model presented 
for the erroneous behaviour of a microprocessor in Chapter 3 assumes that restart 
outcome generating instructions can be used to provide a detection capability for 
erroneous execution. Hence equation (3.60.) can be used to calculate availability, 
where A is the rate of failure events (per hour) that initiate erroneous behaviour, If 
is the mean number of instructions executed per hour by the processor, Ld is the 
error detection latency, and NR is the number of instructions to be processed in the 
recovery routine. 
A three dimensional availability plot shown in Figure 4.7. describes how avail-
abil i ty varies wi th different event rates and error detection latencies. The solid plane 
denotes the use of a recovery routine wi th 50 instructions, whilst the dashed plane 
shows the effect of a larger recovery routine of 1000 instructions. 
In order to aid comprehension of the equation (4.3.) plot consider the following 
example. A microprocessor operates at 0.4 MHz and has a mean instruction process-
ing t ime of 400 cycles. Failure events in i t ia t ing erroneous microprocessor behaviour 
are expected to occur once per second. Hence, 
1 













a i ICS 
ISO 
300 6.0001 SO aooos SOO 0.0005 
sso Q.C004 





M O D E L P E R F O R M A N C E 
DASHED SURFACE . VERSION 1 
BOLD SURFACE i VERSION 2 
F 8 § w ® 4.7. ° MlfgrQpr®©®®®©^ AvaBflafefiBifiy 




= 0.001 (4.4.) 
The mean error detection latency (Ld) is ten instructions, and the recovery routine 
consists of 50 instructions (NR)- The availability of the system can now be determined 
using equation (4.3.): 
The calculated availability can be located on Figure 4.7. by mapping the event 
rate (per instruction) j-fi a n d error latency (number of instructions) Ld on the solid 
surface representing 50 instructions in the recovery routine NR. The actual operating 
frequency of a processor may be higher and the mean instruction cycle time shorter 
than that used in this example. These parameter values would increase the calculated 
value of system availability. 
4.9. C o n c l u s i o n 
A model based on probability theory is used to predict microprocessor erroneous 
behaviour. Characteristics of linear erroneous execution and their termination are 
compared for a selection of 8, 16, and 32-bit microprocessors using instruction mix 
analysis. Whils t this method of analysis does not reflect the time dependency of 
erroneous behaviour on the instruction sequence, i t does provide a valuable insight 
into the patterns of erroneous behaviour. 
Two important processing outcomes are studied: the probability of catastrophic 
failure and recovery. Catastrophic failure is denned as the entry into a stop/wait 
state where normal instruction activity ceases. This state can only be exited by 
the generation of an external interrupt. Such a reset signal cannot be relied upon 
because they have not been incorporated into the microprocessor system design. The 
1 
A V 1 + 0.001 10 + 50 
(4.5.) 
Av = 94.3%. (4.6.) 
76 
other important processing outcome is recovery. This is attained through entry into 
a restart state which by its definition establishes controlled microprocessor activity. 
The evaluation of recovery leads to the determination of reliability for the micro-
processor system. The selection of target processors is compared using the reliability 
parameter Mean Time to Failure ( M T T F ) . Many processors show l i t t le improve-
ment in the M T T F wi th respect to the inter-arrival event period. The best results 
are obtained f rom the microprocessor models for the A M D Am29000 and Motorola 
68000 family which have a three-fold increase in their M T T F compared wi th the 
inter-arrival event period. 
Finally, the chapter concludes wi th an investigation of microprocessor system 
availability. A general plot is shown to indicate the availability of a processor system 
implementing recovery routine of two sizes and an example discussed. Availabil i ty 
calculations are only made possible where a microprocessor system has an automatic 
detection and recovery capability for erroneous behaviour. 
77 
Chapter 5 
D E T E C T I N G E R R O N E O U S M I C R O P R O C E S S O R B E H A V I O U R 
5.1. Introduction 78 
5.2. Address Space Allocation 78 
5.3. Erroneous Execution in the Unused Area of the Address Space 79 
5.3.1. The Ini t ia l Erroneous Jump. Characteristic 81 
5.3.2. Detecting Erroneous Execution 81 
5.3.2.1. A Software-Based Technique 83 
5.3.2.2. Watchdog Timers and Smart Watchdogs 84 
5.3.2.3. The Access Guardian Proposal 84 
5.4. Erroneous Execution in the Used Area of the Address Space 87 
5.4.1. The Subsequent Erroneous Jump Characteristic 87 
5.4.2. Detection Using Software Implemented Fault Tolerance 91 
5.4.2.1. Program Areas 91 
5.4.2.2. Data Areas and Reserved Inpu t /Ou tpu t Areas 96 
5.5. The Overheads of Implementing Fault Tolerance 97 
5.5.1. Hardware Fault Tolerance 97 
5.5.2. Software Implemented Fault Tolerance 98 
5.6. A Fault Tolerant Strategy for Microprocessor Controllers 99 
5.7. Summary 104 
C H A P T E R F I V E 
D E T E C T I N G E R R O N E O U S M I C R O P R O C E S S O R B E H A V I O U R 
5 . 1 . I n t r o d u c t i o n 
The previous chapters of this thesis have proposed and evaluated a model of er-
roneous behaviour for a selection of microprocessors. This chapter considers methods 
of exploiting characteristics of erroneous execution as part of a detection strategy. 
In particular the characteristics of Ini t ia l Erroneous Jump (IEJ) and Subsequent 
Erroneous Jump (SEJ) are identified for this purpose. 
The chapter commences by defining the functional allocation of a microproces-
sor's address space: the used area consisting of program, data, and reserved i n p u t / 
output areas; the unused area consisting of physically implemented and non-existent 
memory. Detection techniques are then considered for these functional address space 
allocations. Particular proposals are made using software techniques for the program 
and physically implemented unused areas of the address space. In instances where 
a microprocessor does not have its address space totally implemented w i t h physical 
memory, a proposed hardware unit called an Access Guardian can be implemented 
to provide detection of unused area access. 
The application of faul t tolerance incurs a physical and/or performance overhead 
to the target microprocessor system. Each of the detection techniques considered 
wi th in this chapter has its required overhead evaluated. 
5 .2. Addre s s Space A l l o c a t i o n 
W i t h i n the system memory, areas can be defined which have different execution 
characteristics dependent on the defined uti l ization of that memory area. For the 
purpose of statistical analysis, the memory is divided into functional areas. Each 
functional area wil l exhibit a particular instruction distr ibution. 
Ini t ia l ly the address space is divided into two distinct areas, used area and unused 
area, such that 
78 
{ 
Total Address Space 
(bytes) H Used Area (bytes) } + Unused Area (bytes) } (5-1.) 
The unused area is considered to include those address space locations not re-
served for, or required by the processor during correct operation. This area can be 
subdivided depending on the implementation of the resident address space, 
The used area contains locations reserved for external communication, and loca-
tions for instructions and the data they require for correct operation. The used area 
can be functionally subdivided into, 
The program area is considered to contain all software instructions, opcodes and 
operands. The data area is considered to contain any information required by the 
software, i.e. data structures including stacks and linked lists. The input /output 
reserved area contains those locations specified as reserved for communication ports 
and exception targets. 
The functional allocation of the address space into its constituent areas is shown 
in Figure 5.1. I t should be realized that for microprocessor systems, allocation of the 
address space rarely involves contiguous functional areas. 
5.3. Erroneous E x e c u t i o n in the U n u s e d A r e a of the Address Space 
This section ini t ia l ly describes the character of modelled erroneous execution in 
the unused area of the address space. In particular the In i t ia l Erroneous Jump (IEJ) , 
first described in Chapter 3, is identified wi th erroneous execution in the unused 
area. Software and hardware detection techniques are presented which exploit this 
characteristic. Finally, the design of the hardware technique is detailed. 
o physically implemented memory, and 
o non-existent memory. 
o program area. 
o data area, and 
o input/output reserved area. 
79 
I n p u t / O u t p u t 
R e s e r v e d A r e a 
P r o g r a m 
A r e a 
Da ta 
A r e a 
y USED AREA 
J 






F i g u r e 5.1. : F u n c t i o n a l A d d r e s s S p a c e A l l o c a t i o n 
80 
5 .3 .1 . T h e I n i t i a l E r roneous J u m p C h a r a c t e r i s t i c 
The event associated wi th the ini t iat ion of erroneous behaviour is the In i t ia l Er-
roneous Jump (IEJ) . The destination of an IEJ has been assumed to be a random 
location in the address space in the model of erroneous execution presented in Chap-
ter 3. Hence the probability that the destination of an IEJ w i l l be in the unused 
area, PiEj{Unused Area), is dependent on the proportion of the total address space 
occupied by the used area. That is, 
A linear relationship between PjEj{Used Area) and Used Area for a selection of 
address space sizes is shown in Figure 5.2. In particular the graph describes the IEJ 
characteristic exhibited by the Motorola 6800, Intel 8048, and Intel 8085 microproces-
sors which have a 64 KByte address space, the Intel 8086 microprocessor which has 
a 1 MByte address space, and finally the Motorola 68000 and 68010 microprocessors 
which have a 16 MByte address space. 
Consider a particular software application on two microprocessors whose only 
difference is the size of their respective address space. I t is evident f rom equation 
(5.1.) that the microprocessor wi th the larger address space wi l l have an unused 
area occupying a greater proportion of the address space. The IEJ characteristic 
of equation (5.2.) highlights the profi tabi l i ty of detecting processor execution in the 
unused area which by definition is erroneous, particular benefit being offered by those 
microprocessors wi th a larger address space. 
5.3.2. D e t e c t i n g E r r o n e o u s E x e c u t i o n 
Access to the unused area of the address space is indicative of erroneous execution. 
Therefore a technique to detect such access is required to prevent erroneous execution. 
The unused area may partially or entirely include physical memory locations. These 
Unused Area (bytes) 
PiEJ (Unused Area) 
Total Address Space (bytes) 
(5.2.) 
and 
PiEj(Unused Area) = 1 - PiEj(Used Area) (5.3.) 
81 






1 2 3 4 5 
USED AREA ( B Y T E S x l O 5 ) 
/ • £ ± 3 a o T i f t® DEJ © l fa®m(gH@[ro©fc 
82 
locations may occur as a single contiguous block or as a collection of blocks distributed 
throughout the address space. 
5.3.2.1. A Software-Based Technique 
Software constructs can be placed in the blocks of physically implemented unused 
address space to provide detection of erroneous execution. The principles of their 
structure are as follows: 
i) A l l instructions wi th in the construct should be without an operand 
requirement in order that there is only one possible program flow 
path through the memory. 
i i) The program flow path of the software construct must have one 
or more termination points where recovery action is ini t iated, 
otherwise no recovery is possible. 
The software construct adopted by particular processor systems is influenced by the 
availability of instructions in their respective instruction sets. 
Error latency can be minimized by the placing of restart instructions wi thout an 
operand requirement at every location in a block of unused physical memory. Pro-
cessing any one of these restart instructions initiates execution of a recovery routine 
for the application software. However, not all microprocessors possess such a restart 
instruction, e.g. the Intel 8048 processor. For microprocessors like these, a software 
construct called the 'snake' can be used [Pearson, 1983]. The snake construct involves 
placing a chain of 'no-operation' instructions, without an operand requirement, at 
each location except the last in a block of unused memory. The terminating location 
in the block holds a jump instruction. The action of the snake is to 'slide' erroneous 
execution ini t iated on i t to the jump, which then directs execution to the recovery 
routine. The error latency associated wi th the processing time required to 'slide' 
down the snake can be reduced by placing intermediate jumps, whose destination is 
the recovery routine, wi thin the blocks. However, in order to preserve detection of 
erroneous execution at every location in a block of unused physical memory, such 
intermediate jumps should not require any operands. 
83 
Many microprocessor systems do not implement their entire unused area in physi-
cal memory, and hence a complementary or alternative detection technique is required 
for non-existent memory in the address space. 
5.3.2.2. Watchdog T i m e r s and S m a r t Watchdogs 
As described in Chapter 2, watchdog timers and smart watchdogs can be im-
plemented in a microprocessor system to detect access to the unused areas of the 
processors address space. 
I B M [1986] describe an analogue watchdog timer which they have developed for 
microprocessor controllers. I t was developed because automatic reset is required when 
an embedded microprocessor in a controller improperly executes code. Watchdog 
timers require the application software to correctly reset the watchdog timer. Hence 
the programmer requires a prerequisite knowledge of the target processor system in 
order to satisfy the watchdog timer requirements. Such programming practice is much 
slower than code generation for microprocessor systems employing a transparent fault 
tolerant technique. 
Namjoo & McCluskey [1982] propose a smart watchdog, based on an additional 
processor, to detect (in a transparent fashion to the application software) unused 
area access. The watchdog monitors access to the unused area. On an invalid access, 
the watchdog signals a hardware interrupt to the host microprocessor indicating 
detection of erroneous behaviour. The host machine then processes the interrupt in a 
manner that wi l l initiate recovery. I t should be realized that the additional processor 
implemented by the smart watchdog is susceptible to error generation in the way as 
the microprocessor i t is protecting. 
5.3.2.3. T h e Access G u a r d i a n Proposa l 
The redundancy of the smart watchdog can be reduced by implementing its de-
tection function in a dedicated hardware unit . The design of such a uni t , referred to 
as an 'Access Guardian', is proposed here. The Access Guardian provides indepen-
dent on-line monitoring and detection of unused area access by the microprocessor. 
84 
The topology of a microprocessor system incorporating an Access Guardian is shown 
in Figure 5.3. 
The Access Guardian detects whether or not invalid address lines are activated, 
and if so, impresses an interrupt signal to the microprocessor. This induces a restart 
state outcome wi th in the microprocessor which then directs execution to a recovery 
routine for the software application. Implementation of an Access Guardian requires 
a prerequisite knowledge of the residence of the used area wi th in the microprocessor 
address space. This implies that Access Guardians can only be used in dedicated 
microprocessor systems. 
The general function of the Access Guardian is shown in Figure 5.4. The Ac-
cess Guardian takes the system address bus as input to its 'address decoder' which 
generates a signal when invalid address lines are activated. This signal is then pro-
cessed w i t h Access Guardian status information by the 'restart generator' to produce 
an interrupt signal 'RESTART' for the application processor. The interrupt signal 
must exist slightly in excess of the microprocessor interrupt latency. The interrupt 
latency is the length of time an interrupt must exist to guarantee processing by the 
microprocessor. Assuming the interrupt is given highest pr ior i ty the processor w i l l 
detect i t following the execution of the present instruction. The interrupt signal must 
therefore be just longer than the longest execution time required by any instruction. 
A 'timer un i t ' holds a set interrupt signal for the required period. The detailed design 
of an Access Guardian is presented in Appendix B. 
The effectiveness of detecting unused area access has been investigated using 
faul t insertion experiments. Gunneflo et al [1989] report 60% of faults inserted into 
an operating Motorola 6809 microprocessor system as generating access to an unused 
area occupying 88% of the address space. They expect unused area area to fa l l and 
rise wi th decreasing and increasing proportions of the address space occupied by 
the unused area respectively. This suggestion is supported by Schmid et al [1982] 
who report only 10% of faults inserted into a Zilog Z80 microprocessor system as 












^ L i n e s 
INPUT/OUTPUT 
DEVICE 









Figure 5.4.: Access Guardian 
86 
5.4. Erroneous Execution in Used Area of the Address Space 
A technique has been proposed so that erroneous execution in the unused area 
will be detected. There still remains, however, the possibility of erroneous execution 
in the used area. This involves periods of linear erroneous execution interspersed 
by Subsequent Erroneous Jumps (SEJs) within the used area. This characteristic 
is investigated by tracing successive SEJ destinations, and several microprocessors 
are evaluated using the model of erroneous microprocessor behaviour proposed in 
Chapter 3. Techniques are proposed for detecting erroneous execution using software 
implemented fault tolerance. 
5.4.1. The Subsequent Erroneous Jump Characteristic 
The probability of a SEJ whose generator and destination are both in the used 
area can be determined as follows. Let the set {L} contain every location in the used 
area, and NL be the number of items in the set {L}. Let the set {J} contain every 
jump type instructions in the microprocessor instruction set, and Nj be the number 
of items in the set {J}. Let / be a location in the used area, and j be a particular 
jump type instruction in the instruction set. Hence I € {L}, and j € {</}. 
Let the function H(l,j) represent the percentage of possible destinations gener-
ated by a particular jump type instruction ( j ) at a used area location (/), that reside 
within the used area. This function is referred to as the 'hit ' function. Let the set 
{ T } contain all the locations that can be addressed by a jump type instruction ( j ) 
residing at location (/). Then 
Pr({T} n { L } ) 
Pr({L}) 
where Pr({L}) = 1 (5.4.) 
which can be expressed as a conditional probability, 
H(l,j) = Pr({T}\{L}) (5.5.) 
having the boundary condition Pr({L}) > 0 is satisfied. 
87 
Typically jump type instructions can generate destinations within either a local 
2 8 , 2 1 6 , or 2 3 2 byte range. Consider a jump type instruction employing relative 
addressing with a byte operand specifying the displacement. Various operand values 
alter the destination of the jump type instruction. If this jump type instruction 
resides in the middle of the used area of > 2 8 bytes then it is guaranteed to generate 
all its possible destinations within the used area: that is H(l,j) = 1. However the 
same jump type instruction residing in the middle of a used area of 2 7 bytes only has 
half its possible destinations within the used area: that is H(l,j) = 0.5. Figure 5.5. 
is provided to further aid comprehension of the HIT function. 
Let Ps£j(Used Area) be the probability of a jump type instruction ( j ) selected 
at random from the microprocessor instruction set, residing at a random location in 
the used area (I), generating a destination which is also within the used area. 
PsEj(UsedArea) = -±—.J2 E H i 1 ^ ) - ( 5 - 6 0 
The relationship between PsEj(Used Area) and Used Area for a range of 8, 16 
and 32-bit microprocessors is investigated in Figure 5.6. Al l jump type instructions 
within the 8-bit Intel 8085 microprocessor instruction set specify absolute target ad-
dresses and hence its SEJ characteristic is linear. When half the Intel 8085 processors 
address space is used, there is a 50% expectation that a SEJ in the used area will 
return to the used area. For the same address space utilization, however, the 8-bit 
Motorola 6800 microprocessor has an 80% probability that a SEJs generator and 
destination lie within the used area. This is because the Motorola 6800 processor 
instruction set does have some relative addressing mode jump instructions. Similar 
results are obtained for the 16-bit processors evaluated. The Motorola 68000 micro-
processor has a much higher probability of a used area SEJ targeting the the used 
area again than the Intel 8086 because i t has a far higher proportion of opcode for-
mats within its instruction set dedicated to relative addressing. For example, a 300 
KByte used area on the Motorola 68000 and Intel 8086 have respective probabilities 
of approximately 95% and 70% that a SEJ generated in the used area will also target 
the used area. The 32-bit microprocessors evaluated show that for a 300 KByte used 
Locations addressed by 







AREA USED AREA 
Address Space 
location 
Let y be the number of location addressed by jump type 
instruct ion ' j ' . 
Let 'x' be the number of locations y that fa l l w i t h in the 
used area when then jump type instruct ion ' j ' resides at 
location T in the used area. 
The HIT function of the jump type instruct ion at location 
T in the used area is then defined as 
X H(1J) - ± 
y 










o co *o o 
_J o o o o o 




as *o - r cm o 
o d d o o 
v3av a3sn OINI ess JO un isvaoad 
90 
area of random data, the AMD Am29000 Intel 80386, and Motorola 68020 micropro-
cessors are expected to generate a target with the used area of approximately 25%, 
60%, and 95% respectively. The variation in the HIT function evaluation for the pro-
cessors is due to the proportion of relative jump opcode formats in their instruction 
sets. 
A high probability of an SEJ in the used area generating a target address which is 
also in the used area suggest extended periods of erroneous execution in the used area. 
Such behaviour can be extremely hazardous, not only involving mal-operation but 
also program area corruption from invalid data manipulation. Therefore, a method of 
detection is required within the used area to prevent extended erroneous execution. 
5.4.2. Detection Using Software Implemented Fault Tolerance 
The following section describes techniques which can be implemented in the soft-
ware to detect erroneous execution in the used area. The techniques utilize instruc-
tions that generate a restart outcome. Such instructions direct execution to a pre-
defined location in memory at which a recovery routine resides. The recovery rou-
tine restores operation as required by the application software, two possible recovery 
strategies are reset and rollback. 
5.4.2.1. Program Areas 
A software detection technique is proposed which exploits the SEJ characteris-
tic. In particular, erroneous jumps are considered which are generated by invalid 
interpretation of an operand as an opcode. Such erroneous jumps are referred to as 
invalid branches. Detection mechanisms are strategically inserted at each identifiable 
invalid branch destination. 
Detection Mechanism Construction 
The actual construction of a detection mechanism will vary in detail for different 
microprocessors. A detection mechanism consists of an initial relative branch instruc-
tion over the remainder of the detection mechanism so that logical control flow of 
91 
the correctly executing program is not interrupted. The remainder of the detection 
mechanism consists of a number of 'seed' instructions. 
The instructions used to construct a detection mechanism should not require any 
operands, however, this may not always be possible. In such instances care must be 
taken to avoid the use of operand bit formats which when erroneously interpreted as 
an opcode generate erroneous jumps. Failure to ensure that detection mechanisms 
do not themselves generate erroneous jumps leads to successive detection mechanism 
placements with no guarantee of placement completion. 
A detection mechanism seed instruction is a software exception without operands 
which, through a restart outcome, directs execution flow to a recovery routine. The 
number of seeds required depends on the individual placement of a detection mech-
anism. 
General examples of detection mechanism structure are shown in Figure 5.7. 
The Motorola 68000 microprocessor facilitates detection mechanism constructs like 
that in Figure 5.7(a), the jump over mechanism being provided by the operandless 
hexadecimal instruction 600X where ! X ' is the necessary relative displacement, and 
the seed is provided by the hexadecimal instruction 6001 which is branch that al-
ways generates an 'odd address' exception (restart) and again does not require any 
operands. Within the Intel 8048 microprocessor there are no restart instructions or 
jump instructions without a operand requirement. The detection mechanism in this 
case is like that in Figure 5.7(b). The mechanism uses the hexadecimal instruction 
04XX for the jump over the mechanism, the hexadecimal no-operation instruction 00 
for the snake, and the hexadecimal instruction 04XX to jump to the recovery routine 
(mimic restart function) where 'XX' specifies the hexadecimal representation of the 
jump destination location. 
Detection Mechanism Placement 
Figure 5.8. shows the two basic methods of inserting a detection mechanism 
within the target software. Forward invalid branch destinations are covered by 


















to recovery code 
Operand 
may be required 
a) Instruct ion Set 
includes restar t 
instruct ions that 
do not require 
operands. The 
instruct ion set 
includes jump 
instruct ions. 
b) Instruct ion Set 
does not have any 
restar t ins t ruc t -
ions, but has 
'no-operation' 
instruct ions that 
do not require 
operands. The 
instruct ion set 
includes jump 
instruct ions. 
Figure 5.7. : Detec t ion Mechanism Cons t ruc t ions 
93 
0> 
CD U cb 
CO UJ UJ t Cb <b <b 
< O O UJ CX UJ ex CX cx cx cx a> u j UJ Pi "J a UJ UJ UJ UJ UJ UJ UJ UJ CO in o o CD ccj 
UJ 3 NOI lD f l d iSN I i N O i i j n a i S N i WSINVHJ3UI N O I l J J i J O 
T3 
00 UJ UJ o 
< < < O o cx a: (X (X UJ UJ UJ UJ UJ UJ UJ UJ Q. 
O cn 
I N0I1DHH I 9NI z Noi ionaisNi 
C y. <b 
Cb 
C CO 




O <ca UJ UJ 
cb cu a, ( J Cb a <b o < < a. C l cb O o O lo UJ UJ IX ex cx (X (X cx cx 03 UJ UJ ib UJ UJ UJ UJ UJ UJ UJ UJ cb in i f ) o GU «5 O O O O O O O to CO 
- 1 I <5 UJ 




a UJ UJ 
o 
cr cx cx cx cx cx cx 
UJ UJ UJ UJ UJ UJ UJ UJ CL 
Q. 
O o o o o 
o 
2 NOI lDnd iSN I i N O i i D n d l S N i 
94 
Detection mechanisms are inserted to cover backward invalid branches at the location 
of their respective following instruction. Each detection mechanism requires sufficient 
seeds to ensure that the invalid branch destination has a resident seed. The maximum 
number of seeds required will be equivalent to the byte length of the largest instruction 
construct in the instruction set. 
The insertion of a detection mechanism may itself alter the destination of an in-
valid branch. This situation arises when the invalid branch generator has a specified 
displacement contained in the byte locations directly following the host instruction 
and a detection mechanism is inserted immediately after the host instruction. A 
good example of this is provided by the Motorola 68(7)05 microprocessor where the 
'test and branch' instruction requires two operands, the final operands containing 
the relative displacement. This instruction has the same size as the maximum length 
instruction within the instruction set; therefore, whenever an invalid branch has this 
operation, its destination information will always be contained outside the generating 
instruction. If a detection mechanism is inserted immediately after the generating 
instruction then the destination for the invalid branch is altered by the change in the 
information content at the location specifying the relative displacement. Successive 
invalid branches can be generated in this manner. Application of detection mech-
anism placement should test for this situation and take evasive action to prevent 
changing the destination of an invalid branch. 
The Detection Capability of the Mechanism 
The mechanisms inserted within the program code provide a detection capability 
which has two methods of activation. The first method of detection is associated 
with an invalid branch generated by erroneous execution. Detection mechanisms are 
placed in order to detect such SEJs. Secondly, ensuing linear erroneous execution 
processing through a placed mechanism is detected when a seed is interpreted as 
an opcode. This second method of detection is guaranteed to be successful if the 
detection mechanism has a number of seeds equivalent to the byte length of the 
longest instruction within the application processors instruction set. 
95 
Detection Mechanism Placement Deadlock 
Detection mechanisms cannot be placed where the generator and destination of an 
invalid branch are both operands of a single instruction. This occurrence is referred 
to as placement deadlock. There are two particular types of placement deadlock, 
those where the invalid branch has a forward direction and those with a backward 
direction. In addition placement deadlock occurs when the destination of an invalid 
branch is displaced forward of its generating instruction by the equivalent or fewer 
bytes than that required by the detection mechanism's jump instruction. Placement 
deadlock with a backward invalid branch is critical because an infinite execution loop 
may be created. This processing outcome is classified as a failure and has a particular 
hazard in real-time systems. Placement deadlock involving a forward branch does not 
share this hazard, erroneous execution continuing unhindered through the associated 
code. 
5.4.2.2. Data and Reserved Input/Output Areas 
Data areas and areas reserved for input/output communication can be manipu-
lated using similar techniques to facilitate a detection capability of erroneous execu-
tion. The information content of both area types cannot be changed, but the method 
and location of storage can be altered. 
Halse [1984] investigates methods of inserting special sections of code within the 
data area. Erroneous execution in the data area is detected when i t flows through one 
of these sections of inserted code. The efficiency of various sizes of code insertions 
with different desperations throughout the data area are analysed by Halse. This 
technique is not transferable to the reserved input/output area because the locations 
of this area are fixed and no code insertion is therefore possible. 
An alternative technique involves utilizing particular bits of each memory ele-
ment to specify an opcode restart operation. The remainder of the memory element 
is free for information storage. This technique can best be explained using the Mo-
torola 68000 as an example. Within this processor the hexadecimal opcode format 
FXXX can be used where 'X' denotes an unspecified content. This opcode format 
is not necessarily available in upwardly compatible members of the Motorola 68000 
96 
family, e.g. the Motorola 68020 reserves this format for a co-processor. The execu-
tion of a Motorola 68000 FXXX opcode is denned to be illegal and to generate an 
exception (restart outcome), and hence can be used to detect erroneous execution. 
The unspecified opcode bits can be used to contain useful data area or input/output 
reserved area information. Implementation of this technique in the data area and 
input/output reserved areas will required the application software only to extract 
the least significant 12-bits of information from each memory location. In addition, 
the input/output locations should have the most significant 4-bits hard-wired for the 
restart opcode format. This technique incurs redundancy, which may be substantial 
in some microprocessor applications. Indeed, the technique may not be feasible for 
some microprocessor systems. 
A third technique involves moving the location of data areas in the address space 
so that their address specification within an operand has the format of a restart in-
struction. This ensures that when erroneous execution in the program area generates 
a SEJ destination in the data area, erroneous execution is detected. Application of 
this technique is highly dependent on the host microprocessor instruction set. The 
best results are obtained for those instruction sets with a larger number of restart 
generating opcode formats. Small numbers of restart opcode formats restricts the 
number of locations available for positioning sections of data area. This technique 
is investigated further by Halse [1984]. Again, however, this technique cannot be 
applied to the input/ output reserved area because its locations are fixed, although 
some locations may be available that, by coincidence, map restart opcode formats. 
5.5. The Overheads of Implementing Eault Tolerance 
5.5.1. Hardware Fault Tolerant Techniques 
The Access Guardian and smart watchdog have an error latency much smaller 
than that typically exhibited by watchdog timers. In addition their application, 
unlike the watchdog timer, is transparent to the application program. The use of an 
Access Guardian or smart watchdog, therefore, releases the application programmer 
97 
from requiring a knowledge of the target processor system architecture in order to 
produce dedicated software. 
The design of the Access Guardian can be applied to most microprocessor sys-
tems. The complexity of the 'address decoder' is dependent on the nature of the 
address space, and the bus architecture employed by the microprocessor. The 'timer 
unit' will also vary in size depending on the microprocessor interrupt latency. The 
'restart generator' has a fixed size. An Access Guardian designed in Appendix B 
requires 60 logic gates, representing 17 standard T T L IC parts. This represents a 
significant reduction in the gate overhead introduced by a smart watchdog which 
typically has thousands of gates. 
The Access Guardian acts in parallel with the microprocessor and does not inflict 
a performance overhead to the system during correct operation. Its reduced size in 
comparison with a smart watchdog also implies a smaller chance of the 'doomsday' 
syndrome occurring by which the hardware detection unit fails [Damm, 1988]. 
5.5.2. Software Implemented Fault Tolerant Techniques 
The overheads of used area software enhancement are additional memory re-
quirement and increased operational processing during correct operation of detection 
mechanism jumps. Modification of the unused physical memory locations does not 
incur any overhead to the microprocessor system because during the course of correct 
operation the section of the address space is totally independent of processor action. 
The memory extension required by the software implemented fault tolerant tech-
nique proposed for the program area can be reduced by generating optimum size de-
tection mechanisms at each placement rather than a default maximum size. Whilst 
this reduces the memory overhead, it also decreases the effectiveness of the mechanism 
to detect linear erroneous execution. The relative cost of a byte of physical memory 
has decreased over recent decades [Freer, 1987] and, therefore, the memory overhead 
is not predicted to be a major system constraint. Nevertheless, in those systems with 
a limited memory, optimum size detection mechanisms can be implemented. 
A memory overhead is also produced by the insertion of software in the data area 
to detect erroneous execution. Further details of the expected overhead for particular 
98 
processor systems can be found in Halse [1984]. This memory overhead does not have 
an associated performance overhead. 
The extra processing requirement of detection mechanism jumps during correct 
operation of program code may prove critical for some stringent real-time and parallel 
processing applications. This processing overhead is not influenced by the placement 
of optimum or default size detection mechanisms in the target code. However, for 
the majority of applications this overhead is considered to be acceptable. 
The technique proposed for the data and reserved input/ output area by which 
the data content of each memory location is reduced in order to give that location 
a detection capability, also generates a processing overhead. Data transfers and 
memory requirement may be increased. The magnitude of this overhead is application 
dependent. The architecture of a microprocessor could incorporate this technique in 
order to reduce lost operational performance. 
5.6. A Fault Tolerant Strategy for Microprocessor Controllers 
The detection techniques described in this chapter cover attributes of erroneous 
execution associated with the model of erroneous microprocessor behaviour presented 
in Chapter 3. Individual application of one of the techniques will improve the re-
liability of the processor system in relation to this mode of failure. However, the 
reliability of the system can be further improved by the selection of techniques for 
collective implementation. Such techniques should be complementary and feasible for 
incorporation into a particular microprocessor system. Hence, the selection of fault 
tolerant techniques is termed 'strategic'. 
The model of erroneous microprocessor behaviour described in Chapter 3 is sum-
marized in Figure 3.2. This figure can now be modified to include the detection 
capability of unused area access via the Access Guardian, and invalid used area oper-
ation via the detection mechanisms planted in the software. The model of erroneous 
behaviour in a microprocessor implementing such fault tolerant techniques is shown 


























Figure 5.9. : Erroneous Execut ion Model : 
Enhanced T rans ien t Faul t Recovery 
( * man i fes ted t r ans ien t f a u l t ) 
100 
The reliability of a selection of microprocessors implementing an Access Guardian 
are shown in Figure 5.10. The evaluations assume a worst case of no recovery capabil-
ity exhibited by the used area so equation (3.49.) with equation (3.26.) substituted 
becomes, 
R(t) = exp{-q.P(Ef).t}, (5.7.) 
where, 
f Used Area (bytes) 1 
1 f ) \ Address Space (bytes) J ' K ' 
and q is the event rate, P{Ej) the probability that the event initiates erroneous 
execution, and t is time. 
Figure 5.10. has a normalized time base. Reliability can also be expressed as MTTF 
using the following equation for the microprocessor system described above, 
MTTF =-.{ Address Space (bytes) 1 
q [ Used Area (bytes) J 
Table 5.1. shows the MTTF calculations corresponding to Figure 5.15. where events 
initiating erroneous microprocessor behaviour are taken to occur once a month (every 
714 hours). 
The complementary application of fault tolerance in the used area enables the 
reliability of the microprocessor system to be improved. Future chapters investigate 
the recovery capability enhancement realized by particular microprocessor systems. 
Meanwhile it is sufficient to demonstrate the benefit of detecting erroneous micro-
processor behaviour in the used area of the address space. Consider a Motorola 
68000 microprocessor system with 48 KBytes used area and implementing an Access 
Guardian. The used area is initially assumed not to have a recovery capability. For 






6 8 0 0 / 8 0 4 8 / 8 0 8 5 
• 
0 4 8 12 16 20 24 28 32 36 
NUMBER OF EVENTS 
gore 5.10.: Microprocessor Reliability 
with Access Guardian 
(Used Area = 48 KBytes) 
102 
Microprocessor Class P{Ef) M T T F 
6800 8-Bit 0.750000 952 hrs = 39 days 
8048 8-Bit 0.750000 952 hrs = 39 days 
8085 8-Bit 0.750000 952 hrs = 39 days 
8086 16-Bit 0.046870 15232 hrs = 21.6 months 
68000 16-Bit 0.001465 487567 hrs = 56 yrs 
68010 16-Bit 0.001465 487567 hrs = 56 yrs 
68020 32-Bit 0.000006 119047619 hrs = 13590 yrs 
Tab le 5 .1 . : M T T F w i t h U n u s e d A r e a D e t e c t i o n 
Program/Data P(Ef) M T T F 
4 8 K / 0 K 
4 0 K / 8 K 




487567 hrs = 56 yrs 
516476 hrs = 59 yrs 
678979 hrs = 78 yrs 
T a b l e 5.2. : M T T F E n h a n c e m e n t w i t h U s e d A r e a D e t e c t i o n 
103 
data areas wi th no and total recovery capability respectively. Reliability enhancement 
is now dependent on the proportion of the used area occupied by the data area. 
Table 5.2. notes 8 KByte and 40 KByte data areas and calculates their respective 
M T T F influence on the microprocessor system. These calculations are represented 
as reliability curves in Figure 5.11. Of course the program area can have a recovery 
capability too: the above calculations are given purely as an example. 
The strategic selection of fault tolerant techniques is not l imi ted to those special-
ized techniques presented in this chapter for particular modes of failure at t r ibuted to 
erroneous microprocessor behaviour described in Chapter 3. Addi t ional techniques 
such as Recovery Blocks for programs, and parity bi t checking for physical memory 
(both described in Chapter 1), can be incorporated to further enhance reliability 
through the coverage of other modes of processor system failure. 
5.7. S u m m a r y 
Operational characteristics have been identified in the model of erroneous mi -
croprocessor behaviour proposed in Chapter 3. In particular the characteristics of 
In i t ia l Erroneous Jump (IEJ) and Subsequent Erroneous Jump (SEJ) are associated 
w i t h erroneous execution within the unused and used areas of the processor address 
space respectively. The characteristics are modelled for a selection of 8-bit, 16-bit 
and 32-bit processors, and variations are observed wi th differences in processor archi-
tecture and instruction sets. Detection techniques are proposed which exploit these 
characteristics in order to provide fault tolerance and hence increased reliability to 
the microprocessor system. 
A detection capability can be provided for the unused area using either a software 
based technique for physically implemented memory, and/or an Access Guardian 
which provides an additional detection capability for non-existent memory. 
The used area incorporates functional areas for the program, data, and inpu t / 
output communications, and can be implemented in volatile and non-volatile mem-
ory. A detection capability is provided for the program area by the application of a 











i— 0o 96 i—i 
_! t — 4 
C D M68000 
48K-PR0G _ j 
LU 
cc 
0o 94 -I ! 1 i 1 1 1 ! ! 1 
0 4 8 12 16 20 24 28 32 36 
NUMBER OF EVENTS, E ( t ) 
Figure 5.11.: Enhanced Microprocessor 
Reliability with Access 
Guardian and Used Area 
Detection Capability 
105 
(SEJ) destinations to detect erroneous execution. The application of this technique 
and its performance evaluation are covered by future chapters. Other techniques are 
discussed in relation to providing a detection capability for the data and reserved 
i n p u t / output areas of the used area. 
Application of fault tolerance in the used area involves an additional memory 
overhead. This memory overhead inflicts a processing performance overhead when 
inserted wi th in the program area, namely the jumps over detection mechanism which 
avoid correct program flow corruption. The magnitude of these overheads is evaluated 
for particular applications in the following chapters. 
Fault tolerant techniques can be strategically selected for collective application 
in order to achieve high reliability. The selection criteria used depends on the fault 
classes which require detection and the feasibility of implementing fault tolerant 
techniques wi th in particular microprocessor systems. 
106 
Chapter 6 
P O S T - P R O G R A M M I N G , A U T O M A T E D , R E C O V E R Y U T I L I T Y ( P A R U T ) 
6.1. Introduction 107 
6.2. Design and Development Objectives for the P A R U T Prototype 107 
6.3. A Functional Overview of P A R U T 108 
6.4. Design Features Incorporated wi th in the P A R U T Prototype 110 
6.4.1. Programming Language 110 
6.4.2. Programming Style 110 
6.4.3. The Diagnostic Facility 110 
6.4.4. Target Software I l l 
6.4.5. Target Processors 112 
6.5. A Description of the P A R U T Prototype's Operation 112 
6.5.1. Data Code Analysis 114 
6.5.2. Program Code Analysis 115 
6.5.3. The 'Seeding' Algor i thm 116 
6.6. P A R U T : A Review of the Prototype 122 
6.7. P A R U T : Developing a Standard Programming Tool 123 
6.8. Summary 124 
C H A P T E R S I X 
P O S T - P R O G R A M M I N G , A U T O M A T E D , R E C O V E R Y U T I L I T Y ( P A R U T ) 
6 . 1 . I n t r o d u c t i o n 
A prototype software u t i l i ty has been designed to automatically apply the soft-
ware implemented fault tolerant technique, proposed in the previous chapter, on 
program code. This ut i l i ty , called PARUT (Post-programming Automated Recovery 
U T i l i t y ) , can also apply other software based fault tolerant techniques. 
This chapter in i t ia l ly outlines the objectives of PARUT, and these are appraised 
at the end of the chapter to assess the success of the prototype. P A R U T , its function 
and structure, are described in overview, a description of the physical mechanics 
of the code can be found by examining the annotated u t i l i t y listing in Appendix C. 
Finally, enhancements for the PARUT prototype are suggested, and a proposal for the 
development and application of PARUT as a standard programming tool discussed. 
6.2. D e s i g n a n d D e v e l o p m e n t O b j e c t i v e s f o r t h e P A R U T P r o t o t y p e 
The software implemented fault tolerant technique proposed in Chapter 5 involves 
inserting detection mechanisms at machine code level to cover invalid branches asso-
ciated w i t h erroneous microprocessor behaviour. Manual application of the technique 
can be complex, especially for large target programs. The P A R U T prototype is de-
signed to automate the technique's application. 
As a prototype, PARUT does not have rigorous specification but rather a set of 
objectives. In order to facilitate a wider application of PARUT, its ini t ia l objective 
is broadened to include those listed below. 
o Apply and assess software implemented fault tolerant techniques. 
o Process software for a variety of target microprocessors. 
o Facilitate application to any software whose host processor is supported. 
o Produce a report assessing u t i l i ty activity. 
107 
In addition to the design objectives i t is worthwhile specifying some development 
objectives for the production of this prototype ut i l i ty. In particular the prototype 
program should exhibit qualities facil i tat ing programmer/analyst comprehension and 
modification of module mechanics. These qualities are especially important in the 
production of a prototype because alterations are commonplace. 
6.3. A F u n c t i o n a l O v e r v i e w o f P A R U T 
P A R U T has two input requirements: a copy of the software and a description 
of the microprocessor on which i t resides. The microprocessor description input 
to P A R U T is a file, referred to as M I C R O - F I L E , containing the target processor 
instruction set. The file lists the defined instructions wi thin the instruction set, 
specifying each instruction opcode. The software presented to P A R U T for processing 
is in machine code format because the detection capability assessment and fault 
tolerant technique application require knowledge of the opcode and operand usage 
on the target processor. The target software is held in a file called CODE_FILE. 
The execution of PARUT generates a report file, referred to as ANALYSIS-FILE , 
which documents the detection capability assessment of the software under investi-
gation. The detection capability is evaluated by determining the proportion invalid 
branches that are detected during erroneous execution. PARUT also produces a file 
called RESULT-FILE, containing the fault tolerant version of CODE-FILE, when 
a software implemented fault tolerant technique is applied to the target software 
represented in CODE_FILE. The format of this file can be tailored to meet specific 
requirements. P A R U T currently outputs the enhanced software in a format which 
facilities easy user interpretation of the u t i l i t y action at machine code level. 
A n overview of the PARUT program is shown in Figure 6.1. Examples of the 
two input files, M I C R O - F I L E (Motorola 68000 microprocessor) and C O D E - F I L E , 
and the two output files, RESULT_FILE and ANALYSIS-F ILE , associated wi th the 










6.4. D e s i g n Features I n c o r p o r a t e d i n t o t h e P A R U T P r o t o t y p e 
Particular design features can be incorporated into the PARUT program to realize 
prototype objectives, outlined in section section 6.2.. or to underpin the procedural 
activity of P A R U T which is described in the next section of this chapter. This section 
presents such design features and describes the criteria for their application. 
6 .4 .1 . P r o g r a m m i n g Language 
P A R U T is implemented in the Pascal programming language. This language was 
selected for the prototype because of its structural constructs and readability. A n 
alternative considered was the programming language C, but was rejected because 
language constructs such as those involving linked lists are difficult to understand 
when the reader is not proficient in the language. A n important feature of a prototype 
program is readability. A future development of PARUT might involve the translation 
of the present code into another language deemed more appropriate. I t is considered 
that Pascal is relatively easy to translate. 
6.4.2. P r o g r a m m i n g S ty le 
The u t i l i ty program has been wri t ten implementing 'good' programming practice 
[Sommerville, 1985]. This involves developing code in concise modules (a few tens 
of lines) which exhibit low coupling and high cohesion. Coupling and cohesion refer 
to the required passing of external parameters to a module, and uni ty of operation 
respectively. Such programming practice facilitates easy modification or replacement 
of modules wi thout disruptive consequences for the remainder of the u t i l i t y program. 
Furthermore, 'good' programming practice also encourages the production of readable 
code. 
6.4.3. T h e Diagnos t i c s F a c i l i t y 
A diagnostics facility is provided to aid understanding of the funct ion of PARUT. 
The facil i ty is activated by setting the ' D I A G N O S T I C variable at the beginning 
of the P A R U T listing to 'true'. When active, the facili ty generates a file called 
T R A C E - F I L E in which all functions and procedures accessed by the code operation 
no 
insert an entry specifying their name. Nested functions appear in the TRACE_FILE 
as indented entries. This file can be accessed by the analyst to monitor the chronolog-
ical activity of PARUT. The last enclosure in Appendix C is a typical T R A C E - F I L E . 
6.4.4. T a r g e t S o f t w a r e 
P A R U T processes target software at machine code level in order to apply and/or 
analyse its fault tolerance. The data structure chosen to represent the machine code is 
a linked list of records. This data structure is used because it requires no predefinition 
of dimensions and can easily be manipulated when inserting records representing code 
associated wi th the application of fault tolerance. 
Each record within the linked list contains information describing the character-
istic of a machine code element (usually 8 or 16 bits). The contents of a record are 
itemized below and can be found at the beginning of the PARUT listing in Appendix 
C under the ' T Y P E ' declaration. The items wi th in each record are reviewed below : 
nexLaddress & last^address 
- are pointers connecting adjacent records in the linked list. 'next_address' 
and 'last_address' are set to nil in the last and first records in the 
linked list respectively denoting the lists termination. 
op, optype, & address 
- specify the absolute value of the machine code element (typically 
8 or 16 bits), whether i t is an opcode or operand, and its resident 
location in the address space. 
offset 
- is a parameter used in address processing. 
seeded 
- specifies the status of an operand identified as a potential invalid 
branch ; ' true' and 'false' signify the presence and absence of fault 
tolerance respectively. 
offset 
- is a parameter used in address processing. 
I l l 
jump-type, jump-too & jump^address 
- specify the type of jump instruction that the item 'op' generates 
under erroneous execution as an opcode (details are shown under 
the ' T Y P E ' declaration at the beginning of the PARUT listing), 
a pointer to the target location, and the target address respec-
tively. Non-jump instructions set ' jump_type', ' jump_too', and 
'jump_address', to '0', ' n i l ' , and '0' respectively. 
jump-from 
- is a pointer to a record containing machine code identified as po-
tentially generating an invalid branch whose target is this record, 
the pointer has the default setting of ' n i l ' . 
6.4.5. T a r g e t Processors 
Target software is input to the u t i l i ty via a file called CODE_FILE. P A R U T pro-
cesses the software at source code level. The source code on most microprocessor 
systems is not directly readable and hence an indirect method of input for the code 
is required. C O D E - F I L E contains data generated by U N I X 'adb' (a debugger). Par-
ticular details of CODE_FILE can be found in Appendix C. In summary, the file 
contains two sections: a memory dump of the resident source code, and a list of 
opcode addresses wi th in the source code. 
6.5. A D e s c r i p t i o n o f t h e P A R U T P r o t o t y p e ' s O p e r a t i o n 
This section briefly describes the procedural activity of, and user interaction wi th , 
P A R U T . A n annotated listing of the P A R U T code is held in Appendix C. 
W i t h i n the program structure defined by Pascal, the root module is a function 
called ' M A I N ' . A call chart of functions and procedures used by M A I N is shown in 
Figure 6.2. The chart depicts module operations in rectangular boxes, and functions 
and procedures in rectangular boxes wi th duplicated vertical bars. 
The ini t ia l job of module M A I N is to initialize variables and prepare all files 
required by PARUT. After this is completed, the user is required to respond to a 











1 I 3 
8" 

















then whether a program area requires analysis. I f neither option is requested, the 
prompts are repeatedly presented unti l the user chooses an option. Af te r completion 
of the requested analysis PARUT terminates activity and returns the user to the host 
environment. 
Data and program code analysis implement a common multi-stage processing 
approach. The approach involves the following sequence of activity, 
i) Generating a linked list to represent the machine code of the soft-
ware under investigation. This includes the insertion of informa-
tion wi thin each record of the linked list detailing program flow 
associated with valid and invalid interpretation of the machine 
code. 
i i ) Apply fault tolerance by manipulating the linked list ensuring 
that the valid program flow is preserved (program code only). 
i i i ) Produce a report documenting fault tolerant analysis of the ma-
chine code and, in the case of program code, detail the enhance-
ment provided by the application of software implemented fault 
tolerant techniques. 
6 .5 .1 . D a t a Code A n a l y s i s 
Analysis of data code involves four basic operations, each operation being con-
tained within a function or procedure. Ini t ial ly the user may request by prompt to 
construct a data structure either f rom actual data code (procedure B U I L D _ A D B ) , or 
f rom pseudo-random generated code (procedure BUILD_RNG) . Actual data code was 
received in a U N I X 'DIS' format in an early version of the P A R U T prototype, and 
for this the procedure BUILD_DIS was wri t ten. Now data code is received in a U N I X 
' A D B ' format and hence procedure B U I L D _ A D B is used, however, BUILDJ3IS re-
mains available for future use i f required. Af te r generating the data structure for the 
code, procedure B U I L D . J U M P S is used to derive the program flow through the data 
area if i t was incorrectly interpreted as program code. The penultimate operation 
of data area analysis is the output of the data code (primarily "of use when the code 
114 
is generated by PARUT) using the procedure PRINT_LIST. Finally, the data code 
is analysed to determine the hazard of interpreting i t as program code, and this is 
achieved by the procedure DATA-ANALYSIS. 
6.5.2. P r o g r a m Code Ana lys i s 
Program area analysis calls more procedures than data area analysis but retains 
the same basic approach. Ini t ial ly a data structure is constructed by procedure 
B U I L D _ A D B to represent input program code. The user is then requested by a 
sequence of prompts to select a combination of software implemented fault tolerant 
techniques to be applied to the code including the technique proposed in Chapter 5. 
Selection of any of the fault tolerant techniques offered to the user requires a 
duplicate copy of the linked list representing the target program code and this is 
provided by activating the procedure COPY-LIST. This copy is then processed by 
the procedure BUILD_JUMPS so that the necessary program now information asso-
ciated with both valid and invalid interpretation of the program code is incorporated. 
Then depending on the fault tolerant techniques chosen by the user, the procedures 
SEED_LIST (technique proposed in Chapter 5), S T R E A M - L I S T (signature analysis), 
and R E L O C A T E - L I S T (an alternative technique now discarded due to poor results) 
are executed. Other techniques can be added to those offered by PARUT, and would 
be included here in the structure of PARUT. 
Once the selected techniques have been implemented on the copies of the linked 
list representing the original target program code, two 'housekeeping' operations are 
required. Firstly, procedure ALIGN_LIST is called which resets any disturbed abso-
lute branches in the program code. In this manner P A R U T does not compromise the 
transparent application feature of the software implemented fault tolerant technique 
(proposed in Chapter 5) upon the target software. Secondly, procedure T I D Y - L I S T 
is executed to remove any redundancy in the placed detection mechanisms. When 
the housekeeping is complete, the resultant code enhancement is output by procedure 
P R I N T - L I S T . 
The operation of program code analysis terminates w i t h a prompt to the user re-
questing a choice whether or not program flow analysis is required. I f i t is not required 
115 
then the action of this section of code is complete, otherwise the procedure A N A L -
YSIS is activated. As a prerequisite to executing ANALYSIS the original program 
code data structure must be prepared for comparison with the enhanced program code 
version(s). This is achieved by the activation of the procedure BUILD-JUMPS. On 
returning f rom ANALYSIS this section of the PARUT code completes its operation. 
ANALYSIS reports instances of placement deadlock when the software implemented 
fault tolerant technique proposed in Chapter 5 is applied. A screen dump of the user 
interface for program code analysis is shown in Figure 6.3. 
6.5.3. T h e 'Seeding' A l g o r i t h m 
This section describes the algorithm used by PARUT to apply the fault tolerant 
technique proposed in Chapter 5. The algorithm is implemented by a procedure 
called SEED_LIST. 
I t is necessary before the algorithm is described to introduce some basic termi-
nology and observations concerning the structure of invalid branches wi th in machine 
code. The 'range' of an invalid branch describes the machine code locations lying 
between its generating location and target address. The range of an invalid branch 
has a 'level' which denotes how many target addresses of other invalid branches lie 
wi thin its range. Wi th in a section of machine code, invalid branches can 'group' 
incorporating features of four identified basic constructs. These constructs are shown 
in Figure 6.4. and are: 
a) Non-Intersect ing Inval id Branches 
The generating locations and target addresses of the individual in -
valid branches range over independent areas of the machine code. 
b) Intersect ing Inval id Branches 
The generating locations and target addresses of the individual 
invalid branches range over areas of the machine code that over-
lap. The range of one invalid branch contains the target address 
but not the generating location of the remaining invalid branch. 
116 
#Execution begins 
PARUT : TRANSIENT FAULT RECOVERY TOOL 
INFORMATION INPUT :- P l e a s e type a p p r o p r i a t e response 
Data or Program Area (DATA/PROGRAM) ? 
PROGRAM 
De t e c t i o n Mechanism Placement (YES/NO)? 
YES 
=> Optimise Placement (YES/NO)? 
YES 
Boundary R e l o c a t i o n (YES/NO)? 
NO 
S i g n a t u r e Placement (YES/NO)? 
NO 
« < O r i g i n a l Code ( f o r comparison) being prepared » > 
A n a l y s i s Required (YES/NO)? 
YES 
• E x e c u t i o n t e r m i n a t e d 






















CD J C 












CD (0 > 
c 









c « 51 
CO 
118 
c) Coupled Inval id Branches 
Intersecting invalid branches except that ranges of both invalid 
branches contain the target address, but not the generating loca-
tion, of the respective remaining invalid branch. 
d) Embedded Inval id Branches 
The generating location and target addresses of one invalid branch 
both lie within the range of the remaining invalid branch. 
Of course there may be more complex situations of invalid branch interaction in the 
machine code under investigation, but such situations are constructs of the primitives 
listed above. The 'seeding' algorithm ensures that all invalid branches are resolved 
by 'seeding' except where placement deadlock is identified. 
The A l g o r i t h m 
Stage 1. Check whether or not there remain any unresolved invalid branches 
within the machine code. If not go to stage 8 of the algorithm. 
Stage 2. 'Seeding' required. Investigate the machine code resolving invalid 
branches at level zero unless this is not the first pass of the code, 
in which case, increment the level to be investigated by 1. 
Stage 3. Search the machine code until an unresolved invalid branch with 
the same level as that under investigation is found, or the end of 
the machine code is located. Searching commences initially from 
the start of the machine code. However, if an invalid branch has 
been resolved in the current code pass then the search commences 
at the location following the last address of the group in which 
that invalid branch was a member. 
Stage 4. Resolve the invalid branch -unless the end of the machine code 
was located in which case go to stage 6 of the algorithm. 
Stage 5. Addresses are updated and valid program flow re-established for 
the machine code due to the insertion of a detection mechanism. 
Go to stage 3 of the algorithm. 
. 1 1 9 
Stage 6. Remove complex groups of invalid branches. If the present level of 
investigation is greater than zero then recursively apply the 'seed-
ing' algorithm from stage 3 incrementing the level of investigation 
from zero to one lower than the present level. 
Stage 7. If there remain unresolved invalid branches at the level of investi-
gation after the code pass then return to stage 3 of the algorithm 
and start a new pass of the code, otherwise go to stage 1. 
Stage 8. 'Seeding' complete. 
The function of the 'seeding' algorithm is now demonstrated with the example 
of a complex invalid branch group shown in Figure 6.5. Noted below are the stages 
processed by the algorithm with status comments. The example should be examined 
in association with the 'seeding' algorithm. 
Stage 1 : Unresolved invalid branches. 
Stage 2 : 'Seeding' required. Level = 0. 
Stage 3 : Start pass of linked list. 
Stage 4 : Pass of linked list completed. 
Stage 6 : No recursive call. 
Stage 7 : No invalid branches at this level. 
Stage 1 : Unresolved invalid branches. 
Stage 2 : 'Seeding' required. Level = 1. 
Stage 3 : Start pass of linked list. 
Stage 4 : Invalid branch 'B' identified. 
Stage 5 : Resolve coupled invalid branch. 
Stage 3 : Complete pass of linked list. 
Stage 4 : Pass of linked list completed. 
Stage 6 : Recursive call to stage 3 with level = 0. 
Stage 4 : No invalid branches at this level. 
Stage 6 : Recursive call completed. 












6.5=: A C@mp8ex Example of 
Algorithmic Processing 
J] = Invalid Branch Reference 
3 = Resolving Order of Algorithm 
121 
Stage 3 : Start pass of linked list. 
Stage 4 : Invalid branch ' C identified. 
Stage 5 : Resolve coupled invalid branch. 
Stage 3 : Complete pass of linked list. 
Stage 4 : Pass of linked list completed. 
Stage 6 : Recursive call to stage 3 with level = 0. 
Stage 3 : Start pass of linked list. 
Stage 4 : Invalid branch ! D' identified. 
Stage r o : Resolve embedded invalid branch. 
Stage 3 : Continue pass of linked list. 
Stage 4 : Invalid branch 'A' identified. 
Stage 5 : Resolve non-intersecting invalid branch. 
Stage 3 : Continue pass of linked list. 
Stage 4 : Pass of linked list completed. 
Stage 6 : Recursive call completed. 
Stage 7 : Level 1 invalid branches resolved. 
Stage 1 : All invalid branches resolved. 
Stage 8 : 'Seeding' complete. 
The procedural implementation of this algorithm is now briefly reviewed. Stages 
1, 2, and 8 of the algorithm are implemented directly by procedure SEED-LIST, 
whilst the remaining stages are controlled by the called procedure SEED-PLACE-
MENT. Procedure SEED_PLACEMENT manages three procedures and recursive ac-
tivation of itself. SEED_PLACEMENT initially executes procedure SEED-LOCAT-
ION to achieve stage 3 and 4 of the algorithm. This routine activates five other pro-
cedures. Initially JUMP_DIRECTION is used to determine the forward or backward 
nature of the invalid branch, then INTERVAL evaluates the level of the invalid branch 
and procedure TEST_SEED checks whether the invalid branch is already resolved by 
another detection mechanism placement. If the present invalid branch can be resolved 
then procedures SEED_DETAILS and PLACE-SEED construct and insert the detec-
tion mechanism into the machine code. SEED-DETAILS can place a default size or 
122 
optimum size detection mechanism, depending on user criteria passed by the root 
procedure MAIN. After executing SEED_LOCATION, SEED_PLACEMENT pro-
cesses stage 5 of the 'seeding' algorithm by activating procedures ADDRESS_LIST 
and BUILD_JUMPS. These procedures are also directly used by the root proce-
dure MAIN and are described in the following sections of this chapter. Finally, 
SEED_PLACEMENT implements repetitive and recursive calls to stages 3-7 of the 
'seeding' algorithm in order to resolve complex invalid branch groups. 
6.6. P A R U T : A Review of the Prototype 
The PARUT prototype successfully applies the software implemented fault toler-
ant technique proposed in Chapter 5. The 'seeding' algorithm employed by PARUT 
appears from experience to be efficient, but no quantitative assessment of its perfor-
mance has been attempted. The algorithm is based on solving constructs of invalid 
branches: non-intersecting, intersecting, coupled, and embedded. The algorithm is 
also validated for an example of complex invalid branch group structure in machine 
code. In addition PARUT is designed to enable the simple inclusion of other soft-
ware implemented fault tolerant techniques. In particular the prototype currently 
implements a simulation of the signature analysis technique. Further techniques can 
be included as required during any futur„ development. 
It is important that the operation of the prototype can be easily understood. A 
diagnostic facility is built into PARUT enabling the generation of a procedure call 
list referred to as TRACE_FILE. This list enables the operation of the utility to be 
monitored and hence aid comprehension of operation. Furthermore, prograrnrning 
language and style are adopted to facilitate understanding of the PARUT program. 
These qualities of the prototype have also proved valuable during the utili ty devel-
opment, facilitating easy code manipulation without the disruption often associated 
with prototype development of similar sized programs. 
A linked list is employed to represent the input machine code for processing by 
PARUT. This data structure is not dimensioned and does not itself restrict the size 
of machine code input. Equally, modules processing the linked list are designed not 
to impose a dimension restriction. However, there will be a constraint on the size 
123 
of the input code due to general limitations of the host system environment, e.g. 
a maximum size of CODE_FILE generation by UNIX 'adb'. Such restrictions are 
outside the scope of the PA RUT development programme. 
One of the most difficult objectives to achieve is the use of PARUT with a 
range of target processor types. Software manipulation required for the applica-
tion of software implemented fault tolerance uses the pseudo-compiler action of pro-
cedure BUILD_JUMPS on the source code of the target processor. Such activity 
implies knowledge of the processor's instruction set, currently input to PARUT in 
MICRO_FILE. A robust version of PARUT should incorporate design features which 
facilitate a complete specification of a microprocessor type within MICRO_FILE to be 
processed by BUILD-JUMPS whose activity is independent of microprocessor archi-
tecture. Implementing such a robust specification is complex; therefore, for simplicity 
PARUT was developed to target only one processor type: the Motorola 68000 family. 
This family of microprocessors have a fixed size instruction set and extensions to the 
used instruction set are upwardly compatible. Hence, although only one processor 
type is made available by PARUT, the utility in reality can be used with a selection 
of processors within the Motorola 68000 family. This gives PARUT a base selection 
of target processors. 
6.7. P A R U T : Developing a Standard Programming Tool 
The PARUT prototype extensively realizes its design and development objectives. 
It therefore appears feasible to further develop PARUT into a standard programming 
tool. Such a tool is valuable when implementing and assessing fault tolerance asso-
ciated with the characteristics of erroneous behaviour described in Chapter 3. 
A standard programming tool based on the prototype PARUT should adopt the 
following recommended enhancements. Firstly, the range of target processors should 
be extended. This is possibly the most complex modification of PARUT involving 
the integrated development of a robust BUILD_JUMPS module and general purpose 
format for MICRO_FILE. Secondly, PARUT should be extended to implement (rather 
than simulate) other software implemented fault tolerant techniques. The program 
structure of the PARUT program has been demonstrated to provide easy inclusion 
124 
of new techniques. Thirdly, the analysis techniques used to assess the effectiveness 
of the fault tolerant techniques implemented could, in addition to static analysis, 
provide dynamic analysis via emulation/ simulation. 
As a standard programming tool PARUT might be incorporated into a compiler. 
This would remove the necessity of generating and processing CODE_FILE because 
all the required information on the target software is inherently available from the 
translation process of the compiler. 
6.8. Summary 
This chapter describes the design and development of the prototype programming 
tool PARUT. Design objectives are successfully attained. In particular a selection of 
software implemented fault tolerant techniques, including that proposed in Chapter 
5, are facilitated for a variety of target processors. This is achieved without undue 
restrictions on the size of the target software. Additionally the fault tolerance of 
the software can be assessed in respect of the hazard of erroneous microprocessor 
behaviour modelled in Chapter 3. 
The structured programming language Pascal is used in conjunction with 'good' 
programming practice to generate readable code and hence ease comprehension and 
modification. A diagnostic facility which monitors procedure access by PARUT op-
eration is also provided to aid understanding of the utility function. All these design 
features have proved valuable during the development of the PARUT prototype. 
The success of PARUT leads to the proposal that further development be init i-
ated in order to produce a standard programming tool. Enhancements to PARUT 
for this purpose are outlined. The post-programming nature of the software imple-
mented fault tolerant techniques applied by the PARUT function suggests its possible 
inclusion within a compiler, providing an additional code enhancement stage at the 
end of the translation process. 
125 
Chapter 7 
ASSESSING F A U L T T O L E R A N C E 
7.1. Introduction 125 
7.2. Assessing the Fault Tolerance of a Microprocessor System 126 
7.2.1. Assessment Parameters 126 
7.2.2. Parameter Evaluation 126 
7.2.3. Internal Microprocessor Faults 128 
7.2.4. Assessment Dependence on Application Software 128 
7.2.5. Behavioural Observations 130 
7.3. Single-Bit Fault Injection Experiment 130 
7.3.1. Fault Injection Experiment 130 
7.3.2. Microprocessor Application Under Investigation 132 
7.3.3. Programme of Injected Faults 133 
7.3.4. Selected Single-Bit Faults 135 
7.3.5. Decoupling the Microprocessor Detection Mechanisms 137 
7.3.6 Performance Evaluation 138 
7.4. Multiple-Bit Fault Emulation Experiment 144 
7.4.1. Emulation and Fault Investigation 144 
7.4.2. Microprocessor Applications under Investigation 145 
7.4.3. Programme of Emulated Faults 146 
7.4.4. Behavioural Analysis 147 
7.4.5. Identified Phases of Erroneous Execution 147 
7.4.5.1. The Initial Erroneous Jump Phase 148 
7.4.5.2. The Subsequent Erroneous Jump Phase 151 
7.4.6. Analysing Detection Capability 151 
7.4.7. Critical Hazards of Erroneous Behaviour 157 
7.4.7.1. Cessation of Processing 158 
7.4.7.2. Execution Loops 158 
7.4.7.3. Placement Deadlock 158 
7.4.8. Re-synchronizing Erroneous Execution 160 
7.5. Summary and Conclusions , 160 
C H A P T E R S E V E N 
ASSESSING F A U L T T O L E R A N C E 
7.1. In t roduc t ion 
The effectiveness of software implemented fault tolerance in a microprocessor 
system is difficult to assess. Microprocessors and their memories are VLSI devices 
which have a huge number of potential logical fault sites. In addition these faults will 
not always be activated because of time dependent circuit operation. This chapter 
reports two fault insertion experiments initiated in order to investigate temporary 
faults within a microprocessor system implementing the software based fault tolerance 
technique proposed in Chapter 5. 
The fault insertion experiments investigate the effect of single-bit and multiple-bit 
faults on program behaviour. The faults are inserted into memory and the program 
counter, and the response of the microprocessor system tracked instruction by in-
struction. 
The first set of experiments involve injecting single-bit faults into a microproces-
sor system. Five classes of single-bit fault are injected in order to model faults at 
various locations in the microprocessor and memory. The faults include line-errors 
on the address and data bus during instruction and data cycles, and program counter 
faults. The response of the microprocessor system once the faults have been injected 
is monitored. A detailed analysis of the system response gives an important insight 
into the character and nature of erroneous microprocessor behaviour. 
Within microprocessor based systems faults are often observed to affect multiple-
bit as well as single-bit locations. This occurrence is investigated by the second set 
of fault injection experiment. The fault class selected for multiple-bit fault injection 
is program counter corruption. This fault is selected because i t is observed in many 
of the single-bit fault injection experiments, and it is relatively simple to inject. The 
fault response of three microprocessor systems (an 8, 16, and 32-bit machine), is 
analysed. 
125 C L . 
Finally, the chapter concludes with a summary of experimental observations relat-
ing to the effectiveness of the software detection mechanisms and the Access Guardian 
function. In particular, detection latency and the hazard of re-synchronization are 
discussed. 
7.2. Assessing the Fault Tolerance of a Microprocessor System 
7.2.1. Assessment Parameters 
Techniques providing tolerance of temporary faults in VLSI devices are difficult 
to assess. This is particularly true for microprocessor systems. In order to assess the 
effectiveness of a fault tolerant technique it is necessary to evaluate several perfor-
mance parameters. Two performance parameters are particularly important: detec-
tion latency, and error coverage. Error coverage is the percentage of error conditions 
that can be detected by the technique, and the time taken between the activation 
of a fault as an error and its detection is referred to as its detection latency. It is 
also important to quantify the performance overhead imposed by the technique. The 
overhead comprises processing degradation and additional hardware requirement. 
7.2.2. Parameter Evaluation 
Analytic assessment of fault tolerant microprocessors is a difficult task, because 
in most applications it is impossible to determine appropriate error classes and the 
distribution of errors amongst these classes. Although models have been developed 
to investigate the effect of temporary hardware faults on executing microprocessor 
systems, their analysis is limited by assumptions and imposed restrictions. Mahmood 
& McCluskey [1985], and Namjoo [1982] have modelled the error coverage of signature 
analysis techniques but their investigation was limited to the effect of control flow 
errors, and their estimates proved slightly optimistic compared with actual results 
presented by Schuette & Shen [1986]. Nevertheless, the model did give a valuable 
indication of the effectiveness of signature analysis. 
Experimental evaluation by fault injection into actual hardware in many cases is 
the only way to estimate fault tolerance effectiveness successfully. In such experiments 
126 
the selection of the fault injection method is crucial. Ideally a fault should be capable 
of injection at random and specific VLSI device locations, and at a certain point in 
time for a controlled period. Executing software is dynamic and temporary faults can 
have different outcomes depending on the activation of circuitry by microprocessing 
within the processor. This is referred to as the instruction sequence dependency. 
Higher processor workloads, in a multi-tasking environment, may also increase the 
probability of activating a fault. 
Temporary faults that occur in microprocessor devices are difficult to mimic in 
the laboratory. Initially fault injection experiments, see Table 2.3., inserted faults 
via the hardware pins of a device. More recently Chillarege & Bowen [1989] inserted 
faults into a microprocessors memory. Whilst these faults model internal faults, faults 
are not actually injected within the device. Although there are methods of injecting 
internal faults to a microprocessor, notably Damm [1988] by power line fluctuations, 
and Gunneflo et al [1989] by ion radiation, these methods generate multiple faults at 
uncontrolled locations. 
Controlled fault generation can be provided by using microprocessor emulators 
and simulators, see Table 2.2., but analysis of the microprocessor response may be 
limited by the tool's sophistication. Armstrong & Devlin [1981] suggest that gate-
level microprocessor simulators are prohibi'' /ely expensive for fault injection experi-
ments. Therefore, they used a microprocessor simulator based on a functional model 
[Li et al, 1984]. More sophisticated emulators have become available to researchers, 
whereby gate-level descriptions are incorporated into functional models. Czeck & 
Siewiorek [1990] recently used such a sophisticated simulator. However, it must be 
realized that as the simulators increase in complexity so they become increasingly 
susceptible themselves to development errors. A simulator was not available to the 
author of this thesis so an alternative method of assessing fault tolerance is utilized. 
As mentioned above, an accepted method used by many researchers to obtain 
assessments of fault tolerant techniques is fault injection. The method is popular 
because it gives actual error coverage analysis for the injected faults. A limitation 
of the method is the validity of the result for the whole device. Locations for fault 
127 
injection are usually chosen by the experiment designer following some predefined 
selection criteria. The representivity of these points for the remainder of the device 
should always be critically examined. This method appeared to be the best approach 
to analyse the software based fault tolerant technique proposed in Chapter 5. 
7.2.3. In te rna l Microprocessor Faults 
Faults are selected for insertion in order to model actual faults in a microprocessor 
system. The inserted faults are only valid when modelling processor and memory 
faults, faults in the peripheral devices are not accurately modelled. Before comparing 
inserted and actual faults, it is necessary to describe the basic structure of the tested 
microprocessor system. 
The microprocessor system under test is considered as the integration of a data 
path (consisting of an ALU and data registers), and a control path (consisting of 
an opcode decoder, controller, and program control unit). The program control unit 
(PCU) contains bus interface circuitry (BIC), program counter (PC), instruction 
prefetch queue (1PQ), and a controller. The PCU is responsible for address calcula-
tions, and it is assumed to be responsible for the reading and writing of opcodes and 
operands. Interrupt handling and bus arbitration circuitry will not be considered 
because the inserted faults will not closely model faults within these units. A general 
microprocessor topology is shown in Figure 7.1. The microprocessor system under 
test does not have the following features: co-processor, memory cache, instruction 
pipeline (beyond the prefetch queue), and multiplexed busses. 
7.2.4. Assessment Dependence on Appl ica t ion Software 
The effectiveness of software based fault tolerant techniques is difficult to access. 
Techniques such as recovery blocks, signature analysis, and the technique proposed 
in Chapter 5 are all dependent on the application program for their performance. 
It is therefore important to select a representative application program when assess-
ing the effectiveness of a fault tolerant technique. Czeck & Siewiorek [1990] and 







in 0) <D 













CD 0> CD 
c m 
T3 CD CO CD 
CD 
O- CD 




s e u n IOJJUOO 
<3 CO CD 
E <3 0) C >" CD 
to O CD CO <3 O) O Q •a 3 p 
< rt O <3 o o 
O O 
129 
in order to provide an assessment benchmark. However, particular applications for 
implementation can have quite different characteristics from any of the benchmark 
software. Hence the value of a benchmark program is limited to giving an indication 
of the effectiveness of a fault tolerant technique implemented on a similar application 
program. 
7.2.5. Behavioural Observations 
Recently reported fault injection experiments (see Table 2.2.) measure the detec-
tion latency of fault tolerant techniques. Such experiments do not enable the mech-
anism of spawning errors, at a functional level, to be observed. The fault injection 
experiments described within this chapter involve tracing the instruction sequence 
of microprocessor operation from the activation of the fault to its detection or de-
activation. These experiments provide an interesting insight into the mechanisms, 
modelled in Chapter 3, of processing failures induced by temporary faults. 
7.3. Single-Bit Fault In jec t ion Experiment 
7.3.1. Fault In jec t ion Experiment 
The experimental set-up of the fault injection programme is shown in Figure 7.2. 
Within the Engineering Department at the University of Durham there is a Vittese 5 
computer system based on the Motorola 68020 microprocessor which services many 
devices, including terminals independently connected through a dedicated Motorola 
68000 microprocessor system, in a multi-user environment. Programs written for 
implementation on a dedicated Motorola 68000 system are coded on the Vittese in 
Assembler before being assembled. The object code can then be down-loaded onto 
an MC68000 microprocessor system if one is attached to the terminal in use. The 
MC68000 microprocessor system used was developed by the Microprocessor Centre 
at the University of Durham, and has a special 'monitor' program which provides, 




10 (0 © co o 9J E a. O t/1 o o 0) O O F CO P * O Q 0) (0 00 o 0) 9 S <n CD ^ W O 5 CO 
7 ^3 o co 











CO oo CD CO 
> 2 
131 
o read/ write memory locations. 
o read/ write register contents. 
o trace execution (instruction by instruction). 
The fault injection programme implemented involves injecting faults into the 
microprocessor memory or program counter. These faults are activated by the appli-
cation program's execution, and their effects are monitored by using the trace facility 
and interrogating the register contents to ascertain the processor's status. 
7.3.2. Microprocessor Appl ica t ion Under Investigation 
For the purpose of this section, a single microprocessor application is analysed. 
The system chosen has many typical characteristics of an industrial microprocessor 
based controller such as the monitoring and control of equipment to perform an 
on-going task or process. 
The application system monitors the water level in two connected reservoirs, 
one higher than the other. If the higher reservoir level falls beneath a minimum 
marker (solenoid float) then water is pumped from the lower to the higher reservoir. 
Similarly, if the higher reservoir level goes above a maximum marker (solenoid float) 
then water is drained from the higher to the lower reservoir. 
The controller is based on a iMotorola 68000 microprocessor operating at 8MHz 
although a lower operating frequency would be acceptable for this application. The 
microprocessor executes software stored within 64 KBytes of memory. The actual 
program size is initially 381 Bytes. The application program is processed by the 
PARUT tool, described in Chapter 6, in order to strategically place software detection 
mechanisms within the code. These mechanisms, designed to provide fault tolerance, 
for the application program under investigation required an additional 108 Bytes of 
memory. An annotated listing of the original application program, and the same 
program after processing by PARUT can be found in Appendix D respectively. 
The Motorola 68000 microprocessor system has several detection mechanisms; 
bus errors (access to unavailable address space) are detected by logic external to the 
processor whilst address errors (invalid specification of memory locations , e.g. odd 
132 
byte address exception) and illegal opcodes are inherently detected by the processor. 
Collectively these mechanisms shall be referred to as the MC68000 detection facility. 
The bus and address errors within the MC68000 detection facility duplicate 
the function of the Access Guardian proposed in Chapter 5. Therefore an Access 
Guardian unit is not attached to the microprocessor system. The collected results 
from the fault injection experiments are processed in order to de-couple the Access 
Guardian function from the MC68000 detection facility. 
7.3.3. Programme of Injected Faults 
The fault insertion programme was based on that used by Schuette & Shen [1986]. 
They injected faults by temporarily altering pin logic values on an embedded Mo-
torola 68000 microprocessor. The faults in this insertion programme are injected by 
corrupting the microprocessor memory. The experiment models five classes of fault 
within the microprocessor system as described below. 
A. Instruction Cycle : Data Bus Faults 
These faults are inserted by corrupting bit positions of an opcode stored in mem-
ory. The fault appears on the data bus when the instruction opcode is read. This 
class of fault models the following situations. 
® Memory degradation or data bus line-errors external to the microprocessor. 
© Errors in the bus interface circuitry or internal data bus line-errors. 
© Errors in the Opcode Decoder. 
s Program Counter faults or Address Calculating Circuitry errors as a result of 
either determining an incorrect branch address (special MC68000 case where 
displacement is held in the opcode), or corrupted opcode, or opcode decoder 
error leads to the wrong number of operands being read and hence an incorrect 
location is accessed for the next opcode. 
133 
B. Data Cycle : Data Bus Faults 
These faults are inserted by corrupting bit positions of an operand in memory, 
the fault appearing on the data bus when the operand is accessed. This class of fault 
models the following situations. 
o Memory degradation or data bus line-errors external to the microprocessor, 
o Errors in the bus interface circuitry or internal data bus line-errors, 
o Faults in the data registers when operands are moved by them, 
o Arithmetic Logic Unit (ALU) errors if the operands are processed. 
© Program Counter faults or Address Calculating Circuitry errors if the operand 
is used in determining a branch address. 
C. Instruction Cycle : Address Bus Faults 
These faults are inserted by replacing an opcode in memory with data from 
another location in the address space. The fault is activated when the opcode is 
accessed and an alternative opcode value is put on the data bus. This class of fault 
models the following situations. 
© Memory degradation or data bus line-errors external to the microprocessor. 
© Errors in the bus interface circuitry or internal data bus line-errors. 
® Faults in the stack pointer (when retrieving the next opcode location), pro-
gram counter faults, or errors in the Address Calculating Circuitry. 
e Multiple faults in the Opcode Decoder which cause severe malfunction and 
have an effect similar to multiple line-errors on either the internal or external 
data bus, and burst faults in memory. 
D. Data Cycle : Address Bus Faults 
These faults are inserted by replacing an operand in memory with data from 
another location in the address space. The fault is activated when the operand is 
accessed and an alternative operand value is put on the data bus. This class of fault 
models the following situations. 
134 
o Memory degradation or data bus line-errors external to the microprocessor, 
o Errors in the bus interface circuitry or internal data bus line-errors, 
o Errors in the Address Calculating Circuitry. 
o Alternatively this fault class can mimic multiple faults in the data register, 
or ALU malfunction, or multiple line-errors on the internal or external data 
bus, or burst faults in memory. 
E. Program Counter Fa- 'ts 
These faults are inserted by corrupting the contents of the program counter using 
the status facility in the MC68000 board monitor program. The fault becomes active 
when the next opcode is accessed and processing is forced to deviate from its intended 
path. This class of fault models the following situations. 
o Line-errors on the internal address bus. 
o Opcode Decoder faults initiating a branch. 
o Corruption of the program counter, errors in the Address Calculating Cir-
cuitry, or stack pointer faults which lead to an incorrect branch. 
Review 
A summary of the injected fault programme and modelled fault situations can 
be found in Table 7.1. The injected faults are all single-bit; multiple bit faults were 
not injected in this experiment. The modelled faults are single-bit, or simple errors, 
except where stated. 
7.3.4. Selected Single-Bit Faults 
The faults injected for the instruction and data cycle address and data bus ex-
periments, and the corrupted program counter experiment are single-bit corruptions. 
Bit faults on the address bus and program counter affect address bit positions 1, 4, 
8, and 12. Bit fault positions on the data bus are 0, 7, 8, and 15. These bit posi-
tions are used by Schuette & Shen [1986] and Damm [1988] in their fault injection 
135 
Fault Modelled Fault Class Injected 
A B c D E 
Memory Bit Faults X X X<6) X(6) 
Memory Select Circuitry Error X X 
Line Errors : Internal Data Bus X X X(6) X(6) 
External Data Bus X X X(6) X(6) 
Internal Address Bus X X X 
External Address Bus X X 
Bus Interface Circuitry Errors X X X X 
Faults : Data Registers X X(6) 
ALU X X(5) 
Opcode Decoder X X(5) X 
Program Counter Faults X(2,3) X ( D X X 
Address Calculating Circuitry Errors X(2.3) X ( D X X X 
Stack Pointer Faults X(4) X 
Fault Class Injected: 
A: Instruction Cycle: Data Bus Faults 
B: Data Cycle: Data Bus Faults 
C: Instruction Cycle: Address Bus Faults 
D: Data Cycle: Address Bus Faults 
E: Program Counter Faults 
Notes: 
Cl) Only if the operand is used in determining a branch address. 
(2) Special MC68000 case (1): displacement held in opcode. 
(3) Corruption of an opcode or a fault in the Opcode Decoder can result in the wrong 
number of operands being read for an instruction and hence a program counter 
or address calculating circuitry error when accessing the next opcode. 
(4} When retrieving next opcode address from stack. 
(5) Multiple faults causing severe malfunction. 
(6) Multiple faults causing severe mis-interpretation of opcode/operand. 
Table 7.1. : Single-Bit Fault Injection Programme 
136 
experiments. The application program resides in low memory and hence address bus 
faults on line 12 are typically detected by the Access Guardian function. 
In total 2136 faults were injected during the single-bit fault experiment. These 
faults disrupted program and data flow. A feature of the fault injection programme 
is that results are dependent on the instruction sequence and not the instruction 
mix: actual execution paths are traced instruction by instruction. This is important 
because faults can have different effects when the the microprocessor system is in 
various run-time conditions. 
Each instruction cycle fault class had 472 faults injected, as did the program 
counter fault experiment. The data cycle fault classes comprise of 360 faults each. 
7.3.5. Decoupling the Microprocessor Detection Mechanisms 
The microprocessor system under investigation has two sources of detection mech-
anisms: those implemented by the software based fault tolerant techniques proposed 
in Chapter 5, and those inherently present in the embedded MC68000 microprocessor. 
Let the sample space F contain all the faults injected into the microprocessor 
system. Let the sets M and P be the faults covered by the MC68000 detection 
facility and the software detection mechanisms planted in the application program 
respectively. Now, 
( M n P ) = 0, (7.1.) 
but unfortunately M and P do not describe the whole fault set F. 
(M U P) C F. (7.2.) 
This is because the errors generated by some faults cause erroneous execution to 
re-synchronize and hence avoid detection. 
Let R be the set of faults that generate a re-synchronized outcome, and therefore 
the injected fault outcomes can be described as follows. 
137 
( M U P) U R = F, (7.3.) 
( M u P) n R = 0. (7.4.) 
The MC68000 detection facility is observed to detect fault outcomes which would 
otherwise have been diagnosed by an Access Guardian. For the purpose of analysis 
it is useful to de-couple the Access Guardian function from the inherent MC68000 
detection mechanisms. Let set A contain all the faults that are detectable by the 
Access Guardian function. Then 
The faults detected by the MC68000 detection facility without the Access Guardian 
function is therefore given by (M n A). 
7.3.d. Performance Evaluation 
The detection mechanisms provided fault tolerance for approximately 60% of the 
faults injected into the system. The outcome of the faults no detected is observed 
as re-synchronization. The mean latency to either detection or re-synchronization is 
1.2 processed instructions. The implications of this latency is discussed below. The 
overhead associated with the implementation of the software based fault tolerant 
technique have been estimated in Chapter 5. The hardware overhead attributed to 
the Access Guardian is approximately 5% of the MC68000 transistor count. The 
additional memory required by the fault tolerant software is approximately 30% of 
the original application program size. 
A summary of the processing outcome after each single-bit fault is injected is 
shown in Table 7.2. The processing outcomes are classified as re-synchronization, 
(Mr\A) = A. (7.5.) 
(M li A) — M. (7.6.) 
138 
CM O r-- co co 





































co o CM o 
CM 
oo 








































CO oo 0 0 t — I 
oo 




cu CO « 2 « 
CU o3 p ; 




detection by a software mechanism, detection by the MC68000 microprocessor ex-
cluding the Access Guardian function, and finally, detection by the Access Guardian 
function. The processing response en route to the monitored outcomes is detailed in 
Table 7.3. and shown in Figure 7.3. 
Re-synchronization involves the program flow, which has already diverged from 
the specified control path, rejoining the specified control path. Over 40% of the 
injected faults lead to re-synchronization. This is not surprising, as a significant 
proportion of the faults injected during the experiment corrupted the instruction 
without initially affecting the program counter, but corrupted the program counter 
following the completed execution of the first instruction because the wrong number 
of operands were interpreted as belonging to the initial opcode. This scenario initiates 
irregular processing through the coded area before processing once more aligns itself 
with the original program flow. Gunneflo et al [1989] recorded 5% of his injected faults 
leading to re-synchronization, the lower value of this figure can be attributed to the 
nature of his fault experiments. Arlat et al [1989] and Chillarege & Bowen [1989] 
report 46% and 42% of their fault injections leading to undiagnosed errors which 
did not prevent continued processing although the function may have been slightly 
corrupted. These faults together with similar faults, referred to as overwritten faults 
and collated by Czeck & Siewiorek [1990], (59% [Schuette & Shen, 1986], 60-70% 
[Czeck & Siewiorek, 1990], and 77% [Choi et al, 1989]), may to some degree be due 
to the re-synchronization phenomena. 
It is clear that re-synchronization is an important class of fault outcome. In the 
experiment re-synchronization occurred within five erroneously processed instruc-
tions, and had a mean occurrence latency of 1.2 instructions. It is interesting to 
note that 29 of the injected faults classified as re-synchronization, had no effect on 
execution - the induced erroneous behaviour being totally benign as far as the system 
status is concerned. For example, in the data cycle with address bus fault, the incor-
rectly accessed operand address may contain the same operand value as the correct 
operand address. These occurrences are shown in Table 7.3. where re-synchronization 






CM O CM O CM 
b - CO b - C O l > 



























































































































l-H >-> o CM 0 t » 
























































































O CO O CD O oo oi io 



















































































































































































































































CM 1 0 r - 1 QO C O ° 
C O



















i — i 
„ N O ) tO 






































CO in VI i n to CD to c o CO CD to U CO CD 
0) 
to u CO cu <a 3 «_ m CD LU 00 2 UJ 
cu CD Cu 10 to CO <3 0) to (0 cu c 5 
— 4-* 











»* c CD 
UJ UJ 
3 i • 
o 03 co CM 
aujoojnr) % a/v?jB|nujno © 
CD 
CO 3 to B3 CO 
CO- lf) CD CO 
cu 
co 
co CD CD 












CM 03 CD CO 
awooinn sieis Apeais 
DuijBjauan si inej jo jaquinN 
142 
A particular hazard associated with re-synchronization is that from a user per-
spective there is a small and perhaps un-noticeable processing glitch which can involve 
corruption of the microprocessor stack or stack pointer. Approximately 3% of the 
injected faults re-synchronize with a corrupted stack. A similar result is reported 
by Damm [1988] who diagnosed 3.6% of his injected faults to cause this error. This 
hazard may prove critical at a much later processing stage when the return from a 
subroutine or other stack access occurs. This phenomena of a sleeping fault, called 
a potential hazard by Chillarege &; Bowen [1989], being awakened at some future 
period could explain the system crash data presented by Czeck & Siewiorek [1990] 
where high corruption of system integrity is experienced. 
Those faults injected to disrupt the data cycle have a 56% probability (approxi-
mately) of generating re-synchronization, which is twice the expectation for instruc-
tion cycle faults. This is not an unexpected result because corrupted operands will 
typically alter the result of the function but not the function itself whilst corrupted 
opcodes will alter the function and interpretation of operands. It is interesting to 
note that the program counter faults, which are intuitively more allied to instruc-
tion cycle faults, generate re-synchronization for 30% of their outcomes which is very 
similar to the data cycle fault experiment. 
In the experiment non-re-synchronized erroneous execution was detected, by ei-
ther a software detection mechanism, Access Guardian function, or a non-Access 
Guardian function of the MC68000 detection facility, with a mean latency of 1.2 in-
structions. This is extremely rapid and highly desirable for reliable systems. The 
faults trapped with greatest latency (6 instructions) were detected by the software 
detection mechanism. 
Just over one quarter of the faults which do not re-synchronize, are caught by the 
software detection mechanisms. Only 8, or 1%, of the data cycle faults are detected 
in this way, compared with 20% of the instruction cycle and program counter faults. 
The software detection mechanisms have caught erroneous execution as late as the 
sixth processed instruction, and have a mean detection latency of 1.6 instructions in 
the experiment. 
143 
The Access Guardian function was very successful in detecting injected faults. It 
detected 40% of the faults inserted with a mean detection latency of one processed 
instruction. The experimental result reported here compares favourably with other 
documented versions of this detection function; 60% [Gunneflo et al, 1989] where the 
used memory filled 12% of the Motorola 6809 microprocessor address space, and 58% 
[Schmid et al, 1982] where the used area filled 90% of the Zilog Z80 microprocessor 
address space. As noted by Gunneflo et al [1989], the effectiveness of this detection 
mechanism will increase as the percentage of used memory in the microprocessor 
address space decreases. 
The remaining injected faults were detected by the illegal opcode facility of the 
MC68000 microprocessor. These accounted for almost 5% of the detected faults. 
Early microprocessor designs did not incorporate this facility and under these fault 
injection experiments would have a reduced reliability. Schmid et al [1982] and Gun-
neflo et al [1989] attached an illegal opcode detector to their respective Z80 and 
MC6809 microprocessor systems; the facility detected approximately 35% and 30% 
of the injected faults. Most modern microprocessor designs incorporate this detec-
tion capability. The effectiveness of this mechanism is dependent on the number of 
illegal opcodes in the instruction (opcode) map, and the data diversity within the 
microprocessor used memory which is application dependent. It is therefore difficult 
to quantify the usefulness of this utility, but it can considerably improve a micropro-
cessor system's reliability. 
7.4. Multiple-Bit Fault Emulation Experiments 
7.4.1. Emulation and Fault Investigation 
Emulators are software tools that mimic the register action of a target micropro-
cessor. As such the injection of a fault will not be as accurately modelled as in a 
simulator which models the functional/gate activity of a microprocessor. However, 
emulators, unlike simulators, are commonly available and inexpensive. Indeed many 
modern microprocessor systems are provided with an emulator within a debugging 
facility. 
144 
The register selected for fault investigation is the program counter. Other register 
faults would only investigate data type errors, whilst program counter faults are in-
dicative of instruction type faults. Gunneflo et al [1989], who injected faults internally 
at random via ion radiation, found that 77% of faults were instruction type. Other 
fault injection experiments support this finding: 77% [Schuette & Shen, 1986] and 
60% (experienced in the single-bit experiment documented within this chapter). It 
therefore seems reasonable to conduct experiments that investigate instruction type 
faults. 
7.4.2. Microprocessor Applications Under Investigation 
Three application programs have been selected for the multiple-bit fault injection 
experiment. The first program ('A'), is the same program used for the single-bit 
fault injection experiment. That is, a slurry pump control application involving the 
monitoring and control of a reservoir system. Program A is written in assembler for 
the Motorola 68000 microprocessor. The second program ('B'), is written in assembler 
for a Motorola 68(7)05 microprocessor based system and is concerned primarily with 
data movement using the processor input/output ports. The third program ('C'), 
unlike the other programs, is not an application program but rather a section of high 
level code written in C. The purpose oi tiiis program is to investigate the hidden 
hazards that can be generated when high level language programs are translated 
to machine code. Program C is translated into machine code for the Intel 80386 
microprocessor. 
The three programs were selected to provide a diverse variety of application 
processors types and sizes; the Motorola 68(7)05, Motorola 68000, and Intel 80386 
are 8, 16, and 32-bit machines respectively. As for the single-bit fault injection 
experiment, the programs are prepared by applying the software based fault tolerant 
technique proposed in Chapter 5. Annotated assembler/ machine code listings of the 
original application programs and the programs with strategically placed software 
detection mechanisms are shown in Appendix D. 
145 
7.4.3. Programme of Emulated Faults 
The faults injected into the program counter are single and multiple-bit corrup-
tions. All possible program counter corruption patterns representing execution in the 
used area of the microprocessor address space are analysed through fault injection. 
The remaining program counter corruption patterns generate detection by the Access 
Guardian and are evaluated. Hence this class of fault is comprehensively analysed. 
High order bit faults in the program counter typically lead to detection by the Access 
Guardian'feature because the application program resides in low memory. 
The programs A, B. and C are investigated to assess their respective fault tol-
erance with and without the software based fault tolerant technique proposed in 
Chapter 5. Each program is evaluated :-
Version 1 : without the software technique applied, 
Version 2 : with the software technique applied (default detection 
mechanism size), 
Version 3 : with the software technique applied (optimum detection 
mechanism size), 
Program A : { 68000} 
The size of this program is 308 bytes, increasing to 500 and 416 bytes when default 
and optimum size software detection mechanisms are inserted respectively. Faults 
are emulated to analyse the microprocessor response to even byte program counter 
corruption covering every location in the program for each of the three program 
versions; a total of 612 fault runs. Odd byte program counter corruptions are detected 
automatically by the inherent MC68000 detection facility. 
Program B : {68(7)05} 
The size of this program is 53 bytes, increasing to 86 and 80 bytes when default 
and optimum size software detection mechanisms are inserted respectively. Faults 
are emulated to analyse the microprocessor response to program counter corruption 
covering every location in the program for each of the three program versions; a total 
of 219 fault runs. 
146 
Program C : {80386} 
The size of this program is 293 bytes, increasing to 365 and 323 bytes when default 
and optimum size software detection mechanisms are inserted respectively. Faults 
are emulated to analyse the microprocessor response to program counter corruption 
covering every location in the program for each of the three program versions; a total 
of 981 fault runs. 
7.4.4. Behavioural Analysis 
The performance evaluation in the preceding section, concerning the fault in-
jection experiment, and other fault tolerance evaluations (see Tables 2.2. and 2.3.) 
involving fault injection, simulation, or emulation, provide static analysis. They do 
not investigate the processing behaviour associated with the latency of the spawning 
errors generated by the injected fault, and hence cannot identify dangers or assets of 
the techniques under evaluation. 
The emulation experiment reported here involves tracing erroneous execution 
instruction by instruction. In this way, the character of erroneous microprocessor 
behaviour can be investigated. This study validates the erroneous microprocessor be-
haviour model presented in Chapter 3, demonstrates the effectiveness of the software 
based fault tolerant technique proposed in Chapter 5, and re-iterates the importance 
of the re-synchronization phenomenon. 
7.4.5. Identified Phases of Erroneous Execution 
The erroneous microprocessor behaviour model assumes two phases of erroneous 
execution: that following an Initial Erroneous Jump (IEJ), and that following a Sub-
sequent Erroneous Jump (SEJ). The fault programme primarily investigates the SEJ 
phase, but the results can be extended to investigate the IEJ phase under the as-
sumption that an Access Guardian is embedded within the microprocessor system 
being evaluated. Execution within each phase can generate either a restart, unspec-
ified jump, return, or stop/wait outcome as described in Chapter 3. Restart out-
comes signify detection of erroneous execution, whilst stop/wait outcomes signify a 
147 
cessation of execution. Only unspecified jump and return outcomes lead to another 
SEJ phase of erroneous execution. 
The fault emulation results for program A, shown in Table 7.4., are analysed 
in detail in order to validate the erroneous microprocessor behaviour model. Erro-
neous execution can be detected in the MC68000 microprocessor system by either 
the Access Guardian, the inherent MC68000 detection mechanisms, or the software 
detection mechanisms. The Access Guardian detects access to the unused address 
space. Inherent MC68000 processor detection mechanisms include the odd byte ad-
dress exception for the program counter, and the illegal opcode exception. The 
software detection mechanisms are designed to detect SEJs. 
7.4.5.1. The Initial Erroneous Jump Phase 
The purpose of Figure 7.4. is to show the behavioural phase of program execution 
following an Initial Erroneous Jump (IEJ). The I E J is generated by corrupting the 
program counter. Some IEJ destinations are detected immediately, such as target 
locations in the unused area and odd byte addresses, and these are represented by 
the ordinate intercept in Figure 7.4. During erroneous execution such detection 
coerces the outcome of an instruction to a restart. The ordinate intercept is lower 
for version 2 because the injected software mechanisms increase the used memory 
requirement, which reduces the initial effectiveness of the Access Guardian. The odd 
byte address exception facility in the MC68000 microprocessor detects all program 
counter corruptions with an odd byte address, and hence will always have the same 
detection capability. 
During the I E J phase, detection is provided by the software detection mecha-
nisms or instructions generating conditions that are detected by the Access Guardian 
or an inherent MC68000 detection mechanism. Version 1 does not have any inserted 
software detection mechanism, whilst version 2 does. Not all return instructions in 
program A produce a return outcome, some are liable to generate conditions which 
are detected by the Access Guardian or an inherent MC68000 detection mechanism 
and generate a restart outcome. The effect of the software detection mechanisms is 
148 
(a) Version 1 : Original Program 
Jump 
Outcome 
Number of Instructions Processed 
0 1 2 3 4 5 6 7 8 9 10 
RT 0 13 7 4 5 4 4.5 2.5 3 0 0 
UJ 0 16.5 14 14 16.5 12 6.5 4.5 4 2.5 1 
RN 0 2 2 3 3.5 2 1.5 1 1.5 1 0.5 
SW 0 0 0 0 0 0 0 0 0 0 0 
(b) Version 2 : Insertion of Default Size Detection Mechanisms 
Jump Number of Instructions Processed 
Outcome 0 1 2 3 4 5 6 7 8 9 10 
RT 0 93 17 5 1 0 0 0 0 0 0 
UJ 0 32.5 37.5 33 15 8.5 4 0.5 0 0 0 
RN 0 2 2 0.5 1 0.5 0 0 0 0 0 
SW 0 0 0 0 0 0 0 0 0 0 0 
(c) Version 3 : Insertion of Optimum Size Detection Mechanisms 
Jump Number of Instructions Processed 
Outcome 0 1 2 3 4 5 6 7 8 9 10 
RT 0 51 16 5 0 0 0 0 0 0 0 
UJ n 32.5 35.5 33 15 8.5 4 0.5 0 0 0 
RN 0 2 2 0.5 1 0.5 0 0 0 0 0 
SW 0 0 0 0 0 0 0 0 0 0 0 
Notes: 
i) RT - Restart Outcome 
ii) UJ - Unspecifies Jump Outcome 
iii) RN - Return Outcome 
iv) SW - Stop/Wait Outcome 
Jump Outcome Statistics 
i) There are a total of 32 potential jump instructions open to interpretation (version 1). 
ii) 15 valid branches are specified within the original program. 
iii) None of the invalid branches can be detected by an Access Guardian. 
iv) 17 invalid branches can be detected by the insertion of 16 software detection mechanisms 
(version 2 and version 3). 
Table 7.4. : Observed Behaviour of Program 'A' 
149 
e n u 
UJ 
eg in UJ UJ 
in UJ lu U J UJ 
















in U J UJ 
in UJ UJ in UJ UJ 
in •OUJ 
CO in i n UJ in i n s 0) UJ 
ha 
UJ (9 in 
•"in 
CEUJI— 
U J O L J 
Z C S U J 
_JUJUJ 
CO 
3 I Y I S JO Ainiavaoad 
150 
clearly observed in version 2 in which approximately half of the unspecified jump 
outcomes now generate a restart outcome, the remaining unspecified jumps occurring 
within re-synchronization. 
7.4.5.2. The Subsequent Erroneous Jump Phase 
The observed Subsequent Erroneous Jump (SEJ) phase of erroneous execution 
for version 1 and 2 of program A are shown in Figure 7.5. The observations made 
for the SEJ phase are the same as those for the IEJ phase except that the ordinate 
intercept is origin. This is because the detection capabilities of the Access Guardian 
and the MC68000 odd byte address exception are taken into account by the detection 
outcome of their generating instructions in the previous phase of erroneous execution. 
The period of linear erroneous execution following an SEJ in version 1 is typically 
longer than that for version 2. This is denoted in Figure 4 by the combined cumula-
tive jump outcomes for version 1 reaching approximately 100% after 10 instructions 
processed, compared to 6 instructions processed for version 2. Furthermore, the 
percentage of SEJ phases terminated by a restart outcome, representing detection, 
is greater for version 2 than version 1. This observation clearly demonstrates the 
enhanced detection capability, provided by the insertion of software detection mech-
anisms, in the SEJ phase. 
7.4.6. Analysing Detection Capability 
The data collected for the two phases of erroneous execution are inserted into 
equation (3.13.) in order to determine the dynamic detection capability of the soft-
ware based fault tolerant technique. The fault emulation results for program A, B, 
and C, shown in Tables 7.4., 7.5., and 7.6., are processed by equation (3.13.) to pro-
duce Figure 7.6(a, b, c). respectively. Each figure shows the detection capability for 
version 1,2, and 3 of the program. The programs are assumed to be implemented on 
a microprocessor system with an embedded Access Guardian. This assumption facil-
itates evaluation of program counter faults covering the whole address space without 








X L U 
L U DC 
in 0 on 
to 
CC L U »— 
L I O U z c r i u 
L U L U 
00 
31Y1S J O Ainievaoad 
CO CD L U ~3 
L U 
L U 
in L U > 0 L U L U 




cr U J • — 
UJOU Z C C L U 
L U L U 
00 
31Y1S JO AiniGYOOad 
152 
(a) Version 1 : Original Program 
Jump Number of Instructions Processed 
Outcome 0 1 2 3 4 5 6 7 8 9 10 
RT 0 2.5 1.5 0.5 0 0 0 0 0 0 0 
UJ 0 15 10 6.5 4.5 5 3 1.5 0.5 0.5 0 
RN 0 2 0 0 0 0 0 0 0 0 0 
SW 0 0 0 0 0 0 0 0 0 0 0 
( b) Version 2 : Insertion of Default Size Detection Mechanisms 
Jump Number of Instructions Processed 
Outcome 0 1 2 3 4 5 6 7 8 9 10 
RT 0 20.5 13.5 6.5 2.5 1.0 0 0 0 0 0 
UJ 0 19.5 8 5.5 3 3.5 2 0.5 0 0 0 
RN 0 2 0 0 0 0 0 0 0 0 0 
SW 0 0 0 0 0 0 0 0 0 0 0 
(c) Version 3,: Insertion of Optimum Size Detection Mechanisms 
Jump 
Outcome 
Number of Instructions Processed 
0 1 2 3 4 5 6 7 8 9 10 
RT 0 14.5 13.5 6.5 2.5 1.0 0 0 0 0 0 
UJ 0 19.5 8 5.5 3 3.5 2 0.5 0 0 0 
RN 0 2 0 0 0 0 0 0 0 0 0 
SW 0 0 0 0 0 0 0 0 0 0 0 
Notes: 
i) RT - Restart Outcome 
ii) UJ - Unspecifies Jump Outcome 
iii) RN - Return Outcome 
iv) SW - Stop/ Wait Outcome 
Jump Outcome Statistics 
i) There are a total of 22 potential jump instructions open to interpretation (version 1). 
ii) 7 valid branches are specified within the original program. 
iii) 5 invalid branches can be detected by an Access Guardian (version 1). 
iv) 9 invalid branches can be detected by the insertion of 9 software detection mechanisms, the re-
maining invalid branches created by the inserted software detection mechanisms can be detected 
by the Access Guardian (version 2 and version 3). 
Table 7.5. : Observed Behaviour of Program ' B ' 
153 
(a) Version 1 : Original Program 
Jump Number of Instructions Processed 
Outcome 0 1 2 3 4 5 6 7 8 9 10 
RT 0 10 7 2 1 0 0 0 0 0 0 
UJ 0 32.5 39.5 45 34 21 16.5 14.5 10 11 7 
RN 0 3 3.5 2 0 0 0 0 0 0 0 
SW 0 0 0 0 0 0 0 0 0 0 0 
( b) Version 2 : Insertion of Default Size Detection Mechanisms 
Jump Number of Instructions Processed 
Outcome 0 1 2 3 4 5 6 7 8 9 10 
RT 0 76 7 2 1 0 0 0 0 0 0 
UJ 0 32.5 42.5 47.5 34.5 19 15.5 14.5 10 11 7 
RN 0 3 3.5 2 0 0 0 0 0 0 0 
SW 0 0 0 0 0 0 0 0 0 0 0 
(c) Version 3 : Insertion of Optimum Size Detection Mechanisms 
Jump 
Outcome 
Number of Instructions Processed 
0 1 2 3 4 5 6 7 8 9 10 
RT 0 19 7 2 1 0 0 0 0 0 0 
UJ 0 32.5 42.5 47.5 34.5 19 15.5 14.5 10 11 7 
RN 0 3 3.5 2 0 0 0 0 0 0 0 
SW 0 0 0 0 0 0 0 0 0 0 0 
Notes: 
i) RT - Restart Outcome 
ii) UJ - Unspecifies Jump Outcome 
iii) RN - Return Outcome 
iv) SW - Stop/Wait Outcome 
Jump Outcome Statistics 
i) There are a total of 35 potential jump instructions open to interpretation (version 1). 
ii) 27 valid branches are specified within the original program. 
iii) None of the invalid branches can be detected by an Access Guardian. 
iv) 7 invalid branches can be detected by the insertion of 6 software detection mechanisms (version 
2 and version 3). 
v) There is 1 non-critical placement deadlock. 
Table 7.6. : Observed Behaviour of Program ' C 
154 





o CO •o in 
o. o o o> o. o. o- o- o> o-o o. o- (V o. rv o- o* o- o. o. o- o f> o. o- cv o> o. & • o- o o- o- o-
o o o o d 
o o. o 
O o-o 
o. o. o. o. o 
o. 
CK 
o o o-o 
o-o-o-





iaYis3ci JO AiniBv9oad >-
UJ 
31 
to CO •—•4 » —3 










a a _ l -< t— _ i =3 •—> => z: •< »—• 
1—O u_ 1U a. o a o 
= 
— 
o O o » > •• < 





I i 1 
L L J 
> • 
L L J 
> 
L L J 
> 






o o o o 
IHVlSBd JO AlMlGVOOUd 
155 
The relationship between the three versions of each of the programs is very similar. 
There is a detection capability enhancement shown by versions 2 and 3 over version 
1, and version 3 over version 2. However, there is an initial performance overhead 
associated with inserting software detection mechanisms. This is clearly seen in 
Figures 7.6(a, b, c). where for the first one or two erroneously processed instructions 
the detection capability of versions 2 and 3 is lower than that for version 1 (the 
original) of the program. This performance overhead is due to the increased program 
size and hence the higher probability that the corruption of the program counter, 
initiating the IEJ phase of erroneous execution, takes a value corresponding to a 
location within the program. The performance overhead is quickly over-ridden by 
the enhanced detection capability provided by the software detection mechanisms. 
The relative overhead of the version 2 compared with version 1 depends on the 
number and default size of the detection mechanisms. The number of detection 
mechanisms that can be placed is dependent on the code content of the application 
program. The default size of detection mechanisms is dependent on the applica-
tion processor's instruction set, and the optimum size required for each placement 
is dependent on the local code structure. The overhead associated with version 3 is 
dependent of the memory saving facilitated by using the optimum size of detection 
mechanism on each placement. There is little difference in versions 2 and 3 for pro-
gram B because the mean optimum detection mechanism size is about the same as 
the default detection mechanism size. There is a larger difference in program A, and 
larger again in program C, because the mean optimum detection mechanism size is 
smaller than the default detection mechanism size. 
The effectiveness of the software detection mechanisms is dependent on their 
number and distribution within the application program. Program B yields some 
of the best results due to a large number of software detection mechanisms spread 
evenly throughout the code. Program A obtains similar benefits from the insertion of 
software detection mechanisms. Program C results are poorer because fewer software 
detection mechanisms could be placed, and there is a contiguous block the size of 
156 
half the application program which is void of mechanism placements. The effect on 
program C is to greatly slow down the detection capability. 
I t will also be noticed that the version 3 results, whilst having a smaller overhead 
have a slightly reduced detection capability, compared with version 2, after several 
instructions have been erroneously processed. This is clearly seen for programs A and 
C in Figure 7.5. The variation in detection capabilities is due to erroneous execution 
flow, other than a SEJ, being detected by the software detection mechanisms. The 
default size detection mechanisms detect all non-re-synchronized linear erroneous 
execution because they have the same number of seed bytes as bytes in the longest 
instruction construct for the application processor. Hence, it is guaranteed that 
during linear erroneous execution a seed will be interpreted as an opcode, generating 
a restart outcome and detection. Optimum size detection mechanism will not detect 
all linear erroneous execution passing through the detection mechanism because some 
erroneously interpreted opcodes will consider all the seeds as operands of the current 
instruction. 
Re-synchronization of erroneous execution within the experiment generates a de-
tection capability ceiling. This ceiling is rapidly approached by programs A and B, 
due to the large number of inserted software detection mechanisms. Program C did 
not facilitate the same potential for software detection mechanism insertion because 
of the limited number of invalid branches, and hence the detection capability en-
hancement is not so rapid. Nevertheless, all three programs show improved detection 
capability. For highly reliable systems, additional techniques should be employed to 
cover the probability of re-synchronized erroneous execution. 
7.4.7. Critical Hazards of Erroneous Behaviour 
Three main classes of critical hazard are identified: cessation of processing (de-
scribed in Chapter 3), placement deadlock (described in Chapter 5), and infinite 
execution loops. The observations and implications of these hazards within the ex-
ample programs A, B, and C are now discussed. The next chapter in this thesis 
considers a technique for removing these hazards. 
157 
7.4.7.1. Cessation of Processing 
None of the example programs included code which could be interpreted during 
erroneous execution as a stop/wait outcome and hence cause microprocessor opera-
tion to cease. Program B does, however, include instruction sequences which require 
user input. These instruction sequences could generate an apparent cessation of pro-
cessing if erroneous execution re-synchronizes at one of these instruction sequences -
the user not being aware of the required input. Such occurrences are very difficult 
to prevent and an additional fault tolerant methodology is required to remove the 
hazard. 
7.4.7.2. Infinite Execution Loops 
These hazards involve the creation of an infinite loop by erroneous jumps. Once 
erroneous execution enters such a structure, the correct function of the program is 
permanently lost. The hazard is removed by implementing the software based fault 
tolerant technique proposed in the thesis, involving the placement of software detec-
tion mechanisms and the attachment of an Access Guardian to the microprocessor 
system as required. A good example of this hazard is demonstrated in program C. 
The assembler listing of the program can be found in Appendix D. An infinite loop 
is generated by the erroneous jump at location l D ^ in the 'getvalue' routine, as 
shown in Table 7.7. 
7.4.7.3. Placement Deadlock 
The potential hazard of a placement deadlock occurred once during the applica-
tion of software implemented fault tolerance documented in Appendix D. Placement 
deadlock describes the situation where an erroneous jump has its generator and desti-
nation are operands within the same instruction. The identified placement deadlock, 
shown in Table 7.8., is at location 2 E / i e i within the 'main' routine in version 1 of 
program C (listed in Appendix D). Fortunately this potential erroneous jump is not 
backward, and hence there is no danger of an erroneous infinite execution loop. 
Placement deadlocks can, however, be hazardous and it is pertinent to develop 
techniques for manipulating code with this attribute. 
158 
Intended Execution Program Segment Erroneous Execution 
Address Code 
addl $8,%esp 001B 83 
operand 001C C4 
operand 001D 08 OR and <- JNE at 001F 
movl -4(%ebp),%eax 001E FF operand 
operand 001F 75 JNE -> 001D 
operand 0020 FC operand 
Table 7.7. Erroneous Infinite Execution Loop 
Intended Execution Program Segment Erroneous Execution 
Address Code 
call swap 002D E8 
operand 002E 7A JP — 002F 
operand 002F FF operand and *— JP at 002E 
operand 002F FF 
operand 002F FF 
operand 002F FF 
Table 7.8. Placement Deadlock 
159 
7.4.8. Re-synchronizing Erroneous Execution 
The re-synchronization experienced by version 1 of the example programs is in-
fluenced by the relative number of opcodes to operands. Lower ratios encourage 
re-synchronization, because there are more opcodes available for interpretation as 
such in the range of locations open to instruction translation during erroneous execu-
tion. Version 1 of programs C, A, and B have ascending speeds of re-synchronization, 
with descending opcode/operand ratios respectively. The op code/operand ratio is de-
pendent on the instruction mix within the application program, and the instruction 
constructions within an applications processor's instruction set. 
7.5. Summary and Conclusions 
This chapter documents the results of two experimental programmes. The first 
involved injecting 2136 faults into a MC68000 microprocessor-based system and mon-
itoring the system's response. The second involved emulating the response of a 
MC68(7)05. a MC68000, and an Intel 80386 microprocessor-based system with a 
combined total of 1812 emulated faults. 
The faults selected for the experiments were single and multiple bit. Several 
microprocessor systems with various detection facilities are monitored in order to 
evaluate their response to each fault insertion. The behavioural observations highlight 
the re-synchronization phenomenon whereby erroneous execution, diverted by the 
activation of an inserted fault, returns to a valid program path. In particular two 
types of hazard are associated with re-synchronization: placement deadlock, and 
sleeping corruption of the processor stack. Access to a corrupted stack re-initiates 
erroneous execution and detection is facilitated by the fault tolerant mechanisms 
resident in the experimental system. Cessation of processing and placement deadlock 
are identified as not being covered by the implemented fault tolerant mechanisms, 
and require other techniques to remove their hazards. Such techniques are proposed 
and discussed in the next chapter. 
Fault injection experiments and fault emulations have indicated the effectiveness 
of the software based fault tolerant technique proposed in Chapter 5. Performance 
parameters used to assess the technique are largely dependent on the nature of the 
160 
application program to be implemented with the microprocessor system under eval-
uation. The only overhead which can be accurately stated is the hardware overhead 
attributed to the design of the Access Guardian, and this is discussed in Chapter 
5. Therefore although figures for error coverage, detection latency, and processing 
overhead are derived for application programs, their replication for other program 
implementations is not assured. This is a point not sufficiently stressed by other 
reports of fault injection, simulation, or emulation experiments. 
Nevertheless, the benefits of implementing the software based fault tolerant tech-
nique have been clearly demonstrated for those microprocessor systems under evalu-
ation. In addition, the behaviour of a microprocessor's erroneous execution has been 
observed. These observations have shown the importance of re-synchronizing erro-
neous execution, and have validated the erroneous microprocessor behaviour model 
proposed in Chapter 3. 
161 
Chapter 8 
G E N E R A T I N G N O N - H A Z A R D O U S S O F T W A R E 
8.1. Introduction 162 
8.2. Bridging the Semantic Gap 163 
8.3. The Risks of Erroneous Execution 163 
8.3.1. Catastrophic Processing Failures 163 
8.3.2. Critical Hazard Coverage 164 
8.4. Non-Hazardous Program Area Code 165 
8.4.1. Hazardous Instruction Formats 165 
8.4.2. Hazardous Opcodes 166 
8.4.3. Hazardous Operands 168 
8.4.3.1. Prevention of Addressing Mode Hazards 168 
8.4.3.2. Inherent Addressing 168 
8.4.3.3. Manipulating Direct Addressing 169 
8.4.3.4. Manipulating Immediate Addressing 170 
8.4.3.5. Manipulating Indirect Addressing 170 
8.4.3.6. Manipulating Indexed Addressing 171 
8.4.3.7. Manipulating Register-Indexed Addressing 171 
8.4.3.8. Manipulating Relative Addressing 172 
8.5. Influencing Translator Practices 174 
8.5.1. Instruction Selection 174 
8.5.2. Coupling and Cohesion 174 
8.5.3. Macros 175 
8.5.4. Peephole Optimization 176 
8.6. Non-Hazardous Data Area Code 176 
8.7. Non-Hazardous Input/Output Reserved Area Code 177 
8.8. Influence of the Instruction Set 177 
8.8.1. Undefined Instructions 177 
8.8.2. Restart Instructions 178 
8.8.3. Stop/ Wait and Return Instructions 178 
8.8.4. Unspecified Jump Instructions 179 
8.9. Conclusions 179 
C H A P T E R E I G H T 
G E N E R A T I N G N O N - H A Z A R D O U S S O F T W A R E 
8.1. In t roduc t ion 
Previous chapters have identified the characteristics of microprocessor behaviour 
following an event disrupting software execution. A fault tolerant technique has been 
developed to detect erroneous behaviour and initiate recovery. There still, however, 
remains the uncertainty of the nature of any erroneous behaviour. 
Erroneous microprocessor execution is dependent upon the operation outcomes 
of interpreted instructions through the address space. There may be associated risks 
with erroneous execution which adversely effects system integrity. In particular, 
erroneous execution may process critical hazards which are indicative of catastrophic 
processing failures. 
Translators can have considerable influence on the nature of target machine code 
generated to implement a high-level language program [Ciminiera& Valenzano, 1987]. 
Machine code is produced primarily on the criteria of performance and efficiency, 
however, such code may incorporate critical hazards as observed in the previous 
chapter. Microprocessor reliability can be considerably improved if hazardous code 
is not generated. 
This chapter investigates the generation of software by two types of translator: 
compilers and assemblers. Generated software consists of machine code for program, 
data, and input/ output reserved areas of the address space. Techniques are proposed 
for each of these areas to manipulate the machine code in order to prevent the gener-
ation of critical hazards within the code released by a translator. To facilitate these 
techniques, it is necessary to encourage and discourage particular translator practices. 
These practices are influenced by the characteristics of the target microprocessor. A 
selection of 8, 16, and 32-bit microprocessors are examined including the Intel 8086 
and Motorola 68000 families, and their implications for translator action discussed. 
162 
8.2. Br idg ing the Semantic Gap 
Microprocessor design is advancing so rapidly that design-oriented software can 
quickly become obsolete. It is therefore important that software should be written 
in a manner that facilitates implementation on a variety of processors. Even with 
upwardly compatible microprocessor designs, implementation of an original piece of 
software inefficiently uses the capability of the advanced design. 
High-level languages have been developed which are abstracted from processor 
architectures and configurations to the extent that the language is independent of 
the target machine. Examples of high-level languages include FORTRAN, Pascal, 
and C. They are considered portable languages because they allow software to be 
implemented on a range of processor types. 
The difference between a high-level language and its target microprocessor imple-
mentation is known as the semantic gap. Bridging this gap has a major influence on 
the overall execution efficiency and software reliability of the microprocessor appli-
cation. The tool used to bridge the gap is the translator. Translators automatically 
process the high-level software down through the levels of abstraction until code is 
generated which is executable on the target processor. The nature of translation 
means that the programmer has no control over the generation of the machine code 
and any hazards it may contain. It is important that translators are designed to 
generate non-hazardous machine code. 
8.3. The Risks of Erroneous Execution 
8.3.1. Catastrophic Processing Failures 
Two particular hazards identified with catastrophic failure are the possibilities of 
cessation of processing, and infinite processing loops during erroneous execution. 
Both occurrences prevent the possible detection of erroneous microprocessor be-
haviour by software techniques: an external hardware reset being necessary to restore 
the microprocessor system. 
163 
Processing ceases when an instruction execution outcome is a stop/wait state. 
Exit from this state requires external intervention. The probability of process cessa-
tion can be predicted by examining individual address space elements for stop/wait 
instruction codes. However, it must be recognized that the processing action of the 
processor with volatile memory can introduce further occurrences of this hazard which 
cannot be predicted. 
Infinite execution loops are instruction sequences that are continuously executed 
in a cyclic fashion. There is no processing exit from these loops except through some 
external intervention. Such loops may incorporate a chain of erroneous jumps, and 
operate through a combination of different functional areas in the address space (de-
scribed in Chapter 5). Identification of all infinite execution loops for application 
software would involve tracing the execution path for all possible erroneous execu-
tion variants following an Initial Erroneous Jump (IEJ). In reality, execution paths 
may change. A loop might include a conditional branch instruction. Erroneous ex-
ecution might change the tested condition codes and hence the loop would not be 
infinite. Prediction of infinite execution loops would also have to allow for a section 
of address space being implemented on volatile or non-volatile memory. Obviously 
volatile memory is subject to manipulation through processor activity such as stack 
operations. 
8.3.2. Cr i t i ca l Hazard Coverage 
In order to prevent the occurrence of catastrophic processing failures, it is nec-
essary to provide coverage for all associated critical hazards. The two identified 
hazards of catastrophic failure, cessation of processing and infinite execution loops, 
rely on the action of erroneous jumps. I t has been shown in the previous chapter 
that translator generated software can include hazardous code which if interpreted 
as an opcode produces an erroneous jump. The generation and eventuality of execut-
ing such hazardous code is covered to improve the reliability of the microprocessor 
system. 
A fault tolerant technique based on software enhancement has been proposed 
in Chapter 5. This technique provides rapid termination of erroneous behaviour by 
164 
detecting the execution of erroneous jumps. The technique implements an Access 
Guardian which provides immediate detection of erroneous execution in the unused 
area of the address space. The effectiveness of the technique in the used area depends 
on the number of erroneous jumps within the machine code. However there are 
three particular erroneous jump constructs which cannot be covered by the the fault 
tolerant technique : stop/wait erroneous jump outcomes, return erroneous jump 
outcomes, and placement deadlock. 
The interval between the initiation of erroneous execution and its detection is 
called error latency. During this period there is a risk of a critical hazard not covered 
by the fault tolerant technique being executed. Such processing may generate a 
catastrophic failure. 
Clearly there is a need to develop a translator which avoids generating these 
classes of erroneous jump hazard. Such a translator would reduce the inherent hazard 
associated with the code without reducing the detection improvement offered by 
application of the fault tolerant technique. Indeed the ability to manipulate used 
area code enables not just the prevention of hazards, but also the introduction of 
attributes facilitating detection of erroneous execution. 
8.4. Non-Hazardous Program Area Code 
8.4.1. Hazardous Ins t ruc t ion Formats 
The program area consists of instruction sequences generated by the translator. 
Each instruction has a conceptual structure consisting of a descriptor and an address 
field. The descriptor specifies the instruction operation. The address field specifies 
the information to be processed by the instruction. The manner in which the address 
field is interpreted is known as the addressing mode of the instruction. The content 
of the address field is referred to as the stated address. The address of the referenced 
memory location, containing the information to be processed, is referred to as the 
effective address. 
The physical implementation of an instruction consists of opcodes representing 
the descriptor, and operand(s) if required by the addressing mode representing the 
165 
stated address. Opcodes are defined in the microprocessor's instruction set. Operands 
are specified by the addressing modes implemented within the microprocessors archi-
tecture. The most common addressing modes are 
Inherent Addressing : 
Direct Addressing : 
Immediate Addressing : 
Indexed Addressing : 
Indirect Addressing : 
Register-Indirect Addressing : 
Relative Addressing : 
No effective address required. 
The stated address is the effective address 
of a register. 
The effective address is the location im-
mediately following the opcode. 
The effective address is the stated address 
added to the contents of a specific register 
(index register). 
The stated address serves as a pointer to 
the location at which the effective address 
resides. 
The stated address serves as a pointer to 
a specified register which contains the ef-
fective address. 
The stated address is an offset which is 
added to the contents of a location inher-
ently selected by the opcode to form the 
effective address. 
The hazard identified with catastrophic processing failures, the erroneous jump, 
can be generated by opcodes and operands. The addressing mode adopted by an 
instruction influences the requirement and nature of operands. Techniques are pro-
posed for the selection of instructions and their manipulation to prevent the genera-
tion of critical hazards. To this end, certain translator practices can be encouraged 
or discouraged. Several practices are identified and discussed. 
8.4.2. Hazardous Opcodes 
Early microprocessor designs such as the Intel 8080 and Motorola 6800 implement 
an instruction set in which all operations can be uniquely identified by a single-byte 
opcode. Hence these opcodes have no associated hazard. 
166 
Later processor designs extended the instruction set to improve flexibility and 
performance. Examples of this type of machine include the Motorola 68000 and In-
tel 8086. Such microprocessors are known as Complex Instruction Set Computers 
(CISC). Processor designs that extend the instruction set beyond 256 unique oper-
ations require a multiple-byte opcode. However, the second and subsequent opcode 
bytes are liable to have a hazardous format susceptible to erroneous execution. The 
nature of this hazard can be determined by mapping the opcode bytes onto the 
first-byte opco map. 
The Intel 8086 microprocessor family have instructions which require an addi-
tional opcode byte to further specify the operation code of the first opcode byte. 
Hazardous bytes can be identified and suitable translator action taken. 
The Intel 80286 microprocessor instruction set has an instruction 'SETNP' whose 
second byte, if interpreted as an opcode, generates a stop/ wait outcome. This is a 
critical hazard associated with catastrophic processing failure. There is no method 
of manipulating the second-byte of the opcode because its format is completely de-
fined. The selection of this and other instructions that have hazardous opcode bytes 
that cannot be manipulated should be avoided in order to prevent the possibility of 
catastrophic failure during erroneous execution. 
All instructions in the Motorola 68000 microprocessor family instruction sets 
have a two-byte opcode. Furthermore, these microprocessors implement memory 
organization which specifies that instructions reside at even-byte boundary addresses. 
Interpretation of an instruction at an odd-byte boundary address generates a software 
exception and hence a restart outcome. Hence there is no hazard associated with the 
second opcode byte because of the over-ride action of the odd-address exception. 
Some modern microprocessor designs have returned to a smaller instruction set, 
discarding under-utilized instructions. The core of the instruction set facilitates effi-
cient processing within a simplified machine architecture. The processors are known 
as Reduced Instruction Set Computers (RISC). An example of this processor type is 
the AMD Am29000. This microprocessor specifies its instructions to have a single-
byte opcode and a three-byte operand, with the notable exception of the load and 
167 
store instructions where there is a two-byte opcode and a two-byte operand. Single-
byte opcodes have no hazard whilst two-byte opcodes may have a hazardous second 
byte. 
8.4.3. Hazardous Operands 
8.4.3.1. Prevention of Addressing Mode Hazards 
In general, instructions have operands which are susceptible to interpretation as 
a first-byte opcode during erroneous execution. The addressing mode adopted by an 
instruction influences the requirement and nature of these operands. Instructions may 
be selected on the basis of the non-hazardous operands associated with addressing 
mode and their ease of manipulation. 
An exception to the rule is the AMD Am29000 microprocessor. This machine 
locates instructions at every other even-byte address. A exception facility can be en-
abled within this processor such that a program counter, containing the address of an 
operand byte, is masked to access the opcode address. Hence the operands may have 
a hazardous format but the microprocessor architecture prevents their interpretation 
during erroneous execution. No operands, therefore, require manipulation to remove 
any hazard. 
The Intel 8086 microprocessor family arcmtecture implements segmentation. Ev-
ery memory access requires the specification of a segment register in conjunction with 
an addressing mode. The segment register and the addressing mode implement the 
most and least significant address bits respectively. Hence all addressing modes facili-
tated by these processors are pseudo-relative. The allocation of segment registers can 
be automatic which eases the complexity for low-level programmers and translators 
alike. 
8.4.3.2. Inherent Addressing 
In this addressing mode there is no effective address requirement. Inherent ad-
dressing instructions have all necessary information for processing within the instruc-
168 
tion opcode. No operands are needed. A common inherent addressing instruction to 
many microprocessor instruction sets is 'no-operation'. 
8.4.3.3. Man ipu la t ing Direct Addressing 
The number of operands required by an instruction implementing direct address-
ing to specify the effective address is dependent upon the size of the microprocessor 
address space and the size of each operand. For instance, a 4 GByte address space 
requires 4 bytes to specify an absolute address, whilst a 64 KByte address space 
requires only 2 bytes. Operands typically take a one-byte structure. However, larger 
byte structures are implemented by particular microprocessors, notably the Motorola 
68000 family which have two-byte operands. Each operand used to represent the ef-
fective address may have a hazardous format. 
The hazard associated with an operand that partially or fully represents an ef-
fective address is dependent on that address. Accessed memory can be relocated so 
that the operands specifying the effective address take non-hazardous code formats. 
To give the operands of an instruction implementing direct addressing a restart 
format, it is necessary to have groups of consecutive restart opcodes within the mi-
croprocessor instruction set. Larger groups of restart opcodes reduce the complexity 
of movement of the used area when covering each directly addressed location. The 
Intel 8086 microprocessor like the AMD Am29000 has an instruction set which has 
few useful restart instruction formats. The advanced Intel 80286 and 80386 micro-
processors, which are upwardly compatible with the 8086, have no additional restart 
instructions of a useful format. The Motorola 68000 microprocessor family have a 
large number of useful restart instruction formats. Their instruction sets of 65536 
instructions comprise of approximately 2% denned restart instructions, and 30% un-
defined instructions specified to generate a software exception which is a restart out-
come. Within these restart formats are two large blocks taking the hexadecimal values 
AXXX and FXXX, where 'X ' is a 'don't care' hexadecimal value. These represent 
undefined instructions which are reserved for implementation by later releases of the 
microprocessor family. In particular, the Motorola 68030 microprocessor facilitates 
the use of some FXXX hexadecimal format instructions for co-processor operation. 
169 
However, if the microprocessor system does not incorporate a co-processor then these 
instructions revert to generating a restart outcome. Each of the two blocks of restart 
instructions when converted to an address range cover 4 KBytes. The remainder of 
the instructions generating a restart outcome are scattered about the opcode map. 
8.4.3.4. Man ipu la t ing Immediate Addressing 
In the immediate addressing mode, the effective address is, by default, the in-
struction operand(s). The operands contain data to be processed by the instruction. 
The operands are by their nature application specific. Their hazards are depen-
dent on the host microprocessor instruction set. Although the data stored cannot be 
altered, the format of its storage can be manipulated. 
A technique is proposed for microprocessors whose architecture organizes memory 
on a double byte addressing scheme such as the Motorola 68000 family. Operands 
therefore have a two-byte format. For single byte data, one byte holds the data, while 
the remaining byte can be set to a value such that the resultant operand code is non-
hazardous. The choice of operand values that do not exhibit a hazard is dependent 
on the target instruction set. 
8.4.3.5. Man ipu la t ing Indexed Addressing 
This mode of addressing requires operands to represent the stated address and 
the index register. This addressing mode is particularly useful when implementing a 
stack structure in memory for microprocessor operation. 
Some microprocessor architectures, including the Motorola 68000 family, specify 
register use within the instruction opcode. Other architectures, such as the Intel 8086 
family require an operand to specify register usage. In these instances it is suggested 
that some registers may be preferred. The generation of operand code specifying 
particular registers and which has a hazardous format should be avoided. 
The operand(s) representing the stated address can be manipulated using the 
same techniques proposed for direct addressing: operands specifying absolute ad-
dresses for direct addressing should not have a hazardous bit format. 
170 
The Intel 8086 and Motorola 68000 microprocessor families implement additional 
variants of the indexed addressing mode. These involve a base address displacement 
which is added to the effective address. Such extensions to the indexed addressing 
mode facilitate enhanced data processing techniques. 
8.4.3.6. Man ipu la t i ng Indirect Addressing 
The stated address contained in the operands attached to an instruction imple-
menting indirect addressing is an intermediate location in the address space. The 
content of this location specifies the effective address of the data to be processed by 
the instruction. 
It is proposed that an intermediate location is selected so that the operands 
specifying its address do not take a hazardous value. However, the intermediate 
locations available may be restricted by the residence of used areas in the address 
space. Movement of the used area can release useful locations. 
It is valuable when manipulating indirect addressing, to have a large number 
of scattered opcode formats for restart instructions. This reduces the likelihood of 
having to move sections of the used area in order to release addresses that mimic a 
restart opcode format. 
Both the Intel 8086 and Motorola 68000 microprocessor families do not implement 
indirect addressing, preferring to use the equivalent register-indirect addressing mode. 
8.4.3.7. Man ipu la t i ng Register-Indirect Addressing 
This method of addressing is closely allied to the indirect addressing mode. The 
effective address is contained within a register specified by the stated address. 
As identified with the index addressing mode, some microprocessor architectures 
require the specification of a register within an operand. These operands can be 
manipulated to remove any hazard loc cit. 
This addressing mode is implemented with particular effect in respect to hazard 
generation in the Motorola 68000 microprocessor family. These machines implement 
an instruction using register-indirect addressing without an operand requirement, 
i.e. the opcode specifies the register. Opcodes for these processors do not have a 
171 
hazard, so instructions implementing this addressing mode do not introduce hazards 
and should be encouraged by translator code generation. 
8.4.3.8. Man ipu la t ing Relative Addressing 
The inherent register specified by relative addressing serves as a pointer to a 
memory location. The stated address acts as a displacement to the pointer. The 
result is a method of referencing memory in the vicinity of the address contained in 
the register. 
An important class of instruction that implement relative addressing are branches. 
These facilitate changes in the otherwise sequential control flow of program execu-
tion. Relative addressing is used in conjunction with the microprocessor's dedicated 
register, the program counter, which serves as a pointer to the next instruction to be 
processed. 
The displacement required by this addressing mode is stored in one or more 
operands. A large proportion of relative addressing displacements are local. Operands 
specifying local forward and backward displacements have low and high hexadecimal 
values respectively. The hazards associated with such operand values are dependent 
on the host microprocessor instruction set. Displacements represented by operand 
code of a hazardous format should not be used. Operand manipulation is only possible 
where the displacement value is independent of run-time conditions. 
Assuming that branch offsets tend to be local, deductions can be made for dif-
ferent microprocessor implementations. This is known as the principle of program 
locality [Ciminiera & Valenzano, 1987]. Two popular microprocessor families are 
investigated with respect to local relative addressing hazards. 
Firstly, consider a Motorola 68000 target machine. Local backward branches 
will produce an operand of the hexadecimal format FFXX, an undefined instruction, 
which, when executed as an opcode, generates a restart outcome and hence no hazard. 
Local forward branches produce an operand of the format 00XX which, when executed 
as an opcode, is highly likely to produce an OR instruction. Although there is 
no immediate hazard associated with an OR instruction it does continue erroneous 
execution, and may corrupt system data. Implementation on the compatible but 
172 
extended instruction set of the Motorola 68020 realizes usage of some of the FFXX 
instructions for co-processor operation. When a co-processor is incorporated into 
the microprocessor system, local backward jumps have a probability of producing 
an erroneous instruction execution hazard. However, if a co-processor is not present 
then these instructions continue to generate a restart outcome. 
Secondly consider the Intel 8086 microprocessor family. Local backward branches 
tend to produce FF hexadecimal values for the most-significant, and high hexadec-
imal values for the least-significant operand bytes. Equally, local forward branches 
tend to produce 00 hexadecimal values for the most-significant, and low hexadecimal 
values for the least-significant operand bytes. The byte FF hexadecimal value, when 
erroneously interpreted as a first-byte opcode specifies a multi-byte opcode, which 
can take a format associated with critical hazards. The byte 00 hexadecimal value, 
when mapped onto the first-byte opcode map, reveals that erroneous execution will 
generate an ADD instruction. Whilst this is not classified as a critical hazard, it 
still allows erroneous execution to propagate and perhaps activate a critical hazard 
elsewhere. Low and high hexadecimal values can thus generate both detection of 
erroneous execution and catastrophic processing failures. Hence local branches are 
fraught with danger for translators generating machine code for the Intel 8086 mi-
croprocessor family. Translators can move sections of code within the address space 
to ensure bytes specifying local branches do not take formats defining critical haz-
ards. The intermediate vacant slots of memory between relocated sections of code 
can be filled with 'no-operation' single-byte instructions, or sections of code linked by 
a branch instruction and intermediate locations filled with restart single-byte restart 
outcome instructions yielding a detection capability. This proposed manipulation of 
the program area will introduce a processing performance overhead to the correctly 
executing program. However, this overhead is deemed to be acceptable for most 
microprocessor applications. 
173 
8.5. Inf luencing Translator Practices 
8.5.1. Ins t ruc t ion Selection 
It is desirable that a translator uses a microprocessor instruction set efficiently 
during its code generation phase. That is, program code is produced for optimal 
performance. Ineffective use of the instruction set could lead to unnecessary increases 
in the processing time and memory space required by the program code. 
Selection of instructions which process registers has a particular benefit. Ad-
dressing modes using registers do not require a memory fetch and hence are more 
efficient than addressing modes using direct addressing. Instructions which process 
registers thus enhance program code performance. 
An example of translator inefficiency is demonstrated by the UNIX 'cc' compiler. 
The target machine is the Motorola 68000. The instruction set for this microproces-
sor includes a branch instruction with an embedded byte displacement, for relative 
addressing, in the opcode. This compiler, however, generates a branch instruction 
with a two-byte displacement for a branch requiring a single byte displacement. The 
implemented instruction requires an otherwise unnecessary operand. Resultant pro-
gram code requires extra memory, and has a slower execution for this instruction. 
Whilst generated code should be efficient, it should also not exhibit any critical 
hazards. Manipulation of the code to remove these hazards may degrade program 
performance. Particular translator actions can be encouraged or discouraged so that 
the generated code is acceptable. 
8.5.2. Coupl ing and Cohesion 
Selected instructions should exhibit the characteristics of coupling and cohesion. 
These characteristics are more commonly associated with high-level language soft-
ware engineering [Sommerville, 1985]. Coupling is a measure of the programs code 
dependence on referenced parameters. Cohesion is a measure of the functionality 
(unity of operation) of the program code. Translators should generate program code 
that exhibits both high coupling and cohesion characteristics. 
174 
Low coupling indicates the use of specified parameters rather than referenced 
parameters. Specified parameters give operand formats which may have an associated 
hazard. Program code may replicate use of the parameter and hence the hazard is 
propagated. An example of this is demonstrated by the UNIX 'cc' translator on 
the SUN machines whose target processor is the Motorola 68020. The translator 
generates instructions with an immediate addressing mode to access repeated data 
values, and hence any hazard associated the data value is replicated. Translators can 
avoid this problem by implementing reference parameters. Reference parameters are 
independent of the specified parameter value. The reference can be altered so that 
its operand format does not have an associated hazard. Such manipulation does not 
affect the specified value of the referenced parameter. High coupling is, therefore, a 
useful attribute when manipulating the program code to remove hazards. 
Cohesion reflects the efficiency of the generated program code. Low cohesion 
indicates unnecessary control flow in the generated instruction sequences. Loop in-
variants are a source of inefficient control flow [Aho & Ullman, 1977]. They involve a 
computation that produces the same result at each cycle of a loop. The computation 
can be moved to a point just before the loop is entered. In general, loops are a major 
source of program inefficiency. High cohesion is an attribute of effective program 
code generation. 
8.5.3. Macros 
A macro is a collection of target machine instructions which perform some oper-
ation not directly facilitated by the microprocessor instruction set. Macros appear in 
the intermediate code generated by a translator. The translator replaces the macro 
with its defined target instruction sequence when producing target machine code. A 
macro should not be confused with a procedure. Macros are called and expanded by 
translator action whereas procedures are called and processed by program execution. 
The specification of macros should ensure no resident hazards. Removal of haz-
ards can be achieved using the code manipulation techniques outlined in this chap-
ter. The importance of macros increases with processor architectures implementing 
175 
smaller instruction sets which provide fewer directly executable operations. This is 
particularly so in RISC processors such as the AMD Am29000. 
8.5.4. Peephole Opt imiza t ion 
Translator generated code usually goes through the process of peephole optimiza-
tion before release. This is designed to enhance the performance of the program code 
by increasing the efficiency of small sections of code. The JPI TopSpeed (version 1) 
Modula-2 translator produces code for the Intel 80296 microprocessor which is par-
ticularly effective in this respect. However, optimization may involve manipulation 
of instruction sequences such that hazards are re-introduced. Nevertheless, peephole 
optimization is very valuable in generating efficient program code. Therefore, this 
process cannot be abandoned, but rather it must endeavour not to create any new 
hazards. 
8.6. Non-Hazardous Data Area 
These areas can include both volatile and non-volatile memory. Non-volatile 
memory is used to implement static data structures such as 'look-up' tables. Volatile 
memory can implement dynamic structures such as stacks, as well as static data 
structures. 
Halse [1984] suggests many variants of a software seeding strategy to implement 
detection of erroneous behaviour in this area. Experiments demonstrate the effec-
tiveness of seeding. However, erroneous execution in this area can have unpredictable 
behaviour before detection, depending on the data content. In some instances the 
content may change through the operation of the microprocessor system. 
The AMD Am29000 microprocessor has a separate instruction and data channels, 
therefore instruction fetches from the data area are impossible. Interpretation of a 
data area as a program area generates an exception and hence a restart outcome 
which denotes detection of erroneous behaviour: the data area within this processor 
is therefore intrinsically non-hazardous. 
176 
8.7. Non-Hazardous I n p u t / O u t p u t Reserved Area 
These memory mapped areas contain locations reserved by the microprocessor 
architecture for communication with external devices. Used locations will contain 
time dependent values related to the requests to, and responses from external de-
vices. Unused locations with an attached communication link can have an undefined 
content. 
All memory mapped I/O locations are susceptible to erroneous interpretation as 
an opcode. Such erroneous behaviour may be hazardous. Manipulating the content 
of the I /O reserved locations can remove any associated hazard. The proposed alter-
ations of the microprocessor system require knowledge of the I /O communication of 
the application software. 
The values resident at used locations can be manipulated using the same tech-
nique proposed for immediate addressing. Alternatively, values sent and received 
from any external device could be defined to take non-hazardous formats. The com-
munication links for the unused locations can be tied high or low, corresponding 
to logic 1 and logic 0 respectively. In this manner, unused memory mapped I /O 
addresses can be set to hold non-hazardous values. 
8.8. Influence of the Ins t ruc t ion Set 
Manipulation techniques for machine code in the used area of the address space 
have been proposed, based on microprocessor architecture. Specific implementation is 
dependent upon the instruction set of the host microprocessor. This section identifies 
instruction set characteristics, including content and format, which facilitate or hinder 
the manipulation techniques. 
8.8.1. Undef ined Instructions 
Some microprocessors do not declare the operation of undefined instructions 
within their instruction set. The physical implementation of such instructions de-
pends on the manufacturer. A manufacturer will, typically, introduce useful and 
varied operations for these instructions. However, there are no standard operations 
177 
for these instructions and their function can vary between manufacturers. Further-
more, the manufacturer is under no obligation to keep a particular function for differ-
ent fabrications of the same microprocessor. Modern microprocessors have generally 
avoided this ambiguity by declaring their undefined instructions to generate a soft-
ware exception - a restart outcome. Undefined instructions have therefore been given 
a detection attribute. Translators should encourage the use of these codes so that 
the detection capability of the generated code is enhanced. Modern microprocessors 
generally declare the operation of all possible instruction opcodes even though the 
use of some is undefined. 
8.8.2. R e s t a r t I n s t r u c t i o n s 
Instructions w i t h a restart outcome are used to detect erroneous behaviour. 
Where possible code should be manipulated to attain this detection capability. The 
abil i ty of a translator to endow code with this at tr ibute is influenced by the numbers 
of restart instructions and their position wi th in the target processor's instruction set. 
The proportion of instructions in a microprocessor instruction set that generate a 
restart outcome are typically small, see Appendix A. However, in some microproces-
sors the undefined instructions are specified to generate a restart outcome. Undefined 
instructions also usually occur as groups within the instruction set, being reserved 
for future instruction set extensions. There can be substantial numbers of undefined 
instructions wi th in an instruction set. 
8.8.3. S t o p / W a i t a n d R e t u r n I n s t r u c t i o n s 
Hazards associated wi th code which reflect the format of a stop/wait instruction 
are critical. They are synonymous wi th cessation of processing: a catastrophic failure. 
Similarly, code wi th a return instruction format is deemed a critical hazard because 
i t can create an infini te execution loop during erroneous execution. Such erroneous 
processing is also associated wi th catastrophic failure. 
Fewer stop/wait and return instructions in the instruction set discourage the 
generation of critical hazards. This assumes that the translation of target machine 
code can be broadly considered as a pseudo-random process. 
178 
Additionally, the principle of locality, associated wi th relative addressing, identi-
fies that high and low value operands occur more regularly than other formats. To 
further reduce the probability of hazard generation, i t wi l l be an advantage i f the 
stop/wait and return instructions take mid-range operand values. 
The generation of hazards wi l l also be influenced by the application code access 
to inpu t /ou tpu t reserved areas. Specific, utilized locations in this area may have 
an address which has a hazardous format corresponding to a stop/wait or return 
instruction. The translator generation of these hazards is application specific. For 
this reason, microprocessors are preferred whose stop/wait and return instructions 
do not reflect any reserved input /output locations. 
8.8.4. U n s p e c i f i e d J u m p I n s t r u c t i o n s 
Code which has a format of an unspecified jump instruction is considered haz-
ardous in the same manner as code wi th a return instruction format, the hazard being 
indicative of an infini te execution loop and hence catastrophic failure. Although such 
code has an embedded critical hazard, i t is a feature of asset rather than liability. 
The fault tolerance technique proposed in this thesis uses the occurrence of code 
wi th unspecified jump instructions formats to detect erroneous execution. Hence the 
presence of this code hazard is an attr ibute which indirectly enhances the software 
detection capability. 
8.9. C o n c l u s i o n 
A significant amount of software produced for microprocessor applications is wr i t -
ten in a high-level language which is independent of the target processor. This soft-
ware is converted by a translator to equivalent machine code implementation. The 
programmer, therefore, does not have any control over the nature of the machine code 
generated. This code could contain embedded hazards which cause catastrophic fai l -
ures during erroneous execution. 
I t has been shown that translators can implement techniques to manipulate ma-
chine code. Such manipulation can prevent the release of hazardous code by the 
179 
translator. The techniques involve encouraging and discouraging particular transla-
tor practices. W i t h i n program areas instructions are selected wi th respect to their 
opcode and operand. Both may have a hazardous format. Equivalent operations 
can be chosen to provide alternative opcodes, and operands can be manipulated by 
implementing different addressing modes so that no critical hazards are generated. 
In addition, the configuration of the address space can be altered to cover certain 
hazards. The manipulation techniques proposed to prevent hazardous code genera-
tion in the data and inpu t / output reserved areas are dependent to a large extent on 
the data structures implemented by the microprocessor. Apply ing the collection of 
manipulation techniques can prevent the generation of critical hazards. 
The proposed translator techniques are influenced by the format and content of 
the target processor instruction set. General inferences are made for instruction set 
characteristics. 
The abil i ty to manipulate translation of target machine code also facilitates fur-
ther working of the code to introduce a detection capability for erroneous execution. 
This is achieved by encouraging the generation of code wi th a restart instruction 
format. Such enhancement of the code can further improve the reliability of a micro-
processor application by reducing error latency. 
180 
Chapter 9 
M I C R O P R O C E S S O R D E S I G N F O R F A U L T T O L E R A N C E 
9.1. Introduction 181 
9.2. The Effectiveness of Fault Tolerance 181 
9.3. Implementing Fault Tolerance 182 
9.4. Influences on Microprocessor Design 182 
9.5. Instruction Set Architecture 183 
9.5.1. Instruction Set Mix 184 
9.5.2. Opcode Maps 185 
9.5.3. Operand Requirements and Specification 188 
9.6. Inpu t /Ou tpu t Communication Ports 189 
9.7. Memory Organization 189 
9.7.1. Memory Alignment 189 
9.7.2. Defining Memory Uti l izat ion 192 
9.8. Moni tor ing Branch Act iv i ty 194 
9.9. Conclusion 195 
C H A P T E R N I N E 
M I C R O P R O C E S S O R D E S I G N F O R F A U L T T O L E R A N C E 
9 . 1 . I n t r o d u c t i o n 
The hazard of erroneous microprocessor behaviour is modelled and evaluated in 
the early chapters of this thesis. A technique is proposed in Chapter 5 using software 
implemented fault tolerance to provide detection of erroneous execution. A u t i l i ty 
has been buil t which automatically applies the proposed software based technique 
to target code, with the performance of the enhanced code being analysed and the 
benefit of its improved detection capability demonstrated. Furthermore, the previ-
ous chapter discusses methods of software translation which reduce the hazard of the 
target source code in respect of erroneous execution. This chapter concludes the re-
search presented in this thesis by considering design features that can be incorporated 
wi th in the architecture of a microprocessor to provide fault tolerance. 
The design features presented identify particular characteristics of erroneous exe-
cution. Once erroneous execution is distinguished f rom valid processing, appropriate 
recovery action can be init iated to restore the integrity of the microprocessor system. 
The performance of the implemented design features and their at t r ibuted architecture 
overhead for a microprocessor is discussed. 
9.2. T h e Ef fec t iveness o f F a u l t To le rance 
As discussed in Chapter 1, the implementation of fault tolerance at the logic level 
wi th in a microprocessor's architecture incurs large overheads, in particular, a large 
extension in the number of gates is required. Chakraborty & Ghosh [1988] show that 
logic faults have a good correlation to functional errors in processor operation. I t is 
therefore feasible to implement techniques faci l i tat ing functional-level faul t tolerance. 
These techniques require less gate overhead than logic faul t tolerance because they 
only attempt to detect particular characteristics of erroneous behaviour rather than 
all possible failure patterns. 
181 
Established functional-level fault tolerant techniques for microprocessor systems, 
called capability checks, are discussed in Chapter 2. Application of these techniques 
to microprocessor systems involves the addition of dedicated hardware units and/or 
software manipulation. Fault tolerant techniques which operate on functional aspects 
of erroneous execution are extremely diff icult to assess. Erroneous behaviour w i l l vary 
f rom application to application, depending on the target software and processor and 
hence the detection capability of the applied technique wi l l vary too. This is known 
as the ' instruction sequence' dependency. Nevertheless, inclusion of capability checks 
wi th in microprocessor systems can be beneficial where high reliability is required. 
9.3. I m p l e m e n t i n g Fau l t To le rance 
The implementation of fault tolerance as described in Chapter 2 involves the 
co-ordinated detection of an error and restoration of the microprocessor system in-
tegrity. This thesis considers the first stage of fault tolerance, i.e. the detection of 
erroneous microprocessor behaviour as described in Chapter 3. I t is proposed that 
detection can be achieved through the generation of a restart outcome during erro-
neous execution. Such a processing outcome re-establishes control of the program 
flow by directing execution to a predefined location in the address space. In order to 
complete recovery, a restart outcome must initiate execution of a module of code that 
restores system integrity. The predefined location in memory, accessed as a result of a 
restart outcome, is therefore hardwired wi th in the microprocessor architecture to be 
a section of Read Only Memory (ROM) . The micro-code in the R O M is programmed 
wi th a recovery routine for the application software. 
Microprocessor design features are proposed in this chapter which facilitate an 
increased probability of erroneous execution generating a restart outcome, without 
the need for peripheral circuitry or software manipulation. 
9.4. I n f luences o n M i c r o p r o c e s s o r Des ign 
I t is very diff icul t to present a list of influences on microprocessor architecture de-
sign because of the diversity of application requirements between different processors. 
For instance, the Intel 4004 is a 4-bit microprocessor designed for simple arithmetical 
182 
functions, while the Motorola 68(7)05 is an 8-bit processor specifically designed for 
programmable controllers, and the Advanced Micro Device Am29000 processor is a 
32-bit processor designed for real-time systems requiring large amounts of processing. 
The architecture of these microprocessors satisfies many design requirements includ-
ing processing performance, processing capability, and cost. This is achieved through 
the implementation of various design features: for example, the A M D Am29000 im-
plements a Reduced Instruction Set Computer (RISC) architecture incorporating a 
cache memory, .struction pipeline, and reduced instruction set in order to improve 
its run-time processing capability. 
The following sections of this chapter consider design features which can be in-
corporated into the microprocessor architecture to facilitate the self-detection of erro-
neous behaviour. In particular, design features are proposed which facilitate detection 
of the erroneous execution modelled in Chapter 3, and the hardware representation 
of the software implemented fault tolerant technique presented in Chapter 5. 
9.5. I n s t r u c t i o n Set A r c h i t e c t u r e 
A l l programs which are executable on a target microprocessor consist of a se-
quence of machine instructions. Typical instruction formats contain the following 
basic elements: 
e an opcode specifying the operation, 
© addressing mode specification for each of the input operands and 
result, and 
• addressing mode data, e.g. immediate data for direct addressing. 
An instruction's addressing mode is designated by either the opcode or a reserved 
tag wi th in the operand. The Motorola 68000 processor family defines the addressing 
mode of instruction operands within the opcode, whilst the Intel 8086 processor 
implements a tag wi th in the operand bit format which is decoded to ascertain the 
operand's addressing mode. 
183 
The architecture and organization of the processor instruction set can be ma-
nipulated to enhance the inherent detection capability of erroneous execution by the 
microprocessor. Techniques considered within this section involve the instruction set 
mix, the opcode map, and the instruction operand requirements. 
9 .5 .1 . I n s t r u c t i o n Set M i x 
The selection of instruction operations for inclusion in an instruction set is the 
subject of much research [Tanenbaum, 1990]. Instructions can be separated into 
two broad groups: general purpose and specialized instructions. General purpose 
instructions include data movement operations that are needed by almost every ap-
plication. Specialized instructions are only useful i n specific applications, e.g. the 
Motorola 68000 instruction M O V E P takes the content of the D register and stores 
i t in alternate bytes and is intended for communication wi th specific 8-bit peripheral 
devices. 
Studies of instruction set usage have led to the development of Reduced Instruc-
tion Set Computers (RISC) such as the Advanced Micro Device Am29000 processor. 
These processors incorporate only the most used and general purpose instructions 
unlike Complex Instruction Set Computers (CISC) which have large instruction sets 
that are often extended when the processor is upgraded through an upwardly com-
patible revision. W i t h i n the RISC processor, particular tasks that can be achieved 
by a single sophisticated CISC instruction may require several of the more basic in -
structions provided in its instruction set. Although the application software for a 
particular task is larger on a RISC compared to a CISC processor, its simpler data 
path processing (partially facilitated by the RISC instruction set) improves overall 
performance. 
The instruction mix analysis of Chapter 4 which is used to evaluate the model 
of erroneous microprocessor behaviour highlights the advantages and disadvantages 
associated wi th the inclusion of restart and stop/wait outcome generating operations 
wi th in the processor instruction set respectively. Stop/wait outcome instructions in i -
tiate catastrophic failure during erroneous execution because all independent opera-
tion is lost, and restart outcome instructions provide a controlled route for erroneous 
184 
execution to a recovery routine which restores processor integrity. Obviously the 
number of stop/wait outcome generating instructions in the instruction set should be 
minimized, although the degree of their influence on erroneous execution is dependent 
on their activation rather than their occurrence in an instruction mix. 
9.5.2. O p c o d e M a p s 
A n instruction opcode is designed to have sufficient bits to identify all facilitated 
unique operations. The number of bits, n , used to specify an opcode format should be 
an integral multiple, m, of the processors memory element length b (usually a byte) 
in order to simplify data flow processing. Additionally, the number of bits specified 
for an opcode should be kept as small as possible in order to reduce the memory 
requirement of software. Hence opcodes which specify i instruction operations are 
designed wi th a bi t length n where, 
i < 2n. (9.1.) 
Re-arranging gives. 
loq i 
n > r-^—, (9.2.) 
~ log 2 ' v ' 
n > log2 i, (9.3.) 
and, 
n = m.b. (9.4.) 
The specified opcodes form an instruction set, and the range of opcode formats is 
usually referred to as the processors opcode map. 
Equation (9.1.) notes that there may be some redundancy in the opcode map. 
This redundancy is acceptable in many architectures because the processing ben-
efits of fixed-length instruction implementation within the overall processor design 
outweigh the sacrifice of larger average code size [Hennessy & Patterson, 1990]. 
185 
A l l redundant opcode formats in the opcode map should be defined to generate 
a restart outcome. In this manner, execution of an undeclared operation, which is 
synonymous wi th erroneous microprocessor behaviour, is detected. 
The opcode map can also be designed to eliminate the hazard of instruction 
operands associated wi th a local displacement by a relative branch. Program flow 
typically follows the 'principle of locality', well known by memory systems' designers, 
and can be explained by using a rule of thumb: 
! A program executes about 90% of its instructions in about 
10% of its code'. 
The implication is that the major i ty of program flow involves branches whose target 
destination is in the local vicinity to the location of the generating instruction in 
the program code. Hennessy & Patterson [1990] describe the analysis of benchmark 
programs which support this assertion. 
Many processor architectures define local branch displacement information to be 
contained wi th a branch instruction operands. These operands are susceptible to 
interpretation as an opcode during periods of erroneous execution. I t is therefore 
proposed that the processor instruction set reserves areas representing low and high 
values in the opcode map for restart generating instructions, see Figure 9.1. Such 
opcode map organization ensures that local branch instruction operands interpreted 
as an opcode generate a restart outcome and hence detect erroneous execution. 
A particular hazard associated wi th branch locality is observed wi th in the Mo-
torola 68(7)05 microprocessor instruction set. Here, branch instructions specify their 
displacement in the attached operands. Low order relative displacements for forward 
branch instructions which specify byte operands wi th the most significant four bits 
set to '0000' or '0010' are potential invalid branches. 
The Motorola 68000 microprocessor reserves high value opcode formats for future 
upwardly compatible processor extensions. Execution of a memory element w i t h such 
an opcode format results in a restart outcome via a software exception. This at tr ibute 
186 
Least Significant Bits 
A. r 
CD 
Used Opcode Formats 
C/j 
in 
Unused Opcode Formats • 
Figure 9.1.: Microprocessor Opcode Map 
The unused opcode bit formats represent high 
and low values. These opcodes can be set to 
generate a restart outcome. Relative branch 
instructions specify a local displacement. 
Operands containing such displacements which 
are erroneously interpreted as an opcode will 
now generate a restart outcome and hence 
detect erroneous execution. 
187 
of the Motorola 68000 instruction set facilitates a detection capability where relative 
branch instructions specify an operand to contain a local backward displacement. 
9.5.3. O p e r a n d R e q u i r e m e n t s a n d S p e c i f i c a t i o n 
Operands have a bit size equivalent to an opcode in order to preserve the mem-
ory and data path organization associated wi th opcode processing. Microprocessor 
architects aim to keep the number of operands as small as possible in order to re-
duce instruction decoding and hence improve performance [Ciminiera & Valenzano, 
1987]. Additionally, implementing fewer operands reduces the memory requirement 
of a program. The number of operands required depends on the amount of data to 
be processed and the addressing mode used. 
The code extension required by the application of the software implemented fault 
tolerant techniques proposed in Chapter 5 is largely influenced by the operand re-
quirements of instructions wi th in the processor instruction set. Particular influence is 
observed where default size mechanisms are inserted with target software; the default 
size is equivalent to the maximum number of operands required by an instruction in 
the instruction set. Optimum size detection mechanisms may require less additional 
code, depending on the placement conditions at each insertion. Further details of 
detection mechanism construction can be found in Chapter 5. In order to reduce the 
code extension required when applying the software implemented fault tolerant tech-
nique proposed in Chapter 5, the instruction architecture should specify instructions 
to have fewer operands. 
Register oriented addressing modes can be encouraged to reduce the operand 
requirements of an instruction. Such addressing modes remove the necessity for 
operand specification of absolute addresses or immediate data. Details of the registers 
to be processed can be specified wi th in the opcode bit format. Where this information 
cannot be incorporated within the opcode bi t format, a single operand can be specified 
to contain the register allocation information. The bi t format of the operand is 
designated to represent a restart generating opcode. In many instances a single 
operand wil l be shorter than the number of operands required to represent an absolute 
address. 
188 
9.6. I n p u t / O u t p u t C o m m u n i c a t i o n P o r t s 
Chapter 5 discusses two attributes of a microprocessor's architecture which facil-
itate an erroneous execution detection capability for the reserved input /ou tput port 
locations in the address space. These attributes can be incorporated into the design 
specification of a processor. 
Firstly, instructions that communicate w i th external devices may specify access 
to an input /ou tput port wi thin an operand. In particular, individual instruction 
operands may be set to contain the whole or partial address of the communication 
port. The location of these communication ports in the memory map can be defined 
to take bi t formats which represent restart generating instructions in the opcode map. 
Hence, erroneous execution of an input /ou tput port address as an opcode results in 
detection and recovery can be initiated. 
Secondly, erroneous execution may itself interpret an input /ou tpu t port location 
as an opcode. The microprocessor design can incorporate the hardwiring of particular 
bits in the input /output port such that the bit format represents a restart instruction, 
as Figure 9.2. Erroneous execution of this location as an opcode is self-detected. 
This method, of course, incurs an overhead in that the data transfer capability to 
an external device is reduced because of the redundant bits reserved for erroneous 
execution detection. 
9.7. M e m o r y O r g a n i z a t i o n 
This section discusses techniques which involve the organization and implemen-
tation of memory used by a microprocessor. 
9 .7 .1 . M e m o r y A l i g n m e n t 
Some microprocessors require elements (e.g. byte, double-byte, or quad-byte) in 
memory to be aligned according to the memory organization. A memory element of 
size s bytes resident at location Ad in the address space is aligned when, 



















































Hence, byte memory elements in a byte orientated memory organization wi l l always 
be aligned. Byte memory elements in a double-byte orientated memory are aligned on 
even byte boundary addresses, but mis-aligned on odd byte boundary addresses. Mis-
aligned element access specified by an instruction requires multiple physical memory 
accesses, whilst aligned memory access requires only one physical memory access by 
the microprocessor hardware. From a performance perspective, therefore, elements 
of memory should be aligned rather than mis-aligned. 
The memory organization employed in a microprocessor architecture has partic-
ular implications for the software implemented fault tolerant techniques proposed in 
this thesis. Those processors that implement a byte memory organization for the pro-
gram area allow any instruction operand to be processed erroneously as an opcode. 
Hence erroneous execution in the program area is not hindered by the memory orga-
nization. Other means of memory organizations allow the possibility of mis-aligned 
opcode access. Mis-aligned opcode access is a characteristic of erroneous behaviour. 
Fault tolerant techniques can be implemented to detect this error. 
The Advanced Micro Device Am29000 microprocessor and the Motorola 68000 
microprocessor family implement a similar approach to mis-aligned opcode access 
for their respective quad-byte and double-byte memory organization. Access to mis-
aligned double-byte elements in memory wi th in the Motorola 68000 microprocessor 
architecture generates an exception which, naturally, is called the 'odd byte address' 
exception. This method of imposing a double-byte organization in memory to im-
prove the operational performance of the processor, also makes possible a detection 
capability to prevent erroneous execution. The Advanced Micro Device Am29000 
processor has a similar function. Its exception generation can, however, be dis-abled 
by a special status register. Enabling the mis-alignment exception results in the 
least two significant bits of the program counter being masked and hence an opcode 
address is generated. Erroneous execution is, therefore, re-synchronized. 
The design feature described in this section can be incorporated into the archi-
tecture of a microprocessor. Digital circuitry can generate a restart outcome when 
191 
i t detects mis-aligned memory access. Such an outcome can be used to initiate a 
recovery procedure and hence improve the reliability of the microprocessor system. 
9.7.2. D e f i n i n g M e m o r y U t i l i z a t i o n 
The address space of a microprocessor can be divided according to its functional 
allocation. Chapter 5 ini t ia l ly divides the address space into used and unused ar-
eas discusses the implications of the division wi th respect to erroneous behaviour. 
Moreover, the said chapter describes the digital implementation of a hardware unit 
called an Access Guardian which detects the unused area access characteristic of er-
roneous behaviour. The design of the Access Guardian is not complex and can be 
incorporated into the architecture of a microprocessor. 
There, nevertheless, remains the problem of detecting erroneous execution in the 
used area of the address space. Glaser & Masson [1982] present a hardware unit 
referred to as a 'SAFE R O M ' which has been implemented wi th in a microprocessor 
system by Li et al [1984]. The SAFE ROM is a one-bit wide memory which is attached 
to each memory element (in this case byte) of Read Only Memory (ROM) to signify 
its usage as either opcode or operand, see Figure 9.3. Invalid opcode address access 
is identified by detection circuitry which determines whether or not the location in 
question has its SAFE R O M bit set to opcode or operand. 
A similar approach is proposed here which can be applied to all address space 
locations regardless of their implementation in either Read Only Memory or Random 
Access Memory ( R A M ) . Instead of implementing additional memory, a b i t of each 
memory element is reserved to perform the SAFE R O M function. Hence, there is not 
a memory overhead as such but rather memory redundancy associated wi th the re-
served bi t , see Figure 9.4. This method provides detection of erroneous execution for 
all physically implemented memory. The memory redundancy incurred by applying 
this technique can be calculated, thus: 
Memory Redundancy (%) = —, (9.6.) 












CO m « 
<3 Q 
o 1— o T -
o N— o o o o 
o o o 
T— 
o T-
- o *r— o o 
o o o o o •7— o 
v— o o "7— o %— 
° TP" o o o 
i 
© 












CD CD c c 
















o o o 
o o o o o 
o o o •v— o 
o o — o -
o o o o o o 
o o • s — 
o o o o 
















3? CO o >-





! ^ WS 







V) (0 => 3 
193 
The memory redundancy associated with this technique may be considered sig-
nificant for some processor applications. For example, microprocessors whose mem-
ories are organized as byte or double-byte structures have a 12.5% and 6.25% re-
dundancy in their memory respectively. Nevertheless, the technique is extremely 
effective, providing detection of erroneous execution except in circumstances involv-
ing re-synchronization. Detection capability is important for systems requiring high 
reliability and in these applications the memory redundancy is expected to be ac-
ceptable. 
9.8. M o n i t o r i n g Branch A c t i v i t y 
The architecture of a microprocessor can be extended to incorporate design fea-
tures facilitating the recognition of branch operations similar to that provided by 
the Motorola 68030 microprocessor. Such a facility could be used to activate spe-
cial circuitry dedicated to determining whether an invalid or valid branch is being 
processed. 
Two methods of determining invalid branch activity are suggested here. Firstly, 
the software implemented fault tolerant technique proposed in Chapter 5 can be 
incorporated into the microprocessor architecture. The required digital circuitry is 
based on a logical AND function using, as inputs, the recognition of branch activity 
and the opcode 'usage' bit proposed in the previous section. The outcome is detection 
of all invalid branches regardless of whether or not their destination leads to the re-
synchronization of erroneous execution. 
The second technique is based on verifying branch activity, and is commonly re-
ferred to as 'signature analysis'. The theory supporting the technique is described 
in Chapter 2. Schuette &c Shen [1986] have implemented the technique using ad-
ditional digital circuitry in a Motorola 68000 microprocessor based system, which 
incurred a 17% gate overhead compared with the number of gates in the processor. 
More recently, the technique has been incorporated within the architecture of devel-
opment processors where it had a 13% chip area overhead [Leveugle et al, 1990]. The 
technique detects erroneous execution by failing to verify the occurrence of a valid 
branch. 
194 
On identifying an invalid branch erroneous execution is detected and appropriate 
hardwired recovery action can be initiated. Although the techniques described above 
are only activated when branch activity is recognized, they do detect re-synchronized 
erroneous execution. This contrasts with the technique based on denning memory 
utilization which is activated more regularly because of its on-line monitoring process, 
but which cannot detect re-synchronized erroneous execution. 
9.9. Conclusion 
Design features which can be incorporated into the architecture of a microproces-
sor to provide a self-detection capability for erroneous behaviour have been proposed. 
These include suggestions for the instruction set architecture and memory organiza-
tion. The benefit of the inclusion of these techniques in the microprocessor hardware 
(enhanced detection capability) are dependent on the particular instruction sequences 
of erroneous execution for a target processor system. Nevertheless, capability checks 
are being included into microprocessor designs, notably the mis-aligned opcode ad-
dress exception. Further capability checks, including those proposed in this chapter, 
may be implemented within commercial microprocessor designs in the future. 
195 
Chapter 10 
C O N C L U S I O N 
10.1. Microprocessor Controllers for Industrial Applications 196 
10.2. Reliable Microprocessor Controllers 196 
10.3. Modelling Erroneous Microprocessor Behaviour 197 
10.4. Detecting Erroneous Microprocessor Execution 199 
10.5. Evaluating Fault Tolerance 201 
10.6. Generating Non-Hazardous Software 201 
10.7. Microprocessor Design for Fault Tolerance 202 
10.8. Summary 202 
C H A P T E R T E N 
C O N C L U S I O N 
10.1. Microprocessor Controllers For Industrial Applications 
In recent years it has become popular practice to implement industrial control sys-
tems using digital circuitry with an embedded microprocessor rather than analogue 
systems. Microprocessor controllers provide a flexible design approach, the nature 
of their operation being easily tailored to particular tasks through the alteration of 
control software. Digital systems, however, are more susceptible to transient dis-
turbances, common within industrial environments, than similar analogue systems. 
Analogue systems tend to pass the effects of a transient disturbance as a temporary 
processing discrepancy, whilst digital systems, because of their discrete state nature, 
can have their operation disrupted. This thesis addresses the problem of improving 
the reliability of low budget microprocessor systems where fault masking is considered 
too expensive. 
10.2. Reliable Microprocessor Controllers 
The failure process of a digital system involves the manifestation of a logic fault, 
its activation as an error, and finally error spawn until a fatal operating condition 
is generated. Faults can be manifested as either temporary or permanent logic cor-
ruption. Temporary faults have been observed to cause a significant proportion of 
microprocessor system failures. Studies, reviewed in Chapter 2, suggest that in excess 
of 90% of processor failures are generated by temporary faults rather than permanent 
faults. 
Temporary faults, attributed to disturbances in the operating environment of 
a microprocessor based controller, are called transient faults. Environmental dis-
turbances can involve electro-magnetic interference (EMI), electro-static discharge 
(ESD), electronic noise, ionizing radiation, and power supply fluctuations. The 
operating environment can often be harsh. In order to achieve high reliability, 
196 
microprocessor controllers can implement fault tolerance. Fault tolerant techniques 
implement four tasks in sequence: 
o error detection, 
o damage assessment, 
o error recovery, and 
o system restoration. 
Although this thesis focuses on error detection, the remaining fault tolerant tasks 
are equally important and should be considered when implementing a fault tolerant 
system. 
Additional circuitry to detect individual logic faults can be prohibitive within 
a low budget microprocessor architecture. An alternative technique for detecting 
logic faults is based on the premise that logic faults are complemented by processing 
errors. The technique is referred to as functional fault tolerance because it relies on 
distinguishing between valid and invalid characteristics of microprocessor execution. 
Functional fault tolerance must ensure that erroneous microprocessor execution 
does not generate a fatal error and hence catastrophic failure. To achieve this, detec-
tion techniques can be applied to recognize different attributes of erroneous execution. 
These techniques are collectively known as 'capability checks'. In order to design and 
evaluate the effectiveness of capability checks i t is necessary to model erroneous mi-
croprocessor behaviour. 
10.3. Modelling Erroneous Microprocessor Behaviour 
Erroneous microprocessor behaviour involves either erroneous data or program 
flow within executing software. The 'reasonableness' of data flow can be checked 
by the operating software. Erroneous program flow, however, cannot be verified in 
this manner because predictable operation of the software is lost. This has serious 
implications for industrial applications where microprocessors are responsible for the 
monitoring, process, or control of equipment. Unpredictable processor action can 
command equipment to malfunction, the hazard of this situation being dependent on 
the equipment's task. t 
197 
A model has been developed which investigates the program flow associated with 
erroneous microprocessor behaviour. Erroneous execution is defined to be initiated 
by a temporary fault generating an Initial Erroneous Jump (IEJ) through corruption 
of the processor's program counter. Ensuing erroneous microprocessor behaviour 
is characterized by periods of linear erroneous execution interspersed with further 
erroneous jumps called Subsequent Erroneous Jumps (SEJs). In addition, the model 
allows consideration of particular processing outcomes associated with catastrophic 
failure and recovery. Recovery is achieved through the processing of an instruction 
developing a restart outcome, which directs execution to a pre-defined location in 
memory where a recovery routine resides. The recovery routine is programmed to 
fulfil the restoration requirements of the application software, two possible recovery 
strategies are reset and roll-back. 
The model of erroneous behaviour is applied to a selection of 8, 16, and 32-
bit processors. The following microprocessors are assessed: (8-bit) Motorola 6800, 
Intel 8048, and Intel 8085; (16-bit) Intel 8086, Motorola 68000, and Motorola 68010; 
(32-bit) Advanced Micro Device Am29000, Motorola 68020, and Intel 80386. All 
processors are assumed to have a random content address space. Erroneous execution 
is evaluated using instruction mix anaiy-is to predict the mean expected operation. 
The character of erroneous jumps is studied. The model assumes that IEJs have a 
random target in the address space. Such erroneous jumps have particular significance 
in relation to microprocessor reliability when their destination is in the unused area 
whose code attributes are unknown. Where an IEJ directs erroneous execution to 
the used area, further erroneous jumps (SEJs) occur as a result of invalid software 
processing. The model suggests that SEJs typically generate a local target and hence 
erroneous execution initiated within the used area is likely to remain there for a 
significant period. This is a hazardous situation because not only is the processor 
out of control but its erroneous activity could be mutilating system integrity, making 
restoration of the microprocessor system more difficult. 
A Markov process is used to predict the reliability of a microprocessor. In partic-
ular, the Mean Time To Failure (MTTF) parameter is used because it has more intu-
198 
itive meaning to reliability engineers. The instruction mix analysis for the selection 
of microprocessors described above suggests that both the instruction distribution 
and the processor architecture can have a significant influence on reliability. The 
Motorola 68000 microprocessor family have the highest reliability with an MTTF 
three times the mean inter-arrival time of events that initiate erroneous execution. 
These processors have instruction sets which define the execution of an undeclared 
opcode and mis-aligned memory access to generate a restart outcome. The Advanced 
Micro Device Am29000 processor has an MTTF prediction twice that of the mean 
inter-arrival event period which is due solely to undeclared opcodes generating a 
restart outcome. The remaining microprocessor reliability models predict a MTTF 
of similar magnitude to the mean inter-arrival event period. These processors do not 
define their undeclared opcodes to have a restart outcome, and implement a byte 
memory organization and hence mis-aligned memory access is impossible. 
The availability of a microprocessor system is dependent on the detection of an 
error and time taken to restore the system integrity. A general model is presented: 
the influences on availability are the mean inter-arrival event period, the processor 
operational frequency, and the size of the routine required to restore system integrity. 
The latter two factors are processor and application dependent. Microprocessor ar-
chitectures that require little restoration activity and simple application software can 
reduce the size of the recovery routine. These, together with high processor operating 
speeds, facilitate higher availability. 
10.4. Detecting Erroneous Microprocessor Execution 
This thesis proposes a new capability check for detecting erroneous microproces-
sor execution. The technique is based on software implemented fault tolerance. 
The aim of the technique is to identify, through static analysis, the potential tar-
gets of erroneous jumps, referred to as invalid branches, and to place at these locations 
a software detection mechanism which is activated by the erroneous execution. In-
valid branches are unsynchronized erroneous jumps and should not be confused with 
synchronized erroneous jumps. Interpretation of a memory location containing an 
199 
opcode is termed 'synchronized', whilst interpretation of any other content is termed 
'unsynchronized'. 
Different approaches are required when applying software implemented fault tol-
erance to the used and unused areas of the address space respectively. The first 
approach considers the unused area, which can consist of physical and non-existent 
memory. Physical memory is filled with restart generating instructions which do not 
have an operand requirement. In this way, erroneous execution at any location will 
develop a restart outcome and hence detection of erroneous behaviour. Non-existent 
memory requires a hardware solution, and hence a unit called an Access Guardian is 
designed which detects memory access by monitoring the processor address bus. The 
complexity of the Access Guardian is dependent on the contiguity of non-existent 
memory locations, and whether or not the processor has a multiplexed address/data 
bus such as the Intel 8086 microprocessor. 
Secondly, within the used area, detection mechanisms are inserted within the 
software at invalid branch targets. Some manipulation of the application software 
may be necessary so that the its function is not disturbed by the placement of de-
tection mechanisms. Construction principles for the detection mechanisms and an 
algorithm for their placement within the application program are presented. An im-
portant limitation of the technique is that of placement deadlock. This describes a 
situation where an invalid branch destination cannot be covered by a detection mech-
anism placement because the generator and destination of the invalid branch are both 
within the same instruction, or the destination resides at the location immediately 
following the generating instruction. The former can generate a catastrophic failure 
if an infinite execution loop is created. 
A software tool, called the Post-programming Automated Recovery UTility 
(PARUT), has been developed as a prototype in order to assess the capability of 
the software implemented fault tolerant technique and to assess the feasibility of 
developing a standard software tool to apply the technique. The structure and orga-
nization of the prototype is described. The tool is designed to be robust, capable of 
generating enhanced program code for a variety of target processors. The future of 
200 
PARUT appears to be its incorporation, as an processing option, within a translator. 
This is because PARUT uses much of the information inherently required within the 
translation process, and its phase of activity immediately follows translation. 
10.5. Evaluating Fault Tolerance 
The dynamics of erroneous execution can only be evaluated through instruction 
sequence analysis which involves tracing the execution attributed with each initiated 
period of erroneous behaviour. 
The effectiveness of the software implemented fault tolerant technique proposed 
in this thesis is evaluated by investigating the erroneous behaviour of application 
software before and after application of the technique. Erroneous microprocessor be-
haviour is investigated using fault emulation and fault injection experiments. The 
fault injection experiment involves physically inserting faults on the the address bus, 
data bus, and program counter during instruction and data cycles. The fault em-
ulation experiment involved inserting faults within a register model of a processor. 
Almost 4000 faults are investigated for a selection of three programs, each with a 
different application processor. 
Improved performance is observed in the processor systems when they employ the 
software implemented fault tolerant technique. The degree of improvement is related 
to the number of detection mechanism placements in the application software. The 
effectiveness of the software technique is clearly demonstrated. The memory overhead 
and performance of the software technique is, however, application specific. Within 
the example programs, the application of the technique required an approximate 
software extension of 20% to 30% for 16-bit and 32-bit microprocessors. 
10.6. Generating Non-Hazardous Software 
Many microprocessor systems utilize application code written in a high level lan-
guage which is independent of the target processor architecture, e.g. the programming 
language C. In these instances a translator is used to convert the source code through 
levels of abstraction to the object code. This process can be influenced so that target 
201 
code is produced without the hazards associated with catastrophic failure, and with 
a high inherent detection capability against erroneous execution. 
Five tasks of the translation process are identified which can influence the produc-
tion reliable code. Firstly, opcodes with more than one addressable memory element 
can be hazardous on a mis-aligned opcode access. The selection of opcodes during 
translation should avoid identified hazardous opcodes, their function being imple-
mented other equivalent instructions. Secondly, the addressing mode selected for an 
instruction should not generate hazardous operands. Thirdly, macros used to im-
plement high level language constructs should not incorporate instruction sequences 
with a hazard. Fourthly, peephole optimization should not create new code hazards. 
Finally, address space allocation for the object code should not introduce hazards, 
for instance through relative address operands. 
These translator proposals have not been implemented. An alternative method of 
generating non-hazardous software is to design a microprocessor architecture which 
inherently defines code with a detection capability against erroneous execution. 
10.7. Microprocessor Design for Fault Tolerance 
Fault tolerant techniques implemented as additional hardware circuitry, with or 
without software manipulation, can be incorporated within the architecture of a 
microprocessor. Many modern microprocessors incorporate a mis-aligned memory 
access exception, and other prototype processors implement signature analysis. The 
techniques proposed in this thesis concerning software implemented fault tolerance 
can also be embedded within the design of a microprocessor. These techniques use 
an Access Guardian to detect all unused area access, and software detection mech-
anisms to detect invalid branches regardless of their re-synchronization. Additional 
techniques involve influencing the design of the processor instruction opcode map 
and operand requirements, reserved input/output ports, and memory organization. 
10.8. Summary 
Microprocessor systems are incorporated within many industrial control systems. 
Such applications are often required to be highly reliable. Working environments can, 
202 
however, be harsh and microprocessor systems are prone to disruption from transient 
disturbances. It is therefore necessary to apply fault tolerance to the microprocessor 
system in order to improve its reliability. The sophistication of the fault tolerance 
may be limited by budget constraints which prevent fault masking. 
The solution is the application of capability checks within a uniprocessor con-
troller. The capability checks identify particular characteristics of erroneous pro-
cessor behaviour and initiate recovery. This thesis models erroneous microprocessor 
behaviour and proposes a new low-cost software-implemented capability check involv-
ing the recognition of invalid branches. The effectiveness of applying the capability 
check is demonstrated; however, a general assessment cannot be made because the 
techniques action is application specific. The error detection capability can be further 
improved by strategically selecting several capability checks for collective application. 
It should be realized that these techniques cannot guarantee enhanced reliability be-
cause they are reliant on particular attributes of erroneous execution being exhibited. 
It is pertinent for the reliability engineer to incorporate a back-up detection facility 
into the system design, such as a watchdog timer as well as a fail-safe action, in order 






[British Telecom, 1986] 
British Telecom, Handbook of Reliability Data for Components Used in Telecom-
munication Systems (HRD4)-, 1986. 
[Fontaine & Barrand, 1989] 
Fontaine, A.B. & Barrand, I . , 80286 and 80386 Microprocessors : New PC 
Architectures., Macmillan Education, 1989. 
[Intel, 1982] 
Intel, 8086/8088/8087/80186/80188 Programmer's Pocket Reference Guide., 
Intel Corporation, 1982. 
[Intel, 1988] 
Intel, 80386 Programmer's Reference Manual, Intel Corporation, 1988. 
[Johnson, 1989] 
Johnson, M. (ed), Am29000 Users Manual, Advanced Micro Devices, 1989. 
[Klaassen & van Peppen, 1989] 
Klaassen, K.B. & van Peppen, J.C.L., System Reliability : Concepts and Appli-
cations., Hodder & Stoughton (Edward Arnold Division), 1989. 
[Lewis, 1987] 
Lewis, E.E., System Safety Analysis : Human Error., in 'Introduction to Relia-
bility Engineering', John Wiley & Sons, New York, 1987. 
[Motorola, 1983] 
Motorola, M0805 HMOS, Ml46805 CMOS Family : Microcomputer/ Micropro-
cessor User's Manual, Motorola Inc.. 1983. 
[Motorola, 1984] 
Motorola, M68000 16/32-Bit Microprocessor : Programmer's Reference Manual, 
Motorola Inc., 1984. 
[Motorola, 1987] 




[Aho & Ullman, 1977] 
Aho A.V. & Ullman, J.D., Principles of Compiler Design., Addison-Wesley Pub-
lishing Company, 1977. 
[Amerasekera & Campbell, 1987] 
Amerasekera, E.A., & Campbell, D.S., Failure Mechanisms in Semiconductor 
Devices., John Wiley & Sons, 1987. 
[Anderson & Lee, 1982] 
Anderson, T., & Lee, P.A., Fault Tolerant Terminology Proposals., Proc. FTCS-
12, 1982, pp 29-33. 
[Arlat et al, 1990] 
Arlat, J., Aguera, M., Amat, L., Crouzet, Y., Fabre, J.C., Laprie, J.C., Martins, 
E., & Powell, D., Fault Injection for Dependability Validation : A Methodology 
and Some Applications., IEEE Trans. Soft. Engineering, Vol. 16, No. 2, 1990, 
pp 166-181. 
[Armstrong & Devlin, 1981] 
Armstrong, J.R.. & Devlin, D.E., GSP - A Logic Simulator for LSI., Proc. 18th 
Annual Automation Conf., 1981, pp 518-524. 
[Avizienis, 1976] 
Avizienis, A., Fault Tolerant Systems., IEEE Trans. Computing, 1976, pp 1304-
1311. 
[Ball & Hardie, 1969] 
Ball. M. , & Hardie, F., Effects and Detection of Intermittent Failures in Digital 
Systems., AFIPS Proc. Fall Joint Computer Conf., Vol. 35, 1969, pp 329-335. 
[Bennetts, 1979] 
Bennetts, R.G., Reliability of Engineering Systems., Lecture Notes (University of 
Southampton, England.), 1979. 
[Bhar & McMahon, 1983] 
Bhar, T.N., & McMahon, E.J., Electronic Discharge Control., Hayden Book Com-
pany, 1983. 
[Blough & Masson, 1987] 
Blough, M.B. & Masson, C M . , Performance Analysis of a Generalized Upset 
Detection Procedure., FTCS-17, 1987, pp 218-223. 
[Bourne, 1982] 
Bourne, S.R., The UNIX System., Addison-Wesley Publishing Company, 1982. 
[Carpenter, 1989] 
Carpenter, G.F., Database for Investigating the Effects of Induced Faults., Micro-
processors and Microsystems, Vol. 13, No. 10, 1989, pp 627-636. 
205 
[Carter, 1985] 
Carter, W.C., Hardware Fault Tolerance., Resilient Computer Systems (ed. An-
derson, T.), Collins Publishers, 1985, pp 11-63. 
[Chakraborty & Ghosh, 1988] 
Chakraborty, T., & Ghosh, S., On Behaviour Fault Modelling for Combinational 
Digital Designs., Proc. Int. Test Conf., 1988, pp 593-600. 
[Chen & Avizienis, 1978] 
Chen, L., & Avizienis, A., N- Version Programming : A Fault Tolerant Approach 
to Reliability of Software., Proc. Int. Symp. Fault-Tolerant Computing, 1978, 
pp 3-9. 
[Chillarege & Bowen, 1989] 
Chillarege, R., & Bowen, N.S., Understanding Large System Failures - A Fault 
Injection Experiment.. Int. Symp. on Fault Tolerant Computing, 1989, pp 355-
363. 
[Choi et al, 1989] 
Choi, G., Iyer, R., Saleh, R., & Carreno, V., A Fault Behaviour Model for an 
Avionic Microprocessor : A Case Study., Proc. Int. Working Conf. on Depend-
able Computing for Critical Applications, 1989, pp 71-77. 
[Ciminiera & Valemzano, 1987] 
Ciminiera, L. & Valemzano, A., Advanced Microprocessor Architectures., Addison-
Wesley Publishing Company, 1987. 
[Clark et al, 1987] 
Clark, A.S., Jenkins, K., & Lipscombe, J.A., Low-Cost Electronic District Con-
trol., Institution of Gas Engineers, 53rd Autumn Meeting, Communication 1353, 
1987. 
[Connet et al, 1972] 
Connet, J.R., Pasternak, E.J., Wagner, B.D., Software Defences in Real-Time 
Control Systems., Dig. Int. Symp. on Fault Tolerant Computing, 1972, pp 
94-99. 
[Cortes et al, 1986] 
Cortes, M.L., McCluskey, E.J., Wagner, K.D., & Lu, D.J., Modelling Power 
Supply Disturbances in Digital Circuits., IEEE Int. Solid State Circuit Conf. 
(Anahem, C.A.), 1986, pp 164-165. 
[Crouzet & Decouty, 1982] 
Crouzet, Y., & Decouty, B., Measurement of Fault Detection Mechanisms Effi-
ciency : Results., Proc. FTCS-12, 1982, pp 373-376. 
[Cullyer, 1988] 
Cullyer, W.J., Implementing High Integrity Systems : the VIPER microproces-
sor., IEEE Computer Assurance Conference (COMPASS), 1988, pp 56-66. 
[Cusick et al, 1985] 
Cusick, J., Koga, R., Kolasinski, W.A., & King, C , SEU Vulnerability of the 
Zilog Z-80 and NSC-800 Microprocessors., IEEE Trans. Nuclear Science, Vol. 
32, No. 6, 1985, pp 4206-4211. 
206 
[Czeck k Siewiorek, 1990] 
Czeck, E. W., & Siewiorek, D.P., Effect of Transient Gate-Level Faults on Pro-
gram Behaviour., FTCS-20, 1990, pp 236-243. 
[Damm, 1988] 
Damm, A., Experimental Evaluation of Error Detection and Self-Checking Cover-
age of Components of a Distributed Real-Time System., Ph.D. Thesis, Technical 
University of Vienna, Austria, 1988. 
[Dijkstra, 1972] 
Dijkstra, Notes on Structured Programming., 'Structured Programming', Aca-
demic Press Inc., New York, 1972. 
[Duba k Iyer, 1988] 
Duba, P., k Iyer, P.K., Transient Fault Behaviour of a Microprocessor : A Case 
Study., Proc IEEE/ICCD, 1988, pp 272-276. 
[Eckhardt k Lee, 1988] 
Eckhardt. D.E. k Lee, L.D., Fundamental Differences in the Reliability of N-
Modular Redundancy and N- Version Programming., Journal of Systems and Soft-
ware, Vol. 8, 1988, pp 313-318. 
[Eifert k Schuette, 1984] 
Eifert, J.B.. k Schuette, J.P., Processor Monitoring Using Asynchronous Signa-
ture Instruction Streams., Proc. FTCS, 1984, pp 394-399. 
[Ferrara et al, 1989] 
Ferrara, K.C., Keene, S.J., k Lane, C , Software Reliability from a System Per-
spective., Proc. Annual Reliability and Maintainability Symposium, IEEE 1989, 
pp 332-336. 
[Freer, 1987] 
Freer, J.R., System Design with Advanced Microprocessors., Pitman Publishers, 
1987. 
[Fujimura, 1989] 
Fujimura, N., Software Productivity in Built-in Microprocessors., Microprocessing 
and Microprogrammming, Vol. 28, 1989, pp 169-172. 
[Fuller k Harbison, 1978] 
Fuller, S.H. k Harbison, S.P., The C.mmp Multiprocessor., Technical Report 
CMU-CS-78-146, Carnegie-Mellon University, October 1978. 
[Glaser k Masson, 1982] 
Glaser, R.E. k Masson, G.M., The Containment Set Approach to Crash-Proof 
Microprocessor Controller Design., IEEE Trans. Computers, Vol. 31, No. 7, 
1982, pp 689-692. 
[Gunneflo et al, 1989] 
Gunneflo, U., Karlsson, J, k Torin, J., Evaluation of Error Detection Schemes 
Using Fault Injection by Heavy-Ion Radiation., Int. Symp. on Fault Tolerant 
Computing, 1989, pp 340-347. 
207 
[Halse, 1984] 
Halse, R.G., Software Fault Tolerance for Small Digital Controllers., Ph.D. The-
sis, University of Durham (England), 1984. 
[Halse k Preece, 1985] 
Halse, R.G., k Preece, C , Erroneous Execution and Recovery in Microprocessor 
Systems., Software and Microsystems, Vol. 4, No. 3., 1985, pp 63-70. 
[Halse k Preece, 1987] 
Halse. R.G., k Preece, C , Recovery Assessment After Microprocessor System 
Transient Disturbance., System Fault Diagnosis: Reliability and Related 
Knowledge-Based Approaches, (ed. Tzafestas, S.), Reidell Publishing Company, 
Vol. 2, 1987, pp 38^-397. 
[Hennessy k Patterson, 1990] 
Hennessy, J.L. k Patterson, D.A., Computer Architecture : A Quantitive Ap-
proach., Morgan-Kaufmann Publishers Inc., 1990. 
[Hitt k Webb, 1985] 
Hitt, E.F. k Webb, J.J., A Fault Tolerant Software Strategy for Digital Systems., 
Proc. AIAA/IEEE 6th Digital Avionics Systems Conference, 1985, pp 211-216. 
[Horowitz k Hill, 1986] 
Horowitz, k Hill, Eliminating Interference., 'The Art of Electronics', Cambridge 
University Press, 1986, p.307. 
[IBM, 1986] 
IBM, Microprocessor Monitor and Reset Circuitry., IBM Technical Disclosure 
Bulletin (USA), Vol. 29, No. 2, 1986, p 611. 
[IEEE, 1989] 
IEEE, Logic Comes to the Rescue., 'Faults and Failures', IEEE Spectrum, April 
1989,p 18. 
[IEEE, 1990] 
IEEE, What About Those Intel 1486 Bugs?., 'Faults and Failures', IEEE Spec-
trum, April 1990, pp 8-9. 
[Iyer k Rossetti, 1982] 
Iyer, R.K. k Rossetti, D.J., A Statistical Dependency of CPU Errors at SLAC, 
FTCS-12 (Santa Monica, CA.), 1982, pp 363-372. 
[Iyer k Rossetti, 1986] 
Iyer, R.K. k Rossetti, D.J., A Measurement Based Model for Workload Depen-
dence of CPU Errors., IEEE Trans. Computers, Vol. 35, No. 6, 1986, pp 
511-519. 
[Kim k Iyer, 1988] 
Kim, S., k Iyer, R.K., Impact of Device Level faults in a Digital Avionic Proces-
sor., AIAA/ IEEE 8th Digital Avionic Systems Conf., 1988, pp 428-435. 
[Kopetz, 1982] 
Kopetz, H., The Failure-Fault (FF) Model, Proc. FTCS-12, 1982, pp 14-17. 
208 
[Lala, 1983] 
Lala, J., Fault Detection and Reconfiguration in FTMP : Methods and Experi-
mental Results., Proc. 5th IEEE/AIA A Digital Avionics Systems Conf. (DASC), 
1983, pp 21.3.1-21.3.9. 
[Lala, 1985] 
Lala, P.K., Fault Tolerant and Fault Testable Hardware Design., Prentice-Hall, 
1985. 
[Leveugle et al, 1990] 
Leveugle, R., Michel, T. & Saucier, G., Design of Microprocessor with Built-in 
On-line Test, Proc. FTCS, 1990, pp 450-456. 
[Li et al, 1984] 
Li , K.W., Armstrong, J.R., & Tront, J.G., An HDL Simulation of the Effects 
of Single Event Upsets on Microprocessor Program Flow., IEEE Trans. Nuclear 
Science, Vol. 31, No. 6, 1984, pp 1139-1144. 
[Lin, 1988] 
Lin, T-T, K., Design and Evaluation of an On-Line Predictive Diagnostic Sys-
tem., Ph.D. Dissertation, Carnegie-Mellon University, Pittsburgh, PA, 1988. 
[Lin & Siewiorek, 1990] 
Lin, T-T. Y. & Siewiorek, D.P., Error Log Analysis : Statistical Modelling and 
Heuristic Trend Analysis., IEEE Trans. Reliability, Vol. 39, No. 4, 1990, pp 
419-432. 
[Littlewood, 1981] 
Littlewood, B., Stochastic Reliability Growth : A Model for Fault Removal in 
Computer-Programs and Hardware-Design., IEEE Trans. Reliability, Vol. 30, 
No. 4, 1981, pp 313-320. 
[Liu, 1982] 
Liu, B., Soft Failure Detection and Correction in Microprocessor Characteriza-
tion., Proc. FTCS-12, 1982, pp 458-460. 
[Liu & Whalen, 1988] 
Liu, K., & Whalen, J.J., Electromagnetic Interference in CMOS Circuits., IEEE 
Int. Symp. On Electromagnetic Compatibility, 1988, pp 471-472. 
[Lomelino & Iyer, 1986] 
Lomelino, D., k. Iyer, R., Error Propagation in a Digital Avionic Processor : A 
Simulation Based Study., NASA CR-176501, Univ. of Illinois, 1986. 
[Lu, 1980] 
Lu, D.J., Watchdog Processors and VLSI., Proc. Nat. Electronics Conf. (Chicago), 
Vol. 34, 1980, pp 240-245. 
[Lu, 1982] 
Lu, D.J., Watchdog Processors and Structural Integrity Checking., IEEE Trans. 
Computers, Vol. 31, No. 7, 1982, pp 240-245. 
209 
[Madeira et al, 1990] 
Madeira, H., Quadros, G., & Silva, J.G., Experimental Evaluation of a Set of 
Simple Error Detection Mechanisms., Microprocessing and Microprogramming, 
Vol. 30, 1990, pp 513-520. 
[Mahmood & McCluskey, 1985] 
Mahmood, A., & McCluskey, E.J., Watchdog Processors : Error Coverage and 
Overhead., Proc. 15th Fault Tolerant Computing Symp., 1985, pp 214-219. 
[Mahmood & McCluskey, 1988] 
Mahmood, A., & McCluskey, E.J., Concurrent Error Detection Using Watchdog 
Processors - A Survey., IEEE Trans. Computers, Vol. 37, No. 2, 1988, pp 
160-174. 
[Mahmood et al, 1983] 
Mahmood, A., McCluskey, E.J., & Lu, D.J., Concurrent Fault Detection Using 
Watchdog Processors and Assertions., IEEE Proc. Int. Test. Conf., 1983, pp 
622-628. 
[Marchal & Courtois, 1982] 
Marchal. P. & Courtois, B., On Detecting the Hardware failures Disrupting Pro-
grams in Microprocessors., FTCS-12, 1982, pp 249-256. 
[McCluskey & Wakerly, 1981] 
McCluskey, E.J. '& Wakerly, J.F., A Circuit for Detecting and Analysing Tempo-
rary Faults., Proc. COMPCON, 1981, pp 317-321. 
[McConnel, 1981] 
McConnel, S.R., Analysis and Modelling of Transient Errors in Digital Comput-
ers., Ph.D. Dissertation, Carnegie-Mellon University, Pittsburgh, PA, 1981. 
[McConnel & Siewiorek, 1978] 
McConnel, S.R. & Siewiorek, D.P., C.vmp : The Implementation, Performance, 
and Reliability of a Fault Tolerant Multiprocessor., Interim Report, Carnegie-
Mellon University, Computer Science Dept., Pittsburgh, PA 15213, USA., March 
1978. 
[McGough k Swern, 1981] 
McGough, J.G., & Swern, F., Measurement of Fault Latency in Digital Avionic 
Mini-Processor., NASA, Bendix Corp., Part I (Oct. 1981). 
[McGough & Swern, 1983] 
McGough, J.G., & Swern, F., Measurement of Fault Latency in Digital Avionic 
Mini-Processor., NASA, Bendix Corp., Part I I (Jan 1983). 
[Millman, 1979] 
Millman, J., Microelectronics : Digital and Analogue Circuits and Systems., 
McGraw-Hill Inc., 1979. 
[Morganti et al, 1978] 
Morganti, M. , Coppadoro, G., & Ceru, S., UDET 7116 - Common Control for 
PCM Telephone Exchange : Diagnostic Software Design and Availability Eval-




Musa, J.D., A Theory of Software Reliability and its Application., IEEE Trans. 
Software Engineering, Vol. 1, No. 3, 1975, pp 312-327. 
[Musa et al, 1987] 
Musa. J.D., lannion, A., k Okumoto, K., Software Reliability : Measurement, 
Prediction, Application., McGraw-Hill, 1987. 
[Namjoo, 1982] 
Namjoo, M. Techniques for Concurrent Testing of VLSI Processor Operation., 
IEEE Int. Test Conf., 1982, pp 461-468. 
[Namjoo, 1983] 
Namjoo, M. CERBERUS-16: An Architecture for a General Purpose Watchdog 
Processor., Proc. FTCS-13, 1983, pp 216-219. 
[Namjoo k McCluskey, 1982] 
Namjoo, M., k McCluskey, E.J., Watchdog Processors and Capability Checking., 
Proc. FTCS-12, 1982, pp 245-248. 
[O'Connor, 1985] 
O'Connor, D.T., Practical Reliability Engineering (Second Edition)., John Wiley 
k Sons, 1985. 
[Ornstein et al, 1975] 
Ornstein, S.M., Crowther, W.R., Kraley, M.F., Bressier, R.D., Michel, A., k 
Heart, F.E., PLURIBUS - A Reliable Multiprocessor., AFIPS Int. Computer 
Conf.. 1975, pp 551-559. 
[Pearson, 1983] 
Pearson, J.C., Reliability of Small Digital Controllers., PhD. Thesis, University 
of Durham (England), 1983. 
[Randall, 1975] 
Randall, B., System Structure for Software Fault Tolerance., IEEE Trans. Soft. 
Eng., Vol. 1., 1975, pp 220-232. 
[Rigby k Norris, 1990] 
Rigby, P., k Norris, M. , The Software Death Cycle., IEE UK I T Conference 1990, 
University of Southampton, March 1990. 
[Russel k Sayers, 1989] 
Russel, G. k Sayers, I.L., Advanced Simulation and Test Methodologies for VLSI 
Design., Van Nostrand Reinhold, 1989. 
[Saxena k McCluskey, 1990] 
Saxena, N.R., k McCluskey, E.J., Control-Flow Checking Using Watchdog Assists 
and Extended-Precision Checksums., IEEE Trans. Computers, Vol. 39, No. 4, 
1990, pp 554-559. 
[Schmid et al, 1982] 
Schmid, M.E., Trapp, R.L., Davidoff, A.E., k Masson, G.M., Upset Exposure by 
Means of Abstraction Verification., FTCS-12, 1982, pp 237-244. 
211 
[Schuette & Shen, 1986] 
Schuette, M.A., & Shen, J.P., Processor Control Flow Monitoring Using Signa-
tured Instruction Streams., IEEE Trans. Computers, Vol. 36, No. 3, 1986, pp 
264-276. 
[Segall et al, 1988] 
Segall, Z., Vrsalovic, D., Siewiorek, D., Yaskin, D., Kownacki, J., Barton, J., 
Dancy, R., Robinson, A., k. Lin, T., FIAT - Fault Tolerant Based Automated 
Testing Environment., Proc. FTCS-18, 1988, pp 102-107. 
[Shen & Schuette, 1983] 
Shen, J.P., & Schuette, M.A., On-Line Monitoring Using Signatured Instruction 
Streams., IEEE Proc. Int. Test. Conf., 1983, pp 275-282. 
[Shoji, 1987] 
Shoji, M. , CMOS Digital Circuit Technology., Prentice-Hall, 1987. 
[Siewiorek & Swarz, 1982] 
Siewiorek. D.P. & Swarz, R.S., The Theory and Practice of Reliable Systems., 
Digital Press (Bedford, MA.), 1982. 
[Siewiorek et al, 1978a] 
Siewiorek, D.P.. Kini, V., Masburn, H., McConnel, S., & Tsao, M. , A Case Study 
of C.mmp, Cm*, and C.vmp : Part 1 - Experiences with Fault Tolerance in 
Multiprocessor Systems., Proc. IEEE, Vol. 66, No. 10, 1978, pp 1178-1199. 
[Siewiorek et al, 1978b] 
Siewiorek, D.P., Kini, V., Joobbani, R., & Bellis, H., A Case Study of C.mmp, 
Cm*, and C.vmp : Part 2 - Predicting and Calibrating Reliability of Micropro-
cessor Systems., Proc. IEEE, Vol. 66, No. 10, 1978, pp 1200-1220. 
[Sornrnerville, 1985] 
Sommerville, I . , Software Engineering., Addison-Wesley Publishing Company, 
1985. 
[Sosnowski, 1986a] 
Sosnowski, J., Evaluation of Transient Hazards in Microprocessor Controllers., 
Proc. FTCS-16, 1986, pp 364-369. 
[Sosnowski, 1986b] 
Sosnowski, J., Transient Fault Effects in Microprocessor Controllers., Reliability 
Technology - Theory and Applications, (eds. Moltoft, J., & Jensen, F.), North-
Holland Press, 1986, pp 329-348. 
[Stiffler, 1980] 
Stiffler, J.I., Robust Detection of Intermittent Faults., IEEE Proc. Int. Symp. on 
Fault Tolerant Computing, 1980, pp 216-218. 
[Tanenbaum, 1990] 
Tanenbaurn, A.S., Structured Computer Organization., Prentice-Hall Interna-
tional, 1990. 
212 
[Taylor et al, 1980] 
Taylor, D.J., Morgan, D.E., k Black, J.P., Redundancy in Data Structures : 
Improving Software Fault Tolerance., IEEE Trans. Software Engineering, Vol. 6, 
No. 6, 1980, pp 585-594. 
[Trachtenberg', 1990] 
Trachtenberg, M., A General Theory of Software Reliability Modelling., IEEE 
Trans. Reliability, Vol. 39, No. 1, 1990, pp 92-96. 
[Wilken k Shen, 1987] 
Wilken, K.D., k Shen, J.P., Embedded Signature Monitoring Analysis and Tech-
nique., IEEE Proc. Int. Test. Conf., 1987, pp 324-333. 
[Wingate k Preece, 1991] {Copy in Appendix E] 
Wingate, G.A.S. k Preece, C , Analysis of Failure Data Collected From a TMR 
Microprocessor Controller., Microprocessing & Microprogramming, Vol. 32,1991, 
pp 861-868. 
[Woodbury k Shin, 1990] 
Woodbury, M.H., k Shin, K.G., Measurement and Analysis of Workload Effects 
on Fault Latency in Real-Time Systems., IEEE Trans. Soft. Engineering, Vol. 
16, No. 2, 1990, pp 212-216. 
[Wynne et al, 1988] 
Wynne, R.J., Parkinson, J.S., k Lees, A., The Design and Implementation of a 
Control System for a Multi-Feed Gas Network., IEE Control Conference (Oxford), 
1988, pp 70-75. 
[Yang et al, 1985] 
Yang, X., York, G., Birmingham, W., k Siewiorek, D., Fault Recovery of Trip-
licated Software on the Intel iAPX 432., Distributed Computing Systems, 1985, 
pp 138-143. 
[Yau k Chen, 1980] 
Yau, S.S., k Chen, F.C., An Approach to Concurrent Control Flow Checking., 
IEEE Trans. Soft. Eng., Vol. 6, No. 2, 1980, pp 126-137. 
213 
Appendix A 
I N S T R U C T I O N SET P A R A M E T E R S 
A . l . Introduction 214 
A.2. Instructions Influencing Program Flow 3214 
A.3. Microprocessor Jump Type Instruction Data 215 
A P P E N D I X A 
I N S T R U C T I O N SET P A R A M E T E R S 
A . l . In t roduc t ion 
The model of erroneous microprocessor behaviour described in Chapter 3 is based 
on the instruction mix of the processor software. The evaluation of the model for 
a selection of 8, 16, and 32-bit microprocessors, presented in Chapter 4, requires 
the mix of their respective instruction sets. This appendix contains details of the 
instruction set parameters used in Chapter 4. The microprocessors considered are: 
(8-bit) MC 6800, Intel 8048, and Intel 8085 ; (16-bit) Intel 8086, MC 68000, and MC 
68010 ; (32-bit) AMD 29000-D, MC 68020, and Intel 80386. The notations MC and 
AMD specify 'Motorola Corporation' and 'Advanced Micro Devices' respectively. 
Data for the 8-bit processors is taken from Halse [1984]. Data parameters for the 
16 and 32-bit microprocessors has been evaluated from appropriate manufacturers' 
manuals listed in the bibliography. 
A .2 . Instruct ions Influencing Program Fiow 
Program flow through software is determined by the content of the microproces-
sor program counter. The instruction set of a microprocessor contains three types of 
instruction, each affecting the program counter content in a different way. Firstly, 
'non-jump' instructions are classified as those instructions that perform some oper-
ation and increment the program counter to the next logical instruction location in 
the address space. Secondly, 'unconditional jump' instructions specify the program 
counter to contain the next sequential instruction address which may not be the next 
logical instruction in memory. Finally, 'conditional jump' instructions test a speci-
fied parameter value and, if successful, generate a branch like the unconditional jump 
instruction. If the test is unsuccessful then the program counter increments to the 
next logical instruction address like the non-jump instruction. 
214 
Microprocessor architectures implementing a ROM opcode decoder generate op-
erations for all possible instruction opcode formats, i.e. 2" instructions where the 
opcode has n bits. ROM decoders often have some redundancy, for instance the MC 
68000, MC 68010, MC 68020, where n = 16 have more ROM opcode values than 
implemented operations and hence there is a large number of undeclared instructions. 
The AMD Am29000 is a Reduced Instruction Set Computer (RISC) and implements 
a ROM decoder for opcodes of 8 bits leaving fewer unused opcode formats (undeclared 
instructions). Processors not implementing a ROM decoder, directly interpret the 
opcode through digital circuitry (examples include the Intel 8048, Intel 8085, Intel 
8086, and Intel 80386). These microprocessors have all their instructions hardwire 
defined, although some instructions may appear undeclared because the manufac-
ture withholds information. Undeclared instructions can be non-jumps, conditional 
jumps, or unconditional jumps, and knowledge of these instructions may be of benefit 
to the programmer. Tables A . l . to A.9. summarise details of the instruction sets for 
a variety of processors. 
A.3 . Microprocessor Jump Type Ins t ruc t ion Data 
A fundamental characteristic associated with the program flow during erroneous 
microprocessor behaviour is the 'erroneous jump'. Jump type instructions can be 
catagorised in terms of the nature of their branch operation. Chapter 3 defines the 
following catagories of jump instruction ; 
Restart (RT) : Leads to a jump to a predefined location in the address 
space. 
Return (RN) : Leads to a jump to an address held in a stack. 
Stop/Wait (SW) : Leads to cessation of processing and requires an interrupt 
or hardware reset to exit this state. 
Unspecified Jump (U J) : Leads to a jump to a new location in the address space, 
determined by volatile memory content. 
Deriving the distribution of restart, return, stop/wait, and unspecified jump 
instructions in an instruction set requires evaluation of all undeclared instructions. 
215 
Within the selection of microprocessors analysed in this appendix, the MC 68000, 
MC 68010, AMD Am29000-D, MC 68020, and Intel 80386 specify their undeclared 
instructions to generate a restart. Investigation has shown that the undeclared in-
structions in the remaining processors can be restart, return, stop/wait, or unspecified 
jump [Halse, 1984]. 
The distribution of different jump type instructions for the MC 6800, Intel 8048, 
Intel 8085, and Itel 8086 microprocessors are shown in Tables A.10 to A.13. re-
spectively. Undeclared instructions with a jump type operation are identified in the 
Tables by the mnemonic '***'. 
The MC 68000, MC 68010, and MC 68020 instruction sets are upwardly com-
patible; the distribution of their jump type instructions is collated in Table A. 14. 
Similarly, the jump type instruction distributions for the AMD Am29000 and Intel 
80386 microprocessors are shown in Table A. 15. and Table A. 16. respectively. 
216 
CD a >> 
2 t- co 
£ C N r -










X > C 
O 
U 


































< N O 
C N 00 
T P 

























C O C N 































































-o o i^  -a o X ! 
a CP a. CU 
Is o T P p^ o 
ec
la




















T P en 
00 
" C P 
0 5 00 
x> «s C N C N < S X I c c c o o ;=> O C N 
o O 
0 0 oo 





































CD CO T P CU ' C P "CP " C P Q in Q 00 Q 














>> E 3 a >^  E 3 £ >^  3 "c5 H 3 > ~ 5 H 3 — 5 H H 3 — J 
' - 5 TJ + J o O >—> xi o 
c xi C H t-i C xi C C X ) c to o c O w o c o CO o c o C z o C J C z o C J Z o C J 1 U C O c U c 1 3 3 i 
0 ) 
a 
£j m o> 

























oS <=> CO 
















g C N C N 


































\n t~ 0 0 ^ CN h 
= — 














2 C N 0 0 



































u i> r-, 







O I N Q 
CO CN CO a a a 
a 
H P P P P P P P P P P P P P P P P P P P 
Q a g • - i T N) * LO O 
















CN CO CO 
0 0 
CN CO CO CN CO 




L _ 5 


































0 0 W a CQ U a a 
Q H a 
a a a a 
p p p p p p p p z z 
3 Cd 








j H Cu 
< S S a < 








o i-> IN CO LO co CO TJ-oo oo 































































































































































































































































CO a o 
a 






















» D D D 
CQ CQ Cd 



























X J O 
u 
o 
CO 3 D 3 3 D O D D 
- 3 
* CL, * a 
* otj < 
hJ —J 
< < 
x x x x x x x x x x x x x x x x 
O . r-1 . O r - ^ O j 
CQ p r- ^ ^ • ^ • • ^ T o ' o ' ^ ' ^ o ' o ^ 
p J H H U L i t i i i i i O O T — I . — I O O >—i i — i x x x x x x x x .x. .x. x. x. .x. x. x .x. 
UH U . U . t. 
2 r» 
OS S 


























f _ a. j 
5 O - J S eS O < - _ j O 
Cd 
( M 







00 CT> < 








O i - i CO CO 
Cd Cd Cd Cx< 
H Z " - 5 — > 
Bi Di 3 5 3 D 
CU Cu 
Q fa C M oo a> < 



















* H CD 
II •£ 
•—• cr 
• . CD 
•-CS CP XI 
O 






































































Pu < O Q Cx, 
OS D 
H ^ H 1 - 5 *"» H 
S S DS S D oS OS 
tSJ 
z 



















C N T r o o < C Q U < = ' C N T r o o < O Q 
U O U U Q U Q Q Q Q Q Q Q 
O CN 
Cd Cd S oo Cd 
- ^ H Z ^ H H H H ^ H H H 







J — < C M 
J H H 
< co co 




OS OS U a. 
io co 
H H H 
co co co 
oS OS cS 
CO 
U 
N O) Q [ l , 
U O U O 
N [ i . N CD tin N 
















































































3 3 3 
Cd H H ^ 
CQ CQ CQ CQ 
co co co iO to in 










co co co co 
LO JO LO l O 
CN CM <N 
X X X 
X X X 
V LO CO 















X X X X 
X X X X 
U Q H fc, 
CO CO CO CO 




















































CQ Q CQ a 
MC 3 
CQ CQ a a 







CQ CQ c a 
co J s td 
> 0- 2 O 
CQ CQ CQ CQ Q Q Q Q 
H H Cd J O 3 
CQ CQ CQ 
Q Q Q 
< cd s ci en * CQ CQ CQ 
c 0 0 0 0 0 0 0 0 O0 0 0 oo oc oo 0 0 0 0 0 0 0 0 oo 0 0 
X X X X X X X X X X X X X X X 
X X X X X X X X X X X X X X X 
X X X X X X X X X X X X X X X 
1—1 1—1 r—« T—< I—< I—< * M rM '—1 .* 1 • 1 1—1 
U U u u u o U o u u U U U O o CM "J m CO oo < CQ O Q H l O l O l O l O l O m l O LO l O LO LO LO LO LO LO 
CO CO CO 
l O LO LO 



















< a a 
3 
o 
x x x X X X X X X o cs co co co 
H H ^ 
CM cd CO 
z z 
On P-


















pi CM H co 
CM - ! 
CM 
.-M 5 0 0 0 oo 
CM 
X O CS 
•<cp f- t~-

























- 5 P 
220 
0 > 














































L O C O t-- CO I N pq o t- Q B B o f- f- H r - . C O a a C M C O fa- a 
<L> 
1-5 ' - 5 
P P P P P P P P P P P P P P P 
E 
O 
PQ PQ 2 NJ 2 B PQ PQ co co -j P " -5 
























C O 0 - 0 0 
o o o a a a 
















o C N co 
a 
H P P co 
2 2 
02 02 
2 2 H 
02 02 02 
H H 2 








H H H 
a a g 
02 02 -




a _ j O J O < 
a a a 
s s s 
J J % % 
<; < S 2 




U a a C N C O 
< PQ C N C O u o < PQ U u u 











X a a a a 
x 











































p p p p p 
a H a a a 
o a 
Q a a 
u a 
Q a a 
a 

























p a a 
CO 
< 
p a a 
CO 
< 
a a p P a a a a 
C O C O ^ ^ C O C O ^ ^ C O C O 









L O L O 
C N C O 
L O L O L O 
L O C O 
L O L O L O 
0 0 CJ> 
L O L O 
< PQ u G a a 
L O L O L O L O L O L O 
O 1-H C N C O < 
C U 







H a, a 5 
o2 1 
a a < < o u u 
oo oo oo O 0 0 
o 
< 







(- X I 
C O o el & 
O 
g .2 M 
C J 
f = 3 
—< 
u 
O * J y: O C 
o 
CN E 
P 'G X ! 
al 
X 










A p p e n d i x B 
T H E D E S I G N OF A N ACCESS G U A R D I A N 
B . l . Introduction 222 
B.2. An Access Guardian Design 222 
B.3. The Address Decoder 225 
B.4. The Restart Generator 225 
B.5. The Timer Unit 228 
B.6. Design Simulation 231 
B.7. The Design's Hardware Requirement 235 
B.8. Summary 235 
A P P E N D I X B 
T H E D E S I G N OF A N ACCESS G U A R D I A N 
B . l . In t roduc t ion 
This appendix describes the design of an Access Guardian proposed in Chapter 
5. The Access Guardian detects whether or not invalid address lines are activated by 
the microprocessor whose operation it is montioring, and if so, impresses an interrupt 
signal to the microprocessor. The design is validated through a gate-level simulation, 
and the hardware requirement is listed. The topology of a microprocessor system 
incorporating an Access Guardian is shown in Figure B . l . 
B.2. A n Access Guardian Design 
The Access Guardian design presented here monitors a dedicated address bus for 
access outside a contiguous 16 MByte block of memory, and has a required inter-
rupt latency of ten clock cycles. Whilst particular microprocessor applications are 
expected to have a more complex Access Guardian specification, the requirements 
used here are sufficient to indicate design implications. 
The Access Guardian design is based on the interaction of three functional units: 
the 'address decoder', the 'restart generator', and the 'timer unit'. The general func-
tion of the Access Guardian is shown in Figure B.2. The 'address decoder' generates 
a signal when invalid address lines are activated. This signal is then processed with 
'timer unit' status information by the 'restart generator' to produce an interrupt sig-
nal for the application processor. The interrupt signal must exist slightly in excess of 
the microprocessor interrupt latency. The interrupt latency is the length of time an 
interrupt must exist to guarantee processing by the microprocessor. Assuming the 
interrupt is given highest priority the processor will detect it following the execution 
of the present instruction. The interrupt signal must therefore be just longer than 
the longest execution time required by any instruction. 
222 
Q CQ 
§ co CO C O 
i n 
Q to C O 
OS u u 
o 
5 1 
< c > 01 OH H D O D 04 M C M CU H > 
2 D W 
M O Q 
19 -a c s 8 3 C O CO M C0 «0 • o 
62 3 3 ° O 
• H Q M or 3 
p 3, a u o 
H P O P ^ s f 
i : its < 
+J 3 <M O T3 si is 





I 5 C O 
3 -H 


















C O OS 
C O W 
w Q 
OS O 
Q C J 
Q W 
3 Q 
5 - a 





u o o o 
M *1 -P 
. _ _ TJ -H O 













B.3. The Address Decoder 
The 'address decoder' determines whether or not invalid address lines have been 
activated. An example decoder is shown in Figure B.3. for a Motorola 68000 address 
bus (address lines 'A0VAl ' , . . . t A23' specifying a 16 MByte address space) where only 
the least significant 16 MByte of memory is used. A simple OR function for the 
four most significant address lines ('A23','A22','A2r, and 'A20') determines an in-
valid access and generates a signal 'A'. The 'address decoder', however, will be more 
complex if sections of the used area are dispersed across the address space, or if the 
address bus is multiplexed with the data bus as in the Intel 8086 microprocessor. 
B.4. The Restart Generator 
The 'restart generator' consists of control logic driving a Set-Reset flip-flop (SRFF). 
The control circuitry determines the logic values for the SRFF depending on the Ac-
cess Guardian operating conditions. The controller takes the input 'A ' (from the 
address decoder), the manual reset line ('MRESET'), and the feedback signal 'FB' 
(from the timer). The SRFF has inputs 'S' and !R', and output 'Q'. Table B . l . shows 
the truth table for the control logic to drive the SRFF. The following expressions for 
!S ! and 'R' are developed from the truth table. 
S = A. MRESET (B.l.) 
R = MRESET + FB.A~.MRESET (B.2.) 
Applying DeMorgan's Theorem yields to equation (B.l . ) 
5 = A. MRESET (B.3.) 
S = A + MRESET (BA.) 




e is .a 
O H 
M >» O O 
H -a o c a 
B O D 
o 3 a 
© 
C O M 
H + i T j o *J a «a * J5 >* M » d 
M 0 M - H a -a 
x: U H c -a v a o *. a o <H g o c M 
£ O O O S 
& O > i - H H 
a £ * 4 J a o a a o <a - H 
0 o *j g 3 +> 
1 O O O ft^ 
Q -rt t* £1 td a 
*> O < +1 
- O M W a -a a h a o o si M o g n £ 
H M o a « a o v a o M d t) a t-> o o 
O S N P O t l 
a o ,3 < a M a 
m a A a a tJ n * « o o o CM a M o 
s 
C O C M 
C N C M 
<-H O 
C N C N 
i 
i 












W -M cd 
Q o> c o en 
o\ 
W 
C O fi O 











*J o o 
226 
A MRESET FB s R 
0 0 0 0 0 
0 0 1 0 1 
0 1 0 0 1 
0 1 1 0 1 
1 0 0 1 0 
1 0 1 1 0 
1 1 0 0 1 
1 1 1 0 1 
B . l . : T r u t h Table for SRFF Contro l Logic i n Restart Generator. 
Sn Rn Qn+l 
0 0 Qn 
0 1 0 
1 0 1 
1 1 
7 
Table B.2. : Set-Reset F l ip Flop Transi t ion Table 
227 
R = MRESET + FB^.M RESET 
R = MRESET. FB.A.MRESET (B.6.) 
The SRFF is enabled by an input clock signal ( 'CLK'), and its output 'Q' is dependent 
upon the input control signals 'S' and 'R'. The relationship between 'Q', 'S', and 
'R' is shown in the transition table, see Table B.2. The output 'Q' is set high 
when an invalid address is decoded and there is no manual reset and no feedback 
signal indicating the continuing activity of a previously identified invalid address 
line activity. The output 'Q' remains at the same logic value after being set. The 
SRFF control circuitry resets the output 'Q' to a low when either the manual reset 
is exerted, or the feedback signals completed processing of the Access Guardian, and 
there is no current address line discrepancy detected by the address decoder. The 
logic design for the 'restart generator' implementing equations (B.3.) and (B.5.) to 
drive the SRFF is shown in Figure B.4. 
B.5. The T imer U n i t 
A 'timer unit' is used to hold the interrupt signal for the required interrupt la-
tency period and is shown in Figure B.5. This unit takes as inputs the RESTART 
interrupt signal, the manual reset signal ('MRESET'), and the clock signal ( 'CLK'). 
The 'timer unit' generates a feedback signal ('FB') which is used by the 'restart 
generator'. Initially the clock, restart interrupt, and inverse manual reset signal are 
put through a logic AND gate to produce a control signal 'PULSE'. The 'PULSE' 
line is used to drive, together with the clock signal 'CLK', the ripple counter. The 
ripple counter consists of a series of master-slave JK flip-flops active on the negative 
edge of the clock signal ( 'CLK'). The 'PULSE' signal provides the clocking signal 
to the first master-slave JK flip-flop to generate an output ' Q l ' . The output ' Q l ' is 
used as the clocking signal for the second master-slave JK flip-flop. This method of 











•a M a a 
a* 
•a Q 
j " S 
t 
1 A e 
ft 3 ' 
to . 
a m a 
& a> S -
o 3 o 






O O . ffl 
* i 3 %4 
H * i a M o g e e 
n % 0 0 
Si 
1 



























H g u 
onf? 





U O D 2 H U K 
3 -H H O 
0 M O E -H i S CO 
14 W -H 
I I u 9 
& 
3 
O" U tS M 
A ffl u 1 ffi I n a* E £ DC o o +i w a o a 
a -H (St 
u Pu H U U O fa a n J m 




The 'J' and ! K ' inputs to the flip-flops are set high. A 'CLEAR' signal is used to 
reset the 'Qn' outputs of the flip-flops to logic 0. The ripple counter architecture is 
shown in Figure B.6. [Millmari, 1979]. 
The ripple counter is set to count to a specific number ' C n ' by taking the flip-flop 
outputs ' Q l ' , 'Q2',.. 'Qn' and applying them to a NAND logic gate as required. The 
example ripple counter is a base 10 counter, the NAND gate taking the binary inputs 
representing decimal 10, !Q2' and 'Q4'. The NAND gate generates the 'COUNT' 
signal which indicates a necessary reset of the ripple counter. This signal line could 
be connected directly to the 'CLEAR' line but there may be timing difficulties due 
to the unequal internal delays within the ripple counter flip-flops. These timing 
difficulties are removed by inserting a latch between the 'COUNT' and 'CLEAR' 
lines. 
The latch unit takes the additional input of the clock signal ( 'CLK') and the 
manual reset line ('MRESET'). The latch is now reset by the positive edge of the 
'COUNT' signal to set the ripple counter low. The ripple counter itself is clocked on 
the negative edge of the 'PULSE' signal. There are now no timing difficulties. The 
manual reset line ('MRESET') is used to set the ripple counter outputs low. 
The ' Q l ' . 'Q2',... 'Qn' outputs of the ripple counter are taken to a feedback unit 
which consists of an AND logic gate. The outputs of 'Qn' represent the binary of (Cn — 
1), where the ripple counter is base Cn. In the example, the ripple counter outputs 
representing decimal 9 (Ql and Q4) are used, see Figure B.6. The ripple counter 
can, however, be extended as required to produce the interrupt signal 'RESTART' of 
necessary duration depending on the microprocessor interrupt latency. The feedback 
unit generates an automatic reset signal ('FB') to the 'restart generator'. 
B.6. Design Simulat ion 
The gate-level design for the Access Guardian was simulated using a CLASSIC 
(a trademark of Plessey pic.) Gate Array Simulator. The digital circuit description 
used as input to the simulator, and describing the design presented in this appendix, 
is shown in Figure B.7. The output of the simulator is shown in Figure B.8. The 




_ O ( 
a ( 
o,-c o t 
•rl a H 
fa 4J t 
a < 
§3
n 3 a 
P a. o» 
3 s 
sifl 
f -H «t ; 
> c •* 
5 S jl * 
I 6 h Q • 
^ 3 TJ 
c 3 o 
1 s s 4J 1 to o u 
ffi £ « 5 ITS I fit J3 O - H O n o *J *» -a a I O *H C U M 
<H M - H 4 y 
- - a 
I «-4 u w I at Q o 
a O Si 
•H ( 
"Si 
i > 5 
> -H 3 . 
U * O E >» O 
0 > i & H £ 
u « « a 
O C C *» ej O 
o o a o 
a o 
a i-4 .H 
h + i H H 
a >M 
8 a) 
M O -H M O 
IJJ* 
3 9 n 
O to O 
U -r) «N TI a o o o _ ft -P 
•8 3 § 
O «M " I H U *» O 
- +J fa 
a 5 3 a 
! a H J5 
* -H <H h 



























1 2 l 5 8 : 5 6 
VIR8 
ECLASSIC I I V2R0 
T » r m i n a l 
K 10f>S 
WINDOW P a t h 
A23 
A 22 
















X X X 
X X X 
x x x 
x x x 
x x x 
x x x 
x x x 
x x x 
X - • - _ 
x x x x 
I 
0 
Top l e v e l 
x \ _ _ 
xx~_! 
x x \ _ 
X X - -
I 
20000 40000 60000 80000 100000 20000 






2 2 0 0 0 0 


























260000 280000 300OO0 
Figure B.7. : Access Guardian Timing Simulation 
2H39 
GUARBSAN 
HICsmOy P a t h Top l e v e l 
A 2 3 
A3B 
16 HAY-90 
C L A 5 0 0 0 
1 2 s 9 8 s 3 6 
V1R8 
E C L A S S I C 
T e r r a i n s ! 
I I V2R0 
em? 
K S 3 @ S T 
O 
K S 8 T A R T 
© 4 
CCUMT 
C L E A R 
X 
X X X 
X X X 
X X X 
X X X 
X X X 
X X X 
X X X 
X X X 
X — 





x x _ 
xx\ 






8 0 0 0 0 100000 
I 
1 2 0 0 0 0 














140000 1 6 0 0 0 0 
I I 
180000 20O000 2 2 O 0 O O 





H R E S E T 
S 
R 
R E S T A R T 






C L E A R 
P S 
H IO^O 





2 2 0 0 0 0 2 4 0 0 0 0 
I 
2 6 0 0 0 0 
I I 
2 6 0 0 0 0 3 0 0 0 0 0 
Figure 0=7.: Access Guardian Tinning Ssnniuflation 
233 
UNITS - 10 pS 
SUHH 
A R R A Y S ! Z E - 6 4 0 
C I R C U I T 
CUARDIANC R E S T A R T : A 2 3 , A 2 2 . A 2 1 , A 2 0 , C L K , M R E S E T : ] 
S U P P L I E S / + V D D . - V S S 
DECODER 1:NOR4 C D 1 . A 2 3 , A 2 2 , A 2 1 , A 2 0 1 
DECODERS: 21NV C A , D 2 , D 1 , A 3 
R E SE TSR1: N0R2 CS, D2, ( IRESET J 
RESETSR2:NAND3 [ E l , , F B , D 2 , R E S E T B A R . - 3 
R6SE1SR3:NAND2 C R , R E S E T B A R , E l 3 
SR1 FT I:NAND2 C S t , S , C L K ) 
SR1FF2:NAND2 C R 1 . R . C L K 3 
S R 1 F F 3 : 2 I N V C S 2 , R 2 , S I , R l 3 
S R 1 F F 4 : N 0 R 2 C S R 1 Q B A R , S 2 , Q l 
S R 1 F F 5 : N G R 2 C Q , R 2 , S R 1 0 B A R J 
G A T E 2 : NANU3 C P B A R 1 , P U L S E , Q , C L K , R E S E T B A R , P B A R 1 3 
J K l F F I: NAND3 C J U . , +, P U L S E , JK1QBAR, - ] 
J K 1 F K 2 : NAND3 C K I 1 , , • . P U L S E , Q l , - 3 
JK1FF3:NANU2 C J l 2 , J l 1, K123 
JK1FF4:NAND3 C K 1 2 , . K l I , J 1 2 , C L E A R , - 3 
JK1FF5:NAND2 C J 1 3 , J 1 2 . P B A R 1 3 
J K I F F 6 : N A N D 2 C K I 3 , K l 2 , P B A R 1 3 
J K 1 F F7:NAND2 C G I , J 1 3 , J K 1 Q 6 A R 3 
J K I F F 8 : N A N D 2 C J K I Q B A R , K l 3 , Q l 3 
• JK2FF1:NAND3 C J 2 1 , , ••. Q l , JK20BAR. - 3 
JK2FF2 :NAND3 C K 2 1 , , + , Q 1 , Q 2 , - 3 
JK2FF3:NAND2 C J 2 2 . J 2 1 , K 2 2 3 
JK2FF4:NAND3 C K 2 2 , P B A R 2 , K 2 1 , J 2 2 , C L E A R , Q l 3 
JK2FF5:NAND2 C J 2 3 , J 2 2 , P B A R 2 3 
JK2FF6:NAND2 C K 2 3 . K 2 2 . P B A R 2 3 
JK2FF7:NANU2 C 0 2 , J 2 3 , JK2QBAR 3 
J K 2 F F 8 : N A N D 2 CJK2QBAR,K23 ,Q23 
J K 3 F F 1 : N A N D 3 C J 3 1 , , + , Q 2 , J K 3 Q B A R , - 3 
J K 3 F F 2 : N A N D 3 C K 3 1 , , + . 0 2 , Q 3 , - 3 
J K 3 F F 3 : N A N D 2 C J 3 2 , J 3 1 , K 3 2 3 
JK3FF4:NAND3 C K 3 2 , P B A R 3 , K 3 1 , J 3 2 , C L E A R , Q 2 3 
J K 3 F F 5 : N A N D 2 C J 3 3 , J 3 2 . P B A R 3 3 
J K 3 F F 6 : NANCJ2 CK33 , K 3 2 , PBAR33 
J K 3 F F 7 : N A N D 2 C Q 3 , J 3 3 , J K 3 Q B A R J 
J K 3 F F 8 : N A N D 2 CJK3QBAR,K33 ,Q33 
J K 4 F F 1 : N A N D 3 C J 4 1 , , + , Q 3 , J K 4 Q B A R . - 3 
J M F F 2 : NAND3 CK41 , , « - , 03 , Q 4 , - J 
J K 4 F F 3 : N A N D 2 C J 4 2 , J 4 1 , K 4 2 3 
JK4FK4:NAND3 C K 4 2 , P B A R 4 , K 4 1 , J 4 2 , C L E A R , Q3 3 
J K 4 F F 5 : NAND2 C J 4 3 , J 4 2 , P B A R 4 3 
JK4FF6:NAND2 C K 4 3 . K 4 2 . P B A R 4 3 
J K 4 F F 7 : N A N D 2 C Q 4 , J 4 3 . J K 4 Q B A R 3 
JK4FF3:NAND2 C J K 4 Q B A R , K 4 3 , Q 4 3 
COUNTER:NAND2 CCOUNT, 0.2, 04 3 
LATCH 1 : NAND2 CP. COUNT,RST3 
LATCH2:NAND2 C R S T , P , C L K B A R 3 
L A T C H 3 : 2 I N V C C L K B A R , , C L K , - 3 
RST1:HANU2 C C L E A R B A R , R S T , R E S E T B A R 3 
R S T 2 1 21NV C C L E A R , R E S E T B A R , C L E A R B A R , M R E S E T 3 
FEEDBACK 1:NAND2 C F B I , 0 1 . 0 4 3 
F E E D B A C K S : 2 I N V C F B , , F B l , - 3 
R E S U L T : 2 I N V CRESTART,QN, ON,03 
END 
Figure B.8. : Access Guardian Circuit Description 
234 
B.7. The Design's Hardware Requirement 
The hardware requirement for the Access Guardian design is shown in Table B.3. 
The design specification requires only 60 logic gates, which can be implemented by 
17 standard T T L IC parts. 
B.8. Summary 
The Access Guardian proposed in Chapter 5 has been designed and its operation 
verified through a gate-level simulation. The design implemented is simple, other 
designs may be more appropriate in particular applications. The design chosen here 
operates independently of the processor whose bus activity it is monitoring. 
The hardware requirement of the Access Guardian design presented here for a 
contiguous 16 MByte of used memory, dedicated address bus, and ten cycle interrupt 
latency, can increase with more complex microprocessor systems. The complexity 
of the address decoder will increase with a non-contiguous used area of memory and 
multiplexed busses. The timer unit may be slightly more complicated with particular 
































































































3 a c 
5 a 
S z os 

















P A R U T A N D O T H E R R E L A T E D C O D E L I S T I N G S 
C.I . Introduction 237 
C.2. PARUT Listing 237 
C.3. Microprocessor Description File, MICRO-FILE 257 
C.4. Target Software, CODE_FILE 261 
C.5. Target Software with Fault Tolerance, RESULT-FILE 263 
C.6. PARUT Report File, ANALYSIS-FILE 266 
C.7. PARUT Diagnostics, TRACE-FILE 269 
A P P E N D I X C 
P A R U T A N D O T H E R R E L A T E D C O D E L I S T I N G S 
C . l . Introduction 
This appendix is designed to be read in conjunction with Chapter 6 which de-
scribes the Post-programming Automated Recovery Utility (PARUT). The appendix 
holds enclosures of the PARUT program listing, and copies of typical input and 
output files processed by PARUT during its operation. These files, referred to as 
MICRO_FILE, CODE_FILE, RESULT_FILE, ANALYSIS-FILE, and TRACE_FILE 
within Chapter 6, are presented respectively with covering notes. 
C.2. P A R U T Listing 
The first enclosure in this append'x is that of the PARUT program. The utility 
code is written in Pascal. The listing is annotated to aid comprehension of program 
activity. Table C. l . details, in chronological order, the utility functions and pro-









































































































































0) o o 
o 
1) (£ 










































H M „ 
-rt 0 O 0 O T J * 
O <M <«4 <U « J 3 _ 
« J ( U H U 
lw h W M M C 
11 
5 I 4 ** • * P * J P H M 
-H C C C R G • 
£ *4 -H -*H <H <H 0 >• Ijj 
o a a a a a l J «3 
STS S 8 S5S I §, 
• t i l 
*J y - m *» 
3 s a s.. 
a s r 
i *3 c c 5 " ** « 
i -H -H -H M C V 
<» '» -«-4 -H f t *H W W -H 
H H H H •• I |<H 
H N a * 0 








IN • • 




S ! £ 
r 
• *J * 
TJTS 
9 • U 
-J 
0 0 -H h u u a 
• oi-H 
«f? 
833 I J J I s * 
! L 
!
a 9 1 
C 0 u -











f ' - B 
8 : a 8-
•0 I e 3 h 









i i l i i i ! : 1 1 3 
l 2 i 
« « u u 
i ("•(*) K SB 
i 
H s 
(0 o o u 
* y w. » i 
•t <o i n 
N • » m 
'as 















5 S s 
u n a * 
d « « m u • 
a? Is 1 | j s: fa u o c a »< a e o o o o T > , o i i j i o 
s i 
•M M 9 W 
" ° ~ " 
U H - H » -H 
• * a 
f i r 
• • 
r'r'.Vihllt'r 1 










- H O O O Q O O a o « B Q Q C J 
v n v ** *> a Q , a o o a a , a a . B 
•S 8 S f l l S«3 8 * 3 8 « 3 S 1 
o a A O O , £ O <3 £ O C 3 0 ( 
£L Q* § 8t 0< i & Qt § fib Ot G L - _ 
<? 5 e o o c o o c o o n a a a o a 
s § 
§5 
h o . r ^ a . 
ta b 




u u w 
0 O 
H (M H -H -H O 
o a «w 
o o a 
H H o n 
aj Q o a a Q 
> ! M > * J H W 
« H o n H H o </) 
§•38 § §-3 § 
i-l O 
U -H « 
M O 
- w o o 
U *» 4-> 
& •*•« 
k 1* O M O 
B CM 
O - £ 
-i o 




















1 i j 
Q a 




S 1 s < § 
I I 





a 6 » a 
• P "6 a 
C « f - i o 
»H u 
o - H a i d 
V o 
• H (£ b* 
U O 
' v a < — a. z 
*J * J a — * * — o 
a a •«• o - ' <s - * &J 
- 0 T 3 O — a s * * t J , m 
55 H J3 Ui Cu 
b l U O u H H 
U td — 
O < O 1 6 " i i 0 
a e 0 . ^ o „ 
P O 0 . a 0 J3 , 0 — o — o I • M q o 4 « • U < M *J 
+ J < * J < o ft c a 
a 
0) N 
O U fc3 
O O 
s z 
— d a <J 
o a v 4 i *t 
t i v a a a 
* * * 
o 
'P J S u u o o n n b* o o 
* £ 2 & M M 
£ 8 8 8 3 8 
I r-t fH , _ — 
0 0 0 
- TJ — — *6 — — ' - d a o a a o . . 
+ J 4 j o a *t a * i 
i -a &-S -a &-S -_ 
D B O O D 0 0 — 1 
a a a n a a a « a 
M H d ftQ O 
h - r t f ^ - n M 14 u a o t 




2 4 1 
-3 — CO 
- 8 
.1 S I 8 $ - -
— . CO 
t <3 
N H H H H H H I 




a q a o o e a e c c o d a a o 
a a o a a a a 
_ O - B O _ 
H H N H H N H 
- H O - H O 
*M < H Q "W » H O | l-H H J-H 6-« 
_ a «w a «w 
^ o ^} o^i^ i © 
a o a a a o a 
a a a a a o o a Q . :>*>*>»>«>*>i>*>i t . 
CD a H H a O ) > - t Q W r - j » - 4 r - < ^ H 1 H r H r H . H « - » >.{>ia a a > * M a g a 4 q o a a 
a a a a g O a < a o o o < 9 4 3 a < 3 a c ) 
« o c c c c 
*3 h M b M M 
O 0 0 -H -H 0 
«J * J * J ** ** _ 
•W -H -H ( t S T4 S 
b M M O H O 
D & D h 0 Cv 
o c o a c c o c c c 
j i H H H H i - I H i - t H H - n o o a o o o o o o 
?4 H t i ^1 
M t< M M 
B S B & 
« J v ** *> 
H H - H «4 T4 
h M U M U 
B B C D D Q 
I ? 
.. 2 






<M <M r-t O 
I M H » w, u a a >• 
H rt O 




M M N N o a o o • 
H H H H N 
-H v * «H 0 <U <M <M *4 iH 
I | | | *H 
o a o o <H -H -rt -H | 
o a o o o > * >t >i-W 
* H <-* <-i a Q a q <t > 
C C C C 
o ej o o o 
• o 
- 0 0 • 5 
N H H W H 
0 -H "H 0 _ 
H «H <H H O 
•H | |-rt &4 <H o a «M , 
I - * * -H | U 
a a a a a 
-H >1 > -H M 
O H H O M 
>t a <a * £ 
a a d a o i c a 
I O d 0 
C C 
» •— «H PH - » 
i o o a o - H l *> 4i * J * i 
I -*4 -H -H -H 0g 
I U U h U o 
i 1 D S & k 
2 4 2 
o 
•n o o u 
o *o o 
J 3 U C q 
T* Ji - H a 
b 5 









8 o "8 
S 2 & 
a v o 
• u 







*J a a u u o o o s e c u u o» a> - H 
•» a a u . 
[» b. 3 -
B ~ o u - w 3 — s ff- -
-H r-t M 
.. „ § 
M Q 
o .a 
> a o 
O. -H 4i I C O -H k< 
0 ) <M 0 
o 
M * C 














0 *> a +J 
— a #-i ft 
5 § 1! 6 ~ S 
•a — — o 5 o w 




0 1 D<H 
n o u 
"SO 
•4 O O 
> a •fl a a r o a § a _ _ 
5 i s . 
c a c < H a 
o a h c a 
M 0 0 4-1 
a a -H 4J a » o u c M 
O * J " 




.u e ~ 
c M ii a 
ti * J a o u «* o u o s a a 
P S" 
2 4 4 
u g < 
M O M 
Q O W 
1 1 ° . 
Eh O H i 
as 
X2 0 O 
- d H 
3 3 5 
r-t fH O 
5 JB =J 
•H fH | 
g . . g 
—» f t 
• C «M - -
*H | * i --4 
0 § w 
q —. 
o 
? J v i i L i l ' i ^ i f i 
81 
_ - M — 
H < « -H U < 
O 0 * ^ 1 «H Q CO 0 " 
as 
i Bo 15 
I -8S « ' 
O 3 2 5 W • 
« H n 
eg 
< < < < < < c cr tr tr cr o* er • 
H I N n 
i no s § 
3 * 
I o u 
2 4 5 
U *r4 (Q 
g 
SB a 9 a 
U O H 





H 6 C 
0 -H 
i * 8* » « 
G 4J 
• CO 
0 , a 
! : 
a o 
o o ~ ~ A O V 
• H o> r 
o 
- o e 
o <* 
•D O < 
| a. i 
« H a a i «3 8 ; _ ' _ | 
•w 0> 
!•§ 5 1 f. e 
U O D 0 — Z ' 
»— CO C O " -
5 ' 
p. 8 u 1 s - 8 % o ° 1 1 i 
3 - 8 - * 5 o S | 3 8 ° " " i J * | b . 0 1 . 9 g ° i -
_e i s 
§1 § 
C U ! 
s a ! 




. • g ° " J S 
* H D — ~ * 
i - a »*, 
246 
H frt H 
J J U 
3 § 10 
E3 U S 
Q a a 
9 .1 " 
8 n S - 8 o e 
u * M * * o o ^ o 
n n - n e P. _ m "2 ~e " 
a a a. a o- g 2 l " M 8 , f~< a < 
o c 





H ~ a 
o Q.-6 
O £ *i 0 Q 
' 1 1 'o S 
~ 8 B I o. 
Q 5 a , 
a -H " 
H *> H 
a o a p 
O t l B • ~t 
n g 0. 4 H 





o < A n 
*J o* H 
0 -H -H 
2 U C B 
9 o T3 o o 
o | o « 
M +J a to •9 5 o c o "6 o M H 
V ° , 3 I 
*J < « a 
O Q 0* I I — 
> O *J 4J — 
<-» a a « J 
4 j . 13 o © « a < a *•« 
H 0" O* • • -rt C o < < ja -W 
U D U EP ( T D <M 
- c - i § 
3 « 
*< o o > • 
; . 
a o , 
•o « 
5 u. 
Js 3 $ 41 o 
u u 
247 
8 0 O ~4 i-t 
m o o *i 
•~ 2 gs M a a 
O U 3 
3 3 > 
• H s u o — i u a o i 
i — D — C 
9 
2 "S 
N S I 
<-.<•«. u 
V « « u> 
•H H A A V *» 
§ o o •/» w 0 <5 "7 N H 
• *i a • • • • « § < < < < 
i a 3 u *J *J * i 
Li JT-N - H - H f l - H t k 
' " § , § , S , | 1 
9 ? ? § 
M h U U ti 




• H B 9 
•0 
i 8 4 
a" 
p o, « i 
u o 
0 0 0) a o a y S A « 
§ * J v _ ft, T J a 
o o. < o 3 
u tr a m -o 
> < - J 








i J o 
— D 
> Q 0 1 
I g . 
a - *> 
•8.5*1 
r 
e - H i 
9 CT> 
O - H 
u —. 
« . 0 
o _ p o 
•H 0 M *H 
V M U O 
O U •* *t 
0 0 
O B J 
a — s _ « H 
1 9 1 B I 
2 Q Q 5 




' S i ! 
• r> e 
S I 
•a a »-J d b e 
- ! i - ! i - 1 
.. B 
0 . O (9 
a a 1 
* I 
< 
PC 4 J 
r> c 
?. § a * -s 
- m i t 
a Qi a, o u to 
a I f *? o 
n r < < < S o 
" • • U J J O 0 o: 1> w o q S.jj w a o. 1*6 
tt! B » G»" 0 > O* 
fi -s 2 ~s sua a -s s ~a s -a 
a a i i t o ' — " — " 
" I I . 
- I I V I i v 
o a o o? — — 
*J c c a 
a. a u a. a u a. a 
* ST 
3 .a 8 
5 i c "l "l "in' 3 
9* * li * •* o a. & a w - H 
a a a a 
* *J — 
- ° i £ 
o a a c 
& & & & & & 
"l **! "l -I t l -I B l CM 
i I 
I I 0 3 * J 0 ) + * 
g » § a § a § B § * § g 
4 I ° S 8 „ B SLfl 8"'~5 S'-B J - B !f — a a, a a a o o a a o a a o a. Q o a 
I f i f i i 
— l g l I H _ « 
S B ° ~ 3 
- H t3 — a o 9 1 i a a a a 5 3 S I I S I 
1 a* i n 
B s 8 1 
9 55 
14 I 3 B s » 0 3 CO 1 1 i 0 1 
3 
2 4 9 
if» o in 
6 - B 
B . I flu tX. 
E 
a a en 
0> a a 
& & H 
4i — V 00 § a 1 J r f S § u 
•H « < W < 
l a c a i 
9 - § i 
K ^ a. o v 8 a +j n B a o n B 3 a a 
§ E 0 
8" 
CO • o 
O 2 * J 
O M Q M S 
§ § s s s fcl U U H ffl CO 0 
2 






H " • 
~ | 3 
o n e 
tn 
M z o 
52 a 9 
- I 
2 O 
tn m m I A m m 
& 












D 0 P 5 
— » M O 14 + 
















8 - H .. a o u o. I, i t 
8 IS 
s a c 
SI MI 
l i p 6 
0 O TJ W 
a c a cn —• 
4 * ** w I 
** ** "3 *S § ' S 3, 
U U M f 0> 







<9 i ) 
u m 
R - o j — 
2 z o o tl U f 
O - H - H •» * - H 
*• •» o 
• H a 
G J < O 
M l 
251 
3 3 " is as 
i H H H (r* 
I J U Id U 
; b n tn oi 
S3 g s s 








O r-t <M I 
8 ~ 
a a a a o o 
0 * * 
0 (S v a i o o 
5 - I I I 
0 M f i - * - t a 
M Q N M -
> a y Q h 
i o u 
! §•§•« Z 9 -




0 " 6 C <W • U Q CN (N w z 
d • o, < .c » D • • — A a a H 
— | < D « <0 « V C f l Q 1 
*« o a — cw-4 ag N z •• H « 
w q a o* 5 




• H — SB 
- H a 
e M M 
-I H O u 
§ 3 
u u 
\ rt 0 u 
I - H A « fl 
J 3 * - * s - i c a, 




0 . S 
O JS U 
M i 
*i a > 
q <0 
O K 5 
H Q O 
« a.3 
* G O 
* O C T ? 
4 M 0 Q 
« 3 H 0 
« U W iS, 
« © a o 
* O "O 
« O I-I 
a U H Q 
— a* * * H 
r 
8 *3 i o J 3 O O










I I I 
I'D Q 
§ 3 8** 
0 D O "H CO 
B o *t ta u »4 
S b. 3 < 
4 4 1 
i 3 H n - v 
H -H — • a A e 
. — V •* 0> 
— A -H -H i v a o ^ fft Ol » H. 
o o in «o 
g u 
- a 
1 N < 
Q 5 
-16 
A <M < . 4J • • 
V 0 *» < -H < < 
! S S 1 I 
• g c 5 
JJ 3, pi 
> U> 9 
I D O* O M 
i « o a a. 
! w & — ~ 
I H U H 
I B I H S 
O H S 
i 0 .5 
. „ v ** _ 
y s » z 
- l § l - l - l ~ 
5 p 3 4J 
M M ft M JJ 
o. a a a. 
•H •» 0 
H 14 O 
0 £ > 
*i u 









a a | a 
»• a u a 
a M a 
^ "3 
a i - i < < a i 
9 , ? 
la I c i 
? 9 i ? °i y 
u v b n u a.o C L O a 
a a c 
a J Q 
•eat* 
0 I/) 
a Q. M 
o< C t j 
-5 i 3 
S 2 «3 
0 y H 
a y - H 
3 2 
c o o 
5 &*§ 
** 1 3 8 : 
JS a u o* P ft 
9 C 
O - H O • 
+i 5 * i . 
'8«: 
3 - H 
81 
4! .0 





( i DIN M n u D"l 
s - H 3 a a o a, o o. 
;3 B o. 
a - r » « < e < 
< -H -H < 
I I c 
254 
•0 t ) V T> U T> 'S 
«U +J 4J i 
a 
•H 5 • 
H tnh 
U -JJ g, 
-•as a 
B 3 9 9 3 9 3 3 9 
Q H W 
6 Q d 4 <9 4 4 
I 44 <N W <M > 
i q u « o v 
•3 S L S S g S B E B B 
a n i n i A n o i i f l i f l i a v ) 
o o o o o o a 
ii s s s is ti ti 
a a o a 4 4 o 
••4 -H f » T4 -H -r* u o o o o o o 
8.2.8.8.8.8.8. 





<' S S 
s T ) a. 
"Jo. 1 
- " 8 3 
I 
w —• — * J 
C - H 0 K 
3 ** +* 
Q r-i -r« 
I 0 3 C 
f - H J U O 3 -* 0 
< 9 M 
I o +J q ** o a a, u -< 
I*— - H o o 
e u *i « 
i-4 D - H O 
0 M ** *•» 
* J U 5 - H 
- H « a .g '3 J -a o 
i » 0 
4J a A a >» o» i-s 
T4 « J O *• a 
S B § 
0 O Jt 
i 3 B "8 
a >. a o * 
0 . a > a -p o © » 
S E a - H c xs o 
H a a tnu v 
I E "O © *M Q 9 
0 - H M 
o a 
w c M M : 
s. 
s i 
: H a 
c s z - H 2 , 
a o o w Q £ a *J a 3 w 








e « a 3 x 
o -r> I ' « ' < " « < < < 
I PJ N N C4 N , r4 ^4 ! r tr tr C P o1 <r a a 
i < — 
a ^ 
o 
u n a . 
J I B 
a s T-> 
o § o 
5 
a « o ^ i & a »H u a 
01 0 -H 
u> a 
H C O 
as 0 M 
S ^ 2 a> o a u 
S . 5 
0 « ** 
8 S.o 
o> 0 * J 
2 
« ~ <3 a a 3 
S I S 3 3 
a — 








i i B I I -
a a c c c - H 
, M U • » M L« 
recces 
• - I M t-t f* f* 9 , 
ao - H 
3 E 
S O 0 3 9 9 9 9 J 
U — Q O 0 0 0 0 0 0 
I M ( J 
c c c c c — z 
, f t .-t r4 M , o 0 o o o u Q ; *i 4J *> *-» « J ij u 
I -rt i 4 f 4 f 4 i 4 H ffl 
h h b h It 9 
» » B » & g 
255 





c * o 
s i 2 g§ - J CD 
a -
? 3 & E 
* - O SB O 
6 w O M " *4 G U O 
5 =a-a 
• H O 
M O fa 
B U M 
I, »-t - H « J 4J 
I t3 O | | — 
, J t> © c * J »w 
H O h M ' H 
• H H T1 H — 
_ O © T> M i 





a JO 4 
J3 -m 
» a 
m —» U 
0 P 
a n a ; 
q Q H 
a z 
* J M 
<n » a -
m * 
0 . en 
a. — 
a - H .• • 
2 B 
i M a 
o c 
s s. 
C U I 
3 - i 
1 ~ 
e . . « - H -r-» | a H H 
I T J a « C U V * V CI 
>,f-t o u a* a v a 
*•* ^ 3 Sfi - - ~ - 1 0 3 W ' 0 j j Q 
0 a M 
a d (9 
. a 
o a * ! 
C 01 
8.52 
Q U O —' 
+J a u o 
a B c 0 a 
f a * " ? § 8 . ' 
•* 3 M 0 w Q Z 
0 •»-» Q -H Q B w 0 M 
-n | a | c > - i e u t ) 
-r4 -rt "O -rt « -H a 
M 3 - t J U O H © f i . 
o o o a o o o 
a a a a a 
- a o a a o o a o 
0 > ! > > » > « > > 1 > 1 > I 
9 f H i - t < - 4 < H > ~ f f t i H i - l 
l A J c e c c e c c e 
i 3 f i J C J c j q O < J f l 
' D ^ . ^ " 
. " c c q a c c c c » * 
I o ' o o ' o ' o ' i o ' o o ' o ' 
O - H - H - H - H - H - H - H - H A M * M S 8 
C C SB C a w 
ess I 
i Q « o oa * I d a » A a 
O ^3 il ^ * 
r i * » * 
O Q H H » 0 > «-» 
0 , o ** * J <-*, ta a ( 
a a Q - H : 
Q w -•-» - H O rt H 
u co » - 4 . ^ . n | — 
' o a * » <M 
256 
C.3. Microprocessor Description File, M I C R O J F I L E 
The following enclosure is a copy of MICRO_FILE which describes the defined in-
structions within the Motorola 68000 microprocessor instruction set. Similar versions 
of MICRO-FILE can be designed for the Motorola 68010 and 68020 microprocessor 
instruction sets. 
The first line of the file contains the number of remaining lines in the file, in this 
case 359. Each of the remaining lines of this file contains a 16 character code which 
may or may nc be followed by a comment string. For instance, consider the second 
line of the MICRO-FILE example, 
1100XX110000XXXX ABCD 
16 character code comment string 
The 16 character code represents a 16-bit opcode value, where the most significant 
bit is leftmost (or the first character read). Characters ' 1 ' , '0', and ! X ' denote logic 
values T , '0', and 'don't care' (i.e. either logic value) respectively. The arbitrary 
logic representation allows multiple opcode values to be represented by a single entry 
in MICRO-FILE. This is particularly beneficial in the case of the Motorola 68000 mi-
croprocessor instruction set which has 43342 defined instructions. Comment strings 
following character codes describe the instruction represented. Some instructions re-
quire several character codes to describe their opcode values, in which case a comment 
string is only attached to the first entry associated with that instruction. 
257 
x x : 




; o o f it f-t 
x x x x x ; 
© F-4 fH O tH I 
S H O H H 
© . H iH tH ! 
«-* X  © o o o © o © o © . 
O O O O O i 
O O O © O ! 
O O O O © : 
> o o o o o < 
i rH r-4 O O O O O 
© © 
o © 
o o o © o 
o o o o © 
© o o o © 
! X X X X X 
i S x S S S 
> *H fH «H © »H 
s s s s s 
o © o © o 
i M rt « H 
H H H O O 
i S S s i i 
© o o o o 
o o o o © 
o o o o o 
o o o o o 
x « ; 
© © 
«H t-t 
© O i 
O O I 
o o o 
H H O 
o © o 
O O fH 
O © *H 
§ g s 
«H O < 
«H X r-4 < 
i *H H © « 
I © «H . H < 
I rt rt H i 
I H O O 
: x o © 
X o © o o o o o o o © 
i © o o o o 
o o o 
© o o 
© © © 
O O f-t © 
O O H H r l r 4 H 
f* tH O fH fH «H 
«H O «H X O tH ri 
O tH >H X fH O fH 
X X X o o o o 
O O O O f H t H t H t H 
O O O O O O O O 
f H r 4 H « H * H H f H « - t . 
© © o o o o o o ; 
o o a o o o o o ! 
o o o o o o o o < 
© O O © O c" - -
X X X X X o © 
O O O O O H H 
O O O © O f~ 
l « H « H O O O © 0 0 0 
O O O O O O O O I H * H < H I 
X X 
I B - H B B S H H B B I 
' O I 
fl' 
11 a X © © © © >H 
• O r t B - ^ r l g O 5 S O 'A 'wi a i n O fH H Mr* O «H X «H O £ tH 
. . . . _ O O X O fH r-4 r4 « O r t H H ^ O iH fH X O fH r j r j 
X X O O O O M O O O O O O H f H * - 4 « H t H . X X X X O © © O X 
O O 5 H H »H O »H S M X:M X O G O O Q O O © O^ fH fH rH fH O 
O O O O O O f H f H X X X M X: M X X X M O O O © O © O © O 
S 3 X X X X N R X X © O O © O O O O O x x M M X X X X X o 0 : 0 O O O O^O O t H o o o o a o o o o o o . 0 0 0 0 0 © o o 
O O O O 0 0 O O O O O O O O O O O O O O O O © o o~o o 
H H r t r t H H H H r t H H H H H H H H H O O O O O O . O O O 





I HI 13 iillllllllllillll 
i 






11 m a §a 8 is 8 
Hill HI 
i s § 3 1 BBSggBBSB 
B Is l i s 









I r H r H r H O r H r H O O f 
O rH rH rH X rH r-i 
O O O X fH r-l rH 
rH r-t rH O O O © 
O O O O O O O 
O O O O O O O 
o o o o o © o 
O O O O rH rH rH 
o o o o o o o 
o o o o o o o 
O O O O rH rH rH 
o o © o o o o 




 rt H r l H 
O r l H r t H 
O O O O O i 
O O rH i-t 
U OS 
H rt r » X H H H X . — — 
r H O r H X r H O r H X 
rH X X O rH rH - . . X O rH 
_ _ O r H r H X O r H r H O 
© O r H r H O O r H r H O r H r H r H O r H r H r H © 
X W W X O O O O X X X X O O O O X 
O O O O r H r H H r H O O O O r H r H r H r H O 
O O O O O O O O r H r H r H f H r - t r H r H r H r H 
X X X O O O O O O O O X X X X X X X X X 
o o o 
O H H 
r-t o o 
» - » o © o o o — ~ — " 
o o o o o o o o o o o o o o o o o o o o o o o o 
r H r H O O O O O O O O O O O O O r H r H r H r H r H f H r H r H © 
© O r H r H r H r H r H O O O O O O O O O O O O O O O O r H 
X O rH . 
X O O < . o o 
X H H H H 
rH O O O O 
O O O O O 
O rH r-t *H rH 
O r l rt 
O H X H O 
rH (-4 X © rH 
X X O © O 
O O rH rH rH 




© r H r H O O O O © 0 0 0 < 
O O O O O 
rH rH rH rH rH 




I O rH 
O H rt >< H O H 
H H H H O H H 
rH rH «-l © O O O 
o o o o o o o 
© o o o o o o 
© © o o o © o 
rH rH rH rH rH i 
o © o o o < 
I rH O rH rH rH 
' " f t O H 
O H H 
© O © 
i f l B f l l B f l B S B B J B ! 
I tH K
I rH X 
O O O O O O r H r H r H i 
© O O O O O 
_ _ _ o x x x x x 
r H r H r H r H O O O O O 
" " ~ o o o o o o 
© o © © © © 
_ rH rH rH rH W rH 
0 0 0 0 0 0 © 0 © © © © r H « H f H f - « r - » © 0 0 0 0 © 0 0 0 
O O © 
O O O 
K M X 
o o o 
o o o 
o o o 
Q O r i r i H O #H rH rH O rH r l H H O r t H H r l O 
r-4 X H O r l M H O H M rH O r H r H X r H O r H r H K 
rH M O H H S o H rH 53 O rH H H R O H H r t g 
H W W X X O O O O O O O O O r H r H « - 4 r - l r H W 
O O O O O rH H rH rH M M M W . X © O O O O O 
O O O O O O O O © 
S o n : 
• 1 rH < 
i s ! 
< o < 
LBi 
S 8 S S 
X X O X 
rH rH rH O 
r H O O O O O O O O M X ^ X ^ X ^ X ^ X t t X X X O 
O O O O O O O O O O O O O O O O O O O O 
O O O O O O O O O O O © © O O O O O O o 
H r i H H H H r l r l r l O O O O O O O O O © O 
O O O O O O O O O r H r H r H r H r H r H r H t - t r H r H O 
O rH rH X 
M X X © 
O O O «-4 
O O O O 
O O O O 
O O O O 
O . O O O 
O O O O 
O O © o 
O O O O 
O O O O 
260 
C.4. Target Software, C O D E - F I L E 
As described in Chapter 6, CODE-FILE contains the target software to be pro-
cessed by PARUT. The software is presented to PARUT in a machine code repre-
sentation. The example of CODE-FILE in this enclosure is generated by the UNIX 
'adb' facility. 
CODE-FILE can be split into three sections. The first three lines of the file 
contain redundant information which is ignored when the file is processed by PARUT. 
The remaining two sections are each generated by an 'adb' command of the form, 
< address >, < count > < request > < modif ier > 
where < address > specifies the location from which processing commences, < 
count > specifies the number of consecutive locations to be processed, and < request > 
specifies the output of the operation specified by < modifier > . Further details of 
the UNIX !adb' facility can be found in Bourne [1982], 
The next section of CODE-FILE contains a source code dump and is produced 
by the command, 
lstart,180?i 
The second section contains inforrnati on ihe location of opcodes within the source 
code and is generated by the command, 
lstart,180?x 
In both 'adb' commands, 'lstart' is a label in the source code denoting the start 
of the information to be retrieved, '180' is the hexadecimal number of code lines 
to be extracted, '?' specifies output to the file system file 'a.out' (later renamed 
CODE_FILE), and ' i ' and 'x' specify code dump and opcode information respectively. 
261 
f l i i i l l i n i l i i i ! p i i ! ! ! i i ! ! ! i ! I t f M s I 
• i H i f f l i i f f l i l i IMiiUMiM \ h \ \ m : ! 
_Bii5iffiifiifSJ|^ :illllllllllll„5iJ^ l 9 J 
3 us* I? 
M S 1 li 
1 1 13 
1 PlUiliiiiMit iililii in plliiiillliiipll I! ili!!!!ti!iiiir i f|s 
2". 2l | i 3 ;3 | ! i i f ; i ^ i i i m i m i i ; : 
!!S!!!5Si!K SHI i l s?if f fiisiiill s! 
• i r i m 
i l t iHiHifi j»»3ji '» ' i i i i i i i i iSi i i j insi i i imftm 
262 
C.5. Target Software w i t h Fault Tolerance, RESULT JFILE 
The application of software implemented fault tolerance can be complex. This 
enclosure contains an example of the enhanced code generated by PARUT when it 
applies fault tolerance to the target software. The format of the file is tailored to the 
development needs of PARUT. 
RESULT_FILE contains a listing representing the target machine code which 
has four columns of information. The first column denotes the decimal address of 
an opcode or operand. The second and third columns denote whether the location 
content is an opcode ('true') or an operand ('false'), and the decimal value of that 
content respectively. The fourth column, by default, is set to 'true'. Positions marked 
'false' describe identified potential invalid branches that require resolving by the 
insertion of a detection mechanism. Finally, the fifth column contains the relative 
target destination displacement of jump related instructions which have a valid or 
invalid interpretation. This column may also contain entries marked 'exception' which 
describe the action of a seed within a placed detection mechanism. 
Within the listing of RESULT_FILE, the locations of detection mechanisms in-
serted within the target software are highlighted by a surrounding box, and potential 
invalid branches have arrows drawn to emphasis their location and action. 
263 
to a a tt 
O O © 0 0 
2 2 2 2 2 
tH "H frt CH IH 
2 0 9 p h M 5 
v w ^ N a ) 
0 0 0 0 0 
o a o o o 
H l< H H H a E-* «a s o 
EM EM (» h 
O N < 7 « a > O N " 5 W 
H H r i H r l N N N N 
o o o o o o a o o 
N r - p" r -
co p» p* p« 
n i n m i n 
w ^ « 
N « CM N 
H <3 <B 4 
o »H H m g 
»-* «-i f t vo r-
CM in vp »n 
g a a g g^a 




O 0 0 0 0 0 0 0 O O 0 0 0 0 O O 0 O 0 . . - . 
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 S 2 2 2 2 2 2 2 2 2 2 2 
N H H H H 
m o o r- o 
a a a a H o m o 
O O O O M iH o r> 
O O O O O N (N O 
IN N N (S N CM f-4 
fflOOOOOOOOOOOOQOOOOOO© 
2 5 5 2 5 2 2 2 2 2 5 5 2 2 5 5 2 5 5 25 
h <D a h a n h h H h it a ' H f r « a a ' H f i j < g " H a 
CM CM CM CM CM EM EM EM EM EM 
O r - P-
CD P* t* 
m m m 
•» V <"? 
N N CN 
25 
i n N o r i N 
W CO o 
cn A o 
H H ( S 
N CN IN 
CM CM CM 
N ? « CO O 
O © O © r-4 
- N N N N 
N N H N 
N N N N 
o o o 
5 .3 .3 
EM EM 
m e IM o i 
CM CM. CM 
*V4 CM M 
N N N 
25 25 25 
*M ( M CM 
ED O N >7 <« (D 
H CM CM CM CM CM 
CM CM CM CM CM CM 
CM CM CM CM CM CM 
" " CM CM CM CM CM 
2 2 2 2 
e-» H * « 
*» i n CM v 
f-t CM 
o o o o 
o o o o 
CM CM CM CM 
CM CM CM CM 




US o o !> 0 u u & tt K U M 
o o 
z z 
« 0 0 0 
2 2 2 2 
0 0 0 
2 2 2 
m r-
to tn 
e- I H H 
IO O CM M> 
H iH 
H < H 
( H H H 
© P- P-
a> p* p-










0 0 0 0 g Q 8 8 
CM CM CM 
0 0 © 
a a a 
b <-* r4 
H a 
CM 
EH a a a 








0 CM •"•/» M5 CO 
rt rf H r l 
o o o o 
CM CM CM CM 
CM CM CM CM 
O CM ^ 
CM CM CM 
o o o 
CM CM CM 
CM CM CM 
B- & 8- 8* S 
u u u u o 
IS ft! Cu c3 Ed 
© o o o o o o g a a o p 0 o 
«H i-t <H W H H i 
H O O O H fl ,0 , 
C M t M o n r o m m ^ 
o o o o o o o o 
CMCMCMCMCMCMCMCM 
CMCMtMCMCMCMCMCM 
O O O O O © 
2 2 2 2 2 2 
E * Cf> EH E * t« H 
w> r- r - f - P- r» 
z P» P- r - P* P* 
m m m m m i n 
«5> T T ^ 
CM CM CM CM CM CM 
B a n a n a .-t »-4 »H r-t 
H a 4 « a 
EM EM EM EM EM 
N V ID CD O N 
» W V ifl U) 
o o o o o o 
CM CM CM CM CM CM 
CM CM CM CM CM 
o o o o o o 
. 2 2 2 5 2 2 
H f« H tH E-< fr« H 
f W CO O N V « 
I A i n m « « <o « 
o o o o © © o 
CM CM CM CM CM CM CM 
CM CM CM CM CM CM CM 
S £ t3 
0 0 0 0 
2 2 2 2 
H H H 6 * 
CM P» P» P-
00 P- P* P-. 
m tit m m 
CM CM CM CM 
B O a iH iH 
EM EH EM 
CO © CM 
vo r - r- r-o o o o 
CM CM CM CM 
CM CM CM CM 
0 0 0 Q 0 Q O O O O 0 0 
2 2 2 E 2 2 S 2 2 2 2 2 
c * » o « M n < 7 e > o m N O « i 
- N - 7 H « « CM *** 
• H O P- *© if> 
O CO <9 « 
o o o o o o e o o e o o o 
B o o p a p t a o a a a p o 
W C 0 O N * 7 « O O N l « K > a ) O 
P ' t ' C O C D C O C O C D t f l O t A C I I A O 
O O O O O O O O O O O O H 
CMCMCMCMCMCMCMCMCMCMCMCMCM 
CMCMCMCMCMCMCMCMCMCMCMCMCM 
a a © o o o 
M c3 
2 2 
H (H (H 
O P» P» 
oo r - r -
m m m 
« 9 
CM CM CM 










i l l 
s s s; •p°llll 
II lpIIIIIIII 
o o o g o g 
2 B S 




e e i l l 
PTI iPTr i i i t i 
o o o o o o o g o o b o 
2 2 E S 2 2 2 S 2 2 2 2 
i 1 i " r i 2 Ti lP1 5 11I 
UPlPIPlllJil 
i i i i i i i i i i i i i i i 
2 2 
&« IH 
O O O O 0 
" 2 2 2 2 
a g o 
S 2 2 mmmm 
n 
O 0 o 
2 2 S 




C.6. P A R U T Report File, A N A L Y S I S - F I L E 
This enclosure contains the tables of information generated by PARUT when it 
assesses the fault tolerance of target software. The first table provides a summary of 
hazards posed by invalid branches and, where appropriate, the inclusion of details of 
the fault tolerance achieved by applying a software implemented fault tolerant tech-
nique. The second table collates information regarding the distribution and action 
of jump related instructions during erroneous execution. 
The information contained within ANALYSIS-FILE provides an indication of 
the target software fault tolerance. In particular, it details the recovery capability 
of erroneous jumps within target software. The recovery performance of the target 
software is dependent of the instruction sequence, a dynamic process, and hence the 
static analysis contained within the file has limited application. Chapter 7 describes 







Rao 3 01 § 
I d 








O R I G I N A L CODE t t t t t i t i t i t t t 
- O D D — — H I T H X H - - N O 5 E E 0 TOTAL 1 - O U T I N -
OPERAND — — 
- O D D K I T H I N — H O S E E D — - T O T A L -
CHS 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 1 0 . 0 0 o . o o 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 
D X V S / D I V U 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 1 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 
I L L E G A L 0 . 0 0 0 . 0 0 0 . 0 0 a . o o 0 . 0 0 0 . 0 0 1 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 
RESET 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 1 0 . 0 0 o . o o 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 
TRAP 0 . 0 0 1 . 0 0 1 . 0 0 0 . 0 0 0 . 0 0 3 . 0 0 1 0 . 0 0 0 . 0 0 1 . 0 0 0 . 0 0 0 . 0 0 1 . 0 0 
TRAPV O.DO 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 1 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 
u n d a f l n o 4 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 1 o . o o 5 . 5 0 5 . 9 0 0 . 0 0 0 . 0 0 1 1 . 0 0 
TOTAL 
U N S P E C I F I E D JUMP 





0 . 0 0 
0 . 0 0 
0 . 0 0 
7 . 5 0 
2 . 5 0 
2 . 0 0 
0 . 5 0 0 . 0 0 
0 . 5 0 0 . 0 0 
0 . 0 0 0 . 0 0 
0 . 0 0 
0 . 0 0 
0 . 0 0 
3 . 0 0 
2 . 0 0 
0 . 0 0 1 7 . 0 0 0 . 0 0 0 . 0 0 1 7 . 0 0 1 7 . 0 0 
0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 
0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 




S T O P / H A I T 
STOP 
0 . 0 0 
0 . 5 0 
0 . 0 0 
0 . 0 0 
0 . 0 0 
1 . 0 0 
0 . 0 0 
0 . 0 0 
0 . 0 0 
0 . 0 0 0 . 0 0 
0 . 0 0 0 . 0 0 
0 , 0 0 0 . 0 0 
0 . 0 0 
0 . 0 0 
0 . 0 0 
DETECTION HECEANISH PLACEMENT d « 0 # 0 f f P 0 « f t f Q O 
1 - O P C O D E — 
OUT I N - ODD I f l T l U N — N O S E E D TOTAL— ODD H I T H I H — N O S E E D - T O T A L 
CMC 0 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 1 0 . 0 0 0 . 0 0 0 . 0 0 0 0 0 0 . 0 0 0 . 0 0 
D t v s / o r v u 0 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 1 0 . 0 0 0 . 0 0 0 . 0 0 o . o o 0 . 0 0 0 . 0 0 
I L L E G A L 0 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 1 0 . 0 0 0 . 0 0 0 . 0 0 0 0 0 0 . 0 0 0 . 0 0 
RESET 0 0 0 0 , 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 1 0 . 0 0 0 . 0 0 0 . 0 0 0 0 0 0 . 0 0 0 . 0 0 
TRAP 0 0 0 1 . 0 0 1 . 0 0 0 . 0 0 0 . 0 0 2 . 0 0 1 0 . 0 0 0 . 0 0 1 . 0 0 0 0 0 0 . 0 0 1 . 0 0 
TRAPV 0 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 1 0 . 0 0 0 . 0 0 o . o o 0 0 0 0 . 0 0 0 . 0 0 
u n d e f l n e d 0 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 1 0 . 0 0 5 . 9 0 5 . 5 0 0 . 0 0 0 . 0 0 1 1 . 0 0 
TOTAL -
U N S P E C I F I E D JUMP 
B C C / B R A / B S R 0 0 0 2 3 . 5 0 0 . 5 0 O.DO 0 . 0 0 2 4 . 0 0 1 0 . 0 0 1 7 . 0 0 5 4 . 0 0 0 0 0 0 . 0 0 5 4 . 0 0 
JMP 0 . 0 0 2 . 0 0 1 . 0 0 0 . 0 0 0 . 0 0 3 . 0 0 1 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 
JSR 0 . 0 0 1 . 5 0 0 . 5 0 0 . 0 0 0 . 0 0 2 . 0 0 1 0 . 0 0 0 . 0 0 0 . 0 0 0 , 0 0 0 . 0 0 0 . 0 0 
TOTAL 
RETURN 
RTE 0 . 0 0 0 . 5 O 0 . 5 0 0 . 0 0 0 . 0 0 1 . 0 0 1 0 . 0 0 0 . 0 0 0 . 0 0 0- 0 0 0 . 0 0 O.DO 
RTR 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 1 0 . 0 0 0 . 0 0 0 . 0 0 ' 0 . 0 0 0 . 0 0 0 . 0 0 
RTS 0 . 0 0 0 . 5 O 0 . 5 0 0 . 0 0 0 . 0 0 1 . 0 0 1 0 . 0 0 0 . 0 0 0 . 0 0 0 0 0 0 . 0 0 0 . 0 0 
TOTAL 
e 
5 T 0 P / U A 1 T — 
STOP 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 0 . 0 0 1 0 . 0 0 0 . 0 0 0 . 0 0 0 0 0 0 . 0 0 0 . 0 0 
268 
C.7. PA R U T Diagnostics, T R A C E _ F I L E 
The final enclosure of this appendix contains an example of TRACE_FILE. This 
file is generated when the 'diagnostic' facility within PARUT is activated. The 
file contains, in chronological order, a list of procedures and functions operated by 
PARUT. Indented entries in the file denote nested module calls. TRACE_FILE is 
intended to aid analyst/programmer comprehension of the PARUT function. Exam-
ination of this file should be made in association with Chapter 6. 
269 
a a . . . . a » a 
4 J * J "8 *> 2 
0 0 0 T) e 
V i f ° i V 
2 9 3 9 9 3 U i-* h c u u 
a a o <H a <H a <-i o 
2222 .u . i t2 -w2*'2*j2& 
o o o 
u v 
S 2 2 2 
CU * l « 
G n ~i u u u \ a +* w 
r H - O O O O O O 
J J JS 
o a o o o -H .. 2.3 2 « 3 2 2 2 B 2 2 * j . t t 2 
D* *« d O * O H £ IH H H h S O f r t 
O O O O O O O O O O O O O O ^ B 
O . O . O . O . ° . O . ° . ° . ° . o o a 
2 2 2 2 2 2 2 2 2 2 2 2 2 2 •*-> « H 2 
e o a a a o a e c c a c a a o e 
T - t - H - H - H - ^ - H ^ ^ - W - H - W - t - i - H - H Q 
• H - H -v« O - H *H " 
270 
> •** V Q O » 
t J O O -H O 
I O * > 0 +> U r4 
I O Q O O - H fiJ 
Q - 6 I U T J > 
I | 0 O \U D
 1 1
 a 
2 2 2 2 
C» 
§ § § 2 - - 8 8 S 3 " " S 8 § S " " 8 8 S S 3 g 1 S 8 
8 0 0 0 fa 0 0 P- b . 
H l H B r * 
B 0 0 O 0 E O O O V B O 
9 a 9 9 9 9 9 < 9 9 9 ? 9 3 d ? 
I (V xJ xJ xa XJ I <V xJ xJ xJ xJ I O . x l 
: c s o 
ft-3 ft ft ft I ft 
JQ X> J ) g A 
M U U U M 
§ 1 1 i i I 
* j 2 ^ 2 2 * j j w 2 2 2 2 2 . 2 2 . 2 2 2 2 2 2 2 4 j j * 2 2 2 ^ 2 ^ 2 ^ 2 ^ 2 2 2 2 ^ 2 2 
- u. g o b « c u g o 9 2 2 2 ' 2 3 § € § § § § § 5 o u o u o o a a a a a a a 
M "O <H r-t r J <-« »H 
a i a o, a a. a o 
„ v u u „ „ 0 0 O 0 0 0 0 B O O O O O O O O O O O O O O S 0 0 O 0 0 _ O 0 0 0 0 0 . 0 * 0 ° ° 
a a 5 S a o o a a a a 0 . a a, o o o a a o o a a a a a a a Q O O Q oo. a o o . a. a. a c s a o . A a . & a , o, a, o a , a a *g a a 
3 P 1 p' P p111 I I I JI»'»' p'p1 I 8 p1 g1 » ' = ' p1 P 1 »' «,' I S »' p'f »'£ »'f »•? g 1 »' P 1 B ' g11 p' j p' B ' 5 5 B w n a n i h a "a a S a 11 a a a a a a a 5 a a a l a a a a *ianana3! a | a a 3 a §>* I 
v c c c c o c o c o S o c o c c c o o c e c c B c a c a o c c c f l o e c c o c o c o c o a c c c e o e ^ a c 
271 
is 
O G O O O O M O 0 O O O O O O O O O 
o o a o o o u a o a o a Q a a o o a 
o o a o o e j a o o a a o o o a c j a a 
b a b D u b i b i C u B t t a t b * b i b < b « b i G t o B » E i a b i C u 
0 ±» 9 
8 8 8 
£££««££ l | l££££ t s£ ££ s« A A A I 1 1 1 1 1 11 111111 
o 5 O Q Q O O O O O O S O O O O O O O O O O O 
I s § § B S S r f S S S g U S 5 s s s s s s s s 
.. « a *j - « - a o •• Q « » o « J t j « J 4 J 4 J 4 £ t i l « J * j 
s s s ^ S s s s 3 * * * * * * * ! * * - I i 1 1 1 1 i M i 
"iVits ai°i°i8-ai°iai™iai§*0i« Vi"« ..8 8 8 8 8 8 
iizls B E fits B B B BI.B a'B E-|5 
* * * * * * i a * j * j * » I * * * » * J V * * I * f * f * f 5 
c o c o a a a o a p a c o o o - - * c a o w > » 
• o n e a a a 
* l * l * J * I U * J * J « J W * » * J * > * ' * * * £ . 
272 
Appendix D 
E X A M P L E P R O G R A M S 
D . l . Introduction 273 
D.2. Program 'A' Targeting the Motorola 68000 Microprocessor 274 
D.3. Program 'B' Targeting the Motorola 68(7)05 Microprocessor 284 
D.4. Program ' C Targeting the Intel 80386 Microprocessor 288 
A P P E N D I X D 
E X A M P L E P R O G R A M S 
D . l . In t roduc t ion 
This appendix has three main enclosures, each containing an example of the ap-
plication of the software implemented fault tolerant technique proposed in this thesis. 
Each enclosure has a suite of program listings, each suite having a different target 
processor for program implementation. The first and second enclosures contain three 
program code listings concerning Program A and Program B respectively. The first 
enclosure listing details the original code written in Assembler. The second and third 
enclosure listing show the insertion at Assembly code level of the detection mech-
anisms planted by the software implemented fault tolerant technique. The second 
listing shows the insertion of default size detection mechanisms, and the third listing 
shows the insertion of an optimum size detection mechanism for each placement. The 
final enclosure contains Assembler code listings, like the previous enclosures, except 
that there is an initial high-level language listing of the original program before its 
compilation to Assembler code. 
To aid comprehension of the listings, two types of function are marked. Firstly, 
erroneous jumps are highlighted by connecting the generator and destination code, 
which are circled, with an arrow denoting the branch direction. Secondly, inserted 
detection mechanisms are distinguished from the program code by their encapsula-
tion in rectangular boxes. The detection capability of the inserted mechanisms is 
shown where the destination of an erroneous jump lies with a detection mechanism. 
Erroneous jumps whose destination is outside the memory occupied by the software 
are represented by arrows which terminate with a 'star' symbol denoting detection 
by an Access Guardian. 
273 
D.2. Program ' A ' Targeting the Moto ro la 68000 Microprocessor 
The original version of Program 'A' is written in Assembler for the 16-bit Mo-
torola 68000 microprocessor and is shown as the first in this enclosure. The program 
is written as an example for the Engineering Microprocessor Laboratory at the Uni-
versity of Durham. It monitors two water reservoirs and controls the level of one by 
pumping water from or draining water to the remaining reservoir. 
The second and third listings shown the insertion of detection mechanisms by the 
software implementation of fault tolerance proposed in this thesis. The second and 
third listings respectively detail default size and optimum size detection mechanism 
placements. 
274 
8 1 9 1 a •8 
A* 8 I J3 «4 3 8.g 8 S 3 
2 
s » 44 3 g 1 I 3 8" S 04 C p i - I CM 
2 g 5 o 8 1 =3 sacg 0 . I P 











C 4 CO 
s s 
O 9 
O H H H H E9 8 (A h CM C 4 N 
H O N ' J a ffl 
w r * e- r - r » 
8 •8 04 
3 33 ff I i l 
o 
g 8 8. 
O 1 11 f 3 3 
g 8 3 
* 4 
S 8 3 04 Bt s s 8 3 
c 6 o B 
h fi U p o 
I t 
11 8 ° fffl IS IB I : B< r««-a« s 3 o o a a, a - H 
S i (M <U <M <M <M U U 8 4 
Q § 
0 8 3 i 8 i 4 •a g 2 
8 
I «* M & I CK D b 
I £ 1 O O I 
W <3 » O 
w o w ! 1 8 
O Q «-* o o o o o o o i ** a 8 2 $ § o i g 1 1 i < H 0 2 3 
3 p C 4 C 4 
! o I g 83 o ! 
I O Q I s V9 V9 W W W «9 « O 19 M « « W 
I u q 0 l 
N N N N N N N 
3 8. 1 
I 




a 2=* 1 . 
O 
3? 
6 i 9 
































w . f i 
































r - «0 







o v n n b , o « p a 
a 







r n Q < 7 A o i n o Q 
o O 




© o o o w v> «3 
U N H O N N b 
S O O O © O H 
N ( 9 O O N Q h 
O W O © N f ( 9 
M o 
N « W 
O O I O U 
u a 
to ca 
o o o o • 
W « « CS * 
O O O O > <3 7^ 9 a < 
N M M r i i 
S W <H N ' 






a 0 M 
O ** 3 
o - H 2 
4 S D 
5 
•g ii 3 
3 
8 










M » 9 A • » 9 d 
04 M O M t - l t - l O S 
- H O - H O O rH 
3, 4 1* 
« N 
276 
8 =3 o c = 3 *. 
o Ti cm i OS 1 01 8 0 0 3 I 
•»•» 1 f I | L 
O O <M 0 O 
o 8-
e z g 8 g a g 4 83 




04 3 s s 3 to M » W K W » 
5 i n m 
8 8 8 8 8 8 
9 
i n u o f 4 
• rf U « «^ «# 0 C3 o e l c i n tf> tf> tn i n 
l i C—Z3Q 
ot - l 
•8 3 i 01 5 s 
1 
m •8 
i i i i i i ! ! t A n n c 0 ) a) ca 0 ) oi 01 
9 9 bt a, <M § 8$ ass- a W "M <M W « >M W 
8 8 § 
o 
s 8- w t a u u> 
O H a 
I 1 1 1 
o o o o a o 8 8 8 8 8 S 8 8 B 8 8 S ( 0 40 W » 8 s ffflfl B fl fl i 
I u o o o a B D o 
*4 < H »4 
3 
c* S % 
g 3 & 
4k* 
M D4 
8 | B | 
a M O > 
0) 
O ( 9 « V9 « 9 V «!> 
0 S N N N N N N M N 
Ji l l 
SO VD «> O VD 
u a o M § s SI a* <M 
277 
3 
f a r E 3 S 1 f I 2 3 A* a 8 o e 
a " I 8 W S 4J 6) 
ft 
« o 
J5 u s 8 ft 2 r- t CM fas og 4* 
i i 
U 5 & 
i 1 I 3 
H P» C3 o 
* * CJ Q * l 
8*52 3 Ji 
0) 03 (0 CO g.3 2 01 0) 01 01 01 
3 
» ( 9 
C4 « H « 
8 8 8 8 8 8 8 8 8 t 8 8 8 f 8 8 8 S 2 
P4 
f t C* 
s s 
s s CM CC U » n 
s ass 
o o o o 
U S N O H O 3 0 1 
(1 ' O < o < 
« « «n i n t n c 
LI S 0 ' » A 
tn 




4* CP 41 4 j A " O 
i ! I 1 § SB 3 8 815 8 i i I I 1 1 § I i 1 I I 
a a a 1 g 8 





I • 3 
n O 
W H O s 8 S 8 ! "3 S 8 8 « 1 s s s 
t n 
c c s e e n j > 
? ? 8 8 8 J ! 8 8 8 8 8 B « | | •8 8 8 8.8 8 8 i m m s 8 8 
5 
s 
CO C 9 a o / 7 fe 
e PI o ea a t a ca CM CO f4 o o u » U t t O 
g o * 
o _ o 
CO <? * > ^ C9 U O N <7 ( 9 Q d 
sssn is 8 § « ^ » § 
O W 7 
o> ft « S o 
s 
278 
I i . « 
jl>!« " 1 i i 






h ! i ( E m 
I P II 





' H i 
i l l i i i 
I I I ! I I 
i l l i i i 
! i ! 
I I I ® ! ! 
L i J l 
! 
1 P 




i l l i i 






I I I 
i *&i mil 
i j H a ) i l i | i i j H I l i 
~ i i if i I I I i H i 1 i i I I 
I 
! i i 
279 
N U N 
o n e 
o o o 
O «9 N 
Q W f i 
U N O 
G o o _ 
*n o o o 
« « w c* 
m v o o 
«-* *3 o o 
O O r 4 O 
O O O W 
w o o w 
H C« H O 
O <3> O O 
O C4 N O 
W Q O 
H Q 0 *-4 
r 4 O I 
r * U 4 
O 5 < 
O O O 
3 8 8 
O b i r t 
o . P I <M 
<7 H U 
O O O o 0 0 
W W O 
N 
W W <? 
W rt ~ 
O O _„ 
o e o 
o 0 0 
f t 0 0 
w W W r * . 
0 0 0 0 
0 0 0 0 
W O O W 
W A C t H 
o ft* n o 
seas 
O - O ' O ' O 
o w o W 
V N O W 
O O O O 
O O O O 
to P | 









H w tn p » « 
co r - p i P o « 
H r i b t n O O 
O O r4 f - W O 
O U N M H O 
- - - _ -
W W W €4 O 
r-« C T CJ O O 
O O O Ew W I * 
O O O p* 4*1 O 
W W O G > W 
r - l b 0 0 t n 
0 0 b i o w r -
0 0 0 0 0 0 
o w o w w f » 
O f f * « 
. . 0 0 e» 
O e H O N h 
o n o o 
§ a " 
•-1 w 
O O O 
W W W 
W *4 
O O O 
O W O 
W W W 
N N H 
O O O 
W O o 
w o w 
w 0 0 
H « H 
O O O 
O W «=? 
O O O 
O O O 
N N f t 
O O O 
N N M 
O O O 
O O O 
N N N 
W W W 
10 0 9 CO 
O O O 
w w w 
O O O 
Sf «7 
w w w ca 
CM W 
w w 









I I 8 H 
i CM 8 
P P 
£ 5 « 
5 B e i i I P s § 5^ 
t4 n n to n uo g a a a 
H *-f 
S 3 
Cft » *» « w o • 
o 





o n u n 





O H H 
O 
t n m N « « 
J 1 1 L 1 . n LTSO O •» O 
8 
r n , r 
nS 'Li °8 
CM 
o 
I . 8 o 
1 1 8 o 
•8 8 ? l 4 8 3 & 2 
3 5 8 g « « o « 8" 
a i p 
01 01 c 
s 
I 8 in 
§3 0 a q q p c g c c 
B *4 Tl O M 0* -H <rt 
3 ff 1 <M <M <M <M I M M <M 
S i 1 a a o a a 
1 » j i i f l J J i 
*4 o o o o a o a o 
8 
M «. S 1 ! 8 8 I S 8 S I 
D 0 D D Q D D 
§1 S 3 S f i 8<S 8 




M CP a w i 8 o a t> 
M xj io o e f O C3 I 
83 8 
§1 .5 8 I 
^ 8 » ! 
I N N M N M N N N 
3 in 
r to 
8 5 S V4 U t 





H C 4 
. a 
0 1 8 s r JS n 
II 1 o 8 s 
8 3 ! * as 3 3 3 2 S g f l I O c a g 8 s a I H § lis §J8Ji S § 3 
E 
I CO 
g 8 s r r 
H '8 11 1 







S S 8 Of C I M 
9 « OB II (X 
o o o n o is 111 2 3 
282 
• a 
« U N r t 
o n o o 
o o o o 
O « O *9 
n o m a o a oj o g o H o 
« o n 
o o o 
«*» o o o 
«-t p o to o G r> o 
«9 H N O 
• 7 H tx N o o o o o o o o 
<7 H I M <7 
O O <7 O 
*o o « o 
S S g g 
Q O O P I 5(9 O H 
O H N O 
O O O rt 
o o o o 
8 U o o f l w o 
h o b o 
cn o M to 
r< W «5 <fl 
c$ r-i rl 
o o o o 
o o o o 
IO w w w 
W e{ rt rt 
o o o o 
o to o o 
« « « V 
N M rt * 3 
o o o o 
to o o o 
u o w <e 
N H « H 
o o o o 
o o o o 
o o » o 
o n o c i 
Q r t O O 
6* o o o 
et to o to 
H U O U) 
o o o o g « ^ w o o o 
O O O O 
« « N M 
o o o o 
<c> <^  
N N N N 
CM CM CM <M 
• « ca co ca 
IM O ^ 
^ o a 
«q o o 
O H O 
O O V9 
O O 49 
EM rt O 
<9 O O 
ej « o 
«7 © © 
q t ) H 
rfoo 
H w O 
o to o 
o o o 
o o o o o o 
Ct O H 
b» m o 
r» N o 
<7 H U) 
H O >5 
ca o o 
o o o 
o o » 
CM O I N 
o o o 
a M CM 
b, I N O 
U n a 
t» o o 
CM t* *0 
O CM tO 
VO H CM 
O CO o 
to o o 
to o o 
CM CM « 
o o in 
O O CM 
O O H 
r l W rt 
« V o» 
o r n 
H H h 
O O ©. 
o o to o 
o n o o 
to CM o o 
t9 rt «0 tO 
o o o o 
o o o o 
o o WW 
f4 O r4 
o o o o 
o o o o 
o o o to 
H U H r t 
o « o o 
O N N O 
tO CM O O 
rt «3 O H 
0 I - o o 
O rt to CM 
to O W O 
«C #M CM CO 
o o o o 
o o o *9 
to o o to 
N O O N 
o U cn o 
rt rt rt 6> 
o o o m 
N O O N 
O W W H 
O rt rt N 
o o o co 
tO O O rt 
V ID IS O 
N r t C N 
O O O o 
o o o o 
o to to o 
a is u a 
— o — 
<fl U « 
W N P O 
CM CO O O 
O U N t 
p K c S 
U* a a at 
b o to o 
O W C 3 0 
- « W h g w « rt < O I 
o o to r» 
rt 10 <n m 
o o to r-
O "57 o u 
to o to <o 
W U O < 7 
o n N r-
W U V I O 
W N W N « rt N r» 
o o r» to 
O O rt H 
o w w w 
rt rt o cn 
u o 
<o r» o 
o M o 
o «o 
CM rt 
O M f l 
o cd o 
O 9 O 
O rt O 
«? » 
o Q a 
co 5 O 
O O O 
o o o 
W N N 
o o o 
«3 <3 
N W N 
N N CM 
CO « 0) 
o o o w 
o o o w 
B ? H H q o o o 3 O N o 
<? o q o 
W U U rt 
n b* o o 
o n w u 
O rt tO O 
o o o o 
W O N * ? 
O ri r4 t* 
o o o o 
N N N N 
O O O O 
N <M N N 
N CM N N 
0) <0 CO CO 
O O * W 
*o tu O * bt 
tO N to to 
rt en o> en 
o i - to to 
o H r» © 
o ^ in « o o o p p 
w s t f O K 
rt rt rt rt O 
O O O O O 
N N N N O 
O O O O O 
«y «^  c*> 
N N CM N O 
N N N N Ch 
CO 0} B) CQ CO 
ca to 
to o C* <*> 
co o rt r-
O V3 N "3 
CM W 
N <7 
o r * s i a o b u o n o o 
in o co 9 a P I fcj e w p 
r«Nt©r»<?tototo«\jr»5 
U m f l O O O N U t f t N A O 
t o t o t o N N N f - t o t o r - r - o 
r - f - p - v o r - t o r - t o t o r - r - o 
O H H e 9 A H t ] U < 7 U O > 1 
N t 0 t o { O t 9 t 9 t O t O r - » O I M O 
o o © o 
o o o o 
o o o o 
283 
D.3. Program ' B ' Targeting the Motorola 6805 Microprocessor 
Program 'B' is written in Assembler code targeted at the 8-bit Motorola 6805 
microprocessor. The first listing in this enclosure details the program code. The 
program demonstrates the SC687 development system for MC68(7)05 software. It 
simply sends a sequence of user inputs to an output device. 
The insertion of detection mechanisms by the software implemented fault toler-
ant technique, proposed in this thesis, is shown in the second and third listings of 
this enclosure. Default size and optimum size detection mechanism placements are 
detailed in the second and third listings respectively. 
284 
• B Q 3 
1 












1 1 1 a 1 
8 
I 
8 i i a 
6= i s 8 I 3 1 s 1 
g ? 1 1 1 8 01 
I 8 i 8 
6 s I 2 I I 3 B I S I 
g 8 8 111 O H N O 0 < » < » n 
"8 
a 
i i l i i i s i g § 1 8 8 1 i . i i s 
3 5 S 
f \ 'o s 4 g 3 0 4 8 3 i 8 e g 8 8 







1 1 1 
I E 1 2 s a 6 ^ 
0 
1 3 (34 
1 1 1 1 1 s 1 1 1 1 1 
• H 
i 





1 1 i 1 I 0 at 
ft 
o° in 




H P i 3 & N 
3 0 8 8 8 8 II b 
o o o o o o 
S £3 8 
s 
8 •8 B 11 a o 3 04 
0 
I 




a H N COO 
3 8 8 
§ I B 9 1 1 i 1 1 1 g 
SB 3 8 
t 
fl 
• H I 
g 111 
g S.1 ! 
s t 
Ml C O CO 
8 • 
r< n r» a M n e o o N Q CD 
8 a 
j j l l I 






Q 1 8 ft i F i & s H I 
8 
I 04 04 04 04 04 
I 3 1 I 1 11 1 1 a 
3 
8 i en • 
8 6 6 g 6 
P I C I 





CM n CM CO 
1 w 0 0 0 3 5 Of 
O i l l 
o o o 
a 1 i 8 8 
O O O II o 
o o i 1 a s 
o 
u 




a D I 3 1 1 I 
1 8 g 0) 8 
3 fci 
S 3 8 
era 
O r* « O 
8 a § 8 I I 8 8 8 1 1 §1 
s 
oi P -6 
<S>° I ' a 8 
s - X 04 8 
03 





D.4. Program ' C Targeting the Intel 80386 Microprocessor 
The original version of Program ' C is written in the high-level programming 
language known as C. High-level language programs can be transported for appli-
cation on many target processors because they describe a function in terms totally 
abstracted from the architectural influences of any one microprocessor system. The 
first listing in this enclosure details the original program. The program has no par-
ticular function: its purpose is to demonstrate the potential hazards of machine code 
which are transparent to the high-level language. Functions within the program use 
local or passed parameters. 
The target processor selected for Program ' C is the 32-bit Intel 80386. This 
microprocessor has been chosen so that the software implemented fault tolerant tech-
nique proposed in this thesis is demonstrated on a variety of microprocessors. In 
order to generate code tailored for application on the Intel 80386 processor, Program 
' C is compiled and an Assembler code representation of the source code is shown as 
the second listing in this enclosure. 
Application of the software implemented fault tolerant technique proposed in 
this thesis is shown within this enclosure at Assembler code level. The third and 









8 3 g 
a a s I Ol I 0 i t a i O H b 1 01 o a g. g. 01 a 
o u § s 8 8 U B U I • H e ! 1 
1 o I 41 4 J 
8 H O Q —. O P • 
8 8 § §1 
a§ 4 ! & S S g t 5 •3! g 15 
>H a 
is 1 g. 6 S g « 
M i> u a 
9 4 g. 
i a ** g 
8 - H o a 4 H V O > O Q , u | i h i O - K d & D i h ! *t 9 r l Cu M W W & lass i I 8 3 I M 
i ? 3. g g 2 U E-o 
a s i l h i) *i O _ 0 Q 
C C C M 
S u a § 8> *; 
a 
8 1 I -rl C U 
I £ Q < H 
I H H B o c> o o o o o o 
289 
• o 
A a » 
O O eo e2> «s» u> 
> , o v o > A a 
O 0 3 o 3 & 
& S S a" 
a < H 41 a 




o3 0 9 O 
r- e» tt o M fH 
N N N N N N n 
O O O O O O O 
O O O O O O O 
U CO 
o o 
U «A o <n 
v n 
S O b A 
o 3 
U O t ) U U U CO trl *< O 
I M O O ^ O Ut Q O N O O 
i o m i n d ) 4 a A i A o i A o i A o w [ u « ! 
b) CO CO CO CO CD CO < 
O N i n r ^ c ^ U i - i w a u U t - t ^ r - A m U r i P i o c s a O N ^ c ^ U O U P M i H N 
O O O O O O H H H H r l N N N N N N n n n n n n < 7 < 7 « < ? < 7 < 9 « 7 i O i n 
O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 
O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 
m m 
_ 8 S g 
O Q H I tf» 
O O «r ?^ 
09 n 0 o a 
O e o n o 
Q N in O A 
I 0. I 
i o I 
1 a 1 CP 1 1 § 1 -S j 1 M 1 41 i 1 ** 1 0 1 
I u I 3 ! 
3 8 
<H O 
1 O « O « 
ii + 0 a* + 0 
4 > * * 0 U w Q « * O 
O O 4 > 0 < ? 0 O « * 
O* v> O «=• I <? « O 
5] *-4 <H fH |c1 *H 









l?f i l - | l f i j f If 
B O *— O O W "3 f O CO Q O 
n u co 
COCO co U O O p c s p 
|i4 o b* O 6* <H &~fc» & 
e n m n o t f i H i n i A h i n i n t n e n 
• — 1 3 « « B ^ 
1 V O ^ H ^ I 
O N M ) O O H W 
O O O O O i H i H r - t 
O O O O O O O O 
O O O O O O O O 
N N n n 
0 0 0 0 
0 0 0 0 
m V 9 CO A 
n w n w 
0 0 0 0 
O N 6) u u H n 
o o o o o »•* «-« 
O O O O O O O 
O O O O O O O 
» en O bi w <5 
H H H H N « 
O O O O O O 











s t K a t f s l i t ' -
I 6 & s I s l l s s o s i l i 
I 5 
i n 
i l l 
I . 0 o I 1 1 I I I I l i l l l l l l i l if 
291 
to a a 
m 
0 0 0 ( P ^ Cf> Q ^ d? 
a c<? j) a • 
o Offl 
6> (3 W 
c c o c c c c c &| 1 sf »| 1111 «t§ ts si |1 
a t > O c = > 0 « » O t 4 « ? 
Q O * * CD * • (ft o w c a 
I I I I i i l l § i 
*1 *4 ** #H «-( 3 
o o o o o o o o o 
o o o o o o o o o 
A m n o m H M i n i i i n S i 
-i a -• O V «H «/> i 
m m o 
M o 
A Q U tu O IN n O f > « A U M f - t « n < O A U b * M « ? r - C f t c 2 0 U U < H 
ri ty (j tj ift tf| tfl O O O O O H H r l H r i H N N N N N N N N n 
o o o o o o o O O O O O O O O O O O O O O O O O O O O 
O O O O O O O O O O O O O O O O O O O O O O O O O O O 
o o 
O O N Q O O b O O _ O & O 
l A h A b o o n n n a i d i i l l Q A O B i O i t i i b i 
n t i i O h v u o o n o a i o c o n n c i o o B i 
O O O O O O f H r H r - t r H H t M f ^ N f N N M c n c n r n 
O O O O O O O O O O O O O O O O O O O O 




1 o I 
o o o o o o o o o o 
I 5 I 
I O I CP I e i c 
I S ! 3 




g l i a s j l i l l i i i 
o 
e a a a e o a o o a o t i 
s + o o. t o I t 
J l i i i s l i * 




n A U 
O N M ; Q O H V > O . 
OOOOOi -«« -» r - i cH 
O O O O O O O O O 
O O O O O O O O O 
8 8 8 8 8 8 8 8 8 8 8 
H N N N N M N N N N N 
O O O O O O O O O O O 
O O O O O O O O O O O 
£ 3 
292 
CO CO 0) CO CO 09 Q 
U 
V t> it I | « I *l V *l v V 0 QH |D a Q H 
e a c c a c c o a a a a ficqaccaccao a 
3 8 8 8 8 8 8 8 8 





O CM CM fn to 







Cj 0 0 lifi 
l i l i i t j f i l i l j t i i i i 1 5 3 3 5 5 5 5 5 3 3 5 
sill J I . . . 
i i i l i i i i l l l i f!I!y§§§ 
8 8 8 8 8 8 8 8 8 
Igo < o n o ) t r ) < a o ) o j w o m o a o o « o o « « o « i o 
II - ^ n n n n n n n m i t 
fc l 8 8 8 8 8 8 8 8 8 8 8 
N N N N N N N N N W N ooooooooooo ooooooooooo 
293 
2 
l O O t 
I O " I 
I fi Q I 
I -5 0 . I 
I S u ! 
! £ 5 I 
h i ! 
294 
a o 
9 I e> •a 3 8 C P a 3 3 3 -3 #•*.,? £ 3 8-8-3 
f t 5 
O O O » 0 O > Q > — » c £ > a > O w 
O H M CN 
l I | ! f | | l § I U | | l | | f j | a|g| I &e i —<=> 8, to a> I e=> <5» C& 
g 2 .1 H p H >H)f4«-a r-« H • -« P r f «»4 H H H H H H H H > 4 H H H , d H j e j 4 H H H H H H H H H H H H H A H , d . a H H t> A H i Q H H H ir«tsJiixiiiusiJt l i i i l i i l i i i i i i i H I ! lisliir&iii sin" it 
u. o 
tn m o 
«=? »5» o 
r> n a 
( J 00 CO 
M O O 
O O O 
o o o 
o o o 
u> H in in eg 
2 
S K 
in tftin o 
i ^ M s; o 
CO rH f t » <ft U fei _ - _ - r» eh 
H N N N N 
o o o o o 
o o © © © 
eS A 
CM CM 
u ca H 
N N P I 
o o o 
o o o 
g _ 
i in o in o i 1 w i*» 5 i*» « 





j U co 
» o o 
! M S 3 
U CO I 
£ O < 
in m i 
to «» i . _ _ 
oi n a n A i 
gg g 
I o o o o o o 9 £ 
CO © 
n fa. i © 5 < 
( M O O 
o w E ^ o i 
n w S o o i 
( k B q n n a n i n i 
i w *n r-
> o o o © 
' o o o o 
i o o o o 
U H W a U H H 
O H H H H H N 
© o © © © © © 
o o o o o o o 
CM CM i 
O O < 
O O I 
i ca H i 
I M HI « 
> O © 1 
i O O I 
i a o 
i m m 
I o o 




«s» v « m 
o o o o © © o 
o o o o © o o 
o o o © 
g 3 8 
O H O 







8 i 1 11 8 3 I 8 0) a I 4i 
a i g, I 0 
P I 
g 3 O CD to n C3 I 
o o a i - u a o r i o m I Q 
8 
g 
1 al l i 







f + 3 
H 
i 3 I I I 1 « J l fl 
Al O » — II O 
g | » 9 
e) o o 
5 «* ? 1 1 *n j 3 j 3 H H * 4 o « n r > ^ A H A i I ; ! i i Is i t i i JtJHfiiJt t 
« 0 
n 
i i 1 en 3 8 0 3 s 
J 








i . i f fjfij fill I 
iitiiiuiii* i 
! , l h , . , i i i i ! 
lllltlflllf 
g B B 1 1 B I i 1 1 i I co GQ n oo a 
C P 
I f f f 1 I 
O 5 I 
| ! I 111!!! i I * ! i l l 3 •§ S 5 • g, 8 w c i w n w w P I n " i i i ln i i i i i i i 1 i 
C9 
tn in 
8 8 i S S 3 




E. l . Introduction 297 
E.2. Paper J : EUROMICRO '88 Paper 299 
E.3. Paper 2 : EUROMICRO '89 Paper 307 
E.4. Paper 3 : IEE '89 Paper 315 
E.5. Paper 4 : IEE '90 Paper 318 
E.6. Paper 5 : EUROMICRO '91 Paper 323 
A P P E N D I X E 
P U B L I C A T I O N S 
E . l . Introduction 
To date five papers have been published in connection wi th the research presented 
in this thesis. They are as follows, 
Paper 1 : Wingate, G.A.S. & Preece, C , Transient Fault Recovery Assess-
ment in 8 and 16 Bit Microprocessor Based Controllers in Embed-
ded Systems.. Microprocessing and Microprogramming, Vol. 24, 
pp 775-782, 1988. 
Paper 2 : Wingate, G.A.S. &; Preece, C , Performance Evaluation of a New 
Design-Tool for Microprocessor Transient Fault Recovery., Micro-
processing arid Microprogramming. Vol. 27, pp 801-808, 1989. 
Paper 3 : Wingate, G.A.S. & Preece. C . Fault Tolerance for Microprocessor-
Based Controllers Susceptible to Transient Disturbances., I E E 
Digest 1989/111, pp 3/1-3, 1989. 
Paper 4 : Wingate, G.A.S. & Preece, C., Fault Tolerance for Uniprocessor 
Systems., IEE Digest 1990/176, pp 4/1-5, 1990. 
Paper 5 : Wingate, G.A.S. & Preece, C., Analysis of Failure Data Collected 
From a TMR Microprocessor Controller., Microprocessing and 
Microprogramming, Vol. 32, pp 861-868, 1991. 
The first, second, and last paper listed were presented at the international E U -
R O M I C R O '88, EUROMICRO '89, and EUROMICRO '91 conferences held i n Zurich 
(Switzerland), Koln (West Germany), and Vienna (Austria) respectively. The th i rd 
paper was presented at the IEE Colloquium "Control Systems Software Reliabili ty for 
297 
Industrial Applications." organized by the Automation and Control Systems Group 
C13 in London, October 1989. The fourth paper was presented at the I E E Collo-
quium "System Architectures for Failure Management." organized by the Control 
Techniques and Applications Group C9 in London, December 1990. 
298 






p . te 
North-Holland 
Microprocessing and Microprogramming 24 (1988) 775—782 775 
T R A N S I E N T F A U L T R E C O V E R Y ASSESSMENT I N 8 A N D 16 B I T M I C R O P R O C E S S O R 
B A S E D C O N T R O L L E R S I N E M B E D D E D S Y S T E M S 
G.A.S. Winga te and C. Preece 
Uni t ed K ingdom 
School of Engineering and Appl ied Science 
Univers i ty of Durham 
Keywords : Transient Fault Tolerance, Fault Recovery, Microprocessors, Indus t r ia l 
Controllers, Embedded Systems. 
Microprocessors used i n embedded systems for industr ia l control applications are of ten 
subject to transient disturbances. Th i s can cause system fai lure unless f au l t tolerance 
can be int roduced in to the design. Th i s paper discusses software design techniques for 
enhancing fau l t tolerance i n small d ig i t a l systems. A metric is proposed for assessing 
different designs, and its influence on M - T T F is i l lustrated. 
1. I N T R O D U C T I O N 
M o d e r n indus t r ia l control systems are increasingly 
based on d ig i t a l circuits incorpora t ing microproces-
sors. I n par t icular , the use of microprocessor based 
d ig i t a l controllers i n embedded systems provides an 
example of a low cost appl icat ion where simple archi-
tectures are preferred. However when replacing ana-
logue c i r cu i t ry w i t h d ig i t a l systems, i t is i m p o r t a n t 
to note t ha t the d ig i t a l replacements may be more 
susceptible to catastrophic fa i lure f r o m transient dis-
turbances. 
Microprocessor based indust r ia l controllers can en-
ter a state of erroneous execution due t o the effects 
of a transient disturbance. The subsequent execution 
his tory depends on the par t icular microprocessor ar-
chitecture and the system configurat ion. I n many em-
bedded indus t r i a l applications there is a requirement 
for h igh re l iab i l i ty , and this can be enhanced by at-
tent ion to software design. Further improvement can 
be achieved by in tegra t ion of hardware and software 
methods. T h i s p rob lem has t rad i t iona l ly been tack-
led by electrical shielding techniques. The methods 
presented here provide added protect ion by increas-
ing the p robab i l i ty o f recovery once a transient f a u l t 
has occured. 
Studies have been published elsewhere, and are 
referenced below, showing how the probabi l i ty o f re-
covery can be determined and subsequently enhanced, 
for systems based on 8-bit microprocessors. Tech-
niques have been proposed for assessing the probabi l -
i t y of f au l t recovery fo l lowing certain classes of t ran-
sient disturbance. I n this paper this approach is ex-
tended to consider 16-bit processors, and i n par t icular 
the Moto ro la M68000 series. I n order to compare dif-
ferent processors, an overall measurement parameter 
or metric is defined as the probabi l i ty of executing 
an ins t ruc t ion which w i l l cause an ordered re-entry 
to the program. This metr ic is used to discuss the 
implicat ions of processor choice fo r d ig i t a l controller 
design. 
This paper shows how the inherent properties of 
16-bit microprocessors can be ut i l ised t o improve the 
probabi l i ty o f recovery f r o m a transient f au l t . 
The concept of Mean T i m e To Failure for a sys-
tem, commonly used for hardware fa i lure est imation, 
is adapted to provide a measure of transient faul t re-
covery capabil i ty. 
2. CLASSES O F T R A N S I E N T D I S T U R B A N C E 
The indus t r ia l environment is a source of transient 
disturbances many of which are derived f r o m power 
supply transients and electromagnetic radia t ion. I t is 
c r i t ica l tha t indus t r ia l d ig i t a l systems i n applications 
such as real-t ime mon i to r i ng and control , have a m i n -
ima l possibi l i ty o f complete system fa i lure f r o m t ran-
sient disturbances. D i g i t a l systems are more prone to 
such a fa i lure t h a n analogue systems which tend to 
filter the disturbances. 
Pract ical observations and experiments have 
shown tha t transient disturbances can cause an erro-
neous j u m p i n the execution of a program, due to cor-
rup t ion of the program counter. Th is may be caused 
by direct co r rup t ion of the program counter, of the 
bus signals, or by data errors which lead to corrupt ion 
299 
776 G.A.S. Wingate, C. Preece / Transient Fault Recovery in Embedded Systems 
of stored addresses. The stat ist ical calculation of the 
probabi l i ty of recovery fo l lowing an erroneous j u m p 
provides a method of quan t i fy ing fau l t tolerance, and 
leads to the def in i t ion of a recovery metr ic , which en-
ables recovery strategies to be assessed qual i tat ively. 
Execut ion at any address results i n the proces-
sor entering one of a f ixed number of classified states. 
The probabi l i ty o f reaching a par t icular state may be 
calculated, based on the p ropor t ion of the ins t ruc t ion 
types i n the ins t ruc t ion set, and their d i s t r ibu t ion i n 
the microprocessor. Some of these states lead t o f u r -
ther erroneous execution, whereas others allow an or-
dered recovery to take place. 
W i t h i n the system memory, areas can be defined 
which have different characteristics dependent on the 
defined u t i l i sa t ion o f tha t memory area. For the pur-
pose of s tat is t ical analysis, the memory is d ivided in to 
dis t inct ive areas, and models are derived fo r erroneous 
execution i n these areas. The models fo r 16-bit m i -
croprocessors fo l low the same principles as those de-
veloped fo r 8-bit microprocessors. The differences are 
due to the inherent architecture and ins t ruc t ion word-
length of the 16-bit machines. The common model al-
lows comparisons between machine types to be made. 
Once figures fo r recovery probabil i t ies have been cal-
culated fo r a par t icu lar design, techniques fo r improv-
ing the met r ic can be proposed. T h e a im o f the design 
technique is t o maximise the probabi l i ty of an ordered 
recovery af ter an erroneous j u m p . 
E R R O N E O U S 
E V E N T " 
Erroneous 
J u m p 
3 
^ R E S T A R T ^ 
\ 
^ S T O P / W A I T ^ 
NORMAL 
P R O C E S S I N G 
IN 
A B N O R M A L 
C O N T E X T 
- > ^ R E T U R N ^ 
/'UNSPECIFIEDS 
V J U M P J 
Figure 1. 8-Bit Microprocessor Erroneous Execu-
t i o n Mode l . 
( * manifested transient fau l t ) 
3. R E V I E W O F 8 - B I T C O N C E P T S 
The concepts used i n the study of 8-bit machines 
are reviewed here, i n order to i l lustrate the compari-
son between 8- and 16- b i t processors. A f u l l descrip-
t ion w i l l be f o u n d i n Reference 1. 
Take the M6800 microprocessor which has an ad-
dressing range of 64K bytes using a 16-bit address bus. 
A n erroneous j u m p f r o m a runn ing program to a ran-
dom address in the address space results i n an entry to 
one of five states as i l lus t ra ted i n Figure 1, the state 
reached being determined by the value of the data 
at the par t icular address. The probabi l i ty of reach-
ing one of the five states can be calculated knowing 
the propor t ion of par t icular instructions w i t h i n the 
ins t ruct ion set. 
O f the 256 possible op-codes in the M6800 m i -
croprocessor, not a l l are defined. Those that are un-
defined have various state outcomes. The probabil i -
ties of entering each state after an erroneous j u m p are 
shown i n Figure 2 for the M6800 when the contents 
of the memory area are assumed to be random. The 
probabilit ies of entering each state change as fu r the r 
instructions are executed fo l lowing the original j u m p 
as shown i n the figure, the p lo t ted lines represent the 
boundaries between states. 
The i n i t i a l p robabi l i ty of entering a part icular 
state, Ps , is given by 
Ps = El 
NT 
(1) 
where Ns is the number of op-codes corresponding 
to this state, and NT is the t o t a l number of possible 
op-codes. 
The probabi l i ty tha t K instructions w i l l be exe-
cuted before a state is reached where a j u m p out of 
ordered processing w i l l occur is given by P j ( K ) where 
Pj{K) = (1 - Pj)<>K-l\Pj (2) 
f r o m Reference 2. 
A n ordered recovery fo l lowing a transient event 
implies the execution of a restart ins t ruct ion. The 
probabi l i ty of executing such an ins t ruct ion , i l lustrated 
i n Figure 2, is low because of the small p ropor t ion of 
restart op-codes i n the M6800 ins t ruc t ion set. Sim-
i lar results loc cit f o r other 8-bit processors serve to 
con f i rm this conclusion fo r this class of processor. 
Techniques for improv ing the probabi l i ty of recov-
ery i n 8-bit systems have been reported i n Reference 
3. I t is shown below that these methods are much 
more effective when applied to 16-bit processors. 
:<oo 
G.A.S. Wingate, C. Preece / Transient Fault Recovery in Embedded Systems 777 






8 1 0 
INSTRUCTIONS EXECUTED 
Figure 2. 16-Bit Mode l Erroneous Execut ion State 
Outcome. 
4. C O M P A R I S O N W I T H T H E M68000 SERIES 
4.1 State and Type definit ions 
T h e microprocessors considered here i n the M o - , 
torola M68000 f a m i l y are the M68000, M68010, and 
the M68020. These microprocessors have addressing 
capabilities o f between 24 and 32 bi ts , and have 16 b i t 
data buses. T h e ins t ruc t ion set is based upon a 16-
bi t i ns t ruc t ion word . This gives the microprocessors 
a possible ins t ruc t ion set of 65536 inst ruct ions . 
The M68000 f a m i l y o f microprocessors are micro-
coded. T h a t is, the 16-bit op-code is presented to 
an execution unit. Th i s un i t is effectively a R O M to 
which the 16-bit op-code is an address. T h e R O M 
then releases, f o r every possible address variant , an 
appropriate sequence of micro-codes which w i l l carry 
out the requested operat ion. Illegal and undefined 
instruct ions are treated i n exactly the same way as 
legal and defined instruct ions. Il legal and undefined 
instructions have a specified operation: an exception 
cal l . 
Th is fact is par t icu lar ly significant in terms of 
f au l t recovery as execution of any inval id op-code leads 
to an exception, wh ich i n t u r n , can lead to an ordered 
recovery th rough an exception service rout ine . 
The numbers o f defined and undefined instruc-
tions fo r the M68000 series are shown i n Table 1, 
where they are compared w i t h equivalent figures for 
the M6800. 
Table 1. Number of defined and undefined instruc-
tions for the MC6800 and M68000 micro-
processor fami ly . 
Microprocessor No. of instructions 
denned undefined 
MC6800 197 59 
M68000 43342 22194 
M68010 43521 22015 
M68020 46595 18941 
The ins t ruc t ion types can be classified as shown 
in Table 2. 
Table i. Ins t ruc t ion T y p e Classification 
N o n - J u m p 
Restart ( software in te r rup t , software exception ) 
Re tu rn 
Stop ( wai t ) 
Undefined Ins t ruc t ion 
Unspecified J u m p 
We consider an "op-code state" to be the state re-
sul t ing f r o m the in te rpre ta t ion of one of the instruc-
t ion types as fol lows: 
Non-Jump - leads t o the program counter po in t ing to 
the locat ion fo l lowing a val id single or m u l t i -
byte ins t ruc t ion . 
Restart - leads t o a j u m p to a predefined locat ion i n 
the memory map (by exception). 
Return - leads to a j u m p to an address held i n a stack. 
Stop/Wait - leads to cessation of processing; and re-
quires an in te r rup t or hardware reset to exit 
f r o m this state. 
Undefined Instruction - leads to a restart i n the M68000 
series because of the undefined ins t ruc t ion ex-
ception feature. 
Unspecified Jump - leads to a j u m p to a new locat ion 
determined by local memory contents. 
Of these states, only two are capable of providing an 
ordered restart, the restart and the undefined instruc-
tion exception. A diagram showing the possible states 
301 
778 G.A.S. Wingate, C. Preece / Transient Fault Recovery in Embedded Systems 
fo l lowing an erroneous jump i n a 16-bit microproces-
sor is shown in Figure 3. Th i s diagram can be com-
pared to the corresponding diagram fo r 8-bit micro-
processors shown i n Figure 1. 
/ E R R O N E O U S \ 





J u m p 
^ R E S T A R T ^ 
NORMAL 
P R O C E S S I N G 
IN 
A B N O R M A L 
C O N T E X T 
( S T O P / W A I T ) 
. ^ R E T U R N j 
r U N D E F I N E D 
I N S T R U C T I O N 
V R E S T A R T 
^ U N S P E C I F I E D ^ 
i  J V J U M P J 
F i j u r e 5. 16-Bit Microprocessor (M68000 Fami ly ) 
Erroneous Execut ion Model . 
( * manifested transient f au l t ) 
4.2 Microprocessor M o d e l 
I n developing the model of microprocessor oper-
a t ion subsequent to a transient event leading to an 
erroneous jump, the assumption is made that there is 
an equal p robab i l i ty of the program counter contain-
ing any address w i t h i n the memory map. 
The memory map is d iv ided in to not ional areas. 
Assumptions w i l l be made later about the properties 
of par t icular areas wh ich w i l l m o d i f y the s tat is t ical 
evaluation o f state outcome. The categories are shown 
i n Table 3. 
Table 3. M e m o r y M a p Categories 
Memory Map Category 
I n p u t / O u t p u t Reserved Area. 
Program Area. 
D a t a Area. 
Unused Area. 
I f an i n i t i a l assumption is made that the data area 
of the memory contains random data, then compar-
ison can be made between the 8-bit and 16-bit ma-
chines, by calculat ing the probabi l i ty of entering a 
par t icular state as before. However, an impor tan t fea-
ture of the M68000 leads to an addi t ion to the equa-
t ion . A n y a t tempt to fe tch an ins t ruc t ion f r o m an 
odd address leads to an exception. Assuming that 
the value of the the corrupted program counter is ran-
dom, then the probabi l i ty tha t the program counter 
holds an odd address is 0.5. 
A l l p robabi l i ty calculations for the M68000 series 
therefore apply to even addresses, recovery is guar-
anteed for a l l odd address references. A diagram of 
state probabi l i ty for the M68000 fo l lowing an erro-
neous j u m p to random data area is shown in Figure 
4. Comparison of this diagram w i t h Figure 2 shows 
the enhanced probabi l i ty of restart i n the M68000. 
The percentages of different ins t ruc t ion types i n the 
microprocessor ins t ruc t ion sets are given i n Table 4. 
Table 4- Ins t ruc t ion Type Percentages i n Mic ro -
processor Ins t ruc t ion Sets. 
Instruction Type M68000 M68010 M68020 M6800 
Non-Jump 57.825 58.095 59.327 91.600 
Restart 1.970 1.970 3.683 0.300 
Return 0.003 0.006 0.006 1.300 
Stop 0.002 0.002 0.002 1.600 
Undefined Instruction 33.865 33.592 28.902 e 
Unspecified Jump 6.335 6.335 8.080 5.200 
* T h e M6800 has 59 undefined instruct ions, bu t 
unl ike the MC68000 f a m i l y of microprocessors, 
these instructions do not lead to exception a j d 
recovery. T h e y have therefore been grouped w i t h 
the other ins t ruc t ion types dependant on their 
effective act ion. 
5. E X E C U T I O N O F A P R O G R A M I N A N 
E M B E D D E D S Y S T E M 
For many indus t r ia l control and mon i to r ing appl i-
cations, the program may occupy only a small pro-
p o r t i o n of the addressable memory map. W h e n a 
transient event occurs, the corrupted program counter 
may point to any locat ion, whether used or not , i n the 
addressing range. We w i l l refer t o the area occupied 
by the program, data storage, and memory mapped 
I / O as the used area , and.the remainder of the map 
as the unused area. I f the unused area is filled w i t h 
data which are interpreted as restart instruct ions, or 
other instruct ions which would generate exceptions, 
:i02 
G.A.S. Wingate, C. Preece / Transient Fault Recovery in Embedded Systems 779 





0 . S 
RESTART 





000 ADORESS RESTART 
2 3 4 S 6 7 8 
INSTRUCTIONS EXECUTED 
1 0 
Figure 4- 16-Bit M o d e l Erroneous Execut ion State 
Outcome. 
then the p robab i l i ty of recovery is fu r the r improved. 
Methods of achieving this have been discussed i n Ref-
erence 3, the most powerfu l o f which is bus-biassing. 
This consists of external c i rcu i t ry which asserts a b i t 
pa t te rn on the da ta bus when any unused memory 
address is accessed dur ing an ins t ruc t ion fetch. I f the 
bi t pa t t e rn is chosen to force an exception, then a l l 
references to unused areas provide recovery. 
I n this s i tua t ion the probabi l i ty of recovery de-
pends, not only on the proport ions of restart instruc-
tions i n the ins t ruc t ion set, bu t also on the ra t io of 
used to unused memory i n the whole addressable mem-
ory area. 
The concept of Mean T i m e To Failure, M T T F , 
used i n hardware re l iab i l i ty calculations, can be 
adapted fo r th is work . I t provides a method of com-
par ing the improvement i n re l iabi l i ty brought about 
by fau l t recovery w i t h other hardware and software 
methods i n embedded systems. I t also enables com-
parisons between designs using different microproces-
sors to be made. 
6. M T T F F O R A S Y S T E M S U B J E C T T O 
T R A N S I E N T E V E N T S 
Let the sample space E, comprise a set of events 
corresponding to erroneous jumps i n the runn ing pro-
gram. Let ET £ E where E, is an event leading to a 
recovery, and Ej € E where Ej is an event leading 
to a fa i lure . We can also state tha t ET U E j = E and 
Er D Ef = 0. 
Let the probabi l i ty of event ET be P{E,) and the 
probabi l i ty of event Ef be P ( E f ) , which leads to 
P{E,) + P ( E f ) = \ (3) 
I f we assume that transient events occur at a rate 
of k events per hour, then the rate of failures per hour 
is given by A where 
A = k . P ( E f ) (4) 
Assuming an exponential probabi l i ty d i s t r ibu t ion 
func t i on / ( f ) , then 
f { t ) = e 








Equa t ion 7 gives a value of M T T F fo r the system 
when subjected to transient events at the rate of k per 
hour, and where the probabi l i ty of recovery f r o m any 
single event is PiEr) . I t allows a comparative assess-
ment to be made for software recovery techniques i n a 
f o r m wh ich can be related t o M T T F calculations fo r 




0 4 8 1 2 1 6 2 0 2 4 2 8 3 2 3 6 
NUMBER OF EVENTS, E (t) 
Figure 5. Microprocessor Rel iabi l i ty . 
( for key see Table 5 ) 
303 
7 8 0 G.A.S. Wingate, C. Preece / Transient Fault Recover/ in Embedded Systems 
Let us take an example of a system having a used 
area of 48K bytes. I f we assume, as a worst case, tha t 
any erroneous j u m p in to the 48K byte area w i l l result 
in system fai lure , bu t any j u m p in to the unused area is 
recoverable, then the effect of different processors on 
the re l iabi l i ty can be seen i n Figure 5. These curves 
are d rawn w i t h a normalised t ime base. 
I t is he lp fu l to consider a numerical example t o 
i l lus t ra te the po in t . I f we assume that the system is 
subjected to transient events at a rate of, say, 1 . 4 1 0 - 3 
per hour, or approximate ly once per month . Table 5 
shows the effect on M T T F o f variat ion i n P(ET) . 
The M T T F for this event rate w i t h no recovery, (i.e. 
P(ET) = 0 ) , is 714 hours. 
Table 5. M T T F for different microprocessors. 
MTTF = 
Processor P(Er) M T T F 
M6800 0.250000 952 hrs = 39 days 
8086 * 0.250000 952 hrs = 39 days 
8086 ** 0.999969 23405714 hrs = 2672 yrs 
8086 *** 0.953125 15238 hrs = 21 mths 
M68000 0.998535 487567 hrs = 56 yrs 
M68010 0.998535 487567 hrs = 56 yrs 
M68020 0.999994 119047619 hrs = 13590 yrs 
* assuming cor rup t ion of the program counter only, 
and tha t the p rogram is contiguous. 
** assuming co r rup t ion of the segment register 
only, and tha t the program is contiguous. 
*** assuming cor rup t ion of the program counter 
and segment register are taken together as a sin-
gle register. 
I n the case of the 8086, two registers are involved 
i n specifying the address of any ins t ruc t ion . Cor rup-
t ion of each register has been treated separately. A 
t h i r d case has been considered ( assuming that the 
8086 could be considered to have a single rather than 
a mul t ip le , program counter ) so tha t a comparison 
fo r a microprocessor w i t h the same addressing range 
as the 8086 can be made. Table 5 illustrates the prob-
ab i l i ty of recovery and the M T T F for a range of m i -
croprocessors. 
I t can be seen that once the value of P(Er) ap-
proaches a value of 0.99 or better, a small increase i n 
value can b r i n g a large improvement i n M T T F . 
As the p ropor t i on of used area increases, the change 
i n M T T F can be calculated as, i n general, 
( to ta l area i n bytes ) 
k.( used area in bytes ) 
(8) 
This calculation assumes that any entry into the 
used area leads to fai lure. However this is not actu-
ally the case. As mentioned above, different areas of 
the memory map have different properties, and more 
accurate figures for recovery probabi l i ty can be deter-
mined f o r these areas. 
7. M O D I F I C A T I O N T O M T T F E S T I M A T E B Y 
M E M O R Y C A T E G O R I E S 
W i t h i n the used area of the memory map, tha t is 
the area containing the program w i t h its data area, 
and mapped I / O addresses, d i s t inc t ly different prop-
erties of the contents of these areas can be defined. 
The program area contains program code consist-
ing of op-codes and operands. As a first approxima-
t ion we may assume that there are no restart codes 
in the program area. This means that any erroneous 
j u m p to a program area results i n P(Er) — 0. 
The da ta area on the other hand, consists of nu-
merical da ta which bear no re la t ion to an ordered se-
quence of instruct ions. I t is therefore possible to as-
sume a random d i s t r i bu t i on o f data values. We here 
include I / O mapped registers i n the data area. For the 
M68000 this gives a value of P{ET) = .35835 w i t h i n 
the data area for a l l even addresses. 
1 . 0 0 
M 6 8 0 2 0 
4 8 K - P R 0 G 
M 6 8 0 0 0 
B K - P R O G / 4 0 K - D A T A 
U J O . 9 8 
M 6 8 0 0 0 
1 . 0 K - P R 0 G / 8 I C - 0 A T A 
ZJ0. 9 6 
M 6 8 0 0 0 
4 8 K - P R 0 G 
0 . 9 4 
0 4 8 1 2 16 2 0 2 4 2 8 3 2 3 6 
N U M 8 E R O F E V E N T S , E (t) 
Figure 6. Improved M68000 Microprocessor Rel i-
abi l i ty . 
304 
G.A.S. Wingate, C. Preece / Transient Fault Recovery in Embedded Systems 7 8 1 
The effect of this different t reatment for different 
used areas of memory is shown in the examples given 
in Figure 6. A typica l program area of 48K is as-
sumed i n one case to comprise 40K of program and 
SK of data, in the second case the proport ions are re-
versed. These are compared w i t h the case of a 48K 
program w i t h no da ta area. I f we assume an. event 
rate of approximately 1 per m o n t h as i n the exam-
ples above, then the M T T F for these applications can 
be evaluated. The results fo r implementa t ion on the 
M68000 are shown i n Table 6. 
Table 6. M T T F for various P r o g r a m / D a t a ratios 
i n M68000 
P r o g r a m / D a t a P{ET) M T T F 
48K / OK 
4 0 K / 8 K 




487567 hrs = 56 yrs 
516476 hrs = 59 yrs 
678979 hrs = 78 yrs 
A n y pa t te rn of memory use can be analysed to 
give a more accurate value of M T T F for a par t icu lar 
appl icat ion program. 
8. E N H A N C E M E N T O F R E C O V E R Y 
P R O B A B I L I T Y B Y D E S I G N 
A number of design options present themselves as 
candidates for improv ing the recovery metr ic P{ET). 
I t is clear tha t the a im is to increase the number o f 
codes i n the memory which produce vectored restarts. 
The t reatment of units ed memory has already been re-
ferred to . A l l unused locations should be f i l l ed w i t h 
restart or exception codes, either by bus-biassing, or 
by special E P R O M s . Par t ia l decoding of the E P R O M s 
can reduce the number required i n a par t icu lar sys-
tem. Th i s technique is universally applicable, and 
does not depend on the detai l o f the program or the 
appl icat ion; i t may be par t icu lar ly advantageous where 
the microprocessor has a mul t ip lexed bus. 
The detai led t reatment of da ta and program areas 
of the memory is dependent on the properties of the 
code for a par t icular appl icat ion. Some general rules 
can, however, be formula ted . 
As a first approximat ion , i f data areas can be con-
sidered to contain random numbers, then the propor-
tions of codes given by Table 4 apply. The inher-
ent metr ic is a f unc t i on o f the restart and undefined 
instructions. However, over 50% of the ins t ruc t ion 
types are i n the non- jump category, tha t is, execution 
proceeds beyond them i n sequence. Th i s observation 
introduces the possibil i ty of u t i l i s ing this property to 
force f u r t h e r restarts by seeding the da ta areas w i t h 
recovery traps, sequences of codes spread throughout 
the da ta area. As wel l as providing direct recovery i f 
an erroneous j u m p lands on a t rap , i t also enhances 
the probabi l i ty of recovery after execution of a non-
j u m p ins t ruc t ion . 
The treatment of program areas is quite different. 
Here the codes are valid instructions, and entry to any 
program area has a high probabi l i ty of resuming val id 
code execution. (Reference 1). Seeding the program 
area is appl icat ion dependent, and the programmer 
needs to have this i n m i n d when w r i t i n g the code. 
Al terna t ive ly this func t ion might be implemented by 
a high level compiler. T w o examples of programming 
techniques invo lv ing operands i n the M68000 series 
w i l l be sufficient to i l lustrate the point . 
i ) A propor t ion o f a program area may contain codes 
representing addresses of operands. These ad-
dresses po in t t o items of data storage. I f the data 
is stored at addresses which themselves represent 
inva l id ins t ruc t ion codes, then erroneous execu-
t i o n of any of these operands i n the program area 
w i l l force restarts. 
i i ) A n alternative f o r m o f addressing can also pro-
duce the same result. I f data is referenced us-
ing backward relative addressing, then execution 
of the operands containing the negative data off-
set w i l l cause inval id op-code exceptions i n the 
M68000 and M68010. This is because setting the 
four most significant bi ts i n the word produces a 
code which is interpreted as an exception i n these 
processors. 
9. D I S C U S S I O N 
Transient faul ts can cause a microprocessor sys-
tem t o experience cor rup t ion o f the program counter 
causing erroneous jumps t o random locations i n mem-
ory. Th i s brings a loss of control unless some mecha-
nism for recovery is present. T h e probabi l i ty of recov-
ery can be enhanced i f a restart can be in i t i a ted after 
the f a u l t . The inherent ly larger addressing space of 
16-bit microprocessors provides a ma jo r improvement 
i n the recovery met r ic , as long as al l the unused mem-
ory is designed to contain restart instructions, or to 
in i t ia te exceptions, (software in ter rupts) . 
T h e concept of Mean T i m e To Failure can use-
f u l l y be applied to systems subjected to transient dis-
turbances. The de f in i t ion of M T T F incorporates the 
probabi l i ty of achieving a restart state at any event 
P(Er), and also the rate of events k per hour. 
The method enables the re l iabi l i ty o f micropro-
cessor based embedded systems to be assessed, and 
different designs to be compared. I t also provides a 
stat ist ical basis fo r developing the techniques fo r en-
hancing the fau l t tolerance o f small systems which are 
described i n Reference 3. 
305 
782 G.A.S. Wingate, C. Preece / Transient Fault Recovery in Embedded Systems 
The results suggest tha t software recovery for t ran-
sient events can provide an addi t ional means of pro-
tect ion for microprocessor based systems i n addi t ion 
to tha t t r ad i t iona l ly provided by external hardware 
watchdog circuits . I n selected 16-bit machines the fig-
ures fo r M T T F can be h igh even for onerous transient 
event rates. 
A C K N O W L E D G E M E N T S 
T h e authors wish to acknowledge the support o f 
the U K Science and Engineering Research Counci l 
and the B r i t i s h Gas Engineering Research Stat ion, 
K i l l i n g w o r t h , Newcastle upon Tyne . They would also 
like t o thank D r . R . G . Halse, of Westinghouse Sig-
nals L t d . , Chippenham, W i l t s , fo r his suggestions and 
he lp fu l comments on the paper. 
R E F E R E N C E S 
[1] Halse R.G. Fault tolerance in digital controllers 
using software techniques. Ph .D . Thesis. Univer-
sity of Durham, England. 1984 
[2] Halse R .G. and Preece C. Erroneous execution and 
recovery in microprocessor systems Software and 
Microsystems 4 No. 3. June 1985. pp 63-70. 
[3] Halse R.G. and Preece C. Recovery assessment af-
ter microprocessor transient disturbances i n Sys-
tem Fault Diagnostics and Related Knowledge-
Based Approaches Vo l 2. pp 383-397. S.Tzafestas 
et al.(eds) 1987 D . Reidel Publ ishing Co. 
306 
North-Holland 
Microprocessing and Microprogramming 27 (1989) 801-808 801 
P E R F O R M A N C E E V A L U A T I O N O F A N E W D E S I G N - T O O L F O R 
M I C R O P R O C E S S O R T R A N S I E N T F A U L T R E C O V E R Y 
G . A . S . W i n g a t e ( U K ) & C . Preece ( U K ) 
S c h o o l o f E n g i n e e r i n g a a d A p p l i e d Science , 
U n i v e r s i t y of D u r h a m , D H l 3 L E . E n g l a n d . 
K e y w o r d s : Transient Fault Tolerance, Fault Recovery, Design Tool , Embedded Systems, 
Microprocessor Controllers. 
A p p r o a c h : Evaluation. 
A model of microprocessor erroneous behaviour has led to the development of a new 
design too l to automate the in t roduct ion of transient fault tolerance into program code. The 
design too l , P A R U T ( Post-programming Automated Recovery U T i l i t y ) provides a method 
of enhancing existing program code to optimise the recovery capability fol lowing a transient 
disturbance. The tool can be used to implement a number of different recovery strategies, 
some of which may involve addit ional hardware. The paper examines the performance of the 
design too l for a range of techniques. 
1 . I N T R O D U C T I O N 
Modern designs of industr ial control systems incor-
porate microprocessors for control and moni tor ing pur-
poses. The versati l i ty tha t microprocessor based con-
trollers offer to the designer, and the flexibility tha t 
customised software provides, makes them attractive in 
many industr ia l situations. 
However, industr ia l environments are of ten harsh, 
fa l l ing short of the ideal for computer systems. I n par-
ticular, microprocessor operation can be corrupted by 
externally generated transients events such as electrical 
power transients [ 1 ] and electro-magnetic radiat ion ( 2, 
3, 4 ] . Even i n 'benign' operating conditions transients 
have been observed to cause between 80% and 90% of 
d ig i ta l system failures [ 5, 6, 7, 8 J 
I n analogue systems these transients go through a 
' f i l t e r ing ' process which generally means that the con-
t r o l func t ion is not lost. Dig i ta l systems are much more 
liable to lose al l control funct ion following a transient 
disturbance. I t is impor tan t tha t d ig i ta l design should 
incorporate mechanisms for recovery i n the event of er-
roneous execution. 
Transient events considered i n this paper are those 
which lead to corrupt ion of data on the bus, or in the 
memory, or registers of a microcomputer system. Whi l e 
such corrupt ion can lead to erroneous behaviour of the 
system, no permanent hardware damage is incurred, 
and i f control of the computational process can be re-
established, then this permits the possibility of overall 
recovery of the control system. 
A number of well known techniques are available to 
reduce the probabil i ty of failure f r o m transients. These 
include bo th hardware and software enhancement and 
usually incorporate some f o r m of redundancy. A tech-
nique such as hardware modular redundancy w i t h voting 
w i l l prevent many transient failures but involves con-
siderable hardware overhead [ 9 ] . Watchdogs circuits 
[ 10 ] , although commonly used, themselves suffer f r o m 
transients. They also introduce a performance overhead 
by constantly in ter rupt ing microprocessor operation. 
This paper includes a comparison of two other re-
covery techniques which a im to improve transient faul t 
tolerance by the inclusion of both hardware and soft-
ware enhancements. These have been applied to 8-bit, 
16-bit, and 32-bit microprocessors [ 11, 12, 13 ] . The 
hardware modificat ion is simple w i t h a very low over-
head. The software enhancement has involved up to 
60% increase in memory requirement, but this figure is 
very much code dependant. A post-programming u t i l i t y 
has been bui l t which automates the method of enhanc-
ing software to improve transient faul t tolerance. The 
u t i l i t y is described, and its performance w i t h transient 
faul t tolerant techniques is evaluated. 
2. E R R O N E O U S B E H A V I O U R 
We consider erroneous behaviour to be ini t ia ted by 
corruption of the microprocessor's internal state which 
leads to the corruption of the program counter. I t has 
been suggested that 25% of transients affecting digi ta l 
systems wi l l corrupt the internal state [ 6 ] . The erro-
307 
802 G. Wingate, C. Preece / Performance Evaluation of a New Design Tool 
neous behaviour produced by a microprocessor is char-
acterised as a sequence of erroneous execution states ter-
minated either by recovery or catastrophic failure. 
Execution states can be denned in terms of the out-
come of each operation. The possible states are defined 
as follows : 
Non-Jump - leads to the program counter point ing to 
the location fol lowing a valid instruction. 
Restart - leads t o a j u m p to a predefined location in the 
address space. 
Unspecified Jump - leads to a j u m p to a new location in 
the address space determined by local memory 
contents. 
Return - leads to a j u m p to an address held in a stack. 
Stop/Wait - leads to a cessation of processing ; and 
requires an in ter rupt or hardware reset to exit 
f r o m this state. 
Not all possible ins t ruct ion b i t patterns in a micropro-
cessor are necessarily defined. In such cases these un-
defined instructions can, when executed, result in any 
of the defined states. I t should be noted however tha t 
because no specification is available for these undefined 
instructions, manufacturers are not obliged to ensure ev-
ery die batch produces the same operation for each of 
the undefined instructions. However, the problem does 
not arise i n al l microprocessors. I n the case of the the 
Motorola 68000 family, for example, the execution of al l 
undefined instructions results in an 'exception' (or soft-
ware in te r rup t ) leading to the 'restart ' state as defined 
above. 
For purposes of discussion, we assume that a t ran-
sient faul t w i l l cause random corruption of the program 
counter contents. The instruct ion following the t ran-
sient event w i l l be fetched f rom a location pointed to by 
the corrupted contents of the program counter resulting 
in a j u m p in the control flow of the executing software. 
This random j u m p is termed the In i t i a l Erroneous Jump 
( IEJ ) . We define execution following the I E J as 'erro-
neous execution'. The content of the memory at the new 
execution locat ion is fetched and executed as i f i t were a 
valid ins t ruct ion. The outcome of this erroneous execu-
t ion w i l l result i n a new behavioural state. Execution of 
a non- jump type inst ruct ion continues linear erroneous 
execution. I f the new state is a j u m p type (but not 
a ' restart ' ) , then a Subsequent Erroneous Jump (SEJ) 
w i l l occur. I n order to achieve controlled recovery i t is 
necessary to maximise the probabil i ty of a 'restart ' type 
instruct ion being executed as soon as possible after the 
inception of erroneous execution. 
elsewhere ( 11, 12, 13 ] . and is reviewed briefly here. 
The probabil i ty of a j u m p type outcome after k in-
structions have been executed in a random area [ 11 J is 
given by 
Pj(k) = PNj{k - \).Pj (1) 
where Pj and Pyj are the in i t i a l probabilities of a j u m p 
and non-jump outcome. I f the s impl i fy ing assumption 
is made that the content of the memory at the j u m p 
target is random, then the probabilit ies for individual 
outcomes can be calculated f r o m the dis t r ibut ion of their 
associated instructions w i t h i n the instruct ion set, and 
mul t ip ly ing by Pj(k). 
Improvement i n the probabil i ty of recovery follow-
ing an in i t i a l erroneous j u m p ( IEJ ) has already been dis-
cussed in a previous paper [ 12 ) where unused address 
space is filled w i t h restart type instructions. Major i m -
provements were shown for a range of 8, 16, and 32 b i t 
microprocessors. For erroneous execution in the 'used 
areas' of memory, the probabi l i ty of a number of ran-
dom subsequent erroneous jumps (SEJ) landing wi th in 
the used area is given by 
PsBj(Used Area) = 
f / N o . addressed bytes in used areaN 
J ^ ^ \ No. bytes in address range / 




0 . 0 0 0 0 1 . 3 1 0 7 2 . 6 2 1 4 3 . 9 3 2 1 5 . 2 4 2 8 
USED AREA x ! 0 5 ( BYTES ) 
3. M O D E L 
A model of erroneous execution has been presented 
Figure 1 : SEJ Characteristic. 
308 
G. Wingate, C. Preece / Performance Evaluation of a New Design Tool 803 
where the summations to J, and L, map each j u m p type 
instruction and, every location in the used area respec-
tively. Nj is the number of j u m p type instructions in 
the instruct ion set. Ni is the number of locations in the 
used area. 
The SEJ characteristic for the Motorola 68000 and 
Intel 8086 is shown in Figure 1. This graph emphasises 
the importance of implementing a recovery technique 
w i t h i n the used area. I f no recovery technique is imple-
mented then there is a high probabil i ty of an extended 
period of erroneous behaviour consisting of many SEJs. 
A detailed analysis of this characteristic is given in Ref-
erence 13. 
4. P A R U T ( P o s t - p r o g r a m m i n g A u t o m a t e d 
R e c o v e r y U t i l i t y ) 
P A R U T has been designed to accomplish the follow-
ing: 
i ) To analyse the original code and report the IEJ re-
covery capability inherent in the original code, and 
the SEJ recovery capability. 
i i ) To enhance the faul t tolerance of the original code 
by a selection of methods. This usually involves in -
serting some redundancy into the code. When this is 
completed, the u t i l i t y re-aligns the original software 
control flow which wi l l have been offset at the ma-
chine code level by the in t roduct ion of redundancy. 
i i i ) To analyse the enhanced code and report the im-
provement in the IEJ recovery capability, the im-
provement i n the SEJ recovery capability, and the 
coded area extension overhead required. 
The P A R U T program has two input requirements, 
a description of the microprocessor on which the code 
is to reside, and a copy of the code. Processing this 
informat ion results in two output streams, a report on 













Code with Other 
Control Flow Enhanced 
Signatures Codings 
the original and the enhanced codes, and a copy of the 
enhanced codes. A n overview of the P A R U T program 
is shown in Figure 2. 
In the present version of the P A R U T program, de-
tection mechanisms cannot be inserted for SEJs that 
originate w i th in , and whose destinations lie w i th in , the 
same instruction [ 13 ] . The u t i l i t y does, however, report 
these occurrences. 
5. A P P L I C A T I O N : M o t o r o l a 68000 
Studies of a range of microprocessor types have sug-
gested that while many features of erroneous behaviour 
are common to a l l , [ 11, 12 ] , detailed analysis requires 
that each type is considered separately. We consider 
here the Motorola 68000 microprocessor as a typical ex-
ample of commonly used 16/32 b i t microprocessors. The 
analysis produced for this microprocessor is specific, but 
the general approach and results are valid for a range of 
microprocessors and architectures. 
The model for the Motorola 68000 follows f rom the 
description of microprocessor behaviour. A particular 
feature of the Motorola 68000 family of microprocessors 
is the 'odd address exeption'. The handling routine for 
this exception can be wr i t ten so that i t directs execution 
to the recovery routine. W i t h unused even addresses also 
giving a restart ( via a hardware address access guard-
ian ) , the microprocessor gives an enhanced probabil i ty 













(a) Original Code Showing 































PARUT Enhanced Code: 
Seed In Placed Detection 
Mechanism Will Catch 
Potential Erroneous Jump. 
DM* - Detection Mechanism. 
S E E D - Restart ( e g. 6001 ). 
Figure 2 : P A R U T Overview. Figure 3 : Example of Detectection Mechanism 
Placement. 
309 
804 G. Wingate, C. Preece / Performance Evaluation of a New Design Tool 
To improve the recovery capability of the micro-
processor s t i l l further, P A R U T can implement detec-
t ion mechanism placement. The detection mechanism 
used for the Motorola 68000 is shown in Figure 3. I t 
consists of an in i t i a l one-word relative branch instruc-
t ion over the remainder of the mechanism, so tha t log-
ical control flow of the correctly executing program is 
not interrupted. This does however incur an addit ional 
processing overhead. The remainder of the detection 
mechanism consists of five one-word seed instructions. 
A seed instruction is a software exception instruct ion 
which directs execution flow to the recovery routine. 
Five seed instructions are necessary because the maxi-
mum length of an instruct ion in the 68000 microproces-
sor is five words. Detection mechanisms must be placed 
so as not to disrupt correct execution flow. The rule for 
placement is that where possible each SEJ destination 
becomes a seed in the detection mechanism, thus ensur-
ing a restart dur ing the next execution cycle, and an 
increased probabil i ty of recovery following an SEJ. The 



















Figure 4 •' Erroneous Execution Model . 
To il lustrate the versatility of the P A R U T method a 
second technique has been implemented, based on pro-
posals in a paper by Schutte and Shen ( 14 ] of control 
flow monitor ing using 'Signatured Instruct ion Streams '. 
For the purposes of comparison, P A R U T has been mod-
ified to produce code w i t h embedded signatures. The 
simulation inserts a single random word immedately fo l -









o 0. 9985 
g 0. 9980 







U N S P E C I F I E D JUMP 
000 ADDRESS RESTART 
UNUSED AREA RESTART 
7 
0 .0 -H 1 1 1 1 1 1 1 1 1 
0 1 2 3 4 5 6 7 8 9 10 
INSTRUCTIONS EXECUTED 
Figure 5 : Execution History Following IEJ 
( Original Code ) t . 
6. P E R F O R M A N C E E V A L U A T I O N 
The results presented here show the effect of the 
P A R U T u t i l i t y on a part icular piece of code. The re-
sults are entirely code dependant, and w i l l vary greatly 
f r o m one program to another. A n assessment is made 
of the improvement i n recovery capabili ty achieved by 
detection mechanism placement. 
The performance evaluation of original and P A R U T 
enhanced codings requires examination of the execution 
histories belonging to each of the two phases of erro-
neous behaviour. The first phase is defined as tha t of 
erroneous execution fol lowing an IEJ . The second phase 
consists of the erroneous execution fol lowing each of a 
number of SEJs un t i l either recovery or catastrophic fa i l -
ure occurs. The example coding is taken f r o m a short 
Motorola 68000 control program wr i t ten i n assembler 
code. 
:iio 








ODD ADDRESS RESTART 
UNUSED AREA RESTART 










o 0. 9985 









000 ADDRESS RESTART 
UNUSED AREA RESTART 
2 3 4 5 6 7 8 
INSTRUCTIONS EXECUTED 
10 
Figure 6 : Execution History Following I E J 
( Code w i t h Detection Mechanisms ) t . 
6.1 Phase 1 Observations 
The execution histories for the original code, the 
code wi th detection mechanism placement, and the em-
bedded signature code, are shown in Figures 5, 6, and 
7 respectively. These results are for execution histories 
following an In i t i a l Erroneous Jump. 
Both enhanced codings exhibit a lower in i t i a l prob-
abi l i ty of restart. This overhead is due to the extended 
memory required for the inserted detection mechanisms 
and embedded signatures. 
The code w i t h detection mechanism placement has 
a smaller probabil i ty that the outcome state w i l l be a 
j u m p ( wi thout restart ) compared to the original code. 
This is due to the probabil i ty of an in i t ia l SEJ desti-
nation being a detection mechanism seed delivering a 
restart outcome. 
The signatured code also shows a decrease in the 
probabil i ty j u m p ( wi thout restart ) outcome state. This 
is because the signature process delivers a restart out-
come for any in i t i a l SEJ which does not synchronised 
wi th valid program flow. 
Figure 7 : Execution History Following IEJ 
( Signatured Code ) t . 
The behaviour of this first phase of erroneous exe-
cution, i f restart is not achieved, is characterised by a 
short period of linear execution followed by a fur ther 
erroneous j u m p . Figures 6 suggests that the enhanced 
code wi th embedded signatures wi l l have the longest pe-
riod of linear execution, followed by the code wi th detec-
t ion mechanism placement, and then the original code. 
I f the first phase of erroneous behaviour is longer than 
three instructions then the performance of the enhanced 
code need not necessarily be reduced. However, i t is only 
when the effects of SEJ are taken into account that the 
overall improvement is clearly apparent. 
6.2 Phase 2 Observations 
The execution histories for the original code, the 
code w i t h detection mechanism placement, and the em-
bedded signature code, are shown in Figures 8, 9, and 
10 respectively. These results are for execution histories 
following a Subsequent Erroneous Jump. 
Some SEJ destinations cannot be determined due to 
their use of data which is only specified at run-time. In 
such instances, for analysis purposes, the destinations 
311 
806 G. Wingate, C. Preece / Performance Evaluation of a New Design Tool 











UNSPECIF ED JUMP 
RESTART 







< m o a. 








0 1 2 3 4 5 6 7 8 9 10 
INSTRUCTIONS EXECUTED 
Figure 8 : Execution History Following S E J 
( Original Code )f. 
Figure 9 : Execution History Following S E J 
( Code with Detection Mechanisms )f. 
are assumed to be random. This is important in es-
timating the proportion of S E J destinations which lie 
outside the used area', or at an odd address, and which 
affect the probability of restart. 
The effect of placing detection mechanisms is clearly 
demonstrated by the increased probability of restart as 
shown in Figure 9 compared to the original code shown 
in Figure 8. 
In the case of the signature method, an S E J by an 
attempted execution of an operand will not complete its 
operation because of the hardware detection circuitry. 
On the other hand an S E J by an attempted execution 
of an op-code results in re-synchronisation with the pro-
gram flow. 
6.3 Recovery Performance 
Recovery is assumed to be achieved through -
start type instruction directing the program flow to 
recovery routine. 
The absolute probability of restart can be calculated 
from the information held in both the execution histories 
for erroneous exceution following an I E J and S E J . Let 
Pjtr(k) be the absolute probability of restart after k 
instructions processed following an I E J . Let Iftr(k) and 
l j ( k ) be the probabilities of restart and, jump without 
restart, after k instructions processed following an I E J . 
Let Snr(k) and Sj(k) be the probabilities of restart 
and, jump without restart, after k instructions processed 




/ «TH + 1 1 1 {'•/(*)• ft S y ( y n ) . S * r ( * ) } (3) 
where i + £ n yn + z = ui, and n, x,y,z > 0. 
The absolute probability of restart for the different 
enhanced versions of the program are shown in Fig-
ure 11. The performance of the P A R U T generated code 
shows improvement over the original code after one or 
more instructions have been erroneous executed. The 
period of erroneous behaviour is reduced leading to a 
smaller probability of data corruption. This in turn 
will improve availability, and reduce the chance of catas-
trophic failure. 
Absolute probabilities of restart tends to a value less 
than 100% because of the probability of ^synchronisa-
tion with the program flow. If resynchronisation occurs, 
complementary recovery techniques are required ( 15, 
16] . 
6.4 Overheads 
Detection mechanism placement incurs a memory 
overhead. This overhead is clearly demonstrated in the 
reduced initial probability of recovery through restart 
for erroneous behaviour following an I E J , see Figure 6. 
Examination of Figure 11 reveals how this initially re-
duces the absolute probability of restart of the P A R U T 
enhanced code compared to the original code. However 
312 
G. Wingate, C. Preece / Perfortmnce Evaluation of a New Design Tool 807 
1.0 
.0.8 
£ 0 . 6 








1 2 3 4 5 6 7 8 
INSTRUCTIONS EXECUTED 
10 
Figure 10 : Execution History Following S E J 
( Signatured Code )f. 
we note in Figure 11 that the performance of the en-
hanced code improves due to the increased probability 
of restart following a S E J produced by detection mech-
anism placement, see Figure 9. 
The execution of the detection mechanism jumps 
over the seeds, during correct program flow, will incur a 
small processing overhead. The influence of this on the 







QQ 0. 9992 
0. 9990 KEY 
ORIGINAL 
v DETECTION MECHANISM 
a SIGNATURE 
0.0 
2 3 4 5 6 7 8 
INSTRUCTIONS EXECUTED 
Figure 11 : Absolute Probability of Restart 
During Erroneous Execution. 
10 
7. D I S C U S S I O N 
The method described in this paper for improving 
the transient fault recovery capability is valuable for ap-
plications which require a high degree of dependability 
or which are safety critical. 
The microprocessor model described above can be 
extended to other types of microprocessors [ 11, 12 ]. 
The design tool P A R U T is based on this model and is 
therefore widely applicable. 
The recovery technique used for erroneous execution 
following an I E J requires the unused area to have a 100% 
detection capability during the processing of the initial 
instruction. In the case of the Motorola 68000 micro-
processor, this can be achieved by additional circuitry 
that detects whether invalid address lines have been ac-
tivated, and if so, impresses a bus error exception signal 
to the microprocessor. This is acceptable if the used 
area forms a contiguous block in the memory map. If 
it does not, then bus-biassing can be used so that the 
quiescent bus value is loaded to represent an exception 
instruction format. 
One of the major advantages of the detection mech-
anism placement is seen in the software development 
cycle. As a post-programming technique, this approach 
does not constrain the initial software, and the program-
mer need not be aware of its subsequent application. 
There is no pre-requisite programming requirement, and 
the original language of the software is immaterial be-
cause the utility is applied to the machine code. The 
method has wide application, existing software can be 
processed as a maintenance up-grade, or new software 
processed for immediate enhancement. 
The use of the utility is proposed as part of an over-
all fault-tolerant strategy, an addition to and not a re-
placement for, other software and hardware techniques. 
Watchdogs alone would permit recovery after some in-
terrupt interval, but the intervening period could con-
sist of prolonged erroneous behaviour. The method pro-
posed here reduces the mean period of erroneous be-
haviour and hence decreases the probability of catas-
trophic failure. 
313 
808 C. Wingate, C. Preece / Performance Evaluation of a New Design Tool 
8. C O N C L U S I O N S 
It has been shown that the injection of exception 
generating mechanisms into the machine code of a pro-
gram can enhance the probability of recovery following a 
transient disturbance. This technique provides coverage 
for transient events which cause erroneous jumps into 
the program code. A particular program example shows 
performance improvement without the need for complex 
additional hardware. The technique is implemented by 
a software utility, P A R U T , applied to existing program 
code. The method is therefore transparent to the pro-
grammer. The utility can be used to investigate other 
techniques of fault coverage, and can form part of an 
overall design strategy for reliable digital controllers. 
A C K N O W L E D G E M E N T S 
The authors would like to acknowledge the support 
of the U K Science and Engineering Research Council 
and the British Gas Engineering Research Station, Ki l l -
ingworth, Newcastle upon Tyne, England. 
R E F E R E N C E S 
[1] Buschke, H.A., A Practical Approach to Testing 
Electronic Equipment for Susceptability to AC Line 
Transients. I E E E Trans. Reliability, Vol. 37, No. 
4, pp 355-359, 1988. 
[2] Burton, P., Designing Microprocessor-Based Equip-
ment for Immunity from Electrical Interference. Mi-
croprocessors and Microsystems, Vol. 12, No. 6, pp 
309-316, 1988. 
[3] Kotheimer, W . C . , The Source and Nature of Tran-
sient Surges. I E E E Trans. Industrial Applications, 
Vol. 1A-13, No. 6, pp 501-503, 1977. 
[4] Thurlow, M., Susceptability Characterization of Mi-
croprocessor and LSI Technology. Microprocessors 
and Microsystems, Vol. 12, No. 6, pp 317-322, 1988. 
[5] Ball, M. k Hardie, F . Effects and Detection of Inter-
mittent Failures in Digital Systems. A F I P S Confer-
ence Proceedings Fall Joint Computer Conference, 
Vol. 35, pp 329-335, 1969. 
[6] McConnel, S.R. k Siewiorek, D.P. C.vmp : The 
Implementation, Performance, and Reliability of a 
Fault Tolerant Multiprocessor. Interim Report, 
Carnegie-Mellon University, Computer Science De-
partment, Pittsburgh, PA 15213, USA, March 1978. 
[7| Siewiork, D.P. k Swarz, R.S. The Theory and Prac-
tice of Reliable System Design. Digital Press, Bed-
ford, MA, 1982. 
[8] Iyer, R . K . & Rossetti, D .J . A Statistical Dependancy 
of CPU Errors at SLAC Proc. F T C S - 1 2 , Santa 
Monica, C A , 1982. 
|9] Lala, P .K. Fault Tolerant and Fault Testable Hard-
ware Design. Prentice Hall International, New York, 
1985 
[10] L u , D . J . Watchdog Processors and Structural In-
tegrity Checking. I E E E Trans. Computers, C-31, 
No. 7, pp 681-685, 1982. 
[11] Halse, R. Fault Tolerance In Digital Controllers Us-
ing Software Techniques. Ph.D. Thesis, University 
of Durham, England. 1984. 
[12) Wingate, G .A.S . k Preece, C . Transient Fault Re-
covery Assessment In 8 and 16 Bit Microprocessor 
Based Controllers In Embedded Systems. Micropro-
cessing and Microprogramming, Vol. 24, pp 775-
782, 1988. 
[13) Wingate, G .A.S . k Preece, C . Enhanced Dependabil-
ity for Microprocessor Based Controllers Susceptable 
to Transient Disturbances , ( to be published ). 
[14) Schutte, M.A. k Shen, J.P. Processor Control Flow 
Monitoring Using Signatured Instruction Streams. 
I E E E Trans, on Computing, C-36, No. 3, pp 264-
276, 1987. 
[15) Randell, B . System Structure for Software Fault-
Tolerance. I E E E Software Engineering, Vol. 1, pp 
220-232, 1975. 
[16] Sosnowski, J . Transient Fault Tolerance In Micro-
processor Controllers. I F I P / I F A C Working Con-
ference on Hardware and Software for Real Time 
Process Control ( Warsaw ) 1988, eds Zalewski, J . 
k Ehrenberger, W., North Holland Press, pp 189-
195, 1989. 
t Special Note : The probability of outcome for a par-
ticular state is represented by the vertical distance of 
the labelled outcome band shown on the graph. 
3 H 
F A U L T T O L E R A N C E F O R MICROPROCESSOR-BASED 
C O N T R O L L E R S SUSCEPTABLE T O T R A N S I E N T D I S T U R B A N C E S 
G.A.S. Wingate & C. Preece 
Abstract 
This paper outlines the design of a microprocessor based controller with fault tolerance from 
transient disturbances. Such events can cause erroneous jumps within executing software. Fault 
tolerance is achieved through automatic enhancement of the application software. Performance 
issues are briefly discussed. 
1. In t roduc t ion 
Industrial controllers for monitoring and control are often based on microprocessors. The 
versatility offered to design aspects, both hardware and software, make them attractive in many 
industrial situations. 
Operating conditions within industrial environments are often harsh. Transient disturbances 
such as mains power flucuations [1], and electro-magnetic radiation [2] may occur. Analogue 
systems effectively 'filter' these events without losing their control function by passing the 
transient event as a temporary signal discrepancy. Digital systems having a discrete nature are 
much more liable to lose all control function. Transients have been observed to cause between 80 
and 90% of digital system failures [3,4]. I t is important that digital systems should incorporate 
mechanisms for recovery from transient failures. . 
This paper considers those transient events which lead to corruption of bus information, 
memory contents, and register contents of a microprocessor system. Such corruption can induce 
erroneous behaviour whilst no permanent hardware damage occurs. If control of the micropro-
cessor can be restored then overall system recovery is attainable. 
Most techniques available replicate hardware and/or software, involve particular program-
ming style, or require complex dedicated hardware. All these techniques are expensive in design 
and/or construction and are generally tailored to individual applications. The technique de-
scribed here consists of automated software enhancement transparent to the programmer. The 
technique is directly applicable to a range of microprocessor systems and involves the self-
detection of erroneous behaviour. 
2. Erroneous Behaviour 
We consider erroneous behaviour to be initiated by the corruption of the microprocessor's 
program counter. The erroneous behaviour produced by the microprocessor is characterised as 
a sequence of erroneous states terminated either by catastrophic failure or system recovery. 
Execution states can be defined in terms of the outcome of each operation [5]. The possible 
states are defined below. 
Erroneous execution may be within the used (program, data, or I/O reserved) area, or the 
unused area of the microprocessor address space. Recovery is attained through a restart state 
which vectors execution to a predefined memory location which holds the recovery routine. 
School of Engineering and Appl ied Science, 
Univers i ty of D u r h a m , D H l 3LE. England. 
3 / 1 
315 
Non-Jump - leads to the program counter pointing to the location following a 
valid instruction. 
Restart - leads to a jump to a predefined location in the address space. 
Unspecified Jump - leads to a jump to a new location in the address space 
determined by local memory contents. 
Return - leads to a jump to an address held in a stack. 
Stop/Wait - leads to a cessation of processing ; and requires an interrupt or 
hardware reset to exit from this state. 
3. Detection Technique 
Erroneous execution can flow through both used and unused regions of the address space 
of a microprocessor system. Detection is based on the occurrence of a restart state during 
erroneous execution. Techniques for used and unused areas will now be presented. 
3.1. Unused Area Detection 
All execution within the unused area is defined to be erroneous and hence total detection 
coverage is required. Detection is achieved by ensuring all memory locations have the capability 
of generating restart state when accessed. Unused address space consisting of memory elements 
has every location filled with a restart outcome instruction [6]. Where there are no memory 
elements, a restart outcome is achieved through a simple hardware unit. This unit monitors the 
address bus and any illegal access results in the the unit developing an external reset for the 
microprocessor. The microprocessor will treat the external reset as a restart state. 
3.2. Used Area Detect ion 
Erroneous execution within the used area may flow through either program, data, or I /O 
reserved areas. Each will now be considered. 
Program areas consist of opcodes and operands. Erroneous processing of an opcode will 
follow a valid execution path which is out of phase with the desired execution flow. Fault 
tolerance can be introduced by implementing software techniques such as recovery blocks or 
assertion tests [7]. 
Erroneous processing of an operand as an instruction will lead to erroneous execution depen-
dent upon local memory contents. Such erroneous execution will follow paths, unpredictable 
to the programmer, through memory, leading to a danger of catastrophic failure. To detect 
this mode of erroneous behaviour, mechanisms are placed within memory so that any operand 
processed as a jump instruction will have at its destination a restart outcome instruction. This 
means that any operand being interpreted as an instruction and developing an erroneous jump 
will be followed by a restart and hence recovery. A design tool PARUT (Post-programming 
Automated Recovery UTility) automates this process [8j. 
Data and I /O reserved areas cannot have their structure altered in the same manner as the 
program areas. Erroneous execution within these areas may be detected through a watchdog 
timer. 
4. Post-programming Automated Recovery U T i l i t y ( P A R U T ) 
A software tool called PARUT has been built to implement code enhancement. PARUT can 
provide automated code enhancement for a range of microprocessors. The tool works on machine 
code. There is no dependency on the original program language or pre-requisite programming 
style, hence the enhancement is transparent to the programmer. Enhancement may be provided 
3 / 2 
316 
before software release or as a maintenance up-grade. 
The performance improvement can be quantified with the application of PARUT. This is 
described in Reference 8. Many other fault tolerant techniques have been suggested but few 
offer analyses of their performance. 
5. System Recovery 
System recovery is attained through execution of the recovery routine accessed by the de-
tection restart states. This routine may implement a number of recovery mechanisms including 
roll-back, roll-forward, or cold-start. The choice of recovery method will often be determined 
by the specific application of the microprocessor based controller. 
6. Discussion 
Detection is statistically very rapid. Early results show that detection within 500ns (10 
instructions) has a likelihood of 99.9% [8]. A watchdog interval of 100ms, in a similar processor 
would permit approximately 2000 instructions to be erroneously processed before detection. 
Extended periods of erroneous execution increase the probability of catastrophic system mal-
function. 
Traditional methods of fault tolerance for this class of failure have involved large redundancy, 
or complex dedicated hardware, both very expensive and specific to a particular microprocessor. 
The technique presented here is easily transferable to other microprocessors, and its implemen-
tation, based on software, is automated. 
7. Acknowledgements 
The authors would like to thank the Science and Engineering Reasearch Council, and British 
Gas pic (Killingworth, Newcastle-upon-Tyne) for sponsoring this work. 
8. References 
[1] Buscke, H.A., A Practical Approach to Testing Electronic Equipment for Susceptability to 
AC Line Transients. IEEE Trans. Reliability, Vol. 37, No. 4, pp 355-359, 1988. 
[2] Thurlow, M. , Susceptability Characterisation of Microprocessor and LSI Technology. Mi-
croprocessors and Microsystems, Vol. 12, Noo. 6, pp 317-322, 1988. 
[3] Siewiork, D.P. k Swarz, R.S., The Theory and Practice of Reliable Systems Design. Digital 
Press, Bedford, M.A., 1982. 
[4] Iyer, R.K. k Rossetti, D.J., Statistical Dependancy of CPU Errors at SLAC. Proc. FTCS-
12, Santa Monica, C.A., pp 363-372, 1982. 
[5] Halse, R., Fault Tolerance in Digital Controllers Using Software Techniques. Ph.D. Thesis, 
University of Durham, England, 1984. 
[6] Wingate, G.A.S. k Preece, C , Transient Fault Recovery Assessment In 8 and 16 Bit Mi-1 
crorprocessor Based Controllers In Embedded Systems. Microprocessing and Microprogram-
ming, Vol 24, pp 775-782, 1988. 
[7] Horning, J.J. et al, A Program Structure for Error Detection an Recovery. Lecture Notes in 
Computer Science, Vol. 16, (eds) Gelembe, E. k Kaiser, C , Springer-Verlag, pp 171-187, 
1974. 
[8] Wingate, G.A.S. k Preece, C , Performance Evaluation of a New Design-Tool for Micro-
processor Transient Fault Recovery. Microprocessing and Microprogramming, Vol. 27, pp 
801-808, 1989. 
3 / 3 
317 
F A U L T T O L E R A N C E FOR UNIPROCESSOR SYSTEMS 
Wingate , G.A.S. & Preece, C. 
School of Engineering and Appl ied Science 
Univers i ty of Durham 
1. I N T R O D U C T I O N 
Fault tolerance for system architectures is typically associated with multiple levels of 
redundancy, common examples are duplex, triplex, and quadruplex. Such system architectures 
are sometimes referred to as NMR (N-Modular Redundancy). Duplex systems can identify 
the module in error and switch to continue processing on the standby module. NMR systems, 
of a higher or than duplex, mask errant modules. Whilst these architectures offer very high 
reliability, their application also incurs in excess of 100% redundancy. The associated cost of 
this overhead may be significant and perhaps unacceptable for low budget systems. In such 
situations a uniprocessor fault tolerant approach may be appropriate. 
2. FAULTS A N D F A I L U R E S 
A significant hazard for all processor systems is that of temporary fault generation. Tem-
porary faults, unlike permanent faults, incur no physical damage and have a limited duration 
and hence system recovery is possible without physical repair. Studies have suggested that 
temporary faults cause between 10 and 50 times more processor system failures than perma-
nent faults. 
Temporary faults can be classified as tmnsient or intermittent. Transient faults occur 
unpredicably and are generated by environmental influences on the processor system such as 
electro-magnetic radiation, alpha-particles, power supply disturbances, and radio-frequency 
interference. Intermittent faults are recurring temporary faults and are indicative of imminent 
permanent fault generation. 
3. C A P A B I L I T Y C H E C K I N G 
Namjoo & McCluskey [1982] first used the term 'capability checking' to describe a sell-
detection scheme that could be implemented by a uniprocessor system to identify errant pro-
cessing. Since then, it has become evident in the literature that the collective application 
of a selection of capability checks provides the best method of achieving a highly reliable 
uniprocessor system. 
The fault tolerant techniques proposed for capability checking in uniprocessor systems 
employ various strategies to detect erroneous processing. The strategies are based on the 
identification of different processing characteristics associated with erroneous behaviour. Im-
plementation of these techniques incurs an overhead, physical (software and/or hardware) 
and/or processing (time). The various capability checks reported in the literature are sum-
marised below.. 
Clive Preece is and Guy Wingate was formerly with the School of Engineering and Applied 
Science, University of Durham. Guy Wingato is now with ICI Engineering (Improved Manu-
facturing Systems Technology Group), Chilton Mouse, Billingham, Cleveland. 
318 
A . Watchdog T imer 
This is one of the most basic fault tolerant techniques, and involves a dedicated hardware 
unit called a 'watchdog timer'. The watchdog timer is incorporated into the processor system 
in such a way that failure of the processor to reset the watchdog tinier periodically will result in 
the watchdog sending an alarm signal, representing indentified failure, back to the processor. 
Watchdog timers incur a small switching overhead. A disadvantage of the watchdog tinier as 
a stand-alone technique is that the timer interval to detection can allow many hundreds of 
instructions to be processed haphazardly. The length and effect of this period of malfunction 
vary will vary between individual processor types and their applications. 
B . Fetch Inval id Ins t ruc t ion 
Most processor architectures have defined and undefined instruction opcodes. Some in-
struction sets, however, do not specify the action of their undefined instructions which may or 
may not have an operation. Two good examples are the Motorola 68000 in which all possible 
instruction opcodes have a specified operation, and the Intel 8085 which does not specify all its 
opcodes of which some undefined opcodes have useful operations (Denhardt, 1979]. In order 
to prevent the execution of an opcode of unknown operation, only defined instruction fetches 
should be allowed - all illegal instruction fetches should be detected. This technique requires 
additional decoder circuitry to be added to the uniprocessor system, 
C. Inval id Opcode Address 
Glaser & Masson [1982] proposed a SAFE ROM whereby an extra memory-bit is attached 
to each memory unit (usually a byte) to signify usage as an opcode of operand. Interpretation 
of an instruction activates decoding of the 'usage' bit and if the location is not specified as an 
opcode then erroneous processing is assumed to have been identified. The technique incurs 
a memory overhead and additional circuitry to decode the extra memory-bit. Furthermore, 
the techniques cannot be implemeneted in those locations in the address space implementing 
Random Access Memory (RAM) or non-existant memory. 
D . Inval id R e a d / W r i t e W i t h i n Permi t ted Memory 
A dedicated hardware unit can be embedded within the processor system to ensures that 
a read is not made from a write only address, eg. a specified output port, or that a write is 
made to a read only address, eg. a Read Only Memory (ROM) location. 
E. Unused M e m o r y Access & Non-Exis tunt Memory Access 
A commonly used technique involves additional circuitry to check that the address bus 
does not carry signals accessing unused locations of physical memory or addresses without 
resident physical memory in the address space. 
F. Inval id Branch 
An invalid branch involves the incorrect interpretation of an instruction as a branch and 
should not be confused with interpreting a valid branch instruction whose destination is in-
correct. A technique has been proposed by Wingate & Preece [1989] in which software de-
tection blocks are strategically placed within the application code at locations identified as 
destinations of invalid branch instructions. A complementary hardware unit called an Access 
319 
Guardian detects unpermitted memory access in parts of the address space not implenieneted 
in physical memory. 
G. Incorrect Sequence of Instruct ions 
This method, commonly referred to a 'signature analysis', assigns tag values on a cyclic en-
coding of instruction sequences which are inserted within the software before implementation. 
Additional circuitry monitors the tags in tin* software, comparing the tag with a hardware 
generated tag. A favourable comparison verifies execution, whilst a mis-match of tag values 
signifies the identification of erroneous behaviour. A good review of this approach can be 
found in Mahmood k McCluskey [1988]. 
4. E F F E C T I V E N E S S 
The effectiveness of the fault tolerant techniques can be assessed using two parameters, 
fault coverage and fault latency. Fault coverage is derived from fault insertion experiments 
(injection, emulation, or simulation) expressing the percentage of faults detected. Fault latency 
describes the time interval that passes between the fault insertion and its detection. 
4 .1 . Fault Coverage 
Schmid et al [1982] identified erroneous program flow as the most prominent exposure 
feature of uniprocessor failure induced by a fault. They evaluated the individual performance 
of some of the techniques outlined above. The three most successful techniques, during fault 
simulations on the Z80 microprocessor, with 03%, 58%, and 56% fault coverage respectively 
were those based on detecting incorrect sequences of instructions, invalid opcodes, and unused 
memory access. Similar results are reported by Gunneflo et al [1989] and Li et al [1981] for 
the Z80 and SBR9000 processors respectively. 
The reliability of a uniprocessor system can bo improved by collectively applying several 
techniques. Table 1 collates data from three experiments evaluating different combinations 
of uniprocessor fault tolerant techniques. The fault coverage of the combinations is only an 
indication of the performance. There will be sonic variation in the results due to different 
methods of fault injection employed by the authors. 
Source Techniques Employed Fault Coverage 
Schmid et al [1982] 
Gunnelfo et al [1989] 
Madiera et al [1990] 
D, C, D, E, G 
C, D, E. G. 




Table 1: Collective Appl ica t ion of Capability Checks 
320 
The overall fault coverage might appear low, but some account must be made for benign 
faults and those faults which generate data value errors and do not disrupt normal program 
action. 
4.2. Fault Latency 
Arlat et al [1990] describe a bimodal distribution of fault latencies, ie. there are two or 
more distinct classes of error manifestation. Cliillarge ic Bowen [1989] identify some faults to 
be dormant, such as stack corruption requiring particular processor activity to exercise the 
fault, whilst other faults induce 'fast failure'. It is therefore important that error detection be 
provided with a minimal fault latency in order to detect faults that would otherwise generate 
a 'fast failure'. 
The fault tolerant techniques outlined above have a short fault latency, eg. the incorrect 
instruction sequence detection reported by Schiuid ct al [1982] ocurred with a mean latency 
of 8 p.s. The precise latencies for each fault tolerant technique will vary depending on their 
host uniprocessor. 
The collective application of a selection of fault tolerant techniques, whilst improving the 
fault coverage, incur a higher mean detection latency. Assessment of the detection latency is 
application specific, further details can be found in the references given in Table 1. 
5. F U T U R E W O R K 
Future work is needed to compare the individual and collective effectiveness of all the 
capability checks listed in this paper. This will facilitate the assessment of the benefits of 
strategically applying particular selections of capability checks to a uniprocessor application 
and the overheads they incur. 
6. C O N C L U S I O N 
Processes or equipment that require low budget and yet reliable control can utilise fault 
tolerant uniprocessor systems. Such systems have high reliability without the order of mag-
nitude redundancy associated with NMR architectures. Reliability can be further improved 
by implementing fault tolerant techniques, but this has the effect of increaseing fault latency. 
Nevertheless, fault latency is much less than that introduced by stand-alone watchdog timers 
traditionally associated with uniprocessor systems. In many industrial applications fault la-
tency is not a critical factor because the equipment or process under control has a response 
time much longer than the fault latency. 
R E F E R E N C E S 
o Arlat, J., Aguera, M. , Amat, L., Crouzet, Y., Fabre, J.C., Laprie, J.C., Martins, E. 
& Powell, D., Fault Injection for Dependability Validation: A Methodology and Some 
Applications., IEEE Trans. Soft. Engineering, Vol. 16, No. 2, 1990, pp 166-181. 
o Chillarge, R. & Bowen, N.S., Understanding Large System Failures - A Fault Injection 
Experiment., Int. Symp. on Fault Tolerant Computing, 1989, pp 94-99. 
o Dehnhardt, VV. & Sorensen, V.M. , Unspecified 8085 Op-Codes Enhance Programming., 
Electronics, 1979, pp 144-145. 
321 
Glaser, R.E. & Masson, G.M., The Containment Set Approach to Crash-Proof Micropro-
cessor Controller Design., IEEE Trans. Computers, Vol. 31, No. 7, 1982, pp 689-692. 
Gunneflo, U., Karlson, J. & Torin, J., Evaluation of Error Detection Schemes Using Fault 
Injection by Heavy-Ion Radiation., Int. Symp. on Fault Tolerant Computing, 1989, pp 
340-347. 
Li , K.W., Armstrong, J.R. & Tront, J.G., An HDL Simulation of the Effects of a Single 
Event Upset on Microprocessor Program Flow., IEEE Trans. Nuclear Science, Vol. 31, 
No. 6, 1984, pp 1139-1144. 
Madiera, H., Quadros, G. & Silva, J.G., Experimental Evaluation of a Set of Simple Error 
Detection Mechanisms., Microprocessing and Microprogramming, Vol. 30, 1990, pp 513-
520. 
Namjoo, M. & McCluskey , E.J., Watchdog Processors and Capability Checking., Proc. 
FTCS-12, 1982, pp 245-248. 
Namjoo, M. &c McCluskey , E.J., Concurrent Error Detection Using Watchdog Processors 
: A Survey., IEEE Trans. Computing, Vol. 37, No. 2, 1988, pp 160-174. 
Schmid, M.E., Trapp, R.L., Davidoff, A.E. & Masson, G.M., Upset Exposure by Means of 
Abstarct Verification., Proc. FTCS-12, 1982, pp 237-244. 
Wingate, G.A.S. k. Preece, C , Performance Evaluation of a New Design-Tool for Micro-
processor Transient Faxdt Recovery., Microprocessing and Microprogramming, Vol. 27, 
1989, pp 801-808. 
322 
Microprocessing and Microprogramming 32 (1991) 861-668 
North-Holland 
861 
A N A L Y S I S O F F A I L U R E D A T A C O L L E C T E D F R O M A 
T M R M I C R O P R O C E S S O R C O N T R O L L E R 
Guy A.S . Wingate* and Clive Preece 
School of Engineering and Applied Science 
University of Durham, England 
K E Y W O R D S : Microprocessor System Reliability, Failure Analysis, Real Time Systems, Temporary Faults. 
Experimental failure data has been collected over two operational periods, each in excess 
of one year's duration, from a Triple Modular Redundancy ( T M R ) microprocessor controller 
based on the Intel 8085. Failures of embedded microprocessors are diagnosed as due to either 
temporary or permanent faults. There are few published studies covering temporary fault 
analysis of microprocessor failures; the research reported here is a valuable addition. Failures 
attributed to temporary faults are observed to occur approximately 40 times more frequently 
than those attributed to permanent faults. Further analysis of each data set reveals a very good 
correlation (0.992 and 0.995) to a constant failure rate, which is associated with an exponential 
inter-arrival distribution. T h i s paper will be of interest to reliability engineers considering 
aspects of operational microprocessor system reliability. 
1. I N T R O D U C T I O N 
Microprocessor failures can be diagnosed as due to 
either permanent or temporary faults [9j. Permanent 
faults are physical defects, whilst temporary faults have 
a limited duration asci do not incur physical damage. 
Temporary faults are often divided within the literature 
into transient and intermittent classes. Transient faults 
occur unpredictably and are attributed to temporary en-
vironmental conditions such as electrical power distur-
bances, electro-magnetic interference, radio-frequency 
interference, electro-static discharge, and alpha-particle 
radiation. Intermittent faults are recurring temporary 
faults and are associated with the imminent creation of 
a permanent fault (wear-out phenomena), or are faults 
whose activation is pattern sensitive. 
Reliability engineers have observed electrical sys-
* G . A . S . Wingate is now with I C I Engineering (Computer 
Aided Production), Chilton House, Billingham, Cleveland. 
U . K . 
tems to exhibit a time dependent failure rate, referred to 
as the hazard rate, Z(t). T h e Weibull function is widely 
used to describe the hazard rate as it varies during a 
systems lifetime (known as the 'Bathtub Curve') . T h e 
function is 
Zit) = ^ . t " - 1 (1.) 
where a and 0 are known as the scale and shape param-
eters respectively. 
The Bathtub Curve divides the lifetime of an elec-
tronic system into three phases. Firstly, the 'burn-in' 
phase involves the indentification of premature faults 
and is modelled by 0 < 1 : hazard rate decreases. Sec-
ondly, the 'useful period' is marked by unpredictable 
faults, induced by component ageing and/or environ-
mental stress, and is modelled by 0 = 1 : constant 
hazard rate. A constant failure rate is a special case of 
the Weibull function implying an exponential distribu-
tion of inter-arrival failure times. Finally, the 'wear-out' 
phase occurs when faults due to component degrada-
.'523 
862 G.A.S. Wingate, C. Preece 
2. D E S C R I P T I O N O F M I C R O P R O C E S S O R 
C O N T R O L L E R 
Failure data was collected from a T M R micropro-
cessor system, based on the Intel 8085, used to control 
a gas governor system. T h e microprocessor system, see 
Figure 1., was designed and implemented by Pearson [8], 
and is briefly reviewed below. 
T h e T M R architecture provides fault tolerance for 
embedded processor failures. A voter is implemented 
which compares thirty bus channel signals from each 
processor every 3 /is. The voter outputs the majority 
agreed value for each signal. Hence, the correct output 
is guaranteed when not more than one processor fails. 
The voter implemented by Pearson has additional cir-
cuitry which identifies single errant processors, the di-
agnosis being output on two error flag signals. When 
more than one processor fails concurrently, the majority 
decision process breaks down, and the voter outputs a 
signal to indicate voting failure. Some processor failures 
require multiple reset attempts (every 3/is), however, if 
more than 100 reset attempts are required a steady state 
failure is assumed to have occured. The voter can oper-
ate at a maximum speed of 3 MHz and is the primary 
constraint on the microprocessor controller's operational 
speed, the Intel 8085 microprocessor being capable of 
operating at 8 MHz. 










CMUA Fuller & Harbison PDP-10 E C L Parity 5700 44 800-1600 94.8 - 97.3 
[2] 
Cm' Siewiorek et al, LSI-11 NMOS Diagnostics 15000 128 4200 97.0 
(10) 
C.vmp Siewiorek et al, TMRLSI- NMOS Crash 15000 97-328 4900 93.7 -98.1 
[U] 11 
Telettra Morganti et al, [7] UDET T T L Mismatch N/A 80-170' 1300 88.4 - 94.2 
7116 
SLAC Iyer & Rossetli N/A N/A Diagnostics 26000 58 2300 97.5" 
[3] 
CMU- Lin & Siewiorek, MC NMOS Diagnosis 212800"*' 201 6552 97.3 
AFS [5] 68010/20 
Notes: • Reported by McConnel [61 
** From which 85* were recovered 
° * * 13 applications monitored over 22 months 
Table 1. : Observed Temporary & Permanent Faults 
tion become increasingly significant compared with use-
ful period faults, and is modelled by 0 > 1 : hazard rate 
increases. 
Temporary faults have a significant influence on mi-
croprocessor system reliability. A selection of monitored 
processor systems have shown that temporary faults 
are responsible for between 90 and 98% of system fail-
ures, see Table 1. In addition, temporary faults have 
been observed to occur as frequently as once every 100 
hours during continuous operation. Microprocessor fail-
ure data collated by McConnel [6] and reported by Lin 
k Siewiorek [5] exhibits a Wiebull inter-arrival distribu-
tion with a decreasing hazard rate for failures attributed 
to temporary faults. 
Microprocessor '1' 
Output 30 Signals 
Microprocessor '2' 
Output 30 Signals 
Microprocessor '3' 






Figure 1. : T M R Controller 
3 2 4 
Analysis of failure data collected Irom a TMR microprocessor controller 863 
In addition, the microprocessor controller imple-
ments self-synchronising clock signals for the embedded 
microprocessors, a common reset signal, Random Access 
Memory ( R A M ) with code protection, Read Only Mem-
ory ( R O M ) duplicated - the reserve copy being selected 





O e t e c t i o n 
Further 
Diagnosis 
I n l o r m a t i o n 
D A T E 07:06 HR 15:03 
V E C T O R E D R E C O V E R Y 
SYNO=F8 RETRIES=03 " 
f S Y N C E R R CHANNEL 02 
Time 
" (24 Hours) 
Number of 
" R e t r i e s 
Idemil ied 
' Errant 
P r o c e s s o r 
Figure 2. : Example of Diagnosis Printout 
3. D A T A C O L L E C T I O N 
The microprocessor controller was in continuous op-
eration from October 1983 to February 1985, and from 
June 1989 to July 1990. During these periods the con-
troller retained operation through the automatic activa-
tion of fault tolerant mechanisms implemented by the 
system. Failure of the fault tolerant mechanisms re-
sulted in a crash outcome. Instances of automatic re-
covery are assumed to restore a temporary fault con-
dition because the failure incurred no physical dam-
age. Crashes require operator intervention for recovery, 
via manual reset or repair, and hence because of their 
steady state failure condition are attributed to perma-
nent faults. 
T h e controller self-diagnoses failures from which sys-
tem integrity can automatically be restored. Self-diag-
nosis information is output by the controller on an at-
tached dedicated printer. A typical print-out is shown in 
Figure 2. The print-out shows the date and time of the 
incident, the identified errant processor, and the number 
of sequential reset attempts required to secure restored 
system operation. 
T h e collected 1983/85 failure data was recorded over 
12299 hours, noted 3 permanent faults and 79 tempo-
rary faults, and is referred to as data set 'A' . Similarly, 
the 1989/90 failure data is referred to as data set ' B ' 
and recorded 3 permanent and 80 temporary faults in-
ducing failure over 9360 hours. In both failure data sets, 
temporary faults are diagnosed as responsible for 96.3% 
of controller failures. 
4. D A T A A N A L Y S I S 
The analysis of temporary fault failure sets 'A' and 
HISTOGRAM 01STR1BUT ON MEAN 153.6 BEST FIT 50 600 77 129.9 BAR W 0TH 80 HRS oSTD. DEV. 0. 992 C0RR. 550 S.E. (EST.) 2. 865 70 
EXPONENTIAL CURVE INTERCEPT 2. 000 500 40 <n 63 CORR. 0. 969 SLOPE 0. 00634 
<0 450 
W 56 in 
Q 400 U J 
1,9 « 30 350 U J 
a UJ 300 U J 
V CD 35 UJ 250 P. 20 U J fjj 28 200 uj 21 \ 150 10 100 
50 3 an if a 
o 
0 7 14 21 20 35 42 49 56 63 70 77 2000 4000 6000 8000 1000012000 0 80 160 240 320 400 480 560 640 
TIME (HOURS) SEQUENCE OF EVENTS TIME TO EVENT (HOURS) 
(a) Linear Regression (b) Event Occurrence (c) Event Distr ibut ion 
Figure 3. : Collected Fai lure Data Set 'A' 
325 
864 G.A.S. Wingate, C. Preece 
' B ' are shown in Figures 3. and 4. respectively. Each fig-
ure has three graphs: (a) showing the cumulative num-
ber of events against operational time; (b) the failure 
time of each event as they occurred; and (c) the inter-
arrival failure time distribution of the events. 
4.1. F a i l u r e O c c u r r e n c e 
The observed failures caused by temporary faults 
are shown in Figures 3(a). and 4(a). for data sets 'A' 
and ' B ' respectively. It is interesting to notice the step 
function in these figures, particularly of data set 'A' . 
The three steps of higher failure rate ocurrence in data 
set A are recorded for the months of December 1983, 
July 1984, and December 1984. T h e steps are harder 
to distinguish for data set ' B ' , but occur at the sim-
ilar months of December 1989 and May 1990. Other 
processor systems have been observed to have a work-
load dependent failure rate [3, 4j. Th i s , however, does 
not explain the observed failure rate step function for 
the T M R controller because its workload is deemed to 
be constant. T h e method of L i n and Siewiorek [5] ap-
plies Dispersion Frame Technique ( D F T ) to processor 
system failure data and observes an increased hazard 
rate associated with a intermittent fault. Such an ob-
servation does not fit the T M R controller data because 
the steps of increased failure rate do not terminate with 

















live explanation is that the failure rate is dependent on 
some external environmental influence but this sugges-
tion cannot be substantiated. 
Analysing linear regresssion on the whole of each 
failure data set still yields a very good correlation to 
a constant failure rate, \m, despite the observed step 
function. Failure data sets 'A' and ' B ' yield Mean Time 
To Failure ( M T T F ) parameters Xm of similar magni-
tude, 157.7 hours and 112.0 hours respectively. This 
compares favourably with the sample data M T T F of 
147.8 hours and 117.6 hours for data sets A and B. T a -
ble 2 summarises the linear regression analysis on the 
whole of each failure data set. 
4.2. M T T F D i s t r i b u t i o n 
The inter-arrival times of failures diagnosed as due 
to temporary faults are shown in their sequential order 
of occurrence for failure data sets 'A' and ' B ' in Fig-
ures 3(b). and 4(b). respectively. The plots suggest 
a memoryless failure mechanism because the scatter ot 
inter-arrival failure times appears unchanged through-
out the observed periods of controller operation. 
A histogram distribution of the inter-arrival times 
for failures attributed to temporary faults for data sets 
'A' and ' B ' are shown in Figures 3(c). and 4(c). respec-
tively. Histogram bars represent the number of failures 
BEST F i r 
CORR. = 0 .990 
S. E. (EST.) = 2. 298 
INTERCEPT = - < . . ; 
SLOPE = 0 .00893 
















MEAN = 118.5 
ST0.0EV. = 123. 1 
0 7 14 21 28 35 42 49 56 63 70 77 









BAR WIDTH = 70 HRS 
EXPONENTIAL CURVE 
CORR. = 0. 956 
70 140 210 280 350 420 490 560 
TIME TO EVENT (H0URSI 
(a) Linear Regression (b) Event Occurrence 
Figure 4. : Collected Failure Data Set B' 
(c) Event Distr ibution 
326 
Analysis of failure data collected from a TMR microprocessor controller 865 
that occur within sequential time intervals following the 
last failure event. The intervals, A , for the histogram, 
distribution plot are calculated using the following equa-
tion from Lewis (4): 
A = r [ l + 3.3loyio{Ni)]~l (2.) 
where A is considered to be a 'reasonable' interval, r 
is the range of values taken by the data, and Nt is the 
number of data items under analysis. 
As, A is not usually a convenient value for plotting 
a histogram an approximate value A ' is chosen. Table 
3. summarises the A derivations and the choice of A' . 
The histogram interval for data set 'A' and ' B ' is 80 and 
70 hours respectively. 
T h e distributions for data sets 'A' and 'B ' , yielding 
a correlation of 0.969 and 0.956 respectively with the 
Draper & Smith [lj non-linear regression method, are 
modelled by, 
-y.exp{-XLR.t} (3.) 
where 7 is given by, 
7 = Atfl./V-j.A' (4.) 
T h e exponential distribution of the failure data val-
idates the memoryless characteristic observation made 
earlier for the inter-arrival failure times. 
4.3. C o n t r o l l e r U n a v a i l a b i l i t y 
Some detected failures were only successfully recov-
ered after multiple attempts (every 3 (is) to restore con-
troller operation. Table 4 gives the down time distri-
bution for the failures from which automatic recovery 
was achieved. Multiple attempts to restore the system 
integrity implies that the temporary fault causing the 
failure was still active when recovery action was initi-
ated. Alternatively, a burst of temporary faults may 
have occurred. The nature of the collected failure data 
prevents further analysis or postulation. 
Failure r (hours) Nj (hours) A (hours) A' (hours) 
Data Set 
A 600 79 82.6 80 
B 500 80 68.7 70 
Table 3. : Histogram Distribution Interval 
Down Time Number of Failures 
Data Set 'A' Data Set ' B ' 
0 - 3 59 61 
3 - 6 20 17 
6 — 9 0 1 
9 — 1 2 0 0 
12 — 15 0 0 
15 — 18 0 1 
18 — 00 0 0 
T a b l e 4. : F a i l u r e I n d u c e d U n a v a i l a b i l i t y 
Failure Linear Regression Data Sample 
Data Set 
Xm (hrs"1) MTTF(hrs) Correlation X s(hr.,-') MTTF (hrs) 
A 0.00634 157.7 0.992 0.00676 147.8 
B 0.00893 112.0 0.995 0.00850 117.6 
Table 2 . : L inear Regression Analysis 
327 












0 100 200 300 400 500 600 
TIME (HOURS) 
Solid line : from Data Set 'A' 
Broken line : from Data Set 'B' 
Figure 5 : Probability Density Function 
5. R E L I A B I L I T Y A S S E S S M E N T 
Siewiorek At Swarz (9) describe four evaluation pa-
rameters and their inter-relationship for failure distri-
bution analysis: probability density function (pdf) , cu-
mulative density function (Cdf), reliability, and hazard 
function. Pdf, ' f ( t ) ' , defines the probability of a failure 
occuring at a specific time. Cdf, 'F ( t ) ' , defines the prob-
ability of failure occuring at or before a specific duration 
of operation. Reliability, '/?(()', is the probability of not 
observing a failure before a specific duration of opera-
tion. Finally, the hazard function, 'Z(t)\ is defined as 
the time dependent failure rate. Within this experiment 
the hazard function appears time independent, denoted 
by the constant X, hence equation (1.) becomes, 
Z{t) = A (5, 
Basic reliability theory gives the following parameter re-
lationship, 












Z(t ) = 
0 100 200 300 400 500 600 
TIME (HOURS) 
Solid line : from Data Set 'A' 
Broken line : from Data Set 'B' 




The probability density function is given by convert-
ing the time intervals, used by the experimental results 
histogram, in equation (3.) into a general time parame-
ter, 
/(*) = ^LRexp{-XLR.t} (8.) 
Reliability is evaluated using equation (7.), 
R(t) = exP{-\LR-t} (9.) 
and the cumulative density function is evaluated from 
equation (6.), 
F(t)= l-exp{-\LR.t} (10.) 
The performance parameters of pdf, Cdf, and relia-
bility are evaluated for failures attributed to temporary 
faults and are plotted in Figures 5., 6., and 7. respec-
tively. In each graph, data set ' A ' evaluations are shown 
:i28 
Analysis of failure data collected from a TMR microprocessor controller 867 
1.0 
0 . 8 \ 
0. 1 
100 200 300 400 500 60C 
TIME (HOURS) 
Solid line : from Data Set 'A' 
Broken line : from Data Set 'B' 
Figure 7 : Reliability 
as solid lines and data set ' B ' as broken lines. The pdf 
plot presents the inter-arrival failure times modelled for 
the two data sets in Figure 3(G ) . and 4(c). Data set 
'B ' exhibits a higher failure rate than data set ' A ' which 
may be due to the aging of the controller or varying 
working conditions. Figure 6. plots the Cdf for each 
data set, ie the running sum of the pdf coefficients, and 
clearly shows the higher failure rate observed for data 
set 'B ' . The effect of the observed failure rates on the 
controller reliability for each data set is shown in Figure 
7. Data set 'A ' with the lower failure rate has a higher 
reliability. 
6. D I S C U S S I O N 
The aim of the work was to observe the effects of 
temporary faults on a real control system. The results 
presented add to the limited body of published data in 
this area. The failure history of a TMR microprocessor 
controller has been collected over a 17 and a 13 month 
period. During each period approximately 80 failures 
occurred, of which 96% were diagnosed as due to tem-
porary rather than permanent faults. The inter-arrival 
time of the failures, attributed to temporary faults, fol-
lows an exponential distribution. This observation sug-
gests that the controller was operated during its use-
ful period as specified by the Bathtub Curve commonly 
used by reliability engineers. These results demonstrate 
the validity of modelling the effects of temporary faults 
using techniques developed for permanent fault mod-
elling, with constant hazard rates. 
A C K N O W L E D G E M E N T S 
The authors wish to record their debt to former re-
search students Dr. J.C. Pearson and Dr. R.G. Halse 
who helped to gather the data on which the results in 
this paper are based. Continued support from the Sci-
ence and Engineering Research Council and from British 
Gas is also acknowledged. 
R E F E R E N C E S 
[1] Draper, N.R. & Smith, H., Applied Linear Regress-
sion., Wiley and Sons, 1981. 
[2] Fuller, S.H. & Harbison, S.P., The C.mmp Mul-
tiprocessor., Technical Report CMU-CS-78-146, 
Carnegie-Mellon University, October 1978. 
[3] Iyer, R.K. & Rossetti, D.J., A Statistical Depen-
dency of CPU Errors at SLAC, Proc. FTCS-12 
(Santa Monica, CA), 1982, pp 363-372. 
(4] Lewis, E.E., Introduction to Reliability Engineering., 
John Wiley k Sons, New York, 1987. 
[5] Lin, T-T. K. & Siewiorek, D.P., Error Log Analysis : 
Statistical Modelling and Heuristic Trend Analysis., 
IEEE Trans. Reliability, Vol. 39, No. 4, 1990, pp 
419-432. 
(6) McConnel, S.R, Analysis and Modelling of Tran-
sient Errors in Digital Computers., Ph.D. Dissserta-
tion, Carnegie-Mellon University, Pittsburgh, PA., 
1981. 
(7j Morganti, M. , Coppadoro, G. k Ceru, S., UDET 
7116 - Common Control for PCM Telephone Ex-
change : Diagnostic Software Design and Availabil-
ity Evaluation., Digest 8th Int. Conf. on Fault Tol-
erant Computing, 1978, pp 16-23. 
[8] Pearson, J,C. Reliability of Small Digital Con-
trollers., Ph.D. Thesis, University of Durham, Eng-
land, 1983. 
|9] Siewiorek, D.P., & Swarz, R.S., The Theory and 
Practice of Reliable Systems., Digital Pess, (Bed-
ford, MA.) , 1982. 
:12!) 
868 G.A.S. Wingate, C. Preece 
[10] Siewiorek, D.P., Kini, V., Joobani, R., k Bellis, H., 
A Case Study of Cmmp, Cm', and C.vmp : Part I -
Experiences with Fault Tolerance in Multiprocessor 
Systems., Proc. IEEE, Vol. 66, No. 10, 1978, pp 
1178-1199. 
[11) Siewiorek, D.P., Kini, V., Joobani, R., k Bellis, H., 
A Case Study of C.mmp, Cm', and C.vmp : Part 2 
- Predicting and Calibrating Reliability of Micropro-
cessor Systems., Proc. IEEE, Vol. 66, No. 10, 1978, 
pp 1200-1220. 
[12] Woodbury, M.H. k Shin, K.G., Measurement and 
Analysis of Workload Effects on Fault Latency in 
Real-Time Systems., IEEE Trans. Soft. Engineer-
ing, Vol. 16, No. 2, 1990, pp 212-216. 
ERRATA: 
Three typo-graphical errors appear in the above paper. Mean and Standard Deviation 
are incorrectly marked on Figure 4b & 5b: the values given there should be ignored. 
Table 3 should mark as an integer. Correlation in Figure 4a should read '0.995'. 
