In this paper we consider a new approach to the modeling and analysis of transient faults in microprocessor based controllers. An instruction level perspective of performance is taken which is the basis of a useful high level program state description of the controller. is defined which describes the response to transient fault arrivals. This approach permits theoretical transient fault analysis of a practical, useful nature. The requirements can be met with the addition of a small amount of hardware. An experimental system is described which has been implemented to demonstrate the feasibility of our approach.
Introduction

A transition matrix
The essence of our approach to transient faults is to view the ramifications of the fault at an operational level in a digital system which has been specifically designed so that the extent of these ramifications is contained within a classifiable and observable set of states. In order to describe and demonstrate the feasibility of this approach, we will restrict our discussion in this paper to microprocessor controllers.
Microprocessor Controllers
Microprocessor controllers consist of: the central processor unit (CPU); a read-only memory (ROM), used for program storage; a read/write memory ( R A M ) , used for data storage; input/output (I/O) devices; and support logic.
Controller software can be characterized as consisting of scanning routines for reading input lines, and a relatively simple data processing section followed by output routines. Inputs can be used to select between several otherwise independent function subroutines. Considerable use is made of lookup tables to compensate for relatively slow execution speed and limited instruction sets. The software requirements are such that there is typically a "large" ROM of several thousand bytes, and a "small" RAM of several hundred bytes.
It is evident that a thorough understanding of fault phenomena in these controllers is important. In particular, it has been shown that transients comprise a significant fault category [1,21.
Faults and Upsets
Transient faults can be divided into two categories: internal and external. Numerous studies have been reported in the literature *This research was supported by NASA GrantNSG1442. . Such an environmental condition induces noise on power supply and signal lines. As before, useful data relative to operational implications regarding these external causes of transient faults is generally not available.
Reprinted from
probably still be of questionable value in studying transient faults in microprocessor controllers. The perturbations are usually analog variations that seldom resemble digital signals. The actual fault caused can be an internal logical signal modification satisfying neither the logic high nor low conditions. In such a case it would not be possible to predict how a digital circuit would react. Moreover, while the microprocessor is a clocked, synchronous digital system, actual faults are seldom "well-behaved," such that the fault is only synchronously active or inactive with the system clock. Hence, this low level approach to transients is generally not applicable for practical use, and this leads us to a higher level perspective. digital system is completely described in terms of a finite set of mutually exclusive system states. This set must include all possible valid system states, but, as will be discussed in more detail in the next section, this set must also include invalid states, not explicitly designed into the system but into which the system can nevertheless be driven by a transient. In effect, the universe Of system operation (valid and invalid) must be contained within this set of states. Hence, we will refer to such a set of states as the containment set. Clearly, the details of a containment set are dependent upon the system and its application. System modifications will usually be necessary to ensure a finite, observable set. We will see that this is not an overly restrictive requirement for microprocessor controllers.
If fault data were available, it would most At the upset level, the operation of the
Upset Characterization
We are naturally led to an upset model that involves the program being executed. For microprocessor controllers, programs can be classified as one of two types; those which exit after execution, and those which "continuously" loop. An exiting program performs a calculation or function and then terminates. Loop programs jump back to their start after each task execution. Programs so classified are not required to loop constantly, but only for a long time relative to an exiting program. In the following, only loop programs will be considered. This is not particularly restrictive, since most often controller programs are indeed loops [51, and also, a number of exiting programs executed in sequence can result in a single loop. A typical loop program monitors inputs, processes the input data, transfers results to the outputs,and branches back to the start.
The program memory contains all of the loop programs normally executed. Large loops can be broken down into several smaller loop programs into which, dependent upon input conditions, the processor can jump between, but at a slow rate relative to these small loop program execution times.
Within this framework of a microprocessorcontroller, the loop programs are natural choices for high level program states. The operational status of the controller can then be given as a specification of the particular loop program currently under execution. In addition to valid loop programs that were explicitly written for the application, there can be invalid embedded loop programs into whichthe system can be driven by a transient fault. Our perspective to transient phenomena can be summarized as follows: an actual transient fault causes an error which results in an upset of one of the above types. The data change and program bump upsets are the least significant of the three upset types, because their effects on the controller's operation are temporary, and the controller either continues or quickly returns to the execution of the proper loop program. In either case, standard techniques such as voting 161 or error correcting codes [ 7 ] can rectify the situation. In many control applications, such as those in whichamechanical device is being operated, temporary datachanges or program bumps occur at a rate exceeding the device's capacity to respond, while a program transition upset resulting from a transient fault is a steady state operational deviation which can have most serious consequences. Indeed, a transition into an invalid embedded loop program is usually referred to as a system crash. Hence, in the following we will concentrate on these steady state upsets that are produced by transient faults.
Containment Set Transitions the loop programs in the containment set,we canuse
Since our concern is with transitions among a transition matrix T Tk for large k becomes more important than T .
Interestingly, we see that comparison of iz (T*)k with lim (T**)k shows that T* is the more fault tolerant mplementation for multiple upsets.
matrix for containment set transitions is a very useful analysis, evaluation, and validation tool.
In the following section, we will discuss the hardware/software requirements imposed upon the controller such that this upset level perspective can be exploited.
is the more transient fault tolerant
This shows that knowledge of the transition
Controller Requirements
The realization of a microprocessor controller in which upsets can be characterized as loop program transitions requires some specific, but not excessive, attributes be designed into the system. It is necessary that program execution be forced to be only from ROM. This ROM restriction guarantees that the set of loop programs remains fixed during system operation, as long as no permanent faults occur. If control states such as HALT or HOLD are permissible in the CPU, then these must be included in {L} as special cases, or access to these states must be restricted. Programs must be written such that invalid data values cannot produce loops in programs which normally are exiting routines. The structure of program flow should not be affected by out-of-domain inputs.
defined. Normally, most of this space will not be The entire processor address space must be utilized. When detection hardware flags a bad state, a nonmaskable interrupt or reset can be used to cause error recovery. It should be appreciated that the additional hardware sections require on the order of two to three integrated circuits.
processor must be executing instructions, and those instructions are contained in ROM. The loop programs provide a true containment set, because they are mutually exclusive and complete. It is then guaranteed that, while a transient fault can arrive and perturb the controller, when the fault disappears, the controller will be executing one of the predefined loop programs (valid or invalid), and is therefore in one of the containment states. The controller can then be considered at the upset level where the upsets will at most move the system from one containment state to another. 
Experimental System
To demonstrate the feasibility of an upset level perspective to transient faults in microprocessor controllers, an experimental system has been developed, consisting of a gold controller,a faulty controller, and a monitor processor. The controllers use the Intel 8080 microprocessor. The gold and the faulty controllers are programmed to synchronously run identical programs under fault-free conditions. However, the faulty controller has random noise superimposed on the DC level of its power supply. This choice of a transient fault source was made because it is a realistic fault phenomenon that can exist in a system for many different reasons. The controller buses are compared, and differences logged by the monitor processor. With this experimental system, the transition matrix can be measured.
Conclusions
It can be shown that digital system fault phenomena can be viewed through a high level upset perspective. The concept of a containment state has been introduced to provide a base from which upsets can be explored in a usable format. The transition matrix can probabilistically describe the operational nature of the controller/program/ fault complex, and it is seen to be a practical tool for a variety of fault tolerant uses. Specific application to an 8080 based controller has been made. For a given program, the loop program set has been generated, and transient induced jumps to each possible loop program have been observed experimentally. No other steady state conditions have been found. The framework for consideration of more complex systems has been established, for which the principal extension requirement is the definition and containment of operational states.
