This paper presents a high-level design for a reliable computing platform for real-t,ime control applications. The design tradeoffs and analyses relat,ed to the development of a formally verified reliable computing platform are discussed. The design strategy advocat,ed in this paper requires t,he use of techniques that. can be complet.ely charact,erized mat,liematically as opposed to more powerful or more flexible algorithms whose performance properties can only be analyzed by simulation and testing. The need for accura.te reliability models that. can be relat,ed to the behavior models is also stressed. Tradeoffs bet.ween reliability and voting complexity are explored. In particular, the t,ransient. recovery properties of the system are found t.o be fundamental to bot,h the reliabilit,!: analysis as well as the "correctness7 models.
A Science of Reliable Design

Iiitroduct ion
Researchers at NASA Langley Research Center (LaRC) have initiated a major research effort towards the development of a practical validation and verification methodology for digital fly-by-wire control systems. The validation process for such systems must demonstrate that these systems meet stringent reliability requirements. Flight critical components of commercial aircraft should have a probability of failure of at most IO-' for a 10 hour mission [l] . Under quch severe reliability requirements, design errors. also referred t o in the literature a s generic errors, can not be tolerated. Thus. the validation problem for life-critical systems can be decomposed into two major tasks:
1. Quantifying the probability of system failure due to 2. Establishing that destgn errors are not present.
phystcal failure.
Since current technology cannot support the manufacturing of electronic devices with failure rates low enough to meet the reliability requirements directly, fault-tolerance strategies must be utilized that enable the continued operation of the system in the presence of component failures. The first Mathemat,ical reliabilit,y models provide the foundation for a scientific approach to fault-tolerant. system design. Lsing these models. t,he impact of archit,ectural design decisions on system reliability can be analytically evaluat,ed. Reliabilit,y models are construct,ed that abstractly account for all possible physical failures and all system recovery processes. In the analysis, physical failures must be enumerat,ed and their failure rates det.ermined. The fault arrival rates for physical hardware devices are available from field data or empirical models [4] . The fault recovery behavior of a sysrem is dependent upon t.he particular fault-tolerant syst.em architect,ure and must, be det,ermined by experimentation or by formal analysis.
The justification for building ultra-reliable systems from replicaded resources rests on an assumption of failure independence among redundant units. This is a reasonable assumpt,ion when t,he redundant units are electrically isolabed (i.e. 1ocat.ed in separate chassis and using different power supplies). The alt,ernat.ive approach of modeling and experiment ally measuring the degree of dependence is infeasible, see [SI. The unreliabi1it.y of a system of replicat,ed components ivit,h independent probabilities of failure can easily be calculated by multiplying the individual probabilit.ies. Thus, the independence assumption provides t,he means to obt,ain ult,ra-reliable designs using moderat,ely reliable parts. Complex systems const,ruct.ed from components with int,erdependencies (e.g. due t,o shared memories. sha.red power supplies, et,c.). can be modeled (assuming perfect knowledge about. the failure dependencies) and the system reliability can st.ill be comput,ed. Of course, the reliabilit,y models can become very complex and t,he analysis intract,able.
The validity of a. reliabilit,y analysis depends crit,ically upon t.he accuracy of the reliabilit,y model. If the reliabilit.y model omits cert.ain failure mechanisms or t,he represent,ation of t,he recovery behavior is overly opt,imistic. t,he predicted probabilit,y of failure is ina.ccurat.e. This might occur. for example, if t,liere are errors in the logical design or in the implementation of the fault-recovery strategy. .4ny ralidat,ion methodology must address the "correctness" of t,he reliability model wit.11 respect to t.he actual implement,ation. Ultimat.ely, a niathemat.ica1 mapping between the implementat.ion and the model must be constructed. Thus. the two validation tasks a.re essentially demonstrations of "correctness'. Alt,hough the quantificat,ion t,ask involves reliability models. experimental data. and numerical calculation. model correctness must also be esrablished.
The Role of Formal Methods
A major difference bet.ween the development effort presented in this paper and most, other efforts is t,he use of formal methods'. This approach is born from the belief t,hat the successful engineering of complex computing systems will require the application of mathematically based oncnlysis analogous to t,he struct,ural analysis performed before a bridge or airplane wing is built. The applied mathematics for t,he design of digit,al syst,ems is logic. just as calculus a.nd differential equat,ions provide the mat,hematical t.ools used in other engineering fields.
It is often assumed that the application of formal methods is an '*all or nothing" affair. This is not the case. Different levels of application are both possible and recommended. The following is a useful t.axonomy of the degrees of rigor in applying formal methods:
Level 0: KO application of formal methods.
Level 1: Formal specification of the system. Level 0: Paper and pencil proof of correctness. Level 5: Formal proof checked by mechanical Significant gains in assurance are possible in existing design met.hodologies by formalizing the operating assumptions and constraints. the specification, and the implement.ation of a system in some formal mathemat,ical notation. Experience shows that applica.tion of formal methods t,o level 1 alone oft,en reveals inconsistencies and subtle errors that might not be caught until much later in the development process if at all.
The use of paper and pencil proof in the design process adds a second level of assurance in design correctness. The theorem prover. 
A View of Digital Flight Control Systems
The control system architecture for aerospace vehicles can be viewed as hierarchical as shown in figure 1. Each level In the hierarchy represents a different aspect of the design process and entails different validation and verification issues. The top-level represents the aerodynamic properties of a rigid body controlled by maneuverable surfaces. The second level represents the continuous-time feedback-control functions that operate on the aerodynamic vehicle. The third level represents the block-diagram specification of the control laws. The fourth level represents the implementation of the control laws in an executable programming language. The fifth level describes the system that dispatches the controllaw code on a set of redundant hardware in a manner that provides fault tolerance. The sixth level represents the hardware components of the system. In this project. the design and verification issues at the bottom two levels of the hierarchy are being explored.
Figure 2 illustrates how the hierarchy above can be further refined2.
Traversing t,he horizontal hierarchy at. tlie coarsest level of a.bstract,ion reveals tlie control application domain, which is built on the reliable computing platform. These in turn view t,he st,at,e of t,he aircraft, through t.he sensor/octuntor network. Each of t,hese abst,ractions is decomposed into sublevels discussed below. The rationale for choosing the major system interfaces at the points not,ed in figure 2 is based on notions of reusabilit,y, a partitioning of t,he areas of t.echnica1 expertise, and the interfa.ces found in most, computing systems in use today.
Tlie control application domain abstraction isolat.es one of tlie two main applicat.ion-specific aspect.s of t,he cont'rol syst,em. The most. abst.ract view at this level might be a. t,em of continuous differential equat.ions modeling tlie cont,rol surfa.ces and aerodynamic properties of the aircraft. .4bstrac-tions below this level include t,he block-diagram specificat,ion of the control la.ws and at the lowest level. implementation of the control laws in an execut.able programming language on tlie underlying reliable comput.ing platform. Obiiously. correctness at each level is a.s import,ant as the correctness of t,he coniput,ing plat,form. Formal met,hods can have an impact on correctness in areas in the control application domain; however, these issues are not addressed here.
The reliable computing platform dispatches tlie controllaw code for execution on the underlying hardware and provides the interface to the net.work of sensors and actuators. Traversing the hierarchy within the reliable computing platform abst,raction reveals two boxes. one represenhg the operating system and the other representing the underlying replicated processors. The operating syst.em provides the interface t.o the bot,tom level of tlie cont,rol application domain. tlie application code. The replicat,ed processor level provides the physical int.erface t.0 tlie sensor a.ct,uat,or net,work.
The third component of the control system is the network of sensors and act,uators. Like the control applicat,ion domain, the sensor act uat,or net,work is highly application dependent. Because of the application-specific nat,ure of t,liis part. of the syst,em. we consider t.liis component t,o be outside of the reliable computing plat.form and do not specifically address it here, although at,tribut,es of the sensor actuat.or net.work must, be included in any overall system reliability model.
Requirements for a Re1 i ab1 e C on1 p u t i n g P 1 at fo rni
The interface between the application code and operating system levels determines the functional requirements for the reliable computing platform. Tlie reliability requirements for boxes contained within another box denote horizontal hierarchy or system interfaces. The dependence of an interface on a resource is indicated by placing the dependent box above the box denoting the resource. Adjacent boxes at the same level within the horizontal hierarchy indicate independent resources. Thus. in figure 2 the operating system is dependent on the replicated processors for implementation; however, the individual processors are not dependent on one another. Nested blocks denote vertical hierarchy or successive levels of abstraction.
aircraft applications ha.w been determined by the regulat,orp agencies.
We will not. explicit,ly address performance requiremenk here alt,hough t,liey are a critical aspect for t,he success of the syst,em. Most of t,he functionality supporting t.he system's fault t.olerance will be implement,ed in 1iardwa.re t,o avoid the performance overhead suffered by the implement,at,ion of SIFT [SI.
Functional Requirements
The following is a summary of tlie most importrant requirement,s generat,ed by t,ypical aircraft cont.ro1-law application tasks:
o Hard deadlines o Multi-ra.t,e cyclic scheduling o Upper bound on task execution time o 1ntert.ask communicat,ion Tlie hard-deadline requirement means that a task must be dispat,ched and complete within a st.rict time boundary. In particular, the time delay between rea.ding a. sensor and sending a signal t,o an actuat.or. t,he transport delay. must be strict,ly less than a predetermined value. Tlie required periods of execution are different for different tasks. Thus. the syst,em must. perform multi-ratre scheduling. hssociat,ed with each t,ask is an upper bound on execution t,ime. If a task receives input from anot.lier task that has the same execut,ion period, the receiving t,ask must execute after tlie source task. Thus, within a "period-class" , there is a precedence ordering on the t,asks. The relationship bet,ween different t,asks with different, execution periods. is not constrained.
Reliability Requirements
Fault-tolerant, archit.ect,ures use replicated hardware resources and majorit,? voting to enable continued operation of the syst.em in the presence of component failures due t o pliysical faults. The operat,ing system provides the applications soft,ware developer a reliable mechanism for dispatching periodic t.asks on a fault-tolerant comput.ing base t.hat appears to him as a single ultra-reliable processor.
We are concerned with the most general t.ype of faulty behaviors: Byzantine or malicious faults in which a producer can exhibit arbit,rary behavior, or lie. t,o its consumers, sending each different information. In our case, producers are processors or sensors and consumers are other processors or act.uators.
There 
I
Replzcated Processors
Figure 2: Digital Flight Control System Archit,ecture down upon failure. In our system. there is no fault det,ection function since it is non-reconfigura.ble. Therefore a minimalvoting strategy is used t.o flush the effect,s of transient faults. In other syst,ems where internal vot.ing is used for fault. det.ection (and reconfiguration). t,he minimal-voting strategy employed here may not be appropriate. However, if a fail-st,op strategy such as self-checking pairs is used for fault-det,ection, the minimal vot,ing approach may still be useful and efficient.
Field data indicates that t,ransient fau1t.s are significant,ly more likely than permanent faults [9] . If all faults are considered to be permanent, voting need only occur at t,he actuators to mask faults. Similarly. if all port,ions of the dynamic state of the system were recoverable from sensor inputs then. event.ually. the effects of transient, faults would be flushed from the syst,em. Typically. most. of the stat,e of the system is held in the aircraft itself, however there are data that can not be regenerated from sensor input,s. For example, in a frame synchronous scheduling regime. the operating syst,em must keep track of which frame is scheduled for execution next. This is critical data that must, be stored in volatile memory and can not be recovered from sensor inputs.
Balancing the Requirements
The drive for increased functionality is often pursued without regard to its impact on system reliability. The failure probability of the system has two contributors: (1) physical failure and ( 2 ) design flaws3 The graph in figure 3 shows the conjectured failure probability due to each of these contributors as a function of system complexity.
'Although it is infeasible to measure the contribution of the design flaws in the ultrareliable regime. its effect can be discussed theoretically. The top curve represents t,he t,ot,al probability of failure. We have opted for a less complex system in order t o produce the best reliabilit,y.
Previous Efforts
Many techniques for implementing fault-tolerance through redundancy have been developed over the past decade, e.g. In FTMP. for example, the unit of reconfigura.tion is a memory module or a CPU module. In SIFT. FTP and MAFT, the unit of reconfigurat,ion is an entire processor. In a reconfigurable system. vot,ing can be used to det,ect faults. In t,he architect,ure considered here it. is assumed t,liat faulty processors are not removed until aft.er the mission is over. The operating sgst.em does not ut,ilize error report,s from t,he voter. However, it may be desirable t,o st.ore these report,s in memory for later use by ground maint.enance personnel.
Differences bet,ween previously developed syst,ems nat.urally arose from different design decisions. However an oft.en overlooked but significa.nt factor in t,he development process is the approach to system verification. In SIFT and MAFT. serious considerat,ion was given t,o the need t,o mat hematically reason about the system. In FThIP and FTP. t,he verificat,ion concept was almost, exclusively based on empirical t,esting. Obviously. the approach advocat,ed here is one of formal rigor in specification and verification of the syst,em.
Alt,hough several fault-t,olera.nt real-time computing bases have been designed for control applications [ G , 10. 11, 121, only t,he SIFT project at,t.empt,ed t.o use formal methods. Although many positive theoretical advances were made. the SIFT operating syst.em was never completely verified [13] . On t,he poit.ive side, the concept of Byzantine Generals algorithms was developed [14] as was the first fault-t.olerant clock synchronizat,ion algorithm with a mat,heniatical performance proof [15] .
Unlike the S I F T models. which did not present an operational view of the scheduling function of the syst.em. the models described in [ 2 ! 31 deal with t,his funct,ionalit,y in some det.ail. The SIFT specificat,ion was given from t.he perspective of an individual t,ask. The specification defined the behavior of a task given inputs from other t.asks. However. it did not describe the required behavior of the scheduling system. It roughly stated that, if a task were execut,ed and given st.a.ble input.s, t,he output would be correct as long as the system had enough non-faulty hardware. Although t,liere was an abstract notion of execut,ion windows for t,he t,asks. t,here was no specification of the requirement that the operating system must dispatch t.asks according t,o this schedule. Thus. the specification approach was 1a.cking in some import,ant ways. Nevertheless. many of the design/verification concept,s used in the SIFT project have been adopt,ed in this project.
Design of the Reliable Computing Platform
Management of the replicated resources that implement the required fault tolerance is a complex systems problem. T h e fundamental problem is the elimination of all singlepoint failures. Clearly, a shared voter is insufficient. The voter itself must be distributed! A second difficulty arises from the fact that a distributed voter can only mask errors algorit,lims can be incorpora.ted int.0 the fabric of a distribut,ed syst.em is at, the heart of fault-tolerant syst,em design.
Tradit,ionally. the operat,ing syst.em has been implemented as an e~e c u t i v e (or main program) t,hat invokes subrout,iiies implementing the applicat,ion t,asks. Communication between the t.asks has been accomplished by use of shared memory. This strat,egy is effective for t.ems with nominal reliability requirements where a single processor can be used. For ultrareliable systems. t,he addit,ional responsibilit,y of providing fault tolerance makes this approach untenable.
The operating syst.em and replicated comput.er architect,ure are designed together so that they mutually support t,he goals of the reliable c o m p u h g platform. A four-level hierarchical decomposition of the reliable comput,ing plat,form is shown in figure 4 .
The design philosophy advocated in t,liis paper is to design the syst,em in a manner that minimizes the amount of experiment,al t.esting required t,o validate t,he system reliabilit,y models and maximizes the ability to mat.hematically reason about correct,ness. Ultimately, the quantification of system reliability must be made on the basis of a mat,liemat,ical model of t,he system and the correct.ness of the model must be demonstrated. The complexity and number of paramet,ers that must be measured should be minimized in order to reduce the cost of the verification and valida.t,ion process. The following design decisions have been made for t,he initial version of the syst,em t.oward that end: o the syst.em is non-reconfigurable o the syst,em is frame-synchronous o the scheduling is static. non-preemptive o internal voting is used t,o recover the state of a processor affected by a t,ransient fault Discussion of each point, is deferred to following sections.
Frame synchronous systems are common in aircraft control applications with hard real-time deadlines as is st,atic nonpreemptive scheduling.
The Uniprocessor Model
The top level of t.he hierarchy describes the operating system as a funct.ion that sequentially invokes applicat,ion tasks. It ext,ends t,he executive model by supporting a. more sophist,icated model of inter-t,ask communication. This view of the operating system will be referred to as the ziniprocessor m,odel. The uniprocessor model is formalized as a state t,ransit.ion system and provides the most abstra.ct specificat,ion of t.he opera.ting syst.em.
There are two major design issues at this level-t,he choice of the scheduiing st,rategy and the choice of intertask communica.tion stra.tegy. There are many theoretical approaches to scheduling multi-rate periodic tasks. Scheduling can be classified as either (1) preemptive or non-preemptive or (?) dynamic or static. ITnfortunately. the theoret,ical resu1t.s cannot guarantee that the hard deadlines will be met for any of t.he non-st atic or preemptive algorithms capable of scheduling t,he real-time control application tasks [17] . Consequently, all commercial aircraft control systems have been implemented using a st a.tic. non-preemptive schedule table. The intert ask communications problem is simplified by the fact that t,a.sks need only receive dat.a produced by ot.lier tasks after they have termina.ted. This can be implemented by use of dat,a buffers managed by the operating system.
The non-preemptive, st.atic approach simplifies the design and verificat.ion of t,he operating syst.em. In some ways, this merel!. transfers the burden of efficient scheduling to the designer of the schedule table. However, there are many ways t.0 automat,e the generat,ion of static schedule tables. It is envisioned that an off-line schedule generation program would be developed and formally verified. The generated schedule table resides in the memory of the processors in the system. It is the responsibility of the operat,ing system to dispatch the tasks in accordance with t,he static t,ables.
The static t-able consists of a sequence of '.frames". Each frame contains a set of t,asks which must be executed. The complete sequence of frames is referred to as a "cycle" or a '.major frame". This cycle is repeat,edly executed in response to clock interrupt,s. Multi-rate scheduling is accomplished by placing a task in t,he table in multiple places. This is illustrated in figure 5 .
The Synchronous Replicated Model
The second level in the hierarchy describes the operating system as a synchronous replicated system where each processor executes the same application tasks. The existence of a global time base. an interactive consistency mechanism and a reliable voting mechanism are assumed at this level. The formal details of the model, specified as a state transition system, are described in [ ? I . .Also at this level. a model of processor faults is developed. Suffice it to say here that the fault model is a worst case model in which nothing is known about any faulty processor.
.... The replicat.ed synchronous model implements the uniprocessor model bj. vot,ing results computed on t,he replicated processors. The correctness not,ion is based on majoritmy. -4s long as a majority of the processors are working and a majorit,y of them have been working since t,he start of the computation. then the replicated machine will produce the same results as the uniprocessor model.
The primary design decisions at this level are whether the system is reconfigurable and where in t.he data path voting is to occur.
There is ample evidence that robust, implementsation of online processor reconfiguration is an extremely difficult problem. The Fault-Tolerant Processor ( F T P ) [ .4 design flaw has been discovered in both F T M P and FTP which leads to the removal of a good processor rather than the faulty processor in t,he presence of a single injected fault [19. 181. The F T P and F T M P are both highly respect,ed and successful research efforts that have pushed the state-ofthe-art in fault t.olerant system design. These errors point to the fact that experienced comput,er architects, with expertise specifically in areas of fault-tolerant system design, are not immune to the problem of design flaws.4 From these experiences we conclude that the online fault-diagnosis and reconfiguration problem is ripe for the application of formal methods and we intend to pursue this avenue in future research efforts. However, for the initial effort report,ed on here we have chosen not to address reconfiguration.
Voting can take place at a number of locations in t.he system and associated with each choice are various tradeoffs.
41t should be pointed out that CSDL never claimed to produce error-free software. In fact, the Draper team specifically concentrated on the physical failure problem. CSDL is aware of the design flaw problem and has also become interested in pursuing formal methods.
Voting is dependent upon t.wo system activities: (1) the redundant processing sites must synchronize for t,he vot,e and (2) single source input dat,a must, be sent t,o t,he redundant shes using int,eractive consistency algorithms to ensure t,hat each processor uses the same iiiput,s for performing the same comput,ations. As ment,ioned above, both these activities are assumed at, t.his level of a.bstra.ction.
Voting can t,ake place at different. locations along the data path wit,li differing impact,s on the level of clock synchronization required. If voting t.akes place at the instruction level. synchronization must be very tight. If outputs are vot.ed only after t.ask execut,ion is complete, loose synchronization is possible lessening t he comput,at,ional burden required for clock synchronization. Thus, the design decisions made at. t.his level impact t,he implementation at lower levels of abst,raction.
If voting occurs only at, t,he act.uat.ors and the iiit,eriial state of the syst,em (contained in volatile memory) is never subject.ed t,o a vot,e, a single tra.nsient, fault can permanently corrupt the st.at.e of a good processor. This is an unacceptable approach since field dat,a indicat.es that transient faults are significantly more likely than permanent faults [Y]. .An alternative voting strat,egy is t,o vot,e t.he elitmire syst.em stat,e. This approach purges the effects of transient faults from the syst.em: however, t,he comput.ational overhead for this approach may be prohibit,ive. We observe that v o h g need only occur for syst,em state t,hat is not recoverable from sensor inputs. This approach accomplishes recovery from the effectas of transient faults at greatly reduced overhead, but involves increased design complexit,p.
The formal models present,ed in [2] provide a precise characterization of the minimum voting requirements for a fault.-tolerant system that, purges the effects of transient faults.
There is a trade-off bet.ween the rat,e of recovery from t,ransient faults and the frequency of vot,ing. The more frequent the voting, the faster the recovery from transients, but at the price of increased computational overhead.
Asynchronous Replicated System
Fault tolerance is achieved by voting results computed by the replicated processors operating on the same inputs. Interactive consistency checks on sensor inputs and voting actuator out puts requires synchronization of the replicated processors. This implies the existence of a global time base. In the absence of technology supporting manufacture of ultra-reliable clocks, electrically isolated processors can not share a single clock. Thus. fault-tolerant implementation of the uniprocessor model must ultimately be an asynchronous distributed system.
Reasoning about asynchronous distributed systems is notoriously difficult" Serious validation problems have appeared in previous efforts due t o the decision to deal with the inherent asynchrony at the application level. The AFT1 F16 provides a good example of the problems that can arise when asgnchrony is present at the application level. There was a 51n fact Lehman and Shelah [20] claim the analysis of such sysrems is an order of magnitude more difficult than reasoning about simply sequential syst,ems 1 3 1 significant problem with false alarms caused by design oversights traced to the asynchronous computer operation [21] . Also t,he ability t,o set effective thresholds for t,he redundant. sensor selection algorit,lims was seriously ha.mpered. Thresholds should be t,ight to filter t,lie effects of failed sensors. Unfortunat,ely, the thresholds had t,o be set at 15% to eliminat,e false alarms due t,o t,he asynchrony. But, with such a large threshold a single channel failure can cause large aircraft t,ransient.s. Thus. it is advantageous t,o deal wit,li t,he complexities due t,o asynchrony at t,he lowest possible level in the system. This isola.t,es the difficulties t,o a single clock synchronization function. N i t h a fault-t,olerant clock synchronization algorithm at the base of the operating system. the rest of t,he operating syst,em can he designed in a synchronous manner. The a.dvantages of this approach are discussed in [22] .
.4t the asynchronous replicat,ed syst.em level. the assumptions of t.he synchronous model must be discharged. In, [23] Rushby and von Henke report on t,he formal rerificat,ion of Lamport and Melliar-Smith's [15] int,eractive-convergence clock synchronization algorithm. This algorit.hm can serve as a fouiidat,ion for the implement,at,ion of t,he replicat,ed syst,em as a collect,ion of asynchronously operating processors. Elaboration of t.he asynchronous layer design will be carried out in Phase 2 of the research effort,.
Hardware/Software Iiiipleinentatioii
Final realizat.ion of the reliable computing pht~form is t,he subject of the Phase 3 effort. The research actirit,y will culminat,e in a det,ailed design and prototype implement,ation. The hardware archit.ecture assumed for the implement,at.ioii of the replicat,ed syst,em is a K-modular redundant (NMR) srst,em wit,h a small number A of processors. Single-source sensor inputs are distribut,ed by special purpose hardware executing a Byzantine agreement algorit,hm. Replicat,ed act.ua.t,or outputs are all delivered in parallel to the actuat,ors, where force-sum voting occurs. Iiit,erprocessor communication links allow replicat,ed processors to exchange and vote on the results of task computa.tions. This is illustrated in figure 6 .
Overview of the Verification
In [2, 31 we provide the details of the formal verification of the reliable computing platform. The proof establishes that the 1/0 behavior of the replicated model is identical to the uniprocessor model. Our approach is based on state machine concepts of behavioral equivalence, specialized for this application. -411 of the proofs are accomplished for all possible processor failures as long as a majority of them are working at all times.
The major property that must be established in order to prove that the replicated processor mimics the 1/0 behavior of a uniprocessor is that the dynamic state of the system is recovered after a transient fault within a bounded amount of time. The Reliability Models Since reliability is a driving influence on t,he syst.em design it is essential that t,he design be fait,hfully captured in the reliability model. The reliability analysis must be sound and the paramet.ers of the model must be measurable.
Three ralidat.ion tasks are eliminat,ed by not using reconfigurat,ion. First. it is not necessary t,o perform fault-injection experiment,s t.o measure recovery time distributions for nonreconfigurable syst.ems. Second, fault 1at.ency is not a concern since it does not occur as a paramet,er in t.he reliability model. Fault latency is only a concern when one is trving to det,ect and remove a faulty component. In a reconfigurable system. non-correlated latent faults increase recovery time and correlated latent faults (in t,he worst case) reduce t.he reliabiliby of a recoiifigurable system t.o dhat of a nonreconfigura.ble syst.em. Finally, the comp1exit.y of the model is greatly reduced-e.g.. no reconfiguration process. the interface t o the sensors and actuat.ors is static as opposed t,o dynamic.
Although the architecture present,ed here is parameterized for an arbitrary number of replicated processors, interactive consistency requires at least four processors to tolerate a single fault. Thus, a quadruplex is the minimum system configuration. A simplified reliability model for a quadruplex version of the system architect,ure is shown in figure 7 .
The horizontal transitions represent transient fault arrivals. The vertical transitions represent permanent fault arrivals. These arrive at rate AT and A, respectively. The backwards arc represents the disappearance of the transient fault and all errors produced by it. This is accomplished by vot.ing of internal state. The presence of this transition depends upon t,he proper design of the operating system so that it can recover the state of a processor that, has been af-10-5 t -"* 10-510-410-310-210-1 ioo io1 io2 io" io4 The model was solved using t,he STEM reliability analysis program [24] for the following parameter values: A, = 10-4/hour, AT = 10-3/hour and mission time T = 10 hours.
The plot in figure 8 shows the probability of failure curve for three values of hr.
Surprisingly the inflection points of the curve do not vary significantly for the different values of N . Consequently, the optimal value of p does not vary much as a function of N .
''Io simplify t,his discussion, the arrival of a second transient before the disappearance of the first transient has not been included in the model. A complete reliability analysis will include such events.
A Philosophical Point
The concept of syst,em design driven by quantit,at,ive models is cert,ainly not new [9] . However. there is an import.ant difference between the use of reliability models t,o predict ult,ra-reliabilit,y and ot.her quant,it.ative modeling t,echniques. The definition of qualitative probabi1it.y terms in [l] [Par. 9, sec. e] is Ext,remely Improbable failure conditions are those so unlikely t,hat t,hey are not. anticipa.ted t.o occur during t,he entire operat.ional life of all airplanes of o 11 e t, y p e.
By this definition, such events should never be observed. Consequently it. is impossible to t,est t,he robustness of these models against real empirical da.ta. Some confusion arises because empirical dat.a are used t.o measure some of t,he parameters of the relia.bility model. This is not the same thing as an "end-t,o-end'' test. In order t,o t,est the accuracy of the reliabilit,y model it,self, system failure times would ha.ve t,o be collect.ed and compared against the predict.ed reliability. Unfort.unately, one would ha.ve to wait virtually forever to collect this data.
Although relatively simple performance models can oft,en be shown empirically to reasonably predict system performance. there is no such luxury in the ultra-reliabilit,y business. Reliabilit,y models must be conservat.ive. This cannot be established empirically so it, must be established by formal reasoning and mathematical analysis.
