The Navy's Advanced Avionics Subsystem Technology (AAST) Fault Tolerant program is clarifyin the Navy's fault tolerant avionics specifications mekods and acceptance tests. The goal of the program will be clarify the Specification and Statement of Work language needed in future procurements and to demonstrate fault tolerant validation tools on an avionics design. A set of tool features will then be developed that spans the needs of fault tolerant computer system design from early concept studies to full scale production and operational support, both hardware and software. The paper will give an overview of the AAST Fault Tolerant Demonstration and focus on two tools that are being used in the demonstration: FERRARI -a software fault injector that will be used to validate the fault tolerance of the Common Integrated Processor (CIP), the F-22 Mission Processor and GRIND a concept evaluation tool that will be used to evaluate the overall CIP architecture.
Introduction
The DoD is enterin a radically new era of systems procurement. The Cofd War is over and new scenarios and radical force structure changes are changing the requirements of the next generation of weapon systems. Other factors in the "new world order" of weapon system procurement are: budgets are decreasing, development cycles are being stretched out, prototypes and demonstrations are being emphasized, open systems standards will dominate, and the next eneration of systems will be intensively modeled ani simulated before any hardware is built. Future complex weapon systems will increasingly rely on digital systems and the dependability of these digital components will play a critical role in the effectiveness of those systems in the field.
The key issue that the AAST Fault Tolerant Demonstration is addressing in this new era of defense procurement is -how can the Navy manage and procure dependable and cost effective, computer-based weapon systems? The demonstration program will investigate the timely and practical application of fault tolerant technology early in the design cycle before major resomces are committed to a particular design. This application of fault tolerant technology will be balanced against the extreme time pressures of modern avionics system development.
The Advanced Avionics Subsystem Technology Fault Tolerant Demonstration
Funds for the research, development, transition and insertion of new technolo ies into the fleet are divided into 6.1, 6.2, 6.3A and 8. 4 funds. The 6.1 and 6.2 funds are focused on exploring the feasibility of new technologies. The 6.3A funds are aimed at demonstrating those technologies so that program offices can specify them with confidence. In 1990, ONR's 6.1 research started the Ultradependable Multicomputers and Electronic Systems Research Initiative. This research initiative addresses a wide ranging number of fault tolerance topics including measurement and modeling of expected system fault modes, fault injection, simulation and modeling techniques, software fault tolerance approaches, as well as compiler, algorithm and hardware-based fault-tolerance techniques. ONT's 6.2 exploratory development, computer block has an effort called the "Engineering of Complex Systems Technology" whose aim is to explore the entire design and development of advanced real-time systems. The fault tolerance portion of the block plan is aimed at integrating fault tolerance into the design process of complex systems. The Advanced Avionics Subsystem Technology (AAST) Fault Tolerance Demonstration is a 6.3A project that takes the 6.2 Engineering of Complex Systems effort the next step and demonstrates the fault tolerance metrics and acceptance tests at each stage of an evolving contractor's design. The AAST work will also transition some of the ONR 6.1 developed tools (fault injection, fault tolerance benchmarks and fault tolerance simulation techniques). The goal of the AAST Fault Tolerance Demonstration is to demonstrate the necessary and sufficient dependability metrics and validation techniques of a fault tolerant system. These requirements will be documented so that pro ram offices can u8e them in their specifications and S 8 W packages according to their various fault tolerance and dependability needs.
Language and
The two key thrusts Legal, formal, requirements are the only way the government can define computer performance and dependability requirements. More precise fault tolerant requirements would specify the system's fault containment regions, the specific faults the system will guard against, and the types of analysis and fault injection testing that shall be done at each stage of the system design.
The SOW is the requirements for the contractor design team to fulfill at each stage of the system design. Generally, the SOW should require that the error handling features of the system shall be validated at each stage of the systems evolution. The validation should be a functional fault analysis which will map the specified fault set onto each identified fault containment region and then identify the fault detection, isolation, removal and recovery mechanisms of the system that will enable the system to maintain the mission services in the presence of faults. This validation should be demonstrated with fault injection techniques on the current simulation or breadboard of the evolving design.
This precise legal language, dealing with dependability and fault tolerance, should include clear and quantifiable validation techniques to be performed at each sta e of the system's design that will allow Navy to be informed customers able to quantify a design's dependability and reasonably ensure that the evolving system will be a dependable system for the Navy to own and operate. Thus, the Nav needs to support the development of fault tolerant vddation tools. The following two sections describe two of the tools to be used in the laboratory demonstration of the AAST Fault Tolerant program.
Fault Injection Tools Supporting the AAST Fault Tolerant Demonstration
The two two tools that are used in supporting the AAST Fault Tolerant Demonstration. The first is DE-PEND, which is a simulation environment tool that is used in the design phase of a system before an actual hardware has been built. The second is FERRARI, which is a fault and error injection tool that injects errors into a prototype of a system to measure its ability to detect, locate and recover from errors while it is executing real application.
DEPEND/GRIND Overview
Commercial systems for air, ground, and space applications require innovative solutions to de endability problems. To meet these needs, we have leveloped a highly instrumented, simulation-based CAD environment, called DEPEND, which allows designers to study a system in detait. The CAD tool provides an object-oriented framework that allows the evaluation of highly dependable systems. The tool provides facilities to rapidly model components typically found in fault-tolerant systems. It provides an extensive, automated fault injection facility which can simulate realistic fault scenarios. For example, the tool can inject correlated and latent errors, and it can vary the injection rate based on the workload on the system. In addition, it provides several key features that are necessary for fault simulation:
It provides ways to signal a change in the status of the components due to a failure, so that remedial actions can be simulated.
It provides mechanisms to halt on-going processes due to faults/errors/failures. This is an extremely needed feature for fault simulations. It is also useful for incorporating importance sampling methods.
It has the capability to model the intercomponent dependencies under fault conditions. For example, a failed server may not be able to initiate re-integration without control from a healthy control server. Such dependencies can be easily modeled with DEPEND.
It provides several automatic fault statistics collection facilities that can provide measures such as MTTF and availability. They can also provide a detailed list of every fault injected, repair action attempted and their status.
DEPEND is a powerful tool capable of modeling complex systems, but using it may be difficult for those who are new to the tool or who are unfamiliar with C++. Though the object library reduces the amount of programming that the user has to perform, models often turn out to be hundreds of lines of code. GRIND, a GRaphical INterface for DEPEND, provides an alternative to coding C++ directly. GRIND ;/// I Global Memory is a menu-driven X-Windows application which facilitates the creation of DEPEND models. With this interface, one is able to visualize the architecture of the system being modeled and how it functions. Hardware components are represented using icons while the software aspects of the system are specified using a gra hical flow-chart representation. Development of mocfels can also be performed quicker because GRIND'S menu structure presents most of DEPEND'S features to the user (10 that l e a time is spent referrin to the manual and debugging typos. While a graptical interface provides a quicker and more intuitive way of entering models, much of DEPEND'S power can not be harnessed graphically, which means that direct C++ coding must be used to create especially complex mOdel8.
However, since GRIND'S output is a file containing a well-formatted C++ program, one can speed up the creation of a complex model by first using GRIND to create a simpler, more abstract model, and then extendin it by jum ing directly into the C++ code generate! by GRIN8.
Example Application
As an example application of GRIND, this section will present a model of a system similar to a process ing module found in Hughes Common Integrated P r e cesser (CIP). The system is a fault-tolerant element consisting of four processing elements PES , two re-(GM), and a control proceasor ( c' P).
Refer t o Fi ure 1. Either NI can be used by any of the PES in otter to e n d data to or get data from tbe outside world. We will assume that one NI has s u acient bandwidth to support the system, (10 that if one fails the system can continue to function. The CP is responsible for distributing tasks among the four PEk which communicate with each other using the GM. The reason for having multiple processing elements is for fault-tolerance as well as for increasing computing power. Let's say that three of the four processors are needed to maintain the minimum throughput require dundant network interfaces (NIs , a g o b a I 1 ' memory ments. Since tasks are likely to be running on a PE when it fails, the process of reconfiguring to use only three PES is likely to be complex and itself prone to failure. Thus, the model will include a reconfiguration coverage for the processing elements. Given the failure rates of each of the subcomponents, an interesting analysis would be to see how sensitive the reliability of the module as a whole is to this reconfiguration coverage. This would give engineers an idea of how much effort needs to be invested in designing a robust reconfiguration process.
Constructin this model using GRIND was a straightforward process. The first step was to create a derived class for each of the different types of subcomponents in the model. GRIND allows the user to create derived objects from the classes within the D E PEND object library 80 that more specialized functionality can be added to the default objects. Once a derived class is created, one can create variables and methods for that class in addition to thoae inherited from the parent class. In this example model, a PE class was derived from the FTAofn object. Because FTIofn is the parent of PE, PE inherits all of FTAofn's functionality making it able to model the processing-element k-out-of-n system. Similarly, a GM class was derived from a FTmemory object, as well as a CP class from a FTaerver2 object and a I1 class from a FTIink2 object. Since no workload (such as processor utilization, message passin memory access, etc.) is incorporated into this m o t h , there was no need to further specialize the derived classes. The derived classes were added here to demonstrate that this model can be readily extended within the GRIND environment. Once the derived classes were established and objects from these classes where added to the model, the initialization methods of these objects were set through the Set Inits menu. Through this menu, the fault injection rates were set as well as mine of the configuration parameters. Through these 'configuration parameters', we specified that the proc-elem object was t o be 3-out-of-4 system with a 96% reconfiguration coverage and the net-interface object was to have one spare.
See Fi ure 2 for a GRIND display showing the objects of t8e model and a listing of the initialization methods of proc-elam, an object of the class PE. The GRIND environment also allowed us to specify that the executable was to run the simulation one thousand times, outputting the time-to-failure after each run. With this information, GRIND was able to c r e ate a C++ program which could be compiled and then run. The output of the executable is simply a listing of one thousand times to failure; so with the aid of a standard statistical analysis packa e, one could generate a curve estimating the reliabaity of the system. The parameter for the reconfiguration coverage can easily be changed within GRIND so that multiple scenarios may be simulated and a sensitivity analysis can be performed.
The next tool that will be discussed is FERRAFU.
6 Design and Implementation of FER-
RAM For the Hughes Common Intergrated Processor (CIP) Module
Fault and error injection has been recognized as a powerful technique which allows the evaluation of a prototype system under faults, in particular, tbe mewurement of the effectiveness of its error detectiqn and correction capabilities. Another advantage of this techai ue is that the effects of fadte in the system cbn be s&ed a h a it is executin redistic pro rams.
Hpdware aad software teciniques have %een p m pose@ for f*lt injection. The motivation behind our worlq was the develapment of a a flexible and an a u b matdd fault and error injection system. We concluded that hardwue fault injection would be cumbersome and would Bot allow us to inject faults and errors inside chips, for example, change a bit or bits in an internal re ister of a processor. On the other hand, it was clear &at simulation would be too time consuming. Our approach, therefore, is to emulate hardw e faultp and arors through software, by corrupting $e pro am execution state while it is executing, so t b t thetehaviolr of the system would be the same as if tbe internal fault had been present.
Our studies showed that the behavior of a systeqn variee with the type of faults and errors injected. (This is described in more detail in Section 8.) We a b wanteed the ability to inject faults while executinq a variqty of applications or system functions. The jection techniques described in this paper provide t t i necessary flexibility.
These techniques have been incorporated into FER-RAM a Fault and E R b r Automatic Real-time Injector. The main contributions of FERRARI are its ability t o inject transient errors as well as permanent faults so that it can be used to test the effectiveness of concurrent error detection and correction mechanisms, and its capability to perform the injection on object code. The current version of FERRARI is implemented to emulate a large number of faults and errors in the CPU circuitry, in memory, and in peripheral drivers.' Hardware faults as well as control flow errors (including bus errors, memory errors and processor control line errors) are emulated through software. FERRARI allowe control over the time, location, type and duration of the fault or error. It can measure coverage and latency (in instruction cycles or microseconds) and is able to locate the source of a detected error or one which caused failure. It is able to automatically control a large number of experimental runs.
The current version of the CIP module was not built to be fault tolerant. Our aim in injecting fault to this module is not to evaluate its dependability properties but rather to test the capability of FERRARI to inject faults and errors on the hardware prototype. This will give us the ability to add new features to FERRARI that will be used on future versions of the CIP module which are build to be fault tolerant. Figure 3 depicts the hardware configuration of the fault and error injection process. In this figure, a microvax workstation is connected to the CIP through the high speed data link. The microvax is configured as the host machine where the fault and error tool is executing.
The CIP contains the general purpose processing elements (GPPE's), the special signal processing elements (SPE's), and the global memory. The GPPE's execute the ADA applications while the SPE's execute the signal processing applications. Communication between the GPPE's and the SPE's is attained through the global memory shown in Figure 3 . Each GPPE's has its own loccl memory which contains the program that is runnin on that GPPE and the data The "interface module" is a special purpose fardware that controls the cmmunication between the microvax and the CIP module.
As mentioned earlier, the fault and error injection tool (FERRARI) is executing on the host microvax workstation which is running the vms operating system. The procedure for injecting errors into the CIP module is depicted in Figure 4 . The "user console" is a program that can access the low level functionality of each component residin on the CIP module. It can insert software breakpofnts, access any of the internal registers for any of the GPPE's, and read/write into the local memory of the GPPE's. FERRARI can inject fault and ERRORS in the CIP modules by sending a sequence of commands to the "user console" program that is running concurrently with FER-RARI. This communication is attained through the use of mailboxes ' . After receiving the sequence of commands from FERRARI, the "user console'' pro-'More rigorous studies to obtain the actual high level fault/error models of a partiaular processor should be conducted prior to using FERRARI. An example is the study undertaken in [5] where a technique for mapping real hardware failures to high level error models is presented.
2Mailboxes are used to send streamsof data from one process to another.
needed for the success fs wl execution of the pro ram. 
Initialization and Activation Module
This module prepares the test program3 for fault and error injection. The tasks of this module are: 1) it parses the test pro ram executable file to determine the starting address focation and size of the text and data se ments of the file, and 2) it extracts the execution befavior of an error free run of the test program. Extracted information includes the output of the program, referred to later as reference, execution time, and the address space traversed during program execution. 
User Information Module
This module obtains experiment parameters supplied by the user which include: 1) the type of the dependability measurement (coverage vs. latency), 2) the duration of the fault/error (transient error vs. permanent fault), 3) the mode of the experiment which selects between user vs. random selection of the address bit position and time instance at which the fault/error is injected, 4) the fault/error type which selects one of the five types supported by FERRARI (these are XORing a bit with logic "l", resetting a bit, setting a bit, resetting a b te, and setting a byte), 5 ) the fault and error model &xplained in the next subsection), 6) the method of fault/error injection (also explained in the next subsection). At the end of the fault injection experiment, the CIP module operating system will send error messages through the high speed link to the "user console" program running on the microvax workstation which, in turn, transfer t h e messages to FERRARI. This error message contains information that describes the execution state of the application running on the CIP module when the error was injected. FERRARI uses this information to evaluate the final coverage of the system.
Fault and Error Injection Module
FERRARI supports the injection of both permanent faults and transient errors. The mechanisms for fault and error injection are identical, and the only difference is the duration of the injected fault and error. For transient error injection, the duration is defined to be one instruction cycle. On the other hand, the duration of permanent faults may be several instruction cycles, or may span the entire execution interval of the application.
One of the design features of FERRARI is its capability to inject a variety of fault/error models. This feature was established when we observed the target system responding differently when different error models were injected in the system. (An example of this behavior is shown in Figure 13 .) In addition, the design of FERRARI allows the inclusion of other models to its set of fault/error models.
A detailed discussion of the mechanism of some of the transient errors and permanent faults that are inajected through FERRARI follows in the next two subsections.
Transient Errors
FERRARI supports three methods of transient fault/error injection, Memory Corruption, Spatial and Temporal. In the "Memory Corruption'' method, a fault is injected in the task memory image before program execution starts. In the "Spatial" method, the 3We refer to the program whose code is mutated by FER-RARI during the course of the fault/error injection process as the "test program". Rolar fault/error injection is triggered after N occurrences of a randomly selected address line. The value of N is defined by the user. Once the program uses the erroneowi value, the error is removed. Finally, in the "Temporal" method, the execution of an application is interrupted at a randomly selected instant of time and tbe value of the next element fetched or stored into memory is modified. An experiment to compare the thtee different methods will be presented later in Section 8.
( : -
Figure 5: Address line error when an instruction is fetched Table 1: Selected Transient Error Models All the other instructions use registers as their source and destination operands. The load and store instructions use registers to access memory. The effective address of the operand, for the s e lected load/store instruction, is modified. After the execution of the faulty instruction, the program is trapped again to restore the content of the previous program counter.
Some of the transient errors supported by FER-
RARI are presented in Table 1 and are emulated as follows. When the execution reaches a specified address, the program is trapped. A selected error is injected and the current instruction is executed. The injected error is then removed and the program is allowed to resume execution. The reason for this procedure is to avoid injecting the error more than once if the e lected address was in a loop. Of course, if the single execution of the instruction under the error resulted in a change of internal state, this erroneous state would remain, and may cause other execution errors subsequently. This section will present the mechanism for injectin two of the error models presented in Table   1 , the i d d I F and the AddOF. The mechanisms for injecting the other emor models in Table 1 were p r e sented in [l] . 0 Address line error while the processor is fetching an instruction Figure 5 illustrates the mechanism to inject this error. The processor is interrupted when it 6.3.2 Permanent Faults reaches the selected address to be modified. The nwrt instruction to be is fetched from the address pointed to by the current program After the execution of the wrong instruction, the program is trapped on the following instruction. The previous program counter value is restored before the program is allowed to proceed. 0 Address line error when the program is fetching an operand When execution reaches the address where the error is to be injected, the program is trapped (Figure 6 ). In SPARC machines, for example, only load and store instructions access memory. In addition to the error models listed in Table 1, FERRARI allows the user to mutate the contents of an internal register. In addition, the user may select a combination of some of the transient errors in the  Table 1 . For example, it was found in [5] that a significant percentage of the injected faults inside a sample processor are manifested as "address and data line errors while fetching an operand".
Permanent faults supported by FERRARI are: 1) address line fault; 2) data line fault; 3) fault in condition counter having one of its bits (bytes) modified.
Code flags-These faults me emulated as follows:
0 Address line fault When the program execution reaches the address of the selected instruction, a bit/byte in the program counter is modified. The instruction at the modified address is executed. If the executed instruction is a branch instruction, the value of the program counter becomes the target address of the branch instruction, otherwise its value is the previous program counter, before fault injection, incremented by four, Figure 7 . This procedure is repeated N times where N is a number that determines the duration of the fault in instruction cycles. If any of the executed instructions acceses memory (load/store instructions), the effective address of the operand may be modified in the same b i t b y t e position used to mutate the address of the executed instruction.
cl N.N-1 
Data Collection and Analysis Module
This module measures and records the response of the system for every injected fault/error. For each run, the location of the fault/error (virtual address), the affected bit, and the affected register, if any, are recorded. Terminating conditions for every run are also appended to the logfile. Terminating conditions indicate whether the resultin error: 1) was dormant (did not lead to a failure a n t did not produce wron output during the lifetime of program execution), 2 wrong output, or was terminated after it timed-out, or 3) was detected. Terminatin conditions that result in detected errors are due to: 17 executin statements in the test program code, w ere came is a value that either indicates the nature of an error d e tected by a user detection mechanism, or signals the end of the program execution, 2) an error triggering one of the built-in error detection mechanisms of the system. Note that when the test pro am terminates abnormally, it returns to its parent fn this case the fault/error injection process) a flag indicating the nature of the error that caused the system to abort the execution of the test program.
The data analysis module also records the identity of the error detection mechanism and the error detection latency if the error was detected. For example, for non-deterministic applications, compare would check whether the difference between corresponding elements from two files is within a specified limit. FERRARI may also be used to test systems that utilizes spatial redundancy techniques. For example in a TMR system, FERRARI can inject faults/errors into one of the TMR modules and the output of the voter is considered to be the compare function output.
At the end of the experiment, the collection and analysis module collects these results along with the associated status flags and calculates percentages with respect to coverage, latency, and type of error detection mechanism for each experiment.
had led to a failure (the test program either produce H % Uexit(cauee)"
Experiment Description
Experiments presented in this paper were conducted on SUN SPARC workstations runnin SUNOS 4.1. These experiments provide an insight ofthe type of faults and errors that can be injected into the CIP module. These experiments were selected to demonstrate the capabilities of FERRARI, as well as to study the behavior of the target system when injected with faults and errors. In addition, a variety of faults and errors were injected to measure and compare the effectiveness of several of the error detection and correction techniques that are either built into the operating system or are embedded into the test programs.
For every experiment, the user selects the fault type and fault model to be injected and 6.4) injection runs. In each run, the bit position(s), the selected re ister to be faulted (if any), and either the location (afdreas inside the program code, including library codes) or the instance at which a fault/error is injected4 were randomly selected.
Results for over one million runs are presented in this aper. The criterion adopted when selecting the numger of runs per experiment waa based on obtaining a consistent average behavioral response of the system (e.g. percentages of "No Error", "Undetected Errors", and "Error Coverage" including the distribution of the contribution of each of the error detection mechanisms). For some of the conducted experiments, the response of the system became consistent at 10,000 runs, while for others, the behavior of the system became consistent at 20,000 runs.
The guideline followed when selecting test programs was to maintain an automated injection environment while making fault/error injection runs. This feature resulted in conducting a large number of runs for each experiment, thus providing confidence in the measurement of the response of the system. Another factor considered when selecting test programs was to evaluate several of the concurrent error detection and correction techniques that were embedded at the application-level code. As a result, test programs used in our experiments were application-level programs.
An advantage for injecting faults/errors in the system at this level is that many of these faults/errors generate traps (error conditions and exceptions) which are later detected by the built-in error detection mechanisms of the system (e.g. detecting an "illegal instruction"). Once these errors are detected, the system aborts the execution of the program and returns to its parent shell, which is in this case FERRARI.
When those same errors, on the other hand, were trapped while the system is executing in the supervisor (kernel) mode5, the processor enters either an endless "wait" state, or a "diagnostic" state6. In both cases automating the fault/error injection process becomes impossible since the system has to be reset manually in order to continue the fault/error injection experiments. The study in [6] has shown that injecting faults/errors in the system while executing in either mode enerates the same traps but differs in the action tafen once the error is detected. Note that injecting the system with faults/errors while it is running in the kernel mode is accomplished by running FER-RARI as a daemon process7 in the supervisor mode. As a result, FERRARI will have access to supervisor allocation tables and would thus be able to modify the processor state while it is executing operating system code.
The procedure of fault/error injection in FERRARI 'during the c o m e of execution of the test program 5The SPARC processor supports multi-tasking which requires the processor to be able to operate in two modes: a supervisor and a user mode. this mode, the system dumps all its memory to a swap area and later reboots the machine.
'In Unix, a daemon process is one which is running silently until awakened by a request.
is shown in Figure 8 . In each experiment conducted, a selected application is first run without injecting an error in the system. The output, named reference, is written to a file for future comparisons. A fault or error is injected into the system while running the application. If no error detection mechanism is triggered, the output of the current run is compared to reference. A difference between the two outputs indicates an error has resulted in a wrong output and that the error was not detected by any of the error detection mechanisms. This would contribute to the lack of coverage of the mechanism. 3. Robust data structures applied to modular robust binary (MRB) trees [7] .
Empirical Results
In this section we present the results of fault/error injecting SUN SPARCl workstations while runnin the above applications. Both permanent faults an! transient errors were injected in our extensive studies. The results of the experiments presented in this paper, however, concentrate on transient errors, since the results were more interesting, and since we found that errors due to faults which were active for more than a few instruction cycles were very likely to be detected by one of the error detection mechanisms. This is consistent with previous work done on permanent faults [lo] .
The first experiment was designed to evaluate the coverage of error detection mechanisms under different types of fault and errors. The next experiment attempted to determine the percentage of injected errors which remained latent for a particular application. The coverage sensitivity to different error models and the effect of system error detection mechanisms was studied in the next two experiments. Final1 , the tradeoffs between error detection capability a n i performance overhead were measured.
Effect of different fault/error injection methods
Experiments were conducted to compare three different methods of fault and error injection. These are: 1) corrupting the task memory image, 2) spatial transient error injection, 3) and temporal transient error injection. In these experiments we injected over 15,000 errors in the system while running the quicksort application for two different data sizes. The purpose of these experiments was to study the variation of error coverage with data size when using these three injection techniques. Figure 9 presents coverages for the three experiments. In the first experiment, a fault was injected in the task memory image before program execution started. This was referred to previously as the "Memory Corruption" method. The injected fault, as explained earlier, remained throu hout the execution sequence of the program and the faulty value was POtentially used many times. This is referred to as a "memory error" in the figure. In the second experiment, an address line was randomly selected before program execution started. Once the program used the erroneous value (in this experiment a data line error), the error was removed. This is referred to as a "spatial transient error" in the fi ure. The "temporal transient error" injection mehod was used in the third experiment. In this experiment, the error injection was accomplished by interrupting the execution of an application at a randomly selected instant of time, and changing the value of the next element fetched from memory.
Results presented in Figure 9 demonstrate the variability of the error coverage of the detection mechanisms under different injection methods. From these results we deduce the following.
0 Errors injected in the task memory image exhibited the highest coverages since the error, as argued before, was exercised more than once if the corrupted instruction was in a loop.
0 The coverage was insensitive to the size of the sorted data when faulting the memory image of the test pro ram and when using the spatial injection metho%. The coverage, on the other hand, increased as the size of the sorted elements increased from 100 to 1000 elements for the tempcral error injection method. This behavior becomes clear when one realizes that the quicksort application was embedded with "assertions", a data value error detection mechanism. The quicksort application is composed of three segments: 1 initialization, 2) data loop, and 3) file closing. In data elements in ascending order. When the size of the sorted elements increases, the execution time inside only the data loop segment is increased whereas the execution time inside the other two segments is not affected. As a result, the probability that a data error is injected inside the loop and consequently is detected by the "assertion" mechanism increases when the "temporal" injection method is employed. This behavior, however, was not observed in the other two injection methods. In those methods, the data error is injected at the first incident when the value of the "program counter" matches the value of the preselected address location, including any address inside the data loop segment. Consequently, the response of the system for the memory corruption and spatial injection methods is independent of the execution time of the application. Therefore, the temporal injection method would challenge the error detection mechanisms to a higher degree.
t L e "data loop" segment, the program reads and sorts Figure 10 shows the result for the matrix multiplication application usin the checksum technique for error detection. Errors from every error model shown in Table 1 were injected into the system while running the matrix multiplication of two matrices each has 20 by 20 elements. For this experiment, we first injected the transient errors at all legal addresses traversed during the course of execution of the test program. Later, we utilized the pseudo-random number generator to select the address at which an error will be injected. We started with 1000 runs and kept incrementing the number of runs by 1000. After 10,000 runs, the system exhibited the same behavior (measured in terms of the distribution of the responses of the system, as explained later) to that observed when the test program was injected exhaustively. Figure 11 , we list a segment of code, shown in both C++ and the corresponding lowlevel SPARC assembly language, mutated to emulate the injection of an error. In the C++ code, the algorithm checks whether the value of the variable "argc" is less than 4. In this example, the injected error modified the "immediate" operand of the instruction to the value of zero. When the "cmp" instruction is executed, the value of the internal register "00" was greater than zero and the program branched to location "main + 120". Thus the injected fault did not cause the program to deviate from its normal execution. ...
Effect of latent errors

Id [%ip + 0x44],4cd)
Figure 11: An example of a latent error 43% of the errors were detected by the built-in error detection and protection mechanisms in the SPARC system. These mechanisms, which are triggered before any application-level error detection techniques, trap ille a1 instructions, bus errors, segmentation faults, b a b system calls, interrupts, arithmetic exceptions, etc. In Section 8.4 we present the contribution of these built-in error detection mechanisms.
Of the remaining errors, 5.5% were detected by the checksum technique and 3.0% by program exit conditions ("Prog Exit" in the figure). Program exit conditions were features added to increase pro ram robustness, such as checking the status of I / b operations when opening and closing files, and were found to enhance the error detection capabilities of the system. The "Time Out" value in Figure 10 and in all other figures indicates the percentage of errors which caused the system to enter a watt state. This case was observed when the injected fault caused the program to start executin a system wait call. Apparently the arguments passe$ to this system call (wait) were set to produce an infinite loop. For such cases, a timer was used to abort program execution. In this particular application, 6.9% of the errors were not detected ("Undetected Error"), and produced incorrect results. Undetected Errors are errors that are not detected by the application program error detection mechanisms, yet they do not cause the execution of the application to terminate prematurely. The nature of such errors and techniques to prevent them will be described later in the paper.
As shown in Figure 12, . Note, however, the system responses obtained through our experiments do not include the categor "system crash", which is a very common error manikstation reported in other studies. As a.r ued previously, we decided to avoid system crashes ?consequently the time delay to reboot the machine once it crashes), by fault/error injecting while the processor is in the "user" mode.
In order to concentrate on detectable errors, the rest of the figures in this paper show the coverages of errors when latent errms are excluded from the calculations. Figure 12 shows the distribution of coverages for matrix multiplication using checksum only when excluding the latent errors. 
Coverage sensitivity to different error models
The purpose of this experiment was to study the response of the system when injected with faults/errors that are representative of failures in the real hardware8 [5] . Figure 13 shows the system behavior when transient errors presented in Table 1 where injected into the system while running the MRB tree test program.
Over 60% of the detected errors were trapped by the built-in system error detection mechanisms. In Figure 13 , the highest coverage was obtained when address line errors were injected while loading/storing operands (AddOF and AddOS) where the system detection techniques (elaborated in Section 8.4), the assertion technique, and other program robustness techniques have contributed to the overall coverage. The lowest covera e was obtained when data line errors were injected ?models DataIF, DataOF, and DataOS).
Note that although the MRB tree error detection mechanism is a data value checking technique, it was still effective for other errors caused by the injected errors. The effectiveness of the MRB tree error detection, however, was the highest when errors were injected into the data bus when operands were either fetched or stored (error models DataOF and DataOS).
Effect of system error detection mechanisms
The SPARC system has built-in error detection mechanisms that monitor address, data and control buses, Figure 14 . In this experiment, the system was injected with the eight different error models (see Table l), while running the matrix multiplication application using checksums. These coverages were found to be comparable to those obtained with the quicksort and robust data structures applications. In this figure, most of the illegal instructions produced were caused by injecting transient errors from models AddIF, AddIF2, DataIF (when the processor is fetching different instructions) and model CndCR. Segmentation faults were triggered by memory access exceptions which occur when a data memory access or an instruction prefetch fails to complete normally. Although this response is anticipated when address line errors were injected, it is less intuitive when data errors were injected (models DataOF, DataOS). Further analysis has shown that the majority of these errors were modifying operand values that specify the memory address of a data element directly or indirectly (an example is the case when the contents of the destination register for a "load" instruction is mutated). In Figure 14 , "Bus error" is due to a memory misaligned access which occurs when a load, store, or exchange instruction attempts to access a memory address that is not consistent with the size of the access. An example of a misaligned access is when a half-word access is attempted to an odd byte addreas. 
U d nus
Measurement of error detection latency
Figures 15 shows the average user, system, and overall error detection latency for the quicksort application. In this figure, latency is shown for selected transient error models presented in Table 1 . The sorted list for this experiment consisted of ten integer elements. As shown in the Figure, the latency for the user detection mechanism was several thousands of instruction execution time compared to a few hundreds for the system error detection mechanisms. The reason is that user assertions (elements are monotonically increasing) are applied near the end of the execution of the program, whereas the system error detection mechanisms are triggered at every instruction. A timeout interval at the beginning of the experiment was set to 5000 instruction cycles. If an error was not detected daring this interval, it was labeled as an "Undetected Error".
As shown in the figure, latencies for address line errora were in general smaller than those for other injected error models. This behavior is expected since most the injected address line errors (80%), as was shown in the previous experiment, were detected by the built-in system error detection mechanisms such as "segmentation faults", "illegal instructions" , and "bus error". The system error detection mechanisms trap these illegal conditions in one or few instruction cycles. Note that since the quicksort application occupies a small address space, the probability of inducing a "se mentation fault" exception increases, and is detecte8 by the memory management unit in the few cycles. Similar arguments holds for "bus error" and "illegal instruction'' exceptions. Also note in the figure that the latency for AddIF2 model is smaller than that for AddIF since more than one bit of the instruction address is mutated in the AddIF2 error model.
400
Regarding data line errors when operands are accessed, DataOS has higher latency than that of DataOF, as shown in Figure 15 . A principal reason for this behavior is that some of the corrupted data operands are used in the address calculation of other operands. For the DataOF model, these corrupted data are immediately used to fetch other data, whereas for the DataOS model , some of these corrupted data are only used at a later stage durin the course of execution of the test program, depending on the control flow of the program.
Figure 15: Error detection latencies for matrix multiplication utilizing checksum techniques (10,000 runs).
Conclusion
This paper outlined the Navy's AAST Fault Tolerant Demonstration program and described two tools being used in the demonstration. This program will clarify the precise legal language needed in future Specifications and SOW of complex, computer based weapon systems and what validation techniques are needed to support that contract language.
During the Gulf War, Naval aircraft sustained hi h Full Mission Capable (FMC) rates (in the high 90% range), but these high FMC rates were sustained at the cost of high Maintenance Man Hours per Flight Hour (MMH/FH), (30 to 65 MMH/FH), high spares usa e and high false alarm rates, (consistently in the 3 0 2 to 35% range). Future complex weapon systems will increasingly rely on digital sub-systems and the dependability of these digital designs will play a critical role in the effectiveness of those systems in the field. In future combat situations, the rate of spares usage, the time to isolate and repair the equipment, and the false alarm rates will play a critical role in the effectiveness of those systems. As budgets decrease, fewer systems are purchased and commercial parts are used, the dependability of those systems will be a critical factor in the effectiveness of the next generation of computer based combat systems.
