This paper presents a dependability study of high-speed, switched Local Area Networks (LANs) using Myrinet as an example testbed (with theoretical speeds of 2.56 Gbps). The study uses results of two fault injection methods, simulated fault injection and softwareimplemented fault injection (SWIFI), to analyze the application-level impact of transient faults injected into the network interface hardware. These results include a number of errors such as dropped or corrupt messages, host interface or host resets, and local or remote host interface hangs. The paper presents the study in two parts: First, the results from the SWIFI method in the real system are used as a basis to validate the simulation and identify the major factors leading to di erences between the methods. A comparison between the two injection methods shows that they agree for 83% of the fault injections. The results, however, vary greatly depending on the fault type considered. The study presents an analysis of the e ects of varying workload intensity, host platform, and interface function targeted by the injection. An example of this analysis is to show that the function targetted has a signi cant impact on the fault activation rate. Finally, the study identi es two mechanisms by which faults may propagate from the interface to other parts of the network; in one example, this propagation caused the interface's host computer to reboot while another caused a remote interface in the network to hang.
Introduction
This paper presents a fault injection based dependability study of a high-speed local area network (LAN). The network analyzed in the study is a Myrinet 1], from Myricom, Inc. Rather than focusing on communication protocols or network links, the study focused on faults in the network interface hardware and their impact in broad terms such as \dropped message" or \hung interface."
Myrinet was chosen, not because of any special fault tolerant design features, but for its open architecture and for the availability of its speci cations and source code. We feel that many of the problems addressed in injecting faults to the Myrinet apply to any microprocessor/DMA-based network system. As a result, these issues should be considered in any highly dependable network design.
The study uses the results of two fault injection methods, simulated fault injection and software-implemented fault injection (SWIFI) to analyze the application-level impact of transient faults injected into the network interface hardware. The rst method, applicable in the product design phase, is a simulation-based fault injection approach; the other uses a SWIFI method that can be used only after a working prototype exists.
By combining both methods in our study, we were able to validate our simulation methodology. In particular, a comparison between the simulation and SWIFI methods is presented in the results section, and reasons for the di erences between the two methods are treated in the discussion section. The study also examines the e ects of certain parameters, such as level of network workload intensity and choice of platform for the host undergoing fault injection.
We would like to emphasize the following three points about this study. First, all of the fault injections in this study are implemented by modifying the control software running on the host interface board. Second, the overall fault model for these injections is a temporary single bit-ip in an instruction in this control software, approximating a transient fault in the host interface hardware during that instruction's execution. Third, where the term SWIFI is used throughout the paper, it refers to software-implemented fault injections on the real system.
The remainder of the paper is organized as follows. Section 2 discusses related work, focusing on injection-based dependability studies. Section 3 gives a brief overview of the Myrinet LAN and the con guration used for the study. Section 4 describes our experimental approach, and Section 5 describes the fault model and the fault injection scheme used. Section 6 presents the results of the experiment. Examples of the results of speci c fault injections are given in Section 7, and Section 8 contains a discussion of the overall experiment. Section 9 summarizes the work and suggests future considerations.
Related Work
Dependability evaluation involves the study of faults, errors, failures, and recovery 2, 3] . Fault injection plays an important role in such studies. However, because of the interfering, sometimes destructive, nature of fault injection, it is normally conducted in simulated environments 4, 5, 6] or in experimental testbeds 7, 8, 9, 10] and very rarely in operational production environments.
Most injection-based dependability studies in the literature focus on hardware faults or software defects. Only a few studies 4, 7] have concentrated on communication faults, such as corrupted or lost messages. NEST 7] simulated the e ects of connection faults in messages. Orchestra 11] injected faults between protocol layers to study their e ect on outgoing and incoming messages. One study 12] measured the e ect of communication faults by corrupting message headers. An issue that has not been addressed is how to determine a set of communication faults that represents those that can result from real faults. Very little has been done in the area of network dependability studies that involves looking into the particular network components of the interface hardware, the control software, or the software that performs data transfer.
There have been a few studies that have compared di erent fault injection methods. For example, one study 13] compared the results obtained using three di erent physical fault injection methods and a SWIFI implementation on a processor in a system area network. To our knowledge, none of these papers has addressed the comparison between simulation-based fault injection and real fault injection.
An additional issue involving the use of simulated fault injection to analyze software behavior is that it is di cult to modify software behavior based on hardware fault models. A number of simulation tools can model system hardware and, to some degree, the software executing on the system. REACT, MEFISTO, and ADEPT are examples of these tools 14, 15, 16] . To our knowledge, however, none of these tools can explicitily simulate the change in software behavior.
Our experiment, then, di ers from previous work in two main ways. First, we focus our analysis and fault injections on the network interface component, in particular its controlling software, rather than on the communication protocol or network links. Second, we provide results from both actual fault injections into the real system and from simulations, and we compare these results to point out the strengths and weaknesses of each method.
Experimental Network
This study is based on a Myrinet local-area network (LAN). A Myrinet LAN connects nodes (host computers with interface cards) and multiple-port switches. Full-duplex links supporting synchronous transmission rates of 160 megabytes of data per second connect these components arranged in an arbitrary topology.
Since the Myrinet has an open architecture, we had access to the source code and compiler for the key software element (called the Myrinet Control Program or MCP). This level of access was essential for the fault injections performed in this study. It is often di cult to obtain this level of detailed information for other commercial networks. Myricom, Inc. also provided us with a cycle-accurate simulator which aided in the development and testing for this study and was used in the fault simulator.
Myrinet provides some level of error detection and fault tolerance. The data messages are protected by a CRC byte, and the network will automatically remap itself to changes in the network topology. These two features were not tested in this study. We are unaware of any fault tolerant features in the interface hardware. All of the Myrinet's fault tolerant features assume that processor executes error-free. Figure 1 shows a diagram of the LANai chip, a custom-VLSI RISC microprocessor, which is the main component on the host interface board that connects the host computer to the network. The LANai includes interfaces to on-board static RAM, the I/O bus, and the physical network (also shown in the diagram). The host interface board is plugged into the host I/O bus and treated as an external I/O device. The Myrinet Control Program (MCP) running on the LANai performs most of the host interface's functions. The on-board static RAM stores the MCP and bu ers network messages. In this study, all the fault injections into the LANai processor were performed via the on-board SRAM.
Host Interface

Network Layout
In our study we used an actual 4-node Myrinet network and a similarly con gured simulated network. An overview of the target network is shown later in Figure 3 . The network consisted of two Ultra SPARC workstations (with 167MHz Ultra SPARC processor running Solaris 2.5), two PCs (one with 200MHz Pentium Pro and one with 133MHz Pentium, each running Linux 2.0.0), and an 8-port Myrinet switch. Each host had a Myrinet interface card and was connected to the switch via a 1.28 + 1.28 Gbps (gigabits per second) duplex link. All nodes were also connected via Ethernet for monitoring and control from a remote machine.
The simulation tool DEPEND 17] was used to model the simulated network after the real network, as described above. The model simulated the four host interfaces cards (containing the LANai chip) in detail, including the behavior of their controlling software. The only activity on the host computer at each node that the simulator modeled was the synthetic application's workload. This workload was considered in terms of its message tra c and API calls; neither operating system activities nor expansion bus activity were modeled. The network links and switch were modeled at the packet transfer level. The simulation ensured correct routing of packets through the switch but only considered approximate transmission delays and contention for links. Furthermore, the simulation assumed that ow control and other link-level protocols work correctly and, hence, did not model them. Despite these abstractions, the level of detail that the simulation provided at the host interface level was su cient to serve as a basis for comparing the simulated results with the actual injections.
Dependability Experiments
The objectives of the experiments were to: compare simulation-based fault injection with SWIFI on a real system, and explore how the choice of experimental parameters (e.g., network workload) a ect the results of fault injection. This section rst presents an overview of the study and then gives a detailed description of the experiments used to meet these objectives.
We use the following terminology for the multiple fault injection. Each entry in the fault list is called a fault. In the simulated network, each fault is injected once, but in the SWIFI experiment, there are 10 fault injections for each fault. The faults are classi ed in the results section by fault rather than by injection. We will rst describe the experimental parameter that de ned each experiment. Once the parameters and their possible values are given, we will describe the experiments performed to meet our goals.
Experiment Procedure Overview
The overall experimental approach is shown in Figure 2 . As the gure indicates, three parameters de ne each experiment. First, a set of general parameters (e.g., network workload) must be chosen. The next parameter is the fault list. The nal experimental parameter is the fault injection method (either simulated fault injection or SWIFI).
The diagram shows two basic ows. The upper ow is for the SWIFI experiments, and the lower ow is for the simulated experiments. As the diagram indicates, the inputs to both ows are identical.
General Parameters
To de ne the experimental values for the following three parameters must rst be chosen: network workload, host platform, and code selection. The following paragraphs give a speci c de nition of each parameter as it applies to this study, and the values it may assume.
The network workload speci es the sequence of messages and the destination node for each node in the network during one experiment. This parameter has only two values for the experiments we performed, high and low. For both the high and low workloads, the same pattern of messages is sent, but the rate that messages are transmitted is varied. The di erence between the two workloads is that the high workload sends messages at the maximum rate the receiving node can accept (the sending interface's message bu ers are kept full), while the low workload speci es a much lower rate, which typically uses only one of the message bu ers of the sending interface at a time.
The host platform parameter refers to the type of host computer connected to the interface undergoing fault injection. This parameter is only applicable to the SWIFI method because the simulation does not model the host computer in enough detail to di erentiate between di erent host computer types. The host platform parameter could take one of two values: an interface is either connected to a SPARC Ultra or to an Intel PC.
The code selection parameter de nes the region of code in the MCP where faults are injected. Our fault model is a hardware, transient fault injected while the interface executes a given region of code. This parameter takes one of three values. The rst value, called \focused send," includes one function from the MCP host send code, which corresponds to 64 assembly instructions. The second value, called \broad send," includes all of the host send code (317 instructions). Finally, the last value, called \broad receive," includes all of the MCP host receive code (922 instructions). The entire MCP contains about 18400 instructions, but the target regions of code represent a larger share of the dynamic execution of the code.
The fault list enumerates all the faults to be injected into the target interface. Note that the code selection parameter a ects which faults may be in the fault list. Each fault is identi ed by giving the assembly instruction in the MCP code that will be modi ed and the bit within that instruction that will be ipped. (The following section on fault injection has more detail on the fault model.) Each of these faults was injected, one at a time, into the target system. After injecting each fault, the fault injector waits for the monitor to observe the e ect of that fault before it injects the next fault (the observation period is about two seconds).
Repeatability of Experiments
On the real system, we were unable to control all variables that might a ect the repeatability of results. For example, operating system activity can a ect the timing of events, or the value of \dead" variables (those memory locations of data that are not read during the normal execution of the program) can be di erent between successive instantiations of the same fault. Nevertheless, we were interested in determining whether the results are repeatable.
To do so, each fault from the fault list was injected at least ten times. That is, the process of injecting the fault and observing the impact of the fault on the system was repeated several times. Section 6 provides more details on how the result of the ten injections are classi ed for each di erent fault.
Detailed Description of Experiment Runs
We de ned eight speci c experiments to meet our objectives. These experiments are represented as letters in Table 1 . For example, Experiment C used SWIFI to inject 400 faults (4000 injections) into the \Broad send" region of the MCP on an Ultra while running a high workload. The simulator was only able to execute the focused section of code. Since the focused send experiment is included in the broad send, this experiment was only performed for the base conditions (using a high workload on a SPARC platform). To meet the rst goal { comparing simulation and SWIFI on the real system { we used the results of experiments A and B. These experiments consisted of 500 faults randomly chosen from within the \focused send" code selection region under a high workload using the Ultra SPARC on the host platform.
To meet the second goal { examining the impact of workload intensity, host platform, and code selection on the results of fault injection { all seven experiments using the SWIFI experiment were used. The e ect of the network workload intensity was explored by comparing the results of experiments C and D to those from E and F. This comparison, then, involved fault injections into both the \broad send" and \broad receive" regions on a SPARC Ultra, with either a high or low workload levels. The e ect of the choice of target platform was explored by comparing the results of experiments C and D to those of G and H. These experiments represented fault injections into both the \broad send" and \broad receive" regions under a high workload on either a SPARC Ultra or Intel PC. Finally, the e ect of code selection was examined by comparing the results of experiment B, C, and D with each other. Note that the three experiments were performed on a SPARC Ultra with a high workload, but each looked at a di erent code region for fault injection. Figure 3 gives an overview of the fault injection environment. The experiment speci cations determine the faults to be injected. Each node independently generates a network workload and monitors its own network activity. The remainder of this section describes the fault model, the workload generator, the monitor, the fault injector, and descriptions of simulation and SWIFI injections.
Fault Model
We injected faults into a selected network node (speci cally into the MCP running on the node's host interface board). An ideal fault model would include any type of transient fault that could a ect the processor. But, since internal parts of the processor are not readily available, we instead injected faults into the instruction that the processor was executing. Such a fault could occur if a transient fault a ected the instruction decode section of the processor or the instruction memory. Since a fault of this type has the capability of a ecting other parts of the processor (such as the register le or the contents of memory), it is a useful fault model. More speci cally, we used a single bit-ip in the instruction as our fault model. For dependability experiments, the single bit-ip is commonly used model 9, 18, 19] . Each fault is de ned by the address of the MCP instruction that it corrupts and by the position of the bit in the instruction that is ipped. The MCP address is selected at random from the instructions in the target region of code, regardless of the dynamic execution of code. Taking the original instruction and ipping one bit at random from that instruction determines the corrupt instruction. After the MCP executes the section of code containing the corrupt instruction, the fault injector resets the ipped bit to its original value. For example, a fault might ip the fth least signi cant bit of instruction 22f8 16 . Such a fault would change the instruction from f70427cc 16 (or ld M 127cc], %r14) to f70427dc 16 (or ld M 127dc], %r14), which copies the contents of memory location 127dc 16 instead of 127cc 16 to register 14. Instead of trapping illegal instructions, the LANai processor treats illegal instructions as no-op instructions.
Workload Generation
Each node in the network has a workload generator that creates messages, sends them over the network, and reads incoming messages. The user may specify both the rate at which the workload generator sends messages and the destination of each message. In these experiments, each node sent data to a di erent node in a circular fashion; that is, each node sent data to another node in a ring. Each data message includes a header with two bytes to identify the protocol and a two-byte message count followed by data from a test pattern. The message pattern is created using a simple algorithm that generates a sequence of messages that allows the receiver to verify that the contents of each message is correct. To avoid overhead in retransmitting messages and for simplicity in the protocol, corrupt messages may be detected and dropped, but not re-sent. This protocol is similar to what one might use for a video-on-demand (VOD) application. The experiment in this study used two workload intensity speci cations: a high workload, since it is expected that a high workload will exercise more errors than a low workload 20, 21] , and a low workload.
For the high workload, the generator sends packets at the maximum rate that each platform can send without dropping messages due to over owing receive bu ers. The maximum message-sending rate for each host machine is determined by trial and error. The simulation approximates this workload by lling all of the message bu ers. On the Ultra Sparcs this rate was determined to be about 2500 messages per second and on the PCs about 1500 messages per second. For the low workload level, the generator sends messages at a much lower rate (with average transfer rate of about 50 messages per second or 2-3 Mbps).
Monitor
The host computer at each node of the network runs a monitor process to examine the messages it receives and the activities on its own host interface. The monitor is integrated into the workload-generating program. Since the messages are created such that the monitor always knows the sequence of messages in a good run (one without faults), there is no need to have a gold run for each message. When the monitor detects an error, it logs an error message and outputs the message to the users. If a fault causes the MCP to reset, the monitor detects the reset. Certain errors (such as MCP restart) are detected at the node on which the error occurs. Other errors (such as data corrupt) are detected by the monitor of the receiving node. When a node of the network is in a hang state, the monitor on the node detects the problem on the link when it fails to receive any more messages for a given period of time.
Fault Injectors
The fault injectors for SWIFI and the simulation method operated in a similar fashion. Faults were exercised when the MCP executed an instruction the host computer had corrupted to simulate a real fault. In both methods, the MCP executed exactly the same code. To implement the SWIFI fault injection, the MCP was modi ed slightly to allow the MCP to synchronize with the host computer. Though the synchronization was not modeled by the simulation, the extra code was modeled. In the SWIFI injection, the fault injection was performed in real time at the user's request. For the simulation, the MCP executed for an arbitrary number of cycles and then its state was recorded. All faults were injected into the MCP from this state. After the MCP executed the target section of code, the corrupt instruction was reset to its good value.
SWIFI Fault Injector
To inject a fault into the SWIFI testbed, the user simply makes a request (from a remote computer via TCP/IP over a safe Ethernet) to the host computer to inject a fault at a given address ipping a speci c bit. Each time the MCP running on the LANai processor executes the target section of code, it checks to see if there was a fault injection pending. If not, it executes the code section. If so, then the program signals the host computer to write the corrupt instruction into the LANai's memory space (the instruction memory is protect from the MCP so that programming errors do not overwrite code; thus, only the host computer can write to the instruction memory). After the host computer corrupts the memory, the MCP executes the entire code section (including the corrupt instruction) without interruption. The SWIFI fault injector was implemented in this manner to minimize the disturbance on the system when injecting faults. Figure 4 shows the block diagram and the ow of control in the fault injector. To allow for runtime injection, the following event ags (readable and writable to both the MCP and the host computer) are used to control fault injections:
FI Request (request for fault injection): The fault injector sets this ag to notify the MCP that the user wants to inject a fault. Each code region has its own request ag. FI Done (fault injection completed): After injecting the fault, the fault injector sets this ag to inform the MCP that the fault has been injected and that the MCP should continue to execute the code region.
FI Exercised (exercised the fault): The MCP sets this ag when it completes the code region to inform the fault injector that the fault has been exercised.
Synchronizing the fault injector with the MCP requires minor modi cations of the MCP. To allow fault injection into a speci ed region of code, these modi cations include adding two functions to the MCP. The rst function, inserted before the region of code, checks the FI Request ag each time the MCP calls the speci ed region. If the fault injector has requested a fault for that region, this function acknowledges the request by setting FI Ready and waiting for the fault injector to inject the fault and set FI Done. The MCP then executes the region of code containing the corrupt instruction. The second function is called when the MCP completes executing the region of code. It asserts FI Exercised so that the fault injector can restore the original instruction. This sequence of events completes a single fault injection.
Simulated Fault Injection
Our simulation method was designed to allow rapid simulation while retaining the ties between the hardware and software architecture of the target system. The fault model used in this study, a single bit-ip in an instruction, requires the ability to simulate the LANai chip in the host interface at the instruction-set level. Propagating the impact of the fault in such as simulation, however, would be very slow if the entire system, including all four MCP programs running on the four nodes of the simulated network, were modeled at this level. For this reason, a second level of abstraction was added to the simulation to propagate the injected faults. Details of the two-level simulation and of how this method was rst used to model a single host interface can be found in 6].
Faults are initially injected into a cycle-accurate, instruction-set simulation of the LANai chip provided by Myricom, Inc. This simulator models the processor pipeline, the instruction set, on-board memory, and all programmer-visible registers for one host interface. The DMA unit on the LANai chip is not modeled. Presumably, any section of the MCP could run on the cycle simulator if the interaction between the LANai chip, the system bus and the DMA were modeled. But, lacking this level of detail constrains the regions of the MCP that we can simulate and, thus, also constrains fault injection to the \focused send" region. Only the MCP code region selected for fault injection is simulated on this processor. At the completion of a simulation, a given fault will have propagated to errors in the software variables of the simulated MCP, and so it is possible to translate the e ect of the fault to the software level. This set of errors in the data memory and register le is stored as a fault dictionary entry to represent the given fault. Note that the fault dictionary entries for the abstract level are determined by the cycle-accurate simulation. A typical cycle-accurate simulation takes less than 10 seconds, including time for setup and loading and saving data.
Next, the analysis moves to a software-level simulation model, realized in C++, using the dependability simulation tool DEPEND 17] . This model contains the real code of the MCP for the four host interfaces simulated, additional software objects modeling the special-purpose hardware of the LANai chip, a simpli ed model of the network links and switch, and four application workloads (one per host interface). In this simulation, the four copies of the MCP actually run in the native instruction set of the workstation rather than that of the LANai chip. The errors recorded in each dictionary entry (changes in the data memory or register le) are injected at the appropriate simulation time (i.e., when the highlevel simulation completes the code simulated at the lower level) into one copy of the MCP software, and the behavior of the simulated network is observed to determine the ultimate e ect of the fault.
The two levels of simulation can proceed almost independently. The cycle-accurate simulator repeatedly simulates a speci ed module of the code, injecting faults and recording the results in the fault dictionary. The fault dictionary represents the abstraction of the e ect of each fault from a level at which the software simulator can model the fault occurrence of the cycle-accurate simulator to the level of the software simulator. The software-level simulator simulates the actions of the four MCPs and applications. The only interaction between the two simulations is through the fault dictionary. Thus, a fault cannot be simulated at the software level until it has been considered at the cycle-accurate level; otherwise, the two parts can proceed independently. 6 Results Table 2 shows the categories used to classify the results of each fault injection. The rst ve categories are considered severe. MCP restart is somewhat less severe, since it requires no action from the application to recover (device drivers can hide the e ects of a restart from the application) and only the current messages in the bu er are lost. Categories such as message dropped and data corrupt can be easily remedied by network protocols (by resending corrupt messages). The fault injection on the simulation, unlike the real system, is deterministic (each unique fault would produce the same results for every injection); it does not have the categories of hang sometimes and multiple manifestations.
Faults can cause no error for several reasons. First, the fault injection cannot guarantee that all corrupt instructions execute, since the fault address is determined without knowledge of branching. Second, many faults change a nop instruction into another instruction that also performs no operation. Furthermore, many instructions can perform a similar function even when corrupted. For example, if an instruction compares a register value to some other Fault injection result Characterization MCP hang The host interface hung in at least half the experiments. MCP hang sometimes The host interface hung in some, but less than half, of the experiments.
Hang remote MCP A fault on one host interface caused a di erent host interface to hang. Host computer hang The host computer either hung or rebooted itself. Multiple manifestations Some combination of the above results was seen.
MCP restart The MCP on the host interface restarted. Message dropped Some message was not received.
Data corrupt A valid message was received, but its contents had been modi ed.
Other errors Some other error occurred. No error No error was detected. Table 2 : Categorization of Fault Injection Results value but the fault causes a di erent register to be used, there is a good chance that the same condition code may be set with or without the fault. Also, we cannot guarantee that the MCP will execute every instruction in the target code section. Section 7 illustrates several detailed examples of faults causing many of these error result categories.
Comparison of SWIFI and Simulated Fault Injection Results
Recall that in the comparison study between the simulation and SWIFI on the real system, each method used the same fault list to inject the same 500 faults. Some of these injections (48 injections) had to be discarded because they caused the simulator to crash. Thirtysix (73%) of these faults that crashed the simulator also caused MCP hang, MCP hang sometimes, or MCP restart in the SWIFI method. Also, 67 injections resulted in MCP hang sometimes or multiple manifestations for the SWIFI and had to be discarded because the simulation did not include these categories. The breakdown for the remaining 385 faults is presented in Table 3 The leftmost column of Table 3 shows the fault injection results (based on the categories in Table 2 ). The next column gives the number of faults resulting in each category observed in the simulations. The SWIFI column shows the number of faults resulting in each selected category in the real system. Finally, the match column shows how often the simulation and the real fault injection results were identical. In computing the value for the match column, the SWIFI method was considered to give the \gold" result. If, for a given injection, the simulation result agreed with the SWIFI result, a match was said to occur. The value in the match column is the number of matches that fell into a given category divided by the total number of SWIFI results in that category. For example, the simulator and SWIFI results matched for 51 injections in the message dropped category. The maximum possible number of matches for this case would be 54 (100%). Thus, for the message dropped category the simulator accuracy was 94.4% (51/54); for the remaining 7 faults where the simulator detected a drop, the SWIFI determined a di erent error category. Table 3 shows that the simulation does extremely well at detecting when no error will occur (matching SWIFI for over 99% of the faults) and reasonably well at predicting the less severe injection results (e.g., the simulation correctly identi ed a dropped message for about 95% of the faults where SWIFI determined that result). The simulation, however, has relatively low accuracy in predicting severe fault results, such as host interface hang where about 20% of the injections match. One reason for the low level of accuracy is that the simulation does not fully model the interaction between the host and the interface. Several factors a ecting the accuracy level are addressed later in the discussion section. Table 4 gives the combined results for injecting into di erent regions of code from the SWIFI experiments. The table is divided into three sets of data, one for each code-selection parameter value used in the experiment. For each parameter, the table shows (a) the number of times fault results fell into each category and (b) the ratio of the number in (a) to the number of injections resulting in errors. There are two major points from this data that should be noted.
E ect of Code Selection
First, the activation rate for the focused region is much greater than that of the broad region. For the focused region, nearly half of the faults produced an error, but in the broad region only 43 out of 400 (11%) of the faults produced an error. One reason for this di erence in activation rate is that in the focused region each corrupt instruction gets executed; while in the broad region, due to control ow branches, the corrupt instruction might not executed. The second point point is that between the two broad regions, the distribution of the results of those faults that caused errors is very similar between the regions. The percentage of MCP hang sometimes is a bit higher in the focused region. In the remaining categories, either the numbers are too small or the di erences are too small to be statistically signi cant.
The results indicate that despite the similarities, di erent scopes of injections should be used in dependability analyses. Table 5 , the number of fault results for each category is given for both workload levels. The right column shows, for each category, the total number of faults that had the same result for each workload level. The heavy workload sends messages at an average rate of 150 Mbps; the rate for the low workload is about 2-3 Mbps. The table shows that the results from injecting faults into the interface of an Ultra with a very heavy workload are similar to the same injections with a light workload for the less severe categories.
E ect of Workload Levels
Although we expected more hangs using the heavy workload, the results for MCP hangs are in fact the opposite. We were unable to nd the exact reason for this pattern. However, the MCP hang sometimes category o ers some insight. The MCP hangs sometimes, almost disappears for the light workload. The light workload does not exercise certain parts of the program (e.g., code to handle a full bu er might never execute). Apparently, the conditions needed for such a fault to cause an error only occur when the queue lengths are large or when parts of the MCP compete for resources. Since there are some errors that only appear in the heavy workload and others that only appear in the light workload, multiple workloads should be considered when high fault coverage is needed.
E ect of Choice of Target Platform
Result Ultra PC Matches MCP hang 23 20 18 MCP hang sometimes 10 
Total 800 800 773 Table 6 : Breakdown of Fault Injection Results for Di erent Platforms
The results of the fault injection to each platform are given in Table 6 . The rightmost column gives the number of faults for each category that matched. Since the faults are injected to the LANai processor independently of the host CPU, we expect that most faults should produce the same result on each platform. Each platform, however, has it own I/O bus and operating system, which could produce di erent results.
The network needs one node, which is automatically selected, to perform extra functions to maintain the routing information. In our experiments, a PC was selected to perform these functions. These extra functions may be another source of di erences in the results.
As in the previous cases, the number of errors is small relative to the number of injections. The relative number of matches is lower than in previous cases. For example, taking the MCP hang and MCP hang sometimes categories together, the percentage of matches is less than 70%. It appears from our results that the type of host platform makes a signi cant di erence in the dependability behavior of the network. Also, and perhaps more important, obtaining results for one platform type does not necessarily predict the behavior on another platform type. Further studies would be needed be verify this result.
Severe Fault E ects and Suggestions for Improving Dependability
Considering our SWIFI results as a whole, we found three mechanisms that led to major (severe) host or remote node failures, two of which also led to fault propagation outside of an interface. If additional error detection and recovery features were to be added to the system, these cases would be the best to add such features to, because of their severity. The three mechanisms and some suggestions for improving the dependability for each are described in this section. In the rst mechanism, a fault in a host interface caused its host computer to crash. The fault propagated to the host by corrupting the EAR register in the host interface before transferring data from the DMA to the host. (The EAR holds the address in the host's physical memory to which data will be transferred.) The DMA transfer then corrupted vital data in the host computer, causing a crash.
An invalid address could be written into the EAR register for several reasons. It could be the result of (i) a natural phenomenon such as an alpha particle causing a bit-ip (something like our fault injections), (ii) programming errors (a user modi es the MCP for some application and introduces a bug), or (iii) a malicious user running a program on the host machine. The danger of the rst two cases could be reduced by adding some hardware to the host or interface to prevent the DMA from accessing illegal locations. The third case is due primarily to Myricom's open architecture; currently any user may edit the MCP. User access to this memory is a feature allowing users to develop new network protocols. If a high level of dependability (or security) is required for the network, however, access to the MCP could be restricted.
In the second mechanism, sending a message with a valid header but zero bytes of data to another interface caused that interface to hang. This is the only error we found that propagated from one node to another. This mechanism could easily be avoided by adding a check for data messages with zero bytes of data in the function that tests the validity of incoming messages. This example shows how fault injection experiments can uncover gaps in the fault containment regions implemented in dependable systems.
The last severe error that we uncovered was MCP hangs. There seem to be several causes for such failures, but many were the result of upsetting the state machine used by the MCP's scheduler. Such hangs a ect the MCP's ability to perform certain basic functions, such as sending messages. If these failures can be detected, they can be xed either by simply restarting the MCP or by reloading the MCP if there is a danger that instruction memory is corrupt (or to contain the error). These actions can be implemented using a watchdog to detect the MCP's failure to send messages.
Illustrative Sample Faults and Errors
This section includes detailed explanations of some speci c faults used in the experiment and produced di erent errors.
Data Corrupt
In the \receive code," one fault changed ld M 127fc], %r14 to ld M 127cc], %r14. The instruction should load the contents of memory address 127fc 16 to register 14. The data structure HostReceiveBu er is stored in memory location 127fc 16 . This is supposed to be a pointer to the beginning of an incoming message. Changing the address of the pointer causes the data to appear corrupted. In particular, the application receiving the message nds that the beginning of the message does not match the expected header.
Host Computer Hang, Example 1
In the \receive code," one fault changed ld M 16b30], %r11 to ld M 17b30], %r11. The contents of memory location 16b30 16 is a pointer to the address in the DMA (using the host computer's physical address space) where the incoming message is supposed to be written. The contents of the new memory location 17b30 16 is apparently to be unallocated and set to zero. The instruction following the faulty one stores the value of register 11 into 3c 16 (in the LANai's memory space, the upper bits are ignored), which is the memory-mapped register EAR containing the base address in the host computer's physical address space of the destination of the memory transfer. The DMA engine on the interface will then write to the memory location with physical address equal to EAR. Writing to this memory location resulted in a system crash on both the Ultra and the PC. The error can be recreated by writing 0 into the EAR and performing a DMA write.
Host Computer Hang, Example 2
In the \send code," one fault changed st %r12], %r9 to st %r12+152], %r9; %r12 %r12+152. Register 12 is supposed to hold the HostSendItem structure. In subsequent instructions, the value of register 12 is used to nd the bu er holding the data that is about to be sent. The DMA engine attempts to read the physical address given by the faulty address. This causes a system failure on each platform.
MCP Hang
In the \receive code," one fault changed bt 22c4 to bt 2244, where bt is the mnemonic for \branch true" (or unconditional branch). The address of this instruction is 22b4 16 . Thus, this fault changes the instruction from a forward branch to a backwards branch. In this case, the fault creates a loop of 16 instructions in the program's control ow. Since the fault injector is waiting for the MCP to leave the target segment of code before reverting the faulty instruction the program is caught in a loop. Hence, the fault becomes a permanent fault, and the MCP is stuck in the loop. This is considered an MCP hang, by the categories above since the MCP appears to the API to have hung.
Multiple Manifestations
In the \send code," one fault changed ld %r14+24], %r14 to ld %r14+88], %r14 (the o sets for this example are given by decimal values instead of hexadecimal). This instruction is called inside the NetSendQueue.putPeek() call. The value of 24 + register 14 should be *bu er putNext]. Instead, the faulty instruction gets the bu er address from beyond the bounds of the bu er array. The result of such an operation depends on which bu er in the circular queue is being used. This fault could result in MCP hangs, MCP resets, or dropped or corrupted messages.
Message Dropped, Example 1
In the \receive code," one fault changed sub.f %r14, 0, %r0 to sub.f %r14, 128, %r0. The original instruction subtracts (compares) r14 -0, setting condition ags, and ignores the results (r0 is speci ed as the destination, but it is hardwired to zero). At this point in the code, register 14 holds the HostReceiveChannel->receiveAck.full(), which is true only if the receive queue is full (which should not happen in the lab setup since the program continually empties the receive queue). The next instruction (branch not equal, bne 2170) after the faulty one branches if the ZERO is clear over a segment of code that handles a full receive queue. The statement should return 0 since the queue rarely lls in the experimental setup. Thus, the compare statement should set the ZERO (or EQUAL) condition ag. Then, the branch statement should skip past the code to handle a full receive queue. Instead, the faulted compare statement becomes 0 -128, which will not set the ZERO ag, and the branch is not executed. The MCP then assumes the receive queue is full and drops the incoming message. This fault produced a dropped message in the real system.
Message Dropped, Example 2
In the \send code," one fault changed ld M %r12 + 140], %r14 to ld M %r12+140], %r30.
Register 30 is not used in supervisor mode (which we use exclusively), so we are not concerned about the contents of register 30. Here, register 12 holds the HostSendItem structure, and HostSendItem+140 16 is the channel number of the current message as set by the API. The value of register 14 is supposed to have been used by the MCP to set the message channel. Since register 14 is not set with the faulty instruction, the MCP either drops the message or sends it to the wrong channel, where it is dropped. The SWIFI detected a dropped message for this fault.
Discussion
The initial weeks of our comparison e ort turned out to be a learning process. A number of problems associated with our simulation e orts made it di cult to match the behavior of the simulation to that of the real device. Those problems are discussed in this section, including limitations in the simulator model, speci cation problems, and e ects of the simulation environment.
Cycle-Accurate Simulator Limitations
The cycle-accurate simulator had three basic limitations for our purposes. First, the standard version supplied by Myricom simulates only the CPU core of the LANai chip. None of the memory-mapped I/O on the real chip is implemented. As a result, we were restricted from simulating certain regions of the MCP code at the cycle level and so could not inject faults into these regions. The simulator could be extended to model the entire LANai chip; however, we chose instead to nd an important region of code (the \focused send" region) that could be simulated in the cycle simulator without modi cations. In this way we were able to make a convincing argument for the validation of the simulation without bringing in the additional issues of developing and testing new features in the cycle simulator. The second limitation in the cycle simulator was that it was not guaranteed to match the host interface behavior for certain error conditions, such as the response to an illegal instruction or invalid memory access. One particular di erence we noted was that the simulator did not implement memory protection of the low memory segment, where the MCP is stored, as was done in the real host interface. Some faults were observed to attempt writes to this region. While these writes would be discarded in the real interface, they were allowed to proceed in the cycle simulation. In the examples we observed, the lack of memory protection did not appreciably alter the results of the simulation, but the potential certainly exists.
The third limitation applies to cycle-accurate simulation in general. Such simulators are typically thousands of times slower than the real device they simulate. For most faults, this simulation time was very short because we could quickly translate the fault e ect to the software level. There were rare faults, however, that would change the program counter to a random value. In these cases, execution would be outside of typical program paths, and thus the e ect of the fault could not be translated to the software level right away. If execution never returned to normal, a hang could be assigned to these cases. However, this random code could lead to a reset of the LANai chip or a return to normal execution. The only way to be sure of the result was to continue simulation. Due to the cost of simulating at this level, however, an arbitrary limit of 80,000 cycles was set. If the program had not reset or resumed normal execution within this period, the result was marked as a hang.
Speci cation Problems
Another problem we encountered in the development of our simulations was that the response of the LANai device to various error conditions was very vague or missing in the o cial device speci cation. One example of unspeci ed behavior was the response to an unaligned memory access. A 32-bit memory access has to be aligned to a 32-bit boundary (the memory address must be evenly divisible by 4) in the LANai device. This information is clearly stated in the speci cation. However, the behavior when accesses are not aligned is not speci ed. While it is understandable that such information is unnecessary for programmers, it is necessary for the proper simulation of the device, particularly when such error conditions are likely to occur due to fault injection.
Some of the unspeci ed behavior we came across was cleared up through discussion with Myricom. In other cases, however, we did not even realize we had overlooked some error condition until we observed it to cause a di erence between the SWIFI and simulation experiments. A particular example involved a fault that changed the DMA DIR register. This register holds a one bit value specifying the direction of DMA transfers between the host (workstation) and LANai chip. In our simulations, we considered only the lowest bit of this register to be valid, and so writing a ve or a one to this register would give the same result. Experiments on the real device showed, however, that values other than zero or one written to this register could cause a hang. The results presented in this study were taken before training the simulator.
E ects of Simulation Environment
One problem in simulation is deciding where to draw the boundaries of the simulated system. Interaction between objects inside the boundaries and those outside that have been abstracted away may cause the simulation to act di erently than the real device. One case where this problem appeared in our study involved a fault in the real host interface that caused the reset of its host workstation. Because the host was outside our simulation boundaries, we did not model any of the interaction between host and interface. Therefore, we were not able to predict this occurrence.
Another e ect of the environment was due to our execution of the MCP code natively on a workstation (for the software-level model). While this approach allowed the simulation of the MCP to be very fast, it posed a problem. The LANai chip itself has only a limited memory protection that simply discards illegal accesses, but on a workstation, illegal memory accesses can cause the termination of an application. While every attempt was made to avoid crashing the simulations due to memory access violations, 48 such crashes still occurred (see Section 6.1). For example, the pointer type needed to be overloaded so that a fault causing the MCP to access an illegal memory location would still access memory in the LANai's memory space and not memory in the simulator's memory space.
Finally, because the simulation engine and the MCP shared one user process in our software-level simulation and had to communicate with each other, it was possible for an errant MCP to corrupt variables belonging to the simulation engine. In the runs we recorded, this behavior was not observed, but it was considered in our design. Avoiding this problem, at least in part, would have meant distributing our simulation among multiple processes so as to provide the operating system's memory protection to each simulation process. Such a design was beyond the scope of this study.
Conclusion and Future Work
This paper has presented a dependability study of a Myrinet LAN based on fault injection into the network interface hardware. The study explored the user-level impact of these faults in terms such as dropped messages, interface hangs, or host system hangs. Furthermore, the study was conducted using two methods of analysis { simulated fault injection and softwareimplemented fault injection (SWIFI) { and presented a comparison of the results of each. This comparison was used to to validate the simulation results versus the SWIFI results and to explore the major reasons for di erences between the results of the two methods.
Some results of the study underlined the dangers of unprotected DMA accesses into the host and unprotected access to the interface control program. The unprotected DMA access allowed an errant control program to corrupt its host computer, causing that computer to reboot. Because this behavior could be replicated by malicious users as well as by faults in the interface, allowing unrestricted access to the control program could pose a security risk. While these results were found through fault injections on a Myrinet, other networks that use similar mechanisms could be prone to the same behaviors. Thus, these issues could be of concern in any highly dependable networking environment.
The comparison of the two methods showed the simulations to be accurate at predicting the errors of moderate severity but not very accurate at predicting more severe errors. In particular, the simulator had an average accuracy of 97% for the moderate categories of no error, corrupt messages, and dropped messages, while averaging a 21% accuracy for the severe cases of MCP restarts or MCP hangs. The reasons for this result were the combination of limitations in the simulator, problems with the LANai speci cation, and interactions between the simulation and its execution environment. Overall, the simulator obtained 83% accuracy due to the relative rarity of the severe errors. One conclusion that can be drawn from the comparison that should be emphasized is that dependability speci cations should be part of the design process from the start. The major reason for mismatches between the simulation and SWIFI methods was the lack of a clear, complete, and correct speci cation for the error behaviors of the target system. If these error behaviors are not set down from the beginning of a design, obtaining such a speci cation can be very di cult, impeding the dependability analysis process.
Finally, SWIFI experiments examined the e ect of workload and host platform on the manifestation of errors. While a high workload level is more likely to create errors than a low level, experimenting with multiple workloads is required for high fault coverage. The choice of host platform had little bearing the e ect of common errors such as dropped messages, but it could be important when faults propagate to the system bus, particularly for catastrophic errors.
To gain better insight into the propagation of faults from the network to the host, a detailed study of the interaction with the system host is required. If the simulation were extended to include more of these features, its results might match the real system more closely. Analysis of a larger section of code or use of a di erent fault model may lead to more complete results.
