AbstractÐIntel Corporation developed the Teraflops supercomputer for the US Department of Energy (DOE) as part of the Accelerated Strategic Computing Initiative (ASCI). This is the most powerful computing machine available today, performing over two trillion floating point operations per second with the aid of more than 9,000 Intel processors. The Teraflops machine employs complex hardware and software fault/error handling mechanisms for complying with DOE's reliability requirements. This paper gives a brief description of the system architecture and presents the validation of the fault tolerance mechanisms. Physical fault injection at the IC pin level was used for validation purposes. An original approach was developed for assessing signal sensitivity to transient faults and the effectiveness of the fault/error handling mechanisms. Dependency between fault/error detection coverage and fault duration was also determined. Fault injection experiments unveiled several malfunctions at the hardware, firmware, and software levels. The supercomputer performed according to the DOE requirements after corrective actions were implemented. The fault injection approach presented in this paper can be used for validation of any fault-tolerant or highly available computing system.
INTRODUCTION

T HE Teraflops supercomputer was developed by Intel Corporation for the US Department of Energy's Accelerated Strategic Computing Initiative (ASCI) program.
In June 1997 and October 1998, the Teraflops machine set new world records of 1.338 and 2.38 tera floating point operations per second, respectively, on the Multi-Processor Linpack (MP Linpack) benchmark. Consisting of more than 9,000 Pentium1 Pro processors, over half a terabyte of RAM, and more than two terabytes of disk storage, the Teraflops machine is not only the fastest but also the world's largest supercomputer. 1 Currently, Sandia National Laboratories is using this supercomputer for simulating the aging process of nuclear weapons (eliminating physical tests as a means for maintaining the integrity of US nuclear stockpile), modeling ballistic weapon systems, and running other complex simulation programs.
Reliability requirements for the Teraflops supercomputer (four weeks MTBF with at least 97 percent of the system resources available), the extremely large size of the machine, and the fact that commercial off-the-shelf (COTS) components were widely used demanded a welldefined strategy for designing, implementing, and validating the fault tolerance mechanisms. This paper provides a brief description of the architecture of the Teraflops system and its fault/error handling mechanisms and presents the fault injection experiments performed to validate these mechanisms.
Fault/error injection is a common practice for validating fault tolerance mechanisms and estimating coverage probability (conditional probability that a system recovers given that a fault occurs). A wide variety of physical fault injection, software-implemented fault injection (SWIFI), and simulated methods have been developed over the last two decades. Physical fault injection and SWIFI are usually performed on the target system. The most widely used methods for performing physical fault injection are: injection at the IC pin level [1] , [2] , [8] , [20] , [25] , heavy-ion radiation [20] , [21] , electromagnetic interference [20] , and laser-based fault injection [29] . Several SWIFI experiments [3] , [5] , [32] and tools, such as FIAT [31] , EFA [11] , DOCTOR [15] , FERRARI [19] , and Xception [4] , were presented in the literature. Alternatively, simulation models can be used. ADEPT [13] , DEPEND [14] , and MEFISTO [18] are among the most well-known fault injection simulation tools. Combined implementations, based on hardware, software, and simulated fault injection, were employed in [2] , [6] , [30] , [35] . Simulated and scan chain implemented fault injection (SCIFI) were compared in [12] . A review of the main fault injection tools was provided in [16] .
An enhanced version of physical fault injection at the IC pin level was used for validating the fault/error handling mechanisms of the Teraflops supercomputer. The experiments were performed in a three-dimensional space of events, the fault location, injection time, and fault duration being randomly selected. The use of multidimensional spaces had been initially proposed for deriving coverage probabilities [10] and employed in simulated fault injection [9] . Our experiments show that this approach provides two major advantages. First, fault injection is performed in a more realistic environment. Second, it allows for the accurate assessment of signal sensitivity to transient faults and evaluation of the effectiveness of the fault tolerance mechanisms.
The fault injection experiments performed on the Teraflops supercomputer unveiled several hardware, firmware, and software malfunctions. After corrective actions had been undertaken, the machine performed according to the specification.
The paper is organized as follows: Section 2 presents the architecture of the Teraflops supercomputer and its main fault tolerance mechanisms. Physical fault injection in a three-dimensional space of events is described in Section 3. Section 4 presents the experimental setup. The results of the fault injection experiments are discussed in Section 5. Confidence intervals of the fault/error detection coverage are derived, as a function of time, and guidelines for assessing signal sensitivity to transient faults are provided. Conclusions are given in Section 6. System nodes support system booting and operation of the Scalable Platform Services (SPS) stations. Eagle boards are hosting the system nodes.
ARCHITECTURE OF THE TERAFLOPS SUPERCOMPUTER
Two SPS stations are used for system booting, configuring, monitoring, and fault/error reporting. When faults occur, SPS stations control system reconfiguration and recovery. Additional details about the architecture of the Teraflops supercomputer are given in [26] .
Teraflops software evolved from the Intel Paragon supercomputer software and provides a message passing, scaleable environment. The service, I/O, and system nodes run the Teraflops Operating System (TOS). This is a distributed Unix operating system and consists of a micro kernel and an OSF server. TOS presents a single system image to the user. A second operating system, PUMA (Performance-oriented User-managed Messaging Architecture), runs on the compute nodes. PUMA was developed at Sandia National Laboratories and the University of New Mexico. 2 It uses a message passing mechanism and consists of a Quintessential Kernel (QK) and the Process Control Thread (PCT). QK provides computation and communication facilities and address protection. PCT ensures priority based process management and communication capabilities. Readers can find details about Teraflops software in [26] , [36] .
Monitoring and Recovery Subsystem
Monitoring and Recovery Subsystem (MRS) provides the capability to detect and recover from faults/errors experienced at the level of Field Replaceable Unit (FRU) of the Teraflops system. The fault/error handling process consists of one or more of the following steps: detection, isolation, and diagnosis of the faulty FRU, reintegration if the fault/ error is transient, replacement by a spare if a permanent or intermittent fault is detected and spares are available, and graceful degradation if no spares are available. The threshold for intermittent faults (number of occurrences before replacement of a FRU) is manually set by the operator. Spare compute, service, I/O, and system nodes are allocated when the system is configured. The ICF can degrade gracefully by isolating the failed MRCs and rerouting the traffic. Redundant Arrays of Inexpensive Disks (RAID) are used for ensuring integrity of the stored data. Redundant power supplies, fans, and blowers are also provided.
All major components of the machine, e.g., compute nodes, ICF backplanes, power supplies, disks, are field replaceable without powering down the system (hot swappable). Although the physical replacement of the failed FRU is performed manually, precise hot swapping procedures are provided by MRS. In this way, the probability of human errors is significantly diminished.
Fault/error handling at the FRU level is ensured by the following mechanisms: error detection and correction codes, protocol checking, watch-dog timers, and memory scrubbing, for all nodes of the system, and cyclic redundancy check, parity, checking for loss of frame, and inbound/outbound stall, for the ICF.
MRS consists of SPS stations and Patch Support Boards (PSBs), as is shown in Fig. 1 . Each Teraflops cabinet accommodates four cardcages and each cardcage has two backplanes. The backplanes belong to either plane A or plane B of the ICF. The X, Y, and Z arrows represent the three dimensions of the ICF. Four Kestrel or Eagle boards are connected to each backplane. Two clock boards, primary and backup, are available per cabinet. Each cardcage is monitored by one PSB. Redundant interconnects are provided between PSBs and cardcages over boundary scan test ports (JTAG) and dedicated serial ports. PSBs report the faults and errors detected at the FRU level to the SPS over private Ethernet links.
The Teraflops machine can be partitioned into smaller subsystems, called compute partitions. Partitions can include spare nodes. Each partition is managed by MRS and can independently execute its own application programs. Recovery from failures induced by transient faults can be achieved by implementing checkpointing at the application level. Both hardware and software failures are confined to the partition in which they occur. Teraflops architecture also allows for simultaneous execution of critical applications on two independent compute partitions (space redundancy).
FAULT INJECTION STRATEGY
Physical fault injection at the IC pin level was employed for validating the fault/error handling mechanisms of the Teraflops supercomputer. A three-dimensional space of events, consisting of fault location, time of occurrence, and duration of the faults, was considered [9] , [10] . Fault locations were represented by signals, which, actually, were affected by the transient faults. Transients were injected in randomly selected signals, at random time instances. The duration of the faults was also randomly selected.
Practically, the selected signal was overdriven to 0 or 1 (ª0º or ª1º type of fault, respectively), using a fault injection probe. The probe was designed to minimize intrusiveness by complying with signal DC and AC specifications. Different probes were used for injecting faults in GTL and CMOS signals. The induced errors were similar to those generated by many internal faults experienced by the ICs and emulated the effects of environment perturbations (e.g., power glitches, electromagnetic interference). Our experiments have shown that some of the intermittent faults generated by poor chip manufacturing processes can be also emulated.
By changing the fault duration, we were also able to assess signal sensitivity to transients. Simulated fault injection was used to determine the effectiveness of candidate fault/error handling mechanisms, employed to protect the most sensitive signals.
In a few cases, fault injection was performed simultaneously in several signals in order to assess the impact of multipoint faults. Because of practical considerations, multipoint fault injection was used only when the experiments were focused on the evaluation of error detection/ correction mechanisms used for protecting a limited number of signals (e.g., data and address signals).
It has to be noted that deterministic fault injection in a three-dimensional space can be also employed for finding the root cause of malfunctions unveiled by the random experiments. In that case, faults are injected in deterministically selected signals at a given time and have a welldetermined duration.
We chose to inject transient faults because these are the faults most frequently experienced by modern computers [17] , [22] , [24] , [33] . The effect of permanent faults can be assessed by extending the duration of the transients. Fig. 2 shows the block diagram of a generic fault injection experiment, performed on the Front Side Bus (FSB) of a compute node. The chip set connects FSB to the node memory and I/O devices. Transistors graphically suggest that faults can occur anywhere inside the ICs, perturbing the operation of one or more signals, while the crooked arrows represent the environment induced errors.
EXPERIMENTAL SETUP
Fault injection was performed on an experimental setup consisting of three Teraflops cabinets equipped with 59 Kestrel boards (118 compute nodes), one Eagle board (one I/O node), the appropriate clock, and PSB boards and disk subsystems. One SPS station was available. Current Teraflops operating systems and PSB and SPS software packages run on this setup.
In order to assess the impact of faults on a large system, multiple applications were executed on independent computing partitions. The MP Linpack benchmark was executed on the compute partition which included the injected node. MP Linpack performs matrix computations and also verifies the correctness of the numerical results by deriving the residues, at the end of every iteration. This particular feature allowed us to check for possible numerical errors, undetected by the system (silent data corruption). In addition, a Molecular Dynamics (MD) application was executed on an independent partition.
The compute partitions are graphically showed in Fig. 3 . The first one, consisting of 36 Kestrel boards (72 compute nodes), runs the MP Linpack benchmark. Kestrel boards belonging to that partition are designated by letter a. The second partition, consisting of 23 Kestrels (46 compute nodes), marked by letter b, executed the MD application. The shape of the partitions was chosen so that heavy message traffic could be carried over the ICF. One Eagle board, B, was used for booting the system and performing disk I/O operations.
Transient faults were injected in data, address, command, and miscellaneous signals of a compute node. The injected board is marked by an arrow in Fig. 3 . Four boards were removed to provide physical access to the injected node, as is shown by the empty space. Faults were also injected in the I/O node, primary clock board, and ICF. In this way, the correct operation of the hardware, firmware, and software fault/error handling mechanisms, at the board level, and of the error detection, isolation, logging, and recovery, at the PSB and SPS levels, were evaluated. The use of two independent compute partitions, running different applications, allowed us to assess the error containment capabilities of the machine. Over 3,000 faults were randomly injected in order to validate the fault/error handling mechanisms of the Teraflops machine and more than 2,800 for assessing signal sensitivity to transient faults. The outcome of the experiments is divided into five categories: no errors, corrected errors, locally detected errors, uncovered errors, and application hangs. Names of the first, second, and fourth categories are self-explanatory. Errors which were locally detected at the FRU level (e.g., nonmaskable interrupt asserted because of parity error on the address bus of a compute node or MRC detected parity error on a backplane) represent the third category. In general, application hangs (the fifth category) affected only the compute partition which included the failed board. In this section, we present the most significant results of the experiments. 3 
Malfunctions Unveiled by Fault Injection
For the sake of brevity, three cases of malfunctions unveiled by the fault injection experiments are discussed herein.
Transient faults injected in a compute node running the MP Linpack benchmark led to application hangs in the second partition, which was executing the MD application. This situation occurred because the error induced by the fault was not properly confined to the node level and propagated over the ICF, affecting the adjacent compute partition. Error reporting to the SPS station was also hampered, backplane failures being logged instead of node failures. Both problems were solved by enabling and properly adjusting the node watch-dog timer.
The residue checking feature of the MP Linpack benchmark unveiled several numerical errors, unreported by the error detection mechanisms of the machine. This type of errors were experienced both by Kestrel and Eagle boards. Disabled processor ECC was the root cause. Proper operation was observed after a firmware bug was corrected.
Fault injection performed on ICF (NICs on Kestrel and
Eagle boards, MRCs and logic circuitry on backplanes, and ICF links) produced the expected results, i.e., detected errors and application hangs. Initially, reporting of backplane errors by the SPS station was inaccurate. Correct error reporting was observed after several SPS software bugs were fixed.
Assessing Signal Sensitivity to Transient Faults
A new approach was required to evaluate signal sensitivity to transient faults and determine the effectiveness of the mechanisms designed to eliminate or decrease the probability of occurrence of uncovered errors. Our method consists of repeatedly injecting transient faults into a given signal at randomly selected time instances. The fault duration is deterministically selected and varied over a large time interval. In the case of the Teraflops supercomputer, there were from 12 to 15 fault durations, selected for each signal. Thirty faults were injected for each duration. Confidence intervals of the fault/error detection coverage (conditional probability that a fault/error is detected given that a fault occurs) were derived. The remainder of this section gives a few examples of signal sensitivity analysis. For the sake of clarity, only the most significant seven durations are showed for each signal. Further details on analysis of signal sensitivity to transient faults are provided in [6] , [7] . Fig. 4 shows the impact of ª0º faults injected in signal A (active low). The experiments started by injecting thirty 10 ns transient faults (not shown in Fig. 4) . No errors were induced. Then, the fault duration was increased to 25 ns faults. Sixty percent of the thirty 25 ns faults induced uncovered errors. The remaining 40 percent of the faults produced no errors. The percentage of uncovered errors increased to 70 percent for 50 ns faults. The 100 ns transients generated uncovered errors in 93 percent of the cases. All P "s faults induced uncovered errors. The percentage of uncovered errors decreased as the fault duration was increased. Eight "s faults led to 80 percent uncovered errors, 3 percent detected errors, and 17 percent application hangs. Thrity-two "s transients induced 50 percent uncovered errors, 3 percent detected errors, and 47 percent application hangs. In the case of IPH "s faults, 3 percent of the errors were uncovered, 3 percent were detected, and 94 percent of the faults led to application hangs. Longer transients induced application hangs. Uncovered errors occurred when signal A was falsely asserted because of the injected faults. No errors occurred when the fault overlapped a ªlegitimateº active signal.
The effect of transient faults on signal B (active low) is presented in Fig. 5 . In this experiment, type ª1º faults were injected, i.e., the injected signal was pulled up, preventing normal assertion. The 10 ns faults induced no errors (not shown in Fig. 5 ), while the 25 ns transients induced 10 percent uncovered errors. Ninety percent of the 25 ns faults produced no errors. The percentage of uncovered errors increased to 27 percent and 77 percent for 50 ns and 100 ns transients, respectively. Seven percent of the errors induced by the 100 ns faults were detected. Two "s faults led to 97 percent uncovered errors and 3 percent application hangs. The percentage of uncovered errors slightly decreased to 93 percent for V "s faults. The remaining 7 percent were application hangs. The percentage of uncovered errors continued to decrease as the duration of the faults was increased. Conversely, the number of detected errors and application hangs increased. Fifty-three percent of the QP "s, faults induced uncovered errors, while 10 percent of the errors were detected. Thirty-seven percent of the faults led to application hangs. In the case of IPH "s the outcome of the fault injection was 10 percent, 10 percent, and 80 percent uncovered errors, detected errors, and application hangs, respectively. Transients longer than IPH "s led to application hangs. Fig. 6 shows the results of simultaneous fault injection in signals A (ª0º type faults) and B (ª1º type faults). Uncovered errors were the dominating outcome of the experiment. They represented 30 percent even for very short transients (10 ns). The percentage of uncovered errors increased with duration, reaching 73 percent and 93 percent for 25 ns and 50 ns transients, respectively. The percentage of detected errors was 3 percent for faults in the 25 ns-50 ns range. All P "s faults induced uncovered errors. The percentage of uncovered errors decreased to 80 percent as fault duration approached V "s. The remaining 20 percent were application hangs. This trend continued with the increase of the fault duration. In the case of QP "s, the outcome of the experiments consisted of 67 percent uncovered errors and 33 percent application hangs. Application hangs became predominant, i.e., 90 percent, as duration of the transients approached IPH "s. Faults longer than IPH "s led to application hangs.
The high percentage of uncovered errors and wide range of the duration, from 25 ns to IPH "s, proved that signals A and B were extremely sensitive to transient faults. The simultaneous occurrence of transient faults on both signals led to a higher percentage of uncovered errors, even for very short transients (e.g., 10 ns).
Statistical inference is commonly used for estimating the fault/error detection coverage probabilities. Table 1 provides the 90 percent confidence intervals of the detection coverage for signals A and B. Student t and 1 P probability distributions were used because the number of faults injected for each duration was 30 (normal distribution can be considered if at least 50 faults are injected). Student t distribution was employed when estimated values of the coverage were in HXI HXW range. 1 P distribution was used for deriving one-sided intervals, for `HXI and b HXW. Details on computing confidence intervals of the fault/error detection coverage are provided in [6] .
Coverage probability takes low values, in the (0, 0.08), (0, 0.013), and (0, 0.18) confidence intervals, for most of the transients in the 10 ns-V "s range. Application hangs impacted only one compute partition and were detected and properly reported at the system level (SPS station). As a consequence, both locally and system detected faults/errors were considered covered. Table 1 clearly shows that coverage probability is a function of fault duration. Transients in the 10 ns-V "s range were the most difficult to detect. Coverage probability improved with fault duration. For long transients, i.e., IPH "s, the 90 percent confidence interval of the detection coverage was in the (0.8, 0.99) and (0.87, 1.0) ranges, especially because of the effectiveness of error confinement and system level detection. The impact of long transients was similar to that of permanent faults.
Fault injection experiments showed that the fault/error handling capabilities had to be improved. As a consequence, simulated fault injection experiments were performed in order to assess the effectiveness of additional detection and recovery mechanisms. Traces collected from the real machine were used as the input of the simulation engine. Intel Hardware Description Language (iHDL) was employed for simulating the fault/error handling process. A lower percentage of uncovered errors was observed when a specially designed state machine was used to control signals A and B. These experiments furnished information on detection latency, too. The methodology used for performing simulated fault injection experiments is presented in [7] .
Fault injection has provided vital information about the impact of transient faults. When uncovered errors are observed, it is necessary to carry out an extensive analysis of signal sensitivity to transients. The duration of the injected faults has to be varied over a large time interval. The experiment has to start with very short transients which, in general, do not induce errors. The fault duration has to be gradually increased until detected errors and/or application hangs become predominant. Both the outcome of the experiment and the width of the time window which has to be covered are strongly influenced by the signal type (e.g., address, data, clock) and technology (e.g., GTL, CMOS). Similar fault injection experiments have to be performed, preferably in a simulated environment, for assessing the effectiveness of the candidate fault/error detection and recovery mechanisms, designed to handle transients. Finally, it should be stressed that injecting permanent faults only leads to unrealistically high values of the fault/error detection coverage probability.
Summary of the Fault Injection Experiments
A summary of the fault injection experiments performed for validating the fault tolerance mechanisms of the Teraflops supercomputer is given in Table 2 . Numbers in this table reflect the behavior of the machine after corrective actions were implemented.
In the case of the compute node, a significant percentage of the errors induced by the injected faults were corrected (24 percent) . This shows the effectiveness of the front side bus and memory data ECC. Fourteen percent of the faults led to properly detected and reported errors at the board level. The percentage of uncovered errors was 18 percent. A similar number of experiments led to application hangs.
Results of the fault injection experiments on the I/O node are summarized by the second row of Table 2 and FRU detected errors was smaller compared to the compute node.
The third row summarizes the fault injection experiments on the ICF. A significant number of faults, 66 percent, induced no errors, while 34 percent led to application hangs. The cause of the application hangs (e.g., MRC parity error) was properly identified and reported by SPS. Failures of the ICF required the reset of the fabric.
The majority of faults injected in signals of the clock board induced no errors (91 percent). Application hangs and detected errors were observed in the remaining 9 percent of the cases. It should be noted that both uncovered errors and application hangs were confined to the compute partition which experienced the fault. For critical applications, the impact of uncovered errors can be mitigated by using redundant compute partitions.
In the case of compute and I/O nodes (the only FRUs which experienced uncovered errors), the 90 percent confidence intervals of the fault/error detection coverage probability are (0.80, 0.84) and (0.79, 0.91), respectively. Normal distribution was considered for deriving the confidence intervals [34] . Multistage and stratified sampling have to be employed if system coverage probability is required [9] , [10] . Alternatively, sampling in partitioned and nonpartitioned spaces can be used [27] , [28] .
CONCLUSIONS
The complexity of the Teraflops supercomputer required extensive fault injection experiments for validation of the fault tolerance mechanisms. Physical fault injection at the IC pin level was employed for this purpose. Transient faults were injected in all major components of the machine: compute and I/O nodes, internode communication fabric, and clock boards. Compute partitions, running independent applications, were used for validation of the error confinement capabilities of the supercomputer. Fault injection was carried out in a three-dimensional space of events, consisting of fault location, time of fault occurrence, and fault duration. In this way, the effect of a variety of internal IC faults and environmental induced errors was accounted for. This approach proved to be a powerful tool for validating fault/error handling mechanisms. Several hardware, firmware, and software malfunctions were unveiled at both the field replaceable unit and monitoring/recovery subsystem levels. After corrective actions had been taken, the supercomputer performed according to the specification.
Fault injection experiments taught us an important lesson. As voltages get lower, noise margins decrease, leading to increased signal sensitivity to transients. As a consequence, enhanced fault tolerance mechanisms have to be devised. The method presented in this paper allowed us to assess signal sensitivity to transient faults. For the first time, confidence intervals of the detection coverage were derived as a function of fault duration. Similar fault injection experiments, performed in a simulated environment, provided information on the effectiveness of new detection mechanisms, designed to reduce the number of uncovered errors.
Signal sensitivity analysis showed that fault/error detection coverage is a function of fault duration. Short transients were more difficult to detect, while longer faults led to application hangs. It should be noted that injection of permanent faults can only lead to significant overestimation of the coverage.
Fault injection provided valuable data about the behavior of computer systems in presence of faults. Many of the successful fault/error detection, isolation, and recovery strategies tested on the Teraflops supercomputer are becoming common features of new commercial machines.
FUTURE WORK
Technology advancements, especially smaller transistor dimensions, lower voltages, and higher frequencies, have a significant influence on dependability of computing systems (dependability subsumes reliability, availability, maintainability, and safety [23] ). Lower rates of occurrence, observed for permanent faults, represent the positive impact of the newest computer manufacturing technologies. By contrast, the rate of transient faults is increasing. As a consequence commercial machines employ more sophisticated fault/error handling mechanisms, both at the IC, FRU, and system level. Validation of these mechanisms require extensive fault injection experiments.
The fault injection approach presented in this work is currently used for validation of highly available enterprise class servers. Future work in this area includes simulated and physical fault injection for estimating the impact of neutrons and alpha particles on sub HXPS "m ICs. Enhanced methodologies are under development for assessing the impact of transients on parallel multidrop and point-topoint interconnects and serial links. SWIFI is being increasingly used for validation of fault/error handling software, from regular error reporting to complex fail-over mechanisms, employed by clusters of servers.
