14 research outputs found

    Software fault tolerance in computer operating systems

    Get PDF
    This chapter provides data and analysis of the dependability and fault tolerance for three operating systems: the Tandem/GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Based on measurements from these systems, basic software error characteristics are investigated. Fault tolerance in operating systems resulting from the use of process pairs and recovery routines is evaluated. Two levels of models are developed to analyze error and recovery processes inside an operating system and interactions among multiple instances of an operating system running in a distributed environment. The measurements show that the use of process pairs in Tandem systems, which was originally intended for tolerating hardware faults, allows the system to tolerate about 70% of defects in system software that result in processor failures. The loose coupling between processors which results in the backup execution (the processor state and the sequence of events occurring) being different from the original execution is a major reason for the measured software fault tolerance. The IBM/MVS system fault tolerance almost doubles when recovery routines are provided, in comparison to the case in which no recovery routines are available. However, even when recovery routines are provided, there is almost a 50% chance of system failure when critical system jobs are involved

    An implementation and performance measurement of the progressive retry technique

    Get PDF
    This paper describes a recovery technique called progressive retry for bypassing software faults in message-passing applications. The technique is implemented as reusable modules to provide application-level software fault tolerance. The paper describes the implementation of the technique and presents results from the application of progressive retry to two telecommunications systems. the results presented show that the technique is helpful in reducing the total recovery time for message-passing applications

    Experimental analysis of computer system dependability

    Get PDF
    This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance

    Analysis of safety systems with on-demand and dynamic failure modes

    Get PDF
    An approach for the reliability analysis of systems with on demand, and dynamic failure modes is presented. Safety systems such as sprinkler systems, or other protection systems are characterized by such failure behavior. They have support subsystems to start up the system on demand, and once they start running, they are prone to dynamic failure. Failure on demand requires an availability analysis of components (typically electromechanical components) which are required to start or support the safety system. Once the safety system is started, it is often reasonable to assume that these support components do not fail while running. Further, these support components may be tested and maintained periodically while not in active use. Dynamic failure refers to the failure while running (once started) of the active components of the safety system. These active components may be fault tolerant and utilize spares or other forms of redundancy, but are not maintainable while in use. In this paper we describe a simple yet powerful approach to combining the availability analysis of the static components with a reliability analysis of the dynamic components. This approach is explained using a hypothetical example sprinkler system, and applied to a water deluge system taken from the offshore industry. The approach is implemented in the fault tree analysis software package, Galile

    Dependability analysis of systems with on-demand and active failure modes, using dynamic fault trees

    Get PDF
    Safety systems and protection systems can experience two phases of operation (standby and active); an accurate dependability analysis must combine an analysis of both phases. The standby mode can last for a long time, during which the safety system is periodically tested and maintained. Once a demand occurs, the safety system must operate successfully for the length of demand. The failure characteristics of the system are different in the two phases, and the system can fail in two ways: 1) It can fail to start (fail on-demand), or 2) It can fail while in active mode. Failure on demand requires an availability analysis of components (typically electromechanical components) which are required to start or support the safety system. These support components are usually maintained periodically while not in active use. Active failure refers to the failure while running (once started) of the active components of the safety system. These active components can be fault tolerant and use spares or other forms of redundancy, but are not maintainable while in use. The approach, in this paper, automatically combines the “availability analysis of the system in standby mode” with the “reliability analysis of the system in its active mode.” The general approach uses an availability analysis of the standby phase to determine the initial state probabilities for a Markov model of the demand phase. A detailed method is presented in terms of a dynamic fault-tree model. A new “dynamic fault-tree construct” captures the dependency of the demand-components on the support systems, which are required to detect the demand or to start the demand system. The method is discussed using a single example sprinkler system and then applied to a more complete system taken from the offshore industry

    Software-based methods for Operating system dependability

    Get PDF
    Guaranteeing correct system behaviour in modern computer systems has become essential, in particular for safety-critical computer-based systems. However all modern systems are susceptible to transient faults that can disrupt the intended operation and function of such systems. In order to evaluate the sensitivity of such systems, different methods have been developed, and among them Fault Injection is considered a valid approach widely adopted. This document presents a fault injection tool, called Kernel-based Fault-Injection Tool Open-source (KITO), to analyze the effects of faults in memory elements containing kernel data structures belonging to a Unix-based Operating System and, in particular, elements involved in resources synchronization. This tool was evaluated in different stages of its development with different experimental analyses by performing Faults Injections in the Operating System, while the system was subject to stress from benchmark programs that use different elements of the Linux kernel. The results showed that KITO was capable of generating faults in different elements of the operating systems with limited intrusiveness, and that the data structures belonging to synchronization aspects of the kernel are susceptible to an appreciable set of possible errors ranging from performance degradation to complete system failure, thus preventing benchmark applications to perform their task. Finally, aiming at overcoming the vulnerabilities discovered with KITO, a couple of solutions have been proposed consisting in the implementation of hardening techniques in the source code of the Linux kernel, such as Triple Modular Redundancy and Error Detection And Correction codes. An experimental fault injection analysis has been conducted to evaluate the effectiveness of the proposed solutions. Results have shown that it is possible to successfully detect and correct the noxious effects generated by single faults in the system with a limited performance overhead in kernel data structures of the Linux kernel

    Integrating Defect Data, Code Review Data, and Version Control Data for Defect Analysis and Prediction

    Get PDF
    In this thesis, we present a new approach to integrating software system defect data: Defect reports, code reviews and code commits. We propose to infer defect types by keywords. We index defect reports into groups by the keywords found in the descriptions of those reports, and study the properties of each group by leveraging code reviews and code commits. Our approach is more scalable than previous studies that consider defects classified by manual inspections, because indexing is automatic and can be applied uniformly to large defect dataset. Also our approach can analyze defects from programming errors, performance issues, high-level design to user interface, a more comprehensive variety than previous studies using static program analysis. By applying our approach to Honeywell Automation and Control Solutions (ACS) projects, with roughly 700 defects, we found that some defect types could be five times more than other defect types, which gave clues to the dominant root causes of the defects. We found certain defect types clustered in certain source files. We found that 20%-50% of the files usually contained more than 80% of the defects. Finally, we applied a known defect prediction algorithm to predict the hot files of the defects for the defect types of interest. We achieved defect hit rate 50%-90%
    corecore