Search CORE

14 research outputs found

Recommended from our members

Fault diversity among off-the-shelf SQL database servers

Author: Gashi I.
Popov P. T.
Strigini L.
Publication venue
Publication date: 01/01/2004
Field of study

Fault tolerance is often the only viable way of obtaining the required system dependability from systems built out of "off-the-shelf" (OTS) products. We have studied a sample of bug reports from four off-the-shelf SQL servers so as to estimate the possible advantages of software fault tolerance - in the form of modular redundancy with diversity - in complex off-the-shelf software. We checked whether these bugs would cause coincident failures in more than one of the servers. We found that very few bugs affected two of the four servers, and none caused failures in more than two. We also found that only four of these bugs would cause identical, undetectable failures in two servers. Therefore, a fault-tolerant server, built with diverse off-the-shelf servers, seems to have a good chance of delivering improvements in availability and failure rates compared with the individual off-the-shelf servers or their replicated, nondiverse configurations

City Research Online

Crossref

Recommended from our members

Fault Tolerance Against Design Faults

Author: Strigini L.
Publication venue: 'Wiley'
Publication date: 01/01/2005
Field of study

City Research Online

Software fault tolerance in computer operating systems

Author: Iyer Ravishankar K.
Lee Inhwan
Publication venue
Publication date
Field of study

This chapter provides data and analysis of the dependability and fault tolerance for three operating systems: the Tandem/GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Based on measurements from these systems, basic software error characteristics are investigated. Fault tolerance in operating systems resulting from the use of process pairs and recovery routines is evaluated. Two levels of models are developed to analyze error and recovery processes inside an operating system and interactions among multiple instances of an operating system running in a distributed environment. The measurements show that the use of process pairs in Tandem systems, which was originally intended for tolerating hardware faults, allows the system to tolerate about 70% of defects in system software that result in processor failures. The loose coupling between processors which results in the backup execution (the processor state and the sequence of events occurring) being different from the original execution is a major reason for the measured software fault tolerance. The IBM/MVS system fault tolerance almost doubles when recovery routines are provided, in comparison to the case in which no recovery routines are available. However, even when recovery routines are provided, there is almost a 50% chance of system failure when critical system jobs are involved

NASA Technical Reports Server

An implementation and performance measurement of the progressive retry technique

Author: Fuchs W. Kent
Huang Yennun
Kintala Chandra
Suri Gaurav
Wang Yi-Min
Publication venue
Publication date
Field of study

This paper describes a recovery technique called progressive retry for bypassing software faults in message-passing applications. The technique is implemented as reusable modules to provide application-level software fault tolerance. The paper describes the implementation of the technique and presents results from the application of progressive retry to two telecommunications systems. the results presented show that the technique is helpful in reducing the total recovery time for message-passing applications

NASA Technical Reports Server

Recommended from our members

Fault tolerance via diversity for off-the-shelf products: A study with SQL database servers

Author: Gashi I.
Popov P. T.
Strigini L.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/10/2007
Field of study

If an off-the-shelf software product exhibits poor dependability due to design faults, then software fault tolerance is often the only way available to users and system integrators to alleviate the problem. Thanks to low acquisition costs, even using multiple versions of software in a parallel architecture, which is a scheme formerly reserved for few and highly critical applications, may become viable for many applications. We have studied the potential dependability gains from these solutions for off-the-shelf database servers. We based the study on the bug reports available for four off-the-shelf SQL servers plus later releases of two of them. We found that many of these faults cause systematic noncrash failures, which is a category ignored by most studies and standard implementations of fault tolerance for databases. Our observations suggest that diverse redundancy would be effective for tolerating design faults in this category of products. Only in very few cases would demands that triggered a bug in one server cause failures in another one, and there were no coincident failures in more than two of the servers. Use of different releases of the same product would also tolerate a significant fraction of the faults. We report our results and discuss their implications, the architectural options available for exploiting them, and the difficulties that they may present

City Research Online

Crossref

Experimental analysis of computer system dependability

Author: Iyer Ravishankar, K.
Tang Dong
Publication venue
Publication date
Field of study

This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance

NASA Technical Reports Server

Analysis of safety systems with on-demand and dynamic failure modes

Author: J.D. Andrews (7120562)
Joanne Bechta Dugan (7120670)
Leila Meshkat (7121570)
Publication venue
Publication date: 01/01/2000
Field of study

An approach for the reliability analysis of systems with on demand, and dynamic failure modes is presented. Safety systems such as sprinkler systems, or other protection systems are characterized by such failure behavior. They have support subsystems to start up the system on demand, and once they start running, they are prone to dynamic failure. Failure on demand requires an availability analysis of components (typically electromechanical components) which are required to start or support the safety system. Once the safety system is started, it is often reasonable to assume that these support components do not fail while running. Further, these support components may be tested and maintained periodically while not in active use. Dynamic failure refers to the failure while running (once started) of the active components of the safety system. These active components may be fault tolerant and utilize spares or other forms of redundancy, but are not maintainable while in use. In this paper we describe a simple yet powerful approach to combining the availability analysis of the static components with a reliability analysis of the dynamic components. This approach is explained using a hypothetical example sprinkler system, and applied to a water deluge system taken from the offshore industry. The approach is implemented in the fault tree analysis software package, Galile

Loughborough University Institutional Repository

Dependability analysis of systems with on-demand and active failure modes, using dynamic fault trees

Author: J.D. Andrews (7120562)
Joanne Bechta Dugan (7120670)
Leila Meshkat (7121570)
Publication venue
Publication date: 01/01/2002
Field of study

Safety systems and protection systems can experience two phases of operation (standby and active); an accurate dependability analysis must combine an analysis of both phases. The standby mode can last for a long time, during which the safety system is periodically tested and maintained. Once a demand occurs, the safety system must operate successfully for the length of demand. The failure characteristics of the system are different in the two phases, and the system can fail in two ways: 1) It can fail to start (fail on-demand), or 2) It can fail while in active mode. Failure on demand requires an availability analysis of components (typically electromechanical components) which are required to start or support the safety system. These support components are usually maintained periodically while not in active use. Active failure refers to the failure while running (once started) of the active components of the safety system. These active components can be fault tolerant and use spares or other forms of redundancy, but are not maintainable while in use. The approach, in this paper, automatically combines the “availability analysis of the system in standby mode” with the “reliability analysis of the system in its active mode.” The general approach uses an availability analysis of the standby phase to determine the initial state probabilities for a Markov model of the demand phase. A detailed method is presented in terms of a dynamic fault-tree model. A new “dynamic fault-tree construct” captures the dependency of the demand-components on the support systems, which are required to detect the demand or to start the demand system. The method is discussed using a single example sprinkler system and then applied to a more complete system taken from the offshore industry

Loughborough University Institutional Repository

Software-based methods for Operating system dependability

Author: VELASCO ALEJANDRO DAVID
Publication venue: country:Italy
Publication date: 01/01/2017
Field of study

Guaranteeing correct system behaviour in modern computer systems has become essential, in particular for safety-critical computer-based systems. However all modern systems are susceptible to transient faults that can disrupt the intended operation and function of such systems. In order to evaluate the sensitivity of such systems, different methods have been developed, and among them Fault Injection is considered a valid approach widely adopted. This document presents a fault injection tool, called Kernel-based Fault-Injection Tool Open-source (KITO), to analyze the effects of faults in memory elements containing kernel data structures belonging to a Unix-based Operating System and, in particular, elements involved in resources synchronization. This tool was evaluated in different stages of its development with different experimental analyses by performing Faults Injections in the Operating System, while the system was subject to stress from benchmark programs that use different elements of the Linux kernel. The results showed that KITO was capable of generating faults in different elements of the operating systems with limited intrusiveness, and that the data structures belonging to synchronization aspects of the kernel are susceptible to an appreciable set of possible errors ranging from performance degradation to complete system failure, thus preventing benchmark applications to perform their task. Finally, aiming at overcoming the vulnerabilities discovered with KITO, a couple of solutions have been proposed consisting in the implementation of hardening techniques in the source code of the Linux kernel, such as Triple Modular Redundancy and Error Detection And Correction codes. An experimental fault injection analysis has been conducted to evaluate the effectiveness of the proposed solutions. Results have shown that it is possible to successfully detect and correct the noxious effects generated by single faults in the system with a limited performance overhead in kernel data structures of the Linux kernel

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Integrating Defect Data, Code Review Data, and Version Control Data for Defect Analysis and Prediction

Author: Lee Tao-Chun
Publication venue
Publication date: 26/06/2014
Field of study

In this thesis, we present a new approach to integrating software system defect data: Defect reports, code reviews and code commits. We propose to infer defect types by keywords. We index defect reports into groups by the keywords found in the descriptions of those reports, and study the properties of each group by leveraging code reviews and code commits. Our approach is more scalable than previous studies that consider defects classified by manual inspections, because indexing is automatic and can be applied uniformly to large defect dataset. Also our approach can analyze defects from programming errors, performance issues, high-level design to user interface, a more comprehensive variety than previous studies using static program analysis. By applying our approach to Honeywell Automation and Control Solutions (ACS) projects, with roughly 700 defects, we found that some defect types could be five times more than other defect types, which gave clues to the dominant root causes of the defects. We found certain defect types clustered in certain source files. We found that 20%-50% of the files usually contained more than 80% of the defects. Finally, we applied a known defect prediction algorithm to predict the hot files of the defects for the defect types of interest. We achieved defect hit rate 50%-90%

Infoscience - École polytechnique fédérale de Lausanne