18 research outputs found
Software dependability in the Tandem GUARDIAN system
Based on extensive field failure data for Tandem's GUARDIAN operating system this paper discusses evaluation of the dependability of operational software. Software faults considered are major defects that result in processor failures and invoke backup processes to take over. The paper categorizes the underlying causes of software failures and evaluates the effectiveness of the process pair technique in tolerating software faults. A model to describe the impact of software faults on the reliability of an overall system is proposed. The model is used to evaluate the significance of key factors that determine software dependability and to identify areas for improvement. An analysis of the data shows that about 77% of processor failures that are initially considered due to software are confirmed as software problems. The analysis shows that the use of process pairs to provide checkpointing and restart (originally intended for tolerating hardware faults) allows the system to tolerate about 75% of reported software faults that result in processor failures. The loose coupling between processors, which results in the backup execution (the processor state and the sequence of events) being different from the original execution, is a major reason for the measured software fault tolerance. Over two-thirds (72%) of measured software failures are recurrences of previously reported faults. Modeling, based on the data, shows that, in addition to reducing the number of software faults, software dependability can be enhanced by reducing the recurrence rate
Dependability in Federated Cloud Environments
Cloud Computing has emerged as a large-scale distributed system model for utility computing, whereby services are supplied on-demand. It has been proposed that Clouds are in the process of evolving from single, monolithic Clouds such as EC2 or Microsoft Azure serving many consumers to a federation of autonomous Clouds. However, there remain a number of research challenges in building dependable and robust Clouds; a critical research problem that has not yet to be fully understood. This paper discusses the issues and challenges surrounding Cloud dependability, and outlines research areas of opportunity for improving the dependability and robustness of federated Clouds
Recommended from our members
Fault tolerance via diversity for off-the-shelf products: A study with SQL database servers
If an off-the-shelf software product exhibits poor dependability due to design faults, then software fault tolerance is often the only way available to users and system integrators to alleviate the problem. Thanks to low acquisition costs, even using multiple versions of software in a parallel architecture, which is a scheme formerly reserved for few and highly critical applications, may become viable for many applications. We have studied the potential dependability gains from these solutions for off-the-shelf database servers. We based the study on the bug reports available for four off-the-shelf SQL servers plus later releases of two of them. We found that many of these faults cause systematic noncrash failures, which is a category ignored by most studies and standard implementations of fault tolerance for databases. Our observations suggest that diverse redundancy would be effective for tolerating design faults in this category of products. Only in very few cases would demands that triggered a bug in one server cause failures in another one, and there were no coincident failures in more than two of the servers. Use of different releases of the same product would also tolerate a significant fraction of the faults. We report our results and discuss their implications, the architectural options available for exploiting them, and the difficulties that they may present
Recommended from our members
Modeling software design diversity
Design diversity has been used for many years now as a means of achieving a degree of fault tolerance in software-based systems. Whilst there is clear evidence that the approach can be expected to deliver some increase in reliability compared with a single version, there is not agreement about the extent of this. More importantly, it remains difficult to evaluate exactly how reliable a particular diverse fault-tolerant system is. This difficulty arises because assumptions of independence of failures between different versions have been shown not to be tenable: assessment of the actual level of dependence present is therefore needed, and this is hard. In this tutorial we survey the modelling issues here, with an emphasis upon the impact these have upon the problem of assessing the reliability of fault tolerant systems. The intended audience is one of designers, assessors and project managers with only a basic knowledge of probabilities, as well as reliability experts without detailed knowledge of software, who seek an introduction to the probabilistic issues in decisions about design diversity
Reproducibility of environment-dependent software failures: An experience report
Abstract-We investigate the dependence of software failure reproducibility on the environment in which the software is executed. The existence of such dependence is ascertained in literature, but so far it is not fully characterized. In this paper we pinpoint some of the environmental components that can affect the reproducibility of a failure and show this influence through an experimental campaign conducted on the MySQL Server software system. The set of failures of interest is drawn from MySQL's failure reports database and an experiment is designed for each of these failures. The experiments expose the influence of disk usage and level of concurrency on MySQL failure reproducibility. Furthermore, the results show that high levels of usage of these factors increase the probabilities of failure reproducibility
Automation Derivation of Application-Aware Error Detectors Using Compiler Analysis
Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Science Foundation / NSF ACI CNS-040634 and NSF CNS 05-24695Gigascale Systems Research CenterMotorola Corp
Safety Implications of Robotic Surgery: A Study of 13 Years of FDA Data on da Vinci Surgical Systems
Robotic surgical systems are intended to enable surgeons to perform minimally invasive operations with increased vision, precision, dexterity, and control, and to reduce the rate of injuries, blood loss, length of hospital stay, and post-operative complications. Recently, concerns regarding the safety and effectiveness of robot-assisted surgeries have heightened as an increased number of adverse events associated with the surgical robots have been reported to the U.S. Food and Drug Administration (FDA). Our study focuses on the analysis of the adverse events and recalls of da Vinci surgical systems, collected by the FDA over a period of 13 years from 2000 to 2012. We use the data on deaths, injuries, and robot malfunctions, combined with the technical problems and corresponding recovery actions taken by the company (provided by the recalls), together with systematic accident analysis using a tool called CAST. Using an automated natural language parsing tool trained with domain-specific dictionaries and part-of-speech and negation taggers, we extracted valuable information on the potential causes of robotic accidents in order to understand the effectiveness of using robotic devices for different minimally invasive procedures. We found that despite the increasing number of procedures being done with the da Vinci surgical system, a significant number of malfunctions and system downtimes with potentially adverse impacts on patients are being experienced. We provide insights on the use of existing state-of-the-art technologies for enhancing safety in future robotic surgical systems.National Science Foundation (NSF CNS10-18503 CISE); IBM Corporation; Infosys LtdOpe