93 research outputs found

    Experimental analysis of computer system dependability

    Get PDF
    This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance

    Measurement-based reliability prediction methodology

    Get PDF
    In the past, analytical and measurement based models were developed to characterize computer system behavior. An open issue is how these models can be used, if at all, for system design improvement. The issue is addressed here. A combined statistical/analytical approach to use measurements from one environment to model the system failure behavior in a new environment is proposed. A comparison of the predicted results with the actual data from the new environment shows a close correspondence

    Software fault tolerance in computer operating systems

    Get PDF
    This chapter provides data and analysis of the dependability and fault tolerance for three operating systems: the Tandem/GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Based on measurements from these systems, basic software error characteristics are investigated. Fault tolerance in operating systems resulting from the use of process pairs and recovery routines is evaluated. Two levels of models are developed to analyze error and recovery processes inside an operating system and interactions among multiple instances of an operating system running in a distributed environment. The measurements show that the use of process pairs in Tandem systems, which was originally intended for tolerating hardware faults, allows the system to tolerate about 70% of defects in system software that result in processor failures. The loose coupling between processors which results in the backup execution (the processor state and the sequence of events occurring) being different from the original execution is a major reason for the measured software fault tolerance. The IBM/MVS system fault tolerance almost doubles when recovery routines are provided, in comparison to the case in which no recovery routines are available. However, even when recovery routines are provided, there is almost a 50% chance of system failure when critical system jobs are involved

    Measurement and Analysis of Operating System Fault Tolerance

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryONR / N00014-91-J-1116NASA / NAG-1-61

    Report of the IEEE Workshop on Measurement and Modeling of Computer Dependability

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNASA Langley Research Center / NASA NAG-1-602 and NASA NAG-1-613ONR / N00014-85-K-000

    A real-time diagnostic and performance monitor for UNIX

    Get PDF
    There are now over one million UNIX sites and the pace at which new installations are added is steadily increasing. Along with this increase, comes a need to develop simple efficient, effective and adaptable ways of simultaneously collecting real-time diagnostic and performance data. This need exists because distributed systems can give rise to complex failure situations that are often un-identifiable with single-machine diagnostic software. The simultaneous collection of error and performance data is also important for research in failure prediction and error/performance studies. This paper introduces a portable method to concurrently collect real-time diagnostic and performance data on a distributed UNIX system. The combined diagnostic/performance data collection is implemented on a distributed multi-computer system using SUN4's as servers. The approach uses existing UNIX system facilities to gather system dependability information such as error and crash reports. In addition, performance data such as CPU utilization, disk usage, I/O transfer rate and network contention is also collected. In the future, the collected data will be used to identify dependability bottlenecks and to analyze the impact of failures on system performance

    DEPEND: A Simulation-Based Environment for System Level Dependability Analysis

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Aeronautics and Space Administration / NASA NAG-1-613 and NASA NGT-5083

    DEPEND: A Design Environment for Prediction and Evaluation of System Dependability

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryJoint Services Electronics Program / N00014-90-J-1270National Aeronautics and Space Administration (NASA) / NAG-1-61

    Design for dependability: A simulation-based approach

    Get PDF
    This research addresses issues in simulation-based system level dependability analysis of fault-tolerant computer systems. The issues and difficulties of providing a general simulation-based approach for system level analysis are discussed and a methodology that address and tackle these issues is presented. The proposed methodology is designed to permit the study of a wide variety of architectures under various fault conditions. It permits detailed functional modeling of architectural features such as sparing policies, repair schemes, routing algorithms as well as other fault-tolerant mechanisms, and it allows the execution of actual application software. One key benefit of this approach is that the behavior of a system under faults does not have to be pre-defined as it is normally done. Instead, a system can be simulated in detail and injected with faults to determine its failure modes. The thesis describes how object-oriented design is used to incorporate this methodology into a general purpose design and fault injection package called DEPEND. A software model is presented that uses abstractions of application programs to study the behavior and effect of software on hardware faults in the early design stage when actual code is not available. Finally, an acceleration technique that combines hierarchical simulation, time acceleration algorithms and hybrid simulation to reduce simulation time is introduced
    corecore