2,443 research outputs found
Experimental analysis of computer system dependability
This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance
Recommended from our members
Assessing Asymmetric Fault-Tolerant Software
The most popular forms of fault tolerance against design faults use "asymmetric" architectures in which a "primary" part performs the computation and a "secondary" part is in charge of detecting errors and performing some kind of error processing and recovery. In contrast, the most studied forms of software fault tolerance are "symmetric" ones, e.g. N-version programming. The latter are often controversial, the former are not. We discuss how to assess the dependability gains achieved by these methods. Substantial difficulties have been shown to exist for symmetric schemes, but we show that the same difficulties affect asymmetric schemes. Indeed, the latter present somewhat subtler problems. In both cases, to predict the dependability of the fault-tolerant system it is not enough to know the dependability of the individual components. We extend to asymmetric architectures the style of probabilistic modeling that has been useful for describing the dependability of "symmetric" architectures, to highlight factors that complicate the assessment. In the light of these models, we finally discuss fault injection approaches to estimating coverage factors. We highlight the limits of what can be predicted and some useful research directions towards clarifying and extending the range of situations in which estimates of coverage of fault tolerance mechanisms can be trusted
Improving reconfigurable systems reliability by combining periodical test and redundancy techniques: a case study
This paper revises and introduces to the field of reconfigurable computer systems, some traditional techniques used in the fields of fault-tolerance and testing of digital circuits. The target area is that of on-board spacecraft electronics, as this class of application is a good candidate for the use of reconfigurable computing technology. Fault tolerant strategies are used in order for the system to adapt itself to the severe conditions found in space. In addition, the paper describes some problems and possible solutions for the use of reconfigurable components, based on programmable logic, in space applications
Towards Accurate Estimation of Error Sensitivity in Computer Systems
Fault injection is an increasingly important method for assessing, measuringand observing the system-level impact of hardware and software faults in computer systems. This thesis presents the results of a series of experimental studies in which fault injection was used to investigate the impact of bit-flip errors on program execution. The studies were motivated by the fact that transient hardware faults in microprocessors can cause bit-flip errors that can propagate to the microprocessors instruction set architecture registers and main memory. As the rate of such hardware faults is expected to increase with technology scaling, there is a need to better understand how these errors (known as ‘soft errors’) influence program execution, especially in safety-critical systems.Using ISA-level fault injection, we investigate how five aspects, or factors, influence the error sensitivity of a program. We define error sensitivity as the conditional probability that a bit-flip error in live data in an ISA-register or main-memory word will cause a program to produce silent data corruption (SDC; i.e., an erroneous result). We also consider the estimation of a measure called SDC count, which represents the number of ISA-level bit flips that cause an SDC.The five factors addressed are (a) the inputs processed by a program, (b) the level of compiler optimization, (c) the implementation of the program in the source code, (d) the fault model (single bit flips vs double bit flips) and (e)the fault-injection technique (inject-on-write vs inject-on-read). Our results show that these factors affect the error sensitivity in many ways; some factors strongly impact the error sensitivity or SDC count whereas others show a weaker impact. For example, our experiments show that single bit flips tend to cause SDCs more than double bit flips; compiler optimization positively impacts the SDC count but not necessarily the error sensitivity; the error sensitivity varies between 20% and 50% among the programs we tested; and variations in input affect the error sensitivity significantly for most of the tested programs
Rigorously assessing software reliability and safety
This paper summarises the state of the art in the assessment of software reliability and safety ("dependability"), and describes some promising developments. A sound demonstration of very high dependability is still impossible before operation of the software; but research is finding ways to make rigorous assessment increasingly feasible. While refined mathematical techniques cannot take the place of factual knowledge, they can allow the decision-maker to draw more accurate conclusions from the knowledge that is available
Software reliability and dependability: a roadmap
Shifting the focus from software reliability to user-centred measures of dependability in complete software-based systems. Influencing design practice to facilitate dependability assessment. Propagating awareness of dependability issues and the use of existing, useful methods. Injecting some rigour in the use of process-related evidence for dependability assessment. Better understanding issues of diversity and variation as drivers of dependability. Bev Littlewood is founder-Director of the Centre for Software Reliability, and Professor of Software Engineering at City University, London. Prof Littlewood has worked for many years on problems associated with the modelling and evaluation of the dependability of software-based systems; he has published many papers in international journals and conference proceedings and has edited several books. Much of this work has been carried out in collaborative projects, including the successful EC-funded projects SHIP, PDCS, PDCS2, DeVa. He has been employed as a consultant t
- …