Search CORE

7,710 research outputs found

Software-implemented fault insertion: An FTMP example

Author: Czeck Edward W.
Segall Zary Z.
Siewiorek Daniel P.
Publication venue
Publication date
Field of study

This report presents a model for fault insertion through software; describes its implementation on a fault-tolerant computer, FTMP; presents a summary of fault detection, identification, and reconfiguration data collected with software-implemented fault insertion; and compares the results to hardware fault insertion data. Experimental results show detection time to be a function of time of insertion and system workload. For the fault detection time, there is no correlation between software-inserted faults and hardware-inserted faults; this is because hardware-inserted faults must manifest as errors before detection, whereas software-inserted faults immediately exercise the error detection mechanisms. In summary, the software-implemented fault insertion is able to be used as an evaluation technique for the fault-handling capabilities of a system in fault detection, identification and recovery. Although the software-inserted faults do not map directly to hardware-inserted faults, experiments show software-implemented fault insertion is capable of emulating hardware fault insertion, with greater ease and automation

NASA Technical Reports Server

Experimental analysis of computer system dependability

Author: Iyer Ravishankar, K.
Tang Dong
Publication venue
Publication date
Field of study

This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance

NASA Technical Reports Server

DeSyRe: on-Demand System Reliability

Author: Armato Antonino
Bouganis Christos-Savvas
Falsafi Babak
Gaydadjiev Georgi
Isaza Sebastian
Malek Alirad
Mariani Riccardo
Pnevmatikatos Dionisios N
Pradhan Dhiraj K
Rauwerda Gerard
Seepers Robert
Shafik Rishad Ahmed
Sourdis Ioannis
Strydis Christos
Sunesen Kim
Theodoropoulos Dimitris
Tzilis Stavros
Vavouras Michail
Publication venue: 'Elsevier BV'
Publication date: 01/01/2013
Field of study

The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints

Southampton (e-Prints Soton)

EUR Research Repository

Chalmers Research

Chalmers Publication Library

Explore Bristol Research

Transient fault behavior in a microprocessor: A case study

Author: Duba Patrick
Publication venue
Publication date
Field of study

An experimental analysis is described which studies the susceptibility of a microprocessor based jet engine controller to upsets caused by current and voltage transients. A design automation environment which allows the run time injection of transients and the tracing from their impact device to the pin level is described. The resulting error data are categorized by the charge levels of the injected transients by location and by their potential to cause logic upsets, latched errors, and pin errors. The results show a 3 picoCouloumb threshold, below which the transients have little impact. An Arithmetic and Logic Unit transient is most likely to result in logic upsets and pin errors (i.e., impact the external environment). The transients in the countdown unit are potentially serious since they can result in latched errors, thus causing latent faults. Suggestions to protect the processor against these errors, by incorporating internal error detection and transient suppression techniques, are also made

NASA Technical Reports Server

An integration of case-based and model-based reasoning and its application to physical system faults

Author: Karamouzis Stamos T.
Publication venue: W&M ScholarWorks
Publication date: 01/01/1993
Field of study

Case-Based Reasoning (CBR) systems solve new problems by finding stored instances of problems similar to the current one, and by adapting previous solutions to fit the current problem, taking into consideration any differences between the current and previous situations. CBR has been proposed as a more robust and plausible model of expert reasoning than the better-known rule-based systems.;Current CBR systems have been used in planning, engineering design, and memory organization. There has been minimal work, however, in the area of reasoning about physical systems. This type of reasoning is a difficult task, and every attempt to automate the process must overcome the problems of modeling normal behavior, diagnosing faults, and predicting future behavior.;CBR systems are currently quite difficult to compare and evaluate, because there is currently no common mathematical framework in which the systems can be described. The only avenue available at present for comparison and evaluation of CBR systems requires an intellectual synthesis of the semantics of the program sources. Important constraints on the operation of a CBR system are often hidden in obscure programming tricks in the system\u27s source code.;This thesis presents a hybrid methodology for reasoning about physical systems in operation. This methodology is based on retrieval and adaptation of previously experienced problems similar to the problem at hand. In this methodology the ability of a CBR to reason about a physical system is significantly enhanced by the addition to the Case-Based Reasoner of a model of the physical system. The model describes the physical system\u27s structural, functional, and causal behavior.;Additionally, this thesis presents a mathematical formalization of the case-based reasoning paradigm and a formal specification of the interaction of the CBR component with the model-based component of a case-based system. to prove the feasibility and the merit of such methodology, a prototypical system for dealing with the faults of a physical system has been designed and implemented. Through testing has been proved that this hybrid methodology allows the generation of diagnoses and prognoses that are beyond the capabilities of current reasoning systems

College of William & Mary: W&M Publish

Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters

Author: Garraghan P
McKee D
Ouyang X
Xu J
Yang R
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 20/09/2016
Field of study

Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as “Long Tail”, whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5% of task stragglers impact 50% of total jobs for batch processes, and 53% of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11% into their execution lifecycle with 95% accuracy for short duration jobs

Crossref

Lancaster E-Prints

White Rose Research Online

Aerospace medicine and biology. A continuing bibliography with indexes, supplement 195

Author
Publication venue
Publication date
Field of study

This bibliography lists 148 reports, articles, and other documents introduced into the NASA scientific and technical information system in June 1979

NASA Technical Reports Server