10,222 research outputs found

    Experimental analysis of computer system dependability

    Get PDF
    This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance

    Analyzing the effects of transient faults into applications

    Get PDF
    As computer chips implementation technologies evolve to obtain more performance, those computer chips are using smaller components, with bigger density of transistors and working with lower power voltages. All these factors turn the computer chips less robust and increase the probability of a transient fault. Transient faults may occur once and never more happen the same way in a computer system lifetime. There are distinct consequences when a transient fault occurs: the operating system might abort the execution if the change produced by the fault is detected by bad behavior of the application, but the biggest risk is that the fault produces an undetected data corruption that modifies the application final result without warnings (for example a bit flip in some crucial data). With the objective of researching transient faults in computer system's processor registers and memory we have developed an extension of HP's and AMD joint full system simulation environment, named COTSon. This extension allows the injection of faults that change a single bit in processor registers and memory of the simulated computer. The developed fault injection system makes it possible to: evaluate the effects of single bit flip transient faults in an application, analyze an application robustness against single bit flip transient faults and validate fault detection mechanism and strategies.L'evolució dels processadors en cerca de millors prestacions fa que els xips duguin transistors més petits i incloguin major quantitat y densitat de transistors, a més d'operar amb un voltatge més baix. Tots aquests factors fan que els processadors siguin menys robusts i augmenten la probabilitat de fallades transitòries. Les fallades transitòries poden ocórrer una vegada i no tornar a passar de la mateixa forma en la vida útil d'un sistema. Quan ocorren poden passar diferents conseqüències: el sistema operatiu pot avortar l'execució quan el canvi produït per la fallada és detectat per mal comportament de l'aplicació, però el risc major és que, amb el canvi produït, ocasioni una corrupció de dades que no sigui detectada i canviï el resultat final de l'aplicació sense que ningú ho sàpiga. Per a investigar sobre els efectes que les fallades transitòries poden ocasionar en els registres d'un processador i en les memòries d'un computador, hem desenvolupat una extensió del simulador d'ordinadors complet de HP (COTSon). L'extensió realitzada permet la injecció de fallades que canvien un bit en registres i en les memòries del computador simulat. La injecció de fallades permet: avaluar els efectes de les fallades transitòries que ocasionen el canvi d'un bit en una aplicació, analitzar la robustesa d'una aplicació després de fallades transitòries de canvis del valor d'un bit i validar mecanismes i estratègies de detecció de fallades.La evolución de los procesadores en busca de prestaciones mejores hace que los circuitos lleven transistores más pequeños e incluyan mayor cantidad y densidad de transistores, además de operar con un voltaje menor. Todos estos factores hacen que los procesadores sean menos robustos y aumenta la probabilidad de fallos transitorios. Los fallos transitorios pueden ocurrir una vez y no volver a pasar, de la misma forma, en la vida útil de un sistema. Cuando ocurren, pueden pasar distintas consecuencias: el sistema operativo puede abortar la ejecución cuando el cambio producido por el fallo es detectado por mal comportamiento de la aplicación, pero el riesgo mayor es que, con el cambio producido, se produzca una corrupción de datos que no sea detectada y cambie el resultado final de la aplicación sin que sea detectado. Para investigar sobre los efectos que los fallos transitorios pueden ocasionar en los registros de un procesador y en las memorias de un computador, hemos desarrollado una extensión del simulador de ordenadores completo de HP (COTSon). La extensión realizada permite la inyección de fallos que cambian un bit en registros y en las memorias del computador simulado. La inyección de fallos permite: evaluar los efectos de los fallos transitorios que ocasionan cambio de un bit en una aplicación, analizar la robustez de una aplicación tras fallos transitorios de cambios del valor de un bit y validar mecanismos y estrategias de detección de fallos

    Analog-digital simulation of transient-induced logic errors and upset susceptibility of an advanced control system

    Get PDF
    A simulation study is described which predicts the susceptibility of an advanced control system to electrical transients resulting in logic errors, latched errors, error propagation, and digital upset. The system is based on a custom-designed microprocessor and it incorporates fault-tolerant techniques. The system under test and the method to perform the transient injection experiment are described. Results for 2100 transient injections are analyzed and classified according to charge level, type of error, and location of injection

    A Fault Injection Environment for Microprocessor-based Board

    Get PDF
    Evaluating the faulty behaviour of low-cost microprocessor-based boards is an increasingly important issue, due to their usage in many safety critical systems. To address this issue, the paper describes a software-implemented fault injection system based on the trace exception mode available in most microprocessors. The architecture of the complete fault injection environment is proposed, integrating modules for generating a fault list, for performing their injection and for gathering the results, respectively. Data gathered from some sample benchmark applications are presented The main advantages of the approach are low cost, good portability, and high efficienc

    Design for diagnostics and prognostics:a physical- functional approach

    Get PDF

    On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation

    Get PDF
    Machine Learning (ML) is making a strong resurgence in tune with the massive generation of unstructured data which in turn requires massive computational resources. Due to the inherently compute- and power-intensive structure of Neural Networks (NNs), hardware accelerators emerge as a promising solution. However, with technology node scaling below 10nm, hardware accelerators become more susceptible to faults, which in turn can impact the NN accuracy. In this paper, we study the resilience aspects of Register-Transfer Level (RTL) model of NN accelerators, in particular, fault characterization and mitigation. By following a High-Level Synthesis (HLS) approach, first, we characterize the vulnerability of various components of RTL NN. We observed that the severity of faults depends on both i) application-level specifications, i.e., NN data (inputs, weights, or intermediate), NN layers, and NN activation functions, and ii) architectural-level specifications, i.e., data representation model and the parallelism degree of the underlying accelerator. Second, motivated by characterization results, we present a low-overhead fault mitigation technique that can efficiently correct bit flips, by 47.3% better than state-of-the-art methods.Comment: 8 pages, 6 figure

    AES-EPO study program, volume I Final study report

    Get PDF
    Conceptual study of possible solutions to long- term and time-critical reliability problems affecting Apollo command module guidance and control compute

    Time domain analysis of switching transient fields in high voltage substations

    Get PDF
    Switching operations of circuit breakers and disconnect switches generate transient currents propagating along the substation busbars. At the moment of switching, the busbars temporarily acts as antennae radiating transient electromagnetic fields within the substations. The radiated fields may interfere and disrupt normal operations of electronic equipment used within the substation for measurement, control and communication purposes. Hence there is the need to fully characterise the substation electromagnetic environment as early as the design stage of substation planning and operation to ensure safe operations of the electronic equipment. This paper deals with the computation of transient electromagnetic fields due to switching within a high voltage air-insulated substation (AIS) using the finite difference time domain (FDTD) metho

    Design for dependability: A simulation-based approach

    Get PDF
    This research addresses issues in simulation-based system level dependability analysis of fault-tolerant computer systems. The issues and difficulties of providing a general simulation-based approach for system level analysis are discussed and a methodology that address and tackle these issues is presented. The proposed methodology is designed to permit the study of a wide variety of architectures under various fault conditions. It permits detailed functional modeling of architectural features such as sparing policies, repair schemes, routing algorithms as well as other fault-tolerant mechanisms, and it allows the execution of actual application software. One key benefit of this approach is that the behavior of a system under faults does not have to be pre-defined as it is normally done. Instead, a system can be simulated in detail and injected with faults to determine its failure modes. The thesis describes how object-oriented design is used to incorporate this methodology into a general purpose design and fault injection package called DEPEND. A software model is presented that uses abstractions of application programs to study the behavior and effect of software on hardware faults in the early design stage when actual code is not available. Finally, an acceleration technique that combines hierarchical simulation, time acceleration algorithms and hybrid simulation to reduce simulation time is introduced
    corecore