3,041 research outputs found

    Cross-layer system reliability assessment framework for hardware faults

    Get PDF
    System reliability estimation during early design phases facilitates informed decisions for the integration of effective protection mechanisms against different classes of hardware faults. When not all system abstraction layers (technology, circuit, microarchitecture, software) are factored in such an estimation model, the delivered reliability reports must be excessively pessimistic and thus lead to unacceptably expensive, over-designed systems. We propose a scalable, cross-layer methodology and supporting suite of tools for accurate but fast estimations of computing systems reliability. The backbone of the methodology is a component-based Bayesian model, which effectively calculates system reliability based on the masking probabilities of individual hardware and software components considering their complex interactions. Our detailed experimental evaluation for different technologies, microarchitectures, and benchmarks demonstrates that the proposed model delivers very accurate reliability estimations (FIT rates) compared to statistically significant but slow fault injection campaigns at the microarchitecture level.Peer ReviewedPostprint (author's final draft

    Systematic Model-based Design Assurance and Property-based Fault Injection for Safety Critical Digital Systems

    Get PDF
    With advances in sensing, wireless communications, computing, control, and automation technologies, we are witnessing the rapid uptake of Cyber-Physical Systems across many applications including connected vehicles, healthcare, energy, manufacturing, smart homes etc. Many of these applications are safety-critical in nature and they depend on the correct and safe execution of software and hardware that are intrinsically subject to faults. These faults can be design faults (Software Faults, Specification faults, etc.) or physically occurring faults (hardware failures, Single-event-upsets, etc.). Both types of faults must be addressed during the design and development of these critical systems. Several safety-critical industries have widely adopted Model-Based Engineering paradigms to manage the design assurance processes of these complex CPSs. This thesis studies the application of IEC 61508 compliant model-based design assurance methodology on a representative safety-critical digital architecture targeted for the Nuclear power generation facilities. The study presents detailed experiences and results to demonstrate the benefits of Model testing in finding design flaws and its relevance to subsequent verification steps in the workflow. Additionally, to study the impact of physical faults on the digital architecture we develop a novel property-based fault injection method that overcomes few deficiencies of traditional fault injection methods. The model-based fault injection approach presented here guarantees high efficiency and near-exhaustive input/state/fault space coverage, by utilizing formal model checking principles to identify fault activation conditions and prove the fault tolerance features. The fault injection framework facilitates automated integration of fault saboteurs throughout the model to enable exhaustive fault location coverage in the model

    Fast and Accurate Error Simulation for CNNs Against Soft Errors

    Get PDF
    The great quest for adopting AI-based computation for safety-/mission-critical applications motivates the interest towards methods for assessing the robustness of the application w.r.t. not only its training/tuning but also errors due to faults, in particular soft errors, affecting the underlying hardware. Two strategies exist: architecture-level fault injection and application-level functional error simulation. We present a framework for the reliability analysis of Convolutional Neural Networks (CNNs) via an error simulation engine that exploits a set of validated error models extracted from a detailed fault injection campaign. These error models are defined based on the corruption patterns of the output of the CNN operators induced by faults and bridge the gap between fault injection and error simulation, exploiting the advantages of both approaches. We compared our methodology against SASSIFI for the accuracy of functional error simulation w.r.t. fault injection, and against TensorFI in terms of speedup for the error simulation strategy. Experimental results show that our methodology achieves about 99% accuracy of the fault effects w.r.t. SASSIFI, and a speedup ranging from 44x up to 63x w.r.t. TensorFI, that only implements a limited set of error models

    Design for dependability: A simulation-based approach

    Get PDF
    This research addresses issues in simulation-based system level dependability analysis of fault-tolerant computer systems. The issues and difficulties of providing a general simulation-based approach for system level analysis are discussed and a methodology that address and tackle these issues is presented. The proposed methodology is designed to permit the study of a wide variety of architectures under various fault conditions. It permits detailed functional modeling of architectural features such as sparing policies, repair schemes, routing algorithms as well as other fault-tolerant mechanisms, and it allows the execution of actual application software. One key benefit of this approach is that the behavior of a system under faults does not have to be pre-defined as it is normally done. Instead, a system can be simulated in detail and injected with faults to determine its failure modes. The thesis describes how object-oriented design is used to incorporate this methodology into a general purpose design and fault injection package called DEPEND. A software model is presented that uses abstractions of application programs to study the behavior and effect of software on hardware faults in the early design stage when actual code is not available. Finally, an acceleration technique that combines hierarchical simulation, time acceleration algorithms and hybrid simulation to reduce simulation time is introduced

    Analyzing the effects of transient faults into applications

    Get PDF
    As computer chips implementation technologies evolve to obtain more performance, those computer chips are using smaller components, with bigger density of transistors and working with lower power voltages. All these factors turn the computer chips less robust and increase the probability of a transient fault. Transient faults may occur once and never more happen the same way in a computer system lifetime. There are distinct consequences when a transient fault occurs: the operating system might abort the execution if the change produced by the fault is detected by bad behavior of the application, but the biggest risk is that the fault produces an undetected data corruption that modifies the application final result without warnings (for example a bit flip in some crucial data). With the objective of researching transient faults in computer system's processor registers and memory we have developed an extension of HP's and AMD joint full system simulation environment, named COTSon. This extension allows the injection of faults that change a single bit in processor registers and memory of the simulated computer. The developed fault injection system makes it possible to: evaluate the effects of single bit flip transient faults in an application, analyze an application robustness against single bit flip transient faults and validate fault detection mechanism and strategies.L'evolució dels processadors en cerca de millors prestacions fa que els xips duguin transistors més petits i incloguin major quantitat y densitat de transistors, a més d'operar amb un voltatge més baix. Tots aquests factors fan que els processadors siguin menys robusts i augmenten la probabilitat de fallades transitòries. Les fallades transitòries poden ocórrer una vegada i no tornar a passar de la mateixa forma en la vida útil d'un sistema. Quan ocorren poden passar diferents conseqüències: el sistema operatiu pot avortar l'execució quan el canvi produït per la fallada és detectat per mal comportament de l'aplicació, però el risc major és que, amb el canvi produït, ocasioni una corrupció de dades que no sigui detectada i canviï el resultat final de l'aplicació sense que ningú ho sàpiga. Per a investigar sobre els efectes que les fallades transitòries poden ocasionar en els registres d'un processador i en les memòries d'un computador, hem desenvolupat una extensió del simulador d'ordinadors complet de HP (COTSon). L'extensió realitzada permet la injecció de fallades que canvien un bit en registres i en les memòries del computador simulat. La injecció de fallades permet: avaluar els efectes de les fallades transitòries que ocasionen el canvi d'un bit en una aplicació, analitzar la robustesa d'una aplicació després de fallades transitòries de canvis del valor d'un bit i validar mecanismes i estratègies de detecció de fallades.La evolución de los procesadores en busca de prestaciones mejores hace que los circuitos lleven transistores más pequeños e incluyan mayor cantidad y densidad de transistores, además de operar con un voltaje menor. Todos estos factores hacen que los procesadores sean menos robustos y aumenta la probabilidad de fallos transitorios. Los fallos transitorios pueden ocurrir una vez y no volver a pasar, de la misma forma, en la vida útil de un sistema. Cuando ocurren, pueden pasar distintas consecuencias: el sistema operativo puede abortar la ejecución cuando el cambio producido por el fallo es detectado por mal comportamiento de la aplicación, pero el riesgo mayor es que, con el cambio producido, se produzca una corrupción de datos que no sea detectada y cambie el resultado final de la aplicación sin que sea detectado. Para investigar sobre los efectos que los fallos transitorios pueden ocasionar en los registros de un procesador y en las memorias de un computador, hemos desarrollado una extensión del simulador de ordenadores completo de HP (COTSon). La extensión realizada permite la inyección de fallos que cambian un bit en registros y en las memorias del computador simulado. La inyección de fallos permite: evaluar los efectos de los fallos transitorios que ocasionan cambio de un bit en una aplicación, analizar la robustez de una aplicación tras fallos transitorios de cambios del valor de un bit y validar mecanismos y estrategias de detección de fallos

    Comparison of Fault Simulation Over Custom Kernel Module Using Various Techniques

    Get PDF
    To test the behavior of the Linux kernel module, device drivers and file system in a faulty situation, scientists tried to inject faults in different artificial environments. Since the rarity and unpredictability of such events are pretty high, thus the localization and detection of Linux kernel, device drivers, file system modules errors become unfathomable. ‘Artificial introduction of some random faults during normal tests’ is the only known approach to such mystifying problems. A standard method for performing such experiments is to generate synthetic faults and study the effects. Various fault injection frameworks have been analyzed over the Linux kernel to simulate such detection. The following paper highlights the comparison of different approaches and techniques used for such fault injection to test Linux kernel modules that include simulating low resource conditions and detecting memory leaks. The frameworks chosen to be used in these experiments are; Linux Text Project (LTP), KEDR, Linux Fault-Injection (LFI), and SCSI.&nbsp
    • …
    corecore