36,045 research outputs found

    Experimental analysis of computer system dependability

    Get PDF
    This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance

    Advanced information processing system: Fault injection study and results

    Get PDF
    The objective of the AIPS program is to achieve a validated fault tolerant distributed computer system. The goals of the AIPS fault injection study were: (1) to present the fault injection study components addressing the AIPS validation objective; (2) to obtain feedback for fault removal from the design implementation; (3) to obtain statistical data regarding fault detection, isolation, and reconfiguration responses; and (4) to obtain data regarding the effects of faults on system performance. The parameters are described that must be varied to create a comprehensive set of fault injection tests, the subset of test cases selected, the test case measurements, and the test case execution. Both pin level hardware faults using a hardware fault injector and software injected memory mutations were used to test the system. An overview is provided of the hardware fault injector and the associated software used to carry out the experiments. Detailed specifications are given of fault and test results for the I/O Network and the AIPS Fault Tolerant Processor, respectively. The results are summarized and conclusions are given

    Study of the effects of SEU-induced faults on a pipeline protected microprocessor

    Get PDF
    This paper presents a detailed analysis of the behavior of a novel fault-tolerant 32-bit embedded CPU as compared to a default (non-fault-tolerant) implementation of the same processor during a fault injection campaign of single and double faults. The fault-tolerant processor tested is characterized by per-cycle voting of microarchitectural and the flop-based architectural states, redundancy at the pipeline level, and a distributed voting scheme. Its fault-tolerant behavior is characterized for three different workloads from the automotive application domain. The study proposes statistical methods for both the single and dual fault injection campaigns and demonstrates the fault-tolerant capability of both processors in terms of fault latencies, the probability of fault manifestation, and the behavior of latent faults

    A small-scale testbed for large-scale reliable computing

    Get PDF
    High performance computing (HPC) systems frequently suffer errors and failures from hardware components that negatively impact the performance of jobs run on these systems. We analyzed system logs from two HPC systems at Purdue University and created statistical models for memory and hard disk errors. We created a small-scale error injection testbed—using a customized QEMU build, libvirt, and Python—for HPC application programmers to test and debug their programs in a faulty environment so that programmers can write more robust and resilient programs before deploying them on an actual HPC system. The deliverables for this project are the fault injection program, the modified QEMU source code, and the statistical models used for driving the injection

    The Art of Fault Injection

    Get PDF
    Classical greek philosopher considered the foremost virtues to be temperance, justice, courage, and prudence. In this paper we relate these cardinal virtues to the correct methodological approaches that researchers should follow when setting up a fault injection experiment. With this work we try to understand where the "straightforward pathway" lies, in order to highlight those common methodological errors that deeply influence the coherency and the meaningfulness of fault injection experiments. Fault injection is like an art, where the success of the experiments depends on a very delicate balance between modeling, creativity, statistics, and patience

    On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation

    Get PDF
    Machine Learning (ML) is making a strong resurgence in tune with the massive generation of unstructured data which in turn requires massive computational resources. Due to the inherently compute- and power-intensive structure of Neural Networks (NNs), hardware accelerators emerge as a promising solution. However, with technology node scaling below 10nm, hardware accelerators become more susceptible to faults, which in turn can impact the NN accuracy. In this paper, we study the resilience aspects of Register-Transfer Level (RTL) model of NN accelerators, in particular, fault characterization and mitigation. By following a High-Level Synthesis (HLS) approach, first, we characterize the vulnerability of various components of RTL NN. We observed that the severity of faults depends on both i) application-level specifications, i.e., NN data (inputs, weights, or intermediate), NN layers, and NN activation functions, and ii) architectural-level specifications, i.e., data representation model and the parallelism degree of the underlying accelerator. Second, motivated by characterization results, we present a low-overhead fault mitigation technique that can efficiently correct bit flips, by 47.3% better than state-of-the-art methods.Comment: 8 pages, 6 figure

    Cross-layer system reliability assessment framework for hardware faults

    Get PDF
    System reliability estimation during early design phases facilitates informed decisions for the integration of effective protection mechanisms against different classes of hardware faults. When not all system abstraction layers (technology, circuit, microarchitecture, software) are factored in such an estimation model, the delivered reliability reports must be excessively pessimistic and thus lead to unacceptably expensive, over-designed systems. We propose a scalable, cross-layer methodology and supporting suite of tools for accurate but fast estimations of computing systems reliability. The backbone of the methodology is a component-based Bayesian model, which effectively calculates system reliability based on the masking probabilities of individual hardware and software components considering their complex interactions. Our detailed experimental evaluation for different technologies, microarchitectures, and benchmarks demonstrates that the proposed model delivers very accurate reliability estimations (FIT rates) compared to statistically significant but slow fault injection campaigns at the microarchitecture level.Peer ReviewedPostprint (author's final draft
    • …
    corecore