8,461 research outputs found

    Cross-layer system reliability assessment framework for hardware faults

    Get PDF
    System reliability estimation during early design phases facilitates informed decisions for the integration of effective protection mechanisms against different classes of hardware faults. When not all system abstraction layers (technology, circuit, microarchitecture, software) are factored in such an estimation model, the delivered reliability reports must be excessively pessimistic and thus lead to unacceptably expensive, over-designed systems. We propose a scalable, cross-layer methodology and supporting suite of tools for accurate but fast estimations of computing systems reliability. The backbone of the methodology is a component-based Bayesian model, which effectively calculates system reliability based on the masking probabilities of individual hardware and software components considering their complex interactions. Our detailed experimental evaluation for different technologies, microarchitectures, and benchmarks demonstrates that the proposed model delivers very accurate reliability estimations (FIT rates) compared to statistically significant but slow fault injection campaigns at the microarchitecture level.Peer ReviewedPostprint (author's final draft

    On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation

    Get PDF
    Machine Learning (ML) is making a strong resurgence in tune with the massive generation of unstructured data which in turn requires massive computational resources. Due to the inherently compute- and power-intensive structure of Neural Networks (NNs), hardware accelerators emerge as a promising solution. However, with technology node scaling below 10nm, hardware accelerators become more susceptible to faults, which in turn can impact the NN accuracy. In this paper, we study the resilience aspects of Register-Transfer Level (RTL) model of NN accelerators, in particular, fault characterization and mitigation. By following a High-Level Synthesis (HLS) approach, first, we characterize the vulnerability of various components of RTL NN. We observed that the severity of faults depends on both i) application-level specifications, i.e., NN data (inputs, weights, or intermediate), NN layers, and NN activation functions, and ii) architectural-level specifications, i.e., data representation model and the parallelism degree of the underlying accelerator. Second, motivated by characterization results, we present a low-overhead fault mitigation technique that can efficiently correct bit flips, by 47.3% better than state-of-the-art methods.Comment: 8 pages, 6 figure

    Modeling of a latent fault detector in a digital system

    Get PDF
    Methods of modeling the detection time or latency period of a hardware fault in a digital system are proposed that explain how a computer detects faults in a computational mode. The objectives were to study how software reacts to a fault, to account for as many variables as possible affecting detection and to forecast a given program's detecting ability prior to computation. A series of experiments were conducted on a small emulated microprocessor with fault injection capability. Results indicate that the detecting capability of a program largely depends on the instruction subset used during computation and the frequency of its use and has little direct dependence on such variables as fault mode, number set, degree of branching and program length. A model is discussed which employs an analog with balls in an urn to explain the rate of which subsequent repetitions of an instruction or instruction set detect a given fault

    Probabilistic Monte-Carlo method for modelling and prediction of electronics component life

    Get PDF
    Power electronics are widely used in electric vehicles, railway locomotive and new generation aircrafts. Reliability of these components directly affect the reliability and performance of these vehicular platforms. In recent years, several research work about reliability, failure mode and aging analysis have been extensively carried out. There is a need for an efficient algorithm able to predict the life of power electronics component. In this paper, a probabilistic Monte-Carlo framework is developed and applied to predict remaining useful life of a component. Probability distributions are used to model the component’s degradation process. The modelling parameters are learned using Maximum Likelihood Estimation. The prognostic is carried out by the mean of simulation in this paper. Monte-Carlo simulation is used to propagate multiple possible degradation paths based on the current health state of the component. The remaining useful life and confident bounds are calculated by estimating mean, median and percentile descriptive statistics of the simulated degradation paths. Results from different probabilistic models are compared and their prognostic performances are evaluated

    Autonomous fault emulation: a new FPGA-based acceleration system for hardness evaluation

    Get PDF
    The appearance of nanometer technologies has produced a significant increase of integrated circuit sensitivity to radiation, making the occurrence of soft errors much more frequent, not only in applications working in harsh environments, like aerospace circuits, but also for applications working at the earth surface. Therefore, hardened circuits are currently demanded in many applications where fault tolerance was not a concern in the very near past. To this purpose, efficient hardness evaluation solutions are required to deal with the increasing size and complexity of modern VLSI circuits. In this paper, a very fast and cost effective solution for SEU sensitivity evaluation is presented. The proposed approach uses FPGA emulation in an autonomous manner to fully exploit the FPGA emulation speed. Three different techniques to implement it are proposed and analyzed. Experimental results show that the proposed Autonomous Emulation approach can reach execution rates higher than one million faults per second, providing a performance improvement of two orders of magnitude with respect to previous approaches. These rates give way to consider very large fault injection campaigns that were not possible in the past.This work was supported by the Directorate of Research of Madrid Community Government, Spain (Code 07/0052/2003 2) and by the European Commission and Spanish Government under MEDEA+ Project (PARACHUTE-2A701) and PROFIT Project (CIRCE-FIT-330100-2005-60)
    • 

    corecore