3 research outputs found

    Fault tolerance core: a framework for application-aware reliability

    Get PDF
    As processor manufacturers keep pushing the limits of the transistor, the reliability of computer systems has become an increasing concern. Various fault tolerance techniques have been developed in an effort to provide reliable computing in the presence of faults. These approaches suffer from either a high resource cost or high performance overhead. This thesis presents a design for a Fault Tolerance Core (FTC) that uses configurable application-aware hardware modules for improving reliability. Application-aware fault tolerance is achieved by detecting perturbations in application execution through the monitoring of processor pipeline signals. This approach leverages hardware resources more efficiently than replication. The FTC achieves low overhead by placing fault tolerance hardware separately from the processing core, minimizing the processor data collection hardware, and by performing fault detection in the background. This thesis presents work that has been completed towards the achievement of a FTC. This work includes a hardware assisted incremental checkpoint, an application hang detector and a preliminary FTC framework for integrating these into a Leon3 microprocessor. All modules have been implemented and tested on a Leon3 synthesized atop a Stratix III FPGA running a Linux environment. A hardware fault injector capable of modifying 9 distinct processor pipeline signals has been implemented for performing validation experiments on the modules

    OS-Level Hang Detection in Complex Software Systems

    No full text
    Many critical services are nowadays provided by large and complex software systems. However, the increasing complexity introduces several sources of non-determinism, which may lead to hang failures: the system appears to be running, but part of its services is perceived as unresponsive. Online monitoring is the only way to detect and to promptly react to such failures. However, when dealing with off-the-shelf-based systems, online detection can be tricky since instrumentation and log data collection may not be feasible in practice. In this paper, a detection framework to cope with software hangs is proposed. The framework enables the non-intrusive monitoring of complex systems, based on multiple sources of data gathered at the operating system (OS) level. Collected data are then combined to reveal hang failures. The framework is evaluated through a fault injection campaign on two complex systems from the air traffic management (ATM) domain. Results show that the combination of several monitors at the OS level is effective to detect hang failures in terms of coverage and false positives and with a negligible impact on performance

    OS-level hang detection in complex software systems

    No full text
    corecore