2,344 research outputs found

    Study of fault-tolerant software technology

    Get PDF
    Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance

    Collecting and Analyzing Failure Data of Bluetooth Personal Area Networks

    Get PDF
    This work presents a failure data analysis campaign on Bluetooth Personal Area Networks (PANs) conducted on two kind of heterogeneous testbeds (working for more than one year). The obtained results reveal how failures distribution are characterized and suggest how to improve the dependability of Bluetooth PANs. Specically, we dene the failure model and we then identify the most effective recovery actions and masking strategies that can be adopted for each failure. We then integrate the discovered recovery actions and masking strategies in our testbeds, improving the availability and the reliability of 3.64% (up to 36.6%) and 202% (referred to the Mean Time To Failure), respectively

    A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

    Get PDF
    High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems. Cloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments. However, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment. In this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems

    An automated wrapper-based approach to the design of dependable software

    Get PDF
    The design of dependable software systems invariably comprises two main activities: (i) the design of dependability mechanisms, and (ii) the location of dependability mechanisms. It has been shown that these activities are intrinsically difficult. In this paper we propose an automated wrapper-based methodology to circumvent the problems associated with the design and location of dependability mechanisms. To achieve this we replicate important variables so that they can be used as part of standard, efficient dependability mechanisms. These well-understood mechanisms are then deployed in all relevant locations. To validate the proposed methodology we apply it to three complex software systems, evaluating the dependability enhancement and execution overhead in each case. The results generated demonstrate that the system failure rate of a wrapped software system can be several orders of magnitude lower than that of an unwrapped equivalent

    Separation kernel robustness testing : the xtratum case study

    Get PDF
    With time and space partitioned architectures becoming increasingly appealing to the European space sector, the dependability of separation kernel technology is a key factor to its applicability in European Space Agency projects. This paper explores the potential of the data type fault model, which injects faults through the Application Program Interface, in separation kernel robustness testing. This fault injection methodology has been tailored to investigate its relevance in uncovering vulnerabilities within separation kernels and potentially contributing towards fault removal campaigns within this domain. This is demonstrated through a robustness testing case study of the XtratuM separation kernel for SPARC LEON3 processors. The robustness campaign exposed a number of vulnerabilities in XtratuM, exhibiting the potential benefits of using such a methodology for the robustness assessment of separation kernels.peer-reviewe
    corecore