347 research outputs found

    An implementation and performance measurement of the progressive retry technique

    Get PDF
    This paper describes a recovery technique called progressive retry for bypassing software faults in message-passing applications. The technique is implemented as reusable modules to provide application-level software fault tolerance. The paper describes the implementation of the technique and presents results from the application of progressive retry to two telecommunications systems. the results presented show that the technique is helpful in reducing the total recovery time for message-passing applications

    An automated wrapper-based approach to the design of dependable software

    Get PDF
    The design of dependable software systems invariably comprises two main activities: (i) the design of dependability mechanisms, and (ii) the location of dependability mechanisms. It has been shown that these activities are intrinsically difficult. In this paper we propose an automated wrapper-based methodology to circumvent the problems associated with the design and location of dependability mechanisms. To achieve this we replicate important variables so that they can be used as part of standard, efficient dependability mechanisms. These well-understood mechanisms are then deployed in all relevant locations. To validate the proposed methodology we apply it to three complex software systems, evaluating the dependability enhancement and execution overhead in each case. The results generated demonstrate that the system failure rate of a wrapped software system can be several orders of magnitude lower than that of an unwrapped equivalent

    Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems

    Get PDF
    Checkpointing and rollback recovery are techniques that can provide efficient recovery from transient process failures. In a message-passing system, the rollback of a message sender may cause the rollback of the corresponding receiver, and the system needs to roll back to a consistent set of checkpoints called recovery line. If the processes are allowed to take uncoordinated checkpoints, the above rollback propagation may result in the domino effect which prevents recovery line progression. Traditionally, only obsolete checkpoints before the global recovery line can be discarded, and the necessary and sufficient condition for identifying all garbage checkpoints has remained an open problem. A necessary and sufficient condition for achieving optimal garbage collection is derived and it is proved that the number of useful checkpoints is bounded by N(N+1)/2, where N is the number of processes. The approach is based on the maximum-sized antichain model of consistent global checkpoints and the technique of recovery line transformation and decomposition. It is also shown that, for systems requiring message logging to record in-transit messages, the same approach can be used to achieve optimal message log reclamation. As a final topic, a unifying framework is described by considering checkpoint coordination and exploiting piecewise determinism as mechanisms for bounding rollback propagation, and the applicability of the optimal garbage collection algorithm to domino-free recovery protocols is demonstrated

    Software fault tolerance in computer operating systems

    Get PDF
    This chapter provides data and analysis of the dependability and fault tolerance for three operating systems: the Tandem/GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Based on measurements from these systems, basic software error characteristics are investigated. Fault tolerance in operating systems resulting from the use of process pairs and recovery routines is evaluated. Two levels of models are developed to analyze error and recovery processes inside an operating system and interactions among multiple instances of an operating system running in a distributed environment. The measurements show that the use of process pairs in Tandem systems, which was originally intended for tolerating hardware faults, allows the system to tolerate about 70% of defects in system software that result in processor failures. The loose coupling between processors which results in the backup execution (the processor state and the sequence of events occurring) being different from the original execution is a major reason for the measured software fault tolerance. The IBM/MVS system fault tolerance almost doubles when recovery routines are provided, in comparison to the case in which no recovery routines are available. However, even when recovery routines are provided, there is almost a 50% chance of system failure when critical system jobs are involved

    Hijacker: Efficient static software instrumentation with applications in high performance computing: Poster paper

    Get PDF
    Static Binary Instrumentation is a technique that allows compile-time program manipulation. In particular, by relying on ad-hoc tools, the end user is able to alter the program's execution flow without affecting its overall semantic. This technique has been effectively used, e.g., to support code profiling, performance analysis, error detection, attack detection, or behavior monitoring. Nevertheless, efficiently relying on static instrumentation for producing executables which can be deployed without affecting the overall performance of the application still presents technical and methodological issues. In this paper, we present Hijacker, an open-source customizable static binary instrumentation tool which is able to alter a program's execution flow according to some user-specified rules, limiting the execution overhead due to the code snippets inserted in the original program, thus enabling for the exploitation in high performance computing. The tool is highly modular and works on an internal representation of the program which allows to perform complex instrumentation tasks efficiently, and can be additionally extended to support different instruction sets and executable formats without any need to modify the instrumentation engine. We additionally present an experimental assessment of the overhead induced by the injected code in real HPC applications. © 2013 IEEE

    Automatic Software Repair: a Bibliography

    Get PDF
    This article presents a survey on automatic software repair. Automatic software repair consists of automatically finding a solution to software bugs without human intervention. This article considers all kinds of repairs. First, it discusses behavioral repair where test suites, contracts, models, and crashing inputs are taken as oracle. Second, it discusses state repair, also known as runtime repair or runtime recovery, with techniques such as checkpoint and restart, reconfiguration, and invariant restoration. The uniqueness of this article is that it spans the research communities that contribute to this body of knowledge: software engineering, dependability, operating systems, programming languages, and security. It provides a novel and structured overview of the diversity of bug oracles and repair operators used in the literature

    Acting, Planning, and Learning Using Hierarchical Operational Models

    Get PDF
    The most common representation formalisms for planning are descriptive models that abstractly describe what the actions do and are tailored for efficiently computing the next state(s) in a state-transition system. However, real-world acting requires operational models that describe how to do things, with rich control structures for closed-loop online decision-making in a dynamic environment. Use of a different action model for planning than the one used for acting causes problems with combining acting and planning, in particular for the development and consistency verification of the different models. As an alternative, this dissertation defines and implements an integrated acting-and-planning system in which both planning and acting use the same operational models, which are written in a general-purpose hierarchical task-oriented language offering rich control structures. The acting component, called Reactive Acting Engine (RAE), is inspired by the well-known PRS system, except that instead of being purely reactive, it can get advice from a planner. The dissertation also describes three planning algorithms which plan by doing several Monte Carlo rollouts in the space of operational models. The best of these three planners, Plan-with-UPOM uses a UCT-like Monte Carlo Tree Search procedure called UPOM (UCT Procedure for Operational Models), whose rollouts are simulated executions of the actor's operational models. The dissertation also presents learning strategies for use with RAE and UPOM that acquire from online acting experiences and/or simulated planning results, a mapping from decision contexts to method instances as well as a heuristic function to guide UPOM. The experimental results show that Plan-with-UPOM and the learning strategies significantly improve the acting efficiency and robustness of RAE. It can be proved that UPOM converges asymptotically by mapping its search space to an MDP. The dissertation also describes a real-world prototype of RAE and Plan-with-UPOM to defend software-defined networks, a relatively new network management architecture, against incoming attacks

    Identification of Crash Fault & Value Fault for Random Network in Dynamic Environment

    Get PDF
    During the past few years distributed systems have been the focus of considerable research in computer science. Fault tolerance in distributed systems is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. Fault tolerance is the ability of a system to perform its function correctly even in the presence of internal faults. An extensive methodology has been developed in this field over the past few years, and a number of fault-tolerant machines have been developed but most dealing with random hardware faults, while a smaller number deal with software, design and operator faults to varying degrees. Our work mainly focuses on the simulation of the system that deals with software faults means the faults that occur because of the failure or error in the internal software component. Our work is restricted to distributed diagnosis in dynamic fault environment. Basically we have created different not-completely connected random networks with number of nodes ranging from 8 to 256.Then we have induced faults to these networks dynamically using poison distribution. Three different algorithms have been implemented to detect the faults and the comparison among these algorithms, based on delay latency and number of message exchanges, has been represented graphically. The software faults that we had dealt with are crash fault and value fault in a distributed system (not-completely connected network). Although many researches have been done in the crash fault area but very less work has been done in diagnosing the value faults in dynamic fault environment
    corecore