408 research outputs found

    Multi-perspective Evaluation of Self-Healing Systems Using Simple Probabilistic Models

    Get PDF
    Quantifying the efficacy of self-healing systems is a challenging but important task, which has implications for increasing designer, operator and end-user confidence in these systems. During design system architects benefit from tools and techniques that enhance their understanding of the system, allowing them to reason about the tradeoffs of proposed or existing self-healing mechanisms and the overall effectiveness of the system as a result of different mechanism-compositions. At deployment time, system integrators and operators need to understand how the selfhealing mechanisms work and how their operation impacts the system's reliability, availability and serviceability (RAS) in order to cope with any limitations of these mechanisms when the system is placed into production. In this paper we construct an evaluation framework for selfhealing systems around simple, yet powerful, probabilistic models that capture the behavior of the system's selfhealing mechanisms from multiple perspectives (designer, operator, and end-user). We combine these analytical models with runtime fault-injection to study the operation of VM-Rejuv — a virtual machine based rejuvenation scheme for web-application servers. We use the results from the fault-injection experiments and model-analysis to reason about the efficacy of VM-Rejuv, its limitations and strategies for managing/mitigating these limitations in system deployments. Whereas we use VM-Rejuv as the subject of our evaluation in this paper, our main contribution is a practical evaluation approach that can be generalized to other self-healing systems

    Automatic Software Repair: a Bibliography

    Get PDF
    This article presents a survey on automatic software repair. Automatic software repair consists of automatically finding a solution to software bugs without human intervention. This article considers all kinds of repairs. First, it discusses behavioral repair where test suites, contracts, models, and crashing inputs are taken as oracle. Second, it discusses state repair, also known as runtime repair or runtime recovery, with techniques such as checkpoint and restart, reconfiguration, and invariant restoration. The uniqueness of this article is that it spans the research communities that contribute to this body of knowledge: software engineering, dependability, operating systems, programming languages, and security. It provides a novel and structured overview of the diversity of bug oracles and repair operators used in the literature

    A Pattern Language for High-Performance Computing Resilience

    Full text link
    High-performance computing systems (HPC) provide powerful capabilities for modeling, simulation, and data analytics for a broad class of computational problems. They enable extreme performance of the order of quadrillion floating-point arithmetic calculations per second by aggregating the power of millions of compute, memory, networking and storage components. With the rapidly growing scale and complexity of HPC systems for achieving even greater performance, ensuring their reliable operation in the face of system degradations and failures is a critical challenge. System fault events often lead the scientific applications to produce incorrect results, or may even cause their untimely termination. The sheer number of components in modern extreme-scale HPC systems and the complex interactions and dependencies among the hardware and software components, the applications, and the physical environment makes the design of practical solutions that support fault resilience a complex undertaking. To manage this complexity, we developed a methodology for designing HPC resilience solutions using design patterns. We codified the well-known techniques for handling faults, errors and failures that have been devised, applied and improved upon over the past three decades in the form of design patterns. In this paper, we present a pattern language to enable a structured approach to the development of HPC resilience solutions. The pattern language reveals the relations among the resilience patterns and provides the means to explore alternative techniques for handling a specific fault model that may have different efficiency and complexity characteristics. Using the pattern language enables the design and implementation of comprehensive resilience solutions as a set of interconnected resilience patterns that can be instantiated across layers of the system stack.Comment: Proceedings of the 22nd European Conference on Pattern Languages of Program

    Achieving Cost-Effective Software Reliability Through Self-Healing

    Get PDF
    Heterogeneity, mobility, complexity and new application domains raise new software reliability issues that cannot be met cost-effectively only with classic software engineering approaches. Self-healing systems can successfully address these problems, thus increasing software reliability while reducing maintenance costs. Self-healing systems must be able to automatically identify runtime failures, locate faults, and find a way to bring the system back to an acceptable behavior. This paper discusses the challenges underlying the construction of self-healing systems with particular focus on functional failures, and presents a set of techniques to build software systems that can automatically heal such failures. It introduces techniques to automatically derive assertions to effectively detect functional failures, locate the faults underlying the failures, and identify sequences of actions alternative to the failing sequence to bring the system back to an acceptable behavior

    Modern software cybernetics: new trends

    Get PDF
    Software cybernetics research is to apply a variety of techniques from cybernetics research to software engineering research. For more than fifteen years since 2001, there has been a dramatic increase in work relating to software cybernetics. From cybernetics viewpoint, the work is mainly on the first-order level, namely, the software under observation and control. Beyond the first-order cybernetics, the software, developers/users, and running environments influence each other and thus create feedback to form more complicated systems. We classify software cybernetics as Software Cybernetics I based on the first-order cybernetics, and as Software Cybernetics II based on the higher order cybernetics. This paper provides a review of the literature on software cybernetics, particularly focusing on the transition from Software Cybernetics I to Software Cybernetics II. The results of the survey indicate that some new research areas such as Internet of Things, big data, cloud computing, cyber-physical systems, and even creative computing are related to Software Cybernetics II. The paper identifies the relationships between the techniques of Software Cybernetics II applied and the new research areas to which they have been applied, formulates research problems and challenges of software cybernetics with the application of principles of Phase II of software cybernetics; identifies and highlights new research trends of software cybernetic for further research
    • …
    corecore