17,238 research outputs found

    Adaptive Optimal Control of MapReduce Performance, Availability and Costs

    No full text
    International audienceMapReduce is a popular programming model for distributed data processing and Big Data applications running on clouds. Extensive research has been conducted either to improve the dependability or to increase performance of MapReduce, ranging from adaptive and on-demand fault-tolerance solutions, adaptive task scheduling techniques to optimized job execution mechanisms. This paper investigates an optimization-based solution to control MapReduce systems in order to provide guarantees in terms of both performance and availability while reducing utilization costs. We follow a control theoretical approach for MapReduce cluster scaling and admission control. Moreover, we aim to be robust to changes in MapRe-duce and in it's environment by adapting the controller online to those changes. This paper highlights the major challenges of combining system adaptation and optimal control to take the best of both approaches. CCS Concepts • Networks → Cloud computing; • Software and its engineering → Software configuration management and version control systems; • Computer systems organization → Dependable and fault-tolerant systems and networks

    Survivable algorithms and redundancy management in NASA's distributed computing systems

    Get PDF
    The design of survivable algorithms requires a solid foundation for executing them. While hardware techniques for fault-tolerant computing are relatively well understood, fault-tolerant operating systems, as well as fault-tolerant applications (survivable algorithms), are, by contrast, little understood, and much more work in this field is required. We outline some of our work that contributes to the foundation of ultrareliable operating systems and fault-tolerant algorithm design. We introduce our consensus-based framework for fault-tolerant system design. This is followed by a description of a hierarchical partitioning method for efficient consensus. A scheduler for redundancy management is introduced, and application-specific fault tolerance is described. We give an overview of our hybrid algorithm technique, which is an alternative to the formal approach given

    Reliable fault-tolerant model predictive control of drinking water transport networks

    Get PDF
    This paper proposes a reliable fault-tolerant model predictive control applied to drinking water transport networks. After a fault has occurred, the predictive controller should be redesigned to cope with the fault effect. Before starting to apply the fault-tolerant control strategy, it should be evaluated whether the predictive controller will be able to continue operating after the fault appearance. This is done by means of a structural analysis to determine loss of controllability after the fault complemented with feasibility analysis of the optimization problem related to the predictive controller design, so as to consider the fault effect in actuator constraints. Moreover, by evaluating the admissibility of the different actuator-fault configurations, critical actuators regarding fault tolerance can be identified considering structural, feasibility, performance and reliability analyses. On the other hand, the proposed approach allows a degradation analysis of the system to be performed. As a result of these analyses, the predictive controller design can be modified by adapting constraints such that the best achievable performance with some pre-established level of reliability will be achieved. The proposed approach is tested on the Barcelona drinking water transport network.Postprint (author's final draft

    Architecting fault-tolerant software systems

    Get PDF
    The increasing size and complexity of software systems makes it hard to prevent or remove all possible faults. Faults that remain in the system can eventually lead to a system failure. Fault tolerance techniques are introduced for enabling systems to recover and continue operation when they are subject to faults. Many fault tolerance techniques are available but incorporating them in a system is not always trivial. We consider the following problems in designing a fault-tolerant system. First, existing reliability analysis techniques generally do not prioritize potential failures from the end-user perspective and accordingly do not identify sensitivity points of a system. \ud Second, existing architecture styles are not well-suited for specifying, communicating and analyzing design decisions that are particularly related to the fault-tolerant aspects of a system. Third, there are no adequate analysis techniques that evaluate the impact of fault tolerance techniques on the functional decomposition of software architecture. Fourth, realizing a fault-tolerant design usually requires a substantial development and maintenance effort. \ud To tackle the first problem, we propose a scenario-based software architecture reliability analysis method, called SARAH that benefits from mature reliability engineering techniques (i.e. FMEA, FTA) to provide an early reliability analysis of the software architecture design. SARAH evaluates potential failures from the end-user perspective to identify sensitive points of a system without requiring an implementation. \ud As a new architectural style, we introduce Recovery Style for specifying fault-tolerant aspects of software architecture. Recovery Style is used for communicating and analyzing architectural design decisions and for supporting detailed design with respect to recovery. \ud As a solution for the third problem, we propose a systematic method for optimizing the decomposition of software architecture for local recovery, which is an effective fault tolerance technique to attain high system availability. To support the method, we have developed an integrated set of tools that employ optimization techniques, state-based analytical models (i.e. CTMCs) and dynamic analysis on the system. The method enables the following: i ) modeling the design space of the possible decomposition alternatives, ii ) reducing the design space with respect to domain and stakeholder constraints and iii ) making the desired trade-off between availability and performance metrics. \ud To reduce the development and maintenance effort, we propose a framework, FLORA that supports the decomposition and implementation of software architecture for local recovery. The framework provides reusable abstractions for defining recoverable units and for incorporating the necessary coordination and communication protocols for recovery

    A synthesis of logic and bio-inspired techniques in the design of dependable systems

    Get PDF
    Much of the development of model-based design and dependability analysis in the design of dependable systems, including software intensive systems, can be attributed to the application of advances in formal logic and its application to fault forecasting and verification of systems. In parallel, work on bio-inspired technologies has shown potential for the evolutionary design of engineering systems via automated exploration of potentially large design spaces. We have not yet seen the emergence of a design paradigm that effectively combines these two techniques, schematically founded on the two pillars of formal logic and biology, from the early stages of, and throughout, the design lifecycle. Such a design paradigm would apply these techniques synergistically and systematically to enable optimal refinement of new designs which can be driven effectively by dependability requirements. The paper sketches such a model-centric paradigm for the design of dependable systems, presented in the scope of the HiP-HOPS tool and technique, that brings these technologies together to realise their combined potential benefits. The paper begins by identifying current challenges in model-based safety assessment and then overviews the use of meta-heuristics at various stages of the design lifecycle covering topics that span from allocation of dependability requirements, through dependability analysis, to multi-objective optimisation of system architectures and maintenance schedules
    • …
    corecore