11 research outputs found

    A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

    Full text link
    Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

    Reducing the cost of group communication with semantic view synchrony

    Get PDF
    View Synchrony (VS) is a powerful abstraction in the design and implementation of de- pendable distributed systems. By ensuring that processes deliver the same set of messages in each view, it allows them to maintain consistency across membership changes. However, experience indicates that it is hard to combine strong reliability guarantees as offered by VS with stable high performance. In this paper we propose a novel abstraction, Semantic View Synchrony (SVS), that exploits the application's semantics to cope with high throughput applications. This is achieved by allowing some messages to be dropped while still preserving consistency when new views are installed. Thus, SVS inherits the elegance of view synchronous communi- cation. The paper describes how SVS can be implemented and illustrates its usefulness in the context of distributed multi-player games

    MSF-Model: Modeling Metastable Failures in Replicated Storage Systems

    Full text link
    Metastable failure is a recent abstraction of a pattern of failures that occurs frequently in real-world distributed storage systems. In this paper, we propose a formal analysis and modeling of metastable failures in replicated storage systems. We focus on a foundational problem in distributed systems -- the problem of consensus -- to have an impact on a large class of systems. Our main contribution is the development of a queuing-based analytical model, MSF-Model, that can be used to characterize and predict metastable failures. MSF-Model integrates novel modeling concepts that allow modeling metastable failures which was interactable to model prior to our work. We also perform real experiments to reproduce and validate our model. Our real experiments show that MSF-Model predicts metastable failures with high accuracy by comparing the real experiment with the predictions from the queuing-based model

    Achieving Reliable Parallel Performance in a VoD Storage Server Using Randomization and Replication

    Full text link

    Using Virtualization to Improve Software Rejuvenation

    Full text link
    In this paper, we present an approach for software rejuvenation based on automated self-healing techniques that can be easily applied to off-the-shelf Application Servers and Internet sites. Software aging and transient failures are detected through continuous monitoring of system data and performability metrics of the application server. If some anomalous behavior is identified the system triggers an automatic rejuvenation action. This self-healing scheme is meant to be the less disruptive as possible for the running service and to get a zero downtime for most of the cases. In our scheme, we exploit the usage of virtualization to optimize the self-recovery actions. The techniques described in this paper have been tested with a set of open-source Linux tools and the XEN virtualization middleware. We conducted an experimental study with two applications benchmarks (Tomcat/Axis and TPC-W). Our results demonstrate that virtualization can be extremely helpful for software rejuvenation and fail-over in the occurrence of transient application failures and software aging. 1

    Proactive software rejuvenation solution for web enviroments on virtualized platforms

    Get PDF
    The availability of the Information Technologies for everything, from everywhere, at all times is a growing requirement. We use information Technologies from common and social tasks to critical tasks like managing nuclear power plants or even the International Space Station (ISS). However, the availability of IT infrastructures is still a huge challenge nowadays. In a quick look around news, we can find reports of corporate outage, affecting millions of users and impacting on the revenue and image of the companies. It is well known that, currently, computer system outages are more often due to software faults, than hardware faults. Several studies have reported that one of the causes of unplanned software outages is the software aging phenomenon. This term refers to the accumulation of errors, usually causing resource contention, during long running application executions, like web applications, which normally cause applications/systems to hang or crash. Gradual performance degradation could also accompany software aging phenomena. The software aging phenomena are often related to memory bloating/ leaks, unterminated threads, data corruption, unreleased file-locks or overruns. We can find several examples of software aging in the industry. The work presented in this thesis aims to offer a proactive and predictive software rejuvenation solution for Internet Services against software aging caused by resource exhaustion. To this end, we first present a threshold based proactive rejuvenation to avoid the consequences of software aging. This first approach has some limitations, but the most important of them it is the need to know a priori the resource or resources involved in the crash and the critical condition values. Moreover, we need some expertise to fix the threshold value to trigger the rejuvenation action. Due to these limitations, we have evaluated the use of Machine Learning to overcome the weaknesses of our first approach to obtain a proactive and predictive solution. Finally, the current and increasing tendency to use virtualization technologies to improve the resource utilization has made traditional data centers turn into virtualized data centers or platforms. We have used a Mathematical Programming approach to virtual machine allocation and migration to optimize the resources, accepting as many services as possible on the platform while at the same time, guaranteeing the availability (via our software rejuvenation proposal) of the services deployed against the software aging phenomena. The thesis is supported by an exhaustive experimental evaluation that proves the effectiveness and feasibility of our proposals for current systems

    Error management in ATLAS TDAQ : an intelligent systems approach

    Get PDF
    This thesis is concerned with the use of intelligent system techniques (IST) within a large distributed software system, specifically the ATLAS TDAQ system which has been developed and is currently in use at the European Laboratory for Particle Physics(CERN). The overall aim is to investigate and evaluate a range of ITS techniques in order to improve the error management system (EMS) currently used within the TDAQ system via error detection and classification. The thesis work will provide a reference for future research and development of such methods in the TDAQ system. The thesis begins by describing the TDAQ system and the existing EMS, with a focus on the underlying expert system approach, in order to identify areas where improvements can be made using IST techniques. It then discusses measures of evaluating error detection and classification techniques and the factors specific to the TDAQ system. Error conditions are then simulated in a controlled manner using an experimental setup and datasets were gathered from two different sources. Analysis and processing of the datasets using statistical and ITS techniques shows that clusters exists in the data corresponding to the different simulated errors. Different ITS techniques are applied to the gathered datasets in order to realise an error detection model. These techniques include Artificial Neural Networks (ANNs), Support Vector Machines (SVMs) and Cartesian Genetic Programming (CGP) and a comparison of the respective advantages and disadvantages is made. The principle conclusions from this work are that IST can be successfully used to detect errors in the ATLAS TDAQ system and thus can provide a tool to improve the overall error management system. It is of particular importance that the IST can be used without having a detailed knowledge of the system, as the ATLAS TDAQ is too complex for a single person to have complete understanding of. The results of this research will benefit researchers developing and evaluating IST techniques in similar large scale distributed systems

    Error management in ATLAS TDAQ : an intelligent systems approach

    Get PDF
    This thesis is concerned with the use of intelligent system techniques (IST) within a large distributed software system, specifically the ATLAS TDAQ system which has been developed and is currently in use at the European Laboratory for Particle Physics(CERN). The overall aim is to investigate and evaluate a range of ITS techniques in order to improve the error management system (EMS) currently used within the TDAQ system via error detection and classification. The thesis work will provide a reference for future research and development of such methods in the TDAQ system. The thesis begins by describing the TDAQ system and the existing EMS, with a focus on the underlying expert system approach, in order to identify areas where improvements can be made using IST techniques. It then discusses measures of evaluating error detection and classification techniques and the factors specific to the TDAQ system. Error conditions are then simulated in a controlled manner using an experimental setup and datasets were gathered from two different sources. Analysis and processing of the datasets using statistical and ITS techniques shows that clusters exists in the data corresponding to the different simulated errors. Different ITS techniques are applied to the gathered datasets in order to realise an error detection model. These techniques include Artificial Neural Networks (ANNs), Support Vector Machines (SVMs) and Cartesian Genetic Programming (CGP) and a comparison of the respective advantages and disadvantages is made. The principle conclusions from this work are that IST can be successfully used to detect errors in the ATLAS TDAQ system and thus can provide a tool to improve the overall error management system. It is of particular importance that the IST can be used without having a detailed knowledge of the system, as the ATLAS TDAQ is too complex for a single person to have complete understanding of. The results of this research will benefit researchers developing and evaluating IST techniques in similar large scale distributed systems.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Automated Service-Oriented Impact Analysis and Recovery Alternative Selection

    Get PDF
    corecore