2,156 research outputs found

    Dynamic FTSS in Asynchronous Systems: the Case of Unison

    Full text link
    Distributed fault-tolerance can mask the effect of a limited number of permanent faults, while self-stabilization provides forward recovery after an arbitrary number of transient fault hit the system. FTSS protocols combine the best of both worlds since they are simultaneously fault-tolerant and self-stabilizing. To date, FTSS solutions either consider static (i.e. fixed point) tasks, or assume synchronous scheduling of the system components. In this paper, we present the first study of dynamic tasks in asynchronous systems, considering the unison problem as a benchmark. Unison can be seen as a local clock synchronization problem as neighbors must maintain digital clocks at most one time unit away from each other, and increment their own clock value infinitely often. We present many impossibility results for this difficult problem and propose a FTSS solution when the problem is solvable that exhibits optimal fault containment

    Stabilizing Byzantine-Fault Tolerant Storage

    Get PDF
    Distributed storage service is one of the main abstractions provided to developers of distributed applications due to its ability to hide the complexity generated by the various messages exchanged between processes. Many protocols have been proposed to build Byzantine-fault-tolerant (BFT) storage services on top of a message-passing system but none of them considers the possibility that well-behaving processes (i.e. correct processes) may experience transient failures due to, say, isolated errors during computation or bit alteration during message transfer. This paper proposes a stabilizing Byzantine-tolerant algorithm for emulating a multi-writer multi-reader regular register abstraction on top of a message passing system with n > 5f servers, which we prove to be the minimal possible number of servers for stabilizing and tolerating f Byzantine servers. That is, each read operation returns the value written by the most recent write and write operations are totally ordered with respect to the happened before relation. Our algorithm is particularly appealing for cloud computing architectures where both processors and memory contents (including stale messages in transit) are prone to errors, faults and malicious behaviors. The proposed implementation extends previous BFT implementations in two ways. First, the algorithm works even when the local memory of processors and the content of the communication channels are initially corrupted in an arbitrary manner. Second, unlike previous solutions, our algorithm uses bounded logical timestamps, a feature difficult to achieve in the presence of transient errors

    Self-stabilization Overhead: an Experimental Case Study on Coded Atomic Storage

    Full text link
    Shared memory emulation can be used as a fault-tolerant and highly available distributed storage solution or as a low-level synchronization primitive. Attiya, Bar-Noy, and Dolev were the first to propose a single-writer, multi-reader linearizable register emulation where the register is replicated to all servers. Recently, Cadambe et al. proposed the Coded Atomic Storage (CAS) algorithm, which uses erasure coding for achieving data redundancy with much lower communication cost than previous algorithmic solutions. Although CAS can tolerate server crashes, it was not designed to recover from unexpected, transient faults, without the need of external (human) intervention. In this respect, Dolev, Petig, and Schiller have recently developed a self-stabilizing version of CAS, which we call CASSS. As one would expect, self-stabilization does not come as a free lunch; it introduces, mainly, communication overhead for detecting inconsistencies and stale information. So, one would wonder whether the overhead introduced by self-stabilization would nullify the gain of erasure coding. To answer this question, we have implemented and experimentally evaluated the CASSS algorithm on PlanetLab; a planetary scale distributed infrastructure. The evaluation shows that our implementation of CASSS scales very well in terms of the number of servers, the number of concurrent clients, as well as the size of the replicated object. More importantly, it shows (a) to have only a constant overhead compared to the traditional CAS algorithm (which we also implement) and (b) the recovery period (after the last occurrence of a transient fault) is as fast as a few client (read/write) operations. Our results suggest that CASSS does not significantly impact efficiency while dealing with automatic recovery from transient faults and bounded size of needed resources

    Eventual election of multiple leaders for solving consensus in anonymous systems

    Get PDF
    In classical distributed systems, each process has a unique identity. Today, new distributed systems have emerged where a unique identity is not always possible to be assigned to each process. For example, in many sensor networks a unique identity is not possible to be included in each device due to its small storage capacity, reduced computational power, or the huge number of devices to be identified. In these cases, we have to work with anonymous distributed systems where processes cannot be identified. Consensus cannot be solved in classical and anonymous asynchronous distributed systems where processes can crash. To bypass this impossibility result, failure detectors are added to these systems. It is known that ? is the weakest failure detector class for solving consensus in classical asynchronous systems when amajority of processes never crashes. Although A? was introduced as an anonymous version of ?, to find the weakest failure detector in anonymous systems to solve consensus when amajority of processes never crashes is nowadays an open question. Furthermore, A? has the important drawback that it is not implementable. Very recently, A? has been introduced as a counterpart of ? for anonymous systems. In this paper, we show that the A? failure detector class is strictly weaker than A? (i.e., A? provides less information about process crashes than A?). We also present in this paper the first implementation of A? (hence, we also show that A? is implementable), and, finally, we include the first implementation of consensus in anonymous asynchronous systems augmented with A? and where a majority of processes does not crash

    Putting Teeth into Open Architectures: Infrastructure for Reducing the Need for Retesting

    Get PDF
    Proceedings Paper (for Acquisition Research Program)The Navy is currently implementing the open-architecture framework for developing joint interoperable systems that adapt and exploit open-system design principles and architectures. This raises concerns about how to practically achieve dependability in software-intensive systems with many possible configurations when: 1) the actual configuration of the system is subject to frequent and possibly rapid change, and 2) the environment of typical reusable subsystems is variable and unpredictable. Our preliminary investigations indicate that current methods for achieving dependability in open architectures are insufficient. Conventional methods for testing are suited for stovepipe systems and depend strongly on the assumptions that the environment of a typical system is fixed and known in detail to the quality-assurance team at test and evaluation time. This paper outlines new approaches to quality assurance and testing that are better suited for providing affordable reliability in open architectures, and explains some of the additional technical features that an Open Architecture must have in order to become a Dependable Open Architecture.Naval Postgraduate School Acquisition Research ProgramApproved for public release; distribution is unlimited

    Dynamic FTSS in Asynchronous Systems: the Case of Unison

    Get PDF
    Distributed fault-tolerance can mask the effect of a limited number of permanent faults, while self-stabilization provides forward recovery after an arbitrary number of transient fault hit the system. FTSS protocols combine the best of both worlds since they are simultaneously fault-tolerant and self-stabilizing. To date, FTSS solutions either consider static (i.e. fixed point) tasks, or assume synchronous scheduling of the system components. In this paper, we present the first study of dynamic tasks in asynchronous systems, considering the unison problem as a benchmark. Unison can be seen as a local clock synchronization problem as neighbors must maintain digital clocks at most one time unit away from each other, and increment their own clock value infinitely often. We present many impossibility results for this difficult problem and propose a FTSS solution when the problem is solvable that exhibits optimal fault containment

    NASA space station automation: AI-based technology review

    Get PDF
    Research and Development projects in automation for the Space Station are discussed. Artificial Intelligence (AI) based automation technologies are planned to enhance crew safety through reduced need for EVA, increase crew productivity through the reduction of routine operations, increase space station autonomy, and augment space station capability through the use of teleoperation and robotics. AI technology will also be developed for the servicing of satellites at the Space Station, system monitoring and diagnosis, space manufacturing, and the assembly of large space structures

    Dagstuhl News January - December 2000

    Get PDF
    "Dagstuhl News" is a publication edited especially for the members of the Foundation "Informatikzentrum Schloss Dagstuhl" to thank them for their support. The News give a summary of the scientific work being done in Dagstuhl. Each Dagstuhl Seminar is presented by a small abstract describing the contents and scientific highlights of the seminar as well as the perspectives or challenges of the research topic

    An Adaptive Resonance Theory Neural Network (ART NN)-based fault diagnosis system: A Case Study of gas turbine system in Resak Development Platform

    Get PDF
    The project introduces a case study of a real gas turbine system in Resak Development Platform. There are two main objectives of this project. The first objective is aimed to achieve an online fault diagnosis model using Adaptive Resonance Theorem (ART) as a considered option to avoid potential faults happen during plant system and process. The second objective is focused on a solution to improve the maintenance plan for the gas turbine system to be more economical yet still maintaining its safety level
    • …
    corecore