1,687 research outputs found

    Fault-Tolerant Distributed Services in Message-Passing Systems

    Get PDF
    Distributed systems ranging from small local area networks to large wide area networks like the Internet composed of static and/or mobile users have become increasingly popular. A desirable property for any distributed service is fault-tolerance, which means the service remains uninterrupted even if some components in the network fail. This dissertation considers weak distributed models to find either algorithms to solve certain problems or impossibility proofs to show that a problem is unsolvable. These are the main contributions of this dissertation: • Failure detectors are used as a service to solve consensus (agreement among nodes) which is otherwise impossible in failure-prone asynchronous systems. We find an algorithm for crash-failure detection that uses bounded size messages in an arbitrary, partitionable network composed of badly- behaved channels that can lose and reorder messages. • Registers are a fundamental building block for shared memory emulations on top of message passing systems. The problem has been extensively studied in static systems. However, register emulation in dynamic systems with faulty nodes is still quite hard and there are impossibility proofs that point out scenarios where change in the system composition due to nodes entering and leaving (also called churn) makes the problem unsolvable. We propose the first emulation of a crash-fault tolerant register in a system with continuous churn where consensus is unsolvable, the size of the system can grow without bound and at most a constant fraction of the number of nodes in the system can fail by crashing. We prove a lower bound that states that fault-tolerance for dynamic systems with churn is inherently lower than in static systems. • We then extend the results in the crash-fault tolerant case to a dynamic system with continuous churn and nodes that can be Byzantine faulty. It is the first emulation of an atomic register in a system that can withstand nodes continually entering and leaving, imposes no upper bound on the system size and can tolerate Byzantine nodes. However, the number of Byzantine faulty nodes that can be tolerated is upper bounded by a constant number. Although the algorithm requires that there be a constant known upper bound on the number of Byzantine nodes, this restriction is unavoidable, as we show that it is impossible to emulate an atomic register if the system size and maximum number of servers that can be Byzantine in the system is unknown

    Kompics: a message-passing component model for building distributed systems

    Get PDF
    The Kompics component model and programming framework was designedto simplify the development of increasingly complex distributed systems. Systems built with Kompics leverage multi-core machines out of the box and they can be dynamically reconfigured to support hot software upgrades. A simulation framework enables deterministic debugging and reproducible performance evaluation of unmodified Kompics distributed systems. We describe the component model and show how to program and compose event-based distributed systems. We present the architectural patterns and abstractions that Kompics facilitates and we highlight a case study of a complex distributed middleware that we have built with Kompics. We show how our approach enables systematic development and evaluation of large-scale and dynamic distributed systems

    Detecting Failures in an Asynchronous System That Never Stops Changing

    Get PDF
    This thesis presents an algorithm for detecting failures in dynamic asynchronous distributed systems or environments in which new participants may continually join the system and old participants may continually leave the system (a phenomenon called churn), and active participants may fail. Such behavior is exhibited by many dynamic modern networks, for example, peer-to-peer networks. Devices are continually joining and leaving, and peers often remain in the network only long enough to retrieve the data they require. Another example would be mobile networks. Devices are constantly on the move, resulting in a continual change in participants. In these types of networks, the set of participants is rarely stable for very long and is dynamically changing. Many problems are not solvable if the fraction of participants that are crashed is too large. Yet the participants will continue to leave the system or crash. To avoid crossing the threshold where too many participants are crashed, it is of paramount importance to detect crashed participants and remove them from the system. The problem of detecting failures has been solved in static and synchronous distributed systems. However, since processes in an asynchronous dynamic distributed systems possess no global clock or synchronized logical clocks or timing information, detecting failures is a hard problem to solve in such systems. We propose a failure detector for an asynchronous system with churn by exploiting the dynamic nature of the system to estimate elapsed time. We design an algorithm to detect failed processes in such an asynchronous system and prove that if a process is identified as crashed by our failure detector, it has indeed failed. Additionally, we also prove that if the churn continues forever, then under certain circumstances every failed process is eventually identified as crashed

    Implementing a Register in a Dynamic Distributed System

    Get PDF
    Providing distributed processes with concurrent objects is a fundamental service that has to be offered by any distributed system. The classical shared read/write register is one of the most basic ones. Several protocols have been proposed that build an atomic register on top of an asynchronous messagepassing system prone to process crashes. In the same spirit, this paper addresses the implementation of a regular register (a weakened form of an atomic register) in an asynchronous dynamic message-passing system. The aim is here to cope with the net effect of the adversaries that are asynchrony and dynamicity (the fact that processes can enter and leave the system). The paper focuses on the class of dynamic systems the churn rate c of which is constant. It presents two protocols, one applicable to synchronous dynamic message passing systems, the other one to asynchronous dynamic systems. Both protocols rely on an appropriate broadcast communication service (similar to a reliable brodcast). Each requires a specific constraint on the churn rate c. Both protocols are first presented in an as intuitive as possible way, and are then proved correct

    Recovering Shared Objects Without Stable Storage

    Get PDF
    This paper considers the problem of building fault-tolerant shared objects when processes can crash and recover but lose their persistent state on recovery. This Diskless Crash-Recovery (DCR) model matches the way many long-lived systems are built. We show that it presents new challenges, as operations that are recorded at a quorum may not persist after some of the processes in that quorum crash and then recover. To address this problem, we introduce the notion of crash-consistent quorums, where no recoveries happen during the quorum responses. We show that relying on crash-consistent quorums enables a recovery procedure that can recover all operations that successfully finished. Crash-consistent quorums can be easily identified using a mechanism we term the crash vector, which tracks the causal relationship between crashes, recoveries, and other operations. We apply crash-consistent quorums and crash vectors to build two storage primitives. We give a new algorithm for multi-writer, multi-reader atomic registers in the DCR model that guarantees safety under all conditions and termination under a natural condition. It improves on the best prior protocol for this problem by requiring fewer rounds, fewer nodes to participate in the quorum, and a less restrictive liveness condition. We also present a more efficient single-writer, single-reader atomic set - a virtual stable storage abstraction. It can be used to lift any existing algorithm from the traditional Crash-Recovery model to the DCR model. We examine a specific application, state machine replication, and show that existing diskless protocols can violate their correctness guarantees, while ours offers a general and correct solution

    Average Case Analysis of a Shared Register Emulation Algorithm

    Get PDF
    Distributed algorithms are important for managing systems with multiple networked components, where each component operates independently but coordinates to achieve a common goal. Previous theoretical research has produced numerous distributed algorithms but primarily uses mathematical proofs to yield theoretical results, such as worst-case runtime complexity. However, less research has been done on how these algorithms behave in practice, such as average-case runtime complexity. This paper will describe the empirical behavior of the distributed algorithm CCReg in a realistic environment, using the language DistAlgo to implement said algorithm. CCReg emulates a shared read/write register using a message-passing system. In particular, CCReg allows the underlying message-passing system to experience continuous changes to the set of components present, and tolerates crash failures of components. When the rate of component change and the fraction of crash failures are bounded, CCReg is proven to work correctly. The original paper specifies bounds for both that are guaranteed to work, and gives proof for those bounds. However, these bounds are restrictive and do not allow for much component change or many crash failures. Thus the goal of our implementation is to determine if CCReg's theoretical bounds can be relaxed in practice. We focus on CCReg's safety and liveness conditions: the algorithm eventually terminates, and a consistency condition called linearizability is maintained. We use a general method developed by Gibbons and Korach for determining if any ordering of operations satisfying linearizability exists. We find that, for executions where operations are randomly invoked, the algorithm does not exhibit any adverse behaviors. Each execution we tested terminates in finite time and has an order of operations satisfying linearizability. We discuss these findings, as well as future approaches and methodology for testing the theoretical boundaries in practice
    • …
    corecore