413 research outputs found

    The Failure Detector Abstraction

    Full text link
    This paper surveys the failure detector concept through two dimensions. First we study failure detectors as building blocks to simplify the design of reliable distributed algorithms. More specifically, we illustrate how failure detectors can factor out timing assumptions to detect failures in distributed agreement algorithms. Second, we study failure detectors as computability benchmarks. That is, we survey the weakest failure detector question and illustrate how failure detectors can be used to classify problems. We also highlights some limitations of the failure detector abstraction along each of the dimensions

    Fairness Properties of the Trusting Failure Detector

    Get PDF
    In 1985 it was shown by Fischer et al. that consensus, a fundamental problem in distributed computing, was impossible in asynchronous distributed systems in the presence of even just one process failure. This result prompted a search for alternative system models that were capable of solving such problems and culminated in the development of two helpful constructs: partially synchronous system models and failure detectors. Partially synchronous system models seek to solve the problem of identifying process crashes by constraining the real-time behavior of the underlying system. In the resulting models, crashed processes can be detected indirectly through the use of timeouts. Failure detectors, on the other hand, address process crashes by directly providing (potentially inaccurate) information on failures. As a result, failure detectors were viewed as abstractions of real-time information. Pike et al. proposed a different perspective on failure detectors; as abstracting fairness properties. Fairness in a system imposes bounds on the relative frequencies of communication and execution between processes in a system, and it was shown that four frequently-used failure detectors from the Chandra-Toueg hierarchy (P, ♱P, S, ♱S) encapsulate these fairness properties. This discovery suggests that failure detectors may be better understood as abstractions of fairness rather than real-time properties as well as demonstrates the possibility to communicate results between systems augmented with failure detectors and partially synchronous system models. In this thesis, we will be discussing an extension of the Pike et al. result to the trusting failure detector. The trusting failure detector is the weakest failure detector to implement the problem of fault-tolerant mutual exclusion: a fundamental primitive for distributed computing

    A Look at Basics of Distributed Computing *

    Get PDF
    International audienceThis paper presents concepts and basics of distributed computing which are important (at least from the author's point of view), and should be known and mastered by Master students and engineers. Those include: (a) a characterization of distributed computing (which is too much often confused with parallel computing); (b) the notion of a synchronous system and its associated notions of a local algorithm and message adversaries; (c) the notion of an asynchronous shared memory system and its associated notions of universality and progress conditions; and (d) the notion of an asynchronous message-passing system with its associated broadcast and agreement abstractions, its impossibility results, and approaches to circumvent them. Hence, the paper can be seen as a guided tour to key elements that constitute basics of distributed computing

    Optimal Resilience in Systems That Mix Shared Memory and Message Passing

    Get PDF
    We investigate the minimal number of failures that can partition a system where processes communicate both through shared memory and by message passing. We prove that this number precisely captures the resilience that can be achieved by algorithms that implement a variety of shared objects, like registers and atomic snapshots, and solve common tasks, like randomized consensus, approximate agreement and renaming. This has implications for the m&m-model of [Aguilera et al., 2018] and for the hybrid, cluster-based model of [Damien Imbs and Michel Raynal, 2013; Michel Raynal and Jiannong Cao, 2019]

    Randomized protocols for asynchronous consensus

    Full text link
    The famous Fischer, Lynch, and Paterson impossibility proof shows that it is impossible to solve the consensus problem in a natural model of an asynchronous distributed system if even a single process can fail. Since its publication, two decades of work on fault-tolerant asynchronous consensus algorithms have evaded this impossibility result by using extended models that provide (a) randomization, (b) additional timing assumptions, (c) failure detectors, or (d) stronger synchronization mechanisms than are available in the basic model. Concentrating on the first of these approaches, we illustrate the history and structure of randomized asynchronous consensus protocols by giving detailed descriptions of several such protocols.Comment: 29 pages; survey paper written for PODC 20th anniversary issue of Distributed Computin

    The Failure Detector Abstraction

    Get PDF
    A failure detector is a fundamental abstraction in distributed computing. This paper surveys this abstraction through two dimensions. First we study failure detectors as building blocks to simplify the design of reliable distributed algorithms. In particular, we illustrate how failure detectors can factor out timing assumptions to detect failures in distributed agreement algorithms. Second, we study failure detectors as computability benchmarks. That is, we survey the weakest failure detector question and illustrate how failure detectors can be used to classify problems. We also highlight some limitations of the failure detector abstraction along each of the dimensions

    Interactive Consistency in practical, mostly-asynchronous systems

    Full text link
    Interactive consistency is the problem in which n nodes, where up to t may be byzantine, each with its own private value, run an algorithm that allows all non-faulty nodes to infer the values of each other node. This problem is relevant to critical applications that rely on the combination of the opinions of multiple peers to provide a service. Examples include monitoring a content source to prevent equivocation or to track variability in the content provided, and resolving divergent state amongst the nodes of a distributed system. Previous works assume a fully synchronous system, where one can make strong assumptions such as negligible message delivery delays and/or detection of absent messages. However, practical, real-world systems are mostly asynchronous, i.e., they exhibit only some periods of synchrony during which message delivery is timely, thus requiring a different approach. In this paper, we present a thorough study on practical interactive consistency. We leverage the vast prior work on broadcast and byzantine consensus algorithms to design, implement and evaluate a set of algorithms, with varying timing assumptions and message complexity, that can be used to achieve interactive consistency in real-world distributed systems. We provide a complete, open-source implementation of each proposed interactive consistency algorithm by building a multi-layered stack of protocols that include several broadcast protocols, as well as a binary and a multi-valued consensus protocol. Most of these protocols have never been implemented and evaluated in a real system before. We analyze the performance of our suite of algorithms experimentally by engaging in both single instance and multiple parallel instances of each alternative.Comment: 13 pages, 10 figure

    Information Infrastructures in Distributed Environments: Algorithms for Mobile Networks and Resource Allocation

    Get PDF
    A distributed system is a collection of computing entities that communicate with each other to solve some problem. Distributed systems impact almost every aspect of daily life (e.g., cellular networks and the Internet); however, it is hard to develop services on top of distributed systems due to the unreliable nature of computing entities and communication. As handheld devices with wireless communication capabilities become increasingly popular, the task of providing services becomes even more challenging since dynamics, such as mobility, may cause the network topology to change frequently. One way to ease this task is to develop collections of information infrastructures which can serve as building blocks to design more complicated services and can be analyzed independently. The first part of the dissertation considers the dining philosophers problem (a generalization of the mutual exclusion problem) in static networks. A solution to the dining philosophers problem can be utilized when there is a need to prevent multiple nodes from accessing some shared resource simultaneously. We present two algorithms that solve the dining philosophers problem. The first algorithm considers an asynchronous message-passing model while the second one considers an asynchronous shared-memory model. Both algorithms are crash fault-tolerant in the sense that a node crash only affects its local neighborhood in the network. We utilize failure detectors (system services that provide some information about crash failures in the system) to achieve such crash fault-tolerance. In addition to crash fault-tolerance, the first algorithm provides fairness in accessing shared resources and the second algorithm tolerates transient failures (unexpected corruptions to the system state). Considering the message-passing model, we also provide a reduction such that given a crash fault-tolerant solution to our dining philosophers problem, we implement the failure detector that we have utilized to solve our dining philosophers problem. This reduction serves as the first step towards identifying the minimum information regarding crash failures that is required to solve the dining philosophers problem at hand. In the second part of this dissertation, we present information infrastructures for mobile ad hoc networks. In particular, we present solutions to the following problems in mobile ad hoc environments: (1) maintaining neighbor knowledge, (2) neighbor detection, and (3) leader election. The solutions to (1) and (3) consider a system with perfectly synchronized clocks while the solution to (2) considers a system with bounded clock drift. Services such as neighbor detection and maintaining neighbor knowledge can serve as a building block for applications that require point-to-point communication. A solution to the leader election problem can be used whenever there is a need for a unique coordinator in the system to perform a special task

    A Timing Assumption and two tt-Resilient Protocols for Implementing an Eventual Leader Service in Asynchronous Shared Memory Systems

    Get PDF
    This paper considers the problem of electing an eventual leader in an asynchronous shared memory system. While this problem has received a lot of attention in message-passing systems, very few solutions have been proposed for shared memory systems. As an eventual leader cannot be elected in a pure asynchronous system prone to process crashes, the paper first proposes to enrich the asynchronous system model with an additional assumption. That assumption (denoted AWB\mathit{AWB}) is particularly weak. It is made up of two complementary parts. More precisely, it requires that, after some time, (1) there is a process whose write accesses to some shared variables be timely, and (2) the timers of (t−f)(t-f) other processes be asymptotically well-behaved (tt denotes the maximal number of processes that may crash, and ff the actual number of process crashes in a run). The {\it asymptotically well-behaved} timer notion is a new notion that generalizes and weakens the traditional notion of timers whose durations are required to monotonically increase when the values they are set to increase (a timer works incorrectly when it expires at arbitrary times, i.e., independently of the value it has been set to). The paper then focuses on the design of tt-resilient AWB\mathit{AWB}-based eventual leader protocols. ``tt-resilient'' means that each protocol can cope with up to tt process crashes (taking t=n−1t=n-1 provides wait-free protocols, i.e., protocols that can cope with any number of process failures). Two protocols are presented. The first enjoys the following noteworthy properties: after some time only the elected leader has to write the shared memory, and all but one shared variables have a bounded domain, be the execution finite or infinite. This protocol is consequently optimal with respect to the number of processes that have to write the shared memory. The second protocol guarantees that all the shared variables have a bounded domain. This is obtained at the following additional price: all the processes are required to forever write the shared memory. A theorem is proved which states that this price has to be paid by any protocol that elects an eventual leader in a bounded shared memory model. This second protocol is consequently optimal with respect to the number of processes that have to write in such a constrained memory model. In a very interesting way, these protocols show an inherent tradeoff relating the number of processes that have to write the shared memory and the bounded/unbounded attribute of that memory
    • 

    corecore