Search CORE

28,759 research outputs found

Distributed protocols as behaviours in Erlang

Author: Demicoli Darren
Francalanza Adrian
Publication venue: University of Malta. Faculty of Information and Communication Technology
Publication date: 01/01/2010
Field of study

We investigate the implementation of standard algorithms for three classes of Distributed Agreement problems in Erlang, an industry-strength language for programming fault tolerant distributed systems. We develop a framework to bridge the gap between the assumptions of these standard algorithm and the network abstraction provided by Erlang, and structure our implementations as reusable behaviours within this framework.peer-reviewe

OAR@UM

Which Broadcast Abstraction Captures k-Set Agreement?

Author: Imbs Damien
Perrin Matthieu
Raynal Michel
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st International Symposium on Distributed Computing (DISC 2017)
Publication date: 01/01/2017
Field of study

It is well-known that consensus (one-set agreement) and total order broadcast are equivalent in asynchronous systems prone to process crash failures. Considering wait-free systems, this article addresses and answers the following question: which is the communication abstraction that "captures" k-set agreement? To this end, it introduces a new broadcast communication abstraction, called k-BO-Broadcast, which restricts the disagreement on the local deliveries of the messages that have been broadcast (1-BO-Broadcast boils down to total order broadcast). Hence, in this context, k=1 is not a special number, but only the first integer in an increasing integer sequence. This establishes a new "correspondence" between distributed agreement problems and communication abstractions, which enriches our understanding of the relations linking fundamental issues of fault-tolerant distributed computing

arXiv.org e-Print Archive

HAL-CentraleSupelec

HAL AMU

INRIA a CCSD electronic archive server

HAL Descartes

Dagstuhl Research Online Publication Server

Hal-Diderot

HAL-Rennes 1

Comparison of Failure Detectors and Group Membership: Performance Study of Two Atomic Broadcast Algorithms

Author: Schiper André
Shnayderman Ilya
Urbán Péter
Publication venue
Publication date: 13/07/2005
Field of study

Protocols that solve agreement problems are essential building blocks for fault tolerant distributed systems. While many protocols have been published, little has been done to analyze their performance, especially the performance of their fault tolerance mechanisms. In this paper, we present a performance evaluation methodology that can be generalized to analyze many kinds of fault-tolerant algorithms. We use the methodology to compare two atomic broadcast algorithms with different fault tolerance mechanisms: unreliable failure detectors and group membership. We evaluated the steady state latency in (1) runs with no crashes and no suspicions, (2) runs with crashes and (3) runs with no crashes in which correct processes are wrongly suspected to have crashed, as well as (4) the transient latency after a crash. We found that the two algorithms have the same performance in Scenario 1, and that the group membership based algorithm has an advantage in terms of performance and resiliency in Scenario 2, whereas the failure detector based algorithm offers better performance in the other scenarios. We discuss the implications of our results to the design of fault tolerant distributed systems

Infoscience - École polytechnique fédérale de Lausanne

Comparison of Failure Detectors and Group Membership: Performance Study of Two Atomic Broadcast Algorithms (extended version)

Author: Schiper André
Shnayderman Ilya
Urbán Péter
Publication venue
Publication date: 13/07/2005
Field of study

Protocols that solve agreement problems are essential building blocks for fault tolerant distributed systems. While many protocols have been published, little has been done to analyze their performance, especially the performance of their fault tolerance mechanisms. In this paper, we present a performance evaluation methodology that can be generalized to analyze many kinds of fault-tolerant algorithms. We use the methodology to compare two atomic broadcast algorithms with different fault tolerance mechanisms: unreliable failure detectors and group membership. We evaluated the steady state latency in (1) runs with neither crashes nor suspicions, (2) runs with crashes and (3) runs with no crashes in which correct processes are wrongly suspected to have crashed, as well as (4) the transient latency after a crash. We found that the two algorithms have the same performance in Scenario 1, and that the group membership based algorithm has an advantage in terms of performance and resiliency in Scenario 2, whereas the failure detector based algorithm offers better performance in the other scenarios. We discuss the implications of our results to the design of fault tolerant distributed systems

Infoscience - École polytechnique fédérale de Lausanne

Agreement in epidemic information dissemination

Author: Ayiad Mosab
Di Fatta Giuseppe
Katti Amogh
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Consensus is one of the fundamental problems in multi-agent systems and distributed computing, in which agents or processing nodes are required to reach global agreement on some data value, decision, action, or synchronisation. In the absence of centralised coordination, achieving global consensus is challenging especially in dynamic and large-scale distributed systems with faulty processes. This paper presents a fully decentralised phase transition protocol to achieve global consensus on the convergence of an underlying information dissemination process. The proposed approach is based on Epidemic protocols, which are a randomised communication and computation paradigm and provide excellent scalability and fault-tolerant properties. The experimental analysis is based on simulations of a large-scale information dissemination process and the results show that global agreement can be achieved without deterministic and global communication patterns, such as those based on centralised coordination

Central Archive at the University of Reading

Failure Detectors: implementation issues and impact on consensus performance

Author: Defago Xavier
Schiper Andre
Sergent Nicole
Publication venue
Publication date: 13/07/2005
Field of study

Due to their nature, distributed systems are vulnerable to failures of some of their parts. Conversely, distribution also provides a way to increase the fault tolerance of the overall system. However, achieving fault tolerance is not a simple problem and requires complex techniques. An agreement problem known as the problem of consensus is at the heart of most problems encountered during the design of a fault tolerant system. This problem is however not solvable in the asynchronous system model, unless the model is augmented with adequate failure detectors. The resulting system model is a time-free model since all timing issues are abstracted by the characteristics of the failure detectors. It is sometimes claimed that time-based system models are more realistic than time-free models for solving distributed agreement problems. The goal of this paper is to show that solving consensus in the asynchronous system model augmented with failure detectors does not prevent from considering timing issues. We consider the consensus algorithm with various implementations of failure detectors, and we analyse their impact on the termination time of the consensus algorithm. This study shows that the design of fault-tolerant distributed algorithms in the asynchronous system model augmented with failure detectors is orthogonal to the issue of implementing the actual failure detectors. This nicely decouples logical issues (proof of safety and liveness of an algorithm) from engineering issues (e.g., performance and timing constraints)

Infoscience - École polytechnique fédérale de Lausanne

Fault-Tolerant Distributed Services in Message-Passing Systems

Author: Kumar Saptaparni
Publication venue
Publication date: 20/11/2019
Field of study

Distributed systems ranging from small local area networks to large wide area networks like the Internet composed of static and/or mobile users have become increasingly popular. A desirable property for any distributed service is fault-tolerance, which means the service remains uninterrupted even if some components in the network fail. This dissertation considers weak distributed models to find either algorithms to solve certain problems or impossibility proofs to show that a problem is unsolvable. These are the main contributions of this dissertation: • Failure detectors are used as a service to solve consensus (agreement among nodes) which is otherwise impossible in failure-prone asynchronous systems. We find an algorithm for crash-failure detection that uses bounded size messages in an arbitrary, partitionable network composed of badly- behaved channels that can lose and reorder messages. • Registers are a fundamental building block for shared memory emulations on top of message passing systems. The problem has been extensively studied in static systems. However, register emulation in dynamic systems with faulty nodes is still quite hard and there are impossibility proofs that point out scenarios where change in the system composition due to nodes entering and leaving (also called churn) makes the problem unsolvable. We propose the first emulation of a crash-fault tolerant register in a system with continuous churn where consensus is unsolvable, the size of the system can grow without bound and at most a constant fraction of the number of nodes in the system can fail by crashing. We prove a lower bound that states that fault-tolerance for dynamic systems with churn is inherently lower than in static systems. • We then extend the results in the crash-fault tolerant case to a dynamic system with continuous churn and nodes that can be Byzantine faulty. It is the first emulation of an atomic register in a system that can withstand nodes continually entering and leaving, imposes no upper bound on the system size and can tolerate Byzantine nodes. However, the number of Byzantine faulty nodes that can be tolerated is upper bounded by a constant number. Although the algorithm requires that there be a constant known upper bound on the number of Byzantine nodes, this restriction is unavoidable, as we show that it is impossible to emulate an atomic register if the system size and maximum number of servers that can be Byzantine in the system is unknown

Replica determinism and flexible scheduling in hard real-time dependable systems

Author: Barrett P
Burns A
Poledna S
Wellings A
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2000
Field of study

Fault-tolerant real-time systems are typically based on active replication where replicated entities are required to deliver their outputs in an identical order within a given time interval. Distributed scheduling of replicated tasks, however, violates this requirement if on-line scheduling, preemptive scheduling, or scheduling of dissimilar replicated task sets is employed. This problem of inconsistent task outputs has been solved previously by coordinating the decisions of the local schedulers such that replicated tasks are executed in an identical order. Global coordination results either in an extremely high communication effort to agree on each schedule decision or in an overly restrictive execution model where on-line scheduling, arbitrary preemptions, and nonidentically replicated task sets are not allowed. To overcome these restrictions, a new method, called timed messages, is introduced. Timed messages guarantee deterministic operation by presenting consistent message versions to the replicated tasks. This approach is based on simulated common knowledge and a sparse time base. Timed messages are very effective since they neither require communication between the local scheduler nor do they restrict usage of on-line flexible scheduling, preemptions and nonidentically replicated task sets

CiteSeerX

Crossref

White Rose Research Online