Search CORE

9,802 research outputs found

Unreliable Failure Detectors via Operational Semantics

Author: Fuzzati Rachele
Nestmann Uwe
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 23/11/2005
Field of study

The concept of unreliable failure detectors for reliable distributed systems was introduced by Chandra and Toueg as a fine-grained means to add weak forms of synchrony into asynchronous systems. Various kinds of such failure detectors have been identified as each being the weakest to solve some specific distributed programming problem. In this paper, we provide a fresh look at failure detectors from the point of view of programming languages, more precisely using the formal tool of operational semantics. Inspired by this, we propose a new failure detector model that we consider easier to understand, easier to work with and more natural. Using operational semantics, we prove formally that representations of failure detectors in the new model are equivalent to their original representations within the model used by Chandra and Toueg

Infoscience - École polytechnique fédérale de Lausanne

Fault-Tolerant Distributed Systems: a Modular Approach to the Non-Blocking Atomic Commitment Problem

Author: Raynal Michel
Publication venue: HAL CCSD
Publication date: 01/01/1996
Field of study

Agreement problems allow a set of processes to agree on a common output value. These problems are of primary importance in distributed systems and difficult to solve in presence of failures. This paper considers one of these problems whose practical interest is well known, namely the Non-Blocking Atomic Commitment Problem. First, a generic protocol solving this problem is given and then instantiations of its generic statements are provided for both synchronous and asynchronous distributed systems. These instantiations use timeout mechanism, reliable multicast primitives and unreliable failure detectors as basic components. Incidentally, this paper can also be considered as an introduction to state-of-the-art concepts and mechanisms of distributed fault tolerance

INRIA a CCSD electronic archive server

Enhanced Failure Detection Mechanism in MapReduce

Author: Antoniu Gabriel
Memishi Bunjamin
Pérez Hernández María de los Santos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2012
Field of study

The popularity of MapReduce programming model has increased interest in the research community for its improvement. Among the other directions, the point of fault tolerance, concretely the failure detection issue seems to be a crucial one, but that until now has not reached its satisfying level. Motivated by this, I decided to devote my main research during this period into having a prototype system architecture of MapReduce framework with a new failure detection service, containing both analytical (theoretical) and implementation part. I am confident that this work should lead the way for further contributions in detecting failures to any NoSQL App frameworks, and cloud storage systems in general

HAL-CentraleSupelec

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

INRIA a CCSD electronic archive server

Hal-Diderot

Archivo Digital UPM

HAL-Rennes 1

Lifeguard: Local Health Awareness for More Accurate Failure Detection

Author: Currey Jon
Dadgar Armon
Phillips James
Publication venue
Publication date: 03/04/2018
Field of study

SWIM is a peer-to-peer group membership protocol with attractive scaling and robustness properties. However, slow message processing can cause SWIM to mark healthy members as failed (so called false positive failure detection), despite inclusion of a mechanism to avoid this. We identify the properties of SWIM that lead to the problem, and propose Lifeguard, a set of extensions to SWIM which consider that the local failure detector module may be at fault, via the concept of local health. We evaluate this approach in a precisely controlled environment and validate it in a real-world scenario, showing that it drastically reduces the rate of false positives. The false positive rate and detection time for true failures can be reduced simultaneously, compared to the baseline levels of SWIM

arXiv.org e-Print Archive

Crossref