Search CORE

453 research outputs found

Fast casual multicast

Author: Birman Kenneth
Schiper Andre
Stephenson Pat
Publication venue
Publication date
Field of study

A new protocol is presented that efficiently implements a reliable, causally ordered multicast primitive and is easily extended into a totally ordered one. Intended for use in the ISIS toolkit, it offers a way to bypass the most costly aspects of ISIS while benefiting from virtual synchrony. The facility scales with bounded overhead. Measured speedups of more than an order of magnitude were obtained when the protocol was implemented within ISIS. One conclusion is that systems such as ISIS can achieve performance competitive with the best existing multicast facilities--a finding contradicting the widespread concern that fault-tolerance may be unacceptably costly

NASA Technical Reports Server

Replication for send-deterministic MPI HPC applications

Author: Lefray Arnaud
Ropars Thomas
Schiper André
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

International audienceReplication has recently gained attention in the context of fault tolerance for large scale MPI HPC applications. Existing implementations try to cover all MPI codes and to be independent from the underlying library. In this paper, we evaluate the advantages of adopting a different approach. First, we try to take advantage of a communication property common to many MPI HPC application, namely send-determinism. Second, we choose to implement replication inside the MPI library. The main advantage of our approach is simplicity. While being only a small patch to the Open MPI library, our solution called SDR-MPI supports most main features of the MPI standard including all collectives and group operations. SDR-MPI additionally achieves good performance: Experiments run with HPC benchmarks and applications show that its overhead remains below 5%

HAL-ENS-LYON

Infoscience - École polytechnique fédérale de Lausanne

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

Practical impact of group communication theory

Author: Schiper A.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 20/05/2005
Field of study

Practical impact of group communication theory Andre Schiper Group communication is an important topic in fault-tolerant distributed applications. The paper summarizes the main contributions of practical importance that contributed to our current understanding of group communication. These contributions are classified into ''abstractions'' and ''specifications'', ''paradigms'', ''system models'', ''algorithms'', and ''theoretical results''. Some open issues are discussed at the end of the paper

Infoscience - École polytechnique fédérale de Lausanne

Failure Detection vs. Group Membership in Fault-Tolerant Distributed Systems: Hidden Trade-Offs

Author: Schiper A.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 20/05/2005
Field of study

Failure detection and group membership are two important components of fault-tolerant distributed systems. Understanding their role is essential when developing efficient solutions, not only in failure-free runs, but also in runs in which processes do crash. While group membership provides consistent information about the status of processes in the system, failure detectors provide inconsistent information. This paper discusses the trade-offs related to the use of these two components, and clarifies their roles using three examples. The first example shows a case where group membership may favourably be replaced by a failure detection mechanism. The second example illustrates a case where group membership is mandatory. Finally, the third example shows a case where neither group membership nor failure detectors are needed (they may be replaced by weak ordering oracles)

Infoscience - École polytechnique fédérale de Lausanne

Dependable Systems

Author: Schiper André
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 15/02/2012
Field of study

Improving the dependability of computer systems is a critical and essential task. In this context, the paper surveys techniques that allow to achieve fault tolerance in distributed systems by replication. The main replication techniques are first explained. Then group communication is introduced as the communication infrastructure that allows the implementation of the different replication techniques. Finally the difficulty of implementing group communication is discussed, and the most important algorithms are presented

Infoscience - École polytechnique fédérale de Lausanne

Group Communication: From Practice to Theory

Author: Schiper André
Publication venue
Publication date: 26/05/2008
Field of study

Infoscience - École polytechnique fédérale de Lausanne

Model Checking of Consensus Algorithms

Author: Schiper André
Tsuchiya Tatsuhiro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/05/2008
Field of study

Infoscience - École polytechnique fédérale de Lausanne

Crossref