Search CORE

1,123 research outputs found

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

Author: Treaster Michael
Publication venue
Publication date: 31/12/2004
Field of study

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

arXiv.org e-Print Archive

CiteSeerX

Hosting Byzantine Fault Tolerant Services on a Chord Ring

Author: Dearle Alan
Kirby Graham
Norcross Stuart
Publication venue
Publication date: 17/06/2010
Field of study

In this paper we demonstrate how stateful Byzantine Fault Tolerant services may be hosted on a Chord ring. The strategy presented is fourfold: firstly a replication scheme that dissociates the maintenance of replicated service state from ring recovery is developed. Secondly, clients of the ring based services are made replication aware. Thirdly, a consensus protocol is introduced that supports the serialization of updates. Finally Byzantine fault tolerant replication protocols are developed that ensure the integrity of service data hosted on the ring.Comment: Submitted to DSN 2007 Workshop on Architecting Dependable System

arXiv.org e-Print Archive

St Andrews Research Repository

LIPIcs

Author: Dragoi Cezara
Henzinger Thomas A
Zufferey Damien
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2015
Field of study

Fault-tolerant distributed algorithms play an important role in many critical/high-availability applications. These algorithms are notoriously difficult to implement correctly, due to asynchronous communication and the occurrence of faults, such as the network dropping messages or computers crashing. Nonetheless there is surprisingly little language and verification support to build distributed systems based on fault-tolerant algorithms. In this paper, we present some of the challenges that a designer has to overcome to implement a fault-tolerant distributed system. Then we review different models that have been proposed to reason about distributed algorithms and sketch how such a model can form the basis for a domain-specific programming language. Adopting a high-level programming model can simplify the programmer's life and make the code amenable to automated verification, while still compiling to efficiently executable code. We conclude by summarizing the current status of an ongoing language design and implementation project that is based on this idea

INRIA a CCSD electronic archive server

Dagstuhl Research Online Publication Server

IST Austria: PubRep (Institute of Science and Technology)

Hal-Diderot

Diverse intrusion-tolerant database replication

Author: Ferreira Paulo Jorge Botelho
Publication venue
Publication date: 01/01/2012
Field of study

Tese de mestrado em Segurança Informática, apresentada à Universidade de Lisboa, através da Faculdade de Ciências, 2012A combinação da replicação de bases de dados com mecanismos de tolerância a falhas bizantinas ainda é um campo de pesquisa recente com projetos a surgirem nestes últimos anos. No entanto, a maioria dos protótipos desenvolvidos ou se focam em problemas muito específicos, ou são baseados em suposições que são muito difíceis de garantir numa situação do mundo real, como por exemplo ter um componente confiável. Nesta tese apresentamos DivDB, um sistema de replicação de bases de dados diverso e tolerante a intrusões. O sistema está desenhado para ser incorporado dentro de um driver JDBC, o qual irá abstrair o utilizador de qualquer complexidade adicional dos mecanismos de tolerância a falhas bizantinas. O DivDB baseia-se na combinação de máquinas de estados replicadas com um algoritmo de processamento de transações, a fim de melhorar o seu desempenho. Para além disso, no DivDB é possível ligar cada réplica a um sistema de gestão de base de dados diferente, proporcionando assim diversidade ao sistema. Propusemos, resolvemos e implementamos três problemas em aberto, existentes na conceção de um sistema de gestão de base de dados replicado: autenticação, processamento de transações e transferência de estado. Estas características torna o DivDB exclusivo, pois é o único sistema que compreende essas três funcionalidades implementadas num sistema de base de dados replicado. A nossa implementação é suficientemente robusta para funcionar de forma segura num simples sistema de processamento de transações online. Para testar isso, utilizou-se o TPC-C, uma ferramenta de benchmarking que simula esse tipo de ambientes.The combination of database replication with Byzantine fault tolerance mechanism is a recent field of research with projects appearing in the last few years. However most of the prototypes produced are either focused on very specific problems or are based on assumptions that are very hard to accomplish in a real world scenario (e.g., trusted component). In this thesis we present DivDB, a Diverse Intrusion-Tolerant Database Replication system. It is designed to be incorporated inside a JDBC driver so that it abstracts the user from any added complexity from Byzantine Fault Tolerance mechanism. DivDB is based in State Machine Replication combined with a transaction handling algorithm in order to enhance its performance. DivDB is also able to have different database systems connected at each replica, enabling to achieve diversity. We proposed, solved and implemented three open problems in the design of a replicated database system: authentication, transaction handling and state-transfer. This makes DivDB unique since it is the only system that comprises all these three features in a single database replication system. Our implementation is robust enough to operate reliably in a simple Online Transaction Processing system. To test that, we used TPC-C, a benchmark tool that simulates that kind of environments

Universidade de Lisboa: Repositório.UL