9 research outputs found

    Performance Engineering of a Lightweight Fault Tolerance Framework

    Get PDF
    It is well-known that the Paxos algorithm can be used to build provably correct practical fault tolerant systems. In this thesis, a lightweight consensus framework - Paxos-Based Fault Tolerance (PFT) framework and its practical implementation is presented. It also includes how the system tolerates faults under practical conditions where the replicas might not be strictly homogeneous due to the asynchrony of their deployment environment. A comprehensive performance evaluation study is performed on the PFT framework. The approaches that can optimize the fault tolerance mechanisms under various practical scenarios are also discusse

    Performance Engineering of a Lightweight Fault Tolerance Framework

    Get PDF
    It is well-known that the Paxos algorithm can be used to build provably correct practical fault tolerant systems. In this thesis, a lightweight consensus framework - Paxos-Based Fault Tolerance (PFT) framework and its practical implementation is presented. It also includes how the system tolerates faults under practical conditions where the replicas might not be strictly homogeneous due to the asynchrony of their deployment environment. A comprehensive performance evaluation study is performed on the PFT framework. The approaches that can optimize the fault tolerance mechanisms under various practical scenarios are also discusse

    Fast Genuine Generalized Consensus

    Get PDF
    International audienceConsensus (agreeing on a sequence of commands) is central to the operation and performance of distributed systems. A well-known solution to consensus is Fast Paxos. In a recent paper, Lamport enhances Fast Paxos by leveraging the commutativity of concurrent commands. The new primitive, called Generalized Paxos, reduces the collision rate, and thus the latency of Fast Paxos. However if a collision occurs, Generalized Paxos needs four communication steps to recover, which is slower than Fast Paxos. This paper presents FGGC, a novel consensus algorithm that reduces recovery delay when a collision occurs to one. FGGC tolerates f < n/2 replicas crashes, and during failure-free runs, processes learn commands in two steps if all commands commute, and three steps otherwise; this is optimal. Moreover, as long as no fault occurs, FGGC needs only f + 1 replicas to progress

    Aurora: Leaderless State-Machine Replication with High Throughput

    Get PDF
    State-machine replication (SMR) allows a state machine to be replicated across a set of replicas and handle clients\u27 requests as a single machine. Most existing SMR protocols are leader-based, i.e., requiring a leader to order requests and coordinate the protocol. This design places a disproportionately high load on the leader, inevitably impairing the scalability. If the leader fails, a complex and bug-prone fail-over protocol is needed to switch to a new leader. An adversary can also exploit the fail-over protocol to slow down the protocol. In this paper, we propose a crash-fault tolerant SMR named Aurora, with the following properties: • Leaderless: it does not require a leader, hence completely get rid of the fail-over protocol. • Scalable: it can scale to a large number of replicas. • Robust: it behaves well even under a poor network connection. We provide a full-fledged implementation of Aurora and systematically evaluate its performance. Our benchmark results show that Aurora achieves a peak throughput of around two million TPS, up to 8.7×\times higher than the state-of-the-art leaderless SMR

    A software architecture for consensus based replication

    Get PDF
    Orientador: Luiz Eduardo BuzatoTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Esta tese explora uma das ferramentas fundamentais para construção de sistemas distribuídos: a replicação de componentes de software. Especificamente, procuramos resolver o problema de como simplificar a construção de aplicações replicadas que combinem alto grau de disponibilidade e desempenho. Como ferramenta principal para alcançar o objetivo deste trabalho de pesquisa desenvolvemos Treplica, uma biblioteca de replicação voltada para construção de aplicações distribuídas, porém com semântica de aplicações centralizadas. Treplica apresenta ao programador uma interface simples baseada em uma especificação orientada a objetos de replicação ativa. A conclusão que defendemos nesta tese é que é possível desenvolver um suporte modular e de uso simples para replicação que exibe alto desempenho, baixa latência e que permite recuperação eficiente em caso de falhas. Acreditamos que a arquitetura de software proposta tem aplicabilidade em qualquer sistema distribuído, mas é de especial interesse para sistemas que não são distribuídos pela ausência de uma forma simples, eficiente e confiável de replicá-losAbstract: This thesis explores one of the fundamental tools for the construction of distributed systems: the replication of software components. Specifically, we attempted to solve the problem of simplifying the construction of high-performance and high-availability replicated applications. We have developed Treplica, a replication library, as the main tool to reach this research objective. Treplica allows the construction of distributed applications that behave as centralized applications, presenting the programmer a simple interface based on an object-oriented specification for active replication. The conclusion we reach in this thesis is that it is possible to create a modular and simple to use support for replication, providing high performance, low latency and fast recovery in the presence of failures. We believe our proposed software architecture is applicable to any distributed system, but it is particularly interesting to systems that remain centralized due to the lack of a simple, efficient and reliable replication mechanismDoutoradoSistemas de ComputaçãoDoutor em Ciência da Computaçã

    Deferred-update database replication:theory and algorithms

    Get PDF
    This thesis is about the design of high-performance fault-tolerant computer systems. More specifically, it focuses on how to develop database systems that behave correctly and with good performance even in the event of failures. Both performance and dependability can be improved by means of the same technique, namely replication. If several database replicas are available, performance can be improved by distributing the load among them. Moreover, if one of the replicas cannot be accessed due to failures, users can still rely on the other ones. However, providing the interface of a single database system out of several replicas is not an easy task since one has to ensure they are always consistent with each other. Allowing replicas to diverge would easily break the illusion of having a single high-performance fault-tolerant database system. Although we would like to have replicas as independent of each other as possible for performance and dependability reasons, we must keep them synchronized if we want to provide a consistent interface to users. In this work, we study how we can balance this trade-off to provide good performance and fault-tolerance without compromising consistency. Our basis is a widely used technique for database replication known as the deferred update technique. In this technique, transactions are initially executed in a single replica. Passive transactions, which do not change the state of the database, can commit locally to the replica they execute. Active transactions, which change the database state, must be synchronized with the transactions running on other replicas. This thesis makes four major contributions. First, we introduce an abstract specification that generalizes the deferred update technique. This specification provides a strong model to prove lower bounds on replication algorithms, design new correct-by-construction protocols tailor-made for specific settings, and prove existing protocols correct more easily, in a standard way. Using this model, we show that the problem of termination of active transactions in deferred-update protocols is highly related to the problem of sequence agreement among a set of processes. In this context, we study the problem of implementing latency-optimal fault-tolerant solutions to sequence agreement and present a novel, highly-dynamic, algorithm that can quickly adapt to system changes in order to preserve its optimal latency. Our algorithm is based on a new agreement problem we introduce that seems to be more suitable to solve problems like sequence agreement than previously used abstractions. Our last two contributions are in the context of specific deferred-update algorithms, where we present two new fault-tolerant protocols derived from our general abstraction. The first algorithm uses no extra assumptions about database replicas. Yet, it has very little overhead associated with the termination of active transactions, propagating only strictly necessary information to replicas. Our second protocol uses strong assumptions about the concurrency control mechanism used by database replicas to reduce even more the latency and the burden associated with transaction termination. These algorithms are good examples of how our general abstraction can be extended to create new protocols and prove them correct

    Multicoordinated agreement problems and the log service

    No full text
    Orientador: Edmundo R. M. Madeira, Fernando PedoneTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Problemas de acordo, como Consenso, Terminação Atômica e Difusão Atômica, são abstrações comuns em sistemas distribuídos. Eles ocorrem quando os componentes do sistema precisam concordar em reconfigurações, mudanças de estado ou em linhas de ação em geral. Nesta tese, investigamos estes problemas no contexto do ambiente e aplicações em que serão utilizados. O modelo geral é o assíncrono sujeito a quebras com possível posterior recuperação. Nossa meta é desenvolver protocolos que explorem esta informação contextual para prover maior disponibilidade, e que se mantenham corretos mesmo que algumas das prerrogativas do contexto tornem-se inválidas. Na primeira parte da tese, exploramos a seguinte propriedade: mensagens difundidas em pequenas redes tendem a ser entregues ordenada e confiavelmente. Nós fazemos três contribuições nesta parte da tese. A primeira é a transformação de algoritmos conhecidos para o modelo quebra-e-pára, que utilizam a propriedade de ordenação mencionada, em protocolos práticos. Isto é, protocolos que toleram perda de mensagens e recuperação após a quebra. Nossos protocolos garantem progresso na presença de falhas, contanto que mensagens sejam espontaneamente ordenadas freqüentemente. Na ausência de ordenação expontânea, outras prerrogativas são necessárias para contornar falhas. A segunda contribuição é a generalização de um dos algoritmos citados acima em um modo de execução "multi-coordenado" em um protocolo híbrido de consenso, que usa ou ordenação expontânea ou detecção de falhas para progredir. Em comparação a outros protocolos, o nosso provê maior disponibilidade sem comprometer resiliência. A terceira contribuição é a utilização do modo multi-coordenado para resolver Consenso Generalizado, um problema que generaliza uma série de outros e que, portanto, é de grande interesse prático. Além disso, fizemos diversas considerações sobre aspectos práticos da utilização deste protocolo. Como resultado, nosso protocolo perde desempenho gradualmente no caso de condições desfavoráveis, permite o balanceamento de carga sobre os coordenadores, e acessa a memória estável parcimoniosamente. Na segunda parte da tese, consideramos problemas de acordo no contexto de redes organizadas hierarquicamente. Em específico, nós consideramos uma topologia usada nos data centers de grandes cooporações: grupos de máquinas conectadas internamente por links de baixa latência, mas por links mais lentos entre grupos. Em tais cenários, latência é claramente um fator importante e reconfigurações, onerosas aos protocolos, devem ser evitadas tanto quanto possível. Nossa contribuição neste tópico está em evitar reconfigurações e melhorar a disponibilidade de um protocolo de acordo que é rápido a despeito de colisões. Isto é, um protocolo que consegue chegar a uma decisão em dois passos inter-grupos mesmo quando várias propostas são feitas concorrentementes. Além do uso da técnica de multicoordenação, nós usamos primitivas de multicast e consenso para conter algumas reconfigurações dentro dos grupos, onde seus custos são menores. Na última parte da tese nós estudamos o problema de terminação de transações distribuídas. O problema consiste em garantir que os vários participantes da transação concordem em aplicar ou cancelar de forma consistente as suas operações no contexto da transação. Além disso, é necessário garantir a durabilidade das alterações feitas por transações terminadas com sucesso. Nossa contribuição neste tópico é um serviço de log que abstrai e desassocia a terminação de transações dos processos que executam tais transações. O serviço funciona como uma caixa preta e permite que resource managers lentos ou falhos sejam reiniciados em servidores diferentes, sem dependências na memória estável do servidor em que executava anteriormente. Nós apresentamos e avaliamos experimentalmente duas implementações do serviço.Abstract: Agreement problems are a common abstraction in distributed systems. They appear when the components of the system must concur on reconfigurations, changes of state, or in lines of action in general. Examples of agreement problems are Consensus, Atomic Commitment, and Atomic Broadcast. In this thesis we investigate these abstractions in the context of the environment in which they will run and the applications that they will serve; in general, we consider the asynchronous crash-recovery model. The goal is to devise protocols that explore the contextual information to deliver improved availability. The correctness of our protocols holds even when the extra assumptions do not. In the first part of this thesis we explore the following property: messages broadcast in small networks tend to be delivered in order and reliably. We make three contributions in this part. The first contribution is to turn known Consensus algorithms that harness this ordering property to reach agreement in the crash-stop model into practical protocols. That is, protocols that tolerate message losses and recovery after crashes, efficiently. Our protocols ensure progress even in the presence of failures, if spontaneous ordering holds frequently. In the absence of spontaneous ordering, some other assumption is required to cope with failures. The second contribution of this thesis is to generalize one of our crash-recovery consensus protocols as a "multicoordinated" mode of a hybrid Consensus protocol, that may use spontaneous ordering or failure detection to progress. Compared to other protocols, ours provide improved availability with no price in resilience. The third contribution is to employ this new mode to solve Generalized Consensus, a problem that generalizes a series of other agreement problems and, hence, is of much practical interest. Moreover, we considered several aspects of solving this problem in practice, which had not been considered before. As a result, our Generalized Consensus protocol features graceful degradation, load balancing, and is parsimonious in accessing stable storage. In the second part of this thesis we have considered agreement problems in wide area networks organized hierarchically. More specifically, we considered a topology that is commonplace in the data centers of large corporations: groups of nodes, with large-bandwidth low-latency links connecting the nodes in the same group, and slow and limited links connecting nodes in different groups. In such environments, latency is clearly a major concern and reconfiguration procedures that render the agreement protocol momentarily unavailable must be avoided as much as possible. Our contribution here is in avoiding reconfigurations and improving the availability of a collision fast agreement protocol. That is, a protocol that can reach agreement in two intergroup communication steps, irrespectively to concurrent proposals. Besides the use of a multicoordinated approach, we employed multicast primitives and consensus to restrict some reconfigurations to within groups, where they are less expensive. In the last part of this thesis we study the problem of terminating distributed transactions. The problem consists of enforcing agreement among the parties on whether to commit or rollback the transaction and ensuring the durability of committed transactions. Our contribution in this topic is an abstract log service that detaches the termination problem from the processes actually performing the transactions. The service works as a black box and abstracts its implementation details from the application utilizing it. Moreover, it allows slow and failed resource managers be re-started on different hosts without relying on the stable storage of the previous host. We provide two implementations of the service, which we evaluated experimentally.DoutoradoDoutor em Ciência da Computaçã

    Multicoordinated agreement protocols and the log service

    Get PDF
    Agreement problems are a common abstraction in distributed systems. They appear when the components of the system must concur on reconfigurations, changes of state, or in lines of action in general. Examples of agreement problems are Consensus, Atomic Commitment, and Atomic Broadcast. In this thesis we investigate these abstractions in the context of the environment in which they will run and the applications that they will serve; in general, we consider the asynchronous crash-recovery model. The goal is to devise protocols that explore the contextual information to deliver improved availability. The correctness of our protocols holds even when the extra assumptions do not. In the first part of this thesis we explore the following property: messages broadcast in small networks tend to be delivered in order and reliably. We make three contributions in this part. The first contribution is to turn known Consensus algorithms that harness this ordering property to reach agreement in the crash-stop model into practical protocols. That is, protocols that tolerate message losses and recovery after crashes, efficiently. Our protocols ensure progress even in the presence of failures, if spontaneous ordering holds frequently. In the absence of spontaneous ordering, some other assumption is required to cope with failures. The second contribution of this thesis is to generalize one of our crash-recovery consensus protocols as a ``multicoordinated'' mode of a hybrid Consensus protocol, that may use spontaneous ordering or failure detection to progress. Compared to other protocols, ours provide improved availability with no price in resilience. The third contribution is to employ this new mode to solve Generalized Consensus, a problem that generalizes a series of other agreement problems and, hence, is of much practical interest. Moreover, we considered several aspects of solving this problem in practice, which had not been considered before. As a result, our Generalized Consensus protocol features graceful degradation, load balancing, and is parsimonious in accessing stable storage. In the second part of this thesis we have considered agreement problems in wide area networks organized hierarchically. More specifically, we considered a topology that is commonplace in the data centers of large corporations: groups of nodes, with large-bandwidth low-latency links connecting the nodes in the same group, and slow and limited links connecting nodes in different groups. In such environments, latency is clearly a major concern and reconfiguration procedures that render the agreement protocol momentarily unavailable must be avoided as much as possible. Our contribution here is in avoiding reconfigurations and improving the availability of a collision fast agreement protocol. That is, a protocol that can reach agreement in two intergroup communication steps, irrespectively to concurrent proposals. Besides the use of a multicoordinated approach, we employed multicast primitives and consensus to restrict some reconfigurations to within groups, where they are less expensive. In the last part of this thesis we study the problem of terminating distributed transactions. The problem consists of enforcing agreement among the parties on whether to commit or rollback the transaction and ensuring the durability of committed transactions. Our contribution in this topic is an abstract log service that detaches the termination problem from the processes actually performing the transactions. The service works as a black box and abstracts its implementation details from the application utilizing it. Moreover, it allows slow and failed resource managers be re-started on different hosts without relying on the stable storage of the previous host. We provide two implementations of the service, which we evaluated experimentally
    corecore