91 research outputs found

    Incremental Consistency Guarantees for Replicated Objects

    Get PDF
    Programming with replicated objects is difficult. Developers must face the fundamental trade-off between consistency and performance head on, while struggling with the complexity of distributed storage stacks. We introduce Correctables, a novel abstraction that hides most of this complexity, allowing developers to focus on the task of balancing consistency and performance. To aid developers with this task, Correctables provide incremental consistency guarantees, which capture successive refinements on the result of an ongoing operation on a replicated object. In short, applications receive both a preliminary---fast, possibly inconsistent---result, as well as a final---consistent---result that arrives later. We show how to leverage incremental consistency guarantees by speculating on preliminary values, trading throughput and bandwidth for improved latency. We experiment with two popular storage systems (Cassandra and ZooKeeper) and three applications: a Twissandra-based microblogging service, an ad serving system, and a ticket selling system. Our evaluation on the Amazon EC2 platform with YCSB workloads A, B, and C shows that we can reduce the latency of strongly consistent operations by up to 40% (from 100ms to 60ms) at little cost (10% bandwidth increase, 6% throughput drop) in the ad system. Even if the preliminary result is frequently inconsistent (25% of accesses), incremental consistency incurs a bandwidth overhead of only 27%.Comment: 16 total pages, 12 figures. OSDI'16 (to appear

    Optimistic Parallel State-Machine Replication

    Full text link
    State-machine replication, a fundamental approach to fault tolerance, requires replicas to execute commands deterministically, which usually results in sequential execution of commands. Sequential execution limits performance and underuses servers, which are increasingly parallel (i.e., multicore). To narrow the gap between state-machine replication requirements and the characteristics of modern servers, researchers have recently come up with alternative execution models. This paper surveys existing approaches to parallel state-machine replication and proposes a novel optimistic protocol that inherits the scalable features of previous techniques. Using a replicated B+-tree service, we demonstrate in the paper that our protocol outperforms the most efficient techniques by a factor of 2.4 times

    Distributed replicated macro-components

    Get PDF
    Dissertação para obtenção do Grau de Mestre em Engenharia InformáticaIn recent years, several approaches have been proposed for improving application performance on multi-core machines. However, exploring the power of multi-core processors remains complex for most programmers. A Macro-component is an abstraction that tries to tackle this problem by allowing to explore the power of multi-core machines without requiring changes in the programs. A Macro-component encapsulates several diverse implementations of the same specification. This allows to take the best performance of all operations and/or distribute load among replicas, while keeping contention and synchronization overhead to the minimum. In real-world applications, relying on only one server to provide a service leads to limited fault-tolerance and scalability. To address this problem, it is common to replicate services in multiple machines. This work addresses the problem os supporting such replication solution, while exploring the power of multi-core machines. To this end, we propose to support the replication of Macro-components in a cluster of machines. In this dissertation we present the design of a middleware solution for achieving such goal. Using the implemented replication middleware we have successfully deployed a replicated Macro-component of in-memory databases which are known to have scalability problems in multi-core machines. The proposed solution combines multi-master replication across nodes with primary-secondary replication within a node, where several instances of the database are running on a single machine. This approach deals with the lack of scalability of databases on multi-core systems while minimizing communication costs that ultimately results in an overall improvement of the services. Results show that the proposed solution is able to scale as the number of nodes and clients increases. It also shows that the solution is able to take advantage of multi-core architectures.RepComp project (PTDC/EIAEIA/108963/2008

    High-performance state-machine replication

    Get PDF
    Replication, a common approach to protecting applications against failures, refers to maintaining several copies of a service on independent machines (replicas). Unlike a stand-alone service, a replicated service remains available to its clients despite the failure of some of its copies. Consistency among replicas is an immediate concern raised by replication. In effect, an important factor for providing the illusion of an uninterrupted service to clients is to preserve consistency among the multiple copies. State-machine replication is a popular replication technique that ensures consistency by ordering client requests and making all the replicas execute them deterministically and sequentially. The overhead of ordering the requests, and the sequentiality of request execution, the two essential requirements in realizing state-machine replication, are also the two major obstacles that prevent the performance of state-machine replication from scaling. In this thesis we concentrate on the performance of state-machine replication and enhance it by overcoming the two aforementioned bottlenecks, the overhead of ordering and the overhead of sequentially executing commands. To realize a truly scalable system, one must iteratively examine and analyze all the layers and components of a system and avoid or eliminate potential performance obstructions and congestion points. In this dissertation, we iterate between optimizing the ordering of requests and the strategies of replicas at request execution, in order to stretch the performance boundaries of state-machine replication. To eliminate the negative implications of the ordering layer on performance, we devise and implement several novel and highly efficient ordering protocols. Our proposals are based on practical observations we make after closely assessing and identifying the shortcomings of existing approaches. Communication is one of the most important components of any distributed system and thus selecting efficient communication patterns is a must in designing scalable systems. We base our protocols on the most suitable communication patterns and extend their design with additional features that altogether realize our protocol's high efficiency. The outcome of this phase is the design and implementation of the Ring Paxos family of protocols. According to our evaluations these protocols are highly scalable and efficient. We then assess the performance ramifications of sequential execution of requests on the replicas of state-machine replication. We use some known techniques such as state-partitioning and speculative execution, and thoroughly examine their advantages when combined with our ordering protocols. We then exploit the features of multicore hardware and propose our final solution as a parallelized form of state-machine replication, built on top of Ring Paxos protocols, that is capable of accomplishing significantly high performance. Given the popularity of state-machine replication in designing fault-tolerant systems, we hope this thesis provides useful and practical guidelines for the enhancement of the existing and the design of future fault-tolerant systems that share similar performance goals

    Reducing the Latency of Dependent Operations in Large-Scale Geo-Distributed Systems

    Get PDF
    Many applications rely on large-scale distributed systems for data management and computation. These distributed systems are complex and built from different networked services. Dependencies between these services can create a chain of dependent network I/O operations that have to be executed sequentially. This can result in high service latencies, especially when the chain consists of inter-datacenter operations. To address the latency problem of executing dependent network I/O operations, this thesis introduces new approaches and techniques to reduce the required number of operations that have to be executed sequentially for three system types. First, it addresses the high transaction completion time in geo-distributed database systems that have data sharded and replicated across different geographical regions. For a single transaction, most existing systems sequentially execute reads, writes, 2PC, and a replication protocol because of dependencies between these parts. This thesis looks at using a more restrictive transaction model in order to break dependencies and allow different parts to execute in parallel. Second, dependent network I/O operations also lead to high latency for performing leader-based state machine replication across a wide-area network. Fast Paxos introduces a fast path that bypasses the leader for request ordering. However, when concurrent requests arrive at replicas in different orders, the fast path may fail, and Fast Paxos has to fall back to a slow path. This thesis explores the use of network measurements to establish a global order for requests across replicas, allowing Fast Paxos to be effective for concurrent requests. Finally, this thesis proposes a general solution to reduce the latency impact of dependent operations in distributed systems through the use of speculative execution. For many systems, domain knowledge can be used to predict an operation’s result and speculatively execute subsequent operations, potentially allowing a chain of dependent operations to execute in parallel. This thesis introduces a framework that provides system-level support for performing speculative network I/O operations. These three approaches reduce the number of sequentially performed network I/O operations in different domains. Our performance evaluation shows that they can significantly reduce the latency of critical infrastructure services, allowing these services to be used by latency-sensitive applications

    SoK: Understanding BFT Consensus in the Age of Blockchains

    Get PDF
    Blockchain as an enabler to current Internet infrastructure has provided many unique features and revolutionized current distributed systems into a new era. Its decentralization, immutability, and transparency have attracted many applications to adopt the design philosophy of blockchain and customize various replicated solutions. Under the hood of blockchain, consensus protocols play the most important role to achieve distributed replication systems. The distributed system community has extensively studied the technical components of consensus to reach agreement among a group of nodes. Due to trust issues, it is hard to design a resilient system in practical situations because of the existence of various faults. Byzantine fault-tolerant (BFT) state machine replication (SMR) is regarded as an ideal candidate that can tolerate arbitrary faulty behaviors. However, the inherent complexity of BFT consensus protocols and their rapid evolution makes it hard to practically adapt themselves into application domains. There are many excellent Byzantine-based replicated solutions and ideas that have been contributed to improving performance, availability, or resource efficiency. This paper conducts a systematic and comprehensive study on BFT consensus protocols with a specific focus on the blockchain era. We explore both general principles and practical schemes to achieve consensus under Byzantine settings. We then survey, compare, and categorize the state-of-the-art solutions to understand BFT consensus in detail. For each representative protocol, we conduct an in-depth discussion of its most important architectural building blocks as well as the key techniques they used. We aim that this paper can provide system researchers and developers a concrete view of the current design landscape and help them find solutions to concrete problems. Finally, we present several critical challenges and some potential research directions to advance the research on exploring BFT consensus protocols in the age of blockchains

    Intrusion tolerance in large scale networks

    Get PDF
    Tese de doutoramento, Informática (Engenharia Informática), Universidade de Lisboa, Faculdade de Ciências, 2010The growing reliance on wide-area services demands highly available systems that provide a correct and uninterrupted service. Therefore, Byzantine Fault-Tolerant (BFT) algorithms have received considerable attention in the last years. A service is replicated over several servers and can survive even in the presence of a bounded number of Byzantine failures. The main motivation for this thesis is that for a replicated service to be fault-tolerant, common mode failures have to be avoided. More specifically, the thesis is concerned with common mode failures caused by natural disasters, power outages and physical attacks, which have to be prevented by scattering replicas geographically. This requires the sites where the replicas reside to be connected by a wide-area network (WAN) like the Internet. Unfortunately, when the replicas are distributed geographically the performance of current BFT algorithms is affected by the lower bandwidths, and the higher and more heterogeneous network latencies. In order to deal with these limitations this thesis introduces novel BFT algorithms that are simultaneously efficient and secure. Some algorithms of this thesis are based on a hybrid fault model, i.e., considering that a part of the system is secure by construction. A notable contribution of this thesis is the definition and implementation of a minimal trusted service: the Unique Sequential Identifier Generator (USIG). The thesis describes how to implement a 2 f +1 Byzantine consensus algorithm using a 2 f +1 reliable multicast algorithm that requires a trusted service, that is an abstract version of the USIG. Then, the USIG service and the reliable multicast primitive are applied as a core component to implement two novel BFT algorithms introduced in this thesis: MinBFT and MinZyzzyva. These BFT algorithms are minimal in terms of number of replicas, complexity of the trusted service used, and number of communication steps. In order to mitigate performance degradation attacks, this thesis proposes the use of a rotating primary defining a novel BFT algorithm, Spinning, that is less vulnerable to attacks caused by a faulty primary and attains a throughput similar to the baseline algorithm in the area. Finally, the mechanisms and techniques developed in this thesis are combined in order to define EBAWA, a novel BFT algorithm that is suitable for supporting the execution of wide-area replicated services.O crescimento da dependência na utilização de serviços informáticos em redes de larga escala demanda sistemas que forneçam um serviço correcto e ininterrupto. Por este motivo, algoritmos tolerantes a faltas bizantinas (BFT) têm recebido considerável atenção nos últimos anos. A ideia fundamental destes algoritmos é replicar um determinado serviço num conjunto de servidores, assegurando a sua operação contínua mesmo na presença de um número limitado de servidores faltosos. Cada servidor é uma réplica, uma máquina de estados determinística que executa operações em resposta a requisições realizadas por clientes. Para que serviços replicados sejam tolerantes a faltas, modos comuns de falhas devem ser evitados, essa ´e a principal motivação desta tese. Mais especificamente, a tese trata de falhas causadas por desastres naturais, falta de energia e ataques físicos. Para que a ocorrência destas falhas afecte um número limitado de servidores é necessário distribuir as réplicas geograficamente. Esta distribuição, requer que os locais onde se situam as réplicas sejam conectados por uma rede de larga-escala (WAN), como a Internet. Infelizmente, quando as réplicas estão distribuídas geograficamente o desempenho dos algoritmos BFT actuais é afectado pelas limitações de largura de banda e latências heterogéneas, típicas em redes de larga-escala. A fim de tratar destas limitações esta tese introduz novos algoritmos BFT que são simultaneamente eficientes e seguros. Alguns destes algoritmos são baseados em um modelo de faltas híbrido, por exemplo, parte do sistema ´e considerado seguro pela sua construção. Uma importante contribuição desta tese é a definição e concretização de um serviço confiável mínimo: o gerador de identificador único e sequencial (USIG). A tese descreve como concretizar algoritmos de consenso bizantinos com 2 f +1 processos, usando um algoritmo de reliable multicast que requer um componente confiável, uma abstração do USIG. O serviço USIG e a primitiva de reliable multicast são aplicados como componentes nucleares na concretização de dois novos algoritmos BFT introduzidos nesta tese: MinBFT e MinZyzzyva. Estes algoritmos são mínimos em termos de número de réplicas, complexidade do componente confiável e número de passos de comunicação. A fim de mitigar os ataques de degradação de desempenho esta tese propõe o uso de um primário rotativo, definindo assim um novo algoritmo BFT, o Spinning. Al´em de ser menos vulnerável a ataques causados por primários faltosos, o Spinning atinge um débito similar ao algoritmo base. Finalmente, os mecanismos e técnicas desenvolvidos ao longo desta tese são combinados com o objectivo de definir o EBAWA, um novo algoritmo BFT que é adequado para suportar a execução de serviços replicados em redes de larga-escala.Programme ALBAN; Fundação para a Ciência e a Tecnologia - Portuga
    corecore