15 research outputs found

    The Alpha of Indulgent Consensus

    Get PDF
    This paper presents a simple framework unifying a family of consensus algorithms that can tolerate process crash failures and asynchronous periods of the network, also called indulgent consensus algorithms. Key to the framework is a new abstraction we introduce here, called Alpha, and which precisely captures consensus safety. Implementations of Alpha in shared memory, storage area network, message passing and active disk systems are presented, leading to directly derived consensus algorithms suited to these communication media. The paper also considers the case where the number of processes is unknown and can be arbitrarily larg

    The Information Structure of Indulgent Consensus

    Get PDF

    Round-Based Consensus Algorithms, Predicate Implementations and Quantitative Analysis

    Get PDF
    Fault-tolerant computing is the art and science of building computer systems that continue to operate normally in the presence of faults. The fault tolerance field covers a wide spectrum of research area ranging from computer hardware to computer software. A common approach to obtain a fault-tolerant system is using software replication. However, maintaining the state of the replicas consistent is not an easy task, even though the understanding of the problems related to replication has significantly evolved over the past thirty years. Consensus is a fundamental building block to provide consistency in any fault-tolerant distributed system. A large number of algorithms have been proposed to solve the consensus problem in different systems. The efficiency of several consensus algorithms has been studied theoretically and practically. A common metric to evaluate the performance of consensus algorithms is the number of communication steps or the number of rounds (in round-based algorithms) for deciding. A large amount of improvements to consensus algorithms have been proposed to reduce this number under different assumptions, e.g., nice runs. However, the efficiency expressed in terms of number of rounds does not predict the time it takes to decide (including the time needed by the system to stabilize or not). Following this idea, the thesis investigates the round model abstraction to represent consensus algorithms, with benign and Byzantine faults, in a concise and modular way. The goal of the thesis is first to decouple the consensus algorithm from irrelevant details of implementations, such as synchronization, then study different possible implementations for a given consensus algorithm, and finally propose a more general analytical analysis for different consensus algorithms. The first part of the thesis considers the round-based consensus algorithms with benign faults. In this context, the round model allowed us to separate the consensus algorithms from the round implementation, to propose different round implementations, to improve existing round implementations by making them swift, and to provide quantitative analysis of different algorithms. The second part of the thesis considers the round-based consensus algorithms with Byzantine faults. In this context, there is a gap between theoretical consensus algorithms and practical Byzantine fault-tolerant protocols. The round model allowed us to fill the gap by better understanding existing protocols, and enabled us to express existing protocols in a simple and modular way, to obtain simplified proofs, to discover new protocols such as decentralized (non leader-based) algorithms, and finally to perform precise timing analysis to compare different algorithms. The last part of the thesis shows, as an example, how a round-based consensus algorithm that tolerates benign faults can be extended to wireless mobile ad hoc networks using an adequate communication layer. We have validated our implementation by running simulations in single hop and multi-hop wireless networks

    SoK: Understanding BFT Consensus in the Age of Blockchains

    Get PDF
    Blockchain as an enabler to current Internet infrastructure has provided many unique features and revolutionized current distributed systems into a new era. Its decentralization, immutability, and transparency have attracted many applications to adopt the design philosophy of blockchain and customize various replicated solutions. Under the hood of blockchain, consensus protocols play the most important role to achieve distributed replication systems. The distributed system community has extensively studied the technical components of consensus to reach agreement among a group of nodes. Due to trust issues, it is hard to design a resilient system in practical situations because of the existence of various faults. Byzantine fault-tolerant (BFT) state machine replication (SMR) is regarded as an ideal candidate that can tolerate arbitrary faulty behaviors. However, the inherent complexity of BFT consensus protocols and their rapid evolution makes it hard to practically adapt themselves into application domains. There are many excellent Byzantine-based replicated solutions and ideas that have been contributed to improving performance, availability, or resource efficiency. This paper conducts a systematic and comprehensive study on BFT consensus protocols with a specific focus on the blockchain era. We explore both general principles and practical schemes to achieve consensus under Byzantine settings. We then survey, compare, and categorize the state-of-the-art solutions to understand BFT consensus in detail. For each representative protocol, we conduct an in-depth discussion of its most important architectural building blocks as well as the key techniques they used. We aim that this paper can provide system researchers and developers a concrete view of the current design landscape and help them find solutions to concrete problems. Finally, we present several critical challenges and some potential research directions to advance the research on exploring BFT consensus protocols in the age of blockchains

    Invalidation-based protocols for replicated datastores

    Get PDF
    Distributed in-memory datastores underpin cloud applications that run within a datacenter and demand high performance, strong consistency, and availability. A key feature of datastores is data replication. The data are replicated across servers because a single server often cannot handle the request load. Replication is also necessary to guarantee that a server or link failure does not render a portion of the dataset inaccessible. A replication protocol is responsible for ensuring strong consistency between the replicas of a datastore, even when faults occur, by determining the actions necessary to access and manipulate the data. Consequently, a replication protocol also drives the datastore's performance. Existing strongly consistent replication protocols deliver fault tolerance but fall short in terms of performance. Meanwhile, the opposite occurs in the world of multiprocessors, where data are replicated across the private caches of different cores. The multiprocessor regime uses invalidations to afford strongly consistent replication with high performance but neglects fault tolerance. Although handling failures in the datacenter is critical for data availability, we observe that the common operation is fault-free and far exceeds the operation during faults. In other words, the common operating environment inside a datacenter closely resembles that of a multiprocessor. Based on this insight, we draw inspiration from the multiprocessor for high-performance, strongly consistent replication in the datacenter. The primary contribution of this thesis is in adapting invalidating protocols to the nuances of replicated datastores, which include skewed data accesses, fault tolerance, and distributed transactions

    State Machine Replication:from Analytical Evaluation to High-Performance Paxos

    Get PDF
    Since their invention more than half a century ago, computers have gone from being just an handful of expensive machines each filling an entire room, to being an integral part of almost every aspect of modern life. Nowadays computers are everywhere: in our planes, in our cars, on our desks, in our home appliances, and even in our pockets. This widespread adoption had a profound impact in our world and in our lives, so much that now we rely on them for many important aspects of everyday life, including work, communication, travel, entertainment, and even managing our money. Given our increased reliance on computers, their continuous and correct operation has become essential for modern society. However, individual computers can fail due to a variety of causes and, if nothing is done about it, these failures can easily lead to a disruption of the service provided by computer system. The field of fault tolerance studies this problem, more precisely, it studies how to enable a computer system to continue operation in spite of the failure of individual components. One of the most popular techniques of achieving fault tolerance is software replication, where a service is replicated on an ensemble of machines (replicas) such that if some of these machines fail, the others will continue providing the service. Software replication is widely used because of its generality (can be applied to most services) and its low cost (can use off-the-shelf hardware). This thesis studies a form of software replication, namely, state machine replication, where the service is modeled as a deterministic state machine whose state transitions consist of the execution of client requests. Although state machine replication was first proposed almost 30 years ago, the proliferation of online services during the last years has led to a renewed interest. Online services must be highly available and for that they frequently rely on state machine replication as part of their fault tolerance mechanisms. However, the unprecedented scale of these services, which frequently have hundreds of thousands or even millions of users, leads to a new set performance requirements on state machine replication. This thesis is organized in two parts. The goal of the first part is to study from a theoretical perspective the performance characteristics of the algorithms behind state machine replication and to propose improved variants of such algorithms. The second part looks at the problem from a practical perspective, proposing new techniques to achieve high-throughput and scalability. In the first part, we start with an analytical analysis of the performance of two consensus algorithms, one leader-free (an adaptation of the fast round of Fast Paxos) and another leader-based (an adaptation of classical Paxos). We express these algorithms in the Heard-Of round model and show that using this model it is fairly easy to determine analytically several interesting performance metrics. We then study the performance of round models in general. Round models are perceived as inefficient because in their typical implementation, the real-time duration of rounds is proportional to the (pessimistic) timeouts used on the underlying system. This contrasts with the failure detector or the partial synchronous system models, where algorithms usually progress at the speed of message reception. We show that there is no inherent gap in performance between the models, by proposing a round implementation that during stable periods advances at the speed of message reception. We conclude the first part by presenting a new leader election algorithm that chooses as leader a well-connected process, that is, a process whose time needed to perform a one-to-majority communication round is among the lowest in the system. This is useful mainly in systems where the latency between processes is not homogeneous, because the performance of leader-based algorithms is particularly sensitive to the performance and connectivity of the process acting as a leader. The second part of the thesis studies different approaches to achieve high-throughput with state machine replication. To support the experimental work done in this part, we have developed JPaxos, a fully-featured implementation of Paxos in Java. We start by looking at how to tune the batching and pipelining optimizations of Paxos; using an analytical model of the performance of Paxos we show how to derive good values for the bounds on the batch size and number of parallel instances. We then propose an architecture for implementing replicated state machines that is capable of leveraging multi-core CPUs to achieve very high-levels of performance. The final contribution of this thesis is based on the observation that most implementations of state machine replication have an unbalanced division of work among threads, with one replica, the leader, having a significantly higher workload than the other replicas. Naturally, the leader becomes the bottleneck of the system, while other replicas are only lightly loaded. We propose and evaluate S-Paxos, which evenly balances the workload among all replicas, and thus overcomes the leader bottleneck. The benefits are two-fold: S-Paxos achieves a higher throughput for a given number of replicas and its performance increases with the number of replicas (up to a reasonable number)

    Byzantine fault-tolerant vote collection for D-DEMOS, a distributed e-voting system

    Get PDF
    Τα συστήματα διαχείρισης εκλογών είναι μια δυναμική τεχνολογία που επιτρέπει την βελτίωση της δημοκρατικής διαδικασίας μέσω της μείωσης του κόστους υλοποίησης εκλογών, της αύξησης της συμμετοχής των ψηφοφόρων και της αμεσότητας παραγωγής αποτελεσμάτων. Επίσης, δίνουν την δυνατότητα στους ψηφοφόρους να επιβεβαιώσουν άμεσα την ορθή λειτουργία ολόκληρης της εκλογικής διαδικασίας. Δυστυχώς, τα υπάρχοντα τέτοια συστήματα είναι σχεδιασμένα με κεντρικά συστατικά, τα οποία και αποτελούν μοναδικά σημεία αποτυχίας. Αυτό μπορεί να οδηγήσει στην απώλεια διαθεσιμότητας, εμπιστευτικότητας, καθώς και της ακεραιότητας του εκλογικού αποτελέσματος. Σε αυτή τη διατριβή εξετάζουμε την εισαγωγή ανοχής λαθών στα εκλογικά συστήματα, μέσω της εισαγωγής κατανεμημένων συστατικών. Αυτό είναι περίπλοκο γιατί, εκτός από την ακεραιότητα και διαθεσιμότητα, σε ένα εκλογικό σύστημα είναι σημαντικό να διαφυλαχθεί και η εμπιστευτικότητα, απέναντι σε έναν κακόβουλο αντίπαλο. Εστιάζουμε στην φάση συλλογής ψήφων του εκλογικού συστήματος, η οποία είναι ένα κρίσιμο τμήμα της εκλογικής διαδικασίας. Χρησιμοποιούμε το σύγχρονο αλλά κεντρικοποιημένο σύστημα διαχείρισης εκλογών DEMOS σαν βάση για την μελέτη μας. Αυτό το σύστημα χρησιμοποιεί κωδικούς που αντιστοιχούν στις δυνατές επιλογές των ψηφοφόρων, μια Αρχή Εκλογών η οποία αρχικοποιεί τις εκλογές, συλλέγει τις ψήφους και παράγει το αποτέλεσμα, και έναν Πίνακα Ανακοινώσεων για την διατήρηση των στοιχείων των εκλογών μακροπρόθεσμα. Εξάγουμε τον μηχανισμό συλλογής ψήφων από την κεντρικοποιημένη Αρχή Εκλογών του αρχικού συστήματος DEMOS, και τον αντικαθιστούμε με ένα κατανεμημένο σύστημα που χειρίζεται την συλλογή ψήφων με ανοχή σε λάθη Βυζαντινού τύπου. Σε αυτή τη διατριβή, παρουσιάζουμε τον σχεδιασμό, ανάλυση ασφάλειας, την ανάπτυξη και αξιολόγηση της πρωτότυπης υλοποίησης αυτού του κατανεμημένου συστατικού συλλογής ψήφων. Παρουσιάζουμε δύο εκδόσεις αυτού του συστατικού: μία πλήρως ασύγχρονη και μία με ελάχιστες υποθέσεις συγχρονισμού αλλά καλύτερη απόδοση. Και οι δύο εκδόσεις παρέχουν άμεση επιβεβαίωση στην ψηφοφόρο ότι η ψήφος της καταχωρήθηκε όπως υποβλήθηκε, χωρίς να απαιτούνται κρυπτογραφικές λειτουργίες από την πλευρά της ψηφοφόρου. Με αυτόν τον τρόπο, η ψηφοφόρος μπορεί να στείλει την ψήφο της χρησιμοποιώντας έναν μη ασφαλή υπολογιστή ή δίκτυο, και να συνεχίσει να είναι εξασφαλισμένη ότι η ψήφος της καταχωρήθηκε σωστά. Για παράδειγμα, μπορεί να ψηφίσει χρησιμοποιώντας έναν δημόσιο υπολογιστή, ή στέλνοντας ένα σύντομο μήνυμα μέσω κινητού τηλεφώνου. Ακόμη και σε αυτές τις περιπτώσεις, η εμπιστευτικότητα της ψήφου διατηρείται στο ακέραιο. Δίνουμε ένα μοντέλο και μια ανάλυση ασφάλειας για τα συστήματα που παρουσιάζουμε. Υλοποιούμε πρωτότυπα από τα πλήρη συστήματα, μετράμε την απόδοσή τους πειραματικά, και επιδεικνύουμε την ικανότητά τους να χειρίζονται εκλογές μεγάλου μεγέθους. Τέλος, παρουσιάζουμε τις διαφορές απόδοσης ανάμεσα στις δύο εκδόσεις του συστήματος. Θεωρούμε ότι τα συστατικά συλλογής ψήφων που παρουσιάζουμε σε αυτή τη διατριβή μπορούν να βρουν εφαρμογή σε οποιοδήποτε σύστημα διαχείρισης εκλογών που στηρίζεται στην τεχνική της εκπροσώπησης των επιλογών στα ψηφοδέλτια με κωδικούς.E-voting systems are a powerful technology for improving democracy by reducing election cost, increasing voter participation, and even allowing voters to directly verify the entire election procedure. Unfortunately, prior internet voting systems have single points of failure, which may result in the compromise of availability, voter secrecy, or integrity of the election results. In this thesis, we consider increasing the fault-tolerance of voting systems by introducing distributed components. This is non-trivial as, besides integrity and availability, voting requires safeguarding confidentiality as well, against a malicious adversary. We focus on the vote collection phase of the voting system, which is a crucial part of the election process. We use the DEMOS state-of-the-art but centralized voting system as the basis for our study. This system uses vote codes to represent voters' choices, an Election Authority to setup the election and handle vote collection and result production, and a Bulletin Board for storing the election transcript for the long-term. We extract the vote collection mechanism from the centralized Election Authority component of the original DEMOS system, and replace it with a distributed system that handles vote collection in a Byzantine fault-tolerant manner. In this thesis, we present the design, security analysis, prototype implementation and experimental evaluation of this vote collection component. We present two versions of this component: one completely asynchronous and one with minimal timing assumptions but better performance. Both versions provide immediate assurance to the voter her vote was recorded as cast, without requiring cryptographic operations on behalf of the voter. This way, a voter may cast her vote using an untrusted computer or network, and still be assured her vote was recorded as cast. For example, she may vote via a public web terminal, or by sending an SMS from a mobile phone. Even in these cases, voter's privacy is still preserved. We provide a model and security analysis of the systems we present. We implement prototypes of the complete systems, we measure their performance experimentally, and we demonstrate their ability to handle large-scale elections. Finally, we demonstrate the performance trade-offs between the two versions of the system. We consider the vote collection components we introduce are applicable to any voting system that uses the code-voting technique

    Abstractions for Solving Consensus and Related Problems with Byzantine Faults

    Get PDF
    We become increasingly dependent on online services; therefore, their availability and correct behavior become increasingly important. Software replication is a popular technique for ensuring that computer systems continue to provide a correct service even when some of their components fail. By replicating a service on multiple servers, clients are guaranteed that even if some replica fails, the service is still available. At the core of software replication is the consensus problem, where a set of processes has to agree on a single value. A large number of consensus algorithms for different system models have been proposed. The most general system models (for which consensus is solvable) do not make strong assumptions on the synchrony (allow period of asynchrony) and assume that a subset of processes can fail completely arbitrarily (Byzantine faults). However, solving consensus in the presence of arbitrary faults and asynchrony is hard and demands sophisticated algorithms. Most of the existing consensus algorithms that deal with arbitrary faults are monolithic and developed from scratch, or by modifying existing algorithms in a non-modular manner. As a consequence, these algorithms are rather complex and hard to understand. We impute this complexity to the lack of adequate abstractions. The motivation of this thesis is suggesting abstractions that simplify the understanding of existing consensus algorithms with arbitrary faults and allow modular design of novel algorithms. The thesis also aims to clarify relations between consensus and the total-order broadcast problem in the presence of arbitrary faults. In the context of the consensus problem with arbitrary process faults, the literature distinguishes (1) authenticated Byzantine faults, where messages can be signed by the sending process, and (2) Byzantine faults, where there is no mechanism for signatures. Consensus protocols that assume Byzantine faults (without authentication) are harder to develop and prove correct than algorithms that consider authenticated Byzantine faults, even when they are based on the same idea. We propose an abstraction called weak interactive consistency (or WIC), that allows us to design consensus algorithms that can be instantiated into algorithms for authenticated Byzantine faults (signed messages) and algorithms for Byzantine faults. In other words, WIC unifies Byzantine consensus algorithms with and without signatures. This is illustrated on two seminal Byzantine consensus algorithms: the Castro-Liskov PBFT algorithm (no signatures) and the Martin-Alvisi FaB Paxos algorithms (signatures). WIC allows a very concise expression of these two algorithms. Furthermore, WIC turns out to be fundamental abstraction for solving consensus in the transmission fault model. The transmission fault model captures faults without blaming a specific component for the fault, and it is well-adapted to dynamic and transient faults. Using WIC we designed a consensus algorithm that overcomes limitations of all existing solutions to consensus in this model, which assume the synchronous system model, or require strong conditions for termination that exclude the case where all messages of a process can be corrupted. Then we go one step further in unifying consensus algorithms by proposing a generic consensus algorithm that highlights, through well chosen parameters, the core mechanisms of a number of well-known consensus algorithms including Paxos, OneThirdRule, PBFT and FaB Paxos. Interestingly, the generic algorithm allows us to identify a new Byzantine consensus algorithm that requires n > 4b, in-between the requirement n > 5b of FaB Paxos and n > 3b of PBFT (b is the maximum number of Byzantine processes). Afterwards, we study the relation between consensus and total-order broadcast in the presence of Byzantine faults. Total-order broadcast is defined for a set of processes, where each process can broadcast messages, with the guarantee that all processes in this set see the same sequence of messages. Among the several definitions of Byzantine consensus that differ only by their validity property, we identify those equivalent to total-order broadcast. We also give the first deterministic total-order broadcast reduction to consensus with constant time complexity with respect to consensus. Finally, we consider state-machine replication (SMR) with Byzantine faults. State-machine replication is a general approach for replicating services that can be modeled as a state machine. The key idea of this approach is to guarantee that all replicas start in the same state and then apply requests from clients in the same order, thereby guaranteeing that the replica states do not diverge. Recent studies has shown that most BFT-SMR algorithms do not actually perform well under performance attacks by Byzantine processes. We propose a new BFT-SMR algorithm, called BFT-Mencius, that guarantees, assuming a partially synchronous system model, that the latency of updates of correct processes is eventually upper-bounded, even under performance attacks by Byzantine processes. BFT-Mencius is a modular, signature-free algorithm based on a new communication primitive called Abortable Timely Announced Broadcast (ATAB). We evaluate the performance of BFT-Mencius in cluster settings, and show that it performs comparably to the state-of-the-art algorithms such as PBFT and Spinning in fault-free configurations and outperforms these algorithms under performance attacks by Byzantine processes

    Resilience-Building Technologies: State of Knowledge -- ReSIST NoE Deliverable D12

    Get PDF
    This document is the first product of work package WP2, "Resilience-building and -scaling technologies", in the programme of jointly executed research (JER) of the ReSIST Network of Excellenc
    corecore