42 research outputs found

    Multi-Shot Distributed Transaction Commit

    Get PDF
    Atomic Commit Problem (ACP) is a single-shot agreement problem similar to consensus, meant to model the properties of transaction commit protocols in fault-prone distributed systems. We argue that ACP is too restrictive to capture the complexities of modern transactional data stores, where commit protocols are integrated with concurrency control, and their executions for different transactions are interdependent. As an alternative, we introduce Transaction Certification Service (TCS), a new formal problem that captures safety guarantees of multi-shot transaction commit protocols with integrated concurrency control. TCS is parameterized by a certification function that can be instantiated to support common isolation levels, such as serializability and snapshot isolation. We then derive a provably correct crash-resilient protocol for implementing TCS through successive refinement. Our protocol achieves a better time complexity than mainstream approaches that layer two-phase commit on top of Paxos-style replication

    The Weakest Failure Detector for Eventual Consistency

    Get PDF
    In its classical form, a consistent replicated service requires all replicas to witness the same evolution of the service state. Assuming a message-passing environment with a majority of correct processes, the necessary and sufficient information about failures for implementing a general state machine replication scheme ensuring consistency is captured by the {\Omega} failure detector. This paper shows that in such a message-passing environment, {\Omega} is also the weakest failure detector to implement an eventually consistent replicated service, where replicas are expected to agree on the evolution of the service state only after some (a priori unknown) time. In fact, we show that {\Omega} is the weakest to implement eventual consistency in any message-passing environment, i.e., under any assumption on when and where failures might occur. Ensuring (strong) consistency in any environment requires, in addition to {\Omega}, the quorum failure detector {\Sigma}. Our paper thus captures, for the first time, an exact computational difference be- tween building a replicated state machine that ensures consistency and one that only ensures eventual consistency

    Replicated Data and Partition Failures

    Get PDF
    In a distributed database system, data is often replicated to improve performance and availability. By storing copies of shared data on processors where it is frequently accessed, the need for expensive, remote read accesses is decreased. By storing copies of critical data on processors with independent failure modes, the probability that at least one copy of the data will be accessible increases. In theory, data replication makes it possible to provide arbitrarily high data availability. In practice, realizing the benefits of data replication is difficult since the correctness of data must be maintained. One important aspect of correctness with replicated data is mutual consistency: all copies of the same logical data-item must agree on exactly one current value for the data-item. Furthermore, this value should make sense in terms of the transactions executed on copies of the data-item. When communication fails between sites containing copies of the same logical data-item, mutual consistency between copies becomes complicated to ensure. The most disruptive of these communication failures are partition failures, which fragment the network into isolated subnetworks called partitions. Unless partition failures are detected and recognized by all affected processors, independent and uncoordinated updates may be applied to different copies of the data, thereby compromising the correctness of data. Consider, for example, an Airline Reservation System implemented by a distributed database which splits into two partitions when the communication network fails. If, at the time of the failure, all the nodes have one seat remaining for PAN AM 537, reservations could be made in both partitions. This would violate correctness: who should get the last seat? There should not be more seats reserved for a flight than physically exist on the plane. (Some airlines do not implement this constraint and allow overbookings.) The design of a replicated data management algorithm tolerating partition failures (or partition processing strategy) is a notoriously hard problem. Typically, the cause or extent of a partition failure cannot be discerned by the processors themselves. At best, a processor may be able to identify the other processors in its partition; but, for the processors outside of its partition, it will not be able to distinguish between the case where those processors are simply isolated from it and the case where those processors are down. In addition, slow responses can cause the network to appear partitioned even when it is not, further complicating the design of a fault-tolerant algorithm

    Dependable Systems

    Get PDF
    Improving the dependability of computer systems is a critical and essential task. In this context, the paper surveys techniques that allow to achieve fault tolerance in distributed systems by replication. The main replication techniques are first explained. Then group communication is introduced as the communication infrastructure that allows the implementation of the different replication techniques. Finally the difficulty of implementing group communication is discussed, and the most important algorithms are presented

    Designing the replication layer of a general-purpose datacenter key-value store

    Get PDF
    Online services and cloud applications such as graph applications, messaging systems, coordination services, HPC applications, social networks and deep learning rely on key-value stores (KVSes), in order to reliably store and quickly retrieve data. KVSes are NoSQL Databases with a read/write/read-modify-write API. KVSes replicate their dataset in a few servers, such that the KVS can continue operating in the presence of faults (availability). To allow programmers to reason about replication, KVSes specify a set of rules (consistency), which are enforced through the use of replication protocols. These rules must be intuitive to facilitate programmer productivity (programmability). A general-purpose KVS must maximize the number of operations executed per unit of time within a predetermined latency (performance) without compromising on consistency, availability or programmability. However, all three of these guarantees are at odds with performance. In this thesis, we explore the design of the replication layer of a general-purpose KVS, which is responsible for navigating this trade-off, by specifying and enforcing the consistency and availability guarantees of the KVS. We start the exploration by observing that modern, server-grade hardware with manycore servers and RDMA-capable networks, challenges conventional wisdom in protocol design. In order to investigate the impact of these advances on protocols and their design, we first create an informal taxonomy of strongly-consistent replication protocols. We focus on strong consistency semantics because they are necessary for a general-purpose KVS and they are at odds with performance. Based on this taxonomy we carefully select 10 protocols for analysis. Secondly, we present Odyssey, a frame-work tailored towards protocol implementation for multi-threaded, RDMA-enabled, in-memory, replicated KVSes. Using Odyssey, we characterize the design space of strongly-consistent replication protocols, by building, evaluating and comparing the 10 protocols. Our evaluation demonstrates that some of the protocols that were efficient in yesterday’s hardware are not so today because they cannot take advantage of the abundant parallelism and fast networking present in modern hardware. Conversely, some protocols that were inefficient in yesterday’s hardware are very attractive today. We distil our findings in a concise set of general guidelines and recommendations for protocol selection and design in the era of modern hardware. The second step of our exploration focuses on the tension between consistency and performance. The problem is that expensive strongly-consistent primitives are necessary to achieve synchronization, but in typical applications only a small fraction of accesses is actually used for synchronization. To navigate this trade-off, we advocate the adoption of Release Consistency (RC) for KVSes. We argue that RC’s one-sided barriers are ideal for capturing the ordering relationship between synchronization and non-synchronization accesses while enabling high performance. We present Kite, a general-purpose, replicated KVS that enforces RC through a novel fast/slow path mechanism that leverages the absence of failures in the typical case to maximize performance, while relying on the slow path for progress. In ad dition, Kite leverages our study of replication protocols to select the most suitable protocols for its primitives and is implemented over Odyssey to make the most out of modern hardware. Finally, Kite does not compromise on consistency, availability or programmability, as it provides sufficient primitives to implement any algorithm (consistency), does not interrupt its operation on a failure (availability), and offers the RC API that programmers are already familiar with (programmability)
    corecore