22 research outputs found

    Optimal atomic broadcast and multicast algorithms for wide area networks

    Get PDF
    In this paper, we study the atomic broadcast and multicast problems, two fundamental abstractions for building fault-tolerant systems. As opposed to atomic broadcast, atomic multicast allows messages to be addressed to a subset of the processes in the system, each message possibly being multicast to a different subset. We require atomic multicast algorithms to be genuine, i.e., only processes addressed by the multicast message are involved in the protocol. Our study focuses on wide area networks where groups of processes, i.e., processes physically close to each other, are inter-connected through high latency communication links. In this context, we capture the cost of algorithms, denoted latency degree, as the number of inter-group message delays between the broadcasting (multicasting) of a message and its delivery. We present an atomic multicast algorithm with a latency degree of two and show that it is optimal. We then present the first fault-tolerant atomic broadcast algorithm with a latency degree of one. To achieve such a low latency, the algorithm is proactive, i.e., it may take actions even though no messages are broadcast. Nevertheless, it is quiescent: provided that the number of broadcast messages is finite, the algorithm eventually ceases its operation. As a consequence, in runs where the algorithm becomes quiescent too early, its latency degree is two. We show that this is unavoidable, and establish a lower bound on the quiescence of atomic broadcast algorithms. These two lower bound results stem from a common cause, namely the reactiveness of the processes at the time when the message is cast (broadcast or multicast). This reveals an interesting link between the quiescence of total order algorithms and the genuineness of atomic multicast, two problems which seemed to be unrelated at first sight

    A robust and lightweight stable leader election service for dynamic systems

    Get PDF
    We describe the implementation and experimental evaluation of a fault-tolerant leader election service for dynamic systems. Intuitively, distributed applications can use this service to elect and maintain an operational leader for any group of processes which may dynamically change. If the leader of a group crashes, is temporarily disconnected, or voluntarily leaves the group, the service automatically re-elects a new group leader. The current version of the service implements two recent leader election algorithms, and users can select the one that fits their system better. Both algorithms ensure leader stability, a desirable feature that lacks in some other algorithms, but one is more robust in the face of extreme network disruptions, while the other is more scalable. The leader election service is flexible and easy to use. By using a stochastic failure detector [5] and a link quality estimator, it provides some degree of QoS control and it adapts to changing network conditions. Our experimental evaluation indicates that it is also highly robust and inexpensive to run in practice

    Fast, flexible, and highly resilient genuine fifo and causal multicast algorithms

    Get PDF
    We study the fifo and causal multicast problem, two group-communication abstractions that deliver messages in an order consistent with their context. With fifo multicast, the context of a message m at a process p is all messages that were previously multicast by m’s sender and addressed to p. Causal multicast extends the notion of context to all messages that are causally linked to m by a chain of multicast and delivery events. We propose multicast algorithms for systems composed of a set of disjoint groups of processes: server racks or data centers. These algorithms offer several desirable properties: (i) the protocols are latency-optimal, (ii) to deliver a message m only m’s sender and addressees communicate, (iii) messages can be addressed to any subset of groups, and (iv) these algorithms are highly resilient: an arbitrary number of process failures is tolerated and we only require the network to be quasi-reliable, i.e., a message m is guaranteed to be received only if the sender and receiver of m are always up. To the best of our knowledge, these are the first multicast protocols to offer all of these properties at the same time

    Optimistic algorithms for partial database replication

    Get PDF
    In this paper, we study the problem of partial database replication. Numerous previous works have investigated database replication, however, most of them focus on full replication. We are here interested in genuine partial replication protocols, which require replicas to permanently store only information about data items they replicate. We define two properties to characterize partial replication. The first one, Quasi-Genuine Partial Replication, captures the above idea; the second one, Non-Trivial Certification, rules out solutions that would abort transactions unnecessarily in an attempt to ensure the first property. We also present two algorithms that extend the Database State Machine [8] to partial replication and guarantee the two aforementioned properties. Our algorithms compare favorably to existing solutions both in terms of number of messages and communication steps

    P-Store: genuine partial replication in wide area networks

    Get PDF
    Partial replication is a way to increase the scalability of replicated systems since updates only need to be applied to a subset of the system’s sites, thus allowing replicas to handle independent parts of the workload in parallel. In this paper, we propose P-Store, a partial database replication protocol for wide area networks. In P-Store, each transaction T optimistically executes on one or more sites and is then certified to guarantee serializability of the execution. The certification protocol is genuine, it only involves sites that replicate data items read or written by T , and incorporates a mechanism to minimize a convoy effect. P-Store makes a thrifty use of an atomic multicast service to guarantee correctness: no messages need to be multicast during T ’s execution and a single message is multicast to certify T. This is in contrast to previously proposed solutions that either require multiple atomic multicast messages to execute T , are non-genuine, or do not allow transactions to execute on multiple sites. Experimental evaluations reveal that the convoy effect plays an important role even when one percent of the transactions are global, that is, they involve multiple sites. We also compare the scalability of our approach to a fully replicated solution when the proporti n of global transactions and the number of sites vary

    Genuine versus non-genuine atomic multicast protocols

    Get PDF
    In this paper, we study atomic multicast, a fundamental abstraction for building fault-tolerant systems. We suppose a system composed of data centers, or groups, that host many processes connected through high-end local links; a few groups exist, interconnected through high-latency communication links. In this context, a recent paper has shown that no multicast protocol can deliver messages addressed to multiple groups in one inter-group delay and be genuine, i.e., to deliver a message m, only the addressees of m are involved in the protocol. We first survey and analytically compare existing multicast algorithms to identify latency-optimal multicast algorithms. We then propose a non-genuine multicast protocol that may deliver messages addressed to multiple groups in one inter-group delay. Experimental comparisons against a latency-optimal genuine protocol show that the non-genuine protocol offers better performance in all considered scenarios, except in large and highly loaded systems. To complete our study, we also evaluate a latency-optimal protocol that tolerates disasters, i.e., group crashes

    On multicast primitives in large networks and partial replication protocols

    Get PDF
    Recent years have seen the rapid growth of web-based applications such as search engines, social networks, and e-commerce platforms. As a consequence, our daily life activities rely on computers more and more each day. Designing reliable computer systems has thus become of a prime importance. Reliability alone is not sufficient however. These systems must support high loads of client requests and, as a result, scalability is highly valued as well. In this thesis, we address the design of fault-tolerant computer systems. More precisely, we investigate the feasibility of designing scalable database systems that offer the illusion of accessing a single copy of a database, despite failures. This study is carried out in the context of large networks composed of several groups of machines located in the same geographical region. Groups may be data centers, each located in a local area network, connected through high-latency links. In these settings, the goal is to minimize the use of inter-group links. We mask failures using data replication: if one copy of the data is not available, a replica is accessed instead. Guaranteeing data consistency in the presence of failures while offering good performance constitutes the main challenge of this thesis. To reach this goal, we first study fault-tolerant multicast communication primitives that offer various message ordering guarantees. We then rely on these multicast abstractions to propose replication protocols in which machines hold a subset of the application's data, denoted as partial replication. In contrast to full replication, partial replication may potentially offer better scalability since updates need not be applied to every machine in the system. More specifically, this thesis makes contributions in the distributed systems domain and in the database domain. In the distributed systems domain, we first devise FIFO and causal multicast algorithms, primitives that ease the design of replicated data management protocols, as we will show. We then study atomic multicast, a basic building block for synchronous database replication. Two failure models are considered: one in which groups are correct, i.e., groups contain at least one process that is always up, and one in which groups may fail entirely. We show a tight lower bound on the minimum number of inter-group message delays required for atomic multicast in the first failure model. When an arbitrary number of processes may fail and process failures may not be predicted, we demonstrate that erroneous process failure suspicion cannot be tolerated. We then present atomic multicast protocols for the case of correct and faulty groups and empirically compare their performance. The majority of the proposed algorithms are latency-optimal. In the database domain, we extend the database state machine (DBSM), a previously proposed full replication technique, to partial replication. In the DBSM, transactions are executed locally at one database site according to the strict two-phase locking policy. To ensure global data consistency, a certification protocol is triggered at the end of each transaction. We present three certification protocols that differ in the communication primitives they use and the amount of information related to transactions they store. The first two algorithms are tailored for local area networks and ensure that sites unrelated to a transaction T only permanently store the identifier of T. The third protocol is more generic since it is not customized for any type of network. Furthermore, with this protocol, only sites that replicate data items read or updated by T are involved in T's certification
    corecore