18 research outputs found
Rethinking State-Machine Replication for Parallelism
State-machine replication, a fundamental approach to designing fault-tolerant
services, requires commands to be executed in the same order by all replicas.
Moreover, command execution must be deterministic: each replica must produce
the same output upon executing the same sequence of commands. These
requirements usually result in single-threaded replicas, which hinders service
performance. This paper introduces Parallel State-Machine Replication (P-SMR),
a new approach to parallelism in state-machine replication. P-SMR scales better
than previous proposals since no component plays a centralizing role in the
execution of independent commands---those that can be executed concurrently, as
defined by the service. The paper introduces P-SMR, describes a "commodified
architecture" to implement it, and compares its performance to other proposals
using a key-value store and a networked file system
Dynamic group communication
Group communication is the basic infrastructure for implementing fault-tolerant replicated servers. While group communication is well understood in the context of static groups (in which the membership does not change), current specifications of dynamic group communication (in which processes can join and leave groups during the computation) have not yet reached the same level of maturity. The paper proposes new specifications - in the primary partition model - for dynamic reliable broadcast (simply called "reliable multicastâ), dynamic atomic broadcast (simply called "atomic multicastâ) and group membership. In the special case of a static system, the new specifications are identical to the well known static specifications. Interestingly, not only are these new specifications "syntacticallyâ close to the static specifications, but they are also "semanticallyâ close to the dynamic specifications proposed in the literature. We believe that this should contribute to clarify a topic that has always been difficult to understand by outsiders. Finally, the paper shows how to solve atomic multicast, group membership and reliable broadcast. The solution of atomic multicast is close to the (static) atomic broadcast solution based on reduction to consensus. Group membership is solved using atomic multicast. Reliable multicast can be efficiently solved by relying on a thrifty generic multicast algorith
Dynamic Group Communication
Group communication is the basic infrastructure for implementing fault-tolerant replicated servers. While group communication is well understood in the context of static systems (in which all processes are created at the start of the computation), current specifications of dynamic group communication (in which processes can be added and removed during the computation) are not satisfactory. The paper proposes new specifications for dynamic reliable broadcast (which we call reliable multicast), dynamic atomic broadcast (which we call atomic multicast) and group membership. In the special case of a static system, our specifications are identical to the well known static specifications. The specification of group membership is derived from the specification of atomic multicast. The paper also shows how to solve atomic multicast, group membership and reliable broadcast. The solution of atomic multicast is close to the (static) atomic broadcast solution based on reduction to consensus. Group membership is solved using atomic multicast. In the context of reliable multicast, we introduce the notion of thrifty solution, and show that such a solution can be obtained by relying on a thrifty generic multicast algorithm
Scalable service-oriented replication with flexible consistency guarantee in the cloud
Replication techniques are widely applied in and for cloud to improve scalability and availability. In such context, the well-understood problem is how to guarantee consistency amongst different replicas and govern the trade-off between consistency and scalability requirements. Such requirements are often related to specific services and can vary considerably in the cloud. However, a major drawback of existing service-oriented replication approaches is that they only allow either restricted consistency or none at all. Consequently, service-oriented systems based on such replication techniques may violate consistency requirements or not scale well. In this paper, we present a Scalable Service Oriented Replication (SSOR) solution, a middleware that is capable of satisfying applicationsâ consistency requirements when replicating cloud-based services. We introduce new formalism for describing services in service-oriented replication. We propose the notion of consistency regions and relevant service oriented requirements policies, by which trading between consistency and scalability requirements can be handled within regions. We solve the associated sub-problem of atomic broadcasting by introducing a Multi-fixed Sequencers Protocol (MSP), which is a requirements aware variation of the traditional fixed sequencer approach. We also present a Region-based Election Protocol (REP) that elastically balances the workload amongst sequencers. Finally, we experimentally evaluate our approach under different loads, to show that the proposed approach achieves better scalability with more flexible consistency constraints when compared with the state-of-the-art replication technique
Solving atomic multicast when groups crash
In this paper, we study the atomic multicast problem, a fundamental abstraction for building faulttolerant systems. In the atomic multicast problem, the system is divided into non-empty and disjoint groups of processes. Multicast messages may be addressed to any subset of groups, each message possibly being multicast to a different subset. Several papers previously studied this problem either in local area networks [3, 9, 20] or wide area networks [13, 21]. However, none of them considered atomic multicast when groups may crash. We present two atomic multicast algorithms that tolerate the crash of groups. The first algorithm tolerates an arbitrary number of failures, is genuine (i.e., to deliver a message m, only addressees of m are involved in the protocol), and uses the perfect failures detector P. We show that among realistic failure detectors, i.e., those that do not predict the future, P is necessary to solve genuine atomic multicast if we do not bound the number of processes that may fail. Thus, P is the weakest realistic failure detector for solving genuine atomic multicast when an arbitrary number of processes may crash. Our second algorithm is non-genuine and less resilient to process failures than the first algorithm but has several advantages: (i) it requires perfect failure detection within groups only, and not across the system, (ii) as we show in the paper it can be modified to rely on unreliable failure detection at the cost of a weaker liveness guarantee, and (iii) it is fast, messages addressed to multiple groups may be delivered within two inter-group message delays only
Hermes: a Fast, Fault-Tolerant and Linearizable Replication Protocol
Today's datacenter applications are underpinned by datastores that are
responsible for providing availability, consistency, and performance. For high
availability in the presence of failures, these datastores replicate data
across several nodes. This is accomplished with the help of a reliable
replication protocol that is responsible for maintaining the replicas
strongly-consistent even when faults occur. Strong consistency is preferred to
weaker consistency models that cannot guarantee an intuitive behavior for the
clients. Furthermore, to accommodate high demand at real-time latencies,
datastores must deliver high throughput and low latency.
This work introduces Hermes, a broadcast-based reliable replication protocol
for in-memory datastores that provides both high throughput and low latency by
enabling local reads and fully-concurrent fast writes at all replicas. Hermes
couples logical timestamps with cache-coherence-inspired invalidations to
guarantee linearizability, avoid write serialization at a centralized ordering
point, resolve write conflicts locally at each replica (hence ensuring that
writes never abort) and provide fault-tolerance via replayable writes. Our
implementation of Hermes over an RDMA-enabled reliable datastore with five
replicas shows that Hermes consistently achieves higher throughput than
state-of-the-art RDMA-based reliable protocols (ZAB and CRAQ) across all write
ratios while also significantly reducing tail latency. At 5% writes, the tail
latency of Hermes is 3.6X lower than that of CRAQ and ZAB.Comment: Accepted in ASPLOS 202
Solving Agreement Problems with Weak Ordering Oracles
Agreement problems, such as consensus, atomic broadcast, and group membership, are central to the implementation of fault-tolerant distributed systems. Despite the diversity of algorithms that have been proposed for solving agreement problems in the past years, almost all solutions are crash detection based (CDB). We say that an algorithm is CDB if it uses some information about the status crashed/not crashed of processes. Randomized consensus algorithms are rare exceptions non-CDB algorithms. In this paper, we revisit the issue of non-CDB algorithms. Instead of randomization, we consider ordering oracles. Ordering oracles have a theoretical interest (e.g., they extend the state of the art of non-CDB algorithms) as well as a practical interest (e.g., they remove altogether the burden involved in tuning timeout mechanisms). To illustrate their use, we present solutions to consensus and atomic broadcast, and evaluate the performance of the atomic broadcast algorithm in a cluster of workstations
Dependable Systems
Improving the dependability of computer systems is a critical and essential task. In this context, the paper surveys techniques that allow to achieve fault tolerance in distributed systems by replication. The main replication techniques are first explained. Then group communication is introduced as the communication infrastructure that allows the implementation of the different replication techniques. Finally the difficulty of implementing group communication is discussed, and the most important algorithms are presented