106,925 research outputs found

    High-performance state-machine replication

    Get PDF
    Replication, a common approach to protecting applications against failures, refers to maintaining several copies of a service on independent machines (replicas). Unlike a stand-alone service, a replicated service remains available to its clients despite the failure of some of its copies. Consistency among replicas is an immediate concern raised by replication. In effect, an important factor for providing the illusion of an uninterrupted service to clients is to preserve consistency among the multiple copies. State-machine replication is a popular replication technique that ensures consistency by ordering client requests and making all the replicas execute them deterministically and sequentially. The overhead of ordering the requests, and the sequentiality of request execution, the two essential requirements in realizing state-machine replication, are also the two major obstacles that prevent the performance of state-machine replication from scaling. In this thesis we concentrate on the performance of state-machine replication and enhance it by overcoming the two aforementioned bottlenecks, the overhead of ordering and the overhead of sequentially executing commands. To realize a truly scalable system, one must iteratively examine and analyze all the layers and components of a system and avoid or eliminate potential performance obstructions and congestion points. In this dissertation, we iterate between optimizing the ordering of requests and the strategies of replicas at request execution, in order to stretch the performance boundaries of state-machine replication. To eliminate the negative implications of the ordering layer on performance, we devise and implement several novel and highly efficient ordering protocols. Our proposals are based on practical observations we make after closely assessing and identifying the shortcomings of existing approaches. Communication is one of the most important components of any distributed system and thus selecting efficient communication patterns is a must in designing scalable systems. We base our protocols on the most suitable communication patterns and extend their design with additional features that altogether realize our protocol's high efficiency. The outcome of this phase is the design and implementation of the Ring Paxos family of protocols. According to our evaluations these protocols are highly scalable and efficient. We then assess the performance ramifications of sequential execution of requests on the replicas of state-machine replication. We use some known techniques such as state-partitioning and speculative execution, and thoroughly examine their advantages when combined with our ordering protocols. We then exploit the features of multicore hardware and propose our final solution as a parallelized form of state-machine replication, built on top of Ring Paxos protocols, that is capable of accomplishing significantly high performance. Given the popularity of state-machine replication in designing fault-tolerant systems, we hope this thesis provides useful and practical guidelines for the enhancement of the existing and the design of future fault-tolerant systems that share similar performance goals

    Mandator and Sporades: Robust Wide-Area Consensus with Efficient Request Dissemination

    Full text link
    Consensus algorithms are deployed in the wide area to achieve high availability for geographically replicated applications. Wide-area consensus is challenging due to two main reasons: (1) low throughput due to the high latency overhead of client request dissemination and (2) network asynchrony that causes consensus protocols to lose liveness. In this paper, we propose Mandator and Sporades, a modular state machine replication algorithm that enables high performance and resiliency in the wide-area setting. To address the high client request dissemination overhead challenge, we propose Mandator, a novel consensus-agnostic asynchronous dissemination layer. Mandator separates client request dissemination from the critical path of consensus to obtain high performance. Composing Mandator with Multi-Paxos (Mandator-Paxos) delivers significantly high throughput under synchronous networks. However, under asynchronous network conditions, Mandator-Paxos loses liveness which results in high latency. To achieve low latency and robustness under asynchrony, we propose Sporades, a novel omission fault-tolerant consensus algorithm. Sporades consists of two modes of operations -- synchronous and asynchronous -- that always ensure liveness. The combination of Mandator and Sporades (Mandator-Sporades) provides a robust and high-performing state machine replication system. We implement and evaluate Mandator-Sporades in a wide-area deployment running on Amazon EC2. Our evaluation shows that in the synchronous execution, Mandator-Sporades achieves 300k tx/sec throughput in less than 900ms latency, outperforming Multi-Paxos, EPaxos and Rabia by 650\% in throughput, at a modest expense of latency. Furthermore, we show that Mandator-Sporades outperforms Mandator-Paxos, Multi-Paxos, and EPaxos in the face of targeted distributed denial-of-service attacks

    Executing requests concurrently in state machine replication

    Get PDF
    State machine replication is one of the most popular ways to achieve fault tolerance. In a nutshell, the state machine replication approach maintains multiple replicas that both store a copy of the system’s data and execute operations on that data. When requests to execute operations arrive, an “agree-execute” protocol keeps replicas synchronized: they first agree on an order to execute the incoming operations, and then execute the operations one at a time in the agreed upon order, so that every replica reaches the same final state. Multi-core processors are the norm, but taking advantage of the available processor cores to execute operations simultaneously is at odds with the “agree-execute” protocol: simultaneous execution is inherently unpredictable, so in the end replicas may arrive at different final states and the system becomes inconsistent. On one hand, we want to take advantage of the available processor cores to execute operations simultaneously and improve performance. But on the other hand, replicas must abide by the operation order that they agreed upon for the system to remain consistent. This dissertation proposes a solution to this dilemma. At a high level, we propose to use speculative execution techniques to execute operations simultaneously while nonetheless ensuring that their execution is equivalent to having executed the operations sequentially in the order the replicas agreed upon. To achieve this, we: (1) propose to execute operations as serializable transactions, and (2) develop a new concurrency control protocol that ensures that the concurrent execution of a set of transactions respects the serialization order the replicas agreed upon. Since speculation is only effective if it is successful, we also (3) propose a modification to the typical API to declare transactions, which allows transactions to execute their logic over an abstract replica state, resulting in fewer conflicts between transactions and thus improving the effectiveness of the speculative executions. An experimental evaluation shows that the contributions in this dissertation can improve the performance of a state-machine-replicated server up to 4 , reaching up to 75% the performance of a concurrent fault-prone server

    Rethinking State-Machine Replication for Parallelism

    Full text link
    State-machine replication, a fundamental approach to designing fault-tolerant services, requires commands to be executed in the same order by all replicas. Moreover, command execution must be deterministic: each replica must produce the same output upon executing the same sequence of commands. These requirements usually result in single-threaded replicas, which hinders service performance. This paper introduces Parallel State-Machine Replication (P-SMR), a new approach to parallelism in state-machine replication. P-SMR scales better than previous proposals since no component plays a centralizing role in the execution of independent commands---those that can be executed concurrently, as defined by the service. The paper introduces P-SMR, describes a "commodified architecture" to implement it, and compares its performance to other proposals using a key-value store and a networked file system

    MOON: MapReduce On Opportunistic eNvironments

    Get PDF
    Abstract—MapReduce offers a flexible programming model for processing and generating large data sets on dedicated resources, where only a small fraction of such resources are every unavailable at any given time. In contrast, when MapReduce is run on volunteer computing systems, which opportunistically harness idle desktop computers via frameworks like Condor, it results in poor performance due to the volatility of the resources, in particular, the high rate of node unavailability. Specifically, the data and task replication scheme adopted by existing MapReduce implementations is woefully inadequate for resources with high unavailability. To address this, we propose MOON, short for MapReduce On Opportunistic eNvironments. MOON extends Hadoop, an open-source implementation of MapReduce, with adaptive task and data scheduling algorithms in order to offer reliable MapReduce services on a hybrid resource architecture, where volunteer computing systems are supplemented by a small set of dedicated nodes. The adaptive task and data scheduling algorithms in MOON distinguish between (1) different types of MapReduce data and (2) different types of node outages in order to strategically place tasks and data on both volatile and dedicated nodes. Our tests demonstrate that MOON can deliver a 3-fold performance improvement to Hadoop in volatile, volunteer computing environments

    State Machine Replication:from Analytical Evaluation to High-Performance Paxos

    Get PDF
    Since their invention more than half a century ago, computers have gone from being just an handful of expensive machines each filling an entire room, to being an integral part of almost every aspect of modern life. Nowadays computers are everywhere: in our planes, in our cars, on our desks, in our home appliances, and even in our pockets. This widespread adoption had a profound impact in our world and in our lives, so much that now we rely on them for many important aspects of everyday life, including work, communication, travel, entertainment, and even managing our money. Given our increased reliance on computers, their continuous and correct operation has become essential for modern society. However, individual computers can fail due to a variety of causes and, if nothing is done about it, these failures can easily lead to a disruption of the service provided by computer system. The field of fault tolerance studies this problem, more precisely, it studies how to enable a computer system to continue operation in spite of the failure of individual components. One of the most popular techniques of achieving fault tolerance is software replication, where a service is replicated on an ensemble of machines (replicas) such that if some of these machines fail, the others will continue providing the service. Software replication is widely used because of its generality (can be applied to most services) and its low cost (can use off-the-shelf hardware). This thesis studies a form of software replication, namely, state machine replication, where the service is modeled as a deterministic state machine whose state transitions consist of the execution of client requests. Although state machine replication was first proposed almost 30 years ago, the proliferation of online services during the last years has led to a renewed interest. Online services must be highly available and for that they frequently rely on state machine replication as part of their fault tolerance mechanisms. However, the unprecedented scale of these services, which frequently have hundreds of thousands or even millions of users, leads to a new set performance requirements on state machine replication. This thesis is organized in two parts. The goal of the first part is to study from a theoretical perspective the performance characteristics of the algorithms behind state machine replication and to propose improved variants of such algorithms. The second part looks at the problem from a practical perspective, proposing new techniques to achieve high-throughput and scalability. In the first part, we start with an analytical analysis of the performance of two consensus algorithms, one leader-free (an adaptation of the fast round of Fast Paxos) and another leader-based (an adaptation of classical Paxos). We express these algorithms in the Heard-Of round model and show that using this model it is fairly easy to determine analytically several interesting performance metrics. We then study the performance of round models in general. Round models are perceived as inefficient because in their typical implementation, the real-time duration of rounds is proportional to the (pessimistic) timeouts used on the underlying system. This contrasts with the failure detector or the partial synchronous system models, where algorithms usually progress at the speed of message reception. We show that there is no inherent gap in performance between the models, by proposing a round implementation that during stable periods advances at the speed of message reception. We conclude the first part by presenting a new leader election algorithm that chooses as leader a well-connected process, that is, a process whose time needed to perform a one-to-majority communication round is among the lowest in the system. This is useful mainly in systems where the latency between processes is not homogeneous, because the performance of leader-based algorithms is particularly sensitive to the performance and connectivity of the process acting as a leader. The second part of the thesis studies different approaches to achieve high-throughput with state machine replication. To support the experimental work done in this part, we have developed JPaxos, a fully-featured implementation of Paxos in Java. We start by looking at how to tune the batching and pipelining optimizations of Paxos; using an analytical model of the performance of Paxos we show how to derive good values for the bounds on the batch size and number of parallel instances. We then propose an architecture for implementing replicated state machines that is capable of leveraging multi-core CPUs to achieve very high-levels of performance. The final contribution of this thesis is based on the observation that most implementations of state machine replication have an unbalanced division of work among threads, with one replica, the leader, having a significantly higher workload than the other replicas. Naturally, the leader becomes the bottleneck of the system, while other replicas are only lightly loaded. We propose and evaluate S-Paxos, which evenly balances the workload among all replicas, and thus overcomes the leader bottleneck. The benefits are two-fold: S-Paxos achieves a higher throughput for a given number of replicas and its performance increases with the number of replicas (up to a reasonable number)

    Optimistic Parallel State-Machine Replication

    Full text link
    State-machine replication, a fundamental approach to fault tolerance, requires replicas to execute commands deterministically, which usually results in sequential execution of commands. Sequential execution limits performance and underuses servers, which are increasingly parallel (i.e., multicore). To narrow the gap between state-machine replication requirements and the characteristics of modern servers, researchers have recently come up with alternative execution models. This paper surveys existing approaches to parallel state-machine replication and proposes a novel optimistic protocol that inherits the scalable features of previous techniques. Using a replicated B+-tree service, we demonstrate in the paper that our protocol outperforms the most efficient techniques by a factor of 2.4 times
    • …
    corecore