91 research outputs found
Incremental Consistency Guarantees for Replicated Objects
Programming with replicated objects is difficult. Developers must face the
fundamental trade-off between consistency and performance head on, while
struggling with the complexity of distributed storage stacks. We introduce
Correctables, a novel abstraction that hides most of this complexity, allowing
developers to focus on the task of balancing consistency and performance. To
aid developers with this task, Correctables provide incremental consistency
guarantees, which capture successive refinements on the result of an ongoing
operation on a replicated object. In short, applications receive both a
preliminary---fast, possibly inconsistent---result, as well as a
final---consistent---result that arrives later.
We show how to leverage incremental consistency guarantees by speculating on
preliminary values, trading throughput and bandwidth for improved latency. We
experiment with two popular storage systems (Cassandra and ZooKeeper) and three
applications: a Twissandra-based microblogging service, an ad serving system,
and a ticket selling system. Our evaluation on the Amazon EC2 platform with
YCSB workloads A, B, and C shows that we can reduce the latency of strongly
consistent operations by up to 40% (from 100ms to 60ms) at little cost (10%
bandwidth increase, 6% throughput drop) in the ad system. Even if the
preliminary result is frequently inconsistent (25% of accesses), incremental
consistency incurs a bandwidth overhead of only 27%.Comment: 16 total pages, 12 figures. OSDI'16 (to appear
Optimistic Parallel State-Machine Replication
State-machine replication, a fundamental approach to fault tolerance,
requires replicas to execute commands deterministically, which usually results
in sequential execution of commands. Sequential execution limits performance
and underuses servers, which are increasingly parallel (i.e., multicore). To
narrow the gap between state-machine replication requirements and the
characteristics of modern servers, researchers have recently come up with
alternative execution models. This paper surveys existing approaches to
parallel state-machine replication and proposes a novel optimistic protocol
that inherits the scalable features of previous techniques. Using a replicated
B+-tree service, we demonstrate in the paper that our protocol outperforms the
most efficient techniques by a factor of 2.4 times
Recommended from our members
Replicating multithreaded services
textFor the last 40 years, the systems community has invested a lot of effort in designing techniques for building fault tolerant distributed systems and services. This effort has produced a massive list of results: the literature describes how to design replication protocols that tolerate a wide range of failures (from simple crashes to malicious "Byzantine" failures) in a wide range of settings (e.g. synchronous or asynchronous communication, with or without stable storage), optimizing various metrics (e.g. number of messages, latency, throughput). These techniques have their roots in ideas, such as the abstraction of State Machine Replication and the Paxos protocol, that were conceived when computing was very different than it is today: computers had a single core; all processing was done using a single thread of control, handling requests sequentially; and a collection of 20 nodes was considered a large distributed system. In the last decade, however, computing has gone through some major paradigm shifts, with the advent of multicore architectures and large cloud infrastructures. This dissertation explains how these profound changes impact the practical usefulness of traditional fault tolerant techniques and proposes new ways to architect these solutions to fit the new paradigms.Computer Science
Distributed replicated macro-components
Dissertação para obtenção do Grau de Mestre em
Engenharia InformáticaIn recent years, several approaches have been proposed for improving application
performance on multi-core machines. However, exploring the power of multi-core processors
remains complex for most programmers. A Macro-component is an abstraction
that tries to tackle this problem by allowing to explore the power of multi-core machines
without requiring changes in the programs. A Macro-component encapsulates several
diverse implementations of the same specification. This allows to take the best performance
of all operations and/or distribute load among replicas, while keeping contention
and synchronization overhead to the minimum.
In real-world applications, relying on only one server to provide a service leads to
limited fault-tolerance and scalability. To address this problem, it is common to replicate
services in multiple machines. This work addresses the problem os supporting such
replication solution, while exploring the power of multi-core machines.
To this end, we propose to support the replication of Macro-components in a cluster of
machines. In this dissertation we present the design of a middleware solution for achieving
such goal. Using the implemented replication middleware we have successfully deployed
a replicated Macro-component of in-memory databases which are known to have scalability
problems in multi-core machines. The proposed solution combines multi-master
replication across nodes with primary-secondary replication within a node, where several
instances of the database are running on a single machine. This approach deals with
the lack of scalability of databases on multi-core systems while minimizing communication
costs that ultimately results in an overall improvement of the services. Results show
that the proposed solution is able to scale as the number of nodes and clients increases.
It also shows that the solution is able to take advantage of multi-core architectures.RepComp project (PTDC/EIAEIA/108963/2008
High-performance state-machine replication
Replication, a common approach to protecting applications against failures, refers to maintaining several copies of a service on independent machines (replicas). Unlike a stand-alone service, a replicated service remains available to its clients despite the failure of some of its copies. Consistency among replicas is an immediate concern raised by replication. In effect, an important factor for providing the illusion of an uninterrupted service to clients is to preserve consistency among the multiple copies. State-machine replication is a popular replication technique that ensures consistency by ordering client requests and making all the replicas execute them deterministically and sequentially. The overhead of ordering the requests, and the sequentiality of request execution, the two essential requirements in realizing state-machine replication, are also the two major obstacles that prevent the performance of state-machine replication from scaling. In this thesis we concentrate on the performance of state-machine replication and enhance it by overcoming the two aforementioned bottlenecks, the overhead of ordering and the overhead of sequentially executing commands. To realize a truly scalable system, one must iteratively examine and analyze all the layers and components of a system and avoid or eliminate potential performance obstructions and congestion points. In this dissertation, we iterate between optimizing the ordering of requests and the strategies of replicas at request execution, in order to stretch the performance boundaries of state-machine replication. To eliminate the negative implications of the ordering layer on performance, we devise and implement several novel and highly efficient ordering protocols. Our proposals are based on practical observations we make after closely assessing and identifying the shortcomings of existing approaches. Communication is one of the most important components of any distributed system and thus selecting efficient communication patterns is a must in designing scalable systems. We base our protocols on the most suitable communication patterns and extend their design with additional features that altogether realize our protocol's high efficiency. The outcome of this phase is the design and implementation of the Ring Paxos family of protocols. According to our evaluations these protocols are highly scalable and efficient. We then assess the performance ramifications of sequential execution of requests on the replicas of state-machine replication. We use some known techniques such as state-partitioning and speculative execution, and thoroughly examine their advantages when combined with our ordering protocols. We then exploit the features of multicore hardware and propose our final solution as a parallelized form of state-machine replication, built on top of Ring Paxos protocols, that is capable of accomplishing significantly high performance. Given the popularity of state-machine replication in designing fault-tolerant systems, we hope this thesis provides useful and practical guidelines for the enhancement of the existing and the design of future fault-tolerant systems that share similar performance goals
Reducing the Latency of Dependent Operations in Large-Scale Geo-Distributed Systems
Many applications rely on large-scale distributed systems for data management and computation. These distributed systems are complex and built from different networked services. Dependencies between these services can create a chain of dependent network I/O operations that have to be executed sequentially. This can result in high service latencies, especially when the chain consists of inter-datacenter operations.
To address the latency problem of executing dependent network I/O operations, this thesis introduces new approaches and techniques to reduce the required number of operations that have to be executed sequentially for three system types. First, it addresses the high transaction completion time in geo-distributed database systems that have data sharded and replicated across different geographical regions. For a single transaction, most existing systems sequentially execute reads, writes, 2PC, and a replication protocol because of dependencies between these parts. This thesis looks at using a more restrictive transaction model in order to break dependencies and allow different parts to execute in parallel.
Second, dependent network I/O operations also lead to high latency for performing leader-based state machine replication across a wide-area network. Fast Paxos introduces a fast path that bypasses the leader for request ordering. However, when concurrent requests arrive at replicas in different orders, the fast path may fail, and Fast Paxos has to fall back to a slow path. This thesis explores the use of network measurements to establish a global order for requests across replicas, allowing Fast Paxos to be effective for concurrent requests.
Finally, this thesis proposes a general solution to reduce the latency impact of dependent operations in distributed systems through the use of speculative execution. For many systems, domain knowledge can be used to predict an operation’s result and speculatively execute subsequent operations, potentially allowing a chain of dependent operations to execute in parallel. This thesis introduces a framework that provides system-level support for performing speculative network I/O operations.
These three approaches reduce the number of sequentially performed network I/O operations in different domains. Our performance evaluation shows that they can significantly reduce the latency of critical infrastructure services, allowing these services to be used by latency-sensitive applications
SoK: Understanding BFT Consensus in the Age of Blockchains
Blockchain as an enabler to current Internet infrastructure has provided many unique features and revolutionized current distributed systems into a new era. Its decentralization, immutability, and transparency have attracted many applications to adopt the design philosophy of blockchain and customize various replicated solutions. Under the hood of blockchain, consensus protocols play the most important role to achieve distributed replication systems. The distributed system community has extensively studied the technical components of consensus to reach agreement among a group of nodes. Due to trust issues, it is hard to design a resilient system in practical situations because of the existence of various faults. Byzantine fault-tolerant (BFT) state machine replication (SMR) is regarded as an ideal candidate that can tolerate arbitrary faulty behaviors. However, the inherent complexity of BFT consensus protocols and their rapid evolution makes it hard to practically adapt themselves into application domains. There are many excellent Byzantine-based replicated solutions and ideas that have been contributed to improving performance, availability, or resource efficiency. This paper conducts a systematic and comprehensive study on BFT consensus protocols with a specific focus on the blockchain era. We explore both general principles and practical schemes to achieve consensus under Byzantine settings. We then survey, compare, and categorize the state-of-the-art solutions to understand BFT consensus in detail. For each representative protocol, we conduct an in-depth discussion of its most important architectural building blocks as well as the key techniques they used. We aim that this paper can provide system researchers and developers a concrete view of the current design landscape and help them find solutions to concrete problems. Finally, we present several critical challenges and some potential research directions to advance the research on exploring BFT consensus protocols in the age of blockchains
Intrusion tolerance in large scale networks
Tese de doutoramento, Informática (Engenharia Informática), Universidade de Lisboa, Faculdade de Ciências, 2010The growing reliance on wide-area services demands highly available systems that provide
a correct and uninterrupted service. Therefore, Byzantine Fault-Tolerant (BFT) algorithms have
received considerable attention in the last years. A service is replicated over several servers and
can survive even in the presence of a bounded number of Byzantine failures.
The main motivation for this thesis is that for a replicated service to be fault-tolerant,
common mode failures have to be avoided. More specifically, the thesis is concerned with
common mode failures caused by natural disasters, power outages and physical attacks, which
have to be prevented by scattering replicas geographically. This requires the sites where the
replicas reside to be connected by a wide-area network (WAN) like the Internet.
Unfortunately, when the replicas are distributed geographically the performance of current
BFT algorithms is affected by the lower bandwidths, and the higher and more heterogeneous
network latencies. In order to deal with these limitations this thesis introduces novel BFT
algorithms that are simultaneously efficient and secure. Some algorithms of this thesis are based
on a hybrid fault model, i.e., considering that a part of the system is secure by construction. A
notable contribution of this thesis is the definition and implementation of a minimal trusted
service: the Unique Sequential Identifier Generator (USIG).
The thesis describes how to implement a 2 f +1 Byzantine consensus algorithm using a
2 f +1 reliable multicast algorithm that requires a trusted service, that is an abstract version
of the USIG. Then, the USIG service and the reliable multicast primitive are applied as a core
component to implement two novel BFT algorithms introduced in this thesis: MinBFT and
MinZyzzyva. These BFT algorithms are minimal in terms of number of replicas, complexity of
the trusted service used, and number of communication steps. In order to mitigate performance
degradation attacks, this thesis proposes the use of a rotating primary defining a novel BFT
algorithm, Spinning, that is less vulnerable to attacks caused by a faulty primary and attains a
throughput similar to the baseline algorithm in the area. Finally, the mechanisms and techniques developed in this thesis are combined in order to
define EBAWA, a novel BFT algorithm that is suitable for supporting the execution of wide-area
replicated services.O crescimento da dependência na utilização de serviços informáticos em redes de larga escala
demanda sistemas que forneçam um serviço correcto e ininterrupto. Por este motivo,
algoritmos tolerantes a faltas bizantinas (BFT) têm recebido considerável atenção nos últimos
anos. A ideia fundamental destes algoritmos é replicar um determinado serviço num conjunto
de servidores, assegurando a sua operação contínua mesmo na presença de um número limitado
de servidores faltosos. Cada servidor é uma réplica, uma máquina de estados determinística que
executa operações em resposta a requisições realizadas por clientes.
Para que serviços replicados sejam tolerantes a faltas, modos comuns de falhas devem
ser evitados, essa ´e a principal motivação desta tese. Mais especificamente, a tese trata de
falhas causadas por desastres naturais, falta de energia e ataques físicos. Para que a ocorrência
destas falhas afecte um número limitado de servidores é necessário distribuir as réplicas
geograficamente. Esta distribuição, requer que os locais onde se situam as réplicas sejam
conectados por uma rede de larga-escala (WAN), como a Internet.
Infelizmente, quando as réplicas estão distribuídas geograficamente o desempenho dos
algoritmos BFT actuais é afectado pelas limitações de largura de banda e latências heterogéneas,
típicas em redes de larga-escala. A fim de tratar destas limitações esta tese introduz novos
algoritmos BFT que são simultaneamente eficientes e seguros. Alguns destes algoritmos são
baseados em um modelo de faltas híbrido, por exemplo, parte do sistema ´e considerado seguro
pela sua construção. Uma importante contribuição desta tese é a definição e concretização de
um serviço confiável mínimo: o gerador de identificador único e sequencial (USIG).
A tese descreve como concretizar algoritmos de consenso bizantinos com 2 f +1 processos,
usando um algoritmo de reliable multicast que requer um componente confiável, uma abstração
do USIG. O serviço USIG e a primitiva de reliable multicast são aplicados como componentes
nucleares na concretização de dois novos algoritmos BFT introduzidos nesta tese: MinBFT e
MinZyzzyva. Estes algoritmos são mínimos em termos de número de réplicas, complexidade do
componente confiável e número de passos de comunicação.
A fim de mitigar os ataques de degradação de desempenho esta tese propõe o uso de um primário rotativo, definindo assim um novo algoritmo BFT, o Spinning. Al´em de ser menos
vulnerável a ataques causados por primários faltosos, o Spinning atinge um débito similar ao
algoritmo base. Finalmente, os mecanismos e técnicas desenvolvidos ao longo desta tese são
combinados com o objectivo de definir o EBAWA, um novo algoritmo BFT que é adequado para
suportar a execução de serviços replicados em redes de larga-escala.Programme ALBAN; Fundação para a Ciência e a Tecnologia - Portuga
- …