81 research outputs found
CATS: linearizability and partition tolerance in scalable and self-organizing key-value stores
Distributed key-value stores provide scalable, fault-tolerant, and self-organizing
storage services, but fall short of guaranteeing linearizable consistency
in partially synchronous, lossy, partitionable, and dynamic networks, when data
is distributed and replicated automatically by the principle of consistent hashing.
This paper introduces consistent quorums as a solution for achieving atomic
consistency. We present the design and implementation of CATS, a distributed
key-value store which uses consistent quorums to guarantee linearizability and partition tolerance in such adverse and dynamic network conditions. CATS is
scalable, elastic, and self-organizing; key properties for modern cloud storage
middleware. Our system shows that consistency can be achieved with practical
performance and modest throughput overhead (5%) for read-intensive workloads
Planet-scale leaderless consensus
Programa de Doutoramento em Informática das Universidades do Minho, de Aveiro e do PortoAs aplicações de web modernas replicam os seus dados à escala planetária e exigem fortes garantias
na coerência dos seus dados mais críticos. Essas garantias são geralmente fornecidas por meio de replicação
de máquina de estados (RME). Avanços recentes em RME concentraram-se em protocolos sem
líder, pois estes melhoram o desempenho e a disponibilidade das soluções tradicionais baseadas em
Paxos. Embora os protocolos sem líder se tenham mostrado muito promissores, estes são ainda pouco
adequados para sistemas de escala planetária, pois utilizam grandes quóruns, oferecem um desempenho
imprevisível e têm mecanismos de recuperação complexos. Nesta tese propomos dois protocolos sem
líder, Atlas e Tempo, adaptados para sistemas de escala planetária. O Atlas minimiza o tamanho
dos seus quóruns fazendo uso da observação de que falhas simultâneas em centros de dados são raras.
Também processa uma percentagem elevada de comandos da aplicação em uma única round trip,
mesmo quando estes comandos conflituam. O Atlas consegue isto com um mecanismo de recuperação
que é significativamente mais simples do que os protocolos sem líder que o precederam. O Tempo
baseia-se no Atlas, mas atinge um rendimento superior e oferece um desempenho previsível mesmo
em cargas de trabalho com elevado nível de conflitos. Para obter estes benefícios, o Tempo marca cada
comando da aplicação com uma timestamp e executa-o somente após esta timestamp se tornar estável,
ou seja, quando todos os comandos com uma timestamp menor são conhecidos. Ambos os mecanismos
para gerar uma timestamp e detetar quando esta fica estável são totalmente descentralizados, evitando
assim a necessidade de um líder. Avaliámos o Atlas e o Tempo em ambientes geo-distribuídos reais e
simulados e demonstramos que eles superam as alternativas oferecidas pelo estado da arte.Modern web applications replicate their data across the globe and require strong consistency guarantees
for their most critical data. These guarantees are usually provided via state-machine replication
(SMR). Recent advances in SMR have focused on leaderless protocols, which improve the performance and
availability of traditional Paxos-based solutions. Although leaderless protocols have shown great promise,
they are poorly suited to planet-scale systems as they leverage large quorums, offer unpredictable performance
and have complex recovery mechanisms. In this thesis we propose two leaderless protocols,
Atlas and Tempo, tailored to planet-scale systems. Atlas minimizes the size of its quorums by making
use of the observation that concurrent data center failures are rare. It also processes a high percentage
of accesses in a single round trip, even when these conflict. Atlas achieves this while having a recovery
mechanism that is significantly simpler than that of previous leaderless protocols. Tempo builds upon
Atlas, but achieves superior throughput and offers predictable performance even in contended workloads.
To achieve these benefits, Tempo timestamps each application command and executes it only
after the timestamp becomes stable, i.e., all commands with a lower timestamp are known. Both the
timestamping and stability detection mechanisms are fully decentralized, thus obviating the need for a
leader replica. We evaluate Atlas and Tempo in both real and simulated geo-distributed environments
and demonstrate that they outperform state-of-the-art alternatives.This work was partially supported by an FCT – “Fundação para a Ciência e Tecnologia” – PhD Fellowship
(PD/BD/142927/2018)
State-machine replication for planet-scale systems
Online applications now routinely replicate their data at multiple sites around the world. In this paper we present Atlas, the first state-machine replication protocol tailored for such planet-scale systems. Atlas does not rely on a distinguished leader, so clients enjoy the same quality of service independently of their geographical locations. Furthermore, client-perceived latency improves as we add sites closer to clients. To achieve this, Atlas minimizes the size of its quorums using an observation that concurrent data center failures are rare. It also processes a high percentage of accesses in a single round trip, even when these conflict. We experimentally demonstrate that Atlas consistently outperforms state-of-The-Art protocols in planet-scale scenarios. In particular, Atlas is up to two times faster than Flexible Paxos with identical failure assumptions, and more than doubles the performance of Egalitarian Paxos in the YCSB benchmark.H2020 - Horizon 2020 Framework Programme(825184
Distributed Multi-writer Multi-reader Atomic Register with Optimistically Fast Read and Write
A distributed multi-writer multi-reader (MWMR) atomic register is an
important primitive that enables a wide range of distributed algorithms. Hence,
improving its performance can have large-scale consequences. Since the seminal
work of ABD emulation in the message-passing networks [JACM '95], many
researchers study fast implementations of atomic registers under various
conditions. "Fast" means that a read or a write can be completed with 1
round-trip time (RTT), by contacting a simple majority. In this work, we
explore an atomic register with optimal resilience and "optimistically fast"
read and write operations. That is, both operations can be fast if there is no
concurrent write.
This paper has three contributions: (i) We present Gus, the emulation of an
MWMR atomic register with optimal resilience and optimistically fast reads and
writes when there are up to 5 nodes; (ii) We show that when there are > 5
nodes, it is impossible to emulate an MWMR atomic register with both
properties; and (iii) We implement Gus in the framework of EPaxos and Gryff,
and show that Gus provides lower tail latency than state-of-the-art systems
such as EPaxos, Gryff, Giza, and Tempo under various workloads in the context
of geo-replicated object storage systems
- …