81 research outputs found

    CATS: linearizability and partition tolerance in scalable and self-organizing key-value stores

    Get PDF
    Distributed key-value stores provide scalable, fault-tolerant, and self-organizing storage services, but fall short of guaranteeing linearizable consistency in partially synchronous, lossy, partitionable, and dynamic networks, when data is distributed and replicated automatically by the principle of consistent hashing. This paper introduces consistent quorums as a solution for achieving atomic consistency. We present the design and implementation of CATS, a distributed key-value store which uses consistent quorums to guarantee linearizability and partition tolerance in such adverse and dynamic network conditions. CATS is scalable, elastic, and self-organizing; key properties for modern cloud storage middleware. Our system shows that consistency can be achieved with practical performance and modest throughput overhead (5%) for read-intensive workloads

    Planet-scale leaderless consensus

    Get PDF
    Programa de Doutoramento em Informática das Universidades do Minho, de Aveiro e do PortoAs aplicações de web modernas replicam os seus dados à escala planetária e exigem fortes garantias na coerência dos seus dados mais críticos. Essas garantias são geralmente fornecidas por meio de replicação de máquina de estados (RME). Avanços recentes em RME concentraram-se em protocolos sem líder, pois estes melhoram o desempenho e a disponibilidade das soluções tradicionais baseadas em Paxos. Embora os protocolos sem líder se tenham mostrado muito promissores, estes são ainda pouco adequados para sistemas de escala planetária, pois utilizam grandes quóruns, oferecem um desempenho imprevisível e têm mecanismos de recuperação complexos. Nesta tese propomos dois protocolos sem líder, Atlas e Tempo, adaptados para sistemas de escala planetária. O Atlas minimiza o tamanho dos seus quóruns fazendo uso da observação de que falhas simultâneas em centros de dados são raras. Também processa uma percentagem elevada de comandos da aplicação em uma única round trip, mesmo quando estes comandos conflituam. O Atlas consegue isto com um mecanismo de recuperação que é significativamente mais simples do que os protocolos sem líder que o precederam. O Tempo baseia-se no Atlas, mas atinge um rendimento superior e oferece um desempenho previsível mesmo em cargas de trabalho com elevado nível de conflitos. Para obter estes benefícios, o Tempo marca cada comando da aplicação com uma timestamp e executa-o somente após esta timestamp se tornar estável, ou seja, quando todos os comandos com uma timestamp menor são conhecidos. Ambos os mecanismos para gerar uma timestamp e detetar quando esta fica estável são totalmente descentralizados, evitando assim a necessidade de um líder. Avaliámos o Atlas e o Tempo em ambientes geo-distribuídos reais e simulados e demonstramos que eles superam as alternativas oferecidas pelo estado da arte.Modern web applications replicate their data across the globe and require strong consistency guarantees for their most critical data. These guarantees are usually provided via state-machine replication (SMR). Recent advances in SMR have focused on leaderless protocols, which improve the performance and availability of traditional Paxos-based solutions. Although leaderless protocols have shown great promise, they are poorly suited to planet-scale systems as they leverage large quorums, offer unpredictable performance and have complex recovery mechanisms. In this thesis we propose two leaderless protocols, Atlas and Tempo, tailored to planet-scale systems. Atlas minimizes the size of its quorums by making use of the observation that concurrent data center failures are rare. It also processes a high percentage of accesses in a single round trip, even when these conflict. Atlas achieves this while having a recovery mechanism that is significantly simpler than that of previous leaderless protocols. Tempo builds upon Atlas, but achieves superior throughput and offers predictable performance even in contended workloads. To achieve these benefits, Tempo timestamps each application command and executes it only after the timestamp becomes stable, i.e., all commands with a lower timestamp are known. Both the timestamping and stability detection mechanisms are fully decentralized, thus obviating the need for a leader replica. We evaluate Atlas and Tempo in both real and simulated geo-distributed environments and demonstrate that they outperform state-of-the-art alternatives.This work was partially supported by an FCT – “Fundação para a Ciência e Tecnologia” – PhD Fellowship (PD/BD/142927/2018)

    State-machine replication for planet-scale systems

    Get PDF
    Online applications now routinely replicate their data at multiple sites around the world. In this paper we present Atlas, the first state-machine replication protocol tailored for such planet-scale systems. Atlas does not rely on a distinguished leader, so clients enjoy the same quality of service independently of their geographical locations. Furthermore, client-perceived latency improves as we add sites closer to clients. To achieve this, Atlas minimizes the size of its quorums using an observation that concurrent data center failures are rare. It also processes a high percentage of accesses in a single round trip, even when these conflict. We experimentally demonstrate that Atlas consistently outperforms state-of-The-Art protocols in planet-scale scenarios. In particular, Atlas is up to two times faster than Flexible Paxos with identical failure assumptions, and more than doubles the performance of Egalitarian Paxos in the YCSB benchmark.H2020 - Horizon 2020 Framework Programme(825184

    Distributed Multi-writer Multi-reader Atomic Register with Optimistically Fast Read and Write

    Full text link
    A distributed multi-writer multi-reader (MWMR) atomic register is an important primitive that enables a wide range of distributed algorithms. Hence, improving its performance can have large-scale consequences. Since the seminal work of ABD emulation in the message-passing networks [JACM '95], many researchers study fast implementations of atomic registers under various conditions. "Fast" means that a read or a write can be completed with 1 round-trip time (RTT), by contacting a simple majority. In this work, we explore an atomic register with optimal resilience and "optimistically fast" read and write operations. That is, both operations can be fast if there is no concurrent write. This paper has three contributions: (i) We present Gus, the emulation of an MWMR atomic register with optimal resilience and optimistically fast reads and writes when there are up to 5 nodes; (ii) We show that when there are > 5 nodes, it is impossible to emulate an MWMR atomic register with both properties; and (iii) We implement Gus in the framework of EPaxos and Gryff, and show that Gus provides lower tail latency than state-of-the-art systems such as EPaxos, Gryff, Giza, and Tempo under various workloads in the context of geo-replicated object storage systems
    corecore