7 research outputs found

    Towards a generic group communication service

    Get PDF
    View synchronous group communication is a mature technology that greatly eases the development of reliable distributed applications by enforcing precise message delivery semantics, especially in face of faults. It is therefore found at the core of multiple widely deployed and used middleware products. Although the implementation of a group communication system is a complex task, application developers may benefit from the fact that multiple group communication toolkits are currently available and supported. Unfortunately, each communication toolkit has a different interface, that differs from every other interface in subtile syntactic and semantic aspects. This hinders the design, implementation and maintenance of applications using group communication and forces developers to commit beforehand to a single toolkit, thus imposing a significant hurdle to portability. In this paper we propose jGCS, a generic group communication service for Java, that specifies an interface as well as minimum semantics that allow application portability. This interface accommodates existing group communication services, enabling implementation independence. Furthermore, it provides support for the latest state-of-art mechanisms that have been proposed to improve the performance of group-based applications. To support our claims, we present and experimentally evaluate implementations of jGCS for several major group communication systems, namely, Appia, Spread/FlushSpread and JGroups, and describe the port of a large middleware product to jGCS.This work was partially supported by the IST project GORDA (FP6-IST2-004758

    High performance deferred update replication

    Get PDF
    Replication is a well-known approach to implementing storage systems that can tolerate failures. Replicated storage systems are designed such that the state of the system is kept at several replicas. A replication protocol ensures that the failure of a replica is masked by the rest of the system, in a way that is transparent to its users. Replicated storage systems are among the most important building blocks in the design of large scale applications. Applications at scale are often deployed on top of commodity hardware, store a vast amount of data, and serve a large number of users. The larger the system, the higher its vulnerability to failures. The ability to tolerate failures is not the only desirable feature in a replicated system. Storage systems need to be efficient in order to accommodate requests from a large user base while achieving low response times. In that respect, replication can leverage multiple replicas to parallelize the execution of user requests. This thesis focuses on Deferred Update Replication (DUR), a well-established database replication approach. It provides high availability in that every replica can execute client transactions. In terms of performance, it is better than other replication techniques in that only one replica executes a given transaction while the other replicas only apply state changes. However, DUR suffers from the following drawback: each replica stores a full copy of the database, which has consequences in terms of performance. The first consequence is that DUR cannot take advantage of the aggregated memory available to the replicas. Our first contribution is a distributed caching mechanism that addresses the problem. It makes efficient use of the main memory of an entire cluster of machines, while guaranteeing strong consistency. The second consequence is that DUR cannot scale with the number of replicas. The throughput of a fully replicated system is inherently limited by the number of transactions that a single replica can apply to its local storage. We propose a scalable version of the DUR approach where the system state is partitioned in smaller replica sets. Transactions that access disjoint partitions are parallelized. The last part of the thesis focuses on latency. We show that the scalable DUR-based approach may have detrimental effects on response time, especially when replicas are geographically distributed. The thesis considers different deployments and their implications on latency. We propose optimizations that provide substantial gains in geographically distributed environments

    Scalable coordination of distributed in-memory transactions

    Get PDF
    Phd ThesisCoordinating transactions involves ensuring serializability in the presence of concurrent data accesses. Accomplishing it in an scalable manner for distributed in-memory transactions is the aim of this thesis work. To this end, the work makes three contributions. It first experimentally demonstrates that transaction latency and throughput scale considerably well when an atomic multicast service is offered to transaction nodes by a crash-tolerant ensemble of dedicated nodes and that using such a service is the most scalable approach compared to practices advocated in the literature. Secondly, we design, implement and evaluate a crash-tolerant and non-blocking atomic broadcast protocol, called ABcast, which is then used as the foundation for building the aforementioned multicast service. ABcast is a hybrid protocol, which consists of a pair of primary and backup protocols executing in parallel. The primary protocol is a deterministic atomic broadcast protocol that provides high performance when node crashes are absent, but blocks in their presence until a group membership service detects such failures. The backup protocol, Aramis, is a probabilistic protocol that does not block in the event of node crashes and allows message delivery to continue post-crash until the primary protocol is able to resume. Aramis design avoids blocking by assuming that message delays remain within a known bound with a high probability that can be estimated in advance, provided that recent delay estimates are used to (i) continually adjust that bound and (ii) regulate flow control. Aramis delivery of broadcasts preserve total order with a probability that can be tuned to be close to 1. Comprehensive evaluations show that this probability can be 99.99% or more. Finally, we assess the effect of low-probability order violations on implementing various isolation levels commonly considered in transaction systems. These three contributions together advance the state-of-art in two major ways: (i) identifying a service based approach to transactional scalability and (ii) establishing a practical alternative to the complex PAXOSiii style approach to building such a service, by using novel but simple protocols and open-source software frameworks.Red Ha

    On the many faces of atomic multicast

    Get PDF
    Many current online services need to serve clients distributed across geographic areas. Coordinating highly available and scalable geographically distributed replicas, however, is challenging. While State Machine Replication is the most direct way of achieving availability, no scalability comes from the traditional approach. Typically, scalability is obtained by partitioning the original application state among groups of servers, which leads to further challenges. Atomic multicast is a group communication abstraction that groups processes, providing reliability and ordering guarantees, and can be explored to provide partially replicated applications a scalable and consistent alternative. This work confronts the challenges of providing practical group communication abstractions for crash fault-tolerant and Byzantine fault-tolerant (BFT) models. Although there are plenty of atomic multicast algorithms that tolerate crash failures, they suffer from two major issues: (a) high latency for messages addressed to multiple groups, and (b) low performance when proportion of messages to multiple groups is high. To solve the first problem and reduce the latency of multi-group messages, this work presents FastCast, an algorithm with unprecedented four communication delays. The second problem can be addressed by maximizing the proportion of single- group messages and eliminating additional communication among groups to execute operations. In this direction, this document introduces GeoPaxos, a protocol that partitions the ordering of operations like atomic multicast while still keeping the state fully replicated. In the BFT model, the task is more challenging, since servers can behave arbitrarily. This thesis presents ByzCast, the first algorithm that tolerates Byzantine failures. ByzCast is hierarchical and introduces a new class of atomic multicast defined as partially genuine. Lastly, since at the very core of most strong consistent replicated system resides a consensus protocol, the thesis concludes with Kernel Paxos, a Paxos implementation provided as a loadable kernel module, providing at the same time high performance, and abstracting ordering from the application execution

    Optimistic Atomic Multicast

    No full text
    Message ordering is one of the cornerstones of reliable distributed systems. However, some ordering guarantees, such as atomic order, are expensive to implement in terms of message delays. This paper presents Optimistic Atomic Multicast, a protocol that combines reduced latency and increased throughput. Messages can be delivered optimistically in a single communication step and conservatively in three communication steps. Differently from previous optimistic group communication protocols, Optimistic Atomic Multicast does not rely on spontaneous message ordering for fast delivery. In addition to presenting Optimistic Atomic Multicast, we provide detailed performance results comparing it to other ordering protocols in both local-area and wide-area networks

    Replicação escalável de máquina de estados

    No full text
    Redundancy provides fault-tolerance. A service can run on multiple servers that replicate each other, in order to provide service availability even in the case of crashes. A way to implement such a replicated service is by using techniques like state machine replication (SMR). SMR provides fault tolerance, while being linearizable, that is, clients cannot distinguish the behaviour of the replicated system to that of a single-site, unreplicated one. However, having a fully replicated, linearizable system comes at a cost, namely, scalability—by scalability we mean that adding servers will always increase the maximum system throughput, at least for some workloads. Even with a careful setup and using optimizations that avoid unnecessary redundant actions to be taken by servers, at some point the throughput of a system replicated with SMR cannot be increased by additional servers; in fact, adding replicas may even degrade performance. A way to achieve scalability is by partitioning the service state and then allowing partitions to work independently. Having a partitioned, yet linearizable and reasonably performant service is not trivial, and this is the topic of research addressed here. To allow systems to scale, while at the same time ensuring linearizability, we propose and implement the following ideas: (i) Scalable State Machine Replication (S-SMR), (ii) Optimistic Atomic Multicast (Opt-amcast), and (iii) Fast S-SMR (Fast-SSMR). S-SMR is an execution model that allows the throughput of the system to scale linearly with the number of servers without sacrificing consistency. To provide faster responses for commands, we developed Opt-amcast, which allows messages to be delivered twice: one delivery guarantees atomic order (conservative delivery), while the other is fast, but not always guarantees atomic order (optimistic delivery). The implementation of Opt-amcast that we propose is called Ridge, a protocol that combines low latency with high throughput. Fast-SSMR is an extension of S-SMR that uses the optimistic delivery of Opt-amcast: while a command is atomically ordered, some precomputation can be done based on its fast, optimistically ordered delivery, improving response time.Redundância provê tolerância a falhas. Um serviço pode ser executado em múltiplos servidores que se replicam uns aos outros, de maneira a prover disponibilidade do serviço em caso de falhas. Uma maneira de implementar um tal serviço replicado é através de técnicas como replicação de máquina de estados (SMR). SMR provê tolerância a falhas, ao mesmo tempo que é linearizável, isto é, clientes não são capazes de distinguir o comportamento do sistema replicado daquele de um sistema não replicado. No entanto, ter um sistema completamente replicado e linearizável vem com um custo, que é escalabilidade – por escalabilidade, queremos dizer que adicionar servidores ao sistema aumenta a sua vazão, pelo menos para algumas cargas de trabalho. Mesmo com uma configuração cuidadosa e usando otimizações que evitam que os servidores executem ações redundantes desnecessárias, em um determinado ponto a vazão de um sistema replicado com SMR não pode ser mais aumentada acrescentando-se servidores; na verdade, adicionar réplicas pode até degradar a sua performance. Uma maneira de conseguir escalabilidade é particionar o serviço e então permitir que partições trabalhem independentemente. Por outro lado, ter um sistema particionado, porém linearizável e com razoavelmente boa performance não é trivial, e esse é o tópico de pesquisa tratado aqui. Para permitir que sistemas escalem, ao mesmo tempo que se garante linearizabilidade, nós propomos as seguinte ideias: (i) Replicação Escalável de Máquina de Estados (SSMR), (ii) Multicast Atômico Otimista (Opt-amcast) e (iii) S-SMR Rápido (Fast-SSMR). S-SMR é um modelo de execução que permite que a vazão do sistema escale de maneira linear com o número de servidores, sem sacrificar consistência. Para reduzir o tempo de resposta dos comandos, nós definimos o conceito de Opt-amcast, que permite que mensagens sejam entregues duas vezes: uma entrega garante ordem atômica (entrega atômica), enquanto a outra é mais rápida, mas nem sempre garante ordem atômica (entrega otimista). A implementação de Opt-amcast que nós propomos nessa tese se chama Ridge, um protocolo que combina baixa latência com alta vazão. Fast-SSMR é uma extensão do S-SMR que utiliza a entrega otimista do Opt-amcast: enquanto um comando é ordenado de maneira atômica, pode-se fazer alguma pré-computação baseado na entrega otimista, reduzindo assim tempo de resposta

    Replicação escalável de máquina de estados

    No full text
    Redundancy provides fault-tolerance. A service can run on multiple servers that replicate each other, in order to provide service availability even in the case of crashes. A way to implement such a replicated service is by using techniques like state machine replication (SMR). SMR provides fault tolerance, while being linearizable, that is, clients cannot distinguish the behaviour of the replicated system to that of a single-site, unreplicated one. However, having a fully replicated, linearizable system comes at a cost, namely, scalability—by scalability we mean that adding servers will always increase the maximum system throughput, at least for some workloads. Even with a careful setup and using optimizations that avoid unnecessary redundant actions to be taken by servers, at some point the throughput of a system replicated with SMR cannot be increased by additional servers; in fact, adding replicas may even degrade performance. A way to achieve scalability is by partitioning the service state and then allowing partitions to work independently. Having a partitioned, yet linearizable and reasonably performant service is not trivial, and this is the topic of research addressed here. To allow systems to scale, while at the same time ensuring linearizability, we propose and implement the following ideas: (i) Scalable State Machine Replication (S-SMR), (ii) Optimistic Atomic Multicast (Opt-amcast), and (iii) Fast S-SMR (Fast-SSMR). S-SMR is an execution model that allows the throughput of the system to scale linearly with the number of servers without sacrificing consistency. To provide faster responses for commands, we developed Opt-amcast, which allows messages to be delivered twice: one delivery guarantees atomic order (conservative delivery), while the other is fast, but not always guarantees atomic order (optimistic delivery). The implementation of Opt-amcast that we propose is called Ridge, a protocol that combines low latency with high throughput. Fast-SSMR is an extension of S-SMR that uses the optimistic delivery of Opt-amcast: while a command is atomically ordered, some precomputation can be done based on its fast, optimistically ordered delivery, improving response time.Redundância provê tolerância a falhas. Um serviço pode ser executado em múltiplos servidores que se replicam uns aos outros, de maneira a prover disponibilidade do serviço em caso de falhas. Uma maneira de implementar um tal serviço replicado é através de técnicas como replicação de máquina de estados (SMR). SMR provê tolerância a falhas, ao mesmo tempo que é linearizável, isto é, clientes não são capazes de distinguir o comportamento do sistema replicado daquele de um sistema não replicado. No entanto, ter um sistema completamente replicado e linearizável vem com um custo, que é escalabilidade – por escalabilidade, queremos dizer que adicionar servidores ao sistema aumenta a sua vazão, pelo menos para algumas cargas de trabalho. Mesmo com uma configuração cuidadosa e usando otimizações que evitam que os servidores executem ações redundantes desnecessárias, em um determinado ponto a vazão de um sistema replicado com SMR não pode ser mais aumentada acrescentando-se servidores; na verdade, adicionar réplicas pode até degradar a sua performance. Uma maneira de conseguir escalabilidade é particionar o serviço e então permitir que partições trabalhem independentemente. Por outro lado, ter um sistema particionado, porém linearizável e com razoavelmente boa performance não é trivial, e esse é o tópico de pesquisa tratado aqui. Para permitir que sistemas escalem, ao mesmo tempo que se garante linearizabilidade, nós propomos as seguinte ideias: (i) Replicação Escalável de Máquina de Estados (SSMR), (ii) Multicast Atômico Otimista (Opt-amcast) e (iii) S-SMR Rápido (Fast-SSMR). S-SMR é um modelo de execução que permite que a vazão do sistema escale de maneira linear com o número de servidores, sem sacrificar consistência. Para reduzir o tempo de resposta dos comandos, nós definimos o conceito de Opt-amcast, que permite que mensagens sejam entregues duas vezes: uma entrega garante ordem atômica (entrega atômica), enquanto a outra é mais rápida, mas nem sempre garante ordem atômica (entrega otimista). A implementação de Opt-amcast que nós propomos nessa tese se chama Ridge, um protocolo que combina baixa latência com alta vazão. Fast-SSMR é uma extensão do S-SMR que utiliza a entrega otimista do Opt-amcast: enquanto um comando é ordenado de maneira atômica, pode-se fazer alguma pré-computação baseado na entrega otimista, reduzindo assim tempo de resposta
    corecore