65 research outputs found

    Byzantine Fault Tolerance for Distributed Systems

    Get PDF
    The growing reliance on online services imposes a high dependability requirement on the computer systems that provide these services. Byzantine fault tolerance (BFT) is a promising technology to solidify such systems for the much needed high dependability. BFT employs redundant copies of the servers and ensures that a replicated system continues providing correct services despite the attacks on a small portion of the system. In this dissertation research, I developed novel algorithms and mechanisms to control various types of application nondeterminism and to ensure the long-term reliability of BFT systems via a migration-based proactive recovery scheme. I also investigated a new approach to significantly improve the overall system throughput by enabling concurrent processing using Software Transactional Memory (STM). Controlling application nondeterminism is essential to achieve strong replica consistency because the BFT technology is based on state-machine replication, which requires deterministic operation of each replica. Proactive recovery is necessary to ensure that the fundamental assumption of using the BFT technology is not violated over long term, i.e., less than one-third of replicas remain correct. Without proactive recovery, more and more replicas will be compromised under continuously attacks, which would render BFT ineffective. STM based concurrent processing maximized the system throughput by utilizing the power of multi-core CPUs while preserving strong replication consistenc

    Byzantine Fault Tolerance for Distributed Systems

    Get PDF
    The growing reliance on online services imposes a high dependability requirement on the computer systems that provide these services. Byzantine fault tolerance (BFT) is a promising technology to solidify such systems for the much needed high dependability. BFT employs redundant copies of the servers and ensures that a replicated system continues providing correct services despite the attacks on a small portion of the system. In this dissertation research, I developed novel algorithms and mechanisms to control various types of application nondeterminism and to ensure the long-term reliability of BFT systems via a migration-based proactive recovery scheme. I also investigated a new approach to significantly improve the overall system throughput by enabling concurrent processing using Software Transactional Memory (STM). Controlling application nondeterminism is essential to achieve strong replica consistency because the BFT technology is based on state-machine replication, which requires deterministic operation of each replica. Proactive recovery is necessary to ensure that the fundamental assumption of using the BFT technology is not violated over long term, i.e., less than one-third of replicas remain correct. Without proactive recovery, more and more replicas will be compromised under continuously attacks, which would render BFT ineffective. STM based concurrent processing maximized the system throughput by utilizing the power of multi-core CPUs while preserving strong replication consistenc

    Practical database replication

    Get PDF
    Tese de doutoramento em InformáticaSoftware-based replication is a cost-effective approach for fault-tolerance when combined with commodity hardware. In particular, shared-nothing database clusters built upon commodity machines and synchronized through eager software-based replication protocols have been driven by the distributed systems community in the last decade. The efforts on eager database replication, however, stem from the late 1970s with initial proposals designed by the database community. From that time, we have the distributed locking and atomic commitment protocols. Briefly speaking, before updating a data item, all copies are locked through a distributed lock, and upon commit, an atomic commitment protocol is responsible for guaranteeing that the transaction’s changes are written to a non-volatile storage at all replicas before committing it. Both these processes contributed to a poor performance. The distributed systems community improved these processes by reducing the number of interactions among replicas through the use of group communication and by relaxing the durability requirements imposed by the atomic commitment protocol. The approach requires at most two interactions among replicas and disseminates updates without necessarily applying them before committing a transaction. This relies on a high number of machines to reduce the likelihood of failures and ensure data resilience. Clearly, the availability of commodity machines and their increasing processing power makes this feasible. Proving the feasibility of this approach requires us to build several prototypes and evaluate them with different workloads and scenarios. Although simulation environments are a good starting point, mainly those that allow us to combine real (e.g., replication protocols, group communication) and simulated-code (e.g., database, network), full-fledged implementations should be developed and tested. Unfortunately, database vendors usually do not provide native support for the development of third-party replication protocols, thus forcing protocol developers to either change the database engines, when the source code is available, or construct in the middleware server wrappers that intercept client requests otherwise. The former solution is hard to maintain as new database releases are constantly being produced, whereas the latter represents a strenuous development effort as it requires us to rebuild several database features at the middleware. Unfortunately, the group-based replication protocols, optimistic or conservative, that had been proposed so far have drawbacks that present a major hurdle to their practicability. The optimistic protocols make it difficult to commit transactions in the presence of hot-spots, whereas the conservative protocols have a poor performance due to concurrency issues. In this thesis, we propose using a generic architecture and programming interface, titled GAPI, to facilitate the development of different replication strategies. The idea consists of providing key extensions to multiple DBMSs (Database Management Systems), thus enabling a replication strategy to be developed once and tested on several databases that have such extensions, i.e., those that are replication-friendly. To tackle the aforementioned problems in groupbased replication protocols, we propose using a novel protocol, titled AKARA. AKARA guarantees fairness, and thus all transactions have a chance to commit, and ensures great performance while exploiting parallelism as provided by local database engines. Finally, we outline a simple but comprehensive set of components to build group-based replication protocols and discuss key points in its design and implementation.A replicação baseada em software é uma abordagem que fornece um bom custo benefício para tolerância a falhas quando combinada com hardware commodity. Em particular, os clusters de base de dados “shared-nothing” construídos com hardware commodity e sincronizados através de protocolos “eager” têm sido impulsionados pela comunidade de sistemas distribuídos na última década. Os primeiros esforços na utilização dos protocolos “eager”, decorrem da década de 70 do século XX com as propostas da comunidade de base de dados. Dessa época, temos os protocolos de bloqueio distribuído e de terminação atómica (i.e. “two-phase commit”). De forma sucinta, antes de actualizar um item de dados, todas as cópias são bloqueadas através de um protocolo de bloqueio distribuído e, no momento de efetivar uma transacção, um protocolo de terminação atómica é responsável por garantir que as alterações da transacção são gravadas em todas as réplicas num sistema de armazenamento não-volátil. No entanto, ambos os processos contribuem para um mau desempenho do sistema. A comunidade de sistemas distribuídos melhorou esses processos, reduzindo o número de interacções entre réplicas, através do uso da comunicação em grupo e minimizando a rigidez os requisitos de durabilidade impostos pelo protocolo de terminação atómica. Essa abordagem requer no máximo duas interacções entre as réplicas e dissemina actualizações sem necessariamente aplicá-las antes de efectivar uma transacção. Para funcionar, a solução depende de um elevado número de máquinas para reduzirem a probabilidade de falhas e garantir a resiliência de dados. Claramente, a disponibilidade de hardware commodity e o seu poder de processamento crescente tornam essa abordagem possível. Comprovar a viabilidade desta abordagem obriga-nos a construir vários protótipos e a avaliálos com diferentes cargas de trabalho e cenários. Embora os ambientes de simulação sejam um bom ponto de partida, principalmente aqueles que nos permitem combinar o código real (por exemplo, protocolos de replicação, a comunicação em grupo) e o simulado (por exemplo, base de dados, rede), implementações reais devem ser desenvolvidas e testadas. Infelizmente, os fornecedores de base de dados, geralmente, não possuem suporte nativo para o desenvolvimento de protocolos de replicação de terceiros, forçando os desenvolvedores de protocolo a mudar o motor de base de dados, quando o código fonte está disponível, ou a construir no middleware abordagens que interceptam as solicitações do cliente. A primeira solução é difícil de manter já que novas “releases” das bases de dados estão constantemente a serem produzidas, enquanto a segunda representa um desenvolvimento árduo, pois obriga-nos a reconstruir vários recursos de uma base de dados no middleware. Infelizmente, os protocolos de replicação baseados em comunicação em grupo, optimistas ou conservadores, que foram propostos até agora apresentam inconvenientes que são um grande obstáculo à sua utilização. Com os protocolos optimistas é difícil efectivar transacções na presença de “hot-spots”, enquanto que os protocolos conservadores têm um fraco desempenho devido a problemas de concorrência. Nesta tese, propomos utilizar uma arquitetura genérica e uma interface de programação, intitulada GAPI, para facilitar o desenvolvimento de diferentes estratégias de replicação. A ideia consiste em fornecer extensões chaves para múltiplos SGBDs (Database Management Systems), permitindo assim que uma estratégia de replicação possa ser desenvolvida uma única vez e testada em várias bases de dados que possuam tais extensões, ou seja, aquelas que são “replicationfriendly”. Para resolver os problemas acima referidos nos protocolos de replicação baseados em comunicação em grupo, propomos utilizar um novo protocolo, intitulado AKARA. AKARA garante a equidade, portanto, todas as operações têm uma oportunidade de serem efectivadas, e garante um excelente desempenho ao tirar partido do paralelismo fornecido pelos motores de base de dados. Finalmente, propomos um conjunto simples, mas abrangente de componentes para construir protocolos de replicação baseados em comunicação em grupo e discutimos pontoschave na sua concepção e implementação

    Conflict-Free Replicated Data Types in Dynamic Environments

    Get PDF
    Over the years, mobile devices have become increasingly popular and gained improved computation capabilities allowing them to perform more complex tasks such as collaborative applications. Given the weak characteristic properties of mobile networks, which represent highly dynamic environments where users may experience regular involuntary disconnection periods, the big question arises of how to maintain data consistency. This issue is most pronounced in collaborative environments where multiple users interact with each other, sharing a replicated state that may diverge due to concurrency conflicts and loss of updates. To maintain consistency, one of today’s best solutions is Conflict-Free Replicated Data Types (CRDTs), which ensure low latency values and automatic conflict resolution, guaranteeing eventual consistency of the shared data. However, a limitation often found on CRDTs and the systems that employ them is the need for the knowledge of the replicas whom the state changes must be disseminated to. This constitutes a problem since it is inconceivable to maintain said knowledge in an environment where clients may leave and join at any given time and consequently get disconnected due to mobile network communications unreliability. In this thesis, we present the study and extension of the CRDT concept to dynamic environments by introducing the developed P/S-CRDTs model, where CRDTs are coupled with the publisher/subscriber interaction scheme and additional mechanisms to ensure users are able to cooperate and maintain consistency whilst accounting for the consequent volatile behaviors of mobile networks. The experimental results show that in volatile scenarios of disconnection, mobile users in collaborative activity maintain consistency among themselves and when compared to other available CRDT models, the P/S-CRDTs model is able to decouple the required knowledge of whom the updates must be disseminated to, while ensuring appropriate network traffic values

    Analysis of the Matrix Event Graph Replicated Data Type

    Get PDF
    Matrix is a new kind of decentralized, topic-based publish-subscribe middleware for communication and data storage that is getting particularly popular as a basis for secure instant messaging. By comparison with traditional decentralized communication systems, Matrix replaces pure message passing with a replicated data structure. This data structure, which we extract and call the Matrix Event Graph (MEG), depicts the causal history of messages. We show that this MEG represents an interesting and important replicated data type for decentralized applications that are based on causal histories of publish-subscribe events: First, we prove that the MEG is a Conflict-Free Replicated Data Type for causal histories and, thus, provides Strong Eventual Consistency (SEC). With SEC being among the best known achievable trade-offs in the scope of the well-known CAP theorem, the MEG provides a powerful consistency guarantee while being available during network partition. Second, we discuss the implications of byzantine attackers on the data type’s properties. We note that the MEG, as it does not strive for consensus or strong consistency, can cope with n>fn>f environments with nn participants, of which ff are byzantine. Furthermore, we analyze scalability: Using Markov chains, we study the number of forward extremities of the MEG over time and observe an almost optimal evolution. We conjecture that this property is inherent to the underlying spatially inhomogeneous random walk. With the properties shown, a MEG represents a promising element in the set of data structures for decentralized applications, but with distinct trade-offs compared to traditional blockchains and distributed ledger technologies

    Fast and transparent recovery for continuous availability of cluster-based servers

    Full text link

    Analysis of the Matrix Event Graph Replicated Data Type

    Get PDF
    Matrix is a new kind of decentralized, topic-based publish-subscribe middleware for communication and data storage that is getting particularly popular as a basis for secure instant messaging. By comparison with traditional decentralized communication systems, Matrix replaces pure message passing with a replicated data structure. This data structure, which we extract and call the Matrix Event Graph (MEG), depicts the causal history of messages. We show that this MEG represents an interesting and important replicated data type for decentralized applications that are based on causal histories of publish-subscribe events: First, we prove that the MEG is a Conflict-Free Replicated Data Type for causal histories and, thus, provides Strong Eventual Consistency (SEC). With SEC being among the best known achievable trade-offs in the scope of the well-known CAP theorem, the MEG provides a powerful consistency guarantee while being available during network partition. Second, we discuss the implications of byzantine attackers on the data type’s properties. We note that the MEG, as it does not strive for consensus or strong consistency, can cope with n>fn>f environments with nn participants, of which ff are byzantine. Furthermore, we analyze scalability: Using Markov chains, we study the number of forward extremities of the MEG over time and observe an almost optimal evolution. We conjecture that this property is inherent to the underlying spatially inhomogeneous random walk. With the properties shown, a MEG represents a promising element in the set of data structures for decentralized applications, but with distinct trade-offs compared to traditional blockchains and distributed ledger technologies

    Analysis of the Matrix Event Graph Replicated Data Type

    Get PDF
    Matrix is a new kind of decentralized, topic-based publish-subscribe middleware for communication and data storage that is getting particularly popular as a basis for secure instant messaging. By comparison with traditional decentralized communication systems, Matrix replaces pure message passing with a replicated data structure. This data structure, which we extract and call the Matrix Event Graph (MEG), depicts the causal history of messages. We show that this MEG represents an interesting and important replicated data type for decentralized applications that are based on causal histories of publish-subscribe events: First, we prove that the MEG is a Conflict-Free Replicated Data Type for causal histories and, thus, provides Strong Eventual Consistency (SEC). With SEC being among the best known achievable trade-offs in the scope of the well-known CAP theorem, the MEG provides a powerful consistency guarantee while being available during network partition. Second, we discuss the implications of byzantine attackers on the data type’s properties. We note that the MEG, as it does not strive for consensus or strong consistency, can cope with n>fn>f environments with nn participants, of which ff are byzantine. Furthermore, we analyze scalability: Using Markov chains, we study the number of forward extremities of the MEG over time and observe an almost optimal evolution. We conjecture that this property is inherent to the underlying spatially inhomogeneous random walk. With the properties shown, a MEG represents a promising element in the set of data structures for decentralized applications, but with distinct trade-offs compared to traditional blockchains and distributed ledger technologies

    A software architecture for consensus based replication

    Get PDF
    Orientador: Luiz Eduardo BuzatoTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Esta tese explora uma das ferramentas fundamentais para construção de sistemas distribuídos: a replicação de componentes de software. Especificamente, procuramos resolver o problema de como simplificar a construção de aplicações replicadas que combinem alto grau de disponibilidade e desempenho. Como ferramenta principal para alcançar o objetivo deste trabalho de pesquisa desenvolvemos Treplica, uma biblioteca de replicação voltada para construção de aplicações distribuídas, porém com semântica de aplicações centralizadas. Treplica apresenta ao programador uma interface simples baseada em uma especificação orientada a objetos de replicação ativa. A conclusão que defendemos nesta tese é que é possível desenvolver um suporte modular e de uso simples para replicação que exibe alto desempenho, baixa latência e que permite recuperação eficiente em caso de falhas. Acreditamos que a arquitetura de software proposta tem aplicabilidade em qualquer sistema distribuído, mas é de especial interesse para sistemas que não são distribuídos pela ausência de uma forma simples, eficiente e confiável de replicá-losAbstract: This thesis explores one of the fundamental tools for the construction of distributed systems: the replication of software components. Specifically, we attempted to solve the problem of simplifying the construction of high-performance and high-availability replicated applications. We have developed Treplica, a replication library, as the main tool to reach this research objective. Treplica allows the construction of distributed applications that behave as centralized applications, presenting the programmer a simple interface based on an object-oriented specification for active replication. The conclusion we reach in this thesis is that it is possible to create a modular and simple to use support for replication, providing high performance, low latency and fast recovery in the presence of failures. We believe our proposed software architecture is applicable to any distributed system, but it is particularly interesting to systems that remain centralized due to the lack of a simple, efficient and reliable replication mechanismDoutoradoSistemas de ComputaçãoDoutor em Ciência da Computaçã
    corecore