1,408 research outputs found

    Contributions on agreement in dynamic distributed systems

    Get PDF
    139 p.This Ph.D. thesis studies the agreement problem in dynamic distributed systems by integrating both the classical fault-tolerance perspective and the more recent formalism based on evolving graphs. First, we developed a common framework that allows to analyze and compare models of dynamic distributed systems for eventual leader election. The framework extends a previous proposal by Baldoni et al. by including new dimensions and levels of dynamicity. Also, we extend the Time-Varying Graph (TVG) formalism by introducing the necessary timeliness assumptions and the minimal conditions to solve agreement problems. We provide a hierarchy of time-bounded, TVG-based, connectivity classes with increasingly stronger assumptions and specify an implementation of Terminating Reliable Broadcast for each class. Then we define an Omega failure detector, W, for the eventual leader election in dynamic distributed systems, together with a system model, , which is compatible with the timebounded TVG classes. We implement an algorithm that satisfy the properties of W in M. According to our common framework, M results to be weaker than the previous proposed dynamic distributed system models for eventual leader election. Additionally we use simulations to illustrate this fact and show that our leader election algorithm tolerates more general (i.e., dynamic) behaviors, and hence it is of application in a wider range of practical scenarios at the cost of a moderate overhead on stabilization times

    A Prescription for Partial Synchrony

    Get PDF
    Algorithms in message-passing distributed systems often require partial synchrony to tolerate crash failures. Informally, partial synchrony refers to systems where timing bounds on communication and computation may exist, but the knowledge of such bounds is limited. Traditionally, the foundation for the theory of partial synchrony has been real time: a time base measured by counting events external to the system, like the vibrations of Cesium atoms or piezoelectric crystals. Unfortunately, algorithms that are correct relative to many real-time based models of partial synchrony may not behave correctly in empirical distributed systems. For example, a set of popular theoretical models, which we call M_*, assume (eventual) upper bounds on message delay and relative process speeds, regardless of message size and absolute process speeds. Empirical systems with bounded channel capacity and bandwidth cannot realize such assumptions either natively, or through algorithmic constructions. Consequently, empirical deployment of the many M_*-based algorithms risks anomalous behavior. As a result, we argue that real time is the wrong basis for such a theory. Instead, the appropriate foundation for partial synchrony is fairness: a time base measured by counting events internal to the system, like the steps executed by the processes. By way of example, we redefine M_* models with fairness-based bounds and provide algorithmic techniques to implement fairness-based M_* models on a significant subset of the empirical systems. The proposed techniques use failure detectors — system services that provide hints about process crashes — as intermediaries that preserve the fairness constraints native to empirical systems. In effect, algorithms that are correct in M_* models are now proved correct in such empirical systems as well. Demonstrating our results requires solving three open problems. (1) We propose the first unified mathematical framework based on Timed I/O Automata to specify empirical systems, partially synchronous systems, and algorithms that execute within the aforementioned systems. (2) We show that crash tolerance capabilities of popular distributed systems can be denominated exclusively through fairness constraints. (3) We specify exemplar system models that identify the set of weakest system models to implement popular failure detectors

    A simple proof of the necessity of the failure detector Σ\Sigma to implement a register in asynchronous message-passing systems

    Get PDF
    This paper presents a simple proof that the quorum failure detector class (denoted ) is the weakest failure detector class required to implement an atomic read/write register in an asynchronous message-passing system prone to an arbitrary number of process crashes

    On the Resilience of Intrusion-Tolerant Distributed Systems

    Get PDF
    The paper starts by introducing a new dimension along which distributed systems resilience may be evaluated - exhaustion-safety. A node-exhaustion-safe intrusion-tolerant distributed system is a system that assuredly does not suffer more than the assumed number of node failures (e.g., crash, Byzantine). We show that it is not possible to build this kind of systems under the asynchronous model. This result follows from the fact that in an asynchronous environment one cannot guarantee that the system terminates its execution before the occurrence of more than the assumed number of faults. After introducing exhaustion-safety, the paper proposes a new paradigm - proactive resilience - to build intrusion-tolerant distributed systems. Proactive resilience is based on architectural hybridization and hybrid distributed system modeling. The Proactive Resilience Model (PRM) is presented and shown to be a way of building node-exhaustion-safe intrusion-tolerant systems. Finally, the paper describes the design of a secret sharing system built according to the PRM. A proof-of-concept prototype of this system is shown to be highly resilient under different attack scenarios

    An eventually perfect failure detector in a high-availability scenario

    Get PDF
    Modern-day distributed systems have been increasing in complexity and dynamism due to the heterogeneity of the system execution environment, different network technologies, online repairs, frequent updates and upgrades, and the addition or removal of system components. Such complexity has elevated the operational and maintenance costs and triggered efforts to reduce it while improving its reliability. Availability is the ratio of uptime to total time of a system. A High Available system, or systems with at least 99.999% of Availability, imposes a challenge to maintain such levels of uptime. Prior work shows that by using system state monitoring and fault management with failure detectors it is possible to increase system availability. The main objective of this work is to develop an Eventually Perfect Failure Detector to improve a database system Availability through fault-tolerance methods. Such a system was developed and tested in a proposed High-Availability database access infrastructure. Final results have shown that is possible to achieve performance and availability improvements by using, respectively, replication and a failure detector.Os Sistemas distribuídos modernos têm aumentando em dinamismo e complexidade devido à heterogeneidade do ambiente de execução, diferentes tecnologias de rede, manutenção online, atualizações frequentes e a adição ou remoção de componentes do sistema. Esta complexidade tem elevado os custos operacionais e de manutenção, incentivando o desenvolvimento de soluções para reduzir a manutenção dos sistemas enquanto melhora sua confiabilidade. Disponibilidade é a razão do tempo de atividade sobre um intervalo de tempo total. Sistemas de Alta Disponibilidade, ou seja, que possuem pelo menos 99.9999% de Disponibilidade, representam um grande desafio para manter tais níveis de operacionalidade. Trabalhos anteriores mostram que é possível melhorar a Disponibilidade do sistema utilizando o monitoramento de estados do sistema e o gerenciamento de falhas com detectores. O objetivo principal deste trabalho é desenvolver um Detector de Falhas Eventualmente Perfeito que pode melhorar a Disponibilidade de um sistema de base de dados através de uma arquitetura de Alta Disponibilidade. Os resultados finais mostram que é possível ter ganhos de desempenho e disponibilidade utilizando, respectivamente, métodos como replicação e detecção de falhas

    Byzantine fault-tolerant agreement protocols for wireless Ad hoc networks

    Get PDF
    Tese de doutoramento, Informática (Ciências da Computação), Universidade de Lisboa, Faculdade de Ciências, 2010.The thesis investigates the problem of fault- and intrusion-tolerant consensus in resource-constrained wireless ad hoc networks. This is a fundamental problem in distributed computing because it abstracts the need to coordinate activities among various nodes. It has been shown to be a building block for several other important distributed computing problems like state-machine replication and atomic broadcast. The thesis begins by making a thorough performance assessment of existing intrusion-tolerant consensus protocols, which shows that the performance bottlenecks of current solutions are in part related to their system modeling assumptions. Based on these results, the communication failure model is identified as a model that simultaneously captures the reality of wireless ad hoc networks and allows the design of efficient protocols. Unfortunately, the model is subject to an impossibility result stating that there is no deterministic algorithm that allows n nodes to reach agreement if more than n2 omission transmission failures can occur in a communication step. This result is valid even under strict timing assumptions (i.e., a synchronous system). The thesis applies randomization techniques in increasingly weaker variants of this model, until an efficient intrusion-tolerant consensus protocol is achieved. The first variant simplifies the problem by restricting the number of nodes that may be at the source of a transmission failure at each communication step. An algorithm is designed that tolerates f dynamic nodes at the source of faulty transmissions in a system with a total of n 3f + 1 nodes. The second variant imposes no restrictions on the pattern of transmission failures. The proposed algorithm effectively circumvents the Santoro- Widmayer impossibility result for the first time. It allows k out of n nodes to decide despite dn 2 e(nk)+k2 omission failures per communication step. This algorithm also has the interesting property of guaranteeing safety during arbitrary periods of unrestricted message loss. The final variant shares the same properties of the previous one, but relaxes the model in the sense that the system is asynchronous and that a static subset of nodes may be malicious. The obtained algorithm, called Turquois, admits f < n 3 malicious nodes, and ensures progress in communication steps where dnf 2 e(n k f) + k 2. The algorithm is subject to a comparative performance evaluation against other intrusiontolerant protocols. The results show that, as the system scales, Turquois outperforms the other protocols by more than an order of magnitude.Esta tese investiga o problema do consenso tolerante a faltas acidentais e maliciosas em redes ad hoc sem fios. Trata-se de um problema fundamental que captura a essência da coordenação em actividades envolvendo vários nós de um sistema, sendo um bloco construtor de outros importantes problemas dos sistemas distribuídos como a replicação de máquina de estados ou a difusão atómica. A tese começa por efectuar uma avaliação de desempenho a protocolos tolerantes a intrusões já existentes na literatura. Os resultados mostram que as limitações de desempenho das soluções existentes estão em parte relacionadas com o seu modelo de sistema. Baseado nestes resultados, é identificado o modelo de falhas de comunicação como um modelo que simultaneamente permite capturar o ambiente das redes ad hoc sem fios e projectar protocolos eficientes. Todavia, o modelo é restrito por um resultado de impossibilidade que afirma não existir algoritmo algum que permita a n nós chegaram a acordo num sistema que admita mais do que n2 transmissões omissas num dado passo de comunicação. Este resultado é válido mesmo sob fortes hipóteses temporais (i.e., em sistemas síncronos) A tese aplica técnicas de aleatoriedade em variantes progressivamente mais fracas do modelo até ser alcançado um protocolo eficiente e tolerante a intrusões. A primeira variante do modelo, de forma a simplificar o problema, restringe o número de nós que estão na origem de transmissões faltosas. É apresentado um algoritmo que tolera f nós dinâmicos na origem de transmissões faltosas em sistemas com um total de n 3f + 1 nós. A segunda variante do modelo não impõe quaisquer restrições no padrão de transmissões faltosas. É apresentado um algoritmo que contorna efectivamente o resultado de impossibilidade Santoro-Widmayer pela primeira vez e que permite a k de n nós efectuarem progresso nos passos de comunicação em que o número de transmissões omissas seja dn 2 e(n k) + k 2. O algoritmo possui ainda a interessante propriedade de tolerar períodos arbitrários em que o número de transmissões omissas seja superior a . A última variante do modelo partilha das mesmas características da variante anterior, mas com pressupostos mais fracos sobre o sistema. Em particular, assume-se que o sistema é assíncrono e que um subconjunto estático dos nós pode ser malicioso. O algoritmo apresentado, denominado Turquois, admite f < n 3 nós maliciosos e assegura progresso nos passos de comunicação em que dnf 2 e(n k f) + k 2. O algoritmo é sujeito a uma análise de desempenho comparativa com outros protocolos na literatura. Os resultados demonstram que, à medida que o número de nós no sistema aumenta, o desempenho do protocolo Turquois ultrapassa os restantes em mais do que uma ordem de magnitude.FC

    Proactive resilience

    Get PDF
    Tese de doutoramento em Informática (Ciências da Computação), apresentada à Universidade de Lisboa através da Faculdade de Ciências, 2007Disponível no document
    • …
    corecore