6,866 research outputs found

    Verifying Safety of Fault-Tolerant Distributed Components -- Extended Version

    Get PDF
    We shows how to ensure correctness and fault-tolerance of distributed components by behavioural speci cation. We specify a system combining a simple distributed component application and a fault-tolerance mechanism. We choose to encode the most general and the most demanding kind of faults, Byzantine failures, but only for some of the components of our system. With Byzantine failures a faulty process can have any behaviour, thus replication is the only convenient classical solution; this greatly increases the size of the system, and makes model-checking a challenge. Despite the simplicity of our application, full study of the overall behaviour of the combined system requires us putting together the speci cation for many features required by either the distributed application or the fault-tolerant protocol: our system encodes hierarchical component structure, asynchronous communication with futures, replication, group communication, an agreement protocol, and faulty components. The system we obtain is huge and we have proved its correctness by using at the same time data abstraction, compositional minimization, and distributed model-checking.Nous montrons comment assurer la correction et la tolérance aux pannes de composants distribués à l'aide de spéci cations comportementales. Nous spéci ons un système combinant une application distribuée très simple avec un mécanisme de tolérance aux pannes. Nous avons choisi le type de fautes le plus général et le plus exigeant, les pannes Byzantines, mais seulement pour une partie des composants de notre système. Avec des pannes Byzantines un composant peut avoir n'importe quel comportement, et la replication est la seule solution classique convenable; ceci augmente de beaucoup la taille du système, et sa véri cation par des techniques de model-checking est un dé . Malgré la simplicité de notre application, l' étude complète du système nous oblige à combiner de nombreux aspects nécessaires à l'application distibuée ou au protocole de tolérance aux pannes: notre système utilise une architecture de composants hiérarchiques, des communications asynchrones avec futurs, de la replication, de la communication de groupe, et un protocole de consensus. Le système obtenu est très gros, et nous avons pouvé sa correction en combinant des techniques d'abstraction de données, de minimisation compositionnelle, et de model-checking distribué

    Verifying Safety of Fault-Tolerant Distributed Components -- Extended Version

    Get PDF
    We shows how to ensure correctness and fault-tolerance of distributed components by behavioural speci cation. We specify a system combining a simple distributed component application and a fault-tolerance mechanism. We choose to encode the most general and the most demanding kind of faults, Byzantine failures, but only for some of the components of our system. With Byzantine failures a faulty process can have any behaviour, thus replication is the only convenient classical solution; this greatly increases the size of the system, and makes model-checking a challenge. Despite the simplicity of our application, full study of the overall behaviour of the combined system requires us putting together the speci cation for many features required by either the distributed application or the fault-tolerant protocol: our system encodes hierarchical component structure, asynchronous communication with futures, replication, group communication, an agreement protocol, and faulty components. The system we obtain is huge and we have proved its correctness by using at the same time data abstraction, compositional minimization, and distributed model-checking.Nous montrons comment assurer la correction et la tolérance aux pannes de composants distribués à l'aide de spéci cations comportementales. Nous spéci ons un système combinant une application distribuée très simple avec un mécanisme de tolérance aux pannes. Nous avons choisi le type de fautes le plus général et le plus exigeant, les pannes Byzantines, mais seulement pour une partie des composants de notre système. Avec des pannes Byzantines un composant peut avoir n'importe quel comportement, et la replication est la seule solution classique convenable; ceci augmente de beaucoup la taille du système, et sa véri cation par des techniques de model-checking est un dé . Malgré la simplicité de notre application, l' étude complète du système nous oblige à combiner de nombreux aspects nécessaires à l'application distibuée ou au protocole de tolérance aux pannes: notre système utilise une architecture de composants hiérarchiques, des communications asynchrones avec futurs, de la replication, de la communication de groupe, et un protocole de consensus. Le système obtenu est très gros, et nous avons pouvé sa correction en combinant des techniques d'abstraction de données, de minimisation compositionnelle, et de model-checking distribué

    Rapid Recovery for Systems with Scarce Faults

    Full text link
    Our goal is to achieve a high degree of fault tolerance through the control of a safety critical systems. This reduces to solving a game between a malicious environment that injects failures and a controller who tries to establish a correct behavior. We suggest a new control objective for such systems that offers a better balance between complexity and precision: we seek systems that are k-resilient. In order to be k-resilient, a system needs to be able to rapidly recover from a small number, up to k, of local faults infinitely many times, provided that blocks of up to k faults are separated by short recovery periods in which no fault occurs. k-resilience is a simple but powerful abstraction from the precise distribution of local faults, but much more refined than the traditional objective to maximize the number of local faults. We argue why we believe this to be the right level of abstraction for safety critical systems when local faults are few and far between. We show that the computational complexity of constructing optimal control with respect to resilience is low and demonstrate the feasibility through an implementation and experimental results.Comment: In Proceedings GandALF 2012, arXiv:1210.202

    Verification and Synthesis of Symmetric Uni-Rings for Leads-To Properties

    Full text link
    This paper investigates the verification and synthesis of parameterized protocols that satisfy leadsto properties R⇝QR \leadsto Q on symmetric unidirectional rings (a.k.a. uni-rings) of deterministic and constant-space processes under no fairness and interleaving semantics, where RR and QQ are global state predicates. First, we show that verifying R⇝QR \leadsto Q for parameterized protocols on symmetric uni-rings is undecidable, even for deterministic and constant-space processes, and conjunctive state predicates. Then, we show that surprisingly synthesizing symmetric uni-ring protocols that satisfy R⇝QR \leadsto Q is actually decidable. We identify necessary and sufficient conditions for the decidability of synthesis based on which we devise a sound and complete polynomial-time algorithm that takes the predicates RR and QQ, and automatically generates a parameterized protocol that satisfies R⇝QR \leadsto Q for unbounded (but finite) ring sizes. Moreover, we present some decidability results for cases where leadsto is required from multiple distinct RR predicates to different QQ predicates. To demonstrate the practicality of our synthesis method, we synthesize some parameterized protocols, including agreement and parity protocols

    A Short Counterexample Property for Safety and Liveness Verification of Fault-tolerant Distributed Algorithms

    Full text link
    Distributed algorithms have many mission-critical applications ranging from embedded systems and replicated databases to cloud computing. Due to asynchronous communication, process faults, or network failures, these algorithms are difficult to design and verify. Many algorithms achieve fault tolerance by using threshold guards that, for instance, ensure that a process waits until it has received an acknowledgment from a majority of its peers. Consequently, domain-specific languages for fault-tolerant distributed systems offer language support for threshold guards. We introduce an automated method for model checking of safety and liveness of threshold-guarded distributed algorithms in systems where the number of processes and the fraction of faulty processes are parameters. Our method is based on a short counterexample property: if a distributed algorithm violates a temporal specification (in a fragment of LTL), then there is a counterexample whose length is bounded and independent of the parameters. We prove this property by (i) characterizing executions depending on the structure of the temporal formula, and (ii) using commutativity of transitions to accelerate and shorten executions. We extended the ByMC toolset (Byzantine Model Checker) with our technique, and verified liveness and safety of 10 prominent fault-tolerant distributed algorithms, most of which were out of reach for existing techniques.Comment: 16 pages, 11 pages appendi
    • …
    corecore