222 research outputs found

    Rapid Recovery for Systems with Scarce Faults

    Full text link
    Our goal is to achieve a high degree of fault tolerance through the control of a safety critical systems. This reduces to solving a game between a malicious environment that injects failures and a controller who tries to establish a correct behavior. We suggest a new control objective for such systems that offers a better balance between complexity and precision: we seek systems that are k-resilient. In order to be k-resilient, a system needs to be able to rapidly recover from a small number, up to k, of local faults infinitely many times, provided that blocks of up to k faults are separated by short recovery periods in which no fault occurs. k-resilience is a simple but powerful abstraction from the precise distribution of local faults, but much more refined than the traditional objective to maximize the number of local faults. We argue why we believe this to be the right level of abstraction for safety critical systems when local faults are few and far between. We show that the computational complexity of constructing optimal control with respect to resilience is low and demonstrate the feasibility through an implementation and experimental results.Comment: In Proceedings GandALF 2012, arXiv:1210.202

    An Optimal Self-Stabilizing Firing Squad

    Full text link
    Consider a fully connected network where up to tt processes may crash, and all processes start in an arbitrary memory state. The self-stabilizing firing squad problem consists of eventually guaranteeing simultaneous response to an external input. This is modeled by requiring that the non-crashed processes "fire" simultaneously if some correct process received an external "GO" input, and that they only fire as a response to some process receiving such an input. This paper presents FireAlg, the first self-stabilizing firing squad algorithm. The FireAlg algorithm is optimal in two respects: (a) Once the algorithm is in a safe state, it fires in response to a GO input as fast as any other algorithm does, and (b) Starting from an arbitrary state, it converges to a safe state as fast as any other algorithm does.Comment: Shorter version to appear in SSS0

    Asynchronous neighborhood task synchronization

    Full text link
    Faults are likely to occur in distributed systems. The motivation for designing self-stabilizing system is to be able to automatically recover from a faulty state. As per Dijkstra\u27s definition, a system is self-stabilizing if it converges to a desired state from an arbitrary state in a finite number of steps. The paradigm of self-stabilization is considered to be the most unified approach to designing fault-tolerant systems. Any type of faults, e.g., transient, process crashes and restart, link failures and recoveries, and byzantine faults, can be handled by a self-stabilizing system; Many applications in distributed systems involve multiple phases. Solving these applications require some degree of synchronization of phases. In this thesis research, we introduce a new problem, called asynchronous neighborhood task synchronization ( NTS ). In this problem, processes execute infinite instances of tasks, where a task consists of a set of steps. There are several requirements for this problem. Simultaneous execution of steps by the neighbors is allowed only if the steps are different. Every neighborhood is synchronized in the sense that all neighboring processes execute the same instance of a task. Although the NTS problem is applicable in nonfaulty environments, it is more challenging to solve this problem considering various types of faults. In this research, we will present a self-stabilizing solution to the NTS problem. The proposed solution is space optimal, fault containing, fully localized, and fully distributed. One of the most desirable properties of our algorithm is that it works under any (including unfair) daemon. We will discuss various applications of the NTS problem

    Exploiting replication in distributed systems

    Get PDF
    Techniques are examined for replicating data and execution in directly distributed systems: systems in which multiple processes interact directly with one another while continuously respecting constraints on their joint behavior. Directly distributed systems are often required to solve difficult problems, ranging from management of replicated data to dynamic reconfiguration in response to failures. It is shown that these problems reduce to more primitive, order-based consistency problems, which can be solved using primitives such as the reliable broadcast protocols. Moreover, given a system that implements reliable broadcast primitives, a flexible set of high-level tools can be provided for building a wide variety of directly distributed application programs

    Secure time information in the internet key exchange protocol

    Get PDF
    Many network services and protocols can work correctly only when freshness of messages sent between participants is assured and when the protocol parties’ internal clocks are adjusted. In this paper we present a novel, secure and fast procedure which can be used to ensure data freshness and clock synchronization between two communicating parties. Next, we show how this solution can be used in other cryptographic protocols. As an example of application we apply our approach to the Internet Key Exchange (IKE) protocol family

    Distributed Key Generation for the Internet

    Get PDF
    Although distributed key generation (DKG) has been studied for some time, it has never been examined outside of the synchronous setting. We present the first realistic DKG architecture for use over the Internet. We propose a practical system model and define an efficient verifiable secret sharing scheme in it. We observe the necessity of Byzantine agreement for asynchronous DKG and analyze the difficulty of using a randomized protocol for it. Using our verifiable secret sharing scheme and a leader-based agreement protocol, we then design a DKG protocol for public-key cryptography. Finally, along with traditional proactive security, we also introduce group modification primitives in our system.

    Architecture, Services and Protocols for CRUTIAL

    Get PDF
    This document describes the complete specification of the architecture, services and protocols of the project CRUTIAL. The CRUTIAL Architecture intends to reply to a grand challenge of computer science and control engineering: how to achieve resilience of critical information infrastructures (CII), in particular in the electrical sector. In general lines, the document starts by presenting the main architectural options and components of the architecture, with a special emphasis on a protection device called the CRUTIAL Information Switch (CIS). Given the various criticality levels of the equipments that have to be protected, and the cost of using a replicated device, we define a hierarchy of CIS designs incrementally more resilient. The different CIS designs offer various trade offs in terms of capabilities to prevent and tolerate intrusions, both in the device itself and in the information infrastructure. The Middleware Services, APIs and Protocols chapter describes our approach to intrusion tolerant middleware. The CRUTIAL middleware comprises several building blocks that are organized on a set of layers. The Multipoint Network layer is the lowest layer of the middleware, and features an abstraction of basic communication services, such as provided by standard protocols, like IP, IPsec, UDP, TCP and SSL/TLS. The Communication Support layer features three important building blocks: the Randomized Intrusion-Tolerant Services (RITAS), the CIS Communication service and the Fosel service for mitigating DoS attacks. The Activity Support layer comprises the CIS Protection service, and the Access Control and Authorization service. The Access Control and Authorization service is implemented through PolyOrBAC, which defines the rules for information exchange and collaboration between sub-modules of the architecture, corresponding in fact to different facilities of the CII’s organizations. The Monitoring and Failure Detection layer contains a definition of the services devoted to monitoring and failure detection activities. The Runtime Support Services, APIs, and Protocols chapter features as a main component the Proactive-Reactive Recovery service, whose aim is to guarantee perpetual correct execution of any components it protects.Project co-funded by the European Commission within the Sixth Frame-work Programme (2002-2006

    Preliminary Specification of Services and Protocols

    Get PDF
    This document describes the preliminary specification of services and protocols for the Crutial Architecture. The Crutial Architecture definition, first addressed in Crutial Project Technical Report D4 (January 2007), intends to reply to a grand challenge of computer science and control engineering: how to achieve resilience of critical information infrastructures, in particular in the electrical sector. The definitions herein elaborate on the major architectural options and components established in the Preliminary Architecture Specification (D4), with special relevance to the Crutial middleware building blocks, and are based on the fault, synchrony and topological models defined in the same document. The document, in general lines, describes the Runtime Support Services and APIs, and the Middleware Services and APIs. Then, it delves into the protocols, describing: Runtime Support Protocols, and Middleware Services Protocols. The Runtime Support Services and APIs chapter features as a main component, the Proactive-Reactive Recovery Service, whose aim is to guarantee perpetual execution of any components it protects. The Middleware Services and APIs chapter describes our approach to intrusion-tolerant middleware. The middleware comprises several layers. The Multipoint Network layer is the lowest layer of CRUTIAL's middleware, and features an abstraction of basic communication services, such as provided by standard protocols, like IP, IPsec, UDP, TCP and SSL/TLS. The Communication Support Services feature two important building blocks: the Randomized Intrusion-Tolerant Services (RITAS), and the Overlay Protection Layer (OPL) against DoS attacks. The Activity Support Services currently defined comprise the CIS Protection service, and the Access Control and Authorization service. Protection as described in this report is implemented by mechanisms and protocols residing on a device called Crutial Information Switch (CIS). The Access Control and Authorization service is implemented through PolyOrBAC, which defines the rules for information exchange and collaboration between sub-modules of the architecture, corresponding in fact to different facilities of the CII's organizations.The Monitoring and Failure Detection layer contains a preliminary definition of the middleware services devoted to monitoring and failure detection activities. The remaining chapters describe the protocols implementing the above-mentioned services: Runtime Support Protocols, and Middleware Services Protocol

    Byzantine Fault Tolerance for Distributed Systems

    Get PDF
    The growing reliance on online services imposes a high dependability requirement on the computer systems that provide these services. Byzantine fault tolerance (BFT) is a promising technology to solidify such systems for the much needed high dependability. BFT employs redundant copies of the servers and ensures that a replicated system continues providing correct services despite the attacks on a small portion of the system. In this dissertation research, I developed novel algorithms and mechanisms to control various types of application nondeterminism and to ensure the long-term reliability of BFT systems via a migration-based proactive recovery scheme. I also investigated a new approach to significantly improve the overall system throughput by enabling concurrent processing using Software Transactional Memory (STM). Controlling application nondeterminism is essential to achieve strong replica consistency because the BFT technology is based on state-machine replication, which requires deterministic operation of each replica. Proactive recovery is necessary to ensure that the fundamental assumption of using the BFT technology is not violated over long term, i.e., less than one-third of replicas remain correct. Without proactive recovery, more and more replicas will be compromised under continuously attacks, which would render BFT ineffective. STM based concurrent processing maximized the system throughput by utilizing the power of multi-core CPUs while preserving strong replication consistenc
    • …
    corecore