35 research outputs found

    Self-Stabilizing Byzantine Asynchronous Unison

    Get PDF
    We explore asynchronous unison in the presence of systemic transient and permanent Byzantine faults in shared memory. We observe that the problem is not solvable under less than strongly fair scheduler or for system topologies with maximum node degree greater than two. We present a self-stabilizing Byzantine-tolerant solution to asynchronous unison for chain and ring topologies. Our algorithm has minimum possible containment radius and optimal stabilization time

    Dynamic FTSS in Asynchronous Systems: the Case of Unison

    Full text link
    Distributed fault-tolerance can mask the effect of a limited number of permanent faults, while self-stabilization provides forward recovery after an arbitrary number of transient fault hit the system. FTSS protocols combine the best of both worlds since they are simultaneously fault-tolerant and self-stabilizing. To date, FTSS solutions either consider static (i.e. fixed point) tasks, or assume synchronous scheduling of the system components. In this paper, we present the first study of dynamic tasks in asynchronous systems, considering the unison problem as a benchmark. Unison can be seen as a local clock synchronization problem as neighbors must maintain digital clocks at most one time unit away from each other, and increment their own clock value infinitely often. We present many impossibility results for this difficult problem and propose a FTSS solution when the problem is solvable that exhibits optimal fault containment

    Auditable Restoration of Distributed Programs

    Full text link
    We focus on a protocol for auditable restoration of distributed systems. The need for such protocol arises due to conflicting requirements (e.g., access to the system should be restricted but emergency access should be provided). One can design such systems with a tamper detection approach (based on the intuition of "break the glass door"). However, in a distributed system, such tampering, which are denoted as auditable events, is visible only for a single node. This is unacceptable since the actions they take in these situations can be different than those in the normal mode. Moreover, eventually, the auditable event needs to be cleared so that system resumes the normal operation. With this motivation, in this paper, we present a protocol for auditable restoration, where any process can potentially identify an auditable event. Whenever a new auditable event occurs, the system must reach an "auditable state" where every process is aware of the auditable event. Only after the system reaches an auditable state, it can begin the operation of restoration. Although any process can observe an auditable event, we require that only "authorized" processes can begin the task of restoration. Moreover, these processes can begin the restoration only when the system is in an auditable state. Our protocol is self-stabilizing and has bounded state space. It can effectively handle the case where faults or auditable events occur during the restoration protocol. Moreover, it can be used to provide auditable restoration to other distributed protocol.Comment: 10 page

    Dynamic FTSS in Asynchronous Systems: the Case of Unison

    Get PDF
    Distributed fault-tolerance can mask the effect of a limited number of permanent faults, while self-stabilization provides forward recovery after an arbitrary number of transient fault hit the system. FTSS protocols combine the best of both worlds since they are simultaneously fault-tolerant and self-stabilizing. To date, FTSS solutions either consider static (i.e. fixed point) tasks, or assume synchronous scheduling of the system components. In this paper, we present the first study of dynamic tasks in asynchronous systems, considering the unison problem as a benchmark. Unison can be seen as a local clock synchronization problem as neighbors must maintain digital clocks at most one time unit away from each other, and increment their own clock value infinitely often. We present many impossibility results for this difficult problem and propose a FTSS solution when the problem is solvable that exhibits optimal fault containment

    EFFICIENT COUNTING WITH OPTIMAL RESILIENCE

    Get PDF
    Consider a complete communication network of n nodes, where the nodes receive a common clock pulse. We study the synchronous c-counting problem: given any starting state and up to f faulty nodes with arbitrary behavior, the task is to eventually have all correct nodes labeling the pulses with increasing values modulo c in agreement. Thus, we are considering algorithms that are self-stabilizing despite Byzantine failures. In this work, we give new algorithms for the synchronous counting problem that (1) are deterministic, (2) have optimal resilience, (3) have a linear stabilization time in f (asymptotically optimal), (4) use a small number of states, and, consequently, (5) communicate a small number of bits per round. Prior algorithms either resort to randomization, use a large number of states and need high communication bandwidth, or have suboptimal resilience. In particular, we achieve an exponential improvement in both state complexity and message size for deterministic algorithms. Moreover, we present two complementary approaches for reducing the number of bits communicated during and after stabilization.Peer reviewe

    Asynchronous neighborhood task synchronization

    Full text link
    Faults are likely to occur in distributed systems. The motivation for designing self-stabilizing system is to be able to automatically recover from a faulty state. As per Dijkstra\u27s definition, a system is self-stabilizing if it converges to a desired state from an arbitrary state in a finite number of steps. The paradigm of self-stabilization is considered to be the most unified approach to designing fault-tolerant systems. Any type of faults, e.g., transient, process crashes and restart, link failures and recoveries, and byzantine faults, can be handled by a self-stabilizing system; Many applications in distributed systems involve multiple phases. Solving these applications require some degree of synchronization of phases. In this thesis research, we introduce a new problem, called asynchronous neighborhood task synchronization ( NTS ). In this problem, processes execute infinite instances of tasks, where a task consists of a set of steps. There are several requirements for this problem. Simultaneous execution of steps by the neighbors is allowed only if the steps are different. Every neighborhood is synchronized in the sense that all neighboring processes execute the same instance of a task. Although the NTS problem is applicable in nonfaulty environments, it is more challenging to solve this problem considering various types of faults. In this research, we will present a self-stabilizing solution to the NTS problem. The proposed solution is space optimal, fault containing, fully localized, and fully distributed. One of the most desirable properties of our algorithm is that it works under any (including unfair) daemon. We will discuss various applications of the NTS problem

    A Taxonomy of Daemons in Self-stabilization

    Full text link
    We survey existing scheduling hypotheses made in the literature in self-stabilization, commonly referred to under the notion of daemon. We show that four main characteristics (distribution, fairness, boundedness, and enabledness) are enough to encapsulate the various differences presented in existing work. Our naming scheme makes it easy to compare daemons of particular classes, and to extend existing possibility or impossibility results to new daemons. We further examine existing daemon transformer schemes and provide the exact transformed characteristics of those transformers in our taxonomy.Comment: 26 page

    The Firing Squad Problem Revisited

    Get PDF
    In the classical firing squad problem, an unknown number of nodes represented by identical finite state machines is arranged on a line and in each time unit each node may change its state according to its neighbors\u27 states. Initially all nodes are passive, except one specific node located at an end of the line, which issues a fire command. This command needs to be propagated to all other nodes, so that eventually all nodes simultaneously enter some designated ``firing" state. A natural extension of the firing squad problem, introduced in this paper, allows each node to postpone its participation in the squad for an arbitrary time, possibly forever, and firing is allowed only after all nodes decided to participate. This variant is highly relevant in the context of decentralized distributed computing, where processes have to coordinate for initiating various tasks simultaneously. The main goal of this paper is to study the above variant of the firing squad problem under the assumptions that the nodes are infinite state machines, and that the inter-node communication links can be changed arbitrarily in each time unit, i.e., are defined by a dynamic graph. In this setting, we study the following fundamental question: what connectivity requirements enable a solution to the firing squad problem? Our main result is an exact characterization of the dynamic graphs for which the firing squad problem can be solved. When restricted to static directed graphs, this characterization implies that the problem can be solved if and only if the graph is strongly connected. We also discuss how information on the number of nodes or on the diameter of the network, and the use of randomization, can improve the solutions to the problem

    Extending Byzantine Fault Tolerance to Replicated Clients

    Get PDF
    Byzantine agreement protocols for replicated deterministic state machines guarantee that externally requested operations continue to execute correctly even if a bounded number of replicas fail in arbitrary ways. The state machines are passive, with clients responsible for any active ongoing application behavior. However, the clients are unreplicated and outside the fault-tolerance boundary. Consequently, agreement protocols for replicated state machines do not guarantee continued correct execution of long-running client applications. Building on the Castro and Liskov Byzantine Fault Tolerance protocol for unreplicated clients (CLBFT), we present a practical algorithm for Byzantine fault-tolerant execution of long-running distributed applications in which replicated deterministic clients invoke operations on replicated deterministic servers. The algorithm scales well to large replica groups, with roughly double the latency and message count when compared to CLBFT, which supports only unreplicated clients. The algorithm supports both synchronous and asynchronous clients, provides fault isolation between client and server groups with respect to both correctness and performance, and uses a novel architecture that accommodates externally requested software upgrades for long-running evolvable client applications
    corecore