23,186 research outputs found

    Phase Clocks for Transient Fault Repair

    Full text link
    Phase clocks are synchronization tools that implement a form of logical time in distributed systems. For systems tolerating transient faults by self-repair of damaged data, phase clocks can enable reasoning about the progress of distributed repair procedures. This paper presents a phase clock algorithm suited to the model of transient memory faults in asynchronous systems with read/write registers. The algorithm is self-stabilizing and guarantees accuracy of phase clocks within O(k) time following an initial state that is k-faulty. Composition theorems show how the algorithm can be used for the timing of distributed procedures that repair system outputs.Comment: 22 pages, LaTe

    About logical clocks for distributed systems

    Full text link

    Causality Diagrams using Hybrid Vector Clocks

    Full text link
    Causality in distributed systems is a concept that has long been explored and numerous approaches have been made to use causality as a way to trace distributed system execution. Traditional approaches usually used system profiling and newer approaches profiled clocks of systems to detect failures and construct timelines that caused those failures. Since the advent of logical clocks, these profiles have become more and more accurate with ways to characterize concurrency and distributions, with accurate diagrams for message passing. Vector clocks addressed the shortcomings of using traditional logical clocks, by storing information about other processes in the system as well. Hybrid vector clocks are a novel approach to this concept where clocks need not store all the process information. Rather, we store information of processes within an acceptable skew of the focused process. This gives us an efficient way of profiling with substantially reduced costs to the system. Building on this idea, we propose the idea of building causal traces using information generated from the hybrid vector clock. The hybrid vector clock would provide us with a strong sense of concurrency and distribution, and we theorize that all the information generated from the clock is sufficient to develop a causal trace for debugging. We post-process and parse the clocks generated from an execution trace to develop a swimlane on a web interface, that traces the points of failure of a distributed system. We also provide an API to reuse this concept for any generic distributed system framework

    Fault Tolerant Gradient Clock Synchronization

    Full text link
    Synchronizing clocks in distributed systems is well-understood, both in terms of fault-tolerance in fully connected systems and the dependence of local and global worst-case skews (i.e., maximum clock difference between neighbors and arbitrary pairs of nodes, respectively) on the diameter of fault-free systems. However, so far nothing non-trivial is known about the local skew that can be achieved in topologies that are not fully connected even under a single Byzantine fault. Put simply, in this work we show that the most powerful known techniques for fault-tolerant and gradient clock synchronization are compatible, in the sense that the best of both worlds can be achieved simultaneously. Concretely, we combine the Lynch-Welch algorithm [Welch1988] for synchronizing a clique of nn nodes despite up to f<n/3f<n/3 Byzantine faults with the gradient clock synchronization (GCS) algorithm by Lenzen et al. [Lenzen2010] in order to render the latter resilient to faults. As this is not possible on general graphs, we augment an input graph G\mathcal{G} by replacing each node by 3f+13f+1 fully connected copies, which execute an instance of the Lynch-Welch algorithm. We then interpret these clusters as supernodes executing the GCS algorithm, where for each cluster its correct nodes' Lynch-Welch clocks provide estimates of the logical clock of the supernode in the GCS algorithm. By connecting clusters corresponding to neighbors in G\mathcal{G} in a fully bipartite manner, supernodes can inform each other about (estimates of) their logical clock values. This way, we achieve asymptotically optimal local skew, granted that no cluster contains more than ff faulty nodes, at factor O(f)O(f) and O(f2)O(f^2) overheads in terms of nodes and edges, respectively. Note that tolerating ff faulty neighbors trivially requires degree larger than ff, so this is asymptotically optimal as well

    Preserving Stabilization while Practically Bounding State Space

    Full text link
    Stabilization is a key dependability property for dealing with unanticipated transient faults, as it guarantees that even in the presence of such faults, the system will recover to states where it satisfies its specification. One of the desirable attributes of stabilization is the use of bounded space for each variable. In this paper, we present an algorithm that transforms a stabilizing program that uses variables with unbounded domain into a stabilizing program that uses bounded variables and (practically bounded) physical time. While non-stabilizing programs (that do not handle transient faults) can deal with unbounded variables by assigning large enough but bounded space, stabilizing programs that need to deal with arbitrary transient faults cannot do the same since a transient fault may corrupt the variable to its maximum value. We show that our transformation algorithm is applicable to several problems including logical clocks, vector clocks, mutual exclusion, leader election, diffusing computations, Paxos based consensus, and so on. Moreover, our approach can also be used to bound counters used in an earlier work by Katz and Perry for adding stabilization to a non-stabilizing program. By combining our algorithm with that earlier work by Katz and Perry, it would be possible to provide stabilization for a rich class of problems, by assigning large enough but bounded space for variables.Comment: Moved some content from the Appendix to the main paper, added some details to the transformation algorithm and to its descriptio

    Non-intrusive on-the-fly data race detection using execution replay

    Full text link
    This paper presents a practical solution for detecting data races in parallel programs. The solution consists of a combination of execution replay (RecPlay) with automatic on-the-fly data race detection. This combination enables us to perform the data race detection on an unaltered execution (almost no probe effect). Furthermore, the usage of multilevel bitmaps and snooped matrix clocks limits the amount of memory used. As the record phase of RecPlay is highly efficient, there is no need to switch it off, hereby eliminating the possibility of Heisenbugs because tracing can be left on all the time.Comment: In M. Ducasse (ed), proceedings of the Fourth International Workshop on Automated Debugging (AAdebug 2000), August 2000, Munich. cs.SE/001003

    Execution replay and debugging

    Full text link
    As most parallel and distributed programs are internally non-deterministic -- consecutive runs with the same input might result in a different program flow -- vanilla cyclic debugging techniques as such are useless. In order to use cyclic debugging tools, we need a tool that records information about an execution so that it can be replayed for debugging. Because recording information interferes with the execution, we must limit the amount of information and keep the processing of the information fast. This paper contains a survey of existing execution replay techniques and tools.Comment: In M. Ducasse (ed), proceedings of the Fourth International Workshop on Automated Debugging (AADebug 2000), August 2000, Munich. cs.SE/001003
    corecore