23,186 research outputs found
Phase Clocks for Transient Fault Repair
Phase clocks are synchronization tools that implement a form of logical time
in distributed systems. For systems tolerating transient faults by self-repair
of damaged data, phase clocks can enable reasoning about the progress of
distributed repair procedures. This paper presents a phase clock algorithm
suited to the model of transient memory faults in asynchronous systems with
read/write registers. The algorithm is self-stabilizing and guarantees accuracy
of phase clocks within O(k) time following an initial state that is k-faulty.
Composition theorems show how the algorithm can be used for the timing of
distributed procedures that repair system outputs.Comment: 22 pages, LaTe
Causality Diagrams using Hybrid Vector Clocks
Causality in distributed systems is a concept that has long been explored and
numerous approaches have been made to use causality as a way to trace
distributed system execution. Traditional approaches usually used system
profiling and newer approaches profiled clocks of systems to detect failures
and construct timelines that caused those failures. Since the advent of logical
clocks, these profiles have become more and more accurate with ways to
characterize concurrency and distributions, with accurate diagrams for message
passing. Vector clocks addressed the shortcomings of using traditional logical
clocks, by storing information about other processes in the system as well.
Hybrid vector clocks are a novel approach to this concept where clocks need not
store all the process information. Rather, we store information of processes
within an acceptable skew of the focused process. This gives us an efficient
way of profiling with substantially reduced costs to the system. Building on
this idea, we propose the idea of building causal traces using information
generated from the hybrid vector clock. The hybrid vector clock would provide
us with a strong sense of concurrency and distribution, and we theorize that
all the information generated from the clock is sufficient to develop a causal
trace for debugging. We post-process and parse the clocks generated from an
execution trace to develop a swimlane on a web interface, that traces the
points of failure of a distributed system. We also provide an API to reuse this
concept for any generic distributed system framework
Fault Tolerant Gradient Clock Synchronization
Synchronizing clocks in distributed systems is well-understood, both in terms
of fault-tolerance in fully connected systems and the dependence of local and
global worst-case skews (i.e., maximum clock difference between neighbors and
arbitrary pairs of nodes, respectively) on the diameter of fault-free systems.
However, so far nothing non-trivial is known about the local skew that can be
achieved in topologies that are not fully connected even under a single
Byzantine fault. Put simply, in this work we show that the most powerful known
techniques for fault-tolerant and gradient clock synchronization are
compatible, in the sense that the best of both worlds can be achieved
simultaneously.
Concretely, we combine the Lynch-Welch algorithm [Welch1988] for
synchronizing a clique of nodes despite up to Byzantine faults with
the gradient clock synchronization (GCS) algorithm by Lenzen et al.
[Lenzen2010] in order to render the latter resilient to faults. As this is not
possible on general graphs, we augment an input graph by
replacing each node by fully connected copies, which execute an instance
of the Lynch-Welch algorithm. We then interpret these clusters as supernodes
executing the GCS algorithm, where for each cluster its correct nodes'
Lynch-Welch clocks provide estimates of the logical clock of the supernode in
the GCS algorithm. By connecting clusters corresponding to neighbors in
in a fully bipartite manner, supernodes can inform each other
about (estimates of) their logical clock values. This way, we achieve
asymptotically optimal local skew, granted that no cluster contains more than
faulty nodes, at factor and overheads in terms of nodes and
edges, respectively. Note that tolerating faulty neighbors trivially
requires degree larger than , so this is asymptotically optimal as well
Preserving Stabilization while Practically Bounding State Space
Stabilization is a key dependability property for dealing with unanticipated
transient faults, as it guarantees that even in the presence of such faults,
the system will recover to states where it satisfies its specification. One of
the desirable attributes of stabilization is the use of bounded space for each
variable. In this paper, we present an algorithm that transforms a stabilizing
program that uses variables with unbounded domain into a stabilizing program
that uses bounded variables and (practically bounded) physical time. While
non-stabilizing programs (that do not handle transient faults) can deal with
unbounded variables by assigning large enough but bounded space, stabilizing
programs that need to deal with arbitrary transient faults cannot do the same
since a transient fault may corrupt the variable to its maximum value. We show
that our transformation algorithm is applicable to several problems including
logical clocks, vector clocks, mutual exclusion, leader election, diffusing
computations, Paxos based consensus, and so on. Moreover, our approach can also
be used to bound counters used in an earlier work by Katz and Perry for adding
stabilization to a non-stabilizing program. By combining our algorithm with
that earlier work by Katz and Perry, it would be possible to provide
stabilization for a rich class of problems, by assigning large enough but
bounded space for variables.Comment: Moved some content from the Appendix to the main paper, added some
details to the transformation algorithm and to its descriptio
Non-intrusive on-the-fly data race detection using execution replay
This paper presents a practical solution for detecting data races in parallel
programs. The solution consists of a combination of execution replay (RecPlay)
with automatic on-the-fly data race detection. This combination enables us to
perform the data race detection on an unaltered execution (almost no probe
effect). Furthermore, the usage of multilevel bitmaps and snooped matrix clocks
limits the amount of memory used. As the record phase of RecPlay is highly
efficient, there is no need to switch it off, hereby eliminating the
possibility of Heisenbugs because tracing can be left on all the time.Comment: In M. Ducasse (ed), proceedings of the Fourth International Workshop
on Automated Debugging (AAdebug 2000), August 2000, Munich. cs.SE/001003
Execution replay and debugging
As most parallel and distributed programs are internally non-deterministic --
consecutive runs with the same input might result in a different program flow
-- vanilla cyclic debugging techniques as such are useless. In order to use
cyclic debugging tools, we need a tool that records information about an
execution so that it can be replayed for debugging. Because recording
information interferes with the execution, we must limit the amount of
information and keep the processing of the information fast. This paper
contains a survey of existing execution replay techniques and tools.Comment: In M. Ducasse (ed), proceedings of the Fourth International Workshop
on Automated Debugging (AADebug 2000), August 2000, Munich. cs.SE/001003
- …