Abstract-We present concept and implementation of a self-stabilizing Byzantine fault-tolerant distributed clock generation scheme for multi-synchronous GALS architectures in critical applications. It combines a variant of a recently introduced self-stabilizing algorithm for generating lowfrequency, low-accuracy synchronized pulses with a simple nonstabilizing high-frequency, high-accuracy clock synchronization algorithm. We provide thorough correctness proofs and a performance analysis, which use methods from fault-tolerant distributed computing research but also addresses hardwarerelated issues like metastability. The algorithm, which consists of several concurrent communicating asynchronous state machines, has been implemented in VHDL using Petrify in conjunction with some extensions, and synthetisized for an Altera Cyclone FPGA. An experimental validation of this prototype has been carried out to confirm the skew and clock frequency bounds predicted by the theoretical analysis, as well as the very short stabilization times (required for recovering after excessively many transient failures) achievable in practice.
I. INTRODUCTION
To circumvent the cumbersome clock tree engineering issue [1] , [2] , [3] , [4] , systems-on-chip (SoC) are nowadays increasingly designed globally asynchronous locally synchronous (GALS) [5] . Using independent and hence unsynchronized clock domains requires asynchronous crossdomain communication mechanisms or synchronizers [6] , [7] , [8] , however, which inevitably create the potential for metastability [9] . This problem can be circumvented by means of multi-synchronous clocking [10] , [11] , which guarantees a certain degree of synchrony between clock domains. Multi-synchronous GALS is particularly beneficial from a designer's point of view, since it combines the convenient local synchrony of a GALS system with a global time base across the whole chip, including the ability for metastabilityfree high-speed communication across clock domains [12] .
The decreasing feature sizes of deep submicron VLSI technology also resulted in an increased likelihood of chip components failing during operation: Reduced voltage swings and smaller critical charges make circuits more susceptible to ionized particle hits, crosstalk, and electromagnetic interference [13] , [14] , [15] , [16] , [17] , [18] . Faulttolerance hence becomes an increasingly pressing issue also for chip design. Unfortunately, faulty components may behave non-benign in many ways: They may perform signal transitions at arbitrary times and even convey inconsistent information to their successor components if their outputs are affected by a failure. Well-known theory on fault-tolerant agreement and synchronization shows that this behaviour is the key feature of unrestricted, i.e., Byzantine faults [19] . This forces to model faulty components as Byzantine if a high fault coverage is to be guaranteed.
Unfortunately, lower-bound results [19] , [20] reveal that, in order to cope with some maximum number f of Byzantine faulty components (say, processors) throughout an execution of a system, n ≥ 3f + 1 components are required. Given the typically transient nature of failures in digital circuits, these bounds reveal that even a Byzantine fault-tolerant system cannot be expected to recover from a situation where more than f components became faulty transiently, since their state may be corrupted. Dealing with this problem is in the realm of self-stabilizing algorithms [21] , which are guaranteed to recover even if each and every component of the system fails arbitrarily, but later on works according to its specification again: in that case the system resumes correct operation after some stabilization time following the instant when no more failures occur. Byzantine-tolerant self-stabilizing algorithms [22] , [23] , [24] , [25] , [26] , [27] , [28] combine the best of both worlds, by guaranteeing both correct operation and self-stabilization in the presence of up to f Byzantine faulty components in the system. This paper presents concept and prototype implementation of a novel approach, termed FATAL + , for multi-synchronous clocking in GALS systems. It relies on a self-stabilizing and Byzantine fault-tolerant distributed algorithm, consisting of n identical instances (called nodes), which generate n local clock signals (one for each clock domain) with the following properties: Bounded skew, i.e., bounded maximum time between the k-th clock transitions of any two clock signals of correct nodes, and bounded accuracy (i.e., frequency), i.e., bounded minimum and maximum time between the occurence of any two successive clock transitions of the clock signal at any correct node. At most f < n/3 nodes may behave Byzantine faulty, in which case their clock signals may be arbitrary. The whole algorithm can be directly implemented in hardware, without quartz oscillators, using standard asynchronous logic gates only. FATAL + self-stabilizes within O(kn) time with probability 1 − 2 −k(n−f ) (with constant expectation in typical settings), and is metastability-free by construction after stabilization in failure-free runs. 1 If the number of faults is not overwhelming, i.e., a majority of at least n−f nodes continues to execute the protocol in an orderly fashion, recovering nodes and late joiners (re)synchronize deterministically in constant time.
Detailed contributions: (1) In Sections II-VI, we present concept and theoretical analysis of FATAL + , which is based on a variant of the randomized self-stabilizing Byzantinetolerant pulse synchronization algorithm [28] we recently proposed. It eventually generates synchronized periodic pulses with moderate skew and low frequency, and improves upon the results from [28] in that it tolerates arbitrarily large clock drifts and allows late joiners or nodes recovering from transient faults to deterministically resynchronize within constant time. The formal proof of these properties builds upon and extends the analysis in [30] . In Section VI, this algorithm is integrated with a Byzantine-tolerant but nonself-stabilizing tick generation algorithm based on Srikanth & Touegs clock synchronization algorithm [31] , operating in a control loop: The latter, referred to as the quick cycle algorithm, generates clock ticks with high frequency and small skew, which also (weakly) affect pulse generation. On the other hand, quick cycle uses pulses to monitor its ticks in order to detect the need for stabilization.
(2) In Section VII, we present the major ingredients of an Altera Cyclone IV FPGA protoype implementation of FATAL + . It primarily consists of multiple hybrid (asynchronous + synchronous) state machines, which have been generated semi-automatically from the specification of the algorithms using Petrify [32] . Non-standard extensions were needed for ensuring deadlock-free communication despite arbitrarily many desynchronized nodes, some of which could be Byzantine faulty, which e.g. forced us to use state-based communication instead of handshake-based communication. Special care had also to be exercised for ensuring selfstabilizing elementary building blocks and metastabilityfreedom in normal operation (after stabilization).
(3) In Section VIII, we provide some results of the experimental evaluation of our prototype implementation. They demonstrate the feasibility of FATAL + and confirm the results of our theoretical analysis, in particular, a tight skew bound, in the presence of Byzantine faulty nodes. Special emphasis has been put on experiments validating the predictions related to stabilization time, which revealed that the system indeed stabilizes in very short time from any initial/error state.
Section IX eventually concludes our paper.
Related work: The work [33] , [34] , [35] , [36] on distributed clock generation in VLSI circuits is essentially based on (distributed) ring oscillators, formed by regular structures (rings, meshes) of multiple inverter loops. Since clock synchronization theory [20] reveals that high connectivity is required for bounded synchronization tightness in the presence of failures, these approaches are fundamentally restricted in that they can overcome at most a small constant number of Byzantine failures.
The only exception we are aware of is the DARTS faulttolerant clock generation approach [37] , [38] , which also adresses multi-synchronous clocking in GALS systems. Like FATAL + , DARTS is based on a fault-tolerant distributed algorithm [39] implemented in asynchronous digital logic. Although it shares many features with FATAL + , including Byzantine fault-tolerance, it is not self-stabilizing: If more than f nodes ever become faulty, the system will not recover even if all nodes work correctly thereafter. Moreover, in DARTS, simple transient faults such as radiation-or crosstalk-induced additional (or omitted) clock ticks accumulate over time to arbitrarily large skews in an otherwise benign execution. Despite not suffering from these drawbacks, FATAL + offers similar guarantees in terms of area consumption, clock skew, and amortized frequency as DARTS. Furthermore, a number of Byzantine-tolerant self-stabilizing clock synchronization protocols [22] , [23] , [24] , [25] , [26] , [27] have been devised by the distributed systems community. Beyond optimal resilience, an attractive feature of most of these protocols is a small stabilization time. However, all of them exhibit deficiencies rendering them unsuitable in the VLSI context. This motivated to devise the algorithm from [28] , [30] , an improved variant of which forms the basis of FATAL + .
II. MODEL
In this section we introduce our system model. Our formal framework will be tied to the peculiarities of hardware designs, which consist of modules that continuously 2 compute their output signals based on their input signals.
Signals
Following [40] , [41] , we define (the trace of) a signal to be a timed event trace over a finite alphabet S of possible signal states: Formally, signal σ ⊆ S × R + 0 . All times and time intervals refer to a global reference time taken from R + 0 , that is, signals reflect the system's state from time 0 on. The elements of σ are called events, and for each event (s, t) we call s the state of event (s, t) and t the time of event (s, t). In general, a signal σ is required to fulfill the following conditions: (i) for each time interval [t − , t + ] ⊆ R + 0 of finite length, the number of events in σ with times within [t − , t + ] is finite, (ii) from (s, t) ∈ σ and (s , t) ∈ σ follows that s = s , and (iii) there exists an event at time 0 in σ.
Note that our definition allows for events (s, t) and (s, t ) ∈ σ, where t < t , without having an event (s , t ) ∈ σ with s = s and t < t < t . In this case, we call event (s, t ) idempotent. Two signals σ and σ are equivalent, iff they differ in idempotent events only. We identify all signals of an equivalence class, as they describe the same physical signal. Each equivalence class [σ] of signals contains a unique signal σ 0 having no idempotent events. We say that signal σ switches to s at time t iff event (s, t) ∈ σ 0 .
The state of signal σ at time t ∈ R + 0 , denoted by σ(t), is given by the state of the event with the maximum time not greater than t. 3 Because of (i), (ii) and (iii), σ(t) is well defined for each time t ∈ R + 0 . Note that σ's state function in fact depends on [σ] only, i.e., we may add or remove idempotent events at will without changing the state function.
Distributed System
On the topmost level of abstraction, we see the system as a set of V = {1, . . . , n} physically remote nodes that communicate by means of channels. In the context of a VLSI circuit, "physically remote" actually refers to quite small distances (centimeters or even less). However, at gigahertz frequencies, a local state transition will not be observed remotely within a time that is negligible compared to clock speeds. We stress this point, since it is crucial that different clocks (and their attached logic) are not placed too close to each other, as otherwise they might fail due to the same event such as a particle hit. This would render it pointless to devise a system that is resilient to a certain fraction of the nodes failing.
Each node i comprises a number of input ports, namely S i,j for each node j, an output port S i , and a set of local ports, introduced later on. An execution of the distributed system assigns to each port of each node a signal. For convenience of notation, for any port p, we refer to the signal assigned to port p simply by signal p. We say that node i is 3 To facilitate intuition, we here slightly abuse notation, as this way σ denotes both a function of time and the signal (trace), which is a subset of S × R + 0 . Whenever referring to σ, we will talk of the signal, not the state function.
in state s at time t iff S i (t) = s. We further say that node i switches to state s at time t iff signal S i switches to s at time t.
Nodes exchange their states via the channels between them: for each pair of nodes i, j, output port S i is connected to input port S j,i by a FIFO channel from i to j. Note that this includes a channel from i to i itself. Intuitively, S i being connected to S j,i by a (non-faulty) channel means that S j,i (·) should mimic S i (·), however, with a slight delay accounting for the time it takes the channel to propagate events. In contrast to an asynchronous system, this delay is bounded by the maximum delay d > 0. 4 Formally we define: The channel from node i to j is said to be correct during [t − , t + ] iff there exists a function τ i,j : R i,j (t)) ∈ S i , and for each t ∈ [t − , τ i,j (0)), (s, t) ∈ S j,i ⇒ s = S i (0). Note that because of (i), τ −1 i,j exists in the domain [τ i,j (0), ∞), and thus (ii) and (iii) are well defined. We say that node i observes node j in state s at time t if S i,j (t) = s.
Clocks and Timeouts
Nodes are never aware of the current reference time and we also do not require the reference time to resemble Newtonian "real" time. Rather we allow for physical clocks that run arbitrarily fast or slow, 5 as long as their speeds are close to each other in comparison. One may hence think of the reference time as progressing at the speed of the currently slowest correct clock. In this framework, nodes essentially make use of bounded clocks with bounded drift.
Formally, clock rates are within [1, ϑ] (with respect to reference time), where ϑ > 1 is constant and ϑ − 1 is the (maximum) clock drift. A clock C is a continuous, strictly increasing function C : R + 0 → R + 0 mapping reference time to some local time. Clock C is said to be correct during [t − , t + ] ⊆ R + 0 iff we have for any t, t ∈ [t − , t + ], t < t , that t −t ≤ C(t )−C(t) ≤ ϑ(t −t). Each node comprises a set of clocks assigned to it, which allow the node to estimate the progress of reference time.
Instead of directly accessing the value of their clocks, nodes have access to so-called timeout ports of watchdog timers. A timeout is a triple (T, s, C), where T ∈ R + is a duration, s ∈ S is a state, and C is some local clock (there may be several), say of node i. Each timeout (T, s, C) has a corresponding timeout port Time T,s,C , being part of node i's local ports. Signal Time T,s,C is Boolean, that is, its possible states are from the set {0, 1}. We say that timeout (T, s, C) 
] such that (T, s, C) is reset, i.e., (0, t) ∈ Time T,s,C . This is a one-to-one correspondence, i.e., (T, s, C) is not reset at any other times. 2) For a time t ∈ [t − , t + ], denote by t 0 the supremum of all times from [t − , t] when (T, s, C) is reset. Then it holds that (1, t) ∈ Time T,s,C iff C(t) − C(t 0 ) = T . Again, this is a one-to-one correspondence. We say that timeout (T, s, C) expires at time t iff Time T,s,C switches to 1 at time t, and it is expired at time t iff Time T,s,C (t) = 1. For notational convenience, we will omit the clock C and simply write (T, s) for both the timeout and its signal.
A randomized timeout is a triple (D, s, C), where D is a bounded random distribution on R + 0 , s ∈ S is a state, and C is a clock. Its corresponding timeout port Time D,s,C behaves very similar to the one of an ordinary timeout, except that whenever it is reset, the local time that passes until it expires next-provided that it is not reset again before that happens-follows the distribution D. 
s,C "with probability µ(C(t)−C(t 0 ))" and we require that the probability of (1, t) ∈ Time D,s,Cconditional to t 0 and C on [t 0 , t] being given-is independent of the system's state at times smaller than t. More precisely, if superscript E identifies variables in execution E and t 0 is the infimum of all times from (t 0 , t + ] when node i switches to state s, then we demand for any [τ
We will apply the same notational conventions to randomized timeouts as we do for regular timeouts.
Note that, strictly speaking, this definition does not induce a random variable describing the time t ∈ [t 0 , t 0 ) satisfying that (1, t ) ∈ Time D,s,C . However, for the state of the timeout port, we get the meaningful statement that for any t ∈ [t 0 , t 0 ),
The reason for phrasing the definition in the above more cumbersome way is that we want to guarantee that an adversary knowing the full present state of the system and memorizing its whole history cannot reliably predict when the timeout will expire. 6 We remark that these definitions allow for different timeouts to be driven by the same clock, implying that an adversary may derive some information on the state of a randomized timeout before it expires from the node's behavior, even if it cannot directly access the values of the clock driving the timeout. This is crucial for implementability, as it might be very difficult to guarantee that the behavior of a dedicated clock that drives a randomized timeout is indeed independent of the execution of the algorithm.
Memory Flags
Besides timeout and randomized timeout ports, another kind of node i's local ports are memory flags. For each state s ∈ S and each node j ∈ V , Mem i,j,s is a local port of node i. It is used to memorize whether node i has observed node j in state s since the last reset of the flag. We say that node i memorizes node j in state s at time t if Mem i,j,s (t) = 1. Formally, we require that signal Mem i,j,s switches to 1 at time t iff node i observes node j in state s at time t and Mem i,j,s is not already in state 1. The times t when Mem i,j,s is reset, i.e., (0, t) ∈ Mem i,j,s , are specified by node i's state machine, which is introduced next.
State Machine
It remains to specify how nodes switch states and when they reset memory flags. We do this by means of state machines that may attain states from the finite alphabet S. A node's state machine is specified by (i) the set S, (ii) a function tr, called the transition function, from T ⊆ S 2 to the set of Boolean predicates on the alphabet consisting of expressions "p = s" (used for expressing guards), where p is from the node's input and local ports and s is from the set of possible states of signal p, and (iii) a function re, called the reset function, from T to the power set of the node's memory flags.
Intuitively, the transition function specifies the conditions (guards) under which a node switches states, and the reset function determines which memory flags to reset upon the state change. Formally, let P be a predicate on node i's input and local ports. We define P holds at time t by structural induction: If P is equal to p = s, where p is one of node i's input and local ports and s is one of the states signal p can obtain, then P holds at time t iff p(t) = s. Otherwise, if P is of the form ¬P 1 , P 1 ∧ P 2 , or P 1 ∨ P 2 , we define P holds at time t in the straightforward manner.
We say node i follows its state machine during [t − , t + ] iff the following holds: Assume node i observes itself in state s ∈ S at time t ∈ [t − , t + ], i.e., S i,i (t) = s. Then, for each (s, s ) ∈ T , both:
1) Node i switches to state s at time t iff tr(s, s ) holds at time t and i is not already in state s . 7 (In case more than one guard tr(s, s ) can be true at the same time, we assume that an arbitrary tie-breaking ordering exists among the transition guards that specifies to which state to switch.) 2) Node i resets memory flag m at some time in the interval [t, τ i,i (t)] iff m ∈ re(s, s ) and i switches from state s to state s at time t. This correspondence is one-to-one. A node is defined to be non-faulty
all its timeouts and randomized timeouts are correct and it follows its state machine. If it employs multiple state machines (see below), it needs to follow all of them.
In contrast, a faulty node may change states arbitrarily. Note that while a faulty node may be forced to send consistent output state signals to all other nodes if its channels remain correct, there is no way to guarantee that this still holds true if channels are faulty. 8 
Metastability
While the presented model does not fully capture propagation and decay of metastable upsets, i.e., the propagation of intermediate values through combinational circuit elements, and the probability distributions on the decay of metastable upsets, it allows to capture its generation. An algorithm is inherently susceptible to metastability by the lacking capability of state machines to instantaneously take on new states: Node i decides on state transitions based on the delayed status of port S i,i instead of its "true" current state S i . Consider the following example: Node i is in state s at some time t, but since it switched to s only very recently, it still observes itself in state s = s at time t. A metastable upset might occur at time t (i) if the guard tr(s , s) falls back to false at time t, or (ii) if there is another transition (s , s ) in T whose guard becomes true at time t. The treatment of scenario (i) is postponed to Section VII where it is discussed together with the implementation of a node's components. Scenario (ii) is accounted for in the following definition: Definition 2.1 (Metastability-Freedom): We denote state machine M of node i as being metastability-free during [t − , t + ], iff for each time t ∈ [t − , t + ] when M switches from some state s to another state s , it holds that τ i,i (t) < t , where t is the infimum of all times in (t, t + ] when M switches to some state s .
Multiple State Machines
In some situations the previous definitions are too stringent, as there might be different "components" of a node's state machine that act concurrently and independently, mostly relying on signals from disjoint input ports or orthogonal components of a signal. We model this by permitting that nodes run several state machines in parallel. All these state machines share the input and local ports of the respective node and are required to have disjoint state spaces. If node i runs state machines M 1 , . . . , M k , node i's output signal is the product of the output signals of the individual machines. Formally we define: Each of the state machines M j , 1 ≤ j ≤ k, has an additional own output port s j . The state of node i's output port S i at any time t is given by S i (t) := (s 1 (t), . . . , s k (t)), where the signals of ports s 1 , . . . , s k are defined analogously to the signals of the output ports of state machines in the single state machine case. Note that by this definition, the only (local) means for node i's state machines to interact with each other is by reading the delayed state signal S i,i .
We say that node i's state machine M j is in state s at time t iff s j (t) = s, where S i (t) = (s 1 (t), . . . , s k (t)), and that node i's state machine M j switches to state s at time t iff signal s j switches to s at time t. Since the state spaces of the machines M j are disjoint, we will omit the phrase "state machine M j " from the notation, i.e., we write "node i is in state s" or "node i switched to state s", respectively.
Recall that the various state machines of node i are as loosely coupled as remote nodes, namely via the delayed status signal on channel S i,i only. Therefore, it makes sense to consider them independently also when it comes to metastability. Definition 2.2 (Metastability-Freedom-Multiple SM's): We denote state machine M of node i ∈ V as metastabilityfree during [t − , t + ], iff for each time t ∈ [t − , t + ] when M switches from some state s ∈ S to another state s ∈ S, it holds that τ i,i (t) < t , where t is the infimum of all times in (t, t + ] when M switches to some state s ∈ S.
Note that by this definition the different state machines may switch states concurrently without suffering from metastability. 9 It is even possible that some state machine suffers metastability, while another is not affected by this at all. 10 
Problem Statement
The purpose of the pulse synchronization protocol is that nodes generate synchronized, well-separated pulses by switching to a distinguished state accept. Self-stabilization requires that it starts to do so within a bounded time, for any possible initial state. However, as our protocol makes use of randomization, there are executions where this does not happen at all; instead, we will show that the protocol stabilizes with probability one in finite time. To give a precise meaning to this statement, we need to define appropriate probability spaces. Definition 2.3 (Adversarial Spaces): Denote for i ∈ V by C i = (C i,1 , . . . , C i,ci ) the tuple of clocks of node i. An adversarial space is a probabilistic space that is defined by subsets of nodes and channels W ⊆ V and E ⊆ V 2 , a time interval [t − , t + ], a protocol P (nodes' ports, state machines, etc.) as previously defined, tuple of all clocks
, an initial state E 0 of all ports, and an adversarial function A. Here A is a function that maps a partial execution E| [0,t] until time t (i.e., all ports' values until time t), W , E, [t − , t + ], P, C, and Θ to the states of all faulty ports during the time interval (t, t ], where t is the infimum of all times greater than t when a non-faulty node or channel switches states.
The adversarial space AS(W, E, [t − , t + ], P, C, Θ, E 0 , A) is now defined on the set of all executions E satisfying that (i) the initial state of all ports is given by
with respect to the protocol P, (v) all channels in E are correct during [t − , t + ], and (vi) given E| [0,t] for any time t, E| (t,t ] is given by A, where t is the infimum of times greater than t when a non-faulty node switches states. Thus, except for when randomized timeouts expire, E is fully predetermined by the parameters of AS. 9 However, care has to be taken when implementing the inter-node communication of the state components in a metastability-free manner, cf. Section VII. 10 This is crucial for the algorithm we are going to present. For stabilization purposes, nodes comprise a state machine that is prone to metastability. However, the state machine generating pulses (i.e., having the state accept, cf. Definition 2.4) does not take its output signal into account once stabilization is achieved. Thus, the algorithm is metastabilityfree after stabilization in the sense that we guarantee a metastability-free signal indicating when pulses occur. 11 This follows by induction starting from the initial configuration E 0 . Using A, we can always extend E to the next time when a correct node switches states, and when non-faulty nodes switch states is fully determined by the parameters of AS except for when randomized timeouts expire. Note that the induction reaches any finite time within a finite number of steps, as signals switch states finitely often in finite time.
The probability measure on AS is induced by the random distributions of the randomized timeouts specified by P.
To avoid confusion, observe that if the clock functions and delays do not follow the model constraints during [t − , t + ], the respective adversarial space is empty and thus of no concern. This cumbersome definition provides the means to formalize a notion of stabilization that accounts for worstcase drifts and delays and an adversary that knows the full state of the system up to the current time.
We are now in the position to formally state the pulse synchronization problem in our framework. Intuitively, the goal is that after transient faults cease, nodes should with probability one eventually start to issue well-separated, synchronized pulses by switching to a dedicated state accept. Thus, as the initial state of the system is arbitrary, specifying an algorithm 12 is equivalent to defining the state machines that run at each node, one of which has a state accept.
Definition 2.4 (Self-Stabilizing Pulse Synchronization): Given a set of nodes W ⊆ V and a set E ⊆ V × V of channels, we say that protocol P is a (W, E)-stabilizing pulse synchronization protocol with skew Σ and accuracy bounds T − > Σ and T + that stabilizes within time T with probability p iff the following holds. Choose any time
e., C, Θ, E 0 , and A are arbitrary). Then executions from AS satisfy with probability at least p that there exists a time t s ∈ [t − , t − + T ] so that, denoting by t i (k) the time when node i ∈ W switches to a distinguished state accept for the k th time after t s (t i (k) = ∞ if no such time exists), (i)
Note that the fact that A is a deterministic function and, more generally, that we consider each space AS individually, is no restriction: As P succeeds for any adversarial space with probability at least p in achieving stabilization, the same holds true for randomized adversarial strategies A and worstcase drifts and delays.
III. THE FATAL PULSE SYNCHRONIZATION PROTOCOL
In this section, we present our self-stabilizing pulse generation algorithm. In order to be suitable for implementation in hardware, it needs to utilize very simple rules only. It is stated in terms of state machines as introduced in the previous section.
Since the ultimate goal of the pulse generation algorithm is to interact with an application layer, we introduce a possibility for a coupling with such a layer in the pulse generation algorithm itself: for each node i, we add a further port NEXT i , which can be driven by node i's application layer. As for other state signals, its output raises flag Mem i,NEXT , to which for simplicity we refer to as NEXT i as well. The purpose of the port is to allow the application layer to influence the time between two of node i's successively generated pulses within a range that does not prevent the pulse generation algorithm to stabilize correctly.
In Section VI we give an example for an application layer: The quick cycle completing the FATAL + is a non-selfstabilizing clock synchonization routine which relies on the pulse generation algorithm for self-stabilization. Since we will show that the pulse algorithm stabilizes independently of the behavior of the NEXT signal, and the clock synchronization routine presented Section VI is designed such that it will stabilize once the pulse generation algorithm did so, we can partition the analysis of the compound algorithm into two parts. When proving the correctness of the pulse generation algorithm in Section IV, we thus assume that for each node i, NEXT i is arbitrary.
A. Basic Cycle
The full pulse generation algorithm makes use of a rather involved interplay between conditions on timeouts, states, and thresholds to converge to a safe state despite a limited number of faulty components. As our approach is thus complicated to present in bulk, we break it down into pieces. Moreover, to facilitate giving intuition about the key ideas of the algorithm, in this subsection we assume that there are never more than f < n/3 faulty nodes, i.e., the remaining n−f nodes are non-faulty within [0, ∞). We further assume that channels between non-faulty nodes (including loopback channels) are correct within [0, ∞). We start by presenting the basic cycle that is repeated every pulse once a safe configuration is reached (see Figure 1) .
We employ graphical representations of the state machine of each node i ∈ V . States are represented by circles containing their names, while transition (s, s ) ∈ T is depicted as an arrow from s to s . The guard tr(s, s ) is written as a label next to the arrow, and the reset function's value re(s, s ) is depicted in a rectangular box on the arrow. To keep labels more simple we make use of some abbreviations. Recall that in the notation of timeouts (T, s, C) the driving clock C is omitted. We write T instead of (T, s) if s is the same state which node i leaves if the condition involving (T, s) is satisfied. Threshold conditions like " ≥ f + 1 s ", where s ∈ S, abbreviate Boolean predicates that reach over all of node i's memory flags Mem i,j,s , where j ∈ V , and are defined in a straightforward manner. If in such an expression we connect two states by "or", e.g., " ≥ n − f s or s " for s, s ∈ S, the summation considers flags of both types s and s . Thus, such an expression is equivalent to j∈V max{Mem i,j,s , Mem i,j,s } ≥ f + 1. For any state s ∈ S, the condition S i,i = s, (respectively, ¬(S i,i = s)) is written in short as "in s" (respectively, "not in s"). We write "true" instead of a condition that is always true (like e.g. "(in s) or (not in s)" for an arbitrary state s ∈ S). Finally, re(·, ·) always requires to reset all memory flags of certain types, hence we write e.g. propose if all flags Mem i,j,propose are to be reset.
We now briefly introduce the basic flow of the algorithm once it stabilizes, i.e., once all n − f non-faulty nodes are well-synchronized. Recall that the remaining up to f < n/3 faulty nodes may produce arbitrary signals on their outgoing channels. A pulse is locally triggered by switching to state accept. Thus, assume that at some time all non-faulty nodes switch to state accept within a time window of 2d, i.e., a pulses are generated by non-faulty nodes within a time interval of size 2d. Supposing that T 1 ≥ 3ϑd, these nodes will observe, and thus memorize, each other and themselves in state accept within a time interval of size 3d and thus before T 1 expires at any non-faulty node. This makes timeout T 1 the critical condition for switching to state sleep. From state sleep, they will switch to states sleep → waking, waking, and finally ready, where the timeout (T 2 , accept) is determining the time this takes, as it is considerably larger than ϑ(2ϑ + 2)T 1 . The intermediate states serve the purpose of achieving stabilization, hence we leave them out for the moment.
Note that upon switching to state ready, nodes reset their propose flags and NEXT i . Thus, they essentially ignore these signals between the most recent time they switched to propose before switching to accept and the subsequent time when they switch to ready. Since nodes already reset their accept flags upon switching to waking, this ensures that nodes do not take into account outdated information for the decision when to switch to state propose.
Hence, it is guaranteed that the first node switching from state ready to state propose again does so because T 4 expired or because T 3 expired and its NEXT memory flag is true. The constraint min{T 3 , T 4 } ≥ ϑ(T 2 + 4d) ensures that all non-faulty nodes observe themselves in state ready before the first one switches to propose. Hence, no node deletes information about nodes that switch to propose again after the previous pulse.
The first non-faulty node that switches to state accept again cannot do so before it memorizes at least n − f nodes in state propose, as the accept flags have been reset upon switching to state waking. Therefore, at this time at least n−2f ≥ f +1 non-faulty nodes are in state propose. Hence, the rule that nodes switch to propose if they memorize f +1 nodes in states propose will take effect, i.e., the remaining non-faulty nodes in state ready switch to propose after less than d time. Another d time later all non-faulty nodes in state propose will have become aware of this and switch to state accept as well, as the threshold of n−f nodes in states propose or accept is reached. Thus the cycle is complete and the reasoning can be repeated inductively.
Clearly, for this line of argumentation to be valid, the algorithm could be simpler than stated in Figure 1 . We already mentioned that the motivation of having three in- termediate states between accept and ready is to facilitate stabilization. Similarly, there is no need to make use of the accept flags in the basic cycle at all; in fact, it adversely affects the constraints the timeouts need to satisfy for the above reasoning to be valid. However, the accept flags are much better suited for diagnostic purposes than the propose flags, since nodes are expected to switch to accept in a small time window and remain in state accept for a small period of time only (for all our results, it is sufficient if T 1 = 4ϑd). Moreover, two different timeout conditions for switching from ready to propose are unnecessary for correct operation of the pulse synchronization routine. As discussed before, they are introduced in order to allow for a seamless coupling to the application layer.
B. Main Algorithm
We proceed by describing the main routine of the pulse algorithm in full. Alongside the main routine, several other state machines run concurrently and provide additional information to be used during recovery, as we detail later.
The main routine is graphically presented in Figure 2 . Except for the states recover and join and additional resets of memory flags, the main routine is identical to the basic cycle. The purpose of the two additional states is the following: Nodes switch to state recover once they detect that something is wrong, that is, non-faulty nodes do not execute the basic cycle as outlined in Section III-A. This way, non-faulty nodes will not continue to confuse others by sending for example state signals propose or accept despite clearly being out-of-sync. There are various consistency checks that nodes perform during each execution of the basic cycle. The first one is that in order to switch from state accept to state sleep, non-faulty nodes need to memorize at least n − f nodes in state accept. If this does not happen within 4d ≤ T 1 /ϑ time after switching to state accept, by the arguments given in Section III-A, the nodes could not have entered state accept within 2d of each other. Therefore, something must be wrong and it is feasible to switch to state recover. Next, whenever a non-faulty node is in state waking, there should be no nonfaulty nodes in states accept or recover. Considering that the node resets its accept and recover flags upon switching to waking, it should not memorize f +1 or more nodes in states accept or recover at a time when it observes itself in state waking. If it does, however, it again switches to state recover. Last but not least, during a synchronized execution of the basic cycle, no non-faulty node may be in state propose for more than a certain amount of time before switching to state accept. Therefore, nodes will switch from propose to recover when timeout T 5 expires.
There are two different ways for nodes in recover to switch back to the basic cycle, corresponding to two different mechanisms for stabilization. The transition from recover to accept requires to (directly) observe n − f nodes in state accept. This enables nodes to resynchronize provided that at least n − f nodes are already executing the basic cycle in synchrony. While this method is easily implemented, clearly it is insufficient to ensure stabilization from arbitrary initial configurations. Hence, nodes can also join the basic cycle again via the second new state, called join. Since the Byzantine nodes may "play nice" towards n − 2f or more nodes still executing the basic cycle, making them believe that system operation continues as usual, it must be possible to do so without having a majority of nodes in state 
, active) and in active) or (not in dormant and ((T 7 , passive or ≥ f + 1 join))) and Mem i,i,join = 0 in dormant recover. On the other hand, it is crucial that this happens in a sufficiently well-synchronized manner, as otherwise nodes could drop out of the basic cycle again because the various checks of consistency detect an erroneous execution of the basic cycle.
In part, this issue is solved by an additional agreement step. In order to enter the basic cycle again, nodes need to memorize n − f nodes in states join (the respective nodes detected an inconsistency), propose (these nodes continued to execute the basic cycle), or accept (there are executions where nodes reset their propose flags because of switching to join when other nodes already switched to accept). The threshold conditions of f + 1 nodes memorized in state join or f + 1 nodes memorized in state propose for leaving state recover, all nodes will follow the first one switching from join to propose quickly, just as with the switch from propose to accept in an ordinary execution of the basic cycle. However, it is decisive that all nodes are in states that permit to participate in this agreement step in order to guarantee success of this approach.
As a result, still a certain degree of synchronization needs to be established beforehand, 13 both among nodes that still 13 This is the reason for the complicated transition condition involving additional states and timeouts. The detailed interplay between these conditions is delicate and beyond the scope of a high-level description of the algorithm; the interested reader is referred to the analysis section.
execute the basic cycle and those that do not. For instance, if at the point in time when a majority of nodes and channels become non-faulty, some nodes already memorize nodes in join that are not, they may switch to state join and subsequently propose prematurely, causing others to have inconsistent memory flags as well. Byzantine faults may sustain such amiss configuration of the system indefinitely.
So why did we put so much effort in "shifting" the focus to this part of the algorithm? The key advantage is that nodes outside the basic cycle may take into account less reliable information for stabilization purposes. They may take the risk of metastable upsets (as we know it is impossible to avoid these during the stabilization process, anyway) and make use of randomization.
In fact, to make the above scheme work, it is sufficient that all non-faulty nodes agree on a so-called resynchronization point (cf. Definitions 3.1 and 3.2), that is, a point in time at which nodes reset the memory flags for states join and sleep → waking as well as certain timeouts, while guaranteeing that no node is in these states close to the respective reset times. Except for state sleep → waking, all of these timeouts, memory flags, etc. are not part of the basic cycle at all, thus nodes may enforce consistent values for them easily when agreeing on such a resynchronization point.
Conveniently, the use of randomization also ensures that it is quite unlikely that nodes are in state sleep → waking close to a resynchronization point, as the consistency check of having to memorize n − f nodes in state accept in order to switch to state sleep guarantees that the time windows during which non-faulty nodes may switch to sleep make up a small fraction of all times only.
Consequently, the remaining components of the algorithm deal with agreeing on resynchronization points and utilizing this information in an appropriate way to ensure stabilization of the main routine. We describe this connection to the main routine first. It is done by another, quite simple state machine, which runs in parallel alongside the core routine. It is depicted in Figure 3 .
Its purpose is to reset memory flags in a consistent way and to determine when a node is permitted to switch to join. In general, a resynchronization point (locally observed by switching to state resync, which is introduced later) triggers the reset of the join and sleep → waking flags. If there are still nodes executing the basic cycle, a node may become aware of it by observing f + 1 nodes in state sleep → waking at some time. In this case it switches from the state passive, which it entered at the point in time when it locally observed the resynchronization point, to the state active. Subsequently, once timeout T 8 expires, the node will switch to state, in which it is more susceptive to switching to state join. This is expressed by the rather involved transition rule tr(recover, join) (in Figure 2) . T 6 is much smaller than T 7 , but T 6 is of no concern until the node switches to state active and resets T 6 .
14 The condition that Mem i,i,join = 0 simply means that nodes should not already have attempted to stabilize by switching to join since the most recent transition to passive. This avoids interfering too much with the second stabilization mechanism (switching from recover to accept), as it might take significantly longer 14 The conditions "in active" and "not in dormant", respectively, here ensure that the transition is not performed because the node has been in state resync a long time ago, but there was no recent switching to resync. than the time required for this "immediate" recovery to stabilize by means of agreeing on a resynchronization point.
It remains to explain how resynchronization points are generated.
C. Resynchronization Algorithm
The resynchronization routine is specified in Figure 4 . Similarly to the extension of the core routine, it is a lower layer that the core routine uses for stabilization purposes only. It provides some synchronization that is akin to that of a pulse, except that such "weak pulses" occur at random times, and may be generated inconsistently even after the algorithm as a whole has stabilized. Since the main routine operates independently of the resynchronization routine once the system has stabilized, we can afford the weaker guarantees of the routine: If it succeeds in generating a "good" resynchronization point merely once, the main routine will stabilize deterministically.
Definition 3.1 (Resynchronization Points): Given W ⊆ V , time t is a W -resynchronization point iff each node in W switches to state supp → resync in the time interval (t, t + 2d).
Definition 3.2 (Good Resynchronization Points):
A W -resynchronization point is called good iff no node from W switches to state sleep during (t − ∆ g , t), where ∆ g := (2ϑ + 3)T 1 , and no node is in state join
In order to clarify that despite having a linear number of states (supp 1 , . . . , supp n ), this part of the algorithm can be implemented using 2-bit communication channels between state machines only, we generalize our description of state machines as follows. If a state is depicted as a circle separated into an upper and a lower part, the upper part denotes the local state, while the lower part indicates the signal state to which it is mapped. A node's memory flags then store the respective signal states only, i.e., remote nodes do not distinguish between states that share the same signal. Clearly, such a machine can be simulated by a machine as introduced in the model section. The advantage is that such a mapping can be used to reduce the number of transmitted state bits; for the resynchronization routine given in Figure 4 , we merely need two bits (init/wait and none/supp) instead of log(n + 3) + 1 bits.
The basic idea behind the resynchronization algorithm is the following: Every now and then, nodes will try to initiate agreement on a resynchronization point. This is the purpose of the small state machine on the left in Figure 4 . Recalling that the transition condition "true" simply means that the node switches to state wait again as soon as it observes itself in state init, it is easy to see that it does nothing else than creating an init signal as soon as R 3 expires and resetting R 3 again as quickly as possible. As the time when a node switches to init is determined by the randomized timeout R 3 distributed over a large interval (cf. Equality (11) therefore it is impossible to predict when it will expire, even with full knowledge of the execution up to the current point in time. Note that the complete independence of this part of node i's state from the remaining protocol implies that faulty nodes are not able to influence the respective times by any means.
Consider now the state machine displayed on the right of Figure 4 . To illustrate how the routine is intended to work, assume that at the time t when a non-faulty node i switches to state init, all non-faulty nodes are not in any of the states supp → resync, resync, or supp i, and at all non-faulty nodes the timeout (R 2 , supp i) has expired. Then, no matter what the signals from faulty nodes or on faulty channels are, each non-faulty node will be in one of the states supp j, j ∈ V , or supp → resync at time t + d. Hence, they will observe each other (and themselves) in one of these states at some time smaller than t + 2d. These statements follow from the various timeout conditions of at least 2ϑd and the fact that observing node i in state init will make nodes switch to state supp i if in none or supp j, j = i. Hence, all of them will switch to state supp → resync during (t, t + 2d), i.e., t is a resynchronization point. Since t follows a random distribution that is independent of the remaining algorithm and, as mentioned earlier, most of the times nodes do not switch to state sleep and it is easy to deal with the condition on join states, there is a large probability that t is a good resynchronization point. Note that timeout R 1 makes sure that no non-faulty node will switch to supp → resync again anytime soon, leaving sufficient time for the main routine to stabilize.
The scenario we just described relies on the fact that at time t no node is in state supp → resync or state resync. We will choose R 2 R 1 , implying that R 2 + 3d time after a node switched to state init all nodes have "forgotten" about this, i.e., (R 2 , supp i) is expired and they switched back to state none (unless other init signals interfered). Thus, in the absence of Byzantine faults, the above requirement is easily achieved with a large probability by choosing R 3 as a uniform distribution over some interval [R 2 +3d, R 2 +Θ(nR 1 )]: Other nodes will switch to init O(n) times during this interval, each time "blocking" other nodes for at most O(R 1 ) time. If the random choice picks any other point in time during this interval, a resynchronization point occurs. Even if the clock speed of the clock driving R 3 is manipulated in a worst-case manner (affecting the density of the probability distribution with respect to real time by a factor of at most ϑ), we can just increase the size of the interval to account for this.
However, what happens if only some of the nodes receive an init signal due to faulty channels or nodes? If the same holds for some of the subsequent supp signals, it might happen that only a fraction of the nodes reaches the threshold for switching to state supp → resync, resulting in an inconsistent reset of flags and timeouts across the system. Until the respective nodes switch to state none again, they will not support a resynchronization point again, i.e., about R 1 time is "lost". This issue is the reason for the agreement step and the timeouts (R 2 , supp j). In order for any node to switch to state supp → resync, there must be at least n − 2f ≥ f + 1 non-faulty nodes supporting this. Hence, all of these nodes recently switched to a state supp j for some j ∈ V , resetting (R 2 , supp j). Until these timeouts expire, f + 1 ∈ Ω(n) non-faulty nodes will ignore init signals on the respective channels. Since there are O(n 2 ) channels, it is possible to choose R 2 ∈ O(nR 1 ) such that this may happen at most O(n) times in O(n) time. Playing with constants, we can pick R 3 ∈ O(n) maintaining that still a constant fraction of the times are "good" in the sense that R 3 expiring at a non-faulty node will result in a good resynchronization point.
D. Timeout Constraints
Condition 3.3 summarizes the constraints we require on the timeouts for the core routine and the resynchronization algorithm to act and interact as intended.
Condition 3.3 (Timeout Constraints):
Recall that ϑ > 1 and ∆ g := (2ϑ + 3)T 1 . Define
The timeouts need to satisfy the constraints
> (2ϑ
We need to show that this system can always be solved. Furthermore, we would like to allow to couple the pulse generation algorithm to an application algorithm with any possible drift. To this end, we would like to be able to make the ratio (T 2 +T 4 )/(ϑ(T 2 +T 3 +4d)) arbitrarily large: Thereby, (T 2 + T 4 ) is the minimal gap between successive pulses generated at each node, provided that the states of all the NEXT signals are constantly zero, and ϑ(T 2 + T 3 + 4d) is the maximal time it takes nodes to observe themselves in state ready with T 3 expired after the last generated pulse (as then they will respond to NEXT i switching to one).
Lemma 3.4:
For any d, ϑ ∈ O(1), Condition 3.3 can be satisfied with T 1 , . . . , T 7 , R 1 ∈ O(1) and R 2 ∈ O(n), where the ratio
maybe chosen to be an arbitrarily large constant.
Proof: First, observe that if Inequality (3) holds, the denominator in the right hand side of Inequality (12) is positive. Thus, we can equivalently state Inequality (12) as
Since λ ∈ (4/5, 1), this inequality clearly imposes a stronger constraint than Inequality (3), hence we can replace Inequalities (3) and (12) with this one and obtain an equivalent system. The requirement of (T 2 +T 4 )/(ϑ(T 2 +T 3 +4d)) = α can be rephrased as
Again, clearly this constraint is stronger than Inequality (5), hence we drop Inequality (5) in favor of Inequality (14).
We satisfy the inequalities by iteratively defining the values of the left hand sides in accordance with the respective constraint, in the order (2), (13) , (7), (4), (14), (6), (8), (9), and finally (10) . Note that this is feasible, as in any step the right hand side of the current inequality is an expression in d, ϑ, α, and, in case of Inequality (10), n − f . 15 We obtain the solution
As α ∈ O(1) was arbitrary, d and ϑ are constants, and λ ∈ (4/5, 1) depends on ϑ only and is thus a constant as well, these values satisfy the asymptotic bounds stated in the lemma, concluding the proof.
IV. ANALYSIS
In this section we derive skew bounds Σ, as well as accuracy bounds T − , T + , such that the presented protocol is a (W, E)-stabilizing pulse synchronization protocol, for proper choices of the set of nodes W and the set of channels E, with skew Σ and accuracy bounds T − , T + that stabilizes within time T (k) ∈ O(kn) with probability 1 − 2 −k(n−f ) , for any k ∈ N. This analysis follows the lines of [30] , with minor adjustments due to the changes made to the FATAL protocol. Moreover, we show that if a set of at least n − f nodes fires pulses regularly, then other non-faulty nodes synchronize within O(R 1 ) time deterministically. This stabilization mechanism is much simpler; the main challenge here is to avoid interference with the other approach.
A. Basic Statements
To start our analysis, we need to define the basic requirements for stabilization. Essentially, we need that a majority of nodes is non-faulty and the channels between them are correct. However, the first part of the stabilization process is simply that nodes "forget" about past events that are captured by their timeouts. Therefore, we demand that these nodes indeed have been non-faulty for a time period that is sufficiently large to ensure that all timeouts have been reset at least once after the considered set of nodes became non-faulty.
all nodes i ∈ W are non-faulty, and all channels S i,j , i, j ∈ W , are correct. We will show that if a coherent set of at least n − f nodes fires a pulse, i.e., switches to accept in a tight synchrony, this set will generate pulses deterministically and with controlled frequency, as long the set remains coherent. This motivates the following definitions.
Definition 4.2 (Stabilization Points): We call time t a Wstabilization point (quasi-stabilization point) iff all nodes i ∈ W switch to accept during [t, t + 2d) ([t, t + 3d)).
Throughout this section, we assume the set of coherent nodes W with |W | ≥ n−f to be fixed and consider all nodes in and channels originating from V \ W as (potentially) faulty. As all our statements refer to nodes in W , we will typically omit the word "non-faulty" when referring to the behavior or states of nodes in W , and "all nodes" is short for "all nodes in W ". Note, however, that we will still clearly distinguish between channels originating at faulty and nonfaulty nodes, respectively, to nodes in W .
As a first step, we observe that at times when W is coherent, indeed all nodes reset their timeouts, basing the respective state transition on proper perception of nodes in W .
has been reset at least once since time t − − (ϑ(R 2 + 3d) + 8(1 − λ)R 2 ). If t denotes the time when such a reset occurred, for any j ∈ W it holds that S i,j (t ) = S j (τ −1 j,i (t )), i.e., at time t , i observes j in a state j attained when it was non-faulty.
Proof: According to Condition 3.3, the largest possible value of any (randomized) timeout is ϑ(R 2 + 3d) + 8(1 − λ)R 2 . Hence, any timeout that is in state 1 at a time smaller than t − −(ϑ(R 2 +3d)+8(1−λ)R 2 ) ≥ 0 expires before time t − or is reset at least once. As by the definition of coherency all nodes in W are non-faulty and all channels between such nodes are correct during
, this implies the statement of the lemma.
Phrased informally, any corruption of timeout and channel states eventually ceases, as correct timeouts expire and correct links remember no events that lie d or more time in the past. Proper cleaning of the memory flags is more complicated and will be explained further down the road.
Throughout this section, we will assume for the sake of simplicity that the set W is coherent at all times and use this lemma implicitly, e.g. we will always assume that nodes from W will observe all other nodes from W in states that they indeed had less than d time ago, expiring of randomized timeouts at non-faulty nodes cannot be predicted accurately, etc. We will discuss more general settings in Section V.
We proceed by showing that once all nodes in W switch to accept in a short period of time, i.e., a W -quasi-stabilization point is reached, the algorithm guarantees that synchronized pulses are generated deterministically with a frequency that is bounded both from above and below.
Theorem 4.4: Suppose t is a W -quasi-stabilization point. Then (i) all nodes in W switch to accept exactly once within [t, t + 3d), and do not leave accept until t + 4d; and (ii) there will be a W -stabilization point t ∈ (t + (T 2 + T 3 )/ϑ, t + T 2 + T 4 + 5d) satisfying that no node in W switches to accept in the time interval [t + 3d, t ); and that (iii) each node i's, i ∈ W , core state machine ( Figure 1 ) is metastability-free during [t + 3d, t + 3d]. Proof: Proof of (i): Due to Inequality (2), a node does not leave the state accept earlier than T 1 /ϑ ≥ 4d time after switching to it. Thus, no node can switch to accept twice during [t, t+3d). By definition of a quasi-stabilization point, every node does switch to accept in the interval
Proof of (ii): For each i ∈ W , let t i ∈ [t, t + 3d) be the time when i switches to accept. By (i) t i is well-defined. Further let t i be the infimum of times in (t i , ∞) when i switches to recover or propose. 16 In the following, denote by i ∈ W a node with minimal t i .
We will show that all nodes switch to propose via states sleep, sleep → waking, waking, and ready in the presented order. By (i) nodes do not leave accept before t+4d. Thus at time t + 4d, each node in W is in state accept and observes each other node in W in accept. Hence, each node in W memorizes each other node in W in accept at time t+4d. For each node j ∈ W , let t j,s be the time node j's timeout T 1 expires first after t j . Then t j,s ∈ (t j + T 1 /ϑ, t j + T 1 + d).
17
Since |W | ≥ n − f , each node j switches to state sleep at time t j,s . Hence, by time t + T 1 + 4d, no node will be observed in state accept anymore (until the time when it switches to accept again).
When a node j ∈ W switches to state waking at the minimal time t w larger than t j , it does not do so earlier than at time
This implies that all nodes in W have already left accept at least d time ago, since they switched to it at their respective times t j < t + T 1 + 3d. Moreover, they cannot switch to accept again until t i as it is minimal and nodes need to switch to propose or recover before switching to accept. Hence, nodes in W are not observed in state accept during (t + T 1 + 4d, t i ], in particular not by node j. Furthermore, nodes in W are not observed in state recover 16 Note that we follow the convention that inf ∅ = ∞ if the infimum is taken with respect to a (from above) unbounded subset of R + 0 . 17 The upper bound comprises an additive term of d since T 1 is reset at some time from (t j , t j + d).
As it resets its accept and recover flags upon switching to waking, j will hence neither switch from waking to recover nor from ready to propose during (t w , t i ).
Now consider node i. By the previous observation, it will not switch from waking to recover, but to ready, following the basic cycle. Consequently, it must wait for timeout T 2 to expire, i.e., cannot switch to ready earlier than at time t+T 2 /ϑ. By definition of t i , node i thus switches to propose at time t i . As it is the first node that does so, this cannot happen before timeouts T 3 or T 4 expire, i.e., before time
All other nodes in W will switch to waking, and for the first time after t j , observe themselves in state waking at a time within (t + T 1 + 4d, t + T 1 (2 + ϑ) + 7d). Recall that unless they memorize at least f + 1 nodes in accept or recover while being in state waking, they will all switch to state ready by time max{t + T 2 + 4d, t + (2ϑ + 2)T 1 + 7d}
As we just showed that t i > t + T 2 + 5d, this implies that at time t + T 2 + 5d all nodes are observed in state ready, and none of them leaves before time t i . Now choose t to be the infimum of times from (t+(T 2 + T 3 )/ϑ, t + T 2 + T 4 + 4d] when a node in W switches to state accept. 18 Because of Inequality (15), node j cannot switch to propose within [t j , t + (T 2 + T 3 )/ϑ). Thus, (after time t + 3d) node j does not switch to accept again earlier than time t , and timeout T 5 cannot expire at j until time
making it impossible for j to switch from propose to recover at a time within [t j , t + 3d]. What is more, a node from W that switches to accept must stay there for at least T 1 /ϑ > 3d time. Thus, by definition of t , no node j ∈ W can switch from accept to recover at a time within [t j , t + 3d]. Hence, no node j ∈ W can switch to state recover after t j , but earlier than time t + 2d. It follows that no node in W can switch to other states than propose or accept during
In particular, no node in W resets its propose flags during
If at time t a node in W switches to state accept, n−2f ≥ f + 1 of its propose flags corresponding to nodes in W are true, i.e., all correspond to a flag holding 1. As the node reset its propose flags at the most recent time when it switched to ready and no nodes from W have been observed in propose between this time and t i , it holds that f + 1 nodes in W switched to state propose during [t i , t ). Since we established that no node resets its propose flags during [t i , t + 2d], it follows that all nodes are in state propose by time t + d.
Consequently, all nodes in W will observe all nodes in W in state propose before time t + 2d and switch to accept, i.e., t ∈ (t + (
On the other hand, if at time t no node in W switches to state accept, it follows that t = t + T 2 + T 4 + 4d. As all nodes observe themselves in state ready by time t + T 2 + 5d, they switch to propose before time t + T 2 + T 4 + 5d = t + d because T 4 expired. By the same reasoning as in the previous case, they switch to accept before time t +2d, i.e., Statement (ii) holds as well.
Proof of (iii): We have shown that within [t j , t + 3d], any node j ∈ W switches to states along the basic cycle only. Note that Condition (ii) in the definition of metastabilityfreedom is satisfied by definition for state transitions along the basic cycle, as the conditions involve memory flags and timeouts (that are not associated with the states the nodes switch to) only. To show the correctness of Statement (iii), it is thus sufficient to prove that, whenever j switches from state s of the basic cycle to s of the basic cycle during time [t j , t + 3d], the transition from s to recover is disabled from the time it switches to s until it observes itself in this state. We consider transitions tr(accept, recover), tr(waking, recover), and tr(propose, recover) one after the other:
1) tr(accept, recover): We showed that node j's condition tr(accept, sleep) is satisfied before time t + 4d ≤ t+T 1 /ϑ, i.e., before tr(accept, recover) can hold, and no node resets its accept flags less than d time after switching to state sleep. When j switches to state accept again at or after time t , T 1 will not expire earlier than time t + 4d. 2) tr(waking, recover): As part of the reasoning about Statement (ii), we derived that tr(waking, recover) does not hold at nodes from W observing themselves in state waking. 3) tr(propose, recover): The additional slack of d in Inequality (17) ensures that T 5 does not expire at any node in W switching to state accept during (t , t +2d) earlier than time t + 3d.
Inductive application of Theorem 4.4 shows that by construction of our algorithm, nodes in W provably do not suffer from metastability upsets once a W -quasi-stabilization point is reached, as long as all nodes in W remain non-faulty and the channels connecting them correct. Unfortunately, it can be shown that it is impossible to ensure this property during the stabilization period, thus rendering a formal treatment infeasible. This is not a peculiarity of our system model, but a threat to any model that allows for the possibility of metastable upsets as encountered in physical chip designs. However, it was shown that, by proper chip design, the probability of metastable upsets can be made arbitrarily small [29] . 19 In the remainder of this work, we will therefore assume that all non-faulty nodes are metastability-free in all executions.
The next lemma reveals a very basic property of the main algorithm that is satisfied if no nodes may switch to state join in a given period of time. It states that if a non-faulty node switches to state sleep, other non-faulty nodes cannot remain too far ahead or behind in their execution of the basic cycle.
Lemma 4.5: Assume that at time t sleep , some node from W switches to sleep and no node from W is in state join
, any node is in one of the states sleep, sleep → waking, waking, or recover; (ii) any node in states sleep, sleep → waking, or waking reset its timeout T 2 at some time from (t sleep − ∆ g − 4d, t sleep + (2 − 1/ϑ)T 1 + 3d); and (iii) no node switches from recover to accept during [t sleep + T 1 + 2d, t a ], where t a > t sleep + 2T 1 + 3d denotes the infimum of times larger than t sleep + T 1 + 2d when a node switches to state accept. Proof: We claim that there is a subset A ⊆ W of at least n − 2f nodes such that each node from A has been in state accept at some time in the interval
To see this, observe that if a node switches to state sleep at time t sleep , it must have observed n − 2f non-faulty nodes in state accept at times from (t sleep − T 1 , t sleep ], since it resets its accept flags at the time t a ≥ t sleep − T 1 (that is minimal with this property) when it switched to state accept. Each of these nodes must have been in state accept at some time from (t sleep − T 1 − d, t sleep ), showing the existence of a set A ⊆ W as claimed.
During
no node from A is observed in state accept, as following the basic cycle requires T 2 to expire, no node switches to join, and in order to switch directly from recover to accept, a timeout of ϑ(2T 1 + 3d) needs to expire first. Since this also applies to the nodes from A and no node is in state join until time t sleep + 2T 1 + 3d, the only way to do so is by following the basic cycle via states sleep, sleep → waking, waking, ready, and propose. However, this takes at least until time
as well. This shows Statement (iii) of the lemma. Now consider any node that observes itself in one of the states waking, ready, or propose at time t sleep − T 1 − d. By time t sleep + d, it will memorize all nodes from A in accept (provided that it did not switch to accept in the meantime). Hence, it satisfies tr(waking, recover), tr(ready, propose), and tr(propose, accept) until it switches to either recover or accept. It follows that any such node must have switched to recover or accept by time t s + 3d < t s + T 1 + 2d. On the other hand, nodes that do not observe themselves in state waking at time t sleep − T 1 − d but are in one of the states sleep, sleep → waking, or waking at this time or switch to sleep during time (t sleep − T 1 − d, t sleep + 2T 1 + 3d] must have reset their timeout T 2 at some time from
i.e., Statement (ii) holds. To infer Statement (i), it remains to show that none of the latter nodes may switch to ready until time t sleep + 2T 1 + 3d. As no nodes from W are in state
] by assumption, the stetement follows immediately from Statement (ii), as
The lemma follows.
Granted that nodes are not in state join for sufficiently long, this implies that nodes will switch to sleep in rough synchrony with others or drop out of the basic cycle and switch to recover. Corollary 4.6: Assume that at time t sleep , a node from W switches to sleep, no node is in state join during [t sleep −T 1 − d, t sleep +2T 1 +4d], and also that during (t sleep −∆ g , t sleep ) = (t sleep − (2ϑ + 3)T 1 , t sleep ) no node in W is in state sleep. Then at time t sleep + 2T 1 + 4d, any node from W is either in one of the states sleep or sleep → waking and observed in sleep, or it is and is observed in state recover.
Proof: We apply Lemma 4.5 to see that at time t sleep + 2T 1 + 3d, all nodes are in one of the states sleep, sleep → waking, waking, or recover. As nodes remain in sleep for a timeout of duration (2ϑ+1)T 1 ≥ ϑ(2T 1 +4d), the statement of the corollary follows immediately provided that we can show that any node that does not switch to state sleep during [t sleep , t sleep + T 1 + 3d] is not in state waking at time t sleep + T 1 +3d. Consider such a node. If there is a time from (t sleep − ∆ g , t sleep + T 1 + 3d] when the node is not in one of the states sleep, sleep → waking, or waking, it cannot be in state waking at time t sleep + 2T 1 + 5d, since it could not have switched to sleep again in order to get there. Assume w.l.o.g. that the node switches to sleep exactly at time t sleep − ∆ g . Thus, it must have previously reset its timeout T 2 no later than
Hence we conclude by Lemma 4.5 that the node is in state recover at time t sleep + 2T 1 + 5d, finishing the proof.
B. Resynchronization Points
In this section, we derive that within linear time, it is very likely that good resynchronization points occur. As a first step, we infer from Lemma 4.5 that whenever nodes may not enter state join, the time windows during which nodes may switch to sleep occur infrequently.
Lemma 4.7: Suppose no node is in state join during
Proof: Denote by t 0 the infimum of times from [t − + T 1 +d, t + ] when a node switches to sleep. Thus, by definition any time t
. We proceed by induction over increasing times
and that
In fact, we will show these bounds by establishing that no node is in state sleep during
for all i ∈ {1, . . . , i max }. We first establish these bounds for t 1 . By Lemma 4.5, every node not switching to state recover until time t 0 + T 1 + 3d resets T 2 at some time from (t 0 − ∆ g − 4d, t 0 + 3d) and is in one of the states sleep, sleep → waking, or waking at time t 0 + T 1 + 3d. Hence, such nodes do not switch to state ready and subsequently to propose, accept, and sleep again until t 0 + T 2 /ϑ − ∆ g − 4d ≤ t + , giving
Moreover, the lemma implies that no node is in state sleep during [t 0 + (2ϑ + 3)T 1 + 3d, t 1 ), as any node in state sleep at time t 0 +2T 1 +3d will leave after a timeout of (2ϑ+1)T 1 expires. Hence, the volume of times t
showing the claim for i = 1.
We now perform the induction step from i < i max to i+1. By (20) , no node is in state sleep during
Hence we can apply Corollary 4.6 to see that nodes not observing themselves in state sleep at time t i + 2T 1 + 4d switched to state recover. Therefore, nodes that continue to execute the basic cycle must have performed their most recent reset of timeout T 2 at or after time t i − T 1 − d. Thus, such nodes do not switch to state ready and subsequently to propose, accept, and sleep again until
Moreover, no node is in state sleep during [t i +(2ϑ+3)T 1 + 3d, t i+1 ). These two statements show Inequality (20) and Inequality (21) for i + 1, and by means of the induction hypothesis directly imply Inequality (18) and Inequality (19) for i + 1 as well, i.e., the induction succeeds. From Inequality (19), we have that
Observe that the same reasoning as above shows that no node switches to sleep during
Thus, inserting i = i max into Inequality (18), we infer that the volume of times t ∈ [t − + T 1 + d, t + ] such that no node is in state sleep during (t, t − ∆ g ) is at least
concluding the proof.
We are now ready to advance to proving that good resynchronization points are likely to occur within bounded time, no matter what the strategy of the adversary is. To this end, we first establish that in any execution, at most of the times a node switching to state init will result in a good resynchronization point. This is formalized by the following definition.
Definition
Proof: Assume w.l.o.g. that |W | = n − f (otherwise consider a subset of size n − f ) and abbreviate
The proof is in two steps: First we construct a measurable subset of [t − , t + ] that comprises good times only. In a second step a lower bound on the volume of this set is derived.
Constructing the set: Consider an arbitrary time t ∈ [t − , t + ], and assume a node i ∈ W switches to state init at time t. When it does so, its timeout R 3 expires. By Lemma 4.3 all timeouts of node i that expire at times within [t − , t + ], have been reset at least once until time t − . Let t R3 be the maximum time not later than t when R 3 was reset. Due to the distribution of R 3 we know that
Thus, node i is not in state init during time [t−(R 2 +2d), t), and no node j ∈ W observes i in state init during time [t − (R 2 + d), t). Thereby any node j's, j ∈ W , timeout (R 2 , supp i) corresponding to node i is expired at time t.
We claim that the condition that no node from W is in or observed in one of the states resync or supp → resync at time t is sufficient for t being a W -resynchronization point. To see this, assume that the condition is satisfied. Thus all nodes j ∈ W are in states none or supp k for some k ∈ {1, . . . , n} at time t. By the algorithm, they all will switch to state supp i or state supp → resync during (t, t + d). It might happen that they subsequently switch to another state supp k for some k ∈ V , but all of them will be in one of the states with signal supp during (t +
we know that during (t r + R 1 + 2d, t r ), no node from W will be in, or be observed in, states supp → resync or resync. Thus, if a node from W switches to init at a time within (t r +R 1 +2d, t r ), it is a W -resynchronization point. Further, all nodes in W will be in state dormant during (t r + R 1 + 2d, t r + 4d). Thus all nodes in W will be observed to be in state dormant during (t r + R 1 + 3d, t r + 4d), implying that they are not in state join during (t r + R 1 + 3d, t r + 4d). In particular, any time t ∈ (t r + R 1 + T 1 + 4d, t r ) satisfies that no node in W is in state join during (t − T 1 − d, t + 4d). Applying Corollary 4.7, we infer that the total volume of times from (t r , t r ) that is good is at least
In other words, up to a constant loss in each interval (t r , t r ), a constant fraction of the times are good.
Volume of the set: In order to infer a lower bound on the volume of good times during [t − , t + ], we subtract from t + − t − the volume of some intervals during which we cannot exclude that a node switches to supp → resync, increased by the constant term R 1 +4∆ g +T 1 +10d from Inequality (24) . The inequality then yields that at least a fraction of (T 2 − 2ϑ∆ g − (ϑ − 1)T 1 − 4ϑd)/(T 2 − (ϑ − 1)T 1 − ϑd) of the remaining volume of times is good. Note that we also need to account for the fact that nodes may already be in state supp → resync at time t − , which we account for by also covering events prior to t − when nodes switch to supp → resync. Formally, we definē
Observe that any node in W does not switch to state init more than
Now consider the case that a node in W switches to state supp → resync at a time t satisfying that no node in W switched to state init during (t − (8ϑ + 6)d, t). This necessitates that this node observes n − f of its channels in state supp during (t − (2ϑ + 1)d, t), at least n − 2f ≥ f + 1 of which originate from nodes in W . As no node from W switched to init during (t − (8ϑ + 6)d, t), every node that has not observed a node i ∈ V \ W in state init at a time from (t − (8ϑ + 4)d, t) when (R 2 , supp i) is expired must be in a state whose signal is none during (t − (2ϑ + 2)d, t) due to timeouts. Therefore its outgoing channels are not in state supp during (t − (2ϑ + 1)d, t). By means of contradiction, it thus follows that for each node j of the at least f + 1 nodes (which are all from W ), there exists a node i ∈ V \ W such that node j resets timeout (R 2 , supp i) during the time interval (t − (8ϑ + 4)d, t) .
The same reasoning applies to any time t ∈ (t − (8ϑ + 6)d, t) satisfying that some node in W switches to state supp → resync at time t and no node in W switched to state init during (t − (8ϑ + 6)d, t ). Note that the set of the respective at least f + 1 events (corresponding to the at least f + 1 nodes from W ) where timeouts (R 2 , supp i) with i ∈ V \ W are reset and the set of the events corresponding to t are disjoint. However, the total number of events where such a timeout can be reset during
is upper bounded by
i.e., the total number of channels from nodes not in W (|V \ W | many) to nodes in W multiplied by the number of times an associated timeout can expire at a receiving node in W
With the help of inequalities (25) and (26)
, since any such time requires the existence of at least f + 1 events where timeouts (R 2 , supp i), i ∈ V \W , are reset at nodes in W , and the respective events are disjoint. Thus, all times t r ∈ [t − −(R 1 +4∆ g +T 1 +10d), t + ] when some node i ∈ W switches to supp → resync are covered by at most 2N − 1 intervals of length (8ϑ + 6)d.
This results in a coverḠ ⊇Ḡ consisting of at most 2N − 1 intervals that satisfies
As argued previously, summing over the at most 2N intervals that remain in [t − , t + ] \Ḡ and using Inequality (24), it follows that the volume of good times during [t
≥ λ(t
as claimed. The lemma follows.
We are now in the position to prove our second main theorem, which states that a good resynchronization point occurs within O(R 2 ) time with overwhelming probability. Theorem 4.10: Denote byÊ 3 := ϑ(R 2 + 3d) + 8(1 − λ)R 2 + d the maximal value the distribution R 3 can attain plus the at most d time until R 3 is reset whenever it expires. For any k ∈ N and any time t, with probability at least 1 − (1/2) k(n−f ) there will be a good W -resynchronization point during [t, t + (k + 1)Ê 3 ].
Proof: Assume w.l.o.g. that |W | = n − f (otherwise consider a subset of size n − f ). Fix some node i ∈ W and denote by t 0 the infimum of times from [t, t + (k + 1)Ê 3 ] when node i switches to init. We have that t 0 < t +Ê 3 . By induction, it follows that node i will switch to state init at least another k times during [t, t + (k + 1)Ê 3 ] at the times t 1 < t 2 < . . . < t k . We claim that each such time t j , j ∈ {1, .., k}, has an independently by 1/2 lower bounded probability of being good and therefore being a good Wresynchronization point.
We prove this by induction on j: As induction hypothesis, suppose for some j ∈ {1, . . . , k − 1}, we showed the statement for j ∈ {1, . . . , j − 1} and the execution of the system is fixed until time t j−1 , i.e., E|
, and all nodes' clocks make progress in E as in E. Clearly each such E has its own time t j < t + (j + 1)Ê 3 when R 3 expires next after t j−1 at node i, and i switches to init. We next characterize the distribution of the times t j .
As the rate of the clock driving node i's R 3 is between 1 and ϑ, t j > t j−1 is within an interval, call it [t − , t + ], of size at most t
regardless of the progress that i's clock C makes in any execution E . Certainly we can apply Lemma 4.9 also to each of the E , showing that the volume of times from [t − , t + ] that are not good in E is at most
Since clock C can make progress not faster than at rate ϑ and the probability density of R 3 is constantly 1/(8(1 − λ)R 2 ) (with respect to the clock function C), we obtain that the probability of t j not being a good time is upper bounded by
Here we use that the time when R 3 expires is independent of E | [0,tj−1] .
We complete our reasoning as follows. Given E| [0,tj−1] , we permit an adversary to choose E , including random bits of all nodes and full knowledge of the future, with the exception that we deny it control or knowledge of the time t j when R 3 expires at node i, i.e., E is an imaginary execution in which R 3 does not expire at i at any time greater than t j−1 . Note that for the good W -resynchronization points we considered, the choice of E does not affect the probability that t 1 , . . . , t ] . We define that E| [0,tj ) = E | [0,tj ) and in E node i switches to state init (because R 3 expired). As-conditional to the clock driving R 3 and t j−1 being specified-t j is independent of E| [0,tj ) , E is indistinguishable from E until time t j . Because t j is good with probability at least 1/2 independently of E| [0,tj−1] = E| [0,tj−1] , so it is in E. Hence, in E t j is a good W -resynchronization point with probability 1/2, independently of E| [0,tj −1] . Since E was chosen in an adversarial manner, this completes the induction step.
In summary, we showed that for any node in W and any execution (in which we do not manipulate the times when R 3 expires at the respective node), starting from the second time during [t, t + (k + 1)Ê 3 ] when R 3 expires at the respective node, there is a probability of at least 1/2 that the respective time is a good W -resynchronization point. Since we assumed that |W | = n − f and there are at least k such times for each node in W , this implies that having no good W -resynchronization point during [t, t + (k + 1)Ê 3 ] is as least as unlikely as k(n − f ) unbiased and independent coin flips all showing tail, i.e., (1/2) k(n−f ) . This concludes the proof.
C. Stabilization via Good Resynchronization Points
Having established that eventually a good W -resynchronization point t g will occur, we turn to proving the convergence of the main routine. We start with a few helper statements wrapping up that a good resynchronization point guarantees proper reset of flags and timeouts involved in the stabilization process of the main routine.
Lemma 4.11: Suppose t g is a good W -resynchronization point. Then (i) each node i ∈ W switches to passive at a time t i ∈ (t g + 4d, t g + (4ϑ + 3)d) and observes itself in state
,tjoin] ≡ 0 for all i, j ∈ W , where t join ≥ t g + 4d is the infimum of all times greater than t g − T 1 − d when a node from W switches to join,
where t s ≥ t g +(1+1/ϑ)T 1 is the infimum of all times greater or equal to t g when a node from W switches to sleep → waking, (iv) no node from W resets its sleep → waking flags during [t g + (1 + 1/ϑ)T 1 , t g + R 1 /ϑ], and (v) no node from W resets its join flags due to switching to passive during [t g + (1 + 1/ϑ)T 1 , t g + R 1 /ϑ].
Proof: All nodes in W switch to state supp → resync during (t g , t g + 2d) and switch to state resync when their timeout of ϑ4d expires, which does not happen until time t g + 4d. Once this timeout expired, they switch to state passive as soon as they observe themselves in state resync, i.e., by time t g + (4ϑ + 3d). Hence, every node i ∈ W does not observe itself in state resync within [t g + 3d, τ i,i (t i )), and therefore is in state dormant during [t g + 3d, τ i,i (t i )]. This implies that it observes itself in state dormant during [t g + 4d, τ i,i (t i )), completing the proof of Statement (i).
Moreover, from the definition of a good W -resynchronization point we have that no nodes from W are in state join at times in [t g − T 1 − d, t join ). Statement (ii) follows, as every node from W resets its join flags upon switching to state passive at time t i .
Regarding Statement (iii), observe first that no nodes from W are in state sleep → waking during (t g − d, t g + (1 + 1/ϑ)T 1 ) for the following reason: By definition of a good W -resynchronization point no node from W switches to sleep during (t g − ∆ g , t g ) ⊃ (t g − (2ϑ + 1)T 1 − 3d, t g ). Any node in W that is in states sleep or sleep → waking at time t g − (2ϑ + 1)T 1 − 3d switches to state waking before time t g − d due to timeouts. Finally, any node in W switching to sleep at or after time t g will not switch to state sleep → waking before time t g + (1 + 1/ϑ)T 1 . The observation follows.
Since nodes in W reset their sleep → waking flags at some time from
Statements (iv) and (v) follow from the fact that all nodes in W switch to state passive until time
while timeout (R 1 , supp → resync) must expire first in order to switch to dormant and subsequently passive again.
Before we proceed with our third main statement showing eventual stabilization, we make a few more basic observations. Firstly, if nodes do not make progress on the basic cycle, they must eventually switch to recover, i.e., the timeout conditions ensure detection of deadlocks.
Lemma 4.12: For any time t − and any node it holds that it must be in state recover or join or switch to sleep at some time from [t − , t − + (1 − 1/ϑ)T 1 + T 2 + T 4 + T 5 + 4d). Proof: Suppose a node is never in state recover or join
Thus it may follow transitions along the basic cycle only. Assume w.l.o.g. that the node switched to sleep right before time t − . Thus, it switched to state accept beforehand, no later than time t − − T 1 /ϑ. Due to timeouts, either switch to recover at some point in time or switch to sleep, sleep → waking, waking, ready, propose, accept, and finally sleep again. At each state, it takes less than d time until a respective timeout is started and it observes itself in the respective state. Hence, the node switches to recover or sleep before time
proving the claim of the lemma. Secondly, after a good W -resynchronization point t g , no node from W will switch to state join until either time t g + T 7 /ϑ + 4d or T 6 /ϑ time after the first non-faulty node switched to sleep → waking again after t g . By proper choice of T 7 > T 6 and T 6 , this will guarantee that nodes from W do not switch to join prematurely during the final steps of the stabilization process. Lemma 4.13: Suppose t g is a good W -resynchronization point. Denote by t s the infimum of times greater than t g when a node in W switches to state sleep → waking and by t join the infimum of times greater than t g − T 1 − d when a node in W switches to state join. Then, starting from time t g + 4d, tr(recover, join) is not satisfied at any node in W until time
Proof: By Statements (ii) and (iii) of Lemma 4.11 and Inequality (2), we have that t s ≥ t g +T 1 +4d ≥ t g +(4ϑ+4)d and t join ≥ t g + 4d. Consider a node i ∈ W not observing itself in state dormant at some time t ∈ [t g + 4d, t join ]. According to Statements (i) and (ii) of Lemma 4.11, the threshold condition of f + 1 nodes memorized in state join cannot be satisfied at such a node. By statements (i) and (iii) of the lemma, the threshold condition of f +1 nodes memorized in state sleep → waking cannot be satisfied unless t > t s . Hence, if at time t a node from W satisfies that it observes itself in state active, we have that T 6 expired after being reset after time t s , i.e., t > t s + T 6 /ϑ. Moreover, by Statement (i) of Lemma 4.11, we have that if T 7 is expired at any node in W at time t, it holds that t > t g + T 7 /ϑ + 4d. Altogether, we conclude that tr(recover, join) is not satisfied at any node in W during t g + 4d, min t g + T 7 ϑ + 4d, t s + T 6 ϑ .
In particular, t join must be larger than the upper boundary of this interval, concluding the proof. Thirdly, after a good W -resynchronization point, any node in W switches to recover or to sleep → waking within bounded time, and all nodes in W doing the latter will do so in rough synchrony. Using the previous lemmas, we can show that this happens before the transition to join is enabled for any node.
Lemma 4.14: Suppose t g is a good W -resynchronization point and use the notation of Lemma 4.13. Define t + := t g − T 1 /ϑ + T 2 + T 4 + T 5 + 3d and denote by t sleep the infimum of all times greater than t g − ∆ g when a node in W switches to sleep. Then t sleep ≥ t g and either (i) t sleep < t + and at time t sleep + 2T 1 + 4d, any node in W is either in one of the states sleep or sleep → waking and observed in sleep or is in recover and also observed in recover, or (ii) all nodes in W are observed in state recover at time
Proof: By definition of a good resynchronization point, no node switches to sleep during (t g − ∆ g , t g ), giving that t sleep ≥ t g . If t sleep < t + , Lemma 4.13 yields that
≥ t sleep + 2T 1 + 4d.
Therefore, by definition of a good resynchronization point, no nodes are in state join during 
≥ t + + 2T 1 + 4d.
Hence, Lemma 4.12 states that every node must be in state recover at some time in (t g −T 1 −d, t + ). Since nodes do not leave state recover during (t g − T 1 − d, t join ), Statement (ii) follows.
We have everything in place for proving that a good resynchronization point leads to stabilization within R 1 /ϑ − 3d time.
Theorem 4.15: Suppose t g is a good W -resynchronization point. Then there is a quasi-stabilization point during (t g , t g + R 1 /ϑ − 3d].
Proof: For simplicity, assume during this proof that R 1 = ∞, i.e., by Statement (i) of Lemma 4.11 all nodes in W observe themselves in states passive or active at times greater or equal to t g + (4ϑ + 4)d. We will establish the existence of a quasi-stabilization point at a time larger than t g and show that it is upper bounded by t g + R 1 /ϑ − 3d. Hence this assumption can be made w.l.o.g., as the existence of the quasi-stabilization point depends on the execution up to time t g + R 1 /ϑ only, and R 1 cannot expire before this time at any node in W . Moreover, by Statements (i) and (ii) of Lemma 4.11, every node satisfies Mem i,i,join ≡ 0 on [t g + (4ϑ + 4)d, t i,join ), where t i,join denotes the infimum of all times greater or equal to t g + (4ϑ + 4)d when node i switches to join. During the time span considered in this proof, every node switches at most once to join, thus we may w.l.o.g. assume that Mem i,i,join = 0 is always satisfied in the following. We use the notation of Lemmas 4.13 and 4.14. By Statements (ii) of Lemma 4.11 and Inequality (2), we have that t s ≥ t g + (1 − 1/ϑ)T 1 ≥ t g + (4ϑ + 4)d.
According to Lemma 4.11, all nodes in W switched to state passive during (t g + 4d, t g + (3 + 4ϑ)d), implying that at any node in W , T 7 will expire at some time from
By Lemma 4.13, thus t join > t g + (1 + 1/ϑ)T 1 , and by Statement (v) of Lemma 4.11, no node resets its join flags after t g + (1 + 1/ϑ)T 1 again (before R 1 expires).
Case 1: Assume t sleep ≥ t + . Thus, Statement (ii) of Lemma 4.14 applies, i.e., all nodes are observed in state recover by time t + + 2T 1 + 4d. Any node from W will switch to state join before time t g + T 7 + (4ϑ + 4)d because T 7 expires no later than that. Subsequently, it will switch to propose as soon as it memorizes all non-faulty nodes in state join. Denote by t propose ∈ (t g +T 7 /ϑ+4d, t g +T 7 +(4ϑ+5)d) the minimal time when a node from W switches from join to propose. Certainly, nodes in W do not switch from waking to ready during (t propose , t propose + 2d) and therefore also not reset their join flags before time t propose +3d. As nodes in W reset their propose and accept flags upon switching to state join, some node in W must memorize n − 2f ≥ f + 1 non-faulty nodes in state join at time t propose . According to Statement (ii) of Lemma 4.11, these nodes must have switched to state join at or after time t join . Hence, all nodes in W will memorize them in state join by time t propose + d and thus have switched to state join. Hence, all nodes in W will switch to state propose before time t propose + 2d and subsequently to state accept before time t propose + 3d, i.e., t propose ≤ t g + T 7 + (4ϑ + 5)d is a quasi-stabilization point.
Case 2: Assume t sleep < t + . By Statement (i) of Lemma 4.14, all nodes are observed in either sleep or recover at time t sleep + 2T 1 + 4d. The nodes observed in state sleep will have been observed in state sleep → waking and switched to waking by time t sleep + (2ϑ + 3)T 1 + 5d.
Case 2a: Suppose < f + 1 nodes in W are observed in state sleep at time t sleep +2T 1 +4d, i.e., ≥ n−2f ≥ f +1 nonfaulty nodes are observed in state recover. By Lemma 4.13, we have that
Hence, any node observing itself in state waking at some time t ∈ (t sleep + 2T 1 + 4d, t sleep + (2ϑ + 3)T 1 + 6d) will also observe at least f + 1 nodes in state recover and switch to recover. As any node in sleep or sleep → waking at time t sleep + 2T 1 + 4d will observe itself in state waking no later than time t sleep + (2ϑ + 3)T 1 + 6)d, by time t sleep + (2ϑ + 3)T 1 + 7d < t join , all nodes observe themselves in state recover. From here we can argue analogously to the first case, i.e., there exists a quasi-stabilization point t propose ≤ t g + T 7 + (4ϑ + 5)d. Case 2b: Suppose ≥ f + 1 nodes in W are observed in state sleep at time t sleep + 2T 1 + 4d. These nodes will switch to waking and subsequently ready until time
due to T 2 being expired while observing themselves in waking unless they switch from waking to recover. Note that these nodes reset their accept flags upon switching to waking. Denote by t propose and t accept the infima of times greater than t sleep +2T 1 +4d when a node switches to propose or accept, respectively. Recall that any node switching from recover to join resets its propose and accept flags, and any node switching from waking to ready resets its propose flags. Hence, we have for all i, j ∈ W that (i) Mem i,j,propose (t) = 0 at any time t ∈ [t sleep + 2T 1 + 4d, t propose ] when i observes itself in ready or join, and (ii) Mem i,j,accept (t) = 0 at any time t ∈ [t sleep + 2T 1 + 4d, t accept ] when i observes itself in ready, join, or propose. By Statements (ii) and (iv) of Lemma 4.11, no node from W resets its sleep → waking flags at or after time t s ≥ t g + (1 + 1/ϑ)T 1 . As t s ≥ t sleep + 2T 1 + 4d and all nodes observed in sleep at time t sleep +2T 1 +4d will be observed in sleep → waking by time t sleep +(2ϑ+3)T 1 +5d, Statement (i) of the lemma implies that all nodes in W switch to active at some time from (t s , t sleep +(2ϑ+3)T 1 +5d) ⊆ (t sleep +2T 1 + 6d, t sleep +(2ϑ+3)T 1 +5d). As, by the Statements (i) and (ii) from above, the first node switching to state propose must do so because of an expiring timeout, Lemma 4.13 yields that
Therefore,
(28) By Inequality (27), we conclude that at time t sleep − T 1 /ϑ + T 2 + 2d < t propose , any node from W observes itself in one of the states ready, recover, or join.
Again, we distinguish two cases. Case 2b-I: t propose < t sleep − T 1 − d + (T 2 + T 3 )/ϑ. As previously used, no node can switch from ready to propose during (t sleep +2T 1 +4d, t sleep −T 1 −d+(T 2 +T 3 )/ϑ)). Hence, there must be a node that switches from join to propose at time t propose . By Statements (i) and (ii) from above, the node must memorize at least n−2f ≥ f +1 nodes from W in state join at time t propose . By Statement (ii) of Lemma 4.11, these nodes must have switched to join at or after time t join . By Statements (iii) and (v) of the lemma, no node resets its join flags during [t g +(1+1/ϑ)T 1 , t g +R 1 /ϑ) ⊃ [t propose , t propose + 3d) unless it switches to state join. Hence, all nodes still in state recover have switched to join by time t propose +d, giving that all nodes are in one of the states ready, join, or accept at time t propose + d (since they cannot leave accept earlier than t propose + T 1 /ϑ ≥ t propose + 4d again).
Case 2b-II: t propose ≥ t sleep −T 1 −d+(T 2 +T 3 )/ϑ. Recall that all nodes switched to active by time t sleep +(2ϑ+3)T 1 + 5d. Hence, any node observing itself in state recover at time t sleep +2T 1 +4d will have switched to join because T 6 expired by time
Hence, also in this case all nodes are in one of the states ready, join, or accept at time t propose + d. Continuing Case 2b: Next, we claim that any node is in states propose or join by time t sleep − T 1 /ϑ + T 2 + T 4 + 2d. To see this, observe that any node following the basic cycle must switch from ready to propose by this time due to timeouts. On the other hand, according to Inequality (29) , all nodes in state recover switch to join by time
showing the claim. In summary, we showed the following points: that have not been in that state at or after time t propose . We claim that the infimum t q of all times from
when a node switches to accept is a quasi-stabilization point. Note that because
≤ t propose + T 5 ϑ no node will switch from propose to recover before time t q + 3d.
Again, we distinguish two cases. First assume that t q < t sleep − T 1 /ϑ + T 2 + T 4 + 2d, i.e., at time t q indeed a node switches to state accept. Due to Statement (iv) from the above list and the minimality of t q , it follows that the respective node memorizes n − 2f ≥ f + 1 nodes from W in state propose that switched to propose at or after time t p . These nodes must be in one of the states propose or accept during [t q , t q + 3d]. According to Statement (i) from above, thus all nodes still in ready will switch to propose by time t q + d. By time t q + 2d, all nodes in join will observe the at least n − f nodes from W in one of the states join, propose, or accept, and hence switch to propose. Another d time later, all nodes will have switched to accept, i.e., t q is indeed a quasi-stabilization point.
On the other hand, if t q = t sleep − T 1 /ϑ + T 2 + T 4 + 2d, Statement (ii) from the above list gives that all nodes from W are in one of the states join, propose, or accept during [t q + d, t q + 3d]. Therefore, nodes will switch from join to propose and subsequently from propose to accept until time t q + 3d as well.
It remains to check that in all cases, the obtained quasisynchronization point t q occurs no later than time t g + R 1 /ϑ − 3d. In Cases 1 and 2a, we have that
In Case 2b, it holds that
We conclude that indeed all nodes in W switch to accept within a window of less than 3d time before, at any node in W , R 1 expires and it leaves state resync, concluding the proof. Finally, putting together the established main theorems and Lemma 3.4, we deduce that the system will stabilize from an arbitrary initial state provided that a subset of n − f nodes remains coherent for a sufficiently large period of time.
Corollary 4.16: Let W ⊆ V , where |W | ≥ n − f , and define for any k ∈ N
Then, for any k ∈ N, the proposed algorithm is a (W, W 2 )-stabilizing pulse synchronization protocol with skew 2d and accuracy bounds (T 2 +T 3 )/ϑ−2d and T 2 +T 4 +7d stabilizing within time T (k) with probability at least 1 − 2 −k(n−f ) . It is feasible to pick timeouts such that T (k) ∈ O(kn) and
Proof: The satisfiability of Condition 3.3 with T (k) ∈ O(kn) and T 2 + T 4 + 7d ∈ O(1) follows from Lemma 3.4. Assume that t + is sufficiently large for [t − +T (k)+2d, t + ] to be non-empty, as otherwise nothing is to show. By definition, W will be coherent during [t Theorem 4.4 inductively, we derive that the algorithm is a (W, W 2 )-stabilizing pulse synchronization protocol with the bounds as stated in the corollary that stabilizes within time T (k) with probability at least 1 − 1/2 k(n−f ) .
D. Late Joining and Fast Recovery
An important aspect of combining self-stabilization with Byzantine fault-tolerance is that the system can remain operational when facing a limited number of transient faults.
If the affected components stabilize quickly enough, this can prevent future faults from causing system failure. In an environment where transient faults occur according to a random distribution that is not too far from being uniform (i.e., one deals not primarily with bursts), the mean time until failure is therefore determined by the time it takes to recover from transient faults. Thus, it is of significant interest that a node that starts functioning according to the specifications again synchronizes as fast as possible to an existing subset of correct nodes. Moreover, it is of interest that a node that has been shut down temporarily, e.g. for maintenance, can join the operational system again quickly.
Theorem 4.17: Suppose there exists a node i in V and a set W ⊆ V , |W | ≥ n − f , such that there is a W -stabilization point at some time t − and W ∪ {i} is coherent during [t
Proof: Again, the proof is executed by distinguishing cases. W.l.o.g., we assume for the moment that W ∪ {i} is coherent during [t − , ∞) and later show that indeed t ≤ t − + (1 + 5/(2ϑ))R 1 .
Case 1: Node i does not switch to supp → resync during
. Thus, after R 1 expires at the latest by time
Denote by t sleep the minimum of the respective times. We apply Lemma 4.5 to W ∪{i}. Thus, at time t sleep +2T 1 +3d, node i is either in state recover and will not leave until the next W -stabilization point (or it switches to join), or it is in state sleep and reset its timeout T 2 at some time from
Case 1a: Node i is in recover at time t sleep +2T 1 +3d. As it cannot switch to join until time t − +R 1 +(ϑ−1)T 1 +2T 2 + 2T 4 + 18d, it will stay in recover until the subsequent Wstabilization point t W ∈ (t W + (T 2 + T 3 )/ϑ, t − + R 1 + (ϑ − 1)T 1 +2T 2 +2T 4 +14d) (existing according to Theorem 4.4). By time t W , clearly timeout (recover, ϑ(2T 1 + 3d)) has expired at the node, as
> t sleep + (ϑ + 1)(2T 1 + 3d).
Because T 1 /ϑ ≥ 4d, i will observe all nodes from W in accept during [t W + 3d, t W + 4d]. Hence it will switch to accept by time t W + 3d, i.e., t W is a W ∪ {i} quasistabilization point. Case 1b: Node i is in sleep at time t sleep + 2T 1 + 3d. Denote by t W the W -stabilization point subsequent to t W as in the previous case. As no node from W is observed in state accept or recover during [t s +2T 1 +3d, t W ) and i reset its timeout T 2 no earlier than time t W − ∆ g + T 1 /ϑ − 4d, it will not switch to recover before time min{t W , t W − ∆ g + (T 1 + T 2 + T 3 + T 5 )/ϑ − 4d} unless it switches to accept first. However, as it resets its propose and accept flags before switching to ready, it cannot switch to accept before at least f nodes from W switched to propose (unless switching to recover first). Moreover, by time t W , it will already have switched to ready since
Hence, reasoning analogously to the proof of Theorem 4.4, t W is in fact a W ∪ {i}-stabilization point provided that i switches to accept instead of recover first. This in turn follows from the bound
where in the last step we used that t W < t W + T 2 + T 4 + 5d according to Theorem 4.4. This shows that T 5 does not expire at i while it is in propose before time t W +2d. Hence, t W is a W ∪ {i}-stabilization point. Case 2: Node i switches to supp → resync at a time
. Denote by t W and t W the maximal W -stabilization point smaller than t and the minimal W -stabilization point larger than max{t , t W + 2d}, which exist by Theorem 4.4. Denote by t sleep the minimal time larger than t W when a node from W switches to sleep. Analogously 20 to Case 1b, t W is a W ∪{i}-stabilization point if i is in state sleep at time t sleep + 2T 1 + 3d. Hence, assume w.l.o.g. that i is in state recover or already switched to join by this time. Analogously to Case 1a, t W will be a W ∪ {i}-quasi-stabilization point if it stays in recover until time t W + 3d. Therefore, w.l.o.g., i switches to join at some time during (t , t W + 3d), implying that it will leave the state no later than time t W + 4d and switch to state accept by time t W + 5d. Now either i continues to execute the basic cycle and thus will, analogously to Case 1b, participate in the minimal Wstabilization point t W > t W +2d, or it will switch to recover again. In the latter case, it cannot switch back to join until at least time t +R 1 /ϑ because it needs to reset its join flags first, which happens upon switching to passive only. As we have that
i cannot leave state recover through join again before time t W + 4d. Therefore, t W is a W ∪ {i}-quasi-stabilization point, analogously to Case 1a.
We have shown that there is some W ∪ {i}-quasistabilization at the latest by time
in Case 2, while in Case 1 there is a quasi-stabilization point no later than time t − + R 1 + (ϑ − 1)T 1 + 2T 2 + 2T 4 + 18d. By Theorem 4.4, this implies a W ∪ {i}-stabilization point by time
where the estimate is obtained analogously to the bound t + R 1 /ϑ > t W + 4d shown above. This concludes the proof, as indeed there is a W ∪ {i}-stabilization point no later than time t − + (1 + 5/(2ϑ))R 1 .
V. GENERALIZATIONS
This section provides a few extensions of the core results derived in the previous section. In particular, we show that it is not necessary to map faulty channels to, for example, faulty nodes (thus rendering a non-faulty node effectively faulty in terms of results), that the algorithm can tolerate an even stronger adversary than defined in Section II without significant change of stabilization time, and that in many reasonable setting stabilization takes O(R 1 ) time only, even if there is no majority of non-faulty nodes that is already synchronized. With the exception of Corollary 5.6, we again follow [30] during this section.
A. Synchronization Despite Faulty Channels
Theorem 4.15 and our notion of coherency require that all involved nodes are connected by correct channels only. However, it is desirable that non-faulty nodes synchronize even if they are not connected by correct channels. To capture this, the notions of coherency and stability can be generalized as follows.
Definition 5.1 (Weak Coherency): We call the set C ⊆ V weakly coherent during [t − , t + ], iff for any node i ∈ C there is a subset C ⊆ C that contains i, has size n − f , and is coherent during [t
In particular, if there are in total at most f nodes that are faulty or have faulty outgoing channels, then the set of nonfaulty nodes is (after some amount of time) weakly coherent.
, where T (k) is defined as in Corollary 4.16. Suppose the subset of nodes C ⊆ V is weakly coherent during the time interval
Proof: By the definition of weak coherency, every node in C is in some coherent set C ⊆ C of size n − f . Hence, for any such C it holds that we can cover all nodes in C by at most 1+|V \C | ≤ f +1 coherent sets C 1 , . . . , C f +1 ⊆ C. By Corollary 4.16 and the union bound, with probability at least 1 − (f + 1)/2 k(n−f ) , for each of these sets there will be at least one stabilization point during [t
where i 0 ∈ {1, . . . , f + 1} is an index for which the first maximum is attained and t i0 is the respective maximal time, i.e., t i0 is a C i0 -stabilization point. Define t i0 ∈ (t i0 +2d, t − +T (k)] to be minimal such that it is another C i0 -stabilization point. Such a time must exist by Theorem 4.4. Since the theorem also states that no node from C i0 switches to state accept during [t i0 + 2d, t i0 ) and C i ∩ C i0 = ∅, there can be no C i -stabilization point during (t i0 + 2d, t i0 − 2d) for any i ∈ {1, . . . , f + 1}. Applying the theorem once more, we see that there are also no C istabilization points during (t i0 + 2d, t i0 + (T 2 + T 3 )/ϑ) − 2d for any i ∈ {1, . . . , f +1}. On the other hand, the maximality of t i0 implies that every C i had a stabilization point by time t i0 . Applying Theorem 4.4 to the latest stabilization point until time t i0 for each C i , we see that it must have another stabilization point before time t i0 + T 2 + T 4 + 5d. We have
i.e., all C i have stabilization points within a short time interval of (t i0 − 2d, t i0 + 2d). Arguing analogously about the previous stabilization points of the sets C i (which exist because t i0 is maximal), we infer that all C i had their previous stabilization point during (t i0 − 2d, t i0 + 2d). Now suppose t a is the minimal time in (t i0 −2d, t i0 +2d) when a node from C switches to accept and this node is in set C i for some i ∈ {1, . . . , f +1}. As usual, there must be at least f +1 non-faulty nodes from C i in state propose at time t a and by time t a + d, all nodes from C i will be observed in either of the states propose or accept. As |C i ∩ C j | ≥ f + 1 for any j ∈ {1, . . . , f + 1}, all nodes in C j will observe at least f + 1 nodes in states propose or accept at times in (t a , t a + 2d). We have that t a ≥ t i0 + (T 2 + T 3 )/ϑ − 2d according to Theorem 4.4. As no nodes switched to state accept during (t i0 + 2d, t a ) and none of them switch to state recover (cf. Theorem 4.4), for any j we can bound
that all nodes from C j are in one of the states ready, propose, or accept at time t a + d. Hence, they will switch from ready to propose if they still are in ready before time t a + 2d. Less than d time later, all nodes in C j will memorize all nodes in C j in state propose and therefore switch to accept if not done so yet. Since j was arbitrary, it follows that t a is a C-quasi-stabilization point. Corollary 5.3: Suppose C ⊆ V is weakly coherent dur-
(i) all nodes from C switch to accept exactly once within [t, t + 3d); (ii) there will be a C-quasi-stabilization point t ∈ [t + (T 2 + T 3 )/ϑ, t + T 2 + T 4 + 5d) satisfying that no nodes switch to accept in the time interval [t + 3d, t ); (iii) and each node i's, i ∈ C, main state machine (Figure 1) is metastability-free during [t + 4d, t + 4d). Proof: Analogously to the proofs of Theorem 4.4 and Corollary 5.2. We point out that one cannot get stronger results by the proposed technique. Even if there are merely f + 1 failing channels, this can e.g. effectively render a node faulty (as it may never see n − f nodes in states propose or accept) or exclude the existence of a coherent set of size n − f (if the channels connect f + 1 disjoint pairs of nodes, there can be no subset of n − f nodes whose induced subgraph contains correct channels only). Stronger resilience to channel faults would necessitate to propagate information over several hops in a fault-tolerant manner, imposing larger bounds on timeouts and weaker synchronization guarantees.
Combination of Corollary 5.2 and Corollary 5.3 finally yields:
Corollary 5.4: Let C ⊆ V be such that, for each i ∈ C, there is a set C i ⊆ C with |C i | = n − f , and let E = i∈C C 2 i . Then, for any k ∈ N, the proposed algorithm is a (C, E)-stabilizing pulse synchronization protocol with skew 3d and accuracy bounds (T 2 + T 3 )/ϑ − 3d and T 2 + T 4 + 8d stabilizing within time T (k) + T 2 + T 4 + 5d with probability at least 1 − (f + 1)/2 k(n−f ) . Proof: Analogously to the proof of Corollary 4.16
B. Stronger Adversary
So far, our analysis considered a fixed set C of coherent (or weakly coherent) nodes. But what happens if whether a node becomes faulty or not is not determined upfront, but depends on the execution? Phrased differently, does the algorithm still stabilize quickly with a large probability if an adversary may "corrupt" up to f nodes, but may decide on its choices as time progresses, fully aware of what happened so far? Since we operate in a system where all operations take positive time, it might even be the case that a node might fail just when it is about to perform a certain state transition, and would not have done so if the execution had proceeded differently. Due to the way we use randomization, this however makes little difference for the stabilization properties of the algorithm.
Corollary 5.5: Suppose at every time t, an adversary has full knowledge of the state of the system up to and including time t, and it might decide on in total up to f nodes (or all channels originating from a node) becoming faulty at arbitrary times. If it picks a node at time t, it fully controls its actions after and including time t. Furthermore, it controls delays and clock drifts of non-faulty components within the system specifications, and it initializes the system in an arbitrary state at time 0. For any k ∈ N, define t k as
Then the set of all nodes that remain non-faulty until time t k reaches a quasi-stabilization point during [Ê 3 , t k ] with probability at least
Moreover, at any time t ≥Ê 3 , the set of nodes that are non-faulty at time t is coherent. Proof: The last statement of the corollary holds by definition.
We need to show that Theorem 4.10 holds for the modified time interval [Ê 3 , (k + 3)Ê 3 ] with the modified probability of at least 1−e −k(n−f )/2 . If this is the case, we can proceed as in Corollaries 5.2 and 5.3.
We start to track the execution from timeÊ 3 . Whenever a non-faulty node switches to state init at a good time, the adversary must corrupt it in order to prevent subsequent deterministic stabilization. In the proof of Theorem 4.10, we showed that for any non-faulty node, there are at least k +1 different times during [Ê 3 , (k +3)Ê 3 ] when it switches to init that have an independently by 1/2 lower bounded probability to be good. Since Lemma 4.9 holds for any execution where we have at most f faults, the adversary corrupting some node at time t affects the current and future trials of that node only, while the statement still holds true for the non-corrupted nodes. Thus, the probability that the adversary may prevent the system from stabilizing until time t k is upper bounded by the probability that (k + 1)(n − f ) independent and unbiased coin flips show f or less times tail. Chernoff's bound states for the random variable X counting the number of tails in this random experiment that for any δ ∈ (0, 1),
< e −δE [X] .
, we see that the probability that
as claimed.
C. Constant-Time Stabilization
Up to now, we considered worst-case scenarios only. In practice, it is likely that faulty nodes show not entirely arbitrary behavior. In particular, they might still be partially following the protocol, not exhibit a level of coordination that could only be achieved by a powerful central instance, or not be fully aware of non-faulty nodes states. Moreover, it is unlikely that at the time when a majority of the nodes becomes non-faulty, all their timeouts R 2 and R 3 have been reset recently. In such settings, stabilization will be much easier and therefore be achieved in constant time with a large probability. It is difficult, however, to name simple conditions that cover most reasonable cases. Generally speaking, once (randomized) timeouts of duration R 2 or R 3 are not "messed up" at non-faulty nodes anymore, faulty channels and nodes need to collaborate in an organized manner in order to prevent stabilization for a large time period. We give a few examples in the following corollary.
Corollary 5.6: Suppose W ⊆ V , where |W | ≥ n − f , satisfies that for each i ∈ W , all (randomized) timeouts of duration R 2 or R 3 are correct during [t − , t + ], and the node is non-faulty
Moreover, channels between nodes in W are correct (i) Nodes in V \ W switch to init at times that are independently distributed with probability density at most O(1/(R 1 n)), and channels from V \ W to W do not generate init signals on their own (or delay init signals from beforet − more than R 1 time). (ii) Channels from V \ W to W switch to init at times that are independently distributed with probability density at most O(1/(R 1 n 2 )). (iii) Channels from V \ W to W switch to init obliviously of the history of signals originating at nodes in W and do not know the timet − .
If
] with probability at least 1 − 2
−Ω(k) . Proof: In Theorems 4.4 and 4.15, we showed that stabilization is deterministic once a good resynchronization point occurs. The notion of coherency essentially states that at non-faulty nodes, each timeout expired at least once and has not been reset again because of incorrect observations on other non-faulty nodes' states until the set is considered coherent (cf. Lemma 4.3). Subsequently, the respective nodes are non-faulty and the channels connecting them correct. This is true by the prerequisites of the corollary, which essentially state respective conditions on timeouts R 2 and R 3 explicitly, while rephrasing the conditions for coherency for the remaining timeouts (note that R 1 is the largest timeout except for R 2 and R 3 ).
Moreover, the time span during which R 2 and R 3 behave and are observed regularly is large enough for R 3 to expire twice and additional R 2 + 3d time to pass. This accounts for the fact that in the proof of Theorem 4.10, we essentially first wait until R 3 expires once (so the adversary has no useful information on the timeout at the respective node anymore) and then consider the subsequent time(s) when it expire(s). The proof then exploits that non-faulty nodes timeout R 3 will expire at roughly independently uniformly distributed points in time. Therefore, unless faulty nodes or channels interfere, the statement of the corollary holds.
Hence, we need to show that for any of the three conditions, there is not too much meddling from outside W . For Conditions (i) or (ii), we see that the probability that there are no init signals on channels from V \W to W at all for any time span of length O(R 1 ) is at least constant, regardless of the time interval considered. Regarding Condition (iii), recall that Theorem 4.10 essentially shows that whatever the strategy of the adversary, the expected number of good Wresynchronization points during a time interval (where W is coherent) is linear in the size of the interval divided by R 1 if the interval is sufficiently large. Since the adversary is oblivious of the current time in relation tot + and the state of W , the statement that for any strategy of the adversary the amortized number of good stabilization points per R 1 time is constant yields the claim of the lemma. We remark that this observation is particularly interesting as the core routine of the algorithm is independent of the resynchronization routine after stabilization. If at some time W becomes subject to a large number of faults resulting in loss of synchronization, however the resynchronization routine still works properly, it is very likely that W will recover within O(1) time (provided R 1 ∈ O(1)). On the other hand, if the resynchronization routine fails in the sense that a majority of the nodes suffers from faulty timeouts R 2 or R 3 , or communication is faulty between too many nodes, this will not affect the core routine unless too many components related to it fail as well.
VI. THE FATAL + PROTOCOL
The synchronized pulses established by the FATAL pulse synchronization algorithm could in principle serve as the local clock signals provided to the application layer of the SoC. 21 However, just using the FATAL protocol in this way would result in a very low clock frequency: Despite the fact that the time between pulses is Θ(d) (if the timeouts are chosen accordingly) and thus asymptotically optimal, the actual clock speed would be several orders of magnitude below the upper bound resulting from [42] , due to the large implied constants. Moreover, the system model introduced in Section II assumes that delays may vary arbitrarily between 0 and d, with d also covering the fairly complex implementation of communicating the main algorithm's states (see Section VII-B). By contrast, pure wire delays of the communication channels between different nodes are much smaller and also vary within a smaller range.
This section contains an extension of FATAL, termed FATAL + , which overcomes these limitations. In a nutshell, it consists of adding a fast non-self-stabilizing, Byzantinetolerant algorithm termed quick cycle to FATAL, which generates exactly M > 1 fast clock ticks between any two pulses at a correct node after stabilization.
The Quick Cycle Algorithm
Consider a system of n nodes, each of which runs the FATAL pulse synchronization protocol. Additionally, each node is equipped with an instance of the quick cycle state machine depicted in Figure 5 . The interface between the quick cycle algorithm and the underlying FATAL pulse synchronization protocol is by means of two signals only, one for each direction of the communication: (i) The quick cycle state machine generates the NEXT signal by which it (weakly) influences the time between two successive 21 In order to establish a consistent global tick numbering (needed for establishing a global notion of time across different clock domains) of arbitrarily large bounded clocks, a self-stabilizing digital clock synchronization algorithm like the one from [25] can be employed. Implementing such algorithms in SoCs is part of our future work and thus outside the scope of this paper, however. pulses generated by FATAL, and (ii) it observes the state of the (T + 2 , accept) signal, which signals the expiration of an additional timer added to the FATAL protocol. The timer is coupled to the state accept of FATAL, in which the pulse synchronization algorithm generates a new pulse. The signal's purpose is to enforce a consistent reset of the quick cycle state machine once FATAL has stabilized.
Essentially, the quick cycle state machine is a copy of the outer cycle of Figure 2 that is stripped down to the minimum. However, an additional mechanism is introduced in order to ensure stabilization, namely, some coupling to the accept state of the main algorithm: Whenever a pulse is generated by FATAL, we require that all nodes switch to the accept + state unless they already occupy that state. This is easily achieved by incorporating the state of the expiration signal of the additional FATAL timer (T + 2 , accept) in the guards of Figure 5 . Since pulses are synchronized up to the skew Σ of the pulse synchronization routine, it follows that all nodes switch to accept + within a time window of Σ + 2d. Subsequently, all nodes will switch to state ready + before the first one switches to propose
is sufficiently large, and the condition that f + 1 propose + signals trigger switching to propose + guarantees that all nodes switch to accept + in a tightly synchronized fashion. One element that is not depicted explicitly in Figure 5 is that nodes increase an integer cycle counter by one whenever they switch to accept + . The counter is reset to zero whenever (T + 2 , accept) expires, i.e., shortly after a pulse generated by the underlying pulse synchronization algorithm. The algorithm makes sure that, once the compound algorithm stabilized, these resets never happen when the counter holds a non-zero value. The counter operates mod M ∈ N, where M is large enough so that at least roughly T 2 + T 3 and at most (T 2 + T 4 )/ϑ time passed since the most recent pulse when it reaches M ≡ 0 again. Whenever the counter is set to 0, node i ∈ V will set its NEXT i signal to 1 and switch it back to 0 at once (thus raising the respective NEXT i memory flag of the main algorithm). Thus, by actively triggering the next pulse, we ensure that a pulse does not occur at an inconvenient point in time: When the system has stabilized, exactly M switches to accept + of the quick cycle algorithm occur between any two consecutive pulses at a correct node. As these switches occur also synchronously at different nodes, it is apparent that the quick cycle state machine in fact implements a bounded-size synchronized clock.
To derive accurate bounds on the skew of the protocol, we need to state the involved delays more carefully.
Definition 6.1 (Refined Delay Bounds): The state of the quick cycle algorithm is communicated via separate channels S + i,j , with i, j ∈ V , whose delays vary within d 
It follows from Lemma 3.4 that it is always possible to pick appropriate values for the timeouts and M . Note, however, that choosing M ∈ ω(1) requires that T 2 +T 3 ∈ ω(1), resulting in a superlinear stabilization time. More precisely, the stabilization time of FATAL + is, given M and minimizing the timeouts under this constraint, in Θ(M n). As mentioned previously, this limitation can be overcome by employing a digital clock synchronization such as [22] .
We now prove the correctness of the FATAL + protocol. Theorem 6.2: Let W ⊆ V , where |W | ≥ n − f , and define T (k), for k ∈ N, as in Corollary 4.16. Then, for any k ∈ N, the FATAL + protocol is a (W, W 2 )-stabilizing pulse synchronization protocol (where accept + is the "pulse" state) with skew Σ + and accuracy bounds (T + max with probability at least 1 − 2 −k(n−f ) . Moreover, the cycle counters increase by exactly one mod M at each pulse, within a time window of Σ + , and both the quick cycle state machine and the cycle counters are metastability-free once the protocol stabilized and remains fault-free in W .
Proof: Assume that nodes in W ⊆ V , where |W | ≥ n − f , are non-faulty and channels between them are correct during [t − , t + ], where t
According to Corollary 4.16, with probability at least 1 − 2 −k(n−f ) , there exists a time t 0 ∈ [t − , t − + T (k))] such that all nodes in W switch to accept within [t 0 , t 0 + 2d), and they will continue to switch to accept regularly in a synchronized fashion until at least t + . For the remainder of the proof, we assume that such a time t 0 is given; from here we reason deterministically.
The skew bound is shown by induction on the k-th consecutive quick cycle pulse, where k ∈ N, generated after the stabilization time t 0 of the FATAL algorithm. Note that the time for which we are going to establish that the compound algorithm stabilizes is t 1 > t 0 ; here we denote for k ≥ 1 by t k the time when the first node from W switches to accept + for the k th time after t 0 + 3d, i.e., the beginning of the k th pulse of FATAL + that we prove correct. W.l.o.g. we assume that t + = ∞; otherwise, all statements will be satisfied until t + only (which is sufficient). To prove the theorem, we are going to show by induction
accept for the l th time after
, and (vii) if k ≥ 2, ∀i ∈ W : node i's cycle counter changes its state exactly once during [t k−1 , t k ). In particular, the protocol is a pulse synchronization protocol with the claimed bounds on skew, accuracy, and stabilization time. Proving these properties will also reveal that quick cycle is metastability-free after time t 1 .
To anchor the induction at k = 1, we need to establish Statement (i) as well as Statements (iv) and (vi) for k = 1; the remaining statements are empty for k = 1.
Recall that any node i ∈ W switches to accept during [t 0 , t 0 + 2d). Hence, during
at no node in W , (T + 2 , accept) is expired, implying that all nodes in W are in state accept + during [t 0 +3d+2d + max , t 0 + 3d+3d + max ). Note that each node will reset its cycle counter to 0 when (T + 2 , accept) expires, i.e., after having completed its transition to accept + . The above bound shows that at the minimal time after t 0 + 3d when a node in W switches to ready + , it is guaranteed that no node is observed in propose + until the minimal time t p ≥ t 0 + 3d when a node in W switches to propose + . Moreover, at any node switching to state ready + timeout (T + 2 , accept) must be expired, implying that the node may not switch from ready + to propose + due to this signal until it switches to accept again. Recall that nodes set their NEXT signals to 1 only briefly when their cycle counters are set to 0. Hence, for each such node in W , this signal is observed in state 0 from the time when (T + 2 , accept) expires until (a) at least time t M or (b) the time the node is forced by a switch to accept to set its counter to 0, whatever is earlier. Examining the main state machine, it thus can be easily verified that no node in W may switch from ready + to propose + because (T + 2 , accept) = 0 before (a) time t M or (b) time (34) is reached. We obtain:
when it is not in state accept + .
Considering that any node i ∈ W will switch to ready 
each node in W must have been observed in propose + at least once. On the other hand, as we established that nodes do not observe nodes in W in state propose + when switching to ready + at or after time t + 3d before the first node in W switches to propose + , it follows that until time
nodes in W will have at most |V \W | ≤ f of their propose + flags in state 1, and their timeout T + 3 did not expire yet. Thus, by (P1), the first node in W that switches to propose + after t 0 + 3d must do so at time t p ≥ t 0 + 3d + T
Recall that t 1 is the minimal time larger than t 0 +3d when a node in W switches to state accept + . By (35) and since |W | ≥ n−f , we have that each node in W observes at least n−f nodes in propose + by time t 0 +T
+ max , and thus
Moreover, we can trivially bound
From (38) and (39) 
+ , all nodes in W memorize at least |W | ≥ n − f nodes in propose + and therefore switched to accept + . Hence, we successfully established Statement (iv) of the claim for k = 1. Statement (vi) follows for k = 1, as the cycle counters have been reset to zero at the expiration of (T + 2 , accept) and are increased upon the subsequent state transition to accept + . Note that Statements (ii), (iii), (v), and (vii) trivially hold.
We now perform the induction step from k ∈ N to k + 1. Assume that Statements (ii) to (vii) hold for all values smaller or equal to k; Statement (i) only applies to k = 1 and was already shown. Define l := k/M ≥ 0. Thus, if we can show Statement (ii) for k + 1, we may infer that t k+1
(33)
In case l = 0, it holds that k < M and we may deduce (P1) by the same arguments as in the induction basis. In case l ≥ 1, we use Statement (v) for value k, and, by analogous arguments as in the induction basis, deduce that at no node in W , (T Since further t M l+M ≥ t k+1 by definition of l, we obtain from (P1') that no node i ∈ W will memorize NEXT i = 1 earlier than time min{t k+1 , t M l + M (T By Statement (iv) for the value k, we know that each node i ∈ W switches to accept + during [t k , t k + Σ + ). In particular, i will increase its cycle counter at the respective time, i.e., Statement (vi) for k + 1 follows at once if we establish Statement (vii) for k + 1. As Statement (iv) for the value k together with Statement (ii) for value k + 1 imply that each node switches to accept + exactly once during [t k , t k+1 ), Statement (vii) for k+1 follows, provided that we can exclude that the counter is reset to 0, due to (T + 2 , accept) expiring, at a time when it holds a non-zero value.
We now show that this never happens. By Statement (v) for value k each node i ∈ W switches to accept during
and this time is unique during [t M l , t k+1 ) due to (40) .
Because of (42) a node in W will reset its timeout
and (T + 2 , accept) will expire within
Thus, no node in W leaves state accept + after switching there for the (M l) th time after t 0 + 3d before observing that (T + 2 , accept) is reset and expires again. In particular, this shows that the counters are only reset to 0 at times when they are 0 anyway. Granted that Statement (ii) holds for k + 1, Statement (vii) for k + 1 follows.
Next, we establish Statements (ii) to (iv) for k + 1. We reason analogously to the case of k = 1, except that we have to revisit the conditions under which state accept + is left. As we have just seen, nodes switch from accept + to ready + upon T 
. By time
all nodes in W will be observed in accept + (and therefore not in propose + ), together with (P1') preventing that the first node in W that (directly) switches from ready + to propose 
As This completes the induction. According to Statement (i), t 1 satisfies the claimed bound on the stabilization time. With respect to this time, Statement (iv) provides the skew bound, and combining it with Statements (ii) and (iii), respectively, yields the stated accuracy bounds. Statements (vi) and (vii) show the properties of the counters. Metastability-freedom of the state machine is trivially guaranteed by the fact that each state has a unique successor state. For the counter, we can infer metastability-freedom after stabilization from the observation made in the proof that for times t ≥ t 1 , (T 2 , accept)(t) = 0 at a non-faulty node implies that it is in state accept + with its cycle counter equal to zero. This completes the proof.
For some applications, one might require an even higher operational frequency than provided by the quick cycle state machine. It turns out that there is a simple solution to this issue.
Increasing the Frequency Further
Given any pulse synchronization protocol, one can derive clocks operating at an arbitrarily large frequency as follows. Whenever a pulse is triggered locally, the nodes start to increase a local integer counter modulo some value m ∈ N at a speed of φ ∈ R + times that of a local clock, starting from 0. Denote by T − the accuracy lower bound of the protocol and suppose that the local clock controlling the counter runs at a speed between 1 and ρ ∈ (1, ϑ], i.e., its maximum drift is ρ−1.
25 Once the counter reaches the value m − 1, it is halted until the next pulse. We demand that
This approach is similar to the one presented in [23] , enriched by addressing the problem of metastability.
In the context of the FATAL + protocol, we get the following result.
Corollary 6.3: Adding a counter as described above to the FATAL + protocol and concatenating the counter values of the two counters at node i ∈ V yields a bounded logical clock L i ∈ {0, . . . , mM − 1}. At any time t when the protocol has stabilized on some set W (according to Theorem 6.2), it holds for any two nodes i, j ∈ W that
Once stabilized, these clocks do not "jump", i.e., they always increase by exactly one mod mM , with at least 1/ρ time between any two consecutive "ticks". The amortized clock frequency is within the bounds m/(T Proof: Observe that it takes at least m/(φρ) time for one of the new counters to increase from 0 to m. Since the counters are restarted at pulses, which are triggered locally at most Σ + time apart, at the time when a "fast" node arrives at the value m, a "slow" node will have increased its clock by at least m/ρ−φΣ + . According to Inequality (44), slow nodes will be able to increase their counters to m before the next pulse. The claimed bound on the clock skew and the facts that clock increases are one by one and at most every 1/ρ time follow.
The bound on the amortized clock frequency follows by considering the minimal and maximal times M iterations of the quick cycle may require.
The metastability-freedom of the clock is deduced from the metastability-freedom of the individual counters. For the new counter this is guaranteed by Inequality (44) , since the counter is always halted at 0 before it is reset due to a new quick cycle pulse.
We remark that in an implementation, one would probably utilize the better clock source, if available, to drive T + 1 and T + 3 as well. 26 Maximizing m with respect to Inequality (44) and choosing T + 1 + T + 3 sufficiently large will thus result in clocks whose amortized drift is arbitrarily close to ρ, the drift of the underlying local clock source.
VII. IMPLEMENTATION
In this section, we provide an overview of our FPGA prototype implementation of the FATAL + protocol. The 26 Since the new counter is started together with T + 1 , this does not incur metastability. Special handling is required for T + 3 on the M th pulse of the quick cycle, though. purposes of this implementation are (i) to serve as a proof of concept, (ii) to validate the predictions of the theoretical analysis, and (iii) to form a basis for the future development of protocol variants and engineering improvements. Rather than striving for optimizing performance, area or power efficiency, our primary goal is hence to essentially provide a direct mapping of the algorithmic description to hardware, and to evaluate its properties in various operating scenarios.
Our implementation does not follow the usual design practice, for several reasons:
Asynchrony: Targeting ultra-reliable clock generation in SoCs, the implementation of FATAL + itself cannot rely on the availability of a synchronous clock. Moreover, some performance-critical guards, like the one of the transition from propose to accept in Figure 2 , are purely asynchronous and should hence not be synchronized to a local clock. Even worse, testing for activated guards synchronized to a local clock source bears the risk to generate metastability, as remote signals originate in different clock domains. On the other hand, conventional asynchronous state machines (ASM) are not well-suited for implementing Figure 2 - Figure 5 due to the possibility of choice of successor states and continuously enabled (i.e., non-alternating) guards. Our prototype relies on hybrid state machines (HSM) that combine an ASM with synchronous transition state machines (TSM) that are started on demand only.
Fault tolerance: The presence of Byzantine faulty nodes forced us to abandon the classic "wait for all" paradigm traditionally used for enforcing the indication principle in asynchronous designs: Failures may easily inhibit the completion of the request/acknowledge cycles typically used for transition-based flow control. Timing constraints, established by our theoretical analysis, in conjunction with state-based communication are resorted to in order to establish event ordering and synchronized executions in FATAL + . Self-Stabilization: In sharp contrast to non-stabilizing algorithms, which can always assume that there is a (substantial) number of non-faulty nodes that run approximately synchronously and hence adhere to certain timing constraints, self-stabilizing algorithms cannot even assume this. Although FATAL + guarantees that non-faulty nodes will eventually execute synchronously, even when started from an arbitrary state, the violation of timing constraints and hence metastability [9] cannot be avoided during stabilization. For example, state accept in Figure 2 has two successors sleep and recover, the guards of which could become true arbitrarily close to each other in certain stabilizing scenarios. This is acceptable, though, as long as such problematic events are neither systematic nor frequent, which is ensured by the design and implementation of FATAL + (see Section VII-A).
Inspecting The above requirements reveal the need for the following major building blocks:
• Concurrent HSMs, implementing the states and transitions specified in the protocol.
• Communication infrastructure between those state machines, continuously conveying the state information.
• Watchdog timers (also with random timeouts) for implementing type (1) guards.
• Threshold modules and memory flags for implementing type (2) and type (3) guards. Obviously, all these building blocks require implementations that match the assumptions of the formal model in Section II. Apart from maintaining timing assumptions like an endto-end communication delay bound t − τ −1 i,j (t) < d, this also includes the need to implement all stateful components in a self-stabilizing way: They must be able to eventually recover from an arbitrary erroneous internal state, including metastability, when operating in the specified environment.
Before we proceed with a description of the implementations of these components, we discuss how FATAL + deals with the threat of metastability arising from our extreme fault scenarios.
A. Metastability issues
Reducing the potential for both metastability generation and metastability propagation are important goals in the design and implementation of FATAL + . Although it is impossible to completely rule out metastability generation in the presence of Byzantine faulty nodes (which may issue signal transitions at arbitrary times) and during self-stabilization (where all nodes may be completely asynchronous), we nevertheless achieved the following properties:
(I) Guaranteed metastability-freedom in fault-free executions after stabilization. (II) Non-faulty nodes are safeguarded against "attacks" by faulty nodes that aim at inducing metastability, in particular once the system has stabilized. (III) Metastable upsets at non-faulty nodes are rare during stabilization, therefore delaying stabilization as little as possible. (IV) Very small windows of vulnerability and the possibility to incorporate additional measures for decreasing the upset probability further.
The following approaches have been used in FATAL + to accomplish these goals (additional details will be given in the subsequent sections):
(I) is guaranteed by our proofs of metastability-freedom, which exploit the fact that all non-faulty nodes run approximately synchronously after stabilization. It is hence relatively straightforward to ensure, via timing constraints, that some data from remote ASMs does not change while it is used.
(II) is accomplished by several means, which make it very difficult (albeit not impossible) for a faulty node to generate/propagate metastability. Besides avoiding any explicit control flow between ASMs by communicating states only, which greatly reduces the dependency of a non-faulty receiver node from a faulty sender, several forms of logical masking of metastability are employed. One example is the combination of memory flags and threshold gates, which ensure that possibly upset memory flags are always overruled quickly by correct ones at the threshold output. A different form of logical masking occurs due to the fact that, after stabilization, all non-faulty nodes execute the outer cycle of the main state machine ( Figure 2 ) only: Since the outer cycle does not involve any type (3) guard once stabilization is achieved, any metastability originating from the (less metastability-safe) resynchronization algorithm ( Figure 4 ) and its extension (Figure 3 ) is completely masked.
To accomplish (III), the measures outlined in (2) are complemented by adding time masking using randomization: The resynchronization routine (Figure 4 ) tries to initialize recovery from arbitrary states at random, sufficiently sparse points in time. It is hence very unlikely that non-faulty nodes are kept from stabilizing due to metastable upsets. Moreover, if at the beginning of the stabilization process f < f < n/3 nodes are faulty, up to f − f metastable upsets can be tolerated without keeping the remaining nodes from stabilizing; the nodes that became subject to newly arising transient faults will stabilize quickly once n−f nodes established synchronization (cf. Theorem 4.17).
Finally, (IV) is achieved by implementing all building blocks that are susceptible to metastable upsets, like memory flags, in a way that minimizes the window of vulnerability. Moreover, elastic pipelines acting as metastability filters [29] or synchronizers can be added easily to further protect such elements.
B. State machine communication
According to our system model, an HSM must be able to continuously communicate its current state system-wide: It is requested that every receiver is informed of the sender's current state within d time (resp. within d our implementation. 27 Since a node treats itself like any other node in type (2) and type (3) guards with thresholds, it comprises a complete receiver as described below for every node in the system (including itself). Figure 6 shows the circuitry used for communicating the current state of the main algorithm in Figure 2 . The sender consists of a simple array of flip-flops, which drive the parallel data bus that thus continuously reflects the current state of the sender's HSM. In sharp contrast to handshakebased communication, reading at the receiver occurs without any coupling to the sender here. As argued in Section VII-A, the synchrony between non-faulty nodes guaranteed by the FATAL + protocol guarantees that the sender state data will always be stable when read after stabilization. For the stabilization phase, we cannot give such a guarantee but take some (acceptable) risk of metastability. Figure 3 , init or wait in Figure 4 , and propose + or none + in Figure 5 ). Hence, every bus consists of a single wire here, and the decoder in the receiver becomes trivial.
The receiver consists of a simple combinational decoder consisting of AND gates, which generate a 1-out-of-m encoding of the binary representation of the state communicated via the data bus. The decoded signals correspond to a single sender state each. This information is directly used for type (3) guards, and fed into memory flags for type (2) guards. Every memory flag is just an SR-latch with dominant reset, whose functional equivalents are also included in Figure 6 . Note that a memory flag is set depending on the state communicated by the sender, but (dominantly) cleared under the receiver's control.
A memory flag may become metastable when the inputs change during stabilization of its feedback loop, which can occur due to (a) input glitches and/or (b) simultaneous falling transitions on both inputs. However, for correct receivers, (a) can only occur in case of a faulty sender, and (b) is again only possible during stabilization: Once non-faulty nodes execute the outer cycle of Figure 2 , it is guaranteed that e.g. all non-faulty nodes enter accept before the first one leaves. The probability of an upset is thus very small, and could be further reduced by means of an elastic pipeline acting as metastability filter (which must be accounted for in the delay bounds).
The most straightforward implementation of the threshold modules used for generating the ≥ f + 1 and ≥ n − f thresholds in type (2) and type (3) guards is a simple sumof-product network, which just builds the OR of all AND combinations of f + 1 resp. n − f inputs. In our FPGA implementation, a threshold module is built by means of lookup-tables (LUT); some dedicated experiments confirmed that they work glitch-free for monotonic inputs (as provided by the memory flags).
C. Hybrid state machines
Our prototype implementation of FATAL + relies on hybrid state machines (HSM): An ASM is used for determining, by asynchronously evaluating the guards, the points in time when a state transition shall occur. Our ASMs have been built by deriving a state transition graph (STG) specification directly 28 from Figures 2-5 and generating the delay-insensitive implementation via Petrify [32] . The actual 28 Note that the STG specification had to be extended slightly in order to transform our possibly non-alternating guards (which might be continuously enabled in some cycle, in particular during stabilization) into strictly alternating ones. The TSM is driven by a pausible clock (see Section VII-D), which is started dynamically by the ASM before the transition. Note that this avoids the need for synchronization with a free-running clock and hence preserves the ASMs continuous time scale.
The TSM works as follows (see Figure 7) : Assume that the ASM is in state A, and that the guard G for the transition from A to B becomes true. In the absence of an inhibit signal (indicating that another transition is currently being taken, see below), the TSM clock is started. With every rising edge of TSMClock, the TSM unconditionally moves through a sequence of three states: synchronize (Syn), commit (Cmt), and terminate (Trm) shown in the rectangular box in Figure 7 . In Syn, the inhibit signal is activated to prevent other choices from being executed in case of more than one guard becoming true. Whereas any ambiguity can easily be resolved via some priority rule, metastability due to (a) enabled guards that become immediately disabled again or (b) new guards that are enabled close to transition time cannot be ruled out in general here. However, as argued in Section VII-A, (a) could only do harm to FATAL + during stabilization, due to type (3) guards; recall that type (1) and type (2) guards are always monotonic, with the reset (of watchdog timers and memory flags) being under the control of the local state machine. Similarly, our proofs reveal that upsets due to (b) are fully masked after stabilization. Thus, after stabilization, metastability of the TSM can only occur due to unstable inputs, i.e., upsets in memory flags. Given the small window of vulnerability of the synchronizing stage for Syn, the resulting very low probability of a metastable upset is considered acceptable.
Once the TSM has reached Syn, it has decided to actually take the transition to B and hence moves on to state Cmt. Here the watchdog timer associated with B and possibly 
D. Pausible oscillator
The TSM clock is an asynchronously startable and synchronously stoppable ring oscillator, which provides a clock signal TSMClock that is LOW when the clock is stopped via an input signal TSMCStop. Note that copies of this oscillator are used for driving the watchdog timers presented in Section VII-E.
The operation of the TSM clock circuit shown in Figure 8 is straightforward: In its initial state, TSMCStop=HIGH and the Muller C-gate has HIGH at its output, such that TSMClock=LOW. Note that the circuit also stabilizes to this state if the Muller C-gate was erroneously initialized to LOW, as the ring oscillator would eventually generate TSMClock=HIGH, enforcing the correct initial value HIGH of the C-gate.
When the ASM requests a state transition, at some arbitrary time when a transition guard became true, it just sets TSMCStop=LOW. This starts the TSM clock and produces the first rising edge of TSMClock half a clock cycle time later. As long as TSMCStop remains LOW, the ring oscillator runs freely.
The frequency of the ring oscillator is primarily determined by the (odd) number of inverters in the feedback loop. 29 It varies heavily with the operating conditions, in particular with supply voltage and temperature: The resulting (two-sided) clock drift ξ is typically in the range of 7% . . . 9% for uncompensated ring oscillators like ours; in ASICs, it could be lowered down to 1% . . . 2% by special compensation techniques [44] . Note that the two-sided clock drifts map to ϑ = (1 + ξ)/(1 − ξ) bounds of 1.15 . . . 1.19 and 1.02 . . . 1.04, respectively.
The stopping of TSMClock is regularly initiated by the TSM itself: With the rising edge of TSMClock that moves the TSM into Trm, TSMCStop is set to HIGH. Since TSMClock is also HIGH after the rising edge, 30 the C-gate output is also forced to HIGH. Hence, after having finished the half period of this final clock cycle, the feedback loop is frozen and TSMClock remains LOW.
For metastability-free operation of the C-gate in Figure 8 , (a) the falling transition of TSMCStop must not occur simultaneously with a rising edge of TSMClock, and (b) the rising transition of TSMCStop must not occur simultaneously with the falling edge of TSMClock. (a) is guaranteed by stopping the clock in state Trm of the TSM, since the output of the C-gate is permanently forced to HIGH on this occasion; TSMClock cannot hence generate a rising transition before TSMCStop goes to LOW again. Whereas this synchronous stopping normally also ensures (b), we cannot always rule out the possibility of getting TSMCStop=HIGH close to the first rising edge of TSMClock: (b) could thus occur due to prematurely disabled type (3) guards, which we discussed already with respect to their potential to create metastability in the TSM, recall Section VII-C. Besides being a rare event, this can only do harm during stabilization, however.
E. Watchdog Timers
Recall that every ASM state, except for accept in Figure 2 , is associated with at most one watchdog timer required for type (1) guards; accept is associated with three timers (for T 1 and T 2 as well as for T + 2 in Figure 5 ). A timer is reset by the TSM when its associated state is entered.
According to Figure 9 , every watchdog timer consists of a synchronous resettable up-counter that is clocked by some oscillator, and a timeout register that holds the timeout value. A comparator raises an output signal if the counter value is greater or equal to the register value. An SR latch with dominant reset memorizes this expired condition until the timer is re-triggered. Like the TSM, timers are driven by pausible oscillators, which are started by the TSM after resetting the timer and stopped synchronously upon timer expiration. Note that every timer (except for the multiple accept timers, which share a common oscillator that is stopped when the largest timeout expires) is provided with a dedicated oscillator in our implementation for simplicity. This not only avoids quantization errors in the continuous timing of the ASM state transitions, but is also mandatory for avoiding the potential of metastability due to timer resets colliding with the transitions of a free-running clock. In our implementation, the timer reset takes place in TSM state Cmt, while the oscillator is started in state Trm. This well ordered sequence rules out all metastability issues.
As for the watchdog timer with random timeout R 3 in Figure 4 , our implementation uses an linear feedback shift register (LFSR) clocked by a dedicated oscillator: A uniformly distributed random value, sampled from the LFSR, is loaded into the timeout register whenever the watchdog timer is re-triggered. Note that for many settings, it is reasonable to assume that the new random value remains a secret until the timeout expires, as it is not read or in any other way considered by the node until then. As our prototype implementation is not meant for studying security issues, this simple implementation is thus sufficient.
VIII. EXPERIMENTAL EVALUATION
Our prototype implementation has been written in VHDL and compiled for an Altera Cyclone IV FPGA using the Quartus tool.
Apart from standard functional and timing verification via Modelsim, we conducted some preliminary experiments for verifying the assumed properties (glitch-freeness, monotonicity, etc.) of the synthetisized implementations of our core building blocks: Since FPGAs do not natively provide the basic elements required for asynchronous designs, and we have no control over the actual mapping of functions to the available LUTs (e.g. our threshold modules are implemented via LUT instead of the intended combinational AND-OR networks), we had to make sure that properties that hold naturally in "real" asynchronous implementations also hold here. Backed up by the (positive) results of these experiments, a complete system consisting of n = 4 resp. n = 8 nodes (tolerating at most f = 1 resp. f = 2 Byzantine faulty nodes) has been built and verified to work as expected; overall, they consume 23000 resp. 55000 logic blocks. Note however, that both designs include the test environments which makes up a significant part of the designs.
To facilitate systematic experiments, we also developed a custom test bench that provides the following functionality:
(1) Measurement of pulse frequency and skew at different nodes. (6) Varying the communication delay between any pairs of sender and receiver, at arbitrary times. All these experiments can be done with and without up to f (actually, f + 1 to also include excessively many) Byzantine nodes. To this end, the HSMs of at most f + 1 = 3 nodes can be replaced by special devices that allow to (possibly inconsistently) communicate, via the communication data buses, any HSM state to any receiver HSM at any time.
(1) is accomplished using standard measurement equipment (logic analyzer, oscilloscope, frequency counter) attached to the appropriate signals routed via output pins. (2) is implemented by memorizing any event where more than one guard is enabled when the TSM performs its first state transition, in a flag that can be externally monitored.
(3) is realized by adding a scan-chain to the implementation, which allows to serially shift-in arbitrary initial system states at run-time. Repeated random experiments are controlled via a Python script executed at a PC workstation, which is connected via USB to an ATMega 16 microcontroller (uC) that acts as a scan-controller towards the FPGA: The uC takes a bit-stream representing an initial configuration, sends it to the FPGA via the serial scanchain interface, and signals the FPGA to start execution of FATAL + . When the system has stabilized, the uC informs the Python script which records the stabilization time and proceeds with sending the next initial configuration.
To enable (4)- (6), the testbench provides a global highresolution clock that can be used for triggering mode changes. To ensure its synchrony w.r.t. the various node clocks, all start/stoppable ring oscillators are replaced by start/stoppable oscillators that derive their output from the global clock signal. (4) is achieved by just forcing a node to reset to its initial state for this run at any time during the current execution. In order to facilitate (5), dividers combined with clock multipliers (PLLs) are used: For any oscillator, it is possible to choose one of five different frequencies (0, excessively slow, slow, fast, excessively fast) at any time. For (6), a variable delay line implemented as a synchronous shift register of length X ∈ [0, 15], driven by the global clock, can be inserted in any data bus connecting different HSMs individually.
In order to exercise also complex test scenarios in a reproducible way, a dedicated testbed execution state machine (TESM), driven by the global clock, is used to control the times and nodes when and where clock speeds, transmission delays and communicated fault states are changed and when a single node is reset throughout an execution of the system. Transition guards may involve global time and any combinatorial expression of signals used in the implementation of FATAL + , i.e., any predicate on the current system state.
31 31 To decrease the experiment setup time (after all, changing the TESM requires recompilation of the entire system), the TESM is gradually changed to also incorporate additional parameters and configuration information downloaded at run-time via the uC. Using our testbench, it was not too difficult to get our FATAL + up and running. As expected, we spotted several hidden design errors that showed up during our experiments, but also some errors (like a missing factor of ϑ in one of our timeouts due to a typo) in the initial version of our theoretical analysis, which caused deviations of the measured w.r.t. the predicted performance.
Finally, using the implementation parameters ϑ = 1.3,
, where T is the experimental clock period T = 400ns, and minimal timeouts according to the constraints, we conducted the following experiments, observing the behavior of both, the FATAL + as well as the underlying FATAL system: (A) Maximum skew scenarios, including effects of excessively small/fast clocks and message delays: The experimental results confirmed the analytic predictions as being tight: As shown in Figure 10 , pulses of the 8 node FATAL resp. FATAL + system occur at a frequency of about 62Hz resp. 10kHz. Note that the quite low values for the frequency stem from the fact that we were intentionally slowing down the system in order to carry out our worst-case experiments.
The figure further clearly demonstrates the capability of FATAL + to generate pulses with significantly less skew (1µs) on top of the FATAL pulses.
Further experiments, involving f = 2 Byzantine nodes, were used to produce a worst-case scenario for the FATAL skew (6µs). The resulting waveform is depicted in Figure 11 .
(B) Scenarios leading to the potential of non-deterministic HSM state transitions in the absence of Byzantine nodes (which would invalidate our proof of metastability-freedom if happening after stabilization): We run 17000 experiments, in each of which the 8 node system was set up with randomly chosen message delays between nodes and random clock speeds and stabilized from random initial states. Within 10 seconds from stabilization on, not a single upset was encountered in any instance.
(C) Stabilization of an 8-node system from random initial states, with randomly chosen clock speeds and message delays (without Byzantine nodes). Over 4000 runs have been performed. A considerable fraction of the setups stabilizes within less than 0.035 seconds, which can be credited to the fast stabilization mechanism intended for individual nodes resynchronizing to a running system (see Figure 12 and Figure 13 ). The remaining runs stabilize, supported by the resynchronization routine, in less than 10 seconds, which is less than the system's upper bound on T (1). Note that the stabilization time is inversely proportional to the frequency, i.e., in a system that is not slowed down stabilization is orders of magnitude faster.
IX. CONCLUSIONS
We conclude with a few considerations regarding the asymptotic complexity of implementations of FATAL+ and future work. The algorithm has the favorable property that nodes broadcast a constant number of bits in constant time, which clearly is optimal. While it would be beneficial to reduce node degrees, this must come at the price of reducing the resilience to faults [19] , [20] . In terms of the number of Byzantine faults the algorithm can sustain in relation to node degrees, the algorithm is asymptotically optimal as well. It is subject to future work to extend the algorithm to be applicable to networks of lower degree in a way preserving resilience to a (local) number of faults that is optimal in terms of connectivity.
Furthermore, it is not difficult to see that except for the threshold modules, each node comprises a number of basic components that is linear in n (cf. [37] , where similar building blocks were used). In an ASIC implementation, one could implement the threshold modules by sorting networks, resulting in a latency of O(log n) and a gate complexity of O(n log n) [45] . Clearly, it is necessary to have conditions involving more than f nodes in order to overcome f Byzantine faults. Hence, assuming constant fanin of the gates, both the current and envisioned solutions are asymptotically optimal with respect to latency. Optimality of an implementation relying on sorting networks with respect to gate complexity is not immediate, however there is at most a logarithmic gap to the trivial lower bound of Ω(n).
