Abstract Classic distributed computing abstractions do not match well the reality of digital logic gates, which are the elementary building blocks of Systems-on-Chip (SoCs) and other Very Large Scale Integrated (VLSI) circuits: Massively concurrent, continuous computations undermine the concept of sequential processes executing sequences of atomic zerotime computing steps, and very limited computational resources at gate-level make even simple operations prohibitively costly. In this paper, we introduce a modeling and analysis framework based on continuous computations and zerobit message channels, and employ this framework for the correctness & performance analysis of a distributed faulttolerant clocking approach for Systems-on-Chip (SoCs). Starting out from a "classic" distributed Byzantine fault-tolerant tick generation algorithm, we show how to adapt it for direct implementation in clockless digital logic, and rigorously prove its correctness and derive analytic expressions for worst case performance metrics like synchronization precision and clock frequency. Rather than on absolute delay values, both the algorithm's correctness and the achievable synchronization precision depend solely on the ratio of certain path delays. Since these ratios can be mapped directly to placement & routing constraints, there is typically no need This work originates in our DARTS project, which has been a joint effort of Vienna University of Technology and RUAG Space, see http://ti.tuwien.ac.at/darts for details. It has been supported by the Austrian bm:vit FIT-IT project DARTS (809456-SCK/SAI) and the Austrian FWF projects Theta (P17757), PSRTS (P20529) and FATAL (P21694).
Introduction
Shrinking feature sizes and increasing clock speeds are the most visible signs of the tremendous advances in VLSI design, which will accommodate billions of transistors on a single chip in the near future [34] . This comes at the price of increased system-level complexity, however: With today's deep submicron technology with GHz clock speeds, wiring delays dominate transistor switching delays, and electrical signals cannot traverse the whole chip within a single clock cycle any more. Consequently modern VLSI chips can no longer be viewed as monolithic blocks of synchronous hardware, where all the chip's state-holding gates simultaneously perform state transitions: The engineering of the clock tree, which disseminates the high-speed clock -typically generated by a quartz oscillator in conjunction with a clock multiplier -throughout the whole chip with zero skew is becoming more and more difficult [6, 22, 52, 64] and ineffective. Large VLSI chips are hence nowadays increasingly being considered as more or less loosely-coupled systems of interacting subsystems -the advent of Systems-on-Chip (SoC) and Networks-on-Chip (NoC).
Moreover, the smaller feature sizes and the reduced voltage swing needed for high clock speeds and low power consumption dramatically increase the adverse effects of upsets due to α-particle or neutron hits [4, 15, 29, 36, 57, 73] , crosstalk and ground bouncing [50] . The resulting increase of the transient failure rate (soft-error rate) [46] and crosstalk sensitivity [59] has hence raised concerns about the dependability of future generation VLSI chips [11] . Mitigation techniques exist at different levels of abstraction, including replication of a chip's functional units at system-level. Since in synchronous hardware designs the oscillator and its clock tree make up a non-negligible single point-of-failure [70] , consequent fault-tolerant designs, too, comprise SoCs and NoCs of interacting functional units that independently perform state-transitions.
Modern SoCs have hence much in common with looselycoupled distributed systems that have been studied by the fault-tolerant distributed algorithms community for decades. Recent work e.g. on scheduling of memory requests [54] , transactional memory in multicores [19] , and self-stabilizing microprocessors [13] , as well as our own work introduced below, confirms that it is indeed possible to utilize distributed algorithms research in the VLSI context. Conversely, results and methods from VLSI-related research, as error-correcting codes, have proved useful e.g. for tolerating Byzantine adversaries [14] , and blend nicely with distributed algorithms research on fault-tolerant consensus [23] , for example. Attempts to systematically bridge the gap between distributed algorithms and VLSI design are lacking, however, [9, 67] ; our paper makes a first step in this direction.
This work originates in our DARTS project, which is devoted to fault-tolerant clock generation in SoCs: As the zero-skew clock requirement had to be dropped in modern VLSI circuits anyway, replacing the classic centralized clock generation approach (an obvious single-point-of-failure) by a distributed solution becomes a feasible option. Like standard GALS (globally asynchronous, locally synchronous) architectures [8] , DARTS replaces the traditional common clock source by multiple clock sources. In sharp contrast to GALS, however, which employs multiple unsynchronized clock devices (typically quartz oscillators), DARTS utilizes a Byzantine fault-tolerant distributed tick generation algorithm, which guarantee some bounded clock skew also between the different clock sources and their clock domains. Such multi-synchronous [72, 78] GALS systems are beneficial from a designers point of view, since they combine the convenient local synchrony with a global time base across the whole chip. It has been shown in [61] that these properties facilitate even metastability-free high-speed communication across clock domains via bounded-size buffers.
The DARTS clocking scheme has been implemented both in an FPGA [20] and in a custom radiation-hardened ASIC [24, 27] , which proves that the approach is feasible in practice and indeed works very well. Although the implementation complexity of DARTS is definitely not negligible, it must be considered as the price for a fault-tolerant clocking system. Moreover, it is not clear how an improved and fully engineered version of DARTS, which consists of standard gates and wires only, would actually compare w.r.t. area to traditional carefully balanced clock trees with their many strong clock drivers.
In our attempts to devise a rigorous correctness proof for DARTS, we realized that classic distributed computing abstractions do not match well the peculiarities of hardware implementations:
(i) Inherent fine-grained parallelism, which is caused by a large number of digital logic gates that concurrently and continuously compute their outputs based on their inputs. This undermines the abstraction of a (typically small) collection of processors that perform atomic computing steps at discrete points in time, which is common to (almost) all existing distributed computing models. (ii) Very limited resources, which make even apparently simple operations like addition or comparison of k-bit numbers, as well as sending such numbers via messages, prohibitively costly. This is in conflict with the basic atomic operations that are typically used in existing distributed algorithms.
Contributions: This paper introduces a novel framework for modeling and analysis of distributed algorithms implemented in VLSI, which explicitly addresses the above issues. Using this framework, we present a complete and rigorous correctness proof and worst case performance analysis of DARTS. The detailed contributions of our paper are the following:
(1) We introduce a novel modeling and analysis framework based on signals, which allow modeling of continuous computations and binary message channels with delay. The framework facilitates "switching" between different -but consistent -abstractions of the same signal (e.g., state view). In sharp contrast to existing modeling frameworks capable of expressing timed executions, 1 these features allow to express the properties -and reason about correctness and performance -of faulttolerant distributed algorithms designed for a direct implementation in digital logic in a very natural and simple way. (2) We adapt the simple Byzantine fault-tolerant distributed tick generation algorithm introduced in [82] for a direct implementation in clockless digital logic. Major modifications concern the enforcement of certain atomic actions ("interlocking") via implicit handshaking between parts of the algorithm, and by replacing the k-bit messages used for communication in the original algorithm by zero-bit messages, i.e., by up/down signal transitions. (3) We prove that the adapted algorithm is correct, and derive worst case bounds for its performance metrics like synchronization precision and minimum/maximum clock frequency. Since our system-level proof rests on fundamental properties of certain elementary building blocks only, it effectively reduces the complex problem of guaranteeing system correctness to the problem of guaranteeing the correctness of fairly simple basic blocks. Consequently, in sharp contrast to classic distributed algorithms and their correctness proofs, our low-level modeling approach leaves only a small "proof gap" with respect to the actual implementation.
The paper is organized as follows: Following an overview of related work in Sect. 2, we informally explain the original tick generation algorithm and the required modifications in Sect. 3. Section 4 introduces our modeling framework and provides the system and failure model used in the subsequent analysis, as well as the detailed specification of our hardware tick generation algorithm and its (elementary) building blocks. Section 5 provides the detailed correctness proof and the worst case performance analysis. Some conclusions and directions of further research in Sect. 6 complete the paper. A glossary of our notation can be found in (Table 1) .
Related work VLSI clock generation:
There exists a huge body of work on classic fault-tolerant clock synchronization [63, 74] , including hardware-assisted clock synchronization [68] , where a set of free-running physical clocks are to be synchronized. In contrast to these approaches, DARTS does not compute adjustment values for synchronizing free-running (e.g., driven by a quartz oscillator) local hardware clocks, but generates clock ticks that are inherently synchronized (by means of a distributed algorithm) in a closed-loop fashion. Note that there is research on "extracting" clock synchronization from the actual communication in a distributed computation [1, 31, 58, 60] that bears some relation to our approach.
The few approaches for distributed clock generation without local clock sources we are aware of are essentially based on a (distributed) ring oscillator, which is formed by gates arranged in a feedback loop. Instead of being dictated by a quartz, the frequency of the generated clock signal is determined by the end-to-end delay of the feedback loop. In [51] , a regular structure of closed loops of an odd number of inverters is used for distributed clock generation. Similarly, [17, 18] employs local tick generation cells, arranged in a two-dimensional grid, with each cell inverting its output signal when its four inputs (from the up, down, left and right neighbor) match the current clock output value. Since clock synchronization theory [12] reveals that high connectivity is required for bounded synchronization tightness in the presence of failures, however, the sparsely connected designs proposed in [17, 18, 51] are not fault-tolerant. Modeling approaches: The theory of asynchronous (clockless) distributed systems -in the absence of failures -has been used in the VLSI community for decades [10] : Research on transition signaling [7, 71] , delay-insensitivity [16, 49] , micropipelines [77] , etc. has established a sound basis for dealing with self-timed systems [32] . Since then, much research has been conducted on benefits and limitations of clockless circuits. There is a wealth of literature on the arbiter problem [41, 47] , which is -like the latch, the inertial delay and the mutex -impossible to solve in a delay-insensitive way [3, 49] . Both arbiter-free problems [43] and a few ways to circumvent the impossibility of implementing arbiters by adding (some) timing properties [48] or order properties [76] have been thoroughly investigated.
Existing modeling approaches for clockless circuits are based on algebraic trace theory [16, 49] or Petri-nets [43, 83] ; such specifications are time-free. Time-augmented clockless circuits can be handled by using timed Petri-nets, which assign time intervals to each transition [5, 55, 65, 84] . However, to the best of our knowledge, none of these modeling frameworks deals with failures. On the other hand, modeling frameworks developed in distributed algorithms research, like timed I/O automatons [37] or TLA [42] , can deal with failures, but are not tailored to the specific needs of VLSI circuits. Fault-tolerance in VLSI: Dependability concerns have also stimulated a large body of research work devoted to fault-tolerance and fault-prevention in VLSI systems [40] . Fine-grained fault tolerance, e.g. at transistor and gate level, encoding, error detection & recovery/reconfiguration, and radiation hardening techniques are the methods of choice here, see e.g. [35, 53, 56, 79, 80] for some examples. The proposed techniques are very different from the "systemlevel approach" employed in fault-tolerant distributed algorithms, which we will exploit in this paper. We note, however, that those approaches are complementary, in the sense that VLSI fault-tolerance techniques can reduce component failure rates and, hence, the required system-level redundancy.
Although we are not aware of any work that deals with the fine-grain parallelism inherent in VLSI implementations, there is a sizeable body of work devoted to hardware implementations of fault-tolerant algorithms. Well-known examples are MAFT [38] , SAFEBUS [33] , GUARDS [62] and TTP [39] . However, in sharp contrast to our problem, these systems incorporate hardware assistance only. The major part of the algorithms is still implemented in conventional software and executed on general-purpose processors. As a consequence, minimizing the gate-level resource consumption implied by these algorithms has never been considered. A notable exception is [2] , however, where it has been shown that consensus can be solved with 1-bit messages.
DARTS informal overview
As shown in Fig. 1 , the basic idea of DARTS is to replace the common quartz oscillator and the clock tree by a fully distributed GALS-like approach [8] [72, 78] ). It has been proved elsewhere [62] that even such loose synchrony suffices for implementing metastability-free high-speed communication between different Fu's using bounded-size buffers. DARTS clocks (patented in [69] ) provide a number of additional advantages, which makes them particularly promising for critical applications in the aerospace domain: First of all, the approach entirely circumvents quartz oscillators, which are fairly big and sensitive devices (shock, vibration, temperature etc.), as well as the cumbersome clock tree engineering issue [6, 22, 52, 64] . It is fault-tolerant, in the sense that the correctness of the clock signals supplied by correct TG-Algs is not affected by transient and permanent failures occurring in other TG-Algs and/or in the TG-Net. DARTS clocks can also be guaranteed to indeed start operating at booting time, a feature that is difficult to ensure for oscillatorbased clocking approaches in space applications. Moreover, the clocks always run at the maximum speed and adapt to the current communication delays within the TG-Algs and the TG-Net, both of which may vary with the current operating conditions, such as supply voltage and temperature. And last but not least, since different Fus are driven by slightly different clock signals, DARTS clocks alleviate EM radiation and ground bouncing problems [50] that typically plague devices using synchronous clocking.
The TG-Algs developed and analyzed in this paper derive from a simple synchronizer algorithm for the -Model [44, 81, 82] introduced in [82] . The (core of this) algorithm, which is based on Srikanth and Toueg's consistent broadcasting primitive [75] , is shown in Fig. 2 . It assumes a message-driven system (where nodes make atomic receivecompute-send steps whenever they receive a message) of n = 3 f + 1 nodes (= TG-Alg instances), at most f of which may behave arbitrarily faulty, i.e., Byzantine. The number of nodes n is equal to the number of Fus the SoC is partitioned into, and also depends on the intended number of faults f the SoC should sustain. Typical DARTS systems will probably have some f in the order of 1 to 4, resulting in some n in the range of 4 to 13 TG-Algs. All correct nodes are connected by a reliable 2 point-to-point message-passing network (= TG-Net): No spurious messages are ever generated, no messages are lost or altered, and all messages sent at time t are eventually received within the interval t + [τ − , τ + ], where τ − respectively τ + denote the (possibly unknown) lower respectively upper bound on the end-to-end delay of messages exchanged between correct nodes. Let ε = τ + − τ − be the maximum uncertainty of the message delay, and = τ + /τ − the maximum delay ratio.
The idea of the algorithm is the following: Initially, every node broadcasts tick(0) in line 3. If a correct node p receives f + 1 tick( ) messages (line 5), it can be sure that at least one of those was broadcast by a correct node. Therefore, p can safely catch up and send tick(k + 1), …, tick( ). If some node p receives 2 f + 1 tick(k) messages (line 7) and thus sends tick(k + 1), one can be sure that all tick(k) messages broadcast by correct nodes, i.e., at least f + 1, will be received within time ε by every other correct node. Hence, every correct node will execute line 5 and send tick(k), if it has not already done so.
In conjunction with the fact that the fastest correct node cannot send consecutive tick(k), tick(k + 1),…arbitrarily fast, this implies a bound on the synchronization precision: Our detailed analysis will reveal that correct nodes generate a sequence of consecutive messages tick(k), k ≥ 1, in a synchronized way (see Sect. 4.4): If #b p (t) denotes the number of tick(k) messages broadcast by node p by realtime t (which is identical to the value of variable k at realtime t, cf. Fig. 2 ), it will turn out that (t 2 − t 1 )α min ≤ #b p (t 2 ) − #b p (t 1 ) ≤ (t 2 − t 1 )α max for any correct node p and t 2 − t 1 sufficiently large ("accuracy"); the constants α min and α max depend on τ − , τ + . Moreover, every two correct nodes p, q maintain |#b p (t) − #b q (t)| ≤ π ("precision"), for all t ≥ 0 with a small constant π that depends on only. Note carefully that the algorithm automatically adapts to the instantaneous timing characteristics of any involved computation and communication.
Since the algorithm in Fig. 2 looks very simple, it is tempting to conclude that it is easily translated into a hardware description language: Node p's TG-Alg just needs to drive a Boolean-valued clock signal, where it outputs the k-th signal transition when the algorithm sends its tick(k) message; the TG-Net is formed by feeding all clock signals to all TG-Algs. It turns out, however, that a number of challenging issues must be solved to actually accomplish this:
• How to implement the TG-Net efficiently? The algorithm assumes a fully connected network, consisting of n 2 links, 3 so anything beyond a single wire per link is unacceptable [26] . Moreover, for implementation simplicity and performance, the information transmitted via the TGNet must be kept to a minimum. Ideally, and almost mandatory, the TG-Net should just feed the emitted clock ticks, i.e., signal transitions, of every TG-Alg node to every other TG-Alg node.
• How to adapt the original algorithm for zero-bit messages? By just sending signal transitions, no information except the occurrence time can be conveyed over the TGNet. Thus, the tick number k contained in the messages of Fig. 2 must be maintained at every receiver, individually for every sender.
• How to map hardware faults to node failures? Given that the algorithm shown in Fig. 2 tolerates Byzantine failures, we are on the safe side here. Interestingly, there is evidence [30] that assuming less severe failures is inappropriate in the presence of real hardware faults: Even simple stuck-at faults could produce early and/or inconsistently perceived signal transition, and cannot hence be modeled as a crash or omission node failure. More severe hardware faults, like delay faults or early/spurious clock transitions induced e.g. by particle hits or crosstalk, can also easily lead to Byzantine failures.
• How to ensure atomicity of actions in a VLSI implementation? This turned out to be the most demanding challenge: All fault-tolerant distributed computing models assume 3 Note that a bus of n broadcast links that provide every TG-Alg with the messages from all other TG-Algs is in fact sufficient here. (i) messages are received, (ii) the number of received messages is checked with respect to a threshold, and (iii) possibly a new message is broadcast and variable k is updated; all this happens in one atomic computing step. This abstraction does not apply when the algorithm is implemented via clockless digital logic gates, which concurrently and continuously compute their outputs based on their inputs. Explicit synchronization (serialization of actions/interlocking) must be introduced if two local computations must not interfere with each other.
Informal overview of the TG-Alg design problems
Taking into account the above issues, we arrived at the basic architecture of a single TG-Alg shown in Fig. 3 . The major building blocks of a single TG-Alg are the (n − 1) +/− counters, one for each of the n − 1 other TGAlgs in the system. Each such device counts the difference of (i) the number of clock ticks seen from the respective peer, and (ii) the number of clock ticks generated locally so far. Counting the difference between these two numbers is sufficient to implement the algorithm of Fig. 2 : To decide when to broadcast the next message, the algorithm needs to know when there are enough remote messages tick( ) for which > k or ≥ k. Since > k ⇔ − k > 0 and ≥ k ⇔ − k ≥ 0, it suffices to know when the difference − k is > 0 or ≥ 0. Thus, the +/− counters are supposed to provide two binary status functions, G R and G E Q, which are true when the counter's actual value is > 0 and ≥ 0, respectively. In addition, a ≥ f + 1 and a ≥ 2 f + 1 threshold circuit is required for implementing the rules in (line 5) and (line 7) in Fig. 2 , respectively. Finally, there is a device (shown as an OR-gate in Fig. 3 ), which is responsible for generating every local clock tick exactly once when the ≥ f + 1 or the ≥ 2 f + 1 threshold circuit triggers the generation of a new tick message. Again, the above architecture is deceptively simple. The major problem encountered when trying to implement Fig. 3 in hardware is the lack of a common clock signal that could be used for a synchronous design. Obviously, such a clock signal is not available here. Hence, one has to resort to a (quasi) delay insensitive clockless implementation [32] of the algorithm depicted in Fig. 3 , which raises a number of intricate problems:
How to reconcile transition signaling and fault-tolerance? The probably most elegant paradigm for clockless logic is transition signaling, where information is conveyed exclusively via signal transitions, rather than via signal states as in conventional logic. Any reasonable delay-insensitive clockless circuit can be composed from a small set of elementary building blocks here, which includes the Muller C-Element that forms the equivalent of the logical AND of two input signals, see [7, 48, 77] for details.
The "expressive" power of transition signaling is restricted to time-free systems, however, where causality is the only meaningful relation between events. Consequently, there is no full equivalent to the conventional logical OR of two input signals: If only one of the two inputs can provide a transition, the EXOR (exclusive OR) gate can safely be used for this purpose. However, there is no meaningful transition signaling OR if both inputs could -but need not -provide a (somehow related) transition. In fact, generating an OR output transition when the first input transition arrives would destroy the causality relation of the second input transition and the generated output transition.
Unfortunately, incorporating fault-tolerance, as instantiated by our threshold gates, requires exactly this semantics: The clock signal generated by some TG-Alg must not depend causally on the clocks generated by faulty TG-Algs, since this may lead to blocking. Hence, there is no way but to "switch" from transition signaling to state signaling and vice versa to circumvent this problem.
How to manage a clean switch between transition signaling and state signaling? In our implementation of Fig. 3 , transition signaling logic is used for processing clock ticks (+/− counters). The +/− counters, on the other hand, output signals with status G R and G E Q. These signals are then processed by the threshold circuits, which themselves output signals with status T H G R (respectively T H G E Q), signalling whether the G R signals have reached the f +1 threshold or not (respectively whether the G E Q signals have reached the 2 f + 1 threshold or not). This information is finally fed into a circuit (depicted by the OR-gate in Fig. 3 ) responsible for generating exactly one state transition tick k for any k, which is again implemented in transition logic.
The intermediate switching to status signalling performed by the +/− counters, however, bears a problem: it is not feasible to decrement all +/− counters at the same time, since it is inevitable that a local tick(k) message is received at slightly different times by a node's +/− counters. This results in the fact that some of the +/− counters produce a status of G R and G E Q based on local tick(k) and others based on the old local tick(k − 1), at least until they finally receive local tick(k). This problem is solved as follows: We will see during the paper that by proper constraints it can be enforced that no three +/− counters set their ports' status based on local ticks k, k − 1 and k − 2 or even less at the same time. Thus it suffices to distinguish whether G R and G E Q is based on k or on k − 1, which is done by simply duplicating both signals G R and G E Q, one for status that is based on even local ticks and one for odd local ticks.
More specifically, since transitions of binary-valued clock signals must strictly alternate between low-to-high (= odd clock tick) and high-to-low (= even clock tick), it is obvious that the next tick to be generated after an even tick must be odd, and vice versa. Hence, in our solution, we just (i) duplicate the signals G E Q, G R, obtaining the signals
(ii) duplicate the two threshold circuits, (iii) duplicate the signals T H G E Q and T H G R, obtaining the four signals T H G E Q e , T H G R e , T H G E Q o and T H G R o
, generated by one of the threshold circuits each, and (iv) use the rules:
• Generate an even tick k + 1 ∈ N even := 2N if the last tick generated was odd (k ∈ N odd := 2N + 1) and either (a) the same or a greater number of ticks have been seen
Note that condition (a) ensures that the odd tick k has been seen from ≥ 2 f + 1 remote TG-Algs, whereas (b) guarantees that the even tick k +1 has already been seen from ≥ f + 1 ones.
• Generate an odd tick k + 1 ∈ N odd if the last tick generated was even (k ∈ N even ) and either (a) the same or a greater number of ticks have been seen from ≥ 2 f +1 TG-
. Again, condition (a) ensures that the even tick k has been seen from ≥ 2 f + 1 remote TG-Algs, whereas (b) guarantees that the odd tick k +1 has already been seen from ≥ f +1 ones.
Glitches 4 , due to the non-simultaneous arrival of clock signal transitions from the peers, are masked by simply ignoring the output of the threshold circuits (say, T H G R o and
T H G E Q o
) that generated the even tick k when the next tick 4 A glitch is a wrong state transition. How to implement the +/− counters? This task turned out to be the most delicate part of the hardware design work. Actually, implementing a clockless up/down counter is inherently difficult due to the fact that the up-clock ("+ port") and the down-clock ("− port") transitions are totally unrelated. They can hence occur arbitrarily close in time to each other, which usually causes metastability problems [77] . Another problem is how to correctly generate the status of G R o and
). They should truly reflect the current counter value, at least during times when they are used. We will specify the detailed properties that must be maintained by these signals in Sect. 4. Note that our correctness proof and performance analysis will only rely on these properties, i.e., is valid for every implementation of the +/− counters that fulfills these properties.
Our particular implementation of the +/− counters consists of two elastic pipelines [77] , which can be seen as shift registers/FIFO buffer for signal transitions. One of the elastic pipelines is attached to the remote clock signal ("+ port"), the other one is fed by the local clock signal ("− port"). They are fitted together at their ends via a special Diff-Gate, which removes "matching" transitions, that is, transitions representing the same tick number, as soon as they traveled through the pipelines. The signals with status
G E Q e
are derived from monitoring the last few stages of both pipes. Further details are provided in Sect. 4 and in the descriptions of our implementations [20, 24, 27] .
Modeling and analysis framework
In this section, we introduce a modeling framework for fault-tolerant distributed algorithms that are implemented by means of clockless digital circuits, which is amenable to mathematical correctness proofs and worst-case performance analysis. The presented framework addresses the multitude of issues raised in the previous section: It is based on a continuous model of computation and time, and avoids the use of design elements and abstractions that are not available or too costly at the hardware implementation level. To handle the design complexity challenge at such low levels of abstraction, it also supports hierarchical modeling: At the top-level of DARTS, for example, there is the entire system, consisting of n TG-Algs interconnected via the clock signal wires making up the TG-Net; every TG-Alg can be further partitioned into several building-blocks (like the +/− counters), which are interconnected in some non-regular way. Before we formally state the framework, we give an informal overview of the main ingredients.
Our modeling framework is based on modules, which possess input and output ports. An execution of a module's ports is an assignment of a signal (which captures continuous computation over time) to each of the module's ports. A module's allowed behavior is specified by a set of executions of the module's ports. Note that modules differ from classic distributed computing abstractions like timed automatons [45] primarily in that they continuously compute their outputs.
Compound modules consist of multiple sub-modules and their interconnect, which specifies how sub-module ports are connected to each other and to the module's input and output ports. The interconnect specification itself assumes zero delays; modeling non-zero interconnect delays, e.g., for real wires, requires intermediate channels: A channel is a module that possesses a single input port and a single output port, and its behavior specifies delayed FIFO delivery of input port signal transitions at the output port. Modules that are not further refined are called basic modules. Elementary basic modules are those that calculate zero-delay Boolean functions (AND, OR, …) and channels.
Clearly, the behavior of a (non-faulty) compound module is determined by the behavior of its constituent sub-modules; the behavior of a basic module must be given a priori. Correctness proofs establish properties of the behaviors of higher-level compound modules, based on the assumption that (1) the system and failure model holds, and (2) that the implementations of (non-faulty) basic modules indeed satisfy their behavioral specification.
Signals and zero-bit message channels
Since we target implementations using clockless circuits, our formal framework will be based on a continuous notion of real-time t ∈ R + 0 . We assume that the system initialization (reset) is triggered at time t = 0; different modules may complete their reset at different times, however. Signal: A signal S is an event trace, i.e., a set of time/value tuples. Formally,
where event (t, 1) ∈ S respectively (t, 0) ∈ S means that S takes on value 1 respectively 0 at time t. We require nonsimultaneity of contradicting events on a single signal, i.e.,
and assume that the initial event (0, I ), with either I = 0 or I = 1, is always present in S. We also disallow alternating Zeno behavior in our event traces, i.e., we require that at most finitely many events with different value can occur in any finite time interval, cp. [45, p. 737f] . Note, however, that S may still contain arbitrarily many idempotent events. 6 Consequently, if
denotes the prefix and suffix of S at time t, respectively, there need not be a maximum element (v max , t max ) -with respect to the time component -in pre(S, t) and/or no minimum element (v min , t min ) in suff(S, t). However,
is well-defined. 7 Since our modeling framework is primarily devoted to "real" systems like DARTS, we will restrict our attention to systems made up of well-formed circuits only. A well-formed circuit does not contain zero-delay wires, branches with infinite fan-in/out and other non-implementable assumptions. We say a signal S is well-formed if (i) it does not show alternating Zeno behavior and (ii) the function t → last-val(S, t) is right-continuous. It can be shown that well-formed circuits never produce signals that are not well-formed, unless their input signals are not well-formed. Therefore, during this paper, we may safely assume that all signals by which the behavior of DARTS is modeled, are well-formed. Wellformed signals allow for more abstract representations than just event traces, which we will now introduce.
In fact, specifying systems in terms of event traces is sometimes overly complicated. More convenient in this regard are two higher-level representations of signals: (i) status, and (ii) counting function. All three representations will be consistent, in a well-defined way, and can hence be used interchangeably. Status: The status representation of a signal S, denoted by S, is a function
from real-time to its instantaneous Boolean value, defined by
Since S is well-formed, the resulting S(t) is obviously rightcontinuous.
Status functions may be composed out of already defined status functions by using arbitrary Boolean predicates, e.g.,
A := B ∧ C, with status function B, C, is defined as A(t) := B(t) ∧ C(t).
Counting function: Finally, a signal S can be represented by the number of non-idempotent events (excluding the initial event) that occur during (0, t], denoted as the signal's counting function #S(t). For example, if S's event trace is given by S = {(0, 0), (1, 1), (1.5, 1), (2, 0)} and I = 0, then #S(0) = 0, #S(0.5) = 0, #S(1) = 1, #S(1.5) = 1, and #S(2) = 2. Sometimes, we will also employ generalized counting functions #S (t) that have an initial value other than 0: We define #S (t) := #S(t) + S 0 , where #S is the standard counting function of S and S 0 an arbitrary offset. It follows immediately from the properties of signals that t 1 ≤ t 2 ⇒ #S(t 1 ) ≤ #S(t 2 ) for any counting function #S, and that #S is right-continuous at any point of discontinuity since S is well-formed.
In the sequel, we will use the most convenient representation of a signal S interchangeably, namely S itself, S or #S. Execution: We begin with the formal definition of a system. A system is a set P of ports, whereby a port can be thought of as a measurement point on a digital chip. An execution (of ports P) is a function that assigns each p ∈ P a signal p. To avoid cluttered notation we simply write p respectively # p when we refer to the abstract signal representations p respectively # p. To specify a system's allowed executions in a convenient way, modules are introduced: a module is a triple comprising (i) a set of input ports I, (ii) a set of output ports O, and (iii) a set of allowed executions E of ports I ∪O. It is the specification of E, where the convenience of a threefold representation of signals comes into play, and it will be extensively used in Sect. 4.3. For example, to specify a module with no input port and a single output port o that produces a constant-0 signal at o, we simply require that E comprises all executions of ports {o} that fulfill o = 0.
The system's allowed executions can thus be specified by stating a set of ports P together with a set of modules that have input/output ports in P. Channel: A channel models a reliable FIFO channel for signal transitions with finite delay. Since signal transitions must be alternating, only the occurrence time but no data can be conveyed over a single channel ("zero-bit messages"). Formally, the semantics of a channel X is as follows: Let X s be the channel's single input port [which will be connected to an output port of a single sender module], and X r be its single output port [which will be connected to the input ports of some receiver module(s)]. Intuitively a channel maps events at the input port occurring at some time t to (delayed) events at the output port occurring at some delivery time t , where the delay t − t is not necessarily the same for each event, if the channel has non-constant delay. Formally, we demand that there exists a continuous and strongly monotonically increasing delivery function δ : R + 0 → R + 0 for X , which maps sending time t to delivery time δ(t). Note that we will use δ(X ; t) to refer to δ(t) when the corresponding channel X is not clear from the context. We assume that the channel delay is within τ
(1)
From the properties of δ, it follows immediately that δ is a bijection from R t 1 ), δ(t 2 )] . Clearly, the inverse function δ −1 of δ also exists and has the same properties. In addition, we will assume that the channel output has some well-defined initial state (is initialized to) I ∈ {0, 1}, which is I = 0 if not specified otherwise. Given δ, the channel's behavior in terms of event traces is specified by two properties, namely
which ensures that before the first event is delivered from the input port to the output port, i.e., before δ(0), only one event occurs at the output port: the event which sets the channel output port to its initial value I . Secondly we demand that
Since δ carries over the total order of the events (t, x) in X s to the events (δ(t), x) in X r [called matching events in the sequel], it follows immediately that X r is an event trace. A more abstract specification of a channel in terms of states is by,
Note that this definition is consistent in the sense that an execution fulfilling the event trace specification from above fulfills the abstract state definition. When considering the counting functions of a constant delay channel's ports, we observe that the output counting function is obtained by shifting the input counting function in time by the constant delay, say τ X , obtaining # X r (t + τ X ) = # X s (t). For non-constant delay channels, this equality does not hold in general, but has to be replaced by inequalities (Pmax) and (Pmin) of the following lemma, summarizing important channel properties: 
System model
Having introduced the basics of our formal framework, we can now define our system model. A DARTS system consists of a set P of n := |P| top-level modules, where n ∈ N. The top-level modules will interchangeably be called TG-Alg or node, and are usually denoted by letters p, q etc. Every TGAlg p in P has exactly one output port with the counting function #b p (t), where it broadcasts its clock, and one input port per remote TG-Alg q ∈ P \ {p} with the counting function #r rem p,q (t), where it receives q's clock. We assume a fully connected system, i.e., from every TG-Alg p to every TGAlg q ∈ P \ {p}, there is a channel R E M, p, q with input #b p (t), output #r rem q, p (t), and delay in [τ − rem , τ + rem ]. Figure 5 shows the resulting outbound channels of TG-Alg p.
The following notation will be used throughout the paper: For a function f , let f (t → ) be its left limit at time t, i.e., f (t → ) := lim ξ →t f (ξ ) and ξ approaches t from the left. For any k ≥ 1, we say that node p sends tick k, at time t p,k , if the kth event (without counting idempotent events) occurs at t p,k . In terms of counting functions: t p,k is the time for
The time when the first (respectively the last) correct node sends tick k is denoted by t first,k (respectively t last,k ); note that the node who is the first (respectively last) one to send tick k may be different for different k. Failure model: Since hardware faults easily lead to Byzantine failures [30] , we assume this failure semantics here: The adverse power of Byzantine failures in our context lies in the ability of a faulty node to generate wrong clock ticks (early timing failures or even spurious) that are perceived inconsistently at different remote nodes. Such failures could be the consequence of manufacturing defects or electrostatic breakdown [40] , particle hits [4, 57] , or electromagnetic noise [50] , which may affect any module in a TG-Alg. Due to different wire lengths and signal-level detection thresholds, such faults typically propagate differently to different receivers. Note that we allow faulty nodes to create even metastability [41] , but we must assume that metastability cannot propagate beyond FCRs (see below); we have already some convincing evidence [25] that this is ensured by the elastic pipelines in the +/− counters with large probability.
We partition our system into multiple fault-containment regions (FCRs), i.e., sets of (sub-)modules that are potentially affected by a single fault like a particle hit and thus cannot be assumed to fail independently. More specifically, we define FCR p to consist of the single TG-Alg p together with all its outgoing channels, as depicted in Fig. 5 . If FCR p is faulty, any of its (sub-)modules may behave arbitrarily (Byzantine). 8 Since every FCR is associated with exactly one TG-Alg, we will also use these terms interchangeably.
Throughout the paper, let C be the set of correct FCRs, and F, with f := |F|, the set of faulty FCRs. Clearly P = C ∪ F and C ∩ F = ∅, i.e., C and F are partitions of P. We will prove that correct nodes behave as specified in Sect. 4.4 in the presence of up to f Byzantine faulty FCRs, provided that the total number of nodes is n ≥ 3 f + 2. Note that this is slightly more than the required lower bound of n ≥ 3 f +1 for clock synchronization [12] , but facilitates a considerably better precision and accuracy (attained by counting only remote messages when calculating the f + 1 respectively 2 f + 1 thresholds; including self-reception would lead to τ − rem = τ − loc in Theorem 2 and hence spoil the achievable worst-case precision). In case one does not want to spend an extra node to the required 3 f + 1 nodes, an alternative is to add selfreception and to artificially increase its delay, e.g., by feeding the signal through inverter chains, so that τ − loc becomes of the order of remote delays. Note that both a 3 f + 2 node system without and a 3 f + 1 node system with self-reception have about the same overall size for reasonably small f , since the extra digital logic needed for self-reception at 3 f + 1 nodes is about the size of a node without self-reception. This allows the designer to choose between a system with a slightly larger number of nodes and a system which is slightly more resilient, without changing the overall system size. While, throughout this paper, we focus on the solution using 3 f + 2 nodes without self-reception, an adaptation of the algorithm and its correctness proof to the alternative solution can be done in a straightforward manner. Booting: We assume that the whole system is simultaneously reset at time t = 0. However, we allow the modules to Note that the latter condition will ensure that messages sent by p are never lost at a correct node q because of late booting.
TG-Alg architecture and module specification
In this section, we will describe the architecture of a TG-Alg, i.e., its sub-modules and interconnect, and formally specify their behaviors. It is important to note here that the behavioral properties defined in this subsection are assumed properties, i.e., basic properties that must a priori be guaranteed by the implementation of the modules. (Modules in FCRs hit by a failure may deviate (arbitrarily) from their correct behavior, however.) Based on these basic properties, the correctness proofs provided in Sect. 5 will show that the system of TGAlgs will maintain the system-level properties (precision and accuracy) as specified in Sect. 4.4. Figure 6 shows the general architecture of the TG-Alg p, cp. Fig. 3 . It consists of one +/− counter module per remote TG-Alg (only two are depicted), four threshold modules implementing the f + 1 and 2 f + 1 rules in Fig. 2 , and a tick broadcast module that finally generates p's clock ticks #b p (t). Every +/− counter is refined into several additional sub-modules: A pair of elastic pipes (remote pipe, local pipe) that form FIFO buffers for (remote, local) clock ticks, a DiffGate module that removes matching remote and local ticks from the pipes, and a Pipe Compare Signal Generator (PCSG) module that generates the signals' status reflecting the difference of the number of ticks present in the local and remote pipe.
We will now specify the ports and the behavior of all these modules in detail.
(i) Pairs of elastic pipes: Every TG-Alg p incorporates n −1 pairs of elastic pipelines [77] , each of which corresponds to a single remote TG-Alg q ∈ P \ {p}. We will denote the pair of pipes at p corresponding to q by ( p, q) in the sequel. ( p, q) consists of a remote pipeline that can store up to S rem clock ticks sent by q, and a local pipeline that can hold up to S loc clock ticks sent by p locally. Note that the numbers S rem and S loc are implementation parameters that have to be chosen in accordance with Theorems 4 and 6; in the specifications of this section, they are just assumed to be unbounded (= always sufficiently large). (ii) Diff-Gate module: To avoid pipes with infinite capacity, each pair of pipes is equipped with a special Diff-Gate circuit that removes matching clock ticks, i.e., clock ticks contained in both pipes. However, it deletes matching ticks only if at least one tick remains in both the local and remote pipe. The Diff-Gate for ( p, q) has two input ports connected to #r rem p,q (t) and #r loc p,q (t), and a single output port represented by the counting function #d p,q (t), which gives the largest tick number that has been removed from both the remote and local pipe of ( p, q) by time t. To allow removal of the virtual tick 0, the initial value is #d p,q (t p,b ) = −1.
Behavioral description:
Recall that virtual tick 0 shows up at the output of the local and the remote pipe at booting completion time t loc,0 = t p,b .
Ticks are removed from the pipes as follows: • k ≥ 1: If -tick k + 1 shows up at the output #r rem p,q (t rem,k+1 ) of the remote pipe of ( p, q) at time t rem,k+1 , and 9 Actually, the pipe delays are accounted for in the delays of the real channels in the signal path, namely, R E M, q, p and L OC, p, q .
-tick k + 1 shows up at the output #r loc p,q (t loc,k+1 ) of the local pipe of ( p, q) at time t loc,k+1 , and -tick k − 1 has been removed at time t rmv (t) signals when the number of remote clock ticks is greater or equal than the number of local clock ticks, provided that the last clock tick that showed up in the local pipe was odd; P G R,o p,q (t) does the same for "greater" replacing "greater or equal".
All these signals are fed, via dedicated channels that add some delay, to the threshold modules of the TG-Alg p.
The signals generated by the PCSG associated with ( p, q) must satisfy the following properties: Note that these signals need to be true only if the local pipes contain exactly one tick (#s loc p,q (t) = 1), which makes it easier for an implementation to fulfill these properties.
The above signals are fed into four dedicated channels that connect the PCSG with the threshold modules, all of which are initialized to 0: . This property will be formalized below as a submodule which continuously computes a Boolean predicate [which is a function of time here], involving the sum of the status functions of a threshold module's input ports, and feeding the result into a channel. Behavioral description: The threshold modules comprise of modules with the threshold modules' input ports and outputs • Channel (t) for the current tick becomes active. We will refer to (i) as the disabling path and to (ii) as the enabling path in the sequel. The Tick broadcast module hence has four input ports connected to the threshold outputs, and a single output port represented by the counting function #b p (t), which is the number of ticks broadcast by p by time t. Finally, #b p (t) is distributed to the local pipe in ( p, q) at TG-Alg p and to the remote pipe in (q, p) at TG-Alg q, for all q ∈ P \ {p}, via dedicated channels L OC, p, q and R E M, p, q , respectively. Behavioral description: Let the signal b p be the set for which The n − 1 channels, one for each q ∈ P \ {p}, L OC, p, q and R E M, p, q for distributing #b p (t) are all initialized to 0 and adhere to the following specifications:
G RtoT H G R, e, p with input T H G R
• Local channel L OC, p, q with input signal #b p (t), output #r loc p,q (t) and delay in τ
output #r rem q, p (t) and delay in τ − rem , τ + rem .
Recall that the channel delays also include the delays of the pipelines.
System-level properties
In all executions complying to the system and failure model introduced in the previous sections, the following properties must be guaranteed:
(P) Precision: (Theorem 2) There is a constant π , such that for every pair of correct nodes p, q in C: Informally, the precision requirement (P) just states that the difference of the number of clock ticks generated by any two different correct nodes is bounded, whereas the accuracy requirement (A) guarantees some relation of the progress of the clock ticks with respect to the progress of real-time. Note that (A) is also called envelope requirement in literature, and effectively bounds the frequency of the generated clock ticks. Finally, the size requirement guarantees that the size of the pipelines remains bounded.
In the following Sect. 5, we will show that the system of TG-Algs indeed satisfies the above properties in all executions complying to our system and failure model, provided that (a) the implementations of non-faulty basic modules specified in Sect. 4.3 indeed fulfill their specifications, and (b) the additional "global" Constraints 1-3 introduced later hold. Our Theorems 2, 3, 4 and 6 will also establish numerical values for all the constants introduced above, which only depend on the delay parameters introduced in the specifications of the TG-Alg basic modules in Sect. 4.3.
Correctness proofs
Our correctness proof and performance analysis has a layered structure, where the lemmas and theorems in a layer establish higher-level abstractions atop of the lower-level abstractions provided by the layer below:
1. The lowest proof layer (Sect. 5.1) deals with the problem of creating the abstraction of uniquely labeled tick k messages atop of anonymous up/down transitions: Provided that Constraint 1 (which bounds the relative speed ratio of all local channels) holds, the pivotal Interlocking Lemma 3 proves that ticks k − 2, k − 4, . . . are never falsely interpreted as tick k by the algorithm. The result of the Interlocking Lemma allows us to bound any correct node's maximum tick frequency (Lemma 4), and to rule out the possibility of unbounded queueing effects in the pipes (Lemma 5); the latter requires Constraint 2 to hold, which ensures that the Diff-Gate can digest ticks at the maximum clock frequency. Atop of these results, it is possible to prove that both the f + 1-rule (GR, Lemma 6) and the 2 f +1-rule (GEQ, Lemma 7) work as expected. 2. The intermediate proof layer (Sect. 5.2) establishes elementary synchronization properties of the ticks generated at different correct nodes, namely, Progress (P), Unforgeability (U), Quasi-Simultaneity (QS) and finally Booting Simultaneity (BS). Note that these results correspond to the synchronization properties of the classic algorithm of Fig. 2, cp. [ 44, 66, 75, 81, 82] . Our major Theorem 1 requires Constraint 3 to hold, which essentially guarantees that even the slowest local channel is faster than the fastest remote channel. Based on the elementary synchronization properties, it is possible to bound the progress of the fastest node (Lemmas 10-13) and the slowest node (Lemmas 15-17) in the system. 3. The top layer of our proof (Sect. 5.3) establishes our major results: Bounds for precision (Theorem 2), accuracy (Theorem 3) and pipeline sizes (Theorems 4 and 6).
Bottom proof layer
We start our detailed treatment with the technical Lemma 2, which asserts a certain persistence of the number of ticks present in the local and remote pipe for a certain time.
Lemma 2 If, for a correct node p ∈ C and a different correct node q ∈ C \ {p}, at time t it holds that
(#r
Proof See Lemma 2 in Appendix A.
We next establish a main result, namely our Interlocking Lemma 3, which states that an "old" tick k − 2, k − 4, . . . is never falsely interpreted as a "new" tick k in the GR and GEQ rules of the algorithm. This is not immediately evident: An even tick k + 1 is generated by T H G E Q The Interlocking Lemma will require that for each TG-Alg, all the delays along a path through the Threshold module, via the PCSG module and along the local loop are within about a factor 2 of each other, which is expressed formally in Constraint 1.
Constraint 1 (Interlocking Constraint). We abbreviate
Then the relation T max ≤ T min + T min,dis must hold.
Lemma 3 (Interlocking lemma)
If, for some correct node p and k + 1 ≥ 2, #b p (t) = k + 1, then:
k ∈ N even ⇒ ∀q ∈ Q : ∃t q ≤ t :
The proof is by induction on k + 1 ≥ 2, the number of ticks sent by node p. The lemma is first shown for tick k + 1 = 2. Then we assume that some tick k + 1 > 2 is the first tick for which the lemma does not hold. By investigating the cause which triggered the sending of this tick, we obtain a contradiction to Constraint 1. Begin (k + 1 = 2): Assume p sends tick 2 at time t p,2 . Then, by the algorithm (specification of the tick generation mod- 2 ) must have held. We consider both cases: ,2 ) , then by the algorithm (specification of the threshold modules), there must be a set Q ⊆ P \ {p}, of size |Q| ≥ 2 f + 1, such that, for time t :
Again by the algorithm (specification of the PCSG to threshold module channels), t q := δ −1 ( P G E Q toG E Q, o, p, q ; t ) defined for every q ∈ Q, we obtain
and by this
Since #r loc p,q (t q ) ≥ 0 from reset on and #r loc p,
Finally, from the channel properties, we know that
, then, by the algorithm (specification of the threshold modules), there must be a set Q ⊆ P \ {p}, of size |Q| ≥ f + 1, such that, for time t :
By the algorithm (specification of the PCSG to threshold module channels), and with t q := δ −1 ( P G R toG R, o, p, q ; t ) defined for every q ∈ Q, this implies
From the channel properties, we know that
The lemma follows, again.
Step (k + 1 ≥ 3): Assume by contradiction that k + 1 is the first tick for which the lemma does not hold. Let t p,k+1 be the time p sends tick k + 1. Assume wlog. that k ∈ N odd 10 . We will establish two delay bounds, one on the enabling path and the other on the disabling path.
Enabling path: To send tick k +1 at time t p,k+1 , by the algorithm (specification of the tick generation module), at least one of the two threshold signals must have been enabled, i.e., (t p,k+1 ) , then, by the algorithm (specification of the threshold modules), there must be a set Q ⊆ P \ {p}, of size |Q| ≥ 2 f + 1, such that, at time t :
Again by the algorithm (PCSG to threshold module channels), at t q := δ −1 ( P G E Q toG E Q, o, p, q ; t ) defined for every q ∈ Q, this implies
By the channel properties
Assuming that ∀q ∈ Q : #r loc p,q (t q ) ≥ k yields the desired result of the Lemma. Thus we only have to investigate the negation:
Since, by (13) , #r loc p,q (t q ) ∈ N odd , we obtain
Thus, tick k − 1 must be received in the local pipe of pipepair ( p, q) at time t rcv,k−1 , with
The combination with (14) yields
Let t p,k−1 be the sending time of tick k − 1. Clearly, by the local channel properties, we arrive at
Before proceeding further, we will handle the disabling path. Disabling path: Let t p,k be the sending time of tick k. By the induction hypothesis, we know that the lemma holds for tick k. According to the lemma, we have to distinguish two cases (i) and (ii):
case (a.i): Assume that implication (i) is valid, i.e., there exists a set Q of size |Q | ≥ 2 f + 1, such that, for
Thus, local tick k − 1 must have been received in pipepair ( p, q ) at time t q ,rcv,k−1 , with
by Lemma 2. Furthermore, by the properties of the local channels,
From the algorithm (specification of the tick generation module), it follows that at time t p,k+1
[and also ¬ T H G R e p (t p,k+1 ), which is handled analogously, cf. (a.ii) below] must hold, i.e., the threshold signals that generated (the odd) tick k must be inactive again. Therefore, for the times t q defined as
it must hold that
Let us choose Q := Q . Because of the FIFO property of the channels and because of t p,k < t p,k+1 , we obtain
Clearly, there has to be at least one q ∈ Q for which
since otherwise Q := Q would have been a choice for Q , contradicting (21) . However, (22) can only hold if local tick k has been in pipepair ( p, q ) at time t q ,rcv,k , with
In combination with (20) and the channel properties, this implies
and, by the local channel properties,
Finally, (18) together with (23) and (24) yields
case (a.ii): Assuming that implication (ii) is valid, i.e., that there exists a set Q of size |Q | ≥ f + 1, such that, for
By analogous arguments as in case (a.i), we obtain
Combination of (a.i) and (a.ii): Combining (16), (25) and (26) leads to
which is a contradiction to Constraint 1. The lemma follows. ,k+1 ) , then, by the algorithm (specification of the threshold module), there must be a set Q ⊆ P \ {p}, of size |Q| ≥ f + 1, such that, for time t :
By analogous arguments as in case (a), we obtain an analogon to (16) as
as well as analogons to (25) and (26), namely,
and
Combination of (b.i) and (b.ii): Combining (27) , (28) and (29) leads to
which is a contradiction to Constraint 1. The lemma follows.
The next lemma establishes a minimum duration between any two successive ticks generated at a correct node, i.e., an upper bound on the clock frequency that could possibly be generated locally at any correct node.
Lemma 4 If correct node p sends tick k ≥ 1 at time t p,k , then it cannot send tick k
Proof Assume by contradiction that p sends tick k + 1 at time t p,k+1 , wlog. for k ∈ N even , with
We apply Lemma 3 for tick k + 1. Thus, implication (i) or (ii) has to be true: case (i): There exists at least one node q ∈ Q, such that, for some t q ≤ t p,k+1 − τ
By applying Lemma 2, we obtain
Furthermore, by the local channel property and (Pmin) in Lemma 1,
Combining (31) and (32) yields
e., tick k must have been sent at time t p,k with
contradicting (30) . The lemma follows. case (ii): Starting from (ii) in Lemma 3, the contradiction is derived analogously as for case (i). The lemma follows in both cases.
Lemma 5 together with Constraint 2 allows us to exclude the possibility of unbounded queuing effects inside a pipepair ( p, q) of local/remote pipes #r loc p,q (t) and #r rem p,q (t) situated at correct node p and corresponding to correct node q. Constraint 2 ensures that the Diff-gate can digest matching ticks at least as fast as any node can generate them. Step (k > 1): As our induction hypothesis, assume that tick k − 2 is removed from both pipes by max{t
Constraint 2 τ
By analogous arguments as above, tick k is received in both pipes of ( p, q) by t both , defined as
Because of Lemma 4, consecutive ticks cannot be generated with less than T min distance in-between, i.e.,
by Constraint 2. According to the induction hypothesis, we know that tick k − 2 has already been removed by t both . By the properties of the Diff-gate, tick k − 1 is hence removed by t both + τ 
Proof Wlog. assume that k ∈ N even . We distinguish two cases:
(a) Suppose that ∃q ∈ Q : #r loc p,q (t ) > k. Since #r loc p,q (t ) ≤ #b p (t ), as no tick can be locally received if it has not been sent before, #b p (t ) > k must hold. Thus, p has sent tick k + 1 by t . The lemma follows.
(b) Otherwise, assume that
By Lemma 4, it follows that all nodes q send tick k by t Q,k with t Q,k ≤ t Q,k+1 − T min . By applying Lemma 5, we conclude that tick k − 1 is removed from all of p's pipepairs (for all q ∈ Q) by 
By combination of (34) with (33) , it holds that for all ξ ∈ t rcv , t :
Furthermore,
Assuming ∃q ∈ Q : #d p,q (ξ ) > k − 1 implies #r loc p,q (ξ ) > k, since by definition of the Diff-gate's behavior, there must be at least one tick in the local pipe which was not deleted, thereby contradicting (35) . Thus it must hold that ∀q ∈ Q : #d p,q (ξ ) = k − 1. Therefore ∀q ∈ Q : P G R,e p,q (ξ ) is true for times ξ ∈ t rcv , t . By the algorithm (specification of the PCSG to threshold module channels), ∀q ∈ Q : G R e p,q (ξ ) is true for times ξ ∈ t rcv + τ + G R , t . Again by the algorithm (threshold module),
is true for each time ξ within t rcv + τ
It remains to be shown that the disabling path cannot inhibit the generation of tick k + 1 at p. For the sake of contradiction, assume that the disabling path can enforce t p,k+1 > t . Because of the lemma's assumption (ii), tick k must eventually be received in all of p's local pipes corresponding to correct nodes r , i.e., ∀r ∈ C \ {p} :
Since tick k + 1 is not generated by t , we actually have #r loc p,r (ξ ) = k. As k ∈ N even , this implies
Consequently, by the algorithm (specification of the PCSG to threshold module channels and the threshold modules), together with the fact that there are only up to f faulty nodes, who cannot reach a threshold themselves,
for ξ ∈ t p,k + τ (37) and (38) and noting that max{τ
is apparent that p must send tick k + 1 by t , providing the required contradiction. The lemma follows.
The lemma follows in both cases.
Analogous to Lemma 6, the next lemma states that the GEQrule does its duty. 
Proof See Lemma 7 in Appendix A.
Intermediate proof layer
Based on the results of the bottom proof layer, we can now establish elementary synchronization properties of the ticks generated at different correct nodes. The following Theorem 1 corresponds to well-known classic results on consistent broadcasting [44, 66, 75, 82] , which are expressed and proved in our new modeling framework. Major differences to the existing proofs are the far lower level of abstraction at which the proofs have to be carried out, and the problems arising from queueing effects that are due to bounded queue sizes of the local and remote queues and the fact that a TGAlg can process and generate ticks not arbitrarily fast, both of which is in contrast to the original algorithm stated in Fig. 2 . For the theorem to hold, Constraint 3 must hold, which essentially guarantees that even the slowest local channel is faster than the fastest remote channel.
Constraint 3 For
must hold. 
Theorem 1 (Synchronization Properties
We will show the properties Progress (P), Unforgeability (U), Quasi-Simultaneity (QS) and Booting-Simultaneity (BS) one after the other.
Progress (P)
Proof Assume that all correct nodes C, with |C| ≥ 2 f + 2, sent tick k ≥ 1 by time t. Now focus on a correct node p ∈ C: We can apply Lemma 7 with Q = C \ {p} and t p,k = t Q,k = t. Thus, p must send tick k + 1 by
The property follows.
Unforgeability (U)
Proof Let p be the first correct node that sends tick k +1 ≥ 2 at time t p,k+1 , and assume wlog. that k ∈ N even . We apply Lemma 3 and consider the two possible cases: By the remote channel properties, this implies #b r (t ) ≥ k with
i.e., node r -and by this the first correct node which sent tick k -has sent tick k by t p,k+1 − T − f irst . The property follows.
(ii) Since |Q| ≥ f + 1, there must be at least one correct nodes r = p among Q for which #r rem p,r (t ) ≥ k + 1 with t = t p,k+1 − τ
For r , this implies #b r (t ) ≥ k + 1 -a contradiction to the assumption that p was the first correct node to send tick k + 1. Again, the property follows.
The property follows in both cases.
Before turning to the proof of (QS), we proceed with two technical lemmas.
Lemma 8
If the first correct node p sends tick k + 1 ≥ 2 at time t p,k+1 , then at t := t p,k+1 − τ
it must hold that: There exists a set Q of size |Q| ≥ 2 f + 1 such that
Proof Analogous to the proof of (U), we apply Lemmas 3 and 2, and consider the two possible cases:
(i) This case exactly matches the implication of our lemma. (ii) Since, |Q| ≥ f + 1, there must be at least one correct node r = p among Q which has already sent tick k +1 before p did; this contradicts the assumption that p is the first correct node to send tick k + 1.
The lemma holds in both cases. 
Proof Let p be the first correct node that completes booting, i.e., t b = t p,b . Since the virtual tick 0 is already available in all pipepairs ( p, q) at p at time t b , the first correct node that sends tick 1 does so at t f irst,1 , with
Since all other correct nodes boot by t b + B, they all must send tick 1 by time t last,1 with
The lemma follows.
Quasy Simultaneity (QS)
Proof The proof is by induction on the number of ticks k + 1 ≥ 2 sent by the first correct node. Begin (k + 1 = 2): By Lemma 9, the first correct node must send tick 1 at t f irst,1 with
Let t f irst,2 be the time when the first correct node sends tick 2. By Unforgeability,
By Lemma 9, all other correct nodes must send tick 1 by t last,1 with
Step (k + 1 ≥ 3): Let t f irst,k be the time when the first correct node sends tick k. As our induction hypothesis, we assume that all correct nodes send tick k − 1 by
Let t f irst,k+1 be the time the first correct node, say p, sends tick k +1 ≥ 3. By Lemma 8, there exists a set Q of size |Q| ≥ 2 f +1, such that at time t = t f irst,k+1 −τ
Clearly, there is a subset Q ⊆ Q of correct nodes among Q of size at least | Q| ≥ f + 1. Let Q := C \ ( Q ∪ {p}). Now consider the partitioning of correct nodes C = Q ∪ { p} ∪ Q . We will prove the lemma separately for each of the three partitions: In case q ∈ Q, by (43) and the remote channel properties, q has sent tick k by t − τ − rem < t f irst,k+1 . In case q = p, it has sent tick k by t f irst,k+1 . For the only non-trivial case q ∈ Q , it follows from the remote channel properties that any q ∈ Q, as a correct node, must have sent tick k by
Now consider an arbitrary correct node r ∈ Q . By the induction hypothesis and (U),
We may now apply Lemma 6 to node r and the set of correct nodes Q with t r,k−1 and t Q,k from (45) and (44) . This yields: r sends tick k by t r,k with 
where (46) 
The proposition follows.
Step (k > 1): Assume the Lemma is true for k. From (U) and (P), it follows that t f irst,k+1 − t f irst,k ≥ T − f irst t last,k+1 − t last,k ≤ T P .
In combination with the induction hypothesis, this yields: t last,k+1 − t f irst,k+1 = t last,k+1 − t last,k + t last,k − t f irst,k + t f irst,k − t f irst,k+1
The proposition follows. 
Lemma 10 (Fastest Progress
Proof The proof is by induction on k − k. Begin (k = k): The lemma trivially holds since the first correct node cannot send tick k before time t f irst,k .
Step (k ≥ k + 1): Assume that p is the first correct node that sends tick k. The first correct node q ∈ C, by the induction hypothesis, does not send tick k before t + (k − k)T Having completed the proof of our major Theorem 1, we proceed with Lemmas 11, 12 and 13 that bound the progress of the ticks generated by the fastest node. For this purpose, we define b max (t) as the maximum of #b p (t) over all correct nodes C, i.e., b max (t) := max{#b p (t) | p ∈ C}. Similarly, we define b min (t) := min{#b p (t) | p ∈ C}. Recall that f (t → ) denotes the left limit of function f at time t. For example, if node p sends tick k ≥ 1 at time t p,k , then #b p (t → p,k ) = k − 1 (whereas #b p (t p,k ) = k). 
≤ N T P + T BS (k).
Top proof layer
We are now ready for establishing our major results. The first one, Theorem 2, bounds the precision π of our algorithm, i.e., shows that for every pair of correct nodes p, q ∈ C : ∀t ≥ 0 : |#b q (t) − #b p (t)| ≤ π . 
Theorem 2 (Precision

Conclusions
We introduced the DARTS clock generation approach, which has been derived from a well-known distributed fault-tolerant tick generation algorithm and adapted for direct implementation in hardware. Major modifications had to be applied to the original distributed algorithm in order to adapt to the inherent fine-grain parallelism and limited resources of VLSI hardware implementations. DARTS provides a set of local clock signals, driving the subsystems of a SoC, for example, which are closely synchronized to each other. Our approach does not need quartz oscillators or the like, is guaranteed to start initially, and generates a clock frequency that automatically adapts to the current operating conditions.
The resulting algorithm (and its synchronization precision) are only weakly dependent on the implementation technology and the actual placement and routing on a chip: It just requires that the ratio of certain path delays, rather than the delays themselves, satisfy a few moderate constraints. The algorithm itself depends on these constraints only via the number of stages used in the elastic pipelines. Hence, there is usually no need to modify the algorithm when using a different implementation technology and/or a different placement & routing in an SoC.
We also provided a rigorous correctness proof and a worst case performance analysis, which employs a novel framework for the specification and analysis of distributed algorithms that are directly implementable in clockless digital logic. It shows that a system incorporating n ≥ 3 f + 2 clock generation units (TG-Algs) can cope with up to f Byzantine faulty TG-Algs. Our proof rests on simple properties of some elementary building blocks only, which can be verified by digital design tools. Hence, the correctness of a system of any size n can be guaranteed if the low-level building blocks are implemented correctly. Note that a comparable result cannot be established via model checking.
Backed up by the lessons learned in the DARTS project, and encouraged by the conclusions of a recent Dagstuhl seminar [9] devoted to the topic, we feel justified to claim that the proposed modeling framework, the building blocks, and the problems solved in DARTS are paradigmic for fault-tolerant clockless algorithms in VLSI in general. We will conclude our paper with a few arguments in favor of this claim.
In classic clockless VLSI circuits, each module handshakes with its predecessor and its successor modules in a "wait-for-all" manner, i.e., it waits until it has received a valid handshake signal from all its predecessor modules before it generates the next handshake signal for its predecessor modules. Clearly, this approach is infeasible if some modules may fail and hence generate erroneous or no handshake signals. In this case, however, it is natural to replace the "wait-for-all" by thresholds: Instead of waiting for all predecessors, a module only waits for a sufficiently large subset of its predecessors to complete the handshaking. Still, for repeated threshold-based handshaking, it is instrumental not to mix up handshake signals that have not been used for passing the threshold earlier with "fresh" ones. Obviously, it is exactly this kind of behavior that is encountered in a single DARTS TG-Alg module: It generates the next tick if a sufficiently large number of its predecessor modules (i.e., other TG-Alg modules) have generated a tick, and never mixes up old and new ticks.
As a consequence, we are convinced that both (i) the building blocks of DARTS, and (ii) the switch from transition logic (tick broadcast, elastic pipelines), which is typical for standard clockless circuits, to state logic (PCSGs and threshold modules, which are standard synchronous circuits), and back to transition logic, are not specific to DARTS but paradigmic for fault-tolerant clockless circuits in general [28] .
Part of our current and future research is devoted to further substantiating this claim: We recently completed some work on an important building block for a self-stabilizing variant of DARTS, which will allow to dismiss the simultaneous booting restriction and to transparently recover from transient failures, at the price of higher circuit complexity. Moreover, we started working on how to make other fault-tolerant distributed algorithms, including consensus, amenable to a direct implementation in digital hardware. Needless to say, our modeling framework, which captures all the peculiarities of such systems without unnecessary overhead, proved instrumental for all this work.
(a) Suppose that ∃q ∈ Q : #r loc
