Abstract
Motivation
Shrinking feature sizes and increasing clock speeds are the most visible signs of the tremendous advances in VLSI design, which will accommodate billions of transistors on a single chip in the near future [13] . This comes at the price of increased system-level complexity, however: With today's deep submicron technology with GHz clock speeds, wiring delays dominate transistor switching delays, and signals cannot traverse the whole die within a single clock cycle any more. Moreover, the reduced voltage swing needed for high clock speeds and low power consumption dramatically increases the adverse effects of single event upsets like α-particle or neutron hits. The resulting increase of the transient failure rate (soft-error rate) [18] and crosstalk sensitivity [24] has raised concerns about the dependabil-ity of future generation VLSI chips [5] . In fact, a modern VLSI chip can no longer be viewed as a monolithic block of synchronous hardware, where all state transitions occur simultaneously. Rather, VLSI chips are nowadays considered as systems of interacting subsystems -the advent of Systems-on-Chip (SoC). Due to the problems listed above, however, SoCs have much in common with the loosely-coupled distributed systems that have been studied by the fault-tolerant distributed algorithms community for decades. This paper explores whether it is possible to utilize some of this research for SoCs and similar VLSI devices.
More specifically, in the context of our DARTS-Project (ti.tuwien.ac.at/darts), which is a joint project between Vienna University of Technology and Austrian Aerospace, we will explore an alternative approach (patented in [27] ) to synchronous clocking in VLSI chips and PCB-level system designs. As shown in Fig. 1 , the idea is to replace the external quartz oscillator and the clock tree, which supplies the clock signal to the different functional units (Fu i ) on a traditional chip, using a GALS-like approach [4] : Every functional unit has attached a dedicated fault-tolerant tick generation block (TS-Alg), which generates the Fu's local clock signal. In contrast to GALS, however, our approach ensures that the local clock signals of different Fu's are closely synchronized to each other. To accomplish this, all TS-Alg blocks communicate with each other over a simple "network" of clock signals (TS-Net). This alternative clock- ing approach has a number of advantages, which makes it particularly promising for certain application domains: First of all, it does not need a quartz oscillator, which is an expensive and sensitive device (shock, vibration, temperature etc.). The generated clock always runs at the maxi-mum speed and adapts to the current operating conditions 2 . Moreover, the approach tolerates transient failures in TSAlgs and TS-Net and avoids the cumbersome clock tree engineering issue [2, 9] . And last but not least, as different Fus are driven by slightly different clock signals, our approach alleviates EM radiation and ground bouncing problems [20] that typically plague devices using synchronous clocking. Contributions: This paper shows that it is indeed possible to adapt fault-tolerant distributed algorithms to the particular needs of VLSI implementations. More specifically:
(i) We adapt the simple variant of Srikanth & Toueg's [28] consistent broadcasting introduced in [30] to the peculiarities of VLSI hardware implementations, namely, inherent fine-grained parallelism and very limited resources. Our major modifications are the enforcement of some atomic actions (interlocking) via implicit handshaking, and the replacement of k-bit messages by anonymous rising or falling signal transitions (zero-bit messages).
(ii) We provide a fault-tolerant distributed tick generation algorithm (TS-Alg), which tolerates up to f Byzantine faulty instances in a system containing n ≥ 3f + 2 TSAlgs. Examples of Byzantine failures are spurious clock transitions or early timing failures that are perceived inconsistently at different TS-Algs.
(iii) We prove 3 that the resulting algorithm is correct, and derive bounds for its performance metrics like worst case precision and minimal/maximal clock frequency. Since our "system-level proof" rests upon some simple properties of certain digital logic blocks only, which can be easily verified by means of standard design tools, we can guarantee the correctness of any system of n ≥ 3f + 2 correctly implemented TS-Algs.
(iv) We provide some details of our synthesizable VHDL implementation of the algorithm, and demonstrate the feasibility of our approach by means of some measurement results obtained from an FPGA prototype system. These results will hence allow us to implement our DARTS clock generation scheme in a prototype SoC ASIC.
Related work: Asynchronous distributed systems theory has been applied to VLSI chips for decades [7] : Research on transition signaling [3] , delay-insensitivity [19] , micropipelines [29] , etc. has in fact established a sound basis for dealing with self-timed systems [11] . However, those approaches cannot deal with failures.
Given the importance of dependability issues, there is a huge body of research work devoted to fault-tolerance in 2 Several important questions must still be answered in our DARTSProject, however: E.g., it is not clear yet how area and power consumption of some reasonable number of TS-Algs relate to area and power consumption of a clock tree. Those problems, which primarily requires comparison of suitable ASIC implementations, are outside the scope of this paper. 3 The full details of the proofs have been omitted due to space restrictions and can be found in [10] VLSI. However, the proposed techniques are very different from the "system-level approach" employed in faulttolerant distributed algorithms: Fine-grained fault tolerance, e.g. at gate level, and error detection and recovery are the methods of choice in VLSI chips [23] . This research is hence not relevant w.r.t. our approach.
There is also a sizeable body of work devoted to hardware implementations of fault-tolerant algorithms. Wellknown examples are MAFT [14] , SAFEBUS [12] and TTP [15] . However, in sharp contrast to our problem, these systems incorporate hardware assistance only. The major part of the algorithms is still implemented in conventional software and executed on general-purpose processors. Consequently, there was no need to minimize the gate-level resource consumption implied by these algorithms. Somewhat an exception is the paper [1] , which shows that consensus can be solved with 1-bit messages. None of the above systems had to deal with fine-grain parallelism inherent in VLSI implementations, though.
There is also some related work on alternative clocking schemes in VLSI. Since we do not consider external clock sources in our approach, we can ignore the sizeable body of work on hardware-assisted fault-tolerant clock synchronization (see [26] for an overview) here. The few approaches for distributed clock generation without external clock sources we are aware of are essentially based on a (distributed) ring oscillator, which is formed by gates arranged in a positive feedback loop. Instead of being dictated by a quartz, the frequency of the generated clock signal is determined by the end-to-end delay of the feedback loop. In [21] , a regular structure of closed loops of an odd number of inverters is used for distributed clock generation. Similarly, [8] employs local tick generation cells, arranged in a two-dimensional grid. Since clock synchronization theory [6] reveals that high connectivity is required for bounded synchronization tightness in presence of failures, however, the sparsely connected designs proposed in [8, 21] are not fault-tolerant.
Informal Overview
The TS-Alg developed and analyzed in this paper derives from a simple synchronizer algorithm introduced in [30] . The (core of this) algorithm, which is based on Srikanth & Toueg's well-known consistent broadcasting primitive [17, 25, 28] , is shown in Fig. 2 . When executed in a system of n = 3f + 1 processes (= TS-Alg instances) with at most f of them being Byzantine faulty, it generates a sequence of consecutive messages tick(k ), k ≥ 0, at every process. The algorithm ensures that the difference of the points in real-time when two correct processes p and q emit tick(k ) is bounded by a certain constant precision π, which can be computed from the max. The algorithm is started by sending tick(0 ) in line 2 and works as follows: If a correct process p receives f + 1 tick( ) messages (line 4), it can be sure that at least one of those was sent by a correct process. Therefore, p can safely catch up and send tick(k ), . . . , tick( ). If some process p receives 2f + 1 tick(k ) messages (line 6), one can be sure that at least f + 1 of those will be received by every other correct process (which then executes line 4) within
Hence, all other correct processes q = p receive 2f + 1 tick(k ) messages within another τ + . It follows that if p emits tick(k ) at time t, every other correct q emits tick(k ) no later than t + ε + τ + . This eventually guarantees a bound π on the synchronization precision, as claimed above. Note that the algorithm automatically adapts to the instantaneous timing characteristics of all involved computations and message transmissions.
Since the algorithm in Fig. 2 looks very simple, it is tempting to conclude that it should be easy to translate into a hardware description language: The k-th generated clock tick occurs when a process sends its tick(k ) message. It turns out, however, that a number of challenging issues must be solved to accomplish this: How to implement the TS-Net efficiently? The algorithm assumes a fully connected network, consisting of n 2 links, so anything beyond a single wire per link is considered unacceptable 4 . Moreover, for implementation simplicity and performance, the information transmitted via the TS-Net must be kept to a minimum. Ideally, and almost mandatory, the TS-Net should just feed the emitted clock ticks, i.e., signal transitions, of every TS-Alg to every other TS-Alg. How to adapt the original algorithm for zero-bit mes-sages? By just sending anonymous signal transitions, no information except the occurrence time can be conveyed over the TS-Net. Thus, the tick number k contained in a message in the algorithm of Fig. 2 must be maintained at every receiver, individually for every sender. How to ensure atomicity of actions in a VLSI implementation? Any distributed computing model we are aware of assumes atomic computing steps at the level of a single processor. This abstraction does not apply when an algorithm is implemented directly in hardware, however, since all "computations" are done by several digital logic gates that run concurrently. Explicit synchronization (serialization of actions/interlocking) must be introduced if two local computations must not interfere with each other.
Taking into account the above issues, we arrived at the following basic architecture of a single TS-Alg shown in Fig. 3 . The major building blocks of a single TS-Alg are the n−1 up/down counters, one for every of the n−1 other TSAlgs in the system. Each such device counts the difference of (i) the number of clock ticks seen from the respective peer, and (ii) the number of clock ticks generated locally so far. It is supposed to provide two binary status signals, GR and GEQ, which are true when the counter's actual value is > 0 and ≥ 0, respectively. In addition, we need ≥ f + 1 and ≥ 2f + 1 threshold circuits implementing the rules in (line 4) and (line 6) in Fig. 2 , respectively. Finally, there is a device (shown as an OR-gate in Fig. 3 ), which is responsible for generating the local clock ticks from the outputs of the threshold gates.
Again, the above architecture is deceptively simple. The major problem when trying to implement Fig. 3 in hardware is the lack of a common clock signal that could be used for a synchronous logic design. Rather, a (quasi) delay insensitive asynchronous implementation [11] must be devised. The major 5 problem here is to distinguish GR, GEQ signals that contributed to the previously generated tick k from GR, GEQ signals contributing to the next (to be generated) tick k + 1. Our solution exploits the fact that the transitions of a binary-valued signal must strictly alternate between low-to-high and high-to-low: We just provide independent signals GR, GEQ and threshold circuits for generating odd (k ∈ N odd := 2N + 1) and even clock ticks (k ∈ N even := 2N).
The output of the threshold circuits (say T H o GR and T H o GEQ ) that generated the even tick k are ignored when the next tick k + 1 to be generated is odd. This "gap" thus allows GR 
The Algorithm
It has been highlighted in the previous section that even the simple algorithm presented in Fig. 2 makes use of design elements that are not available or too costly at the hardware design level. In addition, one has to account for the fact that even the simplest (= sequential) control flow comes with some delay since it actually involves sending a signal over a wire, i.e., a zero-bit FIFO message channel. In this section, we will provide the architectural design model of a TS-Alg that meets those requirements.
Signals and Zero-bit Message Channels
All components of our TS-Algs, which are digital logic blocks, deal with binary signals only. Given such a signal S [or an arbitrary boolean predicate], with possible values ⊥ (= logical 0, false, inactive) and (= logical 1, true, active), we say that the event (= state transition) S− ↑ (t * ) resp. S−↓ (t * ) occurs when S changes state from ⊥ to resp. from to ⊥ at time t * . The status S(t) of S at time t ≥ t * is S(t) = ⊥ resp. S(t) = iff the last event at or before t was S−↓ (t * ) resp. S−↑ (t * ). In our analysis, we will reason about events and status of binary signals [and predicates]. In order not to clutter our notation, we will employ the convention that, depending on the context of usage, S(t) will denote either: (i) S's status S(t) ∈ {⊥, }, where t denotes the observation time, or (ii) the event S−↑ (t), where t denotes the time of the last transition to the active state (or reset time).
Note also that we will sometimes drop the time t from events and status if it is clear from the context.
All components in our system are interconnected by signal wires, which are modeled as reliable FIFO channels with finite delay that carry zero-bit messages. The semantics of a zero-bit message channel X is as follows: Let X s be the channel's input signal, which is controlled by a single sender component. It generates the events (= messages) X s −↑ (t) and X s −↓ (t), where t denotes the sending time. The associated input state X s (t) can be viewed as the information content of the last message sent into X. Channel X's output is fed to the receiver, which perceives the event (= message) X r − ↑ (t ) resp. X r − ↓ (t ) for every sent event X s −↑ (t) resp. X s −↓ (t) within finite time t ≥ t. The receiving state X r (t) at time t can again be viewed as the information content of the last received message. At reset time t 0 , the channel state X r (t 0 ) is initialized to ⊥. Obviously, a zero-bit message channel can only convey messages with strictly alternating content. Nevertheless, this type of communication is compatible with Lamport's happened-before relation [16] : For matching send and receive events, it holds that X s − ↑ (t) → X r − ↑ (t ) and Figure 4 . TS-Alg architecture, including the points of observation b p (t), r p,q (t) and r s p,q (t) using a channel X, we will employ the convention that X−↑ (t) and X−↓ (t) abbreviate the send events X s −↑ (t) and X s −↓ (t), respectively, whereas X(t) abbreviates the state X r (t) at the receiving end.
TS-Alg Component and Architecture Specification
In this section we will specify the TS-Alg components' behavior and determine their arrangement. The architecture is depicted in Fig. 4 . Note that a +/− counter in Fig. 3 comprises the first 3 components listed below. Pairs of elastic pipes: Every TS-Alg/process p incorporates n − 1 pairs of elastic pipelines (= a shift register/FIFO for signal transitions [29] ), each of which corresponds to a single remote TS-Alg/process q ∈ P \ {p}. Every pair consists of a remote pipeline that can store up to S tick−↑/ tick− ↓ messages sent by q, and a local pipeline that can hold up to S tick− ↑/ tick− ↓ messages sent by p locally. The number S is an implementation parameter that has to be chosen in accordance with Theorem 4.13.
For description and analysis purposes, we will need some notation: r p,q (t) resp. r s p,q (t) denotes the number of messages that arrived at the end of the remote resp. local pipe at time t. Moreover, s p,q (t) denotes the number of tick messages stored in the local pipeline by t. Note that those quantities are not available to the algorithm. Rather, the algorithm uses only the binary status signals r p,q (t) ≥ r s p,q (t) and r p,q (t) > r s p,q (t) in conjunction with s p,q (t) = 1. Upon reset, all pipelines are initialized to contain a single even tick−↓ message. We assume that r p,q (t 0,p ) = r s p,q (t 0,p ) = 0 and s p,q (t 0,p ) = 1 at reset time t 0,p . Diff-Gate: To avoid the need for infinite storage, each pair of pipelines is equipped with a special circuit that removes tick messages contained in both pipes. The behavior of a Diff-Gate is as follows:
such that s p,q (t ) = s p,q (t −dt)−1, for some infinitesimally small dt > 0. Note that we are not loosing information due to this removal of "common" tick messages, since the algorithm is only interested in the difference of the number of messages received. 
e GR for the current tick becomes active. We denote by b p (t) the number of tick messages generated by process p by time t, with b p (t 0 ) = 0 at reset time t 0 . If t k denotes the time when p generates tick k, we assume that b p (t k ) = b p (t k − dt) + 1 for an infinitesimally small dt > 0, i.e., b p (t k ) gives the number of ticks including the new tick k.
The detailed description of the TS-Alg based on our architectural model is given in Fig. 5 . Note that the algorithm's if-clauses are only evaluated when the validity of the if-clause's premise changes, i.e., upon a state transition of the enabling condition (which also happens upon reset). In case of the threshold gates (line 18-line 29), we conceptually assume that a (possibly idempotent) output event is sent upon any change of the state of any input.
Correctness Proofs
For our correctness proof and performance analysis, we employ the following system and failure model: Let P be a set of n distributed processes, executing TS-Algs, which communicate via simulated broadcasting (multiple pointto-point sends) over a fully connected network of reliable zero-bit FIFO message passing links. Every link carries strictly alternating tick−↑ and tick−↓ messages only. The transmission delay satisfies some upper and lower bounds, which are unknown to the algorithm, i.e., introduced solely for analysis purposes: The time of a locally generated tick message to reach the end of any local pipeline is bounded by Note that this assumption has some impact on initialization 0: /* Initialization */ 1: ∀q : rp,q(t 0,p ) = r s p,q (t 0,p ) = 0; sp,q(t 0,p ) = 1 2: ∀channels X : X r (t 0,p ) = ⊥
3:
4: /* code for PCSGs at process p for remote process q*/
→ send GEQ e p,q i
→ 
t) ∨ T H o GEQ (t)] ∧ ¬[T H e GR (t) ∨ T H e GEQ (t)]
32:
→ send tick−↑ Figure 5 . TS-Alg tick generation algorithm adopted for VLSI implementations.
as well: Every process p can be initialized to b p (t 0,p ) = r p,q (t 0,p ) = r Our tick generation algorithm will allow up to f processes to fail arbitrarily, provided that the total number of processes is n ≥ 3f + 2. This is slightly more than the required lower bound of n ≥ 3f + 1 for clock synchronization [6] , but facilitates a much better precision and accuracy (attained by counting only remote messages when calculating the f + 1 resp. 2f + 1 thresholds; including self reception would lead to τ − rem = τ − loc in Theorem 4.11 (Precision)). In our context, the adverse power of arbitrary failures lies in the ability of a process to generate wrong (early timing failures or even spurious) clock ticks, which are perceived inconsistently at different receiver processes. Such failures may be the consequence of particle hits or electromagnetic interference, which can very well affect different receivers differently, depending upon wire length and signal detection sensitivity.
We will start our formal treatment with Definition 4.1, which will be employed frequently in our proofs.
Definition 4.1. (Direct Causality). Let I(t ) and O(t) be two events of some specific signal input and output, respectively, of a correct component C. We say that they are directly causally related, denoted by I(t ) → O(t), if they are (i) causally related and (ii) there is neither an ↑-nor a ↓-event I (t ) on the same input in between, i.e., I (t ) : I(t ) → I (t ) → O(t). If the input and output events I(t ) and O(t) of a correct component C with latency
An instrumental part in the correctness proof of the tick generation algorithm in Fig.5 is to make sure that a process generates tick k + 1 messages based on tick k messages only, i.e., does not incorporate stale information from earlier ticks < k here. This will be formalized in the following Definition 4.2. Lemma 4.3 below will deduce some simple properties from this definition.
Definition 4.2. (Notion of Basis). Abbreviate
P GEQ, p,q (t) = [r p,q (t) ≥ r s p,q (t) = ] ∧ [s p,q (t) = 1], P GR, p,q (t) = [r p,q (t) > r s p,q (t) = ] ∧ [s p,q (t) = 1]. (1)
We say that correct process p's tick k +1 is based on correct process q's tick , if there exists at least one of the following chains of direct causal dependencies: For
(and analogous for k + 1 ∈ N odd ). A correct process p's tick k + 1 is said to be based on tick if, for all correct processes q, it is based on q's tick q with q ≥ and ∃q i :
Lemma 4.3. The time instants t k+1 and t in the direct causal chains (2) and (3) in Definition 4.2 satisfy τ
Moreover, the predicate P GEQ, p,q (t) resp. P GR, p,q (t) holds not only at t = t , but continues to hold true for every t ∈ [t , t k+1 − τ The proof is by induction on the number of ticks p has sent so far. It is first shown that all ticks up to number 2 are based on their predecessor. Then we assume that some tick k is the first tick that is not based on the precedent tick k − 1 but rather on some tick ∈ {k − 2, k − 4, . . . }. By investigating the direct causal chains that would lead to this behavior, we obtain a contradiction to Constraint 4.4.
Constraint 4.4. (Interlocking Constraint). With the abbreviations
We will now provide a sequence of simple lemmas, which are needed for establishing our major results Theo- 
The proof is by applying Simultaneity (S) once and Progress (P) iteratively.
Theorem 4.11 gives the precision π of our algorithm, which guarantees ∀t : |b q (t) − b p (t)| ≤ π for every pair of correct processes p and q. Theorem 4.12 (Accuracy) can be used to bound the number of tick messages generated locally at a correct process p during some real time interval ∆, i.e., allows to make statements about the local frequency. For example, it reveals that the long-term frequency is within
This final theorem establishes that pipeline size is indeed bounded, provided that the additional condition τ The proof is by bounding the maximum time a tick messages can exist before it is eliminated by the Diff-Gate.
Hardware Implementation
Now we will present implementation details of the lowlevel building blocks of our TS-Alg implementation, which have been introduced at higher levels of abstraction in Fig. 3 and 4. We will also derive timing conditions an implementation has to satisfy in order to fulfill the behavioral specification of Section 3.2. It turns out that the conditions can be enforced easily by design tools, if not a priori true.
Implementation of +/− Counters
Elastic Pipeline: An elastic pipeline can be seen as a FIFO buffer for signal transitions, based on Sutherland's micropipelines [29] . As shown in Fig. 6 , it consists of a chain of Muller C-Elements [22] (+ inverters), each of which can store a single low-to-high/high-to-low transition.
In order to get an idea of the internal operation of the elastic pipeline in Fig. 6 , consider the gate-level implementation of a C-Element given in Fig. 7 . It is apparent that it changes the value of its output y only if both inputs have the same value -it hence implements an AND for signal transitions. Note that a C-Element must incorporate a feedback loop, since y must stay at the current logic level when either a or b (but not both) changes.
The correct operation of the pipeline depends upon some timing constraints regarding (i) the feedback loop involved in the Muller C-Elements and (ii) the loop formed by two consecutive stages in the pipeline. Actually, the delays of those feedback loops impose some minimum delay between successive input transitions, which must be maintained in order not to jeopardize correct behavior. In standard delay-insensitive applications of elastic pipelines, this is ensured by triggering a new data in transition only when the ack out transition that acknowledges the previous data in transition occurs. Since we obviously cannot use ack out in our application (we cannot acknowledge every generated clock tick), the required timing constraints must be externally maintained.
We will first analyze the timing conditions of the Muller C-Element: To allow the gates on the feedback path in Fig. 7 to settle, it has to be ensured that a and/or b do not change state in opposite directions (high-to-low, low-tohigh) within τ + loop of each other if potentially causing a state change. The condition τ min ≥ τ + loop has to be obeyed for both consecutive transitions on a single input as well as opposite transitions on different inputs if these transitions would lead to a state change (e.g. a high, whereas b performs low-tohigh and high-to-low within τ + loop ). The bound τ + loop can be obtained by analyzing the netlist of the hardware generated by a synthesis resp. place&route tool.
The basic timing constraints for the elastic pipeline, which originate from the feedback loops formed by any two consecutive pipeline stages, as highlighted in Fig. 6 , can be derived in a similar manner: As long as input signal data in does not change faster than any stage's τ It is important to mention, however, that the simple considerations above apply only to the case where data in and ack in do not change simultaneously. The latter case, which could obviously cause metastability problems if there were no additional constraints, is treated in conjunction with the Diff-Gate below.
Difference Gate: The Diff-Gate is responsible for removing matching transitions from the ends of the remote and local pipe, see Fig. 8 . Our goal is to implement a +/− counter, which keeps track of the difference of the number of ticks seen so far. Still, every pipe must have at least
+ 1 stages to avoid an overflow due to the fact that remote and local clock are not perfectly synchronized, cp. Theorem 4.13. Note that S is the only implementation-technology-dependent parameter that is compiled into the TS-Algs. However, since it depends on the ratio of maximum and minimum delays only, like in the Theta-Model [17] , rather than on the absolute delays, we are convinced that it is fairly independent of things like temperature and implementation technology.
Our Diff-Gate implementation, cp. Fig. 8 , is by a simple asynchronous state machine which, in case of matching values of req ext and req int, removes the last transition of the remote (left) pipeline before it removes the matching last transition of the local (right) one. There are two reasons why this left-before-right removal is mandatory: The first one is related to guaranteeing the required properties of the signals GEQ . Unlike all other building blocks described so far, the PCSG is not a transition signaling logic but rather a conventional asynchronous state logic. However, it must be able to handle transitions, which arrive at random instants, without producing glitches or wrongly activate its output signals. Since this is impossible in general, we have separated GEQ e , GR e from GEQ o and GR o : As already explained in Section 2, the former signals are only relevant if the next tick to be generated is odd, whereas the latter status signals are only relevant when it is even. When the next tick to be generated is the other one, we do allow glitches and other incorrect signal changes.
To ensure correct behavior of our algorithm, the relevant status signals for the next tick to be generated, say, GEQ e , must be activated only if r p,q (t) > r s p,q (t) ∧ s p,q (t) = 1 holds for sure. This requires a left-before-right removal of matching transitions by the Diff-Gate: GEQ e must never become true, not even in a glitch, if the number of remote ticks seen is smaller than the number of local ticks. If we allowed the Diff-Gate to remove the right transition before the left one, however, the PCSG unit might see a larger number of remaining transitions in the remote pipe than in the lo-cal pipe, and would hence produce an erroneous output on GEQ e . Further, it must be ensured that any state signal can become true only if the next tick to be generated is not already available in the local pipe. This can happen for the pipelines corresponding to a slow remote node, where the faster ones have already achieved the required 2f + 1 or f + 1 threshold. Our PCSG implementation ensures this by comparing the last stages of the local and remote pipeline, subject to the condition that the three last stages of the local pipeline contain the same value. It is easy to see from the operation of an elastic pipeline that the latter condition signals an (almost) empty pipe, which is sufficient to infer the required "freshness" of the status outputs.
Other low-level building blocks
Apart from the n − 1 +/− counters, a TS-Alg also consists of standard asynchronous f + 1 and 2f + 1 threshold gates and a clock generation unit (represented by the ORgate in Fig. 3 ). Threshold gates: The threshold logic is responsible for implementing the if-clauses depicted in Fig. 5 . The clock generation is based on four separate threshold gates: Two are active, waiting for ≥ f + 1 GEQ e resp. ≥ 2f + 1 GR e , when the next tick to be generated is odd, while the remaining two, waiting for ≥ f + 1 GR o resp. ≥ 2f + 1 GEQ o , are inactive, in the sense that they can behave arbitrarily. Clock generation: According to the algorithm in Fig. 5 , either the active f + 1 or the active 2f + 1 threshold gate may produce the next clock tick. Our implementation employs a suitable combination of active-high and active-low signals on the inputs and outputs of the threshold gates, which allows us to use a simple logical AND resp. OR for combining the two corresponding threshold gate outputs for generating even resp. odd ticks. The output of the odd and even tick generation is joined by a final Muller C-Element, which ensures that the local clock output remains stable until the next valid clock tick is generated. Note that this element can be seen as the back-transition from the statebased logic (PCSG and threshold gates) to transition signaling logic (+/− counters).
Experimental Evaluation
The low-level building blocks described above have been implemented using VHDL and assembled in a complete TSAlg. We have also prototyped a complete system of n = 5 TS-Algs, which has been synthesized and put into operation on an Altera APEX EP20K1000 FPGA. The clock signals of all five TS-Algs in a sample run are depicted in Fig. 9 . The clock frequency is approximately 24MHz, and the maximum skew of the clock ticks is 4ns. Of course, this FPGA implementation is just a proof of concept and Figure 9 . DARTS approach with 5 TS-Alg running on an FPGA shall demonstrate the principal feasibility of the DARTS approach. We are convinced that there is a huge potential for optimizations, e.g., to increase the clock frequency: FPGAs are not particularly suitable for asynchronous designs due to the structure of the lookup tables and registers. Furthermore, the performance of the TS-Alg clock depends upon the ratio of the routing delays of the longest and the shortest clock signal paths ( τ + τ − ). Due to the static interconnect structure of an FPGA, these signal delays may vary within a wide range. We are hence convinced that much better performance, at much lower cost, can be obtained by the standard cell CMOS ASIC implementation ultimately targeted by our DARTS project.
Conclusions
We proposed a novel clock generation approach for VLSI chips that has been derived from a well-known distributed fault-tolerant tick generation algorithm. It provides a set of local clock signals, to be fed to the subsystems of a SoC, for example, which are synchronized within a bounded precision to each other. Our approach does not need any external quartz oscillator or the like, and generates a clock frequency that automatically adapts to the current operating conditions.
Major modifications had to be applied to the original algorithm in order to adapt to the inherent fine-grain parallelism and limited resources of VLSI hardware implementations. Our algorithm depends upon the implementation technology only via the required number S of stages of the elastic pipelines. Since S depends upon the ratio of max. and min. clock signal delays only, rather than on the delay values itself, we are convinced that it is actually reasonably independent from implementation technology.
We also provided the cornerstones of our correctness proof which shows that a system incorporating n ≥ 3f + 2 clock generation units can cope with up to f Byzantine faulty nodes and links. This allows our algorithm to work correctly in presence of up to f units that produce spurious clock transients perceived inconsistently by the other units. Our proof rests upon simple properties of some locally available signals only, which can be verified by digital design tools. Hence, the correctness of a system of any size n can be guaranteed if the low-level building blocks providing the required signals are implemented correctly. Note that a comparable result cannot be established via model checking. Current work is devoted to speed optimization and stabilization, both within the scope of the algorithm and the hardware implementation. First simulations of our ASIC implementation reached ≥ 200 MHz clock frequency. Systematic experimental evaluations are also planned in the near future.
