10 research outputs found

    TRIX: Low-Skew Pulse Propagation for Fault-Tolerant Hardware

    Get PDF
    The vast majority of hardware architectures use a carefully timed reference signal to clock their computational logic. However, standard distribution solutions are not fault-tolerant. In this work, we present a simple grid structure as a more reliable clock propagation method and study it by means of simulation experiments. Fault-tolerance is achieved by forwarding clock pulses on arrival of the second of three incoming signals from the previous layer. A key question is how well neighboring grid nodes are synchronized, even without faults. Analyzing the clock skew under typical-case conditions is highly challenging. Because the forwarding mechanism involves taking the median, standard probabilistic tools fail, even when modeling link delays just by unbiased coin flips. Our statistical approach provides substantial evidence that this system performs surprisingly well. Specifically, in an "infinitely wide" grid of height~HH, the delay at a pre-selected node exhibits a standard deviation of O(H1/4)O(H^{1/4}) (2.7\approx 2.7 link delay uncertainties for H=2000H=2000) and skew between adjacent nodes of o(loglogH)o(\log \log H) (0.77\approx 0.77 link delay uncertainties for H=2000H=2000). We conclude that the proposed system is a very promising clock distribution method. This leads to the open problem of a stochastic explanation of the tight concentration of delays and skews. More generally, we believe that understanding our very simple abstraction of the system is of mathematical interest in its own right.Comment: 16 pages, 11 figure

    {TRIX}: {L}ow-Skew Pulse Propagation for Fault-Tolerant Hardware

    Get PDF
    The vast majority of hardware architectures use a carefully timed reference signal to clock their computational logic. However, standard distribution solutions are not fault-tolerant. In this work, we present a simple grid structure as a more reliable clock propagation method and study it by means of simulation experiments. Fault-tolerance is achieved by forwarding clock pulses on arrival of the second of three incoming signals from the previous layer. A key question is how well neighboring grid nodes are synchronized, even without faults. Analyzing the clock skew under typical-case conditions is highly challenging. Because the forwarding mechanism involves taking the median, standard probabilistic tools fail, even when modeling link delays just by unbiased coin flips. Our statistical approach provides substantial evidence that this system performs surprisingly well. Specifically, in an "infinitely wide" grid of height~HH, the delay at a pre-selected node exhibits a standard deviation of O(H1/4)O(H^{1/4}) (2.7\approx 2.7 link delay uncertainties for H=2000H=2000) and skew between adjacent nodes of o(loglogH)o(\log \log H) (0.77\approx 0.77 link delay uncertainties for H=2000H=2000). We conclude that the proposed system is a very promising clock distribution method. This leads to the open problem of a stochastic explanation of the tight concentration of delays and skews. More generally, we believe that understanding our very simple abstraction of the system is of mathematical interest in its own right

    Fast self-stabilizing byzantine tolerant digital clock synchronization

    Get PDF
    Consider a distributed network in which up to a third of the nodes may be Byzantine, and in which the non-faulty nodes may be subject to transient faults that alter their memory in an arbitrary fashion. Within the context of this model, we are interested in the digital clock synchronization problem; which consists of agreeing on bounded integer counters, and increasing these counters regularly. It has been postulated in the past that synchronization cannot be solved in a Byzantine tolerant and self-stabilizing manner. The first solution to this problem had an expected exponential convergence time. Later, a deterministic solution was published with linear convergence time, which is optimal for deterministic solutions. In the current paper we achieve an expected constant convergence time. We thus obtain the optimal probabilistic solution, both in terms of convergence time and in terms of resilience to Byzantine adversaries

    Fault Tolerant Gradient Clock Synchronization

    Full text link
    Synchronizing clocks in distributed systems is well-understood, both in terms of fault-tolerance in fully connected systems and the dependence of local and global worst-case skews (i.e., maximum clock difference between neighbors and arbitrary pairs of nodes, respectively) on the diameter of fault-free systems. However, so far nothing non-trivial is known about the local skew that can be achieved in topologies that are not fully connected even under a single Byzantine fault. Put simply, in this work we show that the most powerful known techniques for fault-tolerant and gradient clock synchronization are compatible, in the sense that the best of both worlds can be achieved simultaneously. Concretely, we combine the Lynch-Welch algorithm [Welch1988] for synchronizing a clique of nn nodes despite up to f<n/3f<n/3 Byzantine faults with the gradient clock synchronization (GCS) algorithm by Lenzen et al. [Lenzen2010] in order to render the latter resilient to faults. As this is not possible on general graphs, we augment an input graph G\mathcal{G} by replacing each node by 3f+13f+1 fully connected copies, which execute an instance of the Lynch-Welch algorithm. We then interpret these clusters as supernodes executing the GCS algorithm, where for each cluster its correct nodes' Lynch-Welch clocks provide estimates of the logical clock of the supernode in the GCS algorithm. By connecting clusters corresponding to neighbors in G\mathcal{G} in a fully bipartite manner, supernodes can inform each other about (estimates of) their logical clock values. This way, we achieve asymptotically optimal local skew, granted that no cluster contains more than ff faulty nodes, at factor O(f)O(f) and O(f2)O(f^2) overheads in terms of nodes and edges, respectively. Note that tolerating ff faulty neighbors trivially requires degree larger than ff, so this is asymptotically optimal as well

    Self-stabilising Byzantine Clock Synchronisation is Almost as Easy as Consensus

    Get PDF
    We give fault-tolerant algorithms for establishing synchrony in distributed systems in which each of the nn nodes has its own clock. Our algorithms operate in a very strong fault model: we require self-stabilisation, i.e., the initial state of the system may be arbitrary, and there can be up to f<n/3f<n/3 ongoing Byzantine faults, i.e., nodes that deviate from the protocol in an arbitrary manner. Furthermore, we assume that the local clocks of the nodes may progress at different speeds (clock drift) and communication has bounded delay. In this model, we study the pulse synchronisation problem, where the task is to guarantee that eventually all correct nodes generate well-separated local pulse events (i.e., unlabelled logical clock ticks) in a synchronised manner. Compared to prior work, we achieve exponential improvements in stabilisation time and the number of communicated bits, and give the first sublinear-time algorithm for the problem: - In the deterministic setting, the state-of-the-art solutions stabilise in time Θ(f)\Theta(f) and have each node broadcast Θ(flogf)\Theta(f \log f) bits per time unit. We exponentially reduce the number of bits broadcasted per time unit to Θ(logf)\Theta(\log f) while retaining the same stabilisation time. - In the randomised setting, the state-of-the-art solutions stabilise in time Θ(f)\Theta(f) and have each node broadcast O(1)O(1) bits per time unit. We exponentially reduce the stabilisation time to logO(1)f\log^{O(1)} f while each node broadcasts logO(1)f\log^{O(1)} f bits per time unit. These results are obtained by means of a recursive approach reducing the above task of self-stabilising pulse synchronisation in the bounded-delay model to non-self-stabilising binary consensus in the synchronous model. In general, our approach introduces at most logarithmic overheads in terms of stabilisation time and broadcasted bits over the underlying consensus routine.Comment: 54 pages. To appear in JACM, preliminary version of this work has appeared in DISC 201

    Attack Resilient Pulse Based Synchronization

    Get PDF
    Synchronization of pulse-coupled oscillators (PCOs) has gained significant attention recently due to increased applications in sensor networks and wireless communications. However, most existing results are obtained in the absence of malicious attacks. Given the distributed and unattended nature of wireless sensor networks, it is imperative to enhance the resilience of pulse-based synchronization against malicious attacks. To achieve this goal, we first show that by using a carefully designed phase response function (PRF), pulse-based synchronization of PCOs can be guaranteed despite the presence of a stealthy Byzantine attacker, even when legitimate PCOs have different initial phases. Next, we propose a new pulse-based synchronization mechanism to improve the resilience of pulse-based synchronization to multiple stealthy Byzantine attackers. We rigorously characterize the condition for mounting stealthy Byzantine attacks under the proposed new pulse-based synchronization mechanism and prove analytically that synchronization of legitimate oscillators can be achieved even when their initial phases are unrestricted, i.e., randomly distributed in the entire oscillation period. Since most existing results on resilient pulse-based synchronization are obtained only for all-to-all networks, we also propose a new pulse-based synchronization mechanism to improve the resilience of pulse-based synchronization that is applicable under general connected topologies. Under the proposed synchronization mechanism, we prove that synchronization of general connected legitimate PCOs can be guaranteed in the presence of multiple stealthy Byzantine attackers, irrespective of whether the attackers collude with each other or not. The new mechanism can guarantee resilient synchronization even when the initial phases of legitimate oscillators are distributed in a half circle. Then, to relax the limitation of the stealthy attacker model and the constraint on the legitimate oscillators\u27 initial phase distribution, we improved our synchronization mechanism and proved that finite time synchronization of legitimate oscillators can be guaranteed in the presence of multiple Byzantine attackers who can emit attack pulses arbitrarily without any constraint except that practical bit rate constraint renders the number of pulses from an attacker to be finite. The improved mechanism can guarantee synchronization even when the initial phases of all legitimate oscillators are arbitrarily distributed in the entire oscillation period. The new attack resilient pulse-based synchronization approaches in this dissertation are in distinct difference from most existing attack-resilient synchronization algorithms (including the seminal paper from Lamport and Melliar-Smith [1]) which require a priori (almost) synchronization among all legitimate nodes. Numerical simulations are given to confirm the theoretical results

    Hazard-free clock synchronization

    Get PDF
    The growing complexity of microprocessors makes it infeasible to distribute a single clock source over the whole processor with a small clock skew. Hence, chips are split into multiple clock regions, each covered by a single clock source. This poses a problem for communication between these clock regions. Clock synchronization algorithms promise an advantage over state-of-the-art solutions, such as GALS systems. When clock regions are synchronous the communication latency improves significantly over handshake-based solutions. We focus on the implementation of clock synchronization algorithms. A major obstacle when implementing circuits on clock domain crossings are hazardous signals. We can formally define hazards by extending the Boolean logic by a third value u. In this thesis, we describe a theory for designing and analyzing hazard-free circuits. We develop strategies for hazard-free encoding and construction of hazard-free circuits from finite state machines. Furthermore, we discuss clock synchronization algorithms and a possible combination of them. In the end, we present two implementations of the GCS algorithm by Lenzen, Locher, and Wattenhofer (JACM 2010). We prove by rigorous analysis that the systems implement the algorithm. The theory described above is used to prove that our clock synchronization circuits are hazard-free (in the sense that they compute the most precise output possible). Simulation of our GCS system shows that it achieves a skew between neighboring clock regions that is smaller than a few inverter delays.Aufgrund der zunehmenden Komplexität von Mikroprozessoren ist es unmöglich, mit einer einzigen Taktquelle den gesamten Prozessor ohne großen Versatz zu takten. Daher werden Chips in mehrere Regionen aufgeteilt, die jeweils von einer einzelnen Taktquelle abgedeckt werden. Dies stellt ein Problem für die Kommunikation zwischen diesen Taktregionen dar. Algorithmen zur Taktsynchronisation bieten einen Vorteil gegenüber aktuellen Lösungen, wie z.B. GALS-Systemen. Synchronisiert man die Taktregionen, so verbessert sich die Latenz der Kommunikation erheblich. In Schaltkreisen zwischen zwei Taktregionen können undefinierte Signale, sogenannte Hazards auftreten. Indem wir die boolesche Algebra um einen dritten Wert u erweitern, können wir diese Hazards formal definieren. In dieser Arbeit zeigen wir eine Methode zum Entwurf und zur Analyse von hazard-freien Schaltungen. Wir entwickeln Strategien für Kodierungen die Hazards vermeiden und zur Konstruktion von hazard-freien Schaltungen. Darüber hinaus stellen wir Algorithmen Taktsynchronisation vor und wie diese kombiniert werden können. Zum Schluss stellen wir zwei Implementierungen des GCS-Algorithmus von Lenzen, Locher und Wattenhofer (JACM 2010) vor. Oben genannte Mechanismen werden verwendet, um formal zu beweisen, dass diese Implementierungen korrekt sind. Die Implementierung hat keine Hazards, das heißt sie berechnet die bestmo ̈gliche Ausgabe. Anschließende Simulation der GCS Implementierung erzielt einen Versatz zwischen benachbarten Taktregionen, der kleiner als ein paar Gatter-Laufzeiten ist
    corecore