PALS: Plesiochronous and Locally Synchronous Systems by Bund, Johannes et al.
ar
X
iv
:2
00
3.
05
54
2v
1 
 [c
s.D
C]
  1
1 M
ar 
20
20
PALS: Plesiochronous and Locally Synchronous Systems
Johannes Bund∗†, Matthias Fu¨gger‡, Christoph Lenzen∗, Moti Medina§ and Will Rosenbaum∗
∗MPI for Informatics, Saarland Informatics Campus
{jbund, clenzen, wrosenba}@mpi-inf.mpg.de
†Saarbru¨cken Graduate School of Computer Science
‡CNRS & LSV, ENS Paris-Saclay, Universite´ Paris-Saclay & Inria
mfuegger@lsv.fr
‡Ben-Gurion University of the Negev
medinamo@bgu.ac.il
Abstract—Consider an arbitrary network of communicating
modules on a chip, each requiring a local signal telling it when to
execute a computational step. There are three common solutions
to generating such a local clock signal: (i) by deriving it from a
single, central clock source, (ii) by local, free-running oscillators,
or (iii) by handshaking between neighboring modules.
Conceptually, each of these solutions is the result of a perceived
dichotomy in which (sub)systems are either clocked or fully
asynchronous, suggesting that the designer’s choice is limited
to deciding where to draw the line between synchronous and
asynchronous design.
In contrast, we take the view that the better question to
ask is how synchronous the system can and should be. Based
on a distributed clock synchronization algorithm, we present
a novel design providing modules with local clocks whose
frequency bounds are almost as good as those of corresponding
free-running oscillators, yet neighboring modules are guaranteed
to have a phase offset substantially smaller than one clock
cycle. Concretely, parameters obtained from a 15 nm ASIC
implementation running at 2GHz yield mathematical worst-case
bounds of 30 ps on phase offset for a 32×32 node grid network.
Index Terms—gradient clock synchronization, clocking, GALS
I. INTRODUCTION AND RELATED WORK
At surface level, the synchronous and asynchronous design
paradigms seem to be opposing extremes. In their most pure
forms, this is true: Early synchronous systems would wait for
a clock signal to be propagated throughout the system and all
computations of the current clock cycle to complete before
moving on to the next; and delay-insensitive circuits make
no assumptions on timing whatsoever, explicitly acknowleding
completion of any computational step.
In reality, however, fully synchronous or asynchronous sys-
tems are the exception. It has long since become impractical to
wait for the clock to propagate across a chip, and there are nu-
merous clock domains and asynchronous interfaces in any off-
the-shelf “synchronously” clocked computer [1]. On the other
hand, delay-insensitive circuits [2] suffer from substantial
computational limitations [3], [4], [5] and provide no timing
guarantees, rendering them unsuitable for many applications –
in particular the construction of a general-purpose computer.
Accordingly, most real-world “asynchronous” systems will
utilize timing assumptions on some components, which in fact
could be used to construct a (possibly very primitive) clock.
As systems grow in size – physically or due to further minia-
turization – maintaining the illusion of perfect synchronism
becomes increasingly challenging. Due to various scalability
issues, more and more compromises are made. A well-known
such compromise gaining in popularity in recent years are
Globally Asynchronous Locally Synchronous (GALS) sys-
tems [6], [7]. Here, several clock domains are independently
clocked and communicate asynchronously via handshakes,
where synchronizers are used to ensure sufficiently reliable
clock domain crossing [7], [8]. While this approach resolves
important scalability issues, arguably it does so by surren-
dering to them: between clock domains, all interaction is
asynchronous. However, fixing a sufficiently small probabil-
ity of synchronizer failure, communication latency becomes
bounded, permitting bounded response times to internal and
external events. Yet, as timing relations between different
clock domains remain desirable, GALS systems with guaran-
teed frequency relations between clock domains (but without
any bound on their phase offsets), so-called mesochronous
architectures, have been conceived [7].
One might think that GALS systems exemplify a funda-
mental struggle between the synchronous and asynchronous
paradigms. We argue that this dichotomy is false! Rather,
choices between clocked and clockless designs are driven
by tradeoffs between guarantees on response times, cost (in
terms of energy, buffer size, area, etc.), and complexity of
development. Ideally, we would like to provide the convenient
synchronous abstraction to the developer, yet have the system
respond quickly to external and internal events. Unfortunately,
existing approaches behave less than ideal in this regard:
• Centralized clocking does not scale. In large systems, the
resulting timing guarantees become too loose (requiring to
make the system slow). Indeed, it has been shown that the
achievable local skew, i.e., maximum phase offset between
neighbors, in a grid grows linearly with the width of the
grid; see Section V-B.
• A system-wide asynchronous design results in challenging
development, especially when tight timing constraints are
to be met. While in a clocked system one can bound
response times by bounding the number of clock cycles
for computation and communication, analyzing the (worst-
case) response time of a large-scale asynchronous system
has to be performed bottom-up. In addition, without highly
constraining design rules, it is difficult to ensure that waiting
for acknowledgements does not delay the response to a high-
priority local event or an external request for a significant
time. Causal acknowledge chains can span the entire system,
potentially resulting in waiting times that grow linearly with
the system diameter.
• A GALS design ostensibly does not suffer from these
issues, as each clock domain can progress on its own due
to independent clocks.1 However, clock domain crossings
require synchronizers, incurring 2 or more clock cycles
of additional latency. If synchronizers are placed in the
data path, communication becomes slow, even if a simple
command is to be spread across the chip or information is
acquired from an adjacent clock domain.
• Alternative solutions that do not require synchronizers in
the data path have been proposed in [11], [12]. The designs
either skip clock cycles or switch to a clock signal shifted
by half a period, when transmitter and receiver clock risk
to violate setup/hold conditions. The indicating signal is
synchronized without additional latency to the datapath.
Depending on the implementation and intended guarantees,
the additional latency is in the order of a clock period.
While this can, in principle, be brought down to the order
of setup/hold-windows, such designs would require consid-
erable logical overhead and fine-tuning of delays. Further,
note that an application of such a scheme has to periodically
insert no-data packets. An application-level transmission
may be delayed by such a timeslot. In [11] this additional
delay can be up to two periods when the no-data packet is
oversampled. Finally, note that a potential application that
runs on top of this scheme and uses handshaking to make
sure all its packets of a (logical) time step have arrived
before the next time step is locally initiated faces the same
problem as a fully asynchronous design, i.e., that the worst-
case waiting time between consecutive time steps grows
linearly with the system diameter.
Our Contribution: In this work, we present a radically
different approach. By using a distributed clock synchroniza-
tion algorithm, we essentially create a single, system-wide
clock domain without needing to spread a clock signal from
a single dedicated source with small skew. We employ results
on gradient clock synchronization (GCS) by Lenzen et al. [13],
in which the goal is to minimize the worst-case clock skew
between adjacent nodes in a network. In our setting, the
modules correspond to nodes, and they are connected by an
edge if they directly communicate (i.e., exchange data). Thus,
nodes of the clock synchronization algorithm communicate
only if the respective nodes exchange data for computational
functionality. This leads to an easy integration of our algorithm
into the existing communication infrastructure.
The algorithm provides strong parametrized guarantees.
1This is different for designs with pausible clocks [9], [10], rendering them
even more problematic in this context.
Consider a network of local clocks that are controlled by our
GCS algorithm. Let D be the diameter of the network. Further,
let ρ be the (unintended) drift of the local clock, µ > 2ρ a
freely chosen constant, and δ an upper bound on how precisely
the phase difference to neighbors is known. Then:
• The synchronized clocks are guaranteed to run at normalized
rates between 1 and (1 + µ)(1 + ρ).
• The local skew is bounded by O(δ logµ/ρD).
• The global skew, i.e., the maximum phase offset between
any two nodes in the system, is O(δD).
In other words, the synchronized clocks are almost as good as
free-running clocks with drift ρ, yet the local skew grows only
logarithmically in the chip’s diameter. The local and global
skew bounds are optimal up to roughly factor 2 [13].
As a novel theoretical result, we improve the global skew
bound by roughly factor 2 compared to [14]. This improve-
ment brings our theoretical worst-case skew to within a factor
of roughly 2 of the theoretical optimum (which is only
known to be achieved by a significantly more complicated
mechanism [13]). As a second theoretical contribution, we
prove that a minor modification of the algorithm reduces the
obtained local skew bound by an additive 2δ.
We can control the base of the logarithm in the local skew
bound by choosing µ. Picking, e.g., µ = 100ρ means that
logµ/ρD ≤ 1 for any D ≤ 100. Of course, the constants
hidden in the O-notation matter, but they are reasonably small.
Concretely, for a grid network of 32× 32 nodes in the 15 nm
FinFET-based Nangate OCL [15], 2GHz clock sources with an
assumed drift of ρ = 10−5, and µ = 10−3, our simple sample
implementation guarantees that δ ≤ 5 ps in the worst case.
The resulting local skew is 30 ps, well below a clock cycle.
We stress that this enables much faster communication than
for handshake-based solutions incurring synchronizer delay.
Note that locking the local oscillators to a common stable
reference does not require to balance the respective path de-
lays, implying that our assumed ρ is very pessimistic. Smaller
ρ (while keeping µ fixed) increases the base of the logarithm,
further improving scalability. To show that the asymptotic
behavior is relevant already to current systems and with our
pessimistic ρ, we compare the above results to skews obtained
by clock trees in the same grid networks in Section V-B.
Organization of this paper: We present the GCS algo-
rithm in Section II, stating worst-case bounds on the local
and global skews proved in the appendix We then break
down the algorithm into modules in Section III and discuss
their implementation in Section IV. Section V presents Spice
simulations for a network of four nodes, organized in a line
and compares them to clock trees. We conclude in Section VI.
II. ALGORITHM
A. High-level Description
We give a high level description of our algorithm that
achieves close synchronization between neighboring nodes in
a network. We model the network as an undirected graph
G = (V,E) where V is the set of nodes, and E is the
set of edges (or links). Abstractly, we think of each node
v as maintaining a logical clock, which we view as a func-
tion Lv : R≥0 → R. That is for each (Newtonian) time
t, Lv(t) is v’s logical clock value at time t. The local
skew is the maximum clock difference between neighbors:
L(t) = max{v,w}∈E {|Lv(t)− Lw(t)|}. The global skew is
the maximum clock difference between any two nodes in the
network: G(t) = maxv,w∈V {|Lv(t)− Lw(t)|}. The goal of
our algorithm is for each node to compute a logical clock
Lv(t) minimizing L(t) at all times t, subject to the condition
that all logical clocks progress at least at (normalized) rate 1.2
We assume that each node v has an associated reference
clock signal, which we refer to as v’s hardware clock, de-
noted Hv(t). For notational convenience,
3 we assume that the
minimum (normalized) rate of Hv is 1, and its maximum rate
is 1 + ρ: for all v ∈ V and t, t′ ∈ R≥0
t′ − t ≤ Hv(t
′)−Hv(t) ≤ (1 + ρ)(t
′ − t). (1)
To compute a logical clock, after initially setting Lv(0) =
Hv(0), v adjusts the rate of Lv relative to the rate of Hv
(where this rate itself is neither known to nor under the
influence of the algorithm). Specifically, v can be either in
slow mode or fast mode. In slow mode, Lv runs at the same
rate as Hv , while in fast mode, v sets the rate of Lv to be 1+µ
times the one of its hardware clock. Here, µ is a parameter
fixed by the designer. In order for the algorithm to work, a fast
node must always run faster than a slow node—i.e., µ > ρ.
We impose the stronger condition that µ > 2ρ.
The GCS algorithm of Lenzen et al. [13] specifies condi-
tions for a node to be in slow or fast mode that ensure asymp-
totically optimal local skew, provided that the global skew is
bounded. The algorithm is parametrized by a variable κ ∈ R+,
whose value determines the quality of synchronization.
Definition 1. Let κ ∈ R+ be a parameter. We say that a node
v satisfies the fast condition at time t if there exists a natural
number s ∈ N such that the following two conditions hold:
FC1 v has a neighbor x such that Lx(t)−Lv(t) ≥ (2s+1)κ
FC2 all of v’s neighbors y satisfy Lv(t)−Ly(t) ≤ (2s+1)κ.
It satisfies the slow condition if there exists s ∈ N such that:
SC1 v has a neighbor x such that Lv(t)− Lx(t) ≥ 2sκ
SC2 all of v’s neighbors y satisfy Ly(t)− Lv(t) ≤ 2sκ.
Definition 2. We say that an algorithm is a GCS algorithm
with parameters ρ, µ, κ if the following invariants hold, for
every node v ∈ V and all times t, t′:
I1 µ > ρ,
I2 Hv(t
′)−Hv(t)≤Lv(t
′)−Lv(t)≤(1+µ)(Hv(t
′)−Hv(t))
I3 if v satisfies the fast condition throughout the interval [t, t′],
then Lv(t
′)− Lv(t) = (1 + µ)(Hv(t
′)−Hv(t))
I4 if v satisfies the slow condition throughout the interval
[t, t′], then Lv(t
′)− Lv(t) = Hv(t
′)−Hv(t).
2Without the minimum rate requirement, the task becomes trivial: all nodes
can simply set Lv(t) = 0 for all times t to achieve perfect “synchronization.”
3It is common to assume a two-sided frequency error, i.e., a rate between
1 − ρ and 1 + ρ. However, the one-sided notation simplifies expressions.
Translating between the two models is a straightforward renormalization.
Invariants (I3) and (I4) still allow a node’s clock Lv(t) to
vary within the rates of the underlying hardware clock, which
is assumed not be under the control of the algorithm.
Theorem 1. Suppose algorithm A is a GCS algorithm. Then
A maintains global skew G(t) ≤ µκDµ−2ρ and local skew L(t) ≤(
2
⌈
logµ/ρ
µD
µ−2ρ
⌉
+ 1
)
κ for all sufficiently large t.
Remark 1. The precise local and global skew bounds
achieved by a GCS algorithm at an arbitrary time t depend
on the initial state of the system. GCS algorithms are self-
stabilizing in the sense that starting from an arbitrary initial
state, the algorithm will eventually achieve the skew bounds
claimed in Thm. 1 (see [16]). In the appendix, we analyze the
speed of convergence as function of local skew at initialization.
In order to fulfill the invariants of a GCS algorithm, each
node v maintains estimates of the offsets to neighboring
clocks. Specifically, for each neighboring node w, v computes
an offset estimate Ôw(t) ≈ Lw(t) − Lv(t). Given offset
estimates for each neighbor, the synchronization algorithm
determines if v should run in fast mode by checking if the
fast trigger (FT) is satisfied, as defined below. The trigger is
parametrized by variables κ (as in the GCS algorithm) and δ,
whose values are determined by the quality of estimates of
neighboring clock values.
Definition 3. We say that v satisfies the fast trigger, FT, if
there exists s ∈ N such that the following conditions hold:
FT1 Ômax ≥ (2s+ 1)κ− δ,
FT2 Ômin ≥ −(2s+ 1)κ− δ.
We are now in the position to formalize our GCS algorithm,
OffsetGCS (Algorithm 1). OffsetGCS is simple: at each
time, each node checks if it satisfies FT. If so, it runs in
fast mode. Otherwise, the node runs in slow mode. As the
decision to run fast or slow is a discrete decision, a hardware
implementation will be prone to metastability [17]. We discuss
how to work around this problem in Section III.
Algorithm 1 OffsetGCS algorithm for node v
1: At each time t do
2: Ômin ← minw{Ôw(t) | w is neighbor of v}
3: Ômax ← maxw{Ôw(t) | w is neighbor of v}
4: if v satisfies FT then
5: # fast mode (rate in [(1 + µ), (1 + ρ)(1 + µ)])
6: rate of Lv ← (1 + µ) · rate of Hv
7: else
8: # slow mode (rate in [1, (1 + ρ)])
9: rate of Lv ← rate of Hv
In what follows, we show that for a suitable choice of
parameters, OffsetGCS is a GCS algorithm in the sense
of Def. 2. Thus, OffsetGCS maintains the skew bounds
of Thm. 1.
B. Analysis of the OffsetGCS algorithm
We denote an upper bound on the overall uncertainty of v’s
estimate of w by δ:∣∣∣Ôw(t)− (Lw(t)− Lv(t))∣∣∣ ≤ δ. (2)
In our analysis, it will be helpful to distinguish two sources
of uncertainty faced by any implementation of the GCS algo-
rithm. The first is the propagation delay uncertainty, which is
the absolute timing variation in signal propagation adding to
the measurement error. We use the parameter δ0 > 0 to denote
an upper bound on this value.
The second source of error is the time between initiating a
measurement and actually “using” it in control of the logical
clock speed. During this time, the logical clocks advance at
rates that are not precisely known. Here, we can exploit that
the maximum rate difference between any two logical clocks
is (1+ρ)(1+µ)−1 = ρ+µ+ρµ. Thus, denoting the maximum
end-to-end latency by Tmax, this contributes an error of at most
(ρ+ µ+ ρµ)Tmax at any given time. Time Tmax includes the
time for the logical clock to respond to control signal.
Once suitable values of δ0 and Tmax are determined, δ can
be computed easily.
Lemma 1. With δ = δ0+(ρ+µ+ρµ) ·Tmax, Ineq. (2) holds.
Based on δ, we now seek to choose κ as small as possible
to realize the invariants given in Def. 2. The basic idea is
to ensure that if a node v satisfies the fast condition at time
t (which depends on the unknown phase difference), then it
must satisfy the fast trigger (which is expressed in terms of
the estimates Ôw), thus ensuring that v is in fast mode at time
t. In turn, if the slow condition is not satisfied, we must make
sure that the fast trigger does not hold either.
Lemma 2. Suppose for all times t an implementation of
OffsetGCS satisfies (2). Then for any
κ > 2δ (3)
and µ > ρ, OffsetGCS is a GCS algorithm.
Proof. We verify the conditions of Def. 2. Conditions I1
and I2 are direct consequences of the algorithm specifica-
tion. For Condition I3, suppose first that v satisfies the fast
condition at time t. Therefore, there exists some s ∈ N
and neighbor w of v such that Lw(t) − Lv(t) ≥ (2s + 1)κ.
Therefore, by Ineq. 2, Ôw(t) ≥ (2s + 1)κ − δ, so that FT1
is satisfied. Similarly, since v satisfies the fast condition, all
of its neighbors x satisfy Lv(t) − Lx(t) ≤ (2s + 1)κ − δ.
Therefore, Ôx(t) ≥ −(2s + 1)κ, hence FT2 is satisfied for
the same value of s and v runs in fast mode at time t.
It remains to show that if v satisfies the slow condition at
time t, then it does not satisfy FT at time t (and, accordingly,
is in slow mode). To this end suppose to the contrary that v
satisfies FT at t. Since v satisfies the slow condition at time t,
∃x : Lv(t)− Lx(t) ≥ 2sκ− δ (4)
∀y : Ly(t)− Lv(t) ≤ 2sκ+ δ. (5)
Since v is assumed to satisfy FT at time t, combining FT1
and FT2 with (2) imply that there exists some s′ ∈ N with
∃x : Lx(t)− Lv(t) ≥ (2s
′ + 1)κ− δ (6)
∀y : Lv(t)− Ly(t) ≤ (2s
′ + 1)κ+ δ. (7)
Combining (5) and (6), we must have
(2s′ + 1)κ− δ ≤ 2sκ+ δ,
hence 2s′κ ≤ 2sκ − κ + 2δ. Since 2δ < κ, the previous
expression implies that s′ < s. Similarly, combining (4)
and (7) gives 2sκ − δ · Tmax ≤ (2s
′ + 1)κ + δ, hence
2sκ ≤ 2s′κ+2δ < 2(s′+1)κ. Thus, s < s′+1, or equivalently
(since s and s′ are integers), that s ≤ s′. However, this final
expression contradicts s′ < s from before. Thus FT cannot be
satisfied at time t if the slow condition is satisfied at time t,
as desired.
Applying Thm. 1 and Lem. 2 we obtain:
Corollary 1. For suitable choices of parameters, OffsetGCS
maintains local skew
L(t) ≤
(
2
⌈
logµ/ρ
(
µ ·D
µ− 2ρ
)⌉
+ 1
)
κ.
III. MODULES
For a hardware implementation of the OffsetGCS algo-
rithm, we break down the distributed algorithm into modules.
Per node, this will be a local clock and a controller. Per link,
we have a time offset measurement module for each node
connected via the link. For each module we specify its input
and output ports, its functionality, and its delay. We further
relate the delay Tmax from Section II to the module delays.
A. Local Clock
The clock signal of node v is derived from a tunable local
clock oscillator. It has input MODEv , the mode signal (given by
the controller; see Section III-C), and output CLKv , the clock
signal. The mode signal MODEv is used to tune the frequency
of the oscillator within a factor of 1 + µ. An oscillator
responds within time Tosc ≥ 0, i.e., switching between the
two frequency modes takes at most Tosc time. We have four
requirements to the local clock module:
(C1) The initial maximum local skew is bounded by c · κ for
a parameter c > 0 depending on the implementation of
the module.
(C2) If MODEv is constantly 0 (respectively 1) during [t −
Tosc, t], then the local oscillator is in slow (respectively
fast) mode at time t and the rate of the local oscillator is
in [1, 1 + ρ] (respectively [1 + µ, (1 + µ)(1 + ρ)]).
(C3) If MODEv is neither constantly 0 nor 1 during [t−Tosc, t],
then the local oscillator is unlocked and its rate is in
[1, (1 + µ)(1 + ρ)].
(C4) Clocks in slow mode are never faster than clocks in fast
mode, hence µ > ρ.
Note that if (C2) does not apply, i.e., the mode signal is not
stable, (C3) allows an arbitrary rate between fast and slow.
B. Time Offset Measurement
In order to check whether the FT conditions are met, a
node v needs to measure the current phase offset Ôw to
each of its neighbors w. This is achieved by a time offset
measurement module between v and each neighbor w. Note
that the algorithm does not require a full access to the function
Ôw, but only to the knowledge of whether Ôw has reached a
bounded number of thresholds – we elaborate on this shortly.
The inputs of the module are the clock signal of v and w.
The outputs of the module are defined as follows. Let S =
{0, . . . , ℓ} with ℓ > 0. The output of the measurement module
is a binary string of length 2(ℓ+1) bits where the first ℓ+1 bits,
denoted as Qiw, are going from ℓ to 0, followed by additional
ℓ+1 bits, denoted as Q−iw , going from 0 to ℓ. For example, a
module with S = {0, 1} has 4 outputs with thresholds 3κ− δ,
κ− δ, −κ− δ, and −3κ− δ.
Let ε > 0 be a (small) time. We require that output Q±iw is
set to 1 if Ôw(t) ≥ ∓(2(i− 1) + 1)κ− δ + ε. Output Q
±i
w is
set to 0 if Ôw(t) ≤ ∓(2(i− 1) + 1)κ− δ. Otherwise, Q
±i
w is
unconstrained, i.e., within {0,M, 1}. Here, M denotes a meta-
/unstable signal between logical values 0 and 1. Intuitively, ε
will account for setup/hold times that any realistic hardware
implementation will have to account for.
We further require that ε < 2κ. This guarantees that at most
one output is M at a time: Assume that bit Qiw is metastable,
then Ôw(t) ∈ (2(i− 1) + 1)κ− δ + [0, ε]. Since the adjacent
thresholds are 2κ away, their corresponding outputs are either
0 or 1. In fact, by Eq. (3) and since ε ≤ δ0 (we account
for setup/hold times in δ0), we get that ε <
κ
2 , hence our
requirement is satisfied.
Choosing ℓ ≥ L/κ−12 , where L is the guaranteed local skew
of the OffsetGCS algorithm, guarantees that the nodes will
always be within the module’s measurement range. Note that
L here needs to respect the initial local skew as well, i.e., L
here is given by the bound from Cor. 1 plus the local skew
on initialization (as we show in the appendix) .
Given the above, the module outputs form a unary ther-
mometer code of the phase difference between v and w’s
clocks. Moreover, since this module decides whether a subset
of the thresholds are met or not, then inevitably, any imple-
mentation of this module (see Section IV) is susceptible to
metastable upsets. If implemented correctly, one can leverage
the output encoding, which is a unary thermometer code, and
guarantee that at most one bit is in a metastable state, located
conveniently between a prefix of 1’s and a suffix of 0’s.
Let Tmeas denote the maximum end-to-end latency of the
measurement module, i.e., an upper bound on the elapsed time
from when Q±iw is set, to when the measurements are available
at the output. More precisely, if Q±iw is set to x ∈ {0, 1}
for the entire duration of an interval [t − Tmeas, t], then the
corresponding output is x.
C. Controller
Each node v is equipped with a controller module. Its input
is the (thermometer encoded) time measurement for each of
v’s neighbors, i.e., the outputs of the time offsets measurement
module on each link connecting v to an adjacent node. It
outputs the mode signal MODEv.
Denote by Tcnt the maximum end-to-end delay of the con-
troller circuit, i.e., the delay between its inputs (the measure-
ment offset outputs) and its output MODEv . The specification
of the controllers interface is as follows:
(L1) For t > Tcnt, if algorithm OffsetGCS continuously
maps the rate of v to fast mode (resp. slow mode) during
[t− Tcnt, t], then MODEv(t) = 1 (resp. MODEv(t) = 0)
(L2) In all other cases, the output at time t is arbitrary, i.e.,
any value from {0,M, 1}.
D. Putting it all together
The module specifications above, together, specify a realiza-
tion of the OffsetGCS algorithm in hardware. The parameters
of this hardware specification of OffsetGCS are: δ0, ρ, µ, and
Tmax, where Tmax = Tmeas + Tcnt + Tosc. These parameters
are mapped to parameters of Cor. 1 by applying Lem. 2.
IV. HARDWARE IMPLEMENTATION
We have implemented the modules from Section III and
compiled them into a system of 4 nodes, connected in a line
from node 0 to node 3. To resemble a realistically sparse spac-
ing of clocks, we placed nodes at distances of 200µm. Target
technology was the 15 nm FinFET-based Nangate OCL [15].
The gate-level design was laid out and routed with Cadence
Encounter, which was also used for extraction of parasitics
and timing. Local clocks run at a frequency of approximately
2GHz, controllable within a factor of 1 + µ ≈ 1 + 10−4. We
use µ = 10−4 here to make the interplay of ρ and µ better
visible in traces. We will discuss the gate-level design and its
performance measures in the following.
A. Gate-level Implementation
Figures 1a to 1c show the schematics of an implementation
of the time offsets measurement module (Figure 1a), and the
controller (Figures 1b and 1c).
As a local clock source, we used a ring oscillator with some
of its inverters being starved-inverters to set the frequency to
either fast mode or slow mode. Nominal frequency is around
2GHz, controllable by a factor 1 + µ ≈ 1 + 10−4 via the
MODEv signal. We choose ρ ≈ µ/10 ≈ 10
−5, assuming
a moderately stable oscillator. While this is below drifts
achievable with uncontrolled ring oscillators, one may lock
the frequency of the ring oscillator to a stable external quartz
oscillator, see e.g., [18]. For such an implementation, we only
require a stable frequency reference for local clocks; the phase
difference of the distributed clock signal between adjacent
nodes (which may be large) is immaterial. If distributing a
stable clock source to all nodes is not feasible or considered
too costly for a design, one may choose a larger µ resulting in
a larger local and global skew bound; see Thm. 1 and Cor. 1.
We measure the logical clock value Lv(t) in terms of the
time passed since its first active clock transition.
The time offset measurement module resembles a time to
digital converter (TDC) in both its structure and function. The
clkv
clkw
2κ
5κ+ δ
Q3
w
Q2
w
Q1
w
Q−1
w
Q−2
w
Q−3
w
D Q D Q D Q D Q D Q D Q
(a)
Q−i
max
Qi
min
Qi
w1
Qi
w2
Qi
wk
Q−i
w1
Q−i
w2
Q−i
wk
(b)
Q−3
max
Q−2
max
Q−1
max
Q1
min
Q2
min
Q3
min
modev
(c)
Figure 1: Gate-level implementation of the OffsetGCS algorithm’s modules. Sub-figure 1a shows a linear TDC-based circuitry
for the module which measure the time offsets between nodes v and w. Buffers and inverters are used as delay elements the
delay of which appears next to the corresponding delay element. Given node v’s time offsets to its neighbors, the circuit in
Sub-figure 1b computes the minimum and maximum threshold levels which have been reached. Sub-figure 1c shows the circuit
that computes if the FT conditions are satisfied, i.e., if there is an S ∈ {0, 1, 2} that satisfies both FT1 and FT2.
upper delay line in Figure 1a, fed by remote clock w, is tapped
at intervals of 2κ. The lower delay line is used to shift the
module’s own local clock v to the middle of the delay line
(plus some δ offset) so that phase differences can be measured
both in the negative and positive direction. The module in
Figure 1a is instantiated for S = {0, 1, 2} with 6 taps for
threshold levels. In fact, in our hardware implementation we
set S = {0, 1}, as even for µ/ρ = 10 this is sufficient for
networks of diameter up to around 80 (see how to choose this
set of thresholds in the specification of this module in Sec. III).
If both clocks are perfectly synchronized, i.e., Lv = Lw,
then the state of the flip-flops will beQ3Q2Q1Q−1Q−2Q−3 =
111000 after a rising transition of CLKv . Now, assume that
clock w is earlier than clock v, say by a small ε > 0 more
than κ + δ ps. Then Lw = Lv + κ − δ + ε. For the moment
assuming that we do not make a measurement error, we get
Ôw = Lw−Lv = κ−δ+ε. From the delays in Figure 1a one
verifies that in this case, the flip-flops are clocked before clock
w has reached the second flip-flop with output Q1, resulting in
a snapshot of 110000. Likewise, an offset of Ôw = Lw−Lv =
3κ− δ + ε results in a snapshot of 100000, etc.
However, care has to be taken for non-binary outputs.
Given the output specification above, one can verify that
measurements are of the form 1∗0∗ or 1∗M0∗.
The circuit in Figure 1b then computes the minimum and
the maximum of the thermometer codes (by AND and OR
gates), determining the thresholds reached by the furthest node
ahead and behind v (while possibly masking metastable bits);
compare this with lines 2 and 3 in OffsetGCS (Algorithm 1).
Figure 1c finally computes the mode signal of v from the
thermometer codes, namely verifying whether there is an s ∈
{0, 1, 2} that satisfies both triggers; compare this with FT1
and FT2 in Def. 3.
a) Timing Parameters: We next discuss how the mod-
ules’ timing parameters relate to the extracted physical timing
of the above design.
The time required for switching between oscillator modes
Tosc is about the delay of the ring oscillator, which in our case
is about 1/(2 · 2GHz) = 250 ps. The measurement latency
Tmeas plus the controller latency Tcnt are given by a clock
cycle (500 ps) plus the delay (25 ps) from the flip-flops through
the AND/OR circuitry in Figures 1b and 1c to the mode signal.
In our case, delay extraction of the circuit yields Tmeas +
Tcnt < 500 ps+ 25 ps. We thus have, Tmax < Tmeas + Tcnt +
Tosc = 775 ps.
The propagation delay uncertainty, δ0, in measuring if Ôw
has reached a certain threshold is given by the uncertainties
in latency of the upper delay chain plus the lower delay chain
in Figure 1b. For the described naive implementation using
an uncalibrated delay line, this would be problematic. With
an uncertainty of ±5% for gate delays, and starting with
moderately sized κ and thus length of delay chains, extraction
of minimum and maximum delays showed that the constraints
for δ and κ from Lem. 2 were not met. Successive cycles
of increasing δ and κ do not converge due to the linear
dependency of δ and κ on the uncertainty δ0 with a too large
factor. Rather, delay variations (of the entire system) have to
be less than ±1% for the linear offset measurement circuit,
depicted in Sub-figure 1a, to fulfill Lem. 2’s requirements.
B. Improvements
Figure 2 shows an improved TDC-type offset-measurement
circuit that does not suffer from the problem above. Concep-
tually the TDC of node v that measures offsets w.r.t. node w
is integrated into the local ring oscillator of neighboring node
w. If w has several neighbors, e.g., up to 4 in a grid, they
share the taps, but have their own flip-flops within node w.
The Figure shows a design for S = {0, 1} with 4 taps, as
used in our setup.
Integration of the TDC into w’s local ring oscillator greatly
reduces uncertainties at both ends: (i) the uncertainty at
the remote clock port (of node w) is removed to a large
extent, since the delay elements which are used for the offset
measurements are part of w’s oscillator, and (ii) the uncertainty
at the local clock port is greatly reduced by removing the delay
line of length 5κ+ δ. Remaining timing uncertainties are the
latency from taps to the D-ports of the flip-flops and from
Starved invs
#inv = 2i+ 1
modew
κ κ
2κ 2κ
Q1
w
D Q Q D
Q−1
w
clkv
Q D
Q−2
w
Q2
w
D Q
Lw + 3κ+ δ Lw − 3κ+ δ
Lw − κ+ δLw + κ+ δ
clkw
δ
Lw
Lw + δ
Figure 2: Improved offset measurement implementation. The
delays of each delay element are written inside it. The gray
buffers at the offset measurement taps decouple the load of
the remaining circuitry. At the bottom of the ring oscillator an
odd number of starved inverters used to set slow or fast mode
for node w. The phase offset that we measure in each tap is
written next to the corresponding flip-flop. The delay elements
at the top are inverters instead of buffers to achieve a latency
of κ = 10 ps. We inverted the clock output to account for the
negated signal at the tap of clock w at the top.
clock v to the CLK-ports of the flip-flop. Timing extraction
yielded δ0 < 4 ps in presence of ±5% gate delay variations.
From Lem. 2, we thus readily obtain κ ≈ 10 ps and
δ ≈ 5 ps which matched the previously chosen latencies of
the delay elements. Applying Thm. 1 and Cor. 1 finally yields
a bounds of 1.223κD = 12.23D ps on the global skew and of
(2⌈log10(1.223D)⌉ + 1)κ on the local skew. For our design
with diameter D = 3 this makes a maximum global skew of
36.69 ps and a maximum local skew of 3κ = 30 ps. Note that
considerably larger systems, e.g., a grid with side length of
W = 32 nodes and diameter D = 2W − 2 = 62, still are
guaranteed to have a maximum local skew of 3κ = 30 ps –
and for µ = 10−3, the base of the logarithm becomes 100.
V. SIMULATION AND COMPARISON TO CLOCK TREES
A. Spice Simulations on a Line Topology
We ran Spice simulations with Cadence Spectre of the post-
layout extracted design for 4 nodes arranged in a line, as
described in Section IV. The line’s nodes are labeled 0 to
3. For the simulations, we set µ = 10ρ instead of 100ρ,
resulting in slower decrease of skew, to better observe how
skew is removed. We simulated two scenarios where node 1
is initialized with an offset of 40 ps ahead of (resp. behind) all
other nodes. Simulation time is 1000 ns (≈ 2000 clock cycles)
for the first and 600 ns for the second scenario.
0V
0.8V
0.48ns 0.5ns 0.52ns 0.54ns 100.04ns 100.08ns 175.22ns
Figure 3: Spice simulation of the line topology. Node 1 has
been initialized with a skew of 40 ps ahead of the other nodes.
Nodes from left to right: (i) 1 before 0, 2, 3, (ii) 1 before 0, 2
before 3, (iii) 1 before 0, 2 before 3.
Figure 4: Maximum local skew (dotted) and global skew
(solid) for the scenarios of node 1 initially being ahead (red)
and behind (blue) of all other nodes.
Figure 3 shows the clock signals of nodes 0 to 3 at three
points in time for the first scenario: (i) shortly after the
initialization, (ii) around 100 ns, and (iii) after 175 ns.
For the mode signals, in the first scenario, we observe the
following: Since node 1 is ahead of nodes 0 and 2, node 1’s
mode signal is correctly set to 0 (slow mode) while node 0 and
2’s mode signals are set to 1 (fast mode). Node 3 is unaware
that node 1 is ahead since it only observes node 2. By default
its mode signal is set to slow mode. When the gap to 2 is large
enough it switches to fast mode. This configuration remains
until nodes 0 and 2 catch up to 1, where they switch to slow
mode, to not overtake node 1. Again node 3 sees only node 2
which is still ahead and switches only after it catches up to 2.
Figure 4 (red lines) depicts the dynamics of the maximum
local and global skews for the first scenario. Observe that,
from the beginning the local skew decreases until it reaches
less than 9 ps. It then remains in an stable oscillatory state
where it increases until the algorithm detects and reduces the
local skew. This is well below our worst-case bound of 30 ps
on the local skew. The global skew first increases, as node 3
does not switch to fast mode immediately. Scenario two shows
a similar behaviour (blue lines in Figure 4).
B. Comparison to Clock Tree
For comparison, we laid out a grid of W ×W flip-flops,
evenly spread in 200µm distance in x and y direction across
the chip. The data port of a flip-flop is driven by the OR of the
up to four adjacent flip-flops. Clock trees were synthesized and
routed with Encounter Cadence, with the target to minimize
local skews. Delay variations on gates and nets were set to
±5%. The results are presented in Figures 5. For comparison,
we plotted local skews guaranteed by our algorithm for the
same grids with parameters extracted from the implementation
described in Section IV. Observe the linear growth of the local
clock skew and the logarithmic growth of the local skew in
our implementation. The figure also shows the skew for a
clock tree with delay variations of ±10%. This comparison
Figure 5: Local skew (ps) between neighboring flip-flops in
the W ×W grid. Clock tree with ±5% delay variation (solid
green) and our algorithm with ±5% delay variation (solid
magenta). The dotted line shows the clock tree with ±10%
delay variation, demonstrating linear growth of the skew also
in a different setting. Clock trees are shown up to W = 32
after which Encounter ran out of memory.
is relevant, as δ0 is governed by local delay variations, which
can be expected to be smaller than those across a large chip.
It is worth mentioning that it has been shown that no clock
tree can avoid the local skew being proportional to W [19].
It is worth mentioning that one can show that for any clock
tree there are always two nodes in the grid that have local skew
which is proportional to W . This follows from the fact that
there are always two neighboring nodes in the grid which are
in distance proportional to W from each other in the clock
tree [19], [20]. Accordingly, uncertainties accumulate in a
worst case fashion to create a local skew which is proportional
to W ; this behavior can be observed in Figure 6.
To gain intuition on this result, note that there is always an
edge that, if removed (see the edge which is marked by an X in
Figure 6), partitions the tree into two subtrees each spanning
an area of Ω(W 2) and hence having a shared perimeter of
length Ω(W ). Thus, there must be two adjacent nodes, one
on each side of the perimeter, at distance Ω(W ) in the tree.
Our algorithm, on the other hand, manages to reduce the
local skew exponentially to being proportional to logW .
VI. CONCLUSION
Low skew between neighboring nodes in a chip allows for
efficient low-latency communication and provides the illusion
of a single clock domain. A classical solution for this problem
is to use a clock tree. However, clock trees inevitably produce
local skews which are proportional to the diameter of the
chip. We propose a solution based on a distributed clock
synchronization algorithm. Its main idea is to control the local
clocks of each node by measuring the time offsets from its
neighbors and switching between fast and slow clock rates.
We compare our implementation to tool-generated 2GHz
clock trees for W ×W grids in 15 nm technology. Asymp-
totically, the implementation improves over the clock tree ex-
ponentially. Our simulations show an improvement of roughly
50% on the local skew already for W = 32.
Ω(W ) distT = Ω(W )
dist = 1
W
Figure 6: A low stretch spanning tree of an 8 × 8 grid [21].
The bold lines depict the spanning tree, i.e., our clock tree in
this example. The two neighboring nodes that are of distance
13 in the tree are circled (at the middle right side of the grid).
The algorithmic approach is highly robust. It does not
rely on a single node or link, and can stabilize to small
skews even under poor initialization conditions. In particular,
it will recover from transient faults, and can handle the
loss of individual nodes or links by adding simple detection
mechanisms [16]. Moreover, it is known how to integrate new
or recovering links or nodes by a simple mechanism without
interfering with the skew bounds [16]. Thus, our approach
provides a flexible and resilient alternative to classic designs.
In future work, we intend to design a full implementation
including suitable (locked) oscillators. As demonstrated by
the work of Mota et al. [18], systems with much smaller
values of ρ than 10−5 are feasible. Consequently, even a
simple design is likely to result in sufficiently stable local time
references. However, a challenge here is that the oscillators
need to be locked to a (frequency) reference. This prevents
directly adjusting their phase, which would be in conflict with
their locking. This issue can be resolved by using a digitally
controlled oscillator derived from the local clock. Such a
design is possible using synchronizers (which however would
increase Tmax), or could make use of metastability-containing
techniques in the vein of Fu¨gger et al. [22].
Acknowledgments. We thank the reviewers for their valuable
feedback, and in particular the third reviewer for pointers
to related work. This research has received funding from
the European Research Council (ERC) under the European
Unions Horizon 2020 research and innovation programme
(grant agreement No 716562), the Israel Science Foundation
under Grant 867/19, ANR grant FREDDA (ANR-17-CE40-
0013), and the Digicosme working group HicDiesMeus.
REFERENCES
[1] H. D. Foster, “Trends in functional verification: A 2014 industry study,”
in 52nd Annual Design Automation Conference. ACM, 2015, p. 48.
[2] A. J. Martin, “Compiling communicating processes into delay-
insensitive vlsi circuits,” Dist. comp., vol. 1, no. 4, pp. 226–234, 1986.
[3] ——, “The limitations to delay-insensitivity in asynchronous circuits,”
in Beauty is our business. Springer, 1990, pp. 302–311.
[4] R. Manohar and Y. Moses, “The eventual c-element theorem for delay-
insensitive asynchronous circuits,” in 23rd IEEE International Sympo-
sium on Asynchronous Circuits and Systems. IEEE, 2017, pp. 102–109.
[5] ——, “Asynchronous signalling processes,” in 25th IEEE Int. Sympo-
sium on Asynchronous Circuits and Systems. IEEE, 2019, pp. 68–75.
[6] D. M. Chapiro, “Globally-asynchronous locally-synchronous systems.”
Stanford Univ CA Dept of Computer Science, Tech. Rep., 1984.
[7] P. Teehan, M. Greenstreet, and G. Lemieux, “A survey and taxonomy of
gals design styles,” IEEE Design & Test of Computers, vol. 24, no. 5,
pp. 418–428, 2007.
[8] R. Dobkin, R. Ginosar, and C. P. Sotiriou, “Data synchronization issues
in gals socs,” in 10th International Symposium on Asynchronous Circuits
and Systems. IEEE, 2004, pp. 170–179.
[9] K. Y. Yun and R. P. Donohue, “Pausible clocking: A first step toward
heterogeneous systems,” in Proc. Int. Conference on Computer Design.
VLSI in Computers and Processors. IEEE, 1996, pp. 118–123.
[10] X. Fan, M. Krstic´, and E. Grass, “Analysis and optimization of pausible
clocking based gals design,” in IEEE International Conference on
Computer Design. IEEE, 2009, pp. 358–365.
[11] L. R. Dennison, W. J. Dally, and D. Xanthopoulos, “Low-latency
plesiochronous data retiming,” in Proceedings Sixteenth Conference on
Advanced Research in VLSI. IEEE, 1995, pp. 304–315.
[12] A. Chakraborty and M. R. Greenstreet, “Efficient self-timed interfaces
for crossing clock domains,” in 9th International Symposium on Asyn-
chronous Circuits and Systems. IEEE, 2003, pp. 78–88.
[13] C. Lenzen, T. Locher, and R. Wattenhofer, “Tight Bounds for Clock
Synchronization,” Journal of the ACM, vol. 57, no. 2, pp. 1–42, 2010.
[14] F. Kuhn and R. Oshman, “Gradient Clock Synchronization Using
Reference Broadcasts,” in Principles of Distributed Systems, 13th
International Conference, 2009, pp. 204–218. [Online]. Available:
https://doi.org/10.1007/978-3-642-10877-8 17
[15] M. Martins, J. M. Matos, R. P. Ribas, A. Reis, G. Schlinker, L. Rech,
and J. Michelsen, “Open cell library in 15nm freepdk technology,” in
Proceedings of the 2015 Symposium on International Symposium on
Physical Design. ACM, 2015, pp. 171–178.
[16] F. Kuhn, C. Lenzen, T. Locher, and R. Oshman, “Optimal gradient
clock synchronization in dynamic networks,” CoRR, vol. abs/1005.2894,
2010. [Online]. Available: http://arxiv.org/abs/1005.2894
[17] L. R. Marino, “General Theory of Metastable Operation,” IEEE Trans-
actions on Computers, vol. 30, no. 2, pp. 107–115, 1981.
[18] M. Mota, J. Christiansen, S. Debieux, V. Ryjov, P. Moreira, and
A. Marchioro, “A flexible multi-channel high-resolution time-to-digital
converter asic,” in IEEE Nuclear Science Symp., vol. 2, 2000, pp. 9–155.
[19] Fisher and Kung, “Synchronizing Large VLSI Processor Arrays,” IEEE
Transactions on Computers, vol. C-34, no. 8, pp. 734–740, 1985.
[20] P. Boksberger, F. Kuhn, and R. Wattenhofer, “On the approximation of
the minimum maximum stretch tree problem,” Technical report/ETH,
Department of Computer Science, vol. 409, 2003.
[21] M. James, “Linear solver in linear time.” [Online]. Available:
https://www.i-programmer.info/news/181-algorithms/5573-linear-solver-in-linear-time.html
[22] M. Fu¨gger, A. Kinali, C. Lenzen, and B. Wiederhake, “Fast All-Digital
Clock Frequency Adaptation Circuit for Voltage Droop Tolerance,” in
Symp. on Asynchronous Circuits and Systems, 2018.
[23] W. Rudin, Principles of Mathematical Analysis, 3rd ed. New York:
McGraw-Hill Education, 1976.
APPENDIX A
PROOF OF LEMMA 1
Proof. Consider the estimate Ôw(t) that the algorithm uses
at node v for neighbor w at time t. By definition of Tmax,
the measurement is based on clock values Lv(tv) and Lw(tw)
for some tv, tw ∈ [t − Tmax, t). Without loss of generality,
we assume that to measure whether Lw − Lv ≥ T ∈ R,
the signals are sent at logical times satisfying Lw(tw)− T =
Lv(tv).
4 Denote by t′v ∈ (tv, t) and t
′
w ∈ (tw, t) the times
4One can account for asymmetric propagation times by shifting Lw(tw)
and Lv(tv) accordingly, so long as this is accounted for in Tmax and carry
out the proof analogously.
when the respective signals arrive at the data or clock input,
respectively, of the register5 indicating whether Ôw ≥ T for a
given threshold T . By definition of δ0, we have that
|t′v − tv − (t
′
w − tw)| ≤ δ0.
Note that the register indicates Ôw(t) ≥ T , i.e., latches 1, if
and only if t′w < t
′
v.
6 Thus, we need to show
Lw(t)− Lv(t) ≥ T + δ =⇒ t
′
w < t
′
v
Lw(t)− Lv(t) ≤ T − δ =⇒ t
′
w > t
′
v.
Assume first that Lw(t)−Lv(t) ≥ T + δ. Then, using I4 and
that Lw(tw)− T = Lv(tv), we can bound
T + δ ≤ Lw(t)− Lv(t)
≤ Lw(tv)− Lv(tv) + ((1 + µ)(1 + ρ)− 1)(t− tv)
= Lw(tv)− Lw(tw) + T + (µ+ ρ+ ρµ)(t− tv)
≤ tv − tw + T + (µ+ ρ+ ρµ)(t−min{tv, tw})
< tv − tw + T + (µ+ ρ+ ρµ)Tmax.
Hence,
t′w − t
′
v ≥ tw − tv − δ0 > δ − δ0 − (µ+ ρ+ ρµ)Tmax = 0.
For the second implication, observe that it is equivalent to
Lv(t)− Lw(t) ≥ −T + δ =⇒ t
′
v > t
′
w.
As we have shown the first implication for any T ∈ R, the
second follows analogously by exchanging the roles of v and
w.
APPENDIX B
PROOF OF THEOREM 1
In this appendix, we prove Theorem 1. We assume that
at (Newtonian) time t = 0, the system satisfies some bound
on local skew. The analysis we provide shows that the GCS
algorithm maintains a (slightly larger) bound on local skew for
all t ≥ 0. An upper bound on the local skew also bounds the
number of values of s for which FC or SC (Definition 1) can
hold, as a large s implies a large local skew. (For example, if
a node v satisfies FC1 for some s, then v has a neighbor x
satisfying Lx(t) − Lv(t) ≥ (2s + 1)κ, implying that L(t) ≥
(2s + 1)κ.) Accordingly, an implementation need only test
for values of s satisfying |s| < 12κLmax, where Lmax is an
upper bound on the local skew. Our analysis also shows that
given an arbitrary initial global skew G(0), the system will
converge to the skew bounds claimed in Theorem 1 within time
O(G(0)/µ). We note that the skew upper bounds of Theorem 1
match the lower bounds of [13] up to a factor of approximately
2, and these lower bounds apply even under the assumption
of initially perfect synchronization (i.e., systems with L(0) =
G(0) = 0).
5We assume a register here, but the same argument applies to any state-
holding component serving this purpose in the measurement circuit.
6For simplicity of the presentation we neglect the setup/hold time ε
(accounted for in δ0) and metastability; see Section III for a discussion.
Our analysis also assumes that logical clocks are differen-
tiable functions. This assumption is without loss of generality:
By the Stone-Weierstrass Theorem (cf. Theorem 7.26 in [23])
every continuous function on a compact interval can be
approximated arbitrarily closely by a differentiable function.
We will rely on the following technical result. We provide
a proof in Section B-E.
Lemma 3. For k ∈ Z and t0, t1 ∈ R≥0 with t0 < t1, let F =
{fi | i ∈ [k]}, where each fi : [t0, t1] → R is a differentiable
function. Define F : [t0, t1]→ R by F (t) = maxi∈[k] {fi(t)}.
Suppose F has the property that for every i and t, if fi(t) =
F (t), then ddtfi(t) ≤ r. Then for all t ∈ [t0, t1], we have
F (t) ≤ F (t0) + r(t − t0).
Throughout this section, we assume that each node runs
an algorithm satisfying the invariants stated in Definition 2.
By Lemmas 1 and 2, Algorithm 1 meets this requirement if
κ > 2δ + 2(ρ+ µ+ ρµ)Tmax.
A. Leading Nodes
We start by showing that skew cannot build up too quickly.
This is captured by analyzing the following functions.
Definition 4 (Ψ and Leading Nodes). For each v ∈ V , s ∈ N,
and t ∈ R≥0, we define
Ψsv(t) = max
w∈V
{Lw(t)− Lv(t)− 2sκd(v, w)},
where d(v, w) denotes the distance between v and w in G.
Moreover, set
Ψs(t) = max
v∈V
{Ψsv(t)}.
Finally, we say that w ∈ V is a leading node if there is some
v ∈ V satisfying
Ψsv(t) = Lw(t)− Lv(t)− 2sκd(v, w) > 0.
Observe that any bound on Ψs implies a corresponding
bound on L: If Ψs(t) ≤ κ, then for any adjacent nodes v, w
we have Lw(t) − Lv(t) − 2sκ ≤ Ψ
s(t) ≤ κ. Therefore,
Ψs(t) ≤ κ =⇒ L ≤ (2s + 1)κ. Our analysis will show
that in general, Ψs(t) ≤ Gmax/σ
s for every s ∈ N and all
times t. In particular, considering s = ⌈logµ/ρ Gmax/κ⌉ gives
a bound on L in terms of Gmax. Because G(t) = Ψ
0(t), the
skew bounds will then follow if we can suitably bound Ψ0 at
all times.
Note that the definition of Ψsv is closely related to the
definition of the slow condition. In fact, the following lemma
shows that if w is a leading node, then w satisfies the slow
condition. Thus, Ψs cannot increase quickly: I4 (Def. 2) then
stipulates that leading nodes increase their logical clocks at
rate at most 1 + ρ. This behavior allows nodes in fast mode
to catch up to leading nodes.
Lemma 4 (Leading Lemma). Suppose w ∈ V is a leading
node at time t. Then ddtLw(t) =
d
dtHw(t) ∈ [1, 1 + ρ].
Proof. By I4, the claim follows if w satisfies the slow con-
dition at time t. As w is a leading node at time t, there are
s ∈ N and v ∈ V satisfying
Ψsv(t) = Lw(t)− Lv(t)− 2sκd(v, w) > 0.
In particular, Lw(t) > Lv(t), so w 6= v. For any y ∈ V , we
have
Lw(t)− Lv(t)− 2sκd(v, w) = Ψ
s
v(t)
≥ Ly(t)− Lv(t)− 2sκd(y, w).
Rearranging this expression yields
Lw(t)− Ly(t) ≥ 2sκ(d(v, w) − d(y, w)).
In particular, for any y ∈ Nv, d(v, w) ≥ d(y, w)−1 and hence
Ly(t)− Lw(t) ≤ 2sκ,
i.e., SC2 holds for s at w.
Now consider x ∈ Nv so that d(x,w) = d(v, w)− 1. Such
a node exists because v 6= w. We obtain
Lw(t)− Ly(t) ≥ 2sκ.
Thus SC1 is satisfied for s, i.e., indeed the slow condition
holds at w at time t.
Lemma 4 can readily be translated into a bound on the
growth of Ψsw whenever Ψ
s
w > 0.
Lemma 5 (Wait-up Lemma). Suppose w ∈ V satisfies
Ψsw(t) > 0 for all t ∈ (t0, t1]. Then
Ψsw(t1) ≤ Ψ
s
w(t0)− (Lw(t1)− Lw(t0)) + (1 + ρ)(t1 − t0).
Proof. Fix w ∈ V , s ∈ N and (t0, t1] as in the hypothesis
of the lemma. For v ∈ V and t ∈ (t0, t1], define the function
fv(t) = Lv(t)− 2sκd(v, w). Observe that
max
v∈V
{fv(t)} − Lw(t) = Ψ
s
w(t) .
Moreover, for any v satisfying fv(t) = Lw(t) + Ψ
s
w(t), we
have Lv(t) − Lw(t) − 2sκd(v, w) = Ψ
s
w(t) > 0. Thus,
Lemma 4 shows that v is in slow mode at time t. As (we
assume that) logical clocks are differentiable, so is fv, and it
follows that ddtfv(t) ≤ 1+ρ for any v ∈ V and time t ∈ (t0, t1]
satisfying fv(t) = maxx∈V {fx(t)}. By Lemma 3, it follows
that maxv∈V {fv(t)} grows at most at rate 1 + ρ:
max
v∈V
{fv(t1)} ≤ max
v∈V
{fv(t0)}+ (1 + ρ)(t1 − t0) .
We conclude that
Ψsw(t1)−Ψ
s
w(t0) = max
v∈V
{fv(t1)} − Lw(t1)
− (max
v∈V
{fv(t0)} − Lw(t0))
≤ (1 + ρ)(t1 − t0)− (Lw(t1)− Lw(t0)),
which can be rearranged into the desired result.
Corollary 2. For all s ∈ N and times t1 ≥ t0, Ψ
s(t1) ≤
Ψs(t0) + ρ(t1 − t0).
Proof. Choose w ∈ V such that Ψs(t1) = Ψ
s
w(t1). As
Ψsw(t) ≥ 0 for all times t, nothing is to show if Ψ
s(t1) = 0.
Let t ∈ [t0, t1) be the supremum of times from t
′ ∈ [t0, t1)
with the property that Ψsw(t
′) = 0. Because Ψsw is continuous,
t 6= t0 implies that Ψ
s
w(t) = 0. Hence, Ψ
s
w(t) ≤ Ψ
s
w(t0). By
I2 and Lemma 5, we get that
Ψs(t1) = Ψ
s
w(t1)
≤ Ψsw(t)− (Lw(t1)− Lw(t)) + (1 + ρ)(t1 − t)
≤ Ψsw(t) + ρ(t1 − t)
≤ Ψsw(t0) + ρ(t1 − t0)
≤ Ψs(t0) + ρ(t1 − t0).
Trailing Nodes
As Lw(t1)−Lw(t0) ≥ t1− t0 at all times by I2, Lemma 7
implies that Ψs cannot grow faster than at rate ρ when Ψs(t) >
0. This means that nodes whose clocks are far behind leading
nodes can catch up, so long as the lagging nodes satisfy the
fast condition and thus run at rate at least 1 + µ by I3. Our
next task is to show that “trailing nodes” always satisfy the
fast condition so that they are never too far behind leading
nodes. The approach to showing this is similar to the one for
Lemma 5, where now we need to exploit the fast condition.
Definition 5 (Ξ and Trailing Nodes). For each v ∈ V , s ∈ N,
and t ∈ R≥0, we define
Ξsv(t) = max
w∈V
{Lv(t)− Lw(t)− (2s+ 1)κd(v, w)},
where d(v, w) denotes the distance between v and w in G.
Moreover, set
Ξs(t) = max
v∈V
{Ξsv(t)}.
Finally, we say that w ∈ V is a trailing node at time t, if there
is some v ∈ V satisfying
Ξsv(t) = Lv(t)− Lw(t)− (2s+ 1)κd(v, w) > 0.
Lemma 6 (Trailing Lemma). If w ∈ V is a trailing node at
time t, then ddtLw(t) = (1+µ)
d
dtHw(t) ∈ [1+µ, (1+ρ)(1+µ)].
Proof. By I3, it suffices to show that w satisfies the fast
condition at time t. Let s and v satisfy
Lv(t)− Lw(t)− (2s+ 1)κd(v, w)
= max
x∈V
{Lv(t)− Lx(t)− (2s+ 1)κd(v, x)} > 0.
In particular, Lv(t) > Lw(t), implying that v 6= w. For y ∈ V ,
we have
Lv(t)− Lw(t)− (2s+ 1)κd(v, w)
≥ Lv(t)− Ly(t)− (2s+ 1)κd(v, y).
Thus for all neighbors y ∈ Nw,
Ly(t)− Lw(t) + (2s+ 1)κ(d(v, y)− d(v, w)) ≥ 0.
It follows that
∀y ∈ Nv : Lw(t)− Ly(t) ≤ (2s+ 1)κ,
i.e., FC2 holds for s. As v 6= w, there is some node x ∈ Nv
with d(v, x) = d(v, w) − 1. Thus we obtain
∃x ∈ Nv : Ly(t)− Lw(t) ≥ (2s+ 1)κ,
showing FC1 for s, i.e., indeed the fast condition holds at w
at time t.
Using Lemma 6, we can show that if Ψsw(t0) > 0, w will
eventually catch up. How long this takes can be expressed in
terms of Ψs−1(t0), or, if s = 0, G.
Lemma 7 (Catch-up Lemma). Let s ∈ N and v, w ∈ V . Let
t0 and t1 be times satisfying that
t1 ≥ t0 +
Ξsv(t0)
µ
.
Then
Lw(t1) ≥ t1 − t0 + Lv(t0)− (2s+ 1)κd(v, w).
Proof. W.l.o.g., we may assume that t1 = t0 + Ξ
s
v(t0)/µ, as
I2 ensures that ddtLw(t) ≥ 1 at all times, i.e., the general
statement readily follows. For any x ∈ V , define
fx(t) = t− t0 + Lv(t0)− Lx(t)− (2s+ 1)κd(v, x).
Again by I2, it thus suffices to show that fw(t) ≤ 0 for some
t ∈ [t0, t1].
Observe that Ξsv(t0) = maxx∈V {fx(t0)}. Thus, it suffices
to show thatmaxx∈V {fx(t)} decreases at rate µ so long as it is
positive, as then fw(t1) ≤ maxx∈V {fx(t1)} ≤ 0. To this end,
consider any time t ∈ [t0, t1] satisfying maxx∈V {fx(t)} > 0
and let y ∈ V be any node such that maxx∈V {fx(t)} = fy(t).
Then y is trailing, as
Ξsv(t) = max
x∈V
{Lv(t)− Lx(t)− (2s+ 1)κd(v, x)}
= Lv(t)− Lv(t0)− (t− t0) + max
x∈V
{fx(t)}
= Lv(t)− Lv(t0)− (t− t0) + fy(t)
= Lv(t)− Ly(t)− (2s+ 1)κd(v, y)
and
Ξsv(t) = Lv(t)− Lv(t0)− (t− t0) + max
x∈V
{fx(t)}
> Lv(t)− Lv(t0)− (t− t0) ≥ 0.
Thus, by Lemma 6 we have that ddtLy(t) ≥ 1 + µ, implying
d
dtfy(t) = 1−
d
dtLy(t) ≤ −µ.
To complete the proof, assume towards a contradiction that
maxx∈V {fx(t)} > 0 for all t ∈ [t0, t1]. Then, applying
Lemma 3 again, we conclude that
Ξsv(t0) = max
x∈V
{fx(t0)}
> −(max
x∈V
{fx(t1)} −max
x∈V
{fx(t0)})
≥ µ(t1 − t0) = Ξ
s
v(t0),
i.e., it must hold that fw(t) ≤ maxx∈V {fx(t)} ≤ 0 for some
t ∈ [t0, t1].
B. Base Case and Global Skew
We now prove that if Ψs(0) is bounded for some s ∈ N,
it cannot grow significantly and thus remains bounded. This
will both serve as an induction anchor for establishing our
bound on the local skew and for bounding the global skew,
as Ψ0(t) = G(t). In addition, we will deduce that even if the
initial global skew G(0) is large, at times t ≥ G(0)/µ, G(t) is
bounded by Gmax = (1− 2ρ/µ)κD.
To this end, we will apply Lemma 7 in the following form.
Corollary 3. Let s ∈ N and t0, t1 be times satisfying
t1 ≥ t0 +
Ψs(t0)
µ
.
Then, for any w ∈ V we have
Lw(t1)− Lw(t0) ≥ t1 − t0 +Ψ
s
w(t0)− κ ·D.
Proof. If Ψsw(t0) − κ · D ≤ 0, the claim is trivially satisfied
due to I2 guaranteeing that ddtLw(t) ≥ 1 at all times t. Hence,
assume that Ψsw(t0)− κ ·D > 0 and choose any v so that
Ψsw(t0) = Lv(t)− Lw(t)− 2sκd(v, w).
We have that
Ξsv(t0) ≥ Lv(t)− Lw(t)− (2s+ 1)κd(v, w)
≥ Lv(t)− Lw(t)− 2sκd(v, w)− κ ·D
= Ψsw(t0)− κ ·D.
As trivially Ψs(t0) ≥ Ξ
s(t0) ≥ Ξ
s
v(t0), we have that t1 ≥ t0+
Ξsv(t0)/µ and the claim follows by applying Lemma 7.
Combining this corollary with Lemma 5, we can bound Ψs
at all times.
Lemma 8. Fix s ∈ N. If Ψs(0) ≤ κ ·D/(1− ρ2/µ2), then
Ψs(t) ≤
µ
µ− ρ
· κ ·D.
at all times t. Otherwise,
Ψs(t) ≤

(
1 + ρµ
)
·Ψs(0) if t ≤ Ψ
s(0)
µ
κ ·D + ρµ ·
(
1 + ρµ
)
·Ψs(0) else.
Proof. For t ≤ Ψs(0)/µ, the claim follows immediately from
Corollary 2 (and possibly using that Ψs(0) ≤ κ · D/(1 −
ρ2/µ2)). Concerning larger times, denote by B the bound that
needs to be shown and suppose that Ψs(t1) = B+ε for some
ε > 0 and minimal t1 > Ψ
s(0)/µ. Choose w ∈ V so that
Ψsw(t1) = Ψ
s(t1) and t0 such that t1 = t0 + Ψ
s(t0)/µ. Such
a time must exist, because the function f(t) = t1−t−Ψ
s(t)/µ
is continuous and satisfies
f(t1) = −
Ψs(t1)
µ
< 0 < t1 −
Ψs(0)
µ
= f(t0).
We apply Lemma 5 and Corollary 3, showing that
Ψsw(t1) ≤ Ψ
s
w(t0)− (Lw(t1)− Lw(t0)) + (1 + ρ)(t1 − t0)
≤ κ ·D + ρ(t1 − t0)
= κ ·D +
ρ
µ
Ψs(t0).
We distinguish two cases. If Ψs(0) ≤ κ ·D/(1− ρ2/µ2), we
have that
Ψs(t0) <
µ
µ− ρ
· κ ·D + ε,
because t0 < t1, leading to the contradiction
µ
µ− ρ
· κ ·D + ε = Ψs(t1) <
(
1 +
ρ
µ− ρ
)
· κ ·D + ε.
On the other hand, if Ψs(0) > κ · D/(1 − ρ2/µ2), this is
equivalent to
κ ·D +
ρ
µ
·
(
1 +
ρ
µ
)
·Ψs(0) >
(
1 +
ρ
µ
)
·Ψs(0).
Hence,
Ψs(t0) < κ ·D +
ρ
µ
·
(
1 +
ρ
µ
)
·Ψs(0) + ε
and we get that
κ ·D +
ρ
µ
·
(
1 +
ρ
µ
)
·Ψs(0) + ε
= Ψs(t1)
< κ ·D +
ρ
µ
·
(
κ ·D +
ρ
µ
·
(
1 +
ρ
µ
)
·Ψs(0) + ε
)
.
This implies the contradiction
Ψs(0) <
κ ·D
1 + ρ/µ
+
ρ
µ
·Ψs(0)
to Ψs(0) > κ ·D/(1− ρ2/µ2).
Corollary 4. Abbreviate q = ρµ ·
(
1 + ρµ
)
and assume that
q ≤ 34 . For i, s ∈ N and times t ≥ 4(Ψ
s(0) + i · κ ·D)/µ, it
holds that
Ψs(t) ≤
κD
1− q
+ qi
(
1 +
ρ
µ
)
Ψs(0).
Proof. Consider the series given by x0 = (1+ρ/µ)Ψ
s
0, xi+1 =
κ ·D+qxi, t0 = 0, and ti+1 = ti+
xi
µ . By applying Lemma 8
with time 0 replaced by time ti (i.e., shifting time) and Ψ
s(0)
by xi, we can conclude that xi upper bounds Ψ
s(t) at times
t ≥ ti. Simple calculations show that xi ≤
κD
1−q +q
iΨs(0) and
ti ≤ 4(Ψ
s(0) + i · κ ·D)/µ, so the claim follows.
In particular, Ψs becomes bounded by (1 + O(ρ/µ))κD
within O(Ψs(0)/µ) time. Plugging in s = 0, we obtain a
bound on the global skew.
Corollary 5. If ρµ ·
(
1 + ρµ
)
≤ 34 , it holds that
G(t) ≤
κD
1− q
+ qi
(
1 +
ρ
µ
)
G(0)
at all times t ≥ 4(G(0) + i · κ ·D)/µ.
Proof. By applying Corollary 4 for s = 0, noting that G(t) =
Ψ0(t).
C. Bounding the Local Skew
In order to bound the local skew, we analyze the average
skew over paths in G of various lengths. For long paths of
Ω(D) hops, we will simply exploit that we already bounded
the global skew, i.e., the skew between any pair of nodes.
For successively shorter paths, we inductively show that the
average skew between endpoints cannot increase too quickly:
reducing the length of a path by factor σ can only increase the
skew between endpoints by an additive constant term. Thus,
paths of constant length (in particular edges) can only have
a(n average) skew that is logarithmic in the network diameter.
In order to bound Ψs in terms of Ψs−1, we need to apply
the catch-up lemma in a different form.
Corollary 6. Let s ∈ Z and t0, t1 be times satisfying
t1 ≥ t0 +
Ψs−1(t0)
µ
.
Then, for any w ∈ V we have
Lw(t1)− Lw(t0) ≥ t1 − t0 +Ψ
s
w(t0).
Proof. We have that Ψs−1(t0) ≥ Ξ
s−1(t0) and there is some
v ∈ V satisfying
Ψsw(t0) = Lv(t0)− Lw(t0)− 2sκd(v, w).
We apply Lemma 7 to t0, t1, v, w and level s − 1, yielding
that
Lw(t1)− Lw(t0)
≥ t1 − t0 + Lv(t0)− Lw(t0)− (2s− 1)κd(v, w)
≥ t1 − t0 + Lv(t0)− Lw(t0)− 2sκd(v, w)
= t1 − t0 +Ψ
s
w(t0).
Combining this corollary with Lemma 5, we can bound Ψs
at all times.
Lemma 9. Fix s ∈ Z and suppose that Ψs−1(t) ≤ ψs−1 for
all times t. Then
Ψs(t) ≤
{
Ψs(0) + ρµ · ψ
s−1 if t ≤ ψ
s−1
µ
ρ
µ · ψ
s−1 else.
Proof. For t ≤ ψs−1/µ, the claim follows immediately from
Corollary 2. To show the claim for t > ψs−1/µ, assume for
contradiction that it does not hold true and let t1 be minimal
such that there Ψs(t1) > ρψ
s−1/µ+ ε for some ε > 0. Thus,
there is some w ∈ V so that
Ψsw(t1) = Ψ
s(t1) =
ρ
µ
· ψs−1 + ε.
Applying Corollary 6 with t0 = t1 − ψ
s−1/µ together with
Lemma 5 yields the contradiction
Ψsw(t1) ≤ Ψ
s
w(t0)− (Lw(t1)− Lw(t0)) + (1 + ρ)(t1 − t0)
≤ ρ(t1 − t0)
=
ρ
µ
· ψs−1.
Corollary 7. Fix s ∈ N. Suppose that Ψs(t) ≤ ψs for all
times t and that L(0) ≤ 2(s+ 1)κ. Then
Ψs
′
(t) ≤
(
ρ
µ
)s′−s
ψs
for all s′ ≥ s and times t.
Proof. Observe that L(0) ≤ 2(s+1)κ implies that Ψs
′
(0) = 0
for all s′ > s. Thus, the statement follows from Lemma 9 by
induction on s′, where ψs
′
= ρ · ψs
′−1/µ and the base case
is s′ = s.
Corollary 8. Fix s ∈ N. Suppose that Ψs(t) ≤ ψs for all
times t. Then
Ψs
′
(t) ≤
(
ρ
µ
)s′−s
ψs
for all s′ ≥ s and times t ≥ ψs/(µ− ρ).
Proof. Consider the times
ts′ =
s′−s∑
i=1
(
ρ
µ
)i
·
ψs
µ
≤
ψs
µ
·
1
1− ρ/µ
=
ψs
µ− ρ
.
We apply Lemma 9 inductively, where in step s′ > s we
shift times by −ts′ . Thus, all considered times fall under the
second case of Lemma 9, i.e., the initial values Ψs
′
(0) (or
rather Ψs
′
(ts′)) do not matter.
D. Putting Things Together
It remains to combine the results on global and local skew
to derive bounds that depend on the system parameters and
initialization conditions only. First, we state the bounds on
global and local skew that hold at all times. We emphasize
that this bound on the local skew also bounds up to which
level s ∈ N the algorithm needs to check FT1 and FT2, as
larger local skews are impossible.
Theorem 2. Suppose that L(0) ≤ (2s+1)κ for some s ∈ N.
Then
G(t) ≤
(
2s+
µ
µ− ρ
)
κD
and L(t) ≤
(
2s+
⌈
logµ/ρ
µD
µ− ρ
⌉
+ 1
)
κ
for all t ∈ R≥0.
Proof. As L(0) ≤ (2s+ 1)κ, we have that
Ψs(0) ≤ max
v,w∈V
{d(v, w)} · κ = κ ·D.
By Lemma 8, hence Ψs(t) ≤ µµ−ρ · κ ·D at all times t. Thus,
Lv(t)− Lw(t)− 2sκD ≤ Lv(t)− Lw(t)− 2sκd(v, w)
≤ Ψs(t)
≤
µ
µ− ρ
· κ ·D
for all v, w ∈ V and times t, implying the stated bound on
the global skew.
Concerning the local skew, apply Corollary 7 with ψs =
µ
µ−ρ · κ ·D and s
′ = s+
⌈
logµ/ρ
µD
µ−ρ
⌉
, yielding that
Ψs
′
(t) ≤
(
ρ
µ
)⌈logµ/ρ(ψs/κ)⌉
ψs ≤ κ.
Hence, for all neighbors v, w ∈ V and all times t,
Lv(t)− Lw(t)− 2s
′κ = Lv(t)− Lw(t)− 2s
′κd(v, w)
≤ Ψs
′
(t) ≤ κ,
implying the claimed bound on the local skew.
Theorem 2 bounds the number of levels s ∈ N for which
the algorithm needs to check FT1 and FT2, depending on the
local skew at initialization. It also shows that, if the system can
be initialized with local skew at most κ, the system maintains
the strongest bounds the algorithm guarantees at all times.
Corollary 9. Suppose that L(0) ≤ κ. Then
G(t) ≤
µ
µ− ρ
· κD
and L(t) ≤
(⌈
logµ/ρ
µD
µ− ρ
⌉
+ 1
)
κ
for all t ∈ R≥0.
If such highly accurate intialization is not possible, the
algorithm will converge to the bounds from Corollary 9.
Theorem 3. Suppose that µ > 2ρ. Then there is some T ∈
O
(
G(0)+κD
µ−2ρ
)
such that
G(t) ≤
µ
µ− 2ρ
· κD
and L(t) ≤
(⌈
logµ/ρ
µD
µ− 2ρ
⌉
+ 1
)
κ
for all times t ≥ T .
Proof. By assumption,
q =
ρ
µ
·
(
1 +
ρ
µ
)
≤
1
2
·
3
2
=
3
4
.
Fix some sufficiently small constant ε > 0 such that
κD
1− q
+ εκD ≤
κD
1− 2ρ/µ
;
since q ≤ 32 ·
ρ
µ , such a constant exists. Choose i ∈ N minimal
with the property that qi
(
1 + ρµ
)
G(0) ≤ εκD. Therefore, by
Corollary 5,
G(t) ≤
µκD
µ− 2ρ
at all times t ≥ 4(G(0) + iκD)/µ. Noting that Ψ0(t) = G(t),
analogously to Theorem 2 we can now apply Corollary 8 to
infer the desired bound on the local skew for times
t ≥
4(G(0) + iκD)
µ
+
µκD
(µ− ρ)(µ− 2ρ)
.
Consequently, it remains to show that the right hand side of
this inequality is indeed in O
(
G(0)+κD
µ−2ρ
)
. As µ−ρ ≥ µ/2, this
is immediate for the second term. Concerning the first term,
our choice of i and q ≤ 3/4 yield that i ∈ O
(
log G(0)κD
)
.
Because for x ≥ y > 0 it holds that x ≥ log(x/y) · y, we can
bound
4(G(0) + iκD)
µ
∈ O
(
G(0) + κD
µ− 2ρ
)
.
Theorem 1 is an immediate corollary of Theorem 3.
E. Proof of Lemma 3
Proof. We prove the stronger claim that for all a, b satisfying
t0 ≤ a < b ≤ t1, we have
F (b)− F (a)
b− a
≤ r. (8)
To this end, suppose to the contrary that there exist a0 < b0
satisfying (F (b0)−F (a0))/(b0−a0) ≥ r+ε for some ε > 0.
We define a sequence of nested intervals [a0, b0] ⊃ [a1, b1] ⊃
· · · as follows. Given [aj , bj ], let cj = (bj + aj)/2 be the
midpoint of aj and bj . Observe that
F (bj)− F (aj)
bj − aj
=
1
2
F (bj)− F (cj)
bj − cj
+
1
2
F (cj)− F (aj)
cj − aj
≥ r + ε,
so that
F (bj)− F (cj)
bj − cj
≥ r + ε or
F (cj)− F (aj)
cj − aj
≥ r + ε.
If the first inequality holds, define aj+1 = cj , bj+1 = bj , and
otherwise define aj+1 = aj , bj = cj . From the construction
of the sequence, it is clear that for all j we have
F (bj)− F (aj)
bj − aj
≥ r + ε. (9)
Observe that the sequences {aj}
∞
j=0 and {bj}
∞
j=0 are both
bounded and monotonic, hence convergent. Further, since bj−
aj =
1
2j (b0 − a0), the two sequences share the same limit.
Define
c = lim
j→∞
aj = lim
j→∞
bj ,
and let f ∈ F be a function satisfying f(c) = F (c). By the
hypothesis of the lemma, we have f ′(c) ≤ r, so that
lim
h→0
f(c+ h)− f(h)
h
≤ r.
Therefore, there exists some h > 0 such that for all t ∈ [c−
h, c+ h], t 6= c, we have
f(t)− f(c)
t− c
≤ r +
1
2
ε.
Further, from the definition of c, there exists N ∈ N such that
for all j ≥ N , we have aj , bj ∈ [c − h, c + h]. In particular
this implies that for all sufficiently large j, we have
f(c)− f(aj)
c− aj
≤ r +
1
2
ε, (10)
f(bj)− f(c)
bj − c
≤ r +
1
2
ε. (11)
Since f(aj) ≤ F (aj) and f(c) = F (c), (10) implies that for
all j ≥ N ,
F (c)− F (aj)
c− aj
≤ r +
1
2
ε.
However, this expression combined with (9) implies that for
all j ≥ N
F (bj)− F (c)
bj − c
≥ r + ε. (12)
Since F (c) = f(c), the previous expression together with (11)
implies that for all j ≥ N we have f(bj) < F (bj).
For each j ≥ N , let gj ∈ F be a function such that gj(bj) =
F (bj). Since F is finite, there exists some g ∈ F such that
g = gj for infinitely many values j. Let j0 < j1 < · · · be
the subsequence such that g = gjk for all k ∈ N. Then for
all jk, we have F (bjk) = g(bjk). Further, since F and g are
continuous, we have
g(c) = lim
k→∞
g(bjk) = lim
k→∞
F (bjk) = F (c) = f(c).
By (12), we therefore have that for all k
g(bjk)− g(c)
bjk − c
=
F (bj)− F (c)
bj − c
≥ r + ε.
However, this final expression contradicts the assumption that
g′(c) ≤ r. Therefore, (8) holds, as desired.
