Abstract. The Dutch company Chess develops a wireless sensor network (WSN) architecture using an epidemic communication model. One of the greatest challenges in the design is to find suitable mechanisms for clock synchronization. In this paper, we study a clock synchronization protocol for the Chess WSN. First, we model the protocol as a network of timed automata and verify various instances using the Uppaal model checker. Next, we present a full parametric analysis of the protocol for the special case of cliques (networks with full connectivity), that is, we give constraints on the parameters that are both necessary and sufficient for correctness. These results have been checked using the proof assistant Isabelle. Finally, we present a negative result for the special case of line topologies: for any instantiation of the parameters, the protocol will eventually fail if the network grows.
Introduction
Wireless sensor networks (WSNs) consist of potentially thousands of autonomous devices that communicate via radio and use sensors to cooperatively monitor physical or environmental conditions, such as temperature, sound or motion, at different locations. WSNs have numerous exciting applications, ranging from monitoring of dikes to smart kindergartens, and from forest fire detection to monitoring of the Matterhorn. It is an active research area with numerous workshops and conferences arranged each year.
The Dutch company Chess develops a WSN architecture using an epidemic (gossip) communication model [17] . Gossiping in distributed systems refers to the repeated probabilistic exchange of information between two members [11, 8] . The effect is that information can spread within a group just as it would in real life. Their simplicity, robustness and flexibility make gossip based algorithms attractive for data dissemination and aggregation in wireless sensor networks. However, formal analysis of gossip algorithms is a challenging research problem [2] . The Chess WSN currently distinguishes three protocol layers: the Medium Access Control (MAC) layer, which is responsible for regulating the access to the wireless shared channel, the intermediate Gossip layer, which is responsible for insertion of new messages, forwarding of current messages and deletion of old messages, and the Application layer, which has the business logic that interprets messages and may generate new messages. In our research we focus on the MAC layer of the Chess WSN. Characteristics of the other layers influence the design decisions for the MAC layer. For instance, the redundant nature of the Gossip layer justifies occasional message loss in the MAC layer.
The MAC layer uses a Time Division Multiple Access (TDMA) protocol. Time is divided in fixed length frames, and each frame is subdivided into slots (see Figure 1 ). Slots can be either active or idle. During active slots, a node is either listening for incoming messages from neighboring nodes ("RX ") or it is sending a message itself ("TX "). During idle slots a node is switched to energy saving mode. Nodes are battery operated devices with an expected uninterrupted field deployment of several years. Hence, energy efficiency is a major concern in the design of WSNs. For this reason, the number of active slots is typically much smaller than the total number of slots (less than 1% in the current implementation [17] ). The active slots are placed in one contiguous sequence which currently is placed at the beginning of the frame. A node can only transmit a message once per time frame in its TX slot. If two neighboring nodes choose the same send slot, a collision will occur in the intersection of their ranges preventing message delivery of either node's message in that intersection. Ideally, no neighboring pair would ever choose the same send slot. This has proven to be very hard to achieve, especially in settings with node mobility. In our work, we have not addressed the issue of slot allocation and simply assume that the TX slots of all nodes are fixed and have been chosen in such a way that no collisions occur.
One of the greatest challenges in the design of the MAC layer is to find suitable mechanisms for clock synchronization: we must ensure that whenever some node is sending all its neighbors are listening. In this paper, we study clock synchronization in the Chess WSN. Each wireless sensor node comes equipped with a low-cost 32 KHz crystal oscillator that drives an internal clock that is used to determine the start and end of each slot. This may cause the TDMA time slot boundaries to drift and thus lead to situations in which nodes get out of sync. To overcome this problem, the notion of guard time is introduced: at the beginning of its TX slot, before actually starting transmission, a sender waits a certain amount of time for the receiver to be ready to receive messages. Similarly, the sender also waits for some time period at the end of its TX slot (see Figure 2 ). In the current implementation, each slot consists of 29 clock cycles, out of which 18 cycles are used as guard time. Assegei [1] calculated how the battery life of a wireless sensor node is influenced by the guard time. Figure 3 , taken from [1] , summarizes these results. Clearly, it is of vital importance to reduce the guard time as much as possible, since this directly affects the battery life, which is a key characteristics of WSNs. Reduction of the guard time is possible if the hardware clocks are properly synchronized.
Many clock synchronization protocols have been proposed for WSNs. In most of these protocols, clocks are synchronized to an accurate real-time standard like Universal Coordinated Time (UTC). We refer to [22] for an overview of this type of protocols. However, these protocols are based on the exchange of time stamp messages, and for the Chess WSN this creates an unacceptable computation and communication overhead. It is possible to come up with more efficient algorithms, since for the MAC layer a weak form of clock synchronization suffices: a node only needs to be synchronized to its immediate neighbors, not to faraway nodes or to UTC. Fan & Lynch [9] study the gradient clock synchronization (GCS) problem, in which the difference between any two network nodes' clocks must be bounded from above by a non-decreasing function. Thus nearby nodes must be closely synchronized but faraway nodes are allowed to be more loosely synchronized.
In the approach of Fan & Lynch [9] , nodes compute logical clock values based on their hardware clocks and message exchanges, and the goal is to synchronize the nodes' logical clocks as closely as possible, while satisfying certain validity conditions. Logical clocks have been introduced by Lamport [12] to totally order the events in a distributed system. A key property of Lamport's logical clocks is that they never run backwards: their value can only increase. In fact, Fan & Lynch [9] assume that the rate of increase of each node's logical clock is at least 1 2 , at all times. Also Meier & Thiele [13] and Pussente & Barbosa [15] , who adapt the work of Fan & Lynch to the setting of wireless sensor networks, make a similar assumption (with minimal clock rates 1 2 and 1 D , respectively, where D is the network diameter). For certain applications of WSNs it is important to have Lamport style logical clocks. For example, if two sensor nodes observe a moving object, then logical clocks allow one to establish the object's direction by determining which node observed the object first [13] . However, for the MAC layer there is no need to compute a total order on events: we only need to ensure that whenever one node is sending all neighbors are listening. If we are willing to set back clocks now and then, we may obtain even more efficient clock synchronization protocols.
Unlike Fan & Lynch [9] , Meier & Thiele [13] and Pussente & Barbosa [15] target sensor networks directly. Meier & Thiele [13] provide a lower bound for the achievable synchronization quality in sensor networks, but no algorithms that attain or come close to this bound. Pussente & Barbosa [15] do present an algorithm, but this cannot be applied in the TDMA based setting of the Chess algorithm. Basic assumptions of [13, 15] are that (a) messages sent between neighbors are always delivered instantaneously, and (b) consecutive communications between any two neighbors in the same direction are no farther apart in time than some given time d. Pussente & Barbosa [15] derive a strict upper bound of c + 2(1 + 2ρ)d on the difference between the clocks of neighboring nodes, where c > 0 is a constant andρ ∈ [0, 1) is the maximal clock drift. But since this bound exceeds 2d and in a TDMA setting d basically equals the length of a frame, the algorithm of [15] is unable to guarantee that whenever some node is sending all its neighbors are listening.
The current implementation of the Chess WSN uses Median, an extension of an algorithm proposed by Tjoa et al [23] . The idea is that in every frame each node computes its phase error to any of its direct neighbors. After the last active slot, each node adjusts its clock by the median of the phase errors of its immediate neighbors. Assegei [1] points out that the performance of the Median algorithm decreases if the network becomes more dynamic. In fact, in [21] we established that in certain cases even a static, fully synchronized network may eventually become unsynchronized if the Median algorithm is used, even in a setting with infinitesimal clock drifts. Assegei [1] proposes a variation of the Median algorithm that uses Kalman filters. In our paper, we use formal methods to analyze another variation of the Chess algorithm in which a node adjusts its clock whenever a message arrives. Advantages of this approach are (a) unlike the Median approach and its variants we need almost no guard time at the end of a sending slot (2 clock ticks suffice instead of 9 ticks in the current implementation), and (b) the computational overhead becomes essentially zero. However, robustness of our algorithm still needs to be explored further.
In Section 2, we model the algorithm using timed automata. Section 3 describes the use of the timed automata model checker Uppaal [4, 3] to analyze WSNs with full connectivity. We verify various instances and identify three different scenarios that may lead to situations where the network is out of sync. Section 4 presents a full parametric analysis of the protocol for cliques (networks with a connection between every pair of nodes), that is, we give constraints on the parameters that are both necessary and sufficient for correctness. We have checked our results using the proof assistant Isabelle [14] . Section 5 presents a result for the special case of line topologies: for any instantiation of the parameters, the protocol will eventually fail if the network grows. Section 6, finally, discusses related work and draws conclusions.
Uppaal models, Isabelle sources and invariant proofs for this paper are available at http://www.mbsd.cs.ru.nl/publications/papers/fvaan/HSV09/.
Acknowledgement Many thanks to Frits van der Wateren, Marcel Verhoef and Bert Bos from Chess for explaining their WSN algorithms to us.
Uppaal Model
In this section, we describe the Uppaal model that we constructed of the Chess protocol. For a detailed account of the timed automata model checking tool Uppaal, we refer to [4, 3] and to http://www.uppaal.com.
We assume a finite, fixed set of wireless nodes Nodes = {0, . . . , N − 1}. The behavior of an individual node i ∈ Nodes is described by three timed automata: Clock(i) (Section 2.1), WSN(i) (Section 2.2) and Synchronizer(i) (Section 2.3). Automaton Clock(i) models the hardware clock of node i, automaton WSN(i) takes care of sending messages, and the Synchronizer(i) automaton resynchronizes the hardware clock of i upon receipt of a message. The complete protocol is modeled as a network that consists of timed automata Clock(i), WSN(i) and Synchronizer(i), for each i ∈ Nodes. Table 1 lists the parameters (constants in Uppaal terminology) that we use in our model, together with some basic constraints. The domain of all parameters is the set of natural numbers.
Clock
Timed automaton Clock(i), displayed in Figure 4 , models the behavior of the hardware clock of node i. It has a single location and a single transition. It comes equipped with a local clock variable x, which is initially 0, that is used to measure the time in between clock ticks. Whenever x reaches the value min, the automaton enables a tick[i]! action. Broadcast channel tick[i] is used to synchronize all activities within node i. The tick[i]! action must occur before x has reached value Parameter Description Constraints N number of nodes 0 < N C number of slots in a time frame 0 < C n number of active slots in a time frame 0 < n ≤ C tsn [i] TX slot number for node i ∈ Nodes 0 ≤ tsn[i] < n k0 number of clock ticks in a time slot 0 < k0 g guard time 0 < g t tail time 0 < t, g + t + 2 ≤ k0 min minimal time between two clock ticks 0 < min max maximal time between two clock ticks min ≤ max th tick in the current slot (cf. Figure 2 ), and then returns to location WAIT. At the end of each slot, that is, when the k 0 th tick occurs, the automaton increments its current slot number (modulo C). 
Synchronizer
Automaton Synchronizer(i), displayed in Figure 6 , is the last component of our model. It performs the role of the clock synchronizer in the TDMA protocol. The automaton has two locations and two transitions. The automaton waits in its initial location S0 until it detects the start of a new message, that is, until a start message[j]? event occurs, for some j. We use the Uppaal select statement to nondeterministically select a j ∈ Nodes. The automaton then moves to location S1, provided node i is active (csn[i] < n). Remember that at the moment when the start message[j]? event occurs, the hardware clock of node j, clk[j], has value g. Therefore, node i resets its own hardware clock clk[i] to g + 1 upon occurrence of the first clock tick following the start message[j]? event. The automaton then returns to its initial location S0.
Note that in our model there is no delay between sending and receipt of messages. Following Meier & Thiele [13] , we assume delay uncertainties to be negligible, and we therefore eliminate the delays themselves from our analysis. When communication is infrequent, this is reasonable since the impact of clock drift dominates over the influence of delay uncertainties.
Automaton Synchronizer(i) has no constraint on the value of j, that is, we assume that node i can receive messages from any node in the network. Hence the network has full connectivity. It is easy to generalize our model to a setting with arbitrary network topologies by adding a guard neighbor(i, j) to the transition from S0 to S1 that indicates that i is a direct neighbor of j. The neighbor(i, j) predicate does not have to be symmetric since in a wireless sensor network it may occur that i can receive messages from j, but not vice versa. For networks with full connectivity, we assume that all nodes have unique TX slot numbers:
For networks that are not fully connected, this assumption can be relaxed to the requirement that neighboring nodes have distinct TX slot numbers, and distinct nodes with the same TX slot number do not have a common neighbor:
Uppaal Analysis Results
We call a wireless sensor network synchronized if whenever a node is sending all neighboring nodes have the same slot number as the sending node. For networks with full connectivity this means that all nodes in the network agree on the current slot. We obtain the following formal definition of correctness.
Definition 1. A network with full connectivity is synchronized if and only if for all reachable states
Our objective is to find necessary and sufficient constraints on the system parameters that ensure that a network with full connectivity is synchronized. To this end, we assign different values to the parameters of the model and use Uppaal to verify the property of Definition 1. Based on the outcomes (and in particular the counterexamples generated by Uppaal) we try to derive general constraints. For networks with up to 4 nodes, Uppaal is able to explore the state space within a few seconds. Table 2 shows some example values of the parameters for which the model is synchronized. In fact, the values of min and max in this table are the smallest consecutive natural numbers for which the model with the values assigned to N, C, n, k 0 and g is synchronized. Parameter t is chosen equal to g and tsn(i) is chosen equal to i. We keep n, k 0 and g constant and vary C, the number of slots in a frame. Observe that if the value of C increases also the values of min and max increase, i.e., if the length of a frame increases then the hardware clocks must become more accurate to maintain synchronization. Observe that these parameter values are not realistic: a realistic clock accuracy is around 30 ppm (parts-per-million), C is about 1000 (instead of 10), and g is 9 (instead of 2). Uppaal cannot handle realistic values because of the state explosion problem. Nevertheless, as we will see, the counterexamples provided by Uppaal do provide insight.
In Table 3 , we keep all the parameters constant and then consider the values of min and max for different numbers of nodes when n changes. It turns out that increasing n has no impact on network behavior. In Table 4 , we keep all the Table 3 . Numerical results, changing n parameters constant and then consider the smallest values of min and max for different number of nodes when k 0 changes. It turns out that increasing k 0 forces us to increase min and max. In Table 5 , we keep all the parameters constant and then consider the smallest values of min and max for different number of nodes when g changes. Increasing g, allows us to to decrease min and max. It turns out that there are essentially three different scenarios that may lead to a state in which the network is not synchronized. In order to describe these scenarios at an abstract level, we need a bit of notation. We say that s ∈ {0, . . . , C − 1} is a transmitting slot, notation TX(s), if there is some node i that is transmitting in s, that is,
We let PREV(s) denote the nearest transmitting slot that precedes s (cyclically). Formally, function PREV : {0, . . . , C − 1} → {0, . . . , C − 1} is defined by
We write D(s) to denote the number of slots visited when going from PREV(s) to s, that is, D(s) = (s − PREV(s))%C. We define M = max s D(s) to be the maximal distance between transmitting slots. As we will see, M plays a key role in defining correctness.
Scenario 1: Fast Sender -Slow Receiver
In the first error scenario, a sending node is proceeding maximally fast whereas a receiving node runs maximally slow. The sender starts with the transmission of a message while the receiver is still in an earlier slot. The scenario is illustrated in Figure 7 . It starts when the fast and the slow node receive a synchronization message. Immediately following receipt of this message (at the same point in time), the hardware clock of fast node ticks and the synchronizer resets this clock to g + 1. Now, in the worst case, it may take M · k 0 − 1 ticks before the fast node is in its TX slot with its hardware clock equal to g. Since the hardware clock of the fast node ticks maximally fast, the length of the corresponding time interval is (M · k 0 − 1) · min. The slow node will reach the TX slot of the fast node after M · k 0 − g ticks. With a clock that ticks maximally slow, this may take
· min then we may end up in a state where the network is no longer synchronized since the fast node is sending before the slow node has moved to the same slot. Hence, in order to exclude this scenario, we must have:
This constraint is consistent with the results in Table 2 . Consider, for instance the first column. According to Uppaal the protocol is correct if N = 2, C = 6, n = 4, k 0 = 10, g = 2, min = 49 and max = 50. Since we assume that the two nodes are sending in the first two slots of a frame, it is easy to see that M = 5. Now we can verify that Instead of the lower bound min and the upper bound max on the time between clock ticks, we sometimes find it convenient to consider the ratio ρ = min max
Since 0 < min ≤ max, it follows that ρ is contained in the interval (0, 1]. The following elementary lemma turns out to be quite useful.
This implies that the worst case scenario occurs when the distance between TX slots is maximal: if the constraint holds for M it also holds when we replace M by a smaller value.
Example 1 (The Chess implementation). Constraint (2) allows us to infer a lower bound on the guard time g. In the current implementation of the protocol by Chess [17] , a quartz crystal oscillator is used with a clock drift rate θ of at most 20 ppm. This means that
1 + 20 · 10 −6 ≈ 0, 99996
In the Chess implementation, one time frame lasts for about 1 second. It consists of C = 1129 slots and each slot consists of k 0 = 29 clock ticks. The number of active slots is small (n = 10). A typical value for M is C − n = 1119. Hence
Thus, according to our theoretical model, a value of g = 3 should suffice. Chess actually uses a guard time of 9. Of course one should realize here that our model is overly simplified and, for instance, does not take into account (uncertainty in) message delays and partial connectivity. We will see that these restrictions greatly influence the minimal guard time.
Scenario 2: Fast Receiver -Slow Sender -before transmission
In our second error scenario, a receiving node runs maximally fast whereas a sending node proceeds maximally slow. The receiving node already leaves the slot in which it should receive a message from the sender before the sender has even started transmission. This scenario is illustrated in Figure 8 . Again, the scenario starts when the fast and the slow node receive a synchronization message. But now the node that has to send the next message runs maximally slow. It sends this message after M·k 0 ticks have occurred, which takes M·k 0 ·max time. Meanwhile, the fast node has made maximal progress: immediately after receipt of the first synchronization message (at the same point in time), the hardware clock of the fast node ticks and the synchronizer resets this clock to g + 1. Already after (k 0 − g − 1) · min time the node proceeds to the next slot. Another (M · k 0 − 1) · min time units later the fast node sets its clock to k 0 − 1 and is about to leave the slot in which the slow node will send a message. If the slow node starts transmission after this point it is too late: after the next clock tick the fast node will increment its slot counter and the network is no longer synchronized. In order to exclude the second scenario, the following constraint must hold:
Also this constraint can be rewritten:
Scenario 3: Fast Receiver -Slow Sender -during transmission
Our third scenario involves a fast receiver and a slow sender. The receiver moves to a new slot while the sender is still transmitting a message. Figure 9 illustrates the scenario. As in the previous scenarios, the hardware clock of the fast node is set to g + 1 immediately after receipt of the synchronization message. To exclude this scenario, the following condition should be satisfied:
Essentially, constraint (4) provides a lower bound on t: to rule out the scenario in Figure 9 , the sender should wait long enough before proceeding to the next slot.
Lemma 3. Constraint (4) is equivalent to
If we fill in the values of Example 1 with g set to 3, we obtain t > 1.001. Hence a value of t = 2 should suffice. Hence, for the simple case of a static network with full connectivity and no uncertainty in message delays, we only need to reserve 5 clock cycles for guard and tail time together. In Section 5, we will see that for different network topologies indeed much larger values are required.
Proving Sufficiency of the Constraints
In this section, we outline our proof that the three constrains derived in Section 3 are sufficient to ensure synchronization in networks with full connectivity. We first present the key invariants used in the proof and then discuss the formalization of the full proof using Isabelle/HOL.
Invariants
We start our proof by stating some elementary invariants.
Lemma 4. For any network with full connectivity the following invariant assertions hold, for all reachable states and for all i ∈ Nodes:
Invariants (5)- (7) (8)- (12) directly follow from the definitions of the individual automata in the network. For invariant (10) , observe that since the tick?-transition from WAIT to GO SEND may synchronize with the tick?-transition from S1 to S0, the value of clk[i] in GO SEND i is potentially g+1. In order to be able to state more interesting invariants, we introduce two auxiliary global history (or ghost) variables. Clock y records the time that has elapsed since the last synchronization message (or the beginning of the protocol). Variable last records the last slot in which a synchronization message has been sent (initially last = −1). Figure 10 shows the version of the WSN(i) automaton obtained after adding these variables. The only change is that upon occurrence of a synchronization start message[i]! clock y is reset to 0 and variable last is reset to csn[i]. We first state a few basic invariants which restrict the values of the new variables. 
Lemma 5. For any network with full connectivity the following invariant assertions hold, for all reachable states and for all i ∈ Nodes:
0 ≤ y (13) −1 ≤ last < C(14)S1 i ⇒ y ≤ x i (15) last = −1 ⇒ S0 i(16)
Fig. 10. WSN(i) with history variables
Invariant (13) says that y is always nonnegative, and invariant (14) says that last takes values in the integer domain [−1, C − 1). If the system is in S1 i then a synchronization occurred after the last clock tick (invariant (15)), and if the system is in S0 i then no synchronization occurred yet (invariant (16)).
The key idea behind our correctness proof is that, given the local state of some node i and the value of last, we can compute the number c(i) of ticks of i's hardware clock that has occurred since the last synchronization. Since we know the minimal and maximal clock speeds, we can then derive an interval that contains the value of y, the amount of real-time that has elapsed since the last synchronization. Next, given the value of y, we can compute an interval that contains the value of c(j), for arbitrary node j. Once we know the value of c(j), this gives us some information about the local state of node j. Through these correspondences, we are able to infer that if node i is sending the slot number of i and j must be equal.
Formally, for i ∈ Nodes, the state function c(i) is defined by
If there has been no synchronization yet (last = −1) then c(i) is just equal to the hardware clock clk [i] . If the synchronizer is in location S1 i , then we know that there has been no tick since the last synchronization, so c(i) is set to 0. Otherwise, c(i) is k 0 times the number of slots since the last synchronization, incremented by the number of ticks in the current slot, minus g to take into account that the hardware clock has been reset to g + 1 after the last synchronization.
We can now state the main invariant result from this section.
Theorem 1. Assume constraints (2) , (3) and (4) 
Proof. By induction, using the invariants from Lemma's 4 and 5. For a manual proof see http://www.mbsd.cs.ru.nl/publications/papers/fvaan/HSV09/.
Invariants (17) and (18) are the key invariants that relate the values of c(i) and y. Invariant (22) implies that the network is synchronized. This is the main correctness property we are interested in. All the other invariants in Theorem 1 are auxiliary assertions, needed to make the invariant inductive.
On the formal proof
The manual proof has been completely checked in Isabelle/HOL and is about 5300 lines. The manual proof is about 1000 lines. It is quite natural that formal proofs are longer that their manual counterpart. Wiedijk [26, 6] has proposed the De Bruijn factor as a way to quantify this difference. This factor basically compares the size of two proof files, compressed using the Unix utility gzip. The average De Bruijn factor is about 4. In our case, we obtain 4.58. This is a bit larger than usual, which can be explained by the following remarks. Our formal proof includes the definition of the Uppaal model and its semantics, which are not included in the manual proof. Obviously, we also need to define each one of the invariants. This preamble takes about 500 lines. We also need to formally prove that the 12 basic invariants defined in Lemmas 4 and 5 hold. In the manual proof, these are all disposed of by the word "trivial". The formal proof is indeed completely straightforward but still occupies about 440 lines.
Key aspects of the Isabelle formalization are (1) an alternative definition of function PREV and a proof of lemmas showing particular properties of it, and (2) a formalization of the claim that there are at least three transmitting slots per frame. Common to these two issues is the introduction of the largest slot number in which a message is transmitted. This is the maximum of function tsn and is obtained for node i max . The properties we need are basic facts like PREV(s) cannot be s or that in the idle period of a frame PREV(s) equals the transmitting slot of i max , i.e., tsn[i max ]. Altogether, the definition of PREV, the introduction of i max , the formal proof that there are at least three transmitting slots, and the proof of basic properties about these notions occupy about 600 lines.
In the remainder of this section, we first formally introduce i max . Then, we rephrase the definition of function PREV and prove a sequel of properties of that function. After that, we formalize the claim that there are at least three transmitting slots. Finally, we illustrate the formal proof by two simple but representative examples.
Definition of i max and PREV As shown in Figure 1 , a frame is composed of an active period and an idle period. In the active period, there are slots where a node is transmitting and the other nodes are listening, and also slots where no node is sending and all nodes are listening. Consequently, there is a last slot in which a message is emitted. Let i max be the node that is transmitting in this slot. This transmitting node maximizes function tsn:
The formal definition of function PREV in Isabelle slightly differs from Equation 1. The combination of modulo and the incrementation in the argument does not translate to Isabelle, where functions must be total and proved to terminate. We basically remove the modulo and considers unbounded frames. We still have the assumption that function tsn returns a natural number strictly less than n. The first basic invariants then prove that parameters take values in their intended domain. Function PREV is the recursive function below: Definition 2.
Properties of PREV In the formal proof, we need a sequel of properties showing the structure of a frame. The next lemma asserts that function PREV is constant during the idle period, that is, if slot s is transmitting and all slots from s to y are not transmitting, then PREV(y) is slot s.
Proof. By induction on s.
From this above lemma it directly follows that after the last transmitting slot, function PREV equals this slot:
Proof. By definition i max is such that there is no transmitting slot after it. We use this fact to instantiate Lemma 6 above.
We prove that the previous slot of slot s is strictly less than s. Because of the cyclic nature of a frame, this is only true if s > 0.
Another useful lemma asserts that the "PREV" of a transmitting slot cannot be tsn[i max ].
At least three sending nodes In the informal case study description [17] , it is assumed that for each node there is a transmission slot. Translated to the setting of our model, this means that tsn is a total function from nodes to slots. Interestingly, the Isabelle formalization revealed that the assumption that tsn is total is never used in the proof. 4 The only assumption that we make is that there are at least three sending nodes.
In our formalization, we introduce a predicate TX n which states that for node i there exists a slot s that equals the transmitting slot of node i, that is, node i is a transmitting node. Predicate TX n complements predicate TX defined earlier. Predicate TX n is defined as follows:
The assumption that there are at least three transmitting slots is formalized by assuming that predicate TX n holds for nodes 0 to 2.
We derive two important facts. The first one is that tsn[i max ] at least 2. The second one is that between slot number 0 and slot number n − 1 there is at least one transmitting slot. 
. By definition a slot number is not greater than n − 1 and positive. Consequently, the "tsn" in the middle of the ordering is strictly positive and strictly less than n − 1. This shows the second term of our conclusion.
A consequence of Lemma 10 is that function PREV is at least one for all slots not smaller than n − 1.
Proof. We consider two cases. If s > n − 1, then s > tsn[i max ]. Moreover there is no transmitting slot between s and n−1. So, from Lemma 6 we obtain PREV(s) > tsn[i max ] > 1. If s = n − 1, then we know from Lemma 10 that there is at least one transmitting node between slot 0 and n − 1 and PREV(s) is then at least equal to this slot.
Proof samples
Example 2. The situation of our first proof sample is pictured in Figure 11 . This situation appeared in the proof of Invariant 21 and 23 of Theorem 1. It involves nodes k and an arbitrary different node j. Node k is sending in its current slot number, i.e. we have csn[k] = tsn[k] and TX(csn[k]). The last transmitting slot (depicted in the gray slot) is the previous transmitting slot of both nodes j and k.
PREV
The goal is to prove that these two nodes agree on the current slot number, i.e., that
0 000 000 000 000 000 000 000 000 000 000 000 In brief, node i is about to send a message. The conclusion asserts that the last slot with a synchronization is not the current slot (last = csn[i]). Before the first synchronization, last is negative (last = −1) and the conclusion holds as any csn is a nonnegative number (basic Invariant 7). Before a start message action, node i is in state GO SEND with its clock equal to g. 
Line Topologies
In the two previous sections, we studied the correctness of our clock synchronization protocol for networks with full connectivity. In practice, however, wireless sensor networks are usually not fully connected. A full parametric analysis of the protocol for arbitrary network topologies will be quite involved. In this section, we report on some experiments we did using Uppaal involving line topologies, that is, connected networks in which each node is connected to exactly two other nodes, except for two nodes that only have a single neighbor. As we explained in Section 2, we can easily model arbitrary network topologies in Uppaal by appropriate instantiation of the function neighbor.
In Figure 12 , a simple three node network is depicted in which there is no connection between nodes B and C. We defined this network in Uppaal and checked the behavior of the system for different variable valuations. It turns out that, unlike the fully connected network with three nodes (see Table 2 ), the network will not always remain synchronized for g = 2, even when the clocks are perfect. Table 6 lists some of our verification results. On the left we give the results for network of Figure 12 and on the right those for a clique network of size 3. If we compare these results then we see that, in order to keep the network Table 6 . Results for network of Figure 12 (left) and for clique network of size 3 (right) synchronized, the hardware clock of a not fully connected network must be more accurate than the hardware clock of a fully connected network. Intuitively, the reason is that in a line topology the frequency of synchronization for each node is less than that in a fully connected network. In order to maintain synchronization, a line topology requires more accurate hardware clocks and a larger guard time. We claim that, for a fixed value of the guard time, the network may become unsynchronized if we keep increasing the number of nodes. In fact, we claim that for a line topology of size N, the guard time g should be at least N.
Model checking of synchronization for line topology entails exploring a very large state space and Uppaal needs much memory and CPU resources to do that. In order to reduce the state space, we considered only networks with perfect clocks. However, even with perfect clocks, we could only manage networks with at most 8 nodes. Table 7 shows the resource usage of Uppaal required for model checking of networks with line topologies. We used a Sun Fire X4440 machine with 4 Opteron 8356 2.3 Ghz quad-core processors and 128 Gb DDR2-667 memory. One processor on this machine needs about half an hour to establish that a line network with 8 nodes is synchronized if the guard time is 8.
Uppaal explores all possible interleavings of concurrent events that may occur at a given point in time. Depending on the interleaving, clock misalignment Table 7 . CPU time and memory usage of Uppaal for line networks of different sizes and loss of synchronization are possible. Due to race conditions involving arrival of messages and ticking of hardware clocks, even a network with perfect clocks will not necessarily remain synchronized for any parameter valuation. Figures 13  and 14 illustrate how race conditions may affect the time interval between two synchronization events in our model. Consider the case where a node is the receiver in one slot and the sender in the next slot. We know that the sender sends a message when the value of its clock equals g, and that the receiver resets its clock counter to g + 1 at the first clock tick after receiving the message. Figure 13 shows that a synchronization signal is received immediately after a clock tick at the receiver. In this scenario, the receiver waits a full clock cycle before resetting its clock counter to g+1. Figure 14 illustrates a different scenario in which a synchronization signal is received immediately before the receiver clock ticks and the receiver immediately resets its clock counter to g + 1. We see that the length of the time interval between two synchronization events in the first scenario is one clock cycle longer than that in the second scenario. We show that in our Uppaal model of a line network of size N and with a guard time of N − 1, there exists an error scenario which leads to collision, which means that a receiver node gets to the end of its current time slot and starts a new slot, while receiving a message from a transmitter node. Figures 15, 16 , 17 illustrate three possible error scenarios produced by Uppaal, resulting in a loss of synchronization in a setting with Nodes = {0, 1, 2, 0, 1, · · · }. We explain the example of Figure 15 in detail. The other two scenarios are similar.
The scenario consists of two "staircases". One "fast" staircase has stairs with the minimum width, where a synchronization signal is received immediately before the receiver clock ticks and the receiver resets its clock counter to g + 1 immediately, while the other "slow" staircase has stairs with the maximum width, where a synchronization signal is received immediately after the receiver clock ticks, and it takes an additional clock tick before resetting the clock is reset to g + 1. The staircases start from the same point, viz. when node number 1, the second node in the line, sends messages to its neighboring nodes 0 and 2. After N − 1 steps, which takes a guard time period, the two staircases join again when node N − 2 tries to communicate with node N − 1. At that point, node N − 2 has gone through g time units since its previous synchronization and is about to send a message to node N − 1. However, node N − 1 is about to make a clock tick and enter its new time slot, which is convenient for receiving the message from its neighbor. Synchronization is lost when node N − 2 starts sending before node N − 1 ticks. 
Conclusions and Related Work
Wireless sensor networks constitute a potentially very important but also extremely challenging application area for formal methods. As we have seen in this paper, even the analysis of a basic clock synchronization algorithm for an industrial WSN turns out to be quite difficult. Formal analysis of the gossip layer is a largely unexplored research field [2] . Using timed automata model checking, we discovered some interesting error scenarios for line topologies: for any instantiation of the parameters, the protocol will eventually fail if the network grows. We also succeeded in presenting a parametric verification for the very restrictive case of cliques (network with full connectivity). We used model checking to find the key error scenarios that underly the parameter constraints for correctness, and theorem proving to check the correctness of our manual invariant proof. In practical applications of WSNs, cliques rarely occur and therefore our results should primarily be seen as a first step towards a correctness proof for arbitrary and dynamically changing network topologies. Nevertheless, these results could give us an upper bound on allowable clock drift of a generic WSN.
Using state-of-the-art model checking technology, we have only been able to analyze models of some really small networks. In order to carry out our analysis we had to make some drastic simplifying assumptions. Nevertheless, we conclude that the ability of model checkers to find worst-case error scenarios appears to be quite useful in this application domain. In particular, it is sometimes possible to reproduce error scenarios -found by exploring simple models of small networks -in real implementations of larger networks [18] .
The use of simulations will be essential for providing additional insight into the robustness and usefulness of our algorithm, also because occasional flaws of the MAC layer protocol may be resolved by the redundancy of the gossip layer. However, we believe it is unlikely that simulation techniques will be able to produce worst case counterexamples, such as the example of Figure 15 that was produced by the model checker Uppaal. Work of [7] also shows that one has to be extremely careful in using the results of MANET simulators.
Methodologically, the approach of this paper is similar to our study of the Biphase Mark Protocol [25] , which also uses Uppaal to analyze instances of the protocol and a theorem prover for the full parametric analysis. Theorem provers have been frequently and successfully applied for the analysis of clock synchronization protocols, see for instance [19, 20] . An interesting research challenge is to synthesize (or prove the correctness of) the parameter constraints for the Chess protocol fully automatically. Recently, some approaches have been presented by which, for instance, the (parametric) Biphase Mark Protocol can be verified fully automatically [5, 24] . However, these approaches are not powerful enough (yet) to handle the Chess protocol in which the number N of sensor nodes is not fixed, and the parameter constraints and the length of the corresponding counterexamples depend on N.
