Abstract-The objective of this work is to validate mathematically derived clock synchronization theories and their associated algorithms through experiment. 'hvo theories are considered, the Interactive Convergence Clock Synchronization Algorithm and the Mid-Point Algorithm. Special clock circuitry was designed and built so that several operating conditions and fahre modes (including malicious failures) could be tested. Both theories are shown to predict conservative upper bounds (Le., measured values of clock skew were always less than the theory prediction). Insight gained during experimentation led to alternative derivations of the theories. These new theories accurately predict the clock system's behavior. It is found that a 100% penalty is paid to tolerate worst case failures. It is also shown that under optimal conditions (with minimum error and no failures) the clock skew can be as much as 3 clock ticks. Clock skew grows to 6 clock ticks when failures are present. Finally, it is concluded that one cannot rely solely on test procedures or theoretical analysis to predict worst case conditions. Index Terms-Clock synchronization, experimental verification, byzantine failure, formal methods, proof of correctness.
I. INTRODUCTION
ANY theories of clock synchronization have been M proposed and subjected to the rigors of mathematical proof of correctness (see the surveys [ I ] and [2]). Few of these theories are validated by experiment on actual systems [14]. The approach to experimental verification taken here is one which has been used in aeronautical research for decades, i.e., a theory is formulated which predicts the system's behavior and then tests are designed and performed to validate the theory's predictions. To simplify the test procedures and clarify the experimental results it is desirable to exercise a single degree of freedom (or variable) of the theory in any given test. The clock theories considered here are functions of several variables, e.g., clock drift rate, resynchronization interval and clock read error. It is necessary then to design the clock subsystem so that these characteristics can be altered .and appropriate data be recorded. In particular, the clock subsystem must be capable of simulating the "malicious liar" behavior that the clock theories are designed to tolerate [3]. the algorithm and the accompanying bounding theory had been recently subjected to the rigors of a mechanical proof [6] . During the process of testing, it was found that the theoretical bound on the clock skew was larger than the observed maximum clock skew. Although the theory only guarantees an upper bound, this discrepancy led to inquires into why the theory was not more accurate. In the course of this investigation an alternative method for the derivation of the expression for the clock skew bound was developed. This new expression accurately predicts the observed clock skew for the Interactive Convergence Clock Synchronization Algorithm.
Lundelius has derived a clock skew bound [7] for the MidPoint Algorithm proposed by Dolev [8]. The Dolev algorithm was programmed into the clock synchronization subsystem and tested. As with the ICCSA theory, the predicted bound was found to be greater than the observed clock skew (although only in extreme cases). Using the insight gained from the previous derivation and applying a fresh approach to the worst case analysis of the Mid-Point Algorithm, a new expression is derived which accurately predicts the observed clock skew.
In the following sections, expressions for the clock skew bound for both the ICCSA and the Mid-Point Algorithm are derived. A test plan is introduced and the design of the clock subsystem described. Results of the testing are presented and case studies are done. Finally, conclusions concerning this work are drawn.
CLOCK FUNDAMENTALS
The existence of synchronized clocks on distributed systems greatly simplifies and improves the performance of algorithms operating on these systems [ 161. Once this time base exists, transactions between members of the distributed system can be controlled based on time. For example, the management of redundant data in a real-time fault-tolerant computer is simplified if the processors are synchronized [9]. In the following discussions, the term clock refers to a device that provides a time base for a processor. A processor thus inherits time related characteristics from its clock. For this reason, we sometimes refer to a processor as drifting with respect to other processors when, in fact, the drift is actually a property of the processor's clock. Table I is a compilation of the symbols that will be used in this text. A common convention has been that real time is denoted by a lowercase letter, as in t or h, and that clock times are capitalized, as in T and A. A clock approximates U.S. Government work not protected by U.S. copyright
TABLE I
SYMROI. TAM real time with the relationship between clock time and real time given by where I. is real time, 7' is clock time and p is the rate of drift of clock time from real time. A clock may have some nonzero offset at clock time 7' = 0 as represented by the constant l o in ( 2 ) .
The drift between any two clocks, p and (I, in the set of c nonfaulty clock\ is given by
The real-time skew. some clock time 7' is given by that exists between two clocks at Alternatively, the skew can be expressed in terms of the difference between two clock values at some real time t. The form of (7) was chosen as this is the perspective taken in the Lamport and Melliar-Smith proof.
A. Synchronizing Clocks
In the two algorithms considered here, synchronization is accomplished by periodically executing an algorithm which first computes a clock correction value and then applies the correction to the local processor's clock. In order to compute either of the two algorithms, each processor in the synchronizing set must obtain a perceived skew, Aril,, between its clock and each of the other clocks in the set.
To obtain A,,,, processor p must compute the difference between its local clock and the remote clock. Processor p must, in effect, read processor (I'S clock. Figure 1 graphically depicts this process. By design, the algorithm executes every R time units and takes S time units to complete. In the clock subsystem constructed for these tests, actual clock values are not transmitted. Instead, at pre-determined time, T,, during S , clock (1 sends a synchronization signal to p . Upon receipt of this signal, p reads its local clock and stores this value, TqIl. 7;lp is then processor 1)'s local clock value taken at a real time corresponding to processor (1's clock reading 7;. The perceived skew, Aqp can then be computed as Tqp -7; .
More precisely stated, the perceived skew values are arrived
at by the following process.
1) Each processor broadcasts a synchronizing signal at a ( 2 )
If p is zero then the clock is a perfect clock. If p is positive, then the clock is a fast clock and accumulates time faster than real time. Clocks are considered to be digital devices consisting of a crystal oscillator and a counter. Ideally the crystal oscillates at frequency .fc. Deviations from this specification are what cause drift among a set of clocks. The digital nature of the counter causes the relation between t and ?' to be discontinuous. The error in reading a clock is denoted as E , and for digital clocks E has a minimum, €0, of l/jc. Thus, for a digital clock the inverse of ( 2 ) becomes
predetermined time. ?:.
2) Upon receipt of the synchronizing signals from other processors, the receiving processor, p , stores its clock's value, Tqp.
3) The perceived ckew is then the stored value, TCIrJ, minus
Figure 2 represents this process taking place between two processors. p and (I. with processor q having a clock which is faster than processor p . at time T plus an amount equal to the relative drift rate times the constant.
S,,(T)
Equation (13) states that the skew between p and q during a synchronization period is equivalent to the skew which exists at the beginning of the period (T,), plus the skew accumulated over the period due to the relative drift, pqP.
(14)
Equation (14) states a relationship that exists between the
ST-,(T) -&p(T) = &(T).

TP(tq(TS)
Ts clock time,T (3), the following expression for the perceived skew can be derived (see [I71 for details).
skews of three good clocks, p, q and r.
THE PROOFS
The statement of the bounding theorem is taken largely from
(IO) [4] and later [6] .
An examination of Fig. 2 will reveal that if q is faster than p , then p,, 2 0, S , , 2 0, and AqP < 0. To correct its clock, the slower processor p must add a positive value to the clock. Since the A,,'s will be negative, the resulting correcting value,
x, must be subtracted from p's clock. 
B. Some Useful Relations
The following relations will be used to derive the bound formulas. Detailed derivations of these relations can be found in [17] . These relations hold true provided that a clock correction is not applied during the interval T -+ (T + C ) .
S,,(T + C ) = S , P ( T ) + P P P C .
(12)
Equation (12) states that the skew between p and q at some time T plus a constant, C, is equivalent to the skew that exists
A. Clock Skew Bounding Theorem
For a set of n processors cooperating in the synchronization algorithm for all time, T, through period i, a bound, 6, exists on the skew between any two of the processors given that at most m of the n processors are faulty. Stated mathematically,
Because this theorem is written in terms of consecutive periods of time, it is convenient to use proof by induction.
To do this, we will derive an expression for S for the first interval, i = 0, and then show that another expression exists which is true for the following intervals. This latter expression depends on characteristics of the synchronization algorithm and thus separate derivations are necessary for the ICCSA and the Mid-Point Algorithm.
It$(T)
At system startup, assume a maximum skew So exists between all good processors in the set. Then at the end of
Wishing to replace the correction terms in (19) with an expression based on (20), we look at the correction terms more closely: (16) is thus one constraint on the value of 6, i.e, 6 2 P A I R + bo.
,.=1 C. Periods i and i + 1.
To continue the proof, we will assume that an expression for the bound is true for period i and show that the same expression is true for i + 1. As stated above, this expression will depend on the synchronization algorithm. However, we can derive a general expression from which the subsequent 
S,f,(TJ = t,+(T,) -t , f ( T c ) .
( 17) Then using ( 1 1) to replace the tf functions with t , and then clocks (denoted by V). Each term is taken individually and expanded under assumptions relating to those terms and then recombined to obtain (22) ~~ ( 7 1 -7~1 ) . I 2711..
It is assumed that the difference between the p x terms can be ignored. For an error free system this is justified because, when considering the worst case skew condition with pq equal to negative pp, pqxq will be of the same sign and approximately equal to p p x p . When clock read errors are present, the worst case read error effect occurs when clock q's error is equal to but opposite to clock p's error. As in the error free case, the effect is canceled out in the px difference terms. In short,
Substituting the resulting expression in (1 3) written for
See [ 171 for details of this derivation.
Substituting (22) into (19), we get
with R 2 (T-T,). Expression (19) will be used in the followNow, we create an expression for b and assume it holds for period i , i.e., that (1 2 b,,(T) with T in i and with b given by
PAIR. (24) ing sections to derive bound expressions for the algorithms.
2 71 -rrb ~+ P A l A + -A +~
D. The Interactive Convergence Clock Synchronization Algorithm
The ICCSA is derived for n clocks synchronizing in the presence of m faulty clocks. In this algorithm, a processor computes the correction by averaging all the perceived skew values, Aqp. To limit the effect of a faulty clock, the Aq, are subjected to the test that their absolute value be less then some maximum expected value, A. If a .Aq, exceeds A, then Aqp is set to 0. More precisely q = l A value for A is easily derived from (IO).
Under this assumption then, by replacing S , , with 6 in (23) we have Now using (24) for S it follows that which completes the proof. 
E. The Mid-Point Algorithm
In This algorithm has the property that a faulty processor's clock reading will not be used to compute the correction unless it is bounded be good clock readings. This results in it being possible to derive a tighter bound.
In the following sections an expression for xp is derived by first considering the case with no errors, then with some clock read error, E , and finally with an arbitrary faulty clock reading.
I) The Ideal Case: In the absence of a faulty clock and read errors, all good processors in a synchronizing set will place the processor readings in the same order. Take, for example the four processor system (p, q . T , s) where Then, for any member i in (p, q , T , s)
All good processors will then use clock readings from the same two processors to compute their respective corrections (In the above example, this would be Aqi and Avi). This is equivalent to the processors using a single clock reading from a processor which is at the midpoint of these two clock readings. Thus using equation (10) with E = 0, we have 2 ) Including Read Error: Any read error present in the clock readings will affect the clock correction by at most the maximum read error, E :
. IEEE TRANSACTIONS ON COMPUTERS. VOL. 43. NO. 6. JUNE 1994 3) Including a Faulty Clock: Referring to Fig. 4 , consider that the maximum and minimum readings taken from good clocks differ by at most ( h + 2~) .
The algorithm guarantees that if a faulty clock reading is used in computing the correction, that it is bounded by good clock readings. Thus, the maximum error that a faulty clock could cause is one half (0 + 2~) .
The expression for the correction including both read error and error due to a faulty clock reading becomes 
IV. EXPERIMENTAL VERIFICATION
Equations (26) and (36) describe clock skew bounds for systems executing their respective algorithms. These equations were derived based on assumed circuit behavior. It is desired to verify that these equations in fact describe actual clock circuit synchronization performance. The clock skew bound should at all times be greater than or equal to observed clock skew while at no time be so much greater as to unnecessarily reduce system performance. This task is simplified by noting that for a particular system implementation parameters R, A, n , and 711 are constant. The remaining parameters, E and 0, are independent. The clock skew is thus a linear function of two independent variables and it is thus possible to test the equations' dependence on E and p separately. Ideally, it would be preferred to run tests for a range of ri and 711. However, the clock subsystem testbed that was available accommodated only 4 clocks, and thus 7 r i was limited to (0, 1). The following test cases were generated for (26) and (36): The first test case explores the dependence of 6 on E with the drift rate terms set to zero. The second test case confirms that the relation holds for nonzero drift rate. Similarly, the third test case explores the dependence of b on p while the read error terms are set to zero. The fourth test case confirms that this relation holds for nonzero read error. These four test cases were taken with an initial skew, 60, of zero as it was determined early in the test procedure that the initial skew was removed after one cycle and had no effect thereafter. However, since the theory provides an expression for maximum skew based on the initial skew (16), the following two test cases were added: 5 ) 6 = f ( h 0 ) during the first period with 1) = 0: 6) h = f ( h u ) during the first period with p = C.
Finally, it should be the case that for an initial skew, drift rate, and read error of zero, and with no faulty clocks that the clock skew should be zero: 7)
In all the tests, the read error is treated as a random variable with a mean of zero. This is not the case in most communication systems. However, the expected value of the communication delay is often known and can be subtracted from the clock readings in the synchronization algorithm, so that the resulting effect is a read error with zero mean.
In addition to functioning as a synchronizing circuit, the clock subsystem must be able to support the test plan. The following capabilities were then designed into the clock subsystem and experiment support environment.
1 ) Ability to sustain long duration data acquisition of internal variables without perturbing the system function. 2) Availability of a global clock which can be read by each processor under test. The global clock will represent real time. h = 0 with 60 = 0. 'rn = 0 , E = 0 and = 0.
3) Ability to set the starting skew h,, of each clock.
1) Ability to set the drift rate of each clock with respect to 5) Ability to set the read error of each clock. 6) Ability to emulate a faulty clock, especially a malicious
The following sections describe the clock subsystem and real time, i.e., the global clock.
liar.
experiment environment.
A. The Design of the Clock Subsystem
The ICCSA and the Mid-Point Algorithm can be implemented completely in software. However, as it has been shown with SIFT [IO] , this can lead to unsatisfactory synchronization performance. Our implementation follows the design offered by Ramanathan [ 151 where the clock synchronization process is assisted by clock synchronization hardware. The primary purpose of the clock synchronization hardware is to quickly recognize clock messages as they are received so that Auv can be determined with as little read error as possible. While it is possible to put the entire clock function in hardware. for the purposes of this test it is convenient to have the algorithm in software so that alternate algorithms can be tested. Having the algorithm in software also enhances data acquisition and fault simulation.
The primary synchronization function of the c l o d s ! nchronization hardware is enhanced to provide the data acquisition and control necessary to accomplish the tests proposed in the previous section. The clock peripheral design is augmented to allow the adjustment of the oscillator drift rate, the setting of read error and the simulation of a malicious liar.
In these and the subsequent sections the term clock tick is used to refer one increment of digital time. Practically all the parameters are stated in terms of clock ticks instead of time. A clock tick is easily converted to time once the base frequency of the clock is known.
B. Experiment Environment
The clock peripherals were installed on an existing fault tolerant processor testbed, (FTP) [ 1 I]. The FTP is hosted from a VAX computer through a dual port memory. In addition, each channel of the quad FTP has an additional dual port memory channel to separate VAX computers. These channels were dedicated to data acquisition. A sixth VAX computer with a windowing interfxe was used to control the experiment. The FTP is a tightly coupled computer. Initial skew is then easily controlled from the base skew of 60 = 0 provided by the FTP. The synchronization algorithm is loaded into FTP RAM and configured for the test trial. The FTP operating system is then started from ROM. After the FTP stabilizes, control is passed to the synchronization algorithm and the FTP's clock synchronization is disabled.
Another component of the experiment environment is the global clock. The global clock has a base frequency of 2 MHz and a resolution of 32 bits. The output of the global clock can be read by each channel and is assumed to be real time.
To establish the global clock as real time, its 2 MHz base frequency is fed to the clock synchronization peripherals as the reference frequency. Thus, in the absence of any programmed drift rate, the clack synchronization peripherals are perfectly synchronized.
C. Results
Several tests were run to verify the functionality of the system. The following runs were made with the synchronization algorithm disabled:
I )
2)
3)
4)
With the synchronization algorithm enabled, several tests were run with h~ > 0 and p > 0 and it was found that equation (16) I ) The ICCSA: In [6] , six constraints are listed which must be met if the bounding theorem is to hold for a clock synchronization system executing the ICCSA (see Table 11 ). These constraints include the skew bounds (C5 and CG), the maximum perceived clock skew, A (C4), the maximum clock correction, C ( C 3 ) , the minimum length of time allocated to the synchronization process, S(C2), and the minimum length of the synchronization frame, R(C1). The constraints in all cases define minimum values. However, in C3,C4, C5 and C(i, the values are the minimum value that can correctly be set as an upper bound. A synchronization subsystem based on these constraints must have the property that a processor can read a remote clock at a time when the remote processor is not executing the synchronization process. That is, that the remote clack is accessible for external reads outside the scope of the remote clock's own synchronization process. This is not the case with the design used in this test.
Because a remote clock is read with the cooperation of the remote synchronization process, the synchronization window must allocate adequate time before and after the synchronization time, Tq, in order to be sure of capturing all good clocks.
This time is at least h + E. In these tests, the window was set at 2 times the maximum perceived clock skew, A, with the synchronization time, Ts, in the center of the window, (see Fig. 5 ). In The constraints as defined for these tests are listed in Table  111 . The only expression that remains equivalent to Table I1 is C5. The difference in C4 may be due to the difference in S as described in the previous paragraph. C3 defines the maximum correction possible if all n -1 clocks return a difference of A. C2 comes directly from the above discussion. Finally, R must be at least as big as S with room for a correction.
Figures 6(a) and (b) show the result of one series of the tests and compares the old theory's bound (Table 11 , C6), the bound as derived in this paper (Table 111 , CG), and the actual data.
These plots are of maximum clock skew (in ticks) versus drift rate. The data was taken at large drift rates with a constant read error of 200. Figure 6 (a) displays fault free performance ( r n = 0) and Fig. 6(b) (rn = 1) . The bound as derived in this paper exactly predicts the circuit's performance.
2) The Mid-Point Algorithm: A theory based on the MidPoint Algorithm was derived in [7] and interpreted in [2] . Table IV lists the constraints for the old theory in terms of the symbols used in this paper. Table V contains the constraints for the theory for the Mid-Point Algorithm as derived in this paper. The synchronization process was identical to the ICCSA with the exception that the Mid-Point Algorithm was executed at T,. Figure 7 plots the clock skew bound predicted by the old theory, the theory derived in this paper and the actual measured results versus drift rate. As can be seen, the measured clock skew is well below the new theory's prediction. This is not due to an inaccuracy in the theory, but to an inability to replicate worst case conditions with the clock subsystem. This phenomena will be explained in more detail in the Section VI-C, Simulating a Malicious Liar.
V. CASE STUDIES
The parameters used in the verification tests are obviously far worse than can be expected in an actual system. However, now that the theory has been verified under these extreme conditions, it is reasonable to ask what level of performance can be expected under nominal conditions. The case studies listed in Table VI were generated to probe this area. The Case Studies deal primarily with read error and synchronization period as these are the most significant contributors to the clock skew.
A read error occurs every time a digital clock is read. It is believed that the minimum read error that will be obtainable in most synchronization systems is I tick. This tick of read 
T A B L E VI CASE STL~DY PARAMETERS
error is added when, as is the case with the subject clock subsystem, the local clock is read in response to the strobe generated by the remote clock. In this case the remote clock is not actually read, but generates an event signal which, by definition, occurs at clock time Ts and, therefore, does not include an error component. A similar situation would exist if the remote clock were to be read in response to a request from the local clock (given that there were no other overhead). Case I covers this best case situation.
If both the local clock and remote clock are read in response to asynchronous events generated by the processor, then two ticks of error would be added to a clock read. Similarly, two ticks of read error can also be added when a clock is corrected. This is again due primarily to the asynchronous nature of clock reads and writes. If the clock correction circuitry is designed properly this error will not be incurred. Case 2 covers the situation when the read error is 4, with two ticks added during clock read and clock correction.
Each case consists of 3 sub-cases where the drift rate is set so that the accumulated drift over one period for sub-case "a" is equal to one tenth of the read error, 1.0 tick for sub-case "b" and the entire read error in sub-case "c". lines) and for both no fault tolerant (filled symbols) and single fault tolerant cases (empty symbols).
VI. DISCUSSION OF RESULTS
The results of this study span a broad spectrum of subject matter including clock algorithm performance, design methodology, and techniques of worst case testing. The following sections address these issues.
A. Clock Algorithm Pe$ormunce
For the purposes of this study, clock algorithm performance is evaluated primarily by the tightness of the clock skews. As can be seen by comparing the fault free and single fault cases in Fig. 8 , the single fault bounds are twice the fault free bounds, Le., a performance penalty of 100% is paid to protect the system from faults. It is interesting to note that this percentage is the same for both algorithms. If a clock skew dead band is made part of every communications exchange, then a designer must consider if he is willing to pay this penalty to protect his system from a rare form of malicious behavior.
The equations for the clock skew upper bound suggest that the component of clock skew due to actual drift (pR) can be reduced to an insignificant level if R is made small enough. This is not thought to be possible since, in the absence of read error, no correction will be made for a series of intervals until a significant skew has accumulated. A correction will then be made. This was in fact observed indirectly. Direct observation was not possible because our system had I tick of read error, minimum.
The indirect observation was made by first taking one dataset with zero additional read error and zero drift rate. What is observed is the system's minimum read error. This was done for several thousand clock readings with none exceeding f l tick. To observe the effect of pR < 1, the same system was then run with pR = 0.1. Within this series, occasional readings of f 2 were observed, thus supporting the conjecture that the pR term actually contributes an amount equivalent to ule function ceiling(pR).
The Mid-Point Algorithm obtains tighter skew bounds than the ICCSA and is the clear choice. Remembering that the "a" series sub-cases are hypothetical with pR < 1, the next best design is Case 1 b ( E = 1, pR = 1) which yields a single fault tolerant skew bound of 6 ticks. While this kind of performance is possible over dedicated links, it may not be possible to design a general purpose communication protocol which can support both the efficient transfer of normal traffic and very low read error.
If it is necessary to allow for greater read error. the designer has a wider choice in selecting the synchronization period. In this case, the use of a minimum synchronization period (Le., with pR = 11, may yield only marginally tighter clock skews because the read error dominates. The frequent synchronizations may produce more overhead on the communications channel than is saved by virtue of the resultant tighter clock skews.
B. Design Methodology
One of the areas in which clock synchronization is used is highly reliable fault-tolerant architectures such as those used in military and commercial aircraft. The high reliability requirements put on these designs (probability of failure = lopy per mission) precludes testing as a means of validating that this requirement has been met. One of the methods that has been suggested for this purpose is formal verification. A formal verijication methodology would entail the use of a specification language and the construction of a hierarchical theory written in that language that could be proven to show that the final design meets the highest level specification. Automated theorem provers are often used to facilitate this task. A good example of this method is the HDM methodology
[12] that was used on SIFT. Most recently this has matured to EHDM [ 131 which was used by Rushby [6] to re-derive the clock theory originally invented by Lamport and MelliarSmith. In [6] , Rushby reports that the rigor enforced by the use of the theorem provers led to the uncovering of several inconsistencies in the original, hand derived, theory.
It was the purpose of experimental verijication as reported in this paper, to demonstrate that the formal theory was indeed correct. What was found was that although the theory was correct in that it predicted a bound that was never violated, the bound was only a bound and not a model for the actual circuit performance. With the insight gained by experimentally observing the circuit's behavior, it was possible to derive a more accurate theory. Thus, although testing cannot be relied upon to verify highly reliable components, it becomes an integral part of deriving the theory which can then be used to predict the circuit's performance into the unobservable regions. While this may sound obvious to those who have practiced such techniques, it has been observed that the tendency exists in individuals to be heavily biased towards either the "design and debug" or "theorize and prove" camps.
An open question is the optimal mix of theory and practice. Should an effort begin by establishing a theoretical or practical basis? What kind of problems are better solved in the lab through experiment? Which are better suited to analytical \olutions? These questions are significant to program managers who have the responsibility to deliver products on time and are skeptical of theoretical practices. As demonstrated in this work. verification of predicted values of physical quantities is sell wited to testing. Testing will also provide behavioral in\ight which aids in the construction of provable and realizable theon. As s i l l be seen in the next section, testing cannot be relied upon to quantify worst case behavior.
C. Sitnukiting a Malicious Liar
To experimentally verify the clock theory, special circuits were added to the clock peripheral circuitry to enable the simulation of malicious faults. What was found during testing of the ICCSA was that the worst case behavior of a lying clock was more difficult to simulate than originally anticipated, and that the special circuitry could not be used to simulate worst case conditions without great difficulty. Moreover, in the case of the Mid-Point Algorithm, worst case conditions could not be simulated at all. Figure 9 shows the faulty behavior which was assumed during the design of the test equipment. The figure illustrates the time line of three processors, p , q , and T , with p and T being good processors and q being a lying processor. If p is a slow processor with respect to 7'. then y would send a synchronization signal to p just prior to the end of the synchronization window to give p the perception that it was a good deal faster than q and thus cause p to apply a correction which would slow its clock even further. Conversely, (1 would signal 'I' at the beginning of the window causing 7' to apply a correction which would speed up its clock.
In practice, the difficulty with doing this is that, while it is possible to anticipate the beginning and ending window times for 7' and y with respect to (1 for the first frame, it was observed that worst case skew is not obtained until several frames later. This behavior is illustrated in Fig. IO .
Consider the case in which processor 1) uses the ICCSA. Prxessor p will rend a clock difference of A from q in frame %. Processor p uses this value as part of the averaging process to compute the correction. The correction computed by processor p will thus have an error of A/4 (for four processors). Processor 7'. on the other hand, will apply a correction with an equal but opposite error with the result that the synchronization windows of p and 7' have been driven At2 further apart. Thus, for q to again send worst case synchronization signals. it must now take this additional skew into account as illustrated by the second frame in the figure. The correction error would then become ( A + A/4)/4. The correction is then increasing by amount A/ak. where k: is the frame number. The skew between 1) and I' would increase until the additional error becomes insignificant, i.e., A < 4". This typically took five frames when large drift rates made large synchronization windows necessary.
It was decided, after having observed this behavior, to model the malicious behavior from the perspective of the good processors instead of creating the erroneous signal on the faulty processor. This was done by providing the synchronization algorithm with a parameter which indicated which remote clock was to be considered a liar and in which direction it was lying. The good processor then substituted its START or END window value for the faulty clock's actual reading, thus simulating the effect described above.
Worst case conditions could not be simulated with the Mid-Point Algorithm. Examination of the data has led to the conclusion that this is due to the lack of sufficient processors to create the necessary conditions. The theory's prediction overestimated the measured data by an amount equal to 2~ (or 400 ticks, see Fig. 7 ). This is one half of the skew due to read error ( 4~) and is equivalent to the read error contribution of one clock (since two clock values are used). Worst case conditions are a combination of maximum drift, maximum read error and the presence of a malicious liar, i.e., the fastest and slowest clocks receive read errors and faulty clock readings which cause them to under estimate their corrections. In the MidPoint Algorithm, the two outlying clock readings are discarded and the remaining two averaged (for four processors). When a malicious liar is present and behaves as described above, The malicious clock replaces the fastest and slowest clocks at the extremes with the effect that the fastest and slowest clocks include their clock difference readings (0) in the correction computation. Normally the fastest and slowest clocks would be at the extremes and not be used. The ''self' clock readings do not contain any read error so that the worst case skew is not achieved. In a system of five or more clocks, the malicious clock would still replace the fastest and slowest clocks. but these clocks would now not be included in the central clock readings. It would then be possible to achieve worst case conditions.
What is concluded then is that testing cannot be relied upon to create worst case behavior. The complex interactions often confound cursory analysis with the result that something other than worst case may be observed with the danger then that the system will be designed around these misleading specifications. Developing a theory which predicts worst case, provides a checking mechanism which, when the theory's prediction does not match the observation, immediately raises the question of which is at fault. For a highly reliable design, these kind of discrepancies must be known and resolved.
VII. CONCLUSION I New theory has been developed and experimentally verified for the Interactive Convergence Clock Synchronization Algorithm and the Mid-Point Algorithm. The Mid-Point Algorithm is capable of achieving tighter synchronization than the Interactive Convergence Clock Synchronization Algorithm. Both algorithms suffer a 100% penalty to protect against 1 fault, i.e, skew bounds are twice as large for the single fault case as they are for the fault free case. The new theory is more accurate than existing theory that was developed without the benefit of the insight gained during experimental verification. However, it is also found that it is not adequate to rely on testing procedures to uncover worst case behavior. It is concluded that testing and theory go hand in hand to produce optimal designs. This is especially true for highly reliable systems.
