A hardware implementation of a provably correct design of a fault-tolerant clock synchronization circuit by Torres-Pomales, Wilfredo
NASA Technical Memorandum 109001
p I
A Hardware Implementation of a Provably Correct
Design of a Fault-Tolerant Clock Synchronization
Circuit
Wilfredo Torres-Pomales
July 1993
(NASA-TM-IOgO01) A HARDWARE
IMPLEMENTATION OF A PROVABLY
CORRECT DESIGN OF A FAULT-TOLERANT
CLOCK SYNCHRONIZATION CIRCUIT
(NASA) 14 p
G3/62
N94-13450
Unclas
OIBbOZZ
AJA_ 
National Aeronauticsand
Space Administration
Langley Research Center
Ham..pton,Virginia 236814300t
https://ntrs.nasa.gov/search.jsp?R=19940008977 2020-06-16T21:34:11+00:00Z
\T
,r
1F"
Contents
1. Introdution
2. System Description
3. Experimental Results
4. Concluding Remarks
5. References
1
1
3
5
6

1. Introduction
Many of the critical systems in aircraft are being implemented using electronic
instruments. These systems must be able to operate in many different types of
environments. Present reliability requirements for critical systems produce failure rate
specifications that exceed those of commercially available digital devices. This forces the
utilization of fault-tolerant architectures. This paper presents a hardware design of a fault-
tolerant clock synchronization circuit and the results of the tests performed on it.
This work is part of NASA's effort towards the development of a practical
validation and verification methodology for aircraft digital control systems. The circuit is
intended to be part of a verified hardware base for the formally verified Reliable
Computing Platform (RCP) for real-time digital flight control [1], currently under
development at NASA Langley Research Center. Also, the circuit serves as an
experimental test-bed for the ongoing FLY-BY-LIGHT/POWER-BY-WIRE project.
The system presented here was designed following the description given by Paul S.
Miner in [2]. It is a four-clock system capable of achieving initial synchronization, and it
tolerates a single transient or permanent fault. The system can tolerate only a single fault
because no Byzantine Exchange is used in the communication among the clocks. The
convergence function used is the fault-tolerant midpoint function. No further analysis of
the theory behind the implementation will be presented here. This can be found in [2, 3].
2. System Description
The synchronization algorithm requires that the clock system operate in frames
(i.e., cycles) and adjustments to the clocks be computed once every frame. The docks
have to exchange local times, and then compute adjustments to their own times using the
information received. Each clock will then make its frame longer or shorter than the
nominal frame length (R = 8192 ticks; 1 tick = 1 clock cycle) by an amount corresponding
to the computed adjustment. Because the algorithm requires the time from at least three
nodes to compute an adjustment, proper behavior must be defined when a clock circuit
does not receive enough information. The design provides two options:
Assumed-Perfection - assume all clocks are observed to be in perfect
synchrony, or
Assumed-End-of-Frame - assume that unobserved clocks are seen at the end
of the frame (Local Clock = R), and then compute an adjustment.
Both options are available in the implementation. Prior to a test, the desired option can be
selected by setting a DIP switch.
The dominant parameter in the formal theory is the inherent error in reading a
remote dock. An efficient communications network is required to achieve a tight
synchronization. This system uses point-to-point optical communications. The optical
transmitter-receiver interface is composed of off-the-shelf components and provides a
transmissionrateof 125Mbits/sec.A maximumreaderrorof 0.5ticks wasestimated
from anengineeringanalysisof the communication network.
Figure 1 shows a block diagram of one of the four identical clock circuits. Each
circuit performs several functions including: keeping a local time, performing the
transmissions and receptions of the frame counter values, voting the frame values,
computing and applying an adjusmaent to local time, determining if there are at least three
clocks synchronized in the system, and determining if its local time is synchronized with
the rest of the system. Based on the current frame information and state, a clock is
capable of determining what its next state should be: initialization, maintenance or
recovery of synchronization. The clock circuits also provide external controls used during
the experiments
The local time has two components: the Local Clock and the Frame Number. If i
is the current Frame Number and LC is the current value of the Local Clock, then the time
elapsed since the system achieved synchro_ation is iR + LC. Processors connected to
this system must perform this operation to determine the local time from the information
available at the Processor Inteface.
The Local Clock is a 16-bit counter driven by a 10MHz oscillator. It goes through
the Monotonic Clock Logic before being sent to the Processor Interface. The purpose of
this logic is to ensure that the local time is a monotonically non-decreasing Kmction of real
time. This logic limits the output of the local clock to R, so that backward jumps in local
time are inhibited.
The Frame Counter keeps track of the number of frames since the system first
synchronized. It is an 8-bit counter and its value is determined by the state of the clock.
When a clock is synchronized with the system in normal operation, its Frame Counter is
incremented by one at the beginning of each new frame. During initialization, the value is
set to zero until synchronization is achieved. If a clock determines that it has lost
synchronization, the Frame Counter will not be incremented; the Majority Voter will
recover the correct value, which will be loaded in the Counter at the end of the frame.
The algorithm requires the clocks to exchange local time information in every
frame. The clocks in this implementation do not transmit both their Local Clock and
Frame Counter values. Instead, some time before the local clock reaches time Q (the
transmission time, Q = P-J2), the Timing Logic will signal the Transmitters to send the
frame value to the other clocks. This transmission provides sufficient information about
the transmitting clock's local time. The receiving clocks compute the difference between
the actual times of arrival of the received transmissions and the expected arrival time, and
then use those differences to compute an adjustment to their local clocks. The received
frame values and the local Frame Counter value go through the majority voter which is
used to maintain agreement among the clocks on this component of the local times. If the
Majority Voter cannot find a majority among the frame values, it will assert the
NO_MA.JOR/TY signal at the Processor Interface and the local Frame Counter will keep
its current value into the next frame. This way of exchanging local times satisfies the
efficiency requirements to achieve a tight synchronization and requires minimum use of the
communications network, which could be shared with other components in a fault-tolerant
computing system.
TheTiming Logic alsocontrolsotherfunctions. It uses the local clock value and the
computed adjustment to identify when the clock is close to the end of the frame and disables the
Receivers to allow the electronics to process the received information before starting a new
frame. The effects of this no-reception window on the performance of the system are discussed in
section 3. The Timing Logic also provides the signals needed by the Adjustment Computation
logic to compute an adjustment when not enough information has been received. At the end of
the frame, the Timing Logic clears all the information processing circuitry to prepare the system
for the next frame and signals the Local Clock to start a new frame and reset its value to zero.
The State DetenTfination logic uses the received information and the current state
to determine the next state. It determines whether or not there is a group of at least three
clocks within the maximum allowed skew D (= 11 ticks or 1.1p.s at 10MHz), and also
compares the absolute value of the computed adjustment to D. The results are then used
to determine if the system is synchronized and whether or not the local time is in sync with
the rest of the system. The State Determination logic decides when the Frame Counter
has to be incremented, signals the Restart Operation Logic when the system is not
synchronized, and also provides the OUT_OF_SYNC signal at the Processor Interface to
indicate when the local time is not synchronized.
The Restart Operation Logic is responsible for resetting the clock circuit. This
reset will cause a clock to enter the initialization state in which the Frame Counter is kept
at zero and the OUT_OF_SYNC signal asserted until synchronization is achieved. The
circuit goes to this state immediately after power-up and also when the State
Determination logic signals that the system has lost synchronization after already being
synchronized.
Because of the experimental nature of this implementation, external controls were
included. These allow the introduction of faults into the system and enable the
experimenter to create situations of special interest when studying its performance. The
DAS Decoder is used to decode the external control signals in real-time (DAS stands for
Digital Acquisition System, which was the system chosen to control and study the
implementation.). Some implementation parameters can be set using DIP switches. Some
of these parameters include: the frame length R, the time for transmission Q, the
maximum allowed skew D, and the length of the no-reception window at the end the
frame. Also, the desired behavior when a circuit receives insufficient information to
compute an adjusunent can be selected using the DIP switches.
3. Experimental Results
There are several scenarios which were of special interest in this implementation.
These are: achieving initial synchronization, maintaining synchronization, recovery from a
single transient fault, recovery from a massive transient fault, and maintaining
synchronization in the presence of a single permanent fault. There was also interest in
determining the effect of the no-reception window at the end of the frames and comparing
the behavior of the system for both initialization algorithms: Assumed-Perfection, and
Assumed-End-of-Frame.
According to the theory, a system which correctly implements the synchronization
algorithm will be able to achieve and maintain synchronization if there are enough good
clocks present (i.e., at least three). Experimental observations did not show any
significant difference in the normal behavior of the clocks when one or the other
initialization algorithm was used. Because of this, the results will be presented for only
one case: Assumed-End-of-Frame.
Figure 2 is a plot of a typical response of the system during initialization after a
power-up. Only the local clocks are included because they are the ones directly affected
by the synchronization algorithm. The reference time was provided by a 32-bit counter at
10MHz (1 tick = lOOns). The figure shows that after power-up the clocks start with
random values. The system is not considered operational until after the initial reset of all
the circuitry. _ occurs from approximately time i6,000 on clock 1 until time 30,000 on
clock 4. The net effect of the clocks resetting at different times is that they walce up at
different times. As can be seen, the system achieves full synchronization by time 51,000
and then maintains the synchronization. Taking the time at which the last clock woke up
as a reference (i.e., clock 4), synchronization was achieved in 21,000 ticks or 2.1ms.
Once the system achieved initial synchronization, it stayed in that state for as long as the
observations lasted. The longest continuous observation was of approximately 15
minutes. The synchronization was always within two ticks. This behavior was much
better than the estimated 11 ticks because the characteristics of the parts used in the
implementation exceeded the ones used in the analysis.
Figure 3 shows the response of the system to a single transient fault. The fault was
introduced in clock 2 at approximately time 23,000 when it started a new frame at LC =
1360 instead of zero. The other clocks were able to maintain synchronization because
there were enough good clocks present. Also clock 2 was able to get back in sync within
one frame because the transmissions from the good clocks showed that the were
synchronized. Recovery of the frame value also took only one flame for this type of fault.
An example of a massive transient fault is presented in figure 4. In this case docks
2, 3, and 4 were affected by transients which made them reset to values different from
zero. Clock 1 was not affected. The jumps in clocks 3 and 4 were to values of LC greater
then the transmission time Q, and so they did not transmit in this frame. Because none of
the clocks had sufficient information to compute adjustments they all assumed End of
Frame arrival and extended their frames. However, because they were previously in sync,
they decided that a massive transient occurred, reset their circuitry and started again. The
system regained synchronization approximately 19,000 ticks or 1.9ms after the fault was
introduced. Synchronization was later maintained.
The performance of the system under a single permanent fault was excellent.
Figure 5 shows the response when a fault on clock 2 rendered its receivers inoperational
after reference time 22,000. Because of this fault, clock 2 was unable to compute
adjustments but kept transmitting in every frame. Since it did_ot have sufficient
information to compute adjustments, it always assumed End of Frame arrival (Note the
flat extension on its curve). Clocks 1, 3, and 4 never lost synchronization and the
transmissions from clock 2 had no negative effects on their behavior.
The effect of the no-reception window at the end of every frame was also
investigated. There was a concern with the possibility of enabling a 2-2 split by using this
4
window. However, this condition has not been observed to date. Were it to occur, it
would be an unstable state, since the drift and jitter in the oscillators would tend to move
the clocks to a different state. Also, ff a recovering clock receives all the transmissions
during this window, then it assumes that the system has lost synchronization, resets its
circuitry, and starts again. Synchronization takes only one frame following the reset
because the clock circuit will receive three synchronized transmissions in the first frame.
The effect of the no-reception window could be a problem if an initializing clock
happened to receive two or more of the transmissions in this window. In this situation the
clock behaves as ff there were no other clocks present in the system. If the clock is
operating under the Assumed-Perfection algorithm, the drift in the oscillators will be the
only factor which would get that clock out of this state. However, if the clock is in the
initialization state and using the Assumed-End-of-Frame arrival option, then it will extend
its frame and receive all the transmissions in the next frame. This will enable the clock to
synchronize within one frame. This behavior of the system makes the Assumed-End-of-
Frame option the preferred one to handle situations in which a clock does not have
sufficient information to compute an adjusmaent.
4. Concluding Remarks
Based on the collected experimental data, the design seems to satisfy the
requirements of the synchronization algorithm. As mentioned before, synchronization was
better than expected: within 2 ticks for the experimental versus 11 ticks for the analytical.
Also, the system never failed to achieve initial synchronization and always recovered from
transient faults with expected behavior.
Work is currently under way to design and test a new version of the clock
synchronization system. This design will use a different block model than the one
presented in [2]. It will perform the same functions, without the experimental controls,
and minimize the amount of logic needed. It is expected that the new system be will able
to operate at frequencies in excess of 33 MHz.
5. References
[1] R.W. Butler and B.L. Di Vito, "Formal Design and verification of a reliable
computing platform for real-time control: Phase 2 results", Technical
Memorandum 104196, NASA, Langley Research Center, Hampton, VA, Jan.
1992
[2] P.S. Miner, "A verified design of a fault-tolerant clock synchronization circuit:
Preliminary investigation", Technical Memorandum 107568, NASA, Langley
Research Center, Hampton, VA, Max. 1992
[3] P.S. Miner, P.A. Padilla, and W. Tones, "A provably correct design of a fault-tolerant
clock synchronization circuit", Proceedings of the 1 lth DASC, Oct. 1992, pp.341-
346
00
0
E
a
0
0
rn
0
wm
LL
I
i
i F'
IIiI I
_--" _ -- I_
I,--
I
I
I
--1
I.
e-_ i,
m
m-
._.J
z
F--
0
_>
I
p--
F---
Z
m--
I i
L__
0
>--
F---
Cz_
0
0
Z
L._
k---
1,1
L_--
0
I 0
.--J
L._
L._
L._
F---
Z
O::::
0
O0
0
n
7 V ',_ ",_
0i
0
)1ool3 ID:01
Lg_L6
L0i7176
LgiTL6
LOgS_
Lg_'g9
L09_
Lg96L
LOL9L
LgL_L
LOI_OL
Lg_L9
106179
Lg6L9
LOO6g
LgO9gLgLOg
LO_Lt;'
LO_L1;'
1.01_£
LgtT_
LOg_
Lg'g9_
LD9_
L990<_
LOLLL
LgL1;'L
LOgLL
Lggg
LO6g
Lg6<_
L
§ 0
_l:X/13 ID:X_l
£600Z
696L9
L_L_"b9
Z6g L9
CL_'6G
6I_LG
LOL_"_
LL60G
_'GGglz
6_L9_
g09_
Lgl_17
LG_C_ !
60L9_ i
gg6_'C
L9gLC
L.CL6_ -.
£ L9L
691_
L_'_L_
ZLL61.
_669 L
691_L
_"_L_L
Li_90L
L6tT_
61_lz
G_L_
L
P
++++++__+
)I_OI_ ID:+I
10
.................... _m
I
-_o
M.
im
@
1,..
wm
)I::)oI_ IDO01
l.g£L6
1017176
lb"_L6
L_'gg
LO9_g
Lg96L
LOL9L
l.gLgL
I.OgOL
l.ggL9
L06_9
Lg6L9
LOO6g
I.gO9(J i
LOL£_"
LgLOg
LO_ZP'
LO_LP'
L_I;'_
LOg_,
LO_
Lg90_
LOZZL
LgLI_L
LOgLL
Lggg
LO6g
Lg_,
L
11
REPORT DOCUMENTATION PAGE  o,'mApprov 
OMB No. 0704-0188
Public retoortmg burden for this collection of information is estlmate¢l to average 1 hour Der rescX3nse, including the time for revtewing Instructions. searching existtng data sources
gathering _nd maintaining the data needed, and comDleting and reviewing the collection of information, Send comments regarding this burden estimate or any other aspect of thit
collection of information, including suggestions for reducing this burden, to WashingtOn Headcluartefs Services. Directorate for Informat on Operations and Rel>Octs, 1215 JeffersOn
Davis Highway. Suite 1204, Arlington, VA 222024302, and to the Office of Management and Budget. Paperwork Reduction Project (0704-0188), Washington, DC 20503.
1. AGENCY USE ONLY (Leave blank) 2. REPORT DATE | 3. REPORT TYPE AND DATES COVERED
July 1993 ! Technical Memorandum
4. TITLE AND SUBTITLE 5, FUNDING NUMBERS
A Hardware Implementation of a Provably Correct Design of a Fault-Tolerant Clock WU 505-64-10-10
Synchronization Circuit
';. AUTHOR(S)
Wilfredo Torres-Pomales
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)
NASA Langley Research Center
Hampton. VA 23681-0001
'g. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES)
National Aeronautics and Space Administration
Washington, DC 20546-0001
II. SUPPLEMENTARY NOTES
8. PERFORMING ORGANIZATION
REPORT NUMBER
10. SPON SORING / MONITORING
AGENCY REPORT NUMBER
NASA TM- 10900 t
12a. DISTRIBUTION/AVAILABILITYSTATEMENT
Unclassified-Unlimited
Subject Category 62
12b. DISTRIBUTION CODE
13. ABSTRACT (Maximum 200 words)
A fault-tolerant clock synchronization system was designed to a proven correct formal specification. Formal Methods were
used in the development of this specification. This paper presents a description of the system and an analysis of the test
performed. Plots of typical experimental results are included.
14. SUBJECT ERMS
Clock Synchronizaiton, Formal Methods, Point-to-Point Communications
17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION lg.
OF REPORT OF THIS PAGE
Unclassified Undassified
NSN 7S40-01-280-5500
SECURITY CLASSIFICATION
OF ABSTRACT
e
15. NUMBER OF PAGES
13
16. PRICE CODE
A03
20. LIMITATION OF ABSTRACT
Standard Form 298 (Rev. 2-89)
Pre*_rDbed by ANSI Std, z3g-18
298-102
