A Timeout-Based Message Ordering Protocol for a Lightweight Software Implementation of TMR Systems by Ezhilchelvan PD et al.
A Timeout-Based Message Ordering Protocol
for a Lightweight Software Implementation
of TMR Systems
Paul D. Ezhilchelvan, Francisco V. Brasileiro, Member, IEEE Computer Society, and Neil A. Speirs
Abstract—Replicated processing with majority voting is a well-known method for achieving reliability and availability. Triple Modular
Redundant (TMR) processing is the most commonly used version of that method. Replicated processing requires that the replicas
reach agreement on the order in which input requests are to be processed. Almost all synchronous and deterministic ordering
protocols published in the literature are time-based in the sense that they require replicas’ clocks to be kept synchronized within some
known bound. We present a protocol for TMR systems that is based on timeouts and does not require clocks to be kept in bounded
synchronism. Our design efforts focus on keeping the ordering delays small, without an unnecessary increase in message overhead.
Consequently, we are able to show that no symmetric protocol that works only with unsynchronized clocks can provide a smaller worst-
case delay. We also demonstrate through analysis and experiments that our protocol is faster than a time-based one of identical
message complexity in certain situations which can prevail in many application settings.
Index Terms—Byzantine failures, fault tolerance, Triple Modular Redundancy (TMR), process replication, agreement, message
ordering, physical and logical clocks.

1 INTRODUCTION
WE consider the task of designing and implementing asystem that continues to provide services in the
presence of a bounded number of faulty processors. The
weaker the assumptions made about a faulty processor’s
behavior, the wider is the set of faults tolerated. The
weakest fault model known to the fault-tolerant community
is the Byzantine model [1], and N modular redundant
(NMR) processing is one of the most effective ways to mask
the effects of Byzantine failures [2]. The basic idea here is to
use N , N  3, processors in place of a single processor so
that failures of at most ðN  1Þ=2 processors are masked. A
Triple modular redundant (TMR) system with N = 3 is the
most practical version of an NMR system. The three
processors of a TMR system execute every given task in
parallel and the results produced by them are subject to a
majority vote. If the voted result is regarded as the output
from the TMR system, the system functions correctly
provided
1. at least two of its constituent processors are
nonfaulty,
2. all nonfaulty ones produce identical results after
executing any given task, and
3. the voting performed is correct.
Process replication within a TMR system assumes the
well-understood state machine model, which imposes two
requirements [3] on replicas: they must be deterministic
(i.e., the execution of an operation in a given state and with
a given set of arguments must always produce the same
result) and they must start in the same state. The second
requirement means that a message ordering mechanism
that presents the input messages to replicas in an identical
order is necessary. In this paper, we develop an ordering
protocol for the authenticated Byzantine fault model—per-
haps the most well-known subclass of the Byzantine fault
model—in which a faulty processor cannot undetectably
forge a nonfaulty processor’s signature; the protocol also
assumes the following synchronous environment:
1. Communication delays between nonfaulty proces-
sors are bounded by a known constant.
2. Processing and scheduling delays within a nonfaulty
processor can also be bounded.
3. Each nonfaulty processor has a local read-only
physical clock whose running rate with respect to
the passage of real time differs from unity by a small
and known bound.
(Real time is measured in an assumed Newtonian time
frame that cannot be directly observed.)
Message ordering in the presence of failures requires the
ability to detect late and absent messages. In the time-based
approach [4], this capability is usually achieved by
synchronizing nonfaulty processors’ clocks within some
known constant. Since nonfaulty processors need not have
clocks with identical running rates, they should periodically
execute a synchronization protocol (e.g., [5], [6]) to adjust
their clock readings by appropriate amounts. This synchro-
nization involves
1. periodic exchange of messages, consuming network
bandwidth,
2. using data abstractions [7] to adjust the readings of a
physical clock (a read-only object), and
3. implementing amortization techniques [8] to avoid
sudden jumps in the synchronized clock readings.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 1, JANUARY 2004 53
. P.D. Ezhilchelvan and N.A. Speirs are with the School of Computing
Science, University of Newcastle upon Tyne, NE1 7RU, UK.
E-mail: {Paul.Ezhilchelvan, Neil.Speirs}@ncl.ac.uk.
. F.V. Brasileiro is with the Departamento de Sistemas e Computac¸a˜o,
Universidade Federal de Campina Grande, 58.109-970, Campina Grande,
PB, Brazil. E-mail: fubica@dsc.ufcg.edu.br.
Manuscript received 18 Dec. 2000; revised 1 Nov. 2001; accepted 11 Sept.
2002.
For information on obtaining reprints of this article, please send e-mail to:
tpds@computer.org, and reference IEEECS Log Number 113319.
1045-9219/04/$17.00  2004 IEEE Published by the IEEE Computer Society
An alternative to building a heavy-weight, synchronized
clock abstraction is to use timeouts and to employ
(unsynchronized) physical clocks only for measuring time-
outs. Here, we adopt the timeout-based approach for
designing the TMR message ordering protocol. Specifically,
we derive timeout durations called the timeliness bounds to
detect late and absent messages; such bounds are derived
by taking advantage of the synchronous nature of the TMR
system and they remain constant while the system is
operational.
The works of [9] and AMp [10] suggest that using
timeouts in the synchronous context is not something new.
One of the designers of AMp analyzed time versus timeout-
based approaches and observed [11]: Though the synchro-
nous, timeout-based protocols cannot be perfect substitutes
for their time-based counterparts in all circumstances, they
can, however, provide attractive alternatives in a number of
application settings. We analytically identify and experi-
mentally observe the situations where our timeout-based
protocol is faster than its time-based counterpart. Existence
of such situations indicates that the protocol presented here
enhances the tool-kit available to a TMR system builder.
The rest of the paper is organized as follows: Section 2
describes the structure of the TMR system, the basic
assumptions, and the message ordering requirements.
Section 3 develops and presents the protocol. Section 4
outlines the design efforts involved in keeping the ordering
delays small. Section 5 presents proofs of correctness and
also proves that no (symmetric) protocol that works only
with unsynchronized clocks can guarantee smaller worst-
case ordering delays. In Section 6, we compare the ordering
delays in both time and timeout-based approaches, and
identify situations where the timeout protocol performs
faster than the time-based protocol; our implementation
experiments demonstrate that such situations are not
uncommon in practice. Section 7 concludes the paper after
a brief survey on related work.
2 SYSTEM DESCRIPTION AND ASSUMPTIONS
The TMR system is made up of processors named Pi, Pj,
and Pk. These processors are uniquely ordered and the
ordering is known to them. Each processor is connected
directly to other two processors of the system by internal
links. Also, it is connected to the “outside world” or the
system environment from which input messages are
received and output messages are sent to. Fig. 1 shows a
TMR system whose processors are connected to the
environment via a bus. The unit (shown as a black square
in the figure) that connects a processor to the bus is called
the network attachment controller (NAC).
Assumption 1. Within a TMR system, processors fail indepen-
dent of each other and at least two processors are nonfaulty (and,
therefore, do not fail).
Assumption 2. The internal links that connect processors do not
fail. Within a processor, the messages received via each internal
link are queued separately before they are processed. A processor
is reliably connected to the system environment via aNACwhich
does not allow the attached processor to use the bus continuously
for a long time. Thus, if a faulty processor, like a “babbling idiot,”
transmits randomly generated messages, it cannot overwhelm
the communication subsystem and prevent a nonfaulty proces-
sor from receiving another nonfaulty processor’s messages nor
from accessing the bus.
Assumption 3. Each processor can sign the messages it sends,
and authenticate the signed messages it receives [12] such that:
1) a nonfaulty processor’s signature for a given message is
unique and cannot be generated by any other processor, and
2) any attempt to alter the contents of a nonfaulty processor’s
signed message is detected by any other nonfaulty processor.
These assumptions define the authenticated Byzantine
fault model. In the general (unauthenticated) Byzantine
model, the assumed failure modes are described not by
listing how a faulty processor is expected to fail, but rather
by describing a small set of incorrect behavior that are
disallowed: a faulty processor cannot 1) make a nonfaulty
one faulty (failure independence), and 2) overwhelm the
communication subsystem with random messages and
deny a nonfaulty processor the capability to communicate
with another nonfaulty processor or with the environment.
The authenticated model additionally restricts a faulty
processor’s ability to undetectably impersonate a nonfaulty
processor (Assumption 3). Byzantine fault models are
regarded to be the weakest since they allow a faulty
processor to be operated by a malicious adversary who can
exercise any combination of permitted failure modes to
make nonfaulty processors order inputs differently.
The next three assumptions define the synchronous
nature of the system. Before stating them, we remark that
we write, throughout this paper, real-time values in Greek
letters and clock time values in italicized lower case Roman
letters; the term “clock,” when not qualified, will refer to a
processor’s hardware clock.
Assumption 4. If any nonfaulty processor prepares and
transmits m at real-time  , any nonfaulty destination will
receive m at real-time  0;    0 <  þ , where ;  > 0 is a
known constant. Thus,  bounds the message queuing delays at
the sending and receiving processors, and the propagation
delay from the sender to a receiver.
Assumption 5. The clock of any nonfaulty processor, say Px,
measures a duration of length l in real-time lð1þ xÞ, where
jxj   and  is a known positive constant. ( is around 106
for nonfaulty processor clocks.)
Assumption 6. Processing and scheduling delays are bounded
and known for any nonfaulty processor; more precisely, any
nonfaulty processor: 1) performs a local computation (e.g.,
processing a received message) within a known amount of
time; and 2) schedules a computational task within some
known amount of time.
54 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 1, JANUARY 2004
Fig. 1. The TMR system.
Remarks. Assuming the synchronous model requires
accurate estimation of certain bounds that must hold
throughout the system operation. Since these bounds are
assumed to hold for all nonfaulty processors, a violation
of any of these bounds makes a nonfaulty processor
appear to be behaving like a faulty one, e.g., a violation
of Assumption 4 means that the (nonfaulty) sender
appears to be faulty to the (nonfaulty) receiver. When the
fault hypothesis (of at most one fault) is thus artificially
undermined, processors that correctly execute the pro-
tocol can end up ordering messages differently. There-
fore, adopting the synchronous model, be it for time-
based or timeout-based approach, requires defining the
maximum processing and communication load which
the system will ever be subject to. This could, in turn,
involve assessing the maximum processing load which a
valid input can impose on processors, and the number of
most-demanding inputs which the system has to
concurrently deal with at any given time. The only
alternative to having to determine hard bounds on
various delays is to opt for the asynchronous model,
which is briefly described in Section 7.
Note that when the actual system load is less than the
maximum defined, the processing, the scheduling, and
communication delays will be smaller than their respec-
tive upper bound. A question then naturally arises:
whether an order protocol can order messages faster
when the TMR system is lightly loaded. A time-based
protocol, by the nature of its design, cannot work faster,
while a timeout-based can. That is why our protocol
works faster than its time-based counterpart when
certain situations prevail, one of which happens to be a
lightly loaded TMR. However, for this benefit to
manifest, the design efforts, as shown in Section 4.3,
must go farther than those needed for a time-based
protocol. (Here, we remark that the class of early-
stopping protocols, such as [13], [14], are designed to
terminate early, not when delays are smaller than
expected, but when fewer failures than expected occur.)
2.1 Input Message Ordering
Nonfaulty processors can receive inputmessages in different
order. Referring to Fig. 1,Pi can receive1 followedby2 and
Pk in the reverse order. So,whenaprocessor receives an input
message from the system environment, it must first decide
the processing order for that message. For that, it forms an
internal message m that contains the received input message
in the data fieldm: and sendsm to all other processors in the
systemusing the order protocol that guarantees the following
two conditions:
Validity. If a nonfaulty processor, say Pi, forms and
sends m, all nonfaulty processors (including Pi) decide on
an order for m, within a known and bounded real-time
interval .
Unanimity. If a nonfaulty processor decides on an order
for a given m, then every nonfaulty processor decides on the
same order for m.
These conditions ensure that an input  supplied to a
nonfaulty P gets identically ordered in the form of P’s
internal message m, m: ¼ , by all nonfaulty processors of
the TMR system within . We refer the reader to [15] for
details on how a nonfaulty processor 1) derives an ordered
stream of inputs from an ordered stream of internal
messages, and 2) generates a voted output from the results
it computes. This paper will focus only on the (timeout-
based) ordering of internal messages. Note that, since the
TMR system can have at most one faulty processor, an input
 must be supplied to at least two processors within the
system; we will assume that every  is sent to all three
processors in the TMR system.
3 THE PROTOCOL
The protocol has three aspects to it:
1. message counters maintained by processors,
2. message diffusion that enables nonfaulty processors
to receive each other’s messages, and
3. timeliness checks to assess the timeliness of a received
message. All processors execute the same version
except for processor identities and signatures.
The protocol is described for Pi which is assumed to be
nonfaulty throughout the paper.
3.1 Message Counter and Diffusion
Pi maintains a counter called the message counter, denoted as
MCi, which holds an integer value and is initialized to
INIT_VAL (usually 1) when the system is first started. For
every input  received from the environment, Pi forms an
internal m in the following manner: the data field m: is set
to , the originator field m:O to i; the timestamp field m:TS
to MCi; further, MCi is incremented by 1. Incrementing
MCi immediately after timestamping m ensures that any
message Pi later forms gets a timestamp larger than m:TS.
An internal message m formed is accepted with its copy
being entered into a message list called the acceptedi and is
then sent to other processors using the send(m) primitive.
An invocation of this primitive discards m if m.S contains
two signatures, and performs the following actions if m.S
contains no or one signature: signature of Pi for m is
generated and appended to any signature that may already
be in m.S; the signed m is then transmitted to processors in
the system that have not signed m. Pi receives internal
messages from other processors by executing the receive(m)
primitive which blocks until it can return an authentic m
that is received via an internal link and has the authentic
signatures of one or two distinct processors other than Pi.
Whenever Pi receives m, it checks whether m is timely.
(Procedures for checking the timeliness of a received
message are described in the next section.) If m is found
untimely, it is discarded; otherwise, the following three
actions are performed:
1. MCi is set to the maximum of fMCi;m:TS þ 1g.1
2. m is accepted by entering a copy of m into acceptedi.
3. send(m) which ensures that if the accepted m had one
signature when it was received, it is diffused to the
processor that appears not to have “seen” m.
Note that themessage diffusion in 3) is essential to ensure the
unanimity condition when one processor can be faulty [16].
From the description above, it is obvious that: 1) a
message sent or received by a nonfaulty processor will have
EZHILCHELVAN ET AL.: A TIMEOUT-BASED MESSAGE ORDERING PROTOCOL FOR A LIGHTWEIGHT SOFTWARE IMPLEMENTATION OF... 55
1. A faulty processor can give arbitrarily large m.TS to messages it forms
and sends and, thereby, cause MCi to wrap around frequently. The water-
mark scheme of [17] can be used to restrict this incorrect behavior.
either one or two authentic signatures; and 2) every sent
message, but no received message, carries the host
processor’s signature. For a given message m, path(m) is
defined as the ordered sequence of processors that have
signed m. Thus, if m is a double-signed message that is
formed by Pj and diffused by Pk, then path(m) = Pj : Pk. The
first processor in path(m) is called the originator of m and the
last processor the immediate sender of m. Note that the
originator and the immediate sender ofm are one and the same
if m is single-signed. Two paths are said to intersect if they
contain one or more processors in common.
3.2 Timeliness Checks
These checks enable a processor to determine the timeliness
of a received m. Before presenting them, we will define a
clock time interval d such that by measuring d in its local
clock a nonfaulty processor is guaranteed to measure a real-
time interval of at least  duration, i.e., d  =ð1 Þ; d is
known to, and identical for all nonfaulty processors of the
system. We will assume, for simplicity, that a processor
takes zero time to execute any instruction of the protocol
and the send(m) and receive(m) primitives. (Realizing this
assumption, in practice, will require an increase in the value
of d, which is possible as the protocol does not impose any
upper bound on the value of d.)
Suppose that Pi receives a message m at its local clock
time ti. There can arise one of three possible situations
depending on the value ofMCi at ti:MCi < m:TS orMCi ¼
m:TS or MCi > m:TS. If MCi < m:TS or MCi ¼ m:TS,
then m is a “future” or a “present” message, respectively,
and is considered by Pi as timely; if, on the other hand,MCi
is already larger than m.TS when m is being received, then
m is a “past” message and its timeliness should be judged
based on how much time has elapsed since MCi first
became larger than m.TS. So, timeliness checks are needed
only for messages received with past timestamps. Let us
suppose that the m received at ti is a past message. Letm
0 be
the message whose acceptance by Pi causedMCi to become
larger than m.TS for the first time. That is, just before Pi
accepted m0, MCi  m:TS. So, m0:TS  m:TS must be true.
Note that m0 could have been formed by any processor
including Pi, therefore path(m
0) can be any one of Pi, Pj, Pk,
Pj : Pk, or Pk : Pj. Let Pi accept m
0 at its clock time t0i, t
0
i < ti.
For m to be considered timely, (ti  t0i) must be less than a
fixed bound, called the timeliness bound, whose value
depends on path(m0) and path(m). Table 1 summarizes these
timeliness bounds for all combinations of path(m0) and
path(m). For example, if m0 is a single signed message from
Pk and m is a single signed message from Pj, then the entry
B2 (row B, column 2) indicates that (ti  t0i) must be less than
2d for Pi to consider m timely. We will denote an entry of
Table 1 by treating the table as a matrix: Table 1 ½pr; pc
denotes the value in the row corresponding to the path pr
and in the column corresponding to the path pc.
3.3 Protocol Description
To perform the timeliness checks on received messages, Pi
maintains path counters, denoted as PCi½p, for every path
p through which Pi can receive a message. These counters
are initialized to ðINIT VAL 1Þ and are updated at
appropriate times using two primitives: an invocation of
update(path: p, timestamp: T) sets PCi½p = maximum of
fPCi½p; Tg; schedule update(p, T) at t schedules update(p, T)
to be invoked at (local) clock time t, if t is a future time
when the schedule instruction is executed.
Whenever Pi forms and sends m
0 or receives a timely
m0, updates are scheduled to set PCi½p = maximum of
fPCi½p;m0:TSg, for every p 2 fPj, Pk, Pj : Pk, Pk : Pjg,
after the elapse of time intervals indicated in Table 1
[path(m0), p]. Thus, the timeliness of a received m can be
checked simply by referring to PCi½pathðmÞ : m is timely
only if m.TS > PCi[path(m)] when m was received.
Putting it differently, a received m is not timely if
PCi½pathðmÞ  m:TS when m was received. Let PCi;min
denote the minimum of the path counter values at any
time, and its current value be T. From the moment
PCi;min becomes T, the set of messages in acceptedi which
have the timestamp of T cannot expand in its size; that is,
this set has stabilized and is denoted as stableiðT Þ.
To perform message ordering, entries of stableiðT Þ are
first stripped of their signatures, and then duplicate entries
are discarded. Two messages m and m0 are said to be
spurious if m.O = m0.O, m.TS = m0.TS and m: 6¼ m0:.
Spurious messages are generated when the faulty processor
gives the same timestamp to distinct messages m and m0 it
forms. Spurious messages in stableiðT Þ are discarded. After
this pruning, stableiðT Þ contains at most one message
originating from a given processor. Messages of stableiðT Þ
are ordered 1) according to their source processors and
2) only after the contents of all stableiðT 0Þ; T 0 < T have been
ordered. To ensure 2), a stability counter, denoted as SCi and
initialized to ðINIT V AL 1Þ, is used. The protocol (for Pi)
is presented in Fig. 2.
56 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 1, JANUARY 2004
TABLE 1
Pi’s Timeliness Bounds for a Past m Given That MCi First Exceeded m.TS Due to Accepting m
0
4 ESTIMATING TIMELINESS BOUNDS
Central to protocol design is the estimation of timeliness
bounds which need to be 1) as small as possible for message
ordering to be fast and 2) large enough to ensure that the
validity and unanimity conditions can be realized. This
section describes how we meet these competing require-
ments. It is done in three parts.
First, the basic condition that should be met for the
protocol to be correct is stated. We then analyze how
meeting this condition is made difficult by various possible
failure modes of the faulty processor; this analysis leads to
the identification of a requirement for meeting the basic
condition.
The third part is on estimating timeliness bounds which
meet that requirement and is done in two stages. First, we
derive the bounds in a manner that is typically employed in
designing time-based protocols. This results in what we call
the “conservative” bounds. Next, these bounds are reduced,
which leads to smaller worst-case ordering delay.
4.1 The Basic Condition
The following condition must be met for identical ordering
by nonfaulty processors:
EZHILCHELVAN ET AL.: A TIMEOUT-BASED MESSAGE ORDERING PROTOCOL FOR A LIGHTWEIGHT SOFTWARE IMPLEMENTATION OF... 57
Fig. 2. The timeout-based protocol for message ordering.
Unanimous acceptance. m enters the accepted list of a
nonfaulty processor if and only if m or a message equivalent
to m enters the accepted list of every other nonfaulty
processor.
A message equivalent to m is denoted as equiv(m) and is
defined as a message that differs from m only in its
signature field; that is, equiv(m). = m., equiv(m).O = m.O,
equiv(m).TS = m.TS and equiv(m).S 6¼ m.S. For every
accepted m, (nonfaulty) Pi can receive at most one equiv(m).
Note that equiv(m) and m become identical once their
signatures are stripped off (i.e., once the signature fields are
set to 6).
If the above condition holds, then nonfaulty processors
will construct an identical stable(T) of signature-stripped and
nonspurious messages for every given T, T  INIT VAL.
Given that processor ordering is unique and known to
nonfaulty processors, the order determined by them for any
(signature-stripped) mwill be identical.
4.2 Effects of Failures
A faulty processor, say Pj, can attempt to prevent the above
condition from being met, by failing in the following ways:
F1 (impersonating a nonfaulty processor): Pj generates a
signed message on behalf of Pi and attempts to deceive Pk
into accepting the forged message.
F2 (delayed sending of own messages). Pj delays the sending
of a message m it generates, with the consequence that one
nonfaulty processor, say Pi, finds m timely and Pk does not.
This situation, depicted in Fig. 3a, has the effect of m
entering acceptedi, but not acceptedk.
F3 (two-facing while sending own messages). Pj sends a
properly signed m to, say Pi, and sends to Pk either
1. an inauthentic version of m,
2. nothing, or
3. a different, authentic message m0 (that is never sent
to Pi).
The effect of a two-facing failure is the same as the previous
category: a 1-signed message enters the accepted list of only
one nonfaulty processor.
F4 (failure during diffusion). Pj fails while diffusing a
message m which it received from, say Pi, as shown in
Fig. 3b. It can fail by altering the contents ofm (call this failure
type F4.1) or bydelaying thediffusionby an arbitrary amount
of time (call this type F4.2). Assumption 3 reduces the impact
of F1 and F4.1 into Pk detecting and discarding Pj’s tampered
message as not authentic. F4.2 can result in Pk not receiving
the diffused message at all, or receiving it but finding it
untimely. Thus, Pj’s failures of types F2, F3, and F4.2 can
cause Pj’s m not to be accepted by one nonfaulty processor,
whilemor equiv(m) is being accepted by another. Despite this,
the condition of unanimous acceptance needs to be satisfied
through message diffusion which is feasible if the following
requirement is met.
Unanimous acceptance requirement. A nonfaulty pro-
cessor finds a received m timely if the immediate sender of
m is nonfaulty.
Suppose that the above requirement is met. In case of F2
and F3 type failures, if the 1-signed m sent by faulty Pj
enters acceptedi but not acceptedk, then the equiv(m) diffused
by Pi will be accepted by Pk; Pj’s failures of type F4.2 are
made irrelevant since m sent by nonfaulty Pi is assured to
enter acceptedk.
4.3 Derivation of Timeliness Bounds
As per the protocol, whenever Pi accepts m, MCi is
immediately set to maximum of fMCi;m:TS þ 1g, while
PCi½p, for every path p 2 fPk, Pj, Pj : Pk, Pk : Pjg, is set to
maximumoffPCi½p;m:TSgafter some time. So,MCi > PCi½p
is always true for anypath p. Thismeans thatwhen a received
m is a futureorpresentmessage (i.e.,MCi  m:TS), it is found
timely (i.e., PCi½p < m:TS) and, therefore, the unanimous
acceptance requirement is trivially met. So, in what follows,
we show that the timeliness bounds of Table 1meet the above
requirement when a received m is a past one.
Wewill present our derivations in the context inwhich the
bounds of Table 1 are presented: Pi is nonfaulty and accepts
m0 at its clock time t0i and receives m, m:TS  m0:TS, at ti,
ti > t
0
i; just before t
0
i,MCi  m:TS, and at t0i MCi > m0:TS. To
keep the derivation simple, we assume the following:
1. The minimum message transmission delay (as
measured by a nonfaulty clock) is zero.
2. The clocks of all nonfaulty processors have an
identical running rate; so, d ¼ =ð1 iÞ. (This
assumption is removed in Section 4.3.3.)
3. Unless stated explicitly, time is measured according
to the clock of Pi. So, when we say an event
happened at (time) t, it means that the event
happened at time t according to Pi’s clock.
4. Finally, the subscript i is dropped from ti and t
0
i,
where the context is obvious.
4.3.1 Conservative Bounds
Lemma 1. Given that Pi accepts m
0 at t0, any nonfaulty Px will
have MCx > m
0:TS before t0 þ d.
Proof. If Px has signed m
0, then MCx > m0:TS when it
signed m0. Since Px’s signing of m0 has to be prior to Pi
acceptingm0, the lemma is true. If Px has not signedm0, it
will receive the diffused equivðm0Þ from Pi before t0 þ d.
Upon the reception of m0, if MCx  equivðm0Þ:TS then
equivðm0Þ will be seen by Px as a present or future
message and, therefore, must be accepted which will
then set MCx > m
0:TS. tu
Say, nonfaulty Pk forms and sends m;m:TS  m0:TS. By
Lemma1,MCk becomes larger thanm
0:TS before t0 þ d. So,m
should have been sent by Pk before t
0 þ d. Even if m
experiences the maximum delay of just less than d time, Pi
should receive Pk’s single-signedm before t
0 þ 2d. It suggests
58 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 1, JANUARY 2004
Fig. 3. Pj’s failures of types F2 and F4.
that if Pi receives a single-signed message m at t such that
ðt t0Þ < 2d, it must consider m timely. This scenario is
depicted in Fig. 4a where time progresses from left to right
and the labels “> m0:TS” and “¼ m:TS þ 1” indicate the
earliest instances when a nonfaulty message counter exceeds
m0:TS and when it becomes equal tom:TS þ 1, respectively;
“< nd” labels an interval of length less than nd, for some
integer n  1. On the time-line of Pk (in Fig. 4a), the timing
instance labeled “¼ m:TS þ 1” cannot be on the right hand-
sideof that labeled “> m0:TS” sincem:TS  m0:TS. So,Pi can
receive a timely m from Pk at t, ðt t0Þ < 2d.
Suppose that m originates from a faulty Pj and that Pk
receives it within 2d time after MCk > m
0:TS has become
true. (See Fig. 4b.)
Just like in Fig. 4a where Pi accepts Pk’s m because m
arrives within 2d time after MCi > m
0:TS, Pk must now
consider Pj’s m timely. (Note that a nonfaulty processor
cannot know whether another processor is nonfaulty or
faulty.) When Pk diffuses m, Pi must find the diffused
message timely for the unanimous acceptance requirement
to be met. From Fig. 4b, Pi can receive the diffused m at t,
t < t0 þ 4d. So, Pi’s timeliness check for accepting a double
signed m is ðt t0Þ < 4d.
4.3.2 Reducing the Conservative Bounds:
Why and How?
Let us evaluate had the conservative bounds been used as
such. Consider the following scenario: Pi receives and
accepts 1-signed m0 from nonfaulty Pk at time t0. Pi would
have to wait until t0 þ 2d and t0 þ 4d to deduce that it would
no longer accept a 1-signed m and a 2-signed m,
m:TS  m0:TS, respectively. That is, Pi could not stabilize
m0 no earlier than t0 þ 4d. Say, Pk sends m0 at t0k,
t0k  t0  t0k þ . So, the delay (measured from t0k) for Pi to
order m0 could be at most  þ 4d. The maximum delay for
Pk to order its own m
0 is 4 d. Thus, ; , the maximum
ordering delay, would have to be  þ 4 d. Interestingly, the
case which decides is Pi ordering m
0;m0:O 6¼ Pi; it turns out
that if m0:O 6¼ Pi, the conservative timeliness bounds can be
reduced, permitting Pi to stabilize m
0 at t0 þ 3d itself,
resulting in a smaller  ¼ 4d. Below, we present the
intuition behind this reduction.
Suppose thatPi has received and acceptedm
0 at t0, and that
Pi 6¼ m0:O ¼ Pk (say). Pi can now expect any of Pk’s 1-signed
m;m:TS < m0:TS, to be received before t0 þ d, if Pk is
nonfaulty. (Note: The unanimous acceptance requirement
needs to be met only when Pk, the immediate sender of m, is
nonfaulty.) This is because, nonfaulty Pk sends the messages
it forms, in the increasing order of message timestamps; Pi
must receive the earlier one within at most d after it has
received the later one. So, the conclusion is: given that m0 is
accepted at t0, the timeliness bound for a 1-signedm such that
m:TS  m0:TS and the originatorofm0  originatorofm 6¼ Pi,
is only d (see entries B1 and E1 of Table 1), and need not be 2d
which is the conservative bound for any 1-signed m.
Observe that we achieved the above reduction (for two
particular cases) by considering the slowestpossiblebehavior
of an unknown mwhose immediate sender is nonfaulty and
m:TS  m0:TS; further, we also made use of two facts:
1)m0:O 6¼ Pi, and 2) pathðm0Þ intersects with path(m). It turns
out that whenever 1) and 2) hold, reduction is possible.
Intuitively, the latest arrival time of the unknown m can be
estimated more precisely, if m and the accepted m0 are
handled by the same processor. Because there are only three
processors in the TMR system, pathðm0Þ and path(m) always
intersect if m is 2-signed and m0:O 6¼ Pi. Furthermore, if
pathðm0Þ and path(m) have only one common processor, then
that processor does not have to be nonfaulty for reduction to
be feasible. Suppose that Pj is faulty. The only path
combination for which pathðm0Þ and path(m) intersect only
at Pj is: pathðm0Þ = Pj and pathðmÞ ¼ Pj : Pk, given that the
immediate sender of m (i.e., Pk) has to be a nonfaulty one.
Appendix A (available online as supplemental material at
http://computer.org/tpds/archives.htm) argues that the
bound for this path combination is 3d as indicated by the
entry C3 of Table 1. With reduced bounds in use, Pi, after
having acceptedm0 at t0, can conclude by t0 þ 3d that it would
no longer accept anymwhose immediate sender is nonfaulty
andm:TS  m0:TS, i.e., it can stabilizem0 no later than t0 þ 3d.
Using arguments similar to those used in Appendix A
(available online as supplemental material at http://
computer.org/tpds/archives.htm), we have reduced many
other bounds, all of which are indicated by those entries of
Table 1 that are neither 2d nor 4d, and by entries D3 and E4
which have 2d for a double-signed m. These arguments are
shown in [18].
4.3.3 Accounting for Nonidentical Clock Rates
We now remove the following simplifying assumption
made earlier: Clocks of all nonfaulty processors have a
known, identical running rate. Recall that d was defined in
EZHILCHELVAN ET AL.: A TIMEOUT-BASED MESSAGE ORDERING PROTOCOL FOR A LIGHTWEIGHT SOFTWARE IMPLEMENTATION OF... 59
Fig. 4. (a) Nonfaulty Pk forms and sends m. (b) Pk diffuses Pj’s m.
Section 3.2 to be the clock time interval such that by
measuring d in its local clock a nonfaulty processor is
guaranteed to have measured a real time interval of at least
. Since the running rate of any nonfaulty clock can differ
from unity by at most  (Assumption 5), it is enough to set
d ¼ =ð1 Þ ¼ d1 (say), if all nonfaulty clocks had the same
unknown running rate. We argue below that the effects of
nonidentical running rates of nonfaulty clocks are compen-
sated for, when d1 is increased to d1ð1þ 2Þð1þ 2Þ.
Suppose that all processors start measuring d1 from the
same real-time instance by referring to their respective local
clocks. If nonfaulty clocks had been running at an identical
rate, then every given nonfaulty processor deduces the
following two facts when it completes measuring d1. Fact-1:
Every other nonfaulty processor has completed measuring
d1. Fact-2: Every nonfaulty processor deduces that every
other nonfaulty processor has completed measuring d1. By
Assumption 5, nonfaulty clocks can drift apart at the
maximum rate of 2. So, a nonfaulty processor must
measure d2 ¼ d1ð1þ 2Þ in its clock to ascertain that every
other nonfaulty processor must have measured at least d1.
That is, every nonfaulty processor must measure d2 in its
clock to be able to deduce fact-1. Similarly, to be able to
deduce fact-2, a duration d3 ¼ d2ð1þ 2Þ must be measured
in the local clock. More precisely, only after measuring d3 in
its local clock, a given nonfaulty processor, say Pi, can
1) ascertain that every other nonfaulty processor, say Pk,
must have measured at least d2 and 2) deduce that Pk would
have deduced fact-1.
If we ignore the terms containing the second and higher
order powers of , we canwrite d3 ¼ d1=ð1 4Þ= =ð1 5Þ.
Thus, choosing d  d3 ¼ =ð1 5Þ will account for non-
identical running rates of nonfaulty clocks.
5 PROTOCOL CORRECTNESS AND
OPTIMALITY RESULT
Lemma 2. The protocol guarantees the validity condition with
 ¼ 4dð1þ Þ, provided d  =ð1 5Þ.
Proof. Consider an execution of the protocol in which Pi and
Pk are nonfaulty. LetPi form and send amessagem at real-
time i. Pk receivesm before i þ  (by Assumption 4). The
arguments of Section 4.3 indicate that the timeliness
bounds of Table 1 satisfy the unanimous acceptance
requirement when d  =ð1 5Þ. (For a formal proof of
correctness, see Appendix C of [19].) So, Pk will find the
receivedm timelyandaccept it.Note thatwhenanonfaulty
processor accepts a message m, none of its path counters
stays belowm.TS after some finite time. So, themessagem
accepted byPi andPk, will be taken up for ordering and be
ordered because a faulty processor cannot form and send
another message that could make Pi and Pk consider m as
spurious (due to Assumption 3). This shows that the
validity condition is met within a finite time after i.
The schedule instructions in the Broadcast process
indicate that PCi;min reaches m.TS within 4d clock time
after i. So, Pi orders m no later than i þ 4dð1þ iÞ. The
schedule instructions in the Diffuse process indicate that
PCk;min reachesm.TSwithin 3d clock time after Pk receives
m which is before i þ . So, Pk orders m no later than
i þ  þ 3dð1þ kÞ. Thus, the message m sent by Pi gets
ordered by Pi and Pk no later than i þ maximum of
f4dð1þ iÞ;  þ 3dð1þ kÞg. By Assumption 5, jij   and
jkj  . So,  ¼ maximum of f4dð1þ Þ,  þ 3dð1þ Þg =
4dð1þ Þ. Hence, the lemma. tu
Let StableiðT Þ, for some T, T  INIT VAL, denote the set
stableiwhichnonfaultyPi first constructsduringanexecution
of the protocol when SCi = T. The code for the Order process
indicates that (a1) Pi strips off signatures of every m in
StableiðT Þ and removes duplicates; and then, (a2) it removes
spurious messages from StableiðT Þ. Let SigFreeiðT Þ and
SpuFreeiðT Þ denote the resulting StableiðT Þ after a1 and a2
are carried out, respectively.
Definition 1. SigFreeiðT Þ ¼ fmjm 2 StableiðT Þ ^m:S ¼6g.
Definition 2.
SpuFreeiðT Þ ¼fmjm 2 SigFreeiðT Þ ^ ð9m0 2 SigFreeiðT Þ :
m:O ¼ m0:OÞg:
Lemma 3. Consider an execution of the protocol inwhichPi andPk
are nonfaulty. Suppose that: m 2 StableiðT Þ ) equivðmÞ 2
StablekðT Þ, and m 2 StablekðT Þ ) equivðmÞ 2 StableiðT Þ.
Then, SpuFreeiðT Þ ¼ SpuFreekðT Þ.
Proof. In reducing StableiðT Þ to SigFreeiðT Þ, Pi empties the
m.S field of every m in StableiðT Þ and then discards
duplicates.m and equiv(m) are distinguished only by their
signature fields, and setting this field to empty will make
themidentical. (Seethedefinitionofequiv(m) inSection4.1.)
If StableiðT Þ contains m and equiv(m), the duplicate
removal ensures that only one of the identical copies is
retained in SigFreeiðT Þ. So, by the hypothesis of the
lemma, SigFreeiðT Þ ¼ SigFreekðT Þ. Contrary to the lem-
ma,assumethatSpuFreeiðT Þ 6¼ SpuFreekðT Þ; alsoassume
that, without loss of generality, there is anm such thatm 2
SpuFreeiðT Þ and m =2 SpuFreekðT Þ. By Definition 2,
SpuFreeiðT Þ  SigFreeiðT Þ. We have established that
SigFreeiðT Þ = SigFreekðT Þ. So, m 2 SigFreekðT Þ. But,
m =2 SpuFreekðT Þ; so,byDefinition2, theremusthavebeen
an m0 in SigFreekðT Þ such that m:O ¼ m0:O. Since
SigFreeiðT Þ = SigFreekðT Þ, m0 2 SigFreeiðT Þ. By Defini-
tion 2,m cannot be in SpuFreeiðT Þ. Hence, the lemma. tu
Lemma 4. The protocol guarantees the unanimity condition,
provided d  =ð1 5Þ.
Proof. Consider an execution of the protocol in which Pi
and Pk are nonfaulty. Say, Pi accepts some m
0. If m0 is
signed by Pk, then equivðm0Þ is already accepted by Pk. If
m0 is not signed by Pk, then Pi will sign and diffuse m0 to
Pk. Since the timeliness bounds of Table 1 satisfy the
unanimous acceptance requirement when d  =ð1 5Þ,
Pk will accept the diffused message. Thus, for every m
0
accepted by Pi, Pk accepts equivðm0Þ. Observe (from
Fig. 2) that a nonfaulty processor accepts no m,
m:TS  T , once its stability counter (SC) becomes equal
to T, T  INIT VAL. So, for every m0 accepted by Pi, Pk
should accept equivðm0Þ before SCk becomes equal to
m0:TS. Therefore, for any T  INIT VAL, when Pi and
60 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 1, JANUARY 2004
Pk form StableiðT Þ and StablekðT Þ, respectively:
m 2 StableiðT Þ ) equivðmÞ 2 StablekðT Þ.
Bysimilararguments,wecanshow:m00 2 StablekðT Þ )
equivðm00Þ 2 StableiðT Þ. By Lemma 3, SpuFreeiðT Þ =
SpuFreekðT Þ= SpuFree(T) (say) for everyT  INIT VAL.
By Definition 2, SpuFree(T) will contain at most one
message originating from a given processor. Since proces-
sor ordering is unique and known,Pi andPk will order the
entries of SpuFree(T) identically. Further, entries of
SpuFree(T) are ordered only after the entries of all
SpuFree(T 0), T 0 < T , have been ordered. So, Pi orders m0
before m if and only if Pk orders m
0 before m. Thus, the
protocol satisfies the unanimity condition. tu
Theorem 1. The protocol guarantees unanimity and validity
conditions with  ¼ 4dð1þ Þ, provided d  =ð1 5Þ.
Proof. Follows from Lemmas 2 and 4. tu
5.1 Optimal Upper Bound
We show next that the ordering bound of our protocol is the
smallest achievable when
1. clocks are not synchronized,
2. ordering is symmetric, and
3. a processor cannot deduce temporal order between
concurrent messages it receives.
Each of these premises is defined below.
Unsynchronized Clocks. Nonfaulty processors’ clocks are
not synchronized, where a clock is a device which a
processor uses for observing time. To state formally, let ciðÞ
denote the reading of Pi’s clock at real time  . Clocks of
nonfaulty Pi and Pj are said to be unsynchronized during an
interval  if jciðÞ  cjðÞj is arbitrary for every  in .
Symmetric Ordering. Our protocol, like [20], is symmetric
in the sense that the correct processors run the same
program (except for process identities and signatures). In
contrast, in an asymmetric protocol (e.g., [9]), correct
processors can execute different code and, hence, play
different roles: one processor, termed the sequencer,
decides and disseminates the message ordering for other
processors to accept what it has decided. In failure-free
executions, asymmetric ordering will be the fastest if the
nonsequencer processors can confirm the sequencer’s
correct behavior without delaying message ordering; such
delay-free confirmation is possible only when the sequencer
is guaranteed to fail in a benign manner (e.g., by crashing).
However, when (authenticated) Byzantine failures are
permitted, the nonsequencer processors must exchange
messages to confirm that the sequencer had behaved
correctly so far; that is, message diffusion must precede
message ordering. Further, when the sequencer fails,
message ordering is delayed until the failure is detected
and a new sequencer is elected. Though, not proven here, it
appears that asymmetric ordering cannot offer a better
worst-case ordering delay.
Nondeducibility of Temporal Order. Based on the definition
of “happened before” in [21], we define two messages to be
concurrent if and only if neither one can be said to have
happened before the other. Temporal order on messages is
an order that is based on the Newtonian time instants at
which messages were generated: For any two messages in
the system, one message is before another in the temporal
order if the first one was generated earlier in a Newtonian
time-frame (see [22] for a formal definition). We assume
that no processor can deduce temporal order between
messages that are concurrent in the sense of [21].2
For simplicity, we will assume  ¼ 0 and consider a
system in which communication delays between two
nonfaulty processors can be anything between 0 and
m; 0 < m < . Let m be a value such that m < m and
ðm  mÞ is infinitely small.
Theorem 2. Any symmetric ordering protocol that works only
with unsynchronized clocks, will have executions in which the
ordering delay can be 3m þ m.
Proof. By contradiction. Shown in Appendix B (available
online as supplemental material at http://computer.org/
tpds/archives.htm). (Also in the full paper [19].) tu
By Theorem 2, the upper bound  on ordering delays
must be at least ð3m þ mÞ, i.e.   ð3m þ mÞ. Since m
is not known directly, but only its upper bound  (see
Assumption 4),   4. Theorem 1 establishes  of our
protocol to be 4dð1þ Þ with d recommended to be
d  =ð1 5Þ. When d is chosen to be =ð1 5Þ, we have
  4, if we assume ð1þ Þ=ð1 5Þ  1.
6 COMPARISON WITH TIME-BASED APPROACH
Let nonfaulty processors’ clocks be synchronized within e:
At any given real-time instance  , if a nonfaulty
processor’s synchronized clock reads T, then any other
nonfaulty processor’s synchronized clock reads T0 such
that T e  T0  Tþ e, where e is a known constant.
According to the classical time-based ordering protocol
of [20], when (nonfaulty) Pi forms and broadcasts a
message m, it sets m.TS to the current reading of its
synchronized clock; Pi or any other nonfaulty Px
stabilizes an accepted m at its local synchronized time
m.TS þ2ðdþ eÞ. So, if Pi has broadcast m at real time  ,
then it orders m at real-time  þ 2ðdþ eÞð1þ iÞ. (To
simplify the comparison of the two approaches, we will
assume that the running rate of the synchronized clock of
any nonfaulty Px is the same as that of Px’s physical
clock.) At  , the synchronized clock of any nonfaulty Px,
x 6¼ i, could be reading a value between m.TS-e and
m.TS+e. So, Px will order m at some real time in the
interval ½ þ ð2dþ eÞð1þ xÞ,  þ ð2dþ 3eÞð1þ xÞ, de-
pending on whether Px’s synchronized clock is ahead of
or behind Pi’s. So, in the best case when every nonfaulty
synchronized clock is ahead of the nonfaulty transmitter’s
synchronized clock by e, the upper bound on time-based
ordering delays time is maximum of f2ðdþ eÞð1þ iÞ,
ð2dþ eÞð1þ xÞg. Thus, time is 2ðdþ eÞð1þ Þ in the best
case and is ð2dþ 3eÞð1þ Þ in the worst-case.
The value of e depends on how frequently clocks are
adjusted and on the algorithm used to compute the
adjustment. The latter can be either an
EZHILCHELVAN ET AL.: A TIMEOUT-BASED MESSAGE ORDERING PROTOCOL FOR A LIGHTWEIGHT SOFTWARE IMPLEMENTATION OF... 61
2. According to [21], two messages need not be produced at the same
Newtonian time for them to be deemed concurrent.
. external synchronization algorithm which requires a
reliable time source (e.g., global positioning system
(GPS)) to which processors are reliably connected
[6], or an
. internal synchronization algorithm which does not
require any external assistance. It can be run entirely
in hardware [5], or by software processes with
specific hardware assistance as in MARS [23], or by
software processes without any special hardware
support.
In external, hardware-based, or specialized-hardware
assisted approaches, e obtained is very small (in the order of
picoseconds) compared to d (which can be in the order of
milliseconds). Thus, time is effectively 2d, i.e., one half of
the ordering delay  of the timeout approach. However,
these approaches involve the use of specific devices or
components which may not always be available to a system
builder; in those circumstances, software implementation of
an internal synchronization algorithm using commercial,
off-the-shelf (COTS) processors and operating systems is
the only available option which we will focus on, in what
follows.
Among the internal synchronization protocols in the
literature, the deterministic and authenticated Byzantine
fault-tolerant protocols of [24], [25], [26] are suitable for our
TMR system and fault model. Among the rest which yield
small e, some, such as [6], assume a stronger fault model (of
bounded omission failures), while others, such as [27] and
the first two protocols in [25], assume a weaker fault model
of unauthenticated Byzantine faults, but are not appropriate
for a TMR system as they require the system to have four
processors.
In comparing time with , we will make another
simplifying assumption, favoring the time-based approach:
e is taken to be the maximum clock difference immediately
after adjustments were made, i.e., we ignore a component of
e which accounts for clock drift until next adjustment. Thus,
the value of e turns out to be =ð1þ Þ and ð1þ 5Þ=ð1þ Þ
when [24] and [26] are used to compute the adjustment,
respectively. (The authenticated protocol of [25] is con-
siderably less efficient than [24].) In our calculations, we
will take e ¼ =ð1þ Þ and, with no serious loss of accuracy,
ignore the terms containing second and higher order
powers of ; this approximation will enable us to write,
for example, e ¼ =ð1þ Þ ¼ ð1 Þ.
ðtime Þ for the best case of time becomes:
2ðdþ eÞð1þ Þ  4dð1þ Þ ¼ 2ð1þ Þ½ð1þ Þþ
ð1 Þ  4ð1þ Þ½ð1þ 5Þ:
After some algebra, ðtime Þ ¼ 20.
In the worst case fortime, ðtime Þ =20þ eð1þ Þ
= ð1 20Þ  .
Note that the time-based protocol of [20], like ours, is
symmetric and involves message diffusion and authentica-
tion. Hence, the message complexity is the same for both.
Wepresent two remarks over the evaluationof ðtime Þ
shown above. First, we have ignored a component of ewhich
compensates the clock drift between successive adjustments.
Since two nonfaulty clocks can drift at the maximum rate of
2, the missing component of e is 2I, where I is the period
between successive adjustments. So, for example, if  is taken
tobe106 and I is chosen to be8.33minutes, thevalueof e to be
considered for comparison increases by 1 millisecond. Thus,
the values of ðtime Þderived abovehold in those contexts
where the clocks are synchronized frequently often that 2I
remains negligibly small. Observe that frequent execution of
clock synchronisation increases the message traffic, pushing
the worst-case delay  for time-based protocol to a higher
value.
Second, it is possible to reduce e by assigning a higher
priority to clock synchronization processes. This reduces
the uncertainty in the delays for process scheduling and
message queuing, thus resulting in a lower value for  in
e ¼ =ð1þ Þ. It, however, has an implication on  estimated
for order protocol messages which now have a lower
priority. Recall that the bound  for order protocol messages
is the maximum delay which nonfaulty processors can
possibly encounter during the entire system operation. This
means that  be estimated in the most demanding scenario
where the maximal set of higher priority processes are
transmitting simultaneously. Thus, assigning a higher
priority to synchronization processes, while reducing  for
synchronization messages, tends to increase  for (time-
based) order protocol messages.
6.1 Conditions Favoring Timeout Approach
We identify some favorable conditions in which our
protocol works faster than the time-based one. We define
 as the maximum difference within which nonfaulty
processors receive a given input from the environment
and da as the actual maximum transmission delay that
currently holds within the system.  is typically called
tightness [22] and will be small when inputs from the system
environment are received via a broadcast LAN as shown in
Fig. 1; when the system is lightly loaded, da is much smaller
than d (the worst-case estimate).
Suppose that all processors within the TMR system are
nonfaulty, and an input  is first received by a processor at
real-time  . For simplicity, we will ignore the effect of
nonzero  for the durations considered here. By  þ , all
processors must have formed and broadcast a message
containing . By  þ þ 2da, Pi accepts two double-signed
messages m0 and m00 such that pathðm0Þ = Pj : Pk, path(m00) =
Pk : Pj and m
0: ¼ m00: ¼ ; without loss of generality, let
us assume: m0.TS  m00.TS. After  þ þ 2da þ 2d; Pi will
not accept
1. any double-signed m with m.TS  m0.TS and path(m)
= Pj : Pk due to having accepted m
0 before  þ þ
2da and due to the entry D3 of Table 1, and
2. any double-signed mwith m.TS  m00.TS and path(m)
= Pk : Pj due to having accepted m
00 before  þ þ
2da and due to the entry E5 of Table 1.
Therefore m0;m0:TS  m00.TS, becomes stable at Pi by
 þ þ 2da þ 2d, i.e., every Pi orders input  contained
within m0 by  þ þ 2da þ 2d. When one of the processors
is faulty, a similar reasoning indicates that nonfaulty
processors will order  by  þ þ da þ 3d.
With the time-based protocol, nonfaulty processors can
order  at or before  þ 2ðdþ eÞ and the explanation is as
follows: Let Pi first receive the input (from the environment)
at  . The message formed and broadcast by Pi will be
ordered by Pi at  þ 2ðdþ eÞ and by other processors at
some time in ½ þ ð2dþ eÞ;  þ ð2dþ 3eÞ. Assuming the best
case (and in favor of the time-based approach), we will
regard that all other processors order at or before
 þ ð2dþ eÞ. Hence, the maximum delay incurred for
62 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 1, JANUARY 2004
ordering the input is 2ðdþ eÞ. Our protocol is guaranteed to
order the inputs faster in the following situation:
1. Processors receive their inputs via a broadcast LAN
which facilitates nearly simultaneous input arrival at
all processors.
2. No special hardware support nor external assistance
is available for synchronising clocks, leaving e
comparable to d, the worst-case communication
delay envisaged.
3. The system is lightly loaded and message processing
and queuing delays are small—leaving da and  very
small compared to d, such that þ 2da < 2ewhen no
processor is faulty and þ da þ d < 2e when one
processor is faulty.
6.2 Implementation and Relative Performance
Estimation of d. Estimating d requires making an assump-
tion on the maximum number of clients which can
simultaneously send an input at a given time. We fixed
this to be 10. To estimate d, a provisional estimate dpro was
first made as follows: 10 clients were made to issue their
requests (to all three processors) at the same time. All
messages were assumed to be timely and authentic and,
therefore, there was no message verification delay. The
experiment ran until all clients had sent 1,000 requests of
64 bytes. The maximum observed delay became dpro which
did not include the delay component due to maintaining
path counters and performing timeliness checks on received
messages. In the next stage, the experiment was repeated,
using dpro as the value for d, for both time and timeout-
based protocols, but now the protocols ran according to
their complete description. The final value of d was chosen
to be 10 percent more than the maximum observed delay, to
account for inputs of size larger than 64 bytes.
Experimental Set-Up. Since the clocks were synchro-
nized frequently, ewas taken to be d itself. We measured the
input ordering delay (IOD) as the interval from the instance
an input is first received by a nonfaulty processor to the
instant when at least two nonfaulty processors are known to
have ordered a message containing that input. To measure
the average IOD, we ran a set of experiments in which one
client sent a batch of 10 requests 100 times sequentially.
(This emulates the situation of 10 clients accessing the
system at different times.) Processors processed an ordered
input by echoing it back to the client with a sequence
number which should be identical for nonfaulty processors.
The experiments were carried out in two different
settings: the TMR processors were 1) T800 Inmos transpu-
ters [28] connected directly to each other by fast links (as
shown in Fig. 1), and 2) Pentium II 233 MHz PC’s with
64 MB of memory running the Linux 2.2.14 operating
system and connected by a 100 Mbits/sec fast ethernet.
Thus, we consider two architectures commonly used for
interprocessor communication: point-to-point and bus-
based. Further, the second implementation was done using
Java (on Linux)—in a multithreaded environment where
thread scheduling is a source of nondeterminism which
should not be allowed to affect the deterministic behavior of
replicas. We ensured this by implementing the order
process—which makes the ordered delivery of stable
messages—in a single thread.
Experimental Results. With Transputers, the values of d
and  were 50 ms (milliseconds) and 13.15 ms, respectively
(and e ¼ d). With all processors being nonfaulty, the
average values were: da ¼ 9:18 ms; the IOD for the timeout
protocol, IODTo ¼ 136:28 ms, and that for the time-based,
IODTime ¼ 202:83 ms. With one processor crashed, da ¼
5:87 ms, IODTo ¼ 163:88 ms, and IODTime ¼ 202:75 ms.
Observe that when the crashed processor is not participat-
ing in the protocol, da reduces but little change occurs to
IODTime. The reduction in da is due to the reduced number
of messages to be handled when one processor crashes:
when there is no failure, each processor has to handle, for
each client request, five messages (one directly from the
client and two sets of two equivalent messages with each set
originating from a given coprocessor); this figure drops to
two when one processor is crashed. Thus, a single crash
results in 60 percent drop in the number of messages to be
handled by an operational processor. (See [29], [18] for more
experiments in the Transputer context.)
In the second bus-based experimental set-up, the value
chosen for d was 121 ms and  was observed to be 11 ms.
Ninety-nine percent of the delays observed were in the
range [90, 105] ms when there were 10 clients, and in the
range [60, 75] for a single client. The frequently-encountered
delays being smaller than the chosen d helps the timeout
protocol to be faster than the time-based protocol, as shown
in Table 2.
Observe that IODTo increases only very slightly when a
processor is crashed. Let us use subscripts 0 and 1 to
differentiate the estimates made when no processor and one
processor is crashed, respectively. So, IODTo;0 ¼ þ 2da;0 þ
2d: and IODTo;1 ¼ þ da;1 þ 3d. Recall that when a processor
is crashed, the number of messages handled by a working
processor drops by 60 percent. In a bus-based system, this
alsomeansa reduction inprocessors competing forbusaccess
which often leads to a nonlinear reduction in message
transmission delays. Our measurements suggest that
ð2da;0  da;1Þ  d. Finally, the small difference of IODTime
over the theoretical estimate (of 2ðdþ eÞ ¼ 4d) is attributed to
EZHILCHELVAN ET AL.: A TIMEOUT-BASED MESSAGE ORDERING PROTOCOL FOR A LIGHTWEIGHT SOFTWARE IMPLEMENTATION OF... 63
TABLE 2
Input Ordering Delays in a Bus-Based Environment
the cost of thread scheduling in Java for delivering the
ordered messages to the application process.
7 RELATED WORK AND CONCLUDING REMARKS
To our knowledge, the problem of message ordering in a
distributed context was first addressed by Lamport [21]. We
owe our use of message counters (MCs) in our protocol to his
paper. The protocol in [21] is not fault-tolerant and is for an
asynchronous context where any estimated bounds on
processing, scheduling, and communication delays can be
violated. For an equivalent problem of reaching agreement,
Pease et. al [1] provided synchronous time-based protocols
for the least restrictive fault models of unauthenticated and
authenticated Byzantine faults; they also showed that at least
four processors are needed to contain one faulty processor
when authentication is not employed. Among the works that
ensued in the synchronous and time-based context, the
following results are significant: protocols of [24], [25], [26]
made internal clock synchronisation possible even for
Byzantine faultmodels; even in the absence of faults, e cannot
be guaranteed to be less than one half of theworst case delays
expected for synchronisation messages [7]; reaching agree-
ment simultaneously by nonfaulty processors, necessary for
identical message ordering, requires at least (f+1) rounds of
message diffusion if f is the maximum number of faults
expected [16]. Cristian et. al [20] proposed a suite of ordering
protocols for a range of fault models, the weakest being the
authenticated Byzantine.
In the domain of synchronous timeout approach, [9] and
AMp of [10] are worth mentioning. The former is an
asymmetric protocol and assumes, like us, an authenticated
Byzantine model within a TMR system; further, it assumes
every client to be a TMR system as well and solves the
problem of message ordering together with majority voting
of inputs. AMp was developed with commercial applica-
tions in mind, and provides the same message ordering
guarantees as our protocol in a general n-processor system
but assumes a benign fault model where processors either
crash or occasionally omit to produce responses. Our
assumption of authenticated Byzantine faults is weaker
and, as argued in [1], any further weakening of our fault
model makes the desired form of message ordering
impossible in a three-processor system.
In the asynchronous model, the processing, the schedul-
ing, and communication delays are only known to be finite,
but their (upper) bounds cannot be known with certainty.
Consequently, no deterministic message ordering protocol
can be guaranteed to terminate even if one processor can
crash [30]. This impossibility stems from the inherent
difficulty in determining whether a remote processor has
crashed or is only very slow. That is, since the asynchronous
model permits any prior estimates of bounds to be violated,
a fault-tolerant deterministic protocol cannot be guaranteed
to terminate. It can only guarantee correctness without
liveliness: If nonfaulty processes order a given message,
they do so identically. This is in contrast to synchronous
protocols which require the bounds to be inviolable; if
violations undermine the fault hypothesis, a synchronous
protocol will terminate within the guaranteed time period,
but can cause nonfaulty processors to order messages
differently. Stating a precise set of requirements for
eventual termination, asynchronous protocols such as [31]
and [32] solve the ordering problem for non-Byzantine fault
models, and [17] for the authenticated Byzantine model.
These requirements generally warrant the violations of the
assumed bounds to be below a threshold for a sufficiently
long time.
In this paper, we have developed a synchronous time-
out-based, ordering protocol for a TMR system. It borrows
its structure from its time-based counterpart (thereby
having the same message complexity) and replaces the
synchronized time base with Lamport’s logical clocks [21]
and unsynchronized physical clocks. Where it required
considerable design effort was in reducing the timeliness
bounds by carefully analysing the various paths through
which a processor can receive a message. This made the
ordering delays smaller and the protocol optimal among the
symmetric, timeout-based protocols. The reductions
achieved made use of the structure of the TMR system:
There are only three processors of which at most one can
fail. It is not clear whether the conservative bounds can be
sufficiently reduced in a general system of n processors
with at most f faults, to yield a smaller worst-case delay.
This remains an open problem.
We have identified the contexts in which our protocol
performs worse/better than the time-based protocol. When
the synchronization accuracy achieved is very small
compared to the worst-case message communication
delays, the time-based approach is twice as fast as ours.
Where these two are nearly equal, as would be the case
when clocks are synchronized with no assistance from
special hardware or an external time source, our protocol
offers a choice. We have identified analytically, and justified
experimentally, the conditions in which our protocol orders
inputs faster. Such conditions hold when the prevailing
conditions of message traffic, failure and failure type within
the system are less, or less severe, than the worst-case
conditions envisaged for the system. For example, as our
experiments show, when the number of clients simulta-
neously accessing the system is less than the expected, the
actual communication delays are certainly less than the
estimated bound; similarly, no processor may fail for a
considerable part of the TMR operation and even the failure
occurred may be of type crash instead of the supposed
Byzantine. Thus, in addition to deriving a timeout-based
protocol, the paper presents a practical alternative to time-
based protocol when the achievable clock-synchronization
accuracy is comparable to the worst-case message commu-
nication delay envisaged.
ACKNOWLEDGMENTS
This work has been supported in part by grants from
CNPq/Brazil. Thanks to the anonymous reviewers for their
constructive criticisms and suggestions.
REFERENCES
[1] M. Pease, R. Shostak, and L. Lamport, “Reaching Agreement in
the Presence of Faults,” J. ACM, vol. 27, no. 2, pp. 228-234, Apr.
1980.
[2] D. Powell, P. Verissimo, G. Bonn, F. Waeselynck, and D. Seaton,
“The Delta-4 Approach to Dependability in Open Distributed
Computing Systems,” Digest of Papers, FTCS-18, Tokyo, pp. 246-
251, June 1988.
[3] F.B. Schneider, “Implementing Fault Tolerant Services Using the
State Machine Approach: A Tutorial,” ACM Computing Surveys,
vol. 22, no. 4, pp. 299-319, Dec. 1990.
[4] L. Lamport, “Using Time Instead of Timeout for Fault-Tolerant
Distributed Systems,” ACM Trans. Programming Languages and
Systems, vol. 6, no. 2, pp. 254-280, Apr. 1984.
64 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 1, JANUARY 2004
[5] N. Vasanthavada and P.N. Marinos, “Synchronisation of Fault-
Tolerant Clocks in the Presence of Malicious Failures,” IEEE Trans.
Computers, vol. 37, no. 4, pp. 440-448, Apr. 1988.
[6] P. Verissimo, L. Rodrigues, and A. Casimoro, “Cesium Spray: A
Precise and Accurate Global Clock Service of Large Scale
Systems,” J. Real Time Systems, vol. 11, no. 3, 1997.
[7] D. Dolev, J. Halpern, and H.R. Strong, “On the Possibility and
Impossibility of Achieving Clock Synchronisation,” Proc. 16th
Ann. ACM STOC, pp. 504-511, Apr. 1984.
[8] F. Schmuck and F. Cristian, “Continuous Clock Amortization
Need Not Affect the Precision of a Clock Synchronisation
Algorithm,” Proc. Ninth ACM Symp. Principles of Distributed
Computing, pp. 133-141, Aug. 1990.
[9] K. Echtle, “Fault Masking and Sequence Agreement by a Voting
Protocol with Low Message Number,” Proc. Sixth Symp. Reliability
in Distributed Software and Database Systems, pp. 149-160, Mar. 1987.
[10] P. Verissimo, L. Rodrigues, and J. Rufino, “The Atomic Multicast
Protocol (AMp),” Delta-4: A Generic Architecture for Dependable
Distributed Computing, D. Powell, ed., pp. 267-294, ESPRIT
Research Papers, Springer-Verlag, 1991.
[11] P. Verissimo, “Causal Delivery Protocols in Real-Time Systems: A
Generic Model,” J. Real Time Systems, vol. 10, no. 1, pp. 45-73, 1996.
[12] R.L. Rivest, A. Shamir, and L. Adleman, “A Method for Obtaining
Digital Signatures and Public Key Cryptosystems,” Comm. ACM,
vol. 31, no. 2, pp. 120-126, Feb. 1978.
[13] P.D. Ezhilchelvan, “Early Stopping Algorithms for Distributed
Agreement under Fail-Stop, Omission, and Timing Fault Types,”
Proc. Sixth Symp. Reliability in Distributed Software and Database
Systems, pp. 201-212, Mar. 1987.
[14] D. Dolev, R. Reischuk, and H.R. Strong, “Early Stopping in
Byzantine Agreement,” J. ACM, vol. 37, no. 4, pp. 720-741, Oct.
1990.
[15] S.K. Shrivastava, P.D. Ezhilchelvan, N.A. Speirs, S. Tao, and A.
Tully, “Principle Features of the Voltan Family of Reliable System
Architectures for Distributed Systems,” IEEE Trans. Computers,
vol. 41, no. 5, pp. 542-549, May 1992.
[16] D. Dolev and H.R. Strong, “Requirements for Agreement in a
Distributed System,” Proc. Second Symp. Distributed Databases,
pp. 115-129, Sept. 1982.
[17] M. Castro and B. Liskov, “Practical Byzantine Fault Tolerance,”
Proc. Third ACM Symp. Operating Systems Design and Implementa-
tion (OSDI), pp. 173-186, Feb. 1999.
[18] F.V. Brasileiro, P.D. Ezhilchelvan, and N.A. Speirs, “TMR
Processing Without Explicit Clock Synchronisation,” Proc. 14th
Symp. Reliable Distributed Systems, pp. 186-195, Sept. 1995.
[19] P.D. Ezhilchelvan, F.V. Brasileiro, and N.A. Speirs, “Timeout
Based Message Ordering Protocols for a Lightweight, Software
Implementation of TMR Systems,” http://www.cs.ncl.ac.uk/
research/pubs/trs/papers/817.pdf, June 2002.
[20] F. Cristian, H. Aghili, R. Strong, and D. Dolev, “Atomic Broadcast:
From Simple Message Diffusion to Byzantine Agreement,” Digest
of Papers, FTCS-15, Ann Arbor, pp. 200-206, June 1985.
[21] L. Lamport, “Time, Clocks, andOrdering of Events in a Distributed
System,” Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[22] P. Verissimo and M. Raynal, “Time in Distributed System Models
andAlgorithms,”Advances inDistributed Systems, S. Krakowiak and
S.K. Shrivastava, eds., pp. 1-32, LNCS 1752, Springer-Verlag, 2000.
[23] H. Kopetz, A. Damm, C. Koza, M. Mulazzani, W. Schwabl, C.
Senft, and R. Zainlinger, “Distributed Fault-Tolerant Real-Time
Systems: The Mars Approach,” IEEE Micro, pp. 25-41, 1989.
[24] J.Y. Halpern et. al., “Fault-Tolerant Clock Synchronisation,” Proc.
Third ACM Symp. Principles of Distributed Computing, pp. 89-102,
Aug. 1984.
[25] L. Lamport and P.M. Melliar-Smith, “Synchronising Clocks in the
Presence of Faults,” J. ACM, vol. 32, no. 1, pp. 52-78, Jan. 1985.
[26] T.K. Srikanth and S. Toueg, “Optimal Clock Synchronisation,”
Proc. Fourth ACM Symp. Principles of Distributed Computing, pp. 71-
86, Aug. 1985.
[27] H. Kopetz and W. Ochsenreiter, “Clock Synchronisation in
Distributed Real Time Systems,” IEEE Trans. Computers, vol 36,
no. 8, pp. 933-940, 1987.
[28] INMOS Limited, Transputer Instruction Set, Prentice Hall Int’l
(UK) Ltd., ISBN 0-13-929100-8, 1988.
[29] N.A. Speirs, S. Tao, F.V. Brasileiro, P.D. Ezhilchelvan, and S.K
Shrivastava, “The Design and Implementation of Voltan Fault-
Tolerant Systems for Distributed Systems,” Transputer Comm.,
vol. 1, no. 2, pp. 1-17, Nov. 1993.
[30] M.J. Fischer, N.A. Lynch, and M.S. Paterson, “Impossibility of
Distributed Consensus with One Faulty Process,” J. ACM, vol. 32,
no. 2, pp. 374-382, Apr. 1985.
[31] T.D. Chandra and S. Toueg, “Unreliable Failure Detectors for
Reliable Distributed Systems,” J. ACM, vol. 43, no. 2, pp. 225-267,
Mar. 1996.
[32] F. Cristian and C. Fetzer, “The Timed Asynchronous Distributed
SystemModel,” IEEE Trans. Parallel and Distributed Systems, vol. 10,
no. 6, pp. 642-657, June 1999.
Paul D. Ezhilchelvan received the Bachelor of
Engineering degree in 1981 from the University
of Madras, India, and the Master of Engineering
degree in 1983 from the Indian Institute of
Science, Bangalore. He received the PhD
degree in computer science in 1989 from the
University of Newcastle upon Tyne, United
Kingdom. He joined the School of Computing
Science at the University of Newcastle upon
Tyne in 1983, where he is currently a lecturer.
His main research interests are in the areas of fault-tolerance and
distributed computing. He has published several research papers in the
topics of distributed agreement protocols, replicated processing,
scalable and reliable multicast protocols, and group membership in
synchronous and asynchronous distributed systems. He is currently the
principal investigator of the UK-EPSRC funded PACE project and a
Work-Package Investigator in the EU-IST funded TAPAS project.
Francisco V. Brasileiro received the Bachelor’s
degree in computing science and the MSc
degree in informatics from the Federal University
of Paraı´ba, Brazil, in 1988 and 1989, respec-
tively. He received the PhD degree in computing
science in 1995 from the University of Newcastle
upon Tyne, England, for his work on fail-
controlled nodes and agreement protocols. In
1989, after a brief incursion in industry, he joined
the Department of Systems and Computing of
the Federal University of Paraı´ba (now, Federal University of Campina
Grande), where he is currently a senior lecturer. His main research
areas are in fault tolerance, distributed systems, and protocols.
Dr. Brasileiro is a member of the Brazilian Computing Society, the
ACM, and the IEEE Computer Society.
Neil A. Speirs received the first class honors
degree in mathematics from the University of
Newcastle upon Tyne in 1980, and the
doctorate degree in theoretical physics from
the University of Durham in 1985. For two
years, he worked for Sagesoft Ltd. writing
many commercial packages. For two years,
he worked for Mari Applied Microelectronics
Ltd. where he was project leader on the
Esprit Projects Concordia and Delta-4, both of
which were concerned with the design and implementation of fault-
tolerant distributed computer systems. Since 1987, he has been at
the University of Newcastle upon Tyne where he is currently a
senior lecturer in computing science. His main research interests are
in fault-tolerance, reliability, and distributed systems.
. For more information on this or any other computing topic,
please visit our Digital Library at http://computer.org/publications/dlib.
EZHILCHELVAN ET AL.: A TIMEOUT-BASED MESSAGE ORDERING PROTOCOL FOR A LIGHTWEIGHT SOFTWARE IMPLEMENTATION OF... 65
- 30 -
Appendix A. Timeliness Bound C3 
In Section 4.3.2, we claimed that the entry C3 in Table 1 can be 3d (instead of the 
conservative bound 4d). We here show this by first proving the following lemma. 
Lemma A1: Let t1 and t2 be the local clock times when a non-faulty Pi receives a single-
signed m1 and a double-signed m2, respectively; also, let m1.TS ≤ m2.TS and t1 > t2. If Pi
finds m1 timely then it must also find m2 timely. 
Proof: We will measure time according to Pi’s local clock and prove the lemma by 
contradiction. Suppose that Pi finds m1 timely and m2 late. Let m2 be late by y, y > 0, time 
units, i.e., if Pi had received m2 at any time before (t2 - y) then it would have found m2 timely. 
(See figure A1.) That m2 received at t2 was found late implies that there exists a message m’,
m’.TS ≥ m2.TS, which was accepted by Pi at (t2 - y - tb2), where tb2 is the timeliness bound 
indicated by the entry of table 1 whose row corresponds to the path(m’) and column to the 
path(m2). Let tb1 be the timeliness bound indicated by the entry of table 1 whose row 
corresponds to the path(m’) and column to the path(m1).
Since m1.TS ≤ m2.TS ≤ m’.TS, m1 must be received by Pi before (t2 - y - tb2 + tb1) for it 
to be considered timely. For any given path(m’), i.e. in any given row of Table 1, the 
timeliness bound for a single-signed message is smaller than that for a double-signed-
message. That is, tb1 < tb2. This means, (t2 - y - tb2 + tb1) < t2; by given, t2 < t1. So, t1 > (t2 - 
y - tb2 + tb1). This means that m1 is received by Pi after (t2 - y - tb2 + tb1) and cannot be 
found timely by Pi. This is a contradiction.       ??
?
Figure A1. m1.TS ≤ m2.TS ≤ m’.TS.
Claim: The entry C3 of Table 1 is correct. 
Proof: By given, path(m’) = Pj, path(m) = Pj:Pk and m’.TS ≥ m.TS. Observe that Pk here 
is the non-faulty immediate sender of m and Pj can be faulty or non-faulty. (If Pj is non-faulty 
then m’.TS > m.TS.) Since Pi accepts m’ (by hypothesis), it will diffuse the message to Pk. Pk
also diffuses m to Pi. So, there are two cases to consider: Pk receives the single-signed m
from Pj either (i) before or (ii) after it receives the diffused m’ from Pi. The sub-case (i) is 
shown in figure A2(a). Pk can receive the diffused m’ from Pi at any time before t’ + d. Even 
- 31 -
if it diffuses m just before receiving m’, Pi can receive the diffused m just before t’ + d + d;
that is, (t - t’) < 2d.
Figure A2(b) illustrates the second sub-case where Pk receives the single-signed m from 
Pj after it has received the double-signed m’ from Pi. Since Pk diffuses m to Pi, it must have 
found the single-signed m it received as timely. By Lemma A1, Pk must find the double-
signed m’ timely. When it receives the single-signed m after having accepted the double-
signed m’, it is in the same situation as Pi in case E1 where path(m’) = Pj:Pk, path(m) = Pj
and the timeliness bound is shown to be d in Section 4.3.2. So, for the non-faulty Pk to find 
the single-signed m timely, it must have received m within d time after it received m’ true; 
that is, the time elapsed between receiving m’ and m must be less than d. From Figure A2(b), 
(t - t’) cannot be more than 3d. Choosing the largest of the bounds estimated for the two sub-
cases, 3d becomes the timeliness bound for the entry C3. 
Figure A2. (a) Pk receives m before m’. (b) Pk receives m after m’.
- 32 -
Appendix B. Protocol Optimality
Theorem 5.2: Any symmetric ordering protocol that works only with unsynchronised 
clocks, will have executions in which the ordering delay can be 3δm + δm-.
Proof: By contradiction. Assume that there is such a protocol which guarantees that 
ordering delays are always smaller than 3δm + δm-. Consider two distinct executions of this 
protocol during real-time intervals ι1 and ι2 respectively. By hypothesis, non-faulty 
processors’ clocks remain unsynchronised throughout each interval. (Note: By this, we 
exclude a class of protocols which permit clock synchronisation messages to be piggybacked 
onto the order protocol messages and thus achieve clock synchronisation during the order 
protocol execution.) 
In the first execution (see figure B1(a)), Pi fails only by not sending its messages to Pj
and not receiving Pj’s messages. Pi sends mi at its clock time ti. Let mi take zero time to 
reach Pk. Suppose that Pk’s clock reads tk when Pk receives mi. (Since mi takes zero time, 
when Pk’s clock reads tk, Pi’s clock reads ti.) Let Pk accept and diffuse mi to Pj and the 
diffused message take δm time to be received by Pj. Just before Pj receives the diffused mi,
i.e. when Pk’s clock reads tk + δm-, suppose that Pj‘s clock reads tj and that Pj forms and 
sends mj which takes δm time to be received by Pk. So, Pk’s clock reads (tk + δm- + δm)
when Pk receives mj. Note that Pj sends mj before it receives the diffused mi from Pk;
therefore, neither mi nor mj happened before [21] the other. Since Pk cannot deduce that mi
originated before mj in real-time, we will assume (without loss of generality) that mj is 
ordered before mi by all non-faulty processors. 
Figure B1. Execution Scenarios. (a) First execution. (b) Second execution. 
In the second execution (see figure B1(b)), Pj fails only by not sending its messages to Pi
and not receiving Pi’s messages. Pi sends mi at its clock time ti. mi takes δm time to reach Pk
and is not received by Pj. Suppose that Pk’s clock reads tk when Pk receives mi, i.e., when 
Pi’s clock reads ti + δm. (Note: this is possible with unsynchronised clocks whose readings 
can differ by an arbitrary amount.) Assume that the mi diffused by Pk takes δm time to be 
received by Pj.
- 33 -
When Pk’s clock reads tk + δm-, suppose that Pj‘s clock reads tj and that Pj forms and 
sends mj only to Pk which takes δm time to be received. That is, Pk’s clock reads 
(tk + δm- + δm) when Pk receives mj. For Pk, this execution is indistinguishable from the 
first one. (Note that since the difference between the readings of unsynchronised clocks can 
differ by an arbitrary amount, we have chosen the difference to be a convenient amount that 
allows the following claim to hold: the two executions are indistinguishable for Pk even if the 
sender of a message m timestamps m with the local send time.) So Pk must order mj before 
mi. Since ordering delays are always smaller than 3δm + δm-, non-faulty Pi must order its 
own mi before ti + 3δm + δm-. Say Pk’s diffused mj takes δm time to be received by Pi. That 
is, Pi can receive mj (for the first time), only when its clock reads ti + 3δm + δm-. So, Pi can 
receive mj only at or after ti + 3δm + δm-. Hence Pi cannot order mj before ti + 3δm + δm-,
and therefore before mi. This violates the unanimity condition. This is a contradiction.  ?
