Abstract. The cost of using message-passing to implement linearizable read/write objects for shared memory multiprocessors with drifting clocks is studied. We take as cost measures the response times for performing read and write operations in distributed implementations of virtual shared memory consisting of such objects. A collection of necessary conditions on these response times are presented for a large family of assumptions on the network delays. The assumptions include the common one of lower and upper bounds on delays, and bounds on the difference between delays in opposite directions. In addition, we consider broadcast networks, where each message sent from one node arrives at all other nodes at approximately the same time. The necessary conditions are stated in the form of "gaps" on the values that the response times may attain in any arbitrary execution of the system; the ends of the gap intervals depend solely on the delays in a particular execution, and on certain fixed parameters of the system that express each specific delay assumptions. The proofs of these necessary conditions are comprehensive and modular; they consist of two major components. The first component is independent of any particular type of delay assumptions; it constructs a "counter-example" execution, which respects the delay assumptions only if it is not linearizable. The second component must be tailored for each specific delay assumption; it derives necessary conditions for any linearizable implementation by requiring that the "counter-example" execution does not respect the specific delay assumptions. Our results highlight inherent limitations on the best possible cost for each specific execution of a linearizable implementation. Moreover, our results imply lower bounds on the worst possible such costs as well; interestingly, for the last two assumptions on mesage delays, these worst-case lower bounds are products of the drifting factor of the clocks and the delay uncertainty inherent for the specific assumption.
Introduction
Shared memory has become a convenient paradigm of interprocessor communication in contemporary computer systems. Perhaps this is so due to its combined features that, first, it facilitates a natural extension of sequential programming, and, second, it is more high-level than message-passing in terms of semantics. This convenience has favored the evolution of concurrent programming on top of shared memory for the solution of many diverse problems. Thus, supporting shared memory in distributed memory machines has become a currently major objective.
Unfortunately, implementing shared memory in a distributed memory machine encounters a lot of complications; these complications are due to the high degree of parallelism and the lack of synchronization between dispersed processors, that are both inherent in a distributed architecture. This necessitates the explicit and precise definition of the guarantees provided by shared memory implemented this way; such definition is called a consistency condition. Linearizability is a basic consistency condition for concurrent objects of shared memory due to Herlihy and Wing [7] . Informally, linearizability requires that each operation, spanning over an interval of time from its invocation to its response, appears to take effect at some instant in this interval. The use of linearizable data abstractions simplifies both the specification and the proofs of multiple instruction/multiple data shared memory algorithms, and enhances compositionality.
In this work, we continue the study of the impact of timing assumptions on the cost of supporting linearizability in distributed systems; this study has been initiated by Attiya and Welch [2] , and continued further by Mavronicolas and Roth [12] , Chaudhuri et al. [3] , Friedman [5] , and Kosa [9] . We consider a distributed system that introduces non-negligible timing uncertainty in two significant ways: first, in the synchronization with respect to real time of each individual process, and, second, in the communication among different processes.
Following previous work [2, 3, 5, 9, 12] , we consider a model consisting of a collection of application programs running concurrently and communicating through virtual shared memory, which consists of a collection of read/write objects. These programs are running in a distributed system consisting of a collection of processes located at the nodes of a communication network. The shared memory abstraction is implemented by a memory consistency system (MCS), which uses local memory at each process node. Each MCS process executes a protocol, which defines the actions it takes on operation requests by the application programs. Specifically, each application program may submit requests to access shared data to a corresponding MCS process; the MCS process responds to such a request, based, possibly, on information from messages it receives from other MCS processes. In doing so, the MCS must, throughout the network, provide linearizability with respect to the values returned to application programs.
We take as cost measures the response times for performing read and write operations on read/write objects in such a distributed system. However, a first major diversion from previous works [2, 3, 5, 9, 12] addressing these particular cost measures is that we show bounds on them that hold for each specific execution of the system, while bounds established in previous work on the same cost measures hold only for the worst execution. Recent research work in distributed computing theory has addressed bounds that hold for each specific execution in the context of the clock synchronization [1, 13] and connection management [8, 11] problems. A common argument in support of showing such "per-execution" bounds is that for certain kinds of assumptions on network delays, the costs for the worst-case execution may, in fact, have to be unbounded [1] , while one may still want to award algorithms that achieve costs that are the best possible for each specific instance [1] .
A second major diversion from previous related work [2, 3, 5, 9, 12] is with respect to assumptions on message delays; all that work has considered the relatively simple case where there are lower and upper bounds on message delays. Under this assumption, linearizable implementations of shared memory objects have been designed [3, 5, 12] , whose efficiency depends critically on the existence of tight lower and upper bounds on message delays. This assumption, however, may not always apply, since it is often the case that there do not exist tight lower and upper bounds on message delays, while there is some other relevant information about the delays. We draw excellent motivation from the work of Attiya et al. [1] on clock synchronization under different delay assumptions to study the problem of implementing linearizable read/write objects in message-passing under the following assumptions on message delays (considered in [1] ): (1) There is a lower and an upper bound on delays, d − u and u, respectively. (2) There is a bound ε on the difference between delays in opposite directions; this assumption is supported by experimental results revealing that message delays in opposite directions of a bidirectional link usually come very close (cf. [1] ). The clock synchronization problem has been already studied under this assumption [1] . (3) There is a bound β on the difference between the times when different processes receive a broadcast message; this assumption is useful for broadcast networks that are used in many local area networks. The clock synchronization problem has been studied under this assumption in [1, 6, 14] .
A third major diversion from previous related work is with respect to the amount of synchronization of processes to real time. While that work [2, 3, 5, 9, 12] has assumed "perfect" (non-drifting, but possibly translated) clocks to be available to processes, we allow a small "drift" on the processes' clocks; the impact of this assumption on the time complexity of distributed algorithms has already been studied for the clock synchronization problem (see, e.g., [13] ), and the connection management problem [8, 11] .
The main contribution of our work is a systematic methodology for proving necessary conditions on the response times of read and write operations, that hold for each specific execution of any linearizable implementation, under a variety of message delay assumptions, and allowing a small "drift" on the processes' clocks. This methodology yields a collection of corresponding necssary conditions. Our proof methodology is modular, and consists of two major components. The first component is independent of the specific type of delay assumptions, while the second one addresses each such type in a special way.
In more detail, the first component starts with a linearizable execution that is chosen in a different way for the write operation, the read operation, and their combination, respectively. In each of the three cases, we use the technique of "retiming," originally introduced by Lundelius and Lynch for showing lower bounds for the clock synchronization problem [10] , to transform this execution into another possible execution of the system that is not linearizable. The transformation maintains the view held by each process in the original execution to the result of the transformation; moreover, the clocks in the latter are still drifting. Roughly speaking, retiming is used to change the timing and the ordering of events in an execution of the system, while precluding any particular process from "realizing" the change.
The second component is tailored for each specific assumption on message delays. More specifically, the starting point of the second component is the result of transforming the original execution, and the corresponding message delays in this result. For each specific assumption on message delays, we insist that the resulting delays confirm to the assumption. This yields corresponding upper and lower bounds on the response time of the read and write operations, as a function of the message delays in the original linearizable execution.
Our lower and upper bounds highlight inherent limitations on the best possible cost for each specific execution of a linearizable implementation, as a function of the message delays in the execution, and the parameters associated with each specific assumption on message delays. Moreover, our results imply also Ω(ρ 2 ε) and Ω(ρ 2 β) worst-case lower bounds on response times for both write and read operations for the bias model and the model of broadcast networks, respectively. These lower bounds indicate that the timing uncertainty ρ 2 in the drifting clocks model must multiply the delay uncertainty (ε and β, respectively) for each of these models. We have not been able to deduce a corresponding fact for the model with lower and upper bounds on delays. (However, for the special case where ρ = 1, our general results imply worst-case results that are identical to those in [2, 12] .) This model appears to be stronger than the previous two since it does not allow unbounded delays; we conjecture that linearizable implementations allowing for response times o(ρ 2 u) for both write and read operations are possible for this model.
Framework
For the system model, we follow [2, 12] . We consider a collection of application programs running concurrently and communicating through virtual shared memory, consisting of a collection X of read/write objects, or objects for short. Each object X ∈ X attains values from a domain, a set V of values. We assume a system consisting of a collection of nodes, connected via a communication network. The shared memory abstraction is implemented by a memory consistency system (MCS), consisting of a collection of MCS processes, one at each node, that use local memory, execute some local protocol, and communicate through sending messages along the network. Each MCS process p i , located at node i, is associated with an application program P i ; p i and P i interact by using call and response events.
Call events at p i represent initiation of operations by the application program P i ; they are Read i (X) and Write i (X, v), for all objects X ∈ X and values v ∈ V. Response events represent responses by p i to operations initiated by the application program P i ; they are Return i (X, v) and Ack i (X), for all objects X ∈ X and values v ∈ V. Message-delivery events represent delivery of a message from any other MCS process to p i . Message-send events represent sending of a message by p i to any other MCS process.
For each i, there is a physical, real-time clock at node i, readable by MCS process p i but not under its control, that may drift away from the rate of real time. Formally, a clock is a strictly increasing (hence, unbounded), piece-wise continuous function of real time γ i :
Define ρ 2 to be the drifting factor of a ρ-drifting clock. The clocks cannot be modified by the processes. Processes do not have access to real time; instead, each process obtains information about time from its clock. The call, message-delivery and timer-expire events are called interrupt events. The response, message-send and timer-set events are called react events.
Each MCS process p i is modeled as an automaton with a (possibly infinite) set of states, including an initial state, and a transition function. Each interrupt event at MCS process p i causes an application of its transition function, resulting in a computation step. The transition function is a function from tuples of a state, a clock time and an interrupt event to tuples of a state and sets of react events. Thus, the transition function takes as input the current state, the local clock time, and an interrupt event, and returns a new state, a set of response events for the corresponding application program, a set of messages to be sent to other MCS processes, and a set of timer-set events. A history for an MCS process p i with clock γ i is a mapping h i from (real time) to finite sequences of computation steps by p i such that: (1) For each real time t, there is only a finite number of times t < t such that the corresponding sequence of steps h i (t ) is non-empty; thus, the concatenation of all such sequences in real-time order is also a sequence, called the history sequence. (2) The old state in the first computation step in the history sequence is p i 's initial state. (3) The old state of each subsequent computation step is the new state of the previous computation step in the history sequence. (4) For each real time t, the clock time component of every computation step in the sequence h i (t) is equal to γ i (t). (5) For each real time t, there is at most one computation step whose interrupt event is a timer-set event; this step is ordered last in the sequence h i (t). (6) At most one call event is "pending" at a time; this outlaws pipelining or prefetching at the interface between p i and P i . (8) For each call event, there exists a matching response event in some subsequent computation step of the history sequence.
Each pair of matching call and response events forms an operation. The call event marks the start of the operation, while the response event marks its end. An operation op is invoked when the application program issues the appropriate call event for op; op terminates when the MCS process issues the appropriate response for op. For a given MCS, an execution σ is a set of histories, one for each MCS process, such that for any pair of MCS processes p i and p j , i = j, there is a one-to-one correspondence between the messages sent by p i to p j and those delivered at p j that were sent by p i . Use this message correspondence to define the delay of any message in an execution to be the real time of delivery minus the real time of sending. By definition of execution, a zero lower bound and an infinite upper bound hold on delay. Define ∆ (e) ij to be the set of delays of messages from MCS process p i to MCS process p j in execution e. Two executions are equivalent [10] if each process has the same history sequence and associated local clock times in both. Intuitively, equivalent executions are indistinguishable to the processes, and only an "outside observer" with access to real time can tell them apart.
We continue with specific assumptions on the delays, borrowing from [1, 13] .
The assumption of bounds on the round trip delay bias [1, Section 5.2] requires that the difference between the delays of any pair of messages in opposite direction be bounded. Fix any constant ε > 0, called the delay uncertainty. Formally, an execution σ is admissible if for any pair processes p i and p j , and for any pair of messages m and m received by p i from p j and received by p j from p i , respectively, |d
The assumption of multicast networks has been studied in [1, 4, 6, 14] in the context of the clock synchronization problem; our presentation follows [1, Section 5.3] . To define this assumption, we replace message-send events by events of the form Broadcast i (m) at the MCS process p i , for all messages m; such events represent a broadcast of m to all MCS processes. The definition of an execution is modified so that for any pair of processes p i and p j , i = j, there is a one-toone correspondence between the messages broadcast by p i , and those delivered at p j and broadcast by p i . Use this message correspondence to define the delay of message m to process p j in execution σ, denoted d σ (m, p j ), to be the real time of delivery at p j in σ minus the real time of broadcast by p i in σ. Fix any constant β > 0, called the broadcast accuracy. Execution σ is admissible if for any process p i , for any message m broadcast by
that is, m reaches p j at most β time units later it reaches p k , and vice versa.
Each object X has a serial specification [7] , which describes its behavior in the absence of concurrency and failures. Formally, it defines: (1) A set Op(X) of operations on X, which are ordered pairs of call and response events. Each operation op ∈ Op(X) has a value val (op) associated with it. (2) A set of legal operation sequences for X, which are the allowable sequences of operations on X.
For each process p i , Op(X) contains a read operation [Read i (X), Return i (X, v)] on X and a write operation [Write i (X, v), Ack i (X)] on X, for all values v ∈ V;
v is the value associated with each of these operations. The set of legal operation sequences for X contains all sequences of operations on X for which, for any read operation rop in the sequence, either val (rop) = ⊥ and there is no preceding write operation in the sequence, or val (wop) = val (rop) for the latest preceding write operation wop. A sequence of operations τ for a collection of processes and objects is legal if, for every object X ∈ X , the restriction of τ to operations on X, denoted τ | X, is in the set of legal operation sequences for X.
Given an execution σ, let ops(σ) be the sequence of call and response events appearing in σ in real-time order, breaking ties for each real time t as follows: First, order all response events whose matching call events occur before time t, using process identification numbers (ids) to break any remaining ties. Then, order all operations whose call and response events both occur at time t. Preserve the relative ordering of operations for each process, and break any remaining ties using again process ids. Finally, order all call events whose matching response events occur after time t, using process ids to break any remaining ties. An execution σ specifies a partial order The efficiency of an implementation A of X is measured by the response time for any operation on an object X ∈ X . Given a particular MCS A and a read/write object X implemented by it, the time |op A (X, σ)| taken by an operation op on X in an admissible execution σ of A is the maximum difference between the times at which the response and call events of op occur in σ, where the maximum is taken over all occurrences of op in σ. In particular, we denote by |R A (X, σ)| and |W A (X, σ)| the maximum time taken by a read and a write operation, respectively, on X in σ, where the maximum is taken over all occurrences of the corresponding operations in σ. Define |R A (X)| (resp., |W A (X)|) to be the maximum of |R A (X, σ)| (resp., |W A (X, σ)|) over all executions σ of A.
Fix e to be any execution, and let op = [Call(op), Response(op)] be any operation in e. We denote by t (e) c (op) and t (e) r (op) the (real) times at which Call(op) and Response(op), respectively, occur in e. We use val (e) (op) to denote the value associated with the "execution" of operation op in e.
Writes
A construction of a non-linearizable, if admissible, execution is presented in Section 3.1; this execution is used in Section 3.2 for deriving necessary conditions for the write operation under specific assumptions on the delays. We refer to any linearizable implementation A of read/write objects, including an object X with at least two writers p i and p j , and a distinct reader p k .
A Non-Linearizable, if Admissible, Execution
This construction is based on one in [2, Section 4] and [12, Section 5] . We start with an admissible execution e, in which p i writes x i to X, then p j writes x j to X, x j = x i , and finally p k reads x j from X; moreover, we assume that all clocks in e run at a rate of σ for some constant σ such that 1/ρ ≤ σ ≤ ρ. If p i 's history is shifted later, while p j 's history is shifted earlier, each by an appropriate amount, while both are either "stretched" or "shrinked" by a factor of σ, depending on whether 1 ≤ σ ≤ ρ or 1/ρ ≤ σ ≤ 1, the result is an execution e , not necessarily admissible, in which the write operation by p j precedes the write operation by p i , which, in turn, precedes the read operation by p k . If, in addition, all clocks are correspondingly "stretched" or "shrinked" by the same factor of σ, all three processes still "see" the same events occurring at the same local time and cannot, therefore, distinguish between e and e ; thus, in particular, p k still reads x j from X, which implies that e , if admissible, is not linearizable. We now present some details of the construction.
By the serial specification of X, there exists an admissible execution e of A consisting of the following operations at processes p i , p j , and p k : p i performs a write operation wop i on X with t Since A is a linearizable implementation and e is an admissible execution, e is a linearizable execution. Thus, there exists a legal linearization τ of e such that for each MCS process p, ops(e) | p = τ | p. We use the construction of e to show simple properties of the sequence τ , namely that wop i τ −→ wop j , and that
Since τ is a legal operation sequence, these properties imply that val (e) (rop k ) = val (e) (wop j ) = x j . We now "perturb" the (admissible) execution e in order to obtain another execution e , which is not necessarily admissible; however, we shall show that if e is admissible, then it is not linearizable. We construct e as follows. l (t)) in e ; in addition, h l preserves the ordering of steps in h l . (3) e preserves the correspondence between message-delivery and message-send events in e.
Since e is an execution of A, for each MCS process p l , h l is a history for p l with clock γ (e) l . By rule (2) , this implies that h l is a history for p l with clock γ (e ) l ; moreover, for any real times t 1 , t 2 ∈ with t 1 < t 2 
is ρ-drifting; thus, by rule (3), it follows that e is an execution of A. In addition, rule (2) immediately implies that executions e and e are equivalent. We continue to establish a fundamental property of the execution e .
Lemma 1.
Assume that e is an admissible execution. Then, e is not linearizable.
Proof. We give a sketch of the proof. Since A is a linearizable implementation and e is an admissible execution of A, e is a linearizable execution. Thus, there exists a legal linearization τ of e such that for each MCS process p, 
Results for Specific Models of Delays
Our methodology is as follows. We first calculate message delays in execution e (independent of specific delay assumptions). Next, we consider separately each specific assumption on delays; requiring that message delays in the execution e constructed in Section 3.1 satisfy the assumption yields the admissibility of e , which, by Lemma 1, implies the non-linearizability of e . For the model with lower and upper bounds on delays, we show: 
kj , such that either 
The proof of Theorem 3 is similar to the proof of Theorem 1, and it is omitted. Theorem 3 demonstrates the existence of executions with "gaps" on the response times of writes operations. In order to derive a worst-case lower bound on the response time for write operations from Theorem 3, we set σ = 1/rho, δ ij − δ ik = β, δ jk − δ ji = β, and δ kj − δ ki = β. With these choices, the upper limit on |W A (X, e)| becomes negative, and, therefore, it cannot be met, which implies that the lower limit on |W A (X, e)| must be met, which is positive for these choices. We obtain: 
Reads
A construction of a non-linearizable, if admissible, execution is presented in Section 4.1; this execution is used in Section 4.2 for deriving necessary conditions for the read operation under specific assumptions on the delays. We refer to any linearizable implementation A of read/write objects including an object X with at least two readers p i and p j , and a distinct writer p k .
A Non-Linearizable, if Admissible, Execution
This construction is based on one in [2, Section 4] and [12, Section 5] . We start with an admissible execution e, in which p i reads ⊥ from X, then p j and p i alternate reading from X while p k is writing x to X, and finally p j reads x from X; moreover, we assume that all clocks in e run at a rate of σ, for some constant σ such that 1/ρ ≤ σ ≤ ρ. Theorem 6 demonstrates the existence of executions with "gaps" on the response times of read operations. In order to derive a worst-case lower bound on the response time for read operations from Theorem 6, we set σ = 1/rho, δ ij − δ ik = β, δ jk − δ ji = β, and δ kj − δ ki = β. With these choices, the upper limit on |R A (X, e)| becomes negative, and, therefore, it cannot be met, which implies that the lower limit on |R A (X, e)| which is positive for these choices, must be met. We obtain: 
