13] M. Herlihy and N. Shavit. The asynchronous computability theorem for t-resilient tasks.
In Proceedings of the 25th ACM Symposium on Theory of Computing, pages 111{120, 1993.
application of Lemma 4.9 implies that the longest message requires O(n 4 ). However, we can note that, in response to the writing processor, the intermediates send only the passed vector of its entry (O(n 3 ) bits). Similarly, the intermediates send only a vector of n timestamps (again, O(n 3 ) bits) to the reading processor. This implies the tighter bound. A simple application of Lemma 4.9 implies that the local memory size is O(n 5 ), at the writer, since it is the writing processor for O(n) registers, and O(n 4 ), at each reader, since it is the writing processor for O (1) registers. In addition, each processor holds O(n 4 ) local memory bits as an intermediate for O(n) registers.
Discussion
We have presented a simulation of an atomic, single-writer multi-reader register in messagepassing systems, in the presence of processor failures. Each operation (read or write) to the register requires O(n) messages and constant time. The size of the messages and the local memory (in bits) is polynomial in n.
The simulation of a shared register requires (n) messages per simulated operation. read or write. If a write operation sends less than n=2 messages, then at most n=2 processors store the latest value written. Since these processors may later fail, by stopping, later read operations will not be able to return the latest value written. A similar argument can be used to prove a lower bound on the read operation: If a read operation sends less than n=2 messages, then at most n=2 processors respond with the latest value written. Since it is possible that these processors did not get the writer's latest value, the read may return an old value.
Clearly, the time complexity of our simulation cannot be improved. However, there is place for improvement in the size (in bits) of the messages and the local memory. One possibility to improve the e ciency using the current scheme is to reduce the size of the bounded timestamps used by the simulation by reducing the number of outstanding timestamps. Another possibility is to employ the timestamp system of Dwork and Waarts 10], where the number of bits to required for each timestamp is logarithmic in the number of outstanding timestamps. However, this timestamp system uses a register where the ordering of generated timestamps is announced, and such a register is exactly what we are trying to simulate.
The read procedure is signi cantly more complicated. The reader reads the rst copy (Line 1), and then checks if the writer is overlapping it by reading the handshake registers (Line 2). If the writer is not overlapping, then the reader un-sets the overlap ag (Line 4) and equates the handshake register (Line 5). Then the reader checks the current value of the pointer (Line 6). If the pointer is 0, then the write of the rst copy has completed, and the reader can safely return the value read from it (Line 7). If the pointer is 2, then the writer is in between the rst copy and the second copy; in this case, the reader sets its overlap ag, and returns the value it read from the rst copy. If the pointer is 1, then it is not clear if the writer has already written to the rst copy, so the reader checks if other readers have overlapped the write. This is done by reading the reading, overlap and writing arrays (Lines 12, 13 and 14, respectively). If for some other reader p j , the handshake bits are equal and the overlap ag is set, then the reader sets its overlap ag (Line 16) and returns the value it read from the rst copy (Line 17); otherwise, it returns the value of the second copy .
The partial correctness proof of the algorithm, that is, the proof that any execution of the algorithm is linearizable, is exactly the same as in the corresponding proof of 12]. The proof is by case analysis, depending on the index of the main copy whose value is returned by r 1 and r 2 . That is, we check the four possible combinations of r 1 returning the value from main 1] or main 2], and r 2 returning the value from main 1] or main 2].
By Lemma 4.6, maj-read and maj-write simulate a single-writer multi-reader regular register; the only di erence between Algorithm 3 and the algorithm of 12] is that some arrays of regular registers are read or written together. The proof of 12] does not rely on the speci c order in which these entries are read or written, and hence, it also applies to Algorithm 3.
The complexities of the simulation are dominated by the complexities of maj-read and maj-write, as shown in the proof of the following theorem. 5 The High-Level Algorithm Our high-level algorithm has exactly the same structure as the algorithm of Haldar and Vidyasankar 12], which uses several regular registers. Each regular register is written by a single processor, and read by all other processors. 4 In our algorithm, invocations to maj-read and maj-write simply replace reading and writing the regular registers. The only real modi cation is that arrays of registers, with the same writing processor, are written together to save on the communication overhead.
There are two copies of the value, main 1] and main 2], as well as a ternary register, pointer, telling what is the current version that should be read. The writing processor of these registers is the writer.
Two arrays of bits are used for the handshake between the writer and the readers: writing 1::n], written by the writer, and reading 1::n], in which the ith entry is written by reader p i . There is an additional array of bits: overlap 1::n], in which the ith entry is written by reader p i , and all readers can read.
In addition, several local variables are used to store copies of the shared register; their usage should be clear from the code.
The code appears in Algorithm 3. Note that the array writing is written in a single maj-write operation, and each of the arrays writing, reading and overlap is read in a single maj-read operation. All entries of the arrays writing, reading and overlap are initially 0; main 1] and main 2] initially contain the desired initial value of the atomic register; pointer is initially 0.
In the write procedure, the writer signals it is about to write the rst copy, by setting pointer to be 1 (Line 1), and then writes (Line 2). Then it sets the pointer to the intermediate value (Line 3), and afterwards signals it is about to write the second copy, by setting pointer to be 0 (Line 4), and then writes (Line 5). Finally, the writer negates all handshake registers with the readers (Lines 6-8).
Proof: Assume mr is an maj-read operation of p i which returns a value with timestamp x.
Clearly, mr cannot return the value of a maj-write operation which starts after it completes. The size of the messages and the local memory depends on the number of timestamps maintained in the writing processor and the size of the bounded timestamps. By the structure of the bounded timestamp system, the size of the bounded timestamps depends on the If status k j] equals notsent, then exactly one message is in transit from p k to q j or vice versa. Furthermore, if this is a hvaluei message from p k to q j , then the data in the message is for a maj-write prior to the most recent one.
Then #acks correctly counts the number of acks received for the current maj-write, and each one indicates that the sending intermediate has received the timestamped value sent by the current maj-write. By de nition, the current value of last-ts j is viable. By Lemma 4.3, ts is larger (in the order ) than the current value of last-ts j . Thus, q j assigns ts to last-ts j . Then #acks i correctly counts the number of messages received for the current maj-read, and each one contains the contents of the sending intermediate's last-ts and last-val variables at some point since mr began.
The crux of the next lemma deals with the ordering of the timestamps of the value returned by a maj-read operation and of the values written by maj-write operations which completely precedes it. Lemma 4.6 A maj-read operation mr returns either the value written by the most recent preceding maj-write operation (the initial value if there is no such operation) or a value written by a maj-write operation that overlaps mr.
The following lemma proves that the variable pending at the writing processor keeps track of the viable timestamps. The parameter passed to the LABEL procedure is the value of the variable pending at the writing processor. By Lemma 4.2, it contains all the viable timestamps. Since the LABEL procedure returns a timestamp which is larger (according to ) than all the timestamps in the parameter passed to it 16], we have the following lemma:
Lemma 4.3 The timestamp generated in the call to the LABEL procedure is greater than any viable timestamp in the system.
The ping-pong mechanism guarantees that the responses received were indeed sent in reply to the message sent in the maj-read or maj-write procedures. As we prove in the next two lemmas, at least b n+1 2 c+1 processors either store the recent value for the register (in maj-write) or respond with their current value for the register (in maj-read). The correctness proof shows that the low-level communication algorithm guarantees that the pending variable (at the writing processor) contains all the timestamps that may be in use in the system. This implies that the call to LABEL generates a timestamp which is larger than all outstanding timestamps. The correctness proof uses this fact to prove that if a maj-write operation mw completes before a maj-read operation mr starts, then mr returns a value which was written by mw or by a later maj-write operation.
The next lemma proves that the pending variable at the writing processor keeps track of values forwarded by intermediates to the reading processors. A timestamp x is viable after some nite execution pre x, , if one of the following condi- In addition to storing the most recent value for the register, an intermediate also keeps track of the latest timestamps it has passed to the reading processors. The writing processor for the register keeps track of the latest two timestamps it has sent to each intermediate. These sets allow the processor writing the register to know which of the timestamps it has generated so far are still outstanding (as we prove in Lemma 4.2).
In more detail, the algorithm uses the following variables. received from the intermediates is the most recent value \written", the values are accompanied by a timestamp.
It can be seen (as proved below, in Lemma 4.6) that such \read" operation returns a value which was recently \written". There is no atomicity guarantee for concurrent reads of several processors. 3 The timestamps are taken from a bounded sequential time-stamp system 16], which is a nite domain of timestamps, L, together with a total order relation, . Whenever the writer needs a new timestamp it produces a new one, larger (with respect to the order) than all the timestamps that \exist" in the system. This is done by invoking a special procedure called LABEL, whose output is a new timestamp that is greater than all the timestamps in the set given as input to the procedure. In addition, it is assumed that we can compare two timestamps according to the order . There is a distinguished value ?, which is smaller (in the order ) than any possible timestamp, and is never generated by LABEL. The bounded sequential time-stamp system requires that the set of timestamps that exist in the system is contained in the set of timestamps given as input to the LABEL procedure. To achieve this, we have to be very careful how these timestamps are passed around, as discussed above. In our simulation, a timestamped value is sent only from the writing processor to the reading processors, through a majority of the processors. Reading processors handle the timestamp in an \un-interpreted" manner, and do not forward it.
The Low-Level Communication Algorithm
In this section, we explain the low-level communication algorithm used to exchange information: The procedure maj-write used to store a value at a majority of the processors, and the procedure maj-read used to get a value from a majority of the processors.
The high-level algorithm requires several regular registers, but our description in this section concentrates on a single register, var, which is sometimes left implicit. When several regular registers are written by the same processor, the communication can be combined to save messages, as described in Section 5.
The regular register is written by a single processor, and read by all processors. The writing processor of the regular register is not necessarily the writer of the atomic register simulated by the high-level algorithm, and the reading processors of the regular register are not necessarily the readers of the atomic register simulated by the high-level algorithm.
In our simulation, processors play dual roles: First, a processor acts as a reading or writing processor; second, it acts as an intermediate, storing and echoing values for the reading and writing processors. By running both tasks in parallel, a processor can play each role independently without passing information between the tasks. The exposition in the rest of this Message size: The size of the messages, in bits. Time complexity; The time to execute a write or a read operation, under the assumption that any message is either received within one time unit, or never at all (cf. 5]).
Local memory size: The amount of the local memory used by a processor, in bits.
Overview of the Simulation
The simulation of Attiya, Bar-Noy and Dolev 2] is based on the construction of Israeli and Li 16]. In this construction, the writer writes a copy of the new value for each reader; timestamps are added to the values to distinguish the most recent value. This provides the regularity property between reads and writes. To guarantee the atomicity property between reads, the readers also forward to each other the most recent value they have read (with its timestamp).
To bound the memory requirements and the size of messages, the timestamps are bounded using the method of Israeli and Li 16]; this method requires the writer to keep track of the timestamps currently in use. Keeping track of the copies of the writer's timestamps which are forwarded between readers causes the quadratic message complexity (see the analysis in 2]).
In our simulation, we side-step the cost of keeping track of the timestamps forwarded around by readers, by avoiding the forwarding altogether. We use the same idea as in 2] to provide the regularity property between reads and writes, and thus, provide a single-writer multi-reader regular register (cf. 17]). To obtain the atomicity property, our simulation follows Haldar and Vidyasankar 12].
The simulation of Haldar and Vidyasankar 12] uses several single-writer multi-reader regular registers to simulate one single-writer multi-reader atomic register. Atomicity of the reads is guaranteed since reads do not return values of writes that overlap them. Instead, when a reader detects that a write operation is in progress, it copies the value of a previous write operation, set aside for it by the writer. Since values are obtained directly from the writer, and not from other readers, it is simpler for the writer to keep track of the values in use.
In the rest of this section, we explain how processors exchange information, by passing messages, to simulate regular registers. The next section details this low-level communication algorithm, while Section 5 describes how our high-level algorithm adapts the simulation of 12].
In a message passing system, unlike the shared memory model, there is no way for one processor to make sure that another processor will see a message sent to it. Furthermore, a sender cannot wait to receive an acknowledgment from the receiver, since the receiver may fail. Therefore, instead of sending messages directly to each other, processors have to communicate via intermediates. The idea, used in several previous papers 2, 3, 6], is to \write" a value by making sure that at least a majority of the processors store it, and to \read" a value by requesting it from a majority of processors. To allow a processor to know which of the values 2.2 Single-Writer Multi-Reader Atomic Registers A single-writer multi-reader register is an abstract data structure which is accessed by two operations, write(v) that is executed only by a speci c processor p 0 , the writer, and read i (v) that can be executed by any reader processor p i , 1 i n.
A single-writer multi-reader register is atomic if each read operation returns the value of most recent write operation before it (or the initial value if there is no such write).
The Simulation Problem
A simulation of an single-writer multi-reader register supplies two procedures written in the low-level computation model, for read and for write. In our case, the low-level computation model is the message-passing model.
An invocation of a read or write operation translates into a sequence of low-level computation steps; we also associate computation steps with the invocation and the response. A sequence of invocations of the read and write procedures generates an execution in which the low-level computation steps corresponding to di erent invocations are interleaved. An operation op 1 precedes an operation op 2 in an execution, if the response of op 1 is generated before op 2 is invoked. Two operations overlap if neither of them precedes the other.
Each interleaved execution generated by the simulation is required to be linearizable 15]; that is, it must be equivalent to an execution in which the operations are executed sequentially, and the order of non-overlapping operations is preserved. In the speci c case of read and write procedures, this condition translates into the following properties:
1. A read operation r returns either the value written by the most recent preceding write operation (the initial value if there is no such write) or a value written by a write operation that overlaps r. (This is the regularity property.) 2. If a read operation r 1 reads a value from a write operation w 1 , and a read operation r 2 reads a value from a write operation w 2 and r 1 precedes r 2 , then w 2 does not precede w 1 . (This is the atomicity property.)
We concentrate on the simulation of a single atomic register. Since linearizability is local (cf. 15]), multiple copies of the simulation can be composed to simulate any number of atomic registers.
The quality of an simulation is evaluated by the worst case, over all its possible executions, of the following quantities:
Message complexity: The number of messages sent per execution of a write or read operation.
The Computation Model: Message Passing Systems
In a message-passing system, processors communicate by sending messages (taken from some alphabet M) to each other. Processors are located at the nodes of a complete network, and any processor can send to any other processor.
We model computations of the system as sequences of steps. Each step is either a message delivery step, representing the delivery of a message to a processor, or a computation step of a single processor.
In each message delivery step, a single message is placed in the incoming bu er of the processor. Formally, a message delivery step is a pair (i; m), where m 2 M. In a computation step, a processor receives all messages delivered to it since its last computation step, performs some local computation and sends some messages, and (possibly) step, and 4. there is a one-to-one mapping from delivery steps to corresponding send events (with the same message).
The network is not explicitly modeled; however, the last condition guarantees that messages are not duplicated or corrupted. The network is allowed to not deliver some messages or to deliver them out of order.
Processor p i is nonfaulty in an execution if the execution contains an in nite number of computation steps by p i , and all the messages sent by p i are eventually delivered. A faulty processor stops operating or fails to receive a message; that is, we assume a send omission type of failure and do not allow processors to behave in an arbitrary malicious manner. Messages can be delivered to a faulty processor, even after it stops taking steps. majority of the processors is considered faulty and is blocked. 2 A wait-free algorithm will run correctly under our simulation if at least a majority of the processors are non-faulty.
In the simulation of Attiya, Bar-Noy and Dolev 2], atomicity of reads is guaranteed by communicating timestamped values between the readers. In contrast, our simulation relies on a handshake mechanism between the writer and the readers to guarantee atomicity of the reads, following the ideas of Haldar and Vidyasankar 12]. However, since communication between the writer and the readers is not by directly writing and reading to registers, but rather by sending messages to a majority of processors, we still need timestamps. As in 2], bounded timestamps 16] are used to bound the size of messages and local memory. Since timestamps are used in a di erent manner, the number of messages needed for maintaining them correctly is reduced. Furthermore, it allows a smaller set of timestamps to be outstanding at any point, and therefore, reduces the size of the messages and the local memory requirements.
Our algorithm tolerates failures by requiring the writers and the readers to communicate with a majority of the processors. This is similar to the notion of quorums, introduced by Gi ord 11], and employed in many situations, cf. 1, 18, 19] .
Two other papers investigated the relationships between shared-memory and messagepassing systems. Bar-Noy and Dolev 6] provide translations between protocols in the sharedmemory and the message-passing models. These translations apply only to protocols that use restricted form of communication. Chor and Moscovici 9] present a hierarchy of resilience for problems in shared-memory systems and complete networks. They show that for some problems, the wait-free shared-memory model is not equivalent to the complete network model, where up to half of the processors may fail. This result, however, assumes that processors halt after deciding.
The rest of this paper is organized as follows. In Section 2, we de ne the message-passing model, the behavior of a single-writer multi-reader atomic register, and the notion of simulating an atomic register in the message-passing model. Section 3 contains an overview of the simulation. Section 4 presents the basic mechanism for maintaining information in our simulation, while Section 5 includes the high-level algorithm for simulating an atomic single-writer multi-reader register. We conclude in Section 6, with a discussion of the results.
Introduction
Two major interprocessor communication models in distributed systems have attracted much attention and study: the shared-memory model and the message-passing model. In the sharedmemory model, processors communicate by writing and reading to shared registers. In the message-passing model, processors are located at the nodes of a network and communicate by sending messages over communication links. In both models, we consider asynchronous unreliable systems in which failures may occur; a processor fails by stopping and a slow processor cannot be distinguished from a failed processor.
Originally, the two models were investigated separately; algorithms and impossibility results were designed and proved for each of the models individually. However, Attiya, Bar-Noy and Dolev have shown that the message-passing variation of the asynchronous model can be reduced to the shared-memory variation 2]. This was proved by presenting an simulation of read/write registers in the message-passing model. The simulation implies that any wait-free algorithm in the shared-memory model which uses atomic, single-writer multi-reader registers can be executed in the message-passing model.
This simulation immediately implies that many algorithms designed for the shared-memory model can be automatically employed in the message-passing model; see the many references in 2]. Given this simulation, the recent ourishing research on wait-free solvability in asynchronous systems, e.g., 4, 7, 8, 13, 14] , solely addresses the shared-memory model. These papers rely on the simulation of 2] to translate the results into the message-passing model. Unfortunately, the simulation of 2] has a relatively high message complexity: Each read or write operation requires O(n 2 ) messages, for a system with n + 1 processors. This paper presents an improved algorithm for simulating an atomic single-writer multireader register in message-passing systems in the presence of processor failures. Each simulated operation (read or write) requires O(n) messages, each with O(n 3 ) bits, constant time, and O(n 5 ) bits of local memory. This improves on the simulation of 2] in which each simulated operation (read or write) requires O(n 2 ) messages, each with O(n 5 ) bits, constant time, and O(n 6 ) bits of local memory. 1 Wait-free protocols in shared-memory systems allow a processor to complete its operation regardless of the speed of other processors. In message-passing systems, it can be shown, following the proof in 3], that for many problems requiring global coordination, there is no solution that can prevail over a \strong" adversary|an adversary that can stop a majority of the processors. Such an adversary can cause two groups of fewer than a majority of the processors to operate separately by suspending all the messages from one group to the other. For many global coordination problems this leads to contradicting and inconsistent operations by the two groups. As mentioned in 3], similar arguments show that processors cannot halt after deciding. Thus, in our simulation, a processor that is disconnected (permanently) from a 1 The complexity analysis in 2] states that message size is O(n) bits, and local memory requires O(n 3 ) bits. This analysis is wrong, as it does not account correctly for the size of the bounded timestamps used.
