We present a sequence of constructions of commuting synchronization objects (e.g., fetch-and-increment and fetch-and-add) in a system of n processors from any two processor synchronization object whose consensus number is two or more Her91a]. Each implementation in the sequence uses a particular type of shared memory snapshot, as a building block. Later implementations in the sequence are based on higher quality snapshots. The rst implementation of a fetch-and-increment uses the standard atomic snapshot concept, introduced in AAD + 93, And90], while the last construction in the sequence, of fetch-and-add, is based on the immediate snapshot concept introduced in BG93b]. This last construction also yields an implementation of a stronger snapshot which we call Write-and-snapshot. In addition this work solves an open question of Borowsky and Gafni by presenting an implementation of a multi-shot immediate snapshot object.
Introduction
In his seminal paper Her91a], Herlihy established the existence of a hierarchy of wait-free concurrent data objects that classi es objects according to their ability to solve k-consensus, that is, consensus among k asynchronous processes. An object has consensus number k if any number of this object and of read/write registers can be used to implement a k-consensus protocol, but cannot be used to implement a k + 1-consensus protocol. Thus objects with higher consensus number cannot be deterministically implemented by employing objects with lower consensus numbers ( Jay93]).
A preliminary version of this paper was presented in PODC '93. y Motorola Semi-Conductors Israel.
At the bottom level are the weakest objects with consensus number 1, e.g., read/write atomic registers, while at the top are objects such as compare-and-swap, whose consensus number is 1.
In Her91a, Plo88] it is shown that any wait-free shared object can be implemented from any object in the top level of the hierarchy, that is they have presented a universal construction for any sequentially speci ed wait-free object.
The synchronization primitives in the second level of the hierarchy include many synchronization primitives that are commonly used in practice, such as, test-and-set, fetch-and-add, fetch-and-increment, queue, and swap (which are supported by Encore-Multimax, SequentSymmetry, and SGI's MIPS-based multiprocessor Ber91]). In addition, many algorithms for queue management and primitives of concurrent operating system are based on either one of these objects HW87, GLR83, PS85], but not necessarily on the one supported by the system employed. This situation raises the question: Does an implementation of an object (e.g., queue) that is based on n-processor fetch-and-add GLR83] implies the existence of an implementation of the same object based on 2-processor test-and-set Ber91] and read/write registers. Yet, aside from very trivial implementations, no implementation of objects at level two, from any other at that level, have been provided (trivial ones are test-and-set from swap, or fetch-and-increment from fetch-and-add). Herlihy posed the following open question: \Can fetch-and-add implement any object having consensus number two in a system of three or more processes? " Her91a] . Although the universal construction implements any sequentially speci ed, n-process shared object, it is necessarily based on objects with consensus number n.
A read-modify-write (RMW) object, over a set of functions F and a register r, is generically de ned as follows:
read-modify-write(f; r): return(v)ff 2 Fg v := r r := f(r) return(v) Let F be a set of functions indexed by an arbitrary set S. The set F is commuting if for all values v and all i and j in S, f i (f j (v)) = f j (f i (v)). The set is overwriting if either f i (f j (v)) = f i (v) or f j (f i (v)) = f j (v). E.g., fetch-and-add is a classical example of an object that applies commuting functions (add 5, add 2, ...) while swap and test-and-set apply overwriting functions.
In this paper we present constructions of any read-modify-write shared synchronization object that supports commutative operations in a system with an arbitrary number of processors from any other two processor synchronization objects whose consensus number is two or more.
The uniqueness of our implementations is that we implement objects that can be accessed by more processes than their consensus number. To understand why the implementation of n-process objects with consensus number two from similar two process objects is more di cult than the analogous implementations between objects with consensus number n, one should understand the di erence in the computational power of the two classes. Intuitively, an n-consensus object has the capability to return in the response to each access a total order of any subset of preceding accesses, such that the returned order is consistent with the linearization of the accesses. That is, which process accessed the object rst, which process second, third etc. The capability of consensus number two objects is totally di erent. They cannot return as a response to an access a total order of previous accesses in the linearization order (because that would imply in nite consensus number). Yet, at least the objects in the common2 class of synchronization objects that we have introduced in AWW93] have the capability either to return in a response to an access the unordered subset of all previous accesses, or to return the access that is linearized just before this access. (Read/write registers cannot give any of these orders atomically, but can give an order on the operations in a nonatomic manner DS89]). The common2 class contains read-modify-write objects that commute (e.g., fetch-and-add), or overwrite (e.g., swap). It is known that this class is contained in the consensus number two class of objects Her91a] and any object in this class can be implemented from any other object in the class.
From the universal construction of Her91a, Plo88] it follows that any consensus number k object is universal in a system with k processes. Thus, any two process object can be implemented from any consensus number two object. Our implementations are based only on two process objects. By this transformation, each of our implementations can use any consensus number two object as the base object.
All the constructions in this paper use n process test-and-set and read/write registers as their base objects. The construction of the n process test-and-set register from any two process object with consensus number two or more can be found in AWW93, Wei94] . In this paper we present a sequence of constructions each of which combines a particular type of snapshot algorithm with a linear array of test-and-set objects, as follows:
Single-use fetch-and-increment object: Constructed from n n-process test-and-set objects. Each fetch&increment operation takes at most O(n 2 ) primitive operations on the base objects of the implementation (when considering the implementation from 2-process objects) (Section 3.1).
Non-linearizable multi-use fetch-and-increment object: This construction uses an unbounded array of n-process test-and-set objects in conjunction with atomic snapshots. The time complexity of this construction is unbounded (Section 3.2).
Linearizable multi-use fetch-and-increment object: In this construction we replace the regular atomic snapshot by a new type of snapshots, called proximate-snapshot, obtaining a linearizable implementation of fetch-and-increment register in which the time complexity of each operations is O(n 2 ) (Section 3.4).
Single-use fetch-and-add object: This construction uses n test-and-set registers in conjunction with the single use (i.e., one shot) immediate-snapshot of Borowsky and Gafni BG93b] . The time complexity of the resulting construction is again O(n 2 ) operations on either two processor two consensus objects or read/write registers (Sections 4 and 4.2).
Multi-use fetch-and-add object: This construction is based on a multi-shot immediate-snapshot in conjunction with an unbounded array of test-and-set registers. The complexity of the resulting algorithm is O(n 3 ) operations on read/write registers and O(n 2 ) test&set per operation (Sections 4 and 4.3). An important by product of this construction is the proof that the construction of multi-use immediate-snapshot from read/write registers is possible. The space complexity of the multi-use fetch-and-increment and multi-use fetch-and-add constructions are unbounded in two ways; In the number of registers used in the constructions and in the size of the registers assumed. In Section 5 we describe a method by which the number of registers used in the constructions can be made bounded. However, the size of the registers cannot be bounded since there is no bound on the number of fetch&increment operations applied to the object. That is, the size of the constructed multi-use object is inherently unbounded. Although we managed to bound the space complexity of the constructions still, the main point is in the possibility of constructing any commuting synchronization object from any other object whose consensus number is 2 and not in the complexity of the construction.
Model de nitions and notations
We assume the standard model of shared memory as in Her91a]. A concurrent system consists of a collection of processes that communicate through typed shared memory objects. Processes are sequential that is, each process applies a sequence of operations to objects in the system. An operation on an object is de ned by two events: (1) invocation to the object, and (2) response by the object. There are no fairness assumptions concerning the processes, i.e., a process may halt or operate in di erent speeds. In particular, processes can not detect whether other processes have halted.
State Machine and Sequential Speci cation
More formally, we model an object type using a Mealy state machine Mea55]. We use the notations of AGMT92] to de ne a sequential speci cation via the Mealy machine. The sequential speci cation allows operations to be speci ed as atomic state transitions. However, in asynchronous concurrent systems operations have duration and can overlap in time. This is modeled by allowing the interleaving of invocations and responses by di erent processes, so that between the invocation and response by one process may be any number of invocations or responses by other processes. Thus, concurrent runs of the object are modeled as elements of (Inv + Res) 1 .
Speci c correctness conditions constrain these runs by relating them to those in the sequential speci cation.
For example, the linearizable HW87] runs of an object O are speci ed as follows. Given a string 2 (Inv +Res) 1 , de ne a partial order, on the events in , (the distinct occurrences of symbols in ), such that a b if and only if either (1) both a and b are invocations or responses events of the same process (have the same subscript) and a appears before b in , or (2) a is a response that precedes the invocation b, in . Then is linearizable if there is a sequential run of O, containing exactly the events in and such that the total order is an extension of .
Implementations
It su ces to consider an implementation of an n-process high-level object O from a set of primitive objects fO 1 ; :::; O k g as a set of procedures for each process. Invocations and responses of these procedures are identi ed with those in the speci cation of O. The procedures are allowed to do local computation, and communicate only by making invocations and receiving responses from the primitive objects fO 1 ; :::; O k g. Hence, runs of the implementation of O consist of sequences containing invocations and responses to the high-level object O, local steps of the procedures, and invocations and responses to the primitive objects fO 1 ; :::; O k g.
The runs are constrained in the obvious way to respect the control ow of the individual procedures, allowing arbitrary, asynchronous interleaving between the threads of distinct processes. Moreover, the subsequences of invocations and responses to each primitive object must satisfy the speci cation of that object. Given these constraints, the subsequences of high-level invocations and responses to O must in turn satisfy the speci cation of O.
The implementations are required to be wait free, thus any single high-level operation op (procedure invocation) must terminate, regardless of the steps taken by any other high-level operations, provided the local actions and low-level operations of op are allowed to progress Lam77].
This informal notion of implementation may be made precise using any of several formalisms HW87, Her91a, Lam86].
The step (time) complexity of an implementation of a high-level operation is the worst case number of atomic (primitive) operations each process performs when completing the operation on the (high-level) object. De ne C(i; op; r) as the step complexity of operation op of process P i in run r.
An implementation I of an object A is bounded wait free if there exists N such that, for every run r of I, for every operation op of r and for every i n, C(i; op; r) N.
Two characteristics of objects are used in this work:
De nition 2 The access number of an object is the maximum number of processes that can perform an operation on the object concurrently.
De nition 3 A single-use object is one in which each process may perform at most one operation;
in a multi-use object each process may repeatedly access the object an arbitrary number of times.
We pay special attention to two objects fetch-and-increment and fetch-and-add. The sequential speci cation of fetch-and-increment applies the function f(r) = r + 1 in the generic RMW of the introduction. The fetch-and-add object applies the function f(val; r) = r + val in the generic RMW of the introduction.
3 Proximate Snapshots and Fetch-and-increment 3.1 Single-use-fetch-and-increment
The implementation of n process single-use-fetch-and-increment from n process test-and-set is very simple (see Figure 1 ): there is a linear array T of n test-and-set registers. To perform fetch&increment each process starts performing test&set on the array starting from T 1] until it wins. a process that won in T j] (j = 1; : : :; n) returns j ? 1.
operation single-use-fetch&increment Shared T 1..n]: array of n-process test-and-set ; for j = 1 to n if test&set(T j]) = 0 then return (j-1)
Figure 1: Single-use-fetch&increment.
Correctness: The implementation is wait-free, since the test&set operations are wait-free and each fetch&increment operation performs at most n of those. Each process returns a unique value from f0; : : :n ? 1g since only one process wins in each location. The linearizability follows since when a process returns x the operations that would return k, k < x have already started, by the linearizability of test-and-set.
Non linearizable multi-use fetch-and-increment
The single-use-fetch-and-increment suggests a multi-use implementation by using an in nite array of test-and-set registers. While it may be possible to make this solution space bounded, it is not wait-free. Fast processes can repeatedly win an arbitrarily long sequence of registers in the array while a slow process trails behind.
To overcome this problem we suggest a di erent approach. Instead of ascending from the test-and-set register at some location up to the location at which the fetch&increment operation wins in a test&set, a process approximates a position in the linear array of test-and-sets such that, it is guaranteed to nd an unset test-and-set register while descending down from that location.
The problem of approximating such a position is best captured through the write and snapshot problem.
De nition 4 The write and snapshot problem is to update (write) a register in a shared array of single-writer-multi-reader (swmr) registers and then to take a snapshot of the array such that a bounded number of other updates and snapshots have taken place between the update and the snapshot.
The straightforward approach, using an ordinary snapshot algorithm, does not provide any bound on the number of updates that may take place between the update of a processor and its subsequent snapshot. On the other hand, in the fetch-and-increment object that we implement here, a process updates a shared counter and gets back the value of the counter immediately after (before) its update. The write and snapshot problem in the read/write model was rst raised by Borowsky and Gafni BG93a] , and Saks and Zaharoglou SZ93]. Borowsky and Gafni has studied this problem and de ned the immediate snapshot problem for which they provided a one-shot solution in BG93b]. However, here we need a multi-shot write and snapshot that would bound the number of other updates between the two operations.
Let us rst consider the straightforward solution, using an ordinary snapshot scan algorithm AAD + Clearly this construction is wait-free (though unbounded, since each operation depends on the number of operations that preceded it) because each operation cascades down from some location towards the beginning of the array until it wins a test&set operation.
The key observation of this subsection, which is the basis for subsequent subsections, is that every operation is guaranteed to nd, and to set, an unset test-and-set register in the cascading process described in the last line of code in Figure 2 . This fact is proved in the following lemma. However, the implementation is not linearizable since it is easy to construct scenarios in which an operation starts and nishes with value k while another operation that starts after that, returns with a value smaller than k.
Lemma 3.2.1 Every fetch&increment operation (Figure 2 ) of any process P i terminates after winning in a test&set operation.
Proof: Assume to the contrary and consider an operation of process P i that starts with entry = h and losses in test&set(T 1]). Then P i must have failed in all the test-and-set registers between T h] and T 1]. Since in each fetch&increment operation a process wins at most one test-and-set register, it must be that more than h fetch&increment operations have accessed the array in between T 1] and T h].
Consider a run of the implementation in which the lemma is violated. De ne l to be the shortest pre x of in which there is an l > 0 such that more than l operations accessed the test-and-set registers in the interval T 1]-T l]. Note that by the end of l there might be several di erent values l 1 ; l 2 ; : : :l j such that for each i more than l i operations accessed the test-and-set registers in the interval T 1]-T l i ]. Let l 0 be the largest value in l 1 ; l 2 ; : : :l j . That is, at least l 0 + 1 operations have accessed the test-and-set array between T 1] and T l 0 ]. Let L denote this set of l 0 +1 operations.
Let op be the last operation in L to update its my inc register in the rst line of code in Figure 2 Since the non linearizable fetch-and-increment of Figure 2 is introduced here mostly as a pedagogical step, we will not give a full proof of its correctness. However note that (1) Each test-and-set may be won by at most one process, thus a value is returned by at most one non linearizable fetch&increment, (2) consider a complete run, that is a run all of whose operations have terminated. Let k be the total number of fetch&increment operations that have started in the run. Then, no operation has started with entry > k. By Lemma 3.2.1, each of the operations returns with a correct value, and hence each value in 0 : : : k ? 1 is returned by the pigeon hole principle.
Proximate Snapshot
As indicated above, the problem of the construction in Figure 2 is that a fetch&increment operation op may be delayed for a long time after increasing its my inc and before taking the snapshot. In that time interval many other operations may take place, each winning a test-and-set register and leaving an unset test-and-set down in the array. When op eventually resumes its operation, its snapshot would return an entry point which is much larger than the entry it would have gotten had it taken the snapshot soon after it had incremented the my inc. This causes the construction to be both non linearizable and unbounded in time complexity.
To alleviate this problem we introduce a new type of write and snapshot, called proximate-snapshot.
De nition 5 In the proximate-snapshot of process P, P performs an update and takes a proximate-snapshot that returns an atomic snapshot such that: 1. The update is an atomic write operation and the proximate-snapshot is also an atomic operation.
2. At most n ? 1 other updates may take place between P's update and its subsequent proximate-snapshot, i.e., the returned snapshot contains at most n ? 1 updates that followed P's update. 3. At the point at which P's proximate-snapshot returns no other proximate-snapshot operation has returned an earlier snapshot that includes P's last update. (In other words, all the snapshots returned by other proximate-snapshot and that are serialized before P's proximate-snapshot either do not include P's last update or return a snapshot that is not earlier than the snapshot returned by P's proximate-snapshot.) The basic idea behind the proximate-snapshot given in Figure 3 is that fast processes would help processes that are much slower to obtain an early snapshot. To this end each process keeps several copies of snapshots that it saw while doing update and snapshot operations. A process whose snapshot took place many steps (of other processes) after its corresponding update, would nd among the copies of snapshots retained by the fast processes, an older snapshot which is closer to its update.
The shared data structures used by the implementation are: my inc 1::n] a swmr register, each counts the number of updates that the corresponding process has performed. my minSS 1::n; 1::n], a 2-dimensional array of snapshot vectors. Each process maintains n snapshots, one for each other process. In addition, the my inc registers are augmented so that the snapshot algorithm of AAD + 93] or of AR93] may be applied.
Since we construct the proximate-snapshot here only as a building block for the construction of fetch-and-increment, we will present here a simple version in which we assume that the update operation always increments a register called my inc. Generalizing the proximate-snapshot presented here to a proximate-snapshot that makes arbitrary updates is a straightforward process which does not t the current ow of the paper. Henceforth we assume that each process alternates between two operations, update and Proximate-snapshot, such that the update of process P i always increments the value in register my inc i]. That is, no process ever performs two updates without taking a Proximate-snapshot between them.
The proximate-snapshot operation of process P i has three steps. First the process takes a regular snapshot of the array my inc (Line 1). Second, P i checks the snapshot taken in the previous step to see if any process P j has incremented its my inc j] in between the last two snapshots taken by P i . If P i nds that my inc j] has been updated, then it retains a copy of the new snapshot, in my minSS i; j] in case P j will ever need it (Line 2-6). Note that my minSS is rst calculated on a local copy, my minSS 0 , and then my minSS is updated for all the processes in one atomic step (Line 5).
Third, P i scans all the snapshots that other processes hold for it together with the new snapshot it took, to nd out which of them is the earliest that contains the current value of my inc i] (which is the most recent update of P i ) (Lines 7-13). To determine which of two snapshots is earlier the summation of their entries is compared. The snapshot with the smaller summation is clearly the earlier. To ensure the third property of proximate-snapshot we have to embed the scan in a repeat loop. Embedding the scan in a repeat loop ensures that the returned minimal snapshot was observed twice (and was the minimum in both) before being returned. This ensures that the minimal snapshot returned was not changed during the rst time it was observed (collected). Hence, there is a point in which the process has nished reading the minimal snapshot and in this point this is still the minimal snapshot for that process (this property is used in the proof of Lemma 3.4.4).
Claim 3.3.1 For every two processes P i and P j , P i never updates the value in my minSS i; j] twice, such that my minSS i; j] j] = l in both updates.
Proof: At line 4 a process P i changes the value of another process P j , only when it observed a change in the value of P j in the snapshot S. Therefore for a given l it will update the value of my minSS i; j] at most once. Claim 3.3.2 A Process P i in its l-th proximate-snapshot operation performs the repeat loop (Lines 7?13) at most n ? 1 times.
Proof: P i remains in the loop only if it nds that another process P k had updated its value in my minSS k; i] and my minSS k; i] i] is equal to l (Line 10). From Claim 3.3.1 it follows that at most n ? 1 such updates could occur (one for each another process). Proof: PSS j i is a regular atomic snapshot that was taken after w j i . Hence all the values in this snapshot are at least as large as they were at the time of the w j i operation. Therefore their sum is at least as large as atomic j i , and the left hand part of the inequality holds.
To prove the right hand side of the inequality we claim that there are at most n ? 1 w l k operations, k = 1; : : :; n, l = 1; : : : after w j i and before PSS j i was rst taken as an atomic snapshot by some process (not necessarily process i). Assume by way of contradiction that there are more than n?1 such operations. Thus there must be at least one process index, t such that there were two w t operations (w l t , and w l+1 t ) after w j i and before PSS j i was taken. Thus, after w l t and before w l+1 t process P t must have performed a complete proximate-snapshot operation, denoted pss l t . During the pss l t operation process P t computes a snapshot in which w l t was observed and hence also w j i was observed. Therefore after executing Line 5 of pss l t my minSS t; i] must contain a snapshot that includes w j i . Since by the assumption PSS j i is after w l+1 t process P i starts executing procedure Proximate-snapshot after process P t has nished the pss l t operation. Therefore, when process P i reaches Line 10 the condition in that line must be satis ed for some process index, and S must contain a snapshot whose value for process P t is at most l. This contradicts the assumption that PSS j i has a value larger than l for t.
De nition 9 ll8 j i is the last execution of Line 8 in the j-th execution of proximate-snapshot operation by process i.
Lemma 3.3.5 The implementation in Figure 3 satis es property 3 of De nition 5.
Proof: Serialize each operation pss j i at ll8 j i . Let pss l r and pss k q be two proximate-snapshot operations such that w k q 2 PSS l r and PSS l r is earlier than PSS k q . To complete the proof of the lemma we claim that ll8 k q was performed before ll8 l r . P r computes PSS l r according to a snapshot of P g (g might be equal to r), which had updated its my minSS (Line 5 in Figure 3 ) to contain a snapshot that corresponds to PSS l r . Therefor P r performs ll8 l r after P g 's execution of Line 5. On the other hand, PSS l r could be also a valid value for PSS k q , since w k q 2 PSS l r . But PSS k q is later than PSS l r , so it follows that P q had performed ll8 k q before the execution of Line 5 by P g .
From the above lemmas we get:
Theorem 1 The code in Figure 3 correctly implements a wait free proximate-snapshot operation and each operation in it terminates in at most n 2 accesses to the shared memory.
Multi-use fetch-and-increment
Here we combine the proximate-snapshot of the previous subsection with the unbounded array of test-and-set registers as in Subsection 3.2 to get a wait-free linearizable multi-use-fetch-and-increment. We give in detail an unbounded space implementation, a method to transform it into a bounded number of registers implementation is outlined in Section 5. The idea is exactly as in Subsection 3.2. Since here we replace the regular snapshot by an proximate-snapshot the implementation becomes bounded wait-free and linearizable as proved next. 3.4.1 Correctness De nition 10 start j i : is the value of the local variable entry of P i in its j-th multi-use-fetch&increment operation. This is the index of the rst test-and-set register that P i tries to set in its j's increment.
De nition 11 inc j i is the value returned in P i 's j-th multi-use-fetch&increment operation. That is, the index of the test-and-set register which P i wins when my inc i] = j minus one.
De nition 12 pss j i (fai j i ) is the j-th proximate-snapshot (fetch&increment) operation of process P i . Lemma 3.4.1 Every multi-use-fetch&increment operation of any process P i terminates with entry > 0.
Proof: The algorithm in gure 4 di ers from the non-linearizable solution in Figure 2 only in the calculation of entry. Since the proximate-snapshot operation returns a "legal" snapshot vector, the proof of Lemma 3.2.1 carries over verbatim. Lemma 3.4.3 If in a run of the system there is a process P i and a j, such that: atomic j i > t inc j i then in this run there must be a process P l , l 6 = i, and an m such that: atomic m l t < inc m l start m l Proof: By way of contradiction, assume such P i , j, t and that for any operation by process P r and any k such that atomic k r t, inc k r t. Then at least t + 1 (these operations and the j-th operation of P i ) access the array T 1::1] between 1 and t. By Lemma 3.4.1 all the processes terminate. Thus t + 1 processes terminate in locations 1 to t of the array. Therefore two processes must have won the same test-and-set register, contradiction.
Lemma 3.4.4 Consider a run of the system in which there are r; l; q and k such that: atomic k q start l r < start k q then, process P q had performed pss k q before process P r performed pss l r .
Proof: From the third property of proximate-snapshot (see De nition 5), it follows that start l r was taken after start k q was taken, hence the lemma. was performed after pss b1 k 1 . Apply this lemma repeatedly to get that pss b z?1 k z?1 was performed after pss bz kz , for z = 2; : : : ; l. P k l started its pss b l k l operation after w x i , since start b l k l > atomic x i . Hence P j performed its pss y j operation after w x i in contradiction to the assumption that P i starts after P j nishes.
Lemma 3.4.6 For every x 1 there are at most n fai l t operations with atomic l t > x such that inc l t x (each might access the array T 1::1] between 1 and x) Proof: Since start l t = PPS l t , it follows from Lemma 3.4.2 that atomic l t start l t < atomic l t + n. Therefore from the x fai m k operations with atomic m k x, at most n operations might have computed start m k > x, and the others, at least x ? n operations, compute start m k x. Since by Lemma 3.4.1 at most x fetch&increment operations access the array T 1::1] between 1 and x, and at least x ? n of them have atomic m k x, it follows that at most n fai l t operations with atomic m k > x access the array between 1 and x.
Lemma 3.4.7 In a single fetch&increment operation, a process losses in less than 3n T&S registers in the array T 1::1].
Proof: Assume to the contrary that there are t and l such that during the fai l t operation, P t losses in 3n T&S registers. Since P t starts in T start l t ] it follows from Lemmas 3.4.2 and 3.4.6 that it must have lost to two fetch&increment operations of the same process P k , fai m1 k and fai m2 k , such that atomic m1 k < atomic m2 k < atomic l t . The operation fai m1 k was completed before the beginning of the execution of fai l t . Since fai l t lost to fai m1 k it would terminate with a inc l t which is smaller than inc m1 k in contradiction to Lemma 3.4.5.
De nition 13 x is a hole in a run if x was not returned by any fetch&increment operation in while there is a value y, y > x that is returned by an operation in .
Claim 3.4.8 Let be a complete nite run of z fetch&increment operations, then there are no holes in , that is, all the integer values from 0 to z ? 1 are returned in .
Proof: For every fai t i in , start t i z since start t i is a "real" snapshot of the my inc registers. The usage of the test-and-set bit guarantees that such operation terminates with a unique value in entry. The claim now follows from Lemma 3.4.1 and the "pigeon hole" principle. Proof: Consider an 0 as in the statement of the claim. We construct for this 0 a 0 that satis es the claim by simply scheduling all the operations that has not yet terminated to take enough steps (in any order) until they all terminate. By Claim 3.4.8 this su x satis es the claim.
Theorem 2 The code in Figure 4 correctly implements fetch-and-increment and each operation in it terminates in at most n 2 accesses to the shared memory, and at most n test&set operations on the test-and-set array.
Proof: From Lemmas 3.4.1, 3.4.7 and 3.3.3 it follows that the implementation is wait-free, with a complexity of O(n 2 ) primitive operations per fetch&increment operation. To prove the correctness of the implementation we de ne a total order on the fetch&increment operations and prove that this total order is an extension of the natural partial order . The partial order is induced on the operations by the relation < which is de ned as follows: fai j q < fai l p i fai j q nishes before fai l p starts. The total order is de ned by the relation < which is: fai j q < fai l p i inc j q < inc l p . By Lemma 3.4.5 the total order agrees with the partial order . Together with Claim 3.4.9 this completes the proof.
4 Atomic-write-and-snapshot and Fetch-and-add While proximate-snapshot is su cient for the construction of fetch-and-increment it does not su ce to implement a fetch-and-add object. The di erence between the two being that in fetch-and-increment a process returns from an operation with the number of operations that took place (are linearized) before it whereas, a process returning from a fetch&add operation essentially identi es the exact set of operations that is linearized before it. This additional strength of fetch-and-add requires a more sophisticated implementation. Here we show that it is enough to strengthen the snapshot primitive used within the implementation of fetch-and-increment. That is, while in the non linearizable fetch-and-increment a regular snapshot was enough, and in the linearizable fetch-and-increment proximate-snapshot was su cient, here for the fetch-and-add construction we need the atomic-write-and-snapshot. The atomic-write-and-snapshot is based on the immediate-snapshot dened by Borowsky and Gafni BG93b] which is a restricted case of proximate-snapshot. Let us next de ne the atomic-write-and-snapshot, then show how it is used to generate fetch-and-add in Subsection 4.1. Then in Subsection 4.2 we present an implementation of single-use-atomic-write-and-snapshot from test-and-set registers that gives us also single-use-fetch-and-add, and in Subsection 4.3 we implement multi-use-atomic-write-and-snapshot from test-and-set registers that produces as a by product multi-use-fetch-and-add.
The atomic-write-and-snapshot is a restriction of the immediate-snapshot (or the participating set problem) of Borowsky and Gafni. In the participating set algorithm several processes that have accessed the object concurrently may return (observe) the same set of participating processes. Here we augment their algorithm with test-and-set registers to obtain the new object, called atomic-write-and-snapshot:
De nition 14 The atomic-write&snapshot operation by process p atomically writes a value in a register in a shared array of swmr registers, and returns an atomic snapshot scan of the array. That is, the i th operation of process p writes i into the shared memory and returns a vector R i p of indices such that: 1. Self-Containment: R i p p] = i. 2. Atomic Snapshot: For all p; q; i; j; either R i p R j q (element by element) or R i p R j q . 3. Atomic Write&Snapshot: For all p; q; i; j; if R i p q] = j then R j q p] < i.
Note that it follows that any atomic-write&snapshot operation that is later in the serial order \ob-serves" all the atomic-write&snapshot operations that are ordered before it. That is, the snapshot returned is an accurate picture of the array instantly after the write operation. This is the best one can hope for in the write and snapshot problem from Subsection 3.2, i.e., this is an atomic write&snapshot. No other update or snapshot may take place in between the update and the snapshot of one processor.
Implementing Fetch-and-add from Atomic-write-and-snapshots
The de nition of the atomic-write-and-snapshot object enables each operation to return an unordered set of operations linearized before it, which is essentially what is necessary for fetch-and-f, Figure 5 . In the implementation of fetch-and-add (replacing f in the gure by summation) process P i keeps a local variable part i with the sum of all of its inputs. To perform fetch&add P i rst adds input i to its private sum, part i , and then calls atomic-write-and-snapshot with part i as an input. Then, summing up all the parts in the returned snapshot gives P i the total value of all the processes' at the time it has posted its new part in the atomic-write-and-snapshot. The implementation of single-use-Atomic-write-and-snapshot is based on the algorithm of Borowsky and Gafni for the one-shot immediate-snapshot. The one-shot immediate-snapshot is also an operation in which a process writes a value to the immediate-snapshot-shared-memory and returns a snapshot. However, in the immediate-snapshot the operation is not atomic. Instead, in each execution of the immediate-snapshot, the operations are partitioned into sets of operations with a total order between the sets. Operations that are not concurrent belong to di erent sets, and the total order between the sets extends the natural after relation Lam78]. The speci cation of immediate-snapshot requires that the snapshot that operation o returns contains the values written by immediate-snapshot operations that are together with o in the same set and immediate-snapshot operations that are in preceding sets. Single-use-Atomic-write-and-snapshot is however a restriction of immediate-snapshot in the sense that here each operation should be in a set of its own. That is, a write and a snapshot operations that are executed as one atomic operation. Therefore, the code of BG93b] is augmented with test&set operations. At the point where an operation should return the immediate snapshot in BG93b] it rst test&set that particular snapshot. If the test&set was successful, then this snapshot is returned only by that operation. Otherwise, the process performing the atomic-write-and-snapshot tries to capture a smaller snapshot that is contained in the snapshot it failed to capture. This is done by continuing the iterations of the immediate-snapshot algorithm, until nally winning the test-and-set in one of the snapshots.
De nition 15 Let wss i be the single write-and-snapshot operation of process i; denote by snap i the value returned by wss i .
De nition 16 S p is the set that is assigned into S in the last execution of Line 4 in the code in Figure 6 , by P p in operation wss p . and has performed test&set(T l 0 + 1]). Since by assumption P r accessed T l 0 ], it must have lost in test&set(T l 0 + 1]). Therefore at least l 0 + 2 write&snapshot operations accessed the array T 1 : : : n] between 1 and l 0 + 1, the l 0 + 1 operations in L, and the operation that won in T l 0 + 1] which is necessarily not a member of L. This is a contradiction to the assumption that l 0 is the largest index that was violated in l .
De nition 17 For a run of the algorithm we de ne participate l = frjlevel r] l at some point in g, for 1 l n. Corollary 4.2.2 For every l 1 and l 2 , l 1 < l 2 if q 2 participate l 1 then q 2 participate l 2 and therefore participate l 1 participate l 2 . Therefore, for every l 1 6 = l 2 either participate l 1 participate l 2 or, participate l 1 participate l 2 .
Claim 4.2.3 Consider a run of the algorithm in which process P p wins the test&set on (T l]), then the set S p equals participate l , and for every other process P q either S p S q or S p S q .
Proof: From Lemma 4.2.1 it follows that jparticipate l j l and since in Line 5 jSj l it must be that S p is identical to participate l . Since for every other process q S q is identical to participate l 0, l 0 6 = l it follows from Corollary 4.2.2 that either S p S q or S p S q . Lemma 4.2.4 Let snap p and snap q be the sets that were returned by two write&snapshot operations.
Then, either snap p snap q or snap p snap q .
Proof: Let x be the index of the T&S register that P p won, and y be the T&S register that P q won. Assume, w.l.o.g. that x > y. From Claim 4.2.3 it follows that S p = participate x and S q = participate y and that S q S p . Therefore, snap q snap p .
Lemma 4.2.5 In any run of the implementation, if P p starts its write&snapshot operation after P q nishes its write&snapshot operation then, snap p snap q .
Proof: Let y be the index of the T&S register that was won by P q and let S y be the set S that P q saw in its last while iteration in Line 4. The index p is not in S y , since P p have not started its operation when S y was taken. It follows that snap q 6 snap p and therefore from Lemma 4.2.4 it must be that snap p snap q .
Theorem 3 The code in Figure 6 correctly implements a single-use write-and-snapshot and each operation in it terminates in at most n 2 accesses to the shared memory, and at most n test&set operations on the test-and-set array.
Proof: From Lemmas 4.2.1 it follows that the run of the implementation terminates after at most n iterations of the while loop. The cost of the execution of Line 4 is O(n), all the other lines have constant cost, and therefore the complexity of the implementation is O(n 2 ) with at most n operations on the test-and-set array, and it is obviously wait-free. To prove the correctness of the implementation we de ne a total order on the write&snapshot operations and prove that this total order is an extension of the natural partial order . The partial order is induced on the operations by the relation < which is de ned as follows: wss q < wss p i wss q nishes before wss p starts. The total order is de ned by the relation < which is: wss q < wss p i wss q wss p . By Lemma 4.2.5 the total order agrees with the partial order .
Multi-use-Write-and-snapshot
Overview: The single-use solution of the previous section cannot be extended as is to the multiuse because it requires knowing ahead of time, the total number of operations that would ever be applied (and the complexity of the multi-use operation would linearly depend on that number). A multi-use immediate-snapshot algorithm would probably solve the problem, however no such algorithm existed before this work and the algorithm here provides the only solution that we know of to that problem. The implementation of Multi-use-write-and-snapshot presented here is a nontrivial extension of the previous section combined with the techniques of the proximate snapshot (Section 3.3).
Following BG93b] the single-use solution was constructed around an algorithm to snapshot the names of the processes whose values should be returned (\participating set" in the jargon of BG93b]). Then, these names were used as indices to the shared-memory from which the values of the immediate snapshot were taken (i.e., indirect addressing). Here we use a similar idea. Each process owns an unbounded 1 linear array of registers each of which can hold one value (the array is called values in the code) and a pointer to the array (called my inc in the code). Instead of overwriting the values in a single register, each process writes a new value by appending it in the next vacant place in its array and then incrementing its pointer by one (Lines 1 and 2 in the code in Figure 7 ). After writing a value in this way each process performs a write-and-snapshot on the array of pointers (Lines 3 to the end in Figure 7) . The multi-use-write-and-snapshot then terminates by returning the values of each process by indirect addressing with the snapshoted pointer values. Henceforth, the rest of the overview considers the multi-use-write-and-snapshot of the pointer values (that are incremented by one in each operation).
Intuitively the algorithm presented here works as follows: The algorithm groups the write&snapshot(v) operations into disjoint sets of operations. The operations in each set are concurrent. Within each set the single-use-write-and-snapshot is used to order the operations in a legal order. While each operation locates the set with which it has to do the single-use-write-and-snapshot it also informs later sets of its new value and ensures that no earlier set observes the new value.
Let us rst provide a high level description of the write-and-snapshot algorithm on the pointers values (each value is monotonically incremented by its process, like a counter). Imagine an unbounded linear array of cells. Associate with each cell a single-use-write-and-snapshot object, and n read/write words, one per process, in which each may register a pointer value. In each cell each process is registered with some pointer value. In each multi-use-write-and-snapshot operation a process increments its pointer value by one and nds one cell. The cell a process nds in an operation is higher in the array than the one it had found in the previous operation. However, in nding a cell, a process ensures two things: (1) that it is registered with the previous pointer value in all the cells from the one it had found in the previous operation (or the zero cell, if this is the rst time) to the current one (exclusive) and (2) that in all higher cells it would be automatically registered with its new pointer value (until the next operation). After entering a cell a process performs a singleuse-write-and-snapshot with all the other operations that have entered this cell. Then, from the set obtained by the single-use plus the registration information associated with the cell it computes a set of pointer values to return as a snapshot. Thus the algorithm ensures that if in some operation process p enters a cell that is lower than a cell in which some operation of process q enters, then p is already registered in the cell that is captured by q, with p's new pointer value.
More speci cally, each process p starts its operation by writing its input and incrementing the pointer (my inc p]) by one (Lines 1 and 2 in Figure 7) . Since all the updates of the pointers are atomic write operations they are totally ordered and may be sequentially enumerated by an outside observer. Moreover, each my inc ] pointer can be viewed as a counter since it is incremented by one in each operation, therefore, the enumeration of the atomic increment points (of the my inc ] pointers) associates with each updating point the summation of the pointers values at this point. We call this summation of a point, the floor of the point, and with each such point we associate one cell (with a single-use-write-and-snapshot as described above). That is, the oor of one point is larger by one than its previous. In other words, had we taken a snapshot of the my inc ] pointers values instantly after an increment then the summation of its entries is equal to the oor of this update. In the implementation we maintain an unbounded array of snapshots that reserves one slot for each such point (called SS ] in the code), this array parallels the array of cells, i.e., it associates one snapshot with each cell (these snapshots should not be confused with the ones returned by processes that exit from a cell).
Clearly, many of these snapshots may not be observed and thus their corresponding slots in the array remain vacant throughout the algorithm execution. However, the rst snapshot each process obtains in its operation (in Line 3) is a real snapshot produced by a procedure as in AAD + 93, And93] that is called from within proximate-snapshot. Hence, this snapshot is one of those enumerated above and the process writes it into SS floor]. Since these are real instantaneous atomic snapshots, if two processes write a snapshot to the same location in the array, the two snapshots are identical.
After obtaining an initial snapshot with total value f, and writing it into SS f] process p starts the search for the oor (cell) in which to perform single-use-write-and-snapshot. The snapshots that are recorded in the array SS are used by processes to obtain values for processes that they do not observe in the single-use-write-and-snapshot. Clearly, any atomic snapshot that would be recorded in higher oors (higher than f) would contain the new or later my inc p] pointer values of p and thus would be observed by processes that exit the algorithm from these oors. Also, if my inc p]?1, the pointer value in the previous operation of p appears in some snapshot in SS l], in oor l lower than f, then all the operations that exit from oor l and below would observe one of the previous values of my inc p] but not the new. All that remains for p to do is to record its new my inc p] pointer value in all the oors below f and in which it does not nd itself recorded in a snapshot with the previous pointer value. Since we used the proximate-snapshot to compute f it is guaranteed that p nds a oor in which it is recorded with its previous pointer value after checking no more than 2n oors from f down (see Lemma 4.4.3). In each oor that p is descending through, it performs two tasks: rst it records its new my inc p] pointer value, and second, if it observes itself in the atomic snapshot that is associated with that oor with its previous pointer value, my inc p] ? 1, then it participates in a single-use-write-and-snapshot that is associated with that oor. The key to our algorithm is that both tasks can be achieved by a single-use-write-and-snapshot. Thus, when arriving at a new oor the process performs single-use-write-and-snapshot to rst record its new value in this oor, and this operation is used in case it turns out that it has to exit from that oor.
Note that the snapshots that are returned in the end of each operation are not necessarily snapshots that were returned by a proximate-snapshot or recorded in SS. The (note that the observed process could be p itself, Line 18 in Figure 7 ). At this point p has to compose the snapshot it returns from the write&snapshot operation. For each process q, p should choose one of two values to be included in the snapshot it returns, either the value with which q appears in the snapshot associated with oor f, or the value with which q is registered in oor f (in inc f; q]). For each process that p observes in the single-use-write&snapshot it chooses the value with which that process is registered. For all other processes it chooses the value with which they appear in the associated snapshot (in SS f]).
The formal code of the algorithm is given twice in Figures 7 and in Figure 8 . The proofs use the code given in Figure 7 . The reason for presenting Figure 8 is that the code there is modular and uses the single-use-write-and-snapshot as a building block, thus being shorter. However, for the proofs we found it more appropriate to use the full expanded version without the modular calls.
Correctness
De nition 18 In this section we make the following de nitions: wss i p denotes the i-th write&snapshot operation of process p. The proof of this lemma (given below) follows the proofs of the single-use-write-and-snapshot since all the processes that reach a oor, access a single-use-write-and-snapshot object that is associated with that oor. by the relation < de ned as follows: wss j q < wss i p i wss j q nishes before wss i p starts. The total order < is de ned as follows: wss j q < wss i p i capture ?1 (q; j) < capture ?1 (p; i) (i.e., i (f 0 ; t 0 ) < lexicographically (f; t) where capture (f;t) = (p; i) and capture (f 0 ;t 0 ) = (q; j)). By Lemma 4.4.9 the total order agrees with the partial order .
Next we prove that the snapshots returned by the write&snapshot operations satisfy De nition 14. Requirement (1) of the de nition follows from the code that assigns i to R p] (which is R (f;t) p]). Requirement (2) follows from Lemma 4.4.8. Requirement (3) follows from Lemma 4.4.10.
Multi-use Immediate-snapshot
An interesting derivative of our algorithm is a multi-shot algorithm for the Immediate-snapshot problem. In BG93b] Borowsky and Gafni have presented a single-shot implementation of Immediate-snapshot and left the question of a multi-shot implementation open. As was explained before, in the implementation of the multi-use write-and-snapshot we use multi-use immediate-snapshot. Each set of processors that was returned by the immediate-snapshot is then di racted into individual processes by using test-and-set registers. Since we were unable to employ the multi-use immediate-snapshot as a true black box in the implementation of the multi-use write-and-snapshot, we present the code of the multi-use immediate-snapshot separately in this subsection and in Figure 9 . Note that the code is exactly the same as the code given in Figure 7 after omitting Lines 15, 16, and 26 (Line 26 is just a closing " ").
The proof of correctness is essentially the same as the proof of the multi-use write-and-snapshot implementation. The de nition of capture i p should be changed to: capture i p is the pair of indices floor; level p; floor]] of process p when it returns from its i th immediate-snapshot operation. Then the Lemmas of the previous subsection should be modi ed to take care of the case (f 0 ; t 0 ) = (f; t).
5 Bounding the number of registers in the Atomic-write-and-snapshot algorithm Several entities in the Atomic-write-and-snapshot algorithm are unbounded, these are: 1. The number of registers used (i.e., the number of oors in the array of cells). 2. The number of test-and-set registers used.
3. The size (number of bits) of values written in the di erent registers, e.g., floor, and my inc ]. In this section we describe how the rst two entities may be bounded. Bounding the third entity, the size of the values, was left for further research.
The basic principle in bounding the number of registers used is that each process can compute a bounded set of oor indices such that only these indices in the di erent arrays owned by that process may be accessed by any other process. Then we will describe our method and nally the method will be extended to bound the size of T ] as well.
Instead of holding one multi-writer SS ] array, each process maintains a private single-writer multi-reader copy of the array. Since each entry in the array records a real atomic snapshot, and for the same oor index (which is the summation of the snapshot entries) any process observes the same snapshot, all processes that record a snapshot at some oor value record the same snapshot at this oor value. Thus to read a snapshot at SS f] a process should read all the f entries in each of the new n arrays. A read that nds a snapshot is serialized at its last atomic read and a read that does not nd a snapshot in any of the n arrays is serialized at the rst atomic read.
The basic idea in bounding the swmr arrays is that each process may have information that might be necessary by another process in at most 6n di erent indices of the arrays. Moreover, each process can compute at any point in time the set of these 6n indices for each other process. This computation is based on the following observations:
1. To bound the number of entries that need to be preserved by process q in array values q;:] we use the above technique. Since the number of entries that process q has information about and that might still be accessed by some process is 4n 2 the number of input values that q need to preserve is also bounded by 4n 2 . This follows because with each oor index value that q needs to preserve at most one of q 0 s input values is associated. Thus for each of the 4n 2 indices that q preserves it also preserves the corresponding input value.
Since each array is now kept in a bounded size memory we need to provide a method by which if a certain index is given then the corresponding entry is found. In order to be able to locate an entry we keep with each entry its original index value in the unbounded array. A trivial method to access an entry in index i in array A would then be to scan the entire memory of array A. However, a more e cient method would be to use a hashing function.
It remains to bound the number of test-and-set registers used. To do this we employ a method that we have developed in Wei94, AWW93] and in which a collection of n 2 =2 2-process swap registers are used to implement the test-and-set registers. In that method (see Section 3.1 in AWW93], or Chapter 3 in Wei94]) between each two processes one swap register is placed. To perform a test&set operation a process has to swap the index of the test-and-set register it accesses into each of the swap registers it shares with the other processes. If in any of the swaps it detects that that index was already swaped in by the competing process it loses and returns from the test&set operation. (The order in which processes perform the swap operations with each other is very important and is de ned in AWW93, Wei94, AGTV92]). Following Observations 4 and 5 above each process q knows a set of 6n oor values in which another process p may compete with it. Therefore, when process q swaps into the swap register which it shares with p (in a test&set operation), it swaps in the indices of all the test-and-set registers that correspond to these oor values and that it won in the past and the index of the one it tries to win currently. If the swap operation returns a vector that contains the index in which the process is trying to win then the process loses and returns from the test&set operation. Otherwise the process wins the test&set. Since in the Atomic-write-and-snapshot algorithm with each oor n test-and-set registers are associated, each swap register has to hold at most 6n 2 indices. It is possible to reduce the number of indices kept in each swap register from 6n 2 to 6n by keeping only one index for each oor, however the many details of this implementation are beyond the scope and space of this paper.
Conclusions and Implications
The commuting objects constructed in this paper are part of the common2 class of objects introduced in AWW93]: The class of objects, called common2, contains (1) any read-modify-write object that applies commutative functions, and (2) any read-modify-write object that applies overwrite functions. In Her91a] it is shown that common2 is contained in the class of objects with consensus number 2. The class common2 includes most of the commonly used objects with consensus number 2. In Wei94, AWW93] we present implementations of any read-modify-write synchronization object that applies overwriting functions, in a system with an arbitrary number of processors, from any other shared object whose consensus number is 2 or more.
Thus, together with Wei94, AWW93] this paper provides a completeness theorem for the common2 class of objects. That is, we provide a reduction from any object in common2 to any other object in the class. Moreover, we show that any object with consensus number 2 can implement any object in common2 (in polynomial number of steps per operation). We have thus resolved Herlihy's open question (\Can fetch-and-add implement any object having consensus number two in a system of three or more processes?") with respect to the class common2. Our result has three additional implications: First, we show that there are fault tolerant implementations for n processes commuting objects from any set of such objects that contains some faulty objects. This is the result of combining the constructions of this paper with the constructions for two processes given in AGMT92, JCT92].
The second implication of our result are new randomized implementations for commuting objects from read/write registers. This is the result of combining the constructions given herein with the known randomized constructions of 2-process consensus and test-and-set ADS89, Asp90, AH90, SSW91, Her91b, AGTV92]. Some of those constructions are more e cient than what is previously known, e.g., O(n 2 ) steps for fetch-and-increment, and O(n 3 ) fetch-and-add, as oppose to O(n 4 ) for either one in Her91b].
The third implication of our results are low contention implementations of commuting synchronization objects. This follows from the fact that the basic building block of all our constructions are 2-process synchronization objects and read/write registers. Thus, if any of those are implemented by a critical section (e.g., in a system with only read/write registers) then, within our implementation, at most two processes will contend on each critical section. Moreover, since all our implementations are wait-free, the slow down, or failure of one critical section will not a ect the progress of all processes aside from the two connected with the faulty section.
Low contention constructions were also provided in AHS94, HSW91], where fetch-and-increment is constructed by a counting network of balancers. A balancer is an object whose consensus number is two, and which essentially behaves as fetch-and-increment mod 2. The contention level of our implementations is di erent than that of the counting networks. Though each balancer in the counting network has only two ports, the two processors that access each of these ports are di erent at di erent points in the execution, while in our implementations always the same process accesses each port of a two processor building block. Furthermore, we conjecture that it is impossible to adapt the concept of counting networks to implement functions such as fetch-and-add (e.g., by replacing each balancer with a more sophisticated object).
This work leaves several open questions. First, to extend the class common2 to include more or all the objects with consensus number 2, or to prove that the class common2 is included but not equal to the class of objects with consensus number 2 (An object has consensus number 2 if 2 processes can reach consensus by using any number of this object and any number of read/write registers). More speci cally, can common2 include objects that may apply either commute functions or overwrite functions (e.g., fetch-and-add with a set operation). The question of the optimality of any of the above constructions in time and/or space is open.
