We investigate the time and space complexity of detecting and preventing ABAs in shared memory algorithms for systems with n processes and bounded base objects. To that end, we define ABA-detecting registers, which are similar to normal read/write registers, except that they allow a process q to detect with a read operation, whether some process wrote the register since q's last read. ABA-detecting registers can be implemented trivially from a single unbounded register, but we show that they have a high complexity if base objects are bounded: An obstruction-free implementation of an ABA-detecting single bit register cannot be implemented from fewer than n − 1 bounded registers. Moreover, bounded CAS objects (or more generally, conditional read-modify-write primitives) offer little help to implement ABA-detecting single bit registers: We prove a linear timespace tradeoff for such implementations. We show that the same time-space tradeoff holds for implementations of single bit LL/SC primitives from bounded writable CAS objects. This proves that the implementations of LL/SC/VL by Anderson and Moir [2] as well as Jayanti and Petrovic [15] are optimal.
INTRODUCTION
Since the beginning of shared memory computing, programmers and researchers have had to deal with the ABA problem: Even though a process retrieves the same value twice in a row from a shared memory object, it is still possible that the value of the object has changed multiple times.
Especially algorithms using the standard Compare-and-Swap (CAS) primitive seem to be susceptible. A CAS object provides two operations: Read() returns the value of the object, and CAS(x,y) changes the value of the object to y provided that its value v prior to the operation equals x, and it returns v. (According to some specifications a CAS(x,y) returns a Boolean, which is True if and only if the CAS() succeeded, i.e., it wrote y.) Often, CAS() objects are used in the following way: First, a process p reads the value x stored in the CAS object, then it performs some computation, and finally it tries to propagate the result of the computation by performing a CAS(x,y). The idea is that if another process has already updated the data structure, p's CAS() should fail, and so inconsistencies are avoided. However, if multiple successful CAS() operations have occurred and the value of the object has changed back to x, p's CAS() might still succeed, possibly yielding inconsistencies.
ABAs are also a problem for algorithms using other strong primitives, or even only registers. For example, in mutual exclusion algorithms often processes busy-wait for certain events to happen, by repeatedly reading the same register. In systems with caches, the cost of waiting is small, because as long as no process changes the register value, all reads are cache hits. The event is signaled by other processes through a change in the register value. But it may also be desirable to eventually reset the register to its state, before the event was signaled, in order to be able to reuse it. But this may result in the ABA problem, and as a consequence waiting processes may miss events. Therefore, algorithm designers have to devise more complicated code in order to avoid unnoticed cache misses, or even lack of progress. Many shared memory algorithms and data structures have to deal with the ABA problem. Often this is done in an ad-hoc, application specific way [31] , or solutions are based on tagging [10, 19, [23] [24] [25] [27] [28] [29] (see below). Other papers combine tagging and memory management techniques, or suggest both as alternatives [10, 18] .
Tagging, introduced by IBM [14] , requires augmenting an object with a tag (which is sometimes called sequence number) that gets changed with every write operation. This technique avoids the ABA problem only, if tags never repeat. Therefore, theoretically, an infinite number of tags and thus base objects of unbounded size are required. One may argue that, in practice, for reasonably large base objects, a system will never run out of tags. However, this is unrealistic in cases where the tag has to be stored together with other information in the same object. In some cases, it is possible to store the tag in a separate object (e.g., [15] ), but this requires technically difficult algorithms and tedious correctness proofs. Some architectures like the IBM System/370 [14] introduced a double-width CAS primitive, which allows one of two (32-bit) words to be used for storing tags. While using bounded tags does not completely avoid the ABA problem (because tag values may wrap around), it has been argued [24, 25, 28, 29] that an erroneous algorithm execution due to an unexpected ABA becomes very unlikely. From a theoretical perspective this is unsatisfactory. Moreover, for practical applications, it is often necessary to use the entire object space (today usually comprising 64 bits) for data, so the tagging technique requires double-width atomic instructions. Those are not supported by most mainstream architectures [20] .
ABAs cause problems in algorithms that use some form of memory management, where a pointer to some memory space may change its value in an ABA fashion. In this context, memory reclamation techniques based on reference counting [32] , Hazard pointers [20, 21] , the repeat-offender problem technique [12] , or the memory reclamation technique introduced in [1] deal with the ABA problem. But those techniques are application specific.
A more methodological approach has been followed by research that showed how a load-linked store-conditional (LL/SC) object can be implemented from CAS objects and registers. Such an object provides two operations, LL() and SC(), where LL() returns the current value of the object. SC(x) may either fail and not change anything, or succeed and write the value x to the object. Specifically, an SC(x) operation by process p succeeds if and only if no other SC() operation succeeded since p's last LL(). A Boolean return value of an SC() operation indicates its success (True) or failure (False). An extended specification also allows for a VL() (verify-link) operation, which does not change the state of the object, but it returns False if a successful SC() has been performed since the calling process' last LL(), and True otherwise. LL/SC (or LL/SC/VL) objects can in almost all cases replace CAS objects in algorithms, and are an effective way of avoiding the ABA problem. Unfortunately, existing multiprocessor systems only provide weak versions of LL/SC that restrict programmers severely in how they can use the objects [26] , and hence they "offer little or no help with preventing the ABA problem" [22] .
For that reason, a line of research has been dedicated to finding time and space efficient LL/SC implementations from CAS objects and registers [2, 5, 15, 16, 20, 22, 26] . While many of those solutions are wait-free and often even guarantee constant time execution of each LL() and SC() operation, they still have drawbacks: Existing implementations either require unbounded tags (e.g., [26] ) and thus use unbounded CAS objects or registers, or they need at least linear space. Jayanti and Petrovic [15] and Anderson and Moir [2] presented the most space efficient implementations of an LL/SC object from bounded CAS and registers, which achieves constant step-complexity: it uses only one CAS object but requires Θ(n) registers. This raises the question, whether time efficient implementations of LL/SC from a smaller number of bounded CAS objects and registers may exist. More generally, in order to understand the power and limits of shared memory primitives, it seems important to learn how much time and space is required to avoid or detect ABAs, and not to restrict this question to the implementation of LL/SC objects from CAS objects and registers.
CAS and LL/SC objects have a consensus number of infinity [11] , while registers have a consensus number of one. Therefore, it is impossible to implement wait-free LL/SC from registers or other objects with a bounded consensus number. Time and space lower bounds for implementations of LL/SC objects may not necessarily imply that it is the ABA problem that is hard to solve, but such lower bounds may follow inherently from other properties of the LL/SC specification.
Results.
To investigate the complexity of detecting or preventing ABAs, we define a natural object, the ABA-detecting register. It supports two operations, DRead() and DWrite(). Operation DWrite(x) writes value x to the register, and returns nothing. Operation DRead() by process p returns, in addition to the value of the register, a Boolean flag, which is True if and only if some process executed a DWrite() since p's last DRead() operation. We distinguish between singlewriter ABA-detecting registers, where only one dedicated process is allowed to call DWrite(), and multi-writer ones that don't have this restriction.
A wait-free ABA-detecting register can be implemented from registers, and thus has consensus number 1. (Therefore, they are weaker with respect to wait-freedom than CAS or LL/SC.) Using a single unbounded register with an unbounded tag that gets changed whenever some process writes to it, it is trivial to obtain an ABA-detecting register with constant time complexity. But if base objects have only bounded size, the situation is completely different: For implementations of ABA-detecting registers in a system with n processes and bounded registers, we obtain a linear (in n) space lower bound, even if the implementation satisfies only nondeterministic solo-termination (the non-deterministic variant of obstruction-freedom), which is a progress condition strictly weaker than wait-freedom. The availability of CAS seems to be of little help: For wait-free implementations from CAS objects and registers we obtain a time-space tradeoff that is linear in n. The same asymptotic time-space tradeoff is obtained if the base objects support arbitrary conditional read-modify-write operations [7] . Each conditional operation can be simulated by a single operation on a writable CAS objects, i.e., an object that supports a Write() operation in addition to Read() and CAS(). For that reason we state the lower bound for implementations from conditional read-modify-write operations in terms of writable CAS base objects. Theorem 1. Any linearizable implementation of a single-writer 1-bit ABA-detecting register from m bounded base objects satisfies:
(a) m ≥ n − 1 if the base objects are bounded registers, and the implementation satisfies nondeterministic solo-termination; (b) m ≥ (n − 1)/t, if the the base objects are bounded CAS objects and registers, and the implementation is deterministic and wait-free with worst-case step-complexity at most t; and (c) m ≥ (n−1)/(2t), if the base objects are bounded writeable CAS objects, and the implementation is deterministic and wait-free with worst-case step-complexity at most t.
The requirement that base objects are bounded is necessary for this lower bound, because, as mentioned earlier, an ABAdetecting register can be trivially obtained by augmenting a normal register with an unbounded tag.
There is a simple implementation of a (bounded) ABAdetecting registers with constant step-complexity from a single (bounded) LL/SC/VL object of the same size: Each process uses a local variable old. To DWrite(x), the process executes a LL() operation followed by a SC(x). To DRead(), the process first executes a VL(). If VL() returns True, the process returns (old, False); otherwise, it reads the value of the LL/SC/VL object into old (by executing LL()), and then returns (old, True). It is not hard to see that this implementation is linearizable. Thus, by reduction we obtain the same lower bound as the one stated in Theorem 1 for implementations of single bit LL/SC/VL. Unfortunately, for that reduction the VL() operation is needed, and at least we do not know how to obtain a similarly efficient ABA-detecting register from an LL/SC object that does not support VL(). However, the proofs of Theorem 1 can be easily modified to accommodate LL/SC objects: Corollary 1. Any linearizable implementation of a single bit LL/SC object from m bounded base objects satisfies (a) m ≥ (n − 1)/t, if the the base objects are bounded CAS objects and registers, and the implementation is deterministic and wait-free with worst-case step-complexity at most t; and (b) m ≥ (n−1)/(2t), if the base objects are bounded writeable CAS objects, and the implementation is deterministic and wait-free with worst-case step-complexity at most t.
A linear space lower bound (corresponding to Part (a) of Theorem 1) for nondeterministic solo-terminating implementations of LL/SC from (even unbounded) registers follows from the fact that LL/SC objects are perturbable [17] . As in Theorem 1, the assumption that base objects are bounded is necessary, because there is an implementation of an LL/SC/VL object from a single unbounded CAS object with constant step complexity by Moir [26] . Our timespace tradeoff is asymptotically tight for implementations with constant step-complexity, as it matches known upper bounds [2, 15] . We show that it also asymptotically tight for implementations using a single CAS object: Theorem 2. A single bounded CAS object suffices to implement a bounded LL/SC/VL object or a bounded multiwriter ABA-detecting register with O(n) step-complexity.
These results raise the question, whether bounded CAS objects are helpful for ABA detection. We determine that for this problem bounded CAS objects do not provide additional benefits over bounded registers:
There is a linearizable wait-free implementation of a multi-writer b-bit ABA-detecting register from n + 1 (b + 2logn + O(1))-bit registers with constant step complexity.
Not only do our lower bounds show that Anderson's and Moir's [2] as well as Jayanti's and Petrovic's [15] implementations of LL/SC from CAS objects and registers are optimal with respect to their time-and space-product, but they also clearly indicate that ABA detection is inherently difficult, even if bounded conditional read-modify-write primitives such as (writable) CAS objects are available. Therefore, other primitives that provide a solution to the ABA problem would most likely be as difficult to obtain as LL/SC. Our upper bounds demonstrate that bounded CAS objects (and in fact any conditional read-modify-write operations) are not more helpful than bounded registers with respect to ABA detection. On the other hand, ABA detection is difficult only if base objects are bounded, but for our lower bounds it does not matter how large that bound on the size of the base object is, as long as it is finite.
Other Related Work.
Our lower bounds use covering arguments. Covering arguments were first used by Burns and Lynch [4] to prove a space lower bound for mutual exclusion, and essentially all space lower bounds are based on this technique. Examples are space lower bounds for one-time test-and-set objects [30] , consensus [8] , timestamps [6, 9] , and the general class of perturbable objects [17] (which includes LL/SC among others). These lower bounds have in common that they do not apply if CAS objects are available as base objects. (They allow for registers, swap objects, and, in case of [17] , resettable consensus.) An overview of covering arguments can be found in Attiya's and Ellen's recent textbook [3] .
In our time-space tradeoffs we construct executions, where a sequence of operations by a process p is interleaved with successful CAS() and Write() operations of other processes, so that p's steps remain "hidden". Such a technique has been also used by Ellen, Hendler, and Shavit [7] to prove linear space lower bounds for wait-free implementations of visible objects implemented from conditional read-modify-write (i.e., writable CAS) objects. Visible objects include counters, queues, stacks, or snapshots. Neither ABA-detecting registers nor LL/SC objects are visible, because they can be implemented from a single unbounded CAS object. In fact, we are not aware of any other non-trivial lower bounds that, like ours, separate bounded from unbounded base objects.
Preliminaries.
We consider a system with n processes with unique IDs in P = {0, . . . , n − 1}. Processes communicate through shared memory operations, called steps, that are executed on atomic base objects provided by the system. Each process executes a possibly nondeterministic program. If processes are deterministic, a schedule is a sequence of process IDs, that determines the order in which processes execute their steps. If processes are nondeterministic, a schedule is a sequence of process IDs together with coin-flips, and it describes the order in which processes take steps together with the nondeterministic decisions they make. The sequence of shared memory steps taken by processes is called execution. A history on some implemented object is the sequence of method call invocations and responses that occur in an execution on that object. A configuration describes the state of the system, i.e., of all processes and all base objects.
Our implementations are deterministic and wait-free, which means that every method call terminates within a finite number of the calling process' steps, in any execution. The step-complexity of a deterministic wait-free method is the maximum number of steps a process needs to terminate the method call in any execution. Our lower bounds hold for implementations that satisfy a progress condition which is strictly weaker than wait-freedom: A nondeterministic method m satisfies nondeterministic solo-termination, if for every process p and every configuration C in which a call of method m by p is pending, there is a p-only execution that starts in C and during which p finishes method m. For deterministic algorithms, nondeterministic solo-termination is the same as obstruction-freedom. Our algorithms are linearizable [13] , but our lower bounds work for much weaker correctness conditions.
LOWER BOUNDS
For a configuration C and a schedule σ, let Exec(C, σ) denote the execution arising from processes taking steps, starting in configuration C, in the order defined by σ, and using the nondeterministic decisions defined by σ, if the algorithm is nondeterministic. Let Conf(C, σ) denote the configuration resulting from execution Exec(C, σ) started in C.
For two configurations C and D and a schedule α, we write C α D to indicate that Conf(C, α) = D. Let Cinit denote the initial configuration. If there exists a schedule α such that C α D, then we say D is reachable from C, and if D is reachable from Cinit, we simply say D is reachable.
An execution E or a schedule α is P -only for a set P ⊆ {0, . . . , n − 1} of processes, if only processes in P take steps during E respectively α. If P = {p} is the set of a single process, then we sometimes write p-only instead of {p}-only.
For an execution E, let ≺E denote the happens-before order on operations in E, i.e., if operation op responds in E before op gets invoked, then and only then op ≺E op (op happens before op ). We write simply ≺ instead of ≺E, if is clear from the context which execution E the relation refers to. For a schedule α, an execution E and a process p, E|p and α|p denote the sub-sequences of steps by p in E and in α, respectively.
Two configurations C and D are indistinguishable to process p, if every register has the same value in C as in D, and p is in the same state in both configurations. We write C ∼p D to denote that C and D are indistinguishable to p. We write C ∼S D for a set S of processes to denote that C ∼p D for every process p ∈ S. We say process p is idle in configuration C, if it has no pending method call, and if all processes are idle, then the configuration is quiescent. A process completes a method call in an execution E, if that method terminates in E.
For our lower bounds, we do not require that the implementation of the ABA-detecting registers is linearizable. Instead, we consider methods WeakRead() and WeakWrite() that take no arguments, and where WeakRead() returns a Boolean value, and WeakWrite() returns nothing. A correct concurrent implementation of these methods must guarantee for every execution, that a WeakRead() operation r by process p returns True if and only if there exists a WeakWrite() operation w such that w happens before r and every other WeakRead() operation by p happens before w.
Linearizability of an ABA-detecting register R guarantees that the operations R.DRead() (in place of WeakRead()) and R.DWrite() (in place of WeakWrite()) satisfy the correctness properties above. Therefore, every lower bound on the time and/or space complexity for correct implementations of those methods implies the same lower bound for linearizable ABA-detecting registers.
Let p be some process and C a configuration. We say C is p-clean, if there exists a schedule α, Cinit α C, such that Exec(Cinit, α) contains a complete WeakRead() operation r * by p, and every WeakWrite() happens before r * . Configuration C is p-dirty, if there exists a schedule α, Cinit α C,
and Exec(Cinit, α) contains a complete WeakWrite() operation w * such that no WeakRead() by p is pending at any point after w * has been invoked. Note that some configurations are neither p-dirty nor p-clean.
Throughout this section we assume that each process executes an infinite program, in which it repeatedly calls WeakRead() and WeakWrite() methods. More specifically, process 0 repeatedly executes WeakWrite(), while every process in {1, . . . , n − 1} repeatedly calls WeakRead().
Then in a p-only execution starting from a configuration C, the first WeakRead() operation by p returns False if C is p-clean and True if C is p-dirty. Therefore, each process must be able to distinguish p-clean configurations from pdirty ones:
Observation 1. Suppose the WeakRead() method satisfies nondeterministic solo-termination. For any process p ∈ {1, . . . , n − 1} and any two reachable configurations C1, C2, if C1 is p-clean and C2 is p-dirty, then C1 ∼ p C2.
A Space Lower Bound for Implementations from Bounded Registers
Let R be a set of k registers and P a set of processes. We say the processes in P cover R in configuration C, if for each register R ∈ R there is a process in P that is poised to write to R. A block-write to R is an execution in which k processes participate, and each of them takes exactly one step in which it writes to a distinct register in R. (The only block-write to ∅ is the empty execution.)
In the following we assume an implementation of methods WeakRead() and WeakWrite() from m bounded registers. The register configuration of a configuration C is an m-tuple, reg(C) = (v1, . . . , vm), where vi is the value of the i-th register. Proof of Lemma 1. The proof is by induction on k. If k = 0, we let α be the empty schedule, and the lemma is immediate because Conf (Q, α) = Q is a quiescent configuration (so 0 is idle). Now suppose we have proved the inductive hypothesis for some integer k < n − 1. Let β = (p1, . . . , p k ) be the schedule in which each of p1, . . . , p k takes exactly one step. Let Q0 = Q. By the inductive hypothesis there is a schedule α1 such that in C1 := Conf(Q0, α1) a set R1 of exactly k registers is covered, and process 0 is idle. Hence, Exec(C1, β) is a block-write to R1 yielding a configuration D1 = Conf(C1, β). We let γ1 be the schedule such that in Exec(D1, γ1) first each process in {p1, . . . , p k } takes enough unobstructed steps to finish its WeakRead() method call, and after that process 0 takes enough unobstructed steps to complete exactly one WeakWrite() method. Then Q1 = Conf(D1, γ1) is quiescent, and during Exec(D1, γ1) exactly one complete WeakWrite() gets executed. Repeating this construction (using the inductive hypothesis repeatedly) we obtain a schedule α1βγ1α2βγ2α3 . . . and configurations Q0, C1, D1, Q1, C2, D2, Q2, . . . and sets of k registers R1, R2, . . . , such that for any i ≥ 1:
• Qi is quiescent;
• during Exec(Di, γi) process 0 executes a complete WeakWrite() operation; and • in Ci process 0 is idle and Ri is covered by P k (and thus Exec(Ci, β) is a block-write to Ri). Since the number or registers is finite, and all registers are bounded, there exist indices 1 ≤ i < j such that reg(Di) = reg(Dj). Let σ = γiαi+1βγi+1αi+2 . . . αjβ, i.e.,
Ci
β Di σ Dj.
Now let λ be a p k+1 -only schedule such that in Exec(Ci, λ ) process p k+1 completes exactly one WeakRead() method call. By the nondeterministic solo-termination property, such a schedule λ exists. Let λ be the prefix of λ , such that Exec(Ci, λ) ends when p k+1 is poised to write to a register R ∈ Ri for the first time, or λ = λ if p k+1 finishes its WeakRead() method call without writing to a register outside of Ri. First assume λ = λ , i.e., in Exec(Ci, λ) process p k+1 does not finish its WeakRead() method call, but instead the execution ends when p k+1 covers a register R ∈ Ri. Since in Ci process 0 is idle and Ri is covered by P k , and since λ is p k+1only, in configuration Conf(Ci, λ) = Conf(Q, α1βγ1 . . . αiλ) processes p1, . . . , p k+1 cover k + 1 registers, and process 0 is still idle. This completes the proof of the inductive step for α = α1βγ1 . . . αiλ.
Now we consider the case λ = λ , i.e., during Exec(Ci, λ) process p k+1 finishes its WeakRead() method call without writing to a register outside of Ri. To complete the proof of the lemma, it suffices to show that this case cannot occur.
Since in Ci the processes in P k cover Ri, and p k+1 only writes to registers in Ri during Exec(Ci, λ), it follows that Exec(Ci, λβ) ends with a block-write by P k in which all writes by p k+1 get obliterated. In particular, for D i := Conf(Ci, λβ) we have
Hence, since schedule σ is (P k ∪ {0})-only, i.e., p k+1 does not participate, we obtain Exec(D i , σ) = Exec(Di, σ), and in particular using Eq. (1)
Now recall that we chose i and j in such a way that reg(Di) = reg(Dj). Thus, from Eq. (2) and (3) we get reg(D i ) = reg(Di) = reg(Dj) = reg(D j ). Because D i σ D j (Eq. (3) ), and since by construction only processes {0, p1, . . . , p k } appear in σ, p k+1 is in D i in exactly the same state as in D j . Hence,
Now recall that Ci λβ D i , and in the corresponding execution process p k+1 executes a complete WeakRead() method, while process 0 takes no steps, and p k+1 is idle in D i . Hence, D i is p k+1 -clean. On the other hand, Exec(D i , σ) = Exec(Di, σ) starts with a complete WeakWrite() operation (during the prefix Exec(D i , γi)) by process 0, while process p k+1 takes no steps, and thus remains idle. It follows that the configuration resulting from that execution, D j , is p k+1dirty. Summarizing, we have two reachable configurations, D i and D j , where one of them is p k+1 -clean and the other one is p k+1 -dirty, and both are indistinguishable to p k+1 , according to Eq. (4). This contradicts Observation 1.
A Time-Space Tradeoff for Implementations from CAS Objects
We now consider deterministic wait-free implementations of WeakRead() and WeakWrite() from m writable bounded CAS objects. We assume without loss of generality that every CAS(x,y) operation satisfies x = y. (A CAS(x,x) operation can be replaced by a Read()).
For any configuration C and any shared CAS object R let CCov(C, R) and WCov(C, R) denote the sets of processes that are poised in C to execute a CAS() respectively Write() operation on R. Let P be a set of processes and C a configuration. A schedule β is called P -successful for C, if it contains every process in P exactly once, and every step of Exec(C, β) is either a Write() or a successful CAS(). If a configuration C has a P -successful schedule β, then we also say execution Exec(C, β) is P -successful.
As before, we assume that all processes run an infinite loop, where process 0 repeatedly calls WeakWrite() while all other processes repeatedly call WeakRead().
Lemma 2. Let P P \ {0}, q ∈ P \ P , q = 0. Let C be a configuration, in which either q is idle, or in no execution starting from C process q executes more than t shared memory steps before finishing a pending WeakRead() call. If β is a P -successful schedule for C, then there is a schedule σ such that
and at least one of the following is the case: (a) In Conf(C, σ) process q is idle; (b) in Conf(C, σ) process q is poised to write to some object R and |WCov(C, R) ∩ P | < t; or (c) in Conf(C, σ) process q is poised to execute a CAS(x,y) operation on some object R, where x is the value of R in configuration Conf(C, σ), and |WCov(C, R) ∩ P | + |CCov(C, R) ∩ P | < t.
Proof. We prove the lemma by induction on t. If t = 0, then q is idle in C. Hence, for σ = β we obtain Eq. (5) and Case (a).
Let opq be the step q is poised to execute in C, and let V be the object affected by opq. Further, let valC (V ) denote the value of V in configuration C.
Case 1: First, assume that opq is a Read() or a CAS() operation. Let z be the first process in P that executes a step in Exec(C, β)|V , and let opz be that step. We construct a two-step schedule λ that contains q and z, such that
First suppose opz is a CAS(a, b) operation and opq a CAS(x, y) operation that would succeed in C (i.e., x = valC (V )). Then we define λ = (z, q). Since β is P -successful, the CAS(a, b) by z in configuration C succeeds and changes the value of V from a to b. In this case, x = a, so in the execution Exec(C, λ) the CAS(x, y) by q fails. Eq. (6) follows. In all other cases (i.e., if opz is a Write() operation or opq is a Read() or a CAS() that fails in C), then we let λ = (q, z). Then in Exec(C, λ) either operation opq does not change the value of object V , or opz executes a Write() and overwrites any changes that may have resulted from opq. It follows that Eq. (6) is true. Now let β = β|(P \{z}), and recall that C = Conf(C, λ). Since β is P -successful and in Exec(C, β) process z executes the first step on V , it follows that β is P -successful in C .
Hence, we can apply the inductive hypothesis for configuration C , process set P = P \ {z}, and schedule β , to obtain a schedule σ that satisfies one of the Cases (a)-(c). Let σ = λ • σ . Then by construction, Conf(C, σ) = Conf(C , σ ) ∼ P \{q,z} Conf(C , β ) = Conf(C, β). Because of Eq. (6), process z can also not distinguish between Conf(C, σ) and Conf(C, β), so we obtain Eq. (5) . If (a) of the inductive hypothesis applies for C and σ , then the same also applies for C and σ, because Conf(C, σ) = Conf(C , σ ). Now suppose that Case (b) applies for C and σ . Let R be the object on which process q is poised to execute a Write() in Conf(C , σ ) = Conf(C, σ). Starting from configuration C , process q must finish its WeakRead() method within t = t − 1 steps. Hence, |WCov(C , R) ∩ P | ≤ t . Since all processes other than z are poised to execute the same step in C as in C , we have |WCov(C, R) ∩ P | ≤ |WCov(C , R) ∩ P | + 1 ≤ t + 1 = t. Hence, Case (b) follows for C, σ and P . With exactly the same argument, if Case (c) applies for C , σ , and P , then it also applies for C, σ, and P .
Case 2: We now assume that in C process q is poised to execute a Write() operation opq on object V . If |WCov(C, V ) ∩ P | < t, we let σ = β. Then Eq. (5) and Case (b) (for R = V ) of the lemma are trivially satisfied.
Hence, assume that |WCov(C, V ) ∩ P | ≥ t. Then Exec(C, β) contains at least t writes to V . Let z1, . . . , z −1 be the processes accessing V in this order in Exec(C, β)|V before the first write to V occurs, and let z be the first process writing to V . Let Z = {z1, . . . , z }, λ = (z1, . . . , z −1 , q, z ), γ = β|Z = (z1, . . . , z ) and β = β|(P \ Z). Then in Exec(C, γ•β ) all processes in P execute exactly one step, as they do in Exec(C, β), and for each object U the steps executed on U occur in the same order in both execu-tions. Hence, processes cannot distinguish these executions from each other, and in particular Conf(C, γ • β ) = Conf(C, β). (7) In Exec(C, λ), first processes z1, . . . , z −1 execute successful CAS operations on V , then q writes to V , and finally, z overwrites what q has written. It follows that
Combining this with Eq. (7) we obtain that β is Psuccessful in C . Moreover, since q executed one step in the execution leading from C to C , in any execution starting from C it finishes its WeakRead() method after at most t = t − 1 steps. Thus, we can apply the inductive hypothesis to obtain a schedule σ such that Conf(C , β ) ∼ P\{q} Conf(C , σ ), and one of Cases (i) all processes in P \ P k are idle in C;
(ii) there is a P k -successful schedule for C; and (iii) |WCov(C, R) ∩ P k | ≤ t and |CCov(C, R) ∩ P k | ≤ t for all objects R.
Proof. Throughout this proof let P 0 k denote the set P k ∪ {0}. We prove the lemma by induction on k. For k = 0, we let α be the empty schedule, so C = Conf(Q, α) = Q. Then C is quiescent and (i) is true. Statements (ii)-(iii) follow immediately from P k = ∅. Now suppose the inductive hypothesis is true for some value of k ∈ {0, . . . , n − 2}. We let Q0 = Q and P k+1 = P k ∪{p k+1 } for an arbitrary process p k+1 ∈ P \P 0 k . Then, for i = 1, 2, . . . we iteratively construct executions αi, βi, γi and configurations Qi, Ci, Di, where Qi−1
and αi, βi, γi are determined as follows: αi is a P 0 k -only schedule that guarantees properties (i)-(iii) from the inductive hypothesis for configuration Qi−1; βi is a P k -successful schedule for Ci; and γi is an arbitrary P k -only schedule followed by a 0-only schedule such that Qi is quiescent, and where Exec(Di, γi) contains exactly one complete Weak-Write() operation by process 0. By the assumption that WeakRead() and WeakWrite() are wait-free, γi exists.
We define for each configuration Ci a signature, sig(Ci), which encodes for every process p the exact shared memory operation p is poised to execute next (including its parameters), and for every base object R its value.
Since there is only a finite number of bounded base objects in the system, there is a finite number of signatures, and thus there exist 1 ≤ i < j such that Ci and Cj have the same signature. We let λ = α1β1γ1α2 . . . αi−1βi−1γi−1, and λ = γiαi+1βi+1γi+1αi+1 . . . αj−1βj−1γj−1. From the construc-
where Qi−1 and Qj−1 are quiescent, Ci satisfies (i)-(iii) from the inductive hypothesis, and sig(Ci) = sig(Cj). This situation, as well as the following construction is depicted in Figure 1 . Now we apply Lemma 2 to configuration Ci. For the purpose of applying this lemma, we may assume that in Ci process p k+1 has just invoked a WeakRead() operation but not yet executed its first shared memory step during that operation. Hence, in all executions starting from Ci, p k+1 will finish that pending WeakRead() operation in at most t steps. Then Lemma 2 yields a P k+1 -only schedule σi such that for D i = Conf(Ci, σi)
and one of the Cases (a)-(c) of Lemma 2 hold. Let C j = Conf(D i , λ αj), and D j = Conf(C j , βi). (The use of βi instead of βj is intentional.) Then Qi−1
Since λ αj does not contain p k+1 , which is the only process that, according to Eq. (9), may be able to distinguish Di from D i , we obtain
Configurations Ci and Cj have the same signature. Therefore, every process is poised to execute the same step in Ci as in Cj, and all objects have the same values in both configurations. This, together with the fact that βi is P k -only and every process appears at most once in βi implies Exec(Ci, βi) = Exec(Cj, βi) (10) = Exec(C j , βi).
Hence, all objects have the same value in Di = Conf(Ci, βi) as in D j = Conf(C j , βi), and thus from Eq. (9), all objects have the same value in D i as in D j . Since p k+1 does not appear in λ αjβi, and thus takes no step in the execution leading from D i to D j , we conclude
Now recall that σi is the schedule σ guaranteed by Lemma 2 (applied with C = Ci and q = p k+1 ), and the claim guarantees one of three Cases (a)-(c). First, assume Case (a) occurs, i.e., p k+1 . Then it completes a WeakRead() method call in Exec(Ci, σi) (recall that in Ci it had just invoked that method call) and is idle in D i = Conf(Ci, σi). Since process 0 takes no steps in Exec(Ci, σi) it follows that D i is p k+1 -clean. On the other hand, Exec(D i , λ αjβi) contains no steps by p k+1 , but instead a complete WeakWrite() by process 0. Hence, D j = Conf(D i , λ αjβj), is p k+1 -dirty. But this contradicts Observation 1, because according to Eq. (12) process p k+1 cannot distinguish D i from D j . Hence, we know that Case (a) from Lemma 2 cannot apply. Now, suppose that instead Case (b) or (c) applies. We show that statments (i)-(iii) of the lemma are true for α = λαiσiλ αj and C = Conf(Q, α) = C j . By the inductive hypothesis (i), in Cj all processes in P \ P k are idle, so from Eq. (10) it follows that in C j all processes in P \ P k+1 are idle. This proves (i).
According to Cases (b) and (c) of Lemma 2, in configuration D i (and thus also in C j and D j ) process p k+1 is poised to execute an operation op that is either a Write() or a CAS(x,y) on some object R * . Moroever, in case that op is a CAS(x,y), in configuration D i object R * has value x. Then Eq. (12) implies that the value of R * is also x in configuration D j , and in partiular, if p k+1 executes CAS(x,y) in that configuration, that CAS() succeeds. We conclude that in the execution Exec(C j , βi • p k+1 ) process p k+1 takes exactly one step, which is either a Write() or a successful CAS(x,y). By construction, Exec(Ci, βi) is P ksuccessful, and so Eq. (11) implies that Exec(C j , βi) is also P k -successful. It follows that Exec(C j , βi • p k+1 ) is P k+1successful. Hence, Exec(C j , βi • p k+1 ) is P k+1 -successful, which proves statement (ii). Finally, since in C j process p k+1 is poised to execute operation op on R * , and all other processes are poised to execute exactly the same step as in Ci, we have: In case op is a Write(), WCov(C j , R * ) = WCov(Ci, R * ) ∪ {p k+1 }, and in case op is a CAS(), CCov(C j , R * ) = CCov(Ci, R * ) ∪ {p k+1 }. All other sets WCov(·, ·) and CCov(·, ·) are the same for Ci as for C j . Therefore, Cases (b) and (c) of Lemma 2 together with the inductive hypothesis (iii) immediately imply |WCov(C j , R) ∩ P k+1 | ≤ t and |CCov(C j , R) ∩ P k+1 | ≤ t for all objects R. This proves (iii) and completes the inductive step.
Parts (b) and (c) of Theorem 1 follow immediately from this lemma. Replacing each WeakWrite() with a LL()/SC() pair by process 0, and accommodating the definition of pclean and p-dirty configurations, we obtain Corollary 1. Figure 2 depicts an optimal linearizable implementation of an ABA-detecting register from n + 1 bounded registers with constant step-complexity. We use two more registers than needed according to the lower bound in Theorem 1 (a). The main idea of the algorithm is similar to one used in the multi-layered construction of LL/SC/VL from CAS by Jayanti and Petrovic [15] , which itself is a modified version of the implementation by Anderson and Moir [2] .
UPPER BOUNDS

Constant-Time ABA-Detecting Registers from Registers
Here, we briefly discuss the implementation. The correctness proof is omitted due to space restrictions. We use a shared bounded register X that stores a triple (x, p, s), where x ∈ D, p ∈ P is a process ID, and s ∈ {0, . . . , 2n + 1} is a sequence number. We also use a shared announce array A[0 · · · n − 1], where only process q can write to A[q]. Each array entry A[q] stores a pair (p, s), where p ∈ {0, . . . , n − 1} is a process ID and s ∈ {0, . . . , 2n + 1} is a sequence number. Register X is initialized to (⊥, ⊥, ⊥) and all entries of A are initialized to (⊥, ⊥).
In a DWrite(x) operation, the calling process q first determines a suitable sequence number, s, using the helper method GetSeq(), and writes the pair (x, p, s) to X (lines 1-2). Method GetSeq() ensures that the sequence number s it returns satisfies the following: If there is any point at which X = (·, p, s) and A[q] = (p, s) for some process q, then p will not use sequence number s again in any following DWrite() call, until A[q] = (p, s). To achieve that, in a sequence of n consecutive GetSeq() calls process p scans through the entire announce array, reading one array entry with each GetSeq() call. It then returns a sequence number that p has not used in its preceding n DWrite() method calls, and which it has not found in any array entry of A[], when it read that entry last.
For its DRead() operation each process q uses a local variable, b, that indicates whether a DWrite() operation linearized already during q's previous DRead() operation after that operation's linearization point. Our algorithms ensure that if b = True at the beginning of a DRead() operation, then such a DWrite() has happened. In a DRead() operation, a process reads X twice to obtain the triples (x, p, s) in line 13 First suppose q reads two different triples from X in line 13 and line 16, i.e., (x, p, s) = (x , p , x ). Then the DRead() operation will linearize with the first read of X (line 13). We now know that the value of X has changed between the linearization point and the response of q's DRead(). Hence, q sets flag b to indicate that its next DRead() should return a pair (·, True). If (x, p, s) = (x , p , s ), then, on the other hand, it is ensured that A[q] = (p, s) at the point when q read (x, p, s) from X in line 16. As explained above, in this case the pair (p, s) will not be used again in any following DWrite() operation, until q has replaced its announcement (p, s) with a new one. Hence, q resets b because in the following DRead() operation, q will be able to detect any DWrite() that has happened inbetween by comparing A[q] with the corresponding pair stored in X.
LL/SC/VL from a Single Bounded CAS
We now briefly sketch our wait-free implementation of LL/SC/VL from a single bounded CAS object. The implementation has O(n) step complexity, and thus, by Corollary 1, is optimal. The pseudo-code is presented in Figure 3 . The correctness proof is omitted due to space restrictions.
In a CAS object X, we store a pair (x, a), where x represents the value of the implemented LL/SC/VL object, and a is an n-bit string. The p-th bit of a is used to indicate whether an SC() operation linearized since p's last LL() (the bit is usually set in this case). As in the previous algorithm, we use a local variable b for each process p. In an LL() call, a process p tries to reset its bit (the p-th bit of the second component) of X. As we explain below, this may fail, but only if an SC() sc linearizes during p's attempts to reset that bit. If that happens, p sets the flag b and its LL() linearizes before sc. Thus, in a subsequent SC() or VL(), p determines from the set flag b that it does not have a valid link, and that SC() or VL() can fail, even though p's bit in X is not set.
More precisely, in a LL() method call, process p reads the pair (x, a) from X (line 39) and checks whether its bit in a is set. If not, in lines 41-42 it simply resets b (because in subsequent SC() or VL() calls p's bit in X will indicate whether p has a valid link or not) and returns x. That LL() operation linearizes with the Read() of X in line 39. Now suppose p's bit in X is set. Then p tries to reset that bit, using a CAS() operation on X. However, that CAS() may fail because of some other process' successful CAS() during a LL() or SC() call. Therefore, p repeatedly reads X followed by a CAS() to set its bit in the second component of X, until its CAS() operation succeeds, or until it has failed n times (lines 45-46). If a CAS() succeeds, p resets b and returns the first component of X that it read just before its last, successful CAS() attempt (lines 47-48); the LL() linearizes with that CAS(), and since p's bit in X is now reset, in the next DRead() operation p can use its bit in X to determine whether an SC() linearized since the linearization point of the current DRead(). If p's CAS() fails n times, then X must have changed n times since p's first Read() of X. We argue that then at least one such change must be due to a CAS() operation during some process' SC(): Suppose not. Then X must have changed at least n times, and every time it must have changed because of a CAS() executed in a LL() operation. But this is not possible, because each time such a CAS() succeeds, one of the bits in the second part of X changes from 1 to 0, and p's bit does not change at all. We conclude that at least once, while p has been trying to reset its bit in X, a successful CAS() on X must have occurred during an SC() operation. As we discuss below, this means that a successful SC() linearized. Hence, in this case p can set its bit to True (line 49), which guarantees that p's next SC() or VL() will fail, and return in line 50 the value x it read at the very beginning from X. The linearization point of that LL() is the Read() of X in line 39.
In an SC(y) operation a process p first checks flag b, and if it is set, p immediately returns False-this indicates that an SC() linearized during p's last LL() but after the lineariza-tion point of that LL(). If b is not set, then p reads X to determine whether its bit in X is set, and if yes, it can also return False (lines [28] [29] [30] , because this indicates that some other SC() has linearized since p's last LL(). If p's bit in X is not set, then p tries to write (y, 2 n − 1) into X using a CAS() operation (line 31). If that CAS() succeeds, as a result the value of the LL/SC/VL object change to y, and the bits of all processes are now set in X. Hence, p's SC(y) linearizes with that successful CAS(), and p returns True (line 32). If the CAS() fails, then p repeats up to n times, until either it finds that its bit in X is set (and thus some other process' SC() succeeded), or its own CAS() succeeds. If p's CAS() fails n times, then for the same reasons as explained earlier, we know that some process' SC() must have linearized during p's ongoing SC(y) operation, and thus p can return False (the unsuccessful SC() linearizes with its response).
Operation VL() is very simple: A process simply checks whether flag b or its bit in X is set, and if yes, it returns False, otherwise it returns True.
