The emergence of systems with non-volatile main memory (NVM) increases the interest in the design of recoverable concurrent objects that are robust to crash-failures, since their operations are able to recover from such failures by using state retained in NVM. Of particular interest are recoverable algorithms that, in addition to ensuring object consistency, also provide detectability, a correctness condition requiring that the recovery code can infer if the failed operation was linearized or not and, in the former case, obtain its response.
Introduction
Byte-addressable non-volatile main memory (NVM) combines the performance benefits of conventional (volatile) main memory with the durability of secondary storage. Systems where non-volatile memory co-exists with volatile main memory already exist and are expected to become more prevalent in the future. This increases the interest in the crash-recovery model, where a failed process may be resurrected by the system following a crash. Traditional log-based recovery techniques can be applied correctly in such systems but fail to take full advantage of the parallelism and efficiency that may be gained by allowing processing cores to concurrently access recovery data directly from NVM, rather than by performing slow block transfers from secondary storage. Consequently, there is increasing interest in the design of recoverable concurrent objects that are robust to crash-failures, since their operations are able to recover from such failures by using state retained in NVM (see e.g. [3, 9, 11, 10, 12, 19, 20] ).
Of particular interest are recoverable algorithms that, in addition to ensuring object consistency, also provide detectability [9] . Detectability requires that the code recovering from a failed operation can infer if it was linearized or not and, in the former case, obtain its response. Several recent works presented detectable algorithms [3, 4, 9] . In particular, both Ben-David et al. [4] and Attiya et al. [3] presented detectable CAS algorithms and [3] also presented a detectable read/write object. All these algorithms augment the arguments of recoverable operations with unique identifiers, for allowing the recovery code to detect whether or not the failed operation was linearized, consequently incurring unbounded space complexity. 1 In addition, [3] proved that every lock-free detectable test-andset implementation from (non-recoverable) test-and-set objects must use unbounded space. This raises the question of whether unbounded space complexity is inherent to nonblocking detectable implementations of these objects. We provide a negative answer to this question by presenting the first nonblocking bounded-space detectable CAS and read/write algorithms. Both algorithms are wait-free. Our N -process bounded-space CAS algorithm uses Θ(N ) bits in addition to those storing the CAS object's value. In our second contribution, we show that every obstruction-free detectable CAS implementation, assuming values from a domain of size at least N , must have Ω(2 N ) different reachable shared-memory configurations, thus establishing that our CAS algorithm's space complexity is asymptotically optimal.
Detectable algorithms often require auxiliary state that helps them infer where in the execution the failure occurred. Informally, auxiliary state is information that is provided to the recoverable operation that is not provided to (nor required by) the "original" (non-recoverable) operation. In some works, it is assumed that this information is provided by the system. For example, the recoverable mutual exclusion algorithms presented by Golab and Hendler [10] assume a model in which the system provides to each operation an epoch number whose value increases after each (systemwide) failure. Some detectable algorithms presented by Attiya et al. [3] assume that the system provides to the recovery code information identifying the instruction that the failed operation was about to execute via a non-volatile variable. However, auxiliary state is not necessarily provided by the system. For example, the read/write algorithm of [3] , the CAS algorithm of Ben-David et al. [4] and the queue algorithm of Friedman et al. [9] rely on auxiliary state (e.g. unique identifiers) passed to recoverable operations via their arguments by the operations that invoke them.
We show that, for a large class of objects that includes read/write, CAS and FIFO queue objects, any obstruction-free detectable implementation must receive auxiliary state. As we prove, this auxiliary state must be made available to recoverable operations either via their arguments or via a non-volatile variable accessible by them whose value must be modified outside the operation. In contrast, this external support is, in general, not required if the recoverable algorithm is not detectable.
The rest of the paper is organized as follows. We describe the system model in Section 2. We then present our bounded-space detectable read/write and CAS algorithms in Sections 3 and 4, respectively. In Section 4 we also prove a lower bound on the space complexity of detectable CAS. This is followed by a proof that detectable implementations of a large class of objects require auxiliary state in Section 5. The paper is concluded by a short discussion in Section 6.
System Model
A set P of N asynchronous crash-prone processes communicate through shared objects. The system provides base objects (also called shared variables or registers) that support atomic read, write, and read-modify-write primitive operations. Base objects are used to implement higher-level concurrent objects by defining algorithms, for each process, which use primitive operations to carry out the operations of the implemented object.
The state of the system consists of non-volatile shared-memory variables and per-process local variables stored in its local volatile cache. Local variables are accessed only by the process to which they belong. For presentation simplicity, we assume that each process p may own non-volatile private variables that reside in the NVM but are accessed only by p. We also assume the abstract private cache model [4, 18] , in which all shared variables are always persistent and there is no shared cache. In this model, primitive operations to shared variables are applied directly to the NVM. At any point during the execution of an operation, a system-wide crash-failure (or simply a crash) may occur, which resets the local variables of all processes to their initial values, but preserves the values of all non-volatile variables.
As we explain in Section 6, all our results hold also in the more realistic shared-cache model. In this model, in addition to per-process private caches, there is a single (volatile) shared cache. Primitive operations to shared variables are applied to this cache and explicit persistency instructions may be required for guaranteeing that values written to this cache get persisted to the NVM in the correct order [18] .
To start executing an operation Op, a process p invokes Op. We say that Op completes once control returns to the caller of Op. Before completing, Op returns a response value, which is stored to a local variable of p. The response value is lost if p crashes before persisting it (i.e., writing it to a non-volatile variable). We say that a process is idle if it is not in the midst of executing any operation. Each recoverable operation Op of a shared object is associated with a recovery function, denoted Op.Recover, which is responsible to infer whether Op was linearized or not, and to obtain its response in the former case. Op.Recover is performed by p in order to recover from a failure that occurred while p was executing Op. We assume that Op.Recover is being called with the same arguments as those with which Op was invoked when the crash occurred. If Op.Recover infers that Op was not yet linearized, it returns a special fail value, otherwise it returns Op's response.
Our lower bounds (Theorems 1 and 2) only require the model assumptions specified above. However, as we prove in Theorem 2, detectable algorithms must receive auxiliary state whose value is modified either by the operation's caller or by the system. We therefore make the following additional assumptions that are used by the algorithms we present. Each process p is associated with a private non-volatile structure Ann p consisting of three fields. Ann p .op stores the type of recoverable operation currently performed by p, as well as the arguments with which it was called. It is accessed only by the caller of the recoverable operation Op, which sets its value (thus announcing the operation it is about to perform) immediately before invoking Op. Which function (if any) should be invoked by p in order to recover from a failure is determined according to the value of Ann p .op. Field Ann p .resp stores the response of the recoverable operation and is initialized to ⊥ immediately before Op is invoked. The 3rd field, Ann p .CP , may be used by recoverable operations and recovery functions for managing checkpoints in their execution flow. Field Ann p .CP is set to 0 by the caller of the recoverable operation immediately before invoking it. Ann p .CP can be read and written by recoverable operations and their recovery functions and is used by p in order to record (in the NVM) the fact that the execution reached a certain point. The recovery function can then use this information in order to correctly recover and to avoid re-execution of critical instructions.
Failed processes recover in an asynchronous manner, independently of each other. Specifically, the recovery of some processes may have already completed while other processes may have not yet completed (or even started) their recovery. Op.Recover may be invoked multiple times before it completes, because the system may undergo multiple crashes in the course of executing it. If all the operations of an implementation are recoverable, then the implementation is called recoverable.
Linearizability [17] requires that each operation applied to a concurrent object takes effect instantaneously at some point between its invocation and response. The correctness condition ensured by our algorithms is durable linearizability (DL) [18] . DL requires that linearizability be maintained in spite of crash-failures. In other words, once the system recovers after a crash-failure, the state of the data structure reflects a history containing all operations that completed before the crash and may also contain some operations that have not completed before the crash. This captures the idea that an operation can be linearized only once its effect gets persisted to NVM.
The progress conditions we consider are wait-freedom [15] and obstruction-freedom [16] . A recoverable operation or a recovery function is wait-free (resp. obstruction-free) if, starting from any reachable configuration, p completes it in a finite number of its own steps (resp. when running solo), when the system experiences no crashes. We emphasize that all our results hold also in a model where processes may fail independently, such as that assumed by [3] .
Detectable Read/Write Object
Algorithm 1 presents the pseudo-code of a detectable read/write object O that uses bounded space from (bounded-space) variables that support read/write primitive operations. To the best of our knowledge, this is the first detectable read/write algorithm that uses bounded space. The checkpoint field Ann p .CP is used by process p in order to allow the recovery function to infer where in the recoverable operation the failure occurred. Each process p owns two private variables: RD p , storing data used during recovery, and T p , storing an index ∈ {0, 1} to one of two size-n toggle-bit arrays, A[ ][p][0] and A[ ][p] [1] , that are used by p's write operations in an alternating manner. O's state consists of a single shared read/write register R storing a triplet of values v, q, b , where v is O's current value, q is the identifier of the process that (last) wrote v, and b is the index of the toggle-bit array used by q for that write operation. Initially, R = v init , 0, 0 , where v init is O's initial value, thus "attributing" this value to a write by process 0 that used toggle-bit array 0. Register R stores O(log n) bits in addition to the application value v, in contrast with the unbounded state required by the read/write object implementation of Attiya et al. [3] . A 3-dimensional array A[N ][N ] [2] allows each writing process p to coordinate with any other process q using p's two toggle-bit arrays.
The key challenge with which Algorithm 1 copes is the ABA problem. Attiya et al. [3] avoid it by ensuring that all written values are distinct, at the cost of using a register of unbounded-size. Algorithm 1 allows the same value to be written multiple times, so a process p may read from R a value v q (written by process q) and then write some value v p that is later overwritten by another write of v q by q. In this case, if p recovers after a system crash, a mechanism for allowing it to detect whether or not its operation was linearized is required. As we explain below, per-process toggle bits are used to implement this mechanism. Before invoking an operation on the object, its caller initializes the Ann p structure as described in Section 2. Specifically, Ann p .CP is initialized to 0 and Ann p .resp is initialized to ⊥.
The Write operation To write, process p reads R (line 1), thus learning that q was the last to write to R and which toggle-bit array was used by q for writing. Next, p resets the bit from q's other toggle-bit array corresponding to p (line 2), and persists the value read from R, as well as the index of the toggle-bit array used by p's current write (stored in T p ), into RD p (lines 3-4). Then, p reads R again (line 5) and proceeds to write to R (line 7) only if it read from R the same value as in line 1. In this case, p sets its checkpoint field to 1 (line 6) immediately before the write to R and sets it to 2 (line 8) immediately after it. It then sets all the bits in the toggle-bit array used by its current write operation, switches its toggle-bit array index, persist the response and returns (lines 9-13).
If the condition of line 5 is not satisfied then, as we prove, a write operation W by a process other than p is linearized between p's first and second reads of R, hence p can be assumed to have been overwritten by W . In this case, p skips lines 6-7 and proceeds directly to line 8.
The Write.Recover recovery function Upon recovery from a failed Write operation W , p first reads RD p (line 14) and then checks if Ann p .result was set (line 15). In this case, W was completed and has been linearized, so the recovery function returns ack. Next, p checks if Ann p .CP equals 0 (line 17). In this case, as we prove, W was not linearized before the failure, so the recovery function returns fail (line 18); the caller of the failed operation can now decide whether or not to reattempt performing W . Otherwise, if Ann p .CP equals 1 (line 19), then the recovery code must determine whether or not R was written in line 7 (either by p or by another process) since when W read qval, q, qtoggle from R in line 1. This is done in line 20 as follows. If R's value differs from qval, q, qtoggle , then R was written and so either W performed line 7 or W can be assumed to have been overwritten by another write, so the recovery code proceeds by performing lines 22-27 (which are identical to lines 8-13). Otherwise, R's value equals qval, q, qtoggle but it is still possible that q wrote qval, q, qtoggle to R again after R was read by W . This is checked by the 2nd condition of line 20 which relies on the following key observation used by our correctness proof: in order for q to write again using the same toggle-bit index, it must first complete a write operation using the other toggle-bit index. However, in that earlier write operation, q sets all its toggle bits of that set to 1 (either in lines 9-10 of its write operation or in lines 23-24 of its recovery function). Therefore, upon recovery, if p reads the same value from R as before the crash, it can conclude that a write occurred in between its two reads of R if and only if q's toggle bit that it has set to 0 is now 1. If this is not the case, p concludes that W was not linearized and returns fail The Read operation reads a triplet of values from R and then extracts its first component, writes it to Ann p .resp and returns it. Its recovery function re-invokes Read if Ann p .resp = ⊥ holds, otherwise it returns it. This simple code is not presented in Algorithm 1.
It is easily seen that Algorithm 1 uses bounded space, assuming that the values written by Write operations are of bounded size. It remains to show that the algorithm satisfies durable linearizability, detectability and wait-freedom. Lemma 1. Algorithm 1 is wait-free and satisfies durable linearizability and detectability.
Proof. Consider an execution α of Algorithm 1. Assume process p completes a Write(val) operation W in α (either directly or by completing the recovery function). We prove that one of the following holds: 1) p writes to R exactly once, and this is W 's linearization point; 2) p does not write to R and there is a concurrent write operation W ′ by a different process that writes to R, hence we can linearize W immediately before W ′ ; or 3) the failure occurred before W wrote to R, in which case Write.Recover returns fail.
We start by observing that if there is no crash during W , then either it writes to R exactly once in line 7, or the condition in line 5 holds, in which case p does not write to R and there was a concurrent write operation W ′ by a different process that wrote to R.
We proceed to prove that the lemma holds also if W and its recovery code experience one or more crashes. If the system crashes before p writes to CP p in line 6 then p did not write to R while executing W , nor did it write to any entry in its toggle-bit arrays A[i][p][y] for i = p, hence none of its writes is ever read by another process. Thus, W was not linearized yet and the recovery code simply returns fail (line 18). Otherwise, consider the case where a crash occurred after p executed line 8. Then either p wrote to R in line 7 (hence W was linearized) or p observed in line 5 a write by a concurrent write operation W ′ and so W can be linearized immediately before W ′ . In both cases, Write.Recover re-executes the for loop of lines 23-24, switches its toggle-bit index, and returns.
We are left with the case where a crash occurred after the update of CP p in line 6 but before its update in line 8. Upon recovery, in order to satisfy detectability, the algorithm has to determine whether W was linearized or not. To complete the proof, we establish the following two claims: 1) the condition of line 20 evaluates to false only if there was a write to R (either by p or by some other process) after p first read it (in line 1); 2) if p wrote to R before the crash (in line 7), then the condition of line 20 evaluates to false.
In line 1, p reads R and writes its content (together with p's current value of T p ) to RD p in line 4. Therefore, p persists the identifer q of the process that last wrote to R and the toggle-bit index used by q. Denote by W q this write of q, and let qts denote the toggle-bit index used in W q . Assume that the condition of line 20 evaluates to false. We prove that some write operation is linearized after W q . Since a write operation can only be linearized when R is written (not necessarily by the linearized operation itself), this would prove claim 1). If p reads from R in line 20 a value different from the one stored in RD p , then clearly there was a write to R after the read in line 1 and we are done. Otherwise, p reads the same value from R in line 20, and so it holds that
[q][1 − qts] to 0. Thus, q must have set it to 1 later in the execution. Notice that q sets a bit of the (1 − qts) toggle-bit array to 1 only during the for loop (in lines 9-10 or 23-24), after completing a write using the toggle-bit index 1 − qts. Denote this write by W q . Since p observed the write W q , which is associated with the toggle-bit index qts, it must be that write W q was performed after the read of p in line 1. This is true since W q must either write to R, or observe a write to R by another process, and in both cases, the linearization point of W q must be after p observed W q in line 1.
We now prove claim 2). Assume the system crashed after p wrote to R in line 7. Upon recovery, if p reads in line 20 a value from R other than that stored in RD p , the claim clearly holds. Otherwise, p reads the same value written by W q . Notice that p reads the same value in line 5, after setting A[p][q][1 − qts] to 0. Moreover, later in the execution p writes to R, and thus there must be another write by q to R with the same value as in W q that was done after p wrote to R. In particular, these two writes use the same toggle-bit index qts, thus there must be another write by q using toggle-bit index 1 − qts in between the two. This write must have been completed, and thus the for loop updating A[p][q][1 − qts] to 1 was done after p's read in line 5, hence, after p has set A[p][q][1 − qts] to 0. Moreover, no process but p can set A[p][q][1 − qts] to 0 again. Consequently, the second condition of the if statement in line 20 does not hold, implying that claim 2) holds.
To conclude the proof of durable linearizability and detectability, we claim that every completed Read operation returns the value of the latest Write operation linearized before it, or v init if there are no such Write operations. We linearize a Read operation Op when it writes to Ann p .resp and its linearization point is then set to its previous read of R. It follows that the response of a completed Read is the value of R's first component when last read by the operation. A Write operation W can be linearized either when it writes to R (line 7) or when it does not write to R but is linearized before a concurrent Write operation W ′ that does write to R. The claim now follows from the atomicity of R.
It is easily seen that the algorithm is wait-free, since neither Read, Write or their recovery functions contain any loops, so each of these operations/functions terminates in a constant number of steps if it experiences no crashes.
A Detectable CAS Algorithm and Lower Bound
We now present a wait-free detectable implementation of an N -process durable linearizable CAS object from (bounded-space) variables that support read/CAS primitive operations. As far as we know, this is the first bounded-space detectable CAS implementation. Our algorithm uses Θ(N ) bits in addition to those storing the CAS object's value. We then prove that any such implementation requires Ω(N ) bits, thus establishing that our algorithm is asymptotically space optimal. A Read operation simply reads C, extracts O's value, writes it to Ann p .resp and returns it. Its recovery function re-invokes Read if Ann p .resp = ⊥ holds, otherwise it returns it. This simple code is not shown. Lemma 2. Algorithm 2 is wait-free and satisfies durable linearizability and detectability.
A Bounded-Space Recoverable CAS Algorithm
Proof. Consider an execution α of Algorithm 2. Assume process p completes a Cas(old, val) operation Op in α (either directly or by completing the recovery function). We prove that one of the following holds: 1) p successfully writes to C exactly once and the return value of Op is true, in which case this is Op's linearization point; 2) p does not write to C and C contains a value different then old at some point during Op, hence we can linearize Op at this point and it returns false; or 3) a failure occurs before Op wrote to C, in which case Cas.Recover returns fail.
We start by observing that if Op does not experience a crash, then either p reads a value different from old in C, thus it returns false (lines 29-31), or p performs a single CAS in line 35 and returns its response. The CAS is successful only if the value stored in C is old, in which case p returns true. On the other hand, the CAS fails only if another process performed a successful CAS to C after p first read it in line 28, hence the value of C after it must be other than old, and this is also the linearization point of Op, which indeed returns false.
Note that p is the only process to ever update vec[p] and the only place in which this update occurs is the CAS of line 35. Moreover, this is the only place in the code where a CAS is performed. Thus, since lines 32-33 have to be executed before line 35 (even in case of a crash), each successful CAS to C by p will flip the bit vec[p] stored in C, and it will remain flipped until p's next successful CAS .
We proceed to prove that the lemma holds also if Op experiences one or more crashes. If a crash occurs before Op writes to Ann p .CP in line 34, then p did not write to C while executing Op. Thus, Op was not linearized yet and Cas.Recover simply re-invokes Op (line 41). Otherwise, consider the case where a crash occurs after p performs line 34. If p performed a successful CAS at line 35 before the crash, then the operation was already linearized. As per our previous observation, C.vec[p] = RD p will hold as long as p does not perform another (successful) CAS . Hence, the condition in line 43 is evaluated as false, and p persists true as its response and then returns (lines 45-46).
It remains to consider the case where the crash occurred when p did not perform a successful CAS at line 35, either because the CAS failed or because p crashed before performing line 35. In this case, we can consider Op as not having been linearized yet, since it did not change the value of any variable that operations by other processes may read. In both cases, vec[p] = RD p holds, since vec[p] stores the old, un-flipped, value, whereas RD p stores the new, flipped, value. Moreover, no process but p can change vec[p]. Thus, the condition of line 43 is evaluated as true and the recovery function returns fail in line 44.
A Space Lower Bound on Detectable CAS
Algorithm 2 uses Θ(N ) shared memory bits, in addition to those required for storing the CAS object's value. Let V denote the set of states that may be assumed by a CAS object O. Assuming that |V| ≥ N , we prove that any recoverable and detectable obstruction-free implementation A of O must have at least 2 N −1 reachable configurations with distinct shared-memory states. This implies that A uses at least N − 1 shared-memory bits. If |V| = O(N ), then only O(log N ) bits are required for storing the CAS object's value, implying that Ω(N ) additional shared-memory bits are required for supporting detectability. Before proceeding with the proof, we need the following two definitions. We say that configurations C, D are memory-equivalent if the values of each shared memory variable is the same in both configurations. We say that a step s by some process p is a modifying step [14] w.r.t. to an operation Op by another process q in some configuration C, if the solo execution of Op by q after C and after C • s return different responses. Figure 1 illustrates the structure of the inductive construction of the proof that follows. Theorem 1. Let A be an N -process obstruction-free implementation of a CAS object O (using any primitive operations) satisfying durable linearizability and detectability, assuming values from a domain V of size at least N . Then A has at least 2 N −1 different reachable configurations, no two of which are memory-equivalent.
Proof. We prove the claim by induction on N . The claim holds trivially for N = 1. Assume the claim holds for N and let A be an (N +1)-process obstruction-free recoverable implementation of a CAS object O satisfying durable linearizability and detectability, assuming values from a domain V of size at least N + 1. Denote the processes using A by p 1 , p 2 , . . . , p N , q. Assume also WLOG that {0, 1, . . . , N } ⊆ V and that O's value in the initial configuration C is 0. Starting from C, we let p N perform a Cas p N (0, 1) operation until it is about to perform the first modifying step s with respect to a Cas q (0, 1) operation by process q. We denote by α the prefix of p N 's execution up to (but not including) step s. That is, a solo execution of Cas q (0, 1) by q after C • α returns true, while its solo execution after C • α • s returns false. Such a modifying step exists because A is obstruction-free, a solo execution of Cas q (0, 1) starting from C returns true, but a solo execution of Cas q (0, 1) after Cas p N (0, 1) completes returns false. We define the following two configurations:
that are reached after Cas q (0, 1) returns. The executions leading to C ns and C s exist, since A is obstruction-free. From linearizability, O's value is 1 in both C s and C ns , that is, a Read( ) (resp. a Cas(1, 0) ) operation on O, performed to completion immediately after either C s or C ns , must return 1 (resp. return true and change the object's value to 0). Moreover, Cas q (0, 1) returns true in the execution leading to C ns and false in the execution leading to C s . Notice that all processes but p N are idle in C ns . Moreover, if we fix C ns as an initial configuration, halt p N starting from this point on and restrict processes p 1 , . . . , p N −1 , q to perform CAS operations with arguments from the domain V \ {0}, then we obtain an N -process algorithm A ′ for which we can apply the induction hypothesis. Thus, there is a set S 1 of at least 2 N −1 configurations reachable from C ns in p N -free executions, no two of which are memory-equivalent. The same argument can be applied to configuration C s , resulting in a second set S 2 of 2 N −1 configurations reachable from C s , no two of which are memory-equivalent.
To complete the proof, we show that no configuration reachable from S 1 can be memoryequivalent with any configuration reachable from S 2 . Intuitively, this is because the modifying step s must write to shared memory and does not modify any of p N 's local variables. Thus, if we can extend both C s and C ns and reach two memory-equivalent configurations, then p N will behave the same after a crash in both of them, which leads to a contradiction. The formal proof follows.
Assume towards a contradiction that there exist executions β ns and β s starting from C ns and C s , respectively, leading to memory-equivalent configurations D ns ∈ S 1 and D s ∈ S 2 . Note that both β ns and β s are p N -free. We extend D ns by a system crash, followed by a recovery of p N , followed by an execution in which p N performs its recovery function Cas.Recover p N (0, 1) to completion. Denote this extension by γ. Since Cas p N (0, 1) can be linearized only after Cas q (0, 1) (as the latter returns true in the execution leading to C ns ) and as O's state is positive all throughout β ns , Cas.Recover p N (0, 1) must return either false or fail in γ.
Since all of p N 's local variables are reset after the crash, and since D ns and D s are memoryequivalent, if the system crashes after D s , and then p N recovers by performing Cas.Recover p N (0, 1) to completion, it will return false or fail as well. However, as Cas q (0, 1) returned false in the execution leading to C s , from linearizability, Cas.Recover p N (0, 1) must return true. This is a contradiction.
Detectable Algorithms Require Auxiliary State
The detectable algorithms we presented in Sections 3-4 use auxiliary state, provided via the nonvolatile Ann p .CP field used for managing checkpoints. As we've mentioned in Section 1, detectable algorithms often require auxiliary state that helps them infer where in the execution the failure occurred [3, 9] . We now formalize this notion. Definition 1. Let Op be a recoverable operation. We say that auxiliary state is provided to Op via NVM, if in-between every two successive invocations of Op, a write is made to a non-volatile variable that can be accessed by Op. We say that auxiliary state is provided to Op via operation arguments, if the arguments to Op contain data not specified by the object's abstract operation.
For example, in our model, the system provides auxiliary state via Ann p .CP , since it sets its value to 0 before any invocation of a recoverable operation by p. The queue algorithm of [9] provides auxiliary state via unique operation identifiers passed as arguments.
In the following, we prove that the usage of auxiliary state is inevitable for obtaining detectable implementations for a large class of objects that includes read/write, CAS, resettable test-and-set and FIFO queue objects. It is easily seen that weak obstruction-freedom is satisfied by any recoverable algorithm that satisfies obstruction-freedom or deadlock-freedom. Our proof holds for the class of doubly-perturbing objects, a notion we define next. Definition 3. Given an object O and a sequential history H, we say that an operation Op is perturbing after H, if there exists an operation Op ′ by a different process such that Op ′ returns different responses in H • Op • Op ′ and in H • Op ′ . We also say that Op is perturbing with respect to Op ′ after H. We say that O is doubly-perturbing if there exists an operation Op p by some process p and a sequential history H 1 of O such that the following conditions hold:
1. Op p is perturbing with respect to some Op ′ after H 1 .
2. H 1 • Op p • Op ′ has a p-free extension, resulting in a sequential history H 2 , such that Op p is perturbing after H 2 .
We say that Op p witnesses that O is doubly-perturbing.
It is easily shown that many widely-used objects are doubly-perturbing. We now prove that a read/write object is doubly perturbing. In the appendix, we provide proofs that establish that the counter, CAS, fetch-and-add and FIFO queue objects are doubly-perturbing as well.
Proof. Consider a read/write object O over a domain of values including at least two distinct values v 0 , v 1 , initialized to v 0 . We claim that write p (v 1 ) witnesses that O is doubly-perturbing. For any process q = p, write p (v 1 ) is perturbing w.r.t. read q after the empty sequential history. This satisfies the first condition of Definition 3. Moreover, write p (v 1 ) • read q can be extended by a write q (v 0 ) operation, resulting in a sequential history
(a second instance of) read q operation after H 2 . This satisfies the second condition of Definition 3. Thus, O is doubly-perturbing.
The notion of doubly-perturbing objects bears similarity to the notion of perturbable objects defined by Jayanti et al. [21] . Although most common perturbable objects (including read/write, counter, compare-and-swap, swap and fetch-and-add objects) are also doubly-perturbing, the two classes of objects are incomparable. We now prove that there are perturbable objects that are not doubly-perturbing (e.g., a max register [2] ). In the appendix we prove that a bounded counter is doubly-perturbing but not perturbable.
Lemma 4.
A max register object is not doubly-perturbing.
Proof. Consider a max register object O, which supports a writeM ax(v) operation and a read() operation that returns the largest value written to the register preceding it. Clearly, a read operation cannot witness that O is doubly-perturbing, as it cannot be observed by other operations. As for a writeM ax(v) operation by some process p, once the operation is lineraized any read must return a value v or higher. Thus, invoking writeM ax(v) for a second time cannot modify O's value, and thus cannot change the response of any other operation. Lemma 4 proves that max register is not doubly-perturbing. However, it is known to be perturbable [21] . Thus, Theorem 2 below does not apply for a max register object. This raises the question of whether there is an obstruction-free implementation of a max-register satisfying durable linearizability and detectability that does not use auxiliary state. We now show that this is indeed the case.
Algorithm 3 presents the pseudo-code of a detectable max register implementation that uses no auxiliary state. The max register is composed of an integers array M R, such that process p is smaller than val, and if so updates it. To perform Read, p uses a local array. For simplicity of presentation, we use an array-assignment/comparison notation as a shorthand for copying/comparing the array entry by entry. Process p repeatedly copies the contents of M R until the first successful double collect. Upon success, p managed to obtain a valid snapshot of M D, so the largest value in M D was the value of the max register at some point in the execution interval of Read. The recovery function of each of these operations simply re-invokes the operation, and thus the code is not shown.
Next, we present our impossibility result. The following two definitions are required for the proof. We say that configurations C, D are indistinguishable to process p, denoted by C p ∼ D, if the values of all shared-memory variables, as well as those of p's local variables, are the same in both configurations. We say that C Q ∼ D for a set of processes Q if C p ∼ D for any p ∈ Q. Figure  2 depicts the structure of the execution constructed by our proof. We note that Theorem 2 holds regardless of the primitive operations used by the implementation.
Theorem 2. Let A be a weak obstruction-free recoverable implementation of a doubly-perturbing object O satisfying durable linearizability and detectability and let Op be a recoverable operation of Op. Then either auxiliary state is provided to Op via NVM, or it is provided to Op via operation arguments.
Proof. Assume towards a contradiction that auxiliary state is not provided to Op neither via NVM nor via the arguments to Op, so the arguments to operations in A are identical to those applied to the implemented object. For example, if O is a read/write object, our assumption means that read operations receive no arguments and a write operation invoked for writing value v receives a single argument whose value is v. We construct an execution of A that will establish that A violates durable linearizability, thus reaching a contradiction.
Since O is doubly-perturbing, there exists an operation Op p by some process p, witnessing that O is doubly-perturbing. Thus, from the first condition of Definition 3, there is a sequential history H 1 such that Op p is perturbing w.r.t. to some operation Op r after H 1 , for r = p. Since A satisfies weak obstruction-freedom, there exists an execution α in which processes perform their operations according to H 1 in a sequential manner, starting from an initial configuration, leading to a configuration C α . Starting from C α , let p apply its Op p operation and halt just before returning; denote the resulting configuration by C β . Starting from C β , let p complete its Op p operation on O by returning from it and denote the resulting configuration by C ′ β . As A satisfies weak obstructionfreedom, the executions leading to C β and C ′ β exist, since in both p runs solo starting from a quiescent configuration. Since returning a response does not change any shared variable and no auxiliary state is provided via NVM, we get:
Notice that C ′ β is a quiescent configuration. Since Op p witnesses that O is doubly-perturbing, from the second condition of Definition 3, the history H 1 • Op p • Op r has a p-free extension resulting in a sequential history H 2 (extending H 1 ) such that Op p is perturbing after H 2 . Assume it is perturbing w.r.t. an operation Op q (for some q = p). Since A is weak obstruction-free, there exists an execution γ starting from C ′ β in which first r performs to completion Op r , followed by a p-free sequential execution corresponding to the extension of H 1 • Op p • Op r , resulting in configuration C ′ γ . Moreover, let δ denote the solo execution of Op p after C ′ γ , then Op q returns different responses when executed after C ′ γ and after C ′ γ δ. Since γ is a p-free execution, and from Equation (1), we get that γ is also an execution starting from C β such that:
Op r was performed to completion in γ. Since Op r returns the same response in the executions leading to C ′ γ and to C γ and as Op p perturbed its response, it follows that the linearization point of Op p precedes that of Op r in both executions.
Next, consider the following two configurations:
Thus, in the execution leading to D 1 , the system crashes just before p returns from its first Op p operation, whereas, in the execution leading to E 1 , p completes that operation (thus p returns its response), invokes a second Op p operation and then the system crashes immediately after p's invocation. Note that both these executions exist, since all processes but p are idle in C γ and A satisfies weak obstruction-freedom. Since neither responses nor invocations change any shared variables and since no auxiliary state is provided, neither via NVM nor via operation arguments, and as a crash reset all local variables of p we get: (3) D 1 P ∼ E 1 . Next, starting from E 1 , we extend the execution by recovering p, letting it perform a solo execution of its recovery function to completion, denoted by ζ. We let E 2 = E 1 • (p.REC) • ζ denote the resulting configuration. ζ exists, since all processes but p are idle in E 1 and A satisfies weak obstruction-freedom. From Equation (3), p performs the same execution ζ after E 1 • (p.REC) and after D 1 • (p.REC), hence
Finally, starting from configuration E 2 , extend the execution by letting process q = p perform Op q to completion. Op q completes, since E 2 is quiescent and A satisfies weak obstruction-freedom. Assume that Op q returns value v. From Equation (4), Op q returns v also if it is performed starting from configuration D 2 . Notice that p cannot return fail in its recovery function in ζ. This is due to the fact that in the execution leading to D 1 p executed a single Op p operation which perturbs the response of Op r . Hence, Op p must be linearized before Op r , and the recovery function returns its response. In particular, the recovery function of p executed in E 2 returns a response different then fail, and thus the second Op p must have a linearization point in the execution leading to E 2 . Moreover, since Op p is perturbing w.r.t. Op q after H 2 , executing Op q starting from E 1 returns a response u = v. That is, the execution of Op p in ζ p perturbs the response of Op q . From Equation (3), we get that executing Op q starting from D 1 returns u as well. It follows that p's recovery function, executed in ζ after D 1 , perturbs the response of Op q performed after D 2 . However, p executed a single Op p operation in the execution leading to D 2 , which was linearized before Op r , while a second linearization point of Op p must exist in the execution of Op p 's recovery function (ζ), since it perturbs the response of Op q . Thus, Op p has two linearization points in the same execution. This is a contradiction to our assumption that A satisfies durable linearizability.
Discussion
Several works propose different correctness conditions for the crash-recovery model [1, 5, 13] . Izraelevitz et al. [18] presented durable linearizability (DL) that assumes a system-wide crash model. Roughly speaking, it requires that after the system recovers from a crash, the state of each object reflects a consistent sub-history that contains all the operations that completed before the crash. Detectability [9] requires, in addition, that a process recovering from a crash that occurred while it was performing an operation Op can infer if Op took effect (was linearized) and, in such case, obtain its response. This allows for a "client" operation that called Op to continue its execution after the crash, since Op's response is made available to it. If Op was not linearized, the client operation can choose whether or not to re-invoke it.
Attiya et al. [3] formalized a strict variant of this notion as nesting-safe recoverable linearizability (NRL). NRL requires that Op.Recover complete the crashed operation and persist its response before returning, thus allowing the client operation access to this response. NRL is stricter than detectability [9] , since the latter provides the client code with the flexibility of choosing whether or not to re-invoke a crashed operation that was not linearized (upon receiving a fail response), whereas NRL will re-attempt such an operation again and again until it completes (if it ever does).
We note that NRL implies DL and detectability, as each crashed operation must complete, persist the response, and is linearized before Op.Recover completes. As for the other direction, given an implementation that satisfies both DL and detectability, one can easily transform it into an implementation satisfying NRL by having the recovery function Op.Recover invoke Op again instead of returning a fail response. Therefore, all algorithms presented in this paper can be easily be transformed to satisfy NRL.
The private cache model [9, 18] is an abstract model, where primitive operations are applied directly to NVM and processes do not share a cache. In the more realistic shared cache model [18] , the main memory is non-volatile, while primitive operations are applied to a volatile shared cache. In this model, explicit persistency instructions are used in order to ensure that writes get persisted to the NVM and do so in the right order. Upon a crash, the shared cache content, as well as any the values of local variables, are lost. It is easily seen that our impossibility and lower-bound results hold also in the shared cache model. As for our algorithms, for presentation simplicity, we specified and analyzed them assuming the private cache model. A simple syntactic transformation was proposed by [18] , where persistency instructions are added to the code, thus transforming it to an implementation for the shared cache model. By applying this transformation to our algorithms, they maintain their correctness (as well as their space complexity) in the shared cache model as well.
An alternative approach for achieving recoverability is the design of persistent universal constructions. For example, [7] designed a persistent log-based universal construction that requires only one round trip to NVRAM per operation, which is optimal. However, logging imposes significant overheads in time and space, which are even more pronounced for concurrent data structures, where there is an extra cost of synchronizing accesses to the log. [5] presented a recoverable variant of Herlihy's wait-free universal construction. These constructions do not provide detectability as defined by [9] . However, upon recovery processes are able to determine which operations were linearized before the crash and obtain their response. Such implementations does not require auxiliary state, as upon recovery process can not infer whether its last invoked operation was linearized. For example, assume process p performs the same operation Op twice, and a crash occurs during its second instance of Op. Upon recovery, p is able to conclude that its last linearized operation was Op, and obtain its response. However, p can not tell what instance of Op was linearized. Unlike it, detectability requires p to infer if the second instance of Op, during which the crash occurred, was linearized.
Ben-David et al. [4] showed that any implementation using only read, write and CAS primitives can be made detectable by partitioning the code into capsules, each containing a single CAS followed by several reads, and replacing each CAS with its recoverable version.
Some recent works study the recoverable mutual exclusion (RME) problem, defined by Golab and Ramaraju [11, 10, 12, 19, 20] . Much work was also done on implementing persistent transactional memory frameworks (see [8, 22, 24, 26] ). Several previous works investigated recoverable implementations for specific data-structures. Friedman et al. [9] presented concurrent lock-free queue algorithms exhibiting different tradeoffs between consistency and efficiency. Coburn et al. [6] presented NV-heaps. Several works proposed persistent algorithms for index trees [27, 23, 25, 27] .
In this work, we presented the first bounded-space detectable CAS and read/write algorithms. Detectable algorithms have the advantage of supporting composability. On the downside, as we established, this comes with a price tag in terms of space complexity and the need to provide recoverable operations and recovery functions with auxiliary data.
Open Problems Theorem 1 establishes a space lower bound of Ω(N ) shared bits on any obstructionfree detectable implementation of a CAS object, thus establishing that Algorithm 2 is asymptotically space optimal. However, Algorithm 2 uses a single Ω(N )-bits shared variable. Finding such an algorithm which uses registers of size O(log N ) bits, or alternatively proving that such an algorithm is impossible, is an interesting research direction.
No (non-trivial) space lower bound for a detectable read/write object is known and finding the tight bound is another open question. Finally, exploring the tradeoff between space and time complexity for detectable implementations, as well as the tradeoff between the complexities of a recoverable operation and its recovery function, is another interesting avenue for future work.
