The promise of software transactional memory (STM) is to combine an easy-to-use programming interface with an efficient utilization of the concurrent-computing abilities provided by modern machines. But does this combination come with an inherent cost?
Introduction
The software transactional memory (STM) paradigm promises to efficiently exploit the concurrency provided by modern computers while offering an easy-to-use programming interface. It allows a programmer to write a concurrent program as a sequence of transactions. A transaction is a series of read and write operations on transactional objects (or t-objects). An STM implementation turns this series into a sequence of accesses to underlying base objects and exports "all-or-nothing" semantics: every transaction either commits in which case all its operations are expected to instantaneously "take effect", or aborts in which case the transaction does not affect any other transaction. In this paper, the default STM correctness property is opacity [13, 15] that, informally, requires that in every execution, there is a total order on all transactions, including aborted ones, where every read operation returns the argument of the last committed write operation on the read t-object.
An STM implementation that aborts every transaction is trivially correct but useless. Therefore, we need to specify a progress condition that captures the execution scenarios in which a transaction should commit. Consider, for example, a simple non-trivial progress condition that requires a transaction to commit if it does not overlap with any other transaction. This condition can be implemented using a single lock that is acquired at the beginning of a transaction and released at its end. The resulting "single-lock" STM will be running one transaction at a time, thus ignoring the potential benefits of multiprocessing. Similarly, an obstruction-free STM [12] that only requires a transaction to commit if it eventually runs with no contention allows for no concurrency at all. But to exploit the power of modern multiprocessor machines, an STM implementation must allow at least some transactions to make progress concurrently. If this is the case, we say that the implementation provides positive concurrency, in contrast to zero concurrency provided by "single-lock" and obstruction-free STMs.
In this paper, we try to understand the inherent costs of allowing multiple concurrent transactions to commit. Therefore, we focus on progress conditions that provide positive concurrency: progressiveness [14] and permissiveness [11] . Informally, a progressive STM [14] provides a very small degree of concurrency by only enforcing a transaction T to commit if it encounters no concurrent conflicting transaction T : T and T conflict on a t-object X if they concurrently access X and one of the transactions tries to update X. A stronger variant of progressiveness, called strong progressiveness, additionally requires that in case a set of transactions conflict on at most one t-object, at least one transaction commits. A much more demanding permissive STM [11] stipulates that a transaction must commit, unless committing it violates correctness, which, informally, provides the highest degree of concurrency.
To understand the inherent cost of positive concurrency in STM implementations, we first consider the number of RAW/AWAR synchronization patterns [6] that must be performed by a process in the course of a transaction. A read-after-write (RAW) pattern consists of a write to a (shared) base object x followed by a read from a different base object y (without a write to y in between). An atomic write-after-read (AWAR) pattern consists of an atomic (indivisible) execution of a read of a base object followed by a write on (possibly the same) base object. Accounting for RAW/AWAR patterns is important since most modern processor architectures use relaxed memory models, where maintaining the order of operations in a RAW requires a memory fence [21] and each AWAR is manifested as an atomic instruction such as Compareand-Swap (CAS). In most architectures, memory fences and atomic instructions are believed to be considerably slower than regular shared-memory accesses [1, 19, 21, 20] .
We show that every permissive and opaque STM implementation has, for any m ∈ N, an execution in which a transaction with a read set of size m incurs Ω(m) consecutive RAW/AWAR patterns. This contrasts with a single-lock STM that uses only one such pattern, since a successful lock acquisition can be implemented using only one (multi-) RAW [18] 1 or AWAR [4] . We show that one RAW/AWAR is in fact optimal for single lock STMs. Moreover, we present implementations of progressive STMs that employ just a single RAW or AWAR pattern per transaction. Also, we describe a strongly progressive space-bounded STM implementation that incurs four RAWs per transaction.
These implementations suggest that the RAW/AWAR metric is too coarse-grained to evaluate the complexity of progressive STMs. Therefore, we introduce a new metric called protected data size that, intuitively, captures the amount of data that a transaction must exclusively control at some point of its execution. All progressive STM implementations we are aware of (see, e.g., an overview in [14] ) use locks or timing assumptions to give an updating transaction exclusive access to all objects in its write set at some point of its execution. E.g., lock-based progressive implementations require that a transaction grabs all locks on its write set before updating the corresponding base objects. Our results show that this is an inherent price to pay for providing progressive concurrency: every committed transaction in a progressive and disjointaccess-parallel 2 STM implementation must, at some point of its execution, protect every object in its write set. Interestingly, as our progressive implementations show, the transaction's read set does not need to be protected.
In brief, our results imply that providing high degrees of concurrency in opaque STM implementations incurs a considerable synchronization cost. Permissive STMs, while providing the best possible concurrency in theory, require a strong synchronization primitive or a memory fence per read operation, which may result in excessively slow execution times. Progressive STMs provide only basic concurrency but perform considerably better in this respect: we present progressive implementations that incur constant RAW/AWAR complexity. Does this mean that maximizing the ability of processing multiple transactions in parallel should not be an important factor in STM design? Should we rather assume little positive concurrency provided by progressiveness or even focus on speculative single-lock solutionsá la flat combining [16] ? Difficult to say affirmatively, but our results suggest so.
The rest of the paper is organized as follows. Section 2 briefly introduces our system model and recalls the correctness criteria in STM. Section 3 presents some useful properties of STM implementations and Section 4 recalls the definitions of progress conditions of STM, including progressiveness and permissiveness. Section 5 presents the definitions of RAW/AWAR complexity. Sections 6 presents a linear lower bound on the number of RAW/AWAR patterns executed by a transaction in a permissive STM. Section 7 describes our progressive STM implementations that perform constant RAWs or AWARs per transaction and presents a lower bound on the amount of data to be protected by a transaction in a progressive STM. Section 8 summarizes some related work and Section 9 concludes the paper. Detailed proofs are delegated to the optional Appendix.
Model
Our STM model, while keeping the spirit of the original definitions of [13, 15] , introduces some refinements that are instrumental for our results.
Transactions. Transactional memory provides the ability of reading and writing to a set of transactional objects, or t-objects using atomic transactions. A transaction is a sequence of accesses (reads or writes) to t-objects. We assume that every transaction T k has a unique identifier k. Formally, STM exports the following operations (called tm-operations in the paper): (1) read k (X) that returns a value in a set V or a special value A k / ∈ V (abort); (2) write k (X, v) that returns ok k or A k ; (3) tryC k that returns C k / ∈ V (commit)or A k and (4) tryA k that returns A k .
A history H is a sequence of invocations and responses of tm-operations. A history H is sequential if every invocation is either the last event in H or is immediately followed by a matching response. H|k denotes the subsequence of H restricted to events with index k. If H|k is non-empty we say that T k participates in H, and parts(H) denotes the set of transactions that participate in H. A history is well-formed if for all T k , H|k is sequential and contains no events that appear after A k or C k . Throughout this paper, we assume that all histories are wellformed, i.e., the user of transactional memory never invokes a new operation before receiving a response from the current one and does not invoke any operation op k after T k has returned C k or A k . A history H is complete if for every T k ∈ parts(H), H|k ends with a response event. A transaction T k ∈ parts(H) is live in H if H|k does not end with A k or C k . Otherwise, T k is called complete. A history is t-complete if parts(H) contains only complete transactions. A transaction T k ∈ parts(H) is forcefully aborted in H if some operation op k = tryA k returns A k . Two histories H and H are equivalent if for every transaction T k , H|k = H |k.
The read set (resp., the write set) of a transaction T k ∈ parts(H), denoted Rset(T k ) (resp., Wset(T k )), is the set of t-objects that T k reads (resp., writes to) in H.
Real-time and deferred-update orders. For T k , T m ∈ parts(H), we say that T k precedes T m in the real-time order in H, and we write T k ≺ H T m , if T k is committed or aborted and the last event of T k precedes the first event of T m in H. If neither T k ≺ H T m nor T m ≺ H T k , then we say that T k and T m are concurrent in H. A transaction T k ∈ parts(H) which is not concurrent with any other transaction in H is called uncontended in H. A history H is t-sequential if no two transactions are concurrent in H.
For T k , T m ∈ parts(H), we say that T k precedes T m in the deferred-update order, and we write
, T m has committed, such that the response of read k (X) precedes the invocation of tryC m () in H. For T k , T m ∈ parts(H), we write T k X ≺ H T m , if T k has committed and the response of read m (X), X ∈ Rset(T m ) ∩ Wset(T k ) returns v, the value of X updated in write k (X, v).
Legal histories. Let H be a complete t-sequential history. For every operation read k (X) in H that reads a t-object X, we define the latest written value of X as follows: (1) If T k contains a write k (X, v) preceding read k (X) then the latest written value of X is the value of the latest such write. (2) Otherwise, if H contains a write m (X, v) such that m = k, T m precedes T k , and T m commits in H, then the latest written value of X is the value of the latest such write in H. (3) Otherwise, the latest written value of X is the initial value of X. Without loss of generality, we assume that H starts with a fictitious initializing transaction T 0 that writes 0 to every t-object. We say that a complete t-sequential history H is legal if for every t-object X, every read of X in H returns the latest written value of X.
Opacity. Let H be any complete sequential history. NowH denotes a history constructed from H as follows: (1) For every live transaction T k in H, we insert tryC k · A k immediately after the last event of T k in H and (2) For every aborted transaction T k in H, we remove all write operations in T k with the matching responses.
Definition 1 A complete sequential history H is opaque if there exists a legal complete tsequential history S such that (1)H and S are equivalent and (2) S respects ≺ H and ≺ DU H .
We call such a legal complete t-sequential history S a serialization of H. A weaker property, called strict serializability [22] , guarantees opacity with respect to committed transactions in H. Obviously, every opaque history is also strictly serializable.
Implementations. We consider an asynchronous shared-memory system in which processes p 1 , . . . p N communicate by executing atomic operations on shared base objects.
An STM implementation provides the processes with algorithms for operations read k , write k , tryC k and tryA k . Without loss of generality, we assume that base objects are accessed with atomic read-write operations, but we allow the programmer to aggregate a sequence of operations on base objects using clearly demarcated atomic sections: the operations within an atomic section are to be executed sequentially. The atomic-section construct is general enough to implement various strong synchronization primitives, such as test-and-set (TAS) or compareand-swap (CAS). We assume that atomic sections may only contain a bounded number of base-object operations.
An execution of an implementation M is a sequence of atomic accesses to base objects (base-object events), and invocation and responses of the TM operations (TM-events). If a base-object event is a write or an atomic-section that contains a write (in one of its execution paths), we say that the event is non-trivial.
A configuration of M (after some execution E) is determined by the states of all base objects and the states of the processes. An initial state of M is determined by the initial states of base objects and t-objects. We assume that each base object and each t-object is initialized to 0. A history of an execution E, denoted by E| T M is the subsequence of E restricted to TM-events. E| T M,p i denotes the subsequence of E| T M restricted to events issued by process p i .
The interval of a transaction T k in E is the fragment of E that starts with the first event of T k in E and ends with the completing event of T k (A k or C k ) in E, or, if T k has not completed in E, with the last event of E. A tm-operation op 1 precedes op 2 in H if the invocation of op 2 appears after the response of op 1 in H. An execution E is well-formed if every atomic section is executed sequentially in E, E| T M,p i is t-sequential for each p i , and no event on behalf of a transaction T k is taking place outside of an interval between invocation and response of some TM-operation in T k . We assume here that a TM implementation generates only well-formed executions.
A completion of H is a history constructed from H by removing some pending invocations and adding responses to the remaining pending invocations to the end of H. To account for initial values of t-objects, we add to the beginning of H a (fictitious) transaction T 0 that writes 0 to every t-object and commits.
A complete sequential history H is a linearization of H if there exists a history H , a completion of H, such that (1) H respects the precedence order of H, and (2) H and H are equivalent.
Definition 2 An STM implementation M is opaque if for every execution E of M , there exists an opaque linearization of E| T M .
Preliminaries
In this section, we define some useful properties of STM implementations and prove some simple facts that follow from these definitions.
Access patterns. The definition of STM allows a process to alternate reading and writing to t-objects arbitrarily in the course of a transaction. Moreover, it allows a process to read from a t-object that was previously written within the same transaction. We show that this flexibility can be obtained "for free" given an implementation that only allows a user to read from a set of t-object and then to write to a set of t-objects within a transaction.
We say that a transaction T k is canonic in a history H if H|k consists is a sequence of reads (of distinct t-objects) followed by a sequence of writes (to distinct t-objects). A general complexity of an STM implementation M accounts for the number of accesses to base-objects used to implement every given transaction in every execution of M .
Lemma 3 Let M be an opaque STM implementation that can only be accessed with canonic transactions. Then there exists an opaque STM implementation M that preserves the complexity of M .
Proof. Let read
M , write M , tryC M and tryA M denote the implementations of the operations provided by M . Now M is constructed as follows.
We associate every transaction T k with a local variable Wset(T k ) which contains, at any moment of time, the current write set of T k with the values to be written.
When write k (X, v) is invoked, (X, v) is simply added to Wset(T k ) and all other entries of the form (X, v ) are removed from Wset(T k ). When read k (X) is invoked, we first check if X is in Wset(T k ) and if so, we return the value stored in Wset(T k ). Otherwise, we invoke read M k (X) and return the obtained value.
When tryC k () is invoked, we first execute write k (X, v) for each (X, v) ∈ Wset k . Since for each X there can be at most one entry of the form (X, v), the order in which these operations are invoked does not matter. Also, since all invocations of write k succeed all invocations of read k , the resulting sequence of invocations of M on behalf of T k is a canonic transaction. Operation
Since M is opaque, the resulting implementation is also opaque: just use the serialization of the resulting history of M . Since the modifications of M involve only local variables, the base-object complexity of M is the same as that of M . Therefore, in the rest of the paper, we only consider canonic transactions, which simplifies the analysis without sacrificing generality.
Disjoint-access parallelism. In STM implementations, it is considered important to allow transactions that are not related through their data sets that they access to execute independently.
Let I be a fragment of an execution E. Following [17, 8] , we first define a conflict graph which relates transactions that are live in I. Vertices of the graph represent t-objects. The vertices representing distinct t-objects X and Y are related with an edge if and only if there is a transaction T such that {X, Y } ⊆ Dset(T ) and the interval of T overlaps with I in E.
Two transactions T i and T j are disjoint-access in E if there is no path between an item in Dset(T i ) and an item in Dset(T j ) in the conflict graph of the minimal execution interval containing the intervals of T i and T j .
Two processes concurrently contend on a base-object x in a given configuration if they have pending events on x in the configuration and one of these events is non-trivial.
Definition 4 An STM implementation M is disjoint-access parallel (DAP) if, for all executions E of M , two processes executing T i and T j concurrently contend on the same base object in E only if T i and T j are not disjoint-access.
The following lemma is inspired by [8] :
Lemma 5 Let E be an execution of a DAP STM implementation M in which a complete execution of T 1 is immediately followed by a (possibly incomplete) execution of T 2 such that T 1 and T 2 are disjoint-access. Then there does not exist a base object x such that both processes executing T 1 and T 2 access x in E and one of the accessing events is non-trivial.
Proof. Let E 0 · E 1 · E 2 be the prefix of E, where E 1 is the fragment of E consisting of the complete execution of T 1 , and E 2 is the fragment of E consisting of the execution of T 2 .
Suppose, by contradiction, that T 1 writes to a base object that is accessed by T 2 . Let E 2 be the longest fragment of E 2 that does not contain the first event on an object that is accessed by T 1 in E 1 with a non-trivial event, and let x be this base object. Let E 1 be the longest prefix of T 1 that does not contain the first non-trivial event on x. Since before accessing x, T 1 does not observe T 2 , E 0 · E 1 · E 2 is an execution of M . By construction T 1 and T 2 are disjoint-access in E 0 · E 1 · E 2 . But in the resulting configuration, the processes executing T 1 and T 2 concurrently contend on x-a contradiction. Now suppose that T 2 writes to a base object that is accessed by T 1 . Let E 2 be the longest prefix of E 2 that does not contain the first non-trivial event of T 2 on a base object that is accessed by T 1 in E 1 , and let y be this object. Let E 1 be the longest prefix of E 1 that does not contain the first event on x in E 1 . Since, as we showed above, T 2 does not observe the presence of T 1 in E, E 0 · E 1 · E 2 is an execution of M . But, again, we obtained a configuration in which the processes executing T 1 and T 2 concurrently contend on x-a contradiction.
Definition 6 An STM implementation M provides strict data partitioning if every t-object X is associated with a set of base object β(X) such that ∀X = Y , β(X) ∩ β(Y ) = ∅ and a transaction T i can access a base object in β(X) only if X ∈ Dset(T i ).
Any STM that provides strict data partitioning is also disjoint-access parallel (but not vice versa).
Invisible reads and single-version opacity. An STM implementation M uses invisible reads if no execution of a tm-read operation incurs a write on a base object.
Let H be a sequential history. We say that T i precedes T j in H in the single-version order, and we write
A sequential history H is single-version opaque if there exists a legal t-sequential history H such that: 
Liveness and Progress
To describe the conditions under which a TM implementation does something useful, we need to address two orthogonal dimensions. First, we need to give a tm-liveness property [3] that determines the conditions under which an individual tm-operation must return. Second, we need to give a progress condition that describes the cases in which a transaction must commit.
TM-liveness properties
A TM implementation M is wait-free if in every infinite execution of M , each tm-operation returns in a finite number of its own steps, regardless of the behavior of concurrent transactions. In other words, a wait-free individual tm-operation (read, write, tryC or tryA) cannot be delayed because of a concurrent operation. The property can be very beneficial if executions of transactions are subject to unpredictable delays or failures.
In this paper, we do not assume failures: every operation is expected to take steps until it terminates. Moreover, we are interested in deriving inherent costs of implementing non-trivial concurrency in TM. Therefore, we assume a weaker default tm-liveness guarantee, that we call starvation-freedom. A TM implementation M is starvation-free in every infinite execution of M , each tm-operation eventually returns, assuming that no concurrent tm-operation stops indefinitely before returning. Starvation-freedom allows a tm-operation to be delayed only by a concurrent tm-operation.
Progress conditions
A progress condition determines the scenarios in which a transaction is allowed to abort. Technically, unlike tm-liveness, a progress condition is a safety property [3] , since it can be violated in a finite execution. The simplest non-trivial progress property we consider in this paper is single-lock progressiveness that says that a transaction can only abort if there is a concurrent transaction. Clearly, an opaque single-lock TM can be implemented using any mutual exclusion algorithm [24] with one critical section per transaction. Stronger progress conditions allow some transactions to progress concurrently in some scenarios implying positive concurrency 3 .
Progressiveness allows an implementation to abort a transaction only in case of a conflict.
Definition 7 A TM implementation M is (weakly) progressive if for every history H of M and every transaction T i ∈ parts(H) that is forcefully aborted, there exists a prefix H of H and a transaction T k ∈ parts(H ) that is live in H , such that T k and T i conflict in H .
The strong progressiveness property [14] additionally requires that in case of a set of transactions conflict on a single t-object at least one transaction commits. The formal definition is inspired from [15] .
Let CObj H (T i ) denote the set of t-objects over which transaction T i ∈ parts(H) conflicts with any other transaction in history H i.e. X ∈ CObj H (T i ) if there exists a transaction
Let CT rans(H) denote the set of non-empty subsets of parts(H) such that a set Q is in CT rans(H) if no transaction in Q conflicts with a transaction not in Q.
Definition 8 A TM implementation M is strongly progressive if there does not exist any history H of M in which for every set Q ∈ CT rans(H) of transactions such that |CObj H (Q)| ≤ 1, every transaction in Q is forcefully aborted in H.
But since the goal of this paper is to derive a lower bound, we consider weak progressive implementations (from now on-simply progressive).
Let C be any correctness property, i.e., any safety property on TM histories [3] . The following property guarantees that no transaction is forcefully aborted if there is a chance of to commit the transaction and preserve correctness.
Definition 9 A TM implementation M is permissive with respect to C if for every history H of M such that H ends with a response r k and replacing r k with some r k = A k gives a history that satisfies C, we have r k = A k .
Therefore, permissiveness does not allow a transaction to abort, unless committing it would violate the execution's correctness. In this paper, we consider TM implementations that are permissive with respect to opacity. Clearly, permissiveness with respect to opacity is strictly stronger than progressiveness: every permissive opaque implementation is also progressive opaque, but not vice versa.
A transaction in a permissive opaque implementation can only be forcefully aborted if it tries to commit:
Lemma 10 Let a TM implementation M be permissive with respect to opacity. If a transaction T i is forcefully aborted executing an operation op i , then op i is tryC i .
Proof. Suppose, by contradiction, that there exists a history H of M such that some op i ∈ {read i , write i } executed within a transaction T i returns A i . Let H 0 be the shortest prefix of H that ends just before op i returns. By definition, H 0 is opaque and any history H 0 · r i where r i = A i is not opaque. Let H 0 be the serialization of H 0 .
If op i is a write, then H 0 ·ok i is also opaque -no write operation of the incomplete transaction T i appears in H 0 and, thus, H 0 is also a serialization of H 0 · ok i .
If op i is a read(X) for some t-object X, then we can construct a serialization of H 0 · v where v is the value of X written by the last committed transaction in H 0 preceding T i or the initial value of X if there is no such transaction. It is easy to see that H 0 " obtained from H 0 by adding read(X) · v at the end of T i is a serialization of H 0 · read(X). In both cases, there exists a non-A i response r i to op i that preserves opacity of H 0 · r i , and, thus, the only operation that can be forcefully aborted in an execution of M is tryC.
Obviously, Lemma 10 implies that there does not exist a permissive single-version TM implementation.
Multi-version permissiveness. A relaxation of permissiveness, called multi-version permissiveness (or mv-permissiveness) [23] says that a transaction T i can only abort if T i is updating and there is a concurrent conflicting updating transaction T j i.e. a read-only transaction cannot be aborted.
Lemma 11
There does not exist a mv-permissive TM implementation M that guarantees (waitfreedom)starvation-freedom of individual tm-operations and single-version opacity.
Proof. By contradiction, suppose that there exists a single-version opaque mv-permissive implementation M . Consider an execution of M in which transaction T 1 sequentially reads X, then transaction T 2 writes to X and Y and commits. Such an execution exists, since none of these operations can be forcefully aborted in a mv-permissive implementation. Now we extend this history with T 1 reading Y . There is no way to serialize T 1 and T 2 preserving single-version opacity, unless read 1 (Y ) aborts. But a mv-permissive TM implementation does not allow a read-only transaction to return abort-a contradiction.
If we relax our tm-liveness property and allow a tm-operation to be delayed by a concurrent conflicting transaction, then a single-version mv-permissive implementation is possible [7] .
Probabilistic permissiveness. Intuitively, a probabilistic permissive TM ensures the property of Definition 9 with a positive probability. It is conjectured in [11] that probabilistically permissive (with respect to opacity) implementations can be considerably cheaper than deterministic ones. This is achieved by choosing the response to a tm-operation op k by sampling uniformly at random from the set of possible return values (including A k ).
Definition 12 A TM implementation M is permissive with respect to C if for every history H of M such that H ends with a response r k and replacing r k with some r k = A k gives a history that satisfies C, we have r k = A k with positive probability.
RAW/AWAR complexity
Modern CPU architectures perform reordering of memory references for better performance. Hence, memory barriers/fences are needed to enforce ordering in synchronization primitives whose correct operation depends on ordered memory references. Attiya et al. [6] formalized the RAW/AWAR class of synchronization patterns and showed that a wide class of concurrent algorithm implementations must involve these expensive patterns. We recall the definitions below.
Let π be an execution fragment and let π i denote the i-th event in π (i = 0, . . . , |π| − 1). We say that process p performs a RAW (read-after-write) in π if ∃i, j; 0 ≤ i < j < |π| such that
• π i is a write to a base object x by process p,
• π j is a read of a base object y = x by process p and
• there is no π k such that i < k < j and π k is a write to y by p.
We say that two RAWs by process p overlap in an execution E with the read event of the first RAW occurs after the write event of the second RAW. A multi-RAW consists of series of writes to a set of base objects followed by a series of reads from different base objects.
We say a process p performs an AWAR (atomic-write-after-read) in π if ∃i, j, 0 ≤ i < j < |π| such that
• π i is a read of a base-object x by process p,
• π j is a write to a base-object y by process p and
• π i and π j belong to the same atomic section.
Examples of AWAR are CAS and mCAS. Figure 1 : Execution E of a permissive, opaque STM: T 2 and T 3 force T 1 to perform a RAW/AWAR in
RAW/AWAR cost of permissive STMs
In this section, we show that an execution of a transaction in a permissive STM implementation may require to perform at least one RAW/AWAR pattern per tm-read.
Let M be a permissive, opaque TM implementation. Consider an execution E of M with a history H consisting of transactions T 1 , T 2 , T 3 as shown in Figure 1 : T 3 performs a read of X 1 , then T 2 performs a write on X 1 and commits, and finally T 1 performs a series of reads from objects X 1 , . . . , X m . Here, R k (X), W k (X, v) denote complete executions of read k (X) and write k (X, v) respectively. Since the implementation is permissive, no transaction can be forcefully aborted in E, and the only valid serialization of this execution is T 3 , T 2 , T 1 . Note also that the execution generates a sequential history: each invocation of a tm-operation is immediately followed by a matching response in H. Thus, since we assume starvation-freedom as a liveness property, such an execution exists.
Imagine that we modify the execution E as follows. Immediately after R 1 (X k ) executed by T 1 we add W 3 (X, v), and tryC 3 executed by T 3 (let T C 3 (X k ) denote the complete execution of W 3 (X k , v) followed by tryC 3 ). Obviously, T C 3 (X k ) must return abort: neither T 3 can be serialized before T 1 nor T 1 can be serialized before T 3 . On the other hand if T C 3 (X k ) takes place just before R 1 (X k ), then T C 3 (X k ) must return commit but R 1 (X k ) must return the value written by T 3 . In other words, R 1 (X k ) and T C 3 (X k ) are strongly non-commutative [6] : both of them see the difference when ordered differently. As a result, intuitively, R 1 (X k ) needs to perform a RAW or AWAR to make sure that the order of these two "conflicting" operations is properly maintained. A formal proof follows.
Theorem 13 Let M be a permissive opaque STM implementation. Then, for any m ∈ N, M has an execution in which some transaction performs m tm-reads such that the execution of each tm-read contains at least one RAW or AWAR.
Imagine a modification E of E, in which T 3 performs W 3 (X k ) immediately after R 1 (X k ) and then tries to commit. A serialization of H = E | T M should obey T 3 ≺ DU H T 2 and T 2 ≺ H T 1 . The execution of R 1 (X k ) does not modify base objects, hence, T 3 does not observe R 1 (X k ) in E . Since M is permissive, T 3 must commit in E . But since T 1 performs R 1 (X k ) before T 3 commits and T 3 updates X k , we also have T 1 ≺ DU H T 3 . Thus, T 3 cannot precede T 1 in any serialization-contradiction. Consequently, each R 1 (X k ) must perform a write to a base object.
Let π be a fragment of E that represents the complete execution of R 1 (X k ). Clearly, π contains a write to a base object. Let π j be the first write to a base object in π and π w , the shortest fragment of π that contains the atomic section to which π j belongs, else if π j is not part of an atomic section, π w = π j . Thus, π can be represented as π s · π w · π f .
Suppose that π does not contain a RAW or AWAR. Since π w does not contain an AWAR, there are no read events in π w that precede π j . Thus, π j is the first base object event in π w . Consider the execution fragment π s · ρ, where ρ is the complete execution of T C 3 (X k ) by T 3 . Such an execution exists since π s does not perform any base object write, hence, π s · ρ is indistinguishable to T 3 from ρ.
Since, by our assumption, π w · π f contains no RAW, any read performed in π w · π f can only be applied to base objects previously written in π w · π f . Thus, there exists an execution π s · ρ · π w · π f that is indistinguishable to T 1 from π. In π s · ρ · π w · π f , T 3 commits (as in ρ) but T 1 ignores the value written by T 3 to X k . But T 3 , T 2 , T 1 is the only valid serialization for
Note that since all tm-reads of T 1 are executed sequentially, all these RAW/AWAR patterns are pairwise non-overlapping.
RAW/AWAR cost and protected data in progressive STMs
In this section, we first describe our progressive STM implementations that perform at most one RAW/AWAR per transaction. Then we present a lower bound on the amount of data to be protected by a transaction in a progressive STM.
Constant RAW/AWAR implementations for progressive STM
We start with showing that even a single-lock progressive STM cannot avoid performing one RAW/AWARs per transaction in some executions.
Theorem 14
Let M be a single-lock progressive opaque STM implementation. Then every execution of M in which an uncontended transaction performs at least one read and at least one write contains a RAW/AWAR pattern.
Proof. Consider an execution π of M in which an uncontended transaction T 1 performs (among other events) read 1 (X), write 1 (Y, v) and tryC 1 (). Since M is single-lock progressive, T 1 must commit in π. Clearly π must contain a write to a base object. Otherwise a subsequent transaction reading Y would return the initial value of Y instead of the value written by T 1 .
Let π j be the first write to a base object in π and let π w denote the shortest fragment of π that contains the atomic section to which π j belongs (π w = π j if π j is not part of an atomic section). Thus, π can be represented as π s · π w · π f . Now suppose, by contradiction, that π contains neither RAW nor AWAR patterns. Since π w contains no AWAR, there are no read events in π w that precede π j . Since π j is the first write event in π, it follows that π j is the first base-object event in π w .
Since π s contains no writes, the states of base objects in the initial configuration and in the configuration after π s is performed are the same. Consider an execution π s · ρ where in ρ, a transaction T 2 performs read 2 (Y ), write 2 (X, 1), tryC 2 () and commits. Such an execution exists, since ρ is indistinguishable to T 2 from an execution in which T 2 is uncontended and thus T 2 cannot be forcefully aborted in π s · ρ.
Since π w · π f contains no RAWs, every read performed in π w · π f is applied to base objects which were previously written in π w · π f . Thus, there exists an execution
But both T 1 reads the initial value of X and T 2 reads the initial value of Y in π s · ρ · π w · π f , and thus T 1 and T 2 cannot be both committed (at least one of the committed transactions must read the value written by the other)-a contradiction.
The proof is analogous in the case when an execution of T 1 extends any execution π 0 that contains only complete transactions.
Since every progressive or permissive STM implementation is also single-lock progressive, the RAW/AWAR lower bound of Theorem 14 also holds for progressive and permissive STM implementations. The lower bound is actually tight, and we sketch two progressive opaque implementations. Both implementations are strict data-partitioned [15] (split the set of base objects used into disjoint subsets, each subset storing information of only a single t-object) and singleversion (maintain exactly one copy of a t-object's state at a time). They also use invisible reads, i.e., no execution of a tm-read operation performs a write to a base object.
Our first implementation employs a mCAS primitive 4 and works, in brief, as follows. Every t-object X i is associated with a distinct base object v i that stores the "most recent" value of X i together with the id of the transaction that was the last to update X i . Each time a transaction T k performs a read of a t-object X i , it reads v i , adds X i to its read set and checks if the t-objects in the current read set of T k have not been updated since T k has read them. If this is not the case the transaction is forcefully aborted. Otherwise, T k returns the value read in v i . Each time T k performs a write to a t-object X i , it adds X i to its write set and returns ok.
For every updating transaction T k , tryC k () invokes the mCAS primitive over Dset(T k ). If the mCAS returns true, tryC k () returns C k , otherwise it returns A k . Clearly, if T k is forcefully aborted, then the execution of mCAS involved no AWAR (no write to a base object took place). Read-only transactions simply returns C k . Consequently, the implementation incurs a single AWAR per updating committed transaction.
Theorem 15
There exists a progressive opaque STM implementation with wait-free operations that employs exactly one AWAR per transaction. Moreover, no AWARs are performed in readonly or aborted transactions.
Even if we do not use atomic sections (and, thus, AWARs) we still can implement a progressive opaque STM using reads and writes that incurs only a single multi-RAW (and, thus, incurring just a single fence) per update transaction. This implementation uses a simple multi-trylock primitive, which in turn can be implemented with a single multi-RAW. The multi-trylock primitive exports operations acquire(W ), release(W ) and isContended(X), for all sets of t-objects W and all t-objects X. Informally, if there is no contention on the locks on objects in W , then acquire(W ) returns true which means that exclusive locks on all objects in W are acquired. Otherwise, acquire(W ) returns false which means that no locks on objects in W are acquired. Operation release(W ) releases the acquired locks on objects in W and isContended(X) returns true iff a lock on X is currently held by any process. The implementation of acquire(W ) first writes to a series of base objects and then reads a series of base objects incurring a single multi-RAW, while operations release(W ) and isContended(X) incur no RAW.
Implementations of reads and writes are similar to ones described above, except that each time a transaction T k performs a read of a t-object X i , it additionally checks if no object in the current read set is locked by an updating transaction. If some object in the read set has been modified or is locked, the transaction is forcefully aborted.
For every updating transaction T k , tryC k () invokes acquire(Wset(T k )). If it returns true, tryC k () returns C k , otherwise it returns A k . Read-only transactions simply returns C k . Consequently, the implementation incurs a single multi-RAW per updating transaction.
Theorem 16
There exists a progressive opaque STM implementation with wait-free operations that employs a single multi-RAW per transaction. Moreover, no RAWs are performed in readonly transactions.
We also derive a strongly progressive STM using only reads and writes that incurs at most four RAWs per updating transaction and uses a finite number of bounded registers. Our implementation uses a starvation-free multi-trylock primitive inspired by the Black-White Bakery Algorithm [25] , a bounded version of the Bakery Algorithm [18] .
Informally, if no concurrent process contends infinitely long on some X ∈ W , then the acquire(W ) operation of the starvation-free multi-trylock eventually returns true which means that exclusive locks on all objects in W are acquired. The implementation of acquire(W ) incurs three RAWs, while operation release(W ) performs a single RAW.
Implementations of tm-reads and tm-writes are identical to the constant RAW progressive implementation described above. For every updating transaction T k , tryC k () invokes the acquire operation of the starvation-free multi-trylock over Wset(T k ). Note that this always returns true and a transaction T k with Rset k = ∅ eventually returns C k . Read-only transactions simply returns C k . Consequently, the implementation incurs four RAWs per updating transaction.
Theorem 17 There exists a strongly progressive single-version opaque STM implementation with starvation-free operations that uses invisible reads and employs four RAWs per transaction. Moreover, no RAWs are performed in read-only transactions.
Note that our implementation does not violate the impossibility result of Guerraoui and Kapalka [15] who proved that a strongly progressive opaque STM cannot be implemented using only reads and writes if tm-operations are required to be wait-free.
Protected data
Let M be a progressive STM implementation. Intuitively, a t-object X j is protected at the end of some finite execution π of M if some transaction T 0 is about to atomically change the value of X j in its next step (e.g., by performing a CAS operation) or does not allow any concurrent transaction to read X j (e.g., by holding a lock on X j ).
Formally, let α · π be an execution of M such that π is an uncontended complete execution of a transaction T 0 , where Wset(T 0 ) = {X 1 , . . . , X m }. Let u j (j = 1, . . . , m) denote the value written by T 0 to t-object X j in π. We say that π is a proper prefix of π if π is a prefix of π and every atomic section is complete in π . In this section, let π t denote the t-th shortest proper prefix of π. Let π 0 denote the empty prefix. (Recall that an atomic event is either a tm-event, a read or write on a base object, or an atomic section.)
For any X j ∈ Wset(T 0 ), let T j denote a transaction that tries to read X j and commit. Let E t j = α · π t · ρ t j denote the extension of α · π t in which T j runs solo until it completes. Note that, since we only require the implementation to be starvation-free, ρ t j can be infinite. We say that α · π t is (1, j)-valent if the read operation performed by T j in α · π t · ρ t j returns u j (the value written by T 0 to X j ). We say that α · π t is (0, j)-valent if the read operation performed by T j in α · π t · ρ t j does not abort and returns an "old" value u = u j . Otherwise, if the read operation of T j aborts or never returns in α · π t · ρ t j , we say that α · π t is (⊥, j)-valent.
Definition 18
We say that T 0 protects an object X j in α · π t , where π t is the t-th shortest proper prefix of π (t > 0) if one of the following conditions holds:
For disjoint-access parallel (DAP) progressive STM, we show that every uncontended transaction must protect every object in its write set at some point of its execution. We observe that the no prefix of π can be 0 and 1-valent at the same time (notations used are the same as introduced in Section 7.2).
Lemma 19
There does not exist π t , a proper prefix of π, and i, j ∈ {1, . . . , m} such that α · π t is both (0, i)-valent and (1, j)-valent.
Proof. By contradiction, suppose that there exist i, j and α · π t that is both (0, i)-valent and (1, j)-valent. Since the implementation is DAP, by Lemma 5, there exists an execution of M , E t ij = α · π t · ρ t j · ρ t i that is indistinguishable to T i from α · π t · ρ t i . In E t ij , the only possible serialization is T 0 , T j , T i . But T i returns the "old" value of X i and, thus, the serialization is not legal-a contradiction.
If α · π t is (0, i)-valent (resp., (1, i)-valent) for some i, we say that it is 0-valent (resp., 1-valent). By Lemma 19, the notions of 0-valence and 1-valence are well-defined.
Theorem 20 Let M be a progressive, opaque and disjoint-access-parallel STM implementation. Let α · π be an execution of M , where π is an uncontended complete execution of a transaction T 0 . Then there exists π t , a proper prefix of π, such that T 0 protects |Wset(T 0 )| t-objects in α · π t .
Proof. Let Wset T 0 = {X 1 , . . . , X m }. Consider two cases:
(1) Suppose that π has a proper prefix π t such that α · π t is 0-valent and α · π t+1 is 1-valent.
By Lemma 19, there does not exists i, such that α · π t is (1, i)-valent and α · π t+1 is (0, i)-valent. Thus, one of the following are true
• For every i ∈ {1, . . . , m}, α · π t is (0, i)-valent and α · π t+1 is (1, i)-valent
• At least one of α · π t and α · π t+1 is (⊥, i)-valent i.e. the operation of T i aborts or never returns
In either case, T 0 protects m t-objects in α · π t .
(2) Now suppose that such π t does not exists, i.e., there is no i ∈ {1, . . . , m} and t ∈ {0, |π|−1} such that E t i exists and returns an old value, and E t+1 i exists and returns a new value.
Suppose there exists s, t, 0 < s + 1 < t, S ⊆ {1, . . . , m}, such that:
• for all r, s < r < t, and for all i ∈ S, α · π r is (⊥, i)-valent.
We say that s + 1, . . . , t − 1 is a protecting fragment for t-objects {X j |j ∈ S}.
Since M is opaque and progressive, α · π 0 = α is 0-valent and α · π is 1-valent. Thus, the assumption of Case (2) implies that for each X i , there exists a protecting fragment for {X i }. In particular, there exists a protecting fragment for {X 1 }.
Now we proceed by induction. Let π s+1 , . . . , π t−1 be a protecting fragment for {X 1 , . . . , X u−1 } such that u ≤ m.
Now we claim that there must be a subfragment of s+1, . . . , t−1 that protects {X 1 , . . . , X u }.
Suppose not. Thus, there exists r, s < r < t, such that α · π r is (0, u)-valent or (1, u)-valent. Suppose first that α · π r is (1, u)-valent. Since α · π s is (0, i)-valent for some i = u, by Lemma 19 and the assumption of Case (2), there must exist s , t , s < s + 1 < t ≤ r such that
• for all r , s < r < t , α · π r is (⊥, u)-valent.
As a result, s + 1, . . . , t − 1 is a protecting fragment for {X 1 , . . . , X u }. The case when α · π r is (0, u)-valent is symmetric, except that now we should consider fragment r, . . . , t instead of s, . . . , r.
Thus, there exists a subfragment of s + 1, . . . , t − 1 that protects {X 1 , . . . , X u }. By induction, we obtain a protecting fragment s + 1, . . . , t − 1 for {X 1 , . . . , X m }. Thus, any prefix α · π r , where s < r < t protects exactly m t-objects.
In both cases, there is a proper prefix of α · π that protects exactly m t-objects.
The lower bound of Theorem 20 is tight: it is matched by all progressive implementations we are aware of, including ones in Section 7.1. Note that any DAP single-lock STM implementation automatically provides a stronger progress condition than just single-lock progressiveness. A transaction T in a DAP single-lock STM can only be forcefully aborted if it observes a concurrent transaction T such that Dset(T )∩Dset(T ) = ∅. This is not very far from progressiveness, where T may abort only if T and T experience a write-write or write-read conflict on a t-object. Thus, in the realm of DAP STM implementations, progressiveness is very close to the weakest nontrivial progress condition.
Related work
Crain et al. [9] proved that a permissive opaque TM implementation cannot maintain invisible reads, which inspired the derivation of our lower bound on RAW/AWAR complexity in Section 6. The RAW/AWAR complexity for concurrent implementations was recently introduced in [6] . The proofs of Theorems 13 and 14 extend the arguments used in [6] to the STM context.
A related paper by Attiya et al. [8] showed that every permissive strictly serializable and DAP TM in which every read-only transaction must commit in a wait-free manner has an execution in which some read-only transaction T k performs at least |Dset(T k )|-1 base-object writes. In this paper we do not assume that a read operation must be wait-free and we do not require disjoint-access parallelism. Also, we focus the number of RAW/AWAR patterns and not only base-object writes. On the other hand, we consider a stronger correctness property (opacity). Therefore, our lower bound in Section 6 incomparable with the one of [8] .
To establish the lower bound on t-objects that must be "protected" in an opaque, progressive TM (Section 7.2), we use the definition of disjoint-access parallelism introduced in [8] . Guerraoui and Kapalka [15] considered a stronger version of DAP called strict data-partitioning to prove a linear lower bound on the number of steps performed by a successful read operation in a progressive, opaque TM that uses invisible reads. Interestingly, the constant RAW/AWAR implementations of progressive, opaque TMs sketched in Section 7 are strict data-partitioned.
Concluding remarks
In this paper, we derived inherent costs of implementing STMs with non-trivial concurrency guarantees. At a high level, our results suggest that providing high degrees of concurrency in STM may incur considerable unavoidable costs. Our results give rise to many intriguing questions, and we list some of them below.
In this paper, we focused on progress conditions that provide positive concurrency, progressiveness and permissiveness. The results do not apply to obstruction-free STMs [12] that only guarantee that a transaction commits if it eventually runs without contention. Effectively, an obstruction-free STM provides zero concurrency, since progress is guaranteed only when one transaction is active at a time. However, unlike single-lock implementations, it does allow overlapping transactions to make progress (one at a time). Does this incur higher RAW/AWAR complexity?
We cannot expect the lower bound of Theorem 20 (the protected-data size) to apply to non-DAP STMs, including trivial ones that allow storing the state of the whole STM in one base object. One way to avoid trivialities is to assume that a base object can store information only about a constant number of t-objects (the constant-size information property in [13] ) which can potentially give asymptotically close results.
We focused on implementations that allow a tm-operation to be delayed only by concurrent operations performed by other transactions. Does relaxing the tm-liveness property by allowing a read operation to wait until a concurrent transaction terminates [7] improve the RAW/AWAR complexity with respect to permissive implementations? It is easy to see that the proof of our permissive lower bound (Theorem 13) does not work for this case. But it is unclear a priori how this may affect the cost of progressive implementations.
Last but not least, the results of this paper assume opacity as a correctness property. Recently, multiple relaxations of opacity were proposed [10, 2, 9, 8] . It would be very interesting to understand the concurrency benefits gained by such relaxed consistency conditions.
A Constant RAW/AWAR implementations for progressive TM
This section presents the pseudo-code for single RAW and single AWAR implementations of progressive opaque STMs and their proofs of correctness. The single RAW implementation uses a multi-trylock primitive described below, while the single AWAR implementation uses a mCAS primitive. Finally, we describe the read-write implementation of a strongly progressive STM that employs at most four RAWs per transaction. In the implementations, every t-object X i is associated with a distinct base object v i that stores the "most recent" value of X i together with the id of the transaction that was the last to update X i .
A.1 Multi-trylock
Algorithm 1 Multi-trylock invoked by process p i for all Xj ∈ Q do 5: write(rij, 1)
6:
if ∃Xj ∈ Q; t = i : rtj = 1 then
7:
for all Xj ∈ Q do 8: write(rij, 0) for all Xj ∈ Q do 13: write(rij, 0)
14:
return ok 15: isContended(X j ):
16:
if ∃pt : rtj = 0, t = i then 17: return true 18: return false A multi-trylock provides exclusive write-access to a set Q of t-objects. Specifically, a multitrylock exports the following operations
• acquire(Q) returns true or false
• release(Q) releases the lock and returns ok
• isContended(X j ), X j ∈ Q returns true or false We assume that processes are well-formed: they never invoke a new operation on the multitrylock before receiving response from the previous invocation.
We say that a process p i holds a lock on X j after an execution π if π contains the invocation of acquire(Q), X j ∈ Q by p i that returned true, but does not contain a subsequent invocation of release(Q ), X j ∈ Q , by p i in π. We say that X j is locked after π by process p i if p i holds a lock on X j after π.
We say that X j is contended by p i after an execution π if π contains the invocation of acquire(Q), X j ∈ Q, by p i but does not contain a subsequent return false or return of release(Q ),
Let an execution π contain the invocation i op of an operation op followed by a corresponding response r op (we say that π contains op). We say that X j is uncontended (resp., locked) during the execution of op in π if X j is uncontended (resp., locked) after every prefix of π that contains i op but does not contain r op .
A multi-trylock implementation satisfies the following properties:
• Mutual-exclusion: For any object X j , and any execution π, there exists at most one process that holds a lock on X j after π.
• Progress: Let π be any execution that contains acquire(Q) by process p i . If no object in Q is contended during the execution of acquire(Q) by a process p k = p i in π then acquire(Q) returns true in π.
• Let π be any execution that contains isContended(X j ) invoked by p i .
-If X j is locked by p ; = i during the complete execution of isContended(X j ) in π, then isContended(X j ) returns true.
-If ∀ = i, X j is never contended by p during the complete execution of isContended(X j ) in π, then isContended(X j ) returns false.
Note that if X j is neither locked or uncontended during the complete execution of isContended(X j ), then either of true and false can be returned.
Theorem 21 Algorithm 1 is an implementation of multi-trylock object in which every operation is wait-free, every operation incurs at most one multi-RAW, and isContended involves no baseobject writes
Proof. Denote by L the shared object implemented by Algorithm 1. The operations exported by L are wait-free i.e. every operation returns a value to the invoking process after a finite number of its own steps. This follows from the fact that the implementation of acquire, release and isContended described by Algorithm 1 contains no unbounded loops or waiting statements. Assume, by contradiction, that L does not provide mutual-exclusion: there exists an execution π after which processes p i and p k hold a lock on the same object, say X j . In order to hold the lock on X j , process p i writes 1 to register r ij and then checks if any other process p k has written 1 to r kj . Since the corresponding operation acquire(Q), X j ∈ Q invoked by p i returns true, p i read 0 in r kj in Line 6. But then p k also writes 1 to r kj and later finds that r ij is 1. This is because p k can write 1 to r kj only after the read of r kj returned 0 to p i which is preceded by the write of 1 to r ij . Hence, there exists an object X j such that r ij = 1; i = k, but the conditional in Line 6 returns true to process p k -a contradiction.
L also ensures progress. This is trivial since some process p i wishing to hold a lock on X j in an execution π invokes acquire(Q), X j ∈ Q which writes 1 to register r ij and then checks if any other process p k has written to register r kj . If no other process contends on X j during the execution of acquire(Q), the conditional on Line 6 returns true and respectively, acquire(Q) must return true.
Let π be any execution that contains isContended(X j ) executed by p i . If no process contends on X j during the execution of isContended(X j ) in π, p i finds 0 in r tj = 0, ∀t and the conditional in Line 27 returns false. However, if X j is locked during the execution of isContended(X j ) in π, at any point of the execution there exists t such that r tj = 1. Thus, the conditional in Line 27 returns true and, respectively, isContended(X j ) must return true.
The implementation of isContended(X j ) only reads base objects. The implementation of acquire(Q) first writes to a series of base objects and then reads a series of base objects incurring a single multi-RAW. The implementation of release(Q) only writes to base objects.
A.2 Progressive implementation with single multi-RAW
Algorithm 2 describes the algorithms for tm-operations of a progressive opaque STM incurring at most a single multi-RAW per transaction. 
if isAbortable() then 8: return A k
9:
return the value of ovj 10: write k (X j , v):
if Xj ∈ Wset(T k ) then 12: nvj := v 13:
return ok k 15: tryA k ():
locked := L.acquire(Wset(T k )) 21: if not locked then 22: return A k
23:
if isAbortable() then 24: L.release(Wset(T k )) 25: return A k
26:
for all Xj ∈ Wset(T k ) do 27:
return C k 30: Function: isAbortable(): return false Each time a transaction T k performs a read of a t-object X i , it reads v i , adds X i to its read set and checks if the t-objects in the current read set of T k have not been updated since T k has read them and additionally checks if no object in the current read set is locked by an updating transaction. If some object in the read set has been modified or is locked, the transaction is forcefully aborted. Otherwise, T k returns the value read in v i .
Each time T k performs a write to a t-object X i , it adds X i to its write set and returns ok. The implementation of tryC k () uses the multi-trylock primitive described in Section A.1. For every updating transaction T k , tryC k () invokes L.acquire(Wset(T k )), where L denotes the multi-trylock implemented by Algorithm 1. If it returns true, tryC k () returns C k , otherwise it returns A k . Read-only transactions simply returns C k .
A.2.1 Proof of opacity
Let E by any execution of the TM implemented by Algorithm 2. Recall that we assume every t-object was initialized by some fictitious committed transaction T 0 that precedes E. Let < E denote a total-order on events in E.
Linearization points. Let H denote a linearization of E| T M constructed by selecting linearization points of tm-operations performed in E| T M . The linearization point of a tm-operation op, denoted as op is associated with a base object event or a tm-event performed during the lifetime of op using the following procedure.
First, we obtain a completion of E| T M by removing some pending invocations and adding responses to the remaining pending invocations involving a transaction T k as follows:
• Every incomplete read k , write k or tryA k operation is removed from E| T M
• For every pending tryC k , if some base object v j was written (Line 12), the response C K is added to the end of E| T M , else A k is added to the end of E| T M Now a linearization H of E| T M is obtained by associating linearization points to tm-operations in the obtained completion of E| T M as follows:
• For every tm-read op k that returns a non-A k value, op k is chosen as the event in Line 5 of Algorithm 2, else, op k is chosen as invocation event of op k
• For every tm-write or tm-abort op k that returns, op k is chosen as the invocation event of op k
• For every op k = tryC k that returns C k such that Wset(T k ) = ∅, op k is associated with the successful acquisition of the lock on Wset(T k ) (at the end of Line 7), else if op k returns A k , op k is associated with the invocation event of op k
• For every op k = tryC k that returns C k such that Wset(T k ) = ∅, op k is associated with Line 6
< H denotes a total-order on tm-operations in the complete sequential history H.
Serialization points. The serialization of a transaction T j , denoted as δ T j is associated with the linearization point of a tm-operation performed within the lifetime of the transaction. We obtain a t-complete historyH from H as follows:
• For every transaction T k in H that is live, we insert tryC k · A k immediately after the last event of T k in H.
• For every aborted transaction T k in H, we remove each write operation in T k with the matching responsē H is thus a t-complete sequential history that contains only updating committed transactions and read-only transactions since every aborted transaction is reduced to its read-prefix. A serialization S is obtained by associating serialization points to transactions inH as follows:
• If T k is an update transaction that commits, then δ T k is tryC k
• If T k is a read-only or aborted transaction, then δ T k is assigned to the linearization point of the last tm-read that returned a non-A k value in T k < S denotes a total-order on transactions in the t-sequential history S.
Recall that T i precedes T j in the deferred-update order if there exists X ∈ Rset(T i ) ∩ W set(T j ), T j has committed, such that the response of read i (X) precedes the invocation of tryC j () in H. Thus, read i (X) < E tryC j . Consider the histories depicted in Figure 2 where T i precedes T j in the deferred-update order (tryC k (X j ) denotes a tryC k such that X j ∈ Wset(T k )).
(1) Consider the history depicted in Figure 2(A) where T i is a read-only transaction and T j is an updating transaction that returns C j . Assume the contrary that T i ≺ DU H T j , but T j < S T i , which implies that δ T j < E δ T i i.e. tryC j (X) precedes the linearization point of the last tm-read in T i that returns a non-A i value (say read i (X )). Thus, successful lock Figure 2 : Assignment of serialization points respects the deferred-update order acquisition on X by T j in Line 7 precedes the read of the base object associated with X by T i in Line 5.
read i (X ) checks if any object in Rset(T i ) is locked by a concurrent transaction, then performs read-validation (Line 7). Consider the following possible sequence of events: T j acquires the lock on X, updates X to shared-memory, T i reads the base object associated with X , T j releases the lock and finally T i performs the check in Line 7. read i (X ) is forced to return A i because X has been invalidated.
Else if T j acquires the lock on X, updates X to shared-memory, T i reads the base object associated with X , T i performs the check in Line 7 and finally T j releases the lock on X. Again, read i (X ) returns A i since T j is holding a lock on X ∈ Rset(T i )-contradiction.
Hence, the only possibility is that the last successful tm-read (read i (X )) in T i is linearized before tryC j (X), which implies that
(2) Suppose that T i is an updating transaction as shown in Figure 2 (B), then tryC i (X 2 ) and tryC j (X 1 ) are assigned to Line 7 of Algorithm 2 when the locks are acquired on X 2 and
Proof. This follows from the fact that for a given transaction, its serialization point is chosen within the lifetime of the transaction implying if
Proof. Assume the contrary, i.e. there exists a read j (X), X ∈ Rset(T j ) ∩ W set(T i ) that returns the value of X updated in write i (X, value) and T j < S T i . T i is an updating committing transaction, hence δ T i = tryC i . Consider two cases:
(1) Suppose that T j is a read-only transaction. Thus, δ T j is assigned to the last tm-read that returns a non-A j value (say read j (X )), whose linearization point precedes tryC i . This implies that the read of the base object associated with X by T j in Line 5 precedes the successful lock acquisition on X by T i in Line 7. Thus, the write to the base object associated with X performed by tryC i () in line 12 is executed after the read of the base object performed by read j (X) in Line 5-a contradiction.
(2) Suppose that T j is an updating transaction. Then, tryC j < E tryC i . Again, this implies that the read of the base object in Line 5 executed by read j (X) precedes to the write to the base object performed by tryC i ()-a contradiction.
Lemma 25 S is legal
Proof. Recall that S is legal if every tm-read of an object X performed by a transaction T i returns the response of the latest value written to X in S. Since we only consider canonic transactions, the latest value written to X in S is the value written by the last transaction T j such that T k commits, T j < S T i and X ∈ Wset(T j ). From Lemma 24, we have that for all T i and
Assume the contrary that
T i and T k are both updating transactions that commit. Thus,
Since, T j reads the value of X written by T i , one of the following is true
, then the successful lock acquisition on X by T k in Line 7 precedes the read of the base object associated with X by T j in Line 5. read j (X) checks if any object in Rset(T j ) is locked by a concurrent transaction, then performs read-validation (Line 7). Consider the following possible sequence of events: T k acquires the lock on X, updates X to shared-memory, T j reads the base object associated with X, T k releases the lock and finally T j performs the check in Line 7. read j (X) is forced to return A j because X ∈ Rset(T j ) (Line 6) and has been invalidated since last reading its value.
Else if, T k acquires the lock on X, updates X to shared-memory, T j reads the base object associated with X, T j performs the check in Line 7 and finally T k releases the lock on X. Again, read j (X) returns A j since T k is holding a lock on X ∈ Rset(T j )-contradiction.
Thus, read j (X) < E tryC k . Consider two cases:
(1) Suppose that T j is a read-only transaction. Then, δ T j is assigned to the last tm-read performed by T j that returns a non-A j value. If read j (X) is not the last tm-read that returned a non-A j value, then there exists a read j (X ) such that
(2) Suppose that T j is an updating transaction that commits, then δ T j = tryC j which implies that
The same argument derived in the proof of Lemma 22 shows that both cases lead to a contradiction, i.e., both read j (X ) and tryC j are forced to return A j -contradiction.
Lemma 26 Algorithm 2 implements a progressive TM
Proof. Every transaction T k in a TM M whose tm-operations are defined by Algorithm 2 can be aborted in the following scenarios
• Read-validation failed in read k or tryC k
• read k or tryC k returned A k because X j ∈ Rset(T k ) is locked (belongs to write set of a concurrent transaction)
• L.acquire(Wset(T k )) returned false in Line 21 of Algorithm 2
Read-validation consists of checking whether the value to be returned from a tm-read of transaction T k is consistent with the values returned from the previous tm-reads. Hence, if validation of a tm-read in T k fails, it means that the t-object is overwritten by some transaction T i such that T i < S T k , implying a read-write conflict. This is also implied if some t-object X j ∈ Rset(T k ) is locked and returns abort since the t-object is in the write set of a concurrent transaction. Acquisition of the multi-trylock can return false for T i because there exists some X j ∈ Wset(T i ) that was being written to by a concurrent transaction T k implying a write-write conflict.
Hence, for every transaction T i ∈ H that is aborted, there exists a conflicting t-object that is contended by a concurrent transaction. Thus, Algorithm 2 implements a progressive TM 
A.3 Progressive implementation with single mCAS
Algorithm 3 describes the implementation of a progressive, opaque TM incurring a single AWAR per updating committed transaction. The implementations of reads and writes are similar to ones described in Section A.2 except that each time a transaction T k performs a read of a tobject X i , it reads v i , adds X i to its read set and checks if the t-objects in the current read set of T k have not been updated since T k has read them. If this is not the case, the transaction is forcefully aborted. Otherwise, T k returns the value read in v i .
For every updating transaction T k , tryC k () invokes the mCAS primitive over Dset(T k ). If the mCAS returns true, tryC k () returns C k , otherwise it returns A k . Read-only transactions simply returns C k . if isInvalid() then 8: return A k
9:
if Xj ∈ Wset(T k ) then for all Xj ∈ Wset(T k ) do 21: ovj := read(vj) 
A.3.1 Proof of opacity
Using the same notation as in proof of opacity for Algorithm 2 in Section A.2.1, let E denote an execution of the TM implemented by Algorithm 3 and H , a linearization of the execution history E | T M . We construct H by assigning linearization points to tm-operations performed in completion of E | T M .
The linearization point of a tm-operation op k performed by transaction T k in a completion of E | T M is associated with access of a base object or a tm-event performed during the lifetime of the tm-operation as follows.
• For every tm-read op k that returns a non-A k value, op k is chosen as the event in Line 4 of Algorithm 3, else, op k is chosen as invocation event of op k
• For every tm-write op k that returns, op k is chosen as the invocation event of op k
• For every op k = tryC k that returns C k such that Wset(T k ) = ∅, op k is associated with the successful acquisition of the lock on Wset(T k ) (Line 26), else if op k returns A k , op k is associated with the invocation event of op k
The t-sequential history S is constructed in same manner as described in Section A.2.1 from the above assignment for linearization points. Note that the Lemmas proven for Algorithm 2 are clearly also valid for Algorithm 3.
Theorem 27 There exists a progressive opaque STM implementation that employs exactly one AWAR per transaction. Moreover, no AWARs are performed in read-only or aborted transactions.
Proof sketch. Clearly, Algorithm 3 implements an opaque STM.
Algorithm 3 is progressive since every transaction forcefully aborts either due to readinvalidation or because mCAS returns false implying that there exists a conflicting t-object contended by a concurrent transaction. Also note that, if several transactions concurrently conflict on a single t-object, the first transaction to execute the mCAS in Line 26 is returned true and commits. Thus, the implementation guarantees that in any set of concurrent conflicting transactions, at least one of the transactions commits which actually provides a stronger progress guarantee than progressiveness or even strong progressiveness. Indeed, a transaction T k can abort only if a concurrent committed transaction modifies the value of v j for some X j ∈ Dset(T k ).
Algorithm 3 performs a single mCAS operation on Dset(T k ) of a transaction T k that commits during tryC k ; if T k aborts, the mCAS only performs reads of base objects. For read-only transactions, the transaction simply returns C k incurring no AWAR.
A.4 Starvation-free multi-trylock
In this section, we define a multi-trylock object analogous to the one defined in Section A.1, but whose operations are starvation-free. The algorithm is inspired by the Black-White Bakery Algorithm [25] and uses a finite number of bounded registers.
The algorithm uses the following shared variables: registers r ij for each process p i and object X j , a shared bit color ∈ {B, W }, registers LA i ∈ {0, . . . , N } for each p i that denote a Label and
A starvation-free multi-trylock implementation satisfies the following properties:
• Progress: Let π be any execution that contains acquire(Q) by process p i . If no other process p k , k = i contends infinitely long on some X j ∈ Q, then acquire(Q) returns true in π.
-If ∀ = i, X j is never contended by p during the execution of isContended(X j ) in π, then isContended(X j ) returns false.
Lemma 28
In every execution π of Algorithm 4, if p i holds a lock on some object X j after π, then one of the following conditions must hold:
Proof. In order to hold the lock on X j , some process p i writes 1 to r ij , writes a value, say W to M C i and reads the Labels of other processes that have obtained the same color as itself and generates a Label greater by one than the maximum Label read (Line 11). Observe that until the value of the color bit is changed, all processes read the same value W . The first process p i to hold the lock on X j changes the color bit to B when releasing the lock and hence the value read by all subsequent processes will be B until it is changed again. Now consider two cases:
L also ensures progress. If process p i wants to hold the lock on an object X j i.e. invokes acquire(Q), X j ∈ Q, it checks if any other process p k holds the lock on X j . If such a process p k exists and M C k = M C i , then clearly isContended(X j ) returns true for p i and (LA k , k) < (LA i , i). Thus, p i fails the conditional in Line 13 and waits until p k releases the lock on X j to return true. However, if p k contends infinitely long on X j , p i is also forced to wait indefinitely to be returned true from the invocation of acquire(Q). The same argument works when M C k = M C i since when p k stops contending on X j , isContended(X j ) eventually returns false for p i if p k does not contend infinitely long on X j .
All operations performed by L are starvation-free. Each process p i that successfully holds the lock on an object X j in an execution π invokes acquire(Q), X j ∈ Q, obtains a color and chooses a value for LA i since there is no way to be blocked while writing to LA i . The response of operation acquire(Q) by p i is only delayed if there exists a concurrent invocation of acquire(Q ), X j ∈ Q by p k in π. In that case, process p i waits until p k invokes release(Q) and writes 0 to r kj and eventually holds the lock on X j . The implementation of release and isContended are waitfree operations (and hence starvation-free) since they contains no unbounded loops or waiting statements.
The implementation of isContended(X j ) only reads base objects. The implementation of release(Q) writes to a series of base objects (Line 18) and then reads a base object (Line 20) incurring a single RAW. The implementation of acquire(Q) writes to base objects (Line 8), reads the shared bit color (Line 9)-one RAW, writes to a base object (Line 10), reads the Labels (Line 11)-one RAW, writes to its own Label and finally performs a sequence of reads when evaluating the conditional in Line 13-one RAW.
Thus, Algorithm 4 incurs at most four RAWs.
A.5 Strong progressive implementation with constant RAWs
Let CObj H (T i ) denote the set of t-objects over which transaction T i ∈ parts(H) conflicts with any other transaction in history H i.e. X ∈ CObj H (T i ), if there exists a transaction T k ∈ parts(H), k = i, such that T i conflicts with T k on X in H. Then, CObj H (Q) = {CObj H (T i )|∀T i ∈ Q}, denotes the union of sets CObj H (T i ) for all transactions in Q. Let CT rans(H) denote the set of non-empty subsets of parts(H) such that a set Q is in CT rans(H) if no transaction in Q conflicts with a transaction not in Q.
Definition 30 A TM implementation M is strongly progressive if M is weakly progressive and for any history H of M , there does not exist a prefix H of H in which every set Q ∈ CT rans(H ) of transactions that are live in H such that |CObj H (Q)| ≤ 1, every transaction in Q is forcefully aborted in H.
Algorithm 5 describes the implementation of the tryC operation of a strongly progressive, opaque TM. The only modification over the tryC implementation of Algorithm 2 is that in Algorithm 5, every transaction with |Rset| = ∅ eventually commits. The read, write, tryA and isAbortable operations are the same as in Algorithm 2. vj, for each t-object Xj
3:
L, a starvation-free multi-trylock object 4: tryC k ():
5:
if |Wset(T k )| = ∅ then 6: return C k
7:
locked := L.acquire(Wset(T k )) 8: if isAbortable() then 9: L.release(Wset(T k )) 10: return A k
11:
for all Xj ∈ Wset(T k ) do 12: write(vj, (nvj, k))
13:
L.release(Wset(T k )) 14: return C k Proof. Every transaction T k in a TM M whose tm-operations are defined by Algorithm 5 can be aborted in the following scenarios
Thus, Algorithm 5 implements a weakly progressive TM (From Lemma 26).
To show Algorithm 5 also implements a strongly progressive STM, we need to show that for every set of transactions that concurrently contend on a single t-object, at least one of the transactions is not aborted.
Consider transactions T i and T k that concurrently attempt to execute tryC i and tryC k such that X j ∈ Wset i ∪ Wset k . Consequently, they both invoke the acquire operation of the multitrylock (Line 7) and thus, from Theorem 29, both T i and T k must commit eventually. Also, if validation of a tm-read in T k fails, it means that the t-object is overwritten by some transaction T i such that T i precedes T k , implying at least one of the transactions commit. Otherwise, if some t-object X j ∈ Rset(T k ) is locked and returns abort since the t-object is in the write set of a concurrent transaction T i . While it may still be possible that T i returns A i after acquiring the lock on Wset i , strong progressiveness only guarantees progress for transactions that conflict on at most one t-object. Thus, in either case, for every set of transactions that conflict on at most one t-object, at least one transaction is not forcefully aborted.
Theorem 17 There exists a strongly progressive single-version opaque STM implementation with starvation-free operations that uses invisible reads and employs at most four RAWs per transaction. Moreover, no RAWs are performed in read-only transactions.
Proof. The correctness of Algorithm 5 clearly follows from the proof of opacity presented in Section A.2.1 for Algorithm 2. From Theorem 31, it is also strongly progressive.
Any process executing a transaction T k holds the lock on Wset(T k ) only once during tryC k . If |Wset(T k )| = ∅, then the transaction simply returns C k incurring no RAW's. Thus, from Theorem 29, Algorithm 5 incurs at most four RAWs per updating transaction and no RAW's are performed in read-only transactions.
B RAW/AWAR cost of probabilistically permissive STMs
Theorem 32 Let M be a probabilistically permissive opaque STM implementation. Then, for any m, there exists with positive probability, an execution in which a read-only transaction T i contains Ω(m) non-overlapping RAWs or AWARs on base objects where m = |Rset(T i )|.
Proof. For the proof, note that we only need to show that there exists an execution of the probabilistically permissive TM that is the same as the execution of a permissive TM, Then, the construction and arguments used in the proof of Theorem 13 can be extended for the probabilistic case.
Let E denote the execution depicted in Figure 1 where T 3 performs a read of X 1 , then T 2 performs a write on X 1 and commits, and finally T 1 performs a series of reads on X 1 , . . . , X m . We proceed by induction by considering R 1 (X k ), the k-th read of T 1 , 2 ≤ k ≤ m.
(1) Imagine an extension of E, denoted by E , in which T 3 performs a W 3 (X k ) immediately after R 1 (X k ) and then tries to commit. A serialization of H = E | T M should obey T 3 ≺ DU H T 2 and T 2 ≺ H T 1 . The execution of R 1 (X k ) does not modify base objects, hence, T 3 does not observe R 1 (X k ) in E . In a probabilistically permissive TM, the tm-operation W 3 (T k ) can return one of the following values A k or ok k . Note that this response is chosen by sampling uniformly at random from the set of possible return values, thus, there exists a positive probability that T 3 commits successfully (when it returns ok k ). But since T 1 performs R 1 (X k ) before T 3 commits and T 3 updates X k , we also have T 1 ≺ DU H T 3 . Thus, T 3 cannot precede T 1 in any serialization and we establish a contradiction. Consequently, there exists with positive probability, an execution in which each R 1 (X k ), 2 ≤ k ≤ m performs a write to a base-object.
(2) Let π be a fragment of E that represents the complete execution of R 1 (X k ). Clearly, there exists with positive probability, an execution in which π contains a write to a base-object. Let π j be the first write to a base-object in π and π w , the shortest fragment of π that contains the atomic section to which π j belongs, else if π j is not part of an atomic section, π w = π j . Thus, π can be represented as π s · π w · π f .
Suppose that π does not contain a RAW or AWAR. Since π w does not contain an AWAR (atomic write-after-read), there are no read events in π w that precede π j . Thus, π j is the first base-object event in π w . Consider the execution fragment π s · ρ, where ρ is the complete execution of {W 3 (X k ), T C 3 } by transaction T 3 . By Definition 9, such an execution exists with positive probability in which T 3 commits. Since π s does not perform any base-object write, π s · ρ is indistinguishable to T 3 from ρ.
Also, by our assumption, π w · π f contains no RAW i.e. any read performed in π w · π f can only be applied to base objects previously written in π w · π f . Thus, in a probabilistically permissive TM in which responses to tm-operations are chosen by independent coin-tosses, there exists with positive probability, an execution π s · ρ · π w · π f that is indistinguishable to T 1 from π. However, in π s · ρ · π w · π f , T 3 commits (as in ρ) but T 1 ignores the value written by T 3 to X k . But T 3 can only be serialized before T 1 -contradiction.
