On the Cost of Concurrency in Hybrid Transactional Memory by Brown, Trevor & Ravi, Srivatsan
On the Cost of Concurrency in Hybrid Transactional Memory
Trevor Brown1 Srivatsan Ravi2
1Dept. of CS, University of Waterloo
2Dept. of CS and Information Sciences Institute, University of Southern California
July 8, 2019
Abstract
State-of-the-art software transactional memory (STM) implementations achieve good performance
by carefully avoiding the overhead of incremental validation (i.e., re-reading previously read data
items to avoid inconsistency) while still providing progressiveness (allowing transactional aborts only
due to data conflicts). Hardware transactional memory (HTM) implementations promise even better
performance, but offer no progress guarantees. Thus, they must be combined with STMs, leading to
hybrid TMs (HyTMs) in which hardware transactions must be instrumented (i.e., access metadata)
to detect contention with software transactions.
We show that, unlike in progressive STMs, software transactions in progressive HyTMs cannot
avoid incremental validation. In fact, this result holds even if hardware transactions can read metadata
non-speculatively. We then present opaque HyTM algorithms providing progressiveness for a subset of
transactions that are optimal in terms of hardware instrumentation. We explore the concurrency vs.
hardware instrumentation vs. software validation trade-offs for these algorithms. Our experiments
with Intel and IBM POWER8 HTMs seem to suggest that (i) the cost of concurrency also exists in
practice, (ii) it is important to implement HyTMs that provide progressiveness for a maximal set of
transactions without incurring high hardware instrumentation overhead or using global contending
bottlenecks and (iii) there is no easy way to derive more efficient HyTMs by taking advantage of
non-speculative accesses within hardware.
1 Introduction
The Transactional Memory (TM) abstraction is a synchronization mechanism that allows the programmer
to optimistically execute sequences of shared-memory operations as atomic transactions. Several software
TM designs [9, 26, 14, 12] have been introduced subsequent to the original TM proposal based in
hardware [15]. The original dynamic STM implementation DSTM [14] ensures that a transaction aborts
only if there is a read-write data conflict with a concurrent transaction (a` la progressiveness [13]). However,
to satisfy opacity [13], read operations in DSTM must incrementally validate the responses of all previous
read operations to avoid inconsistent executions. This results in quadratic (in the size of the transaction’s
read set) step-complexity for transactions. Subsequent STM implementations like NOrec [9] and TL2 [11]
minimize the impact on performance due to incremental validation. NOrec uses a global sequence lock
that is read at the start of a transaction and performs value-based validation during read operations only
if the value of the global lock has been changed (by an updating transaction) since reading it. TL2, on
the other hand, eliminates incremental validation completely. Like NOrec, it uses a global sequence lock,
but each data item also has an associated sequence lock value that is updated alongside the data item.
When a data item is read, if its associated sequence lock value is different from the value that was read
from the sequence lock at the start of the transaction, then the transaction aborts.
In fact, STMs like TL2 and NOrec ensure progress in the absence of data conflicts with O(1) step
complexity read operations and invisible reads (read operations which do not modify shared memory).
Nonetheless, TM designs that are implemented entirely in software still incur significant performance
overhead. Thus, current CPUs have included instructions to mark a block of memory accesses as
transactional [24, 1, 18], allowing them to be executed atomically in hardware. Hardware transactions
promise better performance than STMs, but they offer no progress guarantees since they may experience
spurious aborts. This motivates the need for hybrid TMs in which the fast hardware transactions are
complemented with slower software transactions that do not have spurious aborts.
To allow hardware transactions in a HyTM to detect conflicts with software transactions, hardware
transactions must be instrumented to perform additional metadata accesses, which introduces overhead.
1
ar
X
iv
:1
90
7.
02
66
9v
1 
 [c
s.D
C]
  5
 Ju
l 2
01
9
Hardware transactions typically provide automatic conflict detection at cacheline granularity, thus ensuring
that a transaction will be aborted if it experiences memory contention on a cacheline. This is at least
the case with Intel’s Transactional Synchronization Extensions [27]. The IBM POWER8 architecture
additionally allows hardware transactions to access metadata non-speculatively, thus bypassing automatic
conflict detection. While this has the advantage of potentially reducing contention aborts in hardware,
this makes the design of HyTM implementations potentially harder to prove correct.
In [3], it was shown that hardware transactions in opaque progressive HyTMs must perform at least
one metadata access per transactional read and write. In this paper, we show that in opaque progressive
HyTMs with invisible reads, software transactions cannot avoid incremental validation. Specifically, we
prove that each read operation of a software transaction in a progressive HyTM must necessarily incur a
validation cost that is linear in the size of the transaction’s read set. This is in contrast to TL2 which is
progressive and has constant complexity read operations. Thus, in addition to the linear instrumentation
cost in hardware transactions, there is a quadratic step complexity cost in software transactions.
We then present opaque HyTM algorithms providing progressiveness for a subset of transactions that
are optimal in terms of hardware instrumentation. Algorithm 1 is progressive for all transactions, but it
incurs high instrumentation overhead in practice. Algorithm 2 avoids all instrumentation in fast-path
read operations, but is progressive only for slow-path reading transactions. We also sketch how some
hardware instrumentation can be performed non-speculatively without violating opacity.
Extensive experiments were performed to characterize the cost of concurrency in practice. We studied
the instrumentation-optimal algorithms, as well as TL2, Transactional Lock Elision (TLE) [23] and
Hybrid NOrec [25] on both Intel and IBM POWER architectures. Each of the algorithms we studied
contributes to an improved understanding of the concurrency vs. hardware instrumentation vs. software
validation trade-offs for HyTMs. Comparing results between the very different Intel and IBM POWER
architectures also led to new insights. Collectively, our results suggest the following. (i) The cost of
concurrency is significant in practice; high hardware instrumentation impacts performance negatively on
Intel and much more so on POWER8 due to its limited transactional cache capacity. (ii) It is important
to implement HyTMs that provide progressiveness for a maximal set of transactions without incurring
high hardware instrumentation overhead or using global contending bottlenecks. (iii) There is no easy
way to derive more efficient HyTMs by taking advantage of non-speculative accesses supported within
the fast-path in POWER8 processors.
Roadmap. The rest of the paper is organized as follows. S 2 presents details of the HyTM model that
extends the model introduced in [3]. S 3 presents our main lower bound result on the step-complexity
of slow-path transactions in progressive HyTMs while S 4 presents opaque HyTMs that are progressive
for a subset of transactions. S 5 presents results from experiments on Intel Haswell and IBM POWER8
architectures which provide a clear characterization of the cost of concurrency in HyTMs, and study the
impact of non-speculative (or direct) accesses within hardware transactions on performance. S 6 presents
the related work along with concluding remarks. Formal proofs appear in the Appendix.
2 Hybrid transactional memory (HyTM)
Transactional memory (TM). A transaction is a sequence of transactional operations (or t-operations),
reads and writes, performed on a set of transactional objects (t-objects). A TM implementation provides
a set of concurrent processes with deterministic algorithms that implement reads and writes on t-objects
using a set of base objects.
Configurations and executions. A configuration of a TM implementation specifies the state of each
base object and each process. In the initial configuration, each base object has its initial value and each
process is in its initial state. An event (or step) of a transaction invoked by some process is an invocation
of a t-operation, a response of a t-operation, or an atomic primitive operation applied to base object
along with its response. An execution fragment is a (finite or infinite) sequence of events E = e1, e2, . . . .
An execution of a TM implementation M is an execution fragment where, informally, each event respects
the specification of base objects and the algorithms specified by M.
For any finite execution E and execution fragment E′, E · E′ denotes the concatenation of E and
E′, and we say that E · E′ is an extension of E. For every transaction identifier k, E|k denotes the
subsequence of E restricted to events of transaction Tk. If E|k is non-empty, we say that Tk participates
in E, Let txns(E) denote the set of transactions that participate in E. Two executions E and E′ are
indistinguishable to a set T of transactions, if for each transaction Tk ∈ T , E|k = E′|k. A transaction
Tk ∈ txns(E) is complete in E if E|k ends with a response event. The execution E is complete if all
2
transactions in txns(E) are complete in E. A transaction Tk ∈ txns(E) is t-complete if E|k ends with Ak
or Ck; otherwise, Tk is t-incomplete. We consider the dynamic programming model: the read set (resp.,
the write set) of a transaction Tk in an execution E, denoted RsetE(Tk) (resp., WsetE(Tk)), is the set of
t-objects that Tk attempts to read (and resp. write) by issuing a t-read (resp., t-write) invocation in E
(for brevity, we sometimes omit the subscript E from the notation).
We assume that base objects are accessed with read-modify-write (rmw) primitives. A rmw primitive
event on a base object is trivial if, in any configuration, its application does not change the state of the
object. Otherwise, it is called nontrivial. Events e and e′ of an execution E contend on a base object b if
they are both primitives on b in E and at least one of them is nontrivial.
Hybrid transactional memory executions. We now describe the execution model of a Hybrid
transactional memory (HyTM) implementation. In our HyTM model, shared memory configurations
may be modified by accessing base objects via two kinds of primitives: direct and cached. (i) In a direct
(also called non-speculative) access, the rmw primitive operates on the memory state: the direct-access
event atomically reads the value of the object in the shared memory and, if necessary, modifies it. (ii) In
a cached access performed by a process i, the rmw primitive operates on the cached state recorded in
process i’s tracking set τi.
More precisely, τi is a set of triples (b, v,m) where b is a base object identifier, v is a value, and
m ∈ {shared , exclusive} is an access mode. The triple (b, v,m) is added to the tracking set when i
performs a cached rmw access of b, where m is set to exclusive if the access is nontrivial, and to shared
otherwise. We assume that there exists some constant TS such that the condition |τi| ≤ TS must always
hold; this condition will be enforced by our model. A base object b is present in τi with mode m if
∃v, (b, v,m) ∈ τi.
Hardware aborts. A tracking set can be invalidated by a concurrent process: if, in a configuration C
where (b, v, exclusive) ∈ τi (resp., (b, v, shared) ∈ τi), a process j 6= i applies any primitive (resp., any
nontrivial primitive) to b, then τi becomes invalid and any subsequent event invoked by i sets τi to ∅ and
returns ⊥. We refer to this event as a tracking set abort.
Any transaction executed by a correct process that performs at least one cached access must necessarily
perform a cache-commit primitive that determines the terminal response of the transaction. A cache-
commit primitive issued by process i with a valid τi does the following: for each base object b such
that (b, v, exclusive) ∈ τi, the value of b in C is updated to v. Finally, τi is set to ∅ and the operation
returns commit. We assume that a fast-path transaction Tk returns Ak as soon a cached primitive or
cache-commit returns ⊥.
Slow-path and fast-path transactions. We partition HyTM transactions into fast-path transactions
and slow-path transactions. A slow-path transaction models a regular software transaction. An event of a
slow-path transaction is either an invocation or response of a t-operation, or a direct rmw primitive on a
base object. A fast-path transaction essentially encapsulates a hardware transaction. Specifically, in any
execution E, we say that a transaction Tk ∈ txns(E) is a fast-path transaction if E|k contains at least
one cached event. An event of a hardware transaction is either an invocation or response of a t-operation,
or a direct trivial access or a cached access, or a cache-commit primitive.
Remark 1 (Tracking set aborts). Let Tk ∈ txns(E) be any t-incomplete fast-path transaction executed
by process i, where (b, v, exclusive) ∈ τi (resp., (b, v, shared) ∈ τi) after execution E, and e be any event
(resp., nontrivial event) that some process j 6= i is poised to apply after E. The next event of Tk in any
extension of E · e is Ak.
Remark 2 (Capacity aborts). Any cached access performed by a process i executing a fast-path transaction
Tk; |Dset(Tk)| > 1 first checks the condition |τi| = TS, where TS is a pre-defined constant, and if so, it
sets τi = ∅ and immediately returns ⊥.
Direct reads within fast-path. Note that we specifically allow hardware transactions to perform
reads without adding the corresponding base object to the process’ tracking set, thus modeling the
suspend/resume instructions supported by IBM POWER8 architectures. Note that Intel’s HTM does
not support this feature: an event of a fast-path transaction does not include any direct access to base
objects.
HyTM properties. We consider the TM-correctness property of opacity [13]: an execution E is opaque
if there exists a legal (every t-read of a t-object returns the value of its latest committed t-write) sequential
execution S equivalent to some t-completion of E that respects the real-time ordering of transactions in
E. We also assume a weak TM-liveness property for t-operations: every t-operation returns a matching
3
response within a finite number of its own steps if running step-contention free from a configuration in
which every other transaction is t-complete. Moreover, we focus on HyTMs that provide invisible reads:
t-read operations do not perform nontrivial primitives in any execution.
3 Progressive HyTM must perform incremental validation
In this section, we show that it is impossible to implement opaque progressive HyTMs with invisible reads
with O(1) step-complexity read operations for slow-path transactions. This result holds even if fast-path
transactions may perform direct trivial accesses.
Formally, we say that a HyTM implementation M is progressive for a set T of transactions if in
any execution E of M; T ⊆ txns(E), if any transaction Tk ∈ T returns Ak in E, there exists another
concurrent transaction Tm that conflicts (both access the same t-object and at least one writes) with Tk
in E [13].
We construct an execution of a progressive opaque HyTM in which every t-read performed by a
read-only slow-path transaction must access linear (in the size of the read set) number of distinct base
objects.
Theorem 3. Let M be any progressive opaque HyTM implementation providing invisible reads. There
exists an execution E of M and some slow-path read-only transaction Tk ∈ txns(E) that incurs a time
complexity of Ω(m2); m = |Rset(Tk)|.
Proof sketch. We construct an execution of a read-only slow-path transaction Tφ that performs m ∈ N
distinct t-reads of t-objects X1, . . . , Xm. We show inductively that for each i ∈ {1, . . . ,m}; m ∈ N, the
ith t-read must access i− 1 distinct base objects during its execution. The (partial) steps in our execution
are depicted in Figure 1.
For each i ∈ {1, . . . ,m}, M has an execution of the form depicted in Figure 1b. Start with the
complete step contention-free execution of slow-path read-only transaction Tφ that performs (i − 1)
t-reads: readφ(X1) · · · readφ(Xi−1), followed by the t-complete step contention-free execution of a fast-
path transaction Ti that writes nvi 6= vi to Xi and commits and then the complete step contention-free
execution fragment of Tφ that performs its i
th t-read: readφ(Xi)→ nvi. Indeed, by progressiveness, Ti
cannot incur tracking set aborts and since it accesses only a single t-object, it cannot incur capacity
aborts. Moreover, in this execution, the t-read of Xi by slow-path transaction Tφ must return the value
nv written by fast-path transaction Ti since this execution is indistinguishable to Tφ from the execution
in Figure 1a.
We now construct (i− 1) different executions of the form depicted in Figure 1c: for each ` ≤ (i− 1), a
fast-path transaction T` (preceding Ti in real-time ordering, but invoked following the (i − 1) t-reads
by Tφ) writes nv` 6= v to X` and commits, followed by the t-read of Xi by Tφ. Observe that, T` and
Ti which access mutually disjoint data sets cannot contend on each other since if they did, they would
concurrently contend on some base object and incur a tracking set abort, thus violating progressiveness.
Indeed, by the TM-liveness property we assumed (cf. Section 2) and invisible reads for Tφ, each of these
(i− 1) executions exist.
In each of these (i− 1) executions, the final t-read of Xi cannot return the new value nv: the only
possible serialization for transactions is T`, Ti, Tφ; but the readφ(X`) performed by Tk that returns the
initial value v is not legal in this serialization—contradiction to the assumption of opacity. In other words,
slow-path transaction Tφ is forced to verify the validity of t-objects in Rset(Tφ). Finally, we note that,
for all `, `′ ≤ (i− 1);`′ 6= `, fast-path transactions T` and T`′ access mutually disjoint sets of base objects
thus forcing the t-read of Xi to access least i− 1 different base objects in the worst case. Consequently,
for all i ∈ {2, . . . ,m}, slow-path transaction Tφ must perform at least i− 1 steps while executing the ith
t-read in such an execution. 
How STM implementations mitigate the quadratic lower bound step complexity. NOrec [9]
is a progressive opaque STM that minimizes the average step-complexity resulting from incremental
validation of t-reads. Transactions read a global versioned lock at the start, and perform value-based
validation during t-read operations iff the global version has changed. TL2 [11] improves over NOrec
by circumventing the lower bound of Theorem 3. Concretely, TL2 associates a global version with each
t-object updated during a transaction and performs validation with O(1) complexity during t-reads
by simply verifying if the version of the t-object is greater than the global version read at the start
of the transaction. Technically, NOrec and algorithms in this paper provide a stronger definition of
4
Rφ(X1) · · ·Rφ(Xi−1)
i− 1 t-reads Rφ(Xi)→ nvWi(Xi, nv)
commits
Slow-PathFast-Path
TφTi
(a) Slow-path transaction Tφ performs i − 1 distinct t-reads (each returning the initial value) followed by the
t-read of Xi that returns value nv written by fast-path transaction Ti
Rφ(X1) · · ·Rφ(Xi−1)
i− 1 t-reads
Wi(Xi, nv)
commits
Rφ(Xi)→ nv
Slow-Path
Fast-Path
Tφ
Ti
(b) Fast-path transaction Ti does not contend with any of the i−1 t-reads performed by Tφ and must be committed
in this execution since it cannot incur a tracking set or capacity abort. The t-read of Xi must return nv because
this execution is indistinguishable to Tφ from 1a
Rφ(X1) · · ·Rφ(Xi−1)
i− 1 t-reads
Wi(Xi, nv)
commits
Wi(Xi, nv)
commits
Wi−1(Xi−1, nv)
commits
W1(X1, nv)
commits
Rφ(Xi)→?
Slow-Path
Fast-PathFast-Path
Fast-Path Fast-Path
Tφ
Rφ(Xi)→?
Slow-Path
Rφ(X1) · · ·Rφ(Xi−1)
i− 1 t-reads
Tφ
Ti−1 Ti
T1 Ti
(c) In each of these each i− 1 executions, fast-path transactions cannot incur a tracking set or capacity abort. By
opacity, the t-read of Xi by Tφ cannot return new value nv. Therefore, to distinguish the i− 1 different executions,
t-read of Xi by slow-path transaction Tφ is forced to access i− 1 different base objects
Figure 1: Proof steps for Theorem 3
progressiveness: a transaction may abort only if there is a prefix in which it conflicts with another
transaction and both are t-incomplete. TL2 on the other hand allows a transaction to abort due to a
concurrent conflicting transaction.
Implications for disjoint-access parallelism in HyTM. The property of disjoint-access parallelism
(DAP), in its weakest form, ensures that two transactions concurrently contend on the same base object
only if their data sets are connected in the conflict graph, capturing data-set overlaps among all concurrent
transactions [5]. It is well known that weak DAP STMs with invisible reads must perform incremental
validation even if the required TM-progress condition requires transactions to commit only in the absence
of any concurrent transaction [13, 17]. For example, DSTM [14] is a weak DAP STM that is progressive
and consequently incurs the validation complexity. On the other hand, TL2 and NOrec are not weak
DAP since they employ a global versioned lock that mitigates the cost of incremental validation, but this
allows two transactions accessing disjoint data sets to concurrently contend on the same memory location.
Indeed, this inspires the proof of Theorem 3.
4 Hybrid transactional memory algorithms
Instrumentation-optimal progressive HyTM. We describe a HyTM algorithm that is a tight bound
for Theorem 3 and the instrumentation cost on the fast-path transactions established in [3]. Pseudocode
appears in Algorithm 1. For every t-object Xj , our implementation maintains a base object vj that stores
the value of Xj and a sequence lock rj .
Fast-path transactions: For a fast-path transaction Tk executed by process pi, the readk(Xj) implemen-
tation first reads rj (direct) and returns Ak if some other process pj holds a lock on Xj . Otherwise, it
returns the value of Xj . As with readk(Xj), the write(Xj , v) implementation returns Ak if some other
5
Algorithm 1 Algorithm 2 TLE HybridNOrec
Instrumentation in fast-path reads per-read constant constant constant
Instrumentation in fast-path writes per-write per-write constant constant
Validation in slow-path reads Θ(|Rset|) O(|Rset|) none O(|Rset|), but validation only if concurrency
h/w-s/f concurrency prog. prog. for slow-path readers zero not prog., but small contention window
Direct accesses inside fast-path yes no no yes
opacity yes yes yes yes
Figure 2: Table summarizing complexities of HyTM implementations
process pj holds a lock on Xj ; otherwise process pi increments the sequence lock rj . If the cache has not
been invalidated, pi updates the shared memory during tryCk by invoking the commit-cache primitive.
Slow-path read-only transactions: Any readk(Xj) invoked by a slow-path transaction first reads the value
of the t-object from vj , adds rj to Rset(Tk) if its not held by a concurrent transaction and then performs
validation on its entire read set to check if any of them have been modified. If either of these conditions
is true, the transaction returns Ak. Otherwise, it returns the value of Xj . Validation of the read set is
performed by re-reading the values of the sequence lock entries stored in Rset(Tk).
Slow-path updating transactions: An updating slow-path transaction Tk attempts to obtain exclusive
write access to its entire write set. If all the locks on the write set were acquired successfully, Tk performs
validation of the read set and if successful, updates the values of the t-objects in shared memory, releases
the locks and returns Ck; else pi aborts the transaction.
Direct accesses inside fast-path: Note that opacity is not violated even if the accesses of the sequence lock
during t-read may be performed directly without incurring tracking set aborts.
Instrumentation-optimal HyTM that is progressive only for slow-path reading transactions.
Algorithm 2 does not incur the linear instrumentation cost on the fast-path reading transactions (inherent
to Algorithm 1), but provides progressiveness only for slow-path reading transactions. The instrumentation
cost on fast-path t-reads is avoided by using a global lock that serializes all updating slow-path transactions
during the tryCk procedure. Fast-path transactions simply check if this lock is held without acquiring it
(similar to TLE [23]). While the per-read instrumentation overhead is avoided, Algorithm 2 still incurs
the per-write instrumentation cost.
Sacrificing progressiveness and minimizing contention window. Observe that the lower bound
in Theorem 3 assumes progressiveness for both slow-path and fast-path transactions along with opacity
and invisible reads. Note that Algorithm 2 retains the validation step complexity cost since it provides
progressiveness for slow-path readers.
Hybrid NOrec [8] is a HyTM implementation that does not satisfy progressiveness (unlike its STM
counterpart NOrec), but mitigates the step-complexity cost on slow-path transactions by performing
incremental validation during a transactional read iff the shared memory has changed since the start of
the transaction. Conceptually, Hybrid NOrec uses a global sequence lock gsl that is incremented at the
start and end of each transaction’s commit procedure. Readers can use the value of gsl to determine
whether shared memory has changed between two configurations. Unfortunately, with this approach, two
fast path transactions will always conflict on the gsl if their commit procedures are concurrent. To reduce
the contention window for fast path transactions, the gsl is actually implemented as two separate locks
(the second one called esl). A slow-path transaction locks both esl and gsl while it is committing. Instead
of incrementing gsl, a fast path transaction checks if esl is locked and aborts if it is. Then, at the end of
the fast path transaction’s commit procedure, it increments gsl twice (quickly locking and releasing it
and immediately commits in hardware). Although the window for fast path transactions to contend on
gsl is small, our experiments have shown that contention on gsl has a significant impact on performance.
5 Evaluation
In this section, we study the performance characteristics of Algorithms 1 and 2, Hybrid NOrec, TLE
and TL2. Our experimental goals are: (G1) to study the performance impact of instrumentation on
the fast-path and validation on the slow-path, (G2) to understand how HyTM algorithm design affects
performance with Intel and IBM POWER8 HTMs, and (G3) to determine whether direct accesses can
be used to obtain performance improvements on IBM POWER8 using the supported suspend/resume
instruction to escape from a hardware transaction.
Experimental system (Intel). The experimental system is a 2-socket Intel E7-4830 v3 with 12 cores
per socket and 2 hyperthreads (HTs) per core, for a total of 48 threads. Each core has a private 32KB L1
6
Algorithm 1 Progressive fast-path and slow-path opaque HyTM implementation; code for transaction
Tk
1 Shared objects
2 vj , value of each t-object Xj
3 rj , a sequence lock of each t-object Xj
5 Code for fast-path transactions
7 readk(Xj)
8 ovj := vj
9 orj := rj . direct read
10 if orj .isLocked() then return Ak
11 return ovj
13 writek(Xj , v)
14 orj := rj
15 if orj .isLocked() then return Ak
16 rj := orj .IncSequence()
17 vj := v
18 return OK
20 tryCk()
21 commit -cachei
23 Function: release(Q)
24 for each Xj ∈ Q do rj := orj .unlock()
26 Function: acquire(Q)
27 for each Xj ∈ Q
28 if rj .tryLock() . CAS/LLSC
29 Lset(Tk) := Lset(Tk) ∪ {Xj }
30 else
31 release(Lset(Tk))
32 return false
33 return true
34
35 Code for slow-path transactions
37 Readk(Xj )
38 if Xj ∈ Wset(Tk) then return Wset(Tk).locate(Xj )
39 orj := rj
40 ovj := vj
41 Rset(Tk) := Rset(Tk) ∪ {Xj ,orj }
42 if orj .isLocked() then return Ak
43 if not validate () then return Ak
44 return ovj
46 writek(Xj , v)
47 orj := rj
48 nvj := v
49 if orj .isLocked() then return Ak
50 Wset(Tk) := Wset(Tk) ∪ {Xj , nvj , orj }
51 return OK
53 tryCk()
54 if Wset(Tk) = ∅ then return Ck
55 if not acquire(Wset(Tk)) then return Ak
56 if not validate ()
57 release(Wset(Tk))
58 return Ak
59 for each Xj ∈ Wset(Tk) do vj := nvj
60 release(Wset(Tk))
61 return Ck
63 Function: validate ()
64 if ∃ Xj ∈ Rset(Tk):orj .getSequence() 6= rj .getSequence() then return false
65 return true
cache and 256KB L2 cache (shared between HTs on a core). All cores on a socket share a 30MB L3 cache.
This system has a non-uniform memory architecture (NUMA) in which threads have significantly different
access costs to different parts of memory depending on which processor they are currently executing on.
The machine has 128GB of RAM, and runs Ubuntu 14.04 LTS. All code was compiled with the GNU
C++ compiler (G++) 4.8.4 with build target x86 64-linux-gnu and compilation options -std=c++0x -O3
-mx32.
We pin threads so that the first socket is saturated before we place any threads on the second socket.
Thus, thread counts 1-24 run on a single socket. Furthermore, hyperthreading is engaged on the first
socket for thread counts 13-24, and on the second socket for thread counts 37-48. Consequently, our
graphs clearly show the effects of NUMA and hyperthreading.
7
Algorithm 2 Opaque HyTM implementation that is progressive only for slow-path reading transactions;
code for Tk by process pi
1 Shared objects
2 L, global lock
4 Code for fast-path transactions
5 startk()
6 if L.isLocked() then return Ak
8 readk(Xj )
9 ovj := vj
10 return ovj
12 writek(Xj , v)
13 orj := rj
14 rj := orj .IncSequence()
15 vj := v
16 return OK
18 tryCk()
19 return commit -cachei
23 Code for slow-path transactions
25 tryCk()
26 if Wset(Tk) = ∅ then return Ck
27 L.Lock()
28 if not acquire(Wset(Tk)) then return Ak
29 if not validate () then
30 release(Wset(Tk))
31 return Ak
32 for each Xj ∈ Wset(Tk) do vj := nvj
33 release(Wset(Tk))
34 return Ck
36 Function: release(Q)
37 for each Xj ∈ Q do rj := nrj .unlock()
38 L.unlock(); return OK
Experimental system (IBM POWER8). The experimental system is a IBM S822L with 2x 12-core
3.02GHz processor cards, 128GB of RAM, running Ubuntu 16.04 LTS. All code was compiled using G++
5.3.1. This is a dual socket machine, and each socket has two NUMA zones. It is expensive to access
memory on a different NUMA zone, and even more expensive if the NUMA zone is on a different socket.
POWER8 uses the L2 cache for detecting tracking set aborts, and limits the size of a transaction’s read-
and write-set to 8KB each [21]. This is in contrast to Intel which tracks conflicts on the entire L3 cache,
and only limits a transaction’s read-set to the L3 cache size, and its write-set to the L1 cache size.
We pin one thread on each core within a NUMA zone before moving to the next zone. We remark
that unlike the thread pinning policy for Intel which saturated the first socket before moving to the next,
this proved to be the best policy for POWER8 which experiences severe negative scaling when threads
are saturated on a single 8-way hardware multi-threaded core. This is because all threads on a core share
resources, including the L1 and L2 cache, a single branch execution pipeline, and only two load-store
pipelines.
Hybrid TM implementations. For TL2, we used the implementation published by its authors. We
implemented the other algorithms in C++. Each hybrid TM algorithm first attempts to execute a
transaction on the fast-path, and will continue to execute on the fast-path until the transaction has
experienced 20 aborts, at which point it will fall back to the slow-path. We implemented Algorithm 1 on
POWER8 where each read of a sequence lock during a transactional read operation was enclosed within a
pair of suspend/resume instructions to access them without incurring tracking set aborts (Algorithm 1∗).
We remark that this does not affect the opacity of the implementation. We also implemented the variant
of Hybrid NOrec (Hybrid NOrec∗) in which the update to gsl is performed using a fetch-increment
primitive between suspend/resume instructions, as is recommended in [25].
In each algorithm, instead of placing a lock next to each address in memory, we allocated a global
array of one million locks, and used a simple hash function to map each address to one of these locks.
This avoids the problem of having to change a program’s memory layout to incorporate locks, and greatly
reduces the amount of memory needed to store locks, at the cost of some possible false conflicts since
many addresses map to each lock. Note that the exact same approach was taken by the authors of TL2.
We chose not to compile the hybrid TMs as separate libraries, since invoking library functions for
each read and write can cause algorithms to incur enormous overhead. Instead, we compiled each hybrid
TM directly into the code that uses it.
Experimental methodology. We used a simple unbalanced binary search tree (BST) microbenchmark
as a vehicle to study the performance of our implementations. The BST implements a dictionary,
which contains a set of keys, each with an associated value. For each TM algorithm and update rate
U ∈ {40, 10, 0}, we run six timed trials for several thread counts n. Each trial proceeds in two phases:
prefilling and measuring. In the prefilling phase, n concurrent threads perform 50% Insert and 50%
Delete operations on keys drawn uniformly randomly from [0, 105) until the size of the tree converges to a
steady state (containing approximately 105/2 keys). Next, the trial enters the measuring phase, during
which threads begin counting how many operations they perform. In this phase, each thread performs
(U/2)% Insert, (U/2)% Delete and (100− U)% Search operations, on keys/values drawn uniformly from
[0, 105), for one second.
Uniformly random updates to an unbalanced BST have been proven to yield trees of logarithmic height
8
2x12-core Intel E7-4830v3
No threads perform
RangeIncrement (W1)
One thread performs
RangeIncrement (W2)
0
%
u
p
d
a
te
s
1
0
%
u
p
d
a
te
s
4
0
%
u
p
d
a
te
s
2x12-core IBM POWER8
No threads perform
RangeIncrement (W1)
One thread performs
RangeIncrement (W2)
0
%
u
p
d
a
te
s
1
0
%
u
p
d
a
te
s
4
0
%
u
p
d
a
te
s
Figure 3: Results for a BST microbenchmark. The x-axis represents the number of concurrent threads.
The y-axis represents operations per microsecond.
with high probability. Thus, in this type of workload, almost all transactions succeed in hardware, and
the slow-path is almost never used. To study performance when transactions regularly run on slow-path,
we introduced an operation called a RangeIncrement that often fails in hardware and must run on the
slow-path. A RangeIncrement(low, hi) atomically increments the values associated with each key in the
range [low, hi] present in the tree. Note that a RangeIncrement is more likely to experience data conflicts
and capacity aborts than BST updates, which only modify a single node.
We consider two types of workloads: (W1) all n threads perform Insert, Delete and Search, and
(W2) n− 1 threads perform Insert, Delete and Search and one thread performs only RangeIncrement
operations. Figure 3 shows the results for both types of workloads.
Results (Intel). We first discuss the 0% updates graph for workload type W1. In this graph, essentially
all operations committed in hardware. In fact, in each trial, a small fraction of 1% of operations ran on
the slow-path. Thus, any performance differences shown in the graph are essentially differences in the
performance of the algorithms’ respective fast-paths (with the exception of TL2). Algorithm 1, which has
instrumentation in its fast-path read operations, has significantly lower performance than Algorithm 2,
which does not. Since this is a read-only workload, this instrumentation is responsible for the performance
difference.
In the W1 workloads, TLE, Algorithm 2 and Hybrid NOrec perform similarly (with a small performance
advantage for Hybrid NOrec at high thread counts). This is because the fast-paths for these three
algorithms have similar amounts of instrumentation: there is no instrumentation for reads or writes, and
the transaction itself incurs one or two metadata accesses. In contrast, in the W2 workloads, TLE performs
quite poorly, compared to the HyTM algorithms. In these workloads, transactions must periodically run
on the slow-path, and in TLE, this entails acquiring a global lock that restricts progress for all other
threads. At high thread counts this significantly impacts performance. Its performance decreases as
the sizes of the ranges passed to RangeIncrement increase. Its performance is also negatively impacted
by NUMA effects at thread counts higher than 24. (This is because, when a thread p reads the lock
and incurs a cache miss, if the lock was last held by another thread on the same socket, then p can fill
the cache miss by loading it from the shared L3 cache. However, if the lock was last held by a thread
on a different socket, then p must read the lock state from main memory, which is significantly more
expensive.) On the other hand, in each graph in the W2 workloads, the performance of each HyTM
(and TL2) is similar to its performance in the corresponding W1 workload graph. For Algorithm 1
(and TL2), this is because of progressiveness. Although Algorithm 2 is not truly progressive, fast-path
transactions will abort only if they are concurrent with the commit procedure of a slow-path transaction.
9
In RangeIncrement operations, there is a long read-only prefix (which is exceptionally long because of
Algorithm 2’s quadratic validation) followed by a relatively small set of writes. Thus, RangeIncrement
operations have relatively little impact on the fast-path. The explanation is similar for Hybrid NOrec
(except that it performs less validation than Algorithm 2).
Observe that the performance of Hybrid NOrec decreases slightly, relative to Algorithm 2, after 24
threads. Recall that, in Hybrid NOrec, the global sequence number is a single point of contention on
the fast-path. (In Algorithm 2, the global lock is only modified by slow-path transactions, so fast-path
transactions do not have a single point of contention.) We believe this is due to NUMA effects, similar to
those described in [6]. Specifically, whenever a threads on the first socket performs a fast-path transaction
that commits and modifies the global lock, it causes cache invalidations for all other threads. Threads on
socket two must then load the lock state from main memory, which takes much longer than loading it
from the shared L3 cache. This lengthens the transaction’s window of contention, making it more likely
to abort. (In the 0% updates graph in the W2 workload, we still see this effect, because there is a thread
performing RangeIncrement operations.)
Results (IBM POWER8). Algorithm 1 performs poorly on POWER8: POWER8 transactions can
only load 64 cache lines before they will abort [22]. Transactions read locks and tree nodes, which are
in different cache lines: together, they often exceed 64 cache lines loaded in a tree operation, so most
transactions cannot succeed in hardware. Consequently, on POWER8, it is incredibly important either
to have minimal instrumentation in transactions, or for metadata to be located in the same cache lines
as program data. Of course, the latter is not possible for HyTMs, which do not have control over the
layout of program data. Consequently, Algorithm 2 outperforms Algorithm 1 in POWER8 quite easily by
avoiding the per-read instrumentation.
Algorithm 1 is improved slightly by the expensive (on POWER8) suspend/resume on sequence locks
during transactional reads, but it still performs relatively poorly. To make suspend/resume a practical
tool, one could imagine attempting to collect several metadata accesses and perform them together to
amortize the cost of a suspend/resume pair. For instance, in Algorithm 1, one might try to update the
locks for all of the transactional writes at once, when the transaction commits. Typically one would
accomplish this by logging all writes so that a process can remember which addresses it must lock at
commit time. However, logging the writes inside the transaction would be at least as costly as just
performing them.
Observe that Hybrid NOrec does far worse with updates in POWER8 than on the Intel machine.
This is due to the fact that fetch-increment on a single location experiences severe negative scaling on
the POWER8 processor: e.g., in one second, a single thread can perform 37 fetch-add operations while
6 threads perform a total of 9 million and 24 threads perform only 4 million fetch-add operations. In
contrast, the Intel machine performs 32 million operations with 6 threads and 45 million with 24 threads.
This is likely because this Intel processor provides fetch-add instructions while it must be emulated on
the POWER8 processor.
In Hybrid NOrec∗, the non-speculative increment of gsl actually makes performance worse. Recall that
in Hybrid NOrec, if a fast-path transaction T1 increments gsl, and then a software transaction T2 reads
gsl (as part of validation) before T1 commits, then T1 will abort, and T2 will not see T1’s change to gsl.
So, T2 will have a higher chance of avoiding incremental validation (and, hence, will likely take less time
to run, and have a smaller contention window). However, in Hybrid NOrec∗, once T1 increments gsl, T2
will see the change to gsl, regardless of whether T1 commits or aborts. Thus, T2 will be forced to perform
incremental validation. In our experiments, we observed that a much larger number of transactions ran
on the fallback path in Hybrid NOrec∗ than in Hybrid NOrec (often several orders of magnitude more).
6 Related work and discussion
HyTM implementations and complexity. Early HyTMs like the ones described in [10, 16] provided
progressiveness, but subsequent HyTM proposals like PhTM [19] and HybridNOrec [8] sacrificed pro-
gressiveness for lesser instrumentation overheads. However, the clear trade-off in terms of concurrency
vs. instrumentation for these HyTMs have not been studied in the context of currently available HTM
architectures. This instrumentation cost on the fast-path was precisely characterized in [3]. In this paper,
we proved the inherent cost of concurrency on the slow-path thus establishing a surprising, but intuitive
complexity separation between progressive STMs and HyTMs. Moreover, to the best of our knowledge,
this is the first work to consider the theoretical foundations of the cost of concurrency in HyTMs in theory
and practice (on currently available HTM architectures). Proof of Theorem 3 is based on the analogous
10
proof for step complexity of STMs that are disjoint-access parallel [17, 13]. Our implementation of Hybrid
NOrec follows [25], which additionally proposed the use of direct accesses in fast-path transactions to
reduce instrumentation overhead in the AMD Advanced Synchronization Facility (ASF) architecture.
Beyond the two path HyTM approach. Employing an uninstrumented fast fast-path. We now
describe how every transaction may first be executed in a “fast” fast-path with almost no instrumentation
and if unsuccessful, may be re-attempted in the fast-path and subsequently in slow-path. Specifically, we
transform any opaque HyTM M to an opaque HyTM M′ in which a shared fetch-and-add metadata
base object F that slow-path updating transactions increment (and resp. decrement) at the start (and
resp. end). In M′, a “fast” fast-path transaction checks first if F is 0 and if not, aborts the transaction;
otherwise the transaction is continued as an uninstrumented hardware transaction. The code for the
fast-path and the slow-path is identical to M.
Other approaches. Recent work has investigated fallback to reduced hardware transactions [20] in which
an all-software slow-path is augmented using a slightly faster slow-path that is optimistically used to
avoid running some transactions on the true software-only slow-path. Amalgamated lock elision (ALE)
was proposed in [2] which improves over TLE by executing the slow-path as a series of segments, each of
which is a dynamic length hardware transaction. Invyswell [7] is a HyTM design with multiple hardware
and software modes of execution that gives flexibility to avoid instrumentation overhead in uncontended
executions.
We remark that such multi-path approaches may be easily applied to each of the Algorithms proposed in
this paper. However, in the search for an efficient HyTM, it is important to strike the fine balance between
concurrency, hardware instrumentation and software validation cost. Our lower bound, experimental
methodology and evaluation of HyTMs provides the first clear characterization of these trade-offs in both
Intel and POWER8 architectures.
References
[1] Advanced Synchronization Facility Proposed Architectural Specification, March 2009. http://
developer.amd.com/wordpress/media/2013/09/45432-ASF_Spec_2.1.pdf.
[2] Yehuda Afek, Alexander Matveev, Oscar R. Moll, and Nir Shavit. Amalgamated lock-elision. In
Proceedings of 29th Int. Sym. on Distributed Computing, DISC ’15, pages 309–324, 2015. URL:
http://dx.doi.org/10.1007/978-3-662-48653-5_21, doi:10.1007/978-3-662-48653-5_21.
[3] Dan Alistarh, Justin Kopinsky, Petr Kuznetsov, Srivatsan Ravi, and Nir Shavit. Inherent limitations
of hybrid transactional memory. In Proceedings of 29th Int. Sym. on Distributed Computing,
DISC ’15, pages 185–199, 2015. URL: http://dx.doi.org/10.1007/978-3-662-48653-5_13, doi:
10.1007/978-3-662-48653-5_13.
[4] Hagit Attiya, Sandeep Hans, Petr Kuznetsov, and Srivatsan Ravi. Safety of deferred update
in transactional memory. In 2013 IEEE 33rd Int. Conf. on Distributed Computing Systems
(ICDCS), pages 601–610, Los Alamitos, CA, USA, 2013. IEEE Computer Society. doi:http:
//doi.ieeecomputersociety.org/10.1109/ICDCS.2013.57.
[5] Hagit Attiya, Eshcar Hillel, and Alessia Milani. Inherent limitations on disjoint-access parallel
implementations of transactional memory. Theory of Computing Systems, 49(4):698–719, 2011. URL:
http://dx.doi.org/10.1007/s00224-010-9304-5, doi:10.1007/s00224-010-9304-5.
[6] Trevor Brown, Alex Kogan, Yossi Lev, and Victor Luchangco. Investigating the performance of
hardware transactions on a multi-socket machine. In Proceedings of 28th ACM Sym. on Parallelism
in Algorithms and Architectures, SPAA ’16, pages 121–132, 2016. URL: http://doi.acm.org/10.
1145/2935764.2935796, doi:10.1145/2935764.2935796.
[7] Irina Calciu, Justin Gottschlich, Tatiana Shpeisman, Gilles Pokam, and Maurice Herlihy. Invyswell:
a hybrid transactional memory for haswell’s restricted transactional memory. In Int. Conf. on Par.
Arch. and Compilation, PACT ’14, pages 187–200, 2014. URL: http://doi.acm.org/10.1145/
2628071.2628086, doi:10.1145/2628071.2628086.
[8] Luke Dalessandro, Francois Carouge, Sean White, Yossi Lev, Mark Moir, Michael L. Scott, and
Michael F. Spear. Hybrid NOrec: a case study in the effectiveness of best effort hardware transactional
11
memory. In ASPLOS ’11, pages 39–52. ACM, 2011. URL: http://doi.acm.org/10.1145/1950365.
1950373.
[9] Luke Dalessandro, Michael F. Spear, and Michael L. Scott. Norec: Streamlining stm by abolishing
ownership records. SIGPLAN Not., 45(5):67–78, January 2010. URL: http://doi.acm.org/10.
1145/1837853.1693464, doi:10.1145/1837853.1693464.
[10] Peter Damron, Alexandra Fedorova, Yossi Lev, Victor Luchangco, Mark Moir, and Daniel Nussbaum.
Hybrid transactional memory. SIGPLAN Not., 41(11):336–346, October 2006. URL: http://doi.
acm.org/10.1145/1168918.1168900, doi:10.1145/1168918.1168900.
[11] Dave Dice, Ori Shalev, and Nir Shavit. Transactional locking ii. In Proceedings of the 20th
International Conference on Distributed Computing, DISC’06, pages 194–208, Berlin, Heidelberg, 2006.
Springer-Verlag. URL: http://dx.doi.org/10.1007/11864219_14, doi:10.1007/11864219_14.
[12] K. Fraser. Practical lock-freedom. Technical report, Cambridge University Computer Laboratory,
2003.
[13] Rachid Guerraoui and Michal Kapalka. Principles of Transactional Memory, Synthesis Lectures on
Distributed Computing Theory. Morgan and Claypool, 2010.
[14] Maurice Herlihy, Victor Luchangco, Mark Moir, and William N. Scherer, III. Software transactional
memory for dynamic-sized data structures. In Proc. of 22nd Int. Sym. on Principles of Distr. Comp.,
PODC ’03, pages 92–101, New York, NY, USA, 2003. ACM. URL: http://doi.acm.org/10.1145/
872035.872048, doi:10.1145/872035.872048.
[15] Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural support for lock-free
data structures. In ISCA, pages 289–300, 1993.
[16] Sanjeev Kumar, Michael Chu, Christopher J. Hughes, Partha Kundu, and Anthony Nguyen. Hybrid
transactional memory. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming, PPoPP ’06, pages 209–220, New York, NY, USA, 2006. ACM.
URL: http://doi.acm.org/10.1145/1122971.1123003.
[17] Petr Kuznetsov and Srivatsan Ravi. Progressive transactional memory in time and space. In Proceed-
ings of 13th Int. Conf. on Parallel Computing Technologies, PaCT ’15, pages 410–425, 2015. URL:
http://dx.doi.org/10.1007/978-3-319-21909-7_40, doi:10.1007/978-3-319-21909-7_40.
[18] Hung Q. Le, G. L. Guthrie, Derek Williams, Maged M. Michael, Brad Frey, William J. Starke,
Cathy May, Rei Odaira, and Takuya Nakaike. Transactional memory support in the IBM POWER8
processor. IBM Journal of Research and Development, 59(1), 2015. URL: http://dx.doi.org/10.
1147/JRD.2014.2380199.
[19] Yossi Lev, Mark Moir, and Dan Nussbaum. Phtm: Phased transactional memory. In In Workshop
on Transactional Computing (Transact), 2007.
[20] Alexander Matveev and Nir Shavit. Reduced hardware transactions: a new approach to hybrid
transactional memory. In Proceedings of the 25th ACM symposium on Parallelism in algorithms and
architectures, pages 11–22. ACM, 2013.
[21] Takuya Nakaike, Rei Odaira, Matthew Gaudet, Maged M. Michael, and Hisanobu Tomari. Quantita-
tive comparison of hardware transactional memory for Blue Gene/Q, zEnterprise EC12, Intel Core,
and POWER8. In Proc. of 42nd Int. Sym. on Comp. Arch., ISCA ’15, pages 144–157, NY, USA,
2015. URL: http://doi.acm.org/10.1145/2749469.2750403.
[22] Andrew T. Nguyen. Investigation of hardware transactional memory. 2015.
http://groups.csail.mit.edu/mag/Andrew-Nguyen-Thesis.pdf.
[23] Ravi Rajwar and James R. Goodman. Speculative lock elision: Enabling highly concurrent mul-
tithreaded execution. In Proc. of 34th ACM/IEEE Int. Sym. on Microarchitecture, MICRO ’01,
pages 294–305, Washington, DC, USA, 2001. URL: http://dl.acm.org/citation.cfm?id=563998.
564036.
12
[24] James Reinders. Transactional Synchronization in Haswell, 2012. http://software.intel.com/
en-us/blogs/2012/02/07/transactional-synchronization-in-haswell/.
[25] Torvald Riegel, Patrick Marlier, Martin Nowack, Pascal Felber, and Christof Fetzer. Optimizing
hybrid transactional memory: The importance of nonspeculative operations. In Proc. of 23rd ACM
Sym. on Parallelism in Algs. and Arch., pages 53–64. ACM, 2011.
[26] Nir Shavit and Dan Touitou. Software transactional memory. In Principles of Distributed Computing
(PODC), pages 204–213, 1995.
[27] Richard M. Yoo, Christopher J. Hughes, Konrad Lai, and Ravi Rajwar. Performance evaluation of
intel&reg; transactional synchronization extensions for high-performance computing. In Proceedings
of Int. Conf. on High Perf. Computing, Networking, Storage and Analysis, SC ’13, pages 19:1–
19:11, New York, NY, USA, 2013. URL: http://doi.acm.org/10.1145/2503210.2503232, doi:
10.1145/2503210.2503232.
13
A Proof of opacity for algorithms
We will prove the opacity of Algorithm 1 even if some of accesses performed by fast-path transactions are
direct (as indicated in the pseudocode). Analogous arguments apply to Algorithm 2.
Let E by any execution of Algorithm 1. Since opacity is a safety property, it is sufficient to prove that
every finite execution is opaque [4]. Let <E denote a total-order on events in E.
Let H denote a subsequence of E constructed by selecting linearization points of t-operations performed
in E. The linearization point of a t-operation op, denoted as `op is associated with a base object event or
an event performed during the execution of op using the following procedure.
Completions. First, we obtain a completion of E by removing some pending invocations or adding
responses to the remaining pending invocations. Incomplete readk, writek operation performed by a
slow-path transaction Tk is removed from E; an incomplete tryCk is removed from E if Tk has not
performed any write to a base object rj ; Xj ∈Wset(Tk) in Line 59, otherwise it is completed by including
Ck after E. Every incomplete readk, tryAk, writek and tryCk performed by a fast-path transaction Tk is
removed from E.
Linearization points. Now a linearization H of E is obtained by associating linearization points to
t-operations in the obtained completion of E. For all t-operations performed a slow-path transaction Tk,
linearization points as assigned as follows:
• For every t-read opk that returns a non-Ak value, `opk is chosen as the event in Line 40 of Algorithm 1,
else, `opk is chosen as invocation event of opk
• For every opk = writek that returns, `opk is chosen as the invocation event of opk
• For every opk = tryCk that returns Ck such that Wset(Tk) 6= ∅, `opk is associated with the first
write to a base object performed by release when invoked in Line 60, else if opk returns Ak, `opk is
associated with the invocation event of opk
• For every opk = tryCk that returns Ck such that Wset(Tk) = ∅, `opk is associated with Line 54
For all t-operations performed a fast-path transaction Tk, linearization points are assigned as follows:
• For every t-read opk that returns a non-Ak value, `opk is chosen as the event in Line 8 of Algorithm 1,
else, `opk is chosen as invocation event of opk
• For every opk that is a tryCk, `opk is the commit-cachek primitive invoked by Tk
• For every opk that is a writek, `opk is the event in Line 17.
<H denotes a total-order on t-operations in the complete sequential history H.
Serialization points. The serialization of a transaction Tj , denoted as δTj is associated with the
linearization point of a t-operation performed by the transaction.
We obtain a t-complete history H¯ from H as follows. A serialization S is obtained by associating
serialization points to transactions in H¯ as follows: for every transaction Tk in H that is complete, but
not t-complete, we insert tryCk · Ak immediately after the last event of Tk in H. If Tk is an updating
transaction that commits, then δTk is `tryCk . If Tk is a read-only or aborted transaction, then δTk is
assigned to the linearization point of the last t-read that returned a non-Ak value in Tk.
<S denotes a total-order on transactions in the t-sequential history S. Since for a given transaction,
its serialization point is chosen between the first and last event of the transaction, if Ti ≺H Tj , then
δTi <E δTj implies Ti <S Tj .
Throughout this proof, we consider that process pi executing fast-path transaction Tk ∈ txns(E) does
not include the sequence lock rj in the tracking set of pi when accessed in Line 9 during readk(Xj).
Claim 4. If every transaction Tk ∈ txns(E) is fast-path, then S is legal.
Proof. Recall that Algorithm 1 performs direct accesses only during the t-read operation in Line 9
which involves reading the sequence lock rj corresponding to t-object Xj . However, any two fast-path
transactions accessing conflicting data sets must necessarily incur a tracking abort (cf. Remark 1) in E.
It follows immediately that S must be legal.
Claim 5. S is legal, i.e., every t-read returns the value of the latest committed t-write in S.
Proof. We claim that for every readj(Xm) → v, there exists some slow-path transaction Ti (or resp.
fast-path) that performs writei(Xm, v) and completes the event in Line 59 (or resp. Line 17) such that
readj(Xm) 6≺RTH writei(Xm, v).
Suppose that Ti is a slow-path transaction: since readj(Xm) returns the response v, the event in
Line 40 succeeds the event in Line 59 performed by tryCi. Since readj(Xm) can return a non-abort
response only after Ti releases the lock on rm in Line 24, Ti must be committed in S. Consequently,
14
`tryCi <E `readj(Xm). Since, for any updating committing transaction Ti, δTi = `tryCi , it follows that
δTi <E δTj .
Otherwise if Ti is a fast-path transaction, then clearly Ti is a committed transaction in S. Recall that
readj(Xm) can read v during the event in Line 40 only after Ti applies the commit-cache primitive. By
the assignment of linearization points, `tryCi <E `readj(Xm) and thus, δTi <E `readj(Xm).
Thus, to prove that S is legal, it suffices to show that there does not exist a transaction Tk that
returns Ck in S and performs writek(Xm, v
′); v′ 6= v such that Ti <S Tk <S Tj .
Ti and Tk are both updating transactions that commit. Thus, (Ti <S Tk) ⇐⇒ (δTi <E δTk) and
(δTi <E δTk) ⇐⇒ (`tryCi <E `tryCk).
Since, Tj reads the value of X written by Ti, one of the following is true: `tryCi <E `tryCk <E `readj(Xm)
or `tryCi <E `readj(Xm) <E `tryCk .
Suppose that `tryCi <E `tryCk <E `readj(Xm).
(Case I:) Ti and Tk are slow-path transactions.
Thus, Tk returns a response from the event in Line 55 before the read of the base object associated
with Xm by Tj in Line 40. Since Ti and Tk are both committed in E, Tk returns true from the event in
Line 55 only after Ti releases rm in Line 24.
If Tj is a slow-path transaction, recall that readj(Xm) checks if Xj is locked by a concurrent transaction,
then performs read-validation (Line 42) before returning a matching response. Indeed, readj(Xm) must
return Aj in any such execution.
If Tj is a fast-path transaction, it follows that readj(Xm) must return Aj immediately from Remark 1.
Thus, `tryCi <E `readj(X) <E `tryCk .
(Case II:) Ti is a slow-path transaction and Tk is a fast-path transaction. Thus, Tk returns Ck before
the read of the base object associated with Xm by Tj in Line 40, but after the response of acquire by
Ti in Line 55. Since readj(Xm) reads the value of Xm to be v and not v
′, Ti performs the cas to vm in
Line 59 after the Tk performs the commit-cache primitive (since if otherwise, Tk would be aborted in E).
But then the cas on vm performed by Ti would return false and Ti would return Ai—contradiction.
(Case III:) Tk is a slow-path transaction and Ti is a fast-path transaction. This is analogous to the
above case.
(Case IV:) Ti and Tk are fast-path transactions. Follows immediately from Claim 4.
We now need to prove that δTj indeed precedes `tryCk in E. Consider the two possible cases. Suppose
that Tj is a read-only transaction. Then, δTj is assigned to the last t-read performed by Tj that returns
a non-Aj value. If readj(Xm) is not the last t-read that returned a non-Aj value, then there exists
a readj(X
′) such that `readj(Xm) <E `tryCk <E `readj(X′). But then this t-read of X
′ must abort by
performing the checks in Line 42 or incur a tracking set abort—contradiction.
Otherwise suppose that Tj is an updating transaction that commits, then δTj = `tryCj which implies
that `readj(X) <E `tryCk <E `tryCj . Then, Tj must neccesarily perform the checks in Line 56 and return
Aj or incur a tracking set abort—contradiction to the assumption that Tj is a committed transaction.
Since S is legal and respects the real-time ordering of transactions, Algorithm 1 is opaque.
B Proof of Theorem 3
The proof of the lemma below is a simple extension of the analogous lemma from [3] allowing direct
trivial accesses inside fast-path transactions which in turn is inspired by an analogous result concerning
disjoint-access parallel STMs [5]. Intuitively, the proof follows follows from the fact that the tracking
set of a process executing a fast-path transaction is invalidated due to contention on a base object with
another transaction (cf. Remark 1).
Lemma 6. Let M be any progressive HyTM implementation in which fast-path transactions may perform
trivial direct accesses. Let E1 ·E2 be an execution ofM where E1 (and resp. E2) is the step contention-free
execution fragment of transaction T1 (and resp. T2) executed by process p1 (and resp. p2), T1 and T2 do
not conflict in E1 ·E2, and at least one of T1 or T2 is a fast-path transaction. Then, T1 and T2 do not
contend on any base object in E1 · E2.
Proof. Suppose, by contradiction that T1 and T2 contend on the same base object in E1 · E2.
If in E1, T1 performs a nontrivial event on a base object on which they contend, let e1 be the last
event in E1 in which T1 performs such an event to some base object b and e2, the first event in E2 that
accesses b (note that by assumption, e1 is a direct access). Otherwise, Ti only performs trivial events in
E1 to base objects (some of which may be direct) on which it contends with T2 in E1 ·E2: let e2 be the
15
first event in E2 in which E2 performs a nontrivial event to some base object b on which they contend
and e1, the last event of E1 in T1 that accesses b.
Let E′1 (and resp. E
′
2) be the longest prefix of E1 (and resp. E2) that does not include e1 (and resp.
e2). Since before accessing b, the execution is step contention-free for T1, E · E′1 · E′2 is an execution
of M. By assumption of lemma, T1 and T2 do not conflict in E′1 · E′2. By construction, E1 · E′2 is
indistinguishable to T2 from E
′
1 ·E′2. Hence, T1 and T2 are poised to apply contending events e1 and e2
on b in the execution E˜ = E′1 · E′2.
We now consider two cases:
1. (e1 is a nontrivial event) After E˜ · e1, b is contained in the tracking set of process p1 in exclusive
mode and in the extension E˜ · e1 · e2, we have that τ1 is invalidated. Thus, by Remark 1, transaction
T1 must return A1 in any extension of E · e1 · e2—a contradiction to the assumption that M is
progressive.
2. (e1 is a trivial event) Recall that e1 may be potentially an event involving a direct access. Consider
the execution E˜ · e2 following which b is contained in the tracking set of process p2 in exclusive
mode. Clearly, we have an extension E˜ · e2 · e1 in which τ2 is invalidated. Thus transaction T2 must
return A2 in any extension of E · e2 · e1—a contradiction to the assumption that M is progressive.
Theorem 7. Let M be any progressive opaque HyTM implementation providing invisible reads. There
exists an execution E of M and some slow-path read-only transaction Tk ∈ txns(E) that incurs a time
complexity of Ω(m2); m = |Rset(Tk)|.
Proof. For all i ∈ {1, . . . ,m}; m ∈ N, let v be the initial value of t-object Xi. Let pim denote
the complete step contention-free execution of a slow-path transaction Tφ that performs m t-reads:
readφ(X1) · · · readφ(Xm) such that for all i ∈ {1, . . . ,m}, readφ(Xi)→ v.
Claim 8. For all i ∈ N, M has an execution of the form pii−1 · ρi · αi where,
• pii−1 is the complete step contention-free execution of slow-path read-only transaction Tφ that
performs (i− 1) t-reads: readφ(X1) · · · readφ(Xi−1),
• ρi is the t-complete step contention-free execution of a fast-path transaction Ti that writes nvi 6= vi
to Xi and commits,
• αi is the complete step contention-free execution fragment of Tφ that performs its ith t-read:
readφ(Xi)→ nvi.
Proof. M has an execution of the form ρi · pii−1. Since Dset(Tk)∩Dset(Ti) = ∅ in ρi · pii−1, by Lemma 6,
transactions Tφ and Ti do not contend on any base object in execution ρ
i · pii−1. Moreover, since they
each access a single t-object, fast-path transaction Ti cannot incur a capacity abort. Thus, ρ
i · pii−1 is
also an execution of M.
By opacity, ρi · pii−1 · αi (Figure 1a) is an execution of M in which the t-read of Xi performed by Tφ
must return nvi. But ρ
i · pii−1 ·αi is indistinguishable to Tφ from pii−1 · ρi ·αi. Thus,M has an execution
of the form pii−1 · ρi · αi (Figure 1b).
For each i ∈ {2, . . . ,m}, j ∈ {1, 2} and ` ≤ (i − 1), we now define an execution of the form
Eij` = pii−1 · β` · ρi · αij as follows:
• β` is the t-complete step contention-free execution fragment of a fast-path transaction T` that writes
nv` 6= v to X` and commits
• αi1 (and resp. αi2) is the complete step contention-free execution fragment of readφ(Xi)→ v (and
resp. readφ(Xi)→ Aφ).
Note that in the execution so defined above, we assume that each fast-path transactions Ti and T`;` ≤ (i−1)
are executed by distinct processes.
Claim 9. For all i ∈ {2, . . . ,m} and ` ≤ (i− 1), M has an execution of the form Ei1` or Ei2`.
Proof. Note that by our assumption on capacity aborts, fast-path transactions Ti and T` cannot incur
capacity aborts in the defined execution.
For all i ∈ {2, . . . ,m}, pii−1 is an execution of M. By assumption of invisible reads, T` must be
committed in pii−1 · ρ` and M has an execution of the form pii−1 · β`. By the same reasoning, since Ti
and T` do not have conflicting data sets, M has an execution of the form pii−1 · β` · ρi.
Since the configuration after pii−1 · β` · ρi is quiescent, pii−1 · β` · ρi extended with readφ(Xi) must
return a matching response. If readφ(Xi) → vi, then clearly Ei1 is an execution of M with Tφ, Ti−1, Ti
16
being a valid serialization of transactions. If readφ(Xi)→ Aφ, the same serialization justifies an opaque
execution.
Suppose by contradiction that there exists an execution of M such that pii−1 · β` · ρi is extended
with the complete execution of readφ(Xi)→ r; r 6∈ {Aφ, v}. The only plausible case to analyse is when
r = nv. Since readφ(Xi) returns the value of Xi updated by Ti, the only possible serialization for
transactions is T`, Ti, Tφ; but readφ(X`) performed by Tk that returns the initial value v is not legal in
this serialization—contradiction.
Claim 10. For all i ∈ {2, . . . ,m}, j ∈ {1, 2} and ` ≤ (i− 1), slow-path transaction Tφ must access (i− 1)
different base objects during the execution of readφ(Xi) in the execution pi
i−1 · β` · ρi · αij.
Proof. Consider the (i− 1) different executions: pii−1 · β1 · ρi, . . ., pii−1 · βi−1 · ρi (cf. Figure 1c). For all
`, `′ ≤ (i− 1);`′ 6= `, M has an execution of the form pii−1 · β` · ρi · β`′ in which fast-path transactions T`
(executed by process p`) and T`′ (executed by process p`′) access mutually disjoint data sets. By invisible
reads and Lemma 6, the pairs of transactions T`′ , Ti and T`′ , T` do not contend on any base object in
this execution. This implies that pii−1 · β` · β`′ · ρi is an execution of M in which transactions T` and T`′
each apply nontrivial primitives to mutually disjoint sets of base objects in the execution fragments β`
and β`
′
respectively.
This implies that for any j ∈ {1, 2}, ` ≤ (i − 1), the configuration Ci after Ei differs from the
configurations after Eij` only in the states of the base objects that are accessed in the fragment β`.
Consequently, slow-path transaction Tφ must access at least i− 1 different base objects in the execution
fragment piij to distinguish configuration C
i from the configurations that result after the (i− 1) different
executions pii−1 · β1 · ρi, . . ., pii−1 · βi−1 · ρi respectively.
Thus, for all i ∈ {2, . . . ,m}, slow-path transaction Tφ must perform at least i−1 steps while executing
the ith t-read in the execution fragment piij . Inductively, this gives the
m−1∑
i=1
i = m(m−1)2 step complexity
for Tφ thus completing the proof.
17
