Stretching the capacity of Hardware Transactional Memory in IBM POWER
  architectures by Filipe, Ricardo et al.
Stretching the capacity of Hardware Transactional
Memory in IBM POWER architectures
Ricardo Filipe Shady Issa Paolo Romano João Barreto
INESC-ID, Instituto Superior Técnico, Universidade de Lisboa
{ricardo.filipe, shadi.issa, paolo.romano, joao.barreto}@tecnico.ulisboa.pt
Abstract
The hardware transactional memory (HTM) implementa-
tions in commercially available processors are significantly
hindered by their tight capacity constraints. In practice, this
renders current HTMs unsuitable to many real-world work-
loads of in-memory databases.
This paper proposes SI-HTM, which stretches the capac-
ity bounds of the underlying HTM, thus opening HTM to a
much broader class of applications. SI-HTM leverages the
HTM implementation of the IBM POWER architecture with
a software layer to offer a single-version implementation of
Snapshot Isolation. When compared to HTM- and software-
based concurrency control alternatives, SI-HTM exhibits
improved scalability, achieving speedups of up to 300% rela-
tively to HTM on in-memory database benchmarks.
CCS Concepts • Information systems→ Parallel and
distributed DBMSs; • Theory of computation → Con-
current algorithms; •Computingmethodologies→Con-
current algorithms.
Keywords HTM, TransactionalMemory, IMDBs, IBMPOWER,
Snapshot Isolation
1 Introduction
In the quest for scalability, in-memory databases (IMDBs)
offering weak consistency guarantees like Snapshot Isola-
tion (SI) [2] are increasingly prominent within the database
landscape. On the one hand, the in-memory nature of IMDBs
minimizes (or even eliminates) disk access latency to achieve
faster and more predictable performance [35, 39]. On the
other hand, weak consistency models alleviate many concur-
rency bottlenecks that characterize serializable systems [2].
Today, popular databases like HyPer [31], SAP HANA [20],
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACMmust be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from permissions@acm.org.
PPoPP ’19, February 16–20, 2019, Washington, DC, USA
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6225-2/19/02. . . $15.00
https://doi.org/10.1145/3293883.3295714
solidDB [22] and Hekaton [8] combine the virtues of both
trends, relying on weakly consistent IMDB designs.
At a first glance, the recent emergence of hardware trans-
actional memory (HTM) support in commercially available
processors such as Intel Core and IBM POWER might seem
like a perfect match to push the current generation of IMDBs
to new performance levels. However, the limited capacity of
HTM implementations [9, 12, 30] is incompatible with many
real-world OLTP/OLAP workloads, whose access footprints
are often much larger than the reduced capacity of existing
HTM implementations [21].
To work around this crucial obstacle, recent works either
propose modifications to HTM for SI support over a multi-
versioned memory layout that eliminates the capacity limits
[4, 24]; or exploit HTM as an auxiliary mechanism to ac-
celerate software-based concurrency control schemes [21].
However, while the former depends on hardware that is not
yet available, the latter exploits HTM only to a limited extent.
To the best of our knowledge, the expectation of IMDBs that
rely on HTM transactions as a first-class mechanism to run
and synchronize each transaction is yet to be met in practice.
This paper proposes SI-HTM, the first solution that achieves
such a goal by relying on a commercially available HTM im-
plementation – the HTM support originally introduced in
the IBM POWER8 architecture and continued in the most re-
cent IBM POWER9 [16]. Hereafter, we will denote such HTM
implementation as P8-HTM. The key novelty of SI-HTM is
that, with no hardware modification to P8-HTM, SI-HTM is
able to support SI-equivalent guarantees while relying on
the hardware to detect conflicts and abort transactions.
Intuitively, SI-HTM constructs a restricted SI implemen-
tation by combining two building blocks: i) rollback-only
transactions (ROT), a complementary mode available in P8-
HTM that is originally aimed at speculative execution of
code blocks that do not manipulate shared data [16]; and ii)
a software regulated quiescence phase that is added before
the hardware commit to ensure that the transaction only
commits once it is certain that its execution is compliant
with SI semantics.
As we describe later, this hybrid software-hardware mech-
anism is able to substantially stretch the capacity bounds of
the hardware transactions that can run on P8-HTM, with
no software instrumentation of memory reads and writes.
SI-HTM eliminates capacity bounds on a transaction’s read
set, restricting only their write sets by the HTM capacity.
ar
X
iv
:2
00
3.
03
31
7v
1 
 [c
s.D
C]
  6
 M
ar 
20
20
PPoPP ’19, February 16–20, 2019, Washington, DC, USA Ricardo Filipe Shady Issa Paolo Romano João Barreto
Since many IMDB workloads are dominated by read-only
and read-dominated transactions with few writes, SI-HTM
is typically able to run the vast majority of transactions in
the HTM fast path.
Breaking the tight capacity bounds of the original P8-
HTM contributes to important improvements in the scala-
bility that P8-HTM can attain, for two distinct reasons. First
and foremost, with SI-HTM, less transactions abort due to
exceeding the HTM capacity. This means less frequent situa-
tions that require falling back to a sequential fall-back path.
The second improvement is related to the notable power of
simultaneous multi-threading (SMT) on the POWER8 and 9
architectures, which are able to run up to 8 hardware threads
on each core. As acknowledged by previous studies on P8-
HTM [30], this SMT feature is practically incompatible with
HTM programs since the already scarce HTM capacity be-
comes shared among the co-located SMT threads. By stretch-
ing the capacity of P8-HTM, SI-HTM enables SMT-friendly
transactional workloads to achieve speed-ups at multi-SMT
levels, thus enabling parallelism in scenarios that are typi-
cally strongly adverse to HTM based applications.
This paper has three main contributions:
• We propose SI-HTM, a restricted implementation of
SI for P8-HTM.
• We experimentally evaluate SI-HTM on a real IBM
POWER8 server, both with a synthetic benchmark
and the TPC-C benchmark [5], which is serializable
under SI. When compared to HTM-based concurrency
control alternatives, SI-HTM exhibits speedups of up
to 300% on TPC-C.
• We prove that any execution history that SI-HTM al-
lows is correct under SI. An important corollary is
that any application that is serializable under SI is also
serializable on SI-HTM.
The remainder of this paper is organized as follows. Sec-
tion 2 provides the background on SI and on the features of
P8-HTM. Section 3 describes SI-HTM. Section 4 evaluates
SI-HTM, comparing its performance to relevant alternatives.
Section 5 surveys related work. Finally, Section 6 concludes
and describes future work.
2 Background
In this section we start by introducing the basic notions of
SI. We then describe the HTM support in the IBM POWER
architecture (which we call P8-HTM), emphasizing its lim-
itations and showing why it is not trivial to obtain weak
semantics like SI when using this HTM.
2.1 Snapshot Isolation
SI is a widely used correctness criterion in databases. Intu-
itively, SI allows each transaction to read from and write to
its own private isolated snapshot of data [2, 3, 10].
r(X)=0
w(X,1)
r(X)=0
w(X,2)
t0
t1
t2
t3
commit
commit
commit
abort
time
r(Y)=10
w(Y, 20)
Figure 1. Example of SI semantics.
Each transaction’s snapshot is created when the transac-
tion starts (or, alternatively, when it performs its first read).
Hence the snapshot holds the committed values that were
valid at that moment.
Each transaction’s snapshot is isolated from the writes
of concurrent transactions. More precisely, each write that
an active transaction performs on its snapshot is only vis-
ible to that transaction; other concurrent transactions will
not observe such write when reading from their own snap-
shots. This means that SI is typically implemented in amulti-
versioned approach, since different transactions reading from
the same location at the same time may observe different
versions. It is only when a transaction commits that its writes
become (atomically) visible to any new transaction whose
snapshot is created afterwards.
SI aborts transactions in the presence of write-write con-
flicts. More precisely, a transaction t is only allowed to com-
mit if its write set does not overlap with the write-set of any
other (concurrent) transaction that has committed after t
started. In contrast, SI tolerates read-write conflicts. There-
fore, two transactions may commit even if one transaction’s
read-set overlaps with the other transaction’s write-set.
To illustrate SI semantics, consider the example in Figure
1. Since t1 and t2 read from their own snapshots, they are
isolated from the writes that t0 performs. Since t1 and t2 do
not incur any write-write conflict, they can safely commit
under SI. Only t3 has to abort, because of the write-write
conflict with t0 (on X ). Note, however, that SI allows both t0
and t1 to commit, although they are not serializable.
The weaker (than serializable) guarantees of SI enable ef-
ficient concurrency control implementations and improved
concurrency, especially for read-only or read-dominated
transactions, while avoiding most isolation anomalies [2].
These advantages have quickly rendered SI a mainstream
consistency guarantee in the database domain [32] and, more
recently, in IMDBs [8, 20, 22, 31] and distributed transac-
tional systems [34, 38].
SI can still yield a few anomalies, most notably the write
skew anomaly [10]. An example of a write skew is when
two transactions start from a common snapshot, each one
writes to a different location, and later each transaction reads
from the location written by the other transaction. While
some application semantics naturally tolerate the write skew
Stretching the capacity of HTM in IBM POWER PPoPP ’19, February 16–20, 2019, Washington, DC, USA
anomaly, other applications may suffer from unexpected be-
havior in the presence of write skews. Fortunately, recent
tools and methodologies have been proposed to detect and
remove write skews [10], with the goal of ensuring seri-
alizable executions even when the program runs under SI.
One common fix is read promotion [10, 24]: the problematic
reads are also inserted into the transaction’s write set, which
ensures that a write skew triggers an abort.
2.2 P8-HTM
P8-HTM detects conflicts by adopting a 2-Phase Locking
(2PL) scheme at the granularity of a cache line. In P8-HTM’s
2PL scheme, the last transaction to read onto some shared
variable will kill the execution of any other previous writer
transaction on that same variable. In the case of write-write
conflicts the last writer is killed.
P8-HTM can only handle transactions whose read and
write sets fit into each core’s transactional buffer. For the IBM
POWER8 and POWER9 processors, this buffer is called TM-
CAM and consists of a content-addressable memory linked
with the L2 cache, shared by eight hardware threads [16].
Since the TMCAM is 8 KB in size, the available capacity for
transactions running on the core(s) sharing a TMCAM is up
to 64 cache lines.
One of the main features of the POWER architecture is its
extensive use of Simultaneous Multi-Threaded (SMT) which
supports the execution of up to 8 threads per core (SMT-
1,2,4,8). When multiple threads run in SMT on the same core,
they share the hardware resources available to that core.
It is worth noting that SMT and P8-HTM are, in practice,
conflicting features since the TMCAM is shared among co-
located SMT threads. Therefore, a transactional program that
takes advantage of SMT will inherently reduce the available
capacity to hardware transactions, which degrades the effec-
tiveness of P8-HTM. Most recently, POWER9 introduced an
additional 512 KB read tracking structure, called L2 LVDIR
and also shared among two cores, intended to support trans-
actions with larger read sets [16]; however, the L2 LVDIR can
only be used by up to two threads at any given time, which
essentially makes it incompatible with workloads with large
transactions that wish to use SMT. Not surprisingly, we are
not aware of any paper that has proposed a transactional sys-
tem that exploits P8-HTM and achieves consistent speed-ups
when running in SMT scenarios.
When running transactions in P8-HTM, the program can
resort to an advanced suspend-resume mechanism. When
a transaction is suspended, all its subsequent operations
are executed non-transactionally, thus not tracked in the
TMCAM. When the program eventually resumes the trans-
action any transactional conflicts that were detected during
the suspend-resume interval take effect (and the transaction
aborts). This is a useful mechanism to support programs
that need to access control variables within a transaction’s
r(X)=0
w(X,1)r0
r1
commit
commit
time r(X)=0
w(X,1)r0
r1
abort
commit
time
invalidate
A B
Figure 2. Example A: write-after-read conflict is tolerated by ROTs.
Example B: read-after-write conflict causes the writer ROT to abort.
lifetime without aborting due to conflicts when accessing
those control variables.
Another advanced feature of P8-HTM is the support of a
special kind of transactions called rollback-only transactions
(ROT). In this mode, the TMCAM only tracks writes1 while
reads are performed as if they were not inside a transaction.
This difference has two key consequences when we compare
the behaviors of ROTs and regular HTM transactions. First,
reads no longer contribute to spending the HTM capacity.
Since reads are usually the most prevalent operation inside
a transaction, the capacity of ROTs is improved massively
relatively to regular HTM transactions.
The second consequence is that, while concurrent ROTs
can still abort due to write-write conflicts, some read-write
conflicts are not guaranteed to be (and in general are not)
detected, hence will not lead to aborts. Example A in Figure
2 shows an example of a write-after-read conflict between
two ROTs that is tolerated. However, it should be noted that
ROTs can still abort due to read-after-write conflicts. As
illustrated in example B in Figure 2, if a ROT r1 writes to a
location and a concurrent ROT r2 later reads from the same
location, r1’s entry in the TMCAM will be invalidated and
thus r1 will abort. Evidently, the weak semantics of ROTs
do not guarantee serializability. For that reason, the official
documentation of P8-HTM clearly states that ROTs should
only be used with code blocks that exclusively access thread-
local data and may benefit from the ability to roll back their
updates [16] – hence the term “rollback-only transactions”.
As we show next, ROTs can actually be used with a dif-
ferent purpose than intended and constitute a fundamental
building block to implementing a restricted form of hardware-
supported SI in SI-HTM.
3 SI-HTM
The goal of SI-HTM is to build an SI implementation directly
supported by HTM. The gains of SI-HTM design are two-
fold: first, to benefit from the fast transactional execution that
an HTM delivers when directly handling memory accesses
and conflicts; second, to take advantage of SI to avoid the
tight capacity restrictions of HTM. Recent proposals have
shown how to achieve this goal through modified hardware
[4, 24]. The key novelty of SI-HTM is that it relies on the
commercially available IBM POWER8 and 9 architectures,
hence SI-HTM is ready to run on off-the-shelf hardware.
1Actually, due to implementation-specific reasons, the TMCAM can also
track a small fraction of reads in a ROT [15].
PPoPP ’19, February 16–20, 2019, Washington, DC, USA Ricardo Filipe Shady Issa Paolo Romano João Barreto
Accomplishing this design requires overcoming impor-
tant challenges. First, the available plain HTM transactions
impose rigid strong semantics and capacity limits. SI-HTM
relies on ROTs as the main building block to execute trans-
actions. This inherently enables SI-HTM transactions to be
capacity-bounded only by their write-sets, which represents
a decisive advantage with read-dominated and read-only
transactions. However, executing each transaction as a ROT
that accesses sharedmemory is unsafe, as it may yield serious
anomalies that SI disallows.
To prevent the ROT-induced anomalies, the hardware ROT
support needs to be complemented with software-level in-
strumentation that enforces the hardware-supported execu-
tion to circumvent those anomalies. The second challenge
is, then, to ensure that such software instrumentation has a
reduced impact on the runtime performance of HTM. Ideally,
memory accesses should be handled directly by the HTM;
any code instrumentation should only be allowed (and min-
imized) on the begin and commit stages – especially for
read-only transactions, which dominate many workloads.
Non-instrumented transactional accesses imply that each
transaction will directly access the cache-coherent memory,
which is single-version (i.e., two transactions that read from
the same location simultaneously observe the same value).
This is incompatible with the original definition of SI, which
relies on a multi-versioned scheme to allow concurrent trans-
actions to access distinct isolated snapshots. An example is
given in Figure 1, where, after t0 has written to X on its local
snapshot, both t0 and t1 observe different values when they
read from X . The third and last challenge is, then, how to
implement SI on a single-version memory system.
Since building a multi-versioned memory would require
significant software instrumentation on memory accesses,
SI-HTM follows a different approach: SI-HTM relies on the
single-version memory system, which keeps transactional
accesses non-instrumented, and restricts the allowed execu-
tions to those that, under SI rules, would not require keeping
older data versions. For instance, recalling the above exam-
ple from Figure 1, a correct execution (under SI) implies that
t0 and t1 observe different values when accessing the same
memory location. To deal with this, SI-HTM enforces that
one of the contending transactions (either t0 or t1) aborts.
The next sections describe SI-HTM in detail. Section 3.1
starts by discussing the semantics of encapsulating trans-
actions in ROTs, pointing out possible anomalies that are
not accepted under SI. Section 3.2 then complements ROTs
with the necessary software instrumentation to ensure that
allowed executions are correct under SI. Section 3.3 then
describes how read-only transactions may be optimized.
We describe SI-HTM as a support for general-purpose
transactional memory programs that, within each transac-
tion may read and write to pre-allocated memory locations,
indexed by their virtual address. Among other uses, SI-HTM
can be integrated as a concurrency control mechanism in
w(X,1)
r(X)=0r0
r1
commit
commit
time
r(X)=1
Figure 3. Example of two concurrent ROTs contending on shared
data with an anomaly that is not allowed under SI
IMDBs, including IMDBs that stores named records that are
accessed by a set-oriented language (like SQL), making use
of efficient indexes [35].
3.1 ROTs as the building block of SI-HTM
SI-HTM encapsulates each transaction in a ROT by preced-
ing each transaction’s code by a HTMBeginROT instruction
and committing the transaction with HTMEnd. Yet, as Section
2 discussed, the semantics of ROTs are unsuitable to trans-
actional programs, as they may yield serious correctness
anomalies. Still, it is also true that ROTs implicitly ensure
some key consistency properties that are shared with SI.
Namely:
• Since P8-HTM keeps track of each ROT’s write set,
the underlying hardware 2PL implementation detects
write-write conflicts and resolves them by aborting
(at least) one of the contending ROTs. This implicitly
satisfies SI’s restriction that, when two concurrent
transactions have overlapping write sets, one of them
should not be allowed to commit.
• Executions where a ROT writes to a location that has
previously been read (and not written) by an ongoing
concurrent ROT are not treated as conflicts. While this
was not allowed by the serializable 2PL implementa-
tion of plain HTM, it is allowed under SI. Of course, as
Section 2 discusses, ROTs still treat some read-write sit-
uations as conflicts. This reflects the fact that SI-HTM
is a single-version implementation of SI.
Therefore, by running transactions as ROTs, SI-HTM gets
the above SI properties for free from the hardware. However,
using ROTs to encapsulate transactions that concurrently
access shared data may yield dirty read anomalies [1], which
are not allowed under SI [2].
To illustrate these anomalies consider the example in Fig-
ure 3. In this example, two concurrent ROT-encapsulated
transactions, r1 and r2 access shared variable X . Since r2
writes to X after r1 reads X (a write-after-read case), no con-
flict is detected, hence both ROTs are allowed to continue
running. However, since r2’s write is performed in the actual
shared location, this write is not isolated in r2’s conceptual
snapshot, as SI mandates. Instead, the write is visible to r2
when r2 later reads from X . Recall that, in an execution that
is correct under SI, the second read by r2 should return the
value of X that was committed when the (isolated) snap-
shot was initially created – clearly, the above execution with
Stretching the capacity of HTM in IBM POWER PPoPP ’19, February 16–20, 2019, Washington, DC, USA
w(X,1)
r(X)=0r0
r1
time
r(X)=0
txCommit
safety wait
Invalidate
w(X,1)
r(X)=0r0
r1
time
r(Y)=3
txCommit
safety wait rot-commit
txCommit
rot-commit
w(Y)=3
w(Y)=3A
B
Figure 4. Examples of the safety wait as a means to prevent dirty
read anomalies. Example A: a dirty read is effectively prevented by
having r1 wait until r0 performs the problematic read. Example B:
after a safety wait, r1 commits without causing dirty reads.
ROTs results in a dirty read, which violates the requirement
of isolated snapshots in SI.
The next section shows how SI-HTM prevents dirty reads
on ROT-encapsulated transactions.
3.2 Base algorithm
Recall the example in Figure 3 illustrating the dirty read
anomaly induced by encapsulating transactions in ROTs. The
key insight behind SI-HTM is that, if ROT r1 had waited for
a sufficiently long time before issuing the HTMEnd instruction,
the anomaly would not have occurred.
Suppose that r1 had waited until r0 had issued its last
read, r (X ), as Figure 4 illustrates in example A. Since this
read targets a location that is currently in r1’s write-set,
r0’s read invalidates r1’s write entry in the TMCAM and
r1 aborts. Consequently, r1’s write is rolled back, thus r0
reads the original (and correct) value (X = 0). Therefore,
this waiting prevents the dirty read anomaly by aborting the
writer transaction.
It should be noted that r1 can only be sure that it can safely
commit without incurring dirty read anomalies on r0 after r0
has performed its last access (and r1 has survived each read
access by r0). Example B in Figure 4 illustrates an alternative
execution where r0 does not read from locations written by
r1 and, thus, r1 can safely commit after waiting r0’s last read.
We should remark that, in both examples in Figure 4, there
is a cost to pay for correctness: in example A, the writer trans-
action aborts; in example B, the writer transaction spends sig-
nificant time spinning. As we describe next, SI-HTM trades
these costs for important improvements in capacity. As Sec-
tion 4 evaluates, the gains clearly outweigh these costs.
More generically, when a ROT-encapsulated transaction,
r that wrote to a given location, X , completes (i.e. performs
its last memory access before entering the commit phase),
r should wait until any active transaction, t , that has read
X before r ’s update is guaranteed to have performed its last
read of X . If r only issues HTMEnd after that condition is
Algorithm 1 Transaction begin and end
1: int state[N ] = {inactive, .., inactive }
2:
3: function TxBegin()
4: state[t id ] ← currentT ime()
5: sync()
6: if HTMBeдinROT then
7: return
8: end if
9: end function
10:
11: function TxEnd()
12: HTMSuspend
13: state[t id ] ← completed
14: sync()
15: HTMResume
16: snapshot [0..N − 1] ← state[0..N − 1]
17: for c : 0..N − 1, c , t id do
18: if snapshot [c] > 1 then
19: wait while snapshot [c] = state[c]
20: end if
21: end for
22: HTMEnd
23: state[t id ] ← inactive
24: end function
guaranteed, then none of r ’s writes will induce dirty reads
on any t .
However, precisely determining the earliest instant when
such a guarantee is met is impractical. Firstly, it is usually
not possible to predict the future reads of a transaction. Sec-
ond, determining which ROTs have previously read X would
imply keeping track (at the software-level) of each read, re-
quiring prohibitive read instrumentation. Hence, SI-HTM
adopts a conservative approach: r assumes that every active
ROT may have read from at least one location of r ’s write
set; moreover, r waits until each such concurrent transac-
tion has completed, thus it will not issue any more memory
requests. When this condition is satisfied, we say that r is
safe to commit.
Algorithm 1 shows the software-level instrumentation
that SI-HTM adds to implement the safety wait. SI-HTM
maintains a shared array where each thread publishes its cur-
rent state, which can either be not running any transaction
(inactive=0), running a transaction (any value greater than
1), or completed andwaiting for a safe commit (completed=1).
Before starting the ROT on P8-HTM, a transaction an-
nounces that it will become active (line 4) by setting its state
to the current system timestamp (in clock cycles). Conversely,
when a transaction has completed and wishes to commit, it
sets its state to completed (line 13) and then waits until ev-
ery other active transaction, t , leaves that state. After t leaves
that state, the corresponding thread’s state will change to
inactive, eventually starting another transaction and be-
coming active again with a higher timestamp, and later
switching to completed, and so forth.
The waiting condition in lines 17-19 spins until one of
such options is observed for each other thread. Once that
happens, the waiting transaction can finally commit its ROT
in P8-HTM and announce itself as inactive (line 23).
PPoPP ’19, February 16–20, 2019, Washington, DC, USA Ricardo Filipe Shady Issa Paolo Romano João Barreto
Algorithm 2 Extension with SGL and RO paths
1: function SyncWithGL()
2: state[t id ] ← currentT ime()
3: sync()
4: if дlobalLock .isLocked () then
5: state[t id ] ← inactive
6: wait while дlobalLock .isLocked ()
7: go to 2
8: end if
9: end function
10:
11: function TxBeginExt(boolean isRO)
12: if isRO then
13: SyncW ithGL()
14: return
15: else
16: while (r etr ies - - > 0) do
17: SyncW ithGL()
18: if HTMBeдinROT then
19: return
20: end if
21: end while
22: state[t id ] ← inactive
23: дlobalLock .lock (t id )
24: for c : 0..N − 1, c , t id do
25: wait while state[c] , inactive
26: end for
27: end if
28: end function
29:
30: function TxEndExt()
31: if дlobalLock .isLocked (t id ) then
32: дlobalLock .unlock ()
33: else
34: if isRO then
35: lwsync()
36: state[t id ] ← inactive
37: else
38: TxEnd ()
39: end if
40: end if
41: end function
Whenever a transaction’s state is set, after the beginning of
a new transaction (line 4) and after suspending (line 13), we
need to ensure that the change propagates to all concurrent
transactions. To do so, we enforce a full memory barrier
(sync) after the state change.
One relevant implementation detail is that all updates to
the thread’s entry on the shared state array are performed in
non-transactional mode. If these updates happened inside an
active ROT, the transactional buffer would be occupied with
one unnecessarywrite and,most importantly, the ROTwould
abort whenever other transactions read this transaction’s
state in the shared array.
3.3 Read-only fast path and the fall-back path
SI-HTMhas alternative paths besides the algorithm described
so far, which Algorithm 2 presents.
One important path is the fast path for read-only trans-
actions. In the context of SI-HTM, we define a read-only
transaction as one that performs no writes on shared data
locations. Note that a read-only transaction is allowed to
update thread-private memory locations (such as its local
stack). When a transaction is launched in SI-HTM, an argu-
ment specifies whether the transaction is read-only or not.
We assume this parameter is set by the programmer or by
some automatic tool (e.g. a compiler).
Since read-only transactions perform no shared updates,
they are not prone to cause dirty reads. Therefore, they may
safely skip the safety waiting and immediately commit upon
completion (line 34). Of course, read-only transactions still
need to announce their state changes, so that other read-
write transactions can know how to coordinate their safety
waitings. Finally, when a read-only transaction ends we must
ensure that all shared memory reads were performed before
the state is set. This is accomplished using a light-weight
barrier lwsync issued at Line 35.
Another alternative path is the fall-back path when a read-
write transaction is not able to commit after a number of
retries. This can happen for a number of reasons, ranging
from transactions whose write-sets exceeds the available
capacity, frequent aborts under high contention, to transac-
tions issuing instructions that are illegal on P8-HTM, among
others. To ensure progress under such situations, transac-
tions in SI-HTM resort to a traditional single global lock
(SGL) fall-back path after having aborted too many consec-
utive times (line 16). We should note that, upon acquiring
the SGL, the transaction cannot proceed immediately since
other concurrent transactions may still be actively running.
Hence, the SGL holder first waits until no other active trans-
action exists (lines 23). While the SGL is locked, no other
transaction is allowed to proceed2.
3.4 Correctness
In this section, we show that any execution history that SI-
HTM allows is correct under the original definition of SI [2].
This implies that any existing application that is serializable
under SI may be directly executed under SI-HTM and will
retain its correctness. Moreover, it means that the techniques
and tools that have been proposed to analyze and fix pro-
grams to run under SI without anomalies (e.g., [10, 24]) are
also applicable to SI-HTM.
We note, however, that SI-HTM assumes that any access
to shared memory locations is included within a SI-HTM
transaction (i.e., between TxBeginExt and TxEndExt). This
is similar to the weak atomicity model [26], albeit employed
in the context of SI. We also remark that SI-HTM prevents
inconsistent reads, in the sense that any transaction in SI-
HTM (even one that eventually aborts) must see (all and
only) the effects of transactions that committed before they
started.
However, we highlight that SI-HTM also prevents incon-
sistent reads from transactions that eventually abort, which
is a stronger guarantee that SI does not require. It is straight-
forward to prove this property by recalling the semantics
2It should be noted that the early subscription scheme that is usually used
with regular HTM (which precludes the initial wait after the SGL lock) is not
possible in SI-HTM since read-only transactions run non-transactionally
and ROTs do not detect write-after-read conflicts.
Stretching the capacity of HTM in IBM POWER PPoPP ’19, February 16–20, 2019, Washington, DC, USA
of P8-HTM. Let us suppose, by contradiction, that some
transaction, t , performs an inconsistent read, i.e. reads a
value written by some transaction that has or will abort.
This would mean that t was able to read a value written by
another transaction that had not yet been committed, which
contradicts P8-HTM’s semantics. A consequence of this prop-
erty is that, for those applications that are serializable under
SI (i.e., are guaranteed to run under SI without incurring
SI-related anomalies) SI-HTM will not yield the undesirable
side-effects that may arise in TM implementations that allow
inconsistent reads [13]. Therefore, SI-HTM can safely run
such applications in non-sandboxed environments.
To show that any execution history that SI-HTM allows
is correct under SI, the following sketch of proof addresses
each restriction from the operational definition of SI [2],
explaining why SI-HTM satisfies all of them.
R1: Each transaction reads data from a snapshot of
the (committed) data as of the time the transaction
started, called its Start-Timestamp.
Proof. Rephrasing the above restriction, any transaction t
reads from a snapshot that reflects the writes by the most
recently committed transactions whose Commit-Timestamp
precedes t ’s Start-Timestamp. The above restriction is guar-
anteed if, for any transaction t that successfully commits (in
SI-HTM), we define its Commit-Timestamp as the instant
where t completes taking a snapshot of each thread’s state
(line 16 in Alg. 1); i.e., just before performing its safety wait.
Consider a pair of transactions, tw and tr , where: trans-
action tw writes to a given location, o, and tw eventually
commits; tr starts after tw commit and reads from o; tw is
the last transaction to write to o before the Start-Timestamp
of tr . Assume, by contradiction, that tr observes a version,
v∗, different from the one produced by tw . This would be
possible only in the following three cases:
Case a:v∗ is produced by a not yet committed transaction.
This is impossible, since it implies that tr read uncommitted
values, which is prevented by the P8-HTM. Note that this
would be a violation of restriction R4, which we address later.
Case b: v∗ is produced by a transaction, t ′, that commit-
ted after tr started; i.e., the Start-Timestamp of tr precedes
the Commit-Timestamp of t ′. Clearly, tr can only read v∗
after t ′ has issued HTMEnd. This implies that t ′ completed
its safety wait before tr read o. However, by hypothesis, tr ’s
Start-Timestamp is earlier than the Commit-Timestamp of
t ′, consequently t ′ observed that tr ’s state was active before
t ′ initiated its safety wait. Therefore, t ′ could only conclude
its safety wait after tr committed, which contradicts the
hypothesis that tr read after t ′ executed HTMEnd.
Case c: v∗ is produced by a committed transaction, t ′,
whose Commit-Timestamp is earlier than tw ’s Commit-
Timestamp. By restriction R5 (described later), the only case
where both t ′ and tw are able to commit is if the Commit-
Timestamp of t ′ precedes the Start-Timestamp of tw . Since, by
w(X,1)
r(X)=1
t1
t2
TMEnd
state
snapshot 
safety 
wait
Figure 5. Example illustrating why selecting the Commit-
Timestamp of t1 is defined as the time where t1 completes reading
each thread’s state, rather than the time at which HTMEnd occurs.
hypothesis, both transactions commit, then it is easy to prove
that tw ’s write to o occurred after t ′ executed HTMEnd (other-
wise, a write-write conflict would arise and P8-HTM would
abort one writer) and, consequently, after the write by t ′.
Since, by hypothesis, tr starts after the Commit-Timestamp
of tw , then tr will necessarily observe the most recent write
(tw ’s write), which contradicts case c’s hypothesis. □
As a side note, we explain the rationale behind defining the
Commit-Timestamp as defined above, instead of the instant
in which HTMEnd is performed (line 22 of Alg. 1). To illustrate
why the alternative definition is not appropriate, consider
the example in Figure 5. Transaction t2 reads the value that
t1 wrote (and committed) to x . However, t2 began before t1
executed HTMEnd. Hence, considering the moment at which
a transaction performs HTMEnd as its Commit-Timestamp
would contradict the previous SI restriction.
R2: A transaction running in SI is never blocked at-
tempting a read as long as the snapshot data from its
Start-Timestamp can be maintained.
Proof. Trivially ensured by P8-HTM’s semantics, which never
block upon memory accesses. □
R3: A transaction’s writes are also reflected in its lo-
cal snapshot.
Proof. Since update transactions are executed in ROTs, the
semantics of P8-HTM trivially ensure this restriction. □
R4: Updates by other transactions active after the
transaction Start-Timestamp are invisible to the trans-
action.
Proof. Let us assume that a transaction tw writes on a given
location and another transaction, tr , reads the same location.
Further, let us assume that tw is active (i.e., tw has not reached
its Commit-Timestamp, as defined in R1) after tr started. If
tw still has not executed HTMEnd when the read occurs, then
P8-HTM invalidates tw ’s write (thus aborting tw ) and, thus,
tr reads the previous value, which satisfies R3.
Alternatively, let us suppose, by contradiction, that tw had
already committed when tr reads, then tr would see tw ’s
update. However, since tw had already performed HTMEnd,
then we know that tw had previously completed its safety
wait. This implies that every transaction that had started
before tw ’s Commit-Timestamp has already completed; since
PPoPP ’19, February 16–20, 2019, Washington, DC, USA Ricardo Filipe Shady Issa Paolo Romano João Barreto
tr subsequently performs a read, then tr has started after tW
has committed, which contradicts the initial hypothesis. □
R5: A transaction t1 can only commit if no other
transaction, t2, with a Commit-Timestamp in t1’s exe-
cution interval [Start-Timestamp,Commit-Timestamp]
wrote data that t1 also wrote.
Proof. Suppose, by contradiction, that transactions t1 and
t2 write to a common location, o. Further, suppose, with-
out loss of generality, that t2 commits before t1 does. Since,
by hypothesis, t1’s Start-Timestamp precedes t2’s Commit-
Timestamp, t2 must have observed that the state of t1 was not
inactive before t2 entered its safety wait. Therefore, t2 had
to wait until t1 issued all its memory accesses and completed
(since, by hypothesis, t1 did not abort). This implies that t1
wrote to o before the write to the same location by t2 was
committed in hardware, which is a write-write conflict that
P8-HTM solved by aborting either t1 or t2, which contradicts
the hypothesis that both t1 and t2 commit. □
To conclude, we complement the above sketch of proof
with correctness arguments for the SGL fall-back path sce-
nario. As Section 3.3 briefly explains, after a transaction
acquires the SGL, it waits until ongoing transactions finish
(lines 14-15, Alg.2) before it starts executing. Conversely, any
other transaction that may try to start will observe that the
SGL is taken and wait (lines 4 and 7, Alg.2) until that condi-
tion changes. Therefore, when the thread holding SGL runs
its transaction, no other threads have active transactions
and any previous transaction that had issued writes must
have already committed or aborted. Consequently, it is easy
to show that the previous restrictions apply in the scenario
where one transaction runs in the SGL fall-back path.
4 Evaluation
SI-HTM has distinct features that have the potential to con-
tribute to effective performance improvements. More pre-
cisely, when compared with plain HTM, SI-HTM potentially
offers the following benefits: i) update transactions with
much larger memory footprints may run and commit with-
out exhausting the HTM capacity; ii) read-only transactions
run non-transactionally, hence exhibit lower begin/commit
overheads and have unlimited capacity; iii) as a corollary
of the previous outcomes, it becomes feasible to co-locate
more parallel transactions on a common SMT core; iv) since
SI-HTM provides weaker correctness guarantees than plain
HTM, SI-HTM allows higher concurrency.
The main goal of this evaluation is to understand, for a
wide range of scenarios and workloads, the performance
and scalability gains of SI-HTM, when compared to relevant
HTM- and software-based concurrency control mechanisms.
in this study we aim to evaluate the effective benefits that
each factor above (i to iv) contributes to the global outcome
of SI-HTM, as well as the real performance costs of the qui-
escence phase component of SI-HTM.
In order to answer these questions, we deploy SI-HTM
on a IBM Power8 system with one 8284-22A processor of 10
cores with SMT-8 (i.e., up to 8 hardware threads per core). We
use a hash-map micro-benchmark to compare the behavior
of SI-HTM with the pure HTM baseline; and TPC-C [5] as a
real-world application benchmark, which we use to compare
SI-HTM with a relevant set of state-of-the-art concurrency
control systems.
4.1 Hash-map benchmark
The hash-map benchmark consists of a simple transactional
hash-map implementation, where clients can perform lookup,
insert and remove operations. A read-only transaction per-
forms a lookup operation and a read-write transaction per-
forms an insert, or a remove operation if the last transaction
on that thread was an insert. This synthetic benchmark al-
lows us to study different workload scenarios that cover
distinct combinations between the orthogonal dimensions
of transaction footprint and contention.
Regarding the transaction footprint dimension, the num-
ber of elements that initially populate the hash-map can be in
one of two modes: a large transaction footprint mode, where
the hash-map size is such that each bucket has, on average,
a list of 200 elements (hence, operations on a key in that
bucket may need to read from 200 cache lines at most to find
the target element, which easily leads transactions to exceed
P8-HTM’s capacity); and a short transaction footprint mode,
where each bucket has, on average, 50 elements (thus most
operations find the target element without exceeding P8-
HTM’s capacity). Since the available capacity, both in HTM
and in SI-HTM, depends on the read/write ratio, we further
distinguish the (large vs. short) transaction footprint dimen-
sion with the read/write ratio. Thus, in total we consider
3 scenarios: a large-footprint scenario dominated by read-
only transactions (90% read-only transactions vs. 10% update
transactions); a large-footprint scenario 50% read-only vs.
50% update transactions; and a short-footprint scenario that
mixes 90% read-only transactions and 10% update transac-
tions. (We omit the short/50%:50% case for space limitations,
as it adds no relevant findings.)
Concerning the orthogonal dimension of contention, it can
be tuned by choosing different numbers of buckets for the
hash-map. We consider two scenarios along this dimension:
low contention, where the hash-map has 1000 buckets (hence,
concurrent operations on the same bucket are rare); and
high contention, where the hash-map has only 10 buckets
(frequent operations contend for a common bucket).
We experiment each possible combination of transaction
footprint with the two contention scenarios (3× 2 scenarios),
running each combination up to 80 threads (10 cores running
in no-SMT up to SMT-8 mode).
Stretching the capacity of HTM in IBM POWER PPoPP ’19, February 16–20, 2019, Washington, DC, USA
 0
 2
 4
 6
 8
 10
 12
20 40 80 1  10
Throughput (106 Tx/s)
HTM SI-HTM
 0
 20
 40
 60
 80
 100
Number of threads (1,2,4,8,16,32,40,80)
Aborts (%)
transactional
non-transactional
capacity
SI-HTMHTM
 0
 5
 10
 15
 20
 25
 30
20 40 80 1  10
Throughput (106 Tx/s)
HTM SI-HTM
 0
 20
 40
 60
 80
 100
Number of threads (1,2,4,8,16,32,40,80)
Aborts (%)
transactional
non-transactional
capacity
SI-HTMHTM
Figure 6. Hash-map 90% large read-only txs, low (left) and high (right) contention
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4.5
 5
20 40 80 1  10
Throughput (106 Tx/s)
HTM SI-HTM
 0
 20
 40
 60
 80
 100
Number of threads (1,2,4,8,16,32,40,80)
Aborts (%)
transactional
non-transactional
capacity
SI-HTMHTM
 0
 2
 4
 6
 8
 10
 12
 14
 16
 18
20 40 80 1  10
Throughput (106 Tx/s)
HTM SI-HTM
 0
 20
 40
 60
 80
 100
Number of threads (1,2,4,8,16,32,40,80)
Aborts (%)
transactional
non-transactional
capacity
SI-HTMHTM
Figure 7. Hash-map 50% large read-only txs, low (left) and high (right) contention
 0
 5
 10
 15
 20
 25
 30
 35
20 40 80 1  10
Throughput (106 Tx/s)
HTM SI-HTM
 0
 20
 40
 60
 80
 100
Number of threads (1,2,4,8,16,32,40,80)
Aborts (%)
transactional
non-transactional
capacity
SI-HTMHTM
 0
 2
 4
 6
 8
 10
 12
 14
 16
20 40 80 1  10
Throughput (106 Tx/s)
HTM SI-HTM
 0
 20
 40
 60
 80
 100
Number of threads (1,2,4,8,16,32,40,80)
Aborts (%)
transactional
non-transactional
capacity
SI-HTMHTM
Figure 8. Hash-map 90% small txs, low (left) and high (right) contention
Figures 6 to 8 present the throughput and discriminated
abort rate of each experiment, averaged over five runs. Re-
garding the types of aborts, we distinguish transactional
aborts, essentially caused by conflicting accesses to shared
memory locations; non-transactional aborts, mostly caused
by a locked SGL that kills ongoing transactions (only possible
in HTM); and, of course, capacity aborts.
As expected, the largest gains we observe are on large
read-only transaction scenarios, with an impressive 576%
improvement of peak throughput on the low contention
workload. This is the best-case scenario for SI-HTM, where
most transactions are read-only, hence run with no capacity
bounds. This is in clear contrast with the prohibitive capacity
constraints of HTM; such capacity issues quickly escalate
onto non-transactional aborts due to falling back to the SGL.
On the scenario with 50% update transactions, where the
majority of SI-HTM’s transactions run as ROTswith a limited
write set capacity, SI-HTM still proves to be the best approach
on a low contention workload, with gains of up to 10% peak
throughput. Again, the fact that update transactions do not
abort frequently for capacity reasons is the main reason.
However, on the high contention workload, SI-HTM is not
able to surpass regular HTM. Because of the quiescence
phase, SI-HTM transactions take longer to abort (than in
HTM), leading to a delay on the fall-back to the SGL.
On scenarios of small transactions, whichmostly fall within
P8-HTM’s capacity bounds, SI-HTM is not able to surpass
HTM. The added safety wait delays the execution of update
transactions in SI-HTM, on both low and high contention
workloads, a cost that is not compensated by relevant reduc-
tions on capacity aborts.
HTM has been historically bad on SMT execution [9, 12,
30], mostly due to sharing scarce hardware resources be-
tween SMT threads. Since a transaction’s footprint in SI-
HTM is limited only by its write-set, we expect that, in some
workloads, multiple SMT transactions on the same core will
finally fit in a shared TMCAM. In fact, we can observe on all
low contention scenarios (even on short transactions) that
PPoPP ’19, February 16–20, 2019, Washington, DC, USA Ricardo Filipe Shady Issa Paolo Romano João Barreto
 0
 2
 4
 6
 8
 10
 12
20 40 80 1  10
Throughput (104 Tx/s)
HTM
SI-HTM
P8TM
Silo
 0
 20
 40
 60
 80
 100
Number of threads (1,2,4,8,16,32,40,80)
Aborts (%)
transactional
non-transactional
capacity
SI-HTMHTM
 0
 2
 4
 6
 8
 10
 12
 14
20 40 80 1  10
Throughput (104 Tx/s)
HTM
SI-HTM
P8TM
Silo
 0
 20
 40
 60
 80
 100
Number of threads (1,2,4,8,16,32,40,80)
Aborts (%)
transactional
non-transactional
capacity
SI-HTMHTM
Figure 9. TPC-C standard mix with low (left) and high (right) contention
 0
 5
 10
 15
 20
 25
 30
20 40 80 1  10
Throughput (104 Tx/s)
HTM
SI-HTM
P8TM
Silo
 0
 20
 40
 60
 80
 100
Number of threads (1,2,4,8,16,32,40,80)
Aborts (%)
transactional
non-transactional
capacity
SI-HTMHTM
 0
 5
 10
 15
 20
 25
 30
 35
 40
 45
20 40 80 1  10
Throughput (104 Tx/s)
HTM
SI-HTM
P8TM
Silo
 0
 20
 40
 60
 80
 100
Number of threads (1,2,4,8,16,32,40,80)
Aborts (%)
transactional
non-transactional
capacity
SI-HTMHTM
Figure 10. TPC-C read dominated mix with low (left) and high (right) contention
SI-HTM behaves very well on SMT threads, scaling up to
32 threads, only showing signs of exhaustion at 40 threads,
when the resources of each core start to be shared by more
than four SMT threads. To the best of our knowledge, SI-
HTM is the first HTM-based algorithm to consistently exhibit
the power of SMT in low contention workloads.
4.2 TPC-C benchmark
To evaluate SI-HTM on a real-world application, we use the
TPC-C benchmark, with a standard mix of transactions (i.e.
-s 4 -d 4 -o 4 -p 43 -r 45), and a read-dominated mix
(i.e. -s 4 -d 4 -o 80 -p 4 -r 8). The standard mix is pri-
marily composed of update transactions, where roughly half
of them have large transactional footprints. We also tested
high and low-contention workloads of both mixes. Figures
9 and 10 present the results for TPC-C under such configu-
rations. We compare SI-HTM not only to HTM but also to
P8TM [17], a HTM-based design for larger capacity transac-
tions on P8-HTM (discussed in Section 5); and Silo [35], a
software-level optimistic concurrency control for in-memory
databases. For a fair comparison, we disable record indexing
in Silo, and the on-line adaptation of P8TM; this way, our
analysis focuses exclusively on the core concurrency control
performance of all solutions.
Overall, SI-HTM is able to improve peak performance
of TPC-C’s standard mix by 48% on 8 threads, relatively to
the best alternative (HTM). With 16 threads (SMT-2), the
low contention workload still delivers very good results,
albeit SI-HTM starts to show capacity issues, explained by
TMCAM sharing, and exacerbated on higher thread counts.
The improved resource usage of SI-HTM is especially evident
on the read-dominated mix, where SI-HTM gracefully scales
up to SMT-2 levels, improving 27% in peak throughput over
the best alternative (P8TM) and 300% over base HTM. Still,
the performance degrades on SMT-4 and SMT-8 modes, since
much of a core’s hardware is shared between the multiple
SMT threads of that core.
5 Related Work
Hybrid transactional memory (TM) designs [6, 7, 19, 27] are
the most visible effort that addresses the capacity issue in
HTM-based programs. Hybrid TMs fall back to software-
based TM (STM) implementations when transactions cannot
successfully execute in hardware. In contrast to SI-HTM,
hybrid TMs do not change the capacity limits of the HTM;
rather, they aim at providing scalable STM fall-back paths.
Postponing a transaction’s commit to a moment where the
system state ensures that committing the transaction will
not result in correctness anomalies is not a new concept of SI-
HTM. In another context, a notable example of a solution that
relies on a similar technique is the read-copy-update (RCU)
synchronization mechanism [29]. Like SI-HTM, RCU allows
multiple read-only threads to read directly from shared mem-
ory by having writer threads update a snapshot that is only
committed at a later time when safety is ensured. The RCU
mechanism is entirely done in software and has been im-
plemented at user and kernel-level [14]. It requires the pro-
grammer to explicitly provide dedicated code in order to
Stretching the capacity of HTM in IBM POWER PPoPP ’19, February 16–20, 2019, Washington, DC, USA
create snapshots of the objects to update and ensure consis-
tent pointers to the right snapshot. Read-Log-Update (RLU)
extended RCU to allow a simpler programming model and
higher concurrency between readers and writers [28], by
relying on techniques borrowed from the world of STM.
The works that are technically closer to SI-HTM are
HERWL [11], SpRWL [18] and P8TM [17]. These works ex-
ploit the principle of making writers commit only when it is
safe to do so. However, their aim is clearly distinct than SI-
HTM’s, as they all offer strong consistency guarantees to pro-
grams based on read-write locks (HERWL and SpRWL) and
in-memory transactions (P8TM). One direct consequence of
this key distinction is that none of them relieve transactions
from the cost of having the read sets of update transactions
tracked by the HTM (in HERWL and SpRWL) or by costly
software instrumentation of each read (in P8TM). In con-
trast, since SI-HTM aims at weaker consistency guarantees,
SI-HTM is able to completely free update transactions from
read tracking. This fundamental advantage allows SI-HTM
to clearly outperform P8TM for applications that are correct
under SI, as Section 4 shows.
In order to mitigate the worst-case scenarios where the
quiescence phase yields prohibitive latencies, P8TM pro-
poses self-tuning techniques that revert to the baseline HTM
support in unfavourable scenarios. With minor adaptations,
such self-tuning techniques may be incorporated in SI-HTM
to improve its performance in some of the unfavourable
scenarios that Section 4 identified.
An alternative direction to mitigate the constraints of
HTM capacity is by providing the weaker correctness guar-
antee of SI. Two recent efforts have proposed new (or modi-
fied) hardware support to implement SI on HTM. Litz et al.
[24] propose a multi-versioned memory architecture that
implements SI for transactional programs running on par-
allel CPUs while Chen et al. [4] pursue the same goal on
GPUs. They propose a multi-versioned memory subsystem
for transactional programs that run on a GPU, together with
an online method for eliminating the write skew anomaly
on the fly. When compared to SI-HTM, these systems al-
low higher concurrency degrees as they constitute full SI
implementations. Further, the fact that they rely on a multi-
versioned memory system obviates the need for (and the cost
of) SI-HTM’s quiescence. Nevertheless, SI-HTM is ready to
use on commercially available systems, which distinguishes
it from these recent attempts to combine SI and HTM.
On software, Litz et al. [23] implement SI by manipulating
virtual memory mappings and using a copy-on-read mecha-
nism with a customized page cache. Riegel et al. [33] created
an STM approach to SI by using a lazy multi-version mech-
anism. These approaches require instrumenting read and
write operations of transactions, thus incumbering them
over our HTM based implementation. Litz et al. [25] present
a technique to automatically correct SI anomalies. Our work
can also benefit from such a technique being used before
deploying a new workload untested on SI systems.
IMDBs are among the domains where HTM capacity limits
constitute a major obstacle to adopt HTM to replace current
software-based concurrency control schemes [21]. Several
previous proposals have leveraged concurrency control in
IMDBs with HTM. Leis et al. [21] use HTM transactions
to run individual portions of a large transaction, with sub-
stantial code instrumentation. Wang et al. [36] leverage a
software-based optimistic concurrency control mechanism
with an optimized HTM-based commit stage. Wu et al. [37]
adopt a similar HTM-assisted strategy, in this case using
HTM transactions to perform optimized HTM-based pre-
commit validation and writes to individual database records.
All these proposals use HTM as an auxiliary hardware mech-
anism to assist a software-based concurrency control. In con-
trast, SI-HTM relies on HTM transactions – more precisely,
on ROTs – as a first-class construct that runs full individual
transactions, with non-instrumented memory accesses.
6 Conclusions
SI-HTM leverages the HTM features of the IBM POWER ar-
chitectures with a software-based safety wait before commit
to offer a restricted implementation of Snapshot Isolation. As
a main outcome, SI-HTM dramatically stretches the capacity
bounds of the underlying HTM with no hardware modifica-
tions, thus boosting the scalability of P8-HTM and opening
it to a much broader class of applications, like large-footprint
transactions from the IMDB domain.
Our work emphasizes how important it is for commer-
cially available HTM implementations to expose advanced
HTM related mechanisms like ROTs and suspend-resume
to the software layers. SI-HTM shows that such mecha-
nisms can serve as building blocks to sophisticated software-
hardware designs that enrich the baseline features.
As future work, we plan to study advanced mechanisms to
mitigate the idle waiting time that RW transactions spend in
SI-HTM. Among possible approaches, we envision a killing
alternative, where the group of already-completed transac-
tions decides, based on system-efficient heuristics to kill
the transactions that are taking too long to complete; and
a batching alternative, where a completed transaction that
predicts a long safety waiting uses such idle time to execute
one (or more) subsequent transactions. We could also study
how feasible a software based SI fallback path would be.
7 Acknowledgements
Our thanks go to Pascal Felber for shepherding our pa-
per, the anonymous reviewers who gave us valuable feed-
back, Brno University of Technology and the University of
Neuchatel for providing us access to their IBM POWER8
machines. This work is partially funded by FCT via projects
UID/CEC/50021/2019 and PTDC/EEISCR/1743/2014.
PPoPP ’19, February 16–20, 2019, Washington, DC, USA Ricardo Filipe Shady Issa Paolo Romano João Barreto
8 Artifact Appendix
8.1 Abstract
Our artifact includes the algorithm described in the SI-HTM
paper, the benchmarks used and scripts to reproduce the pa-
per’s results. There are no software dependencies to run our
algorithm. The test machine should include an IBM POWER8
processor with at least 10 cores. Plotting scripts were in-
cluded which produce the graphs presented in the paper.
These graphs can be used to validate our results.
8.2 Artifact check-list (meta-information)
• Algorithm: yes, the main file of our algorithm is:
POWER8TM/backends/p8tm-si/tm.h
• Program: hashmap and TPCC, included
• Compilation: GCC 5+
• Transformations: no
• Binary: no
• Data set: none
• Run-time environment: Linux
• Hardware: IBM POWER8 processor
• Run-time state: no other processes sharing the same
cores as SI-HTM
• Execution: specific thread pinning, two hours
• Metrics: Execution time, number of operations, de-
tailed specific abort counters
• Output: graphs with throughput and abort rate
• Experiments: compile the TinySTM back-end and
run the given scripts
• Howmuchdisk space required (approximately)?:
10 MBytes
• How much time is needed to prepare workflow
(approximately)?: 20 minutes
• How much time is needed to complete experi-
ments (approximately)?: 2 hours
• Publicly available?: yes
• Code/data licenses (if publicly available)?:MIT
• Workflow frameworks used?: None
• Archived?:
https://doi.org/10.6084/m9.figshare.7378496
8.3 Description
8.3.1 How delivered
https://doi.org/10.6084/m9.figshare.7378496
10 MB
8.3.2 Hardware dependencies
IBM POWER8 processor with 10+ cores
8.3.3 Software dependencies
Gnuplot
8.3.4 Data sets
None
8.4 Installation
tar -xvf artifact.tgz
cd POWER8TM/stms/tinystm; make
8.5 Experiment workflow
cd si-htm
- In each sub-folder (hashmap; tpcc) you will find scripts
to run and plot the results of the respective benchmark
- Create a results folder for each of the benchmarks:
mkdir <benchmark>/results
- Run each benchmark with the absolute path to the
POWER8TM directory included in the artifact and the abso-
lute path to the corresponding results directory:
bash run_<benchmark>.sh <POWER8TM-dir> <results-dir>
This creates a <results-dir>/date sub-directory for that run
- Edit each benchmark’s plot.sh script with the absolute
path to gnuplot and eps driver
- Edit each benchmark’s default-plots.sh script with the
absolute path to your Gnuplot PostScript directory and the
corresponding <results-dir/date> directory
- Plot each benchmark: bash default-plots.sh
8.6 Evaluation and expected result
The results of our artifact should be reproducible with small
variations in single digit percentages. It will output graphs
into the results/plots folder, which can be compared to those
on the paper. The raw data for the plots can be found in the
results/summary folder, which include run time, number of
transactions executed and number of aborts by type.
References
[1] American National Standards Institute. 1992. American national stan-
dard for information systems: database language — SQL: ANSI X3.135-
1992. pub-ANSI. xiv + 580 pages. Revision and consolidation of ANSI
X3.135-1989 and ANSI X3.168-1989, Approved October 3, 1989.
[2] Hal Berenson, Phil Bernstein, Jim Gray, Jim Melton, Elizabeth O’Neil,
and Patrick O’Neil. 1995. A Critique of ANSI SQL Isolation Levels.
In Proceedings of the 1995 ACM SIGMOD International Conference on
Management of Data (SIGMOD ’95). ACM, New York, NY, USA, 1–10.
[3] Andrea Cerone and Alexey Gotsman. 2016. Analysing Snapshot Iso-
lation. In Proceedings of the 2016 ACM Symposium on Principles of
Distributed Computing (PODC ’16). ACM, New York, NY, USA, 55–64.
[4] Sui Chen, Lu Peng, and Samuel Irving. 2017. Accelerating GPU Hard-
ware Transactional Memory with Snapshot Isolation. In Proceedings
of the 44th Annual International Symposium on Computer Architecture
(ISCA ’17). ACM, New York, NY, USA, 282–294.
[5] TPC Council. 2010. Transaction Processing Performance Council, TPC
BENCHMARK™ C. Revision 5.11. February 2010.
[6] Luke Dalessandro, François Carouge, Sean White, Yossi Lev, Mark
Moir, Michael L. Scott, and Michael F. Spear. 2011. Hybrid NOrec: A
Case Study in the Effectiveness of Best Effort Hardware Transactional
Memory. In Proceedings of the Sixteenth International Conference on Ar-
chitectural Support for Programming Languages and Operating Systems
(ASPLOS XVI). ACM, New York, NY, USA, 39–52.
[7] Peter Damron, Alexandra Fedorova, Yossi Lev, Victor Luchangco, Mark
Moir, and Daniel Nussbaum. 2006. Hybrid Transactional Memory. In
Proceedings of the 12th International Conference on Architectural Support
Stretching the capacity of HTM in IBM POWER PPoPP ’19, February 16–20, 2019, Washington, DC, USA
for Programming Languages and Operating Systems (ASPLOS XII). ACM,
New York, NY, USA, 336–346.
[8] Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Ake Larson, Pravin
Mittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling. 2013. Heka-
ton: SQL Server’s Memory-optimized OLTP Engine. In Proceedings
of the 2013 ACM SIGMOD International Conference on Management of
Data (SIGMOD ’13). ACM, New York, NY, USA, 1243–1254.
[9] Nuno Diegues, Paolo Romano, and Luís Rodrigues. 2014. Virtues
and Limitations of Commodity Hardware Transactional Memory. In
Proceedings of the 23rd International Conference on Parallel Architectures
and Compilation (PACT ’14). ACM, New York, NY, USA, 3–14.
[10] Alan Fekete, Dimitrios Liarokapis, Elizabeth O’Neil, Patrick O’Neil,
and Dennis Shasha. 2005. Making Snapshot Isolation Serializable. ACM
Trans. Database Syst. 30, 2 (June 2005), 492–528.
[11] Pascal Felber, Shady Issa, Alexander Matveev, and Paolo Romano. 2016.
Hardware Read-write Lock Elision. In Proceedings of the 11th European
Conference on Computer Systems (EuroSys ’16). ACM, New York, NY,
USA, Article 34, 15 pages.
[12] Bhavishya Goel, Rubén Titos-Gil, Anurag Negi, Sally McKee, and Per
Stenstrom. 2014. Performance and Energy Analysis of the Restricted
Transactional Memory Implementation on Haswell. Proceedings of
the International Parallel and Distributed Processing Symposium, IPDPS,
615–624.
[13] Rachid Guerraoui and Michal Kapalka. 2008. On the Correctness
of Transactional Memory. In Proceedings of the 13th ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming (PPoPP
’08). ACM, New York, NY, USA, 175–184.
[14] D. Guniguntala, P. E. McKenney, J. Triplett, and J. Walpole. 2008. The
read-copy-update mechanism for supporting real-time applications
on shared-memory multiprocessor systems with Linux. IBM Systems
Journal 47, 2 (2008), 221–236.
[15] IBM. 2016. POWER8 Processor User’s Manual for the Single-Chip Module
(version 1.3).
[16] IBM. 2018. POWER9 Processor User’s Manual (version 2.0).
[17] Shady Issa, Pascal Felber, Alexander Matveev, and Paolo Romano. 2017.
Extending Hardware Transactional Memory Capacity via Rollback-
Only Transactions and Suspend/Resume. In LIPIcs 31st International
Symposium on Distributed Computing (DISC ’17), Vol. 91. Schloss
Dagstuhl–Leibniz-Zentrum fuer Informatik, Germany, 28:1–28:16.
[18] Shady Issa, Paolo Romano, and Tiago Lopes. 2018. Speculative Read
Write Locks. In Proceedings of the 19th International Middleware Con-
ference (Middleware ’18). ACM, New York, NY, USA, 214–226.
[19] Sanjeev Kumar, Michael Chu, Christopher J. Hughes, Partha Kundu,
and Anthony Nguyen. 2006. Hybrid Transactional Memory. In Pro-
ceedings of the Eleventh ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming (PPoPP ’06). ACM, New York, NY,
USA, 209–220.
[20] J. Lee, Y. S. Kwon, F. Färber, M. Muehle, C. Lee, C. Bensberg, J. Y. Lee,
A. H. Lee, and W. Lehner. 2013. SAP HANA distributed in-memory
database system: Transaction, session, and metadata management. In
2013 IEEE 29th International Conference on Data Engineering (ICDE).
1165–1173.
[21] V. Leis, A. Kemper, and T. Neumann. 2014. Exploiting hardware trans-
actional memory in main-memory databases. (March 2014), 580–591.
[22] Jan Lindström, Vilho Raatikka, Jarmo Ruuth, Petri Soini, and Katri-
ina Vakkila. 2013. IBM solidDB: In-Memory Database Optimized for
Extreme Speed and Availability. IEEE Data Eng. Bull. 36, 2 (2013),
14–20.
[23] Heiner Litz, Benjamin Braun, and David Cheriton. 2016. EXCITE-
VM: Extending the virtual memory system to support snapshot isola-
tion transactions. In Parallel Architecture and Compilation Techniques
(PACT), 2016 International Conference on. IEEE, 401–412.
[24] Heiner Litz, David Cheriton, Amin Firoozshahian, Omid Azizi, and
John P. Stevenson. 2014. SI-TM: Reducing Transactional Memory
Abort Rates Through Snapshot Isolation. In Proceedings of the 19th
International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS ’14). ACM, New York, NY,
USA, 383–398.
[25] Heiner Litz, Ricardo J Dias, andDavid RCheriton. 2015. Efficient correc-
tion of anomalies in snapshot isolation transactions. ACM Transactions
on Architecture and Code Optimization (TACO) 11, 4 (2015), 65.
[26] Milo Martin, Colin Blundell, and E. Lewis. 2006. Subtleties of Transac-
tional Memory Atomicity Semantics. IEEE Comput. Archit. Lett. 5, 2
(July 2006), 17–17.
[27] Alexander Matveev and Nir Shavit. 2015. Reduced Hardware NOrec:
A Safe and Scalable Hybrid Transactional Memory. In Proceedings
of the Twentieth International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS ’15). ACM,
New York, NY, USA, 59–71. https://doi.org/10.1145/2694344.2694393
[28] Alexander Matveev, Nir Shavit, Pascal Felber, and Patrick Marlier.
2015. Read-log-update: A Lightweight Synchronization Mechanism
for Concurrent Programming. In Proceedings of the 25th Symposium
on Operating Systems Principles (SOSP ’15). ACM, New York, NY, USA,
168–183.
[29] Paul E. Mckenney and John D. Slingwine. 1998. Read-Copy Update:
Using Execution History to Solve Concurrency Problems. In Parallel
and Distributed Computing and Systems. Las Vegas, NV, 509–518.
[30] Takuya Nakaike, Rei Odaira, Matthew Gaudet, Maged M. Michael,
and Hisanobu Tomari. 2015. Quantitative Comparison of Hardware
Transactional Memory for Blue Gene/Q, zEnterprise EC12, Intel Core,
and POWER8. In Proceedings of the 42Nd Annual International Sympo-
sium on Computer Architecture (ISCA ’15). ACM, New York, NY, USA,
144–157.
[31] Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper. 2015. Fast
Serializable Multi-Version Concurrency Control for Main-Memory
Database Systems. In Proceedings of the 2015 ACM SIGMOD Interna-
tional Conference on Management of Data (SIGMOD ’15). ACM, New
York, NY, USA, 677–689.
[32] Dan R. K. Ports and Kevin Grittner. 2012. Serializable Snapshot Isola-
tion in PostgreSQL. Proc. VLDB Endow. 5, 12 (Aug. 2012), 1850–1861.
[33] Torvald Riegel, Christof Fetzer, and Pascal Felber. 2006. Snapshot
isolation for software transactional memory. In First ACM SIGPLAN
Workshop on Languages, Compilers, and Hardware Support for Transac-
tional Computing (TRANSACT’06). 1–10.
[34] Tudor-Ioan Salomie, Ionut Emanuel Subasu, Jana Giceva, and Gustavo
Alonso. 2011. Database Engines on Multicores, Why Parallelize when
You Can Distribute?. In Proceedings of the Sixth Conference on Computer
Systems (EuroSys ’11). ACM, New York, NY, USA, 17–30.
[35] Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and
Samuel Madden. 2013. Speedy Transactions in Multicore In-memory
Databases. In Proceedings of the Twenty-Fourth ACM Symposium on
Operating Systems Principles (SOSP ’13). ACM, New York, NY, USA,
18–32.
[36] ZhaoguoWang, Hao Qian, Jinyang Li, and Haibo Chen. 2014. Using re-
stricted transactional memory to build a scalable in-memory database.
In Proceedings of the Ninth European Conference on Computer Systems.
ACM, 26.
[37] Yingjun Wu and Kian-Lee Tan. 2016. Scalable In-Memory Transaction
Processing with HTM. In 2016 USENIX Annual Technical Conference
(USENIX ATC 16). USENIX Association, Denver, CO, 365–377.
[38] Erfan Zamanian, Carsten Binnig, Tim Harris, and Tim Kraska. 2017.
The End of a Myth: Distributed Transactions Can Scale. Proc. VLDB
Endow. 10, 6 (Feb. 2017), 685–696.
[39] H. Zhang, G. Chen, B. C. Ooi, K. Tan, and M. Zhang. 2015. In-Memory
Big Data Management and Processing: A Survey. IEEE Transactions
on Knowledge and Data Engineering 27, 7 (July 2015), 1920–1948.
