Crafty: Efficient, HTM-Compatible Persistent Transactions by Genç, Kaan et al.
ar
X
iv
:2
00
4.
00
26
2v
3 
 [c
s.P
L]
  2
1 A
pr
 20
20
Crafty: Efficient, HTM-Compatible
Persistent Transactions
Kaan Genç
Ohio State University (USA)
genc.5@osu.edu
Michael D. Bond
Ohio State University (USA)
mikebond@cse.ohio-state.edu
Guoqing Harry Xu
UCLA (USA)
harryxu@cs.ucla.edu
This extended arXiv version of our PLDI 2020 paper adds an appendix with additional results
Abstract
Byte-addressable persistent memory, such as Intel/Micron
3D XPoint, is an emerging technology that bridges the gap
between volatile memory and persistent storage. Data in
persistent memory survives crashes and restarts; however,
it is challenging to ensure that this data is consistent after
failures. Existing approaches incur significant performance
costs to ensure crash consistency.
This paper introduces Crafty, a new approach for ensur-
ing consistency and atomicity on persistent memory oper-
ations using commodity hardware with existing hardware
transactional memory (HTM) capabilities, while incurring
low overhead. Crafty employs a novel technique called non-
destructive undo logging that leverages commodity HTM to
control persist ordering. Our evaluation shows that Crafty
outperforms state-of-the-art priorwork under low contention,
and performs competitively under high contention.
CCS Concepts: • Information systems→ Storage class
memory; • Software and its engineering→Concurrent
programming structures.
Keywords: persistent transactions, transactional memory
1 Introduction
Non-volatile memory (NVM) technologies, such as phase
change memory (PCM) [37, 55, 60], resistive random-access
memory (RRAM) [54], spin-transfer torque memory (STT-
MRAM) [34], and 3D XPoint [25], are designed to combine
DRAM’s byte-addressability and storage’s durability: A pro-
gram’s updates to data structures residing in persistent mem-
ory can persist across failures such as a program crash or
power interruption. As a result, NVM has the potential to
fundamentally change the dichotomy between DRAM and
durable storage in many important domains such as storage
systems [23, 35, 57, 58, 62], databases [1, 2, 24, 63], and big
data analytics [52].
State of the art. As with any storage system, the first chal-
lenge in effectively using NVM is to provide crash consis-
tency [43, 51], which allows a program to correctly recover
from persistent data upon a failure. Crash consistency is of-
ten achieved by leveraging transactional support in a high-
level programmingmodel. The developer specifies persistent
transactions, in which updates to persistent memory appear
to be one atomic unit—upon a program crash, either all or
none of these updates are committed, ensuring that impor-
tant data structures are always left in a consistent state.
However, prior work’smechanisms for persistent transac-
tions have twomain drawbacks. First, all of themechanisms—
undo logging [9, 31], redo logging [51], and copy-on-write [7,
14, 38, 42, 56]—incur performance costs such as persist la-
tency on eachwrite, lookups at program reads, maintenance
of shadow memory, and poor multithreaded scalability.
Second, while commodity hardware transactional mem-
ory (HTM) such as Intel’s transactional synchronization ex-
tensions (TSX) [20, 21, 59] is an appealing mechanism for
supplementing persistent transactions to achieve full ACID
transactions, persistent transaction mechanisms are incom-
patiblewith commodityHTMbecause of the following dilemma:
To ensure correct recovery, log entries must be persisted before
a transaction commits, yet the nature of transactions dictates
that executing transactions cannot perform persist operations.
Although some recent work shows how to make hardware
transactions persistent [3, 4, 7, 14, 15, 38, 53], it has ma-
jor drawbacks such as requiring log lookups at reads, using
shadow memory, incurring scalability bottlenecks, or rely-
ing on nontrivial hardware changes (Section 2).
Contributions. This paper addresses bothmajor limitations
of prior work by leveraging commodity HTM to control per-
sist ordering.We introduce a new kind of persistent transac-
tion mechanism, nondestructive undo logging, that exploits
commodity HTM to populate and persist undo logs before
making persistent writes visible. Key to nondestructive undo
logging—which runs a persistent transaction’s code in a hard-
ware transaction and logs persistent writes in an undo log—
is that the hardware transaction rolls back its persistent writes
prior to committing, effectively creating its undo log entries
without performing actual persistent writes. This behavior
breaks HTM’s persist–commit dependence cycle mentioned
above, by decoupling the undo log updates from the persis-
tent writes. After committing its hardware transaction that
computes the undo log entries, a persistent transaction can
Kaan Genç, Michael D. Bond, and Guoqing Harry Xu
perform its persistent writes—albeit in a way that is con-
sistent with the persisted undo log entries and with other
threads’ persistent transactions.
We apply nondestructive undo logging in introducingCrafty,
a novel and general approach for correct and efficient per-
sistent transactions using unmodified commodity HTM. For
each persistent transaction, Crafty first uses nondestructive
undo logging to compute and persist undo log entries. It
then performs the transaction’s writes—by performing the
logged writes directly if contention is low, or by repeating
the transaction’s execution while validating its consistency
with the persisted undo log entry if contention is high. Crafty
can operate in a thread-unsafe mode that provides only fail-
ure atomicity (relying on some other mechanisms such as
locks for thread atomicity), or it can operate in a thread-safe
mode that provides both failure and thread atomicity (i.e.,
full ACID transactions).
We implemented Crafty by extending the publicly avail-
able implementation ofNV-HTM [7],which also implements
DudeTM [38]; both approaches support persistent transac-
tionswithHTMusing shadow-memory-based copy-on-write
mechanisms.Our evaluation uses several programswith vary-
ing levels of thread contention: persistent transaction mi-
crobenchmarks and transactional benchmarks. Our results
demonstrate that Crafty outperforms the two state-of-the-
art HTM-based persistent transaction implementations NV-
HTM and DudeTM, especially under low thread contention.
Furthermore, Crafty usually adds low run-time overhead
over non-durable transactions, and its overhead is largely
thread local and thus scales well with additional threads.
These results suggest that nondestructive undo logging
and Crafty are promising approaches for providing efficient,
HTM-compatible persistent transactions.
2 Background and Motivation
This section covers background on persistent memory pro-
gramming models and motivates the need for better mech-
anisms for persistent transactions.
2.1 Persistent Memory Programming Model
The key challenge of supporting persistent memory is en-
suring that if a failure occurs, a recovery observer can restore
persistent memory to a state that is usable by the restarted
program. This property is often provided through failure
atomicity: in the event of a crash or power failure, persis-
tent memory state can be restored so that each persistent
transaction appears to have executed fully or not executed
at all [5, 7–9, 17, 22, 26, 38, 40].
In addition, amultithreaded programgenerally needs thread
atomicity—persistent transactions execute atomically with
respect to other threads—and state reconstructed by the re-
covery observer should be consistent with the commit order
of persistent transactions. A program with persistent trans-
actions can provide thread atomicity by using locks [8, 17,
22], or by using transactional memory [7, 9, 14, 38, 51]—in
which case the transactions have full ACID properties.
Requirements. An implementation of persistent transactions
must ensure that, after a crash, the recovery observer can re-
store the program’s persistent state so that it corresponds to
a serialization of persistent transactions consistent with the
program’s multithreaded execution. For example, if trans-
action A happened before transaction B, the recovered state
must correspond to one of the following three execution sce-
narios: (1) B executed after A, (2) only A executed, or (3) no
transaction executed at all.
Furthermore, recovered state should correspond to a point
in time that is not too “far back” from the crash time. Other-
wise, the amount of work that needs to be re-executed may
be too large to be practical.
2.2 Persistent Transaction Mechanisms
Upon a crash or power failure, the recovery observer must
reconstruct a state in which transactions appear to have exe-
cuted fully or not at all. This challenge is exacerbated by the
fact that stores do not reach persistent memory in their issu-
ing order. This is because processor caches effectively buffer
writes until eviction or explicit write-back of the dirty line
to persistent memory.
To ensure that stores reach persistent memory in order,
one can use persist operations. A persist operation consists
of one or more flush operations that write back specified
cache lines to persistent memory, followed by a drain oper-
ation that waits until the flush operations have completed.
On x86, flush can be implemented with the CLWB (cache
line write-back) instruction, and drain can be implemented
with the SFENCE (store fence) instruction [48]. A persist
operation is expensive because it incurs the roundtrip write
latency of NVM, which is expected to be several hundreds
of nanoseconds [37, 49]. Even if the NVM controller buffers
persistent stores and includes the buffer as part of the persis-
tence domain [48], persist latency (i.e., the time for roundtrip
communication with the NVM controller) is still significant.
If a commodity approach can be developed that amortizes
persistent latency effectively across many persistent writes,
one can make a case for removing the buffer from the per-
sistence domain, simplifying future hardware designs.
Persistent transactions generally use one of the follow-
ing three mechanisms to provide crash consistency: undo
logging [9, 31], redo logging [51], and copy-on-write [7, 38,
42, 56]. Marathe et al. compared these mechanisms quan-
titatively and found that no mechanism is a clear winner
in all situations (e.g., across thread counts or transaction
sizes) [40]. Here we describe each mechanism and its draw-
backs. We use Figure 1(a) as a simple example persistent
transaction.
Cray: Efficient, HTM-Compatible Persistent Transactions
failure_atomic {
∗p = 1;
... = ∗q;
∗r = 2;
}
undoLog.append(p, ∗p) ;
flush( last log entry) ;
drain;
∗p = 1;
... = ∗q;
undoLog.append(r, ∗r ) ;
flush( last log entry) ;
drain;
∗ r = 2;
undoLog.append(COMMITTED);
redoLog.put(p, 1) ;
... = redoLog.lookup(q) ;
redoLog.put(r , 2) ;
redoLog.append(COMMITTED);
foreach entry in redoLog
flush(entry) ;
drain;
foreach <ptr , val> in redoLog
∗ptr = val ;
∗p = 1; // writes shadow mem
redoLog.append(p, 1) ;
... = ∗q; // reads shadow mem
∗r = 2; // writes shadow mem
redoLog.append(r, 2) ;
redoLog.append(COMMITTED);
foreach entry in redoLog
flush(entry) ;
drain;
foreach <ptr , val> in redoLog
// Write to persistent addr
∗ getPersistAddr ( ptr ) = val ;
(a) A persistent transaction (b) Undo logging applied to (a) (c) Redo logging applied to (a) (d) Copy-on-write applied to (a)
Figure 1. To providing failure atomicity for the persistent transaction in (a), a system uses one of the following crash-
consistency mechanisms: (b) undo logging, (c) redo logging, or (d) copy-on-write. Initial values for all locations are 0.
Undo logging. In undo logging, a persistent transaction logs
the old value of a persistent location in a persistent undo log
before the location is updated by a memory store. Undo log-
ging enables fast read accesses: Since each store performs
an in-place memory update, any memory load can directly
read the latest value from persistent memory without being
remapped to a different address. However, to ensure correct
rollback after a crash, the implementation must persist (i.e.,
flush and drain) each update to the undo log before writing
to the corresponding persistent memory location, incurring
a high write latency for each NVM write.
Figure 1(b) shows how undo logging works for the persis-
tent transaction in Figure 1(a). To signal the end of a persis-
tent transaction’s log entries, the implementation appends
a COMMITTED entry to the undo log. A multithreaded im-
plementation can include a timestamp (not shown in the fig-
ure) with theCOMMITTED entry to enable the recovery ob-
server to reconstruct a state corresponding to some globally
consistent point in time.
Redo logging. In contrast, instead of performing in-place
updates to persistent memory, redo logging buffers all per-
sistent writes and performs them together at the end of the
transaction. By buffering writes, redo logging pays the cost
of persist ordering once only at the end of each transaction,
effectively amortizing the latency across all of the writes.
However, it adds an overhead for each persistent read be-
cause the read needs to find the latest value in a set of buffered
writes. Since reads often significantly outnumberwrites, redo
logging can also incur significant overhead.
Figure 1(c) illustrates how redo loggingworks.Writes and
reads to persistent memory are replaced with updates and
lookups, respectively, to a map-based log so that reads of
persistent memory correctly read from any precedingwrites.
Copy-on-write. Recent work proposes copy-on-writemech-
anisms that maintain a volatile shadow for each persistent
page to be modified [7, 38, 42, 56]. We focus on copy-on-
write mechanisms that use shadow paging because it allows
efficient in-place writes. Other copy-on-write mechanisms
use indirection to copy an object upon the first write in a
transaction, incurring costs similar to redo logging [14, 40].
Persistent transactions perform reads and writes normally,
since virtual addresses aremapped to physical volatile shadow
memory addresses. At the end of the transaction, changes
to each shadow page are persisted to its corresponding non-
volatile page. Figure 1(d) shows how this mechanism works.
Although copy-on-write techniques enjoy the performance
benefits of undo and redo logging—and can bemade compat-
ible with commodityHTMas described shortly—shadowing
the entire NVM is expensive and impractical. Most signifi-
cantly, copy-on-write mechanismsmust ensure consistency
between the updates to volatile and non-volatile pages, lead-
ing to scalability bottlenecks, as detailed below.
2.3 Transactional Memory
Anatural way to implement persistent transactions that pro-
vide full ACID properties is to leverage transactional mem-
ory [20, 21]. Much of the existing work on persistent trans-
actions extends software transactional memory (STM) [19],
which incurs a high overhead in detecting and resolving
conflicts between concurrent transactions.
Hardware transactional memory (HTM), which detects
and resolves conflicts at the hardware level, is an appeal-
ing technique for implementing efficient persistent transac-
tions. However, commodity HTM implementations includ-
ing Intel’s restricted TM (RTM) [59] are fundamentally in-
compatible with persistency. Because log updates must oc-
cur before memory updates, there is an obvious dilemma:
On the one hand, undo or redo log entries must be per-
sisted before the hardware transaction commits the actual
memory updates (to ensure crash consistency), while on the
other hand, the nature of the transaction dictates that these
Kaan Genç, Michael D. Bond, and Guoqing Harry Xu
log entries cannot be persisted before the transaction commits—
otherwise they cannot be revoked upon an abort. The up-
dates to the log entries and the actual memory updates de-
pend on each others, forming a dependence cycle that seem-
ingly thwarts the use of HTM for persistent transactions.
Recent approaches use commodity HTM for persistent
transactions, by decoupling persistence fromHTM’s concur-
rency control.DudeTM [38] andNV-HTM [7] showhow copy-
on-writemechanisms can supportHTM-based persistent trans-
actions. Hardware transactions perform in-place reads and
writes to shadowmemory.After a transaction commits, redo
log entries can be persisted before copying the transaction’s
writes to persistent memory. In addition, by writing redo
log entries and program writes to persistent memory asyn-
chronously, writes to the same persistent locations can be
combined.
The drawback, though, is that DudeTM andNV-HTMhave
significant disadvantages in maintaining shadow state and
keeping updates to persistent memory consistent with the
order of transactions writing to the volatile shadow state.
First, these approaches add space overhead by maintaining
two copies of program state, as discussed above. Second,
theymust ensure consistency between the order of thewrites
to DRAM inside a transaction and that to NVM at the end
of the transaction. DudeTM computes timestamps by incre-
menting a global variable in commodity HTM, making it
effectively incompatible with commodity HTM [38].
NV-HTM, on the other hand,workswith unmodified com-
modity HTM, but it has two major scalability bottlenecks
that limit performance at higher thread counts [7]. First,
each persistent transaction cannot complete until every other
ongoing transaction completes. In particular, each transac-
tion cannot write a COMMIT entry to its redo log until it
ensures that no ongoing transaction may still write a COM-
MIT entry for an earlier transaction, since redo logs are used
by the recovery observer to roll the persistent state forward
after a crash. Waiting ensures that if the recovery observer
sees a COMMIT entry for a transaction, it sees COMMIT en-
tries for all earlier transactions. Of course, this incurs over-
head.
Second, threads that persist logs and program writes to
persistent memory must do it in a serialized manner. In NV-
HTM, an asynchronous background thread applies transac-
tions’ writes (based on their redo log entries) to persistent
memory locations in timestamp order. This serialization of
writes to persistent memory is inherent in the fact that trans-
actions record a timestamp (for efficiency), from which only
a global transaction order can be inferred.
The DudeTM paper surmises that decoupling persistence
from HTM may be “the best (and possibly the only) way
to avoid the drawbacks of both undo and redo logging and
reduce the performance penalty” [38]. Our work seeks to
counter that supposition and overcome the performance dis-
advantages of existing persistent transaction mechanisms.
3 Crafty Overview
As Section 2.3 explained, themain obstacle that precludes ef-
ficient use of commodity HTM in implementing persistent
transactions is the dependence cycle that results from the
tight coupling of log entry updates and program memory
updates: If a hardware transaction contains a mix of these
two types of updates, it can neither commit before persist-
ing, nor persist before committing.
To address this problem, we introduce a new persist trans-
action design called Crafty that leverages a new logging
mechanismcalled nondestructive undo logging. Key to Crafty’s
success is breaking the persist–commit dependence cycle by
executing the log entry updates and the program memory
updates in separate hardware transactions, effectively decou-
pling these two types of updates. In nondestructive undo
logging, a hardware transaction performs a Log phase that
executes a persistent transaction in a way that updates only
undo log entries, not the program’s persistent data. These
log entries are persisted after the hardware transaction com-
mits. Next, Crafty executes the program writes using an-
other hardware transaction. These writes are performed in
a way that is consistent with the updates of the log entries
and also with other threads’ executed transactions.
Challenges and insights. Achieving a correct and efficient
design presents three major challenges. The first challenge
is how to make the Log phase only update undo or redo
log entries without modifying program memory locations.
To overcome this challenge, Crafty uses undo logging when
executing the Log phase: Before each write to a persistent
memory location, the old value in the location is recorded
in a thread-local undo log. At the end of the transaction,
Crafty rolls back all of these writes by applying the entries
of the undo log in a reverse order, effectively setting the mod-
ified values back to the their original values before the trans-
action executed. Furthermore, during this rollback process,
when both the old and new values are visible, the hardware
transaction builds a redo log for these locations. After the
Log phase commits, all of the undo log entries are persisted
into persistent memory. Figure 2 shows how the Log phase
uses nondestructive undo logging to construct an undo log
for the persistent transaction from Figure 1.
The second challenge is how to execute the program’smem-
ory updates in the same order as the updates to log entries.
To do this, Crafty starts the second phase, which updates
programmemory locations. In theory, all we need is a Redo
phase that applies the redo log constructed at the end of
the Log phase. This naïve approach would work if persis-
tent transactions were protected by a pessimistic technique
such as locking, because transactions executed by different
threads would conflict with each other. However, if persis-
tent transactions can conflict, then a thread’s Log and Redo
phases—which executed in two separate hardware transactions—
may not execute together atomically. This can potentially
Cray: Efficient, HTM-Compatible Persistent Transactions
HTM_BEGIN;
undoLog.append(p, ∗p) ;
∗p = 1;
... = ∗q;
undoLog.append(r, ∗r ) ;
∗ r = 2;
foreach entry <ptr , oldVal> in undoLog in reverse
redoLog.append(ptr, ∗ptr ) ;
∗ ptr = oldVal ; // undo each write
HTM_END;
foreach entry in undoLog
flush(entry) ;
drain; // persist the undo log entries
/∗ ... Transaction’s writes can now be performed here ... ∗/
undoLog.append(COMMITTED);
Figure 2.How Crafty’s crash-consistency mechanism, non-
destructive undo logging, provides failure atomicity for the
persistent transaction in Figure 1(a).
lead to inconsistencies between the log entries and the con-
tents in their correspondingmemory locations. To solve this
problem, Crafty lets the Redo phase check a conservative
conflict constraint based on timestamps. Failure of this check
is a necessary but insufficient condition for a transaction con-
flict. To guarantee safety, Crafty aborts theHTM transaction
that executes this Redo phase.
The third challenge is what to do if and when Redo aborts.
Due to the conservative nature of our conflict constraint,
a Redo abort does not necessarily indicate a real conflict.
Hence, if and only if Redo aborts, Crafty executes a Vali-
date phase, which re-executes the persistent transaction to
check the validity of the undo log entries that were per-
sisted in the Log phase. If all of the undo log entries are
still valid, the transaction succeeds, allowing the memory
writes to be committed and visible to other threads. Anymis-
match between the values in a log entry and its correspond-
ing memory location makes Validate abort, indicating that
another thread has committed new, conflicting writes after
the current thread’s Log phase finished. The aborted thread
handles this case by starting over—by re-executing the Log
phase and constructing new undo and redo logs.
Outline. Section 4 describes howCrafty executes persistent
transactions to provide atomicity at run time and support re-
covery on a crash. Section 5 describes how recovery restores
persistent state after a crash.
4 How Crafty Executes Transactions
This section describes how Crafty leverages nondestructive
undo logging to execute persistent transactions.
Executionmodes. Crafty can operate in either of twomodes.
In its thread-safe mode (this paper’s focus), programmers
Start persistent transaction
Log phase
Redo phase
Validate phase
SGL-based execution
in thread-unsafe mode
End persistent transaction
success
fallback
read-only
failure
success
validation
failure
success
fallback
Figure 3. Crafty’s phases in thread-safe mode.
Start persistent transaction;
k ← MAX_WRITES
Execute until persistent
write or transaction end
Log phase for
up to k writes
Return to last
Log phase’s
start; k ← k2
Redo phase for
Log phase’s writes
Persist undo log entry
before persistent write
End persistent
transaction
...
persistent write
(if k > 1)
persistent write
(if k = 1)
transaction end
success
failure
failure
success
Figure 4. Crafty’s phases in thread-unsafe mode.
specify persistent transaction boundaries, and Crafty pro-
vides both atomicity and durability (i.e., all ACID properties)
for persistent transactions.
Crafty’s thread-unsafe mode is appropriate when locks or
another mechanism already provides atomicity, so Crafty
only needs to provide durability. In this mode, programmers
can specify transaction boundaries explicitly or informCrafty
to treat all critical sections [5, 8, 22, 26, 39] or synchronization-
free regions [17] as persistent transactions.
Figures 3 and 4 show how Crafty operates in its thread-
safe and thread-unsafe modes (Section 3), respectively. Sec-
tions 4.1–4.3 provide a detailed description of the Log, Redo,
and Validate phases in the context of Crafty’s thread-safe
Kaan Genç, Michael D. Bond, and Guoqing Harry Xu
Thread 1 Thread 2
atomic_and_durable {
∗p = ∗q;
∗r = 1;
}
atomic_and_durable {
∗q = 2;
∗s = 3;
}
(a) Two persistent transactions.
Thread 1 Thread 2
Log phase:
HTM_BEGIN
undoLogT1 . add(p, 0)
∗p = ∗q
undoLogT1 . add(r , 0)
∗r = 1
// Start roll back :
redoLogT1 . add(r , 1)
∗r = 0 // from undo log
redoLogT1 . add(p, 0)
∗p = 0 // from undo log
lastTST1 = ts ()
undoLogT1 . add(LOGGED,
lastTST1 )
HTM_END
Redo phase:
HTM_BEGIN
check gLastRedoTS < lastTST1
∗p = 0 // from redo log
∗r = 1 // from redo log
gLastRedoTS = ts ()
undoLogT1 . add(COMMITTED,
gLastRedoTS)
HTM_END
Log phase:
HTM_BEGIN
undoLogT2 . add(q, 0)
∗q = 2
undoLogT2 . add(s , 0)
∗s = 3
// Start roll back :
redoLogT2 . add(s , 3)
∗s = 0 // from undo log
redoLogT2 . add(q, 2)
∗q = 0 // from undo log
lastTST2 = ts ()
undoLogT2 . add(LOGGED,
lastTST2 )
HTM_END
Redo phase:
HTM_BEGIN // redo
check gLastRedoTS < lastTST2
ABORT // check failed
Validate phase:
HTM_BEGIN
check ∗q == 0 // from undo log
∗q = 2
check ∗s == 0 // from undo log
∗s = 3
check # writes == # log entries
gLastRedoTS = ts ()
undoLogT2 . add(COMMITTED,
gLastRedoTS)
HTM_END
(b) A possible execution of the persistent transactions in (a) using
Crafty in its thread-safe mode, which provides all ACID
properties.
Figure 5. An example of Crafty’s thread-safe mode execut-
ing persistent transactions. Initial values of *p, *q, *r, and
*s are 0. The example omits flush and drain instructions.
Algorithm 1 Log phase
1: HTM_BEGIN
⊲ Start of persistent transaction
. . .
⊲ Program write to persistent variable:
2: Add 〈addr, oldValue〉 to T.undoLog ⊲ T is current thread
3: *addr = newValue ⊲ Original program write
. . .
⊲ End of persistent transaction
4: Roll back transaction’s writes using T.undoLog, and populate
local redo log from T.undoLog
5: Add 〈LOGGED, getTimestamp()〉 to T.undoLog
6: HTM_END
7: flush(T.undoLog entries for this transaction)
mode. In thread-safe mode, repeated aborts cause Crafty
to transition to thread-unsafe mode while holding a single
global lock (SGL), as Figure 3 shows and Section 4.4 describes.
The rest of this section uses Figure 5 as an example to
show how Crafty works.
4.1 Log Phase
Crafty’s Log phase generates undo log entries for an exe-
cuted persistent transaction and then persists these entries.
The key treatment here is that the Log phase does not com-
mit any programwrites to persistent memory. TheLog phase
achieves this outcome by allowing the persistent transac-
tion to perform writes normally during its execution, but
rolling back thewrites before the hardware transaction com-
mits.
Algorithm 1 shows the details of the Log phase, which ex-
ecutes the persistent transaction body in a hardware trans-
action. Before each persistent write, the Log phase records
the old value of thewritten-to address in the executing thread’s
persistent undo log. For example, in Figure 5, each persistent
transaction’s Log phase adds old values to the undo log be-
fore each write.
At the transaction end, the Log phase uses the undo log
entries to roll back the transaction’s writes, by applying the
undo log entries’ old values in the reverse order.When rolling
back the writes, Crafty simultaneously builds a volatile redo
log for the transaction, which can be used by the subsequent
Redo phase to perform programwrites. For example, in Fig-
ure 5, starting from the “Start roll back:” comment, the per-
sistent transaction’s Log phase rolls back the writes by ap-
plying the values from the undo log. Before committing the
hardware transaction, the Log phase adds a LOGGED en-
try with a Lamport timestamp1 denoting the current logical
time (which is equivalent to the logical time at the beginning
of the hardware transaction since HTM ensures atomicity).
The timestamps will be used by recovery to order undo log
1If two events are ordered by happens-before, their logical times are corre-
spondingly ordered [36].
Cray: Efficient, HTM-Compatible Persistent Transactions
Algorithm 2 Redo phase
Thread-safe Redo phase:
1: HTM_BEGIN
2: if gLastRedoTS < LOGGED timestamp from Log phase
then
3: gLastRedoTS← getTimestamp()
4: Perform thread-unsafe Redo phase
5: else
6: Abort transaction and fail Redo phase
7: end if
8: HTM_END
9: flush(written-to addresses)
Thread-unsafe Redo phase:
10: Perform writes from redo log
11: if not in hardware transaction then
12: flush(written-to addresses)
13: end if
14: Add 〈COMMITTED, getTimestamp()〉 to T.undoLog
entries by different threads. In Figure 5, each Log phase con-
cludes by inserting a LOGGED entry into the undo log be-
fore committing the hardware transaction.
After committing the transaction, the Log phase flushes
the transaction’s undo log entries to persistent memory. The
algorithmflushes the transaction’s undo log entries but does
not wait for them to be written back to persistent memory
(i.e., flush but no drain) because the program writes will be
committed by the Redo or Validate phase inside of a hard-
ware transaction, which has drain semantics (e.g., an RTM
transaction has SFENCE semantics).
Once the Log phase completes, undo log entries for the
transaction have been persisted, but the transaction has ef-
fectively not executed from the perspective of other threads
and memory because none of the memory updates have
been performed. In Figure 5, after the Log phase completes,
*p, *r, *q, and *s still have their initial values (0). To make
these updates visible to other threads and persistent mem-
ory, Crafty uses the Redo or, if needed, the Validate phase.
A read-only transaction need not add a LOGGED entry to
the undo log or perform any persist operations, and it can
skip the Redo and Validate phases, as shown in Figure 3.
4.2 Redo Phase
The Redo phase applies the writes from the redo log (in the
reverse order of how they were recorded in the Log phase),
as illustrated in Algorithm 2.
If the program is single-threaded or no other threads ac-
cess persistent memory, it is safe to execute the Redo phase
unconditionally. However, if multiple threads are executing
persistent transactions, atomicity can be violated. For exam-
ple, thread B’s Redo phase can occur between thread A’s
Log and Redo phases. Hence, it is important to ensure that
A’s Redo phase completes only if it executes atomically with
its preceding Log phase.
To this end, Crafty uses a global variable gLastRedoTS
that represents the timestamp of the last writes committed
by any thread. Figure 5 demonstrates how gLastRedoTS is
updated and used. Crafty checks gLastRedoTS at the start
of the Redo phase in Thread 1. The check succeeds because
no thread has committed writes since Thread 1’s Log phase.
The Redo phase then performs the writes from the redo log.
Thread 1 completes the Redo phase by updating gLastRe-
doTS and adding a timestamped COMMITTED entry to the
log. This timestamp represents the time at which the trans-
action’s writes happened in relation to other threads’ trans-
actions.
A failed check indicates a potential atomicity violation.
In Figure 5, Thread 2’s check of gLastRedoTS fails because
Thread 1 updated gLastRedoTS to reflect that it committed
writes (in its Redophase) after Thread 2’s Log phase. Thread 2’s
Redo phase thus fails, and Crafty tries the Validate phase.
(Alternatively, under different timing, Thread 2’sRedo phase
could start and complete before Thread 1’s Redo phase read
gLastRedoTS, allowing Thread 2 to commit its writes with
the Redo phase. Thread 1 would fail the Redo phase and try
the Validate phase, re-execuing the transaction and writ-
ing an updated value of *q, 2, to *p.)
If successful, the Redo phase concludes by flushing the
transaction’s writes to persistent memory, but does not wait
for the write-backs to finish (i.e., flush but no drain). The
next persistent transaction’s Log phase will perform a hard-
ware transaction, which has drain semantics, and the recov-
ery algorithm always rolls back each thread’s latest transac-
tion in case its writes had not fully persisted (Section 5).
The Redo phases of all transactions are effectively seri-
alized. This does not necessarily cause a bottleneck in per-
formance because the Redo phase is often short and can
execute concurrently with Log and Validate phases.
4.3 Validate Phase
The goal of the Validate phase is to execute a persistent
transaction that is consistent with the persisted undo log
entries. The Validate phase checks consistency by compar-
ing the old values recorded in the undo log with the current
values of the same locations, as Algorithm 3 illustrates.
The Validate phase checks whether, for each program
write, its corresponding entry in the undo log matches the
write’s address and the value at the address. If it does, this
implies the validity of the undo log entry. For example, in
Figure 5, Thread 2’s Validate phase checks that both writes
to q and s match the original addresses and old values in
the undo log, and that there are no new writes. At the end
of the persistent transaction, the hardware transaction is
committed, and the writes are persisted and made visible
to other threads. Note that it is important to re-execute the
transaction—by validating the undo log entries rather than
Kaan Genç, Michael D. Bond, and Guoqing Harry Xu
Algorithm 3 Validate phase
1: Reset T.undoLog to the beginning of the transaction
2: HTM_BEGIN
⊲ Start of the persistent transaction
. . .
⊲ Program write to a persistent variable:
3: let 〈expectedAddr, expectedValue〉 be next entry in
T.undoLog
4: if addr , expectedAddr∨ *addr , expectedValue then
5: Abort transaction and fail validation
6: end if
7: *addr← newValue ⊲ Original program write
. . .
⊲ End of the persistent transaction
8: Check that the next entry in T.undoLog is a LOGGED entry (if
not, abort the transaction and fail the validation)
9: gLastRedoTS← getTimestamp()
10: Add 〈COMMITTED, gLastRedoTS〉 to T.undoLog after the
LOGGED entry
11: HTM_END
12: flush(written-to addresses)
just performing the writes—to ensure (implicitly) that val-
ues read by the transaction are still consistent with the undo
log entries. Like theRedo phase, after performing thewrites,
Crafty adds a 〈COMMITTED, getTimestamp()〉 entry to the
log, which represents the time at which the transaction’s
writes happened in relation to other threads.
Note that the Validate phase executes only if the Redo
phase fails, Every persistent transaction commits its writes
exactly once, with either the Redo or Validate phase.
4.4 Single Global Lock Fallback
A hardware transaction may abort for a variety of reasons
including a conflict with other threads, cache capacity over-
flow, or an unsupported event such as an interrupt [59]. Since
commodity HTM generally provides no progress guaran-
tee, special care must be taken to ensure that an execution
makes progress. As Figure 3 shows, Crafty’s thread-safemode
retries an aborted transaction several times; if no Redo or
Validate phase commits successfully, it falls back to acquir-
ing a single global lock (SGL) to provide progress guaran-
tees. The SGL serves two purposes. First, it eliminates con-
flicts among different threads. Second, it allows Crafty to
execute in thread-unsafe mode, where Crafty can execute
shorter hardware transactions (fewer instructions) or with-
out hardware transactions to ensure progress.
The SGL is a global variable that a thread acquires by up-
dating it atomically from 0 to 1, and releases by setting it to 0
(with proper memory fencing). To ensure consistency with
respect to other threads executing hardware transactions in
thread-safe mode, each hardware transaction in thread-safe
mode must check whether the SGL is 0 at the beginning of
the transaction; if the SGL is 1, the transaction must abort
(not shown in the algorithms). This handling ensures consis-
tency with an ongoing SGL section or with an SGL section
that starts while the transaction is executing (since the trans-
action’s read set contains the SGL). This fallback method is
referred to as speculative lock elision in the literature and has
been widely studied [27, 45, 59].
After acquiring the SGL, a thread can safely execute in
Crafty’s thread-unsafe mode, as illustrated in Figure 4. In
this mode, the SGL ensures atomicity, so HTM serves solely
to implement nondestructive undo logging (i.e., to prevent
updated cache lines from being written back to persistent
memory prematurely), not for thread atomicity.
As a result, thread-unsafe mode uses hardware transac-
tions for the Log phase only, which can wait to start a hard-
ware transaction until the first persistent write of the per-
sistent transaction. Thread-unsafe mode does not use HTM
for the Redo phase because no other threads can update the
global timestamp gLastRedoTS, and hence this phase always
succeeds. The Validate phase is not needed at all.
Ensuring progress. Even without contention from other
threads, a hardware transaction may still abort for cache
capacity or other reasons. The Log phase in thread-unsafe
mode can ensure progress by breaking a persistent transac-
tion into smaller hardware transactions, each executing at
most k persistent writes. After executing k persistent writes
(or fewer, if the persistent transaction ends beforek is reached),
the Log phase completes normally, rolling back writes and
persisting undo log entries including a LOGGED entry. The
Redo phase then performs the k (or fewer) persistent writes,
except that it does not add a COMMITTED entry, which
should only be used to indicate the end of the (SGL-based)
transaction. The Log and Redo phases continue executing
the persistent transaction, in chunks of up to k writes. If
k = 1, the Log phase writes and persists an undo log entry
before performing the write to memory, without using any
hardware transaction.
When entering thread-unsafe mode for a persistent trans-
action, Crafty begins with a (relatively large) value of k (e.g.,
64) with the goal to amortize persist latency across multi-
ple writes. After each transactional abort in thread-unsafe
mode, Crafty decreases k geometrically for the next hard-
ware transaction. When the value of k drops to 1, thread-
unsafe mode is guaranteed to make progress. Figure 4 illus-
trates the logic for thread-unsafe mode, which executes the
transaction in k-write chunks until completion.
Before releasing the SGL, Crafty adds aCOMMITTED en-
try to the persistent undo log. All of the SGL section’s hard-
ware transaction’s LOGGED and COMMITTED entries use
the same timestamp (from the first call to getTimestamp())
to ensure that the recovery algorithm either rolls back all or
none of the SGL section’s writes.
Cray: Efficient, HTM-Compatible Persistent Transactions
Crafty thus adaptively adjusts transaction sizes to pro-
vide a tradeoff between persist latency and the risk of abort-
ing. Prior work in other contexts splits transactions to bal-
ance between per-transaction costs and aborts costs [41, 50].
5 How Crafty Recovers After a Crash
This section presents Crafty’s recovery logic. We first de-
scribe how recovery is done under the assumptions that an
infinite log is initially zeroed (i.e., no wraparound or reuse)
and log entries are persisted atomically. Then we show how
to handle logs without these simplifying assumptions.
5.1 Basic Recovery Logic
Under the above-mentioned assumptions, the recovery ob-
server can detect persisted entries, which are entries with a
nonzero addr field (which is either an address or a LOGGED
or COMMITTED tag). A fully persisted sequence is a con-
secutive sequence of persisted 〈addr, oldValue〉 entries pre-
ceded by a (persisted) LOGGED orCOMMITTED entry and
concluded by a persisted LOGGED entry.
The recovery observer needs to roll back the last fully
persisted sequence of each thread because some of the cor-
responding writes may have persisted, but not all of them
have definitely persisted. Let 〈LOGGED, ts〉 be the sequence’s
concluding entry. Thenwe define a sequence’s timestamp to
be ts. To arrive at a globally consistent snapshot, the recov-
ery observer must also roll back every sequence that has a
timestamp later than or equal to the timestamp of any se-
quence being rolled back. Any persisted entries outside a
fully persisted sequence must not be rolled back because
their corresponding writes definitely have not persisted.
To roll back a sequence, the recovery observer applies the
〈addr, oldValue〉 entries in reverse order, performing *addr
= oldValue for each entry. The recovery observer rolls back
the fully persisted sequences in the reverse timestamp order.
5.2 Handling High-Performance Logs
Next, we discuss how Crafty provides correct recovery in
the absence of simplifying assumptions about the logs. The
following design handles circular logs that reuse entries, and
it does not require a log entry to be persisted atomically.
The design also addresses a limitation of Crafty’s design as
presented so far: Because the recovery observer rolls back
at least each thread’s last transaction, a rolled-back transac-
tion can be arbitrarily far back in time if a thread has not
executed a transaction in a while.
The design assumes that the system provides persistence
at word (or coarser) granularity. The design relies on each
thread’s circular log being large enough to hold log entries
for at least two persisted sequences (which are bounded due
to HTM bounding constraints).
Distinguishing reused entries. To handle reuse of undo
log entries (e.g., via a circular log), the recovery observer
needs to be able to tell whether an entry 〈addr, oldValue〉
is from the latest transaction or the last wraparound of the
log. Inspired by prior mechanisms [10], Crafty’s execution
of transactions maintains a per-thread wraparound bit that
is encoded in each word and flips each time the log wraps
around. This wraparound bit allows recovery to differenti-
atewordswritten after versus before the latest logwraparound.
Because logged transactions occurmore often thanwraparound,
recovery can only observe log entries that are after the next-
to-latest wraparound and hence a single wraparound bit suf-
fices.
We further assume that all addresses are word (4- or 8-
byte) aligned. This allows us to steal two or three bits of
the addr word in each 〈addr, value〉 log entry. One of them
is used as the wraparound bit. The LOGGED and COM-
MITTED tags are each represented as a reserved, aligned
address.
Because NVM is only guaranteed to provide persistence
at word granularity, the value word in a 〈addr, value〉 log
entry will also need awraparound bit. However, value needs
all of its bits for program values. We thus steal another bit
in each addr word to store a bit (e.g., the lowest bit) of the
value word, allowing that same bit of the value word to be
replaced with the wraparound bit.
Discarding entries and bounding rollback severity. In
order for Crafty’s Log phase to reuse log entries (e.g., via a
circular log), we must be able to discard some entries that
would no longer be needed by the recovery observer. Since
Crafty must not discard entries for a logged transaction that
needs to be rolled back, we need to ensure that the earliest
possible rollback timestamp ts is greater than the timestamp
of a logged transaction that Crafty is ready to discard. A
related issue we address here is bounding how far back in
time the recovery observer must roll back to. This distance
can be quite far if a thread has not executed a persistent
transaction for a while.
The logging algorithmmaintains a global timestamp tsLow-
erBound that is a lower bound on the earliest possible times-
tamp r that recovery might need to roll back to. It is a lower
bound because, for performance reasons, it is kept up to
date lazily. When adding a LOGGED entry to its undo log,
a thread checks that
currentTS() < tsLowerBound +MAX_LAG
whereMAX_LAG represents a customizable maximum time
duration for which recovery might need to roll back, and
currentTS() is a timestamp representing the current time.
Likewise, whenever a thread T gets halfway through its cir-
cular log, it first checks if overwriting the next half of the
log will violate
T.log.earliestTSToBeOverwrien > tsLowerBound
Kaan Genç, Michael D. Bond, and Guoqing Harry Xu
If either of these conditions fails, thread T performs further
inspection, by checking the following two conditions for ev-
ery other thread U:
currentTS() < U.lastCommiedTxn.ts +MAX_LAG
T.log.earliestTSToBeOverwrien < U.lastCommiedTxn.ts
T can perform these checks safely (atomically) by executing
them in a hardware transaction, performing read-only ac-
cesses to U.logStart and U’s log. If either condition fails on
U, then T forcesU to append an by committing a 〈LOGGED,
getTimestamp()〉 entry to U’s log (representing an empty
completed transaction). T can accomplish this by using a
transaction to safely update U’s log (because we need to be
careful about interferencewithU, especially its non-transactional
state manipulations).
After T makes each delinquent thread U commit a more
up-to-date transaction, it sets
tsLowerBound← min
U
U.lastCommiedTxn.ts
Note thatmost transactions only need to read a global shared
variable (tsLowerBound) that is mostly read-only, resorting
to more expensive operations only when they are halfway
through the circular undo log. The frequency of expensive
operations can be reduced by increasing the size of each
thread’s circular log.
Providing immediate persistence. Some persistent trans-
action systems, including DudeTM [38] and NV-HTM [7],
guarantee that if a persistent transaction completes and the
thread continues execution, then the recovered state will
include the completed transaction’s state. This “immediate
persistence” property ensures that the persistent state is con-
sistent with any externally visible, irrevocable actions be-
tween transactions such as system calls. However, Crafty
does not provide “immediate persistence” because it does
not ensure that all writes have been persisted when a trans-
action completes (which is why recovery rolls back each
thread’s last logged transaction). Some prior work includ-
ing PMThreads [56] also does not provide immediate per-
sistence.
Instead of providing immediate persistence, Crafty can
provide a method for “on-demand” immediate persistence
(to be invoked before performing externally visible, irrevo-
cable actions). Crafty can implement on-demand immediate
persistence by adding an 〈LOGGED, getTimestamp()〉 entry
to each thread’s, similar to the approach described above for
reusing log entries and bounding rollback severity. Modify-
ing other threads’ logs can be performed safely by executing
in a hardware transaction. Our prototype implementation
does not suppport on-demand immediate persistence.
6 Implementation
Our Crafty implementation, which we have made publicly
available,2 extends the publicly available NV-HTM imple-
mentation [7].3 TheNV-HTM implementation also provides
a configuration that represents the prior work DudeTM [38].
It also includes anNon-durable configuration that simply ex-
ecutes each persistent transaction in a hardware transaction
and thus does not provide any crash-consistency guaran-
tees.
TheNV-HTM implementation emulates non-volatile mem-
ory in volatile memory by performing 300 ns of busy wait-
ing at drain operations (emulating the roundtrip latency of
each SFENCE instruction that follows one or more CLWB
instrutions). This methodology is consistent with the eval-
uations of prior work including DudeTM and NV-HTM [7,
38].
Cray logging details. Each thread has an undo and a redo
log. Undo logs are in non-volatile memory and are circular.
Redo logs are in volatile memory and not needed after a per-
sistent transaction completes, so the next persistent transac-
tion can reuse the redo log from the beginning.
Each undo log entry 〈addr, oldValue〉 contains two 8-byte
words: the written-to address and the old value. Each addr
value is 8-byte-aligned because all writes are expressed as 8-
byte, aligned stores. The implementationmerges the LOGGED
and COMMITTED entries into a single entry, overwriting
the entry’s timestamp on commit. This optimization is safe
as the recovery observer does not need to differentiate be-
tween LOGGED and COMMITTED entries when deciding
what sequences to roll back. The recovery observer can check
if each log entry has persisted using the wraparound bit.
Timestamps come from RDTSC.
The implementation performs the work needed to sup-
port rollback (i.e., the wraparound bit and the log checks in
Section 5.2). However, we have not implemented the actual
recovery logic, leaving it and its evaluation to future work.
Mixed-mode accesses. The implementation requires that
all writes to persistent memory happen in persistent trans-
actions. Crafty can support writes to volatile memory in
transactions by ensuring transactions are idempotent with
respect to volatile memory accesses. Our implementation re-
quires manual transformation of transactions to make them
idempotent with respect to function-local variables. It does
not allow other volatile memory writes in transactions, but
could do so by adding undo logs for volatile accesses.
The same (volatile or persistent) variable can be accessed
both in and out of transactions, subject to the aforemen-
tioned constraints. Programmersmust be careful to synchro-
nize such accesses correctly: Although Intel’s RTM provides
strong atomicity [20], Crafty may fall back to using a global
2hps://github.com/PLaSSticity/Cray
3hps://bitbucket.org/dfscastro/nvhtm-selfcontained/src/master/
Cray: Efficient, HTM-Compatible Persistent Transactions
lock for providing thread atomicity. Programs thus must en-
sure transactional data race freedom [12].
Memorymanagement. Because the Log andValidate phases
execute the same code, the implementationmust handle side
effects from malloc and free to avoid leaking memory and
failing checks in the Validate phase. The implementation
thus logs allocations during the Log phase and reuses the
allocated memory at corresponding malloc calls during the
Validate phase. Similarly, the Log phase logs free calls dur-
ing the Log phase, and either performs the logged frees after
completing the Redo phase or allows theValidate phase to
performed free calls and then discards logged frees.
7 Evaluation
This section evaluates the performance of Crafty by compar-
ing it with prior work’s HTM-compatible persistent transac-
tions and with non-durable transactions.
7.1 Methodology
Configurations. Our experiments runCrafty in thread-safe
mode to provide full ACID transactions. NV-HTMandDude-
TM are run under their standard configurations. As a base-
line, we run the implementation’s Non-durable configura-
tion that does not provide any guarantees on crash consis-
tency.
In addition to evaluating Crafty’s full-blown version (in
thread-safe mode), we evaluate two variants of Crafty that
exclude theValidate andRedo phases, referred to asCrafty-
NoRedo and Crafty-NoValidate, respectively, in the rest of
the section. These configurations help tease out the perfor-
mance effects of Crafty’s components. Note that these con-
figurations are also fully functioning configurations that pro-
vide the same guarantees as Crafty.
Evaluated programs. We use two microbenchmarks and a
set of transactional memory benchmarks.
The bank microbenchmark is from the publicly available
NV-HTM implementation [7] that performs random trans-
fers between accounts. We configure the benchmark to run
five transfers (ten persistent writes) per transaction with
three levels of contention: high,medium, and no conflict. The
difference in conflict rates is achieved by varying the num-
ber of accounts—the medium- and high-conflict configura-
tions operate on 4,096 and 1,024 cache-line-aligned accounts,
respectively. The no-conflict configuration avoids all con-
flicts by partitioning the accounts among threads.
The other microbenchmark performs operations on a B+
tree and is adapted from the implementation of Zardoshti et
al. [61] by annotating writes to shared memory. The bench-
mark provides two variants: one performs only insertions
on the tree, and the other performs a mixture of lookups,
insertions, and removals.
As a standard benchmark suite for transaction memory
research, our experiments use the transactional STAMP bench-
marks [6]. In particular, we consider each transaction to be a
persistent transaction, and treat all shared-memory accesses
in transactions as accesses to persistent memory. This same
methodology was used in the evaluation of prior work [7].
We exclude the benchmark yada as it fails to run with Non-
durable andNV-HTMdue to a pointer corruption, and bayes
because around half of the transactions fall back to the SGL
mode due to HTM incompatible instructions which makes
the results not meaningful.
Experimental setup. We run the experiments on amachine
with a quiet 16-core Intel Skylake processorwith hyperthread-
ing disabled. The implementation uses native hardware trans-
actions [47] and emulates non-volatile memory as described
in Section 6. Each reported result is the arithmetic mean
of five trials. The throughput results are normalized to the
throughput of the single-thread, non-durable configuration
of the same benchmark.We define throughput as the inverse
of the execution’s wall-clock time.
7.2 Performance Results
This subsection presents our main results: performance and
scalability for the evaluated programs across thread counts
and persistent transaction implementations. We also per-
form additional measurements that help explain the perfor-
mance including (1) breakdowns of persistent transactions
by Crafty phases and (2) hardware transaction commit and
abort counts and abort causes. These additional measure-
ments, as well as performance results that emulate 100 ns
(instead of 300 ns) write latency, can be found in Appen-
dix A.
Bankmicrobenchmark. Figure 6 compares Crafty and other
persistent transaction implementations, under different con-
tention levels. The general trend behind these results is that
Crafty outperformsNV-HTMandDudeTMunder low-contention
settings, when there are few threads or few conflicting trans-
actions. For example, under all contention levels Crafty out-
performs NV-HTM and DudeTM for 1–2 threads.
Under high contention, Crafty scales poorly because it
Crafty amplifies the transactional conflicts by executing per-
sistent transactions using more hardware transactions than
other approaches.While NV-HTM scaleswell up to 4 threads,
its scalability limitations (Section 2) cause it to anti-scales
above 4 threads and underperforms Crafty above 8 threads.
Crafty outperforms or performs as well as the competing
approaches except for NV-HTM on the high-conflict config-
uration at 4 threads.
Note that NV-HTM’s and DudeTM’s throughput drops
dramatically at 16 threads. NV-HTM and DudeTM use an
extra thread that performs writes to persistent memory. Be-
cause 16 program threads are running on 16 cores, the extra
thread is scheduled on the same core as a program thread,
Kaan Genç, Michael D. Bond, and Guoqing Harry Xu
1 2 4 8 12 15 16
Thread count
0.0
0.2
0.4
0.6
0.8
1.0
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(a) High contention
1 2 4 8 12 15 16
Thread count
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(b) Medium contention
1 2 4 8 12 15 16
Thread count
0
1
2
3
4
5
6
7
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(c) No contention
Figure 6. Throughput of Crafty and competing approaches,
using the bank microbenchmark at three contention levels.
Crafty generally outperforms NV-HTM and DudeTM, espe-
cially under low contention and at low thread counts.
1 2 4 8 12 15 16
Thread count
0
1
2
3
4
5
6
7
8
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(a) Insert operations only
1 2 4 8 12 15 16
Thread count
0
1
2
3
4
5
6
No
rm
al
ize
d 
th
ro
ug
hp
ut
(b) Lookup, insert, and remove operations
Figure 7. Throughput of Crafty and competing approaches,
on the B+ tree microbenchmark, for mixed operations and
insert only. Crafty scales better than NV-HTM and DudeTM
and has low overhead compared with Non-durable.
causing frequent context switches because of the producer–
consumer relationship between the two threads.
Figure 6(c) motivates theValidate phase.When the num-
ber of threads is above 4, Crafty-NoValidate is slower than
Crafty because Redo fails due to timestamp checks, but Val-
idate succeeds since there is no true contention. Results
in Figure 11 in Section A support this conclusion: Crafty-
NoValidate incursmany explicit aborts at thread counts above
4, caused by failed timestamp checks.
B+ tree microbenchmark. Figure 7 shows the results for
the B+ tree microbenchmark. NV-HTM and DudeTM scale
poorly compared with Crafty and Non-durable, presumably
as a result of serializing execution during transaction com-
mit and when persisting writes; our extended results (Ap-
pendix A) do not show significant differences in transaction
abort rates.
For both configurations of the benchmark at all thread
counts, Crafty outperforms NV-HTM and DudeTM, and has
modest overhead over Non-durable.
STAMPbenchmarks. Figure 8 shows results for the STAMP
benchmarks. Across the benchmarks, Crafty generally per-
forms better than NV-HTM and DudeTM and scales as well
as Non-durable (the exception is intruder, discussed below).
Cray: Efficient, HTM-Compatible Persistent Transactions
1 2 4 8 12 15 16
Thread count
0.0
0.2
0.4
0.6
0.8
1.0
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(a) kmeans (high contention)
1 2 4 8 12 15 16
Thread count
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(b) kmeans (low contention)
1 2 4 8 12 15 16
Thread count
0
1
2
3
4
5
6
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(c) vacation (high contention)
1 2 4 8 12 15 16
Thread count
0
2
4
6
8
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(d) vacation (low contention)
1 2 4 8 12 15 16
Thread count
0.0
0.2
0.4
0.6
0.8
1.0
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(e) labyrinth
1 2 4 8 12 15 16
Thread count
0
2
4
6
8
10
12
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(f) ssca2
1 2 4 8 12 15 16
Thread count
0
1
2
3
4
5
6
7
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(g) genome
1 2 4 8 12 15 16
Thread count
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(h) intruder
Figure 8. Throughput of Crafty and competing approaches, on the STAMP benchmarks. Crafty has low overhead and scales
well at high thread counts.
Kaan Genç, Michael D. Bond, and Guoqing Harry Xu
Figures 8(a) and 8(b) show that Crafty outperforms NV-
HTM and DudeTM at thread counts above 4 for kmeans un-
der both high and low contention.
Figures 8(c) and 8(d) show that Crafty adds modest over-
head over Non-durable on vacation. Except for the high-
contention vacation configuration above 8 threads, forwhich
NV-HTM and DudeTM perform best, Crafty outperforms
competing approaches. (The sudden drop in NV-HTM and
DudeTM’s throughput at 16 threads is the same issue as de-
scribed above for the bank microbenchmark.) Both figures
show the benefits of using both the Redo and Validate
phases under higher thread counts but low contention.
Figures 8(e) and 8(f) show similar performance for Crafty,
NV-HTM, and DudeTM on labyrinth and ssca2. An excep-
tion is that for ssca2, which has very low contention, Crafty-
NoRedo performs significantly better than the other con-
figurations. As Figure 19 in Appendix A shows, across all
thread counts, Crafty-NoRedo,which uses only Log andVal-
idate phases, experiences very few aborts.
Figure 8(g) shows that Crafty scales well at high thread
counts for genome, while NV-HTM and DudeTM are un-
able to scale quite as well. Crafty-NoValidate scales poorly
withmore threads, showing the value of theValidate phase
when the Redo phase fails frequently due to numerous si-
multaneous transactions.
Figure 8(h) shows that for intruder, Crafty performsworse
than NV-HTM and DudeTM.While detailed statistics in Fig-
ure 21 in Appendix A show that Crafty configurations com-
mit and abort significantly more hardware transactions than
the other configurations, these results do not explain Crafty’s
poor performance:Crafty inherently commits, and often aborts,
more hardware transactions than other approaches across
the other programs, yet generally outperformsNV-HTMand
DudeTM.As of the camera-ready deadline, we have not been
able to understand this issue better (we fixed an issue just
days before the deadline that allowed our implementation
to run intruder without error).
8 Related Work
Crafty leverages hardware transactional memory (HTM) to
control persist ordering, while also supporting the use of
commodityHTM for concurrency control in persistent trans-
actions. To our knowledge, no prior work has used HTM
to control persist ordering. Prior work supports commodity
HTM for concurrency control in persistent transactions [7,
14, 38]. DudeTM and NV-HTM use shadow-paging-based
copy-on-writemechanisms and incur scalability bottlenecks [7,
38]; we comparedwith themqualitatively and quantitatively
in this paper. Giles et al. introduced an approach for HTM-
based persistent transactions that requires instrumenting
program reads [14], arguably forgoing a key benefit of using
HTM instead of STM. In contrast with the priorwork, which
works around the challenges of combining persistence and
HTM, Crafty leverages HTM to control persist ordering, as
realized in the new nondestructive undo loggingmechanism.
Modifying HTM. Several research efforts propose nontriv-
ial modifications to commodity HTM to support persistent
transactions [3, 4, 15, 29, 53]. In contrast, Crafty shows how
to leverage and work with contemporary systems.
Soware persistent transactions. Many existing systems
provide persistent transactions [9, 11, 13, 16, 18, 31, 32, 40, 42,
44, 46, 48, 51, 56]. These approaches use undo, redo, or copy-
on-write mechanisms to provide failure atomicity. The ap-
proaches either assume that programs provide thread atom-
icity through locks or another concurrency control mech-
anism, or they apply STM to provide thread atomicity to-
gether with failure atomicity.
Failure atomicity of critical sections. Several approaches
including Atlas provide failure atomicity for lock-based crit-
ical sections [5, 8, 22, 26, 39] or synchronization-free regions [17].
Crafty (in its thread-unsafe mode) could likewise provide
failure atomicity for lock-based regions.
Failure ordering. This paper focuses on providing failure
atomicity. Providing failure atomicity relies on the lower-
level property of failure ordering, which refers to the order
of persisted writes that the recovery observer sees. This pa-
per’s nondestructive undo logging leverages HTM to con-
trol persist ordering. Prior work introduces memory persis-
tency models, which extend memory consistency models to
incorporate the recovery observer [28, 30, 33, 43].
9 Conclusion
Nondestructive undo logging is a new crash-consistencymech-
anism that leverages commodity HTM to persist a transac-
tion’s undo log entries before its persistent writes. Crafty
is a new design that uses nondestructive undo logging to
provide persistent transactions. An evaluation shows that
Crafty performs well compared with non-durable transac-
tions and has better performance than state-of-the-art per-
sistent transaction designs. These results show the potential
for efficient persistent transactions using today’s computing
systems.
Acknowledgments
Many thanks to Daniel Castro for making the NV-HTM im-
plementation publicly available and providing help using
it. Thanks to Steve Blackburn, Jake Roemer, and Tomoharu
Ugawa for helpful discussions and feedback. We thank the
anonymous reviewers and our shepherd, Erez Petrank, for
feedback and suggestions that improved the final paper.
This material is based upon work supported by the Na-
tional Science Foundation under Grants CAREER-1253703,
XPS-1629126,CNS-1613023,CNS-1703598, andCNS-1763172,
Cray: Efficient, HTM-Compatible Persistent Transactions
and by the Office of Naval Research under Grants N00014-
16-1-2913 and N00014-18-1-2037.
References
[1] Joy Arulraj, Justin Levandoski, Umar Farooq Minhas, and Per-Ake Lar-
son. 2018. BzTree: A High-Performance Latch-free Range Index for
Non-Volatile Memory. VLDB 11, 5 (2018), 553–565.
[2] Joy Arulraj, Matthew Perron, and Andrew Pavlo. 2016. Write-Behind
Logging. VLDB 10, 4 (2016), 337–348.
[3] Hillel Avni and Trevor Brown. 2016. Persistent Hybrid Transactional
Memory for Databases. VLDB 10, 4 (2016), 409–420.
[4] Hillel Avni, Eliezer Levy, and Avi Mendelson. 2015. Hardware Trans-
actions in Nonvolatile Memory. In DISC. 617–630.
[5] Hans-J. Boehm and Dhruva R. Chakrabarti. 2016. Persistence Pro-
gramming Models for Non-volatile Memory. In ISMM. 55–67.
[6] Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, and Kunle
Olukotun. 2008. STAMP: Stanford Transactional Applications for
Multi-Processing. In IISWC.
[7] Daniel Castro, Paolo Romano, and João Barreto. 2018. Hardware
Transactional Memory Meets Memory Persistency. In IPDPS. 368–
377.
[8] Dhruva R. Chakrabarti, Hans-J. Boehm, and Kumud Bhandari. 2014.
Atlas: Leveraging Locks for Non-volatile Memory Consistency. In
OOPSLA. 433–452.
[9] Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Ra-
jesh K. Gupta, Ranjit Jhala, and Steven Swanson. 2011. NV-Heaps:
Making Persistent Objects Fast and Safe with Next-generation, Non-
volatile Memories. In ASPLOS. 105–118.
[10] Nachshon Cohen, Michal Friedman, and James R. Larus. 2017. Effi-
cient Logging in Non-Volatile Memory by Exploiting Coherency Pro-
tocols. PACMPL 1, OOPSLA, Article 67 (2017), 24 pages.
[11] Andreia Correia, Pascal Felber, and Pedro Ramalhete. 2018. Romulus:
Efficient Algorithms for Persistent Transactional Memory. In SPAA.
271–282.
[12] Luke Dalessandro and Michael L. Scott. 2009. Strong Isolation is a
Weak Idea. In TRANSACT.
[13] Subramanya R. Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip
Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. 2014. Sys-
tem Software for Persistent Memory. In EuroSys. 15:1–15:15.
[14] Ellis Giles, Kshitij Doshi, and Peter Varman. 2017. Continuous Check-
pointing of HTM Transactions in NVM. In ISMM. 70–81.
[15] Ellis Giles, Kshitij Doshi, and Peter Varman. 2018. Hardware Transac-
tional Persistent Memory. In MEMSYS. 190–205.
[16] Ellis R. Giles, Kshitij Doshi, and Peter Varman. 2015. SoftWrAP: A
Lightweight Framework for Transactional Support of Storage Class
Memory. InMSST. 1–14.
[17] Vaibhav Gogte, Stephan Diestelhorst, William Wang, Satish
Narayanasamy, Peter M. Chen, and Thomas F. Wenisch. 2018.
Persistency for Synchronization-Free Regions. In PLDI. 46–61.
[18] Jinyu Gu, Qianqian Yu, Xiayang Wang, Zhaoguo Wang, Binyu Zang,
Haibing Guan, and Haibo Chen. 2019. Pisces: A Scalable and Efficient
Persistent Transactional Memory. In USENIX. 913–928.
[19] Tim Harris and Keir Fraser. 2003. Language Support for Lightweight
Transactions. In OOPSLA. 388–402.
[20] Tim Harris, James Larus, and Ravi Rajwar. 2010. Transactional Mem-
ory (2nd ed.). Morgan and Claypool Publishers.
[21] Maurice Herlihy and J. Eliot B. Moss. 1993. Transactional Memory:
Architectural Support for Lock-Free Data Structures. In ISCA. 289–
300.
[22] Terry Ching-Hsiang Hsu, Helge Brügner, Indrajit Roy, Kimberly Kee-
ton, and Patrick Eugster. 2017. NVthreads: Practical Persistence for
Multi-threaded Applications. In EuroSys. 468–482.
[23] Qingda Hu, Jinglei Ren, Anirudh Badam, Jiwu Shu, and Thomas
Moscibroda. 2017. Log-Structured Non-Volatile Main Memory. In
USENIX. 703–717.
[24] Jian Huang, Karsten Schwan, and Moinuddin K. Qureshi. 2014.
NVRAM-aware Logging in Transaction Systems. VLDB 8, 4 (2014),
389–400.
[25] Intel Corporation. 2018. 3D XPointTM: A Breakthrough in Non-
Volatile Memory Technology. https://www.intel.com/content/
www/us/en/architecture-and-technology
/intel-micron-3d-xpoint-webcast.html.
[26] Joseph Izraelevitz, Terence Kelly, and Aasheesh Kolli. 2016. Failure-
Atomic Persistent Memory Updates via JUSTDO Logging. In ASPLOS.
427–442.
[27] Joseph, Izraelevitz, Lingxiang Xiang, and Michael L. Scott. 2017. Per-
formance Improvement via Always-Abort HTM. In PACT. 79–90.
[28] Arpit Joshi, Vijay Nagarajan, Marcelo Cintra, and Stratis Viglas. 2015.
Efficient Persist Barriers for Multicores. InMICRO. 660–671.
[29] Arpit Joshi, Vijay Nagarajan, Marcelo Cintra, and Stratis Viglas. 2018.
DHTM:DurableHardware Transactional Memory. In ISCA. 452–465.
[30] Aasheesh Kolli, Vaibhav Gogte, Ali Saidi, Stephan Diestelhorst, Pe-
ter M. Chen, Satish Narayanasamy, and Thomas F. Wenisch. 2017.
Language-level persistency. In ISCA. 481–493.
[31] Aasheesh Kolli, Steven Pelley, Ali Saidi, Peter M. Chen, and Thomas F.
Wenisch. 2016. High-Performance Transactions for Persistent Memo-
ries. In ASPLOS. 399–411.
[32] Aasheesh Kolli, Steven Pelley, Ali Saidi, Peter M. Chen, and Thomas F.
Wenisch. 2016. High-Performance Transactions for Persistent Memo-
ries. In ASPLOS. 399–411.
[33] A. Kolli, J. Rosen, S. Diestelhorst, A. Saidi, S. Pelley, S. Liu, P. M. Chen,
and T. F. Wenisch. 2016. Delegated persist ordering. InMICRO. 1–13.
[34] Emre Kultursay, Mahmut Kandemir, Anand Sivasubramaniam, and
Onur Mutlu. 2013. Evaluating STT-RAM as an Energy-Efficient Main
Memory Alternative. 256–267.
[35] Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett
Witchel, and Thomas Anderson. 2017. Strata: A Cross Media File Sys-
tem. In SOSP. 460–477.
[36] Leslie Lamport. 1978. Time, Clocks, and the Ordering of Events in a
Distributed System. CACM 21, 7 (1978), 558–565.
[37] Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. 2009.
Architecting Phase Change Memory As a Scalable Dram Alternative.
In ISCA. 2–13.
[38] Mengxing Liu, Mingxing Zhang, Kang Chen, Xuehai Qian, Yongwei
Wu,Weimin Zheng, and Jinglei Ren. 2017. DudeTM: Building Durable
Transactions with Decoupling for Persistent Memory. In ASPLOS.
329–343.
[39] Qingrui Liu, Joseph Izraelevitz, Se Kwon Lee, Michael L Scott, Sam H
Noh, and Changhee Jung. 2018. iDO: Compiler-Directed Failure
Atomicity for Nonvolatile Memory. In MICRO. 258–270.
[40] Virendra Marathe, Achin Mishra, Amee Trivedi, Yihe Huang, Faisal
Zaghloul, Sanidhya Kashyap, Margo Seltzer, Tim Harris, Steve Byan,
Bill Bridge, and Dave Dice. 2018. Persistent Memory Transactions.
arXiv:1804.00701
[41] Jason Mars and Naveen Kumar. 2012. BlockChop: Dynamic Squash
Elimination for Hybrid Processor Architecture. In ISCA. 536–547.
[42] Amirsaman Memaripour, Anirudh Badam, Amar Phanishayee, Yanqi
Zhou, Ramnatthan Alagappan, Karin Strauss, and Steven Swanson.
2017. Atomic In-Place Updates for Non-Volatile Main Memories with
Kamino-Tx. In EuroSys. 499–512.
[43] Steven Pelley, Peter M. Chen, and Thomas F. Wenisch. 2014. Memory
Persistency. In ISCA. 265–276.
[44] Azalea Raad, John Wickerson, and Viktor Vafeiadis. 2019. Weak Per-
sistency Semantics from the Ground up: Formalising the Persistency
Semantics of ARMv8 and Transactional Models. PACMPL 3, OOPSLA,
Article 135 (2019).
Kaan Genç, Michael D. Bond, and Guoqing Harry Xu
[45] Ravi Rajwar and James R. Goodman. 2001. Speculative Lock Eli-
sion: Enabling Highly Concurrent Multithreaded Execution. In MI-
CRO. 294–305.
[46] Pedro Ramalhete, Andreia Correia, Pascal Felber, and Nachshon Co-
hen. 2019. OneFile: A Wait-Free Persistent Transactional Memory. In
DSN. 151–163.
[47] Carl G. Ritson and Frederick R.M. Barnes. 2013. An Evaluation of
Intel’s Restricted Transactional Memory for CPAs. In CPA. 271–292.
[48] Andy Rudoff. 2017. Persistent Memory Programming. Login: The
Usenix Magazine (2017), 34–40. Issue 2.
[49] Andy M. Rudoff. 2016. Deprecating the PCOMMIT Instruc-
tion. https://software.intel.com/en-us/blogs/2016/09/12/deprecate-
pcommit-instruction.
[50] Aritra Sengupta, Man Cao, Michael D. Bond, and Milind Kulkarni.
2017. Legato: End-to-End Bounded Region Serializability Using Com-
modity Hardware Transactional Memory. In CGO. 1–13.
[51] Haris Volos, Andres Jaan Tack, and Michael M. Swift. 2011.
Mnemosyne: Lightweight Persistent Memory. In ASPLOS. 91–104.
[52] ChenxiWang, Huimin Cui, TingCao, John Zigman, Haris Volos, Onur
Mutlu, Fang Lv, Xiaobing Feng, and Guoqing Harry Xu. 2019. Pan-
thera: Holistic Memory Management for Big Data Processing over
Hybrid Memories. In PLDI. 347–362.
[53] Zhaoguo Wang, Han Yi, Ran Liu, Mingkai Dong, and Haibo Chen.
2015. Persistent Transactional Memory. CAL 14 (2015), 58–61. Issue
1.
[54] H.-S. Philip Wong, Heng-Yuan Lee, Shimeng Yu, Yu-Sheng Chen, Yi
Wu, Pang-Shiu Chen, Byoungil L Lee, Frederick T. Chen, and Ming-
Jinn Tsai. 2012. Metal-Oxide RRAM. Proc. IEEE 100, 6 (2012), 1951–
1970.
[55] H.-S. Philip Wong, Simone Raoux, SangBum Kim, Jiale Liang, John P.
Reifenberg, Bipin Rajendran, Mehdi Asheghi, and Kenneth E. Good-
son. 2010. Phase ChangeMemory. Proc. IEEE 98, 12 (2010), 2201–2227.
[56] Zhenwei Wu, Kai Lu, Andrew Nisbet, Wenzhe Zhang, and Mikel Lu-
ján. 2020. PMThreads: Persistent Memory Threads Harnessing Ver-
sioned Shadow Copies. In PLDI.
[57] Jian Xu and Steven Swanson. 2016. NOVA: A Log-structured File Sys-
tem for Hybrid Volatile/Non-volatile Main Memories. In FAST. 323–
338.
[58] Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadhara-
iah, Amit Borase, Tamires Brito Da Silva, Steven Swanson, and Andy
Rudoff. 2017. NOVA-Fortis: A Fault-Tolerant Non-Volatile MainMem-
ory File System. In SOSP. 478–496.
[59] Richard M. Yoo, Christopher J. Hughes, Konrad Lai, and Ravi Rajwar.
2013. Performance Evaluation of Intel Transactional Synchronization
Extensions for High-Performance Computing. In SC. 19:1–19:11.
[60] Hanbin Yoon, JustinMeza, NaveenMuralimanohar, Norman P. Jouppi,
and Onur Mutlu. 2014. Efficient Data Mapping and Buffering Tech-
niques forMultilevel Cell Phase-ChangeMemories. TACO 11, 4 (2014),
40:1–40:25.
[61] Pantea Zardoshti, Tingzhe Zhou, Yujie Liu, and Michael Spear. 2019.
Optimizing Persistent Memory Transactions. In PACT. 219–231.
[62] Yiying Zhang, Jian Yang, Amirsaman Memaripour, and Steven Swan-
son. 2015. Mojim: A Reliable and Highly-Available Non-Volatile Mem-
ory System. In ASPLOS. 3–18.
[63] Pengfei Zuo, Yu Hua, and Jie Wu. 2018. Write-Optimized and High-
Performance Hashing Index Scheme for Persistent Memory. In OSDI.
461–476.
A Additional Results
This section contains additional results that supplement the
main paper’s results.
1 2 4 8 12 15 16
Bank (medium) 10.0 10.0 10.0 10.0 10.0 10.0 10.0
Bank (high) 10.0 10.0 10.0 10.0 10.0 10.0 10.0
Bank (none) 10.0 10.0 10.0 10.0 10.0 10.0 10.0
B+ tree (mixed) 13.3 13.3 13.3 13.3 13.2 13.2 13.2
B+ tree (insert only) 14.0 14.0 14.0 14.0 14.0 14.0 14.0
kmeans (high) 25.0 25.0 25.0 25.0 25.0 25.0 25.0
kmeans (low) 25.0 25.0 25.0 25.0 25.0 25.0 25.0
vacation (high) 8.0 8.0 8.0 8.0 8.0 8.0 8.0
vacation (low) 5.5 5.5 5.5 5.5 5.5 5.5 5.5
labyrinth 177.6 177.4 177.1 176.3 175.4 175.1 174.9
ssca2 2.0 2.0 2.0 2.0 2.0 2.0 2.0
genome 2.1 2.1 2.0 2.0 2.1 2.1 2.0
intruder 1.8 1.8 1.8 1.8 1.8 1.8 1.8
Table 1. Number of writes per executed persistent transac-
tion on average, for each evaluated thread count.
Persistent writes per transaction. Table 1 shows the av-
erage numbers of writes executed by each persistent trans-
action. Because Crafty amortizes persist latency across all
writes in a transaction, it reduces latency comparedwith ap-
proaches that incur per-write overhead if each transaction
executes multiple writes. On the other hand, long transac-
tions are more likely to abort due to capacity constraints
and conflicts with other threads.
Transaction breakdowns. The following pages contain fig-
ures that present the breakdowns for persistent transactions
and hardware transactions executed. For each benchmark,
the figure contains two bar graphs: one for the breakdown
of persistent transactions and the other for the breakdown
of hardware transactions.
The persistent transaction breakdown shows how each
persistent transactionwas completed. For Non-durable, Dude-
TM, and NV-HTM, persistent transactions can be completed
using a hardware transaction (labeled Non-Crafty) or the
SGL fallback. For Crafty, persistent transactions can be com-
pleted using the Redo or Validate phase or the SGL fall-
back. An exception is for Read Only transactions, for which
Crafty skips the Redo and Validate phases. (Non-durable,
DudeTM, andNV-HTMalso perform read-only transactions,
but the graphs categorize them as Non-Crafty.)
The hardware transaction breakdown shows the outcome
of each hardware transaction: either a commit or a conflict,
capacity, explicit, or “zero” abort. Conflict aborts occur if
multiple concurrent transactions performing conflicting ac-
cesses to the same cache line. Capacity aborts occur if the
transaction accesses more cache lines than HTM can handle.
Explicit aborts occur if the program explicitly requests an
abort as part of its programming, or if a Redo or Validate
transaction aborts due to failed checks (i.e., line 6 in Algo-
rithm 2 or line 8 in Algorithm 3). A “zero” abort is an abort
that does not fit into any of these categories. For example,
a transaction that triggers a page fault, executes a system
Cray: Efficient, HTM-Compatible Persistent Transactions
call, or receives an interrupt will cause a zero abort. The fig-
ures count every executed hardware transaction; for Crafty,
these counts include transactions performed Crafty’s Log,
Redo, and Validate phases—including for the Log phase
in an SGL section.
Sensitivity toNVM latency. The last several figures present
the same performance results as the main paper, but emu-
late an NVM persist latency of 100 ns (instead of 300 ns as
in the main paper). These results help to show how much
performance cost is due to NVM latency, and they represent
the expected performance if the NVM controller includes a
buffer as part of the persistence domain [48] (Section 2.2).
K
a
a
n
G
en
ç,M
ich
a
el
D
.
B
o
n
d
,a
n
d
G
u
o
q
in
g
H
a
rry
X
u
1
2
4
8
12
15
16
Thread count
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
Persistent transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Non-Crafty
Read Only
Redo
Validate
SGL
(a
)
P
ersisten
t
tran
sactio
n
b
reak
d
o
w
n
s.
1
2
4
8
12
15
16
Thread count
0
2000000
4000000
6000000
8000000
10000000
12000000
Hardware transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Com
m
it
Conflict
Capacity
Explicit
Zero
(b
)
H
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s.
F
ig
u
re
9
.P
ersisten
t
an
d
h
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s
fo
r
th
e
b
an
k
m
icro
b
en
ch
m
ark
(h
ig
h
co
n
ten
tio
n
).
C
ra

y
:
E
ff
icien
t,
H
T
M
-C
o
m
p
a
tib
le
P
ersisten
t
T
ra
n
sa
ctio
n
s
1
2
4
8
12
15
16
Thread count
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
Persistent transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Non-Crafty
Read Only
Redo
Validate
SGL
(a
)
P
ersisten
t
tran
sactio
n
b
reak
d
o
w
n
s.
1
2
4
8
12
15
16
Thread count
0
2000000
4000000
6000000
8000000
10000000
12000000
Hardware transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Com
m
it
Conflict
Capacity
Explicit
Zero
(b
)
H
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s.
F
ig
u
re
1
0
.P
ersisten
t
an
d
h
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s
fo
r
th
e
b
an
k
m
icro
b
en
ch
m
ark
(m
ed
iu
m
co
n
ten
tio
n
).
K
a
a
n
G
en
ç,M
ich
a
el
D
.
B
o
n
d
,a
n
d
G
u
o
q
in
g
H
a
rry
X
u
1
2
4
8
12
15
16
Thread count
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
Persistent transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Non-Crafty
Read Only
Redo
Validate
SGL
(a
)
P
ersisten
t
tran
sactio
n
b
reak
d
o
w
n
s.
1
2
4
8
12
15
16
Thread count
0
10000000
20000000
30000000
40000000
50000000
60000000
Hardware transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Com
m
it
Conflict
Capacity
Explicit
Zero
(b
)
H
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s.
F
ig
u
re
1
1
.P
ersisten
t
an
d
h
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s
fo
r
th
e
b
an
k
m
icro
b
en
ch
m
ark
(n
o
co
n
ten
tio
n
).
C
ra

y
:
E
ff
icien
t,
H
T
M
-C
o
m
p
a
tib
le
P
ersisten
t
T
ra
n
sa
ctio
n
s
1
2
4
8
12
15
16
Thread count
0
5000000
10000000
15000000
20000000
25000000
Persistent transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Non-Crafty
Read Only
Redo
Validate
SGL
(a
)
P
ersisten
t
tran
sactio
n
b
reak
d
o
w
n
s.
1
2
4
8
12
15
16
Thread count
0
10000000
20000000
30000000
40000000
50000000
60000000
Hardware transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Com
m
it
Conflict
Capacity
Explicit
Zero
(b
)
H
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s.
F
ig
u
re
1
2
.P
ersisten
t
an
d
h
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s
fo
r
th
e
B
+
tree
m
icro
b
en
ch
m
ark
w
ith
in
sert
o
p
eratio
n
s
o
n
ly.
K
a
a
n
G
en
ç,M
ich
a
el
D
.
B
o
n
d
,a
n
d
G
u
o
q
in
g
H
a
rry
X
u
1
2
4
8
12
15
16
Thread count
0
10000000
20000000
30000000
40000000
50000000
Persistent transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Non-Crafty
Read Only
Redo
Validate
SGL
(a
)
P
ersisten
t
tran
sactio
n
b
reak
d
o
w
n
s.
1
2
4
8
12
15
16
Thread count
0
20000000
40000000
60000000
80000000
100000000
120000000
Hardware transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Com
m
it
Conflict
Capacity
Explicit
Zero
(b
)
H
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s.
F
ig
u
re
1
3
.
P
ersisten
t
an
d
h
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s
fo
r
th
e
B
+
tree
m
icro
b
en
ch
m
ark
w
ith
m
ix
ed
o
p
eratio
n
s.
C
ra

y
:
E
ff
icien
t,
H
T
M
-C
o
m
p
a
tib
le
P
ersisten
t
T
ra
n
sa
ctio
n
s
1
2
4
8
12
15
16
Thread count
0
50000
100000
150000
200000
250000
Persistent transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Non-Crafty
Read Only
Redo
Validate
SGL
(a
)
P
ersisten
t
tran
sactio
n
b
reak
d
o
w
n
s.
1
2
4
8
12
15
16
Thread count
0
200000
400000
600000
800000
Hardware transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Com
m
it
Conflict
Capacity
Explicit
Zero
(b
)
H
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s.
F
ig
u
re
1
4
.P
ersisten
t
an
d
h
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s
fo
r
k
m
ea
n
s
(h
ig
h
co
n
ten
tio
n
).
K
a
a
n
G
en
ç,M
ich
a
el
D
.
B
o
n
d
,a
n
d
G
u
o
q
in
g
H
a
rry
X
u
1
2
4
8
12
15
16
Thread count
0
50000
100000
150000
200000
250000
Persistent transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Non-Crafty
Read Only
Redo
Validate
SGL
(a
)
P
ersisten
t
tran
sactio
n
b
reak
d
o
w
n
s.
1
2
4
8
12
15
16
Thread count
0
200000
400000
600000
800000
Hardware transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Com
m
it
Conflict
Capacity
Explicit
Zero
(b
)
H
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s.
F
ig
u
re
1
5
.P
ersisten
t
an
d
h
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s
fo
r
k
m
ea
n
s
(lo
w
co
n
ten
tio
n
).
C
ra

y
:
E
ff
icien
t,
H
T
M
-C
o
m
p
a
tib
le
P
ersisten
t
T
ra
n
sa
ctio
n
s
1
2
4
8
12
15
16
Thread count
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
Persistent transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Non-Crafty
Read Only
Redo
Validate
SGL
(a
)
P
ersisten
t
tran
sactio
n
b
reak
d
o
w
n
s.
1
2
4
8
12
15
16
Thread count
0
2500000
5000000
7500000
10000000
12500000
15000000
17500000
Hardware transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Com
m
it
Conflict
Capacity
Explicit
Zero
(b
)
H
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s.
F
ig
u
re
1
6
.P
ersisten
t
an
d
h
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s
fo
r
va
ca
tio
n
(h
ig
h
co
n
ten
tio
n
).
K
a
a
n
G
en
ç,M
ich
a
el
D
.
B
o
n
d
,a
n
d
G
u
o
q
in
g
H
a
rry
X
u
1
2
4
8
12
15
16
Thread count
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
Persistent transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Non-Crafty
Read Only
Redo
Validate
SGL
(a
)
P
ersisten
t
tran
sactio
n
b
reak
d
o
w
n
s.
1
2
4
8
12
15
16
Thread count
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
Hardware transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Com
m
it
Conflict
Capacity
Explicit
Zero
(b
)
H
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s.
F
ig
u
re
1
7
.P
ersisten
t
an
d
h
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s
fo
r
va
ca
tio
n
(lo
w
co
n
ten
tio
n
).
C
ra

y
:
E
ff
icien
t,
H
T
M
-C
o
m
p
a
tib
le
P
ersisten
t
T
ra
n
sa
ctio
n
s
1
2
4
8
12
15
16
Thread count
0
200
400
600
800
1000
Persistent transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Non-Crafty
Read Only
Redo
Validate
SGL
(a
)
P
ersisten
t
tran
sactio
n
b
reak
d
o
w
n
s.
1
2
4
8
12
15
16
Thread count
0
20000
40000
60000
80000
100000
Hardware transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Com
m
it
Conflict
Capacity
Explicit
Zero
(b
)
H
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s.
F
ig
u
re
1
8
.
P
ersisten
t
an
d
h
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s
fo
r
la
b
y
rin
th
.
K
a
a
n
G
en
ç,M
ich
a
el
D
.
B
o
n
d
,a
n
d
G
u
o
q
in
g
H
a
rry
X
u
1
2
4
8
12
15
16
Thread count
0
2500000
5000000
7500000
10000000
12500000
15000000
17500000
20000000
Persistent transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Non-Crafty
Read Only
Redo
Validate
SGL
(a
)
P
ersisten
t
tran
sactio
n
b
reak
d
o
w
n
s.
1
2
4
8
12
15
16
Thread count
0
25000000
50000000
75000000
100000000
125000000
150000000
175000000
Hardware transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Com
m
it
Conflict
Capacity
Explicit
Zero
(b
)
H
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s.
F
ig
u
re
1
9
.
P
ersisten
t
an
d
h
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s
fo
r
ssca
2.
C
ra

y
:
E
ff
icien
t,
H
T
M
-C
o
m
p
a
tib
le
P
ersisten
t
T
ra
n
sa
ctio
n
s
1
2
4
8
12
15
16
Thread count
0
500000
1000000
1500000
2000000
2500000
Persistent transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Non-Crafty
Read Only
Redo
Validate
SGL
(a
)
P
ersisten
t
tran
sactio
n
b
reak
d
o
w
n
s.
1
2
4
8
12
15
16
Thread count
0
1000000
2000000
3000000
4000000
5000000
6000000
Hardware transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Com
m
it
Conflict
Capacity
Explicit
Zero
(b
)
H
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s.
F
ig
u
re
2
0
.
P
ersisten
t
an
d
h
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s
fo
r
g
en
o
m
e.
K
a
a
n
G
en
ç,M
ich
a
el
D
.
B
o
n
d
,a
n
d
G
u
o
q
in
g
H
a
rry
X
u
1
2
4
8
12
15
16
Thread count
0
5000000
10000000
15000000
20000000
Persistent transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Non-Crafty
Read Only
Redo
Validate
SGL
(a
)
P
ersisten
t
tran
sactio
n
b
reak
d
o
w
n
s.
1
2
4
8
12
15
16
Thread count
0
10000000
20000000
30000000
40000000
50000000
60000000
70000000
Hardware transactions
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
Non-durable
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
DudeTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
NV-HTM
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoValidate
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Crafty-NoRedo
Com
m
it
Conflict
Capacity
Explicit
Zero
(b
)
H
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s.
F
ig
u
re
2
1
.P
ersisten
t
an
d
h
ard
w
are
tran
sactio
n
b
reak
d
o
w
n
s
fo
r
in
tru
d
er
Cray: Efficient, HTM-Compatible Persistent Transactions
1 2 4 8 12 15 16
Thread count
0.0
0.2
0.4
0.6
0.8
1.0
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(a) High contention
1 2 4 8 12 15 16
Thread count
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(b) Medium contention
1 2 4 8 12 15 16
Thread count
0
1
2
3
4
5
6
7
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(c) No contention
Figure 22. Throughput of Crafty and competing approaches, using the bank microbenchmark at three contention levels,
emulating an NVM latency of 100 ns (instead of 300 ns as in Figure 6).
Kaan Genç, Michael D. Bond, and Guoqing Harry Xu
1 2 4 8 12 15 16
Thread count
0
1
2
3
4
5
6
7
8
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(a) Insert operations only
1 2 4 8 12 15 16
Thread count
0
1
2
3
4
5
6
No
rm
al
ize
d 
th
ro
ug
hp
ut
(b) Lookup, insert, and remove operations
Figure 23. Throughput of Crafty and competing approaches, on the B+ tree microbenchmark, for mixed operations and insert
only, emulating an NVM latency of 100 ns (instead of 300 ns as in Figure 7).
Cray: Efficient, HTM-Compatible Persistent Transactions
1 2 4 8 12 15 16
Thread count
0.0
0.2
0.4
0.6
0.8
1.0
1.2
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(a) kmeans (high contention)
1 2 4 8 12 15 16
Thread count
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(b) kmeans (low contention)
1 2 4 8 12 15 16
Thread count
0
1
2
3
4
5
6
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(c) vacation (high contention)
1 2 4 8 12 15 16
Thread count
0
2
4
6
8
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(d) vacation (low contention)
1 2 4 8 12 15 16
Thread count
0.0
0.2
0.4
0.6
0.8
1.0
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(e) labyrinth
1 2 4 8 12 15 16
Thread count
0
2
4
6
8
10
12
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(f) ssca2
1 2 4 8 12 15 16
Thread count
0
1
2
3
4
5
6
7
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(g) genome
1 2 4 8 12 15 16
Thread count
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
No
rm
al
ize
d 
th
ro
ug
hp
ut
Non-durable
DudeTM
NV-HTM
Crafty
Crafty-NoValidate
Crafty-NoRedo
(h) intruder
Figure 24. Throughput of Crafty and competing approaches, on the STAMP benchmarks, emulating an NVM latency of 100
ns (instead of 300 ns as in Figure 8).
