Hardware Transactional Persistent Memory by Giles, Ellis et al.
Hardware Transactional Persistent Memory
Ellis Giles
Rice University
Houston, TX 77005, USA
erg@rice.edu
Kshitij Doshi
Intel Corporation
Chandler, AZ 85226, USA
kshitij.a.doshi@intel.com
Peter Varman
Rice University
Houston, TX 77005, USA
pjv@rice.edu
ABSTRACT
Emerging Persistent Memory technologies (also PM, Non-Volatile
DIMMs, Storage Class Memory or SCM) hold tremendous promise
for accelerating popular data-management applications like in-
memory databases. However, programmers now need to deal with
ensuring the atomicity of transactions on Persistent Memory resi-
dent data and maintaining consistency between the order in which
processors perform stores and that in which the updated values
become durable.
e problem is specially challenging when high-performance iso-
lation mechanisms like Hardware Transactional Memory (HTM) are
used for concurrency control. is work shows how HTM transac-
tions can be ordered correctly and atomically into PM by the use of
a novel soware protocol combined with a Persistent Memory Con-
troller, without requiring changes to processor cache hardware or
HTM protocols. In contrast, previous approaches require signicant
changes to existing processor microarchitectures. Our approach,
evaluated using both micro-benchmarks and the STAMP suite com-
pares well with standard (volatile) HTM transactions. It also yields
signicant gains in throughput and latency in comparison with
persistent transactional locking.
1 INTRODUCTION
is paper provides a solution to the problems of adding durability
to concurrent in-memory transactions that use Hardware Trans-
actional Memory (HTM) for concurrency control while operating
on data in byte-addressable Non-Volatile Memory or Persistent
Memory (PM).
Recent years have witnessed a sharp shi towards real time
data-driven and high-throughput applications. is has spurred
a broad adoption of in-memory and massively parallelized data
processing [1–3] across business, scientic, and industrial applica-
tion domains [4, 5]. Two emerging hardware developments provide
further impetus to this approach and raise the potential for transfor-
mative gains in speed and scalability: (a) the arrival of inexpensive,
byte-addressable, and large capacity Persistent Memory devices [6]
eliminates the I/O operations and bolenecks common tomany data
management applications, and, (b) the availability of CPU-based
transaction support (with HTM) [7, 8]) makes it straight-forward for
threads to work spontaneously in shared memory spaces without
having to synchronize explicitly.
To prevent corruption of state upon an untimely machine or
soware failure, a sequence of store operations to PM in a transac-
tional section of code cannot be partially reected into PM; nor can
it be transmied piecemeal from processor caches into PM without
similarly risking signicant loss of data. Operating on Persistent
Memory based data thus produces new atomicity and consistency
requirements. Soware approaches for ensuring atomic durable
updates share some characteristics withHTM techniques in commer-
cial processors – both checkpoint state at some level of granularity
and guard against communication of partial updates. However,
dierent mechanisms are at play: while stable stores to persistent
media are usually obtained by covering updates with logging or
versioning, partial updates are prevented from being propagated
between threads in HTM transactions by CPUs sheltering them
until the transaction closes. Once an HTM transaction closes, its
updates become visible en masse through the cache hierarchy and
can travel in any order to memory DIMMs. However stable storage
of the updates into Persistent Memory by the transaction addition-
ally requires an ability to reliably delineate its updates from those
by other overlapping transactions, and to use that delineation to
recover from an unanticipated machine restart.
Existing PM programming frameworks separate into categories
based on the degree of change they require in the processor ar-
chitecture for such problems. Several works [9–13] operate with
existing processor capabilities and write a log (either a write-ahead
log or an undo log) durably to cover data changes arriving via the
volatile cache hierarchy; some [14–16] require signicant changes
to existing cache hardware and protocols in the processor microar-
chitecture, while others [17–19] only require external controllers
not aecting the processor core.
Isolation among concurrent transactions in the above works
is achieved either by the use of two-phase locking protocols or
provided within the rubric of an STM [9]. Logging based soware
approaches are problematic for HTM transactions (e.g., Intel TSX)
which cannot bypass the caches in order to ush the log records
synchronously into Persistent Memory ahead of transaction clos-
ings. For these, PTM [20] proposes changes to processor caches
while adding an on-chip scoreboard and global transaction id regis-
ter to couple HTM with PM. Recent work [13, 21, 22] has aempted
to provide inter-transactional isolation by employing processor
based (HTM) [7, 8] mechanisms. However, these solutions all re-
quire changes to the existing HTM semantics and implementations–
[21] and [22] propose a new instruction to perform non-aborting
cacheline ush from within an HTM, while [13] proposes allowing
non-aborting concurrent access to designated memory variables
within an HTM.
Selective and incremental changes to the clean isolation seman-
tics of HTM are not to be undertaken lightly; understanding their
impact on global system correctness and performance typically re-
quires long gestation periods before processor manufacturers will
embrace them. In this paper we provide a new solution to obtain
persistence consistency in Persistent Memory while using HTM
for concurrency control. e solution does not alter the processor
microarchitecture, but leverages a very simple, external Persistent
Memory controller alongwith a persistence protocol, to supplement
ar
X
iv
:1
80
6.
01
10
8v
1 
 [c
s.D
C]
  2
2 M
ay
 20
18
the existing HTM semantics and allowing HTM transactions to op-
erate at the speed of in-memory volatile transactions. e solutioni,
while it achieves the concurrency benets of HTM for PM-based
data also applies to non-HTM transactions in a straight-forward
way.
2 OVERVIEW
2.1 HTM+PM Basics
Hardware Transactional Memory, or HTM, was introduced in [7]
as a new, easy-to-use method for lock-free synchronization sup-
ported by hardware. e initial instructions for HTM included load
and store transactional instructions in addition to transactional
management instructions. Most HTM implementations extend
an underlying cache-coherency protocol to handle detection of
transactional conicts during program execution. e hardware
system performs a speculative execution on a demarcated region of
code similar to an atomic section. Independent transactions (those
that do not write a shared variable) proceed unrestricted through
their HTM sections. Transactions which access common variables
concurrently in their HTM sections, with at least one transaction
performing a write, are serialized by the HTM. at is, all but one
of the transactions is aborted; an aborted transaction will restart
its HTM code at the beginning. Updates made within the HTM
section are hidden from other transactions and are prevented from
writing to memory until the transaction successfully completes the
HTM section. is mechanism provides atomicity (all-or-nothing)
semantics for individual transactions with respect to visibility by
other threads, and serialization of conicting, dependent transac-
tions. However HTM was originally designed for volatile memory
systems (rather than supporting database style ACID transactions)
and therefore any power failure leaves main memory in an un-
predictable state relative to the actual values of the transaction
variables.
Persistent Memory, or PM, introduces a new method of persis-
tence to the processor. PM, in the form of persistent DIMM, resides
on the main-memory bus alongside DRAM. Soware can access
persistent memory using the usual LOAD and STORE instructions
used for DRAM. Like other memory variables, PM variables are
subject to forced and involuntary cache-evictions and encounter
other deferred memory operations done by the processor.
For Intel CPUs, CLWB and CLFLUSHOPT instructions provide the
ability to ush modied data (at cacheline granularity) to be evicted
from the processor cache hierarchy. ese instructions, however,
are weakly ordered with respect to other store operations in the
instruction stream. Intel has extended the semantic for SFENCE
to cover such ushed store operations so that soware can issue
SFENCE to prevent new stores from executing until previously
ushed data has entered a power-safe domain; i.e., the data is
guaranteed by hardware to reach its locations in the PMmedia. is
guarantee also applies to data that is wrien to PMwith instructions
that bypass the processor caches. However, when executing within
an HTM transaction, a CPU cannot exercise CLWB, CLFLUSHOPT,
non-cacheable stores, and SFENCE instructions since the stores
by the CPU are considered speculative until the HTM transaction
completes successfully.
Even though HTM guarantees that transactional values are only
visible on transaction completion, hardware manufacturers cannot
simply utilize a non-volatile processor cache hierarchy or baery
backed ushing of the cache on failures to provide transactional
atomicity. Transactions that do not complete before a soware or
hardware restart produce partial and therefore inconsistent updates
in non-volatile memory, as there is no guarantee when a machine
halt will occur. e halt may happen during XEND execution
leaving only partial updates in cache or write buers which can
corrupt in-memory data structures.
2.2 Challenges of persistent HTM transactions
Consider transactions A, B and C shown in Listings 1, 2 and 3.
Assume that w, x, y, z are persistent variables initialized to zero in
their home locations in Persistent Memory (PM). e code section
demarcated between the instructions XBegin and XEnd will be
referred to as an HTM transaction or simply a transaction. e
HTM mechanism ensures the atomicity of transaction execution.
Within anHTM transaction, all updates aremade to private locations
in the cache, and the hardware guarantees that the updates are
not allowed to propagate to their home locations in PM. Aer the
XEnd instruction completes, all of the cache lines updated in the
transaction become instantaneously visible in the cache hierarchy.
Atomic Persistence: e rst challenge is to ensure that the trans-
action’s updates that were made atomically in the cache are also
persisted atomically in PM. Following XEnd, the transaction vari-
ables are once again subject to the normal cache operations like
evictions and the use of cache write-back instructions. ere are no
guarantees regarding whether or when the transaction’s updates
actually get wrien to PM from the cache. is can create a problem
if the machine crashes before all these updates are wrien back
to PM. On a reboot, the values of these variables in PM will be
inconsistent with the pre-crash transaction values. is leads to
the rst requirement:
• Following crash recovery, ensure that all or none of the up-
dates of an HTM transaction are stored in their PM home
locations.
A common solution is to log the transaction updates in a sepa-
rate persistent storage area before allowing them to update their
PM home locations. Should a crash interrupt the updating of the
home locations, the saved log can be replayed. When transactions
execute within an HTM there is a problem with this solution since
the log cannot be wrien to PM within the transaction and can
be done only aer the XEnd. At that time the transaction updates
are also made visible in the cache hierarchy and are susceptible to
uncontrolled cache evictions into PM. Hence there is no guarantee
that the log has been persisted before transaction updates have
percolated into PM. We describe our solution in Section 3.
Persistence Ordering: e second problem deals with ensuring
that the execution order of dependent HTM transactions is correctly
reected in PM following crash recovery. As an example, consider
the dependent transactions A,B,C in Listings 1, 2 & 3. e HTM
will serialize their execution in some order: say A, B and C . e
values of the transaction variables following the execution of A are
given by the vectorV1 = [w, x, y, z] = [1, 1, 0, 0]; aer the execution
of B the vector becomes V2 = [2, 1, 2, 0] and nally following C it is
2
Listing 1: A
A ( ) {
XBegin ;
w = 1 ;
x = w;
XEnd ;
}
Listing 2: B
B ( ) {
XBegin ;
w = w+1 ;
y = w;
XEnd ;
}
Listing 3: C
C ( ) {
XBegin ;
w = w+1 ;
z = w;
XEnd ;
}
V3 = [3, 1, 2, 3]. Under normal operation the write backs of variables
to PM from dierent transactions may become arbitrarily inter-
leaved. For instance suppose that x is evicted immediately aer A
completes,w aer B completes, and z aerC completes. e persis-
tent memory state is then [2, 1, 0, 3]; should the machine crash, the
PM will contain this meaningless combination of values on reboot.
A consistent state would be either the initial vector [0, 0, 0, 0] or
one of V1,V2 or V3. is leads to the second requirement:
• Following crash recovery, ensure that the persistent state of
any sequence of dependent transactions is consistent with
their execution order.
If individual transactions satisfy atomic persistence, then it is
sucient to ensure that PM is updated in transaction execution
order. With soware concurrency control (using an STM or two-
phase transaction locking), it is straightforward to correctly order
the updates simply by saving a transaction’s log before it commits
and releases its locks. In case of a crash, the saved logs are sim-
ply replayed in the order they were saved, thereby reconstructing
persistent state to a correctly-ordered prex of the executed trans-
actions.
When HTM is used for concurrency control the logs can only
be wrien to PM aer the transaction XEnd. At that time other
dependent transactions can execute and race with the completed
transaction, perhaps writing out their logs before the rst. Solutions
like using an atomic counter within transactions to order them
correctly are not practical since the shared counter will result in
HTM-induced aborts and serialization of all transactions. Some
papers have advocated that processor manufacturers alter HTM
semantics and implementation to allow selective writes to PM from
within an HTM [13, 21, 22]. We describe our solution without the
need for such intrusive processor changes in Section 3.
Strict and Relaxed Durability: In traditional ACID databases, a
commied transaction is guaranteed to be durable since its log is
made persistent before it commits. We refer to this property as
strict durability. In HTM transactions the log is wrien to PM aer
the XEnd instruction some time before the transaction commits. A
natural question is to characterize the time it is safe for a transaction
requiring strict durability to commit.
It is generally not safe to commit a transaction Y at the time it
completes persisting its log for the same reason that it is dicult
to ensure persistence ordering. Due to races in the code outside
the HTM it is possible that an earlier transaction X (on which Y
depends) to have completed but not yet persisted its log. When
recovering from a crash that occurs at this time, the log of Y should
not be replayed since earlier transactionX cannot be replayed. is
leads to the third requirement:
• Following crash recovery, strict durability requires that every
commied transaction is persistent in PM.
We dene a new property known as relaxed durability that allows
an individual transaction to opt for an early commit immediately
aer it persists its log. Requiring relaxed or strict durability is
a local choice made individually by a transaction based on the
application requirements. Transactions choosing relaxed durability
face a window of vulnerability aer they commit, during which a
crash may result in their transaction updates not being reected
in PM aer recovery. e gain is potentially reduced transaction
latency. However, irrespective of the choice of the durability model
by individual transactions, the underlying persistent memory will
always recover to a consistent state, reecting an ordered atomic
prex of every sequence of dependent transactions.
3 OUR APPROACH
Our approach achieves durability of HTM transactions to Persistent
Memory by a cooperative protocol involving three components: a
back-end Persistent Memory Controller , transaction execution and
logging, and a failure recovery procedure. e Persistent Memory
Controller intercepts dirty cache lines evicted from the last-level
processor cache (LLC) on their way to persistent memory. An inter-
cepted cache line is held in a FIFO queue within the controller until
it is safe to write it out to PM. All memory variables are subject to
the normal cache operations and are fetched and evicted accord-
ing to normal cache protocols. e only change we introduce is
interposing the external controller between the LLC and memory.
Note that the controller does not require changing any of the in-
ternal processor behavior. e controller simply delays the evicted
cache lines on their way to memory till it can guarantee safety. It
is pre-programmed with the address range of a region of persistent
memory that is reserved for holding transaction logs. Addresses in
the log region pass through the controller without intervention.
HTM+PM transactions execute independently. Within an outer-
envelope that achieves consistency of updates between the volatile
cache hierarchy and the durable state in PM, these transactions
use an unmodied HTM to serialize the computational portions of
conicting transactions. A transaction (1) noties the controller
when it opens and closes, (2) saves start and end timestamps in
PM to enable consistent recovery aer a failure, (3) performs its
HTM operation, and (4) persists a log of its updates in PM before
closing. If a transaction requires strict durability it informs the
controller during its closing step, and then waits for the go-ahead
from the controller before commiing. If it only needs relaxed
durability it can commit immediately aer its close. e recovery
routine is invoked aer a system crash to restore the PM variables
to a valid state i.e. a state that is consistent with the actual execution
order of every sequence of dependent transactions. e recovery
procedure uses the saved logs to recover the values of the updated
variables, and the saved start and end timestamps to determine
which logs are eligible for replay and their replay order.
3.1 Transaction Lifecycle
A transaction can be viewed as progressing through ve states:
OPEN,COMPUTE, LOG,CLOSE andCOMMIT as shown in Listing 4.
When a transaction begins, it calls the library function OpenWrapC
3
(see Algorithm 1 in Section 4.2). is function invokes the Persis-
tent Memory Controller with a small unique integer (wrapId) that
identies the transaction. e controller adds wrapId to a set of
currently open transactions (referred to as COT) that it maintains
(see Algorithm 2 in Section 4.3). e transaction then allocates and
initializes space in PM for its log and updates the startTime record
of the log. e startTime is obtained by reading a system wide
platform timer using the RDTSCP instruction (see Section 4.1). In
addition to startTime, a log includes a second timestamp persistTime
that will be set just prior to completing the HTM transaction. e
writeSet is a sequence of (address,value) pairs in the log that will be
lled with the locations and values of the updates made by the HTM
transaction. e log with its recorded startTime is then persisted
using cache line write-back instructions (clwb) and sfence.
e transaction then enters the COMPUTE state by executing
XBegin and entering the HTM code section. Within the HTM sec-
tion, the transaction updates writeSet with the persistent variables
that it writes. Note that the records in writeSet will be held in cache
during the COMPUTE state since it occurs within an HTM and can-
not be propagated to PM until XEnd completes. Immediately before
XEnd the transaction obtains a second timestamp persistTime that
will be used to order the transactions correctly. is timestamp is
also obtained using the same RDTSCP instruction.
Aer executing XEnd, a transaction next enters the LOG state. It
ushes its log records from cache hierarchy into PM using cache
line write-back instructions (CLWB or CLFLUSHOPT), following the
last of them with an SFENCE. is ensures that all the log records
have been persisted. In addition to startTime , a log includes the
persistTime time stamp that was set just prior to completing the
transaction. ewriteSet records in the log hold (address; value)
pairs representing the locations and the values updated by the
transaction. Aer the SFENCE following the log record ushes, the
transaction enters the CLOSE state.
In the CLOSE state the transaction signals the Persistent Memory
Controller that its log has been persisted in PM. e controller
removes the transaction from its set of currently open transactions
COT. It also reects the closing in the state of evicted cache lines
in its FIFO buer as described below in Section 3.2. A transaction
requiring strict durability informs the controller at this time; the
controller will signal the transaction in due course when it is safe
to commit i.e. its updates are guaranteed to be durable.
e transaction is then complete and enters the COMMIT state. If
it requires strict durability it waits till it is signaled by the controller.
Otherwise, it immediately commits and leaves the system.
3.2 Persistent Memory Controller
e Persistent Memory Controller is shown in Figure 1. While
supercially similar to an earlier design [18, 19, 23], this controller
includes enhancements to handle the subtleties of using HTM rather
than locking for concurrency control, and makes signicant simpli-
cations by shiing some of the responsibility for the maintenance
of PM state to the recovery protocol.
A crucial function of the controller is to prevent any transac-
tion’s updates from reaching PM until it is safe for it to do so -
without requiring detailed book-keeping about which transactions,
Listing 4: Transaction Structure
Tr an s a c t i o n Begin
−−−−−−−−−− S t a t e : OPEN −−−−−−−−−−−−−−−−
1 No t i f yPMCon t ro l l e r ( open ) ;
2 A l l o c a t e a Log s t r u c t u r e in PM;
3 nowTime = ReadP la t fo rmCounte r ( ) ;
4 Log . s t a r t T ime = nowTime ;
5 Log . w r i t e S e t = {} ;
6 P e r s i s t Log in PM;
−−−−−−−−−− S t a t e : COMPUTE −−−−−−−−−−−−−
XBegin
/ / T r a n s a c t i o n Body o f HTM
/ / A l l r e a d s and w r i t e s t o t r a n s a c t i o n
/ / v a r i a b l e s a r e p e r f o rmed and a l s o
/ / appended t o Log . w r i t e S e t .
7 endTime = ReadP la t fo rmCounte r ( ) ;
XEnd
−−−−−−−−−− S t a t e : LOG −−−−−−−−−−−−−−−−−
8 Log . p e r s i s t T ime = endTime ;
9 P e r s i s t Log in PM;
−−−−−−−−−− S t a t e : CLOSE −−−−−−−−−−−−−−−
10 No t i f yPMCon t ro l l e r ( c l o s e ) ;
−−−−−−−−−− S t a t e : COMMIT −−−−−−−−−−−−−−
11 I f ( s t r i c t d u r a b i l i t y r e qu e s t e d )
Wait for n o t i f i c a t i o n by c o n t r o l l e r ;
12 T r an s a c t i on End
Figure 1: PM Controller with Volatile Delay Buer, Current
Open Transactions Set, and Dependency Waiteue
currently active or previously completed, generated a specic up-
date. It does this by enforcing two requirements before allowing
an evicted dirty cache line (representing a transaction’s update)
from proceeding to PM: (i) ensuring that the log of the transaction
has been made persistent, and (ii) guaranteeing that the saved log
will be replayed during recovery. e rst condition is needed (but
not sucient) for atomic persistence by guarding against a failure
that occurs aer only a subset of a transaction’s updates have been
4
persisted in PM.e second requirement arises because transaction
logs are persisted outside the HTM, and there is no relation between
the order in which the transactions execute their HTM and the order
in which their logs are persisted. To maintain correct persistence
ordering, the recovery routine may not be able to replay a log. We
illustrate the issue in the example below.
Example: Consider two dependent transactions A and B, A: {w =
3; x=1;} and B: {y = w+1; z = 1;}. Assume that HTM transaction A
executes before B, but that B persists its logs and closes before A.
Suppose there is a crash just aer B closes. e recovery routine
will not replay either A or B, since the log of the earlier transaction
A is not available. is behavior is correct.
Now consider the situation where y (value 4) is evicted from the
cache and then wrien to PM aer B persists its log. Once again,
following a crash at this time, neither log will be replayed. However,
atomic persistence is now violated since B’s updates are partially
reected in PM. Note that this violation occurred even though the
write back of y to PM happened aer B’s log was persisted. e
Persistent Memory Controller protocol prevents a write back to
PM unless it can also guarantee that the log of the transaction
creating the update will be played back on recovery (see Lemmas
C2 and C4 in Section 3.4).
e second function of the controller is to track when it is safe
for a transaction requiring strict durability to commit. It is not
sucient to commit when a transaction’s logs are persistent on
PM since, as seen in the example above, the recovery routine may
not replay the log if that would violate persistency ordering. e
controller protocol eectively delays a strict durability transac-
tion τ from commiing until the earliest open transaction has a
startTime greater than the persistTime of τ . is is because the
recovery protocol will replay a log (see Section 3.3) if and only if
all transactions with startTime less than its persistTime have closed.
Implementation Overview:
e controller tracks transactions by maintaining a COT (cur-
rently open transactions) set S . When a transaction opens, its
identier is added to COT and when the transaction closes it is
removed. e write into PM of a cache line C evicted into the
Persistent Memory Controller is deferred by placing it at the tail
of a FIFO queue maintained by the controller. e cache line is
also assigned a tag called its dependency set, initialized with S the
value of COT, at the instant that C entered the Persistent Memory
Controller.
e controller holds the evicted instance of C in a FIFO until
all transactions that are in its dependency set (i.e. S) have closed.
When a transaction closes it is removed from both the COT and from
the dependency sets of all the FIFO entries. When the dependency
set of a cache line in the FIFO becomes empty, it is eligible to be
ushed to PM. One can see that the dependency sets will become
empty in the order in which the cache lines were evicted, since a
transaction still in the dependency set of C when a new eviction
enters the FIFO will also be in the dependency set of the new entry.
e simple protocol guarantees that all transactions that opened
before cache line C was evicted into the controller (which must
also include the transaction that last wroteC) must have closed and
persisted their logs when C becomes eligible to be wrien to PM.
is also implies that all transactions with startTime less than the
persistTime of the transaction that last wrote C would have closed,
satisfying the condition for log replay. Hence the cache line can be
safely wrien to PM without violating atomic persistence.
Note that the evicted cache lines intercepted by the controller do
not hold any identifying transaction information and can occur at
arbitrary times aer the transaction leaves the COMPUTE state. e
cache line could hold the update of a currently open transaction
or could be from a transaction that has completed or even commit-
ted and le the system. To guarantee safety, the controller must
perforce assume that the rst situation holds. e details of the
controller implementation will be presented in Section 4.3.
3.3 Recovery
Each completed transaction saves its log in PM. e log holds
records startTime and persistTime obtained by reading the platform
timer using RDTSCP instruction. We refer to these as the start and
end timestamps of the transaction. e start timestamp is persisted
before a transaction enters its HTM. is allows the recovery routine
to identify a transaction that started but had not completed at the
time of a failure. Note even though such a transaction has not
completed, it could still have nished its HTM and fed values to
a later dependent transaction which has completed and persisted
its log. e end timestamp and the write set of the transaction are
persisted aer the transaction completes its HTM section, followed
by an end of log marker. ere can be an arbitrary delay between
the end of the HTM and the time that its log is ushed from caches
into PM and persisted.
e recovery procedure is invoked on reboot following amachine
failure. e routine will restore PM values to a consistent state that
satises persistence ordering by copying values from the writeSets
of the logs of qualifying transactions to the specied addresses.
A transaction τ qualies for log replay if and only if all earlier
transactions on which it depends (both directly and transitively)
are also replayed.
Implementation Overview:
e recovery procedure rst identies the set of incomplete
transactions I, which have started (as indicated by the presence of
a startTime record in their log) but have not completed (indicated by
the lack of a valid end-of-record marker). e remaining complete
transactions (set C) are potential candidates for replay. Denote the
smallest start timestamp of transactions inI byTmin . A transaction
in C is valid (qualies for replay) if its end timestamp (persistTime) is
no more thanTmin . All valid transactions are replayed in increasing
order of their end timestamps persistTime.
3.4 Protocol Properties
We now summarize the invariants maintained by our protocol.
Denition: e precedence set of a transaction T , denoted by
prec(T ), is the set of all dependent transactions that executed their
HTM before T . Since the HTM properly orders any two dependent
transactions the set is well dened.
Lemma C1: Consider a transaction X with a precedence set
prec(X ). For all transactions Y in prec(X ), startTime(Y ) <
persistTime(X ).
5
Proof Sketch: Let Y be a transaction in prec(X). First let us con-
sider direct precedence, in which a cacheline C modied in Y con-
trols the ordering of X with respect to Y . at is, X either reads
or writes the cacheline C . Since Y is in prec(X ), the earliest time
that X accesses C must be no earlier than the latest time that Y
accessesC , and thus persistTime(Y ) < persistTime(X ). Next consider
a chain of direct precedences, Y → Z →W → · · ·X , which puts
Y in prec(X ); and by transitivity, persistTime(Y ) < persistTime(X ).
Since startTime(Y ) < persistTime(Y ) the lemma follows.
Lemma C2: Consider transactions X and Y with startTime(Y ) <
persistTime(X ). If a cache line C that is updated by X is wrien
to PM by the controller at time t , then Y must have closed and
persisted its log before t .
Proof Sketch: Suppose C was evicted to the controller at time
t ′ ≤ t . Now t ′ must be later than the time X completed HTM
execution and set persistTime(X ); by assumption this is aer Y set
its startTime at which time Y must have been registered as an open
transaction by the controller. Now, either Y has closed before t ′ or
is still open at that time. In the laer case, Y will be added to the
dependence set of C at t ′. Since C can only be wrien to PM aer
its dependence set is empty, it follows that Y must have closed and
removed itself from the dependence set of C .
Lemma C3 Any transaction X that writes an update to PM and
closes at time t will be replayed by the recovery routine if there is
a crash any time aer t .
Proofe recovery routine will replay a transaction X if the only
incomplete transactions (started but not closed) at the time of the
crash started aer X completed; that is, there is no incomplete
transaction Y that has a startTime(Y ) ≤ persistTime(X ). By Lemma
C2 such an incomplete transaction cannot exist.
Lemma C4:
Consider a transaction X with a precedence set prec(X ). en
by the time X closes and persists its logs, one of the following must
hold: (i) Some update of X has been wrien back to PM and all
transactions Y in prec(X ) have persisted their logs; (ii) No update
of X has been wrien to PM and all transactions Y in prec(X ) have
persisted their logs; (iii) No update ofX has been wrien to PM and
some transactions Y in prec(X ) have not yet persisted their logs.
Proof Sketch: From Lemmas C1 and C2, it is not possible to have
a transaction in prec(X ) that is still open if an update from X has
been wrien to PM.
4 ALGORITHM AND IMPLEMENTATION
e implementation consists of a user soware library backed by a
simple Persistent Memory Controller. e library is used primarily
to coordinate closures of concurrent transactions with the ow of
any data evicted from processor caches into PM home locations dur-
ing those transactions. e Persistent Memory Controller uses the
dependency set concept from [18, 19, 23] to temporarily park any
processor cache eviction in a searchable Volatile Delay Buer (VDB)
so that its eective time to reach PM is no earlier than the time that
the last possible transaction with which the eviction could have
overlapped has become recoverable. e Persistent Memory Con-
troller in this work improves upon the backend controller of [18] by
Figure 2: Flow of a transaction with implementation using
HTM, cached write sets, timing, and logging.
dispensing with synchronous log replays and victim cache manage-
ment. e library also covers any writes to PM variables by volatile
write-aside log entries made within the speculative scope of an
HTM transaction; and then streaming the transactional log record
into a non-volatile PM area outside the HTM transaction. ese log
streaming writes into PM bypass the VDB. A soware mechanism
may periodically check the remaining capacity of the PM log area
and initiate a log cleanup if needed; for such occasional cleanups,
new transactions are delayed, and, aer all open transactions have
closed, processor caches are ushed (with a closing sfence), the logs
are removed.
We refer to our implementation asWrAP, for Write-Aside Per-
sistence, and individual transactions as wraps. We rst describe the
timestamp mechanism, then the user soware library, and nally
describe the Persistent Memory Controller implementation details.
4.1 System Time Stamp
We use the recent Intel instruction RDTSCP, or Read Time Stamp
Counter and Processor ID, to obtain the timestamps in listing 4.
e RDTSCP instruction provides access to a global monotonically
increasing processor clock across processor sockets [24], while
serializing itself behind the instructions that precede it in program
order. To prevent the reordering of an XEnd before the RDTSCP
instruction, we save the resulting time stamp into a volatile memory
address. Since all stores preceding an XEnd become visible aer
XEnd, and the store of the persist timestamp is the last store before
XEnd, that store gets neither re-ordered before other stores nor
re-ordered aer the end of the HTM transaction. We note that
RDTSCP has also been used to order HTM transactions in novel
transaction proling [25] and memory version checkpointing [26]
tools.
4.2 Soware Library
For HTM we employ Intel’s implementation of Restricted Trans-
actional Memory or RTM, which includes the instructions XBegin
and XEnd. Aborting HTM transactions retry with exponential
back-o a few times, and then are performed under a soware lock.
Our HTMBegin routine checks the status of the soware lock both
before and aer an XBegin, to produce the correct indication of
conicts with the non-speculative paths; acquiring the soware
lock non-speculatively aer having backed o. HTMBegin and
HTMEnd library routines perform the acquire and release of the
soware lock for the fallback case within themselves. e remain-
ing soware library procedures are shown in Algorithm 1. Various
6
Algorithm 1: Concurrent WrAP User Soware Library
User Level Soware WrAP Library:
OpenWrapC ()
// ———— State: OPEN ————–
wrapId = threadId;
Notify Controller Open WrapwrapId ;
startTime = RDTSCP();
Log[wrapId].startTime = startTime;
Log[wrapId].writeSet = {};
sfence();
CLFLUSH Log[wrapId];
sfence();
// ———— State: COMPUTE ———
HTMBegin(); // XBegin
wrapStore (addrVar, Value)
Add {addrVar ,Value} to Log[wrapId].writeSet;
Normal store of Value to addrVar;
CloseWrapC (strictDurability)
Log[wrapId].persistTime = RDTSCP();
HTMEnd(); // XEnd
// ———— State: LOG ——————
CLFLUSH Log[wrapId].persistTime;
for num cachelines in Log[wrapId].writeSet
CLFLUSH cacheline;
if (strictDurability)
durabilityAddr = 0; // Reset Flag;
tAddr = durabilityAddr ;
else tAddr = 0;
sfence();
// ———— State: CLOSE —————
Notify Controller Closed (wrapId , tAddr );
// ———— State: COMMIT ————
if (strictDurability)
// Wait for durable notication from controller
Monitor durabilityAddr ;
events that arise in the course of a transaction are shown in Figure 2,
which depicts the HTM concurrency section with vertical lines and
the logging section with slanted lines.
Not shown in Figure 2, is a per-thread durability address location
that we call durabilityAddr in Algorithm 1. A soware thread may
use it to setup a Monitor-Mwait coordination to be signaled via
memory by the Persistent Memory Controller (as described shortly)
when a transacting thread wants to wait until all updates from
any non-conicting transactions that may have raced with it are
conrmed to be recoverable.
is provision allows for implementing the strict durability for
any transaction because the logs of all other transactions that could
possibly precede it in persistence order are in PM –which guaran-
tees the replayability of its log. By contrast, many other transactions
that only need the consistency guarantee (correct log ordering) may
continue without waiting (or defer waiting to a dierent point in
the course of a higher level multi-transaction operation). e num-
ber of active HTM transactions at any given time is bounded by the
number of CPUs, therefore, we use thread identiers as wrapIds. In
OpenWrapC we notify the Persistent Memory Controller that a
wrap has started. We then read the start time with RDTSCP and save
it and an empty write set into its log persistently. e transaction
is then started with the HTMBegin routine.
During the course of a transactional computation, the stores
are performed using the wrapStore function. e stores are just
the ordinary (speculatively performed) store instructions, but are
accompanied by (speculative) recording of the updates into the log
locations, each capturing the address, value pair for each update, to
be commied into PM later during the logging phase (aer XEnd).
In CloseWrapC we obtain and record the ending timestamp for
an HTM transaction into the persistTime variable in its log. Its
concurrency section is then terminated with the HTMEnd routine.
At this point, the cached write set for the log and ending persistent
timestamp are instantly visible in the cache. Next, we ush transac-
tional values and the persist timestamp to the log area followed by a
persistent memory fence. e transaction closure is then notied to
the Persistent Memory Controller with the wrapId, and along with
it, the durabilityAddr, if the thread has requested strict durability
(by passing a ag to CloseWrapC) – for which, we use the e-
cient Monitor-Mwait construct to receive memory based signaling
from the Persistent Memory Controller. If strict durability is not
requested, then CloseWrapC can return immediately and let the
thread proceed immediately with relaxed durability. In many cases
a thread performing a series of transactions may choose relaxed
durability over all but the last and then request strict durability
over the entire set by waiting for only the last one to be strictly
durable.
4.3 PM Controller Implementation
e Persistent Memory Controller provides for two needs: (1) hold-
ing back modied PM cachelines that fall into it at any time T
from the processor caches, from owing into PM until at least
a time when all successful transactions that were active at time
T are recoverable, and (2) tracking the ordering of dependencies
among transactions so that only those that need strict durability
guarantees need to be delayed pending the completion of the log
phases of those with which they overlap. It implements a VDB
(volatile data buer) as means for the transient storage for the rst
need, implements a durability wait queue (DWQ) for the second
need, and implements a dependency set (DS) tracking logic across
the open/close notications to control the VDB and the DWQ, as
described next.
Volatile Delay Buer (VDB): e VDB is comprised of a FIFO
queue and hash table that points to entries in the FIFO queue. Each
entry in the FIFO queue contains a tuple of PM address, data, and
dependency set. On a PM write, resulting from a cache eviction
or streaming store, to a memory address not in the log area or
pass-through area, the PM address and data are added to the FIFO
queue and tagged with a dependency set initialized to the COT.
Additionally, the PM address is inserted into the hash table with a
pointer to the FIFO queue entry. If the address already exists in the
hash table, then it is updated to point to the new queue entry. On
7
Algorithm 2: Hardware WrAP Implementation
Persistent Memory Controller :
Open Wrap Notification (wrapId)
AddwrapId to Current Open Transactions COT;
Memory Write (memoryAddr, data)
// Triggered from cache evict or stream store.
if ((memoryAddr not in Pass-rough Log Area) and
(Current Open Transactions COT != {}))
Add (memoryAddr , data, COT ) to VDB;
else
// Normal memory write
Memory[memoryAddr ] = data;
Memory Read (memoryAddr)
if (memoryAddr in Volatile Delay Buer)
return latest cacheline data from VDB;
return Memory[memoryAddr ];
Close WrAP Notification (wrapId, durabilityAddr)
RemovewrapId from Current Open Transactions COT;
if (durabilityAddr )
durabilityDS = COT;
// Add pair to Durability Wait eue DWQ;
Add (durabilityDS , durabilityAddr ) to DWQ;
RemovewrapId from Volatile Delay Buer elements
if earliest VDB elements have empty DS
// Write back entries to memory in FIFO order
Memory[memoryAddr ] = data;
RemovewrapId from Durability Wait eue elements
if earliest DWQ elements have empty DS
// Notify waiting thread of durability
Memory[durabilityAddr ] = 1;
a memory read, the hash table is rst consulted. If an entry is in
the hash table, then the pointer is to the latest memory value for
the address, and the data is retrieved from the queue. On a hash
table miss, PM is read and data is returned. As wraps close, the
dependency set in each entry in the queue is updated to remove
the dependency on the wrap.
Dependency sets become empty in FIFO order, and as they be-
come empty, we perform three actions. First, we write back the
data to the PM address. Next, we consult the hash table. If the hash
table entry points to the current FIFO queue entry, we remove the
entry in the hash table, since we know there are no later entries
for the same memory address in the queue. Finally, we remove the
entry from the FIFO queue.
On inserting an entry into the back of the queue, we can also
consult the head of the FIFO queue to check to see if the dependency
set is empty. If the head has an empty dependency set, we can
perform the same actions, allowing for O(1) VDB management.
Dependency Waiteue (DWQ): Strict durability is handled
by the Persistent Memory Controller using the Dependency Wait
eue or DWQ, which is used to track transactions waiting on
others to complete and notify the transaction that it is safe to
proceed. e DWQ is a FIFO queue similar to the VDB with entries
containing pairs of the dependency set and a durability address.
When a thread noties the Persistent Memory Controller that it
is closing a transaction (see steps below), it can request strict dura-
bility by passing a durability address . Dependencies on closing
wraps are also removed from the dependency set for each entry in
the DWQ. When the dependency set becomes empty, the controller
writes to the durability address and removes the entry from the
queue. reads waiting on a write to the address can then proceed.
Opening and Closing WrAPs: As outlined in Algorithm 2,
the controller supports two interfaces from soware, namely those
for Open Wrap and Close Wrap notications exercised from the
user library as shown in Algorithm 1. (Implementations of these
notication can vary: for example, one possible mechanism may
consist of soware writing to a designated set of control addresses
for these notications). It also implements hardware operations
against the VDB from the processor caches: Memory Write, for
handling modied cachelines evicted from the processor caches or
non-temporal stores from CPUs and Memory Read, for handling
reads from PM from the processor caches.
e Open Wrap notication simply adds the passed (wrapId)
to a bit vector of open transactions. We call this bit vector of
open transactions the Current Open Transactions COT. When the
controller receives a Memory Write (i.e., a processor cache eviction
or a non-temporal/streaming/uncached write) it checks the COT: if
the COT is empty, writes can ow into the PM. Writes that target
the log range in PM can also ow into PM irrespective of the COT.
For the non-log writes, if the COT is nonempty cache line is tagged
with the COT and placed into the VDB.
e Close Wrap controller notication receives the wrapId and
durability address, durabilityAddr . e controller removes the
wrapId from the Current Open Transactions COT bit mask. If the
transaction requires strict durability, we save the durabilityDS
and COT as a pair in the DWQ. e controller then removes the
wrapId from all entries in the VDB and DWQ. is is performed
by simply draining the bit on the dependency set bit mask for the
entire FIFO VDB. If the earliest entries in the queue result in an
empty dependency set, the cache line data is wrien back in FIFO
order. Similarly, the controller removes thewrapId from all entries
in the Durability Wait eue DWQ.
Soware Based Strict Durability Alternative: As an alter-
native for implementing strict durability in the controller, strict
durability may be implemented entirely in the soware library; we
modify the soware algorithm as follows. On a transaction start,
threads save the start time and an open ag in a dedicated cache
line for the thread. On transaction close, to ensure strict durability,
it saves its end time in the same cache line with the start time and
clears the open ag. It then waits until all prior open transactions
have closed. It scans the set of all thread cache lines and compares
any open transaction end times and start times to its end time. e
thread may only continue, with ensured durability, once all other
threads are either not in an open transaction or have a start or
persist time greater than its persist time.
8
(a) Example Transactions T1-T4
Time Only Start TimestampHas Been Persisted
Order of Persist
Timestamps
t1 T1
t2 T1, T2
t3 T1, T2, T3
t4 T1, T3 T2
t5 T1, T3, T4 T2
t6 T1, T4 T2, T3
t7 T1, T4 T2, T3
t8 T1, T4 T2, T3
t9 T4 T2 , T3, T1
t10 T2 , T3 , T4, T1
t11 T2 , T3 , T4, T1
t12 T2 , T3 , T4 , T1
(b) Transaction Recovery Order on Failure
Time t1 - t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
Event T1,T2,T3 Start Evict X T4 Starts Evict Y T3 Ends T2 Ends Evict Z Evict X T1 Ends T4 Ends
COT {1,1,1,0} {1,1,1,0} {1,1,1,1} {1,1,1,1} {1,1,0,1} {1,0,0,1} {1,0,0,1} {1,0,0,1} {0,0,0,1} {0,0,0,0}
↓
Volatile
Delay
Buer
↓
-
↓
X: {1,1,1,0} X: {1,1,1,0}
↓
Y: {1,1,1,1}
X: {1,1,1,0}
Y: {1,1,0,1}
X: (1,1,0,0}
Y: {1,0,0,1}
X: {1,0,0,0}
↓
Z: {1,0,0,1}
Y: {1,0,0,1}
X: {1,0,0,0}
↓
X: {1,0,0,1}
Z: {1,0,0,1}
Y: {1,0,0,1}
X: {1,0,0,0}
X: {0,0,0,1}
Z: {0,0,0,1}
Y: {0,0,0,1}
X: {0,0,0,0}
↓
X: {0,0,0,0}
Z: {0,0,0,0}
Y: {0,0,0,0}
↓
(c) Contents of the Persistent Memory Controller Current Open Transactions Dependency Set and FIFOeue
Figure 3: Example Transaction Sequence with Contents of Persistent Memory Controller and Recovery Algorithm
4.4 Example
Figure 3a shows an example set of four transactions, T1-T4, happen-
ing concurrently. In this example, we show T1-T4 split into states,
specically, concurrency (COMPUTE), shown in vertical lines, and
LOG, depicted with slanted lines. At certain time steps we show
the contents of the Persistent Memory Controller’s Volatile De-
lay Buer, or VDB which is a FIFO eue, and the Current Open
Transactions or COT in Figure 3c. e recovery algorithm is shown
in Figure 3b with the contents of the log. Either the start times-
tamp only or the persist timestamp order is shown; where bold
and underline indicate the log is wrien, and a circled transaction
indicates that it is recoverable.
First, at time t1, T1 opens, notifying the controller, and records
its start timestamp safely in its log. e controller adds T1 to the
bitmap COT of open transactions. At times t2 and t3, transactions T2
and T3 also open, notify the controller, and safely read and persist
their start timestamps. At this point in time the Persistent Memory
Controller has a COT of {1,1,1,0} and only start timestamps have
been recorded in the log. T2 then completes its concurrency section
and persists its persist timestamp at time t4 and begins writing its
log. In Figure 3c, we also show a random cache eviction of a cache
line X that is tagged with the COT of {1,1,1,0}.
Transaction T4 starts at time t5 persisting its start timestamp and
is added to the set of open transactions on the Persistent Memory
Controller, now {1,1,1,1}. At time t6 we illustrate several events. T3
completes its concurrency section and persists its persist timestamp.
At this time, as shown in 3b, T3 is now ordered aer T2 for recovery,
however neither have completed persisting their logs. Also, we
show a random cache eviction of cache line Y , and it is placed at
the back of the VDB on the Persistent Memory Controller as shown
in 3c.
At time t7, transaction T3 has completed writing its logs and is
marked completed and is removed from the dependency set in the
controller and cache line dependencies for X and Y . However, as
shown in 3b, T3 is not recoverable at this point since it is not rst
in the persist timestamp order. T3 is behind T2 and T3 also has a
smaller persist timestamp than the start timestamp of T1, which
hasn’t wrien its persist timestamp yet. When T2 completes writing
of logs at time t8, it is removed from the current and dependency
sets in the queue of the Persistent Memory Controller as shown
in 3c. Note that cache line X is still not able to be wrien back
to Persistent Memory as it is still tagged as being dependent on
T1, and Y on both T1 and T4. At this time, T2 and T3 are also not
recoverable as shown in 3b as T2 has a persist timestamp that is
greater than the start timestamp of T1, as it would be unknown by
a recovery process if T1 had simply had a delay in persisting its log
and T2 had transactional values dependent on T1.
We illustrate two events at time t9. T1 nally completes its
concurrency section and writes its persist timestamp at t9. Since the
persist timestamp of T1 is safely persisted and known at recovery
time, transaction T2 is now fully recoverable as shown circled in 3b.
However, at this time, T3 is not fully recoverable since it is waiting
9
on T4, which started before T3 completed its concurrency section
and T4 hasn’t yet wrien the persist timestamp. Also at t9, in 3c
we illustrate the eviction of cache line Z which is tagged with the
set of open transactions COT {1,0,0,1}.
At time t10, T4 writes its persist timestamp and its order is now
known to a recovery routine to be behind T3, which is now fully
recoverable as shown with the circle in 3b. Note that T4 has a
persist time before T1. In 3c, we also illustrate the eviction of cache
line X again into the VDB of the Persistent Memory Controller and
tagged with the set of the two open transactions, T1 and T4. Note
that there are two copies of cache line X in the controller. e one
at the head of the queue has fewer dependencies (only dependent
on T1) than the recent eviction. Any subsequent read for cache line
X returns the most recent copy, the last entry in the VDB. Note how
cache lines at the back of the queue have dependency set sizes that
are greater than or equal to entries earlier in the queue.
T1 completes log writing at t11, but is behind T4, which hasn’t
yet nished writing its logs, so neither are yet recoverable. e
PM controller also removes T1 from its dependency set and of those
in the VDB. e rst copy of X now has no dependencies in the
queue and is safely wrien back to PM as shown in 3c.
At time t12, T4 completes writing its logs and both T4 and T1
are recoverable. Also, T4 is removed from the dependency sets in
the controller, which allows for Y , Z , and X to ow to PM .
Strict Durability: Suppose a transaction requires strict dura-
bility during its Commit Stage, ensuring that once complete, the
transactional writes will be reected in PM if a failure were to occur.
If T4 requires strict durability, it is simply durable at the end as
there are no open transactions when it completes. However, T1,
T2, and T3, have other constraints. A transaction requiring strict
durability is only durable when it is fully recoverable. Table 3b illus-
trates transaction durability when it is circled. T1 must wait until
step t12 if it requires strict durability as it might have dependencies
on T4. T2 is fully durable at time t9 when T1, which started earlier,
writes its persist timestamp. At time t10, T3 is fully durable when
T4, which started before T3 completed its concurrency section and
could have introduced transactional dependencies, writes its persist
timestamp which indicates T4 started HTM section later.
5 EVALUATION
We evaluated our method using benchmarks directly running on
hardware and through simulation analysis (described in Section 5.4).
Our simulation evaluates the length of the FIFO buer and perfor-
mance against various Persistent Memory write times. In the direct
hardware evaluation described next, we employed Intel(R) Xeon(R)
E5-2650 v4 series processors, 12 cores per processor, running at
2.20 GHz, with Red Hat Enterprise Linux 7.2. HTM transactions
were implemented with Intel Transactional Synchronization Exten-
sions (TSX) [27] using a global fallback lock. We built our soware
using g++ 4.8.5. Each measurement reects an average over twenty
repeats with small variation among the repeats.
Using micro-benchmarks and SSCA2 [28] and Vacation, from the
STAMP [29] benchmark suite, we compared the following methods:
• HTM Only: Hardware Transactional Memory with Intel
TSX, without any logging or persistence. is method pro-
vides a baseline for transaction performance in cache mem-
ory without any persistence guarantees. If a power failure
occurs aer a transaction, writes to memory locations may
be le in the cache, or wrien back to memory in an out-
of-order subset of the transactional updates.
• WrAP: Our method. e volatile delay buer in the con-
troller is assumed to be able to keep up with back pres-
sure from the cache, as shown in Section 3. We therefore
perform all other aspects of the protocol such as logging,
reading timestamps, HTM and fall-back locking, etc.
• WrAP-Strict: Same as above, but we implement the so-
ware strict durability method as described in Section 4.
reads wait until all prior-open transactions have closed
before proceeding.
• PTL-Eager: (Persistent Transactional Locking). In this
method, we added persistence to Transactional Locking
(TL-Eager) [30–32] by persisting the undo log at the time
that a TL transaction performs its sequence of writes. e
undo-log entries are wrien with write-through stores and
SFENCEs, and once the transaction commits and the new
data values are ushed into PM, the undo-log entries are
removed.
5.1 Benchmarks
e Scalable Synthetic Compact Applications for benchmarking
High Productivity Computing Systems [28], SSCA2, is part of the
Stanford Transactional Applications for Multi-Processing [29], or
STAMP, benchmark suite. SSCA2 uses a large memory area and has
multiple kernels that construct a graph and perform operations on
the graph. We executed the SSCA2 benchmark with scale 20, which
generates a graph with over 45 million edges. We increased the
number of threads from 1 to 16 in powers of two and recorded the
execution time for the kernel for each method.
Figure 4: SSCA2 Benchmark Compute Graph Kernel Execu-
tion Time as a Function of the Number of Parallel Execution
reads
10
Figure 5: SSCA2BenchmarkComputeGraphKernel Speedup
as a Function of the Number of Parallel Executionreads
Figure 6: SSCA2 Benchmark Compute Graph Kernel HTM
Aborts as a Function of the Number of Parallel Execution
reads
Figure 4 shows the execution time for each method for the Com-
pute Kernel in the SSCA2 benchmark as a function of the number of
threads. Each method reduces the execution time with increasing
numbers of threads. Our WrAP approach has similar execution
time to HTM in the cache hierarchy with no persistence and is over
2.25 times faster than a persistent PTL-Eager method to PM.
Figure 5 shows the speedup for each method as a function of
the number of threads when compared to a single threaded undo
log for the persistence methods and speedup versus no persistence
for the in cache method HTM only. Even though the HTM (cache-
only) method does beer in absolute terms as we saw in Figure 4, it
proceeds from a higher baseline for single-threaded execution. PTL-
Eager yields a signicantly weaker scalability due to the inherent
costs of having to perform persistent ushes within its concurrent
region.
Figure 7: VacationBenchmarkExecutionTime as a Function
of the Number of Parallel Executionreads
Figure 8: Vacation Benchmark Execution Timewith Various
Persistent Memory Write Times for Four reads
Figure 6 shows the number of hardware aborts for both our
WrAP approach and cache-only HTM. Our approach introduces
extra writes to log the write-set, and, along with reading the system
time stamp, extends the transaction time. However, as shown in the
Figure, this only slightly increases the number of hardware aborts.
We also evaluated the Vacation benchmark which is part of the
STAMP benchmark suite. e Vacation benchmark emulates data-
base transactions for a travel reservation system. We executed the
benchmark with the low option for lower contention emulation.
Figure 7 shows the execution time for each method for the Va-
cation benchmark as a function of the number of threads. Each
method reduces the execution time with increasing numbers of
threads. e WrAP approach follows the trends similar to HTM
in the cache hierarchy with no persistence, with both approaches
aening execution time aer 4 threads. We examine the eect
of strict durability, WrAP-Strict in the gure, and show that strict
11
durability only introduces a small amount of overhead. For just
a single thread, it has the same performance as WrAP relaxed as
a thread doesn’t need to wait on other threads, as it is durable as
soon as the transaction completes.
Additionally, we examined the eect of increased Persistent
Memorywrite times on the benchmark. Byte-addressable Persistent
Memory can have longer write times. To emulate the longer write
times for PM, we insert a delay aer non-temporal stores when
writing to new cache lines and a delay aer cache line ushes. e
write delay can be tuned to emulate the eect of longer write times
typical of PM. Figure 8 shows the Vacation benchmark execution
time for various PM write times.
eWrAP approach is less aected by increasing PM write times
than the PTL-Eager approach due to several factors. WrAP performs
write-combining for log entries on the foreground path for each
thread, so writes to several transaction variables may be combined
into a fewer writes. Also, PTL-Eager transactionally persists an undo
log on writes causing a foreground delay.
5.2 Hash Table
Our next series of experiments show transaction sizes and high
memory trac aect overall performance. We create a 64 MB
Hash Table Array of elements in main memory and transactionally
perform a number of element updates. For each transaction, we
generate a set of random numbers of a congurable size, compute
their hash, and write the value into the Hash Table Array.
Figure 9: Millions of Transactions per Second for Hash Ta-
ble Updates of 10 Elements versus Concurrent Number of
reads
First, we create transactions consisting of 10 atomic updates and
vary the number of concurrent threads and measure the maximum
throughput. We perform 1 million updates and record the average
throughput and plot the results in Figure 9. Our approach achieves
roughly 3x throughput over PTL-Eager. Figure 10 shows increasing
the write set to 20 atomic updates has similar performance. In both
gures, adding strict durability only slightly decreases the overall
performance; threads wait additional time for the dependency on
other transactions to clear before continuing to another transaction.
Figure 10: Millions of Transactions per Second for Hash Ta-
ble Updates of 20 Elements versus Concurrent Number of
reads
Figure 11: Average Txps for Increasing Transaction Sizes of
Atomic Hash Table Updates with 6 Concurrent reads
e transaction write set was then varied from 2 to 30 elements
with 6 concurrent threads. e average throughput was recorded
and is shown in Figure 11. Even with adding strict durability, WrAP
performs roughly three times faster than PTL-Eager.
A transaction size of ten elements was then varied with a write to
read ratio with 6 concurrent threads. e average throughput was
recorded and is shown in Figure 12. Unlike transactional memory
approaches, our approach does not require instrumenting read
accesses and can therefore execute reads at cache speeds.
5.3 Red-Black Tree
We use the transactional Red-Black tree from STAMP [29] initial-
ized with 1 million elements. We then perform insert operations
on the Red-Black tree and record average transaction times and
throughput over 200k additional inserts. Each transaction inserts
12
Figure 12: Average Txps for Write / Read Percentage of
Atomic Hash Table Updates with 6 Concurrent reads
Figure 13: Millions of Transactions per Second for Atomic
Red-Black Tree Element Inserts versus Number of Concur-
rent reads
an additional element into the Red-Black tree. Inserting an ele-
ment into a Red-Black tree rst requires nding the insertion point
which can take many read operations and can trigger many writes
through a rebalance. In our experiments, we averaged 63 reads and
11 writes per transactional insert of one element into the Red-Black
tree.
We record themaximum throughput of inserts into the Red-Black
tree per second for a varying number of threads in Figure 13. As
can be seen in the Figure, WrAP has almost 9x higher throughput
over PTL-Eager, and with strict durability almost 6x faster. Our
method can perform reads at the speed of the hardware, while
PTL-Eager requires instrumenting reads through soware to track
dependencies on other concurrent transactions.
Figure 14: Average Atomic Hash Table 10 Element Update
with Various Persistent Memory Write Times with 8 Con-
current reads
Figure 15: Maximum FIFO eue Length for Atomic Hash
Table 10 Element Update with 4 Concurrent reads
5.4 Persistent Memory Controller Analysis
We investigated the required length of our FIFO in the Volatile Delay
Buer and performance with respect to Persistent Memory write
times using an approach similar to [14]. In the absence of readily
available memory controllers, we modied the McSimA+ simula-
tor [33]. McSimA+ is a PIN [34] based simulator that decouples
execution from simulation and tightly models out-of-order proces-
sor micro-architecture at the cycle level. We extended the simulator
to support the notications for opening and closing WrAPs along
with extended support for memory reads and writes. We added
support for DRAMSim2 [35], a cycle-accurate memory system and
DRAM memory controller model library. Write-combining and
store buers were then added with multiple conguration options
to allow ne tuning to match the system to be modeled.
13
Figure 16: Average B-Tree Atomic Element Insert with Var-
ious Persistent Memory Write Times with 8 Concurrent
reads
Figure 17: Maximum FIFO eue Length for B-Tree Atomic
Element Insert with 8 Concurrent reads
To stress the Persistent Memory Controller, we executed an
atomic hash table update without any thread contention by having
each thread update elements on a separate portion of the table. In
the simulation, we ll the cache with dirty cache lines so that each
write by a thread in a transaction generates write-backs to main
Persistent Memory. For 8 threads, we recorded the average atomic
hash table update time for 10 elements in each transaction. We then
vary the Persistent Memory write time as a multiple of DRAM write
time. As shown in Figure 14, WrAP is less aected by increasing
write times when compared to PTL-Eager.
Additionally, we record the maximum FIFO buer size for various
Persistent Memory write times and 4 concurrent threads, shown
in Figure 15. Initially, the buer size decreases for an increasing
PM write time, due to slower transaction throughput and less cache
evictions into the buer. As the write time increases, the buer
length increases, but is still less than 1k cache lines or 64KB.
We performed a similar analysis using a B-Tree, where each
thread atomically inserts elements on its own copy of a B-Tree.
Each insert into the tree required, on average, over 5 times as many
reads as writes. As shown in Figure 16, our method is less aected
by increasing PM write times, due to PTL-Eager instrumenting the
large portion of the read operations. In this experiment, we use
eight concurrent threads each atomically inserting elements into
an initialized B-Tree of 128 elements.
As more reads than writes are generated for each atomic insert
transaction, the FIFO buer length remains small. We also exam-
ined the FIFO buer length in the VDB with 8 concurrent threads.
Figure 17 shows the length was less than about 100 elements for
each write speed due to the large proportion of reads.
6 RELATEDWORK
Related Persistence Work: Analysis of consistency models for
persistent memory was considered in [36]. Changes to the front-
end cache for ordering cache evictions were proposed in [14, 15,
37, 38]. BPFS [37] proposed epoch barriers to control eviction order,
while [38] proposed a ush soware primitive to control of update
order. Snapshoing the entire micro architectural state at the point
of a failure is proposed in [39]. A non-volatile victim cache to pro-
vide transactional buering was proposed in [14], with the added
property of not requiring logging, but requires changes to the front-
end cache controller to track pre- and post- transactional states
for cache lines in both volatile and persistent caches, atomically
moving them to durable state on transaction commits.
Memory controller support for transaction atomicity in Persis-
tent Memory have been proposed in [17–19, 23, 40–42]. Adding
a small DRAM buer in front of Persistent Memory to improve
latency and to coalesce writes was proposed in [40]. e use of
a volatile victim cache to prevent uncontrolled cache evictions
from reaching PM was described in [17–19], but requires soware
locking for concurrency control. FIRM [42] describes techniques
to dierentiate persistent and non-persistent memory trac, and
presents scheduling algorithms to maximize system throughput
and fairness. Low-level memory scheduling to improve eciency
of persistent memory access was studied in [41]. Except for [17–19],
none of these works deal with the issues of atomicity or durability
of write sequences. Our approach eectively uses HTM for concur-
rency control and does not require changes to the font-end cache
controller or use logs for replaying transactions to PM.
Related Concurrency Work: Existing non-HTM solutions [9,
11, 12] tightly couple concurrency control with durable writes of
either write-ahead logs or data updates into Persistent Memory to
maintain persistence consistency. Soware that employs these ap-
proaches generally means they must extend the duration for which
they remain in critical sections, leading to longer times to hold locks,
which reduces concurrency and expands transactional duration.
Other work [10, 13] decouples concurrency control so that post
transactional values may ow through cache hierarchy and reach
PM asynchronously; however, the write ahead log for an updating
transaction has to get commied into PM synchronously before
the transaction can close so that the integrity of the foreground
14
value ow is preserved across machine restarts. Another hardware-
assisted mechanism proposes hardware changes to allow a dual-
scheme checkpointing that writes previous check-pointed values
in the background while collecting current transaction writes [43].
Recent work [13, 21, 22] aims to exploit processor-supported
HTM mechanisms for concurrency control instead of traditional
locking or STM-based approaches. However, all of these solutions
require making signicant changes to the existing HTM semantics
and implementations. For instance, PHTM [21] and PHyTM [22],
propose a new instruction called TransparentFlush which can be
used to ush a cache line from within a transaction to persistent
memory without causing any transaction to abort. ey also pro-
pose a change to the xend instruction that ends an atomic HTM
region, so that it atomically updates a bit in persistent memory as
part of its execution. Similarly, for DUDETM [13] to use HTM, it
requires that designated memory variables within a transaction be
allowed to be updated globally and concurrently without causing
an abort. Other work [26] utilizes HTM for concurrency control,
but requires aliasing all read and write accesses while concurrently
maintaining log ordering and and replaying logs for retirement.
7 SUMMARY
In this paper we presented an approach that unies HTM and PM to
create durable, concurrent transactions. Our approach works with
existing HTM and cache coherency mechanisms, and does not re-
quire changes to existing processor caches or store instructions,
avoids synchronous cache-line write-backs on completions, and
only utilizes logs for recovery. e solution correctly orders HTM
transactions and atomically commits them to Persistent Memory by
the use of a novel soware protocol combined with a back-end Per-
sistent Memory Controller.
Our approach, evaluated using both micro-benchmarks and the
STAMP suite compares well with standard (volatile) HTM transac-
tions. In comparison with persistent transactional locking, our
approach performs 3x faster on standard benchmarks and almost
9x faster on a Red-Black Tree data structure.
REFERENCES
[1] F. Fa¨rber, S. K. Cha, J. Primsch, C. Bornho¨vd, S. Sigg, and W. Lehner, “SAP HANA
Database: Data management for modern business applications,” SIGMOD Rec.,
vol. 40, no. 4, pp. 45–51, Jan. 2012.
[2] V. Raman, G. Aaluri, R. Barber, N. Chainani, D. Kalmuk, V. KulandaiSamy,
J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman et al., “Db2 with blu acceleration:
So much more than just a column store,” Proceedings of the VLDB Endowment,
vol. 6, no. 11, pp. 1080–1091, 2013.
[3] R. Palamuam, R. M. Mogrovejo, C. Mamann, B.Wilson, K.Whitehall, R. Verma,
L. McGibbney, and P. Ramirez, “Scispark: Applying in-memory distributed
computing to weather event detection and tracking,” in Big Data (Big Data),
2015 IEEE International Conference on. IEEE, 2015, pp. 2020–2026.
[4] G. Team, “Gridgain: In-memory computing platform,” 2007.
[5] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman,
D. Tsai, M. Amde, S. Owen et al., “MLlib: Machine learning in apache spark,”
Journal of Machine Learning Research, vol. 17, no. 34, pp. 1–7, 2016.
[6] I. Corporation. (2015, July) Intel and Micron Produce Breakthrough Memory
Technology. [Online]. Available: hps://newsroom.intel.com/news-releases/
intel-and-micron-produce-breakthrough-memory-technology/
[7] M. Herlihy and J. E. B. Moss, Transactional Memory: Architectural support for
lock-free data structures. ACM, 1993, vol. 21,2.
[8] R. Rajwar and J. R. Goodman, “Speculative lock elision: Enabling highly con-
current multithreaded execution,” in Proceedings of the 34th annual ACM/IEEE
international symposium on Microarchitecture. IEEE Computer Society, 2001,
pp. 294–305.
[9] H. Volos, A. J. Tack, and M. Swi, “Mnemosyne: Lightweight persistent mem-
ory,” in Proceedings of 16th International Conference on Architectural Support for
Programming Languages and Operating Systems. ACM Press, 2011, pp. 91–104.
[10] E. Giles, K. Doshi, and P. Varman, “SoWrAP: A lightweight framework for
transactional support of storage class memory,” in Mass Storage Systems and
Technologies (MSST), 2015 31st Symposium on, May 2015, pp. 1–14.
[11] D. R. Chakrabarti, H.-J. Boehm, and K. Bhandari, “Atlas: Leveraging locks for
non-volatile memory consistency,” in Proceedings of the 2014 ACM International
Conference on Object Oriented Programming Systems Languages & Applications,
ser. OOPSLA ’14. New York, NY, USA: ACM, 2014, pp. 433–452. [Online].
Available: hp://doi.acm.org/10.1145/2660193.2660224
[12] A. Chatzistergiou, M. Cintra, and S. D. Viglas, “Rewind: Recovery write-ahead
system for in-memory non-volatile data-structures,” Proceedings of the VLDB
Endowment, vol. 8, no. 5, pp. 497–508, 2015.
[13] M. Liu, M. Zhang, K. Chen, X. Qian, Y. Wu, W. Zheng, and J. Ren, “DudeTM:
Building Durable Transactions with Decoupling for Persistent Memory,” in
Proceedings of the Twenty-Second International Conference on Architectural
Support for Programming Languages and Operating Systems, ser. ASPLOS
’17. New York, NY, USA: ACM, 2017, pp. 329–343. [Online]. Available:
hp://doi.acm.org/10.1145/3037697.3037714
[14] J. Zhao, S. Li, D. H. Yoon, Y. Xie, and N. P. Jouppi, “Kiln: Closing the performance
gap between systems with and without persistence support,” in Proceedings of
the 46th Annual IEEE/ACM International Symposium on Microarchitecture, ser.
MICRO-46. New York, NY, USA: ACM, 2013, pp. 421–432. [Online]. Available:
hp://doi.acm.org/10.1145/2540708.2540744
[15] A. Joshi, V. Nagarajan, S. Viglas, and M. Cintra, “ATOM: Atomic durability in
non-volatile memory through hardware logging,” in 2017 IEEE International
Symposium on High Performance Computer Architecture (HPCA), 2017.
[16] S. Li, P. Wang, N. Xiao, G. Sun, and F. Liu, “SPMS: Strand based persistent
memory system,” in Proceedings of the Conference on Design, Automation &
Test in Europe, ser. DATE ’17. 3001 Leuven, Belgium, Belgium: European
Design and Automation Association, 2017, pp. 622–625. [Online]. Available:
hp://dl.acm.org/citation.cfm?id=3130379.3130529
[17] E. Giles, K. Doshi, and P. Varman, “Bridging the programming gap between per-
sistent and volatile memory usingWrAP,” in Proceedings of the ACM International
Conference on Computing Frontiers. ACM, 2013, p. 30.
[18] L. Pu, K. Doshi, E. Giles, and P. Varman, “Non-Intrusive Persistence with a
Backend NVM Controller,” IEEE Computer Architecture Leers, vol. 15, no. 1, pp.
29–32, Jan 2016.
[19] K. Doshi, E. Giles, and P. Varman, “Atomic Persistence for SCM with a Non-
intrusive Backend Controller,” in e 22nd International Symposium on High-
Performance Computer Architecture (HPCA). IEEE, March 2016.
[20] Z. Wang, H. Yi, R. Liu, M. Dong, and H. Chen, “Persistent transactional memory,”
IEEE Computer Architecture Leers, vol. 14, no. 1, pp. 58–61, Jan 2015.
[21] H. Avni, E. Levy, and A. Mendelson, “Hardware transactions in nonvolatile
memory,” in Proceedings of the 29th International Symposium on Distributed
Computing - Volume 9363, ser. DISC 2015. New York, NY, USA: Springer-Verlag
New York, Inc., 2015, pp. 617–630.
[22] H. Avni and T. Brown, “PHyTM: Persistent hybrid transactional memory,” Pro-
ceedings of the VLDB Endowment, vol. 10, no. 4, pp. 409–420, 2016.
[23] E. Giles, K. Doshi, and P. Varman, “Brief announcement: Hardware transactional
storage class memory,” in Proceedings of the 29th ACM Symposium on Parallelism
in Algorithms and Architectures, ser. SPAA ’17. New York, NY, USA: ACM, 2017,
pp. 375–378. [Online]. Available: hp://doi.acm.org/10.1145/3087556.3087589
[24] W. Ruan, Y. Liu, and M. Spear, “Boosting timestamp-based transactional memory
by exploiting hardware cycle counters,” ACM Transactions on Architecture and
Code Optimization (TACO), vol. 10, no. 4, p. 40, 2013.
[25] Y. Liu, J. Goschlich, G. Pokam, and M. Spear, “Tsxprof: Proling hardware
transactions,” in Parallel Architecture and Compilation (PACT ), 2015 International
Conference on. IEEE, 2015, pp. 75–86.
[26] E. Giles, K. Doshi, and P. Varman, “Continuous Checkpointing of HTM
Transactions in NVM,” in Proceedings of the 2017 ACM SIGPLAN International
Symposium onMemory Management, ser. ISMM 2017. New York, NY, USA: ACM,
2017, pp. 70–81. [Online]. Available: hp://doi.acm.org/10.1145/3092255.3092270
[27] Intel Corporation, “Intel Transactional Synchronization Extensions,” in Intel
Architecture Instruction Set Extensions Programming Reference, February 2012,
ch. 8, hp://soware.intel.com/.
[28] D. A. Bader and K. Madduri, “Design and implementation of the HPCS graph
analysis benchmark on symmetric multiprocessors,” in International Conference
on High-Performance Computing. Springer, 2005, pp. 465–476.
[29] C. C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun, “STAMP: Stanford trans-
actional applications for multi-processing,” inWorkload Characterization, 2008.
IISWC 2008. IEEE International Symposium on. IEEE, 2008, pp. 35–46.
[30] D. Dice and N. Shavit, “Understanding tradeos in soware transactional mem-
ory,” in Code Generation and Optimization, 2007. CGO’07. International Symposium
on. IEEE, 2007, pp. 21–33.
[31] D. Dice, O. Shalev, and N. Shavit, “Transactional Locking II,” in Distributed
Computing. Springer, 2006, pp. 194–208.
15
[32] C. C. Minh, “TL2-x86, a port of tl2 to x86 architecture,” in On GitHub, ccaominh,
tl2-x86. Stanford, 2015. [Online]. Available: hps://github.com/ccaominh/
tl2-x86
[33] J. H. Ahn, S. Li, O. Seongil, and N. P. Jouppi, “Mcsima+: A manycore simulator
with application-level+ simulation and detailed microarchitecture modeling,” in
Performance Analysis of Systems and Soware (ISPASS), 2013 IEEE International
Symposium on. IEEE, 2013, pp. 74–85.
[34] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J.
Reddi, and K. Hazelwood, “Pin: building customized program analysis tools with
dynamic instrumentation,” in ACM Sigplan Notices, vol. 40, no. 6. ACM, 2005,
pp. 190–200.
[35] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “Dramsim2: A cycle accurate memory
system simulator,” Computer Architecture Leers, vol. 10, no. 1, pp. 16–19, 2011.
[36] S. Pelley, P. M. Chen, and T. F. Wenisch, “Memory persistency,” in Computer
Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on. IEEE,
2014, pp. 265–276.
[37] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, D. Burger, and D. Coetzee,
“Beer I/O through byte-addressable, persistent memory,” in Proceedings of
the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, ser. SOSP
’09. New York, NY, USA: ACM, 2009, pp. 133–146. [Online]. Available:
hp://doi.acm.org/10.1145/1629575.1629589
[38] S. Venkatraman, N. Tolia, P. Ranganathan, and R. H. Campbell, “Consistent and
durable data structures for non-volatile byte addressable memory,” in Proceedings
of 9th Usenix Conference on File and Storage Technologies. ACM Press, 2011, pp.
61–76.
[39] D. Narayanan and O. Hodson, “Whole-system persistence,” in Proceedings of 17th
International Conference on Architectural Support for Programming Languages
and Operating Systems. ACM Press, 2012, pp. 401–410.
[40] M. K. reshi, V. Srinivasan, and J. A. Rivers, “Scalable high performance
main memory system using phase-change memory technology,” in Proceedings
of the 36th Annual International Symposium on Computer Architecture, ser.
ISCA ’09. New York, NY, USA: ACM, 2009, pp. 24–33. [Online]. Available:
hp://doi.acm.org/10.1145/1555754.1555760
[41] P. Zhou, Y. Du, Y. Zhang, and J. Yang, “Fine-grained QoS scheduling for PCM-
based main memory systems,” in Parallel & Distributed Processing (IPDPS), 2010
IEEE International Symposium on. IEEE, 2010, pp. 1–12.
[42] J. Zhao, O. Mutlu, and Y. Xie, “Firm: Fair and high-performance memory control
for persistent memory systems,” in Microarchitecture (MICRO), 2014 47th Annual
IEEE/ACM International Symposium on. IEEE, 2014, pp. 153–165.
[43] J. Ren, J. Zhao, S. Khan, J. Choi, Y. Wu, and O. Mutlu, “yNVM: Enabling
soware-transparent crash consistency in persistent memory systems,” in Pro-
ceedings of the 48th International Symposium on Microarchitecture. ACM, 2015,
pp. 672–685.
16
