Emerging Persistent Memory technologies (also PM, Non-Volatile DIMMs, Storage Class Memory or SCM) hold tremendous promise for accelerating popular data-management applications like inmemory databases. However, programmers now need to deal with ensuring the atomicity of transactions on Persistent Memory resident data and maintaining consistency between the order in which processors perform stores and that in which the updated values become durable.
INTRODUCTION
This paper provides a solution to the problems of adding durability to concurrent in-memory transactions that use Hardware Transactional Memory (HTM) for concurrency control while operating on data in byte-addressable Non-Volatile Memory or Persistent Memory (PM).
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. Recent years have witnessed a sharp shift towards real time data-driven and high-throughput applications. This has spurred a broad adoption of in-memory and massively parallelized data processing [1] [2] [3] across business, scientific, and industrial application domains [4, 5] . Two emerging hardware developments provide further impetus to this approach and raise the potential for transformative gains in speed and scalability: (a) the arrival of inexpensive, byte-addressable, and large capacity Persistent Memory devices [6] eliminates the I/O operations and bottlenecks common to many data management applications, and, (b) the availability of CPU-based transaction support (with HTM) [7, 8] ) makes it straight-forward for threads to work spontaneously in shared memory spaces without having to synchronize explicitly.
To prevent corruption of state upon an untimely machine or software failure, a sequence of store operations to PM in a transactional section of code cannot be partially reflected into PM; nor can it be transmitted piecemeal from processor caches into PM without similarly risking significant loss of data. Operating on Persistent Memory based data thus produces new atomicity and consistency requirements. Software approaches for ensuring atomic durable updates share some characteristics with HTM techniques in commercial processors -both checkpoint state at some level of granularity and guard against communication of partial updates. However, different mechanisms are at play: while stable stores to persistent media are usually obtained by covering updates with logging or versioning, partial updates are prevented from being propagated between threads in HTM transactions by CPUs sheltering them until the transaction closes. Once an HTM transaction closes, its updates become visible en masse through the cache hierarchy and can travel in any order to memory DIMMs. However stable storage of the updates into Persistent Memory by the transaction additionally requires an ability to reliably delineate its updates from those by other overlapping transactions, and to use that delineation to recover from an unanticipated machine restart.
Existing PM programming frameworks separate into categories based on the degree of change they require in the processor architecture for such problems. Several works [9] [10] [11] [12] [13] operate with existing processor capabilities and write a log (either a write-ahead log or an undo log) durably to cover data changes arriving via the volatile cache hierarchy; some [14] [15] [16] require significant changes to existing cache hardware and protocols in the processor microarchitecture, while others [17] [18] [19] only require external controllers not affecting the processor core.
Isolation among concurrent transactions in the above works is achieved either by the use of two-phase locking protocols or provided within the rubric of an STM [9] . Logging based software approaches are problematic for HTM transactions (e.g., Intel TSX) which cannot bypass the caches in order to flush the log records synchronously into Persistent Memory ahead of transaction closings.
For these, PTM [20] proposes changes to processor caches while adding an on-chip scoreboard and global transaction id register to couple HTM with PM. Recent work [13, 21, 22] has attempted to provide inter-transactional isolation by employing processor based (HTM) [7, 8] mechanisms. However, these solutions all require changes to the existing HTM semantics and implementations- [21] and [22] propose a new instruction to perform non-aborting cacheline flush from within an HTM, while [13] proposes allowing non-aborting concurrent access to designated memory variables within an HTM. Selective and incremental changes to the clean isolation semantics of HTM are not to be undertaken lightly; understanding their impact on global system correctness and performance typically requires long gestation periods before processor manufacturers will embrace them. In this paper we provide a new solution to obtain persistence consistency in Persistent Memory while using HTM for concurrency control. The solution does not alter the processor microarchitecture, but leverages a very simple, external Persistent Memory controller along with a persistence protocol, to supplement the existing HTM semantics and allowing HTM transactions to operate at the speed of in-memory volatile transactions. The solution, while it achieves the concurrency benefits of HTM for PM-based data also applies to non-HTM transactions in a straightforward way.
OVERVIEW 2.1 HTM+PM Basics
Hardware Transactional Memory, or HTM, was introduced in [7] as a new, easy-to-use method for lock-free synchronization supported by hardware. The initial instructions for HTM included load and store transactional instructions in addition to transactional management instructions. Most HTM implementations extend an underlying cache-coherency protocol to handle detection of transactional conflicts during program execution. The hardware system performs a speculative execution on a demarcated region of code similar to an atomic section. Independent transactions (those that do not write a shared variable) proceed unrestricted through their HTM sections. Transactions which access common variables concurrently in their HTM sections, with at least one transaction performing a write, are serialized by the HTM. That is, all but one of the transactions is aborted; an aborted transaction will restart its HTM code at the beginning. Updates made within the HTM section are hidden from other transactions and are prevented from writing to memory until the transaction successfully completes the HTM section. This mechanism provides atomicity (all-or-nothing) semantics for individual transactions with respect to visibility by other threads, and serialization of conflicting, dependent transactions. However HTM was originally designed for volatile memory systems (rather than supporting database style ACID transactions) and therefore any power failure leaves main memory in an unpredictable state relative to the actual values of the transaction variables.
Persistent Memory, or PM, introduces a new method of persistence to the processor. PM, in the form of persistent DIMM, resides on the main-memory bus alongside DRAM. Software can access persistent memory using the usual LOAD and STORE instructions used for DRAM. Like other memory variables, PM variables are subject to forced and involuntary cache-evictions and encounter other deferred memory operations done by the processor.
For Intel CPUs, CLWB and CLFLUSHOPT instructions provide the ability to flush modified data (at cacheline granularity) to be evicted from the processor cache hierarchy. These instructions, however, are weakly ordered with respect to other store operations in the instruction stream. Intel has extended the semantic for SFENCE to cover such flushed store operations so that software can issue SFENCE to prevent new stores from executing until previously flushed data has entered a power-safe domain; i.e., the data is guaranteed by hardware to reach its locations in the PM media. This guarantee also applies to data that is written to PM with instructions that bypass the processor caches. However, when executing within an HTM transaction, a CPU cannot exercise CLWB, CLFLUSHOPT, non-cacheable stores, and SFENCE instructions since the stores by the CPU are considered speculative until the HTM transaction completes successfully.
Even though HTM guarantees that transactional values are only visible on transaction completion, hardware manufacturers cannot simply utilize a non-volatile processor cache hierarchy or battery backed flushing of the cache on failures to provide transactional atomicity. Transactions that do not complete before a software or hardware restart produce partial and therefore inconsistent updates in non-volatile memory, as there is no guarantee when a machine halt will occur. The halt may happen during XEND execution leaving only partial updates in cache or write buffers which can corrupt in-memory data structures.
Challenges of persistent HTM transactions
Consider transactions A, B and C shown in Listings 1, 2 and 3. Assume that w, x, y, z are persistent variables initialized to zero in their home locations in Persistent Memory (PM). The code section demarcated between the instructions XBegin and XEnd will be referred to as an HTM transaction or simply a transaction. The HTM mechanism ensures the atomicity of transaction execution. Within an HTM transaction, all updates are made to private locations in the cache, and the hardware guarantees that the updates are not allowed to propagate to their home locations in PM. After the XEnd instruction completes, all of the cache lines updated in the transaction become instantaneously visible in the cache hierarchy.
Atomic Persistence: The first challenge is to ensure that the transaction's updates that were made atomically in the cache are also persisted atomically in PM. Following XEnd, the transaction variables are once again subject to the normal cache operations like evictions and the use of cache write-back instructions. There are no guarantees regarding whether or when the transaction's updates actually get written to PM from the cache. This can create a problem if the machine crashes before all these updates are written back to PM. On a reboot, the values of these variables in PM will be inconsistent with the pre-crash transaction values. This leads to the first requirement:
• Following crash recovery, ensure that all or none of the updates of an HTM transaction are stored in their PM home locations.
A common solution is to log the transaction updates in a separate persistent storage area before allowing them to update their PM home locations. Should a crash interrupt the updating of the XBegin ; w = w+ 1 ; z = w ; XEnd ; } home locations, the saved log can be replayed. When transactions execute within an HTM there is a problem with this solution since the log cannot be written to PM within the transaction and can be done only after the XEnd. At that time the transaction updates are also made visible in the cache hierarchy and are susceptible to uncontrolled cache evictions into PM. Hence there is no guarantee that the log has been persisted before transaction updates have percolated into PM. We describe our solution in Section 3.
Persistence Ordering: The second problem deals with ensuring that the execution order of dependent HTM transactions is correctly reflected in PM following crash recovery. As an example, consider the dependent transactions A, B, C in • Following crash recovery, ensure that the persistent state of any sequence of dependent transactions is consistent with their execution order.
If individual transactions satisfy atomic persistence, then it is sufficient to ensure that PM is updated in transaction execution order. With software concurrency control (using an STM or twophase transaction locking), it is straightforward to correctly order the updates simply by saving a transaction's log before it commits and releases its locks. In case of a crash, the saved logs are simply replayed in the order they were saved, thereby reconstructing persistent state to a correctly-ordered prefix of the executed transactions.
When HTM is used for concurrency control the logs can only be written to PM after the transaction XEnd. At that time other dependent transactions can execute and race with the completed transaction, perhaps writing out their logs before the first. Solutions like using an atomic counter within transactions to order them correctly are not practical since the shared counter will result in increased HTM-induced aborts and serialization of transactions. Some papers have advocated that processor manufacturers alter HTM semantics and implementation to allow selective writes to PM from within an HTM [13, 21, 22] . We describe our solution without the need for such intrusive processor changes in Section 3.
Strict and Relaxed Durability: In traditional ACID databases, a committed transaction is guaranteed to be durable since its log is made persistent before it commits. We refer to this property as strict durability. In HTM transactions the log is written to PM after the XEnd instruction some time before the transaction commits. A natural question is to characterize the time it is safe for a transaction requiring strict durability to commit.
It is generally not safe to commit a transaction Y at the time it completes persisting its log for the same reason that it is difficult to ensure persistence ordering. Due to races in the code outside the HTM it is possible that an earlier transaction X (on which Y depends) to have completed but not yet persisted its log. When recovering from a crash that occurs at this time, the log of Y should not be replayed since earlier transaction X cannot be replayed. This leads to the third requirement:
• Following crash recovery, strict durability requires that every committed transaction is persistent in PM.
We define a new property known as relaxed durability that allows an individual transaction to opt for an early commit immediately after it persists its log. Requiring relaxed or strict durability is a local choice made individually by a transaction based on the application requirements. Transactions choosing relaxed durability face a window of vulnerability after they commit, during which a crash may result in their transaction updates not being reflected in PM after recovery. The gain is potentially reduced transaction latency. However, irrespective of the choice of the durability model by individual transactions, the underlying persistent memory will always recover to a consistent state, reflecting an ordered atomic prefix of every sequence of dependent transactions.
OUR APPROACH
Our approach achieves durability of HTM transactions to Persistent Memory by a cooperative protocol involving three components: a back-end Persistent Memory Controller, transaction execution and logging, and a failure recovery procedure. The Persistent Memory Controller intercepts dirty cache lines evicted from the last-level processor cache (LLC) on their way to persistent memory. An intercepted cache line is held in a FIFO queue within the controller until it is safe to write it out to PM. All memory variables are subject to the normal cache operations and are fetched and evicted according to normal cache protocols. The only change we introduce is interposing the external controller between the LLC and memory. Note that the controller does not require changing any of the internal processor behavior. The controller simply delays the evicted cache lines on their way to memory till it can guarantee safety. It is pre-programmed with the address range of a region of persistent memory that is reserved for holding transaction logs. Addresses in the log region pass through the controller without intervention.
HTM+PM transactions execute independently. Within an outerenvelope that achieves consistency of updates between the volatile cache hierarchy and the durable state in PM, these transactions use an unmodified HTM to serialize the computational portions of conflicting transactions. A transaction (1) notifies the controller when it opens and closes, (2) saves start and end timestamps in PM to enable consistent recovery after a failure, (3) performs its HTM operation, and (4) persists a log of its updates in PM before closing. If a transaction requires strict durability it informs the controller during its closing step, and then waits for the go-ahead from the controller before committing. If it only needs relaxed durability it can commit immediately after its close. The recovery routine is invoked after a system crash to restore the PM variables to a valid state i.e. a state that is consistent with the actual execution order of every sequence of dependent transactions. The recovery procedure uses the saved logs to recover the values of the updated variables, and the saved start and end timestamps to determine which logs are eligible for replay and their replay order.
Transaction Lifecycle
A transaction can be viewed as progressing through five states: OPEN, COMPUTE, LOG, CLOSE and COMMIT as shown in Listing 4. When a transaction begins, it calls the library function OpenWrapC (see Algorithm 1 in Section 4.2). This function invokes the Persistent Memory Controller with a small unique integer (wrapId) that identifies the transaction. The controller adds wrapId to a set of currently open transactions (referred to as COT) that it maintains (see Algorithm 2 in Section 4.3). The transaction then allocates and initializes space in PM for its log and updates the startTime record of the log. The startTime is obtained by reading a system wide platform timer using the RDTSCP instruction (see Section 4.1). In addition to startTime, a log includes a second timestamp persistTime that will be set just prior to completing the HTM transaction. The writeSet is a sequence of (address, value) pairs in the log that will be filled with the locations and values of the updates made by the HTM transaction. The log with its recorded startTime is then persisted using cache line write-back instructions (clwb) and sfence.
The transaction then enters the COMPUTE state by executing XBegin and entering the HTM code section. Within the HTM section, the transaction updates writeSet with the persistent variables that it writes. Note that the records in writeSet will be held in cache during the COMPUTE state since it occurs within an HTM and cannot be propagated to PM until XEnd completes. Immediately before XEnd the transaction obtains a second timestamp persistTime that will be used to order the transactions correctly. This timestamp is also obtained using the same RDTSCP instruction.
After executing XEnd, a transaction next enters the LOG state. It flushes its log records from cache hierarchy into PM using cache line write-back instructions (CLWB or CLFLUSHOPT), following the last of them with an SFENCE. This ensures that all the log records have been persisted. In addition to startT ime, a log includes the persistT ime time stamp that was set just prior to completing the transaction. The writeSet records in the log hold (address; value) pairs representing the locations and the values updated by the transaction. After the SFENCE following the log record flushes, the transaction enters the CLOSE state.
In the CLOSE state the transaction signals the Persistent Memory Controller that its log has been persisted in PM. The controller removes the transaction from its set of currently open transactions COT. It also reflects the closing in the state of evicted cache lines in its FIFO buffer as described below in Section 3.2. A transaction requiring strict durability informs the controller at this time; the controller will signal the transaction in due course when it is safe to commit i.e. its updates are guaranteed to be durable. The transaction is then complete and enters the COMMIT state. If it requires strict durability it waits till it is signaled by the controller. Otherwise, it immediately commits and leaves the system.
I f ( s t r i c t d u r a b i l i t y r e q u e s t e d ) Wait f o r n o t i f i c a t i o n by c o n t r o l l e r ; 12 T r a n s a c t i o n End

Persistent Memory Controller
The Persistent Memory Controller is shown in Figure 1 . While superficially similar to an earlier design [18, 19, 23] , this controller includes enhancements to handle the subtleties of using HTM rather than locking for concurrency control, and makes significant simplifications by shifting some of the responsibility for the maintenance of PM state to the recovery protocol.
A crucial function of the controller is to prevent any transaction's updates from reaching PM until it is safe for it to do sowithout requiring detailed book-keeping about which transactions, currently active or previously completed, generated a specific update. It does this by enforcing two requirements before allowing an evicted dirty cache line (representing a transaction's update) from proceeding to PM: (i) ensuring that the log of the transaction has been made persistent, and (ii) guaranteeing that the saved log will be replayed during recovery. The first condition is needed (but not sufficient) for atomic persistence by guarding against a failure that occurs after only a subset of a transaction's updates have been persisted in PM. The second requirement arises because transaction logs are persisted outside the HTM, and there is no relation between the order in which the transactions execute their HTM and the order in which their logs are persisted. To maintain correct persistence ordering, the recovery routine may not be able to replay a log. We illustrate the issue in the example below.
Example: Consider two dependent transactions A and B, A: {w = 3; x=1;} and B: {y = w+1; z = 1;}. Assume that HTM transaction A executes before B, but that B persists its logs and closes before A. Suppose there is a crash just after B closes. The recovery routine will not replay either A or B, since the log of the earlier transaction A is not available. This behavior is correct. Now consider the situation where y (value 4) is evicted from the cache and then written to PM after B persists its log. Once again, following a crash at this time, neither log will be replayed. However, atomic persistence is now violated since B's updates are partially reflected in PM. Note that this violation occurred even though the write back of y to PM happened after B's log was persisted. The Persistent Memory Controller protocol prevents a write back to PM unless it can also guarantee that the log of the transaction creating the update will be played back on recovery (see Lemmas C2 and C4 in Section 3.4).
The second function of the controller is to track when it is safe for a transaction requiring strict durability to commit. It is not sufficient to commit when a transaction's logs are persistent on PM since, as seen in the example above, the recovery routine may not replay the log if that would violate persistency ordering. The controller protocol effectively delays a strict durability transaction τ from committing until the earliest open transaction has a startTime greater than the persistTime of τ . This is because the recovery protocol will replay a log (see Section 3.3) if and only if all transactions with startTime less than its persistTime have closed.
Implementation Overview:
The controller tracks transactions by maintaining a COT (currently open transactions) set S. When a transaction opens, its identifier is added to COT and when the transaction closes it is removed. The write into PM of a cache line C evicted into the Persistent Memory Controller is deferred by placing it at the tail of a FIFO queue maintained by the controller. The cache line is also assigned a tag called its dependency set, initialized with S the value of COT, at the instant that C entered the Persistent Memory Controller.
The controller holds the evicted instance of C in a FIFO until all transactions that are in its dependency set (i.e. S) have closed. When a transaction closes it is removed from both the COT and from the dependency sets of all the FIFO entries. When the dependency set of a cache line in the FIFO becomes empty, it is eligible to be flushed to PM. One can see that the dependency sets will become empty in the order in which the cache lines were evicted, since a transaction still in the dependency set of C when a new eviction enters the FIFO will also be in the dependency set of the new entry. The simple protocol guarantees that all transactions that opened before cache line C was evicted into the controller (which must also include the transaction that last wrote C) must have closed and persisted their logs when C becomes eligible to be written to PM. This also implies that all transactions with startTime less than the persistTime of the transaction that last wrote C would have closed, satisfying the condition for log replay. Hence the cache line can be safely written to PM without violating atomic persistence.
Note that the evicted cache lines intercepted by the controller do not hold any identifying transaction information and can occur at arbitrary times after the transaction leaves the COMPUTE state. The cache line could hold the update of a currently open transaction or could be from a transaction that has completed or even committed and left the system. To guarantee safety, the controller must perforce assume that the first situation holds. The details of the controller implementation will be presented in Section 4.3.
Recovery
Each completed transaction saves its log in PM. The log holds records startTime and persistTime obtained by reading the platform timer using RDTSCP instruction. We refer to these as the start and end timestamps of the transaction. The start timestamp is persisted before a transaction enters its HTM. This allows the recovery routine to identify a transaction that started but had not completed at the time of a failure. Note even though such a transaction has not completed, it could still have finished its HTM and fed values to a later dependent transaction which has completed and persisted its log. The end timestamp and the write set of the transaction are persisted after the transaction completes its HTM section, followed by an end of log marker. There can be an arbitrary delay between the end of the HTM and the time that its log is flushed from caches into PM and persisted.
The recovery procedure is invoked on reboot following a machine failure. The routine will restore PM values to a consistent state that satisfies persistence ordering by copying values from the writeSets of the logs of qualifying transactions to the specified addresses. A transaction τ qualifies for log replay if and only if all earlier transactions on which it depends (both directly and transitively) are also replayed.
Implementation Overview:
The recovery procedure first identifies the set of incomplete transactions I, which have started (as indicated by the presence of a startTime record in their log) but have not completed (indicated by the lack of a valid end-of-record marker). The remaining complete transactions (set C) are potential candidates for replay. Denote the smallest start timestamp of transactions in I by T min . A transaction in C is valid (qualifies for replay) if its end timestamp (persistTime) is no more than T min . All valid transactions are replayed in increasing order of their end timestamps persistTime.
Protocol Properties
We now summarize the invariants maintained by our protocol.
Definition:
The precedence set of a transaction T , denoted by prec(T ), is the set of all dependent transactions that executed their HTM before T . Since the HTM properly orders any two dependent transactions the set is well defined.
Lemma C1: Consider a transaction X with a precedence set prec(X ). For all transactions Y in prec(X ), startTime(Y ) < persistTime(X ).
Proof Sketch: Let Y be a transaction in prec(X). First let us consider direct precedence, in which a cacheline C modified in Y controls the ordering of X with respect to Y . That is, X either reads or writes the cacheline C. Since Y is in prec(X ), the earliest time that X accesses C must be no earlier than the latest time that Y accesses C, and thus persistTime(Y ) < persistTime(X ). Next consider a chain of direct precedences, Y → Z → W → · · · X , which puts Y in prec(X ); and by transitivity, persistTime(Y ) < persistTime(X ). Since startTime(Y ) < persistTime(Y ) the lemma follows.
Lemma C2: Consider transactions X and Y with startTime(Y ) < persistTime(X ). If a cache line C that is updated by X is written to PM by the controller at time t, then Y must have closed and persisted its log before t.
Proof Sketch: Suppose C was evicted to the controller at time t ′ ≤ t. Now t ′ must be later than the time X completed HTM execution and set persistTime(X ); by assumption this is after Y set its startTime at which time Y must have been registered as an open transaction by the controller. Now, either Y has closed before t ′ or is still open at that time. In the latter case, Y will be added to the dependence set of C at t ′ . Since C can only be written to PM after its dependence set is empty, it follows that Y must have closed and removed itself from the dependence set of C.
Lemma C3 Any transaction X that writes an update to PM and closes at time t will be replayed by the recovery routine if there is a crash any time after t.
Proof The recovery routine will replay a transaction X if the only incomplete transactions (started but not closed) at the time of the crash started after X completed; that is, there is no incomplete transaction Y that has a startTime(Y ) ≤ persistTime(X ). By Lemma C2 such an incomplete transaction cannot exist.
Lemma C4:
Consider a transaction X with a precedence set prec(X ). Then by the time X closes and persists its logs, one of the following must hold: (i) Some update of X has been written back to PM and all 
ALGORITHM AND IMPLEMENTATION
The implementation consists of a user software library backed by a simple Persistent Memory Controller. The library is used primarily to coordinate closures of concurrent transactions with the flow of any data evicted from processor caches into PM home locations during those transactions. The Persistent Memory Controller uses the dependency set concept from [18, 19, 23 ] to temporarily park any processor cache eviction in a searchable structure. In our implementation this is a Volatile Delay Buffer (VDB) so that its effective time to reach PM is no earlier than the time that the last possible transaction with which the eviction could have overlapped has become recoverable. The Persistent Memory Controller in this work improves upon the backend controller of [18] by dispensing with synchronous log replays and victim cache management. The library also covers any writes to PM variables by volatile write-aside log entries made within the speculative scope of an HTM transaction; and then streaming the transactional log record into a non-volatile PM area outside the HTM transaction. These log streaming writes into PM bypass the VDB. A software mechanism may periodically check the remaining capacity of the PM log area and initiate a log cleanup if needed; for such occasional cleanups, new transactions are delayed, and, after all open transactions have closed, processor caches are flushed (with a closing sfence), the logs are removed.
We refer to our implementation as WrAP, for Write-Aside Persistence, and individual transactions as wraps. We first describe the timestamp mechanism, then the user software library, and finally describe the Persistent Memory Controller implementation details.
System Time Stamp
We use the recent Intel instruction RDTSCP, or Read Time Stamp Counter and Processor ID, to obtain the timestamps in listing 4. The RDTSCP instruction provides access to a global monotonically increasing processor clock across processor sockets [24] , while serializing itself behind the instructions that precede it in program order. To prevent the reordering of an XEnd before the RDTSCP instruction, we save the resulting time stamp into a volatile memory address. Since all stores preceding an XEnd become visible after XEnd, and the store of the persist timestamp is the last store before XEnd, that store gets neither re-ordered before other stores nor reordered after the end of the HTM transaction. We note that RDTSCP has also been used to order HTM transactions in novel transaction profiling [25] and memory version checkpointing [26] 
Software Library
For HTM we employ Intel's implementation of Restricted Transactional Memory or RTM, which includes the instructions XBegin and XEnd. Aborting HTM transactions retry with exponential back-off a few times, and then are performed under a software lock. Our HTMBegin routine checks the status of the software lock both before and after an XBegin, to produce the correct indication of conflicts with the non-speculative paths; acquiring the software lock non-speculatively after having backed off. HTMBegin and HTMEnd library routines perform the acquire and release of the software lock for the fallback case within themselves. The remaining software library procedures are shown in Algorithm 1. Various events that arise in the course of a transaction are shown in Figure 2 , which depicts the HTM concurrency section with vertical lines and the logging section with slanted lines. Figure 2 , is a per-thread durability address location that we call durabilityAddr in Algorithm 1. A software thread may use it to setup a Monitor-Mwait coordination to be signaled via memory by the Persistent Memory Controller (as described shortly) when a transacting thread wants to wait until all updates from any non-conflicting transactions that may have raced with it are confirmed to be recoverable.
Not shown in
This provision allows for implementing the strict durability for any transaction because the logs of all other transactions that could possibly precede it in persistence order are in PM -which guarantees the replayability of its log. By contrast, many other transactions that only need the consistency guarantee (correct log ordering) may continue without waiting (or defer waiting to a different point in the course of a higher level multi-transaction operation). The number of active HTM transactions at any given time is bounded by the number of CPUs, therefore, we use thread identifiers as wrapIds. In OpenWrapC we notify the Persistent Memory Controller that a wrap has started. We then read the start time with RDTSCP and save it and an empty write set into its log persistently. The transaction is then started with the HTMBegin routine.
During the course of a transactional computation, the stores are performed using the wrapStore function. The stores are just the ordinary (speculatively performed) store instructions, but are accompanied by (speculative) recording of the updates into the log locations, each capturing the address, value pair for each update, to be committed into PM later during the logging phase (after XEnd).
In CloseWrapC we obtain and record the ending timestamp for an HTM transaction into the persistTime variable in its log. Its concurrency section is then terminated with the HTMEnd routine. At this point, the cached write set for the log and ending persistent timestamp are instantly visible in the cache. Next, we flush transactional values and the persist timestamp to the log area followed by a persistent memory fence. The transaction closure is then notified to the Persistent Memory Controller with the wrapId, and along with it, the durabilityAddr, if the thread has requested strict durability (by passing a flag to CloseWrapC) -for which, we use the efficient Monitor-Mwait construct to receive memory based signaling from the Persistent Memory Controller. If strict durability is not requested, then CloseWrapC can return immediately and let the thread proceed immediately with relaxed durability. In many cases a thread performing a series of transactions may choose relaxed durability over all but the last and then request strict durability over the entire set by waiting for only the last one to be strictly durable.
PM Controller Implementation
The Persistent Memory Controller provides for two needs: (1) holding back modified PM cachelines that fall into it at any time T from the processor caches, from flowing into PM until at least a time when all successful transactions that were active at time T are recoverable, and (2) tracking the ordering of dependencies among transactions so that only those that need strict durability guarantees need to be delayed pending the completion of the log phases of those with which they overlap. It implements a VDB (volatile data buffer) as means for the transient storage for the first need, implements a durability wait queue (DWQ) for the second FIFO queue contains a tuple of PM address, data, and dependency set. On a PM write, resulting from a cache eviction or streaming store, to a memory address not in the log area or pass-through area, the PM address and data are added to the FIFO queue and tagged with a dependency set initialized to the COT. Additionally, the PM address is inserted into the hash table with a pointer to the FIFO queue entry. If the address already exists in the hash table, then it is updated to point to the new queue entry. On a memory read, the hash table is first consulted. If an entry is in the hash table, then the pointer is to the latest memory value for the address, and the data is retrieved from the queue. On a hash table miss, PM is read and data is returned. As wraps close, the dependency set in each entry in the queue is updated to remove the dependency on the wrap.
Dependency sets become empty in FIFO order, and as they become empty, we perform three actions. First, we write back the data to the PM address. Next, we consult the hash On inserting an entry into the back of the queue, we can also consult the head of the FIFO queue to check to see if the dependency set is empty. If the head has an empty dependency set, we can perform the same actions, allowing for O(1) VDB management.
Dependency Wait Queue (DWQ): Strict durability is handled by the Persistent Memory Controller using the Dependency Wait Queue or DWQ, which is used to track transactions waiting on others to complete and notify the transaction that it is safe to proceed. The DWQ is a FIFO queue similar to the VDB with entries containing pairs of the dependency set and a durability address.
When a thread notifies the Persistent Memory Controller that it is closing a transaction (see steps below), it can request strict durability by passing a durability address. Dependencies on closing wraps are also removed from the dependency set for each entry in the DWQ. When the dependency set becomes empty, the controller writes to the durability address and removes the entry from the queue. Threads waiting on a write to the address can then proceed.
Opening and Closing WrAPs: As outlined in Algorithm 2, the controller supports two interfaces from software, namely those for Open Wrap and Close Wrap notifications exercised from the user library as shown in Algorithm 1. (Implementations of these notification can vary: for example, one possible mechanism may consist of software writing to a designated set of control addresses for these notifications). It also implements hardware operations against the VDB from the processor caches: Memory Write, for handling modified cachelines evicted from the processor caches or non-temporal stores from CPUs and Memory Read, for handling reads from PM from the processor caches.
The Open Wrap notification simply adds the passed (wrapId) to a bit vector of open transactions. We call this bit vector of open transactions the Current Open Transactions COT. When the controller receives a Memory Write (i.e., a processor cache eviction or a non-temporal/streaming/uncached write) it checks the COT: if the COT is empty, writes can flow into the PM. Writes that target the log range in PM can also flow into PM irrespective of the COT. For the non-log writes, if the COT is nonempty cache line is tagged with the COT and placed into the VDB.
The Close Wrap controller notification receives the wrapId and durability address, durabilityAddr . The controller removes the wrapId from the Current Open Transactions COT bit mask. If the transaction requires strict durability, we save the durabilityDS and COT as a pair in the DWQ. The controller then removes the wrapId from all entries in the VDB and DWQ. This is performed by simply draining the bit on the dependency set bit mask for the entire FIFO VDB. If the earliest entries in the queue result in an empty dependency set, the cache line data is written back in FIFO order. Similarly, the controller removes the wrapId from all entries in the Durability Wait Queue DWQ.
Software Based Strict Durability Alternative: As an alternative for implementing strict durability in the controller, strict durability may be implemented entirely in the software library; we modify the software algorithm as follows. On a transaction start, Figure 3a shows an example set of four transactions, T1-T4, happening concurrently. In this example, we show T1-T4 split into states, specifically, HTM concurrency (COMPUTE), shown in vertical lines, and LOG, depicted with slanted lines. At certain time steps we show the contents of the Persistent Memory Controller's Volatile Delay Buffer, or VDB which is a FIFO Queue, and the Current Open Transactions or COT in Figure 3c . The recovery algorithm is shown in Figure 3b with the contents of the log. Either the start timestamp only or the persist timestamp order is shown; where bold and underline indicate the log is written, and a circled transaction indicates that it is recoverable.
Example
First, at time t1, T1 opens, notifying the controller, and records its start timestamp safely in its log. The controller adds T1 to the bitmap COT of open transactions. At times t2 and t3, transactions T2 and T3 also open, notify the controller, and safely read and persist their start timestamps. At this point in time the Persistent Memory Controller has a COT of {1,1,1,0} and only start timestamps have been recorded in the log. T2 then completes its concurrency section and persists its persist timestamp at time t4 and begins writing its log. In Figure 3c , we also show a random cache eviction of a cache line X that is tagged with the COT of {1,1,1,0}.
Transaction T4 starts at time t5 persisting its start timestamp and is added to the set of open transactions on the Persistent Memory Controller, now {1,1,1,1}. At time t6 we illustrate several events. T3 completes its concurrency section and persists its persist timestamp. At this time, as shown in 3b, T3 is now ordered after T2 for recovery, however neither have completed persisting their logs. Also, we show a random cache eviction of cache line Y , and it is placed at the back of the VDB on the Persistent Memory Controller as shown in 3c.
At time t7, transaction T3 has completed writing its logs and is marked completed and is removed from the dependency set in the controller and cache line dependencies for X and Y . However, as shown in 3b, T3 is not recoverable at this point since it is not first in the persist timestamp order. T3 is behind T2 and T3 also has a smaller persist timestamp than the start timestamp of T1, which hasn't written its persist timestamp yet. When T2 completes writing of logs at time t8, it is removed from the current and dependency sets in the queue of the Persistent Memory Controller as shown in 3c. Note that cache line X is still not able to be written back to Persistent Memory as it is still tagged as being dependent on T1, and Y on both T1 and T4. At this time, T2 and T3 are also not recoverable as shown in 3b as T2 has a persist timestamp that is greater than the start timestamp of T1, as it would be unknown by a recovery process if T1 had simply had a delay in persisting its log and T2 had transactional values dependent on T1.
We illustrate two events at time t9. T1 finally completes its concurrency section and writes its persist timestamp at t9. Since the persist timestamp of T1 is safely persisted and known at recovery time, transaction T2 is now fully recoverable as shown circled in 3b. However, at this time, T3 is not fully recoverable since it is waiting on T4, which started before T3 completed its concurrency section and T4 hasn't yet written the persist timestamp. Also at t9, in 3c we illustrate the eviction of cache line Z which is tagged with the set of open transactions COT {1,0,0,1}.
At time t10, T4 writes its persist timestamp and its order is now known to a recovery routine to be behind T3, which is now fully recoverable as shown with the circle in 3b. Note that T4 has a persist time before T1. In 3c, we also illustrate the eviction of cache line X again into the VDB of the Persistent Memory Controller and tagged with the set of the two open transactions, T1 and T4. Note that there are two copies of cache line X in the controller. The one at the head of the queue has fewer dependencies (only dependent on T1) than the recent eviction. Any subsequent read for cache line X returns the most recent copy, the last entry in the VDB. Note how cache lines at the back of the queue have dependency set sizes that are greater than or equal to entries earlier in the queue.
T1 completes log writing at t11, but is behind T4, which hasn't yet finished writing its logs, so neither are yet recoverable. The PM controller also removes T1 from its dependency set and of those in the VDB. The first copy of X now has no dependencies in the queue and is safely written back to PM as shown in 3c.
At time t12, T4 completes writing its logs and both T4 and T1 are recoverable. Also, T4 is removed from the dependency sets in the controller, which allows for Y , Z , and X to flow to PM .
Strict Durability: Suppose a transaction requires strict durability during its Commit Stage, ensuring that once complete, the transactional writes will be reflected in PM if a failure were to occur. If T4 requires strict durability, it is simply durable at the end as there are no open transactions when it completes. However, T1, T2, and T3, have other constraints. A transaction requiring strict durability is only durable when it is fully recoverable. Table 3b illustrates transaction durability when it is circled. T1 must wait until step t12 if it requires strict durability as it might have dependencies on T4. T2 is fully durable at time t9 when T1, which started earlier, writes its persist timestamp. At time t10, T3 is fully durable when T4, which started before T3 completed its concurrency section and could have introduced transactional dependencies, writes its persist timestamp which indicates T4 started HTM section later.
EVALUATION
We evaluated our method using benchmarks directly running on hardware and through simulation analysis (described in Section 5.4). Our simulation evaluates the length of the FIFO buffer and performance against various Persistent Memory write times. In the direct hardware evaluation described next, we employed Intel(R) Xeon(R) E5-2650 v4 series processors, 12 cores per processor, running at 2.20 GHz, with Red Hat Enterprise Linux 7.2. HTM transactions were implemented with Intel Transactional Synchronization Extensions Timestamps   t1  T1  t2  T1, T2  t3  T1, T2, T3  t4  T1, T3  T2  t5  T1, T3, T4  T2  t6  T1, T4  T2, T3  t7  T1, T4  T2, T3  t8  T1, T4  T2, T3  t9 T4 (TSX) [27] using a global fallback lock. We built our software using g++ 4.8.5. Each measurement reflects an average over twenty repeats with small variation among the repeats. Using micro-benchmarks and SSCA2 [28] and Vacation, from the STAMP [29] benchmark suite, we compared the following methods:
• HTM Only: Hardware Transactional Memory with Intel TSX, without any logging or persistence. This method provides a baseline for transaction performance in cache memory without any persistence guarantees. If a power failure occurs after a transaction, writes to memory locations may be left in the cache, or written back to memory in an out-of-order subset of the transactional updates.
• WrAP: Our method. We perform all aspects of the protocol such as logging, reading timestamps, HTM and fall-back locking, etc. as described in Section 3. The volatile delay buffer in the controller is assumed to be able to keep up with back pressure from the cache, as shown in Section 5.4.
• WrAP-Strict: Same as above, but we implement the software strict durability method as described in Section 4.3. Threads wait until all prior-open transactions have closed before proceeding.
• PTL-Eager: (Persistent Transactional Locking). In this method, we added persistence to Transactional Locking (TL-Eager) [30] [31] [32] by persisting the undo log at the time that a TL transaction performs its sequence of writes. The undo-log entries are written with write-through stores and SFENCEs, and once the transaction commits and the new data values are flushed into PM, the undo-log entries are removed.
Benchmarks
The Scalable Synthetic Compact Applications for benchmarking High Productivity Computing Systems [28] , SSCA2, is part of the Stanford Transactional Applications for Multi-Processing [29] , or STAMP, benchmark suite. SSCA2 uses a large memory area and has multiple kernels that construct a graph and perform operations on the graph. We executed the SSCA2 benchmark with scale 20, which generates a graph with over 45 million edges. We increased the number of threads from 1 to 16 in powers of two and recorded the execution time for the kernel for each method. Figure 4 shows the execution time for each method for the Compute Kernel in the SSCA2 benchmark as a function of the number of threads. Each method reduces the execution time with increasing numbers of threads. Our WrAP approach has similar execution time to HTM Only in the cache hierarchy with no persistence and is over 2.25 times faster than a persistent PTL-Eager method to PM. Figure 5 shows the speedup for each method as a function of the number of threads when compared to a single threaded undo log for the persistence methods and speedup versus no persistence for the in-cache method HTM Only. Even though the HTM (cacheonly) method does better in absolute terms as we saw in Figure 4 , it proceeds from a higher baseline for single-threaded execution. PTLEager yields a significantly weaker scalability due to the inherent Figure 6 shows the number of hardware aborts for both our WrAP approach and cache-only HTM. Our approach introduces extra writes to log the write-set, and, along with reading the system time stamp, extends the transaction time. However, as shown in the Figure, this only slightly increases the number of hardware aborts.
We also evaluated the Vacation benchmark which is part of the STAMP benchmark suite. The Vacation benchmark emulates database transactions for a travel reservation system. We executed the benchmark with the low option for lower contention emulation. Figure 7 shows the execution time for each method for the Vacation benchmark as a function of the number of threads. Each method reduces the execution time with increasing numbers of threads. The WrAP approach follows the trends similar to HTM in the cache hierarchy with no persistence, with both approaches flattening execution time after 4 threads. We also examine the effect of strict durability, WrAP-Strict in the figure, and show that strict durability only introduces a small amount of overhead. For just a single thread, it has the same performance as WrAP relaxed as a thread doesn't need to wait on other threads, as it is durable as soon as the transaction completes.
Additionally, we examined the effect of increased Persistent Memory write times on the benchmark. when compared to DRAM, byte-addressable Persistent Memory can have longer write times. To emulate the longer write times for PM, we insert a delay after non-temporal stores when writing to new cache lines and a delay after cache line flushes. The write delay can be tuned to emulate the effect of longer write times typical of PM. Figure 8 shows the Vacation benchmark execution time for various PM write times. The WrAP approach is less affected by increasing PM write times than the PTL-Eager approach due to several factors. WrAP performs write-combining for log entries on the foreground path for each thread, so writes to several transaction variables may be combined into a fewer writes. Also, PTL-Eager transactionally persists an undo log on writes causing a foreground delay.
Hash Table
Our next series of experiments show transaction sizes and high memory traffic affect overall performance. We create a 64 MB Hash Table Array of elements in main memory and transactionally perform a number of element updates. For each transaction, we generate a set of random numbers of a configurable size, compute their hash, and write the value into the Hash Table Array. First, we create transactions consisting of 10 atomic updates and vary the number of concurrent threads and measure the maximum throughput. We perform 1 million updates and record the average throughput and plot the results in Figure 9 . Our approach achieves roughly 3x throughput over PTL-Eager. Figure 10 shows increasing the write set to 20 atomic updates has similar performance. In both figures, adding strict durability only slightly decreases the overall performance; threads wait additional time for the dependency on other transactions to clear before continuing to another transaction. The transaction write set was then varied from 2 to 26 elements with 6 concurrent threads. The average throughput was recorded and is shown in Figure 11 . Even with adding strict durability, WrAP performs roughly three times faster than PTL-Eager. A transaction size of twenty elements was then analyzed using a varied write to read ratio with 6 concurrent threads. The average throughput was recorded and is shown in Figure 12 . Unlike transactional memory approaches, our approach does not require instrumenting read accesses and can therefore execute reads at cache speeds. We use the transactional Red-Black tree from STAMP [29] initialized with 1 million elements. We then perform insert operations on the Red-Black tree and record average transaction times and throughput over 200k additional inserts. Each transaction inserts an additional element into the Red-Black tree. Inserting an element into a Red-Black tree first requires finding the insertion point which can take many read operations and can trigger many writes through a rebalance. In our experiments, we averaged 63 reads and 11 writes per transactional insert of one element into the Red-Black tree.
Red-Black Tree
We record the maximum throughput of inserts into the Red-Black tree per second for a varying number of threads in Figure 13 . As can be seen in the Figure, WrAP has almost 9x higher throughput over PTL-Eager, and with strict durability almost 6x faster. Our method can perform reads at the speed of the hardware, while PTL-Eager requires instrumenting reads through software to track dependencies on other concurrent transactions.
Persistent Memory Controller Analysis
We investigated the required length of our FIFO in the Volatile Delay Buffer and performance with respect to Persistent Memory write times using an approach similar to [14] . In the absence of readily available memory controllers, we modified the McSimA+ simulator [33] . McSimA+ is a PIN [34] based simulator that decouples execution from simulation and tightly models out-of-order processor micro-architecture at the cycle level. We extended the simulator to support the notifications for opening and closing WrAPs along with extended support for memory reads and writes. We added support for DRAMSim2 [35] , a cycle-accurate memory system and DRAM memory controller model library. Write-combining and store buffers were then added with multiple configuration options to allow fine tuning to match the system to be modeled.
To stress the Persistent Memory Controller, we executed an atomic hash table update without any thread contention by having each thread update elements on a separate portion of the table. In the simulation, we fill the cache with dirty cache lines so that each write by a thread in a transaction generates write-backs to main Persistent Memory. For 8 threads, we recorded the average atomic hash table update time for 10 elements in each transaction. We then vary the Persistent Memory write time as a multiple of DRAM write time. As shown in Figure 14 , WrAP is less affected by increasing write times when compared to PTL-Eager. Additionally, we record the maximum FIFO buffer size for various Persistent Memory write times and 4 concurrent threads, shown in Figure 15 . Initially, the buffer size decreases for an increasing PM write time, due to slower transaction throughput and less cache evictions into the buffer. As the write time increases, the buffer length increases, but is still less than 1k cache lines or 64KB.
We performed a similar analysis using a B-Tree, where each thread atomically inserts elements on its own copy of a B-Tree. Each insert into the tree required, on average, over 5 times as many reads as writes. As shown in Figure 16 , our method is less affected by increasing PM write times, due to PTL-Eager instrumenting the large portion of the read operations. In this experiment, we use eight concurrent threads each atomically inserting elements into an initialized B-Tree of 128 elements. As more reads than writes are generated for each atomic insert transaction, the FIFO buffer length remains small. We also examined the FIFO buffer length in the VDB with 8 concurrent threads. Figure 17 shows the length was less than about 100 elements for each write speed due to the large proportion of reads.
RELATED WORK
Related Persistence Work: Analysis of consistency models for persistent memory was considered in [36] . Changes to the frontend cache for ordering cache evictions were proposed in [14, 15, 37, 38] . BPFS [37] proposed epoch barriers to control eviction order, while [38] proposed a flush software primitive to control of update order. Snapshotting the entire micro architectural state at the point of a failure is proposed in [39] . A non-volatile victim cache to provide transactional buffering was proposed in [14] , with the added property of not requiring logging, but requires changes to the frontend cache controller to track pre-and post-transactional states for cache lines in both volatile and persistent caches, atomically moving them to durable state on transaction commits.
Memory controller support for transaction atomicity in Persistent Memory have been proposed in [17-19, 23, 40-42] . Adding a small DRAM buffer in front of Persistent Memory to improve latency and to coalesce writes was proposed in [40] . The use of a volatile victim cache to prevent uncontrolled cache evictions from reaching PM was described in [17] [18] [19] , but requires software locking for concurrency control. FIRM [42] describes techniques to differentiate persistent and non-persistent memory traffic, and presents scheduling algorithms to maximize system throughput and fairness. Low-level memory scheduling to improve efficiency of persistent memory access was studied in [41] . Except for [17] [18] [19] , none of these works deal with the issues of atomicity or durability of write sequences. Our approach effectively uses HTM for concurrency control and does not require changes to the font-end cache controller or use logs for replaying transactions to PM. Related Concurrency Work: Existing non-HTM solutions [9, 11, 12] tightly couple concurrency control with durable writes of either write-ahead logs or data updates into Persistent Memory to maintain persistence consistency. Software that employs these approaches generally means they must extend the duration for which they remain in critical sections, leading to longer times to hold locks, which reduces concurrency and expands transactional duration. Other work [10, 13] decouples concurrency control so that post transactional values may flow through cache hierarchy and reach PM asynchronously; however, the write ahead log for an updating transaction has to get committed into PM synchronously before the transaction can close so that the integrity of the foreground value flow is preserved across machine restarts. Another hardwareassisted mechanism proposes hardware changes to allow a dualscheme checkpointing that writes previous check-pointed values in the background while collecting current transaction writes [43] .
Recent work [13, 21, 22] aims to exploit processor-supported HTM mechanisms for concurrency control instead of traditional locking or STM-based approaches. However, all of these solutions require making significant changes to the existing HTM semantics and implementations. For instance, PHTM [21] and PHyTM [22] , propose a new instruction called TransparentFlush which can be used to flush a cache line from within a transaction to persistent memory without causing any transaction to abort. They also propose a change to the xend instruction that ends an atomic HTM region, so that it atomically updates a bit in persistent memory as part of its execution. Similarly, for DUDETM [13] to use HTM, it requires that designated memory variables within a transaction be allowed to be updated globally and concurrently without causing an abort. Other work [26] utilizes HTM for concurrency control, but requires aliasing all read and write accesses while concurrently maintaining log ordering and and replaying logs for retirement.
SUMMARY
In this paper we presented an approach that unifies HTM and PM to create durable, concurrent transactions. Our approach works with existing HTM and cache coherency mechanisms, and does not require changes to existing processor caches or store instructions, avoids synchronous cache-line write-backs on completions, and only utilizes logs for recovery. The solution correctly orders HTM transactions and atomically commits them to Persistent Memory by the use of a novel software protocol combined with a back-end Persistent Memory Controller.
Our approach, evaluated using both micro-benchmarks and the STAMP suite compares well with standard (volatile) HTM transactions. In comparison with persistent transactional locking, our approach performs 3x faster on standard benchmarks and almost 9x faster on a Red-Black Tree data structure.
