MOD: Minimally Ordered Durable Datastructures for Persistent Memory by Haria, Swapnil et al.
MOD: Minimally Ordered Durable Datastructures for Persistent Memory
Swapnil Haria, Mark D. Hill, Michael M. Swift
University of Wisconsin-Madison
{swapnilh,markhill,swift}@cs.wisc.edu
Abstract
Persistent Memory (PM) makes possible recoverable appli-
cations that can preserve application progress across system
reboots and power failures. Actual recoverability requires
careful ordering of cacheline flushes, currently done in two ex-
treme ways. On one hand, expert programmers have reasoned
deeply about consistency and durability to create applications
centered on a single custom-crafted durable datastructure. On
the other hand, less-expert programmers have used software
transaction memory (STM) to make atomic one or more up-
dates, albeit at a significant performance cost due largely to
ordered log updates.
In this work, we propose the middle ground of composable
persistent datastructures called Minimally Ordered Durable
(MOD) datastructures. MOD is a C++ library of several
datastructures—currently, map, set, stack, queue and vector—
that often perform better than STM and yet are relatively
easy to use. They allow multiple updates to one or more
datastructures to be atomic with respect to failure. Moreover,
we provide a recipe to create more recoverable datastructures.
MOD is motivated by our analysis of real Intel Optane
PM hardware showing that allowing unordered, overlapping
flushes significantly improves performance. MOD reduces
ordering by adapting existing techniques for out-of-place up-
dates (like shadow paging) with space-reducing structural
sharing (from functional programming). MOD exposes a Ba-
sic interface for single updates and a Composition interface
for atomically performing multiple updates. Relative to the
state-of-the-art Intel PMDK v1.5 STM, MOD improves map,
set, stack, queue microbenchmark performance by 40%, and
speeds up application benchmark performance by 38%.
1. Introduction
Persistent Memory (PM) is here—Intel Optane DC Persistent
Memory Modules (DCPMM) began shipping in 2019 [20].
Such systems expose fast, byte-addressable, non-volatile mem-
ory (NVM) devices as main memory and allow applications
to access this persistent memory via regular load/store instruc-
tions. In fact, we ran all experiments in this paper on a system
with engineering samples of Optane DCPMM [21, 23].
The durability of PM enables recoverable applications that
preserve in-memory data beyond process lifetimes and system
crashes, a desirable quality for workloads like databases, key-
value stores and long-running scientific computations [4, 29].
Such applications use cacheline flush instructions to move data
from volatile caches to durable PM and order these flushes
carefully to ensure consistency. For instance, applications
must durably update data before updating a persistent pointer
that points to the data, or atomically do both.
However, few recoverable PM applications have been devel-
oped so far, even though PM libraries like Mnemosyne [47]
and Intel Persistent Memory Development Kit (PMDK) [18]
have existed for several years. Currently, there are two ap-
proaches to building such applications: single-purpose custom
datastructures (e.g., persistent B-trees [6, 45, 49]) or general-
purpose transactions. While both approaches have some bene-
fits, we believe that neither is suitable for encouraging devel-
opers to start building PM applications.
Although custom datastructures are typically very fast, sig-
nificant effort is needed in designing these structures to ensure
that updates are performed atomically with respect to failure
i.e., either all modified data is made durable in PM or none.
Accordingly, the designers need to ensure that modified data
is logged, dirty cachelines are explicitly flushed to PM in
a deliberate order enforced by the use of appropriate fence
instructions for consistency. Furthermore, performance opti-
mizations useful in one datastructure may not generalize to
other datastructures. These custom datastructures typically
do not support the composition of failure-atomic operations
spanning multiple datastructures, e.g., popping an element of
a durable queue and inserting it into a durable map.
Existing PM libraries offer software transactional memory
(PM-STM) for building general-purpose crash-consistent code,
but with complicated interfaces and high performance over-
heads. Operations on existing datastructures can be wrapped
in transactions to facilitate consistent recovery on a crash.
These transactions also allow developers to compose failure-
atomic operations that update multiple datastructures. How-
ever, it is not easy to use these PM-STM interfaces correctly.
For instance, the state-of-the-art PMDK transactions require
programmer annotations (TX_ADD) in each transaction to de-
marcate memory that could be modified in that transaction.
Incorrect usage of such annotations is a common source of
crash-consistency bugs in applications built with PMDK [32].
Moreover, the generality of transactions come at a high
performance cost. As we will show, about 64% of the over-
all execution time in PM-STM based applications is spent in
flushing activity. These high overheads arise from excessive
ordering constraints in these transactions, with each transac-
tion having 5-11 ordering points (i.e., sfence on x86-64).
Our experiments on Optane DCPMM show that flushes (i.e.,
clwb on x86-64) slow execution more when they are more
frequently ordered. For instance, 8 clwbs can be performed
75% faster when they are ordered jointly by a single sfence
ar
X
iv
:1
90
8.
11
85
0v
1 
 [c
s.D
C]
  2
1 A
ug
 20
19
than when each clwb is individually ordered by an sfence.
To make PM application development more widely ac-
cessible, we propose a middle ground: Minimally Ordered
Durable Datastructures (MOD), a library of many persistent
datastructures with simple abstractions and good performance
(and a methodology to make more). MOD performs better than
transactions in most cases and also allows the composition
of updates to multiple persistent datastructures. To allow the
programmer to easily build new PM applications, we encapsu-
late away the details of persistence such as crash-consistency,
ordering and durability mechanisms in the implementation of
these datastructures. Instead, MOD enables programmers to
focus on core logic of their applications.
Similar efforts such as the Standard Template Library
(STL) [11] in C++ have proved extremely popular, allow-
ing programmers to develop high-performance applications
using simple datastructure abstractions whose efficient and
complicated implementations are hidden from the program-
mer. MOD offers datastructure abstractions similar to those in
the STL, namely map, set, stack, queue and vector. For each
datastructure, MOD offers convenient failure-atomic update
and lookup operations with familiar STL-like interfaces such
as push_back for vectors and insert for maps.
New abstractions get wider adoption only if they per-
form well. For high performance, MOD datastructures use
shadow paging [15, 33] to minimize internal ordering in up-
date operations—one sfence per failure-atomic operation
in the common case. Specifically, we rely on out-of-place
writes to create a new and updated copy (shadow) of each
datastructure without overwriting the original data. These out-
of-place writes do not need to be logged and can be flushed
with overlapping flushes to minimize flushing overheads.
To reduce the memory overhead introduced by shadow pag-
ing, our datastructures use structural sharing optimizations
found in purely functional datastructures [13, 37, 39, 42, 44].
With these optimizations, the updated shadow is built out of
the unmodified data of the original datastructure plus modest
new and updated state. Consequently, the shadow incurs ad-
ditional space overheads of less than 0.01% over the original
datastructure. On Intel Optane DCPMM, our MOD data-
structures improve the performance of map, set, stack, queue
microbenchmarks by 43%, hurts vector by 122%, and speeds
up application benchmarks by 36% as compared to state-of-
the-art Intel PMDK v1.5. We also present a methodology to
repurpose other existing purely functional datastructures into
new persistent datastructures.
Finally, MOD also offers the ability to compose failure-
atomic updates to multiple durable datastructures. Only for
such use cases, we expose the underlying out-of-place update
operations to the programmer. Thus, programmers can up-
date multiple datastructures to generate new versions of these
datastructures. We provide a convenient Commit interface to
failure-atomically replace all the original datastructures with
their respective updated versions.
We make the following contributions in this paper:
• We develop the design and implementation of the MOD li-
brary of high-performance durable datastructures with en-
capsulated persistence.
• We present two alternative interfaces for using MOD data-
structures for different use cases.
• We provide a recipe to create more MOD datastructures
from existing functional datastructures.
• We develop an analytical model for estimating the latency
of concurrent cacheline flushes with Optane DCPMM.
• We release a C++ implementation of MOD datastructures.
2. Background
We first provide basic knowledge of PM programming and
functional programming as required for this paper.
2.1. Persistent Memory System
We consider a system in which the physical address space
is partitioned into volatile DRAM and durable PM. While
the contents of PM are preserved in case of a system failure,
DRAM and other structures such as CPU registers, caches, etc.,
are wiped clean. This system model is similar to most prior
work [5, 31, 34, 47] and representative of Optane DCPMM.
Recoverable PM software rely on hardware guarantees to
know when PM writes are persisted, i.e., when a write is guar-
anteed to be durable in PM. Writes are first stored in volatile
caches to exploit temporal locality of accesses and written
back to PM at a later time unknown to software, depending on
the cache replacement policy. Hence, PM systems support two
instructions for durability and/or ordering: a flush instruction
to explicitly writeback a cacheline from the volatile caches to
PM, and a fence instruction to order subsequent instructions
after preceeding flushes become durable.
2.2. Persistent Memory Programming
Here, we discuss applications that rely on the persistence of
PM. Such applications are recoverable if they store enough
state in PM to successfully recover to a recent state and with-
out losing all progress after a system crash. There are several
challenges involved in programming recoverable applications.
Sufficient application data must be persisted in PM to allow
successful recovery to a consistent and recent state. System
crashes at inopportune moments could result in partially up-
dated and thus inconsistent datastructures that cannot be used
for recovery. As a result, programmers have to carefully rea-
son about the ordering and durability of PM updates. Unfor-
tunately, PM updates in program order can be reordered by
hardware including write-back caches and memory controller
(MC) scheduling policies.
To abstract away these programming challenges, researchers
have developed failure-atomic sections (FASEs) [5]. FASEs
are code segments with the guarantee that all PM writes within
a FASE happen atomically with respect to system failure. For
example, prepending to a linked list (Figure 1b) in a FASE
guarantees that either the linked list is successfully updated
2
void ImpurePrepend ( 
   List* L,  Data D)  {
    Node *new_node = new Node();
    new_node->data = D;
    new_node->next = L->head;
    L->head = new_node;
    return;
}
List* PurePrepend (
  List* L, Data D)  {
    Node *new_node = new Node();
    new_node->data = D;
    new_node->next = L->head;
    List* shadowL = new List(); 
    shadowL->head = new_node;
    return shadowL;
} 
(b) (c) (d)
L
shadowL
1 2 3
D
new_node
struct Node {
      Data data;
      Node* next;
}
 
struct List {
       Node* head;
}
(a)
Figure 1: For linked list defined in (a), implementation of prepend as (b) impure function with original list L modified and (c) pure
function where new updated shadowL is created and returned. (d) shadowL reuses nodes of list L to reduce space overheads.
with its head pointing to the durable new node or that the
original linked list can be reconstructed after a crash.
PM libraries [7, 18, 47] typically implement FASEs with
software transactions that guarantee failure-atomicity and dura-
bility. All updates made within a transaction are durable when
the transaction commits. If a transaction gets interrupted due
to a crash, write-ahead logging techniques are typically used
to allow recovery code to clean up partial updates and return
persistent data to a consistent state. Hence, recoverable appli-
cations can be written by allocating datastructures in PM and
only updating them within PM transactions. We discuss the
performance bottlenecks of PM transactions in Section 3.
2.3. Functional Programming Concepts
In this work, we leverage two basic concepts in functional
programming languages: pure functions and purely functional
datastructures. These ideas are briefly described below and
illustrated in Figure 1.
Pure Functions. A pure function is one whose outputs are de-
termined solely based on the input arguments and are returned
explicitly. Pure functions have no externally visible effects
(i.e., side effects) such as updates to any non-local variables or
I/O activity. Hence, only data that is newly allocated within the
pure function can be updated. Figure 1 shows how a pure and
an impure function differ in performing a prepend operation to
a list. The impure function overwrites the head pointer in the
original list L, which is a non-local variable and thus results in
a side effect. In contrast, the pure function allocates a new list
shadowL to mimic the effect of the prepend operation on the
original list and explicitly returns the new list. Note that the
pure function does not copy the original list to create the new
list. Instead, it reuses the nodes of the original list without
modifying them.
Functional Datastructures. Commonly used in functional
languages, purely functional or persistent datastructures are
those that preserve previous versions of themselves when mod-
ified [13]. We refer to these as purely functional datastructures
in this paper to avoid confusion with persistent (i.e., durable)
datastructures for PM.
Purely functional datastructures are never modified in-place.
Instead, every update of such datastructures creates a logically
new version while preserving the old version. Thus these
datastructures are inherently multi-versioned.
To reduce space overheads and improve performance, func-
tional datastructures (even arrays and vectors) are often imple-
mented as trees [37, 39]. Tree-based implementations allow
different versions of a datastructure to appear logically differ-
ent while sharing most of the internal nodes of the tree. For
example, Figure 1 shows a simple example where the original
list L and the updated list shadowL share nodes labeled 1, 2
and 3. Such optimizations are called structural sharing.
3. Understanding Performance Bottlenecks
Good performance typically aids the adoption of new abstrac-
tions. Thus in this section, we try to identify the main per-
formance bottlenecks in PM-STM workloads and understand
how to mitigate these overheads.
Overheads in PM-STM Workloads. At a high level, PM-
STM implementations suffer from two main overheads: flush-
ing (required for durability of data) and logging (required for
failure-atomicity). We measured these overheads on Optane
DCPMMs by running recoverable PM workloads (described in
Table 2 in Section 6) with Intel PMDK v1.5, a state-of-the-art
PM-STM implementation that uses hybrid undo-redo logging.
As shown in Figure 2, these applications spend on average
about 64% of their execution time performing flushing and
9% performing logging operations. These PM-STM imple-
mentations flush both log entries and data updates to PM, and
we consider the time spent in flushing log entries as part of
flushing overheads. Clearly, flushing overheads are the biggest
performance bottlenecks in these applications.
map set queue stack vector vec-swap bfs vacationmemcach
ed
0.0
0.2
0.4
0.6
0.8
1.0
Fr
ac
tio
n 
of
 E
xe
cu
tio
n 
Ti
m
e
PMDK-other PMDK-Flush PMDK-Log
Figure 2: Fraction of execution time spent logging and flush-
ing data in PM workloads using PMDK v1.5.
As we show in the rest of this section, the high flushing
overheads in PM-STM are caused by excessive ordering con-
straints (sfence) limit the overlapping of long-latency flush
instructions. Undo-logging techniques typically require 5-50
fences [34] per transaction. These fences mainly order log
3
CPU
Time
SerialParallel Stall
Cacheline Flush Latency
Concurrent 
 Flushes
CLWBs issue
and commit
instantly
Program Execution
SFENCE issues
and stalls CPU
SFENCE commits
when inflight flushes
"complete"
CLWB A
CLWB B
CLWB C
SFENCE
1
2
3
4
Figure 3: Execution of concurrent flushes on Optane DCPMM.
updates before the corresponding data updates. In some imple-
mentations, the number of fences per transaction scales with
the number of modified cachelines. In our workloads with
hybrid undo-redo logging, we observed 4-23 flushes and 5-11
fences per transaction (Figure 10 in Section 6). Consequently,
the median number of flushes overlapped per fence is 1-2,
resulting in high flush overheads.
Flushes on Test Machine. In this paper, we focus on the
clwb instruction that writes back a dirty cacheline but may not
evict it from the caches. This instruction commits instantly but
launches a cacheline flush that proceeds in the background of
execution, unordered with other flushes to different addresses,
as shown in Figure 3. Ordering points (sfence) stall the CPU
until all inflight flushes are completed. On our test machine
(described in Table 1), we observed the latency of one clwb
followed by one sfence to be 353 ns when the address being
flushed was present in the L1D cache. Thus, ordering points
degrade performance by bringing the flush latency of weakly
ordered flushes on the critical path. In the rest of this paper,
we use the term flushes to refer to weakly ordered flushes.
Effects of Ordering Points. To mitigate the high flush over-
heads, we must reduce the frequency of ordering points and
enable the overlap of multiple flushes. We evaluated the ef-
ficacy of this approach on Optane DCPMMs via a simple
microbenchmark. Our microbenchmark first allocates an array
backed by PM. It issues writes to 320 random cachelines (=
20KB < 32 KB L1D cache) within the array to fault in physical
pages and fetch these cachelines into the private L1D cache.
Next, it measures the time taken to issue clwb instructions to
each of these cachelines. Fence instructions are performed
at regular intervals e.g., one sfence after every N clwb in-
structions. The total time (for 320 clwb + variable sfence
instructions) is divided by 320 to get the average latency of
a single cacheline flush. Figure 4 reports the average flush
latency for varying flush concurrency.
The blue line in Figure 4 shows that the average flush la-
tency can be effectively reduced by overlapping flushes, up
to a limit. Compared to a single un-overlapped flush (clwb
+sfence), performing 16 flushes concurrently reduces aver-
age flush latency by 75%. However, performing 32 flushes
concurrently only reduces average flush latency by 3% com-
pared to the case with 16 concurrent flushes. Beyond 32, there
is no noticeable improvement in flush latency.
0 5 10 15 20 25 30
Flush Concurrency: Flushes Overlapped per Fence
100
200
300
Fl
us
h 
La
te
nc
y 
(in
 n
s)
observed amdahl, f=0.82
Figure 4: Average Latency of PM cacheline flush observed on
Optane DCPMM and estimated by analytical model (amdahl).
Analytical Model of Flush Latencies. As a side note, while
Figure 4’s blue line gives the empirical benefit of overlapping
flushes, it also seems to closely follow Amdahl’s law [1]. In
particular, the red line shows an Amdahl’s law fit using the
Karp-Flatt metric [25] that has concurrent flushes acting 82%
parallel and 18% serial. With the 18% serial component, it is
easy to understand the diminishing returns of many concur-
rent flushes. As the hardware is a black box, we do not yet
know what features cause the appearance of serialization in
the system under test.
4. Minimally Ordered Durable Datastructures
We address the high flushing costs with Minimally Ordered
Durable (MOD) datastructures that allow failure-atomic and
durable updates to be performed with one ordering point in
the common case. These datastructures significantly reduce
flushing overheads that are the main bottleneck in recoverable
PM applications. We have five goals for these datastructures:
1. Failure-atomic updates for recoverable applications.
2. Minimal ordering constraints to tolerate flush latency.
3. Simple programming interface that hides implementation
details for handling simple use cases.
4. Allow composition of failure-atomic updates to multiple
datastructures.
5. Support for common datastructures for application pro-
grammers such as set, map, vector, queue and stack.
6. No hardware modifications needed to enable high perfor-
mance applications on currently available systems.
We first introduce the Functional Shadowing technique
underpinning MOD datastructures. Next, we show a recipe
to create MOD datastructures from existing functional data-
structures. Then, we describe MOD’s programming interfaces.
4.1. Functional Shadowing
Functional Shadowing leverages shadow paging techniques to
minimize ordering constraints in updates to PM datastructures
and uses optimizations from functional datastructures to re-
duce the overheads of shadow paging. As per shadow paging
techniques, we implement non-destructive and out-of-place
4
Vector [7] shadowVector [8]
X
Shadow Paging
Functional Optimizations
(Structural Sharing)
push_back (X)
Vector [7]VectorPtr
Vector [7] shadowVector [8]
X
VectorPtr
VectorPtr
(a)
(b)
(c)
Figure 5: Functional Shadowing in action on (a) MOD vector.
(b) Shadow is created on Append (i.e., push_back) operation
that reuses data from the original vector. (c) Application starts
using updated shadow and old data is cleaned up.
update operations for all MOD datastructures. Accordingly,
updates of MOD datastructures logically return a new version
of the datastructure without any modifications to the original
data. As shown in Figure 5, a push_back operation in a vec-
tor of size 7 would result in a new version of size 8 while the
original vector of size 7 remains untouched. We refer to the
updated version of the datastructure as a shadow in accordance
with conventional shadow paging techniques.
There are no ordering constraints in creating the updated
shadow as it is not considered a necessary part of application
state yet. We do not log these writes as they do not overwrite
any useful data. In case of a crash at this point, recovery code
can reclaim memory corresponding to any partially updated
shadow in PM. Due to the absence of ordering constraints,
we can overlap flushes to all dirty cachelines comprising the
updated shadow to minimize flushing overheads. A single
ordering point is sufficient to ensure the completion of all the
outstanding flushes and guarantee the durability of the shadow.
Subsequently, the application must atomically replace the orig-
inal datastructure with the updated shadow. For this purpose,
we offer multiple efficient Commit functions described in the
next subsection. In contrast, PM-STM implementations per-
form in-place modifications which overwrite existing data and
need logging to revert partial updates in case of crashes. In-
place updates also introduce ordering constraints as log writes
must be ordered before the corresponding data update.
We reduce shadow paging overheads using optimizations
commonly found in functional datastructures. Conventional
shadow paging techniques incur high overheads as the orig-
inal data must be copied completely to create the shadow.
Instead, we use structural sharing optimizations to maximize
data reuse between the original datastructure and its shadow
copy. We illustrate this in Figure 5, where shadowVector
reuses 6/8 internal nodes from the original Vector and only
adds 2 internal and 3 top-level nodes. In the next subsection,
we discuss a method to convert existing implementations of
functional datastructures to MOD datastructures.
4.2. Recipe for MOD Datastructures
We provide a simple recipe for creating MOD datastructures
out of existing implementations of functional datastructures:
1. First, we use an off-the-shelf persistent memory allocator
nvm_malloc [2] to allocate datastructure state in PM.
2. Next, we allocate the internal state of the datastructure on
the persistent heap instead of the volatile stack.
3. Finally, we extend all update operations to flush all modi-
fied PM cachelines with clwb instructions and no ordering
points. These flushes will be ordered by an ordering point
in a Commit step described later in this section.
While functional datastructures do not support durability by
default, they offer a suitable starting point from which to gener-
ate MOD datastructures. They support non-destructive update
operations which are typically implemented through pure func-
tions. Thus, every update returns a new updated version (i.e.,
shadow) of the functional datastructure without modifying the
original. They export simple interfaces such as map, vector,
etc. that are implemented internally as highly optimized trees
such as Compressed Hash-Array Mapped Prefix-trees [43] (for
map, set) or Relaxed Radix Balanced Trees [44] (for vector).
These implementations are designed to amortize the overheads
of data copying as needed to create new versions on updates.
Optimized functional implementations also have low space
overheads via structural sharing, i.e., maximizing data reuse
between the original data and the shadow. Tree-based im-
plementations are particularly amenable to structural sharing.
On an update, the new version creates new nodes at the up-
per levels of the tree, but these nodes can point to (and thus
reuse) large sub-trees of unmodified nodes from the original
datastructure. The number of new nodes created grows ex-
tremely slowly with the size of the datastructures, resulting
in low overheads for large datastructures. As we show in our
evaluation section, the additional memory required on average
for an updated shadow is less than 0.01% of the memory of
the original datastructure of size 1 million elements.
Moreover, the trees are broad but not deep to avoid the prob-
lem of ‘bubbling-up of writes’ [8] that plagues conventional
shadow paging techniques. This problem arises as the update
of an internal node in the tree requires an update of its parent
and so on all the way to the root. We find that existing imple-
mentations of such functional datastructures are commonly
available in several languages, including C++ and Java.
We conjecture that the ability to create MOD datastructures
from existing functional datastructures is important for three
reasons. First, we benefit from significant research efforts to-
wards lowering space overheads and improving performance
of these datastructures [13, 37, 39, 42, 44]. Secondly, program-
mers can easily create MOD implementations of additional
datastructures beyond those in this paper by using our recipe to
5
port other functional datastructures. Finally, we forecast that
this approach can help extend PM software beyond C and C++
to Python, JavaScript and Rust, which have implementations
of functional datastructures.
4.3. Programming Interface
To abstract away the details of Functional Shadowing from ap-
plication programmers, we provide two alternative interfaces
for MOD datastructures:
• A Basic interface that abstracts away the internal versioning
and is sufficient for simple use cases.
• A Composition interface that exposes multiple versions
of datastructures to enable complex use cases, while still
hiding the complexities of the implementation.
// BEGIN-FASE 
Update(dsPtr, updateParams)
// END-FASE
(a)
Basic Interface Composition Interface
(b)
// BEGIN-FASE 
dsPtr1shadow = 
          dsPtr1->PureUpdate(...)
dsPtr2shadow = 
          dsPtr2->PureUpdate(...)
 ...
Commit (dsPtr1, dsPtr1shadow,
               dsPtr2, dsPtr2shadow,...)
// END-FASE
// dsPtr1, dsPtr2 are updated.
2
1
Figure 6: Failure-atomic Code Sections (FASEs) with MOD
datastructures using (a) Basic interface to update one data-
structure and (b) Composition interface to atomically update
multiple datastructures with (1) Update and (2) Commit steps.
4.3.1. Basic Interface The Basic interface to MOD data-
structures (Figure 6a) allows programmers to perform individ-
ual failure-atomic update operations to a single datastructure.
With this interface, MOD datastructures appear as mutable
datastructures with logically in-place updates. Programmers
use pointers to datastructures (e.g., ds1 in Figure 6a), as is
common in PM programming. Each update operation is im-
plemented as a self-contained FASE with one ordering point,
as described later in the next section. If the update completes
successfully, the datastructure pointer points to an updated and
durable datastructure. In case of crash before the update com-
pletes, the datastructure pointer points to the original durable
and uncorrupted datastructure. We expose common update
operations for datastructures such as push_back, update for
vectors, set for sets/maps, push, pop for stacks and enqueue,
dequeue for queues, as in C++ STL.
The Basic interface targets the common case when a FASE
contains only one update operation on one datastructure. This
common case applies to all our workloads except vacation
and vector-swaps. For instance, memcached relies on a
single recoverable map to implement its cache and FASEs
involve a single set operation.
4.3.2. Composition Interface The Composition interface
to MOD datastructures (Figure 6b) is a general-purpose
transaction-like programming interface. It allows program-
mers to failure-atomically perform updates on multiple data-
structures or perform multiple updates to the same data-
structure or any combination thereof. For instance, moving an
element from one queue to another requires a pop operation on
the first queue and a push operation on the second queue, both
performed failure-atomically in one FASE. Complex opera-
tions such as swapping two elements in a vector also require
two update operations on the same vector to be performed
failure-atomically. In such cases, the Composition interface
allows programmers to perform individual non-destructive up-
date operations on multiple datastructures to get new versions,
and then atomically replace all the updated datastructures with
their updated versions in a single Commit operation.
With this interface, programmers can build complex FASEs,
each with multiple update operations on multiple data-
structures. Each FASE must consist of two parts: Update
and Commit. During Update, programmers perform updates
on one or more MOD datastructures. On an update opera-
tion, the original datastructure is preserved and a new updated
version is returned that is guaranteed to be durable only after
Commit. Thus, programmers are temporarily exposed to mul-
tiple versions of datastructures. Programmers use the Commit
function to atomically replace all the original datastructures
with their latest updated and durable versions. Our Commit
implementation (described in Section 5.1) contains a single
ordering point in the common case. We use this interface in
two workloads: vector-swaps and vacation.
Figure 7 demonstrates the following use cases:
Single Update of Single Datastructure: While this case
is best handled by the Basic interface, we repeat it here to
show how this can be achieved with the Composition inter-
face. In Figure 7a, appending an element to VectorPtr
results in an updated version (VectorPtrShadow). The
Commit step atomically modifies VectorPtr to point to
VectorPtrShadow. As a result of this FASE, a new element
is failure-atomically appended to VectorPtr.
Multiple Update of Single Datastructure: We show a FASE
that swaps two elements of a vector in Figure 7b. The Update
step involves two vector lookups and two vector updates. The
first vector update results in a new version VectorPtrShadow.
The second vector update is performed on the new version
to get another version (VectorPtrShadowShadow) that re-
flects the effects of both updates. Finally, Commit makes
VectorPtr point to the latest version.
Single Updates of Multiple Datastructures: Figure 7c
shows how we swap elements from two different vectors in
one FASE. For each vector, we perform update operation to get
a new version. In Commit, both vector pointers are atomically
updated to point to the respective new versions.
Multiple Updates of Multiple Datastructures: The general
case is realized by combining the previous use cases.
5. Implementation Details
We now discuss our implementation of MOD datastructures.
6
// BEGIN-FASE (Vector-Append)
VectorPtrShadow = 
        VectorPtr->push_back(X)
CommitSingle (VectorPtr, 
                         VectorPtrShadow)
// END-FASE
// VectorPtr now points to updated vector
(b)
// BEGIN-FASE (Vector-Swap)
val1 = (*VectorPtr)[index1]
val2 = (*VectorPtr)[index2]
VectorPtrShadow = 
        VectorPtr->update(index1, val2)
VectorPtrShadowShadow =
VectorPtrShadow->update(index2, val1)
CommitSingle (VectorPtr, VectorPtrShadow,
                          VectorPtrShadowShadow)
// END-FASE
// VectorPtr points to doubly-updated vector
2
1
2
1
(a) (c)
// BEGIN-FASE (Multi-Vector-Swap)
val1 = (*VectorPtr1)[index1]
val2 = (*VectorPtr2)[index2]
VectorPtr1Shadow = 
         VectorPtr1->update(index1, val2)
VectorPtr2Shadow = 
          VectorPtr2->update(index2, val1)
CommitUnrelated (VectorPtr1, VectorPtr1Shadow,
                              VectorPtr2, VectorPtr2Shadow)
// END-FASE
// VectorPtr1, VectorPtr2 point to updated vectors
2
1
Figure 7: Using the Composition interface for failure-atomically (a) appending an element to a vector, (a) swapping two elements
of a vector and (c) swapping two elements of two different vectors.
5.1. Implementation of Programming Interfaces
We prioritize minimal ordering constraints in our implementa-
tion of the two interfaces to MOD datastructures.
Basic Interface. As shown in Figure 8a, the Basic interface
is a wrapper around the Composition interface to create the
illusion of a mutable datastructure. The programmer accesses
the MOD datastructure indirectly via a pointer. On a failure-
atomic update, we internally create an updated shadow of
the datastructure by performing the non-destructive update.
Then, using Commit, we ensure the durability of the shadow
and atomically update the datastructure pointer to point to the
updated and durable shadow. Thus, we hide FS details from
the programmer.
Composition Interface. The Composition interface can be
used to build complex multi-update FASEs, each with one
ordering point in the common case.
To support the Update step, MOD datastructure supports
non-destructive update operations. Within these update op-
erations, all modified cachelines are flushed using (weakly
ordered) clwb instructions and there are no ordering points or
fences. However, this step results in multiple versions of the
updated MOD datastructures.
The Commit step ensures the durability of the updated ver-
sions and failure-atomically updates all relevant datastructure
pointers to point to the latest version of each datastructure. We
provide optimized implementations of Commit for two com-
mon cases as well as the general implementation, as shown in
Figure 8. We discuss the memory reclamation needed to free
up unused memory in Section 5.
The first common case (CommitSingle in Figure 8b) oc-
curs when one datastructure is updated one or multiple times
in a FASE (e.g., Figure 7a,b). To Commit, we update the data-
structure pointer to point to the latest version after all updates
with a single 8B (i.e., size of a pointer) atomic write. We then
reclaim the memory of the old datastructure and intermediate
shadow versions i.e., all but the latest shadow version.
The second common case (CommitSiblings in Figure 8c)
occurs when the application updates two or more MOD data-
structures that are pointed to by a common persistent object
(parent) in one FASE. In this case, we create a new instance
of the parent (parentShadow) that points to the updated shad-
ows of the MOD datastructures. Then, we use a single pointer
write to replace the old parent itself with its updated ver-
sion. We used this approach in porting vacation, wherein
a manager object has three separate recoverable maps as its
member variables. A commonly occurring parent object in
PM applications is the root pointer, one for each persistent
heap, that points to all recoverable datastructures in the heap.
Such root pointers allow PM applications to locate recoverable
datastructures in persistent heaps across process lifetimes.
In these two common cases, our approach requires only
one ordering point per FASE. The single ordering point is
required in the commit operation to guarantee the durability
of the shadow before we replace the original data. The entire
FASE is a single epoch per the epoch persistency model [38].
Both of the common cases require an atomic write to a single
pointer, which can be performed via an 8-byte atomic write.
In contrast, PM-STM implementations require 5-11 ordering
points per FASE (Section 6.4).
For the general and uncommon case (CommitUnrelated
in Figure 8d) where two un-related datastructures get updated
in the same FASE, we need to atomically update two or more
pointers. For this purpose, we use a very short transaction
(STM) to atomically update the multiple pointers, albeit with
more ordering constraints. Even in this approach, the majority
of the flushes are performed concurrently and efficiently as
part of the non-destructive updates. Only the flushes to update
the persistent pointers in the Commit transaction cannot be
overlapped due to PM-STM ordering constraints.
Thus, the Composition interface enables efficient FASEs
that update multiple datastructures in the two common cases.
5.2. Correctness
We provide a simple and intuitive argument for correct failure-
atomicity of MOD datastructures. The main correctness con-
dition is that there must not be any pointer from persistent
data to any unflushed or partially flushed data. MOD data-
structures support non-destructive updates that involve writes
only to newly allocated data and so there is no possibility
of any partial writes corrupting the datastructure. All writes
performed to the new version of the datastructure are flushed
to PM for durability. During Commit, one fence orders the
pointer writes after all flushes are completed i.e., all updates
are made durable. Finally, the pointer writes in Commit are
performed atomically. If there is a crash before the atomic
7
CommitSingle
     (ds,
      dsShadow, ..., dsShadowN)
FENCE
dsOld = ds
ds = dsShadowN
Reclaim (dsOld, dsShadow, ...)
CommitSiblings
    (parent, 
     ds1, ds1Shadow,
     ds2, ds2Shadow, ...)
parentShadow = new Parent
parentShadow->ds1 = ds1shadow
parentShadow->ds2 = ds2shadow
...
FLUSH parentShadow
FENCE
parentOld = parent
parent = parentShadow
Reclaim (parentOld)
CommitUnrelated
   (ds1, dsShadow, 
    ds2, ds2Shadow, ...)
ds1Old = ds1
ds2Old = ds2
...
FENCE
Begin-TX {
   ds1 = ds1Shadow 
   ds2 = ds2Shadow
} End-TX
Reclaim (ds1Old, ds2Old, ..)
(b) (c) (d)
Update
   (dsPtr, updateParams) { 
   // BEGIN-FASE 
   dsPtr = 
      dsPtr->PureUpdate(updateParams)
   Commit (dsPtr, dsPtrshadow)
   // END-FASE
   // dsPtr points to updated datastructure 
}
Update
Commit 2
1
(a)
Figure 8: (a) Implementation of Basic interface as a wrapper around the Composition interface. Commit implementation shown
for multi-update FASEs operating on (a) single datastructure, (b) multiple datastructures pointed to by common parent object,
and (c) (uncommon) multiple unrelated datastructures.
pointer writes in Commit, the persistent pointers point to the
consistent and durable original version of the datastructure. If
the atomic pointer writes complete successfully, the persistent
pointers points to the durable and consistent new version of
the datastructures. Thus, we support correct failure-atomic
updates of MOD datastructures.
5.3. Memory Reclamation
Leaks of persistent memory cannot be fixed by restarting a pro-
gram and thus are more harmful than leaks of volatile memory.
Such PM leaks can occur on crashes during the execution of a
FASE. Specifically, allocations from an incomplete FASE leak
PM data that must be reclaimed by recovery code. Addition-
ally, our MOD datastructures must also reclaim the old version
of the datastructure on completion of a successful FASE.
We use reference counting for memory reclamation. Our
MOD datastructures are implemented as trees. In these trees,
each internal node maintains a count of other nodes that point
to it i.e., parent nodes. We increment reference counts of nodes
that are reused on an update operation and decrement counts
for nodes whose parents are deleted on a delete operation.
Finally, we deallocate a node when its reference count hits 0.
Our key optimization here is to recognize that reference
counts do not need to be stored durably in PM. On a crash, all
reference counts in the latest version can be scanned and set
to 1 as the recovered application sees only one consistent and
durable version of each datastructure.
We rely on garbage collection during recovery to clean up
allocated memory from an incomplete FASE (on a crash). Our
performance results include the time spent in garbage collec-
tion. As our datastructures are implemented as trees, we can
perform a reachability analysis starting from the root node of
each MOD datastructure to mark all memory currently refer-
enced by the application. Any unmarked data remaining in
the persistent heap is a PM leak and can be reclaimed at this
point. A common solution for catching memory leaks is to
log memory allocator activity. However, this approach reintro-
duces ordering constraints and degrades the performance of
all FASEs to prevent memory leaks in case of a rare crash.
5.4. Automated Testing
While it is tricky to test the correctness of recoverable data-
structures, the relaxed ordering constraints of shadow updates
allow us to build a simple and automated testing framework
for our MOD datastructures. We generate a trace of all PM al-
locations, writes, flushes, commits, and fences during program
execution. Subsequently, our testing script scans the trace to
ensure that all PM writes (except those in commit) are only
to newly allocated PM and that all PM writes are followed
by a corresponding flush before the next fence. By verifying
these two invariants, we can test the correctness of recoverable
applications as per our correctness argument in Section 5.2.
6. Evaluation
In this work, we seek to provide a library of recoverable data-
structures with good abstractions and good performance. We
answer four questions in our evaluation:
1. Programmability: What was our experience program-
ming with MOD datastructures?
2. Performance: Do MOD datastructures improve the perfor-
mance of recoverable workloads compared to PM-STM?
3. Ordering Constraints: Do workloads with MOD data-
structures have fewer fences than with PM-STM?
4. Additional Overheads: What are the additional overheads
introduced by MOD datastructures?
6.1. Methodology
Test System Configuration. We ran our experiments on
a machine with actual Persistent Memory—Intel Optane
DCPMM [21]—and upcoming second-generation Xeon Scal-
able processors (codenamed Cascade Lake). We configured
our test machine such that Optane DCPMM is in 100% App
Direct mode [16] and uses the default Directory protocol. In
this mode, software has direct byte-addressable access to the
Optane DCPMM. Table 1 reports relevant details of our test
machine. We measured read latencies using Intel Memory
Latency Checker v3.6 [46].
Hardware Primitives. The Cascade Lake processors on our
test machine support the new clwb instruction for flushing
cachelines. The clwb instruction flushes a dirty cacheline
by writing back its data but may not evict it. Our workloads
8
CPU
Type Intel Cascade Lake
Cores 96 cores across 2 sockets
Frequency 3.7 GHz (with Turbo Boost)
Caches
L1: 32KB Icache, 32KB Dcache
L2: 1MB, L2: 33 MB (shared)
Memory System
PM Capacity 2.9 TB (256 GB/DIMM)
PM Read Latency 302 ns (Random 8-byte read)
DRAM Capacity 376 GB
DRAM Read Latency 80 ns (Random 8-byte read)
Table 1: Test Machine Configuration.
use clwb instructions for flushing cachelines and the sfence
instructions to order flushes.
OS interface to PM. Our test machine runs Linux v4.15.6.
The DCPMMs are exposed to user-space applications via the
DAX-filesystem interface [48]. Accordingly, we created an
ext4-dax filesystem on each PM DIMM. Our PM allocators
create files in these filesystems to back persistent heaps. We
map these PM-resident files into application memory with
flags MAP_SHARED_VALIDATE and MAP_SYNC [9] to allow direct
user-space access to PM.
PM-STM Implementation. We use the PM-STM implemen-
tation (libpmemobj) from Intel’s PMDK library [18] in our
evaluations. We choose PMDK as it is publicly available, reg-
ularly updated, Intel-supported and optimized for Intel’s PM
hardware. Moreover, PMDK (v1.4 or earlier) has been used for
comparison by most earlier PM proposals [10, 30, 31, 40, 41].
We evaluate both PMDK v1.5 (released October 2018), which
uses hybrid undo-redo logging techniques as well as PMDK
v1.4, which primarily relies on undo-logging.
Workloads. Our workloads include several microbench-
marks and two recoverable applications, consistent with re-
cent PM works [10, 26, 30, 31, 40, 41]. As described in
Table 2, our microbenchmarks involve operations on com-
monly used datastructures: map, set, queue, list and vector.
The vector-swaps workload emulates the main computa-
tion in the canneal benchmark from the PARSEC suite [3].
The baseline map datastructure can be implemented by ei-
ther hashmap or ctree from the WHISPER suite [34]. Here,
we compare against hashmap which outperformed ctree on
Optane DCPMM. Moreover, we also measured two recover-
able applications from the WHISPER suite: memcached and
vacation. We modified these applications to use the PMDK
and MOD map implementations. The only other PM-STM
application in WHISPER is redis, but it also uses a map data-
structure so we found it redundant for our purposes. The other
WHISPER benchmarks are not applicable for our evaluation
as they are either filesystem-based or do not use PM-STM.
Instead, we created the bfs workload that uses a recoverable
queue for breadth-first search on the large Flickr graph [12].
We do not store the graph durably but reconstruct it from the
dataset in each execution. We ran all workloads to completion
Benchmark Description Configuration
map Insert/Lookup random keys in map 8B key, 32B value
set Insert/Lookup random keys in set 8B key, 32B value
stack Push/Pop elements from top of stack 8B elements
queue Enqueue/Dequeue elements in queue 8B elements
vector Update/Read random indices in vector 8B element
vec-swap Swap two random elements in vector 8B element
bfs Breadth-First Search using recover-
able queue on Flickr graph [12]
0.82M nodes, 9.84M
edges, 8B elements
vacation Travel reservation system with four re-
coverable maps
query range:80%,
55% user queries
memcached In-memory key value store using one
recoverable map
95% sets, 5% gets,
16B key, 512B value
Table 2: Benchmarks used in this study. Workloads performs
1 million iterations of the operations described.
on real hardware.
6.2. Programmability
While the rest of this section presents quantitative perfor-
mance results, in this paragraph we qualitatively describe the
programmability of MOD datastructures. We demonstrated
the use of MOD datastructures in two existing applications:
vacation and memcached. With MOD datastructures as with
C++ STL, applications get access to datastructures via narrow,
expressive interfaces but without access to the internal imple-
mentation. However, memcached, like many PM applications,
uses a custom datastructure (hashmap) whose implementation
is tightly coupled to the application logic. Thus our main
challenge was to decouple the code (i.e., application logic)
from the internal datastructure implementation. We do not
expect this to be an issue when building new applications.
For instance, vacation was easy to port as its datastructure
implementations were neatly encapsulated. Also, vacation’s
logic required composing failure-atomic updates to multiple
distinct maps that were members of the same object, for which
we used our Composition interface with CommitSiblings.
6.3. Performance
Figure 9 shows the execution time (so smaller is better) of PM
workloads with PMDK transactions and MOD datastructures.
We make the following observations.
First, PMDK v1.5 with hybrid undo-redo logging performs
23% better on average than undo-logging based PMDK v1.4,
due to optimizations targeting transaction overheads [17].
Second, MOD datastructures offer a speedup of 43% on av-
erage for pointer-based datastructures (map, set, queue, stack)
over PMDK v1.5. The performance improvements are at-
tributed to lower flushing overheads (50% vs 66% of PMDK
v1.5 execution time) and no logging overheads (0% vs 13%).
Third, for only vector and vec-swap microbenchmarks, the
abstraction benefits of MOD datastructures come with a per-
formance cost—not benefit. This occurs due to the overhead
of moving from a dense 1-D array to a tree-based implementa-
tion that functional datastructures use to facilitate incremental
updates. Future work can examine whether this slowdown
can be mitigated or that the MOD abstractions must have a
performance cost in this case.
9
map set queue stack vector vec-sw
ap bfs vacatio
n
memca
ched
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Ex
ec
ut
io
n 
Ti
m
e 
No
rm
al
ize
d 
to
 P
M
DK
v1
.5 1.5 1.8 2.2 1.7 2.2
PMDK14-other
PMDK14-Flush
PMDK14-Log
PMDK15-other
PMDK15-Flush
PMDK15-Log
MOD-other
MOD-Flush
Figure 9: Execution Time of PM workloads, normalized to
PMDK v1.4 implementation of each workload.
For the three full applications, MOD datastructures show
an average speedup of 36% over PMDK v1.5. Here, the per-
formance improvements arise from lower flushing overheads
(25% vs 50% of PMDK v1.5 execution time) and no logging
overheads. vacation shows a lower speedup of 13% as we
have to copy and flush the parent object in our approach (with
CommitSiblings), while PMDK can update in place.
6.4. Flushing Concurrency
Figure 10 illustrates ordering (x-axis) and flushing frequency
(y-axis) in update operations to evaluated datastructures.
Lookup operations in our workloads do not require any flush
or fence instructions. Here, we restrict the comparison to
PMDK v1.5 and MOD.
Lower ordering constraints enable lower flushing overheads,
if the number of flushed cachelines is comparable. PMDK
workloads typically exhibit a high number of fences per opera-
tion. In our evaluated workloads, MOD datastructures always
have only one fence per operation. While MOD datastructures
copy and flush additional data than PMDK, there are no log
entries to be flushed. For map and set implementations, the
amount of flushed data is comparable in both approaches. Pop
operations in the MOD queue occasionally require a reversal
of one of the internal linked lists resulting in greater flushing
activity than PMDK on average. The reduction in flushes
comes from the absence of log entries as well as implemen-
tation differences. However, writes and swaps to the MOD
vector require significantly more cachelines to be flushed as
compared to the PMDK vector. This helps explain the perfor-
mance degradation explained in the previous subsection.
6.5. Additional Overheads
MOD datastructures introduce two new overheads: space over-
heads and increased cache pressure. First, extra memory is
allocated on every update for the shadow, resulting in addi-
tional space overheads. Secondly, functional datastructures
(including vectors and arrays) are implemented as pointer-
based datastructures with little spatial locality.
Space overheads. In Table 3, we report the increase in mem-
ory consumption on doubling the capacity of datastructures,
0 2 4 6 8 10 12
Fences per Operation
0
5
10
15
20
25
Fl
us
he
s p
er
 O
pe
ra
tio
n
map-insert
map-insertset-insert
set-insert
queue-push
queue-push
queue-pop
queue-pop
stack-push
stack-push
stack-pop
stack-pop
vector-write
vector-write
vec-swap
vec-swap
MOD PMDK
Figure 10: Flush and fence frequency in PM workloads.
i.e., inserting an additional 1 million elements in a data-
structure of size 1 million elements. On average for most
of our workloads (except vector), the memory consumption
of MOD datastructures only grows 21% faster than PMDK
datastrucures. More importantly, every individual update oper-
ation only requires 0.00002-0.00004× extra memory beyond
the original version, as compared to 2× extra memory in naive
shadow paging. Thus, structural sharing in our datastructure
implementations minimizes the FS space overheads.
map set stack queue vector
MOD 1.87× 2.08× 2.25× 1.67× 131×
PMDK 1.78× 1.75× 1.50× 1.50× 2×
Table 3: Ratio of memory consumed by datastructure with 2M
elements compared to 1M elements.
Cache Pressure. While our MOD datastructures typically
perform better than PMDK datastructures, interestingly they
also exhibit greater cache misses. Unfortunately, it was not
possible to separate cache misses to PM and DRAM in our
experiments on real hardware, but we expect most of the cache
accesses to be for PM cachelines in our workloads.
The pointer-based functional implementations results in
more cache misses particularly in the small L1D cache, as
seen in Figure 11. This is evident in case of map, set and
vector workloads, which show 2.8-4.6× the cache misses with
MOD datastructures than with PMDK. The PMDK implemen-
tations of map, set and vector involve arrays contiguously laid
out in memory and thus have greater spatial locality and fewer
pointer-chasing patterns. However, the pointer-based imple-
mentations of MOD datastructures are necessary to reduce the
shadow copying overheads.
MOD implementations of stack, queue and bfs show low
cache miss ratios, comparable to the PMDK implementations.
These results to be expected as stacks and queues are pointer-
based datastructures in both PMDK and MOD implemen-
tations. Moreover, push and pop operations in these data-
structures only operate on the head or the tail, resulting in high
temporal locality of accesses.
10
map set queue stack vector vec-swap bfs vacationmemcach
ed
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
L1
D 
Ca
ch
e 
M
iss
 R
at
io
s
PMDK v1.5 MOD
Figure 11: L1D Cache miss ratios for PM workloads.
7. Related Work
Prior works mainly consists of general optimizations for PM-
STM and specific optimizations for durable datastructures.
7.1. PM-STM Optimizations.
Software approaches include Mnemosyne [47], NV-Heaps [7],
SoftWrap [14], Intel PMDK [18], JUSTDO [22], iDO [31],
Romulus [10], DudeTM [30]. Mnemosyne, SoftWrap, Romu-
lus and DudeTM rely on redo logging, NV-Heaps employs
undo logging techniques and PMDK recently switched from
undo logging (in v1.4) to hybrid undo-redo log (in v1.5) [19].
Each of these approaches requires 4+ ordering points per
FASE. Most undo-logging implementations require ordering
points proportional to the number of contiguous data ranges
modified in each transaction and can have as many as 50 or-
dering points in a transaction [34]. In contrast, redo-logging
implementations require relatively constant number of order-
ing points regardless of the size of the transaction and are
better for large transactions. However, redo logging requires
load interposition to redirect loads to updated PM addresses,
resulting in slow reads and increased complexity.
Romulus and DudeTM both utilize innovative approaches
based on redo-logging and shadow paging to reduce ordering
constraints. Romulus uses a volatile redo-log with shadow
data stored in PM while DudeTM uses a persistent redo-log
with shadow data stored in DRAM. Both of these approaches
double the memory consumption of the application as two
copies of the data are maintained. This is a greater challenge
with DudeTM as the shadow occupies DRAM capacity, which
is expected to be much smaller than available PM. Our MOD
datastructures only have two versions during an update oper-
ation, with significant data reuse between the two versions.
Both DudeTM and Romulus incur logging overheads and re-
quire store interposition, unlike MOD datastructures.
The optimal ordering constraints for PM-STM implementa-
tions under idealized scenarios have been analyzed [27]. The
results show that PM-STM performance can be improved us-
ing new hardware primitives that support Epoch or Strand
Persistency [27], neither of which are currently supported by
any architectures. In contrast, MOD datastructures reduce
ordering constraints on currently available hardware.
Finally, better hardware primitives for ordering and dura-
bility have also been proposed. For instance, DPO [28] and
HOPS [34] propose lightweight ordering fences that do not
stall the CPU pipeline. Efficient Persist Barriers [24] move
cacheline flushes out of the critical path of execution by mini-
mizing epoch conflicts. Speculative Persist Barriers [41] allow
the core to speculatively execute instructions in the shadow
of an ordering point. Forced Write-Back [36] proposes cache
modifications to perform efficient flushes with low software
overheads. All these proposals reduce the performance impact
of each ordering point in PM applications, whereas we reduce
the number of ordering points in these applications. Moreover,
these proposals require hardware modifications to the core
and/or the cache hierarchy while MOD datastructures improve
performance on unmodified hardware.
7.2. Recoverable Datastructures.
While Functional Shadowing provides a way to directly con-
vert existing functional datastructures into recoverable ones,
the following papers demonstrate the value of handcrafting
recoverable datastructures. Dali [35] is a recoverable prepend-
only hashmap that is updated non-destructively while preserv-
ing the old version. Updates in both Functional Shadowing and
Dali are logically performed as a single epoch to minimize or-
dering constraints. However, our datastructures are optimized
to reuse data between versions, while the Dali hashmap uses a
list of historical records for each key. The CDDS B-tree [45]
is a recoverable datastructure that also relies on versioning at
a node-granularity for crash-consistency. However, it is not
straightforward to extend such fine-grained versioning to other
datastructures beyond B-trees. Instead, we rely on versioning
at the datastructure-level.
There have also been several attempts at optimizing recov-
erable B+-trees, which are commonly used in key-value stores
and filesystems. NV-Tree [49] achieves significant perfor-
mance improvement by storing internal tree nodes in volatile
DRAM and reconstructing them on a crash. wB+-Trees [6]
uses atomic writes and bitmap-based layout to reduce the num-
ber of PM writes and flushes for higher performance. These op-
timizations cannot be directly extended to other datastructures
such as vectors and queues. Our MOD datastructures are all
implemented as trees, and could allow these optimizations to
apply generally to more datastructures with further research.
8. Conclusion
Persistent memory devices are close to becoming commer-
cially available. Ensuring consistency and durability across
failures introduces new requirements on programmers and new
demands on hardware to efficiently move data from volatile
caches into persistent memory. Minimally ordered durable
datastructures provide an efficient mechanism that leverages
the performance characteristics of Intel’s Optane DCPMM for
much higher performance. Rather than focusing on minimiz-
ing the amount of data written, MOD datastructures minimize
the ordering points that impose long program delays. Further-
11
more, they can be created via simple extensions to a large
library of existing highly optimized functional datastructures
providing flexibility to programmers.
References
[1] Gene M. Amdahl. Validity of the single processor approach to achiev-
ing large scale computing capabilities. In Proceedings of the April
18-20, 1967, Spring Joint Computer Conference (AFIPS), 1967.
[2] Tim Berning. nvm malloc: Memory allocation for nvram. https:
//github.com/hyrise/nvm_malloc, 2017.
[3] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li.
The parsec benchmark suite: Characterization and architectural im-
plications. In Proceedings of the 17th International Conference on
Parallel Architectures and Compilation Techniques (PACT), 2008.
[4] Adrian M. Caulfield, Joel Coburn, Todor Mollov, Arup De, Ameen
Akel, Jiahua He, Arun Jagatheesan, Rajesh K. Gupta, Allan Snavely,
and Steven Swanson. Understanding the impact of emerging non-
volatile memories on high-performance, io-intensive computing. In
Proceedings of the 2010 ACM/IEEE International Conference for
High Performance Computing, Networking, Storage and Analysis (SC),
2010.
[5] Dhruva R. Chakrabarti, Hans-J. Boehm, and Kumud Bhandari. Atlas:
Leveraging locks for non-volatile memory consistency. In Proceed-
ings of the 2014 ACM International Conference on Object Oriented
Programming Systems Languages & Applications (OOPSLA), 2014.
[6] Shimin Chen and Qin Jin. Persistent b+-trees in non-volatile main
memory. Proceedings of the VLDB Endowment, 8, February 2015.
[7] Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Ra-
jesh K. Gupta, Ranjit Jhala, and Steven Swanson. NV-Heaps: Making
Persistent Objects Fast and Safe with Next-generation, Non-volatile
Memories. In Proceedings of the 16th International Conference on
Architectural Support for Programming Languages and Operating
Systems, 2011.
[8] Jeremy Condit, Edmund B. Nightingale, Christopher Frost, Engin Ipek,
Benjamin Lee, Doug Burger, and Derrick Coetzee. Better i/o through
byte-addressable, persistent memory. In Proceedings of the ACM
SIGOPS 22Nd Symposium on Operating Systems Principles (SOSP),
2009.
[9] Jonathan Corbet. Two more approaches to persistent-memory writes.
https://lwn.net/Articles/731706/, 2017.
[10] Andreia Correia, Pascal Felber, and Pedro Ramalhete. Romulus: Effi-
cient algorithms for persistent transactional memory. In Proceedings of
the 30th on Symposium on Parallelism in Algorithms and Architectures
(SPAA), 2018.
[11] cppreference. Containers library. https://en.cppreference.com/
w/cpp/container, 2018.
[12] Tim Davis. The university of florida sparse matrix collection. http:
//www.cise.ufl.edu/research/sparse/matrices.
[13] James R. Driscoll, Neil Sarnak, Daniel D. Sleator, and Robert E. Tarjan.
Making data structures persistent. Journal of Computer and System
Sciences, 38, 1989.
[14] E. R. Giles, K. Doshi, and P. Varman. Softwrap: A lightweight frame-
work for transactional support of storage class memory. In 2015 31st
Symposium on Mass Storage Systems and Technologies (MSST), 2015.
[15] Jim Gray, Paul McJones, Mike Blasgen, Bruce Lindsay, Raymond
Lorie, Tom Price, Franco Putzolu, and Irving Traiger. The recovery
manager of the system r database manager. ACM Computing Surveys
(CUSR), 13, June 1981.
[16] Alper Ilkbahar. Intel optane dc persistent memory operating modes
explained. https://itpeernetwork.intel.com/intel-optane-
dc-persistent-memory-operating-modes/, 2018.
[17] Intel. New release of pmdk. https://pmem.io/2018/10/22/
release-1-5.html.
[18] Intel. Persistent memory development kit. http://pmem.io/pmdk.
[19] Intel. Pmdk issues: introduce hybrid transactions. https://github.
com/pmem/pmdk/pull/2716.
[20] Intel. Intel optane dc persistent memory readies for widespread deploy-
ment. https://newsroom.intel.com/news/intel-optane-dc-
persistent-memory-readies-widespread-deployment, 2018.
[21] Intel. Intel optane dc persistent memory. https:
//www.intel.com/content/www/us/en/architecture-
and-technology/optane-dc-persistent-memory.html, 2019.
[22] Joseph Izraelevitz, Terence Kelly, and Aasheesh Kolli. Failure-atomic
persistent memory updates via justdo logging. In Proceedings of the
Twenty-First International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS), 2016.
[23] Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amir-
saman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu, Subra-
manya R. Dulloor, Jishen Zhao, and Steven Swanson. Basic per-
formance measurements of the intel optane DC persistent memory
module. arXiv preprint, 2019.
[24] Arpit Joshi, Vijay Nagarajan, Marcelo Cintra, and Stratis Viglas. Ef-
ficient persist barriers for multicores. In Proceedings of the 48th
International Symposium on Microarchitecture (MICRO), 2015.
[25] Alan H. Karp and Horace P. Flatt. Measuring parallel processor perfor-
mance. Communications of the ACM (CACM), 33, May 1990.
[26] Aasheesh Kolli, Vaibhav Gogte, Ali Saidi, Stephan Diestelhorst, Pe-
ter M. Chen, Satish Narayanasamy, and Thomas F. Wenisch. Language-
level persistency. In Proceedings of the 44th Annual International
Symposium on Computer Architecture (ISCA), 2017.
[27] Aasheesh Kolli, Steven Pelley, Ali Saidi, Peter M. Chen, and Thomas F.
Wenisch. High-performance transactions for persistent memories. In
Proceedings of the Twenty-First International Conference on Archi-
tectural Support for Programming Languages and Operating Systems
(ASPLOS), 2016.
[28] Aasheesh Kolli, Jeff Rosen, Stephan Diestelhorst, Ali Saidi, Steven
Pelley, Sihang Liu, Peter M. Chen, and Thomas F. Wenisch. Dele-
gated persist ordering. In The 49th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO), 2016.
[29] Dong Li, Jeffrey S. Vetter, Gabriel Marin, Collin McCurdy, Cristian
Cira, Zhuo Liu, and Weikuan Yu. Identifying opportunities for byte-
addressable non-volatile memory in extreme-scale scientific applica-
tions. In Proceedings of the 2012 IEEE 26th International Parallel and
Distributed Processing Symposium (IPDPS), 2012.
[30] Mengxing Liu, Mingxing Zhang, Kang Chen, Xuehai Qian, Yongwei
Wu, Weimin Zheng, and Jinglei Ren. Dudetm: Building durable
transactions with decoupling for persistent memory. In Proceedings of
the Twenty-Second International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS), 2017.
[31] Q. Liu, J. Izraelevitz, S. K. Lee, M. L. Scott, S. H. Noh, and C. Jung.
ido: Compiler-directed failure atomicity for nonvolatile memory. In
2018 51st Annual IEEE/ACM International Symposium on Microarchi-
tecture (MICRO), 2018.
[32] Sihang Liu, Yizhou Wei, Jishen Zhao, Aasheesh Kolli, and Samira
Khan. Pmtest: A fast and flexible testing framework for persistent
memory programs. In Proceedings of the Twenty-Fourth International
Conference on Architectural Support for Programming Languages and
Operating Systems (ASPLOS), 2019.
[33] Raymond A. Lorie. Physical integrity in a large segmented database.
ACM Transactions on Database Systems (TODS), 2, March 1977.
[34] S. Nalli, S. Haria, M. D. Hill, M. M. Swift, H. Volos, and K. Keeton.
An analysis of persistent memory use with whisper. In Proceedings of
the Twenty-Second International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS), 2017.
[35] Faisal Nawab, Joseph Izraelevitz, Terence Kelly, Charles B. Morrey
III, Dhruva R. Chakrabarti, and Michael L. Scott. Dalí: A Periodically
Persistent Hash Map. In 31st International Symposium on Distributed
Computing (DISC), 2017.
[36] M. A. Ogleari, E. L. Miller, and J. Zhao. Steal but no force: Effi-
cient hardware undo+redo logging for persistent memory systems. In
2018 IEEE International Symposium on High Performance Computer
Architecture (HPCA), 2018.
[37] Chris Okasaki. Purely Functional Data Structures. PhD thesis,
Carnegie Mellon University, 1998.
[38] Steven Pelley, Peter M. Chen, and Thomas F. Wenisch. Memory
persistency. In Proceeding of the 41st Annual International Symposium
on Computer Architecuture (ISCA), 2014.
[39] Juan Pedro Bolívar Puente. Persistence for the masses: Rrb-vectors
in a systems language. Proceedings of the ACM on Programming
Languages, 1, September 2017.
[40] Seunghee Shin, Satish Kumar Tirukkovalluri, James Tuck, and Yan
Solihin. Proteus: A flexible and fast software supported hardware log-
ging approach for nvm. In Proceedings of the 50th Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO), 2017.
[41] Seunghee Shin, James Tuck, and Yan Solihin. Hiding the long latency
of persist barriers using speculative execution. In Proceedings of
the 44th Annual International Symposium on Computer Architecture
(ISCA), 2017.
[42] Michael Steindorfer. Efficient Immutable Collections. PhD thesis,
University of Amsterdam, 2017.
12
[43] Michael J. Steindorfer and Jurgen J. Vinju. Optimizing hash-array
mapped tries for fast and lean immutable jvm collections. In Pro-
ceedings of the 2015 ACM SIGPLAN International Conference on
Object-Oriented Programming, Systems, Languages, and Applications
(OOPSLA), 2015.
[44] Nicolas Stucki, Tiark Rompf, Vlad Ureche, and Phil Bagwell. Rrb vec-
tor: A practical general purpose immutable sequence. In Proceedings
of the 20th ACM SIGPLAN International Conference on Functional
Programming (ICFP), 2015.
[45] Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, and
Roy H. Campbell. Consistent and durable data structures for non-
volatile byte-addressable memory. In Proceedings of the 9th USENIX
Conference on File and Stroage Technologies (FASE), 2011.
[46] Vish Viswanathan. Intel memory latency checker v3.6.
https://software.intel.com/en-us/articles/intelr-
memory-latency-checker, December 2018.
[47] H. Volos, A. J. Tack, and M. M. Swift. Mnemosyne: Lightweight
persistent memory. In Proceedings of the Sixteenth International
Conference on Architectural Support for Programming Languages and
Operating Systems (ASPLOS), 2011.
[48] Matthew Wilcox. Dax: Page cache bypass for filesystems on memory
storage. https://lwn.net/Articles/618064/, 2014.
[49] Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai Leong
Yong, and Bingsheng He. Nv-tree: Reducing consistency cost for
nvm-based single level systems. In Proceedings of the 13th USENIX
Conference on File and Storage Technologies (FAST), 2015.
13
