Data Structure Primitives on Persistent Memory: An Evaluation by Götze, Philipp et al.
Data Structure Primitives on Persistent Memory: An Evaluation
Philipp Götze
philipp.goetze@tu-ilmenau.de
TU Ilmenau, Germany
Arun Kumar Tharanatha
arun-kumar.tharanatha@tu-
ilmenau.de
TU Ilmenau, Germany
Kai-Uwe Sattler
kus@tu-ilmenau.de
TU Ilmenau, Germany
ABSTRACT
Persistent Memory (PMem), as already available, e.g., with Intel
Optane DC Persistent Memory, represents a very promising, next-
generation memory solution with a significant impact on database
architectures. Several data structures for this new technology have
already been proposed. However, primarily only complete struc-
tures are presented and evaluated. Thus, the implications of the
individual ideas and PMem features are concealed. Therefore, in
this paper, we disassemble the structures presented so far, identify
their underlying design primitives, and assign them to appropriate
design goals regarding PMem. As a result of our comprehensive
experiments on real PMem hardware, we can reveal the trade-offs
of the primitives for various access patterns. This allowed us to
pinpoint their best use cases as well as vulnerabilities. Besides our
general insights regarding PMem-based data structure design, we
also discovered new combinations not examined in the literature
so far.
1 INTRODUCTION
Data structures play a crucial role in all data management systems.
Over the past decades, numerous structures have been designed for
very different purposes and each design is always a compromise
among the three performance trade-offs read, write, and memory
amplification [2]. Furthermore, advances in hardware technology
with changing characteristics make designing data structures an
ever-lasting challenge.
Persistent Memory (PMem) – also known as non-volatile mem-
ory (NVM) or storage-class memory (SCM) – is one of the most
promising trends in hardware development which might have a
huge impact on database system architectures in general, but also
particularly on data structures. Intel® has recently commercialized
Optane™ DC Persistent Memory Modules (DCPMMs) based on
the 3D XPoint™ technology [22], on which we focus in this paper.
Characteristics such as byte-addressability, read latency close to
DRAM but with a read-write asymmetry, and the inherent persis-
tence open up new opportunities but require also new designs, e.g.,
to mitigate the read-write asymmetry or to guarantee consistent
updates.
Over the last few years, several data structures for PMem have
been proposed trying to address these specifics. However, the lack
of widely available hardware platforms, different benchmarks, and
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
, ,
© 2020 Copyright held by the owner/author(s).
complex designs addressing different aspects make it difficult to
compare these approaches and – more importantly – identify the
most promising PMem-specific primitives.
Idreos et al. [11] presented the idea of a periodic table of data
structures to be able to argue about the design space of data struc-
tures. The work provides a great foundation for a systematic study
of data structure designs. In this paper, we try to support this ap-
proach by identifying core primitives of tree-based data structures
and evaluate different designs of these primitives on PMem. Lersch
et al. [19] have already extensively evaluated existing B+-Tree de-
signs for PMem on real hardware. However, again this was done
on the macro level hiding impacts of the separate underlying ideas.
Our contributions are as follows:
• From the literature (§3), we identify a set of data structure design
primitives on PMem. Furthermore, we generalize the existing
approaches for various types of tree-like structures such as B+-
Trees, Skip-Lists, Tries, and LSM-Trees (§4).
• We classify these primitives in three PMem-critical design goals:
reducing writes, fine-grained access as well as consistent and
durable operations (§4).
• Instead of a black box (or end-to-end) approach, we introduce
typical low-level access patterns applicable to the primitives (§4).
• We comprehensively evaluate and report a selection of these
access patterns on real PMem hardware (§5).
• We summarize our findings within a performance profile per
primitive and formulate general insights (§5 and §6).
To the best of our knowledge, this is the first evaluation considering
various data structure types and designs on the micro level with real
PMem hardware. The goal of our work is to get deep insights into
PMem-optimized design patterns for data structures in databases.
2 PERSISTENT MEMORY PROPERTIES
There are several variants of PMem that use different physical mech-
anisms to achieve persistence. PCM [17] is probably one of the best-
known technologies among them. Intel® has recently commercial-
ized Optane™DC Persistent Memory Modules (DCPMMs) based on
the 3D XPoint™ technology [22], which seems to behave similarly
to PCM. What makes PMem special is the byte-addressability and
direct persistence at DRAM speed. On modern CPU architectures,
byte-addressability corresponds to cache-line granularity (typically
64 bytes). Further interesting features are a higher density and bet-
ter economic characteristics than DRAM (both in monetary and
energy terms) as well as direct load and store semantics. Another
important fact is that the Optane DC devices internally work with
cache lines, but a write-combining buffer aggregates writes to 256-
byte blocks (cf. [28]). This is mainly to avoid write-amplification.
In our experiments, we could not identify a performance difference
when switching from 64-byte to 256-byte aligned data structures.
ar
X
iv
:2
00
1.
02
17
2v
2 
 [c
s.D
B]
  1
2 J
un
 20
20
, , Philipp Götze, Arun Kumar Tharanatha, and Kai-Uwe Sattler
Table 1: Main characteristics of different memory/storage
technologies (cf. [19, 23, 25, 28])
DRAM Optane DC NAND Flash
Idle read latency 80 ns 175 ns 25 µs
Loaded rand. lat. 120 ns 400 ns N /A
Write latency 80 ns 100 ns − 2 µs 500 µs
Write endurance > 1015 N /A 104 − 105
Density 1X 2X − 4X 4 − 8X
Therefore, we assume that it is enough when the data nodes are at
least 256 Bytes in size but only are aligned to cache lines. Another
benefit of this buffer is that writes can be faster than reads at low
load on the device. However, that is also why it is hard to measure
the real write latency.
Table 1 summarizes some of the characteristics and compares
them with those of DRAM and SLC NAND flash. We remeasured
the latencies on our system (see Section 5) using Intel’s Memory
Latency Checker [13] and Flexible I/O tester [3]. Since we focus
on single-threaded experiments, total bandwidth numbers are not
relevant for us here. Similar to flash, PMem exhibits a read-write
asymmetry and lower write endurance than DRAM. However, we
could not find any actual endurance data of the DCPMMs. When
designing new data structures, these properties mean that writes
should be minimized using more computing power instead.
The DCPMMs provide two possible operating modes: Memory
and App Direct mode. The Memory mode allows applications to
use the DCPMMs as extension to volatile memory, where DRAM
acts like a kind of L4 cache. For that no rewrite of in-memory soft-
ware is necessary. However, to fully utilize PMem and its persistence
the App Direct mode must be used. Therefore, developers have to
take care of persistence, failure-atomicity, performance, and so on
themselves. In the remainder of this paper, we exclusively use the
latter mode. On the software level, we used the de facto standard
Persistent Memory Development Kit (PMDK) [14] to get uniform
and comparable implementations. It provides different levels of
granularity to manage PMem including allocations, transactions,
object management, etc.
3 RELATEDWORK
Data Structures for PMem.With the properties described in Sec-
tion 2 new more fine-grained techniques are enabled when design-
ing PMem-based data structures. An overview of these approaches
is given in [9]. There have already been several publications ad-
dressing byte-addressability and write endurance in particular. One
of the first approaches of Venkataraman et al. [29] proposes a single-
level storage hierarchy and general ideas for consistent and durable
data structures. They mainly focus on B+-Trees and use version-
ing, atomics, and shadowing to guarantee atomicity. This is also
addressed by Chen et al. [4] who exploit indirection and propose
to keep nodes unsorted to save writes. Additionally, they compare
the approaches and effects when adding certain features such as
bitmaps. Yang et al. [32] propose selective consistency, i.e., enforc-
ing consistency of leaf nodes and relaxing it for inner nodes. Here,
too, leaf nodes are kept unsorted and new keys are just appended.
In [24] Oukid et al. present a hybrid solution where leaf nodes
remain in the persistent layer but inner nodes are placed in DRAM.
This allows a much faster traversal of the upper levels but requires
recovery measures to rebuild it in case of failure. Another crucial
part is the use of fingerprints to reduce the number of keys probed.
HiKV [31] takes a similar path and places the B+-Tree in DRAM
and holds only hash partitions persistent. This avoids costly struc-
ture reorganizations on PMem. The BP -Tree [10] buffers changes
in DRAM and, when full, merges them into PMem. They collect
information to predict future accesses to pre-allocate nodes and
reduces writes caused by splits or merges. In [16] the authors pro-
pose cache-line-sized nodes combined with differential encoding
to reduce the number of cache line flushes. The BzTree [1] is a
high-performance latch-free B-Tree using a persistent multi-word
compare-and-swap (PMwCAS [30]) operation to provide failure
atomicity. In [19] some of these trees are already evaluated on real
PMem hardware. But again the complete trees were compared, in-
stead of the individual primitives. As a result, e.g., the wB+-Tree [4]
always performs poorly as the inner nodes are persistent, leading
to a costly traversal. Furthermore, their evaluation is limited to
B+-Trees.
Recalling the properties of PMem, write-optimized data struc-
tures such as the LSM-Tree are a promising option. Many modern
key-value stores such as RocksDB [6] or Cassandra [27] are based
on this concept. There are already first approaches to adapt this
concept for PMem [15, 20, 21]. Furthermore, prefix trees (tries) like
ART were already adapted to PMem [7] as well as some write op-
timized versions of it [18]. There are also first publications in the
field of hash tables [5, 26, 33]. So far, only [8] has targeted analytical
processing, whose idea is based on clustering and unsorted blocks.
The special feature is the three-level architecture and the ability to
efficiently query any attribute besides the key.
The approaches mentioned above have so far been mainly eval-
uated for operator or end-to-end performance. However, this hides
important details and trade-offs of the underlying design primitives
on which we focus in this paper.
Evaluating Data Structure Designs. Several approaches ana-
lyze access patterns and the hardware profile to pick appropriate
data structures and implementations as well as hardware placement.
Similar to us, some of these also subdivide the data structures into
primitives. The Data Calculator [12] and the periodic table of data
structures [11], for example, discuss a novel approach which inter-
prets data structures as an assembly of first principles. The authors
combine analytical models, benchmarks, and machine learning to
gain insights into the impact of these fundamental primitives. Their
engine takes a high-level specification of a data structure assem-
bled from the primitives and predicts the performance for a given
workload and hardware profile. The so-called operation and cost
synthesizer learns a basic set of cost models for different access pat-
terns and synthesizes the cost for more complex operations. These
cost models are trained by micro-benchmarks and thus strongly
dependent on these. It is indicated that the benchmarks must be
entered manually by the user when new patterns or hardware is
added. This is where the micro-benchmarks presented in this paper
can tie in very well.
Data Structure Primitives on Persistent Memory: An Evaluation , ,
Table 2: Design primitives and micro-operations for PMem-aware trees ( → applicable).
Read-based Insert-based Erase-based
Re
co
ve
ryPMem-aware Trees
No
de
Se
ar
ch
Tr
ee
Tr
av
er
se
Tr
ee
Ite
ra
te
In
se
rt
in
No
de
No
de
Sp
lit
M
ov
e N
od
e
M
er
ge
Le
ve
l
Er
as
e f
. N
od
e
Ba
lan
ce
No
de
s
M
er
ge
No
de
s
DesignGoal1
(reduce writes)
sorted [1, 10, 29]
unsorted [1, 4, 10, 16, 24, 32]
bitmaps [4, 16, 24]
indirection [4]
hashing [24, 31]
split move [1, 4, 10, 16, 29, 32]
2-way merge
K-way merge
placement [10, 24, 31, 32]
DesignGoal2
(fine-grained
access)
linear search [1, 10, 16, 32]
search with bit check [4, 16, 24]
search with hash probing [24]
binary search [1, 10, 29, 32]
search with indirection [4]
split copy [24]
cache sensitive [1, 4, 16, 24, 32]
DesignGoal3
(failure
atomicity)
PMDK Transactions [8, 14]
PMwCAS [1, 30]
individually [4, 16, 24, 29, 31, 32]
Evaluation reported in Experiment No. E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
design primitives
m
ic
ro
-o
pe
ra
tio
ns
4 DESIGN PRIMITIVES
Design Goals. Using the PMem properties and related work, we
identify three design goals. The first design goal is the reduction of
writes (DG1), which is based on the write endurance and read-write
asymmetry. Due to the byte-addressability, much finer-grained
accesses are possible that should be exploited (DG2). The direct
load and store semantics further enable zero-copymemorymapping
and, thus, new opportunities to ensure consistency and durability,
e.g., by atomic primitives (DG3).
In the following, we start by classifying typical data structures
found in DBMSs and give an insight into their huge design space.
Subsequently, we extract the design primitives with focus on tree-
based structures from the literature and connect them with our
defined design goals.
Glimpse into the Design Space. In Figure 1, the typical data
and index structures used within a DBMS are summarized. Each
of these data structures is more suitable for certain scenarios and
access patterns than others. As described in [11] there is a gen-
eral trade-off between read, write, and space optimized designs.
Accordingly, their performance depends on both the workload run-
ning on them and the underlying hardware. Furthermore, these
basic structures can also be extended by features or combined with
other structures in order to meet the given requirements. By adding
features, sometimes also new access primitives become possible
applicable to them.
Looking at the illustration and the related work above, it becomes
clear that the design space is huge and there are still thousands of
variants that have not been studied yet, in particular with regard to
PMem. We have primarily worked with B+-Trees, LSM-Trees and
single node structures since these already cover a vast part of the
design space.
The question we want to answer is what impact certain design
primitives have in which scenarios in the presence of PMem. The
goal is to reveal their trade-offs for facilitating design decisions.
This must be done in the form of white-box testing to avoid side
effects in the measurements. As a result, we envisage a profile per
design primitive, from which performance and memory impacts
can be derived for each kind of access pattern.
Definitions. Similar to [11], we define a design primitive in this
context as an indivisible layout or access concept. To achieve the
goal mentioned above, it is necessary to break down the possible
primitives taking into account the properties of PMem. For that,
we study the approaches as described in Section 3 and assign the
···
FlatBranched
Sorted 
Array
Skip-
List
Hash 
TableLSM LogB-TreeTrie
tr
ee ie
Partitioning Feature
PoolCompression
Zone Maps
Basic
Data
Structures
Ordering
Hash Filter Pruning
Adaptivity
Bit Map
Placement
Figure 1: Classification of typical data and index structures
in database management systems
, , Philipp Götze, Arun Kumar Tharanatha, and Kai-Uwe Sattler
ideas to the corresponding design goals. Furthermore, we consider
existing micro-operations for trees/nodes and have set these in
relation to the derived primitives. In this context, a micro-operation
describes a low-level access pattern whose result is independent
of the chosen primitive(s). The typical macro-operation like get,
insert, update, delete, and scan can be implemented by combining
such micro-operations. Therefore, we classify them in read-, insert-
and erase-based as well as recovery operations. Table 2 shows our
results. For design primitives that are not applicable or relevant for
certain operations, we left the cells empty.
Micro-Operations. The great advantage of considering micro-
operations is the disclosure of bottlenecks and optimization poten-
tial, which would be concealed at the macro level. For instance,
inserts for hybrid DRAM/PMem solutions are always faster than
completely persistent ones almost independent of the node layout.
Therefore, it is important to keep both access patterns and the de-
sign space concise. For the read category, the first micro-operation
is the search for a key within a node (lookup). To get the target node,
there are usually two types of traversing the tree, namely vertical
(tree traverse from top) and horizontal (tree iterate from lowest
left). The macro-operations get and scan can be built by combining
lookups and traversals. Next, there are insert-based operations like
placing records (e.g., key-value pairs) into a node. This can lead
to splits, which require the allocation of new nodes. An insert or
update macro-operation would need the micro-operations lookup,
traversal, insert, and split. For hybrid structures (such as the BP - or
LSM-Trees) a common operation is also the movement or migration
from DRAM to PMem. Furthermore, a compaction or merge of mul-
tiple nodes (in a level) into a larger node can be necessary. Erasing
an entry from a node is a micro-operation downsizing the tree. This
may cause an underflow which can be resolved by balancing or
merging with another node. The typical delete macro-operation
consists of search, traversal, erase, balance and merge. The last
class is recovery. It mainly consists of the operations of the read
category combined with the recreation of volatile DRAM structures.
Persisting operations depend on the primitives of DG3.
Primitives.We now briefly describe a set of found primitives.
For the first design goal - reducing writes - the node layout (which
applies to a large variety of tree-like structures) was reconsidered
and the main consensus was to leave data nodes unsorted. To keep
the access fast indirection, hashing, and bitmaps, as well as combina-
tions of these, were used. For this, appropriate auxiliary structures
are added at the start of each node. In Figure 2, we compare the
data node layouts of these ideas, how we have reimplemented them
for the evaluation (cf. Section 5).
All of them align the search structure in the beginning and the
keys to cache lines to have a fair comparison. For the sorted and un-
sorted case the metadata always costs only one cache line whereas
the other layouts can cover multiple cache lines depending on the
number of elements. To further save writes, the node placement was
adapted leading to selective persistence by placing some parts (e.g.,
inner nodes) in DRAM. Depending on the node layout there are dif-
ferent access primitives (DG2). A simple linear search over the keys
is always possible. If the entries are sorted, a binary search is appli-
cable. Both algorithms can also be modified with more fine-granular
access using the cache-line-sized auxiliary structures. Additionally,
numKeys Keys (k1 … kM) Values (v1 … vM)nextPtr prevPtr padding
0 8 24 40 64
(a) Sorted/Unsorted data nodes.
bitmap Keys (k1 … kM) Values (v1 … vM)nextPtr prevPtr
0 M/8 M/8 + 16 M/8 + 32 64 * J
padding
(b) Bitmap-only data nodes.
slots Keys (k1 … kM) Values (v1 … vM)bitmap nextPtr prevPtr
0 M + 1 M + 1 + M/8 … …
padding
64 * J
(c) Indirection data nodes.
bitmap Keys (k1 … kM) Values (v1 … vM)hashes nextPtr prevPtr
0 M/8 M + M/8 … …
padding
64 * J
(d) Hashing data nodes.
Figure 2: Different data node layouts, where: M - number of
entries and J ∈ N>0.
other algorithms such as interpolation or exponential search are
conceivable. When splitting a node, as typical in B-Trees and Skip-
Lists, we identified two approaches. The basic algorithms move
all keys greater than the split key to the new node. This can also
be done by creating two new nodes. An alternative when using a
bitmap is to copy the full node, reset the greater keys in the bitmap,
and finally store the inverted bitmap in the new node. The first vari-
ant will trigger less writes, but the second variant could be faster
by exploiting the fine-grained access. Regarding failure atomicity
(DG3), PMDK provides general-purpose transactions that can be
placed around the algorithms and allocations. We will go into more
detail below. Alternatively, PMwCAS could be used, which provides
compare-and-swap operations for ranges bigger than eight bytes.
Another method is to individually persist the data using flush and
fence instructions.
Figure 3a illustrates the design primitives considered for the
Move Node operation, as typical for LSM-Trees. The initial data is
stored in a DRAM buffer and later moved to a free persistent node.
We consider a scenario where the data in the persistent nodes is
always sorted, whereas the DRAM data can be (1) sorted or (2)
unsorted. If sorted, then, DRAM data is just copied to the PMem
node, else it must be sorted before a copy operation. In the former
case, insertions into DRAMwould be costlier and the penalty would
be higher for a bigger DRAM node size. Alternatively, an unordered
structure could be used whose insertion costs are comparatively
less dependent on DRAM node size. However, a sort operation is
needed during the data movement from DRAM to a persistent node.
Figure 3b and Figure 3c show the design primitives used forMerge
Level. While the design space for merge algorithms is exhaustive,
we use two common approaches: 2-way and K-way merge. When
all the nodes in a level are filled, the sorted data in these nodes
are merged to a free persistent node at the next higher level. The
final result of a 2-way or K-way merge could be written using the
following two approaches. (1) Directly perform merge in PMem.
(2) Merge to a DRAM buffer and copy the result to PMem. The
former approach could result in a performance benefit when the
Data Structure Primitives on Persistent Memory: An Evaluation , ,
_ _ _ _ _ _ DRAM Buffer volatile
persistent
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Sort
PMem Nodes
(1. Level)
1
sorted
2
unsorted
(a) Move DRAM data to a PMem node.
PMem Nodes
(1. Level)
1
direct
2 buffered
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
PMem Node
(2. Level)_ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _
merge merge
(b) 2-way merge into a PMem node.
PMem Nodes
(1. Level)
1 direct
2 buffered
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
PMem Node
(2. Level)_ _ _ _ _ _
_ _ _ _ _ _
(c) K-way merge into a PMem node.
Figure 3: Design primitives for Move node and Merge level.
inserted keys are unique, whereas the latter approach could be a
better choice if there are many updates (i.e., duplicate keys).
Using PMDK transactions for failure atomicity (FA), will induce
a noticeable performance penalty. Figure 4 shows a different way
of realizing FA for Move Node and Merge Level. An array of point-
ers/offsets is used to keep track of nodes that contain valid data and
point to the next free node as shown in Figure 4. After moving data
from one level to the next, the offset of this level is incremented.
Suppose a failure occurs during a write to node 1; an undo operation
on this node or a portion of it is redundant since the pointer remains
at the previous position indicating node 1 is still free. Therefore, it
is sufficient to only add the pointer into the PMDK transaction. As a
consequence, the performance penalty of PMDK transactions could
be greatly minimized. We term this as Individual failure atomicity.
However, the data must be flushed before the pointer with the help
of fences. Furthermore, on x86 architectures, 8-byte aligned writes
are failure atomic. Hence, by limiting the size of the offsets to 8
bytes and properly aligning then, we can completely avoid transac-
tions. We term this as No FA. Regarding Table 2, these mechanisms
basically fall under DG3, whereby Individual FA and No FA also fit
into DG1.
5,6,7,8 DRAM Buffer
1,2,3,4 5 _ _ _ _ _ _ _ PMem Nodes(1. Level)
0 1 Offset/Pointer per Level
Figure 4: Alternative for realizing failure atomicity.
(Use case: PMem-aware LSM-Tree).
Extendability. Table 2 is only an excerpt and can be extended
by more primitives and micro-operations. Further aspects would be,
e.g., hardware utilization, concurrency, and more in-depth failure
atomicity. We excluded concurrency control as it does not fit into
our micro consideration and is also sufficiently complex to merit a
separate paper. Regarding hardware utilization, we already applied
cache-line alignment for all nodes and auxiliary structures.
Metrics. The task now is to check and evaluate the applicable
options in the table with the help of micro-benchmarks. We have
already pursued this task to a certain extent. However, before delv-
ing into the experiments, we must first define relevant metrics.
Since we are on the micro level, throughput does not provide us-
able values at this point. Therefore, we mainly report the average
latency per operation. Moreover, hardware-specific measures can
be studied such as cache misses, flushes, instructions per cycle, or
the number of reads and writes. From our point of view, the num-
ber of persist operations or written bytes are crucial factors due to
the read-write asymmetry. Apart from the performance indicators,
memory consumption is of interest, as PMem is less dense than
disks.
5 EXPERIMENTS
In our experiments, we focus on the micro-operations on tree-
like data structures as introduced in the previous section. From
the primitives described above, we picked for the node layout:
sorted, unsorted, indirection + bitmap ("indirection"), hash-probing
+ bitmap ("hashing") and bitmap only, in most of the experiments.
For the access primitives, binary search with and without using
indirection as well as linear search with and without using hashing
and bitmaps are tested. We re-implemented the approaches from
the literature focusing on the corresponding primitive(s). More
details are given at each experiment. The aim and contribution are
to evaluate the design primitives independently of their original
context and to compare their strengths and costs. This should reveal
a performance profile for each primitive sketched at the end of this
section.
5.1 Experimental Setup
For our experiments, we used a dual-socket Intel Xeon Gold 5215
server as outlined in Table 3. Each socket comes with 6 DCPMMs,
which we grouped into one region and namespace. The operating
mode of the modules is set to AppDirect allowing direct access to
the devices. On the PMem DIMMs, we created an ext4 file system
and mounted it with the DAX option to enable direct loads and stores
Table 3: Experimental setup.
Processor 2 Intel® Xeon® Gold 5215, 10 cores / 20 threads each,
max. 3.4 GHz
Caches 32 KB L1d, 32 KB L1i, 1024 KB L2, 13.75 MB LLC
Memory 2×6×32 GB DDR4 (2666 MT/s),
2×6×128 GB Intel® Optane™ DCPMM
OS & Compiler CentOS 7.8, Linux 5.6.11 kernel, cmake 3.15.3,
ICC 19.1.0.166 (-O3)
, , Philipp Götze, Arun Kumar Tharanatha, and Kai-Uwe Sattler
256 512 1024 2048 4096
Node size (Bytes)
0.0
0.5
1.0
1.5
2.0
2.5
La
ten
cy
 (µ
s)
keypos = first
256 512 1024 2048 4096
Node size (Bytes)
keypos = middle
binary linear indirection hashing bitmap DRAM
256 512 1024 2048 4096
Node size (Bytes)
keypos = last
Figure 5: Searching for a key within a node (E1).
bypassing the OS cache. To avoid NUMA effects, all experiments
are controlled to allocate resources (memory, persistent memory,
and cores) only from the same socket.
We are using PMDK [14] for the implemented data structures to
guarantee failure atomicity. Alternatively, PMwCAS [30] could be
used, which provides CAS operations even for structures bigger than
8 bytes. Another method could be to manually place persist, flush,
or fence instructions, which is also possible with PMDK. Since the
transactions of PMDK had so much overhead hiding the impact
of the approaches in our implementations, we decided to mainly
report the results for manually persisting the modified data.
Unless stated otherwise, we used fixed-size keys and values being
8-byte integers and 16-byte tuples (<int , int ,double>), respectively,
in all our experiments. The size of the values, therefore, corresponds
to the size of a persistent pointer (e.g., to the actual payload). Keys,
values, and children pointer were stored in separate arrays within
the nodes for better locality benefits when iterating through the
keys. In addition, all nodes as well as their inner key arrays are
aligned to cache lines. The fill ratio of the trees was always 100% to
make optimal use of memory. Since, in most experiments, we access
predefined positions, the fill ratio is not too important. When the
node size is varied the various implementations often result in a
different number of actual elements due to their node layout. To
primarily measure PMem and not cache performance, we created a
collection of nodes (or trees) that is a multiple of the size of the LLC.
Each iteration randomly accesses one of these instances to reduce
prefetching and caching of other instances as much as possible.
When referring to a cached case below, we mean that each iteration
always accesses the same instance. Every data point in our plots
is supported by several thousand iterations. Our implementations
can be examined via our public repository.1
5.2 Read Operations
Node Search (E1). In our first experiment, we study the perfor-
mance for searching a key position within a node. This type of
operation is crucial to nearly every macro operation including get,
update, and delete. We varied both the node sizes as well as the
position of the requested key and tested on various node layouts
combined with their corresponding access primitive. The expecta-
tion is that the approaches using binary search are faster, except for
accesses to front elements. Figure 5 shows our outcome including
a baseline that aggregates the results for all variants on DRAM.
1https://github.com/dbis-ilm/PMem_DS
We observe that our expectations have not been met in this setup.
Only if the key is in the middle binary and linear approaches show a
similar performance. The indirection approach is a little bit slower
than the direct binary search but will cost much less writes for
inserts and deletes - which we will consider later. The disadvantage
of indirect binary search is that it needs to jump back and forth from
search structure to key array. Of the linear approaches, hashing is
usually the best since all comparisons are first done in the front
cache line(s) and only if a hash matches, the actual key is checked.
It means that on average, the fewest cache lines have to be loaded
from PMem. However, it should be noted that for DRAM the binary
search is better in the middle and, for a cached setup, also in the
back access areas. Furthermore, the unsorted structures are not
suitable for inner nodes, since the search is not based on equality
but a key range. This is certainly relevant to hybrid structures. It is
also notable that the lines for indirection and hashing are sometimes
jumpy. This is due to the changing size of the front search structure
depending on the node capacity. For 1 KiB and 2 KiB, the search
structure consumes two cache lines and for 4 KiB it needs three
cache lines. However, in contrast to indirection, hashing usually
reads less of these cache lines. Thus, indirection and hashing (and
partly also the bitmap) require more memory.
Talking about memory, Table 4 shows the actual number of
entries that can be stored for a given node size. Without a search
structure - as for basic binary and linear search - more records
can be placed in a node. For indirection and hashing each entry
requires an additional bit for the bitmap and an additional byte
Table 4: Calculated number of records per node (r/n) and
memory consumption of a node chain (50M records) for a
given node size.
256 B 512 B 1 KiB 2 KiB 4 KiB Unit
Base 9 19 41 83 169 r/n
1,32 1,25 1,16 1,15 1,13 GB
Aligned 8 18 40 82 168 r/n
1,49 1,32 1,19 1,16 1,14 GB
w/ search
structure
8 18 37 79 160 r/n
1,49 1,32 1,25 1,22 1,19 GB
Overhead
Aligned
+13% +6% +3% +1% +1%
Overhead Search
Structure
+13% +6% +8% +6% +5%
Data Structure Primitives on Persistent Memory: An Evaluation , ,
for the slot or hash array. For a fairer comparison, we also aligned
the approaches without a search structure so that the counter of
entries and the sibling pointers are placed in the first cache line. The
resulting size adjustment can also be found in the table. Hence, all
variants have their actual data cache-aligned as already mentioned
above. It becomes visible that smaller node sizes generally lead to a
larger overhead to the PMem consumption. In addition, this also
results in a longer traversing path. This of course highly depends
on the size of the keys and values. Apart from the higher memory
footprint, hashing is the best choice for searching a persistent node.
Nodes that are in DRAM or most probably cached should use a
sorted layout.
Tree Traversal (E2). The next experiment focuses on the inner
nodes and the costs for traversing from the root to the leaf level as
typical for B+-Trees. A searchwithin the nodes is not included to get
bare dereferencing and pointer chasing measures. Instead, a random
child position is chosen to prevent prefetching. Therefore, we limit
the comparison to the timing of traversing nodes resided in PMem
and DRAM, respectively. Only the last access is to a persistent leaf
node. This reflects the idea of hybrid data structures and placement.
Here we have varied the depth of the tree. The node sizes have
hardly made a difference. Due to our idle latency measurements
for DRAM and PMem, we would expect an increase of about this
latency per level. The results are shown in Figure 6a.
In fact, this behaves almost as expected. For DRAM, about 50-
100 ns are added for each further level. For PMem, however, each
level adds 400-500 ns, which is nearly double the reported latency
of the MLC benchmark. We assume that this is mainly due to the
software overhead (e.g., PMDK) and the random access under heavy
load. However, we also note that this would nearly fit with the re-
ported read latency in [19, 28]. It becomes visible that all approaches
would greatly benefit from a hybrid variant. If pure performance
in the operating system is most important, we found DRAM-based
sorted nodes with both binary search and indirection to be good
solutions. The former is more memory efficient and the latter saves
write operations, which however is not so crucial on DRAM. Plac-
ing the inner nodes in DRAM, however, requires recovery actions in
the event of a failure. If this is not desired, it is mainly the search al-
gorithm that makes the difference (see E1). In summary, if recovery
is rare, a hybrid approach is highly preferable.
Tree Iterate (E3). For the last experiment in the read category,
the horizontal traversal of data nodes, usually also referred to as
scan, is examined. This contains not only the chasing of the node
pointers (like in E2) but also the iteration of the key and value
arrays within them. Since the order is not prescribed, we stick
with the term iterate to avoid confusion with range scans. For this
experiment, all approaches use the same number of entries based on
the variants with a growing search structure. Here, we use different
data node sizes and let the tree horizontally grow by increasing the
single inner node (the root). Since the order does not matter when
iterating, the sorted and unsorted approach use the same algorithm.
The same applies for indirection, hashing, and bitmap, as only the
bitmap has to be checked for valid entries. However, since this
causes branching in the loop, we expect a weaker performance of
the latter class. For the indirect organization, it is also possible to
1 2 3 4 5 6 7
Tree depth (levels)
0
1
2
3
4
La
ten
cy
 (µ
s)
PM
DRAM
(a) Traversing a tree w/o search.
102 103 104
# of tree elements
0
20
40
60
80
100
La
ten
cy
 (µ
s)
w/o bitmap
indirection
w/ bitmap
(b) Iterating through nodes.
Figure 6: Traversing and iterating nodes (E2 & E3).
iterate using the slot array instead of the bitmap. In Figure 6b the
results for 1 KiB data node sizes are reported.
Interestingly, iterating via indirection is worse than via bitmap.
This means that even with indirection slots the bitmap should be
used for iteration if the order is not important. As expected the
approaches without a bitmap are always the fastest. Using this as
a baseline, in the largest case the overhead is 31% for the bitmap
and 58% for indirection. However, it should be noted that the nodes
were filled to 100%. Thus, it is already the best case for the bitmap
since the other approaches also have to check all entries. In the case
of indirection and without a bitmap the loop only iterates through
the number of actual keys. Besides branching, jumps between cache
lines (bitmap/indirection slots, key, and value array) also have a
negative impact. Since our scan function only copies out the key in
each case, we assume that this is the worst case and that with the
increasing complexity of the function all methods might approach
each other. For a range scan based on this, the impact of sorting per
node compared to pre-sorted nodes has to be checked. Nevertheless,
for iterating through persistent data nodes the sorted and unsorted
approach perform best. Particularly for the DCPMMs, it is important
to avoid jumping between non-sequential cache lines.
5.3 Insert-based Operations
Node Insert (E4). For insert-based operations, we first check the
behavior when inserting a key-value pair into a data node. Similar
to experiment E1, we vary the node size and the insert position.
In addition to the time an operation takes, we report the number
of bytes modified and actually written to the device. In the setup
phase, the key to be inserted is omitted so that space is left for it.
For instance, when inserting at the first position in a node with 10
slots, the keys from 2-10 are pre-inserted. The insertion of key 1 is
then the measured part. The lookup for the insert position is not
part of the measurement. Adding up the times of traverse, node
searches, and node insert would result in approximately the insert
macro operation. We expect that the sorted approach will show
the poorest results as entries have to be moved. It leads to many
writes and flushed cache lines. This is not the case for the other
approaches, which only append the new data and adapt the search
structure. The results are illustrated in Figure 7.
It is apparent that the unsorted variants almost always perform
best. For the plain unsorted case, only the key and value are ap-
pended and the size field is updated. The hashing and bitmap ap-
proach have to set the corresponding hash and bit, respectively.
Also the indirection performs quite well, even in the first case were
, , Philipp Götze, Arun Kumar Tharanatha, and Kai-Uwe Sattler
256 512 1024 2048 4096
Node size (Bytes)
0.0
0.7
1.4
2.1
2.8
3.5
 - 
La
ten
cy
 (µ
s)
keypos = first
256 512 1024 2048 4096
Node size (Bytes)
keypos = middle
256 512 1024 2048 4096
Node size (Bytes)
keypos = last
0
512
1024
1536
2048
2560
 - 
By
tes
 m
od
ifi
ed
 | w
rit
ten
sorted unsorted indirection hashing bitmap
Figure 7: Inserting a key into a node (E4).
all slots have to be shifted to the back to keep the indirect order-
ing. Compared to the other approaches optimized for PMem, the
indirection performs worst. The overhead is not caused by sorting,
but by the additional determination of a free bit position beside the
given slot position. Since indirection is the only approach that has
to find two positions, we have included this overhead. Matching
the high number of write accesses, the performance of the sorted
approach is significantly worse than the others. It can only keep
up with small node sizes. This is because the number of flushed
cache lines is about the same here. It becomes obvious that keeping
the nodes sorted is not suitable for read-write asymmetric PMem.
Although indirection also involves many writes, these are on a
much finer granularity and multiple slots can be persisted at once.
Hence, it shows a similar performance as when using hashing or
appending only. Especially the impact of the read-write asymmetry
of PMem becomes clear by this experiment. Overall the unsorted
variants perform about equally well.
Node Split (E5). As next experiment, we chose node splits in
particular for data nodes as these are definitely placed on PMem.
We picked a similar setup as for the inserts. We applied the two split
strategies as mentioned in Section 4 to the indirection, hashing,
and bitmap approach. Since bitmap and hashing showed exactly
the same performance and to keep the figure clear, we summarized
them as bitmap in the graph. As stated before the move variant
will cost less writes and thus is supporting DG1, whereas the copy
variant exploits the fine-grained access supporting DG2. For a node
organization without bitmap the copy strategy is not useful since
the entries of the new node would be written twice. This is because
the whole node is copied and then all entries are reordered to the
left. Generally, the performance is hard to predict for us, but we
expect at least that the sorted approach should be faster than the
unsorted one. This is since in an unsorted node all entries have
to be checked if they are greater or less the split key. In a sorted
node, everything is simply copied starting from the middle. Figure 8
shows our results with measures for performance and the number
of bytes modified as well as actually written to the device.
In general a split on a sorted node is the fastest and the unsorted
case (with or without bitmap) is almost always worst. Indirection
with move strategy behaves similarly to the sorted variant, but
requires more effort to transfer and set the indirection slots and
bitmap per entry. Hashing and the bitmap needs more time due to
the search for the median in an unsorted array (using quickselect).
The copy strategy needs more bytes as it contains the writing of
256 512 1024 2048 4096
Node size (Bytes)
0
2
4
6
8
10
12
14
16
 - 
La
ten
cy
 (µ
s)
0
1024
2048
3072
4096
5120
6144
7168
8192
 - 
By
tes
 m
od
ifi
ed
 | w
rit
ten
sorted
unsorted
indirection(move)
indirection(copy)
bitmap(move)
bitmap(copy)
Figure 8: Splitting a node (E5).
a whole node. According to the write endurance and read-write
asymmetry, this could be a shortcoming. However, we combined
this copy process with the allocation and as it initiates sequential
writing, this seems beneficial for the write-combing buffer on the
DCPMMs. The copy strategy works a bit better since everything is
transferred once and after that the slots are shifted and the bitmap
is inverted. The same applies to hashing, so we can deduce that the
copy approach is more effective when a bitmap is present. Also if
the inner nodes are in DRAM, we would recommend either a sorted
variant or the copy approach. In this setup, for PMem most of the
time is spend on allocating a new node (around 80%) and, thus, the
approaches show relatively the same performance. The allocation
is done by PMDK encapsulated into a transaction and the time
depends on the allocated size. As already discussed in [19] PMem
allocations add a tremendous overhead and should be handled with
care. As suggested in [24] a group allocation could reduce this
overhead. Apart from this, we see a compromise of performance
against endurance when having unsorted nodes. However, also
here the sorted and indirection variants are always better.
Move Node (E6). In this experiment, we are interested in pro-
filing the latencies involved in moving DRAM data to a persistent
node as necessary in LSM-Trees when merging to the first level.
The experimental setup involves varying the node size and switch-
ing between the FA strategies: No FA, Individual FA and the default
PMDK transactions as explained in Section 4. It is to be noted that
all nodes have the same capacity and if the DRAM data is unsorted,
then the measurements also involve the sorting operation. The
nodes are composed of persistent arrays where each element is
a key-value pair. We conduct the experiment by inserting unique
Data Structure Primitives on Persistent Memory: An Evaluation , ,
2 4 8 16 32
Node size (KiBytes)
0
50
100
150
200
250
La
ten
cy
 (µ
s)
Hashmap, No FA
Hashmap, Individual FA
Hashmap, PMDK TX
Ordered Map, No FA
Ordered Map, Individual FA
Ordered Map, PMDK TX
Figure 9: Move data from DRAM to a PMem node (E6).
keys (since this operation is independent of inserts or updates). The
first goal is to analyze the overall PMem write performance for two
cases: (1) Sorting the DRAM data and moving it to PMem, against
(2) Maintaining a sorted DRAM data structure. The second goal is
to analyze the effect of different FA strategies on varying persistent
node sizes. The results are illustrated in Figure 9.
It is apparent that using a sorted DRAM data structure is faster
than a unsorted hash data structure since in the unsorted case, each
time the capacity is reached, the data must be sorted and moved to
a persistent run. However, maintaining a sorted DRAM data struc-
ture is costlier. For a typical LSM-Tree use case scenario, the DRAM
buffer is in the order of a few kilobytes. Hence, a sorted data struc-
ture is always a better choice for small DRAM buffers. Regarding
the second goal, as depicted in Figure 9, using PMDK transactions
for FA has a much higher performance impact, when compared to
No FA and Individual FA (cf. Section 4). On the other hand, No FA
and Individual FA have almost the same performances, i.e., adding a
single 64-bit persistent variable into the PMDK transaction (plus the
individual flushes and fences) has negligible performance impact.
This shows that PMDK transactions should only be used for alloca-
tions and deallocations. Performance critical applications should
definitely take care of failure atomicity individually.
Merge Level (E7). In this experiment, we examine the impact
of merging sorted PMem nodes into a new larger node as applied,
for example, to the nodes of one level in an LSM-Tree to the next
level. Similar to the previous experiment, the setup involves varying
the node size and switching between two different FA strategies
(i.e., PMDK transaction and No FA). Additionally, we examine the
impact of the two extreme scenarios: Unique keys in each node
and duplicate keys in all the nodes (i.e., 0% and 100% duplicates).
Finally, we benchmark the performance by applying the 2-way and
K-way merge algorithms as illustrated in Figure 3b and Figure 3c,
respectively. We used two merge sub-strategies in our experiments.
(1) Merge directly to a PMem node, (2) Merge to a DRAM buffer
and then copy the result to a PMem node. These two strategies
are applied on both 2-way and K-way algorithms. The results are
shown in Figure 10
An important observation is that enabling PMDK transactions
has again a great performance penalty in all cases. When merging
directly to a persistent node, 2-way is faster since the CPU can
cache the intermediate merge results. On the other hand, in K-way,
the CPU needs to read the persistent memory more often for key
comparisons due to more cache misses. It is interesting to see that
K-way merge shows the same performance as 2-way when the keys
are duplicated in each node (second column of Figure 10). However,
when PMDK transactions are enabled the 2-way merge is faster
again. The effect of PMDK transactions is explained as follows. After
a merge operation, the number of elements in the resultant node
can vary between the two extreme limits: ⟨ the number of elements
in a single PMem source node : the sum of elements in all the PMem
source nodes ⟩. In a general scenario, it is not possible to exactly
determine the resultant number of elements. Therefore, in case of
K-way merge, the entire sum of elements must be added into PMDK
transaction whereas in 2-way merge it is sufficient to add only the
sum of elements of the last binary merge step. Hence, 2-way merge
has a better performance when the keys are duplicated with the
default PMDK transactions enabled. To improve the performance
of K-way merge, one could use intermediate DRAM buffers as
shown in Figure 3b and Figure 3c, respectively. Once again we
see that, there could be two possible scenarios. (1) No or very few
duplicate keys: In such cases, using an intermediate DRAM brings-
in additional performance penalty for both the merge algorithms
because the number of elements added into a transaction is always
equal to the sum of all elements. (2) Many duplicates: In such cases,
both algorithms perform better when an intermediate DRAM buffer
is used as shown in column four of Figure 10. The insights from
this experiment can be summarized as follows: (1) 2-way merge
could be a preferred choice in the scenarios where the inserted keys
are unique and if PMDK transactions are enabled. (2) K-way merge
can be used in the scenarios where no FA is needed and if there
are many updates (i.e., duplicates). (3) If there are many updates
and PMDK transactions are enabled, then using an intermediate
buffer will result in performance enhancement for both the merge
algorithms.
5.4 Erase-based Operations
Erase from Node (E8). For erase-based operations, the first exper-
iment is the removal of a single key-value pair from a node. Similar
to the lookup (E1) and insert (E4), we measure different node sizes
and key positions. Again, we report the latency and modified as
well as written bytes for each operation. The preparation always
creates a full node, where the entry is then deleted at different
positions. The lookup for the erase position is again not part of
the measurement since it is already represented in E1. Hence, an
erase macro operation without underflow would be the result of
traversing the tree, searching each node and this experiment. Once
again, we expect the sorted approach to show the poorest results
as keys and values have to be moved to fill the caused gap. This
entails many writes and flushes. For the unsorted organization, not
all the entries have to be shifted. In this case, it is enough to move
the last entry to the caused gap and decrease the entry counter.
The hashing, indirection, and bitmap variants only need to reset
one bit. For indirection the slot array has to be additionally shifted.
Therefore, these approaches will probably run the fastest, with in-
direction possibly taking slightly more time. The actual results are
visualized in Figure 11
The advantage of the bitmap is unambiguously. In all cases it per-
forms the best. The hashing approach is directly below the bitmap
, , Philipp Götze, Arun Kumar Tharanatha, and Kai-Uwe Sattler
2 4 8 16 32
Node size (KiBytes)
0
50
100
150
200
250
La
ten
cy
 (µ
s)
Merge to: PMem
Duplicates: 0%
2-way merge, No FA K-way merge, No FA 2-way merge, PMDK TX K-way merge, PMDK TX
2 4 8 16 32
Node size (KiBytes)
Merge to: PMem
Duplicates: 100%
2 4 8 16 32
Node size (KiBytes)
Merge to: DRAM
Duplicates: 0%
2 4 8 16 32
Node size (KiBytes)
Merge to: DRAM
Duplicates: 100%
Figure 10: Merging sorted data in persistent nodes to a new persistent node (E7).
256 512 1024 2048 4096
Node size (Bytes)
0.0
0.6
1.2
1.8
2.4
3.0
 - 
La
ten
cy
 (µ
s)
keypos = first
256 512 1024 2048 4096
Node size (Bytes)
keypos = middle
256 512 1024 2048 4096
Node size (Bytes)
keypos = last
0
512
1024
1536
2048
2560
 - 
By
tes
 m
od
ifi
ed
 | w
rit
ten
sorted unsorted indirection hashing bitmap
Figure 11: Erasing an entry from a node (E8).
line, because the algorithm is the same. As expected, indirection
is only a little slower. Starting from 2 KiB it jumps a bit higher if
the key is in the first and middle position. This is due to the fact
that from 2 KiB the bitmap and slots need another cache line to
be flushed (cf. Table 4). In the last case nothing has to be shifted,
thus, it is almost constant. We would have estimated the unsorted
variant to be more stable, since always the same number of bytes
are changed. Here the locality of the deleted and last position, from
where the entry is moved, is quite important. The sorted approach
is absolutely not appropriate for erasing a key in PMem. It costs
way too much writes and also flushes which drastically reduces
the performance. Only when the entry is rather at the end, this
approach can keep up. Hence, using a bitmap is the best choice for
fast erasures.
BalanceNode (E9).Often in trees it is necessary tomove entries
from one node to another. For instance, an erase operation can lead
to an underflow and a balance operation. This operation is what we
want to evaluate in this experiment. For this, in the setup, arrays of
full nodes and half filled nodes (actually: half-1) are prepared. The
balance operation should move a quarter of the entries in the full
node to the half filled node. There are two possible cases. Either the
entries are moved to a node with smaller or larger keys. If the order
is important the first case requires a shift of the already existing
entries on the donor site to bring them to the front. In the other
case, a shift is necessary on the receiver site to make place for the
new smaller entries. Since the number of writes is about the same,
we do not expect much difference. Again, we tested for various
node sizes and report the average latency as well as the modified
bytes. Basically, we expect the sorted variants to perform better
than the unsorted ones. The results are shown in Figure 12
Unlike our previous experience, the number of written bytes is
not reflected in the performance here. The sorted variants are, as ex-
pected, faster but the distance to the other techniques is enormous
for larger node sizes. Comparing the sorted and hashing approach
the former is nearly four times faster than the latter. This is be-
cause in the unsorted case the next maximum or minimum must
0
4
8
12
16
 - 
La
ten
cy
 (µ
s)
move to left siblingsorted
unsorted
indirection
hashing
bitmap
256 512 1024 2048 4096
Node size (Bytes)
0
4
8
12
16
 - 
La
ten
cy
 (µ
s)
move to right sibling
0
1024
2048
3072
4096
5120
 - 
By
tes
 m
od
ifi
ed
 | w
rit
ten
0
1024
2048
3072
4096
5120
 - 
By
tes
 m
od
ifi
ed
 | w
rit
ten
Figure 12: Balancing two nodes (E9).
Data Structure Primitives on Persistent Memory: An Evaluation , ,
256 512 1024 2048 4096
Node size (Bytes)
0
3
6
9
12
 - 
La
ten
cy
 (µ
s)
numKeys (un-/sorted)
indirection
hashing
bitmap
0
1024
2048
3072
4096
5120
 - 
By
tes
 m
od
ifi
ed
 | w
rit
ten
Figure 13: Merging two nodes (E10).
be searched before each move. In addition, the bitmap approaches
must always search and set a free bit on the receiver site. The hash-
ing approach must also copy the hashes, which can lead to further
written cache lines. Compared to the sorted case, with indirection
the slots have to be additionally shifted and written. However, the
keys and values can simply be appended. Since less is written here,
we would have expected it better as direct sorting. This is only the
case with small node sizes. Nevertheless, with regard to our design
goals, we would still consider indirection as the best choice here.
MergeNodes (E10).We already discussed themerge of multiple
nodes into another in experiment E7. There is also a less complex
merge operation as found as a consequence of an underflow in a
B-Tree. Here, no duplicates are present in this micro-operation. On
top of that, we cannot apply different merge strategies like 2-way or
K-way merge since only two nodes are affected. Instead the various
node organizations and their corresponding access primitives are
compared once again. In contrast to the previous experiment, we
only consider one direction, the merging into the node with the
smaller keys. This is always the better option, because there is no
need to shift entries or slots. As a result, the sorted and unsorted
approach proceed exactly the same. They simply append all keys
and values and finally update the number of keys. Hence, they are
summarized as numKeys. The deallocation of the donor node is not
part of the measurement, but it would be the same overhead for all
approaches. Once more, we report the latency and modified/written
bytes. We expect that the numKeys approach performs better than
the approaches using extra search structures since they do not in-
clude a mechanism yielding into any performance gain. In Figure 13
the results can be inspected.
Again the disadvantage of the bitmap shows up. It requires the
verification of each bit on the donor site and search for a free bit on
the receiver site before moving an entry. For indirection the first
check is not required, however additional reads are necessary due
to the indirect access and the updates of the slot array. This results
in bitmap and indirection having almost the same performance.
Again the hashing approach has to additionally copy the hashes
and check the bit on the donor site. This leads to more written
cache lines with larger node size. Comparing the numKeys and
hashing approach, we can again see a performance difference of a
factor of four. In a separate measurement the pure deallocation took
constantly around 1.6 µs independent of the size. In total, the sorted
and plain unsorted variant work most efficiently when merging
two nodes.
5.5 Performance Profiles
Using the results from our experiments, we created performance
profiles summarizing the performance and write reduction of the
main identified primitives in this paper (see Figure 14). This bases
on the node layout (sorted, unsorted, indirection, hashing, and
bitmap) and their corresponding access primitives, from which we
considered the best performing alternatives.
For sorted nodes, the main drawbacks are raw entry inserts and
key deletions. Also, the search performance is worse than the other
variants in uncached cases. However, it performs well in iterations
and structural adjustments. Although it is the smallest in size, it
takes the most writes and flushes when modifying nodes.
The unsorted layout significantly improves the most typical
operations like search, insert and erase, particularly in terms of
memory efficiency and write reduction. If many scans are used, the
sorted and unsorted variants are best suited. Only balancing and
splitting costs considerably more as the order is beneficial for these
micro-operations. So if the tree grows and shrinks a lot, this can
lead to enormous overhead. A countermeasure could be a larger
node size to avoid too much restructuring.
Another variant to overcome this is to add indirection slots
which deliver almost the same performance as in the sorted case,
but require less write operations. This is paid for with a node size
overhead and worse read operations. Therefore, this variant seems
more suitable for write dominated workloads.
The hashing approach is a bit worse regarding the restructuring.
However, it offers the best overall package, especially with a small
node size. Similar to the simple unsorted approach the basic micro-
operations search, insert, and erase of a key or entry are its great
strength. A drawback is the iterate operation due to the bitmap. If
no underflow handling and few scans are required, this primitive is
the best choice.
The bare bitmap approach behaves similarly to the hashing
approach and is only slower for a few operations. Therefore, the
combination of hash and bitmap should usually be chosen. The
sole advantage of the bitmap-only variant is the lower memory
consumption and to an extent faster underflow operations.
6 INSIGHTS & CONCLUSION
The results of the experiments gave us some interesting general
insights for designing data structures, choosing corresponding ac-
cess primitives, and combining various ideas. In part, the insights
gained were consistent with these in [19].
I1: As it became clear already from looking at the design
space, there are still numerous untested possible primitives and
combinations of them. For instance, using hash probing without a
bitmap but a numKeys field, combining indirection for inner and
hashing for data nodes, examining other algorithms like interpo-
lation or exponential search, etc. Also investigating well-known
techniques like compression or zone maps to reduce writes and
read accesses seems reasonable. Explicitly combining the tested
primitives for B+-Trees, we propose hash probing and bitmap for
data nodes (1 KiB) and a sorted layout for inner nodes residing
in DRAM. If inner nodes also need to be persistent and there are
many writes, they should rely on indirection. This is basically a
combination of the ideas from [4] and [24].
, , Philipp Götze, Arun Kumar Tharanatha, and Kai-Uwe Sattler
Layo
ut
Sear
ch
Itera
te
Inser
t
Split
Eras
e
Bala
nce
Merg
e
Worse
Fair
Better
Sorted
Layo
ut
Sear
ch
Itera
te
Inser
t
Split
Eras
e
Bala
nce
Merg
e
Worse
Fair
Better
Unsorted
Layo
ut
Sear
ch
Itera
te
Inser
t
Split
Eras
e
Bala
nce
Merg
e
Worse
Fair
Better
Indirection
Performance Write Reduction
Layo
ut
Sear
ch
Itera
te
Inser
t
Split
Eras
e
Bala
nce
Merg
e
Worse
Fair
Better
Hashing
Layo
ut
Sear
ch
Itera
te
Inser
t
Split
Eras
e
Bala
nce
Merg
e
Worse
Fair
Better
Bitmap
Figure 14: Performance profile of design primitives.
I2: A hybrid DRAM/PMem approach is highly recommended
when seeking the best performance and still requiring persistence.
Especially the traversal experiment E2 proved that dereferencing
and pointer chasing has an even greater impact on PMem. In our
DRAM-based and cached tests, we found sorted and indirection
to be the best solutions for inner nodes. If a hybrid approach or
recovery is not an option, it is important to note that hash probing
is not applicable for inner nodes.
I3: PMDK transactions are universal, but not recommended
for performance critical applications. As it was evident in E6 and
E7 the log and snapshotting used in PMDK transactions add a
tremendous overhead compared to individual realizations of failure
atomicity. This means that the classic Copy-on-Write approaches
should be avoided for PMem.
I4: Jumping between non-sequential cache lines is quite ex-
pensive. Although PMem allows byte-addressable random access,
sequential access is still preferable. This was particularly apparent
for indirection and binary search (E1), iterations via bitmap and in-
direction (E3), as well as erasing in unsorted nodes (E8). Especially
the latter showed the importance of locality.
I5: Allocations are expensive in PMem and depend on the
requested size. During the experiments E5 and E10 it became clear
that allocations should be used wisely. To overcome this bottle-
neck, designers of PMem-based data structures should use group
allocations and reuse already allocated nodes instead of frequent
deallocating and allocating.
I6: The optimal size for nodes located in PMem-based index
structures lies between 256 bytes and 1 KiB. The lower bound of 256
bytes results from the write-combing buffer of the DCPMMs. The
upper bound is the size from which the performance typically dras-
tically degrades. This is partly due to the search structures, which
should not grow beyond a cache line. Small nodes automatically
lead to longer traversing routes, hence we refer back to I2.
Ultimately every design decision depends on the specific appli-
cation and there is no single best solution. This paper is intended to
help determine the optimal PMem-specific design parameters and
(combinations of) primitives for a given use case. For instance, for
a write-intensive workload with many structural adjustments, indi-
rection is the primitive of choice. If there are many point queries,
hashing is best suited. To accelerate deletes, a bitmap should be
added. If mainly iteration through the data nodes is required, no
auxiliary structure should be used. Table 2 and our investigations
still offer much potential for extension. For future work, further
data structures, primitives, and access patterns shall be studied. The
final result should be a far broader derived performance profile per
design primitive as sketched in Figure 14.
ACKNOWLEDGMENTS
This work was funded by the German Research Foundation (DFG)
within the SPP2037 under grant no. SA 782/28.
REFERENCES
[1] Joy Arulraj, Justin J. Levandoski, Umar Farooq Minhas, and Per-Åke Larson. 2018.
BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory.
PVLDB 11, 5 (2018), 553–565. http://www.vldb.org/pvldb/vol11/p553-arulraj.pdf
[2] Manos Athanassoulis, Michael S. Kester, Lukas M. Maas, Radu Stoica, Stratos
Idreos, Anastasia Ailamaki, and Mark Callaghan. 2016. Designing Access Meth-
ods: The RUM Conjecture. In EDBT. OpenProceedings.org, 461–466.
[3] Jens Axboe. 2019. Flexible I/O Tester. https://github.com/axboe/fio. Accessed:
December 27, 2019.
[4] Shimin Chen and Qin Jin. 2015. Persistent B+-Trees in Non-Volatile MainMemory.
PVLDB 8, 7 (2015), 786–797. https://doi.org/10.14778/2752939.2752947
[5] Biplob Debnath, Alireza Haghdoost, Asim Kadav, Mohammed G. Khatib, and Cris-
tian Ungureanu. 2015. Revisiting Hash Table Design for Phase Change Memory.
In Proceedings of the 3rd Workshop on Interactions of NVM/FLASH with Operating
Systems and Workloads, INFLOW 2015, Monterey, California, USA, October 4, 2015.
1:1–1:9. https://doi.org/10.1145/2819001.2819002
[6] Facebook Open Source. 2019. RocksDB: A persistent key-value store. https:
//rocksdb.org. Accessed: December 27, 2019.
[7] FUJITSU Technology Solutions GmbH. 2019. libart. https://github.com/pmem/
pmdk/tree/master/src/examples/libpmemobj/libart. Accessed: December 27,
2019.
[8] Philipp Götze, Stephan Baumann, and Kai-Uwe Sattler. 2018. An NVM-Aware
Storage Layout for Analytical Workloads. In 34th IEEE International Conference
on Data Engineering Workshops, ICDE Workshops 2018, Paris, France, April 16-20,
2018. 110–115. https://doi.org/10.1109/ICDEW.2018.00025
[9] Philipp Götze, Alexander van Renen, Lucas Lersch, Viktor Leis, and Ismail Oukid.
2018. Data Management on Non-Volatile Memory: A Perspective. Datenbank-
Spektrum 18, 3 (2018), 171–182. https://doi.org/10.1007/s13222-018-0301-1
[10] Weiwei Hu, Guoliang Li, Jiacai Ni, Dalie Sun, and Kian-Lee Tan. 2014. Bp -Tree :
A Predictive B+-Tree for Reducing Writes on Phase Change Memory. IEEE Trans.
Knowl. Data Eng. 26, 10 (2014), 2368–2381. https://doi.org/10.1109/TKDE.2014.5
[11] Stratos Idreos, Kostas Zoumpatianos, Manos Athanassoulis, Niv Dayan, Brian
Hentschel, Michael S. Kester, Demi Guo, Lukas M. Maas, Wilson Qin, Abdul
Wasay, and Yiyou Sun. 2018. The Periodic Table of Data Structures. IEEE Data
Eng. Bull. 41, 3 (2018), 64–75. http://sites.computer.org/debull/A18sept/p64.pdf
[12] Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, and
Demi Guo. 2018. The Data Calculator: Data Structure Design and Cost Synthesis
from First Principles and Learned Cost Models. In Proceedings of the 2018 Inter-
national Conference on Management of Data, SIGMOD Conference 2018, Houston,
TX, USA, June 10-15, 2018. 535–550. https://doi.org/10.1145/3183713.3199671
[13] Intel Corporation. 2019. IntelÂő Memory Latency Checker v3.7. https://software.
intel.com/en-us/articles/intelr-memory-latency-checker. Accessed: December
27, 2019.
[14] Intel Corporation. 2019. Persistent Memory Development Kit. http://pmem.io/
pmdk. Accessed: December 27, 2019.
[15] Sudarsun Kannan, Nitish Bhat, Ada Gavrilovska, Andrea C. Arpaci-Dusseau, and
Remzi H. Arpaci-Dusseau. 2018. Redesigning LSMs for Nonvolatile Memory with
NoveLSM. In 2018 USENIX Annual Technical Conference, USENIX ATC 2018, Boston,
MA, USA, July 11-13, 2018. 993–1005. https://www.usenix.org/conference/atc18/
presentation/kannan
Data Structure Primitives on Persistent Memory: An Evaluation , ,
[16] Wook-Hee Kim, Jihye Seo, Jinwoong Kim, and Beomseok Nam. 2018. clfB-tree:
Cacheline Friendly Persistent B-tree for NVRAM. TOS 14, 1 (2018), 5:1–5:17.
https://doi.org/10.1145/3129263
[17] Benjamin C. Lee, Ping Zhou, Jun Yang, Youtao Zhang, Bo Zhao, Engin Ipek, Onur
Mutlu, and Doug Burger. 2010. Phase-Change Technology and the Future of
Main Memory. IEEE Micro 30, 1 (2010), 143. https://doi.org/10.1109/MM.2010.24
[18] Se Kwon Lee, K. Hyun Lim, Hyunsub Song, Beomseok Nam, and Sam H. Noh.
2017. WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems.
In 15th USENIX Conference on File and Storage Technologies, FAST 2017, Santa
Clara, CA, USA, February 27 - March 2, 2017. 257–270. https://www.usenix.org/
conference/fast17/technical-sessions/presentation/lee-se-kwon
[19] Lucas Lersch, Xiangpeng Hao, Ismail Oukid, Tianzheng Wang, and Thomas
Willhalm. 2019. Evaluating Persistent Memory Range Indexes. PVLDB 13, 4
(2019), 574–587. https://doi.org/10.14778/3372716.3372728
[20] Lucas Lersch, Ismail Oukid,Wolfgang Lehner, and Ivan Schreter. 2017. An analysis
of LSM caching in NVRAM. In Proceedings of the 13th International Workshop
on Data Management on New Hardware, DaMoN 2017, Chicago, IL, USA, May 15,
2017. 9:1–9:5. https://doi.org/10.1145/3076113.3076123
[21] Jianhong Li, Andrew Pavlo, and Siying Dong. 2017. NVMRocks: RocksDB on Non-
Volatile Memory Systems. http://istc-bigdata.org/index.php/nvmrocks-rocksdb-
on-non-volatile-memory-systems/. Accessed December 27, 2019.
[22] Micron Technology, Inc. 2019. 3D XPoint Technology. https://www.micron.com/
products/advanced-solutions/3d-xpoint-technology. Accessed December 27,
2019.
[23] Sparsh Mittal and Jeffrey S. Vetter. 2016. A Survey of Software Techniques for
Using Non-Volatile Memories for Storage and Main Memory Systems. IEEE Trans.
Parallel Distrib. Syst. 27, 5 (2016), 1537–1550. https://doi.org/10.1109/TPDS.2015.
2442980
[24] Ismail Oukid, Johan Lasperas, Anisoara Nica, Thomas Willhalm, and Wolfgang
Lehner. 2016. FPTree: A Hybrid SCM-DRAM Persistent and Concurrent B-Tree
for Storage Class Memory. In Proceedings of the 2016 International Conference on
Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 -
July 01, 2016. 371–386. https://doi.org/10.1145/2882903.2915251
[25] Dulloor Subramanya Rao, Sanjay Kumar, Anil Keshavamurthy, Philip Lantz,
Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. 2014. System software for
persistent memory. In Ninth Eurosys Conference 2014, EuroSys 2014, Amsterdam,
The Netherlands, April 13-16, 2014. 15:1–15:15. https://doi.org/10.1145/2592798.
2592814
[26] David Schwalb, Markus Dreseler, Matthias Uflacker, and Hasso Plattner. 2015.
NVC-Hashmap: A Persistent and Concurrent Hashmap For Non-Volatile Memo-
ries. In Proceedings of the 3rd VLDB Workshop on In-Memory Data Mangement
and Analytics, IMDM@VLDB 2015, Kohala Coast, HI, USA, August 31, 2015. 4:1–4:8.
https://doi.org/10.1145/2803140.2803144
[27] The Apache Software Foundation. 2016. Apache Cassandra. https://cassandra.
apache.org. Accessed: December 27, 2019.
[28] Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann, and Alfons
Kemper. 2019. Persistent Memory I/O Primitives. In Proceedings of the 15th
International Workshop on Data Management on New Hardware, DaMoN 2019,
Amsterdam, The Netherlands, 1 July 2019. 12:1–12:7. https://doi.org/10.1145/
3329785.3329930
[29] Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, and Roy H.
Campbell. 2011. Consistent and Durable Data Structures for Non-Volatile Byte-
Addressable Memory. In 9th USENIX Conference on File and Storage Technologies,
San Jose, CA, USA, February 15-17, 2011. 61–75. http://www.usenix.org/events/
fast11/tech/techAbstracts.html#Venkataraman
[30] Tianzheng Wang, Justin J. Levandoski, and Per-Åke Larson. 2018. Easy Lock-
Free Indexing in Non-Volatile Memory. In 34th IEEE International Conference on
Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018. 461–472. https:
//doi.org/10.1109/ICDE.2018.00049
[31] Fei Xia, Dejun Jiang, Jin Xiong, and Ninghui Sun. 2017. HiKV: AHybrid Index Key-
Value Store for DRAM-NVM Memory Systems. In 2017 USENIX Annual Technical
Conference, USENIX ATC 2017, Santa Clara, CA, USA, July 12-14, 2017. 349–362.
https://www.usenix.org/conference/atc17/technical-sessions/presentation/xia
[32] Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai Leong Yong,
and Bingsheng He. 2015. NV-Tree: Reducing Consistency Cost for NVM-based
Single Level Systems. In Proceedings of the 13th USENIX Conference on File and
Storage Technologies, FAST 2015, Santa Clara, CA, USA, February 16-19, 2015. 167–
181. https://www.usenix.org/conference/fast15/technical-sessions/presentation/
yang
[33] Pengfei Zuo, Yu Hua, and Jie Wu. 2018. Write-Optimized and High-Performance
Hashing Index Scheme for Persistent Memory. In 13th USENIX Symposium on Op-
erating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October
8-10, 2018. 461–476. https://www.usenix.org/conference/osdi18/presentation/zuo
