Persistent Memory (PM), as already available e.g. with Intel Optane DC Persistent Memory, represents a very promising, next generation memory solution with a significant impact on database architectures. Several data structures for this new technology and its properties have already been proposed. However, primarily merely complete structures were presented and evaluated hiding the impact of the individual ideas and PM characteristics. Therefore, in this paper, we disassemble the structures presented so far, identify their underlying design primitives, and assign them to appropriate design goals regarding PM. As a result of our comprehensive experiments on real PM hardware, we were able to reveal the trade-offs of the primitives at the micro level. From this, performance profiles could be derived for selected primitives. With these it is possible to precisely identify their best use cases as well as vulnerabilities. Beside our general insights regarding PM-based data structure design, we also discovered new promising combinations not considered in the literature so far.
INTRODUCTION
Data structures play a crucial role in all data management systems. Numerous structures have been designed over the past decades for very different purposes and each design is always a compromise among the three performance trade-offs read, write, and memory amplification [2] . Furthermore, advances in hardware technology with changing characteristics make designing data structures an ever-lasting challenge.
Persistent Memory (PM) -also known as non-volatile memory (NVM) or storage-class memory (SCM) -is one of the most promising trends in hardware development which might have a huge impact on database system architectures in general, but also particularly on data structures. Characteristics such as byte-addressability, read latency close to DRAM but with a read-write asymmetry, and the inherent persistence open up new opportunities but require also new designs, e.g., to mitigate the read-write asymmetry or to guarantee consistent updates.
Over the last few years, several data structures for PM have been proposed trying to address these specifics. However, the lack of widely available hardware platforms, different benchmarks, and complex designs addressing different aspects make it difficult to compare these approaches and -more importantly -identify the most promising PM-specific primitives. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). , , © 2020 Copyright held by the owner/author(s).
In [11] Idreas et al. have presented their idea of a periodic table of data structures with the goal to be able to reason about the design space of data structures. The work provides a great foundation for a systematic study of data structure designs. In this paper, we try to support this approach by identifying core primitives of tree-based data structures and evaluate different designs of these primitives on persistent memory. Lersch et al. have already extensively evaluated existing B + -Tree designs for PM on real hardware [19] . However, again this was done on the macro level hiding impacts of the separate underlying ideas. Instead of a black box (or end-to-end) approach, we focus on read and write primitives (including structural changes) and analyze their behavior in terms of three PM-critical design goals: reducing writes, fine-grained access as well as consistent and durable operations. Furthermore, in this work, we generalize the found approaches for various types of treelike structures such as B + -Trees, Skip-Lists, Tries and LSM-Trees. To the best of our knowledge, this is the first evaluation considering several different data structure types and designs on the micro level with real PM hardware. The goal of our work is to get deep insights into PM-optimized design patterns for data structures in databases.
PERSISTENT MEMORY PROPERTIES
There are several variants of PM that use different physical mechanisms to achieve persistency. PCM [17] is probably one of the best known technologies among them. Intel ® has recently commercialized Optane™ DC Persistent Memory Modules based on the 3D XPoint™ technology [22] , which seems to behave similar to PCM. What makes PM special is the byte-addressability and direct persistence at DRAM speed. On modern CPU architectures byte-addressability actually corresponds to cache-line granularity (typically 64 Bytes). Further interesting features are a higher density and better economic characteristics than DRAM (both in monetary and energy terms) as well as direct load and store semantics. Another important fact is that the Optane DC devices internally work with cache-lines, but a write-combining buffer aggregates writes to 256 Byte blocks (cf. [28] ). This is mainly to avoid write-amplification.
In our experiments, we could not identify a performance difference when switching from 64 Byte to 256 Byte aligned data structures. Therefore, we assume that it is enough when the data nodes are at least 256 Byte in size, but only are aligned to cache-lines. Another benefit of this buffer is that writes can be faster than reads at low load on the device. However, that is also why it is hard to measure the real write latency. Table 1 summarizes some of the characteristics and compares them with those of DRAM and SLC NAND flash. We remeasured the latencies on our system (see Section 5) using Intel's Memory Latency Checker [13] and Flexible I/O tester [3] . Since we focus on single threaded experiments in this paper, total bandwidth numbers are not relevant for us here. Similar to flash PM exhibits a readwrite asymmetry and much lower write endurance than DRAM. However, we could not find any actual endurance data of Optane DCPMMs. For the design of new data structures, these properties mean that writes should be used as sparingly as possible and instead more compute power should be utilized. In general, we crystallize three design goals. Wear-leveling will be applied at the controller level. Hence, the write endurance and read-write asymmetry can be summarized in a first design goal (DG1) -reducing writes. Due to the byte-addressability much more fine-grained accesses are possible that should be utilized (DG2). The direct load and store semantics further enable zero-copy memory mapping and, thus, new opportunities to ensure consistency and durability, for example, by atomic primitives (DG3). The Intel Optane DCPMMs provide two possible operating modes: Memory and App Direct mode. The Memory mode allows applications to use the DCPMMs as extension to volatile memory, where DRAM acts like a kind of L4 cache. For that no rewrite of in-memory software is necessary. However, to fully utilize PM and its persistence the App Direct mode must be used. Therefore, developers has to take care of persistence, failure-atomicity, performance, and so on themselves. In the remainder of this paper, we exclusively use the latter mode.
To access PM devices, we used the de facto standard Persistent Memory Development Kit (PMDK) [14] to get uniform and comparable implementations. It provides different levels of granularity to manage PM including allocations, transactions, object management, etc.
RELATED WORK
Data Structures for PM. Due to the new characteristics described in Section 2 new more fine-grained techniques are enabled when designing persistent data structures to fully utilize PM. An overview of these approaches is given in [9] . There have already been a couple of publications addressing particularly the byte-addressability and write endurance. One of the first approaches of Venkataraman et al. [29] propose a single level storage hierarchy and general ideas for consistent and durable data structures. They mainly focus on the B + -Tree and use versioning, atomics, and shadowing features to guarantee atomicity. This is also addressed by Chen et al. [4] who exploit indirection and propose to keep nodes unsorted in order to save writes. The result is the wB + -Tree which should provide a significant increase in performance especially for inserts and deletes. In addition, they compare the approaches and their effects when adding certain features such as bitmaps. Yang et al. [32] propose selective consistency, i.e., enforcing consistency of leaf nodes and relaxing it for inner nodes. Here, too, leaf nodes are kept unsorted and new keys are just appended. To get the latest version, the node is scanned reverse. In [24] Oukid et al. present a hybrid solution where leaf nodes remain in the persistent layer but inner nodes are placed in DRAM. This allows a much faster traversing of the upper levels, but requires recovery actions to rebuild it in case of failure. Another crucial part is the usage of fingerprints in the leaf nodes to reduce the number of keys probed when searching for a key. HiKV [31] takes a similar path and places the B + -Tree in DRAM and holds only hash partitions persistent. This avoids costly structure reorganizations on PM. The B P -Tree presented in [10] buffers changes in DRAM and as soon as it is full it merges them into PM. They collect information to predict future accesses to pre-allocate nodes and reduces writes caused by splits or merges. In [16] the authors propose cache-line sized nodes combined with differential encoding to reduce the number of cacheline flushes. The BzTree [1] is a high performance latch free B-Tree using a persistent multi-word compare-and-swap (PMwCAS [30] ) operation to provide failure-atomicity. In [19] some of these trees are already evaluated on real PM hardware. But here again the complete trees were compared, instead of the individual primitives. As a result, the wB + tree, for instance, always performs poorly, since their inner nodes are persistent leading to a costly traversal. Furthermore, their evaluation is limited to B + -Trees.
Recalling the properties of PM, write-optimized data structures such as the LSM-Tree are a promising option. Many modern keyvalue stores such as RocksDB [6] or Cassandra [27] are based on this concept. There are already first approaches to adapt this concept for PM [15, 20, 21] . Furthermore, prefix trees (tries) like ART were already migrated to PM [7] as well as some write optimized versions of it [18] . There are also first publications in the field of hash tables [5, 26, 33] . So far, only [8] has considered the analytical direction, which is based on clustering and unsorted blocks. The special feature is the three-level architecture and the ability to efficiently query any attribute besides the key.
The approaches mentioned above have so far been mainly evaluated for operator or end-to-end performance. However, this hides important details and trade-offs of the underlying design primitives on which we focus in this paper.
Evaluating Data Structure Designs. Several systems analyze access patterns and the hardware profile in order to pick appropriate data structures and implementations as well as hardware placement. Similar to us, some of these also subdivide the data structures into primitives. The Data Calculator [12] and the periodic table of data structures [11] , for example, discuss a novel approach which interprets data structures as an assembly of first principles. The authors combine analytical models, benchmarks, and machine learning to gain insights into the impact of these fundamental primitives. Their engine takes a high-level specification of a data structure assembled from the primitives and predicts the performance for a given workload and hardware profile. This process is designed interactively where the user can exchange features and directly observe the effects. The so-called operation and cost synthesizer learns a basic set of cost models for different access patterns and synthesizes the cost for more complex operations. These cost models are trained by micro-benchmarks and thus strongly dependent on these. It is indicated that the benchmarks must be entered manually by the user when new patterns or hardware is added. This is where the micro-benchmarks presented in this paper can very well tie in.
DESIGN SPACE
In this section, we start by classifying typical data structures found in DBMSs and give an insight into their huge design space. Subsequently, we extract the design primitives with focus on tree-based structures from the literature and connect them with our defined design goals (see Section 2).
··· Flat Branched
Sorted Array
Skip-List
Hash 
Glimpse into the Design Space
In Figure 1 , the typical data and index structures used within a DBMS are summarized. Each of these data structures is more suitable for certain scenarios and access patterns than others. As described in [11] there is a general trade-off between read, write, and space optimized designs. Accordingly, their performance depends on both the workload running on them and the underlying hardware. Furthermore, these basic structures can also be extended by features or combined with other structures in order to meet the given requirements. By adding features, sometimes also new access primitives become possible applicable to them. Looking at the illustration and the related work above, it becomes clear that the design space is huge and there are still thousands of variants that have not been studied yet, in particular with regard to PM. Due to the complexity, the focus of this paper are tree-like structures for the time being, although we will gradually look at more data structures. The question we want to answer is what impact certain design primitives have in which scenarios in the presence of PM. The long-term goal is a benchmark revealing these trade-offs for facilitating design decisions. This must be done in the form of white-box testing in order to avoid side effects in the measurements. As a result, a profile per design primitive would be conceivable, from which performance and memory impacts can be derived for each type of operation.
Design Primitives
Similar to [11] , we define a design primitive in this context as an indivisible layout or access concept. In order to achieve the goal mentioned above, it is necessary to break down the possible primitives taking into account the properties of PM. For that, we study the approaches as described in Section 3 and assign the ideas to the corresponding design goals. Furthermore, we consider existing micro-operations for trees and have put these in relation to the derived primitives. In this context, a micro-operation describes a low-level access pattern independent from the chosen primitive(s). The typical macro-operation like get, insert, update, delete, and scan can be implemented by combining such micro-operations. Therefore, we classify them in read-, insert-and erase-based as well as recovery operations. Table 2 shows our results. For design primitives that are not applicable or relevant for certain operations, we left the cells empty.
Micro-Operations. For the read block, the first micro-operation is a search for a specified key within a node. For this purpose, there are usually two types of traversing the tree, namely vertical (tree traverse from top) and horizontal (tree iterate from lowest left), to get the target node. The macro-operations get and scan can be build by combining lookups and traversals. Next there are insertbased operations like placing data (e.g., key-value pairs) into a node. This can lead to split operations, which require the allocation of new nodes. An insert or update macro-operation would need the micro-operations lookup, traversal, insert, and split. For hybrid structures (such as the B P -or LSM-Trees) a common operation is also the movement or migration from DRAM to PM. Furthermore, a compaction or merge of multiple nodes (in a level) into a larger node can be necessary. Erasing an entry from a node is another micro-operation downsizing the tree. This may cause an underflow which can be resolved by balancing or merging with another node. The typical delete macro-operation consists of lookup, traversal, erase, balance and merge. The last class is recovery. We have not considered it in our experiments since it mainly consists of the operations of the read block combined with the recreation of volatile DRAM structures. Our focus here, however, is on operations on PM.
Primitives. For the first design goal -reducing writes -the general node organization (which applies to a large variety of tree-like structures) was reconsidered and the main consensus was to leave data nodes unsorted. In order to keep the access still fast indirection, hashing, and bitmaps as well as combinations of these were used. In Figure 2 , we compare the data node layouts of these ideas, how we have reimplemented them for the evaluation (cf. Section 5). All of them align the search structure in the beginning and the keys to cache-lines to have a fair comparison. For the sorted and unsorted case the metadata always costs only one cache-line whereas the other layouts can cover multiple cache-lines depending on the number of elements. To further save writes the node placement was adapted leading to selective consistency or persistence by placing inner nodes in DRAM. Depending on the node organization there are different access primitives. For example, one could do a simple linear search over the keys in all cases. If the entries are sorted, binary search is enabled. Both algorithms can also be modified with more fine-granular access using the cache-line sized auxiliary structures. Additionally, other algorithms such as interpolation or exponential search are conceivable. When splitting a node, as typical in B-Trees and Skip-Lists, we have found roughly two approaches. The basic algorithms find the split key and then moves all greater keys to the new node. This can also be done by creating two new nodes. An alternative when using a bitmap is to copy the full node, reset the greater keys in the bitmap, and finally store the inverted bitmap in the new node. The first variant will trigger less writes, but the second variant could be faster by exploiting the fine-grained access. Figure 3a illustrates the design primitives considered for the Move Node operation. The initial data is stored in a DRAM buffer and later moved to a free persistent array. We consider a scenario where the data in the persistent arrays ⟨R0...R3⟩ is always sorted, whereas the DRAM data can be sorted or unsorted. If sorted, then, (1) DRAM data is just copied to R0 when its capacity is reached, else, (2) it must be sorted before a copy operation. In the former case, the DRAM data structure could be an ordered map. Consequently, insertions into DRAM would be costlier and the penalty would be higher for a bigger DRAM node size. Alternatively, an unordered hash map could be used whose insertion costs are comparatively less dependent on DRAM node size. However, a sort operation is needed during the data movement from DRAM to a persistent array. The data is in the form of key-value pairs (only keys are shown in figures). A typical use case could be a PM-aware LSM tree, where the DRAM buffer is called C0, R0 -R3 are called runs and an array of such runs constitute a level called L0. Figure 3b and Figure 3c show the design primitives used for Merge Level. While the design space for merge algorithms is exhaustive, we use two common approaches: 2-way and K-way merge. When all the runs/nodes in a level are filled, the sorted data in these runs are merged and written to a free persistent run at the next higher level. The final result of a 2-way or K-way merge could be written using the following two approaches. (1) Directly perform merge onto a persistent run. (2) Merge to DRAM buffer and copy the final result to a persistent run. The former approach could result in a performance benefit when the inserted keys are unique, whereas the latter approach could be a better choice if there are too many updates (i.e., duplicate keys).
Many database applications use transactions to ensure failure atomicity (FA). Hence in such cases, either the entire persistent run or a portion of it must be added into a transaction before writing any data to it. A simple solution is to use the transnational libraries from PMDK. However, PMDK transactions induce a noticeable performance penalty. Figure 4 shows a different way of realizing FA for Move Node and Merge Level. Once again, we consider a scenario where ⟨R0...R3⟩ stores sorted data. A Run_Ptr [level][run] could be used to keep track of runs that contain valid data and point to the next free array as shown in Figure 4 . After pushing the C0 data to ⟨L0 : R0⟩, it is incremented from position (1) to position (2) , indicating that R0 contains valid data and R1 is free. Suppose a failure occurs during a write to R1, an undo operation on R1 or a portion of it is useless since the Run_Ptr remains at position (2) indicating that R1 is still free (i.e. invalidating any partial data written to R1). Data consistency remains unchanged irrespective of whether the entire run (R1) or just the Run_Ptr is added into the PMDK transaction. Therefore, it is sufficient to only add the pointer into the PMDK transaction. As a consequence, the performance penalty of PMDK transactions could be greatly minimized. We term this as Individual failure atomicity (FA). However, the data must be flushed before the pointer with the help of fences. Furthermore, on x86 architectures, 8 Byte aligned writes are power fail atomic. Hence, by limiting the size of the persistent variable Run_Ptr to 64 Bit and placing it at an 8 Byte boundary in PM, we can even avoid adding Run_Ptr into the transaction. We term this as No FA. Consequently, this limits the number of persistent levels in the tree or the number of persistent runs in a level. However, this limitation does not hinder our micro-operation benchmarks. In our experiments, we use the PM-aware LSM tree as a use case scenario and evaluate the above two design optimizations against the PMDK transactions, where a data structure is automatically added into a transaction by default, before any write operation. Regarding Table 2 , these mechanisms basically fall under DG3, whereby Individual FA and No FA also fit into DG1.
Extendability. The table shown is only an excerpt and can be extended by more primitives and micro-operations. Further aspects would be, for example, hardware utilization, concurrency, and more in-depth failure atomicity. For the latter, we have focused on PMDK transactions and individual persist operations for the time being. Regarding hardware utilization, we already applied cache-line alignment for all nodes and auxiliary structures in our experiments.
Metrics. The task now is to work through and fill this table with the help of micro-benchmarks to determine the trade-offs. In the following section, we have already pursued this task to a certain extent. Before starting the experiments and filling the table, we must first define relevant metrics. Typically performance is measured as throughput and latency (or execution time). Since we are on the micro level, throughput does not provide usable values at this point. Moreover, hardware specific measures can be studied such as cache misses, flushes, instructions per cycle, or the number of reads and writes. From our point of view, the number of persist operations or written Bytes are crucial factors due to the read-write asymmetry. In addition to the performance indicators, memory consumption is of interest, as PM DIMMs are likely to be less dense than disks.
EXPERIMENTS
In our experiments, we focus on the micro-operations on tree-like data structures as introduced in the previous section. From the primitives described above, we picked for the node organizations: sorted, unsorted, indirection + bitmap ("indirection"), hash-probing + bitmap ("hashing") and bit-map only, in most of the experiments. For the access primitives binary search with and without using indirection as well as linear search with and without using hashing and bitmaps are tested. We re-implemented the approaches from the literature focusing on the corresponding primitive(s). More details are given at each experiment. The aim and contribution is to evaluate the design primitives independently of their original context and to compare their strengths and costs. This should reveal a performance profile for each primitive and possibly promising new combinations, that we will sketch at the end of this section.
Experimental Setup
For our experiments, we used a dual-socket Intel Xeon Gold 5215 server as outlined in Table 3 . Each socket comes with 6 DCPMMs, which we grouped into one region and namespace to maximize the possible throughput. The operating mode of the modules is set to AppDirect allowing direct access to the devices. On the PM DIMMs, we created an ext4 file system and mounted it with the dax option to enable direct loads and stores bypassing the OS cache.
To avoid NUMA effects, all experiments are controlled to allocate resources (memory, persistent memory, and cores) only from the same socket. On the software part, we are using PMDK [14] for all the implemented data structures to guarantee failure atomicity. Alternatively, PMwCAS [30] could be used, which provides CAS operations even for structures bigger than 8 Byte. Another method could be to manually place persist, flush, or fence instructions, which is also possible with PMDK. Since the transactions of PMDK had so much overhead hiding the impact of the approaches in our implementations, we decided to report mainly the results for manually persisting the modified data.
Unless stated otherwise, we used fixed-size keys and values being 8 Byte integers and 16 Byte tuples (<int, int, double>), respectively, in all our experiments. The size of the values, therefore, also corresponds to the size of a persistent pointer (e.g., to the actual payload). Keys, values, and children pointer were stored in separate arrays within the nodes for better locality benefits when iterating through the keys. In addition, all nodes as well as their inner key arrays are aligned to cache-lines. The fill ratio of the trees was always 100% to make optimal use of memory. When the node size is varied the various implementations often result in a different number of actual elements due to their node layout. To primarily measure PM and not cache performance, we created an array of nodes being more than double of the LLC in size. In each iteration, we accessed a random position in order to prevent prefetching of other nodes as far as possible. Every data point in our plots is supported by several thousand iterations. Our implementations can be accessed and examined via our public repository. 1 
Read Operations
Node Search (E1). In our first experiment, we study the performance profile for looking up a key's position within a node. This kind of operation is fundamental for nearly every macro operation including get, update and delete. We varied both the node sizes as well as the position of the requested key and tested on various node layouts combined with their corresponding access primitive. The expectation is that the approaches using binary search are faster, except for accesses to front elements. Figure 5 shows the results. We observe that our expectations have not been met in this uncached setup. Merely if the key is in the middle binary and linear approaches show about the same performance. The indirection approach is always a little bit slower than the direct binary search, but will cost much less writes for inserts and deletes -which we will consider later. The disadvantage of indirect binary search is that it needs to jump back and forth from search structure to key array. Of the linear approaches, hashing is usually the best since all comparisons are first done in the front cache-line(s) and only if a hash matches, the actual key is checked. This means that on average, the fewest cache-lines have to be loaded from PM. However, it should be noted that in cached and in-memory cases binary search is better in both the middle and back access areas. Furthermore, it is notable that the lines for indirection and hashing are sometimes jumpy. This is due to the changing size of the front search structure depending on the maximum number of elements in a node. For 1 KB and 2 KB the search structure consumes 2 cache-lines and for 4 KB it needs 3 cache-lines. However, in contrast to indirection, hashing usually reads less of these cache-lines. Thus, indirection and hashing (and partly also the bitmap) require more memory.
Talking about memory, Table 4 shows the actual number of entries that can be stored for a given node size. Without a search structure -as for basic binary and linear search -more key-value pairs can be placed in a node. For indirection and hashing each entry requires an additional bit for the bitmap and an additional byte for the slot or hash array. For a fairer comparison, we also aligned the approaches without a search structure so that the counter of entries and the sibling pointers are placed in the first cache-line. The resulting size adjustment can also be found in the table. Hence, all variants have their actual data cache-aligned as already mentioned above. It becomes visible that smaller node sizes generally lead to a larger overhead to the PM consumption. In addition, this also results in a longer traversing path. This of course highly depends on the size of the keys and values. Apart from the higher memory footprint, hashing is the best choice for searching a node. Tree Traversal (E2). Our second experiment focuses on the inner nodes and the costs for traversing from the root to the leaf level as typical for B + -Trees. A search within the nodes is not included to get bare dereferencing and pointer chasing measures. Instead a random child position is chosen to prevent prefetching. Therefore, we limit the comparison to the timing of traversing nodes resided in PM and DRAM, respectively. Only the last access is to a persistent leaf node. This reflects the idea of hybrid data structures and placement (see Section 3 and Section 4). Here we have varied the depth of the tree. The node sizes have hardly made a difference, thus we report only one size (256 Bytes). Due to our idle latency measurements for DRAM and PM, we would expect an increase of approximately this latency per level. The results are shown in Figure 6 .
In fact, this behaves almost as expected. For DRAM each further level adds roughly 50-100 ns. For PM, however, each level adds 400-500 ns, which is nearly double the reported latency of the MLC benchmark. We assume that this is mainly due to the software overhead (e.g., PMDK) and the loaded random access. However, we also note that this would nearly fit with the reported read latency in [19, 28] . It becomes visible that all approaches would greatly benefit from a hybrid variant. Placing the inner nodes in DRAM, however, requires recovery actions in the event of a failure. If this is not desired, it is mainly the search algorithm that makes the difference (see E1). If pure performance in the operating system is most important, we found both sorted nodes with binary search and indirection to be good solutions since in DRAM these perform best. The former is more memory efficient and the latter saves write operations, which however is not so crucial on DRAM. Tree Iterate (E3). For the last experiment in the read block, the horizontal traversal of data nodes, usually also referred as scan, is examined. This contains not only the chasing of the node pointers (like in E2), but also the iteration of the key and value arrays within them. Since the order is not prescribed, we stick with the term iterate to avoid confusion with range scans. For this experiment all approaches use the same number of entries based on the variants with a growing search structure. Here, we use different data node sizes and let the tree horizontally grow by increasing the single inner node (the root). Since the order does not matter when iterating, the sorted and unsorted approach use the same algorithm. The same applies for indirection, hashing, and bitmap, as only the bitmap has to be checked for valid entries. However, since this causes branching in the loop, we expect a weaker performance of the latter class. For the indirect organization, it is also possible to iterate using the slot array instead of the bitmap which we also included in the experiment. In Figure 7 the results for the different data node sizes is reported.
Interestingly, iterating via indirection performs worse than via bitmap. This means that even with indirection slots the bitmap should be used for iterating. As expected the approaches without a bitmap are always the fastest. Using this as baseline in the largest case there is an overhead of 6% for the bitmap and 18% for indirection. However, it should be noted that in this experiment the notes were filled to 100%. Thus, it is already the best case for the bitmap since the other approaches also have to check all entries. In case of indirection and without a bitmap the loop only iterates through the number of actual keys. We have also performed the test in a cached manner, with the bitmap performing even worse. Especially in the largest case the bitmap overhead for iterations was quite significant (> 60%). Besides branching, jumping between cachelines (bitmap/indirection slots, key and value array) also have its influence. Since our scan function only copies out the key in each case, we assume that this is the worst case and that with increasing complexity of the function all methods might approach each other. Nevertheless, for iterating through nodes the sorted and unsorted approach generally perform best. 
Insert-based Operations
Node Insert (E4). For insert-based operations, we first check the behavior when inserting a key-value pair into a data node. Similar to experiment E1, we vary the node size and the insert position. In addition to the time an operation takes, we report the number of bytes written in each case. In the preparation phase, the key to be inserted is omitted so that exactly the space is left for it. For instance, when inserting at the first position in a node with 10 slots, the keys from 2-10 are pre-inserted. The insertion of key 1 is then the measured part. The lookup for the insert position is not part of the measurement. The addition of the timings of traverse, node searches and node insert would result in roughly the insert macro operation. Our expectation is that the sorted approach will show the poorest results as keys and values have to be actually moved. It leads to many writes and flushed cache-lines. This is not the case for the other approaches, which actually only append the new data and adapt the search structure. The results are illustrated in Figure 8 . It is apparent that the plain unsorted variant almost always performs best. Here, only the key and value are appended and the size field is updated. The hashing and bitmap approach additionally have to set the corresponding hash and bit, respectively. Also the indirection performs quite well, even in the first case were all slots have to be shifted to the back to keep the indirect ordering. Compared to the other approaches optimized for PM, however, it performs worst. Matching the high number of write accesses, the performance of the sorted approach is significantly worse than the others. Merely, for small node sizes it can keep up. This is because the number of flushed cache lines is about the same here. It becomes obvious that keeping the nodes sorted is not suitable for read-write asymmetric PM. Although indirection also involves many writes, these are on a much finer granularity and multiple slots can be persisted at once. Hence, it shows a similar performance as when using hashing or appending only. Especially the impact of a read-write asymmetry becomes clear by this experiment. Overall the unsorted variants perform about equally well.
Node Split (E5). As next experiment, we chose node splits in particular for data nodes as these are definitely placed on PM. We picked a similar setup as for the inserts. We applied the two split strategies as mentioned in Section 4 to the indirection, hashing, and bitmap approach. Since bitmap and hashing showed exactly the same performance and to keep the figure clear, we summarized them as bitmap in the graph. As stated before the move variant will cost less writes and thus is supporting DG1, whereas the copy variant exploits the fine-grained access supporting DG2. For a node organization without bitmap the copy strategy is not useful since the entries of the new node would be written twice. This is because the whole node is copied and then all entries are reordered to the left. Generally, the performance is hard to predict for us, but we expect at least that the sorted approach should be faster than the unsorted one. This is since in an unsorted node all entries have to be checked if they are greater or less the split key. In a sorted node, everything is simply copied starting from the middle. Figure 9 shows our results with measures for performance and the number of bytes modified.
As it can be seen, the various approaches barely show any performance difference in this setup. The copy strategy needs more bytes as it contains the writing of a whole node. According to the write endurance and read-write asymmetry, this could be a shortcoming. However, the copy process initiates sequential writing, which seems beneficial for the write-combing buffer on the DCP-MMs. When tested on DRAM and in a cached case, we noted that in general a split on a sorted node is the fastest and the unsorted case (with or without bitmap) was always worst. Indirection with move strategy behaves similarly to the sorted variant, but requires more effort to transfer and set the indirection slots and bitmap per entry. Especially, when the search structure grows to multiple cache-lines, this gets more expensive. Hashing and the bitmap only needs a bit more time due to the search for the median in an unsorted array (using quickselect). The copy strategy in DRAM works a bit better since everything is transferred once and after that the slots are shifted and the bitmap is inverted. The same applies to hashing, so we can deduce that the copy approach is more effective when a bitmap is present. If the inner nodes are in DRAM, we would recommend either a sorted variant or the copy approach. In this setup, for PM most of the time is spend on allocating a new node (between 80-95%) and thus the approaches show relatively the same performance. The allocation is done by PMDK encapsulated into a transaction and the time depends on the allocated size. As already discussed in [19] PM allocations add a tremendous overhead and should be handled with care. As suggested in [24] a group allocation could reduce this overhead. Apart from this, we see a compromise of performance against endurance when having unsorted nodes. However, also here the sorted and indirection variants are always better.
Move Node (E6). In this experiment, we are interested in profiling the latencies involved in writing to a persistent run when DRAM (i.e., C0) data is moved to the first level L0. The experimental setup involves varying the C0/L0-run node size and switching between the FA strategies: No FA, Individual FA and the default PMDK transactions as explained in Section 4. It is to be noted that at Level L0 all runs have the same capacity as C0 and if the C0 data is unsorted, then the measurements also involve the sorting operation. The runs are composed of persistent arrays where each element is a key-value pair. We conduct the experiment by inserting unique keys (since this operation is independent of inserts or updates). The first goal is to analyze the overall PM write performance for two cases: (1) Sorting the C0 data and moving it to PM, against (2) Maintaining a sorted C0 DRAM data structure. The second goal is to analyze the effect of different FA strategies on varying node sizes of persistent runs. The results are illustrated in Figure 10 . Figure 10 : Move data from DRAM to a PM node (E6).
It is apparent that using a sorted C0 data structure is faster than a unsorted hash data structure since in the unsorted case, each time the C0 capacity is reached, the C0 data must be sorted and moved to a persistent run. On the other hand, maintaining a sorted DRAM data structure in C0 is costlier. For a typical LSM-Tree use case scenario, the DRAM buffer is in the order of a few Kilo Bytes. Hence, a sorted data structure is always a better choice for small DRAM buffers. Regarding the second goal, as depicted in Figure 10 , using PMDK transactions for FA has a much higher performance impact, when compared to No FA and Individual FA (cf. Section 4). On the other hand, No FA and Individual FA have almost the same performances, i.e., adding a single 64-bit persistent variable (Run_Ptr) into PMDK transaction (plus the individual flushes and fences) has negligible performance impact. This shows that PMDK transactions should only be used for allocations and deallocations. Performance critical applications should definitely take care of failure atomicity individually.
Merge Level (E7). In this experiment, we examine the impact of merging sorted PM runs/nodes of a level into a new PM node of the next level. Similar to the previous experiment, the setup involves varying the node size and switching between two different FA strategies (i.e., PMDK transaction and No FA). Additionally, we examine the impact of the two extreme scenarios: Unique keys in each run and duplicate keys in all the runs (i.e., 0% and 100% duplicates). Finally, we benchmark the performance by applying the 2-way and K-way merge algorithms as illustrated in Figure 3b and Figure 3c respectively. We used two merge sub-strategies in our experiments. (1) Merge directly to a PM node, (2) Merge to a DRAM buffer and then copy the result to a PM node. These two strategies are applied on both 2-way and K-way algorithms. The results are shown in Figure 11 An important observation is that enabling PMDK transactions has again a great performance penalty in all cases. When merging directly to a persistent node, 2-way is faster since the CPU can cache the intermediate merge results. On the other hand, in K-way, the CPU needs to read the persistent memory more often for key comparisons due to more cache misses. It is interesting to see that Kway merge has a better performance when the keys are duplicated in each run (second sub-figure of Figure 11 ). On the contrary, the scenario is reversed when PMDK transactions are enabled, i.e., the 2-way merge is faster. The effect of PMDK transactions is distinct here and explained as follows. After a merge operation, the number of elements in the resultant run can vary between the two extreme limits: ⟨ the number of elements in a single L0 run : the sum of elements in all the L0 runs ⟩ and in a general scenario, it is not possible to exactly determine the resultant number of elements. Therefore, in case of K-way merge, the entire run ⟨L1 : R0⟩ must be added into PMDK transaction whereas in 2-way merge it is sufficient to add only (B1 + B2) elements. Hence 2-way merge has a better performance when the keys are duplicated with the default PMDK transactions enabled. To improve the performance of Kway merge, one could use intermediate DRAM buffers: B3 and B4, as shown in Figure 3b and Figure 3c respectively. Once again we see that, when PMDK transactions are enabled, there could be two possible scenarios. 
Erase-based Operations
Erase from Node (E8). For erase-based operations, the first experiment is the removal of a single key-value pair from a node. Similar to the lookup (E1) and insert (E4), we measure different node sizes and key positions. Again, we report the latency and modified bytes for each operation. The preparation always creates a full node, where the entry is then deleted at different positions. The lookup for the erase position is again not part of the measurement since it is already represented in E1. Hence, an erase macro operation without underflow would be the result of traversing the tree, searching each node and this experiment. Once again, we expect the sorted approach to show the poorest results as keys and values have to be moved to fill the caused gap. This entails many writes and flushes. For the unsorted organization, not all the entries have to be shifted. In this case, it is enough to move the last entry to the caused gap and decrease the entry counter. The hashing, indirection, and bitmap variants only need to reset one bit. For indirection the slot array has to be additionally shifted. Therefore, these approaches will probably run the fastest, with indirection possibly taking slightly more time.
The actual results are visualized in Figure 12 The advantage of the bitmap is unambiguously. In all cases it performs the best. The hashing approach is directly below the bitmap line, because the algorithm is the same. As expected, indirection is only a little slower. Starting from 2 KB it jumps a bit higher if the key is in the first and middle position. This is due to the fact that from 2 KB the bitmap and slots need another cache-line to be flushed (cf. Table 4 ). In the last case nothing has to be shifted, thus, it is constant. We would have estimated the unsorted variant to be more stable, since always the same number of bytes are changed.
Here the locality of the deleted and last position, from where the entry is moved, is quite important. The sorted approach is absolutely not appropriate for erasing a key in PM. It costs way too much writes and also flushes which drastically reduces the performance. Only when the entry is rather at the end, this approach can keep up. Hence, using a bitmap is the best choice for fast erasures.
Balance Node (E9). Often in trees it is necessary to move entries from one node to another. For instance, an erase operation can lead to an underflow and a balance operation. This operation is what we want to evaluate in this experiment. For this, in the setup arrays of full nodes and half filled nodes (actually: half-1) are prepared. The balance operation should move a quarter of the entries in the full node to the half filled node. There are two possible cases. Either the entries are moved to a node with smaller or larger keys. If the order is important the first case requires a shift of the already existing entries on the donor site to bring them to the front. In the other case, a shift is necessary on the receiver site to make place for the new smaller entries. Since the number of writes is about the same, we do not expect much difference. Again, we tested for various node sizes and report the average latency as well as the modified bytes. Basically, we expect the sorted variants to perform better than the unsorted ones. The results are shown in Figure 13 Unlike our previous experience, the number of written bytes is not reflected in the performance here. The sorted variants are, as expected, faster, but the distance to the other techniques is enormous for larger node sizes. Comparing the sorted and hashing approach the former is nearly four times faster than the latter. This is because in the unsorted case the next maximum or minimum must be searched before each move. In addition, the bitmap approaches must always search and set a free bit on the receiver site. The hashing approach must also copy the hashes, which can lead to further written cache-lines. Compared to the sorted case, with indirection the slots have to be additionally shifted and written. However, the keys and values can simply be appended. Since less is written here, we would have expected it better as direct sorting. This is only the case with small node sizes. Nevertheless, with regard to our design goals, we would still consider indirection as the best choice here. Merge Nodes (E10). We already discussed the merge of multiple nodes into another in experiment E7. There is also a less complex merge operation as found as a consequence of an underflow in a B-Tree. Here, no duplicates are present in this micro-operation. On top of that, we cannot apply different merge strategies like 2-way or K-way merge since only two nodes are affected. Instead the various node organizations and their corresponding access primitives are compared once again. In contrast to the previous experiment, we only consider one direction, the merging into the node with the smaller keys. This is always the better option, because there is no need to shift entries or slots. As a result, the sorted and unsorted approach proceed exactly the same. They simply append all keys and values and finally update the number of keys. Hence, they are summarized as numKeys. The deallocation of the donor node is not part of the measurement, but it would be the same overhead for all approaches. Once more, we report the latency and modified bytes. We expect that the numKeys approach performs better than the approaches using extra search structures since they do not include a mechanism yielding into any performance gain. In Figure 14 the results can be inspected.
Again the disadvantage of the bitmap shows up. It requires the verification of each bit on the donor site and search for a free bit on the receiver site before moving an entry. For indirection the first check is not required, however additional reads are necessary due to the indirect access and the updates of the slot array. This results in bitmap and indirection having almost the same performance. Again the hashing approach has to additionally copy the hashes and check the bit on the donor site. This leads to more written cache-lines with larger node size. Comparing the numKeys and hashing approach, we can again see a performance difference of a factor of nearly four. Figure 15 : Performance profile of design primitives.
In a separate measurement the pure deallocation took constantly around 2 µs independent of the size. In total, the sorted and plain unsorted variant work most efficiently when merging two nodes.
Performance Profiles
As mentioned at the beginning, the goal is to develop performance profiles for different design primitives. Based on our results in the experiments above, we created a first version summarizing the performance and write reduction of the main identified primitives in this paper. This is based on the node organization (sorted, unsorted, indirection, hashing, and bitmap) and their corresponding access primitives, from which we considered the best performing alternatives. Figure 15 shows the performance profile of them.
For the sorted case it becomes quickly visible that its main drawback are raw entry inserts and key deletions. Also when searching, it performs worse than the other variants in uncached cases. However, it scores when iterating and for structural adjustments. Although it is the smallest in size, it takes the most writes and flushes when modifying nodes. The unsorted layout significantly improves the most typical operations like search, insert and erase, particularly in terms of memory efficiency and write reduction. If many scans are used, the sorted and unsorted variants are best suited. Only balancing and splitting costs considerably more as the order is beneficial for these micro-operations. So if the tree grows and shrinks a lot, this can lead to enormous overhead. A countermeasure could be a larger node size to avoid to much restructuring. Another variant to overcome this, is to add indirection slots which deliver almost the same performance as in the sorted case, but require less write operations. This is paid for with a node size overhead and worse read operations. Therefore, this variant seems more suitable for write dominated workloads. The hashing approach is a bit worse regarding the restructuring. However, it offers the best overall package especially with a small node size. Similar to the simple unsorted approach the basic micro-operations search, insert, and erase of a key or entry are its great strength. A drawback is the iterate operation due to the bitmap. If no underflow handling and few scans are required, this variant is the best choice. The bare bitmap approach behaves pretty similar to the hashing approach and is only slower for few operations. Therefore, the combination of hash and bitmap should usually be chosen when considering these two variants. The sole advantage of the bitmap-only variant is the lower memory consumption and to an extend faster underflow operations.
INSIGHTS & CONCLUSION
The results of the experiments gave us some interesting general insights for designing data structures, choosing corresponding access primitives, and combining various ideas. In part, the insights gained were consistent with these in [19] .
I1: As it became clear already from looking at the design space, there are still numerous untested possible primitives and combinations of them. For instance, using hash probing without a bitmap but a size field, combining indirection for inner and hashing for data nodes, examining other algorithms like interpolation or exponential search, etc. Also investigating well-known techniques like compression or zone maps to reduce writes and read accesses seems reasonable.
I2: A hybrid approach is highly recommended when seeking the best performance and still requiring persistence. Especially the traversal experiment E2 proved that dereferencing and pointer chasing has an even greater impact on PM. In our DRAM-based and cached tests, we found sorted and indirection to be the best solutions for inner nodes. If a hybrid approach or recovery is not an option, it is important to note that hash probing is not applicable for inner nodes.
I3: PMDK transactions are universal, but not recommended for performance critical applications. As it was evident in E6 and E7 the log and snapshotting used in PMDK transactions add a tremendous overhead compared to individual realizations of failure atomicity. This means that the classic Copy-on-Write approaches should be avoided for PM.
I4: Jumping between non-sequential cache-lines is quite expensive. Although, PM allows byte-addressable random access, sequential access is still preferable. This was particularly apparent for indirection and binary search (E1), iterations via bitmap and indirection (E3), as well as erasing in unsorted nodes (E8). Especially the latter showed the importance of locality.
I5: Allocations are expensive in PM and depend on the requested size. During the experiments E5 and E10 it became clear that allocations should be used wisely. To overcome this bottleneck, designers of PM-based data structures should use group allocations and reuse already allocated nodes instead of frequent deallocating and allocating.
I6: The optimal size for nodes located in PM-based index structures lies between 256 Byte and 1 KB. The lower bound of 256 Byte results from the write-combing buffer of the DCPMMs. The upper bound is the size from which the performance typically drastically degrades. This is partly due to the search structures, which should not grow beyond a cache-line. Small nodes automatically lead to longer traversing routes, hence we refer back to I2.
I7: For B + -Trees the overall best performance could be achieved with hash probing and bitmap for data nodes (1 KB) and a sorted or indirection layout for inner nodes residing in DRAM. This is basically a combination of the ideas from [4] and [24] .
Finally, it should be mentioned again that Table 2 and our investigations still offer much potential for extension. For future work, further data structures, primitives, and operations shall be studied. Benchmarks testing various concurrency approaches should also be included. The final result should be a far broader derived performance profile per design primitive as sketched in Figure 15 .
