I/O latency and throughput is one of the major performance bottlenecks for disk-based database systems. Upcoming persistent memory (PMem) technologies, like Intel's Optane DC Persistent Memory Modules, promise to bridge the gap between NAND-based flash (SSD) and DRAM and thus eliminate the I/O bottleneck. In this paper, we provide one of the first performance evaluations of PMem in terms of bandwidth and latency. Based on the results, we develop guidelines for efficient PMem usage and two essential I/O primitives tuned for PMem: log writing and block flushing.
INTRODUCTION
Today, data management systems mainly rely on solid state drives (NAND flash) or magnetic disks to store data. These storage technologies offer persistence and large capacities at low cost. However, due to the high access latencies, most systems also use volatile main memory in the form of DRAM as a cache. This yields the traditional two-layered architecture, as DRAM cannot solely be used due to its volatility, high cost, and limited capacity.
Novel storage technologies like Phase Change Memory are about to shrink this fundamental gap between memory and storage. Specifically, Intel's upcoming Optane DC Persistent Memory Modules (Optane DC PMM) offer an amalgamation of the best properties of memory and storage-though as we show in this paper, with some trade-offs. This Persistent Memory (PMem) is durable like storage and directly addressable by the CPU like memory. We also expect the price, capacity, and latency to lie between DRAM and flash.
PMem promises to greatly improve the latency of storage technologies, which in turn would greatly increase the performance of data management systems. However, because PMem is fundamentally different from existing, well-known technologies, it also has different performance characteristics than DRAM and flash. In this work, we show how to efficiently implement atomic log writing Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. , , © Association for Computing Machinery.
and page flushing-two critical I/O primitives for database systems. While we perform our evaluation in a database context, these two I/O primitives are transferable to other systems as evidenced by the fact that are also implemented by the Persistent Memory Development Kit (PMDK) [1] . The results reported are based a prototype of Intel's Optane DC PMM rather than software-or hardware-based emulation. Our contributions can be summarized as follows:
• We provide one the first analyses of PMem on a prototype of Intel's Optane DC PMM. We highlight the impact of the physical properties of PMem on software and derive guidelines for efficient usage of PMem.
• We introduce an algorithm for persisting small data chunks (transactional log entries) that reduces the latency by 2× compared to state-of-the-art algorithms.
• We investigate different algorithms for persisting large data chunks (database pages) in a failure atomic fashion to PMem. By combining a copy on write method with temporary delta files, we achieve significant speedups.
PMEM CHARACTERISTICS
In this section, we first describe how we configured our system before presenting latency and bandwidth results.
Setup and Configuration
There are two ways of using PMem: memory mode and app direct mode. In memory mode, PMem replaces DRAM as the (volatile) main memory, and DRAM serves as an additional hardware managed caching layer ("L4 cache"). The advantage of this mode is that it transparently works for legacy software and thus offers a simple way to cheaply extend the main memory capacity. However, this does not utilize persistence and performance may degrade due to the lower bandwidth and higher latency of PMem. In fact, as we show later, there is a ≈10 % overhead for accessing data when DRAM acts as a L4 cache instead of normally. Because it is not possible to leverage the persistency of PMem in memory mode, we focus on app direct mode in the remainder of this paper. App direct mode, unlike memory mode, leaves the regular memory system untouched. It optionally allows programs to make use of PMem in the form of memory mapped files. We describe this process from a developer point of view in the following:
We are using a two socket system with 24 physical (48 virtual) cores on each node. The machine is running Fedora with a Linux kernel version 4.15.6. Each socket has 6 PMem DIMMs with 128 GB each and 6 DRAM DIMMs with 32 GB each. To access PMem, the physical PMem DIMMs first have to be grouped into so-called regions with ipmctl 1 :
To avoid complicating the following experiments with a discussion on NUMA effects (which are similar to the ones on DRAM) we run all our experiments on socket 0. Once a region is created, ndctl 2 is used to create a namespace on top of it:
ndctl create-namespace --mode fsdax --region 28 Next, we create a file system on top of this namespace (mkfs.ext4 3 ) and mount it (mount 4 ) using the dax flag, which enables direct cache-line-grained access to the device by the CPU:
Programs can now create files on the newly mounted device and map them into their address space using mmap 5 : fd = open(("/mnt/pmem28/file", O_RDWR, 0); res = ftruncate(fd, SIZE); ptr = mmap(nullptr, SIZE, PROT_WRITE, MAP_SHARED, fd, 0);
The pointer can be used to directly access the PMem just like regular memory. How to ensure that a value written to PMem is actually persistent is discussed in Section 3. In the remainder of this section, we discuss the bandwidth and latency of PMem. 
Bandwidth
It is important to know that the PMem hardware internally works on 256 byte blocks. A small write-combining buffer is used, to avoid write amplification, because the transfer size between PMem and CPU is, as for DRAM, 64 byte (cache lines).
The block-based (4 cache lines) design of PMem leads to some interesting performance characteristics that we show in Figure 1 . The experiment measures the bandwidth for loading/storing from/to independent random locations on PMem and DRAM. We use all 24 physical cores of one socket to maximize the number of parallel accesses. The figure shows store (PMem: (a), DRAM: (b)) and load (PMem: (c), DRAM: (d)) benchmarks. The performance heavily depends on the number of consecutively accessed cache lines on PMem while there is no significant difference on DRAM. Peak throughput can only be reached when a multiple of the block size (4 cache lines = 256 byte) is used.
As on DRAM, streaming (non-temporal) stores are more efficient on PMem because the modified cache lines do not have to be loaded first-thereby saving memory bandwidth. However, on PMem the performance of regular stores can be increased to that of streaming stores by issuing a clwb (cache line write back) instruction after each store. The clwb forces a dirty cache line in the data cache to be written to the underlying memory system (without evicting the cache line). While this is beneficial on PMem (a), it does not change the throughput on DRAM (b).
This effect is further studied in Figure 2 , which shows the same experiment, but instead of varying the number of cache lines loaded/stored we vary the number of threads. It shows that the clwb instruction only becomes necessary once several threads are writing to PMem: With more threads, cache lines are evicted more randomly from the last level CPU cache and thus arrive increasingly our of order at the PMem write combining buffer. It seems that at a certain point (≈ 4 threads), the buffer is no longer able to combine the cache lines into a single PMem block write. Using the clwb instruction, we can force the order in which the cache lines arrive at the PMem write buffer and thus enable it to combine neighboring cache lines into a single block write.
Another effect we observe is that the throughput peaks at around 3 threads for streaming (and 12 for stores with clwb). Using additional threads decreases the throughput slightly.
In summary, judging from our experimental results, we recommend the following guidelines for bandwidth-critical applications:
• Algorithms should no longer be designed to fit data on single cache lines (64 byte) but on PMem blocks (256 byte).
• Streaming operations should be utilized when possible, otherwise stores should be followed by clwb.
• Over-saturating PMem can lead to reduced performance.
• The experiments showed that the PMem read bandwidth is 2.6× lower and the write bandwidth 7.5× lower compared to DRAM. Therefore, performance-critical code should prefer DRAM over PMem (e.g., by buffering writes in a DRAM cache).
Latency
While bandwidth is critical for OLAP-style applications, latency is much more important for OLTP workloads because the access pattern shifts from large scan operations to (sequential I/O) to point lookups, which are essentially random accesses into memory. The performance of these random accesses is dominated by the latency of the underlying device.
To measure the latency for load operations on PMem, we use a single thread and perform loads from random locations. To study this effect, we prevented out-of-order execution by chaining the loads such that the address for the load in step i depends on the value read in step i − 1. The results are shown in Figure 3 .
We can observe that DRAM read latency is lower by a factor of 3 in comparison with PMem. Note that this does not mean that each access to PMem is that much slower, because many applications can still benefit from the on-CPU L3 cache. When using PMem in memory mode, it replaces DRAM as main memory and DRAM acts as an L4 cache. In that case, most accesses are captured by the DRAM cache because our workload fits into this huge DRAM cache (196 GB) and therefore only has a small slow down of around 10 %.
To persistently store data on PMem, the data has to be written, the cache line evicted, and then an sfence has to be used to wait for the data to reach PMem. This process is described in more detail in Section 3.1. To measure the latency for persistent store operations on PMem, we use a single thread that persistently stores data to an array of size 10 GB. Each store is aligned to a cache line (64 byte) boundary. The results are shown in Figure 4 . The four bars on the left show the results for continuously writing to the same cache line, in the middle we write cache lines sequentially, and on the right randomly. In each scenario, we use four different methods for flushing cache lines (from left to write: flush, flushopt, clwb, and streaming stores).
As the results show, when storing data to the same cache line over and over again, streaming stores should be preferred. This pattern appears in many data structures (e.g., array-like structures with a size field) or algorithms (e.g., a global counter for time-stamping) that have some kind of global variable that is often modified. Therefore, for efficient usage of PMem, techniques similar to the ones developed to avoid congestion in multi-threaded programming have to be applied to PMem as well. Our experiments suggest to use clwb when available.
STORAGE PRIMITIVES FOR PMEM
The low write latency of PMem (compared to other storage devices) makes it an ideal candidate for the use in database systems, file systems, and other systems software. However, due to the CPU cache, writes to PMem are only persistent once the corresponding cache line is flushed. Algorithms have to explicitly order stores and cache line flushes, to ensure that a persistent data structure is always in a consistent state (in case of a crash). We call this property failure atomicity and discuss it in Section 3.1. Intel's Persistent Memory Development Kit (PMDK) [1] , an open-source library for Pmem, abstracts from this complexity by providing two failure atomic I/O primitives: log writing (libpmemlog) and block/page flushing (libpmemblk). In Section 3.3 and Section 3.2, we apply the guidelines developed earlier (Section 2), apply them to these two problems, and analyze their performance.
Failure Atomicity
As mentioned earlier, when writing to PMem, stores are not immediately propagated to the PMem device, but instead buffered in the regular on-CPU cache. While programs cannot prevent the eviction, they can force it using explicit write-back or flush instructions. This implies that any persistent data structure on PMem always needs to be in a consistent state, otherwise, a system crash-interrupting an update operation-could lead to an inconsistent state after a restart. The following code snippet shows how an element is appended to a pre-allocated buffer: The new element is first copied into the next free slot and the corresponding cache line is forced to be written back to PMem. Instead of using a regular flush operation, clwb (cache line write back) is used, which is an efficient flush operation designed for PMem that flushes the cache line without invalidating it. Before the buffer's size indicator (next) can be changed, a sfence (store fence) must be issued to prevent re-ordering by the compiler or hardware. Once next has been written, it is persisted to memory in the same fashion. Note that persisting the next field is not necessary for the failure atomicity of a single append operation. However, it is convenient and often required for subsequent code (e.g., another append). In the following, we will use the term persistency barrier and persist for a combination of a clwb and a subsequent sfence:
void persist(void* ptr) { clwb(ptr); sfence(); } Generally speaking, a persistency barrier is an expensive operation, as it forces a synchronous write to PMem (or, more precisely, to its internal battery-backed buffers). Therefore, in addition to the guidelines laid out in Section 2, it is also important to minimize the number of persistency barriers while still maintaining failure atomicity. In the following two sections, we show a manually-tuned implementation for logging and page flushing.
Page Propagation
Besides logging, the other essential storage engine component that requires I/O is the buffer manager. It is responsible for loading (swapping in) pages from SSD/HDD into DRAM whenever a page is accessed by the query engine. When the buffer pool is full, the buffer manager needs to evict pages in order to serve new requests. When a dirty page is evicted and has been modified, it needs to be flushed to storage, before it can be dropped from the buffer pool in order to ensure durability. This process has to be carefully coordinated with the transaction and logging controller, i.e., a page can only be flushed when the undo information of all non-committed modifications are persisted in the log file (otherwise a crash would lead to corrupt data). In addition, flushing a page needs to be failure atomic: After a crash, the recovery component needs a consistent snapshot of the page.
Flushing pages to persistent storage is an inherently I/O-bound task. To reduce the latency for pages requests, the buffer manager constantly flushes dirty pages to persistent storage in the background. This way, it can always serve requests without needing to flush a page first. In addition, this makes flushing pages (on a background thread) a mostly bandwidth critical problem (compared to log writing, where latency most important).
For SSDs/HDDs this architecture is strictly necessary as pages have to be copied to DRAM before they can be read or written by the CPU. When using PMem instead, the buffer pool becomes optional. However, as recent work [6, 32] has shown, it is still beneficial to use a buffer pool, due to the lower latencies and reduced complexity when working on DRAM compared to PMem. In addition, this architecture is used in most existing disk-based database systems. In order to integrate PMem into existing systems, the page flushing algorithm needs to be correct (failure atomicity) and efficient (high bandwidth). In the following, we describe two algorithms for failure atomic page flushing and then evaluate them.
Copy on Write.
CoW does not overwrite the original PMem page, but instead writes the DRAM page to an unused PMem page (left-hand side of Listing 1, line 1-3) . Once the new PMem page is persisted, it is marked as valid (line 5-10) and the old PMem page can be reused. During recovery, the headers of all PMem pages are inspected to determine the physical location of each logical page. By adding a page version number (pvn) that is increased after each flush, we can identify the latest version of a page. Using the pvn, it becomes unnecessary to invalidate the old PMem page before writing the new one. This lowers the number of required persistency barriers from three to two and thus yields ≈ 10% increased throughput. We illustrate the pvn in the following example: The green page slot (3) contains the latest persistent copy of page B. The red one (2) contains the original version of page A. The different versions of the blue page slot ( (1)) show each step of flushing a new version of page A. The line numbers where the transition might occur are written over the arrow. In each step, the pvn can be used to figure out the most recent version of each page. In database systems, the log sequence number lsn could be used instead of the pvn, however if the system crashes in line 6, log entries might be reapplied to a page.
Micro Log.
The Micro-log technique uses a small log file to record changes that are going to be made to the page. During recovery, all valid micro logs are reapplied, independent of the page's state. This forces us to invalidate the log (right-hand side of Listing 1, line 1-3) before changing the content (line 5-7), otherwise the changes would be applied to the previous page in case of a crash. Only once the changes are written, we set them to valid (line 8-10) and then apply them to the actual page (line 13-15). Figure 5 details the page flush performance. All techniques are implemented using streaming/non-temporal writes, which have shown to provide the highest throughput in Section 2. When using copy on write we differentiate whether all cache lines are available in DRAM ( ) or only the dirty ones ( ). As a performance metric we chose the number of pages that can be flushed to PMem per second. We vary the amount of dirty cache lines in (a) for a single thread and in (c) for 7 threads. In (b), we vary the number of threads to show the scale out behavior.
Experiments.
The results show, that the micro log is efficient when the number of cache lines that have to be flushed is low. We can observe this effect for a single thread in (a). Using the micro log yields performance gains for up to 112 dirty cache lines. A multi-threaded experiment is shown in (c). Here the micro log only offers throughput gains, when less than 32 cache lines are dirty. Therefore, a hybrid technique based on a simple cost model should be uses to chose the better technique depending on the amount of dirty cache lines (and single/multi threading). The micro benchmarks in Section 2 suggested that streaming instructions should be preferred over regular stores. We were able to confirm this finding in the page flushing experiment (not shown in chart). In addition, as in the bandwidth experiments, we can see a performance degradation when too many threads are used: For optimal throughput it is important to tailor the number of writer threads to the system. As (b) show, the performance degrades after reaching a peak at around 7-11 threads.
Logging
In database systems, write ahead logging is used to ensure the atomicity and durability of transactions. This is achieved by recording (logging) the individual changes of a larger transaction in order to be able to undo them in the event of a rollback. If any of the changes to the data are persisted while the transaction is still active, the log has to be persisted as well. Before completing a transaction (and thereby guaranteeing to the user, that all changes of the transaction are durable), all log entries of the transaction are persistently written. Logging allows a database to only persist the delta of the modifications: For example, consider an insert operation into a table stored as a B-Tree: using logging, only the altered data needs to be persisted instead of all modified nodes (pages). During a restart, the recovery component reads the log file, determines the most recent fully persisted log entry and applies the log to the database.
Logging constitutes a major performance bottleneck in database systems using traditional storage devices (SSD/HDD) because each transaction has to wait until the log entry recording its changes is written. As a mitigation, reduced consistency guarantees are offered and complex group commit protocols are implemented. However, using PMem, a low-latency logging protocol can be implemented that largely eliminates this problem.
3.3.1 Algorithms. In the following, we first explain and then evaluate three logging techniques: Classic, Header, and Zero:
Classic represents a form of logging commonly used in database systems [31] . The following listing shows the algorithm in pseudo code (left) and the file layout grammar (right). For clarity, only information relevant to the protocol is depicted.
log << header << payload persist(log); log << footer persist(log);
LogFile -> Entry* Entry -> header ← payload footer
A log entry is flushed in two steps: first, the header and payload is appended to the log and persisted; second, the footer, which contains a copy of the log sequence number (lsn; an id given to each log entry). The lsn in the footer can be used during recovery to determine whether a log entry was completely written and therefore should be considered as valid and applied to the database. Note that it takes two persistency barriers. Without the first barrier parts of the payload could be missing because the header and footer have already become durable in PMem.
Header uses the same technique as libpmemlog in the PMDK [1] . It is similar to appending elements to an array: log << header << payload persist(log); log.size += entry_size persist(log.size);
LogFile -> size, Entry* Entry -> header payload
The log entry is also written in two steps: first, the header and payload are appended to the tail of the log and persisted. Next, the new size of the log is set in the header of the log file and persisted. This eliminates the need to scan the log file for the last valid entry during recovery because the valid size is directly stored in the header.
Zero is a novel technique we propose for PMem that requires only one persistency barrier: Before logging starts, each log file is initialized to zero. This is commonly done anyway by database systems (e.g., PostgreSQL) to enforce that the file system actually allocates pages to the file. When writing a log entry, the number of set bits are counted (using the popcnt instruction). Next the header, data, and bit count (cnt) is written to the log and persisted together. Using the bit count, it is always possible to determine the validity of a log entry: either the cache line containing the bit count was not flushed or it was. In the former case, the field contains the number zero (because the file was zeroed) and the entry is invalid. In the latter case, the bit count field can be used to determine if all other cache lines belonging to the log have been flushed as well.
Experiments.
In Section 2.3, we showed that there is a large performance penalty when persisting the same cache line twice in a row. This effect is very relevant for latency critical systems as shown in Figure 6 . We use a micro-benchmark that measures the throughput of flushing log entries of varying sizes. The left chart shows a naive implementation, while the right one uses padding on each log entry to align entries to cache line boundaries and thus avoid subsequent writes to the same cache line. While padding wastes some memory 6 , the throughput greatly increases (≈ 8×). However, even with padding, the Classic approach still outperforms the Header one, because of the slowdown due to the writes to the same cache line in the header when updating the size. This problem can be solved by using a dancing size field: We use several size fields on different cache lines in the header and write only one (round-robin) for each log entry. By using 64 of these dancing size fields, the throughput of Header can be increased to that of Classic. However, both of these technique still require persistency barriers and can therefore not compete with Zero logging (≈ 2× faster).
The log implementation (libpmemlog) of the PMDK [1] uses the same approach (and therefore yields the same throughput) as our naive Header implementation without alignment and dancing. It also yields the same throughput, when disabling the locks that it uses for thread-safety. It has the advantage that the log file is dense and can be presented as one continuous memory segment to the user. However, this leaves the user with the task of reconstruct log entry boundaries manually. By moving this functionality into the library, a better logging strategy could be implemented and the usability would be increased. 6 Up to 1 cache line for Zero and Header; up to 2 cache line for Classic To validate the our results, we have integrated all techniques into our PMem-based storage engine prototype HyMem [32] . Running a write heavy (100%) YCSB benchmark [10] on a single thread with a small table that fits into DRAM, Zero logging achieves a throughput of 2 M transactions per second. The Header and Classic technique achieve 1.7 M and 1.5 M transactions per second, respectively.
RELATED WORK
With PMem only being released recently, there have not been any studies on the actual hardware yet. So far, software-or hardwarebased simulations or emulations based on speculative performance characteristics have been used to evaluate possible system architectures [4, 24, 26, 28] . There is a large number of persistent index structure [3, 9, 14, 20, 33, 36] , which have been summarized by Götze et. al [15] . Similar techniques have been used to build storage engines directly on PMem [5, 23] . These approaches are using inplace updates on PMem, which suffer from the lower-than-DRAM performance. Therefore, a number of indexes [25, 35] as well as storage engines [2, 8, 11, 17, 18, 21, 22] integrate PMem as a separate storage layer or an extensions to the recovery component [27, 29] . Furthermore, buffer-managed architectures [6, 19, 32] have been proposed to use PMem more adaptively. Recovery has always been an essential (and performance critical) component of database systems [31] . Several designs have been proposed for database specific logging [7, 13, 16, 30, 34] and file systems [12] .
CONCLUSION
In our evaluation, we found several guidelines for using PMem efficiently (cf. Section 2.2 and 2.3): (1) Instead of optimizing for cache lines (64 byte) as on DRAM, we have to optimize for PMem blocks (256 byte). (2) As in multi-threaded programming, writes to the same cache line in close temporal proximity should be avoided. (3) Forcing the data out of the on-CPU cache (clwb or streaming), is essential for a high write bandwidth. Further, we evaluated algorithms for logging and page propagation: (1) Our logging experiments have shown that latency-critical code should minimize the number of persistency barriers and avoid subsequent writes to the same cache line. (2) Our zero logging algorithm reduces the required persistency barriers from two to one, thus doubling the throughput. (3) For flushing database pages, a small log (µLog) can be used to flush only dirty cache lines. The introduced I/O primitives use an interface similar to the one in PMDK [1] , making them widely applicable.
