# NVSWAP: LATENCY-AWARE PAGING USING NON-VOLATILE MAIN MEMORY

By

YEKANG WU

A thesis submitted in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE IN COMPUTER SCIENCE

WASHINGTON STATE UNIVERSITY School of Engineering and Computer Science, Vancouver

MAY 2021

© Copyright by YEKANG WU, 2021 All Rights Reserved

© Copyright by YEKANG WU, 2021 All Rights Reserved To the Faculty of Washington State University:

The members of the Committee appointed to examine the thesis of YEKANG WU find it satisfactory and recommend that it be accepted.

Xuechen Zhang, Ph.D., Chair

Xinghui Zhao, Ph.D.

Scott Wallace, Ph.D.

# ACKNOWLEDGMENT

I would like to express my sincere gratitude to my research adviser, Dr. Xuechen Zhang for navigating me in the world of graduate studies. He is very patient, knowledgeable, and his professional suggestions always help me solve hard problems.

I would like to express my sincere gratitude to my parents. Without their support, I am not able to finish my studies. Their words always motivate and inspire me in difficult times.

### NVSWAP: LATENCY-AWARE PAGING USING NON-VOLATILE MAIN MEMORY

Abstract

by Yekang Wu, M.S. Washington State University May 2021

### Chair: Xuechen Zhang

Page relocation (paging) from DRAM to swap devices is an important task of a virtual memory system in operating systems. Existing Linux paging mechanisms have two main deficiencies: (1) they may incur a high I/O latency due to write interference on solid-state disks and aggressive memory page reclaiming rate under high memory pressure and (2) they do not provide predictable latency bound for latency-sensitive applications because they cannot control the allocation of system resources among concurrent processes sharing swap devices.

In this thesis, we present the design and implementation of a latency-aware paging mechanism called NVSwap. It supports a hybrid swap space using both regular secondary storage devices (e.g., solid-state disks) and non-volatile main memory (NVMM). The design is more cost-effective than using only NVMM as swap spaces. Furthermore, NVSwap uses NVMM as a persistent paging buffer to serve the page-out requests and hide the latency of paging between the regular swap device and DRAM. It supports in-situ paging for pages in the persistent paging buffer avoiding the slow I/O path. Finally, NVSwap allows users to specify latency bounds for individual processes or a group of related processes and enforces the bounds by dynamically controlling the resource allocation of NVMM and page reclaiming rate in memory among scheduling units. We have implemented a prototype of NVSwap in the Linux

iv

kernel-3.16.74. Our results demonstrate that NVSwap reduces paging latency by up to 99% and provides performance guarantee and isolation among concurrent applications sharing swap devices.

Keywords Paging, Virtual Memory, Storage QoS, Non-Volatile Main Memory

# TABLE OF CONTENTS

| ACKNOWLEDGMENTiii                        |
|------------------------------------------|
| ABSTRACTiv                               |
| LIST OF TABLES                           |
| LIST OF FIGURES ix                       |
| CHAPTER 1: INTRODUCTION                  |
| CHAPTER 2: MOTIVATION AND RELATED WORK   |
| 2.1 Needs for Latency-Aware Paging       |
| 2.2 Previous Work                        |
| CHAPTER 3: NVSwap Design                 |
| CHAPTER 4: Latency Control Module17      |
| CHAPTER 5: Implementation Issues         |
| CHAPTER 6: Evaluation                    |
| 6.1 System Setup                         |
| 6.2 Latency Enforcement                  |
| 6.2.1 Single Workloads                   |
| 6.2.2 Concurrent Homogeneous Workloads   |
| 6.2.3 Concurrent Heterogeneous Workloads |

| 6.3  | Changing Latency Bounds Dynamically | 30 |
|------|-------------------------------------|----|
| 6.4  | Impact of DB Record Size            | 32 |
| 6.5  | Impact of NVMM Size                 | 33 |
| 6. 6 | Comparison with Other Systems       | 34 |
| 6.7  | Experiments with Real NVMM          | 35 |
| CHAP | TER 7: Conclusion                   | 38 |
| REFE | RENCES                              | 39 |

# LIST OF TABLES

| Table 1: | Comparison of NVSwap with existing major paging schemes. | 9 |
|----------|----------------------------------------------------------|---|
| Table 2: | Configurations of heterogeneous Workload A, B, and C 2   | 8 |
| Table 3: | Configurations of heterogeneous Workload A, C, and F     | 0 |

# LIST OF FIGURES

| Figure 1: Motivation                                                    |
|-------------------------------------------------------------------------|
| Figure 2: Illustration of the Linux paging mechanism                    |
| Figure 3: NVSwap system architecture 10                                 |
| Figure 4: Illustration of NVSwap paging scheme16                        |
| Figure 5: Single Workloads experiment                                   |
| Figure 6: Concurrent Homogeneous Workloads experiment                   |
| Figure 7: Concurrent Heterogeneous Workloads experiment 1               |
| Figure 8: Concurrent Heterogeneous Workloads experiment 2               |
| Figure 9: Dynamic latency bounds experiment                             |
| Figure 10: DB Record Size experiment                                    |
| Figure 11: NVMM Size experiment                                         |
| Figure 12: Comparison of latency by NVSwap versus Dr. swap              |
| Figure 13: Single Workload experiment with Real NVMM                    |
| Figure 14: Concurrent Homogeneous Workloads experiment with real NVMM   |
| Figure 15: Concurrent Heterogeneous Workloads experiment with real NVMM |

#### **CHAPTER 1: INTRODUCTION**

In the Linux operating system, paging is designed to extend the main memory capacity using the space of secondary storage devices [31]. The existing paging policy in Linux is designed to improve the overall I/O throughput of concurrent paging workloads. For example, upon paging out, memory pages are written out to swap space in the unit of page clusters [29] to exploit spatial locality. However, the tail latency (i.e., *X<sup>th</sup> percentile latency*) of paging for a particular application can be unprohibitedly high because it is affected by many factors such as queuing time in kernels and I/O interference of applications concurrently accessing the swap devices. In addition, the latency is unpredictable because the existing paging systems cannot enforce the latency bound of paging for an application, which may result in poor swap experience of users of latency-sensitive applications, e.g., in-memory databases and mobile applications.

In this thesis, we present a new paging system called NVSwap using *Non-Volatile Main Memory* (NVMM) (e.g., Intel Optane DIMMs [2, 24]) to extend memory capacity for serving latency-sensitive memory-demanding applications. Supporting paging to NVMM in operating systems hides the complexity of programming and enables programmers to directly run memorydemanding legacy code without the need of understanding NVMM memory models and its programming interface. NVSwap has the following desirable properties to hide paging latency and improve users' experience.

A cost-effective hybrid swap space: For paging, NVSwap employs two swap zones including a swap zone in NVMM on the memory bus and a regular swap zone in solid-state disks (SSDs) on the I/O bus. Latency-sensitive applications can page to both of the zones according to their latency requirements. Other applications can page to the regular zone for a low cost.

NVSwap may swap in pages in NVMM in an in-situ manner (in-situ paging) without extra memory copy, which will further reduce the latency of paging.

Latency-bound enforcement: NVSwap allows users to specify QoS requirements in the form of tail latency bounds of page-in requests for any latency-sensitive processes. These may be set for individual processes or collectively for a group of related processes. NVSwap enforces the latency bound for page-in requests by controlling the space allocation of NVMM among scheduling units (e.g., processes). According to the latency requirements, it also dynamically selects the host swap device from the regular swap zone and the NVMM zone and adjusts the rate of memory page reclaiming to reduce the queuing time of paging requests in the disk scheduling queue.

**Persistent paging buffer:** We implemented a persistent paging buffer in NVMM for latency enforcement and reduce I/O interference in the regular swap zone. It organizes its space into multiple latency groups, each of which consists of pages from processes having the same latency bound. The pages in the buffer are destaged in the unit of latency groups and in descending order of their respective latency bounds. In this way, pages from latency-sensitive processes will have a higher chance of staying in NVMM. It exploits the temporal locality by storing the pages in the same group in the order of eviction from DRAM. By serving the page-out requests using the persistent paging buffers and providing higher priority to read requests in the disk scheduling queue, NVSwap significantly reduces write interference on SSDs. Finally, subsequent page-in requests that hit any pages in the buffer will be directly copied to the swap zone in NVMM and then mapped to process address spaces. With the help of in-situ paging, the page can be immediately used by the process without triggering the overhead of block-level I/O processing.

We have implemented a prototype of NVSwap in the Linux kernel-3.16.74. Our extensive evaluation with the in-memory database and YCSB benchmark show that NVSwap can reduce the paging latency by up to 99% compared to those using only SSDs for swapping. Furthermore, it dynamically adapts the allocation of system resources for enforcing the X<sup>th</sup> percentile latency for concurrent paging workloads and provides the desired performance isolation among them.

# CHAPTER 2: MOTIVATION AND RELATED WORK

In this section, we first motivate the need for latency-aware paging using motivation examples. We then review the literature on paging mechanisms and the limitations of existing paging techniques.



Figure 1: Motivation. (a) and (b) show the minimum, 95<sup>th</sup> percentile, and 99<sup>th</sup> percentile latency of DB update and read operations of with one instance of memcached (Dedicated) and two concurrent instances (Concurrent). (c) and (d) show the minimum, 95<sup>th</sup> percentile, and 99<sup>th</sup> percentile latency of page-out and page-in operations during the execution of memcached.

# 2.1 Needs for Latency-Aware Paging



Figure 2: Illustration of the Linux paging mechanism. Page(1) is paged out from DRAM to the swap space and Page(2) is paged in from the swap space to DRAM. Both of the page-out and page-in requests are issued from the disk scheduling queue.

Paging in virtual memory is a core component of the Linux operating system. Figure 2 illustrates the existing Linux paging mechanism. Paging out happens when the swap out daemon *kswapd* in the kernel is wakened up under memory pressure or when direct page reclaiming is required under an even higher memory pressure [1, 11]. To swap out a page in DRAM, the kernel needs to generate a block-level I/O request and adds it to a disk scheduling queue associated with the device for dispatch. Then the page is written to the swap space hosted on block storage devices, e.g., SSDs. When a page fault is triggered by STORE or LOAD CPU instructions, the kernel needs to swap in the page to be accessed back to DRAM. Paging to NVMM hides the complexity of the new memory models and enables programmers to smoothly adapt their applications to the new hardware.

The current Linux paging mechanism is designed to improve the overall I/O bandwidth of slow swap devices [29] by exploiting spatial locality. For paging out, the kernel typically selects 32 pages from the list of inactive pages and sequentially writes them in a page cluster on swap devices. The size of a cluster ranges from 8 KB to 4096 KB. For paging in, the kernel prefetches multiple pages in a cluster benefiting from high sequential read bandwidth of storage devices.

However, these design options may cause a long paging latency at both kernel and application levels. To illustrate the impact of paging on the latency of major operations of applications. We run the *memcached* in-memory database server provided in the *YCSB* benchmark [3]. The server daemons access an SSD-based swap device. The size of the main memory and swap space are set to 5 GB and 10 GB respectively. We run Workload A with 50/50 read/update ratio,1KB record size, and zipfian [10] request distribution. The detailed hardware configuration can be found in Section 6. In Figure 1, we compare the latency of major database operations (e.g., update and read) with one *memcached* server using the swap space (*Dedicated*) to that with two *memcached* servers concurrently accessing the space (*Concurrent*). We have the following observations from the results.

**1. During paging, both of the DB read and update operations may have a long tail latency.** With dedicated accesses to the swap space, the 99<sup>th</sup> percentile latency of DB read operations (840 us as shown in Figure 1(b)) is 25X longer than its minimum latency (34 us). With concurrent accesses to the swap space, the 99<sup>th</sup> percentile latency of read operations is increased to 2143 us which is 61X higher than its minimum latency (35 us) while the minimum latency is increased by only 3%. A similar pattern is observed for DB updates. This is because when the page to read or update is in the swap space the kernel needs to synchronously read the page from the SSD-based swap space. While reading the page, the SSD also needs to serve pageout requests due to memory reclaim under high memory pressure. When page-in requests and

page-out requests are mixed and concurrently access the SSD, the read (page-in) requests might be blocked by write (page-out) requests, causing a long tail latency [12, 27] at the application level.

**2. During paging, page-out requests have an extremely long tail latency.** We observe that the 99<sup>th</sup> percentile latency of page-out requests is 260495 us, which is 2503X higher than its corresponding minimum latency (Figure 1(c)) with concurrent access to the swap space. This is because under a high memory pressure the kernel executed memory pre-cleaning which swaps out pages from the inactive page list before new pages are requested [29]. The page reclaiming. rate becomes more aggressive as the ratio of free page frames is decreased. Even though paging out is asynchronously executed, we observed that the page-out I/O requests issued by page precleaning may saturate the disk scheduling queue, leading to a long queuing time and long tail latency.

**3. The Linux paging system is not able to enforce latency bounds.** We observed the huge variation between minimum latency and its corresponding 99<sup>th</sup> percentile latency for page-in and page-out requests in Figure 1.

**4.** The OS page-in latency has a critical impact on the latency of DB operations at the application level. Our experimental results show that the OS page-in latency is directly correlated to users' perceived latency because the latency of serving page-in requests is in the critical path of major page faults while page-out requests can be asynchronously served in the kernel. As a result, the 99<sup>th</sup> percentile latency of page-in requests is comparable to those of DB reads and updates. In contrast, the tail latency of page-out requests can be 122X higher than the latency of the DB operations.

In summary, while the paging systems have been well implemented to provide high I/O bandwidth during paging, they are not latency-sensitive. The current deficiencies in its design prevent users who are sensitive to latency from using the swap space and prevent paging being used with in-memory applications and mobile devices [37] requiring predictable and low latency bounds for improving users' experience or used with high-performance computing applications [6] whose performance is sensitive to OS noises.

#### 2.2 Previous Work

We classified existing work on paging in virtual memory management into three categories, as discussed below.

Paging Approaches for Flash: FlashVM was designed to improve paging performance with aggressive pre-cleaning, stride prefetching, and reduce the number of page writes to flashbased swap devices with page sampling and sharing [29]. In comparison, NVSwap is designed for enforcing paging latency bounds with the help of NVMM. Because of the good random access performance of NVMM, prefetching is no longer needed for the swap zone in NVMM. Instead, it uses in-situ paging and persistent write buffers to reduce paging latency. Hybrid Swap allocates SSD space according to users' specified bound on the program stall time due to page faults as a percentage of the program's total run time [23]. Most recently, MARS was designed to speed up the relaunching of mobile applications via flash-aware paging [17]. It reserves memory space and dynamically adjusts the value of memory watermarks to avoid a mix of page-out and page-in requests during the relaunching, thus alleviating the impact of write interference. In addition, it separates the swap space allocated to each application to improve the spatial locality of page-in workloads. Different from MARS, NVSwap uses a dedicated persistent paging buffer in NVMM to separate page-out from page-in requests to reduce write interference.

| Technique          | In-situ Paging-<br>in | Write<br>Awareness | Latency<br>Enforcement | Swap Device<br>Type | Swapping Unit |
|--------------------|-----------------------|--------------------|------------------------|---------------------|---------------|
| Linux [31]         | No                    | No                 | No                     | No                  | Disk Page     |
| FlashVM [29]       | No                    | Yes                | No                     | Flash               | Page          |
| SmartSwap<br>[41]  | No                    | No                 | Some                   | Flash               | Application   |
| Mars [17]          | No                    | Some               | No                     | Flash               | Application   |
| Memorage<br>[20]   | Yes                   | No                 | No                     | NVMM                | Page/block    |
| Dr. swap [37]      | Yes                   | No                 | No                     | NVMM                | Page          |
| Refinery swap [13] | Some                  | Yes                | No                     | NVMM                | Page          |
| NVSwap             | Yes                   | Supported          | Yes                    | NVMM+Flash          | Page          |

Newhall et al. designed a swap space using distributed DRAM and Flash [25].

Table 1: Comparison of NVSwap with existing major paging schemes.

Paging Approaches for NVMM: NVMM is used for paging because it has low latency and vendor-guaranteed lifespan [8, 24] of 5 years at a minimum. Table 1 provides a summary of existing paging approaches for NVMM and their comparison with NVSwap. Memorage manages NVMM as storage space when storage capacity is low and manages it as the main memory extension when the availability of memory pages is low [20]. By dynamically changing the allocation of NVMM to main memory, it uses existing virtual memory managers to improve the performance of in-memory applications. Dr. Swap uses a direct read to reduce the overhead of memory copy from NVMM to DRAM for paging in [37]. Refinery swap and nCode were designed to reduce the number of page-out and page-in requests by swapping out less-frequently accessed pages [13] or read-only code page to NVMM [38]. Both SmartSwap [41] and Mars [17] reduce the user-perceived latency of application relaunching in mobile devices. They typically swap in the whole process address space of the application for relaunching. Other paging approaches have been designed to reduce paging overhead in virtual machines using distributed NVMM [39]. Awad et al. comprehensively studied the impact of the existing techniques (e.g., page prefetching, and page replacement algorithms) on the performance of NVMM-based paging systems [9].



Figure 3: NVSwap system architecture.

Latency Enforcement in Storage Systems: For non-distributed storage systems,

SARC+Avatar is a two-level scheduler that uses the earliest deadline first scheduling policy to achieve latency control [36]. In the Xen hypervisor, PSLO was proposed to enforce tail latency for consolidated VM storage by controlling the level of I/O concurrency and arrival rate for each VM issue queue [22]. Tiny-tail flash was designed to eliminate the tail latency induced by garbage collection in solid-state disks. Because the Linux paging system does not maintain an individual issue queue for each process, NVSwap cannot directly use the approaches like AVATAR or PSLO. Instead, it uses user-defined latency bound to control the disk scheduling queue size and process paging location. It also needs to adjust page pre-cleaning rate and the arrival rate of page-out requests of processes competing for the slots of the disk scheduling queue and NVMM space.

Other studies focus on the latency enforcement in distributed storage systems in which congestion may happen at either the network or storage layer. For example, Cake [35] is a two-level scheduling framework designed for tail latency enforcement in HBase [5]. Its first level scheduler uses a single FIFO queue and needs to split large requests into small requests to reduce head-of-line blocking in the queue caused by serving large requests and determines the number of outstanding requests. Its second level scheduler issues I/O requests to storage and reduces resource underutilization by controlling the queue occupancy. Priority Meister further provides scheduling for multiple latency-sensitive workloads while Cake can only handle one such workload [40]. It limits the network rate to manage networking-induced tail latency. Most recently, Rein [28] was designed to reduce tail latency in distributed key-value stores, e.g., assandra [4]. It aims to control tail latency in a distributed storage system using the client-server model. We do not need to consider networking latency in NVSwap.

#### **CHAPTER 3: NVSWAP DESIGN**

The objective of NVSwap is to enforce latency bounds of paging for latency-sensitive processes using NVMM. We focus on page-in latency in the thesis because it has a direct impact on users' perceived latency as shown in Section 2. In this section, we discuss the key concepts and overall design of NVSwap. Figure 3 shows the system architecture with multiple processes accessing a shared swap space. From the perspective of software architecture, NVSwap has similar basic functionalities as Linux including paging to/from disks, page prefetching to hide disk access latency, page pre-cleaning to eagerly swapping out dirty pages before new pages are needed using kswapd , and other functionalities (e.g., swap space management). Besides these, NVSwap supports paging to/from NVMM. It has a new latency control module, which is responsible for determining memory page reclaim rate and the dynamic allocation of NVMM for each paging process according to its user-specified tail latency bound. Since page-out requests are served asynchronously, we only provide latency control for page-in requests. We describe the algorithm used for latency control and page reclaim in Section 4.

The swap space of NVSwap has four main components: a regular-zone, an NV-zone, a persistent paging buffer. and a shadow mapping table. The regular-zone is hosted on block storage devices, e.g., solid-state disks. It is used to serve paging requests dispatched from a disk scheduling queue as what the Linux paging scheme does. The NV-zone and persistent paging buffer are hosted in NVMM. They are used to serve paging requests to enforce the latency bounds as specified by users. The NV-zone consists of NVMM page frames that can be directly accessed in process address spaces. The persistent paging buffer stores swapped-out pages from latency-sensitive processes and prefetched pages from the regular-zone. When the buffer is full, NVSwap needs to asynchronously flush pages to the regular-zone in background. When page

flushing happens, NVSwap does not need to change the page table entry of its corresponding process. Instead, the new disk location of the flushed pages in the regular-zone is recorded in the shadow mapping table. Then the incoming page-in requests to access the flushed pages will be served using their new disk addresses looked up from the shadow mapping table.

**Paging-out:** According to the output of the latency control module, the page-out requests are directed to access either the persistent paging buffer or the regular-zone. In Figure 4(a), we illustrate three page-out paths of NVSwap. (1) For Page(1), it is paged out to the persistent paging buffer first and then asynchronously flushed to the regular-zone when the buffer is full. Because the persistent paging buffer is on the memory bus, NVSwap simply copies the page to be swapped out in DRAM to a new page frame in NVMM. Since the persistent paging buffer is non-volatile, a page-out request can be considered complete once it is sent to the main memory extension. We schedule writing pages from NVMM to the regular-zone when the scheduling queue is not saturated. (2) For Page(2), it is simply paged out to NVMM. After the termination of the process referencing Page(2), the page frame is freed for future usage by other processes. (3) Page(3) is paged out to the regular-zone. The page-out request should be dispatched by the scheduling queue in DRAM. The size of the scheduling queue is measured in queue slots. It has a huge impact on the tail latency of paging requests as shown in Figure 1. Therefore, the latency control module periodically adjusts the queue size based on the latency requirements of processes.

**Paging-in:** When a STORE/LOAD instruction triggers a page fault to access a page in the swap space, NVSwap has two paths for serving the page-in request. We illustrate them in Figure 4(b). (1) If the page (e.g., Page(3)) is stored in the regular-zone, NVSwap first issues a read request to read the page from the block device to a new DRAM page frame allocated for

serving the page fault. Then by updating the page table entry (PTE) of the process which references the page, it sets up the PTE mapping from the virtual address to the physical address in DRAM. Finally, the process can write/read the data to/from the page. This page-in path is the same as Linux. (2) If the page (e.g., Page(2) ) is stored in the persistent paging buffer, NVSwap allocates a page frame in the NV-zone. Then it sets up the PTE mapping from the virtual address to the physical address in NVMM. Finally, it copies the data from the persistent paging buffer to the NVMM page frame in the NV-zone. The existing buffer slot hosting the page is freed. This operation is called in-situ paging in NVSwap. In-situ paging replaces the operation of reading the regular-zone with the memory copy from the persistent paging buffer to the NV-zone. Consequently, it reduces the page-in latency of serving the page fault.

**Resizing the persistent paging buffer:** The size of the persistent paging buffer is periodically adjusted according to the ratio of the number of page-in requests and page-out requests. Specifically, let's assume the size of NVMM is  $C_{nvmm}$ , the size of the persistent paging buffer is C buffer, and the size of the NV-zone is  $C_{nvzone}$ . We further assume the rate of page-in and page-out is Rate in and Rate out respectively. Then  $C_{nvzone} = \text{Rate}_{in} * C_{nvmm} / (\text{Rate}_{out} + \text{Rate}_{in})$  and  $C_{buffer} = C_{nvmm} - C_{nvzone}$ . To calculate Rate<sub>in</sub> and Rate<sub>out</sub>, NVSwap maintains a moving average of the total number of page-in and page-out requests being served in a 1-second time window. It does not induce additional overhead as Linux already tracks these metrics (e.g., the number of page faults). When the NV-zone is full, in-situ paging is disabled until the existing page frames are freed or more NVMM page frames are allocated for the zone.

The page layout of swap space and prefetching: When flushing occurs in the persistent paging buffer, NVSwap evicts pages from processes that have the highest latency bounds specified by users. For this purpose, it organizes the space of the persistent paging buffer into

multiple latency groups, each of which consists of 64 pages from processes having the same latency bound. When a latency group is full, it is split into two latency groups of the same latency requirement. Furthermore, it exploits the temporal locality by storing the pages in the same group in the order of eviction from DRAM. We schedule writing a group of pages from the persistent paging buffer to the regular-zone when the scheduling queue is not full. NVSwap manages page slots using a cluster-based approach like Linux for the regular-zone.

Serving the read requests from the regular-zone is in the critical path and may significantly increase the latency of page fault handling. Linux prefetches pages after a page fault to hide the latency [29]. However, the existing prefetching mechanism reads pages from the regular-zone into DRAM, which may cause memory thrashing under high memory pressure. In contrast, upon page fault to the regular-zone, NVSwap prefetches the pages from the regular-zone to the persistent paging buffer in the unit of a latency group. Furthermore, it only prefetches the pages of latency-sensitive processes that are set to access NVMM according to the output of the latency control module. When page prefetching happens, NVSwap does not need to change the page table entry of its corresponding process. Instead, the new page frame ID of the prefetched pages in NVMM is recorded in the shadow mapping table. The incoming page-in requests to access the prefetched page will be served using the new NVMM page frame. Because our flushing and prefetching algorithm exploits applications' semantics (e.g., latency bound) and temporal locality, the pages in the same cluster in the regular-zone will likely be accessed together.



Figure 4: Illustration of NVSwap paging scheme. (a) Paging out; (b) Paging in.

#### CHAPTER 4: LATENCY CONTROL MODULE

NVSwap supports storage QoS specified using X<sup>th</sup> percentile page-in latency. These may be set for individual processes or collectively for a group of related processes. According to our observations, the paging latency is affected by the characteristics of both swap devices and workloads, e.g., read/write latency, disk scheduling queue size, and I/O arrival rate. NVSwap selects a host swap device for each latency-sensitive process according to its tail latency bound. Then according to the latency requirements of the processes accessing the regular-zone, it determines the disk scheduling queue size to control the queuing time. Finally, according to the size of the queue, it adjusts the rate of memory page reclaiming to control the I/O arrival rate. In this section, we describe the algorithm used in NVSwap to enforce the latency bounds.

Selecting the host swap device and the queue size: The default host swap device is the regular-zone for processes. Then given the capacity of the regular-zone, NVSwap may select NVMM as the host swap device for latency-sensitive processes. We adopt a control strategy to estimate the capacity in terms of the scheduling queue size. The strategy is inspired by those used in Storage Resource Pools [16] and PARDA [15].

Let's assume that the paging latency is Lat<sub>i</sub> for process  $P_i(1 \le i \le n)$ . Then the latency goal to achieve using the scheduling queue Lat goal is min(Lat<sub>1</sub>,..., Lat<sub>n</sub>). The queue size is adjusted based on Lat goal and observed latency Lat observed using Equation 1, where S(t) denotes the size of the scheduling queue at time t and  $\gamma$  is a smoothing parameter between 0 and 1. For measuring Lat observed, we instrumented the Linux kernel to collect the latency of paging requests.

$$S(t + 1) = (1 - \gamma) * S(t) + \gamma * (S(t) * \frac{Lat_{goal}}{Lat_{observed}}) (1)$$

Using the control strategy, if the observed latency is higher than  $\text{Lat}_{\text{goal}}$ , NVSwap will reduce the queue size. Otherwise, it will increase the queue size. If the queue size is too large, we are at risk of losing data in the queue upon system failures. Consequently, we set the maximum queue size to be no larger than Smax. If  $S(t + 1) > S_{\text{max}}$ ,  $S(t + 1) = S_{\text{max}}$ . We set  $S_{max}$  to be 1024 in the thesis. Furthermore, we also set the minimum queue size to be no smaller than  $S_{min}$ . For example,  $S_{min}$  can be set as the number of channels of SSDs to explore its I/O parallelism. If the queue size S(t + 1) is smaller than S min using Equation 1, NVSwap considers that the regular-zone is under-provisioned. It will serve the requests from the most latency-sensitive processes using the persistent write buffer to reduce the load on the regular-zone until S(t + 1)becomes no smaller than  $S_{min}$ . Algorithm 1 describes the algorithm for the assignment of host swap devices and the determination of queue size.

NVSwap reserves a fixed number of slots in the queue to serve other processes that are not latency-sensitive for solving the starvation issue in request scheduling (#L12). Finally, it is designed to reduce the write interference in the regular-zone. For this purpose, in the scheduling queue, we set read requests to have higher priority than write requests to avoid write interference.

Latency-sensitive page reclaiming: Page replacement algorithm determines which pages should be swapped out. And *SWAP\_CLUSTER\_MAX* determines how many pages should be swapped out [14]. It is set to 32 in Linux [29], indicating that kswapd will swap out 32 pages from the list of inactive pages. For latency enforcement, instead of using *SWAP\_CLUSTER\_MAX* with a fixed value, NVSwap sets the maximum number of pages to swap out according to the size of the scheduling queue S(t). As a result, the rate of page scanning matches the capacity of the regular-zone given the latency bounds of processes.

#### Input:

*Lat<sub>i</sub>*: User-specified paging latency of process i,  $1 \le i \le n$ ;

Set *LS*: Ordered set  $\{ls_1, ls_2, ..., ls_n\}$  of elements from set  $\{Lat_1, Lat_2, ..., Lat_n\}$ ;

index[i]: equals k if  $ls_i$  is  $Lat_k$ ;

 $S_{min}$  and  $S_{max}$ : minimum and maximum size of the scheduling queue respectively;

 $S_{\text{reserve}}$ : the reserved slots in the scheduling queue;

Lat<sub>observed</sub>: observed latency of accessing the queue.

#### **Output:**

Set NS: the set of processes using the persistent paging

#### buffer;

Set *RS*: the set of processes using the regular-zone;

S(t + 1): the size of the scheduling queue.

 $1 \text{ NS} = \{\}, \text{RS} = \{1, \dots, n\}.$ 

2 for k in 1, ..., n do

// Set the latency goal using the minimum latency

3 
$$Lat_{goal} = ls_k;$$

// Update the scheduling queue size using user-specified latency

4 
$$S(t + 1) = (1 - \gamma) * S(t) + \gamma * (S(t) * \frac{Lat_{goal}}{Lat_{observed}});$$

// Handle the case of under-provisioned regular-zone by serving latency-sensitive processes using persistent paging buffer

- 5 **if**  $S(t + 1) < S_{min}$  **then**
- 6  $NS = NS \cup index[k];$
- 7 RS = RS index[k];
- 8 else
- 9 break;

// Handle the case of extremely large queue size

10 if S (t + 1) >  $S_{max}$  then

11  $S(t + 1) = S_{max};$ 

// Add reserved slots for processes that are not latency-sensitive

 $12 S (t + 1) += S_{reserve};$ 

The page replacement algorithm is modified to evict pages from latency-sensitive processes in NS to the persistent paging buffer and evict pages from not in NS to the regularzone. Specifically, for selecting a page to reclaim, the algorithm scans pages from the end of the inactive\_list or until the list is empty. We use reverse mapping to map the page frame to its associated process indexed by the process ID. If the process is in NS, NVSwap directs the paging request to access the persistent write buffer. Otherwise, it directs the request to access the regular-zone. The scanning process in a loop is completed until the number of reclaimed pages from processes not in NS reaches S(t) or until the list is empty.

**Tail latency monitoring and enforcement:** In the Linux kernel, we implemented a monitor, which collects the rate of paging and the X<sup>th</sup> percentile latency of paging processes for any time window k (k>0). Let's assume that the X<sup>th</sup> latency specified by users is Tail  $_{i}^{k}$  for process  $P_{i}$  ( $1 \le i \le n$ ) at the time window k. If the observed tail latency is higher than Tail  $_{i}^{k}$ , all the page-out requests from  $P_{i}$  at the next time window k + 1 will be served using the persistent write buffer. In addition, it will trigger prefetching the pages from  $P_{i}$  so that the page-in requests issued at the time window k + 1 will be served using NVMM to reduce page-in latency Tail  $_{i}^{k+1}$ .

#### **CHAPTER 5: IMPLEMENTATION ISSUES**

We discuss some of the implementation issues that we handled while building our prototype of NVSwap in this section.

Admission control: A key question that arises in the implementation of NVSwap is how many latency-sensitive paging processes can we serve on a hybrid swap space using NVMM? We need to understand both the system capacity and the total I/O demand to answer the question. We use the following equation to provide a general understanding of I/O demand.

DemandIOPS = 
$$\sum_{i=0}^{n} \frac{1}{Lat_i}$$
 (2)

For the capacity of the regular-zone, we suggest to compute its throughput (IOPS) using random I/O workloads. The request size should be equal to the page size in Linux, e.g., 4KB.The measurement should be conducted with an increased number of I/O concurrency. This can be done either during system installation or later by running micro-benchmarks, e.g., IOMETER [19]. NVSwap only copies the pages to the persistent write buffer. For measuring the capacity of the persistent write buffer, we develop a simple tool to measure the latency of copying pages from DRAM to the buffer. Then we convert it to throughput. Using this approach, we obtained the capacity of the regular-zone and persistent write buffer is 7,900 and 215,000 paging I/O operations per second respectively in our experiments. With the capacity being set, NVSwap can automatically determine whether to admit a process given the existing total I/O demand *Demand*IOPS and the latency bound of the incoming process.

**Page reclaiming in NV-zone:** Once a page frame of NVMM in the NV-zone is mapped to process address space, it is possible to reclaim the page under high memory pressure in NVMM. However, since we assume the capacity of NVMM is much larger than DRAM, the

page frames in the NV-zone is not subject to page reclaiming in the prototype of NVSwap. In its implementation, we set kswapd to simply skip the pages in the NVMM zones during page scanning for replacement. We wish to add the page reclaiming support of the NV-zone as the future work. Currently, all pages in the NV-zone are freed only after the exit of the processes which reference them. The freed pages are added back to the persistent write buffer for serving future paging requests.

**Reducing writes to NVMM:** Many existing approaches have been proposed to reduce the number of writes to NVMM during paging, thus increasing its lifetime [13, 38]. In NVSwap, we focus on the software design related to the enforcement of paging latency. However, we believe our idea can also benefit from those schemes. For example, without violating the latency requirement, NVSwap may swap out less-frequently accessed pages or read-only pages (e.g., those store program codes) to persistent write buffers.

**Setting the user-perceived latency bounds:** In this thesis, we focus on enforcing the page-in latency, which is directly correlated to user-perceived latency as shown in Figure 1. The relationship between the paging latency and user-perceived application-level latency can be captured either using classical mathematical models (e.g., linear regression modeling) or using machine-learning approaches (e.g., supervised learning). We experimentally demonstrate the relationship in Section 6.2.1. In a separate thesis, we will discuss our findings on these in detail.

#### **CHAPTER 6: EVALUATION**

In this section, we present results from a comprehensive evaluation of our prototype implementation of NVSwap in the Linux kernel. Our experiments examine the following three questions: (1) How effective is the latency control module for latency enforcement? (2) Does the module provide performance isolation between paging processes? (3) How effective is the approach of in-situ paging compared to others?

### 6.1 System Setup

We implemented NVSwap in the Linux kernel-3.16.74, which is a longterm state version. We instrumented the Linux /proc file system to pass the value of X<sup>th</sup> percentile latency bound specified by users for the corresponding processes to the kernel. By default, processes are not latency-sensitive. Other code modifications are in the virtual memory management system, for example in the do\_swap\_page() function for in-situ paging, and in the shrink\_page\_list() function to select reclaimed pages using the latency control module.

For the experiments, we used a server that is configured with 6-core Intel Xeon CPU X5670 2.93 GHz CPU, 32 GB DRAM, one 1TB hard disk (Seagate Barracuda 7200.12), and one 128 GB SSD (OCZ-VERTEX 4). The hard disk is used to host the operating system. The SSD is used to host the regular-zone. In most of the experiments, we configured the computer so that the kernel can only address 5 GB DRAM as the main memory. A reserved DRAM space is used to emulate NVMM, which hosts the persistent write buffer and the NV-zone. The size of the emulated NVMM is 10 GB by default.

We model NVMM using DRAM on the server using an emulation-based approach. Our emulator is similar to those used in other projects [18, 26, 33, 34]. Specifically, our NVMM

emulator introduces extra latency for NVMM write and read in routines that write to or read from DRAM. The delay is determined using the worse-case read/write latency in published data in [2, 21, 30, 32]. We set the read and write latency of NVMM to be 100 ns and 150 ns respectively. We create delays using a software spin loop [26, 33] that uses the x86 RDTSP instruction to read the processor timestamp counter and spins until the counter reaches the intended delay. For sequential access, we also model NVMM bandwidth by inserting a proper delay after the request sequence completes to limit the effective bandwidth. Specifically, the bandwidth is limited to 10 GB/s for writes and 35 GB/s for reads in the experiments. A similar approach was used in Mnemosyne [34].

We used the *YCSB* [3] benchmark for benchmarking in our experiments. YCSB is a framework developed for benchmarking cloud system performance. It provides a YCSB client for workload generation and a variety of database backends. In the experiments, we use memcached, an in-memory keyvalue database, as the default backend [7]. Unless otherwise specified, we use Workload A with 50/50 read/update ratio by default. The database record size is 1 KB. Its request distribution is zipfian [10]. Both of the total recordcount and operationcount are set to 3 million. To demonstrate the practical effectiveness of NVSwap, we experimented with other Workloads including B, C, and F. We also varied the I/O characteristics of workloads, e.g., DB record size and operation types. We use both dedicated workloads and concurrent workloads to generate paging requests. Please find the configurations of the corresponding experiments in the following sections.

### 6.2 Latency Enforcement

In this section, we present several experiments based on the YCSB benchmark that show the effectiveness of NVSwap in enforcing the latency bounds with both single and concurrent workloads.

### 6.2.1 Single Workloads

We study the effectiveness of latency control using NVSwap with a single memcached server accessing the swap space in the experiment. We compare NVSwap to the Linux swap system without latency control. For NVSwap, we set the 99<sup>th</sup> percentile latency bound to be 200 us for page-in requests. Figure 5 shows the results. We have the following observations. First, from Figure 5(a), we observe that the 99<sup>th</sup> percentile latency is reduced from 345 us to 192 us, which is below the latency bound 200 us. The result shows the effectiveness of NVSwap in enforcing latency bounds. It achieves the QoS goal by synergistically managing the disk scheduling queue and NVMM allocation. For example, the memcached server wrote 418,577 pages to the persistent paging buffer. The minimum latency of page-in requests is reduced from 86 us to 0.08 us because serving the requests using NVMM has a smaller latency than using SSDs. Second, fromFigure5(b), we observe that the 95<sup>th</sup> and 99<sup>th</sup> percentile latency of page-out requests are reduced by 99.6% and 96.7% respectively. It shows that using the persistent paging buffer can significantly alleviate the write interference on SSDs and reduce the congestion in the disk scheduling queue while also exploring the locality of workloads. Third, the latency of application-level requests was reduced by up to 74%. And the 99<sup>th</sup> percentile latency of DB read and update is comparable to that of OS page-in requests.



(c) DB Update Records



Figure 5: Single Workloads experiment. (a), (b), (c), and (d) show the minimum, 95<sup>th</sup> percentile, and 99<sup>th</sup> percentile latency of OS page-in, page-out, DB update, and DB read records respectively with one instance of memcached. We set the 99<sup>th</sup> percentile latency bound of page-in requests to be 200 us which is indicated by the green line.

# 6.2.2 Concurrent Homogeneous Workloads

In this experiment, we study the effectiveness of NVSwap with three concurrent memcached servers accessing the workloads, which are named Workload A1, A2, and A3 for the convenience of discussion. We set the 99<sup>th</sup> percentile latency bounds for the three workloads as 1000 us, 500 us, and 300 us respectively. The results are shown in Figure 6. We can observe that the 99<sup>th</sup> percentile latency of page-in requests is 975 us, 498 us, and 291 us for Workload A1, A2, and A3 respectively. They all meet the latency requirements. In addition, the latency of DB operations is also enforced to the same level of page-in latency bounds. For example, the 99<sup>th</sup> percentile latency of DB update of Workload A3 is 334 us while its corresponding latency of page-in is 291 us. Another observation is that there is a long tail in the latency distribution as shown in Figure 6(a). This is because SSD writes incur long latency after the persistent paging buffer becomes full.



Figure 6: Concurrent Homogeneous Workloads experiment. (a) shows the latency distribution with three instances of memcached workloads. The 99<sup>th</sup> percentile latency bounds are set to 1000 us, 500 us, and 300 us for workload A1, A2, and A3 respectively. (b), (c), and (d) show the minimum, 95<sup>th</sup> percentile, and 99<sup>th</sup> percentile latency of OS page-in, DB update, and DB read operations.

# 6.2.3 Concurrent Heterogeneous Workloads

| Workload | Read% | Update% | 99% Latency Bound |
|----------|-------|---------|-------------------|
| A        | 50%   | 50%     | 1000 us           |
| В        | 95%   | 5%      | 500 us            |
| С        | 100%  | 0%      | 300 us            |



Table 2: Configurations of heterogeneous Workload A, B, and C.

(c) DB Update Records

(d) DB Read Records

Figure 7: Concurrent Heterogeneous Workloads experiment 1. (a) shows the latency distribution with three instances of memcached workloads. The 99<sup>th</sup> percentile latency bounds are set to 1000 us, 500 us, and 300 us for workload A, B, and C respectively. (b), (c), and (d) show the minimum, 95<sup>th</sup> percentile, and 99<sup>th</sup> percentile latency of OS page-in, DB update, and DB read records. Workload C does not have DB update operations.



Figure 8: Concurrent Heterogeneous Workloads experiment 2. (a), (b), (c), and (d) show the minimum, 95<sup>th</sup> percentile, and 99<sup>th</sup> percentile latency of OS page-in, DB update, DB read, and DB read-modify-write operations. Workload C does not have DB update operations. Only Workload F has read-modify-write operations.

In this section, we study the effectiveness of NVSwap using concurrent heterogeneous workloads. In the first experiment we run three YCSB workloads A, B, and C. We show their configurations in Table 2. The DB record size is 1 KB. The results are shown in Figure 7. We observed that the 99<sup>th</sup> percentile latency of page-in requests is 737 us, 323 us, and 255 us for Workload A, B, and C respectively. They are smaller than their corresponding latency bounds,

indicating that the QoS requirements were met with the help of NVSwap. In addition, because of the larger ratio of DB read operations in the workloads leading to a higher cache hit ratio, the 99<sup>th</sup> percentile latency of DB operations is on average 33% smaller than the respective latency bound.

| Workload | Read% | Update% | Read-modify- | 99% Latency |
|----------|-------|---------|--------------|-------------|
|          |       |         | write%       | Bound       |
| А        | 50%   | 50%     | 0%           | 1000 us     |
| С        | 100%  | 0%      | 0%           | 500 us      |
| F        | 50%   | 0%      | 50%          | 300 us      |

Table 3: Configurations of heterogeneous Workload A, C, and F.

In the second experiment, we run Workload A, C, and F, whose configurations are shown in Table 3. The DB record sizeis1KB.WeshowtheresultsinFigure8. We did not show the latency CDF distribution because it is similar to the one observed in the first experiment. We have two observations. The 99<sup>th</sup> percentile latency of page-in requests is 956 us, 410 us, and 299 us for Workload A, C, and F respectively. They are smaller than their corresponding latency bounds. Another interesting observation is that the 99<sup>th</sup> percentile latency of DB read and update operations also meet latency requirements while that of read-modify-write operations is 370 us which is 23% higher than the latency bound 300 us. The reason is that the read-modify-write operation has two phases: read and write. As a result, serving the additional write request increased the 99<sup>th</sup> percentile latency of the read-modify-write operations.

# 6.3 Changing Latency Bounds Dynamically

In this experiment, we show how the latency bounds set dynamically at the process level are respected. For this experiment, we ran two Workloads A and C sharing the swap space. Their initial 99<sup>th</sup> percentile latency bounds are set to 100 us and 1000 us for Workload A and C respectively. Then the latency bound of Workload C is changed from 1000 us to 800 us at t =

200 second and from 800 us to 600 us at t = 450 second. Figure 9(a) shows the tail latency of DB operations in each 1-second time window during the execution of the two workloads.

At the start, the tail latency of the workloads matches the initial latency bounds as expected. Because of the latency bounds, the persistent paging buffer was used to serve Workload A and the regular-zone was used to serve Workload C. After the latency bound was reduced from 1000 us to 600 us for Workload C, more of its pages were directed to the persistent paging buffer to meet its QoS requirements. For example, after the tail latency became 600 us at t = 450 second, we see up to 90% of paging requests were served by the persistent paging buffer. The latency of the other workload Workload A was not affected showing the strong performance isolation between the two workloads. The overall paging performance is shown in Figure 9(b). We also observe that the measured 99<sup>th</sup> percentile latency of Workload A is smaller than 100 us. For Workload C, its 99<sup>th</sup> percentile latency is 748 us overall. This experiment shows the latency bound can be dynamically set and enforced by NVSwap during the execution of processes. Performance isolation is achieved between latency-sensitive paging processes.



Figure 9: Dynamic latency bounds experiment. The 99<sup>th</sup> percentile latency bound of Workload C is changed from 1000 us to 800 us at t = 200 second and is changed from 800 us to 600 us at t = 450 second. (a) shows the measured tail latency of DB operations in each 1-second time window. (b) shows the overall performance.

### 6.4 Impact of DB Record Size

In this section, we study the impact of DB record size. We ran one instance of Workload A and set the 99<sup>th</sup> percentile latency bound of VM page-in requests to be 200 us. We increased the DB record size from 1 KB to 8 KB. From the results shown in Figure 10, we can observe that the 99<sup>th</sup> percentile latency of page-in requests is below 200 us, indicating the effectiveness of NVSwap for latency enforcement. Furthermore, we find the 99<sup>th</sup> percentile latency of DB read and update operations are directly correlated to the record size. When the record size is not larger than a page size, the DB read and update operations can be served with just one page-in request. Therefore, the tail latency of page-in requests is comparable to that of DB operations at the application level. When the record size is 8 KB, NVSwap needs to swap in two pages for serving a single DB read/update operation. This may almost double the 99<sup>th</sup> percentile latency from 243 us to 455 us for DB reads and from 231 to 325 us for DB updates.



Figure 10: DB Record Size experiment. The 99<sup>th</sup> percentile latency of page-in requests, DB read and update operations as we increase the DB record size from 1 KB to 8 KB. We set the 99<sup>th</sup> percentile latency bound of page-in requests to be 200 us which is indicated by the green line.

### 6.5 Impact of NVMM Size

NVSwap uses NVMM to host both the persistent paging buffer and NV-zone. In this experiment, we show the impact of NVMM size on the tail latency of requests. Specifically, we run one instance of Workload A and use the default size of the Linux scheduling queue (128). We deactivated the function which adjusts the queue size in the latency control module. Then we measure the 99<sup>th</sup> percentile latency as the NVMM size is increased from 2.5 GB to 3.5 GB and 4.5 GB. The results are shown in Figure 11. The 99<sup>th</sup> percentile latency of page-in requests is very sensitive to the NVMM size. For example, the latency is reduced by 70% as we increase the NVMM size from 2.5 GB to 4.5 GB. At the application level, the tail latency of DB read and update is reduced by 45% and 42% respectively. This is because of the software overhead of operations (e.g., slab management) in memcached does not change, leading to a smaller improvement ratio overall as the NVMM size is increased.



Figure 11: NVMM Size experiment. The 99<sup>th</sup> percentile latency of page-in requests, DB read and update operations as we increase the NVMM size from 2.5 GB to 4.5 GB.

### 6.6 Comparison with Other Systems

We compared NVSwap with other state-of-the-art systems that support swapping using NVMM. Among them, we choose to implement Dr. swap as it is a page-level solution and provides direct read from NVMM, which is similar to the in-situ paging used in NVSwap. Because Dr. swap was not designed to provide latency enforcement, we only study the performance of NVSwap without using the latency control module. In the experiment, both Dr. swap and NVSwap only access NVMM for paging. No regular-zone is configured. We ran two instances of Workload A concurrently accessing NVMM. The 99<sup>th</sup> percentile latency of DB operations is shown in Figure 12. Since the kernel-level tracing is disabled, we did not show the latency of page-in requests for fairness of the study. The results show that the tail latency of NVSwap is 0.8% higher than that of Dr.swap. The reason is that NVSwap needs to copy the page from the persistent write buffer to the NV-zone, which is then mapped to process address spaces. In contrast, Dr. swap directly mapped it without the additional latency of memory copy.



Figure 12: Comparison of latency by NVSwap versus Dr. swap.



# 6.7 Experiments with Real NVMM

(a) DB Update

(b) DB Read

Figure 13: Single Workload experiment with Real NVMM. (a), (b), (c), and (d) show the minimum, 95<sup>th</sup> percentile, and 99<sup>th</sup> percentile latency of OS page-in, page-out, DB update, and DB read records respectively with one instance of memcached. We set the 99<sup>th</sup> percentile latency bound of page-in requests to be 200 us which is indicated by the green line.

To further prove the effectiveness of latency control using NVSwap, we also did the same experiments on a server with real NVMM. The Optane DIMM Server is configured with 8-core Intel Xeon Scalable Silver 4208 2.1 GHz CPU, 32GB DRAM, 2 Intel Optane DC Persistent Memory 128 GB Module and 240 GB SSD (Samsung PM883 Series 2.5" SATA 6Gb/s). To

guarantee the modified kernel to work well on the Optane DIMM Server, we implemented NVSwap in the Linux kernel-4.4.241 which offers support for NVMM. To avoid extra unexpected issues, we tried to keep consistency of the experiment configurations. We configured the Optane DIMM Server so that the kernel can only address 5 GB main memory, and we reserved 10 GB NVMM to host the persistent write buffer and the NV-zone.

*Single Workload Experiment with real NVMM.* In the experiment, we set the 99<sup>th</sup> percentile latency bound to be 200 us for page-in requests. Figure 13 shows the results. From Figure 13(a), we observe that the 99<sup>th</sup> percentile latency is reduced from 527.2 us to 143.2 us, which is below the latency bound 200 us. The result indicates the effectiveness of NVSwap in enforcing latency bounds in a real NVMM environment. Besides, the minimum latency of page-in requests is reduced from 27.1 us to 0.211 us. Moreover, Figure 13(b) shows that the 95<sup>th</sup> and 99<sup>th</sup> percentile latency of page-out requests are reduced by 98.6% and 83.4% respectively. All the results accord with the experiments results and the analysis in the emulation environment.

*Concurrent Homogeneous Workloads.* In this experiment, just like the concurrent homogeneous workloads experiment that we did in emulated environment, we run three YCSB workloads and use Workload A in YCSB for all the three workloads. We set the 99<sup>th</sup> percentile latency bounds for the three workloads as 1000 us, 500 us, and 300 us respectively. The results are shown in Figure 14.

*Concurrent Heterogeneous Workloads.* In this experiment, we also reproduced the Concurrent Heterogeneous Workloads experiment in the new platform. We run three YCSB workloads A, B, and C and set the 99<sup>th</sup> percentile latency bounds for the three workloads as 1000 us, 500 us, and 300 us respectively. The results are shown in Figure 15.

From the results of these experiments, we can find that although the data varies because of the difference of the hardware environments, the results of experiments with NVMM and experiments with emulated NVMM are similar and the latency bound is met in all experiments if specified. It proves the effectiveness of latency control using NVSwap working with real NVMM.



Figure 14: Concurrent Homogeneous Workloads experiment with real NVMM. The 99<sup>th</sup> percentile latency bounds are set to 1000 us, 500 us, and 300 us for workload A1, A2, and A3 respectively. (a), (b), and (c) show the minimum, 95<sup>th</sup> percentile, and 99<sup>th</sup> percentile latency of OS page-in, DB update, and DB read operations.



Figure 15: Concurrent Heterogeneous Workloads experiment with real NVMM. The 99<sup>th</sup> percentile latency bounds are set to 1000 us, 500 us, and 300 us for workload A, B, and C respectively. (a), (b), and (c) show the minimum, 95<sup>th</sup> percentile, and 99<sup>th</sup> percentile latency of OS page-in, DB update, and DB read records. Workload C does not have DB update operations.

#### **CHAPTER 7: CONCLUSION**

In this thesis, we studied the problem of latency-aware paging in the virtual memory of operating systems. We propose a novel paging scheme called NVSwap which provides a costeffective and hybrid swap space using both NVMM and SSD. It allows the setting of X<sup>th</sup> percentile page-in latency bound for a single process or a group of processes. NVSwap controls the host swap device, the memory page reclaim rate, the scheduling queue size in DRAM, and the allocation of persistent paging buffer in NVMM for paging processes. We implemented NVSwap in Linux kernel-3.16.74. Our evaluation with a diverse set of YCSB workloads shows that NVSwap can enforce the tail latency while providing strong performance isolation for latency-sensitive processes. As future work, we plan to design and implement page reclaiming in NV-zone and automate the setting of page-in latency bounds according to user-perceived application-level latency bounds

#### REFERENCES

- [1] 2011. Tunable Watermark. https://lwn.net/Articles/422291/.
- [2] 2018. Intel Optane DIMMs. https://blocksandfiles.com/2018/12/13/intel-confirms-optanedimm-and-ssd-speed/a.
- [3] 2018. Yahoo! Cloud Serving Benchmark. https://github.com/brianfrankcooper/YCSB/wiki.
- [4] 2019. Apache Cassandra Database. http://cassandra.apache.org/.
- [5] 2019. Apache HBase Database. https://hbase.apache.org.
- [6] 2019. High Performance Computing using Linux. http://events. linuxfoundation.org/sites/events/files/slides/LinuxCon.
- [7] 2019. A High-performance, Distributed Memory Object Caching Sys-tem. https://memcached.org/.
- [8] 2019. Intel Announces Cascade Lake: Up to 56 Cores and Optane Persistent Memory DIMMs. https://www.tomshardware.com/reviews/intel-cascade-lake-xeon-optane,6061-3.html.
- [9] Amro Awad, Sergey Blagodurov, and Yan Solihin. 2016. Write-Aware Management of NVM-based Memory Extensions. In *Proceedings of the 2016 International Conference on Supercomputing (ICS '16)*. ACM, New York, NY, USA, Article 9, 12 pages. https://doi.org/10.1145/2925426. 2926284
- [10] Paul E. Black. 2019. Zipfian Distribution. https://xlinux.nist.gov/dads/HTML/zipfian.html.
- [11] Daniel Bovet and Marco Cesati. 2005. Understanding The Linux Kernel. Oreilly & Associates Inc.

- [12] Feng Chen, David A. Koufaty, and Xiaodong Zhang. 2009. Understanding Intrinsic Characteristics and System Implications of Flash Memory Based Solid State Drives. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '09). ACM, New York, NY, USA, 181–192. https://doi.org/10.1145/1555349.1555371
- [13] X. Chen, E. H. . Sha, W. Jiang, Q. Zhuge, Junxi Chen, Jiejie Qin, and Yuansong Zeng.
  2016. The design of an efficient swap mechanism for hybrid DRAM-NVM systems. In
  2016 International Conference on Embedded Software (EMSOFT). 1–10.
  https://doi.org/10.1145/2968478. 2968497
- [14] Mel Gorman. 2004. Understanding the Linux Virtual Memory Manager.
- [15] Ajay Gulati, Irfan Ahmad, and Carl A. Waldspurger. 2009. PARDA: Proportional Allocation of Resources for Distributed Storage Access. In 7th USENIX Conference on File and Storage Technologies (FAST 09). USENIX Association, San Francisco, CA. https://www.usenix.org/conference/fast-09/parda-proportional-allocation-resourcesdistributed-storage-access
- [16] Ajay Gulati, Ganesha Shanmuganathan, Xuechen Zhang, and Peter Varman. 2012. Demand Based Hierarchical QoS Using Storage Resource Pools. In *Presented as part of the 2012* USENIX Annual Technical Conference(USENIXATC12).USENIX,Boston,MA,1–13. https://www.usenix.org/conference/atc12/technical-sessions/presentation/gulati
- [17] Weichao Guo, Kang Chen, Huan Feng, Yongwei Wu, Rui Zhang, and Weimin Zheng. 2016.
   MARS : Mobile Application Relaunching Speed-Up Through Flash-Aware Page Swapping.
   IEEE Trans. Comput. 65, 3 (March 2016), 916–928.
   https://doi.org/10.1109/TC.2015.2428692

[18] Jian Huang, Karsten Schwan, and Moinuddin K. Qureshi. 2014. NVRAM-aware Logging in Transaction Systems. Proc. VLDB Endow. 8, 4 (Dec. 2014).

[19] iometer 1998. The IOMETER benchmark. http://www.iometer.org.

- [20] Ju-Young Jung and Sangyeun Cho. 2013. Memorage: Emerging Persistent RAM Based Malleable Main Memory and Storage Architecture. In *Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS '13)*. ACM, New York, NY, USA, 115–126. https://doi.org/10.1145/2464996.2465005
- [21] Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. 2009. Architecting Phase Change Memory As a Scalable Dram Alternative. In *Proceedings of the 36<sup>th</sup> Annual International Symposiumon Computer Architecture (ISCA '09)*.
- [22] Ning Li, Hong Jiang, Dan Feng, and Zhan Shi. 2016. PSLO: Enforcing the X<sup>th</sup> Percentile Latency and Throughput SLOs for Consolidated VM Storage. In *Proceedings of the Eleventh European Conference on Computer Systems (EuroSys '16)*. ACM, New York, NY, USA, Article 28, 14 pages. https://doi.org/10.1145/2901318.2901330
- [23] K. Liu, X. Zhang, K. Davis, and S. Jiang. 2013. Synergistic coupling of SSD and hard disk for QoS-aware virtual memory. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 24–33. https://doi.org/10.1109/ISPASS.2013.6557143
- [24] 3D XPoint Memory. 2019. https://www.intel.com/content/www/us/en/architecture-andtechnology/intel-optane-technology.html.
- [25] T. Newhall and D. Woos. 2011. Incorporating Network RAM and Flash into Fast Backing Store for Clusters. In 2011 IEEE International Conference on Cluster Computing. 121–129. https://doi.org/10.1109/CLUSTER.2011.22

- [26] Jiaxin Ou, Jiwu Shu, and Youyou Lu. 2016. A High Performance File System for Nonvolatile Main Memory. In Proceedings of the Eleventh European Conference on Computer Systems (EuroSys '16).
- [27] Stan Park and Kai Shen. 2012. FIOS: A Fair, Efficient Flash I/O Scheduler. In *Proceedings* of the 10th USENIX Conference on File and Storage Technologies (FAST'12). USENIX Association, Berkeley, CA, USA, 13–13. http://dl.acm.org/citation.cfm?id=2208461.2208474
- [28] Waleed Reda, Marco Canini, Lalith Suresh, Dejan Kostić, and Sean Braithwaite. 2017. Rein: Taming Tail Latency in Key-Value Stores via Multiget Scheduling. In *Proceedings of the Twelfth European Conference on Computer Systems (EuroSys '17)*. ACM, New York, NY, USA, 95–110. https://doi.org/10.1145/3064176.3064209
- [29] Mohit Saxena and Michael M. Swift. 2010. FlashVM: Virtual Memory Management on Flash. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference (USENIXATC'10). USENIX Association, Berkeley, CA, USA, 14–14. http://dl.acm.org/citation.cfm?id=1855840.1855854
- [30] Suman Nath Shimin Chen, Phillip B. Gibbons. 2011. Rethinking Database Algorithms for Phase Change Memory. In CIDR'11: 5th Biennial Conference on Innovative Data Systems Research.
- [31] Abraham Silberschatz, Peter B. Galvin, and Greg Gagne. 2012. *Operating System Concepts* (9th ed.). Wiley Publishing.
- [32] Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, and Roy H. Campbell.
   2011. Consistent and Durable Data Structures for Non-volatile Byte-addressable Memory.
   In Proceedings of the 9<sup>th</sup> USENIX Conference on File and Stroage Technologies (FAST'11).

- [33] Haris Volos, Sanketh Nalli, Sankarlingam Panneerselvam, Venkatanathan Varadarajan, Prashant Saxena, and Michael M. Swift. 2014. Aerie: Flexible File-system Interfaces to Storage-class Memory. In Proceedings of the Ninth European Conference on Computer Systems (EuroSys '14).
- [34] Haris Volos, Andres Jaan Tack, and Michael M. Swift. 2011. Mnemosyne: Lightweight Persistent Memory. SIGPLAN Not. 47, 4 (March 2011), 91–104.
- [35] Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Randy Katz, and Ion Stoica. 2012.
  Cake: Enabling High-level SLOs on Shared Storage Systems. In *Proceedings of the Third ACM Symposium on Cloud Computing (SoCC '12)*. ACM, New York, NY, USA, Article 14, 14 pages. https://doi.org/10.1145/2391229.2391243
- [36] J. Zhang, A. Riska, A. Sivasubramaniam, Q. Wang, and E. Riedel. 2005. Storage performance virtualization via throughput and latency control. In 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. 135–142. https://doi.org/10.1109/MASCOTS.2005.70
- [37] K.Zhong, D.Liu, L.Liang, X.Zhu, L.Long, Y.Wang, and E.H.Sha. 2016. Energy-Efficient In-Memory Paging for Smartphones. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 10 (Oct 2016)*, 1577–1590. https://doi.org/10.1109/TCAD.2015.2512904
- [38] K. Zhong, D. Liu, L. Long, J. Ren, Y. Li, and E. H. Sha. 2017. Building NVRAM-Aware Swapping Through Code Migration in Mobile Devices. *IEEE Transactions on Parallel and Distributed Systems 28, 11 (Nov 2017)*, 3089–3099. https://doi.org/10.1109/TPDS.2017.2713780

- [39] G. Zhu, K. Lu, X. Wang, Y. Zhang, P. Zhang, and S. Mittal. 2017. SwapX: An NVM-Based Hierarchical Swapping Framework. *IEEE Access 5 (2017)*, 16383–16392. https://doi.org/10.1109/ACCESS.2017.2737634
- [40] Timothy Zhu, Alexey Tumanov, Michael A. Kozuch, Mor Harchol-Balter, and Gregory R. Ganger. 2014. PriorityMeister: Tail Latency QoS for Shared Networked Storage. In *Proceedings of the ACM Symposium on Cloud Computing (SOCC '14)*. ACM, New York, NY, USA, Article 29, 14 pages. https://doi.org/10.1145/2670979.2671008
- [41] X. Zhu, D. Liu, K. Zhong, Jinting Ren, and T. Li. 2017. SmartSwap: High-performance and user experience friendly swapping in mobile systems. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). 1–6. https://doi.org/10.1145/3061639.3062317