Heterogeneous memory systems that comprise memory nodes based on widely-di erent device technologies (e.g., DRAM and nonvolatile memory (NVM)) are emerging in various computing domains ranging from high-performance to embedded computing. Despite the extensive prior work on architectural and system soware support for heterogeneous memory systems, relatively li le work has been done to investigate the OS-level memory placement and migration policies that consider the bandwidth di erences of heterogeneous memory nodes.
INTRODUCTION
Heterogeneous memory systems comprise memory nodes that exhibit widely-di erent architectural characteristics. e key idea behind the heterogeneous memory systems is to provide useful properties such as performance and durability that cannot be achieved by employing just a single type of memory. For instance, heterogeneous memory systems that consist of DRAM and nonvolatile memory (NVM) nodes can e ectively provide the DRAM-like performance and storage-class persistence. Heterogeneous memory systems are emerging in various computing domains ranging from high-performance [14] to embedded [3] computing.
Prior work has extensively investigated the e cient architectural [1, 23] and system so ware [12] support for heterogeneous memory systems. In particular, Kannan et al. present the design and implementation of a virtual memory manager (VMM) that manages the DRAM and NVM nodes in a single virtual address space [12] . One of the main design objectives of VMMs for heterogeneous memory systems is to provide the capacity scaling by simultaneously utilizing the DRAM and NVM nodes.
However, the existing system so ware techniques for heterogeneous memory systems allocate physical pages in a bandwidthoblivious manner, which fails to achieve the best possible performance. Since the performance of the heterogeneous memory nodes is widely di erent, the system so ware must be able to place and migrate memory pages by judiciously considering the di erence in the performance of heterogeneous memory nodes.
To bridge this gap, this work proposes the bandwidth-aware memory placement and migration policies for bandwidth-intensive applications (which are especially common in high-performance computing (HPC)) on heterogeneous memory systems. Speci cally, we present the design and implementation of three bandwidthaware memory placement policies and a bandwidth-aware memory migration policy by extending the virtual memory manager of the Linux kernel. rough our quantitative evaluation, we demonstrate the e ectiveness of the proposed bandwidth-aware policies for heterogeneous memory systems.
Speci cally, this paper makes the following contributions:
• We present three bandwidth-aware memory placement policies and a bandwidth-aware memory migration policy. e proposed policies dynamically allocate and migrate memory pages by judiciously considering the bandwidth di erence of the heterogeneous memory nodes.
• We implement and evaluate the bandwidth-aware memory placement and migration policies by extending the virtual memory manager of the Linux kernel. We demonstrate that the Linux kernel code can be e ectively extended to manage the memory subsystem in a bandwidth-aware manner.
• We quantify the e ectiveness of the proposed bandwidthaware memory placement and migration policies using a real system. Our quantitative evaluation shows the effectiveness of the bandwidth-aware policies in that they achieve signi cantly higher performance (i.e., up to 35.1%) than the conventional memory polices across a wide range of the DRAM-to-NVM ratios when executing memoryintensive workloads.
e rest of this paper is organized as follows. Section 2 provides the background information and motivation for this work. Section 3 presents the design and implementation of the bandwidth-aware memory placement and migration policies. Section 4 describes the experimental methodology. Section 5 quanti es the e ectiveness of the proposed bandwidth-aware policies. Section 6 summarizes related work. Section 7 concludes the paper.
BACKGROUND AND MOTIVATION 2.1 Physical Memory Description
e Linux kernel describes physical memory in an architectureindependent way [10] . Figure 1 shows the physical memory organization of Linux. Major memory-organization components in the Linux kernel are nodes, zones, and pages.
Each memory node represents a set of DIMMs that are associated with the same CPU. Each memory node consists of zones, each of which represents a range of memory and is suitable for a di erent type of usage. For instance, there are three zones in the x86 architecture -ZONE DMA (0MB-16MB), ZONE NORMAL (16MB-896MB), and ZONE HIGHMEM (896MB-end). Finally, each physical page frame is described using a page.
Memory Placement and Migration Policies
Non-uniform memory access (NUMA) systems comprise multiple CPU sockets and DIMMs. Each CPU socket is locally connected with a memory node that consists of one or multiple DIMMs. In addition, each CPU socket is remotely connected with other CPU sockets and DIMMs via the interconnection network. A majority of modern NUMA systems provide a single globally-addressable physical memory space with support for cache coherence.
In NUMA systems, memory accesses can be largely classi ed into local and remote memory accesses. A processor core in a CPU socket performs a local (remote) memory access when it accesses a memory object placed in a local (remote) memory node. Remote Figure 2: System architecture with heterogeneous memory memory accesses incur a higher latency than local memory accesses due to the data transmission overheads via the interconnection network [7] . Load imbalance across the memory nodes is also a potential performance bo leneck of NUMA systems because it may cause signi cant contention on the memory controllers of the frequently accessed memory nodes and partial utilization of the available memory bandwidth provided by the system. NUMA systems generally achieve high performance when (1) the ratio of the local to remote memory accesses is high and (2) the memory tra c is evenly distributed across the memory nodes.
To achieve robust performance on NUMA systems, the Linux kernel provides several memory placement and migration policies. e representative memory placement policies supported by the Linux kernel are the local and interleave policies. e local policy places a physical memory page in a local memory node of the processor core where the task that has rst touched the page is currently scheduled.
e main advantage of the local policy is the improved locality. In contrast, the local policy may cause load imbalance across the memory nodes.
e interleave policy places physical pages across the memory nodes in a round-robin manner without considering the physical location of the tasks that access the pages.
e main advantage of the interleave policy is the improved load balancing across the memory nodes. In contrast, the interleave policy may incur frequent remote memory accesses.
e major role of memory migration policies is to determine which of the allocated physical pages need to be migrated to which memory nodes to improve the overall performance of NUMA systems. One of the representative memory migration policies integrated into the mainline Linux kernel is the automatic NUMA balancing (ANB) policy [11] .
To dynamically collect the memory access pa ern of the tasks, ANB periodically marks a subset of the pages as the NUMA pages [11] . When a task accesses a page marked as the NUMA page, a NUMA hinting page fault is triggered. e page fault handler then collects the runtime data (e.g., the processor core where the task is scheduled), which ANB uses to perform page migrations for reducing the number of remote memory accesses.
Non-Volatile Memory
Nonvolatile memory (NVM) is an emerging memory technology that provides useful properties such as durability, byte addressability, and high density. NVMs such as phase-change memory (PCM) [20] are expected to achieve 100× higher performance than current SSDs and 2-5× lower performance than DRAM [12] .
To eliminate the potential performance overheads of the blockdevice interface, prior work has extensively investigated the design and implementation of NVM-based systems in which the processors can directly access the NVM memory space by performing regular memory operations [4, 6] . Figure 2 shows the overall architecture of a heterogeneous memory system that comprises DRAM and NVM nodes.
In line with the prior works [6, 12, 21] on the system-so ware support for heterogeneous-memory systems with DRAM and NVM, we consider the baseline architecture in which heterogeneousmemory nodes are connected to the CPUs through separate memory channels. Intel has built a hardware prototype based on this architecture [6] . Allocating separate channels for heterogeneousmemory nodes has advantages such as eliminating their performance interference.
Recent work has investigated the design and implementation of a virtual memory manager (VMM) that manages the DRAM and NVM nodes in a single virtual address space for each process [12] .
e main advantage of the VMM-based design for heterogeneous memory systems is the automatic memory capacity scaling with which the operating system can e ectively exploit the all the available capacity of the DRAM and NVM nodes with no or li le user intervention [12] .
Need for Bandwidth-Aware Memory
Placement and Migration Policies e major drawback of the conventional memory placement and migration policies is that they are bandwidth-oblivious, which leads to the sub-optimal execution of multithreaded applications on heterogeneous memory systems. To motivate the need for bandwidthaware memory placement and migration policies, we consider a simple example (inspired from the example discussed in [1] ) with a bandwidth-intensive multithreaded application whose execution time is mainly determined by the time for transferring the data (S) from the memory subsystem to the CPUs.
We assume that the underlying heterogeneous memory system comprises a DRAM node with a bandwidth B DRAM and an NVM node with a bandwidth B N V M . In addition, we assume that the portions of the data placed in the DRAM and NVM nodes are p and 1 − p, respectively. en, the time spent for transferring the data from the DRAM (t DRAM ) and NVM (t N V M ) nodes is computed using Equations 1 and 2.
e total execution time of the application is determined by the longer transfer time between the DRAM and NVM nodes. Specically, the total execution time is computed using Equation 3.
e total execution time is minimized if and only if t DRAM and t N V M are equal. Without a loss of generality, let us suppose that t DRAM > t N V M . If so, we can keep reducing the total execution time by decreasing p until t DRAM and t N V M become equal. With t DRAM = t N V M , the optimal portion (p O PT ) of the data placed in the DRAM node is computed using Equation 4 .
e conventional bandwidth-oblivious memory placement policies fail to achieve the optimal partitioning of the data on heterogeneous memory systems. For instance, with the conventional interleave policy, the ratio of the data placed in the DRAM node will be 0.5, which is sub-optimal in heterogeneous memory systems.
e conventional bandwidth-oblivious memory migration policies also fail to achieve the optimal performance on heterogeneous memory systems. Even if the memory placement policy was modi ed to allocate pages in an optimal manner, the conventional bandwidth-oblivious migration policy would migrate pages solely based on the locality information, which makes the memory allocation ratio diverge from the optimal one as migrations gradually occur. erefore, it is crucial to design and implement the bandwidth-aware memory placement and migration policies in a tightly integrated manner in order to improve the performance of heterogeneous memory systems.
DESIGN AND IMPLEMENTATION
is section discusses the design and implementation of the bandwidthaware memory placement and migration policies. Section 3.1 discusses how we extend the existing memory organization in the Linux kernel to manage heterogeneous memory systems. Sections 3.2 and 3.3 present the design and implementation of three bandwidth-aware memory placement policies (i.e., bandwidth-aware interleave, random, and local) and a bandwidth-aware migration policy. Section 3.4 discusses the implementation of the proposed policies in the Linux kernel. Section 3.5 discusses the advanced design issues.
We mainly assume that the underlying heterogeneous memory system comprises two types of memory nodes (i.e., DRAM and NVM) as a majority of heterogeneous memory systems build on such architectural con gurations [6] . Section 3.5 discusses how the techniques proposed in this work can be extended for heterogeneous memory systems with three or more memory clusters.
Heterogeneous Memory Description
We extend the memory organization of the Linux kernel by adding top-level components called memory clusters. A memory cluster consists of one or more physical memory nodes, each of which exhibits the same architectural characteristics (e.g., performance).
en, a heterogeneous memory system can be modeled as a system that comprises multiple memory clusters.
e memory cluster data structure is used to represent memory clusters. e memory cluster structure consists of the pointer variable that points to an array of the nodes that belong to the corresponding cluster. In addition, it comprises additional member variables (e.g., bandwidth) to specify the architectural characteristics of the corresponding memory cluster.
BW-Aware Memory Placement Policies

The BW-INTERLEAVE
Policy. e bandwidth-aware interleave memory placement BW-INTERLEAVE policy allocates memory pages across the clusters in a round-robin manner while preserving 1: procedure C I (task) 2: cluster ← DRAM 3:
cluster ← NVM 5: task.aCount ← (task.aCount + 1) % (D + N )
6:
return cluster 7: procedure A I (task) 8: cluster ← getClusterInterleave(task) 9: node ← getNodeInterleave(task, cluster) 10: page ← allocPage(task, node) 11:
return page
Algorithm 2 e BW-RANDOM policy
cluster ← DRAM 3:
if r ≥ D then
cluster ← NVM 6:
cluster ← getClusterRandom(task) 9: node ← getNodeRandom(task, cluster) 10: page ← allocPage(task, node) 11: return page the optimal allocation ratio (i.e.,
) discussed in Section 2.4. Without a loss of generality, we assume that the bandwidth ratio of DRAM to NVM is D N . Algorithm 1 shows the pseudocode for the BW-INTERLEAVE policy. e BW-INTERLEAVE policy determines the location of a page in a hierarchical manner in that it rst determines the cluster (Line 8) and then the node (Line 9).
To dynamically determine the cluster of a newly-allocated memory page, we augment the task structure of the Linux kernel with aCount, which tracks the number of pages allocated by the corresponding task (Line 5). If aCount is less than D, a new memory page is allocated in the DRAM cluster and aCount is incremented by one. If aCount is equal to or greater than D (Line 3), a new memory page is allocated in the NVM cluster. If aCount is D + N , it is reset to zero (Line 5).
A er determining the target cluster, the BW-INTERLEAVE policy determines the target memory node (Line 9) with in the cluster in an interleaved manner, similarly to the conventional interleave policy. Based on the target cluster and node information, a new physical page is allocated for the request (Line 10).
3.2.2
The BW-RANDOM Policy. e bandwidth-aware random memory placement BW-RANDOM policy randomly determines the target cluster for a memory page while preserving the optimal allocation ratio (i.e.,
). In contrast to the BW-INTERLEAVE policy, the BW-RANDOM policy provides a probabilistic guarantee that pathological memory placement across the memory clusters can be avoided.
Algorithm 2 shows the pseudocode for the BW-RANDOM policy. To determine the target cluster, the BW-RANDOM policy generates
currCluster ← task.cluster 3: aCount ← task.aCount [currCluster] 4:
cluster ← DRAM 5: if currCluster = DRAM then 6: if aCount ≥ D then 7: cluster ← NVM 8:
if aCount < N then 10: cluster ← NVM 11: task
return cluster 13 : procedure A L (task) 14: cluster ← getClusterLocal(task) 15: node ← getNodeLocal(task, cluster) 16: page ← allocPage(task, node) 17: return page a random number (r ) and computes r mod (D + N ) (Line 3). If the remainder is equal to or greater than D, the target cluster is determined to be the NVM cluster (Line 5). Otherwise, the target cluster is determined to be the DRAM cluster, preserving the optimal allocation ratio. A er randomly determining the target memory cluster, the BW-RANDOM policy determines the target node in the target cluster in a hierarchical manner.
3.2.3
The BW-LOCAL Policy. e bandwidth-aware local memory placement (BW-LOCAL) policy allocates a memory page in a local node as much as possible while preserving the optimal allocation ratio (i.e.,
). e main di erence of the BW-LOCAL policy from the BW-INTERLEAVE and BW-RANDOM polices is that the BW-LOCAL policy employs the physical location of the task that requests a page allocation to determine the target memory cluster and node. Similarly to the BW-INTERLEAVE policy, the BW-LOCAL policy allocates a page in a hierarchical manner in that it rst determines the target cluster and then the target memory node. Algorithm 3 shows the pseudocode for the BW-LOCAL policy. e task structure of the Linux kernel is augmented with an integer array (i.e., aCount whose length is the number of the memory clusters in the system). Each array element tracks the number of pages allocated by the task in the corresponding memory cluster.
e BW-LOCAL policy rst checks the current memory cluster where the task that requests a page allocation is currently running (Line 2) and retrieves the allocation count of the corresponding cluster (Line 3). If the current cluster is the DRAM cluster, the BW-LOCAL policy allocates a memory page in the DRAM cluster if aCount[DRAM] is less than D and in the NVM cluster if aCount[DRAM] is equal to or greater than D (Lines 5-7).
If the current cluster is the NVM cluster, the BW-LOCAL policy allocates a memory page in the NVM cluster if aCount[NVM] is less than N and in the DRAM cluster if aCount[NVM] is equal to or greater than N (Lines 8-10). Once the target cluster has been determined, the BW-LOCAL policy updates the corresponding aCount by incrementing it by 1 and performing a modulo operation with D + N . currCluster ← page.cluster 4: destCluster ← task.cluster 5: if currCluster destCluster then 6: interClusterMigAllowed ← false 7: currCount ← migCount 8: nextCount ← 0 9:
if destCluster = DRAM then 10: if currCount < M then 11: interClusterMigAllowed ← true 12: nextCount ← currCount + 1 13: else destCluster = NVM 14:
interClusterMigAllowed ← true 16: nextCount ← currCount − 1 17: if interClusterMigAllowed = false or atomicCompareAndSet(migCount, currCount, nextCount) = failed then 18: destCluster ← currCluster 19: doMigration(task, page, destCluster)
A er determining the target memory cluster, the BW-LOCAL policy determines the target memory node for the memory allocation request (Line 15). If the target cluster is same as the cluster where the requesting task is running, the BW-LOCAL policy a empts to allocate a page in the current node. Otherwise, the BW-LOCAL attempts to the node closest to the current node. e BW-LOCAL policy then allocates a physical page in the target memory cluster and node (Line 16).
Bandwidth-Aware Migration Policy
e bandwidth-aware memory migration (BW-MIGRATION) policy dynamically migrates memory pages to improve the locality of memory accesses while preserving the optimal allocation ratio (i.e.,
). Similarly to the bandwidth-aware memory placement policies, the BW-MIGRATION policy migrates memory pages in a hierarchical manner in that it rst determines the target memory cluster and then the target memory node. Algorithm 4 shows the pseudocode for the BW-MIGRATION policy. e key idea behind the BW-MIGRATION policy is to keep the number of the pages migrated from the DRAM cluster to the NVM cluster and the number of the pages migrated from the NVM cluster to the DRAM cluster roughly same in order to ensure that the optimal allocation ratio is preserved.
To guarantee that the number of pages migrated to each cluster is roughly same, BW-MIGRATION policy employs the inter-cluster migration threshold (i.e., M), which indicates the number of outstanding inter-cluster migrations that BW-MIGRATION policy performs even though such outstanding migrations may make the actual allocation ratio temporarily deviate from the optimal one. In general, with a larger value of M, the BW-MIGRATION policy facilitates inter-cluster migrations while increasing the possibility of achieving the suboptimal allocation ratio (and vice versa). e intercluster migration threshold can be con gured at runtime through the sysfs interface. 1 In addition, the BW-MIGRATION policy employs the migration counter (i.e., migCount in Line 1), which indicates the allowed direction for the page migration. If the current value of migCount is within [−M + 1, M − 1], it allows any page migration request. If migCount is less than −M + 1, it only allows a migration request from the NVM cluster to the DRAM cluster. If migCount is larger than M − 1, it only allows a migration request from the DRAM cluster to the NVM cluster.
We now discuss the pseudocode for the BW-MIGRATION policy in depth. e BW-MIGRATION policy retrieves the current cluster of the page to migrate (Line 3) and the destination cluster where the task that has caused the NUMA hinting page fault is running (Line 4). If the current and destination clusters are same, it simply migrates the page to a node in the same cluster (Line 19), similarly to the conventional bandwidth-oblivious memory migration policy.
If the current and destination clusters are di erent (Line 5), the BW-MIGRATION policy reads the current value of migCount (Line 7). If the destination cluster for the incoming migration request is the DRAM cluster (Line 9), the BW-MIGRATION policy checks if currCount, which holds the previously-observed value of migCount, is no greater than M − 1 (Lines 10). If so, the BW-MIGRATION policy sets the interClusterMigAllowed ag variable to true (Line 11) and computes the next value of migCount by incrementing its previously-observed value by one (Line 12). Note that the BW-MIGRATION policy handles a page migration request whose destination is the NVM node in a similar manner (Lines 13 to 16).
If the interClusterMigAllowed ag variable is set to false (Line 17), BW-MIGRATION policy cannot perform an inter-cluster migration for the page.
erefore, it sets destCluster to currCluster (Line 18) and performs a page migration in the cluster where the page is currently placed.
If the interClusterMigAllowed ag variable is set to true, the BW-MIGRATION policy a empts to migrate a page across the clusters by performing an atomic compare-and-set operation for the direction variable (Line 17). If the atomic compare-and-set operation has been successfully performed, it indicates that the following steps have been atomically performed -(1) checking that migCount and currCount hold the same value and (2) se ing migCount to nextCount.
en, the BW-MIGRATION policy performs an intercluster migration (Line 4).
If the atomic compare-and-set operation has failed, the BW-MIGRATION policy cannot perform an inter-cluster migration for the page.
Implementation
We have implemented the bandwidth-aware memory placement and migration policies in the Linux kernel 3.10.0, which is the default kernel installed in Centos 7 (i.e., one of the most widelyused Linux distributions in cluster and datacenter environments). We believe that this version closely matches the system so ware con gurations used in production.
Discussion
ree or more memory clusters: For clarity and conciseness, we have discussed the bandwidth-aware placement and migration policies with an assumption that the underlying heterogeneous memory system comprises two types of memory clusters. However, the proposed policies can be extended for heterogeneous memory systems with three or more memory clusters.
Without loss of generality, we assume that the underlying heterogeneous memory system consists of H memory clusters (i.e., C 0 , C 1 , · · · , C H −1 ), where H ≥ 3. We also assume that the bandwidth of the memory cluster C i is B i . Based on the similar logic discussed in Section 2.3, the optimal allocation ratio (i.e., p i,O PT ) of the memory cluster C i can be computed as follows -
. e BW-INTERLEAVE and BW-RANDOM policies preserve this optimal allocation ratio while allocating memory pages across the heterogeneous memory clusters in an interleaved or randomized manner. e BW-LOCAL policy rst a empts to allocate a page to the local memory cluster. If it is disallowed due to the optimal allocation rule, the BW-LOCAL policy continues to a empt to allocate the page in the next nearest memory cluster until it succeeds.
e bandwidth-aware memory migration policy can be extended by employing the migration counter (i.e., migCount) for each pair of heterogeneous memory clusters. For instance, if the underlying heterogeneous memory system comprises four memory clusters, six migration counters (i.e., 4 2 = 6) are required.
e key idea behind this approach is to keep the number of migrated pages between every pair of memory clusters roughly same, preserving the optimal allocation ratio of each memory cluster. Latency: e bandwidth-aware memory placement and migration policies primarily aim to optimize the performance of bandwidthintensive applications, which are especially common in HPC. As quanti ed in Section 5.3, the proposed policies incur no or li le (if any) negative performance impact on the bandwidth non-intensive applications evaluated in this work. Nevertheless, it is interesting future work to dynamically identify the characteristics of the target application (e.g., bandwidth vs. latency sensitive) based on the runtime data (e.g., memory tra c) and accordingly allocate and migrate pages. For instance, pages can be allocated in the memory clusters that are expected to incur the minimal latency for tasks of the latency-sensitive applications.
METHODOLOGY
To investigate the performance impact of the bandwidth-aware memory placement and migration policies, we employ a 16-core NUMA system with two 8-core CPUs. Each CPU is locally connected with a 16GB memory node. Table 1 summarizes the detailed system speci cation of the server system. e server system is installed with CentOS 7. We use the Linux kernel (version 3.10.0), which has been modi ed to implement the bandwidth-aware memory placement and migration policies. We also use the OpenJDK 1.8.0 and Spark 1.5.0 [26] to run the big-data benchmarks, which are subsequently discussed. For performance data collection, we use the tools summarized in Table 2 .
Since NVMs are not widely available yet, we use artz, an NVM emulator [21] .
artz provides an interface to set the maximum bandwidth of a subset of the memory nodes in the system using the thermal thro ling functionality of the memory controllers. Using artz, we emulate an NVM memory node by se ing the bandwidth of one of the two memory nodes to 2×, 5×, and 10× lower bandwidth to that of the DRAM memory node, which is in the range of the predicted NVM performance.
As for the benchmarks, we use seven bandwidth-intensive benchmarks and ve bandwidth non-intensive benchmarks from the PAR-SEC [2] , SPLASH [25] , and BigDataBench [24] benchmarks suites. As for the datasets, we use the largest datasets (i.e., the native datasets) for the PARSEC and SPLASH benchmarks and the 1GB and 8GB datasets for Kmeans and WordCount, which have been generated using the big data generation tool [17] .
All the evaluated benchmarks are multithreaded applications. For single-process experiments, we set the thread count of each benchmark to 16, which is the same as the core count. For multiprogrammed experiments (Sections 5.1, 5.2, and 5.3), we set the thread count of each benchmark to 8 and simultaneously run two benchmarks (Section 5.4). Table 3 summarizes the evaluated benchmarks with their bandwidth requirements. e bandwidth requirement for each benchmark is measured using the Intel performance counter monitor (Table 2) . To measure the bandwidth requirement for each benchmark, we execute it with the conventional local memory placement e rst seven benchmarks are bandwidth-intensive benchmarks and the next ve benchmarks are bandwidth non-intensive benchmarks.
Benchmark
Bandwidth Requirements Reads (MB/s) Writes (MB/s) canneal (CA) [2] 8082.6 2988.7 FFT (FFT) [25] 6628.7 4678.8 kmeans (KM) [24] 8018.5 7228.5 ocean cp (OC) [25] 13154.6 4777.0 ocean ncp (ON) [25] 12047.8 4018.4 streamcluster (SC) [2] 14589.2 168.3 wordcount (WC) [24] 8795.9 6973.7 blackscholes (BL) [2] 2952.3 448.0 facesim (FS) [2] 3420.8 1076.1 freqmine (FM) [2] 1333.0 591.6 raytrace (RT) [25] 196.8 30.9 swaptions (SW) [2] 669.5 81.7
policy, 16 threads, and no memory thro ling on the aforementioned server system. We then compute it by multiplying the number of data transfers per second (measured through the performance monitoring counters) and the bytes per transfer.
EVALUATION
is section presents the quantitative evaluation of the bandwidthaware memory placement and migration policies, which aims to investigate the following -(1) the performance gains achieved by the bandwidth-aware policies over the conventional policies with bandwidth-intensive benchmarks, (2) the performance sensitivity of the proposed policies to the bandwidth ratio of the DRAM to NVM clusters, (3) the performance impact of the bandwidthaware policies on bandwidth non-intensive benchmarks, and (4) the performance gains of the bandwidth-aware policies with multiprogrammed workloads.
Performance Impact on Bandwidth-Intensive Benchmarks
To quantify the performance impact of the bandwidth-aware memory placement and migration policies on bandwidth-intensive benchmarks, we run each benchmark with the following seven versions -(1) the conventional local placement policy (Local), (2) the conventional interleave placement policy (IL), (3) the DRAM-only policy (D-Only), (4) the bandwidth-aware interleave policy (BW-I), (5) the bandwidth-aware random policy (BW-R), (6) the bandwidth-aware local policy (BW-L), and (7) the bandwidth-aware local policy augmented with the bandwidth-aware migration policy (BW-LM). In this and subsequent sections, unless stated otherwise, the bandwidth ratio of the DRAM cluster to the NVM cluster is set to 2. We rst analyze the overall performance results of the di erent memory placement and migration policies. Figure 3 shows the execution time of each version, which is the average across the seven bandwidth-intensive benchmarks summarized in Table 3 . Each bar denotes the execution time of each version, normalized to that of the Local version. All the data reported in this section is the To gain a deeper insight into the overall performance trends, Figure 4 shows the execution time breakdowns of each version of the benchmarks, normalized to that of the Local version. Each bar consists of the CPU time spent for executing the instructions in the user mode (User), executing the instructions in the kernel mode (System), idling (Idle), and handling the I/O operations (I/O). Figures 3  and 4 demonstrate the following performance data trends, which will be explained with the low-level performance data presented subsequently.
Overall performance trends:
• e bandwidth-aware memory placement and migration policies signi cantly outperform the conventional polices when executing bandwidth-intensive benchmarks. For instance, the BW-L version achieves 34.8% higher performance than the Local version on average.
• e three bandwidth-aware memory placement policies achieve similar performance. For example, the performance di erence between the BW-I and BW-L versions is 0.5%, which is small.
• e bandwidth-aware local policy augmented with the bandwidth-aware memory migration policy achieves slightly lower (i.e., 5.8%) performance than the bandwidth-aware local policy (with no page migration). Figure 3 shows that the performance gains achieved by the bandwidth-aware memory placement and migration policies are mainly from the reduced User time. To investigate the cause of the reduced User time with the bandwidth-aware policies, Figure 5 shows the data tra c (MB/s) between the CPUs and memory nodes.
We observe that the bandwidth-aware policies show signi cantly higher tra c than the conventional policies. e bandwidth-aware policies e ectively utilize all the available bandwidth of the DRAM and NVM clusters by preserving the optimal memory allocation ratio (i.e.,
) across the clusters. Figure 6 shows that the pages are allocated across the DRAM and NVM clusters in proportion to their bandwidth. Further, Figure 7 shows that each benchmark performs memory accesses across the clusters in proportion to their allocation ratio. Figure 9 : Performance sensitivity to the bandwidth ratio a bandwidth-oblivious manner. As a result, the bandwidth-aware policies achieve signi cantly higher performance than the conventional policies.
For some benchmarks (e.g., FFT, KM, and WC), the Local and BW-LM versions incur relatively high System time ( Figure 4 ). As shown in Figure 8 , page migrations are actively performed when executing such benchmarks, increasing the time spent by the Linux kernel. Since the evaluated benchmarks are bandwidth-sensitive and do not usually bene t from dynamic page migration, the Local and BW-LM versions are outperformed by the other versions of the same kind with no migration.
Sensitivity to the Bandwidth Ratio
We investigate the performance sensitivity of the bandwidth-aware memory placement and migration policies to the bandwidth ratio of the DRAM to NVM cluster using the seven bandwidth-intensive benchmarks. Figure 9 shows the average execution time of each version (normalized to that of the Local version) as the bandwidth ratio of the DRAM to NVM cluster changes from 2 to 10.
We observe that the bandwidth-aware policies achieve higher performance gains as the bandwidth ratio of the DRAM to NVM cluster increases. For instance, the performance gain of the BW-L version over the Local version increases from 34.8% to 81.6% as the bandwidth ratio of the DRAM to NVM cluster increases from 2 to 10.
is is mainly because the conventional policies su er from the low performance of NVM more drastically by placing and migrating to the NVM memory node in a bandwidth-oblivious manner as the bandwidth ratio of the DRAM to NVM cluster increases. In contrast, the bandwidth-aware policies robustly utilize the DRAM and NVM nodes by preserving the optimal allocation ratio across the clusters, signi cantly outperforming the conventional policies.
Performance Impact on Bandwidth Non-Intensive Benchmarks
We investigate the performance impact of the bandwidth-aware policies on bandwidth non-intensive benchmarks. Figure 10 shows the average execution time of each version, normalized to that of the Local version. We observe that all the versions achieve almost identical performance, which indicates that the bandwidth-aware policies cause no or li le (if any) negative performance impact on bandwidth non-intensive benchmarks on heterogeneous memory systems.
Performance Impact on Multiprogrammed Workloads
Finally, we investigate the performance impact of the bandwidthaware policies on multiprogrammed workloads. Figure 11 shows the average execution time of each version, normalized to that of the Local version. e average execution time in Figure 11 is collected by executing all the possible pairs (i.e., We observe that the bandwidth-aware policies signi cantly outperform the bandwidth-oblivious policies. For instance, the BW-L version achieves 18.8% higher performance than the Local version on average. Compared with the experimental results ( Figure 3 ) with single multithreaded benchmarks, the bandwidth-aware policies achieve rather lower performance gains with multiprogrammed workloads. is is mainly because some of the multithreaded benchmarks, which execute for a longer time than the other benchmark in the same pair and determine the makespan of the pair, incur lower memory tra c with a smaller thread count (i.e., 8), making their performance less sensitive to the memory bandwidth.
Overall, our quantitative evaluation demonstrates that the bandwidthaware policies are promising in that they provide signi cant performance gains over the conventional policies with bandwidthintensive benchmarks, across a wide range of the DRAM-to-NVM bandwidth ratios, and in multiprogrammed environments.
RELATED WORK
Prior work has extensively investigated the architectural [1, 16, 23] and system so ware [12] support for heterogeneous memory systems. e key theme of the prior work is to improve the capacity scaling of heterogeneous memory systems using architectural and system so ware techniques. While insightful, most of the prior work lacks the bandwidth-aware optimizations, which achieves sub-optimal performance on heterogeneous memory systems as quanti ed in this work.
e prior work closest to ours is the work presented in [1] , which investigates the bandwidth-aware memory placement for GPU architectures that employ both the GPU (GDDR5) and CPU (DDR4) memory nodes. While insightful, the prior work investigates a memory placement policy similar to the bandwidth-aware interleave policy proposed in this work, lacks an in-depth discussion on the design and implementation of the OS-level bandwidth-aware memory management, and evaluates the proposed technique solely based on an architectural simulator, which lacks the modeling of the entire system so ware stack such as the OS. In contrast, our work proposes various bandwidth-aware memory placement and migration policies, discusses the design and implementation of the bandwidth-aware policies on top of the Linux kernel, and evaluates the proposed techniques based on a real system to investigate the interaction among the applications, the OS, and the underlying heterogeneous memory system. Prior work has proposed the architectural [9, 19, 27] and system so ware techniques [4, 6, 13, 22] for improving the performance of NVM-based systems. Our work di ers in that it aims to investigate the design and implementation of the memory placement and migration policies for heterogeneous memory systems that simultaneously employ the DRAM and NVM clusters to improve the performance and capacity scaling.
Prior work has extensively investigated the performance analysis and optimization for NUMA systems [5, 7, 8, 15, 18] . While insightful, the prior work is based on the assumption that all the memory nodes in the underlying system are identical in terms of performance. In contrast, our work investigates the design and implementation of the bandwidth-aware memory placement and migration policies for heterogeneous memory systems where there are multiple types of memory clusters that exhibit widely-di erent performance characteristics.
CONCLUSIONS
is work presents the design and implementation of the bandwidthaware memory placement and migration policies for bandwidthintensive applications on heterogeneous memory systems. e proposed policies dynamically place and migrate memory pages in a bandwidth-aware manner in that it preserves the optimal allocation ratio across the heterogeneous memory clusters. Our quantitative evaluation demonstrates the e ectiveness of the bandwidth-aware policies in that they signi cantly outperform the conventional bandwidth-oblivious policies across the wide range of the DRAMto-NVM bandwidth ratios when executing bandwidth-intensive benchmarks. As future work, we would like to characterize the performance of virtual machines on heterogeneous memory systems in cloud-computing environments and investigate the design and implementation of the bandwidth-aware virtual machine manager.
