With the deployment of innovative memories such as non-volatile memory and 3D-stacked memory in distributed systems, how to improve the application performance by utilizing the unique characteristics of these hybrid memories remains an active research direction. For instance, the Intel Knight Landing (KNL) processor incorporates a High Bandwidth Memory (HBM) using 3D-stacked technology with traditional DRAM onto the same chip. HBM achieves much higher bandwidth than traditional DRAM when the application exhibits high parallelism and sequential access. In this paper, we propose a new metric SP-factor to guide the data scheduling in distributed system using hybrid memories such as HBM and DRAM. The SP-factor incorporates the data access patterns including data block size and data access parallelism, which leads to better data scheduling decision for higher performance. We apply SP-factor to several data eviction policies on the hybrid memory system, which achieves better performance. Moreover, an adaptive data scheduling method (ADSM) is proposed for such hybrid memory system with HBM and DRAM. ADSM can dynamically adjust scheduling decisions based on runtime performance metrics so that it can adapt to workloads with different data access patterns. Our experimental results show that ADSM can significantly improve the performance of the representative workloads. For SQL query application with mixed access pattern, the cache hit ratio increases by 10.4% and the execution time reduces by 14.6% using ADSM compared to ARC policy.
I. INTRODUCTION
With the development of innovative memories, emerging memory technologies such as non-volatile memory [1] , [2] , 3D-stacked [3] , [4] , memory have been integrated into the memory system, thus effectively complemented traditional DRAM. For example, the Intel Knight Landing (KNL) processor [5] incorporates high bandwidth memory (HBM) into the processor, which achieves much higher bandwidth than DRAM when the application exhibits high parallelism and sequential access. However, in the case of discrete access, the access latency of HBM is higher than DRAM [6] . The emerging processors such as KNL actually represent the future trend of memory architecture design, which is a hybrid memory system consisting of high bandwidth memory and traditional memory. Therefore, how to utilize the above The associate editor coordinating the review of this manuscript and approving it for publication was Yupeng Wang.
hybrid memory system in order to achieve both high bandwidth and low latency is important for future applications.
Research [7] and [8] show that distributed file system equipped with large capacity memory stores frequently accessed data in memory can effectively speedup the performance of big data applications. Alluxio [9] (formerly Tachyon) is a distributed memory file system based on such idea. It unifies the data read and write interfaces and bridges the gap between computing frameworks such as Hadoop MapReduce [10] , Spark [11] and Flink [12] , and distributed file systems such as HDFS [13] , S3 [14] and Ceph [15] . Alluxio stores data in user-assigned memory space and provides a pluggable framework to define customized data scheduling methods.
However, the current data scheduling method for hybrid memory systems mainly adopts traditional cache eviction policies with certain improvements. The fundamental idea is to store the data with higher re-access probability in the high bandwidth memory. However, existing works fail to consider the influence of data block size and access parallelism. Especially for hybrid memory using HBM and DRAM, the existing scheduling methods cannot fully exploit the high bandwidth of HBM and the low latency of DRAM, which wastes the potential for further performance improvement of big data applications.
To address the above limitations, this paper proposes a novel data scheduling metric Size and Parallelism factor (SPfactor) that quantitatively describes the weight for the data to be scheduled in high bandwidth memory or traditional memory. We use SP-factor to improve several traditional data eviction policies, which leads to better performance for the applications. In addition, an adaptive data scheduling method ADSM is proposed that dynamically adjusts the scheduling decisions to achieve higher cache hit ratio and thus better performance for applications with various data access patterns.
Specifically, the contributions of this paper are as follows:
• Based on the characteristics of hybrid memory using HBM and DRAM, we propose a new data scheduling metric SP-factor, which represents the impact on application performance to store data block in HBM. It overcomes the limitations of existing scheduling methods by considering the impact of both data block size and data access parallelism, which improves the effectiveness of data scheduling methods. To derive the data access parallelism, a prediction method is also proposed.
• We improve the traditional data eviction policies using SP-factor. Specifically, we apply the modified stack bottom re-order algorithm to the LRU stack, which serves the basics for various data eviction policies such as LRU-SP, LIRS-SP, and ARC-SP. In addition, we propose LRFU-SP, an improved policy of LRFU by modifying the eviction weight using SP-factor. The experimental results show that the improved data eviction policies using SP-factor significantly increase the application bandwidth on the hybrid memory system. • We propose an adaptive data scheduling method ADSM, which considers both the SP-factor and the data reaccess probability in the benefit function, and dynamically adjusts their contribution to the schedule tuning module based on performance metrics collected during runtime. The experimental results show that ADSM achieves better performance for representative workloads with different data access patterns, especially in the case of mixed data access pattern, the application performance can be significantly improved. The rest of the paper is organized as follows. Section II introduces the research background of data scheduling method for hybrid memory in distributed system. Section III presents the definition of SP-factor, which considers the data block size and access parallelism to better guide the data scheduling. Section IV illustrates the improvements to traditional data eviction policies using SP-factor. Section V presents the dynamic data scheduling method ADSM for hybrid memory using HBM and DRAM.
Both the design overview and implementation details are provided. In section VI, we evaluate the proposed method with representative workloads and compare it with various data scheduling policies. Section VII presents the existing research works in related fields. Section VIII concludes this paper.
II. BACKGROUND
A. HYBRID MEMORY Future distributed system will increasingly rely on innovative memory devices with high bandwidth and low latency in order to achieve further performance boost. For example, Intel KNL processors are equipped with high bandwidth memory using 3D-stacked technology, which is integrated with traditional DRAM. The design of KNL provides an unique HBM-DRAM hybrid memory system. KNL offers three ways to utilize HBM: flat mode, cache mode and hybrid mode. The flat mode uses HBM and DRAM as different NUMA nodes, which can be allocated by the operating system explicitly. In the cache mode, HBM is transparent to the user and behaves as a large-capacity cache to DRAM that directly maps the memory through the physical address with 64B per cache line. The data read in DRAM is sent to HBM at the same time, and the core needs to query the HBM first when accessing the data. The hybrid mode offers a middle ground between flat mode and cache mode, in which the HBM is divided into two parts: addressable memory as DRAM and L3 cache to DRAM.
Theoretically, the bandwidth of HBM exceeds 400 GB/s, however the achievable bandwidth of HBM when running actual workload is affected by many factors, including configuration mode, data access pattern, data block size and access parallelism. The peak bandwidth of HBM can reach 4× than DRAM for application with high parallelism and sequential access, although the latency of HBM is between 1.15× and 1.2× than DRAM [6] . Therefore, these unique characteristics of HBM need to be considered in the data scheduling method for the hybrid memory system.
B. DISTRIBUTED FILE SYSTEM WITH HETEROGENEITY
To enable high bandwidth for applications running on distributed file system, HDFS supports heterogeneous storage devices including RAM_DISK, SSD, DISK and ARCHIVE. The corresponding strategies for storing data include Hot, Cold, Warm, All_SSD, One_SSD and Lazy_Persist. These strategies are responsible for managing the placement of the data blocks and their backups among the heterogeneous storages.
One limitation of current HDFS implementation is that it cannot adjust the decision on which storage the data is stored automatically based on the recency and frequency of data access. There are research works attempt to improve HDFS in this direction including Hats [7] , Triple-H [16] and PAHDFS [17] . However, these improvements are highly coupled to a specific version of HDFS implementation with significant code changes, which is difficult to apply these optimizations to other storage systems or other versions of HDFS implementation. In addition, HDFS is originally designed for offline processing in a traditional cluster. Its high availability and backup mechanism introduce significant software overhead, therefore the bandwidth of the heterogeneous storages cannot be utilized efficiently. Compared to HDFS, Alluxio (formerly Tachyon) is a memory-centric distributed file system with the advantage of efficient data sharing and high bandwidth data access across the cluster. Evaluations show that the write bandwidth on Alluxio can reach 110× than HDFS [8] . Due to the capability of fault tolerance, HDFS is usually deployed as the underlying file system of Alluxio in practice.
In addition, Alluxio is capable of integrating different types of storages by tagging the storage path with aliases and serial numbers. The architecture of hybrid memory system using HBM and DRAM on top of Alluxio is shown in Figure 1 . Alluxio starts a master daemon to manage the metadata and worker daemons. The worker daemon is responsible for managing the data stored in the worker nodes. Moreover, HDFS is used as the underlying file system to persist data. The blue arrow indicates the control flow between Alluxio master and worker. The workers periodically send the heartbeats and the block information to the master, and the master returns the control command. The yellow arrow indicates the control flow of the client request, including the file path, nodes list, and data block IDs. The brown arrow indicates the control flow between the scheduler and the storage module in the worker. The scheduler receives the status of the storage module and issues scheduling requests to the storage in return. The green arrows indicate the data flow between nodes as well as between storage tiers. In our Alluxio settings, the HBM storage path is marked as the 0th tier, whereas the DRAM storage path is marked as the 1st tier. By default, the write tier is set as the 1st tier. Note that Alluxio uses the HBM tier in the flat mode.
C. DATA ACCESS PATTERN AND SCHEDULING
The data access patterns of big data applications are quite diverse. For example, KMeans traverses the sample data multiple times, and similarly PageRank accesses the state transition matrix multiple times until it is converged. Such big data applications exhibit distinct loop characteristics. Whereas, Hive SQL query workloads include various data access patterns, such as the most recently accessed data being re-accessed or the more frequently accessed data being reaccessed. Research work [18] , [19] summarize the data access patterns of big data applications into four categories: recencyfriendly, frequency-friendly, loop and mixed.
• Pattern 1, recency-friendly access pattern: This pattern shows good temporal locality with small access time interval. For this access pattern, the LRU policy can reach the maximum bandwidth, since it can store the recently accessed data in the high bandwidth memory.
• Pattern 2, frequency-friendly access pattern: This pattern shows unbalanced data access frequency, and thus storing the most frequently accessed data in the high bandwidth memory achieves better performance. The LFU policy fits well for this pattern.
• Pattern 3, loop access pattern: Data in this pattern is accessed multiple times through loops and the amount of data is larger than the space of high bandwidth memory. For this pattern, the LIRS [20] policy achieves better performance, because it only stores the data with high re-access probability in high bandwidth memory.
• Pattern 4, mixed access pattern: Generally, the data access pattern of real application is mixed. The LRFU [21] policy combines the idea of LRU and LFU, which uses the CRF value to indicate the likelihood that the data will be re-accessed in the future. ARC [22] dynamically adjusts the decision where the data is evicted from in order to adapt to data access pattern changes. On Alluxio, the default allocation policy is cache-promote, which moves the data to highest tier if it is already in Alluxio storage, and writes data into highest tier of local Alluxio if data needs to be read from underlying storage. Several classic eviction policies are also available on Alluxio, such as LRU, LRFU, LIRS and ARC [19] . However, instead of scheduling the data placement on hybrid memories directly, existing data scheduling methods on Alluxio indirectly schedule the data placement when data is evicted, and thus misses potential opportunities for performance optimization during data scheduling. Moreover, current data scheduling methods do not consider the impact of data block size, access parallelism and re-access probability, therefore cannot fully exploit the performance potential of the hybrid memory system. We argue it is necessary to propose a new data scheduling metric to effectively guide the data block placement in hybrid memory system, and design adaptive scheduling method to meet the bandwidth requirements of applications with different data access patterns, and thus fully utilize the advantage of the hybrid memory system for both high bandwidth and low latency. This motivates the work of this paper. 
III. SP-FACTOR: DATA SIZE AND PARALLELISM FACTOR
In this section, we measure the bandwidth of workload Enhanced DFSIO on HBM and DRAM separately by experimenting with different number of map containers and data split size. Based on the above observations, we propose a new data scheduling metrics SP-factor and the corresponding mechanism to model the access parallelism.
A. PERFORMANCE OBSERVATION
To measure the bandwidth characteristics of HBM and DRAM under different data block size and access parallelism, we use workload Enhanced DFSIO [23] in HiBench benchmark suite, which launch multiple map tasks to read files on storage concurrently and record the read ratio. The file size is set by the user. Better than the workload TestDFSIO in Hadoop, workload Enhanced DFSIO avoids measuring the warm-up and cool-down phases when launching or shutting down the JVMs, and obtains the aggregate bandwidth by sampling during the steady phase. Thus, Enhanced DFSIO gives more accurate measurement of the bandwidth performance under high access parallelism.
The experiments are divided into four groups within single machine. The data split size in each group is 1MB, 10MB, 50MB and 100MB respectively, and the number of map containers in each group is set to 2, 4, 6, 8, 10, 20, 30, 40 and 50 respectively in order to observe the impact of parallelism [24] . Each map container includes one core, which has four hardware threads, and 1GB memory. The detailed experimental settings are described in Section VI-A.
The experimental results are shown in Figure 2 . The dashed line indicates the ratio of bandwidth improvement observed on the workload using HBM compared to DRAM. Under the same data split size, when the number of map containers is small, the bandwidth improvement of HBM over DRAM is insignificant. As the number of map containers increases, the bandwidth of HBM becomes much higher than DRAM. When the parallelism is high, for example, the number of map containers exceeds 30, the bandwidth advantage of HBM compared to DRAM begins to decline. This is because launching a large number of JVMs leads to serious resource contention and reduces the bandwidth of HBM. In addition, the data split size has strong impact on the bandwidth of HBM. As the size of the data split increases, the bandwidth of HBM shows an increasing trend. When the number of map containers is 30 and the data split size is 100MB, the bandwidth improvement of HBM over DRAM is most substantial, which is 21%. From the experiments we observe that the HBM bandwidth is closely related to the data block size and access parallelism, which existing data scheduling methods for hybrid memory system fails to consider. The above observation guides us to propose a new data scheduling metric Data Size and Parallelism Factor (SP-factor) for better data scheduling decision.
B. THE DEFINITION OF SP-FACTOR
Based on the observation that data block size and access parallelism have a significant impact on HBM bandwidth and thus application performance, we propose SP-factor that is calculated by data block size and access parallelism in order to guide data scheduling (Section III-B.1). The data block size is defined as the smallest data unit that is scheduled in Alluxio, whereas the access parallelism is defined as the number of accesses the storage tier is requested in a specific interval. The difficult part of calculating SP-factor is to obtain the access parallelism of the data blocks. To address that, we propose a mechanism to model the parallelism based on the data access history at the storage tier of Alluxio (Section III-B.2). Note that SP-factor is calculated and stored as metadata of the Alluxio Worker, which does not introduce extra communication overhead among Alluxio Master and Workers. Because only the data blocks are transferred to other Workers other than the metadata.
1) THE SP-FACTOR EQUATION
With the measured results of Enhanced DFSIO in Section III-A, we observe that the read bandwidth improvement ratio IR is related to data size and access parallelism as shown in Equation 1, where s correlates with IR positively, and p negatively. There are two reasons to explain the above phenomenons. First, there is barely a bandwidth advantage of HBM over DRAM when the block size is small (e.g., less than 20MB). Because the extra access latency [6] offsets the bandwidth gain. Secondly, we find that the bandwidth improvement by increasing the data block size is nonuniform. There is a peak of the bandwidth improvement after which adjusting the block size is less effective. The partial derivative of IR is as shown in Equation 2, and the peak improvement is obtained when s equals to p 2 as shown in Equation 3 , which is associated with data access parallelism. 
As shown in Equation 4
, the size represents the data block size, and MaxBlockSize is determined by configuration, which is 128MB in our system. By fitting the slope peak values with various access parallelisms, we derive Equation 5, where OptimalParallelism is the parallelism with highest bandwidth improvement. The OptimalParallelism is set to 30 in our system described in Section III-A.
It is easy to understand that p decreases when the parallelism approaches OptimalParallelism. In addition, the increase in parallelism will result in more bandwidth contention, and we add 0.5 in Equation 5 to constrain p within the appropriate range.
As shown in Figure 3 , with different parallelisms and block sizes (from 1MB to 100MB), the improvement ratio calculated according to Equation 1 fits well with the measurements on Enhanced DFSIO when η equals to 0.39. Based on this observation, we define the SP-factor to represent the weight of scheduling data block to HBM. The larger SP-factor value indicates that the data block is more suitable for HBM, as shown in Equation 6 . Based on SP-factor, we improve the existing data eviction policies in Section IV, and propose an adaptive data scheduling method ADSM in Section V.
SP-factor
2) DATA ACCESS PARALLELISM MODELING To better understand the data access parallelism, we analyze the SparkSQL Join workload, which contains large number of data accesses. As shown in Figure 4 , we collect block access record of the SparkSQL Join workload. Even though thirty executors are launched in parallel, the operation of accessing data block is still very sequential. We identify six data access phases within this workload. The two file scan phases are pointed by blue arrows in Figure 4 , which means where each Spark executor needs to read source data block from. Due to the limited heap space, each executor needs to access the source data block several times. Therefore, there are three rounds of data accesses in first phase, and two rounds in second phase.
The time of one round of data access is denoted as Span, and P is the data access parallelism within a Span. Assuming a Span is composed of n Intervals, and block accesses from diverse task slots are independent and random during the Intervals, which is supported by studies [25] , [26] . Then the data access parallelism within Interval obeys the Poisson distribution with λ = P n . We use a list to record the block access timestamp, and count the number of block accesses within previous Interval when a block is accessed, which is expanded n times according to the access parallelism p of that data block in this round. The expectation of p equals to P.
To address the large parallelism variance in our modeling method, we use the arithmetic mean of multiple rounds to approximate the true parallelism, as show in Equation 7 . The worker in Alluxio needs to record both the access parallelism and block access number m. If the data block is accessed again, the access parallelism and block access number will be updated. The information of the block that has not been accessed for a long time can be removed to reduce space occupancy.
As mentioned above, the source data block of the Join workload is accessed by executors for five rounds. The parallelism of the data block approaches the practical access parallelism with smaller variance, as shown in Figure 5 . Therefore, the above method is able to accurately model the access parallelism of data blocks.
IV. IMPROVING DATA EVICTION POLICIES USING SP-FACTOR
The existing data eviction policies focus on making eviction decisions when the memory space is exhausted. In this section, we use SP-factor to improve these data eviction policies, including LRU, LIRS, ARC and LRFU. We show the impact of SP-factor on the efficiency of data eviction policies in Section VI.
A. STACK BOTTOM RE-ORDER ALGORITHM
Existing data eviction policies such as LRU, LIRS, and ARC are implemented based on a spatially fixed LRU stack structure. The distance between the element and the stack top represents the recency of the element. To illustrate the LRU stack, we take Figure 6 (a) for example. The space of the stack is 6. If the accessed data block is in the stack (e.g., 1 accessing data block 1), the data block is moved to the top of the stack, indicating that its recency is minimal. If a new data block is accessed (e.g., 2 accessing data block 7), because the stack has no free space, the element at the bottom of the stack is evicted and the new data block is inserted at the top of the stack. To take advantage of the high bandwidth of HBM under high parallelism and sequential access, we design a stack bottom re-order algorithm as shown in Figure 6 
where LockRatio represents the number of data blocks to be locked above the bottom of the stack. The data blocks in the locked area are re-ordered by the SP-factor in ascending order, and the data block with smallest SP-factor is preferred to be evicted at the bottom of the locked area (e.g., 1 ). If the data block in the locked area is re-accessed (e.g., 2 , accessing data block 1), it will be moved from the locked area to the top of the stack. If a new data block is accessed (e.g., 3 , accessing data block 7), the data block at the bottom of locked area is evicted and the new block is inserted on the top of the stack. If the locked area is empty, the stack bottom re-order algorithm runs again. The stack bottom re-order algorithm using SP-factor can extend the time of data block with large SP-factor residing in HBM, and LockRatio can adjust the impact of recency and SP-factor on data eviction.
B. LRU-SP POLICY
The LRU [27] policy is the most commonly used data eviction policy. The idea is to put the most recently accessed data blocks into memory with higher bandwidth. If the memory space is insufficient, the data has not been accessed for the longest time will be evicted. LRU policy has a good performance for application with high temporal locality.
Based on the stack bottom re-order algorithm presented in Section IV-A, we improve the LRU policy (LRU-SP), which maintains a limited stack for the data blocks residing in HBM, locks the bottom of the stack according to the LockRatio, and then re-orders the locked area in the stack by SP-factor in descending order. In hybrid memory system using HBM and DRAM, LRU-SP will encounter the following two types of data access:
• If the data in DRAM is accessed, the policy moves the data to HBM and inserts corresponding block ID onto the top of the stack if there is enough space in HBM. If the remaining space of HBM is insufficient, the bottom data block in the locked area must be evicted first until the remaining space is enough to keep the new data blocks.
• If the data in HBM is accessed, the data block is moved to the top of the stack. If the data block is already in the locked area of the stack, then the policy removes it from the locked area. LRU-SP policy combines the temporal locality and the characteristics of HBM-DRAM hybrid memory system, which ensures that the most recently accessed data blocks residing in HBM, and the data blocks with high recency are re-ordered by SP-factor. Generally, LRU-SP policy extends the time of data blocks suitable for residing in HBM, which effectively reduces the scheduling overhead. Compared to LRU, LRU-SP can better utilize the bandwidth advantage of HBM-DRAM hybrid memory system. The ratio of the data blocks to be locked can be customized by the user.
C. LIRS-SP POLICY
Many workloads exhibit repeatable data access pattern but with poor temporal locality. In such case, LRU policy performs poorly. LIRS [20] policy is designed to cope with the situation that bursty access to cold data causing the hot data evicted. The policy considers two attributes of the data block, Inter-Reference Recency (IRR) and recency. It divides the memory space into two parts: Low Inter-reference recency (LIR) region, which is used to store data that has been accessed twice, and High Inter-reference recency (HIR) region, which stores data that has been accessed only once or evicted from LIR region.
As shown in Figure 7 (a), LIRS policy is implemented using a stack S and a queue Q. The stack ensures a LIR data block is at the bottom of the stack. Otherwise, stack push is performed, which evicts the bottom data blocks until a LIR data block reaches the bottom. The idea of LIRS guarantees that the recency of the HIR data block in the stack is less than the maximum LIR recency, which is necessary for converting HIR to LIR. Stack S holds the LIR and HIR data blocks, whereas the HIR data blocks are divided into resident state and non-resident state. Therefore, in LIRS policy, a data block has three states: LIR, HIR resident and HIR non-resident state. Queue Q only buffers the HIR data blocks in resident state. Figure 7 (a) illustrates how LIRS policy works. Assuming the buffer size is 5, the LIR size is 3, and the HIR size is 2. 1 , data block 5 is accessed. The LIR space is exhausted and HIR has remaining space at this time. Then block 5 is placed in HIR resident state and inserted at the top of stack S and the tail of queue Q. 2 , data block 6 is accessed. Since HIR has no space, the data block 4 at the top of queue Q needs to be evicted first. After that, data block 4 in the stack S is set to HIR non-resident state, and then the data block 6 is set to HIR resident state. Eventually, the data block 6 is inserted at the top of the stack and the tail of the queue. 3 , data block 4 is accessed. Because data block 4 resides in the stack which satisfies the condition of converting it to LIR data block. However due to the insufficient LIR space, data block 1 at the bottom of stack is converted to HIR resident state and moved to the top of S and tail of Q after data block 5 in Q is evicted. Since data block 5 has been evicted, it needs to be set to HIR non-resident state in S. Finally, data block 4 is inserted as a LIR data block at the top of S. In Loop pattern, if the LIR size is k, then LIRS policy can achieve a hit rate of at least k n . We improve the LIRS policy (LIRS-SP) using stack bottom re-order algorithm, which locks the LIR data blocks at the bottom of the stack according to LockRatio, and then reorders the locked area by the SP-factor in descending order as shown in Figure 7 (b). Compared to LIRS policy, the time for the data block with large SP-factor to be converted to HIR is extended, which reduces the overhead caused by the data eviction. LIRS-SP policy can better utilize the high bandwidth memory in hybrid memory system. The ratio of the LIR data blocks to be locked can be customized by the user.
D. ARC-SP POLICY
ARC [22] is a self-tuning data eviction policy that adapts to changes in data access patterns by adjusting the space occupied by recently re-reference data and frequently rereference data. The memory space is divided into two stacks: Stack recently re-reference used (rru) and Stack frequently rereference used (fru), where rru stores the data accessed once, and fru stores the data that has been accessed twice or more. In addition, a rru ghost queue and fru ghost queue are maintained to record the evicted data block ID when the data block is evicted from rru or fru, and the block ID is added to the tail of the corresponding ghost queue. When a data block in the rru ghost queue is accessed, it is called a rru ghost hit. This means that the memory space allocated to the recently re-reference data is insufficient, and the memory space of rru needs to be increased. Otherwise, if the data in the fru ghost queue is accessed, the memory space of fru needs to be increased. Figure 8 gives an example to show the data evictions using ARC policy. 1 , when a new data block 9 is accessed and stack rru has no remaining space, the data block 1 is evicted from the memory and its data block ID is moved from the bottom of stack rru to the top of stack rru ghost. Then data block ID 9 is inserted to the top of the rru stack; 2 , when the data block 5 in the rru stack is re-accessed, it is moved into fru stack. Because the stack fru has no remaining space, the data block 6 at the bottom of stack fru is first evicted from the memory and inserted to the top of fru ghost stack, and then the data block 5 is inserted to the top of stack fru; 3 , when the data block 1 stored in rru ghost stack is reaccessed, the capacity of the rru needs to be increased to 5. Then the data block 1 is stored on the top of stack rru, and no data block is evicted; 4 , when the data block 6 stored in fru ghost stack is re-accessed, the capacity of the fru needs to be increased to 5. Then the data block 6 is stored on the top of stack fru, and no data block is evicted; 5 , when the data block 4 stored in stack fru ghost is accessed, the capacity of fru is increased to 5, and the capacity of rru is reduced accordingly. The data block at the bottom of stack rru is evicted due to limited capacity of rru.
We improve the ARC policy (ARC-SP) using stack bottom re-order algorithm, which locks the bottoms of stack rru and stack fru according to the lock ratio, and then re-orders the locked area by the SP-factor on ascending order. Compared to ARC, ARC-SP considers the unique characteristics of the HBM-DRAM hybrid memory system and gives better performance with higher bandwidth.
E. LRFU-SP POLICY
Different from LRU, LIRS and ARC policies that make data eviction decisions based on LRU stack, the LRFU [21] policy determines the data to be evicted based on the Combined Recency and Frequency (CRF) value. The equation for calculating CRF is shown in Equation 8. The CRF value represents the sum of contributions from all previous data accesses. Where n is the total number of times the data block has been accessed, and t i is the time when the data block has been accessed for the ith time. F(t) represents the contribution per access, which decays with time shown in Equation 9 .
In addition, the calculation of CRF is recursive, therefore only the latest update time and CRF value need to be kept. Depending on whether the data block is accessed again in the time range (t, t + t), the value of CRF at time t + t is calculated based on Equation 10 (not re-accessed) or 11 (re-accessed).
In the contribution function F(t), the parameter λ is used to adjust the LRFU policy whether more tending to LRU or LFU. When λ is 0, the LRFU policy is equivalent to the LFU policy. If it is 1, the LRFU policy is equivalent to the LRU policy. Using SP-factor, we propose the LRFU-SP policy that makes the data eviction decision according to the CRF-SP value. The calculation of CRF-SP value is shown in Equation 12. Compared to LRFU policy, the LRFU-SP policy is able to leverage the advantage of HBM and can achieve better bandwidth in HBM-DRAM hybrid memory system.
CRF − SP(t) = CRF(t) × SP-factor

V. DESIGN AND IMPLEMENTATION OF ADSM
The existing data scheduling strategy in Alluxio is based on the idea of cache eviction policy, where the recently accessed data is preferentially written to upper storage. Such mechanism leads to large scheduling overhead with low benefit. Therefore, this paper proposes a benefit-oriented scheduling strategy by abstracting the scheduling as a classic 0/1 Knapsack problem, which schedules the data with the goal of maximizing the benefit of upper storage.
A. KNAPSACK PROBLEM ABSTRACTION
Inspired by [28] , we abstract the data block scheduling problem in HBM-DRAM hybrid memory system into the 0/1 Knapsack problem. The HBM is equivalent to a capacity limited backpack, and v i is the value of data block i. s i represents the size of the data block i, and V represents the total value of the HBM. The constraint is the capacity of HBM, and the target of the data block scheduling in the HBM-DRAM hybrid memory system can be expressed as Equation 13 .
s i x i ≤ CAPACITY HBM , x i ∈ 0, 1 (13) VOLUME 7, 2019
Solving the 0/1 Knapsack problem is an NP-hard problem [29] . Finding the optimal solution require formidable computation complexity, therefore a sub-optimal solution using greedy algorithm is commonly applied to solve this problem in polynomial time. The value density is defined in Equation 14. The data blocks in the hybrid memory system are stored in HBM in order to achieve high value density until the remaining space cannot accommodate more data blocks.
THE BENEFIT FUNCTION
In HBM-DRAM hybrid memory system, v in Equation 13 indicates the benefit by storing the data block in HBM and is positively correlated with SP-factor and re-access probability. The benefit is represented by Equation 15, λ is a balance factor, which is used to adjust the impact of SP-factor and re-access probability to the value. Based on the comparison of SP-factor and re-access probability, λ is limited to the range of 0.35 to 0.45. For the workload with long access time, the re-access probability is generally low. Therefore, λ needs to be set smaller accordingly to avoid the situation that data blocks with low re-access probability reside in HBM due to the large value of SP-factor, which decreases BHR and bandwidth performance.
The parameter λ can be tuned adaptively, which will be used in our ADSM method in Section V-C. The P(t) calculates the probability that the data block is accessed at time t. If the data block is not accessed again in the time interval t, the re-access probability at time t + t and time t satisfies Equation 16 . We use Exponential Decay to calculate the re-access probability. The definition of re-access probability is shown in Equation 17 . Where n is the number of times the data block accessed and t i is the access time. The parameter a in the equation determines the impact of the re-access recency and frequency on the re-access probability. The larger the value is, the higher impact of recency on the re-access probability. The value of a can be selected in the experience space [28] and also can be adjusted adaptively in our ADSM method in Section V-C.
In addition, the access probability function P is also recursive. The re-access probability at a specific time in the future can be calculated based on the last update time and the correspondent re-access probability. If it is not accessed within the time interval t, the re-access probability at time t + t is calculated by Equation 18 . If it is re-accessed at time t + t, the new re-access probability is calculated by Equation 19 .
The ADSM scheduling method aims at storing the data block with the highest value density in HBM. The detailed algorithm of ADSM scheduling method is shown in Algorithm 1. When a data block is accessed in the hybrid memory system, the scheduling module is triggered to generate scheduling lists toEvict and toPromote. List toEvict records the data block ID that should be evicted from HBM, whereas list toPromote records the data block ID that should be promoted from DRAM into HBM. Based on these two lists, the adaptive data scheduling method generates scheduling decisions. In Algorithm 1, the HBMList records the data blocks in HBM, which are sorted in ascending order by value density. accessId represents the IDs of the most recently accessed data blocks, and freespace represents the remaining space of HBM. When a data block is accessed, the accessId is first updated to the accessed data block ID. And then the algorithm determines whether the accessed data block is already in the HBMList (line 3). If it is, we re-order all data blocks stored in the HBMList according to the new value density of the accessed data block. Otherwise, the algorithm determines whether to perform the promotion of the accessed data block. If there is enough free space (line 11), the data block is moved from DRAM to HBM. Otherwise, the algorithm determines whether to evict data blocks from HBM. We accumulate value of the data blocks need to be evicted in order to leave enough space for the data block needs to be promoted. We compare the accumulated value with the value of the data block needs to be promoted (line 15-23). If the latter is higher, the data evictions are performed and the accessed data block is promoted to HBM.
D. PARAMETER TUNING IN ADSM
There are two parameters λ and a that can be tuned in ADSM. Parameter λ is used to adjust the impact of SP-factor and re-access probability in the value of the data block, in case of the data block with large SP-factor but small re-access probability residing in HBM. Parameter a is used to adjust the impact of access recency and frequency on the re-access probability. The larger a is, the higher the impact of the recency on the re-access probability. In order to reduce the overhead of parameter tuning and obtain better scheduling performance, we present a parameter auto-tuning module in the data scheduling method. The auto-tuning module starts periodically to adjust a set of parameters through multiple phases. Each phase consists of several events regarding specific data block operations, including access, evict, promote, commit and delete.
In order to ensure the effectiveness of tuning at the mean while constrain the tuning overhead, we tune the parameters based on the historical performance of parameter settings during that phase, as shown in Equation 20 . The detailed process of parameter tuning is shown in Algorithm 2. At the beginning of the phase, a new set of parameters is proposed from the parameter space and sent to the scheduling module Algorithm 1 Adaptive Data Scheduling Algorithm Input: accessId, freespace, HBMList, toEvict, toPromote Output: true if there is data to be evicted or promoted, otherwise false 1: Time ++ 2: // update the value of block accessId 3: updateValue(accessId) 4: if accessId in HBMList then 5: re-order HBMList according to the new value density of accessId 6: return false 7: else 8: // update the value of blocks in HBMList 9: updateValues() 10: toPromote ← accessId 11: promoteSize ← getSize(accessId) 12: promoteValue ← getValue(accessId) 13: if freeSpace > promoteSize then 14: freeSpace ← freeSpace − promoteSize 15: return true 16: else 17: // whether accessId should be inserted into HBM 18: for evictSize + freeSpace < promoteSize do 19: next ←HBMList.head() 20: if evictValue + getValue(next) < promoteValue then 21: // pop out the head of HBMList and add it to toEvict 22: toEvict ←HBMList.pop() 23: evictSize ← evictSize+ getSize(next) 24: evictValue ← evictValue+ getValue(next) 25: else 26: return false 27: end if 28: end for 29: freeSpace ← freeSpace+evictSize−promoteSize 30: HBMList.update(toEvict, toPromote) 31: return true 32: end if 33: end if (line 3), then the performance metrics are collected. Based on the performance metrics, the performance score is calculated at the end of the phase. The current performance score is then compared to the highest performance score in history (lines 7-15). After all parameter sets have been tested, the optimal parameter set is selected and sent to the scheduling module as the current optimal parameter setting (line 20). In our system, the parameter space of λ and a in the tuning module is shown in Equation 21 . The a consists of five values almost equally scattered in log space [28] , which requires small overhead for a full range search.
Algorithm 2 Parameter Tuning Algorithm in ADSM Input: parameterSpace, maxEventPerPhase Output: optimalParameterSet 1: for each parameterSet in parameterSpace do 2: // parameter set in scheduling module 3: send(parameterSet) 4: while true do 5: eventListener() 6 : eventNo ++ 7:
if eventNo = maxEventPerPhase then 8: eventNo ← 0 9:
BHR, BIR, SPHR, SPIR ←update (parameterSet) 10: score ← Score(BHR, BIR, SPHR, SPIR) 11: if score > maxScore then 12: optimalParameterSet ← parameterSet 13: maxScore ← score 
Note that the parameter tuning module in Algorithm 2 leverages four performance metrics, including BHR, BIR, SPHR and SPIR. Their definitions are as follows: 1) BHR (Byte Hit Ratio) is a performance metric proposed in [30] , [31] , indicating the ratio of the data bytes accessed in HBM to the total bytes accessed within a phase. The higher the BHR is, the more data is accessed in HBM, which is positively correlated with the overall bandwidth of the hybrid memory system. As shown in Equation 22 , where n is the number of data blocks accessed in HBM, and m is the number of data blocks accessed in DRAM.
2) In a hybrid memory system, migrating data blocks between HBM and DRAM slows down the data access performance. The data access needs to wait for the migrating completed before accessing the data block, which naturally increases the performance overhead.
The BIR (Byte Insert Ratio) is a metric considering the data migration overhead, which is the ratio of data bytes moved to HBM to the total accessed bytes within a phase, as shown in Equation 23 . k represents the number of data blocks moved from DRAM to HBM.
3) Based on the BHR, we propose a new performance metric SPHR (SP-factor Hit Ratio) using SP-factor. This metric can reflect the performance advantage of HBM under high parallelism and sequential access. SPHR is the ratio of the sum of the SP-factor for data block accessed in HBM to the total SP-factor for data block accessed within the phase, as shown in Equation 24 .
4) Based on the BIR, we propose a new performance metric SPIR (SP-factor Insert Ratio) using SP-factor. Moved_SPIR is the ratio of the sum of the SP-factor for data block moved into HBM to the total SP-factor for data block accessed within the phase, as shown in Equation 25 .
Empirically studied with various workloads, we find that BHR, BIR, SPHR and SPIR have a strong linear relationship with overall bandwidth, and there is strong multi-collinearity between these four performance metrics. After building a linear regression model between the four performance metrics and the bandwidth performance, our performance score function is derived in Equation 26 . The performance score can be used to compare the scheduling efficiency under various parameter settings directly.
The performance metrics with the recency-friendly load are shown in Figure 9 (see Section VI for specific experiment setup). In Figure 9 , each group contains three columns, BHR+BIR, SPHR+SPIR and Speedup. The speedup on the right y-axis is compared to the longest execution time (LRFU λ = 0.0). Clearly, except ADSM, the sum of BIR and BHR, as well as the sum of SPHR and SPIR are all 1. Due to the cache eviction policies of Alluxio, if the data access misses in the upper tier, the data will be moved to the upper tier, which results in a lot of data scheduling with hardly performance gains. LRU-SP policy extends the residence time with large SP-factor, thus the SPHR increases by 16%, which reduces execution time by 4% compared to LRU policy. It should be noted that the scheduling volume of BIR with ADSM reduces by 48.5% compared to the LRU policy, thus the scheduling overhead has been effectively mitigated. And the SPHR of ADSM is significantly higher than other policy, therefore ADSM achieves the shortest execution time. 
E. IMPLEMENTATION AND OVERHEAD
We introduce several metadata to Alluxio, as shown in Table 1 . When the data block is accessed, BlockStrore checks the recentAccessList to obtain the parallelism new, updates the parallelism, SP-factor and BlockMeta. Then, the worker checks whether the block triggers an inter-tier migration. If so, the block ID is added to the list toPromote, and the evicted block ID is added to the list toEvict. The BlockStrore is responsible for migrating data blocks and updating the HBMList.
The ParameterMonitor module launches periodically and traverses the parameterSet. Then, ParameterMonitor collects the corresponding performance metrics BHR, BIR, SPHR and SPIR, and obtains the optimalParameterSet as the parameters for next stage. The memory overhead of ADSM is quite limited. For instance, storing 100k data blocks on the worker only takes up 2.4MB memory. Moreover, the computation overhead of ADSM is also negligible with an increase of 296 microseconds to the execution time when dealing with 100k blocks.
VI. EVALUATION
In this section, we evaluate the performance of the scheduling methods we proposed using SP-factor, including LRU-SP, LIRS-SP, LRFU-SP, ARC-SP and ADSM. The experiments demonstrate the effectiveness of our approach from the following aspects: 1) evaluate the HBM hit ratio and execution time for workloads with various data access patterns using improved data eviction policies based on SP-factor; 2) compare the ADSM method with other data eviction policies to verify that ADSM achieves higher HBM hit ratio and shorter execution time. In addition, we verify that the parameter auto-tuning module adjusts the scheduling decision for better performance.
A. EXPERIMENTAL SETUP
The experiments are performed on a cluster consisting of a master node and four slave nodes, all equipped with an Intel Xeon Phi 7210 CPU (KNL). Each KNL CPU consists with 64 cores (4 hardware threads per core) and 16GB HBM. Each node is also equipped with 200GB DRAM and 2TB hard disk. Both master and slave nodes are installed wtih CentOS 7.0. We deploy Hadoop v2.7.6, Spark v2.12, and Alluxio v1.8.0 on each node. The HBM tier in Alluxio are partitioned with different sizes using the memory space from the HBM, whereas the MEM tier is allocated with 50GB DRAM for each slave node. The MEM tier is set as the default write tier. In addition, each slave node runs HDFS with a 500GB hard disk as the underlying file system of Alluxio. We deploy Alluxio-perf and HiBench v7.0 on the master node. Alluxio-perf is a built-in evaluation tool, which can customize the data access parallelism and data block size for workloads running on top of Alluxio. HiBench is a widely used big data benchmark suite, which contains of representative big data applications such as Kmeans, PageRank and SQL query.
In the experiment, we generate synthesized workloads with four different data access patterns using Alluxio-perf, including recency-friendly workload, frequency-friendly workload, loop workload and mixed workload. As shown in Figure 10 and Figure 11 , the access parallelism and the data block size are generated according to the normal distribution, where the access parallelism and the data block size are distributed as N (30, 10 2 ) and N (100, 30 2 ) respectively. The configuration of our HBM-DRAM hybrid memory system is shown in Table 2 , where the Cache mode means that there is no HBM tier in Alluxio. In Cache mode, all I/O operations are performed in the MEM tier. Therefore, the HBM hit ratio is 0, and the execution time is the same across all policies. All experiments in the following sections use Cache mode as the baseline. 
B. PERFORMANCE COMPARISON WITH SYNTHESIZED WORKLOAD
In this section, we compare the performance of different data scheduling methods under different configurations of the hybrid memory system with synthesized workloads. As shown in Figure 12 , for the recency-friendly workload, the HBM hit ratio of different data scheduling methods is significantly improved as the amount of HBM allocated to Alluxio increases. The LRU policy achieves the highest hit ratio across all memory configurations. In the Flat mode, the hit ratio of LRU policy reaches 46.7%, while the hit ratio of LRU-SP policy reaches 40.18%.
The execution time of the recency-friendly workload is shown in Figure 13 . It should be noted that the blue dash Figure 13 , Figure 15 , Figure 17 and Figure 19 , and the Speedup is value that execution time on cache mode divided by that of various methods on different modes. It is clear to observe that in the Hybrid-1 mode, the execution time of several data scheduling methods is even higher than the Cache mode. This is because the data scheduling between HBM and DRAM introduces additional overhead. Similar trend is observed as in Figure 12 , the execution time reduces with increasing size of HBM tier in Alluxio. In the Flat mode, the execution time of LRU-SP is 13.9% shorter than the LRU. Compared to other methods, ADSM reduces the number of data promotion from the MEM tier to the HBM tier, which achieves the least scheduling overhead and thus the shortest execution time.
As shown in Figure 14 , LRFU(λ = 0) achieves the highest hit ratio on frequency-friendly workload, which is 2.18× than LRU in Flat mode. However in terms of execution time, as shown in Figure 15 , LRFU-SP(λ = 0) achieves the shortest execution time, which is 19.7% shorter than LRFU(λ = 0). For the Loop workload, LIRS and LIRS-SP achieve a much higher hit ratio than other methods, as shown in Figure 16 . The LRU achieves a hit ratio of 0 with this workload, whereas the LRU-SP extends the time of data blocks with large SP-factor residing in HBM, and thus the hit ratio in the Flat mode reaches 10.5%. As shown in Figure 17 , the LIRS-SP has the shortest execution time on the Loop workload.
As shown in Figure 18 , for the mixed workload, ADSM achieves the highest HBM hit ratio, more than 9.5% compared to ARC-SP. The execution time of the mixed workload is shown in Figure 19 . The ADSM achieves the shortest execution time compared to other scheduling methods across all memory configurations. Especially in Flat mode, the execution time of ADSM is 18% and 5.8% shorter than ARC and ARC-SP respectively. The results demonstrate that ADSM can quickly adjust the scheduling decisions during runtime, and thus achieve better performance compared to other methods for workloads with mixed access patterns.
In sum, after applying SP-factor to the existing data eviction policies, although the hit ratio decreases, the execution time generally reduces which means better performance. In addition, in Flat mode, which allocates all HBM space to Alluxio, the workloads achieve the shortest execution time compared to other memory configurations. This confirms the conclusion [6] that using HBM in Flat mode achieves better performance than using HBM as system L3 cache. Compared to existing adaptive policy such as ARC, ADSM performs better on all types of workloads due to the efficient scheduling method and the effectiveness of parameter tuning, which reduces both the scheduling and data migration overhead.
C. PERFORMANCE COMPARISON WITH HIBENCH
In this section, we evaluate the performance of different data scheduling methods in the Flat mode using representative big data applications. Specifically, we use KMeans, PageRank and SQL queries from HiBench.
1) KMeans:
A well-known clustering algorithm for knowledge discovery and data mining, that is widely used in large-scale machine learning applications. We store the input and intermediate data in Alluxio and ensure the data is large enough to exceed the space of HBM in order to exercise the scheduling methods. For KMeans, the intermediate data is accessed in an irregular loop access pattern by multiple threads on each node. As shown in Figure 20 , the LIRS achieves the highest hit ratio, which is 1.87× of LRU. Figure 21 shows that the execution time of all improved data eviction policies is shorter than the original policies. For instance, the LIRS-SP reduces execution time by 7.2% than LIRS. Especially, the ADSM achieves the shortest execution time compared to other scheduling methods, which benefits from the value density function applied to reduce the overhead of useless data migrations. 2) PageRank: A link analysis algorithm widely used in web search engines by calculating the ranks of web pages based on the number of reference links. Similar to KMeans, PageRank consists of a serial steps of computations, among which several computation steps are iterated until the converge condition is satisfied. The intermediate data in PageRank is also accessed in irregular loop access pattern. As shown in Figure 20 , LIRS achieves the highest hit ratio, whereas LIRS-SP and ADSM achieve the shortest execution time, which reduce execution time by 7.4% and 8.4% respectively compared to LIRS policy, as shown in Figure 21 . 3) SQL Queries: The SQL queries are evaluated against two types of tables, the web page access Step 1 and
Step 2) are repeated 10 times. Trace-1 is a typical recency-friendly workload. The LRU can achieve the highest HBM hit ratio as shown in Figure 20 , while the LRU-SP achieves the shortest execution time, which is 12.2% shorter than LRU as shown in Figure 21 . Trace-2 belongs to the frequencyfriendly access pattern. The frequency of accessing table Rankings is 3× than accessing table UserVisits, therefore the hit ratio of LRFU (λ = 0) is the highest. However, the execution time of LRFU-SP (λ = 0) and ADSM is the shortest, which is about 7.8% shorter than LRFU (λ = 0). Since Trace-3 exhibits mixed access pattern, ARC, ARC-SP and ADSM achieve high hit ratios. Especially, ADSM adapts to mixed access pattern more effectively, therefore its hit ratio is higher than ARC and ARC-SP, which is 49%. In addition, the execution time of ADSM is also the shortest, which is 5% shorter than ARC-SP. Compared to synthetic workloads with most of time spent on I/O operations, real big data applications spend a significant amount of time at computation and container scheduling. Therefore, although the gap of hit ratio among different scheduling methods is quite large, the difference of execution time is small. Nevertheless, the ADSM proposed in this paper always achieve the shortest execution time across all big data applications. This is because ADSM can effectively adjust the parameters to improve the HBM hit ratio, and reduce the data movement within the hybrid memory system to improve the scheduling efficiency. effectively reduces the system overhead in scan pattern workload. Reference [19] implements the classic cache eviction policies LIRS and ARC on Alluxio, and makes experimental comparisons under various patterns of workloads. The results show that LIRS and ARC have better performance, and ARC has the excellent performance on workloads with changed access pattern. Reference [28] regards the data scheduling in the external memory as a knapsack problem, and it demonstrates that the exponential decay algorithm can be used to characterize the re-access possibility.
VIII. CONCLUSION AND FUTURE WORK
In this paper, we leverage the unique characteristics of HBM-DRAM hybrid memory system to deliver higher bandwidth for application with diverse access patterns. We propose a novel data scheduling metric SP-factor along with access parallelism modeling mechanism, which considers the impact of data block size and data access parallelism on bandwidth performance of hybrid memory system. In addition to improve existing data eviction policies such as LRU, LIRS, LRFU and ARC using SP-factor, this paper also proposes an adaptive data scheduling method ADSM, which can effectively tune the scheduling decisions to adapt to different access patterns based performance metrics collected during runtime. Finally, we evaluate the performance of our scheduling methods on Alluxio with synthetic workloads and big data applications. The experimental results demonstrate ADSM achieves the best performance for all applications under different memory configurations. We have contributed the improved data eviction policies using SP-factor as well as ADSM to the open source community [41] . For future work, we would like to study on systems with more types of memory devices such as 3D Xpoint and PCM in order to verify the scalability of ADSM. 
