ABSTRACT Owing to its high performance, small size, and low energy consumption, NAND flash memory has been extensively adopted in cyber-physical systems. However, the inherent characteristics of flash memory, including not-in-place update and asymmetric I/O latencies, present difficulties in the design of buffer management policies. In this paper, we propose an enhanced buffer management policy for the flash memory referred to as dynamic page weight least recently used (DPW-LRU), which considers temporal locality and simultaneously provides effective utilization of limited buffer resources. Page migration is further enhanced by identifying the page access mode and frequency while separating the buffer into two different regions. A novel eviction algorithm is also designed to reduce the write operations and maintain a high hit ratio of the buffer regions, combining dynamic temporal locality, real-time eviction cost, and recency of pages. The experimental results show that DPW-LRU improves the hit ratio by up to 8.3%, decreases the write operation by up to 22.6%, and reduces the overall latency by up to 18.8% relative to those of other state-of-the-art buffers management policies.
I. INTRODUCTION
Massive data processing and analysis propose a considerably higher requirement on data throughput and I/O latency of the Cyber-Physical Systems. While the hard disk drive (HDD) no longer meets the growing demand for I/O quality, flash memory excels because of its high performance, light weight, small size, and low energy consumption. Moreover, NAND flash memory has been widely used in Cyber-Physical Systems because technologies such as the multi-level cell, triplelevel cell, and quad-level cell lead to higher bit storage density and thus lower product cost [1] - [4] .
However, NAND flash memory cannot completely replace the hard disk drive owing to its inherent characteristics, including out-of-place update, block-based erasure,
The associate editor coordinating the review of this manuscript and approving it for publication was Wei Yu. asymmetric I/O latency, and limited endurance [5] - [7] . In NAND flash memory, pages are supposed to be erased before they are overwritten so that data have to obey the outof-place update policy. In addition, both the read and write operations are page-based, whereas the erase operation is aligned on the block size. The I/O latencies of NAND flash memory are asymmetric, and the latencies of the write and erase operation are considerably higher than those of the read operation [8] . Moreover, NAND flash memory chips have a limited program/erase (P/E) cycle, which means that a block becomes invalid after a certain number of erase operations.
The buffer is envisioned as a potential method to handle the asymmetric I/O speed of different storage layers. By combining the buffer with storage devices, a hard disk drive and flash memory can better and more rapidly process upper-layer requests. However, traditional buffer management policies are optimized for hard disk drives so that the internal parallelism and inherent characteristics of flash memory are ignored while migrating disk-oriented buffer policies directly to flash memory [9] - [11] . Consequently, the mechanism of the storage system needs to be reviewed and optimized buffer management policies have to be developed for flash-based storage systems [12] - [14] .
The purpose of flash-based buffer management policies is to regulate the I/O sequence, reduce the access to flash memory, and advance the overall efficiency of flash-based storage systems. Meanwhile, traditional buffer management policies only consider the hit ratio and neglect the asymmetric I/O latency of flash memory [15] , [16] . Therefore, buffer management policies should maintain a good hit ratio while minimizing the write and erase operations to decrease the overall latency. Many flash-based buffer management policies are proposed to handle the asymmetric I/O latency, such as Clean-First Least Recently Used (CF-LRU) and Least Recently Used Write Sequence Reordering (LRU-WSR). CF-LRU and LRU-WSR not only distinguish pages into clean and dirty pages but also obey a clean-first eviction strategy to reduce overall latency. However, both of them can be polluted by the sequential scan of pages and obtain a poor hit ratio and overall latency [17] , [18] . In contrast to CF-LRU and LRU-WSR, AD-LRU splits the buffer into two regions to store clean and dirty pages and evicts the clean page first to reduce the write and erase operations. AD-LRU performs better; however, on some workloads, the eviction of the cleanfirst operation results in a buffer full of dirty pages and consequently, more read operations on flash memory [19] , [20] . In addition, the size of the buffer regions that suits different workloads is difficult to determine. Moreover, current buffer management policies ignore the access pattern of pages and have not fully utilized temporal locality.
In the current study, we propose a Dynamic Page Weight LRU (DPW-LRU) buffer management policy that takes advantage of the access pattern and locality of pages. A novel page migration strategy is introduced, which identifies the page access mode and frequency while separating the buffer into two different regions. The eviction algorithm is also improved to reduce the write count and maintain the high hit ratio of the buffer regions, combining the dynamic temporal locality, real-time eviction cost, and recency of pages.
The remainder of this paper is organized as follows: Section II reviews the background and related work of the buffer management policy for flash storage systems. Section III explains the proposed DPW-LRU policy. Section IV illustrates the details of the experimental result and analysis of different workloads. Finally, Section V presents the conclusions of the study and outlines our future work. Fig. 1 illustrates the architecture of a flash-based storage system, which consists of operating system-level components and flash memory-level components. The operating system level includes applications for different scenarios, file systems for file management, and block layers for blockdevice abstraction. At the flash memory level, a flash translation layer (FTL) acts as an interface layer between the operating system and the flash memory chip to overcome the inherent defects of NAND flash memory and becomes compatible with the legacy or heterogeneous upper-layer operating system [21] - [23] . Buffer management is integrated in FTL, along with other flash memory management strategies, such as address mapping, garbage collection, and wear leveling [24] - [26] . The inherent characteristics of flash memory are transparent to the upper-layer operating system while flash memory is treated as a block device for the upper-layer file system integrated with an FTL. The buffer in flash memory usually comes in the form of dynamic random access memory (DRAM). LPDDR3 and LPDDR4 are typically integrated as flash memory buffer, which renders flash memory more power efficient. Compared with flash memory, DRAM is capable of in-place write operation and is about 1,000 times faster in access time [27] - [29] . Buffer allows the storage of frequently requested data and page information, which leads to reductions in flash read and write operations. However, owing to the high cost of DRAM, the buffer in flash memory is rather small in contrast to the high-capacity flash chip [30] . An efficient buffer management policy is evidently necessary to achieve high performance and prolong lifetime.
II. BACKGROUND AND RELATED WORK A. FLASH TRANSLATION LAYER AND FLASH MEMORY BUFFER

B. RELATED WORK
Many flash-based buffer management policies have been proposed to enhance buffer efficiency [31] - [34] . The first buffer management policy designed for flash storage systems is CF-LRU [31] . It maintains a clean-first window of size w, and clean pages inside the window are evicted first to reduce the write operations. However, a proper window size adapted to different workloads is difficult to find. In addition, the hit ratio decreases as dirty pages take over the buffer.
In [32] , LRU-WSR, which was developed to address the disadvantages of CF-LRU, considers hotness of pages and keeps hot dirty pages retained in the buffer. To ensure that the buffer is not dominated by cold dirty pages, hot clean pages and cold pages are evicted first. Both CF-LRU and LRU-WSR are vulnerable to a sequential scan of pages.
Cold Clean First Least Recent Used (CCF-LRU), a policy that considers both cold/hot and clean/dirty characteristics, was introduced in [33] . It uses two LRU lists to buffer cold clean pages and other pages. When the buffer is full, the cold clean pages are evicted first, whereas the hot clean pages are retained in the buffer. The disadvantage of CCF-LRU is that recently buffered clean pages can be easily evicted when the cold clean list is empty.
Based on CCF-LRU, AD-LRU was proposed in [34] . The algorithm uses the dynamic size of the cold and hot regions to avoid eviction of recently buffered clean pages. The eviction occurs in the cold region when the size of the cold list exceeds a predefined size; otherwise, pages within the hot region are evicted.
Current flash-based policies are considerably enhanced relative to traditional buffer management policies [31] - [34] in which access pattern and locality of pages are not fully utilized [35] . Consequently, further optimization may be implemented for flash-based buffer management policies.
III. PROPOSED DYNAMIC PAGE WEIGHT (DPW) POLICY A. DESIGN AND ARCHITECTURE OF DPW-LRU
The dynamic page weight LRU (DPW-LRU) policy is designed to enhance traditional LRU-based buffer management policies by decreasing the write operations of flash memory and improving the hit ratio of the buffer. DPW-LRU distinguishes pages in terms of page access mode and frequency. In DPW-LRU, the pages are categorized into four: cold clean (CC), cold dirty (CD), hot clean (HC), and hot dirty (HD) pages. A buffered page that is referenced only once is referred to as a cold page; otherwise, it is considered a hot page. A buffered page with a read access mode is regarded as a clean page; otherwise, it is regarded as a dirty page.
As illustrated in Fig. 2 , the buffer of the DPW-LRU algorithm is divided into two regions-the exchange region and the working region, with sizes L1 and L2, respectively. The size of the total buffer is L, which is the sum of L1 and L2.
VOLUME 7, 2019
The exchange region contains the CC pages and the pages evicted from the working region; meanwhile, the working region maintains the CD, HC, and HD pages. The working region keeps a window of size w for eviction. In contrast to the previous LRU-based buffer management policy, DPW-LRU has a novel page migration strategy, and the buffer cannot be polluted by a sequential scan of pages. In addition, clean pages have a sufficient life cycle, and the buffer is not full of dirty pages. Fig. 2 also presents the data transmission of DPW-LRU through different layers. When upper-layer requests arrive, the requests are formatted and enqueued into a first-infirst-out (FIFO) queue (step 1 in Fig. 2 ). The hash function is applied to every request to determine whether a hit occurs in hash buckets (step 2 in Fig. 2 ). If a hit occurs, the buffer is notified to update the pages within the working region or the exchange region (step 3 in Fig. 2 ). If a miss occurs, the requested page is read from flash memory, and the buffer updates simultaneously (steps 4 and 5 in Fig. 2 ). When eviction occurs, the victim page inside the buffer is evicted to flash memory (steps 4 and 6 in Fig. 2 ). 
B. PAGE MIGRATION STRATEGY AND DPW ALGORITHM
The page migration strategy within and between the exchange region and the working region is presented in Fig. 3 . The exchange region obeys the LRU policy, and the working region follows a novel page migration strategy on the basis of a DPW algorithm. To maintain the efficiency of the buffer regions, three arguments are applied in the algorithm: temporal locality interval (TLI), eviction cost (EC), and recency (RE). The fundamental model of the DPW algorithm is outlined below.
We refer to the page access request as R, which contains two parameters: R pid and R am . R pid is the requested page ID, and R am ∈ {read, write} is the access mode of R. On the basis of R, the formalization of the access sequence can be denoted as
where j is the sequence number of the page access, and n represents the total access amount. To record the unique pages in an access sequence, a set called U is defined as
where P i shows the ith unique page in the access sequence, and m is the number of unique requested pages. For every specific page in U , an access sequence set can be deduced as
where AS k is the sequence number of the kth access request of P i . Suppose P i is a buffered page, the DPW algorithm is described below. Definition 1: Given that S is the access sequence, and AS i is the access sequence of P i , the TLI of P i is defined as
where TLI represents the average access interval of P i , which is used to estimate the temporal locality of P i . With TLI, we can forecast the next access of P i . On the basis of P TLI i , we can calculate the average TLI of all pages in U as
. . , L w q } are the latest q latencies of different operations on a specific flash storage system. The eviction cost of P i is given by
In contrast to other LRU-based algorithm, EC is adaptive according to the ratio of write and read cost of a specific memory device. Given a dynamic read-and-write cost based on the load of flash memory, EC becomes even more adaptive. By using EC, the total latency can be reduced while considering the different characteristics of flash memory as factors.
Definition 3: Let AS latest be the latest access, and AS i be the last access of P i . Recency of P i is defined as
Recency shows the freshness of P i , and pages with low freshness should be evicted first. VOLUME 7, 2019 Definition 4: By integrating the three aforementioned arguments, the DPW of P i can be calculated as
where α is the coefficient of page weight. On the LRU side of the working region, a window of size w is set to conduct the DPW buffer replacement strategy. Pages within the window of the working region are compared using the aforementioned formula. Page weight is the final decision of eviction, and the page with the lowest weight is evicted to the exchange region.
C. DPW-LRU WORKFLOW AND ALGORITHM DESCRIPTION
The overall workflow of DPW-LRU is illustrated in detail in Algorithm 1. When a page access request arrives, the hash function is applied to determine whether the target page is already stored in the buffer. If the target page is already buffered, it is called a hit; otherwise, it is regarded as a miss. When a hit occurs in the working region, the page is moved to the most recent used (MRU) of the working region. If a hit occurs in the exchange region or the request misses, the next step is to check whether the working region or the exchange region is full. If the required region is full, then the DPW algorithm is executed. The victim page is evicted, and the request page is inserted into the required region. The destination of the eviction varies depending on the situation. When a hit occurs in the exchange region, the victim page is moved to the MRU of the exchange region (lines 8-10). When a miss occurs in the working region, the victim page is moved to the MRU of the exchange region, and the LRU of the exchange region is evicted to flash memory (lines [26] [27] [28] [29] . When a miss occurs in the exchange region, the victim page is evicted to flash memory (lines [18] [19] [20] . Algorithm 2 demonstrates the process of the eviction algorithm. In the algorithm, the weight of every page within the window is calculated (lines 6-9). The three aforementioned arguments are adopted for page weight computation (line 7). The weight of each page is store in a weight map and the page with minimum weight are chosen as the victim page (line 10-16).
D. COMPLEXITY ANALYSIS OF DPW-LRU ALGORITHM
In this subsection, we provide a description of time and space complexity of DPW-LRU algorithm. The primary source of time complexity is the eviction algorithm (in Algorithm 2). For each page in the eviction window, the page weight need to be calculated. The eviction algorithm has O(1) time complexity, since the size of eviction window in exchange region is a constant. Therefore, the total time complexity of DPW-LRU if Page ∈ WR then 3: Move Page to the MRU of WR;
4:
else 5: if WR size < WR size−limit then 6: Insert Page into the MRU of WR;
else 8: victim ← Eviction(WR); 9: Insert Page into MRU of WR; 10: Insert victim into MRU of ER; 11: end if 12: end if 13: else 14: if P am == Read then 15: if ER size < ER size−limit then 16: Insert Page into MRU of ER; 17: else 18: victim ← Eviction(ER); 19: Insert Page into MRU of ER; 20: Evict the victim page to the flash memory; 21: end if 22: else 23: if WR size < WR size−limit then 24: Insert Page into MRU of WR; 25: else 26: victim ← Eviction(WR); 27: Insert Page into MRU of WR; 28: Insert victim into MRU of ER; 29: Evict the LRU of ER to the flash memory; 30: end if 31: end if 32: end if 33: return Page; algorithm is O (1) . In terms of space complexity, DPW-LRU uses additional data structures to maintain the recency and time locality interval of each page. The additional space complexity for each page in buffer is O(1), since constant additional space is used to store the additional data for each page.
IV. PERFORMANCE EVALUATION
In this section, the Flash-DBSim [36] simulation platform was used in our experiment, and various workloads including real-world traces were employed to analyze the overall efficiency of the DPW-LRU policy. We obtained a comprehensive analysis of the DPW-LRU algorithm compared Run the LRU strategy in the LRU of ER; 3: return the LRU reference of ER; 4: else 5: Run the DPW-LRU strategy in WR; 6: for pages ∈ [0, w] of WR do 7: P weight ← DPW(page); 8: Map weight .put(page, P weight ); 9: end for 10: for page ∈ Map weight do 11: if page < W min then 12: victim ← page; 13: W min ← Map weight .get(page); 14: end if 15: end for 16: return victim; 17: end if with other state-of-the-art policies, including LRU [11] , CF-LRU [31] , LRU-WSR [32] , and AD-LRU [34] .
A. EXPERIMENTAL SETUP
Experiments were conducted on a workstation with Intel Xeon E5-2680 v4 2.4GHz processor and 64GB 2400mHz DDR4 RAM. The operating system of the workstation was 64-bit Windows 10 Pro. Flash-DBSim is a well-designed simulation environment for flash-based algorithms and we used it to simulate a NAND flash memory on Visual Studio 2015. Table 1 illustrates the detailed NAND flash memory configuration compared with real storage device configuration. The block size is 128 KB with 64 pages in each block, and the erase latency is set to 1.5 ms. The endurance is set to 100,000 P/E, and the read-and-write latency ratio is set to 1/8. 
B. SYNTHETIC AND REAL-WORLD WORKLOADS
To simulate the real-world access pattern of flash memory, two kinds of traces were used in our experiments. The first type of traces consists of four synthetic traces which are generated based on Zipfian distribution. The characteristics of the four traces are described in detail in Table 2 , with the read-and-write ratio indicating the read and write percentage of total requests and the locality representing the number of total operations performed on a certain percentage of pages. Four synthetic traces are from read-intensive to writeintensive, and each has 300,000 requests. Different ratios of the total operations indicate the different workloads of the storage system, and different localities show a combination of different access patterns of various upper-layer applications.
The second type of traces contains two real world traces from Storage Performance Council (SPC) named Financial1 and WebSearch1 [37] . Financial1 is a trace recorded by monitoring requests of OLTP applications running at a large financial institution and it contains a huge amount of random write requests. WebSearch1 is a trace from a popular search engine which represents a read-intensive scenario. The characteristics of the two real-world traces are described in detail in Table 3 . To obtain an overall analysis of the DPW-LRU algorithm, three evaluation criteria were adopted in our experiments, including the hit ratio, read/write count, and total latency. We did not consider the erase count since it is always proportional to the write count. Fig. 4 shows the number of flash read/write count of the DPW-LRU algorithm on different workloads. The read/write count of the first batch of requests is emphasized. Different workloads with different localities indicate a significant difference in flash read/write count. Traces with a higher locality are more likely to have a smaller read/write count. In addition, traces with similar localities only slightly vary in read operations, and different read/write ratios of traces exert a major impact on write operations. To illustrate, a steady read operation of trace1 is around 390, while that of trace2 is about 400. However, compared with trace1, which has an average of 50 write operations, trace2 has an average of 90 write operations.
C. PERFORMANCE EVALUATION ON TRACES 1) READ/WRITE COUNT OF DPW-LRU
Moreover, the headmost 3% requests indicate a period of intensive read operations and extremely sparse write operations. It can be attributed to the empty exchange region and working region. From 10,000 operations to 25,000 operations, the DPW-LRU maintains a low flash write count because of the eviction of the dirty page in the working region to the exchange region. After 25,000 operations, the read/write operations of all traces become stable. Unlike the other figures, Fig. 4 (f) also indicates that WebSearch1 trace is an extremely read-intensive I/O workload with a low locality.
2) HIT RATIO
The hit ratios of five different algorithms with six different workloads are presented in Fig. 5 . The hit ratio is markedly 58816 VOLUME 7, 2019 higher with a growing buffer size, but the growth rate can be reduced when the buffer size is large. Among these algorithms, those that consider the access pattern of pages have higher hit ratios. Results in Fig. 5 (e) and (f) indicate that the hit ratio growth is dependent on the ratio of read/write operations and access locality of traces. Moreover, the DPW-LRU algorithm has the higher hit ratios than the other four algorithms because of the distinct design of the buffer structure to maintain an efficient buffer. Compared with those of LRU [11] , CF-LRU [31] , LRU-WSR [32] , and AD-LRU [34] , the hit ratios of DPW-LRU was 8.3%, 7.4%, 6.2%, and 4.3% higher, respectively. By combining the identification of the page access mode and frequency while separating the buffer into two different regions, DPW-LRU enhances the hit ratio and thus advances the buffer performance on different workloads. 3) WRITE COUNT Fig. 6 shows that the write counts of different algorithms considerably decrease when the buffer size increases. Regardless, different algorithms have different descending rates. For trace1 and trace2, the gap in write count for different algorithms is considerably larger than that for trace3 and trace4 because of the smaller reference locality of trace1 and trace2. Evidently, algorithms that consider the hot and dirty attributes into consideration can manage the buffer better when the spatial locality of trace is small. Moreover, the write counts of the five algorithms in trace4 vary less as shown in Fig. 6(d) . The reason is that dirty pages dominate the buffer in the write-intensive scenario and the difference in policies becomes negligible. DPW-LRU decreases the write count by 22.6%, 21.7%, 16.3%, and 2.1%, relative to those of LRU [11] , CF-LRU [31] , LRU-WSR [32] , and AD-LRU [34] , respectively. Among these algorithms, DPW-LRU produces the lowest write count on different workloads, which can be attributed to the novel page division and region partition mechanism. 
4) TOTAL LATENCY
To observe the effectiveness of the DPW algorithm on total latency, we compare it with other previous methods using different workloads. Fig. 7 illustrates the total latency of different algorithms with six workloads. The total latency consists of read latency, write latency, and erase latency, representing the overall performance of a buffer management algorithm. Five algorithms vary slightly in total latency when the size of the buffer is rather small. With a larger buffer size, DPW-LRU and AD-LRU evidently perform better, compared with other algorithms. As a result of asymmetric read, write, and erase latency, total latency is largely influenced by the write count and erase count and causes the five algorithms to exhibit almost the same efficiency as that of trace4 in Fig. 7(d) . DPW-LRU reduces the total latency by 18.8%, 17.9%, 13.3%, and 2.2% relative to those of LRU [11] , CF-LRU [31] , LRU-WSR [32] , and AD-LRU [34] , respectively. As DPW-LRU algorithm focuses on eviction cost and temporal locality to compute the dynamic weight of pages, it makes more reasonable eviction policy, thereby leading to less total latency in different scenarios. VOLUME 7, 2019 Overall, the experimental results demonstrate that the DPW-LRU buffer management policy enhances the hit ratio of the buffer, reduces write count, and minimizes total latency, apart from providing effective utilization of limited buffer resources.
V. CONCLUSION AND FUTURE WORK
We proposed an enhanced buffer management policy for flash memory in Cyber-Physical Systems, which uses the access pattern and locality of pages. In accordance with the characteristics of flash memory and upper-layer operating systems, the buffer architecture was further investigated and redesigned. The flash memory buffer is divided into the working region and the exchange region, which ensures eviction accuracy and buffer efficiency. Meanwhile, pages within the buffer are categorized into four specific types on the basis of the access mode and frequency, which significantly influence the hit ratio. Compared with previous LRU-based replacement methods, the DPW-LRU algorithm can significantly provide convincing buffer efficiency for buffer management by introducing a novel page migration strategy. The eviction algorithm is also specifically designed to reduce total latency by combining dynamic temporal locality, real-time eviction cost, and recency of pages. We simulated DPW-LRU, together with other algorithms, on the Flash-DBSim platform on different workloads. The simulation results demonstrate that compared with other buffer management policies, DPW-LRU is more efficient with respect to hit ratio, write count, and total latency. The exploration of histogram in our experimental analysis confirms the statistical significance of the DPW-LRU method in buffer management performance. Finally, the proposed buffer management policy is applicable in flash-based storage for Cyber-Physical Systems.
Our future study will focus on the application of the DPW-LRU algorithm in different scenarios. Specifically, the DPW-LRU method will be integrated with a database engine and DBMS. In addition, the implementation of the DPW-LRU algorithm on an open-channel SSD will be incorporated in our future research. 
