Storage class memories, including flash, has been attracting much attention as promising candidates fitting into in today's enterprise storage systems. In particular, since the cost and performance characteristics of flash are inbetween those of DRAM and hard disks, it has been considered by many studies as an secondary caching layer underneath main memory cache. However, there has been a lack of studies of correlation and interdependency between DRAM and flash caching. This paper views this problem as a special form of multi-level caching, and tries to understand the benefits of this multi-level hybrid cache hierarchy. We reveal that significant costs could be saved by using Flash to reduce the size of DRAM cache, while maintaing the same performance. We also discuss design challenges of using flash in the caching hierarchy and present potential solutions.
Introduction
The past decade has witnessed significant advances in semi-conductor technologies and the emergence of various forms of storage class memory (SCM) such as NAND flash, magnetic RAM (MRAM), phase-change memory (PRAM), and Ferroelectric RAM (FeRAM). In particular, NAND flash memory has low unit price (in $/byte) due to its massive production. Therefore it has not only been a major storage media for mobile devices in recent years but also is becoming on the road to replace a certain portion of hard disks (HDD) currently used in desktop and server environments [12, 18] .
Researchers have also proposed using SCM to extend and complement traditional DRAM-based main memory in different ways: to compose hybrid main memory [15] , to form new virtual memory hierarchy [19] , and as secondary buffer caches [8, 14] . Among them, SCMbased caching is particularly important in I/O intensive environments, where enterprise storage servers are provisioned with large DRAM buffer caches to improve I/O throughput and response time. A secondary SCM caching layer can reduce the need for DRAM cache, saving hardware costs and energy, or supplying additional cache capacity.
When both DRAM and SCM caches are used in a storage server, a multi-level hybrid cache hierarchy is formed, as illustrated in Figure 1 . However, existing work has overlooked the coordinated management of DRAM and SCM caches in this hierarchy.
In fact, coordinated multi-level caching have been widely studied in distributed computing environments [3, 6, 22, 25, 24] . In this work we view the above problem as a special form of multi-level cache management, and revisit existing mechanisms based on the following unique characteristics of the DRAM-SCM-HDD hybrid storage hierarchy.
Firstly, in previously studied environments, multiple levels of caches are typically deployed in and managed by distributed computer systems connected with local or remote networks. SCM, on the other hand, often resides in the same node as, and has fast bus connection to, the main memory.
Moreover, the low unit price of SCM, especially flash, determines that its cache space can be an order of magnitude larger than DRAM capacity. This is different than in distributed multi-level caches, where a lower level often have comparable or even smaller cache space than an upper level [3] .
Finally, as a storage media, SCM possesses the following unique characteristics:
(i) SCM's write is noticeably worse than the read counterpart. (ii) SCM becomes less reliable after many repeated write-erase cycles. Depending on the media type, some SCM devices also suffer from degraded performance during garbage collections, which are performed when insufficient free blocks are available. (iii) The nonvolatile property of SCM makes it an attractive option for buffering dirty data blocks longer.
Based on these observations, in this paper we make two major contributions.
• To understand the impact of multi-level hybrid caches, we analyze the amount of DRAM that each GB of SCM "saves" when used for demand paging, prefetching, and write buffering, respectively.
• To address the challenges of adopting this architecture, we propose several novel mechanisms as well as a number of design guidelines. Their purpose is to help the system approach the ideal saving mentioned above. For simplicity, the rest of our discussion will be based on flash. However, they can be extended to any type of SCM.
How much DRAM can be saved?
In this section, we analyze the multi-level caching problem in a hybrid storage system consisting of DRAM, flash memory, and hard disk drive. We consider saving DRAM as the primary impact of using flash in this I/O stack. Therefore, our basic approach is to solve the following equation: 
Demand paging
Memory cache space can be used in three ways:
• Demand paging -keeping data block that are likely to be re-accessed in the future • Prefetching -fetching data blocks before they are requested based on prediction of access patterns • Write buffering -absorbing asynchronous write requests
In this subsection we present our analysis and preliminary results on demand paging. Section 2.2 extends the discussion to prefetching and write buffering.
In multi-level demand paging the role of the lower level cache is to catch misses from the upper level. For a given workload, let h(x) denote the cache hit rate with cache size x. If we have a DRAM cache of size x d and a flash cache of size x f , then we have a first level hit rate of h(x d ) and a second level hit rate of h(
If we further assume the random read cost on DRAM, flash and HDD are r d , r f , r h respectively, then the average I/O response time is:
Having x f amount of flash saves total access time by
Therefore, when coupled with x d amount of DRAM, then we can deduct the following equation:
To empirically evaluate the saving we use a multi-level cache emulator extended from and cross-validated with a multi-level cache simulator used in [3, 25] . The emulator runs on the application level and bypasses the OS buffer cache by using direct and synchronous I/O. Real SATA-II SSD (Intel SSDSA2SH032G1GN) and HDD (WDC WD3200AAKS-75L9A0) are used in experiments. Our preliminary experiments use a simple DRAM − F lash − HDD architecture, and adopt the LRU replacement policy on both caches. Three I/O traces are used: OLTP (collected from online transaction processing applications at a large financial institute) and WebSearch (collected from a popular search engine) are obtained from SPC 1 , while TPC-H were collected at the Purdue university by running the corresponding benchmark. Prefetching is turned off and write requests in traces are ignored. In this trace the average request size is 15KB, and the read access costs on DRAM, SSD and HDD are: r d = 1.56µs, r f = 86µs, and r h = 2.35ms. From the figure, we see that without flash, the response time decreases almost linearly with larger DRAM cache. Adding a flash cache improves the performance significantly in most cases. Moreover, the response time is much less sensitive to DRAM cache size when flash is larger than DRAM. That is because even with a high aggregate cache hit ratio (∼ 90%) the I/O response time is dominated by disk accesses. As long as flash is larger than DRAM, increasing the DRAM size will only transfer flash cache hits into DRAM cache hits, without reducing disk accesses. . We see that in OLTP and TPC-H traces, almost 90% of requests can be hit in 256MB memory space. In contrast, WebSearch shows much less temporal locality than other workloads. 4GB memory space can barely satisfy 50% hits of requests.
$ Saving:
We illustrate the actual $ saving by comparing two sample configurations: a 128MB DRAM cache and a hybrid cache of 64MB DRAM and 128MB flash (marked as "A" and "B" in Figure 2 .a). Since they have almost the same response time, we have F lash(128M B) = DRAM (64M B) (as in Equation 1). With a 5:1 price ratio of DRAM and flash ($/GB), the hybrid cache saves 59% over the single level DRAM cache. With a 10:1 ratio the saving becomes 79%.
Prefetching and Write Buffering
Prefetching: In modern systems the behavior of sequential prefetching is controlled by two parameters, prefetch degree, indicating how much data to prefetch for each prefetching request, and trigger distance, indicating how early to issue the next prefetch request. The memory consumption with a giving prefetching degree P and trigger distance T is roughly T + P 2 [24] . While prefetching mechanisms have different ways of setting T and P values, the main factors affecting these decisions are applications' data access rates and storage media's latency and throughput.
Since flash has smaller read latency and higher read throughput than HDD, with a given workload we have P f < P h and T h < T h , where P f , T f , P h , and T h are prefetch degrees and trigger distances for flash and HDD, respectively. Therefore, with n sr sequential reading streams, the following equation can be deducted:
Write buffering: Write buffering is less studied than demand paging and prefetching because of its asynchronous nature. However, researchers have pointed out that the lack of careful management could lead to considerably degraded performance [2] .
Current DRAM write buffering mechanisms are designed with several main considerations in mind. Firstly, multiple updates to the same page may be aggregated and performed once. Secondly, spatially related writes may be merged into large sequential requests which are favorable for hard disk performance.
In contrast to HDD, flash devices do not have rotating parts, and the main benefit of large sequential writes is to reduce the number of erases. Therefore, with a flash memory cache layer underneath, DRAM write buffering only needs to accumulate write requests up to sizes of a flash erasing unit f pg (typically 128KB or 256KB).
However, it is still beneficial for DRAM to leverage temporal locality and catch re-writes to the same page. Otherwise the same page will be erased multiple times on the flash layer. Meanwhile, the flash memory layer could serve as a buffer region to re-organize dirty blocks into spatially sequential batches to optimize disk I/O.
If there are n sw sequential writing streams, with an average rate of wr, and the system uses al d as the age limit in flushing dirty pages in DRAM, al f for flash, then we have the following equation:
Approaching the ideal saving
It is not trivial to achieve the ideal savings as analyzed in Section 2. In this section we discuss major technical challenges of multi-level hybrid caching and also propose possible mechanisms to overcome those challenges.
Challenges
In NAND flash memory devices, read and write operations are executed at the granularity of pages. Unlike inplace updates in HDD, an update on a flash page needs to mark this page as "invalid" and write the new content in a free page elsewhere. The device will eventually reach a state where too few free pages are remaining, at which time a process is triggered to collect and erase invalid pages marked by page updates and create free pages/blocks for future write operations. This process is called Garbage Collection (GC). Normal read and write operations will be slowed down during GC, to different extents depending on device types [11] .
In addition, even without being disturbed by GC, write operations on flash take considerably longer time than reads (up to 10× latency). If occurring on the critical path such as synchronous reads, this could cause application slowdowns.
Moreover, the reliability and lifetime of flash devices could be severely degraded with frequent page updates. [17] has analyzed the case where flash is used as write buffers for storage volumes. It is pointed out that since all writes to a volume are absorbed by a relatively small flash device, the wear-out time is much shorter (less than 5 years with more than half of the workloads) compared to the scenario where flash is used as persistent storage (over 100 years for the majority of workloads).
We further argue that when used as a cache (for demand paging, prefetching and write buffering), flash faces page updates caused by both write and read requests from applications. As an extreme example, a read-only workload still introduces write operations to flash cache when new cache blocks are inserted and old ones are evicted. Putting it precisely, every read/write request observed on the disk level corresponds to one cache content update. Therefore, the total update/erase frequency of flash cache is the summation of disk read and write I/O rates. In Table 1 we characterize representative disk block I/O traces collected from real workloads. For instance, with the MSR Cambridge trace, the write rate to flash cache is 213% compared to write rate perceived by the disk volume.
Possible solutions
To address the above challenges, existing single-and multi-level cache management mechanisms need to be revisited. In this section we discuss several new designs and present preliminary results.
Direct path between DRAM and HDD: Toward the goal of eliminating flash write operations from critical paths, we propose keeping a direct path between DRAM and HDD, instead of using a traditional tiered architecture assumed by most existing multi-level caching studies [3] . Doing so enables data blocks that are missing from both caches to be fetched directly from HDD to DRAM, and then to be written to flash in background. In addition, the system can select to bypass the flash cache layer for some requests.
Frequency-aware policies:
We argue that new cache replacement policies should be designed and adopted that are aware of the cache content update frequency. As a first step we propose a simple algorithm LRU f . It is a variation of LRU with a cache content update frequency that is only a fraction f of the original algorithm. This can be achieved in different ways, the simplest among which is to randomly discard some updates.
To evaluate this trade-off we have implemented LRU f on the flash layer in the emulator introduced in Section 2.1, and performed experiments with the OLTP trace. The DRAM cache size is fixed at 4MB and flash cache size changes from 8MB to 256MB. Figure 3 shows both the flash cache hit ratio and number of cache updates of of LRU f with different f values. It can be observed that LRU 75% has very similar hit ratios as LRU , while incurring 25% fewer cache updates. LRU 25% reduces the number of updates much more aggressively, with the penalty of degraded hit ratios with relatively large cache sizes. It decreases the hit ratio by 4% for a 128MB cache and 9% for a 256MB one, translating 13% and 37% increases in the number of disk reads, which dominates the response time as we discussed earlier.
Related Work
Multi-level Caching: There have been many studies on improving secondary cache hit ratio in multi-level demand paging. Existing work has been focused on providing exclusiveness [22, 4, 6] or capturing long-term access patterns [26] . A comprehensive empirical evaluation is provided in [3] . With these optimizations the savings with hybrid cache will be further improved. However, some techniques increases data transfers between upper and lower caches which need to be considered when applied to flash. Multi-level prefetching has also been studied recently [23, 24, 25] . However, it has not been empirically studied how to set appropriate prefetching aggressiveness on a very fast secondary cache such as locally attached flash, which we consider as our future work.
SSD storage and caching:
Many studies have been conducted on the performance optimization of systems employing flash/SSD [8, 20, 13] . In [21] a hybrid storage device is proposed that can use a HDD as a write cache for a SSD, reducing over-writes. There are also ongoing efforts to design and develop adaptive file systems for heterogeneous media [10, 5] .
There have been researches on SSD-aware caching [18] as well as using flash as a partial replacement of main memory/DRAM [8] . CFLRU [18] is an optimization of main memory caching mechanism with underlaying SSD-based persistent storage. In [9] a hybrid memory system is proposed where some portion of main memory is replaced with larger flash. However, they focused an architectural design issue rather than cache managements.
Conclusion
In this paper we discovered that a multi-level hybrid cache can have significant $ cost saving (59% ∼ 79%) over a traditional DRAM cache by adding an additional flash layer.
