Abstract-Flash-based SSDs are widely used as storage caches, which can benefit from both the higher performance of SSDs and lower price of disks. Unfortunately, issues of reliability and lifetime limit the use of flash-based cache. One way to solve this problem is to use the flash memory as read cache and use other devices like nonvolatile memory for write buffering. In this paper, we propose a new flash-aware read cache design, which leverages out-of-place update property of SSDs to improve both cache hit ratio and lifetime. Due to the out-of-place update property, when a cache entry is evicted from the flash cache, the eviction only removes the metadata, while the real data is still accessible and resides in the physical flash page until the whole flash block being erased. The main idea of our flash-aware cache is to reuse these evicted but still available data, when a request for the previously evicted data page arrives, instead of accessing underlying storage to fetch the data and rewriting it into fash cache, our design just needs to revive the evicted data. To evaluate the benefits of flash-aware cache design, we implemented the normal LRU, normal ARC, flash-aware LRU (FLRU), and flashaware ARC (FARC) cache algorithms on the Disksim simulator with SSD extension. Our simulation results demonstrate that our flashaware cache can improve the cache hit ratio by up to 28 percent, reduce the average response time by up to 40 percent with higher performance stability, and alleviate the lifetime limitation of flash cache by reducing the erase count by up to more than 70 percent. Besides of the flash-aware design, we also propose a new zero-migration garbage collection scheme to further extend the lifetime of flash cache. Our experiments show that the combination of our flash-aware cache design and the zero-migration garbage collection scheme reduces the erase count by up to nearly 90 percent.
Ç

INTRODUCTION
N AND Flash-based Solid State Disks (SSDs) have recently become immensely popular and been employed in different types of environments including portable devices, personal computers, large data centers, and distributed data systems [11] , [25] , [40] , [52] . Unlike traditional mechanically rotating media, an SSD is a type of electrically-erasable programmable read-only memory (EEPROM) using floatinggate technology, which provides many attractive technical merits, such as low power consumption, light weight, shock resistance, sustain hotter operation regimes, and extraordinarily high performance for random read access. An SSD is a good compromise among performance, capacity, and cost. DRAM is too costly and obviously not a persistent storage medium (it loses data when power outage occurs). Conventional HDDs are too slow. Therefore, SSDs are widely used as caches sitting between DRAM and hard disk drives (HDDs) to fill the huge performance gap between DRAM and HDDs [20] , [27] , [28] , [33] , [37] , [43] , [46] , [54] . Despite all these attractive merits, SSDs suffer from several inherent limitations, especially the limited erase cycles. Each flash block could only be erased limited cycles, after which the block will be unreliable and marked as bad block. In [33] , the authors showed how serious the limited lifetime issue of SSDs could be. When a 60 GB Intel 520 SSD is used as a data cache for a deduplication system, where the available capacity of SSD cache is 5 percent of the deduplicated data. By taking the write speed and the total allowed written amounts before wearing out of Intel 520 SSD into account, the expected SSD lifetime is only several days.
Flash-based SSDs have several distinct properties compared with hard disk drives. Two of the most important aspects are erase-before-write and out-of-place update. A page could only be updated after erasing a whole block which contains multiple pages. The erase operation takes about several milliseconds [4] which will degrade the write performance of SSDs, and out-of-place update is adopted to alleviate the influence of slow erase operations. Instead of updating the data in the original physical location, the new data is written to a new free location and the previous data is marked as invalid which will be reclaimed in the future. To support out-of-place updates, a mapping table that associates logical page number (LPN) with physical page number (PPN) is maintained by the controller. Whenever the accumulation of invalid pages reaches a threshold, a garbage collection process will be triggered to reclaim the obsolete space. In a typical SSD, the real physical capacity is always larger than the user-addressable physical space, and the surplus space is called over-provision. The overprovisioning part of SSDs is used for two purposes. One is to support the out-of-place update and reduce the frequency of garbage collection. The other is to substitute bad blocks. For enterprise applications where reliability and performance stability are of paramount importance, a large amount of flash memory will be reserved as the overprovisioning space.
When SSDs are used as primary storage devices, previous research work has leveraged the out-of-place update property to improve the performance and alleviate the limitations of flash memory under some special application scenarios like RAID (Redundant Arrays of Inexpensive Disks) [23] , CDP (Continuous Data Protection) [22] , and Snapshots [48] . While, to the best of our knowledge, when SSDs are used as caches, none of the existing research work has utilized the out-of-place update property to improve the performance. For general cache algorithms, when there is a cache miss and the cache is full, a cache entry will be replaced out by a replacement algorithm, then the missed data will be loaded from the low level storage and inserted into the cache. However, for flash cache, the eviction merely removes the metadata, and the actual user data is still accessible and resides in the physical flash page before being updated or erased.
In this paper, we propose a flash-aware read cache design through leveraging these evicted but still accessible pages inside SSDs with negligible overhead. One possible concern about the pure read cache is the data consistency. In fact, the issue with isolating read and write caches is well studied [5] , [15] , [16] , [38] , [51] , therefore we directly use SSDs as read caches without discussing the consistency issues in this paper for simplicity. What's more, since most of the current SSDs are used as black boxes and the cache management in the host side is unaware of the internal activities of SSDs including the garbage collection process, how to bridge the gap between the cache management and FTLs is another potential issue. Two possible approaches will be discussed to solve this issue in Section 4. In our simulations, we merged the cache management and FTLs together and implemented inside SSDs. Through simulations, we show that our flashaware read cache design could significantly improve the performance and alleviate the lifetime limitation of flash cache. Additionally, a new zero-migration garbage collection scheme is proposed and implemented to further mitigate the lifetime limitation of flash cache.
The rest of this paper is organized as follows. In Section 2, we describe background and related work on flash memory. Section 3 presents the details of our proposed flashaware cache and zero-migration garbage collection scheme. Section 4 presents the evaluation methodology and the experimental results are shown in Section 5. Section 6 presents concluding remarks.
BACKGROUND AND RELATED WORK
Flash Memory
There are two types of flash memories, NOR and NAND [49] . NOR flash memory is a type of random-access device mainly used to store firmware code. NAND flash memory has denser capacity and only supports page level access. NAND flash memory is cheaper and common on the market. In this paper we only focus on the NAND flash memory.
NAND flash cell now comes into three categories: SLC (single-level cell, storing a single bit per memory cell), MLC (multilevel cell, storing two bits per cell), and TLC (Tri-level cell, storing three bits per cell). The emergence of MLC and TLC technology is to increase the memory density and reduce the price, while at the same time the performance and endurance is impaired. Table 1 depicts the main parameters for SLC, MLC, and TLC flash memory [3] .
Flash memory is organized in units of blocks and pages. A fixed number (32 or 64) of pages compose a block. There are three main operations in flash memory: read, write, and erase. Read and write operations are performed in the unit of pages, while erase operations are on a block basis. Flash memory has several distinguishing features such as out-of-place updates, limited erase cycles, and erasebefore-write. Since most existing conventional file systems are designed for the in-place update storage devices, Flash Translation Layer (FTL) has been developed and deployed in SSDs to mimic in-place update like block devices in order to make flash memory compatible with the existing file systems. An FTL includes three main function units: address translation, garbage collector, and wear-leveler. The address translation unit translates the logical page number to the physical page number in the flash memory and hides the erase-before-write feature of flash memory. The mapping methods could be coarsely classified into three categories: page-level mapping, block-level mapping, and hybrid mapping. A page-level mapping [13] can achieve the best performance, it is constrained by the limited resource of expensive SRAM. While a block-level mapping [6] could save huge amount of memory space for the mapping information, it will lead to space wastage and performance degradation. To reach a compromise, several hybrid schemes [17] , [26] , [30] , [31] , [58] have been proposed that combine the page-level and block-level mapping together and are mainly based on the following idea: most of the data are mapped at the block level to reduce the overhead, while a small fraction of the data that are frequently accessed are mapped at the page level to guarantee the performance. A garbage collector is used to reclaim the obsolete pages caused by the out-of-place updates. Whenever the number of free pages drops to a threshold, a garbage collection process will be triggered to make room for the incoming requests. A victim block will be selected from the pool, all the valid data in the victim block will be moved to other free space, then the whole block will be erased. There are several algorithms to select the victim block: FIFO GC algorithm [14] , [44] , [55] which selects the blocks in a cyclic manner; greedy GC algorithm [8] , [14] which selects the block with the fewest number of valid pages; the windowed GC algorithm [18] which is a combination of the FIFO and greedy algorithms; the dchoices GC algorithm [32] , [50] which selects the victim block containing the fewest number of valid pages from d randomly chosen blocks. The objective of wear-leveler is to get an even erase-count distribution among all flashmemory blocks to improve the overall endurance of flash memory. For simplicity, we assume that the page-level mapping scheme [13] , greedy garbage collection policy [8] , and none wear-leveler function unit are used in this paper. A typical SSD is composed of a host interface, an SSD controller, DRAM, a flash controller, and flash chips. The SSD controller contains a processor and an SRAM which is used to store the firmware. The controller is responsible for the data placement, garbage collection, wear leveling, ECC, and bad block management. DRAM is used as cache for flash memory and to store the address mapping table. In order to improve performance, modern SSDs are organized into multi-channels. All the channels are independent of each other and can work in a parallel way. Each channel has a flash controller to buffer the pending requests and send the requests to the lower level in a channel. Within a channel, there could be multi-packages and each package contains multiple dies. All these packages and dies could work in an interleaving manner.
Related Work
When flash memory is used as cache, lifetime is one of the main concerns and is becoming more serious due to the continuous decreasing of feature size and adoption of MLC and TLC technologies. A number of solutions have been proposed to alleviate the lifetime problem, typical techniques focus on designing more robust ECC [34] , [56] , or on improving traditional wear-leveling techniques [42] . Due to the garbage collection and wear leveling processes, the actual number of write operations inside the flash memory is larger than the write requests from the host, which is called write amplification. Several research works tried to extend the lifetime of flash memory through reducing write amplification [19] , [35] , [47] . CAFTL proposed by Chen et al. [12] integrated the data-deduplication technique into the FTL of SSD to reduce unnecessary duplicate writes and save the lifetime of SSD.
Another way to improve the lifetime of flash memory is retention relaxation. Retention errors are the dominant source of flash memory errors which are caused by charge leakage after the flash cells being programmed [9] . Liu et al. [34] improved the write speed and mitigated the requirement for stronger ECC codes by relaxing the retention time requirement. Cai et al. [10] proposed FCR (Flash Correctand-Refresh) to extend the limited erase cycles due to retention errors. FCR reads, corrects, and reprograms (in-place) or remaps the stored data before the accumulation of the retention errors exceeding the capability of ECC. Huang et al. [21] aggressively placed the frequently updated data into the worn-out flash blocks which could only sustain shorter retention time to prolong the lifetime of these dead blocks.
Besides these typical techniques to improve the endurance of flash memory, other research work focuses on specific optimizations for the flash cache. BPLRU proposed by Kim and Ahn [29] deployed a RAM inside SSD as a write buffer to improve performance and lifetime of flash memory. Kgil et al. [28] put forward a scheme which split the flash cache into separate read and write regions with changeable error correction strength and cell density to improve reliability and lifetime of flash memory. NetApp used flash memory as a second level read cache while used the NVRAM as the second level write cache [53] . Soundararajan et al. [46] used a hard disk drive as a write cache for SSDs. How the partition between the user space and over-provisioning space affects the performance of flash caches was explored in [39] . In that work, the over-provisioning space was dynamically configured based on the properties of the workloads to improve the performance and lifetime at the same time. All of the above proposed schemes are complementary to our solution.
FLASH-AWARE CACHE AND ZERO-MIGRATION GARBAGE COLLECTION
Our flash-aware cache design is based on the out-of-place update property of flash memory which takes advantage of the evicted but still accessible data pages to serve the incoming requests. In order to support our flash-aware cache, we need to add additional queues to preserve the evicted but still valuable entries in flash memory. These additional queues are named as suspected queues (SQs) as the entries inside the additional queues will be dead or resurgent based on the their behaviours in the future. What's more, we also have to make a few lightweight changes for the original cache replacement algorithms. In this paper, we chose LRU and ARC, two most widely used cache replacement algorithms to incorporate with our flash-aware cache design to validate the efficiency of our proposed cache architecture. For other cache replacement algorithms, they can be easily tailored and integrated with our design via the following three steps. First is to add the suspected queues according to the main cache queues. Since LRU only has one main cache queue, one SQ is required. While ARC has two main cache queues: T1 and T2, accordingly we need add two corresponding suspected queues: SQ1 and SQ2 to implement our flash-aware design. Second, entries evicted from the main cache queues will be moved to the head of the corresponding SQs, while entries hit in the SQs will be revived to the specific main cache queues based on the details of specific cache algorithms. Finally, during the garbage collection process, the entries in the SQs whose related data resides in the victim block will be directly deleted or moved to the ghost queues.
Motivation Example
In a page-level mapping scheme, a mapping entry consists of an logical page number and a corresponding physical page number. The whole mapping table is constructed and maintained in both RAM and NAND flash memory. When a write request comes , the mapping table will be checked to verify whether the request is a new write or an update for existing data. For a new write, the data will be written to a free location and a new mapping entry will be added to the mapping table. While for an overwrite, first, the data will be written to a new free location, then the old page will be marked as invalid and the mapping table will be updated to reflect this change. But for a flash read cache, the situation is a little bit different. Invalid data will be generated only when a cache miss happens and the cache is full. Fig. 1 shows a simplified example of out-of-place update and the generation of invalid pages under a page-level mapping scheme. For simplicity, we assume that there are four physical blocks and each block consists of four pages. As we have mentioned, the flash memory has some overprovisioning capacity to support the out-of-place update and bad block replacement. In our example, although the real physical capacity is four blocks, user-addressable space is only 3 blocks. Equation (1) is the definition of over provision,
(1)
C total is the real physical capacity, C user is the useraddressable capacity, OP means the percentage of over-provision. We assume the OP is 25 percent in our example. Initially, the user-addressable space is full, but the overprovisioning part is totally empty. We use the corresponding logical page number to represent the user data in each physical page. Then a series of requests come from the up level. A request is expressed by its LPN and operation type (read or write). In this figure, the dotted arrows point to the obsolete mapping information. The first request is a read request for LPN 15 which will result in a cache miss. Thus an entry will be evicted out based on a specific cache replacement algorithm for example LRU. Here we assume the LRU entry is LPN 0, so that entry will be evicted out and the corresponding data (PPN 3) will be marked as invalid. Then data for LPN 15 will be fetched from the lower storage device and written to PPN 12. The mapping entry for LPN 15 will be added to the mapping table. The next read requests for LPN 23 and LPN 12 are similar to the previous read request for LPN 23. The LRU cache entries will be evicted out and replaced by the new entries. Then the fourth and fifth requests are both read requests for the previous evicted data. When treated as a traditional cache device like DRAM, these two requests will lead to two cache misses, we need to evict out two cache entries, fetch the data for LPN 0 and LPN 1 from the lower storage device, write the new data into the flash memory and then update the mapping information. This process will not only degrade the performance, but also reduce the lifetime of flash cache. Fetching data from the lower level storage device like HDDs will introduce a long latency, rewriting the new data into flash memory also brings timing overhead and more erase operations which is also a long-latency process and harmful to the lifetime of flash memory. Even the mapping entries for LPN 0 and LPN 1 have been evicted out from the mapping table, but the user data still reside in the PPN 3 and PPN 6. This gives us the opportunity to design a flashaware cache architecture. Instead of fetching the data from the lower-level storage device and rewriting it into flash memory, we can merely revive it.
LRU and ARC
Least recently used (LRU) data replacement is one of the most basic and classic cache replacement algorithms. The main idea of LRU is: data recently used is likely to be reused in the near future; data not used in ages is not likely to be used again in the near future. To age the data, a queue will be maintained, recently used at the front and the oldest at the rear. Every time a page is referenced, it is moved to the head of the queue. When a cache miss happens and the cache is full, the LRU based policy evicts the entry which was requested least recently. Then for a read request, the missed data will be fetched from the lower storage device and written into the head of LRU queue.
Basic LRU merely captures the recency of workloads, Adaptive Replacement Cache (ARC) [36] improves the basic LRU algorithm by capturing both the recency and frequency at the same time and dynamically, adaptively, and continually balancing between the recency and frequency components in an online and self-tuning fashion according to evolving and changing access patterns. In the original ARC architecture, the cache directory is split into two lists, T1 and T2. T1 is used to cache the recently referenced entries, while T2 is used to cache the frequently referenced entries. For any entries in T1, it should be accessed only one time recently, and for any entries in T2, it should be accessed at least twice. Two ghost lists B1 and B2 which only contain the metadata are attached to the bottom of T1 and T2. B1 and B2 are used to record the recently evicted entries from T1 and T2, respectively. The main idea of the learning process is as follows: if there is a hit in B1 then we should increase the size of T1, and if there is a hit in B2 then we should increase the size of T2. To support this learning process, a tunable parameter p is defined as the target size of T1. On a hit in B1, the value of p will be increased, and on a hit in B2, the value of p will be decreased.
LRU-Based Flash-Aware Cache Design
As we described previously, a cache eviction for a flash memory only discards the metadata, while the user data still resides in the physical location. When a read request for the evicted but still available data arrives, instead of fetching the data from the lower level storage device, we can revive the suspected data. Algorithm 1 shows our LRUbased flash-aware cache algorithm FLRU. We add a suspected queue (SQ) to preserve the evicted entries. The size of the LRU queue is determined by the user-addressable physical capacity. The maximal size of the SQ in the number of entries is given by equation (2),
GC th is the garbage collection threshold which is defined as the percentage of free physical capacity over the total physical capacity. The reason is that whenever the number of free pages drops to the garbage collection threshold, a garbage collection process will be triggered to reclaim invalid pages. Therefore, only the subtraction between the over-provision and garbage collection threshold could be utilized by our flash-aware design. On a hit in LRU queue, we will move the requested entry to the head of the LRU queue and return the data like the normal LRU-based cache. On a miss in LRU queue, unlike original LRU-based cache, we will first check with the SQ. If the request hits in the SQ, we can revive it through moving the requested entry from the SQ to the head of the LRU queue. As the entries are maintained in the memory, hence the overhead of moving an entry from the SQ to LRU queue is negligible especially when compared with the long-latency lower-level read and flash write operations. Therefore the access latency of hitting in the SQ is almost the same as hitting in the LRU queue. In this case, a read operation in the lower-level storage device and a write request for the flash cache could be avoided. For a request which misses in both the LRU queue and SQ, we first need to move the tail entry from the LRU queue to the head of SQ if the LRU queue is full. Then the requested data will be fetched from the lower-level storage device and written into the flash memory. Besides adding the additional SQ and changing the original LRU algorithm, we also need to modify the garbage collection part of SSDs. When a garbage collection process is triggered, a victim block will be erased. All the invalid data inside the victim block will never exist any more after the erasure, hence there is no need to preserve the corresponding entries in the SQ. The bottom of Algorithm 1 shows our modified garbage collection process. Whenever we need to erase a victim block, if there is any entry for pages in this victim block buffered in the SQ, these entries will be discarded from the SQ. 
flash-aware LRU architecture, SQ will be added to buffer the evicted but still accessible data in flash memory. But there is a little bit of difference, SQ is split into SQ1 and SQ2 in accordance with T1 and T2. The total size of SQ1 and SQ2 is also defined by equation (2). Algorithm 2 shows our flash-aware ARC algorithm FARC. We use c denote the user-addressable physical flash capacity. p is the target size of T1. A Replace function is defined to replace an entry out based on the value of p at that time when a cache miss happens or cache hits in SQ1 or SQ2. Unlike the original ARC algorithm, an entry evicted by the Replace function will be moved to SQ1 or SQ2 in our FARC algorithm. For any request, one of the six cases listed in Algorithm 2 should happen. If a request hits in T1 or T2, the requested entry will be moved to the MRU position of T2 since it has been accessed twice recently. Another two cases: hitting in B1 and hitting in B2 will cause the adjustment of p. In this paper, we follow the policy used in the original ARC algorithm to adjust the value of p. When a request hits in B1, p will be increased by k 1 . If B1 contains more entries than B2, then k 1 is 1, otherwise, k 1 equals the lengths of B2 over the lengths of B1. While a request hits in B2, p will be decreased by k 2 . If the length of B2 exceeds the length of B1, then k 2 will be 1, otherwise k 2 will equal the length of B1 over the length of B2. What's more, the value of p will be confined to a range between 0 and c. In fact, due to the adoption of our flash-aware design, the adjustment of p could be performed a little bit differently. The total usable space in flash cache is not c, but c plus the number of entries in SQ1 and SQ2 which we could call c ' . In the same way, the length of T1 could be extended to include the SQ1 and the length of T2 could be extended to include the SQ2. As the length of SQ1 and SQ2 is unfixed due to the garbage collection process, c ' , lengths of extended T1 and T2 are also fluctuant. Although we could use these extended variables to make more accurate adjustment for p, we ignored these factors in our paper for the purpose of simplicity. We believe that this does not affect the cache performance too much because the queue lengths of SQ1 and SQ2 are much shorter than those of T1 and T2. After the recalculation of p, an entry will be replaced out and the requested entry will be moved to MRU position of T2 which also has been accessed at least twice recently. Also, the requested data is fetched from the lower-level storage and written to the cache. The forth case is that a request hits in SQ1 or SQ2. An entry will be replaced out by calling the Replace function unit. After that, the requested entry is moved from SQ1 or SQ2 to the MRU position of T2. If a request misses in all the queues, an entry from B1 or B2 will be deleted and the Replace function will be called or an entry will be moved from T1 to SQ1 as depicted in case V.
The bottom of Algorithm 2 is the modified garbage collection process for our new ARC-based flash cache. Whenever a victim block is going to be erased, any page inside it will be checked, if a page is buffered in SQ1 or SQ2, it should be discarded from SQ1 or SQ2 and moved to B1 or B2. Return the data. 6: Case II: x t is in B1.
7:
If
Update p=min{p+k 1 , c} 13:
Replace(x t , p). 14:
Move x t from B1 to the MRU position in T2.
15:
Also fetch x t to the cache and return the data. 16: Case III: x t is in B2.
17:
Update p=max{p-k 2 , 0} 23:
Replace(x t , p).
24:
Move x t from B2 to the MRU position in T2.
25:
Also fetch x t to the cache and return the data. 26: CASE IV: x t is in SQ1 or SQ2.
27:
28:
Move x t from SQ1 or SQ2 to the MRU position in T2.
29:
Return the data. 30: CASE V: x t is not in any of the queues.
31:
CASE A: T1 and B1 have exactly c pages.
32:
If(jT 1j c) 33:
Delete LRU page in B1. Replace(x t , p). 34:
Here B1 is empty. Move LRU page from T1 to the head of SQ1. 36: endif 37:
CASE B: T1 and B1 have less than c pages.
38:
If(jT 1j þ jT 2j þ jB1j þ jB2j ! c) 39:
Delete LRU page in B2, if(jT 1j þ jT 2j þ jB1j þ jB2jÞ ¼ 2c).
40:
Replace(x t , p). 41:
Finally fetch x t and move it to MRU position in T1.
43:
Return the data.
44: Subroutine Replace(x t , p) 45:
If((jT 1j 6 ¼ 0)and((jT 1j > p)or(x t is in B2 and jT 1j=p))) 46:
Move LRU page from T1 to the head of SQ1. 47:
Move LRU page from T2 to the head of SQ2. 49: endif 50: ERASE:
51:
If the victim block contains any entries in SQ1 or SQ2 52:
Move the entries to the head of B1 or B2. 53: endif
Zero-Migration Garbage Collection
Garbage collection algorithms are one of the key factors that will affect both the performance and lifetime of flash memories. In this section, we describe a new garbage collection method which aims to further improve the lifetime of flash cache. During traditional garbage collection processes, all the valid data pages in the victim block need to be migrated to new free locations before erasing the whole block. N valid page read and write operations will be required for valid data migrations. The data migration process will introduce extra overhead and block the whole plane or package from servicing the requests from the outside. Here we propose a new zero-migration garbage collection design which aggressively erases the whole victim block without doing data page migrations. The zero-migration garbage collection design is based on our two observations. First, when flash memories are used as read caches, all the data in the flash memory will have a copy in the write buffer or lower level storage device. Hence, the removal of valid data migrations during the garbage collection process will never result in loss of data. Second, when used as cache, flash memory receives more pressure from the upper level requests. Therefore, any additional write operations during the garbage collection process may trigger more extra erase operations in the future which will hurt both the performance and lifetime of flash cache.
To perform zero-migration garbage collection, the FTL selects a victim block using a garbage collection policy like greedy, FIFO, random, etc. For all the valid pages inside the victim block, the corresponding entries in the cache queues will be deleted and the mapping information in the mapping table need to be invalidated. When our flash-aware design is applied, we also need to delete the related entries in the SQs and mapping table.
Although aggressively remove all valid pages during garbage collection processes may have some negative impacts on the cache hit ratio, the overall performance like the average response time might be unaffected or even improved due to the reduction of garbage collection processes. Besides the basic zero-migration garbage collection design depicted above, we also implement a variant named conditional zero-migration design as a comparison which only deletes the relatively cold pages but keeps the hot pages. In our conditional zero-migration design, we treat the entries in the tail half of the cache queue as cold that will be removed during the garbage collection process, while entries in the head half of the cache queues are regarded as hot and will be migrated to other free locations during the garbage collection process. The result shows the conditional zero-migration garbage collection scheme can only achieve marginal improvement of the cache hit ratio (within 1 percent) with even worse average response time and limited extension on the lifetime when compared with the original zero-migration scheme. Therefore, we only present and analyze the result of the original zero-migration garbage collection scheme in Section 6.
Discussions of Implementation Issues
One potential problem with our flash-aware design is the communication between the cache management and garbage collection process. Currently, most of the SSDs are designed as black boxes and FTLs including the garbage collection function units are running on embedded processors within SSDs. While the cache management unit that contains the cache queues are maintained by the OS on the host side. Therefore, the cache replacement algorithm running on the host side is unaware of the semantic information about garbage collection. What's more, the address space in the cache queues is the addresses of underneath storage system, rather than the user-addressable space of SSDs. Hence, an additional mapping table should be maintained to translate the address space of the underneath storage system to the SSDs' address space [27] .
One possible way to bridge the gap is to merge the FTLs with the cache management unit, either by opening the SSDs and moving FTLs into the host side like Triple-A [24] , SDF [41] which has been widely deployed in Baidu's storage system, and Fushion-io's host based FTL [7] , or moving the cache management units into the SSDs. By combining the cache management units with FTLs, the mapping table between underneath storage's address space and SSD's address space could be removed, and the mapping table inside the original SSDs can be merged with the cache queues by adding the physical addresses of flash memory to cache entries in the cache queues. Therefore, whenever a cache hit is detected through searching the cache queues, the corresponding physical location in the flash memory could be returned immediately. Moving FTLs to the host side has several benefits like performance enhancement, and cost reduction [7] , [24] , [41] due to the elimination of redundant resources. The drawback is to consume additional host resources. On the other hand, moving cache management units into the SSDs could deliver good flexibility but result in higher requirement of the computing and memory resources inside SSDs. We adopt and implement the second solution in our simulations to verify the efficiency of our flash-aware design.
Another way is to design a special feedback interface which could expose necessary internal information of SSDs to the host side. A similar interface design has been utilized and proposed in [57] to support the nameless writes scheme which will return the physical address of the data inside SSDs to the host side after each write operation (nameless write interface) or data migration during garbage collection processes (migration or call back interface). Moreover, a real prototype for the nameless writes scheme is implemented and evaluated on the OpenSSD Jasmine board in [45] . To make our flash-aware and zero-migration designs work, we can utilize the migration interface in the nameless writes scheme, whenever the garbage collection happens inside SSDs, the feedback interface will send the LPNs of the invalid and valid pages in the victim block to the host side so that the cache management unit could eliminate the corresponding cache entries from the suspension queues and main cache queues.
EXPERIMENTAL METHODOLOGY
We modified the Disksim with SSD extension to evaluate our proposed flash-aware cache schemes [4] . Table 2 lists the main parameters of our experiments. Since LRU and ARC are two of the most widely used cache replacement algorithms to evaluate a cache architecture, we choose the normal LRU and ARC algorithms as our baselines and implement both of them with the Disksim simulator. Our flash-aware LRU, flash-aware ARC, and zero-migration garbage collection algorithms as described in the previous section are implemented to show what benefits could be obtained from our proposed schemes.
Five realistic workloads: WebSearch1, WebSearch2, WebSearch3, DevDivRelease, and MSNFS are used in our evaluation. WebSearch1, WebSearch2, and WebSearch3 were collected from popular search engines and nearly all the requests are read requests [2] . DevDivRelease and MSNFS are released by Microsoft [1] . DevDivRelease was collected for developers tools release server. MSNFS was collected for MSN storage file server. Since our flash cache is used as a read cache, we only pick out the read requests from DevDivRelease and MSNFS as our test benchmarks. Details of the characteristics of these workloads are depicted in Table 3 .
EXPERIMENTAL RESULTS
Cache hit ratio, average response time, and erase count are three main metrics used in this paper to evaluate our proposed flash-aware and zero-migration garbage collection cache designs. The first section shows the results when only our flash-aware design is applied. Normal LRU, flashaware LRU, ARC, and flash-aware ARC as we have described previously are implemented and simulated to measure the results. The second section presents the results when our zero-migration garbage collection scheme is integrated with the normal and flash-aware caches.
Flash-Aware Cache
Cache Hit Ratio
Cache hit ratio is a common metric to evaluate a cache's performance and efficiency. Fig. 3 shows our simulation results. For both LRU and ARC algorithms with multiple flash cache over-provision and capacity configurations, our flash-aware cache designs can achieve promising cache hit ratio improvement. For example, with 15 percent over-provision, our flash-aware LRU algorithm can increase the cache hit ratio by nearly 10 percent for WebSearch1 and WebSearch2 when the capacity is 3 GB, while our flash-aware ARC design can obtain about 8 percent improvement in the best case. When over-provision is 25 percent, the cache hit ratio improvements could reach 19 and 12 percent for our FLRU and FARC, respectively.
With 35 percent over-provision, FLRU and FARC could achieve about 28 and 21 percent cache hit ratio improvements in the best scenarios. Moreover, with the increasing of the over-provision, our flash-aware algorithms can gain more advantage over traditional algorithms. For example, for FLRU with over-provision increasing from 15 to 35 percent, the geometric means of cache hit ratio improvements are around 7.3, 13.9, and 20.3 percent, respectively, when the cache capacity is 3 GB. For ARC, the results are similar. There are two reasons that make our flash-aware design obtain more benefits from larger over-provisions. On one hand, larger over-provision means more additional physical flash memory space for our flash-aware algorithm to explore to get more performance promotion. On the other hand, larger over-provision also means the user-addressable physical space is reduced when the total physical capacity is fixed, which will introduce more cache misses for normal LRU and ARC algorithms. Fortunately, our flash-aware algorithm can counteract the negative effects of cache misses. Besides, when the useraddressable physical capacity is large enough, for example when the over-provision is 15 percent and capacity is 6 GB, the benefit from FLRU and FARC is limited. The reason is straightforward, larger capacity means higher cache hit ratio and less improvable space. What's more, for WebSearch1-3, original ARC gets much better results than LRU, but the FLRU can achieve similar or even better results compared with the FARC. Fig. 4a shows the geometric means of cache hit ratio improvements of our FLRU and FARC algorithms under different flash capacities and over-provisions. It is very clear that LRU can acquire more benefits from our flash-aware design. The reason is that ARC has already done a very good job to cache the locality of workloads which will limit improvable space for our flash-aware design. Fig. 4b presents the impact of cache size on geometric means of the cache hit ratio. Obviously, LRU can get more benefit from the increase of cache capacity. Since our flash-aware scheme works in a way similar to the expansion of cache capacity, LRU can get more performance promotion from our flash-aware design. Based on this observation, we may use the simple and low overhead LRU algorithm to get similar or even better performance of the ARC algorithm when our flash-aware design is utilized which will further boost the cache performance.
Impacts on Lifetime
In order to investigate how our flash-aware cache algorithms affect the lifetime of flash memory, we collected the erase count for all our experiments. Fig. 5 presents our simulation results of erase count. The experimental results clearly show that our flash-aware cache design significantly extends the lifetime of flash memory. For example, our FLRU and FARC can at least reduce the number of erases by about 10 and 17 percent on average, respectively, when over-provision configuration is 15 percent. When the over-provision is 35 percent, the reduction could even reach nearly 72 percent. Thus the reduction of erase count is more significant than the improvement of cache hit ratios. One of the possible reasons is the write amplification effect of garbage collection processes. Before erasing a block, all the valid pages inside the victim block need to be moved to some new free space and this will introduce more additional writes. Then like the cascade effect, more additional writes will trigger more garbage collection processes especially when used as caches which means more pressure from the upper-level requests.
What's more, for all these four cache algorithms, higher over-provision will reduce the erase count even with the same total physical cache size. Although, higher overprovision means less user-addressable physical space and lower cache hit ratios for normal cache algorithms. Even for our flash-aware cache algorithms, the cache hit ratios will be a little bit lower with higher over-provision as the overprovisioning part will not always be filled with evicted data. There are two factors that affect the number of erase for a flash cache with a fixed total capacity: cache hit ratio, and over-provision. On one hand, lower cache hit ratio will introduce more writes to bring the missed data into the flash cache. On the other hand, higher over-provision gives the flash memory more space to delay and reduce the number and the overhead of garbage collection processes. From our experimental results, the reduction of erase count from higher over-provision counteracts the penalty from the reduced cache hit ratio.
From Fig. 5 , we also find that ARC and FARC suffer more erase operations compared with LRU and FLRU even under the cases in which they can achieve higher cache hit ratios. For instance, when the capacity is 3 GB and the over-provision is 15 percent, for WebSearch2, the cache ratios for LRU and ARC are 41.87 and 50.4 percent, but the LRU suffers less than 300,000 erase operations, while ARC suffers more than 400,000 erase operations. We believe this is due to the multi-queue architecture of the ARC algorithm. Unlike LRU, ARC is divided into four LRU queues: T1, T2, B1, and B2. Cache entries will be moved among these queues, and also cache evictions can happen in both T1 and T2 based on the current cache condition. This multi-queue architecture may generate more data fragmentation which brings the mixture of valid and invalid pages within the same block. And this kind of data fragmentation will introduce higher overhead for garbage collection processes since more valid pages need to be migrated.
Integration of Zero-Migration Garbage
Collection Scheme
Since our zero-migration garbage collection scheme aims to extend the lifetime of flash cache, we will present the extension of lifetime at first and then discuss the impacts on the cache hit ratio and whole cache performance. 
Impacts on Lifetime
Fig . 6 shows the results of normalized number of erase operations when our zero-migration garbage collection policy is integrated with the normal and our flash-aware cache algorithms. The results strongly demonstrate that our zeromigration garbage collection scheme can significantly reduce the number of erase operations for both the normal and flash-aware cache algorithms. For normal cache, our zero-migration garbage collection scheme can reduce the erase count by up to about 72 percent. The reduction of erase count from the combination of flash-aware design and zero-migration garbage collection scheme could even reach nearly 90 percent. One observation from the result is that the benefit of our zero-migration garbage collection scheme on the lifetime decreases with the increasing of the over-provision for both the normal and flash-aware cache algorithms. The reason is that higher over-provision means more space to delay the garbage collection processes and improve the garbage collection efficiency. On the contrary, the benefit of our flashaware design is positively related to the over-provision. When the over-provision is 15 and 25 percent, our zeromigration garbage collection scheme can gain more benefit over the flash-aware design from the perspective of lifetime extension. While, when the over-provision reaches 35 percent, the flash-aware design can more significantly improve the lifetime of flash cache. Experimental results show that the combination of flash-aware cache and zero-migration garbage collection works well for all these different overprovisioning configurations. Therefore, our flash-aware design and zero-migration garbage collection scheme complement each other in their effect of lifetime extension. What's more, ARC and FARC gain more benefit from our zero-migration garbage collection scheme since they have lower garbage collection efficiency which we have concluded from the previous section. Table 4 lists the difference of the geometric means of cache hit ratios after the adoption of our zero-migration garbage collection policy. A negative value means the decreasing of cache hit ratio, while a positive value implies an promotion of cache hit ratio. From the results, we find that the impacts on the cache hit ratio due to our zero-migration garbage collection scheme is negligible. In most circumstances, the drop of cache hit ratio is less than 2 percent or even 1 percent. Even in the worst cases, the loss of the average cache hit ratio is still below 4 percent. What's more, the average cache hit ratio could be increased by more than 2 percent under some special cases. We believe that the results could be explained from two aspects. On one hand, our zeromigration garbage collection scheme will reduce the number of available data to serve the coming requests which will result in lower cache hit ratios. On the other hand, the zeromigration garbage collection scheme has the potential effects to remove the cold data in advance and make room for the missed data in the future. Hence, the next time when a cache miss occurs, the missed entry can be directly inserted into the cache queues without introducing a cache eviction.
Performance
What's more, although our zero-migration garbage collection policy may lead to a little bit sacrifice of cache hit ratio, the real performance of flash cache like the average response time can be unaffected or even improved because of the significant reduction of garbage collection processes which has been verified and utilized in [39] . For flash cache, the cache hit ratio is only one of the important factors that will determine the cache performance. Another vital factor is the garbage collection processes. During a garbage collection process, the whole flash plane or package is unable to serve any requests until the end of the garbage collection process which will increase the response time of the flash cache. For each individual garbage collection process, the overhead consists of the data migration and erase operation. Thanks to our zero-migration garbage collection scheme, the number of garbage collection processes and the cost of each individual garbage collection process could be dramatically reduced due to the removal of data migration. Fig. 7 presents the result of the normalized geometric means of average response time when our flash-aware and zero-migration designs are applied to the normal LRU and ARC algorithms. When our flash-aware and zero-migration designs are applied to the normal LRU and ARC algorithms separately, our flash-aware design can gain more benefit from higher over-provision configurations, while the zero-migration design works better if both the overprovision and cache size are small. For instance, when the over-provision is 35 percent and cache size is larger than 5 GB, our flash-aware design can drop the average response time by around 40 percent for both LRU and ARC algorithms, but the zero-migration design can worsen the average response time by up to 2 percent (which is still negligible when compared with the significant improvement on the lifetime). However, when the over-provision and cache size are 15 percent and 3 GB, respectively, our flash-aware design can only reduce the average response time by 11 and 14 percent for LRU and ARC, while the zero-migration design can improve the performance by 14 and 30 percent for LRU and ARC. This is consistent with the conclusion in our previous section related to the impacts on lifetime. If we combine our flash-aware and zero-migration designs together, we can always obtain considerable reduction of the average response time which is from about 20 to 40 percent.
Besides the average response time, another important aspect is the performance stability especially for some highend or real time applications. In our paper, we use the standard deviation of response time to present the performance stability which is a popular measure used to quantify the amount of variation of a set of data values. A small standard deviation value of the response time means stable performance, while a large standard deviation value indicates large fluctuations. Fig. 8 shows the mean deviations of the response time. In the figure, both our flash-aware and zeromigration designs can provide more stable response times in all the cases, especially the zero-migration design which could reduce the amount of garbage collection and eliminate the data migration process at the same time. What's more, when flash-aware and zero-migration designs are combined together, further enhancement of the performance stability can be achieved.
CONCLUSION
In this paper, we proposed a novel flash-aware cache design and a new zero-migration garbage collection scheme. One of flash memory's most important properties is out-of-place update. When a flash memory is used as cache, cache evictions will generate superseded but still accessible data. Our flash-aware cache design takes advantage of these superseded but still accessible data to improve the performance and prolong the lifetime of flash cache. Our experimental results for LRU and ARC cache replacement algorithms clearly show that our flash-aware design can provide higher and more stable performance and significantly extend the lifetime of flash cache when compared with the original replacement algorithms. With the assistance of our zero-migration garbage collection scheme, a more significant lifetime extension and stable performance of flash memory can be achieved. In the future work, we will explore how to apply our flash-aware design and zero-migration garbage collection scheme to the current commodity SSDs. In addition, we will think about what's the best way to incorporate our key idea of flash-aware cache design with different stateof-the-art cache replacement algorithms. Especially for some high-end applications, which will have more strictly requirements of reliability, and not only high but also sustained performance, the flash cache will be configured with more additional capacity, which will give our flash-aware cache design more space to achieve further improvement for both performance and lifetime. " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
