Abstract-The increasing capacity of NAND flash memory results in larger page size. Since a larger page requires longer access time, the performance of a flash translation layer (FTL) that stores its mapping table in flash pages is degraded. An economical FTL named SCFTL is proposed to avoid such issues caused by the larger page size. In order to reduce spatial requirements, SCFTL employs a demand-based caching mechanism for the page-mapping table. Unlike other FTLs, SCFTL facilitates two techniques for delicately exploiting the spatial locality. In addition, the replacement algorithm is customized for the asymmetrical access time of flash memory. The experiments show that the average overhead of SCFTL in terms of access time is only 6.89%; this overhead is 75.96% and 11.35% lower than the state-of-the-art FTLs. The average cache hit ratio of SCFTL is as high as 0.92 despite the compact memory footprint. Because of the outstanding cache utilization, SCFTL still achieves high performance even though the page size is larger.
I. INTRODUCTION
Flash-based storage devices are prevalent in computer systems due to various benefits over ferromagnetic storage devices. They are widely used in high performance systems because of low access latency. In addition, the flash-based storage devices are more robust to shock and consume less energy; hence, they have become an essential component in most mobile systems.
However, these advantages do not come without restrictions. Unlike ferromagnetic materials, NAND flash memory is unsuitable for in-place update. A flash page, which is consists of thousands cells, cannot be reprogramed until it was erased. More importantly, each page cannot be erased individually. The smallest erasable unit is one block, which is a group of hundreds pages. In addition, the lifetime of each flash cell is limited by its program/erase (P/E) cycles. Consequently, a flash translation layer (FTL) is employed to solve these problems and provide the sector-based file system interfaces.
One of the main functions of an FTL is address translation. Since the upper level locates data by logical addresses, an FTL translates them to flash page locations or physical addresses and memorizes pairs of mapped addresses in a page-mapping table [1] . Due to the fact that a flash memory contains millions of pages, a page-mapping table requires enormous RAM capacity. For instance, an 8GB flash memory with 4,096 blocks of 256 pages needs 4,096KB for a page-mapping table while a block-mapping table [2] takes only 16KB. However, a block-mapping table translates the most significant bits of a logical address to a physical block number while the offset of a physical address is fixed to the least significant bits of the logical address. Owing to more flexibility, a page-level address translation, which facilitates a page-mapping table, usually yields better performance and lifetime.
Numerous research works put constraints on their address translations in order to fulfill the requirements [3] . Among these research works, the log buffer-based scheme [4] - [7] is the most popular. In addition to a block-mapping table, this scheme also adds log buffers for the purpose of performance enhancement. Conversely, the log buffer-based scheme suffers from the costly full merge operations required for rearranging data in its log buffers. For this reason, a non-merging approach was proposed [8] . This approach enables fine-grained page selections within every group of blocks and offloads the mapping table to the flash memory. Even so, it cannot fully utilize flash memory capacity due to block dependency and therefore frequently needs block reclamation.
Recently, a renowned approach for lowering the spatial requirements of the page-level address translation [9] - [15] has been invented. Instead of keeping an entire page-mapping table inside RAM, this approach offloads the table to the flash memory and caches only small portions in RAM. Consequently, the performance of this approach depends on the cache efficiency.
There are several issues that could affect the efficiency of the cache in an FTL, but only two issues are focused in this paper. One is the flash page size. Because of large page size, one flash page can hold numerous mapping entries. Thus, S-FTL [11] and CDFTL [12] exploit the spatial locality by caching whole flash pages. Nevertheless, the demand of gigantic capacity flash memory drives the flash page size larger [16] . It is much larger so that caching entire flash pages is too extravagant and causes undesirable effects.
Another issue is the fact that the programming time of a flash page is substantially longer than the reading time. The cost of writing a modified mapping entry back to the flash memory is several times higher than rereading a victim back to the cache. Although the amount of cache blocks written back is usually only a small fraction of total cache misses, it considerably affects the average address translation time. However, the traditional cache replacement policies, such as LRU, treat modified and unmodified cache blocks equally. Even the recent cache replacement policy [17] is focusing on increasing the hit ratio without differentiating modified cache blocks. These policies are not aware of the asymmetrical access time of a flash memory. Even though many cache replacement policies are customized for a flash memory [18] - [20] , they are not designed for caching the mapping table of a flash memory. Hence, they cannot utilize the localities of mapping table accesses.
Henceforth, we propose a novel caching strategy for the page-level address translation FTL named SCFTL. SCFTL is designed to be an efficient FTL for a NAND flash-based storage device with large page size and small RAM capacity. By implementing two spatial locality exploitation techniques and the specialized cache replacement policy, SCFTL achieves the sublime performance of only 6.89% additional latency. It is less than half of the additional latency required by the stateof-the-art FTL, CDFTL. In addition, the average cache hit of SCFTL is as high as 92.04% despite its compact cache capacity. Furthermore, the customized cache replacement policy efficiently reduces the average number of cache victims written back to only 0.38% of total cache accesses.
The rest of this paper is organized as follows. The related FTLs are described in Section 2. Then, Section 3 presents the proposed FTL, and its performance is evaluated in Section 4. Finally, the paper is concluded in Section 5.
II. RELATED WORKS
Although the page-level address translation has many advantages, it is not widely adopted due to the infeasible spatial requirements of the page-mapping table. In order to implement the page-level address translation in limited RAM space, DFTL [9] stores the enormous page-mapping table in several pages of the flash memory. The pages are called translation page and can be located by a small mapping table in RAM. Since retrieving and updating a mapping entry need to access the flash memory, keeping the mapping table in the flash memory drastically burdens the performance. For these reasons, DFTL exploits the temporal locality by caching some mapping entries in RAM. Hence, the number of flash page read operations decreases and translation page updating can be postponed until a modified mapping entry is evicted. Furthermore, the postponed updating allows the modified mapping entries of the same translation page to combine and write together when one of them is evicted. However, the hit ratio of DFTL is not high because it takes little advantage of the spatial locality.
In order to increase the hit ratio, CFTL [10] and CAST [15] add a consecutive field into their caches. Consecutive logical addresses that are mapped to consecutive physical addresses are grouped into single cache block as illustrated in Fig. 1 . Owing to the nature of sequential write operations, their physical addresses are more likely to be consecutive. Hence, adding the consecutive field improves overall performance. Fig. 1 . An example of cache with consecutive field (C). As the physical addresses of the logical addresses 10-12 are consecutive, they can be kept together in the same cache block by setting the consecutive field to two.
In addition, CFTL improves the performance of infrequently accessed mapping entries by using the block-mapping table.
On the other hand, CAST biases its physical address selection to prefer contiguous data locations and therefore increases the chance of consecutive addresses.
To further exploit the spatial locality, S-FTL [11] caches a whole translation page as a single cache block. It also reduces the RAM space needed by compressing cached translation pages. In contrast, CDFTL [12] enhances DFTL by adding a second level cache. The first level cache is similar to DFTL while the second level cache stores entire translation pages. Hence, the temporal locality is exploited on the first level cache, while the spatial locality is handled by the second level cache. Even though caching full translation pages can guarantee the spatial locality exploitation, it is not suitable for a device with small RAM capacity because each cached translation page is huge.
Since the size of a flash page tends to grow larger, S-FTL and CDFTL cannot maintain the same level of performance without enlarging their caches. Additionally, the larger page size means the chance that a file is spanning to multiple pages is lower; hence, the consecutive field of CFTL and CAST will be less effective. In order to utilize the page-level address translation on the large page flash memory, SCFTL is introduced.
III. DESIGN OF SCFTL SCFTL is a page-level address translation FTL that employs the efficient caching strategy. It consists of three main components: page-mapping table (PMT), translation page directory (TPD), and cache mapping table (CMT). In order to achieve the page-level address translation, SCFTL stores a page-mapping table in several translation pages (TPs). Each translation page keeps a group of physical page numbers mapped to consecutive logical addresses. Due to the gigantic flash page size, each translation page holds thousands of physical page numbers; hence, only few pages are needed for the complete page-mapping table. TPD keeps the addresses of every translation page in RAM and indexes them by the most significant bits of logical addresses. The performance degradation from offloading the mapping table is reduced by caching several mapping entries in CMT. Furthermore, CMT integrates two spatial locality exploitation techniques and a customized cache replacement policy in order to enhance its efficiency.
A. Two-Level Address Translation
As the page-mapping table of SCFTL is kept inside the flash memory, the address translation has to be done by a two- 
2. An example of the SCFTL address translation. Suppose a logical address of the request is 11, and each translation page contains eight physical addresses; the index and offset of the logical address is 1 and 3, respectively.
(1) The access of the logical address 11 incurs a cache miss in CMT, and the first cache block is selected as a victim. Writing back does not occur, as the victim is not modified. Then, (2) the two-level address translation is begun, and the translation page 1 is located at the page number 4. (3) The translation page is read from the flash memory, and (4) the physical address of the request is found at the offset 3. (5) Instead of storing only one mapping entry in CMT, the consecutive physical addresses of the logical addresses 10 and 12 are fetched and stored together with the logical address 11. (6) Assume that the spatial size is four; another mapping entry 13 will be fetched to CMT. Since this is a spatial fetching, the third cache block is selected as a victim instead of the second cache block, which has MC value lower than the threshold.
level process. Generally, the physical address of a request could be found in CMT; hence, the two-level address translation is not triggered. However, the two-level address translation will be executed in case of a cache miss.
The two-level address translation splits a logical address into two parts: an index and an offset. In the first level, TPD converts the index into the location of the related translation page, and then the translation page is retrieved from the flash memory. After that, the second level uses the offset, which is a position of the physical address in the translation page, to extract the physical address from the translation page. Therefore, the logical address is finally translated to the corresponding physical address. An example of the two-level address translation is provided in step (2)-(4) of Fig. 2 .
B. Efficient Caching Strategy
Owing to the temporal and spatial localities of storage accesses, several page read operations can be omitted by caching mapped physical page numbers. In addition, the caching allows the update of a translation page to be postponed; the updates on the same translation page can be combined together to minimize the number of additional page write operations. However, the efficiency of CMT does not only depend on the temporal locality; it is also highly influenced by the spatial locality.
1) Spatial Locality Exploitation:
As a flash page, which is the smallest reading or writing unit, can pack thousands of mapping entries, caching multiple entries each translation page 
retrieving is convenient. However, caching an entry that will not be accessed is wasting the cache space. In order to avoid caching unused mapping entries, a fine-grained spatial fetching technique is introduced.
Despite increasing the cache block size to accommodate more mapping entries, SCFTL spends several small cache blocks to exploit the spatial locality. As a result, the chance of cache trashing can be controlled by limiting the amount of mapping entries cached in each translation page read. Since SCFTL treats each mapping entry as an individual cache block, a low demanded mapping entry can be independently replaced without disturbing others. However, not caching an entire translation page forces SCFTL to reacquire the translation page before writing back.
As the physical addresses of sequential write operations are likely to be sequentially assigned, facilitating the consecutive field could save the CMT capacity by combining several sequentially mapped entries into one single cache block. In contrast, the drawback of the consecutive field is the additional cache eviction because a cache block is unable to retain the same consecutive value after one of the consecutive mapping entries is updated. Henceforth, the cache block has to be split. However, the split cache blocks can be merged back if their mapping entries are subjected to sequential write operations.
2) Cache Replacement Policy:
In order to decrease the number of translation page write operations, the victim selection process has to discriminate modified cache blocks from others. However, preventing modified cache blocks from being a victim in the fully associative cache may cause inefficient cache capacity utilization. Hence, SCFTL implements a customized cache replacement policy named D-NRU. D-NRU is very similar to NRU [21] . Each cache block contains a 1-bit flag for indicating that it was recently accessed. Besides, D-NRU also takes a modified flag into account when it selects a victim. As the modified mapping entries from the same translation page can be written back simultaneously, writing the translation page that contains many modified mapping entries is more economical. Consequently, D-NRU considers the number of modified mapping entries in each translation page when selects a victim. The counters (MCs) are attached to TPD as shown in Fig. 2 . Each MC is very tiny, as it only needs to count until its value reaches the worthwhile threshold.
D-NRU is a combination of two variants of NRU algorithms. The algorithm selection is based on the type of a mapping entry fetching: normal fetching or spatial fetching. The normal fetching has high priority, as it is caused by an I/O request. Its victim selection prefers a cache block that is not recently accessed (¬A), unmodified (¬M ), and modified (M ) with high MC value (MC T P ), respectively. On the other hand, the spatial fetching is initiated by spatial locality exploitation. As its mapping entry may not be reference, the cost of bringing it into the cache should be low. A modified cache block that has the MC value lower than the threshold (MC T P < c) will not become the victim of spatial fetching. Furthermore, the recently accessed flag is not set for the cache block that is brought in by spatial fetching. The orders of D-NRU victim selection are provided in Table I , and examples are illustrated in step (1) and (6) of Fig. 2 .
IV. PERFORMANCE EVALUATION
In this section, SCFTL will be compared with two stateof-the-art FTLs: DFTL and CDFTL. The 8GB NAND flash memory [22] used in the experiments is specified in Table II . It is simulated by the customized FlashSim [23] . The cache size is roughly set to 16KB, which is equal to the memory footprint required by a block-mapping table. As CDFTL prefers the second level cache to be large, the two-level cache of CDFTL is configured to 2KB and 16KB, respectively.
The performance evaluation is done by executing several workload traces selected from Storage Performance Council (SPC) [24] and Microsoft Research Cambridge (MSRC) [25] . In case of SPC benchmarks, Financial are I/O traces from OLTP applications, while WebSearch are I/O traces from a popular search engine. For MSRC benchmarks, the traces from the storage volume 0 of enterprise data centers running various applications are selected. The details of these traces can be found on their publication [25] .
As shown in Table II , the programming time is about 17 times longer than the reading time; therefore, the penalty time of a cache miss that requires a victim to be written back is much higher. Under this circumstance, the hit or miss ratio is insufficient to measure the performance of the cache in the FTLs. In this paper, we introduce another metric called writeback ratio (W B Ratio). It is a proportion between the number of cache writing back (num writeback ) and the number of cache accesses (num access ) (1).
Although the average system response time is a widely adopted performance measurement of FTLs, its value is mainly dominated by the data access time. To observe the effect of FTLs more precisely, we measure the percentage change of the average system response time (T P C ) from the ideal page-level address translation FTL (PFTL) [1] , which can be calculated by (2) . Since the page-mapping table of PFTL is completely held in RAM, the mapping time of PFTL is negligible. Consequently, T P C is the percentage of extra time required to complete address translation, and a lower value means better performance. Furthermore, the percentage change is already normalized; it can be fairly compared across various benchmarks.
A. Performance of the Efficient Caching Strategy
The performance of D-NRU replacement policy is shown in Table III . The suffix number of D-NRU is the number of MC bits. According to the design, D-NRU avoids cache trashing by not setting a recently accessed flag for spatial fetching, which in turn lowers miss ratio. In addition, it prevents spatial fetching from replacing low beneficial modified cache blocks. Consequently, write-back ratio significantly decreases as preventing premature cache writing back provides additional time to gather more modified mapping entries from the same translation page. However, over protecting, which means too few victim candidates, will heighten the risk of cache trashing and therefore results in high miss ratio. Consequently, SCFTL will use D-NRU-3, as it shown better performance than the others.
In Fig. 3 , SCFTL achieves the average cache misses of 7.96%. Due to the very small cache size configuration, the miss ratios of DFTL are very high. Its average cache misses is 54.17%, which is significantly worse than SCFTL by 46.22%. This enhancement is the impact of the spatial locality exploitation. In addition, the average cache misses of CDFTL is 13.29%. It is 5.33% higher than SCFTL because SCFTL can preserve the diversity of logical addresses better than the small first level cache of CDFTL.
The comparison of write-back ratios between FTLs is shown in Fig. 4 . Because D-NRU-3 in SCFTL works efficiently, the average cache writing back is 0.38% while DFTL is 1.09%. Due to the small size of the second level cache of CDFTL, it is insufficient to effectively avoid the write-back operations of modified mapping entries evicted from the first level cache. As a result, the average cache writing back of CDFTL is surprisingly high; it is 4.72%, which is about 12 times of SCFTL.
Finally, the average system response time of FTLs are compared in Fig. 5 . Due to the exceptional cache performance of SCFTL, its average T P C is only 6.89%. In comparison with DFTL and CDFTL, their T P C are 75.96% and 11.35% higher than SCFTL respectively. The T P C of DFTL in mds_0 is In order to match the SCFTL performance, CDFTL needs 128KB of the second level cache. Furthermore, SCFTL is still able to excel in 4KB cache configuration, which is smaller than the flash page size and incapability for CDFTL, with 19.28% average T P C .
B. Impact on Flash Memory Lifetime
As the flash memory endurance is limited by the P/E cycles, the number of extra block erasures from PFTL is measured. Owing to the very low write-back ratio of SCFTL, additional block erasures are barely needed. The average number of block erasures of SCFTL is only 0.67% increased from PFTL, while DFTL and CDFTL are 1.72% and 7.12%, respectively. Consequently, SCFTL is having very little effect on the flash memory lifetime.
C. Memory Requirements
According to Table II , the total number of pages is 4096×256. As each page can store 8,192 bytes, 2,048 of 4-byte physical address can be contained. Only 512 translation pages, which are about 0.05% of the total pages, are required for SCFTL, DFTL, and CDFTL.
Since SCFTL keeps TPD and CMT in RAM, the amount of RAM needed is the summation of the requirements of these two components. TPD is a simple mapping table with counters. Each entry contains a 4-byte translation page address and a 3-bit MC; hence, only 2.25KB of RAM is needed by TPD. On the other hand, every cache block of CMT consists of a 4-byte tag, a 4-byte mapping entry, a 5-bit consecutive field, a 1-bit valid flag, a 1-bit modified flag, and a 1-bit recently accessed flag; therefore, each cache block is 9 bytes. With 2,048 cache blocks, the total size of CMT is 18KB. As a result, SCFTL requires only 20.25KB of RAM. In addition, DFTL and CDFTL require 18.50KB and 20.25KB of RAM, respectively.
V. CONCLUSION
Since the flash memory tends to have larger pages, it is necessary to include this constraint in the design of an FTL. To overcome this restriction, we propose SCFTL an efficient caching strategy for a page-level address translation FTL. In order to utilize the cache, SCFTL facilitates the fine-grained spatial locality exploitation, the consecutive field. Moreover, SCFTL is aware of asymmetrical access time of the flash memory; it customizes the cache replacement policy for provident cache writing back. In spite of the limited memory space, SCFTL successfully exploits the spatial locality and reduces the number of cache writing back. In the 16KB cache configuration, SCFTL needs only 6.89% additional average system response time from the FTL that keeps the complete page-mapping table in RAM. The average overhead time of SCFTL is 75.96% and 11.35% lower than DFTL and CDFTL, respectively. In addition, the average cache misses and writing back of SCFTL are as low as 7.96% and 0.38%, respectively.
Because the degree of spatial locality is varied in each workload, dynamically adjusting the spatial fetching size should further increase the efficiency of SCFTL. In addition, SCFTL does not have any mapping restriction; it would be interesting to discover the performance of cutting-edge garbage collections and wear levelers on SCFTL. 
