ABSTRACT NAND flash memory has the advantages of strong shock resistance, low power consumption, non-volatility, and high performance, and it is gradually applied to embedded systems and enterprise servers, such as the IoT-based power grid storage system. Therefore, research on the flash translation layer strategy has become a popular trend for solid-state storage devices. The mapping granularity of different FTL policies is different in the methods of garbage collection. Furthermore, there is a problem of ''write amplification'' in the write operation of the flash memory device. This paper proposes a data aggregation preprocessing algorithm that can aggregate as many active pages as possible and reduce the proportion of write amplification. The algorithm is implemented and tested on the flash simulation platform FlashSim. The results indicate that our algorithm can effectively improve the performance of the IoT-based power grid storage system, reducing redundant write operations, the number of physical block erasures, the number of physical page read and write times, and the system response time, and extending the life of solid-state devices.
I. INTRODUCTION
It is currently the era of big data, including power grid big data. Information technology is more widely used in various production and social activities of society, which has entered an era of explosive data growth. After a long period of development, computer architecture has made great progress in all aspects. In terms of storage architecture, the speed difference between devices has become one of the bottlenecks in the development of the entire computer system. Therefore, the research of new and efficient storage media has become an indispensable technical development requirement, such as IoT based Power Grid Storage System.
The invention of NAND flash [1] memory has effectively mitigated the speed difference between memory and the underlying storage system. NAND flash memory has the advantages of small size, high speed, and strong shock resistance. As a result, it is widely used in various embedded systems. However, due to the inherent defects of NAND flash memory, (such as large differences in read and write overhead, ''pre-write erasure'', ''out-of-place updates'', and limited number of erasures), the performance, reliability, and lifetime of flash-based storage devices are seriously affected. This seriously affects the performance, reliability, and lifetime of flash-based storage devices.
The flash translation layer (FTL) is the most critical technology for solid-state devices. Its basic functions include address mapping, garbage collection, wear leveling, bad block management, and power-down recovery. Through the above functions, the FTL hides the characteristics of the flash device, provides a virtual disk, and enables the upper-layer application to directly use the flash-based storage device. The FTL strategy mainly determines the performance of the solidstate storage system, so the FTL determines the performance of the system.
In the study of the solid-state hard disk flash translation layer, this study found that the hybrid mapping strategy has a high consolidation cost when implementing the garbage collection strategy [2] , [3] . At the same time, by analyzing the garbage collection strategy Basic_GC in DFTL [4] (before the page-level mapping implements the garbage collection strategy), we found that the hot and cold classification of the requested data can effectively reduce the read and write of the flash chip and number of physical block erases. Using the above method, this paper proposes a data aggregation preprocessing (DAP algorithm, which can reduce redundant write operations, the number of physical block erasures, the number of physical page read and write times, and the system response time, and extend the life of solid-state devices.
The specific contributions of this paper can be summarized as follows:
(1) We present DAP algorithm for more efficient flash memory that will perform data preprocessing operations before garbage collection.
(2) The DAP algorithm is based on a preprocessing strategy for data hot and cold separation. It divides the data into normal data and hot data to reduce the number of effective page transfers and block erasures during garbage collection.
(3) The results of this experiment demonstrate the effectiveness of the proposed DAP algorithm.
The remainder of this paper is organized as follows. In section II, we present theoretical background and research motivation. In section III, the proposed DAP algorithm is explained. The experimental methodology is described in section IV. Finally, we present a brief conclusion in section V.
II. THEORETICAL BACKGROUND AND RESEARCH MOTIVATION A. FLASH TRANSLATION LAYER
As shown in Figure 1 , the function of the FTL conversion layer is mainly to provide address mapping, garbage collection [5] , wear leveling [6] - [8] , bad block management [9] , and error correction control [10] . The FTL mechanism provides a virtual disk by providing the above functions, thereby hiding the characteristics of the flash device. Therefore, the upper application can directly use flash-based storage devices.
As shown in Figure 1 , the flash translation layer has many functions; but the core function is address mapping. Therefore, the garbage collection should be combined with address mapping.
B. FTL MAPPING ALGORITHM AND CORRESPONDING GARBAGE COLLECTION
There are three main mapping methods in the FTL's address mapping algorithm: page-level mapping [11] , block-level mapping [12] , and hybrid mapping [13] . The different address-mapping methods have corresponding garbage-collection strategies [15] - [18] .
In the case of hybrid address mapping, the garbage collection must complete the consolidation of log and data blocks. There are three types of merge operations: swap merge, partial merge, and full merge. As shown in Figure 2 : (A) is the process of swap merge, because log block B is valid and data block A is invalid, then erasure block A and set block B to the valid data block; (B) is the process of partial merge, copy the valid data in block A to the block B, and erasure block A, set block B to the new data block; and (C) is the process of full merge. Full merge is the most expensive merge. Due to the existence of log and data blocks, there must be a full merge operation for hybrid address mapping. However, pagelevel mapping can effectively address the expensive merge operations.
The garbage collection of DFTL is based on the page level. The DFTL strategy maintains two blocks of data and translation blocks, which are written to the data and the translation pages. As data requests increase, the available physical blocks will decrease after a period of time. The garbage collection strategy of DFTL maintains a threshold called GC_threshold. When the number of free blocks is greater than the threshold (GC_threshold), the system does not perform garbage collection. When the number of free blocks is less than the threshold (GC_threshold), DFTL will execute the garbage collection strategy, referred to as Basic_GC.
C. MOTIVATION
In the long-term use of solid-state drives, there are a large number of invalid data blocks. When invalid blocks are filled with the entire flash memory, the systems read and write operations will be hindered. Therefore, the master chip must perform the garbage recovery operation on invalid physical blocks while performing normal data reading and writing. Due to the large overhead of physical erase operations, the overall performance of the system is reduced. Therefore, the time of garbage collection and efficiency of recycling are an important factor in determining the performance of the SSD. Most FTL garbage collection strategies dynamically start the garbage collection strategy when the SSD is idle. This method not only improves the efficiency of reading and writing, but also improves the utilization of system resources.
Since the number of erasures of the flash chip is limited [24] , a bad garbage collection strategy will generate a large amount of effective data transfer and reduce the life of the solid-state drive. Therefore this paper proposes a DAP method based on the DFTL algorithm to optimize the efficiency of garbage collection [20] , [21] .
At the same time, in order to clearly understand the process of garbage collection, this section introduces ''write amplification'' [14] , which is defined as the actual value of a page written to the system. In this paper, we use A to represent ''write amplification'', shown in (1):
Note that c represents the total number of pages in a block, V k represents the effective data when a block is erased, and represents the number of times a page is relocated. A can be redefined as the following formula:
In (2), v represents the average number of page relocations, which is equivalent to the average occupancy rate of relocation selection blocks during garbage collection.
Therefore, reducing the value of v not only reduces the ''write amplification'', but also reduces the total number of block erases and the system response time.
III. THEORETICAL BACKGROUND AND RESEARCH MOTIVATION A. THE DESIGN IDEA OF DAP ALGORITHM
The DAP algorithm uses a data preprocessing strategy based on the page-level mapping algorithm (DFTL), which divides the data into cold and hot data in the FTL. Because the hot and cold properties of the data are time sensitive. It is necessary to store the hot and cold information of the data. This strategy can reduce the number of valid page transfers and the number of block erases during garbage collection. When judging the hot and cold information of the data, it is necessary to read the relevant information. Therefore, these operations will consume system time and storage space.
As shown in Figure 3 , the DAP algorithm is located between the address map and flash chip layer. The DAP preloads the data into the hot and cold block before the data is written to the flash. Using this method, the system can quickly recover the failed data blocks during garbage collection. 
B. THRESHOLD INTRODUCTION
Before introducing the algorithm flow, we first introduce the hot and cold threshold (HW_Threshold), which is used to judge the hot and cold information when the data request arrives. When the data is hit again, the HW (the value used to count the number of hits) is self-increasing. When the HW reaches the threshold (HW_Threshold), the corresponding operation is written to the corresponding hot data block. Otherwise, it is written to the normal data block.
This paper considers that the threshold (HW_Threshold) corresponding to different sizes of SRAM are also different. The threshold must increase as the SRAM capacity increases.
C. DAP ALGORITHM FLOW
Algorithm 1 shows the flow of the DAP algorithm. HW indicates the number of hits before being culled, RQST indicates the logical address of the request, and HW_Threshold indicates that the threshold is set at the FTL layer. This paper analyzes the relationship between different thresholds and storage strategy performance. When a request misses, the mapping information is read from the flash memory through the GTD and HW is set to 0. When the request is write hit again, the HW is self-increasing. When the HW reaches the threshold (HW_Threshold), the corresponding operation is written to the corresponding H_block (hot data block). Otherwise, it is written to the N_block (normal data block). Finally, the DAP algorithm must update data for CMT and MAPPING_Block. When the request is read hit, the DAP algorithm only must read the corresponding page data from the DATA_Block (data block).
When the request arrives, the data is divided into cold hot, and it is stored in the corresponding cold or hot blocks. When the number of free blocks is less than the threshold (HW_Threshold), the garbage collection operation is performed. Because the data in the hot block is more likely to be invalid, the DAP algorithm preferentially selects the data block with the most invalid pages in the hot block set during garbage collection.
D. BLOCK RECYCLING ALGORITHM
Algorithm 2 shows the block collection process, Valid_Page_Per_Block, which represents the number of valid pages in a block. Invalid_Page_Per_Block represents the number of invalid pages in a block. Selected_Block_Index represents the index value of the last selected replacement block. Exist_In_Block indicates whether the page has been Read from MAPPING_Block by GTD; end if 8: end if 9: if operation is write then 10: if HW ≥ Threshold then 11: write into HDATA_Block; 12: else 13: write into NDATA_Block; 14: end if 15: update CMT ; 16: update MAPPING_Block when evict; 17: else 18: Read from DATA_Block; 19: end if written. Exist_Valid_Page_Per_Block indicates the number of valid pages in the block, if it has been written. Exist_Invalid_Page_Per_Block indicates the number of invalid pages in the block, if it has been written. Recycling_ Set represents an index set. The basic flow of the algorithm is as follows: When writing data, the algorithm first writes the data to the new data block and marks the original data as invalid. The number of valid data pages of the new data block then increases, and the number of invalid data pages decreases. The number of valid data pages of the original data block is reduced, and the number of invalid data pages is increased. Finally, the number of invalid pages and the block number in the data block are placed into the set. When performing block reclamation, the algorithm selects the block with the most invalid pages in the set.
Algorithm 2 Block Recycling Algorithm
Definition: Valid_Page_Per_Block: Valid page number in one block Invalid_Page_Per_Block: Invalid page number in one block Selected_Block_Index: The evicted number Exist_In_Block: If write logical page address exist in this block Exist_Valid_Page_Per_Block: Valid page number in exist block Exist_Invalid_Page_Per_Block: Invalid page number in exist block Recycling_Set: Candidate evict block set Return: Selected_Block_Index 1: Valid_Page_Per_Block = 0; 2: Invalid_Page_Per_Block = 0; 3: if operation is write then
4:
Valid_Page_Per_Block + +;
5:
Free_Page_Per_Block − −; 6: if Exist_In_Block is True then 7: Exist_Valid_Page_Per_Block − −;
8:
Exist_Invalid_Page_Per_Block + +; 9: end if 10: add invalid page per block into Recycling_Set; 11: end if 12: Find the largest Invalid_Page_Per_Block number; 13: Selected in Recycling_Set as evicted one; 14: return Selected_Block_Index;
E. ALGORITHM TIME AND SPACE COMPLEXITY ANALYSIS
The time and space complexity of the algorithm are important criteria for evaluating an algorithm. An excellent algorithm must be able to run in a short time and occupy a small storage space.
1) TIME COMPLEXITY ANALYSIS
The main flow of the algorithm can be divided into two parts. The first part is the hot and cold partition of the data, the time complexity of which is O (1) . The second part is the time consumption of garbage collection, which places the block number with the most invalid pages in the recycle collection. Although different recycling methods will consume different amounts of time, the complexity of the best algorithm in this section is O (1) . Therefore, the time complexity of the algorithm can reach the expected result of this paper.
2) SPATIAL COMPLEXITY ANALYSIS
This section mainly describes the performance and result analysis of the DAP algorithm. The experiments are implemented using the simulation platform FlashSim. In the experimental simulation process, we used a series of real loads and some synthetic loads. We compared these with the classic DFTL algorithm. Under a large number of write operations, the DAP algorithm can reduce the number of pages read and written and invalid block erase when the system is garbage collected.
IV. PERFORMANCE EVALUATION
This section mainly describes the performance and result analysis of the DAP algorithm. The experiments are implemented using the simulation platform FlashSim [19] . In the experimental simulation process, we used a series of real VOLUME 6, 2018 loads and some synthetic loads. We compared this with the classic DFTL algorithm. Under a large number of write operations, the DAP algorithm can reduce the number of pages read and written and invalid block erase when the system is garbage collected.
A. EXPERIMENT SETUP
FlashSim is an open-source flash simulator that is an eventdriven, modular simulator. FlashSim has built-in a variety of FTL algorithms, such as basic FTL, FAST, and DFTL algorithms. Figure 4 shows the architecture of FlashSim. This modular design allows us to focus on the DAP module in garbage collection and simplifies the experimental steps.
This simulation experiment uses three real loads: Financial1, Financial2, and Websearch1, and a synthetic load test. Financial1 and Financial2 are collected from financial institutions OLTP programs [22] . Financial1 is based on random write requests, and Financial2 is based on random read requests. Websearch1 is obtained from the website search engine [23] . It is mainly based on a large number of sequential read requests, and 99.9% of the requests are read requests. The synthesized test load is based on a large number of random write requests in a small range. The number of requests is approximately 700,000. Table 1 shows the readwrite ratio of these four loads. Table 2 shows the experimental parameters for the simulation process. Flash is erased in blocks and the read performance of the physical page is better than the write performance. The specific parameters are as follows: one page size is 2 KB, OOB area size is 64 B, one block size is 64 pages, the time cost for reading physical pages is 130.9 us, the time cost for writing physical pages is 405.9 us, and the time cost for erasing one block is 2 ms. 
B. ALGORITHM OVERALL PERFORMANCE COMPARISON
To simplify the data, the experiment set the SRAM size to 32 K and the HW_Threshold value to 2. This section shows a comparison of the number of page reads, the number of block wipes, and the average response time of the SSD with the Basic_GC method when the DAP method triggers the garbage collection strategy. Tables 3 through 6 show the percentage reduction in the number of page reads and writes and the number of block wipes during garbage collection. Figure 5 shows the overall performance improvement. Figure 5(a) shows the reduction in average response time. Figure 5(b) shows the number of physical block erases. 
C. WRITE AMPLIFICATION
The DAP method proposed in this paper can reduce the proportion of invalid pages in each block. As shown in Figure 6 , the abscissa indicates the number of pages of valid data when one block is erased, and the ordinate indicates the percentage of the number of erase blocks when performing physical block erasure. The red line indicates the Basic_GC method (referred to as GA) and the black line indicates the DAP method. Figure 6 (a) shows the results of the experiment with load Fin1. In general, the red line (GA method) is later than the black line (DAP method). The GA method effect data is approximately 5 pages more than the DAP method. Figure 6(b) shows the results of the experiment with load Fin2. The experiment obtained a curve distribution similar to the previous one. Since the peak moves backwards, when the garbage collection of the DAP method is performed, the blocks with a small number of valid pages account for a larger proportion of the total blocks. Figure 6(c) shows the results of the experiment with load Web1. The percentage of read operations is 99.9% of the overall operation in Web1. The DAP method is not very good for performance improvement. Figure 6(d) shows the results of the experiment with load test. More than 90% of the blocks can be directly erased when the DAP method performed garbage collection. The DAP method reduces the write amplification V value, which is beneficial to the improvement of system performance and reduces the number of physical block erasures. In summary, the DAP method can effectively reduce redundant write operations, the number of physical block erasures, the number of physical page read and write times, and the system response time, and extend the life of solid-state devices.
D. GARBAGE COLLECTION ANALYSIS
The number of reads and writes to the page during garbage collection is an important criterion for evaluating the DAP algorithm. The number of page reads and writes affects the number of block erases and the life of the SSD. Figure 7 shows the results of the experiment per 100,000 requests. Because address mapping uses a page-level mapping strategy, we can clearly see that the DAP method can effectively reduce the number of page reads and writes during garbage collection. The upper blue line is the original DFTL garbage collection algorithm, and the lower red line is HW_Threshold set to 2.
Figure 7(a) shows the results of the experiment with load Fin1. For every 100,000 requests, the DAP algorithm reduces the number of page reads and writes by up to 15,000 times, compared to the original algorithm during garbage collection. Figure 7(b) shows the results of the experiment with load Fin2. For every 100,000 requests, the DAP algorithm reduces the number of page reads and writes by 3000 times, compared to the original algorithm during garbage collection. Figure 7 (c) shows the results of the experiment with load Web1. Since 99.9% of the requests are read requests, the DAP algorithm garbage collection operation cannot significantly reduce the number of reads and writes. Figure 7(d) shows the results of the experiment with load test. Because the test has a large number of small-range read and write operations. For every 100,000 requests, The DAP algorithm reduces the number of page read and write operations by up to 12,000 during garbage collection. Based on a large number of write operations, the DAP algorithm can effectively reduce the number of page read and write operations during garbage collection.
E. BLOCK ERASURE ANALYSIS
The number of erase times of flash memory chips is limited. When flash memory chip erasure times are reduced, the life of the solid-state drive can be effectively extended. As shown in Figure 8 , the DAP algorithm can effectively reduce the number of erases of the flash chip, based on a large number of write operations. The upper blue line is the original algorithm, and the lower red line is the result of setting HW_Threshold to 2. Figure 8(a) shows the results of the experiment with load Fin1. For every 100,000 requests, the DAP algorithm reduces the number of block erase operations by up to 300 times, compared to Basic_GC algorithm. Figure 8(b) shows the results of the experiment with load Fin2. Since these requests are primarily read requests, the experimental results are not particularly obvious. However, it can be clearly seen from the curve that the DAP algorithm is effective. Figure 8(c) shows the results of the experiment with load Web1. Since 99.9% of the requests are read requests, the DAP algorithm garbage collection operation cannot significantly reduce the number of reads and writes. All read requests are updated to the same logical address in a short period of time, so the number of block erases is generally reduced. Figure 8(d) shows the results of the experiment with load test. Because test has a large number of small-range read and write operations, the number of block erases has been relatively small. Therefore, the number of block erases is less obvious than other traces. Based on a large number of write operations, the DAP algorithm can effectively reduce the number of page read and write operations during garbage collection. In general, the DAP algorithm can reduce the number of block erases.
F. THRESHOLD ANALYSIS
This paper must analyze the capacity of the SRAM to determine the threshold. The experimental results are shown in Figures 9. In the experiment, the SRAM size was set to 16 K, 32 K, 64 K, and 128 K respectively, as shown in Figures 9(a) through 9(f). For loads with more write requests, it is most appropriate to set the threshold to approximately 2. For loads with less write operations, such as Web1, the experimental results are shown in Figures 9(g) and 9 (h). This article does not recommend adding a HW_Threshold threshold on the FTL layer.
V. CONCLUSION AND FUTURE WORK
In this paper, we propose a novel data aggregation preprocessing algorithm called DAP for improving the performance of the IoT based Power Grid Storage System. It presented to reduce redundant write operations, the number of physical block erasures, the number of physical page read and write times, and the system response time, and extend the life of solid-state devices. The DAP algorithm uses a preprocessing strategy for hot and cold data separation. The written data is divided into normal and hot data at the FTL layer. The DAP algorithm maintains hot and cold data blocks to reduce the number of valid page transfers and block erases during garbage collection. The proposed DAP algorithm was tested on the flash simulation platform FlashSim. The results show when the DAP algorithm is compared with Basic_GC algorithm, our algorithm reduces the number of data reads and writes by 16.25%, the number of block erases by 5.86%, and the average response time by 4.86%.
Since solid-state storage systems are still widely used in personal devices and enterprise-class devices, it is necessary to manage the limited resources reasonably. In future studies, we mainly focus on considerations such as wear leveling and address mappings to achieve better performance of flash memory. 
