Abstract
Introduction
Recently, NAND Flash memory has become the main storage media for embedded devices, such as PDAs and music players. NAND Flash memory is now also being used in systems ranging from laptop and desktop computers to enterprise-scale storage servers. NAND Flash memory offers a number of benefits over the conventional hard disk drives (HDDs). These benefits include lower power consumption, lighter weight, higher resilience to external shock, the ability to sustain hotter operating regimes, and faster access times (with some exceptions that arise due to random writes). Unlike HDDs, NAND flash memory based Solid-State Disks (SSDs) have no mechanical moving parts, such as a spindle and voice-coil motors. Despite these benefits, a storage system designer needs to carefully consider the use of SSDs because they also have some notable weaknesses. The main weaknesses of SSDs include a higher price ($/GB) than HDDs, writes being 4-5 times slower than reads, slowdown in device throughput during periods of garbage collection that are hastened by small, random writes [14] , and limited lifetime (10K-1M erase cycles per block) [4] .
In order to overcome the limitations described above, a variety of complementary approaches have been proposed. For example, Multi-level Cell (MLC) technology gives higher density and cost per GB than Single-level Cell (SLC) [19] . The downside of MLC is that read and write times of MLC are slower. Consequently, there are current attempts to employ combinations of SLC and MLC Flash chips in SSDs. Numerous techniques for efficient address translation, garbage collection, and wear-leveling in the Flash Translation Layer (FTL) (more details in Section 2) have been explored to improve the performance of the SSD devices and/or providing longer lifetimes.
The design and implementation of cost efficient, reliable SSDs requires faithful and accurate evaluation test-beds for evaluating new algorithms for specific software components (such as those that constitute the FTL) within different hardware configurations of the SSD before implementing them in the actual firmware. The fact that significant aspects of the techniques employed within SSDs are unknown to the pubic due to technology property issues further adds to the urgency of having such a test-bed for SSD research. With this motivation, we have designed and developed a simulation infrastructure. Here are the salient features and contributions of our work.
• The components of an SSD are can be classified as those belonging to the hardware and the software categories. The hardware component consists of a processing unit, memory, bus, and Flash chips. The software component (which executes on the processing unit) consists of a FTL. The price of the SSD depends on the hardware configuration in the SSD and the software running on the hardware, but there is a lack of test infrastructure to examine cost-effective hardware configurations and soft-ware algorithms in research environments outside those affiliated with manufacturers of SSDs. In this work, we provide an experimental test-bed to fill this void.
• The few efforts that have attempted to provide the simulation infrastructure [13] , [3] lack desirable features, especially an object-oriented design. It is typically difficult to understand and enhance these simulators. Compared to other existing/evolving SSD simulators, FlashSim is entirely objected-oriented. Our approach allows the developers to easily understand, use, and extend our simulator. Furthermore, our simulator has been integrated with the well-regarded and popular DiskSim simulator [7] and validated for behavioral similarity with real SSD devices.
• Energy consumption in SSDs is surprisingly higher than initially expected; energy consumption is approximately the same as mobile HDDs [17] . Thus, it is important to understand the causes of the energy consumption. We have analyzed the energy consumption in SSDs with our simulator by considering a simple energy model including various FTLs with real traces (Financial and TPC-H). The rest of this paper is organized as follows: In Section 2, we present the basics of NAND Flash memory technology. We present the design of FlashSim and its implementation details in Section 3. We present the experimental results in Section 4. We discuss related work in Section 5. Finally, we summarize our work and discuss future direction in Section 6.
Background
Basics of NAND Flash Memory Technology. The most popular flash type for storage media is NAND flash memory due to higher density and lower cost than NOR flash. NAND flash provides three different operations: read, write, and erase. Each operation requres different operation time and granularity: Erase operations are performed at the granularity of a block that is composed of multiple pages. A page is the granularity at which reads and writes are performed. In addition to its data area, a page contains a small spare Out-of-Band area (OOB) which is used for storing a variety of information including: (i) Error Correction Code (ECC) information used to check data correctness, (ii) the logical page number corresponding to the data stored in the data area, and (iii) page state. Each page on flash can be in one of three different states: (i) valid, (ii) invalid, and (iii) free/erased. When no data has been written to a page, it is in the free/erased state. A write can be done only to a free page and changes its state to valid. An erase operation on an entire block of pages is required to revert the pages back to the free/erased state. Out-of-place updates result in certain written pages that are no longer valid. They are called invalid pages. Table 1 shows comparisons for different flash types in terms of access time and data unit size [20] . The Flash Translation Layer. The FTL is mainly composed of three software components (address translation, garbage collector, and wear-leveler), but the FTL is generally thought of as the address translation layer. The address translation layer that translates logical addresses from the file system into physical addresses on flash devices helps in emulating flash as a normal block device; the layer performs out-ofplace updates which in turn help to hide the erase operation in the flash memory. The mapping table is stored in a small, fast on-board SSD RAM. The garbage collector is in charge of collecting invalid pages to create free space in the flash memory. Since the lifetime of flash memory is limited by the number of erase operations on its cells (each memory cell typically has a lifetime of 10K-1M erase operations [6] ), the wear-leveler elongates the lifetime of flash by maintaining the same level of wear for every block in the flash memory. [5] , [10] , [15] , [9] , [16] , they share one fundamental design principle. Each scheme is a hybrid between page-level and block-level schemes. The schemes logically partition their blocks into two groups -Data Blocks and Log/Update Blocks. Data blocks form the majority and are mapped using a block-level mapping technique, whereas the log blocks are mapped using a page-level mapping technique.
SSD Simulator Design
We have designed and implemented a SSD simulator that is based on the hardware diagram in Figure 1 . The first version of our SSD simulator focused on software components (for instance, FTL schemes, garbage collection, and wear-leveling); we considered a simplified hardware model that simulated a single Plane with a simplified channel implementation.
Since this version of our simulator was limited by a simplified hardware model and not easy to extend due to a highly coupled implementation with DiskSim, we redesigned and re-implemented the simulator with an objectoriented approach. Our new simulator is entirely eventdriven and written in a familiar language, C++; we achieve modularity, low coupling, and high cohesion. Our hardwarelevel diagram is shown in Figure 1 . 
Object-Oriented Component Design
The simulator was written as a single-threaded program in C++ for simplicity. C++ could provide a comprehensible object-oriented scheme where each class instance represented a hardware or software component. The UML diagram of all C++ classes used by the SSD simulator is in [12] . FlashSim is integrated with Disksim's C code. The classes in the SSD simulator for hardware and software components are as follows:
Hardware Component Design
• SSD: The SSD class serves to provide an interface to Disksim and provide a single class to instantiate in order to create the SSD simulator module. The SSD class creates event objects to wrap the Disksim ioreq event structures and returns the event time to disksim.
• Package: The package class represents a group of flash dies that share a bus channel. The package class allocates its dies in its constructor and connects the dies to a bus channel. The package also facilitates addressing.
• Die: A die is a single flash chip organized into a set of planes. Dies are connected to bus channels, but individual planes contained in the die buffer bus transfers. In future development, the highest level at which merge operations may take place will be at the die level. The corresponding event object is updated with the merge delay time.
• Plane: Planes are comprised of blocks and provide a single page-sized register to buffer page data for bus transfers. The register is also used as a buffer for merge operations inside planes. The corresponding event object is updated with merge delays for merge operations and considers register delays.
• Block: A block is comprised of pages and is the smallest component that can be individually erased. When a block is erased, all pages in it are erased and can then be written to again. The corresponding event object is updated with the erase delay time. A block can only be erased a finite number of times because of reliability constraints [4] .
• Page: Each page maintains its state and updates event objects with the read and write delays of the given flash technology. Page states include free/empty after erasure, valid after a successful write, and invalid after being copied to a new location in a merge operation.
• Controller: The controller class receives event objects from the SSD and consults the FTL regarding how to handle each event. The controller sends the virtual data for events to the RAM for buffering before sending the event object to the bus.
• RAM: The RAM class calculates how long it takes to read or write data to itself. The RAM buffers virtual event data for the controller to send across the bus.
• Bus: The bus class has a number of channels that are each shared by all the dies in a package. The bus examines addresses in events and passes the event object on to the proper channel. 
Software Component Design
• Event: First, the event class keeps track of its corresponding Disksim ioreq event structure. Second, the event class holds methods and attributes to do all the record-keeping for the SSD simulator's state, including SSD addresses. Simulator objects pass event class objects and update the event objects statistics.
• Address: Addresses are comprised of a separate field for each hardware address level from the package down to the page. We provide an address class instead of a struct to help make a clear interface to assign and validate addresses.
• FTL: The FTL provides address translation from logical addresses to physical addresses. It determines how to process events that involve many pages by producing a list of single-page events to be processed in-order by the controller. The FTL is responsible for taking advantage of hardware parallelism for performance. The FTL also has a wear leveler and garbage collector to facilitate its tasks.
• Wear Leveler: The wear leveler class helps spread the block erasures over all blocks in the SSD. The wear leveler is responsible for keeping as many blocks functional for as long as possible because blocks of pages can only be erased for reuse a finite number of times.
• Garbage Collector: The garbage collector is activated when a write request cannot be satisfied because the selected block is not writable or there is not enough free space in the selected block. The garbage collector seeks to merge partially-used blocks and free up blocks by erasing them. Any other algorithm for GC can also be simulated. Figure 2 shows the interleaving of processing events for one bus channel. As per Figure 1 , each bus channel connects to several flash dies that are grouped in a package. Each bus channel functions independently and in parallel; operations on different channels are not dependent on each other. The read interleaving for one bus channel is shown in Figure 2 -(a). First, the control time signifies when the bus channel is locked for control signals that request a flash die to prepare data from a specific page. Next, the flash die processes the request for the data to be read. The bus channel is free to handle other requests at this time. Finally, the bus channel is locked for control signals that request the flash die to send data from a specific page and sending the data. The interesting part of this figure is the bus channel idle time period between the end of the control time for request two (R 2 ) and the beginning of the second control time period for request one (R 1 ). A control time period for request three cannot fit; request three (R 3 ) must be delayed until after request two finishes. The write for one bus channel is shown in Figure2-(b) . First, the bus channel must be locked for control signals to inform the proper flash die that it will receive data. Second, the bus remains locked to send the data. Finally, the flash die writes the data; the bus channel is free to handle other requests at this time. Since write requests only require one contiguous time block of bus channel time, write request happen in FIFO.
Bus Channel Interleaving

Event Flow
The SSD simulator is instantiated as a SSD object designed to accept ioreq event structures from Disksim. Its functionality is described in detail in Algorithm 1. The SSD controller uses the FTL software module to create a list of events for a multi-page request. The controller issues each event in the list to the data hardware through corresponding bus channels. The bus channels handle the scheduling and interleaving of events for the controller; this simplifies our controller implementation. In Algorithm 2, events continue through the package and are handled starting at the die level; merge events can be handled inside flash dies or planes. Erase events are handled inside blocks, and read and write events are handled inside pages. The SSD and package components are included in the call stack after consulting the bus channel because these components also keep track of wear statistics. Wear statistics stored in the SSD, package, die, plane, and block are updated every time an erase event occurs to keep a simple interface with lower algorithmic complexity for the FTL.
Experimental Results
We validated our simulator by comparing it to real SSDs for behavioral similarity; we compared the performance of different FTL schemes for realistic workload traces. We used the simplified version of the simulator that simulates a single Plane with with a simplified channel implementation for various software implementations, such as the FTL, garbage collector, and wear-leveler. More thorough evaluation that also considers interleaving with parallelism effects is left for future work.
Evaluation Setup
The specifications available for commercial SSDs are insufficient for modeling them accurately. For example, the memory cache size for FTL mappings and the exact FTL scheme used are not disclosed. Hence, it is difficult to simulate these commercial devices. We made assumptions for flash devices as described in Table 2 and configured our simulator accordingly. 
Validation of SSD Simulator
Using the parameters from Table 2 , we validated our flash device simulator against commercial SSDs (MTron's SSD [1] and Super-Talent's SSD [2]) for behavioral similarity. For this purpose, we sent raw I/O requests to real SSDs and similar traces to our flash device simulator to measure device performance. As shown in Figure 3 , our simulator was able to capture the performance trends exhibited by the real SSDs. With increasing sequentiality of writes ( Figure 3-(a) ), the performance of real SSDs improved, and our flash simulator with various FTLs was able to provide similar characteristics. When examining reads (Figure 3-(b) ), real SSDs showed much less variation; the same was observed with our simulator. With a high degree of randomness in writes (80% random in Figure 3-(c) ), real SSDs demonstrated long-tailed response time distribution (due to larger GC overhead); our simulator exhibited a similar trend.
Evaluation
We conducted a comparison of performance and energy consumption according to different FTL schemes, including a page-based FTL, FAST [15] , and DFTL [8] . We assumed the memory was just sufficient to hold the address translations for FAST. Since the actual memory size is not disclosed by device manufacturers, our estimate represents the minimum memory required for the functioning of a typical hybrid FTL. We allocated extra space (approximately 3% of the total active region [9] ) for use as log-buffers by the hybrid FTL (FAST).
Performance Analysis. The Cumulative Distribution
Function of the average system response time for different workloads is shown in Figure 4 . DFTL is able to closely match the performance of the page-based FTL for the Financial trace. In comparison with the page-based FTL, DFTL reduces the total number of block erases as well as the extra page read/write operations by about 3 times. This results in improved device service times and shorter queuing delays; this improvement in turn improves the overall I/O system response time by about 78% as compared to FAST. For readoriented workloads, DFTL incurs a larger additional address translation overhead, and its performance deviates from the page-based FTL. When considering TPC-H (in Figure 4(b) ), however, FAST exhibits a long tail primarily because of the expensive full merges and the consequent high latencies seen by requests in the I/O driver queue. Hence, even though FAST services about 95% of the requests faster, it suffers from long latencies in the remaining requests, resulting in a higher average system response time than DFTL. 
Analysis of Energy Consumption.
Power consumption of the flash memory in the SSD may not be significant when compared to other components (CPU and Memory), but as shown in Table 2 , erase operations consume significant power. Unlike individual read and write operations, erase operations have a greater impact on the overall SSD's energy consumption, and the number of erase operations for a given workload varies according to the current FTL scheme. Figure 5 shows the energy consumption by operations for different FTL schemes in the Financial and TPC-H traces. The Financial trace is mostly random-write-dominant, while TPC-H is read-dominant (see Table 3 ). Thus, the energy consumption for the Financial trace is much higher than that for TPC-H due to the power consumed by GCs. DFTL requires additional page read and write operations due to mapping table entry misses in the memory, causing additional energy consumption in both traces. As expected, FAST FTL consumes significantly more energy than other FTL schemes due to more erase and write operations during GC. In addition to power consumption by flash operations, the processor power consumption can be considerably high during GC. GC involves victim block searching overhead, which aims at finding the block with the least number of valid pages in order to reduce page copying overhead. copy operations, and (ii) the search operations induce energy consumption by processor and system bus usage. Thus, the energy consumption during GC can be reduced by balancing fewer search operations with a greater number of copy operations. Fewer search operations will slightly increase response time because an incomplete search may select blocks with more valid pages that must be copied.
On-board RAM is another considerable factor in the power consumption in the SSD. Since the page-based FTL requires more memory as compared to the block-based FTL, the idle power consumption of the additional memory will be larger. FAST maintains block-level mapping for data regions and page-level mapping for log regions; the on-board RAM's energy consumption is as close to that of the block-level FTL. DFTL requires the same memory as the block-level FTL; the idle power consumption is the same as that of the block-level FTL.
Related Work
Other research has been conducted to develop a simulator for NAND flash-based SSDs [3] , [13] . Microsoft Research's simulator [3] is one of the first available SSD simulators; however, it is highly coupled with DiskSim. The strengths of their simulator include the implementation of parallelism effects across multiple channels and interleaving across different components within a single plane, but only a pagebased FTL scheme is available. J. Lee et. al have developed a simple flash based SSD simulator [13] . This simulator is a stand-alone simulator that is limited by a single FTL scheme implementation, and they do not simulate I/O queueing effects.
Compared to the above simulators, our simulator has ability to simulate multiple FTL schemes, including page-based, block-based, FAST [15] , and DFTL [8] . Our simulator is integrated with DiskSim to simulate queuing effects, and our simulator module can be instantiated multiple times within Disksim. Our single-threaded, event-driven, object-oriented approach is comprehensible and modular to allow for future extensions. Furthermore, we have validated FlashSim against real SSD devices for behavioral similarity.
Summary and Future Work
We have developed a flexible and robust simulator for SSDs that features an object-oriented design. We have validated our simulator with real SSD devices by demonstrating behavioral similarity and compared performance results for various FTL schemes. We also have analyzed the impact of various FTL schemes on performance and power consumption in the SSD.
This project is a work in progress. Since the simulator has only been validated with a simple behavioral model for a single plane and simplified channel implementation, we will continue with more thorough validation methods that include bus channel interleaving effects. Caching and I/O scheduling effects will be added and examined. Since our simulator module can have multiple instances in Disksim, we can simulate disk arrays that contain a combination of both SSDs and HDDs. In addition to performance simulation, our simulator is able to incorporate power models and other extensions. We plan to combine our thermal-performance simulator of disk drives [11] with our future work involving hybrid disk arrays that contain a combination of both SSDs and HDDs.
Download
Source-code is available for download from http://csl.cse. psu.edu/hybridstore.
