The usage of this PDF file must comply with the IEICE Provisions on Copyright. The author(s) can distribute this PDF file for research and educational (nonprofit) purposes only. Distribution by anyone other than the author(s) is prohibited. 
Introduction
NAND flash memory based SSDs (Solid State Drives) are introduced as a solution to overcome the limitations of HDDs (Hard Disk Drives), such as high power consumption, high noise level, low bandwidth, low IOPS (Input/Output Operations per Second), etc. [1] , [2] . Although the bit density of flash memory devices is significantly increased by storing more cells per in 2 and more bits per cell (MLC [3] , TLC, and QLC [4] ), almost all other aspects of NAND flash memory devices are getting worse: the pro- gram time is getting longer, endurance (erase and write cycle) is getting shorter, and retention period (expected time to hold data reliably) is getting shorter [5] . In addition, there are other problems with SSDs, such as slow random write performance and high power consumption [6] , [7] . The latency and throughput of a storage device depend heavily on how its internal firmware algorithm handles incoming I/Os, and this boils down to the issue of how to place data blocks on the storage media. In legacy HDDs, the firmware algorithms focus on reducing head movement overhead in reading or writing data; to achieve this, cylinder serpentine [8] , surface serpentine [9] , and hybrid serpentine [10] methods have been proposed. At the time, one of the key concerns among practitioners was to understand how data blocks are laid out on the storage device. This information can be used to optimize the filesystem layout for a given storage media [11] .
Disk layout algorithm is closely link to mechanical characteristics of a given HDD model. Disclosing the disk layout algorithm of HDDs means revealing the manufacture's critical competitive edge. Therefore, the disk layout algorithm is often the most hidden part of the device. Numerous efforts have been dedicated to shed light on understanding the data layout mechanism of storage devices [12] - [14] . For HDDs, these works exploit I/O latency to infer the seek time for a given sequence of I/O operations and subsequently to find out the data layout scheme (e.g., DIG [11] ). NAND flash storage device, which is SSDs for desktops and servers or eMMCs for smartphones, consists of multiple NAND flash chips. A write request from the host (e.g., 32 KByte) can be directed to a single chip or can be interleaved across multiple chips in a certain granularity, exploiting a certain type of parallelism. SSDs read data blocks from the flash chip in 4 KByte or larger units. SSD vendors are extremely reluctant to disclose the firmware algorithm of their SSDs, and it is one of the most hidden parts of NAND based storage devices.
The existing works on reverse engineering of HDDs cannot be applied to SSDs because they exploit seek time behavior of the HDDs to understand their internal structure. This work aims at developing a method to derive the internal behavior of SSDs. In particular, we focus on finding the degree of internal parallelism and how it is exploited in laying out data. In this regard, we exploit the power consumption behavior of SSDs. We discover the hidden features of SSDs to improve the performance. As a result, we develop the method, named I/O unit aligning, that improve performance and significantly reduce the energy consumption. In addition, we present the Power Budget as a guideline to evaluate SSDs.
In our previous work [15] , we discovered the page allocation scheme of Intel X-25M, the I/O unit of Samsung MXP, and introduced the concept of Power Budget in designing SSDs. In this paper, we delve into the characteristics of two additional SSDs, Samsung 840 and Toshiba Q pro. While the previous work focused on power consumption of write operations and analyzing the result, this work not only considers the read behaviors of SSDs but suggests a formal characterization method to generalize the technique. This paper also introduces a mechanism to improve the performance of SSDs while minimizing the power consumption of the device. This work's contributions are as follows:
• (i) Formal characterization method to find the internal parallelism, I/O unit, and page allocation scheme of SSDs (Sect. The remainder of this paper is organized as follows: Sect. 2 describes other works related to this study. The background, such as the SSD architecture, flash memory, channel and way, and FTL (Flash Translation Layer) are briefly discussed in Sect. 3. Section 4 presents three essential SSD characterization algorithms. Experiment environment, validation and case studies on the four SSDs are presented in Sect. 5. Section 6 explains the benefits of aligning the I/O unit of SSDs. Section 7 explains Power Budget. Section 8 concludes the paper.
Related Work
There are previous works that exploit seek time of disk arm to determine the sector layout strategy in HDDs [11] , [16] , [17] . However, the HDDs and SSDs have a different physical structure. HDDs have various mechanical parts (platter, spindle, head, actuator arm, actuator axis, and actuator), while SSDs only have electronical parts. Therefore, these previous studies on HDDs cannot be applied to SSDs.
A number of works characterized the behavior of SSDs through mathematical modeling or via simulation [18] . Yang et al. [19] and Tao et al. [20] used simulation method to determine page size, the address allocation policies, and parallelism. In particular, Tao et al. [20] examined the effect of write buffer size and page size on overall performance of SSDs. Changing the write buffer from 4 MB to 32 MB did not have a significant impact on the throughput (MB/s). However, varying the page size from 1KB to 4 KB resulted in noticeable changes in the average response time (msec) and the throughput (MB/s).
Simona et al. [21] used Flashsim [22] to figure out the discrepancies between the potential performance of SSDs and the observed performance of real-world workloads (Microsoft Research (rsrch0, prxy0, src11, proj2), Harvard University (dea2, akadeasna2, and lair62b), and UMass Trace Repository (fin1, fin2)) [22] . They proposed an SSD performance prediction model. This model shows that SSDs are not utilized to its maximum performance. Some exploited energy consumption behavior to understand the internal behavior of SSDs. Shin et al. used uFlip [23] to measure the power consumption of SSDs. They connected an electronic resistor between the SSD and power line to collect power consumption through the resistor using Labview; they tried to analyze page allocation scheme of SSDs. SSDs consume high energy in a short time while performing write and erase operations. Although the peak value of power consumption is important in analysis, their method failed to capture them because of the low sampling resolution. Similarly, Seong et al. [24] collected power consumption of real SSDs and HDDs and compared them with power consumption of the in-house developed SSD. With PCI-6259 (the National Instrument's data acquisition board), Seong et al. measured the energy consumption of storage devices in µjoules and used the results as a basis to compare the performance of SSDs. However, they did not provide enough proof that their method can be used to analyze the behavior of SSD. Seo et al. [7] collected power consumption of SSDs using SM2040 (Signametrics PCI digital multimeter), and analyzed power consumption patterns. They measured energy consumption of SSDs while performing random/sequential read/write operations in various sizes. Although their experiment can be used to characterize SSDs, their focus was limited to energy efficiency side of SSDs, according to the workloads. Mohan et al. [25] used data of Grupp et al. [26] to validate their power model. Grupp et al. measured power consumption of SSDs while performing basic operations such as read, program, and erase using high-resolution current probe (Agilent 1147A). Grupp et al. also measured and calculated energy per operation in terms of peak power, average power, and idle power of flash memory. Grupp et al. collected power consumption of SSDs with high sampling rate so as not to lose the peak value. Therefore, unlike Shin et al., they collected all the peak value of power consumption. Also, Grupp et al. suggested Mango FTL to improve responsiveness of SSDs and decrease their power consumption. Park et al. [27] classified the state of SSDs into four categories (active, idle, standby, and sleep) and measured the Ampere (A) for each state using an oscilloscope. However, they presented data only for the in-house developed SSDs which makes it difficult to apply their work to other studies or applications.
A common problem in most of these measurement is a low sampling rate (longer than 1 msec). To accurately capture the energy consumption behavior of SSDs, sampling interval should be set smaller than program time of a NAND flash, e.g., 900 µs [28] . In this paper, we propose SSD characterization techniques and algorithms to identify internal parallelism, I/O unit, and page allocation scheme of SSDs, with a high sampling rate (less than 2 us). We applied these techniques and algorithms to four real devices and present the results as a case study.
Background

Flash Memory and SSD
The structure of a common flash memory package is shown in Fig. 1 [6] . A flash memory package contains two or more dies (chips), and each die can be selected individually to execute commands independently [29] . Typically, a die is composed of two or more planes and in most flash memories [29] , [30] , each plane has a page register which is used as a buffer for read and write operations [22] . Page registers support multiple planes that concurrently performs the same operation. This is called plane-level parallelism [31] .
In modern SSDs, flash memory packages are organized into multiple groups and each group is allocated a dedicated bus called channel and way. The flash packages that are attached to the different channels can transfer data in parallel manner, and are attached to the ways can transfer data in interleave manner. Flash memory packages are connected in multi-channel and multi-way fashion. A typical main components of SSDs include host interface, internal buffer (DRAM or SRAM), SSD controller, flash memory controller, and flash memory. Host interface connects the SSD with the host via standard interface (e.g., SATA or IDE). An internal buffer holds data or metadata, which will be written to the flash memory, and stores the address table of flash memory. SSD controller manages the flash memory via the flash memory controller. SSD controller executes firmware, such as FTL, buffer management, ECC, etc. FTL is mainly responsible for three tasks: address mapping [32] - [34] , garbage collection [35] , [36] , and wear-leveling [37] , [38] . Flash memory controller manages the flash memory packages in each channel. It has cache register to cache data in channel operations. Flash memory packages in the same channel share cache register. Read and write operations are performed in an interleaved manner [39] . Some SSDs have ECC block for each channel [27] .
Energy Consumption of SSDs
The peak energy consumption of SSDs provides us important information in characterizing the internal structure of SSDs, which is closely related to SSD's degree of parallelism. When data is programmed in parallel to a number of empty flash blocks, the throughput and energy consumption of SSD increases. However, when data is programmed to a empty flash block in serial, the throughput and energy consumption of SSD decreases than parallel case. There is a trade-off between an peak energy consumption of SSDs and their throughput. To characterize the internals of SSDs such as page allocation algorithm, and the number of channels and ways, we exploit the peak energy consumption of device. Figure 2 illustrates the energy consumption in writing three pages. Writing a page to a NAND flash can be partitioned into three phases: i) sending a command to the command register (C), ii) sending data to data register (D), iii) and programming the NAND page using the content in the data register (P).
There are three layout while programming three pages to the flash memories: All pages are on the same die; spread across multiple dies on a channel; and spread across different dies in different channels. Figure 4 (a) shows the first case that three pages are written to the same flash die. Three pages are written sequentially, therefore there can be only one programming activity at a time and this case has longest time. Figure 4 (b) shows the second case where three pages are written to different flash dies attached to the same channel. The process C and D can be serialized, and the flash dies can perform the programming concurrently. We call this way-parallelism. Figure 4 (c) shows the third case that three pages are written to flash dies in different channels. The process C and D can be parallelized, and the latency of this case is the shortest. This is called channel-parallelism.
In one of the SSD models † , we observed some delays in switching channels [15] . We carefully suspect that this is due to the hardware overhead of channel multiplexing. As shown in Fig. 4 (c), with higher degree of parallelism, the total time for writing three pages decreases. When SSDs use multiple flash memory packages in parallel to increase the performance, the duration of current peak decreases. However, the peak current increases for a brief moment in the number of flash memory packages that use power. Conversely, when a small number of flash memory packages are used to perform an operation, the performance decreases and the duration of current usage increases but the peak current decreases.
The number of packages used in performing read and write operations varies in SSDs. Using high number of packages would result in better performance but at the expense of high power consumption.
SSD Characterization
In this paper, we develop characterization method to find three features of SSDs. The first feature is its internal parallelism. SSD has four levels of parallelism (channel, chip, die, and plane). We are especially interested in finding dielevel parallelism. Second, we find an read/write I/O unit size of SSDs, which is larger than a page. Third, we will find an page allocation scheme of SSDs. This means finding the location of a flash memory package on which data is written.
While programming write requested data, SSDs show various patterns of peak currents, which are closely related to the number of dies and the I/O unit size. In this paper, we apply our characterization method to four real SSDs to find their three features. We find the number of dies and the I/O unit size by analyzing the energy consumption patterns of SSDs while they perform read and write operations. Those information, combined with the number of flash memory packages and the number of channels, enable us to infer channel/way utilization and page allocation scheme of SSDs.
Internal Parallelism and I/O Unit
In analyzing energy consumption of SSDs, we focus on how † Samsung SSD 840 PRO its peak current changes when write requested data size increases. In a flash memory, a page is the smallest unit size in reading and writing data. Page sizes are different (4 KB, 8 KB, or 16 KB), depending on flash memory packages. If a flash memory package has multi-flash memories (chip or die), each die can operate page programming concurrently. Therefore, the number of dies equal to the number of concurrent programming operations. A large write request are divided into page size × (no. of dies) to be allocated to each package. SSDs that are performing program operation has higher peak current (active state peak current) than SSDs in idle state. SSDs show the minimum active state peak current (m peak ) when it is performing page-size write, and shows the maximum active state peak current (M peak ) when all packages are activated. When SSDs are using one package with multiple dies to program page size × (no. of dies), its peak current is same as when they are performing page-size write because program operation perform programming in an interleaved manner. Therefore, if write requested data size is increased, peak current of SSDs increase in steps from m peak to M peak .
In some SSD controllers read and write in units that are larger than a page. They read or write multiple pages, simultaneously, as a single operation to reduce the number of read/write operations. We define this unit as the I/O unit of SSDs. As explained above, the peak current of SSDs change in steps. Generally, as the number of dies in a flash memory package increases, the tread depth of each step increases, and as the size of I/O unit increases, the number of steps decreases. If the write size is 1 page, the peak current is m peak because I/O unit is 1 page. If the write size is 4 pages, the peak current is M peak . Therefore, the number of steps is 4, and the tread depth of a step is 1 page. Figure 3 (b) shows peak current of a flash memory package that has 2 dies. In this case, the peak current is m peak for both 1-page and 2-pages writes because two dies are activated concurrently. If the write size is larger than 6 pages, the peak current is M peak . Therefore, the number of steps is 4, and the tread depth of a step is 2 pages. Figure 3 (c) shows how the number of steps changes when I/O unit changes from 1 page (Fig. 3 (a) ) to 4 pages in an SSD with 1 die. Even when the write size is 1 page, the SSD has M peak because I/O unit is 4 pages. When write size larger than 4 pages and smaller than 9 pages, two write operations are performed serially with no peak current change. Therefore, there are no steps in this case. Figure 3 (d) shows how the number of steps changes when I/O unit changes from 1 page (Fig. 3 (b) to 4 pages in an SSD with 2 dies. The peak current is m peak × 2 for both 1-page and 4-pages writes. When the write size is larger than 5 pages, the peak current increases to M peak . Therefore, the number of steps is 2, and the tread depth of a step is 4 pages.
We can infer the number of dies in a flash memory package (N die ) from the tread depth of a step (D step ) and the number of steps (N step ) both obtained from our experiment, and the number of flash memory packages (N package ) that can be obtained from the vendor specifications. By multiplying N step and D step , we can get the number of pages that an SSD can program simultaneously. By dividing this value by N package , we can get the number of pages which can be programmed simultaneously to a package. This value is equal to N die (Eq. (1)).
Since, the size of I/O unit is fixed by firmware of SSDs, inferring the size of I/O unit is essential to performance optimization. We can infer the I/O unit information from N step , N package , and N die . The number of flash memory packages that belongs to one step can be obtained by dividing N package by N step . The number of dies that belongs to one step can be obtained by multiplying this value by N die (Eq. (2)). If we substitute N die in Eq. (2) with Eq. (1), we can see that the I/O unit is identical to D step .
Through our experiment, we can obtain N step and D step . We used sequential write workload and started write size at the page size of an SSD. Write size is increased in multiple of page size until the peak current of the SSD reached its M peak , which was obtained before the experiment by performing 1 MByte sequential write; 1 MByte sequential write is sufficient size to map the I/O to all flash memory packages in all target SSDs. We then measured the peak current value for each write size. The experiment terminated when the peak current reached M peak . N step and D step can be found by plotting the results on a graph.
We already defined the I/O unit as D step . However, this definition only applies to write I/O units and a separate experiment is needed to determine read I/O unit size. When a read request is issued by the host, the SSD controller looks for the physical location of the requested data. After locating the requested data, the flash memory controller reads I/O unit data from the flash memory to the cache register. After reading I/O unit data to the cache register, the SSD controller reduces power supplied to the flash memory from active level to standby level because the flash memory is not being used while the data is transferred from the cache register to I/O buffer and from I/O buffer to the host. Because the power stage changes from active to idle, the waveform of the current falls. Therefore, it is possible to find the read I/O unit size by the number of peak currents in the waveform.
Read and write I/O unit sizes refer to the actual read and write units sent to the flash memory by the SSD controller. Information on I/O unit size is essential to improving the performance of SSDs. If the OS knows the I/O unit of the SSDs, interface bottleneck can be reduced by adjusting the basic I/O unit of the OS to the I/O unit of SSDs. Also, by changing a random I/O to a sequential I/O, fragmentation of SSDs can be reduced.
Page Allocation Scheme
In this section, we present characterization method to find page allocation scheme of SSDs. In order to do this characterization, we need information from vendor specifications, such as number of channels (N channel ) and number of packages (N package ). We also need experiment results, such as D step , N die , and peak current duration. Our method consists of two steps. First, we find the physical structure of an SSD. For example, if an SSD has 8 channels and 16 packages, then 2 packages are connected in each channel and the number of dies in each package is N die . Second, we infer I/O allocation scheme. SSD controller chooses target packages to allocate write requested data. If write requested data are allocated to multiple packages, the target packages are divided into two cases:
Case 1: Target packages share a channel: If write requested data are allocated to a package, data is transferred to the register of package through the channel, and this process occupies the channel until the transfer is complete. Therefore, when target packages share a channel, delay occurs during register transfer time. From this result, we can conclude that if one channel is used to allocate write requested data, then the performance of SSD is reduced by register transfer delay, but since packages perform programming in an interleaved manner, peaks in current are also reduced. In such case, we use case 2.
Case 2: Target packages do not share a channel: In this case, after first target package is selected, the next I/O is allocated to another channel. This incurs only channel switching delay which is less than the register transfer delay [15] . From this result, we can conclude that when multiple channels are used to allocate I/Os, the performance of SSD increases; since multiple packages operate concurrently, this also increases peaks in current consumption than case 1.
We can calculate an I/O duration of an SSD using register delay or channel switching delay. By comparing the calculated values to the actual I/O duration obtained from an experiment, we can infer the SSD's page allocation scheme. We have to consider the three cases that write requested data are allocated to multiple packages. We explain the three cases by using the example in Fig. 4 which has 4 packages. Figure 4 (a) shows the first case that four packages connected with 2-channel 2-way configuration. When data is received, the package #0 of first channel is selected and the first channel is preempted which incurs register transfer delay. When the package #0 of second channel is selected and the second channel is preempted, then there is channel switch delay and register transfer delay. The package #1 of first channel cannot be selected immediately because the package #0 has preempted the first channel. The package #1 of first channel must wait before it can performs register transfer operation. When the package #1 of second channel is selected, channel switching delay occurs but not register transfer delay because the operation on the package #0 of second channel is complete. Therefore, the package #1 of second channel can perform register transfer without waiting. We call this channel priority allocation method. Figure 4 (c) shows the second case that 4 packages connected with 2-channel 2-way. The package #0 of first channel preempts the channel which incurs register transfer delay. The package #1 of first channel cannot be selected, because the package #0 of first channel preempted the channel. The package #1 of first channel must wait before it performs register transfer. Channel switching delay occurs to select the package #0 of second channel and register transfer delay also occurs when the package preempts the channel. The package #1 of second channel cannot be selected because the package #0 of second channel preempted the channel. The package #1 of second channel must wait before it per- forms register transfer operation. This case has longest duration of three cases, as shown in Fig. 4 (d) . We call this way priority allocation method. Figure 4 (e) shows the third case that 4 packages connected with 4-channel 1-way. In this case, there is no waiting time, unlike Fig. 4 (a) and Fig. 4 (c) . There is only channel switching delay. Whenever there is waiting, it increases the duration of I/O which decreases the performance of SSD. Also, since power is continuously supplied to the device even during the waiting period, it leads to power inefficiency. This case, which does not have waiting time, has shortest duration of three cases as shown in Fig. 3 (f) . We call this channel only allocation method.
We can infer the page allocation scheme of SSDs as following steps. First, calculate the duration of three cases. Second, compare the calculated duration with the actual duration obtained from an experiment. The case which has the closest duration of the experiment is the page allocation scheme of the target SSD. Third, expand the page allocation scheme to other write request size. Detailed analysis with four real SSDs are addressed in Sect. 5
Case Study
In this paper, four SSDs are studied: Intel X-25M, Samsung MXP, Toshiba Q Pro, and Samsung 840. Intel X-25M has 10 channels; Samsung MXP has 8; Toshiba Q Pro has 4; and Samsung 840 has 8 channels. The size of I/O buffer varies as shown in Table 1 . Toshiba Q Pro does not have DRAM I/O buffer on PCB board. We conjecture that Toshiba Q Pro has internal DRAM in controller chip. In this section, various features of SSDs, such as power consumption in idle state, internal parallelism, I/O unit, and page allocation scheme, are described. Table 2 summarizes the results of our experiment. Workloads for each SSD started at 4 KB and increased by 4 KB to 160 KB with X-25M and to 128 KB with MXP and Q Pro. As we increased the write size, the peak current increased. The intervals at which the peak current increments are varied by the SSDs. With X-25M, the peak current increased every time the size increased by 4 KB. MXP and Q Pro increases its peak current on every 32 KB increase in the write size. Figure 5 shows the experiment environment which consists of host system, an oscilloscope, and the target SSD. A highresolution current probe is used in measuring current of SSDs between the target and the host system. Tektronix DPO3012 oscilloscope is used in data collection. In general, a flash memory's program time is shorter than 900 µs [28] . In this paper, sampling rate of the oscilloscope is set at 2 µs in write operations and 4 µs in read operations to measure current without losing the peak value. The host system, which creates workload to the target SSD, has Intel Dual Core2 2.9 GHz CPU and 4 GB main memory. The host system was loaded with Linux 2.6.39 and we opened SSDs as a raw device to reduce the noise caused by the file system. We calculate the power (watt) by multiplying current with input voltage which is 5 V, and the energy (joule) is calculated by multiplying the 'averaged' power with the time.
Initialization SSD
As SSDs are used extensively, the number of invalid pages increases and SSDs become "dirty". When "dirty" SSDs perform a workload, garbage collection will be performed in the background. To prevent this background operation, which might interfere with the test, "dirty" SSDs should be initialized before running an experiment. In this paper, secure erase technique [40] (ANSI ATA and SCSI disk interface specific disk purging commands that are performed internal of the disk) is used to initialize SSDs.
In this paper, initialization refers to setting the target SSD to the right state before performing experiments. When SSDs are used without initialization, there is too much noise which makes difficult to separate small sized I/Os from the noise. Some features of SSDs that can interfere with performing workload, such as read-ahead, look-ahead, and write buffer, and these are turned off before the experiment. Figure 6 (a) shows power consumption of Intel X-25M when it is in idle state. It shows 240 mA peaks at 50 msec intervals. Without removing these peaks, it is difficult to analyze the results because these peaks in idle state are sim- Figure 6 (b) shows power consumption of Intel X-25M after a standby command was sent with hdparm [41] using -y option. It shows that 240 mA peaks are removed. In some SSDs, write request data is not written to the flash memory but is recorded only on an I/O buffer (DRAM or SRAM). In this case, current value obtained in the experiment is not from the flash memory. Figure 7 shows the two energy consumption patterns of X-25M, Fig. 7 (a) shows the power consumption of the SSD when the write requests are written to the I/O buffer. Figure 7 (b) is when the requests are written to the flash memory. To prevent write request data from being recorded on an I/O buffer, we disabled the device's write buffer by a command (hdparm) before the experiment. The initialization commands are summarized in Table 3 .
Data Sampling
In this paper, we collect electric current, which is the amount of charge flowing through the conductor per unit time. Magnetic or heat flow can interfere with measuring current and noise was initially found in the oscilloscope. For efficient analysis, we used moving average on the collected data. Very high and short-lasting peaks observed on raw data are lost after use moving average; this is not a problem because what we want to observe is the gap between the peaks and the pattern of power consumptions, not the original peak values. In this paper, two window sizes were used in calculating moving averages. First, window size of 20 was used to find the internal parallelism of SSD, where the exact peak values are important. Second, window size of 100 was used to find the I/O unit, where patterns of results are more important than the exact peak values. Figure 8 shows power consumption of Toshiba Q Pro performing write operations in sizes from 4 KB to 128 KB in increment of 4 KB. Since Toshiba does not disclose page size of Q Pro, we used the method presented in Chen's paper [31] to find the information. From this method, we found page size of Q Pro to be 8 KB. Therefore, D step of Q Pro is 4 pages (32 KB). From Eq. (1) and Eq. (2), we can obtain the following parameters: N die is 4, and write I/O unit is 32 KB. Figure 9 shows power consumption of Q Pro performing read operations in various sizes, from 4 KB to 32 KB, in increments of 4 KB. The result shows that as the size of read operation increases in increment of 8 KB, power consumption pattern of 8 KB read is repeated: twice with 12 KB, three times with 20 KB, and 4 times with 32 KB read. From this result, we conclude that read I/O unit of Q Pro is 8 KB Toshiba Q Pro has 4 channels and 4 flash memory packages. N die is 4. We can infer its configuration as follows: Q Pro has 4 channels; each channel is connected to one flash memory package with way; and each package has 4 dies. Write I/O unit of Q Pro is 4 pages (32 KB) which is allocated to one package (4 dies). 8-page write request is allocated to 2 packages; 12-page write request is allocated to 3 packages; and 16-page write request is allocated to 4 packages. Therefore, the maximum size that can be programmed concurrently is 128 KB (4 × 4 × 8 KB) in Q Pro. Since up to four pages are allocated to a single package (i.e., channel), we can conclude that Q Pro uses way priority page allocation scheme.
TOSHIBA Q Pro
Intel X-25M
Figure 10 (a) shows power consumption of Intel X-25M performing write operations in sizes, from 4 KB to 80 KB, in increments of 4 KB. From Eq. (1) and Eq. (2) of Sect. 4.1, we can obtain the following parameters: N die is 1, and write I/O unit is 4 KB. Current duration of X-25M increased on average, 33 µs per 4 KB increase in write size, until the write size reaches to 80 KB, except when the write size increased from 80 KB to 84 KB; in this case, the duration increased 1.4 ms as shown in Fig. 10 (b) . Figure 11 shows power consumption of X-25M performing read operations in various sizes, from 4 KB to 16 KB, in increments of 4 KB. The result shows that as the size of read operation increases in increments of 4 KB, Fig. 11 Power consumption of Intel X-25M for read operations power consumption pattern of 4 KB read is repeated: twice with 8 KB, three times with 12 KB, and 4 times with 16 KB read. From this result, we can conclude that read I/O unit of X-25M is 4 KB which is same as its write I/O unit. With small I/O unit, each channel processes small I/Os, resulting in high performance. A disadvantage is its high power consumption due to a higher number of channels in use.
Intel X-25M has 10 channels and 20 flash memory packages, and N die is 1. From this, we can infer X-25M's flash memory configuration as follows: X-25M has 10 channels; each channel is connected to 2 flash memory packages with way; and each package has one die. The maximum size that can be programmed concurrently is 80 KB (20 × 1 × 4 KB) because X-25M allocates page (4 KB) to each die.
X-25M has two types of delay. In our experiment, when we increased write size by 4 KB, duration of peak current increased 33 µs on average. When we increased write size from 80 KB to 84 KB, duration of peak current increased by 1.4 ms. From this result, we can infer that X-25M allocates 84 KB in the following 3 steps.
The first 40 KB is allocated to package #0 in each channels, 4 KB per channel. Each package preempts the channel which incurs register transfer delay. This channel preemption time is 82 µs [30] , because this delay is due to 4 KB transfer.
The second 40 KB is allocated to the package #1 in each channels, 4 KB per channel. To select package #1 in each channel as target, there needs to be 10 channel switching. Since channel switching delay in X-25M is 33 µs, 10 channel switching delays (33 µs × 10) is longer than the channel preemption time in the first step (82 µs). Therefore, the second 40 KB is allocated without waiting.
The last 4 KB is allocated to the package #0 of channel #0. However, since the target package is busy with 82 µs register transfer and 900 µs page program, the last 4 KB must wait until the prior operations are finished. The waiting time of the last 4 KB is 322 µs ((900 µs + 82 µs) − (33 µs × 20) ). Experiment result took about 1ms longer than our calculation. We can try to guess that page program performance is slower than the vendor specification or there is unknown delay between the continuous program operations. We can conclude that the X-25M uses channel priority page allocation scheme. SAMSUNG 840 has 8 channels and 8 flash memory packages, and N die is 1. 840 has very simple page allocation scheme similar to the X-25M. The write data is allocated on a page in each channel. In addition, the maximum size of the concurrent writes are 32 KB. Figure 12 shows power consumption of Samsung 840 performing write operations in sizes, from 4 KB to 16 KB, in increments of 4 KB. The peak current increased by 50 mA per 4 KB increase in write sizes. Its maximum peak current was 650 mA, which was reached while performing 32 KB write. The peak current values increased 8 times while performing the above mentioned write operations. From the result, we can obtain the following parameters: N step is 8, and D step is 1 page (4 KB). From Eq. (1) and Eq. (2) of Sect. 4.1, we can obtain the following parameters: N die is 1, and write I/O unit is 4 KB.
If only the write results are considered, 840 has similar result with X-25M. Figure 13 shows power consumption of 840 performing read operations with 4 KB and 512KB. The result shows that as the size of read operation increases, power consumption pattern is not changed. Furthermore, energy consumption of read operation is too much high. For example, 4 KB energy consumption of read operation with Q Pro is 1.97 mJ, however, 840 is 6,325 J.
From the result, we can conclude that the read I/O unit size of Samsung 840 is something larger than the I/O unit of write operation. The specification of 840 tells us that the device uses TLC NAND Flash memory and has read and write performance of 530 MB/sec and 240 MB/sec, respectively. On the other hand, 840 pro which is based on 840 uses MLC NAND and has read and write performance of 540 MB/sec and 520 MB/sec, respectively. One of the main reason for the slower write performance in 840 is that it is using TLC NAND Flash memory while having similar read performance. Since TLC devices are saving three bits in a cell, it has to be more careful in incrementing the voltage to set the bits. One way to deal with it is to introduce more steps used in Incremental Step Pulse programming (ISPP) [42] . Figure 14 shows power consumption of Samsung MXP performing write operations in sizes, from 4 KB to 128 KB, in increments of 4 KB. Using Eq. (1) and Eq. (2) of Sect. 4.1, we can obtain the following parameters: N die is 2, and write I/O unit is 32 KB. We observed an unusual trait in MXP's power consumption. As shown in Fig. 14 , the peak current of MXP does not return to idle state power consumption after completing the write operations but is maintained at 80 mA for a while. The full graphs are omitted to save space but the tail of these 80 mA state lasts about 5 seconds. Another noticeable point of MXP is its very low power consumption level during idle state compared to Q Pro, X-25M, and 840. From these two points, the power consumption technique of MXP can be inferred as follow: MXP reduces idle state power consumption by completely blocking power supply to the flash memory when it is not handling an I/O. This may result in standby delay in MXP unable to respond immediately to incoming I/Os resulting in low throughput. In order to avoid low throughput, MXP maintains standby power condition of 80 mA to be ready for any additional I/Os after completing an I/O, so that it can perform I/Os continuously without standby delay. We conclude that MXP reduces power consumption to the extreme in an idle state, when it is not handling any I/O, but once an operation is completed, it consumes more power than needed to be ready Samsung MXP has 8 channels and 16 flash memory packages, and N die is 2. Configuration of MXP can be estimated as follows: MXP has 8 channels; each channel is connected to 2 flash memory packages with way; and each package has two dies. The maximum size that can be programmed concurrently is 128 KB (16 × 2 × 4 KB) because MXP allocates one page (4 KB) to each die.
SAMSUNG MXP
Write I/O unit of MXP is 8 pages (32 KB), which is allocated to 4 packages (8 dies). There are three ways to allocate 32 KB, depending on the channel and way usage: (i) 2-channel 2-way, channel priority allocation method, (ii) 2-channel 2-way, way priority allocation method, (iii) 4-channel 1-way, channel only allocation method.
When we exploited write I/O unit size, the duration of peak current was measured at 1.12 ms. Considering that other NAND flash memories released about the same time as MXP take about 900 µs [30] to program 4 KB, write speed of MXP is too fast. Therefore, we assume that MXP allocates 4 KB to 2 dies in each package and programs 8 KB concurrently by using the internal command [29] . Based on this assumption, we used 164 µs (82 µs × 2) for register delay, which is the time it takes to transfer 8 KB, and used 900 µs for page programming time, which is the time it takes to program 4 KB.
We consider three cases to find the MXP' page allocation scheme; 2-channel 2-way with channel priority, 2-channel 2-way with way priority, and 4-channel 1-way with channel priority. In first case (2-channel 2-way with channel priority allocation method), when data is received, the package #0 of first channel is selected and the first channel is preempted which incurs register transfer delay. When the package #0 of second channel is selected and the second channel is preempted, channel switching delay and register transfer delay occur. The package #1 of first channel cannot be selected right away because the package #0 has preempted the first channel. The package #1 of first channel must wait 98 µs (164 µs − (33 µs × 2)) before it performs register transfer operation. When the package #1 of second channel is selected, channel switching delay occurs but not register transfer delay because channel preemption by the package #0 of second channel has already finished (98 µs + (33 µs × 2) ). Therefore, the package #1 of second channel can perform register transfer without waiting. This case takes 1.261 µs in total.
In second case (2-channel 2-way with way priority allocation method), the package #0 of first channel preempts the channel which incurs register transfer delay. The package #1 of first channel cannot be selected because the package #0 of first channel preempted the channel. The package #1 of first channel must wait 131 µs (164 µs − 33 µs) before it performs register transfer. Channel switching delay occurs to select the package #0 of second channel and register transfer delay also occurs when the package preempts the channel. The package #1 of second channel cannot be selected because the package #0 of second channel preempted the channel. The package #1 of second channel must wait 131 µs (164 µs − 33 µs) before it performs register transfer operation. This case takes 1.425 µs in total.
In third case (4-channel 1-way with channel only allocation method), there is no waiting time, unlike first and second cases. There is only channel switching delay. Whenever there is waiting, it increases the duration of I/O which decreases the performance of SSD. Also, since power is continuously supplied to the device even during the waiting time, it leads to power inefficiency. This case, which does not have waiting time, takes 1,163 µs. This value is closest to experiment result, 1.120 ms. Therefore, we can conclude that MXP uses 4-channel 1-way to allocate I/O unit sized data.
Another unique feature in MXP is its Read-ModifyWrite operation. Figure 16 shows power consumption of Samsung MXP while performing 4 KB write operation and 32 KB read operation. The waveforms for the two operations are almost identical for the first 0.2 ms. The duration of this identical waveform decreases as the write size increases, until the write size reaches 28 KB. For 32 KB write, there is no identical waveform with the read operation at the beginning. However, 36 KB write operation shows similar Fig. 16 Proof of read-modify-write beginning waveform with 4 KB write. From this result, we can estimate that MXP always writes in 32 KB units to flash memory. If write requested data size is smaller than 32 KB, then MXP reads data from write address to make up 32 KB, places the I/O in I/O buffer, and processes it.
Breakdown of SSD Energy Consumption
We use data from various cases to make a formal characterization method. Experimental results are obvious; however, our characterization method needs verification to be reliable. The I/O unit which is discovered with our characterization method is an important factor that could affect the performance and energy consumption of SSDs. We have to verify the I/O unit by appropriate experiments using the I/O unit. We expect that SSDs can improve the performance and reduce the energy consumption.
We can consider two cases when performing a large sized write operation in SSDs which has I/O unit: first, performing direct I/O with page size; and second, performing direct I/O with I/O unit size. Figure 17 (a) shows the first case. SSDs extend the write data from page size to I/O unit size in the write cache. The additional data is filled with read data from the flash memory. Therefore, SSDs write the I/O unit sized data to the flash memory whenever a page size write is performed. Figure 17 (a) shows the second case. SSDs write the I/O unit sized data to the write cache. There are no additional read operation and data extension. Therefore, SSDs will be only consuming the energy of the I/O unit sized write operation. This case has less energy consumption than first case. As a result, if aligning the record size with I/O unit of SSDs in write operation with huge file, we can maximize the performance and minimize the energy consumption.
We conduct validation experiment for I/O unit as follows: The record size is increased from 4 KB to 64 KB and direct I/O is used. We use IOzone as workload generator. And, we acquire performance and energy consumption. The experiment targets used are two I/O unit SSDs (Q Pro, MXP) and two non I/O unit SSDs (840, X-25M). Our workload is as follows: File Size 128 MB, record size is in- creased from 4 KB to 64 KB in multiples of two, direct I/O, and sequential write. We expected that the performance is greatly improved and the energy consumption decrease significantly with the increase of the record size in the I/O unit SSDs (Q Pro, MXP). Figure 18 shows performance (MB/sec) and energy consumption (Kilo Joule) of four SSDs. Table 4 shows ratio of performance and energy consumption when the record size increased from 4KB to I/O unit size. Figure 18 (a) and Fig. 18 (b) show the result of I/O unit cases, and Fig. 18 (c) and Fig. 18 (d) show the result of non I/O unit cases. The I/O unit size of Q Pro and MXP are same as 32 KB.
The result shows that I/O performance of the one with I/O unit is increased to 162% on the average, and the I/O performance of the non I/O unit SSDs increased to 50% on average. In terms of energy consumption, the energy consumption of the I/O unit SSDs reduced to 78% on average, and the energy consumption of the non I/O unit SSDs reduced to only 35% on average. From the result it can be said that Toshiba Q Pro has the I/O unit.
In general purpose systems, aligning I/O requests to the I/O unit size of SSD is not possible because internal configurations of SSDs are proprietaries of manufacturers and they are not willing to disclose the information. The proposed method in this paper provides a technique to expose the internal configuration of SSD that is I/O unit size, which can be exploited in RAID storage system. Typically, the stripe size of RAID is carefully determined after thorough analysis of given workload and I/O characteristics. Although the process of optimizing the RAID is tedious, but the fact that it is a one-time effort relieves management and deployment overhead. Our method of finding the I/O unit size also needs to be performed only once. As shown in Fig. 18 , the use of the proposed method not only allows increasing the performance by aligning the stripe size to I/O unit size of SSD but also decreases the overall operation cost of storage systems. 
Power Budget
In previous work, we warned excessive use of channel and way of SSDs [15] . As we can see in related works, the peak current is increased excessively when the program operation is performed concurrently in too many NAND chip. The excessive peak current can cause supply voltage drop, ground bounce, signal noise, black-out, and etc, which can lead to unreliable SSD operation [43] . Therefore, we propose a metric called Power Budget, which specifies the maximum tolerable peak current for SSDs' operations.
The previous version of Power Budget only served the purpose to prevent the excessive simultaneous parallel use of resources. However, it is possible to have better performance and lower energy consumption while using less parallel resources by exploiting the I/O unit aligning in Sect. 6. Therefore, it is able to apply more strict criteria in the Power Budget. Figure 19 shows the new Power Budget. The x-axis is the number of way, and the y-axis is the number of channel. The new Power Budget proposes not only the use of balanced parallelism level also to use the appropriate size of I/O unit.
Conclusion
This paper presents the SSD characterization algorithm to infer characteristics of SSDs that are not disclosed by the vendors, such as internal parallelism, I/O unit, and page allocation scheme, by measuring current with an oscilloscope and high-resolution current probe.
These characterization algorithms are applied to the four real SSDs. We found the internal parallelism which is the number of dies in a flash memory package, and the I/O unit which is the read/write unit larger than a page size. From these two characteristics of SSDs, its page allocation scheme is inferred.
Internal parallelism, I/O unit, and page allocation scheme are characteristics of SSDs that are not made public by the vendors. Yet, they affect I/O performance of SSDs, which is the biggest competitive factor in SSDs. If the OS knows these characteristics, the I/O performance can be improved by file system tuning. In addition, vendors will be able to devote more efforts in developing more energy efficient SSDs.
Currently, it is possible to implement SSDs whose performance is close to the limit of interface performance by placing large-sized I/O buffers and using heavy internal parallelism; however, this implementation causes a significant level of power consumption. The required design direction of SSDs is a balance between achieving I/O performance improvement and energy efficiency.
