Emerging solid state drives (SSDs) based on a nextgeneration memory technology have been recently released in market. In this work, we call them low-latency SSDs because the device latency of them is an order of magnitude lower than that of conventional NAND flash SSDs. Although low-latency SSDs can drastically reduce an I/O latency perceived by an application, the overhead of OS processing included in the I/O latency has become noticeable because of the very low device latency. Since the OS processing is executed on a CPU core, its operating frequency should be maximized for reducing the OS overhead. However, a higher core frequency causes the higher CPU power consumption during I/O accesses to low-latency SSDs. Therefore, we propose the device utilization-aware DVFS (DU-DVFS) technique that periodically monitors the utilization of a target block device and applies dynamic voltage and frequency scaling (DVFS) to CPU cores executing I/O-intensive processes only when the block device is fully utilized. In this case, DU-DVFS can reduce the CPU power consumption without hurting performance because the delay of OS processing incurred by decreasing the core frequency can be hidden. Our evaluation with 28 I/O-intensive workloads on a real server containing an Intel R Optane TM SSD demonstrates that DU-DVFS reduces the CPU power consumption by 41.4% on average (up to 53.8%) with a negligible performance degradation, compared to a standard DVFS governor on Linux. Moreover, the evaluation with multiprogrammed workloads composed of I/O-intensive and non-I/O-intensive programs shows that DU-DVFS is also effective for them because it can apply DVFS only to CPU cores executing I/O-intensive processes.
Introduction
Low-latency SSDs based on a next-generation memory technology have been released in market since 2017. They have a higher write throughput, higher write endurance, and lower latency compared to conventional NAND flash SSDs [1] . In particular, the device latency of them is about ten microseconds, which is an order of magnitude lower than that of recent NAND flash SSDs. This very low device latency has significantly reduced an I/O latency perceived by an application because a device latency had accounted for a large proportion of it. Instead, a time required for OS processing has become a primary overhead in the I/O latency due to the low device latency. Although a lot of optimization techniques have been implemented to reduce this OS overhead, it is still non-trivial for low-latency SSDs [2] .
The OS processing required for each I/O access is executed on a CPU core whose operating speed strongly depends on its operating frequency. Therefore, we first investigate the correlation between the core frequency and an I/O latency using a low-latency SSD. The result shows that an I/O latency is increased as the core frequency is decreased; thus, the core frequency should be maximized for reducing an I/O latency. Moreover, as the core frequency is automatically controlled by governors of the intel pstate driver [3] on Linux, we investigate how they control the core frequency during I/O accesses to the low-latency SSD. As a result, we observe that they reduce an I/O latency by automatically maximizing the core frequency based on a high core utilization. However, we also observe that the CPU power consumption during I/O wait times is increased by raising the core frequency because I/O wait times for the low-latency SSD are not long enough for a CPU core to enter deep sleep states. Ideally, we can reduce the wasteful CPU power consumption without increasing an I/O latency by applying DVFS to CPU cores only during I/O wait times. However, this approach is currently impractical because an I/O wait time for low-latency SSDs is about ten microseconds. Unfortunately, existing CPUs take tens of microseconds to change the core frequency [4] , [5] . Therefore, we propose the device utilization-aware DVFS (DU-DVFS) technique in order to reduce the CPU power consumption without hurting performance. It periodically monitors the utilization of a target block device and identifies CPU cores executing I/O-intensive processes. It then applies DVFS to them only when the block device is fully utilized. In this case, as I/O requests issued from each CPU core are temporally accumulated in an I/O queue, the delay of OS processing incurred by decreasing the core frequency can be hidden by the queueing time.
The main contributions of this work are as follows. *
• We show the importance of controlling the core frequency for low-latency SSDs in terms of an I/O latency and the CPU power consumption.
• We propose the DU-DVFS technique to reduce the CPU power consumption during I/O accesses to lowlatency SSDs and implement it as a user-level runtime * An earlier version of this paper was published in the Proceedings of the 7th IEEE Non-Volatile Memory Systems and Application Symposium (NVMSA 2018) [6] . In this paper, we improve the algorithm of the proposed technique to enable per-core DVFS and evaluate it with multiprogrammed workloads.
Copyright c 2019 The Institute of Electronics, Information and Communication Engineers system.
• We evaluate it with 28 I/O-intensive workloads on a real server containing an Intel R Optane TM SSD [1] . The result demonstrates that DU-DVFS reduces the CPU power consumption by 41.4% on average (up to 53.8%) with a negligible performance degradation, compared to the powersave governor in the intel pstate driver.
• We conduct sensitivity analysis with respect to how often and how aggressively DU-DVFS decreases the core frequency.
• We also evaluate the proposed technique with multiprogrammed workloads composed of I/O-intensive and non-I/O-intensive programs and verify its effectiveness for them.
The rest of this paper is organized as follows. We explain our research background in Sect. 2 and argue the importance of the core frequency for low-latency SSDs in Sect. 3. Then, we describe the DU-DVFS technique in Sect. 4 and evaluate it in Sect. 5. In Sect. 6, we introduce related work. Finally, we conclude this work in Sect. 7.
Background
In this section, we explain low-latency SSDs by comparing the specification of a product used in this work with that of a conventional NAND flash SSD. We then describe a new problem caused by low-latency SSDs.
Low-Latency SSDs
The real products of low-latency SSDs based on a nextgeneration memory technology called 3D XPoint TM have been released in market since 2017. Table 1 compares the specification of a low-latency SSD (Intel R Optane TM SSD 900P [1] ) with that of a recent NAND flash SSD (Intel R SSD 750 [7] ). Note that both of them are classified to the same category called enthusiast SSDs, and we show the recommended customer prices of them on September 19th, 2018. This table shows that the capacity and price per gigabyte of the low-latency SSD are still inferior to those of the NAND flash SSD. On the other hand, the low-latency SSD has a higher write throughput and lower read/write latency. In particular, its read latency is ten microseconds, which is an order of magnitude lower than that of the NAND flash SSD.
Noticeable OS Overhead for Low-Latency SSDs
For conventional block storage devices such as hard disks and NAND flash SSDs, the overhead of OS processing required for each I/O access is negligible because of their long device latency. For example, the OS overhead is less than 5% of an overall I/O latency for a recent NVMe NAND flash SSD. In contrast, it reaches about 40% for a low-latency SSD due to its very low device latency [2] . Therefore, it is a big challenge to reduce the OS overhead for fully exploiting the potential of low-latency SSDs. In order to reduce the OS overhead, a lot of optimization techniques have been proposed and implemented [2] , [9] . In particular, polling I/O is a promising approach and has been implemented in an NVMe driver since Linux kernel 4.4. It utilizes a CPU core during an I/O access to continuously monitor the completion of an I/O command execution. Therefore, compared to conventional interrupt request (IRQ)-based I/O, it can eliminate context switches and the processing of interrupt handlers from an I/O path instead of a higher CPU utilization [10] , [11] . However, the OS overhead is still non-trivial for low-latency SSDs even if polling I/O is applied.
Importance of Core Frequency for Low-Latency SSDs
In this section, we argue the importance of controlling the core frequency during I/O accesses to low-latency SSDs. We first show the correlation between the core frequency and an I/O latency and then investigate how the core frequency is controlled by a conventional DVFS technique during I/O accesses. Moreover, we discuss an ideal DVFS technique and its problem.
Correlation between Core Frequency and I/O Latency
As explained in Sect. 2.2, the time of OS processing accounts for a large proportion of an overall I/O latency for low-latency SSDs. Since OS processing is executed on a CPU core whose operating speed strongly depends on its operating frequency, we first investigate the correlation between the core frequency and an I/O latency using the low-latency SSD summarized in 
Conventional DVFS Technique
On recent Linux kernels, the intel pstate driver is available to control the P-state (i.e., the voltage and frequency operating point) of each CPU core [3] . For simplicity, we use the "core frequency" as the P-state of a CPU core in this paper, which means that the voltage is changed together when the core frequency is changed. The intel pstate driver implements two types of governors: performance that maximizes the frequency of each active core and powersave that adaptively adjusts the frequency of each active core in response to its utilization. Note that the powersave governor in the intel pstate driver is similar to the ondemand governor in the previous cpufreq driver. In this section, we investigate how these governors control the core frequency and affect the CPU power consumption during I/O accesses to the Optane SSD 900P. The experimental methodology is similar to one in the previous section. Figure 2 shows the evaluation results of the two governors in addition to the minfreq execution that manually minimizes the core frequency with IRQ-based I/O and polling I/O. For reference, this figure also includes the results with a SAS hard disk and a recent NVMe NAND flash SSD (Intel R SSD DC P3700 [13] ) with IRQ-based I/O. In the case of the hard disk, a CPU core enters deep sleep states such as C3 and C6 states during I/O accesses because I/O wait times for the hard disk are very long (about two milliseconds). Consequently, both the governors select a low frequency (1.4 GHz) and keep the CPU power consumption low (16 W) . Since the OS overhead is negligible in this case, an I/O latency is not increased even if the core frequency is low.
With the NAND flash SSD, the active ratio of the CPU core is increased to 14% because its device latency is much lower (about 100 microseconds) than that of the hard disk. Moreover, we can see that the CPU core only enters the lightest C1 sleep state during I/O wait times. This result means that I/O wait times for recent NVMe NAND flash SSDs are not long enough for a CPU core to enter deeper sleep states. Although the C1 sleep state stops the internal clock of the core, it does not shut off the voltage and internal cache memory [14] . Therefore, the performance governor that selects the maximum frequency (3.6 GHz) causes the high CPU power consumption (40 W). In contrast, the powersave governor decreases the core frequency to 1.9 GHz based on the low core utilization and reduces the CPU power consumption to 26 W. Since the time of OS processing is still much shorter than the device latency of the NAND flash SSD, an I/O latency is not increased significantly by decreasing the core frequency.
Finally, in the case of the Optane SSD whose device latency is very low (about ten microseconds), the active ratio of the CPU core is increased to 34% with IRQ-based I/O. Moreover, the active ratio becomes 98% with polling I/O because it continuously utilizes the CPU core even during I/O accesses. In these cases, both the governors select almost the maximum frequency and increase the CPU power consumption to over 40 W. This high CPU power consumption is wasteful during I/O wait times because the CPU core does not execute meaningful instructions for an application.
The results of the minfreq execution show that manually decreasing the core frequency can reduce the wasteful CPU power consumption but incurs a significant increase in an I/O latency for the low-latency SSD.
Ideal DVFS Technique for Low-Latency SSDs
On the basis of the above observations, we figure out that we can reduce the wasteful CPU power consumption without increasing an I/O latency using an ideal DVFS technique shown in Fig. 3 . This figure illustrates a situation where a workload running on a CPU core continuously issues synchronous I/O requests to a low-latency SSD. In this situation, the core frequency should be maximized while the core executes software processing in order to avoid increasing an I/O latency. In contrast, it should be minimized during I/O wait times for reducing the CPU power consumption. However, this ideal technique needs to decrease the core frequency only during each I/O wait time, which is about ten microseconds for low-latency SSDs. Unfortunately, this technique is currently impractical because existing CPUs take tens of microseconds to change the core frequency [4] , [5] . Decreasing the core frequency at the beginning of each I/O wait time may delay following software processing.
DU-DVFS: Device Utilization-Aware DVFS
Instead of the ideal DVFS technique shown in Fig. 3 , we propose the device utilization-aware DVFS (DU-DVFS) technique. In this section, we first explain its concept and a metric called device utilization. After that, we describe the algorithm and implementation of this technique.
Concept
The objective of the DU-DVFS technique is to reduce the CPU power consumption without hurting performance during I/O accesses to a low-latency SSD. It identifies CPU cores executing I/O-intensive processes and decreases their operating frequencies only when the delay of OS processing caused by the decreased core frequency is not likely to increase an I/O latency. In order to detect this situation, DU-DVFS periodically monitors a metric called device utilization.
Device Utilization
The device utilization is a metric used in the iostat tool that monitors I/O statistics of block devices on Linux [15] . The definition of device utilization is the percentage of a time during which a block device has been active in a time unit. Figure 4 illustrates this concept. For example, if a block device is always active during a time unit, the device utilization is 100%. In order to measure the active time of a target block device, the iostat tool uses a value called io ticks which can be obtained from the /sys/block/ dev /stat file on Linux. Recent device drivers implement an I/O queue per CPU core to temporally store I/O requests issued from each core, and the io ticks value counts the total time during which I/O requests have been queued [16] . As I/O requests issued from CPU cores are accumulated in I/O queues while a block device is active to process previous requests, this value corresponds to the total time the device has been active. Therefore, the device utilization is calculated from this equation:
where io ticks i and io ticks i+1 are the values of io ticks obtained consecutively between a interval. Note that the io ticks value and interval are in milliseconds. The DU-DVFS technique uses the device utilization to decide whether or not to decrease the core frequencies. Figure 5 illustrates how an I/O latency is influenced by decreasing the core frequency in two different situations. When a target block device is idle, I/O requests issued from a CPU core are immediately transferred to the block device without being accumulated in an I/O queue. Therefore, the delay of OS processing incurred by the decreased core frequency directly increases an I/O latency. On the other hand, when the block device is active, I/O requests issued from a CPU core are temporally queued. In this case, an I/O latency is not increased by decreasing the core frequency because the delay of OS processing is hidden by the queuing time. Thus, we can decrease the core frequency with- out hurting performance when the device utilization is almost 100% (i.e., while a target block device continues to be active). Table 2 verifies the above intuition by showing the device utilization and throughput of the Optane SSD 900P during 8 KB and 64 KB random reads. Note that we use the larger request sizes than 4 KB and run the fio benchmark with two jobs on two cores in order to make the device utilization almost 100% at the maximum frequency (3.6 GHz). This table shows that when the request size is 8 KB, the device utilization is lowered from 98.9% to 73.3% by decreasing the core frequencies from 3.6 GHz to 1.2 GHz. In this case, the throughput is degraded by 32% because the delay of OS processing increases an I/O latency while the Optane SSD is idle. In contrast, when the request size is 64 KB, the device utilization remains almost 100% even at 1.2 GHz. Since the delay of OS processing can be hidden in this case, the throughput is not degraded by decreasing the core frequencies.
Algorithm
The DU-DVFS technique periodically monitors the device utilization and adaptively controls the frequencies of CPU cores executing I/O-intensive processes based on it. In order to identify those cores, DU-DVFS uses the num queued value of each core, which counts the total number of I/O requests that have been stored in each I/O queue. It is recorded by an NVMe device driver and can be obtained form the /sys/block/ dev /mq/ coreID /queued file on Linux. DU-DVFS avoids hurting the performance of non-I/O-intensive processes by setting the frequencies of cores executing them to the maximum. It requires four parameters as inputs: a target block device (dev), an interval in milliseconds to mon- 
Algorithm 1 Algorithm of the DU-DVFS technique

itor the device utilization (interval), a threshold of the device utilization to decrease the core frequencies (dec thr), and a threshold of the number of I/O requests queued per second to identify cores executing I/O-intensive processes (queued thr).
Algorithm 1 shows the pseudo code of the DU-DVFS technique. It first gets the number of available cores on a CPU and sets the frequencies of all cores to the maximum. It then stores the io ticks value to the prev io ticks variable and the num queued values of all cores to the prev num queued array (lines 1 to 4). After sleeping during the interval, it stores the io ticks value to the curr io ticks variable and the num queued values of all cores to the curr num queued array. The device utilization is then calculated from the Eq. (1) (lines 6 to 9). If the device utilization exceeds the dec thr threshold, the number of I/O requests queued per second is calculated for each core. If this value exceeds the queued thr threshold, DU-DVFS regards the corresponding core as one executing an I/O-intensive process and decreases its frequency at one step (lines 10 to 16). If the device utilization is less than or equal to the dec thr threshold, the frequencies of all cores are immediately set to the maximum (lines 17 and 18). Finally, the prev io ticks variable and prev num queued array are updated with curret values, and the above procedure is repeated (lines 20 and 22).
Implementation
We implement the DU-DVFS technique as a user-level runtime system that runs in background. It controls the frequency of each core by writing a specific value to each of the corresponding model specific registers (e.g., the IA32 PERF CTL MSRs for Intel CPUs [17] ). In this work, we set the queued thr threshold to 1000. This is because I/O-intensive workloads used in our evaluation issue thousands of I/O requests per second while a non-I/O-intensive workload issues only tens of I/O requests per second. We can adjust how often and how aggressively DU-DVFS controls the core frequencies with the interval and dec thr parameters. We conduct sensitivity analysis in terms of these parameters in Sect. 5.3.
Evaluation
In this section, we evaluate the DU-DVFS technique with I/O-intensive workloads and multiprogrammed workloads by comparing it with the powersave governor and the minfreq execution. We first explain our experimental setup and then show the evaluation results.
Experimental Setup
To evaluate the DU-DVFS technique, we use 15 physical cores of an 18-core Xeon E5-2697 v4 processor, 1.5 TB DRAM, and the Optane SSD 900P equipped on a PRIMERGY RX2540 M2 server. This server contains two CPU sockets, and thus the total number of available physical cores is 36 (=18×2). However, we enable only 15 out of 18 cores on each CPU socket in order to restrict the total number of cores to 30. This is a tuning method to avoid the contention of an I/O queue among multiple cores [18] . As the Optane SSD contains 31 hardware I/O queues, we can assign a single I/O queue to each of the 30 cores. In addition, we only use a single CPU socket in order to obtain stable results. Note that the hyper threading technology is disabled. The operating frequency of each core can be changed from 1.2 GHz to 2.3 GHz at a step of 0.1 GHz, and the Turbo Boost technology can increase it up to 3.6 GHz with one active core and up to 2.8 GHz with 15 active cores. The CPU power consumption is measured with the turbostat tool on Linux.
We use MSR Cambridge I/O Traces [19] , [20] as I/Ointensive workloads by replaying them with the blkreplay tool [21] . They are commonly used to evaluate block storage devices in prior work [22] - [25] . The blkreplay tool is developed for the performance validation of block storage devices in data centers and built with a synchronous multithreaded I/O engine [26] . We disable its write-verify mode and write-protection mode in similar to prior work [26] and enable its direct I/O option to bypass the Linux page cache. In order to put a high load on the Optane SSD 900P, we replay each of the I/O traces at million times speed. As they include I/O statistics for seven days (= 604,800 seconds), we can ignore the timestamps of them at this speed. Note that we omit traces which finish in a short time (within 3 seconds) and run the prxy 0 workload with 8 threads because its execution time is significantly increased with more than 8 threads. In addition, we select the bodytrack benchmark from the PARSEC benchmark suite [27] as a non-I/Ointensive workload. It is CPU-intensive but issues tens of I/O requests per second. We use the Linux kernel 4.4.117 and apply polling I/O.
Evaluation Results with I/O-intensive Workloads
For the evaluation with I/O-intensive workloads, we replay each of 28 I/O traces with 15 threads (same to the number of available cores) in order to fully stress the Optane SSD 900P. Figure 6 shows the evaluation results of the minfreq execution that manually minimizes the core frequencies and the DU-DVFS technique. The left graph plots the execution time of each workload along with the average device utilization throughout each execution. The center and right graphs plot the average CPU power consumption and energy-delay 2 product (EDDP) of each workload throughout each execution in a descending order, respectively. Each of the three metrics is normalized by that with the powersave governor which automatically maximizes the core frequencies. The execution time with the powersave governor corresponds to that with the ideal DVFS method described in Sect. 3.3, because we can execute OS processing at the maximum core frequencies with the powersave governor. Note that the interval and dec thr parameters are set to 100 milliseconds and 99%, respectively.
In the left graph, we can see that the minfreq execution increases the execution time more significantly for workloads exhibiting the lower device utilization. This is because the delay of OS processing incurred by minimiz- Fig. 7 Runtime results of the proj 3 workload.
ing the core frequencies cannot be hidden when the device utilization is below 100%. In the worst case, the minfreq execution increases the execution time by 67.3%. In contrast, DU-DVFS keeps the device utilization nearly 100% and achieves the comparable performance to the powersave governor for all workloads. It increases the execution time by at most 4.4% in the worst case. On the other hand, the center graph shows that DU-DVFS reduces the CPU power consumption by 20.6% to 53.8%, while the minfreq execution does so by approximately 60% for all workloads. Consequently, we can see in the right graph that the minfreq execution achieves lower EDDP than DU-DVFS for 21 workloads because of its larger reductions in the CPU power consumption. However, DU-DVFS achieves lower EDDP than the minfreq execution for the other seven workloads whose performance is significantly degraded by the minfreq execution. Even in the worst case, DU-DVFS reduces the EDDP by 9.4%, while the minfreq execution causes a 97.1% increase. In terms of geometric means across the 28 workloads, the minfreq execution reduces the CPU power consumption by 59.1% and the EDDP by 43.5% with a 11.3% execution time increase. In contrast, DU-DVFS reduces the CPU power consumption by 41.4% and the EDDP by 39.9% with an only 0.8% execution time increase.
In addition, we demonstrate how DU-DVFS controls the core frequencies for three representative workloads by plotting the average frequency across all cores at runtime shown in Figs. 7-9 . Figure 7 first shows the result of the proj 3 workload, which is the worst case for the minfreq execution. The two graphs plot the device utilization and core frequencies at runtime, respectively. The top graph shows that the minfreq execution causes the device utilization to drop below 100% for a long time by decreasing the core frequencies to the minimum. In this case, DU-DVFS selects relatively higher frequencies to keep the device utilization 100%. Second, Fig. 8 shows the result of the mds 1 workload, for which the device utilization remains almost 100% even with the minfreq execution. Therefore, DU-DVFS also aggressively decreases the core frequencies to the minimum. Finally, Fig. 9 shows the result of the proj 4 workload that exhibits a variable behavior. With the minfreq execution, the device utilization remains 100% in the middle of the execution but drops below 100% at the beginning and end of the execution. Thus, DU-DVFS adaptively controls the core frequencies while keeping the device utilization 100% to avoid hurting performance.
Sensitivity Analysis
In this section, we first conduct a sensitivity analysis of the interval parameter with which we can adjust how often DU-DVFS controls the core frequencies. Figure 10 shows the analysis results with varying the interval parameter from 1000 milliseconds to one millisecond. This figure shows that DU-DVFS achieves the comparable performance to the powersave governor with 1000 or 100 milliseconds of interval but increases the execution time with ten or one milliseconds of interval. On the other hand, it reduces the CPU power consumption more significantly with a shorter interval. In order to analyze this reason, we plot the average core frequency across all cores selected by DU-DVFS during the execution of the proj 4 workload with the varied interval parameter in Fig. 11 . This figure shows that DU-DVFS decreases the core frequenciy gradually and conservatively with 1000 milliseconds of interval (blue line). In this case, it does not hurt performance but misses some opportunities to reduce the CPU power consumption. In contrast, it decreases the core frequency more quickly with 100 milliseconds of interval (orange line), leading to a larger CPU power reduction. However, if the interval is set to ten or one milliseconds (green and red lines), the core frequency is decreased immediately and too drastically every time the device utilization becomes 100%. Thus, DU-DVFS hurts performance while the device utilization fluctuates (at the beginning and end of the execution). On the basis of the above observations, we figure out that the interval parameter should be set to 100 milliseconds in order to reduce the CPU power consumption as significantly as possible without hurting performance.
Next, we conduct a sensitivity analysis of the dec thr parameter with which we can adjust how aggressively DU-DVFS decreases the core frequency. Figure 12 plots the analysis results in similar to Fig. 10 and shows that DU-DVFS reduces the CPU power consumption more significantly instead of a larger performance degradation with the lower threshold. This result means that we can achieve a larger power reduction by decreasing this threshold if a larger performance degradation is permitted.
Evaluation Results with Multiprogrammed Workloads
In this section, we evaluate the DU-DVFS technique with multiprogrammed workloads. We construct 28 multiprogrammed workloads by combining the non-I/O-intensive bodytrack benchmark with each of the 28 I/O-intensive traces replayed using the blkreplay tool. Either program that finishes in a shorter time is repeatedly executed until another one finishes once. We report the average execution time of multiple executions for the former and the execution time of a single execution for the latter. Figure 13 shows geometric means across the 28 workloads with varying the thread count of the blkreplay tool to 12, 8, and 4. Note that we adjust the thread count of the bodytrack benchmark so that the total number of threads remains 15.
In Fig. 13 (a) that shows the results with the 12-threaded blkreplay tool, we can see that the minfreq execution increases the execution time of the blkreplay tool by 11.0% and that of the bodytrack benchmark by 127.9%. Since the bodytrack benchmark is CPU-intensive, the impact of minimizing the core frequency on its execution time is very large. On the other hand, the minfreq execution re-duces the CPU power consumption by 59.2%. In contrast, DU-DVFS does not increase the execution times of both the programs because it identifies the CPU cores executing the blkreplay tool and only decreases their frequencies while keeping the device utilization 100%. As a result, it reduces the CPU power consumption by 18.6%. However, this power reduction is much smaller than the result with I/O-intensive workloads shown in Fig. 6 (41.4% on average). This is because the CPU used in this work controls the supply voltage to cores at a chip level. Our additional experiment shows that the CPU power consumption is reduced by only 3 W by decreasing the frequency of a single active core from the maximum to the minimum if any other cores are running at the maximum frequency. On the other hand, the CPU power consumption is reduced by about 19 W by doing so if all of the other cores are also running at the minimum frequency. For our multiprogrammed workloads, DU-DVFS does not decreases the frequencies of the cores executing the bodytrack benchmark. If we use a CPU that can control the supply voltage per core, the power reduction by DU-DVFS will be larger even for the multiprogrammed workloads.
In Figs. 13 (b) and 13 (c), we can observe that the CPU power reduction by DU-DVFS gets smaller as the thread count of the blkreplay tool is decreased (7.9% with 8 threads and 3.6% with 4 threads), because the frequencies of less cores are decreased by DU-DVFS reduced. However, we would like to emphasize that DU-DVFS achieves the comparable performance to the powersave governor in any case. This result verifies that it reduces the CPU power consumption if possible without hurting performance.
Related Work
Since DVFS is a well-known approach to optimize the energy efficiency of CPUs, it is widely used in a lot of prior work. However, DVFS techniques with a consideration of storage devices are very limited as below.
Lee et al. applied DVFS to hardware components embedded in flash-based storage devices such as a microprocessor and a flash controller [28] . Their heuristic algorithm adjusts the voltage and frequency level so that background maintenance jobs such as garbage collection and wear-leveling can be processed within a given idle interval. In contrast, the target of our proposed technique is host CPUs that issue I/O requests to low-latency SSDs.
Ge et al. presented a parallel I/O middleware that combines runtime I/O access interception and DVFS for parallel computer systems [29] . Manousakis et al. proposed a feedback-driven DVFS technique that tracks the power consumption of CPUs, DRAM, and storage devices for I/Ointensive workloads [30] . Mills et al. applied DVFS during I/O-intensive checkpoint and restart operations on highperformance computing (HPC) systems [31] . The common concept of these work is to reduce the system-level power or energy consumption by applying DVFS to CPUs during I/Ointensive phases where the full computation capability of CPUs is not required. As these work assumed conventional hard disks or NAND flash SSDs, decreasing the core frequencies did not significantly hurt performance due to their long device latency. However, simply decreasing the core frequencies causes a significant increase in an I/O latency for low-latency SSDs, as shown in Fig. 1 .
Saito et al. also proposed a profile-based technique that applies DVFS and controls I/O processes to reduce the energy consumption of I/O-intensive checkpoint and restart operations on HPC systems [32] . They used PCIeattached flash memory devices that leave garbage collection and wear-leveling to CPU cores. Therefore, the core frequencies directly affected an I/O latency. In our work, we reveal that the core frequencies have a large impact on an I/O latency for low-latency SSDs. Moreover, while their technique trades the power consumption and performance to minimize the energy consumption, our proposed technique can reduce the CPU power consumption without hurting performance.
Conclusions
In this work, we first investigate the correlation between the core frequency and an I/O latency for low-latency SSDs and reveal that the core frequency should be maximized to reduce an I/O latency. However, we also observe that a CPU wastefully consumes high power during I/O wait times in this case. Therefore, we propose the device utilizationaware DVFS technique that decreases the frequencies of CPU cores executing I/O-intensive processes only when a target storage device is fully utilized. Our evaluation using an Intel Optane SSD and 28 I/O-intensive workloads demonstrates that it reduces the CPU power consumption by 41.4% with an only 0.8% performance degradation on average compared to the Linux powersave governor. Moreover, we show that it is also effective for multiprogrammed workloads composed of I/O-intensive and non-I/O-intensive programs.
