Abstract-Emerging data-intensive applications are creating non-uniform CPU and I/O workloads which impose the requirement to consider both CPU and I/O effects in the power management strategies. Current approaches focus on scaling down the CPU frequency based on CPU busy/idle ratio without taking I/O into considertation. Therefore, they do not fully exploit the opportunities in power conservation. In this paper, we propose a novel power management scheme called model-free, adaptive, rule-based (MAR) in multiprocessor systems to minimize the CPU power consumption subject to performance constraints. By introducing new I/O wait status, MAR is able to accurately describe the relationship between core frequencies, performance and power consumption. Moreover, we adopt a model-free control method to filter out the I/O wait status from the traditional CPU busy/idle model in order to achieve fast responsiveness to burst situations and take full advantage of power saving. Our extensive experiments on a physical testbed demonstrate that, for SPEC benchmarks and data-intensive (TPC-C) benchmarks, an MAR prototype system achieves 95.8-97.8 percent accuracy of the ideal power saving strategy calculated offline. Compared with baseline solutions, MAR is able to save 12.3-16.1 percent more power while maintain a comparable performance loss of about 0.78-1.08 percent. In addition, more simulation results indicate that our design achieved 3.35-14.2 percent more power saving efficiency and 4.2-10.7 percent less performance loss under various CMP configurations as compared with various baseline approaches such as LAST, Relax, PID and MPC.
Ç

INTRODUCTION
M ULTICORE processors or CMPs have become a mainstream in the current processor market because of the tremendous improvement in transistor density and the advancement in semi-conductor technology. At the same time, the limitations in instruction level parallelism (ILP) coupled with power dissipation restrictions encourage us to enter the "CMP" era for both high performance and power savings [1] , [2] . However many crucial application domains still have demand for single thread (core) performance growth [3] . Even without considering the above factors, the increasing number of transistors on a single die or chip reveals a super-linear growth in power consumption [4] . Thus, how to balance system performance and power-saving is a critical issue which needs to be solved effectively.
In recent years, many power management strategies have been proposed for CMP processors based on DVFS [5] , [6] , [7] . Dynamic voltage and frequency scaling (DVFS) is a highly effective technique used to reduce the power dissipation. Previous research [8] , [9] , [10] , [11] has successfully balanced the CPU power consumption and its overall performance by using chip-wide DVFS. However, this coarse-granular management cannot be directly applied to CMP chips where per-core DVFS is available for more aggressive power-saving. Several recently proposed algorithms [12] , [13] are based on open-loop search or optimization strategies, assuming that the power consumption of a CMP at different DVFS levels can be estimated accurately. This assumption may result in severe performance degradation or even power constraint violation when the workloads vary significantly from the one they used to perform estimations. There are also some closed-loop solutions based on feedback control theory [14] , [15] which obtain power savings by using detected statistics of each core. The most widely adopted CPU statistics are CPU busy/idle times [16] . This busy/idle time approach referred as B-I model works well for non-I/O intensive applications. Unfortunately, data Intensive applications are mainly I/O bound, which becomes the main focus in this paper. They often exhibit the unpredictable nature of I/O randomness and thus make the CPU utilization status hard to track. Consequently, the B-I model might not work well at this circumstance.
It may be noted that the I/O operations could actually affect the processor's both performance and power consumption, especially when we deal with data-intensive applications. More specifically, by taking I/O factors into consideration, we are able to make best use of CPU slack periods to save power more effectively and more efficiently. One example is when CPU is waiting for I/O operations to complete. Through our experiments, we found out several facts that can be utilized to conserve more power for a CMP processor. First of all, each core's waiting time for a completed I/O operation can be separated from its idle or working status. Second, appropriate scaling down the core's frequency during its I/O wait could provide more opportunities to save power without sacrificing overall system performance. Third, the core frequency does not directly affect the I/O wait which means no relationship exists between these two factors. According to the above-mentioned observations, we develop our power management solutions for dataintensive applications using the following specific ways: 1) Considerations of each core's I/O wait status besides its working and idle status. 2) Accurately quantifying each status for accurate power-saving decisions. 3) A precise description of the relationship between frequency, performance and power consumption when the I/O wait factor is considered. When we integrate the I/O wait into the power control system design, one challenge lies that the CPU workload is unable to be modeled because of the I/O randomness, which mainly results from the diversity of I/O patterns and the internal characteristic of storage systems. In our experiment, we found that even with a comprehensive control policy, such as MPC, it is impossible to accurately predict the CPU workloads for I/O intensive applications. To resolve this issue, we employ a fuzzy logic control to simplify the model construction work, which incorporates a precise logic and approximate reasoning into the system design and obtains much more accurate representations [40] , [41] , [42] , [43] , [44] .
In this paper, we develop a multi-input-multi-output power management system named MAR (model-free, adaptive, and rule-based) for CMP machines. It adopts a fuzzy logic for the power management by considering the I/O wait rather than employing traditional B-I model. Our design could precisely control the performance of a CMP chip at the desired set point while save 12.3-16.1 percent more power at run-time. Moreover, we are able to save even more power if the loaded CPUs support more frequency levels, as detailed in the experimental results in Section 5.2. There are four primary contributions of this work:
Developing a model-free, adaptive and rule-based power management method named MAR. Achieving fast-responsiveness through fast detecting methods that adopted by MAR. Conducting extensive experiments to examine the impacts of the cores seven working status and proposing an accurate description about the relationship between frequency, performance and power consumption. Simplifying the power management method by introducing fuzzy control theory in order to avoid the heavy relying on precise system model. The rest of this paper is organized as follows. Section 2 discusses CMP behaviors to learn the relationship between frequency, power and performance. Section 3 describes the design of MAR. Section 4 provides the implementation details, and Section 5 presents the extensive experiments and the corresponding results. Section 6 introduces related works regarding to power management. Finally, Section 7 concludes the entire paper.
LEARNING CORE'S BEHAVIORS
In this section, we exploit the behaviors of each core in a CMP processor to learn the relationship between power consumption, performance, and frequency settings, as shown in Fig. 1 . As widely shown in previous works, CPU power consumption and performance are both highly related to CPU frequency [5] , [19] . Our experiment results demonstrate that there exist a cubic relationship between power consumption and CPU frequency which is well documented and shown in Fig. 1 . However, the relationship between performance and frequency is difficult to be modeled: the same frequency setting may result in a different response time (rt) or execution time (et) for various types of applications. Hence, the performance is related to both the processor's frequency and the workload characteristics. On the other hand, the behavior of the CPU is able to illustrate the characteristics of the running workloads. More specifically, each core in a CMP has seven working statuses, which we denote as the "metrics" in the rest of this paper:
user: normal processes executing in user mode; nice: niced processes executing in user mode; system: processes executing in kernel mode; idle: idle times; I/O wait: waiting for the I/O operations to complete; irq: servicing interrupts; softirq: soft servicing interrupts; The duration of the cores seven statuses completely exhibit the composition of the running workload. As a result, the performance is determined by a function F with considering both of the CPU frequency and the workload characteristic. Since the workload can be formulated using f which is a function of the seven working statuses, the system performance can be present in Equation (1) .
fðuser; nice; sys; idle; iowait; irq; softirqÞg:
(1) We launch various applications on our testbed to learn the curve of Equation (1), e.g. an I/O bomb from isolation benchmark suite (IBS) [22] , a gcc and a mcf benchmark from SPEC CPU2006 suite version 1.0 [23], and TPC-C running on PostgreSQL [24] . I/O bomb uses the IOzone benchmark tool to continuously read and write to the hard disk (by writing files larger than main memory to ensure that it is not just testing memory); mcf is the most memory bound benchmark in SPEC CPU2006 while GCC is a CPU-intensive application, as shown in Table 1 . TPC-C is a standard online-transaction-processing (data-intensive) benchmark. The configuration details of these benchmarks could be found in Section 5. We run these tests on a Quad-Core Intel Core2 Q9550 2.83 GHz processor with 12 MB L2 cache and 1333 MHz FSB. The CPU supported frequencies are 2.0, 2.33, and 2.83 GHz.
Per-Core
Because we are using per-core level DVFS for power management, it is necessary to understand the meanings of the seven statuses of each single core. In previous works, the relationship between CPU behavior and estimated performance et is simply determined by the CPU busy/idle (B-I) time [14] , [25] . This B-I model is defined as Equation (2):
where P busy 1 is the busy ratio of CPU while P idle is the idle percentage of CPU; f old and f new are the two version of CPU frequency settings. We first enable only one core in the CMP processor and assign one process to run the benchmarks, so that we can avoid the noise from task switches among the cores. The upper two charts in Fig. 2 illustrates that for the first two workloads, e.g., gcc (CPU-intensive) and mcf (memory-intensive), the B-I model is accurate enough with less than 3 percent deviation. However for the I/O intensive or data-intensive workloads, e.g., I/O bomb and TPC-C showing in the lower two charts in Fig. 2 , using the B-I model which does not consider the I/O impact will result in up to 39.6 percent modeling errors. The reason that B-I model works well for CPU-intensive and memory-intensive workloads is because of ILP. The latency caused by cache misses, and mis-predictions will be eliminated by advancing the future operations. However, ILP is not always capable of eliminating the latency caused by I/O operations [3] , which leads to the prediction errors for I/O bomb and TPC-C benchmarks. We also show the statistics of the 3 main working statuses in Fig. 3 . For gcc and mcf, most of the execution time is in user mode; the cache misses and mis-predictions of mcf have negligible impact on the CMP's behavior due to ILP. For I/O bomb, I/O wait is the main latency; for data-intensive benchmark TPC-C, the lower frequency will hide some of the I/O wait latency because of ILP, but the latency in both user and iowait modes cannot be ignored. For all four cases, the irq and softirq latency are very small which only constitute about 0.2 percent of the total working status. Therefore, irq and softirq will not be taken into account in our experiments since the latency cause by them cannot affect the overall system performance comparing with other major latency. As a result, "user+nice+sys", "idle" and "I/O wait" are the three most important working statuses which could describe the CMP's behavior in general. Without considering I/O wait latency, the basic B-I model may result in non-trivial modeling errors for data-intensive or I/O intensive applications.
Multi-Core
Because of the job scheduler in a CMP processor, one task in CMP processor may be switched among the cores during its run. In order to show whether this core-level task switches can eliminate the I/O wait latency, we run seven processes 1. The "Busy Time" in previous works is usually calculated as the equation of overall=timeidle=time without the consideration of the cores other metrics [14] , [25] .
on all four cores in our testbed. Each process will randomly run one of the following benchmarks: gcc, mcf, bzip2, gap, applu, gzip and TPC-C. Each core has 3 available frequency settings: 2.83, 2.33 and 2.0 GHz. Fig. 4 shows the traces for core0 under different frequency settings. We omit "irq" and "softirq" based on the results of Section 2.1, and we treat "user, nice, sys" as a group denoting the real "busy" status. When the frequency is 2.83 GHz, all the workloads are processed in parallel in "phase 1"; the I/O wait latency could be hidden by the process-level parallelism. However in "phase 2", when there are not enough available processes to schedule, the I/O wait latency will emerge. After all processes are finished, the core will stay idle in "phase 3". The traditional B-I based power management scheme is only able to discover the chances to save power in "phase 3" by lowering the processor's voltage and frequency. However in fact, "phase 2" also provides opportunities to save more power. In this case, we can lower the frequency in order to parallel the CPU and I/O operations as much as possible.
In Case 2, as shown in the lower part of Fig. 4 , we can use "2.0 GHz" to process all the workloads roughly at a comparable execution time while only consumes 35.3 percent power as compared to the Case 1 that runs at the 2.83 GHz frequency. We note that heavy disk utilization may not necessarily result in I/O wait if there are enough parallel CPUconsuming tasks. Interestingly, the new data-intensive analyses and applications will incur long I/O wait latency [18] , [26] , [27] , [28] , [29] . As a result, I/O wait still needs to be taken into account in the big data era.
Analysis and Preparation for MAR Design
Although the I/O wait is the duration that the processor is waiting for the I/O operation to be finished, we cannot simply consider it as a sub-category of the idle time. Since if and only if CPU idle time exists, increasing the CPU frequency will linearly decrease the execution time. However, when I/O wait is present, there are two cases as shown in Fig. 5 . In Case 1 where the CPU-consuming tasks and I/O tasks are running asynchronously or blocking each other [30] , the traditional B-I method can be utilized (discussed in Section 2.1) to model the relation between execution time and CPU frequency. In Case 2 where the two types of workloads are running in parallel but not well aligned, using the B-I model to scale CPU frequency will not affect the overall execution time. In other words, the traditional "B-I" model will not work precisely under this situation. In order to distinguish these two cases, we introduce two thresholds which are "th up " and "th down ". "th up " stands for the CPU scaling up threshold, and the "th down " stands for the CPU scaling down threshold. With their help, we can quantify the Equation (1) as the following Equation (3). In the Case 1, the core is either in busy-dominate (v < th up )or in idledominate (v > th down ), thus the traditional B-I model can be utilized. In the Case 2, the core is neither in busy nor in idle status, thus scaling the CPU frequencies will not affect the overall execution time. Therefore, the ratio of rt new rt old will be set to "1".
Case1 :
ifv < th up when scaling up OR ifv > th down when scaling down:
Otherwise :
where P busy represents the busy ratio; rt stands for response time; P idle means the idle ratio. The default value of "th up " and "th down " are based on our comprehensive experimental results. Note that these two thresholds are affected by the throughput of I/O devices, L1/L2 cache hit rates, network traffic, etc.. A self-tuning strategy for these two thresholds will explain in detail in Section 3.6. Equation (3) can be used to complete the relationship among performance, frequency and the various types of workloads that we try to present in Fig. 1 . Our rule-based power management controller MAR will be designed according to the two relationships in Equation (3).
MAR'S DESIGN
In this section, we introduce the design, analysis, and optimization of our rule-based power management approach.
In previous sections, we explain why I/O wait should be considered into power-management. One challenge is, when considering the I/O wait, the CPU workload will become unpredictable due to the I/O randomness. Therefore, the short sampling period that used to work well for the DVFS control might cause severe problems to the system, such as instability and noise. To attack this problem, we need to prolong the sampling period but not significantly stretch the response time under the variation of CPU utilization. There is a need for incorporating a thorough understanding of the control object and control system into MAR system design. Interestingly, a fuzzy logic control method is an ideal candidate solution here as it utilizes human's experience as input and enforce a precise logic in the control system to reflect the thorough understanding.
MAR Control Model
The fuzzy logic has been used as the main base for the MAR's design, which includes fuzzification, rules evaluation and defuzzification. MAR is designed as a MIMO controller shown in Fig. 6 . In order to demonstrate that our control efficiency is better than any other control methods, we divide our control algorithm into two parts. In the first part, the I/O wait is not taken into account in order to prove that, the outcome of our control algorithm is more precise and the response time is faster. Second, as we define that the I/O wait is critical for power-saving strategy especially when running data-intensive tasks. Thus, I/O wait would be considered in the second part to show that our new "B-W-I" model can work accurately and efficiently. By introducing the I/O wait, Equation (3) can be further updated in the following Equation (4):
where v stands for the I/O wait ratio. Let SP denote the sampling period, RRT means the required response time, which is a key factor used to determine whether the system performance has been achieved, cb is core boundness of the workloads (the percentage of core's busy time compared with the sample time), v is I/O wait ratio and ecb is the tracking error of core boundness. One basic control loop is described as follows: at the end of each SP , rt, cb, and v vectors will feed back into the controller through an input interface. It should be noted that rt, ecb, and v could be directly measured from last SP . These inputs will be processed into the arguments P busy , v, rt new and rt old of the Equation (3) or Equation (4) based on whether the I/O wait has been taken into account or not. Now we show how to calculate the arguments.
Fuzzification without Consideration of I/O Wait
We first fuzzify the input values of cb and rt by performing a scale mapping using membership function to transfer the range of crisp values into the corresponding universe of discourse. The universe of discourse means linguistic terms especially in Fuzzy Logic, such as "PS", "ME" and "PL" represent for "positive short", "moderate" and "positive large" respectively. Membership function represents the degree of truth as an extension of valuation. The response time rt is one input of the system that used to determine the performance of the controller. In order to set a performance constraint, we denote 0 < d < 1 as the user-specified performance-loss constraint. We are using the symmetric triangular membership function that presented in Equation (5) for mapping between crisp value and linguistic terms. The reason that we apply a linear transformation to the response time is we have a specified minimum and maximum value in this case. Equation (5) present the membership function and Fig. 7 plot of the function of rt.
if rt > RRT ð1 þ dÞ 0; Otherwise;
where A stands for the fuzzy set of response time, which including {PF, ME, PS} stands for positive fast, medium and positive small. For each crisp value of response time, we can compute a set of m that can be used in the defuzzification step by applying certain rules. Second, we fuzzify another input variable core boundness using Gaussian membership function because it transforms the original values into a normal distribution, which creates a smooth transformation rather than the linear functions. On the other hand, since the value of core boundness changes frequently, using Gaussian membership function can achieve the higher chances to detect the fluctuation and response accordingly. Equation (6) is the Gaussian membership function that used to map the crisp values for core boundness and Fig. 8 shows the plot of the mapping. where B represents the fuzzy set of core boundness that including {PS, ME, PL} represent "Positive Small", "Medium" and "Positive Large"; cb 0 is the position of the peak relative to the universe and d is the standard deviation. Fig. 8 show that if we have a crisp value of core boundness falls between 0.4 and 0.5, we can calculate the degrees under each element of the fuzzy set.
Fuzzification with Consideration of I/O Wait
In the section, we will focus on the fuzzification of I/O wait. As we mentioned above, there are two thresholds in Section 2.3 that have been introduced to distinguish the parallelism of I/O wait and core boundness. Therefore, these two thresholds will be used as the minimum and maximum value for the mapping by utilizing symmetric triangular membership function. Equation (7) shows the relationship between crisp values and the membership degrees and Fig. 9 present the plot accordingly.
ðthupÀth down Þ=2 ; if th down < v < th up 1;
if v > th up 0; Otherwise;
where C represents the fuzzy set of I/O wait that including {PS, ME, PL} represent "Positive Small", "Medium" and "Positive Large"; th up and th down denote for the two thresholds.
Fuzzy Rules
In this section, we propose a set of rules for MAR which will guide MAR to find the frequencies to be set in next Sampling Period. The fuzzy rules can be presented in the following two tables, which are Tables 2 and 3 . The Table 2 is about the fuzzy rules without considering I/O wait, which aims at demonstrating that our MAR control method works better than other existing control approaches. Table 3 presents a set of fuzzy rules by taking I/O waits into account. The following paragraph provides a detailed explanation showing the procedure of generating the rules.
First, if RRT ð1 þ sÞ rt RRT ð1 þ sÞ : This is the ideal case from a performance perspective. Traditional solutions may not change the core's frequency setting. However MAR will do a further check whether v > th down .
If so, the frequency could be scaled down to a lower level to save more power without affecting the response time rt. PF ME PS ME PF PL PS ME ME PS PS ME ME ME PS ME ME PL PS ME PS PS ME ME PS ME ME ME PS  PL  PS  PL  PF  PS  PS  PL  PF  ME  PS  PL  PF  PL  PS  PL  ME  PS  ME  PL  ME  ME  ME  PL  ME  PL  PS  PL  PS  PS  PH  PL  PS  ME  ME  PL  PS  PL  PS If not, scaling the frequency will result in different response time rt, which is deviated from RRT. In this case, MAR will keep using the current frequency. Secondly, if rt > RRT ð1 þ sÞ: This means the real response time does not meet the requirement, MAR checks whether v > th up .
If v exceeds the scaling up threshold, changing to higher frequency will not improve the performance. Moreover, higher frequency will result in a higher I/O wait, which is a waste of core resources. As a result, MAR will keep the current frequency setting.
If v th down we may be able to scale down the core frequency to just meet the performance requirement while saving more power.
Centroid Defuzzification Method
As we introducing two different kinds of membership functions which are triangular and Gaussian. There are also exist two types of centroid defuzzification methods. In the previous paragraph we present two fuzzification approaches for our MAR controller, with the difference in whether considering I/O wait or not. Equation (8) and Equation (9) were generated to deal with these two cases by using two kinds of defuzzification solutions.
where f i stands for the center of CPU frequencies. In our case, f i ¼ 2.0, 2.33, 2.83. Through Equations (8) and (9), we are able to defuzzify the linguistic values that obtained in the output of the rule table by incorporating the various membership degrees. The output results will fall into three intervals which are 0 À 2:0 GHz, 2:0 À 2:33 GHz and 2:33 À 2:83 GHz.
Since the CPU only supporting three DVFS levels, the value below the first interval will automatically set to 2.0 GHz, otherwise, MAR will decide which frequency level will be set based on the output of the rule table. For example, if linguistic term of the output frequency says "PB", 2.83 GHz would be set for the next sampling period.
Self-Tuning Strategy
There are several factors affecting the thresholds th up and th down , for example:
1) Throughput of I/O devices. Higher I/O throughput means the same amount of data could be transferred in less "I/O wait" jiffies. As a result, the thresholds will become higher because the core needs more I/O wait jiffies to reach the time boundary, which defines whether core bounded part or I/O part is the determinant in execution time. 2) on-chip L1/L2 cache hit rate. Lower cache hit rate results in higher memory access, which is much slower than cache access. Therefore, the overall processing speed of the core bounded part (including both cache and memory access) becomes slower.
3) The noise in I/O wait, such as network I/O traffic file system journaling, paging swapping, etc. 4) Heat and heat dissipation. When processors run too hot, they can experience errors, lock, freeze, or even burn up. It is difficult to predict the thresholds in this case; hence we adopt self-tuning methods based on the observed system behaviors. The self-tuning strategies are listed below: When rt > RRT ð1 þ sÞ, v th up , in this case, the RRT is not met and the frequency need to be scaling up for improving the performance. However, if the rt new after the frequency scaling is same as the rt in last sampling period, we need to adjust it lower by Equation (10) :
When rt < RRT , in this case, the RRT is met and the frequency may be scaled based on the rules presented in Section 3.4 . Thus, if v th down , RRT is over met and the frequency need to be scaled down, the th down also need to be adjusted to a lower level. Else if v > th down , RRT is over met and rt new is changed which means v should be lower than th down , hence we set th down to a higher level. Equation (11) can be used to set th down either to a higher or lower level.
METHODOLOGY
In this section, we show our experimental methodology and benchmarks, as well as the implementation details of each component in our MAR controller.
Processor
We use a Quad-Core Intel Core2 Q9550 2. 
Benchmark
We use three stress tests (CPU-bomb, I/O-bomb, and memory-bomb) from isolation benchmark suite [22] . TPC-C incorporates five types of transactions with different complexity for online and deferred execution on a data-base system. Every single transaction consists of computing part and I/O part. Due to the database buffer pool, the updating records will not be flushed until the pool is full.
Core Statistics
Various information about kernel activities are available in the /proc/stat file. The first three lines in this file are the CPU's statistics, such as usr, nice, sys, idle, etc.. Since we using the 3.8.13 version of Linus, the file includes three additional columns: iowait, irq, softiqr. These numbers identify the amount of time that the CPU has spent on performing different kinds of work. Time units are in USER_HZ or Jiffies. In our x86 system, the default value of a jiffy is 10 ms, or 1/100 of a second. MAR needs to collect core boundness information as well as I/O wait latency. Each core's boundness is the sum of the jiffies in user, nice and sys mode divided by the total number of jiffies in last SP. Similarly, I/O wait latency is calculated based on the iowait column. The way to measure the real-time response time depends on the benchmarks. In Isolation Benchmark, the response time could be monitored by the I/O throughput. In TPC-C, the primary metrics, transaction rate (tpmC), could be used as the response time. However for SPEC CPU2006 benchmarks, it is difficult to find any metrics to denote response time because there is no "throughput" concept. Our previous experiments in Fig. 2 show that these CPU-intensive and memory-intensive benchmarks have roughly linear relationships with core frequency. Hence we can calculate the number of instructions have been processed in the sampling period by multiplying the CPU time (first three fields in /proc/stat file) and the core frequency. The result could be used as the response time metrics.
DVFS Interface
We enable the Intel's SpeedStep on BIOS and use cpufreq package to implement DVFS. When using root privilege, we can echo different frequencies into the system file /sys/devices/system/cpu/cpu[X]/cpufreqscaling_setspeed, where [X] is the index of the core number. We test the overhead of scaling CPU frequencies on our platform, which is only 0.08 milliseconds on average.
Power Estimation
We measure the processor's power consumption by connecting two multi-meters into the circuit, as shown in Fig. 10 . Specifically, as the processor is connected by two +12V CPU 4 pin cables, we put one multi-meter in each cable to test the amperage (A). On the other side, the Agilgent IntuiLink software logged data into the logging server using Microsoft Excel. After we collecting the measured amperage based on the sampling period, we are able to compute the average amperage accordingly, such that we could obtain the energy by multiplying the voltage, amperage as well as the time duration.
Baseline Control Methods
PID controller [32] is a control loop feedback mechanism widely used in industrial control systems. A PID controller calculates an error value as the difference between a measured process variable and a desired setpoint. Model predictive control (MPC) [33] is an advanced method of process control that has been widely used in the process industries. It relies on dynamic models of the process, most often linear empirical models obtained by system identification. LAST [9] is the simplest statistical predictor, which assumes the next sample behavior is identical to its last seen behavior. Relax [25] is an algorithm which predicts the workload using both history values and run-time profiling.
EXPERIMENTS
First, MAR is compared with four other baselines to illustrate the high responsiveness of MAR. Second, MAR is used to do the power control for different types of workloads, including CPU-intensive, memory-intensive and I/O-intensive benchmarks. The purpose is to show MAR's performance under specific environment. Third, we compare the two versions of MAR (with/without considering I/O wait) by running dataintensive benchmarks, in order to highlight the impact of I/O wait in power management schemes. After that, we compare the overall efficiency of MAR and the baselines. In order to demonstrate MAR's efficiency in power management, we also compare MAR with conventional Linux governors solution (Ondemand). At the end, we briefly evaluate the overhead of our power management schemes.
Fast Responsiveness
In this section, we compare MAR's response time as well as prediction accuracy with four baselines: LAST [9] , PID [32] , I I1 Relax [25] and MPC [33] . All of these algorithms are implemented in RTAI4.0 [34] to trace the cores' behavior and predict the next core-boundness. LAST is the simplest statistical predictor, which assumes the next sample behavior is identical to its last seen behavior. For RELAX, we set the relaxation factor to 0.5 based on the empirical value in [25] . For PID, we tune [32] . In MPC implementation, the prediction horizon size is 2, the control horizon size is 1 as described in [35] , and the setpoint is set to the average core boundness obtained from offline computation. First, we use Fig. 11 to show the trajectory of prediction when running bzip2 benchmark with SP ¼ 10s.We also run gcc, mcf, gap, applu, gzip, and collect the "average detecting times" for all of the benchmarks, as shown in the table in Fig. 11 . It can be seen that MAR has the fastest response time (as shown in the zoomed area in Fig. 11 ). The reason that MAR response quickly than other methods because it build up based on fuzzy logic which is a solution created by the knowledge of the system. The more we understand about the system, the less uncertainty would be appeared. The table in Fig. 11 also proves that MAR achieves the shortest settling time after the deviation. The overall response time of MAR outperforms LAST, Relax, PID, MPC by 56.5, 104, 174 and 43.4 percent respectively. Second, we measure the impact of SP in prediction accuracy. Fig. 12 shows the average prediction errors for all five algorithms when SP ¼ 5s, SP ¼ 10s and SP ¼ 20s. We can see that all predictors perform better when SP ¼ 10s. When using a smaller SP, MAR could respond to the changes of core boundness more quickly but may lead to some possible over-reactions; when using a larger SP, the core's behavior is going to change slower due to the larger time window but will be more stable. Slow responsive algorithms such as PID do not work well here since they are only good for the workloads with strong locality. In summary, MAR always obtains the least prediction error because it incorporates the tracking error, which gives more hints for the coming trend and high responsiveness for core's status switches.
Power Efficiency
This set of experiments shows the power management efficiency of MAR for different types of benchmarks: gcc, mcf, bzip2, gap, applu, gzip and TPC-C. Running homogeneous workloads. In this section, we want to show MAR's control performance when homogeneous workloads are running. For each benchmark, we use four threads to run its four copies on our testbed to evaluate the MAR's performance for each specific type of workloads. We show the results of power consumption/performance loss of MAR and the baselines: LAST, Relax, PID, MPC and the Ideal case in Fig. 13 . In "Ideal" case, we use the ideal DVFS settings calculated offline, which could achieve the best power saving efficiency and the least performance loss. Assuming the ideal case saves the most power, MAR and other base-lines perform well when the workloads have no explicit I/O operations. For gcc, mcf, bzip2, gap, applu and gzip, MAR is 95.4 percent close to the ideal power management, while LAST is 91.7 percent, Relax is 94.1 percent, PID is 93.6 percent, MPC is 94.5 percent. However, when we run TPC-C benchmark, the baselines can only achieve 57.8-69.8 percent power saving performance as the ideal case. With considering I/O wait as opportunities to save power, MAR can still achieve 92.5 percent of the power efficiency of Ideal case. At the same time, the performance loss of all power management strategies is between 2-3 percent. Although MAR has the highest performance loss 2.9 percent for TPC-C benchmark (because of our aggressive power saving strategy), it is still in the safe zone [32] .
Running heterogeneous workloads. This section is to compare MAR with the baselines when heterogeneous workloads are running; we still launch all afore-mentioned seven benchmarks in parallel on our testbed. The database for TPC-C benchmark is locally set up. Fig. 14 shows their overall DVFS results and power-saving efficiency. The upper two charts in Fig. 14 illustrate the frequency distributions of all management methods. Note that compared with SP ¼ 5s, the trajectory of workload in SP ¼ 10s case has less fluctuations caused by the "phantom bursts". The slow-responsive methods such as Relax, PID and MPC could not discover as many power-saving opportunities as the fastresponsive ones: MAR and Last, especially in the smaller SP case. From the upper left of Fig. 14, we can see that 60 percent of MARs DVFS result is running under the lowest frequency, nearly 10 percent is set to the medium level. MAR completely outperforms the other four different DVFS control solutions. The lower two charts in Fig. 14 describe the power consumption of all management methods. All the numbers are normalized to MAR which saves the most power. PID and MPC perform very differently when SP ¼ 10s and SP ¼ 5s. The reason is that more "phantom bursts" of the workloads (when SP ¼ 5s) could affect the control accuracy significantly. LAST is always better than Relax because it is more fast-responsive to the core's status switches in CMP environment. From the power saving perspective, MAR, on average (SP ¼ 10/5s), saves 5.6 percent more power than LAST, 3.7 percent more than Re-lax, 2.9 percent more than PID, 2.1 percent more than MPC.
The impact of the new model B-W-I. In order to highlight the impact of B-W-I model in power management, we conducting the experiments on two different MARs as we described in Section 3 which are MAR and MAR(B-W-I Model). The former one is used to test the fast-responsive to the status switches but does not consider I/O effects. We use seven threads to run gcc, mcf, bzip2, gap, applu, gzip and TPC-C in parallel. The comparison of MAR and MAR (B-W-I Model) is shown in Fig. 15 . The results show that MAR (B-W-I Model) is more likely to use lower frequencies than MAR. The reason is: when the I/O wait exceeds the thresholds in the control period, even if the response time is close to RRT MAR (B-W-I Model) still scales down the core frequency to a lower level to save more power. Compared with MAR (B-W-I Model), MAR cannot discover the potential I/O jobs which are overlapped with the computing intensive jobs. Based on the cubic relation between frequency and power consumption, when SP ¼ 10s, MAR (B-W-I Model) could save 10 percent more power than MAR; when SP ¼ 5s, MAR (B-W-I Model) saves 9.7 percent more power. We plot the power consumption and performance statistics of MAR, MAR (B-W-I Model), the performance oriented case, as well as the ideal case in Fig. 16 . In "Perf.Oriented" case, maximum frequencies are used all the time. All the numbers are normalized to "Perf. Oriented" which has the least power consumption. Based on the results, MAR (B-W-I Model) saves about 9.7-10 percent more power than MAR on average. It is expected that the more I/O intensive the workloads are, the better performance the MAR (B-W-I Model) could achieve.
Overhead. At the end, Table 4 shows the overhead of the tested methods. They are all lightweight and consume less than 1 percent CPU utilization for sampling period of 10 s. The MPC controller has the highest overhead because it is computationally expensive. MAR executes almost nine times faster than MPC controller.
Scalability. In previous sections, we have tested MAR on our testbed, which only has four and three available voltage-frequency settings. In order to show the scalability of MAR, we use cycle-accurate SESC simulator with modifications to support per-core level DVFS. Each core is configured as Alpha 21,264 [37] . We enable Wattchify and cacify [36] to estimate the power change caused by DVFS scaling. In our simulation, we scale up MAR for 8, 16, 32 core processors with private L1 and L2 hierarchy and the cores are placed in the middle of the die. Each core in our simulation has three DVFS levels (3.88 GHz, 4.5 GHz and 5 GHz).
The over-head of each DVFS scaling is set to 20 [3] . The bench-marks we used are randomly selected SPEC 2006 benchmarks: gcc, mcf, bzip2 and data-intensive TPC-C benchmark. The number of processes equals to the number for cores, e.g., we run two copies of each of the four benchmark when there is eight cores. We first record the maximum power consumption and the best performance of the workloads by setting all the cores at the highest DVFS level. Then we normalize the results of MAR and other baselines to show their power management efficiency and performance loss. Fig. 17 plots the average power saving efficiency and the performance loss of MAR and LAST, RELAX, PID, MPC based per-core level DVFS controllers. All the numbers are normalized to the "Performance-Oriented" case. With different number of cores, the CMP processor under MARs monitor always saves the most power: about 65 percent compared with cases without DVFS control. On average, MAR outperforms LAST, Relax, PID and MPC about 14, 12.3, 11.8 and 10.1 percent, respectively, under our benchmark configurations. At the same time, MAR and the baselines performance loss are all be-tween 2-3 percent, which confirm what we have observed on our testbed. Our simulation results demonstrate that MAR can precisely and stably control power to achieve the performance requirement for CMPs with different number of cores.
Power conservation potential for multiple CPU frequency levels. In order to further investigate the relationship between MAR's power conservation potential and multi CPU frequency levels, we conducted more experiment in a new testbed with five frequency levels. We launched the dataintensive benchmark (TPC-C) in parallel on our testbed and recorded the DVFS control outputs for MAR, Last, Relax, PID, and MPC. The detailed results are illustrated in Fig. 18 . We calculate the power consumption based on the frequency cubic function that are documented in Section 2 and also present the results in Fig. 18 . The upper chart in Fig. 18 shows the DVFS results for running TPC-C benchmark while the sampling period equals to 5s. From the figure, we can see that MAR is able to scale the CPU frequency at the lowest level for nearly 40 percent of the entire running period, while the corresponding results for Last, Relax, PID, MPC are 23, 19, 20 and 21 percent respectively. The lower chart compares the power consumption among all the five control methods with respect to MAR. It is clear that MAR can save the most power, which is about 20, 24, 21, 16 percent more than Last, Relax, PID and MPC respectively. This further demonstrates MAR's better potential in energy conservation when the processors support more CPU frequency levels.
Comparison with Conventional Governors
In this section, we compare the power conservation capability between our MAR and the conventional Linux [45] . CPU adapts the frequencies based on workload through a user feedback mechanism. As a result, if CPU has been set to a lower frequency, the power consumption will be reduced. The CPU freq infrastructure of Linux allows the CPU frequency scaling handled by governors. These governors can adjust the CPU frequency based on different criteria such as CPU usage. There are basically four governors in the kernel-level power management scheme, which are "Performance", "Powersave", "Ondemand" and "Conservative". Specifically, the Ondemand governor can provide the best compromise between heat emission, power consumption, performance and manageability. When the system is only busy at specific times of the day, the "Ondemand" governor will automatically switch between maximum and minimum frequency depending on the real workload without any further intervention. This is a dynamic governor that allows the system to achieve maximum performance if the workload is high and scale the frequency down to save power when the system is idle. The "Ondemand" governor uses traditional B-I model that we introduced in Section 2.1 for frequency scaling. It is able to switch frequency quickly with the penalty in longer clock frequency latency. The "Performance" governor forces the CPU to set the frequency at the highest level to achieve the best performance. In this case, it will consume the highest power. The "Powersave" governor will conserve the most power with a lowest system performance.
In order to demonstrate MAR's effectiveness and efficiency in power conservation as compared with Linux conventional governors, we run the data intensive benchmark TPC-C in our testbed and set the Linux governor to "Ondemand" status. In the meantime, we record the DVFS results for MAR, MAR(BWI), and "Ondemand" respectively. Fig. 19 illustrates the comparison results of the above-mentioned three methods. Based on the results, we can see that both MAR and MAR(BWI)outperforms the conventional governor "Ondemand" up to 8.62 and 17 percent in power conservation.
RELATED WORKS
In recent years, various power management strategies have been proposed for CMP systems. From the perspective of DVFS level, previous power management schemes could be divided into two categories which are chip-level and corelevel power management. Chip-level power management uses chip-wide DVFS. In chip-level management [8] , [9] , [10] , [11] , the voltage and frequency of all cores are scaled to the same level during the program execution period by taking advantage of the application phase change. These techniques extensively benefit from application "phase" information that can pinpoint execution regions with different characteristics. They define several CPU frequency phases in which every phase is assigned to a fixed range of Mem/op. However, these task-oriented power management schemes do not take the advantage from per-core level DVFS. Core-level power management means managing the power consumption of each core. The authors in [12] and [13] collect performance-related information by an on-core hardware called performance monitoring counter (PMC). There are several limitations by using PMCs: Each CPU has a different set of available performance counters, usually with different names. Even different models in the same processor family can differ substantially in the specific performance counters available [38] ; modern super scalar processors schedule and execute multiple instructions at one time. These "in-flight" instructions can retire at any time, depending on memory access, cache hits, pipeline stalls and many other factors. This can cause performance counter events to be attributed to the wrong instructions, making precise performance analysis difficult or impossible. Some recently proposed power management [16] approaches use MPC-based control model, which is derived from cluster level or large scale data centers level power management, such as SHIP [33] , DEUCON [39] , [15] and [16] . They make an assumption that the actual execution times of real-time tasks are equal to their estimated execution times, and their online-predictive model will cause significant error in spiky cases due to slow-settling from deviation. Moreover, their control architecture allows degraded performance since they do not include the performance metrics into the feedback. Lu et al. [35] tries to satisfy QoS-critical systems but their assumption is maintaining the same CPU utilization in order to guarantee the same performance. However, this is not true for the CPU unrelated works, such as the dataintensive or I/O-intensive workloads. Rule-based control theory [40] is widely used in machine control [41] , [42] , which incorporating the precise logic and approximation reasoning into the system design and obtaining much more accurate representations. More specifically, Rulebased control logic can be viewed as an attempt at formalization of the remarkable human capabilities. It is not "fuzzy" but a precise logic of imprecision and approximate reasoning with four distinguish features, which are graduation, granulation, precisiation and the concept of a generalized constraint [44] . It also reduces the development time/cycle, simplifies design complexity as well as implementation, and improves the overall control performance [43] .
CONCLUSION
Power control for multi-core systems has become increasingly important and challenging. However, existing power control solutions cannot be directly applied into CMP systems because of the new data-intensive applications and complicate job scheduling strategies in CMP systems. In this paper, we present MAR, a model-free, adaptive, rulebased power management scheme in multi-core systems to manage the power consumption while maintain the required performance. "Model-less" reduces the complexity of system modeling as well as the risk of design errors caused by statistical inaccuracies or inappropriate approximations. "Adaptive" allows MAR to adjust the control methods based on the real-time system behaviors. The rules in MAR are derived from experimental observations and operators' experience, which create a more accurate and practical way to describe the system behavior. "Rule-based" architecture also reduces the development cycle and control overhead, simplifies design complexity. MAR controller is highly responsive (including short detective time and settling time) to the workload bouncing by incorporating more comprehensive control references (e.g., changing speed, I/O wait). Empirical results on a physical testbed show that our control method could achieve more precise power control result as well as higher power efficiency for optimized system performance compared to other four existing solutions. Based on our comprehensive experiments, MAR could outperform the baseline methods by 12.3-16.1 percent in power saving efficiency, and maintains comparable performance loss about 0.78-1.08 percent. In our future research, we will consider applying fuzzy logic control in power conservation of storage systems.
Pengju Shang received the BS degree from Jilin University, Changchun, China, the MS degree from Huazhong University of Science and Technology, Wuhan, China. He is currently working toward the PhD degree in computer engineering at the School of EECS, University of Central Florida, Orlando, and is a principle system engineer at EMC. His specific interests include transaction processing, RAID systems, storage architecture, power management in storage systems, CMP architectures, Hadoop/ MapReduce architecture, and cloud computing.
Junyao Zhang received the BE and MS degrees in software engineering from the Jilin University, Changchun, China. He is currently working toward the PhD degree in Computer Science Department, University of Central Florida, Orlando. His primary interests include scalability, fault-tolerant (fast recovery) and power management in distributed storage systems. He is a student member of the IEEE. Ting Liu is currently working toward the BS degree in electronic science and technology, Huazhong University of Science and Technology, China. Her research interests include storage system, circuits design, and data analysis. " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
Jun Wang is an
Associate Professor of Computer Engineering; and Director of the Computer Architecture and Storage Systems
