Abstract-Learning-based dynamic power management (DPM) techniques, being able to adapt to varying system conditions and workloads, have attracted a lot of research attention recently. To the best of our knowledge, however, none of the existing learningbased DPM solutions are dedicated to power reduction in multicore processors, although they can be utilized by treating each processor core as a standalone entity and conducting DPM for them separately. In this paper, by including task allocation into our learning-based DPM framework for multicore processors, we are able to manipulate idle periods on processor cores to achieve a better tradeoff between power consumption and system performance. Experimental results show that the proposed solution significantly outperforms existing DPM techniques.
I. Introduction

D
YNAMIC power management (DPM) for electronic systems that trades off performance for power savings in a controlled fashion, is one of the most successful techniques used for energy-efficient computing [2] . To be specific, by taking system workloads into account, DPM reduces power dissipation via selectively shutting down (or lowering the performance of) inactive system components. For example, a microprocessor can be put into sleep mode for power reduction when it is idle for some time and it is woken up when new tasks arrive.
State transitions in DPM, however, involve nontrivial performance penalty and power cost, and an eager power management policy that turns off system components as soon as they are idle may even increase the system power dissipation and degrades its performance at the same time. Consequently, how to optimize the DPM policy, the procedure that takes decision on the state of the system components, is a rather complex constrained optimization problem, especially considering the fact that a component may have multiple operational modes with different power benefits and transition costs [4] . For example, a processor in deep sleep state has lower power consumption but requires more transition time and transition power when waken up, compared to that in light sleep state. As it is very difficult, if not impossible, to choose the opportune moment to turn off inactive components without knowing the actual workloads of the system in advance, the key issue in DPM policy optimization is how to efficiently utilize/predict the idle periods of system components to obtain power saving without much performance degradation. Earlier paper in this domain tried to characterize the system workloads first and then derived an optimal policy for the system [8] . Stochastic modeling for workloads such as Markov decision process were also used for complicated systems [2] , [21] . The effectiveness of the above techniques, however, relies heavily on the accuracy of their workload models that are not guaranteed during the offline optimization process. Recently, several learning-based DPM techniques have been presented in [4] , [5] , [11] , [25] , and [26] . By online learning, the workload characteristics and adjusting DPM policy on-the-fly, learningbased solutions can adapt to varying system conditions and workloads and hence, are potentially able to achieve better power/performance tradeoff than those offline solutions.
Today, multicore processors are widely used in electronic systems. Since the system behavior gets more complicated as the number of processor cores increases, building an effective offline DPM solution may involve long trial-and-error iterations for workload modeling and learning-based solutions seem to be a natural choice. To the best of our knowledge, however, none of the existing learning-based DPM solutions are dedicated to power reduction in multicore processors, although they can be utilized by treating each processor core as a standalone entity and conducting DPM for them separately.
In multicore processors, global power management solutions can outperform those solutions that manage power per core locally. This is because, given the same workloads, various task allocation strategies may lead to significantly different idle periods on each processor core, and hence, have a significant impact on the efficiency of DPM policies. From a different perspective, even though the workloads are still unknown, the idle periods on processor cores become partially controllable in multicore processors. Motivated by the above, we develop a novel learning-based DPM framework for multicore processors that judiciously allocate tasks on processor cores to achieve a better tradeoff between power consumption and system performance. Considering the significant impact of temperature on leakage power consumption, core temperatures are integrated into our DPM framework to achieve better savings of the overall system power. To be specific, we use Q-learning, a kind of reinforcement learning technique, to learn the system behavior and determine proper processor power state transitions. Due to the huge solution space, we use neural network in our learning framework to speed up the training process. In addition, we develop a novel core grouping scheme to control the size of system states that is able to dramatically reduce core allocation time. Experimental results show that our proposed power manager for multicore processors significantly outperforms existing DPM techniques.
The remainder of this paper is organized as follows. In Section II, we present the preliminaries, survey related papers in this area and motivate our work. The proposed learningbased DPM framework and the corresponding techniques to improve proposed methodology are then detailed in Section III and Section IV, respectively. Next, Section V summarizes the overall algorithm flow, and Section VI presents our experimental results. Finally, Section VII concludes this paper.
II. Preliminaries and Motivation
A. Power Models
The total power consumption of an electronic system consists of two components: Dynamic power and static power [16] , [20] . Dynamic power is due to the switching activities, manifesting as charging and discharging of the load capacitance, while static power is primarily caused by leakage current, which is present even when no logic operations are performed. They can be described as
where P d is dynamic power, α is switching activity, C is load capacitance, V dd is supply voltage, f is clock frequency, P s is static power, and I s is the cumulative leakage current of all kinds of leakage mechanisms as discussed in [19] . In the near future, subthreshold leakage and gate leakage will be the dominant types of leakage current [15] , and the fundamental leakage current formulas for CMOS devices can be used to derive an expression for functional unit leakage power. In this paper, we refer to leakage power as static power and model it similar to previous papers [1] , [15] as follows:
where I sr is the reference leakage current at reference temperature, T is the current temperature, V bs is the body bias voltage, I ju is the junction leakage current, and β, γ, and θ are all curve fitting parameters depending on circuit technology.
B. Power Modes
For a complex electronic system that supports DPM, we can model it as a finite-state machine with multiple power modes [4] , as shown in Fig. 1 , taking StrongARM SA-1100 processor as an example. This processor has three operational modes: Run mode, idle mode, and sleep mode. In the run mode, the processor performs operations with both dynamic power and leakage power dissipated. In the idle mode, it is ready to operate and consumes only leakage power; while in the sleep mode, it is deactivated by power management techniques (e.g., clock gating [18] ) and dissipates only a reduced leakage power. The processor has different power dissipations in different power modes, while the mode transition would induce both performance penalty and power cost. In general, a certain processor with different sleep modes would have to suffer from more performance penalties and power costs to be waken up from a deeper sleep mode, in which less power is consumed. Although the transition costs exist, we can still obtain power savings from selective shutdown after compensating the nontrivial transition cost, as systems typically experience nonuniform workloads (manifested as idle periods among tasks).
C. Related Paper
There are numerous related papers in dynamic power management in the literature. In this paper, we focus on how to conduct effective power state transitions. From this aspect, in general, existing DPM policies can be classified into two categories [5] : Heuristic policies and stochastic policies. Timeout policy [13] is one of the most widely-used heuristic policies that simply turns off a component when the duration time, for which the component has been in idle period, exceeds a predefined time interval. Time-out policies are simple and robust, but they may be too fast or too slow to react. Stochastic policies, on the other hand, model system state changes and request arrivals as stochastic processes. Markov decision process [2] and semiMarkov decision process [21] are often adopted to derive an optimal DPM policy according to these models.
In the above papers, DPM policies are determined at design stage and they may not work well with varying workload characteristics and environment conditions. Learning-based DPM solutions are thus attractive since they are able to adapt to varying system conditions and workloads. Srivastava et al. [24] explored a shutdown mechanism to predict the length of idle time based on real life traces and recent computation history. Hwang and Wu [8] predicted the current idle period length using exponential average approach based on previous idle periods. Steinbach [25] proposed reinforcement learningbased DPM policy to perform mid-level power management in wireless network cards. Theocharous [27] considered user annoyance as a performance constraint and presented a userbased adaptive power management technique. Dhiman and Rosing [5] proposed to dynamically select the best DPM policy from a set of candidate policies. Tan et al. [26] presented an approach for system-level power management in a partially observable environment, based on a model free constrained reinforcement learning. There are also some recent papers that consider DPM in multicore processors that can be categorized into a per core approach [10] - [12] and a chip-wide approach [6] , [7] . Isci et al. [10] proposed an approach to set the power mode of each core to meet a power budget. Jung and Pedram [11] , [12] presented a supervised learning-based DPM framework for multicore processors. Their approach, however, determines power management actions for each core based on their individual workload prediction and hence is not a true multicore power management scheme. Herbert and Marculescu [7] utilized a control theory-based controller to apply dynamic voltage and frequency scaling (DVFS) technique, but the task-to-core allocation is fixed in their approach. Ghasemazar et al. [6] proposed a hierarchical DPM framework under given throughput constraint that employs core consolidation, coarse-grained DVFS, and task allocation at the chip multiprocessor (CMP) level and fine-grained DVFS based on closed-loop feedback control at the individual core level. This paper required to obtain task characteristic a priori for task allocation.
Learning-based approaches have also been used for thermal management. In [32] , reinforcement learning is applied in thermal management, controlling the system as a whole. In [33] , a method to model controlled system is proposed, with artificial neural network extracting the system description, threshold controller to tackle temperature violations, and thermal-aware scheduler to allocate tasks. Different from the objective to minimize power consumption in this paper, they try to achieve an even thermal distribution.
D. Motivation
As discussed earlier, in multicore processors, we have the flexibility to assign a task to any processor core and hence, the idle periods on processor cores become partially controllable that can be exploited for power savings. Note that, for the sake of simplicity, we assume that each task is executed on only one core and there is no dependency between tasks. Fig. 2 presents the motivational example for our paper. In this 4-core processor, when task n+1 arrives, allocating it to different cores for processing may lead to very different results. Suppose, the task is assigned to core 2. Since this core has been idle for some time, it might be in sleep mode at this time point, and we have to wake it up to process this task, causing extra power dissipation and performance penalty. If, however, the task is assigned to core 1, we are able to save the above cost without incurring much performance penalty since it is about to finish the task assigned to it earlier. Ideally, if we can assign a new task to a processor core that has just finished its earlier-assigned task at that time point, we do not need to suffer from any dynamic power and performance cost. On the other hand, as leakage current changes with temperature super-linearly for a given supply voltage [15] , it is essential to take temperature information into consideration when developing an effective online DPM policy. For example, if we can intentionally allocate tasks on those processor cores with lower temperature with little performance penalty, such temperature-aware DPM policy will lead to more overall power savings when compared to a policy that ignores such information.
Motivated by the above, we develop a novel learning-based DPM framework for multicore processors that judiciously allocate tasks on processor cores to achieve a better tradeoff between power dissipation and system performance, as discussed in the following sections.
III. General System Framework
A power-manageable system [2] , [26] can be modeled as shown in Fig. 3 , including three components; service requester (SR), service queue (SQ), and service provider (SP). SR issues requests as the event source, while SP processes requests. SQ buffers requests that cannot be processed at once, if SP is too busy. SR has several operational states to represent different service request rates, which can roughly be considered as the possible request number issued in a time unit. SP can have different modes, such as run mode to process requests and sleep mode to save power in case of no requests. As for SQ, its state can be simply described as the stored request number. The power manager observes system states (consisting of SR, SQ, and SP states), and controls the behavior of SP, to achieve power savings at certain performance penalty. Based on the above, we setup our Q-learning model for DPM problem in multicore processors in this section. This is a general framework and we would discuss how to make it more applicable and efficient in Section IV.
A. Background on Q-Learning
Q-learning, as one of the prevalent reinforcement learning algorithms, has been applied in many scientific and engineering fields. Since it is also used in our proposed DPM solution, we briefly introduce it in the following.
The basic idea of Q-learning [17] , [25] is to decide on what action to take based on current system state information in order to maximize the expected reward in the future by mapping states to actions. In standard Q-learning framework (as shown in Fig. 4 ), an agent is connected to its environment via perception and actions. In each step t, the agent observes the system state s t , chooses an action a t to perform, and then receives r(s t , a t ) from the environment and observes new state s t+1 . Formally, the model consists of the following:
1) a discrete set of environmental states, S = {s t }; 2) a discrete set of agent actions, A = {a t };
In each state, there is a Q-value associated with each action. The definition of Q-value is the sum of the reinforcements received when the system performs the associated action and then follows the given policy thereafter. Given the definition, it is easy to derive the equivalent of the Bellman equation for Q-learning
which is the objective to be maximized in Q-learning. According to this definition, when receiving reward in each learning cycle, we update Q-value according to the following equation:
where r(s t , a t ) is the reward received in state s t with action a t taken; μ is learning rate; and γ is discount rate. It should be noted as follows.
1) The learning rate μ determines what extent the newly acquired information will override the old information to, while the discount rate γ determines the importance of future rewards. 2) The number of possible system states and actions must be finite, and as the number of states and actions increases, the Q-table gets bigger and thus the learning accuracy deteriorates quickly. 3) If the agent always just takes the action with the highest Q-value for a given state, it might end up in a local maximum, because one action might be repeatedly taken without exploring new actions.
B. State Space
In our Q-learning model, system states are composed of the states of SQ and SP only, because the state of SR is unknown a priori. To simplify the problem, we firstly consider how to describe the state space of a single-core system, and then extend it to multicore processors.
For single-core processors, we use a vector with two dimensions to describe its state (s t , q t ). Therein, s t stands for the processor power state, e.g., run mode or sleep mode. q t represents the queue status, which indicates how many task requests are stored in queue to wait for processing. Suppose, we consider q t = 0, 1, and 2 respectively for the cases that the number of requests in the queue is 0, 1, and larger than 1. There are as many as (n c · n q ) states, where n c is the number of power states, and n q is the number of queue states. n . Such a huge state space is a critical problem for learning-based approaches, because in this case many more training samples are needed, and learning accuracy deteriorates quickly.
C. Action Space
In our model, we sample the system state at each time point when a task request arrives. The power manager then observes the current system state, and determines an action for SP to operate. As shown in Fig. 2 , when task n+1 arrives, its arrival time can determine the time point t 2 . At that time point, power manager samples system state, and chooses an action to apply. The action is composed of two components; not only the core that task n+1 is assigned to, but also the power state of the assigned core after finishing this task. In other words, the power manager presets the power modes for all the cores. If idle time slots appear in the cores, they will be transferred to the appointed power modes.
The action can be represented as (core t ,mode t ). The variable mode t stands for the preset mode for the assigned core, and core t is the core index to indicate which core to assign this task to. In this case, the action space size is (n c · n), where n c is the number of power modes for each core and n is the number of cores.
D. Reward
The objective of DPM techniques is usually to achieve the maximum power savings at slight performance penalty cost. To achieve a tradeoff between the two items, the reward function used in our Q-learning model is expressed as
where R is reward, P is mean power dissipation, RT is response time, and β is the coefficient to trade off power and performance. If β-value is changed, the weights of mean power and RT in reward function are adjusted to satisfy system demand. A larger β-value means that RT is more important to our concern. The computation of mean power dissipation P at runtime relies on information collected from sensors, e.g., temperature sensors and current sensors. With the collected current information, P can be calculated by multiplying supply voltage and total current at runtime. As for RT, power manager can simply record the arrival time and finishing time of each task and use their difference value as RT. Furthermore, hardware performance counters embedded in microprocessors to evaluate performance and even localize thermal problems [34] can be utilized to provide more accurate and reliable system information. Note that, for the sake of simplicity, we assume there is no dependency between tasks. If this assumption is canceled, the proposed framework is still applicable. A simple solution is to modify the reward function by considering the impact of task dependency as additional RT penalty, since the delay of a task implies its subsequent tasks would also be delayed, increasing the RT penalty.
At the time point with system state s t , the power manager chooses action a t . Then the system state transfers from s t to s t+1 , and corresponding reward value can be received.
From above, we can conclude that the basic Q-learning framework cannot be directly applied to solve our problem due to the huge state space and action space. We, therefore, propose to utilize neural network to approximate Q-values to improve the convergence speed of learning process. In addition, when the number of cores in a multicore processor is large, we propose to group cores with the same core state together and change the action from core selection to group selection. We detail our techniques to speed up Q-learning in the following section.
IV. Proposed Techniques to Improve Learning Process
A. Q-Function Approximation
One of the most challenging issues in our paper is the huge system space size ((n c · n q ) n · (n c · n)), which is exponentially increased with respect to processor core number n. Q-learning at its simplest version uses tables to store Q-values. This not only costs insufferable memory, but also requires a huge amount of training samples to learn the Q- To address the above issue, we use neural network (NN) [17] to approximate the Q-function. Hence, the key task in the Q-learning for our problem becomes how to estimate the mapping Q( s, a) : s × a → Q. We adopt the feedforward neural network to model the mapping Q( s, a) and represent the value function of state-action pair (s t , a t ). There are a variety of neural networks that are applicable to function approximation, and we consider back propagation neural network (BPNN) [17] , one of the most prevailing neural algorithms in dealing with function approximation.
As shown in Fig. 5 , there are three layers in the used neural network, namely, input, hidden, and output layers, respectively. In the input layer, the input vector is the composite of state vector ( s 1 , ..., s n ) and action vector a. We use binary encoding scheme to denote the input vectors for every possible system state and action. In the hidden layer, the hidden nodes H i employs the following sigmoid function:
where w i,j and u i are the parameters of neural network, i denotes one bit of binary input, p denotes the bit number of binary input for system state, while q denotes the bit number of binary input for action. In the output layer, the approximated Q-value function is given by
As a whole, this neural network describes a nonlinear mapping Q( s, a). At each time t, the parameters of the network (w 1,1 ,. .., w p+q,h ; u 1 ,...,u h ) are updated in a gradient manner with the help of the back-propagation algorithm [17] . The errors propagate backwardly from the output nodes to the inner nodes to adjust the network's weights. When it is applied to Q-learning, the input of back propagation neural network is the state-action pair and its output is the Q-value corresponding to the state-action pair. 
B. Action Selection
The action selection mechanism is an important component of Q-learning. There are two problems to tackle in our action selection phase.
First, if the agent always takes the action with the highest Q-value for a given state, it might end up in a local maximum, because one action might be repeatedly taken without exploring new actions, which prevents us from finding other solution. In other words, action selection may greatly affect learning effectiveness, due to the tradeoff between exploitation and exploration. To balance these two aspects, we employ -greedy method for action selection, so that the agent can reinforce the evaluation of the known actions to be good and also explore unknown actions that helps in avoiding local maximum. We gives the action that owns the highest Q-value a high selected probability (1-), and all the actions equally share the remaining probability . The probability for choosing a certain action a i is
We consider = 10% here, and num is an action number. That means, we have the probability of 10% to select another action instead of the action with highest Q-value, to void local maximum.
Second, since one of the motivations in this paper is to reduce unnecessary power state transitions to avoid transition costs, our algorithm has the trend to allocate tasks successively to certain cores. This may induce temperature stress on certain cores and cause reliability concerns. To tackle this problem, our action selection mechanism is further modified. Each time, if the current temperature of the core chosen is higher than a predefined temperature threshold, we would give up this action and try to select another action from the remaining cores by -greedy again, until its temperature is cooled down below the threshold. In a special case with rather heavy workloads, the case with all the cores exceeding temperature threshold may become true. In that case, the entire system has to be halted to guarantee no hurt on system reliability. We set the temperature threshold to avoid allocating task onto a processor core as 85
• C. Note that, the focus of this paper is to reduce energy consumption rather than conducting thermal management, therefore we simply resort to the above threshold-based scheme that is leveraged in most commercial processors to ensure the reliability requirement. If thermal reliability is a critical concern in certain applications, more advanced thermal management techniques, e.g., adaptive-threshold scheme, can be employed. The proposed solution can cooperate with thermal management techniques that do not control task allocation and power states.
C. Dealing With Temperature Information
In the general DPM framework discussed in Section III, we do not take temperature information into system state description. However, since leakage power is expected to be dominant in future technology and super linearly increased with temperature, it is essential to explicitly take temperature information into account to acquire effective DPM policy. Note that, this temperature information is assumed to be acquired from temperature sensors deployed in the processor.
We set up the lower temperature threshold T lower and higher temperature threshold T upper , and then divide the temperature range (T lower , T upper ) into (n T − 2) segments evenly, so that we have n T temperature states. For single-core processors, we have to add 1-D to describe temperature state, therefore the state vector of core t is changed to (s t , q t , T t ), where s t is power state, q t is queue state, and T t is temperature state. For multicore processors, we can simply use a 3n-D vector (s t1 , s t2 , ..., s tn ; q t1 , q t2 , ..., q tn ; andT t1 , T t2 , ..., T tn ) to represent the overall system state. With temperature information involved, power manager can differentiate the power effects of temperature and optimize the power consumption by selecting proper action. We can find that the number of system states is increased to (n c · n q · n T ) n , where n c , n q and n T are the numbers of power states, queue states, temperature states of one core, respectively. Here, we can see that it is quite easy and intuitive to take any necessary information (e.g., temperature information) into the system state description, if this information is useful for power management.
D. Processor Core Grouping
Our proposed DPM algorithm can be implemented using software and/or extra hardware. For each task, power manager tries to select the proper action before its execution, and receives system information to train the neural network via updating parameters, which induce both energy and performance costs. Through cost analysis, we can estimate that, for each task, a pure software implementation needs O(n 2 ) multiplication and addition operations, where n represents processor core number. For example, in 4-core processors with CPU frequency 1GHz, assuming multiplication needs four CPU cycles and addition needs one CPU cycle, the mean allocation time for each task is in the order of 10μs, which is usually not a big concern as the average task execution time is much higher (e.g., in the order of 10ms).
Our proposed approach can work quite well in 4-core and 8-core processors. However, since the cost of our proposed approach is increased quadratically with respect to the number of processor cores, we can estimate that if the core number becomes huge, the allocation cost may not be negligible. For instance, it can be estimated that the allocation cost is about 1% in 64-core processors, and about 3% in 100-core processors. To reduce allocation cost, we can easily change our current DPM model to classify cores with similar state into groups, redefine the actions to select a certain group, and then choose a core in the selected group hierarchically. This can effectively reduce system state number and allocation cost, and hence, improve the algorithm scalability and applicability to a processor with more cores.
We build up the processor core groups using the core state (s t , q t , T t ) as group index. As shown in Fig. 6 , the circles represent processor cores and the core state vectors are all specified within the circles. For such a 8-core processor, we can simply count the cores with the same state vectors and maintain the table of core numbers in groups (see the right rectangle in Fig. 6 ). With grouping scheme, the allocation cost is O(N group ), where O group is the group number. Given power state number n c , queue state number n q , and temperature state number n T , the group number N group is (n c · n q · n T ), a fixed value that does not change with processor core number n, therefore, the allocation cost is also fixed. Furthermore, for an n-core processor, the number of system states is reduced to n (n c ·n q ·n T ) , only polynomially increased with the number of cores n, instead of the exponentially increased one without grouping scheme. Consequently, when n is large, this grouping scheme results in much less system states than the nongrouping scheme that is beneficial for improving the learning process.
V. Overall Algorithm Flow
To sum up, the proposed Q-learning algorithm is illuminated in Fig. 7 , which starts with initialization (Line 1). The procedure is repeated until there are no more episodes. For every episode, the Q-values are computed via back propagation neural network. We then select an action in the -greedy manner (Line 5), and take the action to transfer system state from s to s and receive reward value (Line 7). When we get the reward as feedback, we update the parameters of back propagation neural network using gradient descent algorithm (Line 8), and update state s using next state s (Line 9).
Note that, an offline training phase with a convergence criterion (e.g., the normalized error of approximated Q-value is less than 5%) can be used to improve the solution quality during the beginning of task execution, if necessary. Proposed solution performs online training to both Q-learning and neural network by considering each task as one training sample.
The block diagram of our proposed DPM framework is demonstrated in Fig. 8 . Each time, once a task arrives, power manager obtains power, queue, and temperature states of all the cores, and translates these core states into the representation of group states. This group state representation is then fed into BPNN-based Q-function approximation component to calculate Q-values that are used to select an action. With this selected action, the system processes the task accordingly. At the same time, the power spent on the task and its associated RT are both delivered to calculate reward value, which is used : The difference ratio between the policy in [26] and Proposed b . together with group state and selected action to update the parameters of BPNN.
VI. Experimental Results
A. Experimental Setup
To evaluate the effectiveness of proposed DPM solution, we implement a simulator using C for multicore processors to obtain the power dissipation and performance index. The recently-updated Hotspot 5.0 [22] is integrated into our simulator to acquire the operating temperature for processor cores. In reality, temperature information can be directly collected from on-chip temperature sensors and then fed into the power manager. Hypothetical homogeneous 4-core processor and 8-core processor are used in our experiments, wherein all processor cores have the same execution time for a certain task, based on power and thermal features of Intel Atom Processor N450 [9] . All the experiments are conducted on a 2.8GHz PC with 4 GB RAM.
As discussed earlier, the proposed DPM policy does not require information about the characteristics of each task, since our concerns in this paper are to reduce unnecessary power state transitions (and hence, avoid transition costs) and set proper power mode for each processor core. From this viewpoint, the different power value of each individual task does not have a high impact on the effectiveness of the proposed technique. Instead, the task flow in the entire workloads, characterized by task interarrival time and task service time, is the decisive factor. Consequently, synthetic workloads are used in our experiments, wherein, the dynamic power value of each task is generated using PTscalar [14] , and the leakage power value is calculated according to the power models introduced in Section II, based on SPEC2000 benchmark programs [23] . The task arrival to the system is assumed as a Poisson process with arrival rate λ that is widely used in prior papers [28] , [29] , while the service time is an exponential distribution with mean 1/μ in our simulation. By denoting the processor core number as n, task load ρ for each core is λ/nμ. A larger ρ-value means that system is running with heavier workloads. Each task set is composed of 1000 tasks.
There are numerous prior papers in the field of power management. To the best of our knowledge, however, none of them targets exactly the same problem for multicore processors as ours. Hence, for comparison, we implement the reinforcement learning-based policy presented in [26] , which is the stateof-the-art DPM approach reported in the literature, and used to conduct DPM for each processor core individually. The comparison to this paper can demonstrate the advantage of manipulating idle periods globally in multicore processors, a feature that is not available for single-core processors. In addition, we construct a global heuristic-based DPM approach for multicore processors as described in Fig. 9 to demonstrate the effectiveness provided by learning. This heuristic-based approach is constructed based on the observation according to our motivational example in Fig. 2 . If there are idle cores in the system, assigning tasks onto these cores would not induce any transition costs, hence, they should be the first choices; otherwise, we need to choose between cores in run mode and sleep mode. There are not an absolute advantage between these two kinds of cores, hence a predefined parameter H is used to decide the choice through observing task queue state. We also compare the proposed policy to the oracle policy [3] , an ideal one that is obtained assuming the task arrivals are known in advance and hence, induces no performance penalty. As for our proposed DPM methodology, we implement two versions: 1) A basic version (denoted as Proposed b ) that uses the system state representation (see Section III) without considering temperature information and 2) a temperatureaware version (denoted as Proposed t ). The learning parameters in (3) can be determined manually according to designers' experience. In the experiments, the learning rate μ and discount rate γ in (3) are set to be 0.1 and 0.8, respectively. How to determine effective learning parameters, in fact, has been studied by many machine learning works. One of its widely-used solutions is cross-valiation [30] , a model validation technique for assessing how the results of a statistical analysis can be generalized.
B. Results and Discussion
First of all, we focus on dynamic power only to conduct experiments to investigate the effectiveness of proposed learningbased DPM in making proper power state transitions.
In Tables I-III, column 1 indicates the core load ρ of the task sets, and column 2 presents the resulting mean energy consumption (En.) and mean RT for each task; both 4-core case and 8-core case are presented. Each column for presenting proposed policy has two subcolumns for β 1 = 100 and β 2 = 1000, respectively. Here, β 1 and β 2 are the parameter values to trade off power dissipation and RT in (5) . The column for heuristic policy in Tables II and III has three subcolumns, showing the results with different parameter values (H = 1, 2, and 3), respectively.
As shown in Tables I and II, our proposed DPM technique can obtain better energy savings when compared with both learning-based policy [26] and heuristic policy in Fig. 9 . To be specific, for the 4-core case with β 2 , the proposed policy can achieve 24.91% energy savings with 14.47% RT reduction compared to the single-core learning-based policy in [26] , and 16.32% energy savings with 17.19% RT reduction compared with the simple heuristic policy for multicore processors with H 2 = 2, when core load ρ = 0.4.
One notable finding from the results is that our DPM technique can provide more power savings when the core load ρ is moderate. This is mainly because when the task sequence is too tight/loose, most of the cores have always been in run/idle mode, thus there are not many power state transitions that can be manipulated by the proposed policy. As for the mean RT, the proposed policy can achieve a c onsiderable decrease in most cases except the case with ρ = 0.2. This is expected because in the case of low ρ, the improvement space for RT is limited. If we change heuristic policy parameter H (from H 1 = 1 to H 3 = 3), heuristic policy can obtain more energy savings at the cost of performance loss. When we increase the tradeoff parameter β (from β 1 = 100 to β 2 = 1000), we can find that energy consumption is increased while the RT is decreased. This is because a larger β-value means that RT is more important in the reward function definition [refer [26] and Proposed b 1 , 2 : the difference ratio between the policy in [26] and Proposed b . To find out the efficacy of our proposed DPM technique in the cases with various task service times, we keep core load ρ = 0.2, and vary task service time from 0.04 to 0.4. As Figs. 10 and 11 indicate that the proposed policy is the most close to oracle policy and performs much better than the other two kinds of policies. It is worth noting that Proposed t does not outperform Proposed b significantly. This is because Proposed t is a temperature-aware solution, while in such case considering only dynamic power Proposed t is not aware of the impact of temperature to leakage power.
Next, we investigate the effectiveness of proposed DPM policies with leakage power considerations. As shown in Tables IV-VI, our proposed DPM (Proposed b ) can still achieve more power savings, compared to the other two policies, in the case of temperature-related leakage power taken into account. We can see that proposed DPM (Proposed b ) can usually perform better in the cases with moderate ρ compared to the cases with small/large ρ. In Table X , we compare the basic and temperature-aware versions of proposed DPM framework. It can be seen that the temperature-aware version (Proposed t ) can work better than the basic version (Proposed b ). On average, Proposed t can further achieve about 5.01% power reduction and 6.59% RT reduction, compared to Proposed b . Specifically, taking the case with (1/μ, ρ) = (0.1, 0.4) in 8-core processors as an example, Proposed t has 5.42% power reduction and 6.01% RT reduction. This observation shows the importance and necessity to integrate temperature information, and proves the effectiveness of proposed temperature-aware DPM framework. We have also varied the service time from 0.04 to 0.4 with core load ρ = 0.2 and plotted the curves (Figs. 12 and 13 ) to describe the trends of energy consumption and RT with respect to service time. As can be seen, Proposed t and Proposed b still perform better than the other policies with various service times.
We count the power state transition numbers in 4-core processors with various ρ values to show more details. As demonstrated in Table VII , when compared to Policy in [26] , Proposed b and Proposed t can significantly reduce state transitions, proving our argument that we are trying to avoid unnecessary power state transitions. Note that, with moderate ρ value, more transitions can be reduced. This is consistent with the above discussion that, when the task sequence is too tight/loose, most of the cores have always been in run/idle mode, thus there are not many state transitions that can be manipulated and reduced. In our implementation, we have three temperature states (see Section IV-C for details on temperature states) for each core to describe its temperature. In Table VIII , we report the statistic data on temperature states in the case of 4-core processor with ρ = 0.6, using state 0, 1, and 2 representing low temperature, moderate temperature, and high temperature, respectively. As can be seen, Proposed b assigns 28.0% of the tasks onto low-temperature cores, 44.9% onto moderate-temperature cores, and 27.1% onto high-temperature cores, while Proposed t assigns more tasks onto low-temperature cores and less tasks onto high-temperature cores. This is because Proposed t is designed to be a temperature-aware solution that considers the impact of temperature to power consumption and hence tends to avoid high temperatures in such case with relatively tight task sequence.
In addition, to understand the overall temperature profile, we report the temperature values of 4-core processors in Table IX . Due to the homogeneity of processor cores, different cores experience similar workload pressures and have almost the same temperature characteristics. For the sake of clear presentation, in Table IX we report the maximum, minimum, and average temperatures of only one core (rather than all the cores) for the cases with different ρ-values. It is obvious that with larger ρ-value, the system experiences heavier workloads and hence, has higher temperatures. Particularly, in the case with ρ = 0.8, the maximum temperature value is 88.3
• C, exceeding the specified threshold 85
• C. Such events trigger the protection on overheated processor cores, avoiding the tasks allocated onto this core and then, reducing its temperature. In our experiments, even with the heaviest workloads (i.e., ρ = 0.8), only 0.2% of the task allocations trigger such protection events that are actually very rare. Its associated performance cost is less than 0.1% from the perspective of the entire task sequence. Note that, although the core temperature sometimes exceeds the specified threshold 85
• C, the maximum temperature allowed for processors to run is generally much higher (e.g., 95
• C for AMD Mobile Sempron processors and 100
• C for Intel Atom N450 processors [9] ). Finally, we conduct some experiments to obtain the cost of our proposed approach, considering software implementation of the DPM policy. The energy cost is those energy consumed by running our learning algorithm, while the RT cost is the task allocation time. The cost of our proposed approach without grouping scheme is shown in Table XI , obtained by running the learning algorithm and recording its runtime. In the cases of 4-core and 8-core processors, the cost is very small and can almost be negligible. In 4-core processors, the mean energy cost is less than 0.02% of mean energy consumption, while the RT cost is about 0.01% of mean RT. In 8-core processors, the cost is also quite small, about 0.04% in both energy and RT. As discussed earlier, in the case without grouping, the cost is quadratically increased with processor core number n, and not negligible any longer with large n value. However, with grouping scheme, the allocation cost can be fixed due to the fixed group number (n c · n q · n T ). In our experiments, this allocation cost is about 0.05% in terms of energy and RT, which is quite small.
VII. Conclusion
Power consumption is a key issue in the design of computing systems today, especially in portable devices which are more sensitive to battery life. In this paper, by including task allocation into our learning-based DPM framework for multicore processors, we are able to manipulate idle periods on processor cores to achieve a better tradeoff between power consumption and system performance. Since leakage power is strongly related to temperature and will be dominant in system power consumption in the future, we integrate temperature information into our proposed DPM framework so that additional power saving can be achieved. Experimental results based on synthetic workloads prove the effectiveness of our proposed approach significantly and justify the importance and necessity to consider temperature during power optimization process.
For the sake of simplicity, we assume no dependencies among tasks in this paper; even though our proposed learningbased framework is applicable after lifting this assumption. In addition, although we assume that a task assigned to a certain core would not be migrated to another core during its execution, task migration can in fact be integrated into our optimization framework by simply considering such task to be a new task. The impact of task dependencies on our proposed framework and the integration of task migration are planned to be investigated in the future.
