An SMDP-Based Approach to Thermal-Aware Task Scheduling in NoC-based
  MPSoC platforms by Niknia, Farnaz et al.
IEEE TRANSACTIONS ON JOURNAL NAME,  MANUSCRIPT ID 1 
 
An SMDP-Based Approach to Thermal-Aware 
Task Scheduling in NoC-based MPSoC 
platforms 
Farnaz Niknia, Kiamehr Rezaee, Vesal Hakami 
Abstract— One efficient approach to control chip-wide thermal distribution in multi-core systems is the optimization of online 
assignments of tasks to processing cores. Online task assignment, however, faces several uncertainties in real-world systems 
and does not show a deterministic nature. In this paper, we consider the operation of a thermal-aware task scheduler, 
dispatching tasks from an arrival queue as well as setting the voltage and frequency of the processing cores to optimize the 
mean temperature margin of the entire chip (i.e., cores as well as the NoC routers). We model the decision process of the task 
scheduler as a semi-Markov decision problem (SMDP). Then, to solve the formulated SMDP, we propose two reinforcement 
learning algorithms that are capable of computing the optimal task assignment policy without requiring the statistical knowledge 
of the stochastic dynamics underlying the system states. The proposed algorithms also rely on function approximation 
techniques to handle the infinite length of the task queue as well as the continuous nature of temperature readings. Compared 
to related research, the simulation results show nearly 6 Kelvin reduction in system average peak temperature and 66 
milliseconds decrease in mean task service time. 
Index Terms— Multi-core Processors, Online Task Assignment, Thermal Management and Reinforcement Learning. 
——————————      —————————— 
1 INTRODUCTION 
1.1 Research Background 
LDER generations of computers (the so-called single-
core systems) relied on a single processing unit, 
which could run only a single thread at any one time. 
Various parallelization techniques were applied on dif-
ferent levels, but despite the added complexity, increased 
power consumption, and heat generation, no significant 
improvement was achieved [1, 2]. The demand for higher 
performance and the shortcomings of single-core proces-
sors brought about the advent of Chip Multi-Processors 
(CMP). A CMP comprises several independent processing 
cores, as well as Network on Chip (NoC) routers and 
shared resources, managed by a single operating system 
(OS). Rather than relying on a large and powerful pro-
cessing unit to improve performance, CMPs use several 
processing cores to run tasks in parallel to achieve more 
efficiency. Some types of CMPs further boost their ener-
gy-efficiency by also applying distinct voltage/frequency 
levels to processing cores depending on conditions and 
workloads [1].  
Despite all the advantages associated with multi-cores, 
by integrating a large number of tiny transistors into a 
small chip, CMPs may have high power density and pro-
duce more heat, causing material fatigue, partial or com-
plete failure, and shorter device lifetime. Although cool-
ing equipment is used to prevent high temperature and 
thermal emergencies, specific workloads can aggressively 
use the processing resources, generating more heat than 
can be dissipated [2]. Moreover, not all processing units 
can be used simultaneously because of the high power 
and energy consumption of a CMP. This means that de-
spite comprising several processing cores, the desired 
performance is not achievable. This phenomenon is called 
dark silicon [3]. One of the critical factors in increasing the 
power usage of a CMP is the temperature. High tempera-
ture increases the leakage power, thereby increasing the 
total energy consumption of the chip [2].  
The dark silicone phenomenon on top of the damages 
looming from high temperature give rise to the dynamic 
thermal management (DTM) problem. The research on 
DTM aims at mitigating thermal emergencies and pre-
venting thermal crisis in multi-core processors. Although 
failure-avoiding technologies like AMD’s multi-point 
thermal control or Intel’s Adaptive Thermal Monitor are 
being built into modern processors [2], the research is still 
ongoing, and there is a large body of work on thermal 
management in CMPs. In one broad taxonomy, one can 
classify the existing works into three groups: Dynamic 
Power Management (DPM), Dynamic Voltage and Fre-
quency Scaling (DVFS), and Task assignment.  
DPM [4-6] exploits the fact that the whole device or at 
least some of its parts may remain idle during specific 
time intervals and waste energy. These parts of the device 
(e.g. processing cores) can be turned off or switched to 
low power (sleep) states in idle periods in order to save 
xxxx-xxxx/0x/$xx.00 © 200x IEEE        Published by the IEEE Computer Society 
———————————————— 
 F. Niknia is with the Iran University of Science and Technology, Tehran, 
Iran. E-mail: niknia_farnaz@alumni.iust.ac.ir. 
 K. Rezaee is with the Iran University of Science and Technology, Tehran, 
Iran E-mail: k_rezaee@comp.iust.ac.ir. 
 V. Hakami is with the Iran University of Science and Technology, Tehran, 
Iran. E-mail: vhakami@iust.ac.ir. 
 
O 
2 IEEE TRANSACTIONS ON JOURNAL NAME,  MANUSCRIPT ID 
 
 
energy, while being brought back to active state when 
work becomes available. DPM can also be considered as a 
DTM technique given that it turns off the whole chip or 
hot cores in case of thermal emergencies in order to be 
cooled down [2]. Although DPM is advantageous in 
terms of energy and thermal management, frequent tran-
sitions between sleep and active states incur additional 
latency. 
The idea behind DVFS [4, 7-9] is to reduce the power 
consumption of a processor via run-time adaptation of 
voltage and frequency (V-F) levels. DVFS relies on the 
fact that the dynamic power consumption of a system 
scales linearly with frequency and quadratically with 
voltage [10]. Despite its benefits, DVFS may result in in-
creased execution time by applying lower frequency. 
Both DPM and DVFS are considered “reactive” tech-
niques as they typically take no measure to prevent ther-
mal crises and are only activated when a predetermined 
temperature threshold is crossed [2]. Task assignment, on 
the other hand, can be considered as a “proactive” tech-
nique as it is aimed at preventing thermal emergencies 
rather than post-crisis activation. Also, unlike DPM and 
DVFS, it has no adverse effect on system performance. 
Depending on the position of the cores and their instan-
taneous temperature, tasks can be dispatched to pro-
cessing cores located in different parts of the chip, thus 
distributing the heat in a chip-wide manner and reducing 
peak temperatures. 
Our proposed scheme in this paper is a between DVFS 
and thermal-aware task assignment. We assume that ran-
dom types of tasks may arrive at the system at stochastic 
time instants, and wait their turn in a queue to be as-
signed to processing cores. Tasks are assigned to cores by 
a dispatcher unit in the OS scheduler, while the voltage 
and frequency of the cores are regulated through a ther-
mal manager component. Aside from handling stochastic 
task arrivals, our proposed thermal management scheme 
is a systematic methodology to face with other realistic 
(yet uncertain) factors. In particular, a recently assigned 
task may be randomly paired with an already running 
one, thus involving a number of NoC routers in addition 
to processing cores. As argued in [11] and [12], the uncer-
tain NoC router involvements should also be accounted 
for by a task scheduler to more wisely control chip heat 
distribution. 
In the sequel, to better highlight the research gaps and 
motivate our contributions, we review and categorize the 
most relevant literature. 
1.2 Related Work 
The simplest scheme that has addressed the thermal 
management problem in CMPs using task assignment is 
[13]. The authors have investigated the unwanted thermal 
cases, and presented a mechanism named Thermal-
Aware Scheduling (TAS) that aims to moderate or even 
eliminate these thermal issues of CMPs. TAS has two 
forms: 1) the newly dispatched task is assigned to the 
coolest idle core. This is the simplest and easiest way to 
implement a thermal aware algorithm; 2) a cost is calcu-
lated for each idle core considering the temperature of 
individual cores as well as the temperature of cores in the 
neighborhood. Moving forward, several more advanced 
research works have investigated thermal-aware task as-
signment. Based on the availability/unavailability of the 
complete task graph at the time of assignments, task as-
signment approaches can be categorized into two groups: 
batch and online. 
In Batch schemes, assignment decisions are made when 
all tasks have arrived to the system and are ready to run. 
Online methods, on the other hand, decide on the alloca-
tion of each incoming task individually while other tasks 
are currently running, entering the system or have not 
arrived yet. Therefore, unlike batch methods, at the time 
of each allocation, the system is in a different thermal 
condition which should be taken into account in assign-
ments. Online methods can in turn be divided into two 
groups: discrete-time and continuous-time. With a discrete-
time operation, each task is assigned at the beginning of 
fixed time intervals which leads to increased service time 
due to increased task waiting times. 
Within another perspective, in many real-world sys-
tems, the scheduler faces several uncertainties including: 
Stochastic workload inter-arrival times: Generally, 
workloads arrive at  stochastic times, with typically no 
prior knowledge of their arrival instants [14]. It is note-
worthy that methods related to the batch category do not 
consider arrival uncertainties. This is because of the fact 
that all tasks has arrived to the system before the assign-
ment starts and no new tasks enter the system when de-
ciding on the assignment of tasks. 
Stochastic workload characteristics: in real-world 
problems, workload characteristics may not be known 
beforehand. This means that the execution times and the 
thermal impact induced by running each task are not giv-
en [14]. 
Random pairing: In general, the tasks arriving into the 
system can engage in IPC as clients, servers or both. A 
client is a process that requests a service from some other 
process. A server, on the other hand, responds to a client 
request. Many processes act as both a client and a server, 
depending on the situation. For example, a word pro-
cessing task might act as a client in requesting a summary 
table from a spreadsheet process acting as a server. The 
spreadsheet process, in turn, might act as a client in re-
questing the latest inventory levels from an automated 
inventory control application [15]. In the absence of a de-
terministic and known task communication graph, IPC 
can be particularly challenging as each task (once as-
signed to a core), can randomly pair with some other cur-
rently running task. These random pairings can be due to 
several reasons; for example, IPC can be initiated by clip-
board sharing when a user performs “cut, copy, and 
paste” operations, or it can arise from message passing 
for the exchange of randomly generated sensor data be-
tween a writer and a sensor program. IPC is assisted by 
and in fact relies on NoC routers, and since the pairings 
are random in general [16], [17], the routers to be in-
volved in a communication are not known beforehand. 
Given that the NoC routers consume considerable power 
and produce a significant amount of heat compared to 
AUTHOR ET AL.:  TITLE 3 
 
 
other chip components [18], it is vitally important to con-
sider IPC-related uncertainties in our thermal-aware task 
assignments. 
Stochastic chip thermal profile: the thermal impact of 
circuit components, the impact of thermal interface mate-
rials and cooling condition, make the thermal profile of 
the chip stochastic [18].   
In the following, we review the related work in batch 
and online categories and with respect to the way the 
above-mentioned uncertainties have been dealt with:  
 Within the category of batch methods, authors in [16] 
have observed that more heat will be dissipated if the 
processor reaches a high temperature earlier. Moreo-
ver, the authors have used a thermal model [17] in 
which the power consumption of a processor is calcu-
lated using the air ambient temperature and chip-
dependent parameters. Exploiting this model, the au-
thors indicate that in a particular time interval, a con-
stant amount of energy is required to run a task. 
Therefore, to prevent temperature increase at the sec-
ond half of the time interval, the temperature should 
rise as much as possible in the first half. Relying on 
these observations, they have proposed a greedy ap-
proach that runs the hottest job that does not violate 
the thermal threshold, at each step and in this way, 
they increase the operating temperature up to the 
threshold as quickly as possible. Also, to predict the 
future temperature of each core, a thermal predictor is 
presented using the hardware specifications, thermal 
sensor readings and the steady state temperature of an 
application which can be obtained by running each 
benchmark on each core until the temperature does 
not change anymore. Although this method is shown 
to outperform particular scheduling algorithms, it uses 
a deterministic thermal model to predict future tem-
perature without considering thermal uncertainty. 
Moreover, selecting the hottest job, requires using the 
given thermal profile of tasks which violates the work-
load related (characteristics and pairing) uncertainties. 
[19] is a pioneer work that has addressed the tradeoff 
between energy consumption and thermal balance in a 
2D mesh NoC architecture. There, a heuristic is pre-
sented based on multi-objective ant colony algorithm 
to explore task to core mapping space and find the pa-
reto-optimal front that optimizes both energy con-
sumption and hotspot temperature. There, the energy 
associated with the involved routers and links are tak-
en into account to calculate the total energy consump-
tion and to estimate the temperature of each tile. 
However, the authors in [12] have conveniently as-
sumed that the task communication graph of the ap-
plications is given a priori, thus circumventing the un-
certainties related to workload characteristics. The 
thermal model presented in [20] calculates the temper-
ature difference between two tiles by only using the 
physical distance of tiles, chip-dependent parameters 
and the power consumption of the cores. This means 
that the thermal model is deterministic and thermal 
uncertainties are not considered. In [20], workload 
characteristics are assumed to be known a priori as 
well (i.e., the authors use an application task graph). A 
Kernighan-Lin bi-partitioning- based [21] approach is 
given in [22] to map the graph of an application on to a 
mesh-based NoC architecture with the aim of optimiz-
ing both communication cost and thermal-variance. 
Similar to previous approaches, this approach does not 
consider thermal and workload uncertainties. In [11], 
the authors have shown that the NoC routers have rel-
atively small chip area and high power consumption 
compared to other on-chip components, thus they can 
potentially become hotspots. Accordingly, the authors 
analyze the importance of taking into account the NoC 
router power consumption in application mapping, 
and they considered the thermal effect of both cores 
and the NoC routers in their formulation. The authors 
have specifically addressed the tradeoff between tem-
perature and network latency. Due to thermal correla-
tion, reducing peak temperature requires placing high-
power cores far from each-other, while the goal of re-
ducing average latency may require these cores to be 
as close as possible. To address this issue, the authors 
have presented a temperature-aware partitioning and 
placement mapping approach using hierarchical bi-
partitioning of the cores. However, the thermal model 
used to estimate the temperature of the cores is deter-
ministic and does not take thermal uncertainties into 
account. Also, the algorithm exploits the task commu-
nication graph, sidestepping the issue of workload 
characteristics. Realizing that thermal correlation be-
tween heat sources may cause hotspots, [23] describes 
a thermal model in which the temperature of each lo-
cation of a die depends on several factors such as on-
chip heat-sources and their distances. Using this mod-
el, [23] searches the application graph and maps high 
communication flow tasks that do not communicate 
directly, on a column of the mesh topology and when 
there is no remaining tile, it turns to the right or left di-
rection based on the minimal thermal correlation with 
the aim of reducing both peak temperature and com-
munication cost. This method completely relies on a 
fixed a priori given application task graph. The work 
presented in [24] proposes a temperature-aware task 
scheduling approach for streaming applications on 
mesh-based NoC systems. There, a temperature model 
is built to estimate the temperature increase for pro-
cessing a particular task. Using this model, the thermal 
profile of the tasks are first extracted at each V-F level. 
Then, for assigning each task, a priority value is calcu-
lated for each core using the estimated temperature in-
crease of the task, the thermal effect of adjacent cores, 
the communication overhead and the location of the 
core on chip. Finally, the task is assigned to the core 
with the highest priority. However, the authors again 
assume that there is no uncertainty regarding the 
workload characteristics such as: the maximum num-
ber of clock cycles, the energy consumption, and tem-
perature increase caused by each task execution. In 
both [25], [26], the authors rely on the fact that the 
cores located at the corners and the edges of the chip 
will cool faster if they become hot because they are 
4 IEEE TRANSACTIONS ON JOURNAL NAME,  MANUSCRIPT ID 
 
 
surrounded with less cores. The maximum voltage 
and frequency level is used initially, then the V-F level 
is lowered at each step. When no lower V-F is availa-
ble, tasks are migrated to cooler cores which causes 
displacement overhead. Authors in [27] exploit rein-
forcement learning to learn the best assignment policy 
for multi-threaded applications and adapt to workload 
changes. There, each action comprises assigning a task 
to a core and determining its working V-F level. The 
system states include thermal stress and aging that are 
calculated using performance counters and thermal 
sensor readings. To limit the state space, a discretiza-
tion method is used to divide the working range of 
state elements into separate intervals and a representa-
tive is defined for each one. The values obtained from 
the thermal sensors and performance counters at each 
time, fall into particular intervals. Then, the original 
values are replaced with the relevant representatives. 
Although discretization limits the state-space and 
simplifies learning, it results in less learning precision. 
In addition, the size of intervals and their numbers af-
fect the number of states and the learning algorithm, 
as well. Also, since standard Q-learning algorithm is 
used in [20] for learning the best decision making poli-
cy, storing Q-values for each state-action pair causes 
significant memory overhead. This is crucial, especial-
ly for embedded systems where the on-chip memory 
is limited [10]. In [28] an assignment approach is pre-
sented to reduce thermal hotspots and temperature 
gradients. The authors have defined several tempera-
ture thresholds for categorizing the processing cores 
based on their current temperature. Threads are also 
classified into three groups according to the tempera-
ture increase that they induce to a core. When a thread 
is ready to be assigned, according to its class, it is as-
signed to a core with proper temperature. Although, 
the classes of cores and threads are dynamically ad-
justed depending on new thermal conditions, it takes 
several iterations to determine a suitable group for 
each thread.   
Online methods can also be further sub-classified as 
discrete-time and continuous-time: 
 Among the discrete-time methods, in [25], the authors 
schedule the task queue under thermal and perfor-
mance constraints such that a balanced temporal and 
spatial thermal profile is achieved. The scheduler ob-
tains the temperature feedback of each individual core. 
Then, considering core temperature values and layout 
positions, a task queue length is assigned to each core 
and at next step, DVFS is applied to balance perfor-
mance, temperature and task queue for each core indi-
vidually. Applications are assigned to the cores based 
on the lengths of the queues, core layouts and the 
temperature readings from on-chip thermal sensors. 
This method takes arrival uncertainties into account by 
assuming that arrival intervals are Poisson distributed. 
Also, it does not exploit any workload characteristics 
for making assignment decisions. However, thermal 
uncertainties are not considered. Authors in [29] have 
implemented two simple heuristics: 1) selecting the 
coolest core for task allocation; 2) prioritizing cooler 
cores that have idle neighbors. They also presented an 
Adaptive Random scheme that assigns a probability to 
each core based on its thermal history. This value is in-
creased if the core does not exceed a predetermined 
temperature, thus cores that keep temperature below 
the threshold for a longer time are more likely to be se-
lected to run a task. It has been shown that Adaptive 
Random outperforms approaches that rely only on 
current temperature to make assignment decisions. 
However, it incurs memory cost for storing thermal 
history for each core [18]. Another shortcoming of this 
approach is that a list of jobs with their arrival times is 
provided for the scheduler beforehand which means 
that the stochastic task arrivals cannot be not account-
ed for. In [30] a scheduling algorithm is proposed for 
single-core processors to maximize the total number of 
completed tasks while keeping the temperature below 
a threshold. Time is divided into fixed intervals, and at 
each interval, the algorithm decides whether to sched-
ule a task and which task to schedule. The future tem-
perature of the system can be calculated assuming that 
the heat contribution of the tasks are given a priori. 
This algorithm has also been extended to multicore 
processors as well but the temperature of each core is 
calculated the same way as the single-core case, i.e., 
the heat transfer of adjacent cores are not considered. 
A similar strategy is used in [31] which has the same 
drawbacks as [30]. In [10], multi-threaded applications 
are executed for several iterations to learn the optimal 
assignment policy for each application with the aim of 
optimizing energy and temperature while addressing 
thermal aspects (peak temperature, average tempera-
ture, and thermal cycling). There, the thread allocation 
is separated from frequency scaling. The thread alloca-
tion is changed at long-term intervals using a greedy 
heuristic. A thermal overhead is then calculated after 
assigning each application using thermal cycling, av-
erage and peak temperatures. If the thermal overhead 
is reduced in comparison with the last allocation, the 
allocation is retained, otherwise, it is returned to the 
previous one. At the next step, the frequency selection 
is performed at every decision epoch using tabular Q-
learning. Thermal and arrival uncertainties are ig-
nored in this approach. In [26], for executing multi-
threaded applications, a finite state machine is defined 
with five stages: start, wait, read, calculate and assign. In 
the start state, all variables are initialized, then the al-
gorithm switches to the wait state where the system 
waits for a new time quota to run a new application. 
The read state is the next system state in which tem-
perature values are collected from the on-chip thermal 
sensors and used in the calculate state where a cost ma-
trix is built using system utilization data besides the 
location of each core. At last, the algorithm switches to 
the assign state and allocates tasks to cores according 
to minimal cost principle. However, arrival and ther-
mal uncertainties are not considered. The problem of 
online task assignment is addressed more systemati-
cally in [18] where the assignment problem is formu-
AUTHOR ET AL.:  TITLE 5 
 
 
lated as a Markov Decision Process (MDP) [32]. The 
only system state is the vector of temperature values 
obtained from the on-chip thermal sensors, each action 
refers to assigning a task to a core, and the instantane-
ous reward is the temperature margin. Reinforcement 
learning is used to solve the MDP. The main drawback 
with [18] is that it is dubious with respect to the nature 
of time: On the one hand, the authors have mentioned 
that a task is allocated to an idle core as soon as a task 
arrival event occurs, which is reminiscent of a contin-
uous-time operation; on the other hand, all their equa-
tions are based on a discrete-time MDP formulation. 
Also, since the task queue length is not explicitly con-
sidered as a state component in MDP, the uncertainties 
associated with task arrival and workload characteris-
tics has not been captured in their formulation.  
 As for continuous-time methods, in [33],the  ready 
tasks are scheduled such that thermal emergencies are 
reduced in the presence of fixed ambient temperature. 
When a task is ready to run, a utilization factor is cal-
culated, and then, core configurations are defined to 
select cores using a round robin fashion. When the 
temperature of a core exceeds the threshold, the con-
figuration is changed. The well-known Global Earliest 
Deadline First (GEDF) algorithm is used as scheduling 
policy in [27]. This method, however, assumes period-
ic task arrivals, and does not consider arrival uncer-
tainty. In [34], a method is presented to balance the us-
age of wireless links by avoiding congestion over wire-
less routers and to distribute temperature across the 
chip in many core system on chips. A 2D mesh wire-
less NoC is virtually divided into several regions such 
that each region contains a wireless router and each 
core falls into a region that has less hop count to its 
wireless router. Then, each application is mapped re-
gion by region in a round robin manner to balance 
thermal distribution while periodically rotating the 
Cartesian coordinate system to locally balance the 
thermal profile within each region. Finally, in each re-
gion, tasks of an application is mapped to cores with 
smallest indices. In [34], inter-application communica-
tions are given within a task communication graph 
while intra-application communications are consid-
ered to be highly dynamic. Also, there is no prior 
knowledge of applications’ entrance time. Therefore, 
arrival and pairing uncertainties have been taken into 
account. 
 
 
TABLE 1 
RELATED WORK  
 Optimization Goal Method 
Category 
Adaptive to IPC  
Uncertainty 
Batch 
Online 
Arrival Workload 
characteristics Pairings Thermal DT CT 
[16] Throughput & utilization Greedy    × × × × × × 
[19] Energy & temperature Multi-objective ant colony    ×  × ×. × × 
[22] Communication & 
thermal-variance 
Kernighan-Lin bi- 
partitioning    ×  × ×. × × 
[11] Temperature & latency Hierarchical bi-partitioning     ×  × × × × 
[23] 
Temperature & 
communication 
Mapping high 
communication flow tasks 
that do not communicate 
directly, on the non-
adjacent tiles 
   Thermal 
profile  × × × × 
[24] 
Peak temperature & 
temperature distribution 
Prioritizing processing 
cores based on 
temperature 
   Temperature  × × × × 
[27] 
The average temperature 
& thermal cycling 
Tabular Q-learning
   Workload × ×  ×  
 
[28] 
Hotspots & temperature 
gradients 
Categorizing processing 
cores and threads    
Temperature 
of cores & 
threads 
× 
× 
 
 × × 
[25] 
Balanced temporal and 
spatial thermal profile  
Close loop control system    Thermal 
profile × 
 
  × × 
[29] 
Temperature Assigning probabilities to 
cores based on their 
temperature  
   
Thermal 
history of 
cores 
× 
× 
 
 × × 
[30] 
The total number of 
completed tasks 
Predicting future 
temperature of the chip     
Workload & 
temperature × 
 
 
 
 
× × 
[31] Weighted throughput  Using heat characteristics 
of the jobs    
Workload & 
temperature ×  × × × 
[10] 
Energy & temperature  
 
Heuristic and 
reinforcement learning 
algorithms 
   Workload & 
temperature × 
 
× 
 × × 
[33] 
Thermal balancing Selecting number of cores 
based on application 
utilization 
   Workload × ×  ×  
6 IEEE TRANSACTIONS ON JOURNAL NAME,  MANUSCRIPT ID 
 
 
 Optimization Goal Method 
Category 
Adaptive to IPC  
Uncertainty 
Batch 
Online 
Arrival Workload 
characteristics Pairings Thermal DT CT 
[26] Heat distribution & 
ensuring the reliability 
Finite state machine    Thermal 
profile × ×  × × 
[18] Peak temperature  Q-learning    Thermal 
profile  × ×   
[34] Global & local thermal 
distribution. 
Dividing cores to regions - 
Round-robin scheduling    ×   ×   
 
1.3 Research gap and motivations  
In many real-world systems, multi-core task processing is 
subject to various types of uncertainties. Reviewing the 
prior work, there is no scheme that accounts for all these 
uncertainties within a single unified framework. Probably 
the most neglected type of uncertainty is associated with 
IPC and task pairings; in fact, tasks of different applica-
tions may communicate with unknown patterns, and this 
type of inter-task interaction can be highly dynamic [34]. 
The random task pairings are particularly important in 
thermal-aware task assignment because they involve the 
NoC routers. These routers consume a significant amount 
of power in comparison with other on-chip components 
[4], and as they have a relatively small area, they can po-
tentially become hotspots themselves [11]. Also, in most 
practical cases, there is no deterministic prior knowledge 
about the routers to be involved in each communication 
as the inter-application communication graph is not 
available beforehand [18, 34]. To the best of our 
knowledge, only [34] and [18] have partially taken task 
pairing uncertainties into account. Although they seem to 
be one step ahead of the other approaches, they also have 
several shortcomings. Aside from its dubious formulation 
(c.f., Section 1.2), the MDP model proposed in [8] ac-
counts for IPC uncertainty only implicitly and through its 
indirect impact on the thermal profile of the chip. As with 
the work in [34], it is specialized for NoCs with wireless 
routers. Also, it fails to address other important types of 
uncertainties (e.g., workload characteristics), and pro-
vides no adaptability to thermal profile. 
1.4 Overview of the Proposed Scheme and 
Contributions 
Here, we give an overview of the proposed scheme to-
gether with a summary of our contributions in this paper: 
 In order to account for the most common uncertainties 
prevalent in multi-core processing scenarios (includ-
ing: stochastic task arrivals, unknown workload char-
acteristics, random pairings, and unpredictable ther-
mal profile of the chip), we systematically model CMP 
as a stochastic dynamic system. We also use the SMDP 
(Semi-Markov Decision Process) formalism [35] to 
formulate the online task assignment problem with the 
objective of maximizing the long-run average tem-
perature margin of the chip. SMDP is among the fairly 
general variants of continuous-time optimization 
frameworks from the stochastic control theory. Since 
in a CMP, tasks arrive and depart in stochastic times, 
SMDP is a much more efficient choice compared to 
discrete-time formalisms (which would lead to an in-
crease in task waiting times and degraded system per-
formance). Each system state in our proposed SMDP is 
comprised of both discrete and continuous elements: 
the continuous element is the core temperature, and 
the discrete element includes the number of tasks in-
side the system (both in the task queue and in-service) 
as well as the idle/busy status of the processing cores. 
Each control action includes selecting a processing 
core to assign a task and determining its working V-F 
level. The operating system scheduler acts as the 
SMDP controller which is comprised of three units: 
dispatcher (assigns tasks to cores), thermal manager (ap-
plies working V-F level), and thermal monitor (reads the 
implemented on-chip thermal sensors).  
In principle, the optimal scheduling policy can be cal-
culated using model-based techniques for SMDPs such 
as dynamic programming algorithms [35]. However, 
these techniques rely on the availability of the prior 
knowledge of the system statistics (e.g., task arrivals, 
execution times, inter-task communication times, as 
well as the chip thermal dynamics). As these statistics 
cannot be realistically assumed to be available in all 
cases, to make up for this lack of knowledge, we pro-
pose a model-free scheme in which the scheduler 
agent is provided with the ability to sample-based ex-
perience and learning. Our proposed model-free solu-
tion is built on the well-known Q–learning algorithm 
from the MDP literature [35].  Standard tabular-based 
Q-learning, however, is suited particularly for dis-
crete-state MDPs with relatively small state-action 
space dimension. Our assignment problem has a 
mixed continuous-discrete state space structure (corre-
sponding to the temperature of processing cores and 
the number of tasks inside the system, respectively). 
Therefore, exploiting tabular Q-learning is not feasible 
for this problem as it would need infinite memory 
space. In addition, having a large state space needs 
more iterations for the convergence of Q-values. As 
such, we propose modifications of Q-learning which 
rely on function approximation techniques to handle 
the infinite length of the task queue as well as the con-
tinuous nature of temperature readings. In particular, 
we come up with two modified Q-learning algorithms 
as described below: 
 
 DVFS-Enabled: In this variant, each action includes 
both assigning a task to a processing core as well as 
determining its working V-F level. To combat the 
curse of dimensionality associated with standard Q-
AUTHOR ET AL.:  TITLE 7 
 
 
learning, we exploit the notion of Radial Basis Func-
tions (RBF) [36] to come up with a featureized repre-
sentaion of states that include only the thermal fea-
tures of the chip state. These features are obtained 
from on-chip thermal sensors that are implemented 
next to each core. RBFs are fed with the temperature 
values, and then, the summarzied thermal features are 
extracted from RBFs and used in function approxima-
tion for estimating the Q-values. The main advantage 
of the proposed DVFS-enabled scheme is reduced dy-
namic power dissipation, which is due to applying 
DVFS in addition to making intelligent task assign-
ment decisions. 
 IR: In this variant, each action is defined only as as-
signing a task to an idle core, without making any 
change to the working V-F level of the cores. Similary 
to our first scheme, in IR, function approximation with 
RBFs are used instead of tabular Q-learning, but the 
difference lies in the definition of the features. In addi-
tion to thermal features, in our IR scheme, we exploit 
several RBFs for extracting combined state-action fea-
tures which include: 1) the Euclidean distance of a core 
from the chip center, 2) the Euclidean distance of a 
core from the hotspot, and 3) the ratio of the number 
of tasks whose communication paths are going to in-
clude the hotspot if they pair with the task assigned to 
some given core to the total number of tasks that may 
pair with the tasks running on the same given core. 
Our proposed IR scheme tries to assign tasks far from 
the chip center and the hotspot, effectively prefering 
the cores located at the chip corners and the edges. 
Compred to DVFS-enabled, IR needs a smaller num-
ber of learning parameters, making way for using a 
higher number of approximation functions. This can 
be exploited in larger mesh sizes where the number of 
learning parameters of DVFS-Enabled would grow 
unmanageably. As evidenced by simulations, IR out-
performs DVFS-Enabled in 7x7 processors, indicating 
IR’s superior scalablility. 
 The processor and scheduler are simulated using mul-
ticore and thermal simulators including Sniper [37], 
Hotspot [38], McPAT [39], DSENT [40], Hotfloorpaln 
[38] which are used for simulating CMP, thermal pro-
file, core power consumption, router power dissipa-
tion, and core floorplan, respectively. The simulation 
flow starts with simulating a CMP by sniper multicore 
simulator and running each Splash2 [41] benchmark 
on a single core to obtain its characteristics such as ex-
ecution time. Then, using McPAT and DSENT, the 
power consumption of processing cores and NoC 
routers are obtained. Finally, Hotspot is fed with the 
floorplan of each processing core produced by hot-
floorplan, and the power consumption of routers and 
cores. Hotspot thermally simulates the processor and 
produces the thermal profile of the whole CMP. At the 
final step, the simulation results are compared with 
two of the previous approaches. The results indicate 
that both the proposed DVFS-Enabled IR schemes re-
duce the long-run average peak temperature by 6 K 
and 5 K. Also, they reduce the average task service 
time by 110 and 40 milliseconds, respectively 
2 SYSTEM MODEL 
2.1 System Architecture  
In this section, we first describe the system model for a multi-
core processor including the processing cores, the operating 
system scheduler and the task queue. Then, we elaborate on the 
assumptions we make about the dynamics of the task arrivals, 
chip thermal profile, as well as the task pairings. 
2.1.1 Multi-core processor  
 We consider a multicore processor with an NoC-based 
mesh topology including M processing cores and M NoC 
routers (Figure 1). We denote by ℳ =
{𝑚1,𝑚2,𝑚3, … ,𝑚𝑚 , … ,𝑚𝑀} the set of processing cores and 
by ℛ = {𝑟1, 𝑟2, 𝑟3, … , 𝑟𝑚 ,… , 𝑟𝑀} the set of NoC routers. It is 
assumed that 𝑟𝑚 is the router connected to core 𝑚𝑚. 
Assumption 1: Each core is assumed to have a V-F regula-
tor for adjusting its voltage and frequency individually. An ex-
ample of a processor supporting core-level DVFS is the AMD 
Opteron “Barcelona” processor [2]. 
 
 
Fig. 1. A mesh-based multi-core processor [42]   
2.1.2 The operating system scheduler 
In multi-core processors, the operating system scheduler 
is responsible for assigning tasks to the cores as well as 
for adjusting the working V-F levels. In order to manage 
the thermal profile of the multi-core processor, the sched-
uler needs to evenly distribute the heat across the chip-
wide components by intelligently assigning tasks to cores 
located on different parts of the chip. As shown in Figure 
2, the scheduler consists of three units [43]: 
 Thread monitor: As electronic devices operate in spe-
cific normal temperature ranges [44], the current tem-
perature of the cores must be taken into account when 
computing the task assignment policy in order to pre-
vent thermal crises. As such, modern chips are 
equipped with thermal sensors to measure the tem-
perature of the different parts of the chip. Here, we 
consider M temperature sensors placed next to each 
processing core. The thermal monitor reads these 
thermal sensors, and sends the collected data to the 
thermal manager as well as to the dispatcher to be 
subsequently considered in task assignments. We de-
note the temperature readings at time 𝑡 by the vector 
𝒄(𝑡) = (𝑐1(𝑡), 𝑐2(𝑡), 𝑐3(𝑡),… , 𝑐𝑚(𝑡),… , 𝑐𝑀(𝑡)) which in-
cludes M temperature values such that 𝑐𝑚 is the tem-
perature of core 𝑚𝑚. Accordingly, one common ther-
mal-sensitive performance measure is the so-called in-
8 IEEE TRANSACTIONS ON JOURNAL NAME,  MANUSCRIPT ID 
 
 
stantaneous temperature margin of the chip which we 
formally define as follows: 
Definition 1: The chip-wide temperature margin at time 𝑡 
is defined as: 
 
𝑇𝑚𝑔(𝑡) ≝ max (𝑇𝑡ℎ − 𝑇𝑝(𝑡), 0) (1) 
 
where 𝑇𝑡ℎ is the maximum tolerable temperature of the 
chip which is defined in datasheet of every electronic de-
vice and  𝑇𝑝(𝑡) is the peak temperature at time t which is 
calculated as: 
 
(2) 𝑇𝑝(𝑡) = max
𝑚
𝑐𝑚(𝑡) 
 Dispatcher: This unit receives 𝒄(𝑡) from the thread 
monitor, and then (based on an assignment policy) se-
lects an idle core for assigning a given task. 
 Thermal manager: This unit receives 𝒄(𝑡) from the 
thread monitor, and then, it determines the operating 
voltage and frequency of the processing cores. We rep-
resent by ℒ = {𝑙1, 𝑙2, 𝑙3, … , 𝑙𝑙 , … , 𝑙𝐿} the set of applicable 
V-F levels to each processing core.  
Assumption 2: The task scheduler is implemented on a 
dedicated core.  
2.2 System uncertainties  
A realistic multicore system is faced with many uncer-
tainties (e.g., random task arrivals, uncertain IPC or 
thermal dynamics) which drastically influence the per-
formance of any thermal-aware task assignment policy. In 
fact, in the presence of these uncertainties, the system 
undergoes a series of stochastic events over time, and the 
scheduler needs to adaptively adjust its decisions to cater 
not just for the current conditions, but also for the dynam-
ic changes that are about to happen in the future. Hence, 
it is technically fair to say that the problem faced by a 
scheduler is sequential in nature, in the sense that every 
decision made in the present also has an impact on de-
termining the long-run performance of the system. We 
elaborate further on this issue in Section 2.2 by giving a 
didactic example. Before that, we first discuss the most 
important types of uncertainties that need to be factored 
in our computation of a scheduling policy.  
2.2.1 Stochastic task arrivals 
In many real-world scenarios, the tasks entering a system 
are of different natures; for example, some tasks may 
heavily engage the arithmetic units (CPU-bound), while 
others mostly involve I/O operations (I/O-bound). In 
fact, executing tasks of various types results in different 
thermal profiles and takes different execution time. On 
top of that, different input sizes of an application also 
affects both its thermal footprint as well its execution 
time. Here, we assume that there are a total of I different 
task types, each in turn associated with N subtypes. In 
particular, let ℐ = {1,2, 3,… , 𝑖, … , 𝐼} denote the set of task 
types, and the set 𝒩 = {1,2,3,… , 𝑛, … , 𝑁} be the set of all 
subtypes. Accordingly, we use the symbol 𝜏𝑖,𝑛 to repre-
sent a task of type i and subtype n. Also, the symbol 𝜁𝑖,𝑛,𝑙 
is used to indicate the pure execution time of 𝜏𝑖,𝑛 by ap-
plying the operating level 𝐿𝑙 (i.e., without considering the 
inter-task communication time). It is further assumed that 
each task can only run on a single core. We denote by 𝜆𝑖,𝑛 
the mean arrival rate of tasks of type 𝜏𝑖,𝑛.  
The key assumptions we make regarding the task model 
are as follows: 
Assumption 3: For ∀𝑖, 𝑛, the task arrival process is Poisson 
with unknown parameter 𝜆𝑖,𝑛.  
Remark 1: As all arrival processes is Poisson for ∀𝑖, 𝑛, the 
aggregate arrival process into the multi-core system is also 
Poisson with unknown rate parameter 𝜆.  
Assumption 4: Each task entering the system has an expo-
nentially distributed execution time (considering the communi-
cation time [18]. 
Assumption 5: There is no previous knowledge (neither 
acausal or statistical) about the task arrivals, the execution 
times nor the thermal impact of the tasks.  
2.2.2 Communication uncertainty 
In order to capture the uncertainty associated with IPC 
and task pairings, we assume that each task entering the 
system has a chance of getting paired and communicating 
with another currently running task. This model of ran-
dom task pairing can account for many real-world scenar-
ios (e.g., clipboard sharing, random sensory data genera-
tion, etc.). Let 𝑒𝑖,𝑛,𝑗,𝑦 denote the probability that task 𝜏𝑖,𝑛 
pair with 𝜏𝑗,𝑦 . The uncertain IPC model we envisage here 
is characterized by the following assumptions: 
Assumption 6: Each task can pair with just one of the 
running tasks at any one time.  
Assumption 7: The duration of communication between 
any pair of tasks is exponentially distributed with unknown 
parameter 𝜉𝑖,𝑛,𝑗,𝑦.  
Remark 2: Given the deterministic nature of the pure exe-
cution times and the exponential distribution of the communi-
cation times, the total service time of the task 𝜏𝑖,𝑛 paired with 
𝜏𝑗,𝑦 is an exponentially distributed random variable with the 
shifted parameter  𝜉𝑖,𝑛,𝑗,𝑦 + 𝜁𝑖,𝑛,𝑙.  
Assumption 8: The well-known xy routing algorithm [44] 
is used for determining the on-chip communication paths.  
Assumption 9: The mean rate of data exchange is Poisson 
distributed with unknown parameter 𝛾
𝑖,𝑛,𝑗,𝑦
. 
2.2.3 Thermal uncertainty  
The thermal impact of circuit components, the impact of 
thermal interface materials and cooling condition, make 
the thermal profile of the chip stochastic.  We assume that 
all uncertainties of this kind can summarily be captured 
by an unknown perturbation parameter𝜚. Therefore, the 
next temperature values observed from thermal sensors, 
changes over time as follows: 
 
(3) 𝒄(𝑡′) = 𝑓(𝒄(𝑡),𝑎(𝑡), 𝜚) 
2.2.4 The arrival queue  
The arriving tasks into the system are queued by the 
scheduler to be served in a continuous time FCFS fashion. 
Given the specifications of our system model, the arrival 
queue corresponds to an infinite-length multi-server 
queue with a Poisson arrival process, and exponentially 
AUTHOR ET AL.:  TITLE 9 
 
 
distributed service time. We denote the occupancy state 
of the arrival queue at time t by 𝑞(𝑡)  ∈  {1,2,3,… }. 
Assumption 10: The aggregate arrival rate 𝜆 is assumed to 
be within the stability region of the queue. The queue stability 
region is a set 𝛬 entailing all arrival regimes (i.e., the 𝜆 parame-
ters), for which there is at least one scheduling policy 𝜋 under 
which the average length of the queue is bounded from above: 
 
(4) Λ ≝ {𝜆 ∈ ℝ+|∃𝜋: lim
𝑇 → ∞
1
𝑇
 𝔼𝜋 [∫ 𝑞(𝑡)𝑑𝑡
𝑇
0
] ≤ ∞}              
Table 2 summarizes the notations used in our system 
model. Also, in Figure 2, we show a sample snapshot of 
the system at time 𝑡. The thermometers represent the rela-
tive temperature of the processing cores. These tempera-
tures are first sensed by thermal sensors placed next to 
each core (shown with small circles next to the cores). 
The, the thread monitor collects these temperature read-
ings and sends them to the dispatcher and to the thermal 
manager. According to a scheduling policy, the dispatch-
er assigns 𝜏𝑖,𝑛 to some core 𝑚𝑚, and the thermal manager 
sets the working V-F level of core 𝑚𝑚 to some 𝑙l. Of par-
ticular note is the continuous-time event-based operation 
of the scheduler. More specifically, the scheduling policy 
can be executed upon the arrival and departure events. In 
a fully-occupied system, as soon as the execution of a task 
is finished, and a core becomes idle, the head-of-line job 
(in case the task queue is non-empty) can be assigned to 
this newly idle core. Similarly, immediately upon arrival, 
a task can be assigned to an idle core (if there is any). 
 
TABLE 2 
SUMMARY OF NOTATIONS IN SYSTEM MODEL 
Description Notation 
Set of processing cores ℳ 
Set of NoC routers ℛ 
Number of processing cores, NoC routers and thermal 
sensors 
𝑀 
The 𝑚-th processing core 𝑚𝑚 
The connected NoC router to core 𝑚𝑚 𝑟𝑚 
The chip thermal profile at time t 𝒄(𝑡) 
The temperature of core 𝑚𝑚 𝑐𝑚(𝑡) 
Set of working voltage and frequency levels ℒ 
The 𝑙-th voltage and frequency level 𝑙𝑙 
2.3 Problem statement  
Before giving a formal definition of the thermal-aware 
task scheduling problem, in this section, we first give an 
illustrative example to further motivate our proposed 
solution in Section 3.   
In Figure 3, we have illustrated a snapshot of the sys-
tem where the tasks 𝜏1,8,  𝜏2,3,  𝜏3,5,  𝜏1,4 ،𝜏5,5,  𝜏6,1 and 𝜏1,7 
are currently running on cores 𝑚1,  𝑚7,  𝑚6,  𝑚2,  𝑚4,  𝑚3 
and 𝑚8, respectively. At the same time, some new task 
𝜏2,8 enters the system, and the scheduler has the option of 
assigning 𝜏2,8 to either of the idle cores 𝑚5 and 𝑚9. As-
sume that 𝑚5 is currently cooler than 𝑚9, and consider a 
myopic scheduler with an instantaneously greedy policy. 
Such a scheduler would assign 𝜏2,8 to 𝑚5. An unlucky run 
unfolds as follows: The newly assigned task 𝜏2,8 gets 
paired with 𝜏1,4 such that the NoC routers 𝑟2 and 𝑟5 would 
become involved in carrying out their IPC with a low in-
jection rate. A few moments later, another task 𝜏9,5 enters 
the system and choice-lessly gets assigned to the only idle 
core 𝑚9. Now, it is quite likely that 𝜏9,5 gets paired with 
𝜏1,8 such that the NoC routers 𝑟1, 𝑟4,  𝑟7,  𝑟8 and 𝑟9 would 
become entangled with a high injection rate. In fact, as 
pairings could occur quite at random, the set of involved 
routers would not be predictable before task assignment. 
Given that the engaged NoC routers could generate sig-
nificant heat, a foresighted scheduling policy that ac-
counts for system uncertainties can bring about major 
improvements over a myopic policy that only considers 
the current temperature footprint to decide on its task-to-
core mappings.   
 
 
Fig. 2. The SMDP Scheduler 
 
 
 
Fig. 3. An example task pairing scenario 
 
In light of this simple example, one could perceive that 
the uncertainties associated with the task arrivals, task 
pairings, and the chip thermal dynamics can have a sig-
nificant bearing on the thermal performance of a schedul-
ing policy. In more technical terms, the multi-core proces-
sor gives rise to a stochastic dynamic system whose tra-
jectory needs to be controlled with a state-dependent 
foresighted policy over time. Accordingly, unlike a myop-
ic policy, it is not the instantaneous temperature margin 
of the chip that we seek to maximize. Rather, the sched-
uler needs to shape its course of actions so as to optimize 
a long-run thermal performance measure. Armed with 
this understanding, in Section 3, we specify the structure 
of an adaptive scheduling policy and give a formal defini-
tion of our optimization objective that accounts for the 
10 IEEE TRANSACTIONS ON JOURNAL NAME,  MANUSCRIPT ID 
 
 
uncertainties affecting the system.  
3  PROBLEM FORMULATION 
In the system model discussed in Section 2,the future 
thermal profile of the chip depends not only on the task 
assignment policy but also on important task characteris-
tics such as task arrivals, pairings, and inter-task commu-
nication time. As these task characteristics are random in 
general, we need to cast the task-to-core assignment prob-
lem using a proper formalism from the stochastic optimi-
zation theory [45]. In particular, we use the framework of 
Markov Decision Processes (MDPs) [32] which is a stand-
ard formalism to express the decision problem faced by 
an agent operating in an uncertain environment. In an 
MDP, the decision made in each state affects the trajectory 
of the states to be visited by the system in the future, re-
flecting the sequential nature of the optimization prob-
lem. The objective is also expressed in a foresighted fash-
ion which is to calculate an optimal policy that maximizes 
the long-run average reward accumulated by the agent. 
In general, the optimal policy in an MDP can be calculat-
ed in two different ways depending on the information 
availability assumptions [46]: model-based or model-free. 
Unlike model-based methods, a "model-free" method 
does not require prior statistical knowledge of the sto-
chastic processes underlying the system, and this lack of 
knowledge is offset by equipping the device with the abil-
ity to learn and experience. 
Also, most previous work on task scheduling (e.g., [10, 
18, 27]) use discrete-time MDP in which tasks are as-
signed only at the start of fixed time quotas. This formula-
tion, however, may lead to performance degradation in 
cases where several tasks are waiting inside the queue, 
and some processing cores are idle. The scheduler has to 
wait for a new time quota to assign the queued tasks. A 
further complication can arise with a discrete-time formu-
lation, and it is the combinatorial nature of task-to-core 
assignments; in fact, in each decision epoch, the scheduler 
has no recourse but to consider all combinations of task-
to-core assignments, which drastically adds to the com-
plexity of the process.  
Armed with this understanding, we propose a contin-
ues-time formulation based on the so-called semi-MDP or 
SMDP formalism [35] in which the scheduler is no longer 
limited to operate just at the start of fixed time quotas; 
instead, it has an event-based functionality; i.e., it is only 
invoked if an event (including a task arrival or departure) 
occurs. In case of a task arrival, if there is at least one idle 
core, one task will be assigned. Also, when a task departs, 
a core becomes idle. At this moment, if there is any task in 
the arrival queue, the scheduler will select an idle core to 
assign a task.  
3.1 SMDP formalism  
In this section, we formalize the task scheduling problem 
by casting it as an SMDP 〈𝒮,𝒜, 𝐹, 𝑟, 𝜋, ?̅?𝜋〉, where 𝒮, 𝒜, 𝐹, 
𝑟, 𝜋 and ?̅?𝜋 denote respectively: the set of states, the set of 
control actions, system dynamics, instantaneous reward, 
the decision policy function, and the optimization objec-
tive.  In the following, we elaborate on our formulation of 
the stochastic optimization problem. 
3.1.1 System states  
The system state at time t is denoted by 𝑠(𝑡) ∈ 𝒮 which is 
comprised of three components: 
 
(5) 𝑠(𝑡) = {𝒄(𝑡), 𝒃(𝑡), 𝜅(𝑡)}c  
As before, 𝒄(𝑡) denotes the temperature readings at 
time 𝑡. Also, 𝒃(𝑡) is the state (idle/busy) of the processing 
cores: 
(6) 𝒃(𝑡) 𝜖 𝓑 = {0,1}𝑀 
In fact, 𝒃(𝑡) is a vector of 0s and 1s. For each core, the 0 
value indicates that the core is idle, and 1 means that it is 
busy. 𝜅(𝑡) represents the number of tasks in the system. 
In an infinite-capacity system, 𝜅(𝑡) varies in the range of 
zero to infinity; i.e., 
 
𝜅(𝑡) ∈ 𝐾 = {0,1,2,… } (7) 
3.1.2 Actions 
Let 𝒜(𝑠) be the set of all feasible actions in state 𝑠. Fur-
ther, 𝒜 =∪𝑠∈𝒮 𝒜(𝑠). At each time 𝑡, a feasible action 𝑎(𝑡) 
at state 𝑠(𝑡) ∈ 𝒮 is selected from the set 𝒜(𝑠(𝑡)) ⊆ 𝒜, and 
is comprised of two components: 
 
(8) 𝑎(𝑡) = (𝑚𝑚 , 𝑙𝑙) 
In which 𝑚𝑚 is the selected core (possibly nil) to assign 
a task and 𝑙𝑙 is the chosen V-F level of that core. For ex-
ample, in Figure 3, the system is considered to be at state 
𝑠(𝑡𝑘) at time 𝑡𝑘. At this moment, a task enters the system 
and there may be several idle cores. The scheduler assigns 
the task at the head of the queue to the core 𝑚𝑚 and sets 
its working V-F level to 𝑙𝑙. Then, at 𝑡𝑘+1, some task de-
parts and another core becomes idle. Now, if the ready 
queue is not empty, the scheduler assigns a waiting task 
to the core 𝑚𝑚′ with 𝑙𝑙′ V-F level. Finally, at 𝑡𝑘+2, a new 
task enters the system and is assigned to 𝑚𝑚" with 𝑙𝑙". 
3.1.3 System dynamics 
Our formalization of the system’s dynamics follows from 
the standard treatment of SMDPs given in [35]. Let 
𝐹𝑠𝑠′(𝑇,  𝑎) represent the conditional cumulative joint dis-
tribution of the transition time and the next system state 
conditioned on currently being in state 𝑠 and taking ac-
tion 𝑎. In other words, 𝐹𝑠𝑠′(𝑇,  𝑎) is the probability of 
transitioning from state 𝑠 to 𝑠′ by performing action 𝑎 
with the transition time lasting less than or equal to 𝑇, i.e., 
𝐹𝑠𝑠′(𝑇,  𝑎) = ℙ{
𝑡𝑘+1 −  𝑡𝑘 ≤  𝑇,  𝒄𝑘+1 ∈
 ?́?, 𝒃𝑘+1 = ?́?, κ𝑘+1 = κ́|
 𝒄𝑘 =  𝒸, 𝒃𝑘 = 𝑏, κ𝑘 = κ,  𝑎𝑘 = 𝑎
} 
(9) 
 
 
Moreover, the transition probability from state 𝑠 to 𝑠′ 
by taking action 𝑎 can be calculated as follows: 
AUTHOR ET AL.:  TITLE 11 
 
 
 
Fig. 3. Timeline 
 
𝑃𝑠𝑠′(𝑎) = ℙ(𝒄𝑘+1 ∈  ?́?, 𝒃𝑘+1 = ?́?, κ𝑘+1 = κ́ |𝒄𝑘 =  𝓬, 𝒃𝑘 =
𝒃, κ𝑘 = κ,  𝑎𝑘 = 𝑎)  =  ℙ(?́?|𝒃, 𝑎) × ℙ(κ́ |κ, a) ×
∫ 𝑓𝓬𝓬′(𝑎) 𝑑𝓬
′
𝓒
  
(10) 
where 𝑓𝓬𝓬′(𝑎) is the conditional transition probability 
density function of the chip temperature given the current 
temperature profile 𝓬 and action 𝑎. Note that all the three 
dynamics ℙ(?́?|𝒃, 𝑎), ℙ(κ́ |κ, a), 𝑓𝓬𝓬′(𝑎) as well as 𝐹𝑠𝑠′(𝑇,  𝑎) 
are dependent on the statistics of the task arrival, task 
execution times and the inter-task communication times 
(c.f., Assumptions 3, 4 and 7). The thermal dynamics 
𝑓𝓬𝓬′(𝑎) also depends on the thermal perturbation function 
𝑓(𝒄(𝑡),𝑎(𝑡), 𝜚) defined previously in Section 2.2.3.  
Now, with the definitions of the 𝐹𝑠𝑠′(𝑇,  𝑎) and 𝑓𝓬𝓬′(𝑎) 
at hand, and given 𝑎, 𝑠 and 𝑠′, the conditional cumulative 
distribution of the transition time from 𝑠 to 𝑠′ by taking 
action 𝑎 can be computed as follows: 
 
ℙ{
𝑡𝑘+1 −  𝑡𝑘  ≤  𝑇 |𝒄𝑘 =  𝓬, 𝒃𝑘 = 𝒃, κ𝑘 = κ,  𝒄𝑘+1
∈  ?́?, 𝒃𝑘+1 = ?́?, κ𝑘+1 = κ́ ,  𝑎𝑘 = 𝑎
} =
 
𝐹𝑠𝑠′(𝑇,𝑎)
𝑃𝑠𝑠′(𝑎)
 𝔼{𝑇|𝑠, 𝑠′,  𝑎} =  ∫ 𝑇
𝑑𝐹𝑠𝑠′(𝑇,𝑎)
𝑃𝑠𝑠′(𝑎)
∞
0
  
(11) 
Also, given 𝑎, 𝑠 and 𝑠′, the conditional expected value 
of the transition time, i.e., 𝔼{𝑇|𝑠, 𝑠′,  𝑎} can be calculated 
as: 
𝔼{𝑇|𝑠, 𝑠′,  𝑎} =  ∫ 𝑇
𝑑𝐹
𝑠𝑠′
(𝑇,𝑎)
𝑃𝑠𝑠′(𝑎)
∞
0
  
(12) 
 
 
Where 𝑑𝐹𝑠𝑠′(𝑇,  𝑎) is the derivative of the conditional 
cumulative joint distribution of the transition time and 
the next state.  
Now, for each 𝑠 = (𝓬, 𝒃, κ) and 𝑎 ∈  𝒜(𝑠), the expected 
value of the transition time 𝑇?̅?(𝑎) from state 𝑠 by taking 
action 𝑎 is calculated as: 
𝑇?̅?(𝑎) =
∑ ∑ [
ℙ(?́?|𝒃, 𝑎) × ℙ(κ́ |κ, a)
× ∫ 𝑓𝓬𝓬′(𝑎) 𝔼{𝑇|𝑠, 𝑠
′, 𝑎}𝑑𝓬′
𝓒
]κ́∈Κ?́?∈𝓑  ⟹ 𝑇?̅?(𝑎)  =
∑ ∑ ∫ ∫ 𝑇𝑑𝐹𝑠𝑠′(𝑇, 𝑎)𝑑𝓬
′∞
0
.
𝓒κ́∈Κ?́?∈𝓑
  
(13) 
3.1.4 The instantaneous reward 
In our thermal management problem, decisions should be 
made in a way as to reduce the average peak temperature 
of the system in the long-run. Instantly, the temperature 
of all processor components is measured. Then, the on-
chip unit which has the maximum temperature is consid-
ered as the hottest spot and its temperature is marked as 
the system’s peak temperature. Reducing the peak tem-
perature of the system means maximizing the tempera-
ture margin. Therefore, the instantaneous reward 
𝑟(𝑠(𝑡), 𝑎(𝑡)) accrued in state 𝑠(𝑡) by performing action 
𝑎(𝑡) at time t is defined as: 
 
(14) 𝑟(𝑠(𝑡), 𝑎(𝑡)) =  𝑇𝑚𝑔(𝑡). 
3.1.5 Policy 
In an SMDP-based formulation, we seek for the optimal 
policy as the solution of the optimization problem. In fact, 
each candidate policy function is a mapping which speci-
fies what action should be taken by the decision maker in 
each state of the system. More formally, policy is defined 
as a mapping of the form: 
 
π: 𝒮 → 𝒜, (15) 
with 𝜋(𝑠) ∈ 𝒜(𝑠), ∀𝑠 ∈ 𝒮.  
3.1.6 The systematic goal (The average long-term re-
ward) 
In principle, the average of the instantaneous rewards 
obtained by following a given decision policy 𝜋 over an 
infinite-length time horizon represents the performance 
of the policy 𝜋 [35]. More formally, the long-term average 
“one-stage” reward for the continuous-time problem is 
calculated as follows [35]: 
 
(16) 𝜌𝜋 = lim
𝑇→∞
1
𝑇
𝔼𝜋 {∫ 𝑟(𝑠(𝑡), 𝑎(𝑡))𝑑𝑡
𝑇
0
},  
where it is assumed that at each time instant 𝑡, the ac-
tion 𝑎(𝑡) is drawn from the probability distribution 
𝜋(𝑠(𝑡)). Now, the equation (16) is equal to: [35] 
 
(17) 𝜌𝜋 = lim
𝑁→∞
1
𝐸𝜋{𝑡𝑁}
𝔼𝜋{∫ 𝑟(𝑠(𝑡), 𝑎(𝑡))𝑑𝑡
𝑡𝑁
0
},  
where 𝑡𝑁 is the completion time of the N-th transition.  
Therefore, our goal is to compute an optimal adaptive 
policy 𝜋∗ for assigning the tasks and selecting V-F levels 
for the cores such that: 
𝜋∗ = argmax
𝜋
𝜌𝜋 ≜ max
𝜋
lim
𝑁→∞
1
𝔼𝜋{𝑡𝑁|𝑠0 = 𝑠}
× 𝔼𝜋 {∑ ∫ 𝑟(𝑠𝑘, 𝜋(𝑠𝑘))𝑑𝑡|𝑠0 = 𝑠
𝑡𝑘+1
𝑡𝑘
𝑁−1
𝑘=0
} 
(18) 
 
Following the treatment in [35] , under the assumption 
that the stochastic system state process {𝑠(𝑡)}𝑡≥0 is an er-
godic Markov chain, the long-term average reward ρπ is 
well-defined and its value is independent of the initial 
state s0. Next, in Section 4.1, we discuss how the optimal 
policy π∗ can be approximated in the absence of the statis-
tical knowledge of the system’s stochastic dynamics.  
The symbols used for our SMDP formalism are gath-
ered in Table 3. 
12 IEEE TRANSACTIONS ON JOURNAL NAME,  MANUSCRIPT ID 
 
 
TABLE 3 
NOTATIONS USED FOR SMDP FORMALISM 
Description Notation 
The state (busy/idle) of all processing core at time 𝑡 𝑏(𝑡) 
The number of tasks inside the system at time t κ(𝑡) 
The (busy/idle) state space of all cores ℬ 
The state space of the number of tasks in the system Κ 
The selected action at time t 𝑎(𝑡) 
The set of all actions  𝒜 
The set of all system states 𝒮 
The state of the system at 𝑡𝑘 (the k-th transition 
time) 
𝑠(𝑡𝑘) 
The number of tasks that enter the system according 
to a Poisson process with parameter 𝜆 
𝑝𝑜𝑖𝑠𝑠𝑟𝑎𝑛𝑑(𝜆) 
The number of terminated tasks within (𝑡, 𝑡′) O(𝑡) 
The thermal dynamics perturbation parameter ϱ 
Conditional cumulative joint distribution of the 
transition time 𝑇 and the next system state 𝑠′ given 
the current state-action pair (𝑠, 𝑎) 
𝐹𝑠𝑠′(𝑇,  𝑎) 
The number of tasks inside the system at 𝑘𝑡ℎ transi-
tion  
κ𝑘 
The state (busy/idle) of all processing cores at 𝑘𝑡ℎ 
transition 
𝑏𝑘 
The temperature of all processing cores at 𝑘𝑡ℎ tran-
sition 
𝑐𝑘 
The current system state 𝑠 
The next system state 𝑠′ 
The conditional transition probability density func-
tion of the chip temperature given the current tem-
perature profile 𝒸 and action 𝑎. 
𝑓𝒸𝒸′(𝑎) 
The transition probability from s to 𝑠′ by perform-
ing a  
𝑃𝑠𝑠′(𝑎) 
The instantaneous reward 𝑟(𝑡) 
Decision policy π 
The set of all probability distributions over the ac-
tion space 
∆(𝒜) 
The long-term average reward ?̅?𝜋 
Time instant of the 𝑘𝑡ℎ transition 𝑡𝑘 
The completion time of the N-th transition 𝑡𝑁 
“one-stage” expected reward 𝑅(𝑠,  𝑎) 
The expected value of the transition time from state 
s by performing action a 
?̅?𝑠(𝑎) 
The optimal policy 𝜋∗ 
4  THE PROPOSED REINFORCEMENT LEARNING 
ALGORITHM 
The computation of the optimal policy 𝜋∗ can be done 
using standard dynamic programming techniques for 
SMDPs such as the specialized value iteration or policy 
iteration algorithms discussed in [32]. A major drawback 
with these techniques is that they require the system’s 
stochastic dynamics 𝑃𝑠𝑠′(𝑎) and 𝐹𝑠𝑠′(𝑇, 𝑎) be known in 
advance. The characterization of these dynamics in our 
case would require the complete knowledge of the task 
arrivals, execution times, inter-task communication times, 
as well as the chip thermal dynamics. As these statistics 
cannot be realistically assumed to be available in all set-
tings, to compensate for this lack of knowledge, we pro-
pose a model-free scheme in which the scheduler agent is 
equipped with the ability to experience and learn. In the 
sequel, we explain our proposed learning algorithm for 
task scheduling in a multi-core processing system. As 
with all forms of MDPs, the road to synthesize a learning 
algorithm starts with basic Bellman equations which we 
discuss next. 
 
4.1 Bellman equations   
Before stating the Bellman equations, we first establish 
one new notation. We define for each pair (𝑠,  𝑎) the so-
called “one-stage” expected reward 𝑅(𝑠, 𝑎), which corre-
sponds to the instantaneous reward from taking action 𝑎 
in state 𝑠 normalized by the mean time spent in the transi-
tion from 𝑠 to some next state, and can be calculated as 
follows: 
 
𝑅(𝑠,  𝑎) = 𝑟(𝑠,  𝑎) ?̅?𝑠(𝑎) (19) 
 
Now, according to the standard treatment of “average-
reward” SMDPs, to solve for the optimal policy 𝜋∗, each 
state of the system 𝑠 ∈ 𝒮 is first given a "value" 𝑉𝜋(𝑠) un-
der each candidate policy 𝜋. In particular, 𝑉𝜋(𝑠) denotes 
the expected total “per stage values” which would be 
obtained starting from that state and following policy 𝜋. 
These “per stage values” are defined as the difference 
between the “one-stage” expected reward 𝑅(𝑠, 𝜋(𝑠)) spe-
cific to the pair (𝑠, 𝜋(𝑠)) and 𝜌𝜋?̅?𝑠(𝜋(𝑠)) which denotes the 
long-term “one-stage” expected reward corresponding to 
policy 𝜋. Following the derivation in [35], this value func-
tion satisfies Bellman equations of the form given below: 
 
𝑉𝜋(𝑠) = 𝑅(𝑠,  𝜋(𝑠)) −  𝜌𝜋?̅?𝑠(𝜋(𝑠)) + ?̅?
𝜋, ∀s ∈ 𝒮 (20) 
 
in which ?̅?𝜋 is the expected value to be obtained by fol-
lowing policy 𝜋 onward from state 𝑠, and can be defined 
as follows: 
(21) 
?̅?𝜋 = ∑ ∑ [
ℙ(?́?|𝒃, 𝑎) × ℙ(κ́ |κ, a)
× ∫ 𝑓𝓬𝓬′(𝑎) 𝑉
𝜋(𝑠′)𝑑𝓬′
𝓒
]κ́∈Κ?́?∈𝓑  
where 𝑠′ = (?́?, κ́ , 𝓬′) symbolizes the next system state. 
Since the optimal policy maximizes the value of all states, 
if the policy followed by the agent is the optimal policy 
𝜋∗, the Bellman equation will change to equation bellow: 
𝑉∗(𝑠) = max
𝑎 ∈𝐴(𝑠)
[𝑅(𝑠,  𝑎) −  𝜌∗?̅?𝑠(𝑎) + ?̅?
∗(𝑠′)] , ∀s ∈ 𝒮 (22) 
in which 𝑉∗(𝑠) is the maximum value of state 𝑠, 𝜌∗ is 
the optimal long-term average “one-stage” reward and 
?̅?∗(𝑠′) is defined as follows: 
 
(23) 
?̅?∗(𝑠′) = ∑ ∑ [
ℙ(?́?|𝒃, 𝑎) × ℙ(κ́ |κ, a)
× ∫ 𝑓𝓬𝓬′(𝑎) 𝑉
∗(𝑠′)𝑑𝓬′
𝓒
]κ́∈Κ?́?∈𝓑  
Similarly, for each action in each state, a value is as-
signed using the Bellman optimality equation [35]. 
 
𝑄(𝑠, 𝑎) = 𝑅(𝑠,  𝑎) −  𝜌∗?̅?𝑠(𝑎) + ?̅?
∗(𝑠′) (24) 
So we have: 
 
𝑉∗(𝑠) = max
𝑎 ∈𝐴(𝑠)
𝑄(𝑠, 𝑎)      ∀𝑠 (25) 
 
Finally, it follows that: 
AUTHOR ET AL.:  TITLE 13 
 
 
𝑄(𝑠, 𝑎) = 𝑅(𝑠, 𝑎) − 𝜌∗?̅?𝑠(𝑎) +
∑ ∑ [ℙ(𝑏
′|𝑏, 𝑎) × ℙ(𝜅′|𝜅, 𝑎) ×
∫ 𝑓𝒄𝒄′(𝑎) 𝑚𝑎𝑥
𝑎′𝜖𝐴(𝑠′)
𝑄(𝑠′,
𝒄
𝑎′)𝑑𝑐′],
𝜅𝜖𝐾𝑏′𝜖𝐵 ∀(𝑠, 𝑎)  
(26) 
 
Once the Q-function is computed for each (𝑠, 𝑎) pair, 
the optimal policy will be calculated as  
𝜋∗ = 𝑎𝑟𝑔 max
𝑎∈𝒜(𝑠)
𝑄(𝑠, 𝑎) [47].  
However, for calculating 𝑄(𝑠, 𝑎) based on equation 
(26), the transition probabilities 𝑃𝑠𝑠′(𝑎) and 𝐹𝑠𝑠′(𝑇, 𝑎) are 
required. Ironically, we also need to know the optimal 
long-term average “one-stage” reward 𝜌∗ itself! 
In the absence of such knowledge, the well-known Q-
learning algorithm [47] can be used. Q-learning is an iter-
ative approximation procedure in which the learning 
agent actually experiences in the environment, and ex-
ploits samples of instantaneous rewards and observations 
of next state transitions to approximate 𝑄(𝑠, 𝑎) values. 
Also, it has been shown that the estimates of the 𝑄 value 
for some fixed (𝑠∗, 𝑎∗) pair can replace 𝜌∗ in equation (24) 
[35]. In particular, the continuous-time version of the Q-
learning algorithm is expressed as follows [47]: 
 
𝑄𝑘+1(𝑠, 𝑎) = (1 − 𝛼𝑘)𝑄𝑘(𝑠, 𝑎) + 𝛼𝑘 [𝑟(𝑠, 𝑎, 𝑠
′) −
 𝑄𝑘(𝑠
∗, 𝑎∗)𝑡(𝑠, 𝑎, 𝑠′) + max
𝑏∈𝐴(𝑠′)
𝑄𝑘(𝑠
′, 𝑏)],  
(27) 
 
where 𝑟(𝑠, 𝑎, 𝑠′) denotes the immediate reward earned 
in transition from 𝑠 to 𝑠′ by performing action 𝑎. 𝑡(𝑠, 𝑎, 𝑠′) 
is the actual time spent in this transition, and 𝛼𝑘 is the 
learning rate which is calculated as follows: 
𝑎𝑘  =  
𝐴
𝐵 + 𝑘
 
(28) 
where A and B are some pre-defined constants. 
In standard Q-learning, a Q-value is stored for each 
(s,a)-pair in a two-dimensional array called the Q-table. 
Then, at each stage 𝑘, the Q-value for the current (s,a) is 
updated according to equation (27). In our thermal-aware 
task scheduling problem, each system state is comprised 
of three components, viz. the state of processing cores, 
temperature values across the chip, and the number of 
tasks inside the system. In large-scale many-core systems, 
the state component associated with the idle/busy status 
of the cores would have a high dimension. Also, with a 
task queue of infinite capacity, the number of tasks inside 
the system can theoretically grow without limit. Most 
problematic though is the thermal profile of the chip 
which is a vector of sensor readings all of continuous real-
valued nature. Therefore, the state space is infinitely 
large, and due to memory limit, using the standard form 
of Q-function with a Q-table is not practical. In general, 
when facing with continuous state values, two different 
methods can be used: discretization and function approxima-
tion. With discretization, the continuous elements of the 
state vector are discretized into several intervals. Then, a 
representative is defined for each one and is used instead 
of all values in that interval [7, 10, 27]. However, learning 
becomes much slower using discretization due to the 
need for more trials and errors [48] . Moreover, the size of 
the intervals would drastically affect the learning preci-
sion. In the RL literature, the preferred way to combat the 
curse of dimensionality is to come up with a function ap-
proximation architecture. In the next section, we propose 
a novel approximation architecture suitable for our prob-
lem at hand.  
4.2 Scaling the Q-learning Algorithm: The Proposed 
Function Approximation Architecture  
To scale the Q-learning scheme, in this section, we pro-
pose a function approximation architecture for approxi-
mating state values using state similarities [36]. Intuitive-
ly, the more similar a pair of states are, the more similar 
their values should be [49]. Hence, the learner does not 
need to experience all system states to estimate their val-
ues and it can generalize its knowledge to unseen states 
with similar features, which can greatly increase the con-
vergence rate. A common way to approximate the Q-table 
is to use a linear approximation of the form below [50]: 
 
(29) ?̂?(𝑠, 𝑎) =  𝜽𝑇 × 𝝓(𝑠, 𝑎), 
where 𝝓(𝑠, 𝑎) is a vector-valued feature function that 
maps the state-action values to a summarized feature 
space and 𝜽 is a weight vector denoting the relative im-
portance of each feature. We leave the details of the pro-
posed 𝝓(𝑠, 𝑎) to Sections 4-2-1 and 4-2-2, in which two 
versions of the Q-learning algorithms are introduced for 
the task scheduling problem. Hence, for now, we suppose 
that 𝝓(𝑠, 𝑎) values are defined. Then, the Q-learning algo-
rithm in Section 4.1 should be modified as follows: After 
each decision making in stage 𝑘, the vector 𝜽𝑘 should be 
updated so as to minimize the mean square error (𝐸𝑘) of 
the new perception (𝑄𝑘+1) and its approximated value (?̂?) 
[36]: 
𝐸𝑘 ≜ [𝑄𝑘+1(𝑠𝑘, 𝑎𝑘) − ?̂?(𝑠𝑘 , 𝑎𝑘)]
2
= [𝑄𝑘+1(𝑠𝑘 , 𝑎𝑘) − 𝜽
𝑇 × 𝝓(𝑠𝑘 , 𝑎𝑘)]
2 
(30) 
in which 𝑄𝑘+1(𝑠𝑘 , 𝑎𝑘) is calculated based on the most 
recent observation as follows: 
𝑄𝑘+1(𝑠𝑘 , 𝑎𝑘) =  𝑟(𝑠𝑘 , 𝑎𝑘 , , 𝑠𝑘+1) − ?̂?(𝑠
∗, 𝑎∗)𝑡(𝑠𝑘 , 𝑎𝑘 , 𝑠𝑘+1)
+ max
𝑏𝜖𝐴(𝑠𝑘+1)
?̂?(𝑠𝑘+1, 𝑏) 
(31) 
For minimizing the error 𝐸, we use the gradient de-
scent technique (with 𝛽𝑘 being the step size): 
 
∇𝜽𝐸 = −2𝝓(𝑠𝑘 , 𝑎𝑘)[𝑄𝑘+1(𝑠𝑘, 𝑎𝑘) − 𝜽
𝑇 × 𝝓(𝑠𝑘, 𝑎𝑘)] 
(32) 
Then 𝜽𝑘 is updated as follows: 
(33) 𝜽𝑘+1 = 𝜽𝑘 − 𝛽𝑘∇𝜽𝐸 
By defining 𝛼𝑘 ≝ 2𝛽𝑘, we have:  
(34) 𝜽𝑘+1 = 𝜽𝑘 + 𝛼𝑘(𝑄𝑘+1(𝑠𝑘 , 𝑎𝑘) − ?̂?(𝑠𝑘 , 𝑎𝑘))𝝓(𝑠𝑘, 𝑎𝑘) 
4.2.1 Proposed (DVFS-Enabled) 
As discussed in the previous section, each element of the 
vector 𝝓(𝑠, 𝑎) is called a feature; 𝜙𝑖(𝑠, 𝑎) denotes the value 
of feature 𝑖 for state-action pair (𝑠, 𝑎). The feature func-
tion 𝝓:𝓢 × 𝓐 → ℝ𝑔 maps each pair (𝑠, 𝑎) to a vector of 
feature values. Finding the right feature function plays a 
key role in the success of our RL-based algorithm.  
14 IEEE TRANSACTIONS ON JOURNAL NAME,  MANUSCRIPT ID 
 
 
Our first suggestion is to first use standard Radial Ba-
sis Function (RBF) as the state-only feature function [36], 
and then construct the whole feature function 𝝓(𝑠, 𝑎); in 
particular, the 𝑖-th component of 𝝓(𝑠) is defined as: 
(35) 
𝜙𝑖(𝑠) =  
1
√2𝜋𝜎2
𝑒−||𝒄−𝝎||
2
/2𝜎2, 
where 𝜎 is a constant reflecting the width of the fea-
tures and 𝝎 is an M-element vector that specifies centers 
of the features. In our context, each center 𝜔𝑚 denotes a 
value that lies within the normal temperature range of a 
processor (e.g., 𝜔𝑚 ∈ [330, 360] Kelvin). Also, 𝒄 − 𝝎 calcu-
lates the difference between the thermal sensor readings 
and the normal temperature value. Typically, rather than 
letting 𝜔𝑚 choose values from a continuous range, each 
element 𝜔𝑚 is assumed to be chosen from one of 𝑥 dis-
crete values. For example, if 𝑥 = 2, then each feature will 
have 2 centers (e.g., 𝜔𝑚 ∈ {340,350}). Thus, we will have 
𝑔 = 𝑥𝑀 combinations for the vector 𝝎, and a total of 𝑔 
feature functions 𝜙𝑖(𝑠), 𝑖 = 1, . . , 𝑔 to estimate the value of 
each state. Now, with this scheme, the features are ex-
tracted only from the thermal component of the system 
state 𝑠. Also, the whole point in approximating the Q-
table is to be able to correctly rank the actions in a given 
state. Therefore, if 𝑎 ≠ 𝑎′, features used for approximat-
ing 𝑄(𝑠, 𝑎) should be different from features that are used 
for approximating 𝑄(𝑠, 𝑎,). Following the discussion in 
[50], given a pair (𝑠, 𝑎), this constraint is met by mapping 
𝑠 to a vector of feature values 𝜙𝑖(𝑠), and then using these 
values in the corresponding slot for action 𝑎 while setting 
the feature values for the rest of the actions to zero.  
The following example shows this mechanism for a 
system with 2 actions and 3 features per action. Hence 3x2 
= 6 features are used for linear function approximation.  
𝜙(s) = [
𝜙1(𝑠)
𝜙2(𝑠)
𝜙3(𝑠)
] ⇒ 𝜙(s, a1) = 
[
 
 
 
 
 
𝜙1(𝑠)
𝜙2(𝑠)
𝜙3(𝑠)
0
0
0 ]
 
 
 
 
 
 , 𝜙(s, a2) = 
[
 
 
 
 
 
0
0
0
𝜙1(𝑠)
𝜙2(𝑠)
𝜙3(𝑠)]
 
 
 
 
 
 
(36) 
However, as the number of actions increases, the num-
ber of parameters increases too, and more memory is re-
quired to store them. With the proposed scheme, the di-
mensions of 𝜽 are in the order of o (𝑥𝑀 × (L) number of V-F 
levels × number of cores). As such, the proposed DVFS-
Enabled scheme is only feasible for systems with a mod-
erate number of actions.  
4.2.2 Proposed (IR) 
In this section, we attempt to further reduce the complexi-
ty of our approximate Q-learning algorithm. At the cost of 
reducing the action space to only choosing a processing 
core (but not being able to set its working V-F level); i.e., 
𝑎𝑘 ∈ ℳ = {𝑚1,𝑚2,𝑚3, … ,𝑚𝑚, … ,𝑚𝑀} (37) 
In our proposed IR scheme, only 4 simple features are 
extracted for each state-action pair. In particular, each 
state-action pair of the system is featurized in the form of 
the quadruple below: 
 
{𝑐𝑎, 𝑑𝑎(𝐶𝑒𝑛𝑡𝑒𝑟), 𝑑𝑎(𝐻𝑜𝑡𝑠𝑝𝑜𝑡),  𝑃𝑎𝑖𝑟𝑖𝑛𝑔𝑅𝑎𝑡𝑖𝑜𝑎} (38) 
 
where, we have: 
 𝒄𝒂: the sensor reading associated with the chosen core 
a.  
 𝒅𝒂(𝑪𝒆𝒏𝒕𝒆𝒓): the Euclidean distance of core a from the 
chip center. The rationale for including this feature is 
that cooling the cores located at the center of the chip 
is harder if they become hot. 
 𝒅𝒂(𝑯𝒐𝒕𝒔𝒑𝒐𝒕): the Euclidean distance of core a from the 
hottest on-chip component. In fact, core a is thermally 
affected by the hot adjacent cores. Moreover, assigning 
tasks to neighboring cores can increase the tempera-
ture of core a. 
  𝑷𝒂𝒊𝒓𝒊𝒏𝒈𝑹𝒂𝒕𝒊𝒐𝒂: The ratio of data transfer paths passing 
through the hottest on-chip component to the number of 
tasks likely to be paired with the task running on the core a.  
In our proposed IR scheme, some feature functions are relat-
ed to the actions while others are based on the states. Similarly 
to our DVFS-Enabled scheme, we have features concerning 
the temperature of the cores , but in the IR scheme, we only 
consider as a feature the temperature of the selected core with 
action a. Given that all elements of the above feature quadruple 
has continuous real-valued nature, we define an RBF for each 
feature element. In particular, we use 𝝓𝑐𝑎(𝑠, 𝑎), 
𝝓𝑑𝑎(𝐶𝑒𝑛𝑡𝑒𝑟)(𝑠, 𝑎), 𝝓𝑑𝑎(𝐻𝑜𝑡𝑠𝑝𝑜𝑡)(𝑠, 𝑎) and 𝝓𝑃𝑎𝑖𝑟𝑖𝑛𝑔𝑅𝑎𝑡𝑖𝑜𝑎(𝑠, 𝑎) to 
denote the thermal features, distance from the chip center, dis-
tance from the hottest component and the ratio of pairings, re-
spectively. Also, let 𝑥𝑖, 𝑖 = {1,2,3,4} be the number of RBF 
centers used for our feature quadruple. The dimensions 
of the parameter vector 𝜽 in our proposed IR scheme 
would be in the order of O (∏ 𝑥𝑖𝑖 ), which is much less 
than the proposed DVFS-Enabled scheme. Hence, as the 
number of RBFs increases, the number of learning param-
eters would still be manageable.  
Table 4 summarizes the notations used in our pro-
posed learning algorithms. Also, Algorithm 1 is a generic 
pseudo-code outlining the overall learning procedure for 
both our proposed variations. 
 
TABLE 4 
NOTATIONS USED IN THE PROPOSED LEARNING ALGORITHMS 
Description Notation 
The long-term average “one-stage” reward ob-
tained by following policy 𝜋 starting from any 
state s ∈ 𝒮 
𝜌𝜋 
The value of state s obtained by following policy 
π 𝑉
𝜋(𝑠) 
The maximum value of state s 𝑉∗(𝑠) 
The optimal long-term average “one-stage” re-
ward  𝜌
∗ 
Value of action a at state s 𝑄(𝑠, 𝑎) 
Stage index k 
Instant reward obtained by doing action a in sate 
s and transit to state 𝑠′ 
𝑟(𝑠, 𝑎, 𝑠′) 
An arbitrarily chosen reference state-action pair (s∗ , a∗) 
The random transition time from state 𝑠 to 𝑠′ by 
performing action a 𝑡(𝑠, 𝑎, 𝑠
′) 
Learning rate 𝛼𝑘 
Vector of feature weights 𝜃 
Parameter vectors of action a 𝜃𝑖𝑎 
The 𝑖𝑡ℎ feature function 𝜙𝑖(𝑠) 
AUTHOR ET AL.:  TITLE 15 
 
 
Number of RBFs 𝑔 =  𝑥𝑀 
Width of RBF features 𝜎 
M-element vector, consisting of feature centers ω 
Number of feature centres 𝑥 
The obtained temperature from the sensor 
placed next to the 𝑎𝑡ℎ core at time t 𝑐𝑎(𝑡) 
The Euclidean distance of core a from the chip 
center 𝑑𝑎(𝐶𝑒𝑛𝑡𝑒𝑟) 
The Euclidean distance of core a from the hottest 
on-chip component  𝑑𝑎(𝐻𝑜𝑡𝑠𝑝𝑜𝑡) 
The ratio of data transfer paths passing through 
the hottest CPU component to the number of 
tasks likely to be paired with the task running on 
core a 
 𝑃𝑎𝑖𝑟𝑖𝑛𝑔𝑅𝑎𝑡𝑖𝑜𝑎 
Thermal features 𝜙𝑖𝑐 
Distant from the center of the chip 𝜙𝑖𝑑(𝐶𝑒𝑛𝑡𝑒𝑟) 
Distant from the hottest on-chip point 𝜙𝑖𝑑(𝐻𝑜𝑡𝑠𝑝𝑜𝑡) 
Ratio of the number of communications 𝜙𝑖𝑃𝑎𝑖𝑟𝑖𝑛𝑔𝑅𝑎𝑡𝑖𝑜 
5  SIMULATION EXPERIMENTS AND PERFORMANCE 
EVALUATION 
In this section, we first describe our simulation flow, set-
tings, and tools. Then, we compare our simulation results 
with related previous approaches.  
5.1 Simulation platform and settings 
Here, we introduce our simulation tools and explain the 
role that each tool plays in the simulation workflow. Also, 
we show the overall workflow and specify the values for 
key parameters. 
5.1.1 Simulation tools and workflow 
The simulation workflow is comprised of two phases (see 
Fig. …): online and offline. The offline phase is performed 
only once at the beginning of the simulation to collect 
data for the online phase. The online phase, however, is 
repeatedly executed in successive trials.  
1) More specifically, in the offline phase, we carry out the 
following three steps: 
2) A multi-core processor is simulated using Sniper [37]. 
As our processor is made up of similar processing 
cores, we only create one core and use it to simulate 
the entire processor. We also use the Splash-2 bench-
mark suite  to simulate various programs with differ-
ent input sizes (i.e., each Splash-2 program is consid-
ered as a single task). The execution of each task in 
each V-F level is simulated using Sniper [37] . 
3) We used McPAT [39]  simulator to calculate the static 
and dynamic powers of running each task at each V-F 
level. In particular, McPAT is fed with simulation re-
sults produced by Sniper, including the count of ac-
cess to ram, main memory, etc. Next, the routers’ 
power consumption is calculated using DSENT [40] 
which is fed by NoC router specifications and data in-
jection rate. 
4) In the third step of the offline phase, the floor plan of a 
core is simulated using HotFloorPlan [17], which takes 
as in input the area of the various units of a processing 
core (as produced by Sniper). 
Algorithm 1 The Proposed RL-based Task Scheduling 
Algorithm 
Initialization: 
    k = 0; 𝜽0 ← 0;  ∀(𝑠, 𝑎): 𝑄0(𝑠, 𝑎) ← 0; Set ?̂?(𝑠
∗, 𝑎∗); 𝜋: 
𝜀 − 𝑔𝑟𝑒𝑒𝑔𝑦; 
begin 
    case (event) do 
        TASK_ARRIVAL: 
             // an idle core exists 
1:            𝐢𝐟 ∃𝑚 ∈ ℳ 𝑠. 𝑡. 𝑏𝑚(𝑘) = 0 𝐭𝐡𝐞𝐧 
2:             Call Assign_Task(); 
3:           end if 
        TASK_DEPARTURE: 
             // the ready queue is not empty 
4:           if 𝑞(𝑘) ! = 0 then 
5:              Call Assign_Task(); 
6:           end if 
end 
Function Assign_Task() 
      begin 
7:      Read temperature values 𝒄(𝑘) from temperature 
sensors in state 𝑠𝑘+1; 
8:      Use  (…) to calculate 𝑟(𝑠𝑘 , 𝑎𝑘 , , 𝑠𝑘+1); 
9:      Obtain 𝑡(𝑠𝑘 , 𝑎𝑘 , 𝑠𝑘+1); 
10:    Use (…) to calculate 𝑄𝑘+1(𝑠𝑘 , 𝑎𝑘); 
11:    Use (…) to calculate ?̂?(𝑠𝑘 , 𝑎𝑘); 
12:    Use (…) to update 𝜃; 
13:    𝑎𝑘 ← Choose an action according to policy 𝜋; 
14:    Apply 𝑎𝑘; 
15:    𝑠𝑘 ← Current system sate; 
16:    Calculate 𝜙(𝑠𝑘 , 𝑎𝑘) (…); 
end 
5) As mentioned before, the offline phase is executed 
only once at the beginning of the simulation. Below, 
we discuss the 6 steps associated with the online 
phase: 
6) Various tasks with different input sizes arrive ran-
domly at the system over time (according to a Poisson 
process). The incoming tasks are queued in the wait-
ing line.  
7) The scheduler assigns a task from the waiting line to 
an idle core and sets its working V-F level.  
8) The newly assigned task, pairs randomly with a run-
ning task, which is not currently paired. Following a 
pairing, the involved NoC routers in each pairing are 
injected with data of an injection rate randomly chosen 
from [0,1]. 
9) The execution time of the newly assigned task is ex-
tracted from the table produced by Sniper. Also, ac-
cording to the uncertainties related to transmission 
time and the number of involved NoC routers, each 
pairing's communication time is not deterministic. To 
capture these uncertainties, the communication time is 
considered as an exponentially distributed variable 
with rate of 𝜉.  
10) The power consumption of each unit and the floor 
plan of a core is given to HotSpot, which generates the 
16 IEEE TRANSACTIONS ON JOURNAL NAME,  MANUSCRIPT ID 
 
 
 
Fig. 4. Simulation workflow 
 
11) thermal profile of the chip considering the thermal 
interactions of neighboring units.  
12) The temperature margin is calculated and reported as 
an instantaneous reward to the scheduler.   
5.1.2 Parameters and settings 
The multicore processor is based on mesh architecture 
in which each core is an Intel Xeon Gainestown (Neha-
lem-EP) [51] and both NoC routers and cores are in 45 nm 
technology (core and router settings and V-F levels are 
described in Table 5 [52]). In the experiments, we vary the 
number of cores from 3×3 up to 7×7. The injection rate of 
the involved NoC routers in each pairing is randomly 
selected from [0,1], and the well-known xy routing algo-
rithm [44] is used for packet routings.  
TABLE 5 
THE SETTINGS FOR CORES, NOC ROUTERS AND V-F LEVELS 
V-F levels [52] Router settings [18] Core settings [18] 
 
Frequency 
(GHz) 
Voltage 
(V) 
2.7 0.9 
3.0 1.0 
3.3 1.1 
3.6 1.2 
 
5 Ports 
2 GHz Frequency 
8 Virtual 
channels 
144  b Flit size 
24 flits Buffer size 
16 KB L1-I 
16 KB L1-D 
256 KB L2 
16entries ITLB 
16 entries DTLB 
 
 
As mentioned before, each Splash-2 program is consid-
ered as a single task. The suite consists of 14 programs 
and each program can be executed with different input 
sizes. By varying these sizes, we generate a total of 29 
types of tasks. Each task type enters the system with the 
mean Poisson rate of 0.29 tasks per second (i.e., the over-
all rate λ is 8.41 tasks per second for all tasks). The com-
munication time for each pair is exponentially distributed 
with mean ζ. To capture the temperature of processing 
cores at time t (𝑐(𝑡)), we implement a thermal sensor next 
to each core and the thermal values are obtained by read-
ing these sensors.  As the dimension of c(t) corresponds 
the number of processing cores, implementing sensors 
next to every single core is not practical for larger proces-
sors. As a remedy, we use linear interpolation [53]  to de-
crease the number of temperature values to 9 irrespective 
of the multicore size. In linear interpolation, we exploit 
the known function values to fill the void of the un-
knowns. More formally, if function Z is given in 4 points, 
then the linear interpolation of this function is as: Z(x, y) = 
ax + by + cxy + d. Then, equivalence will be made for each 
given point and the coefficients (a, b, c, d) will be calculat-
ed. Using these coefficients, the value of the function can 
be calculated in any new point (p,q). By exploiting linear 
interpolation, we have only 9 temperature values to build 
ϕ using feature function. Also, it is considered to have 2, 
3, and 5 temperature centers for each basis function, 
which are normalized to [0, 1] in Table 6 Therefore, de-
pending on the number of temperature centers, we have 
29, 39, and 59 feature functions to estimate the Q-value of 
action a in state s. Initial values of Q and θ are zero, and σ 
is defined as in table 6 depending on the number of tem-
perature centers. 
 
 
 
 
 
 
 
 
 
 
Fig. 5. The floorplan of a single processing core 
 
TABLE 6 
NUMBER OF TEMPERATURE CENTERS AND FEATURE WIDTH 
Temperature 
centers 
feature width for number of 
centers 
 
0.33 and 0.66 0.09 2 centers 
0.25, 0.5 and 0.75 0.07 3 centers 
0, 0.25, 0.5, 0.75, 1 0.05 5 centers 
 
Finally, in equation (27) where a (the learning rate) is 
AUTHOR ET AL.:  TITLE 17 
 
 
being calculated, A and B are considered to be 50 and 
1000 respectively. The term "Iteration" refers to the repeti-
tion number of the learning algorithm. Table 7 summariz-
es the simulation parameters. 
5.2 Comparison with related work 
In this section, we compare our proposed algorithms with 
previous work: 
 TBO: The TBO scheme in [26] has thermally managed 
CMPs in which  a finite state machine is defined for ex-
ecuting multi-threaded applications. The states are 
named: start, wait, read, calculate, and assign. In the 
start state, all variables are initialized, then the algo-
rithm switches to the wait state where the system waits 
for a new time quota to run a new application. In the 
read state, a new time quota has been started, and tem-
perature values are collected from the on-chip thermal 
sensors. Then in the calculate state, a weight matrix is 
created such that each element of the Matrix is the in-
verse of the Euclidean distance of each core from the 
center of the chip. Then, a utilization matrix is made up 
of utilization values of processing cores, and it is multi-
plied with weight matrix to produce a cost matrix. If 
two cores have the same utilization simultaneously, the 
core closer to the center of the chip will have a higher 
temperature than the other. This is why TBO exploits 
the weight matrix in assignment decisions besides utili-
zation and temperature values. Finally, the algorithm 
switches to the assigned state and allocates tasks to 
cores according to minimal cost principle: If two cores 
have the same weights, the core that has less tempera-
ture is selected to run a task. 
TABLE 7 
Parameters and notations 
Notation Description Value 
𝑀 Count of processing cores and 
NoC routers 
3×3, 4×4, 5×5, 6×6 and 
7×7 
𝜆 Accumulative task arrival rate 8.41 Tasks per Second 
ζ Exponential rate of pairing 
time 
The average execution 
time of paired tasks 
𝑙𝐿 Working voltage and frequen-cy level of the processing core 
0.9/2.7, 1.0/3.0, 
1.1/3.3, 1.2/3.6 
(V(v)/F(GHz)) 
𝑇𝑡ℎ Valid temperature threshold 358 Kelvin 
𝛼 Learning rate A/(B+k) 
A Constant value in learning rate 
equation 50 
B Constant value in learning rate 
equation 1000 
𝜎 Feature width 
2 centers 0.09 
3 centers 0.07 
5 centers 0.05 
 
ω 
 
Feature centers 
2 centers 0.33, 0.66 
3 centers 0.25, 0.5, 
0.75 
5 centers 0, 0.25, 
0.5, 0.75, 1 
𝑧 Count of RBFs 
2 centers 512 
3 centers 19683 
5 centers 1953125 
 
 LDT & LCT: A more closely related work to our pro-
posed scheme is [18] in which the assignment problem 
is formulated as an MDP [32]. Each system state con-
sists of the temperature values obtained from the on-
chip thermal sensors, and an action refers to assigning a 
task to a core. The instantaneous reward is the tempera-
ture margin. The work in [18] exploits reinforcement 
learning to solve the MDP. However, since there is no 
rigorous formulation of the problem, there is an ambi-
guity with respect to the nature of time: on the one 
hand, the authors have mentioned that a task is allocat-
ed to an idle core upon each task arrival. This is indica-
tive of a continuous-time MDP. On the other hand, the 
Bellman equations in [18] are all based on a discrete-
time MDP formulation, which contradicts the event-
based operation discussed by the authors. As such, we 
simulate [18] in both continuous- and discrete-time ver-
sions. We use LCT to refer to the continuous-time simu-
lation of [18], and LDT for the discrete-time version.  
 RAND: The fourth baseline is purely random task as-
signment, which has very low cost of implementation, 
but still can serve as a standard for performance com-
parisons.  
Our proposed DVFS-Enabled scheme works in a con-
tinues-time fashion as it assigns a task to a core and sets 
its V-F level as soon as a system event occurs. These 
events include: 1) arrival of a new task and 2) departure 
of a task that has been completely executed. The pro-
posed IR scheme is similar to DVFS-enabled with the only 
difference that it does not apply DVFS to the cores. 
5.3 Performance evaluation Criteria 
We evaluate the proposed schemes in terms of the follow-
ing criteria: 
 Average peak temperature: The primary criterion is to 
reduce the average peak temperature of the processor 
in the long run.  
 Average service time: the secondary criterion is ser-
vice time which refers to the interval between the arri-
val of a task and its completion. It includes both exe-
cution time and the time that tasks wait for its turn in-
side the waiting line. Therefore, lower task service 
time means less waiting time or quicker execution, 
which leads to higher system performance.  
 The convergence of learning parameters: the learning 
algorithm is repeated for each trial and updates its pa-
rameters and the decision making policy. The algo-
rithmic convergence is important for system stability.  
 Dynamic power consumption: As a peripheral goal, 
reducing the dynamic power (which depends on the 
working voltage and frequency level of the cores) is 
much desirable due to power budget limitations.  
5.4 Experiments and simulation results analysis 
In this section, the proposed approaches (DVFS-enabled 
and IR) are compared with TBO, LCT, LDT, and RAND.  
5.4.1 Test 1: Convergence of the proposed learning algo-
rithms 
In our proposed learning algorithms, Q and θ are updated 
18 IEEE TRANSACTIONS ON JOURNAL NAME,  MANUSCRIPT ID 
 
 
at the end of each trial till they converge to their expected 
value in the long run. Figure 6 shows Q-value conver-
gence of DVFS-Enabled for four different actions in a spe-
cific state (0.4, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6) where only 
the interpolated temperature values are indicated. Figures 
7 and 8 also show Q-value and θ-value convergence for 
our IR scheme, respectively. In Figure 7, Q1, Q2 and Q3 
are Q-values of some randomly selected actions in states 
(0.1, 0.5, 0.5, 0.0), (0.5, 0.5, 0.5, 0.0) and (0.8, 0.5, 0.5, 1.0) 
and. Also Figure 9 indicates θ-value convergence for Pro-
posed (DVFS-Enabled) and in Figures 8 and 9, Theta1, 
Theta2 and Theta3 are the coefficients of the first, second 
and third radial basis functions.  
The learning rate is near to 1 at the beginning steps of 
learning, giving the new observations a greater effect on 
updating θ and Q-values. Actions are also randomly se-
lected at the initial trials, and as a result, θ and Q-values 
experience high fluctuations in early stages. In intermedi-
ate trials, variations taper off, with no significant changes 
in final trials. This is due to the gradual reduction in the 
learning rate and its approach to zero. In final stages, θ 
and Q mostly keep their current values and take much 
less influence from new observations. 
5.4.2 Test2: Average peak temperature comparison 
Figure 10 shows the average peak temperature for each 
method. As can be seen, the proposed DVFS-Enabled 
scheme outperforms others. Its average peak temperature 
is 3 °𝐾 lower than LDT and LCT, 1.5 °𝐾 lower in compari-
son with TBO and as much as  6 °𝐾lower than RAND. As 
with the proposed IR scheme, its average peak tempera-
ture is about 1.4 °𝐾 lower than LDT and LCT, and about 4 
°𝐾 lower compared to RAND. 
 
 
Fig. 6. Q-Value convergence of Proposed (DVFS-Enabled) 
 
 
Fig. 7. Q-Value convergence of Proposed (IR) 
As mentioned earlier, TBO creates a cost matrix for 
each core and selects a core with the minimum cost to 
determine the core to assign a task. This cost matrix is 
made from the utilization history of each core and its in-
verse Euclidean distance from the center of the chip. Pro-
posed(IR) also considers the distance from the center of 
the chip and the hotspot, but the difference is that the 
proposed(IR) ignores the utilization history which may be 
the reason for the lower peak temperature obtained from 
TBO compared to the proposed (IR). 
 
Fig. 8. θ convergence of Proposed (IR) 
 
Fig. 9. θ convergence of Proposed (DVFS-Enabled) 
 
 
Fig. 10. The average peak temperature 
5.4.3 Test3: Average service time 
In Figure 11, both of the proposed approaches outperform 
LDT in terms of the average service time. In fact, LDT 
operates in regular time intervals without any regard to 
the arrival or departure events. As such, it always waits 
for a new time quota to assign a new task. Hence, the 
waiting line may remain crowded, despite the availability 
of several idle processing cores. The proposed schemes, 
on the other hand, work in a continuous-time manner, 
and they assign a new task as soon as an event (a task 
arrival or departure) occurs. Tasks are immediately as-
signed to cores (if the line is not empty) until there is no 
idle core and the overall task service time will be reduced 
in comparison with LDT. As a side note, since the DVFS-
Enabled scheme may apply lower V-F levels to reduce the 
peak temperature, its task execution time is slightly high-
er compared to LCT, RAND, and IR. It is also noteworthy 
that the curves related to LCT, RAND and IR coincide in 
Figure 11. The reason is that all these schemes do not ap-
ply DVFS, but use a fixed and same V-F level for all the 
cores during the simulation. 
5.4.4 Test4: The impact of mesh size on the average 
peak temperature  
Figure 12, depicts the impact of mesh size (i.e., the num-
AUTHOR ET AL.:  TITLE 19 
 
 
ber of processing cores) on the average peak temperature. 
As the mesh size increases, the peak temperature also 
increases according to Figure 12. Also, at the larger mesh 
sizes, DVFS-Enabled is outperformed by both our pro-
posed IR and the TBO scheme. In fact, both TBO and IR 
avoid assigning tasks to cores near the center of the chip, 
and instead prefer the cores located far from the center 
like the edges or corners. As fewer cores surround these 
cores, they are less affected by the neighboring cores and 
are also less likely to become hot. 
 
Fig. 11. The average task service time 
 
Fig. 12. the average peak temperature vs. mesh size 
5.4.5 Test 5: Impact of the mean task arrival rate on 
average peak temperature  
According to Figure 13, in general, the peak temperature 
rises with increase in the mean task arrival rate. Both the 
proposed schemes outperform the others for every arrival 
rate. In fact, IR tries to assign new tasks to the cores locat-
ed far from the chip center and the hottest core. Also, 
DVFS-Enabled exploits several V-F levels to maximally 
reduce the average peak temperature. However, under 
higher arrival regimes, a larger number of tasks run in the 
system and the heat arising from the neighboring cores 
may affect the peak temperature. The proposed IR 
scheme considers the distance from the hottest core and 
chip center, and it partly prevents the thermal interac-
tions among the processing cores. This is why IR outper-
forms DVFS-Enabled under higher arrival intensities.  
 
 
 
Fig. 13. The average peak temperature vs. arrival rate 
5.4.6 Test 6: Dynamic power consumption  
As mentioned before, there is an NoC router next to each 
processing core in a mesh topology. Each core and its lo-
cal router is called a “Tile”. Here, we compute the aver-
age dynamic power consumption for each tile individual-
ly in Figure 14. Using TBO and IR, the internal cores (5, 6, 
9, 10) consume less dynamic power while core numbers 
(0, 3, 12, 15) dissipate more dynamic power. This is be-
cause of the preference of TBO and IR towards selecting 
cores away from the chip center (core numbering is start-
ed from 1 at the top left of the mesh to the bottom right 
core. The cores in each line are numbered from left to 
right). 
The dynamic power consumption depends on the 
working V-F level of the processor, and since DVFS-
Enabled applies DVFS to processing cores separately, this 
scheme is more power-efficient compared to others across 
all tiles. However, according to Figure 14, under DVFS-
Enabled, the average dynamic power consumption is 
higher in internal cores as tasks may be assigned uneven-
ly; for example, DVFS-Enabled has assigned 641 tasks (on 
average) to core 6 while only 108 tasks (on average) has 
been assigned to core 16. 
Figure 15 demonstrates the total dynamic power con-
sumption of all tiles under all approaches. As we men-
tioned in previous subsections, all methods except pro-
posed DVFS-Enabled apply fixed V-F levels. Here, we set 
the working V-F level for these methods to 1.1 Volt and 
3.3 GHz. On the other hand, there is proposed DVFS-
Enabled that applies various V-F levels from Table 5. 
Since dynamic power depends on both voltage and fre-
quency, the results in Figure 15 are the same for all meth-
ods except Proposed (DVFS-Enabled). Proposed (DVFS-
Enabled) dissipates less dynamic power in comparison 
with the others, because sometimes it applies less voltage 
and frequency levels to reduce the average peak tempera-
ture. 
 
 
Fig. 14. Time Dynamic power consumption per each tail 
5.6.7 Test 7: The impact of thermal center count on the 
average peak temperature  
Since part of the state space is continuous and its other 
elements have a broad range of change, it is not possible 
to store state-action values in a table. For this reason, 
function approximation is used instead, which requires a 
feature calculator for the continuous part of the state 
space. As discussed in Section 4.2, we have relied on 
standard RBF-based approximation technique [36] with 
several thermal centers (ω). Also, as mentioned before, 
20 IEEE TRANSACTIONS ON JOURNAL NAME,  MANUSCRIPT ID 
 
 
temperature values collected from on-chip sensors are 
reduced to only 9 values using bilinear interpolation. 
Therefore, 𝑔 = 𝑥𝑀 features are used to estimate the value 
of an action in a specific state. On the other hand, if di-
mensions of feature vector (ϕ) are 𝑔 × 1, the dimensions 
of θ will be the count of actions in a state × 𝑔. As a result, 
an increase in the number of temperature centers enlarges 
the dimension of θ to an unmanageable size (Table 8). 
 
 
Fig. 15. Total dynamic power of tiles 
 
In DVFS-Enabled, the count of actions is equal to 4 × 
number of V-F levels (L) × number of processing cores 
(M). In IR, the action count is reduced to M. Therefore, θ 
is larger in DVFS-Enabled and with an increase in tem-
perature centers, more storage space is needed. As shown 
in Table 8, if there are 5 temperature centers for DVFS-
Enabled, θ becomes very large, and it is infeasible to do 
the assignment simulations using an ordinary computer. 
However, in IR this issue is handled with the idea of 
combined state-action features such that the feature func-
tion includes features for actions besides thermal features, 
which results in much smaller θ. Also, in IR, a higher 
number of temperature centers can be used to obtain 
higher accuracy. 
TABLE 8 
The count of parameters per temperature centers 
 2 Centers 3 Centers 5 Centers 
Proposed (IR) 
348.32 K 
16  
347.07 K 
81  
345.74 K 
625  
Proposed 
(DVFS-Enabled) 
346.97 K 
51200  
346.35 K 
1968300  
Unmeasurable 
195312500  
LCT 
350.12 K 
12800  
349.33 K 
492075  
Unmeasurable 
48828125  
LDT 
350.83 K 
12800  
349.95 K 
492075  
Unmeasurable 
195312500  
5.4.8 Test 8: The impact of task arrival rate on the 
average peak temperature 
In Table 9, the peak temperature changes are recorded for 
different arrival rates. As a general trend, higher peak 
temperatures are obtained as the arrival rate increases. To 
justify this, we should look deeper to see what happens 
when the arrival rate increases. As the name implies, arri-
val rate refers to the average number of tasks arrive to the 
system at each second. Now, imagine a large number of 
tasks enter the system at a short time. What happens is 
that the system becomes crowded and the processing 
cores execute tasks continuingly with no time to cool 
down. So each processing cores become hot individually 
while pairings and thermal interactions are also existed 
on the other side.  
TABLE 9 
The arrival rate impact on the average peak temperature 
 3.915  5.365  6.815  
Proposed (IR) 337.32 K 340.97 K 343.21 K 
Proposed (DVFS-Enabled) 336.56 K 340.04 K 343.75 K 
LCT 338.72 K 342.65 K 345.48 K 
LDT 338.9 K 343.56 K 345.94 K 
RAND 341.89 K 346.46 K 347.77 K 
TBO 337.91 K 342.35 K 344.68 K 
 
5.4.9 Test 9: The Impact of mean arrival rate on 
service time 
Table 10 shows the task service times for different arrival 
rates. Under higher arrival rates, the waiting time in the 
queue increases, which results in longer service times for 
all assignment schemes. LDT has the highest service time 
among others due to its discrete-time functionality.  
TABLE 10 
Service time per arrival rate 
 3.915 5.365 6.815 
Proposed (IR) 1175 ms 1207 ms 1288 ms 
Proposed (DVFS-Enabled) 1218 ms 1251 ms 1330 ms 
LCT 1175 ms 1207 ms 1288 ms 
LDT 1284 ms 1317 ms 1395 ms 
RAND 1175 ms 1207 ms 1288 ms 
TBO 1175 ms 1207 ms 1288 ms 
Our proposed IR and DVFS-Enabled schemes have out-
performed others under high arrival regimes. Also, as 
evidenced by the results, IR has even lower service time 
compared to DVFS-Enabled under higher arrival rates 
given that DVFS-Enabled sometimes exploits lower V-F 
levels for running the tasks.  
5.4.10. Test 10: The thermal profile of the processor 
under the learning process 
For investigating the learning process and its impact on 
assigning tasks, snapshots of the thermal profile of the 
processor are provided from before and after learning. 
Figure 16 shows the task assignment before the learning 
begins where tasks are paired as follows: 
 The task running on core 4 is paired with task running 
on core 16 
 The involved NoC routers are 4, 8, 12, 16. 
 The task running on core 6 is paired with task running 
on core 9 
 The involved NoC routers are 6, 10, 9. 
 The task running on core 5 is paired with task running 
on core 8 
AUTHOR ET AL.:  TITLE 21 
 
 
 The involved NoC routers are 5, 6, 7, 8. 
Fig. 16. Thermal profile of the processor before Learning  
 
Figure 17 shows the thermal profile resultant from 
DVFS-Enabled where: 
 The task running on core 13 is paired with task run-
ning on core 3 
 The involved NoC routers are 13, 14, 15, 11, 7, 3. 
 The task running on core 11 is paired with task run-
ning on core 11 
 The involved NoC routers are 11 and 12. 
 The task running on core 6 is paired with task running 
on core 8 
 The involved NoC routers are 6, 7, 8. 
As shown in Figure 17, although core 7 is not assigned 
any tasks, it is bound on all sides by other cores, which 
results in its temperature raise due to thermal interac-
tions. Thermal isolators can prevent this, but this idea is 
put aside due to the cost and the area occupation that it 
causes.  
According to Figure 16, the peak temperature of the pro-
cessor is 354.31K while by applying DVFS-Enabled the 
peak temperature is reduced to 347.4K (Figure 17). Some 
methods like TBO aim to reduce the peak temperature by 
thermal balancing, which means uniformly distributing 
heat to chip-wide processing cores. This is while our 
DVFS-Enabled scheme does not guarantee thermal bal-
ancing but has the least peak temperature. 
 
 
Fig. 17. Thermal profile after learning with DVFS-enabled 
 
Figure 18 depicts the thermal profile after learning the 
task assignment policy by our proposed IR scheme where 
pairings are as follows: 
 The task running on core 16 is paired with tasks run-
ning on core 4 
 The involved NoC routers are 16, 12, 8, 4. 
 The task running on core 1 is paired with task running 
on core 13. 
 The involved NoC routers are 1, 5, 9, 13. 
 The task running on core 14 is paired with task running 
on core 3 
 The involved NoC routers are 3, 7, 11, 15, 14. 
As can be seen in Figure 18, IR tries to assign tasks to 
cores located away from the chip center and the hottest 
core. Therefore, IR can almost prevent thermal interac-
tions. 
 
Fig. 18. Thermal profile after learning with Proposed (IR) 
6  CONCLUSION 
In this paper, we first discussed how increasing tempera-
ture affects multicore processors and how important 
thermal management is. Then, we reviewed several sur-
veys and some recent papers which thermally manage 
CMPs. Based on our review of previous work, we catego-
rize methods on two groups including batch and online 
that each related method falls into a certain category 
based on the presence of arrival tasks at the scheduling 
time. Also we mentioned that in real-world scenarios, the 
system (as an environment for a learning agent) is not 
deterministic, and includes several uncertainties which 
play a significant role in determining the future state of 
the system. These uncertainties include arrival, workload 
characteristics, pairings and thermal interactions. Arrival 
uncertainty is violated in batch methods since whole tasks 
are entered the system prior to scheduling. However, the 
online group also is potential to consider all types of un-
certainties. This group is also divided into two subgroup: 
discrete-time and continuous-time. Discrete-time meth-
ods wait for new time quota to assign a task to a pro-
cessing core, even there are several idle cores. For this 
reason, they result in higher service-time specially when 
the system is crowded. Therefore, we tried to present a 
continuous-time method. For this propose, we used MDP 
platform to formulate the task assignment problem in a 
multicore processor and because the statistical knowledge 
of the environment was not available, we used a model-
free learning approach to solve the MDP. In this paper, 
we present two continuous-time methods named: Pro-
posed(DVFS-Enabled) and Proposed(IR). In both present-
ed methods, a state space is defined to specify the system 
state at each step and the operating system scheduler is 
considered a learning agent that selects an action at each 
22 IEEE TRANSACTIONS ON JOURNAL NAME,  MANUSCRIPT ID 
 
 
system state. By doing each action, the system state is 
changed and the scheduler earns a reward that specifies 
the goodness of the selected action in reaching the system 
goal. Here the long-term goal is defined as reducing the 
average peak temperature of the CMP. The reward also is 
used for updating the last action- and the previous state-
values which are going to be used in future decision ma-
kings. Therefore, action- and state-values converge to 
optimal values in long-run and the optimal policy is cal-
culated.  
Our simulation results indicate that in most cases, both 
proposed approaches outperform the others in reducing 
the average peak temperature and since both methods are 
used to work in continuous-time manner, they result in 
less service time. Also, we show that proposed(DVFS-
Enabled) dissipates less dynamic power in comparison 
with the others which is due to applying lower voltage 
and frequency levels to keep the average peak tempera-
ture as low as possible.  
Comparing the two methods presented in this paper, 
Proposed(DVFS-Enabled) uses a large number of parame-
ters for learning which increases when exploiting more 
temperature centers which in turn, slows down learning. 
On the other hand, Proposed(IR) is capable to use more 
centers for state features since it needs less learning pa-
rameters. Therefore, it learns faster and more precise.  
Both proposed methods can be improved by adding 
several ideas; some are suggested below: 
 Since using function approximation is problematic 
with off-policy and its convergence is not guaranteed, 
policy gradient can be used instead. 
 Several constraints such as length of the waiting queue 
and the number of involved routers in pairings can be 
considered to increase the performance, reduce the av-
erage peak temperature or energy efficiency. 
 Include other system events; for example, exceeding a 
predetermined temperature threshold. This event oc-
curs when the temperature of a processing core goes 
beyond the threshold. In this case, the operating sys-
tem scheduler migrates the running task to an idle 
core and applies DPM techniques to cool down the hot 
core.  
REFERENCES 
[1] J. L. Hennessy and D. A. Patterson, Computer architecture: a 
quantitative approach. Elsevier, 2011. 
[2] S. Zhuravlev, J. C. Saez, S. Blagodurov, A. Fedorova and P. 
Prieto, and D. Systems, "Survey of energy-cognizant schedul-
ing techniques," IEEE Transactions on Parallel and Distribut-ed 
Systems, vol. 24, no. 7, pp. 1447-1464, 2012. 
[3] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and 
D. Burger, "Dark silicon and the end of multicore scal-ing," in 
38th Annual international symposium on computer architec-
ture (ISCA), pp. 365-376, 2011. 
[4] M. Ansari, A. Yeganeh-Khaksar, S. Safari, A. Ejlali, and Sys-
tems, "Peak-Power-Aware Energy Management for Periodic 
Real-Time Applications," 2019. 
[5] A. K. Singh, C. Leech, B. K. Reddy, B. M. Al-Hashimi, and G. V. 
Merrett, "Learning-based Run-time Power and Energy Man-
agement of Multi/Many-core Systems: Current and Fu-ture 
Trends," Journal of Low Power Electronics, vol. 13, no. 3, pp. 
310-325, 2017. 
 [6] J. Donald and M. Martonosi, "Techniques for multicore ther-
mal management: Classification and new exploration," in ACM 
SIGARCH Computer Architecture News, vol. 34, no. 2, pp. 78-
88 2006. 
[7] Y. Ge and Q. Qiu, "Dynamic thermal management for mul-
timedia applications using machine learning," in Proceedings of 
the 48th Design Automation Conference, 2011, pp. 95-100. 
[8] S. Herbert and D. Marculescu, "Analysis of dynamic volt-
age/frequency scaling in chip-multiprocessors," in Proceed-
ings of the 2007 international symposium on Low power elec-
tronics and design (ISLPED'07), 2007, pp. 38-43. 
[9] H. Shen, Y. Tan, J. Lu, Q. Wu, and Q. Qiu, "Achieving au-
tonomous power management using reinforcement learning," 
ACM Transactions on Design Automation of Electronic Sys-
tems, vol. 18, no. 2, p. 24, 2013. 
[10] A. Das, B. M. Al-Hashimi, and G. Merrett, "Adaptive and hier-
archical runtime manager for energy-aware thermal man-
agement of embedded systems," ACM Transactions on Em-
bedded Computing Systems, vol. 15, no. 2, p. 24, 2016. 
[11] D. Zhu, L. Chen, T. M. Pinkston, and M. Pedram, "TAPP: Tem-
perature-aware application mapping for NoC-based many-core 
processors," in Proceedings of the 2015 Design, Automation & 
Test in Europe Conference & Exhibition, 2015, pp. 1241-1244. 
[12] Lu, Shiting, "On thermal sensor calibration and software tech-
niques for many-core thermal management", Doctoral Disserta-
tions, University of Massachusetts - Amherst, 2015. 
[13] K. Stavrou and P. Trancoso, "Thermal-aware scheduling for 
future chip multiprocessors," no. 1, pp. 40-40, 2007. 
[14] M. Mohaqeqi, M. Kargahi, and A. Movaghar, "Analytical leak-
age-aware thermal modeling of a real-time system," IEEE 
Transactions on Computers, vol. 63, no. 6, pp. 1378-1392, 2012 
[15] Inter-process communications, Microsoft Doc Whitepaper, 
URL: https://docs.microsoft.com/en-
us/windows/win32/ipc/interprocess-communications, last 
accessed: Nov 18, 2019. 
 [16] S.-x. Qu, M.-x. Zhang, G.-h. Liu, and T. Liu, "Dynamic thermal 
management by greedy scheduling algorithm," vol. 19, no. 1, 
pp. 193-199, 2012. 
[17] K. Skadron et al., "Temperature-aware microarchitecture: Mod-
eling and implementation," ACM Transactions on Archi-
tecture, vol. 1, no. 1, pp. 94-125, 2004. 
 [18] S. J. Lu, R. Tessier, and W. Burleson, "Reinforcement Learning 
For Thermal-Aware Many-Core Task Allocation," in Proceed-
ings of the 25th edition on Great Lakes Symposium on VLSI, 
2015, pp. 379-384. 
[19] Y. Liu, Y. Ruan, Z. Lai, and W. Jing, "Energy and thermal aware 
mapping for mesh-based NoC architectures using multi-
objective ant colony algorithm," in 3rd International Conference 
on Computer Research and Development, vol. 3, pp. 407-411, 
2011. 
[20] A. Tockhorn, C. Cornelius, H. Saemrow, and D. Timmer-mann, 
"Modeling temperature distribution in networks-on-chip using 
RC-circuits," in 13th IEEE Symposium on Design and Diagnos-
tics of Electronic Circuits and Systems, pp. 229-232, 2010. 
[21] B. W. Kernighan and S. Lin, "An efficient heuristic procedure 
for partitioning graphs," Bell system technical journal, vol. 49, 
no. 2, pp. 291-307, 1970. 
[22] K. Manna, V. Choubey, S. Chattopadhyay, and I. Sengupta, 
AUTHOR ET AL.:  TITLE 23 
 
 
"Thermal variance-aware application mapping for mesh based 
network-on-chip design using Kernighan-Lin partition-ing," in 
International Conference on Parallel, Distributed and Grid 
Computing, 2014, pp. 274-279. 
[23] M. Moazzen, A. Reza, and M. Reshadi, "CoolMap: A ther-mal-
aware mapping algorithm for application specific net-works-
on-chip," in 15th Euromicro Conference on Digital System De-
sign, pp. 731-734, 2012. 
[24] S. Cao, Z. Salcic, Y. Ding, Z. Li, S. Wei, and X. Zhao, "Tem-
perature-aware task scheduling heuristics on network-on-
chips," in 2016 IEEE International Symposium on Circuits and 
Systems (ISCAS), pp. 2603-2606, 2016. 
[25] A. S. Arani, "Online thermal-aware scheduling for multiple 
clock domain CMPs," in 2007 IEEE International SOC Con-
ference, pp. 137-140, 2007. 
[26] J. Wang, J. Lu, S. Guo, Z. Chen, and Y. Li, "A Thermal Bal-ance 
Oriented Task Mapping for CMPs," in Proceedings of the 8th 
International Conference on Information Communica-tion and 
Management, 2018, pp. 12-16. 
 [27] A. Das, R. A. Shafik, G. V. Merrett, B. M. Al-Hashimi, A. Ku-
mar, and B. Veeravalli, "Reinforcement learning-based inter-
and intra-application thermal optimization for lifetime im-
provement of multicore systems," in Proceedings of the 51st 
Annual Design Automation Conference, 2014, pp. 1-6. 
 [28] T.-H. Chien and R.-G. Chang, "A thermal-aware sched-uling for 
multicore architectures," vol. 62, pp. 54-62, 2016. 
[29] A. K. Coskun, T. S. Rosing, and K. Whisnant, "Temperature 
aware task scheduling in MPSoCs," in Design, Automation & 
Test in Europe Conference & Exhibition, 2007, pp. 1-6. 
[30] M. Birks and S. P. Fung, "Temperature aware online schedul-
ing with a low cooling factor," in International Conference on 
Theory and Applications of Models of Computation, pp. 105-
116, 2010. 
[31] M. Birks, D. Cole, S. P. Fung, and H. Xue, "Online algorithms 
for maximizing weighted throughput of unit jobs with tem-
perature constraints," in Frontiers in Algorithmics and Algo-
rithmic Aspects in Information and Management, pp. 319-329, 
2011. 
[32] M. L. Puterman, Markov Decision Processes.: Discrete Sto-
chastic Dynamic Programming. John Wiley & Sons, 2014. 
[33] Q. Bashir et al., "An online temperature-aware scheduling tech-
nique to avoid thermal emergencies in multiprocessor systems," 
vol. 70, pp. 83-98, 2018. 
[34] A. Rezaei, D. Zhao, M. Daneshtalab, and H. Zhou, "Multi-
objective task mapping approach for wireless NoC in dark sili-
con age," in 25th Euromicro International Conference on Paral-
lel, Distributed and Network-based Processing (PDP), 2017, pp. 
589-592. 
[35] D. P. Bertsekas, Dynamic programming and optimal control 
(no. 2). Athena scientific Belmont, MA, 1995. 
[36] R. S. Sutton, A. G. Barto, and F. Bach, Reinforcement learn-ing: 
An introduction. MIT press, 1998. 
 [37] T. E. Carlson, W. Heirman, and L. Eeckhout, "Sniper: exploring 
the level of abstraction for scalable and accurate parallel multi-
core simulation," in Proceedings of 2011 Inter-national Confer-
ence for High Performance Computing, Net-working, Storage 
and Analysis, 2011, p. 52. 
[38] (2019-01-23). HotSpot thermal simulator,  Available: 
http://lava.cs.virginia.edu/HotSpot 
[39] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and 
N. P. Jouppi, "McPAT: an integrated power, area, and timing 
modeling framework for multicore and manycore ar-
chitectures," in Proceedings of the 42nd Annual IEEE/ACM In-
ternational Symposium on Microarchitecture, 2009, pp. 469-480. 
[40] C. Sun et al., "DSENT-a tool connecting emerging photonics 
with electronics for opto-electronic networks-on-chip model-
ing," in IEEE/ACM Sixth International Symposium on Net-
works-on-Chip, 2012, pp. 201-210. 
[41] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, "The 
SPLASH-2 programs: Characterization and methodo-logical 
considerations," in ACM SIGARCH computer archi-tecture 
news, vol. 23, no. 2, pp. 24-36, 1995. 
[42] K. Tatas, K. Siozios, D. Soudris, and A. Jantsch, Designing 2D 
and 3D network-on-chip architectures. Springer, 2014. 
[43] C.-H. Chou, M. E. Belviranli, and L. N. Bhuyan, "Thermal pre-
diction and scheduling of network applications on multi-core 
processors," in Proceedings of the ninth ACM/IEEE symposi-
um on Architectures for networking and communi-cations sys-
tems, 2013, pp. 115-116. 
 [44] V. Rantala, T. Lehtonen, and J. Plosila, Network on chip routing 
algorithms, 2006. 
[45] H. Jung, P. Rong, and M. Pedram, "Stochastic modeling of a 
thermally-managed multi-core system," in Proceedings of the 
45th annual Design Automation Conference, pp. 728-733, 2008. 
[46] R. S. Sutton and A. G. Barto, Introduction to reinforcement 
learning. MIT press Cambridge, 1998. 
[47] A. Gosavi, "Relative value iteration for average reward semi-
Markov control via simulation," in Proceedings of the 2013 
Winter Simulation Conference: Simulation: Making Decisions 
in a Complex World, 2013, pp. 623-630. 
[48] C. Szepesvári, "Algorithms for reinforcement learning," vol. 4, 
no. 1, pp. 1-103, 2010. 
[49] L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Rein-
forcement learning and dynamic programming using func-tion 
approximators. CRC press, 2010. 
[50] A. Geramifard, T. J. Walsh, S. Tellex, "A tutorial on linear func-
tion approximators for dynamic programming and rein-
forcement learning," vol. 6, no. 4, pp. 375-451, 2013. 
[51] Intel® Xeon® Processor 5500 Series Datasheet, Volume 1. (June 
2011) Available: 
https://www.intel.com/content/dam/www/public/us/en/d
ocuments/datasheets/xeon-5500-vol-1-datasheet.pdf 
[52] Z. Chen and D. Marculescu, "Distributed reinforcement learn-
ing for power limited many-core system performance optimiza-
tion," in Proceedings of the Design, Automation & Test in Eu-
rope Conference & Exhibition, 2015, pp. 1521-1526. 
 [53] F. G. De Natale, G. S. Desoli, D. D. Giusto, and G. Vernazza, "A 
spline-like scheme for least-squares bilinear in-terpolations of 
images," in icassp, vol. 5, pp. 141-144, 1993. 
 
