Abstract-In this paper, we present PASTEMP, a solution for Package Aware Scheduling for Thermal and Energy management using Multi-Parametric programming in heterogeneous embedded multiprocessor SoCs (MPSoCs). Based on the current thermal state of the system and current performance requirements of the workload, PASTEMP finds thermally safe and energy efficient voltage/frequency configurations for the cores on a MPSoC. The tasks are assigned to the cores depending on their performance demand and the current voltage/frequency of the core. The voltage/frequency settings of the cores are chosen through an optimization process which is based on the instantaneous thermal model we introduce to decouple the effect of package temperature from the temperature changes caused by the power consumption of the cores. To be able to find the best voltage/frequency settings at runtime, we use multi-parametric programming to separate the optimization into offline and online phases. According to our experimental results, compared to similar DTM techniques, PASTEMP results in up to 23% energy saving and 26% throughput improvement and reduces the deadline misses to more than a half while meeting all thermal constraints.
I. INTRODUCTION
Embedded systems often must satisfy stricter limitations on power efficiency, cost, size, reliability, etc. compared to general purpose systems. There has been a growing tendency toward use of multicore processors in embedded systems since they usually offer higher performance within a specific power budget and thermal envelope compared to single core processors. Heterogeneous multi-processor System-on-Chips (MPSoC) which integrate cores of various types on the same die offer even better performance and power than their homogeneous counterparts as their heterogeneous nature enables them to customize their performance and power characteristics to match the workload requirements [1] [2] . Such heterogeneous MPSoCs are used in a wide range of applications from cell phones (Qualcomms Snapdragon platform) to wireless base stations (Mindspeeds Transcede 4000 processors for 4G wireless base stations). Thermal and energy issues become particularly important in heterogeneous embedded MPSoCs due to the inherent imbalance in distribution of power densities across the die. Moreover, embedded systems experience a wider range of environmental conditions. One example is the case of wireless base stations deployed outdoors in harsh environmental conditions with ambient temperatures even exceeding 80 ∘ C [3] . These diverse and adverse conditions make thermal management a must to prevent issues such as degraded reliability, performance, and higher leakage power, all of which are caused by high temperatures. While high temperatures can be mitigated using better packaging and cooling, in many embedded systems such as cell phones, due to space and cost limitations, using higher quality packaging or larger heat sinks is not an option. On the other hand, the available heterogeneity allows adapting to the workload requirements by offloading activities from more complex, power hungry and therefore hotter cores to simpler lower power cores (which are also less likely to experience thermal issues). This creates new thermal and power management opportunities for such systems. Unlike homogeneous MPSoCs, little work has been done on thermal management of heterogeneous MPSoCs. New techniques are required to make the best use of the heterogeneity in addition to exploiting the variable characteristics of the workload. Scheduling techniques offer a more flexible and cost efficient ways for power and thermal management. In this paper, we propose PASTEMP, a scheduling solution for joint temperature and energy management in heterogeneous embedded MPSoCs running embedded tasks with predefined characteristics. At each scheduling step, PASTEMP finds a set of thermally safe, power efficient voltage/frequency settings for the cores under thermal constraints and assigns the tasks to the cores accordingly. To find these voltage/frequency settings, PASTEMP solves an optimization problem around the runtime performance requirements of the workload and thermal state of the package the set of temperature values at various locations on the package. Using a low overhead state estimator, the thermal state of the package is estimated at runtime. To avoid complex optimization at runtime, we break the problem into offline and online phases using multi-parametric programming. We evaluated PASTEMP in an embedded heterogeneous MPSoC which runs tasks with deadlines. Our experimental results show that PASTEMP can reduce energy by up to 23% and reduce the deadline misses to one third compared to similar DTM techniques, while meeting all thermal constraints. Although we have evaluated PASTEMP for a deadline based system, it is more general and can be applied to any system where knowledge of the tasks and their execution characteristics Fig. 1 . Temperature of a core and the package area above it on an embedded MPSoC on the cores can be obtained in advance. The rest of the paper is organized as follows. Section 2 describes the related work followed by details of design of PASTEMP in section 3. Methodology and experimental results are discussed in section 4 and finally section 5 concludes the paper.
II. RELATED WORK
In general, task scheduling under thermal constraints on an MPSoC is an NP-hard problem. MPSoCs naturally provide multiple instances of similar processing resources on the chip allowing distribution of activities and heat as necessary. Although this creates further opportunities for thermal management compared to the single CPU processors, it also significantly complicates the management process. The solution space is huge due to the numerous possibilities for assigning frequencies and tasks to the cores and when to start each task. Considering that current temperature on the particular location on the die is also a strong function of recent thermal and workload history, the complexity of the problem is even worse. Many thermal management techniques have been proposed for homogeneous MPSoCs. In general purpose and high performance systems where there is no a priori knowledge of the workloads, the techniques are based on heuristics. In [4] , the authors have tried several combinations of thread migration, dynamic voltage and frequency scaling (DVFS), and clock gating for thermal management of a homogeneous MPSoC. In [5] , a probabilistic approach is taken where the technique changes the probability of assigning a task to a core based on the temperature history of that core. Unlike general purpose processing, the type of the workloads in embedded domain are often known in advance. An a priori approximation of the characteristics of these workloads such as their execution time can be obtained. Although this extra information is helpful in devising better solutions, the complexity of the problem makes it practically infeasible to find the optimal solution. To manage this complexity and make the problem tractable, various techniques in this domain use different simplifying assumptions and different heuristics. In [6] , the problem of scheduling a task graph on a homogeneous MPSoC to minimize the hotspots and balance the temperature distribution is formulated as Integer Linear Programming (ILP). The technique is based on heuristics such as minimizing the overlap between tasks running on neighbor cores. In [7] , an assignment and scheduling technique for hard real time applications on MPSoCs is proposed which uses a mixed-integer linear programming formulation to minimize peak temperature under hard real-time constraints and task dependencies. This technique is limited to tasks with large execution times and works based on a steady-state thermal model. A global optimization approach is taken to minimize the temperature of a set of known tasks. The complexity of this approach increases exponentially with the number of tasks and number of cores and as authors have mentioned, this formulation is not practical for large problem instances. For large problems, a heuristic approach is presented which doesnt consider any thermal model and works based on the mobility of the tasks. Another issue with this approach is that it doesnt consider the ambient temperature and might violate the thermal threshold in the presence of ambient temperature changes.
III. PASTEMP DESIGN
The temperature of the core is a combination of contributions from the package temperature and the current power state of the cores. Figure 1 shows an example of the changes in temperature of one of the cores and its corresponding area on the package in an embedded MPSoC as the power state of the core changes. We can see that the rate of change in the package temperature (the green plot) is very slow with a time constant on the order of seconds. Therefore, it takes seconds for the package to reach its steady state value. Steady state thermal models are inaccurate before the package reaches its steady state temperature. Therefore, techniques such as [7] which are based on the steady state thermal model of the chip could be very inefficient as steady state model provides an inaccurate thermal model for dynamic workloads. On the other hand, the dynamic thermal management (DTM) techniques which operate only based on the readings from the thermal sensors could be misled as they are oblivious to the thermal state of the system as a whole. As a simple example, consider a core which has been turned off as its temperature approached the threshold. When the core is turned off, the core temperature decreases quickly but the package does not cool down fast. If the core is turned on again, the temperature contribution by power consumption of the core will soon cause the thermal sensor to detect a hot spot and the core must be turned off again soon. This frequent unnecessary switching between on and off states leads to performance penalty. As opposed to this case, a DTM which is aware of the package temperature could prevent these penalties by avoiding turning on the core until the package is cooled down enough. As mentioned before, the temperature of a core is a combination of two components: the temperature effect caused by the current thermal state of the package and the temperature caused by current power state of the cores. Decoupling these two components, which are different in nature, can lead to a better understanding of the thermal behavior of the system. Our proposed technique estimates the package thermal state (the set of temperature values at various locations on the package) based on the power consumption of the core. Given the package thermal state, the decisions on frequency settings are made so that each core temperature does not exceed the threshold and the total power is minimized. An optimization problem can be formulated using the thermal state of the package and the performance required by the workload to decide on frequency settings of the cores. Solving such an optimization problem in real time is not feasible. Therefore we split the optimization process into two components using a parametric optimization approach as shown in Figure 2 . The first part, which is done offline, calculates the optimal frequency settings for different thermal states of the system and workload requirements. At runtime, performance requirements and the thermal state of the system are identified and decisions on assignment of frequencies to the cores are made. Based on the core frequencies, the tasks are assigned to the appropriate cores such that their performance requirements are met. As shown in the figure, the parametric optimization in the offline phase is based on multi-parametric programming [9] . The optimization problem is parameterized based on the thermal state of the package and performance requirements of the workload as explained in detail in 3.3. The thermal state of the system at runtime is calculated as explained in more detail in 3.2. The performance requirements are found based on the current workload needs and are represented in terms of the required number of cores and their desired frequency settings.
PASTEMP considers leakage power and temperature dependence of leakage power. Moreover, it imposes no restrictions on the distributions of the size and the number of concurrent tasks in the workload which makes it applicable to various classes of workloads. PASTEMP is also applicable to the systems that use heat sinks. We next outline the details of 
A. Thermal model
In this work we use the thermal model proposed in [10] . This model is based on the well known duality of the thermal and electrical phenomena and represents the dynamics of the chip temperature by a thermal RC network shown in Figure  3 .a. We represent this thermal network in state space form. The states of the system are the temperatures at different nodes represented by vector T and inputs to this system are the power consumptions at different nodes of the thermal circuit (represented by vector P). In our thermal model, corresponding to each core on the silicon layer there is a region on the package layer above that core. Thermal state of the package is characterized by the set of temperature values at these locations (T pkg ). The vectors and matrices used in our formulations are: The dependence of temperature to the power consumption of the cores and thermal characteristics of the system is represented as:
The thermal state of the package usually changes slowly (seconds) as a function of the power consumed by the cores over time and is less affected by recent scheduling decisions. However, the temperature due to power dissipation of each core changes much faster (10s of ms) and is a strong function of the current scheduling decisions. The temperature of a core is the aggregate of these two components. Figure 1 shows how the temperature of a core (blue plot) and its corresponding location on the package (green plot) change in an embedded MPSoC. The difference between these two plots represents the temperature component caused by instantaneous power consumption on the core. This component changes quickly, while the package changes very slowly. Although the contribution of the package changes slowly, it cannot be neglected since it can change significantly given enough time.
B. Estimation of package thermal state
This subsection describes how online estimation of thermal state of the package is performed in PASTEMP.
As explained before, (1) formulates the dynamics of the thermal network. It describes the dependence of future thermal state of the whole chip to its current thermal state and power consumptions of functional units on the chip. This equation provides an accurate means to calculate the thermal state of the whole chip. But it is too computationally expensive due to the mathematical operations which need to be performed on very large matrices. This computational overhead makes it impractical for applications requiring low overhead temperature tracking.
In order to find a lower overhead, yet accurate estimator for package thermal state, we use model order reduction. Model order reduction provides a low dimensional approximation of the thermal network. For example, if the circuit formulation of thermal network is represented in the form
where C t and G t are q × q matrices and T is a vector of length q, then the model order reduction reduces the model to
where C r and G r are q r × q r matrices and T r is a vector of length q r and q r ≪ q. We use a projection based implicit moment matching method called PRIMA [8] . In PRIMA, the larger the number of matched moments, the closer is the behavior of the reduced model to the original system, but at the cost of higher computational complexity. In our case, PRIMA with a single moment around frequency s=0 provides sufficient accuracy for tracking package thermal state. The reason is that the thermal network operates as a low pass filter for package temperature and eliminates high frequencies of the inputs to a large extent. Interested readers can refer to [8] for a detailed discussion of PRIMA.
The reduced order approximation of the thermal network provides a low-overhead estimate of the thermal state of the package based on the power consumptions of the cores.
C. Optimization formulation
In our parametric optimization, the decision variables determine the frequency settings of the cores on the MPSoC. The optimization parameters are the thermal state of the package and the number of cores of each type to be set to certain frequencies in order to meet the timing and performance characteristics of the workload. Due to the much larger time constants of the package compared to the cores, the thermal state of the package stays practically constant during quick changes of the temperature at the silicon layer. Therefore, for short time intervals, we model the thermal behavior using the instantaneous thermal model shown in Figure 3 .b. Since the changes in temperature of the package are minimal during these short intervals, they are represented as constant voltage sources. In order to simplify the model, only for the silicon to package part of the thermal network we use steady state thermal model. Since the effect of the package on the core temperature is already decoupled, this steady state thermal network accurately models the contribution of the power consumption to the core temperature within intervals on the orders of 10s of ms. The decisions regarding frequency settings of the cores are made at the same time frames, this estimate is sufficient. Based on these assumptions and using superposition and nodal analysis, the core temperatures at a time instance k is:
where T cores and T pkg are respectively the vector of temperature values at the cores and at the locations on the package which correspond to these cores. We call the matrix Ψ package contribution matrix and Φ power contribution matrix. Power consumption at each core n is the sum of the dynamic and leakage power:
To account for temperature dependence of leakage power, we use a linear approximation as suggested in [28] with an approximate estimation error of up to 5%. Using this model, the leakage power of a core can be estimated as sum of a constant term and a term linearly dependent on the cores temperature. Therefore, the leakage power of different cores can be estimated as:
where L is a diagonal matrix containing the coefficients for the linear terms and C is a vector of constant terms for different cores. Therefore, the core temperature is:
whereΨ, Φ and P are respectively the effective package contribution matrix, the effective power consumption matrix and the effective power. They are calculated as:
For each core, if we assume core of type k has v frequency settings, the dynamic power of core n which is of core type k can be written as:
where P k ,v is the dynamic power consumed at frequency setting v at a core of type k, and
1 if core n is set to frequency setting v 0 otherwise
We denote the number of cores of type Ω which are set to frequency setting v as λ Ω,v which could be determined as:
The minimum number of cores of type Ω required at certain frequency v (which is determined at runtime based on the requirements of the tasks) is denoted as σ Ω,v . A set of constraints are:
Based on these, we formulate an optimization problem for minimizing the total power consumption under thermal constraints.
where we use ≺ and ર as element-wise less than and greater than or equal operators for vectors and matrices. P total is the sum of power consumptions of all of the cores. Since this optimization cannot be solved at runtime, we take a different approach based on parametric optimization approach to break the optimization into two phases. Our parametric optimization approach is based on multiparametric programming [9] .
As optimization parameters, T pkg and σ partition the parameter space into different so-called critical regions. Each critical region which is defined by a set of constraints on values of T pkg and corresponds to a set of voltage/frequency settings for the cores on MPSoC. A region specifies the validity range of that set of voltage/frequency setting for the cores such that temperatures of all cores are below the threshold temperature and the total power is minimized. We use multi-parametric programming to get the critical regions defined by T pkg and σ and their corresponding voltage/frequency settings which are represented by the decision variables α. The actual values for σ and T pkg are found at runtime as explained. Then the corresponding region and appropriate set of voltage/frequency settings for the cores are found.
The offline optimization phase of PASTEMP is completely performed in Matlab. The multi-parametric programming framework is implemented in YALMIP [11] toolbox in Matlab which relies on Multi-Parametric Toolbox (MPT) [12] to solve multiparametric programming problems. The offline optimization phase takes around 10 minutes on a laptop with dual core Intel CPUs at 2.2GHz and 2GB of RAM. The regions created by offline optimization phase are stored in a lookup table which takes 4MB to store. 
D. Task assignment to the cores
As explained before, σ Ω,v in the optimization is the minimum number of cores of type ω required at certain frequency v. The set of σ values represent performance requirements of the workload. For example, given a set of tasks with deadlines, the scheduler can tell how many cores of each type and at which frequencies are required to meet the deadlines of the current tasks. At runtime, when σ is identified and T pkg is estimated, appropriate voltage/frequencies for the cores are identified (represented by σ values). Given these frequency settings, the processing capabilities of the cores are determined and each task in the system is assigned to the core that is the best match for the tasks performance requirements. Algorithm 1 explains how this is done at each scheduling tick in a deadline based system. After the frequency settings are determined, at each scheduling tick, we pick the task with the earliest deadline. This task is assigned to the available core with the highest processing capability. This is repeated until no core or no task is left.
Algorithm 1 Task to core assignment 1: C ⇐ currently available cores 2: J ⇐ current tasks in the system in the decreasing order of performance requirements 3: while (C and J are not empty) do 4:
Ω ⇐ the highest performance type of cores in C 5:
C Ω ⇐ set of cores of type Ω in C 6:
while C Ω is not empty do 7:
v ⇐ the core at the highest frequency in C Ω 8: j ⇐ the first task in J 9:
assign j to v 10:
remove j from J and v from C Ω 11: end while 12: end while
IV. EXPERIMENTAL RESULTS

A. Methodology
The cores used in our experiments are a low-power in-order architecture similar to the SPARC cores in UltraSPARC T1 [13] and a very low power core designed for embedded systems, similar to Intels XScale [14] . Power, performance, and area characteristics of these cores are shown in Figure 4 . We assume that the MPSoC is implemented in 65 nm technology. The areas of the cores are derived from published photos of the dies after subtracting the area occupied by I/O pads, interconnection wires, interface units, L2 cache, and control logic as in [1] , and scaled to 65nm. Each L2 cache has 1MB size, 2 banks, 64-byte lines, and is 4-way associative. Using CACTI [15] , the area and power consumption of the caches at 65nm are estimated as 14mm 2 and 1.7W, respectively. The cache power consumption value includes leakage.
The performance and power simulation of each core type is decoupled from the overall system simulation. This is done by collecting the performance and power data for all the benchmarks for the different types of cores using the M5 Simulator [16] . This modular framework allows for an easy extension of the set of cores simulated in the heterogeneous MPSoC, and it is capable of integrating a variety of simulators or real-life experiments if needed. We assume the same three voltage settings for the XScale and SPARC cores. For XScale, we use the existing available frequency levels (as reported in [14] ), and for SPARC we set the default frequency to 1.2GHz (as reported in [13] ), and scale frequency using the 95% and 85% settings as in [17] . The in-order pipelines of SPARC and Xscale are modeled by modifying M5's execution engine. Wattch [18] is used for power modeling of the cores and is update with model parameters for 65nm. We observe that for each core, the dynamic power values have little variation for our benchmark set. Thus, we compute the average power consumption values at each voltage/frequency setting for each of the cores. These power values are then utilized in the temperature simulations. We compute the leakage power of CPU cores based on structure areas and temperature. For transitions during frequency scaling, we use the power of the higher power state. We set idle power to 1/3 of average dynamic power. We compute temperature dependence using the model introduced in [20] with the same constants mentioned in the paper for 65nm. The overhead of switching to a new frequency is 500 μs [21] , [22] .
The workloads in our experiments consist of integer benchmarks provided in MiBench benchmark suite [23] which include automotive/industrial, network and telecommunications applications. Other than datasets provided in MiBench suite, we use datasets provided by [24] . To evaluate our technique under various conditions, we create moderate to intensive workloads consisting of varying number of tasks from MiBench suite. Instances of each task are generated regularly at every arrival period. We created two classes of workloads which differ in the way their deadlines (d) and periods of arrival (τ) are selected. For workload set A, we set d and of a task to twice execution time of that task at the slowest frequency on the slowest core (XScale). This way the tasks can meet their deadlines irrespective of the core type they are assigned to. For class B of workloads, we set d and of some of the tasks to twice their execution time on the slowest frequency of the fastest core type (SPARC). This way we make these tasks more suitable for the SPARC cores. For each class, we create three workloads at three different levels of utilization-50%, 70% and 90%. We run these workloads till all of the tasks are finished (they take about 100 seconds to complete).
We use HotSpot Version 4.2 [10] as the thermal modeling tool with a sampling interval of 1 ms to ensure sufficient accuracy. In many embedded systems such as cell phones there is no heat sink or spreader. To model this within HotSpot, we set the spreader thickness to very thin size of 0.1mm. The heat sink was replaced by a package with thermal parameters shown in Table 1 which are within the ranges suggested by [25] and [26] . The parameters used in HotSpot are summarized in Table 1 . In our experiments, the maximum safe temperature is assumed to be 90 ∘ C which must not be violated in any of the DTM techniques. We compare PASTEMP with two other DTM techniques. Thermal PO is similar to PASTEMP with the only difference that instead of the instantaneous thermal model of PASTEMP, it uses a steady state thermal model that does not consider the effect of the package. The other technique, Thermal DVFS relies on the direct temperature readings from the thermal sensors. It reduces the frequency and voltage of a core when it reaches a critical threshold of 89 ∘ C. This threshold is set 1 degree below the safe temperature threshold to allow enough time for DVFS to take effect before the temperature reaches the 90 ∘ C threshold. When the temperature gets below 85 ∘ C, 
B. Results
In this section, we present the results of comparing our technique (PASTEMP) to two other DTM techniques: T hermal DV FS which relies only on the readings from the thermal sensors, and Thermal PO which uses an optimization similar to that of PASTEMP, but in a package oblivious manner relying on a steady state thermal model. A general trend observed in the results is that at lower utilizations, due to lower temperature, thermal constraints do not limit scheduling so various techniques make similar scheduling decisions. As system utilization increases, so do the temperatures, and then thermal constraints start playing a more important role in scheduling decisions resulting in larger differences between the scheduling techniques. Figure 5 shows the distribution of core temperatures across various ranges. PASTEMP and Thermal DVFS operate closer to the maximum safe threshold temperature while with Thermal PO, cores operate at lower temperatures. The reason is that compared to the first two, Thermal PO uses lower frequencies even when the cores can operate at higher frequencies. This is because of the steady state thermal model that leads to more pessimistic temperature estimates resulting in more conservative frequency settings. Figure 6 shows the energy consumption of the MPSoC when running various workloads and using the three different thermal management techniques. As this figure shows, in all cases, Thermal DVFS consumes more energy than PASTEMP. There are two reasons for this. The first one is that at each scheduling step, based on the thermal state of package and performance requirements of the workload, PASTEMP optimizes the frequencies of the cores for minimum power. Another reason is frequent unnecessary switching between different voltage/frequency settings in Thermal DVFS . Although Thermal PO uses a similar optimization as PASTEMP, it consumes more energy compared to both of the other two techniques. The reason is that since Thermal PO relies on a steady state thermal model, it overestimates the temperatures when the package is not hot. Therefore, the maximum thermally safe frequencies are not identified correctly and the cores are set to lower frequencies. This leads to longer execution times that in turn lead to higher leakage power. The effect is more significant at higher temperature due to exponential dependence of leakage on temperature. Figure 7 shows the percentage reduction of deadline misses at different utilization levels. As the utilization increases, the number of deadline misses increases. In all cases, PASTEMP experiences fewer misses. This is due the better matching of the tasks and the cores based on the individual performance needs of a task. We also measured throughput of the MPSoC when each of these three techniques is applied. PASTEMP improves throughput 11compared to the Thermal DVFS and 26% compared to Thermal PO.
The runtime overhead of PASTEMP includes estimation of the thermal state of the package plus accessing the lookup table to find the corresponding region provided by our off-line phase. Getting the package thermal state takes less than 20μs per estimation on a SPARC processor running at 1GHz while accessing the lookup table takes virtually no time. Therefore, the total runtime overhead of our technique is about 20μs which is negligible considering typical scheduling tick is 10ms.
V. CONCLUSION
In this paper, we described PASTEMP, a scheduling solution for management of temperature and energy in embedded multiprocessor SoCs (MPSoCs). This technique is applicable to homogeneous or heterogeneous embedded MPSoCs and can be applied to various types of embedded systems where a priori knowledge of the tasks and their characteristics exists. The main idea is to decompose the core temperature into the effect of power consumption of the core and the effect of the thermal state of the package. This gives more visibility and a better understanding of the thermal state of the system and allows more informed scheduling decisions compared to the techniques which directly use thermal sensor readings for scheduling decisions. Given the thermal state of the system and the performance requirements of the workload, an optimization is performed to find the best voltage/frequency configuration. To be able to perform this optimization at runtime, we use multi-parametric to divide the optimization into offline and online phases. PASTEMP is a low overhead temperature aware scheduling technique which results in up to 23% energy savings and 26% throughput improvement and reduces the number of deadline misses up to more than a half compared to other DTM techniques, while meeting all thermal constraints.
