The reliability of multiprocessor system-on-chips (MPSoCs) is nowadays threatened by high chip temperatures leading to long-term reliability concerns and shortterm functional errors. High chip temperatures might not only cause potential deadline violations, but also increase cooling costs and leakage power. Pro-active thermal-aware allocation and scheduling techniques that avoid thermal emergencies are promising techniques to reduce the peak temperature of an MPSoC. However, calculating the peak temperature of hundreds of design alternatives during design space exploration is time-consuming, in particular for unknown input patterns and data. In this paper, we address this challenge and present a fast analytic method to calculate a non-trivial upper bound on the maximum temperature of a multi-core real-time system with non-deterministic workload. The considered thermal model is able to address various thermal effects like heat exchange between neighboring cores and temperaturedependent leakage power. Afterwards, we integrate the proposed thermal analysis method into a design-space exploration framework to optimize the task to processing component assignment. Finally, we apply the proposed method in various case studies to explore thermal hot spots and to optimize the task to processing component assignment.
Introduction
The increasing demand of computational performance and the better power efficiency motivates system designers to use multiprocessor system-on-chips (MPSoCs) that integrate multiple processing components, memories, and communication units on a single die. However, the use of deep submicrometer process technology to fabricate MPSoCs imposes a major rise in power densities, which in turn threats the reliability and performance of the system by inducing high chip temperatures. Thermal hot spots, i.e., areas on the chip with high temperatures, affect the design of the cooling system. In order to reduce device failures, the cooling system has to be designed for the worst-case chip temperature, i.e., the maximum chip temperature under all feasible scenarios of task arrivals [7] .
Besides improving the cooling system, thermal and reliability issues have been tackled by reactive thermal management techniques or thermal-aware task allocation and scheduling algorithms. Reactive thermal management techniques such as dynamic voltage and frequency scaling (DVFS) keep the maximum temperature under a given threshold [9, 13] . However, the drawback of reactive thermal management techniques is the significant degradation of performance caused by stalling or slowing down the processor [7] .
When the workload of the system is known, pro-active thermal-aware allocation and scheduling techniques that avoid thermal emergencies and thus, a reduction in performance, might be preferable over reactive thermal management techniques. In particular, by selecting an optimal frequency, voltage, and task assignment, the peak temperature can significantly be reduced so that a certain quality of service level can be guaranteed at design-time [6, 7, 10, 17] . Nonetheless, prior work either lowers the average temperature or assumes deterministic workload where the maximum temperature of the system can be calculated by simulating the system. However, unknown input patterns and data cause the workload to be non-deterministic so that the maximum possible chip temperature under all feasible scenarios of task arrivals is difficult to identify. Only when the corner case that actually leads to the maximum temperature of the system is considered, simulation-based thermal analysis techniques do not lead to an undesired underestimation of the maximum temperature. However, calculating this critical workload has been shown to be time-consuming [20] , so that calculating the peak temperature of hundreds of design alternatives during design space exploration would be infeasible.
In this paper, we first present a fast analytic method to calculate an upper bound on the maximum temperature of an MPSoC with non-deterministic workload. The proposed method has a time-complexity that only depends on the number of processing components, which enables efficient design-space exploration. The considered thermal model is able to address various thermal effects like heat exchange between neighboring cores and temperaturedependent leakage power. We use the well-established standard event model [11] to model non-determinism in the workload, i.e., we consider periodic event streams with jitter and delay. Real-time calculus [23] , a formal method for schedulability and performance analysis of real-time systems, is applied to upper bound the workload that might arrive in any time interval. Although arrival curves constrain the maximum possible workload, infinitely many traces comply with this specification. Thus, our method identifies the critical workload trace that leads to the worst-case chip temperature. The only requirement of the method is that the real-time scheduling algorithms are work-conserving, i.e., the respective processing component has to process as soon as there is an event in its ready queue. However, this applies to most of the traditional scheduling algorithms as, for example, earliest-deadline-first (EDF), rate-monotonic (RM), fixed-priority (FP), and deadline-monotonic (DM).
We then integrate the proposed thermal analysis method into a design-space exploration framework intended to optimize the task to processing component assignment at design-time, prior to deployment and execution. We formulate this optimization problem with the objective to minimize the worst-case chip temperature. Clearly, the reliability of a system is also affected by other (thermal) metrics such as, for instance, the temperature gradient. However, due to the above listed reasons, we selected the minimization of the worst-case chip temperature as the objective in our framework. Finally, we solve the optimization problem using simulated annealing and provide an extensive evaluation of its performance. This paper is based on the work published in [19] , which formally proves the thermal analysis method. Nonetheless, prior work does not integrate the thermal analysis method into an optimization framework. Thus, we extend the previous work by first formulating a thermal-aware task assignment problem, and then integrating the thermal analysis method into a framework to solve the optimization problem. In addition, this paper gives a broader coverage of related work, the proposed technique is more detailed, and a richer set of experiments is carried out. Therefore, the contributions of this paper can be summarized as follows:
-The considered system is formally described along with an overview of worst-case chip temperature analysis. -A mathematical expression for a non-trivial upper bound on the worst-case chip temperature of a multicore system with non-deterministic workload is derived. In contrast to previous work, the time-complexity of the proposed method only depends on the number of processing components. -The task to processing component assignment is formulated as an optimization problem to minimize the worst-case chip temperature prior to execution. -In various case studies, the proposed method is applied to explore the temperature distribution and optimize the task to processing component assignment.
The paper continues with a discussion of related work. Afterwards, in Section 3, the considered thermal and computational models are introduced. In Section 4, the thermal analysis methods are induced. In Section 5, the task to processing component assignment is formulated as an optimization problem. Finally, in Section 6, case studies are presented to highlight the viability of our method.
Related Work
As the use of dynamic thermal management (DTM) techniques yields in multi-core systems to reduced system reliability, a degradation of performance [4] , and high complexity [7] , various pro-active thermal-aware binding and scheduling techniques have been proposed in recent years. In [6, 7, 9, 10, 17] , the maximum temperature is reduced so that the performance requirements are still met. For instance, in [17] , the thermal-aware scheduling problem is formulated as a convex optimization problem. In [6] , a mixed-integer linear programming formulation is stated to solve the thermal-aware scheduling problem. In [7] , a similar problem is solved for minimizing the energy consumption and reducing thermal hot spots. Finally, a global scheduling algorithm is proposed in [10] so that all cores are running at their ideally preferred speed and the peak temperature is optimally reduced. However, in all of these works, the temperature is obtained by either calculating the average temperature or the workload is assumed to be deterministic so that the maximum temperature of the system can be calculated by simulating the transient temperature evolution.
Evaluating the temperature characteristics is typically a two-step process. First, the transient power dissipation is determined by a software-based [1, 24] or hardware-based [26] power-aware simulator. Having the transient power dissipation, either the average temperature is calculated based on steady-state analysis [31] or the transient temperature evolution is obtained by simulating the system in a thermal simulator [12, 21, 22] . However, due to the complexity of today's systems, it is difficult to identify corner cases that actually lead to the maximum temperature of the system. Consequently, simulation-based thermal analysis may lead to undesired underestimations of the maximum temperature.
In this work, we use a different approach. Similar to wellknown best-case / worst-case timing analysis methods for multiprocessor systems, we use formal analysis methods to predict the maximum temperature of a real-time system. For example, modular performance analysis (MPA) [29] or SymTA/S [11] provide upper and lower bounds on the latency of a system with non-deterministic workload. In particular, a first formal analysis method to calculate the worst-case chip temperature of a multi-core system with non-deterministic workload has been proposed in [20] . By incorporating the temperature into real-time analysis, realtime deadlines can be guaranteed even if reliability is subject to high temperatures. However, as the method proposed in [20] uses linear search to calculate a tight bound on the worst-case chip temperature, its evaluation time is too long for design space exploration of multi-core systems with tens of processing components. In this work, we address this challenge and describe a method to calculate an upper bound on the maximum temperature of a multi-core system with a time complexity that only depends on the number of processing components. Finally, we integrate the proposed method into a thermal-aware task assignment strategy to minimize the worst-case chip temperature subject to real-time deadlines.
System Models
This section introduces the formal models to analyze a realtime application on an MPSoC. Notation Bold characters are used for vectors and matrices and non-bold characters for scalars. For example, H denotes a matrix whose (k, )-th element is H k and T denotes a vector whose k-th element is T k .
Computational Model
In this work, we consider an MPSoC with a set of processing components Θ. The considered computational model of each processing component Θ is expressed using abstractions in real-time calculus [23] . In particular, we suppose that events with a total workload of R (s, t) time units arrive at component Θ in time interval [s, t) and each event has a constant workload of Δ A time units. Thus, for any fixed s, R(s, t) is a staircase function that increases its value by its computation time when an event arrives. The arrival curve α upper bounds all possible cumulative workloads:
with α (0) = 0. Figure 1a illustrates the concept of arrival curves for the widely-used standard event model where the event stream is defined by the parametric triple (p , j , d ) with period p , jitter j , and minimum inter-arrival distance between events d [11] . For the rest of this paper, we assume that an event stream is always characterized by these three parameters.
We suppose that processing components are workconserving. In other words, they will be in 'active' mode as long as there are events in their ready queue. The accumulated computing time Q (s, t) describes the amount of time units that component Θ is spending to process an incoming workload of R (s, t) time units. It is upper bounded by γ (Δ) for all intervals of length Δ < t [28] :
For any fixed s with s < t, the accumulated computing time Q (s, t) is monotonically increasing and has either slope 1 or 0. When the slope is 1, the component is in 'active' mode, i.e., it is processing events. When the component is idle, i.e., in sleep mode, the slope is 0. Thus, we express the processing mode by the mode function:
In Fig. 1b 
Thermal Model
The considered thermal model of an MPSoC describes the temperature evolution by means of an equivalent RC circuit [6, 12, 20, 21] . In particular, we model the layout of the chip by four layers, namely heat sink, heat spreader, thermal interface, and silicon die. Each layer is divided into a set of blocks according to the architectural-level units. In our case, we select a processing component abstraction, i.e., we represent each processing component as an individual node with separate power source and temperature characteristics. Even though a finer granularity could have been selected, the processing component abstraction has been shown to be accurate enough for system-level optimization [8, 30] . As 12 additional nodes are introduced in the heat spreader and heat sink layers to model the area that is not covered by the subjacent layer, a multi-core system with |Θ| processing components is modeled by n = 4 · |Θ| + 12 nodes. The n-dimensional temperature vector T(t) at time t is described by a set of first-order differential equations:
with C the thermal capacitance matrix, G the thermal conductance matrix, K the thermal ground conductance matrix, P the power dissipation vector, and T amb = T amb · [1, . . . , 1] the ambient temperature vector. The initial temperature vector is denoted as T 0 and the system is assumed to start at time t 0 = 0.
A linear dependency of power dissipation on temperature [6, 16] is assumed due to leakage power:
where φ is a diagonal matrix with constant coefficients, and ψ a vector. Finally, the state-space representation of the thermal model is expressed by:
with input vector u(t) = ψ(t)
, and B = C −1 . As the thermal system is linear and time-invariant, the temperature of node k is:
with
is the convolution of input u and H k , i.e., the impulse response between nodes and k:
Input u depends on the processing mode, i.e., the slope of the accumulated computing time of the processing component corresponding to node Θ :
Nodes that do not correspond to a processing component have input u (t) = u i . Similar to [20] , we assume that H k (t) is a non-negative unimodal function that has its maximum at time t H k max , see Fig. 2 for an illustration.
Thermal Analysis
In this section, we start by presenting some key results from peak temperature analysis, and then introduce a novel method to calculate a non-trivial upper bound on the maximum temperature without simulating the transient temperature evolution.
Peak Temperature Analysis
The worst-case chip temperature T * S , i.e., the maximum temperature of the system under all feasible scenarios of task arrivals, is the maximum possible temperature of all nodes: with n the number of nodes and T * k the worst-case peak temperature of node k. Because of non-determinism in the workload arriving at the different components, first, one has to identify the critical set of cumulative workload traces that leads to the worst-case peak temperature T * k of node k. Due to heat exchange between neighboring components, T * k does not only depend on the workload of component Θ k , but also on the workload of all other components of the chip.
Once the critical set of cumulative workload traces is identified, the temperature T * k (τ ) at a certain observation time τ is found by simulating the system with the set of critical accumulated computing times
where {k} indicates that Q {k} leads to the worst-case peak temperature of node k. The critical accumulated computing time Q {k} (0, t) describes the sequence of active and idle time units that, among all possible sequences of active and idle time units, leads to the maximum temperature T * k (τ ). Thus, the remaining question is how to calculate the set of critical accumulate computing times Q {k} .
First, we note that the critical accumulated computing time Q {k} (0, Δ) of node can be calculated independently of all other accumulated computing times [20] . In particular, the problem is equivalent to find the accumulated computing time Q {k} (0, Δ) that maximizes T k, defined as in Eq. 8.
Let us define t H k
where τ is the predefined observation time of the peak temperature, and introduce the auxiliary function v of node , which is, starting at time t s , one for Δ A time units:
The next theorem follows from the results of [20] and provides a method to calculate the critical accumulated computing time Q {k} leading to the worst-case peak temperature T * k of node k. 
Theorem 1 Suppose that the accumulated computing time
end for 8:
for i = 1 to
make trace for t < t (l) 9:
end for 11:
end for 16: end for 17:
Algorithm 1 calculates the critical accumulated computing time Q {k} by altering both the position of burst and gap between burst and first successive active interval, see Fig. 3 for an illustration of the algorithm. Then, Q {k} is the computing time that maximizes the sum of all areas below H k where the node is in 'active' processing mode. Afterwards, the peak temperature T * k (τ ) of node k is obtained by simulating the system with computing time Q {k} . Calculating an upper bound on the maximum temperature T * S has time complexity O(n 2 * m) with n the number of nodes. The factor m reflects the time to execute Algorithm 1 and is inversely proportional to the selected time step. Increasing the time step can improve the execution time, but might lead to a reduced accuracy.
Fast Temperature Evaluation
As indicated, Algorithm 1 might be time-consuming and hence not suited for design space exploration. Therefore, we first derive an analytical expression for the accumulated computing time that leads to a non-trivial upper bound on the peak temperature. Afterwards, we use the obtained computing time to propose a novel mathematical expression for an upper bound on the maximum temperature.
The first lemma simplifies Algorithm 1 so that it is constant time and so that the resulting upper bound on the maximum temperature at observation time τ , T * k (τ ), is not smaller than the worst-case peak temperature of node k. Q {k} denotes the critical accumulated computing time that leads to T * k (τ ). [20] . In the first step, the precedent and successive active intervals of the burst are moved to the burst so that the node is continuously active for b + Δ A time units, compare Fig. 4a and b for an illustration. The second step makes the search for the position of the burst obsolete. To this end, the length of the burst is extended so that it covers all possible positions of the burst, i.e., S (t) = 1 for all t ∈ t Fig. 4c for an illustration.
Theorem 2 Suppose that the accumulated computing time
function Q {k} (0, Δ) = Q {k} 1 (0, Δ), . . . , Q {k} n (0, Δ) for all 0 ≤ Δ ≤ τ with: Q k (0, Δ) = ⎧ ⎨ ⎩ γ t H k max − γ t H k max − Δ 0 ≤ Δ < t H k max γ t H k max + γ Δ − t H k max t H k max ≤ Δ < τ (12) leads to T * k (τ ) at time τ .1: t (r) = t H k max + b − Δ A , t (l) = t H k max − b + Δ A 2: S k (t) = 1 t ∈ [t (l) , t (r) ]extended burst 0 otherwise 3: for i = 1 to τ −t (r) p do trace for t > t (r)
4: S k (t) = S k (t) + v t, t (r)
Algorithm 2 is the result of these two translations. One can readily prove that in both steps, the amount of 'active' time units is either increased or shifted closer to t H k max . Finally, we note that Eq. 12 is equivalent to Algorithm 2, and therefore As Algorithm 2 is constant time, computing an upper bound on the maximum temperature of an MPSoC has time complexity O(n 2 ) with n the number of nodes. As illustrated in Fig. 4c , the system continuously switches between active and idle mode except during the burst. Next, we show that calculating the peak temperature can further be simplified by running the processing component with constant slope δ = Δ A Δ A +Δ I for all time units except during the burst. This lemma provides the foundation for the main theorem of this section.
Lemma 1 Suppose that the mode function:
with utilization δ = Proof Rewriting Eq. 8 with Eq. 9 leads to:
As we know from Theorem 2 that
By rewriting the integral from 0 to θ (l) as a sum, we get:
where ρ is an integer that is selected so that p Fig. 5 for an illustration. Then, we find:
where we subtracted the two integrals in the inter- 
where we used the fact that δ = 
Based on Lemma 1, we will present the main result of this section. The following theorem provides a mathematical expression to calculate a non-trivial upper bound on the maximum temperatureT * k (τ ) of node k. Finally, according to Eq. 10, an upper bound on the maximum temperature of the system can be obtained by calculating the maximum of all individual upper bounds.
Theorem 3 Suppose that T k (t) is the temperature of node k at time instant t for a set of workload functions R(s, t) that are bounded by the set of arrival curves α. When the scheduler is work-conserving, the following statements hold:
-The temperature: 
Hk ( -t)·S (t) H k ( -t)·S (t) H k ( -(
The second item is a simple consequence of Theorem 2.
Three different methods to calculate an upper bound on the maximum temperature have been presented in this section. The first method calculates the critical accumulated computing time by Algorithm 1 leading to the worst-case peak temperature T * k of node k. The second method calculates the accumulated computing time according to Eq. 12 leading to T * k , and the last method calculates an upper bound on the maximum temperatureT * k of node k by the mathematical expression defined by Eq. 14. The relation between these three different bounds on the maximum temperature is as follows:
Minimizing the Peak Temperature
So far, we have seen a method to calculate the worst-case chip temperature, i.e, the maximum chip temperature under all feasible scenarios of task arrivals. Next, we apply this method to calculate an optimal task assignment that minimizes the worst-case chip temperature and guarantees that all real-time deadlines are met. By offering safe bounds, the resulting framework is intended to optimize the task assignment at design-time, i.e, prior to execution.
Task Model
In our task model, we assume to have a set of tasks ν that are concurrently executed. Each task ν j is modeled as a stream of events and has an event arrival curve e ν j (Δ) that upper bounds the cumulative number of events arriving in any time interval of length Δ ≥ 0. An event has to complete its execution within D ν j time units after its arrival and function φ(ν j , Θ ) assigns each task ν j the number of time units that are required to process an event on processing component Θ . Finally, function Γ (ν j , Θ ) is 1 if a task ν j is assigned to processing component Θ and 0 otherwise:
Thus, the total accumulated workload of component Θ is, in any time interval of length Δ ≥ 0, upper bounded by the arrival curve α [23] :
To check schedulability, we use the concept of a demand bound function [2] that models the maximum resource demand of a task. The demand bound function dbf ν j , Θ (Δ) of task ν j upper bounds the maximum accumulated computational demand of all events that arrive and have deadline in any interval of length Δ on processing component Θ . Formally, the demand bound function dbf ν j ,Θ (Δ) is defined as:
The demand bound function dbf (Δ) of a processing component Θ depends on the scheduling algorithm. For example, when an EDF scheduler is used to arbitrate between events of different tasks assigned to the same processing component, the demand bound function dbf (Δ) is:
Optimization Problem
Once we have specified the task model, we can formulate the considered optimization problem:
Given are a set of tasks ν that are mapped onto an MPSoC with processing components Θ. Then, the goal is to select a static assignment of tasks to processing components such that all deadlines are met and the worst-case chip temperature T * S is minimized. In other words, the objective of the optimization problem is to reduce the worst-case chip temperature:
where T * k is defined as in Eq. 14 and n is the number of nodes of the equivalent thermal RC circuit.
We call a processing component Θ schedulable if the real-time deadlines of all events are met. We have to guarantee that the cumulated number of available computing resources is in no time interval Δ smaller than the maximum resource demand, defined by the demand bound function dbf (Δ). Thus, the schedulability test is written as:
Practically, the RTC toolbox [27] can be used to verify schedulability. Finally, we have to make sure that each task is assigned to only one processing component:
Temperature Reduction by Voltage Scaling
The worst-case chip temperature can further be reduced by assigning each processing component its optimal frequency, i.e., the minimum operation frequency so that no real-time deadlines are missed. In the following, we extend the system model and the thermal analysis model, and formulate the optimization problem to make use of voltage and frequency scaling to reduce the power consumption, and thus, the worst-case chip temperature. Each processing component Θ has its own clock domain and executes at a static frequency f with 0 ≤ f ≤ f max . We suppose that the number of time units φ(ν j , Θ ) that an event of task ν j has to execute on processing component Θ scales linearly with the operation frequency. Thus, the total accumulated workload of Θ is upper bounded by the arrival curve:
Furthermore, we assume that the dynamic power consumption of component Θ growths quadratically with its supply voltage v and linearly with its operation frequency f [18] :
Similar to [17] , we suppose that the square of the supply voltage scales linearly with the operation frequency even though the results of the paper also hold for any other monotonic relation between supply voltage and frequency. Now, we can write the total power consumption as:
with diagonal matrix diag(f) of vector f and constant diagonal matrices ρ and ω. As the operation frequency is statically assigned at design-time, the thermal analysis method proposed in Eq. 4 still provides an upper bound on the maximum temperature. In order to calculate the minimum operation frequency so that no real-time deadlines are missed, we rewrite the demand bound function dbf (Δ) with the scaled operation frequency f :
Finally, rewriting Eq. 22 with the above expression for the demand bound function results in the following expression for the minimum operation frequency for a processing component that uses an EDF scheduler to arbitrate between events of different tasks:
Case Studies
In order to evaluate the proposed analysis methods, we extended the modular performance analysis (MPA) framework [5] with the ability to calculate the maximum temperature by the discussed algorithms. Afterwards, we solve the mapping problem proposed in Eq. 5 for various task sets to illustrate the capability of the thermal analysis method.
Experimental Setup and System Description
We consider a homogeneous multi-core ARM platform with a variable number of processing components. Fixed priority preemptive scheduling is used on all processing components while a TDMA policy is employed on the shared bus that connects all processing components. Intermediate streams that cannot be represented by a period, a jitter, and a minimum interarrival distance are upper-bounded by the method presented in [15] and observation time τ is set to five seconds. Temperature-dependency of leakage power is addressed by linearizing the model described in [21] . Table 1 summaries the parameters of the considered power model defined by Eq. 5. As we consider a homogeneous platform, every component has the same power values. HotSpot [12] is used to calculate the thermal parameters of the platform, i.e., the C, G, and K matrices, see Table 2 for the detailed thermal configuration. In all experiments, the traces start from the steady-state temperature in 'idle' mode, i.e., T 0 = (T ∞ ) i . All experiments have been performed on an Intel Core i7-2720 QM processor with 8 GB of RAM.
Worst-Case Chip Temperature Evaluation
First, we consider four benchmark applications to evaluate the performance of the proposed thermal analysis method.
In particular, we compare the accuracy and the evaluation time of the novel method with the method proposed in [20] .
Application Description
A producer-consumer (P-C), a distributed matrix multiplication, a Fast-Fourier transform (FFT), and a motion JPEG (MJPEG) decoder application are considered in the first case study. To improve the performance, the benchmark applications are split into several tasks that might run in parallel. Each task is characterized by its best-case and worst-case execution demand, which have been determined by simulating the benchmark application on the MPARM virtual platform [3] . In particular, the P-C application is split into five tasks, the matrix multiplication application into ten tasks, and the FFT application into twelve tasks. The MJPEG decoder application is split into a variable number of tasks to concurrently decode individual frames. In addition, the MJPEG decoder consists of a task to split up the input sequence into individual frames and to send the frames to the decompressing tasks. Finally, another task merges the decoded frames back into a stream.
Efficiency and Accuracy
First, we use an MJPEG decoder that consists of five tasks running on three processing components, thus the thermal model has order 24. The application is driven by an input stream with a periodic invocation interval of 450 ms and a jitter of 600 ms. To evaluate our method, both the time to calculate an upper bound on the maximum temperature and the quality of this bound are analyzed. To this end, we first calculate three different upper bounds on the maximum temperature:
1. The critical accumulated computing time is computed with Algorithm 1 leading to the worst-case chip temperature T * S .
The upper bound T *
S is calculated by simulating the system with the critical accumulated computing time defined by Eq. 12. 3.T * S is calculated according to Eq. 14. We compare T * S , T * S andT * S as well as the durations to calculate the bounds. Peak temperatures and durations to calculate these temperatures are listed in Table 3 CalculatingT * S is on average 549 times faster than calculating T * S , but note that the execution time of Algorithm 1 depends on the selected time increment. Furthermore, the execution time depends on the actual mapping as shown in Table 3a . We quantify the accuracy ofT * S by means of the worst-case chip temperature T * S . To this end, we introduce 
Applying this formula, the average error of our results is found to be only 0.22 %. This confirms our approach to upper bound the peak temperature by Eq. 14 instead of using Algorithm 1 to calculate the critical accumulated computing time and then simulating the system with the critical computing time. Overall, calculatingT * S instead of T * S is desirable in the design flow of real-time systems as the three order of magnitude reduction in evaluation time enables a faster and more exhaustive design space exploration.
The small error is mainly attributed to the fact that the heat transfer among neighboring nodes is smaller than the self-heating. For self-heating, Eq. 12 and Algorithm 1 calculate the same critical accumulated computing time as t H k max is equal to the observation time τ .
Finally, we compare T * S andT * S as well as the durations to calculate the bounds for the P-C, the matrix multiplication, and the FFT application. The results are listed in Table 4 for one mapping per application. They exhibit similar trends as observed with the MJPEG decoder application.
Comparison with a Cycle-Accurate Simulation
The method proposed in [20] has been extensively evaluated against a cycle-accurate simulation tool-chain. For completeness, we summarize the results of this evaluation; see [25] for additional details. The tool-chain is based on MPARM [3] and HotSpot [12] and key results of the evaluation are listed in Table 5 whereby the reported values are the average of six different mapping configurations. Simulating the temperature evolution on the tool-chain is on average 266 times slower than calculating the peak temperature with the analytic method proposed in [20] . The maximum chip temperature T * S is on average 4.8 K higher than the maximum temperature of the cycle-accurate simulation. One of the reasons for the difference is that the maximum temperature of the cycle-accurate simulation underestimates the worst-case chip temperature due to the infeasibility of an exhaustive simulation of all system configurations.
Temperature Distribution on a 25-Core Processor
Next, we consider a multi-core system with 25 processing components executing an MJPEG decoder with 10 tasks. The processing components are arranged in a grid with five rows and the corresponding thermal model has order 112. We will show that the temperature distribution, and thereby the worst-case chip temperature of the system, is affected by the assignment of tasks to processing components. Figure 6 shows the worst-case chip temperature distribution of the system for four different mappings. In Fig. 6a , the tasks are mapped onto components situated in the left top corner of the chip. Next, in Fig. 6b , the tasks are distributed among components in all four corners. In Fig. 6c , the tasks are distributed all over the chip, and finally, in Fig. 6d , the tasks are only mapped onto components in the middle of the chip. The highest peak temperature occurs in Fig. 6a and the lowest one in Fig. 6c . The difference between their worstcase chip temperatures is of about 16 K. This shows that the worst-case chip temperature can be reduced by spreading the workload over the chip. In this case, intermediate processing components with no workload act like a passive cooling system and keep hot spots separated.
Thermal-Aware Task Assignments
In the second case study, we apply the proposed temperature analysis method to calculate an optimal task assignment that minimizes the worst-case chip temperature and guarantees that all real-time deadlines are met.
System Description
We are still targeting the homogeneous multi-core ARM platform. The platform has two different modes to control the operation frequency. Either all processing components have a common clock domain or each processing component is supposed to have its own clock domain. The maximum operation frequency is supposed to be 1.6 GHz and the power model shown in Table 1 has been extended according to Eq. 26. An EDF scheduler is running on each processing component to arbitrate between events of different tasks assigned to the same component.
Simulated Annealing to Optimize the Temperature
We first suppose that each processing component is running at its maximum operation frequency, i.e., 1.6 GHz.
The thermal optimization problem stated in Eq. 5 can be solved exhaustively for small task sets and platforms with a low number of processing components. Thus, we first compare the performance of a heuristic solver with the optimal solution found by exhaustively exploring the design space. The heuristic solver uses simulated annealing [14] to solve the thermal optimization problem. In addition, for comparison with the optimized task assignments, the average peak temperature of 20 feasible, i.e., schedulable random task assignments is calculated. We consider three different hardware platforms with three, four, and six cores, respectively. Each task set is randomly generated so that the number of tasks in one set is between four and six tasks. Each task ν j is characterized by a period p ν j , a jitter j ν j , and a computing demand c ν j . The period p ν j is uniformly chosen from [1, 400] ms, the jitter j ν j is uniformly chosen from [1 ms, 2 · p ν j ], and the computational demand is uniformly chosen from [1, p ν j · f max /5] cycles with f max = 1.6 GHz. Finally, the real-time deadline of an event is set to the period of its task. Figure 7 compares the performance of the three solvers. Exhaustively exploring the design space results in a task assignment that has a worst-case chip temperature, which is, on average, only 0.37 K smaller than the maximum temperature of the task assignment found by simulated annealing. For comparison, the average peak temperature of the random assignments is on average 3.6 K higher than the minimum peak temperature. Calculating the optimal solution for the hardware platform with six cores took on average 94.5 min and simulated annealing finished on average in 33.8 s.
Voltage and Frequency Scaling
Finally, we evaluate the effect of frequency and voltage scaling on the worst-case chip temperature. For a given task set, we solve the optimization problem for the following three configurations:
1. maximum frequency: each processing component is running at its maximum frequency. 2. single clock domain: the platform has a single clock domain for all processing components and is running at the minimum operation frequency so that no real-time deadline is missed. 3. separate clock domain: each processing component has an own clock domain and is running at the minimum operation frequency so that no real-time deadline is missed.
In other words, in the third configuration, each core has a separate frequency that is individually calculated by Eq. 28. In the second configuration, all cores are running at the same frequency and this frequency is set to the maximum frequency of all frequencies used for the third configuration. The layout of the considered platforms is 3 × 1, 3 × 2, 3 × 3, and 4 × 4 with 3, 6, 9, and 16 cores, respectively. We compare eight different task sets per platform and each task set is randomly generated so that the number of tasks in a set is between one and three times the number of process- ing components. Simulated annealing is used to solve the optimization problem in all benchmarks.
In Fig. 8 , we plot the worst-case chip temperature for the three different frequency configurations and four hardware platforms. It shows that the worst-case chip temperature can be drastically reduced when the processing components are running at their optimal frequency. If each processing component has its own clock domain, the peak temperature is on average reduced by 24.2 K for the 3 × 1 layout, by 17.6 K for the 3 × 2 layout, 22.5 K for the 3 × 3 layout, and 22.8 K for the 4 × 4 layout.
Conclusion
In this paper, we presented a fast thermal analysis method to calculate an upper bound on the maximum temperature of a real-time application with non-deterministic workload running on a multi-core system. The considered thermal model is able to address various thermal effects like temperaturedependent leakage power and heat exchange between neighboring cores to accurately model the thermal behavior of multi-core systems. Afterwards, we applied the proposed thermal analysis method to calculate an optimal task assignment that minimizes the worst-case chip temperature and guarantees that all real-time deadlines are met. Finally, we have shown that the worst-case chip temperature can drastically be reduced when each core is running at its optimal operation frequency, i.e., the minimum operation frequency so that no real-time deadlines are missed. 
