Temperature plays an increasingly important role in the overall performance and reliability of a computing system. Multiand many-core systems provide an opportunity to manage the overall temperature profile by cleverly designing the applicationto-core mapping and the associated scheduling policies. An uncontrolled temperature profile may lead to an unplanned performance loss, since the system activates protective mechanisms such as voltage and/or frequency scaling to cool itself. Similarly, deep thermal cycles with high frequency lead to severe deterioration in the overall reliability of the system. Design space exploration tools are often used to optimize binding and scheduling choices based on a given set of constraints and objectives, thus motivating the need for fast and accurate temperature estimation techniques. We argue that the currently available techniques are not an ideal fit to design space exploration tools, and suggest a system level technique which is based on application fingerprinting. It does not need any information about the processor floorplan, the physical and thermal structure, or about power consumption. Instead, its temperature estimation is based on a set of applicationspecific calibration runs and associated temperature measurements using available built-in sensors. We show that a given application possesses a unique thermal signature on the system it executes on, which provides a computationally fast method to calculate accurate temperature traces. Extensive experimental studies show that our technique can estimate temperature on all cores of a system to within 5 o C, and is three orders of magnitude faster than state of the art numerical simulators like Hotspot.
INTRODUCTION
High temperature has become a first order concern in recent microprocessors due to significantly higher power densities, see [1, 2] . Managing the temperature profile of a system is Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CASES'12, October 7 12, 2 01 2, Ta mpere, F i nland. critical, specially in safety critical applications, like embedded electronic control units (ECUs) in a modern car. These ECUs are generally mounted in areas where ambient temperature is already high, such as in the vicinity of the engine. This creates a situation where thermal affects of applications running on the ECU must be carefully analyzed, and controlled to achieve reliable and consistent computational performance, see [3] .
Different applications stress a given processor core differently, thereby giving rise to different temperature profiles over time. Excessively high temperatures may also lead to "thermal runaway" causing physical destruction of the computing hardware, see [4] . A core experiencing a dangerously high temperature automatically triggers dynamic temperature control techniques like DFVS, or may shutdown, leading to an unplanned loss of quality-of-service, see [5, 6, 7] . Such performance disruptions make it hard to provide end-to-end performance guarantees about the system; and can be avoided if one can ensure that the set of applications mapped to the cores of a multiprocessor, along with the scheduling algorithms used on each core, will never lead to a temperature increase beyond the critical values. Even if dynamic power reduction techniques are applied, the availability of a proper thermal analysis methodology will allow for a combined temperature and performance analysis that can be used to explore alternative mapping, scheduling and thermal management mechanisms.
Beyond the analysis and control of the maximal temperature, it is also important to investigate thermal cycles: the magnitude and frequency of temperature variations on a processor influences its reliability, see [8, 9] .
Embedded multiprocessor embedded systems generally do not have spare computational power for online computation of suitable task mappings and scheduling so that all temperature and performance metrics are met. Thus, a potentially large design space needs to be explored offline using temperatureaware exploration tools, to evaluate various mappings and select the one which best suits the optimization criteria at hand.
The overall objective of this work, therefore, is to develop a fast and yet accurate temperature estimation framework, which can be used in design space exploration iterations. The problem of correctly estimating the temperature profile of all cores in a given multiprocessor has traditionally been solved by using two common approaches. One solution is to use a lowlevel thermal simulator, like Hotspot which requires detailed knowledge of the floorplan and electrical characteristics such as technology node, rail voltage, materials used, and power consumption of each micro-architectural unit in the processor, just to name a few, see [10] . This information is not easily available, requiring designers to approximate the phys-ical and electrical characteristics of the processor, which may lead to unacceptable inaccuracies in the estimated temperature traces. Numerical simulators also tend to be too slow to be used in an iterative design-space exploration tool.
It is possible to get total power consumption of the processor by using the measuring technique described in [11] . On the other hand, without detailed circuit and hardware implementation information, it is not possible to get an accurate breakdown of the total power amongst the micro-architectural units (power density distribution). Consequently, abstract temperature models have been reported in [12] which attempt to estimate the temperature profile as a function of total power consumption of each core. Such abstract models can be computed quickly, but are relatively inaccurate, since these consider only a few observable parameters to calculate temperature, e.g. the total power consumption of a processor. We argue that it is not possible to build a sufficiently accurate thermal model by using such coarse-grained abstractions.
Motivational Example
The approach used in this work is unique, since it does not abstract away power density distribution information as a given application executes on a core of a multiprocessor, which may lead to inaccurate temperature estimates as illustrated in this section. To motivate the discussion, a commonly used hardware architecture is used, which allows us to clearly identify deficiencies of current approaches. Please note that new techniques that will be described in later sections do not need information about hardware details like floorplan or power-density information.
Correct Temperature Trace Estimation
Consider a four-core chip-multiprocessor (CMP) on which three applications: producer, FFT and consumer, are running, see Figure 1 . The producer is in charge of creating data for the FFT application, which in turn supplies the results to the consumer for display. Both, producer and consumer, are I/O intensive applications with data caches dominating the total power consumption. On the other hand, FFT is a compute-intensive application and the ALU will dominate the power consumption of the corresponding processor core. All three applications consume the same total amount of power. The left floorplan shows the power consumption on core 1 when producer is running, while the right floorplan shows the power distribution on core 2 when FFT executes. The temperature sensor is located near the upper left corner of the processor. hence, it is more sensitive to the heat generated by the computational units of the processor.
Coarse grained temperature estimation techniques (see [12, 13] ) estimate temperature as a function of total power consumption of a core i according to
where Pi(t) denotes the total power consumption of processor core i in the CMP at time t. The temperature trace on the cores 1, 2 and 4 is shown in Figure 2 . The temperature simulation was performed on Hotspot using the Alpha 21264 thermal model as supplied with Hotspot. Power numbers were generated from the Wattch/ Simplescalar tool chain [14] . Temperature influence from cores to their neighbors has been accounted for. All applications execute in lock-step, i.e. cores 1, 2, and 4 have the same total power trace over time. Communication between cores is implemented using FIFO buffers. In such a scenario, a simple coarse grained model such as in (1) can not be expected to yield accurate results since it is oblivious to the power density distribution in the cores. The model will predict the same temperature trace for cores 1, 2 and 4, which may be similar to one of the temperature traces shown in Figure 2 , depending on how fi is constructed. We can conclude that ignoring the power density distribution between the micro-architectural units can lead to large errors in the temperature estimates. Knowledge of power distribution information requires detailed circuit level and physical models, which are rarely available for modern processors.
Computation Time
The temperature trace in the previous example was calculated using Hotspot at a temporal resolution of 1ms, requiring 6 hours for a relatively short trace of 100ms. Clearly, numerical simulators, due to their long run time are not suitable for design space exploration tools, since potentially hundreds of mappings may need to investigated. Instead, an ideal temperature estimation technique for design space exploration must be reasonably accurate, sufficiently fast, and must not rely on often unavailable information such as detailed power, layout, physical and thermal models.
The Problem
Based on the above discussion, the problem that needs to be solved can be described as follows:
Given a possibly heterogeneous chip multiprocessor system S and a set of applications A: estimate the temperature trace on all cores of S with sufficient speed and accuracy as required for design space exploration. The method should not depend on prior availability of power, layout, physical and thermal models of the hardware platform.
As we will see below, the fingerprinting approach as proposed in this paper replaces the detailed knowledge about platform internals by a limited set of calibration runs where applications are executed on the platform and temperature traces from the internal sensors are recorded.
Our Contribution
It is clear that an ideal temperature estimation framework for DSE will need to rely on an abstract thermal model, which provides sufficient speed and accuracy such that many mappings of A onto S can be quickly evaluated. In this work, we demonstrate a technique to build a thermal model relying on the results of a limited set of calibration runs including temperature measurements. This model is combined with mapping and scheduling information and results in the desired estimated temperature traces.
Specifically, our contribution is a temperature evaluation framework which:
• Correctly determines a correct temperature trace even when two applications running on the CMP consume the same total power but exercise different micro architectural units; • Does not require knowledge of power traces, and does not assume that all cores of the CMP are homogeneous; • Can model and evaluate temperature effects of various mapping policies in terms of peak temperature, dynamic temperature range both in space and time; • Does not depend on prior knowledge of details about the hardware platform; • Allows for fast and accurate temperature estimation to be reliably used in DSE loops.
RELATED WORK
Temperature estimation has been recent focus of research, due to the reasons discussed above. Overall, the estimation techniques are based on numerical simulation, or based on abstract relationships between power and temperature such as (1) .
Numerical simulators model the entire multiprocessor as a complex resistor-capacitor network (eg, Hotspot) and calculate temperature by numerically solving a large set of differential equations. These simulators depend on knowledge of the exact power consumption of each micro-architectural unit within each core. These power numbers may be obtained for certain processors using the Wattch/Simplescalar toolchain. For other processor designs, a generally accepted method has been to use hardware sniffers, which calculate the number of times a micro-architectural unit has been accessed by an application, see [15] . However, this requires setting up special registers using software or hardware methods that record the accesses of all micro-architectural entities in the processor, which may not be always possible. In addition, unless dedicated hardware is used, the measurement of access counts may disturb the behavior of the profiled application itself, leading to an inaccurate model. The combination of large computational power required for numerical simulators, in addition to detailed knowledge of the hardware as well as software make this approach unfeasible for design space exploration. A System-C based thermal simulator has recently been reported, but suffers from the same basic limitation as other simulators: the level of detailed information required for setting up the model is not easily available, see [16] .
Computationally fast simulators based on first-order differential equations have been used in [12] , but their applicability to modern CMP systems is not clear since the thermal model is too simplistic to take all important temperature dynamics into account, like differences in utilization of core's micro-architectural units. Li et al. propose an abstraction based approach, which builds a thermal model based on total power consumption of the processor, for calculating temperature traces for given applications, see [17] . Li'a approach is oblivious to the power density distribution of applications, i.e., two applications consuming the same total power but targeting different micro-architectural units are indistinguishable. Such an abstraction can lead to large errors in the estimated temperature of cores, as already discussed in Section 1.1.
A look-up table based approach involves building resistorcapacitor (RC) models for S, and recording tables of timetemperature relationship for the set of applications, A, see [18] . The resulting estimation methods are fast, but again, they assume that a unique total power consumption always implies a unique temperature distribution. Separate databases are created for temperature increments and decrements (when an application raises or lowers the temperature of the CMP, respectively). Since temperature changes depend on the current temperature of a core, it is not clear as how to how much data is required to model all possible switching scenarios.
Thus, conventionally available solutions either assume the availability of hard to get information (numerical simulators based approach) or are too abstract for estimating correct temperature traces (abstract power-model based models).
SETUP AND NOTATIONS
In the following sections, we assume an arbitrary chip multiprocessor S consisting of N ∈ N cores Sj ∈ S. It is not necessary that all cores in S are homogeneous. A core, for instance may be of type graphics processor (GPU), a floating point processor (FPU), a RISC processor etc. Thus, we have available a set of processor types, C = {GP U, F P U, RISC, ...} available on S. A function T : S → C maps the set of cores in S to their types. Also available is the set of applications A that may execute on S. The i th application in A is referred to as Ai. The approach taken in the work treats each application Ai ∈ A as a black-box to be run on S. Thus, this approach can be used to estimate temperature for an arbitrary given set of applications. All applications bound to a given core may be scheduled according to a scheduling policy such as earliest deadline first (EDF), round-robin (RR), least laxity first (LFF) or rate-monotonic (RM).
P (Ai, Sj) denotes the instantaneous total power consumption of core Sj due to an application Ai. P(Ai, Sj) refers to time trace of instantaneous total power consumption of core Sj due to the application Ai; henceforth referred to as 'power trace'. We suppose that an application consumes constant power as long as it is running. The utilization alphabet, U ∈ {0, 1} represents the utilization of a core by an application for a time interval with length ts. In other words, if an application is running then its utilization is 1, otherwise 0. The time-trace of an application Ai is given by U Ai, which is a tuple whose elements are in {0, 1}. The set of all tuples is denoted by * U . In other words, U Ai is the time-trace of the utilization of application Ai, specified with a given time-resolution ts.
If S has a square or a rectangular physical footprint, the location of a core in S can also be specified in Cartesian coordinates <x,y>, with the origin located at the lower left corner of S. In this case, two cores with co-ordinates <x,y> and <x',y'>, respectively, are said to be k hops apart, if k = max{|x − x|, |y − y|}.
A set of temperature sensors R is available on S. Temperature for core Sj is available from Rj . It is assumed that a reading from Rj represents the temperature for that core. Logging of temperature trace is done only during the construction of the thermal model.
We also define the severity of thermal cycles experienced by the chip multiprocessor. Thermal cycles are periodic changes in temperature experienced by a core Sj , when a given subset A ⊆ A of applications execute on it. Large variations in temperature are said to be worse for hardware reliability, as compared to small ones. The hardware is designed to withstand a certain maximum number of such temperature cycles, before it fails, see [19] . Therefore, a core Sj has a fixed "thermalcycle budget", and the entire system's budget is simply the sum total of the thermal-cycle budgets for each core. Based on this concept, a simple metric that measures the "expenditure" from the total thermal-cycle budget, is V:
ΔTi is the maximum temperature variation experienced by a core Si; fi is the frequency of this variation and '×' is the multiplication operator. A mapping with smaller V is preferable.
APPLICATION FINGERPRINTING
This section describes the construction of the thermal model of chip multiprocessor S, given a set of applications, A. We call this technique "application fingerprinting". Application fingerprinting assumes that the thermal model of S is linear. It has been shown that a thermal model of a processor can be constructed by using only passive electrical components, such as a resistor and a capacitor, which therefore forms a linear circuit, see [20, 10, 21] .
The overall idea of application fingerprinting is to determine the thermal impulse response, H(Ai, Sp, Sq) ∀ i, p, q such that the temperature trace T (Ai, Sp, Sq) due to Ai can then be calculated easily:
where H(Ai, Sp, Sq) is the required thermal impulse response for the application Ai, when it executes on core Sp and the resulting temperature change is calculated for core Sq. The symbol ⊗ is the convolution operator. Notice that power trace is not used in (3). The utilization trace, U Ai, the impulse response, and the calculated temperature trace are all given at the time resolution of ts.
Presented next are two claims that enable the calculation of accurate temperature traces, without requiring any knowledge of the power density distribution in a core, or its power trace. Data extracted from the Simplescalar/Wattch simulator is used to support the claims being made on the the relationships between total power, temperature and powerdensities, due to an application such as Ai.
Non-Unique Relationship between Total
Power Trace and Temperature Trace A power trace associated with an application Ai, executing on a core Sj, does not automatically imply a unique temperature trace, on any core in the chip multiprocessor S. The power density distribution in the core determines the net flow of heat between various parts of the core, and hence, the overall temperature trace. Multiple applications can have the same total power consumption, but different power density distributions, causing a different overall temperature trace. An example was already discussed in section 1. solely as a function of various power traces when the set of applications A execute on the system S. In summary, given a temperature trace T (Ai, Sp, Sq), the application Ai may not be unique. However, the following relationship is deterministic: given Ai executes on core Sp, its power trace is always P(Ai, Sp).
Unique Relationship between application and Relative Power Distribution
An application Ai executing on core Sp, consumes a constant total instantaneous power, P (Ai, Sp). Furthermore, a given P (Ai, Sp) uniquely determines the relative distribution of this total power amongst the core's micro-architectural units. This unique relationship between an application and its power distribution holds well for typical embedded applications, which are often subjected to similar inputs over their lifetime. We suppose that a particular application, such as a 16-point FFT, will run through the same sequence of steps, irrespective of the inputs. We show that such an assumption is a reasonable abstraction. In case the input to this application varies, these sequence of steps are repeated appropriately. Such repetitions also cause the number of accesses to each of Sp's micro-architectural units to scale appropriately.
This claim was validated using several benchmarks from the MiBench Embedded Systems benchmarks suite, see [22] . The results for selected benchmarks are shown in Table 1 . These benchmarks were run with varying inputs, which is reflected in the number of instructions executed (column 5, minimum vs maximum number of instructions executed). The variation in instantaneous total power consumption of an application executing under varying inputs is minimal. For instance, the maximum variation for the FFT application was only 4%, with inputs ranging from single-digit values to six-digit values. Similar results are obtained for other benchmarks.
In addition, power consumption per instruction for each of these benchmarks was also evaluated after making suitable modifications to the Wattch simulator. The results for FFT and GSM-Encoder ("toast") application are shown in Figure 3 . The same conclusions apply for other applications. It can be seen from the figure that the mean power consumption in all micro-architectural units in the core remains almost constant even under significant input variations. Almost all the difference in any total power consumption can be attributed to the variation in power consumed by the clock. Statistical parameters such as mode, median and standard-deviation are also shown in Figure 3 . It can be observed that these statistical parameters also remain relatively constant.
From the preceding discussion, the following conclusions can be drawn:
• The instantaneous total power consumption of Ai exe- cuting on core Sp is:
• For an application Ai ∈ A, executing on core Sp, its utilization trace, U Ai also determines its power trace, one varying from the other only by a scalar. Specifically:
where s(Ai, T (Sp)) is scalar depending on Ai, and the type of processor core T (Sp).
• Assume a thermal impulse response H (Ai, Sp, Sq) determined from power-trace P(Ai, Sp) and temperature trace T (Ai, Sp, Sq). Also assume another impulse response H(Ai, Sp, Sq), determined from U Ai and T (Ai, Sp, Sq). Requiring that the temperature traces calculated using either impulse responses must be equal:
Therefore, we find:
and:
T (Ai, Sp, Sq) = P(Ai, Sp) ⊗ H (Ai, Sp, Sq)
In other words, (5), (7) and (8) taken together show that the impulse response H(Ai, Sp, Sq)) determined using the temperature trace, T (Ai, Sp, Sq)) and the utilization trace, U Ai can be used to estimate correct temperature traces, without requiring any power trace information. Figure 4 shows the overview of the application fingerprinting technique. The technique starts with taking the set of applications, A and the system S, for estimation of impulse responses, H(Ai, Sp, Sq)∀i, p, q. Each application Ai ∈ A is run individually on Sp using a known utilization trace, U Ai. As the application Ai executes, temperature traces from all other cores are S recorded. The estimation of impulse responses is based on the Generalized Pencil-of-functions (GPOF) technique, see [23] . Our estimation approach is summarized in Algorithm 1. All impulse responses are collected in a three dimensional matrix, H.
Estimating Impulse Responses
Notice that the procedure for estimation of impulse responses is different from Li's approach, see [17] . Li requires power trace information, whereas we require only utilization traces and corresponding temperature traces from S. Impulse responses are estimated for each application Ai ∈ A, making it possible to account for different power density distributions, even when there are multiple applications consuming the same total power. As a consequence, it is possible to avoid incorrect calculations of temperature traces, as discussed in section 1.1. Note that since impulse responses are calculated using the utilization trace of an application and the corresponding temperature trace, the scalar, s(Ai, T (Sp)) is automatically accounted for. Once all impulse responses are available, the correct temperature trace can be calculated for any candidate mapping consisting of applications from A, executing on S. Subsequent temperature trace calculations require the knowledge of only the utilization trace, U Ai of the core on which the application Ai executes. Note that power traces are not required for the temperature estimation step. Run Ai on core Sp
9:
H(Ai , Sp, Sq) = GP OF ( U Ai, T (Ai, Sp, Sq)) , ∀q //s(Ai, T (Sp)) is automatically accounted for 10: end 11: end 12: Procedure: GPOF (PowerTrace, TemperatureTrace) 13: // Calculates impulse response from utilization trace and temperature trace based on the generalized pencil-of-functions algorithm.
14: // returns the estimated impulse response. 15: end
TEMPERATURE AWARE DESIGN SPACE EXPLORATION
DSE tools are already available, which, given a set of applications, and a set of abstract hardware properties (number of cores, type of cores etc, but not the detailed floorplan) calculate various mappings subject to a set of constraints and objectives, see [24] . However, such tools usually are not temperature aware. However, once all required impulse responses are available, temperature-aware design space exploration can now be performed, using applications from the set A, on the system S.
A temperature aware DSE loop is shown in Figure 5 . The DSE tool accepts the following parameters:
1. Abstract architectural properties: available computing resources, their types etc; 2. Set of mapping constraints and objectives; 3. Set of applications, A; 4. Evaluated temperature traces for a given mapping, M.
A mapping M provides the following information:
• Binding for all applications in A, i,e., each application in A is provided with a core to execute on; • Scheduling policy for each core, such as EDF, RR, LFF etc. A candidate mapping generated by the DSE tool is evaluated using the temperature evaluation component. The detailed algorithm for calculation of temperature traces is presented in the next section. Based on the feedback of the temperature evaluation component, the DSE tool may modify its internal parameters to rule out combinations that lead to unacceptable temperature profiles on S. Or, the DSE tool may successively refine mappings that are deemed to be favorable in terms of temperature. A simple approach used in successive refinement of mappings is simulated annealing, which successively transforms an initial mapping M0, by moving applications between cores, and recalculating temperature traces at each step. The DSE tool also evaluates different scheduling policies for each core. The process continues till the required performance objectives are met (viz., minimizing peak temperature, minimizing thermal cycles).
Temperature Trace Calculation from a given Mapping
The linearity property of the thermal model of S allows us to use the superposition principle for determining the overall temperature trace due a given mapping, M. Algorithm 2 presents the procedure to calculate detailed temperature traces. 
19: end
Temperature trace calculations start with a candidate mapping provided by the DSE. For a given core Sp, its scheduling policy determines the utilization trace for each application bound to Sp. The Cheddar project provides for automating the construction of such utilization traces, see [25] .
Referring to Algorithm 2, lines 2-8 define the required variables. Line 11 initializes a loop to iterate over all applications to be run on S. Line 12 iterates over each core in S, calculating the temperature trace due to Ai, on all cores of S (Line 13). The algorithm loops till all applications have been accounted for, and the overall temperature trace on each core due to mapping M is calculated by superposition.
Sources of Inaccuracies

Inexact Impulse Responses
Estimation of the impulse response from a given utilization trace and an associated temperature trace measurement is often an approximate process. Further, the order of the thermal model of S is limited to avoid dealing with overly complex impulse responses, thereby saving some computational effort. In this work, the accuracy of estimated impulse response, H(Ai, Sp, Sq) is specified as Quality of Fit (QoF):
Where:
T (Ai, Sp, Sq)m is the measured temperature trace, due to a known utilization trace, U Ai. The mean value of the measured temperature trace is given by T (Ai, Sp, Sq)m. The temperature trace, T (Ai, Sp, Sq)e is calculated using the esti- 17: 
8: H
k = {C x , y | (|y − c | = k) ||(|x − c| = k)} //E * = E * + F V (H(A * , H i j , O c, c ))// H i j H i ,
Under-estimating the impact of a hot core on a distant neighbor
Due to high lateral thermal resistance, the temperature on a core drops rapidly with distance from the temperature hotspot, see [26] . If from section 4.3, it is determined that the lateral thermal resistance of S is very high, it becomes tempting to ignore the temperature effects of an active core on cores far away from itself. This reduces the effort for the computation of temperature traces, but at the cost of accuracy. In this case, the maximum possible error that can be incurred in temperature calculations must be determined. Assuming that we would like to ignore the temperature affects due to an active core beyond distances of k hops, the resulting worst case error in temperature estimates is calculated from Algorithm 3.
Referring to Algorithm 3, the worst case error is estimated at an 'observer core', O c,c , at the center of S. Since the temperature affect of an active core reduces with the hop distance from itself, a centrally located core will have maximum 1-hop neighbors, maximum 2-hop neighbors and so on. Further, uniform cooling over S is assumed, ensuring that the worst case error in temperature estimate is not missed. Final Value theorem for transfer functions is used to determine the maximum possible temperature influence of an active core on its neighbors. Line 12 determines the steady-state temperature due to application Ai, running on a core Sp. Any core Sp within S may be chosen. Line 14 determines the application A * which leads to the highest steady-state temperature on core Sp. Line 17 then calculates the error by calculating the accumulated influence of all cores beyond k hops from O c,c .
The algorithm assumes that all cores which lie more than k hops from O c, c are running A * . The values of E * with respect to hop distance are shown in Figure 6 . The values in Figure 6 are specific to our experimental platform, but the nature of the curve is expected to remain the same for any chip-multiprocessor platform. The results clearly show the risk associated with making any uncalculated simplifications on the impact of a hot core on its neighbors.
EXPERIMENTS AND RESULTS
Our approach was validated using Hotspot, with the specification of S was taken from Magma project, which provides a variety of multicore floorplans consisting of 2 core-through 64 core-layouts, see [27] . Each core is an appropriately scaled version of the Alpha 21264 processor. The knowledge of physical arrangement of cores on the chip multiprocessor is required, only if the user intends to apply approximations discussed in section 5.2.2. Such approximations were not made in our experiments. Although our technique does not require that all cores of S be homogeneous, the floorplans available in the Magma project consist of only homogeneous cores, and thus we report results for a system with homogeneous cores. Furthermore, no power traces were used, neither in the estimation of impulse responses, nor in the calculation of any temperature traces. The scalability of our technique is demonstrated using a floorplan consisting of an 8x8 arrangement of cores. The following sections describe results relating to the QoF of estimated impulse responses, the accuracy of estimated temperature traces and the speedup due to our model, as compared to Hotspot. The time resolution ts is 1 ms.
Accuracy of Estimated Impulse Responses
The order of the thermal model was limited to 10, at which the QoF achieved was greater than 90%. Further gains in QoF with increase in the order of the model were insignificant (< 0.1%). The net effect of thermal resistance and thermal capacitance becomes increasingly complex, with the hop distance between two given cores, resulting in a QoF drop. However, due to high lateral thermal resistance, the absolute error in temperature estimates remains small (5 0 C). The results are summarized in Table 2 . Only the worst QoF per hop is reported for summary.
Speedup
A total of sixty mappings were evaluated, using applications from Table 2 . The scheduling policy used on each core was varied between EDF, LFF, RM and RR. For each mapping, temperature traces were calculated using our model, as well as Hotspot. The average time for calculations using our model was ∼ 24.9 seconds, while Hotspot averaged ∼ 6 hours, see Table 3 . Further speedup is expected upon porting our algorithms to C/C++ from Matlab/Java environment. 
Accuracy of Estimation
We consider a mapping in which all applications are bound to cores located in a close spatial neighborhood. The temper- ature of each active core in thus significantly influenced by all other active cores. Further, all active cores in this mapping are scheduled according to round-robin (1-ms quantum) policy, where possible, leading to significant number of context switches, and rapid variations in temperature over time. Such a mapping provides a good test for demonstrating the accuracy of temperature traces estimated using our technique.
The mapping is shown in figure 9 . Taking the core with <5,5> as the center, applications are mapped on the immediate 1-hop neighborhood, totaling 9 heat generating cores. All other are idle. The results are shown is shown in Figures 8 . It is clear that the mapping in Figure 9 led to large changes in temperature on almost all active cores. Our thermal model was able to calculate correct temperature traces for all cores, well within the accuracy goal set up in the introductory section of this paper, see Figure 10 .
In section 1.1, the applications producer and FFT consume the same total power, but differ in their respective power density distributions. Both these applications were mapped onto core <4,4>, see Figure 9 . Our model was able to accurately capture the effect of differences in power density distribution between producer and FFT, leading to the trace in Figure 8 , showing temperature variations as each application executes. Figure 7 shows a section of temperature trace for core <4,4> from Figure 8 for more clarity.
Other mappings, in which active cores are not immediate neighbors were also evaluated, and the prediction error was lower than what is reported in Figure 10 . This is because an active core was not significantly influenced by its neighbors. Under such circumstances, estimation errors due to relatively low QoF were limited by high thermal resistance, see Table 2 . The speedup gained due to our approach allowed us to experiment with a lot of different mappings using the design space exploration loop. For instance, it was possible to reduce thermal cycles experienced by S by changing the binding and scheduling policies of a few applications, as shown in Figure 12 , resulting in temperature traces shown in Figure 11 . Notice that the total work done by each application remains unchanged. For instance, LAME executes for a total of 6ms, with a period of 15ms in both mappings. The overall error in estimated temperature is similar to the result shown in Figure 10 ; with the maximum error being 4.7 o C. It is not always possible to reduce thermal cycles by changing the bindings of applications, for a feasible scheduling policy for all cores may not exist.
VARIATIONS AND OPTIMIZATIONS
It is not necessary to estimate impulse responses for all cores, if the given system has thermal symmetry. In this case, all cores are first classified into a set of thermally different locations (TDLs), see [18] . During the calibration step, applications need to be executed on only one distinguished core in each TDL. The calculation of thermal effect of an active core Sp on core Sq then proceeds in two steps. First, a sequence of transformations is determined which translates Sp to a core in one of the TDLs. Next, the same sequence of transforma- tions is applied to core Sq, preserving the relative location of cores Sp and Sq. Temperature trace calculation then proceeds normally. Thermal symmetry reduces the memory space required to store H, and a one-time effort required for its calculation. However, if large errors in temperature estimates are to be avoided, the computational effort may not change much, see section 5.2.2.
The affect of caches on temperature was factored in during the impulse response estimation step. All caches are "clean" before executing an application in order to collect associated temperature traces. Thus, even when multiple applications execute on the same core, the impulse responses already account for the effect of inevitable cache misses. Consequently, our approach is thermally safe since temperature traces estimated using our approach will be slightly higher than actual measurements.
CONCLUSIONS
The paper presented a new calibration based approach for estimating accurate temperature traces. A compact thermal model was built using a small set of mappings and associated temperature trace measurement. The speed and accuracy of our approach enables exploration of a large set of mappings using the design space exploration loop. The highlight of our approach is that it does not require any power-trace information, or the hard-to-obtain details about hardware, such as the detailed floorplan. Our technique can also account for differences in power densities on a core due to an application, even when the total power consumed by two or more applications is the same. This makes our technique applicable to any given set of embedded applications and hardware.
APPENDIX A. ACKNOWLEDGMENTS
This work was supported by EU FP7 project EURETILE under grant number 247846.
