Abstract-Current multicore platforms contain different types of cores, organized in clusters (e.g., ARM's big.LITTLE). These platforms deal with concurrently executing applications, having varying workload profiles and performance requirements. Runtime management is imperative for adapting to such performance requirements and workload variabilities and to increase energy and temperature efficiency. Temperature has also become a critical parameter since it affects reliability, power consumption, and performance and, hence, must be managed. This paper proposes an accurate temperature prediction scheme coupled with a runtime energy management approach to proactively avoid exceeding temperature thresholds while maintaining performance targets. Experiments show up to 20% energy savings while maintaining high-temperature averages and peaks below the threshold. Compared with state-of-the-art temperature predictors, this paper predicts 35% faster and reduces the mean absolute error from 3.25 to 1.15 • C for the evaluated applications' scenarios.
to increase energy and temperature efficiency [1] . Moreover, management becomes challenging when applications are multithreaded and heterogeneity of the processing cores needs to be exploited [i.e., identifying the most appropriate cluster(s) for each application]. The existing RTM approaches exploit cores situated in different clusters simultaneously (referred to as intercluster exploitation) and dynamic voltage and frequency scaling (DVFS) potential of cores [1] - [3] . However, these approaches lack in providing an accurate temperature estimator. We postulate that such exploitation may help to satisfy performance requirements while simultaneously achieving energy savings and avoiding thermal peaks.
System-on-Chip (SoC) thermal management has become a critical subject. Its effect may vary from transient faults to long-term defects of the chip. To mitigate such effects, thermal hotspots, thermal gradients, and thermal cycling need to be well managed. Thermal hotspots are high temperatures at particular spatial locations on the chip. CPU or cache units are usual hotspots in a chip die hotspot. Thermal hotspots induce failures, such as electromigration, stress migration, and dielectric breakdown hybrid. Thermal gradients are the spatial variations of temperature across the die. As the die size of multicore processors gets larger, thermal gradients increase the interconnect delays and, consequently, induce larger clock skews [4] . Communication between the cores is thus negatively affected. Thermal cycling represents repeated temperature temporal variations in the die and reduces lifetime reliability [5] .
To mitigate the thermal hotspots, gradient, and cycling issues, appropriate thermal management actions are needed to improve the performance of the system while reducing the power consumption and protecting the chip from damage. Most thermal management techniques focus on short-term performance, limiting the influence of temperature on the system performance. Performance should, however, be kept at an acceptable level for the running applications. The aforementioned thermal issues are faced using different techniques. Clock gating, a reactive dynamic power and temperature management technique, reduces the power consumption and, thus, the temperature of the chip by shutting down parts of the circuit when the chip temperature is too high. DVFS can be used as a proactive or reactive technique, which adapts the frequency to obtain the desired performance, limiting the power consumption as well as the temperature. Reactive DVFS reduces the frequency whenever the die temperature rises above a certain defined threshold. A task migration process moves tasks from hot cores to cool cores to avoid hightemperature peaks or thermal gradients across the die.
In this paper, we propose an RTM approach coupled with an accurate temperature prediction scheme to comply with energy-performance requirements while keeping the temperature below a threshold. In this way, we focus on the longterm reliability while avoiding thermal hotspots and thermal cycling. We combine a power management algorithm with a temperature prediction algorithm developed for heterogeneous architectures. The accurate temperature predictor helps to avoid high-temperature averages and peaks. To address the aforementioned challenges, this paper makes the following major contributions:
1) an accurate temperature prediction algorithm that can work for any frequency setting of the system; 2) a runtime manager that proactively controls the frequency setting to keep the temperature below a configurable threshold value.
II. EXPERIMENTAL DESIGN
This paper evaluates the existing and proposed methods on the heterogeneous Multiprocessor System-on-Chip (MPSoC) platform Odroid-XU3, composed of the Samsung Exynos 5422 SoC [6] . It contains four ARM Cortex-A15 (big) CPUs, four ARM Cortex-A7 (LITTLE) CPUs, and six ARM Mali-T628 GPU cores. Such an architecture provides the opportunities to exploit different designs as low-power processing (LITTLE cores) and high-performance processing (big cores). The platform contains five temperature sensors enabling management decisions based on the current thermal state of the chip. The GPU and each one of the four big cores have temperature sensors.
The MPSoC provides DVFS at a per-cluster granularity. For the Cortex-A15 cluster, the frequency can be varied between 200 and 2000 MHz with a 100-MHz step, whereas for the Cortex-A7 cluster, it can be varied between 200 and 1400 MHz with a 100-MHz step. The frequency of the GPU cluster can be set at 177, 266, 350, 420, 480, 543, and 600 MHz. It should be noted that we vary only frequency, but the firmware automatically adjusts the voltage based on the preset pairs of voltage-frequency values. The SoC also has 2-GB LPDDR3 RAM, operating at 933 MHz. The memory system is not considered in the design space exploration while building the temperature predictor, as it has only two levels of DVFS [7] , 400 and 800 MHz, which would severely affect the performance of the evaluated applications.
The Odroid-XU3 board allows the hardware measurement of power consumption. It contains four real-time current/voltage sensors for four separate power domains: big (A15) CPU cores, LITTLE (A7) CPU cores, GPU cores, and DRAM. A power measurement circuit estimates the power as the product of voltage and current, i.e., P = I · V . The energy consumption is measured as the product of average power consumption and execution time. Since the power is considered for all the domains, the energy consumption of all the software components (e.g., proposed predictor, OS, runtime manager, and applications) executing within the chip is included.
III. MOTIVATIONAL EXAMPLE
Current state-of-the-art RTM approaches present a way of dynamically changing voltage and frequency (DVFS) to avoid power consumption or temperature from surpassing a given requirement. Some of these proposals present throttling on frequency [8] while trying to comply with these requirements. RTM normally takes into account only performance and energy. Basireddy et al. [9] propose an RTM approach that first selects thread-to-core mapping based on performance requirements and resource availability. Then, it classifies the workload using the metric, memory reads per instruction (MRPI). Finally, it decides the appropriate V-f pair for the predicted workload. This approach does not take into account the temperature, leading to temperature violations and performance losses due to frequency throttling by the Linux kernel. In this paper, the approach in [9] is used as a study of a stateof-the art RTM approach to include the temperature predictor and avoid temperature going beyond the threshold. Fig. 1 presents the frequency and temperature measurements for the big core when the Linux Ondemand governor is controlling the execution of blackscholes application from the PARSEC benchmark [10] . The application was run by allocating all cores (big and LITTLE) available on the Odroid-XU3. It is shown that multiple times after 100 s, Linux needs to reduce the frequency after the temperature has reached above 95 • C. These variations in the frequency may lead to thermal cycling. Prakash et al. [11] evaluate the thermal behavior for mobile gaming devices. It shows that the GPU frequency is reduced but is not coordinated with the CPU frequency adjustment. The net effect is that the temperature continues to rise even after throttling CPU frequency due to thermal inertia. This phenomenon occurs because the temperature of a device is influenced by its current frequency and its past frequency values.
Even RTM that provides better energy-performance tradeoffs presents the same behavior. Fig. 2 presents the same scenario but with a state-of-the-art RTM [8] controlling the system. This RTM does not take into account the temperature, which also leads to Linux overriding operation to throttle the frequency when the temperature exceeds the 95 • C threshold.
This throttling could be avoided using less abrupt changes to current frequency changing algorithms. One possible solution is shown in Fig. 3 where the performance requirements are met, but there is less changing of frequency and no throttling. This can be achieved by employing proactive management where a temperature predictor can be used to set the frequencies while staying below the thermal threshold. With less frequency throttling, we aim to also reduce the average temperature of the chip and increase performance as well.
IV. STATE OF THE ART

A. Temperature Prediction
First, we introduce the reactive and the proactive approaches used for temperature management. Reactive methods focus on reducing the die temperatures based on the current temperatures. Most of those techniques involve shutting down or slowing down cores after the die temperature rises above a defined threshold. The time between two temperature checks is usually short to avoid temperature exceeding the limit. Some examples of reactive approaches are implemented in the Linux kernel. If the CPU temperature goes above the threshold, Linux will throttle the frequency, as demonstrated in Section III, largely impacting the application performance. Proactive methods usually involve the prediction of future die/core temperatures to adjust the workloads or frequencies before exceeding a defined threshold. The computation of the predicted temperature increases the performance overhead of proactive methods when comparing to reactive ones. However, they run less frequently than the reactive methods.
Coskun et al. [12] proposed an example of proactive methods using an autoregressive moving average (ARMA) model to predict future temperatures. This ARMA model is one example of regression models employed to temperature prediction. Ljung [13] extended an ARMA model considering with exogenous inputs (ARMAX model). The exogenous inputs are, in this case, the average power trend of the running applications. Ge et al. [14] claimed that their neural network approach to predict future temperature performs, on average, 79% better than the ARMAX model while limiting the maximum prediction error to 2.5 • C. This maximum prediction error is, however, difficult to compare with the ARMA model, as this model takes some time to adapt to the changing environments (ambient temperature, running applications, and so on).
Prakash et al. [11] estimated the temperature of the CPU and GPU separately for cooperative CPU-GPU thermal management on chip. Their estimator uses the actual temperature sensors of both the CPU and GPU as well as the cores utilization to set the frequency setting for the next time interval. Singla et al. [15] presented a predictor using power sensors to predict the next power consumption based on the following frequency setting. Their technique uses a leakage power model of the ARM big.LITTLE architecture on Odroid-XU3 to test its predictor and dynamic power and frequency management technique. An extension of this paper has also been published in [16] .
Peters et al. [17] proposed a power management strategy for mobile games. The approach saves 1.9% of energy compared with the Android default governor for the evaluated scenarios in the Odroid XU3 board. The work uses the frame rate as a metric to evaluate the workload predictors and apply a thread-to-core mapping. The power management is employed to minimize the operating frequency while keeping the frames per second constraint.
Two works [18] , [19] also proposed power-temperature analysis for many processor and multiprocessor systems. Pagani et al. [18] presented the power budget concept, called thermal safe power (TSP), which is an abstraction that provides power constraints as a function of the number of simultaneously active cores. Executing cores at any power consumption below TSP ensures that thermal management actions are not triggered. The authors show the simulations of platform models with 72 heterogeneous cores, which provide offline and online TSP computations for a particular mapping of active cores and ambient temperature. The simulations allow to obtain safe power and power density constraints for the worst cases, allowing system designers to estimate mapping decisions and the amount of dark silicon.
Bhat et al. [19] presented a power-temperature stability and safety analysis technique. The approach is based on a formula to compute the stable fixed point and thermally safe power consumption at runtime. Hardware measurements on an XU3 board with Android OS can predict the stable fixed point with an average error of 2.6%.
B. Runtime Management
RTM represents an essential paradigm in tackling these challenges by enabling optimization and tradeoffs between computational quality, application throughput, system reliability, and energy efficiency. An increasing number of RTM algorithms are being employed to control and optimize the execution of applications on heterogeneous embedded systems. Mainly online optimization has been considered to cater for dynamic workload scenarios to optimize energy consumption while respecting the timing constraint. For online optimization, either all the processing is performed at runtime or else the optimization is supported by offline characterization.
For performing all the processing at runtime, several works have been reported [20] - [25] . In [20] , the online algorithm utilizes hardware performance monitoring counters to achieve energy savings without recompiling the applications. Singleton et al. [21] presented an accurate runtime prediction of execution time and a corresponding DVFS technique based on memory resource utilization. A similar approach, which is a hardware-specific implementation of the stall-based model, is proposed in [22] . In [23] , an adaptive DVFS approach for field-programmable gate array-based video motion compensation engines using runtime measurements of the underlying hardware is introduced. In [25] , online reinforcement learningbased adaptive DVFS is performed to achieve energy savings. These approaches perform well for unknown applications to be executed at runtime but lead to inefficient results as optimization decisions need to be taken quickly and offline analysis results are not used. Furthermore, they are agnostic of concurrent workload variations and thus fail to adapt for concurrently executing multiple applications.
Recently, there has been focus on online optimization facilitated by offline analysis results [26] - [32] . Such approaches lead to better performance results than only online optimizations as they take advantage from both offline and online computations. In [26] , thread-to-core mapping and DVFS are performed based on power constraint. Similarly, in [27] , first thread-to-core mapping is obtained based on utilization, and then, DVFS is applied depending upon the surplus power. However, the approaches of [26] and [27] target homogeneous multicore architectures and thus cannot be applied to heterogeneous ones.
The state-of-the-art shows some implementations of runtime managers using temperature predictors [15] or the current chip temperature [11] . These works show good improvements in energy and/or temperature efficiency. All these temperature predictors were evaluated for a training set (explained in detail in Section V); however, the approaches lack accuracy, showing high average prediction errors (up to 5 • C or 4%), which may be improved by an accurate temperature predictor. In addition, a few implementations [16] are implemented as kernel modules rather than a standalone library, which impact portability, e.g., to use the predictor in other runtime manager.
V. PROPOSED METHODOLOGY
An overview of the proposed methodology for temperature prediction and its integration into the runtime manager is shown in Fig. 4 . First, a training data set to classify the best temperature prediction regression model is created offline. This training set is composed of measurements of the system when executing applications from the PARSEC benchmark on both the big and LITTLE clusters of the chip. During this step, we log the temperature, frequency, and power consumption for the memory and big cluster. The offline data are collected at a rate of 1 Hz and later used for 1-Hz temperature prediction at runtime, predicting the temperature over the next second of execution. The frequency was changed randomly every 500 ms to evaluate all the operating points of the platform. In this way, we focus more on the platform behavior rather than the application behavior.
When applying this approach, we noticed that leaving the fan disabled limits the testing ranges and capabilities. In particular, we were only able to use lower frequency levels of the big cluster as higher frequencies lead to reaching the temperature threshold quickly. By default, the Linux kernel usually starts the fan when the temperature rises above the 65 • C temperature threshold. Leaving the fan with this setting adds a nonlinearity that is undesired. Therefore, the fan is turned on to make sure that the predictor does not have any other nonlinear behavior which would increase the temperature prediction error. We endorse that the same methodology could be applied for the system with the fan always off, but it would limit the system to operate only with lower frequencies. In this case, the training set would generate different regression coefficients, explained in Section V-A.
This training set is then compared with the different regression model outputs, which is explained in the next sections. The regression model that provides the least error on predicting temperature is then used to feed the runtime manager. Finally, the runtime manager can take into account the next interval when setting the frequency and mapping of the tasks. One advantage of the predictor is that it is totally decoupled from the RTM and thus can be used with other RTMs.
A. Temperature Predictor
This section outlines the proposed temperature predictor. We first list and describe the assumptions that are followed throughout this paper.
Assumption 1: The LITTLE cluster does not influence the global temperature. Since there are no temperature sensors available specifically targeting the LITTLE cluster, the only measurement available is the cluster's power consumption. To measure the impact of the LITTLE cluster on the big cluster temperature, we executed the PARSEC benchmark on the four LITTLE cores only and then measured the temperature on the big cluster. The maximum temperature achieved on the big cluster was 42 • C. When executing the same benchmark on the big cluster, the minimum temperature on the big cluster is 42 • C, while the maximum temperature on the big cluster is 95 • C. Also, the frequency setting of the LITTLE cluster will not be modified by the temperature management algorithm developed in Section V-B.
Assumption 2: The GPU is not used by the running applications. The GPU frequency is set to the lowest possible frequency, and its power consumption is constant during the system operation. To take into account the GPU management, applications should be written with environments such as OpenCL or OpenGL to enable the design space exploration. Also, the RTM can be able to deal with load balancing between the CPU and the GPU to target energy/performance or temperature tradeoffs.
Considering the above-mentioned assumptions, the following equation outlines the regression model of the proposed temperature predictor:
(1) The predictor uses the two past temperature measurements [T (t −1) and T (t −2)] as well as the future power consumption estimation for the big cores (P big t ) and the memory (P mem t). Considering that the memory operates at a constant frequency, the value of P mem t − 1 is the last value measured from the on-chip power sensor. The value ofP big t is estimated using the following equation [15] :
where α and C are the activity factor and switching capacitance, respectively, f is the operating frequency, V is the voltage, and I big leakage corresponds to the leakage current, computed using the following equation:
Parameters α, β, γ , θ , β 1 , and β 2 represent the regression coefficients. The chip has first been put in an oven to retrieve measurements of the power sensors with temperature T ∈ [40, 90] and a constant frequency f . During that process, α is kept approximately constant using constant workload. The leakage power model has been computed using simulated annealing on MATLAB. Simulated annealing gave really good result to find β 1 and β 2 with a starting temperature of a thousand, a cooling rate of 10 −6 , and five rounds per temperature. Fig. 5 shows the power estimation of each cluster with I gate and β i coefficients computed using the simulated annealing.
One way of improving the results of the predictor is to use the previous prediction errors to estimate the future values. The ARMA model uses this approach and should, therefore, improve the quality of the prediction. The following equation shows the estimated prediction error [η(t)] model:
Where μ represents the mean, η represents the actual error,η represents the predicted error, and β 1 and β 2 are the previously calculated values Fig. 6 shows a comparison between the predicted and actual measured temperatures for different applications and temperature ranges. Also, it shows the evolution of prediction error over time.
The estimator developed estimates the future temperature with a low average error at runtime but leaves different errors based on the operating frequency, e.g., the average error is different if the big cluster is executing at 1.7 or 2 GHz.
Error Correction Algorithm: The error difference between the frequencies is due to the willingness of the temperature estimator to act the same for the whole frequency set. To solve this issue, we propose an error correction algorithm that uses different error coefficients for each possible frequency setting of the big cluster. Algorithm 1 computes the associated error after each temperature prediction iteration. Algorithm 1 uses an err or _corr ecti on table to store the error for each frequency. The error is calculated by the difference between the temperature prediction and the actual temperature Algorithm 1 Error Correction Algorithm measurement (line 3) and then stored in the position of the err or _corr ecti on table (line 4). This error is then used to predict the next interval temperature (line 6) taking into account the last error for that frequency. The same temperature prediction model could be applied for the error correction, but this would lead to a longer execution time and more memory to store the values for each one of the frequencies. Therefore, the decision is to trade off accuracy for a lower execution time, then providing a lower processing overhead.
B. Dynamic Temperature Management
A dynamic temperature management (DTM) is used to limit the chip temperature based on the proposed temperature estimator developed in the previous section. This section describes the developed DTM algorithm. The temperature is managed proactively by the proposed DTM, while the Linux kernel uses a reactive method by default with the lowest temperature threshold being 95°C. Reactive temperature management needs to be running faster than the proactive methods to avoid high-temperature violations since it decides after the temperature reaches a given temperature. This may result in performance losses, as cores are shut down to reach lower temperatures.
The goal of a DTM algorithm on heterogeneous architectures is to determine a maximum frequency setting for each cluster separately to avoid temperature violations. This algorithm is analyzed based on its error rate, mean absolute error (MAE), and performances losses or gains. The DTM algorithm may also reduce the power consumption. For example, the DTM developed in [15] is not only used for temperature purposes but also to limit the power consumption of the cluster.
The DTM algorithm developed determines the maximum frequency based on the temperature estimator from the previous section. It applies DVFS based on the maximum frequency for each cluster during a certain time interval. Algorithm 2 outlines the developed DTM. The DTM algorithm predicts the temperature for the highest frequency and reduces this frequency until the prediction stays below the defined threshold. This enables the temperature to stay below the threshold while keeping the maximum performances.
C. Predictive Dynamic Thermal and Power Management
The predictive dynamic thermal and power management (PDTPM) for heterogeneous mobile platforms developed uses [8] a dynamic power management algorithm for heterogeneous Algorithm 2 DTM Algorithm 3 PDTPM Algorithm architectures which applies DVFS to the different clusters. This method takes advantage of the frequency of the memory read and write instructions to adapt the CPU frequency settings and consequently reduce the energy consumption. The approach combines application mapping and DVFS to reduce the energy consumption.
It starts by applying a thread-to-core mapping of the different applications depending on their memory intensiveness. It then applies DVFS to reduce the energy consumption. The proposed PDTPM algorithm (Algorithm 3) combines the DTM algorithm (Section V-B) with the power management based on MRPI.
The DVFS algorithm is executed 10 times every second, while the DTM should predict the temperature and choose the maximum frequency settings every second. The interval_count variable has been introduced for this purpose (in this case, DT M_I NT E RV AL should be initialized with 9). The DTM algorithm only runs when this variable is equal to 0 (line 4). It then resets this variable to its maximum value, which is equal to the number of time the DPM algorithm should run before another temperature prediction is made (see DT M_interval constant). The function DT M_big() predicts Finally, the algorithm (lines 15-19) sets the computed frequencies to both clusters, updates the interval counter, and sleeps until the next interval.
VI. RESULTS
This section is divided into two parts. First, we evaluate and compare the proposed regression model with the state of the art and show the impact of the improvements made on this model. Then, we present the results of the RTM using the regression to predict the temperature. The temperature measurements have been collected at room temperature.
Validation of the proposed temperature predictor and methodology is done on the Odroid-XU3 (see Section IV for more details of the experimental setup. PARSEC [10] and SPLASH [33] applications are used to compare the results of the proposed PDTPM algorithm with different approaches. The chosen mapping for the validation and comparison is taken from a state-of-the-art approach [8] . Table I lists the applications used for the validation. These applications will be tested against the new PDTPM approach and then compared with a series of Mapping-DPM tools of the Linux kernel and the Inter-cluster Thread-to-core Mapping and DVFS (ITMD). The ITMD approach proposes a mapping of tasks and the MRPI metric used to execute DVFS. The Linux kernel uses the heterogeneous multiprocessing (HMP) [34] scheduler to map the task on the different clusters. The temperature threshold of the PDTPM approach is set to 90 • C, while the Linux reactive temperature limit is 95 • C. This is required to avoid temperature peaks to reach temperatures more than 95 • C, the same as the Linux reactive control. The list of considered approaches is shown in Table II .
A. Regression Model Evaluation
Table III outlines the results on the training set for the regression with and without the error prediction. MAE represents the mean absolute error, AE max represents the maximum absolute error, and AE std represents the standard deviation. The Akaike information criterion (AIC) is an estimator of the relative quality of a set of statistical models for a given set of data, i.e., AIC estimates the quality of each model, relative to each of the other models. A model with a lower AIC provides a better estimator. Thus, AIC provides a means for model selection. The MAE on the training data set is 1.21 • C without the error prediction and drops to 1.13 • C when using it. Table IV shows the error generated when the frequency error correction algorithm is used in combination with the proposed temperature predictor. Table IV also outlines the reduction of the MAE when using the error correction algorithm. The average error drops by more than 1 • C, from 2.48 • C to 1.19 • C, while the dependence of the error on the frequency drops by 0.3 • C. This algorithm is not only useful to reduce the error difference between the frequencies, but it also lowers a lot the MAE by an average of approximately 50%.
The prediction error is analyzed by running a series of multithreaded PARSEC applications on the big and the LITTLE cluster together for each of the different temperature predictors described in Section V-B. The final temperature prediction model developed performs better for each of different temperature thresholds of the DTM. predictor, while the blue lines show the standard deviation for each predictor. It is important for the temperature predictor to maintain similar results and error for different ranges of temperature. The proposed predictor gives better error averages (53% better than the version without the error correction and 64% better than a state-of-the-art temperature predictor [15] with a temperature threshold of 90 • C) while keeping the error standard deviation within the same range. The error standard deviation for other temperature thresholds might grow further alongside higher thresholds. This is because error correction introduces an instability when the workload is changing. This error correction algorithm needs time to adapt to the changing environments. Fig. 8 shows a comparison of the proposed temperature predictor with the approach in [16] for a set of applications. The proposed predictor shows less error for most of the applications, and also, all the errors are lower than 2%.
B. Runtime Manager Evaluation
Table V lists the different application scenarios with their respective core allocations among the big and/or LITTLE cluster. These scenarios are then launched separately on the Odroid-XU3. The performances and energy consumption are measured and then compared. Fig. 9 shows an overview of the energy and execution time savings obtained by the proposed approach compared with the approaches detailed in Table V . It computes the improvements of each scenario for all approaches and then presents the average improvement for each approach. Fig. 9(a)-(c) shows the results for scenarios with single, double, and triple applications, respectively. Later, Figs. 11-13 show the details for each of the applications compared with the Linux performance governor [heterogeneous multiprocessing performance (HMPP)]. 1) Energy: Fig. 9 (a) shows single-application scenarios compared with the Linux HMP and governors. The proposed PDTPM performs better, on average, by 5%-10% regarding energy consumption than the Linux governors while increasing 5%-10% on the execution time. Therefore, it shows a simple tradeoff rather than real improvements. It is interesting to note that for single-application scenarios, the ITMD approach consumes a little more than the HMP+conservative Linux approach, less than 2% on average. This means that most of the PDTPM energy savings for single-application scenarios are due to the temperature management algorithm, which improves energy savings in any of the single-application cases. The results also show that the performance governor is affected since maintaining the highest frequency leads to more temperature threshold violation and thus frequency throttling is more likely to occur when comparing to the conservative governor, for example. Fig. 9(b) shows the double-application scenarios. The proposed approach improves the energy consumption by more than 10% on average than any other Linux HMP-governor association considered. The proposal also saves 14% of energy when compared to the Linux conservative governor (the lowenergy governor), and more than 17% compared to the Linux performance governor. Part of it is due to the mapping proposed by ITMD approach and its MRPI-based DVFS management tool. However, more than 10% of the energy savings are due to PDTPM and, especially, to the temperature management algorithm added to the original ITMD.
Triple-application scenarios show that PDTPM improves more the energy savings made over the Linux HMP-governor associations [see Fig. 9(c) ]. It reaches an average of 28% of energy savings' improvements over the different governors and more than 25% on the conservative governor, the one focusing on low energy consumption. These improvements are partly due to the added temperature management algorithm. The PDTPM raises the energy savings by 10% on average compared with the power management in ITMD alone, while the other 15% is due to the mapping and power management.
The energy savings made on the different application scenarios by the temperature management evidently increase with the number of applications running at the same time on the cores. This is due to the high workload and, consequently, high temperatures induced on the cores. The temperature management part of the DPTM limits the frequency to even lower frequency settings than for single-application scenarios. This results in a reduction of the energy consumption.
Unlike the existing approaches, the proposed approach is aware of concurrent execution; therefore, in the case of multiapplication scenario, there was more space for optimization in terms of choosing thread-to-core mapping and compensating for contention. Moreover, the temperature threshold violations by other approaches become more prominent when multiple applications are executed concurrently (this leads to frequent scaling down of frequency). This has helped our proactive thermal manager (PDTPM) to improve performance by not aggressively scaling down the frequency. The abovementioned two cases lead to improved performance in the case of two and three application scenarios compared with the single-application scenario. Fig. 10 shows the results of the average power for the single-, double-, and triple-application scenarios. It shows that the proposed approach reduces the power consumption when compared with the Linux governors and ITMD.
The mapping and PDTPM energy savings can be separated into three parts. The application mapping onto the cores tries to limit the energy consumption by a mapping memory-intensive application on the LITTLE core, sometimes trading performances against energy savings when performance requirements are still met. The MRPI-based power management limits the frequency and, consequently, the energy consumption by adapting the frequency to the memory intensiveness of the applications. The temperature management limits the frequency to avoid a certain threshold. This increases the energy savings as analyzed earlier.
2) Performance: Now, we compare the PDTPM approach with the Linux performance governor for each scenario, running one, two, and three applications concurrently using the mapping of Table V . Single-application scenarios show that the PDTPM improves energy at the cost of performances. The improvements for single-application scenarios are limited. For single-application scenarios, the PDTPM runs faster than the HMP-performance except for the PARSEC application Freqmine [10] , as shown in Fig. 11 . Energy savings of Freqmine are considerable at the cost of lower performances.
In double-application scenarios, the PDTPM improves the energy savings compared with the Linux HMP and governors, while the performance overheads of the proposed PDTPM are limited. The PDTPM encourages better performances for most of the double-application scenarios compared with the different Linux HMP-governors' associations. The addition of temperature management in the proposed PDTPM improves the performances by almost 10%. Fig. 12 shows that performances are usually better for every double-application scenarios compared with HMPP, the Linux governor built for the highest performance.
Triple-application scenarios follow the results given by the two application scenarios. Fig. 13 shows that performances and energy improve for all scenarios compared with the Linux governor. The energy savings are in the worst case, 10% when compared with the Linux governor.
3) Thermal Cycling: Managing the temperature variations is also important to avoid the reduction of lifetime reliability. A DTM algorithm may induce more temperature variation due to the frequency scaling that may change the frequency and, thus, the temperature at every time interval. This section compares the thermal cycling rates of the PDTPM for the different predictors and Linux governors. Fig. 14 shows the average of the temperature variations for the presented scenarios. Each bar represents the average temperature variation within 1 s. First, we measure how much the temperature decreases or increases compared with the previous second and then calculate the average for the sample. Fig. 14 shows that the thermal cycling of the proposed model is almost equivalent to the one from [15] for every temperature thresholds. The difference in temperature between two measurements almost doubles at 90 • C.
The PDTPM also increases thermal cycling compared with the Linux performance CPU governor [36] . Table VI shows that the difference between two temperature checks of the Linux performance governor is halved the value of the thermal cycling of the PDTPM. The Linux reactive thermal management model reduces thermal cycling. This is due to the smaller intervals between reactive measurements and frequency setting adjustments.
4) Overheads:
As outlined in the previous sections, the proposed PDTPM saves energy by different means, but it has an overhead for predicting the temperature for the next interval. This section evaluates the overhead of the temperature predictor only, not taking into account the power management and DVFS. We measure the overheads caused by the temperature prediction, comprising lines 4-7 of Algorithm 3, during the execution of the all scenarios outlined earlier. The average overhead is 836.5 μs with a standard deviation of 48.5 μs. However, 70% of these overheads are spent reading the temperature sensor and the power of the memory and big cluster. Therefore, only the temperature prediction algorithm takes only around 250 μs on average. Since the temperature prediction is executed at every 10th time, the DVFS algorithm is executed and the overhead is minimal. To put these results in perspective, a model predictive control-based policy proposed in [37] takes more than 4 ms, and it is applied every 10 ms. The best comparison is the proposal in [16] , where it takes 390 μs to predict the temperature and to determine the frequency levels. This means that the proposed approach is 35% faster than in [16] .
VII. CONCLUSION
This paper first demonstrates that an accurate temperature predictor helps to limit the high-temperature averages and peaks. The proposed temperature model combines [8] and a regression model to reduce further the theoretical MAE to 1.13 • C. The addition of an error correction algorithm that uses different error predictions for each frequency setting improves further the accuracy at runtime. The overall result is a faster and more accurate temperature prediction when compared with the latest state-of-the-art proposals [15] , [16] . The second contribution is the temperature estimator developed to build a DTM algorithm. The DTM proves that temperature management already reduces the energy consumption by limiting the frequency. The third contribution combines a state-of-theart power manager and the DTM algorithm. This combination gave great results for two-and three-application scenarios, improving up to 20% the energy savings compared with the power manager alone while limiting the performance overhead. Finally, the accurate predictor is decoupled from the runtime manager and may be easily included in a different approach.
The proposed prediction can be extended to estimate the temperature for the GPU since the Odroid-XU3 board provides temperature and current/voltage sensors for the GPU. Therefore, we could also evaluate the division of workloads between the CPU clusters with the GPU, taking into account the energy/performance and temperature tradeoffs. The proposed approach could be applied to single ISA heterogeneous multicore platforms. This could require: 1) power monitors or an accurate power model and 2) temperature monitors.
