Abstract-With CMOS scaling beyond 14 nm, reliability is a major concern for IC manufacturers. Reliability-aware design has a non-negligible overhead and cannot account for user experience in mobile devices. An alternative is dynamic reliability management (DRM), which counteracts degradation by adapting the operating conditions at runtime. In this paper, for the first time we formulate DRM as an optimization problem that accounts for reliability, temperature and performance. We develop an optimal policy for multicores using convex optimization, and show that it is not feasible to implement on real systems. For this reason, we propose workload-aware reliability management (WARM), a fast DRM technique adapting to diverse workload requirements to trade reliability and user experience. WARM is implemented and tested on a real Android device. WARM approximates the solution of the convex solver within 5% on average, while executing more than 400× faster. WARM integrates a thermal controller that allocates tasks to meet thermal constraints. This is required since degradation strongly depends on temperature. We show that WARM meets temperature constraints within 5% in 87.5% more cases than the state-of-the-art. We show that WARM task allocation achieves up to one year lifetime improvement for a multicore platform. It can achieve up to 100% of performance improvement on cluster architectures, such as big.LITTLE, while still guaranteeing the reliability target. Finally, we show that it achieves performance in the 4% of the maximum for a broad range of a applications, while meeting the reliability constraints.
As the technology scales, the impact of mechanisms such as time dependent dielectric breakdown (TDDB), negative bias temperature instability (NBTI), and hot carrier injection (HCI) becomes dramatic [8] . Degradation worsens under voltage and temperature stress and it is influenced by environmental conditions, such as ambient temperature, and workload variations.
Degradation mechanisms are described by the mean time to failure (MTTF) of devices [21] , [39] . A common practice is to assume that all the devices of the same kind have the same MTTF. However, with scaling, the mean lifetime of processors becomes shorter and the distribution of lifetimes becomes larger, so it is less and less accurate to assume the same MTTF. The result is the production of devices whose lifetime is more difficult to predict [7] . This has an impact on warranty costs and on trust and reputation of companies. Design techniques cannot completely address these problems, due to workload variability, frequent user interactions and changing environmental conditions. Imposing high design margins results in a loss of performance and higher costs.
The last decade has seen a great development of mobile systems. Modern smartphones and tablets support graphics, wifi communication, Web browsing and multimedia, thanks to powerful systems-on-chips (see [1] , [22] ). They run a great variety of workloads with different performance requirements and they are subject to variable voltage/frequency stress [42] . Since processor degradation depends on temperature and voltage, a runtime control is needed to correctly manage the reliability of a device over time [15] . Since many degradation mechanisms depend exponentially on temperature, thermal control is a key requirement. Reliability of processors can be estimated by monitoring voltage and temperature, and providing these values to a mathematical model. Alternatively, recent work presents sensors for monitoring degradation [34] , and embeds them in prototypes [35] , [43] .
Dynamic reliability management (DRM) is a technique to tradeoff degradation and performance at runtime, to meet a target lifetime. Reliability is defined at any point in time as a real number between 0 and 1 corresponding to the probability of not having failures. Usually, a reliability threshold is defined a priori. If the estimated reliability at the target lifetime is greater than the threshold, the lifetime constraint is met. Reliability can be modeled as a function of technological parameters, voltage, temperature and time [48] . In DRM, reliability is periodically estimated and per-core operating conditions are controlled to limit the degradation source (i.e., temperature and voltage) [26] , [39] , [49] .
A recent DRM technique uses arrays of local controllers that independently set the voltage of the associated core [26] . The novelty of this approach is to discriminate between highly critical (H) and less critical (L) tasks, depending on their impact on user experience. Each local controller works in two 0278-0070 c 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
time scales: 1) long intervals (LIs) and 2) short intervals (SIs). The long term controller (LTC) monitors the degradation status and calculates average voltage constraints that guarantee the lifetime requirement. The short term controller (STC) changes the voltage and frequency frequently so that: 1) the average voltage over a LI is below the constraint and 2) when high performance are required, the voltage can be boosted for a limited time. This approach only observes temperature, but does not control it, leading to suboptimal decisions. Moreover, the approach only exploits voltage and frequency, but does not leverage task allocation/migration. All modern operating systems like Linux, Windows and iOS have dedicated components for CPU power management, to target energy efficiency and high performance. Linux uses governors to control operating conditions. Governors are kernel modules, part of the cpufreq driver, which is interfaced with hardware regulators enabling voltage and frequency switching [30] . Android, the most popular operating system for mobile devices, based on the Linux kernel, can select between different governors, targeting goals such as providing maximum performance or saving energy. For example, the powersave governor forces the processor to run at minimum frequency, while the performance governor forces it to run at maximum frequency. A more complex governor, the ondemand [30] , samples the CPU utilization at a given rate and scales the frequency accordingly. Standard governors are not aware of running applications, and cannot make distinctions based on their priority. They have no DRM capability, but previous work shows that they can implement user experienceaware per-core DRM with low overhead [27] . Android also implements mechanisms to allocate and migrate tasks.
In this paper, for the first time we formulate an optimization problem for temperature and workload-aware reliability management (WARM) and solve it with a multilevel controller employing convex optimization. It comprehensively manages reliability and temperature subject to diverse performance requirements of applications. We then show that convex solvers are not suited for system requiring response times on the order of milliseconds, because of their computational overhead. For this reason we propose WARM, a heuristic that efficiently solves the optimization problem by leveraging a cascade of controllers that act on different time scales. The fundamental contribution of WARM with respect to previous work in [26] , [27] , and [49] is the addition of a thermal controller (TC) that manages temperature by leveraging task allocation on a multicore platform. This is extremely important, given that most degradation mechanisms affecting reliability have a strong and exponential dependency on temperature. Our results show that WARM approximates the solution of the optimal policy within 5% error on average, while executing more than 400× faster. We also show that, since temperature is a major concern for degradation, WARM meets temperature constraints within 5% in 87.5% more cases than the state-of-the-art.
II. RELATED WORK The international technology roadmap for semiconductors [2] recognizes reliability due to aging as a primary concern for integrated circuits. Uncertainties in reliability can lead to performance, cost, and time-to-market penalties and can lead to field failures that are costly to fix and damaging to reputation. In the near future, TDDB and NBTI are a primary concern. TDDB (also referred to as oxide breakdown or dielectric breakdown) is a degradation mechanism that results in a low-impedance path through the gate dielectric of a transistor. Failures related to TDDB are manifest as abnormally high off-state leakage current, changes in circuit switching delay or even failure to switch (hard breakdown). It depends on temperature, voltage and oxide thickness [18] , [45] . NBTI results in an increased absolute threshold voltage of p-channel MOSFETs, and hence a degradation in drain current and performance. Most research attributes the threshold shift to two mechanisms. The first mechanism involves interface traps and oxide charge formation due to negative gate bias at elevated temperatures. The second mechanism involves breaking of Si-H bonds at the Si/SiO2 interface by a combination of electric field, temperature and holes. It results in dangling bonds or interface traps at that interface and positive oxide charge that may be due to H + [20] . NBTI has been the subject of extensive research [31] . In this paper, we focus on TDDB, which is shown to have a much faster degradation rate compared to NBTI [16] . However, the proposed solution can be applied to any degradation mechanism which can be modeled with a reliability function.
Traditionally, aging is handled at design time, with the adoption of high design margins under the assumption of worst-case conditions. However, it will not be possible any longer for designer to take into account a worst case design window, because this would jeopardize the performance of circuits too much [2] . A more promising approach lowers the design margins and exposes degradation at the software stack to manage it at runtime [15] .
Runtime control of operating conditions and task allocation have been successfully used in the recent years for dynamic power management and dynamic thermal management (DTM). The latter is a technique which aims at keeping the operating temperature of the chip in a safe range. Researchers developed different techniques from simple reactive ones, to more elaborated approaches like model predictive control [3] , [37] . Today DTM is employed in almost all commercial devices and can be handled by all operating systems. DRM, instead, is a technique where a processor can dynamically trade performance to adapt aging. DRM is different from DTM in that it allows to meet a target lifetime. This is introduced in [39] , where a microarchitectural reliability model is the reference for dynamic adaptation. The authors report how DTM and DRM lead to different control behaviors. Work in [23] highlights the limitations of [39] of focusing on a short time scale and of using benchmark applications for experiments. In turn, it proposes a PID-based DRM algorithm which exploits DVFS for peak performance improvement under high demand. The DRM here exploits a linear extrapolation to project the lifetime total damage. They evaluate with macro-level user-collected processor usage profiles. This paper assumes that the future workload is equal to the previous one and is sensitive to sudden workload variations. Zhuo et al. [48] proposed a process variation and temperature-aware reliability model for TDDB, which can estimate processor reliability from temperature and voltage history. Zhuo et al. [49] presented a DRM framework that extends this model to periodically predict the future value of reliability at the target lifetime. Based on the difference between the predicted reliability and the target one, the controller sets a maximum operating voltage. This approach is less sensitive to workload variations compared to previous work because it exploits a confidencebased workload estimation. Since the policy sets the maximum voltage, it is not able to guarantee speed bursts for high performance demanding tasks, causing user experience degradation. Recent work in [38] explores the tradeoff between performance and reliability in multicore processors by introducing the throughput-lifetime product and proposes dynamic reliability variance management to enhance multicore lifetime. The approach in [29] employs an efficient Bayesan classifier to detect reliable configurations online and uses architectural adaptation to select the one with best performance thanks to a performance prediction model. These papers present results deriving from simulations. For this reason they assume workload and platform models. Workload models are known a priori, but they may not be available for general purpose architecture. This limits the feasibility of implementation of the proposed approaches on a real platform. DRM is not available in commercial devices today.
Recent work proposes hardware degradation monitors. Blome et al. [5] proposed the design of a wear out detection unit for automatic compensation. Singh et al. [34] presented in-situ sensors for NBTI and TDDB. These devices are based on a ring oscillator driven by transistors under stress. The frequency of the ring oscillator varies depending on the impact of degradation. Work in [35] uses the output of these sensors in a control loop for managing dynamically the impact of NBTI degradation. Wang et al. [44] presented Radic, a novel built-in sensor for reliability analysis. based on this, they deploy an aging adaption system to prevent failures. These publications effectively counteract degradation, but they do not consider the existence of different workload requirements to achieve good user experience. This is achieved by the technique in [26] , which is a workload-aware DRM technique for multiprocessors based on a two-level controller. This technique monitors system reliability on a long time scale and adapts operating conditions to workload quality requirements on a short time scale, preserving from user experience degradation. This is shown to outperform state-of-the-art techniques, as it provides full performance to critical applications. This paper focuses on TDDB, exploiting the model presented in [49] for simulating the presence of degradation sensors. It does not assume a priori knowledge of workload, but leverages a runtime binary characterization of workload requirements. Thus it is feasible to implement.
A number of recent publications propose reliability-aware scheduling techniques. Work in [19] proposes a task allocation technique for MPSoC which maximizes lifetime subject to performance constraints. Bolchini et al. [6] , instead, focused on NoC-based platform and devise a scheduling algorithm for joint energy and reliability optimization. In these publications, however, lifetime is considered as an objective function and maximized, rather than as a constraint. Work in [31] formulates and solves the problem of energy efficient, agingtolerant task allocation for a variability affected platform running rate-constrained multimedia applications, adapting to system degradation conditions. In this paper, reliability is a constraint rather than an objective function. Such a formulation is more meaningful from a manufacturer perspective [2] . These publications do not exploit DVFS and the existing power management infrastructure. Also, the scheduler of operating systems such as Linux is a very critical section. The implementation of scheduling techniques is highly challenging in real devices and the results presented in the referenced works come from simulations.
Alternative DVFS and task scheduling approaches have been proposed to mitigate thermal hotspots, also improving the system reliability. In reference [33] , the thermal behavior of a multicore processor is first modeled through a state-space system, which accounts for the thermal coupling between cores. Next, the model is reduced into a representation suitable to formulate an integer linear problem (ILP). The ILP's goal is to allocate N tasks onto N cores while maximizing the percore frequency and meeting a thermal constraint. A similar approach is proposed in [17] . In this paper, a heterogeneous multicore processor running a queue containing N nonidentical tasks is considered. The authors use a more detailed thermal model which leads to a nonlinear optimization problem. These publications are effective in controlling temperature, but are not aware of degradation, which may lead to reliability violations and performance degradation.
Another class of related work focuses on management for soft errors (for examples, bit-flips caused by radiations). Such errors can compromise data integrity and accuracy of computations, but they do not affect aging. Work in [47] investigates the effects of energy management using DVFS on real-time embedded systems. The authors find that, for critical applications, DVFS may increase the fault rate dramatically. Based on this, Zhu and Aydin [46] proposed two effective scheduling algorithms for real-time tasks that maintain a very low fault rate, without impacting on energy efficiency. This paper is orthogonal, as we target mobile devices and degradation mechanisms that induce circuit aging.
Reliability management for embedded systems is the subject of very recent work. Das et al. [10] presented a hierarchical manager that exploits Q-learning to jointly control temperature and energy consumption. The proposed strategy is also demonstrated to improve the lifetime reliability. Work in [11] proposes a simplified temperature model to develop a gradient-based efficient heuristic to determine multicore operating conditions with the goal of minimizing energy consumption and maximizing the system lifetime. The presented strategies are effective in improving system reliability, but they cannot account for user-experience.
Previous work in [26] proposed a two time-scale DRM framework and a following publication described its implementation as a reliability governor on a real Android device [27] . These techniques have three main limitations, which may lead to suboptimal decisions. First, they do not control temperature, but only voltage and frequency. This is very critical because degradation mechanisms have an exponential dependency on temperature. Second, they scale voltage and frequency, but cannot leverage task allocation/migration. This is an effective strategy to balance temperature in a multicore. Third, they have not been compared to an optimal approach.
In this paper, we make the following key contributions. 1) For the first time we formulate and solve an optimization problem that accounts for reliability and thermal constraints, while meeting application-specific performance requirements for QoE. We then present WARM, a fast hierarchical heuristic that approximates the optimal solution, and implement it in the Android software stack. 2) We develop and add a TC to WARM that uses task allocation and migration to effectively balance degradation among different cores and achieve higher performance. 3) In the Android implementation, we decouple reliability emulation from reliability control. We also implement an application manager that enables taking short term decisions based on the currently scheduled task. 4) In experimental results, we compare our technique on a real Android device against state-of-the-art governors on a broad range of applications. The remaining of this paper is organized as follows. Section III illustrates the theoretical background for the presented work. Section IV formulates the management problem and presents the controller using convex optimization, Section V describes the WARM architecture, Section VI presents the implementation, Section VII describes the experimental setup, Section VIII presents our experimental results and Section IX concludes this paper.
III. MATHEMATICAL MODELS
The main degradation mechanisms affecting integrated circuits are TDDB, NBTI, HCI, electromigration and thermal cycling. The average time before the failure of a device is denoted as MTTF [22] . Models have been developed for MTTF for each degradation phenomenon, which show a strong dependence on temperature.
For example, the MTTF for TDDB is described by
where A 0 is a constant determined empirically, E ox is the electric field across the dielectric, γ is the field acceleration parameter and E a is the activation energy.
The MTTF for NBTI is described by
where γ is the voltage acceleration factor and V is the applied voltage.
The MTTF for HCI is described by the Eyting model, expressed in Equation (3) for N-channel devices.
where I sub is the peak substrate current during stressing, N is a material dependent constant and B is a scale factor, function of technological parameters. Temperature is a parameter in all the previous equations and the dependency of MTTF on temperature is exponential, so it is very critical for degradation. MTTF for each degradation mechanism is related to a reliability function as expressed by Equation (4). Compared to MTTF, reliability is a function of time, so it is more suited for the purpose of dynamic management [49] . When considering the effect of multiple mechanisms acting together, multiple reliability functions can be combined into a single one [39] . In this paper, we focus on TDDB, as it is considered to be one of the key sources of degradation [2] . We describe the reference reliability model for TDDB [48] that is used in our implementation.
A. TDDB Reliability Model TDDB is a major concern for modern integrated circuits, given the important dependence it has on gate oxide thickness [2] . As technology scales, the gate oxide layer is thinner, increasing the risk of breakdown and shortening devices lifetime. Because of its non reversibility and increasing impact, it is a very representative degradation phenomenon. The oxide breakdown time to failure is inherently a statistically distributed quantity. TDDB time is modeled as a random variable with a Weibull probability distribution function. The reliability of a single transistor subject to oxide degradation can be expressed as [40] 
where t is the time-to-breakdown, a is the device area normalized with respect to the minimum area, and α and β are respectively the shape parameter and scale parameter (sometimes also referred to as Weibull slope) of the Weibull distribution. The scale parameter α represents the characteristic life, which is the time where 63.2% of devices fail, and it depends on voltage and temperature. The shape parameter β, instead, is a function of the critical defect density, which in turn depends on oxide thickness, temperature and applied voltage. In [12] the shape parameter is shown to vary linearly with the oxide thickness x, so R(t) can be expressed as follows:
In the remaining of this paper, for simplicity, we will refer to b as shape parameter. It is constant for a given oxide thickness, and depends only on voltage and temperature. Due to process variations, the oxide thickness is actually a distributed quantity. Therefore, the single ith device reliability can be expressed as a conditional probability
Based on this, the reliability of the entire chip R c can be then expressed as the product of all of the single device reliabilities as
where x is the vector of all oxide thicknesses and m is the number of transistor of the chip. For eliminating the dependence on x, a m-dimensional integral would be required. This problem has high complexity, as m could be millions. The work in [48] proposes a way to reduce the complexity of this problem, while still including process variation effects on oxide thickness. It is based on the observation that different regions of the chip share similar temperatures.
A block is defined as region of the chip with almost the same temperature. Given this definition, the reliability functions of the single devices belonging to the same block have the same scale and shape parameters. Given that N is the total number of blocks, R c (t|x) can be rewritten as
A further simplification is introduced by defining the block level oxide thickness distribution (BLOD). Collecting all the oxide thicknesses of all the devices in a block, it is possible to build a frequency histogram. The histogram, then, can be fitted to a Gaussian curve, which is the BLOD. Each BLOD has mean u i and variance v i . It can be noticed that, since it is unfeasible to actually measure oxide thicknesses and build a frequency histogram, means and variances of BLODs are random variables. However, the number of random variable in the problem is reduced from m (millions) to 2N (some units).
The chip reliability can be approximated as
The simplification steps described in [48] allow to remove the dependence and express the chip reliability as a sum of double integrals in the space of means and variances of BLODs, as
where N is the number of blocks composing the chip, A j is the normalized area with respect to the minimum of the jth block, f u j ,v j is the joint distribution of means and variances of BLODs and g(u j , v j ) results from the simplification procedure and reads
The joint distribution f u j ,v j can be expressed as the product f u j ,v j = f u j × f v j with good approximation, as shown in [48] . The distribution f u is Gaussian, while for the variances we have
where χ is a chisquare distribution withb degrees. The quantity R c is static, since it considers the same voltage and temperature from time t = 0. To exploit the value of reliability in a control loop, it is necessary to consider temperature and voltage changes over time. This is possible by discretizing the time axis and calculating at each time step the system (dynamic) reliability as
where k indicates the generic kth time instant, T k−1,k and V k−1,k are the temperature and voltage experienced by the system between the time instants k − 1 and k, R c is the static reliability. In this way, the system reliability is calculated as a recursive sum of progressive damages. When applying this model, we consider each core of the multicore platform as a single block, thus N = 1. Equation (11) reduces to
Equation (14) is used in our framework to asses the reliability status of the processors, given temperature and voltage values, using (15) to evaluate the static reliability R c . The value R k resulting from (14) goes from 1 to 0 and is considered as a measure of the system degradation status. Equation (14), used to update the reliability value, is general and does not depend on a specific degradation mechanism. Therefore, our framework is valid for every degradation mechanism or combination of multiple mechanisms as long as it can be described by a function R c (t) such as that in (15) . For example, to extend the framework to include NBTI and HCI, we assume that we have models to describe the reliability functions associated with these mechanisms, respectively R NBTI and R HCI . As described in Karl et al. [23] , [24] , the total system reliability function is given by the product of the functions associated with the single mechanisms as
B. Thermal Model
The heat propagation across a multicore processor is modeled using the heat diffusion relationship reported in (16) . In this equation, T( r, t) and P( r, t) are the temperature and power at location r = (x, y, z) and time t. Parameter ρ is the material density, c p is the material specific heat, and k T is the material thermal conductivity [4] 
t) T( r, t) + P( r, t). (16)
Equation (16) 
(18) The representation of (17) and (18) is useful to deploy effective control strategies aimed at meeting a thermal constraint while improving performance and reducing the power consumption by operating DVFS and task scheduling [3] .
C. Power Model
For a core, the power is the sum of two contributions: dynamic and leakage. The dynamic power can be modeled through (19) where α an C are the activity factor and the switching capacitance. As the frequency f depends linearly on V dd , the dynamic power can be approximated as a cubic function of frequency, similarly as in [36] 
(19) The leakage power can be modeled through (20) where the coefficient b is a technology dependent constants, channel length and width; the coefficient k is the Boltzmann constant, the electron charge, and the threshold voltage; and I gate is the gate leakage current that can be assumed constant
A simplified model, that accounts for both dynamic and static power is shown in (21) . The strength of this model is that it can be easily fit to real measurements
IV. OPTIMAL CONTROLLER ARCHITECTURE In this section, we describe the structure of the optimal controller. We divide the problem into LIs, in the order of days it takes for reliability to change, and SIs, in the order of milliseconds for scheduling decisions. Fig. 1 shows the structure of an optimal dynamic reliability controller. It consists of an optimal long term controller (OLTC), which selects the target average voltage and temperature for each core for the next LI, and an optimal short term controller (OSTC), which determines task allocation and frequency levels at each SI. The OLTC activates at the beginning of a LI and provides each core with a voltage and a temperature reference values, T LTC and V LTC . Then, the OSTC activates at each SI and determines voltage/frequency levels and task allocation so that the average temperature and voltage in the LI are, respectively, lower than T LTC and V LTC ,
The OLTC problem for finding T LTC and V LTC is formulated as follows: In this problem, R p (t life ) indicates the predicted reliability at the target lifetime [26] . Based on such formulation, the OLTC finds the pair of values T LTC and V LTC which are used as constraints in the OSTC problem. The constraint on reliability is met if, at the end of each LI, the average observed temperature and voltage are lower than constraints (T LTC , V LTC ) [26] . For the formulation of the OSTC problem, we assume that the target device has N cores, each one executing a task j (j = 1, . . . , N) that requires a frequency f * j [k] at the SI k. For each core i, its average frequency over a LI, labeled as f LI i , must not be greater than the value f LTC i . This corresponds to the maximum frequency that is possible to obtain at voltage V LTC i . This is a reasonable assumption given that processors supporting DVFS usually have predefined voltage-frequency operating points. Moreover, its average temperature over a LI should not be greater than the reference T LTC i .
We also assume such a system to work alongside the native system scheduler, which selects at most N tasks. The problem is then to allocate N tasks onto N cores and select the frequency f i [k] of the core i at each SI k. If the actual tasks are less than the cores, we consider a number of idle tasks to sum up to N. Similarly as in [26] , some tasks are labeled as highly critical (H) for user experience, thus they require to execute at a frequency as close as possible to f * j [k] . These can be, for example, tasks belonging to the foreground application. All others are labeled as less critical (L). Given these assumptions, at each SI, the following optimization problem is formulated:
In the problem above, F[k] is the vector of core frequencies
] at instant k. X alloc is a matrix which elements x ij assume value 1 if core i executes task j and 0 otherwise. The matrix X alloc ∈ R N×N of elements x i,j has only one element equal to 1 on each column and row. This is because every core executes one task. The solution to the problem is given by matrix X alloc and vector F.
The 
In the above equation, T c is the critical temperature (at which the core shuts down). The values of f ref and T ref are updated at each SI, based on the rules shown in Equation (34) and (31), respectively. The goal of (28) and (29) is to relax the constraints of the optimization problem for H tasks, so that the solver can find a solution that provides a higher frequency for H tasks. We use the thermal model of (17) and (18) . We also assume to have a power and thermal sensor on each core. Therefore, A, B, C ∈ R N×N while C = I.
The problem has real (i.e., F) and binary (i.e., X alloc ) decision variables. For a fixed X alloc , the problem is convex, because frequency and power can only have positive values. The problem is solved for all the possible instances of X alloc to choose the one that provides the best solution.
As illustrated in [9] , an optimization problem belongs to the class of convex problems if the objective function and the constraint functions are convex. To prove this, we assume X alloc = X is fixed. First of all, we prove that the objective function ||F − X F * || 2 is convex. This is true, as the function is a square. Then, constraint in (25) is linear, which is a subclass of convex functions. Constraints in (26) and (27) are also convex with respect to frequency. This can be verified by substituting (19) and (20) and considering that frequency is always greater than zero.
To find the solution, the OSTC iterates through each possible X alloc and computes the optimal F. Then, it returns the pair of X alloc and F that provides the lowest value for F − X alloc · F * 2 .
V. WARM CONTROLLER ARCHITECTURE Convex solvers are too computationally expensive to be used in runtime management policies, since the system should take control decisions in the order of milliseconds. For this reason, we develop WARM, a heuristic solution that approximates the optimal solution, but is more than 400x faster. WARM, represented in Fig. 2 , consists of three components: The first is a set of local and independent STCs that find F and switch the operating frequency of the cores at each SI. The second is a TC, which determines the allocation of tasks for a medium interval (MI), sub-second or a few seconds. This is a key component, since degradation depends exponentially on temperature. On top of these two actions, a set of OLTCs based on convex optimization estimates the degradation status of the cores and sets the operating condition constraints for a LI (i.e., T LTC and V LTC ). In the following, we explain the behavior of each component more into details.
A. Long Term Controller
The OLTC activates at each LI, samples data from aging sensors, monitors the degradation status, and calculates the average temperature and voltage. The OLTC predicts future reliability with these values, using the technique described in [26] . This is determined by assuming a predicted constant voltage and temperature for the remaining lifetime. Since reliability loss occurs on a long time scale, we consider a LI to be on the order of days. Based on this, it solves the problem presented in (22) using convex optimization and provides a reference voltage/frequency V/f LTC and a reference temperature T LTC , which are the inputs for STC. The constraint on reliability is met if the average applied voltage V LI is less or equal to V LTC and the average temperature T LI is below T LTC , for each core, at the end of the LI. Since the OLTC activates in the order of days, the use of convex optimization represents here a negligible overhead.
B. Thermal Controller
Previous work in [26] and [27] is based on only a LTC and a STC, and it has two main limitations. First, it cannot balance temperature across different cores of a multicore platform, because it cannot exploit task allocation. Second, it cannot enforce the constraint on T LTC , because it does not control temperature directly. To solve these problems, WARM has a centralized TC that monitors core temperatures and updates the values of average temperature T avg and reference temperature T ref with
In these equations, i indicates the ith SI inside a LI, T(i) is the temperature at the ith SI, t LI is the duration of a LI (measured in SIs), T LTC is the reliability-induced constraint on average temperature and T c is the core critical temperature. Given this, the TC takes decisions at a coarser time granularity. The activation periods for the TC are called MIs and are in the order of tens of SI, representative of temperature changes over time.
When the TC activates, it determines the task allocation by assigning the tasks with higher priority (e.g., H tasks) to cooler cores. This is equivalent to determining the matrix X alloc . Then, it forces the frequency f L to assign to L tasks for the next MI to the minimum, in case the current temperature T gets higher than T ref . This helps reducing the thermal stress for the next MI, thus reducing the value of T avg for the current LI. In this way, the TC limits the performance of L tasks to execute only if the temperature is out of a safe range from the point of view of reliability (i.e., relative to T ref )
. By limiting the performance of L tasks, the TC spends a MI to lower the average temperature.
C. Short Term Controller
The STC activates at each SI and selects the voltage to apply at a fast time rate. A SI ideally corresponds to the scheduling tick of a real system. As already described, the frequency to apply for the execution of L tasks is selected at each MI following the rule specified in (33) . Then, the STC selects the applied frequency as described by
In this equation, the value of f L is selected by the rule in
Here, the first case with f MIN is enforced by the TC. For the second case, f req is the required execution frequency of the task, while f ref is updated at each SI with the rule specified in (34) . Moreover, the value of f H corresponds to f req for the current task. Finally, f ref is computed as
where f LTC represents the reliability-induced constraint on frequency and f avg is updated at each SI similarly as T avg . If the system is running into a power-critical scenario that requires the intervention of power management, the applied frequency f app can be lowered, so the target reliability would still be met. If the task executed is H, this would occur at the cost of QoE degradation, but in any case the proposed DRM would not exacerbate the power consumption.
In this paper, we assume without loss of generality that applications are either H or L. However, the controller can operate correctly even if quality requirements change at a fast rate even for the same application, as long as it is slower the STC activation rate. For example, an H application could have an L phase (like the menu screen of a mobile game). This could be identified online by integrating the DRM with the app-phase recognition engine proposed in [25] to enable further reliability improvement.
VI. ANDROID IMPLEMENTATION All the components of WARM have been implemented in the Android software stack to execute on real devices. Fig. 3 shows the block diagram of the proposed implementation. WARM consists of three subcomponents: the Application Monitor, RelDroid and the WARM reliability manager. In the following we describe each component into details.
A. Application Monitor
The application monitor consists of the configuration file (App Monitor Config in the figure), the App Monitor and the App Driver. The User can optionally fill a list of favorite applications, which is saved into the configuration file. This is because not all the applications may be critical for the user, due to subjective judgment. The Monitor periodically checks the applications currently active in foreground, and if it finds a matching in the list of favorite applications, it outputs the ID of the corresponding process. This is passed to the kernel space through the driver, and it is stored in a shared variable. 
B. RelDroid
RelDroid implements the infrastructure for monitoring reliability. Since commercial devices do not have degradation sensors, reliability needs to be emulated online through temperature and voltage readings, which are the input for the model discussed in Section III. In the kernel space, the reliability module samples the values of voltage and temperature at each scheduling tick and updates the average values. In the userspace, the reliability model activates periodically (at a user-defined rate) and reads the average voltage and temperature values from the module through a dedicated reliability driver. Finally, the reliability model implements the model presented in Section III to update the values of reliability and writes the values to a log file.
C. WARM
WARM consists of the OLTC, the TC and the STC. The OLTC selects the reference average voltage and temperature for the next LI. The TC activates at each MI and switches the allocation of active processes on cores to balance temperature. This is accomplished by the set_affinity mechanism, which automatically increases the affinity of tasks to run on a specific core. Both the OLTC and the TC are implemented in C language and cross compiled with Android NDK toolchain to run on the target ARM architecture. Finally, the STC is implemented as a cpufreq governor, called WARM Governor.
To implement this, we modified the code of the ondemand governor and replaced the original algorithm with the rule described in Section V-C. The kernel and userspace components of WARM communicate with each other through the WARM driver.
VII. EXPERIMENTAL SETUP Our Android test platform is the Odroid XU3 development board, shown in Fig. 4 . This board has a Samsung Exynos 5422 Octa core based on ARM big.LITTLE architecture, with a Cortex-A15 2.0 GHz quad core cluster and Cortex-A7 quad core cluster. It also has a Mali-T628 MP6 GPU, supporting The device has four integrated voltage/current/power monitoring sensors implemented on the PCB. They allow to measure the voltage, current and power respectively of LITTLE cluster, big cluster, GPU, and memory subsystem.
The LITTLE and big cluster have DVFS capability and can switch voltage and frequency using the cpufreq utility between predefined operating points. Fig. 5 shows the voltagefrequency curves for the two clusters. This has been obtained by changing the frequency of the cores and sampling the integrated voltage sensor. The platform has 4 temperature sensors for the big cluster and 1 for the GPU, which exact placement is unknown, but no sensor for the LITTLE cluster. We perform an experiment to associate a sensor to each big core. To do this, we selectively run a power virus application with one big core active at a time for 5 s, and observe which sensor records the highest temperature.
We also implement a virtual platform leveraging thermal and power models to simulate the control policies. This is required due to the complexity of convex solvers. Fig. 6 shows the block diagram of the virtual platform implementing a control loop. The trace of required frequency F * is randomly generated and provided as an input. These are used to compute the power consumption of each core with the derived power model. The power is then used in the state-space model to predict the temperature of each core. Finally, the control policy adapts the frequencies to meet the set of constraint of the management problem.
To simulate temperature, we employ the state space model described in Section III-B. For power, we derive the model by fitting the relation in (21) with real experimental data. Such model accounts for both dynamic and leakage contributions. The fitting procedure estimates the parameters a and b by applying a least square algorithm to train power and frequency traces. The virtual platform has been implemented in MATLAB version R2014a (8.3.0.532) and executed on a Lenovo T440s Thinkpad. The laptop has a 4th Gen Intel Core i5-4300U processor (3 MB cache, up to 2.90 GHz). The virtual platform in our evaluation has 4 cores and is configured based on measurements from the Cortex A15 cores of the ARM big.LITTLE architecture. We specify and solve convex optimization with CVX [13] , [14] .
VIII. EXPERIMENTAL RESULTS
In this section we report our experimental results. First, we illustrate how the OLTC behaves. Then, we compare the optimal and the proposed WARM short term control in a simulation environment. We also show the benefits of WARM as compared to state-of-the-art techniques in [26] and [49] . Then, we discuss the overhead associated to the implementation of WARM in a real operating system. On the real system, we highlight the benefits of the TC migration compared to the technique in [27] . Finally, we evaluate the performance and reliability tradeoffs of WARM compared to other CPU governors. The results presented in Sections VIII-A-VIII-C are obtained through simulations on the virtual platform presented in the previous section. The results in Sections VIII-D-VIII-F, instead, are obtained on the real Odroid XU3 platform.
A. Optimal Long Term Controller
The OLTC activates at each LI, solves the problem expressed in (22) and provides values V LTC and T LTC to the STC. The problem is solved using convex optimization. In this experiment we implement the solver in the MATLAB virtual platform for a single core, and configure the OLTC to meet a final target reliability of 0.8 at a target lifetime of 5 years. The activation rate for the OLTC is 30 days. Fig. 7 shows in the first plot the controlled reliability curve for the target core, where the final value is above the target. In the second and third plot are shown the voltage and temperature, respectively. Both the value of controlled and average are reported. For the two time intervals highlighted in the figure, we assume that the device is active, but not used, so that temperature and voltage are at the minimum value. In this case the device leaves a reliability margin unexploited, and the OLTC reacts by providing higher targets in the subsequent intervals.
B. Optimal Versus Heuristic
With the virtual platform described in Section VI, we compare optimal and WARM policies. We generate two random input traces: a trace of required frequencies f req and a trace of required quality flags H/L. We then provide such traces as input to the simulator. Fig. 8 shows the average temperature and average frequency of the four cores over the time period of the simulated LI. The values for T LTC and f LTC for this simulation are, respectively, 40 • C and 1400 MHz (reported with a black straight line in the figure), for a LI with the duration of 200 SIs. The figure shows that both policies are effective in meeting the reliability-induced average constraint, since the average temperature and frequency are below the constraint at the end of the LI. Moreover, we evaluate the performance of the two policies using the metric defined as [26] δ(q) = (35) where q can be for H or L alone, or both indicated with a " * ".
Metric δ measures how much the performance provided by the policy is close to the required one, distributed across the whole LI. The closer it is to 1, the better. two policies (optimal and WARM) on a trace of required frequencies equal to the maximum on a LI with 200 SIs. Then, we vary the average target temperature T LTC and frequency f LTC and the percentage of H tasks in the trace. The results in the table allow us to conclude that the proposed heuristic is effective in approximating the results given by convex optimization. When the convex approach gives better score, the heuristic is in at most within 5% error on average, compared to the convex approach. Also, we verify that, on the simulation platform, the heuristic is more than 400× faster than the convex optimization approach. This confirms that WARM is feasible to implement on a real system for runtime management.
C. Comparison With State-of-the-Art
In this section we provide a comparison of the proposed WARM technique against the state-of-the-art techniques in [26] and [49] to highlight the advantages of WARM. The two techniques are both effective in guaranteeing the predefined target lifetime, but they have the following limitations. The technique in [49] cannot distinguish between applications with different quality requirements, and only fixes a maximum bound on operating voltage and frequency. For this it may cause degradation of user experience. We denote this technique as F max . The technique in [26] can distinguish H and L applications, but temperature is only observed. It may incur into violations of the constraint on average temperature. This would also affect performance, as the control in the LTC would lower the constraint on average voltage/frequency for future LI. We denote this technique as T obs . Fig. 9 shows a comparison of the three techniques with on a random trace of tasks with 10% of H tasks [26] . In this experiment, we set T LTC = 45 o C and F LTC = 1600, on a simulated LI of 200 SIs (indicated as jiffies, which correspond to scheduling ticks). The results show two key advantages of the new heuristic over the previous two. First, F max cannot meet the desired performance whenever the required frequency for H tasks is higher than F LTC . This would results in degradation of user experience. Second, both F max and T obs violate the constraint on temperature, as the value of average temperature at the end of the LI is higher than T LTC . Finally, we show that our proposed technique can adapt to temperature variations. Looking at the traces of applied frequencies (central column) we observe that they are different for WARM and T obs . This is because WARM employs task migration to exploit higher performance from cooler cores while maintaining the temperature below the limit critical for reliability.
In the next experiment, we analyze temperature violations more into details. We execute a trace of required frequencies equal to the maximum for a LI with 200 SIs, and we keep the average target frequency equal to the maximum f LTC = 2000 MHz. Then we vary the average target temperature T LTC and the percentage of H tasks. For each policy, we count the number of cores for which, at the end of the LI, the average temperature is within 5% from the target. On a total of 40 cases, WARM succeeds in 35 of them, which is 87.5% better than the state-of-the-art. The detailed results are reported in Table II .
D. Implementation Overhead
Table III reports a detailed evaluation of all the overheads involved in the implementation of WARM on the target Odroid board. All the overheads are measured with the WARM infrastructure executing with a single little core active, executing at minimum frequency. All results are presented on an average of 100 samples, with a standard deviation lower than 5%. In the following, we provide a description of each of the actions profiled. We get the values of temperature sensors from the original device driver thermal_exynos.c. The values are stored Fig. 9 . Comparison between WARM heuristic and state-of-the-art policies Fmax [49] and Tobs [26] . in a variable that is shared with the Reliability Module. From the moment in which the value is recorded, until it is available to the reliability module, 160 cycles are elapsed. Similarly, the integrated voltage sensors are read in 170 cycles. In the WARM Governor itself, the STC algorithm runs for 74 cycles. The action of switching frequency, which is performed by any governor, lasts for 16 824 cycles (corresponding to 8.4 μs on average). All the previous overheads are obtained by sampling the ARM cycle counter in the kernel space.
In user space, we record the execution time of the TC routine, with which, the medium term temperature targets are updated and tasks are migrated. The time elapsed is 1.5 ms. Considering that the activation rate of the TC is 1 s, this represents only a 0.15% of time overhead. Also, every second the application manager passes to the kernel space the ID of currently application active in foreground. This operation takes 10 214 cycles (corresponding to 5.1 μs on average). Similarly, we record the LTC execution time, which results in 342 ms. Considering that the LTC activates in the order of days, this is a very low overhead. In userspace, timing overheads are measured using the gettimeofday function. In this paper we target mobile CPUs, which today can have 8 cores (like the ARM big.LITTLE processor in the Odroid XU3). Given the low overhead of the implemented solution, reported in Table III , it is possible to implement it on most mobile devices.
E. Benefits of Task Migration
In this section we present the benefits of including the TC and its task migration in the reliability management policy with respect to the previous technique proposed in [27] . To this aim, we present the effects of both inter-and intracluster migration in the ARM big.LITTLE architecture.
For intracluster, we execute a single CPU-intensive task on a big core. To isolate only the effect of temperature control through migration, we keep the frequency constant to the maximum value and we keep only two cores active (core4 and core7). First, we employ the technique in [27] , which is static (i.e., it cannot leverage reliability-aware migration). The right plot of Fig. 10 shows the reliability curves for the two cores, which result to be unbalanced. In this and following experiments, to derive the reliability curves in a reasonable experimental time we activate the LTC every 10 s. Then, in the model we update reliability as if 30 days are passed [27] . We repeat the same experiment while activating the migration policy. Thanks to this, WARM is able to migrate the task between the two cores when the temperature exceeds the limit critical for reliability. The reliability curves for this case are shown in the left plot of Fig. 10 . The migration policy of WARM is able to keep the degradation among cores more balanced, thus a more efficient utilization of the performance budget of a multicore platform. Therefore, reliability-aware intracluster migration by itself may increase the lifetime of a multicore platform of more than 1 year, for a target reliability of both 0.8 and 0.6. For intercluster, we execute the Vellamo Metal benchmark, labeled as an H application. This is a benchmark evaluating the performance of the CPU that provides a final score which we use for comparison. We first execute the benchmark with the technique from [27] . Since this technique has static allocation, by default it executes the benchmark on the master cluster (which is the LITTLE one). We repeat the experiment while activating WARM, which can also leverage task migration. WARM, thanks to the application manager, recognizes that Vellamo Metal is a H application and automatically places it in the big cluster. Fig. 11 reports the results. The plot on the right shows the reliability curves for the most degraded core of the big cluster. The plot on the left reports the final score obtained with the Vellamo benchmark. The result is that the technique from [27] leaves a reliability margin unexploited, at a significant cost in terms of performance. The plot on the left reports the scores for the Vellamo Metal in the two cases. The proposed technique, in this case, achieves 100% of performance improvement.
F. Comparison With Standard Governors
In this section we show the comparison of our WARM governor against standard governors. First, we want to show that the performance provided by WARM for critical applications is comparable to that provided by the performance governor. To this aim, we execute a set of popular benchmarks for Android: AnTuTu, Vellamo Browser, Vellamo Metal, GeekBench and CFBench. Such benchmarks provide a score at the end of execution that we use as a comparison metric, similarly as in [28] . We first execute the benchmarks, respectively, with performance (giving a highest score) and powersave governor (giving the lowest score). Finally, we execute it with WARM. Fig. 12 shows the benchmark scores (normalized as the increase with respect to the score obtained with the powersave governor) obtained with the three configurations. In all cases, the score obtained with WARM is within 4% of that obtained with performance governor.
In the last experiment we show an example of how the implemented WARM technique behaves with real applications when compared to standard governors. For this experiment we run the Antutu benchmark with different frequency governors, among which our WARM reliability governor. For the execution with the reliability governor, Antutu is labeled as an H application. Together with Antutu, we execute a background program that forks and allocate a periodic task on each core, labeled as L. This mimics the presence of background activity on a mobile device. Fig. 13 shows the reliability curves in the case of powersave (blue), reliability (pink), ondemand (light blue), interactive (green), and performance (red) governor. Performance and powersave governor give, respectively, the longest and shortest lifetime. The WARM Governor is able to meet the target of reliability of 0.8 before the target lifetime of 5 years (corresponding to 60 LI of 30 days each). Ondemand and interactive both fails at meeting such constraint. Fig. 14 shows the corresponding scores obtained with each governor, normalized to the maximum, obtained with the performance governor. The result shows that WARM not only achieves the target reliability, but also provides performance very close to the maximum.
IX. CONCLUSION
In this paper for the first time we develop an optimal controller for comprehensive temperature, performance and reliability management that leverages the CVX convex solver. We also show that convex solvers are not suited for implementation on a real device, due to computational overhead. Motivated by this, we develop WARM, a multilevel heuristic controller that approximates the solution of the optimal within 5% on average (18% in the worst case), while executing more than 400× faster. WARM leverages task allocation to control temperature, while exploiting voltage and frequency scaling to provide maximum performance to critical applications. We show that, since temperature is a major concern for degradation, WARM meets temperature constraints within 5% in 87.5% more cases than the state-of-the-art. Also, WARM task allocation achieves up to one year lifetime improvement for a multicore. We show that it can achieve up to 100% of performance improvement on cluster architectures, while guaranteeing the reliability target. Finally, we show that it achieves performance in the 4% of the maximum for a broad range of a applications, while meeting the reliability constraints.
