Reliability is a major concern for nanoscale CMOS circuits. Degradation phenomena such as Electromigration, Negative Bias Temperature Instability, Time Dependent Dielectric Breakdown worsen with transistor scaling. Dynamic Reliability Management (DRM) techniques reduce reliability loss at runtime by constraining operating points, but they face the challenge of reducing user experience degradation while meeting a lifetime target. In this work we propose a sensor based hierarchical controller for multicore processor DRM, exploiting the major gap between the time scales of workload variations and reliability loss. We improve performance and user experience by locally relaxing reliability-induced operating point constraints, while meeting them over the large time windows relevant for reliability. With respect to the state-of-the-art, our solution guarantees timely execution of 100% of latency-critical applications, and have a 4% performance improvement over the whole lifetime.
INTRODUCTION
Technology scaling has made modern integrated circuits more susceptible to degradation phenomena such as Negative Bias Temperature Instability (NBTI), Electromigration (EM) and Time Dependent Dielectric Breakdown (TDDB) [9] . Degradation depends on many process and environmental factors, but can be controlled by managing temperature and voltage. Degradation worsens under continued stress [18] , while short spikes in temperature and voltage do not affect reliability much. Aging effects are described by mathematical models in terms of Mean Time To Failure (MTTF) [15] or reliability [20] .
We focus on TDDB [6, 16, 20] , but our solution can apply to other phenomena as well. TDDB is a degradation phenomenon that results in a low-impedance path through transistor gate dielectric, causing high leakage current that leads to failures. With the scaling of technology, increasing the design margin and binning the chips are becoming costly strategies. runtime management techniques can overcome this by dynamically changing chip operating points [15] . Modern processors can exploit dynamic voltage and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC '13 May 29 -June 07 2013, Austin, TX, USA Copyright 2013 ACM 978-1-4503-2071-9/13/05 ...$15.00. frequency scaling (DVFS) to modify the degradation rate [18, 21] . Degradation can be estimated through voltage and temperature dependent mathematical models, but recent works also proposed devices that can directly sense degradation [13, 14] and are more accurate than temperature and voltage based estimation. In [14] , TDDB-specific sensors are presented.
These elements, together with a control algorithm, are the basis for Dynamic Reliability Management (DRM). DRM has been proposed in [15, 10, 21] as a mechanism to trade off between system performance and reliability margin. DRM policies guarantee a target of reliability within the predefined lifetime of the system. However, the long time scale and the non-reversibility of degradation pose a challenge for the DRM control algorithm. To alleviate this issue, [21] proposes a predictive control for a single core CPU, based on a mathematical model. Many systems today use multicore processors, from servers to smartphones and tablets. Embedded devices exploit multicore CPUs for data and computing intensive applications with varied requirements in terms of performance and QoS [1, 2, 8] . Android based systems exploit the mechanism of intent to describe and communicate to the hardware the urgency/quality of a task to be executed. Not satisfying these requirements causes significant user experience degradation. Little has been done for the reliability of multicore CPUs [5, 17, 19] . The main problem with state-of-the-art DRM solutions is that they control reliability by fixing a maximum limit of the operating conditions, disregarding the potential degradation of the user experience.
In this work we propose a novel DRM policy for multicore platforms. The proposed policy is based on a two-level controller, composed by a Long Term Controller and a Short Term Controller. The two levels operate on two different time scales, that we have called Long Intervals, corresponding to days that it takes for reliability to change, and Short Intervals, corresponding to OS scheduling ticks. Our controller monitors system reliability on a long time scale and adapts operating conditions to workload phase changes on a short time scale, with the goal of meeting a target reliability within a predefined target lifetime. Our solution uses aging sensors to improve reliability control robustness. The main novelty we introduce is the Borrowing Strategy, through which our solution is able to locally relax reliability-induced operating point constraints, while still meeting them over the large time windows relevant for reliability loss. This is a key feature for systems like smartphones and tablets, that emphasize user experience. We compare our policy against state-of-the-art and show that with our solution 100% of latency-critical applications meet their needs, with an overall performance improvement of 4% over the whole lifetime.
RELATED WORK
Traditionally chips are designed under the assumption of worstcase conditions [10] . This approach ignores the dynamic nature of actual operating conditions, which results in an overly conservative, performance-limiting device. Srinivasan et al. [15] first introduced DRM as a technique where the processor can dynamically respond to changing application behavior to maintain its lifetime reliability target, by dynamic voltage and frequency scaling (DVFS). This was a significant enhancement over previous worst-case reliability qualification methodologies. Blome et al. [4] extended the approach to monitor and control the impact on lifetime reliability, through thread scheduling and DVFS. The authors show how to leverage the slack between the typical degradation and the worstcase one to improve performance in periods of high peak demand. Karl et.al. [10] explored DRM using a systematic model to improve performance. Both [4, 10] assume that the future workload is equal to the previous one. Therefore they are very sensitive to sudden workload variations.
Zhuo et al. [20] proposed a process variation and temperatureaware oxide reliability model, which can estimate reliability from temperature and voltage history. In [21] the authors propose a DRM framework that extends this model to periodically predict the future value of reliability at the target lifetime. Based on the difference between the predicted reliability and the target one, the controller sets a maximum operating voltage. This approach is less sensitive to workload variations compared to previous works, because it exploits a confidence-based workload estimation. Since the policy sets the maximum voltage, it is not able to guarantee speed bursts for high performance demanding tasks, causing user experience degradation. Moreover, it is entirely model-based, relying only on temperature and voltage readings, and does not use sensors to estimate aging. Therefore it has high model uncertainty.
Singh et al. [14, 13] propose oxide degradation sensors and sensorbased DRM approach. Degradation sensors are non-intrusive monitors designed to be integrated in modern CMOS circuits. The authors have designed, manufactured and tested these devices. In the rest of this paper, we refer to them as degradation sensors or aging sensors. Monitoring with these sensors helps mitigate model inaccuracy. Since reliability models are based on stress measurements, a strictly model-based DRM policy tends to be very pessimistic [14] .
The presented approaches for DRM ignore the fact that aging is observable on a large time scale, and that degradation is affected by average, rather than immediate stress. Therefore, they neglect the opportunity for short performance bursts to meet quality requirements of real applications. Furthermore, none of the previous work considers multicore platforms, a key component in most computing devices today. 
CONTROLLER ARCHITECTURE
The target platform is a homogeneous multiprocessor with N cores, with per-chip voltage setting and per-core frequency control. Each core is single threaded and has its own degradation sensors. Tasks are assigned in FIFO manner. The controller exploits voltage and frequency settings as knobs to trade off performance, while meeting the target temperature and reliability within a predefined lifetime. Figure 1 shows the basic building blocks of the proposed architecture, consisting of Application Manager, Long Term Controller and Short Term Controller.
Application Manager (AM): allocates tasks in FIFO manner, communicates the requested frequency fREQ and the quality requirement for the task execution to the Short Term Controller. 1 Long Term Controller (LTC): samples data from aging sensors at the beginning of each Long Interval, monitors the degradation status and calculates the average temperature and voltage. It predicts future reliability using these parameters. Since reliability loss occurs on a long time scale [15] , we consider Long Intervals to be on the order of days. Based on the difference between predicted and target reliability, it computes a reference voltage, VLT C and a reference temperature TLT C , which are the inputs for the Short Term Controller. The constraint on reliability is met if the mean applied voltage in the Long Interval VLI is less or equal to VLT C and the temperature is below TLT C .
Short Term Controller (STC): receives fREQ and the quality requirement for the allocated tasks from the Application Manager, and VLT C and TLT C from the Long Term Controller. Based on that, it applies the Borrowing Strategy, adjusting voltage and frequencies at each scheduling tick given the tasks quality requirements, while keeping the mean applied voltage inside the Long Interval VLI lower than VLT C , and the temperature below TLT C . The Short Term Controller can be coupled with state-of-the-art thermal management techniques to handle thermal emergencies [3, 12] , given the thermal constraint from LTC. The operations performed by the blocks are discussed in more details in the following subsections.
Application Manager:
In modern operating systems, such as Android, the application can request a certain level of hardware and software service to provide a given QoS to the final user. Therefore, we characterize each task as either Highly critical (H) or Less critical (L) in terms of latency and user experience 2 . Executing H tasks at a frequency lower than fREQ causes user experience degradation. This information allows the Short Term Controller to adjust its Borrowing Strategy. General purpose workloads for embedded devices do not contain profile information. Therefore the Application Manager selects fREQ = fMAX for a running task and fREQ = fMIN for the idle period.
Long Term Controller:
The diagram in Figure 2 shows the Long Term Controller. The Long Term Controller samples data from the aging sensors at the beginning of a new Long Interval, separately for each core.
The sens2R block estimates the reliability Ri for the i th core, from the aging sensors readings Si. Ri at time t is a number between [0, 1] indicating the probability that the system will not fail before time t. It is a measure of the system degradation status [20] . For example, TDDB sensors, based on a ring oscillator whose fre- quency increases as degradation takes place [13, 14] , give statistically significant information on the aging status of their core [14] . In [13] , authors map sensor readings for NBTI degradation to the system aging status. Based on that, they dynamically manage operating conditions to minimize NBTI degradation. In this work we assume that there exists a mapping between the output of TDDB sensors [14] and Ri and refer to papers [13, 14] for further discussion.
The Voltage Monitor and the Temperature Monitor keep the voltage and temperature history of the multicore platform by estimating the mean voltage/temperature that are going to be applied from the current time instant until the target lifetime. These values areV andT , and they are calculated as an exponential moving average of the past voltages/temperatures, as:
Where k identifies the k th Long Interval, αV and αT are the weighting factors,V k−1 andT k−1 are the values at the previous Long Interval, VLI k−1 and TLI k−1 are the mean applied voltage/temperature in the previous Long Interval.
The Reliability Predictor receives Ri,V ,T and computes the predicted reliability RP i exploiting the model presented in [21] . RP i is the value of reliability that we would have at the target lifetime given a current reliability equal to Ri, and supposing that from the present time on we apply a voltage equal toV and a temperature equal toT .
The PID controller, similarly as in [10, 21] , receives RP i and the target reliability at the target lifetime Rt . Based on their difference, it calculates VLT C , that asymptotically minimizes the tracking error. For the i th core at the k th Long Interval, we have e k = Rt − RP k and therefore:
where KP , KI , KD are the PID parameters and ∆LI is the duration of a Long Interval. For example, if the system was subject to high temperature and voltage during the previous Long Interval, RP will be low and the PID outputs a VLT C lower than the previ-ous one. If it was subject to low temperature and voltage, the PID outputs a VLT C higher than the previous one. The Voltage Selector selects the minimum VLT C , outputted by the Long Term Controller, to guarantee the reliability of the most degraded core (the one with the lowest RP ). Only one VLT C is needed 3 , as the target platform only has a single voltage island. This work can be easily generalized to multiple VLT C 4 Similarly, the Temperature Selector outputs the reference temperature TLT C . Figure 3 shows the block diagram of the Short Term Controller. The Short Term Controller receives VLT C and TLT C from the Long Term Controller, fREQ and the H/L flags from the Application Manager. Based on these parameters, it develops the Borrowing Strategy, selecting the frequencies fAP P and the voltage VAP P to be applied at each Short Interval to meet the task quality requirements, while keeping VLI less than or equal to VLT C . Note that frequency selection should be coordinated with DTM. If temperature is higher than TLT C , then a lower frequency is selected. Given that there has been a lot of work on DVFS for DTM, here we just focus on reliability aspects of voltage selection. The key to the Borrowing Strategy is the computation of the reference voltage VST C :
Short Term Controller:
Where l identifies the current Short Interval, VLI l is the mean voltage applied from the beginning of the Long Interval, and tLI is the time elapsed since the beginning of the Long Interval. For each core executing a L task, the Short Term Controller selects fAP P = fST C , while for each core executing a H task, it selects fAP P = fMAX . For a idle core, it selects fAP P = fMIN . Since the system has a single voltage island, in order to execute the most performance-heavy task, the controller selects the voltage VAP P corresponding to the maximum fAP P . 5 The other cores have the same applied voltage, but run at a lower or equal frequency. Borrowing Strategy is based on VST C . If the controller applies a voltage lower than VST C , VST C tends to increase, allowing the system to go faster in the next intervals. If, conversely, a H task occurs and the controller applies a voltage higher than VST C , VST C tends to decrease, in order to "repay the loan".
Where ∆SI l is the duration of the l th Short Interval. As a Short Interval ends, VST C is updated through Equation 3 and the Short Term Controller performs a new frequency/voltage selection. If VLT C − VLI at the end of a Long Interval is non zero, the system has either not fully exploited the available reliability margin (if positive) or it has violated the reliability constraint for the current Long Interval (if negative). Therefore, this difference is added to the VLT C which is computed for the next Long Interval, so to keep track of under/over-utilization.
RESULTS
The target platform is composed by 4 homogeneous cores with per-chip voltage and per-core frequency settings. The voltage ranges from 0.8V to 1.4V and the frequency ranges from 223MHz to 532Mhz. Our reference is the STMicroelectronics xSTsim architecture [11] . This platform is composed by a General-purpose Processing Element (GPE) acting as host processor and Processing Elements (PEs) acting as streaming engine. The GPE is an ST231 processor and the PEs are programmable processors with a simple ISA extended with SIMD and vector mode instructions. The platform addresses the needs of data-flow dominated, highly computational intensive tasks, typical of many embedded products.
For short term simulations we test our policy with xSTsim executing Inverse Discrete Cosine Transform (IDCT) [11] on a single Long Interval on random frame sequences. IDCT is a representative multimedia computational kernel, used in MPEG2 and JPEG decoding. Each frame is considered as a task. The GPE acts only as a dispatcher for the PEs and performs no computation. We have modified the application so to mark each frame as either H or as L, and to control the percentages of H and L tasks. The reliability control sets the operating voltage and frequency for the PEs by following the Borrowing Strategy.
For long term simulations we have developed a simulation infrastructure with Matlab, following the characteristics of the described platform. With this framework we can simulate the reliability model presented in [20] over the whole system lifetime and evaluate performance. The workload is modeled as a sequence of tasks with their own requested frequency and H/L flag. The task sequences are generated to reflect different user profiles [7] , by varying the percentage of idle and busy periods and the percentage of H and L tasks. Table 1 shows the value of the parameters used in our simulations. The PID gains are obtained through Ziegler-Nichols open-loop method. The system is run at VMAX for the entire lifetime and its response, e.g. the reliability curve, provides the parameters for the Z-N method. ∆LI is set at 25 days, for having reasonable simulation times. αV and αT are both 0.1. This value has been chosen among others after extensive tests of the model with different workload and temperature profiles.
Comparison With State-of-the-Art
We compare our policy against the state-of-the-art technique presented in [21] . In this work, authors present a reliability management framework which uses the model in [20] as well. The framework periodically computes the predicted reliability RP and exploit it to set a maximum operating voltage. This work has two main limitations:
• It limits the maximum voltage, causing user experience degradation when executing H tasks. We show how our policy, thanks to the two-level controller and the Borrowing Strategy, overcomes this limitation by following the task quality requirements, while still meeting the target reliability.
• It does not use aging sensors, and only exploits temperature and voltage readings for calculating reliability. We show how the use of aging sensors makes the reliability control more robust with respect to model variations.
Since the policy in [21] is for single core, comparison is conducted referring to this scenario. We assume the core to have a target reliability Rt = 0.8 and a target lifetime of 5 years. In the following, we denote our policy as LTST (Long Term -Short Term), and the policy in [21] as Zhuo (from the name of the author). In Figure 4 we compare the voltage traces that we obtain with LTST and Zhuo for the execution of IDCT on a random sequence of frames in a Long Interval. VREQ is 1.4V for a running task and 0.8V for an idle period. Executing tasks at a higher voltage allows to achieve better quality and higher performance. We assume that VLT C = 1V . Zhuo is able to raise the voltage at most to 1V . LTST achieves better performance for both H and L tasks. In the former case, LTST boosts voltage to VMAX , and in the latter case, the voltage is set to VST C , which is already higher than Zhuo's 1V thanks to the Borrowing Strategy. Figure 5 shows the comparison in terms of VLI . We show three cases, in which we vary the percentage of H tasks. In all of them LTST achieves a higher VLI , and thus higher performance. In the cases of 0% and 50% of H tasks, LTST also keeps VLI lower than VLT C , respecting the reliability constraint for the current Long Interval. This means that LTST fully exploits the available reliability margin, while Zhuo does not. In case of 100% of H tasks, the VLI is higher than VLT C . This is not a problem, since the Borrowing Strategy will add the difference VLI −VLT C to the VLT C of the next Long Intervals. By doing this, the next Long Interval will be slightly penalized to recover from this violation. Figure 6 shows the comparison in terms of performance over the whole system lifetime, evaluated with the long term simulator. The comparison refers to different user profiles [7] . A higher percentage of Busy time, denotes a period of more intense user activity. For this evaluation we define a Performance Metric γ as:
where B is the set of tasks (Busy), ∆tB i is the duration of the task, TB is the total Busy time, TI is the total Idle time. Variable γ measures the frequency reduction with respect to the requested one for the executed tasks. Lower values of γ are better. Even if both solutions respect the target of reliability, Figure 6 shows that our policy presents a gain of 4% in terms of γ with respect to Zhuo on the entire lifetime. Moreover, our policy executes 100% of H task at their fREQ, guaranteeing high quality execution to the final user. Zhuo, instead, slows down all the H tasks, causing significant user experience degradation. To simulate the absence/presence of aging sensors, we distinguish the model that describes the real degradation MREAL and the model that is adopted inside the reliability controller MCT RL (through which R and RP are computed). In case of Zhuo, no aging sensors are present. Therefore MCT RL = MREAL and, as a consequence, the controller takes decisions based both on the wrong current reliability R and the wrong predicted reliability RP . In LTST we have aging sensors that can give a better estimate of R. Therefore, only RP is inaccurate. We define LTREAL and LTCT RL respectively as the lifetime obtained by controlling the system (with constant voltage and temperature) with MREAL and MCT RL. Therefore MCT RL leads to an error on lifetime prediction, LTERR, equal to:
LTERR measures the difference between MCT RL and MREAL. Figure 7 shows the comparison between LTST and Zhuo in terms of final reliability for different values of LTERR, expressed in percentages. For each case, LTST obtains a final reliability closer to the target with respect to Zhuo (final reliability is just 1% less than the target one with -48% LTERR). For this reason a sensors-based reliability control is significantly more robust. in Figure 8 we show the benefits of adopting our solution for a platform with 4 cores, and its robustness with respect to long term temperature fluctuations. The target lifetime is 3 years, and the target reliability is 0.8. We set the nominal values of temperature TNOM respectively at 65°, 85°, 105°. At each Long Interval, temperature is given by TNOM ± U (−20, +20), where U is a uniform distribution, in order to simulate long term temperature fluctuation. For all the cases and cores, the target reliability is met and 100% of H tasks are guaranteed, whereas if control is disabled, the target reliability is violated. Figure 9 shows the reliability curve and temperature vs. time, for the 2nd core and TNOM =85°. The controller easily meets the target target reliability of 0.8 by carefully managing the tradeoff between better performance (higher temperatures) and reliability. 
CONCLUSION
In future technology scaling, many phenomena, such as TDDB, impact the system reliability. DRM techniques have been proposed to guarantee system lifetime, by constraining operating conditions, but they can cause user experience degradation. In this paper we propose a two-level controller: the first one acting on Long Intervals to set voltage and temperature constraints according to reli-ability, and the second one which sets operating conditions over Short Intervals to meet quality requirement, while keeping the average voltage lower or equal than VLT C and the temperature below TLT C . Our solution uses aging sensors to improve reliability control robustness. We compare our policy against state-of-the-art and show that with our solution, 100% of latency-critical applications meet their needs, and show an overall performance improvement of 4% over the whole lifetime.
ACKNOWLEDGMENTS
This work was supported, in parts, by the EU FP7 ERC Project MULTITHERMAN (GA n. 291125) and the EU FP7 Project Phidias (GA n. 318013).
