In this article, we propose a new dynamic reliability management (DRM) techniques at the system level for emerging low power dark silicon manycore microprocessors operating in near-threshold region. We mainly consider the electromigration (EM) failures. To leverage the EM recovery effects, which was ignored in the past, at the system-level, we propose a new equivalent DC current model to consider recovery effects for general time-varying current waveforms so that existing compact EM model can be applied.
INTRODUCTION
Reliability becomes more and more significant as technology scales into 10nm and below. It was expected that the future chips will show sign of reliability-induced age much faster than the previous generations. Reliability issues have to be addressed in a cross-layer way. As the power density cannot be kept an constant (or start to increase) as technology scales due to the failure of Dennard's scaling [1] . The consequence is the emergence of so-called dark silicon manycore microprocessors, which mean only a percentage of cores can be powered on the chip due to the power and temperature limitations. Recently, architecture researchers have begun focusing on the development of dark-silicon manycore devices with as many as 100 or 1000 cores integrated onto a single die. Such manycore systems pose new challenges and opportunities for power/thermal and reliability management of those chips [2] .
To further reduce powers for many applications, ultra-low power designs become necessary. Recent research has led to sub-threshold region where CMOS circuits are found to be capable of operating with a supply voltage of less than 200mV. The theoretical lower limit of V dd has been determined to be 36 mV [3] . But at such low voltages, a leakage power dissipation increases drastically making the reduction in dynamic power insignificant. Also the circuit delay increases rapidly as the supply voltage is scaled down, resulting in decreased operation frequency or performance of the circuits.
For dark silicon manycore processors operating in near threshold voltage, reliability becomes quite significant for the long-term reliability such electromigration. To address the increasing reliability issues, a system-level and run-time approach becomes more appealing. There are some existing works on dynamic reliability managements for dark silicon in the past [4, 5] . These works have been proposed to leverage the dark silicon many-core processors in order to save energy while maintaining performance considering reliability. Runtime management of the heterogeneous dark silicon processors and optimal policy of core status have been addressed. Dynamic voltage frequency scaling method have been employed as energy saving techniques in those works. However, dynamic reliability management for near-threshold dark silicon processors has not been studied.
In this article, we propose a new dynamic reliability management (DRM) techniques at the system level for emerging low power dark silicon manycore microprocessors operating in near threshold region. We mainly consider the electromigration (EM) failures. To leverage the EM recovery effects, which was ignored in the past, at the system level, we propose a new equivalent DC current model to consider recovery effects for general time-varying current waveforms so that existing compact EM model can be applied. The new equivalent DC current is calculated in two steps: firstly, the equivalent square waveform is calculated so that peak and terminal stresses are matched, secondly, the parameterized equivalent DC current is derived in terms of the parameters of the fitted periodic square waveforms from the first step. The significance of the new EM current model is that it allows EM recovery effects can be considered at the system level for the first time and thus allow EM-induced lifetime of chips to be better managed at the system level.
The system level energy optimization problems considering recovery-aware EM-induced reliability subject to power and performance constraints was framed by seeking the best dark silicon cores' voltage and on/off status. The resulting problem was solved by the State-Action-Reward-StateAction (SARSA) reinforcement learning algorithm. Experimental results on a 64-core near-threshold dark silicon processor show that the new equivalent EM DC currents can fully exhibit the recovery effects at the system level so that trade-off between EM lifetime and energy/performance can be easily made. We further show that the proposed learningbased energy optimization can effectively manage and optimize energy subject to reliability, given power budget and performance limits. When the recovery effects are considered, the new optimization method can achieve 8.6X longer lifetime at the costs of 2.0X more energy and 3.3X more performance degradation.
ELECTROMIGRATION MODELING AT SYSTEM LEVELS

Review of EM physics and physics-based models
EM is the physical phenomenon of the migration of metal atoms along a direction of applied electrical field. Atoms (either lattice atoms or defects/impurities) migrate toward the anode end of metal wire along the trajectory of conducting electrons. Over time, the lasting unidirectional electrical load increases these stresses, as well as the stress gradient along the metal line. In some cases, usually when a line is long, this stress can reach a critical level, resulting in a void nucleation at the cathode and/or hillock formation at the anode end of line.
The EM effects are mainly modeled and heavily used by empirical Black's equation [6] and Blech limit [7] . However, those models are not physics-based and they do not fully consider the predictability for varying stressed conditions and complicated interconnect wire structures. Additionally, they do not address the inherent redundancy in the power grid networks, which are the most vulnerable wires in a chip.
To address those problems, a more physics-based compact EM model has been recently proposed for full-chip reliability analysis [8, 9] , which is the basis for EM reliability modeled in this work. The EM development process consists of two phases -the nucleation phase and the growth phase. In the first nucleation phase, a closed-form expression to compute the nucleation time (tnuc) is given, which is a function of current density, temperature, and the residual stress of the wire due to thermal and other effects as well as other wire geometry and material parameters. The approximate value of void nucleation time (tnuc) is determined as the instant in time when stress at the cathode end of the line reaches σcrit, which corresponds well to an analytical formulation of tnuc derived from the approximate solution of continuity equations for evolution of vacancy and plated atom concentrations (see, for example [10] ) in the confined 1D line.
. Here, j is the current density, T is temperatures, kB is the Boltzmann's constant, l is the segment length, EV and ED are the activation energies of vacancy formation and diffusion, f is the ratio of volumes occupied by vacancy and lattice atom, σcrit is the critical stress needed for the failure precursor nucleation (void/hillock). σRes is the residual stress of the metal segment from the cooling process and other factors.
The second phase is the void size growth: voids are formed at tnuc and grow at t > tnuc. The wire resistance starts to increase over the time in the growth phase [8] .
New equivalent DC current based modeling for EM recovery effects
For EM failures, one of the important phenomenon is that the EM-induced stress can go down when the stressing current becomes small. This effect is called "EM recovery effect" as it represents important transient effects due to time-varying currents. Fig. 1 shows the EM-induced stress changes over time over a periodic current pulse. As we can see the stress can go down significantly. The net effect for such recovery effect is that the lifetime of wire due to EM can be extended significantly as it will take longer time for the stress to reach to the critical stress over time.
However, EM recovery effects were ignored completely in the existing EM models as most of those models assume constant current or current density. To mitigate this problem, a physics-based EM recovery model was proposed recently [11, 12] by obtaining an analytical solution of the Korhonen's equation describing the stress evolution kinetics of EM effects. Although the accuracy of this model is high, it is still too complicated for practical use. For practical chip design, EM assessment and signoff still uses simple EM models like Black's model [6] or recently proposed, more accurate EM model in (1) , which takes constant current and temperature as inputs. In order to consider practical no-DC currents, a simple time-varying equivalent DC current is computed as following,
where j+(t) and j−(t) are the current densities of the positive and negative phases of the bipolar current, γ is the EM recovery factor, P is the period of the current waveform. When the current density is unidirectional, j trans,EM,ef f essentially the time averaged current density. However, using the effective current formula in (2) will create a number of problems [13] . First of all, the recovery factor depends on the specific current waveforms which is not constant. Also, it ignores important transient effects such as the recovery and peak stress effects. Fig. 2 shows the stress evolutions over time driven by two current waveforms, the actual one and the time-varying equivalent DC current. As we can see, the peak stress due to the actual current waveform can exceed the critical stress while the average current never leads to void nucleation (wire is immortal). In order to solve the problems of these models, in this paper, we propose a new and novel equivalent DC current method to consider the transient EM recovery effects. The new model is based on first-principle based numerical analysis of EM effects. Here we use nucleation phase to compute the time to failure of a wire as a demonstration of this proposed method. The idea is that for a given EM model, the DC equivalent current will lead to the same time to failure (TTF) computed from the detailed numerical EM analysis of stress diffusion equation. This is better illustrated in Fig. 3 (b) in which the periodic current density and a DC current gives the same nucleation time tnuc. Unlike the traditional method which could ignore the case that the peak stress exceeds the critical stress but the equivalent current density never leads to void nucleation, so the transient effects are explicitly taken into account in this method.
That model works well for standard periodic square waveforms with one high current density (j1) and one low current density (j2). As shown in Fig. 3(a) , j1, j2, period (P ) and duty cycle(D) are used as the variables for the model. Also we find that the temperature (T ) is one of the dominant parameters for the equivalent EM DC equivalent current density (jem).
To further derive the parameterized equivalent DC current in terms of two currents, period and duty cycle and temperature, response surface methodology (RSM) [14] is carried out over many different waveforms from measured or detailed numerical analysis information. (3) is the fitted model to obtain equivalent DC current in terms of the five parameters.
jem =4.988 * 10 9 − 0.0663 * 10 9 * X 2 1 − 1.114 * 10 9 * X1 * X2 − 0.9981 * 10 9 * X1 * X3 − 0.1390 * 10 9 * X1 * X4 − 0.3485 * X1 * X5 − 0.0315 * 10 9 * X However, this model can only handle regular square waveforms, but for practical cases, the current waveforms are arbitrary. To mitigate this problem, one of the ideas is to convert the arbitrary current waveform to an equivalent square waveform before we apply the aftermentioned parameterized equivalent DC current modeling. In this conversion process, we make sure that the stresses derived by the square waveform and the actual current waveform will match at both the highest peak stress and the final stress (end of period or time) as shown in Fig. 4(b) . By matching the two stress points, we can find the two currents j1 for highest stress point and j2 for the end of period stress as shown in Fig. 4 . During this conversion process, we assume that the given current waveform will repeat itself over time so that it becomes a periodic waveform. This assumption is reasonable as the future current or power of a chip cannot be predicted precisely in general and the recurrent assumption is a good guess.
The other idea is to convert the arbitrary current waveform directly to a DC equivalent current so that the stresses from two waveforms match at the end of period time as shown in Fig. 4(b) . But this approach may lead to large errors for time to failure estimation as it ignores the peak stress, which can be significant to determine the time reaching the critical stress (time to failure). To study the accuracy of the two modeling methods, the two-step method (square waveform modeling and RSM fitting and the direct equivalent DC current method) is proposed. We compare stress generated by two step method and the stress given by direct equivalent DC current method against the stress generated by the original current waveform. The results are shown in Fig. 5 . As we can see, equivalent square DC current density (two step method) has smaller error compared to the direct DC equivalent method in terms of time to failure estimation. As a result, in this paper, we will use the two-step method to compute the parameterized equivalent DC current. 
EM modeling for varying temperature effects
At the system-level EM reliability, the manycore system will run on different tasks under different voltages and frequencies. As a result, its temperature and current densities will change with time. However existing EM models including the new physics-based model can only take a constant temperature. The previous study shows that whole system MTTF or lifetime under different temperature can be approximated by [15] :
where M T T Fm is the actual MTTF (mean time to failure) under the m-th power and temperature settings for ∆tm period, assuming the chip works through n different power and temperature settings and T = n m=1 ∆tm. Each M T T Fm will be computed based on the EM models discussed in the previous section.
To consider a system-level EM reliability on a manycore dark silicon processor, we use the shortest lifetime among all the cores as the lifetime for all manycore processors [16] .
NEW LEARNING-BASED RELIABILITY MANAGEMENT FOR NEAR THRESHOLD DARK SILICON FOR EM RECOVERY EF-FECTS
Near threshold dark silicon
Near-Threshold Computing (NTC) has been proposed as a viable solution to overcome the limit of energy efficient computing by using optimal near-threshold voltage between super-threshold and sub-threshold region.
NTC cores are operated at or near their threshold voltage V th . By reducing supply voltage V dd from nominal 1.1 V to 500 mV, a 10X energy efficiency gain can be achieved at the expense of 10X performance degradation [17] . Compared Figure 6 : The DRM and NTC Framework to the sub-threshold region where 20X energy efficiency can be achieved, but the 50X performance degradation due to increased circuit delay is too big a factor to ignore for largescale applications. Applications with significant standby times benefit greatly from NTC. Memories have to retain their contents even though digital logic is to be powered off. Thus supply voltage scaling results in a significant reduction in leakage power.
On the other hand, NTC is also a promising technique to mitigate the effects of dark silicon as cores can reduce power and temperature under a given power budget, thus, allowing a larger number of cores to be turned on simultaneously at costs of allowed performance losses. Recently, instead of operating the entire cores at either nominal voltage or nearthreshold voltage (NTV), voltage islands have been defined such that only partial cores are operated at NTV and the rest is operating at nominal voltage for more flexible tradeoff between power and performance [18] . Supply voltage is proportional to the threshold voltage of the transistors in the core. The core with the highest threshold can determine supply voltage for the voltage island. However, the different types of parallel workloads can lead to performance degradation and energy waste. Efficient dynamic management and scheduling to find suitable NTC regions are needed.
In addition to energy and performance, NTC has an effect on reliabilities. NTC may exhibit better long-term EM reliability, as a lower voltage can lead to lower temperature, current density and residual stresses, which are the major factors of EM effects [8] . The NTV, which is a lower supply voltage, can improve EM-induced lifetime of dark silicon processors. However, using NTV for many core can make significant performance issue since NTV still use some cores operated at the nominal voltage and a many core system's reliability can be highly affected by those core's reliability [16] .
Framework of dark silicons in near-threshold computing region
We present the framework for Dynamic Reliability Management (DRM) at NTC region in dark silicon. The DRM framework employs several simulator models (microarchitecture, power, thermal), a policy optimization module, all in conjunction with EM recovery model. Additionally, the DRM has policy optimizer that cores can choose the best NTC policy to maximize energy efficiency while meeting performance limit and power budgets. This work uses a 45nm-based 64-core dark silicon simulation framework with the threshold voltage of V th = 0.20V , core On/Off knobs, and NTV capabilities. ). The DRM makes these decisions based on an on-line policy optimization module that employs the SARSA algorithm which is explained later. In the framework, the DRM receives the new policy from the optimization module. It then sets each V k or turns the core off. Additionally, each core operating frequency (f k ) is affected by V k , because of this we use 5 as a relation to calculate f k based on its respective V k . This ensures that the performance degradation from NTV is reflected in the simulation framework. This policy is then propagated to the architecture, thermal, and power simulators as well as the EM recovery model, and optimizer.
The framework uses the Sniper architecture simulator [19] to generate system performance for given workloads on a specified architecture. Parameters (chip floorplan, number of cores, frequency, and cache design) describing the architecture are passed to Sniper. Benchmarks representing the desired workloads are also used as inputs to sniper to simulate the system's functionality. Sniper then outputs system performance, such as performance characteristics, instruction per cycle, of the chip for each given benchmark run. This is repeated in our experimental setup for several different set of workloads. The whole near threshold dark silicon framework is illustrated in Fig. 6 .
Once architectural system performance is generated and transferred to the physical simulators Hotspot [20] and MC-PAT [21] . Based on the architecture of the chip, its system performance from Sniper, and the voltage scheduling from the DRM, MCPAT (Multi-Core Power Area and Timing simulator) will generate a power trace for each chip component including each core P k . Hotspot then uses the chip floorplan in conjunction with the power trace generated by MCPAT, to produce a thermal trace for each chip component and core T k .
After the power/thermal/voltage characteristics have been generated by the various simulators, the EM Recovery model can use these parameters (P k , T k , and V k ) to estimate the chip's time to failure considering any recovery effects the chip may experience, from V nt k and V of f k for that given policy.
Lastly, all the information generated, in addition to the current policy enacted by the DRM, are passed to the policy optimizer which will generate a new policy. This new policy will find the best voltage schedule for the various cores to optimize the energy of the chip while meeting MTTF, performance, and power budgets.
SARSA-based learning algorithm for DRM considering long-term recovery
We can model our DRM problem as a Markov Decision Process (MDP) with states s(t), and actions a(t) where states are the parameters of the framework for the time-step ∆t, e.g., f k(t) , T k(t) , M T T F (t), P k(t) , and V k(t) . Actions are defined as changes to these parameters which in our case is the tuning of V k(t) . In our case, our goal is to achieve the best policy that minimizes energy E while meeting all constraints and budgets.
The reinforcement learning algorithm used to optimize the DRM policy is the State Action Reward State Action algorithm, or SARSA, first presented in [22] . SARSA is a combination of Q-learning and the traditional Temporal Difference method (TD) [22] . This algorithm exchanges the greedy updates of Q-learning with a policy driven update that is closer to the TD method. The result is an on-policy reinforcement learning algorithm with faster convergence when compared to Q-learning [22] .
The major differences with traditional Q-learning, is that the maximum reward (minimum penalty) for the next state is not used for updating the Q-values. Instead, a new action is selected using the same policy that determined the original action.
The SARSA algorithm works first by populating a Q-table Q(s(t), a(t)), where s(t) is a state and a(t) is an action for time-step ∆t. It then selects an action from the states using some policy. This action is taken and the penalty P T (t + 1) (negative reward) and new state s(t+1) are generated. From this new state, another action a(t+1) is selected from s(t+1). The Q-table is then updated using a penalty function shown below.
Here, α is the learning rate and γ is the discount factor. In our DRM, we employ multiple-constrained penalty function (P T (t + 1), [4] ) and modify to accommodate each value (EM, power, temperature, and performance) and to also incorporate the power budgets, performance/thermal limits assigned as constraints.
In order to provide long-term shut-off time for leveraging EM recovery effects, our DRM have a recovery selection scheduler. It is a periodic scheduler on the top of SARSA for long-term recovery cycle (Trecovery), which is the required time for recovery effects. The selected cores can be turned off long-term recovery cycle as recovery effects needs some long-term as seen in Subsection 2.2. Every Trecovery cycle, we use greedy-based selection algorithm by EM-induced lifetime evaluation and find the worst lifetime core set below the certain lifetime threshold (EM threshold ), then SARSA will work only for the cores except recovering cores for Trecovery. After each long-term cycle, we find new long-term recovery core set.
The proposed new energy optimization algorithm in the near-threshold dark silicon framework has been implemented in Python 2.7.9 with the numerical libraries (Numpy 1.9.2 and Scipy 0.15.1). For near-threshold dark silicon framework, we used the architectural simulator (Sniper 6.1), power estimator (McPAT 1.0.32), and thermal simulator (Hotspot 5.02 [23] ) to estimate recovery-considered EM-induced lifetime on top of the new physics-based EM model [8] . In the proposed framework as shown in Fig 6, each simulator module is connected with a plugin connector, so that one simulator's result can dynamically provide the other's inputs. The learning-based SARSA method and recovery selection scheduler have been implemented for our dynamic reliability management (DRM).
The DRM module assigns each core as the near-threshold or nominal voltage, and calculates its frequency from 5. The near threshold voltages as 0.45V, and a nominal voltage of 1.2V are defined [24] . Additionally, each core can be turned off completely, or turned back on for dark-silicon. Once the simulation runs, the optimization module can then send a new policy to the DRM which it will use to schedule the core voltages. Our energy optimization method is validated with a near-threshold 64-core processor on the SPLASH-2 multi-threaded benchmarks. 
NUMERICAL RESULTS
Evaluation of the lifetime impacts considering EM recovery effects
In order to evaluate system-level EM-induced lifetime considering recovery effects, a single core long-term task example case from our framework is shown here. Our proposed framework, shown in Fig. 6 , can properly manage and control for both power-on and shut-down of each core, so we can significantly extend the system-level reliability by leveraging the EM recovery effects introduced in subsection 2.2. We present two different simple power traces and calculate EM-induced MTTF (mean time to failure) in Fig. 7 . In this example, the time period is 1000 seconds and switch off for 500 seconds that can be recognized as a sufficient period for recovery effect. The original power traces (5.72 W) is converted to an equivalent power, which equals to 2.122W using our recovery model. Our duty cycle for turning-off is 50% in Figure 8 : Performance, energy and EM-induced lifetime from proposed DRM considering recovery effects for three cases (1) Recovery effects with Trecovery = 50s (first column) (2) Recovery effects with Trecovery = 1000s (the second column) (3) Only DRM without recovery effects (the third column) the recovery case. As a result, it leads to 50% performance degradation. On the other hand, the EM-induced lifetime considering recovery effect is 9.16X higher than that of the original case without recovery effect case, which is quite significant.
Evaluation of the DRM for near-threshold dark silicon processors
To evaluate the proposed learning-based energy optimization method in Section 3, we show the comparison results of performance, EM-induced lifetime, energy consumption on processing 64 multi-threaded tasks on a 64-core nearthreshold dark silicon processor. Our experiment uses performance and energy metrics as s/tasks and J/tasks, which are total execution time and energy consumption for selected 64 multi-threaded tasks (16 CHOLESKYs, 16 RADIXs, 16 RAYTRACEs, 16 VOLRENDs) on our framework. The results are shown in Table 1 . We compare our DRM case (without recovery effects) with all-NTV cores (64 cores are near-threshold voltages(NTV)), half-NTV/half-dark (32 cores are NTV and 32 cores are turned off), and half-nominal/halfdark (32 cores are nominal and 32 cores are turned off) cases. The optimization results show the energy consumption improvements for the given budget constraints (power budget is 250W , performance limit 5s per given tasks, and EM-induced lifetime limit is more than 5 years). For all-NTV and half NTV/dark cases, EM lifetimes are significantly improved, but their performance results are 16X and 35X slower than the DRM baseline result. On the other hand, for only half nominal/half dark case, energy and performance are quite improved but EM lifetime is relatively short. Therefore, our DRM result effectively finds better EM lifetime (9.03 years) with significantly high performance with the lowest energy.
DRM considering recovery effects
Finally, we evaluate our proposed DRM method considering the EM recovery effects. As seen in Fig. 8 , we evaluate two different recovery cycles (Trecovery = 50s and 1000s), so all the cores' MTTFs are periodically evaluated for every Trecovery to determine which core needs to be turned off for the whole period when the core is below the certain lifetime threshold, EM threshold . For our experiment, we set 8 years as the EM threshold in the recovery selection scheduler. As we can see, both DRM cases with recovery effects can significantly improve our EM-induced lifetime (8.6X longer compared to the baseline results, which are shown in the third column in Fig. 8) . However, the costs are the 2.0X more energy consumption (1.9J/tasks vs 0.94j/tasks) and 3.3X performance degradations (0.05s/tasks vs 0.014s/tasks). But this is still a better trade off for the higher EM lifetime (64.7 years and 78.1 years for Trecovery = 50s and 1000s, respectively) compared with the baseline case in Fig. 8 .
CONCLUSION
In this paper, we proposed a new dynamic reliability management (DRM) techniques for emerging near-threshold dark silicon manycore microprocessors considering electromigraion (EM) reliability. To leverage the EM recovery effects, which was ignored in the past, at the system level, we propose a new equivalent DC current model to consider recovery effects for general time-varying current waveforms so that existing compact EM model can be applied. The new EM current model allows EM recovery effects to be effectively considered at the system level for the first time. To leverage the EM recovery effects, we considered the energy optimization problem for dark silicon manycore processors with Near-Threshold Voltage (NTV) capabilities considering EM reliability. We showed that the on-chip power consumptions have different impact on reliability. The resulting optimization problem was solved with State-Action-Reward-StateAction (SARSA) reinforcement learning algorithm to optimization the near-threshold dark silicon cores' voltage policy to minimize energy considering reliability. Experimental results on a 64-core near-threshold dark silicon processor showed that the new equivalent EM DC currents was able to fully exhibit the recovery effects at the system level so that trade-off between EM lifetime and energy/performance were easily made. We further showed that the proposed learning-based energy optimization can effectively manage and optimize energy subject to reliability, given power budget and performance limits. When the recovery effects were considered, the new optimization method was able to achieve 8.6X longer lifetime at the costs of 2.0X more energy and 3.3X more performance degradation.
