A framework is proposed to analyze system-level reliability and evaluate the lifetimes of state-of-art microprocessors considering the impact of process-voltage-temperature (PVT) variations and device wearout mechanisms, including bias temperature instability (BTI), hot carrier injection (HCI), and gate oxide breakdown (GOBD). This work studies not only the system performance degradation due to each wearout mechanism individually, but also the performance degradation while all these wearout mechanisms happen simultaneously. A unified gate-delay model is developed to combine PVT variations and the aging effect, and then a statistical timing engine is constructed to analyze performance degradations and system lifetimes.
Introduction
Device wearout mechanisms, namely BTI, HCI and GOBD, not only degrade microprocessor performances, but also shorten microprocessor lifetimes. Besides transistor aging, PVT variations also lead to a large amount of variability in microprocessor performances and lifetimes. The complexity of large systems makes it very challenging to characterize microprocessor lifetime due to device reliability in the presence of PVT variations.
The purpose of this paper is to present a methodology to assess system-level microprocessor lifetime due to the combined effect of both PVT variations and the aging effect. Not only the performance degradation due to each wearout mechanism is studied individually, but also the performance degradation while BTI, HCI and GOBD happen simultaneously is analyzed. Because all these wearout mechanisms happen in devices at the same time, it's necessary to study the performance degradation and system lifetimes due to the combined effect of BTI, HCI and GOBD.
A unified gate-delay model is proposed to link device-level wearout models and the gate delays. The gate-delay model includes the following parameters: channel length, threshold voltage, GOBD breakdown resistances, supply voltage, temperature, input slew and Pi-model parameters (for the RC load). Among them, the threshold voltage of each transistor combines the effect of process variation, BTI, and HCI, while the GOBD breakdown resistances of each transistor represent the GOBD effect. All the included parameters have a considerable effect on gate delays. A method, called multivariate adaptive regression splines (MARS), is employed to characterize the gate delay as a function of these parameters. MARS is well-suited for capturing essential nonlinearities and interactions in a high-dimension parameter space.
Based on the unified gate-delay models, a statistical timing engine is developed to estimate the variability of circuit-performance degradation due to the aging effect when PVT variations are present. The proposed timing engine consists of two parts: a block-based analyzer and a path-based analyzer. The block-based analyzer performs PVTaging-aware critical path extraction, and the path-based analyzer performs accurate circuit-delay estimation of the extracted paths.
Using the statistical timing engine, a framework of system-level aging assessment has been constructed. Since the wearout mechanisms under study are activity, voltage (VDD) and temperature dependent, the proposed framework determines the detailed electrical stress profiles, thermal profiles and IR-drop profiles of each device within the system under study. Combining these profiles and device wearout models, the statistical timing engine is applied to characterize microprocessor performance degradation and assess system lifetime. Moreover, the lifetime estimates take into account realistic use scenarios which include active, standby, and sleep modes.
When PVT variations and the aging effect are taken into account, Static Timing Analysis (STA) is not capable of identifying the real critical paths, while Statistical Static Timing Analysis (SSTA) can be used to handle variations. In this work, our framework implements a blockbased SSTA timing analyzer which extracts the critical paths in the presence of both PVT variations and the aging effect. Fig. 1 shows that the critical paths extracted by STA are different from the ones extracted by the methodology proposed, and~7% error is introduced into the circuit delay distribution using STA-extracted paths. For a large balanced circuit, using STA-extracted critical paths results in significant inaccuracy. Also, using SPICE to run Monte Carlo simulations on critical paths is very computationally expensive, especially when the number of critical paths is large. To address this issue, the path-based SSTA timing analyzer presented in this paper achieves SPICE-level accuracy, while consuming only 1% runtime compared to SPICE, making it feasible for the proposed framework to handle complex systems.
The rest of this paper is organized as follows. Section 2 presents the wearout models used in this paper. Section 3 introduces the gate-delay model which integrates the impact of PVT variations, BTI, HCI, and GOBD. Section 4 shows the system-level aging assessment methodology using the SSTA timing engine. Section 5 concludes the paper.
Device-level wearout models

NBTI/PBTI models
Recent experimental work has shown that the threshold voltage shift as a function of time under DC stress (t DC ) is best modeled with trapping/de-trapping theory [1] [2] [3] [4] [5] [6] :
where, A, B, and ϕ are constants. ϕ is proportional to the number of available traps and is a function of temperature, T, and the Fermi level, E F . The temperature dependence is incorporated in ϕ. We have modeled temperature with the Arrhenius relationship:
where E a is the activation energy, k is a constant, and T is temperature. A frequency dependence in ΔV tp/tn has not been included, since it has been shown to be relatively insignificant, especially for low frequency signals [7] . However, the duty cycle, α, impacts the shift and is incorporated as an effective Fermi level, where E F,eff = αE F,on + (1 − α)E F,off , and E F,on and E F,off are Fermi levels when the device is on and off, respectively. The function, g(α), is a nonlinear function of the duty cycle, α, where g(1) = 1 and g(0) = 0 [2] . The duty cycle accounts for the time under stress, t stress , and the recovery time, t rec , since α = t stress /(t stress + t rec ). Hence, overall,
where ϕ 0 is a constant. The constants were obtained from the experimental results in [8] .
HCI model
Since hot electrons are generated during logic transitions, the impact of HCI is directly proportional to the switching frequency. In this paper, predictive HCI lifetime models under dynamic stress are used for long term performance-degradation simulations. The threshold voltage degradation due to HCI during the time under stress is modeled as [9] :
where r trans is the frequency-dependent transition rate (transitions per unit time), t stress is the total time under stress/operation, t trans is the transition time (rise or fall time), and A HCI and n are technologydependent constants that relate to the inversion charge, trap generation energy, and the hot electron mean free path.
GOBD model
In order to represent soft and hard breakdown (SBD and HBD) in GOBD, the oxide breakdown resistance is modeled as a function of time with the percolation and quantum point contact (QPC) models.
In the percolation model [10] , during electrical stress, the trap density in the SiO 2 increases with stress time t as a power law in the anode hole injection model. Stress is converted to a number of traps [11] :
where V G is gate voltage, A, B, and β are fitting constants, and τ ox , W, and L are oxide thickness, gate width, and length, respectively. The percolation model involves placing neutral traps randomly within the oxide and analyzing the number of resistive conduction paths in a three dimensional matrix representing the oxide layer. In simulation, the number of HBD and SBD paths in each device is randomly selected based on its probability of stress. Next, for each device experiencing breakdown, the transistor is replaced with a resistive transistor model. For HBD, R HBD is in the kΩ range. For SBD, the SBD leakage resistance is calculated with the QPC model [12] ,
ℏ is Plank's constant, e is the electron charge, and N is the number of SBD conduction paths [12] [13] [14] . In summary, the device-level GOBD model used in this paper introduces two time-dependent resistances for each transistor, shown in Fig. 2 . If the breakdown path happens between gate and drain, R G2D is the breakdown resistance while R G2S is adequately large to have a negligible effect. Similarly, if the breakdown path happens between gate and source, then R G2S is the breakdown resistance and R G2D is adequately large. A gate-to-substrate resistance has not been considered, because such breakdown paths have less impact on circuit performances. Fig. 2 . Device-level GOBD model used in this paper.
R G2D (t) R G2S (t)
Gate-level PVT-aging aware delay models
This section introduces a gate-delay model which considers PVT variations and the aging effect using a modeling method called multivariate adaptive regression splines (MARS).
RC interconnect modeling
The effect of resistive-capacitive (RC) interconnect is considered to ensure accurate timing estimation. For the example shown in Fig. 3(a) , the interconnect network, together with the loading cells (NOR2 and INV), forms the load of the driving cell (Buffer). After modeling the loading cells (NOR2 and INV) using their input capacitances (C NOR2_B and C INV ), the load of the driving cell is purely resistive-capacitive, as shown in Fig. 3(b) . Then the three-parameter Pi-model [15] , shown in Fig. 3(c) is used to reduce the order of the original RC network. The Pimodel is the most popular reduced-order model to estimate the input admittance of RC interconnects. The values of C pi1 , R pi and C pi2 in the Pi-model shown in Fig. 3(c) are obtained by equating the first, second, and third moments of the Pi-model to the corresponding moments of the original network in Fig. 3(b) . The moment generation is obtained via the modified nodal analysis (MNA) method [15] .
When PVT variations are present in the loading cells (NOR2 and INV), they change the input capacitances (C NOR2_B and C INV ) which are used to form the resistive-capacitive network in Fig. 3(b) and thus change the Pi-model in Fig. 3(c) . With PVT variations, C NOR2_B and C INV are variational. It is not satisfactory to model standard-cell input capacitances as fixed values. A first-order regression equation is used to model the input capacitances of standard cells as a function of PVT parameters, shown as in Eq. (7). More details on computing Eq. (7) can be found in [16] .
Then, a statistical Pi-model is constructed using the first-order sensitivity of C pi1 , R pi and C pi2 to the input capacitances of each loading cell. The statistical Pi-model is characterized for each interconnect network, that is, each interconnect network has its unique statistical Pi-model. After input capacitances of the loading cells are calculated using Eq. (7), the Pi-model parameters are quickly calculated via the statistical Pi-model. Eq. (8) gives C pi1 as an example.
where C i denotes the input capacitance of i-th loading cell, C i nominal is the nominal value of C i , α i is the first derivative of C pi1 with respect to C i , and C pi1 nominal is the nominal value of C pi1 .
Parameters included in gate-delay model
For each transistor of a gate, the variation of channel length and threshold voltage is considered, denoted as ΔL and ΔVth, respectively. The variation of ΔL comes from only process variation, while the variation of ΔVth comes from process variation, BTI and HCI.
ΔVth ¼ ΔVth process þ ΔVth BTI þ ΔVth HCI ð9Þ
GOBD introduces two additional parameters (gate-to-source resistance, R G2S , and gate-to-drain resistance, R G2D ) for each transistor of a gate, because the gate-oxide breakdown paths can happen from gateto-source or from gate-to-drain. Therefore, each transistor has four parameters (ΔL, ΔVth, R G2S , and R G2D ) considered in the gate-level model. For a gate containing N transistors, there are 4N parameters for ΔL, ΔVth, R G2S and R G2D .
ΔVDD and ΔT are used to represent the supply voltage and temperature of a gate, respectively. It is assumed that all the transistors in the gate have the same supply voltage and temperature. The three parameters (R pi , C pi1 , C pi2 ) in the Pi-model represent the load of a gate. Also, for each timing arc, the input slew time (Slope) is included. Note that Multiple Input Switching (MIS) is not considered in this work.
In summary, for a gate containing N transistors, the 4N device parameters (ΔL, ΔVth, R G2S , R G2D ), together with the six global parameters (ΔVDD, ΔT, R pi , C pi1 , C pi2 , Slope), result in a total of (4 N + 6) parameters for each gate. In the standard cell library used in this work, the value of N is as high as 32, making (4 N + 6) as high as 134.
The multivariate adaptive regression splines (MARS) method
The MARS method is used to model the function relating the gate delay/slew and the (4N + 6) explanatory parameters mentioned in Section 3.2. The high dimension of the explanatory parameter space, as well as the nonlinearities between gate delay/slew and the explanatory parameters, makes it very difficult to characterize this function. Traditional regression techniques, like the response surface method (RSM), suffer from the disadvantage of using the same model to cover the entire parameter space. One model is insufficient to accurately estimate the gate delay (or slew time) over the whole parameter space.
MARS is well suited for high-dimensional problems, as it captures essential nonlinearities and interactions. The piecewise nature of MARS allows it to split the high-dimension parameter space into multiple subspaces, and each subspace has a unique regression model. MARS inherently integrates all the regression models of different subspaces into one general expression using piecewise hinge functions. A hinge function has the form of (x − t) + or (t − x) + . Mathematically, they are defined as:
where t is a constant, called the knot. (x − t) + and (t − x) + are called a reflected pair of hinge functions. MARS uses reflected pairs of hinge functions for each explanatory parameter X j with potential knots at each observed value (x 1j , x 2j , …, x Mj ), where M is the number of observations.
The MARS model has the form:
where h t ð X ! Þ is called a basis function, X ! denotes the vector of all the explanatory parameters, and the set M denotes the collection of the basis function in the model. MARS builds a model in two phases: the forward stepwise addition and the backward stepwise deletion. In the forward phase, MARS repeatedly adds basis functions in pairs to the model step by step. It finds the pair of basis functions that gives the maximum reduction in the sum-of-squares residue error. The process of forward stepwise addition continues until the change in residual error is smaller than a threshold or until the maximum number of terms is reached. The backward stepwise deletion prunes the model to avoid overfitting. More details can be found in [17] .
Gate-level delay model
SPICE simulations are used as training data to train the MARS model. The design of the experiments for the SPICE simulation is a mixture of central composite design [18] and random samples within the corners. The corners of explanatory parameters which is used for central composite design are shown in Table 1 . The implementation of MARS is from a Matlab toolbox called ARESLab [19] . Every timing arc in every cell (a total of 247 standard cells in our library) is characterized. Table 2 presents the results of some representative cells. The "4N + 6" column denotes the number of parameters in the gate-delay model, and the "Time (s)" column denotes the run time to characterize the MARS gate-delay model. The two columns (mean and S.D.) in "error" denote the average and standard deviation of the errors between MARS and SPICE, respectively. With these gate-delay models, the device-level wearout models are converted to gate-delay degradation, which is further used to facilitate system-level aging assessment.
The interconnect-delay model is also obtained using MARS. A method called the stable two-pole (S2P) approximation [20] was employed to get the reduced-order model of the transfer function of the original RC network. The three parameters from this reducedorder model, together with ΔVDD, ΔT and Slope, are used in the interconnect-delay MARS model.
System-level aging assessment framework
Extraction of the stress, thermal, IR-drop and process profiles
The wearout mechanisms being studied are activity, supply voltage (VDD) and temperature dependent. The degradation of system performance is also directly dependent on the thermal and the IR-drop profile because temperature and VDD directly impact circuit timing.
Therefore, this paper proposes the framework shown in Fig. 4 to extract the spatial and temporal electrical stress, IR-drop, the thermal profile of a system.
Running RTL or SPICE simulations of a complete microprocessor to extract the activity profile of each net is not feasible in most cases, since it may take a few months to finish simulating a single benchmark. On the other hand, simulating microprocessors with standard benchmarks on an FPGA takes only a few minutes [21] . Our framework, shown in Fig. 4 , provides an efficient way to acquire electrical, thermal and IR-drop profiles for any digital system for use in system-level reliability analysis. Input: circuit netlist, RC parasitics (.spef), process-variation spatial correlation profile, BTI ΔVth profile, HCI ΔVth profile, GOBD RG2S/RG2D profile, thermal profile, IR-drop profile Output: Circuit-delay distribution Block-based timing analyzer 1. generate N Monte Carlo samples for the circuits according to process-variation spatial correlation profile.
2. abstract timing graph from circuit netlist, and apply (ΔL, ΔVth, RG2S, RG2D) 3. for i=1 to N //for each Monte Carlo sample 3.1. apply to each gate the corresponding values of ΔL, ΔVth, RG2S, RG2D, T and etc. 3.2. do forward traversal to evaluate the edges (gate and wire delay) and propagate the
Algorithm: Statistical Timing Engine
arrival time to Primary Output/DFF 3.3. do backward traversal to extract the first 3 critical paths using PERT algorithm. 4. end for 5. get the critical-path collection: P ={CP1, CP2, …, CPm} // m is up to 3*N, because some paths are repetitive Path-based timing analyzer 6. for i=1 to N //for each Monte Carlo sample 6.1. calculate the path delays for each critical path in P as {PD For activity tracking, the hardware netlists were loaded onto an FPGA for emulation [22, 23] . The netlist is also used for layout generation. Using the RC information from the layout, the net activities, and the computed power profile (via a power simulator), the consequent thermal profile throughout the microprocessor is computed using a thermal simulator. The net activities and layout are also used to determine the IR-drop profile throughout the microprocessor.
For process variations, the proposed methodology includes inter-die, intra-die, and intra-gate variations supporting any distribution and any correlation profile between different parameters. The process variations due to channel length (ΔL) and threshold voltage (ΔVth) for each transistor are considered. Because BTI and HCI also impact the threshold voltage, the threshold voltage for each transistor is a combination of variation due to the process, BTI and HCI, while ΔL is only due to process variation.
The electrical stress, thermal, IR-drop, and process profiles and RC parasitics, together with the device-level wearout models, generate all the parameters for each gate which are needed to calculate the gate delays. All the information is fed to a PVT-aging aware timing engine for the analysis of performance degradation. The timing engine generates the delay degradation under a variety of stress times, and calculates the lifetime according to the specified operating frequency. After a set of Monte Carlo samples are run, the lifetime distribution is obtained.
PVT-aging aware timing engine
The PVT-aging aware timing engine consists of two parts, a blockbased analyzer and a path-based analyzer. The block-based timing analyzer extracts critical paths considering PVT variations and the aging effect; the path-based timing analyzer performs timing analysis and is fed by the critical paths extracted from the block-based analyzer. Our timing engine has two features: it performs PVT-aging aware critical-path extraction, and it achieves SPICE-level accuracy, while consuming only 1% runtime. Fig. 5 presents the algorithm of the timing engine. Both the blockbased and path-based timing analyzer are based on Monte Carlo analysis. The block-based timing analyzer first abstracts a timing graph from the gate-level netlist, shown in Fig. 6 . The nodes of the timing graph represent primary inputs/outputs and gate input/output pins. Its edges represent the timing elements of the circuit, namely, the gate input-pin-output-pin delay and the interconnect delay. The weights on these edges is the delay of the corresponding timing elements, which is calculated using the unified gate-level MARS models. After a forward traversal of the timing graph, the arrival times at primary outputs and D inputs of flip-flops are obtained. By a backward traversal, the critical paths can be extracted using the PERT algorithm [24] . For each Monte Carlo sample, the process of delay evaluation, forward traversal and backward traversal is repeated. With a number of samples, a set of critical paths is obtained in the presence of PVT variations and the aging effect.
The path-based timing analyzer is fed by a set of paths. The delay of each path is calculated by accumulating the gate delay and interconnect delay along the path using the MARS delay models. Then the maximum value of the path delays is taken as the circuit delay. Similarly, after a number of Monte Carlo runs, the circuit-delay distribution is obtained.
Our timing engine exploits the advantages of both the block-based timing analyzer and the path-based timing analyzer, by using the block-based analyzer to extract accurate critical paths and using the path-based analyzer to determine an accurate timing distribution.
The efficiency and accuracy of our timing engine has been verified using several industrial designs from among the IWLS2005 benchmarks [25] . The results are shown in Table 3 . The computational complexity of our timing engine is O(MN), where M is the number of gates and N is the number of samples. The results in Table 3 show the comparison of the mean and standard deviation (S.D.) of the circuit-delay distribution between SPICE and our timing engine. Our timing engine achieves SPICE accuracy while consuming 1% runtime of SPICE, and hence facilitates its usage in the system-level aging simulator which is introduced in Section 4.3.
System-level aging simulator
The core of this aging simulator is the PVT-aging aware timing engine, shown in Fig. 4 . The timing engine is fed by a set of profiles, including process variation, thermal, IR-drop, RC parasitics, BTI, HCI and GOBD. The generation of these profiles was discussed in Section 4.1.
The system-level aging simulator is constructed by analyzing the performance (circuit delay) degradation at different stress times. When the circuit delay degrades beyond the clock period, the system is fails and the lifetime is thus obtained. Our system-level aging simulator has been applied to the LEON3 microprocessor [26] , which consists of logic units (IU, MUL, DIV and MMU) and storage blocks (RF, D-Cache, I-Cache, Dtags and Itags). Fig. 7 shows the distributions of the state probability and the transition rate in Fig. 7(a) and (b) , respectively, when LEON3 is running a set of standard benchmarks [27] . Fig. 8 shows the distributions of the state probability and the transition rate in part Fig. 8(a) and (b) , respectively. Fig. 9 shows a comparison of the lifetime distributions due to BTI, HCI, GOBD individually and simultaneously. It can be seen that GOBD is generally the most critical mechanism of the three, but for some samples BTI is dominant. Besides, when all of these three mechanisms happen simultaneously, the lifetime is shorter than the lifetimes due to analyzing each mechanism alone. It is concluded that BTI and HCI also have some effect on circuit lifetimes even if they are not dominant.
Conclusion
A system-level aging simulator which evaluates the microprocessor lifetimes due to the combined effect of BTI, HCI and GOBD has been proposed. The proposed work has the feature of extracting the critical paths considering PVT variations and aging effect. The impact of wearout and PVT is jointly incorporated in timing tables. The gate timing information is incorporated in statistical static timing analysis to estimate the degradation of microprocessor performances. The statistical degradation in microprocessor performances is linked to timing margins to determine a lifetime distribution. The methodology can be extended for analysis of yield enhancement [28] . Fig. 8 . The distributions of the stress probability (part (a)) and the transition rate (part (b)) for the microprocessor while running a standard benchmark. Fig. 9 . The lifetime distributions of the microprocessor due to BTI, HCI, GOBD and the combined effect.
