This work provides an efficient statistical electrothermal simulator for analyzing on-chip thermal reliability under process variations. Using the collocation-based statistical modeling technique, first, the statistical interpolation polynomial for on-chip temperature distribution can be obtained by performing deterministic electrothermal simulation very few times and by utilizing polynomial interpolation. After that, the proposed simulator not only provides the mean and standard deviation profiles of on-chip temperature distribution, but also innovates the concept of thermal yield profile to statistically characterize the on-chip temperature distribution more precisely, and builds an efficient technique for estimating this figure of merit. Moreover, a mixed-mesh strategy is presented to further enhance the efficiency of the developed statistical electrothermal simulator.
INTRODUCTION
As technology scales down to the sub-90nm node, on-chip power densities increase rapidly. Hence, power dissipation and thermal management have become important issues of VLSI design. High on-chip temperature distribution and thermal gradients
Preliminary versions of this article appeared in Proceedings of the IEEE International Systems-on-Chip
Conference (SoCC'10) [Chang et al. 2010] [Huang et al. 2012] . This work was supported in part by the National Science Council of Taiwan under grants NSC 99-2220-E-009-035, 100-2221-E-009-074, and 101-2221-E-009-168, and by the Industrial Technology Research Institute, Taiwan. Authors' addresses: Y.-M. Lee, Department of Electrical and Computer Engineering, National Chiao Tung University, HsinChu, Taiwan; email: yumin@nctu.edu.tw; P.-Y. Huang, Industrial Technology Research Institute, Taiwan. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested fromdrastically degrade the circuit performance and design reliability, operating temperatures seriously affect gate delays [Kumar and Kursun 2006] , and nonuniform on-chip temperature distribution induces timing faults [Bota et al. 2004] . Since on-chip power consumption is proportional to operating temperatures, thermal runaway might occur if thermal-related issues are not carefully considered in package design [Vassighi and Sachdev 2006] . To ensure design qualities, such as performance, power consumption, and reliability, researchers have been devoted to dealing with thermal-related issues in physical design [Tsai et al. 2006; Liu et al. 2008] . To provide the related thermal cost for optimization engines of physical design, several efficient deterministic thermal simulators [Wang and Chen 2003; Huang et al. 2006; Yang et al. 2007; have been developed to predict on-chip temperature profile. However, these thermal simulators only provide thermal information with deterministic on-chip power consumption.
and in Proceedings of the IEEE/ACM International Asia and South Pacific Design Automation Conference (ASP-DAC'12)
As predicted by the international technology roadmap for semiconductors (ITRS), leakage power consumption increases dramatically and has become a dominant portion of total power consumption [ITRS 2010] . Moreover, the scaling down of technology causes physical parameter variations to be non-ignorable, and this fact leads to substantial on-chip leakage power fluctuations. As pointed out by Pang and Nikolic, 8% of process variations can lead to about 25% of on-chip leakage power fluctuations [2009] . Therefore, physical parameter variations are essential to be considered for on-chip power estimation techniques [Chang and Sapatnekar 2007; Shen et al. 2010b Shen et al. , 2010a . Since on-chip temperature is transfered from on-chip power consumption, its distribution is impacted by process variations inducing leakage power fluctuations. However, deterministic thermal analyzers [Wang and Chen 2003; Huang et al. 2006; Yang et al. 2007; ], which did not take process variations into account in their power models, are not adequate to precisely provide thermal reliability estimation under process variations. Therefore, the on-chip temperature profile should be treated statistically under process variations, and statistical thermal simulation techniques are essential, especially for leakage dominated technologies Jaffari and Anis 2008] .
To provide the statistical characteristics of on-chip temperature distribution, Jaffari and Anis [2008] proposed a recursive log-normal approximation algorithm to obtain mean and standard deviation profiles of the on-chip temperature distribution. Compared with the Monte Carlo (MC) simulations, they have successfully demonstrated its efficiency and accuracy for estimating mean and standard deviation profiles of the temperature distribution in the macro-architectural level. Instead of constructing the different leakage power model for each different macro/gate type, they built the different leakage power model for each bin (grid) on a die. Since optimization engines, such as floorplanners or placers, might disturb positions of macros or gates, their related leakage power models need to be rebuilt after an optimization loop is executed. Therefore, the efficiency of their approach will be degraded while casting into thermal-aware design flows, because their approach needs to execute the time-consuming HSPICE simulation numerous times and fit curves for reestablishing leakage power models. In addition, their recursive log-normal approximation algorithm is restricted to the form of their proposed leakage power models. However, the leakage power model is getting more complicated for maintaining an acceptable accuracy level as the technology continuously scales down. To overcome the leakage power model restriction, a statistical thermal simulation framework that has high capability of adopting complex and accurate power models for any technology generations is required.
Besides the leakage power model issues, the figure of merit for on-chip temperature distribution is still ambiguous if only its mean and standard deviation profiles are reported.
1 Therefore, instead of only reporting mean and standard deviation profiles, a more precise figure of merit for the statistical characteristics of on-chip temperature distribution should be addressed to ensure the thermal reliability or to provide the thermal related cost for thermal-aware design engines.
In this work, a statistical simulation framework is developed for characterizing the on-chip temperature distribution. With the aim of dealing with the restriction issues of leakage power models, providing a more precise figure of merit for ensuring thermal reliability and being more easily incorporated into statistical performance analysis and design engines, technical key points and advantages of this work are summarized as follows.
(1) Compared with the bin-based model [Jaffari and Anis 2008] , a cell-based model is adopted for characterizing leakage powers. With the precharacterizing property, the reestablishing process of leakage power models can be avoided, while macros or gates are exchanged by optimization engines, such as floorplanners or placers. (2) Adopting the concept of sparse collocation-based methods, a statistical electrothermal simulation framework is developed to generate the statistical polynomial expression of on-chip temperature distribution. Compared with that of Jaffari and Anis [2008] , the developed framework is more flexible for complex and precise leakage powers models. (3) This work not only provides the mean and standard deviation profiles of on-chip temperature distribution, but also introduces the concept of thermal yield profile to statistically characterize the on-chip temperature distribution more precisely, and builds an efficient estimating technique for this figure of merit. (4) Without sacrificing the accuracy, a mixed-mesh strategy is presented and integrated into the baseline method of our statistical electrothermal simulation engine to further enhance its efficiency.
The rest of this article is organized as follows. Section 2 motivates the concept and essentialness of on-chip thermal yield profile, investigates the accuracy of existing cellbased leakage power models, and indicates that complex leakage current models are required for maintaining acceptable accuracy level. After that, the problem formulation and the modeling technique of device parameters are described in Section 3. Then, the developed statistical electrothermal simulator is detailed in Section 4, and experimental results are given in Section 5. Finally, the conclusion and potential applications of the developed simulation framework are presented in Section 6.
MOTIVATION ILLUSTRATIONS

Concept of On-Chip Thermal Yield Profile
Because of process variations, on-chip temperature at an arbitrary position r is a random variable. Therefore, the deterministic thermal analysis with nominal on-chip power profile can no longer be a good figure of merit for identifying the hotspot location of the chip. Moreover, even if the temperature at any arbitrary position r has been treated as a random variable, it will still be ambiguous if only the mean (μ T (r)) and standard deviation (σ T (r)) profiles of on-chip temperature distribution are delivered. For example, only using the mean profile of on-chip temperature distribution as a figure of merit is very likely (about 50%) to incorrectly indicate hotspot locations. Furthermore, as both μ T (r) and σ T (r) are delivered, by utilizing the Chebyshev inequality, a large temperature value will be estimated to ensure the lower bound 1 Please see the interpretation stated in Section 2.1.
41:4
Y.-M. Lee and P.-Y. Huang Fig. 1 . An example to demonstrating the gap between the temperature profile that really achieves 90% thermal reliability and the temperature profile that satisfies the lower bound of 90% thermal reliability in the Chebyshev inequality: (a) the temperature profile that really achieves 90% thermal reliability, T 90 Real (r); (b) the temperature profile that satisfies the lower bound of 90% thermal reliability in the Chebyshev inequality, T 90 Chebyshev (r).
of 90% thermal reliability, that is, T 90 Chebyshev (r) needs to be μ T (r) + 3σ T (r) to ensure Prob(T (r) ≤ T 90 Chebyshev (r)) ≥ 0.9. Here, T (r) is the on-chip temperature distribution. Since the Chebyshev inequality does not always get a tight lower bound for any type of random variables, there will be a gap between the temperature profile, T 90 Real (r), that really achieves 90% thermal reliability (i.e., Prob(T (r) ≤ T 90 Real (r)) = 0.9), and T 90 Chebyshev (r).
2 As shown in Figure 1 , the gap between T
90
Real (r) and T
Chebyshev (r) can achieve about 10
• C in our experimental results. Therefore, using the temperature profile of the Chebyshev bound might be an immoderately conservative constraint for thermal reliability. This undesirable phenomenon can result in immoderate guardbanding for circuit design.
On the other hand, from the aspect of circuit design, the specifications of circuit performance and the timing constrains of primary I/Os are usually specified in the system-level design stage. Moreover, in the timing and thermal cooptimization process, timing issues usually take higher priority than thermal issues. Designers try to minimize the circuit delay and meet the temperature requirement. Therefore, in the presence of process variations, to identify possible hotspot regions of a chip, the thermal yield profile, T yield(r, T spec (r)), can be defined as the probability profile of the on-chip temperature at an arbitrary position r being at or less than a specified temperature T spec (r).
The thermal yield profile can identify the hotspot regions and can also quantify the probability of a region that could be a hotspot. Therefore, it is a suitable figure of merit for the thermal-related cost in timing-thermal cooptimization physical design stages.
Leakage Power Modeling
Leakage currents of a gate not only depend on physical device parameters and operating temperatures but also on its input patterns [Chang and Sapatnekar 2007] . To build leakage power models, different input patterns, physical parameters, and operating temperatures are set for each gate in the cell library, and HSPICE simulation is 
ox , T 3 † The adoptive forms of f g and f s in this work.
Note: The second column shows the fitting components of f g and f s adopted by existing and proposed models.
performed with the industry design kit to generate data of the leakage currents. After that, average leakage currents of the input patterns are fitted by the least squares method. With leakage currents exponentially relating with physical parameters and operating temperatures and using the least squares fitting method, two major leakage currents-gate tunneling leakage current (I g ) and subthreshold leakage current (I s )-for each gate type can be fitted [Chang and Sapatnekar 2007; Shen et al. 2010a Shen et al. , 2010b Yu et al. 2009] .
Here, a 0 and b 0 are fitting constants, L is the channel length, t ox is the oxide thickness, T is the operating temperature, and f g (·) and f s (·) are specific fitting forms.
3
Basically, I g occurs in both on and off states, and I s is the off-state leakage mechanism. Therefore, the leakage power of a gate can be represented as follows [Chang and Sapatnekar 2007; Shen et al. 2010a Shen et al. , 2010b Yu et al. 2009] .
where V dd is the supply voltage and Sw is the switching activity. To adopt suitable leakage power models, different cell-based leakage current models are investigated [Chang and Sapatnekar 2007; Shen et al. 2010a Shen et al. , 2010b Yu et al. 2009 ]. To examine their accuracy, we have implemented their proposed models and compared their results with that of HSPICE simulation under TSMC 65nm design kit. Since leakage currents are temperature-dependent, simulation results show that the ignorance of temperature effect in the models [Chang and Sapatnekar 2007; Shen et al. 2010b Shen et al. , 2010a leads to considerable errors. As shown in the second row of Table I, the model adopted by Chang and Sapatnekar [2007] and Shen et al. [2010b] can provide acceptable accuracy for the gate tunneling leakage current because of its insensitivity to operating temperatures. However, since the subthreshold leakage current is sensitive to operating temperatures, as shown in the sixth and seventh rows of Table I , the models [Chang and Sapatnekar 2007; Shen et al. 2010a Shen et al. , 2010b (1) and (2) are increased. As shown in the fourth and tenth rows of Table I , compared with some models [Chang and Sapatnekar 2007; Shen et al. 2010a Shen et al. , 2010b Yu et al. 2009 ], our models can present accurate results for both gate tunneling and subthreshold leakage currents. As demonstrating by the test results, exquisite approaches are still required for modern statistical power analyzers [Chang and Sapatnekar 2007; Shen et al. 2010a Shen et al. , 2010b ] to refine the estimated result, while the temperature-dependent leakage power model is included.
Besides the cell-based leakage current models [Chang and Sapatnekar 2007; Shen et al. 2010a Shen et al. , 2010b Yu et al. 2009 ], Jaffari and Anis proposed a bin (grid)-based leakage power model that also simultaneously contains temperature and process variation effects [2008] . Adopting the bin (grid)-based leakage power model, Haghdad and Anis developed a power yield analysis engine that simultaneously considers temperature and process variation effects [2012] . However, the power dissipation of several bins might be changed because of disturbing macros/cells after each optimization iteration of thermal-aware design engines, such as floorplanners or placers. Therefore, the time-consuming HSPICE simulation and least-squares fitting process need to be reperformed for rebuilding leakage power models of the disturbed bins (grids). This will degrade its efficiency of providing thermal reliability information or thermal-related cost for thermal-aware design engines. We have implemented their leakage power models for examining the accuracy. With the fitting forms adopted in Jaffari and Anis [2008] and Haghdad and Anis [2012] , both subthreshold and gate tunneling leakage currents of each gate type in the cell library are fitted as a 0 (1
The fitting results are also compared with those of HSPICE simulations under the TSMC 65nm model card, and listed in Table II . The results show that their adopted leakage power models present considerable errors under the 65nm technology node, although adequate accuracy of these models for the 90nm technology node has been reported [Jaffari and Anis 2008] . As shown in Table II , their adopted leakage power models result in the maximum error being 35.53% and the average error being 9.82% for the subthreshold leakage current of an NAND gate under the 65nm technology node.
Demonstrating results, more accurate leakage power models should be adopted in the electrothermal analysis frameworks [Jaffari and Anis 2008; Haghdad and Anis 2012] to refine the estimated results, since the temperature is transferred from power consumption. Although the electrothermal analysis frameworks proposed [Jaffari and Anis 2008; Haghdad and Anis 2012] can be very efficient, their baseline temperature calculation frameworks require exquisite extending strategies, because their lognormal temperature approximation algorithm can not be applied to the leakage power models that are not expressed as log-normal random variables, that is, the first-order regression models for f g (L, t ox , T ) and f s (L, t ox , T ) in Eqs. (1) and (2) are not sufficient for the accuracy.
Compared with the framework in Jaffari and Anis [2008] and Haghdad and Anis [2012] , our proposed thermal reliability estimator can handle accurate and more complicated leakage power models and present accurate estimated results.
PROBLEM FORMULATION AND PHYSICAL PARAMETER MODELING
Problem Formulation
The compact thermal model for physical design stages is shown in Figure 2 [Wang and Chen 2003; Huang et al. 2006; Yang et al. 2007; . It consists of three portions. The primary heat flow path is composed of the thermal interface material, heat spreader, and heat sink. The secondary heat flow path contains interconnect layers, I/O pads, and the print circuit board. Functional blocks of the die are modeled as many power-generating sources attached to a thin layer close to the top surface of the die with the thickness being equal to the junction depth of device [Lallement et al. 2004] . Due to variations of physical parameters, the power consumption of functional blocks is treated statistically. Therefore, the profile of power generating sources, p(r, L, t ox , T ), shown in Figure 2 is modeled as a function of device channel length L, oxide thickness t ox , and on-chip temperature distribution T .
Combining the compact thermal model and the statistical power consumption of functional blocks, the on-chip temperature distribution T (r, L, t ox ) can be governed by the statistical steady state heat transfer equation. subject to the boundary condition is the heat flux function on b s , and ∂/∂ n b s is the differentiation along the outward direction which is normalized to b s . p(r, L, t ox , T ) is the power density profile that consists of the deterministic dynamic power density profile p d (r), the statistical gate tunneling leakage power density profile p g (r, L, t ox , T ), and the statistical subthreshold leakage power density profile p s (r, L, t ox , T ). Since the major part of device current flows through the channel, power density distribution has its value only when r ∈ (0,
is the junction depth of device [Lallement et al. 2004] .
Generally, the values of κ(r, T ) are temperature dependent. In practice, they can be treated as appropriate constant values while performing temperature-aware physical design procedures [Tsai et al. 2006] . Given nominal values of the physical device parameters, the appropriate thermal conductivity can be computed by using the approximated average of steady-state nominal temperatures calculated by an iterative computation scheme shown in Figure 3 .
With the appropriate thermal conductivity, the statistical steady-state heat transfer equation can be rewritten as
subject to the boundary condition
where κ is the thermal conductivity of the die that is obtained by utilizing the procedure presented in Figure 3 . With Eqs. (6) and (7), the goals of this work are to estimate the mean profile, the standard deviation profile, and the thermal yield profile of on-chip temperature distribution.
Physical Parameter Modeling
Generally, variations of physical parameters can be classified into two categories, dieto-die (D2D) variations and within-die (WID) variations. Due to different stages of the fabrication process, D2D and WID variations can be treated as two independent variation sources. Since D2D variations are smooth within a die, it is reasonable to model all devices having the same D2D variation. On the other hand, WID variations present considerable gradients within a die, and they are spatially correlated because spatial imperfection of the chemical-mechanical polishing and lithography processes. As indicated by the measured results of Cline et al. [2006] and Cheng et al. [2011] , distributions of the physical parameters are similar to those of the Gaussian random variables; generally, WID variations are assumed to be a correlated Gaussian random process, and D2D variations are treated as a Gaussian random variable for all devices [Bhardwaj et al. 2008; Chang and Sapatnekar 2007; Shen et al. 2010a Shen et al. , 2010b .
Combining the models of D2D and WID variations, the physical parameter Par(r xy ) with its nominal value μ Par (r xy ) at position
where δ W ID (r xy ) is a Gaussian random process of WID variations, and δ D2D is a Gaussian random variable of D2D variations.
Since the spatial correlation of δ W ID (r xy ) has different decreasing rates in the x-direction and y-direction [Cline et al. 2006] , the following spatial covariance function [Bhardwaj et al. 2008 ] is adopted for modeling the spatial correlation of δ W ID (r xy ).
where r x 1 y 1 = (x 1 , y 1 ) and r x 1 y 2 = (x 2 , y 2 ), λ x and λ y are correlation lengths of δ W ID in the x-and y-directions, respectively, and σ is the standard deviation of δ W ID (r xy ). In this work, the Karhunen-Loève (KL) expansion is utilized to simplify δ W ID (r xy ), since its number of transformed random variables is much smaller than that of principal component analysis (PCA) [Bhardwaj et al. 2008] . By applying the KL expansion, δ W ID (r xy ) with the spatial covariance function shown in Eq. (9) can be approximated as
Here, N Par is the truncation number, each (χ l , ϑ l (r xy )) is an eigen-pair of C(r x 1 y 1 , r x 2 y 2 ), and ζ l 's are independent standard normal random variables. The closed-form expressions of an eigen-pair (χ l , ϑ l (r xy )) for C(r x 1 y 1 , r x 2 y 2 ) shown in Eq. (9) can be derived as follows [Zhang and Lu 2004] .
where l, i, and j are indices, and the mapping between (i, j) and l is one to one. ϑ x,i (x) and ϑ y, j (y) are equal to
Here, ν x,i and ν y, j are positive values that satisfy
To get reasonable truncation numbers of KL expansions for the physical parameters L and t ox , in this work, N Par for Par = L or Par = t ox is decided by the following criterion.
with ε = 1%. Generally, devices located adjacently have similar physical characteristics [Chang and Sapatnekar 2007; Shen et al. 2010b Shen et al. , 2010a . Therefore, the active layer is partitioned into several rectangular grids for modeling physical parameters. After that, with the KL expansion, the device channel length L m and oxide thickness t oxm in the mth modeling grid can be approximated as 
. . , η t ox Nt ox ]
T are standard normal random vectors constituted by the KL expanded WID and D2D random variables for representing the device channel length and the oxide thickness in all modeling grids, respectively.
In the rest of this article,
T for the sake of notation simplicity.
STATISTICAL ELECTRO-THERMAL SIMULATOR
The executing flow of the proposed statistical electrothermal simulator is summarized in Figure 4 . Given the information of physical parameters, the KL expansion is performed to transform the spatial correlated physical parameters into a set of uncorrelated random variables. Then, the statistical polynomial expression of the on-chip temperature distribution is generated by the developed stochastic collocation-based statistical interpolation polynomial generator. After that, the on-chip thermal yield profile is estimated by the developed thermal yield profile estimation engine. The stochastic collocation-based statistical interpolation polynomial generator, the thermal yield profile estimation engine, and a mixed-mesh strategy for enhancing the statistical electrothermal simulator will be described in the following three sections.
Stochastic Collocation-Based Statistical Interpolation Polynomial Generator
The generator takes three steps to construct the statistical interpolation polynomial of the on-chip temperature distribution. First, the multidimensional sampling points of KL expanded random variables are generated by using the Smolyak sparse grid formula with sampling points being the roots of Hermite polynomials (HPs). Then, for each sampling point of the physical parameters, its corresponding temperature profile can be obtained by solving the corresponding deterministic heat transfer equation. Finally, the approximated expression of on-chip temperature distribution is built by utilizing Newton's interpolation polynomial formula. The details are presented in the rest of this section.
4.1.1. Smolyak Sparse Grid Generation. The primary advantage of Smolyak sparse grid formulation is to construct an interpolating polynomial for approximating a multivariate function u ∈ C r by using much fewer samples of the desired function than those of the full tensor product interpolation formula and the MC method but still to maintain an acceptable error bound [Smolyak 1963; Barthelmann et al. 2000] . Here, C r is the set of all functions that have continuous derivatives of all orders up to r. With this stochastic collocation technique, the statistical interpolation polynomial of on-chip temperature distribution can be efficiently constructed.
The MC method randomly generates samples of the random variables and hence requires a large number of samples for achieving an accurate estimate. In contrast, the Smolyak sparse grid technique uses the roots of HPs or the extrema of the Chebyshev polynomial [Barthelmann et al. 2000 ] to generate samples of the random variables and employs these fewer samples to effectively interpolate the desired solution. For example, Figure 5 illustrates that the number of possible sample points of the MC method is much larger than that of the Smolyak sparse grid formulation for a twodimensional random variable.
According to the Smolyak sparse grid formulation [Smolyak 1963 ], on-chip temperature distribution can be explicitly approximated as follows [Barthelmann et al. 2000] .
Here, N KL = N t ox + N L is the number of random variables in ξ , q = N KL + l, l ≥ 1 is the formulation level, and |i| = N KL n=1 i n , with each i n ≥ 1. Q i n (T ) is an interpolating polynomial of T (r, ξ ) by only utilizing a single random variable ξ n , i n is the index to decide the sample number (m i n ) for Q i n (T ), and ⊗ is the cross product operator for functions. As suggested by Barthelmann et al. [2000] , m i n =1 is set to 1, and m i n is equal to 2 i n −1 + 1 for i n > 1. From Eq. (19), only the corresponding temperature values of a small set of samples for ξ need to be known [Barthelmann et al. 2000 ]. This set is called the sparse grid and can be represented as
n } is the set of sample points used by Q i n (T ), and × is the cross product operator for sets.
The , for the function having bounded derivatives up to order r [Barthelmann et al. 2000] . Here, N H is the number of sample points in H(q, N KL ), and c N KL ,r is a constant that depends on N KL and r. According to our experience, the accurate estimation of the thermal yield profile can be obtained by setting level l to be 1. The number of sample points for the Smolyak sparse grid formulation can be much less than that of the MC method. A simple example is presented in Appendix A to illustrate the Smolyak sparse grid formulation.
The sampling values ofh i n for each i n must be properly decided. Adopting the roots of H-PCs with its order corresponding to each i n can achieve the most accurate result, as ξ is a normal random vector [Phillips 2003 ]. Choosing the extrema of the Chebyshev polynomial with its order corresponding to i n can achieve the nested sparse grid structure, that is,h i n = j ⊂h i n =k as j < k, and the acceptable accuracy [Barthelmann et al. 2000] . In this work, we select the roots of H-PCs as the sampling values, since the result is shown to be very accurate by using the low-level approximation, and the nested sparse grid structure is still preserved for q = N KL + 1. (17) and (18), respectively, and the deterministic power density profile corresponding to ξ j can also be obtained. Hence, we have the deterministic steady-state heat transfer equation as
Here, T (r, ξ j ) and p(r, ξ j , T ) are the deterministic temperature and power density profiles with the sampling point ξ j , respectively. Since the power density profile in Eq. (21) is temperature dependent, a deterministic electrothermal solver is summarized in Figure 6 and built to obtain each T (r, ξ j ). The implementation of the developed deterministic electrothermal solver is illustrated in Figure 7 . The accumulated area of each gate type in each simulating temperature grid can be precalculated and stored in the precalculation stage. With this precalculated data, the deterministic power density profile for each sampling point in H(q, N KL ) can be obtained and updated in the order of O(N x N y N type ) during the electrothermal simulation loop. Here, N x and N y are the division numbers of the simulation grid along x− and y−directions, respectively, and N type is the number of gate types for the given design. Generally, N type is determined by the specific cell library, and it is far less than the number of simulation grids, N x N y . After obtaining or updating the deterministic power density profile, an efficient deterministic thermal simulator ] is adopted to solve the deterministic heat transfer equation. These updating and solving procedures are repeated until the result is converged. 
Temperature Profile Construction by Using Polynomial Interpolation.
With each obtained T (r, ξ j ), the polynomial interpolation technique can be applied to construct the interpolating polynomial for the statistical temperature. As suggested by Barthelmann et al. [2000] , the Lagrange polynomial can be applied to construct the interpolating polynomial Q i 1 (T ) ⊗ · · · ⊗ Q i N KL (T ) for each different |i|. However, the suggested interpolation method requires performing the cross-product operation of functions; this can be slightly complicated for the implementation. Therefore, we adopt Newton's interpolating method to globally interpolate T (r, ξ ), because it can be implemented more easily and can interpolate the same polynomial as Barthelmann's method [Phillips 2003 ]. Therefore, the Smolyak's error bound can still be preserved. With Newton's interpolation formula, the on-chip temperature at an arbitrary position r * of the die can be approximated as
Here, each φ j (ξ ) is a fundamental polynomial with respect to the jth sampling vector ξ j , and the form of each φ j (ξ ) can be found in Phillips [2003] . N H is the number of sampling vectors in the sparse grid H(q, N KL ). Eachû j (r * ) is an unknown coefficient that needs to be determined.
Based on the basic idea of interpolation that the approximated function must match each known data, the interpolated polynomial in Eq. (23) satisfies the following equation for each ξ n .
With the property of fundamental polynomial described in [Phillips 2003 ], Eq. (24) can be rewritten as the matrix form for finding eachû j (r * ) at position r
Eachû j (r * ) can be calculated by using the forward substitution. The algorithm for generating the statistical interpolation polynomial of on-chip temperature distribution is shown in Figure 8. 
Thermal Yield Profile Estimation Engine
With the generated statistical interpolation polynomial of on-chip temperature distribution, the mean, standard deviation, and skewness profiles of on-chip temperature distribution are computed. After that, the temperature at each arbitrary position is approximated to be a corresponding skew normal random variable by the moment matching technique. Finally, the on-chip thermal yield profile is estimated by looking up the cumulative distribution function (CDF) table of those corresponding skew normal random variables. The detailed description of this thermal yield profile estimation engine is shown next.
As mentioned in Section 2.1, the on-chip thermal yield profile at an arbitrary position r * of the die can be defined as
With the definition given in Eq. (26), our target is to approximate the CDF of T (r * , ξ ). To obtain the approximated expression of T (r * , ξ ), the formulation level l is set as 1 for generating the sparse grid H(q, N KL ) with q = N KL + l in Eq. (20). Then, applying Newton's interpolating method, the approximated expression of T (r * , ξ ), shown in Eq. (23), can be rewritten as
, andĉ(r * ) are the coefficients and can be obtained by performing the algorithm shown in Figure 8 .
After several manipulations, Eq. (27) can be rewritten as
Here, each χ k (r
) 2 is a non-central chi-square random variable, because ξ k is a normal random variable,c(r
is a constant, and χ k (r * , ξ k )'s are independent because ξ k 's are independent. Therefore, T (r * , ξ ) is a weighted sum of independent non-central chi-square random variables.
The estimation of Eq. (26) can be done by calculating the CDF of T (r * , ξ ) represented in Eq. (28). Since ξ is an independent normal random vector, theoretically, the PDF of T (r * , ξ ) could be obtained by convolving the PDFs of χ k (r * , ξ k )'s. However, it is not practical because of numerous numerical convolutions. The moment matching-based CDF estimation techniques are another choice for efficiently approximating the CDF of T (r * , ξ ). APEX [Li et al. 2004 ], a state-of-the-art method, approximates the CDF of a random variable with the similar form of Eq. (27) by linearly combining exponential waveforms and can achieve an arbitrarily required matching order of statistical moments. Padé approximation is essential during performing APEX, although it cannot guarantee being stable for obtaining poles/zeros, even in the low-order approximation. To remedy this unstable issue, the technique proposed by Tutuianu et al. [1996] can be adopted to obtain the first two dominated pole/zero pairs for APEX. However, the first two dominated pole/zero pairs only can construct an approximated CDF of T (r * , ξ ) that 7 To get a more accurate approximated expression of T (r * , ξ ), one can set the formulation level l as 2 to capture the cross-product terms of ξ k 's in the second-order polynomial approximation. As shown in Appendix C, with cross-product terms of ξ k 's, the second-order polynomial approximated expression of T (r * , ξ ) can be transformed to the form that is consistent with Eq. (27). Therefore, the proposed thermal yield profile estimation engine can be extended to the second-order polynomial approximation that has cross-product terms of ξ k 's. matches up to the first two statistical moments. Refer to Li et al. [2004] and Tutuianu et al. [1996] for the details of APEX and the stable two-pole technique, respectively. Here, we are going to develop a moment matching-based technique to match the statistical moments of T (r * , ξ ) up to the third order for approximating the CDF of T (r * , ξ ). The basic idea of this approach is to approximate a random variable with a unimodal and skewed PDF by matching its mean, variance, and skewness to be a skewnormal random variable. To explain the unimodal PDF property of T (r * , ξ ), the sketches of PDFs corresponding to two different cases for the weighted sum of two independent non-central chi-square random variables are shown in Figure 9 . Case 1 shows the convolution result of two right-skewed distributions or two left-skewed distributions, and Case 2 presents the convolution result of one left-skewed distribution and one rightskewed distribution. Although, depending on the leading coefficients, the skewness of resulting random variables might increase or decrease, both resulting random variables have unimodal PDFs. Since T (r * , ξ ) is the weighted sum of independent non-central chi-square random variables, its PDF can be obtained by performing the convolution of two random variables successively. Therefore, it still has a unimodal PDF.
As indicated in Azzalini [2005] , the skew-normal random variable is suitable for approximating the random variable with unimodal and skewed PDF. Hence, the skewnormal random variable can be a suitable model for approximating T (r * , ξ ). From the representation of Eq. (28), the thermal yield at an arbitrary location r * can be approximated as
where
, and T (r
. μ T (r * ) and σ T (r * ) are the mean and the standard deviation of T (r * , ξ ), respectively; they can be computed in the order of O(N KL ), since χ k (r * , ξ k )'s are independent. To approximate T (r * , ξ ) to be a skew-normal random variable, Z ∼ SN(υ r * , ω r * , α r * ), the parameters υ r * , ω r * , and α r * [Azzalini 2005 ] need to be calculated. The first three moments of Z can be formulated as follows with these parameters.
After matching the first three moments of T (r * , ξ ) with Eqs. (30)- (32), we have 
Here, γ T (r * ) is the skewness of T (r * , ξ ), and the sign of δ r * is the same as the sign of γ T (r * ).
8
To obtain γ T (r * ), E{ T 3 (r * , ξ )} is needed and can be calculated as
where μ T (r * ) and σ T (r * ) are the mean and standard deviation of T (r * , ξ ), respectively. As shown in Eq. (38), E{ T 3 (r * , ξ )} is needed. However, the computational complexity of obtaining its value is O(N 3 KL ) if the expression of T (r * , ξ ) shown in Eq. (28) is directly used. To reduce the complexity, T (r * , ξ ) can be rewritten as
s have zero mean and are independent, we have
for i = 1, 2, and 3. Therefore, E{ T 3 (r * , ξ )} can be obtained in the order of O(N KL ), and the computational complexity of evaluating E{ T 3 (r * , ξ )} is also O(N KL ). With υ r * , ω r * , and α r * , T yield(r * , T spec (r * )) can be estimated by the CDF of the matched skew-normal random variable. Finally, we have
Here, (·) is the CDF of the standard normal random variable, T Owen (·) is Owen's T function [Azzalini 2005] , and β r * = ρ T (r * )−υ r * ω r * . With Eqs. (34)- (36) and Eq. (41), T yield(r * , T spec (r * )) can be efficiently evaluated by the lookup table method.
Mixed-Mesh Strategy for Enhancing the Statistical Electrothermal Simulator
As stated in Section 4.1.3, the deterministic electrothermal solver presented in Figure 6 needs to be executed N H times to generate the statistical interpolation polynomial of on-chip temperature distribution shown in Eq. (23) with the level-1 Smolyak sparse grid formula. Hence, although the developed thermal yield profile estimation engine can be done efficiently, the total runtime for obtaining the thermal yield profile is still dominated by the statistical interpolation polynomial generation. Here, we will present a mixed-mesh strategy to speed up the statistical interpolation polynomial generator without sacrificing the accuracy of the estimated thermal yield profile.
The developed mixed-mesh strategy is inspired by the following observations. The statistical interpolation polynomial generator needs to perform the deterministic electrothermal solver once for calculating the temperature profile with nominal device parameters and execute the deterministic electrothermal solver N H -1 times for obtaining temperature variations corresponding to the nominal temperature profile. The temperature profile from the first part contributes the major portion of the mean profile of temperature distribution, and the temperature variations from the second part contribute a large portion of the variance and skewness profiles of temperature distribution. In practice, the magnitude of the mean temperature profile is larger than those of the variance and skewness profiles of temperature distribution, since process variations of parameters are usually within a controllable range.
Based on the preceding observations, the mixed-mesh strategy for generating the statistical interpolation polynomial of on-chip temperature distribution is illustrated in Figure 10 . Since the mean profile contributes a major portion to the thermal yield profile, the nominal temperature profile is built by the deterministic electrothermal solver with a fine mesh having N F N F grids to preserve the estimation accuracy of the mean temperature profile. Then, the difference between the maximum and minimum temperature values, T max , of the nominal temperature profile is calculated, and an allowable temperature resolution, T res , is chosen. After that, the remaining N H − 1 deterministic electrothermal simulations are executed with a coarse mesh having N C N C grids. Here, N C is equal to T max /T res . Finally, the mean, variance, and skewness profiles of on-chip temperature distribution can be approximated by the generated statistical interpolation polynomial, and these temperature profiles are utilized to calculate the thermal yield profile.
With the proposed mixed-mesh strategy, the runtime for generating the statistical interpolation polynomial of on-chip temperature distribution can be significantly reduced. In this work, an effective deterministic thermal simulator ] is adopted as the kernel of our developed deterministic electrothermal solver. 9 The computational complexity of the deterministic thermal simulator presented in is O(N M N M log N Base ). Here, N M N M is the mesh size, and N Base is the number of bases for expressing the deterministic temperature profile. Generally, by using the average chip temperature calculated by the iterative computation scheme of the 1D thermal model shown in Figure 3 to be the initial operating temperature, the number of electrothermal loops for achieving convergence can be less than a small value. Hence, the computational complexity of the developed deterministic electrothermal solver is also O(N M N M log N Base ).
Therefore, the computational complexity of our baseline algorithm (the fine mesh is used for each deterministic electrothermal simulation) stated in Figure 4 is O(N H N F N F log N Base ), and the computational complexity by utilizing the developed mixed-mesh strategy can be reduced to
The computational complexity ratio of the developed mixed-mesh strategy for generating the statistical interpolation polynomial to the deterministic electrothermal solver with nominal device parameters is equal to 1+(N H −1)×(N C /N F ) 2 . In our experimental results, an accurate thermal yield profile can be estimated with N H = 53, N F = 128, N C = 16, and T res = 0.65
• C. The computational complexity ratio is 1.8125. Therefore, the mixed-mesh strategy does enhance the efficiency of the developed statistical electrothermal simulator for catching up with that of a deterministic electro-thermal simulator.
EXPERIMENTAL RESULTS
The developed statistical electrothermal simulator is implemented in C++ language and tested on a Linux system with Intel Xeon 3.0-GHz CPU and 32GB memory. The die size is 2.5 mm × 2.5 mm × 0.5 mm. The junction depth is set to 20nm for the 65nm technology [Lallement et al. 2004] , and the Debye length is 2nm [Bienacel et al. 2004] . The floorplan of a test chip having 1.2 million functional gates is shown in Figure 11(a) , and the geometries of chip and package are shown in Figure 11 (b) .
By applying the modeling skill of thermal parameters mentioned in Figure 3 of Section 3.1 and the modeling skill for both heat transfer paths mentioned in Huang et al. [2006] , the thermal conductivity and the equivalent heat transfer coefficients of the primary and secondary heat flow paths for executing the deterministic thermal simulator are summarized in Table III . The boundary condition of each vertical surface is set to be isothermal .
The device parameters, the truncation points of KL expansions for the channel length (N L ) and the oxide thickness (N t ox ), and the number of device modeling grid (N KL g ) are summarized in Table IV . Both N L and N t ox are decided by using the criterion stated in Eq. (16) with ε = 1%. To model the spatial correlation, both η x /L x and η y /L y are set to 0.98 for the correlation function shown in Eq. (9) [Cline et al. 2006] . The active layer of the test chip is divided into 128 × 128 simulated grids.
The estimated mean and standard deviation profiles of the power map under the settings of 60% WID and 40% D2D variations to the total variations are shown in Figures 11(c)-(d) , respectively. 
Electrothermal vs. Nonelectrothermal Simulations
With the number of grids being 100 for modeling device parameters and the ratios of WID and D2D variations to the total variations being 60% and 40%, respectively, the MC method with 2 × 10 4 samples is employed to demonstrate essentialness of the electrothermal simulation loop. The estimated mean and standard deviation profiles of the on-chip temperature distribution with and without considering the temperaturedependent issue of leakage power are shown in Figures 12(a) 
Accuracy and Efficiency
This section is going to demonstrate the correctness and efficiency of the developed statistical electrothermal simulator and show its efficiency improvement by using the mixed-mesh strategy.
Given three different ratio pairs of the WID and D2D variations to the total variations, (WID, D2D) = (40%, 60%), (50%, 50%), or (60%, 40%), the results from 2 × 10 4 MC simulations, which satisfy the maximum absolute relative error of variance is less than 1%, are utilized as the reference solution.
Mean and Standard Deviation Estimation.
To demonstrate the accuracy and efficiency of the developed statistical interpolation polynomial generator, the level-1 Smolyak sparse grid formula with the sampling points being the roots of HPs is built. The deterministic electrothermal solver needs to be executed 53 times, since the number of sampling points (N H ) for physical parameters to obtain the level-1 Smolyak sparse grid formula is equal to 2 × (N L + N t ox ) + 1, and both N L and N t ox are calculated to be 13, as shown in Table IV . The size of each simulation mesh is 128 × 128.
The maximum absolute errors of mean and standard deviation profiles are presented in Table V . The first two columns indicate the ratio pairs of WID and D2D variations to the total variations. Compared with 2 × 10 4 MC simulations, the maximum absolute errors of the estimated mean and standard deviation profiles from the developed statistical interpolation polynomial generator are shown in the fifth and sixth columns, respectively. As shown in Table V , the maximum absolute errors are less than 3.0% for all three different ratio pairs.
To fairly compare the runtime, the MC simulation is performed till achieving the same accuracy of standard deviation as the developed statistical interpolation polynomial generator. The number of MC simulations is shown in the #Samples column, and the Runtime column denotes the runtime for both methods. According to the Speedu column, the developed statistical interpolation polynomial generator can be orders of magnitude faster than the MC method. Under the ratio pair (WID, D2D) = (60%, 40%), the developed statistical interpolation polynomial generator takes 2.74 seconds to generate the interpolation polynomial of temperature profile for the 128 × 128 simulation mesh. It contains 0.47 seconds for executing 53 deterministic electrothermal simulations, and 2.27 seconds to generate the interpolation polynomials after the 53 sampling temperature profiles are obtained.
With the ratio pair (WID, D2D) = (60%, 40%), the estimated mean and standard deviation profiles of on-chip temperature distribution are shown in Figures 13(a) 
Thermal Yield Estimation.
To demonstrate the accuracy and efficiency of the developed thermal yield estimation engine by using the skew normal model, APEX [Li et al. 2004] , a well-known cumulative distribution function estimation method, is also implemented. To avoid the instability of Padè approximation for APEX, a stable two-pole model [Tutuianu et al. 1996 ] is employed for finding poles and zeros. The developed statistical interpolation polynomial generator is utilized to generate the statistical expression of temperature distribution for both the skew normal model method and APEX. With the average mean (μ T ) and standard deviation (σ T ) of temperature distribution obtained by 2 × 10 4 MC simulations, the specified reference temperature, T re f , is assumed to be μ T + 2.5σ T .
Table VI summarizes the results, and it shows that both the skew normal model method and APEX can accurately provide the thermal yield profile for each test case; furthermore, the developed skew normal model method is more accurate than that of APEX. The maximum error of the skew normal model method is 1.63%, and the maximum error of APEX is 2.32%. The results also reveal that the developed skew normal model method achieves 215× speedup over APEX. This is for two reasons. One is that APEX needs higher-order statistical moments to get a tight bound of its generalized Chebyshev inequality for the PDF/CDF shifting process. In our implementation, APEX requires moments up to the ninth order to achieve an accurate thermal yield profile estimate, even though it only needs moments up to the fourth order to get the first two dominated poles [Tutuianu et al. 1996 ]. The other is that APEX needs to solve zeros for constructing its exponential model after two dominated poles are computed. Compared with APEX, after the moments of temperature distribution up to the third order are computed, the skew normal model method simply looks up the CDF table of the skew-normal random variable to estimate the thermal yield profile.
The thermal yield profiles obtained by 2 × 10 4 MC simulations and estimated by the developed skew normal method and APEX under 60% of WID variation and 40% of D2D variation to the the total variations are presented in Figures 14(a)-14(c) , respectively. The errors of estimated thermal yield profiles by the skew normal model method and APEX are plotted in Figures 14(d)-14(e) , respectively. As shown in Figures 14(b) -14(c), the developed statistical interpolation polynomial generator can provide accurate statistical on-chip temperature expression for the thermal yield estimation, and the thermal yield profile can indicate thermal reliabilities at different locations of the die, that is, the smaller the thermal yield at a location, the less reliable the location. Figure 14 (c) also shows that the estimated thermal yield profile of APEX might impractically exceed 100% in some region, since APEX does not guarantee generating a statistical model with preserving the property of CDF.
To further demonstrate the accuracy of the skew normal model method, the estimated CDF of temperature distribution at position A in Figures 14(a) MC simulations, the skew normal model method, and APEX with the ninth and fourth order moments for the PDF/CDF shifting process are drawn in Figure 15 . As shown in Figure 15 , the result of the skew normal model method fits that of MC simulations very well. However, APEX with the fourth order for the PDF/CDF shifting process cannot 5.2.3. Mixed-Mesh Strategy for Thermal Yield Estimation. As demonstrated in Section 5.2.2, the developed skew normal model-based thermal yield profile estimation engine can be two orders of magnitude faster than APEX and can obtain the thermal yield profile in 0.013 seconds. However, as shown in Table V , it still requires couples of seconds to generate the statistical interpolation polynomial of temperature distribution. Here, we are going to show the superiority of the developed mixed-mesh strategy for thermal yield profile estimation with the test case having 60% WID variation ratio and 40% D2D variation ratio. For the accuracy verification, its results are compared with 2 × 10 4 MC simulations, and for the efficiency demonstration, its results are compared with the baseline method of the developed skew normal model thermal yield profile estimation engine with the size of the simulation mesh being equal to 128 × 128 for each deterministic electrothermal simulation corresponding to each sampling point in the Smolyak sparse grid.
By using the mixed-mesh strategy, first, the simulation mesh for calculating the nominal temperature profile of the test case is set to 128 × 128. After obtaining the nominal temperature profile, the difference between the maximum and minimum temperature values T max is calculated to be 10.1
• C, and the temperature resolution T res is set to 0.65
• C for the coarse-mesh construction. Under this setting, the size of the coarse mesh is calculated as 16 × 16 for executing the remaining 52 (= 2 × N KL ) deterministic electrothermal simulations. The estimated thermal yield profile by using the mixed-mesh strategy is shown in Figure 16 (a). Compared with Figure 14(a) , the estimated thermal yield profile matches with that of the MC simulations. Its error distribution is drawn in Figure 16 (b), and the plot shows that the errors are within the range [−2.28%, 0.13%].
Comparison between the developed baseline and mixed-mesh strategy thermal yield profile estimation methods is summarized in Table VII . The Expression Generation column denotes the runtime of generating the statistical interpolation polynomial, and it is 2.74 seconds and 0.049 seconds for the baseline method and the mixed-mesh strategy, respectively. The Thermal Yield Profile column shows the runtime to obtain the thermal yield profile by using the skew normal random variable model, and it is 0.013 seconds for both methods. From the last three columns of Table VII, we can see that the developed mixed-mesh strategy, with slightly trading off the accuracy, can achieve 44.4× speedup over the developed baseline method.
CONCLUSION AND POTENTIAL APPLICATIONS
An efficient statistical electrothermal simulator is proposed. The developed statistical electro-thermal simulator not only provides the mean and standard deviation profiles of on-chip temperature distribution, but also delivers the thermal yield profile of onchip temperature distribution to designers and provides a proper figure of merit to indicate the thermal reliability of designed circuit.
Compared with the MC method and APEX [Li et al. 2004] , the experimental results demonstrate that the proposed statistical electrothermal simulator can accurately and efficiently provide the mean, standard deviation, and thermal yield profiles of on-chip temperature distribution.
Potential extensions and applications of the developed framework are summarized as follows.
Statistical Electrothermal Analysis of 3D ICs
One possible strategy of extending our framework to 3D ICs is summarized as follows.
(1) Construct the equivalent modified nodal analysis (MNA) system for a 3D IC with the process variation and temperature-dependent power consumption vector at the right-hand side of MNA equation. (2) Build the sparse grid sample points for KL expanded or PCA decomposed random variables of the physical parameters. Use these sample points to get the deterministic physical parameters, and hence, we have several deterministic temperaturedependent power consumption vectors corresponding to these deterministic physical parameters. (3) Perform the deterministic electrothermal analysis with each deterministic temperature-dependent power consumption vector by using any existing deterministic electrothermal simulators to get the corresponding temperature profile. (4) Interpolate the statistical polynomials of the on-chip temperature distribution, and apply the developed thermal yield profile estimation technique to it.
Thermal-Aware Statistical Timing Analysis
One possible extension strategy for the thermal-aware statistical timing analysis is briefed as follows. Given different channel-length variations, different oxide-thickness variations, different operating temperatures, and different load capacitances, the delay data of each gate type in the cell library can be constructed by utilizing the HSPICE. With the simulated data and the least square fitting methodology, the delay of a specific gate type can be fitted as the first-order canonical form
where the fitting coefficients a 0 , a 1 , a 2 , and a 3 are load-capacitance dependent. After that, by using the formulation of KL expansion or PCA decomposition, L(r * ) and t ox (r * ) at the arbitrary position r * can be represented as a linear combination of their transformed Gaussian random variables. Then, using the level-1 Smolyak sparse grid formulation, the delay of an arbitrary gate type at position r * can be expressed as
Here, b i 's, c i 's, and d are constants, and ξ i 's are the Gaussian random variables transformed by KL expansion or PCA decomposition. With the expression stated in Eq. (43), any existing second-order statistical static timing analysis engine can be utilized to perform the thermal-aware statistical timing analysis. requires
