In this paper we propose a novel projection-based algorithm to estimate the full-chip leakage power with consideration of both inter-die and intra-die process variations. Unlike many traditional approaches that rely on log-Normal approximations, the proposed algorithm applies a novel projection method to extract a low-rank quadratic model of the logarithm of the full-chip leakage current and, therefore, is not limited to log-Normal distributions. By exploring the underlying sparse structure of the problem, an efficient algorithm is developed to extract the non-log-Normal leakage distribution with linear computational complexity in circuit size. In addition, an incremental analysis algorithm is proposed to quickly update the leakage distribution after changes to a circuit are made. Our numerical examples in a commercial 90nm CMOS process demonstrate that the proposed algorithm provides 4x error reduction compared with the previously proposed log-Normal approximations, while achieving orders of magnitude more efficiency than a Monte Carlo analysis with 10 4 samples. 
INTRODUCTION
As IC technologies move to nanoscale feature sizes, leakage power becomes increasingly large and significantly impacts the total chip power consumption. The predicted leakage power is expected to reach 50% of the total chip power within the next few technology generations [1] . Therefore, accurately modeling and analyzing leakage power has been identified as one of the top priorities for today's IC design problems.
The most important leakage components in nanoscale CMOS technologies include sub-threshold leakage and gate tunneling leakage [2] . The sub-threshold leakage models the weak inversion conduction when gate voltage is below the threshold voltage. At the same time, the reduction of gate oxide thickness facilitates tunneling of electrons through gate oxide, creating the gate leakage. Both of these leakage components are significant for sub100nm technologies and must be considered for leakage analysis.
Unlike many other performances (e.g., delay), leakage power varies substantially due to process variations, which increases the difficulty of leakage estimation. As demonstrated in [3] , leakage variations can reach 20x, while delays only vary about 30%. It has also been observed that leakage power is sensitive to both interdie and intra-die variations. Intra-die variations model the individual, but spatially correlated, local variations within the same die. These intra-die variations must be modeled by many additional random variables, thereby significantly increasing the problem size of leakage analysis. For example, the total number of random variables can reach 10 3~1 0 6 to model the full-chip variations for a practical industry design.
Many works have been developed to capture the leakage variations [4] - [10] . Most of these approaches approximate the leakage variation as a log-Normal distribution. For that purpose, a first-order (i.e., linear) Taylor expansion is used to approximate the logarithm of the leakage current. Given the increasingly larger variations in nanoscale technologies, such a linear approximation can result in inaccurate results, especially because the leakage current has a strongly nonlinear dependency on process variations. As will be demonstrated by the numerical examples in Section 4, a 20% estimation error is observed by using the linear approximation for a commercial 90nm CMOS process.
To achieve higher accuracy, a quadratic approximation can be used, which, however, significantly increases the computational cost. For example, if the total number of random variables reaches 10 6 , a quadratic approximation will result in a 10 6 x10 6 quadratic coefficient matrix including 10 12 coefficients! The authors of [11] propose a projection-based approach (PROBE) to reduce the quadratic modeling cost. Instead of finding a full-rank quadratic model, PROBE attempts to find an optimal low-rank model by minimizing the approximation error. However, one major difference between leakage analysis and that addressed in [11] is the problem size. The PROBE algorithm is efficient for handling 10 1~1 0 2 random variables, while the fullchip leakage analysis involves 10 3~1 0 6 variables. The challenging problem here is how to make the quadratic modeling feasible for such a large problem size.
In this paper, we propose a novel projection-based algorithm to extract the optimal low-rank quadratic model for statistical leakage analysis. The proposed algorithm is facilitated by exploring the underlying sparse structure of the problem. Namely, the large number of intra-die variations only locally impact the leakage power in their neighborhood, as is demonstrated by many previous works, e.g., [10] . Considering this sparse property, we formulate the statistical leakage analysis problem into a special form that can be efficiently solved by the Arnoldi algorithm and the orthogonal iteration borrowed from matrix computations. As such, an accurate low-rank quadratic model can be extracted with linear computational complexity in circuit size.
Another important contribution of this paper is to propose a quadratic model compaction algorithm that converts a low-rank, high-dimensional quadratic leakage model to a full-rank, lowdimensional one without changing the probability distribution of the leakage. The probability density function (PDF) and the cumulative distribution function (CDF) of the low-dimensional Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. model is much easier to estimate using either Monte Carlo analysis or APEX [12] .
The third contribution of this paper is to offer an incremental analysis capability to quickly update the leakage distribution after changes to a circuit are made. The proposed incremental analysis locally updates the leakage distribution and, therefore, achieves significant speedup over the full leakage analysis.
The remainder of this paper is organized as follows. In Section 2 we review the background materials and then propose our projection-based leakage analysis algorithm in Section 3. The efficacy of the proposed algorithm is demonstrated by numerical examples in Section 4. Finally, we conclude in Section 5.
BACKGROUND

Modeling Process Variations
Process variations are typically characterized into two broad categories: inter-die variations and intra-die variations. Inter-die variations represent the common/average variations across the die and can be modeled by using common random variables for all components in a chip. Intra-die variations represent the individual, but spatially correlated, local variations within the same die. A typical approach for modeling intra-die variations is to partition the entire die into a number of grids [10] , as shown in Fig. 1 . The intra-die variations in the same grid are fully correlated, while those in close (far-away) grids are strongly (weakly) correlated. The process variations, both the inter-die and intra-die variations, are typically modeled as Normal distributions. Principal component analysis (PCA) [13] can be applied to decompose correlated Normal distributions into independent ones. After PCA, the process variations (e.g., ΔV TH , ΔT OX and ΔL) of each logic cell can be modeled as:
T denotes the parameter variations of the i-th logic cell. E = [ε 1 ,ε 2 ,...,ε N ]
T stands for the random variables for modeling both inter-die and intra-die variations of the entire die. {ε 1 ,ε 2 ,...,ε N } are extracted by PCA. They are mutually independent and satisfy the standard Normal distribution (i.e., zero mean and unit standard deviation). N is the total number of these random variables, and it is typically large (e.g., 10 3~1 0 6 ) for practical industry designs. V Celli captures the correlations among the random variables.
The size of V Celli can be extremely large if there are a great number of random variables for modeling intra-die variations. However, ΔX Celli only depends on the intra-die variations in its neighborhood; therefore, V Celli is quite sparse. For example, referring to the grid in Fig. 1 , ε 1 has little impact on Cell 1 , since they are far away. In Section 3, we will show how this sparse property is utilized in our proposed leakage analysis algorithm to reduce the computational cost.
Statistical Leakage Analysis
Statistical leakage analysis typically starts from the leakage modeling for logic cells. Most previous works approximate the logarithm of the cell leakage current by a linear model:
where I Celli denotes the total leakage current (including both subthreshold leakage and gate tunneling leakage) of the i-th cell, B Celli ∈ R N and C Celli ∈ R are the linear model coefficients. Since the random variables {ε 1 ,ε 2 ,...,ε N } satisfy Normal distributions, log(I Celli ) is the linear combination of multiple Normal distributions and, therefore, is also a Normal distribution [14] . It follows that I Celli is a log-Normal distribution [14] .
Given the leakage models of all individual cells, the full-chip leakage current is the sum of all cell leakage currents:
where M is the total number of logic cells in a chip.
Eq. (3) implies that the full-chip leakage current is the sum of many log-Normal distributions. It can be approximated as a logNormal distribution [10] . This is equivalent to approximating the logarithm of the full-chip leakage current by a linear model:
where B Chip ∈ R N and C Chip ∈ R are the model coefficients. It is well-known that leakage current depends on input vector state. The cell and chip leakages in (2)- (4) can be the leakage currents for a fixed input state or the average leakage currents over all input states. In this paper, we will not distinguish these two cases.
The linear models in (2) and (4) are not sufficiently accurate for modeling the large-scale process variations that are expected for nanoscale technologies. This, in turn, suggests that applying quadratic models might be required to improve the accuracy:
where
, B Celli , B Chip ∈ R N and C Celli , C Chip ∈ R are the model coefficients. In (5)- (6), the quadratic coefficient matrices A Celli and A Chip can be extremely large for capturing intradie variations. This makes the quadratic modeling problem extremely challenging in practical applications.
Projection-based Modeling
The authors in [11] proposed a projection-based approach (PROBE) to reduce the quadratic modeling cost. The key difficulty of quadratic modeling is the need to compute all elements of the quadratic coefficient matrix, e.g., A Chip in (6). This matrix is often rank-deficient in practical applications. Therefore, instead of finding the full-rank matrix A Chip , PROBE approximates A Chip by another low-rank matrix Ã Chip such that their difference ||A Chip -Ã Chip || F is minimized. Here, ||•|| F denotes the Frobenius norm, which is the square root of the sum of the squares of all matrix elements. The authors of [11] prove that the optimal rank-R approximation is:
where λ Chipr ∈ R and P Cellr ∈ R N are the r-th dominant eigenvalue and eigenvector of the matrix A Chip respectively.
The PROBE algorithm proposed in [11] is efficient to handle 10 1~1 0 2 random variables, while the full-chip leakage analysis involves 10 3~1 0 6 variables. The challenging problem here is how to extract the dominant eigenvalues and eigenvectors for a 10 6 x10 6 matrix.
PROJECTION-BASED ANALYSIS
We propose a projection-based analysis algorithm that is facilitated by exploring the underlying sparse structure of the leakage analysis problem. Specifically, we propose the following methodology, which includes: 1) a two-step iterative algorithm for quadratic leakage modeling; 2) a quadratic model compaction algorithm for leakage distribution estimation; and 3) an incremental analysis algorithm for locally updating the leakage distribution.
Standard Cell Library Characterization
Our statistical leakage analysis starts from the standard cell library characterization where the objective is to approximate the leakage current of each logic cell by a regression model. Typically, there are only a few (e.g., 5~10) random variables for modeling the variations in one cell. Therefore, we can run SPICE simulations (or utilize measurement models if available) and apply PROBE [11] to fit the rank-K model for each cell:
where ΔX Celli is defined in (1), and λ Cellik , PC ellik , BC elli and C Celli are the model coefficients. Substituting (1) into (8) yields:
, and N is the total number of random variables for the entire die. The sizes of P Cellik and B Celli in (9) are much larger than the sizes of PC ellik and BC elli in (8) . However, as discussed in Section 2.1, V Celli is sparse. Therefore, both P Cellik and B Celli are sparse and contain many zeros.
For simplifying the notation, we define the following symbols to represent all cell leakage models in a matrix form: Comparing (11) with (9), it is easy to verify that:
where ⊗ stands for the point-wise multiplication, i.e., [a 1 ,a 2 ,...]
Full-Chip Leakage Modeling
We next develop the algorithm to efficiently extract the lowrank quadratic model of the full-chip leakage current. As shown in (3), the full-chip leakage current is the sum of all cell leakage currents. Applying the log transform to both sides of (3) yields: 
Substituting (12) into (13) and applying a second order Taylor expansion, after some mathematical manipulations we obtain a quadratic model in the form of (6), where the model coefficients are given by:
In (14)- (16) [ ] (14)- (15) . Because the matrix B Cell in (15) is sparse, computing the matrix-vector product B Cell Φ has linear computational complexity. Therefore, both C Chip in (14) and B Chip in (15) can be extracted with linear complexity.
The major difficulty, however, stems from the non-sparse quadratic coefficient matrix A Chip in (16) . This non-sparse feature can be understood from the last term at the right-hand side of (16) . The vector Φ is dense and, therefore, ΦΦ T is a dense matrix. It follows that B Cell ΦΦ
T B Cell
T is dense, although B Cell is sparse. For this reason, it would be extremely expensive to explicitly construct the quadratic coefficient matrix A Chip based on (16) .
To overcome this problem, we propose a novel iterative algorithm that consists of two steps: Krylov subspace generation and orthogonal iteration. Instead of finding the full matrix A Chip , the proposed algorithm attempts to find the optimal low-rank approximation of A Chip .
A.
Krylov Subspace Generation As shown in (7), the optimal rank-R approximation of A Chip is determined by the dominant eigenvalues {λ Chip1 ,λ Chip2 ,...,λ ChipR } and eigenvectors {P Chip1 ,P Chip2 ,...,P ChipR }. The subspace generated by all linear combinations of these dominant eigenvectors is called the dominant invariant subspace [15] and is denoted as:
It is well-known that the dominant invariant subspace in (19) can be approximated by the following Krylov subspace [15] :
where Q 0 ∈ R N is a non-zero vector that is not orthogonal to any dominant eigenvectors. We first develop the algorithm to extract the Krylov subspace which gives a good approximation of the dominant invariant subspace. The extracted Krylov subspace is then used as a starting point for the orthogonal iteration in Section 3.2.B such that the orthogonal iteration could converge to the dominant invariant subspace within a few iteration steps.
We adapt the Arnoldi algorithm from matrix computations [15] to generate the Krylov subspace. The Arnoldi algorithm has been applied to large-scale numerical problems and its numerical stability has been well-demonstrated for many applications, most notably, IC interconnect order reduction [16] . Fig. 2 summarizes a simplified implementation of the Arnoldi algorithm.
Step 3 in Fig. 2 
4.
Orthogonalize Q r to all Q i (i = 1,2,...,r-1).
5.
Q r = Q r / ||Q r || F . End For 6.
[ ] ( ) ( )
Q (i)
(24) Fig. 3 . Simplified orthogonal iteration algorithm.
The Krylov subspace computed from Fig. 2 is not exactly equal to the dominant invariant subspace. Starting from the matrix Q in (22), we further apply an orthogonal iteration [15] which exactly converges to the dominant invariant subspace. Theoretically, the orthogonal iteration can start from any matrix. However, since the Krylov subspace Q gives a good approximation of the dominant invariant subspace, using Q as the starting point helps the orthogonal iteration to reach convergence within a few iteration steps. Fig. 3 shows a simplified implementation of the orthogonal iteration algorithm. In (23), Q
∈ R N×R is a matrix containing only a few columns, because R is typically small (e.g., around 10) in most applications. Therefore, similar to (21), Z ∈ R N×R contains only a few columns. The orthogonal iteration in Fig. 3 is provably convergent if the columns in the initial matrix Q are not orthogonal to the dominant invariance subspace [15] . After the orthogonal iteration converges, the optimal rank-R approximation of A Chip is determined by Q Chip and U Chip in (24) [15] :
Combining (25) with (6) yields:
where C Chip and B Chip are given in (14)- (15) . The algorithms in Fig. 2 and Fig. 3 assume a given approximation rank R. In practice, the value of R can be iteratively determined based on the approximation error. For example, starting from a low-rank approximation, R should be iteratively increased if the modeling error remains large. In most cases, we find that selecting R in the range of 5~15 provides sufficient accuracy.
In summary, we developed a two-step iterative algorithm to extract the low-rank quadratic model of the full-chip leakage current. The proposed algorithm only involves simple vector operations and sparse matrix-vector multiplications; therefore, its computational complexity is linear in circuit size. In addition, it is not necessary to explicitly construct the matrix Ã Chip in (25). In the next subsection we will develop an algorithm that efficiently estimates the leakage current distribution.
Leakage Distribution Estimation
The quadratic function in (26) is N-dimensional, where N is typically large. It is not easy to estimate the leakage distribution directly from (26). Next we propose a quadratic model compaction algorithm that converts the high-dimensional model to a low-dimensional one, while keeping the leakage distribution unchanged.
Start from the quadratic model in (26
( ) ( ) Fig. 4 . Quadratic model compaction algorithm. Fig. 4 summarizes the proposed quadratic model compaction algorithm. The following two theorems prove that the quadratic models in (26) and (28) are equivalent and the random variables Ω defined in (27) are independent and satisfy the standard Normal distribution.
Theorem 1:
The quadratic models in (26) and (28) are equivalent. Proof: The QR factorization in Step 2 of Fig. 4 results in:
Substituting (29) and (27) into (26) yields:
Eq. (30) proves Theorem 1. ■ Theorem 2: Given a set of independent random variables E that satisfy the standard Normal distribution, the random variables Ω defined in (27) are independent and satisfy the standard Normal distribution. Proof: Since the random variables Ω are the linear combinations of zero-mean Normal distributions, they are also zero-mean Normal distributions. The correlation matrix for Ω is given by:
where a stands for the expected value of the random variable a. Remember that the random variables E are independent with the standard Normal distribution, i.e.:
(32) where I is the identity matrix. In addition, the matrix Q Comp is constructed from QR factorization and, therefore, is orthogonal:
Substituting (32)- (33) into (31) yields:
Eq. (34) implies that the random variables Ω are uncorrelated. Uncorrelated Normal distributions are also independent [14] . ■
The quadratic function in (28) has a dimension of R+1 which is much smaller than N. Based on (28), the PDF/CDF of log(I Chip ) can be extracted, for example, using either Monte Carlo analysis or APEX [12] . After that, the PDF/CDF of I Chip can be easily computed by a simple nonlinear transform [14] .
Incremental Leakage Analysis
Incremental leakage analysis facilitates a quick update on the leakage distribution after local changes to a circuit are made. For simplicity, we only detailedly discuss the case where one logic cell is changed. However, it should be noted that the proposed algorithm can be directly extended to handle the simultaneous change of multiple cells.
Assume that the i-th logic cell is changed (e.g., a low V TH cell is replaced by a high V TH cell to reduce leakage), resulting in:
where I Chip Old (I Chip New ) and I Celli Old (I Celli New ) respectively denote the leakage currents of the entire chip and the i-th cell before (after) the change.
Given the low-rank quadratic models of log(I Chip Old ), log(I Celli Old ) and log(I Celli New ), the objective of incremental leakage analysis is to quickly generate the low-rank model for log(I Chip New ). Compared with (3), Eq. (35) is much simpler and only contains a few terms. Therefore, updating the leakage distribution using (35) is much cheaper than the full leakage analysis from (3). The aforementioned iterative algorithm and compaction algorithm can be directly applied to (35). More details on the incremental analysis are not included in this paper due to the limited number of available pages.
NUMERICAL EXAMPLES
We demonstrate the efficacy of the proposed algorithm using ISCAS'85 benchmark circuits. All circuits are implemented in a commercial 90nm CMOS process. The numerical experiments in this section are performed on a SUN -1GHz server.
Standard Cell Library Characterization
SPICE simulations were used to construct an approximate model of the logarithm of the cell leakage current as a function of the parameter variations: ΔV THP , ΔV THN , ΔT OX , ΔW and ΔL. The distributions and correlations of these parameter variations are provided in the CMOS process design kit. Both linear and quadratic regression models are created for accuracy comparison. The low-rank quadratic models are extracted using the PROBE algorithm in [11] . Fig. 5 shows the approximation errors for three different logic cells. As the quadratic model is used, the approximation error is significantly reduced, e.g., dropping from 14% to 2% for NOR2. In addition, a rank-2 quadratic model, instead of the full-rank model with rank 5, is sufficiently accurate in this example. It is also worth mentioning that similar error patterns are observed for other logic cells, although only three of them are shown in Fig. 5. 
Statistical Leakage Analysis
Both inter-die and intra-die variations are considered for statistical leakage analysis. The grid model discussed in Section 2.1 ( Fig. 1 ) is used to capture intra-die variations. We partition the circuit into extremely fine grids and one grid only contains one logic cell, thereby significantly increasing the problem size. Such a large problem size helps us to verify the efficiency and robustness of the proposed projection-based algorithm. Table 1 shows the number of random variables (i.e., the size of the vector E in (1)) for each benchmark circuit. Note that the problem size is greater than 17K for C7552. A. Probability Density Function Taking the circuit C432 as an example, Fig. 6(a) shows the leakage distributions extracted by three different approaches: the linear approximation, the rank-10 quadratic approximation and the Monte Carlo analysis with 10 4 samples. As shown in Fig. 6(a) , the linear approximation yields large errors, especially at both tails of the PDF which are often the points of greatest concern.
B.
Eigenvalue Distribution For testing and comparison, we extract the full-rank quadratic leakage model for C432. Fig. 6(b) shows the magnitude of the eigenvalues of the quadratic coefficient matrix. Note that there are only a few dominant eigenvalues. Fig. 6(b) explains why the lowrank quadratic approximation is efficient in this example. Table 2 and Table 3 compare the leakage analysis accuracy and cost for different modeling approaches. The worst-case leakage is measured at the 99% point on CDF. The error values in Table 2 are calculated against the Monte Carlo simulation with 10 4 samples. As shown in Table 2 , the proposed low-rank quadratic approximation significantly reduces the maximal error from 21.01% to 5.37%. Using the full-rank quadratic approximation can modestly reduce the error further, however resulting in extremely expensive cost, as shown in Table 3 . The proposed low-rank approximation achieves up to 10 5 x speedup over the full-rank approximation and up to 10 3 x speedup over the Monte Carlo analysis with 10 4 samples. 
C.
Accuracy and Speed
D.
Incremental Analysis For comparison, we change one gate in each benchmark circuit and apply the proposed incremental analysis algorithm to locally update the leakage value. Table 4 shows the computational cost of the incremental analysis. Compared with the second column in Table 3 , the incremental analysis achieves up to 10x speedup. We expect that as the problem size increases further, the incremental analysis could achieve more speedup over the full leakage analysis. 
CONCLUSIONS
We have proposed a projection-based algorithm to capture the non-log-Normal leakage distributions that are expected in nanoscale technologies. The proposed algorithm has linear computational complexity in circuit size, which is facilitated by several novel algorithms, including: 1) a two-step iterative quadratic modeling algorithm; 2) a quadratic model compaction algorithm; and 3) an incremental analysis algorithm. Our numerical examples in a commercial 90nm CMOS process demonstrate that, compared with the popular log-Normal approximation, the proposed leakage analysis reduces the error from 21.01% to 5.37%, while achieving up to 10 3 x speedup over the Monte Carlo analysis with 10 4 samples.
6.
