Abstract-Process variations are of increasing concern in today's technologies, and they can significantly affect circuit performance. An efficient statistical timing analysis algorithm that predicts the probability distribution of the circuit delay considering both inter-die and intra-die variations, while accounting for the effects of spatial correlations of intra-die parameter variations, is presented. The procedure uses a first-order Taylor 
I. INTRODUCTION

P
ROCESS variations have become an increasing concern in integrated circuits as circuit sizes continue to increase and feature sizes continue to shrink. As device and interconnect parameters such as physical dimensions show variability, the prediction of circuit performance is becoming a challenging task. Conventional static timing analysis (STA) handles the problem of variability by analyzing a circuit at multiple process corners (MPCs). However, it is generally accepted that such an approach is inadequate, since the complexity of the variations in the performance space implies that if a small number of process corners is to be chosen, these corners must be very conservative and pessimistic. For true accuracy, this can be overcome by using a larger number of process corners, but then the number of corners that must be considered for an accurate modeling will be too large for computational efficiency. The limitations of traditional STA techniques lie in their deterministic nature. An alternative approach that overcomes these problems is statistical STA, which treats delays not as fixed numbers, but as probability density functions (pdfs), taking the statistical distribution of parametric variations into consideration while analyzing the circuit.
Process variations can be classified into the following two categories: 1) inter-die variations are the variations from die to die; and 2) intra-die variations correspond to variability within a single chip. Inter-die variations affect all the devices on the same chip in the same way, e.g., making the transistor gate lengths of devices on the same chip all larger or all smaller, while the intra-die variations may affect different devices differently on the same chip, e.g., making some devices have smaller transistor gate lengths and others larger transistor gate lengths.
It used to be the case that the inter-die variations dominated intra-die variations, so that the latter could be safely neglected. However, in modern technologies, intra-die variations are rapidly and steadily growing and can significantly affect the variability of performance parameters on a chip [1] . The increase in intra-chip parameter variations is due to the effects such as microloading in the etch, variation in photoresist thickness, optical proximity effects, and steeper within-field aberrations as the manufacturing sizes approach the optical resolution limit [2] . Intra-die variation is spatially correlated: It is locally layout dependent and circuit specific, i.e., devices with similar layout patterns and proximity structures tend to have similar characteristics; it is globally location dependent, i.e., devices located close to each other are more likely to have the similar characteristics than those placed far away.
Due to the increasing effect of intra-die variations, several commercial flows have begun to include intra-die variations in the last few years, e.g., the on-chip variation (OCV) analysis in Synopsys's PrimeTime and the linear combination of delay (LCD) mode of IBM's EinsTimer. In the literature, a number of studies on statistical timing analysis have focused on circuit performance prediction considering intra-die variation. Continuous methods [3] - [6] use analytical approaches to find closed-form expressions for the pdf of the circuit delay. For simplicity, these methods often assume a normal distribution for the gate delay, but even so, finding the closed-from expression of the circuit distribution is still not an easy task. Discrete methods [7] - [9] are not limited to normal distributions and can discretize any arbitrary delay distribution as a set of tuples, each corresponding to a discrete delay and its probability. The discrete probabilities are propagated through the circuit to find 0278-0070/$20.00 © 2005 IEEE a discrete pdf for the circuit delay. However, this method is liable to suffer from the problem of having to propagate an exponential number of discrete point probabilities. In [10] , an efficient method was proposed by modeling arrival times as cumulative density functions (cdfs) and delays as pdfs and by defining operations of add and max on these functions. Alternatively, instead of finding the distribution of circuit delay directly, several attempts have been made to find upper and lower bounds for the circuit delay distribution [5] , [7] , [11] .
Although many prior works have dealt with intra-chip variations, most of them have ignored intra-chip spatial correlations by simply assuming zero correlations among devices on the chip. The difficulty in considering spatial correlations between parameters is that it can result in complicated path correlation structures that are hard to deal with. Tsukiyama et al. [6] consider correlation between delays among the transistors inside a single gate (but not correlations between gates). The work in [12] uses a Monte Carlo (MC) sampling-based framework to analyze circuit timing on a set of selected sensitizable true paths. Another method in [5] computes path correlations on the basis of pairwise gate delay covariances and used an analytic method to derive lower and upper bounds of circuit delay. The statistical timing analyzer in [13] takes into account capacitive coupling and intra-die process variation to estimate the worst case delay of critical path. Two-parameter space techniques, namely, the parallelepiped method and the ellipsoid method, and a performance-space procedure, the binding probability method, were proposed in [14] to find either bounds or the exact distribution of the minimum slack of a selected set of paths. The approach in [3] proposes a model for spatial correlation and a method of statistical timing analysis to compute the delay distribution of a specific critical path. However, the pdf for a critical path may not be a good predictor of the distribution of the circuit delay (which is the maximum of all path delays), as explained in Section II. Moreover, the method may be computationally expensive when the number of critical paths is too large. In [15] , the authors further extended their work in [3] and [7] to compute an upper bound on the distribution of exact circuit delay.
In this paper, an algorithm for statistical STA that computes the distribution of circuit delay while considering spatial correlations will be proposed. The circuit delay will be modeled as a correlated multivariate normal distribution, considering both gate and wire delay variations. In order to manipulate the complicated correlation structure, the principal component analysis (PCA) technique is employed to transform the sets of correlated parameters into sets of uncorrelated ones. The statistical timing computation is then performed with a program evaluation and review technique (PERT)-like circuit graph traversal. The complexity of the algorithm is O(pn(N g + N I )), which is linear in the number of gates N g and interconnects N I , and also linear in the number of varying parameters p and the number of grid squares n that are used to model variational regions. In other words, the cost is, at worst, pn times the cost of a deterministic STA. It is believed that this is the first method that can fully handle spatially correlated distributions under reasonably general assumptions, with a complexity that is comparable to traditional deterministic STA. This work can also be extended, using the same framework of maximum of delays (Section IV-C), to find the distribution of minimum of delays that can be applied to analysis such as computing minimum delay distributions for short-path analysis (to check for hold time violations), for required arrival time (RAT) analysis, etc.
The remainder of the paper is organized as follows. Section II formally formulates the problem to be solved in this work. Section III explains the model used for process variation and spatial correlation of intra-die variation. The algorithm is presented in Section IV, and its run time complexity analysis is given in the following section. The extension to handle interchip variation and spatially uncorrelated intra-die components is introduced in Section VI. The extension to compute minimum of delays is also presented in Section VI. Finally, Section VII shows a list of experimental results and their analysis.
II. PROBLEM FORMULATION
Under process variations, parameter values such as the gate length, the gate width, the metal line width, and the metal line height are random variables. Some of these variations such as across-chip linewidth variations (ACLVs) are deterministic, while others are random: This work will focus on the effects of random variations and will model these parameters as random variables. The gate and interconnect delays, as functions of these parameters, also become random variables. Given appropriate modeling of process parameters or gate and interconnect delays, the task of statistical STA is to find the pdf of the circuit delay.
The STA works with the usual translation from a combinational circuit to a timing graph [16] . The nodes in this graph correspond to the circuit primary inputs/outputs and gate input/output pins. The edges are of two types, namely: 1) a set that corresponds to the pin-to-pin delay arcs within a gate; and 2) a set that corresponds to interconnections from the drivers to receivers. The edges are weighted by the pin-to-pin gate delay and interconnect delay, respectively. The primary inputs of the combinational circuit are connected to a virtual source node, and the primary outputs to a virtual sink node with directed virtual edges. In the case that primary inputs arrive at different times, the virtual edges from the virtual source to the primary inputs are assigned weights of the arrival times. Likewise, if the required times at the primary outputs are different, the weights of the edges from the outputs to the virtual sink are appropriately chosen.
For a combinational logic circuit, the problem of STA is to compute the longest path delay in the circuit from any primary input to any primary output, which corresponds to the length of the longest path in the timing graph. In STA, the technique that is commonly 1referred to in the literature as PERT is commonly used. 1 This procedure starts from the source node to traverse the graph in a topological order and uses a sum operation or max operation (at a multifanin node) to find the longest path at the sink node. For details, the reader may refer to [16] and [17] .
Since a PERT-like traversal will be employed to analyze the distribution of circuit delay, a statistical timing graph of a circuit is defined, as in the case of deterministic STA.
Definition 2.1: Let G s = (V, E) be a timing graph for a circuit with a single source node and a single sink node, where V is a set of nodes and E is a set of directed edges. The graph G s is called a statistical timing graph if each edge i is assigned a weight d i , where d i is a random variable, and where the random variables may be uncorrelated or correlated. The weight associated with an edge corresponds to gate delay or interconnect delay. For a virtual edge, the weight is a random variable with a mean of its deterministic value and a standard deviation (SD) of zero, and it is independent from any other edges. Note that for the same nominal design, the identity of the longest path may change, depending on the random values taken by the process parameters. Therefore, finding the delay distribution of one critical path at a time is not enough, and correlations between paths must be considered in finding the max of the pdfs of all paths. Such an analysis is essential for finding the probability of failure of a circuit, which is available from the cdf of the circuit delay.
For an edge-triggered sequential circuit, the statistical timing graph can be constructed similarly by breaking the circuit into a set of combinational blocks between latches, and the analysis includes statistical checks on setup and hold time violations. The former requires the computation of the distribution of the maximum arrival time at the latches, which requires the solution of the SSTA problem as defined above. On the other hand, the latter problem needs the distribution of the minimum arrival time at the latches to be computed, and this can be solved by a trivial extension of the framework for the SSTA problem proposed in the paper, using minimum operators, as will be mentioned in Section IV-C, instead of maximum operators.
The proposed approach to solve the SSTA problem is based on the following assumption on the distribution of the process parameter values.
Assumption 1: The process parameter values are assumed to be normally distributed random variables.
The gate and interconnect delays, being functions of the fundamental process parameters, are approximated using a firstorder Taylor series expansion. It will be shown that as a result of this, all edges in graph G s are normally distributed random variables. Since spatial correlations of the process parameters are considered, it turns out that some of the delays are correlated random variables. Furthermore, the circuit delay D max is modeled as a multivariate normal distribution. Although the closed form of circuit delay distribution is not normal, it is shown that the loss of accuracy is not significant under this approximation.
III. MODELING PARAMETER VARIATIONS
In this section, the model used for intra-die variations with spatial correlation will be introduced. Although only intradie variations of parameters are considered at this point, the extension of this work to handle inter-die variations will be introduced later in Section VI-A.
A. Components of Intra-die Variations
The intra-die parametric variation δ intra can be decomposed into three components: 1) a deterministic global component δ global ; 2) a deterministic local component δ local ; and 3) a random component [18] 
The global component δ global is location dependent. Across the die or reticle field, it can be modeled by a slanted plane and expressed as a simple function of its location
where (x, y) is its die location and δ x and δ y are gradients of parameter indicating the spatial variations of parameter along the x and y directions, respectively. The local component δ local is proximity dependent and layout specific. The random component stands for the random intra-chip variation and the vector of all random components across the chip or reticle field has a correlated multivariate normal distribution due to spatial correlation of the intra-die variation
where Σ is the covariance matrix of parameters. The detailed model for this covariance matrix will be described in the next section. For spatially uncorrelated parameters, Σ becomes a diagonal matrix where the entries represent variances. If the variances of the parameters described by this matrix are assumed to be uniform across the chip, then Σ is a multiple of the identity matrix. In this paper, only the impact of global and random components will be considered. However, the local component can also be included in the model, given, for instance, the chip layout and precharacterized spatial maps of parameters as in [19] .
Under intra-die variation, the value of parameter p located at (x, y) can be modeled as
wherep is the nominal design parameter value at die location (0, 0). In this way, all parameter variations are modeled as locationdependent normally distributed random variables.
In this work, for transistors, the following process parameters are considered [20] as random variables: transistor length L g and width W g , gate oxide thickness T ox , and doping concentration density N a ; for interconnect, at each metal layer, the following parameters are considered: metal width W int l , metal thickness T int l , and ILD thickness H ILD l , where the subscript l represents that the random variable is of layer l, where l = 1, . . . , n layers . Among all the parameters listed above, L g is observed to exhibit largest parameter variability and also has the most important impact on circuit performance when it shows variations [20] . It is believed that this framework is general enough that it can be applied to handle variations of other parameters as well.
B. Spatial Correlations
To model the intra-die spatial correlations of parameters, the region of die or reticle field 2 is partitioned into nrow × ncol = n grids. Since devices [wires] close to each other are more likely to have more similar characteristics than those placed far away, assume perfect correlations among the devices [wires] in the same grid, high correlations among those in close grids, and low or zero correlations in far-away grids. For example, in Fig. 1 , gates a and b (whose sizes are shown to be exaggeratedly large) are located in the same grid square, and it is assumed that their parameter variations (such as the variations of their gate length) are always identical. Gates a and c lie in neighboring grids, and their parameter variations are not identical but highly correlated due to their spatial proximity (for example, when gate a has a larger than nominal gate length, it is highly probable that gate c will have a larger than nominal gate length, and less probable that it will have a smaller than nominal gate length). On the other hand, gates a and d are far away from each other, and their parameters may be uncorrelated (e.g., when gate a has a larger than nominal gate length, the gate length for d may be either larger or smaller than nominal).
The proposed algorithm makes a second assumption on the distribution of process parameters.
Assumption 2: It is assumed that nonzero correlations may exist only among the same type of process parameters in 2 The same model can be used to model the parameter variations across a reticle field containing multiple chips, in which case, these multiple chips can be analyzed simultaneously and the maximum of the delays at the primary outputs (POs) of all chips is the distribution of chip delay. This does not change the complexity of the algorithm, since the number of dies in a reticle field is a small integer. different grids, and there is no correlation between different types of process parameters. 3 For example, L g values for transistors in a grid are correlated with those in nearby grids, but they are uncorrelated with other parameters such as W g or W int l in any grid. (Note here that interconnect parameters in different layers are considered to be "different types of parameters," e.g., W int 1 and W int 2 are uncorrelated.)
Under this model, the parametric variation for a spatially correlated parameter in a single grid at location (x, y) can be modeled using a single random variable p(x, y). In total, this representation requires n random variables, each representing the value of a parameter in one of the n grids, and a covariance matrix of size n × n representing the spatial correlations among the grids. The covariance matrix could be determined from data extracted from manufactured wafers. For example, a test structure methodology was developed to support the evaluation of process parameter variations in [22] . The number of grid regions divided can also be determined using the test structure methodology by refining the number of grids until delay distribution of test structure converges or changes only within a small tolerance range. In this work, due to the lack of access to real wafer data, the correlation matrix derived from the spatial correlation model in [3] is used. However, it is believed that the proposed model is more general than the model used in [3] , since it is purely based on neighborhood. For example, consider again the case in Fig. 1 , by the proposed model, the parameter in grid (1, 2) has equal correlations with that in grid (1, 1) and (1, 3) . While by the model of [3] , it will have higher correlation with grid (1, 1) than grid (1, 3), i.e., the correlations are uneven at the two neighbors of grid (1, 2) .
For clarity of presentation, it is assumed here that all types of parameters have spatial correlations. In manufacturing, due to effects such as random dopant fluctuations, the intra-die variations of some parameters are truly uncorrelated from transistor to transistor. The extension of this work to incorporate the effect of spatially uncorrelated parameters will be shown in Section VI.
IV. STATISTICAL TIMING ANALYSIS ALGORITHM
The core statistical STA method is described in this section, and its description is organized as follows. At first, Section IV-A will describe how the distributions of gate and interconnect delays are modeled as normal distributions, given the pdfs that describe the variations of various parameters. In general, these pdfs will be correlated with each other. Section IV-B will show how the complicated correlated structure of parameters can be simplified by orthogonal transformations. Section IV-C will describe the PERT-like traversal algorithm on the statistical timing graph by demonstrating the procedure for the computation of max and sum functions. Finally, Section IV-D will explain why orthogonal transformations are important in the proposed method.
A. Modeling Gate/Interconnect Delay pdfs
This section will show how the variations in the process parameters are translated into pdfs that describe the variations in the gate and interconnect delays that correspond to the weights on edges of the statistical timing graph.
In Section III, the geometrical parameters associated with the gate and interconnect are modeled as normally distributed random variables. Before introducing how the distributions of gate and interconnect delays will be modeled, consider first an arbitrary function d = F ( P ) that is assumed to be a function on a set of parameters P , where each p i ∈ P is a random variable with a normal distribution given by
The function d can be approximated linearly using a firstorder Taylor expansion 
where cov(p i , p j ) is the covariance of p i and p j . It is reasonable to ask whether the approximation of d as a normal distribution is valid, since the distribution of d may, strictly speaking, not be Gaussian. It can be said that when ∆p i has relatively small variations, the first-order Taylor expansion is adequate and the approximation is acceptable with little loss of accuracy. This is generally true of intra-die variations, where the process parameter variations are relatively small in comparison with the nominal values. For this reason, as functions of process parameters, the gate and interconnect delays can be approximated as a sum of normal distributions (which is also normal) applying (5) .
Computing the pdf of Interconnect Delay: In this work, the Elmore delay model is used for simplicity to calculate the interconnect delays. 4 Under the Elmore model, the interconnect delay is a function of the vector of resistances R w , the vector of capacitances C w , of all wire segments in the interconnect tree, and the vector of input load capacitances C g , of the fanout gates or receivers
Since the resistances and capacitances above are determined by the process parameters P of the interconnect and the receivers, such as
, and T ox , the sensitivities of the interconnect delay to a parameter p i can be found by using the chain's rule
The distribution of interconnect delay can then be approximated on the computed sensitivities. The factors that affect the interconnect delay associated with edges in the statistical timing graph will now be specifically considered. Recall that under the proposed model, the chip area is divided into grids so that the parameter variations within a grid are identical, but those in different grids exhibit spatial correlations. Now, consider an interconnect tree with several different segments that reside in different grids. The delay variations in the tree are affected by the parameter variations of wires in all grids that the tree traverses. For example, in Fig. 1 , consider the two segments uv and pq in the interconnect tree driven by gate a. Segment uv passes through the grid (1, 1) and pq through the grid (1, 2) . Then the resistance and capacitance of segment uv should be calculated based on the process parameters of grid (1, 1), while the resistance and capacitance of segment pq should be based on those of grid (1, 2) . Hence, the distribution of the interconnect tree delay is actually a function of random variables of interconnect parameters in both grid 4 However, it should be emphasized that any delay model may be used, and all that is needed is the sensitivity of the delay to the process parameters. For example, through a full circuit simulation, the sensitivities may be computed by performing adjoint sensitivity analysis.
(1, 1) and grid (1, 2) , and should incorporate any correlations between these random variables. Similarly, if the gates that the interconnect tree drives reside in different grid locations, the interconnect delay to any sink is also a function of random variables of gate parameters of all grids in which the receivers are located. In summary, the distribution of interconnect delay function can be approximated by (10) , shown at the bottom of the page, where d 0 int is the interconnect delay value calculated with nominal values of parameters, Γ g is the set of indices of grids that all the receivers reside in, Γ int is the set of indices of grids that the interconnect tree traverses, and
g is the random variable representing transistor length in the ith grid. The parameters
, and ∆H i ILD l are similarly defined. As before, the subscript "0" next to each sensitivity represents the fact that it is evaluated at the nominal point.
Computing the pdfs of Gate Delay and Output Signal Transition Time: The distribution of gate delay and output signal transition time at the gate output can be approximated in a similar manner as described above, given the sensitivities of the gate delay to the process parameters.
Consider a multiple-input gate, let d can be written as a function of the process parameters P of the gate, the loading capacitance of the driving interconnect tree C w and the succeeding gates that it drives C g , and the input signal transition time S pin i in at this input pin of the gate to process parameters can be computed applying the chain's rule. The derivatives of C w and C g to the process parameters can be easily computed, as C w and C g are functions of process parameters. The input signal transition time S in is a function of the output transition time of the preceding gate and the delay of the interconnect connecting the preceding gates and this gate, where both interconnect delay (as discussed earlier) and output transition time of the preceding gate (as will be shown in the next paragraph) are Gaussian random variables that can be expressed as a linear function of parameter variations. Therefore, at a gate input, the input signal transition time S in is always given as a normally distributed random variable with a mean and firstorder sensitivities to the parameter variations.
To consider the effect of nonideal input signal on gate delay, the output signal transition time S out at each gate output needs to be computed in addition to pin-to-pin delay of the gate. In conventional STA, S out is set to S pin i out if the path ending at the output of the gate traversing the ith input pin has the longest path delay d path i . In SSTA, each of the paths through different gate input pins has a certain probability to be the longest path. Therefore, S out should be computed as a weighted sum of the distributions of S pin i out , where the weight equals the probability that the path through the ith pin is the longest among all others
becomes computing the probability of a Gaussian random variable greater than zero, which can easily be found from a look-up table. As each S pin i out is a Gaussian random variable in linear combination of parameter variations, S out is, therefore, also a Gaussian distributed random variable, and its sensitivities to all process parameters ∂S out /∂p i can easily be found from its linear expression of parameters.
B. Orthogonal Transformation of Correlated Variables
In statistical timing analysis without spatial correlations, correlations due to reconvergent paths has long been an obstacle. When the spatial correlation of process parameters is also taken into consideration, the correlation structure becomes even more complicated. To make the problem tractable, the PCA technique [23] is used to transform the set of correlated parameters into an uncorrelated set.
PCA is a method that can be employed to examine the relationship among a set of correlated variables. Given a set of correlated random variables X with a covariance matrix R, PCA can transform the set X into a set of mutually orthogonal random variables, X , such that each member of X has zero mean and unit variance. The elements of the set X are called principal components (PCs) in PCA, and the size of X is no larger than the size of X. Any variable x i ∈ X can then be expressed in terms of the PCs X as
where x j is a PC in set X , λ j is the jth eigenvalue of the covariance matrix R, v ij is the ith element of the jth eigenvector of R, and σ i and µ i are, respectively, the mean and SD of x i . Since it is assumed that different types of parameters are uncorrelated, the random variables of parameters can be grouped by types, and PCA can be performed in each group separately, i.e., the PCs for L g , W g , T ox , N a , W int l , and T int l are computed individually. Clearly, not only are the PCs of the same type of parameters independent, but so are the PCs of different type of parameters.
For instance, let L g be a random vector representing transistor gate length variations in all grids, and it is of multivariate normal distribution with covariance matrix R L g . Let L g be the set of PCs computed by PCA. Then any L i g ∈ L g representing the variation of transistor gate length of the ith grid can then be expressed as a linear function of the PCs
where Superposing the set of rotated random variables of parameters on the original random variables in gate or interconnect delay in (10), the expression of gate or interconnect delay is then changed to the linear combination of PCs of all parameters
where
and m is the size of P . Note that all of the PCs p i that appear in (16) are independent. Equation (16) has the following properties.
Property 1: Since all p i are orthogonal, the variance of d can be simply computed as
Property 2: The covariance between d and any PC p i is given by
In other words, the coefficient of p i is exactly the covariance between d and p i .
Property 3: Let d i and d j be two random variables
The covariance of
In comparison, without an orthogonal transformation, the value of cov(d i , d j ) has to be computed by a more complicated formula as will be described in Section IV-D.
C. PERT-Like Traversal of Statistical STA
Using the techniques discussed up to this point, all edges of the statistical timing graph may be modeled as normally distributed random variables. In this section, a procedure for finding the distribution of the statistical longest path in the graph will be described.
In conventional deterministic STA, the PERT algorithm can be used to find the longest path in a graph by traversing it in topological order using the following two types of function:
1) the sum function; 2) the max function. In the statistical timing analysis, a PERT-like traversal is employed to find the distribution of circuit delay. However, unlike deterministic STA, the sum and max operations here are functions of a set of correlated multivariate Gaussian random variables instead of fixed values:
where d i is a Gaussian random variable representing either gate delay or wire delay expressed as linear functions of PCs in the form of (19) , and l is the number of random variables that the sum or the max function is operating on.
Computing the Distribution of the Sum Function:
The computation of the distribution of sum function is simple. Since the 
Computing the Distribution of the Max Function:
The max function of n normally distributed random variables d max = max(d 1 , . . . , d l ) is, strictly speaking, not Gaussian. However, it has been found that, in practice, it can be approximated closely by a Gaussian. This idea is similar in spirit to Berkelaar's approach in [4] and [24] , although it is more general since Berkelaar's work restricted its attention to delay random variables that were uncorrelated. 5 In this work, the Gaussian distribution is used to approximate the result of a max function, 
Therefore, determining this approximation for d max is equivalent to finding the values of µ d max and all a i 's. From Property 2 of Section IV-B, it is known that the coefficient a r equals cov(d max , p r ). Then the variance of the expression on the right-hand side of (24) 
It can now be seen that to find the linear approximation for d max , the values of µ d max , σ d max , and cov(d max , p i ) are required. In [6] , similar inputs were required in their algorithm and the results from [25] were applied and seen to provide good results. In this work, the same analytical formula has been borrowed from [25] for the computation of the max function.
According to [25] , if ξ and η are two random variables, ξ ∼ N (µ 1 , σ 1 ) and η ∼ N (µ 2 , σ 2 ), with a correlation coefficient of r(ξ, η) = ρ, then the mean µ t and the variance σ 2 t of t = max(ξ, η) can be approximated by (27) where
The formula will not apply if σ 1 = σ 2 and ρ = 1. However, in this case, the max function is simply identical to the random variable with the largest mean value. Moreover, from [25] , if γ is another normally distributed random variable and the correlation coefficients r(ξ, γ) = ρ 1 , r(η, γ) = ρ 2 , then the correlation between γ and t = max(ξ, η) can be obtained by
Using the formula above, all the values needed can be found. As an example, notice how this can be done by first starting with a two-variable max function, d max = max(d i , d j ) . Let d max be of the form of (24) . The approximation of d max can be found as follows. 
whichever has a larger mean value, and it can be stopped here; otherwise, it will be continued to the next step. The calculation of the two-variable max function can easily be extended to a multivariable max function by repeating the steps of the two-variable case recursively.
As mentioned at the beginning of this section, the max of two Gaussian random variables is not strictly Gaussian. This approximation can sometimes introduce serious errors, e.g., when the two Gaussian random variables have the same mean and SD and correlation value of −1, and the distribution of the maximum is a half Gaussian. During the computation of the multivariable max function, some inaccuracy could be introduced since the max function is approximated as normal even though it is not really normal, and proceed with further recursive calculations. To the best of the authors' knowledge, there is no theoretical analysis available in the literature that quantifies the inaccuracies when a normal distribution is used to approximate the maximum of a set of Gaussian random variables. However, a numerically based analysis was provided in [25] , which suggests that in some situations, the errors can be great, but for many applications, this approximate is quite satisfactory. Results suggesting that such inaccuracies are not significant in the circuit context will be shown in Section VII, and it will be seen that the results match very well with the simulation results from an MC analysis.
Furthermore, recall that there is a "normalization" step to diminish the difference between the variance computed from the linear form of the max approximation and the real variance of the max function. As in the case of approximating the max as normal distribution, there is no theoretical proof about how this "normalization" step can affect the accuracy of the approximation. Another option to diminish the difference is to move it into an independent random Gaussian component, and it is difficult to state definitively which of these options is better. In this work, the former option is chosen, and it is found that it provides excellent accuracy, as will be shown in Section VII, where the statistics of the "normalization" ratio for several test circuits are provided.
At this point, not only the edges, but also the results of sum and max functions are expressed as linear functions of the PCs. Therefore, using a PERT traversal by incorporating the computation of sum and max functions described above, the distribution of arrival time at any node in the timing graph becomes a linear function of PCs, and so the distribution of circuit delay can be computed at the virtual sink node.
The overall flow of the proposed algorithm is shown in Fig. 2 . It is noticed that this work is in some sense parallel to [14] : In [14] , delays are represented as linear combinations of global random variables, while in this work, they are linear functions of PCs; in [14] , the max of delays are reexpressed as linear functions using binding probabilities, while in this work, the linear functions are found by an analytical method from [25] .
To further speed up the process, the following technique may be used: During the max operation of statistical STA, if the value of µ + 3σ of one path has a lower delay than the value of µ − 3σ of another path, the max function can simply be calculated ignoring the former path.
D. Utility of Principal Components
The previous sections described the proposed statistical STA algorithm. The purpose of this section is to elaborate why the orthogonal transformation is needed to transform the set of correlated process parameters to an uncorrelated set, and how it can simplify the problem of statistical STA considering spatial correlations.
Let d i and d j be the distributions of two gate delays. For simplicity, it is assumed that the gate lengths L g are the only spatially correlated parameters. It is also assumed that d i and d j are sensitive to the same set of correlated random variables of gate lengths L 
Obviously, the covariance of d i and d j is decided by the covariance structure of L g . The direct calculation of cov(
is of a complicated form as in [5] 
In contrast, in the proposed method, orthogonal transforma-
Next, by superposition, d i and d j are transformed to
The value of cov(d i , d j ) can be simply computed using the coefficients of
The advantage in this computation is that it is not necessary to know which specific parameters in d i and d j are correlated. In fact, consider the coefficients of l The direct computation of the covariance of path delays is in a similar form. In general, the path delays are correlated when the gate delays on these paths are correlated. As shown in [5] , the path covariances can be computed on the basis of pairwise gate delay covariances; however, the number of paths is numerous, which makes it computationally difficult to apply such a path-based method to large circuits.
In the proposed method, with the orthogonal transformation, the covariances of path delays are manifested as the coefficients of the independent PCs as in the case of correlated gate delays. The covariances of the paths can then be simply computed in linear time based on these coefficients only, and there is no need to worry about how the gates on the paths are correlated or which parts are correlated. For the same reason, in this algorithm, besides the spatial correlations, path correlations due to reconvergence (structural correlations) can also be accounted for automatically by using the orthogonal transformation on the spatially correlated parameters. However, when spatially uncorrelated parameters are involved in the computation, the structural correlations due to these independent parameters cannot be efficiently dealt with by this methodology, since directly keeping all uncorrelated random variables in the delay form results in a huge number of variables. The extension of the work for handling spatially uncorrelated parameters will be given in Section VI-B.
V. COMPUTATIONAL COMPLEXITY
A run time complexity analysis is presented here to show which factors most greatly affect the CPU time of the algorithm.
The flowchart shown in Fig. 2 can be divided into two parts: 1) model precharacterization (steps 1, 2, and 3); and 2) SSTA (steps 4, 5, and 6). Model precharacterization consists of construction of parameter variations and grid-based spatial correlation models, and the computation of PCs for spatially correlated parameters. The computation of PCs requires calculations of eigenvectors and eigenvalues of the covariance matrix, and its time complexity is O(pn 3 ), where n is total number of grids divided and p is the number of parameters considered. While this step may seem to be a bottleneck of the algorithm, it is a only one-time process. Once the models of parameter variations are constructed, they can be repeatedly used to analyze any design. Meanwhile, for spatial correlated parameters, the PCs computed from the covariance matrix are only model-dependent, so that for different designs analyzed with the same parameter model, the same set of PCs can be applied. In other words, the step of model precharacterization is in fact a one-time library construction at early stage and, therefore, can be excluded from the run time complexity analysis of the algorithm.
The run time of the SSTA algorithm can be divided into the following.
1) The time required to find the delay distribution of the gate and interconnect: 6 This run time depends on how many different grids the interconnect passes through and how many grids the gates are located in, and in general, these numbers are bounded by constant numbers. The run time is also proportional to the total number of PCs, since orthogonal transformation is performed at each wire segment of interconnect. For each random variable, the number of PCs is no more than the total number of grids n partitioned on the chip. The total number of PCs is no more than pn. Thus, the time required to find the distribution of a single gate or wire can be estimated as O (pn) . If N g is the total number of gates and N I the number of net connections in the circuit, the time of this part can be estimated as O(pn (N g + N I ) ).
2) The time required to evaluate the max function: The cost of this operation is proportional to the number of random variables involved in the max operation and the number of PCs of each random variable. The max operation is used at all multi-input gates and at the last level (sink node) where the maximum circuit delay is computed. This number can be upper bounded by the total number of net connections N I in the circuit. Thus, the run time of this part is O(pnN I ).
3) The time required to compute output transition time at each gate output: For a gate with k > 2 inputs, it requires k 2 max operations and k − 1 sum operations, which are constant numbers of max and sum operations. The computation is needed for all gates and thus the total cost is O(pnN g ). 4) The time required to evaluate the sum function: The sum operation must be performed at all gates and interconnects encountered during the PERT-like traversal. A single sum operation requires O(n) and, therefore, the total complexity for this part is O(pn (N g + N I ) ).
Therefore, the run time complexity of the algorithm is O(pn(N g + N I )), which is pn times that of deterministic STA. 6 The time to characterize the sensitivities of delay on parameter variations is excluded from this analysis.
VI. EXTENDING THE METHOD TO HANDLE INTER-DIE VARIATIONS, SPATIALLY UNCORRELATED INTRA-DIE PARAMETERS, AND MIN-DELAY COMPUTATIONS
This section will first describe how this work can be extended to include the effect of inter-die variations in addition to intradie variations. Subsequently, it will be explained how spatially uncorrelated parameters can be incorporated into the current proposed algorithm. Finally, it will be shown how minimum delay computations can easily be incorporated into this framework.
A. Inter-die Variations
In general, the process parametric variation can be modeled as
where δ inter is the inter-die variation and δ intra is the intra-die variation. As for δ intra , δ inter is also modeled as a Gaussian random variable.
As introduced in Section I, inter-die variation has a global effect on all the transistors [wires] within a single chip and, therefore, a single random variable δ inter can be applied to all transistors [wires] to model the effect of inter-die variation. Consequently, the covariance matrix for each type of spatially correlated parameter is changed by adding to all entries a value of σ 2 δ inter , the variance of inter-die parametric variation. Based on the new covariance matrices, the same statistical STA methodology can still be applied to compute distribution of chip delay.
B. Spatially Uncorrelated Parameters
In practice, it is observed that not all process parameters are spatially correlated. For example, the variations of T ox or N a are independent from transistor to transistor. To model the intradie variation of spatially uncorrelated parameter, a separate random variable has to be used for each gate [wire] to represent such independence, instead of a single random variable for all gates [wires] in the same grid for the spatial correlated parameters. Consequently, the timing analysis framework introduced in previous sections must be further extended to accommodate the spatially uncorrelated parameters.
As an example, let us consider the case that gate oxide thickness T ox is the only spatially uncorrelated parameter. The idea described here can easily be extended to the case where there is more than one uncorrelated parameter. With inter-and intra-die variations, the variation of T ox for the ith transistor can be expressed as δ linear combination of PCs and either variable is independent from the PCs and any other random variables in the delay expressions. The timing propagation using the sum and max operators remains the same, except that after each sum or max operation, the random variables for intra-die variations of spatially uncorrelated parameters ∆T i ox 's are merged into one random variable, so that, at each arrival time, only one independent random variable is kept for all intra-die variations of spatially uncorrelated parameters. It is observed that the way of adding this independent random variable to the standard form of the representation of arrival times is similar to the "residual" variance's lumping into the independently random part in [26] .
Although structural correlations can be automatically taken into account using orthogonal transformation on spatially correlated parameters as explained in Section IV-D, the structural correlations due to spatially uncorrelated parameters cannot be handled with the same technique because of the merging of these random variables during the propagation. To reduce the inaccuracies caused, one can appeal to the available [7] , [9] , [10] . In this work, the structural correlations caused by the spatially uncorrelated parameters have been ignored. However, since the structural correlations from spatially correlated parameters are considered, the inaccuracies introduced from this assumption are not significant, as will be demonstrated in Section VII.
C. Distribution of the Minimum of a Set of Gaussians
In circuit performance analysis, computations such as finding the RAT for long-path analysis, and minimum delay computations for short-path analysis (to check for hold time violations), require the computation of the minimum of a set of delays, which becomes finding the distribution of the minimum of a set of random variables under process variations.
The procedure for calculation of the maximum of a set of Gaussians can be utilized to compute the minimum of a set of Gaussian random variables,
where d i is a normally distributed random variable and max is the operator introduced in Section IV-C.
VII. EXPERIMENTAL RESULTS
The proposed algorithm was implemented in C++ as the software package "MinnSSTA" and tested on the edge-triggered ISCAS89 benchmark circuits by working on the combinational logic blocks between the latches. All experiments were run on a Linux PC with a 2.0-GHz CPU and a 256-MB memory. Parameters of 100 nm technologies are experimented upon on a two-metal layer interconnect model. The process parameters (Table I) used here are based on predictions from [20] and [27] .
Since the computation requires physical information about the locations of the gates and interconnects, all cells in the circuit were first placed using the placement tool, Capo [28] . Global routing was then performed to route all the nets in the circuits. Depending on the size of circuit, the chip area into different sizes of grids was divided. Again, due to the lack of access to real wafer data, the covariance matrix for intra-die variations used in this work were derived from the spatial correlation model used in [3] by equally splitting the variance into all levels.
To verify the results of the proposed method MinnSSTA, MC simulations based on the same grid models for comparison were used. To balance the accuracy and run time, 10 000 iterations for the MC simulation were run.
First, the experimental results are presented assuming that all parameters are spatially correlated while using fixed values for the spatially uncorrelated parameters (T ox and N a ). Table II shows a comparison of the results of MC with those from MinnSSTA. For each test case, the mean and SD values for both methods are listed. The results of MinnSSTA can be seen to be very close to the MC results: The average error is −0.23% for the mean and −0.32% for the SD. In Fig. 3 , for the largest test case s38417, the plots of the pdf and cdf of the circuit delay for both MinnSSTA and MC methods are provided. It is observed that the curves almost perfectly match each other. This demonstrates the accuracy of the PCA approach for correlated parameters, including its ability to account for structural correlations.
Next, the results for considering the variations of the spatially uncorrelated parameters (T ox and N a ) are given in Table III . On average, the error is 1.06% for the mean value and −4.34% for the SD. In Table VIII , the 99% and 1% confidence points achieved by MC and MinnSSTA are also provided, and the average errors are −2.46% and −0.99%, respectively. Again, for the largest test case s38417, the pdf and cdf curves of the circuit delay for both MinnSSTA and MC methods are plotted in Fig. 4 . It can be seen that, at the range of lower and higher circuit delay values, the circuit delay distribution computed from MinnSSTA matches well with that of the MC simulation, although there are some deviations in the central portion. As mentioned in Section VI-B, some errors may be introduced from the structural correlations, which are not handled exactly in the presence of uncorrelated intra-die components. Based on the analysis of the experiments, it is found that the cause for the small error that is introduced here is primarily because our implementation does not handle structural correlations between the uncorrelated variables. It is believed that, by appending into the existing framework, an algorithm that handles structural correlation [7] , [9] , [10] , the error of the results in Table III can be further reduced.
In Table III , the CPU times for both methods are provided. To show that the PCA steps require very little run time, the run time for this part is also listed; however, as pointed out earlier, this can be considered a preprocessing step that is carried out once for each technology, and its cost need not be considered in the computation. It can be seen that the CPU time of MinnSSTA on all test cases is very fast. The circuit with the longest run time, s35932, was analyzed in only about 500 s, while the MC simulation required over 15 h.
In the proposed approach, in order to make the computed value of SD of d max the same as that of the approximated linear expression, the coefficients of parameters in the linear expression are normalized by the ratio of the SD of d max (namely, σ d max ) to that of the linear expression s 0 . In Table IV , the statistics of this ratio for all testcases are listed, including the mean, SD, minimum, and maximum values of the ratio and the probability of the ratio falls into each given range. In general, the higher the ratio, the larger the error for estimating d max is, and thus the less accurate for estimating the circuit delay distribution using the proposed approach. For example, the testcase s35932 has the highest probability of 0.045 for the ratio to be greater than 1.1, and it also has the largest errors predicting the circuit mean and SD. Over all test cases, the average value of the ratio is 1.003, which is a reasonably small number so that the accuracy of the proposed statistical SSTA should not be affected significantly by this normalization step.
To further verify the applicability of the proposed algorithm, it has been demonstrated on a path-balanced circuit whose topology is a binary tree of depth 10. Table V lists the results achieved by MinnSSTA and MC. The errors obtained are −0.54% for the mean and −6.26% for the SD; −4.56% and −1.65% for the 99% and 1% confidence points, respectively. This shows that the proposed approach can predict the timing yield well, even for path-balanced circuits.
One may ask what happens if an MC approach was run for the same amount of time as the proposed algorithm. In Table VI , the data achieved from MC runs in the equivalent CPU time of the proposed method "MinnSSTA" are shown. Since this MC simulation can only run a small number of iterations and samples the solution space insufficiently, it does not meet any of the usual convergence criteria used for MC analysis. Therefore, what is achieved is not the genuine distribution of circuit delay, but merely the distribution from an incomplete number of runs. The table shows the minimum values and maximum values of the circuit delay from this insufficient number of MC runs, and for purposes of comparison, the results with the 1% and 99% confidence points, respectively, from the 10 000 iterations of MC simulation. It can be seen that the accuracy is highly variable: In some cases, MC analysis comes close to the action value, while in others, it is very far away. Most notably, large deviations can be seen both for a small circuit (s27) and a large circuit (s38584), implying that the reliability of such an approach is suspect. Of course, this is not surprising in the least, because the artificial limitation on the run time has made the MC analysis unreliable, by permitting only a low point of confidence for its predictions, and has not permitted it to fully sample the search space.
To show the importance of considering spatial correlations, the difference between performing statistical timing analysis while considering spatial correlation and while ignoring it is considered. Since this is a comparison to determine why spatial correlations are important, the CPU time is not a consideration. Therefore, another set of MC simulations (MCNoCorr) was run on the same set of benchmarks, this time assuming zero correlations among the devices and wires on the chip. The comparison between the data is shown in Table VII . It can be observed that although the mean values are close, the variances of the uncorrelated cases (MCNoCorr) are much smaller than the correlated cases (MC). On average, the SD of the correlated case increases by 25.93%. Again, the pdf and cdf curves of both simulations are plotted for circuit s38417 in Fig. 5 . It is seen that the cdf and pdf curves of MCNoCorr deviate significantly from those of MC. In other words, statistical timing analysis without considering correlation may incorrectly predict the real performance of the circuit and could even overestimate the performance of the circuit. This emphasizes the importance of developing efficient statistical STA methods that can incorporate spatial correlations.
As an alternative, the option of using MPCs for these experiments is considered, where the circuit delays are evaluated at all possible corners of parameter values at µ ± 3σ, where µ is the mean and σ the SD for the parameter. Table VIII compares the worst case and best case delays obtained at exhaustive process corners using the MPC method, with the 99% and 1% confidence point delay achieved from the MC simulation accordingly. On average, the MPC approach overestimates the 
VIII. CONCLUSION AND FUTURE WORK
In this paper, an algorithm for performing statistical static timing analysis (SSTA) has been proposed, considering spatial correlations related to intra-die process variations. It is shown that performing statistical timing analysis while ignoring spatial correlations may not be adequate to predict the circuit performance correctly, and that fast and accurate SSTA methods, such as ours, that incorporate spatial correlations are essential. An analysis of the complexity shows it to be reasonable, and like conventional static timing analysis (STA), it is linear in the number of gates and interconnects. The penalty that is paid here is that unlike deterministic STA, it is also linear in the number of grid squares. As a trivial extension of maximum of delays, the computation for the distribution of minimum of delays is also provided.
The current algorithm is limited by the following: It assumes that the distribution of parameter variations are Gaussian and the distribution of gate [wire] delays has linear dependency on the variation of process parameters. A good direction for future research involves solving the problem of statistical timing analysis on non-Gaussian process parameter variations and nonlinear delay dependencies.
