An efficient statistical timing analysis algorithm that can handle arbitrary (spatial and structural) causes of delay correlation is described. The algorithm derives the entire cumulative distribution function of the circuit delay using a new mathematical formulation. Spatial as well as structural correlations between gate and wire delays can be taken into account. The algorithm can handle node delays described by non-Gaussian distributions. Because the analytical computation of an exact cumulative distribution function for a probabilistic graph with arbitrary distributions is infeasible, we find tight upper and lower bounds on the true cumulative distribution. An efficient algorithm to compute the bounds is based on a PERT-like single traversal of the sub-graph containing the set of N deterministically longest paths. The efficiency and accuracy of the algorithm is demonstrated on a set of ISCAS'85 benchmarks. Across all the benchmarks, the average rms error between the exact distribution and lower bound is 0.7%, and the average maximum error at 95 th percentile is 0.6%. The computation of bounds for the largest benchmark takes 39 seconds.
INTRODUCTION
Statistical timing analysis is needed because the traditionally formulated worst-case timing analysis is no longer sufficient for timing verification of advanced chips. Multiple approaches have been proposed over the years to deal with the statistical timing problem. These methods rely either on the performance-space Monte-Carlo STA [1, 2] , the parameter-space integration methods [8] , or the analytical approaches [3, 4, 6, 10] . Monte-Carlo based and integration techniques are very accurate but computationally expensive and become infeasible in the presence of a large number of independent sources of variation and for large circuits. The central challenge for the analytical techniques is the computation of the probabilistic maximum of circuit delay. The proposed approaches for dealing with the challenge rely either on approximating the true distribution by a more convenient one [3, 9, 10] , or on producing the bounds on the true distribution [5, 6] .
Several features seem to be required of a real statistical timing analysis tool. First, it must handle path delay correlation due to all possible sources, including reconvergence, spatial, and structural (non-spatial) correlation. Second, it should allow the analysis of timing graphs with arbitrarily distributed node delays (nonGaussian). Third, the statistical computation must be based on parametric statistical techniques which represent distributions in a compact parametric form (by moments), not histograms. Storing a full distribution (histogram) in memory for every nominal delay point is likely to be too expensive. Finally, given the designer's overwhelming reliance on the standard STA tools such as PrimeTime TM , using the statistical engine as the post-processor to the standard STA is an important feature.
In this paper we describe a statistical static timing analysis tool that addresses the above requirements. The exact distribution of the longest path through a probabilistic timing graph appears intractable regardless of the assumed dependence structure [11] . In contrast to the approaches based on the approximations [8] [9], we believe that computing accurate bounds on the exact distribution is more appropriate. It gives greater flexibility in modeling complex sources of uncertainty, and is essentially consistent with the existing timing paradigm, i.e. the worst-case timing is also a bound, albeit a poor one. We introduce several key mathematical and algorithmic advances. In Section 3, the new mathematical formulation for deriving the entire cdf of the longest circuit path is described. Arbitrary correlations between the delay of the nodes in the probabilistic timing graph, and arbitrary node delay distributions can be handled. In Section 4, an efficient algorithm for computing the bounds based on a PERT-like one pass traversal of the probabilistic timing graph is described. The runtime of the algorithm is O(V+E) where V, E are the number of nodes and edges in the sub-graph containing the N deterministically longest paths to be analyzed. In Section 5, the implementation details and the result of benchmark testing of the algorithm are presented.
MATHEMATICAL BASIS FOR PROBABILISTIC BOUNDING

Problem formulation
The clock cycle of a chip is constrained by the maximum path delay,
, where D i is the delay of the i th path in the circuit. The delay of each path is a random variable, described by a probability distribution. 
where F(t) is the cumulative probability function defined over the path delay probability space. The contribution of this work is a way to efficiently derive bounds on the cdf of circuit delay for any probabilistic timing graph, which is simply a timing graph with random node delays.
In contrast to other approaches, we assume that path delays, not node delays, are Gaussian. This assumption is not, however, as restrictive as it appears because of the distributional convergence under the Central Limit Theorem (CLT) [12] . CLT ensures that a sum of arbitrarily distributed random variables converges to a normal distribution. It is well known that in practice convergence occurs quite rapidly, for N=8-10 random variables, especially for distributions that are nearly Gaussian. Also, while the standard formulation of CLT is for independent random variables, the sum of correlated random variables is also guaranteed to converge to a normal distribution for most practical models of correlation [15] .
Under the above model of path delay, we can handle node delay variation described by an arbitrary probability distribution (Gaussian or non-Gaussian). Not being limited to Gaussian distributions would permit us to handle non-linear dependencies of delay to process and environmental parameter variations. The statistical model of delay can then be found by a higher-order Taylor expansion of the gate and interconnect delay expressions, as compared to the typically used linear expansions [5] [6] [9] . This will improve the accuracy of statistical gate and interconnect delay models.
Bounding the Distribution of Circuit Delay
We now show how we can derive a bound on the cdf of the longest path delay propagating through the probabilistic timing graph that has the properties defined in the previous section. Specifically, we derive an upper and a lower bounds on
where D i is the delay of the i th path. We can re-write this equation as a cumulative probability:
Assuming that the path delay vector is a Gaussian vector, the following theorem can be proven.
Theorem 1.
For any multi-normal random vector with a correlation matrix given by
where and
This theorem expresses the sought cumulative probability in terms of the distribution of a standard multivariate normal vector with an arbitrary correlation matrix. Note that the vector t' that determines the set over which the probability content is evaluated is not equi-coordinate, i.e.
. The first step is to reexpress the cumulative probability of Eq. 6 by a cumulative probability of a vector with a well structured correlation matrix. This is needed because it will allow to later use accurate numerical methods to evaluate the cumulative probability. The following theorem can be used to do that [13] : holds for all
Corollary 1: Using Theorem 2, the cumulative distribution of Eq. 6, can be bounded from below and from above by:
where min Σ (and max Σ ) are generated by setting all their offdiagonal elements to min min{ } ij ij
If min min{ } ij ij Σ = Σ is small, the bound produced by Equations 7 and 8 may be further improved by the following transformation. Let C be a subset of {1….N}, such that ij q ρ ≥ for all and i C j C ∈ ∉ , then:
Then, to get the tightest bound, we will use the largest bound:
The cumulative probabilities (7) and (8) are now described by correlation matrices with identical off-diagonal elements. Still, they require evaluating the probability content of a standard multinormal vector over the non-equicoodinate set, which is numerically expensive. To enable a more efficient numerical evaluation of these cumulative probabilities (c.p.), we now express them in terms of the equicoordinate probability.
Bounding Cumulative Probability through Stochastic Majorization
In this section we will show how we can bound the cumulative probabilities (7) and (8) by a partial ordering of the mean and variance vectors, using the techniques of the theory of majorization [14] . In order to derive the bounds, however, we will need to ensure that several criteria are met, since the bounding is predicated on several properties of the probability distributions.
The notions of strong and weak majorization can be used to compare random variables and their distributions [13] . First, we introduce a set of formal definitions that allow comparing random variables and their distributions. The theorems in this section are shown without the proofs, which can be found in [13] . The notions of strong and weak majorization can be extended to enable comparing random variables and their distributions.
Let and be two -dimensional random variables. is said to stochastically
for every Borel-measurable Schur-convex (Schur-concave) set .
is said to weakly stochastically majorize , in symbols if
g Schur-convex (decreasing Schur-concave) set . A The key idea that we are exploiting in deriving bounds on the cumulative probabilities (7) and (8) is that for certain distributions stochastic inequalities can be established on the basis of a partial ordering of the parameter vectors using ordinary (deterministic) majorization.
The class of random vectors which can be ordered via the ordering of their parameter vectors is limited to distributions that are Schur-convex (Schur-concave). The following theorems formalize this fact [13] :
Let the random variable have a density for (a location parameter vector). If is Schur-convave in , then implies that Thus, if the probability density function of random vector θ X satisfies the properties of Theorem 3 and 4, then we can find a a random vector ξ X that will stochastically majorize, or bound the distribution of, θ X . This is equivalent to saying that (by the definition of stochastic majorization) the probability content of ξ X over the appropriate set will bound the probability content of θ X over this set. This set must satisfy two properties. First, this set must enable computing the probability content that corresponds to our original purpose: the cumulative distribution of the path delay random vector. Second, by definition, it must be Borel-measurable and Schur-convex.
The next theorem provides final answers: We can now point out that for A defined above,
, and putting everything together:
If then or
The similar theorems can be stated for the case of weak stochastic majorization:
(1) Let be the multivariate normal random variable with the mean vector The density function is Schur-concave.
(2) Let denote the set . This set is decrea Theorem 7 :
sing and Schur-concave.
If then or
Theorem 8 :
,
In the next section, we use the above results to bound the cumulative distribution of the path delay vector. In doing that, it will also become clear why the parallel notions of strong and weak stochastic majorization had to be developed.
We can now easily find an equi-coordinate parameter vector that is majorized by the mean path delay vector t. Using the definition of majorization, one may check 
It is impossible, however, to find an equi-coordinate parameter vector that (strongly) majorizes t. 
We have finally bounded the original cumulative probability by cumulative probabilities expressed in terms of an equi-coordinate vector, a correlation matrix with identical off-diagonal elements, and the standard multivariate normal N(0,1) vector.
Evaluating the Cumulative Probability
The prior transformations reduced the problem to estimating the cumulative probabilities of a well-structured object whose probability content is amenable to numerical evaluation. This can be done by using pre-generated look up tables for a wide range of dimensionalities, coordinates, and correlation coefficients. The tables have been generated via the Monte-Carlo integration of the cumulative probability of a multivariate standard normal vector for a range of: (1) vector cardinalities (e.g. number of paths, N); (2) the off-diagonal correlation coefficients. To reduce the data storage requirements, interpolation is used to find the cumulative probability values.
ALGORITHM FOR COMPUTING
CIRCUIT DELAY DISTRIBUTION
Linear Additive Statistical Delay Model
The exposition above assumed that the path-to-path covariance matrix is given. In reality, there will be a significant cost to computing this matrix since this operation is
with the number of paths N. In this section, we derive a set of results that allow computing the correlation matrix and the vector of variances in an algorithmically efficient manner.
The efficient algorithm to compute the bounds is based on the realization that for linear additive models of statistical dependence between the delays of nodes in the probabilistic timing graph, the computation of bounds is greatly simplified. It turns out that under this model, the binary relation of correlation can be converted into a unary relation that reflects the degree of correlation of a node delay variation to the common level of variation among the nodes. This transformation allows key computational improvements. The linear additive statistical model is a natural result of ascribing the total node delay variation to two components: inter-chip (or chip-to-chip) and intra-chip (or withinchip) components of variation [16] :
The first is due to the chip-to-chip, wafer-to-wafer, and lot-tolot variations in the processing conditions. The second is due to local variations on the same chip. Note that the environmental variations can also be captured by the intra-chip source of variation.
We will now work in the space of node delays. For the purpose of notational simplicity, in this section we refer to delay by x. For reasons of greater analytical suitability, we write the statistical delay model as a sum of two independent weighted random variables: Together, the coefficients a and b determine the variance of x and the correlation of x to the chip-level component of variation: Obviously, the binary relationship of correlation between the delay of node i, and the chip-level variation is expressed exclusively in terms of the attributes that describe the statistical structure of delay of node i. To further drive this point home, we will use the following notation to highlight the unary nature of the node-to-chip correlation:
( , )
cor x x c = . Now we can express the relationship of correlation between the two node delays as:
This is the key result because it allows representing a binary operator as a product of two unary operators. Note, however, that this is simply an algorithmically convenient way of capturing the fact that random variables described by linear additive statistical models have an specially simple structure of correlation.
It is easy to see that the above definition can be extended to compute the correlation between two sums: 
where a i , b i and a i , b j are the statistical coefficients of the nodes from the first and second groups of summands.
The node-to-chip correlation can be directly computed by setting the second sum to be the chip-level variation, z. This simply requires setting k 2 =1, a o = 0 and b o =1. Then, setting k=k 1 ,
Finally, note that using the above notation, the correlation between two sums of node delays can be expressed succinctly by:
Again, we can represent a binary operator that describes the correlation between sums of nodes, as a product of two unary operators.
Efficient Computation of the Bounds
From the full probabilistic timing graph G, we first extract a sub-graph G' that contains N deterministically (with respect to the mean delay value) longest paths using an efficient path-extraction algorithm, e.g. [17] . Assuming the longest path is known ( max D ) and the timing threshold ( D ∆ ) is set, the N longest paths with delays in the interval
can be extracted in a very reasonable time. We now use the unary correlation to describe an efficient algorithm for computing the lower and upper bounds on the circuit delay cdf. This algorithm has the computational complexity of O(V+E) where V, E are the number of nodes and edges in G'. The algorithm computes all the variables needed for establishing the bounds in a single PERT-like pass of the graph.
The first key ingredient is the update formula for computing the correlation of the sum of (k+1) variables based on the node-tochip correlation of k variables and the node-to-chip correlation of the (k+1) th node:
Theorem 10. The update formula is
( )
This expression would allow an exact computation of path-topath delay correlation, and path variance, for all paths through the timing graph. It is clearly impossible to compute the full correlation matrix in a one-pass manner. The second ingredient is the realization that the computation of the entire path-to-path correlation matrix is unnecessary, since the bounds rely only on the minimum and maximum values of the correlation matrix. With that in mind, we can use an efficient graph-traversal algorithm to compute only the smallest (largest) path-to-path correlation, rather than every path-to-path correlation. The algorithm is based on the following two theorems. By directly analyzing the unary definition of path-to-path correlations ( ,
Theorem 11: The minimum (maximum) path-to-path correlation in a timing graph is given by the product of two smallest (largest) path-to-chip correlations in the graph.
At the same time, we are guaranteed to find the two smallest (largest) path-to-chip correlation values by propagating them in a PERT-like fashion from the primary inputs to the primary outputs:
Computing two smallest (largest) path-to-chip correlations in the probabilistic timing graph can be done in a single traversal of a graph.
Similarly, the maximum and minimum path variance can be computed in one PERT-like traversal of the graph. We can show that, defining the variance update function, VU:
In summary, all the variables needed for computing the bounds on the true cdf can be found in a single traversal of the timing graph.
RESULTS
The algorithms described in Sections 3 and 4 have been implemented in C, and have been run on Windows PC with CPU 2.4 GHz and 512 MB memory. Below is our flow's pseudo-code:
The algorithms have been tested on a set of combinational ISCAS85 benchmark circuits. The 3 σ values of node delays were set at 15-20% of the mean values. The node delays were described by a correlation matrix with entries ranging from 0.8-0.95. The exact cumulative distribution function was computed via MonteCarlo runs of the deterministic static timing analysis algorithm with samples taken from the relevant correlated gate delay distributions. For each simulation, 1000 iterations of Monte Carlo were run.
The bounds generated by the algorithm follow the Monte Carlo distribution very closely (Figure 1 and 2a) . Table 1 shows the errors of both upper and lower bounds with respect to the exact distribution. The bounds were computed for N=20 and 50 longest paths. Across the benchmarks, the lower bound rms error is 0.3%-1.4% while the upper bound rms error is 0.7%-2.1%. Note that we are specifically interested in the lower bound since it gives us a conservative value of circuit delay at any confidence level. Importantly, the lower bound becomes tighter at higher confidence levels, giving a more reliable estimate of parametric yield: at the 95 th percentile the maximum error across all the benchmark circuits is 0.7%. Figure 2b demonstrates that accurately accounting for node delay correlations is crucial in predicting the shape of cdf: the mean value of an uncorrelated case is larger than that of the correlated one, while the spread is much smaller. under 20s for N=20 and under 40s for N=50 longest paths, over all the benchmarks. (At the same time, the longest path extraction takes up to 90% of total runtime -188s and 210s for N=20,50). Figure 3 shows a close-to-linear growth in the runtime of the algorithm as a function of circuit size, which makes practical the use of the algorithm for significantly larger circuits.
Bounding Error RMS (%)
Run-time (sec) o o: H ig h co rrelatio n so lid: M oderate correlation + +: U n co rrelated C um ulative P roba bility
