Abstract-An efficient and accurate statistical static timing analysis (SSTA) algorithm is reported in this paper, which features 1) a conditional linear approximation method of the MAX/MIN timing operator, 2) an extended canonical representation of correlated timing variables, and 3) a variation pruning method that facilitates intelligent tradeoff between simulation time and accuracy of simulation result. A special design focus of the proposed algorithm is on the propagation of the statistical correlation among timing variables through nonlinear circuit elements. The proposed algorithm distinguishes itself from existing block-based SSTA algorithms in that it not only deals with correlations due to dependence on global variation factors but also correlations due to signal propagation path reconvergence. Tested with the International Symposium on Circuits and Systems (ISCAS) benchmark suites, the proposed algorithm has demonstrated very satisfactory performance in terms of both accuracy and running time. Compared with Monte-Carlo-based statistical timing simulation, the output probability distribution got from the proposed algorithm is within 1.5% estimation error while a 350 times speed-up is achieved over a circuit with 5355 gates.
I. INTRODUCTION
The timing performance of deep submicrometer microarchitectures will be dominated by several factors. Integrated circuit (IC) manufacturing process parameter variations will cause device and circuit parameters to deviate from their designed value. Low supply voltage for low-power applications will reduce noise margin, causing increased timing delay variations. Due to dense integration and nonideal on-chip power dissipation, the rising temperature of a substrate may lead to hot spot, causing excessive timing variations. Classical worst case timing analysis produces timing predictions that are often too pessimistic and grossly conservative. On the other hand, statistical static timing analysis (SSTA) that characterizes timing delays as statistical random variables offers a better approach for more accurate and realistic timing prediction.
Existing SSTA methods can be categorized into two distinct approaches: path-based SSTA [1] - [4] and block-based SSTA [5] - [10] . Path-based SSTA seeks to estimate timing statistically on selected critical paths. However, the task of selecting a subset of paths whose time constraints are statistically critical has a worst case computation complexity that grows exponentially with respect to circuit size. Hence, path-based SSTA is not easily scalable to handle realistic circuits. Block-based SSTA, on the other hand, champions the notion of progressive computation. Specifically, by treating every gate/wire as a timing block, SSTA is performed block by block in the forward direction in the circuit timing graph without looking back to the path history. As such, the computational complexity of block-based SSTA would grow linearly with respect to circuit size. However, to realize the full benefit of block-based SSTA, one must address a challenging issue that timing variables in a circuit could be correlated due to either global variations [6] , [7] , [10] or path reconvergence [5] , [9] . As illustrated in the left-hand side of Fig. 1 , global correlation refers to the statistical correlation among timing variables in the circuit due to global variations such as interdie or intradie spatial correlations, same gate type correlations, temperature or supply voltage fluctuations, etc. Path correlation, on the other hand, is caused by the phenomenon of path reconvergence, that is, timing variables in the circuit can share a common subset of gate/wire blocks along their path histories (Fig. 1) .
The importance of path correlation comes from the fact that each gate/wire block in the circuit will have some local variations that are independent to the rest of the circuit. These local variations will propagate toward the circuit output and cause additional correlations due to the phenomenon of path reconvergence. Furthermore, these correlations caused by sharing local variations cannot be correctly captured by any algorithm that deals with global variations only. So for clarity, the term path correlation used here and after specifically refers to the correlation caused by the local variations of common path history.
Several solutions have been proposed to deal with either of these two types of correlations. In [6] , [7] , and [10] , the dependence on global variations is explicitly represented using a canonical timing model. However, these approaches did not take into account path correlations. In [9] , a method based on common block detection is introduced to deal with path correlations. However, this method does not address the issue of dependence on global variations. To the best of our knowledge, there is no existing method that has dealt with both types of correlations simultaneously. We present a novel block-based SSTA algorithm in this paper that is designed to consider both global correlations and path correlations.
• We develop a novel method to conditionally approximate the MAX/MIN operator by a linear mixing operator. Using the precomputed skewness, we are able to determine the linearity of the MAX/MIN operator analytically. Linear approximation is then applied only when MAX/MIN behaves linearly. When MAX/MIN is significantly nonlinear, the MAX/MIN evaluation is postponed with a form of max tuple.
• We extend the commonly used canonical timing model to be able to represent all possible correlations, including path correlations, between timing variables in the circuit. We further explore the sparse structure of the extended canonical representations of the timing variable and dynamically drop the nonsignificant terms so as to curtail the amount of storage and computation required for implementations.
0278-0070/$20.00 © 2006 IEEE Since min(X, Y ) = − max(−X, −Y ), in the interests of brevity, in the rest of this paper, we will only discuss the MAX operator, with the understanding that the same results can be easily adapted to the MIN operator.
The rest of the paper is organized as follows. In Section II, previous block-based SSTA methods are reviewed briefly. Section III discusses the nonlinearity of the MAX operator and our conditional linear approximation method. Section IV describes the extended canonical timing (ECT) model and the proposed SSTA algorithm with the technique to reduce computation complexity. Section V presents a real implementation of our algorithm in C/C++ and the testing results with benchmark circuits. Section VI gives the conclusions.
II. BRIEF REVIEW OF CURRENT SSTA ALGORITHMS
For the purpose of timing analysis, timing blocks are used to represent the gate/wires in the circuit. Signals propagating through these blocks will add block delays into their arrival times. Block delays and arrival times are both called timing variables of the circuit. The history or path history of an arrival time is then defined as the set of block delays through which the signal ever passes.
A. Timing Variable Propagation
In statistical timing analysis, a timing variable is modeled as a random variable that is characterized by its distribution of probability density function (pdf) or, equivalently, cumulative distribution function (cdf). The goal of statistical timing analysis is to estimate the distribution of the arrival time in circuits given the distributions of each block delay in the circuit. To accrue the overall timing delay distribution, the timing delay random variables will be joined through two basic operators [5] .
• ADD: When an input arrival time X propagates through a block delay Y , the output arrival time will be Z = X + Y .
• MAX: When two arrival times X and Y merge in a block, a new arrival time Z = max(X, Y ) will be formulated before the block delay is added.
In the ADD operation, if both X and Y are Gaussian random variables, then Z = X + Y will also be a Gaussian random variable whose mean and variance can be found as
where
to be the output of the MAX operator. Since MAX is generally a nonlinear operator, Z will not have a Gaussian distribution even if both X and Y are Gaussian. However, in this situation, the mean, variance, and skewness of the distribution of Z have been already derived analytically by Clark [11] as
where θ = σ (X−Y ) . P and Q are the pdf and cdf of a standard normal distribution evaluated at λ = µ (X−Y ) /σ (X−Y ) , i.e.,
The skewness of random variable Z is defined as
B. Linear Approximation of MAX Operator
Although Z = max(X, Y ) does not have a Gaussian pdf even if both inputs X and Y are Gaussian distributed, in the interests of simplicity, it is still desirable by many SSTA researchers to find a Gaussian random variable that approximates Z in some way [6] , [10] . In [6] , the output of the MAX operator Z is approximated by a Gaussian random variableẐ, which is a linear combination of X, Y , and an additional independent Gaussian random variable ∆, i.e.,
where Q is defined in (6) and is called tightness in [6] . The purpose of the additional random variable ∆ is to ensure that the mean and the variance ofẐ match those of Z as specified in Clark's formula (3) and (4). In [11] , it has also been shown that if W is a Gaussian random variable, then the cross-covariance between W and Z = MAX(X, Y ) can be found analytically as
Substituting (8), it is easy to verify that
Hence, a nice property of the approximatorẐ shown in (8) is that the cross-covariance between Z and other timing variable W is preserved when Z is replaced byẐ. While this approximation formula is simple, it does not work safely when the nonlinearity of the MAX operation is significant and the output of the MAX operator is significantly non-Gaussian. A simple example is illustrated in Fig. 2 , where the left panel shows the two independent input Gaussian random variables and the right panel shows the cdfs of max(X, Y ) from Monte Carlo simulation and linear approximation. It can be seen from Fig. 2 (b) that the existing linear approximation will underestimate the distribution at high probability level. This behavior is risky since decisions made upon the estimated delay may result in excessive design failure.
C. Canonical Timing Model
Previously, a canonical timing model [6] , [7] , [10] has been proposed to address the delay correlations through shared global variations. In this model, the block delay is represented as a sum of three terms
where n i (i = 1, 2, . . .) is the random variable corresponding to the ith block delay in the timing graph; µ i is the expected value of n i ; R i ∼ N (0, 1) (called local variation) represents the localized statistical uncertainties of n i ; G j ∼ N (0, 1) represents the jth global variation; R i and {G j (j = 1, 2, . . .)} are additionally assumed to be mutually independent; and the weight parameters α i (named local sensitivity) and β i,j (named global sensitivities) are deterministic constants, explicitly expressing the amount of dependence of n i on each of the corresponding independent random variables. With this canonical representation, the variance of a block delay n i and its covariance with another block delay n k can be evaluated as
However, if arrival times are also expressed in this canonical model, the path correlation between them due to sharing local variations because of path reconvergence will incorrectly be ignored. For example, in Fig. 1(b) , both arrival times X and Y include a common path history of block p. However, the local variation of block p, R p , is no longer a part of the canonical representation of arrival times X and Y . Hence, the path correlation between X and Y due to R p is incorrectly dropped.
III. NONLINEARITY OF MAX OPERATOR
For Gaussian inputs, the linearity of the MAX operator will be equivalent to the Gaussianity of the output. Using Monte Carlo simulation, the Gaussianity of the output can be evaluated with a method called QQ-Plot [12] . Specifically, if the output is Gaussian, then the simulated output of the MAX operator will show a straight line in its QQ-Plot against a standard Gaussian distribution. And if the MAX output is non-Gaussian, such QQ-Plot will deviated from linear. The more the non-Gaussianity of the MAX output, the worse the linearity of such QQ-Plot.
Since the linearity of the QQ-Plot can be quantitatively represented by the linear correlation coefficient of the QQ-Plot, the Gaussianity of the output of the MAX operator can be statistically and quantitatively measured. However, it will be very expensive if we run extensive Monte Carlo simulation during every step of MAX operation in timing analysis. So it is desirable to establish a more convenient criteria to determine the linearity of the MAX operator. It is well known that skewness is not a Gaussianity index for a general random variable since there are distributions that are symmetric but non-Gaussian. However, to measure the linearity of the MAX operator with Gaussian inputs, skewness of the MAX output will be a good choice. Fig. 3 shows the relationship between the nonlinearity of the MAX operator and the skewness of Z = max(X, Y ) for Gaussian inputs X and Y . The scattering points in the figure represent 1000 random samples of the relative mean, relative variance, and the correlation of Gaussian random variables X and Y . The nonlinearity of the MAX operator for each set of randomly sampled mean, variance, and correlation is determined by QQ-Plot method with 10 000 Monte Carlo simulations. It is very clear in the figure that the skewness of the MAX output has a significant positive correlation with the nonlinearity of the MAX operator. Since skewness of the MAX output given Gaussian inputs can be analytically computed by equations developed by Clark [11] , it is suitable to use skewness as an accurate and efficient measurement for the nonlinearity of the MAX operator. 
A. Nonlinearity Condition of MAX Operator
It is clear that the linearity of the MAX operator is heavily dependent on its input parameters. Since we have a good measure of the linearity of the MAX operator, it is ready to study how the linearity changes when inputs vary.
Assuming the standard deviation σ Y ≥ σ X in max(X, Y ), then no generality will be lost if the two variances are assumed to be σ X = σ ∈ And the skewness of Z = max(X, Y ) are computed using equations developed by Clark [11] and are shown in Fig. 4 .
From the figures, it is clear that in most of the cases the skewness is zero, which means Z = max(X, Y ) is normally distributed and the MAX operator is linear. As a rule of thumb, the nonlinearity of the MAX operator is significant when the following nonlinear condition is satisfied.
Given X and Y are Gaussian, max(X, Y ) will be significantly non-Gaussian if X and Y have very similar mean but very different variance or if X and Y have similar mean and variance but very negative correlation.
B. Conditional Linear MAX Approximation
Given two Gaussian random variables X and Y , Z = max(X, Y ) could be significantly skewed if the nonlinear condition is satisfied. If the MAX operator is significantly nonlinear, significant error will occur if a linear operator is forced to approximate the MAX operator. But for the purpose of timing analysis, it is not necessary to explicitly compute the MAX output at every step.
1) Max Tuple: During timing analysis, the arrival time propagates from block to block with two elemental operations: ADD and MAX. If during a propagation step of MAX max(X, Y ) the output arrival time is not Gaussian, no actual computation will be done and the output will be simply recorded as a max tuple: Mt{X, Y }. With such max tuples, the arrival time propagation will have the following computations.
• ADD: Gate/wire delay D is added into a max tuple Mt{X, Y } as
• aMAX: Arrival time A is MAXed with a max tuple Mt{X, Y } as
• tMAX: Two max tuples are MAXed together
2) Tuple Size:
To practically implement such tuple-based MAX evaluation, the number of arrival times in the max tuple, i.e., the tuple size, has to be maintained as small as possible. This is realized by the obvious combinational rule of max tuple as
so if any two Gaussian random variables in the max tuple do not satisfy the nonlinear condition, then they can be replaced by a new Gaussian random variable by approximating the MAX with a linear operator and so that the size of the max tuple is reduced. This reduction process will be done iteratively to minimize the tuple size.
Such kind of tuple size reduction method is realized by associating each max tuple with a skewness matrix that stores the output skewness if pairs of random variables in the max tuple are actually MAXed out. And also a threshold of skewness κth is set beforehand to decide if the MAX result is Gaussian or non-Gaussian. Also, to prevent the explosion of the tuple size, a safeguard maximum allowed size for max tuple is also set, and if any of the tuple size exceeds the maximum size, the skewness threshold will be increased to tolerate more tuple size reduction.
Finally, in the primary output of the circuit, if the circuit delay is reported as max tuple, the output distribution can be easily evaluated by Monte Carlo simulation. For limited size of max tuple, such evaluation is efficient and accurate.
IV. ECT MODEL
The canonical timing model [6] , [7] , [10] is a powerful tool to represent the numerous timing variables for a given circuit. However, as pointed out in the previous section, in its original format, it can only handle timing correlations caused by global variations. In this work, we propose an ECT model that is capable of capturing all correlations between any pair of timing variables in the circuit, be it a block delay or an arrival time.
A. ECT Model
Assume that there are N gate/wire blocks and M global variations in the timing graph. If every block delay is modeled by the canonical format shown in (10) and MAX is approximated by a linear combination operator, then every time variable, including all block delays and arrival times, will then have the ECT expression as
where R i ∼ N (0, 1) is the local and independent variation only related with block i, G j ∼ N (0, 1) is the jth global variation, and α X,i and β X,j are the corresponding sensitivity factors. To differ our approach from the existing canonical timing model, the word "extended" is used to indicate that the local variations are additionally included to the timing model. With such "extended" timing model, both global and path correlations can be handled elegantly. More specifically, global variations are represented by the set of global sensitivity terms {β X,j }, and dependence on path history is represented by nonzero local sensitivity terms α X,k . Equation (13) can be rewritten in a compacted vector format as
where " * " means transpose and
* ∼ N (0, I) are mutually independent local variation vector and global variation vector, respectively. 0 is a zero vector and I is the unit matrix.
* are deterministic local and global sensitivity vectors.
Chang and Sapatnekar [10] prove the correlation evaluation formula between timing variables represented by the canonical timing model of (10) . We here prove a similar formula for correlation evaluation between time variables expressed with the ECT model as (13) or (14).
, the correlation between them can be evaluated as
Proof: By definition
where the independence of r and g is applied.
To get the variance of a time variable, it is easy to prove the following corollary.
which is actually the special case when X = Y of Theorem 1.
B. SSTA Algorithm
Before timing analysis, the delay sensitivities of each individual gate/wire are extracted from its Spice model and a gate/wire delay library is then formed. This library, together with the circuit being analyzed, serves as the input of the SSTA algorithm. An SSTA algorithm will then calculate the distributions for all arrival times in the entire circuit by carrying out ADD and MAX operation at each gate/wire block. The overall data flow of the algorithm is summarized in Fig. 5 , where the timing graph in the SSTA is represented by a file with standard delay variance correlation format (sdvcf), where both gate/wire delays and connections among gate/wires are specified.
Assuming
According to the linear MAX approximation (8), the output distribution of the MAX operator Z = max(X, Y ) will be
Clearly, the complexity of a single iteration of the SSTA algorithm comes from the sensitivity vector computation and the correlation evaluation involved in the MAX operation. Assuming there are totally M global variations and N gate/wire blocks in the circuit, the overall SSTA complexity will then be
C. Exploration of Sparsity
While working with benchmark circuits, we noticed that many components in the variation vectors have very small sensitivity values, indicating that their contributions to the overall correlation is insignificant. By setting these small coefficients to zero, the sensitivity vector will become a sparse vector that contains many zero components. Motivated by this observation, we apply a drop-and-lump method to exploit the sparsity of the sensitivity vector and to further decrease the average complexity of the SSTA algorithm.
For this purpose, a drop threshold is selected such that if α X,i or β X,j is smaller than this threshold, it is deemed to have a small value and will be dropped from the sensitivity vector. However, dropping α X,i or β X,j with a small magnitude directly is the same as applying truncation to the sensitivity vector. In subsequent computations, the quantization error may accumulate, causing nonnegligible error. This is a problem that cannot be overlooked for large circuits. Our solution to this problem is to lump those dropped components into a single correction term
Using circuit c499 as the example, the variation lumping method is compared with a simple dropping method at a drop level of 100%. From Table I , the advantage of using the lumping mechanism is clear: the estimation error for 97% delay quantile (τ 97 ) is improved from 6.1% of the simple dropping mechanism to 3.4% when the lumping mechanism is used.
Using this drop and lump method, the average number of nonzero terms in global sensitivity vector β will be M C and the complexity of the proposed SSTA algorithm will be O[(M C + Γ)N ], where the average number of nonzero terms in local sensitivity vector α is Γ. So what is really dropped in the local sensitivity vector α during computation is then the path correlation, and the length of the local sensitivity vector actually gives a good indication to the extent of path correlation in the circuit. So Γ is given a special name of path correlation length for a given drop threshold.
Theoretically speaking, if a block is not in any statistically critical path, its variation will be automatically dropped. On the other hand, if the block is in the critical path but is not statistically important, it will be dropped too. Furthermore, the importance of the variation will decrease after it propagates through a long path. In real circuits, usually only a few blocks in the circuit will survive the arrival time propagation so that Γ N and M C N . The computational complexity of the proposed method will be practically
, and a significant reduction of complexity is achieved with the above drop and pool mechanism, although it is important know that the actual complexity reduction is highly dependent on the topology of the circuit being analyzed.
V. SIMULATIONS AND DISCUSSIONS
Our SSTA algorithm, named as conditional linear MAX/MIN approximation with extended canonical timing model (CLECT), has already been implemented in C/C++ and tested by benchmark circuits. Before testing, however, all benchmark circuits are remapped into a library that has gates of NOT, NAND2, NAND3, NOR2, NOR3, and XOR/XNOR. All library gates are implemented in 0.18-µm technology and their delays are characterized by Monte Carlo Spice simulation with Cadence tools assuming all variation sources follow Gaussian distribution.
For illustration purposes, only three parameter variations are considered global: channel length (L), supply voltage (V dd), and temperature (T ). All other variation sources, specified in the 0.18-µm technology file, are assumed to be localized in the considered gate only. Furthermore, we do not address the spatial dependency of the gate delays just for demonstration purposes. In real life, gate delay parameters are position dependent and our method is still applicable.
Extensive Monte Carlo simulations with 10 000 repetitions are used as a "golden value" for each benchmark circuit. Each repetition is a process of static timing analysis by fixing global and block variations into a set of randomly sampled values. The global variations are sampled once for each repetition while block variation for each gate is sampled every time the gate delay is computed.
A. Accuracy Improvement With Max Tuple
One simple circuit is given in Fig. 6 , where the overall delay and all internal MAX operators are significantly nonlinear as revealed by Monte Carlo simulation shown in Fig. 7 . The skewness of the output distribution is κ = 2.2, which is significantly larger than zero. As shown in Fig. 7 , it is very clear that for a circuit where MAX operators are significantly nonlinear, the existing linear approximation cannot correctly capture the cdf behavior. Especially, the existing method significantly underestimates the distribution at the high probability level so that it is risky to use such a distribution to predict circuit performance.
Our conditional linear MAX approximation method, on the other hand, matches the exact distribution much better than the existing method. Especially, at the high probability level, the computed distribution is almost exactly the same as the one got from Monte Carlo simulation. Such significant accuracy improvement over the existing method makes our method more suitable to predict performance for nonlinear circuits.
B. Accuracy Improvement by Including Path Correlation
Our SSTA method is also tested in the International Symposium on Circuits and Systems (ISCAS) benchmark suite. From Monte Carlo simulation, all ISCAS combinational circuits are Gaussian circuits whose MAX operators can be well approximated by linear operators. Both our conditional linear MAX approximation method and existing linear approximation will be good MAX approximation methods for these circuits. This nice property comes from the fact that for arrival times in these circuits, bigger mean usually means bigger variance and arrival times are usually positively correlated. So the nonlinear condition will not be satisfied for them. Table II summarizes the error of the arrival time distribution parameters computed at the primary output for each testing circuit from three methods: 1) our method of CLECT; 2) NoPath, where the existing canonical timing model is used and no path correlation is considered; 3) NoCorr, where neither global correlation nor path correlation is considered. µ and σ are mean and standard variation of the distribution. τ 97 = µ + 2σ is the 97% delay quantile estimated assuming the output delay distribution is Gaussian.
From Table II , it is very clear that the method NoCorr fails to give reasonable variance estimation because no correlation is considered. This is a good example demonstrating the importance of correlations in SSTA. Table II also shows that the method NoPath has a significantly larger error in mean estimation than CLECT, although it shows similar accuracy in variance estimation. As a consequence, the method NoPath has a significantly larger error in 97% delay quantile estimation. This consistently larger error in all simulated circuits shows the importance to use the ECT model and consider the path correlations.
To further elaborate the accuracy improvement of CLECT over NoPath, Fig. 8 shows the pdf and cdf for circuit c6288 from three methods: Monte Carlo, CLECT, and NoPath. Apparently enough, CLECT shows excellent accuracy since it considers path correlation. And NoPath has a significant distribution shift because it uses canonical timing model and path correlation is dropped.
C. Performance and Path Correlation Length
It has been mentioned in Section IV-C that the path correlation length (Γ) is an interesting macroproperty of the simulated circuit and gives a good indication of the extent of path correlation existing in that circuit. For the above ISCAS circuits, the path correlation length (Γ) at a drop threshold of 1% is summarized in Table III , where the run time improvement over Monte Carlo simulation is also shown.
From Table III , we can first conclude that the correlation length Γ is much smaller than circuit size and basically independent on circuit size since it remains about 10-20 when the circuit size changes dramatically. This observation helps the conclusion we made before about the complexity reduction of our method by using the technique of flexible vector format.
Second, the only exceptionally high path correlation length among the tested circuits happens with the circuit c6288, which is known as a 16-bit array multiplier. Since there are a large amount of equal delay paths in the circuit, a large path correlation length is natural. Fewer local sensitivities can be dropped due to equal importance. To study the relationship between path correlation length and the accuracy of the SSTA method, an experiment is conducted for circuit c6288 and the results are shown in Fig. 9 , where the errors in τ 97 and path correlation length are both plotted against the drop threshold. It is clear that the path correlation length drops sharply when the drop threshold changes slightly from zero and maintains almost constant after that. But the error changes steadily when drop threshold changes. This phenomenon proves the efficiency of the drop mechanism introduced in this work since it means we can sacrifice very little accuracy to gain very significant reduction in path correlation length and so as to save a significant amount of CPU time.
VI. CONCLUSION
This paper presents a novel method for block-based SSTA. We first disclose a new method to approximate the MAX operation with a linear operator assisted by skewness-based linearity evaluation. Second, we extend the commonly used canonical timing model into an "extended" version to represent the possible occurring path correlation. With these theoretical progress, we are able to evaluate and propagate both global and path correlation in the circuit timing graph. We also design a novel algorithm CLECT that treats both global and path correlation simultaneously and systematically. This algorithm, with the help of a drop-and-lump method, achieves high accuracy and high performance at the same time as tested by ISCAS circuits and compared with Monte Carlo "golden values."
