Abstract-State of the art statistical timing analysis (STA) tools often yield less accurate results when timing variables become correlated due to global source of variations and path reconvergence. To the best of our knowledge, no good solution is available dealing both types of correlations simultaneously.
I. INTRODUCTION
It is well-known that the timing performance of future generations of deep-submicron micro-architecture will be dominated by several factors. IC manufacturing process parameter variations will cause device and circuit parameters to deviate from their designed value. Low supply voltage for lowpower applications will reduce noise margin, causing increased timing delay variations. Due to dense integration and non-ideal on-chip power dissipation, rising temperature of substrate may lead to hot spot, causing excessive timing variations.
Classical worst case timing analysis produces timing predictions that are often too pessimistic and grossly conservative. On the other hand, statistical timing analysis (STA) that characterizes timing delays as statistical random variables offers a better approach for more accurate and realistic timing prediction.
In literatures, there are two distinct approach for STA: path based STA and block based STA. The fundamental challenge of the path based STA [1] - [4] is its requirement to select a proper subset of paths whose time constraints are statistically critical. This task has a computation complexity that grows exponentially with respect to the circuit size, and hence can not be easily scaled to handle realistic circuits.
This potential difficulty has motivated the development of block base STA [5] - [10] that champions the notion of progressive computation. Specifically, statistical timing analysis is performed block by block in the forward direction in the circuit timing graph without looking back to the path history. As such, the computation complexity of block based STA will grow linearly with respect to the circuit size. To even further speed up the computation, Gaussian assumption has been widely adopted ( [6] , [9] , [10] ) with small accuracy penalty, and all internal timing random variables in a circuits are forced to follow the Gaussian distribution.
However, to realize the full benefit of block based STA, one must solve a difficult problem that timing variables in a circuit could be correlated due to either global variation ( [6] , [7] , [10] ) or path reconvergence( [5] , [9] ). As illustrated in the left hand side of Figure 1 , global correlation refers to the statistical correlation among timing variables in the circuit due to global variations such as inter-or intra-die spatial correlations, same gate type correlations, temperature or supply voltage fluctuations, etc. Path correlation, illustrated in the right hand side of Figure 1 , refers to the correlation resulting from the phenomenon of path reconvergence, that is, timing variables may share a common subset of gate or interconnect along their path histories. The importance of the path correlation comes from the fact that each gate in the circuit will have some local variations which are independent to other gates in the circuit. These local variations will propagate towards the circuit output and cause additional correlations. However, the correlations caused by sharing some of these local variations because of path reconvergence, cannot be correctly captured by any algorithm that deals with global variations only.
Several preliminary solutions have been proposed to deal with these correlations. In [6] , [7] , [10] the dependence on global variations is explicitly represented using a canonical timing model. However, none of these approaches has taken into account the path correlations. In [9] , a method based on common node detection is introduced to deal with the path correlations. However, this method does not address the issue of dependence on global variations.
In this paper, we present a systematic STA solution that takes into account correlations caused by both global variations and path reconvergence. Specifically,
• We extend the commonly used canonical timing model to represent all timing variables in the circuit as a weighted linear combination of a set of Gaussian random variables.
A variation vector, consisting of all the weights, is then used to explicitly represent both global and path correlation information.
• We develop a pseudo-canonical timing model which relax the requirement of the independence between the global variation sources involved in the original canonical timing model. With such, the expensive principle component analysis is avoided and total computation time is saved.
• We further explore the sparse structure of the variation vector and develop a flexible vector format so that the non-significant entries of the variation vector are dynamically dropped during computation. According to simulations on ISCAS circuits, this technique significantly curtails the amount of storage and computation required for our method. Since min(X, Y ) = −max(−X, −Y ), in the interests of brevity, in the rest of this paper, we will only discuss the MAX operator, with the understanding that the same results can be easily adapted to the MIN operator.
The rest of the paper is organized as following: In section II, previous block based STA methods are reviewed briefly; Section III describes the vectorized timing format and a theorem used for correlation decompose; Section IV is the detailed algorithm and technique to reduce computation complexity. Section V presents a real implementation of our method in C/C++ and the testing result with ISCAS85 benchmark suites; Section VI gives the conclusions.
II. CANONICAL TIMING MODEL
In timing analysis field, the circuit is modeled as a timing graph, where nodes are used to represent the gate/wire in the circuit and there will be some delay, called node delay, associated with them. Signals propagate through these nodes will add their delays into its arrival time. Node delay and arrival time are usually called timing variables of the circuit.
Different from classical timing analysis, the statistical timing analysis models timing variables as random variables, which are characterized by its probability density func-
The purpose of statistical timing analysis is then to estimate the arrival time distribution at the primary output of the circuits knowing input arrival time distributions and all internal node delay distributions. This is accomplished through two operators [5] :
• ADD: When an input arrival time X propagates through a node delay Y , the output arrival time will be Z = X+Y • M AX: When two arrival times X and Y merge in a node, a new arrival time Z = max(X, Y ) will be computed before the node delay is added. [6] , [7] , [10] proposed a canonical delay model to address the node delay correlations through sharing global variations. In particular, they model each of the node delay as a summation of three terms:
where n i (i = 1, 2, ...) are random variables corresponding to the the i th node delay in the timing graph; µ i is the expected value of n i ; R i , (named node variation), is a zero-mean, unity variance Gaussian random variable representing the localized statistical uncertainties of n i ; G j represents the j th global variation, and is also modeled as a zero-mean, unity variance Gaussian random variable; {R i } and {G j } are additionally assumed to be mutually independent; the weight parameters α i (named node sensitivity) and β i,j (named global sensitivities) are deterministic constants, explicitly expressing the amount of dependence of n i on each of the corresponding independent random variables.
With this canonical representation, the correlation (covariance) between any two node delays, n i and n k , can be easily evaluated.
This canonical timing model is very good for node delay(gate delay or interconnect delay) modeling. But it is not sufficient to model the arrival time. When an arrival time is somehow mapped into this canonical timing model as done in [10] , some important path information will be dropped and the accuracy will be compromised.
For example, referring back to Figure 1 (b), arrival time X and Y share the common history of node p. So between them, there will be some extend of correlation caused by the node p's local variation R p . But X and Y have only one term of local variation of R X and R Y respectively in their canonical representations. And according to the assumption from the canonical timing model, R X and R Y are assumed to be independent to each other, so the correlation caused by R p in node p is incorrectively ignored.
Another significant disadvantage of the original format of the canonical model is that it requires the independence between the global variation sources G i . This requirement, as stated in [10] , will be satisfied by Principle Component Analysis(PCA) before the timing analysis. But we think it is not reasonable to exclude the computation complexity of PCA from the overall performance evaluation of the statistical timing analysis because the complexity of the PCA grows as
where M is the total number of global variation sources considered. Due to the important spatial correlation between global variation sources, the number of global variation sources have to be included in the PCA will easily go up to thousands and so that PCA itself will be a very timing consuming step.
For example, to model the spatial correlation, the chip is usually partitioned into grids and each grid will be associated with m global variations. So the total number of global variation sources will be proportional to the number of grids used to represent the chip area. If the chip is partitioned into 10×10 grid, the total number of global variations will be M = 100m. This number maybe tolerable but the coarse grid may not be sufficient to accurately model the spatial correlation of global variations between different grid locations. If we partition the chip with 100 × 100 to get more accurate spatial correlation model, the total number of global variations will now be M = 10000m which is beyond the reasonable range for cases where a fast PCA is demanded. To be a fact, for a real chip with dimension of 10mm × 10mm, 100 × 100 grids will only have grid cell with size of 100µm × 100µm which is not yet small enough for high quality spatial correlation modeling.
III. EXTENDED PSEUDO-CANONICAL TIMING MODEL
The canonical timing model [6] , [7] , [10] is a powerful tool to represent the numerous timing variables for a given circuit. However, as pointed out in the previous section, in its original format, it can only handle node delay correlations caused by global variations. Furthermore, the original format of canonical timing model requires a computationally expensive PCA which may prohibit it from using in cases where large number of global variations are correlated with each other. In this work, we propose an extended pseudo-canonical timing model(EPCT) that is capable of capture all the correlation between any pair of timing variables in the circuit be it a node delay or an arrival time. And also it removes the requirement of a PCA procedure for its validation.
A. Extended Pseudo-Canonical Timing Model
We notice an important fact that it is not necessary to have the canonical format to be able to evaluate the correlations caused by global variation sources. So we propose a PseudoCanonical Timing Model for node delay as following:
which, compared with the commonly used canonical model in equation (1), relaxes the requirement of the independence of the global variations and so that the global variations G j may be CORRELATED.
With such pseudo-canonical timing model, the correlation between two node delays can be evaluated as:
Assume that there are N nodes and M global variations in the timing graph, if every node delay can be modeled by the pseudo-canonical format of Equation (3), then every timing variable, including all the node delays and arrival times will then have a extended pseudo-canonical timing(EPCT) model as:
where R i and G j are independent to each other although there will possibly correlations between G j s. With this equation, both global and path correlations can be handled elegantly. More specifically, global variations are represented by the set of global sensitivity terms {β X,j }, and dependence on path history are represented by non-zero node sensitivity terms α X,k .
B. Variation Vector
Equation (5) can be rewritten in a compacted vector format as
where "*" means transpose and
are mutually independent local variation vector and global variation vector respectively. 0 is a zero vector and Σ g = E{gg * } is the covariance matrix of global variations and generally not equal to the unit matrix I due to the potential correlations existing between global variations.
Authors in [10] proves the correlation evaluation formula between timing variables represented by the canonical timing model of equation (1). We here prove a similar formula for correlation evaluation between time variables expressed with the EPCT model as equation (5) or (6) .
, the correlation between them can be evaluated as:
where the independence between r and g is applied.
For the variance of a time variable, it is easy to get:
This corollary is actually the special case when X = Y of theorem 8.
IV. PROPAGATING MEAN AND VARIATION VECTOR
In a timing graph, the mean and variation vectors of a node delay is obtained from technology extraction. A STA algorithm, instead, will take those node's means and variation vectors as its input and calculate the mean and variation vector for all arrival times in the entire circuit.
A. Exploration of Sparsity
Since there are N nodes and M global variations in a timing graph, the length of the local variation α will be N and the length of global variation vector β will be M which is the same as the rank of the correlation matrix Σ g . So for every step of arrival time propagation in the timing analysis, the computation complexity will be O(N + M 2 ). And the overall timing analysis computation will have complexity of
) since there will be N steps of arrival time propagations.
Although the correlation matrix Σ g will be dense and have rank of M which is the same as the total number of global variation sources. However, while working with benchmark circuits, we noticed that many components in the variation vectors have very small values, indicating that their contributions to the overall correlation evaluation is insignificant. By setting these small coefficient to zero, the variation vector will become a sparse vector that contains many zero components.
Motivated by this observation, we developed a novel technique called the flexible vector format to exploit the sparsity of the variation vector. For this purpose, a drop threshold is selected so that if α X,i or β X,j is smaller than this threshold, it is deemed to have small value and will be placed into a drop candidate pool to be pruned from the variation vector representation.
However, dropping α X,i or β X,j with small magnitude is the same as applying truncation to the variation vector. In subsequent computations, the quantization error may accumulate, causing non-negligible error. This is a problem that can not be overlooked for large circuits. Our solution to this problem is to lump those components in the drop candidate pool into a single correction term
B. Complexity and Path Correlation Length
Using this drop and pool mechanism, the actual number of non-zero terms in β will be much smaller than M . Assuming there are m global variation sources for a single node, the average number of non-zero terms in β will be either (1) m if the time variable is a node delay; or (2) M C = m × q << M if the time variable is an arrival time and q is proportional to the logic depth of the circuit. So, using equation (8) Using this drop and pool mechanism, what is really dropped in the local variation vector α during computation is then the path correlation. So the length of the local variation vector actually gives a good indication to the extent of path correlation in the circuit. The path correlation length(Γ) of the circuit is then defined as the average length of the pruned local variation vectors for a given drop threshold.
Theoretically speaking, the path correlation length will be the number of critical nodes which are in the statistically critical paths. Specifically, If a node is not in any statistically critical paths, its variation will be automatically dropped. On the other hand, if the node is in one critical path but is not statistically important, it will be dropped too. So by studying the variation vector in the circuit output, it is easy to know which path is critical and which node is critical for circuit performance and this is very important information to help designer improve the circuit design.
In real circuits, usually only a few paths are statistically critical and so that only a few nodes in the circuit will survive the variation vector propagation. So the computation complexity of our method will be
Since Γ << N and M C << M , the total complexity will then be O[N ] and a significant reduction of computation and storage is achieved with the drop and pool mechanism. Compared with the PCA-based approach, the complexity saving is even larger since there the complexity will be
even the drop and pool mechanism is used. So the extended pseudo-canonical timing model actually results in a big saving in computation complexity when the total number of global variations is large.
V. SIMULATION RESULTS AND DISCUSSIONS
The above described algorithm has already been implemented in C/C++ and tested by ISCAS85 benchmark circuits.
Before testing, however, all benchmark circuits are remapped into a library which has gates of not, nand2, nand3, nor2, nor3 and xor/xnor. All library gates are implemented in 0.18µm technology and their delays are characterized by Monte Carlo simulation with Cadence tools assuming all variation sources, either process variations or operational variations, follow Gaussian distribution.
For illustration purpose, only three parameter variation are considered global: channel length(L), supply voltage(Vdd) and temperature(T). All other variation sources, specified in the 0.18µm technology file, are assumed to be localized in the considered gate only.
A. Accuracy and Performance
Monte Carlo simulation results with 10,000 repetitions are used as "Golden Value" for each benchmark circuit. Each repetition is a process of static timing analysis by fixing global and node variation into a set of randomly sampled values. The global variations are sampled once for each repetition while node variation for each gate is newly sampled every time when the gate is computed.
To elaborate the importance of including the path reconvergence correlation, A STA methods called CTM is implemented with canonical timing model where no path correlations are considered. Table I summarizes the arrival time distribution parameters at the primary output of each testing circuit from Monte Carlo(M.C.), CTM and our method of EPCTM which is implemented with extended pseudo-canonical timing model. µ and σ are mean and standard variation of the distribution. τ 97 = µ + 2σ is the delay estimation at confidence level of 97%. The accuracy of STA methods compared with Monte Carlo method, is evaluated in Table II Table II shows that CTM have significantly larger error in mean estimation than EPCTM. This is reasonable because CTM will overestimate the mean at every MAX operation due to smaller correlation considered and this over estimation is accumulated through distribution propagation. It is also interesting to notice that CTM and EPCTM give similar accuracy in variance estimation. This is possibly because of the fact that the variance is dominated by global variation in the tested cases.
Of course, Monte Carlo simulation gives the best STA results but with big runtime penalty. EPCTM runs order-ofmagnitude faster but can provide both mean and variance estimation almost as accurate as Monte Carlo does if most of the path correlations are considered as the cases of EPCTM shown in Table I and II. To further elaborate the accuracy of EPCTM, Figure 2 shows the p.d.f. and c.d.f. for circuit c6288 from three methods: Mont Carlo and two methods of EPCTM and CTM. Apparently enough, EPCTM shows excellent accuracy since most path correlation is considered.
B. Performance and Path Correlation Length
It has been mentioned in Section IV-B that path correlation length (Γ) is an interesting macro property of the simulated circuit and gives a good indication of the extent of the path correlation existing in that circuit. For the above ISCAS Table III where the run time improvement is also shown when EPCTM is compared with Monte Carlo. From Table III , we can firstly conclude that the correlation length Γ is much smaller than the circuit size and basically independent on the circuit size since it remains about 10 − 20 when circuit size changes dramatically. This observation helps the conclusion we made before about the complexity reduction of our method by using the technique of flexible vector format.
Secondly, the only exceptional high path correlation length among the tested circuits happens with the circuit c6288 which is known as a 16-bit array multiplier. Since there are large amount of equal delay paths in the circuit, large To study the relationship between path correlation length and the accuracy of the STA method, An experiment is conducted for circuit c6288 and results are shown in figure  4 where the error in τ 97 and path correlation length are both plotted against the drop threshold. It is clear that the path correlation length drops sharply when the drop threshold changes slightly from zero and maintain almost constant after that. But the error changes readily when drop threshold changes. This phenomenon proves the efficiency of the drop mechanism introduced in this work since it means we can sacrifice very little accuracy to gain very significant reduction in the path correlation length and so that save significant amount of CPU time since the run time of the proposed STA method is proportional to the path correlation length.
VI. CONCLUSIONS This paper presents a novel method for block-based statistical timing analysis. Applying the generally accepted Gaussian assumption, we firstly disclose that the MAX operation can be approximated by linear supposition of its inputs. Secondly we extend the commonly used canonical timing model into a vectorized format, variation vector. We also disclose a novel method to decompose correlated timing variables into independent ones to simplify computation. With these theoretical progress, we are able to evaluate and propagate the global and path correlation systematically in the circuit timing graph. We also design a novel algorithm which treat both global and path correlation simultaneously and systematically. This algorithm, with the help with a new flexible vector format achieves high accuracy and high performance at the same time as tested by ISCAS circuits and compared with Monte Carlo results.
