In this paper, we highlight a fast, effective and practical statistical 
Introduction
Variability is posing an increasing challenge to timing analysis of VLSI designs implemented in the nanometer regime. Variations can be either environmental or physical. Environmental variations are due to unpredictable operating conditions such as the power supply voltage V dd and the chip's temperature, whereas physical variations arise during the fabrication of the chip and include permanent discrepancies in the oxide thickness tox, effective channel length L ef f and interconnect dimensions. These variations appear at different levels of the manufacturing process: inter-lot, inter-wafer, inter-die and intra-die. Traditionally, discrepancies in VLSI chip parameters have been accounted for using several cases (best-case, nominal and worst-case) in Static Timing Analysis. Worst-case analysis is a deterministic timing analysis that was a fair assumption to make until a few years ago, when intra-die parameter mismatches were considered to be negligible in the presence of the more significant interdie variations. Today, intra-die variations must be accounted for, as worst-case timing analysis is highly potential-cutting [1] .
Physical parameters are susceptible to random variations from their nominal values, and therefore should be modeled as Random Variables RVs. If the circuit to be analyzed contains thousands of gates, and because of the existing spatial correlations between RVs of the same kind, thousand-variable Joint PDFs (JPDFs) should be used for each RV kind. Getting all these joint distributions is computationally impossible for some reasonable discretization accuracy, let alone the exponential run-time needed for computing any path PDF using such JPDFs. If the number of discretization points of the JPDFs per dimension is equal to QUALITY, then an N-variable JPDF will have QUALITY N discretization points, making the computational complexity for computing the delay PDF of that path (QUALITY N ). Hence, the exact numerical solution for finding the PDF of the critical path is unrealistically complex.
Recently, a number of techniques have been developed to perform statistical timing analysis. They can be broadly classified into full-chip analysis and path-based analysis. Full-chip analysis maps the whole block to an acyclic graph and strives to propagate and merge the PDFs of the gate delays, in order to get the PDF of the critical path of the circuit [2] [3] [4] [5] [6] [7] [8] [9] . This method has an inherently exponential run-time with circuit size. Therefore, in the pursuit of reducing the run-time, advocators of full-chip analysis have to use primitive delay models, assume that gate delay PDFs are given [3, 4] , neglect parameter correlations [2, 3, 8] and draw on inaccurate approximations [7] . And even with this, they sometimes give bounds for the delay PDF and not the PDF itself [2, 8, 9] . In the path-based approach, the PDF of one path is calculated at a time [10] [11] . First, Deterministic Timing Analysis is performed on the circuit in order to find some upper percentage of nominally critical paths. Then, each candidate critical path is analyzed statistically, with the aim of computing its delay PDF. The probabilistic critical path is decided by comparing some confidence-point on the PDFs, such as the 3σ points. This method allows for more complex delay and interconnect models [11] . Some common assumptions and drawbacks, both in full-chip and path-based statistical analysis, include the consideration of only 1 RV (usually effective channel length) [9, 10] , the use of 1 correlation-space at most (usually spatial correlations) [9, 10] , the restriction to a certain kind of input PDF (usually Gaussian) [10, 12] , and the exponential run-time needed to get accurate results [2, 9, 12] .
In this work, we propose a path-based methodology which relaxes some these assumptions while maintaining a run-time of polynomial complexity. Our methodology includes all the steps that lead to finding the probabilistic critical path, from the initial deterministic analysis algorithms to the ranking of all probabilistic critical paths of the circuit. The methodology accounts for several RVs which affect gate delay variations. Spatial correlations for these RVs are taken into consideration. Our findings emphasize the critical path delay overestimation that arises due to worst-case analysis.
Modeling and Assumptions

Gate Delay Using Elmore's Model
In order to perform probabilistic timing analysis later in the paper, we first derive the delay of any path which is composed of gates by using Elmore's model. Based on Elmore's delay model, [13] showed that the propagation delay of an n-input NAND gate tp is formalized as
where FI is the number of fan-ins, C dN is the drain capacitance, Cn is the sum of the drain capacitances of this gate at the output node, plus the gate capacitances of all fan-out gates, as well as the wire capacitance, while RN and RP are the NMOS and PMOS onresistances, respectively. For short channel models for RN and RP , tp can be expressed as
where
where ox is the oxide permittivity, Wn/Wp are the NMOS/PMOS channel widths, µn/µp are the NMOS/PMOS mobilities and VTn/VTp are the NMOS/PMOS threshold voltages. Similar delay equations can be found for an inverter, an n-input NOR gate and a 2-input XNOR gate. They all have the same form as (2) with different values for α and β. These are the gates under investigation which constitute all ISCAS85 benchmarks. Thus, a path consisting of N such gates will have a delay equation of the form
Sensitivity Analysis
In order to investigate how much each parameter's variability affects the delay, we performed a first-degree sensitivity analysis for the delay of a 2-input NAND gate, an inverter, a 2-input NOR gate and a 2-input XNOR gate, all with fan-outs of 2, at the nominal values of all parameters. To avoid complicated calculations, and to focus the reader's attention to the approach, we assumed the parameters to be independent and all capacitances to be constant. We compared the values of
for each parameter xi, where X is the vector of parameters and σx i is the standard deviation of xi. 130nm CMOS technology values were used, and the typical variances were obtained from [15] . The parameters that had the most impact on the delay were tox, L ef f , V dd , while VTn and VTp had less effect. Table 1 shows the linear approximations of gate delay variations as a result of varying those 5 RVs by one σx i .
Layering of Correlation-Spaces
In this section, we show a simple yet efficient method developed in [9] , [10] that divides the spatial correlation-space into several levels. We use this technique to get rid of the correlated RVs while keeping the correlation information. The model first replicates the die on several layers, then divides each layer i of the die into 4 i rectangle regions. Using variabilities in L ef f for each layer's partitionsize, the model replaces the L ef f of each gate by a sum of RVs, such 
that the correlation information between two L ef f 's is in number of common RVs they have and their variances. This method was used only for spatial correlations in L ef f . We include other RVs to the technique proposed in [9] , to account for any other chip variations.
Since we are accounting for more than 1 RV, we will first provide a general formulation to the spatial correlation problem. Let χ refer to a parameter (RV) with a known marginal probability distribution, χi refers to a RV of type χ of any partition in a certain level i, and χi,j refers to the RV of type χ and of partition j in level i. χ0 is assigned a PDF whose mean is χ nominal . Each subsequent χi is assigned another PDF, with a mean of zero, such that the PDF of
i=0 χi is equal to the PDF of χ, which is given. If χ has a Gaussian PDF, this simplifies to assigning a Gaussian PDF to each χi, such that
where L is the number of hierarchical layers, and layer 0 represents the whole correlation space. The RV of a certain gate, χgate for instance, will be set to the sum over all layers of the partition RVs that it belongs to, yielding
where ξ = 1 only if j is the partition the gate belongs to on level i, and ξ = 0 otherwise. It is the inter-die variations in some χ that decide the chip-mean of χ. The remaining layers correspond to different levels of intra-die variations. Intra-die variations are only deviations around the chip-mean. This is why all layers except layer 0 were assigned zero-mean PDFs. As for the mean of the inter-PDF of some χ (the mean of the PDF of χ0), it is equal to χ nominal , because the average of the means of χ, in all chips of all wafers of all lots, is χ nominal . Using the above spatial correlation description, the path delay (5) can be re-written as
where u and w represent the layer number and the partition number, respectively. Even though the RVs in the equation of a path are considered as independent, we still cannot use this to our advantage in order to find the PDF of any tPATH. This is because the same RV, for instance, say, L ef f 1,2 , can belong to the tp of many gates in the same path. Therefore, we cannot separately calculate the delay PDF of each tp i along a path, since the same RVs may re-occur in the path, causing correlation between tp i 's. In the following section, we will use the Taylor series first-order approximation for the gate delay, in order to linearize part of tp i , and therefore transform (8) into a suitable form for PDF calculation in low run-time complexity.
Taylor Series First-Order Approximation for Path Delay
Let Xi represent the vector of RVs of gate i, Xinter the vector of inter-RVs, ∆Xintra i the vector of intra-RVs of gate i, and for any gate i: Xi = Xinter + ∆Xintra i . Using the Taylor-series first order approximation of tp i taken at the bias point Xinter, the gate delay equation is formulated in vector and scalar forms as follows
It should be noted that the mean of Xinter is X nominal , the mean of ∆Xintra i is zero, and that this approximation is accurate if ∆Xintra i
Xinter. Since the intra-die RV standard deviations are much smaller than the nominal values of those RVs, the approximation is valid. Replacing the intra and inter RVs of (9) by the layer RVs described in the previous subsection, we will get
Xu,w (10) where X0,0 is Xinter. All summation RVs are assumed to be independent, both inside each sum and across sums. This is very helpful because, if their coefficients were constants, we would have an intra-part that is simply a linear combination of independent RVs. Finding the intra-delay PDF would then be very easy. However, their coefficients are the partial derivatives of delay evaluated at Xinter, which is a random vector. Hence, we will take a final zeroth-order approximation that evaluates the partial derivatives at nominal value [10] , and thus making them constants:
where ai, bi, ci, di and ei are constants equal to the derivatives of the delay of gate i at nominal value. Thus, the path-delay equation in (8) changes when we use this approximation. Using (12) , for a path of N gates, (8) yields where au,w, bu,w, cu,w, du,w and eu,w are constants corresponding to layer u and partition w. The first summation is the tPATH inter , while the following terms are the tPATH intra . Therefore, the intradelay of a path is simply a linear combination of RVs. If the input RVs were Gaussian, then finding the intra-PDF would be equivalent to finding its variance
Convexity Analysis
For the delay equation for the gates used, we calculated the values of
for each parameter xi. This represents the change of the derivative of delay with respect to that parameter for a one standard deviation change of the parameter. The first derivative of gate delay with respect to those RVs is in the order of tens of pico-seconds per Volts, which is significantly higher that the values of
for all parameter xi. Therefore, we can conclude that even a worst case change in the derivative, for a 3σ change in those RVs, would still be an order of magnitude less than the actual value of the derivative. Therefore, the convexity is small enough to ensure an acceptable accuracy for this approximation.
Doing the numerical computation of the inter-PDF at once would require a complexity of (QUALITYinter R ), where QUALITYinter is the discretization of the PDFs of the inter-RVs and does not have to be equal to QUALITYintra (the discretization of the PDFs of the intra-RVs), and R is the number of different parameters being varied. One should try to separate as many variables, in order to reduce the complexity of the PDF-computation. It should be noted however, that with more complex models, inter-delay equation will tend to become inseparable in almost all its parameters. In such a case, one should identify the parameters with the least inter-variabilities, and when using the Taylor-series approximation, instead of using their inter-RVs, one should use their nominal values. Another possible compromise of accuracy for faster run-time could be to reduce QUALITYinter.
In the next section, we explain our full methodology, which makes use of the two assumptions analyzed in this section, in order to find and rank some upper percentage of probabilistic critical paths.
Methodology
A high-level description of our methodology flowchart is the following. First, we set the number of RVs (R), the number of layers (L) for the spatial correlation-space, as well as the variabilities of each layer in each RV. The circuit is then mapped to a timing graph, and we evaluate all gate deterministic delays as well as derivatives with respect to all RVs that are being considered, at their nominal values. The inputs to the methodology are a description of the circuit connections, gates, inputs, outputs as well as any necessary information related to correlations, like (x, y) coordinates to compute the spatial correlations. These are one time calculations. Next, the deterministic critical path is found, using Bellman-Ford. We perform the probabilistic timing analysis of the deterministic critical path delay: We find its intra-delay PDF and its inter-PDF, using the techniques described in the previous section, and finally its total delay PDF (accounting for inter and intra variations). From its delay PDF, its standard deviation σC is extracted in order to have an idea about the variability of our circuit. Using that standard deviation and some arbitrary confidence constant C that the user inputs, we find all the next deterministic critical paths that are within C · σC of the first deterministic critical path delay. We perform the same probabilistic timing analysis to each of those paths sequentially. In the end, we can compare any confidence point, such as the 3σ's, on the critical path PDFs, in order to rank them and obtain the probabilistic critical path. Fig. 1 shows the graphical flowchart of our methodology.
Deterministic Delay Computation
The Bellman-Ford algorithm was used to find the deterministic critical path of a circuit. The weight of each edge in the algorithm is to be the nominal delay of the node before the edge, since the graph is simple, directed and acyclic. The worst-case complexity of Bellman-Ford is (|N | × |E|) , where |N | is the number of nodes in the graph and |E| is the number of edges. This can be quite big if the graph is very dense, however, the fact that our graph is simple and acyclic makes it highly unlikely to reach that worst-case.
Probabilistic Timing Analysis to Deterministic Critical Path
Probabilistic timing analysis is applied to the deterministic critical path to find its standard deviation. The probabilistic analysis is separated into intra-and inter-delay calculations. The variance of the intra-die variations are formulated using (14) , and the PDF of the intra-delay is computed (assuming it's Gaussian). The complexity of such a PDF computation is simply (QUALITYintra). For the inter-delay PDF calculation, the first term in (13) is used. Finally, the intra-and inter-PDFs are convolved to evaluate the total delay PDF. Supposing that both PDFs have a discretization of QUALITY, the complexity of the convolution is obviously (QUALITY 2 ).
We use the standard deviation (σC ) of the deterministic critical path total delay PDF as an indicator of variability in the circuit. We will choose and analyze all the paths whose delays are larger than D − C · σC , where D is the deterministic delay of the critical path and C is a constant the user specifies. The larger this constant, the more confident we get in that there is no other path in the circuit that is probabilistically longer than the one we got, but the more run-time is needed. In order to find all the next critical paths within CσC of the deterministic critical path, we used the recursive algorithm shown in Fig. 2 , where Wi is the weight of edge i and LABELn i is the delay label of node ni, equal to the maximum arrival time to ni from ns, that was calculated using Bellman-Ford. In broad terms, the algorithm starts from a root node, which is n f , and checks for fan-in nodes that have labels larger than the label of the root minus the weight of the edge between them. If there is such a node, than the same thing is repeated for that node as the root. This is repeated until ns is reached for each path. The worst-case complexity of this algorithm is (κ × |E|), where κ is the number of near-critical paths and |E| is the number of edges in the graph.
Finally, probabilistic analysis is done for all the near-critical paths, one path at a time, to find the delay PDF of each. In the end, we rank the paths based on their PDFs by some confidence point. We determine the probabilistic critical path and visualize the change in rank of paths, going from deterministic to statistical analysis. It is important to note that the means of the delay PDFs of the paths are not the same as their deterministic delays, because the inter-delay is not a linear function of RVs and therefore the expected value of the delay, is not the delay of the expected values. The intra-delay calculations for all the paths yield a complexity of
where Ω is the number of layer-RVs in the path. However, the bottleneck complexity of the methodology comes from the inter-delay PDFs, which have a worst-case complexity of (κ × QUALITYinter R ).
Results and Discussion
We implemented a program that reads the circuit-description as a Design Exchange Format (DEF) file and gets the variability information from the user, in order to generate and rank the delay-PDFs of critical paths. It applies all the steps of the methodology described in the previous section and depicted in Fig. 1 . The considered RVs were: tox, L ef f , V dd , VTn and VTp. Their PDFs were assumed to be Gaussian, truncated at their 6σ points. Typical standard-deviations were taken from [15] . The (x, y) coordinates of the gates were extracted from the DEF files in order to account for spatial correlations. We used a 4 layer model along with a fifth random layer, and we divided the total variances equally over all layers. We tested our program on the ISCAS85 benchmark circuits, using 130nm technology nominal values. The computations were performed using PDF discretizations of QUALITYintra = 100 and QUALITYinter = 50. We shall illustrate how these values were chosen as an optimal trade-off between run-time and solution accuracy. Table 2 summarizes the results.
The confidence constant C provides a mean for controlling the number of near-critical paths to consider. We started with a minimum of C = 0.05 (except for c6288) and we increased C until we converged to a critical path that is not changing. In Table 2 , we used the minimum value of C that found the correct probabilistic critical path. However, for benchmark c6288, even for a C = 0.005, the number of near-critical paths was more than a hundred thousand, which is unacceptable both in terms of run-time and memory. So we used C = 0.001, which still yielded about 900 paths. Column 7 shows the number of critical paths for each circuit. This number depends heavily on the choice of the confidence C, because all paths with a delay within CσC of the nominal critical delay are considered near-critical, where σC is the deterministic critical path's standard deviation. In addition, the type of benchmark (structure of the graph), as well as the variability and correlations of parameters, affect σC , and therefore the number of critical paths. To see this, one can observe the large difference in critical path numbers, between circuits for which we used the same C, such as benchmarks c880 and c1355: For C = 0.05, the former had 3 near-critical paths, while the latter had 1596. Fig. 3 shows how close the delay PDFs of the 1 st , 798 th and 1596 th paths are in c1355. It should be noted that Column 8 shows the mean of the probabilistic critical path PDF which is very close but not equal to the nominal critical path delay, because the delay is non-linear. Columns 9 and 5 depict the 3σ points for the critical path delay PDFs and the % overestimation of the worst-case analysis over the 3σ points of the probabilistic critical path delay. On average, an overestimation of 55% is found, which demonstrates how unacceptable and persistent the conservatism resulting from worst-case analysis can get. Fig. 4 shows the critical path's intra-and inter-delay PDFs for c432, as well as the convolution of the two, which is the total delay PDF. In addition, the 3-sigma point of the total probabilistic (statistical) PDF compared to the worst-case deterministic analysis.
In order to characterize the different effects of inter-and intravariations on the delay, we performed probabilistic timing analysis of c432, for 3 different scenarios of inter-and intra-variances, for the same total RV variabilities. Table 3 demonstrates the results. One can clearly see that the larger the inter-variability is, the larger will be the path-delay standard-deviation. In fact, large inter-variability increases the possibility that all the gates in a path are subject to worst-case variations. Even the number of near-critical paths increases because of the increase of σC . Table 3 . inter-and intra-variations Column 11 in Table 2 demonstrates the new rank of deterministic critical path after applying probabilistic timing analysis. Some paths remained unchanged (such as c432, c880, c7552), while others were nominally faster paths (notably in c1355 -what used to be the 40th slowest deterministic path is now the critical path in the probabilistic analysis). This is because of the spatial correlations' impact on c1355's topology, increasing the variability in a path, causing their 3σ delays to become very big. The larger the variances and correlations, the more deterministic and probabilistic ranks will tend to differ from one another. Fig. 5 draws the probabilistic rank of the first 100 paths of benchmark c1355 versus their deterministic ranks, when around 1600 near-critical paths were analyzed. Fig. 6 illustrates the same plot for the first 100 paths of benchmark c7552 with the same number of analyzed near-critical paths. We observe that the first plot is considerably far from the first bisector, which means that distances between paths are very small for c1355 compared to the existing amount of variability. On the other hand, in the plot for c7552, we can see that the ranks' changes are minor. This can be explained by the fact that the graph of c1355 is more bushy than that of c7552, therefore paths are much closer in terms of their delays. Thus, path delays are very vulnerable to change ranks for the c1355 case (as depicted in Fig. 5 ). This is not the case for c7552 where the delay of the paths are more distinctive. Moreover, spatial correlations can be accounted for larger delay variances in c1355, due to its DEF circuit description. In conclusion, it is the topology and placement of the circuit that usually determine changes in critical path ranks.
Finally, the last column in Table 2 whole program, from the input-file parsing, to finding the deterministic critical path, to computing their delay PDFs and ranking them. For a path-based approach such as ours, the path delay complexity is directly proportional to the number of near-critical paths to consider, hence those run-times are very strong functions of C. Moreover, the run-time also varies with the graph structure: very close deterministic delays will imply a considerable number of nearcritical paths. Lastly, run-time is a strong function of the discretization points QUALITYintra and QUALITYinter. In order to measure the trade-off between accuracy and runtime, as far as QUALITYintra and QUALITYinter are concerned, we calculated the delay PDF of c499, using a series of discretization combinations. The most accurate would be the one done with the most discretization points. As an optimal trade-off between accuracy and run-time, we chose the point (QUALITYintra = 100, QUALITYinter = 50) because it yielded an accuracy within 0.009% of the 3σ point with the highest discretization, with a runtime of 0.4 seconds. Thus, the work presented in this paper uses QUALITYintra = 100, QUALITYinter = 50.
One last observation involves the typical minimum value of C. Looking at Table 2 , we can see that the biggest C we needed, to find the absolute critical path, was 0.1. This means that, for typical circuits, the probabilistic critical path is within 10% of a typical path-delay standard-deviation from the deterministic critical path. This is a vital advantage for path-based statistical analysis, because it means that typically the number of near-critical paths to consider is acceptable.
Conclusion
We presented a framework for performing statistical timing analysis. Five random variables have been accounted for in spacial correlations. The work presented has a polynomial run time. Our findings confirmed that the worst-case analysis overestimates path delays by more than 50%. Moreover, depending on the circuit's topology and placement information, a path's probabilistic rank with respect to delay could be very different from its deterministic rank.
